HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メタゲノムアセンブリをbinningする CONCOCT

 2021 4/28 コマンド追記


 ショットガンシーケンシングは、複雑な微生物群集からのゲノムの再構築を可能にするが、全ゲノムを再構築することはできないので、ゲノムの断片をビンに入れることが必要である。 この論文では、CONCOCTを提示する。これは、コンティグをゲノムに自動的にクラスタリングするために、複数のサンプルにわたるシーケンス構成とカバレッジを組み合わせたアルゴリズムである。シミュレーション、およびリアルのhuman gutメタゲノムデータセットで高い再現率と精度を示す。





ubuntu16.04でcondaの仮想環境を作ってテストした(docker使用、ホストOS ubuntu18.04)。



#Bioconda link参照

本体 Github

conda install -c bioconda -y concoct

conda create -n concoct_env -c bioconda python=2.7 concoct
source activate concoct_env

> concoct -h

# concoct -h

usage: concoct [-h] [--coverage_file COVERAGE_FILE]

               [--composition_file COMPOSITION_FILE] [-c CLUSTERS]

               [-k KMER_LENGTH] [-l LENGTH_THRESHOLD] [-r READ_LENGTH]

               [--total_percentage_pca TOTAL_PERCENTAGE_PCA] [-b BASENAME]

               [-s SEED] [-i ITERATIONS] [-e EPSILON] [--no_cov_normalization]

               [--no_total_coverage] [--no_original_data] [-o] [-d] [-v]


optional arguments:

  -h, --help            show this help message and exit

  --coverage_file COVERAGE_FILE

                        specify the coverage file, containing a table where

                        each row correspond to a contig, and each column

                        correspond to a sample. The values are the average

                        coverage for this contig in that sample. All values

                        are separated with tabs.

  --composition_file COMPOSITION_FILE

                        specify the composition file, containing sequences in

                        fasta format. It is named the composition file since

                        it is used to calculate the kmer composition (the

                        genomic signature) of each contig.

  -c CLUSTERS, --clusters CLUSTERS

                        specify maximal number of clusters for VGMM, default


  -k KMER_LENGTH, --kmer_length KMER_LENGTH

                        specify kmer length, default 4.


                        specify the sequence length threshold, contigs shorter

                        than this value will not be included. Defaults to


  -r READ_LENGTH, --read_length READ_LENGTH

                        specify read length for coverage, default 100

  --total_percentage_pca TOTAL_PERCENTAGE_PCA

                        The percentage of variance explained by the principal

                        components for the combined data.

  -b BASENAME, --basename BASENAME

                        Specify the basename for files or directory where

                        outputwill be placed. Path to existing directory or

                        basenamewith a trailing '/' will be interpreted as a

                        directory.If not provided, current directory will be


  -s SEED, --seed SEED  Specify an integer to use as seed for clustering. 0

                        gives a random seed, 1 is the default seed and any

                        other positive integer can be used. Other values give


  -i ITERATIONS, --iterations ITERATIONS

                        Specify maximum number of iterations for the VBGMM.

                        Default value is 500

  -e EPSILON, --epsilon EPSILON

                        Specify the epsilon for VBGMM. Default value is 1.0e-6


                        By default the coverage is normalized with regards to

                        samples, then normalized with regards of contigs and

                        finally log transformed. By setting this flag you skip

                        the normalization and only do log transorm of the


  --no_total_coverage   By default, the total coverage is added as a new

                        column in the coverage data matrix, independently of

                        coverage normalization but previous to log

                        transformation. Use this tag to escape this behaviour.

  --no_original_data    By default the original data is saved to disk. For big

                        datasets, especially when a large k is used for

                        compositional data, this file can become very large.

                        Use this tag if you don't want to save the original


  -o, --converge_out    Write convergence info to files.

  -d, --debug           Debug parameters.

  -v, --version         show program's version number and exit

python CONCOCT/scripts/ -h

$ python CONCOCT/scripts/ -h

usage: [-h] [-c CHUNK_SIZE] [-o OVERLAP_SIZE] [-m]

                       [-b BEDFILE]

                       contigs [contigs ...]


Cut up fasta file in non-overlapping or overlapping parts of equal length.


Optionally creates a BED-file where the cutup contigs are specified in terms

of the original contigs. This can be used as input to


positional arguments:

  contigs               Fasta files with contigs


optional arguments:

  -h, --help            show this help message and exit

  -c CHUNK_SIZE, --chunk_size CHUNK_SIZE

                        Chunk size

  -o OVERLAP_SIZE, --overlap_size OVERLAP_SIZE

                        Overlap size

  -m, --merge_last      Concatenate final part to last contig

  -b BEDFILE, --bedfile BEDFILE

                        BEDfile to be created with exact regions of the

                        original contigs corresponding to the newly created


python CONCOCT/scripts/ -h

$ python CONCOCT/scripts/ -h

usage: [-h] [--samplenames SAMPLENAMES]

                                 bedfile bamfiles [bamfiles ...]


A script to generate the input coverage table for CONCOCT using a BEDFile.

Output is written to stdout. The BEDFile defines the regions used as

subcontigs for concoct. This makes it possible to get the coverage for

subcontigs without specifically mapping reads against the subcontigs. @author:

inodb, alneberg


positional arguments:

  bedfile               Contigs BEDFile with four columns representing:

                        'Contig ID, Start Position, End Position and SubContig

                        ID' respectively. The Subcontig ID is usually the same

                        as the Contig ID for contigs which are not cutup. This

                        file can be generated by the script.

  bamfiles              BAM files with mappings to the original contigs.


optional arguments:

  -h, --help            show this help message and exit

  --samplenames SAMPLENAMES

                        File with sample names, one line each. Should be same

                        nr of bamfiles. Default sample names used are the file

                        names of the bamfiles, excluding the file extension.



docker pull binpro/concoct_latest





1、 contigを1000bp以下に小さく分割する。

git clone

python CONCOCT/scripts/ original_contigs.fa -c 10000 -o 0 --merge_last -b contigs_10K.bed > contigs_10K.fa
  • -c    Chunk size
  • -m, --merge_last      Concatenate final part to last contig
  • -b    BEDfile to be created with exact regions of the original contigs corresponding to the newly created contigs
  • -o    Overlap size



2、coverage depth テーブルの出力。前もってbowtie2でオリジナルのfastaにリードをmappingしてbamを作成しておく必要がある。

python CONCOCT/scripts/ contigs_10K.bed mapping/Sample*.sorted.bam > coverage_table.tsv



concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/



4、 concoct_output/clustering_gt1000.csv > concoct_output/clustering_merged.csv



mkdir concoct_output/fasta_bins original_contigs.fa concoct_output/clustering_merged.csv --output_path concoct_output/fasta_bins






Binning metagenomic contigs by coverage and composition.

Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C

Nat Methods. 2014 Nov;11(11):1144-6