2021 4/28 コマンド追記
ショットガンシーケンシングは、複雑な微生物群集からのゲノムの再構築を可能にするが、全ゲノムを再構築することはできないので、ゲノムの断片をビンに入れることが必要である。 この論文では、CONCOCTを提示する。これは、コンティグをゲノムに自動的にクラスタリングするために、複数のサンプルにわたるシーケンス構成とカバレッジを組み合わせたアルゴリズムである。シミュレーション、およびリアルのhuman gutメタゲノムデータセットで高い再現率と精度を示す。
documentation
https://concoct.readthedocs.io/en/latest/
インストール
ubuntu16.04でcondaの仮想環境を作ってテストした(docker使用、ホストOS ubuntu18.04)。
依存
python=2.7
#Bioconda link参照
本体 Github
#bioconda(link)
conda install -c bioconda -y concoct
#依存が多いので仮想環境を作った方が無難
conda create -n concoct_env -c bioconda python=2.7 concoct
source activate concoct_env
> concoct -h
# concoct -h
usage: concoct [-h] [--coverage_file COVERAGE_FILE]
[--composition_file COMPOSITION_FILE] [-c CLUSTERS]
[-k KMER_LENGTH] [-l LENGTH_THRESHOLD] [-r READ_LENGTH]
[--total_percentage_pca TOTAL_PERCENTAGE_PCA] [-b BASENAME]
[-s SEED] [-i ITERATIONS] [-e EPSILON] [--no_cov_normalization]
[--no_total_coverage] [--no_original_data] [-o] [-d] [-v]
optional arguments:
-h, --help show this help message and exit
--coverage_file COVERAGE_FILE
specify the coverage file, containing a table where
each row correspond to a contig, and each column
correspond to a sample. The values are the average
coverage for this contig in that sample. All values
are separated with tabs.
--composition_file COMPOSITION_FILE
specify the composition file, containing sequences in
fasta format. It is named the composition file since
it is used to calculate the kmer composition (the
genomic signature) of each contig.
-c CLUSTERS, --clusters CLUSTERS
specify maximal number of clusters for VGMM, default
400.
-k KMER_LENGTH, --kmer_length KMER_LENGTH
specify kmer length, default 4.
-l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD
specify the sequence length threshold, contigs shorter
than this value will not be included. Defaults to
1000.
-r READ_LENGTH, --read_length READ_LENGTH
specify read length for coverage, default 100
--total_percentage_pca TOTAL_PERCENTAGE_PCA
The percentage of variance explained by the principal
components for the combined data.
-b BASENAME, --basename BASENAME
Specify the basename for files or directory where
outputwill be placed. Path to existing directory or
basenamewith a trailing '/' will be interpreted as a
directory.If not provided, current directory will be
used.
-s SEED, --seed SEED Specify an integer to use as seed for clustering. 0
gives a random seed, 1 is the default seed and any
other positive integer can be used. Other values give
ArgumentTypeError.
-i ITERATIONS, --iterations ITERATIONS
Specify maximum number of iterations for the VBGMM.
Default value is 500
-e EPSILON, --epsilon EPSILON
Specify the epsilon for VBGMM. Default value is 1.0e-6
--no_cov_normalization
By default the coverage is normalized with regards to
samples, then normalized with regards of contigs and
finally log transformed. By setting this flag you skip
the normalization and only do log transorm of the
coverage.
--no_total_coverage By default, the total coverage is added as a new
column in the coverage data matrix, independently of
coverage normalization but previous to log
transformation. Use this tag to escape this behaviour.
--no_original_data By default the original data is saved to disk. For big
datasets, especially when a large k is used for
compositional data, this file can become very large.
Use this tag if you don't want to save the original
data.
-o, --converge_out Write convergence info to files.
-d, --debug Debug parameters.
-v, --version show program's version number and exit
> python CONCOCT/scripts/cut_up_fasta.py -h
$ python CONCOCT/scripts/cut_up_fasta.py -h
usage: cut_up_fasta.py [-h] [-c CHUNK_SIZE] [-o OVERLAP_SIZE] [-m]
[-b BEDFILE]
contigs [contigs ...]
Cut up fasta file in non-overlapping or overlapping parts of equal length.
Optionally creates a BED-file where the cutup contigs are specified in terms
of the original contigs. This can be used as input to concoct_coverage_table.py.
positional arguments:
contigs Fasta files with contigs
optional arguments:
-h, --help show this help message and exit
-c CHUNK_SIZE, --chunk_size CHUNK_SIZE
Chunk size
-o OVERLAP_SIZE, --overlap_size OVERLAP_SIZE
Overlap size
-m, --merge_last Concatenate final part to last contig
-b BEDFILE, --bedfile BEDFILE
BEDfile to be created with exact regions of the
original contigs corresponding to the newly created
contigs
> python CONCOCT/scripts/concoct_coverage_table.py -h
$ python CONCOCT/scripts/concoct_coverage_table.py -h
usage: concoct_coverage_table.py [-h] [--samplenames SAMPLENAMES]
bedfile bamfiles [bamfiles ...]
A script to generate the input coverage table for CONCOCT using a BEDFile.
Output is written to stdout. The BEDFile defines the regions used as
subcontigs for concoct. This makes it possible to get the coverage for
subcontigs without specifically mapping reads against the subcontigs. @author:
inodb, alneberg
positional arguments:
bedfile Contigs BEDFile with four columns representing:
'Contig ID, Start Position, End Position and SubContig
ID' respectively. The Subcontig ID is usually the same
as the Contig ID for contigs which are not cutup. This
file can be generated by the cut_up_fasta.py script.
bamfiles BAM files with mappings to the original contigs.
optional arguments:
-h, --help show this help message and exit
--samplenames SAMPLENAMES
File with sample names, one line each. Should be same
nr of bamfiles. Default sample names used are the file
names of the bamfiles, excluding the file extension.
dockerイメージも用意されている。
docker pull binpro/concoct_latest
実行方法
0、megahitでアセンブリを実行する。
1、 contigを1000bp以下に小さく分割する。
git clone https://github.com/BinPro/CONCOCT.git
python CONCOCT/scripts/cut_up_fasta.py original_contigs.fa -c 10000 -o 0 --merge_last -b contigs_10K.bed > contigs_10K.fa
- -c Chunk size
- -m, --merge_last Concatenate final part to last contig
- -b BEDfile to be created with exact regions of the original contigs corresponding to the newly created contigs
- -o Overlap size
contigs_10K.bedとcontigs_10K.faが出力される。
2、coverage depth テーブルの出力。前もってbowtie2でオリジナルのfastaにリードをmappingしてbamを作成しておく必要がある。
python CONCOCT/scripts/concoct_coverage_table.py contigs_10K.bed mapping/Sample*.sorted.bam > coverage_table.tsv
3、concoctを実行する。
concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/
このステップが一番時間がかかる。gzip圧縮の10GBのfastqからアセンブルしたcontigを使った時は10時間ほどかかった。
4、
merge_cutup_clustering.py concoct_output/clustering_gt1000.csv > concoct_output/clustering_merged.csv
5、
mkdir concoct_output/fasta_bins
extract_fasta_bins.py original_contigs.fa concoct_output/clustering_merged.csv --output_path concoct_output/fasta_bins
出力
引用
Binning metagenomic contigs by coverage and composition.
Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C
Nat Methods. 2014 Nov;11(11):1144-6
関連