古代メタゲノムデータセットのde novoアセンブリは困難な作業である。シークエンシングされた古代DNA分子は、超短断片サイズと特徴的な死後損傷パターンにより、現在のツールでは理想的なアセンブリを行うことができない。CarpeDeamは、古代メタゲノムサンプルに特化して設計された、損傷を考慮した新しいde novoアセンブラである。サンプル固有の損傷パターンを統合する最尤フレームワークを利用することで、CarpeDeamは既存のアセンブラと比較して、シミュレーションデータセットと経験的データセットの両方から、より連続性の高い配列とより多くのタンパク質配列を復元する。古代メタゲノム解析のパイオニアとして、CarpeDeamは古代微生物群集の機能解析や分類学的解析に新たな可能性をもたらす。
インストール
condaを使ってインストールした。
- CarpeDeam can be installed via conda, as statically compiled Linux version or by compiling the program from source.
#conda
mamba create -n carpedeam -y
conda activate carpedeam
mamba install -c bioconda carpedeam -y
#from source
git clone https://github.com/LouisPwr/CarpeDeam
cd CarpeDeam
git submodule update --init
mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make -j 4 && make install
export PATH="$(pwd)/bin/:$PATH"
> carpedeam -h
Ancient Metagenome Assembler.
CarpeDeam Version: 1.0.0
© Louis Kraft (loipwr3000@gmail.com)
usage: carpedeam <command> [<args>]
Main workflows for database input/output
ancient_assemble Damage aware nucleotide assembly iterative greedy overlap assembly using damage matrix and nucleotide information => CarpeDeam
nuclassemble Modified nuclassemble module from PenguiN
ancient_correction Correct deaminations
> carpedeam ancient_assemble -h
usage: carpedeam ancient_assemble <i:fast(a|q)File[.gz]> | <i:fastqFile1_1[.gz] <i:fastqFile1_2[.gz] ... <i:fastqFileN_1[.gz] <i:fastqFileN_2[.gz]> <o:fastaFile> <tmpDir> [options]
By Louis Kraft <lokraf@dtu.dk>
options: prefilter:
--alph-size TWIN Alphabet size (range 2-21) [nucl:5,aa:13]
--spaced-kmer-mode INT 0: use consecutive positions in k-mers; 1: use spaced k-mers [0]
--spaced-kmer-pattern STR User-specified spaced k-mer pattern
--mask INT Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [0]
--mask-lower-case INT Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]
--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]
--add-self-matches BOOL Artificially add entries of queries with themselves (for clustering) [0]
align:
--cov-mode INT 0: coverage of query and target
1: coverage of target
2: coverage of query
3: target seq. length has to be at least x% of query length
4: query seq. length has to be at least x% of target length
5: short seq. needs to be at least x% of the other seq. length [1]
-e DOUBLE Extend sequences if the E-value is below (range 0.0-inf) [1.000E-03]
-a BOOL Add backtrace string (convert to alignments with mmseqs convertalis module) [0]
--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]
--gap-open TWIN Gap open cost (only for clustering [5]
--gap-extend TWIN Gap extend cost (only for clustering) [2]
--zdrop INT Maximal allowed difference between score values before alignment is truncated (only for clustering) [200]
--min-seq-id TWIN Overlap sequence identity threshold (range 0.0-1.0) [nucl:0.900,aa:0.970]
--min-aln-len TWIN Minimum alignment length (range 0-INT_MAX) [0]
clust:
--cluster-mode INT 0: Set-Cover (greedy)
1: Connected component (BLASTclust)
2,3: Greedy clustering by sequence length (CDHIT) [2]
--clust-min-seq-id FLOAT Seq. id. threshold passed to linclust algorithm to reduce redundancy in assembly (range 0.0-1.0) [0.970]
--clust-min-cov FLOAT Coverage threshold passed to linclust algorithm to reduce redundancy in assembly (range 0.0-1.0) [0.990]
kmermatcher:
--kmer-per-seq INT k-mers per sequence [200]
--kmer-per-seq-scale TWIN Scale k-mer per sequence based on sequence length as kmer-per-seq val + scale x seqlen [0.200]
--adjust-kmer-len BOOL Adjust k-mer length based on specificity (only for nucleotides) [0]
--hash-shift INT Shift k-mer hash initialization [67]
--include-only-extendable BOOL Include only extendable [1]
--ignore-multi-kmer BOOL Skip k-mers occurring multiple times (>=2) [1]
-k TWIN k-mer length (0: automatically set to optimum) [nucl:20,aa:14]
misc:
--min-length INT Minimum codon number in open reading frames [45]
--max-length INT Maximum codon number in open reading frames [32734]
--forward-frames STR Comma-separated list of frames on the forward strand to be extracted [1,2,3]
--reverse-frames STR Comma-separated list of frames on the reverse strand to be extracted [1,2,3]
--translation-table INT 1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE
9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL
15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL
23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA
29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA [1]
--translate INT Translate ORF to amino acid [0]
--use-all-table-starts BOOL Use all alternatives for a start codon in the genetic table, if false - only ATG (AUG) [0]
--id-offset INT Numeric ids in index file are offset by this value [0]
--keep-target BOOL Keep target sequences [1]
--rescore-mode INT Rescore diagonals with:
0: Hamming distance
1: local alignment (score only)
2: local alignment
3: global alignment
4: longest alignment fulfilling window quality criterion [3]
--dbtype INT Database type 0: auto, 1: amino acid 2: nucleotides [0]
--shuffle BOOL Shuffle input database [1]
--createdb-mode INT Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]
--chop-cycle BOOL Remove superfluous part of circular fragments (see --cycle-check) [1]
--cycle-check BOOL Check for circular sequences (avoid over extension of circular or long repeated regions) [1]
--min-contig-len INT Minimum length of assembled contig to output [500]
--contig-output-mode INT Type of contigs:
0: all
1: only extended [1]
--num-iter-reads-only INT Raw reads only: Number of assembly iterations performed on nucleotide level (ancient) [5]
--num-iterations TWIN Number of assembly total iterations performed on nucleotide level (ignore protein level for ancient) (range 1-inf) [nucl:10,aa:1]
common:
--threads INT Number of CPU-cores used (all by default) [56]
--compressed INT Write compressed output [0]
-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]
--max-seq-len INT Maximum sequence length [200000]
--sub-mat TWIN Substitution matrix file [nucl:nucleotide.out,aa:blosum62.out]
--db-load-mode INT Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]
--remove-tmp-files BOOL Delete temporary files [0]
--delete-tmp-inc INT Delete temporary files incremental [0,1] [1]
--mpi-runner STR Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")
expert:
--create-lookup INT Create database lookup file (can be very large) [0]
--write-lookup INT write .lookup file containing mapping from internal id, fasta id and file number [1]
--filter-hits BOOL Filter hits by seq.id. and coverage [0]
--sort-results INT Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]
--ext-random-align FLOAT Use either: 0.8 or 0.9 (ancient) [0.850]
--excess-penalty FLOAT Use float: 0.25 to 0.5 (ancient) [0.062]
--min-ryseq-id-corr-reads FLOAT Min. RY-mer space seq. ident in correction phase. Range 0-1 (ancient) [0.990]
--min-seqid-corr-reads FLOAT Min. seq. ident. in correction phase. Range 0-1 (ancient) [0.900]
--likelihood-ratio-threshold FLOAT Min. odds ratio to accept read extension. Range 0-1 (ancient) [0.500]
--min-merge-seq-id FLOAT Min. seq. ident. of contig overlaps (ancient) (range 0.0-1.0) [0.990]
--min-seqid-corr-contigs FLOAT Min. seq. ident. for contig correction (ancient) (range 0.0-1.0) [0.900]
--ancient-damage STR Path to damage matrix (ancient)
--unsafe BOOL Maximize the contig length, but higher misassembly rate (ancient) [0]
--db-mode BOOL Input is database [0]
--min-ryseq-id FLOAT List matches above this sequence identity in RY-mer space (ancient) (range 0.0-1.0) [0.990]
--k-ancient-reads INT k-mer length read step (ancient) [20]
--k-ancient-contigs INT k-mer length contig step (ancient) [22]
references:
- Mirdita M, Steinegger M, Breitwieser F, Soding J, Levy Karin E: Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics, btab184 (2021)
> carpedeam nuclassemble -h
usage: carpedeam nuclassemble <i:fast(a|q)File[.gz]> | <i:fastqFile1_1[.gz] <i:fastqFile1_2[.gz] ... <i:fastqFileN_1[.gz] <i:fastqFileN_2[.gz]> <o:fastaFile> <tmpDir> [options]
By Louis Kraft <lokraf@dtu.dk> and Annika Jochheim <annika.jochheim@mpinat.mpg.de>
options: prefilter:
--alph-size TWIN Alphabet size (range 2-21) [5]
--spaced-kmer-mode INT 0: use consecutive positions in k-mers; 1: use spaced k-mers [0]
--spaced-kmer-pattern STR User-specified spaced k-mer pattern
--mask INT Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [0]
--mask-lower-case INT Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]
-k INT k-mer length (0: automatically set to optimum) [22]
--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]
--add-self-matches BOOL Artificially add entries of queries with themselves (for clustering) [0]
align:
--min-seq-id FLOAT Overlap sequence identity threshold (range 0.0-1.0) [0.900]
--cov-mode INT 0: coverage of query and target
1: coverage of target
2: coverage of query
3: target seq. length has to be at least x% of query length
4: query seq. length has to be at least x% of target length
5: short seq. needs to be at least x% of the other seq. length [0]
-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]
--wrapped-scoring BOOL Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]
-e DOUBLE Extend sequences if the E-value is below (range 0.0-inf) [1.000E-03]
-a BOOL Add backtrace string (convert to alignments with mmseqs convertalis module) [0]
--min-aln-len INT Minimum alignment length (range 0-INT_MAX) [0]
--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]
kmermatcher:
--kmer-per-seq INT k-mers per sequence [200]
--kmer-per-seq-scale TWIN Scale k-mer per sequence based on sequence length as kmer-per-seq val + scale x seqlen [0.200]
--adjust-kmer-len BOOL Adjust k-mer length based on specificity (only for nucleotides) [0]
--hash-shift INT Shift k-mer hash initialization [67]
--include-only-extendable BOOL Include only extendable [0]
--ignore-multi-kmer BOOL Skip k-mers occurring multiple times (>=2) [1]
profile:
--num-iterations INT Number of assembly iterations (range 1-inf) [10]
misc:
--dbtype INT Database type 0: auto, 1: amino acid 2: nucleotides [0]
--shuffle BOOL Shuffle input database [1]
--createdb-mode INT Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]
--id-offset INT Numeric ids in index file are offset by this value [0]
--rescore-mode INT Rescore diagonals with:
0: Hamming distance
1: local alignment (score only)
2: local alignment
3: global alignment
4: longest alignment fulfilling window quality criterion [3]
--keep-target BOOL Keep target sequences [1]
--chop-cycle BOOL Remove superfluous part of circular fragments (see --cycle-check) [1]
--cycle-check BOOL Check for circular sequences (avoid over extension of circular or long repeated regions) [1]
--min-contig-len INT Minimum length of assembled contig to output [500]
--contig-output-mode INT Type of contigs:
0: all
1: only extended [1]
--num-iter-reads-only INT Raw reads only: Number of assembly iterations performed on nucleotide level (ancient) [4]
common:
--compressed INT Write compressed output [0]
-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]
--sub-mat TWIN Substitution matrix file [nucl:nucleotide.out,aa:blosum62.out]
--max-seq-len INT Maximum sequence length [300000]
--threads INT Number of CPU-cores used (all by default) [56]
--db-load-mode INT Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]
--remove-tmp-files BOOL Delete temporary files [0]
--delete-tmp-inc INT Delete temporary files incremental [0,1] [1]
--mpi-runner STR Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")
expert:
--write-lookup INT write .lookup file containing mapping from internal id, fasta id and file number [1]
--filter-hits BOOL Filter hits by seq.id. and coverage [0]
--sort-results INT Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]
--ext-random-align FLOAT Use either: 0.8 or 0.9 (ancient) [0.850]
--excess-penalty FLOAT Use float: 0.25 to 0.5 (ancient) [0.062]
--min-ryseq-id-corr-reads FLOAT Min. RY-mer space seq. ident in correction phase. Range 0-1 (ancient) [0.990]
--min-seqid-corr-reads FLOAT Min. seq. ident. in correction phase. Range 0-1 (ancient) [0.900]
--likelihood-ratio-threshold FLOAT Min. odds ratio to accept read extension. Range 0-1 (ancient) [0.500]
--min-merge-seq-id FLOAT Min. seq. ident. of contig overlaps (ancient) (range 0.0-1.0) [0.990]
--min-seqid-corr-contigs FLOAT Min. seq. ident. for contig correction (ancient) (range 0.0-1.0) [0.900]
--ancient-damage STR Path to damage matrix (ancient)
--unsafe BOOL Maximize the contig length, but higher misassembly rate (ancient) [0]
--db-mode BOOL Input is database [0]
--min-ryseq-id FLOAT List matches above this sequence identity in RY-mer space (ancient) (range 0.0-1.0) [0.990]
--k-ancient-reads INT k-mer length read step (ancient) [20]
--k-ancient-contigs INT k-mer length contig step (ancient) [22]
references:
- Steinegger M, Mirdita M, Soding J: Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, 16(7), 603-606 (2019)
> carpedeam ancient_correction -h
usage: carpedeam ancient_correction <i:sequenceDB> <i:alnResult> <o:reprSeqDB> [options]
By Louis Kraft <lokraf@dtu.dk>
options: align:
--min-seq-id FLOAT List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]
misc:
--keep-target BOOL Keep target sequences [1]
--rescore-mode INT Rescore diagonals with:
0: Hamming distance
1: local alignment (score only)
2: local alignment
3: global alignment
4: longest alignment fulfilling window quality criterion [0]
common:
--max-seq-len INT Maximum sequence length [65535]
--threads INT Number of CPU-cores used (all by default) [56]
-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]
expert:
--ext-random-align FLOAT Use either: 0.8 or 0.9 (ancient) [0.850]
--excess-penalty FLOAT Use float: 0.25 to 0.5 (ancient) [0.062]
--min-ryseq-id-corr-reads FLOAT Min. RY-mer space seq. ident in correction phase. Range 0-1 (ancient) [0.990]
--min-seqid-corr-reads FLOAT Min. seq. ident. in correction phase. Range 0-1 (ancient) [0.900]
--likelihood-ratio-threshold FLOAT Min. odds ratio to accept read extension. Range 0-1 (ancient) [0.500]
--min-merge-seq-id FLOAT Min. seq. ident. of contig overlaps (ancient) (range 0.0-1.0) [0.990]
--min-seqid-corr-contigs FLOAT Min. seq. ident. for contig correction (ancient) (range 0.0-1.0) [0.900]
--ancient-damage STR Path to damage matrix (ancient) []
--unsafe BOOL Maximize the contig length, but higher misassembly rate (ancient) [0]
references:
- Steinegger M, Mirdita M, Soding J: Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, 16(7), 603-606 (2019)
テストラン
carpedeam ancient_assembleコマンドは、1)入力のfastq|fasta、2)出力されるfasta、3)作業ディレクトリ,4)任意のoptions、で実行する。optioでは--ancient-damageの指定が事実上必須で、これは損傷で置換される確率を各塩基について示したタブ区切り行列ファイルとなる。3p.profと5p.profの2種類を以下のように指定する。
git clone https://github.com/LouisPwr/CarpeDeam.git
cd CarpeDeam/
carpedeam ancient_assemble example/test_data.fq.gz assembly.fasta tmp/ --ancient-damage example/dhigh
- --ancient-damage STR Path to damage matrix (ancient) []
上ではdhigh3p.profとdhigh5p.profを example/dhighと指定している。
出力例
> head assembly.fasta
引用
CarpeDeam: A De Novo Metagenome Assembler for Heavily Damaged Ancient Datasets
Louis Kraft, Johannes Söding, Martin Steinegger, Annika Jochheim, Peter Wad Sackett, Antonio Fernandez-Guerra, Gabriel Renaud
bioRxiv, Posted August 09, 2024.