macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

複雑なメタゲノムおよびメタトランススクリプトームデータをアセンブルする PenguiN

 

 メタゲノミクスは、環境およびヒトに関連する微生物群集を研究するための強力なアプローチであり、特に、それらの形成におけるウイルスの役割を研究するためのアプローチでもある。ウイルスゲノムは、高い突然変異率によるゲノムの多様性のため、メタゲノムサンプルからアセンブルすることは困難である。標準的なde Bruijnグラフアセンブラでは、このゲノムの多様性により、ループやバルジが多数存在する複雑なk-merアセンブラグラフになり、k-merサイズ以上離れたバリアントは位相を合わせる事ができないため、株やハプロタイプに解像することは困難である。一方、オーバーラップアセンブラでは、バリアントが1つのリードでカバーされている限り、バリアントの位相を合わせる事が可能である。ここでは、ショットガンメタゲノミクスから得られたウイルスDNAおよびRNAゲノムと細菌16S rRNAのstrain resolvedアセンブルのためのソフトウェアであるPenguiNを紹介する。PenguiNは、すべてのリードのオーバーラップを線形時間で網羅的に検出し、ベイズモデルによって菌株分解された拡張を選択することで、さまざまなリアルデータセットやシミュレーションショートリードデータセットから、従来の2倍以上のウイルス株ゲノムと16S rRNAをアセンブルすることができる。

 

 

レポジトリより

PenguiNは、ショートリードシークエンシングデータを塩基レベルでアセンブルするソフトウェア。第一段階として、翻訳されたタンパク質配列の情報を用いてコーディング配列をアセンブルする。第2ステップでは、非コード領域間のリンクを作成する。PenguiNの主な目的は、複雑なメタゲノムおよびメタトランススクリプトームデータセットアセンブルである。特に16S rRNA遺伝子配列と同様にウイルスゲノムのアセンブリでテストされた。PenguiNは、MegahitやSPAdesのような最新のアセンブラに比べ、3-40倍の完全なウイルスゲノムと6倍の16S rRNA配列をアセンブルすることができる。

 

インストール

#conda
mamba install -c conda-forge -c bioconda plass -y

#docker
docker pull ghcr.io/soedinglab/plass:latest

#static build with AVX2 (fastest)
wget https://mmseqs.com/plass/plass-linux-avx2.tar.gz; tar xvfz plass-linux-avx2.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

# static build with SSE4.1
wget https://mmseqs.com/plass/plass-linux-sse41.tar.gz; tar xvfz plass-linux-sse41.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

# universal build with macOS (Intel or Apple Silicon)
wget https://mmseqs.com/plass/plass-osx-universal.tar.gz; tar xvfz plass-osx-universal.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

> penguin -h

protein-guided nucleotide assembler.

 

PenguiN Version: 5.cf8933

© Annika Jochheim (annika.jochheim@mpinat.mpg.de)

 

usage: penguin <command> [<args>]

 

Main workflows for database input/output

  guided_nuclassemble Assemble nucleotide sequences by iterative greedy overlap assembly using protein and nucleotide information

  nuclassemble      Assemble nucleotide sequences by iterative greedy overlap assembly using nucleotide information only

 

> penguin guided_nuclassemble -h

usage: penguin guided_nuclassemble <i:fast(a|q)File[.gz]> | <i:fastqFile1_1[.gz] <i:fastqFile1_2[.gz] ... <i:fastqFileN_1[.gz] <i:fastqFileN_2[.gz]> <o:fastaFile> <tmpDir> [options]

 By Annika Jochheim <annika.jochheim@mpinat.mpg.de>

options: prefilter:                     

 --alph-size TWIN                Alphabet size (range 2-21) [nucl:5,aa:13]

 --spaced-kmer-mode INT          0: use consecutive positions in k-mers; 1: use spaced k-mers [0]

 --spaced-kmer-pattern STR       User-specified spaced k-mer pattern

 --mask INT                      Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [0]

 --mask-lower-case INT           Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

 --split-memory-limit BYTE       Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

 --add-self-matches BOOL         Artificially add entries of queries with themselves (for clustering) [0]

align:                         

 --cov-mode INT                  0: coverage of query and target

                                 1: coverage of target

                                 2: coverage of query

                                 3: target seq. length has to be at least x% of query length

                                 4: query seq. length has to be at least x% of target length

                                 5: short seq. needs to be at least x% of the other seq. length [1]

 -e DOUBLE                       Extend sequences if the E-value is below (range 0.0-inf) [1.000E-05]

 -a BOOL                         Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

 --seq-id-mode INT               0: alignment length 1: shorter, 2: longer sequence [0]

 --gap-open TWIN                 Gap open cost (only for clustering [5]

 --gap-extend TWIN               Gap extend cost (only for clustering) [2]

 --zdrop INT                     Maximal allowed difference between score values before alignment is truncated (only for clustering) [200]

 --min-seq-id TWIN               Overlap sequence identity threshold (range 0.0-1.0) [nucl:0.990,aa:0.970]

 --min-aln-len TWIN              Minimum alignment length (range 0-INT_MAX) [0]

clust:                         

 --cluster-mode INT              0: Set-Cover (greedy)

                                 1: Connected component (BLASTclust)

                                 2,3: Greedy clustering by sequence length (CDHIT) [2]

 --clust-min-seq-id FLOAT        Seq. id. threshold passed to linclust algorithm to reduce redundancy in assembly (range 0.0-1.0) [0.970]

 --clust-min-cov FLOAT           Coverage threshold passed to linclust algorithm to reduce redundancy in assembly (range 0.0-1.0) [0.990]

kmermatcher:                   

 --kmer-per-seq INT              k-mers per sequence [60]

 --kmer-per-seq-scale TWIN       Scale k-mer per sequence based on sequence length as kmer-per-seq val + scale x seqlen [0.100]

 --adjust-kmer-len BOOL          Adjust k-mer length based on specificity (only for nucleotides) [0]

 --hash-shift INT                Shift k-mer hash initialization [67]

 --include-only-extendable BOOL  Include only extendable [1]

 --ignore-multi-kmer BOOL        Skip k-mers occurring multiple times (>=2) [1]

 -k TWIN                         k-mer length (0: automatically set to optimum) [nucl:22,aa:14]

misc:                          

 --min-length INT                Minimum codon number in open reading frames [45]

 --max-length INT                Maximum codon number in open reading frames [32734]

 --forward-frames STR            Comma-separated list of frames on the forward strand to be extracted [1,2,3]

 --reverse-frames STR            Comma-separated list of frames on the reverse strand to be extracted [1,2,3]

 --translation-table INT         1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE

                                 9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL

                                 15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL

                                 23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA

                                  29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA [1]

 --translate INT                 Translate ORF to amino acid [0]

 --use-all-table-starts BOOL     Use all alternatives for a start codon in the genetic table, if false - only ATG (AUG) [0]

 --id-offset INT                 Numeric ids in index file are offset by this value [0]

 --keep-target BOOL              Keep target sequences [1]

 --rescore-mode INT              Rescore diagonals with:

                                 0: Hamming distance

                                 1: local alignment (score only)

                                 2: local alignment

                                 3: global alignment

                                 4: longest alignment fulfilling window quality criterion [3]

 --dbtype INT                    Database type 0: auto, 1: amino acid 2: nucleotides [0]

 --shuffle BOOL                  Shuffle input database [1]

 --createdb-mode INT             Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

 --chop-cycle BOOL               Remove superfluous part of circular fragments (see --cycle-check) [1]

 --cycle-check BOOL              Check for circular sequences (avoid over extension of circular or long repeated regions)  [1]

 --min-contig-len INT            Minimum length of assembled contig to output [1000]

 --contig-output-mode INT        Type of contigs:

                                 0: all

                                 1: only extended [1]

 --num-iterations TWIN           Number of assembly iterations performed on nucleotide level,protein level (range 1-inf) [5]

common:                        

 --threads INT                   Number of CPU-cores used (all by default) [128]

 --compressed INT                Write compressed output [0]

 -v INT                          Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

 --max-seq-len INT               Maximum sequence length [200000]

 --sub-mat TWIN                  Substitution matrix file [nucl:nucleotide.out,aa:blosum62.out]

 --db-load-mode INT              Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

 --remove-tmp-files BOOL         Delete temporary files [0]

 --delete-tmp-inc INT            Delete temporary files incremental [0,1] [1]

 --mpi-runner STR                Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")

expert:                        

 --create-lookup INT             Create database lookup file (can be very large) [0]

 --write-lookup INT              write .lookup file containing mapping from internal id, fasta id and file number [1]

 --filter-hits BOOL              Filter hits by seq.id. and coverage [0]

 --sort-results INT              Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]

 --db-mode BOOL                  Input is database [0]

 

references:

 - Steinegger M, Mirdita M, Soding J: Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, 16(7), 603-606 (2019)

> penguin nuclassemble -h

usage: penguin nuclassemble <i:fast(a|q)File[.gz]> | <i:fastqFile1_1[.gz] <i:fastqFile1_2[.gz] ... <i:fastqFileN_1[.gz] <i:fastqFileN_2[.gz]> <o:fastaFile> <tmpDir> [options]

 By Annika Jochheim <annika.jochheim@mpinat.mpg.de>

options: prefilter:                     

 --alph-size TWIN                Alphabet size (range 2-21) [5]

 --spaced-kmer-mode INT          0: use consecutive positions in k-mers; 1: use spaced k-mers [0]

 --spaced-kmer-pattern STR       User-specified spaced k-mer pattern

 --mask INT                      Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [0]

 --mask-lower-case INT           Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

 -k INT                          k-mer length (0: automatically set to optimum) [22]

 --split-memory-limit BYTE       Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

 --add-self-matches BOOL         Artificially add entries of queries with themselves (for clustering) [0]

align:                         

 --min-seq-id FLOAT              Overlap sequence identity threshold (range 0.0-1.0) [0.990]

 --cov-mode INT                  0: coverage of query and target

                                 1: coverage of target

                                 2: coverage of query

                                 3: target seq. length has to be at least x% of query length

                                 4: query seq. length has to be at least x% of target length

                                 5: short seq. needs to be at least x% of the other seq. length [0]

 -c FLOAT                        List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]

 --wrapped-scoring BOOL          Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]

 -e DOUBLE                       Extend sequences if the E-value is below (range 0.0-inf) [1.000E-05]

 -a BOOL                         Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

 --min-aln-len INT               Minimum alignment length (range 0-INT_MAX) [0]

 --seq-id-mode INT               0: alignment length 1: shorter, 2: longer sequence [0]

kmermatcher:                   

 --kmer-per-seq INT              k-mers per sequence [60]

 --kmer-per-seq-scale TWIN       Scale k-mer per sequence based on sequence length as kmer-per-seq val + scale x seqlen [0.100]

 --adjust-kmer-len BOOL          Adjust k-mer length based on specificity (only for nucleotides) [0]

 --hash-shift INT                Shift k-mer hash initialization [67]

 --include-only-extendable BOOL  Include only extendable [1]

 --ignore-multi-kmer BOOL        Skip k-mers occurring multiple times (>=2) [1]

profile:                       

 --num-iterations INT            Number of assembly iterations (range 1-inf) [8]

misc:                          

 --dbtype INT                    Database type 0: auto, 1: amino acid 2: nucleotides [0]

 --shuffle BOOL                  Shuffle input database [1]

 --createdb-mode INT             Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

 --id-offset INT                 Numeric ids in index file are offset by this value [0]

 --rescore-mode INT              Rescore diagonals with:

                                 0: Hamming distance

                                 1: local alignment (score only)

                                 2: local alignment

                                 3: global alignment

                                 4: longest alignment fulfilling window quality criterion [3]

 --keep-target BOOL              Keep target sequences [1]

 --chop-cycle BOOL               Remove superfluous part of circular fragments (see --cycle-check) [1]

 --cycle-check BOOL              Check for circular sequences (avoid over extension of circular or long repeated regions)  [1]

 --min-contig-len INT            Minimum length of assembled contig to output [1000]

 --contig-output-mode INT        Type of contigs:

                                 0: all

                                 1: only extended [1]

common:                        

 --compressed INT                Write compressed output [0]

 -v INT                          Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

 --sub-mat TWIN                  Substitution matrix file [nucl:nucleotide.out,aa:blosum62.out]

 --max-seq-len INT               Maximum sequence length [200000]

 --threads INT                   Number of CPU-cores used (all by default) [128]

 --db-load-mode INT              Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

 --remove-tmp-files BOOL         Delete temporary files [0]

 --delete-tmp-inc INT            Delete temporary files incremental [0,1] [1]

 --mpi-runner STR                Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")

expert:                        

 --write-lookup INT              write .lookup file containing mapping from internal id, fasta id and file number [1]

 --filter-hits BOOL              Filter hits by seq.id. and coverage [0]

 --sort-results INT              Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]

 --db-mode BOOL                  Input is database [0]

 

references:

 - Steinegger M, Mirdita M, Soding J: Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, 16(7), 603-606 (2019)

 

 

実行方法

penguin nuclassembleをランするには、1つ以上のfastq、出力fasta名、作業ディレクトリ、(option)の順に指定する。

penguin nuclassemble R1.fq.gz R2.fq.gz output.fasta tmp
  • --threads      Number of CPU-cores used (all by default) [128]

出力例

seqkit stats -a output.fasta

 

penguin guided_nuclassembleをランするには、penguin nuclassembleで指定するファイルに加えて、ガイドするためのfasta配列ファイルを指定する必要がある。

 

引用

Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs

Annika Jochheim,  Florian A. Jochheim, Alexandra Kolodyazhnaya, Étienne Morice, Martin Steinegger,  Johannes Söding

bioRxiv, Posted March 29, 2024.

 

関連

(メタゲノム向け)高効率なプロテインレベルのアセンブリツール PLASS