複雑なメタゲノムおよびメタトランススクリプトームデータをアセンブルする PenguiN

　メタゲノミクスは、環境およびヒトに関連する微生物群集を研究するための強力なアプローチであり、特に、それらの形成におけるウイルスの役割を研究するためのアプローチでもある。ウイルスゲノムは、高い突然変異率によるゲノムの多様性のため、メタゲノムサンプルからアセンブルすることは困難である。標準的なde Bruijnグラフアセンブラでは、このゲノムの多様性により、ループやバルジが多数存在する複雑なk-merアセンブラグラフになり、k-merサイズ以上離れたバリアントは位相を合わせる事ができないため、株やハプロタイプに解像することは困難である。一方、オーバーラップアセンブラでは、バリアントが1つのリードでカバーされている限り、バリアントの位相を合わせる事が可能である。ここでは、ショットガンメタゲノミクスから得られたウイルスDNAおよびRNAゲノムと細菌16S rRNAのstrain resolvedアセンブルのためのソフトウェアであるPenguiNを紹介する。PenguiNは、すべてのリードのオーバーラップを線形時間で網羅的に検出し、ベイズモデルによって菌株分解された拡張を選択することで、さまざまなリアルデータセットやシミュレーションショートリードデータセットから、従来の2倍以上のウイルス株ゲノムと16S rRNAをアセンブルすることができる。

Penguin is our new assembler that reconstructs manyfold more accurate strain-level viral genomes and 16S rRNAs from metagenomes through a novel greedy AA/DNA-hybrid bayesian overlap extension strategy. By @AnnikaJochheim et al.
📄https://t.co/QjXHoccdsm
💾https://t.co/Oz7d5eqknT pic.twitter.com/Zru5sGoF00
— Martin Steinegger 🇺🇦 (@thesteinegger) March 31, 2024

レポジトリより

PenguiNは、ショートリードシークエンシングデータを塩基レベルでアセンブルするソフトウェア。第一段階として、翻訳されたタンパク質配列の情報を用いてコーディング配列をアセンブルする。第2ステップでは、非コード領域間のリンクを作成する。PenguiNの主な目的は、複雑なメタゲノムおよびメタトランススクリプトームデータセットのアセンブルである。特に16S rRNA遺伝子配列と同様にウイルスゲノムのアセンブリでテストされた。PenguiNは、MegahitやSPAdesのような最新のアセンブラに比べ、3-40倍の完全なウイルスゲノムと6倍の16S rRNA配列をアセンブルすることができる。

インストール

#conda
mamba install -c conda-forge -c bioconda plass -y

#docker
docker pull ghcr.io/soedinglab/plass:latest

#static build with AVX2 (fastest)
wget https://mmseqs.com/plass/plass-linux-avx2.tar.gz; tar xvfz plass-linux-avx2.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

# static build with SSE4.1
wget https://mmseqs.com/plass/plass-linux-sse41.tar.gz; tar xvfz plass-linux-sse41.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

# universal build with macOS (Intel or Apple Silicon)
wget https://mmseqs.com/plass/plass-osx-universal.tar.gz; tar xvfz plass-osx-universal.tar.gz; export PATH=$(pwd)/plass/bin/:$PATH

> penguin -h

protein-guided nucleotide assembler.

PenguiN Version: 5.cf8933

usage: penguin <command> [<args>]

Main workflows for database input/output

guided_nuclassemble Assemble nucleotide sequences by iterative greedy overlap assembly using protein and nucleotide information

nuclassemble Assemble nucleotide sequences by iterative greedy overlap assembly using nucleotide information only

> penguin guided_nuclassemble -h

usage: penguin guided_nuclassemble <i:fast(a|q)File[.gz]> | <i:fastqFile1_1[.gz] <i:fastqFile1_2[.gz] ... <i:fastqFileN_1[.gz] <i:fastqFileN_2[.gz]> <o:fastaFile> <tmpDir> [options]

By Annika Jochheim <annika.jochheim@mpinat.mpg.de>

options: prefilter:

--alph-size TWIN Alphabet size (range 2-21) [nucl:5,aa:13]

--spaced-kmer-mode INT 0: use consecutive positions in k-mers; 1: use spaced k-mers [0]

--spaced-kmer-pattern STR User-specified spaced k-mer pattern

--mask INT Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [0]

--mask-lower-case INT Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--add-self-matches BOOL Artificially add entries of queries with themselves (for clustering) [0]

align:

--cov-mode INT 0: coverage of query and target

1: coverage of target

2: coverage of query

3: target seq. length has to be at least x% of query length

4: query seq. length has to be at least x% of target length

5: short seq. needs to be at least x% of the other seq. length [1]

-e DOUBLE Extend sequences if the E-value is below (range 0.0-inf) [1.000E-05]

-a BOOL Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]

--gap-open TWIN Gap open cost (only for clustering [5]

--gap-extend TWIN Gap extend cost (only for clustering) [2]

--zdrop INT Maximal allowed difference between score values before alignment is truncated (only for clustering) [200]

--min-seq-id TWIN Overlap sequence identity threshold (range 0.0-1.0) [nucl:0.990,aa:0.970]

--min-aln-len TWIN Minimum alignment length (range 0-INT_MAX) [0]

clust:

--cluster-mode INT 0: Set-Cover (greedy)

1: Connected component (BLASTclust)

2,3: Greedy clustering by sequence length (CDHIT) [2]

--clust-min-seq-id FLOAT Seq. id. threshold passed to linclust algorithm to reduce redundancy in assembly (range 0.0-1.0) [0.970]

--clust-min-cov FLOAT Coverage threshold passed to linclust algorithm to reduce redundancy in assembly (range 0.0-1.0) [0.990]

kmermatcher:

--kmer-per-seq INT k-mers per sequence [60]

--kmer-per-seq-scale TWIN Scale k-mer per sequence based on sequence length as kmer-per-seq val + scale x seqlen [0.100]

--adjust-kmer-len BOOL Adjust k-mer length based on specificity (only for nucleotides) [0]

--hash-shift INT Shift k-mer hash initialization [67]

--include-only-extendable BOOL Include only extendable [1]

--ignore-multi-kmer BOOL Skip k-mers occurring multiple times (>=2) [1]

-k TWIN k-mer length (0: automatically set to optimum) [nucl:22,aa:14]

misc:

--min-length INT Minimum codon number in open reading frames [45]

--max-length INT Maximum codon number in open reading frames [32734]

--forward-frames STR Comma-separated list of frames on the forward strand to be extracted [1,2,3]

--reverse-frames STR Comma-separated list of frames on the reverse strand to be extracted [1,2,3]

--translation-table INT 1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE

9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL

15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL

23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA

29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA [1]

--translate INT Translate ORF to amino acid [0]

--use-all-table-starts BOOL Use all alternatives for a start codon in the genetic table, if false - only ATG (AUG) [0]

--id-offset INT Numeric ids in index file are offset by this value [0]

--keep-target BOOL Keep target sequences [1]

--rescore-mode INT Rescore diagonals with:

0: Hamming distance

1: local alignment (score only)

2: local alignment

3: global alignment

4: longest alignment fulfilling window quality criterion [3]

--dbtype INT Database type 0: auto, 1: amino acid 2: nucleotides [0]

--shuffle BOOL Shuffle input database [1]

--createdb-mode INT Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

--chop-cycle BOOL Remove superfluous part of circular fragments (see --cycle-check) [1]

--cycle-check BOOL Check for circular sequences (avoid over extension of circular or long repeated regions) [1]

--min-contig-len INT Minimum length of assembled contig to output [1000]

--contig-output-mode INT Type of contigs:

0: all

1: only extended [1]

--num-iterations TWIN Number of assembly iterations performed on nucleotide level,protein level (range 1-inf) [5]

common:

--threads INT Number of CPU-cores used (all by default) [128]

--compressed INT Write compressed output [0]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--max-seq-len INT Maximum sequence length [200000]

--sub-mat TWIN Substitution matrix file [nucl:nucleotide.out,aa:blosum62.out]

--db-load-mode INT Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

--remove-tmp-files BOOL Delete temporary files [0]

--delete-tmp-inc INT Delete temporary files incremental [0,1] [1]

--mpi-runner STR Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")

expert:

--create-lookup INT Create database lookup file (can be very large) [0]

--write-lookup INT write .lookup file containing mapping from internal id, fasta id and file number [1]

--filter-hits BOOL Filter hits by seq.id. and coverage [0]

--sort-results INT Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]

--db-mode BOOL Input is database [0]

references:

- Steinegger M, Mirdita M, Soding J: Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, 16(7), 603-606 (2019)

> penguin nuclassemble -h

usage: penguin nuclassemble <i:fast(a|q)File[.gz]> | <i:fastqFile1_1[.gz] <i:fastqFile1_2[.gz] ... <i:fastqFileN_1[.gz] <i:fastqFileN_2[.gz]> <o:fastaFile> <tmpDir> [options]

By Annika Jochheim <annika.jochheim@mpinat.mpg.de>

options: prefilter:

--alph-size TWIN Alphabet size (range 2-21) [5]

--spaced-kmer-mode INT 0: use consecutive positions in k-mers; 1: use spaced k-mers [0]

--spaced-kmer-pattern STR User-specified spaced k-mer pattern

--mask INT Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [0]

--mask-lower-case INT Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

-k INT k-mer length (0: automatically set to optimum) [22]

--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--add-self-matches BOOL Artificially add entries of queries with themselves (for clustering) [0]

align:

--min-seq-id FLOAT Overlap sequence identity threshold (range 0.0-1.0) [0.990]

--cov-mode INT 0: coverage of query and target

1: coverage of target

2: coverage of query

3: target seq. length has to be at least x% of query length

4: query seq. length has to be at least x% of target length

5: short seq. needs to be at least x% of the other seq. length [0]

-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]

--wrapped-scoring BOOL Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]

-e DOUBLE Extend sequences if the E-value is below (range 0.0-inf) [1.000E-05]

-a BOOL Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

--min-aln-len INT Minimum alignment length (range 0-INT_MAX) [0]

--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]

kmermatcher:

--kmer-per-seq INT k-mers per sequence [60]

--kmer-per-seq-scale TWIN Scale k-mer per sequence based on sequence length as kmer-per-seq val + scale x seqlen [0.100]

--adjust-kmer-len BOOL Adjust k-mer length based on specificity (only for nucleotides) [0]

--hash-shift INT Shift k-mer hash initialization [67]

--include-only-extendable BOOL Include only extendable [1]

--ignore-multi-kmer BOOL Skip k-mers occurring multiple times (>=2) [1]

profile:

--num-iterations INT Number of assembly iterations (range 1-inf) [8]

misc:

--dbtype INT Database type 0: auto, 1: amino acid 2: nucleotides [0]

--shuffle BOOL Shuffle input database [1]

--createdb-mode INT Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

--id-offset INT Numeric ids in index file are offset by this value [0]

--rescore-mode INT Rescore diagonals with:

0: Hamming distance

1: local alignment (score only)

2: local alignment

3: global alignment

4: longest alignment fulfilling window quality criterion [3]

--keep-target BOOL Keep target sequences [1]

--chop-cycle BOOL Remove superfluous part of circular fragments (see --cycle-check) [1]

--cycle-check BOOL Check for circular sequences (avoid over extension of circular or long repeated regions) [1]

--min-contig-len INT Minimum length of assembled contig to output [1000]

--contig-output-mode INT Type of contigs:

0: all

1: only extended [1]

common:

--compressed INT Write compressed output [0]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--sub-mat TWIN Substitution matrix file [nucl:nucleotide.out,aa:blosum62.out]

--max-seq-len INT Maximum sequence length [200000]

--threads INT Number of CPU-cores used (all by default) [128]

--db-load-mode INT Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

--remove-tmp-files BOOL Delete temporary files [0]

--delete-tmp-inc INT Delete temporary files incremental [0,1] [1]

--mpi-runner STR Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")

expert:

--write-lookup INT write .lookup file containing mapping from internal id, fasta id and file number [1]

--filter-hits BOOL Filter hits by seq.id. and coverage [0]

--sort-results INT Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]

--db-mode BOOL Input is database [0]

references:

- Steinegger M, Mirdita M, Soding J: Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nature Methods, 16(7), 603-606 (2019)

実行方法

penguin nuclassembleをランするには、１つ以上のfastq、出力fasta名、作業ディレクトリ、（option）の順に指定する。

penguin nuclassemble R1.fq.gz R2.fq.gz output.fasta tmp

--threads Number of CPU-cores used (all by default) [128]

出力例

seqkit stats -a output.fasta

penguin guided_nuclassembleをランするには、penguin nuclassembleで指定するファイルに加えて、ガイドするためのfasta配列ファイルを指定する必要がある。

引用

Strain-resolved de-novo metagenomic assembly of viral genomes and microbial 16S rRNAs

Annika Jochheim, Florian A. Jochheim, Alexandra Kolodyazhnaya, Étienne Morice, Martin Steinegger, Johannes Söding

bioRxiv, Posted March 29, 2024.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

複雑なメタゲノムおよびメタトランススクリプトームデータをアセンブルする PenguiN