メタゲノムコンティグの分類を行うユーザーフレンドリーなツール SprayNPray

　培養した微生物のisolatesや真核生物の個体のショットガンシーケンス（全ゲノムシーケンス）や微生物群集のショットガンシーケンス（メタゲノミクス）は、生物学において一般的になってきている。シークエンスされたサンプルには、複数の生物種が含まれていることが多く、配列を正確に分類するためには、ますます精巧なソフトウェアが必要になっている。分類のためのソフトウェアツールは数多く存在するが、SprayNPrayは、迅速かつユーザーフレンドリーな半自動化されたアプローチを提供し、ユーザーが関心のある分類学（およびその他の指標）によってコンティグを分類することを可能にする。インストールや使用が容易で、目視や計算機による解析が可能な直感的な出力は、ゲノムやメタゲノムの解析を始めたばかりの生物学者にとっての障壁を減らす。このアプローチは、広範なレベルの概要、予備的な分析、または他の分類学的な分類やビニングソフトウェアの補足として使用することができる。SprayNPrayは、ユーザーが指定した参照データベースからのclosest homologs、遺伝子密度、リードカバレッジ、GC含量、テトラヌクレオチド頻度、コドン使用バイアスなど、複数の指標を用いてコンティグをプロファイルする。このソフトウェアからの出力は、メタゲノムアセンブルされたゲノムのスポットチェック、単離されたアセンブル中の汚染物質と思われるコンティグの識別と削除、真核生物のアセンブル中のバクテリアの識別（またはその逆）、遺伝子の水平伝播の可能性の識別などを可能にするように設計されている。

インストール

setupスクリプトを使って導入した（condaの部分はmambaに書き換えた）（*1）。

依存

Github

git clone https://github.com/Arkadiy-Garber/SprayNPray.git
cd SprayNPray
bash setup.sh
#パスを通す
export PATH=$PATH:$PWD
#環境をアクティベート
conda activate sprayandpray

> ./spray-and-pray.py -h

# ./spray-and-pray.py -h

usage: spray-and-pray.py [-h] [-g G] [-ref REF] [-bam BAM] [-out OUT] [-lvl LVL] [-t T] [--makedb [MAKEDB]] [--spades [SPADES]] [--meta [META]] [--hgt [HGT]] [--fa [FA]] [-blast BLAST] [-hits HITS] [-domain DOMAIN] [-phylum PHYLUM] [-class CLASS]

[-genus GENUS] [-species SPECIES] [-perc PERC] [-gc GC] [-GC GC] [-cov COV] [-COV COV] [-cd CD] [-CD CD] [-l L] [-L L] [-aai AAI]

************************************************************************

Developed by Arkadiy Garber; University of Montana, Biological Sciences

Please send comments and inquiries to arkadiy.garber@mso.umt.edu

************************************************************************

optional arguments:

-h, --help show this help message and exit

-g G Input bin/assembly in FASTA format

-ref REF Input reference protein database (recommended: nr). Could be FASTA file or DIAMOND database file (with extension .dmnd)

-bam BAM Input sorted BAM file with coverage info (optional)

-out OUT Basename for output files

-lvl LVL Level of the taxonomic hierarchy to include in the summary file (Domain, Phylum, Class, Genus, species)

-t T number of threads to use for DIAMOND BLAST

--makedb [MAKEDB] if the DIAMOND database does not already exist (i.e. file with extension .dmnd), and you would like the program to run diamond makedb, provide this flag

--spades [SPADES] is this a SPAdes assembly, with the original SPAdes headers? If so, then you can provide this flag, and BinBlaster will summarize using the coverage information provided in the SPAdes headers

--meta [META] contigs are from a mixed community of organisms

--hgt [HGT] provide this flag if you'd like the program to output potential HGTs into a separate file. This feature is designed for eukaryotic contigs expected to have HGTs of bacterial origin.

--fa [FA] write subset of contigs that match user-specified parameters to a separate FASTA file

-blast BLAST DIAMOND BLAST output file from previous run

-hits HITS total number of DIAMOND hits to report in DIAMOND output file (default=100)

-domain DOMAIN domain expected among hits to provided contigs, to be written to FASTA file (e.g. Bacteria, Archaea, Eukaryota)

-phylum PHYLUM phylum expected among hits to provided contigs, to be written to FASTA file (e.g. Proteobacteria). If you provide this name, please be sure to also provide the domain name via -domain

-class CLASS class name expected among hits to provided contigs, to be written to FASTA file (e.g. Gammaproteobacteria). If you provide this name, please be sure to also provide the domain and phylum names

-genus GENUS genus name expected among hits to provided contigs, to be written to FASTA file (e.g. Shewanella). If you provide this name, please be sure to also provide the domain, phylum, and class names

-species SPECIES species name expected among hits to provided contigs, to be written to FASTA file (e.g. oneidensis, coli, etc.). If you provide this name, please be sure to also provide the domain, phylum, class, and genus names

-perc PERC percentage of total hits to the contig that must be to the specified genus/species for writing to FASTA

-gc GC minimum GC-content of contigs to write to FASTA (default = 0)

-GC GC maximum GC-content of contigs to write to FASTA (default = 100)

-cov COV minimum coverage of contigs to write to FASTA (default = 0)

-COV COV maximum coverage of contigs to write to FASTA (default = 100000000)

-cd CD minimum coding density (in hits/kb) to write to FASTA (default = 0.25)

-CD CD maximum coding density (in hits/kb) to write to FASTA (default = 5)

-l L minimum length of contig to write to FASTA (default = 1000)

-L L maximum length of contig to write to FASTA (default = 100000000)

-aai AAI minimum average amino acid identity (percent) to reference proteins (default 35)

データベース

ランするには、参照タンパク質のデータセットも提供する必要がある。理想的には、NCBIのRefSeqまたはnrデータベースであり、これらは以下の方法でダウンロードできる。ダウンロード後、BLASTデータベースをビルドする。

wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz

#nr.gz.md5もダウンロードしてチエック。
wget ftp://ftp.ncbi.nih.gov/blast/db/FASTA/nr.gz.md5

#diamond makedを実行
diamond makedb --in nr.gz -d nr

nr.dmndができる。

実行方法

コンティグまたはORF（アミノ酸）をFASTA形式で指定する。Pseudomonas aeruginosa の配列を取り出す（汚染配列は除かれる）。

spray-and-pray.py -g pseudomonas_crude.fa -out pseudomonas_clean.fa -genus Pseudomonas -species aeruginosa -perc 50 --fa -ref nr.faa

-g Input bin/assembly in FASTA format
-out Basename for output files
-genus genus name expected among hits to provided contigs, to be written to FASTA file (e.g. Shewanella). If you provide this name, please be sure to also provide the domain, phylum, and class names
-species species name expected among hits to provided contigs, to be written to FASTA file (e.g. oneidensis, coli, etc.). If you provide this name, please be sure to also provide the domain, phylum, class, and genus names
-perc percentage of total hits to the contig that must be to the specified genus/species for writing to FASTA
--fa write subset of contigs that match user-specified parameters to a separate FASTA file
-ref Input reference protein database (recommended: nr). Could be FASTA file or DIAMOND database file (with extension .dmnd)

上記のコマンドでは、コンティグの遺伝子の50％以上がPseudomonas aruginosaにトップDIAMONDでヒットしていることを要求している。これらのパラメータに合致するコンティグは、新しいFASTAファイル（pseudomonas_clean.fa）に書き込まれる。

Maconellicoccus hirsutusのアセンブリから内共生生物のゲノムを抽出する。

spray-and-pray.py -g M_hirsutus_assembly.fa -out endosymbionts.fa -cd 0.5 -L 1000000 -perc 50 --fa -ref nr.faa -domain Bacteria

-cd minimum coding density (in hits/kb) to write to FASTA (default = 0.25)
-CD maximum coding density (in hits/kb) to write to FASTA (default = 5)
-l minimum length of contig to write to FASTA (default = 1000)
-L maximum length of contig to write to FASTA (default = 100000000)
-domain domain expected among hits to provided contigs, to be written to FASTA file (e.g. Bacteria, Archaea, Eukaryota)
-t number of threads to use for DIAMOND BLAST

上記のコマンドでは、endosymbionts.faに書き込むために、遺伝子密度が0.5遺伝子/kb、最大長が1Mb、DIAMONDのトップヒットがバクテリア遺伝子であること、コンティグの遺伝子の50%以上がバクテリア由来であることを要求している。

Maconellicoccus hirsutusのアセンブリにおける推定HGTの同定

spray-and-pray.py -g P_citri.fa -out putative_hgts.csv --hgt -ref nr.faa

--hgt provide this flag if you'd like the program to output potential HGTs into a separate file. This feature is designed for eukaryotic contigs expected to have HGTs of bacterial origin.

引用

SprayNPray: user-friendly taxonomic profiling of genome and metagenome contigs

Arkadiy I Garber, Catherine R Armbruster, Stella E Lee, Vaughn S Cooper, Jennifer M Bomberger, Sean M McAllister

bioRxiv, Posted July 19, 2021.

３つ追加で導入した。

pip instlal numpy matplotlib sklearn

参考

nr database Diamond