（病原性の）大量のバクテリアゲノムの自動解析パイプライン TORMES

2019 12/20 インストール手順修正

2019 12/21, 12/22結果追記

連休中は不定期更新になります。よろしくお願いいたします。

　ハイスループットシーケンシング（HTS）技術の進歩およびシーケンシングコストの削減は、全ゲノムシーケンシング（ＷＧＳ）が多くの伝統的な実験室アッセイおよび手順に取って代わることができるようなものである。 HTSプラットフォームによって生成された大量のデータを活用するには、相当なコンピューティングスキルが必要である。これが、日常的なラボ手法としてのWGSの実装における主なボトルネックである。ゲノムシークエンシングの専門知識がなくても膨大な量の結果を研究者や臨床医に提示する方法も重要な問題である。
　この論文では、一連のバクテリアからIlluminaプラットフォームによって生成されたHTSデータのWGS分析を実行するための、オープンソースでユーザーフレンドリーなコマンドラインパイプラインであるTORMESを紹する。 TORMESは、バイオインフォマティクスバックグラウンドのではない研究者向けに設計されている。シーケンスクオリティフィルタリング、de novo assembly、レファレンスに対するドラフトゲノムの整列、ゲノムアノテーション、MLST、抗生物質耐性検索および病原性検索、遺伝子の比較やパンゲノムの比較などのバイオインフォマティクス分析のステップを自動化する。これは、インターネット接続を必要とせずに、非常に単純な指示に従ってrawシーケンシングデータから直接行うことができる。
　ゲノムシークエンシングの専門知識がなくても研究者や臨床医が理解できる形式で大量のデータをまとめることが、バクテリアゲノミクスの大きな問題として認識されている（Köseret al、2012）。 TORMESは処理中に生成されたすべてのファイルを保存し、分析が完了すると、結果は対話的なWebライクなレポートにまとめられ、修正、共有、そして人間工学的な比較が可能になる。
　レポートは、分析ごとに固有の、自動的に生成されたRMarkdownコードファイルを使用してR環境で生成される。また、より専門的なユーザーが分析を深め、ユーザー固有のレポートのコードを変更できるように、別のフォルダーにも保存されている。
TORMESは無制限数のサンプルに使用でき、さまざまなソース（臨床、糞便、動物および食品関連）からの多数の種（Escherichia、Salmonella、Clostridium、およびKlebsiella を含む）の分離株における数百のバクテリアゲノムで試験され、概算として、これら数百のバクテリアゲノムの約50倍のシーケンシングデプスのTORMES分析には、124 GBのRAM と32コアのコンピューターで16時間かかった。

TORMESに関するツイート

インストール

ubuntu18.0４でテストした。

依存

ABRicate
FastTree
GNUParallel
ImageMagick
Kraken
Megahit
mlst
Prinseq
progrressiveMauve
Prokka
Quast
R
- R packages: ggtree, knitr, plotly, RColorBrewer, reshape2, rmarkdown
Roary
roary2svg.pl
Sickle
SPAdes
Trimmomatic

Additional software when working with -g/--genera Escherichia.

Additional software when working with -g/--genera Salmonella.

本体　Github

#condaの仮想環境を作って導入するように設計されている。
wget https://anaconda.org/nmquijada/tormes-1.0/2019.04.25.180147/download/tormes-1.0.yml
conda env create -n tormes-1.0 --file tormes-1.0.yml
conda activate tormes-1.0

> tormes -h

# tormes -h

This is TORMES version 1.0

Developed by Narciso M. Quijada <https://github.com/nmquijada/tormes>

usage: /root/.pyenv/versions/miniconda3-4.3.21/envs/tormes-1.0/bin/tormes <options>

OBLIGATORY OPTIONS:

-m/--metadata Path to the file with the metadata regarding the samples

The file must have an specific organization for the program to work

If you don't have any or you would like to have an example or extra information,

please type:

/root/.pyenv/versions/miniconda3-4.3.21/envs/tormes-1.0/bin/tormes example-metadata

-o/--output Path and name of the output directory

OTHER OPTIONS:

-a/--adapter Path to the adapters file

(default="/root/.pyenv/versions/miniconda3-4.3.21/envs/tormes-1.0/bin/../files/adapters.fa")

--assembler Select the assembler to use. Options available: 'spades', 'megahit'

(default='spades')

-c/--config Path to the configuration file with the location of all dependencies

(default="/root/.pyenv/versions/miniconda3-4.3.21/envs/tormes-1.0/bin/../files/config_file.txt")

--citation Show citation

--fast Faster analysis (default='0')

('megahit' is used as assembler and contig ordering and pangenome analysis are disabled)

--filtering Select the software for filtering the reads.

Options available: 'prinseq', 'sickle', 'trimmomatic'

(default="prinseq")

-g/--genera Type genera name to allow special analysis (default='none')

Options available: 'Escherichia', 'Salmonella'

-h/--help Show this help

--min_len Minimum length to the reads to survive after filtering (default=125) <integer>

--no_mlst Disable MLST analysis (default='0')

--no_pangenome Disable pangenome analysis (default='0')

-q/--quality Minimum mean phred score of the reads to survive after filtering (default=25) <integer>

-r/--reference Type path to reference genome (fasta, gbk) (default='none')

Reference will be used for contig ordering of the draft genome

-t/--threads Number of threads to use (default=1) <integer>

--title Path to a file containing the title in the project that will be used as title in the report

Avoid using special characters. TORMES will perform a default title if this option is not used

-v/--version Show version

For further explanation please visit: https://github.com/nmquijada/tormes

下で実行時にmauve.jarのパスが間違っていたので、エラーメッセージ通りmauveのディレクトリ名を修正した。

データベースの準備

tormes-setup

Setting up config_file.txt

TORMES is installed and ready to use. Enjoy!

完了。

minikrakenのデータベースなどもダウンロードされる。10GB近くある。また各ツールのパスを記載したconfigファイル（リンク）も自動作成される。

実行方法

１、metadata準備

ランにはfastqのパスや名前を記載したテキストファイル（メタデータ）を指定する。exampleメタデータを見てみる。

#exampleメタデータテキスト生成
tormes example-metadata

カレントにタブ区切りテキストsamples_metadata.txtができる。

以下のように1行目はコメント、２行目以降にサンプル情報を記載していく。１列目がサンプル名、２列目ペアエンドfastqのR1のフルパス、３列目にペアエンドfastqのR2のフルパス、４列目以降にコメントを記載する。コメント列は複数あっても良い。

f:id:kazumaxneo:20190412102548j:plain

上では６サンプル指定している。名前はsample1、sample2...にしている。

このような感じで用意する。コピペしてfastqのパス、Descriptionカラムと末尾のカラムを修正するだけで使えるはずです。

Samples	Read1	Read2	Description	Use_as_many_descrpition_colums_as_wanted
sample1	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2018	AMR1
sample2	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2012	AMR2
sample3	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2006	AMR3
sample4	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2000	AMR4
sample5	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 1994	AMR5
sample6	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2019	AMR6
sample7	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2019	AMR7
sample8	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2019	AMR8
sample9	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2019	AMR9
sample10	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2019	AMR10
sample11	//tormes/SRRXXXXXXX_1.fq	//tormes/SRRXXXXXXX_2.fq	S.enterica isolated in 2019	AMR11

２、run

このメタデータを指定してランする。

tormes --metadata input_metadata.txt --output outdir --reference ref_genome.fasta --threads 32 --genera Salmonella

EscherichiaとSalmonellaのみ、追加分析が行える。上ではSalmonellaを指定している。たくさんのプロセスがあるため、threadは利用できるだけ指定する。

ジョブが終わると、出力ディレクトリに分析ごとのサブディレクトリができる。

f:id:kazumaxneo:20190412101240j:plain

順調にランしているように見えたが、report.htmlの出力に失敗した。修正できたら追記します。

追記

ランできるようになった。

サマリーレポート

f:id:kazumaxneo:20191221110535p:plain

f:id:kazumaxneo:20191221110538p:plain

f:id:kazumaxneo:20191222160642p:plain

f:id:kazumaxneo:20191222160910p:plain

f:id:kazumaxneo:20191221110552p:plain

f:id:kazumaxneo:20191222160952p:plain

f:id:kazumaxneo:20191221110610p:plain

f:id:kazumaxneo:20191222160525p:plain

f:id:kazumaxneo:20191222160531p:plain

f:id:kazumaxneo:20191222160536p:plain

f:id:kazumaxneo:20191221110640p:plain

f:id:kazumaxneo:20191222160605p:plain

f:id:kazumaxneo:20191222161051p:plain

f:id:kazumaxneo:20191221113625p:plain

f:id:kazumaxneo:20191222160629p:plain

引用

TORMES: an automated pipeline for whole bacterial genome analysis

Quijada NM, Rodríguez-Lázaro D, Hernández M

Bioinformatics. 2019 Apr 8

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

（病原性の）大量のバクテリアゲノムの自動解析パイプライン TORMES