2019-04-26

メタゲノムの既知および未知バクテリアの存在量を推定single-copy phylogenetic marker genesに基づいて見積もる mOTUs2

2019 4/26 mergeエラー修正及び追記

2019 7/2 インストール追記

2019 8/6 リンク追加

2020 4/18 condaインストール追記

2020 8/24 インストール追記

　微生物は、地球上の生命や環境中の地球化学的プロセスに影響を与える、相互作用する種の複雑な共同体に住んでいる。したがって、それらが形成するコミュニティの構成を正確にプロファイルし、比較することは基本的な関心事である。微生物群集プロファイリングのための最も一般的なアプローチは、リボソームスモールサブユニットRNA遺伝子（すなわち、バクテリアおよびアーキアの16S rRNA gene）からのPCRアンプリコン配列の分類によるものである。強力ではあるが、このアプローチは、例えば、ゲノム当たりの16S rRNA遺伝子コピー数の変動（論文補足図1）、異なる種におけるPCRプライマーの効率の不均一性、ならびにこの遺伝子の異なるサブ領域の使用（ref.3）、さらに、その高レベルの配列保存は、closely relatedな生物の解像を制限する(ref.4)。

　より最近の方法は、ショットガンシークエンシング（メタゲノミクス）によって直接環境DNAをサンプリングすることで、これはそれらバイアスのいくつかを解決する。メタゲノムデータから微生物群集組成を決定するために様々な戦略が導入されてきた。１つのアプローチは、「既知の」種のpublicに入手可能な分類学的にアノテーション付けされたリファレンスゲノム配列を用いてシーケンシングリードを分類することに基づく。得られたリードの存在量分布は、個々の種の相対存在量を導き出すため、ゲノムサイズ（ref.5、6)による正規化を必要とする（論文補足図1）。全ゲノムを使用するのではなく、現在のリファレンスゲノムデータベースの分析に基づいて、クレード特異的であることが判明した遺伝子のリードカバレッジを定量するという方法もある（ref,7）。そのようなマーカー遺伝子がゲノムごとに一度だけ出現する場合、結果として生じるリードカバレッジはコピー数またはゲノム長によって正規化する必要はない。しかし、ゲノム配列に関する以前の知識に依存する方法の欠点は、ゲノム的に特徴付けられていない分類群が考慮されないままであることであり、これは種レベルの分解能で不正確な相対存在量の推定につながる可能性がある（補足図1）。

　このようなリファレンス依存の方法では見逃されている分類群は、まとめて生物学的「ダークマター」（ref.8）と呼ぶことができる。これらには、検出される可能性があるが、標準的な方法および最新のゲノムデータベースを使用して定量化することが依然として困難である生物（本明細書では「未知の」種と呼ばれる）が含まれる。この問題を克服するために、我々は以前に種レベルの解像度で微生物分類群を捕獲し定量化するためのアプローチとして、普遍的、タンパク質コーディング、single copy phylogenetic marker gene (MG)-based operational taxonomic units (mOTUs) を用いるプロファイリングツールを導入した。 mOTUは既知および未知の両方の種からのMGに基づいて構築されており、後者は既存のメタゲノムから抽出されているため、新しい微生物群集をプロファイリングするときに、より高い分類分解能および種のより正確な定量化ができる。

　この論文では、更新された機能拡張プロファイリングツールであるmOTUプロファイルバージョン2（mOTUs2）を紹介する。これは、3100以上のメタゲノムサンプルからのデータを更新されたmOTUデータベースに統合し、ヒト関連および海洋微生物種の表現を大幅に改善する。最先端の方法と比較したmOTU 2の評価は、既知の種と未知の種の両方について感度と定量精度の改善を示していた。我々（本著者ら）は他のアプローチによって見逃された種により、どのようにメタゲノムデータ構成から相対的な存在量の推定をゆがめられるかを説明する。さらに、MOTUは、非ハウスキーピング遺伝子の使用によるアーチファクト定量を回避しながら、微生物群集のメタトランスクリプトームデータのベースライン転写活性を定量することを可能にする。最後に、メタゲノム研究では、特定の微生物群集に共存し、個人と環境のサンプリング場所が異なり、経時的に安定していることが異質の微生物株集団で報告されている（ref.10-12）。メタゲノム一塩基変異プロファイリングのために全ゲノム配列を使用する効率的な代替手段として、mOTUのMGを使用してそのような株集団間の違いが推定できることを示す。

　最初に、25,000以上のシーケンスゲノムの全セット（ref.13）の中で、以前に選択されベンチマークされた40のMGを同定した。（おそらく冗長）配列の種レベルの分類群を得るために、本著者らは96.5％の配列同一性の較正済みカットオフ（ref.4）に基づいて、より多くのものを含む5232の非冗長リファレンスMG-based operational taxonomic units （ref-mOTU）にこれらのゲノムをクラスタ化した。次に、1件のバイオームあたり多数の系統的に処理されたサンプルを要件として含む研究から、3100以上のメタゲノムを集めた（論文補足データ1）。これらは、主要な人体部位（口腔、皮膚、腸および膣ref.14、15）からの1210のサンプル、さまざまな疾病コホート16-21および243の海洋水サンプル22を含むさまざまなヒト腸メタゲノム研究からの追加1693サンプルから構成されている。これらのアセンブリで予測されたMGは、マーカー遺伝子クラスター（MGC）にまとめられた。最後に、ref-mOTUと同じ包含基準（> 5 MG / mOTU）を適用して、MGCをメタゲノムmOTU（meta-mOTU）に共存量ベースでビニングする方法を考案した（論文図1a、メソッド）。メタmOTUのビニング精度を評価するために、ref-mOTUと比較した個々のMGCの分類学的な一貫性（方法）、存在量の変動、有病率およびGC含有量に関して個々のMGCを評価した（補足図2、方法）。全体として、私たちはすべてのカテゴリーで高い一致を見いだした。例えば、種レベルでは、種名の割り当てとMGに基づく配列との間の不一致が知られているにもかかわらず、97％を超えるメタmOTUがそれらの分類学的アノテーションで完全に一致すると予想される（論文補足図2a）。（以下略）

wiki（簡潔に分かりやすく説明されています）

https://github.com/motu-tool/mOTUs_v2/wiki

@alessio_mila getting ready to present the mOTUs2 profiler in the poster session at #SACMDA3 pic.twitter.com/uMNRvEvxBn
— Zeller Group (@ZellerGroup) April 1, 2019

Our new paper on community and population profiling using metagenomics is out now! #mOTUs2 https://t.co/hTmSN0OLYw
— Daniel Mende (@pangenomics) March 4, 2019

#paper Microbial abundance, activity and population genomic profiling with mOTUs2. Track 7700 species using marker genes, including a few thousand unknown branches. With Zeller and Sunagawa groups. https://t.co/K3j7wSpv5K
— Jaime Huerta-Cepas (@jhcepas) March 12, 2019

Website: https://t.co/6uX9rZbTwk
Code: https://t.co/N9KohNpafQ
Paper: https://t.co/i3oV76pPuF
for #mOTUs2
— Shinichi Sunagawa (@SunShinichi) March 5, 2019

インストール

condaで仮想環境を作ってテストした。

依存

Python 3 (or higher)
the Burrow-Wheeler Aligner v0.7.15 or higher (bwa)
SAMtools v1.5 or higher (link)
metaSNV

conda create -n motus2 python=3.7
conda activate motus2
conda install -c bioconda -y bwa samtools metasnv

本体　Github

git clone https://github.com/motu-tool/mOTUs_v2.git
cd mOTUs_v2

#データベースのダウンロードも行われる（1GBほど）
python setup.py
python test.py
export PATH=`pwd`:$PATH

#追記　condaを使って導入するなら (テスト時は失敗した)
#bioconda (link)
conda install -c bioconda -y motus

> python test.py

$ python test.py

------------------------------------------------------------------------------

| TEST MOTUS TOOL |

------------------------------------------------------------------------------

1-- ran setup.py: done

2-- Tools and versions:

- python: correct

- bwa: correct

- samtools: correct

- metaSNV: correct

3-- Taxonomy profiling test:

- Run motus (-v 1, only error messages):

- end motus call

Check resulting profile: correct

パスを通す。

export PATH=`pwd`:$PATH

> motus

$ motus

Program: motus - a tool for marker gene-based OTU (mOTU) profiling

Version: 2.1.1

Reference: Milanese et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nature Communications (2019). doi: 10.1038/s41467-019-08844-4

Usage: motus <command> [options]

Command:

-- Taxonomic profiling

profile Perform a taxonomic profiling (map_tax + calc_mgc + calc_motu)

merge Merge different taxonomic profiles to create a table

map_tax Map reads to the marker gene database

calc_mgc Aggregate reads from the same marker gene cluster (mgc)

calc_motu From a mgc abundance table, produce the mOTU abundance table

-- SNV calling

map_snv Map reads to create bam/sam file for snv calling

snv_call SNV calling using metaSNV

Type motus <command> to print the help for a specific command

>motus profile

$ motus profile

Usage: motus profile [options]

Input options:

-f FILE[,FILE] input file(s) for reads in forward orientation, fastq formatted

-r FILE[,FILE] input file(s) for reads in reverse orientation, fastq formatted

-s FILE[,FILE] input file(s) for reads without mate, fastq formatted

-n STR sample name

-i FILE[,FILE] provide sam or bam input file(s)

-m FILE provide a mgc reads count file

Output options:

-o FILE output file name [stdout]

-I FILE save the result of bwa in bam format (intermediate step) [None]

-M FILE save the mgc reads count (intermediate step) [None]

-e profile only reference species (ref_mOTUs)

-c print result as counts instead of relative abundances

-p print NCBI id

-u print the full name of the species

-q print the full rank taxonomy

-B print result in BIOM format

-C STR print result in CAMI format (BioBoxes format 0.9.1)

Values: [precision, recall, parenthesis]

-k STR taxonomic level [mOTU]

Values: [kingdom, phylum, class, order, family, genus, mOTU]

Algorithm options:

-g INT number of marker genes cutoff: 1=higher recall, 10=higher precision [3]

-l INT min. length of alignment for the reads (number of nucleotides) [75]

-t INT number of threads [1]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

-y STR type of read counts [insert.scaled_counts]

Values: [base.coverage, insert.raw_counts, insert.scaled_counts]

> motus merge

$ motus merge

Usage: motus merge [options]

Input options:

-i STR[,STR] list of files (comma separated)

-d DIR merge all the files in the directory

Output options:

-o FILE output file name [stdout]

-B print result in BIOM format

Algorithm options:

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

> motus map_tax

$ motus map_tax

Usage: motus map_tax [options]

Input options:

-f FILE[,FILE] input file(s) for reads in forward orientation, fastq formatted

-r FILE[,FILE] input file(s) for reads in reverse orientation, fastq formatted

-s FILE[,FILE] input file(s) for reads without mate, fastq formatted

Output options:

-o FILE output file name [stdout]

-b save the result of bwa in bam format

Algorithm options:

-l INT min. length of alignment for the reads (number of nucleotides) [75]

-t INT number of threads [1]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

> motus calc_mgc

$ motus calc_mgc

Usage: motus calc_mgc [options]

Input options:

-n STR sample name

-i FILE[,FILE] provide a sam or bam input file (or list of files)

Output options:

-o FILE output file name [stdout]

Algorithm options:

-l INT min. length of alignment for the reads (number of nucleotides) [75]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

-y STR type of read counts [insert.scaled_counts]

Values: [base.coverage, insert.raw_counts, insert.scaled_counts]

> motus calc_motu

$ motus calc_motu

Usage: motus calc_motu [options]

Input options:

-n STR sample name

-i FILE provide the mgc abundance table

Output options:

-o FILE output file name [stdout]

-e profile only reference species (ref_mOTUs)

-B print result in BIOM format

-C STR print result in CAMI format (BioBoxes format 0.9.1)

Values: [precision, recall, parenthesis]

-c print result as counts instead of relative abundances

-p print NCBI id

-u print the full name of the species

-q print the full rank taxonomy

-k STR taxonomic level [mOTU]

Values: [kingdom, phylum, class, order, family, genus, mOTU]

Algorithm options:

-g INT number of marker genes cutoff: 1=higher recall, 10=higher precision [3]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

> motus map_snv

$ motus map_snv

Usage: motus map_snv [options]

Input options:

-f FILE[,FILE] input file(s) for reads in forward orientation, fastq formatted

-r FILE[,FILE] input file(s) for reads in reverse orientation, fastq formatted

-s FILE[,FILE] input file(s) for reads without mate, fastq formatted

Output options:

-o FILE output bam file name [stdout]

Algorithm options:

-l INT min. length of alignment for the reads (number of nucleotides) [75]

-t INT number of threads [1]

-v INT verbose level: 1=error, 2=warning, 3=message, 4+=debugging [3]

> motus snv_call

$ motus snv_call

Usage: motus snv_call -d Directory -o Directory [options]

Input options:

-d DIR Call metaSNV on all bam files in the directory. [Mandatory]

-fb FLOAT Coverage breadth: minimal horizontal genome coverage percentage per sample per species. Default=80.0

-fd FLOAT Coverage depth: minimal average vertical genome coverage per sample per species. Default=5.0

-fm INT Minimum number of samples per species. Default=2

-fp FLOAT FILTERING STEP II: Required proportion of informative samples (coverage non-zero) per position. Default=0.50

-fc FLOAT FILTERING STEP II: Minimum coverage per position per sample per species. Default=5.0

-t INT Number of threads. Default=1

Output options:

-o DIR Output directory. Will fail if already exists. [Mandatory]

-K Keep all the directories produced by metaSNV. Default is to remove cov, distances, filtered, snpCaller

Algorithm options:

-v INT Verbose level: 1=error, 2=warning, 3=message, 4+=debugging. Default=3

snv_callを使うにはmetaSNVも必要。詳細はwiki参照。

実行方法

１、taxonomyプロファイリング

ペアエンドfastq（gzip圧縮にも対応）。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -n sample1 > taxonomy_profile.txt

-f input file(s) for reads in forward orientation, fastq formatted
-r input file(s) for reads in reverse orientation, fastq formatted
-n sample name
-t number of threads [1]
-l min. length of alignment for the reads (number of nucleotides) [75]

シングルエンドfastq

motus profile -s single_reads.fq -t 8 -n sample1 > taxonomy_profile.txt

-s input file(s) for reads without mate, fastq formatted

複数 fastqある場合（同じサンプル）

motus profile -t 8 -n sample1 \
 -f for_r_lane1.fq,for_r_lane2.fq \
 -r rev_r_lane1.fq,rev_r_lane2.fq \
 > taxonomy_profile.txt

または各ステップを個別に行う。

motus map_tax -s metagenomic_sample.fastq -o mapped_reads.sam
motus calc_mgc -i mapped_reads.sam -o mgc_ab_table.count
motus calc_motu -i mgc_ab_table.count > taxonomy_profile.txt
rm mapped_reads.sam mgc_ab_table.count

出力

bwaを使ってmOTUsデータベースに対してmappingされる。長さ"-l 100"でフィルタリングされて結果は出力される。bamをexportするならiオプションを使い"-I mapped.bam"のように指定する。

f:id:kazumaxneo:20190425115858j:plain

各speciesのrelative abundanceが出力される。2019年4月現在7730行ある。出力についてはwiki参照（link）。

0以外を確認（mergeしたファイルの場合、1サンプルでも0があれば消すので注意）

awk '!/0.0000000000/' taxonomy_profile.txt

出力変更

"-c"でrelative abundanceからリードカウントに変更

motus profile -f pair_1.fq -r pair_2.fq -t 8 -c > taxonomy_profile.txt

-c print result as counts instead of relative abundances

標準ではspecies出力だが、-kで階級を指定できる。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -k class  > taxonomy_profile.txt

-k taxonomic level [mOTU] Values: [kingdom, phylum, class, order, family, genus, mOTU]

"-p"でNCBI taxonomy idsを出力する。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -p > taxonomy_profile.txt

-p print NCBI id

"-q"でfull rank taxonomyを出力する。-kと併用すれば出力する下限の階級を指定できる。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -k class -q > taxonomy_profile.txt

-k taxonomic level [mOTU] Values: [kingdom, phylum, class, order, family, genus, mOTU]
-q print the full rank taxonomy

"-B"で Biom format v.1.0で出力する（BIOM詳細）。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -B > taxonomy_profile.biom

-B print result in BIOM format.

CAMI (BioBoxes) formatで出力する（wiki）。

motus profile -f pair_1.fq -r pair_2.fq -t 8 -C precision > taxonomy_profile

-C <value> print result in CAMI format (BioBoxes format 0.9.1). Values: [precision, recall, parenthesis]

他にも full species nameを出す"-u"や、メタゲノムのmeta_mOTUsに-1を足してref_mOUTsだけ出力する"-e"があります。wikiで確認して下さい。

２、複数サンプル結果のマージ

まず個別にmotus profileを実行する。

motus profile -f sample1_1.fq -r sample1_2.fq -t 8 -n sample1 > sample1.txt
motus profile -f sample2_1.fq -r sample2_2.fq -t 8 -n sample2 > sample2.txt
motus profile -f sample3_1.fq -r sample3_2.fq -t 8 -n sample3 > sample3.txt

マージする。

motus merge -i sample1.txt,sample2.txt,sample3.txt >  all_sample_profiles.txt

-o output file name [stdout]

マージしてBIOM format（JSON-derived format）出力。

motus merge -i sample1.txt,sample2.txt,sample3.txt -B > all_sample_profiles.biom

-B print result in BIOM format

引用

Microbial abundance, activity and population genomic profiling with mOTUs2

Alessio Milanese, Daniel R Mende, Lucas Paoli, Guillem Salazar, Hans-Joachim Ruscheweyh, Miguelangel Cuenca, Pascal Hingamp, Renato Alves, Paul I Costea, Luis Pedro Coelho, Thomas S. B. Schmidt, Alexandre Almeida, Alex L Mitchell, Robert D. Finn, Jaime Huerta-Cepas, Peer Bork, Georg Zeller, Shinichi Sunagawa

Nature Commun. 2019; 10: 1014

Metagenomic species profiling using universal phylogenetic marker genes
Shinichi Sunagawa, Daniel R Mende, Georg Zeller, Fernando Izquierdo-Carrasco, Simon A Berger, Jens Roat Kultima, Luis Pedro Coelho, Manimozhiyan Arumugam, Julien Tap, Henrik Bjørn Nielsen, Simon Rasmussen, Søren Brunak, Oluf Pedersen, Francisco Guarner, Willem M de Vos, Jun Wang, Junhua Li, Joël Doré, S Dusko Ehrlich, Alexandros Stamatakis & Peer Bork
Nature Methods volume 10, pages 1196–1199 (2013)