2021-06-22

De novoでメタゲノムのbiological marker（サンプル間に共通する領域）を探す MetaMarker

　全メタゲノムシーケンス（WMS）は微生物群集を研究するための新しいアプローチである。研究者らはWMSを使用してヒトのマイクロバイオームが結腸癌、細菌性膣炎、糖尿病、クローン病などのさまざまな疾患と密接な関係があることを発見した（Cho and Blaser、2012）。健常者と罹患者との間の細菌パターン（組成および存在量）の違いは、罹患開始および予後のための新規バイオマーカーと見なすことができる。メタゲノムバイオマーカーを検索する際に、研究者は、2つ以上の微生物群落を確実に区別することができる株、遺伝子、代謝産物またはパスウェイのファミリーを見つけることに焦点を当ててきた。例えば、ヒト腸内のCampylobacter jejuniは免疫増殖性小腸疾患を示すことが示されており（Lecuit et al、2004）、さらに腸内細菌リポ多糖のレベルは結腸直腸癌（CRC）と正の関連があることが示された（Schuerer- Maly et al、1994）。

　WMSは新しいメタゲノムバイオマーカーを発見するために使用されてきたが、現在の計算方法ではマッピングされていないリードを破棄するため結果にバイアスがある。これらの方法は、最初の未加工のリードをリファレンスデータベースにマッピングして、共通の前処理ステップとしてoperational taxonomic unit（OTU）テーブルを生成する。このステップは、リファレンスデータベースにまだ存在していないかもしれないが、疾患にとって潜在的に重要であり得る未知のバクテリア配列を除外する。この問題に対処するために、OTUテーブルを使用せずにWMSからメタゲノムバイオマーカーを発見するためのde novoアプローチであるMetaMarkerを開発した。 MetaMarkerによって発見されたバイオマーカーは、バクテリアのサブグループによって共有される保存された配列断片である。これらの断片は通常特定の生物学的機能を持ち、微生物群集において重要な役割を果たすが、それらはまだどのリファレンスデータベースにも集められていないかもしれない（論文補足図S1）。

　MetaMarkerでは、WMSサンプルで大量のリードを処理するメモリ制限を克服するために、2段階のアプローチを使用した。最初のステップでは、症例群と対照群の正規化したk-mer存在量をプロファイリングすることによって、短いマーカーを発見する。ここでは、ケースとコントロールが統計的に異なるk-merを組み合わせてshort-markerを生成した。 2番目のステップでは、短いマーカーを使用して長いマーカーを生成する。バイオマーカーのリストを作成するために、最終的に長いマーカーをきれいにしてランク付けした（論文補足図S2）。 MetaMarkerは入力ファイルとしてWMSサンプルのFASTAフォーマットを取り、最終的なバイオマーカーとして2つのファイルを生成する（ケースエンリッチおよびコントロールエンリッチ）。 MetaMarkerをフランスの結腸直腸ガン（CRC）便試料のWMSに適用して、CRCに特異的なメタゲノムバイオマーカーを探索した。香港、オーストリア、ドイツ、デンマークからの独立したサンプルを検証することで、発見されたバイオマーカーの堅牢性を示した。著者らはさらに、これらのバイオマーカーが結腸直腸ガンの予測のための機械学習分類器構築に使用できることを実証した。MetaMarkerは、GPLv3ライセンスの下でhttps://bitbucket.org/mkoohim/metamarkerから無料で入手できる。

wiki

https://bitbucket.org/mkoohim/metamarker/wiki/Home0

Githubより転載

インストール

ubuntu18.04でテストした（docker使用）。

依存

Python 2.7
biopython >1.7
numpy >1.13.1
pandas >0.20.3
scipy >1.18.1
SPAdes >3.11
USEARCH >9.0
CD-HIT >4.6.4
Bowtie2 >2.2.6
GenomeTools >1.5.8

本体 bitbuckert

> python Step1_MakeHashTable.py -h

# python Step1_MakeHashTable.py -h

usage: Step1_MakeHashTable.py [-h] [--version] [--sample STRSAMPLE]

[--output STROUT] [--npz STRNPZ]

[--window_size WINDOWLENGTH]

[--step_size SAMPLINGRATE]

MakeHashTable: This program generates a hash table for each whole metagenome

sample by slide a window on reads and find the normalized abundance of each

k-mer. The default size of the window is 12 (k=12). Larger size of k need

largener RAM memory. The generated hash table will be saved az an NPZ. The

input file should be in Fasta format.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

Input::

--sample STRSAMPLE Enter the path of the folder that contain FASTA read

files.

Output::

--output STROUT Enter the path of the output directory. The final NPZ

format hash table will be saved into this folder.

--npz STRNPZ Enter the name of the output NPZ file. The default

name is same as the input sample file name.

Parameters::

--window_size WINDOWLENGTH

Enther the length of window. The default value is 12.

If you increase the size of the window you need more

RAM to run the program. The window size should be an

even number.

--step_size SAMPLINGRATE

Enter the sampling rate. The default value is 1. It

means the window move through the reads one base by

one base. Increasing this parameter may affect on

accuracy but the pipeline will run faster.

> python Step2_MakeShortMarker.py -h

# python Step2_MakeShortMarker.py -h

usage: Step2_MakeShortMarker.py [-h] [--version]

[--caseHash STRCASEHASHTABLES]

[--controlHash STRCONTROLHASHTABLES]

[--spade STRSPADE] [--output STROUT]

[--threads ITHREADS] [--test STRTESTTYPE]

[--pvalue STRPVALUE]

MakeShortMarker: This script process the hash tables which are made using

MakeHashTable script.Here we make a profile table for each case and control

group and extract those k-mers which are statistically different in case and

control groups. The selected k-mers will be assembled to generate the short-

markers.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

Input::

--caseHash STRCASEHASHTABLES

Enter the path of hash tables (NPZ files) which are

extracted for case.

--controlHash STRCONTROLHASHTABLES

Enter the path of hash tables (NPZ files) which are

extracted for control.

Programs::

--spade STRSPADE Provide the path to Spades. Default call will be

spades.py.

Output::

--output STROUT Enter the path and name of the output directory.

Parameters::

--threads ITHREADS Enter the number of CPUs available SPAdes. The default

value is 1.

--test STRTESTTYPE Enter type of test 'wilcoxon' or 'ttest'. By default

we used Wilcoxon Mann Whitney test.

--pvalue STRPVALUE Enter the pvalue threshold for statistical test. The

default value is 0.05.

> python Step3_MakeLongMarker.py -h

# python Step3_MakeLongMarker.py -h

usage: Step3_MakeLongMarker.py [-h] [--version]

[--short_marker SHORTMARKERSTR]

[--sample STRSAMPLE] [--tmp STRTMP]

[--output STROUT] [--spade STRSPADE]

[--usearch USEARCHSTR] [--gt GTSTR]

[--threads ITHREADS] [--id IDSTR]

MakeLongMarker: This script uses MakeShortMarker script output.We used USEARCH

to extract those reads which have at least one short-marker as substring.The

selected reads will be assembled to generate a longer marker.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

Input::

--short_marker SHORTMARKERSTR

Enter the path of the short-marker file which is found

from MakeShortMarker.

--sample STRSAMPLE Enter the path of the folder that contain FASTA files.

Output::

--tmp STRTMP Enter the path and name of the tmp directory.

--output STROUT Enter the path and name of the output directory.

Programs::

--spade STRSPADE Provide the path to Spades. Default call will be

spades.py.

--usearch USEARCHSTR Enter the path of USEARCH tool. Please provide the

full path of usearch. i.e.

tools/usearch11.0.667_i86linux32

--gt GTSTR Enter the gt tool path.

Parameters::

--threads ITHREADS Enter the number of CPUs available for USEARCH and

SPAdes. The default value is 1.

--id IDSTR Enter similarity rate for the USEARCH. The default

value is 0.95.

> python Step4_MarkerCleaning.py -h

# python Step4_MarkerCleaning.py -h

usage: Step4_MarkerCleaning.py [-h] [--version] [--caseMarker STRINPUTCASE]

[--caseSize STRCASESIZE]

[--controlMarker STRINPUTCONTROL]

[--controlSize STRCONTROLSIZE]

[--output STROUT] [--cdhit STRCDHIT]

[--memory STRMEMORY] [--threads STRTHREADS]

MarkerCleaning: This script will process the MarkerExtend output files. Here

we first merge all markers into one FASTA file. We remove markers shorter than

100bp.We then used CD-Hit to cluster the markers and remove those markers

which are redundant (90% similarity) also we removed those markers which are

appeared in less than 10% of samples.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

Input::

--caseMarker STRINPUTCASE

Enter the path of the folder that contain Case samples

markers.

--caseSize STRCASESIZE

Enter the size of case group (number of samples in

case).

--controlMarker STRINPUTCONTROL

Enter the path of the folder that contain Control

samples markers.

--controlSize STRCONTROLSIZE

Enter the size of control group (number of samples in

control).

Output::

--output STROUT Enter the path and name of the output directory.

Parameters::

--cdhit STRCDHIT Enter the path of CD-HIT tool.

--memory STRMEMORY Enter the memory usage by CD-Hit. The default value is

40000MB

--threads STRTHREADS Enter the number of threads for CD-Hit. The default

value is 1.

> python Step5_MarkerAbundace.py -h

# python Step5_MarkerAbundace.py -h

usage: Step5_MarkerAbundace.py [-h] [--version] [--marker STRMARKER]

[--sample STRSAMPLE] [--tmp STRTMP]

[--output STROUT] [--bowtie2 STRBOWTIE2]

[--id DID] [--alnlength ALNLENGTH]

[--threads ITHREADS] [--removeTemp]

MarkerAbundance: This script find the normalized abundance of each markerin

WMS sample. The output will be saved into a CSV format file.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

Input::

--marker STRMARKER Enter the path of the marker that you want to search,

in FASTA format.

--sample STRSAMPLE Enter the path of the folder that contain reads. The

read files should be in Fasta format.

Output::

--tmp STRTMP Enter the path of the tmp directory.

--output STROUT Enter the path of the output directory.

Programs::

--bowtie2 STRBOWTIE2 Provide the path to bowtie2

Parameters::

--id DID Enter the percent identity for the match

--alnlength ALNLENGTH

Enter the minimum alignment length. The default is 50

--threads ITHREADS Enter the number of CPUs available for Bowtie2.

--removeTemp If you use this option, the temp folder will be

deleted from your disk.

> python Step6_MarkerRank.py -h

# python Step6_MarkerRank.py -h

usage: Step6_MarkerRank.py [-h] [--version] [--case STRCASE]

[--control STRCONTROL] [--marker STRMARKER]

[--cdhit STRCDHIT] [--output STROUT]

[--pvalue STRPVAL] [--filter STRFILTER]

[--topMarker STRTOPMARKER]

[--similarity STRSIMILARITY] [--memory STRMEMORY]

[--threads STRTHREADS]

MarkerRank: This script find those markers which are different in case and

control.Here we used wilcoxon-rank sum test to assign a p-value to each

marker.Markers with p-values less than 0.05 (default) have been selected and

then we used cd-hit to remove redundantmarkers (similarity>60%). Finally the

markers will rank based on the p-values

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

Input::

--case STRCASE Enter the path of the marker abundance result for case

samples.

--control STRCONTROL Enter the path of the marker abundance result for

control samples.

--marker STRMARKER Enter the path of .

Parameters::

--cdhit STRCDHIT Enter the path of CD-HIT tool.

Output::

--output STROUT Enter the path of the output directory.

Parameters::

--pvalue STRPVAL Enter the cut-off for p-value

--filter STRFILTER Enter the cut-off to exclude markers with average

abundance score less than this value. The default

value is 1.

--topMarker STRTOPMARKER

Enter the number of top markrs which you like to see

in output. The number of final marker maybe less than

your input value as we removeredundunt markers from

top markers. The default value is 500.

--similarity STRSIMILARITY

Enter the cut-off for cdhit similarity.

--memory STRMEMORY Enter the memory usage by CD-Hit. The default value is

40000MB

--threads STRTHREADS Enter the number of threads for CD-Hit. The default

value is 1.

dockerイメージをアップしておきます。

docker pull kazumax/metamarker

#run
docker run -itv $PWD:/data/ -w /root/mkoohim-metamarker kazumax/metamarker
source ~/.profile
python Step1_MakeHashTable.py -h

実行方法

1, Build a hash table for each sample

入力ファイルとして、メタゲノムアセンブリ（whole metagenome assembly）のfastaファイルを指定する。ケースとコントロールそれぞれのグループを準備する。チュートリアルでは、フランスの集団のWMSサンプルを使っている。これらの集団はCRC（colorectal carcinoma）からの53のサンプルと健康な個人からの61となる（チュートリアルにリンクは記載されています）（pubmed）。ここではgrabseqsを使って10サンプルダウンロードした。fastqは不可、fastaに変換しておく。

準備ができたら、”--sample”でfastaファイルが入ったディレクトリを指定する。

python Step1_MakeHashTable.py --sample sample_case_1 --output npz_case --npz case
python Step1_MakeHashTable.py --sample sample_control_1 --output npz_control --npz ctrl

--sample Enter the path of the folder that contain FASTA read files.
--output Enter the path of the output directory. The final NPZ format hash table will be saved into this folder.

2つのnumpy行列（症例行列と対照行列）が出力される。行列の各行は、異なるサンプルにおけるk-merの正規化された存在量を示している（ディレクトリに複数のfastaがあっても出力は１ファイルになる）。

2, Generate short-markers

症例と対照群でその正規化された存在量が統計的に異なるk-merを選択し、SPAdesを使用して、選択したk-merをアセンブルする。step1の出力のディレクトリ；npz_case/とnpz_control/のパスを指定する。

python Step2_MakeShortMarker.py --caseHash npz_case/ --controlHash npz_control/ --spade /software/SPAdes-3.11.1/bin --output output

SPAdesの出力を処理して、kより長い配列を抽出する。これらをショートマーカーと呼ぶ。

症例と対照群で違いがないと言われてエラーになる。

3, Generate long-markers

Step3_MakeLongMarker.py --sample sample_case_1 --short_marker Short_Markers_case.fasta --output /case_long_marker --tmp /temp --threads 3 --usearch /software/usearch_tool/ --id 0.95 --spade /software/SPAdes-3.11.1/bin --gt /software/genometools-1.5.8/bin

4, Cleaning long-markers

Step4_MarkerCleaning.py --caseMarker result_case --controlMarker result_control --caseSize 53 --controlSize 61 --cdhit /software/cd-hit-4.6.4 --output output

5, Calculate normalized abundance of long-markers

Step5_MarkerAbundace.py --sample sample_case_1 --marker Selected_Markers.fasta --output /result_abundance_case --tmp /temp --threads 3 --bowtie2 /data2/microbiome/software/bowtie2-2.2.6/bin

6, Rank long-markers

Step6_MarkerRank.py --case /result_abundance_case --control /result_abundance_control --marker Selected_Markers.fasta --cdhit /software/cd-hit-4.6.4 --output /output_ranking

引用

MetaMarker: a de novo pipeline for discovering novel metagenomic biomarkers
Mohamad Koohi-Moghadam, Mitesh Borad, Nhan Tran, Kristin Swanson, Lisa Boardman, Hongzhe Sun, Junwen Wang
Bioinformatics, Published: 02 March 2019