StrainPhlAn3 - macでインフォマティクス

チュートリアルより

StrainPhlAnは、保存された種マーカー遺伝子およびユニークな種マーカー遺伝子内の一塩基多型（SNPs）に基づき、大規模サンプルセット全体の種を系統レベルで解決するためのツールです。StrainPhlAn ワークフローの最初のステップは、MetaPhlAn 3.0を実行することです。

http://segatalab.cibio.unitn.it/tools/strainphlan/

Tutorial

https://github.com/biobakery/MetaPhlAn/wiki/StrainPhlAn-3.0

bioBakery forum（ディスカッションや質問など）

https://forum.biobakery.org/

bioBakeryでランする

https://huttenhower.sph.harvard.edu/biobakery_workflows/

Segata Labのツール一覧

http://segatalab.cibio.unitn.it/tools/

インストール

Github

metaphlan2に含まれている。metaphlan2をcondaなどで導入してもパスが通る（参考）。

git clone https://github.com/biobakery/MetaPhlAn.git

> strainphlan -h

usage: strainphlan [-h] [-d DATABASE] [-m CLADE_MARKERS]

[-s SAMPLES [SAMPLES ...]] [-r REFERENCES [REFERENCES ...]]

[-c CLADE] [-o OUTPUT_DIR] [-n NPROCS]

[--secondary_samples SECONDARY_SAMPLES [SECONDARY_SAMPLES ...]]

[--secondary_references SECONDARY_REFERENCES [SECONDARY_REFERENCES ...]]

[--trim_sequences TRIM_SEQUENCES]

[--marker_in_n_samples MARKER_IN_N_SAMPLES]

[--sample_with_n_markers SAMPLE_WITH_N_MARKERS]

[--secondary_sample_with_n_markers SECONDARY_SAMPLE_WITH_N_MARKERS]

[--sample_with_n_markers_after_filt SAMPLE_WITH_N_MARKERS_AFTER_FILT]

[--phylophlan_mode {accurate,fast}]

[--phylophlan_configuration PHYLOPHLAN_CONFIGURATION]

[--tmp TMP] [--mutation_rates] [--print_clades_only]

[--debug]

optional arguments:

-h, --help show this help message and exit

-d DATABASE, --database DATABASE

The input MetaPhlAn 3.0.14 database (default: /home/ka

zu/mambaforge/envs/metaphlan3/lib/python3.7/site-packa

ges/metaphlan/metaphlan_databases/mpa_v30_CHOCOPhlAn_2

01901.pkl)

-m CLADE_MARKERS, --clade_markers CLADE_MARKERS

The clade markers as FASTA file (default: None)

-s SAMPLES [SAMPLES ...], --samples SAMPLES [SAMPLES ...]

The reconstructed markers for each sample (default:

)

-r REFERENCES [REFERENCES ...], --references REFERENCES [REFERENCES ...]

The reference genomes (default: )

-c CLADE, --clade CLADE

The clade to investigate (default: None)

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

The output directory (default: None)

-n NPROCS, --nprocs NPROCS

The number of threads to use (default: 1)

--secondary_samples SECONDARY_SAMPLES [SECONDARY_SAMPLES ...]

The reconstructed markers for each secondary sample

(default: )

--secondary_references SECONDARY_REFERENCES [SECONDARY_REFERENCES ...]

The secondary reference genomes (default: )

--trim_sequences TRIM_SEQUENCES

The number of bases to remove from both ends when

trimming markers (default: 50)

--marker_in_n_samples MARKER_IN_N_SAMPLES

Theshold defining the minimum percentage of samples to

keep a marker (default: 80)

--sample_with_n_markers SAMPLE_WITH_N_MARKERS

Threshold defining the minimun percentage of markers

to keep a sample (default: 80)

--secondary_sample_with_n_markers SECONDARY_SAMPLE_WITH_N_MARKERS

Threshold defining the minimun percentage of markers

to keep a secondary sample (default: 80)

--sample_with_n_markers_after_filt SAMPLE_WITH_N_MARKERS_AFTER_FILT

Threshold defining the minimun percentage of markers

to keep a sample after filtering the markers [only for

dev] (default: 50)

--phylophlan_mode {accurate,fast}

The presets for fast or accurate phylogenetic analysis

(default: fast)

--phylophlan_configuration PHYLOPHLAN_CONFIGURATION

The PhyloPhlAn configuration file (default: None)

--tmp TMP If specified, the directory where to store the

temporal files. (default: None)

--mutation_rates If specified, StrainPhlAn will produce a mutation

rates table for each of the aligned markers and a

summary table for the concatenated MSA. This operation

can take long time to finish (default: False)

--print_clades_only If specified, StrainPhlAn will only print the

potential clades and stop the execution (default:

False)

--debug If specified, StrainPhlAn will not remove the temporal

folders (default: False)

テストラン

チュートリアルの通りに進める。

１、まず、MetaPhlAn 3.0を実行して、sam.bz2ファイルを取得する。

#テストデータ（link）
git clone https://github.com/biobakery/biobakery.git
mkdir input
cp biobakery/demos/biobakery_demos/data/strainphlan2/reads/* input/

mkdir -p sams
mkdir -p bowtie2
mkdir -p profiles
for f in input/*
 do
  echo "Running metaphlan 3.0 on ${f}"
  bn=$(basename ${f%fastq.bz2})
  metaphlan $f --input_type fastq -s sams/${bn}.sam.bz2 --bowtie2out bowtie2/${bn}.bowtie2.bz2 --nproc 8 -o profiles/${bn}_profile.tsv 
done

8スレッドで数十分かかった。

２、sample_to_markersの実行

mkdir consensus_markers
sample2markers.py -i sams/*.sam.bz2 -o consensus_markers --nproc 8

sample_to_markersスクリプトにsam出力ファイルを渡すと、マーカーファイルが生成される。マーカーファイルには、サンプルに含まれる各生物種のユニークなマーカー遺伝子のコンセンサスが含まれており、SNPプロファイリングに使用される。

consensus_markers/

f:id:kazumaxneo:20220223113034p:plain

sample-markerファイル(*.pkl)

３、MetaPhlAnデータベースから、ここではBacteroides_caccaeのマーカーを抽出する。

mkdir -p db_markers
extract_markers.py -c s__Bacteroides_caccae -o db_markers/

db_markers/

f:id:kazumaxneo:20220223134128p:plain

Bacteroides_caccaeのマーカーを含むファイルが生成される。

４、StrainPhlAn を実行してマルチプルシークエンシングアライメントと系統樹を作成する。3の出力のサンプルマーカーファイル、リファレンスゲノム、クレードリファレンスマーカーファイルを指定する。

mkdir -p output
strainphlan -s consensus_markers/*.pkl -m db_markers/s__Bacteroides_caccae.fna -r Bacteroides_caccae=reference.fna -o output -n 8 -c s__Bacteroides_caccae --mutation_rates

サンプル再構成株（ステップ3で作成したマーカーファイルに格納）および参照ゲノム（指定された場合）におけるマーカーの存在に基づいて、選択したクレードマーカーをフィルタリングします。また、サンプル再構築株とリファレンスゲノムも、選択されたクレードマーカーの存在に基づいてフィルタリングされます。StrainPhlAn は、このフィルタリングされたマーカーとサンプルから PhyloPhlAn を呼び出し、MSA（multiple sequence alignment）を作成し、系統樹を構築する（マニュアルより）。

output/RAxML_bestTree.s__Bacteroides_caccae.StrainPhlAn3.treが生成される。

５、condaでは導入されないadd_metadata_tree.py（リンク）を使うと、系統樹ファイルに複数のメタデータファイルを追加できｓ、その結果をgraphlanでメタデータ付き系統樹として視覚化できる。

add_metadata_tree.py -t output/RAxML_bestTree.s__Bacteroides_caccae.StrainPhlAn3.tre -f fastq/metadata.txt -m subjectID --string_to_remove .fastq.bz2

引用

Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3
Francesco Beghini, Lauren J McIver, Aitor Blanco-Miguez, Leonard Dubois, Francesco Asnicar, Sagun Maharjan, Ana Mailyan, Paolo Manghi, Matthias Scholz, Andrew Maltez Thomas, Mireia Valles-Colomer, George Weingart, Yancong Zhang, Moreno Zolfo, Curtis Huttenhower, Eric A Franzosa, and Nicola Segata

Elife. 2021 May 4;10:e65088

Microbial strain-level population structure and genetic diversity from metagenomes
Duy Tin Truong, Adrian Tett, Edoardo Pasolli, Curtis Huttenhower, Nicola Segata

Genome Res. 2017 Apr;27(4):626-638