模擬微生物コミュニティとそのアンプリコンシークエンシングリードを発生させるための多機能ソフトウェア M&Ms

　シーケンシング技術の進歩に伴い、16S rDNAシーケンスデータの解析を目的とした多くのバイオインフォマティクスツールが開発されている。これらのツールをテストするためには、異なる環境からのサンプルに類似したデータセットをシミュレートすることが重要である。ここでは、M&Msを紹介する。M&Msは、実用的な生態学的パラメータに基づいて、参照配列から異なる16S rDNAデータセットを作成する、ユーザーフレンドリーなオープンソース・バイオインフォマティクスツールである。M&Msは、ユーザーが制御する豊かさ、均一性、微細多様性およびソース環境を持つ「in silico」微生物コミュニティの配列ライブラリーを作成する。M&Msは、バイオインフォマティクスソフトウェアの開発や現在のツールのベンチマークに使用できる、実際のパラメータに基づいた単純から複雑なリードデータセットを生成することができる。M&Msのソースコードは、https://github.com/ggnatalia/MMs で自由に利用できる。

wiki

https://github.com/ggnatalia/MMs/wiki

マニュアルより

M&Msによるシミュレーションは4つの段階を経て行われる。

コミュニティメンバーの選定
マイクロダイバーシティシミュレーション
微生物の存在量分布の割り当て
InSilicoSeqs (Gourlé et al., 2019) を用いたシーケンスデータのシミュレーション

M&Msは分類群を選択し、microdiversityを生成した後、各分類群に存在量を割り当て、存在量プロファイルとリードを生成します。例えば、同じ参照ファイルを用いて、1つまたは複数のサンプルで異なるモックコミュニティを構築することができます。

インストール

ubuntu18.04でcondaの環境を作ってテストした。

Github

mamba create -y --name MMs python=3.7
conda activate MMs
mamba install -c anaconda -c bioconda -c conda-forge -c ggnatalia mms

> makemocks.py -h

usage: makemocks.py [-h] -m MOCKNAME -o OUTPUT [-s START] [-e END]

[--by_region BY_REGION [BY_REGION ...]]

[--by_position BY_POSITION] [--region REGION] -H

SHANNONINDEX -N NSAMPLES -r READS [--rank RANK]

[-ASVsmean ASVSMEAN] [-nASVs NASVS] [-env ENVIRO]

[-tx TAXA [TAXA ...]] [-seqs SEQS [SEQS ...]]

[--minseqs MINSEQS] [-txAbund TAXAABUND [TAXAABUND ...]]

[--inputfile INPUTFILE] [-ref REF] [-refTax REFTAX]

[-refEnv REFENVIRO] [--threshold THRESHOLD] [--cpus CPUS]

[--force-overwrite] [--Sim {InSilicoSeqs,NanoSim}]

[--ISSerrormodel ISSERRORMODEL]

[--ISSparams ISSPARAMS [ISSPARAMS ...]]

[--ISSsequences_files ISSSEQUENCES_FILES [ISSSEQUENCES_FILES ...]]

[--ISSabundance_files ISSABUNDANCE_FILES [ISSABUNDANCE_FILES ...]]

[--repeat_ISS_autocomplete]

[--NSerrormodel {perfect,metagenome}]

[--NSparams NSPARAMS [NSPARAMS ...]]

[--repeat_NS_autocomplete] [--alpha ALPHA] [--pstr0 PSTR0]

[--size SIZE] [--ambiguidities AMBIGUIDITIES]

[--just_taxa_selection] [--repeat_previous_mock]

Process input files

optional arguments:

-h, --help show this help message and exit

-m MOCKNAME, --mockName MOCKNAME

Mock name. Ex: mock1

-o OUTPUT, --output OUTPUT

Output directory. Preferably, name of the environment.

Ex: Aquatic

-s START, --start START

SILVA alignment reference positions-START. Default 1.

1-based

-e END, --end END SILVA alignment reference positions-END. Default

50000. 1-based

--by_region BY_REGION [BY_REGION ...]

File with defined regions to introduce point mutations

or list of V1-V9 regions

--by_position BY_POSITION

File with defined positions to introduce point

mutations

--region REGION Name of the studied region

-H SHANNONINDEX, --shannonIndex SHANNONINDEX

ShannonIndex

-N NSAMPLES, --nSamples NSAMPLES

Number of samples

-r READS, --reads READS

Number of reads

--rank RANK Rank to subset taxa: phlyum, order, class, family,

genus

-ASVsmean ASVSMEAN, --ASVsmean ASVSMEAN

Mean of mutant ASV per silva sequence

-nASVs NASVS, --nASVs NASVS

Number of ASVs

-env ENVIRO, --enviro ENVIRO

Let the user simulate an environmental mock. Look

refEnv for options

-tx TAXA [TAXA ...], --taxa TAXA [TAXA ...]

List of taxa

-seqs SEQS [SEQS ...], --seqs SEQS [SEQS ...]

List of sequences' names

--minseqs MINSEQS Minimun number of sequences to extract from DB

-txAbund TAXAABUND [TAXAABUND ...], --taxaAbund TAXAABUND [TAXAABUND ...]

List of abundances of taxa

--inputfile INPUTFILE

The user can provide an align file without creating it

from scratch.

-ref REF, --ref REF SILVA alignment reference

-refTax REFTAX, --refTax REFTAX

SILVA alignment TAX reference

-refEnv REFENVIRO, --refEnviro REFENVIRO

Environment reference

--threshold THRESHOLD

Make & filter strains using distances. Without this

flag, sequences can be identical in the studied region

--cpus CPUS Number of threads

--force-overwrite Force overwrite if the output directory already exists

--Sim {InSilicoSeqs,NanoSim}

Choose read simulator: InSilicoSeqs or NanoSim

--ISSerrormodel ISSERRORMODEL

Mode to generate InSilicoSeqs

--ISSparams ISSPARAMS [ISSPARAMS ...]

(Read length, insert size

--ISSsequences_files ISSSEQUENCES_FILES [ISSSEQUENCES_FILES ...]

List of files to obtain reads: sequences <projectName>

<mockName>.<sampleName>.sequences16S.fasta. Same order

that paired abundance files

--ISSabundance_files ISSABUNDANCE_FILES [ISSABUNDANCE_FILES ...]

List of files to obtain reads: abundances<projectName>

<mockName>.<sampleName>.abundances. Same order that

paired sequence files

--repeat_ISS_autocomplete

If True use mock and samples directly from the

directory, without writing one by one the files

--NSerrormodel {perfect,metagenome}

Mode to generate NanoSim

--NSparams NSPARAMS [NSPARAMS ...]

(Maximum read length, Minimum read length

--repeat_NS_autocomplete

If True use mock and samples directly from the

directory, without writing one by one the files

--alpha ALPHA Correlation Matrix: Probability that a coefficient is

zero. Larger values enforce more sparsity.

--pstr0 PSTR0 ZINBD: Probability of structure 0

--size SIZE ZINBD: Size - dispersion of ZINBD

--ambiguidities AMBIGUIDITIES

Number of Ns at the beginning or the end of the

sequence

--just_taxa_selection

If True: do the selection of the sequences and stop.

--repeat_previous_mock

If True: With a previous mock, repeat reads simulation

using another sequencing simulator.

> make_databases.py -h

usage: make_databases.py [-h] [-ref REF] [-refTax REFTAX]

Process input files

optional arguments:

-h, --help show this help message and exit

-ref REF, --ref REF Path to SILVA alignment reference

-refTax REFTAX, --refTax REFTAX

Path to SILVA alignment TAX reference

データベースの準備

M&Msのランには、mothur形式のSILVAデータベースと、 environment matrixごとの属（genus）行列の2つのデータベースを使う。make_databases.pyスクリプトを実行することで、SILVAデータベースをソースからダウンロードしたり、ユーザーのデータベースをM&Mにリンクさせたりすることができる。

make_databases.py

データベースはMMs/DBに置かれる。

実行方法

簡易モックコミュニティの作成。-m, -o, -N, -H, -rは必須パラメータ。サンプル数は５、リード数は10,000（ペアエンド合計で）、異なる配列の数は500、シャノンの多様性指数は３とする。

makemocks.py -o Random_Mock -m mock2 -N 5 -nASVs 500 -H 3 -r 10000 --cpus 12

-m Mock name. Ex: mock1
-o Output directory. Preferably, name of the environment. Ex: Aquatic
-N Number of samples
-r Number of reads
-nASVs Number of ASVs
-H ShannonIndex
--cpus Number of threads

環境を選択しない、配列のリストを提供しない時、M&Msは微生物モックコミュニティをランダムに構築する。

出力

f:id:kazumaxneo:20220117232245p:plain

Random_Mock.fastaには500のリファレンス配列が含まれている。

> head Random_Mock.taxonomy

f:id:kazumaxneo:20220117232934p:plain

500発生させているので500行ある。

ユーザーが選択した分類群の情報

Random_Mock.pctax.html； ASV の存在量分布

f:id:kazumaxneo:20220117232320p:plain

Random_Mock.16S.distances.html

f:id:kazumaxneo:20220117232359p:plain

Random_Mock.mock2.shannonIndex.rank.png（Shannon Index）

f:id:kazumaxneo:20220117234323p:plain

mock2

f:id:kazumaxneo:20220117232445p:plain

mock2/Random_Mock.mock2.correlationMatrix_fromAbunTable.html

f:id:kazumaxneo:20220117232612p:plain

Random_Mock.mock2.asvsDistribution.html

f:id:kazumaxneo:20220117232721p:plain

Random_Mock/mock2/samples/にはfastqやサンプルそれぞれの情報、16S rRNA配列などが含まれる。

f:id:kazumaxneo:20220117233318p:plain

NanoSimシミュレータを使ってONTのリードも発生させることができる。すでに上のコマンドを実行済みなら、そのディレクトリを-oで指定し（-o Random_Mock）、その中にmockONT を作って（-m mockONT）その中にリードを書き出す。-oで指定したディレクトリにあるalignファイル名を指定する（--inputfile Random_Mock.align）。

makemocks.py -o Random_Mock -m mockONT -N 1 -nASVs 500 -H 3 -r 100000 --cpus 6 --Sim NanoSim --inputfile Random_Mock.align

--inputfile The user can provide an align file without creating it from scratch.

特定の環境の模擬コミュニティサンプルをシミュレートするには-envオプションを使います。以下の環境に対応しています。

List of possible environments: Aquatic.Freshwater.sediment, Aquatic.Freshwater.saline.waters.interfase, Aquatic.Freshwaters, Aquatic.Saline.waters, Aquatic.Soil.Freshwaters.interfase, Aquatic.Soil.Saline.waters.interfase, Host.associated.and.Organic.Animal.host, Host.associated.and.Organic.Gut, Host.associated.and.Organic.Oral, Host.associated.and.Organic.Organic, Host.associated.and.Organic.Other.tissue, Host.associated.and.Organic.Vagina, Other.Aerial, Other.Artificial, Other.Oil, Terrestrial.Plants, Terrestrial.Saline.soil, Terrestrial.Soil, Thermal.Geothermal, Thermal.Hydrothermal.

（マニュアルより）

より実践的な使用方法についてはwikiのExploring M&Msを読んで下さい。

引用

M&Ms: a versatile software for building microbial mock communities
Natalia García-García, Javier Tamames, Fernando Puente-Sánchez
Bioinformatics, Published: 12 January 2022