メタゲノムから16Sなどのターゲットアセンブリを行う MATAM

2022/06/24 追記

Preprintより

　ショットガンのメタゲノムシーケンシングは、未知の微生物の多様性が未知のまま残っている、ヒトの微生物から土壌や海洋のサンプルまで、さまざまな用途で、未培養の微生物サンプルを研究する未曾有の機会を提供する。
　メタゲノム研究の主な目標は、微生物の多様性と生態学的構造を特徴づけることである。これはいくつかの系統発生マーカー遺伝子の1つに焦点を当てることによって達成されることが多い[ref.12]、[ref.28]。細菌集団のゴールデンスタンダードのマーカーは、Silva [ref.24]、RDP [ref.2]またはGreenGenes [ref.3]のような、キュレーションされたリファレンスデータベースで何百万もの配列が利用可能な16Sリボソーム RNA（rRNA、~1500bp）である。アンプリコンシーケンシングのような伝統的なアプローチは、マーカー配列の小部分の分析に限られる。これは、属を超えて、十分に正確な分類学的レベルで生物を同定するための強い技術的限界につながる[ref.23]。種、または株にマーカー配列をアサインするためには、1-kbあたりのエラー数がほとんどない状態で全長rRNAを回収できる必要がある。メタゲノムアセンブラは、ゲノム全体を扱うように最適化されており、非常によく似た配列を区別するのに苦労しているため、この作業には適していない[ref.27]。この点に関して、メタゲノムリードのサブセットを全長16S rRNAコンティグにアセンブリするために、最近、EMIRGE [ref.19]およびREAGO [ref.33]のようなマーカー指向のアプローチが開発され、環境サンプルのtaxonomic assignmentの精度を改善することを目指している。 EMIRGEはベイズ法を用いて16S rRNA全長配列を反復的に再構成する。 REAGOは、Infernal [ref.20]を使用してrRNAリードを同定し、suffix/prefix arrayを使用してリード間の正確なオーバーラップを検索することによってoverlap graphを構築する。しかし、そのようなツールは、低い存在種にリカバリー率で対処することも含め、いくつかの制限を示している。
　本研究では、エラー率とキメラ形成のリスクを最小限に抑えるように慎重に設計されたオーバーラップグラフの構築と利用に基づく新しい手法であるMATAMを提示する。 MATAMは、シミュレートされたデータと実際のシーケンシングデータの両方で検証された。ほぼ全長の16S rRNAを再構築することができ、シークエンシングデプスのばらつきやコミュニティの複雑さに強くなる。

インストール

配布されているdockerコンテナを使ってmac os10.13でテストした。

依存

gcc v4.9.0 or superior, (full C++11 support, <regex> included, and partial C++14 support)
C++ libraries: rt, pthread, zlib
Samtools v1.x or superior
automake, make, cmake v3.1 or superior
Python 3
pip
numpy
Apache Ant
Java SE 7 JDK. OpenJDK is ok (openjdk-7-jdk paquet on debian)
bzip2
google sparse hash library (libsparsehash-dev paquet on debian)

Hardware requirements

We recommand running MATAM with at least 10Go of free RAM. You can try running MATAM with less RAM if --max_memory is set to a lower value (eg. --max_memory 4000 for 4Go).

本体　GIthub

#conda
mamba create -n matam python=3.9 -y
conda activate matam
mamba install -c bioconda matam -y

#またはdockerイメージを使う。
docker pull bonsaiteam/matam

> docker run --rm bonsaiteam/matam matam_assembly.py

$ docker run --rm bonsaiteam/matam matam_assembly.py

CRITICAL - the following arguments are required: -i/--input_fastx

usage: matam_assembly.py [-h] -i FASTX [-d DBPATH] [-o OUTDIR] [-v] [--cpu CPU] [--max_memory MAXMEM] [--best INT]

[--evalue REAL] [--score_threshold REAL] [--straight_mode] [--coverage_threshold INT]

[--min_identity REAL] [--min_overlap_length INT] [-N INT] [-E INT] [--seed INT]

[--quorum FLOAT] [-a {SGA}] [--read_correction {no,yes,auto}] [--contigs_binning]

[--min_scaffold_length MIN_SCAFFOLD_LENGTH] [--perform_taxonomic_assignment]

[--training_model {16srrna,fungallsu,fungalits_warcup,fungalits_unite}]

[--rdp_cutoff RDP_CUTOFF] [--keep_tmp] [--debug] [--resume_from STEP] [--filter_only]

MATAM assembly

optional arguments:

-h, --help show this help message and exit

Main parameters:

-i FASTX, --input_fastx FASTX Input reads file (fasta or fastq format)

-d DBPATH, --ref_db DBPATH MATAM ref db. Default is

$MATAM_DIR/db/SILVA_128_SSURef_NR95

-o OUTDIR, --out_dir OUTDIR Output directory.Default will be

"matam_assembly"

-v, --verbose Increase verbosity

Performance:

--cpu CPU Max number of CPU to use. Default is 1 cpu

--max_memory MAXMEM Maximum memory to use (in MBi). Default is

10000 MBi

Read mapping:

--best INT Get up to --best good alignments per read.

Default is 10

--evalue REAL Max e-value to keep an alignment for. Default

is 1e-05

Alignment filtering:

--score_threshold REAL Score threshold (real between 0 and 1). Default

is 0.9

--straight_mode Use straight mode filtering. Default is

geometric mode

--coverage_threshold INT Ref coverage threshold. By default set to 0 to

desactivate filtering

Overlap-graph building:

--min_identity REAL Minimum identity of an overlap between 2 reads.

Default is 1.0

--min_overlap_length INT Minimum length of an overlap. Default is 50

Graph compaction & Components identification:

-N INT, --min_read_node INT Minimum number of read to keep a node. Default

is 1

-E INT, --min_overlap_edge INT Minimum number of overlap to keep an edge.

Default is 1

--seed INT Seed to initialize random generator. Default is

picking seed from system time

LCA labelling:

--quorum FLOAT Quorum for LCA computing. Has to be between

0.51 and 1. Default is 0.51

Contigs assembly:

-a {SGA}, --assembler {SGA} Select the assembler to be used. Default is SGA

--read_correction {no,yes,auto} Set the assembler read correction step. 'auto'

means assemble the components with read

correction enabled when the components coverage

are sufficient (20X) and assemble the other

components without read correction. Default is

auto

Scaffolding:

--contigs_binning Experimental. Perform contigs binning during

scaffolding.

--min_scaffold_length MIN_SCAFFOLD_LENGTH Filter out small scaffolds

Taxonomic assignment:

--perform_taxonomic_assignment Do the taxonomic assignment

--training_model {16srrna,fungallsu,fungalits_warcup,fungalits_unite} The training model used for taxonomic

assignment. Default is 16srrna

--rdp_cutoff RDP_CUTOFF Sequences assigned (by RDP) with a confidence

score < 0.8 (at genus level) will be tagged as

an artificial "unclassified" taxon

Advanced parameters:

--keep_tmp Do not remove tmp files

--debug Output debug infos

--resume_from STEP Try to resume from given step. Steps are:

reads_mapping, alignments_filtering,

overlap_graph_building, graph_compaction,

contigs_assembly, scaffolding,

abundance_calculation, taxonomic_assignment

--filter_only Perform only the first step of MATAM (i.e Reads

mapping against ref db with sortmerna to filter

the reads). Relevant options for this step

correspond to the "Read mapping" section.

ラン

１、データベースのダウンロード

#カレントディレクトリと/homeを共有してスタート
docker run -itv $PWD:/home bonsaiteam/matam

mkdir /home/datadir
python /matam/index_default_ssu_rrna_db.py -d /home/datadir/ --max_memory 10000

２、SSU rRNAのリカバリー。データベースと解析するFASTAを指定する。

matam_db_preprocessing.py -i ref_db.fasta -d /home/datadir \
--cpu 4 --max_memory 10000 -v

分類はRDP Classifier（link）を元に行われている。 fastqには対応していない。

引用

MATAM: reconstruction of phylogenetic marker genes from short sequencing reads in metagenomes
Pericard P, Dufresne Y, Couderc L, Blanquart S, Touzet H

Bioinformatics. 2018 Feb 15;34(4):585-591