メタゲノムの菌叢存在量を株レベルで正確に定量する StrainR2

合成微生物群集は、管理可能なモデルシステムにおいて還元主義的研究を行う機会を提供する。しかし、これらの群集内における高度に類似した株のアバンダンスを推定することは、現在信頼性が低い状態になっている。16S rRNA遺伝子シーケンシングは株レベルでのアバンダンスを解像できず、定量 PCR（qPCR）などの他の手法は複雑な群集に対しては精度が低く、リソース面でも制約がある。本著者らは、合成コミュニティのすべてのメンバーのゲノムが利用可能な場合に、ショットガンメタゲノムシーケンスを活用して高精度な株レベルでのアバンダンスを提供するためのStrainR2を提案する。

インシリコ解析および合成糞便微生物叢でコロニー化された無菌マウスから得られたシーケンスデータの両方において、StrainR2はショットガンメタゲノムシーケンスリードを利用する他のツールよりも、株のアバンダンスをより高い精度と効率で解明する。絶対定量で同定された一部の株において、StrainR2の精度がqPCRと同等であることを示した。ソフトウェアはGitHubで利用可能で、C、R、Bashで実装されている。LinuxとMacOSでサポートされており、BiocondaまたはDockerコンテナとしてパッケージが利用できる。公開時のソースコードは、figshareのdoi: 10.6084/m9.figshare.29420780でも利用できる。

インストール

mambaで環境を作ってテストした｡ソースからビルドする場合は依存するツールも別に導入する必要がある｡

依存

BBMap
fastp
GNU make (if compiling from source)
samtools
bedtools
R
R optparse
R tidyverse
zlib

Github

#conda (link)
mamba create -n strainr2 -c bioconda -c conda-forge strainr2
conda activate strainr2

#docker (未テスト)
docker pull quay.io/biocontainers/strainr2:<tag>

> PreProcessR -h

USAGE: PreProcessR -i path/to/in [OPTIONS]

PreProcessR counts the unique hashes in subcontigs for StrainR to normalize reads with.

Required Arguments:

-i/--indir path/to/genomes : path to the directory for all community genomes

Optional Arguments:

-o/--outdir path/to/out : path to your output directory [Default = StrainR2DB]

-e/--excludesize number : exclude subcontig size (minimum subcontig size) [Default = 10000]

-s/--subcontigsize number : maximum subcontig size (overrides default use of calculated smallest N50)[Default = N50]

-r/--readsize number : Size of one end of a read. e.g.: for 150bp paired end reads readsize is 150. [Default = 150]

-m/--singleend : Flag that should be used if only single end reads will be provided to StrainR2. Disabled by default.

-h/--help : Display this message

> StrainR -h

USAGE: StrainR -1 path/to/forward.fastq.gz -2 path/to/reverse.fastq.gz -r path/to/reference/directory [OPTIONS]

StrainR normalizes mapping from reads using the output from PreProcessR

Required Arguments:

-1/--forward path/to/forward.fastq.gz : path to forward reads

-r/--reference path/to/reference/directory : path to the output directory generated by PreProcessR

Optional Arguments:

-2/--reverse path/to/reverse.fastq.gz : path to reverse reads

-c/--weightedpercentile number : Weighted percentile for a strain's FUKMs to use in abundance estimation [Default = 60]

-s/--subcontigfilter number : Percentage of a strain's subcontigs that should be filtered out based on number of unique k-mers [Default = 0]

-b1/--background1 path/to/background1.fq.gz : path to forward background reads for exclusion. This is not necessary in a typical use case. [Default = None]

-b2/--background2 path/to/background2.fq.gz : path to reverse background reads for exclusion. This is not necessary in a typical use case. [Default = None]

-o/--outdir path/to/out : path to your output directory to contain normalized abundances [Default = current directory]

-p/--prefix string : Name of community (used in output files) [Default = sample]

-t/--threads number : number of threads to use when running fastp, bbmap, and samtools. Maximum is 16 [Default = 8]

-m/--mem number : gigabytes of memory to use when running bbmap [Default = 8]

-h/--help : Display this message

テストラン

(論文より)StrainR2は2つのステップから構成される：(i) 事前処理（PreProcessR）と(ii) 標準化｡PreProcessRはまず、各ゲノムに適用される参照データベースを生成する。次に、StrainRはサンプルごとにリードを参照ゲノムにユニークにマッピングし、改変FPKM(マッピングされたリード100万件あたり1,000のユニークなk-merあたりのフラグメント数（FUKM))に変換する。このFUKMはFPKMと直接的に類似しているが、ゲノムの総サイズではなく、一意にマッピング可能なサイトの数で正規化するという点が異なる。事前処理事前処理ステップは、効率を最大化するためにほぼすべてC言語で書かれている。まず、すべてのゲノムのコンティグを同様のサイズに分割し（サブコンティグ）、すべてのゲノムの構築品質が比較可能になるようにする。これにより、構築品質への偏りを排除し、また正規化ステップで使用するための十分なサブコンティグを確保する。各断片でその菌株にしか現れないユニークな k-mer（＝株固有のシグネチャ）だけを使ってリードを数え、FUKMを断片毎に計算し､それらの中央値やweighted percentileを使って株ごとの存在量を推定する｡

1､事前処理

定量対象のfastaファイルを含むディレクトリを指定する｡このステップは、参照コミュニティごとに1回だけ実行すればよい。

git clone https://github.com/BisanzLab/StrainR2.git
cd StrainR2/
#step1
PreProcessR -i tests/genomes/multiple_complete/ -o out

-i path to the directory for all community genomes Optional Arguments:
-o path to your output directory [Default = StrainR2DB]

out/が出力される｡

コンティグごとのユニーク k-mer 情報やBBmapのindexが出力されている｡テストデータはすぐに終わったが､手持ちの2000ゲノムでは数十分以上かかった｡

2､StrainRの実行

fastqと1の出力を指定する｡fastqはgzip圧縮されていても認識する｡シングルエンドは-1だけ指定する｡indexサイズが大きい時は-mで指定するメモリ上限を増やす(自分のデータではデフォルト値は足りなかった)。

#step2
StrainR --forward tests/inputs/mock_reads_testing_R1.fastq.gz --reverse tests/inputs/mock_reads_testing_R2.fastq.gz -r out/ --prefix testrun  --outdir strain2_output -t 20 -m 32

-1 path/to/forward.fastq.gz : path to forward reads
-r path/to/reference/directory : path to the output directory generated by PreProcessR
-2 path/to/reverse.fastq.gz : path to reverse reads
-p Name of community (used in output files) [Default = sample]
-t number of threads to use when running fastp, bbmap, and samtools. Maximum is 16 [Default = 8]
-m gigabytes of memory to use when running bbmap [Default = 8]

出力例

testrun.pdf

> column -t StrainR_abundances/testrun_abundance_summary.tsv |head

各項目についてはGithubで説明されていますが､FUKMがStrainR2で計算された株のアバンダンス推定値となります｡Percent_UniqueやNuniqueが極端に少ないと株間を区別して定量できていない可能性があるので､ANI99､99%付近などよく似たゲノムを定量するときはこの値も確認した方がよいと思います｡

引用

StrainR2 accurately deconvolutes strain-level abundances in synthetic microbial communities Open Access

Kerim Heber , Shuchang Tian , Daniela Betancurt-Anzola , Heejung Koo , Jordan E Bisanz

Bioinformatics, Published: 06 August 2025