2019-04-20

メタゲノムのコンタミ除去やメタゲノムのサンプル間比較を行って結果を視覚化する Recentrifuge

2019 4/21　タイトル追加

2019 4/21 オーサーのJose Manuel Martíさんのコメント追加

2019 4/23　タイトル修正

2019 4/26　誤字修正

2019 dockerリンク追記

219 5/9 パラメータ追記

20206/13 ツイート追記

2020 6/14 condaインストール追記

　メタゲノミクスによる微生物群集の研究は、環境、臨床、食品、法医学の研究など、さまざまな生物学的分野でより一般的になってきている[ref.1-3]。新しいDNAおよびRNAシークエンシング技術は、シークエンシングされた塩基あたりのコストを劇的に減らすことによってこれらの研究を後押ししている。研究者は、微生物叢の縦方向（空間的または時間的）パターンを解明するために、さまざまなソースおよび時間からの微生物群集に属するシーケンスのセットを分析することができる（モデルの例については図1を参照）。ショットガンメタゲノムシーケンシング（SMS）研究では、研究者は各サンプルから核酸を抽出して精製し、それらをシーケンシングし、バイオインフォマティクスパイプラインを通して配列を分析する（詳細な例については、論文図S2およびS3参照）。ナノポアシーケンシングの発展に伴い、ポータブルで手頃な価格のリアルタイムSMSが現実のものとなっている[ref.4]。

メタゲノミクスにおける汚染

　低微生物バイオマスサンプルの場合、微生物由来の天然DNAはほとんどない。ライブラリーの調製および配列決定法は、その主な原因が汚染である配列を返すだろう[ref.5、6]。追加の工程を必要とするRNAシーケンシングは、さらなるバイアスおよびアーティファクトを導入し［ref.7］、これは、低微生物バイオマス研究の場合、汚染および偽分類群検出の深刻な問題につながる［ref.8］。メタゲノム解析の臨床界では、メタゲノミクスのワークフローにおけるネガティブコントロールの重要性が強調されており、最近では、結果から汚染物質をどのように差し引くかについて根本的な懸念が生じている[ref.9]。

　データサイエンスの観点からは、これは良いSN比を維持することの重要性のほんの一例である[ref.10]。シグナル（固有のDNA / RNA、サンプリングのターゲット）がノイズの大きさ（コンタミネーションやアーティファクトから取得したDNA / RNA）に近づくと、それらを区別するために特定の方法が必要になる。

　汚染配列の大元は、核酸抽出キット（kitome）[ref.11、12]、試薬および希釈剤[ref.13、14]、宿主[ref.15]、およびサンプリング後の環境にまでさかのぼることができるので多様である。汚染が空中浮遊粒子のような異なる起源から生じる場合、現在のサンプルまたはDNA間のクロスオーバーは過去のシーケンスランからのままであるref.17]。これらの供給源からの様々な量のDNAが天然の微生物のDNAと同時にシーケンシングされる。これは、特に微生物のバイオマスが少ない状況では、存在量や範囲などの大きさに深刻なバイアスをもたらす可能性がある[ref.18]。マルチプレックスシーケンシングが単純なインデックス付けを使用している場合、誤ったアサインは容認できるレートを簡単に超える可能性がある[ref.19]。メタゲノムリファレンスデータベースでさえも無視できない量のクロスコンタミネーションがある[ref.15、17、20]。

　kitomeに関して、それは同じ製品の異なるロット内でさえも異なる。例えば、DNeasy PowerSoil Kit（以前のPowerSoil DNA Isolation Kitとして知られている）は、Earth Microbiome ProjectやHuman Microbiome Projectなど、通常大量のDNAを提供しているため、無視できないほどのバックグラウンド汚染をもたらすことがよくある[ref.6]。サンプル中のバイオマスが低いほど、汚染バックグラウンド評価に役立つネガティブコントロールサンプルを集めることがより重要になる。それらがなければ、検体中の固有の微生物叢 - 汚染 - ノイズ - を区別することはほとんど不可能だからである。

　天然のDNAと混入しているDNAが正確に分離されていると仮定すると、サンプル間の信頼できる比較を実行するという問題が残る。一般的に、taxonomic classificationエンジンは、特にこの方法が最小公倍数祖先（LCA）[ref.21]のようなより保守的なアプローチを使用する場合、シーケンスランからのリードを異なるtaxonomicランクにアサインする。 LCAは誤検知のリスクを大幅に軽減するが、通常分類の分類レベルをより具体的なものからより一般的なものへと広げる。分類器がLCA戦略を使用しない場合でも、通常、各リードに特定のスコアまたは信頼レベルがアサインされる。これは、分類の信頼性推定量として下流のアプリケーションで考慮されるべきである。

　これらの困難さに加えて、分類レベルでの分離度が異なるため、非常に異なるDNA収量を持つサンプル、たとえば、低バイオマスサンプルと高バイオマスサンプルを比較することはさらに困難である。この種の問題は、サンプルが同じ程度のDNA収量であっても、全く異なる微生物構造を持っているため、少数派と多数派の微生物がそれらの間で根本的に異なる場合にも起こる[ref.8]。最後に、closely relatedの問題がメタゲノムbioforensic研究および環境サーベイランスで出現し、そこでは特定の分類群のごくわずかな存在を検出し、そして正確さと精度の両方で定量的結果を提供する方法を準備することが不可欠である。

　当初から、環境試料へのSMSの適用は、バクテリア人工染色体（BAC）クローンまたは16S rRNAシーケンシングからは得られない微生物群集の洞察を生物学者に提供した[ref.24、25]。科学界はすぐに比較メタゲノミクスの必要性と課題を強調した[ref.26、27]。最初のメタゲノムデータ解析ツールの1つであるMEGAN[ref.28]は、最初のリリースでサンプルの非常に基本的な比較を提供した。これは、より最近のバージョンでの対話型アプローチで改善された[ref.29]。一般に、メタゲノム分類およびアセンブリソフトウェアは、サンプル間指向よりもイントラ指向的である[ref.30]。いくつかのツールがこのギャップを埋めることを試みた。CoMet[ref.31]はメタゲノムサンプルのコレクションにおける機能的な違いを予測するために多次元尺度法と階層的クラスタリング分析などの異なる方法を組み合わせる比較機能プロファイリングのためのウェブベースのツールである。た

　Taxonomic classificationエンジンの結果の信頼性を高めるためのRecentrifugeのアプローチは、2つの戦略に従う。まず、各ステップで分類のスコアレベルを考慮する。次に、クロスオーバーを含むさまざまな種類の汚染物質を検出して選択的に除去する、強力な汚染物質除去アルゴリズムを使用する。Recentrifugeは次の高性能の分類分類器を直接サポートする：centrifuge[ref.7]、LMAT [ref.21]、CLARK [ref.39]、CLARK-S [ref.40]、およびKraken [ref.41]。他の分類ソフトウェアは一般的なパーサーを通してサポートされている。 Recentrifugeのインタラクティブなインタフェースにより、研究者はスコア付きKronaのようなチャートを使用してこれらの分類結果を分析することができる。生サンプルのプロットに加えて、Recentrifugeは、関心のある分類レベルごとに4つの異なるスコア付きチャートセットを生成する。コントロール減算サンプル、共有分類群（コントロールを含むまたは含まない）、およびサンプルごとの排他的分類群。この一連の分析およびプロットは、メタゲノム研究における複数のサンプルのロバストな比較分析を可能にし、特に低い微生物バイオマス環境または身体部位の場合に有用である。

　Recentrifugeは、特に汚染除去が必須である低微生物バイオマスメタゲノム研究において、頑健な汚染除去および複数サンプルのスコア指向の比較分析を可能にする。物理的測定に付随する不確実性の記述を添付することが不可欠であるのと同様に、アサインられた分類群の信頼性推定を伴う任意のリード分類に参加することが望ましい。Recentrifugeは、分類ソフトウェアによって得られたスコアをリードに読み取り、この貴重な情報を使用して、分析されたサンプルに関連する分類学的ツリー内の各分類群の平均信頼水準を計算する。この値はまた、リード品質または長さなどのさらなるパラメータの関数であってもよく、これは、ナノポアシーケンサーによって生成されたデータセットのように、リード長が大きく変動する場合に特に有用である。

Recentrifuge’s flowchart. 論文より転載。

Recentrifuge: robust comparative analysis and contamination removal for #metagenomics https://t.co/QE9iLvG5t2 … https://t.co/1mOCgmpXbG #bioRxiv Now freely available on @github and @pypi
— Jose Manuel Martí (@dyn_omics) February 18, 2019

"Negative control sequence were subtracted from patient sample reads by Recentrifuge" 👏👏👏 https://t.co/NdMWxovu4V
— Jose Manuel Martí (@dyn_omics) 2020年6月9日

Happy to see our systems biology approach to Verticillium wilt of olive published in @BioMedCentral Plant Biology: Metatranscriptomic dynamics after Verticillium dahliae infection and root damage in Olea europaea: https://t.co/wEtshVtA6l
— Jose Manuel Martí (@dyn_omics) 2020年2月23日

インストール

macos10.14のpython3.7.1環境とmacos10.12のpython3.6.7環境でテストした。

依存

Python 3.6 is required.

pandas for exporting results to CSV or TSV as extra files or for testing Recentrifuge.
openpyxl package is also required, additionally, for pandas to export results in Excel format.
matplotlib and xlrd are needed in addition to the previous packages for comprehensive testing the Recentrifuge package.

pip install pandas openpyxl xlrd matplotlib

#bioconda(link)
conda install -c bioconda -y recentrifuge

本体　Github

pip install recentrifuge

> rcf -h

$ rcf -h

=-= /Users/kazuma/miniconda3/bin/rcf =-= v0.28.7 - Mar 2019 =-= by Jose Manuel Martí =-=

usage: rcf [-h] [-V] [-n PATH] [--format GENERIC_FORMAT]

(-f FILE | -g FILE | -l FILE | -r FILE | -k FILE) [-o FILE]

[-e OUTPUT_TYPE] [-c CONTROLS_NUMBER] [-s SCORING] [-y NUMBER]

[-m INT] [-x TAXID] [-i TAXID] [-a] [-z NUMBER] [-w INT]

[-u OPTION] [-t] [--nokollapse] [-d] [--sequential]

Analyze results of metagenomic taxonomic classifiers

optional arguments:

-h, --help show this help message and exit

-V, --version show program's version number and exit

input:

Define Recentrifuge input files and formats

-n PATH, --nodespath PATH

path for the nodes information files (nodes.dmp and

names.dmp from NCBI)

--format GENERIC_FORMAT

Format of the output files from a generic classifier

included with the option -g. It is a string like

"TYP:csv,TID:1,LEN:3,SCO:6,UNC:0" where valid file

TYPes are csv/tsv/ssv, and the rest of fields indicate

the number of column used (starting in 1) for the

TaxIDs assigned, the LENgth of the read, the SCOre

given to the assignment, and the taxid code used for

UNClassified reads

-f FILE, --file FILE Centrifuge output files. If a single directory is

entered, every .out file inside will be taken as a

different sample. Multiple -f is available to include

several Centrifuge samples.

-g FILE, --generic FILE

Output file from a generic classifier. It requires the

flag --format (see such option for details). Multiple

-g is available to include several generic samples.

-l FILE, --lmat FILE LMAT output dir or file prefix. If just "." is

entered, every subdirectory under the current

directory will be taken as a sample and scanned

looking for LMAT output files. Multiple -l is

available to include several samples.

-r FILE, --clark FILE

CLARK full-mode output files. If a single directory is

entered, every .csv file inside will be taken as a

different sample. Multiple -r is available to include

several CLARK, CLARK-l, and CLARK-S full-mode samples.

-k FILE, --kraken FILE

Kraken output files. If a single directory is entered,

every .krk file inside will be taken as a different

sample. Multiple -k is available to include several

Kraken (version 1 or 2) samples.

output:

Related to the Recentrifuge output files

-o FILE, --outhtml FILE

HTML output file (if not given, the filename will be

inferred from input files)

-e OUTPUT_TYPE, --extra OUTPUT_TYPE

type of extra output to be generated, and can be one

of ['FULL', 'CMPLXCRUNCHER', 'CSV', 'TSV']

tuning:

Coarse tuning of algorithm parameters

-c CONTROLS_NUMBER, --controls CONTROLS_NUMBER

this number of first samples will be treated as

negative controls; default is no controls

-s SCORING, --scoring SCORING

type of scoring to be applied, and can be one of

['SHEL', 'LENGTH', 'LOGLENGTH', 'NORMA', 'LMAT',

'CLARK_C', 'CLARK_G', 'KRAKEN', 'GENERIC']

-y NUMBER, --minscore NUMBER

minimum score/confidence of the classification of a

read to pass the quality filter; all pass by default

-m INT, --mintaxa INT

minimum taxa to avoid collapsing one level into the

parent (if not specified a value will be automatically

assigned)

-x TAXID, --exclude TAXID

NCBI taxid code to exclude a taxon and all underneath

(multiple -x is available to exclude several taxid)

-i TAXID, --include TAXID

NCBI taxid code to include a taxon and all underneath

(multiple -i is available to include several taxid);

by default, all the taxa are considered for inclusion

-a, --avoidcross avoid cross analysis

fine tuning:

Fine tuning of algorithm parameters

-z NUMBER, --ctrlminscore NUMBER

minimum score/confidence of the classification of a

read in control samples to pass the quality filter; it

defaults to "minscore"

-w INT, --ctrlmintaxa INT

minimum taxa to avoid collapsing one level into the

parent (if not specified a value will be automatically

assigned)

-u OPTION, --summary OPTION

select to "add" summary samples to other samples, or

to "only" show summary samples or to "avoid" summaries

at all

-t, --takeoutroot remove counts directly assigned to the "root" level

--nokollapse show the "cellular organisms" taxon

advanced:

Advanced modes of running

-d, --debug increase output verbosity and perform additional

checks

--sequential deactivate parallel processing

rcf - Release 0.28.7 - Mar 2019

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU Affero General Public License as

published by the Free Software Foundation, either version 3 of the

License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License

along with this program. If not, see <https://www.gnu.org/licenses/>.

> rextract -h

$ rextract -h

=-= /Users/kazuma/miniconda3/bin/rextract =-= v0.28.7 - Mar 2019 =-= by Jose Manuel Martí =-=

usage: rextract [-h] [-V] [-d] -f FILE [-l NUMBER] [-m NUMBER] [-n PATH]

[-i TAXID] [-x TAXID] [-y NUMBER] (-q FILE | -1 FILE)

[-2 FILE]

Selectively extract reads by Centrifuge output

optional arguments:

-h, --help show this help message and exit

-V, --version show program's version number and exit

-d, --debug increase output verbosity and perform additional

checks

-f FILE, --file FILE Centrifuge output file.

-l NUMBER, --limit NUMBER

Limit of FASTQ reads to extract. Default: no limit

-m NUMBER, --maxreads NUMBER

Maximum number of FASTQ reads to search for the taxa.

Default: no maximum

-n PATH, --nodespath PATH

path for the nodes information files (nodes.dmp and

names.dmp from NCBI)

-i TAXID, --include TAXID

NCBI taxid code to include a taxon and all underneath

(multiple -i is available to include several taxid).

By default all the taxa is considered for inclusion.

-x TAXID, --exclude TAXID

NCBI taxid code to exclude a taxon and all underneath

(multiple -x is available to exclude several taxid)

-y NUMBER, --minscore NUMBER

minimum score/confidence of the classification of a

read to pass the quality filter; all pass by default

-q FILE, --fastq FILE

Single FASTQ file (no paired-ends)

-1 FILE, --mate1 FILE

Paired-ends FASTQ file for mate 1s (filename usually

includes _1)

-2 FILE, --mate2 FILE

Paired-ends FASTQ file for mate 2s (filename usually

includes _2)

rextract - Release 0.28.7 - Mar 2019

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU Affero General Public License as

published by the Free Software Foundation, either version 3 of the

License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License

along with this program. If not, see <https://www.gnu.org/licenses/>.

> retaxdump -h

retaxdump -h

usage: retaxdump [-h] [-V] [-n PATH]

Get needed taxdump files from NCBI servers

optional arguments:

-h, --help show this help message and exit

-V, --version show program's version number and exit

-n PATH, --nodespath PATH

path for the nodes information files (nodes.dmp and

names.dmp from NCBI

retaxdump - Release 0.28.8 - Apr 2019

This program is free software: you can redistribute it and/or modify

it under the terms of the GNU Affero General Public License as

published by the Free Software Foundation, either version 3 of the

License, or (at your option) any later version.

This program is distributed in the hope that it will be useful,

but WITHOUT ANY WARRANTY; without even the implied warranty of

MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

GNU Affero General Public License for more details.

You should have received a copy of the GNU Affero General Public License

along with this program. If not, see <https://www.gnu.org/licenses/>.

追記

公式ではないようですが、docker imageも上がってますね。

https://hub.docker.com/r/replikation/recentrifuge

docker pull replikation/recentrifuge

データベースの準備

作業ディレクトリで以下のコマンドを打つ。

retaxdump

Downloading taxdmp.zip from NCBI FTP... OK!

Extracting nodes.dmp... OK!

Extracting names.dmp... OK!

実行方法

例えばcentrifugeの結果を使う。centrifugeのデータベース準備からランまでの流れはこちらを参照（リンク）。以下は fastqからtaxonomy assignmenを行うコマンドのみ記載。

１、まずcentrifugeをランする。sample1~sample3の3サンプル分のデータを解析する。

centrifuge -x abv -1 sample1_R1.fq -2 sample1_R2.fq -p 16 --report-file sample1_report.txt -S sample1.out

centrifuge -x abv -1 sample2_R1.fq -2 sample2_R2.fq -p 16 --report-file sample2_report.txt -S sample2.out

centrifuge -x abv -1 sample3_R1.fq -2 sample3_R2.fq -p 16 --report-file sample3_report.txt -S sample3.out

optional step、reextractを行う場合は一番下を参照。

２、Recentrifugeで再解析する。

rcf -f sample1.out -f sample2.out -f sample3.out

htmlファイルとxlsxファイルが出力される。複雑なメタゲノムサンプルだと、３サンプルでも数十分はかかる。

f:id:kazumaxneo:20190419173745j:plain

３、htmlファイルを開く。

kronaを使って結果は視覚化される。

f:id:kazumaxneo:20190419173853j:plain

図の見方は論文図５を参照（ダイレクトリンク）

クリックすればabundance、tax id、種名など確認できる。

f:id:kazumaxneo:20190419174159j:plain

ダブルクリックすれば下位の階級にジャンプできる。

f:id:kazumaxneo:20190419174316j:plain

rootに戻るには中心の文字部分を繰り返しダブルクリックしていく。

sampleを左上から切り替えられる。例えばsample1のみで検出されたものだけ表示することも可能。

f:id:kazumaxneo:20190419172806j:plain

他のパラメータもインタラクティブに変更可能。

sample1

f:id:kazumaxneo:20190419172925j:plain

sample2

f:id:kazumaxneo:20190419172935j:plain

sample3

f:id:kazumaxneo:20190419172943j:plain

サンプルを選ぶたび遠心機が回転するような動作でabundanceが切り替えられる。

shared species

f:id:kazumaxneo:20190419173038j:plain

shared genus

f:id:kazumaxneo:20190419173218j:plain

もちろんジーナス（genus）より上の階級も選択できる。

sample1 exclusive genus

f:id:kazumaxneo:20190419173308j:plain

切り替えは右のメニューからも行える。

f:id:kazumaxneo:20190419173640j:plain

　↑speciesの階級以下を表示する。

xlsxファイルにもkronaと同じ結果（元データ）が表示される。

f:id:kazumaxneo:20190419174812j:plain

別のsheetにより簡潔なsummaryも作られる。

f:id:kazumaxneo:20190419174715j:plain

reextractコマンドを使えば、centrifuge（Kraken）の結果をチェリーピッキング（*1）して、必要なデータのみ抽出してrcfコマンドにかけれる。

rextract -f S1.out -i 1117 -1 S1_R1.fastq -2 S1_R2.fastq

-i <id> NCBI taxid code to include a taxon and all underneath (multiple -i is available to include several taxid). By default all the taxa is considered for inclusion.
-x <id> NCBI taxid code to exclude a taxon and all underneath (multiple -x is available to exclude several taxid).
-y minimum score/confidence of the classification of a read to pass the quality filter; all pass by default
-q Single FASTQ file (no paired-ends)
-1 Paired-ends FASTQ file for mate 1s (filename usually includes _1)
-2 Paired-ends FASTQ file for mate 2s (filename usually includes _2)