2020-02-13

最新のデータベースを使ってメタゲノムのリードのtaxonomic assignmentを行う ganon

　リファレンスおよびtaxonomyに基づくショートリードの分類は、メタゲノムの基本的なタスクである。シーケンス後に行うことができる環境サンプルからの各リードの起源の定義は、通常は量の推定、プロファイリング、およびアセンブリ前の最初のステップである。過去数年にわたり、このタスクのために多くの特別なツールが開発されてきた[ref.2、3、4、5、6]。これらのアプローチの多くはtaxonomyに基づいており[ref.7]、taxonomyからサンプルの構成をよりよく理解できる。

　パブリックリポジトリの完全なゲノムまたはドラフトゲノム配列の量は、ゲノムシーケンスの進歩、リード品質、長さ、カバレッジの改善、およびゲノムアセンブリのアルゴリズムの向上により、急速に増加している（論文図1）。さらに、多くの部分的および完全なゲノムシーケンスが、metagcnomeのアセンブリされたゲノム[ref.8、9、10]から直接得られる。これは、公共リポジトリの成長を促進する手法である。このかなりの量のリファレンスは、一般に、そのような量のデータを処理するように設計されていない現在のツールにとってかなりの難題を提起する[ref.11]。また、数百万のショートリードを分類上のターゲットにアサインするために、高い計算コストがかかる。

　リファレンスベースの分類を実行するためのデータの選択は重要なステップであり、メタゲノミクスにおける既知の問題である[ref.14]。経験則として、シーケンスが多ければ多いほど、分類は良くなる。ただし、シーケンスの完全なセットでさえ、分類ツリー全体に均等に分散されていないため、異なる分類群は異なるレベルの量と品質で表される。さらに、分類群の大部分はまだ不明であり、分類ツリーにゲノム配列またはエントリが存在しない。これには、パブリックリポジトリの最新リリースを常に最新の状態に保つためのツールが必要である。これは、大量の配列を処理する場合、簡単な作業ではない。ほとんどのツールには、独自のインデックスとデータベースを更新する機能がなく、現在、多くの分析が古いリソースで実行されている。

　たとえば、2018年初頭のRefSeq Microbialリポジトリは、現在（2019年中旬）よりも分類上の多様性が10％少なくない。 2015年6月からのさらに古いRefSeqリリースでは、今日の分類学的多様性の27％が不足している。さらに、一般的に使用されるRefSeqのサブセットである微生物のcompleteゲノムは、完全なリポジトリの利用可能な多様性の15％しかカバーしていない（2018年12月）。例として、完全な細菌ゲノム、古細菌ゲノム、およびウイルスゲノムに基づいたkrakenの[ref.15] MiniKrakenデータベース最新リリース（2017年10月18日現在）は、コミュニティ構成に関する迅速な洞察を得るのに役立つが、 2019年1月4日からの最新のRefSeqリリースで利用可能な分類学的多様性を欠いている。これらのリリースに基づくメタゲノム解析は、潜在的な潜在種を過小評価し、見落としがちである。ただし、古いリファレンスまたは「事前に作成された」インデックスの使用は依然として一般的な慣行である[ref.16]。ほとんどのメソッドはカスタムデータベースを構築できるが、更新することはできない。最新のデータを毎週または毎日更新することは、これらのインデックスを再構築するための時間要件を考えるとほとんど不可能である。

　シーケンス分類器MetaPhlAn [ref.17]とKaiju [ref.18]は、それぞれマーカー遺伝子とタンパク質配列のサブセットを選択することにより、パブリック配列リポジトリに含まれるほとんどの多様性をカバーする代替案を作成した。一方で、これらのメソッドは非常に強力であるため、インデックスサイズが小さくても、高速で正確なコミュニティプロファイルを提供する。一方、複雑な環境の全ゲノムシーケンスを分析する場合、完全なゲノムカバレッジがないため、存在量の少ない生物は簡単に見逃される。さらに、完全なゲノム配列を使用する現在の方法は、現在の利用可能なデータ量と格闘している[ref.11]。

　これらの制限を考慮して、メタゲノムの新しいリファレンスおよびtaxonomyに基づくショートリード分類ツールであるganonを開発した。 Ganonは、インターリーブブルームフィルター（IBF）[ref.19]を使用して、非常に大量の配列を検索可能なインデックスに表現する。これにより、大量のリファレンスセット（完全なRefSeqなど）のインデックス作成がより高速かつ低メモリ消費で可能になり、メタゲノムシーケンス実験全体のリード分類が改善される。Ganonは更新可能なインデックスも提供する。これにより、新しくリリースされたシーケンスを短時間で組み込むことができる。 k-merカウント補題とプログレッシブフィルタリングステップに基づく分類方法は、最先端のツールと比較した場合、感度を損なうことなく分類の精度を向上させる。Ganonは、SeqAnライブラリ[ref.20]とPythonを使用してC ++で開発された。コードはオープンソースであり、https://gitlab.com/rki_bioinformatics/ganonから無料で入手できる。

New version of genome_updater released (0.2.0), now with support for multiple snapshots, keeping a history of downloaded files over time https://t.co/iRl0HmOU4M
— Vitor C. Piro (@piro_vc) September 24, 2019

インストール

pytho3.7の仮想環境でテストした(ホストOS; ubuntu18.04LTS)。

macosでは手動導入のみ可能(Ganon runs on osx only with a manual installation)

依存

build

gcc >=7 (check gcc7 with conda)
cmake >=3
Catch2 >=2.7.0
cxxopts >=2.1.2
sdsl-lite 3.0 d6ed14
seqan 2.4.0 c308e9

run

python >=3.4
taxsbp >=0.1.1
binpacking >=1.4.1
pylca >= 1.0.0
pandas
wget
curl
tar
GNU core utilities (gawk, zcat)

本体　Github

conda create -n ganon -y
conda activate ganon

#bioconda (link)
conda install -c bioconda -c conda-forge -y ganon

#パフォーマンスが上がる手動導入も推奨されています（link）

> ganon

$ ganon

usage: ganon [-h] [-v] {build,update,classify} ...

ganon

positional arguments:

{build,update,classify}

build Build ganon database

update Update ganon database

classify Classify reads

optional arguments:

-h, --help show this help message and exit

-v, --version Show program's version number and exit.

> ganon build -h

$ ganon build -h

usage: ganon build [-h] -d db_prefix [-i ...] [-r] [-k] [-n] [-f] [-m] [-l] [-t] [--fixed-bloom-size] [--fragment-length] [--overlap-length] [--seq-info ...] [--seq-info-file]

[--taxdump-file ...] [--input-directory] [--input-extension] [--verbose]

optional arguments:

-h, --help show this help message and exit

-r , --rank Target taxonomic rank for classification [assembly,taxid,species,genus,...]. Default: species

-k , --kmer-size The k-mer size for the interleaved bloom filter. Default: 19

-n , --hash-functions

The number of hash functions for the interleaved bloom filter. Default: 3

-f , --max-fp Max. false positive rate for k-mer classification. Default: 0.05

-m , --max-bloom-size

Approx. maximum filter size in Megabytes (MB). Will estimate best --bin-length based on --kmer-size, --hash-functions and --max-fp [Mutually exclusive --fixed-bloom-size]

-l , --bin-length Maximum length (in bp) for each bin. Default: auto

-t , --threads Number of subprocesses/threads to use. Default: 2

--fixed-bloom-size Fixed size for filter in Megabytes (MB), will ignore --max-fp [Mutually exclusive --max-bloom-size]

--fragment-length Fragment length (in bp). Set to 0 to not fragment sequences. Default: --bin-length - --overlap-length

--overlap-length Fragment overlap length (in bp). Should be bigger than the read length used for classification. Default: 300

--seq-info [ [ ...]] Mode to obtain sequence information. For each sequence entry provided, ganon requires taxonomic and seq. length information. If a small number of sequences is provided

(<50000) or when --rank assembly, ganon will automatically obtain data with NCBI E-utils websevices (eutils). Offline mode will download batch files from NCBI Taxonomy

and look for taxonomic ids in the order provided. Options: [nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot], eutils (force webservices) or auto

(uses eutils or [nucl_gb nucl_wgs]). Default: auto [Mutually exclusive --seq-info-file]

--seq-info-file Pre-generated file with sequence information (seqid <tab> seq.len <tab> taxid [<tab> assembly id]) [Mutually exclusive --seq-info]

--taxdump-file [ [ ...]]

Force use of a specific version of the (taxdump.tar.gz) or (nodes.dmp names.dmp [merged.dmp]) file(s) from NCBI Taxonomy (otherwise it will be automatically downloaded)

--input-directory Directory containing input files

--input-extension Extension of files to use with --input-directory (provide it without * expansion, e.g. ".fna.gz")

--verbose Verbose mode for ganon-build

required arguments:

-d db_prefix, --db-prefix db_prefix

Database output prefix (.ibf, .map, .tax, .gnn will be created)

-i [ [ ...]], --input-files [ [ ...]]

Input reference sequence fasta files [.gz]

> ganon classify -h

$ ganon classify -h

usage: ganon classify [-h] -d [db_prefix [db_prefix ...]] [-r [reads.fq[.gz] [reads.fq[.gz] ...]]] [-p [reads.1.fq[.gz] reads.2.fq[.gz] [reads.1.fq[.gz] reads.2.fq[.gz] ...]]]

[-c [HIERARCHY_LABELS [HIERARCHY_LABELS ...]]] [-k [MIN_KMERS [MIN_KMERS ...]]] [-e [MAX_ERROR [MAX_ERROR ...]]] [-u [MAX_ERROR_UNIQUE [MAX_ERROR_UNIQUE ...]]] [-f OFFSET]

[-o OUTPUT_PREFIX] [-a] [-n] [-s] [--ranks [RANKS [RANKS ...]]] [-t THREADS] [--verbose]

optional arguments:

-h, --help show this help message and exit

-c [HIERARCHY_LABELS [HIERARCHY_LABELS ...]], --hierarchy-labels [HIERARCHY_LABELS [HIERARCHY_LABELS ...]]

Hierarchy definition, one for each database input. Can also be a string, but input will be sorted to define order (e.g. 1 1 2 3). Default: 1

-k [MIN_KMERS [MIN_KMERS ...]], --min-kmers [MIN_KMERS [MIN_KMERS ...]]

Min. percentage of k-mers matching to consider a read assigned. Single value or one per database (e.g. 0.5 0.7 1 0.25). Default: 0.25 [Mutually exclusive --max-error]

-e [MAX_ERROR [MAX_ERROR ...]], --max-error [MAX_ERROR [MAX_ERROR ...]]

Max. number of errors allowed. Single value or one per database (e.g. 3 3 4 0) [Mutually exclusive --min-kmers]

-u [MAX_ERROR_UNIQUE [MAX_ERROR_UNIQUE ...]], --max-error-unique [MAX_ERROR_UNIQUE [MAX_ERROR_UNIQUE ...]]

Max. number of errors allowed for unique assignments after filtering. Matches below this error rate will not be discarded, but assigned to a parent taxonomic level.

Single value or one per hierarchy (e.g. 0 1 2). -1 to disable. Default: -1

-f OFFSET, --offset OFFSET

Number of k-mers to skip during classification. Can speed up analysis but may reduce recall. (e.g. 1 = all k-mers, 3 = every 3rd k-mer). Default: 2

-o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX

Output prefix for .lca and .rep. Empty to output to STDOUT (only .lca will be printed)

-a, --output-all Output an additional file with all matches (.all). File can be very large.

-n, --output-unclassified

Output an additional file with unclassified read headers (.unc)

-s, --output-single When using multiple hierarchical levels, output everything in one file instead of one per hierarchy

--ranks [RANKS [RANKS ...]]

Ranks to show in the report (.tre). "all" for all identified ranks. empty for default ranks: superkingdom phylum class order family genus species species+ assembly. This

file can be re-generated with the ganon report command.

-t THREADS, --threads THREADS

Number of subprocesses/threads to use. Default: 3

--verbose Verbose mode for ganon-classify

required arguments:

-d [db_prefix [db_prefix ...]], --db-prefix [db_prefix [db_prefix ...]]

Database input prefix[es]

-r [reads.fq[.gz] [reads.fq[.gz] ...]], --single-reads [reads.fq[.gz] [reads.fq[.gz] ...]]

Multi-fastq[.gz] file[s] to classify

-p [reads.1.fq[.gz] reads.2.fq[.gz] [reads.1.fq[.gz] reads.2.fq[.gz] ...]], --paired-reads [reads.1.fq[.gz] reads.2.fq[.gz] [reads.1.fq[.gz] reads.2.fq[.gz] ...]]

Multi-fastq[.gz] pairs of file[s] to classify

> ganon update -h

$ ganon update -h

usage: ganon update [-h] -d db_prefix [-i ...] [-o] [-t] [--seq-info ...] [--seq-info-file] [--taxdump-file ...] [--input-directory] [--input-extension] [--verbose]

optional arguments:

-h, --help show this help message and exit

-o , --output-db-prefix

Output database prefix (.ibf, .map, .tax, .gnn). Default: overwrite current --db-prefix

-t , --threads Number of subprocesses/threads to use. Default: 2

--seq-info [ [ ...]] Mode to obtain sequence information. For each sequence entry provided, ganon requires taxonomic and seq. length information. If a small number of sequences is provided

(<50000) or when --rank assembly, ganon will automatically obtained data with NCBI E-utils websevices (eutils). Offline mode will download batch files from NCBI Taxonomy

and look for taxonomic ids in the order provided. Options: [nucl_gb nucl_wgs nucl_est nucl_gss pdb prot dead_nucl dead_wgs dead_prot], eutils (force webservices) or auto

(uses eutils or [nucl_gb nucl_wgs]). Default: auto [Mutually exclusive --seq-info-file]

--seq-info-file Pre-generated file with sequence information (seqid <tab> seq.len <tab> taxid [<tab> assembly id]) [Mutually exclusive --seq-info]

--taxdump-file [ [ ...]]

Force use of a specific version of the (taxdump.tar.gz) or (nodes.dmp names.dmp [merged.dmp]) file(s) from NCBI Taxonomy (otherwise it will be automatically downloaded)

--input-directory Directory containing input files

--input-extension Extension of files to use with --input-directory (provide it without * expansion, e.g. ".fna.gz")

--verbose Verbose mode for ganon-build

required arguments:

-d db_prefix, --db-prefix db_prefix

Database input prefix (.ibf, .map, .tax, .gnn)

-i [ [ ...]], --input-files [ [ ...]]

Input reference sequence fasta files [.gz]

テストラン

１、テストデータの準備とデータベースのビルド

git clone https://github.com/pirovc/ganon.git
cd ganon/
ganon build --db-prefix sample_bacteria --input-files tests/ganon-build/data/sequences/bacteria*.fasta.gz

２、分類

ganon classify --db-prefix sample_bacteria --single-reads tests/ganon-classify/data/reads/bacteria.simulated.1.fq -o sample_results

３、レポート

ganon report --db-prefix sample_bacteria --rep-file sample_results.rep --min-matches-perc 50 --ranks all --output-report new_report.tre

> cat new_report.tre

$ cat new_report.tre

unclassified - - - - - 1 1.00000

no rank 1 1 root 0 0 99 99.00000

no rank 131567 1|131567 cellular organisms 0 0 99 99.00000

superkingdom 2 1|131567|2 Bacteria 0 0 99 99.00000

no rank 1783272 1|131567|2|1783272 Terrabacteria group 0 0 57 57.00000

phylum 1239 1|131567|2|1783272|1239 Firmicutes 0 0 57 57.00000

class 91061 1|131567|2|1783272|1239|91061 Bacilli 0 0 57 57.00000

order 1385 1|131567|2|1783272|1239|91061|1385 Bacillales 0 0 57 57.00000

family 186822 1|131567|2|1783272|1239|91061|1385|186822 Paenibacillaceae 0 0 57 57.00000

genus 44249 1|131567|2|1783272|1239|91061|1385|186822|44249 Paenibacillus 0 0 57 57.00000

species 1406 1|131567|2|1783272|1239|91061|1385|186822|44249|1406 Paenibacillus polymyxa 57 57 57 57.00000

４、アップデート

ganon update --db-prefix sample_bacteria --output-db-prefix sample_bateria_virus --input-files tests/ganon-build/data/sequences/virus*.fasta.gz

Extracting accessions... 4 accessions retrieved. Done. Elapsed time: 0.01 seconds.

Retrieving sequence lengths and taxid from NCBI E-utils... Done. Elapsed time: 1.03 seconds.

Downloading taxdump...

引用

ganon: precise metagenomics classification against large and up-to-date sets of reference sequences

Vitor C. Piro, Temesgen H. Dadi, Enrico Seiler, Knut Reinert, Bernhard Y. Renard

bioRxiv preprint first posted online Aug. 31, 2018

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

最新のデータベースを使ってメタゲノムのリードのtaxonomic assignmentを行う ganon