2018-07-21

小メモリで高速にtaxonomy assignmentを行う metacache

　メタゲノム研究の例として、ヒト腸のシーケンシング解析（Korpela et al、2016）、ヒトの皮膚（Bzhalava et al、2014）、水生生態系（Bork et al、2015）、食物（Ripp et al、2014 ）、土壌（Fierer et al、2012）および空中の微生物（Barberánet al、2015）が含まれる。初期のプロジェクトは、特定のマーカー遺伝子のアンプリコンシーケンシングするだけだった。しかしながら、最近の次世代シークエンシング（NGS）技術の進歩により、サンプル中の全DNAのショットガンシーケンシングが実行可能で費用対効果の高いものとなった。生成されたNGSデータセットは通常、計算分析を困難にする大きなものである。したがって、大規模なメタゲノムショートシーケンシングリードデータを正確かつ効率的に処理するためのソフトウェアツールの設計および実装は、研究コミュニティおよび産業アプリケーションにとって非常に重要になる。
　シーケンシング解析されたサンプルの分類学的組成の分析は、多くのメタゲノム処理パイプラインにおける重要なステップである。対応するリード分類問題は、適切な分類学的ラベル（例えば、種または属）を所定のNGSシーケンシングリードに割り当てることを目的とする。この問題に対処する従来のアプローチは、リファレンスゲノムのアノテートされたデータベースを使用することによって行われてきた [例えば、 BLAST（Camacho et al、2009）またはMegaBLAST（Morgulis et al、2008））。残念なことに、膨大なリードに対処するには対応するソフトウェアツールのコンピューティングアライメントは一般に遅すぎて実行時間が長くなる。（ Husonら（2011年）またはBrady and Salzberg（2009年）を参照）。この時間のかかる手順を加速する1つの方法は、分類をマーカー遺伝子の小さなサブセットに限定することである。このアプローチに基づくプログラムには、MetaPhlAn（Truong et al、2015）（紹介）、MetaPhyler（Liu et al、2010）、mOTU（Sunagawa et al、2013）およびQIIME（Caporaso et al。、2010）が含まれる。

　最近のアライメントフリーツールはこの制限を回避でき、正確なk-merマッチングに基づいて高速な実行時間と高い精度を実現している。このアプローチでは、k-merインデックスデータ構造（またはデータベース）が前処理ステップで構築される。インデックスは、通常、リファレンスデータベース内の各ゲノムの長さkの異なる部分文字列をすべて含むハッシュテーブルに基づいている。与えられリードRは、その後、Rの中のすべてのk-mers集合を抽出し、続いてそれを事前計算されたインデックスに対して照会する検索手順によって分類される。ルックアップが単一の一致を返す場合、対応するゲノムのためのカウンターが増分される。複数のマッチの場合、マッチするゲノムの最も一般的な分類学的祖先のカウンターを使用することができる。この手続きの終わりに、Rは高得点カウンターに基づいて分類することができる。このアプローチに続く最初のプログラムの1つは、LMAT（Ames et al、2013）であった。 Kraken（Wood and Salzberg、2014）（紹介）は、LMATと比較してメモリ消費と分類速度を向上させるために分類ツリーを使用している。 CLARK（Ounit et al、2015）は、あらかじめ定義された分類学的レベルに標的特異的k-merを保存して、Krakenのメモリ量をさらに改善する。 CLARK-S（Ounit and Lonardi、2016）はCLARKのバリアントで、感度を改善するためにより高いメモリ消費とスピードの遅さ両方を犠牲にするが、spaced k-mersを使う。 Kaiju（Menzel et al、2016）（紹介）は、リファレンスゲノムのアノテーションされたタンパク質コード遺伝子に基づいてリードを分類する。最近のベンチマーク研究（Lindgreen et al、2016）では、Krakenが、14個のテスト済みツール（CLARK、LMAT、MetaPhlAn、mOTU、QIIME、MetaPhyler、およびMEGAN）の比較において、リードのアサイン精度、属、門レベルの分類速度においてベストパフォーマンスを示した。リード分類器によってアサインされた分類学的標識は、メタゲノム試料からDNA中の種の存在量を推定する基礎としても用いることができる。例えばBracken（Lu et al。、2016）は、Kraken出力を用いてベイズ尤度計算を行う豊富さ推定の正確な確率論的方法である。
　最近の他のツールでは、MetaKallisto（Schaeffer et al、2017）のような擬似アラインメント手法や、 WGSQuikr（Koslicki et al、2014）およびMetapallette（Koslicki and Falush、2016）の圧縮センシング（compressed sensing estimates）手法がある。圧縮センシング手法は、リファレンスゲノムのk-meスペクトラムプロファイルに基づき、線形システムを解くことでサンプルのk-merスペクトラムプロファイルを再構成する。

　本論文では、KrakenやCLARKよりもはるかに少ないメモリでindexデータ構造を作成し、同等の分類速度で高感度と高精度両方を達成できるMetaCacheと呼ばれる新しいリード分類ソフトウェアツールを紹介する。著者らによるパフォーマンス評価は、MetaCacheが2つの理由でk-merベースのリード分類アプローチの最先端技術を進化させることを示している。第1に、ハッシュを使用することによって、k-merサブセットのみが使用され、それによりデータインデックス構造サイズが大幅に縮小される。第2に、このアプローチは、ゲノム全体のどの位置でもなく、ローカルウインドウ内のコンテキスト認識k-merマッチのみを使用する。これにより、比較的小さなk値を効果的に適用することができ、すなわち、ランダムなマッチングを大幅に低減しながら高感度を達成することができる。たとえば、NCBI RefSeqドラフトゲノムとcompleteゲノムの合計約1,400億塩基をリファレンスデータベースとすると、MetaCacheのデータベースは62 GBのメモリしか消費しないが、クラークンとCLARKは512 GB RAMのワークステーションでそれぞれのデータベースを構築に失敗する。さらに、我々（著者ら）の実験から、利用リファレンスゲノムデータ量が増えることで、分類精度が継続的に向上することを示す。

インストール

ubuntu16.04に導入した。

本体　Github

git clone https://github.com/muellan/metacache.git 
cd metacache
make

#NCBI Refseqにアクセスして、最新のバクテリア、アーキア、ウィルスのコンプリートゲノムをダウンロードし、データベースを作成します（数十GBあって時間がかかります）
./metacache-build-refseq

デフォルトではk-mer≤16、65,535リファレンスまで対応している。変更する時はmakeを打ってビルド時にパラメータを指定する。詳細はGithubで確認してください。

> ./metacache

$ ./metacache

This program comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it

under certain conditions. See the file 'LICENSE' for details.

You need to specify one of the following modes:

help shows documentation

query classify read sequences using pre-built database

build build new database from reference sequences (usually genomes)

add add reference sequences and/or taxonomy to existing database

info show database and reference sequence properties

annotate annotate sequences with taxonomic information

EXAMPLES:

Query 'myreads.fna' against pre-built database 'refseq':

./metacache query refseq myreads.fna

Query all sequence files in folder 'test' againgst pre-built database 'refseq':

./metacache query refseq test

View documentation for all query options:

./metacache help query

View documentation on how to build databases:

./metacache help build

詳細なヘルプはtopicを指定して実行する。以下のコマンドがあり、

help
query
build
add
annotate
info

modifyのヘルプを見るなら以下のように実行する。

> ./metacache query help

$ ./metacache query help

Reading database from file 'help.db' ... FAIL

ABORT: Could not read database file 'help.db'!

kamisakBookpuro:metacache kamisakakazuma$ ./metacache help query

Documentation for MetaCache mode "query"

SYNOPSIS

metacache query <database name> [<query sequence>...] OPTIONS

DESCRIPTION

Map sequences (short reads) to their most likely origin.

You can also provide names of directories that contain

sequence files instead of single filenames. MetaCache will

search at most 10 levels down in the file hierarchy.

If no input sequence filenames or directories are given, MetaCache will

run in interactive query mode. This can be used to load the database into

memory only once and then query it multiple times with different

query options.

BASIC OPTIONS

-out <file> Redirect output to file <file>.

If not specified, output will be written to stdout.

If more than one input file was given all output

will be concatenated into one file.

-splitout <file> Generate output and statistics for each input file

separately. For each input file <in> an output file

<file>_<in> will be written.

-pairfiles Interleave paired-end reads from two consecutive files,

so that the nth read from file m and the nth read

from file m+1 will be treated as a pair.

If more than two files are provided, their names

will be sorted before processing. Thus, the order

defined by the filenames determines the pairing not

the order in which they were given in the command line.

-pairseq Two consecutive sequences (1+2, 3+4, ...) from each file

will be treated as paired-end reads.

-insertsize Maximum insert size to consider.

default: sum of lengths of the individual reads

-lowest <rank> Do not classify on ranks below <rank>.

default: sequence

-highest <rank> Do not classify on ranks above <rank>.

default: domain

-hitmin <t> Sets classification threshhold to <t>.

A read will not be classified if less than t features

from the database are found in it.

Higher values will increase precision at the expense of

lower sensitivity.

-max-cand <no> maximum number of reference taxon candidates to

consider for each query; A large value can significantly

decrease the querying speed!

default: 2

OUTPUT FORMATTING OPTIONS

-nomap Don't report classification for each individual query

sequence; show summaries only (useful for quick tests).

default: off (= print per-read mappings)

-mapped-only Don't list unclassified queries.

default: off (= print all mappings)

-taxids Print taxon ids in addition to taxon names.

default: off

-taxids-only Print taxon ids instead of taxon names.

default: off

-omit-ranks Don't print taxon ranks.

default: off (= do print taxon ranks)

-separate-cols Prints *all* mapping information (rank, taxon name, taxon ids)

in separate columns (see option "-separator").

default: off (= print rank:taxon_name in one column)

-separator <text> string that separates output fields (sequence name,

classification result, etc.)

default: "\t|\t"

-queryids Show a unique id for each query.

Note that in paired-end mode a query is a pair of two

read sequences. This option will always be activated if

"-hits-per-seq" is given.

default: off

-locations Show locations in classification candidate reference

sequences.

default: off

-lineage Report complete lineage for per-read classification

starting with the lowest rank found or allowed and

ending with the highest rank allowed. See also

options '-lowest' and '-highest'.

default: off

-no-query-params don't show query settings at the beginning of the

mapping output

default: do show query settings

-no-summary don't show result summary & mapping statistics at the

end of the mapping output

default: do show the summary

ADVANCED ANALYSIS OPTIONS

-tophits For each query, print top feature hits in database.

default: off

-allhits For each query, print all feature hits in database.

default: off

-hits-per-seq Shows a list of all hits for each reference sequence.

[<filename>] If this condensed list is all you need, you should

deactive the per-read mapping output with "-nomap".

If a valid filename is given after '-hits-per-seq',

the list will be written to a separate file.

This will activate "-queryids" and set the lowest

classification rank to "sequence".

default:off

ADVANCED OPTIONS

-threads <no> use <no> many parallel threads

default: automatic

-batch-size <no> process <no> many reads per thread;

This can be used for performance tuning.

default: automatic

-query-limit <no> Classify at max. <no> reads per file.

This can be used for quick tests.

default: off

-winlen <no> number of letters in each sampling window

default: same as for reference sequences in database

-winstride <no> distance between window starting positions

default: same as for reference sequences in database

-sketchlen <no> number of features per window

default: same as in database

-ground-truth Report correct query taxa if known.

Queries need to have either a 'taxid|<number>' entry in

their header or a sequence id that is also present in

the database.

This will decrease querying speed!

default: off

-precision Report precision & sensitivity

by comparing query taxa (ground truth) and mapped taxa.

Queries need to have either a 'taxid|<number>' entry in

their header or a sequence id that is also found in

the database.

This will decrease querying speed!

default: off

-taxon-coverage Also report true/false positives and true/false negatives.

This option will turn on '-precision'.

This will decrease querying speed!

default: off

-align Show semi-global alignment for best candidate target.

Original files of reference sequences must be available.

This will decrease querying speed!

default: off

-max-load-fac <f> maximum hash table load factor

This can be used to trade off larger memory consumption

for speed and vice versa. A lower load factor will

improve speed, a larger one will improve memory

efficiency.

default: 0.8

EXAMPLES

Query all sequences in 'myreads.fna' against pre-built database 'refseq':

./metacache query refseq myreads.fna -out results.txt

Query all sequences in multiple files against database 'refseq':

./metacache query refseq reads1.fna reads2.fna reads3.fna

Query all sequence files in folder 'test' againgst database 'refseq':

./metacache query refseq test

Query multiple files and folder contents against database 'refseq':

./metacache query refseq file1.fna folder1 file2.fna file3.fna folder2

Perform a precision test and show all ranks for each classification result:

./metacache query refseq reads.fna -precision -allranks -out results.txt

Load database in interactive query mode, then query multiple read batches

./metacache query refseq

reads1.fa reads2.fa -pairfiles -insertsize 400

reads3.fa -pairseq -insertsize 300

reads4.fa -lineage

OUTPUT FORMAT

MetaCache's default read mapping output format is:

read_header | rank:taxon_name

This will not be changed in the future to avoid breaking anyone's

pipelines. Command line options won't change in the near future for the

same reason. The following table shows some of the possible mapping

layouts with their associated command line arguments:

read mapping layout command line arguments

--------------------------------------- ---------------------------------

read_header | taxon_id -taxids-only -omit-ranks

read_header | taxon_name -omit-ranks

read_header | taxon_name(taxon_id) -taxids -omit-ranks

read_header | taxon_name | taxon_id -taxids -omit-ranks -separate-cols

read_header | rank:taxon_id -taxids-only

read_header | rank:taxon_name

read_header | rank:taxon_name(taxon_id) -taxids

read_header | rank | taxon_id -taxids-only -separate-cols

read_header | rank | taxon_name -separate-cols

read_header | rank | taxon_name | taxon_id -taxids -separate-cols

Note that the separator "<tab>|<tab>" can be changed to something else with

the command line option "-separator <text>".

Note that the default lowest taxon rank is "sequence". Sequence-level taxon

ids have negative numbers in order to not interfere with NCBI taxon ids.

Each reference sequence is added as its own taxon below the

lowest known NCBI taxon for that sequence. If you do not want to classify

at sequence-level, you can set a higher rank as lowest classification rank

with the "-lowest" command line option: "-lowest species" or

"-lowest subspecies" or "-lowest genus", etc.

実行方法

別のツール用にRefSeqをダウンロードしていたので、ここではRefSeqを使う。

手動でデータベースを構築する。

#ダウンロードしたgenome/ゲノムfna.gzを解凍
cd genome/ && gunzip * && cd ../
metacache build refseq genome/*fna

refseq.dbができる。テストした際は、Refseq全データでビルドするとエラーになったので、1Mbp以上の配列に限定して実行した。ダウンロードがパーフェクトかチェックしてないが、何らかのファイルが破損していたのかもしれない。

metacacheのコマンドは、シーケンスデータを直接指定するか、fastaやfastqが入ったディレクトリを指定して実行する。

fastaを指定して実行。

metacache query refseq input.fa -out results.txt

fastaやfastqが入ったディレクトリを指定して実行。

metacache query refseq sequence_dir/ -out results.txt

ジョブが終わると、リードごとの分類結果がテキストにまとめられて出力される。データが無ければ、InSilicoSeq（紹介）などのシミュレータを使ってテストしてみて下さい。

引用
MetaCache: context-aware classification of metagenomic reads using minhashing
Müller A, Hundt C, Hildebrandt A, Hankeln T, Schmidt B

Bioinformatics. 2017 Dec 1;33(23):3740-3748

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

小メモリで高速にtaxonomy assignmentを行う metacache