2018-06-11

メタゲノムのリアルタイム分類ツール LiveKraken

　ゲノムシーケンシングデータのリアルタイム解析は、シーケンサがまだ稼動している間にデータを分析できるため、過去数年にわたって特に注目を集めている。しかし、Minionシーケンサーをベースにしたライブ解析アプローチの可能性は、これらのデバイスのスループット・レートとシーケンスクオリティが低いため、依然として限られている。著者らは、HiLive（Lindner et al 2017）を用いて、イルミナのシーケンサーからのハイスループットシーケンシングデータのリアルタイム解析のための最初の方法を提案し、新しい分野のアプリケーションを可能にした。メタゲノム研究の場合、Kraken（Wood and Salzberg 2014）のような分類ツールも、 time-relevantなアプリケーションで使用されている。しかし、これはウエットおよびドライのラボの逐次的なパラダイムに影響され、実行の全体時間の下限はシーケンシングマシンの実行時間でセットされる。

　これらの制限に取り組むために、著者らはKrakenのコアアルゴリズムに基づくリアルタイムtaxonomy分類ツールであるLiveKrakenを提示する。シーケンサーが終了するずっと前に、確立されたツールの結果と同等の結果が得られ、シーケンシングの実行が終了したら直ちにKrakenの結果と同じ結果が保証されることを示す。 LiveKrakenはHiSeqとMiSeqシステム上でテストされており、Krakenと同じくらい堅牢で使いやすい。アプリケーションフィールドは、サンプル組成の制御、汚染の同定、またはリアルタイムでのアウトブレイクの検出にまで及ぶ可能性がある。

　もともと、Krakenは線形のワークフローを持っている（Wood and Salzberg 2014）。シーケンシングされたリードは、FASTAまたはFASTQファイルから読み取られ、続いて事前計算されたデータベースを用いて分類される。シーケンスリードは互いに独立しているため、並列に処理することができる。各リードで見つかったlowest common ancestor （LCA）分類結果は、Krakenの表形式のレポートファイルに書き込まれる。
　このワークフローを生きた分類学分類に適合させるため、HiLive（Lindner et al、2017）のアプローチと同様に、イルミナのバイナリBCLフォーマットからシーケンシングデータを読み取ることができる新しいシーケンスリーダーモジュールを実装した。 LiveKrakenを使用すると、オリジナルのKrakenと同じデータベース構造を使用して、メタゲノムのサンプル構成を継続的に分析し、改良することができる。
イルミナシーケンサは、すべてのリードをサイクルと呼ばれる並行処理で処理し、サイクルごとにすべてのリードに1ベースを追加する。各サイクルで、Basecall（BCL）ファイルはIlluminaのBaseCallsディレクトリに作成され、FASTAまたはFASTQファイルの代わりにLiveKrakenの入力として宣言される。新しいデータは、サイズ指定kの最初のk-merから始まるj個のシーケンシングサイクルのユーザー指定の間隔でBCLシーケンシングリーダーモジュールによって収集される。収集されたデータは分類器に送られ、分類器は新しいシーケンス情報で記憶された部分分類を精緻化する。Krakenの一時的なデータ構造は、LCAリスト、あいまいなヌクレオチドのリスト、およびデータベース中のk-mer発生の数など、各リードについて保存される。これは、各シーケンスリードについて見出されるLCAの数に比例して、メモリ消費の全体的な増加をもたらす。さらに、繰り返し精緻化のために非常に重要なことは、各リードが分類された位置を保持する変数が格納されることである。各精緻化ステップの後、Krakenと同じ表形式の出力が生成される。これにより、早期分類が可能になると同時に、最後のシーケンシングサイクルからデータを読み取った後の分類出力が、Krakenが生産するものと全く同じであることが保証される（論文図1a参照）。
LiveKrakenは、付属のスクリプトinstall_kraken.shを使ってKrakenと同様にインストールできる。 Gcc v4.9.2およびv 7.2.0およびブーストv 1.5.8でテストされている。さらに、Condaパッケージが利用可能である（Grüninget al、2017）。 LiveKrakenはKrakenと同じコマンドラインインターフェイスを使用している。

インストール

centos6でテストした。

本体 GitLab

https://gitlab.com/rki_bioinformatics/LiveKraken

git clone https://gitlab.com/rki_bioinformatics/LiveKraken.git
cd LiveKraken/
./install_kraken.sh $target_directory

混同がないよう、krakenと同じコマンドはlivekrakenという名前にリネームされてビルドされる。

> ./livekraken -h

$ ./livekraken -h

Usage: livekraken [options] <filename(s)>

Options:

--db NAME Name for Kraken DB

(default: none)

--threads NUM Number of threads (default: 1)

--fasta-input Input is FASTA format

--fastq-input Input is FASTQ format

--bcl-input Input is BCL format

--bcl-length NUM Number of sequencing cycles

--bcl-start NUM First analysis cycle (>31)

--bcl-spacing NUM Spacing between classification

--bcl-lanes NUM NUM .. The lanes to analyse (e.g. 1 3 4)

Default: Analyse all lanes found.

--bcl-tiles NUM NUM .. The tiles to analyse (e.g. 1101 2115 2304)

Default: 96 tiles (2 sides -> 3 swaths -> 16 tiles).

--bcl-max-tile NUM Maximum tile to analyse, in XYZZ tile format.

Default: Up to tile 2316 (for 96 tiles.)

If this option is used, --bcl-tiles is ignored.

--gzip-compressed Input is gzip compressed

--bzip2-compressed Input is bzip2 compressed

--quick Quick operation (use first hit or hits)

--min-hits NUM In quick op., number of hits req'd for classification

NOTE: this is ignored if --quick is not specified

--unclassified-out FILENAME

Print unclassified sequences to filename

--classified-out FILENAME

Print classified sequences to filename

--output FILENAME Print output to filename (default: stdout); "-" will

suppress normal output

--only-classified-output

Print no Kraken output for unclassified sequences

--preload Loads DB into memory before classification

--paired The two filenames provided are paired-end reads

--check-names Ensure each pair of reads have names that agree

with each other; ignored if --paired is not specified

--help Print this message

--version Print version information

If none of the *-input or *-compressed flags are specified, and the

file is a regular file, automatic format detection is attempted.

> ./livekraken-build -h

$ ./livekraken-build -h

Usage: livekraken-build [task option] [options]

Task options (exactly one must be selected):

--download-taxonomy Download NCBI taxonomic information

--download-library TYPE Download partial library

(TYPE = one of "bacteria", "plasmids",

"viruses", "human")

--add-to-library FILE Add FILE to library

--build Create DB from library

(requires taxonomy d/l'ed and at least one file

in library)

--rebuild Create DB from library like --build, but remove

existing non-library/taxonomy files before build

--clean Remove unneeded files from a built database

--shrink NEW_CT Shrink an existing DB to have only NEW_CT k-mers

--standard Download and create default database

--upgrade Upgrade an existing older database to use scrambled

minimizer ordering (see README for details)

--help Print this message

--version Print version information

Options:

--db NAME Kraken DB/library name (mandatory except for

--help/--version)

--threads # Number of threads (def: 1)

--new-db NAME New Kraken DB name (shrink task only; mandatory

for shrink task)

--kmer-len NUM K-mer length in bp (build/shrink tasks only;

def: 31)

--minimizer-len NUM Minimizer length in bp (build/shrink tasks only;

def: 15)

--jellyfish-hash-size STR Pass a specific hash size argument to jellyfish

when building database (build task only)

--max-db-size SIZE Shrink the DB before full build, making sure

database and index together use <= SIZE gigabytes

(build task only)

--shrink-block-offset NUM When shrinking, select the k-mer that is NUM

positions from the end of a block of k-mers

(default: 1)

--work-on-disk Perform most operations on disk rather than in

RAM (will slow down build in most cases)

>./classify #ベースコール解析用に追加されたコマンド

$ ./classify

Missing mandatory option -d

Usage: classify [options] <fasta/fastq file(s)>

Options: (*mandatory)

-a output at all classification steps

* -d filename Kraken DB filename

* -i filename Kraken DB index filename

-n filename NCBI Taxonomy nodes file

-o filename Output file for Kraken output

-t # Number of threads

-u # Thread work unit size (in bp)

-q Quick operation

-m # Minimum hit count (ignored w/o -q)

-C filename Print classified sequences

-U filename Print unclassified sequences

-l length Length of reads (BCL mode)

-s start Initial cycle to start classification (>31)

-k spacing How many cycles to wait between classification attempts.

-x tiles Tiles to analyze (can be passed multiple times). Default: all tiles

-y lanes Lanes to analyze (can be passed multiple times). Default: all lanes

-z max_tile Maximum tile number to analyse (Default: 2316). Conflicts with -x.

-f Input is in FASTQ format

-b Input is in BCL format

-c Only include classified reads in output

-M Preload database files

-h Print this message

At least one FASTA or FASTQ file or a BaseCalls (BCL) folder must be specified.

Kraken output is to standard output by default.

./visualisation/livekraken_sankey_diagram.py -h

$ ./visualisation/livekraken_sankey_diagram.py -h

usage: livekraken_sankey_diagram.py [-h] -i INFILE [-c] [-s]

[-r {species,genus,family,order}] [-t TOP]

[-o OUTPUT] [-m NAMES] [-n NODES]

This tool creates sankey plots from LiveKraken output.

optional arguments:

-h, --help show this help message and exit

-i INFILE, --infile INFILE

Used to list input files, can be used several times to

input an ordered list of files

-c, --color Used to switch from red-green to red-blue color scheme

-s, --compress Used to "compress" unclassified nodes by only keeping

a number of reads corresponding to the sum of flows

from/to nodes other than unclassified.

-r {species,genus,family,order}, --rank {species,genus,family,order}

Used to set on which level to bin the classified reads

-t TOP, --top TOP Used to determine the top x nodes to display for every

cycle (plus one node serving as bin for everyting

else)

-o OUTPUT, --output OUTPUT

Used to set the output directory path for the html and

json file

-m NAMES, --names NAMES

Used to set the path to the names.dmp for taxonomic

resolution

-n NODES, --nodes NODES

Used to set the path to the nodes.dmp for taxonomic

resolution

LiveKrakenはcondaでも導入できる（リンク）。mac osxも対応している。

ラン

--データベースの準備--

すでにkrakenを導入しておりデータベースがあるなら飛ばしてよい。

データベースがない場合、ダウンロードする必要がある。ここではdatabaseというディレクトリ名にダウンロードする。livekraken-buildの34-37行目がkraken-buildになっていたので、実行前にlivekraken-buildに書き換えた。実行する。

livekraken-build --standard --db database --threads 40

database/にRefseqの全データとtaxonomy情報がダウンロードされ、40スレッド使いビルドされる。環境によってはかなりの時間がかかる。

bacteria、viruses、plasmids、humanのいずれかだけデータベースを作るなら、以下のようになる。taxonomyファイルをダウンロードし、それから必要なデータをダウンロード&ビルドする。

#taxonomyデータ
livekraken-build --download-taxonomy --db database

#データ
livekraken-build --download-library bacteria --db database

#さらにユーザー指定のfastaを追加することもできる。数がある時は、公式のようにfindとxargsをつなぎ自動化する
livekraken-build --add-to-library chr1.fa --db database

#ビルド
livekraken-build --build --db database --threads 40

bacteriaはkraken同様、URLが変更されていてダウンロードできなかった。NCBI FTPサーバー（ftp://ftp.ncbi.nlm.nih.gov/genomes/）のarchive/old_refseq/Bacteria/all.fna.tar.gzに変更した。

”bacteria”をダウンロードするには、43行目の

wget $FTP_SERVER/genomes/Bacteria/all.fna.tar.gz

　↓

wget $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz

に修正して実行する。他のデータベースの配列を使うことも可能（kaijuの例）。wgetでダウンロードし、上に書いた"--add-to-library"フラグを立てて追加する。

データベースの内容はkrakenから変わっていない。

database.kdb: Contains the k-mer to taxon mappings
database.idx: Contains minimizer offset locations in database.kdb
taxonomy/nodes.dmp: Taxonomy tree structure + ranks
taxonomy/names.dmp: Taxonomy names

--taxonomyラベリング--

fastqを分析する。gz圧縮fastqを使う時は--gzip-compressedフラグを立てる。

livekraken --preload --threads 20 --fastq-input --paired --db database pair1.fq pair2.fq --output output

--preload Loads DB into memory before classification
--fasta-input Input is FASTA format
--fastq-input Input is FASTQ format
--paired The two filenames provided are paired-end reads
--gzip-compressed Input is gzip compressed
--bzip2-compressed Input is bzip2 compressed

学名に変換。superkingdom, kingdom, phylum, class, order, family, genus, species表記になる。

livekraken-translate --mpa-format --db database output > sequences.labels

集計レポートを作る。

livekraken-report --db database output > summary

以上の作業はKrakenから変更がない。

--realtime classification--

LiveKrakenではbasecallを直接分析、分類するclassifyコマンドが実装されている（解析の時系列は論文の図１参照 pubmed）。

サポートされているシーケンサー

Miseq
HiSeq 1500
HiSeq 2000/2500/3000/4000/X (untested)

以下のコマンドはテストしてません。注意してください。

./classify -d database/ -i database/database.idx -n database/taxonomy/nodes.dmp -b basecall_dir/ -o classify_output

-b used for BCL files, the input path should be the BaseCalls folder generated by Illumina sequencers instead of a file as in the case of FASTA/FASTQ input.
-l final number of cycles so that the software knows when it can stop waiting for additional BCL files to be generated
-s wait for this many cycles to accumulate before starting classification. Note that this should be greater than 31, which is the k-mer size used by Kraken
-k the number of cycles to wait before another incremental classification occurs
-x which tiles on the flowcell to analyze (default is all). Example: -x 1101 -x 1315 -x 2108
-y which of the lanes (usually 8 lanes) to analyze (default is all). Example: -y 1 -y 6
-a whether to output intermediate results at the incremental classification steps. This would make sense if you want to monitor the sequence composition during sequencing.
-f if input is in FASTQ format

出力は付属ツールで画像にできる。

./visualisation/livekraken_sankey_diagram.py -i classify_output

リアルタイム解析を可能にするHiLive（pubmed）も紹介したかったのですが、 HiLiveは依存ライブラリSeqAnの大量コンパイルの後のhiliveのビルドでつまづき時間を浪費しそうだったので一旦諦めました。dockerコンテナはあるようですが（私の環境でちゃんとテストするには、シーケンスを行いbasecallをリアルタイムでLinuxサーバーに送りテストする必要があり、とっても手間がかかります。臨床分野の人なら24時間シーケンスしてるのかもしれませんが...）。

ジョンズホプキンス大学 Center for Computational Biology（CCB）メタゲノム分析ツールには、下記のツールがあります。いくつかは紹介しています。

http://ccb.jhu.edu/software.shtml

Kraken

Centrifuge

Pavian

Bracken (Bayesian Reestimation of Abundance with KrakEN)

KrakenHLL

引用

LiveKraken - Real-time metagenomic classification of Illumina data.

Tausch SH, Strauch B, Andrusch A, Loka TP, Lindner MS, Nitsche A, Renard BY.

Bioinformatics. 2018 Jun 1.