バクテリアをstrainレベルで検出する StrainSeeker

　病原性細菌の検出には、細菌病原体を迅速に同定する必要がある。このために、通常、病原体は単離され、PCRや全ゲノム配列が行われる。分子タイピングの主な目標の1つは、病原体をクローン群に分類することである。なぜなら、同じ種の系統は宿主に対して大きく異なる効果を持つからである。よく知られている例は、大腸菌 O157:H7（Tu、He＆Zhou、2014）および大腸菌EC958（Petty et al、2014）のようないくつかの株である。分類するために、MLST（Maiden、2006）またはクローン特異的なマーカーが使用されている（Inouye et al, 2014）。KvarQ（Steiner et al, 2014）、Mykrobe（Bradleyら, 2015）およびSRST2（Inouyeら, 2014）などのWGSデータから直接に関連する突然変異および対立遺伝子を検出することができるいくつかのアプローチが開発されているが、大部分の場合、十分なカバレッジおよび特化したアレルデータベースが必要である（例えばMykrobeはMycobacterium tuberculosisおよびStaphylococcus aureusの同定にのみ使用できる）。そのようなプログラムの使用は、株の同定プロセスを複雑にする。

　クローン特異的マーカーを探す代わりに、完全なバクテリアゲノムを参照配列として使用することができる。 Kraken（Wood＆Salzberg、2014）またはCLARK（Ounit et al、2015）のようなk-mer長の配列の検出に基づくバクテリア同定プログラムは、RefSeqバクテリアゲノムデータベース全体を使用し、さらに、それぞれのリードを個別に分類するため、低カバレッジWGSサンプルも処理できる。 Sigma（Ahn、Chai＆Pan、2015）のようなアライメントベースのツールと比較して、k-merベースのプログラムは、特に実行時間を考慮すると優れていることが示されている（Lindgreen、Adair＆Gardner、2016; Peabody et al、2015 ）。 KrakenはNCBIのtaxonomy treeを使用して、リードを別々に識別し、ツリー上の各taxonへのヒットをカウントし、最も多くのヒット数を持つブランチを見つける（kraken紹介）。

　StrainSeekerは、異なる分類学レベルで共有されるk-merの数に基づいて、バクテリアの分離株をWGSデータから直接クローンまたはクレードに分類する。NCBI分類法のような既存のtaxonomyシステムに結びついていないリファレンスバクテリアゲノムと系統レベルとの間の系統学的関係を近似するためのガイドツリーを使用するので、大腸菌およびShigella sp. 間のような論争を避けるのに役立つ（ガイドツリーは、ユーザーが提供する必要がある）。

マニュアル

http://bioinfo.ut.ee/strainseeker/index.php?r=site/page&view=manual

webツール

Web Tool

インストール

cent OSに導入した。

依存

PERL
R

Builer

GenomeTester4（GlistMaker、GlistCompare、GlistQuery）

Seeker

GenomeTester4（GlistMaker、GlistCompare、GlistQuery、GDistribution）
R scripts for statistical tests

GenomeTester4のインストールは以前紹介しています（リンク）。

本体& helper scirptのダウンロード

http://bioinfo.ut.ee/strainseeker/index.php?r=site/page&view=downloadable

本体のbuilder.pl、seeker.plと解凍したhelper scirptを同じディレクトリに置いておく。

> perl builder.pl -h

$ perl builder.pl -h

Usage: builder.pl -n <NWK FILE> -d <DIR OF FASTA FILES> -o <USER DEFINED DB NAME> [OPTIONAL PARAMETERS]

Options:

-h, --help - Print this help

-v, --version - Print version of the program

-n, --newick - Guide tree in newick format (same names as fasta files without suffix .fna)

-d, --dir - Directory of fasta files (.fna)

-o, --output - User defined database name

-b, --blacklist - .list file of k-mers unwanted in database (human, plasmids etc)

-w, --word - K-mer length used in database building and later searching (default 32)

-m, --min - Minimal amout of k-mers in node to be considered as subroot (default 250)

-g, --greater - Maximum times child could have more k-mers than parent (default 250)

-t, --threads - Number of cores used

-max - Maximum number of k-mers in one list (default 100000)

> perl seeker.pl -h

$ perl seeker.pl -h

Usage: seeker.pl -d <DB DIR NAME> -i <SAMPLE.fastq> [OPTIONAL PARAMETERS]

Options:

-h, --help - Print this help

-v, --version - Print version of the program

-i, none - Input file (can be multiple, each with own flag)

-o, --output - Output file name (default StrainSeeker_output)

-d, --dir - Path to database directory

-verbose - Print out more of the working process

ラン

１、データベースの構築（300GBくらい空き容量が必要）

perl builder.pl -n refseq_guide_tree.nwk -d strain_fasta_directory -w 32 -b ss_blacklist_w32.list -o my_database

-n is the guide tree in Newick format, describing the relationships between given strains.
-d is a directory containing all the .fna files for strains used in the Newick file.
-b is the path to blacklist (must have the same k-mer length as parameter -w).
-w is the k-mer length.
-o user-defined database name.

ここでは既存のデータベースを使う。ダウンロードリンク（リンク）から32-merのDatabase - pre-built from 4,324 NCBI RefSeq strains (w32) か16-merのDatabase - pre-built from 4,324 NCBI RefSeq strains (w16)をダウンロードする。

シーケンスデータは、オーサーらが評価ペーパー（リンク）にしたがって準備したfastqを使う。このリンクからダウンロードできる。

検索する。

perl seeker.pl -i sample.fastq -d ss_db_w32 -o sample_result.txt -verbose

出力の確認（一部）。

> cat sample_result.txt

$ cat sample_result.txt

Sample:sample_result.txt

22.66893% KNOWN Streptococcus_mitis_B6

17.53777% KNOWN Streptococcus_pseudopneumoniae_IS7493

17.02426% KNOWN Streptococcus_pneumoniae_Hungary19A-6

10.73490% RELATED Lactobacillus_helveticus_strain_KLDS18701,Lactobacillus_kefiranofaciens_ZW3,Lactobacillus_helveticus_strain_CAUH18,Lactobacillus_helveticus_H10,Lactobacillus_helveticus_DPC_4571,Lactobacillus_helveticus_CNRZ32,Lactobacillus_helveticus_R0052,Lactobacillus_helveticus_strain_MB2-1,Lactobacillus_helveticus_H9

500MBまでの限定だが、web版もある。

web版(4300のbacteriaデータベース）

http://bioinfo.ut.ee/strainseeker/index.php?r=site/webtool

テスト結果

http://bioinfo.ut.ee/strainseeker/index.php?r=site/results&id=demoresults

http://bioinfo.ut.ee/strainseeker/demo/demoresults/setA1_bacarc_5feb_11min.txt

上のデータを解析すると以下のような結果が得られた。

f:id:kazumaxneo:20180311181530j:plain