2018-06-02

関心のあるバクテリアゲノムのシグネチャを迅速に検出する Neptune

　安価かつ迅速に大量のシーケンスを生成する能力は、生物、特にバクテリアのような比較的小さなゲノムを有する生物全体のゲノムを研究する能力を可能にした。計算生物学者は、歴史的に、少数のバクテリアゲノムを比較し、ヌクレオチド、遺伝子およびゲノムスケールで基本的な特徴付けを行うために、幅広いバイオインフォマティクスソフトウェアツールを使用してきた。しかしながら、バクテリアゲノム全体の効率的な比較分析および特徴付けを行うためのバイオインフォマティクスソフトウェアの必要性が現在存在する。いくつかのツールが登場した。これらのツールのほとんどは、リファレンスマッピングアプローチ（論文より ref.1(CrossRef), 2(pubmed)）を用いた１塩基変異（SNV）の同定、または正確なサブストリング（k-mer）（ref.3-5（ref.5 pubmed））に基づく距離推定に焦点を当てていて、スケール変更にも単純な並列化戦略で対応することができる。微生物のゲノム集団を解析してゲノムの特徴を表現型の形質と相関させる微生物genome-wide association studies (GWAS) は、例えば長距離の連鎖不平衡（ long range linkage disequilibrium）およびクローン集団構造（ref.6）などに取り組む最近の数学的手法の発達により、表現形質とゲノムの特徴を相関づけるためバクテリアのゲノム集団を分析することを可能にした。 SNVsまたはk-mersと生物学的形質とを関連付けるバクテリアGWAS用のソフトウェアツールがいくつか開発されている（ref.7 pubmed）。しかしながら、バクテリアGWASの場合、特に、新規な生物学的形質を獲得するために水平遺伝子導入に関与する大部分のバクテリアについて、より大きな規模のゲノムの利益および損失を含むバクテリアゲノム変動の全ての様式を同定することが重要である。ある個体群を他の個体群と区別しながらそれらの遺伝子座内の対立遺伝子変異を許容する大規模ゲノム遺伝子座を迅速に抽出することができるスケーラブルなソフトウェアは、バクテリアGWASを達成する上で貴重であり、標的分子診断の開発などの多くの他の用途に有用である。

　この課題に取り組むために、著者らはゲノムシグネチャディスカバリー（ genomic signature discovery）の分野を検討した。ここでのシグネチャとは、背景となる配列グループから目的の配列グループを識別することができる配列として定義される。シグネチャーは、ゲノムまたは遺伝子間領域に存在し得、ゲノムアイランド、ファージ領域またはオペロン全体に対応し得る。しかし、シグネチャが機能的に意味のあるコンテンツを含む必要はなく、そのシーケンスによって2つのグループが効果的に区別さえすればよい。効果的なシグネチャ発見アルゴリズムとは、感受性の高さと特異性の高さ両方を持ち、迅速な計算も可能なものである。しかしながら、実際には、これらの3つの属性すべてを有するアルゴリズムを開発することは依然として困難である。病原体検出診断アッセイ（ref.8）を生成する特定の目的で、シグネチャー発見のための初期のアルゴリズムアプローチが開発された。一般に、これらのアプローチは、BLAST（ref.9）のようなアライメントベースの方法を使用して全ての配列を徹底的に比較して、除外群に含まれていない包含群内のシグネチャ領域を見つけることを含む。しかしながら、これらのアプローチは効率的に拡大縮小せず、固定長の分子診断プライマーを生成することに焦点を当てている。

　他の洗練されたアプローチでは、迅速に検索可能なデータ構造のゲノムから固定サイズの部分文字列をコード化し、次にユニークな部分文字列（ref.10）についてこれらのデータ構造を分析する計算的に最適化された文字列処理アプローチを使用することにより、これらの手法は非常に速くスケール性が良くなるが、ターゲットシーケンスの変動性に対応できず、人為的に固定長シグネチャに限定されている。標的の可変性は、複数の配列アラインメント（ref.8）または他のクラスタリング操作（ref.11）を用いて類似の配列をグループ化することによって達成することができる。しかしながら、これらの一般的なクラスタリング技術は、高い計算コストを伴い、うまく拡張できない。いくつかのアルゴリズムは、不要な計算量を減らすために、クラスタリングの前にデータ削減ステップを組み込んでいる。たとえば、Insignia（ref.10 pubmed）、TOFI（ref.12 CrossRef）およびTOPSI（ref.13 CrossRef）は、包含ターゲット内の正確な一致および排除バックグラウンドを事前計算するために効率的なsuffix ツリーを使用する。しかしながら、バックグラウンドデータベースのサイズに依存して、これは計算上高価な演算のままである可能性がある。興味深い新規実装の1つはCaSSiS（ref.11 pubmed）である。これは、他のシグネチャ発見パイプラインよりも完全にシグネチャ発見問題に近づいている。このソフトウェアは、系統樹のような階層的にクラスタ化されたデータセット内のすべての位置について同時にシグネチャを生成し、それによってすべての可能なサブグループの候補シグネチャを生成する。しかし、このプロセスでは、入力データを計算上高価な系統樹などの階層的にクラスタ化された形式で提供する必要がある。効率と感度のトレードオフに加えて、これまでに発見されたプログラムのほとんどには、ゲノム集団間の共通の変異を特定するのに不適切な欠点がある。例えば、それらは分析を単一の包含ゲノム（ref.12）に制限するか、標的識別のためのユーザ供給ゲノムを許可しないか（ref.10）、またはエンドユーザにソフトウェアを提供しない可能性がある（ref.8）。

　著者らは、識別可能なバクテリア配列シグネチャを発見し、任意のゲノム配列群の比較分析を効率的で正確な斬新な方法で行うためのシステムとしてNeptuneを設計した。 Neptuneは、ユーザー指定の関心がある配列グループ間でに共有されるが、背景グループには欠けているゲノム遺伝子座を同定する。事前計算、ターゲットへの制限、および低速クラスタリングアプローチとは無関係に、Neptuneは基準に基づいた並列化された完全一致のk-mer戦略を速度に適用し、不正確な一致には感度を向上させる。ネプチューンのシグネチャ発見は、統計的な信頼の尺度を用いて決定を行う確率論的モデルによって導かれる。 Neptuneはgithub.com/phac-nml/neptuneで無料で入手できるオープンソースのソフトウェアであり、細菌集団の迅速な比較評価に広く適用できる。

図１。オーバービュー。論文より転載。

インストール

mac os 10.13のPython 2.7.12 :: Anaconda 4.2.0でテストした。

依存

本体　Github

https://github.com/phac-nml/neptune

MinicondaとBiocondaをいれてない人は公式マニュアル参照（リンク）。

condaでインストールする。

conda install neptune

> neptune -h

$ neptune -h

usage: neptune-conda -i INCLUSION [INCLUSION ...] -e EXCLUSION

[EXCLUSION ...] -o OUTPUT

Neptune identifies signatures using an exact k-mer matching strategy. Neptune

locates sequence that is sufficiently present in many inclusion targets and

sufficiently absent from exclusion targets.

optional arguments:

-h, --help show this help message and exit

-V, --version show program's version number and exit

REQUIRED:

-i INCLUSION [INCLUSION ...], --inclusion INCLUSION [INCLUSION ...]

The inclusion targets in FASTA format.

-e EXCLUSION [EXCLUSION ...], --exclusion EXCLUSION [EXCLUSION ...]

The exclusion targets in FASTA format.

-o OUTPUT, --output OUTPUT

The directory to place all output.

KMERS:

-k KMER, --kmer KMER The size of the k-mers.

--organization ORGANIZATION

The degree of k-mer organization in the output files.

This exploits the four-character alphabet of

nucleotides to produce several k-mer output files,

with all k-mers in a file beginning with the same

short sequence of nucleotides. This parameter

determines the number of nucleotides to use and will

produce 4^X output files, where X is the number of

nucleotides specified by this parameter. The number of

output files directly corresponds to the amount of

parallelization in the k-mer aggregation process.

FILTERING:

--filter-percent FILTER-PERCENT

The maximum percent identity of a candidate signature

with an exclusion hit before discarding the signature.

When both the filtered percent and filtered length

limits are exceed, the signature is discarded.

--filter-length FILTER-LENGTH

The maximum shared fractional length of an exclusion

target alignment with a candidate signature before

discarding the signature. When both the filtered

percent and filtered length limits are exceed, the

signature is discarded.

--seed-size SEED-SIZE

The seed size used during alignment.

EXTRACTION:

-r REFERENCE [REFERENCE ...], --reference REFERENCE [REFERENCE ...]

The FASTA reference from which to extract signatures.

--reference-size REFERENCE-SIZE

The estimated total size in nucleotides of the

reference. This will be calculated if not specified.

--rate RATE The probability of a mutation or error at an arbitrary

position. The default value is 0.01.

--inhits INHITS The minimum number of inclusion targets that must

contain a k-mer observed in the reference to begin or

continue building candidate signatures. This will be

calculated if not specified.

--exhits EXHITS The maximum allowable number of exclusion targets that

may contain a k-mer observed in the reference before

terminating the construction of a candidate signature.

This will be calculated if not specified.

--gap GAP The maximum number of consecutive k-mers observed in

the reference during signature candidate construction

that fail to have enough inclusion hits before

terminating the construction of a candidate signature.

This will be calculated if not specified and is

determined from the size of k and the rate.

--size SIZE The minimum size of all reported candidate signatures.

Identified candidate signatures shorter than this

value will be discard.

--gc-content GC-CONTENT

The average GC-content of all inclusion and exclusion

targets. This will be calculated from inclusion and

exclusion targets if not specified.

--confidence CONFIDENCE

The statistical confidence level in decision making

involving probabilities when producing candidate

signatures.

PARALLELIZATION:

-p PARALLELIZATION, --parallelization PARALLELIZATION

The number of processes to run simultaneously. Note

that this is only applicable when running Neptune in

non-DRMAA mode (default).

DRMAA:

--drmaa Whether or not to run Neptune in DRMAA-mode and

attempt to schedule jobs through DRMAA. This will

require setting up DRMAA in advance.

--default-specification DEFAULT-SPECIFICATION

The default DRMAA parameters.

--count-specification COUNT-SPECIFICATION

The DRMAA parameters specific to k-mer counting.

--aggregate-specification AGGREGATE-SPECIFICATION

The DRMAA specific parameters specific to k-mer

aggregation.

--extract-specification EXTRACT-SPECIFICATION

The DRMAA parameters specific to candidate signature

extraction.

--database-specification DATABASE-SPECIFICATION

The DRMAA parameters specific to database construction

and querying.

--filter-specification FILTER-SPECIFICATION

The DRMAA parameters specific to candidate signature

filtering.

--consolidate-specification CONSOLIDATE-SPECIFICATION

The DRMAA parameters specific to filtered signature

consolidation.

——

ラン

ランには、ターゲットとなるゲノムのFASTA、除外したいゲノムのFASTAをディレクトリ単位で与える必要がある。

Githubレポジトリ（リンク）のテストデータをランする。inc_dir/にあるFASTAにあって、exc_dir/にあるFASTAにない特徴が自動検出され、統合された結果がoutput/のconsolidated.fasta に出力される。

neptune -p 12 --organization 3 --inclusion inclusion_dir/ --exclusion exclusion_dir/ --output output/

-i A list of inclusion targets in FASTA format. You may list multiple file or directory locations following the parameter. Neptune will automatically include all files within directories. However, Neptune will not recurse into additional directories.-i A list of inclusion targets in FASTA format. You may list multiple file or directory locations following the parameter. Neptune will automatically include all files within directories. However, Neptune will not recurse into additional directories.
-e A list of exclusion targets in FASTA format. You may list multiple file or directory locations following the parameter. Neptune will automatically include all files within directories. However, Neptune will not recurse into additional directories.
-o The location of the output directory. If this directory exists, any files produced with existing names will be overwritten. If this directory does not exist, then it will be created.
-p The number of processes to run simultaneously. Note that this is only applicable when running Neptune in non-DRMAA mode (default).
--organization The degree of k-mer organization in the output files. This exploits the four-character alphabet of nucleotides to produce several k-mer output files, with all k-mers in a file beginning with the same short sequence of nucleotides. This parameter determines the number of nucleotides to use and will produce 4^X output files, where X is the number of nucleotides specified by this parameter. The number of output files directly corresponds to the amount of parallelization in the k-mer aggregation process.

出力されるファイル一覧。

f:id:kazumaxneo:20180602095840j:plain

詳細はマニュアル（リンク）で解説されている。

すべての参照のソートされたシグネチャは、統合され単一の consolidated.fastaファイルに出力される。この出力が、フィルタリングされソートされたNeptuneの最終出力となる。

> cat output/consolidated/consolidated.fasta

$ cat output/consolidated/consolidated.fasta

>1.0 score=1.0000 in=1.0000 ex=0.0000 len=103 ref=inclusion2 pos=99

TAGTCTCCAGGATTCCCGGGGCGGTTCAGATAATCTTAGCATTGACCGCCTTTATATAGAAGCTGTTATTCAAGAAGCATTTTCAAGCAGTGATGTAAGAAAA

>1.1 score=0.9979 in=0.9979 ex=0.0000 len=640 ref=inclusion2 pos=3494

CGCGGGCGATATTTTCACAGCCATTTTCAGGAGTTCAGCCATGAACGCTTATTACATTCAGGATCGTCTTGAGGCTCAGAGCTGGGAGCGTCACTACCAGCAGATCGCCCGTGAAGAGAAAGAGGCAGAACTGGCAGACACATGGAAAAAGGCCTGCCCCAGCACCTGTTTTGAATCGCTATGCATCGATCATTTGCAACGCCACGGGGCCAGCAAAAAAGCCATTACCCGTGCGTTTGATGACGATGTTGAGTTTCAGGAGCGCATGGCAGAACACATCCGGTACTGGTTAAACCATTGCTCACCACCAGGTTGATATTGATTCAGAGGTATAAAACGAATGAGTACAGCACTCGCAACGCTGGCAGGGAAGCTGGCTGAACGTGTCGGCATGGATTCTGTCGACCCACAGGAACTGATCACCACTCTTCGCCAGACGGCATTTAAAGGTGATGCCAGCGATGCGCAGTTCATCGCATTGCTGATCGTCGCCAACCAGTACGGTCTTAATCCGTGGACGAAAGAAATTTACGCCTTTCCTGATAAGCAGAACGGCATCGTTCCGGTGGTGGGCGTTGATGGCTGTCCCGTATCATCAATGAAAACCAGCAGTTTGAGGCATGGTACTTTGAGCAGGACA

>0.2 score=0.9966 in=0.9966 ex=0.0000 len=98 ref=inclusion1 pos=5209

GCGAGTTTTGCGAGATGGTGCCGGAGTTCATCGAAAAAATGGACGAGGCACTGCTGAAATTGGTTTTGTATTTGGGGAGCAATGGCGATGAAGCATCC

——

スコアで降順に出力される（sensitivityとspecificityが高いとスコアが高くなる）。

ディレクトリ内の特定のFASTAだけ指定することもできます（マニュアル）。

Walkthrough（出力について）

https://phac-nml.github.io/neptune/walkthrough/

引用

Neptune: a bioinformatics tool for rapid discovery of genomic variation in bacterial populations

Eric Marinier, Rahat Zaheer, Chrystal Berry, Kelly A. Weedmark, Michael Domaratzki, Philip Mabon, Natalie C. Knox, Aleisha R. Reimer, Morag R. Graham, Linda Chui, Laura Patterson-Fortin, Jian Zhang, Franco Pagotto, Jeff Farber, Jim Mahony, Karine Seyer, Sadjia Bekal, Cécile Tremblay, Judy Isaac-Renton, Natalie Prystajecky, Jessica Chen, Peter Slade, and Gary Van Domselaar

Nucleic Acids Res. 2017 Oct 13; 45(18): e159.