gANIを計算するツール ANIcalculator - macでインフォマティクス

　微生物は数と多様性の両方で生命の樹木を支配しており、その自然分類を困難かつ重要なものにしている。動物では、種は一般に交配可能な生物群と定義されるが（biological species concept）、この定義は無性生物の集合体に直接適用することはできない。結果として、微生物分類学は、生物に関する遺伝子型、表現型および化学感受性の情報を統合し、利用可能なデータのコンセンサスに基づいて微生物種を描写する多相性のアプローチ（論文よりref.2,3）を採用する。この多相性アプローチは、DNA-DNAハイブリダイゼーション（DDH）、G + C含有量の変動、選択されたDNAマーカー（16S rRNAを含む）の配列比較、脂肪酸、極性脂質、細胞壁などの特定の代謝産物の同定組成物およびエキソポリサッカライド、ならびに形態学的、生化学的および酵素学的特徴付けを含む。多相性分類学のツールボックスにおけるこれらの異なる方法が、同じ分離株の示唆された分類について完全に同意しないことは珍しいことではない。例えば、株のタイトな遺伝子型クラスター化は、それらの生化学的多様性および表現型変動性と矛盾する可能性があり、逆もまた同様である。これは、潜在的に異常で興味深い生物学的プロセスならびに実験上の誤りを示す可能性がある。しかし、特定の方法の選択および各データ型に割り当てられたウエイトは研究者の裁量に任されており、生物および研究の範囲によって異なることがあるので、分類の不一致が生じる可能性がある。高スループットシーケンシング技術の出現の前に進化したこの多相性アプローチは、これまでのところ原核生物分類を可能にした。しかし、生物の完全なゲノムは、その最終的な遺伝子シグネチャーであり、完全なゲノムは迅速で正確であり、もはや費用がかからない。全ゲノムベースのグループ化は、依然として単一の遺伝子または遺伝子の小さなサブセットから生じる生理学的変動を説明することができないかもしれないが、ゲノム距離を計算する際に考慮される情報を最大にし、よりバイアスを減らすために全ゲノム情報を活用し、新規微生物の発見とペースを同じにする。

（一部略）

　 Whole-genome based Average Nucleotide Identity (gANI) は、KonstantinidisとTiedjeにより、2ゲノム間の類似性の尺度として提案された（ref.4）。ここでは、このアプローチを強化し、統合微生物ゲノム（IMG）データベース（ref.5）で公開されている86.5ミリオンのゲノムについてgANIを計算するスケーラブルな方法を開発した。著者らは、迅速ではあるが高感度な相同性検索（ref.6）であるbidirectional best hits (BBHs) でオーソロガス遺伝子のヌクレオチド同一性を計算し、平均化することによってgANIを計算する。 gANIに加えて、著者らは遺伝子含量に基づいてペアのゲノムの遺伝的関連性の相補的尺度としてオーソロガス遺伝子の割合（Alignment Fraction, AF）を考慮する。著者らは、多様なバクテリアゲノムおよびアーキアゲノムの大規模なセットについて、gANIとAFとの関係を系統的に探索し、統計的に特徴付ける。これら2つの測定値は、微生物の生物多様性のサンプリングが増加する中でゲノム間の遺伝的距離を正確に捕捉し、堅牢であることを実証している。それらは、一般に、既存の分類とよく相関し、既存の分類群に一様に適用されると、種内の遺伝的多様性のレベルを予測する一貫した種のアサインにつながる可能性のある閾値の選択を可能にする。（以下略）

JGIのANI HP

https://ani.jgi-psf.org/html/home.php?

インストール

ローカルで使うには、実行ファイル（linux向けにコンパイルされている）をJGIよりダウンロードする。

https://ani.jgi-psf.org/html/download.php?

（リンク先の一番下のacceptをクリック）

> ANIcalculator --help

$ ANIcalculator --help

********************************************************************************************************************

This tool will calculate the bidirectional average nucleotide identity (gANI) between two genomes.

Required input is the full path to the fna file (nucleotide sequence of genes in fasta format) of each query genome.

Either the rRNA and tRNA genes can be excluded, or provided in a list with the -ignoreList option.

********************************************************************************************************************

Usage:ANIcalculator

-genome1fna <fna file of the first query genome> *REQUIRED*

-genome2fna <fna file of the second query genome> *REQUIRED*

-outfile <the output file> OR -stdout <output to screen> *Default: ANIcalculator.out*

-outdir <output directory> *Default: Current directory*

-ignoreList <file containing list of genes to ignore (Should include TRNA and RNA genes)>

-logfile <log file> *Default: ANIcalculator.log*

-help <prints this page>

Results are shown in tab-delimited format with following headers:

Genome1 <name of FNA file of genome1>

Genome2 <name of FNA file of genome2>

ANI(1->2) <Average nucleotide idenitity of the first genome to the second>

ANI(2->1) <Average nucleotide idenitity of the second genome to the first>

AF(1->2) <Alignment Fraction of the first genome to the second>

AF(2->1) <Alignment Fraction of the second genome to the first>

ローカルでのラン

ANIcalculator -genome1fna input1.fasta -genome2fna input2.fasta -outdir outdir

出力。

> outdir/ani.blast.dir/cat input1.fasta.input2.fasta.blout

# cat input1.fasta.input2.fasta.blout

NC_000911.1 AE005174.2 66.38 2484 835 16 2368750 2716600 35208 2230375 1.1e+03 27248.00

NC_000911.1 AE005174.2 49.47 3307 1671 11 644387 1351186 436561 4738477 1.5e+03 7825.00

NC_000911.1 AE005174.2 40.02 2431 1458 13 3367689 3566614 3495590 4924147 3.1e+02 2266.00

NC_000911.1 AE005174.2 61.02 1411 550 2 3129308 3431892 794424 2112869 4.3e+02 861.00

NC_000911.1 AE005174.2 65.86 1775 606 8 3349356 3351139 136681 138470 1.6e+03 563.00

NC_000911.1 AE005174.2 74.11 1016 263 5 2448950 2449972 4226221 4227238 1.3e+03 490.00

NC_000911.1 AE005174.2 71.15 780 225 9 3327399 3328183 4902776 4903565 8.6e+02 330.00

NC_000911.1 AE005174.2 71.43 658 188 0 1667696 1668354 4778438 4779095 7.7e+02 282.00

NC_000911.1 AE005174.2 63.48 660 241 1 3550315 3550974 3614286 3614948 5.4e+02 178.00

NC_000911.1 AE005174.2 67.34 395 129 1 1520691 1521088 1794691 1795085 4.1e+02 137.00

NC_000911.1 AE005174.2 65.18 336 117 0 2152291 2152626 3477274 3477609 2.9e+02 102.00

NC_000911.1 AE005174.2 65.45 275 95 0 1073491 1073765 4329764 4330038 2.6e+02 85.00

NC_000911.1 AE005174.2 62.96 324 120 3 990415 990744 4135533 4135859 2.6e+02 84.00

NC_000911.1 AE005174.2 63.55 299 109 0 808917 809215 755514 755812 2.5e+02 81.00

NC_000911.1 AE005174.2 65.86 249 85 0 1026588 1026837 2198237 2198485 2.4e+02 79.00

NC_000911.1 AE005174.2 60.37 323 128 1 2887886 2888209 4117005 4117330 2.2e+02 67.00

NC_000911.1 AE005174.2 64.19 215 77 0 3505100 3505315 2231875 2232089 1.7e+02 61.00

NC_000911.1 AE005174.2 79.79 94 19 0 2598253 2598347 4208942 4209035 1.5e+02 56.00

NC_000911.1 AE005174.2 85.71 70 10 1 633308 633377 4845001 4845073 1.4e+02 50.00

NC_000911.1 AE005174.2 92.98 57 4 0 320590 320647 4517677 4517733 1.4e+02 49.00

NC_000911.1 AE005174.2 80.25 81 16 0 2087795 2087875 776366 776446 1.4e+02 49.00

NC_000911.1 AE005174.2 72.38 105 29 0 139872 139977 4352722 4352826 1.3e+02 47.00

NC_000911.1 AE005174.2 73.61 72 19 0 2708477 2708548 2059415 2059486 1e+02 34.00

NC_000911.1 AE005174.2 86.96 46 6 0 3224161 3224207 4225817 4225862 96 34.00

NC_000911.1 AE005174.2 72.46 69 19 0 1082309 1082377 777015 777083 90 31.00

NC_000911.1 AE005174.2 79.17 48 10 0 2411957 2412005 4153597 4153644 79 28.00

NC_000911.1 AE005174.2 80.00 45 9 0 2791704 2791748 3737114 3737158 73 27.00

NC_000911.1 AE005174.2 68.33 60 19 0 1082308 1082368 2421426 2421485 68 22.00

Results are shown in tab-delimited format with following headers:
Genome1 <name of FNA file of genome1>
Genome2 <name of FNA file of genome2>
ANI(1->2) <Average nucleotide idenitity of the first genome to the second>
ANI(2->1) <Average nucleotide idenitity of the second genome to the first>
AF(1->2) <Alignment Fraction of the first genome to the second>
AF(2->1) <Alignment Fraction of the second genome to the first>

オンラインで計算するにはgANIにアクセスする。

https://ani.jgi-psf.org/html/calc.php?

f:id:kazumaxneo:20180910205045p:plain

出力。

f:id:kazumaxneo:20180910214058p:plain

グループ

JGI MiSI - Clusters

グループをネットワークで可視化したもの

https://ani.jgi-psf.org/html/clusters.php?page=cliqueGroups

f:id:kazumaxneo:20180911235033p:plain

引用
Microbial species delineation using whole genome sequences
Varghese NJ, Mukherjee S, Ivanova N, Konstantinidis KT, Mavrommatis K, Kyrpides NC, Pati A

Nucleic Acids Res. 2015 Aug 18;43(14):6761-71.