環境や臨床由来の微生物叢(マイクロバイオーム)を記述するうえで、全ゲノムの類縁性推定および分類学的同定は、重要なバイオインフォマティクスの課題である。種レベルの近縁な微生物やウイルスゲノムの類縁性を推定するためには、ゲノム全体の平均ヌクレオチド同一性(ANI:Average Nucleotide Identity)が広く用いられているが、より遠縁なゲノムには適していない。このような場合には**平均アミノ酸同一性(AAI:Average Amino-acid Identity)**が利用できるが、既存のAAIツールでは、数千ものゲノム間の比較を効率的に処理することは困難である。本研究では、「FastAAI」という新たなツールを提案する。FastAAIは、普遍的タンパク質における4-mer(テトラマー)の共有に基づいて全ゲノムのペアワイズな類縁性を推定し、マイクロ秒単位で計算を完了できる。これにより、既存のAAIや他の全ゲノム比較手法と比べて最大で10万倍(5桁)の高速化を実現する。さらに、FastAAIは門(phylum)レベルでの遠縁なゲノムの関係性を、リボソームRNA遺伝子の系統樹と同等の精度で識別可能であり、既存AAI実装の既知の限界を大幅に改善する。また、本手法で得られたAAIマトリクスの解析により、細菌系統は主に漸進的に進化していることが示唆され、クラス・目・科レベルの明確なAAIの閾値は一般的に見いだしにくいことも明らかになった。したがって、FastAAIは、マイクロバイオーム解析のツールボックスを拡張するユニークな手法であり、数百万ゲノム規模の解析にも対応可能なスケーラビリティを提供する。
インストール
pipでpython3.9の環境にインストールした。
Python >=3.6 (3.9+ recommended)
Additional Python Modules:
- numpy
- pyrodigal - https://github.com/althonos/pyrodigal/
- pyhmmer - https://github.com/althonos/pyhmmer
#PyPI
pip install FastAAI
> fastaai
fastaai -h
I couldn't find the module you specified. Please select one of the following modules:
-------------------------------------- Database Construction Options --------------------------------------
build_db | Create or add to a FastAAI database from genomes, proteins, or proteins and HMMs
merge_db | Add the contents of one FastAAI DB to another
---------------------------------------------- Query Options ----------------------------------------------
simple_query | Query a genome or protein (one or many) against an existing FastAAI database
db_query | Query the genomes in one FastAAI database against the genomes in another FastAAI database
------------------------------------------- Other Options -------------------------------------------
single_query | Query ONE query genome against ONE target genome
multi_query | Create a query DB and a target DB, then calculate query vs. target AAI
aai_index | Create a database from multiple genomes and do an all vs. all AAI index of the genomes
-----------------------------------------------------------------------------------------------------------
To select a module, enter 'FastAAI [module]' into the command line!
> fastaai build_db -h
usage: fastaai [-h] [-g GENOMES] [-p PROTEINS] [-m HMMS] [-d DB_NAME] [-o OUTPUT] [--threads THREADS] [--verbose] [--compress]
This FastAAI module allows you to create a FastAAI database from one or many genomes, proteins, or proteins and HMMs, or add these files to an existing one.
Supply genomes OR proteins OR proteins AND HMMs as inputs.
If you supply genomes, FastAAI will predict proteins from them, and HMMs will be created from those proteins
If you supply only proteins, FastAAI will create HMM files from them, searching against FastAAI's internal database
If you supply proteins AND HMMs, FastAAI will directly use them to build the database.
You cannot supply both genomes and proteins
optional arguments:
-h, --help show this help message and exit
-g GENOMES, --genomes GENOMES
A directory containing genomes in FASTA format.
-p PROTEINS, --proteins PROTEINS
A directory containing protein amino acids in FASTA format.
-m HMMS, --hmms HMMS A directory containing the results of an HMM search on a set of proteins.
-d DB_NAME, --database DB_NAME
The name of the database you wish to create or add to. The database will be created if it doesn't already exist and placed in the output directory. FastAAI_database.sqlite.db by default.
-o OUTPUT, --output OUTPUT
The directory to place the database and any protein or HMM files FastAAI creates. By default, a directory named "FastAAI" will be created in the current working directory and results will be placed there.
--threads THREADS The number of processors to use. Default 1.
--verbose Print minor updates to console. Major updates are printed regardless.
--compress Gzip compress generated proteins, HMMs. Off by default.
> fastaai merge_db -h
usage: fastaai [-h] [-d DONORS] [--donor_file DONOR_FILE] [-r RECIPIENT] [--verbose] [--threads THREADS]
This FastAAI module allows you to add the contents of one or more FastAAI databases to another.
You must have at least two already-created FastAAI databases using the build_db module before this module can be used.
Supply a comma-separated list of at least one donor database and a single recipient database.
If the recipient already exists, then genomes in all the donors will be added to the recipient.
If the recipient does not already exist, a new database will be created, and the contents of all the donors will be added to it.
Example:
FastAAI.py merge_db --donors databases/db1.db,databases/db2.db -recipient databases/db3.db --threads 3
This command will create a new database called "db3.db", merge the data in db1.db and db2.db, and then add the merged data into db3.db
Only the recipient database will be modified; the donors will be left exactly as they were before running this module.
optional arguments:
-h, --help show this help message and exit
-d DONORS, --donors DONORS
Comma-separated string of paths to one or more donor databases. The genomes FROM the donors will be added TO the recipient and the donors will be unaltered
--donor_file DONOR_FILE
File containing paths to one or more donor databases, one per line. Use EITHER this or --donors
-r RECIPIENT, --recipient RECIPIENT
Path to the recipient database. Any genomes FROM the donor database not already in the recipient will be added to this database.
--verbose Print minor updates to console. Major updates are printed regardless.
--threads THREADS The number of processors to use. Default 1.
> fastaai db_query -h
usage: fastaai [-h] [-q QUERY] [-t TARGET] [-o OUTPUT] [--output_style STYLE] [--do_stdev] [--threads THREADS] [--verbose] [--in_memory] [--store_results]
This FastAAI module takes two FastAAI databases and searches all of the genomes in the QUERY against all of the genomes in the TARGET
If you have many genomes (more than 1000), it will be faster to create the query database using FastAAI build_db,
then search it against an existing target using this module than it is to do the same thing with an SQL query.
If you give the same database as query and target, a special all vs. all search of the genomes in the database will be done.
optional arguments:
-h, --help show this help message and exit
-q QUERY, --query QUERY
Path to the query database. The genomes FROM the query will be searched against the genomes in the target database
-t TARGET, --target TARGET
Path to the target database.
-o OUTPUT, --output OUTPUT
The directory where FastAAI will place the result of this query. By default, a directory named "FastAAI" will be created in the current working directory and results will be placed there.
--output_style STYLE Either 'tsv' or 'matrix'. Matrix produces a simplified output of only AAI estimates.
--do_stdev Off by default. Calculate std. deviations on Jaccard indicies. Increases memory usage and runtime slightly. Does NOT change estimated AAI values at all.
--threads THREADS The number of processors to use. Default 1.
--verbose Print minor updates to console. Major updates are printed regardless.
--in_memory Load both databases into memory before querying. Consumes more RAM, but is faster and reduces file I/O substantially. Consider reducing number of threads
--store_results Keep partial results in memory. Only works with --in_memory. Fewer writes, but more RAM. Default off.
> fastaai simple_query -h
usage: fastaai [-h] [-g GENOMES] [-p PROTEINS] [-m HMMS] [--target TARGET] [-o OUTPUT] [--output_style STYLE] [--do_stdev] [--threads THREADS] [--verbose] [--in_memory] [--create_query_db] [--query_db_name QDB_NAME]
[--compress]
This FastAAI module takes one or many genomes, proteins, or proteins and HMMs as a QUERY and searches them against an existing FastAAI database TARGET using SQL
If you only have a few genomes - or not enough RAM to hold the entire target database in memory - this is the probably the best option for you.
To provide files, supply either a directory containing only one type of file (e.g. only genomes in FASTA format), a file containing paths to files of a type, 1 per line,
or a comma-separated list of files of a single type (no spaces)
If you provide FastAAI with genomes or only proteins (not proteins and HMMs), this FastAAI module will produce the required protein and HMM files as needed
and place them in the output directory, just like it does while building a database.
Once these inputs are ready to be queried against the database (each has both a protein and HMM file), they will be processed independently, 1 per thread at a time.
Note: Protein and HMM files generated during this query can be supplied to build a FastAAI database from proteins and HMMs using the build_db module, without redoing preprocessing.
optional arguments:
-h, --help show this help message and exit
-g GENOMES, --genomes GENOMES
Genomes in FASTA format.
-p PROTEINS, --proteins PROTEINS
Protein amino acids in FASTA format.
-m HMMS, --hmms HMMS HMM search files produced by FastAAI on a set of proteins.
--target TARGET A path to the FastAAI database you wish to use as the target
-o OUTPUT, --output OUTPUT
The directory where FastAAI will place the result of this query and any protein or HMM files it has to generate. By default, a directory named "FastAAI" will be created in the current working directory and results will be placed there.
--output_style STYLE Either 'tsv' or 'matrix'. Matrix produces a simplified output of only AAI estimates.
--do_stdev Off by default. Calculate std. deviations on Jaccard indicies. Increases memory usage and runtime slightly. Does NOT change estimated AAI values at all.
--threads THREADS The number of processors to use. Default 1.
--verbose Print minor updates to console. Major updates are printed regardless.
--in_memory Load the target database into memory before querying. Consumes more RAM, but is faster and reduces file I/O substantially.
--create_query_db Create a query database from the genomes.
--query_db_name QDB_NAME
Name the query database. This file must not already exist.
--compress Gzip compress generated proteins, HMMs. Off by default.
実行方法
1,build_db - 入力されたゲノムセットからタンパク質を予測、シングルコピーのタンパク質を特定し、FastAAIデータベースを構築する。もしくは既存のものに追加する。タンパク質のfastaやHMMを入力とすることも出来る。
#test run
git clobe https://github.com/cruizperez/FastAAI.git
cd FastAAI/
#genomeの.fnaファイルのディレクトリを指定(.fna.gzも認識する)
fastaai build_db --genomes FastAAI-master/example_genomes/ --threads 4 --verbose --output example_build --database my_example_db.db --compress
- -g GENOMES, --genomes GENOMES A directory containing genomes in FASTA format.
- -p PROTEINS, --proteins PROTEINS A directory containing protein amino acids in FASTA format.

テストデータには10個のゲノムが含まれるが、build_db コマンドは十数秒で終了した。
出力例
example_build/

predicted_proteins、hmms、database、logsサブフォルダが含まれる。logsフォルダにはFastAAI_preprocessing_log.txtファイルがあり、各クエリゲノムのタンパク質予測やHMM検索結果の情報が記録されている。--compressオプションを使ったため、predicted_proteinsとhmmsフォルダ内のファイルは出力時にgzip圧縮されている。
2、db_query - FastAAIデータベースのゲノムを同じまたは別のデータベースのゲノムと照合し、ゲノムのペアごとにAAIを計算する。ここではクエリとターゲットに同じDBを指定して、all versus allのAAI計算を行う。
fastaai db_query --query example_build/database/my_example_db.db --target example_build/database/my_example_db.db --threads 4 --verbose --output outdir
- -q QUERY, --query QUERY Path to the query database. The genomes FROM the query will be searched against the genomes in the target database
- -t TARGET, --target TARGET Path to the target database.
- -o OUTPUT, --output OUTPUT The directory where FastAAI will place the result of this query. By default, a directory named "FastAAI" will be created in the current working directory and results will be placed there.
計算は一瞬で終了した。

出力例
outdir/

1つ開いてみる。

- クエリゲノム
- ターゲットゲノム
- 平均ジャカード指数
- ジャカード指数の標準偏差
- 共有されているSCP数
- 共有の可能性があるSCP数(クエリとターゲットのゲノムペアのいずれかで最大のSCP数)
- 推定AAI
が記録される。--do_stdev で標準偏差の計算を指示しなかったため、4列目はすべて N/A になっている。
”--output_style matrix”を指定すると、出力は行列形式となる(デフォルトは”--output_style tsv”)。”--in_memory”をつけると両方のデータベースがクエリ実行前にメモリに読み込まれる。RAMの消費は増えるが処理が速くなる。
fastaai db_query --query example_build/database/my_example_db.db --target example_build/database/my_example_db.db --threads 4 --verbose --output outdir --output_style matrix --in_memory
- --output_style STYLE Either 'tsv' or 'matrix'. Matrix produces a simplified output of only AAI estimates.
- --do_stdev Off by default. Calculate std. deviations on Jaccard indicies. Increases memory usage and runtime slightly. Does NOT change estimated AAI values at all.
--threads THREADS The number of processors to use. Default 1. - --in_memory Load both databases into memory before querying. Consumes more RAM, but is faster and reduces file I/O substantially. Consider reducing number of threads
出力例

--output_style matrixをつけると結果は1つのマトリクスファイルのみになり、AAIのみ記録される。

merge_db - 2つ以上のFastAAIデータベースを統合する。新しいデータベースを作成することも、既存のものを変更することも可能。
その他
- FastAAIのすべてのクエリは基本的に同じロジックに従って進行する。すなわち、2つのゲノム間で共通するSCPの集合を選び、それぞれの共有SCPに対してユニークなテトラマーのジャカード指数を計算する。個々のSCPペアリングから重み付けなしで平均ジャカード指数を算出し、その平均値を方程式(FastAAIの論文参照)を用いて推定AAIに変換する。
コメント
論文ではFastAIを使って様々な分類ランクでのAAI分布を調べています。よく研究されている門などだと思ったより狭い分布をしている結果となっています。確認してみてください。
引用
FastAAI: efficient estimation of genome average amino acid identity and phylum-level relationships using tetramers of universal proteins
Kenji Gerhardt , Carlos A Ruiz-Perez , Luis M Rodriguez-R , Chirag Jain , James M Tiedje , James R Cole , Konstantinos T Konstantinidis
Nucleic Acids Research, Volume 53, Issue 8, 8 May 2025, gkaf348
関連