機械学習と言語モデルによる高速、正確、包括的なオーソログ推論を行う SonicParanoid2

　オルソログ遺伝子を正確に推論することは、様々なゲノム研究や進化研究の必須条件である。SonicParanoidはオルソロジー推論に最も適したツールの1つである。しかし、その拡張性と感度は、それぞれ時間のかかるall-versus-allアラインメントと複雑なドメイン構造を持つタンパク質によって妨げられている。本発表では、SonicParanoidのアップデートについて報告する。AdaBoost分類器を用いることで、精度を落とさずにall-versus-allアライメントの実行時間を最大42%短縮することができた。Doc2Vecニューラルネットワークモデルにより、ドメインレベルでのオルソロジー推論が可能となり、グラフベースのオルソロジーだけと比較して、予測されるオルソログの数が3分の1増加した。標準化されたベンチマークデータセットと2,000 MAGデータセットでの評価では、SonicParanoid2は他のオルソロジー推論ツールよりも最大18倍高速で拡張性があり、確立された手法と同等の精度を持つことが示された。

Happy to share the #preprint of SonicParanoid2. We used ML to halve the runtime for all-vs-all alignments, and to quicky infer orthologs at domain-level using language models. https://t.co/mHhnj5m7AF
— Salvatore Cosentino (@salvocos981) May 17, 2023

http://iwasakilab.k.u-tokyo.ac.jp/sonicparanoid/

HPの下の方の"Installation"に、pip、conda/mambaいずれかを使った依存関係のトラブルを回避できるインストール手順が説明されています。ここでは、説明されているmicromambaの手順に従います。

インストール

依存

Python 3.8 or up to version 3.10
GNU GCC compiler (version 5.0 or above)

GitLab

#pip(link)
mamba create -n sonicparanoid
conda activate sonicparanoid 
mamba install -c conda-forge python==3.9.15
pip3 install cython==3.0.0a10
mamba install -c conda-forge biopython==1.79 numpy==1.21.6 filetype==1.2.0 gensim==4.2.0 mypy==0.991 pandas==1.3.5 pip==23.1.2 psutil==5.9.5 scikit-learn==1.0.2 scipy==1.10.1 smart-open==6.3.0 tqdm==4.64.1 wheel==0.40.0 -y
mamba install -c bioconda mmseqs2==13.45111 diamond==2.0.12 blast==2.12.0 mcl==14.137 -y
pip3 install --no-cache-dir sonicparanoid

> sonicparanoid -h

$ sonicparanoid -h

usage: sonicparanoid -i <INPUT_DIRECTORY> -o <OUTPUT_DIRECTORY>[options]

SonicParanoid 2.0.3

optional arguments:

-h, --help show this help message and exit

-i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY

Directory containing the proteomes (in FASTA format) of the species to be analyzed.

-o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY

The directory in which the results will be stored.

-p PROJECT_ID, --project-id PROJECT_ID

Name for the project reflecting the name of run. If not specified it will be automatically generated using the current date and time.

-sh SHARED_DIRECTORY, --shared-directory SHARED_DIRECTORY

Directory in which the alignment files are stored. If not specified it is created inside the main output directory.

-t THREADS, --threads THREADS

Maximum number of CPUs to be used. Default=4

-at, --force-all-threads

Force using all the requested threads.

-sm, --skip-multi-species

Skip the creation of multi-species ortholog groups.

-d, --debug Show debug lines. WARNING: extremely verbose

-nc, --no-compress Skip the compression of processed alignment files.

-cl COMPRESSION_LEV, --compression-lev COMPRESSION_LEV

Gzip compression level. Integer values between 1 and 9, with 9 and 1 being the highest lowest compression levels, respectively. Default=5

-m {fast,default,sensitive}, --mode {fast,default,sensitive}

SonicParanoid execution mode. The default mode is suitable for most studies. Use sensitive if the input proteomes are not closely related.

-dmnd {fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}, --diamond {fast,mid-sensitive,sensitive,more-sensitive,very-sensitive,ultra-sensitive}

Use Diamond with a custom sensitivity. This will bypass the -m (--mode) option.

-mmseqs MMSEQS, --mmseqs MMSEQS

Use MMseqs2 with a custom sensitivity (between 1.0 and 7.5). This will bypass the -m (--mode) option.

-blast, --blast Use Blastp for all-vs-all alignments. This will bypass the -m (--mode) option.

-ml MAX_LEN_DIFF, --max-len-diff MAX_LEN_DIFF

Maximum allowed length-difference-ratio between main orthologs and canditate inparalogs. Example: 0.5 means one of the two sequences could be two times longer than the other 0 means no length difference allowed; 1

means any length difference allowed. Default=0.75

-db SEQS_DBS, --seqs-dbs SEQS_DBS

The directory in which the database files created by the selectedlocal alignment tool will be stored. DEFAULT: automatically created inside the main output directory.

-idxdb, --index-db Index the MMSeqs2/Diamond databases. IMPORTANT: This will use more storage but will be slighly faster (5~10%) when processing many big proteomes with MMseqs2. The results might also be sligthy different.

-op, --output-pairs Output a text file with all the pairwise orthologous relationships.

-ka, --keep-raw-alignments

Do not delete raw MMseqs2 alignment files. NOTE: this will triple the space required for storing the results.

-bs MIN_BITSCORE, --min-bitscore MIN_BITSCORE

Consider only alignments with bitscores above min-bitscore. Increasing this value can be a good idea when comparing very closely related species. Increasing this value will reduce the number of paralogs (and

orthologs) generate. WARNING: use only if you are sure of what you are doing. INFO: higher min-bitscore values reduce the execution time for all-vs-all. Default=40

-ca, --complete-aln Perform complete alignments (slower), rathen than essential ones.

-go, --graph-only Perform only graph-based orthology (skip architectures analysis).

--min-arch-merging-cov MIN_ARCH_MERGING_COV

When merging graph- and arch-based orhtologs consider only new-orthologs with a protein coverage greater or equal than this value. Default=0.75

-I INFLATION, --inflation INFLATION

Affects the granularity of ortholog groups. This value should be between 1.2 (very coarse) and 5 (fine grained clustering). Default=1.5

-ot, --overwrite-tables

This will force the re-computation of the ortholog tables. Only missing alignment files will be re-computed.

-ow, --overwrite Overwrite previous runs and execute it again. This can be useful to update a subset of the computed tables.

-rs, --remove-old-species

(EXPERIMENTAL) Remove alignments and pairwise ortholog tables related to species used in a previous run. This option should be used when updating a run in which some input proteomes were modified or removed.

-un, --update-input-names

(EXPERIMENTAL) Remove alignments and pairwise ortholog tables for an input proteome used in a previous which file name conflicts with a newly added species. This option should be used when updating a run in which

some input proteomes or their file names were modified.

テストラン

sonicparanoid-get-test-data -o .
=> メッセージに従って移動する

#run
sonicparanoid -i ./test_input -o ./test_output -p my_first_run -t 8

-t Maximum number of CPUs to be used. Default=4
-i Directory containing the proteomes (in FASTA format) of the species to be analyzed.
-o The directory in which the results will be stored.
-p Name for the project reflecting the name of run. If not specified it will be automatically generated using the current date and time.

sonicparanoidは、入力されたプロテオームからオルソログペアとオルソロググループを生成する。

出力

test_output/

alignments/

alignmentsディレクトリには、計算および処理されたアライメントファイルが含まれ

る。

orthologs_db/

orthologs_dbには、各アップデート実行時に再利用されるペアワイズオルソログテーブルが含まれている。

arch_orthology/

arch_orthologyディレクトリには、ドメインを考慮したオルソロジー推論に関連するほとんどのファイル（学習済み人工ニューラルネットワークを含む）が含まれる。

merged_tables/

merged_tablesには、各アップデート実行時に再利用できるマージされたペアワイズオルソログテーブルが含まれる。

runs/

このディレクトリには、実行設定に関する情報（run_info.txt）と入力ファイル（species.tsv）が含まれている。

runs/my_first_run/ortholog_groups/

入力された生物種に共通するオルソログが格納されている。

runs/my_first_run/pairwise_orthologs/

プロテオームの各ペアのオルソログテーブル（ペアワイズオルソログのテーブル）

ortholog_groups/のファイルについて

ortholog_groups.tsv　オルソロググループを含むタブ区切りのテーブル
flat.ortholog_groups.tsv　各グループの遺伝子名のみを集めたシンプルなテーブル
single_copy.ortholog_groups.tsv　オルソロググループで、グループ内の各生物種に対してオルソログを1つずつ持つもの
not_assigned_genes.ortholog_groups.tsv　オルソログとして分類できなかった遺伝子のリスト
overall.stats.tsv 予測されたオルソロググループに関する一般的な統計情報

引用

SonicParanoid2: fast, accurate, and comprehensive orthology inference with machine learning and language models
Salvatore Cosentino, Wataru Iwasaki

bioRxiv, Posted May 15, 2023.