コア遺伝子の立体構造情報を使って正確な系統復元を行う Unicore

　あるクレードのほとんどのメンバーに共通するシングルコピーのコア遺伝子の解析は、系統復元やゲノムの質の評価など、生物学における重要な課題にとって重要である。コア遺伝子は従来、プロテオーム間のアミノ酸類似性の解析によって同定されてきたが、構造を用いて定義することもできる。AIによるタンパク質構造予測の精度は飛躍的に向上したが、プロテオミクス規模での完全な3次元構造モデルの取得には、まだ法外な時間がかかっている。ここでは、structural core genesを下流の系統解析に適したスケールで同定するための新しい手法であるUnicoreを紹介する。入力プロテオームにProstT5タンパク質言語モデルを適用して3Di構造ストリングを得ることにより、Unicoreは完全な3D予測に比べて実行時間を3桁以上短縮した。Foldseek clusteringを使用して、Unicoreは生物種に普遍的に存在するシングルコピー構造を特定し、Foldmasonを使用してそれらをアライメントする。これらの構造的コア遺伝子アラインメントは、下流の系統解析のためにアミノ酸情報に投影される。このアプローチが、従来のアプローチと一致する系統関係を再構築しながら、OrthoFinderよりも最大56倍速く、種の数に対して線形にスケーリングする実行時間でコア遺伝子を定義することを実証する。Unicoreはどのような分類群にも普遍的に適用可能であり、スーパーキングダムにまたがっていても適用可能である。

インストール

Github

mamba create -n unicore -y
conda activate unicore
mamba install -c bioconda unicore -y

# Linux machine with CUDA-compatible GPU
mamba install -c conda-forge pytorch-gpu -y

> unicore -v

unicore -h

Usage: unicore [COMMAND]

Commands:

createdb Create Foldseek database from amino acid sequences

cluster Cluster Foldseek database

search Search Foldseek database against reference database

profile Create core structures from Foldseek database

tree Infer phylogenetic tree from core structures

easy-core Easy core gene phylogeny workflow, from fasta files to phylogenetic tree

help Print this message or the help of the given subcommand(s)

Options:

-v, --version Print version and information

-h, --help Print help

Create Foldseek database from amino acid sequences

> unicore createdb

Usage: unicore createdb [OPTIONS] <INPUT> <OUTPUT> <MODEL>

Arguments:

<INPUT> Input directory with fasta files or a single fasta file

<OUTPUT> Output foldseek database

<MODEL> ProstT5 model

Options:

-k, --keep Keep intermediate files

-o, --overwrite Force overwrite output database

--max-len <MAX_LEN> Set maximum sequence length threshold

-g, --gpu Use GPU for foldseek createdb

--afdb-lookup Use AFDB lookup for foldseek createdb. Useful for large databases

--afdb-local <AFDB_LOCAL> Local path to the directory with AFDB lookup tables. If not exists, it will try to download from the internet

--threads <THREADS> Number of threads to use; 0 to use all [default: 0]

-v, --verbosity <VERBOSITY> Verbosity (0: quiet, 1: +errors, 2: +warnings, 3: +info, 4: +debug) [default: 3]

-h, --help Print help

> unicore cluster -h

Cluster Foldseek database

Usage: unicore cluster [OPTIONS] <INPUT> <OUTPUT> <TMP>

Arguments:

<INPUT> Input database (createdb output)

<OUTPUT> Output prefix; the result will be saved as OUTPUT.tsv

<TMP> Temp directory

Options:

-k, --keep-cluster-db

Keep intermediate Foldseek cluster database

-c, --cluster-options <CLUSTER_OPTIONS>

Arguments for foldseek options in string e.g. -c "-c 0.8" [default: "-c 0.8"]

--threads <THREADS>

Number of threads to use; 0 to use all [default: 0]

-v, --verbosity <VERBOSITY>

Verbosity (0: quiet, 1: +errors, 2: +warnings, 3: +info, 4: +debug) [default: 3]

-h, --help

Print help

> unicore search -h

Search Foldseek database against reference database

Usage: unicore search [OPTIONS] <INPUT> <TARGET> <OUTPUT> <TMP>

Arguments:

<INPUT> Input database

<TARGET> Target database to search against

<OUTPUT> Output prefix; the result will be saved as OUTPUT.m8

<TMP> Temp directory

Options:

-k, --keep-aln-db

Keep intermediate Foldseek alignment database

-s, --search-options <SEARCH_OPTIONS>

Arguments for foldseek options in string e.g. -s "-c 0.8" [default: "-c 0.8"]

--threads <THREADS>

Number of threads to use; 0 to use all [default: 0]

-v, --verbosity <VERBOSITY>

Verbosity (0: quiet, 1: +errors, 2: +warnings, 3: +info, 4: +debug) [default: 3]

-h, --help

Print help

> unicore profile -h

Create core structures from Foldseek database

Usage: unicore profile [OPTIONS] <INPUT_DB> <INPUT_TSV> <OUTPUT>

Arguments:

<INPUT_DB> Input database (createdb output)

<INPUT_TSV> Input tsv file (cluster or search output)

<OUTPUT> Output directory

Options:

-t, --threshold <THRESHOLD> Coverage threshold for core structures. [0 - 100] [default: 80]

-p, --print-copiness Generate tsv with copy number statistics

--threads <THREADS> Number of threads to use; 0 to use all [default: 0]

-v, --verbosity <VERBOSITY> Verbosity (0: quiet, 1: +errors, 2: +warnings, 3: +info, 4: +debug) [default: 3]

-h, --help Print help

> unicore tree -h

Infer phylogenetic tree from core structures

Usage: unicore tree [OPTIONS] <DB> <INPUT> <OUTPUT>

Arguments:

<DB> Input database (createdb output)

<INPUT> Input directory containing core structures (profile output)

<OUTPUT> Output directory

Options:

-a, --aligner <ALIGNER>

Multiple sequence aligner [foldmason, mafft-linsi, mafft] [default: foldmason]

-t, --tree-builder <TREE_BUILDER>

Phylogenetic tree builder [iqtree, fasttree (under development), raxml (under development)] [default: iqtree]

-o, --aligner-options <ALIGNER_OPTIONS>

Options for sequence aligner

-p, --tree-options <TREE_OPTIONS>

Options for tree builder; please adjust if using different tree method [default: "-m JTT+F+I+G -B 1000"]

-d, --threshold <THRESHOLD>

Gap threshold for multiple sequence alignment [0 - 100] [default: 50]

-c, --threads <THREADS>

Number of threads to use; 0 to use all [default: 0]

-v, --verbosity <VERBOSITY>

Verbosity (0: quiet, 1: +errors, 2: +warnings, 3: +info, 4: +debug) [default: 3]

-h, --help

Print help

> unicore easy-core -h

Easy core gene phylogeny workflow, from fasta files to phylogenetic tree

Usage: unicore easy-core [OPTIONS] <INPUT> <OUTPUT> <MODEL> <TMP>

Arguments:

<INPUT> Input directory with fasta files or a single fasta file

<OUTPUT> Output directory where all results will be saved

<MODEL> ProstT5 model

<TMP> tmp directory

Options:

-k, --keep

Keep intermediate files

-w, --overwrite

Force overwrite output database

--max-len <MAX_LEN>

Set maximum sequence length threshold

-g, --gpu

Use GPU for foldseek createdb

--afdb-lookup

Use AFDB lookup for foldseek createdb. Useful for large databases

--afdb-local <AFDB_LOCAL>

Local path to the directory with AFDB lookup tables. hidden option

-c, --cluster-options <CLUSTER_OPTIONS>

Arguments for foldseek options in string e.g. -c "-c 0.8" [default: "-c 0.8"]

-C, --core-threshold <CORE_THRESHOLD>

Coverage threshold for core structures. [0 - 100] [default: 80]

-p, --print-copiness

Generate tsv with copy number statistics

-A, --aligner <ALIGNER>

Multiple sequence aligner [foldmason, mafft-linsi, mafft] [default: foldmason]

-T, --tree-builder <TREE_BUILDER>

Phylogenetic tree builder [iqtree, fasttree (under development), raxml (under development)] [default: iqtree]

-a, --aligner-options <ALIGNER_OPTIONS>

Options for sequence aligner

-t, --tree-options <TREE_OPTIONS>

Options for tree builder; please adjust if using different tree method [default: "-m JTT+F+I+G -B 1000"]

-G, --gap-threshold <GAP_THRESHOLD>

Gap threshold for multiple sequence alignment [0 - 100] [default: 50]

--threads <THREADS>

Number of threads to use; 0 to use all [default: 0]

-v, --verbosity <VERBOSITY>

Verbosity (0: quiet, 1: +errors, 2: +warnings, 3: +info, 4: +debug) [default: 3]

-h, --help

Print help

テストラン

1、実行するには比較したいプロテオーム配列セットをディレクトリに配置する。ここではテストデータをダウンロードする。

wget https://unicore.steineggerlab.workers.dev/unicore_example.zip
unzip unicore_example.zip && rm unicore_example.zi

> ls -lt example/data/

各株のproteome.fa。レポジトリには拡張子は”.fasta”である必要があると書かれているが、”.fa”を認識した（要確認）。

２，foldseek databasesコマンドを使ってProstT5 weightsをダウンロード（ProstT5タンパク質言語モデルを用いてFASTAファイルから作られた構造データベース）

#tmpは作業ディレクトリ、weights/に保存する
foldseek databases ProstT5 weights tmp

ls -hl weights/

３，easy-coreモジュールを実行する。easy-coreモジュールは、入力プロテオームから構造コア遺伝子に基づいた系統樹を構築するまでの全ての過程を自動で進める。GPUを有効にするには-gフラグを付ける。

unicore easy-core -g example/data example/results weights tmp

-g Use GPU for foldseek createdb
-A Multiple sequence aligner [foldmason, mafft-linsi, mafft] [default: foldmason]
-T Phylogenetic tree builder [iqtree, fasttree (under development), raxml (under development)] [default: iqtree]

以下のステップがシングルコマンドで実行される。

createdb - 入力された生物種から3Di構造アルファベットデータベースを作成する。
cluster - Foldseekデータベースのクラスタリング
profile - 分類学的プロファイリングとコア遺伝子の同定
tree - 構造コア遺伝子を用いた系統推定

GPUが有効になっている場合、 foldseek createdb実行時にGPU使用率が100%に張り付く（RTX3090使用）。

出力

example/results/

- proteomeディレクトリ: 入力ファイルから解析されたプロテオーム情報

- clusterディレクトリ: foldseekクラスタリング結果(clust.tsv)

- profileディレクトリ: 分類学的プロファイリングの結果と、定義された構造コア遺伝子のメタデータ

- treeディレクトリ: 系統推論の結果

example/results/tree/iqtree.treefile: Newickフォーマットで表現された連結構造コア遺伝子ツリー。ツリーの各ノードは生物種を表し、入力プロテオームファイル名でラベルされている。

６つのゲノム（シアノバクテリア門のorderレベルで異なる６つのゲノム）のproteome.faを使ってunicore easy-coreを試した。prokkaでアノテーションした全てのproteome配列を使うとエラーが発生した。稀に存在する非常に長いORFでエラーが起きた可能性があっため（foldseek #288）、長い配列を除外してランしたところ、正常にランされた。