遺伝子ファミリーの起源を推論する GenEra

　GenEra（https://github.com/josuebarrera/GenEra）は、DIAMONDを用いたgene-family founder inference framework（遺伝子ファミリーの起源となる遺伝子の推論フレームワーク）で、ゲノム系統分類における相同性検出の失敗など、これまで指摘されてきた限界やバイアスに対処するものである。また、GenEraは計算時間を数ヶ月から数日に短縮し、あらゆるゲノムを対象としている。本著者らは、植物、動物、菌類の主要な進化的遷移における分類学的に制限された遺伝子ファミリーの出現を分析した。その結果、相同性検出の失敗が推定される遺伝子出現のパターンに与える影響は系統依存的であり、植物は動物や菌類に比べて新しい遺伝子の出現による新規性の進化が起こりやすいことが示唆された。

インストール

公開されている公式dockerイメージを使用した。

依存

DIAMOND v2.0.0 or higher
Foldseek v3.915ef7d or higher
NCBItax2lin
MCL
MMseqs2 (optional for protein-against-nucleotide sequence search)
abSENSE (optional to calculate homology detection failure probabilities)
NumPy and SciPy (needed to run abSENSE in step 4)
R alongside the libraries optparse, Bio3D, Tidyverse, and SeqinR (optional for JackHMMER reassessment)

Github

#dockerhub
docker pull josuebarrera/genera:latest

> genEra -h

# genEra -h

genEra v1.1.1 (C) Max Planck Society for the Advancement of Science

BASIC USAGE

genEra -q [query_sequences.fasta] -t [query_taxid] -b [path/to/nr] -d [path/to/taxdump]

genEra -Q [query_structure_directory] -t [query_taxid] -B [path/to/AlphaFold_DB] -d [path/to/taxdump]

DESCRIPTION

genEra is an easy-to-use, low-dependency command-line tool that

estimates the age of the earliest common ancestor of protein

coding genes though genomic phylostratigraphy.

MANDATORY ARGUMENT ALWAYS

-t NCBI Taxonomy ID of query species (search for the taxid of

your query species at https://www.ncbi.nlm.nih.gov/taxonomy)

OPTIONS FOR MANDATORY INPUT FILE

-q File with query protein sequences in FASTA format to perform

pairwise sequence alignments using DIAMOND

-Q Directory with query structural predictions in PDB format

to perform pairwise structural alignments using Foldseek

OPTIONS FOR MANDATORY DATABASE

-b Path to a locally installed database for DIAMOND (FASTA only)

-B Path to a locally installed database for Foldseek (PDB only)

-p Pre-generated DIAMOND/Foldseek table (skip step 1), with the

query genes in the first column, the bitscore in the second

to last column and the target taxid in the last column

(IMPORTANT: the query sequences must also be searched

against themselves for genEra to work properly)

OPTIONS FOR MANDATORY TAXONOMY FILE

-d Location of the uncompressed taxonomy dump from the NCBI

(ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz)

-r Raw "ncbi_lineages" file generated by ncbitax2lin

(saves time on step 2)

-c Custom "ncbi_lineages" file that is already tailored for the

query species (skip step 2). The table should be arranged so

that the taxid is in the first column and all the phylostrata

of interest are organized from the species level all the way

back to "cellular organisms"

IMPORTANT OPTIONAL ARGUMENTS

-n Number of threads to run genEra (genEra can run with a single

thread, but it is HIGHLY suggested to use as many threads as

possible) (default: 20 threads)

-o Directory to save output files (default: working directory)

-s Table with pairwise evolutionary distances (substitutions/site)

between several species in the database and the query species

(necessary to calculate homology detection failure probabilities

with abSENSE). NOTE: the query species SHOULD be included in

this table. The table should be tab-delimited and have the

following format:

query_sp_taxid 0

species_1_taxid distance_1

species_2_taxid distance_2

-a Table with additional proteins to be included in the analysis

(e.g., proteins from species that are absent from the nr).

Use this option when using FASTA sequences as input.

The table should be tab-delimited and have the following format:

/path/to/species_1.fasta taxid_1

/path/to/species_2.fasta taxid_2

/path/to/species_3.fasta taxid_3

-f Table with additional nucleotide sequences to search against

your query proteins. Particularly useful with genome assemblies

for improved orphan gene classification. Use this option when

using FASTA sequences as input. The table should be

tab-delimited and have the following format:

/path/to/genome_1.fasta taxid_1

/path/to/genome_2.fasta taxid_2

/path/to/genome_3.fasta taxid_3

-A Table with additional structural predictions to be included

in the analysis (e.g., structural predictions from species

that are absent from the AlphaFold DB). Use this option when

using PDB files as input. The table should be tab-delimited

and have the following format:

/path/to/species_directory_1 taxid_1

/path/to/species_directory_2 taxid_2

/path/to/species_directory_3 taxid_3

-i When true, prints an additional output file with the best

sequence hit responsible for the oldest gene age assignment

for each of the query genes (default: false)

FINE-TUNNING ARGUMENTS (DEFAULT IS USUALLY FINE)

-l Taxonomic representativeness threshold below which a gene will

be flagged as putative genome contamination or the product of

a horizontal gene transfer (HGT) event (default: 30)

-e E-value threshold for DIAMOND and Foldseek (default: 1e-5)

-u Additional options to feed DIAMOND or Foldseek, based on user

preferences (filtering hits by identity or query coverage)

Users should input the additional commands in quotes, using

the original arguments from DIAMOND (Example: -o "--id 30")

or from Foldseek (Example: -o "-c 0.8 --cov-mode 0")

-m Minimum percentage of matches between your query sequences

and any species within a taxonomic level to consider it useful

for the gene age assignment (i.e., filtering taxonomic levels

lacking complete genomes in the database) (default: 10)

-x Alternative path where you would like to store the temporary

files as well as the DIAMOND/Foldseek results (warning: genEra

will generate HUGE temporary files) (default: the files will

be stored in a tmp_[RAMDOMNUM]/ directory created by genEra)

-y Modify the sensitivity parameter in DIAMOND for faster

or more sensitive results in step 1 (default: sensitive)

-M Adjust the amount of prefilter made by Foldseek (i.e., the

maximum number of hits that are reported) (default: 10000)

-j When true, run an additional search step using jackhmmer

against the online Uniprot database (default: false)

(warning 1: Only available when using FASTA sequences)

(warning 2: this step requires internet connection)

-k Starting taxonomic rank of genes that will be re-analyzed

using jackhmmer (default: 2)

-h Print this help message and exit

データベースの準備

nrのDIAMOND D.Bを作成する。prot.accession2taxidとtaxdumpなども指定し、ヒット時に分類学的情報も得られるようにする。

#1 nrデータベースをダウンロード
wget ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr{.gz,.gz.md5} && md5sum -c *.md5
gunzip nr.gz

#2 prot.accession2taxid（タンパク質アクセッションとNCBI taxIDを関連づけたファイル）をダウンロード
wget ftp://ftp.ncbi.nih.gov:21/pub/taxonomy/accession2taxid/prot.accession2taxid.gz && gunzip prot.accession2taxid.gz

#3 taxdumpファイル（NCBIの生物種や分類群の識別子、階層的な関係などをまとめたファイル）をダウンロード
wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump

#4 DIAMOND DB作成（diamondのバージョンに注意。dockerのはdiamond version 2.1.6）
diamond makedb \
 --in nr \
 --db nr \
 --taxonmap prot.accession2taxid \
 --taxonnodes taxdump/nodes.dmp \
 --taxonnames taxdump/names.dmp \

データベース作成には３時間ほどかかった（xeon E5 v4 2680 dual、HDD環境）。

ファイルサイズ

DIAMONDの代わりにfoldseekのAlphaFold DB（UniProtの配列）も使うこともできる。foldseekのコマンドでダウンロードできる。ただしダウンロード途中で失敗することがあり、その場合、不完全な圧縮ファイルから残りのコマンドを実行しようとしてエラーになる。エラーになったダウンロードコマンドだけwwget に-cをつけて続きからダウンロードした方が無難。

foldseek databases Alphafold/UniProt alphafoldDB tmp

#DIAMOND　DBの方でダウンロードしてないならtaxdumpもダウンロードする
wget -N ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
mkdir -p taxdump && tar zxf taxdump.tar.gz -C ./taxdump

カレントにalphafoldDB~ができる。

実行方法

クエリとしてタンパク質の一次配列のファイルを使う方法と、構造ファイルを使う方法がある。

１、クエリとしてタンパク質配列を使う

現在のパスにクエリのfastaファイル、データベースのnr.dmnd、taxdumpディレクトリがあるとして、以下のように実行する。クエリタンパク質の生物のNCBI taxidも必要。

#ここではdockerを立ち上げてランする
docker run -itv $PWD:/data -w /data --rm josuebarrera/genera

#genEraのラン
genEra -q query_protein.faa -t <query_taxid> -b nr -d taxdump/ -n 40 -o outdir

-q File with query protein sequences in FASTA format to perform pairwise sequence alignments using DIAMOND
-t NCBI Taxonomy ID of query species (search for the taxid of your query species at https://www.ncbi.nlm.nih.gov/taxonomy)
-i When true, prints an additional output file with the best sequence hit responsible for the oldest gene age assignment for each of the query genes (default: false)
-n Number of threads to run genEra (genEra can run with a single thread, but it is HIGHLY suggested to use as many threads as possible) (default: 20 threads)
-o Directory to save output files (default: working directory)

出力例（1クエリで100分くらいかかった。docker環境、20スレッド使用）

出力について

https://github.com/josuebarrera/GenEra#output-files

２、クエリとして構造ファイルを使う

まずAF2やOmegafoldをランする。

omegafold query_sequences.fasta output_directory

出力されたPDB形式のファイルをディレクトリに配置する。圧縮されていてはならない。

genEraをランする。PDBファイルのディレクトリとAF2 DBを使う以外は１と同じ。

#dockerを立ち上げてランする
docker run -itv $PWD:/data -w /data --rm josuebarrera/genera

#genEraのラン
genEra -Q query_structure_dir/ -t <query_taxid> -B alphafoldDB -d taxdump/ -n 40 -o outdir

-Q Directory with query structural predictions in PDB format to perform pairwise structural alignments using Foldseek
-B Path to a locally installed database for Foldseek (PDB only)

オプションで、NCBIのnrにない配列、例えば新しくゲノムを決定した生物のゲノムやタンパク質、そして構造ファイルなどを加えることもできるようになっています。また、レポジトリには発展的な使い方に関する説明もあります。確認してください。

引用

Uncovering gene-family founder events during major evolutionary transitions in animals, plants and fungi using GenEra

Josué Barrera-Redondo, Jaruwatana Sodai Lotharukpong, Hajk-Georg Drost & Susana M. Coelho
Genome Biology volume 24, Article number: 54 (2023)