高速なトランスクリプトームアノテーションパイプライン TransAnnot

　深くシークエンシングされ、de novoアセンブルされたトランスクリプトームのアノテーションは、最新のツールの中には動作が遅く、インストールが難しく、使いにくいものがあるため、依然として難題である。TransAnnotはトランスクリプトームのアノテーションを高速に自動化するパイプラインで、インストールも使用も簡単である。MMseqs2スイートが提供する高速配列検索を活用し、TransAnnotはSwiss-Protのホモログ、eggNOGの遺伝子オントロジーの語彙とオルソグループ、Pfamの機能ドメインのアノテーションをワンステップで行うことができる。また、カスタムデータベースに対してアノテーションを行うオプションもある。TransAnnotは、アノテーションの入力として、シークエンシングリード（ショートおよびロング）、ヌクレオチド配列、アミノ酸配列を受け付ける。アミノ酸配列のテストデータセットでベンチマークを行ったところ、TransAnnotはEnTAP、Trinotate、eggNOG-mapperのような同等のツールと比較して、それぞれ333倍、284倍、18倍高速であった。

wiki

https://github.com/soedinglab/transannot/wiki

インストール

リリースからAVX2のbinaryをダウンロードした（ubuntu22.04LTS, CPU: xeon E5 v4）。リリースにはSSE4、arm、macos univerrsal、powerプロセッサなども用意されている。

Github

#linux AVX2 binary
wget https://github.com/soedinglab/transannot/releases/download/3-e15e316/transannot-linux-avx2.tar

#source
git clone https://github.com/soedinglab/transannot.git
cd transannot && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. ..
make -j 20
make install
export PATH=$(pwd)/bin/:$PATH

> transannot

TransAnnot - a fast transcriptome annotation pipeline

TransAnnot Version: 8cd2fdc032082204c552266aba186c4cc79f2328

usage: transannot <command> [<args>]

Easy workflows for plain text input/output

easytransannot Easy module for simple one-step reads assembly and transcriptome annotation

Main workflows for database input/output

assemblereads Assembly of de novo transcriptomes on protein level with PLASS

downloaddb Download protein database to run search against

User should download 3 databases: 2 profile DBs and 1 sequence DB.(see mmseqs databases)

Our recommendations are Pfam-A.full, eggNOG (profile DBs) and SwissProt (sequence DB)

annotate Run MMseqs2 searches to find homology, depending on obtained IDs get further information about transcriptome functions

createquerydb Create MMseqs database from assembled sequences (with transannot annotate or other tool)

annotatecustom Annotate using a custom, user-provided DB

An extended list of all modules can be obtained by calling 'transannot -h'.

> transannot createquerydb

usage: transannot createquerydb <i:fast[a|q]File> <o:sequenceDB> <tmpDir> [options]

options:

--threads INT Number of CPU-cores used (all by default) [56]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

examples:

MMseqs uses its own database format to avoid slowing down of the system, that is why if transcriptome is assembled not with PLASS, it is obligatory to create using MMseqs DB

Show an extended list of options by calling 'transannot createquerydb -h'.

Not enough input paths provided. 3 paths are required.

> transannot downloaddb -h

usage: transannot downloaddb <i:selection> <o:outDB> <tmpDir> [options]

By Mariia Zelenskaia mariia.zelenskaia@mpinat.mpg.de & Yazhini A. yazhini@mpinat.mpg.de

options: common:

--threads INT Number of CPU-cores used (all by default) [56]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

examples:

transannot downloaddb eggNOG outpath/eggNOGDB tmp

> transannot easytransannot

usage: transannot easytransannot <i:fast(a|q)File[.gz|bz]> | <i:fastqFile1_1[.gz]> ... <i:fastqFileN_1[.gz]> <i:targetDB> <i:targetDB> <i:targetDB> <o:outFile> <tmpDir> [options]

options:

-s FLOAT Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]

-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]

--min-seq-id FLOAT List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.300]

--createdb-mode INT Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [1]

--compressed INT Write compressed output [0]

--threads INT Number of CPU-cores used (all by default) [56]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--simple-output BOOL Provide only query, target IDs and information from UniProt in the output file. No information about alignment (eg. sequence identity and bit score) [0]

--no-run-clust BOOL Per default there is linclust of mmseqs performed for the redundancy reduction. If you don't want it, provide this tag [0]

Show an extended list of options by calling 'transannot easytransannot -h'.

Not enough input paths provided. 6 paths are required.

データベース

Pfam-A.full、eggNOG、UniProtKB/Swiss-Protなどを利用できる。

GTDBやSILVAのようにtaxonomy情報のためのDBも使用できる。一部はアミノ酸にしか対応していないので注意（２列目）

指定したデータベースをtransannot downloaddbコマンドでダウンロードする。データベース名が分かるprefixにする。

usage: transannot downloaddb <i:selection> <o:outDB> <tmpDir> [options]

#eggNOGをダウンロード
transannot downloaddb eggNOG ./eggNOG_DB /tmp

#Pfam-A-full
transannot downloaddb Pfam-A.full ./Pfam-A_DB /tmp

#Swiss-Prot
transannot downloaddb UniProtKB/Swiss-Prot ./Swiss-Prot_DB /tmp

指定したパスにダウンロードされる。

実行方法

以下のコマンドが用意されている。

assemblereads - 生シーケンスリードをde novoアセンブルする
createquerydb - クエリ入力配列用のデータベースを MMSeqs2 フォーマットで作成
downloaddb - クエリ配列のアノテーションを検索するデータベースを MMSeqs2 形式でダウンロード
annotate - 入力配列をクラスタリングして冗長性を減らし、参照クエリー配列に対して配列プロファイル検索と配列-配列検索を実行して、アノテーションされた機能を持つ最も近いホモログを得る。さらに、オルソロググループやタンパク質ファミリーの記述をクエリ配列にマップする。
easytransannot - 入力アセンブルから始まり、リファレンスデータベースのダウンロード、配列アノテーションの出力まで、完全なトランスアノテーションワークフローを簡単に実行できるコマンドモジュール。
annotatecustom - TransAnnotが使用するデフォルトデータベースの代わりに、ユーザが提供するデータベースに対するアノテーションを容易にする。

A　既にアセンブリされた転写産物（品質フィルタリング済みのもの）を持っている時

1、クエリの配列をMMSeqs2形式DBに変換する。queryDB_nameが出力ファイルのprefixとなる。

transannot createquerydb transcripts.fasta queryDB tmp

queryDBではじまる複数のファイルができる。２でこれを指定する。

2、transannot annotateを実行してアノテーションを付与する。１の出力とダウンロードしたデータベースを指定する。出力ファイル名も指定する。

transannot annotate input_queryDB ./Pfam-A_DB ./eggNOG_DB  ./Swiss-Prot_DB output.tsv tmp

１万配列で20分ほどかかった（CPU: xeon E5 v4 dual）。

出力例

output.tsv

右端の列にはDBソース名がある。

３つのデータベースを使ったので、１つの転写産物が複数行にわたって表示されている。

"-simple-output"パラメータを指定すると、各クエリ配列について、クエリID、ターゲットID、ターゲットデータベースのヘッダー、E-valueのみの簡易出力を得ることができる（レポジトリ参照）。

B transannot easytransannot

transannot easytransannotコマンドを使うとfastqから全解析をワンライナーで実行できる（データベースのダウンロードも含まれている）。PLASSアセンブラ（紹介）がアセンブリに使用される。デフォルトで利用可能な全CPUが使用される。

transannot easytransannot <inputReads.fastq> Pfam-A.full eggNOG UniProtKB/Swiss-Prot <resDB> <tmp> [options]

C transannot assemblereads

PLASSを使用して入力リード配列をアセンブルし、翻訳されたタンパク質配列を取得する。PLASSは少なくとも100ntのリード長を必要とする。

transannot assemblereads <inputReads.fastq[.gz|bz]> ... <inputReads.fastq[.gz|bz]> <o: fastaFile with assembly> <o: seqDB> tmp

トランスクリプトームリード、メタトランスクリプトーム、シングルセルトランスクリプトームリードも使用できると書かれている。

その他

TransAnnotでアノテーションする前に、TransDecoderのようなツールでアセンブリを翻訳する方が、検索が非常に速くはるかに望ましい。その場合、翻訳されたアミノ酸配列を含む入力FASTAファイルをtrannot createquerydbに与え、作成されたクエリDBをtransannot annotateの入力として与える。
デフォルトでは、３つのデータベース；手作業でレビューされたホモログ（SwissProt）、より詳細なオルソログ（eggNOG）、ドメイン（Pfam-A）、を使う。これによって包括的なアノテーションが確保できる。
annotatecustomを使用し、ユーザー定義のカスタムデータベースを使ってアノテーションを行うこともできる。
tmpフォルダーは一時ファイルを保存する。デフォルトでは、異なるモジュールからの中間出力ファイルはすべてこのフォルダに保存される。tmpを消去するには--remove-tmp-filesパラメータ[bool]を渡す。