多様な節足動物ゲノムの迅速な機能アノテーションのためのワークフロー（interproscan）

　ゲノム技術によって遺伝子に関する情報はかつてないほど急速に蓄積されており、Earth BioGenome Project、i5k、Ag100Pest Initiativeなどのシーケンスイニシアティブによって、この取得速度がさらに加速されると予想される。しかし、ゲノム解読を人の健康や農業の改善、生物学的システムの理解に役立てるためには、遺伝子を同定し、それらが生物学的結果にどのように寄与しているかを理解することが必要である。ゲノム配列のアセンブルや遺伝子の同定のための確立されたワークフローはいくつかあるが、実用的な知識を生み出すためには、遺伝子の機能を理解することが不可欠である。さらに、この機能アノテーションプロセスは、新しい配列データに対応するために、簡単にアクセスでき、ゲノムスケールで情報を提供する必要がある。著者らは、プロテオーム全体の機能アノテーションを迅速に行い、Gene Ontologyとパスウェイの情報を生成するための明確なワークフローを報告する。このワークフローを節足動物ゲノムの多様なセットでテストし、一般的な節足動物の参照ゲノムと比較した結果、このワークフローは、節足動物ゲノムの機能アノテーションに最適であることがわかった。このワークフローは、CyVerseのウェブインターフェースや、ローカルコンピューティングシステムにスケーラブルに展開可能なバイオコンテナとして公開されている。
（一部省略）
AgBaseと i5k Workspace@NALデータベースは、それぞれ節足動物のプロテオームとゲノムのアクセスおよびキュレーションツールを提供し、節足動物ゲノミクスコミュニティに貢献している。ここでは、真核生物のゲノム解読プログラムによるタンパク質のハイスループット機能アノテーションのニーズを満たすために、コンテナ化されたワークフローを作成したことを報告する。このワークフローは、無脊椎動物の幅広いクラスにまたがる、アセンブリの質や使用したシーケンス技術が異なる12種類のシーケンス済みゲノムを用いて検証している。これらのゲノムから得られたタンパク質は、ショウジョウバエメラノガスター、ミツバチ、カブトムシの3つの参照種と比較され、実験的証拠に基づいてGOアノテーションが行われた。これらのワークフローは、再利用を容易にするため、CyVerseでもユーザーフレンドリーなウェブベースのインターフェースで利用可能である。

AgBase Documentation

https://agbase-docs.readthedocs.io/en/latest/interproscan/intro.html

（manualより転載）

ワークフローのDocker iamge

GOanna；配列の相同性に基づいてGO termを専用のBLASTデータベースに割り当てる。タンパク質のFASTAファイルを入力として受け入れ、GOannaは結果をgene association file（GAF）ファイルとして出力する。

　　https://hub.docker.com/r/agbase/goanna

interproscan；InterProは、InterProコンソーシアムの多くのパートナーリソースから得たタンパク質の機能予測情報を統合したデータベース。InterProScanは、FASTAファイルを受け取り、InterProタンパク質データベースからモチーフとドメインを特定し、それらをGO termとパスウェイにマッピングする。結果をGAFファイルとして出力する。

　　https://hub.docker.com/r/agbase/interproscan

kobas；KEGG Orthology Based Annotation System (KOBAS) は、入力されたタンパク質を KEGG の既知のパスウェイに割り当てる。また、遺伝子セットエンリッチメント機能があり、生物における全ての注釈付きタンパク質のバックグラウンドに対して、疾患や実験条件において統計的に濃縮された遺伝子を見つけることができる。

　　https://hub.docker.com/r/agbase/kobas

combine_gafs；GOannaとinterproscanからのGAF出力を統合する。

　　https://hub.docker.com/r/agbase/combine_gafs

ここではAgBaseのinterproscanを紹介します。

インストール

Github

#dockerhub
docker pull agbase/interproscan:5.45-80_3

> docker run --rm agbase/interproscan:5.45-80.0_1 -h

Options:

-a <ANALYSES> Optional, comma separated list of analyses. If this option

is not set, ALL analyses will be run.

-b <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path).

Note that this option, the output directory (-d) option and

the output file name (-o) option are mutually exclusive. The

appropriate file extension for the output format(s) will be

appended automatically. By default the input file

path/name will be used.

-d <OUTPUT-DIR> Optional, output directory. Note that this option, the

output file name (-o) option and the output file base (-b) option

are mutually exclusive. The output filename(s) are the

same as the input filename, with the appropriate file

extension(s) for the output format(s) appended automatically .

-c Optional. Disables use of the precalculated match lookup

service. All match calculations will be run locally.

-C Optional. Supply the number of cpus to use.

-e Optional, excludes sites from the XML, JSON output

-f <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output

formats. Supported formats are TSV, XML, JSON, GFF3, HTML and

SVG. Default for protein sequences are TSV, XML and

GFF3, or for nucleotide sequences GFF3 and XML.

-g Optional, switch on lookup of corresponding Gene Ontology

annotation (IMPLIES -l lookup option)

-h Optional, display help information

-i <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on

Master startup. Alternatively, in CONVERT mode, the

InterProScan 5 XML file to convert.

-l Also include lookup of corresponding InterPro

annotation in the TSV and GFF3 output formats.

-m <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will

only be considered if n is specified as a sequence type.

Please be aware of the fact that if you specify a too

short value it might be that the analysis takes a very long

time!

-o <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute

path). Note that this option, the output directory -d option

and the output file basename -b option are mutually

exclusive. If this option is given, you MUST specify a

single output format using the -f option. The output file

name will not be modified. Note that specifying an output

file name using this option OVERWRITES ANY EXISTING FILE.

-p Optional, switch on lookup of corresponding Pathway

annotation (IMPLIES -l lookup option)

-t <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n)

or protein (p)). The default sequence type is protein.

-T <TEMP-DIR> Optional, specify temporary file directory (relative or

absolute path). The default location is temp/.

-v Optional, display version number

-r Optional. 'Mode' required ( -r 'cluster') to run in cluster mode. These options

are provided but have not been tested with this wrapper script. For

more information on running InterProScan in cluster mode:

https://github.com/ebi-pf-team/interproscan/wiki/ClusterMode

-R Optional. Clusterrunid (crid) required when using cluster mode.

-R unique_id

Available analyses:

TIGRFAM (XX.X) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs

SFLD (X.X) : SFLDs are protein families based on Hidden Markov Models or HMMs

ProDom (XXXX.X) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.

Hamap (XXXXXX.XX) : High-quality Automated and Manual Annotation of Microbial Proteomes

SMART (X.X) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs

CDD (X.XX) : Prediction of CDD domains in Proteins

ProSiteProfiles (XX.XXX) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them

ProSitePatterns (XX.XXX) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them

SUPERFAMILY (X.XX) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.

PRINTS (XX.X) : A fingerprint is a group of conserved motifs used to characterise a protein family

PANTHER (X.X) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.

Gene3D (X.X.X) : Structural assignment for whole genes and genomes using the CATH domain structure database

PIRSF (X.XX) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.

Pfam (XX.X) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)

Coils (X.X) : Prediction of Coiled Coil Regions in Proteins

MobiDBLite (X.X) : Prediction of disordered domains Regions in Proteins

OPTIONS FOR XML PARSER OUTPUTS

-F <IPRS output directory> This is the output directory from InterProScan.

-D <database> Supply the database responsible for these annotations.

-x <taxon> NCBI taxon ID of the ID being annotated

-y <type> Transcript or protein

-n <biocurator> Name of the biocurator who made these annotations

-M <mapping file> Optional. Mapping file.

-B <bad seq file> Optional. Bad input sequence file.

Cyverseでも使える。

https://de.cyverse.org/apps/agave/Interproscan-5.36.75u3/launch

データベースの準備

mkdir interproscan && cd interproscan/
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.45-80.0/alt/interproscan-data-5.45-80.0.tar.gz
tar -pxvzf interproscan-data-5.45-80.0.tar.gz

cd interproscan-5.45-80.0/data
wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/data/panther-data-14.1.tar.gz
tar -pxvzf panther-data-14.1.tar.gz

ランする時はinterproscan-5.45-80.0/dataをフルパスで指定する。

実行方法

ホスト側のinterproscan/とdockerの/dataを共有、ホスト側の

interproscan/interproscan-5.45-80.0/dataと/opt/interproscan/dataを共有。

cd interproscan/
cp <path>/<to>/protein.faa ./ #カレントにproteomeをコピー

#run
sudo docker run \
-v <path>/<to>/interproscan:/data \
-v <path>/<to>/interproscan/interproscan-5.45-80.0/data:/opt/interproscan/data \
agbase/interproscan:5.45-80.0_1 \
-d outdir_10000 \
-i /data/protein.faa
-f tsv,json,xml,html,gff3,svg \
-g \
-p \
-c \
-n Amanda \
-x 109069 \
-D AgBase \
-l

-i input FASTA file.
-d output directory name
-f desired output file formats
-g tells the tool to perform GO annotation
-p tells tool to perform pathway annotaion
-c tells tool to perform local compute and not connect to EBI.
-n name of biocurator to include in column 15 of GAF output file
-x taxon ID of query species to be used in column 13 of GAF output file
-D database of query accession to be used in column 1 of GAF output file
-l tells tools to include lookup of corresponding InterPro annotation in the TSV and GFF3 output formats.

デフォルトで利用可能な全部のCPUコアが使用される。3万個の配列のアノテーションには1時間ほどかかった（CPU; TR3990X）。

出力

protein_gaf

GAFのフォーマットに従っており、GOエンリッチメント解析に使用することができる。

protein_acc_go_counts

左端の入力アクセッションは使用したproteomeの配列名。各アクセッションに割り当てられたGO IDの数、GO IDの名称を含む。GO ID は、BP、MF、CCに分かれて記載されている。

protein_acc_interpro_counts

入力アクセッション（proteomeの配列名）、各アクセッションのInterPro ID数、各配列に割り当てられたInterPro ID、InterPro ID名が記載されている。

protein_acc_pathway_counts

入力アクセッション（proteomeの配列名）、そのアクセッションのパスウェイIDの数、パスウェイ名を含むテーブル。複数の名前はセミコロンで区切られる。

protein_go_counts

各GO IDに割り当てられた配列の数と配列名が記載されている。右端は使用したproteomeの配列名。特定の機能に割り当てられたすべての遺伝子を迅速に特定できるように作られている。

protein_interpro_counts

入力アクセッション（proteomeの配列名）、各アクセッションのInterPro ID数、各配列に割り当てられたInterPro ID、InterPro ID名が記載されている。

protein_pathway_counts

パスウェイにアサインされた入力アクセッション数とアクセッション名（proteomeの配列名）が記載されている。パスウェイに割り当てられた全遺伝子が迅速に特定できるように作られている。

protein.tsvには全ての結果がタブ区切りでまとめられている。

引用

Workflows for Rapid Functional Annotation of Diverse Arthropod Genomes
Surya Saha, Amanda M. Cooksey, Anna K. Childers, Monica F. Poelchau Fiona M. McCarthy

Insects 2021, 12(8), 748