2024-02-05

InterProScan 5

2014年の論文より

　ロバストな大規模配列解析は、生物学者が何百万もの配列の特徴を明らかにしようとしている現代のゲノム科学における大きな課題である。ここでは、広く使われているタンパク質機能予測ソフトウェアパッケージInterProScanの新しいJavaベースのアーキテクチャについて述べる。開発には、ソフトウェアの出力に対する改良と追加、ソフトウェアフレームワークの完全な再実装が含まれ、その結果、スケーラブルな分散データ解析を実現するために、マルチプロセッサマシンや従来のクラスタの両方を使用できる、柔軟で安定したシステムを実現した。InterProScanはEMBl-EBIのFTPサイトから自由にダウンロードでき、オープンソースコードはGoogle Codeでホストされている。InterProScanはFTPでftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/、ソースコードはhttp://code.google.com/p/interproscan/から利用できる。

Documentation

https://interproscan-docs.readthedocs.io/en/latest/

EMBL-EBIのオンラインサービス

https://www.ebi.ac.uk/interpro/search/sequence/

ここではローカル環境でランする手順について確認します。

インストール

公式のプログラム一式をダウンロードしてインストールする方法と、docker imageを使う方法、condaで導入する方法などがサポートされている。ここではdockerを使用した。

バージョン5.14-53.0以降のInterProScanのドキュメントとリリースは、GitHubでホストされている。

本体

#way1 docker image(公式)
#program本体
curl -O http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.66-98.0/alt/interproscan-data-5.66-98.0.tar.gz
curl -O http://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.66-98.0/alt/interproscan-data-5.66-98.0.tar.gz.md5
md5sum -c interproscan-data-5.66-98.0.tar.gz.md5
tar -pxzf interproscan-data-5.66-98.0.tar.gz
#=> 解凍後、現在のパスから移動せずに下のテストランの手順を進める

#way2
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.66-98.0/interproscan-5.66-98.0-64-bit.tar.gz
wget https://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.66-98.0/interproscan-5.66-98.0-64-bit.tar.gz.md5

#接続が切れやすいのでチェックサムで一致するか確認
md5sum -c interproscan-5.66-98.0-64-bit.tar.gz.md5 
#解凍
tar -pxvzf interproscan-5.66-98.0-*-bit.tar.gz
#Index hmm models
python3 setup.py -f interproscan.properties

#way3 conda( link)少し古いv5.59が最新（サーバーの最新lookupファイルのバージョンとあってないのためエラーになる。使用するには"-dp"フラグを立てるなど工夫が必要）
mamba create -n interpro -y
conda activate interpro
mamba install bioconda::interproscan -y

> ./interproscan.sh

05/02/2024 00:20:48:738 Welcome to InterProScan-5.59-91.0

05/02/2024 00:20:48:739 Running InterProScan v5 in STANDALONE mode... on Linux

usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts -XX:+UseFastAccessorMethods -Xms128M

-Xmx2048M -jar interproscan-5.jar

Please give us your feedback by sending an email to

interhelp@ebi.ac.uk

-appl,--applications <ANALYSES> Optional, comma separated list of analyses. If this option

is not set, ALL analyses will be run.

-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename (relative or absolute path).

Note that this option, the --output-dir (-d) option and the

--outfile (-o) option are mutually exclusive. The

appropriate file extension for the output format(s) will be

appended automatically. By default the input file path/name

will be used.

-cpu,--cpu <CPU> Optional, number of cores for inteproscan.

-d,--output-dir <OUTPUT-DIR> Optional, output directory. Note that this option, the

--outfile (-o) option and the --output-file-base (-b) option

are mutually exclusive. The output filename(s) are the same

as the input filename, with the appropriate file extension(s)

for the output format(s) appended automatically .

-dp,--disable-precalc Optional. Disables use of the precalculated match lookup

service. All match calculations will be run locally.

-dra,--disable-residue-annot Optional, excludes sites from the XML, JSON output

-etra,--enable-tsv-residue-annot Optional, includes sites in TSV output

-exclappl,--excl-applications <EXC-ANALYSES> Optional, comma separated list of analyses you want to

exclude.

-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive, comma separated list of output

formats. Supported formats are TSV, XML, JSON, and GFF3.

Default for protein sequences are TSV, XML and GFF3, or for

nucleotide sequences GFF3 and XML.

-goterms,--goterms Optional, switch on lookup of corresponding Gene Ontology

annotation (IMPLIES -iprlookup option)

-help,--help Optional, display help information

-i,--input <INPUT-FILE-PATH> Optional, path to fasta file that should be loaded on Master

startup. Alternatively, in CONVERT mode, the InterProScan 5

XML file to convert.

-incldepappl,--incl-dep-applications <INC-DEP-ANALYSES> Optional, comma separated list of deprecated analyses that

you want included. If this option is not set, deprecated

analyses will not run.

-iprlookup,--iprlookup Also include lookup of corresponding InterPro annotation in

the TSV and GFF3 output formats.

-ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide size of ORF to report. Will only

be considered if n is specified as a sequence type. Please be

aware of the fact that if you specify a too short value it

might be that the analysis takes a very long time!

-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file name (relative or absolute

path). Note that this option, the --output-dir (-d) option

and the --output-file-base (-b) option are mutually

exclusive. If this option is given, you MUST specify a single

output format using the -f option. The output file name will

not be modified. Note that specifying an output file name

using this option OVERWRITES ANY EXISTING FILE.

-pa,--pathways Optional, switch on lookup of corresponding Pathway

annotation (IMPLIES -iprlookup option)

-t,--seqtype <SEQUENCE-TYPE> Optional, the type of the input sequences (dna/rna (n) or

protein (p)). The default sequence type is protein.

-T,--tempdir <TEMP-DIR> Optional, specify temporary file directory (relative or

absolute path). The default location is temp/.

-verbose,--verbose Optional, display more verbose log output

-version,--version Optional, display version number

-vl,--verbose-level <VERBOSE-LEVEL> Optional, display verbose log output at level specified.

-vtsv,--output-tsv-version Optional, includes a TSV version file along with any TSV

output (when TSV output requested)

software itself is provided under the Apache License, Version 2.0 (http://www.apache.org/licenses/LICENSE-2.0.html).

Third party components (e.g. member database binaries and models) are subject to separate licensing - please see the

individual member database websites for details.

Available analyses:

TIGRFAM (15.0) : TIGRFAMs are protein families based on hidden Markov models (HMMs).

FunFam (4.3.0) : Prediction of functional annotations for novel, uncharacterized sequences.

SFLD (4) : SFLD is a database of protein families based on hidden Markov models (HMMs).

SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotations for all proteins and genomes.

PANTHER (17.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.

Gene3D (4.3.0) : Structural assignment for whole genes and genomes using the CATH domain structure database.

Hamap (2021_04) : High-quality Automated and Manual Annotation of Microbial Proteomes.

ProSiteProfiles (2022_01) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.

Coils (2.2.1) : Prediction of coiled coil regions in proteins.

SMART (7.1) : SMART allows the identification and analysis of domain architectures based on hidden Markov models (HMMs).

CDD (3.18) : CDD predicts protein domains and families based on a collection of well-annotated multiple sequence alignment models.

PRINTS (42.0) : A compendium of protein fingerprints - a fingerprint is a group of conserved motifs used to characterise a protein family.

PIRSR (2021_05) : PIRSR is a database of protein families based on hidden Markov models (HMMs) and Site Rules.

ProSitePatterns (2022_01) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them.

AntiFam (7.0) : AntiFam is a resource of profile-HMMs designed to identify spurious protein predictions.

Pfam (35.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs).

MobiDBLite (2.0) : Prediction of intrinsically disordered regions in proteins.

PIRSF (3.10) : The PIRSF concept is used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships.

Deactivated analyses:

SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model

SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl

SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

テストラン

ここでは公式のdockerイメージを使った時のラン例を示す。注意点として、イメージにはInterProScanの実行に必要なデータは含まれていない。それらは上の説明（way1）のように別途ダウンロードする必要がある。上のway1に従ってダウンロードして解凍する。解凍後、中に入らないなら以下の1~3の順で実行する。タンパク質のfastaファイルは自前のものを使うかテストデータを使う。テストデータはレポジトリのサブディレクトリに配置されているのでcloneしてカレントに配置する（１）。初回はdocker imageをpullし（２）、InterProScan 5のdocker imageをランする（３）。3のラン時、上のway1で準備したInterProScan5/のdata/サブディレクトリをマウントしている。

#1 テストデータを取得（この手順はっスキップして自前のprotein.fastaでもOK）
git clone https://github.com/ebi-pf-team/interproscan.git #260MBくらい

#2 pull image 
docker pull interpro/interproscan:5.66-98.0

#3 run
mkdir temp output
cp interproscan/core/jms-implementation/support-mini-x86-32/test_proteins.fasta output/
docker run --rm -v $PWD/interproscan-5.66-98.0/data:/opt/interproscan/data -v $PWD/output:/output -v $PWD/temp:/temp interpro/interproscan:5.66-98.0 -i /output/test_proteins.fasta -d /output -T /temp --cpu 16

-i Optional, path to fasta file that should be loaded on Master startup. Alternatively, in CONVERT mode, the InterProScan 5 XML file to convert.
-f Optional, case-insensitive, comma separated list of output formats. Supported formats are TSV, XML, JSON, and GFF3. Default for protein sequences are TSV, XML and GFF3, or for nucleotide sequences GFF3 and XML.
-T Optional, specify temporary file directory (relative or absolute path). The default location is temp/.
--cpu Optional, number of cores for inteproscan.
-d Optional, output directory. Note that this option, the --outfile (-o) option and the --output-file-base (-b) option are mutually exclusive. The output filename(s) are the same as the input filename, with the appropriate file extension(s) for the output format(s) appended automatically .
-dp Optional. Disables use of the precalculated match lookup service. All match calculations will be run locally.

InterProScan5はFASTA形式のタンパク質をサポートしている。その他、高速化のためのルックアップ機能がある。この機能を使うとネット越しにEBIのサーバーにアクセスして完全一致のタンパク質の一致がチェックされ、一致が見つかれば事前計算された結果が読み込まれ、ランタイムが短縮される。ファイアウォールの背後にありサーバーがhttp://www.ebi.ac.uk にアクセスできない場合は、ルックアップサービスをローカルにインストールするかこのサービスをオフにする。ダウンロードする場合、5億以上のuniprot配列の巨大なデータファイルをダウンロードすることになる。データベースの容量は非圧縮で１TBを大きく超えており、しかもダウンロードするにも帯域が（日本からだと）狭く、ダウンロード完了まで数か月かかるかもしれない（マニュアル、下のメモも参照）。メタゲノムなどのタンパク質を使っている場合、この機能を無効化してもランタイムはあまり短縮しないが、ゲノムが決まっている細菌かそれに近い系統の株のタンパク質などを使っているなら高速化が期待できる。

出力例

output/

デフォルトの出力形式は-fオプションを使用して変更可能。

test_proteins.fasta.tsv

複数のデータベースがソースに使用されているため、１つのタンパク質がソースごとに複数行に渡って出力される。D列がDBで、pfamやPANTHER、CODなどが確認できる。

TSV出力フォーマットは以下の通り（マニュアルより）。

Protein accession
Sequence MD5 digest
Sequence length
Analysis
Signature accession
Signature description
Start location
Stop location
Score
Status
Date
InterPro annotations - accession
InterPro annotations - description
GO annotations with their source
Pathways annotations

となっている。15列からなる。TSV以外にXML,JSON、GFF3でも出力される。

マニュアルより

GFF3フォーマットはフラットなタブ区切りファイルであり、TSV出力フォーマットよりもはるかにリッチである。マッチから予測タンパク質や核酸配列へのトレースが可能になっている（GFF3、XML、JSON出力のみ）。また、予測されたタンパク質配列とそのマッチのFASTA形式の表現も含まれている（http://www.sequenceontology.org/gff3.shtmlに使用されているすべてのカラムと属性のドキュメントがある）。
InterProScanは計算量の多いプログラムで、一つの配列を特徴付けるのに数分かかることもある。InterProScanは、サブミットされたアミノ酸配列のみに基づいてInterProシグネチャとのマッチを計算する。したがって、2つの同じアミノ酸配列があれば、同じ出力が得られる（ただし、配列が1残基だけ異なる場合、出力は同じになることもあれば、ならないこともある）。よって、UniProtKBで既に見つかっている配列のマッチを事前に計算することでInterProScanのスピードを上げることができる。この機能はデフォルトでONになっており、配列が提出されると、InterProScanはアミノ酸配列のMD5 チェックサムを計算し、そのチェックサムを使ってInterProScan 5 Lookup Serviceをチェックし、それが既に存在しているかどうかを調べる（注；サブミットするタンパク質末尾に*があると機能しないことになる）。オフにしたい場合は、コマンドラインに"-disable-precalc "あるいは"-dp"オプションを追加する。ルックアップ・サービスのEBIホスト・インスタンス（デフォルトで有効になっているもの）を使用するか、コピーをダウンロードしてローカルで実行するかという選択肢もある（非常にファイルサイズが大きく、非圧縮で1TBを越えている）。

メタゲノム由来の3,000タンパク配列のランに70分ほどかかった（3990X、16CPU指定、-dpフラグ使用）。同じデータセットで-dpなしでランすると75分ほどかかった。

サーバーがEBIにアクセス可能なら、"-dp"はつけない方がかなり早くジョブが終わる。

引用

InterProScan 5: genome-scale protein function classification
Philip Jones, David Binns, Hsin-Yu Chang, Matthew Fraser, Weizhong Li, Craig McAnulla, Hamish McWilliam, John Maslen, Alex Mitchell, Gift Nuka, Sebastien Pesseat, Antony F. Quinn, Amaia Sangrador-Vegas, Maxim Scheremetjew, Siew-Yit Yong, Rodrigo Lopez, Sarah Hunter

Bioinformatics, Volume 30, Issue 9, May 2014, Pages 1236–1240

interproの引用

https://interpro-documentation.readthedocs.io/en/latest/citing.html