gff3出力をサポートを追加したexonerateのフォーク exonerate-gff3

2023/01/05 追記

2023 01/13 パラメータの解釈間違いを修正

Exonerateはペアワイズ配列比較のためのツール。DNAとcDNA(EST)、DNAとタンパク質間のアライメントを行うことができる。アライメントモデルに基づき、ギャップありアライメント、ギャップなしアライメント行うことができる。類似したソフトウエアには、Wise2（Genewise）やUCSCのBLAT（BLAST-like alignment tool）、最近開発されたMiniprotなどがある。これらは、開発された時期が大きく異なり、全ゲノムにスケールするか、精度（例；イントロンーエキソン境界の予測精度）などに違いがある。遅いツールを使ってゲノム全体に全タンパク質をアラインする場合、近似マッピングを求め、それから正確なアラインメントを行うなどの工夫が必要になる。

ここでは、gff3出力に対応したexonerateのフォークを紹介します。これを使うことで、スクリプトを書かなくてもexonerateから直接GFF3形式のアノテーションを得ることができます。

EMBL-EBIのexonerateのmanual

https://www.ebi.ac.uk/about/vertebrate-genomics/software/exonerate-manual

インストール

リリースからexonerate-2.3.0-x86_64.tar.gzをダウンロードしてbin/にパスを通した（ubuntu18）。

Github

git clone https://github.com/hotdogee/exonerate-gff3.git
cd exonerate-gff3/
./configure
make -j8
make install

> exonerate

exonerate from exonerate version 2.3.0

Using glib version 2.26.1

Built on Jul 1 2014

Branch: unnamed branch

exonerate: A generic sequence comparison tool

Guy St.C. Slater. guy@ebi.ac.uk. 2000-2008.

Examples of use:

1. Ungapped alignment of any DNA or protein sequences:

exonerate queries.fa targets.fa

2. Gapped alignment of Mouse proteins to Fugu proteins:

exonerate --model affine:local mouse.fa fugu.fa

3. Find top 10 matches of each EST to a genome:

exonerate --model est2genome --bestn 10 est.fa genome.fa

4. Find proteins with at least a 50% match to a genome:

exonerate --model protein2genome --percent 50 p.fa g.fa

5. Perform a full Smith-Waterman-Gotoh alignment:

exonerate --model affine:local --exhaustive yes a.fa b.fa

6. Many more combinations are possible. To find out more:

exonerate --help

man exonerate

General Options:

---------------

-h --shorthelp [FALSE] <TRUE>

--help [FALSE]

-v --version [FALSE]

Sequence Input Options:

----------------------

-q --query [mandatory] <*** empty list ***>

-t --target [mandatory] <*** empty list ***>

-Q --querytype [unknown]

-T --targettype [unknown]

--querychunkid [0]

--targetchunkid [0]

--querychunktotal [0]

--targetchunktotal [0]

-V --verbose [1]

Analysis Options:

----------------

-E --exhaustive [FALSE]

-B --bigseq [FALSE]

--forcescan [none]

--saturatethreshold [0]

--customserver [NULL]

Fasta Database Options:

----------------------

--fastasuffix [.fa]

Gapped Alignment Options:

------------------------

-m --model [ungapped]

-s --score [100]

--percent [0.0]

--showalignment [TRUE]

--showsugar [FALSE]

--showcigar [FALSE]

--showvulgar [TRUE]

--showquerygff [FALSE]

--showtargetgff [FALSE]

--gff3 [FALSE]

--ryo [NULL]

-n --bestn [0]

-S --subopt [TRUE]

-g --gappedextension [TRUE]

--refine [none]

--refineboundary [32]

Viterbi algorithm options:

-------------------------

-D --dpmemory [32]

Code generation options:

-----------------------

-C --compiled [TRUE]

Heuristic Options:

-----------------

--terminalrangeint [12]

--terminalrangeext [12]

--joinrangeint [12]

--joinrangeext [12]

--spanrangeint [12]

--spanrangeext [12]

Seeded Dynamic Programming options:

----------------------------------

-x --extensionthreshold [50]

--singlepass [TRUE]

BSDP algorithm options:

----------------------

--joinfilter [0]

Sequence Options:

----------------

-A --annotation [none]

Symbol Comparison Options:

-------------------------

--softmaskquery [FALSE]

--softmasktarget [FALSE]

-d --dnasubmat [nucleic]

-p --proteinsubmat [blosum62]

Alignment Seeding Options:

-------------------------

-M --fsmmemory [64]

--forcefsm [none]

--wordjump [1]

Affine Model Options:

--------------------

-o --gapopen [-12]

-e --gapextend [-4]

--codongapopen [-18]

--codongapextend [-8]

NER Model Options:

-----------------

--minner [10]

--maxner [50000]

--neropen [-20]

Intron Modelling Options:

------------------------

--minintron [30]

--maxintron [200000]

-i --intronpenalty [-30]

Frameshift Options:

------------------

-f --frameshift [-28]

Alphabet Options:

----------------

--useaatla [TRUE]

Translation Options:

-------------------

--geneticcode [1]

HSP creation options:

--------------------

--hspfilter [0]

--useworddropoff [TRUE]

--seedrepeat [1]

--dnawordlen [12]

--proteinwordlen [6]

--codonwordlen [12]

--dnahspdropoff [30]

--proteinhspdropoff [20]

--codonhspdropoff [40]

--dnahspthreshold [75]

--proteinhspthreshold [30]

--codonhspthreshold [50]

--dnawordlimit [0]

--proteinwordlimit [4]

--codonwordlimit [4]

--geneseed [0]

--geneseedrepeat [3]

Alignment options:

-----------------

--alignmentwidth [80]

--forwardcoordinates [TRUE]

SAR Options:

-----------

--quality [0]

Splice Site Prediction Options:

------------------------------

--splice3 [primate]

--splice5 [primate]

--forcegtag [FALSE]

実行方法

ゲノムとタンパク質のfastaファイルを指定する。GFF3形式で書き出すには--gff3をつける。レポジトリでは近縁種のタンパク質を使った遺伝子モデル構築用とみられる以下の例が記載されている。

exonerate -q protein.fa -t genome.fa --model protein2genome --querytype protein --targettype dna --showvulgar no --softmaskquery yes --softmasktarget yes --minintron 20 --maxintron 3000 --showalignment no --showtargetgff yes --showcigar no --geneseed 250 --score 250 --verbose 0 --gff3 yes > out.gff3

--bestn 1 best hitのみ（デフォルト0）
--percent 80 　配列同一性カットオフ80％　実際のスコアが最大スコアの80%以上であるタンパク質を保持
--score 250 総合スコアのしきい値 250

DNA-proteinのアラインメントに利用できるモデルはいくつか存在する。ここではgenewiseに似たprotein2genomeモデルを使用している。

追記

alignmemtのhitファイルを作成し、EVMで使用する。その時はexonerateではGFF3形式で書き出さず、EVMのスクリプトでEVM互換GFF3形式に変換するとトラブルがない。

exonerate -q proteome.faa -t genome.fna --model protein2genome --querytype protein --targettype dna --showvulgar no --softmaskquery yes --softmasktarget yes --minintron 20 --maxintron 10000 --showalignment no --showtargetgff yes --showcigar no --geneseed 250 --score 250 --verbose 0 --gff3 no > exonerate_output
#変換
git clone https://github.com/EVidenceModeler/EVidenceModeler.git
perl EVidenceModeler/EvmUtils/misc/Exonerate_to_evm_gff3.pl exonerate_output > exonerate_output.gff3

引用

Automated generation of heuristics for biological sequence comparison
Guy St C Slater & Ewan Birney
BMC Bioinformatics volume 6, Article number: 31 (2005)