ゲノムのGFF3アノテーションファイルを扱う AEGeAn Toolkit

マニュアルより

　AEGeAn Toolkitは、全ゲノム遺伝子構造アノテーションを管理・解析するツールを構築するための、いくつかの異なるが関連した取り組みとして始まった。AEGeAnはこれらの取り組みを一つのライブラリにまとめ、実行可能なプログラムだけでなく、C APIを介して呼び出し可能ないくつかのデータ構造とモジュールを含むようにした。AEGeAn Toolkitは、GenomeToolsライブラリ(http://genometools.org)から利用可能な様々なパーサー、データ構造、グラフィック機能を活用している。

現在、主要な４つのツールがある。他のツールも開発中であり、もう少し安定したらリリースされる予定と書かれている。

ParsEvalは同一配列に対する異なるアノテーションセットを比較するためのプログラム。
CanonGFF3はGFF3データの前処理を行うプログラム。タンパク質をコードする遺伝子に関連する機能を検証する。
LocusPocusは1つ以上の遺伝子予測セットから区間遺伝子座 (iLoci) を計算するプログラム。ParsEvalの論文では、'区間遺伝子座'は、その領域内の他の遺伝子と重複するすべての遺伝子を含む最小のゲノム領域と定義されている。
GAEVALは転写産物アラインメントを用いて遺伝子モデルのカバレッジと完全性スコアを計算する。完全性スコアは0から1の間の値で、遺伝子モデルと関連する転写産物アラインメントの一致度を示し、0は転写産物サポートなし、1は完全な転写産物サポートに対応する。GAEVALプログラムは、長年PlantGDBで生産されてきた、より包括的な同名のPerlモジュールをベースにしているが、その開発はもはやサポートされていない。

Documentation

https://aegean.readthedocs.io/en/stable/

GFF3について

https://aegean.readthedocs.io/en/stable/gff3.html

インストール

Github

mamba create -n aegean
conda activate aegean
mamba install -c bioconda aegean -y

> parseval --help

ParsEval: comparative analysis of two alternative sources of annotation

Usage: parseval [options] reference.gff3 prediction.gff3

Basic options:

-d|--debug: Print debugging messages

-h|--help: Print help message and exit

-l|--delta: INT Extend gene loci by this many nucleotides;

default is 0

-V|--verbose: Print verbose warning messages

-v|--version: Print version number and exit

Output options:

-a|--datashare: STRING Location from which to copy shared data for

HTML output (if `make install' has not yet

been run)

-f|--outformat: STRING Indicate desired output format; possible

options: 'csv', 'text', or 'html'

(default='text'); in 'text' or 'csv' mode,

will create a single file; in 'html' mode,

will create a directory

-g|--nogff3: Do no print GFF3 output corresponding to each

comparison

-o|--outfile: FILENAME File/directory to which output will be

written; default is the terminal (STDOUT)

-p|--nopng: In HTML output mode, skip generation of PNG

graphics for each gene locus

-s|--summary: Only print summary statistics, do not print

individual comparisons

-w|--overwrite: Force overwrite of any existing output files

-x|--refrlabel: STRING Optional label for reference annotations

-y|--predlabel: STRING Optional label for prediction annotations

Filtering options:

-k|--makefilter Create a default configuration file for

filtering reported results and quit,

performing no comparisons

-r|--filterfile: STRING Use the indicated configuration file to

filter reported results;

-t|--maxtrans: INT Maximum transcripts allowed per locus; use 0

to disable limit; default is 32

> gaeval --help

gaeval: calculate coverage and intergrity scores for gene models based on transcript alignments

Usage: gaeval [options] alignments.gff3 genes.gff3 [moregenes.gff3 ...]

Basic options:

-h|--help print this help message and exit

-v|--version print version number and exit

Weights for calculating integrity score (must add up to 1.0):

-a|--alpha: DOUBLE introns confirmed, or % expected CDS length for

single-exon genes; default is 0.6

-b|--beta: DOUBLE exon coverage; default is 0.3

-g|--gamma: DOUBLE % expected 5' UTR length; default is 0.05

-e|--epsilon: DOUBLE % expected 3' UTR length; default is 0.05

Expected feature lengths for calculating integrity score:

-c|--exp-cds: INT expected CDS length (in bp); default is 400

-5|--exp-5putr: INT expected 5' UTR length; default is 200

-3|--exp-3putr: INT expected 3' UTR length; default is 100

> canon-gff3 --help

Usage: canon-gff3 [options] gff3file1 [gff3file2 ...]

Options:

-h|--help print this help message and exit

-i|--infer for transcript features lacking an explicitly

declared gene feature as a parent, create this

feature on-they-fly

-o|--outfile: STRING name of file to which GFF3 data will be

written; default is terminal (stdout)

-s|--source: STRING reset the source of each feature to the given

value

-v|--version print version number and exit

> locuspocus --help

LocusPocus: calculate locus coordinates for the given gene annotation

Usage: locuspocus [options] gff3file1 [gff3file2 gff3file3 ...]

Basic options:

-d|--debug print detailed debugging messages to terminal

(standard error)

-h|--help print this help message and exit

-v|--version print version number and exit

iLocus parsing:

-l|--delta: INT when parsing interval loci, use the following

delta to extend gene loci and include potential

regulatory regions; default is 500

-s|--skipends when enumerating interval loci, exclude

unannotated (and presumably incomplete) iLoci at

either end of the sequence

-e|--endsonly report only incomplete iLocus fragments at the

unannotated ends of sequences (complement of

--skipends)

-y|--skipiiloci do not report intergenic iLoci

Refinement options:

-r|--refine by default genes are grouped in the same iLocus

if they have any overlap; 'refine' mode allows

for a more nuanced handling of overlapping genes

-c|--cds use CDS rather than UTRs for determining gene

overlap; implies 'refine' mode

-m|--minoverlap: INT the minimum number of nucleotides two genes must

overlap to be grouped in the same iLocus; default

is 1

Output options:

-n|--namefmt: STR provide a printf-style format string to override

the default ID format for newly created loci;

default is 'locus%lu' (locus1, locus2, etc) for

loci and 'iLocus%lu' (iLocus1, iLocus2, etc) for

interval loci; note the format string should

include a single %lu specifier to be filled in

with a long unsigned integer value

-i|--ilens: FILE create a file with the lengths of each intergenic

iLocus

-g|--genemap: FILE print a mapping from each gene annotation to its

corresponding locus to the given file

-o|--outfile: FILE name of file to which results will be written;

default is terminal (standard output)

-T|--retainids retain original feature IDs from input files;

conflicts will arise if input contains duplicated

ID values

-t|--transmap: FILE print a mapping from each transcript annotation

to its corresponding locus to the given file

-V|--verbose include all locus subfeatures (genes, RNAs, etc)

in the GFF3 output; default includes only locus

features

Input options:

-f|--filter: TYPE comma-separated list of feature types to use in

constructing loci/iLoci; default is 'gene'

-p|--parent: CT:PT if a feature of type $CT exists without a parent,

create a parent for this feature with type $PT;

for example, mRNA:gene will create a gene feature

as a parent for any top-level mRNA feature;

this option can be specified multiple times

-u|--pseudo correct erroneously labeled pseudogenes

> gaeval --help

gaeval: calculate coverage and intergrity scores for gene models based on transcript alignments

Usage: gaeval [options] alignments.gff3 genes.gff3 [moregenes.gff3 ...]

Basic options:

-h|--help print this help message and exit

-v|--version print version number and exit

Weights for calculating integrity score (must add up to 1.0):

-a|--alpha: DOUBLE introns confirmed, or % expected CDS length for

single-exon genes; default is 0.6

-b|--beta: DOUBLE exon coverage; default is 0.3

-g|--gamma: DOUBLE % expected 5' UTR length; default is 0.05

-e|--epsilon: DOUBLE % expected 3' UTR length; default is 0.05

Expected feature lengths for calculating integrity score:

-c|--exp-cds: INT expected CDS length (in bp); default is 400

-5|--exp-5putr: INT expected 5' UTR length; default is 200

-3|--exp-3putr: INT expected 3' UTR length; default is 100

実行方法

ParsEval；同じ配列に対する2組の遺伝子アノテーションを比較する。

使用例が２つ提案されている。

１、新しくアセンブルされたゲノムのアノテーションに最適なパラメータ設定が明確でないため、探索的にいくつかの異なるパラメータ設定を試す。それから、ParsEvalを使って異なるアノテーション間の類似点と相違点を識別する。
２、ゲノムワイド解析を行っている。ゲノム配列を決定したコンソーシアムから利用可能な遺伝子アノテーションとNCBIの別のアノテーションが利用可能である。ParsEvalを使ってこの2つのアノテーションを比較し、類似点と相違点を素早く特定する。

ランするには参照とするGFF3ファイルと比較したいGFF3ファイルを指定する。

parseval ref.gff3 new_annotation.gff3

類似度統計は2段階の粒度で報告される。１つ目に、個々の遺伝子座のレポートでは、その遺伝子座のアノテーションの類似度が示される（HTML出力モードを使用している場合）。２つ目に、データ全体にわたって集約された単一のサマリー類似度統計が示される。

CanonGFF3；CanonGFF3（canonical GFF3）はGFF3ファイルをクリーンアップし、タンパク質コード遺伝子に直接関係しないフィーチャーをすべて削除し、インロンやUTRのような明示的に宣言されていないフィーチャーを推論する。

入力は、1つ以上のGFF3フォーマットのファイル。

canon-gff3 input.gff3 -o outprefix

一般的な慣習として、エクソンとCDSフィーチャーを使って構造を記述し、イントロン、UTR、開始/停止フィーチャーは明示的に提供されないないことがある。これらは他のフィーチャーから推測できる。
CanonGFF3の出力は、提供された入力ファイルからタンパク質をコードする遺伝子を含むGFF3ファイル。ほとんどの場合、出力は入力よりも冗長で、入力で明示的に提供された特徴から推測された特徴を含む。

LocusPocus；与えられた遺伝子アノテーションから区間遺伝子座 (iLoci) を計算する。iLociは1つ以上の遺伝子をコードするゲノム上の独立した領域、および遺伝子を含まない遺伝子間領域を表す（*1）。

入力は、1つ以上のGFF3フォーマットのファイル。

locuspocus input.gff3 -o outprefix

出力
LocusPocus は与えられた遺伝子特徴から iLoci の位置を計算し、GFF3 フォーマットで出力する。デフォルトでは、遺伝子座の遺伝子数と転写産物数を示す属性とともに、iLocus の特徴のみが報告される。verboseオプションをつけると、遺伝子特徴（とそのサブ特徴）も報告される。

出力例

GAEVAL；転写産物アラインメントから遺伝子モデルのカバレッジと完全性スコアを計算する。完全性スコアは0から1の間の値で、遺伝子モデルと関連する転写産物アラインメントの一致度を示し、0は転写産物サポートなし、1は完全な転写産物サポートに対応する。

GAEVALの入力は2つのGFF3ファイルで、1つは遺伝子予測／注釈を含み、もう1つは転写産物のアラインメントを含む。GFF3仕様ではアライメントフィーチャーに対していくつかの類似したエンコーディング規約を明示的にサポートしているが、GAEVALでは1つのみサポートしている。

gaeval alignments.gff3 genes.gff3

引用

Daniel S. Standage (2010-2015). AEGeAn: an integrated toolkit for analysis and evaluation of annotated genomes, http://standage.github.io/AEGeAn.

iLoci: robust evaluation of genome content and organization for provisional and mature genome assemblies
Daniel S Standage, Tim Lai, and Volker P Brendelcorresponding author

NAR Genom Bioinform. Published online 2022 Feb 22

ParsEval: parallel comparison and analysis of gene structure annotations
Daniel S Standage & Volker P Brendel
BMC Bioinformatics volume 13, Article number: 187 (2012)

*1 インターバル遺伝子座（iLocus）

"iLociは1つ以上の遺伝子をコードするゲノム上の独立した領域、および遺伝子を含まない遺伝子間領域を表す。

iLociは、ほとんどのゲノムプロジェクトが予備的なドラフト段階を越えて進展することはないという新しい現実を受け入れ、急速に変化するゲノムアノテーションデータを扱うための堅牢な座標系を提供する。"