2021-07-01

European Nucleotide Archiveへのゲノムアノテーションサブミットを容易にするコンバーター EMBLmyGFF3

　過去20年にわたり、多くのシーケンスアノテーションツールが開発され、生命のツリーのすべてのkingdomの幅広い生物の比較的正確なアノテーションの作成を容易にしている。ゲノム内で注釈が付けられた機能を記述するために、Generic Feature Format（GFF）が開発された。当初公開されたサンガー仕様を使用した制限に直面して、GFFはさまざまな研究所のさまざまなニーズに応じてさまざまな形に進化してきた。 2013年、Sequence Ontology Project（http://www.sequenceontology.org; [ref.1]）は、「以前のフォーマットとの後方互換性を保ちつつ、GFFの最も一般的な拡張機能に対応する」GFF3フォーマットを提案した。それ以来、GFF3フォーマットは、アノテーションのデファクト・リファレンス・フォーマットとなっている。GFF3フォーマットは、仕様が明確であるにもかかわらず、さまざまな情報を保持できる柔軟性を備えている。この柔軟性により、GFF3フォーマットは幅広いアノテーションツール（MAKER [ref.2]、Augustus [ref.3]、Prokka [ref.4]、Eugene [ref.5]など）で使用されており、ほとんどのゲノムブラウザ（ARTEMIS [ref.6]、Webapollo [ref.7]、IGV [ref.8]など）で使用されている。GFF3フォーマットの柔軟性は普及を促進したが、GFF3フォーマットを使用する様々なツールとの相互運用性の問題が繰り返し発生している。実際、GFF3フォーマットを作成するツールの数だけ、GFF3フォーマットの種類が存在する。GFF3ファイルの当然の結果の1つは、INSDCデータベースの1つに登録できるフォーマットに変換されることである。2016年以降、NCBIはGFF3またはGTFをGenBankに提出するためのプロセスのベータ版をリリースした[ref.9]。それらには、GFF3ファイルにどのような情報が期待されているか、また、GenBankに提出するために整形されたGFF3を.sqnファイルに変換するtable2asn_GFFツールに受け入れられるためにどのように整形すればよいかが記載されている。要件を満たすためにGFF3ファイルを修正するのは簡単なことではなく、自動化するにはプログラミングのスキルが必要になることもある。この作業を容易にするために、Genome Annotation Generator (GAG)という使いやすいバイオインフォマティクスツールが開発された[ref.10]。GAGは、投稿に適したNCBI アノテーションを.tbl形式で作成するための、わかりやすく一貫性のあるツールを提供している。この.tbl形式は、NCBIが提供するtbl2asnツールで、他の2つのファイル（.sbtと.fsa）とともに、GenBankに投稿するための.sqnファイルを作成するために必要な表形式である。

　NCBIではGenBankフラットファイルではなく、.sqnという中間ファイルでの提出を受け付けているが、EBIではEMBLフラットファイルでの提出を受け付けている。ここで問題となるのは、GFF3ファイルからEMBLフラットファイルを生成することである。このステップを実行するために、Artemis [ref.6]、seqret from EMBOSS [ref.11]、GFF3toEMBL [ref.12]などのツールが開発されているが、限界がある。膨大な数のアノテーションツールが存在する中で、GFF3toEMBL[ref.12]は、原核生物のアノテーションツールProkkaが作成したGFF3のみを扱う。そのため、他のツールで作成されたアノテーションについては、他のソリューションを利用しなければならない。Artemisは、グラフィカル・ユーザー・インターフェースを持っており、プロセスの自動化はできない。Seqretは一度に1つのレコードしか扱えないように設計されているので、ゲノムワイドなアノテーションに使用するのは簡単ではない。主なボトルネックは、どちらのツールもINSDCが期待する語彙（3列目と9列目）を含む適切にフォーマットされたGFF3を必要とするが、アノテーションツールは必ずしもこの語彙を使用しないことである。EMBLフォーマットはINSDCの定義に従い、52種類のフィーチャータイプを受け入れる。GFF3ではSequence Ontologyの語彙やアクセッション番号（GFF3の3列目）の使用が義務付けられているが、Sequence Ontologyのバージョン2.5.3では2278の語彙が使われている。さらに、EMBL形式では98種類の修飾子が使用できるが、GFF3の9列目にある対応する属性タグタイプは無制限である。そのため、多くの場合、ユーザーはGFF3を期待される語彙に適合させるために前処理を行う必要がある。

　GFF3ファイルに含まれる情報や使用される語彙は、使用するアノテーションツールによって大きく異なる。さらに、GFF3フォーマットとEMBLフォーマットで使用されている語彙も多くの点で異なっている。このような違いがあるため、GFF3アノテーションファイルの前処理を必要としない、普遍的なGFF3-EMBL変換ツールを作ることは難しい。GFF3ファイルの第3列に記述されているフィーチャー・タイプ、第9列に記述されている異なる属性タグと、対応するEMBLフィーチャーおよび修飾子との間で正しいマッピングを行うことが課題となる。

　著者らは、European Nucleotide Archiveとの共同研究により、これらの問題を解決するためのツールEMBLmyGFF3を開発した。知る限り、このツールはどんな種類のGFF3ファイルでも前処理なしで扱うことができる唯一のツールである。実際、GFF3とEMBLフォーマットの間で語彙のマッピングを可能にするjson マッピングファイルにオリジナリティがある。

MBLmyGFF3は、European Nucleotide Archiveへのゲノムアノテーション提出に対応した、GFF3形式からEMBL形式への強力なユニバーサル変換ツールである。このツールは、GFF3形式とEMBL形式の間で対応する語彙のマッピングを可能にするために、ユーザーが簡単に調整できるjsonパラメータファイルを使用している。一般的に使用されている4つのアノテーションツール：Maker、Prokka、Augustus、Eugene、からGFF3アノテーションファイルを変換する。

インストール

以前はpython2のコードだったが、現在はpython3.6以上となっている。python3.8環境でテストした。

依存

Python >=3.6, biopython >=1.67,<=1.77, numpy<1.16.5 and the bcbio-gff >=0.6.4 python packages.

Github

#Installation with git:
mamba create -n EMBLmyGFF3 -y python=3.8
conda activate EMBLmyGFF3
git clone https://github.com/NBISweden/EMBLmyGFF3.git
cd EMBLmyGFF3/
python setup.py install

#conda (link)
mamba install -c bioconda emblmygff3 

#pip
pip install git+https://github.com/NBISweden/EMBLmyGFF3.git

> EMBLmyGFF3

$ EMBLmyGFF3 -h

usage: EMBLmyGFF3 [-h] [-a] [-c CREATED]

[-d {CON,PAT,EST,GSS,HTC,HTG,MGA,WGS,TSA,STS,STD}]

[-g ORGANELLE] [-i LOCUS_TAG] [-k KEYWORD [KEYWORD ...]]

[-l CLASSIFICATION]

[-m {genomic DNA,genomic RNA,mRNA,tRNA,rRNA,other RNA,other DNA,transcribed RNA,viral cRNA,unassigned DNA,unassigned RNA}]

[-o OUTPUT] [-p PROJECT_ID] [-q]

[-r {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}]

[-s SPECIES] [-t {linear,circular}] [-v]

[-x {PHG,ENV,FUN,HUM,INV,MAM,VRT,MUS,PLN,PRO,ROD,SYN,TGN,UNC,VRL}]

[-z] [--ah {One of the parameters above}] [--de DE]

[--ra RA [RA ...]] [--rc RC] [--rg RG] [--rl RL] [--rt RT]

[--rx RX] [--email EMAIL] [--expose_translations]

[--force_unknown_features] [--force_uncomplete_features]

[--interleave_genes] [--keep_duplicates]

[--locus_numbering_start LOCUS_NUMBERING_START]

[--no_progress] [--no_wrap_qualifier] [--shame]

[--translate]

[--use_attribute_value_as_locus_tag USE_ATTRIBUTE_VALUE_AS_LOCUS_TAG]

[--uncompressed_log] [--version VERSION] [--strain STRAIN]

[--environmental_sample]

[--isolation_source ISOLATION_SOURCE] [--isolate ISOLATE]

gff_file fasta

positional arguments:

gff_file Input gff-file.

fasta Input fasta sequence.

optional arguments:

-h, --help show this help message and exit

-a, --accession Bolean. Accession number(s) for the entry. Default

value: XXX. The proper value is automatically filled

up by ENA during the submission by a unique accession

number they will assign. The accession number is used

to set up the AC line and the first token of the ID

line as well. Please visit [this

page](https://www.ebi.ac.uk/ena/submit/accession-

number-formats) and [this

one](https://www.ebi.ac.uk/ena/submit/sequence-

submission) to learn more about it. Activating the

option will set the Accession number with the fasta

sequence identifier.

-c CREATED, --created CREATED

Creation time of the original entry. The default value

is the date of the day.

-d {CON,PAT,EST,GSS,HTC,HTG,MGA,WGS,TSA,STS,STD}, --data_class {CON,PAT,EST,GSS,HTC,HTG,MGA,WGS,TSA,STS,STD}

Data class of the sample. Default value 'XXX'. This

option is used to set up the 5th token of the ID line.

-g ORGANELLE, --organelle ORGANELLE

Sample organelle. No default value.

-i LOCUS_TAG, --locus_tag LOCUS_TAG

Locus tag prefix used to set up the prefix of the

locus_tag qualifier. The locus tag has to be

registered at ENA prior any submission. More

information

[here](https://www.ebi.ac.uk/ena/submit/locus-tags).

-k KEYWORD [KEYWORD ...], --keyword KEYWORD [KEYWORD ...]

Keywords for the entry. No default value.

-l CLASSIFICATION, --classification CLASSIFICATION

Organism classification e.g 'Eukaryota; Opisthokonta;

Metazoa'. The default value is the classification

found in the NCBI taxonomy DB from the species/taxid

given as --species parameter. If none is found, 'Life'

will be the default value.

-m {genomic DNA,genomic RNA,mRNA,tRNA,rRNA,other RNA,other DNA,transcribed RNA,viral cRNA,unassigned DNA,unassigned RNA}, --molecule_type {genomic DNA,genomic RNA,mRNA,tRNA,rRNA,other RNA,other DNA,transcribed RNA,viral cRNA,unassigned DNA,unassigned RNA}

Molecule type of the sample. No default value.

-o OUTPUT, --output OUTPUT

Output filename.

-p PROJECT_ID, --project_id PROJECT_ID

Project ID. Default is 'XXX' (This is used to set up

the PR line).

-q, --quiet Decrease verbosity.

-r {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}, --transl_table {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}

Translation table. No default. (This is used to set up

the translation table qualifier transl_table of the

CDS features.) Please visit [NCBI genetic code](https:

//www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)

for more information.

-s SPECIES, --species SPECIES

Sample species, formatted as 'Genus species' or taxid.

No default. (This is used to set up the OS line.)

-t {linear,circular}, --topology {linear,circular}

Sequence topology. No default. (This is used to set up

the Topology that is the 3rd token of the ID line.)

-v, --verbose Increase verbosity.

-x {PHG,ENV,FUN,HUM,INV,MAM,VRT,MUS,PLN,PRO,ROD,SYN,TGN,UNC,VRL}, --taxonomy {PHG,ENV,FUN,HUM,INV,MAM,VRT,MUS,PLN,PRO,ROD,SYN,TGN,UNC,VRL}

Source taxonomy. Default value 'XXX'. This option is

used to set the taxonomic division within ID line (6th

token).

-z, --gzip Gzip output file.

--ah {One of the parameters above}, --advanced_help {One of the parameters above}

Display advanced information of the parameter

specified or of all parameters if none specified.

--de DE Description. Default value 'XXX'.

--ra RA [RA ...], --author RA [RA ...]

Author for the reference. No default value.

--rc RC Reference Comment. No default value.

--rg RG Reference Group, the working groups/consortia that

produced the record. Default value 'XXX'.

--rl RL Reference publishing location. No default value.

--rt RT Reference Title. No default value.

--rx RX Reference cross-reference. No default value

--email EMAIL Email used to fetch information from NCBI taxonomy

database. Default value 'EMBLmyGFF3@tool.org'.

--expose_translations

Copy feature and attribute mapping files to the

working directory. They will be used as mapping files

instead of the default internal JSON files. You may

modify them as it suits you.

--force_unknown_features

Force to keep feature types not accepted by EMBL. /!\

Option not suitable for submission purpose.

--force_uncomplete_features

Force to keep features whithout all the mandatory

qualifiers. /!\ Option not suitable for submission

purpose.

--interleave_genes Print gene features with interleaved mRNA and CDS

features.

--keep_duplicates Do not remove duplicate features during the process.

/!\ Option not suitable for submission purpose.

--locus_numbering_start LOCUS_NUMBERING_START

Start locus numbering with the provided value.

--no_progress Hide conversion progress counter.

--no_wrap_qualifier By default there is a line wrapping at 80 characters.

The cut is at the world level. Activating this option

will avoid the line-wrapping for the qualifiers.

--shame Suppress the shameless plug.

--translate Include translation in CDS features.

--use_attribute_value_as_locus_tag USE_ATTRIBUTE_VALUE_AS_LOCUS_TAG

Use the value of the defined attribute as locus_tag.

--uncompressed_log Some logs can be compressed for better lisibility,

they won't.

--version VERSION Sequence version number. The default value is 1.

--strain STRAIN Strain from which sequence was obtained. May be needed

when organism belongs to Bacteria.

--environmental_sample

Bolean. Identifies sequences derived by direct

molecular isolation from a bulk environmental DNA

sample with no reliable identification of the source

organism. May be needed when organism belongs to

Bacteria.

--isolation_source ISOLATION_SOURCE

Describes the physical, environmental and/or local

geographical source of the biological sample from

which the sequence was derived. Mandatory when

environmental_sample option used.

--isolate ISOLATE Individual isolate from which the sequence was

obtained. May be needed when organism belongs to

Bacteria.

> EMBLmyGFF3 -h

$ EMBLmyGFF3 -h

usage: EMBLmyGFF3 [-h] [-a] [-c CREATED]

[-d {CON,PAT,EST,GSS,HTC,HTG,MGA,WGS,TSA,STS,STD}]

[-g ORGANELLE] [-i LOCUS_TAG] [-k KEYWORD [KEYWORD ...]]

[-l CLASSIFICATION]

[-m {genomic DNA,genomic RNA,mRNA,tRNA,rRNA,other RNA,other DNA,transcribed RNA,viral cRNA,unassigned DNA,unassigned RNA}]

[-o OUTPUT] [-p PROJECT_ID] [-q]

[-r {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}]

[-s SPECIES] [-t {linear,circular}] [-v]

[-x {PHG,ENV,FUN,HUM,INV,MAM,VRT,MUS,PLN,PRO,ROD,SYN,TGN,UNC,VRL}]

[-z] [--ah {One of the parameters above}] [--de DE]

[--ra RA [RA ...]] [--rc RC] [--rg RG] [--rl RL] [--rt RT]

[--rx RX] [--email EMAIL] [--expose_translations]

[--force_unknown_features] [--force_uncomplete_features]

[--interleave_genes] [--keep_duplicates]

[--locus_numbering_start LOCUS_NUMBERING_START]

[--no_progress] [--no_wrap_qualifier] [--shame]

[--translate]

[--use_attribute_value_as_locus_tag USE_ATTRIBUTE_VALUE_AS_LOCUS_TAG]

[--uncompressed_log] [--version VERSION] [--strain STRAIN]

[--environmental_sample]

[--isolation_source ISOLATION_SOURCE] [--isolate ISOLATE]

gff_file fasta

positional arguments:

gff_file Input gff-file.

fasta Input fasta sequence.

optional arguments:

-h, --help show this help message and exit

-a, --accession Bolean. Accession number(s) for the entry. Default

value: XXX. The proper value is automatically filled

up by ENA during the submission by a unique accession

number they will assign. The accession number is used

to set up the AC line and the first token of the ID

line as well. Please visit [this

page](https://www.ebi.ac.uk/ena/submit/accession-

number-formats) and [this

one](https://www.ebi.ac.uk/ena/submit/sequence-

submission) to learn more about it. Activating the

option will set the Accession number with the fasta

sequence identifier.

-c CREATED, --created CREATED

Creation time of the original entry. The default value

is the date of the day.

-d {CON,PAT,EST,GSS,HTC,HTG,MGA,WGS,TSA,STS,STD}, --data_class {CON,PAT,EST,GSS,HTC,HTG,MGA,WGS,TSA,STS,STD}

Data class of the sample. Default value 'XXX'. This

option is used to set up the 5th token of the ID line.

-g ORGANELLE, --organelle ORGANELLE

Sample organelle. No default value.

-i LOCUS_TAG, --locus_tag LOCUS_TAG

Locus tag prefix used to set up the prefix of the

locus_tag qualifier. The locus tag has to be

registered at ENA prior any submission. More

information

[here](https://www.ebi.ac.uk/ena/submit/locus-tags).

-k KEYWORD [KEYWORD ...], --keyword KEYWORD [KEYWORD ...]

Keywords for the entry. No default value.

-l CLASSIFICATION, --classification CLASSIFICATION

Organism classification e.g 'Eukaryota; Opisthokonta;

Metazoa'. The default value is the classification

found in the NCBI taxonomy DB from the species/taxid

given as --species parameter. If none is found, 'Life'

will be the default value.

Molecule type of the sample. No default value.

-o OUTPUT, --output OUTPUT

Output filename.

-p PROJECT_ID, --project_id PROJECT_ID

Project ID. Default is 'XXX' (This is used to set up

the PR line).

-q, --quiet Decrease verbosity.

-r {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}, --transl_table {1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25}

Translation table. No default. (This is used to set up

the translation table qualifier transl_table of the

CDS features.) Please visit [NCBI genetic code](https:

//www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi)

for more information.

-s SPECIES, --species SPECIES

Sample species, formatted as 'Genus species' or taxid.

No default. (This is used to set up the OS line.)

-t {linear,circular}, --topology {linear,circular}

Sequence topology. No default. (This is used to set up

the Topology that is the 3rd token of the ID line.)

-v, --verbose Increase verbosity.

-x {PHG,ENV,FUN,HUM,INV,MAM,VRT,MUS,PLN,PRO,ROD,SYN,TGN,UNC,VRL}, --taxonomy {PHG,ENV,FUN,HUM,INV,MAM,VRT,MUS,PLN,PRO,ROD,SYN,TGN,UNC,VRL}

Source taxonomy. Default value 'XXX'. This option is

used to set the taxonomic division within ID line (6th

token).

-z, --gzip Gzip output file.

--ah {One of the parameters above}, --advanced_help {One of the parameters above}

Display advanced information of the parameter

specified or of all parameters if none specified.

--de DE Description. Default value 'XXX'.

--ra RA [RA ...], --author RA [RA ...]

Author for the reference. No default value.

--rc RC Reference Comment. No default value.

--rg RG Reference Group, the working groups/consortia that

produced the record. Default value 'XXX'.

--rl RL Reference publishing location. No default value.

--rt RT Reference Title. No default value.

--rx RX Reference cross-reference. No default value

--email EMAIL Email used to fetch information from NCBI taxonomy

database. Default value 'EMBLmyGFF3@tool.org'.

--expose_translations

Copy feature and attribute mapping files to the

working directory. They will be used as mapping files

instead of the default internal JSON files. You may

modify them as it suits you.

--force_unknown_features

Force to keep feature types not accepted by EMBL. /!\

Option not suitable for submission purpose.

--force_uncomplete_features

Force to keep features whithout all the mandatory

qualifiers. /!\ Option not suitable for submission

purpose.

--interleave_genes Print gene features with interleaved mRNA and CDS

features.

--keep_duplicates Do not remove duplicate features during the process.

/!\ Option not suitable for submission purpose.

--locus_numbering_start LOCUS_NUMBERING_START

Start locus numbering with the provided value.

--no_progress Hide conversion progress counter.

--no_wrap_qualifier By default there is a line wrapping at 80 characters.

The cut is at the world level. Activating this option

will avoid the line-wrapping for the qualifiers.

--shame Suppress the shameless plug.

--translate Include translation in CDS features.

--use_attribute_value_as_locus_tag USE_ATTRIBUTE_VALUE_AS_LOCUS_TAG

Use the value of the defined attribute as locus_tag.

--uncompressed_log Some logs can be compressed for better lisibility,

they won't.

--version VERSION Sequence version number. The default value is 1.

--strain STRAIN Strain from which sequence was obtained. May be needed

when organism belongs to Bacteria.

--environmental_sample

Bolean. Identifies sequences derived by direct

molecular isolation from a bulk environmental DNA

sample with no reliable identification of the source

organism. May be needed when organism belongs to

Bacteria.

--isolation_source ISOLATION_SOURCE

Describes the physical, environmental and/or local

geographical source of the biological sample from

which the sequence was derived. Mandatory when

environmental_sample option used.

--isolate ISOLATE Individual isolate from which the sequence was

obtained. May be needed when organism belongs to

Bacteria.

実行方法

GFF3とfastaファイルを指定する。

EMBLmyGFF3 maker.gff3 maker.fa > output

オプションで必要な情報を提供しない場合、対話形式でコマンドは進む。

種名（属名＋種小名）

f:id:kazumaxneo:20210701123410p:plain

locus_tag

molecule_type

f:id:kazumaxneo:20210701153232p:plain

project ID

linear or circular

f:id:kazumaxneo:20210701132532p:plain

transl_table

f:id:kazumaxneo:20210701153150p:plain

仕様が決まったら、変換されて出力される。

引用

EMBLmyGFF3: a converter facilitating genome annotation submission to European Nucleotide Archive

Martin Norling, Niclas Jareborg & Jacques Dainat
BMC Res Notes. 2018; 11: 584

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

European Nucleotide Archiveへのゲノムアノテーションサブミットを容易にするコンバーター EMBLmyGFF3