GFF3のツールキット GFF3toolkit - macでインフォマティクス

i5k Workspace @ NAL (HP) でサポートされているGFF3toolkit（https://github.com/NAL-i5K/GFF3toolkit）は、節足動物ゲノムプロジェクトとその研究コミュニティからのGFF3形式の遺伝子アノテーションを処理するためのツールスイートを提供する。遺伝子アノテーションのGFF3フォーマットを改善するために、品質管理とマージ手順がGFF3toolkitとともに提案されている。特に、このツールキットは、GFF3ファイルのソート、GFF3形式エラーの検出、2つのGFF3ファイルのマージ、GFF3ファイルからの生物学的シーケンスの生成を行う機能を提供する。

インストール

ubuntu18.04のpython3.7環境でテストした（docker使用、ホストOS macos10.14）。

依存

Python 3.x
Perl

Github

pip install gff3tool

#latest version
pip install git+https://github.com/NAL-i5K/GFF3toolkit.git

> gff3_QC -h

$ gff3_QC -h

usage: gff3_QC [-h] [-g GFF] [-f FASTA] [-noncg] [-i] [-n ALLOWED_NUM_OF_N]

[-t [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]]]

[-o OUTPUT] [-s STATISTIC] [-v]

Testing environment:

1. Python 2.7

Inputs:

1. GFF3: Specify the file name with the -g or --gff argument; Please note that this program requires gene/pseudogene and mRNA/pseudogenic_transcript to have an ID attribute in column 9.

2. fasta file: Specify the file name with the -f or --fasta argument

Outputs:

1. Error report for the input GFF3 file

* Line_num: Line numbers of the found problematic models in the input GFF3 file.

* Error_code: Error codes for the found problematic models. Please refer to lib/ERROR/ERROR.py to see the full list of Error_code and the corresponding Error_tag.

* Error_tag: Detail of the found errors for the problematic models. Please refer to lib/ERROR/ERROR.py to see the full list of Error_code and the corresponding Error_tag.

Quick start:

gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o test

gff3_QC --gff example_file/example.gff3 --fasta example_file/reference.fa --output test

optional arguments:

-h, --help show this help message and exit

-g GFF, --gff GFF Genome annotation file, gff3 format

-f FASTA, --fasta FASTA

Genome sequences, fasta format

-noncg, --noncanonical_gene

gff3 file is not formatted in the canonical gene model

format.

-i, --initial_phase Check whether initial CDS phase is 0 (default: no

check)

-n ALLOWED_NUM_OF_N, --allowed_num_of_n ALLOWED_NUM_OF_N

Max number of Ns allowed in a feature, anything more

will be reported as an error (default: 0)

-t [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]], --check_n_feature_types [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]]

Count the number of Ns in each feature with the type

specified, multiple types may be specified, ex: -t CDS

exon (default: "CDS")

-o OUTPUT, --output OUTPUT

output file name (default: report.txt)

-s STATISTIC, --statistic STATISTIC

statistic file name (default: statistic.txt)

-v, --version show program's version number and exit

> gff3_fix -h

$ gff3_fix -h

usage: gff3_fix [-h] [-qc_r QC_REPORT] [-g GFF] [-og OUTPUT_GFF] [-v]

Testing environment:

1. Python 3.*

Input:

1. Error report: Error report from gff3_QC.py. Specify the file name with the -qc_r or --qc_report argument;

2. GFF3: Specify the file name with the -g or --gff argument;

Output:

1. Corrected GFF3

Quick start:

gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3

optional arguments:

-h, --help show this help message and exit

-qc_r QC_REPORT, --qc_report QC_REPORT

Error report from gff3_QC.py

-g GFF, --gff GFF Genome annotation file, gff3 format

-og OUTPUT_GFF, --output_gff OUTPUT_GFF

output gff3 file name

-v, --version show program's version number and exit

> gff3_sort

$ gff3_sort

usage: gff3_sort [-h] [-g GFF_FILE] [-og OUTPUT_GFF] [-t SORT_TEMPLATE] [-i]

[-v] [-r]

Sort a GFF3 file according to the order of Scaffold (seqID), coordinates on a Scaffold, and feature relationship based on sequence ontology.

Inputs:

1. GFF3 file: Specify the file name with the -g argument

Outputs:

1. Sorted GFF3 file: Specify the file name with the -og argument

Examples:

1. Specify the input, output file names and options using short arguments:

gff3_sort -g example_file/example.gff3 -og example_file/example_sorted.gff

2. Specify the input, output file names and options using long arguments:

gff3_sort --gff_file example_file/example.gff3 --output_gff example_file/example_sorted.gff

optional arguments:

-h, --help show this help message and exit

-g GFF_FILE, --gff_file GFF_FILE

GFF3 file that you would like to sort.

-og OUTPUT_GFF, --output_gff OUTPUT_GFF

Sorted GFF3 file

-t SORT_TEMPLATE, --sort_template SORT_TEMPLATE

A file that indicates the sorting order of features

within a gene model

-i, --isoform_sort Sort multi-isoform gene models by feature type

(default: False)

-v, --version show program's version number and exit

-r, --reference Sort scaffold (seqID) by order of appearance in gff3

file (default is by number)

> gff3_to_fasta -h

$ gff3_to_fasta -h

usage: gff3_to_fasta [-h] [-g GFF] [-f FASTA] [-embf] [-st SEQUENCE_TYPE]

[-u [USER_DEFINED [USER_DEFINED ...]]] [-d DEFLINE]

[-o OUTPUT_PREFIX] [-noQC] [-v]

Extract sequences from specific regions of genome based on gff file.

Testing enviroment:

1. Python 2.7

Required inputs:

1. GFF3: specify the file name with the -g argument

2. Fasta file: specify the file name with the -f argument

3. Output prefix: specify with the -o argument

Outputs:

1. Fasta formatted sequence file based on the gff3 file.

Example command:

gff3_to_fasta -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences

optional arguments:

-h, --help show this help message and exit

-g GFF, --gff GFF Genome annotation file in GFF3 format

-f FASTA, --fasta FASTA

Genome sequences in FASTA format

-embf, --embedded_fasta

Specify this option if you want to extract sequence from embedded fasta.

-st SEQUENCE_TYPE, --sequence_type SEQUENCE_TYPE

Type of sequences you would like to extract:

"all" - FASTA files for all types of sequences listed below, except user_defined;

"gene" - gene sequence for each record;

"exon" - exon sequence for each record;

"pre_trans" - genomic region of a transcript model (premature transcript);

"trans" - spliced transcripts (only exons included);

"cds" - coding sequences;

"pep" - peptide sequences;

"user_defined" - specify parent and child features via the -u argument.

-u [USER_DEFINED [USER_DEFINED ...]], --user_defined [USER_DEFINED [USER_DEFINED ...]]

Specify parent and child features for fasta extraction, format: [parent feature type] [child feature type] (ex: -u mRNA CDS). Required if -st user_defined is given.

-d DEFLINE, --defline DEFLINE

Defline format in the output FASTA file:

"simple" - only ID would be shown in the defline;

"complete" - complete information of the feature would be shown in the defline.

-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX

Prefix of output file name

-noQC, --quality_control

Specify this option if you do not want to execute quality control for gff file. (default: QC is executed)

-v, --version show program's version number and exit

> gff3_merge -h

$ gff3_merge -h

usage: gff3_merge [-h] [-g1 GFF_FILE1] [-g2 GFF_FILE2] [-f FASTA]

[-u1 USER_DEFINED_FILE1] [-u2 USER_DEFINED_FILE2]

[-og OUTPUT_GFF] [-r REPORT_FILE] [-a] [-noAuto] [-v]

Merge two gff files of the same genome into one.

Testing enviroment:

1. Python 3.*

Inputs:

1. GFF3 file 1: Gff with annotations modified relative to the original gff (e.g. output from the Apollo program), specify the file name with the -g1 argument

2. GFF3 file 2: Original/Reference gff, specify the file name with the -g2 argument

3. FASTA: Genomic sequences in the FASTA format with the -f argument

Outputs:

1. Merged GFF3: Models from GFF3 file 1 replace Models from GFF3 file 2 based on their replace tag. Specify the output file name with the -og argument

2. Log report for the integration: specify the file name with the -r argument

Examples:

1. Specify the input, output file names and options using short arguments:

gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt

2. Specify the input, output file names and options using long arguments:

gff3_merge --gff_file1 example_file/new_models.gff3 --gff_file2 example_file/reference.gff3 --fasta example_file/reference.fa --output_gff merged.gff --report_file merged_report.txt

optional arguments:

-h, --help show this help message and exit

-g1 GFF_FILE1, --gff_file1 GFF_FILE1

Updated GFF3 file, such as Apollo gff

-g2 GFF_FILE2, --gff_file2 GFF_FILE2

Reference GFF3 file, such as Maker gff or OGS gff

-f FASTA, --fasta FASTA

Genomic sequences in the fasta format

-u1 USER_DEFINED_FILE1, --user_defined_file1 USER_DEFINED_FILE1

File for specifing parent and child features for fasta

extraction from updated GFF3 file.

-u2 USER_DEFINED_FILE2, --user_defined_file2 USER_DEFINED_FILE2

File for specifing parent and child features for fasta

extraction from reference GFF3 file.

-og OUTPUT_GFF, --output_gff OUTPUT_GFF

The merged GFF3 file

-r REPORT_FILE, --report_file REPORT_FILE

Log file for the integration

-a, --all auto-assignment replace tags for all transcript

features. (default: Only automatically assign replace

tags for the transcript without replace tags)

-noAuto, --auto_assignment

Turn off the auto-assignment of replace tags, if you

already have replace tags in your updated gff

(default: Automatically assign replace tags and then

merge the gff files)

-v, --version show program's version number and exit

実行方法

ここではテストファイルを使う。

git clone https://github.com/NAL-i5K/GFF3toolkit.git
cd GFF3toolkit/

gff3_sort - scaffolds順、座標、およびアノテーションのparent-child の関係に従ってGFF3ファイルを並べ替える

gff3_sort -g example_file/example.gff3 -og example-sorted.gff3

-og Sorted GFF3 file

gff3_to_fasta - ゲノムの特定の領域から配列（スプライスされた転写産物、cds、ペプチド）を抽出

gff3_to_fasta -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences

-st Type of sequences you would like to extract:

"all" - FASTA files for all types of sequences listed below, except user_defined;
"gene" - gene sequence for each record;
"exon" - exon sequence for each record;
"pre_trans" - genomic region of a transcript model (premature transcript);
"trans" - spliced transcripts (only exons included);
"cds" - coding sequences;
"pep" - peptide sequences;
"user_defined" - specify parent and child features via the -u argument.

gff3_QC - GFFの様々なタイプのエラーを検出 (link)

gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o error.txt -s statistic.txt

statistic.txt

f:id:kazumaxneo:20191125015900p:plain

error.txt

f:id:kazumaxneo:20191125015845p:plain

gff3_fix - QCで検出されたエラーを修正 (link)

gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3

-qc_r Error report from gff3_QC.py

gff3_merge - 2つのGFFをマージ (link)

gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt

引用

The GFF3toolkit: QC and Merge Pipeline for Genome Annotation

Methods Mol Biol. 2019;1858:75-87
Chen MM, Lin H, Chiang LM, Childers CP, Poelchau MF