i5k Workspace @ NAL (HP) でサポートされているGFF3toolkit(https://github.com/NAL-i5K/GFF3toolkit)は、節足動物ゲノムプロジェクトとその研究コミュニティからのGFF3形式の遺伝子アノテーションを処理するためのツールスイートを提供する。 遺伝子アノテーションのGFF3フォーマットを改善するために、品質管理とマージ手順がGFF3toolkitとともに提案されている。 特に、このツールキットは、GFF3ファイルのソート、GFF3形式エラーの検出、2つのGFF3ファイルのマージ、GFF3ファイルからの生物学的シーケンスの生成を行う機能を提供する。
インストール
ubuntu18.04のpython3.7環境でテストした(docker使用、ホストOS macos10.14)。
依存
pip install gff3tool
#latest version
pip install git+https://github.com/NAL-i5K/GFF3toolkit.git
> gff3_QC -h
$ gff3_QC -h
usage: gff3_QC [-h] [-g GFF] [-f FASTA] [-noncg] [-i] [-n ALLOWED_NUM_OF_N]
[-t [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]]]
[-o OUTPUT] [-s STATISTIC] [-v]
Testing environment:
1. Python 2.7
Inputs:
1. GFF3: Specify the file name with the -g or --gff argument; Please note that this program requires gene/pseudogene and mRNA/pseudogenic_transcript to have an ID attribute in column 9.
2. fasta file: Specify the file name with the -f or --fasta argument
Outputs:
1. Error report for the input GFF3 file
* Line_num: Line numbers of the found problematic models in the input GFF3 file.
* Error_code: Error codes for the found problematic models. Please refer to lib/ERROR/ERROR.py to see the full list of Error_code and the corresponding Error_tag.
* Error_tag: Detail of the found errors for the problematic models. Please refer to lib/ERROR/ERROR.py to see the full list of Error_code and the corresponding Error_tag.
Quick start:
gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o test
or
gff3_QC --gff example_file/example.gff3 --fasta example_file/reference.fa --output test
optional arguments:
-h, --help show this help message and exit
-g GFF, --gff GFF Genome annotation file, gff3 format
Genome sequences, fasta format
-noncg, --noncanonical_gene
gff3 file is not formatted in the canonical gene model
format.
-i, --initial_phase Check whether initial CDS phase is 0 (default: no
check)
-n ALLOWED_NUM_OF_N, --allowed_num_of_n ALLOWED_NUM_OF_N
Max number of Ns allowed in a feature, anything more
will be reported as an error (default: 0)
-t [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]], --check_n_feature_types [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]]
Count the number of Ns in each feature with the type
specified, multiple types may be specified, ex: -t CDS
exon (default: "CDS")
-o OUTPUT, --output OUTPUT
output file name (default: report.txt)
-s STATISTIC, --statistic STATISTIC
statistic file name (default: statistic.txt)
-v, --version show program's version number and exit
> gff3_fix -h
$ gff3_fix -h
usage: gff3_fix [-h] [-qc_r QC_REPORT] [-g GFF] [-og OUTPUT_GFF] [-v]
Testing environment:
1. Python 3.*
Input:
1. Error report: Error report from gff3_QC.py. Specify the file name with the -qc_r or --qc_report argument;
2. GFF3: Specify the file name with the -g or --gff argument;
Output:
1. Corrected GFF3
Quick start:
gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3
optional arguments:
-h, --help show this help message and exit
-qc_r QC_REPORT, --qc_report QC_REPORT
Error report from gff3_QC.py
-g GFF, --gff GFF Genome annotation file, gff3 format
-og OUTPUT_GFF, --output_gff OUTPUT_GFF
output gff3 file name
-v, --version show program's version number and exit
> gff3_sort
$ gff3_sort
usage: gff3_sort [-h] [-g GFF_FILE] [-og OUTPUT_GFF] [-t SORT_TEMPLATE] [-i]
[-v] [-r]
Sort a GFF3 file according to the order of Scaffold (seqID), coordinates on a Scaffold, and feature relationship based on sequence ontology.
Inputs:
1. GFF3 file: Specify the file name with the -g argument
Outputs:
1. Sorted GFF3 file: Specify the file name with the -og argument
Examples:
1. Specify the input, output file names and options using short arguments:
gff3_sort -g example_file/example.gff3 -og example_file/example_sorted.gff
2. Specify the input, output file names and options using long arguments:
gff3_sort --gff_file example_file/example.gff3 --output_gff example_file/example_sorted.gff
optional arguments:
-h, --help show this help message and exit
-g GFF_FILE, --gff_file GFF_FILE
GFF3 file that you would like to sort.
-og OUTPUT_GFF, --output_gff OUTPUT_GFF
Sorted GFF3 file
-t SORT_TEMPLATE, --sort_template SORT_TEMPLATE
A file that indicates the sorting order of features
within a gene model
-i, --isoform_sort Sort multi-isoform gene models by feature type
(default: False)
-v, --version show program's version number and exit
-r, --reference Sort scaffold (seqID) by order of appearance in gff3
file (default is by number)
> gff3_to_fasta -h
$ gff3_to_fasta -h
usage: gff3_to_fasta [-h] [-g GFF] [-f FASTA] [-embf] [-st SEQUENCE_TYPE]
[-u [USER_DEFINED [USER_DEFINED ...]]] [-d DEFLINE]
[-o OUTPUT_PREFIX] [-noQC] [-v]
Extract sequences from specific regions of genome based on gff file.
Testing enviroment:
1. Python 2.7
Required inputs:
1. GFF3: specify the file name with the -g argument
2. Fasta file: specify the file name with the -f argument
3. Output prefix: specify with the -o argument
Outputs:
1. Fasta formatted sequence file based on the gff3 file.
Example command:
gff3_to_fasta -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences
optional arguments:
-h, --help show this help message and exit
-g GFF, --gff GFF Genome annotation file in GFF3 format
Genome sequences in FASTA format
-embf, --embedded_fasta
Specify this option if you want to extract sequence from embedded fasta.
-st SEQUENCE_TYPE, --sequence_type SEQUENCE_TYPE
Type of sequences you would like to extract:
"all" - FASTA files for all types of sequences listed below, except user_defined;
"gene" - gene sequence for each record;
"exon" - exon sequence for each record;
"pre_trans" - genomic region of a transcript model (premature transcript);
"trans" - spliced transcripts (only exons included);
"cds" - coding sequences;
"pep" - peptide sequences;
"user_defined" - specify parent and child features via the -u argument.
-u [USER_DEFINED [USER_DEFINED ...]], --user_defined [USER_DEFINED [USER_DEFINED ...]]
Specify parent and child features for fasta extraction, format: [parent feature type] [child feature type] (ex: -u mRNA CDS). Required if -st user_defined is given.
-d DEFLINE, --defline DEFLINE
Defline format in the output FASTA file:
"simple" - only ID would be shown in the defline;
"complete" - complete information of the feature would be shown in the defline.
-o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX
Prefix of output file name
-noQC, --quality_control
Specify this option if you do not want to execute quality control for gff file. (default: QC is executed)
-v, --version show program's version number and exit
> gff3_merge -h
$ gff3_merge -h
usage: gff3_merge [-h] [-g1 GFF_FILE1] [-g2 GFF_FILE2] [-f FASTA]
[-u1 USER_DEFINED_FILE1] [-u2 USER_DEFINED_FILE2]
[-og OUTPUT_GFF] [-r REPORT_FILE] [-a] [-noAuto] [-v]
Merge two gff files of the same genome into one.
Testing enviroment:
1. Python 3.*
Inputs:
1. GFF3 file 1: Gff with annotations modified relative to the original gff (e.g. output from the Apollo program), specify the file name with the -g1 argument
2. GFF3 file 2: Original/Reference gff, specify the file name with the -g2 argument
3. FASTA: Genomic sequences in the FASTA format with the -f argument
Outputs:
1. Merged GFF3: Models from GFF3 file 1 replace Models from GFF3 file 2 based on their replace tag. Specify the output file name with the -og argument
2. Log report for the integration: specify the file name with the -r argument
Examples:
1. Specify the input, output file names and options using short arguments:
gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt
2. Specify the input, output file names and options using long arguments:
gff3_merge --gff_file1 example_file/new_models.gff3 --gff_file2 example_file/reference.gff3 --fasta example_file/reference.fa --output_gff merged.gff --report_file merged_report.txt
optional arguments:
-h, --help show this help message and exit
-g1 GFF_FILE1, --gff_file1 GFF_FILE1
Updated GFF3 file, such as Apollo gff
-g2 GFF_FILE2, --gff_file2 GFF_FILE2
Reference GFF3 file, such as Maker gff or OGS gff
Genomic sequences in the fasta format
-u1 USER_DEFINED_FILE1, --user_defined_file1 USER_DEFINED_FILE1
File for specifing parent and child features for fasta
extraction from updated GFF3 file.
-u2 USER_DEFINED_FILE2, --user_defined_file2 USER_DEFINED_FILE2
File for specifing parent and child features for fasta
extraction from reference GFF3 file.
-og OUTPUT_GFF, --output_gff OUTPUT_GFF
The merged GFF3 file
-r REPORT_FILE, --report_file REPORT_FILE
Log file for the integration
-a, --all auto-assignment replace tags for all transcript
features. (default: Only automatically assign replace
tags for the transcript without replace tags)
-noAuto, --auto_assignment
Turn off the auto-assignment of replace tags, if you
already have replace tags in your updated gff
(default: Automatically assign replace tags and then
merge the gff files)
-v, --version show program's version number and exit
実行方法
ここではテストファイルを使う。
git clone https://github.com/NAL-i5K/GFF3toolkit.git
cd GFF3toolkit/
gff3_sort - scaffolds順、座標、およびアノテーションのparent-child の関係に従ってGFF3ファイルを並べ替える
gff3_sort -g example_file/example.gff3 -og example-sorted.gff3
- -og Sorted GFF3 file
gff3_to_fasta - ゲノムの特定の領域から配列(スプライスされた転写産物、cds、ペプチド)を抽出
gff3_to_fasta -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences
- -st Type of sequences you would like to extract:
- "all" - FASTA files for all types of sequences listed below, except user_defined;
- "gene" - gene sequence for each record;
- "exon" - exon sequence for each record;
- "pre_trans" - genomic region of a transcript model (premature transcript);
- "trans" - spliced transcripts (only exons included);
- "cds" - coding sequences;
- "pep" - peptide sequences;
- "user_defined" - specify parent and child features via the -u argument.
gff3_QC - GFFの様々なタイプのエラーを検出 (link)
gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o error.txt -s statistic.txt
statistic.txt
error.txt
gff3_fix - QCで検出されたエラーを修正 (link)
gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3
- -qc_r Error report from gff3_QC.py
gff3_merge - 2つのGFFをマージ (link)
gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt
引用
The GFF3toolkit: QC and Merge Pipeline for Genome Annotation
Methods Mol Biol. 2019;1858:75-87
Chen MM, Lin H, Chiang LM, Childers CP, Poelchau MF
関連