macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

GFF3のツールキット GFF3toolkit

 

i5k Workspace @ NAL (HP) でサポートされているGFF3toolkit(https://github.com/NAL-i5K/GFF3toolkit)は、節足動物ゲノムプロジェクトとその研究コミュニティからのGFF3形式の遺伝子アノテーションを処理するためのツールスイートを提供する。 遺伝子アノテーションのGFF3フォーマットを改善するために、品質管理とマージ手順がGFF3toolkitとともに提案されている。 特に、このツールキットは、GFF3ファイルのソート、GFF3形式エラーの検出、2つのGFF3ファイルのマージ、GFF3ファイルからの生物学的シーケンスの生成を行う機能を提供する。

 

 

インストール

ubuntu18.04のpython3.7環境でテストした(docker使用、ホストOS macos10.14)。

依存

Github

pip install gff3tool

#latest version
pip install git+https://github.com/NAL-i5K/GFF3toolkit.git

gff3_QC -h

$ gff3_QC -h

usage: gff3_QC [-h] [-g GFF] [-f FASTA] [-noncg] [-i] [-n ALLOWED_NUM_OF_N]

               [-t [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]]]

               [-o OUTPUT] [-s STATISTIC] [-v]

 

    Testing environment:

    1. Python 2.7

 

    Inputs:

    1. GFF3: Specify the file name with the -g or --gff argument; Please note that this program requires gene/pseudogene and mRNA/pseudogenic_transcript to have an ID attribute in column 9.

    2. fasta file: Specify the file name with the -f or --fasta argument

 

    Outputs:

    1. Error report for the input GFF3 file

* Line_num: Line numbers of the found problematic models in the input GFF3 file.

* Error_code: Error codes for the found problematic models. Please refer to lib/ERROR/ERROR.py to see the full list of Error_code and the corresponding Error_tag.

        * Error_tag: Detail of the found errors for the problematic models. Please refer to lib/ERROR/ERROR.py to see the full list of Error_code and the corresponding Error_tag.

 

    Quick start:

    gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o test

    or

    gff3_QC --gff example_file/example.gff3 --fasta example_file/reference.fa --output test

 

optional arguments:

  -h, --help            show this help message and exit

  -g GFF, --gff GFF     Genome annotation file, gff3 format

  -f FASTA, --fasta FASTA

                        Genome sequences, fasta format

  -noncg, --noncanonical_gene

                        gff3 file is not formatted in the canonical gene model

                        format.

  -i, --initial_phase   Check whether initial CDS phase is 0 (default: no

                        check)

  -n ALLOWED_NUM_OF_N, --allowed_num_of_n ALLOWED_NUM_OF_N

                        Max number of Ns allowed in a feature, anything more

                        will be reported as an error (default: 0)

  -t [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]], --check_n_feature_types [CHECK_N_FEATURE_TYPES [CHECK_N_FEATURE_TYPES ...]]

                        Count the number of Ns in each feature with the type

                        specified, multiple types may be specified, ex: -t CDS

                        exon (default: "CDS")

  -o OUTPUT, --output OUTPUT

                        output file name (default: report.txt)

  -s STATISTIC, --statistic STATISTIC

                        statistic file name (default: statistic.txt)

  -v, --version         show program's version number and exit

gff3_fix -h

$ gff3_fix -h

usage: gff3_fix [-h] [-qc_r QC_REPORT] [-g GFF] [-og OUTPUT_GFF] [-v]

 

Testing environment:

1. Python 3.*

 

Input:

1. Error report: Error report from gff3_QC.py. Specify the file name with the -qc_r or --qc_report argument;

2. GFF3: Specify the file name with the -g or --gff argument;

 

Output:

1. Corrected GFF3

 

Quick start:

gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3

 

optional arguments:

  -h, --help            show this help message and exit

  -qc_r QC_REPORT, --qc_report QC_REPORT

                        Error report from gff3_QC.py

  -g GFF, --gff GFF     Genome annotation file, gff3 format

  -og OUTPUT_GFF, --output_gff OUTPUT_GFF

                        output gff3 file name

  -v, --version         show program's version number and exit

gff3_sort

$ gff3_sort

usage: gff3_sort [-h] [-g GFF_FILE] [-og OUTPUT_GFF] [-t SORT_TEMPLATE] [-i]

                 [-v] [-r]

 

Sort a GFF3 file according to the order of Scaffold (seqID), coordinates on a Scaffold, and feature relationship based on sequence ontology.

 

Inputs:

1. GFF3 file: Specify the file name with the -g argument

 

Outputs:

1. Sorted GFF3 file: Specify the file name with the -og argument

 

Examples:

1. Specify the input, output file names and options using short arguments:

   gff3_sort -g example_file/example.gff3 -og example_file/example_sorted.gff

2. Specify the input, output file names and options using long arguments:

   gff3_sort --gff_file example_file/example.gff3 --output_gff example_file/example_sorted.gff

 

optional arguments:

  -h, --help            show this help message and exit

  -g GFF_FILE, --gff_file GFF_FILE

                        GFF3 file that you would like to sort.

  -og OUTPUT_GFF, --output_gff OUTPUT_GFF

                        Sorted GFF3 file

  -t SORT_TEMPLATE, --sort_template SORT_TEMPLATE

                        A file that indicates the sorting order of features

                        within a gene model

  -i, --isoform_sort    Sort multi-isoform gene models by feature type

                        (default: False)

  -v, --version         show program's version number and exit

  -r, --reference       Sort scaffold (seqID) by order of appearance in gff3

                        file (default is by number)

gff3_to_fasta -h

$ gff3_to_fasta -h

usage: gff3_to_fasta [-h] [-g GFF] [-f FASTA] [-embf] [-st SEQUENCE_TYPE]

                     [-u [USER_DEFINED [USER_DEFINED ...]]] [-d DEFLINE]

                     [-o OUTPUT_PREFIX] [-noQC] [-v]

 

Extract sequences from specific regions of genome based on gff file.

Testing enviroment:

1. Python 2.7

 

Required inputs:

1. GFF3: specify the file name with the -g argument

2. Fasta file: specify the file name with the -f argument

3. Output prefix: specify with the -o argument

 

Outputs:

1. Fasta formatted sequence file based on the gff3 file.

 

Example command:

gff3_to_fasta -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences

 

optional arguments:

  -h, --help            show this help message and exit

  -g GFF, --gff GFF     Genome annotation file in GFF3 format

  -f FASTA, --fasta FASTA

                        Genome sequences in FASTA format

  -embf, --embedded_fasta

                        Specify this option if you want to extract sequence from embedded fasta.

  -st SEQUENCE_TYPE, --sequence_type SEQUENCE_TYPE

                        Type of sequences you would like to extract: 

                        "all" - FASTA files for all types of sequences listed below, except user_defined;

                        "gene" - gene sequence for each record;

                        "exon" - exon sequence for each record;

                        "pre_trans" - genomic region of a transcript model (premature transcript);

                        "trans" - spliced transcripts (only exons included);

                        "cds" - coding sequences;

                        "pep" - peptide sequences;

                        "user_defined" - specify parent and child features via the -u argument.

  -u [USER_DEFINED [USER_DEFINED ...]], --user_defined [USER_DEFINED [USER_DEFINED ...]]

                        Specify parent and child features for fasta extraction, format: [parent feature type] [child feature type] (ex: -u mRNA CDS). Required if -st user_defined is given.

  -d DEFLINE, --defline DEFLINE

                        Defline format in the output FASTA file:

                        "simple" - only ID would be shown in the defline;

                        "complete" - complete information of the feature would be shown in the defline.

  -o OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX

                        Prefix of output file name

  -noQC, --quality_control

                        Specify this option if you do not want to execute quality control for gff file. (default: QC is executed)

  -v, --version         show program's version number and exit

gff3_merge -h

$ gff3_merge -h

usage: gff3_merge [-h] [-g1 GFF_FILE1] [-g2 GFF_FILE2] [-f FASTA]

                  [-u1 USER_DEFINED_FILE1] [-u2 USER_DEFINED_FILE2]

                  [-og OUTPUT_GFF] [-r REPORT_FILE] [-a] [-noAuto] [-v]

 

Merge two gff files of the same genome into one.

 

Testing enviroment:

1. Python 3.*

 

Inputs:

1. GFF3 file 1: Gff with annotations modified relative to the original gff (e.g. output from the Apollo program), specify the file name with the -g1 argument

2. GFF3 file 2: Original/Reference gff, specify the file name with the -g2 argument

3. FASTA: Genomic sequences in the FASTA format with the -f argument

 

Outputs:

1. Merged GFF3: Models from GFF3 file 1 replace Models from GFF3 file 2 based on their replace tag. Specify the output file name with the -og argument

2. Log report for the integration: specify the file name with the -r argument

 

Examples:

1. Specify the input, output file names and options using short arguments:

   gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt

2. Specify the input, output file names and options using long arguments:

   gff3_merge --gff_file1 example_file/new_models.gff3 --gff_file2 example_file/reference.gff3 --fasta example_file/reference.fa --output_gff merged.gff --report_file merged_report.txt

 

optional arguments:

  -h, --help            show this help message and exit

  -g1 GFF_FILE1, --gff_file1 GFF_FILE1

                        Updated GFF3 file, such as Apollo gff

  -g2 GFF_FILE2, --gff_file2 GFF_FILE2

                        Reference GFF3 file, such as Maker gff or OGS gff

  -f FASTA, --fasta FASTA

                        Genomic sequences in the fasta format

  -u1 USER_DEFINED_FILE1, --user_defined_file1 USER_DEFINED_FILE1

                        File for specifing parent and child features for fasta

                        extraction from updated GFF3 file.

  -u2 USER_DEFINED_FILE2, --user_defined_file2 USER_DEFINED_FILE2

                        File for specifing parent and child features for fasta

                        extraction from reference GFF3 file.

  -og OUTPUT_GFF, --output_gff OUTPUT_GFF

                        The merged GFF3 file

  -r REPORT_FILE, --report_file REPORT_FILE

                        Log file for the integration

  -a, --all             auto-assignment replace tags for all transcript

                        features. (default: Only automatically assign replace

                        tags for the transcript without replace tags)

  -noAuto, --auto_assignment

                        Turn off the auto-assignment of replace tags, if you

                        already have replace tags in your updated gff

                        (default: Automatically assign replace tags and then

                        merge the gff files)

  -v, --version         show program's version number and exit

 

 

実行方法

ここではテストファイルを使う。

git clone https://github.com/NAL-i5K/GFF3toolkit.git
cd GFF3toolkit/

 

gff3_sort - scaffolds順、座標、およびアノテーションのparent-child の関係に従ってGFF3ファイルを並べ替える

gff3_sort -g example_file/example.gff3 -og example-sorted.gff3
  •  -og    Sorted GFF3 file

 

gff3_to_fasta - ゲノムの特定の領域から配列(スプライスされた転写産物、cds、ペプチド)を抽出

gff3_to_fasta -g example_file/example.gff3 -f example_file/reference.fa -st all -d simple -o test_sequences
  • -st    Type of sequences you would like to extract:  
  1. "all" - FASTA files for all types of sequences listed below, except user_defined;
  2. "gene" - gene sequence for each record;
  3. "exon" - exon sequence for each record;
  4. "pre_trans" - genomic region of a transcript model (premature transcript);
  5. "trans" - spliced transcripts (only exons included);
  6. "cds" - coding sequences;
  7. "pep" - peptide sequences;
  8. "user_defined" - specify parent and child features via the -u argument. 

 

gff3_QC - GFFの様々なタイプのエラーを検出 (link)

gff3_QC -g example_file/example.gff3 -f example_file/reference.fa -o error.txt -s statistic.txt

statistic.txt

f:id:kazumaxneo:20191125015900p:plain

error.txt

f:id:kazumaxneo:20191125015845p:plain

 

gff3_fix - QCで検出されたエラーを修正 (link)

gff3_fix -qc_r error.txt -g example_file/example.gff3 -og corrected.gff3
  •  -qc_r   Error report from gff3_QC.py

 

gff3_merge - 2つのGFFをマージ (link)

gff3_merge -g1 example_file/new_models.gff3 -g2 example_file/reference.gff3 -f example_file/reference.fa -og merged.gff -r merged_report.txt

 

 

引用

The GFF3toolkit: QC and Merge Pipeline for Genome Annotation

Methods Mol Biol. 2019;1858:75-87
Chen MM, Lin H, Chiang LM, Childers CP, Poelchau MF

 

関連