2023-02-16

GTF/GFFファイルのツールキット AGAT

2023/02/27 intron addに追記

HPより

　AGATは、あらゆる種類のGTFやGFFの欠落情報（特徴や属性）をチェック、修正、追加し、完全でソートされた標準的なgff3形式を作成する能力を備えている。長年にわたり、GTF/GFFフォーマットファイルに関連するあらゆるタスク（サニタイズ、変換、マージ、修正、フィルタリング、FASTA配列抽出、情報追加など）を実行する多くのツールによって強化されてきた。他の手法と比較して、AGATは最も劣悪なGTF/GFFファイルに対しても堅牢に働く。

Documentation

https://agat.readthedocs.io/en/latest/index.html

GTF/GFF formatsについての解説

https://agat.readthedocs.io/en/latest/gxf.html#gff

（非常に良くまとまっています。バージョンによる違いについてはリンク先の下の方にある表が分かりやすいです。一番下の"Extra"にある標準化できていないことに起因する解釈違いの問題(ストップコドンをCDSに含めるかなど)、またEmsemblとGENCODEのGTFの違いもヒトによっては役立つ情報だと思います）

インストール

Github

#conda (link)
mamba install -c bioconda agat -y

> agat --tools

$ agat --tools

agat_convert_bed2gff.pl

agat_convert_embl2gff.pl

agat_convert_genscan2gff.pl

agat_convert_mfannot2gff.pl

agat_convert_minimap2_bam2gff.pl

agat_convert_sp_gff2bed.pl

agat_convert_sp_gff2gtf.pl

agat_convert_sp_gff2tsv.pl

agat_convert_sp_gff2zff.pl

agat_convert_sp_gxf2gxf.pl

agat_sp_Prokka_inferNameFromAttributes.pl

agat_sp_add_attribute_shortest_exon_size.pl

agat_sp_add_attribute_shortest_intron_size.pl

agat_sp_add_introns.pl

agat_sp_add_start_and_stop.pl

agat_sp_alignment_output_style.pl

agat_sp_clipN_seqExtremities_and_fixCoordinates.pl

agat_sp_compare_two_BUSCOs.pl

agat_sp_compare_two_annotations.pl

agat_sp_complement_annotations.pl

agat_sp_ensembl_output_style.pl

agat_sp_extract_attributes.pl

agat_sp_extract_sequences.pl

agat_sp_filter_by_ORF_size.pl

agat_sp_filter_by_locus_distance.pl

agat_sp_filter_by_mrnaBlastValue.pl

agat_sp_filter_feature_by_attribute_presence.pl

agat_sp_filter_feature_by_attribute_value.pl

agat_sp_filter_feature_from_keep_list.pl

agat_sp_filter_feature_from_kill_list.pl

agat_sp_filter_gene_by_intron_numbers.pl

agat_sp_filter_gene_by_length.pl

agat_sp_filter_incomplete_gene_coding_models.pl

agat_sp_filter_record_by_coordinates.pl

agat_sp_fix_cds_phases.pl

agat_sp_fix_features_locations_duplicated.pl

agat_sp_fix_fusion.pl

agat_sp_fix_longest_ORF.pl

agat_sp_fix_overlaping_genes.pl

agat_sp_fix_small_exon_from_extremities.pl

agat_sp_flag_premature_stop_codons.pl

agat_sp_flag_short_introns.pl

agat_sp_functional_statistics.pl

agat_sp_keep_longest_isoform.pl

agat_sp_kraken_assess_liftover.pl

agat_sp_list_short_introns.pl

agat_sp_load_function_from_protein_align.pl

agat_sp_manage_IDs.pl

agat_sp_manage_UTRs.pl

agat_sp_manage_attributes.pl

agat_sp_manage_functional_annotation.pl

agat_sp_manage_introns.pl

agat_sp_merge_annotations.pl

agat_sp_prokka_fix_fragmented_gene_annotations.pl

agat_sp_sensitivity_specificity.pl

agat_sp_separate_by_record_type.pl

agat_sp_statistics.pl

agat_sp_webApollo_compliant.pl

agat_sq_add_attributes_from_tsv.pl

agat_sq_add_hash_tag.pl

agat_sq_add_locus_tag.pl

agat_sq_count_attributes.pl

agat_sq_filter_feature_from_fasta.pl

agat_sq_list_attributes.pl

agat_sq_manage_IDs.pl

agat_sq_manage_attributes.pl

agat_sq_mask.pl

agat_sq_remove_redundant_entries.pl

agat_sq_repeats_analyzer.pl

agat_sq_reverse_complement.pl

agat_sq_rfam_analyzer.pl

agat_sq_split.pl

agat_sq_stat_basic.pl

実行方法

HPに掲載されている順番に紹介します。

全てのツールは"-h"を付けることで詳細なヘルプを見ることができます。実行前に確認するようにして下さい。

agat_convert_bed2gff.pl - BED => GFF変換

agat_convert_bed2gff.pl --bed infile.bed -o outfile

agat_convert_embl2gff.pl - EMBLフラットファイル => GFF変換

agat_converter_embl2gff.pl --embl infile.embl -o outfile

agat_convert_genscan2gff.pl - genscanファイル => GFF変換

agat_convert_genscan2gff.pl --genscan infile.bed -o outfile

agat_convert_mfannot2gff.pl - MFannotパイプラインで生成されたMFannot "masterfile" アノテーション => GFF変換

agat_convert_mfannot2gff.pl -m mfannot_file -o outfile

agat_convert_sp_minimap2_bam2gff.pl - minimap2出力（bamまたはsam）=> GFF変換

#bam
agat_convert_sp_minimap2_bam2gff.pl -i infile.bam -o outfile

#sam
agat_convert_sp_minimap2_bam2gff.pl -i infile.sam -o outfile

agat_convert_sp_gff2bed.pl - GTF/GXF ファイル => BED変換

agat_convert_sp_gff2bed.pl --gff file.gff -o outfile

agat_convert_sp_gff2gtf.pl - あらゆる GTF/GFF ファイルを"適切な" GTF ファイルに変換

agat_convert_sp_gff2gtf.pl --gff infile.gtf -o outfile

--gtf_version version of the GTF output (1,2,2.1,2.2,2.5,3 or relax). Default 3.

"--gtf_version"オプションを使うことで、出力するGTFのバージョンを7つのタイプ(バージョン1, 2, 2.1, 2.2, 2.5, 3 or relax)から選択できる。

agat_convert_sp_gff2tsv.pl - GTF/GFF ファイル => table形式に変換

agat_convert_sp_gff2tsv.pl -gff file.gff -o outfile

agat_convert_sp_gff2zff.pl - GTF/GFFファイルをSNAP（紹介）で使用されるzffファイルへ変換

agat_convert_sp_gff2zff.pl --gff file.gff --fasta file.fasta -o outfile

agat_convert_sp_gxf2gxf.pl - あらゆるGTF/GFFファイルを完全にソートされたGTF/GFFファイルに修正・標準化する。 .gz拡張子にも対応。

agat_convert_sp_gxf2gxf.pl -g infile.gff -o outfile

-g, --gff or -ref String - Input GTF/GFF file. Compressed file with .gz extension is accepted.

重複するフィーチャーの削除、重複する ID の修正、欠落した ID や Parent 属性の追加、分解された属性のデフレート（複数の親の ID 重複）、可能であれば欠落したフィーチャーの追加（例：CDS のみ記述されていればexonを追加、CDS とexon記述なら UTR を追加）、フィーチャー位置の修正（例：エクソンが親フィーチャーの mRNA や遺伝子内に組み込まれているかチェック）などなど...を実行する。

agat_sp_Prokka_inferNameFromAttributes.pl - prokka gffアノテーションファイルについて、<gene>属性をもとにName属性を埋める

agat_sp_Prokka_inferNameFromAttributes.pl -gff file.gff -o outfile

agat_sp_add_introns.pl - イントロン機能を持たない gtf/gff ファイルにイントロン機能を追加する

agat_sp_add_introns.pl --gff infile --out outFile

追記

stop codonの外にUTR情報があるとエラーになる。その行だけ#でコメントアウトして実行するとエラーは回避できる。出力にはUTRも残っていて＃も除去されている。

agat_sp_add_start_and_stop.pl.pl - CDSフィーチャーが存在する場合、開始コドンと停止コドンを追加する。

agat_sp_add_start_and_stop.pl.pl --gff infile.gff --fasta genome.fa --out outFile

--ct, --codon or --table Codon table to use. [default 1]

ヌクレオチド配列を見て、開始コドンと停止コドンの存在をチェックする。このスクリプトは、開始コドンまたは停止コドンが複数の CDS フィーチャーにまたがっている場合でも動作する。

agat_sp_alignment_output_style.pl - 通常の gtf/gff アノテーション形式ファイルを gff3 alignment 形式に変換する。異なる特徴間の関係として、３列目がmatch / match_part の構造に変わる。

agat_sp_alignment_output_style.pl -g infile.gff -o outfile

agat_sp_clipN_seqExtremities_and_fixCoordinates.pl - 配列の末端NNNの部分を切り取る。切り取られた配列の注釈は、一貫性を保つために順次修正される。GFF/GTFとNNNを含む配列を指定する。

agat_sp_clipN_seqExtremities_and_fixCoordinates.pl -g infile.gff -f infile.fasta --of fixed.fasta --og fixed.gff

agat_sp_compare_two_BUSCOs.plagat_sp_compare_two_BUSCOs.pl - 2つのBUSCOラン（genome modeとproteome mode）の結果を比較し、その違いを特定する（BUSCO紹介）。

agat_sp_compare_two_BUSCOs.pl --f1 <input_busco_dir1> --f2 <input_busco_dir2> -o output_dir

1回目（ゲノムモード）のBUSCOの結果（完全、断片、重複）が2回目のBUSCOの結果と比較される。その結果をtxtファイルで報告し、1回目の実行で得られた完全、断片、重複の注釈付きBUSCOをgffファイルとして抽出する。

agat_sp_compare_two_annotations.pl - 同じアセンブリの2つのアノテーションを比較する。

agat_sp_compare_two_annotations.pl -gff1 infile1.gff -gff2 infile2.gff -o outFile

2つのアノテーション間の遺伝子の分割・融合に関する情報が提供される。最終的に表が出力される。出力フォーマットについてはマニュアル参照。

agat_sp_complement_annotations.pl - 参照アノテーションを他のアノテーションで補完する。参照アノテーションのl1フィーチャーと重複しないaddfile.gffのl1フィーチャーが追加される。

agat_sp_complement_annotations.pl --ref annotation_ref.gff --add addfile1.gff --add addfile2.gff --out outFile

agat_sp_ensembl_output_style.pl - 通常のGFF3アノテーションフォーマットファイルを受け取り、GFF3 like ensemblフォーマットに変換

agat_sp_ensembl_output_style.pl -g infile.gff -o outFile

agat_sp_extract_attributes.pl - GTF/GFFファイルから、全特徴タイプまたは特定の特徴タイプについて、選択した属性を抽出する。GTF/GFFファイルの 9 列目には属性のリストが含まれる。属性はtag=valueのようなもの。

agat_sp_extract_attributes.pl -gff file.gff -att locus_tag,product,name -p level2,cds,exon -o outfile

-p, -t or -l primary tag option, case insensitive, list. Allow to specied the feature types that will be handled. You can specified a specific feature by given its primary tag name (column 3) as: cds, Gene, MrNa You can specify directly all the feature of a particular level: level2=mRNA,ncRNA,tRNA,etc level3=CDS,exon,UTR,etc By default all feature are taking in account. fill the option by the value "all" will have the same behaviour.
--attribute, --att, -a attribute tag. The value of the attribute tag specified will be extracted from the feature type specified by the option -p. List of attributes must be coma separated.
--merge or -m By default the values of each attribute tag is writen in its dedicated file. To write the values of all tags in only one file use this option.

agat_sp_extract_sequences.pl - GFFファイルに記述されたフィーチャーにしたがって配列をFasta形式で書き出す。どのようなタイプの特徴でも抽出できる。特徴量の種類はGFFファイルの3列目に定義されている。

agat_sp_extract_sequences.pl -g infile.gff -f infile.fasta

UTR や CDS など、複数の場所にまたがるフィーチャーは、チャンクごとに抽出され、統合されて生物学的フィーチャーが作成される。各チャンクを独立して抽出したい場合は、-splitパラメータを使う。

このコマンドがマニュアルの情報が充実しています。様々な例も載っているのでマニュアルを確認して下さい。

agat_sp_filter_by_ORF_size.pl - GFFアノテーションファイルを読み込み、条件をパスした遺伝子モデルのGFFとそれ以外の遺伝子モデルのGFFの2つを書き出す。デフォルトでは">100 "であり、100アミノ酸より長いORFを持つ全ての遺伝子モデルがパスする。

agat_sp_filter_by_ORF_size.pl --gff infile.gff -o outFile

-s or --size ORF size to apply the test. Default 100.

agat_sp_filter_by_locus_distance.pl - 互いに近すぎる遺伝子座を削除したり、フラグを立てたりする。遺伝子間領域を適切に学習させるためには、近接した遺伝子座を削除することが重要である。実際、遺伝子間領域（ある遺伝子座の周囲にある部分）が他の遺伝子座の一部を含んでいる場合、遺伝子間領域の学習に偏りが生じる（マニュアルより）。出力はGFF。

agat_sp_filter_by_locus_distance.pl -gff infile.gff -o outFile

--add or --add_flag Instead of filter the result into two output files, write only one and add the flag <low_dist> in the gff.(tag = Lvalue or tag = Rvalue where L is left and R right and the value is the distance with accordingle the left or right locus)

agat_sp_filter_by_mrnaBlastValue.pl - このスクリプトは、他の配列と閾値以上の類似性を持つすべての配列を gff ファイルから削除する（1つだけ残す）。これは通常、ab initio gene finder のトレーニングに使用する mRNA のリストを作成する際に有用である。このスクリプトを使用する前に、blastp 入力ファイルを得るために、レシプロカルblast を行っておく必要がある。

agat_sp_filter_by_mrnaBlastValue.pl --gff infile.gff --blast blastfile --outfile outFile

--blast The list of the all-vs-all blast file (outfmt 6, blastp)

agat_sp_select_feature_by_attribute_presence.pl - 属性の有無（9列目）に応じて機能をフィルタリングする。指定した属性が存在する場合、そのフィーチャーは破棄される。属性は9列目に格納されている（マニュアル参照）。

agat_sp_select_feature_by_attribute_presence.pl --gff infile.gff -a <tag> --output outfile

-p, --type or -l primary tag option, case insensitive, list. Allow to specied the feature types that will be handled. You can specified a specific feature by given its primary tag name (column 3) as: cds, Gene, MrNa You can specify directly all the feature of a particular level: level2=mRNA,ncRNA,tRNA,etc level3=CDS,exon,UTR,etc By default all feature are taking into account. fill the option by the value "all" will have the same behaviour.
--attribute, --att, -a String - Attributes tag specified will be used to filter the feature type (feature type can also be specified by the option -p). List of attribute tags must be coma separated.
--flip BOLEAN - In order to flip the test and keep features that do have the attribute and filter those without.

agat_sp_select_feature_by_attribute_value.pl - 属性値（9列目）に従ってフィーチャーをフィルタリングする。属性タグが存在しない場合は、その機能は破棄されない。属性が存在し、その値がテストに合格すると、そのフィーチャーは破棄される（マニュアル参照）。

agat_sp_select_feature_by_attribute_value.pl --gff infile.gff --value 1 -t "=" --output outfile

--value Value to check in the attribute

agat_sp_filter_feature_from_keep_list.pl - keeplist に基づいて記録を保持する。もしそのフィーチャーのIDがkeeplistにリストされていれば、関連するすべてのフィーチャーと一緒に保存される（全レコードが保存される。レコードは、例えば同じ遺伝子座の遺伝子＋転写産物＋エキソン＋cdsのような関係で結ばれたすべてのフィーチャー）。

agat_sp_filter_feature_from_keep_list.pl --gff infile.gff --keep_list file.txt --output outfile

agat_sp_filter_feature_from_kill_list.pl - キルリストに基づいてフィーチャーを削除する。そのフィーチャーがキルリストにリストされたID(大文字と小文字を区別しない)を持っていれば、そのフィーチャーは削除される。注；レベル1またはレベル2のフィーチャーを削除すると、リンクしているサブフィーチャーもすべて自動的に削除される（マニュアル参照）。

agat_sp_filter_feature_from_kill_list.pl --gff infile.gff --kill_list file.txt --output outfile

agat_sp_filter_gene_by_intron_numbers.pl - イントロン数で遺伝子をフィルタリングする。1つはイントロン数フィルターを通過した遺伝子のGFF、もう1つは残りの遺伝子を含むGFFで合計2つのファイルが作成される。この例なら１つ目のファイルはoutfile、２つ目はoutfile_remainingとなる。

#10個以上のイントロンを含む遺伝子
agat_sp_filter_gene_by_intron_numbers.pl --gff infile.gff --test ">=" --nb 10 --output outfile

-n, --nb or --number Integer - Number of introns [Default 0]
-t or --test Test to apply (>, <, =, >= or <=) If you use one of these two characters >, <, please do not forget to quote your parameter like that "<=". Else your terminal will complain. [Default "="].

agat_sp_filter_gene_by_length.pl - レベル1フィーチャー（例：遺伝子、マッチなど）を長さでフィルタリングする。1つは長さフィルタを通過したフィーチャーのGFF、もう1つは残りのフィーチャーを含むGFFファイルが作成される。

#1000bp以上の遺伝子
agat_sp_filter_gene_by_length.pl --gff infile.gff --test ">=" -s 1000 --output outfile

-s or --size Integer - Gene size in pb [Default 100]

agat_sp_filter_incomplete_gene_coding_models.pl - 不完全な遺伝子モデルを削除する。不完全な遺伝子コーディングモデルとは、そのcdsに開始コドンまたは停止コドンが欠落している遺伝子となる。skip_start_check, skip_stop_check オプションで動作を変更することができる。

agat_sp_filter_incomplete_gene_coding_models.pl --gff infile.gff --fasta genome.fa -o outfile

--skip_start_check or --sstartc Gene model must have a start codon. Activated by default.
--skip_stop_check or --sstopc Gene model must have a stop codon. Activated by default

agat_sp_filter_record_by_coordinates.pl - 入力 csv ファイルで定義された座標に含まれるレコードのみを保持するようフィルタリングする。レコードは特徴量、またはpart-of関係を持つ特徴量の集合である。デフォルトでは、座標に重なるレコードを保持する。excludeパラメータを使用すると、座標に完全に含まれるレコードのみを保持する。注；デフォルトのパラメータでは、座標の外にあるエクソンも、その遺伝子の一部が座標に重なっていれば保持される。

agat_sp_filter_record_by_coordinates.pl --gff infile.gff --tsv coordinates.tsv --output outfile

-c, --coordinates, --tsv, -r or --ranges String - tsv file containing the coordinates. Coordinates must be one per line. Each line must contain 3 fields separated by a tabulation. Field1 is the sequence id Field2 is the start coordinate (included) Field3 is the end coordinate (included)
-e or --exclude Select only the features fully containined within the coordinates, exclude the overlapping ones.

agat_sp_fix_cds_frame.pl - cdsの位相を修正する。出力はGFF。

agat_sp_fix_cds_frame.pl --gff infile.gff -f fasta -o outfile

agat_sp_fix_features_locations_duplicated.pl - 位置が重複しているフィーチャーを修正・削除する。gtf/gffファイルではそれ自体がエラーでなくても、ENAに提出する際に（変換後に）問題になる。AGATでは、UTRを1-bp短くする様に位置を修正する（親フィーチャーやエクソンも修正される）。

agat_sp_fix_features_locations_duplicated.pl --gff infile -o outfile

５つのケースが修正対象です（マニュアル参照）。

ケース1。同一エクソン構造を持つアイソフォームの場合、AGATはCDSの長い方を残して重複を除去する。
ケース2: 異なる遺伝子IDのl2（mRNAなど）が同一のエクソンを持ち、CDSが全くない場合、AGATは重複を1つ削除する。
ケース3: 異なる遺伝子IDのl2（例えばmRNA）が同一のexonとCDS構造を持つ場合、AGATはCDSが最も長いものを残して重複を除去する。
ケース4 異なる遺伝子IDのl2（例えばmRNA）が同一のエクソン構造で異なるCDS構造を持つ場合、AGATはUTRをリシェイプしてmRNAと遺伝子の位置を変更する。
Case5：異なる遺伝子IDのl2（例：mRNA）が重複しているが、エクソン構造が異なる場合。この場合、AGAT は UTR をクリッピングすることで遺伝子の位置を修正する。

これら５つのどれを修正するか選ぶオプションが--modelになっています。デフォルトでは５つの例全てを修正対象となります（--model 1,2,3,4,5）。

agat_sp_fix_fusion.pl - GFFファイルに記述された各遺伝子モデルの UTR (UTR3, UTR5) に含まれる他のORFを検索する。出力を指定すると、修正されていない（intact）遺伝子のGFFと修正された遺伝子モデルのGFFに分けて書き出される。

agat_sp_fix_fusion.pl --gff infile.gff --fasta genome.fa -o outfile

-t or --threshold This is the minimum length of new protein predicted that will be taken in account. By default this value is 100 AA.

agat_sp_fix_longest_ORF.pl - GFFファイルに記述された遺伝子モデルの ORF を修正する。修正とは、mRNA 内の最長予測 ORF が異なる場合に、元の ORF (cds で定義) を置き換えることを意味する。

agat_sp_fix_longest_ORF.pl -gff infile.gff --fasta genome.fa -o outfile

修正したいORF Modelの種類。デフォルトではすべて使用される（--model 1,2,3,4,5,6）。

Model1 = 元のORFは新しいORFの一部であり、新しいORFはより長い
Model2 = 元のORFと新しいORFは異なるもので、新しいORFの方が長く、互いに重なっていない。
モデル3 = 元のORFと新しいORFは異なっており、新しいORFの方が長く、互いにオーバーラップしている。
モデル4＝元のORFにストップコドンがあるため、新しいORFが短くなる。
Model5 = 新しいORFは短いが、元のORFには早期停止コドンがない。予測されたORFが短いのは、元のORFが開始コドンで始まっていないという事実による可能性があるが、ここでは予測に開始コドンを持たせるようにしている。開始コドンがないORFは、不完全なORFや断片化したORFである可能性がある。* 開始領域がNNNN * 開始領域がXXXX * 正しいヌクレオチドであるが、予測ツールがこの部分をアノテーションしていない（例：エビデンスベースの予測における不完全なエビデンス）。
Model6 = ORFは同じサイズだが、フレームが正しくない（+1または+2 bpでフレームがずれる）。

agat_sp_fix_overlaping_genes.pl - GTF/GFF アノテーションファイルをチェックして、異なる遺伝子フィーチャーのCDSが重複しているケースを探す。見つかったら、遺伝子フィーチャーは1つに統合される。一方の遺伝子をリファレンスとし、もう一方の遺伝子のmRNAはリンクされる。つまりアイソフォームが作られる。

agat_sp_fix_overlaping_genes.pl -f infile -o outfile

agat_fix_small_exon_from_extremities.pl - 小さなエクソンを長くすることを目的としている。ENAにアノテーションを提出する場合、エクソンのサイズは最低15-nt必要である。現在、エクソンは末端からしか伸ばせず、そうしないと予測されるORFが壊れてしまう危険性がある。

agat_fix_small_exon_from_extremities.pl -gff infile.gff --fasta genome.fa -o outfile

--size or -s Minimum exon size accepted in nucleotide. All exon below this size will be extended to this size. Default value = 15.

agat_sp_flag_premature_stop_codons.pl - 早期停止コドンを含むmRNAにフラグを立てる。pseudo "という属性が追加され、その値はすべての早期停止コドンの位置となる。遺伝子は、すべてのアイソフォームが偽遺伝子である場合のみ、偽遺伝子としてフラグが立てられる。

agat_sp_flag_premature_stop_codons.pl --gff infile.gff --fasta infile.fa --out outfile

agat_sp_flag_short_introns.pl - ショートイントロンに<pseudo>というフラグをつける。EBIにデータを提出する際のERRORを回避するために有効。(EBIの典型的なエラーメッセージ。********ERROR: Intron usually expected to be at least 10 nt long. 正確かどうか確認してください)

agat_sp_flag_short_introns.pl --gff infile --out outfile

-intron_size or -i Minimum intron size, default 10. All genes with an intron < of this size will be flagged with the pseudo attribute (the value will be the size of the smallest intron found within the incriminated gene)

agat_sp_functional_statistics.pl - GFF/GTFの要約統計

agat_sp_functional_statistics.pl --gff file.gff -o outfile

--gs or -g This option inform about the genome size in oder to compute more statistics. You can give the size in Nucleotide or directly the fasta file.

yeast

agat_sp_keep_longest_isoform.pl - アイソフォームが存在する場合、それをフィルタリングする。ある遺伝子座についてすべてのアイソフォームがCDSを持つ場合、最も長いCDSを持つものを保持する。あるアイソフォームがCDSを持ち、他のアイソフォームが持たない場合、最も長いCDSを持つアイソフォームを残す。どのアイソフォームもCDSを持っていない場合、最も長いexonを連結したものを残す。

agat_sp_keep_longest_isoform.pl -gff file.gff -o outfile

agat_sp_kraken_assess_lift_coverage.pl - Kraken (lift-over tool) で生成された gtf を入力とする。kraken_mapped属性を解析し、各mRNAのマッピング率を計算する。閾値（デフォルトでは0）に従って、その値以上のマッピング率を持つ遺伝子が報告される。その結果を geneMapped_plot.pdf というプロットで可視化する。

agat_sp_kraken_assess_lift_coverage --gtf infile.gtf -o outfile

注；ファイルが完全な場合（kraken_mapped="TRUE", kraken_mapped="FALSE" 属性を含む）、マッピングされた割合の計算が行われる。そうでない場合は、kraken_mapped="TRUE "アトリビュートを持つフィーチャーのみに基づいて計算される。この場合、ほとんどの場合、結果は100%として報告され、警告が表示される。

agat_sp_list_short_introns.pl - あるサイズ以下のイントロンをすべてリストアップする。イントロンはエキソンからオンザフライで計算される。

#10-bp以下
agat_sp_list_short_introns.pl --gff infile -s 10 --out outFile

--size or -s Minimum intron size accepted in nucleotide. All introns under this size will be reported. Default value = 10.

イントロンサイズについてはagat_sp_manage_introns.plを参照。

agat_sp_load_function_from_protein_align.pl - タンパク質アラインメントが遺伝子モデルと重複しているかどうかをチェックし、ユーザーの要求に応じて遺伝子名や関数を遺伝子モデルにロードする。GFF形式のアノテーション、GFF形式のタンパク質アライメント、およびタンパク質fastaファイルを入力とする。

agat_sp_load_function_from_protein_align.pl -a annotation.gff --pgff protein.gff --pfasta protein.fasta -o outfile

アラインメントされたタンパク質を取り出し、オーバーラップスコアによりソートする。最も良いものが最初に来る。詳細はマニュアル参照。

agat_sp_manage_IDs.pl - gff3ファイルを入力として、ID属性の値を上書きする。デフォルトでは、IDはprimary_tag(3列目)-数値のように作成される。

agat_sp_manage_IDs.pl --gff file.gff -p level2 -p cds -p exon -o outfile

詳しくはマニュアル参照。

agat_sp_manage_UTRs.pl - UTRのエクソンが多すぎる遺伝子を、選択された閾値に従って検出する。UTRオプション(3, 5, 3 and 5, both)が指定されない場合、閾値は使用されない。

agat_sp_manage_UTRs.pl --ref infile --three --five -p --out outFile

-3, --three or --tree_prime_utr The threshold of the option <n> will be applied on the 3'UTR.
-5, --five or --five_prime_utr The threshold of the option <n> will be applied on the 5'UTR.
--p or --plot Allows to create an histogram in pdf of UTR sizes distribution.

agat_sp_manage_attributes.pl - 選択されたフィーチャーの選択された属性を削除する。また、'empty' value を持つ新しい属性を作成したり、既存の属性を新しい指定されたタグでコピーペーストすることができる。

agat_sp_manage_attributes.pl -gff file.gff -att locus_tag,product,name/NewName -p level2,cds,exon -o outfile

詳しくはマニュアル参照。

agat_sp_manage_functional_annotation.pl - gff3 ファイルを入力とし、blast や interpro の出力から、gff ファイル内の対応するフィーチャーに機能アノテーションを付加する。

gat_sp_manage_functional_annotation.pl -f infile.gff -b blast_infile --db uniprot.fasta -i interpro_infile.tsv --id ABCDEF --output outfile

Interproの結果(.tsv)ファイルから、pfam, tigr, interpro, GO, KEGGなどの語彙データをDBXREFのフィールド/属性に入力できる。詳しくはマニュアル参照。Protein Database (outfmt 6)に対するblastでは、遺伝子にはNAME、mRNAにはPRODUCTというフィールド/アトリビュートを設定することができる。フォーマットなど詳しくはマニュアル参照。

agat_sp_manage_introns.pl - イントロンに関する情報（最長、最短のサイズ平均...）を統計的手法で提供し、イントロンのサイズ分布の概観を得るためにすべてのイントロンサイズ値をプロットする。また、最長のイントロンをX%除去した後の値も得られる。

agat_sp_manage_introns.pl --gff infile --out outFile

agat_sp_merge_annotations.pl - 異なるGFFアノテーションファイルを1つに統合する。AGATパーサーを使用しており、重複する名前の処理や、ファイル内で発生するその他のおかしな点は修正される。

agat_sp_merge_annotations.pl --gff infile1 --gff infile2 --out outFile

agat_sp_prokka_fragmented_gene_annotations.pl - prokkaアノテーションの中の断片化した遺伝子アノテーション(FRAGS)を見る。FRAGSは、近接した2つ（またはそれ以上）のORFが同じ遺伝子に相同性を持つとアノテーションされていることを表している。このような場合、Prokkaは遺伝子IDに_n接尾辞を付加する。

agat_sp_prokka_fragmented_gene_annotations.pl -gff infile.gff --fasta genome.fa --db prokka/prokka_bacteria_sprot.fa  -o outfolder

詳しくはマニュアル参照。

agat_sp_sensitivity_specificity.pl - リファレンス（真の高品質アノテーションであることが想定される）に従ってアノテーションの品質を評価するために、SensitivityとSpecificityを計算する。

agat_sp_sensitivity_specificity.pl --gff1 infile1.gff --gff2 infile2.gff -o outfile

詳しくはマニュアル参照。

agat_sp_separate_by_record_type.pl - 入力ファイルである GFFファイルからフィーチャーをレコードタイプに応じて別々のファイルに分割する。レコードは、Parent/ID関係で結ばれたすべてのフィーチャーを表す。(例えば、遺伝子＋mrna＋exon＋cds＋utrで構成される遺伝子座のフィーチャー）

agat_sp_separate_by_record_type.pl -g infile.gff -o outfolder

agat_sp_split_by_level2_feature.pl - 入力された GFFファイルを、そのファイルに含まれる Level2 feature の種類に応じて分割する。

agat_sp_split_by_level2_feature.pl -g infile.gff -o outfolder

agat_sp_statistics.pl - GTF/GFFファイルを網羅的に統計処理するスクリプト。注；アイソフォームが存在する場合、正しい値であっても、計算された値が支離滅裂になることがある。例えばmRNA全長がゲノムサイズより大きくなるなど。

agat_sp_statistics.pl --gff file.gff -o outfile

--gs, -f or -g This option inform about the genome size in oder to compute more statistics. You can give the size in Nucleotide or directly the fasta file.

agat_sp_to_tabulated.pl - GTF/GFFファイルを表形式に変換する。9列目以降の属性タグは列のタイトルになる。

agat_sp_to_tabulated.pl -gff file.gff -o outfile

agat_sp_webApollo_compliant.pl - webapolloのために役に立たない/問題のある情報の削除、webapolloに読み込む際に問題が生じないようにいくつかの機能タイプを変更、およびいくつかの属性を最適化して見栄えを良くする。

agat_sp_webApollo_compliant.pl -g infile.gff -o outfile

agat_sq_add_attributes_from_tsv.pl - tsv/csvファイルからGFFファイルの属性に情報を追加する。

agat_sq_add_attributes_from_tsv.pl --gff input.gff --tsv input.tsv -o output.gff3

詳しくはマニュアル参照。

agat_sq_add_hash_tag.pl - ファイルにハッシュタグ (####) を導入する。これにより、gff3を使用するいくつかのツールは、###シグナエルで区切られたチャンクを明確に独立して扱うことができるようになる。

agat_sq_add_hash_tag.pl -i 1 -o output

-i or --interval Integer: 1 or 2. 1 will add #### after each new sequence (column1 of the gff), while 2 will add the ### after each group of feature (gene). By default the value is 1.

agat_sq_add_locus_tag.pl - レコードごとに共有されるlocus tagを追加する。レコードとは、親子関係で結ばれたすべてのフィーチャー（例：Gene、mRNA、exon、CDS）。

agat_sq_add_locus_tag.pl --gff input.gff -o output

agat_sq_filter_feature_from_fasta.pl - 配列名によるアノテーションフィルタの一種である。gffのアノテーションフィーチャーのうち、配列にリンクされていないものを、指定されたFastaファイルから削除します。fastaファイルの配列名とgff3ファイルの1列目とのマッチングは大文字と小文字を区別する。

agat_sq_filter_feature_from_fasta.pl --gff <gff_file.gff> --fasta <fasta_file.fa> [-o <output

agat_sq_list_attributes.pl - ファイル内で使用されている属性タグの情報を報告する。

agat_sq_list_attributes.pl -gff file.gff -p level2,cds,exon -o outfile

-p, -t or -l primary tag option, case insensitive, list. Allow to specied the feature types that will be handled. You can specified a specific feature by given its primary tag name (column 3) as: cds, Gene, MrNa You can specify directly all the feature of a particular level: level2=mRNA,ncRNA,tRNA,etc level3=CDS,exon,UTR,etc By default all feature are taking in account. fill the option by the value "all" will have the same behaviour.

最初に存在する属性タグ一覧が報告される。出力はGFF形式となっているが間違い?。

agat_sq_manage_ID.pl - IDをユニークなものに変更し、影響を受けるフィーチャーのParent属性も変更を反映させる。

agat_sq_manage_ID.pl --gff input.gff -o output

agat_sp_manage_attributes.pl - 選択されたフィーチャーの選択された属性を削除する。また、'empty' value を持つ新しい属性を作成したり、既存の属性を新しい指定されたタグでコピーペーストすることができる。gffファイルの属性は、tag=value;tag=valueのような形をしており、9列目に格納されている。

agat_sq_manage_attributes.pl --gff file.gff  --att locus_tag,product,name/NewName -p level2,cds,exon -o outfile

agat_sq_mask.pl - FASTA フォーマットのファイルから GFF で引用されたセグメントをマスク（ハードまたはソフト）する。GFF3ファイル、fasta ファイル、Mask メソッドの 3 つの入力パラメータを必要とする。

#softmask
agat_sq_mask.pl -g infile.gff -f infile.fasta -sm -o outfile

-sm SoftMask option =>Sequences masked will be in lowercase
-hm HardMask option => Sequences masked will be replaced by a character. By default the character used is 'n'. But you are allowed to speceify any character of your choice. To use 'z' instead of 'n' type: -hm z

agat_sq_remove_redundant_entries.pl - 同じseq_id,primary_tag,start,stop,ID,Parentのような冗長なエントリーを削除する。IDとParent属性が存在しない場合、その機能は削除されない。もし、どちらかが存在しない場合は、代わりに""が使用される。

agat_sq_remove_redundant_entries.pl -i input.gff -o output

agat_sq_repeats_analyzer.pl - リピートを含む GFFファイル（Feature Type は match または protein_match ）から、リピートのアノテーションを表形式でレポートする。

agat_sq_repeats_analyzer.pl -i input.gff -g <integer> -o output

-g, --genome That input is design to know the genome size in order to calculate the percentage of the genome represented by each kind of repeats. You can provide an INTEGER or the genome in fasta format. If you provide the fasta, the genome size will be calculated on the fly.

matchについては上のagat_sp_alignment_output_style.plスクリプトも参照。

agat_sq_reverse_complement.pl - fastaファイルに記述された配列が保持するgffの全アノテーションを逆補完（reverse complement）する。fastaファイルの配列名とgff3ファイルの1列目との一致では、大文字と小文字は区別される。

agat_sq_reverse_complement.pl --gff input.gff --fasta fasta_file.fa -o output

(試してみましたが、自分はまだ理解不足です)

agat_sq_list_attributes.pl - rfam の結果を含む gff ファイルから、rfam-id のアノテーションを表形式でレポートする。

agat_sq_rfam_analyzer.pl -i input.gff -g <integer> -o output

詳しくはマニュアル参照。

agat_sq_split.pl - GFF3ファイルを複数のファイルに分割する。デフォルトでは、1000個の遺伝子とそれに関連するすべてのサブフィーチャを含むファイルを作成する。GFF3入力ファイルはシーケンシャルである必要がある。

agat_sq_split.pl --input input.gff -o output

agat_sq_stat_basic.pl - gtf/gffファイルの基本的な統計情報を報告する。

agat_sq_stat_basic.pl -i input.gff -g <integer> -o output

引用

https://github.com/NBISweden/AGAT

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

GTF/GFFファイルのツールキット AGAT