GFF3を正確にソートする GFF3sort - macでインフォマティクス

　HTML5とJavaScriptに基づく強力なゲノムブラウザとして、JBrowseは2009年にリリースされて以来広く使用されている[ref.1、2]。その構成ドキュメント[ref.3]によると、まず組み込みスクリプト「flatfile-to-json.pl」によってGFF3ファイル形式のゲノムアノテーションデータをJSONファイルに変換し、次に遺伝子などの視覚化された要素モデルをレンダリングする。しかし、主な問題は、このステップが非常に時間がかかることである。時間は、GFF3ファイルの機能要素の数に比例する。酵母（Saccharomyces cerevisiae）のような小さなゲノムの場合でも、変換が完了するまで10秒かかる。ヒトのような大きくて深くアノテーション付けされたゲノムの場合、時間は15分以上に増加する。さらに、変換プロセスにより、単一のGFF3ファイルが数千の断片的なJSONファイルに変換されるため、データのバックアップおよび保存機能に大きな負担がかかる。

　最近リリースされたJBrowseバージョン（v1.12.3）では、インデックス付きGFF3ファイルのサポートが追加された[ref.4]。この戦略では、GFF3ファイルはbgzipで圧縮され、tabix [ref.5]でインデックス付けされる。これにより、圧縮ファイル（.gz）とインデックスファイル（.tbiまたは.csi）の2つのデータファイルのみが生成される。従来の処理プロトコルと比較して、ヒトゲノムアノテーションデータなどの大規模なデータセットであっても、圧縮およびインデックス処理全体を数秒以内に完了することができる。 tabixツールでは、染色体と開始位置によってGFF3ファイルをソートする必要がある。これは、GNUソートプログラムまたはGenomeTools [ref.6]パッケージで実行できる（[ref.7]参照）。同じ染色体と開始位置にあるフィーチャラインを処理する場合、これらのツールはどちらもタイを壊すか、子フィーチャが親フィーチャの前に配置されるソート順を返す場合がある（論文図1a）。これはtabixのインデックス作成にはまだ有効だが、JBrowse [ref.8]での誤ったレンダリングの原因になる（図1a）。現在のツールには、親子関係によってそのような関連付けられた機能を解除するための追加のオプションや引数はない。 JBrowseに適切なバグ修正がない場合、この問題を解決するには代替のソートツールが必要である。

　ここでは、tabixインデックス用にGFF3ファイルをソートするための新しいツールであるGFF3sortを紹介する。 GNUソートおよびGenomeToolsと比較すると、GFF3sortはソート結果を生成する。ソート結果はJBrowseで正しくレンダリングできるが、時間とメモリの要件は同等である。 GFF3sortは、ゲノムアノテーションデータの処理と視覚化を支援する有用なツールになると予想される。
　GFF3sortはPerlで書かれたスクリプトである。ハッシュテーブルを使用して、入力GFF3アノテーションデータを格納する（図1b）。各フィーチャについて、染色体IDと開始位置がそれぞれプライマリキーとセカンダリキーに格納される。同じ染色体と開始位置を持つフィーチャは、元のGFF3データの外観と同じ順序で配列にグループ化される。染色体IDと開始位置でハッシュテーブルを並べ替えた後、GFF3sortは2つのモードを実装して、配列内の機能を並べ替える。デフォルトモードと正確なモードである（論文図1b）。ほとんどの場合、ゲノムアノテーションプロジェクトによって生成された元のGFF3アノテーションは、親のフィーチャを子の前にすでに配置している。したがって、GFF3sortは元の順序でフィーチャラインを返す。これがデフォルトの動作である。入力ファイルの順序がまだ親フィーチャを子フィーチャの前に配置していない状況では、GFF3sortは、有向非巡回グラフ[ref.9]のソートアルゴリズムを使用して、親子トポロジに従ってそれらを再配置する。もう少し計算時間がかかる。

GFFフォーマットでよく使われるversion3のGFFを正確にソートするツール。GNU sortのランダムソートになってしまう同chr、同start position（親featureのmRNAの前に子フィーチャのexonが来てしまったりする）の親子関係を考えた正しいソートにも対応する。

インストール

macos10.14でテストした(perl5.26)。

本体　Github

#bioconda (link)
mamba install -c bioconda -y gff3sort

> gff3sort.pl -h

NAME

gff3sort.pl - Sort GFF3 file for tabix indexing

SYNOPSIS

gff3sort.pl [OPTIONS] input.file.gff3 >output.sort.gff3

COMMAND-LINE OPTIONS

These optional options could be placed either before or after the I/O

files in the commandline

--precise Run in precise mode, about 2X~3X slower than the default mode.

Only needed to be used if your original GFF3 files have parent features

appearing behind their children features.

--chr_order Select how the chromosome IDs should be sorted. Acceptable

values are: alphabet, natural, original [Default: alphabet]

--extract_FASTA If the input GFF3 file contains FASTA sequence at the end,

use this option to extract the FASTA sequence and place in a separate file

with the extention '.fasta'. By default, the FASTA sequences would be

discarded.

DESCRIPTION

The tabix tool requires GFF3 files to be sorted by chromosomes and

positions, which could be performed in the GNU sort program or the

GenomeTools package. However, when dealing with feature lines in the same

chromosome and position, both of the tools would sort them in an ambiguous

way that usually results in parent features being placed behind their

children. This would cause erroneous in some genome browsers such as

JBrowse. GFF3sort can properly deal with the order of features that have

the same chromosome and start position, therefore generating suitable

results for JBrowse display.

Precise mode

In most situations, the original GFF3 annotations produced by genome

annotation projects have already placed parent features before their

children. Therefore, GFF3sort would remember their original order and

placed them accordingly within the same chromosome and start position

block, which is the default behavior.

Sometimes the order in the input file has already been disturbed (for

example, by GNU sort or GenomeTools). In this situation, GFF3sort would

sort them according to the parent-child topology using the sorting

algorithm of directed acyclic graph

(https://metacpan.org/pod/Sort::Topological), which is the most precise

behavior but 2X~3X slower than the default mode.

The chromosome order

In default, chromosomes are sorted alphabetly. Users can choose to sort

naturally (see https://metacpan.org/pod/Sort::Naturally) or keep their

original orders.

Therefore, chromosomes "Chr7 Chr1 Chr10 Chr2 Chr1" would be sorted as:

By alphabet (default): Chr1 Chr10 Chr2 Chr7

By natural: Chr1 Chr2 Chr7 Chr10

Kepp original: Chr7 Chr1 Chr10 Chr2 (Note: tabix requires continuous

chromosome blocks. Therefore the same chromosomes such as Chr1 must be

grouped together)

AUTHOR

Tao Zhu <zhutao@caas.cn>

This script is free software; you can redistribute it and/or modify it

under the same terms as Perl itself.

実行方法

GFF3ファイルを指定する。

gff3sort.pl --chr_order alphabet input.gff3 > sort.gff3

--precise Run in precise mode, about 2X~3X slower than the default mode. Only needed to be used if your original GFF3 files have parent features appearing behind their children features.
--chr_order Select how the chromosome IDs should be sorted. Acceptable values are: alphabet, natural, original [Default: alphabet]
--extract_FASTA If the input GFF3 file contains FASTA sequence at the end, use this option to extract the FASTA sequence and place in a separate file with the extention '.fasta'. By default, the FASTA sequences would be discarded.

GFF3の末尾のカラムにFASTA配列があって、それも出力する。

gff3sort.pl --chr_order alphabet --extract_FASTA input.gff3 > sort.gff3

引用

GFF3sort: a novel tool to sort GFF3 files for tabix indexing

Tao Zhu, Chengzhen Liang, Zhigang Meng, Sandui Guo, Rui Zhang
BMC Bioinformatics volume 18, Article number: 482 (2017)