ショートリードの遺伝子予測の高速化 FragGeneScanRs

　FragGeneScanは現在、短くてエラーが起こりやすいリードの遺伝子予測に最も正確で人気のあるツールであるが、その実行速度は大規模データセットで使用するには不十分である。この問題を解決するはずの並列化も非効率的であった。その代替実装であるFragGeneScan+はより高速だが、メモリ管理、レースコンディション、さらには出力精度に関する多くのバグが発生した。

　この論文では、FragGeneScan遺伝子予測モデルの高速なRust実装であるFragGeneScanRsを紹介する。そのコマンドラインインターフェースは後方互換性があり、より柔軟な使い方のための特別な機能が追加されている。その出力はオリジナルのFragGeneScan実装と同等である。

　C言語による実装と比較して、ショットガンメタゲノムリードをシングルスレッドで最大22倍高速に処理し、マルチスレッドでより良いスケーリングを実現した。FragGeneScanRsのRustコードは、GPL-3.0ライセンスのもと、インストール方法、使用方法、その他のドキュメントとともにGitHub(https://github.com/unipept/FragGeneScanRs)から自由に入手できる。

インストール

リリースからダウンロードするかcargoでインストールする。リリースにはarm向けバイナリも用意されている。ここではcargoでインストールした。

Github

#rustがないなら導入
curl https://sh.rustup.rs -sSf | sh

#cargo
cargo install frag_gene_scan_rs

> FragGeneScanRs -h

FragGeneScanRs 1.1.0

Felix Van der Jeugt <felix.vanderjeugt@ugent.be>

Scalable high-throughput short-read open reading frame prediction.

USAGE:

FragGeneScanRs [FLAGS] [OPTIONS] --training-file <train_file_name>

FLAGS:

-f, --formatted Format the DNA output.

-h, --help Prints help information

-u, --unordered Do not preserve record order in output (faster).

-V, --version Prints version information

OPTIONS:

-a, --aa-file <aa_file> Output predicted proteins to this file (supersedes -o).

-w, --complete <complete> The input sequence has complete genomic sequences; not short sequence

reads. [default: 0]

-g, --gff-file <gff_file> Output metadata to this gff formatted file (supersedes -o).

-m, --meta-file <meta_file> Output metadata to this file (supersedes -o).

-n, --nucleotide-file <nucleotide_file> Output predicted genes to this file (supersedes -o).

-o, --output-prefix <output_prefix> Output metadata (.out and .gff), proteins (.faa) and genes (.ffn) to

files with this prefix. Use 'stdout' to write the predicted proteins to

standard output.

-s, --seq-file-name <seq_file_name> Sequence file name including the full path. Using 'stdin' (or not

suplying this argument) reads from standard input. [default: stdin]

-p, --thread-num <thread_num> The number of threads used by FragGeneScan++. [default: 1]

-t, --training-file <train_file_name> File name that contains model parameters; this file should be in the -r

directory or one of the following:

[complete] for complete genomic sequences or short sequence reads without

sequencing error

[sanger_5] for Sanger sequencing reads with about 0.5% error rate

[sanger_10] for Sanger sequencing reads with about 1% error rate

[454_5] for 454 pyrosequencing reads with about 0.5% error rate

[454_10] for 454 pyrosequencing reads with about 1% error rate

[454_30] for 454 pyrosequencing reads with about 3% error rate

[illumina_1] for Illumina sequencing reads with about 0.1% error rate

[illumina_5] for Illumina sequencing reads with about 0.5% error rate

[illumina_10] for Illumina sequencing reads with about 1% error rate

-r, --train-file-dir <train_file_dir> Full path of the directory containing the training model files.

テストラン

git clone https://github.com/unipept/FragGeneScanRs.git
cd FragGeneScanRs/example/

エラー1率％の454リード（-t 454_10）からコード化された配列を予測。デフォルトでは、標準入力から読み込み、標準出力に書き出す。標準入力は-sオプションを指定して切り替えることができる。標準出力は”-o”オプションで切り替えることができ、その場合はgffファイルと遺伝子配列も出力される。それぞれのファイル名を指定する場合は-m meta_file、-n nucleotide_file、-a aa_file、-g gff_file を使う（-oオプションより優先される）。

FragGeneScanRs -t 454_10 < NC_000913-454.fna > NC_000913-454.faa

-p The number of threads used by FragGeneScan++. [default: 1]
-s Sequence file name including the full path. Using 'stdin' (or not suplying this argument) reads from standard input. [default: stdin]
-o Output metadata (.out and .gff), proteins (.faa) and genes (.ffn) to files with this prefix. Use 'stdout' to write the predicted proteins to standard output.
-t File name that contains model parameters; this file should be in the -r directory or one of the following:
[454_10] for 454 pyrosequencing reads with about 1% error rate

蛋白質配列が出力される。

完全長の配列からの予測。

FragGeneScanRs -t complete -w 1 -o outprefix < NC_000913.fna > NC_000913.faa

-w The input sequence has complete genomic sequences; not short sequence reads. [default: 0]
-t File name that contains model parameters; this file should be in the -r directory or one of the following: [complete] for complete genomic sequences or short sequence reads without sequencing error

レポジトリより

オプション -u を使用すると、マルチスレッド使用時の速度向上とメモリ使用量の削減が可能。出力は入力と同じ順序ではなくなる (FGSやFGS+のように)。
-r train_file_dir はトレーニングファイルを格納するディレクトリのパス名を明示的に指定できる。

引用

FragGeneScanRs: faster gene prediction for short reads
Felix Van der Jeugt, Peter Dawyndt & Bart Mesuere
BMC Bioinformatics volume 23, Article number: 198 (2022)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ショートリードの遺伝子予測の高速化 FragGeneScanRs