真核生物のRNAのコード領域を予測するGeneMarkS-T

GeneMarkS-T は教師なし学習でトレーニングされたRNAのタンパク質コード領域を予測ツール。原核生物向けのGeneMarkSを真核生物向けに拡張して作られた。データサイズに寄らず一定の検出率を示すため、データが莫大になるメタトランスクリプトーム解析のコード領域予測にも向いているとされる。

ダウンロード

このリンクからlinux64bit版をダウンロードできる。

GeneMark™ download

> perl gmst.pl

$ perl gmst

gmst_linux_64.tar gmst.pl

[uesaka@cyano GeneMarkS-T]$ perl gmst.pl

GeneMarkS version 5.1 March 2014

Usage: gmst.pl [options] <sequence file name>

input sequence in FASTA format

Output options:

(output is in current working directory)

--output <string> output file with predicted gene coordinates by GeneMarh.hmm

and species parameters derived by GeneMarkS-T.

(default: <sequence file name>.lst)

GeneMark.hmm can be executed independently after finishing GeneMarkS training.

This method may be the preferable option in some situations, as it provides accesses to GeneMarh.hmm options.

--format <string> output coordinates of predicted genes in this format

(default: LST; supported: LST and GFF)

--fnn create file with nucleotide sequence of predicted genes

--faa create file with protein sequence of predicted genes

--clean <number> delete all temporary files

(default: 1; supported: 1 <true> and 0 <false>)

Run options:

--bins <number> number of clusters for inhomogeneous genome

(default: 0; supported: 0(automatic clustering),1,2,3)

--filter <number> keep at most one prediction per sequence

(default: 1; supported: 1 <true> and 0 <false>)

--strand <string> sequence strand to predict genes in

(default: 'both'; supported: direct, reverse and both )

--order <number> markov chain order

(default: 4; supported in range: >= 0)

--order_non <number> order for non-coding parameters

(default: 2)

--gcode <number> genetic code

(default: 1; supported: 11, 4 and 1)

--motif <number> iterative search for a sequence motif associated with CDS start

(default: 1; supported: 1 <true> and 0 <false>)

--width <number> motif width

(default: 12; supported in range: >= 3)

--prestart <number> length of sequence upstream of translation initiation site that presumably includes the motif

(default: 6; supported in range: >= 0)

--fixmot

if <number> the motif is located at a fixed position with regard to the start; motif could overlap start codon

(default: 1; supported: 1 <true> and 0 <false> if this option is on, it changes the meaning of --prestart

option which in this case will define the distance from start codon to motif start)

--offover <number> prohibits gene overlap

(default: 1; supported: 1 <true> and 0 <false>)

Combined output and run options:

--prok to run program on prokaryotic transcripts

(this option is the same as: --bins 1 --filter 0 --order 2 --order_non 2 --gcode 11 --width 6 --prestart 40 --fixmotif 0)

Test/developer options:

--par <file name> custom parameters for GeneMarkS

(default is selected based on gcode value: 'par_<gcode>.default' )

--gibbs <number> version of Gibbs sampler software

(default: 3; supported versions: 1 and 3 )

--test installation test

--identity <number> identity level assigned for termination of iterations

(default: 0.99; supported in range: >=0 and <= 1)

--maxitr <number> maximum number of iterations

(default: 10; supported in range: >= 1)

--verbose

--version

ラン

de novo assemblyで作ったfragments（FASTA）を指定してランする。

perl gmst.pl transcripts.fa --output output --fnn --faa --strand both --gcode 1

--fnn　create file with nucleotide sequence of predicted genes
--faa　create file with protein sequence of predicted genes
--strand <string>　sequence strand to predict genes in (default: 'both'; supported: direct, reverse and both )
--gcode <number>　genetic code (default: 1; supported: 11, 4 and 1)

原核生物のデータに使う場合、-prokをつけると複数のパラメータを一括指定できる。

引用

Identification of protein coding regions in RNA transcripts.

Tang S, Lomsadze A, Borodovsky M.

Nucleic Acids Res. 2015 Jul 13;43(12):e78. doi: 10.1093/nar/gkv227. Epub 2015 Apr 13.