GeneMarkS-T は教師なし学習でトレーニングされたRNAのタンパク質コード領域を予測ツール。原核生物向けのGeneMarkSを真核生物向けに拡張して作られた。データサイズに寄らず一定の検出率を示すため、データが莫大になるメタトランスクリプトーム解析のコード領域予測にも向いているとされる。
ダウンロード
このリンクからlinux64bit版をダウンロードできる。
> perl gmst.pl
$ perl gmst
gmst_linux_64.tar gmst.pl
[uesaka@cyano GeneMarkS-T]$ perl gmst.pl
GeneMarkS version 5.1 March 2014
Usage: gmst.pl [options] <sequence file name>
input sequence in FASTA format
Output options:
(output is in current working directory)
--output <string> output file with predicted gene coordinates by GeneMarh.hmm
and species parameters derived by GeneMarkS-T.
(default: <sequence file name>.lst)
GeneMark.hmm can be executed independently after finishing GeneMarkS training.
This method may be the preferable option in some situations, as it provides accesses to GeneMarh.hmm options.
--format <string> output coordinates of predicted genes in this format
(default: LST; supported: LST and GFF)
--fnn create file with nucleotide sequence of predicted genes
--faa create file with protein sequence of predicted genes
--clean <number> delete all temporary files
(default: 1; supported: 1 <true> and 0 <false>)
Run options:
--bins <number> number of clusters for inhomogeneous genome
(default: 0; supported: 0(automatic clustering),1,2,3)
--filter <number> keep at most one prediction per sequence
(default: 1; supported: 1 <true> and 0 <false>)
--strand <string> sequence strand to predict genes in
(default: 'both'; supported: direct, reverse and both )
--order <number> markov chain order
(default: 4; supported in range: >= 0)
--order_non <number> order for non-coding parameters
(default: 2)
--gcode <number> genetic code
(default: 1; supported: 11, 4 and 1)
--motif <number> iterative search for a sequence motif associated with CDS start
(default: 1; supported: 1 <true> and 0 <false>)
--width <number> motif width
(default: 12; supported in range: >= 3)
--prestart <number> length of sequence upstream of translation initiation site that presumably includes the motif
(default: 6; supported in range: >= 0)
--fixmot
if <number> the motif is located at a fixed position with regard to the start; motif could overlap start codon
(default: 1; supported: 1 <true> and 0 <false> if this option is on, it changes the meaning of --prestart
option which in this case will define the distance from start codon to motif start)
--offover <number> prohibits gene overlap
(default: 1; supported: 1 <true> and 0 <false>)
Combined output and run options:
--prok to run program on prokaryotic transcripts
(this option is the same as: --bins 1 --filter 0 --order 2 --order_non 2 --gcode 11 --width 6 --prestart 40 --fixmotif 0)
Test/developer options:
--par <file name> custom parameters for GeneMarkS
(default is selected based on gcode value: 'par_<gcode>.default' )
--gibbs <number> version of Gibbs sampler software
(default: 3; supported versions: 1 and 3 )
--test installation test
--identity <number> identity level assigned for termination of iterations
(default: 0.99; supported in range: >=0 and <= 1)
--maxitr <number> maximum number of iterations
(default: 10; supported in range: >= 1)
--verbose
--version
ラン
de novo assemblyで作ったfragments(FASTA)を指定してランする。
perl gmst.pl transcripts.fa --output output --fnn --faa --strand both --gcode 1
- --fnn create file with nucleotide sequence of predicted genes
- --faa create file with protein sequence of predicted genes
- --strand <string> sequence strand to predict genes in (default: 'both'; supported: direct, reverse and both )
- --gcode <number> genetic code (default: 1; supported: 11, 4 and 1)
原核生物のデータに使う場合、-prokをつけると複数のパラメータを一括指定できる。
引用
Identification of protein coding regions in RNA transcripts.
Tang S, Lomsadze A, Borodovsky M.
Nucleic Acids Res. 2015 Jul 13;43(12):e78. doi: 10.1093/nar/gkv227. Epub 2015 Apr 13.