macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

真核生物のRNAのコード領域を予測するGeneMarkS-T

 

GeneMarkS-T は教師なし学習でトレーニングされたRNAのタンパク質コード領域を予測ツール。原核生物向けのGeneMarkSを真核生物向けに拡張して作られた。データサイズに寄らず一定の検出率を示すため、データが莫大になるメタトランスクリプトーム解析のコード領域予測にも向いているとされる。

 

 

ダウンロード

このリンクからlinux64bit版をダウンロードできる。

GeneMark™ download

 

perl gmst.pl 

$ perl gmst

gmst_linux_64.tar  gmst.pl            

[uesaka@cyano GeneMarkS-T]$ perl gmst.pl 

GeneMarkS  version 5.1 March 2014

Usage: gmst.pl [options] <sequence file name>

 

input sequence in FASTA format

 

 

Output options:

(output is in current working directory)

 

--output    <string> output file with predicted gene coordinates by GeneMarh.hmm

            and species parameters derived by GeneMarkS-T.

            (default: <sequence file name>.lst)

 

            GeneMark.hmm can be executed independently after finishing GeneMarkS training.

            This method may be the preferable option in some situations, as it provides accesses to GeneMarh.hmm options.

--format    <string> output coordinates of predicted genes in this format

            (default: LST; supported: LST and GFF)

--fnn       create file with nucleotide sequence of predicted genes

--faa       create file with protein sequence of predicted genes

--clean     <number> delete all temporary files

            (default: 1; supported: 1 <true> and 0 <false>)

 

Run options:

 

--bins      <number> number of clusters for inhomogeneous genome

            (default: 0; supported: 0(automatic clustering),1,2,3)

--filter    <number> keep at most one prediction per sequence

            (default: 1; supported: 1 <true> and 0 <false>)

--strand    <string> sequence strand to predict genes in

            (default: 'both'; supported: direct, reverse and both )

--order     <number> markov chain order

            (default: 4; supported in range: >= 0)

--order_non <number> order for non-coding parameters

            (default: 2)

--gcode     <number> genetic code

            (default: 1; supported: 11, 4 and 1)

--motif     <number> iterative search for a sequence motif associated with CDS start

            (default: 1; supported: 1 <true> and 0 <false>)

--width     <number> motif width

            (default: 12; supported in range: >= 3)

--prestart  <number> length of sequence upstream of translation initiation site that presumably includes the motif

            (default: 6; supported in range: >= 0)

--fixmot

if  <number> the motif is located at a fixed position with regard to the start; motif could overlap start codon

            (default: 1; supported: 1 <true> and 0 <false> if this option is on, it changes the meaning of --prestart 

            option which in this case will define the distance from start codon to motif start)

--offover   <number> prohibits gene overlap

            (default: 1; supported: 1 <true> and 0 <false>)

 

Combined output and run options:

 

--prok      to run program on prokaryotic transcripts

            (this option is the same as:  --bins 1  --filter 0  --order 2  --order_non 2  --gcode 11 --width 6  --prestart 40 --fixmotif 0)

 

 

Test/developer options:

 

--par      <file name> custom parameters for GeneMarkS

           (default is selected based on gcode value: 'par_<gcode>.default' )

--gibbs    <number> version of Gibbs sampler software

           (default: 3; supported versions: 1 and 3 ) 

--test     installation test

--identity  <number> identity level assigned for termination of iterations

            (default: 0.99; supported in range: >=0 and <= 1)

--maxitr    <number> maximum number of iterations

            (default: 10; supported in range: >= 1)

--verbose

--version

 

 ラン

 de novo assemblyで作ったfragments(FASTA)を指定してランする。

perl gmst.pl transcripts.fa --output output --fnn --faa --strand both --gcode 1 
  •  --fnn create file with nucleotide sequence of predicted genes
  • --faa create file with protein sequence of predicted genes
  • --strand <string> sequence strand to predict genes in (default: 'both'; supported: direct, reverse and both )
  • --gcode <number> genetic code (default: 1; supported: 11, 4 and 1)

 

原核生物のデータに使う場合、-prokをつけると複数のパラメータを一括指定できる。 

 

引用

Identification of protein coding regions in RNA transcripts.

Tang S, Lomsadze A, Borodovsky M.

Nucleic Acids Res. 2015 Jul 13;43(12):e78. doi: 10.1093/nar/gkv227. Epub 2015 Apr 13.