FragGeneScan - macでインフォマティクス

　次世代シーケンシング技術の進歩は、環境試料（すなわちメタゲノム）内の遺伝物質の全コレクションを直接シーケンシングしようと試みるメタゲノム研究を促進した。メタゲノムアセンブリは利用できないことが多いので（論文執筆時点）、ショートリードから直接遺伝子を同定することは、メタゲノムにアノテーションををつける重要だがチャレンジングな問題となっている。全ゲノム（例えばGlimmer）用に開発され、そしてメタゲノム配列（例えばMetaGene）用に最近開発された遺伝子予測法は、シーケンシングエラー率が増加するにつれて、またはリードが短くなるにつれて、性能の有意な低下を示す。本著者らは、ショートリードにおけるタンパク質コード領域の予測を改善するために、シークエンシングエラーモデルと隠れマルコフモデルにおけるコドン使用法を組み合わせた、新規な遺伝子予測方法FragGeneScanを開発した。 FragGeneScanの性能は、完全なゲノムに関してGlimmerおよびMetaGeneと同等であった。しかしショートリードでは、FragGeneScanは一貫してMetaGeneを上回った（精度は400塩基リード長で1％シークエンスエラーの場合62％、100塩基リード長でエラーがない場合は18％向上した）。メタゲノムに適用した場合、FragGeneScanはMetaGeneが予測したよりも実質的に多くの遺伝子（相同性検索によって同定された遺伝子の90％超）、および現在のタンパク質配列データベースにホモログを含まない多くの新規遺伝子を回収した。

インストール

ubbuntuのminiconda2.4.0.5環境でテストした。

SourcceForge

#bioconda (link)
conda install -c bioconda -y fraggenescan

実行方法

genomeからproteinを得る。-complete=1と-train=completeを指定。

run_FragGeneScan.pl -genome=contigs.fasta -out=output -complete=1 --train=complete -thread=20

-genome= sequence file name including the full path
-complete= 1 if the sequence file has complete genomic sequences. 0 if the sequence file has short sequence reads
train= file name that contains model parameters. [complete] for complete genomic sequences or short sequence reads without sequencing error. [sanger_5] for Sanger sequencing reads with about 0.5% error rate. [sanger_10] for Sanger sequencing reads with about 1% error rate. [454_10] for 454 pyrosequencing reads with about 1% error rate. [454_30] for 454 pyrosequencing reads with about 3% error rate. [illumina_5] for Illumina sequencing reads with about 0.5% error rate. [illumina_10] for Illumina sequencing reads with about 1% error rate.

output.faやgffファイル等が出力される。

illuminaのショートリードからproteinを得る。-complete=0とtrain=illumina_5（またはillumina_10）を指定。

run_FragGeneScan.pl -genome=ngs.fasta -out=output -complete=0 -train=illumina_5 -thread=20

-genome= sequence file name including the full path
-complete= 1 if the sequence file has complete genomic sequences. 0 if the sequence file has short sequence reads
train= file name that contains model parameters. [complete] for complete genomic sequences or short sequence reads without sequencing error. [sanger_5] for Sanger sequencing reads with about 0.5% error rate. [sanger_10] for Sanger sequencing reads with about 1% error rate. [454_10] for 454 pyrosequencing reads with about 1% error rate. [454_30] for 454 pyrosequencing reads with about 3% error rate. [illumina_5] for Illumina sequencing reads with about 0.5% error rate. [illumina_10] for Illumina sequencing reads with about 1% error rate.