メタゲノム向けの高速なコード領域検出ツール OrfM

2019 11/29 リンク追記、タイトル修正

OrfMはcontigやアセンブルされていないリードからstopコドンの有無に関わらずorfを探索するツール。データサイズが莫大になるメタゲノム向けに設計された。非常に高速に動作し、translateやembossパッケージのgetorf、prodigalなどより数倍速く動作するとされる。例えばバクテリアゲノムのサイズなら1秒程度で全てのorfを探索する。エラーを考慮するような機能はないため、イルミナの高精度なシーケンスデータの使用が推奨されている。

インストール

Github

https://github.com/wwood/OrfM/releases

orfm-0.7.1.tar.gzをダウンロードして解凍する。

cd orfm-x.x.x 
./configure
make 
sudo gem install rspec bio-commandeer
make check 
sudo make install 
orfm -h #ヘルプが出るか確認

ラン

デフォルトでは96-bp（-m 96）以上のorfを全て探す。stopやstartがなくても96bp以上は全て検出される（mは3の倍数の必要がある）。

orfm input.fasta >orfs.fa

入力はfastaまたはfastqに対応している。

出力はアミノ酸配列となる。

>chr_2_5_1

TIHRGHIPPQIRLIHHVIVDQTSGVDHFGNFRQPAMAR

>chr_51_6_2

PNTKKIGQRGNPPLTLPPDCQINYSPWAHSAADPPHPPRHRGPN

>chr_1_4_3

VSKSVSVSKRLMLCKISPSAKSRITSDNNFTTSKLSKLAKYQEDWAKRKSPANTATRLSNKLFTVGTFRRRSASSTTSSWTKLAVWIISVISASRRWRA

>chr_293_5_4

PQMFTIFVKMITHIPGRIAQNPRCLGTVGNERGGFKIAFFQIGF

>chr_336_6_5

NLKYRPQSGTLTPNVYHIRKDDHTHPGAHRSKPPLPWHCWQ

ヘッダーのchrは使用したfastaのヘッダー名、次の数値はorfスタート位置、次はフレームの番号、最後は通し番号。

全てのフレームをチェックするのでポジションの重複も相当数出る。

-m　minimum number of nucleotides (not amino acids) to call an ORF on [default: 96]
-t　output nucleotide sequences of transcripts to this path [default: none]
-l　ignore the sequence of the read beyond this, useful when comparing reads from with different read lengths [default: none]
-c　ID codon table for translation (see http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes for details) [default: 1]
-p　print the actual stop codons at sequence ends if encoded [default: do not]
-s　only print those ORFs in the same frame as a stop codon [default: off]