2024-01-06

タンパク質コードDNAの高感度かつエラー耐性アノテーションを行う BATH

　BATHは、タンパク質配列のデータベースまたはプロファイル隠れマルコフモデル（pHMM）へのDNAの直接アラインメントに基づく、タンパク質をコードするDNAの高感度アノテーションツールである。BATHはHMMER3コードベース上に構築されており、わかりやすい入力インターフェースと解釈しやすい出力を提供することで、pHMMに基づくアノテーションのワークフローを簡素化している。BATHはまた、フレームシフトを考慮した新しいアルゴリズムを導入し、フレームシフトを誘発する塩基の挿入や欠失（indel）を検出する。BATHは、エラーを含まない配列のアノテーションではHMMER3の精度に匹敵し、indelを含む配列のアノテーションでは、テストしたすべてのツールよりも優れた精度を示した。これらの結果から、BATHは、高いアノテーション感度が必要な場合、特に、ロングリードシークエンシングデータや偽遺伝子のように、フレームシフトエラーがタンパク質コード領域に割り込むことが予想される場合に使用すべきであることが示唆される。

DNA上で直接相同性検索を行い、ORFのコールを避け、フレームシフトを明示的にモデル化する方法がある。事実上、これはORF予測を導くために相同性を使用し、その結果、より優れた相同性の検出につながる。著者らはこの戦略を「フレームシフトを考慮した」（FA）翻訳アライメントと呼んでいる。ターゲットDNAをタンパク質に翻訳し、アミノ酸とアミノ酸をアライメントする代わりに、このアプローチは、クエリーのアミノ酸をターゲットDNAと直接比較することによってアライメントを計算し、アミノ酸とコドン間のアライメントに対して、そのコドンによってコードされるアミノ酸に対するスコアリングモデルの値に基づいてスコアを割り当てる。このフレームワークのもとでは、Pearsonらが「準コドン(quasicodon)」と呼ぶものを追加することで、フレームシフトに対処することができる。Pearsonの準コドンは、整列コドンを標準的な3ヌクレオチドの長さに制限するのではなく、2ヌクレオチドまたは4ヌクレオチドの長さにすることができる。この概念の他の実装には、1ヌクレオチドと5ヌクレオチドの長さの準コドンも含まれる。

Tutorial

https://github.com/TravisWheelerLab/BATH/blob/main/documentation/userguide/tutorial.md

インストール

ubuntu20でテストした。

Github

C library for biological sequence analysis

git clone https://github.com/TravisWheelerLab/BATH
cd BATH/
git clone https://github.com/TravisWheelerLab/easel
cd easel/
git checkout BATH
cd ..
autoconf
./configure
make -j8
sudo make install #/usr/local/bin/

> bathsearch -h

Usage: bathsearch [options] <hmm, msa, or seq file> <seqdb>

Basic options:

-h : show brief help on version and usage

Options directing output:

-o <f> : direct output to file <f>, not stdout

--tblout <f> : save parseable table of hits to file <f>

--fstblout <f> : save table of frameshift locations to file <f>

--hmmout <f> : if input is alignment(s) or sequence(s) write produced hmms to file <f>

--acc : prefer accessions over names in output

--noali : don't output alignments, so output is smaller

--notrans : don't show the translated DNA sequence in alignment

--frameline : include frame of each codon in alignment

--cigar : include alignment CIGAR string in table output (with --tblout)

--notextw : unlimit ASCII text output line width

--textw <n> : set max width of ASCII text output lines [150] (n>=150)

Options controlling translation:

--ct <n> : use alt genetic code of NCBI translation table (see end of help) [1]

-l <n> : minimum ORF length [20]

-m : ORFs must initiate with AUG only

-M : ORFs must start with allowed initiation codon

--rosalind : only translate top strand

--franklin : only translate bottom strand

Options controlling reporting and inclusion thresholds:

-E <x> : report sequences <= this E-value threshold in output [10.0] (x>0)

-T <x> : report sequences >= this score threshold in output

--incE <x> : consider sequences <= this E-value threshold as significant [0.01] (x>0)

--incT <x> : consider sequences >= this score threshold as significant

Options controlling acceleration heuristics:

--max : turn all heuristic filters off (less speed, more power)

--F1 <x> : stage 1 (MSV) threshold: promote hits w/ P <= F1 [0.02]

--F2 <x> : stage 2 (Vit) threshold: promote hits w/ P <= F2 [1e-3]

--F3 <x> : stage 3 (Fwd) threshold: promote hits w/ P <= F3 [1e-5]

--nobias : turn off composition bias filter

--nonull2 : turn off biased composition score corrections

--fsonly : send all potential hits to the frameshift aware pipeline

--nofs : send all potential hits to the non-frameshift aware pipeline

Options setting input formats:

--qformat <s> : assert query is in format <s> (can be seq or msa format)

--qsingle_seqs : force query to be read as individual sequences, even if in an msa format

--tformat <s> : assert target <seqfile> is in format <s>: no autodetection

Options handling single sequence inputs:

--singlemx : use substitution score matrix w/ single-sequence MSA-format inputs

--popen <x> : gap open probability [0.02] (0<=x<0.5)

--pextend <x> : gap extend probability [0.4] (0<=x<1)

--mx <s> : substitution score matrix choice (of some built-in matrices) [BLOSUM62]

--mxfile <f> : read substitution score matrix from file <f>

Other expert options:

-Z <x> : set database size (Megabases) to <x> for E-value calculations (x>=0)

--seed <n> : set RNG seed to <n> (if 0: one-time arbitrary seed) [42] (n>=0)

--w_beta <x> : tail mass at which window length is determined (0>=x<=1)

--w_length <n> : window length - essentially max expected hit length (x>=4)

--block_length <n> : length of blocks read from target database (threaded) (n>=50000)

--cpu <n> : number of parallel CPU workers to use for multithreads [2] (n>=0)

Available NCBI genetic code tables (for --ct <id>):

id description

--- -----------------------------------

1 Standard

2 Vertebrate mitochondrial

3 Yeast mitochondrial

4 Mold, protozoan, coelenterate mitochondrial; Mycoplasma/Spiroplasma

5 Invertebrate mitochondrial

6 Ciliate, dasycladacean, Hexamita nuclear

9 Echinoderm and flatworm mitochondrial

10 Euplotid nuclear

11 Bacterial, archaeal; and plant plastid

12 Alternative yeast

13 Ascidian mitochondrial

14 Alternative flatworm mitochondrial

16 Chlorophycean mitochondrial

21 Trematode mitochondrial

22 Scenedesmus obliquus mitochondrial

23 Thraustochytrium mitochondrial

24 Pterobranchia mitochondrial

25 Candidate Division SR1 and Gracilibacteria

> bathbuild -h

Usage: bathbuild [-options] <hmmfile_out> <msafile>

Basic options:

-h : show brief help on version and usage

-n <s> : name the HMM <s>

-o <f> : direct summary output to file <f>, not stdout

-O <f> : resave annotated, possibly modified MSA to file <f>

--ct <n> : use alt genetic code of NCBI transl table <n> [1]

--unali : input file is an unaligned sequence file

Options for selecting alphabet rather than guessing it:

--amino : input alignment is protein sequence data

--dna : input alignment is DNA sequence data

--rna : input alignment is RNA sequence data

Alternative model construction strategies:

--fast : assign cols w/ >= symfrac residues as consensus [default]

--hand : manual construction (requires reference annotation)

--symfrac <x> : sets sym fraction controlling --fast construction [0.5]

--fragthresh <x> : if L <= x*alen, tag sequence as a fragment [0.5]

Alternative relative sequence weighting strategies:

--wpb : Henikoff position-based weights [default]

--wgsc : Gerstein/Sonnhammer/Chothia tree weights

--wblosum : Henikoff simple filter weights

--wnone : don't do any relative weighting; set all to 1

--wgiven : use weights as given in MSA file

--wid <x> : for --wblosum: set identity cutoff [0.62] (0<=x<=1)

Alternative effective sequence weighting strategies:

--eent : adjust eff seq # to achieve relative entropy target [default]

--eclust : eff seq # is # of single linkage clusters

--enone : no effective seq # weighting: just use nseq

--eset <x> : set eff seq # for all models to <x>

--ere <x> : for --eent: set minimum rel entropy/position to <x>

--esigma <x> : for --eent: set sigma param to <x> [45.0]

--eid <x> : for --eclust: set fractional identity cutoff to <x> [0.62]

Alternative prior strategies:

--pnone : don't use any prior; parameters are frequencies

--plaplace : use a Laplace +1 prior

Handling single sequence inputs:

--singlemx : use substitution score matrix for single-sequence inputs

--mx <s> : substitution score matrix (built-in matrices, with --singlemx)

--mxfile <f> : read substitution score matrix from file <f> (with --singlemx)

--popen <x> : force gap open prob. (w/ --singlemx, aa default 0.02, nt 0.031)

--pextend <x> : force gap extend prob. (w/ --singlemx, aa default 0.4, nt 0.75)

Control of E-value calibration:

--EmL <n> : length of sequences for MSV Gumbel mu fit [200] (n>0)

--EmN <n> : number of sequences for MSV Gumbel mu fit [200] (n>0)

--EvL <n> : length of sequences for Viterbi Gumbel mu fit [200] (n>0)

--EvN <n> : number of sequences for Viterbi Gumbel mu fit [200] (n>0)

--EfL <n> : length of sequences for Forward exp tail tau fit [100] (n>0)

--EfN <n> : number of sequences for Forward exp tail tau fit [200] (n>0)

--Eft <x> : tail mass for Forward exponential tail tau fit [0.04] (0<x<1)

Other options:

--cpu <n> : number of parallel CPU workers for multithreads [2]

--stall : arrest after start: for attaching debugger to process

--informat <s> : assert input infile is in format <s> (no autodetect)

--seed <n> : set RNG seed to <n> (if 0: one-time arbitrary seed) [42]

--w_beta <x> : tail mass at which window length is determined

--w_length <n> : window length

--maxinsertlen <n> : pretend all inserts are length <= <n>

Available NCBI genetic code tables (for --ct <id>):

id description

--- -----------------------------------

1 Standard

2 Vertebrate mitochondrial

3 Yeast mitochondrial

4 Mold, protozoan, coelenterate mitochondrial; Mycoplasma/Spiroplasma

5 Invertebrate mitochondrial

6 Ciliate, dasycladacean, Hexamita nuclear

9 Echinoderm and flatworm mitochondrial

10 Euplotid nuclear

11 Bacterial, archaeal; and plant plastid

12 Alternative yeast

13 Ascidian mitochondrial

14 Alternative flatworm mitochondrial

16 Chlorophycean mitochondrial

21 Trematode mitochondrial

22 Scenedesmus obliquus mitochondrial

23 Thraustochytrium mitochondrial

24 Pterobranchia mitochondrial

25 Candidate Division SR1 and Gracilibacteria

> bathconvert -h

Usage: bathconvert [-options] <hmmfile_out> <hmmfile_in>

Options:

-h : show brief help on version and usage

--ct <n> : use alt genetic code of NCBI transl table <n> [1]

Usage: bathstat [-options] <hmmfile>

Options:

-h : show brief help on version and usage

# bathfetch :: retrieve profile HMM(s) from a file

# Easel 0.48 (Nov 2020)

# Freely distributed under the BSD open source license.

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

> bathfetch -h

Usage: bathfetch [options] <hmmfile_in> <key> (retrieves HMM named <key>)

Usage: bathfetch [options] -f <hmmfile_in> <keyfile> (retrieves all HMMs in <keyfile>)

Usage: bathfetch [options] --index <hmmfile_in> (indexes <hmmfile>)

Options:

-h : help; show brief info on version and usage

-f : second cmdline arg is a file of names to retrieve

-o <f> : output HMM to file <f> instead of stdout

-O : output HMM to file named <key>

--ct <n> : use alt genetic code of NCBI transl table (see below) [1]

--index : index the <hmmfile>, creating <hmmfile>.ssi

Available NCBI genetic code tables (for --ct <id>):

id description

--- -----------------------------------

1 Standard

2 Vertebrate mitochondrial

3 Yeast mitochondrial

4 Mold, protozoan, coelenterate mitochondrial; Mycoplasma/Spiroplasma

5 Invertebrate mitochondrial

6 Ciliate, dasycladacean, Hexamita nuclear

9 Echinoderm and flatworm mitochondrial

10 Euplotid nuclear

11 Bacterial, archaeal; and plant plastid

12 Alternative yeast

13 Ascidian mitochondrial

14 Alternative flatworm mitochondrial

16 Chlorophycean mitochondrial

21 Trematode mitochondrial

22 Scenedesmus obliquus mitochondrial

23 Thraustochytrium mitochondrial

24 Pterobranchia mitochondrial

25 Candidate Division SR1 and Gracilibacteria

実行方法

BATHは既存のHMMER3コードベースの上に構築されている。

1、bathbuild - マルチプル配列アラインメント(MSA)、またはアラインメントされていない配列からBATH形式のプロファイル隠れマルコフモデル(pHMM)を構築する。出力、入力の順番に指定する。

#DNA 
bathbuild --dna out.bhmm infile.aln
#RNA
bathbuild --rna out.bhmm infile.aln
#protein
bathbuild --amino out.bhmm infile.aln

#MSAされていない配列を使用するには--unaliフラグを立てる。3配列なら3つのHMM（各配列に1つずつ）ができる
bathbuild --unali out.bhmm infile.faa

-o direct summary output to file <f>, not stdout
-O resave annotated, possibly modified MSA to file <f>
--ct use alt genetic code of NCBI transl table <n> [1]
--unali input file is an unaligned sequence file
--amino input alignment is protein sequence data
--dna input alignment is DNA sequence data
--rna input alignment is RNA sequence data

out.hmmができる。デフォルトでは、真核生物の核DNAで採用されているスタンダードコードを使用する。別のコドン翻訳テーブルを使用するには"--ct"オプションを使う（チュートリアルのpractice 1参照）。

2、bathstat - BATH 形式の pHMM ファイルの統計情報を表示する。

bathstat hmmfile

出力のフィールドはbathbuildで生成されるものとほぼ同じ

3、bathconvert - HMMER3形式のpHMMファイルをBATH形式のpHMMファイルに変換する。出力、入力の順番に指定する。

bathconvert conveted.bhmm in.phmm

conveted.hmmが出力される。

bathconvertコマンドは既存のBATH形式のpHMMファイルのコドン表を変更するために使用することもできる（MDAからビルドし直すよりも早いと書かれている）。コドン表は--ctフラグで指定する。

4、bathsearch - 1つ以上のクエリタンパク質をDNA配列データベース（またはゲノム）に対して検索する。クエリはpHMMのファイル（bathbuildまたはbathconvertで作成）、または配列または配列アラインメントを含むファイルを使用する。

bathsearch -o output database.bhmm in_nucleotide.fna

#tab区切り出力
bathsearch -o output --tblout out.tbl database.bhmm in_nucleotide.fna

--ct use alt genetic code of NCBI translation table (see end of help) [1]
--frameline include frame of each codon in alignment

bathsearchは突然変異やシークエンシングエラーによってフレームシフトが生じた場合でも、タンパク質コードDNAの翻訳アノテーションを行うことができる。

出力例

File Header - #'で始まり、検索パラメーターに関する基本的な情報が示されている。

Query Header - 各クエリの要約とE値でソートされたヒットリストが含まれる。

ヘッダーにはE値、ビットスコア、バイアススコア、ヒットしたターゲット配列の名前、アライメントの開始と終了のターゲット配列位置、そのアライメント内のフレームシフトとストップコドンの数、最後にDescriptionが続く。

Annotation Lines - クエリーヘッダーに記載された各ヒットに対して有用な情報を含む注釈行。Annotation for each hit (and alignments):'という行の後に、これらの注釈行（アラインメントと同様）が表示され、最初にターゲット配列、次にe-valueでソートされる。

クエリーヘッダーと同様に、注釈行には各ヒットのスコア、バイアス、E値がリストされる。また、クエリとターゲットのアライメント開始座標と終了座標、そしてエンベロープ座標の3種類の座標が表示される。エンベロープはbathsearchが相同性を含むと特定したターゲットの領域。

Alignment - Annotation行の下にクエリとターゲットヒットのアラインメントが表示される。

少なくとも以下の5つの行が含まれる。上から下へ、 (1)クエリ行、(2)マッチ行、(3)翻訳行、(4)ターゲット行、(5)事後確率行。pHMMがコンセンサス構造や参照アノテーションを含むMSAから構築された場合、それらはクエリ行の上のCS行とRF行に分かれて表示される。(1)クエリーアミノがターゲットコドンまたは準コドンにアライメントされたマッチ、(2)クエリーアミノ酸がターゲットギャップ文字にアライメントされた欠失、(3)ターゲットコドンがクエリーギャップ文字にアライメントされた挿入。

Query Footer - 出力の最後はbathsearch内のヒットフィルタリングプロセスに関する情報を提供するフッターが表示される。

タブ区切り出力

out.tbl

出力フィールドについてはPractice13を参照。

5、bathfetch - 1つのpHMMで検索する必要があるが、ビルドしたpHMMには複数のプロファイルが含まれている場合、bathfetchを使って目的のpHMMだけを新しいファイルにコピーすることで時間を節約できる。さらに、元のファイルに多数のpHMMが含まれている場合、インデックスファイルを作成してフェッチプロセスをスピードアップできる。

#index
bathfetch --index many_proteins.bhmm

#fetch - 配列名"PTH2"
bathfetch -o PTH2.hmm many_proteins.bhmm PTH2

その他

'--frameline'フラグを使うと、各コドンや準コドンのフレームに番号をつけた行をアライメントに加えてフレームシフトや停止コドンを見つけやすくすることができる（practice14）。
'--fstblout'フラグを使用すると、全ヒットのフレームシフトと停止コドンの位置を表形式で出力することができる（practice15）。

引用

Sensitive and error-tolerant annotation of protein-coding DNA with BATH

Genevieve Krause, Walter Shands, Travis Wheeler

bioRixv, Posted January 01, 2024