BLAST結果をソート・フィルタリングする BLAST-QC

　NCBIのBasic Local Alignment Search Tool (BLAST)は、バイオインフォマティクスおよびゲノミクス研究のための配列アライメントおよび同定のための好ましいユーティリティである。NCBIのBLASTソフトウェアを使用している研究者の間では、大規模なBLAST検索の結果を解析するのは面倒で時間がかかることがよく知られている。さらに、'-max_target_seqs'のようなパラメータがBLASTヒューリスティック検索プロセスに与える影響についての最近の議論では、これらの検索オプションの使用は疑問視されている。このため、スタンドアロンのパーサーを使用することは、これらの大規模なデータセットを凝縮するための唯一の選択肢となり、オンラインでダウンロードできるものがほとんどないため、研究者はBLASTの結果を分析する必要がある場合にはいつでも専用のソフトウェアを作成する必要がある。これらの問題を解決し、様々なバイオインフォマティクスやゲノミクスのワークフローに簡単に実装できる、合理化された高速なスクリプトの必要性が、このソフトウェアを開発する最初の動機だった。
　本研究では、BLAST結果の解析におけるBLAST-QCの有効性と、他の利用可能なオプションと比較した場合の有用性を実証した。我々のバイオインフォマティクスワークフローから得られた遺伝子配列データを用いて、一般的に使用されているBioPerlやBioPythonモジュール、およびBLAST_QCプログラムのCやJava実装を用いて開発された既存のパーサーと比較した場合、BLAST_QCの優れた実行時間を明らかにした。max_target_seqs'パラメータ、このパラメータの使用法や論争について議論し、このパラメータが生成すると想定されていた機能や他の様々な解析オプションを提供するためのこのソフトウェアの能力を実証することによって解決策を提供する。サンプルデータセット上でスクリプトを実行し、実装された機能を示し、プログラムのテストケースを提供する。BLAST-QCは既存のソフトウェアに統合できるように設計されており、ワークフローや他のプロセスのモジュールとしての有効性を確認している。
　BLAST-QCは、他のオプションの欠点を回避しつつ、BLAST結果の品質管理を簡単に行うことができる、シンプルで軽量かつポータブルなPython スクリプトをコミュニティに提供する。これには、-max_target_seqsパラメータを適用することによる不確実な結果や、BioPerlやJavaなどの他のオプションの煩雑な依存関係に依存することによる複雑さや実行時間の増加が含まれる。BLAST-QCは、バイオインフォマティクスやゲノム研究で一般的なハイスループットワークフローやパイプラインでの使用に理想的であり、このスクリプトは移植性と、ユーザーが実行しているどのようなタイプのプロセスにも簡単に統合できるように設計されている。

Githubより

Norman labのBLAST-QCスクリプトは、バイオインフォマティクスおよびゲノムワークフローに統合するために設計されており、ユーザーが必要な機能を変更して指定するためのオプションを提供している。

クエリ配列ごとに返されるヒット数をフィルタリングする機能
ユーザーが望む任意の閾値で出力を順番に並べる機能
ユーザーが必要とする仕様に合わせてフィルタリングされた結果を調整するための閾値の提供
範囲値を指定すると、その範囲内でより詳細な定義を生成したシーケンスを研究者が選択できる。例：トップヒットのe-値が.00010だが、定義の情報が少ない場合、e-値の範囲が.00005に設定されていると、より詳細な定義を持つe-値が.00015のヒットが代わりに返される。未知のシーケンスと正確に一致するヒットを知ることにはほとんど意味がないため、実際の関連情報を提供しない高得点のシーケンスを見つけるという問題を回避することができるため、チームが最も有用だと考えている機能の1つである。

インストール

condaで仮想環境を作ってテストした。pythonとJAVAとCのコードがあるが、ここではGithubで実行例として記載があるpython版を試す。

Github

git clone https://github.com/torkian/blast-QC.git
cd blast-QC/BLAST_QC_PYTHON/

> python BLAST-QC.py -h

$ python BLAST-QC.py -h

usage: BLAST-QC.py [-h] [-f FILENAME] -ff {XML,tab} [-o OUTPUT] [-p PARALLEL]

-t {p,n} [-n NUMBER] [-e EVALUE] [-b BITSCORE]

[-i IDENTITY] [-d DEFINITION] [-or {e,b,i,d}] [-er ERANGE]

[-br BRANGE] [-ir IRANGE]

optional arguments:

-h, --help show this help message and exit

-f FILENAME, --filename FILENAME

Specifiy the Blast XML results input file. (required)

-ff {XML,tab}, --fileformat {XML,tab}

Specifiy the Blast results file format (Tabular or

XML). (required)

-o OUTPUT, --output OUTPUT

Specify the output file base name (no extension).

Defaults to base name of input file.

-p PARALLEL, --parallel PARALLEL

Set number of threads for parallel processing. Set to

1 if sequential processing is desired. (Defaults to

#of CPU cores avalible.)(INT value)

-t {p,n}, --type {p,n}

Specify what type of BLAST you are running (Protein or

Nucleotide). (required)

-n NUMBER, --number NUMBER

Specify the number of hits to return per query

sequence. Defaults to return all hits that fit input

threshold(s). (Int value)

-e EVALUE, --evalue EVALUE

Specify an e-value threshold. (Maximum acceptable

evalue)(Float value)

-b BITSCORE, --bitscore BITSCORE

Specify a bit-score threshold. (Minimum acceptable

bitscore)(Float value)

-i IDENTITY, --identity IDENTITY

Specify a threshold in the percent identity of a

hit.(Calculated value. Not to be confused with

identity value) (Minimum acceptable percentage) (Float

value)

-d DEFINITION, --definition DEFINITION

Specify a threshold in the level of definition

provided.This is defined by the number of line

separators (titles) are in the Hit definition

'<Hit_def>' of the XML file, or salltitles column of

the tabular output (must enable the salltitles column

in the BLAST tabular output using -outfmt "6 std

salltitles"). (Int value)

-or {e,b,i,d}, --order {e,b,i,d}

Specify the order of the results. By lowest evalue,

highest bitscore, highest percent identity or most

detailed definition data. (default: by evalue- 'e')

(if ordering by definition with tabular output you

must enable the salltitles column in the BLAST tabular

output using -outfmt "6 std salltitles")

-er ERANGE, --erange ERANGE

Sets a range of acceptable deviation from the lowest

evalue hit in which a more detailed definition would

be prefered. Must be ordered by evalue. (must enable

the salltitles column in the BLAST tabular output

using -outfmt "6 std salltitles" if using tabular

output from BLAST)

-br BRANGE, --brange BRANGE

Sets a range of acceptable deviation from the highest

bitscore hit in which a more detailed definition would

be prefered. Must be ordered by bitscore. (must enable

the salltitles column in the BLAST tabular output

using -outfmt "6 std salltitles" if using tabular

output from BLAST)

-ir IRANGE, --irange IRANGE

Sets a range of acceptable deviation from the highest

percent identity hit in which a more detailed

definition would be prefered. Must be ordered by

percent identity. (must enable the salltitles column

in the BLAST tabular output using -outfmt "6 std

salltitles" if using tabular output from BLAST)

実行方法

１、BLASTのXML出力を入手する。Download All => XMLを選択。

f:id:kazumaxneo:20210122075508p:plain
ローカルマシンでBLASTのコマンドを実行している時は、-outfmt 5をつけてコマンドを実行する（*1）。

２、XML出力を指定して BLAST-QCを実行する。E-valueの低さを指標に（-or e）1ヒットだけ返す（-n 1）例。

#Protein
python BLAST-QC.py -f out.xml -t n -o example3.out -n 1 -or e -ff XML

#Nucleotide
python BLAST-QC.py -f out.xml -t n -o example3.out -n 1 -or e -ff XML

-f Specifiy the Blast XML results input file. (required)
-t {p, n} Specify what type of BLAST you are running (Protein or Nucleotide). (required)
-o Specify the output file base name (no extension). Defaults to base name of input file.
-n Specify the number of hits to return per query sequence. Defaults to return all hits that fit input threshold(s). (Int value)
-ff {XML, tab} Specifiy the Blast results file format (Tabular or XML). (required)
-or {e, b, i, d} Specify the order of the results. By lowest evalue, highest bitscore, highest percent identity or most detailed definition data. (default: by evalue- 'e') (if ordering by definition with tabular output you must enable the salltitles column in the BLAST tabular

引用

BLAST-QC: automated analysis of BLAST results

Behzad Torkian, Spencer Hann, Eva Preisner, R. Sean Norman
Environmental Microbiome volume 15, Article number: 15 (2020)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

BLAST結果をソート・フィルタリングする BLAST-QC