複雑なクエリ表現に対応し、BAMを様々な条件でフィルタリングできる BAMQL

　Binary Alignment / Map（BAM）は、リファレンスゲノムとのアラインメント後に大量のゲノムリードデータを保持するための共通フォーマットを提供している。リードには捕捉情報が追加されており、例えばFASTAやFASTQファイルには含まれていないターゲット位置や染色体などの情報、およびメイトペアに関する情報が表示されている。

　さらなる分析のために、よくリードのサブセットのみが必要とされる。 BAMファイルを操作するための標準ソフトウェアであるSAMtoolsには、BAMファイルのフィルタリングを実行し、一致するリードを収集するオプションが用意されている[ref.1]。しかしインタフェースは表現力もなくユーザフレンドリでもない: ユーザは名前でなくビットフラグを数値で指定する必要がある。さらに、選択条件はビットフラグがマッチするかどうかによるフィルタリングに制限され、最低クオリティスコアや一致するリードグループ（@RG）などのフィルタリングは行えない。これにより、クエリの表現力が制限される。たとえば、品質の悪いリードを選択することはできない。

　汎用プログラミング言語を使用してBAMファイル操作ライブラリでファイルをフィルタリングすることができるが、サブセット操作のほとんどはad hocな作業なので、手間がかかり定型的なコードがかなり必要になる。さらに、BAM APIによって必要とされるものと、言語によって行われるチェックとの間には相違がある。たとえば、indexを使用してファイルのサブセットのみを読み取ることは可能だが、indexの要求とフィルタリング式の間で正確に一致する染色体名が必要になる。不一致が発生すると、複雑な言語混在クエリのために検出が困難なデータが漏れてしまうことがある。サブセット化プロセスを簡素化しつつ、強力なクエリを実行できるようにするため、BAM Query Language（BAMQL）を開発した。BAMQLクエリは、述語と論理結合によって指定される。述語には、BAM flags、マッピング情報、補助読リード情報、ポジションおよびシーケンスと一致するものが含まれる。たとえば、1番染色体上にあるすべてのリードとBAMファイルを出力するには、次の操作を行う。

bamql -f input.bam -o output.bam ’chr(1) & paired?’

使用可能なすべての連結子と述語の概要を表1に示す。

論文より転載。画像をクリックで論文にジャンプ。

マニュアル

http://artefacts.masella.name/bamql_queries.html

BAMQLに関するツイート

インストール

ubuntu14.04でテストした。

依存

LLVM 3.4 - 4.0, HTSlib, and libuuid（ビルドに必要）

本体　GIthub

#Ubuntu
sudo apt-add-repository ppa:boutroslab/ppa && sudo apt-get update && sudo apt-get install bamql

> bamql -h

$ bamql -h

bamql [-b] [-I] [-o accepted_reads.bam] [-O rejected_reads.bam] [-v] -f input.bam {query | -q query.bamql}

Filter a BAM/SAM file based on the provided query. For details, see the man page.

-b The input file is binary (BAM) not text (SAM).

-f The input file to read.

-I Do not use the index, even if it exists.

-o The output file for reads that pass the query.

-O The output file for reads that fail the query.

-q A file containing the query, instead of providing it on the command line.

-v Print some information along the way.

ラン

１番染色体にマッピングされているリードを抽出 (論文表１の"chr")

bamql -I -f input.bam -o passed.bam -O failed.bam  'chr(1)'

-I 　 Do not use the index, even if it exists.
-o　 The output file for reads that pass the query.
-O　 The output file for reads that fail the query.
-f　 The input file to read.

chr(glob)　Matches the chromosome name to which the read is mapped.

chr以外のヘッダー、例えばplasmidLとかなら'chr(plasmidL)'と記載する。

複数条件も、and、or、if else, elsifなど指定できる。chr1の100-10000に適切にマッピングされたペアエンドリードだけ抽出。duplicated readsは除く。

bamql -I -f input.bam -o passed.bam -O failed.bam 'proper_pair? & !duplicate? &!supplementary? & mapping_quality(0.95) & position(100,10000) & chr(1)'

! expr　Is satisfied if expr is not satisfied.
expr & expr　Is satisfied only if both operands are satisfied.
paired?　The read is paired in sequencing.
duplicate?　The read is either a PCR or an optical duplicate. That is, another read with the same sequence occurs at exactly the same position in the reference genome.
mapping_quality(probability)　Matches the read if the probability of error is less than probability. The mapping quality is approximated in the SAM format, so this will be imperfect. For clarity, setting the probability to 0 will be so stringent as to reject all reads, while setting it to 1 will be so liberal as to accept all reads.
supplementary?　The alignment is supplementary.
position(start,end)
Matches all sequences that cover the range of position from specified start to end, i.e. N matches any base.