ブルームフィルタを用いて低メモリ使用量且つ高速にsamの重複マークを行う streammd

重複テンプレートの同定は、バルクシークエンシング解析における一般的な前処理ステップである。streammdは、Picard MarkDuplicatesの出力を忠実に再現しながら、大幅に高速化し、SAMBLASTERよりはるかに少ないメモリで動作する。streammdは、GitHub https://github.com/delocalizer/streammd からMITライセンスの下で入手可能なC++プログラムである。

特徴（レポジトリより）

高速 - デフォルトの設定で、streammdはPicard MarkDuplicatesより5倍高速
大規模なライブラリでも少ないメモリフットプリント - デフォルトの設定で
streammd が 1B のテンプレートを処理するのに必要なメモリはわずか 4G
Picard MarkDuplicatesのメトリックスとの高い一致
ソフトクリップされたリードを正しく処理
ターゲットの偽陽性率を調整可能
ストリーミング入出力

インストール

依存

Requires c++17 with <charconv>
gcc >= 8.1 or clang >= 7 should work.

Github

#リリースよりstreammd-4.3.0.tar.gzをダウロードする
https://github.com/delocalizer/streammd/releases

cd streammd/
./configure
make -j8
sudo make install

$ streammd --help

Usage: streammd [-h] [--input INPUT] [--output OUTPUT] [--fp-rate FP_RATE] [--mem MEM] [--allow-overcapacity] [--metrics METRICS_FILE] [--remove-duplicates] [--show-capacity] [--single] [--strip-previous]

Read a SAM file from STDIN, mark duplicates in a single pass and stream processed records to STDOUT. Input must begin with a valid SAM header followed by qname-grouped records. Default log level is 'info' — set to something else (e.g. 'debug') via SPDLOG_LEVEL environment variable.

Optional arguments:

-h, --help shows help message and exits

-v, --version prints version information and exits

--input INPUT Input file. [default: STDIN]

--output OUTPUT Output file. [default: STDOUT]

-p, --fp-rate FP_RATE The maximum acceptable marginal false-positive rate. [default: 1e-06]

-m, --mem MEM Memory allowance for the Bloom filter, e.g "4GiB". Both binary (kiB|MiB|GiB) and decimal (kB|MB|GB) formats are understood. As a result of implementation details, a value that is an exact power of 2 (512MiB, 1GiB, 2GiB etc) gives a modest processing speed advantage (~5%) over neighbouring values. [default: "4GiB"]

--allow-overcapacity Warn instead of error when Bloom filter capacity is exceeded. [default: false]

--metrics METRICS_FILE Output metrics file. [default: "streammd-metrics.json"]

--remove-duplicates Omit detected duplicates from the output.

--show-capacity Do no work, just print the capacity of the Bloom filter that would be constructed with the given --fp-rate and --mem values.

--single Accept single-ended reads as input. [default: paired-end]

--strip-previous Unset duplicate flag for any reads that have it set and are no longer considered duplicate. Only ever required if records have previously been through a duplicate marking step. [default: false]

テスト

run unit tests

make check

10分くらいかかる。

実行方法

STDINからSAMファイルを読み込み、1回のパスで重複をマークし、処理したレコードをSTDOUTにストリームする。

#bwa memと組み合わせる例
bwa mem ref.fa r1.fq r2.fq|streammd > out.sam

レポジトリより

60倍のヒトWGS 2x150bpペアエンドシーケンスはn≈6.00E+08テンプレートで構成され、デフォルトの偽陽性率1.00E-06でこれを処理するには、デフォルトのメモリ設定4GiBで十分
入力は有効な SAMのヘッダで始まり、その後に qnameでグループ化されたレコードが続く必要がある。
シングルパス処理の性質上、streammdは最初に出会ったテンプレートをオリジナルとして保持し、それ以降に出会ったコピーを光学的要因による複製、あるいはPCRによる複製（optical duplicactes or PCR duplicactes ）として、２つを区別せずマークする。
現在の実装ではSAM形式の入力のみを扱う。

引用

streammd: fast low-memory duplicate marking using a Bloom filter
Conrad Leonard
Bioinformatics, Volume 39, Issue 4, April 2023