DRAGENのCPUのみ使う実装 DRAGMAP - macでインフォマティクス

2021 11/27 追記

まだ初期のアルファリリースということですが、DRAGMAPを試してみました。詳しくはGATKのブログを読んで下さい。BWA-MEMとの性能比較では、興味深い結果が提示されています。

GATK blog

Introducing DRAGMAP, the new genome mapper in DRAGEN-GATK

https://gatk.broadinstitute.org/hc/en-us/articles/4410953761563-Introducing-DRAGMAP-the-new-genome-mapper-in-DRAGEN-GATK

要約

イルミナのチームは、自社開発のDRAGENハードウェアと同じ結果を得るための方法をオープンソースで提供するために、DRAGENマッパーを開発した。これが必要だったのは、オリジナルのDRAGENマッパーが、FPGA技術を用いたDRAGENハードウェアアーキテクチャ自体にエンコードされているから。
本チームはBWAをずっと愛用してきたが、速度の限界が来ている。DRAGMAPは、BWAを支えてきた基本的なアイデアをベースに、将来を見据えた方法で新しい課題に取り組むことができるよう、いくつかのイノベーションを加えた（ブログ中で性能比較している）。
多様な集団の結果を有意に改善するための最も有望なアプローチの1つは、グラフゲノムを使用することかもしれない。DRAGENの最新のハードウェアバージョンでは、グラフゲノムを使用したマッピングがすでにサポートされており、これがDRAGENのIlluminaリードを使用したPrecision FDA Truth Challenge 2での優勝に貢献した。グラフのサポートをDRAGMAPに移植するのは大変な作業なので、現在のバージョンでは対象外だが、将来のDRAGMAPバージョンでは検討される。
DRAGMAPは、BWA-MEMに代わってDRAGEN-GATKパイプラインのデフォルトのゲノムマッパーとして採用される予定。
DRAGMAPのアルファ版をGPLv3ライセンスでGithubに公開した。
現在はアルファ版のリリースであり、精度と機能的同等性を優先しているため、速度はあまり重視されていない（現時点ではBWA MEMとほぼ同等）。
さまざまなデータセットやコンピューティング環境でテストしていただき、報告された問題点やフィードバックをもとにソフトウェアを改良していく。
何か問題が発生した場合は、DRAGMAP GitHub issuesページに直接投稿して、イルミナの開発チームに連絡を取ってみてください。
2021年12月2日（木）午前10時～11時（EST）に開催されるDRAGEN-GATKのウェビナーとQ&Aセッションへの参加もお忘れなく！（ブログにzoomリンクあり）

インストール

condaを使って3990xの計算機にインストールした（ubuntu18, python 3.9）。

依存

Compilation was tested on CentOS 7

C++11 compatible compiler (e.g gcc-c++ >= 4.8.5-36.el7_6.2)
GNU make >= 3.82
Boost library : boost169-devel >= 1.69.0-1.el7
For unit tests : googletest (>= v1.6)
Hardware: x86_86, 64GB RAM minimum
OS: Centos >= 7.7

Github

#conda (link)
conda install -c bioconda dragmap

> dragen-os --help

$ dragen-os --help

dragenos -r <reference> -b <base calls> [optional arguments]

Command line options:

--Aligner.pe-orientation arg Expected paired-end

orientation: 0=FR, 1=RF, 2=FF

--Aligner.pe-stat-mean-insert arg (=0) Expected mean of the insert

size

--Aligner.pe-stat-mean-read-len arg (=0) When setting paired-end

insert stats, the expected

mean read length

--Aligner.pe-stat-quartiles-insert arg Q25, Q50, and Q75 quartiles

for the insert size

--Aligner.pe-stat-stddev-insert arg (=0) Expected standard deviation

of the insert size

--Aligner.rescue-ceil-factor arg (=3) For tuning the rescue scan

window maximum ceiling

--Aligner.rescue-sigmas arg (=0) For tuning the rescue scan

window

--Aligner.sec-aligns arg (=0) Maximum secondary

(suboptimal) alignments to

report per read

--Aligner.sec-aligns-hard arg (=0) Set to force unmapped when

not all secondary alignments

can be output

--Aligner.sec-phred-delta arg (=0) Only secondary alignments

with likelihood within this

phred of the primary

--Aligner.sec-score-delta arg (=0) Secondary aligns allowed with

pair score no more than this

far below primary

--Aligner.sw-all arg (=0) Value of 1 forces smith

waterman on all candidate

alignments

--Aligner.sw-method arg (=mengyao) Smith Waterman implementation

(dragen / mengyao default =

dragen)

--Mapper.filter-len-ratio arg (=4) Ratio for controlling seed

chain filtering

--RGID arg (=1) Read Group ID

--RGSM arg (=none) Read Group Sample

-b [ --bam-input ] arg Input BAM file

--build-hash-table arg (=0) Generate a reference/hash

table

--enable-sampling arg (=1) Automatically detect

paired-end parameters by

running a sample through the

mapper/aligner

-1 [ --fastq-file1 ] arg FASTQ file to send to card

(may be gzipped)

-2 [ --fastq-file2 ] arg Second FASTQ file with

paired-end reads (may be

gzipped)

--fastq-offset arg (=33) FASTQ quality offset value.

Set to 33 or 64

-h [ --help ] produce help message and exit

--help-defaults produce tab-delimited list of

command line options and

their default values

--help-md produce help message

pre-formatted as a markdown

file section and exit

--ht-anchor-bin-bits arg (=0) Bits defining reference bins

for anchored seed search

--ht-cost-coeff-seed-freq arg (=0.5) Cost coefficient of extended

seed frequency

--ht-cost-coeff-seed-len arg (=1) Cost coefficient of extended

seed length

--ht-cost-penalty arg (=0) Cost penalty to extend a seed

by any number of bases

--ht-cost-penalty-incr arg (=0.69999999999999996)

Cost penalty to incrementally

extend a seed another step

--ht-crc-extended arg (=0) Index of CRC polynomial for

hashing extended seeds

--ht-crc-primary arg (=0) Index of CRC polynomial for

hashing primary seeds

--ht-decoys arg Path to decoys file (FASTA

format)

--ht-dump-int-params arg (=0) Testing - dump internal

parameters

--ht-ext-rec-cost arg (=4) Cost penalty for each EXTEND

or INTERVAL record

--ht-ext-table-alloc arg (=0) 8-byte records to reserve in

extend_table.bin

(0=automatic)

--ht-mask-bed arg Bed file for base masking

--ht-max-dec-factor arg (=1) Maximum decimation factor for

seed thinning

--ht-max-ext-incr arg (=12) Maximum bases to extend a

seed by in one step

--ht-max-ext-seed-len arg (=0) Maximum extended seed length

--ht-max-multi-base-seeds arg Maximum seeds populated at

multi-base codes

--ht-max-seed-freq arg (=16) Maximum allowed frequency for

a seed match after extension

attempts

--ht-max-seed-freq-len arg (=98) Ramp from priMaxSeedFreq

reaches maxSeedFreq at this

seed length

--ht-max-table-chunks arg Maximum ~1GB thread table

chunks in memory at once

--ht-mem-limit arg (=0GB) Memory limit (hash table +

reference)

--ht-methylated arg (=0) If set to true, generate C->T

and G->A converted pair of

hashtables

--ht-min-repair-prob arg (=0.20000000000000001) Minimum probability of

success to attempt extended

seed repair

--ht-num-threads arg Worker threads for generating

hash table

--ht-override-size-check arg (=0) Override hash table size

check

--ht-pri-max-seed-freq arg (=0) Maximum frequency for a

primary seed match (0 => use

maxSeedFreq)

--ht-rand-hit-extend arg (=8) Include a random hit with

each EXTEND record of this

frequency

--ht-rand-hit-hifreq arg (=1) Include a random hit with

each HIFREQ record

--ht-ref-seed-interval arg Number of positions per

reference seed

--ht-reference arg Reference in FASTA format

--ht-repair-strategy arg (=0) Seed extension repair

strategy: 0=none, 1=best,

2=rand

--ht-seed-len arg (=21) Initial seed length to store

in hash table

--ht-size arg (=0GB) Size of hash table, units

B|KB|MB|GB

--ht-sj-size arg Reserved space for RNA

annotated SJs, units

B|KB|MB|GB

--ht-soft-seed-freq-cap arg (=12) Soft seed frequency cap for

thinning

--ht-target-seed-freq arg (=4) Target seed frequency for

seed extension

--ht-test-only arg (=0) Testing - show user

parameters, but don't do

anything

--ht-uncompress arg (=0) Uncompress hash_table.cmp to

hash_table.bin and

extend_table.bin (standalone

option)

--ht-write-hash-bin arg (=0) Write decompressed

hash_table.bin and

extend_table.bin (0/1)

--input-qname-suffix-delimiter arg (= ) Character that delimits input

qname suffixes, e.g. / for /1

--interleaved [=arg(=1)] (=0) Interleaved paired-end reads

in single bam or FASTQ

--map-only arg (=0) no real alignment, produces

alignment information based

on seed chains only

--mmap-reference arg (=0) memory-map reference data

instead of pre-loading. This

allows for quicker runs when

only a handful of reads need

to be aligned

--num-threads arg (=128) Worker threads for

mapper/aligner (default =

maximum available on system)

--output-directory arg Output directory

--output-file-prefix arg Output filename prefix

--pe-stats-interval-delay arg (=) Number of intervals of lag

between sending reads and

using resulting stats

--pe-stats-interval-memory arg (=

) Include reads from up to this

many stats intervals in

paired-end stats calculations

--pe-stats-interval-size arg (=25000) Number of reads to include in

each interval of updating

paired-end stats

--pe-stats-sample-size arg (=100000) Number of most recent pairs

to include in each update of

the paired-end stats

--preserve-map-align-order arg (=0) Preserve the order of

mapper/aligner output to

produce deterministic

results. Impacts performance

-r [ --ref-dir ] arg directory with reference and

hash tables. Must contain the

uncompressed hashtable.

--ref-load-hash-bin arg (=0) Expect to find uncompressed

hash table in the reference

directory.

--response-file arg file with more command line

arguments

-v [ --verbose ] Be talkative

-V [ --version ] print program version

実行方法

１、リファレンスfastaファイルのハッシュテーブル構築

mkdir reference
dragen-os --build-hash-table true --ht-reference ref.fasta  --output-directory reference/

ヒトゲノムアセンブリだと5分前後かかった。

２、マッピング

マッピングしてsam出力。

#paired-end
dragen-os -r reference/ -1 reads_1.fastq.gz -2 reads_2.fastq.gz >  result.sam

#single-end
dragen-os -rreference/ -1 reads_1.fastq.gz  >  result.sam

デフォルトで利用可能な全CPUコア使用される。

直接出力する。

mkdir outdir
dragen-os -r /home/data/reference/ -1 reads_1.fastq.gz -2 reads_2.fastq.gz --output-directory outdir --output-file-prefix result

3990xの計算機では、コマンド実行後しばらく経ってからseqmentation error が発生した。

f:id:kazumaxneo:20211126213726p:plain

引用

https://github.com/Illumina/DRAGMAP