コード領域のリアラインメントによってバリアントコールを改善する ABRA

2019 5/23 ABRA2追記

　indel検出を制限するアラインメントエラーおよびリファレンスバイアスを克服するために、多数のリアライメントおよびアセンブリ方法が提案されている。ショートリードのマイクロアライナーは、局所的に組み立てられたバリアントグラフへリードを局所的に再調整する（Homer and Nelson、2010）。 PindelはPattern growth approarchを使用してindelsを検出する（Ye et al、2009）。 Dindelは候補をハプロタイプ候補に再編成し、ベイズ法を使用して最大50 bpの長さのindelsを呼び出す（Albers et al、2011）。GATKのIndelRealignerは、ローカルリアライメントを介して不一致塩基の数を最小限に抑えようとしている（DePristo et al、2011）。全ゲノムde novoアセンブリアプローチには、Fermi（Li、2012）およびCortex Var（Iqbal et al、2012）が含まれる。 SOAPIndelは、ペアリードの片方だけがマッピングされている領域で、ローカルアセンブリとコールを実行する（Li et al、2012）。 Clipping Reveals STructure（CREST）は、ソフトクリップされたリードとローカルアセンブリを実行して、体系的な構造変化を識別する（Wang et al、2010）。 Targeted Iterative Graph Routing Assembler（TIGRA）は、、推定ブレークポイントからターゲットのアセンブリによってコンティグを生成する（Chen et al、2014）。さらにComplete Genomics（Carnevali et al、2012）およびFoundation Medicine（Framptonら、2013）によって独自のローカルアセンブリ方法も開発されている。

　新たに開発されたABRAというツールは、シーケンスアラインメント/マップ（SAM / BAM）ファイルを入力として受け入れ、リアライメントされたBAMファイルを出力として生成し、バリアントコールアルゴリズムやその他の下流解析の選択に柔軟性をもたらす。グローバルリアラインメントでは、アラインされていないか謝ってマップされたリードを正しいポジションに移動する。 ABRAは、元のリードのアラインメントには存在しない変異を検出し、存在する変異についてアレル頻度推定を改善する。 ABRAは、生殖細胞系および体細胞系の両方の検出を強化するために使用でき、ペアエンドおよびシングルエンドのデータでも機能する。

Githubより転載。

インストール

cent osでテストした。

依存

Building ABRA requires JDK 7, Maven and g++

本体　Github

https://github.com/mozack/abra

git clone https://github.com/mozack/abra.git 
cd abra/ 
make

ビルドに失敗したため、リリースにあるビルド済みのプログラムを使い(v. 0.97,linux 64bit)テストした。

https://github.com/mozack/abra/releases

> java -jar abra.jar

$ java -jar -Xmn4G abra.jar

Java HotSpot(TM) 64-Bit Server VM warning: MaxNewSize (4194304k) is equal to or greater than the entire heap (4194304k). A new max generation size of 4193792k will be used.

Starting 0.97 ...

Missing required input SAM/BAM file

Missing required reference

Missing required target regions

Missing required working directory

Option Description

------ -----------

--adc [Integer] Skip regions with average depth

greater than this value (default:

100000)

--aur Assemble unaligned reads (currently

disabled).

--bwa-ref BWA index prefix. Use this only if

the bwa index prefix does not match

the ref option.

--ib If specified, write intermediate data

to BAM file using the intel deflator

when available. Use this to speed

up processing.

--in Required list of input sam or bam file

(s) separated by comma

--kmer Optional assembly kmer size(delimit

with commas if multiple sizes

specified)

--lr Search for potential larger local

repeats and output to specified file

(only for multiple samples)

--mad <Integer> Regions with average depth exceeding

this value will be downsampled

(default: 250)

--mapq [Integer] Minimum mapping quality for a read to

be used in assembly and be eligible

for realignment (default: 20)

--maxn [Integer] Maximum pre-pruned nodes in regional

assembly (default: 9000)

--mbq [Integer] Minimum base quality for inclusion in

assembly. This value is compared

against the sum of base qualities

per kmer position (default: 60)

--mc-mapq [Integer] Minimum contig mapping quality

(default: 25)

--mcl [Integer] Assembly minimum contig length

(default: -1)

--mer <Double> Min edge pruning ratio. Default value

is appropriate for relatively

sensitive somatic cases. May be

increased for improved speed in

germline only cases. (default: 0.02)

--mnf <Integer> Assembly minimum node frequency

(default: 2)

--mpc [Integer] Maximum number of potential contigs

for a region (default: 5000)

--mur [Integer] Maximum number of unaligned reads to

assemble (default: 50000000)

--no-debug Throttle down debug logging

--out Required list of output sam or bam file

(s) separated by comma

--rcf <Double> Minimum read candidate fraction for

triggering assembly (default: 0.01)

--ref Genome reference location

--rna Input RNA sam or bam file (currently

disabled)

--rna-out Output RNA sam or bam file (required

if RNA input file specified)

--single Input is single end

--sv Enable Structural Variation searching

(experimental, only supported for

paired end)

--target-kmers BED-like file containing target

regions with per region kmer sizes

in 4th column

--targets BED file containing target regions

--threads <Integer> Number of threads (default: 4)

--umnf [Integer] Assembly minimum unaligned node

frequency (default: 2)

--working Working directory for intermediate

output. Must not already exist

ラン

ランには、bam以外に、target領域のbedとFASTA、bwaのindexを用意する必要がある。target領域は遺伝子のコード領域を想定している。

bwa index ref.fa

bedはBEDtoolsのsortbedを使いポジションソートしておく。

sortBed -i input.bed > sort.bed

準備できたら実行する。

java -Xmn4G -jar abra.jar --in input.bam --out output.bam \
--ref ref.fa --targets sort.bed --threads 8 --working temp_dir

--in One or more input BAMs delimited by comma
--out One or more output BAM's corresponding to the set of input BAMs
--ref BWA indexed reference genome.
--targets BED file describing target assembly regions (Usually corresponds to capture targets) with optional kmer lengths for each region. Targets must be sorted by position in ascending order within each chromosome.

--workingで指定したワーキングディレクトリは上書きできないので、ランに失敗した時はワーキングディレクトリを消してから実行する。

sortBed -i input.bed > sort.bed

準備できたら実行する。

実行後のbamのindexでエラーが出たら、一度coordinate sortし直す。

samtools sort -@ 12 realignmnet.bam > sort.bam
samtools index sort.bam

sortし直しても、リアライメントが元に戻るわけではない。

f:id:kazumaxneo:20180817133515j:plain

上半分がオリジナル、下半分がリアライメント実行後、samtoolsでsortし直した bam。102bpの挿入が起きた部位を見ている。リアライメントによりsoft clipされたリードが正しくアライメントされ、long insertが正しく表示されるようになった。

引用

ABRA: improved coding indel detection via assembly-based realignment.

Mose LE, Wilkerson MD, Hayes DN, Perou CM, Parker JS

Bioinformatics. 2014 Oct;30(19):2813-5.

追記

ABRA2も公開されています（現在はbeta）。

https://github.com/mozack/abra2

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

コード領域のリアラインメントによってバリアントコールを改善する ABRA