エラーの多いロングリードをアセンブリグラフにアラインする GraphAligner

ゲノムグラフは、遺伝的変異や配列の不確実性を表現することができる。ゲノムグラフに配列をアラインさせることは、エラー修正、ゲノムアセンブリ、パンゲノムグラフ内のバリアントのジェノタイピングなど、多くのアプリケーションの鍵を握っている。しかし、これまでのところ、この作業は非常に時間がかかることが多い。本研究では、ロングリードをゲノムグラフにアラインメントするためのツールであるGraphAlignerを紹介する。GraphAlignerは、最新のツールと比較して、12倍の速度と5倍少ないメモリ使用量を実現しており、直鎖リファレンスゲノムにリードをアラインメントするのと同等の効率性を実現している。エラー修正のためにGraphAlignerを使用した場合、既存のツールと比較して、ほぼ3倍の精度と15倍以上の速度が得られる。

インストール

ubuntu18.04でテストした。

GIthub

#bioconda(link)
conda install -c bioconda graphaligner

> GraphAligner -h

$ GraphAligner -h

GraphAligner bioconda 1.0.12-

Mandatory parameters:

-g [ --graph ] arg input graph (.gfa / .vg)

-f [ --reads ] arg input reads (fasta or fastq, uncompressed or

gzipped)

-a [ --alignments-out ] arg output alignment file (.gaf/.gam/.json)

--corrected-out arg output corrected reads file (.fa/.fa.gz)

--corrected-clipped-out arg output corrected clipped reads file (.fa/.fa.gz)

General parameters:

-h [ --help ] help message

--version print version

-t [ --threads ] arg number of threads (int) (default 1)

--verbose print progress messages

--E-cutoff arg discard alignments with E-value > arg

--all-alignments return all alignments instead of the best

non-overlapping alignments

--extra-heuristic use heuristics to discard more seed hits

--try-all-seeds don't use heuristics to discard seed hits

--global-alignment force the read to be aligned end-to-end even if the

alignment score is poor

--optimal-alignment calculate the optimal alignment (VERY SLOW)

Seeding:

--seeds-clustersize arg discard seed clusters with fewer than

arg seeds (int)

--seeds-extend-density arg extend up to approximately the best

(arg * sequence length) seeds (double)

(-1 for all)

--seeds-minimizer-length arg k-mer length for minimizer seeding

(int)

--seeds-minimizer-windowsize arg window size for minimizer seeding (int)

--seeds-minimizer-density arg keep approximately (arg * sequence

length) least common minimizers

(double) (-1 for all)

--seeds-minimizer-ignore-frequent arg ignore arg most frequent fraction of

minimizers (double)

--seeds-mum-count arg arg longest maximal unique matches

fully contained in a node (int) (-1 for

all)

--seeds-mem-count arg arg longest maximal exact matches fully

contained in a node (int) (-1 for all)

--seeds-mxm-length arg minimum length for maximal unique /

exact matches (int)

--seeds-mxm-cache-prefix arg store the mum/mem seeding index to the

disk for reuse, or reuse it if it

exists (filename prefix)

-s [ --seeds-file ] arg external seeds (.gam)

--seeds-first-full-rows arg no seeding, instead calculate the first

arg rows fully. VERY SLOW except on

tiny graphs (int)

Extension:

-b [ --bandwidth ] arg alignment bandwidth (int)

-B [ --ramp-bandwidth ] arg ramp bandwidth (int)

-C [ --tangle-effort ] arg tangle effort limit, higher results in slower but

more accurate alignments (int) (-1 for unlimited)

--high-memory use slightly less CPU but a lot more memory

Preset parameters:

-x [ --preset ] arg Preset parameters

dbg - Parameters optimized for de Bruijn graphs

vg - Parameters optimized for variation graphs

テストラン

入力としてリードのfasta/fastq、アセンブリグラフとしてGFA/vgを指定する。ここではGAFフォーマット（GAF）で出力しているが、-a aln.gamを使用すると、vgと互換性のある出力が得られる。

git clone https://github.com/maickrau/GraphAligner.git
cd GraphAligner/
GraphAligner -g test/graph.gfa -f test/read.fa -a aln.gaf -x vg

-g input graph (.gfa / .vg)
-f input reads (fasta or fastq, uncompressed or gzipped)
-a output alignment file (.gaf/.gam/.json)
-t number of threads (int) (default 1)
-x Preset parameters
dbg - Parameters optimized for de Bruijn graphs
vg - Parameters optimized for variation graphs

"-x vg"は、variation graphsにリードを整列させるためのパラメータプリセットで、"-x dbg"はde brujin graphにアラインメントするプリセットになる。

引用

Bit-parallel sequence-to-graph alignment
Mikko Rautiainen, Veli Mäkinen, Tobias Marschall
Bioinformatics, Volume 35, Issue 19, 1 October 2019, Pages 3599–3607