構造バリアントコーリングのため改善されたロングリードマッピングを行う Vulcan

2021 6/4タイトル修正

　ロングリードシーケンシングは、ヒトゲノム全体の構造変異の調査をかつてないほど可能にした。このような状況下でロングリードシーケンシングの可能性を最大限に生かすために、主に速度と精度のどちらかに焦点を当てた新しいマッピング手法が登場している。広く使われているリードマッパー（minimap2やNGMLR）には、速度や精度を最適化するための様々なヒューリスティックやスコアリングスキーマが実装されているが、ゲノム領域や特定の構造変異に対しては性能が異なる。本著者らの仮説は、リードマッピングを、異なる変異ホットスポット間で単一のギャップペナルティを使用するように制限すると、リードのアライメント精度が低下し、構造的バリアントの検出が妨げられるというものである。

　Vulcanと呼ばれるリードマッピングパイプラインを実装して、仮説を検証した。Vulcanは、2つの異なるギャップペナルティモードを使用する。これは、Vulcanがminimap2などで計算されたリードの正規化編集距離を利用して、アラインメントが不十分なリードを特定し、より正確だが計算コストが高いロングリードマッパー（NGMLR）を用いて再配置するというものである。本著者らの仮説を裏付けるように、Vulcanはシミュレーションおよび実際のデータセットの両方において、Oxford Nanopore Technology（ONT）のロングリードのアライメントを改善した。これらの改善により、ヒトゲノムデータセットにおける構造的バリアントコールのパフォーマンスの精度が、どちらかのリードマッピング手法のみの場合と比較して向上した。Vulcanは、2つの異なるギャップペナルティモードを組み合わせた初のロングリードマッピングフレームワークであり、構造的バリアントの再現性と精度を向上させる。Vulcanはオープンソースであり、MITライセンスの下、https://gitlab.com/treangenlab/vulcan で入手できる。

Stoked to share a new bioinformatics tool from my 2nd year PhD student Yilei Fu (@fuyilei96) @RiceCompSci called Vulcan

Great collaboration w/ @sedlazeck & @MedhatHelmy7

A short thread on our long-read mapping pipeline that melds minimap2 & NGMLR.
https://t.co/khpArSAajR

1/n
— Todd J Treangen (@traingene) June 2, 2021

インストール

mambaを使ってcondaの仮想環境に導入した(python3.8)。

Github

mamba create -n vulcan -y
conda activate vulcan 
mamba install -c bioconda vulcan

> vulcan

vulcan: map long reads and prosper🖖, a long read mapping pipeline that melds minimap2 and NGMLR

optional arguments:

-h, --help show this help message and exit

-w WORK_DIR, --work_dir WORK_DIR

Directory of work, store temp files, default: ./vulcan_work

-t THREADS, --threads THREADS

threads, default: 1

-p PERCENTILE [PERCENTILE ...], --percentile PERCENTILE [PERCENTILE ...]

percentile of cut-off, default: 90

-f, --full keep all temp file

-d, --dry only generate config

-R, --raw_edit_distance

Use raw edit distance to do the cut-off

-clr, --pacbio_clr Input reads is pacbio CLR reads

-hifi, --pacbio_hifi Input reads is pacbio hifi reads

-ont, --nanopore Input reads is Nanopore reads

-any, --anylongread Don't know which kind of long read

-hclr, --humanclr Human pacbio CLR read

-hhifi, --humanhifi Human pacbio hifi reads

-hont, --humannanopore

Human Nanopore reads

-cmd, --custom_cmd Use minimap2 and NGMLR with user's own parameter setting

Required arguments::

-i INPUT [INPUT ...], --input INPUT [INPUT ...]

input read path, can accept multiple files

-r REFERENCE, --reference REFERENCE

reference path

-o OUTPUT, --output OUTPUT

vulcan's output's prefix, the output will be prefix_{percentile}.bam

テストラン

リファレンスゲノムとシークエンシングリードを指定する。

git clone https://gitlab.com/treangenlab/vulcan.git
cd vulcan/
vulcan -r test/GCF_000146045.2_R64_genomic.fna -i test/test_reads.fa -w test/ -o output-t 12

-i input read path, can accept multiple files
-r reference path
-o vulcan's output's prefix, the output will be prefix_{percentile}.bam
-w Directory of work, store temp files, default: ./vulcan_work
-t threads, default: 1
-p percentile of cut-off, default: 90

Coordinate sortされたoutput_90.bamが出力される。

引用

Vulcan: Improved long-read mapping and structural variant calling via dual-mode alignment
Yilei Fu, Medhat Mahmoud, Viginesh Vaibhav Muraliraman, Fritz J. Sedlazeck, Todd J. Treangen

bioRxiv, Posted May 30, 2021.