ロングリードのマッピングツール lordFAST

　ハイスループットシーケンシング（HTS）技術は、発足以来進化してきた（Margulies et al、2005）。特にPacific Biosciences（Eid et al、2009; Korlach et al、2010）およびOxford Nanopore（Cherf et al、2012; Manrao et al、2012; Eisenstein）などの一分子シーケンシング（SMS）、2012）は、この進化のブレークスルーである。次世代シークエンシング（NGS）技術は、遺伝的変異（1000 Genomes Project Consortium、2010、2012）の検出における能力を証明しているが、疾患（O'Roak et al、2011）、デノボゲノムアセンブリの構築（Gnerre et al、2011）、NGSデータに基づく計算的解析は、長さに対して短いリード長のためにまだ完全ではない（Hormozdiari et al、2009; Alkan et al、2011）。 SMS技術によって生成されたリードの長さの増加は、ゲノミクスにおける多くの未解決の問題に対する解決策を提供する。残念ながら、SMS技術はより高いエラー率（15％対1％）をもたらし、さまざまなタイプのエラー（置換ではなく挿入と欠失）により、標準のゲノム解析パイプラインでの使用を困難にしている。これにより、さまざまなアプリケーションに対して非常に長くエラーのあるリードを処理できるSMSテクノロジに特化した新しいアルゴリズムおよびパイプラインの開発が促進された。これらのアプリケーションには、de novoアセンブリ（Koren et al、2013; Chin et al、2013a; Berlin et al、2015; Loman et al、2015）、ハイブリッドデノボアセンブリ（Koren et al、2012; Goodwin et 2015）、Scaffoldsのギャップfilling（英語等、2012）、ゲノムFinisihing（Bashir et al、2012; Chin et al、2013b ; Brown et al、2014）、GCリッチおよび複雑な領域の再構成（Shin et al、2013; Huddleston et al、2014; ScottおよびEly、2014）、SV検出（Doi et al 、2014; Ummat and Bashir、2014; Huddleston et al、2017; Fan et al、2017; Chaisson et al、2015）、ハプロタイプフェージング（Pendleton et al、2015; Chaisson et al、2017）、およびメチル化部位を見出す（Simpson et al、2017; Rand et al、2017）。ほとんどの下流分析パイプラインの最初のステップは、リファレンスゲノムへのリードのマッピングである。低エラー率のイルミナのショートリードでは、通常、リファレンスゲノム上のそのマッピング座と正確に一致する「長い」部分文字列を見つけることが可能である。ショートリードをマッピングするための既存のツールはすべて、この基本的な観察に基づいている。それらは、（i）BW Transform / FM Index（Burrows and Wheeler、1994; Ferragina and Manzini、2000）に基づく方法（Li and Durbin、2009; Langmead and Salzberg、2012; Li et al 、2009）、または（ii）部分文字列ハッシング（Alkan et al、2009; Xin et al、2013; Hach et al、2010,2014; Weese et al、2012; David et al、2011; Lin et FMインデックスとハッシング（Siragusa et al、2013; Marco-Sola et al、2012）を組み合わせたハイブリッド手法などがある。残念なことに、PacBio（Travers et al、2010; Thompson and Milos、2011）では最大エラー率が20％、オックスフォードNanopore（Goodwin et al、2015）では最大エラー率が40％になり（論文執筆時点）、SMS技術には有効ではない。さらに、リードのマッピング遺伝子座が正確に見出されても、シーケンシングエラーを実際のゲノムの変異と区別することは非常に困難である。ロングリードをリファレンスゲノムにマッピングするには、いくつかの方法がある。 BLASR（Chaisson and Tesler、2012）は、PacBioリード用に特別に設計された最初のツールである。prefix arrayインデックスを使用して、ロングリードとリファレンスゲノム間の十分に長い正確な一致をすべて見つける。その後、マッチをクラスタにグループ化し、ランク付けする。候補ゲノム位置に対応する最上位のスコアリングされたクラスターはsparse dynamic programming（SDP）に続いてバンドアラインメントを行うために使用される。 BWA-MEM（Li、2013）は、もともと、ショートリードおよびアセンブリされたコンティグをリファレンスゲノムにアライメントさせるために設計された別のマッパーである。これは、（-x pacbioオプションまたは-x ont2dオプションを使用して）アライメントパラメータを調整することによって、長いSMSリードをマップするように拡張されている。 BWA-MEMは、可能な初期マッチとして各クエリ位置をカバーする最も長い完全一致を見つけ、これらのマッチを連鎖させ、連鎖の長さによって初期マッチをランク付けし、最終的に特定のスコアカットオフに基づいて最初のマッチを延長し、完全なアライメントを得ることができる。別のツールrHAT（Liu et al、2016）（紹介）は、ハッシュテーブルベースのマッパーで、ヒューリスティックを使用して各リードのマッピングのおおよその位置を推定する。これは、近似k-mer計数スキームを介して、リファレンスゲノム上のロングリードの中間の1000bpセグメントについて潜在的なマッピング領域を見出すことによって行われる。次いで、潜在的なマッピング領域ごとに、SDPベースの発見的手法を用いてショートシードおよびこれらのシードを見つけるためのルックアップテーブルが構築される。最終的な位置合わせは、選択されたチェインから形成される。第4のツールであるGraphMap（Sovic et al、2016）は、ギャップありの離散シードを使用し、これらのシードをクラスタリングすることによっておおよその位置合わせを実行する。その後、ターゲットの短いk-mersから構築された「アライメントグラフ」で正確なpathを見つけ、アライメントアンカーを構築し、最終的なアライメントを生成するためにチェーンを精緻化する。別のツール、LAMSA（Liu et al、2017）（紹介）は、ロングリードをいくつかの「断片」に分割し、GEMマッパー（Marco-Sola et al、2012）を用いてリファレンスゲノム上のそれらのおおよその一致をすべて見つける。最近、NGMLR（Sedlazeck et al、2018）（紹介）とMinimap2（Li、2018）（紹介）の2つの新しいマッパーが公開された。（以下略）

　本稿では、PacBioのContinuous Long Read（CLR）のために特別に設計された、新しいlong-readマッパーlordFASTを紹介する。 lordFASTは、複数の短い完全一致を使用することにより、CLRリードで観察される高いシーケンシングエラー率に耐えることができる、非常に効率的で敏感なアライナーである。 lordFASTは、PacBioデータセットでより多くのリードをマップするだけでなく、BLASRやBWA-MEMなどの利用可能な選択肢よりも正確にマップする。 lordFastは、エラーモデルが幾分類似しているため、Oxford Nanopore Technologyによって生成されたリードをアライメントすることも可能である。著者らの実験結果は、Minimap2が上記のマッパーの中で最速のツールであることを示している。 lordFASTは速度が2番目で、シミュレートされたデータに対して最高の感度と精度を実現していた。これは特に、テストしたすべてのマッパーの中で最も正しくマップされた塩基数が高かったためである。

lordFASTに関するツイート

インストール

macos10.13のanaconda2-4.3.0環境でテストした。

依存

GCC ≥ 4.4.7
zlib

本体　Github

ダウンロードしてビルドするかブリド済みバイナリをダウンロードする。

git clone https://github.com/vpc-ccg/lordfast.git
cd lordfast/
make

#Anaconda環境ならcondaで導入
conda install -c bioconda lordfast

> ./lordfast -h

$ ./lordfast -h

lordFAST(1) lordfast Manual lordFAST(1)

NAME

lordfast

DESCRIPTION

lordFAST is a sensitive tool for mapping long reads with high error

rates. lordFAST is specially designed for aligning reads from PacBio

sequencing technology but provides the user the ability to change

alignment parameters depending on the reads and application.

INSTALLATION

To install lordFAST, please download the source code from

https://github.com/vpc-ccg/lordfast

or alternatively clone the repository by running the following command:

$ git clone https://github.com/vpc-ccg/lordfast.git

Now the code can be compiled easily by running "make" command line

which builds the binary file "lordfast".

$ cd lordfast

$ make

SYNOPSIS

lordfast --index FILE [OPTIONS]

lordfast --search FILE --seq FILE [OPTIONS]

OPTIONS

Indexing options

-I, --index STR

Path to the reference genome file in FASTA format which is sup-

posed to be indexed. [required]

Mapping options

-S, --search STR

Path to the reference genome file in FASTA format. [required]

-s, --seq STR

Path to the file containing read sequences in FASTA/FASTQ for-

mat. [required]

-o, --out STR

Write output to STR file rather than standard output. [stdout]

-t, --threads INT

Use INT number of CPU cores. Pass 0 to use all the available

cores. [1]

Advanced options

-k, --minAnchorLen INT

Minimum required length of anchors to be considered. [14]

-n, --numMap INT

Perform alignment for at most INT candidates. [10]

-l, --minReadLen INT

Do not try to map any read shorter than INT bp and report them

as unmapped. [1000]

-c, --anchorCount INT

Consider INT anchoring positions on the long read. [1000]

-m, --maxRefHit INT

Ignore anchoring positions with more than INT reference hits.

[1000]

-a, --chainAlg INT

Chaining algorithm to use. Options are "dp-n2" and "clasp". [dp-

n2]

--noSamHeader

Do not print sam header in the output.

Other options

-h, --help

Print this help file.

-v, --version

Print the version of software.

EXAMPLES

Indexing the reference genome:

$ ./lordfast --index gen.fa

Mapping to the reference genome:

$ ./lordfast --search gen.fa --seq reads.fastq > map.sam

$ ./lordfast --search gen.fa --seq reads.fastq --threads 4 > map.sam

BUGS

Please report the bugs through lordfast's issues page at

https://github.com/vpc-ccg/lordfast/issues

CONTACT

Ehsan Haghshenas (ehaghshe@sfu.ca)

This software is released under GNU General Public License (v3.0)

lordFAST Last Updated: June 26, 2018 lordFAST(1)

実行方法

index

lordfast --index input.fa

--index Path to the reference genome file in FASTA format which is

mapping

lordfast --search input.fa -s reads.fq -t 8 > map.sam

--search Path to the reference genome file in FASTA format
-t Use INT number of CPU cores. Pass 0 to use all the available
-s Path to the file containing read sequences in FASTA/FASTQ

引用

lordFAST: sensitive and Fast Alignment Search Tool for LOng noisy Read sequencing Data

Ehsan Haghshenas, S Cenk Sahinalp, Faraz Hach

Bioinformatics (2018) DOI: 10.1093/bioinformatics/bty544

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ロングリードのマッピングツール lordFAST