Pacbioのロングリードアライナー HISEA - macでインフォマティクス

　デノボゲノムアセンブリは、リファレンスゲノムを使用せずにシーケンシングリードから生物のゲノム全体を再構築する。ハイスループットのNGS技術は、微生物および真核生物ゲノムの反復領域の大部分よりもはるかに小さい、数百塩基対の短いリードを生成する。リード長よりも長いリピート領域は、ゲノムアセンブリアルゴリズムにとって重大な問題を提起する。リード長とリピート長のこの不均衡は、アセンブリアルゴリズムの複雑さと処理要件を増加させている。これがNGSデータを使用する多くのアセンブリが断片化され、不完全であり、しばしば下流の分析には役立たない理由である[論文より ref.1]。

　Pacific BiosciencesのSMRTシークエンシング技術の出現により、研究者は新しい視点でゲノムアセンブリの問題を調べるよう促された。多くのリピート領域にわたる長いリードは、かなり優れたアセンブリの作成を可能にする。 SMRT技術は、従来のNGS技術よりもバイアスが少ない[ref.2]。しかしながら、SMRTシーケンシングの2つの重要な欠点は、10-15％の高いエラー率、および高コストである。比較すると、イルミナ技術はエラーレートが100倍低く、Gbpあたりのコストの点で100倍以上安価である[ref.3]（この論文の執筆時点）。肯定的な面では、エラーはランダムであり、シーケンシングデータのカバレッジを増加させることによってアルゴリズムを訂正することが可能であることが判明した[ref.4]。このように、SMRTシークエンシングは、以前の技術で達成されたものよりも、より連続的かつ高品質のゲノムアセンブリを作製することを可能にする。

　公開されたSMRTのゲノムアセンブリパイプラインの大部分[ref.5〜7]において、crucialなステップは、all-vs-allのrawリードのアラインメントを見つけるステップである。このステップの結果は、後続のステップの処理およびアセンブリパイプラインの全体的な結果に大きな影響を与える可能性がある。したがって高感度のアライナーを使用することが不可欠である。

　著者らは既存のものよりもはるかに高感度な新しいロングリードアライナーHISEAを発表した。最近、Korenらは、 [6]は、CanuアセンブラがSPAdes [12]で生成された150xカバレッジハイブリッドアセンブリに匹敵する20xカバレッジを使用してアセンブリを生成できることを示した。また、50倍のカバレッジで最大のアセンブリcontiguityを達成できることも示している。 Korenらによって示されているように[ref.6]、Canuのパイプラインは現在最高のパフォーマンスを示す。しかしMHAPアライナー[ref.11]を使用しているので、MHAPの代わりにCauにHISEAを組み込んだ。 E.coli、S.cerevisiae、C.elegans、A.thaliana、D.melanogasterの2つのパイプライン（Canu + MHAPとCanu + HISEA）を、30倍と50倍の2つのレベルで比較した。 HISEAを使用したパイプラインは、両方のカバレッジレベルでより良いアセンブリを生成することが示された。さらに、Canx + HISEAアセンブリの30xカバレッジはCanu + MHAPの50xカバレッジと同等である。また、他のロングリードアライナーであるHISEAをBLASR [ref.8]、DALIGNER [ref.9]、GraphMap [ref.10]、MHAP [ref.11]、およびMiniMap [ref.5]と、アライナーの感度を比較している。

公式ページ

http://www.csd.uwo.ca/faculty/ilie/HISEA/

インストール

cent OSに導入した。

Github (Canu)

https://github.com/lucian-ilie/Canu_HISEA

canuをビルドすると導入される。

git clone https://github.com/lucian-ilie/Canu_HISEA.git 
cd Canu_HISEA/src/
make -j 20
cd ../Linux-amd64/bin/
./

$ ./hisea --help

USAGE: hisea [--self] [--kmerLen <int>] [--minStoreLen <int>] [--minOlapLen <int>] --ref <file/directory> --query <file/directory>

HISEA is an efficient all-vs-all long read alignment program.

It is designed and tested for SMRT sequecning technology, but

should work well for Oxford Nanopore as well.

Options:

--help no_arg Print this help message

--self no_arg Align set of reads with itself. Use only --ref option

--ref <file/dir> The name of the reference fasta/fastq file or directory

containing these files

--query <file/dir> The name of the query fasta/fastq file or directory

containing these files

--ignore <file> The file containing kmers to be ignored

--index_write <dir> The directory where index is stored

--index_read <dir> The directory from whcih index is read

--kmerLen <int> This is the kmer length used for initial hashing.

The possible values are 10-20, both inclusive

default=16

--smallkmerLen <int> This is the kmer length used during alignment extension.

The possible values are 10-20, both inclusive

default=12

--filterCount <int> This is used for initial filtering. This must be set to 1,

if index is created with split set of reads.

default=2

--threads <int> Number of threads for parallel run

default=1

--minOlapLen <int> Minimum Overlap length

default=100

--minStoreLen <int> Minimum read length. Overlap of two smaller reads is ignored

default=100

--minMatches <int> Minimum number of matches to be considered for alignment

default=3

--maxKmerHits <int> The maximum number of repeat kmers

default=10000

--errorRate <float> Error rate. Valid values are 0-1 percent

default=0.15

--maxShift <float> The value of shift to accomodate indels. Valid values

are 0-1 percent

default=0.20

パスを通しておく。

HISEAのgithubページはこちら

https://github.com/lucian-ilie/HISEA

ラン

hisea --self --kmerLen 16 --minStoreLen 100 --minOlapLen 100 --ref reerence.fa --query pacbio.fq --threads 10 > output

--self no_arg Align set of reads with itself. Use only --ref option
--kmerLen <int> This is the kmer length used for initial hashing. The possible values are 10-20, both inclusive default=16
--minOlapLen <int> Minimum Overlap length default=100
--threads <int> Number of threads for parallel run default=1
--minStoreLen <int> Minimum read length. Overlap of two smaller reads is ignored default=100
--ref <file/dir> The name of the reference fasta/fastq file or directory containing these files
--query <file/dir> The name of the query fasta/fastq file or directory containing these files
--threads <int> Number of threads for parallel run default=1