PacBioのロングリードのアライナー rHAT - macでインフォマティクス

　1分子リアルタイム(SMRT)シーケンシングでは、ノイズの多いロングリードをリファレンスゲノムにアライメントすることが依然としてコストのかかる作業になっている。 SMRTリードアライメントの効率性と有効性を改善するための新しいアプローチが求められている。著者らはエラーの多いロングリードに向けて特別に設計されたシードと拡張子ベースのリードアラインメント手法であるrHATを提案した。 rHATは、リファレンスゲノムのローカルウィンドウ内の短いトークンを記述するハッシュテーブルベースのインデックスであるRHTによってリファレンスゲノムを索引付けする。シード段階では、rHATは、リードの一部とゲノムとの間の短いトークン一致の発生を効率的に計算して、可能性の高い候補部位を効率的に見つけるためにRHTを利用する。拡張段階では、リードを候補サイトに合わせるコストを削減するために、sparseな動的プログラミングに基づくヒューリスティックなアプローチが使用される。様々な原核生物および真核生物のゲノムのリアルデータおよびシミュレートデータセットをベンチマークして、rHATがハイスループットでSMRTのリードをアライメントすることが示された。

インストール

cent OSに導入した。

Github

GitHub - hitbc/rHAT: Alignment tool for noisy long reads

git clone https://github.com/hitbc/rHAT.git 
cd rGAT/src/
make

> ./rHAT-indexer

$ ./rHAT-indexer

Program: rHAT-indexer

Version: 0.1.1

Contact: <ydwang@hit.edu.cn>

Usage: rHAT-indexer [Options] <HashIndexDir> <Reference>

<HashIndexDir> The directory storing RHT index

<Reference> Sequence of reference genome, in FASTA format

Options: -k, --kmer-size <int> the size of the k-mers extracted from reference genome for indexing [13]

-h, --help help

> ./rHAT-aligner

$ ./rHAT-aligner

Program: rHAT-aligner

Version: 0.1.1

Contact: <ydwang@hit.edu.cn>

Usage: rHAT-aligner [Options] <HashIndexDir> <ReadFile> <Reference>

<HashIndexDir> The directory storing RHT index

<ReadFile> Reads file, in FASTQ/FASTA format

<Reference> Sequence of reference genome, in FASTA format

Options: -w, --window-hits <int> the max allowed number of windows hitting by a k-mer [1000]

-m, --candidates <int> the number of candidates for extension [5]

-k, --kmer-size <int> the size of the k-mers for generating short token matches [13]

-a, --match <int> score of match for the alignments in extension phase [2]

-b, --mismatch <int> mismatch penalty for the alignments in extension phase [5]

-q, --gap-open <int> gap open penalty for the alignments in extension phase [2]

-r, --gap-extension <int> gap extension penalty for the alignments in extension phase [1]

-l, --local-kmer <int> the minimum length of the local matches used for SDP [11]

-t, --threads <int> number of threads [1]

-h, --help help

パスの通ったディレクトリにコピーしておく。

ラン

1、indexを作成する。

mkdir index #出力フォルダを作成
rHAT-indexer -k 13 index/ input.fa

-k The size of the k-mers extracted from the reference genome for indexing[13].

２、

rHAT-aligner index/ long_reads.fq input.fa > output.sam

バージョン0.1.0、0.1.1ともmakefileにしたがってmakeしたが、rHAT-alignerでsegmentation errorを起こした。改善したら追記します。

引用

rHAT: fast alignment of noisy long reads with regional hashing.

Liu B, Guan D, Teng M1, Wang Y

Bioinformatics. 2016 Jun 1;32(11):1625-31.