2019-09-16

ロングリードRNA seqのアライナー Graphmap2

　オックスフォードナノポアテクノロジー（ONT）[ref.1]やパシフィックバイオサイエンス（PacBio）[ref.2]などの企業が達成したシーケンシングテクノロジーの進歩により、長さが10 kbpを超えるロングリードが生成される。当初、このようなロングリードのエラー率は非常に高かったが、着実に低下している。PacBioデバイスの最後の世代は、イルミナのショートリードと同等のリードを生成する[ref.3 link]。さらに、ONTは最近、新しいR10ポアを発表し、この社内での結果は有望である。 RNA-seqの分野ではショートリードが依然として主に使用されているが、アイソフォームの検出と定量化に役立つ長いリードが必要である。アルゴリズム的に、RNA-seqリードを既知の転写産物にマッピングすることは、DNAリードをマッピングすることと同等である。しかし、これらのリードを真核生物のゲノムにマッピングすることは、RNA スプライシングのためにより複雑になる。第三世代のシーケンシング技術によって生成されたロングリード用に開発されたいくつかのRNA-seqスプライス対応マッピングツールがある。これらのツールの評価は[ref.4]および[ref.5]で行われ、GMAP [ref.6]およびMinimap2 [ref.7]が最高のパフォーマンスを発揮するツールである。ただし、特にエクソンのエッジを正しく配置し、転写産物内のすべてのエクソン、特に短いエクソンを見つけるには、まだ改善の余地がある。
　ここでは、Graphmapツールの拡張バージョンであるGraphmap2を紹介する。 Graphmapは、DNAリードをリファレンスゲノムにマッピングするための非常に感度の高いツールであり、RNAリードをマッピングする機能を拡張した。この拡張バージョンは、初期バージョン[ref.8]と同じ5段階の「リードファネル」アプローチを使用し、RNAリードのマッピングに固有のアップグレードを追加する。 Graphmap2のパフォーマンスと精度は、シミュレートされたデータセットと実際のデータセットで評価され、結果は最先端のRNA マッピングツールと比較された。さらに、ロングリードを使用して、潜在的に新しいアイソフォームと遺伝子の同定について分析した。
　図1は、RNAリードをリファレンスゲノムにアラインメントする完全なプロセスを示している。元のGraphmapマッピング方法は2つの段階に分割できる。最初の段階では、短いシードを使用してリファレンスゲノム上の候補位置を見つける。2番目の段階では、Edlibを使用してリードの正確なアライメントを計算する。これは、Myersのビットベクトルアルゴリズムの高速実装である[ref.9]。ステップを含む最初のステージ：（i）領域選択、（ii）グラフマッピング、（iii）k merサブストリング（LCSk）の最長共通サブシーケンス[ref.10]はGraphmap2に保持される。入力データセットのすべてのリードについて、このステージはリードの部分とリファレンスの部分との間の近似一致のセットを生成する。これらの一致はアンカーによって表される。すべてのアンカーは、リードの開始位置と終了位置、およびその一致のリファレンスの開始位置と終了位置で構成される。 Graphmap2は、次の段階でワークフローを拡張することにより続行する：（iv）アンカーフィルタリング（ナップサック）、（v）アンカーアライメント、（vi）アライメントチューニング、（vii）エクソンのグループ化と調整。アンカーフィルター処理ステップでは、Graphmap2はナップザックアルゴリズムのバリエーションを使用して最適なアンカーセットを見つけ、次にKSW2 [ref.11]アライナーを使用してこれらのアンカー間で区分的アフィンギャップアライメントを実行し、第1フェーズのリグメントを生成する。その後、これらのアライメントは、エクソン拡張およびエクソン境界調整ステップで処理され、第一段階のアライメントの品質が向上する。改善されたアライメントは、エクソンに分割され、エクソンのグループ化と調整フェーズでさらにグループ化および変更される。エキソンは、同じ転写産物から転写されたエキソンをグループ化するために、リファレンスゲノム上の位置に基づいてグループ化される。エクソングループは、「エクソンオフセット計算」およびエクソン調整ステップで分析および変更される。エクソンオフセット計算ステップは、エクソンのすべてのグループについて、同じグループ内のすべてのエクソンの正しい開始位置と終了位置を計算する。エクソン調整ステップでは、誤ったエクソンの開始位置または終了位置が、グループ全体の以前に計算された開始位置と終了位置に一致するように調整される。エクソン調整フェーズでは、Graphmap2ツールの出力である最終アライメントが生成される。

（以下略）

Graphmap2 - first version of new splice-aware RNA mapper with fine-tuned detection of exons and their boundaries https://t.co/hcbmbQ0Sny https://t.co/T2vR8PtjHP w/ @IvanSovic @jmaricb and @NiranjanTW
— Mile Sikic (@msikic) July 31, 2019

インストール

ubuntu18.0.4でテストした（docker使用、ホストOS macos10.14）。

git clone https://github.com/lbcb-sci/graphmap2 
cd graphmap2 
make modules 
make -j 8
cd /bin/Linux-x64/

> ./graphmap2 -h

# ./graphmap2 -h

ERROR: Unknown value of 'tool' parameter. Exiting.

Usage:

./graphmap2 tool

Options

tool STR Specifies the tool to run:

align - the entire GraphMap pipeline.

owler - Overlapping With Long Erroneous Reads.

> ./graphmap2 align -h

# ./graphmap2 align -h

GraphMap - A very accurate and sensitive long-read, high error-rate sequence mapper

GraphMap Version: v0.6.1

Build date: Sep 16 2019 at 01:41:10

GraphMap (c) by Ivan Sovic, Mile Sikic and Niranjan Nagarajan

GraphMap is licensed under The MIT License.

Affiliations: Ivan Sovic (1, 3), Mile Sikic (2), Niranjan Nagarajan (3)

(1) Ruder Boskovic Institute, Zagreb, Croatia

(2) University of Zagreb, Faculty of Electrical Engineering and Computing

(3) Genome Institute of Singapore, A*STAR, Singapore

Usage:

graphmap [options] -r <reference_file> -d <reads_file> -o <output_sam_path>

Usage:

./graphmap2 [options]

Options

Input/Output options:

-r, --ref STR Path to the reference sequence (fastq or fasta).

-i, --index STR Path to the index of the reference sequence. If not specified, index is generated

in the same folder as the reference file, with .gmidx extension. For

non-parsimonious mode, secondary index .gmidxsec is also generated.

-d, --reads STR Path to the reads file.

-o, --out STR Path to the output file that will be generated.

--gtf STR Path to a General Transfer Format file. If specified, a transcriptome will be

built from the reference sequence and used for mapping. Output SAM alignments will

be in genome space (not transcriptome).

-K, --in-fmt STR Format in which to input reads. Options are:

auto - Determines the format automatically from file extension.

fastq - Loads FASTQ or FASTA files.

fasta - Loads FASTQ or FASTA files.

gfa - Graphical Fragment Assembly format.

sam - Sequence Alignment/Mapping format. [auto]

-L, --out-fmt STR Format in which to output results. Options are:

sam - Standard SAM output (in normal and '-w overlap' modes).

m5 - BLASR M5 format. [sam]

-I, --index-only - Build only the index from the given reference and exit. If not specified, index

will automatically be built if it does not exist, or loaded from file otherwise. [false]

--rebuild-index - Always rebuild index even if it already exists in given path. [false]

--auto-rebuild-index - Rebuild index only if an existing index is of an older version or corrupt. [false]

-u, --ordered - SAM alignments will be output after the processing has finished, in the order of

input reads. [false]

-B, --batch-mb INT Reads will be loaded in batches of the size specified in megabytes. Value <= 0

loads the entire file. [1024]

General-purpose pre-set options:

-x, --preset STR Pre-set parameters to increase sensitivity for different sequencing technologies.

Valid options are:

illumina - Equivalent to: '-a gotoh -w normal -M 5 -X 4 -G 8 -E 6'

overlap - Equivalent to: '-a anchor -w normal --overlapper --evalue 1e0

--ambiguity 0.50 --secondary'

sensitive - Equivalent to: '--freq-percentile 1.0 --minimizer-window 1'

Alignment options:

-a, --alg STR Specifies which algorithm should be used for alignment. Options are:

sg - Myers' bit-vector approach. Semiglobal. Edit dist. alignment.

sggotoh - Gotoh alignment with affine gaps. Semiglobal.

anchor - anchored alignment with end-to-end extension.

Uses Myers' global alignment to align between anchors.

anchorgotoh - anchored alignment with Gotoh.

Uses Gotoh global alignment to align between anchors. [anchor]

-w, --approach STR Additional alignment approaches. Changes the way alignment algorithm is applied.

Options are:

normal - Normal alignment of reads to the reference.

(Currently no other options are provided. This is a placeholder for future

features, such as cDNA mapping) [normal]

--overlapper - Perform overlapping instead of mapping. Skips self-hits if reads and reference

files contain same sequences, and outputs lenient secondary alignments. [false]

--no-self-hits - Similar to overlapper, but skips mapping of sequences with same headers. Same

sequences can be located on different paths, and their overlap still skipped. [false]

-M, --match INT Match score for the DP alignment. Ignored for Myers alignment. [5]

-X, --mismatch INT Mismatch penalty for the DP alignment. Ignored for Myers alignment. [4]

-G, --gapopen INT Gap open penalty for the DP alignment. Ignored for Myers alignment. [8]

-E, --gapext INT Gap extend penalty for the DP alignment. Ignored for Myers alignment. [6]

-z, --evalue FLT Threshold for E-value. If E-value > FLT, read will be called unmapped. If FLT <

0.0, thredhold not applied. [1e0]

-c, --mapq INT Threshold for mapping quality. If mapq < INT, read will be called unmapped. [1]

--extcigar - Use the extended CIGAR format for output alignments. [false]

--spliced - Align clusters of anchors independently and report them as separate alignments.

Does not align in-between clusters. Works only for anchored alignment modes. [false]

--split - Align clusters of anchors independently and report them as separate alignments.

Does not align in-between clusters. Works only for anchored alignment modes. [false]

--no-end2end - Disables extending of the alignments to the ends of the read. Works only for

anchored modes. [false]

--max-error FLT If an alignment has error rate (X+I+D) larger than this, it won't be taken into

account. If >= 1.0, this filter is disabled. [1.0]

--max-indel-error FLT If an alignment has indel error rate (I+D) larger than this, it won't be taken

into account. If >= 1.0, this filter is disabled. [1.0]

Algorithmic options:

-k INT Graph construction kmer size. [6]

-l INT Number of edges per vertex. [9]

-A, --minbases INT Minimum number of match bases in an anchor. [12]

-e, --error-rate FLT Approximate error rate of the input read sequences. [0.45]

-g, --max-regions INT If the final number of regions exceeds this amount, the read will be called

unmapped. If 0, value will be dynamically determined. If < 0, no limit is set. [0]

-q, --reg-reduce INT Attempt to heuristically reduce the number of regions if it exceeds this amount.

Value <= 0 disables reduction but only if param -g is not 0. If -g is 0, the value

of this parameter is set to 1/5 of maximum number of regions. [0]

-C, --circular - Reference sequence is a circular genome. [false]

-F, --ambiguity FLT All mapping positions within the given fraction of the top score will be counted

for ambiguity (mapping quality). Value of 0.0 counts only identical mappings. [0.02]

-Z, --secondary - If specified, all (secondary) alignments within (-F FLT) will be output to a

file. Otherwise, only one alignment will be output. [false]

-P, --double-index - If false, only one gapped spaced index will be used in region selection. If true,

two such indexes (with different shapes) will be used (2x memory-hungry but more

powerful for very high error rates). [false]

--min-bin-perc FLT Consider only bins with counts above FLT * max_bin, where max_bin is the count of

the top scoring bin. [0.75]

--bin-step FLT After a chunk of bins with values above FLT * max_bin is processed, check if

there is one extremely dominant region, and stop the search. [0.25]

--min-read-len INT If a read is shorter than this, it will be marked as unmapped. This value can be

lowered if the reads are known to be accurate. [80]

--minimizer-window INT Length of the window to select a minimizer from. If equal to 1, minimizers will

be turned off. [5]

--freq-percentile FLT Filer the (1.0 - value) percent of most frequent seeds in the lookup process. [0.99]

--fly-index - Index will be constructed on the fly, without storing it to disk. If it already

exists on disk, it will be loaded unless --rebuild-index is specified. [false]

--chain-indel-bw FLT Diagonal between two anchors needs to be within this band to chain them. [0.23]

Anchor chaining options:

--chain-max-dist INT Maximum distance between two anchors to chain them. [200]

--chain-min-cov INT Minimum number of bases covered by a chain to keep it. Only checked if number of

anchors in the cluster is <= '--chain-min-num-anchors'. [50]

--chain-min-num-anchors INT If number of anchors in chain is <= to this value and the chain covers <

'--chain-min-cov' bases, the chain is discarded. [2]

Other options:

-t, --threads INT Number of threads to use. If '-1', number of threads will be equal to min(24, num_cores/2). [-1]

-v, --verbose INT Verbose level. If equal to 0 nothing except strict output will be placed on stdout. [5]

-s, --start INT Ordinal number of the read from which to start processing data. [0]

-n, --numreads INT Number of reads to process per batch. Value of '-1' processes all reads. [-1]

-h, --help - View this help. [false]

Debug options:

-y, --debug-read INT ID of the read to give the detailed verbose output. [-1]

-Y, --debug-qname STR QNAME of the read to give the detailed verbose output. Has precedence over -y.

Use quotes to specify.

-b, --verbose-sam INT Helpful debug comments can be placed in SAM output lines (at the end). Comments

can be turned off by setting this parameter to 0. Different values

increase/decrease verbosity level.

0 - verbose off

1 - server mode, command line will be omitted to obfuscate paths.

2 - umm this one was skipped by accident. The same as 0.

>=3 - detailed verbose is added for each alignment, including timing measurements

and other.

4 - qnames and rnames will not be trimmed to the first space.

5 - QVs will be omitted (if available). [0]

> ./graphmap2 owler -h

# ./graphmap2 owler -h

GraphMap - A very accurate and sensitive long-read, high error-rate sequence mapper

GraphMap Version: v0.6.1

Build date: Sep 16 2019 at 01:41:10

GraphMap (c) by Ivan Sovic, Mile Sikic and Niranjan Nagarajan

GraphMap is licensed under The MIT License.

Affiliations: Ivan Sovic (1, 3), Mile Sikic (2), Niranjan Nagarajan (3)

(1) Ruder Boskovic Institute, Zagreb, Croatia

(2) University of Zagreb, Faculty of Electrical Engineering and Computing

(3) Genome Institute of Singapore, A*STAR, Singapore

Usage:

graphmap [options] -r <reference_file> -d <reads_file> -o <output_sam_path>

Usage:

./graphmap2 [options]

Options

Input/Output options:

-r, --ref STR Path to the reference sequence (fastq or fasta).

-d, --reads STR Path to the reads file.

-o, --out STR Path to the output file that will be generated.

-K, --in-fmt STR Format in which to input reads. Options are:

auto - Determines the format automatically from file extension.

fastq - Loads FASTQ or FASTA files.

fasta - Loads FASTQ or FASTA files.

gfa - Graphical Fragment Assembly format.

sam - Sequence Alignment/Mapping format. [auto]

-L, --out-fmt STR Format in which to output results. Options are:

mhap - MHAP overlap format.

paf - PAF (Minimap) overlap format. [mhap]

-I, --index-only - Build only the index from the given reference and exit. If not specified, index will

automatically be built if it does not exist, or loaded from file otherwise. [false]

--load - Loads the generated index from a file specified with the --index parameter. [false]

--store - Stores the generated index to a file specified with the --index parameter. If the

index was loaded from a file, it will not be rewritten. [false]

--rebuild-index - If the index file is not compatible with currently set parameters, automatically

rebuild the index. Otherwise, use the disk index. This parameter only works if --load

is specified. [false]

-i, --index STR Path to the index of the reference sequence. If not specified, path will be set to

the same folder as the reference file, with .owlidx extension.

Algorithmic options:

-e, --error-rate FLT Approximate error rate of the input read sequences. [0.45]

--shape STR Seed shape which will be used for indexing and querying the sequences. [111111111111111]

--minimizer-window INT Length of the window to select a minimizer from. If equal to 1, minimizers will be

turned off. [5]

--freq-percentile FLT Filer the (1.0 - value) percent of most frequent seeds in the lookup process. [0.999]

Overlap filtering:

--min-read-len INT Overlaps between sequences, where one sequence is shorter than min-read-len, will be discarded. [500]

--min-cov-bases INT Minimum number of bases covered by seeds in an overlap. [50]

--min-overlap-len INT Minimum span of overlap region on either of the two sequences. [100]

--overhang FLT Value in range [0.0, 1.0] which specifies an allowed distance of an overlap from the

beginning/ending of a sequence (e.g. (overhang * read_len)). [ 0.20]

--max-overhang INT Allowed overhang length will be upper-bound by this parameter. Essentially, overhangs

of min(max_overhang, overhang * read_len) will be allowed unless max_overhang is < 0,

in which case this threshold is turned off. [1000]

--min-perc-coverage FLT Discard overlaps which have a fraction of bases covered by seeds then specified by

this parameter. [ 0.01]

--min-seeds INT Discard overlaps with less seed hits (after filtering) than the number specified by

this parameter. [4]

Other options:

-t, --threads INT Number of threads to use. If '-1', number of threads will be equal to min(24, num_cores/2). [-1]

-v, --verbose INT Verbose level. If equal to 0 nothing except strict output will be placed on stdout. [5]

-s, --start INT Ordinal number of the read from which to start processing data. [0]

-n, --numreads INT Number of reads to process per batch. Value of '-1' processes all reads. [-1]

-h, --help - View this help. [false]

General-purpose pre-set options:

-x, --preset STR Pre-set composite parameters. Valid options are:

disk - Equivalent to: '--load --store'

no-filters - Disables all filters except minimum number of seeds and

overhang thresholds. Equivalent to:

'--min-read-len 0 --min-cov-bases 0 --min-overlap-len 0 --min-perc-coverage 0.0'

Debug options:

-b, --verbose-out INT Helpful debug comments can be placed in output lines (at the end). Comments can be

turned off by setting this parameter to 0. [0]

-y, --debug-read INT ID of the read to give the detailed verbose output. [-1]

-Y, --debug-qname STR QNAME of the read to give the detailed verbose output. Has precedence over -y. Use

quotes to specify.

実行方法

アラインメント

graphmap2 align -r reference.fa -d reads.fasta -t 8 -o output.sam

-t Number of threads to use. If '-1', number of threads will be equal to min(24, num_cores/2). [-1]

ロングリードのオーバーラップ

graphmap2 owler -r reads.fasta -d reads.fasta -o output.mhap

引用

Graphmap2 - splice-aware RNA-seq mapper for long reads

Josip Marić, Ivan Sović, Krešimir Križanović, Niranjan Nagarajan, Mile Šikić

bioRxiv preprint first posted online Jul. 30, 2019

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ロングリードRNA seqのアライナー Graphmap2