マッピングツール segemehl - macでインフォマティクス

2018 11/5 タイトル修正

　近年、短いシーケンシングリードを大きなリファレンスゲノムにアライメントさせる問題はかなりの注目を集めており、これまで様々な異なるアルゴリズムアプローチに基づく、異なる多くのアラインメントツールが発表されている。 EBIの調査では、現在、80以上の異なるマッパーが数えられている（Fonseca et al、2012）（論文執筆時点）。この競争の激しい分野ではかなり進化と進歩が見られた。初期のマッパーは、ミスマッチがないか少ないミスマッチでショートリードをアライメントさせることに制限され、挿入および欠失を伴うリードはしばしば除外された。しかし RNA-seqプロトコルの登場により、問題の複雑さがさらに増した。スプライシングである。 2つ以上のエクソンを連結するcDNAにリードをマッピングする場合、アライナーは、リードを「splitして」、その部分をリファレンスゲノムの適切なエキソンにアライメントさせる必要がある。あるいは、マッパーにジャンクション情報を提供する必要がある。今日では、ほとんどのツールは1つのスプリットしか想定しないため（論文執筆時点）、複数のエキソン - エキソンジャンクションにまたがるリードは適切にsplitアライメントされないことがある。リード長は絶えず増加しており、複数のsplitを可能にするアルゴリズムは明らかに有利である。 segemehlは、シミュレートされたデータベンチマークで非常によく機能することが示されたローカルトランジションアルゴリズムを使用してマルチスプリットアライメントを容易にする（Hoffmann et al、2014）。bisulfite処理したDNAリードのアライメント、すなわちmethyl-cytosineシーケンシングはまた、特殊なアルゴリズムおよび方法を必要とする（Chen et al、2010; Krueger and Andrews、2011; Smith et al、2009）。 segemehlの特徴であるsplit-readおよびbisulfiteは、広範なベンチマーク（Hoffmann et al、2014; Otto et al、2012）とともに最近publishされた。ツールの多様性とアルゴリズムとソフトウェアの迅速な開発には、頻繁で透明で再現性のあるベンチマークが必要である。ここで著者らは、DNA-seqおよびRNA-seqリードアライメントの性能テストの結果を示し、詳細情報を提供する。これらすべてのデータ、カスタムスクリプト、および分析を再実行する方法の詳細な説明を含む広範なelectronic supplement（http://www.bioinf.uni-leipzig.de/publications/supplements/13-008）を用意した。ベンチマークが特定の側面のみを測定したり、ツールの普遍的な優越性または劣等性を主張するために使用されないことを強調したい。すべてのreadersにこのデータを再現することを奨励し、代替ベンチマークを提示することを奨励する。

ベンチマーク

http://www.ecseq.com/support/benchmark.html

インストール

mac os10.13のanaconda2.4.3環境でテストした。

#Anacondaを使っているなら、condaで導入できる。

conda install -c bioconda segemehl

> segemehl.x

$ segemehl.x

segemehl.x: required option 'database' (d) missing

usage: segemehl.x [-sbcKVTYCO] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>] [-x <file>] [-y <file>] [-B <string>] [-F <n>] [-m <n>] [-t <n>] [-o <string>] [-u <file>] [-D <n>] [-J <n>] [-E <double>] [-w <double>]

[-M <n>] [-r <n>] [-S] [--nohead] [-e <n>] [-n <n>] [-X <n>] [-A <n>] [-W <n>] [-U <n>] [-Z <n>] [-l <f>] [-H] [--showalign] [-P <string>] [-Q <string>] [-R <n>] [-I <n>]

Heuristic mapping of short sequences

[INPUT]

-d, --database <file> [<file>] list of path/filename(s) of database sequence(s)

-q, --query <file> path/filename of query sequences (default:none)

-p, --mate <file> path/filename of mate pair sequences (default:none)

-i, --index <file> path/filename of db index (default:none)

-j, --index2 <file> path/filename of second db index (default:none)

-x, --generate <file> generate db index and store to disk (default:none)

-y, --generate2 <file> generate second db index and store to disk (default:none)

-B, --filebins <string> file bins with basename <string> for easier data handling (default:none)

-F, --bisulfite <n> bisulfite mapping with methylC-seq/Lister et al. (=1) or bs-seq/Cokus et al. protocol (=2) (default:0)

[GENERAL]

-m, --minsize <n> minimum size of queries (default:12)

-s, --silent shut up!

-b, --brief brief output

-c, --checkidx check index

-t, --threads <n> start <n> threads (default:1)

-o, --outfile <string> outputfile (default:none)

-u, --nomatchfilename <file> filename for unmatched reads (default:none)

[SEEDPARAMS]

-D, --differences <n> search seeds initially with <n> differences (default:1)

-J, --jump <n> search seeds with jump size <n> (0=automatic) (default:0)

-E, --evalue <double> max evalue (default:5.000000)

-w, --maxsplitevalue <double> max evalue for splits (default:50.000000)

-M, --maxinterval <n> maximum width of a suffix array interval, i.e. a query seed will be omitted if it matches more than <n> times (default:100)

-r, --maxout <n> maximum number of alignments that will be reported. If set to zero, all alignments will be reported (default:0)

-S, --splits detect split/spliced reads (default:none)

-K, --SEGEMEHL output SEGEMEHL format (needs to be selected for brief)

-V, --MEOP output MEOP field for easier variance calling in SAM (XE:Z:)

--nohead do not output header

[SEEDEXTENSIONPARAMS]

-e, --extensionscore <n> score of a match during extension (default:2)

-n, --extensionpenalty <n> penalty for a mismatch during extension (default:4)

-X, --dropoff <n> dropoff parameter for extension (default:8)

[ALIGNPARAMS]

-A, --accuracy <n> min percentage of matches per read in semi-global alignment (default:90)

-W, --minsplicecover <n> min coverage for spliced transcripts (default:80)

-U, --minfragscore <n> min score of a spliced fragment (default:18)

-Z, --minfraglen <n> min length of a spliced fragment (default:20)

-l, --splicescorescale <f> report spliced alignment with score s only if <f>*s is larger than next best spliced alignment (default:1.000000)

-H, --hitstrategy report only best scoring hits (=1) or all (=0) (default:1)

--showalign show alignments

-P, --prime5 <string> add 5' adapter (default:none)

-Q, --prime3 <string> add 3' adapter (default:none)

-R, --clipacc <n> clipping accuracy (default:70)

-T, --polyA clip polyA tail

-Y, --autoclip autoclip unknown 3prime adapter

-C, --hardclip enable hard clipping

-O, --order sorts the output by chromsome and position (might take a while!)

-I, --maxinsertsize <n> maximum size of the inserts (paired end) (default:5000)

[VERSION]

0.2.0-418 (2015-01-05 05:17:35 -0500 (Mon, 05 Jan 2015))

[BUGS]

Please report bugs to steve@bioinf.uni-leipzig.de

[REFERENCES]

SEGEMEHL is free software for non-commercial use

実行方法

１、index（*1）

segemehl.x -x index -d ref.fasta

-x generate db index and store to disk (default:none)
-d list of path/filename(s) of database sequence(s)

２、マッピング

segemehl.x -i index -d ref.fasta  -q input.fq > mapped.sam

-q path/filename of query sequences (default:none)
-p path/filename of mate pair sequences (default:none)

Split read alignmentをONにしたマッピング

segemehl.x -S -i index -d ref.fasta  -q input.fq > mapped.sam

-S detect split/spliced reads (default:none)

詳細はマニュアルを読んでください。丁寧に説明されています。

引用
Lacking alignments? The next-generation sequencing mapper segemehl revisited
Otto C, Stadler PF, Hoffmann S

Bioinformatics. 2014 Jul 1;30(13):1837-43

A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection
Hoffmann S, Otto C, Doose G, Tanzer A, Langenberger D, Christ S, Kunz M, Holdt LM, Teupser D, Hackermüller J, Stadler PF

Genome Biol. 2014 Feb 10;15(2):R34

特定のクロモソームだけindexingすることもできるが、ここでは記載しない。WGSデータを特定のクロモソームだけにマッピングさせると偽陽性が生じやすい（例えばヒトゲノムのリファレンスではデコイの配列も用意されている）。

https://software.broadinstitute.org/gatk/documentation/article?id=11010

Heng Liさんのブログ

Which human reference genome to use?

http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use