中間サイズのSVを検出する CLEVER - macでインフォマティクス

　The International HapMap Consortium (2005) とThe 1000 Genomes Project Consortium (2010) は、世界的に協調した取り組みにより、ヒトゲノムのより大きなリアレンジメントを含む全領域の変異のバリエーションに関する最初の体系的な見解を提供した。驚くべきことに、ヒト集団の8％には、500kbを超える領域に影響を及ぼすコピー数変化（CNV）を有している（Itsara et al、2009）(pubmed)。この進歩を可能にする技術は、次世代のシークエンシングとコストの削減とそれがもたらしたシーケンシング速度の向上だった（Bentley et al、2008; Eid et al、2009）。しかしながら、構造変化の解析は、ヒト構造変化のgenotypingがまだルーチン的な手順になっていない限り、シーケンシング技術の進歩に追いつかなかった（Alkan et al、2011）。実際、既存のデータセットには、現在の方法では発見できない構造変化が含まれている可能性がある。（一部略）

　ここでは、我々（著者ら）は20〜50 000 bpのサイズの欠失または挿入（indels）をターゲットとする。ゲノムの非リピート領域においてさえも、特に500bpより小さいindelの発見はなお困難である（Alkan et al、2011; Mills et al、2011）。実際は構造変化の大部分がリピート領域に存在しており、結果的に生じるリード・マッピングのあいまいさに起因する問題はさらに複雑になる。

（複数段落省略）

CLEVERは、長さ20〜100ntの欠失または挿入（indels）に対して、特に優れた性能を達成する。このサイズ範囲では、スプリットリードアライナーの性能をさらに上回る。

wiki

https://bitbucket.org/tobiasmarschall/clever-toolkit/wiki/Home

インストール

cent os6のminiconda2-4.0.5環境でテストした。

本体　Bitbucket

#Anaconda環境でcondaを使い導入
conda install -c bioconda clever-toolkit

$ clever

Usage: clever [options] <bam-file> <ref.fasta(.gz)> <result-directory>

This tool runs the whole workflow necessary to use CLEVER.

<bam-file> Input BAM file. All alignments for the same read (pair) must be in

subsequent lines. It is highly recommended to allows multiple

alignments per read to avoid spurious predictions.

<ref.fasta(.gz)> The reference genome in (gzipped) FASTA format. This is needed to

recompute alignment scores (AS tags). If your BAM file does have AS tags

such that 10^(AS/-10.0) can be interpreted as the probability of this

alignment being correct, use option -a to omit this step.

<result-directory> Directory to be created to store results in. If it already exists, abort

unless option -f is given.

Options:

-h, --help show this help message and exit

--sorted Input BAM file is sorted by position. Note that this

requires alternative alignments to be given as XA tags

(like produced by BWA, stampy, etc.).

--use_xa Interprete XA tags in input BAM file. This option

SHOULD be given for mappers writing XA tags like BWA

and stampy.

-T THREADS Number of threads to use (default=1).

-f Delete old result and working directory first (if

present).

-w WORK_DIR Working directory (default: <result-directory>/work).

-a Do not (re-)compute AS tags. If given, the argument

<ref.fasta(.gz)> is ignored.

-k Keep working directory (default: delete directory when

finished).

-r Take read groups into account (multi sample mode).

-C ADD_CLEVER_PARAMS Additional parameters to be passed to the CLEVER core

algorithm. Call "clever-core" without parameters for a

list of options.

-P ADD_POST_PARAMS Additional parameters for postprocessing results. Call

"postprocess-predictions" without parameters for a

list of options.

-I Create a plot of internal segment size distribution

(=fragment size - 2x read length). Also displays the

estimated normal distribution (requires NumPy and

matplotlib).

実行方法

clever --use_xa input.bam ref.fa out_dir

--use_xa Interprete XA tags in input BAM file. This option SHOULD be given for mappers writing XA tags like BWA and stampy.

引用
CLEVER: clique-enumerating variant finder
Marschall T, Costa IG, Canzar S, Bauer M, Klau GW, Schliep A, Schönhuth A

Bioinformatics. 2012 Nov 15;28(22):2875-82

2013年には、同じオーサーらによってMATE-CLEVERというツールも発表されています。

https://www.ncbi.nlm.nih.gov/pubmed/24072733