コピー数変化の検出と可視化ツール CNVkit

2020 3/22 実行例追記

2022/04/29 インストール手順修正

　コピー数変化は、ガンを含む多くの疾患の有用な診断指標である。ゲノム全体のコピー数解析のゴールドスタンダードは、 array comparative genomic hybridization（array CGH）である[論文より　ref.1,2]。より最近では、全ゲノムシーケンシングデータからコピー数の情報を得る方法が開発されている（非特許文献4参照）。臨床用途では、エキソンや疾患関連遺伝子のセットなどのシーケンス解析では、興味のある領域をより高いカバレッジでシーケンスして、バリアントコールの感受性を高めることが好まれる[ref.5]。これらのデータを解析するため、 CNVer [ref.6]、ExomeCNV [ref.7]、exomeCopy [ref.8]、CONTRA [ref.9]、CoNIFER [ref.10]、ExomeDepth [ref.11]、VarScan 2 [ref.12]、XHMM [ref.13]、ngCGH [ref.14]、EXCAVATOR [ref.15]、CANOES [ref.16]、PatternCNV [ref.17]、CODEX [ref.18]、Control-FREEC [ref.19]およびcn.MOPS [ref.20]などのコピー数を解析するツールも開発されてきた。しかしながら、これらのアプローチは、遺伝子間および通常はイントロン領域のシーケンスリードを使用せず、コピー数を推測する可能性を制限する。

　ターゲットエンリッチメントでは、標的領域はハイブリダイゼーションによりキャプチャされる。しかし、かなりの量のオフターゲットDNAがライブラリーに残っており、このDNAはシーケンスされてリードのかなりの部分を占めている。したがって、オフターゲットのリードは、全ゲノムに渡り非常に低カバレッジのシーケンスを提供する。ターゲット外のリードだけではSNVおよび他の小さなバリアントをコールする十分なカバレッジは得られないが、最近cnvOffSeq [ref.21]およびCopywriteR [ref.22]によって実証されたように、ラージスケールではコピー数推測に有用な情報を提供する。

　著者らは、ターゲットシーケンスでコピー数変化および改変の分析のための計算方法を開発した。このツールキットはCNVkitと呼ばれ、CNV検出のパイプラインを実装している。オンとオフの両方のターゲットシーケンシングリードを利用し、一連の修正を適用してコピー数のコール精度を向上させている。オンとオフのターゲット領域のビニングされたリードデプスを比較し、異なる解像度であるにもかかわらず、コピー数の類似した推定値を提供することを見出した。真のコピー数変化によって駆動されていないであろうビニングされたリードカウントの変動を補正するアルゴリズムをいくつか評価した。最後に、CNVkitメソッドと2つの競合するCNVコーラーの結果をarray CGHと比較し、CNVkitがarray CGHと最もよく一致することを確認した。

マニュアル（丁寧にまとめられています。ボリュームがあります。）

https://cnvkit.readthedocs.io/en/stable/

Biostars quwestion

https://www.biostars.org/t/CNVkit/

インストール

mac os10.13のanaconda2-4.3.0でテストした。

依存

python >=3.5

Github

#bioconda (コメントいただきました通り修正しました。)
mamba create -n cnvkit -c conda-forge -c bioconda python=3.9 cnvkit -y
conda activate cnvkit

pipでも導入できる。Dockerコンテナも準備されている。
#dockerイメージ
docker pull etal/cnvkit

> cnvkit.py -h

]$ cnvkit.py -h

usage: cnvkit.py [-h]

{batch,target,access,antitarget,autobin,coverage,reference,fix,segment,call,diagram,scatter,heatmap,breaks,genemetrics,gainloss,sex,gender,metrics,segmetrics,import-picard,import-seg,import-theta,import-rna,export,version}

...

CNVkit, a command-line toolkit for copy number analysis.

positional arguments:

Sub-commands (use with -h for more info)

batch Run the complete CNVkit pipeline on one or more BAM

files.

target Transform bait intervals into targets more suitable

for CNVkit.

access List the locations of accessible sequence regions in a

FASTA file.

antitarget Derive off-target ("antitarget") bins from target

regions.

autobin Quickly calculate reasonable bin sizes from BAM read

counts.

coverage Calculate coverage in the given regions from BAM read

depths.

reference Compile a coverage reference from the given files

(normal samples).

fix Combine target and antitarget coverages and correct

for biases. Adjust raw coverage data according to the

given reference, correct potential biases and re-

center.

segment Infer copy number segments from the given coverage

table.

call Call copy number variants from segmented log2 ratios.

diagram Draw copy number (log2 coverages, CBS calls) on

chromosomes as a diagram. If both the raw probes and

segments are given, show them side-by-side on each

chromosome (segments on the left side, probes on the

right side).

scatter Plot probe log2 coverages and segmentation calls

together.

heatmap Plot copy number for multiple samples as a heatmap.

breaks List the targeted genes in which a copy number

breakpoint occurs.

genemetrics Identify targeted genes with copy number gain or loss.

sex Guess samples' sex from the relative coverage of

chromosomes X and Y.

metrics Compute coverage deviations and other metrics for

self-evaluation.

segmetrics Compute segment-level metrics from bin-level log2

ratios.

import-picard Convert Picard CalculateHsMetrics tabular output to

CNVkit .cnn files. The input file is generated by the

PER_TARGET_COVERAGE option in the CalculateHsMetrics

script in Picard tools. If 'antitarget' is in the

input filename, the generated output filename will

have the suffix '.antitargetcoverage.cnn', otherwise

'.targetcoverage.cnn'.

import-seg Convert a SEG file to CNVkit .cns files.

import-theta Convert THetA output to a BED-like, CNVkit-like

tabular format. Equivalently, use the THetA results

file to convert CNVkit .cns segments to integer copy

number calls.

import-rna Convert a cohort of per-gene log2 ratios to CNVkit

.cnr format.

export Convert CNVkit output files to another format.

version Display this program's version.

optional arguments:

-h, --help show this help message and exit

See the online manual for details: https://cnvkit.readthedocs.io

たくさんのコマンドがある。

例えばcnvkitの全パイプラインを実行するbatchコマンドのヘルプを表示する。

> cnvkit.py batch -h

$ cnvkit.py batch -h

usage: cnvkit.py batch [-h] [-m {hybrid,amplicon,wgs}] [-y] [-c]

[--drop-low-coverage] [-p [PROCESSES]]

[--rlibpath DIRECTORY] [-n [FILES [FILES ...]]]

[-f FILENAME] [-t FILENAME] [-a FILENAME]

[--annotate FILENAME] [--short-names]

[--target-avg-size TARGET_AVG_SIZE] [-g FILENAME]

[--antitarget-avg-size ANTITARGET_AVG_SIZE]

[--antitarget-min-size ANTITARGET_MIN_SIZE]

[--output-reference FILENAME] [-r REFERENCE]

[-d DIRECTORY] [--scatter] [--diagram]

[bam_files [bam_files ...]]

positional arguments:

bam_files Mapped sequence reads (.bam)

optional arguments:

-h, --help show this help message and exit

-m {hybrid,amplicon,wgs}, --method {hybrid,amplicon,wgs}

Sequencing protocol: hybridization capture ('hybrid'),

targeted amplicon sequencing ('amplicon'), or whole

genome sequencing ('wgs'). Determines whether and how

to use antitarget bins. [Default: hybrid]

-y, --male-reference, --haploid-x-reference

Use or assume a male reference (i.e. female samples

will have +1 log-CNR of chrX; otherwise male samples

would have -1 chrX).

-c, --count-reads Get read depths by counting read midpoints within each

bin. (An alternative algorithm).

--drop-low-coverage Drop very-low-coverage bins before segmentation to

avoid false-positive deletions in poor-quality tumor

samples.

-p [PROCESSES], --processes [PROCESSES]

Number of subprocesses used to running each of the BAM

files in parallel. Without an argument, use the

maximum number of available CPUs. [Default: process

each BAM in serial]

--rlibpath DIRECTORY Path to an alternative site-library to use for R

packages.

To construct a new copy number reference:

-n [FILES [FILES ...]], --normal [FILES [FILES ...]]

Normal samples (.bam) used to construct the pooled,

paired, or flat reference. If this option is used but

no filenames are given, a "flat" reference will be

built. Otherwise, all filenames following this option

will be used.

-f FILENAME, --fasta FILENAME

Reference genome, FASTA format (e.g. UCSC hg19.fa)

-t FILENAME, --targets FILENAME

Target intervals (.bed or .list)

-a FILENAME, --antitargets FILENAME

Antitarget intervals (.bed or .list)

--annotate FILENAME Use gene models from this file to assign names to the

target regions. Format: UCSC refFlat.txt or

ensFlat.txt file (preferred), or BED, interval list,

GFF, or similar.

--short-names Reduce multi-accession bait labels to be short and

consistent.

--target-avg-size TARGET_AVG_SIZE

Average size of split target bins (results are

approximate).

-g FILENAME, --access FILENAME

Regions of accessible sequence on chromosomes (.bed),

as output by the 'access' command.

--antitarget-avg-size ANTITARGET_AVG_SIZE

Average size of antitarget bins (results are

approximate).

--antitarget-min-size ANTITARGET_MIN_SIZE

Minimum size of antitarget bins (smaller regions are

dropped).

--output-reference FILENAME

Output filename/path for the new reference file being

created. (If given, ignores the -o/--output-dir option

and will write the file to the given path. Otherwise,

"reference.cnn" will be created in the current

directory or specified output directory.)

To reuse an existing reference:

-r REFERENCE, --reference REFERENCE

Copy number reference file (.cnn).

Output options:

-d DIRECTORY, --output-dir DIRECTORY

Output directory.

--scatter Create a whole-genome copy ratio profile as a PDF

scatter plot.

--diagram Create an ideogram of copy ratios on chromosomes as a

PDF.

——

試せてませんが、galaxyやDNA Nexusでも利用できるようです。

Galaxy | (sandbox for testing) | Tool Shed

https://platform.dnanexus.com/login

実行方法

--CNVの検出--

様々なサブコマンドがあるが、ここでは全パイプラインを動かすbatchコマンドと、出力を変換するexportコマンドを紹介する（Quick startより）。

batch: Run the complete CNVkit pipeline on one or more BAM files.

パターンA:　turmor-normal

cnvkit.py batch *Tumor.bam --normal *Normal.bam \
 --targets my_baits.bed --fasta hg19.fasta \
 --access data/access-5kb-mappable.hg19.bed \
 --output-reference my_reference.cnn --output-dir example/

パターンB:　比較対象のbamがない場合、baitリファレンスを指定することでランできる。その場合、"--normal"フラグを引数なしで立てる。

cnvkit.py batch *Tumor.bam --normal -t my_baits.bed -f hg19.fasta \
 --access data/access-5kb-mappable.hg19.bed \
 --output-reference my_flat_reference.cnn -d example/

ランが終わると、出力ディレクトリにいくつかのファイルができる。

Target and antitarget bin-level coverages (.cnn)

Copy number reference profile (.cnn)

Bin-level log2 ratios (.cnr)

Segmented log2 ratios (.cns)

詳細はフォーマット解説参照。

https://cnvkit.readthedocs.io/en/stable/fileformats.html

出力をvcfに変換する。

export: Convert CNVkit output files to another format.

cnvkit.py export vcf example/sample.cns -i "SampleID" -o sample.cnv.vcf

--CNVの可視化--

Bin-level log2 ratios (.cnr)ファイルとsegmentation callsをプロットする（おそらくhumanゲノムを想定しているので注意）

scatter（リンク）: Plot probe log2 coverages and segmentation calls
together.

cnvkit.py scatter example/sample.cnr -s example/sample.cns

diagram（リンク）: Draw copy number (log2 coverages, CBS calls) on chromosomes as a diagram.

cnvkit.py diagram -s example/sample.cns example/sample.cnr

heatmap（リンク）: Plot copy number for multiple samples as a heatmap.

cnvkit.py heatmap example/*.cns

heatmap機能は縦にサンプルを並べて描画するので（リンク）、familyやpopulation解析など、一定の数のサンプルを比較する時に使えるんじゃないかと思います。

追記

コンティグごとのカバレッジを出力する。

#contig.faのindexからbedを作る。
samtools faidx contig.fa
awk '{print $1 "\t0\t" $2}' contig.fa.fai > contig.bed

#cnvkit.pyのcoverageコマンド実行
cnvkit.py coverage -o out input.bam contig.bed

出力

f:id:kazumaxneo:20200322203852p:plain