コピーナンバーバリアント（CNV）のシミュレータ SECNVs

　コピーナンバーバリアントは、表現型の変化やヒトの病気に重要な役割を果たすゲノムの複製や欠失である。全ゲノム配列や全エクソーム配列のデータを用いて、コピー数の変異を検出するソフトウェアが数多く開発されている。しかし、これらのアプリケーションから得られる結果は、あまり一致していない。コピーナンバーバリアントを含むシミュレーションデータセットは、既存および新規のコピーナンバーバリアント検出法の動作特性を包括的に比較することができる。全ゲノム配列データのコピー数バリアントやその他の構造バリアントをシミュレートするソフトウェアアプリケーションがいくつか開発されている。しかし、全ゲノム配列データにおけるコピー数バリアントを確実にシミュレートできるアプリケーションはなかった。著者らは、リファレンスゲノムからコピー数バリアントと全エクソーム配列をシミュレートするための、高速で堅牢かつカスタマイズ可能なソフトウェアアプリケーションであるSimulator of Exome Copy Number Variants (SECNVs)を開発し、テストした。SECNVsはインストールが簡単で、シミュレーションをカスタマイズするための様々なコマンドが実装されており、複数のサンプルを一度に出力することができ、1つのコマンドでリアレンジドゲノム、ショートリード、BAMファイルを出力するパイプラインが組み込まれている。SECNVsによって生成されたバリアントは、コピーナンバーバリアントの検出に一般的に使用されているツールによって、高い感度と精度で検出される。SECNVs は https://github.com/YJulyXing/SECNVs で公開されている。

インストール

mambaでpython2.7の環境を作ってテストした（ubuntu18.04）。

依存

Python 2.7. Required python packages: argparse, random, os, subprocess, math, sys, time, copy, numpy

Github

mamba create -n secnv python==2.7.17 -y
conda activate secnv
mamba install numpy

git clone https://github.com/YJulyXing/SECNVs.git

> python2 SECNVs.py -h

$ python2 SECNVs.py -h

usage: SECNVs.py [-h] -G GENOME_FILE -T TARGET_REGION_FILE

[-rN {none,gap,all}] [-eN {none,gap,all}]

[-n_gap NUM_N_FOR_GAPS] [-e_cnv TARGET_CNV_LIST]

[-e_chr TARGET_CNV_CHR] [-e_tol TARGET_CNV_TOL]

[-e_cl TARGET_CNV_LEN_FILE] [-o_cnv OUT_CNV_LIST]

[-o_chr OUT_CNV_CHR] [-o_tol OUT_CNV_TOL]

[-o_cl OUT_CNV_LEN_FILE] [-ol OVERLAP_BPS]

[-min_len CNV_MIN_LENGTH] [-max_len CNV_MAX_LENGTH]

[-min_cn MIN_COPY_NUMBER] [-max_cn MAX_COPY_NUMBER]

[-p PROPORTION_INS] [-f MIN_FLANKING_LEN]

[-ms {random,uniform,gauss}]

[-ml {random,uniform,gauss,beta,user}] [-as AS1] [-bs BS]

[-al AL] [-bl BL] [-s_r S_RATE] [-s_s S_SLACK] [-i_r I_RATE]

[-i_mlen I_MAX_LEN] [-nr NREADS] [-fs FRAG_SIZE] [-s STDEV]

[-l READ_LENGTH] [-tf TARGET_REGION_FLANK] [-pr]

[-q QUALITY_SCORE_OFFSET] [-clr CONNECT_LEN_BETWEEN_REGIONS]

[-m MODEL] [-o OUTPUT_DIR] [-rn REARRANGED_OUTPUT_NAME]

[-n NUM_SAMPLES] [-sc] [-ssr] [-sb] [-picard PATH_TO_PICARD]

[-GATK PATH_TO_GATK]

Simulator for WES or WGS data

optional arguments:

-h, --help show this help message and exit

Mandatory inputs:

-G GENOME_FILE Reference genome FASTA file

-T TARGET_REGION_FILE

Target region file

Arguments for simulating rearranged genomes:

-rN {none,gap,all} Replace sequences of "N"s by ATGC randomly? [none]

-eN {none,gap,all} Exclude sequences of "N"s for CNV simulation? [none]

-n_gap NUM_N_FOR_GAPS

Number of consecutive "N"s to be considered a gap region [50]

-e_cnv TARGET_CNV_LIST

A user-defined list of CNVs overlapping with target regions

-e_chr TARGET_CNV_CHR

Number of CNVs overlapping with target regions to be generated on each chromosome

-e_tol TARGET_CNV_TOL

Total number of CNVs overlapping with target regions to be generated across the genome (estimate)

-e_cl TARGET_CNV_LEN_FILE

User supplied file of CNV length for CNVs overlapping with target regions

-o_cnv OUT_CNV_LIST A user-defined list of CNVs outside of target regions

-o_chr OUT_CNV_CHR Number of CNVs outside of target regions to be generated on each chromosome

-o_tol OUT_CNV_TOL Total number of CNVs outside of target regions to be generated across the genome (estimate)

-o_cl OUT_CNV_LEN_FILE

User supplied file of CNV length for CNVs outside of target regions

-ol OVERLAP_BPS For each CNV overlapping with target regions, number of minimum overlapping bps [100]

-min_len CNV_MIN_LENGTH

Minimum CNV length [1000]

-max_len CNV_MAX_LENGTH

Maximum CNV length [100000]

-min_cn MIN_COPY_NUMBER

Minimum copy number for insertions [2]

-max_cn MAX_COPY_NUMBER

Maximum copy number for insertions [10]

-p PROPORTION_INS Proportion of insertions [0.5]

-f MIN_FLANKING_LEN Minimum length between each CNV [50]

-ms {random,uniform,gauss}

Distribution of CNVs [random]

-ml {random,uniform,gauss,beta,user}

Distribution of CNV length [random]

-as AS1 Mu for Gaussian CNV distribution [0]

-bs BS Sigma for Gaussian CNV distribution [1]

-al AL Mu (Gaussian distribution) or alpha (Beta distribution) for CNV length distribution [0 for Gaussian distribution, and 0.5 for Beta distribution]

-bl BL Sigma (Gaussian distribution) or beta (Beta distribution) for CNV length distribution [1 for Gaussian distribution, and 2.3 for Beta distribution]

-s_r S_RATE Rate of SNPs in target regions [0]

-s_s S_SLACK Slack region up and down stream of target regions to simulate SNPs [0]

-i_r I_RATE Rate of indels in target regions [0]

-i_mlen I_MAX_LEN The Maximum length of indels in target regions [50]. (If a deletion is equal or larger than the length of the target region it is in, the length of the deletion will be changed to (length of the target region it is in) - 1.)

Arguments for simulating short reads (fastq):

-nr NREADS Number of reads / read pairs on target regions to be generated for each genome [10000]

-fs FRAG_SIZE Mean fragment size to be generated [200]

-s STDEV Standard deviation of fragment sizes [20]

-l READ_LENGTH Read length of each short read [100]

-tf TARGET_REGION_FLANK

Length of flanking region up and down stream of target regions to be sequenced (this step take place after -clr). [0]

-pr Select if paired-end sequencing

-q QUALITY_SCORE_OFFSET

Quality score offset for short reads simulation [33]

-clr CONNECT_LEN_BETWEEN_REGIONS

Maximum length bwtween target regions to connect the target regions.

-m MODEL GemSIM error model file (.gzip, need absolute path) [Illumina_HiSeq_2500_p]

Arguments for general settings:

-o OUTPUT_DIR Output directory [simulator_output]

-rn REARRANGED_OUTPUT_NAME

Prefix of the rearranged outputs (do not include directory name) [test]

-n NUM_SAMPLES Number of test samples to be generated [1]

-sc Simulation for control genome

-ssr Simulate short reads (fastq) files

-sb Simulate BAM files

-picard PATH_TO_PICARD

Absolute path to picard

-GATK PATH_TO_GATK Absolute path to GATK

実行方法

ランするには、fasta形式のリファレンスゲノム配列、ターゲット領域のファイルを指定する。このファイルにば、染色体名（リファレンスと一致）、開始、終了のポジションが書かれていなければならない。-e_cnvターゲット領域内のCNVの発生数を指定するタブ区切りのファイルも指定する。-o_cnvではとターゲット領域外のCNVの発生数を指定する。

python2 SECNVs.py -G reference.fasta -T target.tsv -e_cnv in.tsv -o_tol 20 -eN all -o out

-G Reference genome FASTA file
-T Target region file
-e_cnv A user-defined list of CNVs overlapping with target regions
-o_tol Total number of CNVs outside of target regions to be generated across the genome (estimate)
-eN {none, gap, all} Exclude sequences of "N"s for CNV simulation? [none]