SVGen Documentより
構造変異(SV)用の既存のシミュレーションツールは、一部はSNV(single-nucleotide variants)をシミュレートせず、またシミュレートされたシーケンスリードを生成してSVコーラーソフトウェアをベンチマークする外部プログラムが必要となっている。 これらの制限に対処するために、様々なタイプのgermlineおよびsomatic SVを導入し、ショートリードとロングリードをシミュレートするための幅広いオプションを備えたツールSVGenを開発した。 SVGenからの出力には、SV領域を含むBEDファイル、シミュレートされたゲノム配列を伴うFASTAファイル、FASTQファイルが含まれる。
マニュアル
http://svgen.openbioinformatics.org/en/latest/user-guide/manual/
インストール
https://github.com/WGLab/SVGen
git clone https://github.com/WGLab/SVGen/
cd SVGen
> python insert_SNVs.py -h
$ python insert_SNVs.py -h
usage: insert_SNVs.py [-h] --fasta_input input.fasta --fasta_output
output.fasta --freq_file FREQ_FILE --vcf_output
VCF_OUTPUT --chrom chrom
Get arguments to create single-nucleotide variants (SNVs) in a fasta file.
optional arguments:
-h, --help show this help message and exit
Input fasta file to be used as reference to receive
SNVs.
Output fasta file with random SNVs, based in
frequencies.
--freq_file FREQ_FILE
Text file SNV frequencies.
--vcf_output VCF_OUTPUT
VCF file generated with SNVs inserted.
--chrom chrom Chromosome.
> python simulate_SV_BED.py -h
$ python simulate_SV_BED.py -h
usage: simulate_SV_BED.py [-h] --chrom_lens chrom_lengths_file --output
output_bed_file --gaps gaps_file --chroms
chromosome_names [--del_lens del_lengths_file]
[--inv_lens inv_lengths_file]
[--bal_trans_lens bal_trans_lengths_file]
[--unb_trans_lens unb_trans_lengths_file]
[--chroms_trans chroms_trans]
[--distance distance_between_SVs]
[--dist_sd distance_sd] [-v]
Get arguments to create random structural variant regions in a BED file.
optional arguments:
-h, --help show this help message and exit
--chrom_lens chrom_lengths_file
Text file with chromosome lengths.
--output output_bed_file, -o output_bed_file
BED output file.
--gaps gaps_file BED file with regions to avoid (centromeres and
telomeres).
--chroms chromosome_names
Chromosome names (range).
--del_lens del_lengths_file
Text file with deletion lengths.
Text file with duplication lengths.
--inv_lens inv_lengths_file
Text file with inversion lengths.
--bal_trans_lens bal_trans_lengths_file
Text file with balanced translocation lengths.
--unb_trans_lens unb_trans_lengths_file
Text file with unbalanced translocation lengths.
--chroms_trans chroms_trans
Chromosomes from which translocations will come from.
--distance distance_between_SVs, -d distance_between_SVs
Distance between SVs in a countinuous (ungapped)
region.
--dist_sd distance_sd, -sd distance_sd
Standard deviation of distance between SVs in a
countinuous (ungapped) region.
-v, --verbose
> python insert_SVs.py -h
$ python insert_SVs.py -h
usage: insert_SVs.py [-h] --fasta_input input.fasta --fasta_output
output.fasta --bed SVs.bed --chrom_lens
chrom_lengths_file --chrom chromosome_name
[--fasta_label fasta_label] [-v] [--overlap]
This program inserts structural variants from a BED file into a FASTA file.
optional arguments:
-h, --help show this help message and exit
--fasta_input input.fasta, -i input.fasta
Fasta file to be changed with SVs.
--fasta_output output.fasta, -o output.fasta
Fasta file to be created with SVs.
--bed SVs.bed BED file with SVs to be inserted.
--chrom_lens chrom_lengths_file
Text file with chromosome lengths.
--chrom chromosome_name
Chromosome.
Name to label fasta sequence.
-v, --verbose
--overlap
> python create_reads.py -h
$ python create_reads.py -h
usage: create_reads.py [-h] --fasta_input input.fasta --output_file
output.fq|output.bam --cov coverage --read_len
avg_read_len [-pe] [--ins_rate insertion_rate]
[--del_rate deletion_rate] [--snp_rate snp_rate]
[--insert_size insert_size] [--insert_sd insert_sd]
[--alpha alpha] [--beta beta] [--read_label READ_LABEL]
[--fast_sample fast_sample] [-v]
This program inserts structural variants from a BED file into a FASTA file.
optional arguments:
-h, --help show this help message and exit
--fasta_input input.fasta, -i input.fasta
Fasta file to be changed with SVs.
--output_file output.fq|output.bam, -o output.fq|output.bam
Output file for reads. It must finish with .fq/.fastq
or .bam (paired-end option will automatically change
file names to output1.fq and output2.fq).
--cov coverage Average coverage.
--read_len avg_read_len
Average read length (length is fix for short reads).
-pe Add option to generate paired-end reads.
--ins_rate insertion_rate
Insertion error rate for reads.
--del_rate deletion_rate
Deletion error rate for reads.
--snp_rate snp_rate SNP error rate for reads.
--insert_size insert_size
Insert size for short reads.
--insert_sd insert_sd
Insert standard deviation for short reads.
--alpha alpha Alpha for beta distribution of read lengths.
--beta beta Beta for beta distribution of read lengths.
--read_label READ_LABEL
Label to add in each read.
--fast_sample fast_sample
Number of quality score strings, or sets of errors,
for reads. Setting a number for this variable will
make the process of creating quality scores faster.
-v, --verbose
ラン
データベースのダウンロード。初回だけ必要となる。Ensemblからゲノムデータ、Annovarからallele frequencyデータベースをダウンロードしている。
./download_and_format_database.sh hg38
SNVのシミュレーション
さきほどダウンロードしたリファレンスとallele frequencyデータベースを指定する。
python insert_SNVs.py --fasta_input reference/hg38/chr22.fa --fasta_output chr22.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom22.txt --chrom 22 --vcf_output chr22.SNV.vcf
- --fasta_input Input fasta file to be used as reference to
- --fasta_output Output fasta file with random SNVs, based in
- --freq_file Text file SNV frequencies.
- --chroms Chromosome names (range).
- --vcf_output VCF file generated with SNVs inserted.
SNVの位置は、FASTAと一緒に出力されるvcfファイルに記録されている(ポジコン)。
SVのシミュレーション
まずSVの領域を指定したBEDファイルを作る。
python simulate_SV_BED.py --dup_lens SV_lengths.txt --del_lens SV_lengths.txt --inv_lens SV_lengths.txt --bal_trans_lens SV_lengths.txt --unb_trans_lens SV_lengths.txt --chroms 22 --chroms_trans 1-10 --chrom_lens reference/chrom_lengths_hg38.txt --gaps reference/gaps_hg38.txt -o SVs.bed
- --dup_lens Text file with duplication lengths.
- --del_lens Text file with deletion lengths.
- --inv_lens Text file with inversion lengths.
- --bal_trans_lens Text file with balanced translocation lengths.
- --unb_trans_lens Text file with unbalanced translocation lengths.
- --chrom_lens Text file with chromosome lengths.
- --gaps BED file with regions to avoid (centromeres and telomeres).
- -o output_bed_file
SVのポジションが記録されたbedファイルが出力される。
1 13119169 13120668 baltr 22:14529190-14530689
1 122588673 122590672 baltr 22:12135573-12137572
2 265428 269427 baltr 22:23797887-23801886
2 91518961 91558960 baltr 22:26506599-26546598
3 91665436 91765435 baltr 22:24347132-24447131
3 93817281 93823280 baltr 22:25907506-25913505
次にchr22.SNV.faにSVを発生させるが、chr22に他のクロモソームからのinterchromosomal な構造変化も発生させるため、先に他のクロモソームでもSNVをシミュレートしたFASTAを作っておく必要がある。すなわち
python insert_SNVs.py --fasta_input reference/hg38/chr1.fa --fasta_output chr1.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom1.txt --chrom 1 --vcf_output chr1.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr2.fa --fasta_output chr2.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom2.txt --chrom 2 --vcf_output chr2.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr3.fa --fasta_output chr3.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom3.txt --chrom 3 --vcf_output chr3.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr4.fa --fasta_output chr4.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom4.txt --chrom 4 --vcf_output chr4.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr5.fa --fasta_output chr5.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom5.txt --chrom 5 --vcf_output chr5.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr6.fa --fasta_output chr6.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom6.txt --chrom 6 --vcf_output chr6.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr7.fa --fasta_output chr7.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom7.txt --chrom 7 --vcf_output chr7.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr8.fa --fasta_output chr8.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom8.txt --chrom 8 --vcf_output chr8.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr9.fa --fasta_output chr9.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom9.txt --chrom 9 --vcf_output chr9.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chr10.fa --fasta_output chr10.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chrom10.txt --chrom 10 --vcf_output chr10.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chrY.fa --fasta_output chrY.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chromY.txt --chrom Y --vcf_output chrY.SNV.vcf &
python insert_SNVs.py --fasta_input reference/hg38/chrX.fa --fasta_output chrX.SNV.fa --freq_file reference/hg38_AFR.sites.2015_08.chromX.txt --chrom X --vcf_output chrX.SNV.vcf &
を実行する。それぞれのクロモソームのSNV導入FASTAが出力される。
bedを元にSVを発生させる。
python insert_SVs.py -i chr22.SNV.fa -o chr22.SNV.SV.fa --chrom_lens reference/chrom_lengths_hg38.txt --chrom 22 --bed SVs.bed -v
リードの発生
SNVとSVが導入されたリファレンスが用意できたので、これを鋳型にリードを発生させる。
python create_reads.py -pe -i chr22.SNV.SV.fa -o reads.fq --cov 10 --read_len 100 --snp_rate 0.01 --del_rate 0.0001 --ins_rate 0.0001
- -pe Add option to generate paired-end reads.
- -i input.fasta
- -o output.fq|output.bam
- --cov Average coverage.
- --read_len Average read length (length is fix for short reads).
- --snp_rate SNP error rate for reads.
- --del_rate Deletion error rate for reads.
- --insert_size Insert size for short reads.
引用
SVGen: simulation of structural variants in next-generation sequencing data
ima L, Yang H, Wang K.
http://svgen.openbioinformatics.org/en/latest/