2021 4/28 追記
シーケンシングデータからバリアントをコールする方法は、一塩基多型(SNP)を超えて、短い挿入および欠失(indels)、短いタンデムリピート(STR)、MNP、構造変異(SV)などがターゲットになってきている。これらの異なるクラスのバリアントは、通常、Variant Call Format(VCF)(Danecek et al、2011)で表され、これはさまざまなツールによって生成された多様なバリアントタイプのバリアントコール結果を格納するためのフォーマットを提供している。
さまざまなシーケンス解析ソフトウェアツールでは、VCFファイル内でさまざまな方法で同じシーケンスバリアントを表すことが多いため、コールセット全体でバリアントを統合して比較することは簡単である。しかし、あいまいな変異表現がデータ解析に及ぼす影響は実証されておらず、変異の一貫した表現のための標準的なガイドラインはない。
この論文では、正規化のための正式な定義とアルゴリズムを提供する。本著者らの定義とアルゴリズムにより、バリアントの表現が明確でユニークな方法で可能になる。既存のバリアントコールソフトウェアツールは、しばしば複雑なバリアントを一貫して表現していないことを示している。最後に、本正規化法が1000ゲノムプロジェクト(1000ゲノムプロジェクトコンソーシアム、2012)における異なるバリアントコールセットの組み込みにどのように役立ったかを示す。
インストール
ubuntu16.04のminiconda3.4.0.5環境でテストした。
本体 GIthub
git clone https://github.com/atks/vt.git
cd vt
make -j 8
make test
#またはcondaを使う
conda install -c bioconda -y vt
> vt
# vt
Help page on http://statgen.sph.umich.edu/wiki/Vt
Useful tools:
view view vcf/vcf.gz/bcf files
index index vcf.gz/bcf files
normalize normalize variants
decompose decompose variants
uniq drop duplicate variants
cat concatenate VCF files
paste paste VCF files
sort sort VCF files
subset subset VCF file to variants polymorphic in a sample
peek summary of variants in the vcf file
partition partition variants
multi_partition partition variants from multiple VCF files
annotate_variants annotate variants
annotate_db_rsid annotate variants with dbSNP rsid
annotate_1000g annotate variants with 1000 Genomes variants
annotate_regions annotate regions
compute_concordance compute genotype concordance between 2 call sets
compute_features compute genotype likelihood based statistics
discover discover variants
genotype genotype variants
Homebrewでも導入できる。手順はwiki参照。
テスト
> make test
test/test.sh
++++++++++++++++++++++
Tests for vt normalize
++++++++++++++++++++++
testing normalize
output VCF file : ok
output logs : ok
+++++++++++++++++++++++++++++++
Tests for vt decompose_blocksub
+++++++++++++++++++++++++++++++
testing decompose_blocksub of even-length blocks
output VCF file : ok
output logs : ok
testing decompose_blocksub with alignment
output VCF file : ok
output logs : ok
testing decompose_blocksub of phased even-length blocks
output VCF file : ok
output logs : ok
++++++++++++++++++++++
Tests for vt decompose
++++++++++++++++++++++
testing decompose for a triallelic variant
output VCF file : ok
output logs : ok
Passed tests : 5 / 5
実行方法
view - view vcf/vcf.gz/bcf files
bcfを閲覧する。
vt view -h input.bcf
index - index vcf.gz/bcf files
indexをつける。
#bcf
vt index input.bcf
#vcf.gz
vt index input.vcf.gz
peek - Summarizes the variants in a VCF file.
VCFのサマリー
vt peek input.vcf.gz
出力例
options: input VCF file -
stats: no. of samples : 21
no. of chromosomes : 1
========== Micro variants ==========
no. of SNP : 1494
2 alleles : 1492 (1.99) [993/499]
3 alleles : 2 (1.00) [2/2]
no. of MNP : 2
2 alleles : 2 (2.00) [4/2]
no. of INDEL : 60
2 alleles : 60 (0.43) [18/42]
no. of SNP/MNP : 79
2 alleles : 79 (3.16) [60/19]
no. of SNP/INDEL : 1
3 alleles : 1 (0.00) [0/1] (inf) [1/0]
no. of micro variants : 1636
++++++ Other useful categories +++++
no. of block substitutions : 81
2 alleles : 81 (3.05) [64/21]
no. of complex substitutions : 1
3 alleles : 1 (0.00) [0/1] (inf) [1/0]
========= General summary ==========
no. of VCF records : 1636
Time elapsed: 0.04s
cat - Concatenates VCF files.
VCFを連結する。
#concatenates chr1.mills.bcf and chr2.mills.bcf
vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf
filter - Filters variants in a VCF file.
VCFをフィルタリングする。
#adds a filter tag "refA" for variants where the REF column is a A sequence.
vt filter in.bcf -f "REF=='A'" -d "refA"
以下、作成中
sort - sort VCF files
#sorts mills.bcf and outputs to standard out in a 1000bp window.
vt sort -m local mills.bcf
#sorts mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf
vt sort -m local -w 10000 mills.bcf -o out.bcf
#sorts an indexed mills.bcf with chromosomes not sorted in the contig order in the header
vt sort -m chrom mills.bcf -o out.bcf
#sorts mills.bcf with no assumption
vt sort mills.bcf -o out.bcf
normalize - normalize variants
#normalize variants and write out to dbsnp.normalized.vcf
vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf
#normalize variants, send to standard out and remove duplicates.
vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf
#read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out.
vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N"
decompose - decompose variants
#decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf
vt decompose gatk.vcf -o gatk.decomposed.vcf
uniq - drop duplicate variants
#drop duplicate variants and save output in mills.uniq.vcf
vt uniq mills.vcf -o mills.uniq.vcf
paste - paste VCF files
#paste together genotypes from the CEU trio into one file.
vt paste NA12878.mills.bcf NA12891.mills.bcf NA12892.mills.bcf -o ceu_trio.bcf
rminfo - Removes INFO tags from a VCF file.
#removes the INFO tags OLD_VARIANT, ENTROPY, PSCORE and COMP
vt rminfo exact.del.bcf -t OLD_VARIANT,ENTROPY,PSCORE,COMP -o rm.bcf
filter_overlap - Tags overlapping variants in a VCF file with the FILTER flag overlap.
#adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence.
vt filter_overlap in.bcf -w 1 out.bcf
validate - Checks the following properties of a VCF file:
1、order
2、reference sequence consistency
#validates lobstr.bcf
vt validate lobstr.bcf
info2tab - Extract INFO fields to a tab delimited file
#converts in.bcf to tab format with selected INFO and FILTER fields
vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT
Partition - Partition variants from two data sets.
#partitions all variants in bi1.bcf and bi2.bcf
vt partition bi1.bcf bi2.bcf
multi_partition - Partitions variants found in VCF files.
#partitions variants n-ways
vt multi_partition hc.genotypes.bcf pl.genotypes.bcf st.genotypes.bcf
annotate_regions - Annotates regions in a VCF file.
#annotates the variants that overlap with coding regions.
vt annotate_regions mills.vcf -b coding.bed.gz -t CDS -d "Coding region"
annotate_variants - Annotates variants in a VCF file.
#annotates the variants found in mills.vcf
vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz
compute_features - Compute features in a VCF file.
#compute features for the variants found in vt.vcf
#requires GT, PL and DP
vt compute_features vt.vcf
estimate - Compute variant based estimates.
#compute features for the variants found in vt.vcf
#requires GT and PL
vt estimate -e AF,MLEAF vt.vcf
profile_mendelian - Profile Mendelian errors.
#profile mendelian errors found in vt.genotypes.bcf, generate tables in the directory mendel, requires pdflatex.
vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel
#profile snps found in 20.sites.vcf
vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa -i 20
profile_indels - Profile indels.
#profile indels found in mills.vcf
vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20
profile_vntrs - Profile VNTRs.
#profiles a set of VNTRs
vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt
profile_na12878 - Profile Mendelian errors
#profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20.
vt profile_na12878 vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20
discover - Discovers variants from reads in a BAM/CRAM file.
#discover variants from NA12878.bam and write to stdout
vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20
merge_candidate_variants - Merge candidate variants across samples.
#merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf
vt merge_candidate_variants candidates.txt -o candidate.sites.vcf
remove_overlap - Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap.
#annotates variants that are overlapping
vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf
annotate_indels - Annotates indels with VNTR information and adds a VNTR record.
#annotates indels from VCFs with VNTR information.
vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf
construct_probes - Construct probes for genotyping a variant.
#construct probes from candidate.sites.bcf and output to standard out
vt construct_probes candidates.sites.bcf -r ref.fa
genotype - Genotypes variants for each sample.
#genotypes variants found in candidate.sites.vcf from sample.bam
vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf
引用
Unified representation of genetic variants
Tan A, Abecasis GR, Kang HM
Bioinformatics. 2015 Jul 1;31(13):2202-4
関連