VCFを操作する VT - macでインフォマティクス

2021 4/28 追記

シーケンシングデータからバリアントをコールする方法は、一塩基多型（SNP）を超えて、短い挿入および欠失（indels）、短いタンデムリピート（STR）、MNP、構造変異（SV）などがターゲットになってきている。これらの異なるクラスのバリアントは、通常、Variant Call Format（VCF）（Danecek et al、2011）で表され、これはさまざまなツールによって生成された多様なバリアントタイプのバリアントコール結果を格納するためのフォーマットを提供している。

　さまざまなシーケンス解析ソフトウェアツールでは、VCFファイル内でさまざまな方法で同じシーケンスバリアントを表すことが多いため、コールセット全体でバリアントを統合して比較することは簡単である。しかし、あいまいな変異表現がデータ解析に及ぼす影響は実証されておらず、変異の一貫した表現のための標準的なガイドラインはない。

　この論文では、正規化のための正式な定義とアルゴリズムを提供する。本著者らの定義とアルゴリズムにより、バリアントの表現が明確でユニークな方法で可能になる。既存のバリアントコールソフトウェアツールは、しばしば複雑なバリアントを一貫して表現していないことを示している。最後に、本正規化法が1000ゲノムプロジェクト（1000ゲノムプロジェクトコンソーシアム、2012）における異なるバリアントコールセットの組み込みにどのように役立ったかを示す。

wiki

Vt - Genome Analysis Wiki

インストール

ubuntu16.04のminiconda3.4.0.5環境でテストした。

本体　GIthub

git clone https://github.com/atks/vt.git
cd vt
make -j 8
make test

#またはcondaを使う
conda install -c bioconda -y vt

> vt

# vt

Help page on http://statgen.sph.umich.edu/wiki/Vt

Useful tools:

view view vcf/vcf.gz/bcf files

index index vcf.gz/bcf files

normalize normalize variants

decompose decompose variants

uniq drop duplicate variants

cat concatenate VCF files

paste paste VCF files

sort sort VCF files

subset subset VCF file to variants polymorphic in a sample

peek summary of variants in the vcf file

partition partition variants

multi_partition partition variants from multiple VCF files

annotate_variants annotate variants

annotate_db_rsid annotate variants with dbSNP rsid

annotate_1000g annotate variants with 1000 Genomes variants

annotate_regions annotate regions

compute_concordance compute genotype concordance between 2 call sets

compute_features compute genotype likelihood based statistics

discover discover variants

genotype genotype variants

Homebrewでも導入できる。手順はwiki参照。

テスト

> make test

test/test.sh

++++++++++++++++++++++

Tests for vt normalize

++++++++++++++++++++++

testing normalize

output VCF file : ok

output logs : ok

+++++++++++++++++++++++++++++++

Tests for vt decompose_blocksub

+++++++++++++++++++++++++++++++

testing decompose_blocksub of even-length blocks

output VCF file : ok

output logs : ok

testing decompose_blocksub with alignment

output VCF file : ok

output logs : ok

testing decompose_blocksub of phased even-length blocks

output VCF file : ok

output logs : ok

++++++++++++++++++++++

Tests for vt decompose

++++++++++++++++++++++

testing decompose for a triallelic variant

output VCF file : ok

output logs : ok

Passed tests : 5 / 5

実行方法

view - view vcf/vcf.gz/bcf files

bcfを閲覧する。

vt view -h input.bcf

index - index vcf.gz/bcf files

indexをつける。

#bcf
vt index input.bcf

#vcf.gz
vt index input.vcf.gz

peek - Summarizes the variants in a VCF file.

VCFのサマリー

vt peek input.vcf.gz

出力例

options: input VCF file -

stats: no. of samples : 21

no. of chromosomes : 1

========== Micro variants ==========

no. of SNP : 1494

2 alleles : 1492 (1.99) [993/499]

3 alleles : 2 (1.00) [2/2]

no. of MNP : 2

2 alleles : 2 (2.00) [4/2]

no. of INDEL : 60

2 alleles : 60 (0.43) [18/42]

no. of SNP/MNP : 79

2 alleles : 79 (3.16) [60/19]

no. of SNP/INDEL : 1

3 alleles : 1 (0.00) [0/1] (inf) [1/0]

no. of micro variants : 1636

++++++ Other useful categories +++++

no. of block substitutions : 81

2 alleles : 81 (3.05) [64/21]

no. of complex substitutions : 1

3 alleles : 1 (0.00) [0/1] (inf) [1/0]

========= General summary ==========

no. of VCF records : 1636

Time elapsed: 0.04s

cat - Concatenates VCF files.

VCFを連結する。

#concatenates chr1.mills.bcf and chr2.mills.bcf
vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf

filter - Filters variants in a VCF file.

VCFをフィルタリングする。

 #adds a filter tag "refA" for variants where the REF column is a A sequence.
 vt filter in.bcf -f "REF=='A'" -d "refA"

以下、作成中

sort - sort VCF files

#sorts mills.bcf and outputs to standard out in a 1000bp window.
 vt sort -m local mills.bcf
 #sorts mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf
 vt sort -m local -w 10000 mills.bcf -o out.bcf 
 #sorts an indexed mills.bcf with chromosomes not sorted in the contig order in the header 
 vt sort -m chrom mills.bcf -o out.bcf 
 #sorts mills.bcf with no assumption
 vt sort mills.bcf -o out.bcf

normalize - normalize variants

#normalize variants and write out to dbsnp.normalized.vcf
 vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf

#normalize variants, send to standard out and remove duplicates.
vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf

#read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out.
vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N"

decompose - decompose variants

#decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf
vt decompose gatk.vcf -o gatk.decomposed.vcf

uniq - drop duplicate variants

#drop duplicate variants and save output in mills.uniq.vcf
vt uniq mills.vcf -o mills.uniq.vcf

paste - paste VCF files

#paste together genotypes from the CEU trio into one file.
vt paste NA12878.mills.bcf NA12891.mills.bcf NA12892.mills.bcf -o ceu_trio.bcf

rminfo - Removes INFO tags from a VCF file.

 #removes the INFO tags OLD_VARIANT, ENTROPY, PSCORE and COMP 
 vt rminfo exact.del.bcf -t OLD_VARIANT,ENTROPY,PSCORE,COMP -o rm.bcf

filter_overlap - Tags overlapping variants in a VCF file with the FILTER flag overlap.

#adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence.
vt filter_overlap in.bcf -w 1 out.bcf

validate - Checks the following properties of a VCF file:

1、order
2、reference sequence consistency

#validates lobstr.bcf
vt validate lobstr.bcf

info2tab - Extract INFO fields to a tab delimited file

#converts in.bcf to tab format with selected INFO and FILTER fields
vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT

Partition - Partition variants from two data sets.

#partitions all variants in bi1.bcf and bi2.bcf
vt partition bi1.bcf bi2.bcf

multi_partition - Partitions variants found in VCF files.

#partitions variants n-ways
vt multi_partition hc.genotypes.bcf pl.genotypes.bcf st.genotypes.bcf

annotate_regions - Annotates regions in a VCF file.

#annotates the variants that overlap with coding regions.
vt annotate_regions mills.vcf -b coding.bed.gz -t CDS -d "Coding region"

annotate_variants - Annotates variants in a VCF file.

#annotates the variants found in mills.vcf
vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz

compute_features - Compute features in a VCF file.

#compute features for the variants found in vt.vcf
#requires GT, PL and DP
vt compute_features vt.vcf

estimate - Compute variant based estimates.

#compute features for the variants found in vt.vcf
#requires GT and PL
vt estimate -e AF,MLEAF vt.vcf

profile_mendelian - Profile Mendelian errors.

#profile mendelian errors found in vt.genotypes.bcf, generate tables in the directory mendel, requires pdflatex.
vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel

profile_snps - Profile SNPs.

#profile snps found in 20.sites.vcf
 vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa -i 20

profile_indels - Profile indels.

#profile indels found in mills.vcf
vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20

profile_vntrs - Profile VNTRs.

#profiles a set of VNTRs
vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt

profile_na12878 - Profile Mendelian errors

 #profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20.
 vt profile_na12878 vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20

discover - Discovers variants from reads in a BAM/CRAM file.

#discover variants from NA12878.bam and write to stdout
vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20

merge_candidate_variants - Merge candidate variants across samples.

#merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf
vt merge_candidate_variants candidates.txt -o candidate.sites.vcf

remove_overlap - Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap.

#annotates variants that are overlapping 
vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf

annotate_indels - Annotates indels with VNTR information and adds a VNTR record.

#annotates indels from VCFs with VNTR information.
vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf

construct_probes - Construct probes for genotyping a variant.

#construct probes from candidate.sites.bcf and output to standard out
vt construct_probes candidates.sites.bcf -r ref.fa

genotype - Genotypes variants for each sample.

#genotypes variants found in candidate.sites.vcf from sample.bam
vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf

引用
Unified representation of genetic variants
Tan A, Abecasis GR, Kang HM

Bioinformatics. 2015 Jul 1;31(13):2202-4