macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

VCFを操作する VT

2021 4/28 追記

 

シーケンシングデータからバリアントをコールする方法は、一塩基多型(SNP)を超えて、短い挿入および欠失(indels)、短いタンデムリピート(STR)、MNP、構造変異(SV)などがターゲットになってきている。これらの異なるクラスのバリアントは、通常、Variant Call Format(VCF)(Danecek et al、2011)で表され、これはさまざまなツールによって生成された多様なバリアントタイプのバリアントコール結果を格納するためのフォーマットを提供している。

 さまざまなシーケンス解析ソフトウェアツールでは、VCFファイル内でさまざまな方法で同じシーケンスバリアントを表すことが多いため、コールセット全体でバリアントを統合して比較することは簡単である。しかし、あいまいな変異表現がデータ解析に及ぼす影響は実証されておらず、変異の一貫した表現のための標準的なガイドラインはない。

 この論文では、正規化のための正式な定義とアルゴリズムを提供する。本著者らの定義とアルゴリズムにより、バリアントの表現が明確でユニークな方法で可能になる。既存のバリアントコールソフトウェアツールは、しばしば複雑なバリアントを一貫して表現していないことを示している。最後に、本正規化法が1000ゲノムプロジェクト(1000ゲノムプロジェクトコンソーシアム、2012)における異なるバリアントコールセットの組み込みにどのように役立ったかを示す。

 

  

wiki

Vt - Genome Analysis Wiki

 

インストール

ubuntu16.04のminiconda3.4.0.5環境でテストした。

本体 GIthub

git clone https://github.com/atks/vt.git
cd vt
make -j 8
make test

#またはcondaを使う
conda install -c bioconda -y vt

> vt

# vt

Help page on http://statgen.sph.umich.edu/wiki/Vt

 

Useful tools:

view                      view vcf/vcf.gz/bcf files

index                     index vcf.gz/bcf files

normalize                 normalize variants

decompose                 decompose variants

uniq                      drop duplicate variants

cat                       concatenate VCF files

paste                     paste VCF files

sort                      sort VCF files

subset                    subset VCF file to variants polymorphic in a sample

 

peek                      summary of variants in the vcf file

partition                 partition variants

multi_partition           partition variants from multiple VCF files

annotate_variants         annotate variants

annotate_db_rsid          annotate variants with dbSNP rsid

annotate_1000g            annotate variants with 1000 Genomes variants

annotate_regions          annotate regions

compute_concordance       compute genotype concordance between 2 call sets

compute_features          compute genotype likelihood based statistics

 

discover                  discover variants

genotype                  genotype variants

Homebrewでも導入できる。手順はwiki参照。

 

テスト

> make test

test/test.sh

++++++++++++++++++++++

Tests for vt normalize

++++++++++++++++++++++

testing normalize

             output VCF file : ok

             output logs     : ok

+++++++++++++++++++++++++++++++

Tests for vt decompose_blocksub

+++++++++++++++++++++++++++++++

testing decompose_blocksub of even-length blocks

             output VCF file : ok

             output logs     : ok

testing decompose_blocksub with alignment

             output VCF file : ok

             output logs     : ok

testing decompose_blocksub of phased even-length blocks

             output VCF file : ok

             output logs     : ok

++++++++++++++++++++++

Tests for vt decompose

++++++++++++++++++++++

testing decompose for a triallelic variant

             output VCF file : ok

             output logs     : ok

 

Passed tests : 5 / 5

 

実行方法

view - view vcf/vcf.gz/bcf files

bcfを閲覧する。

vt view -h input.bcf 

 

index - index vcf.gz/bcf files

indexをつける。

#bcf
vt index input.bcf

#vcf.gz
vt index input.vcf.gz

 

peek - Summarizes the variants in a VCF file.

 VCFのサマリー

vt peek input.vcf.gz

 出力例

options:     input VCF file            -

 

 

stats: no. of samples                     :         21

       no. of chromosomes                 :          1

 

       ========== Micro variants ==========

 

       no. of SNP                         :       1494

           2 alleles                      :            1492 (1.99) [993/499]

           3 alleles                      :               2 (1.00) [2/2]

 

       no. of MNP                         :          2

           2 alleles                      :               2 (2.00) [4/2]

 

       no. of INDEL                       :         60

           2 alleles                      :              60 (0.43) [18/42]

 

       no. of SNP/MNP                     :         79

           2 alleles                      :              79 (3.16) [60/19]

 

       no. of SNP/INDEL                   :          1

           3 alleles                      :               1 (0.00) [0/1] (inf) [1/0]

 

       no. of micro variants              :       1636

 

       ++++++ Other useful categories +++++

 

        no. of block substitutions        :         81

           2 alleles                      :              81 (3.05) [64/21]

 

        no. of complex substitutions      :          1

           3 alleles                      :               1 (0.00) [0/1] (inf) [1/0]

 

 

       ========= General summary ==========

 

       no. of VCF records                        :       1636

 

 

Time elapsed: 0.04s

 

cat - Concatenates VCF files.

 VCFを連結する。

#concatenates chr1.mills.bcf and chr2.mills.bcf
vt cat chr1.mills.bcf chr2.mills.bcf -o mills.bcf

 

filter - Filters variants in a VCF file.

VCFをフィルタリングする。

 #adds a filter tag "refA" for variants where the REF column is a A sequence.
vt filter in.bcf -f "REF=='A'" -d "refA"

 

 

以下、作成中 

sort - sort VCF files

#sorts mills.bcf and outputs to standard out in a 1000bp window.
vt sort -m local mills.bcf
#sorts mills.bcf and locally sorts it in a 10000bp window and outputs to out.bcf
vt sort -m local -w 10000 mills.bcf -o out.bcf
#sorts an indexed mills.bcf with chromosomes not sorted in the contig order in the header
vt sort -m chrom mills.bcf -o out.bcf
#sorts mills.bcf with no assumption
vt sort mills.bcf -o out.bcf

 

normalize - normalize variants

#normalize variants and write out to dbsnp.normalized.vcf
vt normalize dbsnp.vcf -r seq.fa -o dbsnp.normalized.vcf

#normalize variants, send to standard out and remove duplicates.
vt normalize dbsnp.vcf -r seq.fa | vt uniq - -o dbsnp.normalized.uniq.vcf

#read in variants that do not contain N in the explicit alleles, normalize variants, send to standard out.
vt normalize dbsnp.vcf -r seq.fa -f "~VARIANT_CONTAINS_N"



 

 

decompose - decompose variants

#decomposes multiallelic variants into biallelic variants and write out to gatk.decomposed.vcf
vt decompose gatk.vcf -o gatk.decomposed.vcf

 

uniq - drop duplicate variants

#drop duplicate variants and save output in mills.uniq.vcf
vt uniq mills.vcf -o mills.uniq.vcf

 

paste - paste VCF files

#paste together genotypes from the CEU trio into one file.
vt paste NA12878.mills.bcf NA12891.mills.bcf NA12892.mills.bcf -o ceu_trio.bcf

 

rminfo - Removes INFO tags from a VCF file.

 #removes the INFO tags OLD_VARIANT, ENTROPY, PSCORE and COMP 
vt rminfo exact.del.bcf -t OLD_VARIANT,ENTROPY,PSCORE,COMP -o rm.bcf

 

 

 

filter_overlap - Tags overlapping variants in a VCF file with the FILTER flag overlap.

#adds a filter tag "overlap" for overlapping variants within a window size of 1 based on the REF sequence.
vt filter_overlap in.bcf -w 1 out.bcf

 

 

validate - Checks the following properties of a VCF file:

1、order
2、reference sequence consistency

#validates lobstr.bcf
vt validate lobstr.bcf

 

info2tab - Extract INFO fields to a tab delimited file

#converts in.bcf to tab format with selected INFO and FILTER fields
vt info2tab in.bcf -u PASS -t EX_RL,FZ_RL,MDUST,LOBSTR,VNTRSEEK,RMSK,EX_REPEAT_TRACT

 

 

Partition - Partition variants from two data sets. 

#partitions all variants in bi1.bcf and bi2.bcf
vt partition bi1.bcf bi2.bcf

 

multi_partition - Partitions variants found in VCF files.

#partitions variants n-ways
vt multi_partition hc.genotypes.bcf pl.genotypes.bcf st.genotypes.bcf

 

annotate_regions - Annotates regions in a VCF file.

#annotates the variants that overlap with coding regions.
vt annotate_regions mills.vcf -b coding.bed.gz -t CDS -d "Coding region"

 

annotate_variants - Annotates variants in a VCF file.

#annotates the variants found in mills.vcf
vt annotate_variants mills.vcf -r hs37d5.fa -g gencode.v19.annotation.gtf.gz

 

compute_features - Compute features in a VCF file.

#compute features for the variants found in vt.vcf
#requires GT, PL and DP
vt compute_features vt.vcf

 

estimate - Compute variant based estimates.

#compute features for the variants found in vt.vcf
#requires GT and PL
vt estimate -e AF,MLEAF vt.vcf

 

profile_mendelian - Profile Mendelian errors. 

#profile mendelian errors found in vt.genotypes.bcf, generate tables in the directory mendel, requires pdflatex.
vt profile_mendelian vt.genotypes.bcf -p trios.ped -x mendel

 

profile_snps - Profile SNPs.

#profile snps found in 20.sites.vcf
vt profile_snps -g snp.reference.txt 20.sites.vcf -r hs37d5.fa -i 20

 

profile_indels - Profile indels.

#profile indels found in mills.vcf
vt profile_indels -g indel.reference.txt mills.vcf -r hs37d5.fa -i 20

  

profile_vntrs - Profile VNTRs.

#profiles a set of VNTRs
vt profile_vntrs vntrs.sites.bcf -g vntr.reference.txt

  

profile_na12878 - Profile Mendelian errors

 #profile NA12878 overlap with broad knowledgebase and illumina platinum genomes for the file vt.genotypes.bcf for chromosome 20.
vt profile_na12878 vt.genotypes.bcf -g na12878.reference.txt -r hs37d5.fa -i 20

 

discover - Discovers variants from reads in a BAM/CRAM file.

#discover variants from NA12878.bam and write to stdout
vt discover -b NA12878.bam -s NA12878 -r hs37d5.fa -i 20

 

merge_candidate_variants - Merge candidate variants across samples.

#merge candidate variants from VCFs in candidate.txt and output in candidate.sites.vcf
vt merge_candidate_variants candidates.txt -o candidate.sites.vcf

 

remove_overlap - Removes overlapping variants in a VCF file by tagging such variants with the FILTER flag overlap.

#annotates variants that are overlapping 
vt remove_overlap in.vcf -r hs37d5.fa -o overlapped.tagged..vcf

 

annotate_indels - Annotates indels with VNTR information and adds a VNTR record. 

#annotates indels from VCFs with VNTR information.
vt annotate_indels in.vcf -r hs37d5.fa -o annotated.sites.vcf

 

construct_probes - Construct probes for genotyping a variant.

#construct probes from candidate.sites.bcf and output to standard out
vt construct_probes candidates.sites.bcf -r ref.fa

 

genotype - Genotypes variants for each sample.

#genotypes variants found in candidate.sites.vcf from sample.bam
vt genotype -r seq.fa -b sample.bam -i candidates.sites.vcf -o sample.sites.vcf

 

 

引用
Unified representation of genetic variants
Tan A, Abecasis GR, Kang HM

Bioinformatics. 2015 Jul 1;31(13):2202-4

 

関連