VCFの要約統計を出力するRTG toolsの rtg vcfstatsコマンド

2019 8/3 コマンドのミス修正  

2020 4/19 例追加


RTG toolsはRTGコアのサブセットである。VCFファイルの高度な比較を実行するvcfevalなど、VCFファイルとシーケンスデータを扱うための便利なユーティリティが含まれている。vcfevalが特に重要なコマンドだが、ここではvcfの簡単なstatisticsを出力するrtg vcfstatsを紹介する。



Full version





miniconda3-4.3.21環境でテストした(docker使用、ホストOS ubuntu 16.0.4)


本体 Github

git clone
cd rtg-tools
ant runalltests

#bioconda (link)
conda install -c bioconda -y rtg-tools


$ rtg

Usage: rtg COMMAND [OPTION]...

       rtg RTG_MEM=16G COMMAND [OPTION]...  (e.g. to set maximum memory use to 16 GB)


Type 'rtg help COMMAND' for help on a specific command.

The following commands are available:


Data formatting:

format       convert sequence data files to SDF

sdf2fasta    convert SDF to FASTA

sdf2fastq    convert SDF to FASTQ

sdf2sam      convert SDF to SAM/BAM

fastqtrim    trim reads in FASTQ files



genomesim    generate simulated genome sequence

cgsim        generate simulated reads from a sequence

readsim      generate simulated reads from a sequence

popsim       generate a VCF containing simulated population variants

samplesim    generate a VCF containing a genotype simulated from a population

childsim     generate a VCF containing a genotype simulated as a child of two parents

denovosim    generate a VCF containing a derived genotype containing de novo variants

pedsamplesim generate simulated genotypes for all members of a pedigree

samplereplay generate the genome corresponding to a sample genotype



bgzip        compress a file using block gzip

index        create a tabix index

extract      extract data from a tabix indexed file

sdfstats     print statistics about an SDF

sdfsubset    extract a subset of an SDF into a new SDF

sdfsubseq    extract a subsequence from an SDF as text

mendelian    check a multisample VCF for Mendelian consistency

vcfstats     print statistics about variants contained within a VCF file

vcfmerge     merge single-sample VCF files into a single multi-sample VCF

vcffilter    filter records within a VCF file

vcfannotate  annotate variants within a VCF file

vcfsubset    create a VCF file containing a subset of the original columns

vcfdecompose decompose complex variants within a VCF file

vcfeval      evaluate called variants for agreement with a baseline variant set

svdecompose  split composite structural variants into a breakend representation

bndeval      evaluate called breakends for agreement with a baseline breakend set

pedfilter    filter and convert a pedigree file

pedstats     print information about a pedigree file

rocplot      plot ROC curves from vcfeval ROC data files

version      print version and license information

license      print license information for all commands

help         print this screen or help for specified command


rtg vcfstats -h

$ rtg vcfstats -h

Usage: rtg vcfstats [OPTION]... FILE+


Display statistics from a set of VCF files.


File Input/Output

      --known          only calculate statistics for known variants (Default is to ignore known/novel status)

      --novel          only calculate statistics for novel variants (Default is to ignore known/novel status)

      --sample=STRING  only calculate statistics for the specified sample (Default is to include all samples). May be specified 0 or more times

      FILE+            input VCF files from which to derive statistics or '-' to read from standard input. Must be specified 1 or more times



      --allele-lengths output variant length histogram



  -h, --help           print help on command-line flag usage







minimap2 -ax sr -R "@RG\tID:X\tLB:Y\tSM:sample1\tPL:ILLUMINA"\
-t 10 ref.fasta pair_1.fq pair_2.fq \
|elprep filter /dev/stdin filtered_sorted.bam \
--mark-duplicates --remove-duplicates --filter-mapping-quality 0 \
--clean-sam --nr-of-threads 10 --sorting-order coordinate
samtools index -@ 8 filtered_sorted.bam

#variant call ここでは3つ使う。

#1 sambamba mpileup
sambamba mpileup -t 10 filtered_sorted.bam --samtools -u -g -f ref.fasta | bcftools call -v -m -O z -o mpileup.vcf.gz

#2 freebayes
freebayes -C 5 -u -p 2 -f ref.fasta filtered_sorted.bam |gzip - > freebayes.vcf.gz

#3 Platypus
platypus callVariants --refFile=ref.fasta --bamFiles filtered_sorted.bam --nCPU 20 --minReads 5
  • -u --no-complex  Ignore complex events (freebayes)



bgzip call.vcf
tabix -p vcf call.vcf.gz



rtg vcfstats call.vcf.gz

#raw vcf
rtg vcfstats call.vcf


# rtg vcfstats minimap2_calls.vcf.gz

Location                     : minimap2_calls.vcf.gz

Failed Filters               : 0

Passed Filters               : 78

SNPs                         : 76

MNPs                         : 0

Insertions                   : 2

Deletions                    : 0

Indels                       : 0

Same as reference            : 0

SNP Transitions/Transversions: 1.60 (48/30)

Total Het/Hom ratio          : 38.00 (76/2)

SNP Het/Hom ratio            : 37.00 (74/2)

MNP Het/Hom ratio            : - (0/0)

Insertion Het/Hom ratio      : - (2/0)

Deletion Het/Hom ratio       : - (0/0)

Indel Het/Hom ratio          : - (0/0)

Insertion/Deletion ratio     : - (2/0)

Indel/SNP+MNP ratio          : 0.03 (2/76)







bcftools stats -F ref.fasta -s mpileup.vcf.gz > mpileup.vcf.gz.stats 

#create report pd
plot-vcfstats -p report/ mpileup.vcf.gz.stats

bcftools stats -F ref.fasta -s sampleA.vcf > sampleA_bcftools_stats
bcftools stats -F ref.fasta -s sampleB.vcf > sampleB_bcftools_stats
bcftools stats -F ref.fasta -s sampleC.vcf > sampleC_bcftools_stats
multiqc .