macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

VCFの要約統計を出力するRTG toolsの rtg vcfstatsコマンド

2019 8/3 コマンドのミス修正  

 

RTG toolsはRTGコアのサブセットである。VCFファイルの高度な比較を実行するvcfevalなど、VCFファイルとシーケンスデータを扱うための便利なユーティリティが含まれている。vcfevalが特に重要なコマンドだが、ここではvcfの簡単なstatisticsを出力するrtg vcfstatsを紹介する。

 

HP

https://www.realtimegenomics.com/products/rtg-tools

Full version

https://www.realtimegenomics.com/products/rtg-core

manual

https://cdn.rawgit.com/RealTimeGenomics/rtg-tools/master/installer/resources/tools/RTGOperationsManual/index.html

 

 


インストール

miniconda3-4.3.21環境でテストした(docker使用、ホストOS ubuntu 16.0.4)

依存

本体 Github

git clone https://github.com/RealTimeGenomics/rtg-tools.git
cd rtg-tools
ant runalltests

#bioconda (link)
conda install -c bioconda -y rtg-tools

rtg

$ rtg

Usage: rtg COMMAND [OPTION]...

       rtg RTG_MEM=16G COMMAND [OPTION]...  (e.g. to set maximum memory use to 16 GB)

 

Type 'rtg help COMMAND' for help on a specific command.

The following commands are available:

 

Data formatting:

format       convert sequence data files to SDF

sdf2fasta    convert SDF to FASTA

sdf2fastq    convert SDF to FASTQ

sdf2sam      convert SDF to SAM/BAM

fastqtrim    trim reads in FASTQ files

 

Simulation:

genomesim    generate simulated genome sequence

cgsim        generate simulated reads from a sequence

readsim      generate simulated reads from a sequence

popsim       generate a VCF containing simulated population variants

samplesim    generate a VCF containing a genotype simulated from a population

childsim     generate a VCF containing a genotype simulated as a child of two parents

denovosim    generate a VCF containing a derived genotype containing de novo variants

pedsamplesim generate simulated genotypes for all members of a pedigree

samplereplay generate the genome corresponding to a sample genotype

 

Utility:

bgzip        compress a file using block gzip

index        create a tabix index

extract      extract data from a tabix indexed file

sdfstats     print statistics about an SDF

sdfsubset    extract a subset of an SDF into a new SDF

sdfsubseq    extract a subsequence from an SDF as text

mendelian    check a multisample VCF for Mendelian consistency

vcfstats     print statistics about variants contained within a VCF file

vcfmerge     merge single-sample VCF files into a single multi-sample VCF

vcffilter    filter records within a VCF file

vcfannotate  annotate variants within a VCF file

vcfsubset    create a VCF file containing a subset of the original columns

vcfdecompose decompose complex variants within a VCF file

vcfeval      evaluate called variants for agreement with a baseline variant set

svdecompose  split composite structural variants into a breakend representation

bndeval      evaluate called breakends for agreement with a baseline breakend set

pedfilter    filter and convert a pedigree file

pedstats     print information about a pedigree file

rocplot      plot ROC curves from vcfeval ROC data files

version      print version and license information

license      print license information for all commands

help         print this screen or help for specified command

statsコマンド

rtg vcfstats -h

$ rtg vcfstats -h

Usage: rtg vcfstats [OPTION]... FILE+

 

Display statistics from a set of VCF files.

 

File Input/Output

      --known          only calculate statistics for known variants (Default is to ignore known/novel status)

      --novel          only calculate statistics for novel variants (Default is to ignore known/novel status)

      --sample=STRING  only calculate statistics for the specified sample (Default is to include all samples). May be specified 0 or more times

      FILE+            input VCF files from which to derive statistics or '-' to read from standard input. Must be specified 1 or more times

 

Reporting

      --allele-lengths output variant length histogram

 

Utility

  -h, --help           print help on command-line flag usage

 

 

 

 

実行方法

1、マッピング、バリアントコールしてvcfを出力

#mapping
#ここではminimap2を使う。elprep(link)に渡しフィルタリング、bam出力
minimap2 -ax sr -R "@RG\tID:X\tLB:Y\tSM:sample1\tPL:ILLUMINA"\
-t 10 ref.fasta pair_1.fq pair_2.fq \
|elprep filter /dev/stdin filtered_sorted.bam \
--mark-duplicates --remove-duplicates --filter-mapping-quality 0 \
--clean-sam --nr-of-threads 10 --sorting-order coordinate
#index
samtools index -@ 8 filtered_sorted.bam

#variant call ここでは3つ使う。

#1 sambamba mpileup
sambamba mpileup -t 10 filtered_sorted.bam --samtools -u -g -f ref.fasta | bcftools call -v -m -O z -o mpileup.vcf.gz

#2 freebayes
freebayes -C 5 -u -p 2 -f ref.fasta filtered_sorted.bam |gzip - > freebayes.vcf.gz

#3 Platypus
platypus callVariants --refFile=ref.fasta --bamFiles filtered_sorted.bam --nCPU 20 --minReads 5
  • -u --no-complex  Ignore complex events (freebayes)

 

2、前のステップで圧縮してないなら圧縮(非圧縮vcfで進めてもOK)

bgzip call.vcf
tabix -p vcf call.vcf.gz

 

3、statistics

rtg vcfstats call.vcf.gz

#raw vcf
rtg vcfstats call.vcf

出力例

# rtg vcfstats minimap2_calls.vcf.gz

Location                     : minimap2_calls.vcf.gz

Failed Filters               : 0

Passed Filters               : 78

SNPs                         : 76

MNPs                         : 0

Insertions                   : 2

Deletions                    : 0

Indels                       : 0

Same as reference            : 0

SNP Transitions/Transversions: 1.60 (48/30)

Total Het/Hom ratio          : 38.00 (76/2)

SNP Het/Hom ratio            : 37.00 (74/2)

MNP Het/Hom ratio            : - (0/0)

Insertion Het/Hom ratio      : - (2/0)

Deletion Het/Hom ratio       : - (0/0)

Indel Het/Hom ratio          : - (0/0)

Insertion/Deletion ratio     : - (2/0)

Indel/SNP+MNP ratio          : 0.03 (2/76)

 

引用

https://github.com/RealTimeGenomics/rtg-tools

 

参考

 

要約統計にはbcftoolsやvcftoolも使える。

bcftools stats -F ref.fasta -s mpileup.vcf.gz > mpileup.vcf.gz.stats 

#create report pd
plot-vcfstats -p report/ mpileup.vcf.gz.stats

 

関連