2019 8/3 コマンドのミス修正
2020 4/19 例追加
2021 10/1 追記
RTG toolsはRTGコアのサブセットである。VCFファイルの高度な比較を実行するvcfevalなど、VCFファイルとシーケンスデータを扱うための便利なユーティリティが含まれている。vcfevalが特に重要なコマンドだが、ここではvcfの簡単なstatisticsを出力するrtg vcfstatsを紹介する。
HP
https://www.realtimegenomics.com/products/rtg-tools
Full version
https://www.realtimegenomics.com/products/rtg-core
manual
インストール
miniconda3-4.3.21環境でテストした(docker使用、ホストOS ubuntu 16.0.4)
依存
本体 Github
git clone https://github.com/RealTimeGenomics/rtg-tools.git
cd rtg-tools
ant runalltests
#bioconda (link)
mamba install -c bioconda -y rtg-tools
> rtg
$ rtg
Usage: rtg COMMAND [OPTION]...
rtg RTG_MEM=16G COMMAND [OPTION]... (e.g. to set maximum memory use to 16 GB)
Type 'rtg help COMMAND' for help on a specific command.
The following commands are available:
Data formatting:
format convert sequence data files to SDF
sdf2fasta convert SDF to FASTA
sdf2fastq convert SDF to FASTQ
sdf2sam convert SDF to SAM/BAM
fastqtrim trim reads in FASTQ files
Simulation:
genomesim generate simulated genome sequence
cgsim generate simulated reads from a sequence
readsim generate simulated reads from a sequence
popsim generate a VCF containing simulated population variants
samplesim generate a VCF containing a genotype simulated from a population
childsim generate a VCF containing a genotype simulated as a child of two parents
denovosim generate a VCF containing a derived genotype containing de novo variants
pedsamplesim generate simulated genotypes for all members of a pedigree
samplereplay generate the genome corresponding to a sample genotype
Utility:
bgzip compress a file using block gzip
index create a tabix index
extract extract data from a tabix indexed file
sdfstats print statistics about an SDF
sdfsubset extract a subset of an SDF into a new SDF
sdfsubseq extract a subsequence from an SDF as text
mendelian check a multisample VCF for Mendelian consistency
vcfstats print statistics about variants contained within a VCF file
vcfmerge merge single-sample VCF files into a single multi-sample VCF
vcffilter filter records within a VCF file
vcfannotate annotate variants within a VCF file
vcfsubset create a VCF file containing a subset of the original columns
vcfdecompose decompose complex variants within a VCF file
vcfeval evaluate called variants for agreement with a baseline variant set
svdecompose split composite structural variants into a breakend representation
bndeval evaluate called breakends for agreement with a baseline breakend set
pedfilter filter and convert a pedigree file
pedstats print information about a pedigree file
rocplot plot ROC curves from vcfeval ROC data files
version print version and license information
license print license information for all commands
help print this screen or help for specified command
statsコマンド
> rtg vcfstats -h
$ rtg vcfstats -h
Usage: rtg vcfstats [OPTION]... FILE+
Display statistics from a set of VCF files.
File Input/Output
--known only calculate statistics for known variants (Default is to ignore known/novel status)
--novel only calculate statistics for novel variants (Default is to ignore known/novel status)
--sample=STRING only calculate statistics for the specified sample (Default is to include all samples). May be specified 0 or more times
FILE+ input VCF files from which to derive statistics or '-' to read from standard input. Must be specified 1 or more times
Reporting
--allele-lengths output variant length histogram
Utility
-h, --help print help on command-line flag usage
実行方法
1、マッピング、バリアントコールしてvcfを出力
#mapping
#ここではminimap2を使う。elprep(link)に渡しフィルタリング、bam出力
minimap2 -ax sr -R "@RG\tID:X\tLB:Y\tSM:sample1\tPL:ILLUMINA"\
-t 10 ref.fasta pair_1.fq pair_2.fq \
|elprep filter /dev/stdin filtered_sorted.bam \
--mark-duplicates --remove-duplicates --filter-mapping-quality 0 \
--clean-sam --nr-of-threads 10 --sorting-order coordinate
#index
samtools index -@ 8 filtered_sorted.bam
#variant call ここでは3つ使う。
#1 sambamba mpileup
sambamba mpileup -t 10 filtered_sorted.bam --samtools -u -g -f ref.fasta | bcftools call -v -m -O z -o mpileup.vcf.gz
#2 freebayes
freebayes -C 5 -u -p 2 -f ref.fasta filtered_sorted.bam |gzip - > freebayes.vcf.gz
#3 Platypus
platypus callVariants --refFile=ref.fasta --bamFiles filtered_sorted.bam --nCPU 20 --minReads 5
- -u --no-complex Ignore complex events (freebayes)
2、前のステップで圧縮してないなら圧縮(非圧縮vcfで進めてもOK)
bgzip call.vcf
tabix -p vcf call.vcf.gz
3、statistics
rtg vcfstats call.vcf.gz
#raw vcf
rtg vcfstats call.vcf
#必要な情報を取り出して結合
grep "Sample Name" merged.stats > name
grep "SNPs" merged.stats > SNPs
grep "MNPs" merged.stats > MNPs
grep "Transitions/Transversions" merged.stats > SNP_Transitions_Transversions
grep "Total Het/Hom" merged.stats > Total_Het_Hom_ratio
grep "SNP Het/Hom ratio" merged.stats > SNP_Het_Hom_ratio
paste name SNPs MNPs SNP_Transitions_Transversions Total_Het_Hom_ratio SNP_Het_Hom_ratio > table
出力例
# rtg vcfstats minimap2_calls.vcf.gz
Location : minimap2_calls.vcf.gz
Failed Filters : 0
Passed Filters : 78
SNPs : 76
MNPs : 0
Insertions : 2
Deletions : 0
Indels : 0
Same as reference : 0
SNP Transitions/Transversions: 1.60 (48/30)
Total Het/Hom ratio : 38.00 (76/2)
SNP Het/Hom ratio : 37.00 (74/2)
MNP Het/Hom ratio : - (0/0)
Insertion Het/Hom ratio : - (2/0)
Deletion Het/Hom ratio : - (0/0)
Indel Het/Hom ratio : - (0/0)
Insertion/Deletion ratio : - (2/0)
Indel/SNP+MNP ratio : 0.03 (2/76)
引用
https://github.com/RealTimeGenomics/rtg-tools
参考
要約統計にはbcftoolsやvcftoolも使える。
bcftools stats -F ref.fasta -s mpileup.vcf.gz > mpileup.vcf.gz.stats
#create report pd
plot-vcfstats -p report/ mpileup.vcf.gz.stats
#またはmultiqcと連携
bcftools stats -F ref.fasta -s sampleA.vcf > sampleA_bcftools_stats
bcftools stats -F ref.fasta -s sampleB.vcf > sampleB_bcftools_stats
bcftools stats -F ref.fasta -s sampleC.vcf > sampleC_bcftools_stats
multiqc .
関連