BiostarsとGithubより
VCFファイルを管理・変更するツールはいくつかあるが、バイオインフォマティクスのサポートを受けていない生物学者が必要とする最も単純な出力を生成することができるシンプルで包括的なツールはまだない。このツールは、ソートされたVCFファイルを受け取り、INFOとFORMATフィールドについて、対象となる各SAMPLEのシンプルなテーブル出力を報告する。デフォルトの状態(最小限のコード)では、すべてのSAMPLEのINFOとFORMATが簡略化される。さらに、便利で包括的なスクリプトを使ってフィールドを絞り込むことができる。
VCF simplifyは、バイオインフォマティクスのサポートを受けていない生物学者やチームなど、プログラミングをしていないユーザーを対象としているが、VCFデータを最も効率的な方法で抽出するために、あらゆるレベルのユーザーが使用することができる。また、出力テーブルは "long "と "wide "の両方のフォーマットで生成できるので、サンプル対ポジションによるデータのマイニングに適している。出力は、awkでさらに下流にフィルタリングでき、Rにロードしてtidyrやdplyrで使用することができる。
このツールは以下の3つの主要なタスクを実行する。
- ViewVCF : VCFファイルからメタデータ情報を表示・抽出する。
- SimplifyVCF : VCFファイルをテーブルまたはハプロタイプ形式に変換する。
- BuildVCF : テーブル形式やハプロタイプ形式のファイルをVCFに戻す。
Biostars
https://www.biostars.org/p/309431/
インストール
venv(virtualenv )でpython3の環境を作ってテストした(osはubuntu18.04)。
依存
Python packages and modules:
- argparse (https://docs.python.org/3/library/argparse.html)
- cyvcf2 (https://github.com/brentp/cyvcf2/)
- Python3 (https://www.python.org/)
git clone https://github.com/everestial/VCF-Simplify
cd VCF-Simplify
python3 -m venv myenv
source myenv/bin/activate
pip install Cython
python3 setup.py build_ext --inplace
python3 VcfSimplify.py -h
> python3 VcfSimplify.py -h
# python3 VcfSimplify.py -h
## VCF Simplify ## : Python application for parsing VCF files.
Author: Bishwa K. Giri
Contributors: Bhuwan Aryal
Contact: bkgiri@uncg.edu, kirannbishwa01@gmail.com, bhuwanaryal19@gmail.com
usage: VCF-Simplify [-h] {ViewVCF,SimplifyVCF,BuildVCF} ...
positional arguments:
{ViewVCF,SimplifyVCF,BuildVCF}
Choose one of the following method.
ViewVCF View and extract metadata from the VCF.
SimplifyVCF Simplify VCF -> to {haplotype, table} data.
BuildVCF Build VCF <- from {haplotype, table} data.
optional arguments:
-h, --help show this help message and exit
> python3 VcfSimplify.py ViewVCF -h
# python3 VcfSimplify.py ViewVCF -h
## VCF Simplify ## : Python application for parsing VCF files.
Author: Bishwa K. Giri
Contributors: Bhuwan Aryal
Contact: bkgiri@uncg.edu, kirannbishwa01@gmail.com, bhuwanaryal19@gmail.com
usage: VCF-Simplify ViewVCF [-h] -inVCF INVCF [-outFile OUTFILE]
[-outType {table,json,dict} [{table,json,dict} ...]]
[-metadata METADATA [METADATA ...]]
optional arguments:
-h, --help show this help message and exit
-inVCF INVCF Sorted vcf file.
-outFile OUTFILE Name of the output file without file extension.
-outType {table,json,dict} [{table,json,dict} ...]
Space separated list of output data types.
Multiple types can be requested.
-metadata METADATA [METADATA ...]
Space separated list of metadata of interest.
Allowed values are:
VCFspec, reference, contig, samples, INFO, FORMAT, FILTER, GATKCommandLine, GVCFBlock.
Multiple choices can be requested.
> python3 VcfSimplify.py SimplifyVCF -h
# python3 VcfSimplify.py SimplifyVCF -h
## VCF Simplify ## : Python application for parsing VCF files.
Author: Bishwa K. Giri
Contributors: Bhuwan Aryal
Contact: bkgiri@uncg.edu, kirannbishwa01@gmail.com, bhuwanaryal19@gmail.com
usage: VCF-Simplify SimplifyVCF [-h] -toType {haplotype,table} -inVCF INVCF
-outFile OUTFILE
[-outHeaderName OUTHEADERNAME]
[-GTbase GTBASE [GTBASE ...]] [-PG PG]
[-PI PI] [-includeUnphased {yes,no,0,1}]
[-samples SAMPLES [SAMPLES ...]]
[-preHeader PREHEADER [PREHEADER ...]]
[-infos INFOS [INFOS ...]]
[-formats FORMATS [FORMATS ...]]
[-mode {wide,long,0,1}]
optional arguments:
-h, --help show this help message and exit
-toType {haplotype,table}
Type of the output file.
-inVCF INVCF Sorted vcf file.
-outFile OUTFILE Name of the output file.
-outHeaderName OUTHEADERNAME
Write the VCF raw METADATA HEADER to a separate output file.
Default: no output.
-GTbase GTBASE [GTBASE ...]
Write the genotype (GT, PG etc.) field as IUPAC base code.
Default = GT:numeric (i.e write 'GT' bases as numeric)
Choices : [GT:numeric, GT:iupac, PG:iupac, .....].
Multiple option can be requested for each genotype fields.
Additional arguments for "VCF To -> Haplotype":
-PG PG FORMAT tag representing the phased genotype.
Default: PG
Note: 'GT' can be used if it contains the phased genotype.
-PI PI FORMAT tag representing the unique phased haplotype block index.
Note: 'CHROM' can also be used as 'PI' if the VCF is phased chromosome or contig wide.
-includeUnphased {yes,no,0,1}
include unphased variants (genotypes) in the haplotype output.
Default: no (0) (i.e do not write unphased variants)
Additional arguments for "VCF To -> Table":
-samples SAMPLES [SAMPLES ...]
SAMPLE of interest:
Space separated name of the samples or matching sample names.
Matching prefix, suffix or string in the names can be provided too.
Choices format:
[0, sample A, sample B, prefix:XXXsample, suffix:sampleXXX, match:XXX, all]
Multiple choices can be requested.
Note: 0 = ignore all the samples; Default = all
-preHeader PREHEADER [PREHEADER ...]
Space separated header fields before the 'INFO' field.
Choices:
[0, CHROM, POS, ID, REF, ALT, QUAL, FILTER, all].
Multiple choices can be requested.
Note: 0 = ignore all the pre-header-keys; Default = all.
-infos INFOS [INFOS ...]
Space separate INFO tags of interest.
Choices :
[0, AC, AF, AN, ..., all].
Multiple choices can be requested.
Note: 0 = ignore all the INFO tags; Default = all
-formats FORMATS [FORMATS ...]
Space separate FORMAT tags of interest.
Choices : [0, GT, PG, PI, ..., all].
Multiple choices can be requested.
Note: 0 = ignore all the pre-header-keys; Default = all.
-mode {wide,long,0,1}
Structure of the output table. Default = wide (0)
> python3 VcfSimplify.py BuildVCF -h
# python3 VcfSimplify.py BuildVCF -h
## VCF Simplify ## : Python application for parsing VCF files.
Author: Bishwa K. Giri
Contributors: Bhuwan Aryal
Contact: bkgiri@uncg.edu, kirannbishwa01@gmail.com, bhuwanaryal19@gmail.com
usage: VCF-Simplify BuildVCF [-h] -fromType {haplotype,table} -inFile INFILE
-outVCF OUTVCF -vcfHeader VCFHEADER
[-samples SAMPLES [SAMPLES ...]]
[-formats FORMATS [FORMATS ...]]
[-infos INFOS [INFOS ...]]
[-GTbase GTBASE [GTBASE ...]] [-haplotypeFormat]
optional arguments:
-h, --help show this help message and exit
-fromType {haplotype,table}
Type of the input file the VCF is being prepared from.
-inFile INFILE Sorted table or haplotype file.
Note:
Haplotype file should be in the format created by 'phase-Stitcher' or 'phase-Extender'.
Table file should be in the format created by 'VCF-Simplify'
Only wide?? table is supported for now.
-outVCF OUTVCF Name of the output VCF file.
-vcfHeader VCFHEADER A custom VCF header to add to the VCF file.
The VCF header should not contain the line with #CHROM ....
#CHROM ... line is auto populated while creating the VCF file.
Additional arguments for "Table To VCF":
-samples SAMPLES [SAMPLES ...]
Name of the samples -> space separated name of the samples that needs to be converted.
Default = allto VCF format
-formats FORMATS [FORMATS ...]
Name of the FORMAT tags to write -> space separated FORMAT tags name.
Default = all
-infos INFOS [INFOS ...]
Name of the INFO tags to write -> space separated INFO tags name.
Default = all
-GTbase GTBASE [GTBASE ...]
Suggest if the genotype fields (GT, PG etc.) are in IUPAC base code.
Default = GT:numeric (i.e assumes 'GT' bases are numeric)
Choices : [GT:numeric, GT:iupac, PG:iupac, .....].
Multiple option can be suggested for each genotype fields.
Additional arguments for "Haplotype To VCF":
-haplotypeFormat report which format (numeric vs. iupac) the haplotype file is in.
Default = iupac
docker imageを上げておきます。
docker pull kazumax/vcfsimplify
docker run -itv $PWD:/data kazumax/vcfsimplify
cd ~/VCF-Simplify
実行方法
VCFファイルを指定する。
python3 VcfSimplify.py SimplifyVCF -toType table -inVCF exampleInput/input_test.vcf -outFile wsimple_table.txt
出力
引用
Giri, B.K, (2018). VCF-simplify: Tool to build and simplify VCF (variant call format) files.