VCF, GTF, GFF などを BED に変換する BEDOPS - macでインフォマティクス

2019 6/17 追記

2020 2/21 タイトル修正

2020 3/30 help追記

BEDヘの変換はawkやperlやpythonのスクリプトで簡単にできるが、BEDOPSのvcf2nedを使うと、indelの種類などによってフィルタリングしながら分類することができ便利である。

インストール

#homebrew
brew install BEDOPS

#bioconda(link)
conda install -c bioconda -y bedops

> bedops

$ bedops

bedops

citation: http://bioinformatics.oxfordjournals.org/content/28/14/1919.abstract

https://doi.org/10.1093/bioinformatics/bts277

version: 2.4.37 (typical)

authors: Shane Neph & Scott Kuehn

USAGE: bedops [process-flags] <operation> <File(s)>*

Every input file must be sorted per the sort-bed utility.

Each operation requires a minimum number of files as shown below.

There is no fixed maximum number of files that may be used.

Input files must have at least the first 3 columns of the BED specification.

The program accepts BED and Starch file formats.

May use '-' for a file to indicate reading from standard input (BED format only).

Process Flags:

--chrom <chromosome> Jump to and process data for given <chromosome> only.

--ec Error check input files (slower).

--header Accept headers (VCF, GFF, SAM, BED, WIG) in any input file.

--help Print this message and exit successfully.

--help-<operation> Detailed help on <operation>.

An example is --help-c or --help-complement

--range L:R Add 'L' bp to all start coordinates and 'R' bp to end

coordinates. Either value may be + or - to grow or

shrink regions. With the -e/-n operations, the first

(reference) file is not padded, unlike all other files.

--range S Pad or shrink input file(s) coordinates symmetrically by S.

This is shorthand for: --range -S:S.

--version Print program information.

Operations: (choose one of)

-c, --complement [-L] File1 [File]*

-d, --difference ReferenceFile File2 [File]*

-e, --element-of [bp | percentage] ReferenceFile File2 [File]*

by default, -e 100% is used. 'bedops -e 1' is also popular.

-i, --intersect File1 File2 [File]*

-m, --merge File1 [File]*

-n, --not-element-of [bp | percentage] ReferenceFile File2 [File]*

by default, -n 100% is used. 'bedops -n 1' is also popular.

-p, --partition File1 [File]*

-s, --symmdiff File1 File2 [File]*

-u, --everything File1 [File]*

-w, --chop [bp] [--stagger <nt>] [-x] File1 [File]*

by default, -w 1 is used with no staggering.

Example: bedops --range 10 -u file1.bed

NOTE: Only operations -e|n|u preserve all columns (no flattening)

公式マニュアル

http://bedops.readthedocs.io/en/latest/content/reference/file-management/conversion/vcf2bed.html

ラン

vcfからbedに変換する。

vcf2bed < gatk.vcf > gatk.bed

--do-not-sort (-d) Do not sort BED output with sort-bed
--snvs (-v) Report only single nucleotide variants
--insertions (-t) Report only insertion variants
--deletions (-n) Report only deletion variants
--keep-header (-k) Preserve header section as pseudo-BED elements

snpsのみbedに変換する。

vcf2bed --snvs < gatk.vcf > gatk_snps.bed

塩基置換、挿入、欠損の数を数える。

vcf2bed --snvs < gatk.vcf|wc -l　　　　   #SNV
vcf2bed --insertions < gatk.vcf|wc -l　  #Insertion
vcf2bed --deletions < gatk.vcf|wc -l　   #Deletion

vcf2bedはBAM、GFF、GTF、GVF、PSL、RepeatMasker (OUT)、SAM、VCF、WIGなど多様なフォーマットをBEDに変換することができる。

GFF（GFF3）をbedに変換する。

convert2bed --input=gff < input.gff3 > output.bed

またはawkを使う。以下のようにして6列フォーマットのBEDに変換できる。

cat input.gtf | awk '{OFS = "\t"} {print $1,$4,$5,$3,$6,$7}' > output.bed

awkはデフォルトスペース区切り出力だが、bedtoolsはタブを区切りとして認識するので、タブ区切りを指定。

追記

BEDからGTF (cent OSで動作確認)

awk '{print $1"\t"$7"\t"$8"\t"($2+1)"\t"$3"\t"$5"\t"$6"\t"$9"\t"(substr($0, index($0,$10)))}' input.bed > output.gtf

BEDを使って何かするにはbedtoolsを使います。

引用

BEDOPS: high-performance genomic feature operations
Neph S1, Kuehn MS, Reynolds AP, Haugen E, Thurman RE, Johnson AK, Rynes E, Maurano MT, Vierstra J, Thomas S, Sandstrom R, Humbert R, Stamatoyannopoulos JA.

Bioinformatics. 2012 Jul 15;28(14):1919-20

How To Convert Bed Format To Gtf?