2020 1/15 コマンド追記
2021 4/24 追記
2021 5/14 countコマンド修正
2021 5/16 -pについて追記(赤字部分)
GenomeDiffファイルは、サンプルで検出されたすべての変異を記述するbreseqによって出力されるテキストファイルである。 (これらは、コードの「diff」ファイルに似ており、更新されたバージョンを作成するためにテキストファイルを「パッチ」するために必要なわずかな変更を記述する。)GenomeDiffファイルには、遺伝的変異を記述するVCFファイルと同様の情報が含まれているが、 変異イベントの説明に焦点を当てている。 これらにより、(1)トランスポゾンの挿入や増幅などの構造変異に簡単にアノテーションを付けたり、(2)遺伝子アノテーション情報を含めたり、(3)進化データを完全に記述するために必要な場合がある突然変異を適用する順序を設定したりできる。 gdtoolsコマンドは、このGenomeDiffファイルに対してさまざまな機能を実行するために存在するbreseqのユーティリティコマンドである。
manual
http://gensoft.pasteur.fr/docs/breseq/0.23/gd_usage.html
GenomeDiff Format
http://gensoft.pasteur.fr/docs/breseq/0.23/gd_format.html
インストール
本体 Github
conda create -n breseq -c bioconda breseq -y
conda activate breseq
> gdtools
$ gdtools
================================================================================
breseq 0.35.0 http://barricklab.org/breseq
Active Developers: Barrick JE, Deatherage DE
Contact: <jeffrey.e.barrick@gmail.com>
breseq is free software; you can redistribute it and/or modify it under the
terms the GNU General Public License as published by the Free Software
Foundation; either version 2, or (at your option) any later version.
Copyright (c) 2008-2010 Michigan State University
Copyright (c) 2011-2017 The University of Texas at Austin
If you use breseq in your research, please cite:
Deatherage, D.E., Barrick, J.E. (2014) Identification of mutations
in laboratory-evolved microbes from next-generation sequencing
data using breseq. Methods Mol. Biol. 1151: 165–188.
If you use structural variation (junction) predictions, please cite:
Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C.,
Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G.
(2014) Identifying structural variation in haploid microbial genomes
from short-read resequencing data using breseq. BMC Genomics 15:1039.
================================================================================
Usage: gdtools [COMMAND] [OPTIONS]
Manipulate Genome Diff (*.gd) files using the following commands.
General:
VALIDATE check formating of input files
APPLY apply mutations to a sequence
ANNOTATE (or COMPARE) annotate the effects of mutations and compare multiple samples
MUTATIONS (re)predict mutations from evidence
CHECK compare control versus test mutations
NORMALIZE normalize mutation positions and annotations
Set and Filtering Operations:
SUBTRACT remove mutations in one file from another
INTERSECT keep shared mutations in two files
UNION/MERGE combine mutations, removing duplicates
FILTER/REMOVE remove mutations matching specified conditions
MASK remove mutation predictions in masked regions
NOT-EVIDENCE remove evidence not used by any mutations
Format Conversions:
GD2VCF GD to Variant Call Format (VCF)
VCF2GD Variant Call Format(VCF) to GD
GD2GVF GD to Genome Variation Format (GVF)
GD2CIRCOS GD to Circos Data
MUMMER2MASK Create a mask GD file from MUMmer output
Analysis:
COUNT count statistics for different types of mutations
PHYLOGENY create maximum parsimony tree from mutations (requires PHYLIP)
TACC Utilities:
DOWNLOAD download reference and read files from GD header info
RUNFILE create a commands file and launcher script for use on TACC
> gdtools VALIDATE
gdtools VALIDATE -r reference.gbk input1.gd [input2.gd]
-h,--help Display detailed help message
-r,--reference <arg> File containing reference sequences in GenBank, GFF3, or FASTA format. If provided, will validate seq_ids and positions in the GD file using these. Option may be provided multiple times for
multiple files. (OPTIONAL)
-v,--verbose <arg> Verbose mode. Outputs additional information about progress. (OPTIONAL)
Validates whether the format of the input Genome Diff files is correct.
> gdtools APPLY
gdtools APPLY [ -o output.gff3 -f GFF3 ] -r reference.gbk input.gd
-h,--help Display detailed help message
-o,--output <arg> Output file name (DEFAULT=output.*)
-f,--format <arg> Output file format (Options: FASTA, GFF3) (DEFAULT=FASTA)
-r,--reference <arg> File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)
-v,--verbose Verbose mode
Input a single Genome Diff, and as many reference files
as you like. Using the Genome Diff, this will apply all
the mutations to the reference sequences, output is to
a single file that includes all the references in the
requested format. Input file is expected to only have
consensus mutations. Polymorphic mutations are ignored.
> gdtools COMPARE
gdtools COMPARE z
*** Begin ANNOTATE/COMPARE ***
gdtools ANNOTATE/COMPARE [-o annotated.html] -r reference.gbk input.1.gd [input.2.gd ... ]
-h,--help Display detailed help message
-o,--output <arg> Path to output file with added mutation data. (DEFAULT: output.*)
-r,--reference <arg> File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)
-f,--format <arg> Type of output file to generate. See options below (DEFAULT=HTML)
-a,--add-html-fields Add formatted fields that are used for generating HTML output. Only applicable to GD and JSON output formats
--ignore-pseudogenes Treat pseudogenes as valid ORFs for calling amino acid substitutions
--repeat-header <arg> In HTML mode, repeat the header line every this many rows (0=OFF) (DEFAULT=10)
-p,--phylogeny-aware Check the optional 'phylogeny_id' field when deciding if entries are equivalent
-g,--region <arg> Only show mutations that overlap this reference sequence region (e.g., REL606:64722-65312)
-e,--preserve-evidence By default evidence items with two-letter codes are removed (RA, JC, MC, ...). Supply this option to retain them. Only affects output in GD and JSON formats. This option can only be used with a
single input GD file (i.e., not in COMPARE mode).
-c,--collapse Do not show samples (columns) unless they have at least one mutation
ANNOTATE mutations in one or more GenomeDiff files. If multiple input files are provided, then also COMPARE the frequencies for identical mutations across samples.
Valid output formats:
HTML Descriptive table viewable in a web browser
GD GenomeDiff with added annotation of mutations
TSV Tab-separated values file suitable for input into R or Excel
PHYLIP Alignment file suitable for input into PHYLIP
JSON JavaScript object notation file suitable for parsing
When multiple GD files are provided, the #=TITLE metadata line in each file is used to name each sample/column. If that information is not present, then the GD file name is used (removing the *.gd suffix).
In output, frequencies of 'D' mean that this mutation occurs within a region that is deleted by a different mutation in the genome in question. Frequencies of '?' indicate that there were not enough aligned reads to call a mutation at this position in the genome in question (either for or against the mutation).
In PHYLIP output, each column in the mutation 'sequence' that is created corresponds to a unique mutational event. For SNPs the base present at that position in each genome is shown. For other types of mutations, 'A' is used for the ancestral allele (e.g., no transposon insertion), and 'T' is used for the derived allele (e.g., new transposon copy inserted). 'N' is used when it is ambiguous as to whether the mutation occurred in the lineage leading to this genome, either because the position is deleted or there are not sufficient reads aligned to a position to call a mutation (i.e., is inside an UN region).
PHYLIP output is designed to be input into the 'dnapars' program to create a phylogenetic tree.
*** End ANNOTATE/COMPARE ***
> gdtools COUNT
*** Begin COUNT ***
gdtools COUNT [-o count.csv] -r reference.gbk input.1.gd [input.2.gd ... ]
-h,--help Display detailed help message
-v,--verbose produce output for each mutation counted.
-r,--reference <arg> File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)
-o,--output <arg> path to output CSV file with count data. (DEFAULT=count.csv)
-d,--detailed-output <arg> path to optional output tab-delimited file with detailed information about all mutations (Default = OFF)
-s,--calculate-genome-size use APPLY to calculate final genome sizes
-b,--base-substitution-statistics
calculate detailed base substitution statistics
-p,--count-polymorphisms count polymorphic mutations (those with frequencies < 1). (Default = FALSE)
Counts the numbers of mutations and other statistics for each input GenomeDiff file.
In the output "small" mutations are ≤ 50 bp. "large" mutations are >50 bp
For base substitutions overlapping multiple genes, the mutation is counted as the most signficant effect it has on any of the genes it overlaps with the precedence: nonsense > nonsynonymous > synonymous.
> gdtools SUBTRACT
gdtools SUBTRACT [-o output.gd] input.gd subtract1.gd [subtract2.gd ...]
-h,--help Display detailed help message
-o,--output <arg> output GD file (DEFAULT=output.gd)
-p,--phylogeny-aware Check the optional 'phylogeny_id' field when deciding if entries are equivalent
-f,--frequency-aware Use the frequencies of mutations when performing the subtraction. Normally an input mutation is removed if it appears at any frequency in a subtracted file. In this mode its
frequency is reduced by the frequency in each subtracted file. If the resulting frequency is zero or below, then the mutation is removed.
-v,--verbose verbose mode
Creates a new Genome Diff file that contains all mutation entries that are still
present in the input file after removing mutations that are in any of the
subtracted Genome Diff files. All evidence and validation entries are
retained in the output file.
Input Genome Diff and at least one Genome Diff to subtract not provided.
> gdtools SUBTRACT
*** Begin ANNOTATE/COMPARE ***
gdtools ANNOTATE/COMPARE [-o annotated.html] -r reference.gbk input.1.gd [input.2.gd ... ]
-h,--help Display detailed help message
-o,--output <arg> Path to output file with added mutation data. (DEFAULT: output.*)
-r,--reference <arg> File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)
-f,--format <arg> Type of output file to generate. See options below (DEFAULT=HTML)
-a,--add-html-fields Add formatted fields that are used for generating HTML output. Only applicable to GD and JSON output formats
--ignore-pseudogenes Treat pseudogenes as valid ORFs for calling amino acid substitutions
--repeat-header <arg> In HTML mode, repeat the header line every this many rows (0=OFF) (DEFAULT=10)
-p,--phylogeny-aware Check the optional 'phylogeny_id' field when deciding if entries are equivalent
-g,--region <arg> Only show mutations that overlap this reference sequence region (e.g., REL606:64722-65312)
-e,--preserve-evidence By default evidence items with two-letter codes are removed (RA, JC, MC, ...). Supply this option to retain them. Only affects output in GD and JSON formats. This option can only
be used with a single input GD file (i.e., not in COMPARE mode).
-c,--collapse Do not show samples (columns) unless they have at least one mutation
ANNOTATE mutations in one or more GenomeDiff files. If multiple input files are provided, then also COMPARE the frequencies for identical mutations across samples.
Valid output formats:
HTML Descriptive table viewable in a web browser
GD GenomeDiff with added annotation of mutations
TSV Tab-separated values file suitable for input into R or Excel
PHYLIP Alignment file suitable for input into PHYLIP
JSON JavaScript object notation file suitable for parsing
When multiple GD files are provided, the #=TITLE metadata line in each file is used to name each sample/column. If that information is not present, then the GD file name is used (removing the *.gd suffix).
In output, frequencies of 'D' mean that this mutation occurs within a region that is deleted by a different mutation in the genome in question. Frequencies of '?' indicate that there were not enough aligned reads to call a mutation at this position in the genome in question (either for or against the mutation).
In PHYLIP output, each column in the mutation 'sequence' that is created corresponds to a unique mutational event. For SNPs the base present at that position in each genome is shown. For other types of mutations, 'A' is used for the ancestral allele (e.g., no transposon insertion), and 'T' is used for the derived allele (e.g., new transposon copy inserted). 'N' is used when it is ambiguous as to whether the mutation occurred in the lineage leading to this genome, either because the position is deleted or there are not sufficient reads aligned to a position to call a mutation (i.e., is inside an UN region).
> gdtools GD2VC
$ gdtools GD2VCF
================================================================================
breseq 0.35.0 http://barricklab.org/breseq
Active Developers: Barrick JE, Deatherage DE
Contact: <jeffrey.e.barrick@gmail.com>
breseq is free software; you can redistribute it and/or modify it under the
terms the GNU General Public License as published by the Free Software
Foundation; either version 2, or (at your option) any later version.
Copyright (c) 2008-2010 Michigan State University
Copyright (c) 2011-2017 The University of Texas at Austin
If you use breseq in your research, please cite:
Deatherage, D.E., Barrick, J.E. (2014) Identification of mutations
in laboratory-evolved microbes from next-generation sequencing
data using breseq. Methods Mol. Biol. 1151: 165–188.
If you use structural variation (junction) predictions, please cite:
Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C.,
Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G.
(2014) Identifying structural variation in haploid microbial genomes
from short-read resequencing data using breseq. BMC Genomics 15:1039.
================================================================================
gdtools GD2VCF [-o output.vcf] input.gd
-h,--help Display detailed help message
-r,--reference <arg> File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)
-o,--output <arg> name of output file (DEFAULT=output.vcf)
Creates a Variant Call Format (VCF) file of mutations present in an input Genome Diff file.
VCF is a community format that can be loaded into viewers and used as input to other programs.
You must provide exactly one input Genome Diff file.
> gdtools VCF2GD
$ gdtools VCF2GD
================================================================================
breseq 0.35.0 http://barricklab.org/breseq
Active Developers: Barrick JE, Deatherage DE
Contact: <jeffrey.e.barrick@gmail.com>
breseq is free software; you can redistribute it and/or modify it under the
terms the GNU General Public License as published by the Free Software
Foundation; either version 2, or (at your option) any later version.
Copyright (c) 2008-2010 Michigan State University
Copyright (c) 2011-2017 The University of Texas at Austin
If you use breseq in your research, please cite:
Deatherage, D.E., Barrick, J.E. (2014) Identification of mutations
in laboratory-evolved microbes from next-generation sequencing
data using breseq. Methods Mol. Biol. 1151: 165–188.
If you use structural variation (junction) predictions, please cite:
Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C.,
Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G.
(2014) Identifying structural variation in haploid microbial genomes
from short-read resequencing data using breseq. BMC Genomics 15:1039.
================================================================================
VCF2GD [-o output.gd] input.vcf
-h,--help Display detailed help message
-o,--output <arg> name of output Genome Diff file (DEFAULT=output.gd)
Creates a GD file of mutations present in an input Variant Call Format (VCF) file.
Provide a single input VCF file.
実行方法
まずbreseqをランして出力を得る。遺伝子アノテーション情報もプリントするためには、fastaではなくgenbankファイルを指定する。breseqはすべての入力リードをシングルエンドデータとして扱う。fastqは1つだけ指定してもランできるし、複数回シークエンシングしているなら、3つ以上同時に指定してランすることもできる。
ここでは2つのfastqを指定しているが、ペアエンドのロングレンジ情報は利用されない(*1)。
breseq -j 8 -o output -r ref.gb input_R1.fq input_R2.fq
breseqの解析後、output/にGenomeDiffファイルが出力される。これを使用する(変異株が複数あり結果を比較したいなら、全ての株についてbreseqのランを行う)。
出力ディレクトリにあるoutput.gdがGenomeDiffファイルとなる。
1、GenomeDiffファイルの検証。解析に使ったGenBnakファイルとoutput.gdを指定する。例えばマニュアルで情報を追加した時などに実行する。以後、output.gdをinput.gbとして記載する。
gdtools VALIDATE -r ref.gbk input.gb
出力
Running gdtools VALIDATE on 1 file(s)...
File: /Volumes/4TB-gene_new3/breseq_tutorial1/SRR098289_output/output/output.gd
FORMAT OK
::: Summary :::
Total files processed: 1
* No formatting errors found!
2、変異株のゲノム情報の出力。ここではGFF3で出力。
gdtools APPLY -f GFF3 -o output.gff3 -r ref.gbk input.gb
出力
GFF3を選ぶとgenomic featureの記述の後に変異株のゲノム配列(FASTA形式)が出力される。
変異株ゲノムから新しくランを実行できる。
breseq -j 8 -o mutated_output -r output.gff3 input_R1.fq input_R2.fq
3、複数サンプルのGenomeDiffファイルを比較して統合レポートhtmlを出力
gdtools COMPARE -o compare.html -r ref.gbk MT1.gd MT2.gd MT3.gd ...
#たくさんあるならgdファイルをリネームした上で1箇所に集めておいてワイルドカード指定
gdtools COMPARE -o compare.html -r ref.gbk gd_dir/*gd
4サンプルの出力
ここでは同じ名前のファイルoutput.gdを指定したため、output, output1, output2, output3と表記された。
ploidyオプションを付けて(”-p”とだけ指定)breseqをランした 場合、不完全バリアントも表に追加される。
注;-pを付けてbreseqをランしていない場合、不完全なバリアントは表に表示されない(そのポジションに他の株で100%のバリアントがあったとしても、100%以外は表示されない)。
TSVやJSONでも出力可能
#TSV
gdtools COMPARE -o compare.tsv -r ref.gbk -f TSV gd_dir/*gd
- -f Type of output file to generate. See options below (DEFAULT=HTML)
Valid output formats:
- HTML Descriptive table viewable in a web browser
- GD GenomeDiff with added annotation of mutations
- TSV Tab-separated values file suitable for input into R or Excel
- PHYLIP Alignment of genotypes in PHYLIP format for phylogenetic analysis
- FASTA Multi-FASTA alignment of genotypes for phylogenetic analysis
- JSON JavaScript object notation file suitable for parsing
5、変異のカウント
gdtools COUNT -o count.csv -d detail.txt -r ref.gbk MT1.gd MT2.gd MT3.gd ...
6、変異のsubtraction
input.gdからsubtract1.gdとsubtract2.gdにある変異を除き、input.gdのユニークな変異を出力。
gdtools SUBTRACT -o output.gd input.gd subtract1.gd subtract2.gd
7、変異のサマリーhtmlレポート出力
subtractしたGenomeDiffファイルから再びレポートを作成。
gdtools ANNOTATE -o annotated.html -r ref.gbk output.gd
2020 1/15 追記
8、breseqのgdファイルからvcfを出力
gdtools VCF2GD -o output.gd input.vcf
9、VCFファイルからbreseqのgdファイルを出力
gdtools GD2VCF -o output.vcf input.gd
詳細はbreseqのチュートリアル1にまとめられています。
チュートリアルは3まであります。時間がある時に読んでみてください。
補足
breseqは10Mb以下のスモールゲノム向け変異解析ツールです。ラージゲノムではおそらく機能しないので注意して下さい。
引用
Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq
Jeffrey E Barrick, Geoffrey Colburn, Daniel E Deatherage, Charles C Traverse, Matthew D Strand, Jordan J Borges, David B Knoester, Aaron Reba, Austin G Meyer
BMC Genomics volume 15, Article number: 1039 (2014)
関連
*1
1つのリードをスプリットアラインメントしてJCのコールを行なっている。論文のFootnotesセクションで、1つのシークエンスリードの2つの部分が参照ゲノム中の互いに離れた部位に一致するスプリットリードアラインメントだけを解析することで、構造変異に起因するほとんどの新しいシーケンスジャンクションを確実に予測できることが分かった、との記載がある。