macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ユニークな変異や全サンプルの変異HTMLレポート作成機能を持つbreseqのユーティリティツール gdtools

2020 1/15 コマンド追記 

2021 4/24 追記

2021 5/14 countコマンド修正

2021 5/16 -pについて追記(赤字部分)

 

GenomeDiffファイルは、サンプルで検出されたすべての変異を記述するbreseqによって出力されるテキストファイルである。 (これらは、コードの「diff」ファイルに似ており、更新されたバージョンを作成するためにテキストファイルを「パッチ」するために必要なわずかな変更を記述する。)GenomeDiffファイルには、遺伝的変異を記述するVCFファイルと同様の情報が含まれているが、 変異イベントの説明に焦点を当てている。 これらにより、(1)トランスポゾンの挿入や増幅などの構造変異に簡単にアノテーションを付けたり、(2)遺伝子アノテーション情報を含めたり、(3)進化データを完全に記述するために必要な場合がある突然変異を適用する順序を設定したりできる。 gdtoolsコマンドは、このGenomeDiffファイルに対してさまざまな機能を実行するために存在するbreseqのユーティリティコマンドである。

 

manual

http://gensoft.pasteur.fr/docs/breseq/0.23/gd_usage.html

GenomeDiff Format

http://gensoft.pasteur.fr/docs/breseq/0.23/gd_format.html

 

インストール

本体 Github

conda create -n breseq -c bioconda breseq -y
conda activate breseq

gdtools 

$ gdtools

================================================================================

breseq 0.35.0     http://barricklab.org/breseq

 

Active Developers: Barrick JE, Deatherage DE

Contact:           <jeffrey.e.barrick@gmail.com>

 

breseq is free software; you can redistribute it and/or modify it under the

terms the GNU General Public License as published by the Free Software 

Foundation; either version 2, or (at your option) any later version.

 

Copyright (c) 2008-2010 Michigan State University

Copyright (c) 2011-2017 The University of Texas at Austin

 

If you use breseq in your research, please cite:

 

  Deatherage, D.E., Barrick, J.E. (2014) Identification of mutations

  in laboratory-evolved microbes from next-generation sequencing

  data using breseq. Methods Mol. Biol. 1151: 165–188.

 

If you use structural variation (junction) predictions, please cite:

 

  Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C.,

  Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G. 

  (2014) Identifying structural variation in haploid microbial genomes 

  from short-read resequencing data using breseq. BMC Genomics 15:1039.

================================================================================

 

Usage: gdtools [COMMAND] [OPTIONS]

 

Manipulate Genome Diff (*.gd) files using the following commands.

 

General:

    VALIDATE               check formating of input files

    APPLY                  apply mutations to a sequence

    ANNOTATE (or COMPARE)  annotate the effects of mutations and compare multiple samples

    MUTATIONS              (re)predict mutations from evidence

    CHECK                  compare control versus test mutations

    NORMALIZE              normalize mutation positions and annotations

 

Set and Filtering Operations:

    SUBTRACT               remove mutations in one file from another

    INTERSECT              keep shared mutations in two files

    UNION/MERGE            combine mutations, removing duplicates

    FILTER/REMOVE          remove mutations matching specified conditions

    MASK                   remove mutation predictions in masked regions

    NOT-EVIDENCE           remove evidence not used by any mutations

 

Format Conversions:

    GD2VCF                 GD to Variant Call Format (VCF)

    VCF2GD                 Variant Call Format(VCF) to GD

    GD2GVF                 GD to Genome Variation Format (GVF)

    GD2CIRCOS              GD to Circos Data

    MUMMER2MASK            Create a mask GD file from MUMmer output

 

Analysis:

    COUNT                  count statistics for different types of mutations

    PHYLOGENY              create maximum parsimony tree from mutations (requires PHYLIP)

 

TACC Utilities:

    DOWNLOAD               download reference and read files from GD header info

    RUNFILE                create a commands file and launcher script for use on TACC

gdtools VALIDATE

gdtools VALIDATE -r reference.gbk input1.gd [input2.gd]

  -h,--help                        Display detailed help message

  -r,--reference <arg>             File containing reference sequences in GenBank, GFF3, or FASTA format. If provided, will validate seq_ids and positions in the GD file using these.  Option may be provided multiple times for

                                   multiple files. (OPTIONAL)

  -v,--verbose <arg>               Verbose mode. Outputs additional information about progress. (OPTIONAL)

 

Validates whether the format of the input Genome Diff files is correct.

gdtools APPLY

gdtools APPLY [ -o output.gff3 -f GFF3 ] -r reference.gbk input.gd

  -h,--help                        Display detailed help message

  -o,--output <arg>                Output file name (DEFAULT=output.*)

  -f,--format <arg>                Output file format (Options: FASTA, GFF3) (DEFAULT=FASTA)

  -r,--reference <arg>             File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)

  -v,--verbose                     Verbose mode

 

Input a single Genome Diff, and as many reference files

as you like.  Using the Genome Diff, this will apply all

the mutations to the reference sequences, output is to

a single file that includes all the references in the

requested format. Input file is expected to only have

consensus mutations. Polymorphic mutations are ignored.

 

> gdtools COMPARE

gdtools COMPARE z

*** Begin ANNOTATE/COMPARE ***

 

gdtools ANNOTATE/COMPARE [-o annotated.html] -r reference.gbk input.1.gd [input.2.gd ... ]

  -h,--help                        Display detailed help message

  -o,--output <arg>                Path to output file with added mutation data. (DEFAULT: output.*)

  -r,--reference <arg>             File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)

  -f,--format <arg>                Type of output file to generate. See options below (DEFAULT=HTML)

  -a,--add-html-fields             Add formatted fields that are used for generating HTML output. Only applicable to GD and JSON output formats

  --ignore-pseudogenes             Treat pseudogenes as valid ORFs for calling amino acid substitutions

  --repeat-header <arg>            In HTML mode, repeat the header line every this many rows (0=OFF) (DEFAULT=10)

  -p,--phylogeny-aware             Check the optional 'phylogeny_id' field when deciding if entries are equivalent

  -g,--region <arg>                Only show mutations that overlap this reference sequence region (e.g., REL606:64722-65312)

  -e,--preserve-evidence           By default evidence items with two-letter codes are removed (RA, JC, MC, ...). Supply this option to retain them. Only affects output in GD and JSON formats. This option can only be used with a

                                   single input GD file (i.e., not in COMPARE mode). 

  -c,--collapse                    Do not show samples (columns) unless they have at least one mutation

 

ANNOTATE mutations in one or more GenomeDiff files. If multiple input files are provided, then also COMPARE the frequencies for identical mutations across samples.

 

Valid output formats:

  HTML    Descriptive table viewable in a web browser

  GD      GenomeDiff with added annotation of mutations

  TSV     Tab-separated values file suitable for input into R or Excel

  PHYLIP  Alignment file suitable for input into PHYLIP

  JSON  JavaScript object notation file suitable for parsing

 

When multiple GD files are provided, the #=TITLE metadata line in each file is used to name each sample/column. If that information is not present, then the GD file name is used (removing the *.gd suffix).

 

In output, frequencies of 'D' mean that this mutation occurs within a region that is deleted by a different mutation in the genome in question. Frequencies of '?' indicate that there were not enough aligned reads to call a mutation at this position in the genome in question (either for or against the mutation).

 

In PHYLIP output, each column in the mutation 'sequence' that is created corresponds to a unique mutational event. For SNPs the base present at that position in each genome is shown. For other types of mutations, 'A' is used for the ancestral allele (e.g., no transposon insertion), and 'T' is used for the derived allele (e.g., new transposon copy inserted). 'N' is used when it is ambiguous as to whether the mutation occurred in the lineage leading to this genome, either because the position is deleted or there are not sufficient reads aligned to a position to call a mutation (i.e., is inside an UN region).

 

PHYLIP output is designed to be input into the 'dnapars' program to create a phylogenetic tree.

 

 

*** End ANNOTATE/COMPARE ***

> gdtools COUNT

*** Begin COUNT ***

 

gdtools COUNT [-o count.csv] -r reference.gbk input.1.gd [input.2.gd ... ]

  -h,--help                        Display detailed help message

  -v,--verbose                     produce output for each mutation counted.

  -r,--reference <arg>             File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)

  -o,--output <arg>                path to output CSV file with count data. (DEFAULT=count.csv)

  -d,--detailed-output <arg>       path to optional output tab-delimited file with detailed information about all mutations (Default = OFF)

  -s,--calculate-genome-size       use APPLY to calculate final genome sizes

  -b,--base-substitution-statistics                                                                                                                                                                                                  

                                   calculate detailed base substitution statistics

  -p,--count-polymorphisms         count polymorphic mutations (those with frequencies < 1). (Default = FALSE)

 

Counts the numbers of mutations and other statistics for each input GenomeDiff file.

 

In the output "small" mutations are ≤ 50 bp. "large" mutations are >50 bp

For base substitutions overlapping multiple genes, the mutation is counted as the most signficant effect it has on any of the genes it overlaps with the precedence: nonsense > nonsynonymous > synonymous.

> gdtools SUBTRACT

gdtools SUBTRACT [-o output.gd] input.gd subtract1.gd [subtract2.gd ...]

  -h,--help                        Display detailed help message

  -o,--output <arg>                output GD file (DEFAULT=output.gd)

  -p,--phylogeny-aware             Check the optional 'phylogeny_id' field when deciding if entries are equivalent

  -f,--frequency-aware             Use the frequencies of mutations when performing the subtraction. Normally an input mutation is removed if it appears at any frequency in a subtracted file. In this mode its

                                   frequency is reduced by the frequency in each subtracted file. If the resulting frequency is zero or below, then the mutation is removed.

  -v,--verbose                     verbose mode

 

Creates a new Genome Diff file that contains all mutation entries that are still

present in the input file after removing mutations that are in any of the

subtracted Genome Diff files. All evidence and validation entries are

retained in the output file.

 

Input Genome Diff and at least one Genome Diff to subtract not provided.

> gdtools SUBTRACT 

*** Begin ANNOTATE/COMPARE ***

 

gdtools ANNOTATE/COMPARE [-o annotated.html] -r reference.gbk input.1.gd [input.2.gd ... ]

  -h,--help                        Display detailed help message

  -o,--output <arg>                Path to output file with added mutation data. (DEFAULT: output.*)

  -r,--reference <arg>             File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)

  -f,--format <arg>                Type of output file to generate. See options below (DEFAULT=HTML)

  -a,--add-html-fields             Add formatted fields that are used for generating HTML output. Only applicable to GD and JSON output formats

  --ignore-pseudogenes             Treat pseudogenes as valid ORFs for calling amino acid substitutions

  --repeat-header <arg>            In HTML mode, repeat the header line every this many rows (0=OFF) (DEFAULT=10)

  -p,--phylogeny-aware             Check the optional 'phylogeny_id' field when deciding if entries are equivalent

  -g,--region <arg>                Only show mutations that overlap this reference sequence region (e.g., REL606:64722-65312)

  -e,--preserve-evidence           By default evidence items with two-letter codes are removed (RA, JC, MC, ...). Supply this option to retain them. Only affects output in GD and JSON formats. This option can only

                                   be used with a single input GD file (i.e., not in COMPARE mode). 

  -c,--collapse                    Do not show samples (columns) unless they have at least one mutation

 

ANNOTATE mutations in one or more GenomeDiff files. If multiple input files are provided, then also COMPARE the frequencies for identical mutations across samples.

 

Valid output formats:

  HTML    Descriptive table viewable in a web browser

  GD      GenomeDiff with added annotation of mutations

  TSV     Tab-separated values file suitable for input into R or Excel

  PHYLIP  Alignment file suitable for input into PHYLIP

  JSON  JavaScript object notation file suitable for parsing

 

When multiple GD files are provided, the #=TITLE metadata line in each file is used to name each sample/column. If that information is not present, then the GD file name is used (removing the *.gd suffix).

 

In output, frequencies of 'D' mean that this mutation occurs within a region that is deleted by a different mutation in the genome in question. Frequencies of '?' indicate that there were not enough aligned reads to call a mutation at this position in the genome in question (either for or against the mutation).

 

In PHYLIP output, each column in the mutation 'sequence' that is created corresponds to a unique mutational event. For SNPs the base present at that position in each genome is shown. For other types of mutations, 'A' is used for the ancestral allele (e.g., no transposon insertion), and 'T' is used for the derived allele (e.g., new transposon copy inserted). 'N' is used when it is ambiguous as to whether the mutation occurred in the lineage leading to this genome, either because the position is deleted or there are not sufficient reads aligned to a position to call a mutation (i.e., is inside an UN region).

gdtools GD2VC

$ gdtools GD2VCF

================================================================================

breseq 0.35.0     http://barricklab.org/breseq

 

Active Developers: Barrick JE, Deatherage DE

Contact:           <jeffrey.e.barrick@gmail.com>

 

breseq is free software; you can redistribute it and/or modify it under the

terms the GNU General Public License as published by the Free Software 

Foundation; either version 2, or (at your option) any later version.

 

Copyright (c) 2008-2010 Michigan State University

Copyright (c) 2011-2017 The University of Texas at Austin

 

If you use breseq in your research, please cite:

 

  Deatherage, D.E., Barrick, J.E. (2014) Identification of mutations

  in laboratory-evolved microbes from next-generation sequencing

  data using breseq. Methods Mol. Biol. 1151: 165–188.

 

If you use structural variation (junction) predictions, please cite:

 

  Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C.,

  Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G. 

  (2014) Identifying structural variation in haploid microbial genomes 

  from short-read resequencing data using breseq. BMC Genomics 15:1039.

================================================================================

 

gdtools GD2VCF [-o output.vcf] input.gd

  -h,--help                        Display detailed help message

  -r,--reference <arg>             File containing reference sequences in GenBank, GFF3, or FASTA format. Option may be provided multiple times for multiple files (REQUIRED)

  -o,--output <arg>                name of output file (DEFAULT=output.vcf)

 

Creates a Variant Call Format (VCF) file of mutations present in an input Genome Diff file.

VCF is a community format that can be loaded into viewers and used as input to other programs.

 

You must provide exactly one input Genome Diff file.

gdtools  VCF2GD

$ gdtools  VCF2GD

================================================================================

breseq 0.35.0     http://barricklab.org/breseq

 

Active Developers: Barrick JE, Deatherage DE

Contact:           <jeffrey.e.barrick@gmail.com>

 

breseq is free software; you can redistribute it and/or modify it under the

terms the GNU General Public License as published by the Free Software 

Foundation; either version 2, or (at your option) any later version.

 

Copyright (c) 2008-2010 Michigan State University

Copyright (c) 2011-2017 The University of Texas at Austin

 

If you use breseq in your research, please cite:

 

  Deatherage, D.E., Barrick, J.E. (2014) Identification of mutations

  in laboratory-evolved microbes from next-generation sequencing

  data using breseq. Methods Mol. Biol. 1151: 165–188.

 

If you use structural variation (junction) predictions, please cite:

 

  Barrick, J.E., Colburn, G., Deatherage D.E., Traverse, C.C.,

  Strand, M.D., Borges, J.J., Knoester, D.B., Reba, A., Meyer, A.G. 

  (2014) Identifying structural variation in haploid microbial genomes 

  from short-read resequencing data using breseq. BMC Genomics 15:1039.

================================================================================

 

VCF2GD [-o output.gd] input.vcf

  -h,--help                        Display detailed help message

  -o,--output <arg>                name of output Genome Diff file (DEFAULT=output.gd)

 

Creates a GD file of mutations present in an input Variant Call Format (VCF) file.

 

Provide a single input VCF file.

 

 

 

実行方法

まずbreseqをランして出力を得る。遺伝子アノテーション情報もプリントするためには、fastaではなくgenbankファイルを指定する。breseqはすべての入力リードをシングルエンドデータとして扱う。fastqは1つだけ指定してもランできるし、複数回シークエンシングしているなら、3つ以上同時に指定してランすることもできる。

ここでは2つのfastqを指定しているが、ペアエンドのロングレンジ情報は利用されない(*1)。

breseq -j 8 -o output -r ref.gb input_R1.fq input_R2.fq

 breseqの解析後、output/にGenomeDiffファイルが出力される。これを使用する(変異株が複数あり結果を比較したいなら、全ての株についてbreseqのランを行う)。

 

出力ディレクトリにあるoutput.gdがGenomeDiffファイルとなる。

f:id:kazumaxneo:20191224122138p:plain

 

1、GenomeDiffファイルの検証。解析に使ったGenBnakファイルとoutput.gdを指定する。例えばマニュアルで情報を追加した時などに実行する。以後、output.gdをinput.gbとして記載する。

gdtools VALIDATE -r ref.gbk input.gb

出力

Running gdtools VALIDATE on 1 file(s)...

 

File: /Volumes/4TB-gene_new3/breseq_tutorial1/SRR098289_output/output/output.gd

  FORMAT OK

 

::: Summary :::

  Total files processed:              1

* No formatting errors found!

 

 

2、変異株のゲノム情報の出力。ここではGFF3で出力。

gdtools APPLY -f GFF3 -o output.gff3 -r ref.gbk input.gb
  • -f <FASTA|GFF3>   Output file format (Options: FASTA, GFF3) (DEFAULT=FASTA)

出力

GFF3を選ぶとgenomic featureの記述の後に変異株のゲノム配列(FASTA形式)が出力される。

f:id:kazumaxneo:20191224123142p:plain

 

変異株ゲノムから新しくランを実行できる。

breseq -j 8 -o mutated_output -r output.gff3 input_R1.fq input_R2.fq

 

 

3、複数サンプルのGenomeDiffファイルを比較して統合レポートhtmlを出力

gdtools COMPARE -o compare.html -r ref.gbk MT1.gd MT2.gd MT3.gd ...

#たくさんあるならgdファイルをリネームした上で1箇所に集めておいてワイルドカード指定
gdtools COMPARE -o compare.html -r ref.gbk gd_dir/*gd

 4サンプルの出力

f:id:kazumaxneo:20191224123929p:plain

ここでは同じ名前のファイルoutput.gdを指定したため、output, output1, output2, output3と表記された。

f:id:kazumaxneo:20191224124015p:plain

 

ploidyオプションを付けて(”-p”とだけ指定)breseqをランした 場合、不完全バリアントも表に追加される。

f:id:kazumaxneo:20210516173615p:plain

注;-pを付けてbreseqをランしていない場合、不完全なバリアントは表に表示されない(そのポジションに他の株で100%のバリアントがあったとしても、100%以外は表示されない)。

TSVやJSONでも出力可能

#TSV
gdtools COMPARE -o compare.tsv -r ref.gbk -f TSV gd_dir/*gd
  • -f   Type of output file to generate. See options below (DEFAULT=HTML)

Valid output formats:

  • HTML Descriptive table viewable in a web browser
  • GD GenomeDiff with added annotation of mutations
  • TSV Tab-separated values file suitable for input into R or Excel
  • PHYLIP Alignment of genotypes in PHYLIP format for phylogenetic analysis
  • FASTA Multi-FASTA alignment of genotypes for phylogenetic analysis
  • JSON JavaScript object notation file suitable for parsing

 

 

5、変異のカウント

gdtools COUNT -o count.csv -d detail.txt -r ref.gbk MT1.gd MT2.gd MT3.gd ...

 

6、変異のsubtraction

input.gdからsubtract1.gdとsubtract2.gdにある変異を除き、input.gdのユニークな変異を出力。

gdtools SUBTRACT -o output.gd input.gd subtract1.gd subtract2.gd

 

7、変異のサマリーhtmlレポート出力

subtractしたGenomeDiffファイルから再びレポートを作成。

gdtools ANNOTATE -o annotated.html -r ref.gbk output.gd

f:id:kazumaxneo:20191224125953p:plain


 

2020 1/15 追記

8、breseqのgdファイルからvcfを出力

gdtools VCF2GD -o output.gd input.vcf

9、VCFファイルからbreseqのgdファイルを出力

gdtools GD2VCF -o output.vcf input.gd

 

詳細はbreseqのチュートリアル1にまとめられています。

https://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/tutorial_clones.html

チュートリアルは3まであります。時間がある時に読んでみてください。

 

補足

breseqは10Mb以下のスモールゲノム向け変異解析ツールです。ラージゲノムではおそらく機能しないので注意して下さい。

引用

Identifying structural variation in haploid microbial genomes from short-read resequencing data using breseq

Jeffrey E Barrick, Geoffrey Colburn, Daniel E Deatherage, Charles C Traverse, Matthew D Strand, Jordan J Borges, David B Knoester, Aaron Reba, Austin G Meyer
BMC Genomics volume 15, Article number: 1039 (2014) 

 

関連


*1

1つのリードをスプリットアラインメントしてJCのコールを行なっている。論文のFootnotesセクションで、1つのシークエンスリードの2つの部分が参照ゲノム中の互いに離れた部位に一致するスプリットリードアラインメントだけを解析することで、構造変異に起因するほとんどの新しいシーケンスジャンクションを確実に予測できることが分かった、との記載がある。