2020-08-14

アノテーションパイプライン BRAKER2

2020 8/14、15、10/1、10/2 追記, タイトル修正、誤字修正

2021 2/9、9/4 追記

2022 1/.29 12/23 condaインストール修正、追記

2023/01/04, 01/09.01/11 間違った説明を修正、Documentリンク修正、画像追加

2023/03/03 braker3.0の公開について

遺伝子予測の完全自動化は、次世代シーケンシングの出現以来、バイオインフォマティクスの重要な課題となっている。真核生物ゲノムアノテーションパイプラインBRAKER1では、自己学習型のGeneMark ETとAUGUSTUSを組み合わせて、トランスクリプトームデータをサポートした遺伝子座標(genes coordinates)を生成していた。ここでは、GeneMark EP+とAUGUSTUSによる外部からのタンパク質配列のゲノムアラインメントをサポートしたパイプラインBRAKER2を紹介する。このパイプラインの開発において課題となったのは、相同性はあるが進化的に遠いタンパク質から、タンパク質をコードするエクソン境界の位置を信頼できるヒントを生成することであった。同等の条件で、BRAKER2の遺伝子予測精度は、他のゲノムアノテーションパイプラインであるMAKER2よりも高いことが示された。また、大量のトランスクリプトデータに支えられたBRAKER1と比較して、タンパク質データベースの参照種との進化距離が小さい場合には、BRAKER2の方がより良い遺伝子予測精度が得られることが示された。また、その結果、BRAKER2は真核生物ゲノムの構造アノテーションを高速かつ正確に行うことができることが示された。

下に引用したMethods in Molecular Biologyの論文ではbrakerを使う流れがコマンドも含めて詳細に記述されています。確認して下さい。

User guide

https://bioinf.uni-greifswald.de/augustus/binaries/tutorial2018/BRAKER_userguide.pdf

2023/03/03 に追記。braker3の公開。

BRAKER v.3.0.0 release is now available. https://t.co/H4cENbPvTs Docker container has been updated https://t.co/D2Uok0F1Ia GeneMark-ETP is in the container. Best of luck! Team @Goemario + @katharina_hoff hope you can run BRAKER3, now!
— Katharina Hoff (@katharina_hoff) March 2, 2023

dockerhub link

インストール

dockerhubで公開されているイメージを使ってテストした。

Github

#condaで導入可能(テスト済み)　
#bioconda(link) (v1は別link) 
mamba create -n braker2 -c conda-forge -c bioconda -c defaults braker2
conda activate braker2


#GeneMarkはライセンスの関係でHPより直接ダウンロードする必要がある (link)。
#プログラム本体とキーをダウンロードして、解凍したGeneMarkのディレクトリに移動、キーを登録する。最後にパスを通す。
cd GeneMark_dir
#Install the key
zcat gm_key_64.gz > ~/.gm_key
#test run 
bash check_install.bash
#パスを通す
export GENEMARK_PATH=/your/path/to/GeneMark-ET/

ここでは公開されているdockerイメージを使う。nanozooさんが公開しているイメージを選んだ。 => hamiltonjpさんの公開しているイメージに変更（link）。こちらはGeneMarkだけ入れればエラーなく動作する。ただし、このイメージの2.14はUTR=onで使われるutrrnaseqでエラーが出るので注意（このオプションを使わないならOK）。

#このイメージが使える（足りないものあり*3）
docker pull hamiltonjp/braker2:a765b80
docker run -itv $PWD:/data hamiltonjp/braker2:a765b80

#GeneMarkはライセンスの関係でHPより直接ダウンロードする必要がある (link)。
#プログラム本体とキーをダウンロードして、解凍したGeneMarkのディレクトリに移動、キーを登録する。最後にパスを通す。
cd GeneMark_dir
#Install the key
zcat gm_key_64.gz > ~/.gm_key
#test run 
bash check_install.bash
#パスを通す
export GENEMARK_PATH=/your/path/to/GeneMark-ET/

> braker.pl -v #version 2.1.4

# braker.pl -v

Option v is ambiguous (verbosity, version)

Use of uninitialized value $epath in concatenation (.) or string at /opt/conda/bin/braker.pl line 1986.

#**********************************************************************************

# BRAKER CONFIGURATION

#**********************************************************************************

# BRAKER CALL: /opt/conda/bin/braker.pl -v

# Thu Aug 13 06:07:03 2020: braker.pl version 2.1.4

# Thu Aug 13 06:07:03 2020: Configuring of BRAKER for using external tools...

# Thu Aug 13 06:07:03 2020: Found environment variable $AUGUSTUS_CONFIG_PATH. Setting $AUGUSTUS_CONFIG_PATH to /opt/conda/config/

# Thu Aug 13 06:07:03 2020: Found environment variable $AUGUSTUS_BIN_PATH. Setting $AUGUSTUS_BIN_PATH to /opt/conda/bin/

# Thu Aug 13 06:07:03 2020: Found environment variable $AUGUSTUS_SCRIPTS_PATH. Setting $AUGUSTUS_SCRIPTS_PATH to /opt/conda/bin/

# Thu Aug 13 06:07:03 2020: Did not find environment variable $GENEMARK_PATH (either variable does not exist, or the path given in variable does not exist). Will try to set this variable in a different way, later.

# Thu Aug 13 06:07:03 2020: Trying to guess $GENEMARK_PATH from location of gmes_petap.pl executable that is available in your $PATH

#*********

# WARNING: Guessing the location of $GENEMARK_PATH failed. is not a directory!

#*********

# Thu Aug 13 06:07:03 2020: ERROR: in file /opt/conda/bin/braker.pl at line 2015

$GENEMARK_PATH not set!

There are 3 alternative ways to set this variable for

braker.pl:

a) provide command-line argument --GENEMARK_PATH=/your/path

b) use an existing environment variable $GENEMARK_PATH

for setting the environment variable, run

export GENEMARK_PATH=/your/path

in your shell. You may append this to your .bashrc or

.profile file in order to make the variable available to

all your bash sessions.

c) braker.pl can try guessing the location of

$GENEMARK_PATH from the location of a gmes_petap.pl

executable that is available in your $PATH variable.

If you try to rely on this option, you can check by

typing

which gmes_petap.pl

in your shell, whether there is a bamtools executable in

your $PATH

> braker.pl

# braker.pl

braker.pl Pipeline for predicting genes with GeneMark-ET and AUGUSTUS with

RNA-Seq

SYNOPSIS

braker.pl [OPTIONS] --genome=genome.fa --bam=rnaseq.bam

INPUT FILE OPTIONS

--genome=genome.fa fasta file with DNA sequences

--bam=rnaseq.bam bam file with spliced alignments from

RNA-Seq

--hints=hints.gff Alternatively to calling braker.pl with a

bam file, it is possible to call it with a

file that contains introns extracted from

RNA-Seq (or other data) in gff format.

This flag also allows the usage of hints

from additional extrinsic sources for gene

prediction with AUGUSTUS. To consider such

additional extrinsic information, you need

to use the flag --extrinsicCfgFiles to

specify parameters for all sources in the

hints file (including the source "E" for

intron hints from RNA-Seq).

--prot_seq=prot.fa A protein sequence file in multiple fasta

format. This file will be used to generate

protein hints for AUGUSTUS by running one

of the three alignment tools Exonerate

(--prg=exonerate), Spaln (--prg=spaln) or

GenomeThreader (--prg=gth). Default is

GenomeThreader if the tool is not

specified. Currently, hints from protein

sequences are only used in the prediction

step with AUGUSTUS.

--prot_aln=prot.aln Alignment file generated from aligning

protein sequences against the genome with

either Exonerate (--prg=exonerate), or

Spaln (--prg=spaln), or GenomeThreader

(--prg=gth).

To prepare alignment file, run Spaln2 with

the following command:

spaln -O0 ... > spalnfile

To prepare alignment file, run Exonerate

with the following command:

exonerate --model protein2genome \

--showtargetgff T ... > exfile

To prepare alignment file, run

GenomeThreader with the following command:

gth -genomic genome.fa -protein \

protein.fa -gff3out \

-skipalignmentout ... -o gthfile

A valid option prg=... must be specified

in combination with --prot_aln. Generating

tool will not be guessed.

Currently, hints from protein alignment

files are only used in the prediction step

with AUGUSTUS.

--AUGUSTUS_ab_initio output ab initio predictions by AUGUSTUS

in addition to predictions with hints by

AUGUSTUS

FREQUENTLY USED OPTIONS

--species=sname Species name. Existing species will not be

overwritten. Uses Sp_1 etc., if no species

is assigned

--softmasking Softmasking option for soft masked genome

files. (Disabled by default.)

--esmode Run GeneMark-ES (genome sequence only) and

train AUGUSTUS on long genes predicted by

GeneMark-ES. Final predictions are ab initio

--epmode Run GeneMark-EP with intron hints provided

from protein data. This mode is not

comptabile with using the aligners

GenomeThreader, Exonerate and Spaln within

braker.pl because etpmode and epmode require

a large database of proteins and such

mapping should be done outside of braker.pl

e.g. on a cluster.

--etpmode Run GeneMark-ETP with hints provided from

proteins and RNA-Seq data. This mode is not

compatible with using the aligners

GenomeThreader, Exonerate and Spaln within

braker.pl because etpmode and epmode require

a large database of proteins and such

mapping should be done outside of braker.pl

e.g. on a cluster.

--gff3 Output in GFF3 format (default is gtf

format)

--cores Specifies the maximum number of cores that

can be used during computation. Be aware:

optimize_augustus.pl will use max. 8

cores; augustus will use max. nContigs in

--genome=file cores.

--workingdir=/path/to/wd/ Set path to working directory. In the

working directory results and temporary

files are stored

--nice Execute all system calls within braker.pl

and its submodules with bash "nice"

(default nice value)

--alternatives-from-evidence=true Output alternative transcripts based on

explicit evidence from hints (default is

true).

--crf Execute CRF training for AUGUSTUS;

resulting parameters are only kept for

final predictions if they show higher

accuracy than HMM parameters.

--keepCrf keep CRF parameters even if they are not

better than HMM parameters

--UTR=on create UTR training examples from RNA-Seq

coverage data; requires options

--bam=rnaseq.bam and --softmasking.

Alternatively, if UTR parameters already

exist, training step will be skipped and

those pre-existing parameters are used.

--prg=gth|exonerate|spaln Alignment tool gth (GenomeThreader),

exonerate (Exonerate) or Spaln2

(spaln) that will be used to generate

protein alignments that will be the

basis for hints generation for gene

prediction with AUGUSTUS (if specified

in combination with --prot_seq) or that

was used to externally generate an

alignment file with the commands listed in

description of --prot_aln (if used in

combination with --prot_aln).

--gth2traingenes Generate training gene structures for

AUGUSTUS from GenomeThreader alignments.

(These genes can either be used for

training AUGUSTUS alone with

--trainFromGth; or in addition to

GeneMark-ET training genes if also a

bam-file is supplied.)

--trainFromGth No GeneMark-Training, train AUGUSTUS from

GenomeThreader alignments

--makehub Create track data hub with make_hub.py

for visualizing BRAKER results with the

UCSC GenomeBrowser

--email E-mail address for creating track data hub

--version Print version number of braker.pl

--help Print this help message

CONFIGURATION OPTIONS (TOOLS CALLED BY BRAKER)

--AUGUSTUS_CONFIG_PATH=/path/ Set path to config directory of AUGUSTUS

(if not specified as environment

variable). BRAKER1 will assume that the

directories ../bin and ../scripts of

AUGUSTUS are located relative to the

AUGUSTUS_CONFIG_PATH. If this is not the

case, please specify AUGUSTUS_BIN_PATH

(and AUGUSTUS_SCRIPTS_PATH if required).

The braker.pl commandline argument

--AUGUSTUS_CONFIG_PATH has higher priority

than the environment variable with the

same name.

--AUGUSTUS_BIN_PATH=/path/ Set path to the AUGUSTUS directory that

contains binaries, i.e. augustus and

etraining. This variable must only be set

if AUGUSTUS_CONFIG_PATH does not have

../bin and ../scripts of AUGUSTUS relative

to its location i.e. for global AUGUSTUS

installations. BRAKER1 will assume that

the directory ../scripts of AUGUSTUS is

located relative to the AUGUSTUS_BIN_PATH.

If this is not the case, please specify

--AUGUSTUS_SCRIPTS_PATH.

--AUGUSTUS_SCRIPTS_PATH=/path/ Set path to AUGUSTUS directory that

contains scripts, i.e. splitMfasta.pl.

This variable must only be set if

AUGUSTUS_CONFIG_PATH or AUGUSTUS_BIN_PATH

do not contains the ../scripts directory

of AUGUSTUS relative to their location,

i.e. for special cases of a global

AUGUSTUS installation.

--BAMTOOLS_PATH=/path/to/ Set path to bamtools (if not specified as

environment BAMTOOLS_PATH variable). Has

higher priority than the environment

variable.

--GENEMARK_PATH=/path/to/ Set path to GeneMark-ET (if not specified

as environment GENEMARK_PATH variable).

Has higher priority than environment

variable.

--SAMTOOLS_PATH=/path/to/ Optionally set path to samtools (if not

specified as environment SAMTOOLS_PATH

variable) to fix BAM files automatically,

if necessary. Has higher priority than

environment variable.

--ALIGNMENT_TOOL_PATH=/path/to/tool Set path to alignment tool

(GenomeThreader, Spaln, or Exonerate) if

not specified as environment

ALIGNMENT_TOOL_PATH variable. Has higher

priority than environment variable.

--DIAMOND_PATH=/path/to/diamond Set path to diamond, this is an alternative

to NCIB blast; you only need to specify one

out of DIAMOND_PATH or BLAST_PATH, not both.

DIAMOND is a lot faster that BLAST and yields

highly similar results for BRAKER.

--BLAST_PATH=/path/to/blastall Set path to NCBI blastall and formatdb

executables if not specified as

environment variable. Has higher priority

than environment variable.

--PYTHON3_PATH=/path/to Set path to python3 executable (if not

specified as envirnonment variable and if

executable is not in your $PATH).

--MAKEHUB_PATH=/path/to Set path to make_hub.py (if option --makehub

is used).

--CDBTOOLS_PATH=/path/to cdbfasta/cdbyank are required for running

fix_in_frame_stop_codon_genes.py. Usage of

that script can be skipped with option

'--skip_fixing_broken_genes'.

EXPERT OPTIONS

--augustus_args="--some_arg=bla" One or several command line arguments to

be passed to AUGUSTUS, if several

arguments are given, separated by

whitespace, i.e.

"--first_arg=sth --second_arg=sth".

--overwrite Overwrite existing files (except for

species parameter files)

--skipGeneMark-ES Skip GeneMark-ES and use provided

GeneMark-ES output (e.g. provided with

--geneMarkGtf=genemark.gtf)

--skipGeneMark-ET Skip GeneMark-ET and use provided

GeneMark-ET output (e.g. provided with

--geneMarkGtf=genemark.gtf)

--skipGeneMark-EP Skip GeneMark-EP and use provided

GeneMark-EP output (e.g. provided with

--geneMarkGtf=genemark.gtf)

--skipGeneMark-ETP Skip GeneMark-ETP and use provided

GeneMark-ETP output (e.g. provided with

--geneMarkGtf=genemark.gtf)

--geneMarkGtf=file.gtf If skipGeneMark-ET is used, braker will by

default look in the working directory in

folder GeneMarkET for an already existing

gtf file. Instead, you may provide such a

file from another location. If geneMarkGtf

option is set, skipGeneMark-ES/ET/EP/ETP is

automatically also set.

--rounds The number of optimization rounds used in

optimize_augustus.pl (default 5)

--skipAllTraining Skip GeneMark-EX (training and

prediction), skip AUGUSTUS training, only

runs AUGUSTUS with pre-trained and already

existing parameters (not recommended).

Hints from input are still generated.

This option automatically sets

--useexisting to true.

--useexisting Use the present config and parameter files

if they exist for 'species'

--filterOutShort It may happen that a "good" training gene,

i.e. one that has intron support from

RNA-Seq in all introns predicted by

GeneMark, is in fact too short. This flag

will discard such genes that have

supported introns and a neighboring

RNA-Seq supported intron upstream of the

start codon within the range of the

maximum CDS size of that gene and with a

multiplicity that is at least as high as

20% of the average intron multiplicity of

that gene.

--skipOptimize Skip optimize parameter step (not

recommended).

--skipGetAnnoFromFasta Skip calling the python3 script

getAnnoFastaFromJoingenes.py from the

AUGUSTUS tool suite. This script requires

python3, biopython and re (regular

expressions) to be installed. It produces

coding sequence and protein FASTA files

from AUGUSTUS gene predictions and provides

information about genes with in-frame stop

codons. If you enable this flag, these files

will not be produced and python3 and

the required modules will not be necessary

for running braker.pl.

--skip_fixing_broken_genes If you do not have python3, you can choose

to skip the fixing of stop codon including

genes (not recommended).

--fungus GeneMark-ET option: run algorithm with

branch point model (most useful for fungal

genomes)

--rnaseq2utr_args=params Expert option: pass alternative parameters

to rnaseq2utr as string, default parameters:

-r 76 -v 100 -n 15 -i 0.7 -m 0.3 -w 70

-c 100 -p 0.5

--eval=reference.gtf Reference set to evaluate predictions

against (using the eval package)

--AUGUSTUS_hints_preds=s File with AUGUSTUS hints predictions; will

use this file as basis for UTR training;

only UTR training and prediction is

performed if this option is given.

--flanking_DNA=n Size of flanking region, must only be

specified if --AUGUSTUS_hints_preds is given

(for UTR training in a separate braker.pl

run that builds on top of an existing run)

--verbosity=n 0 -> run braker.pl quiet (no log)

1 -> only log warnings

2 -> also log configuration

3 -> log all major steps

4 -> very verbose, log also small steps

--downsampling_lambda=d The distribution of introns in training

gene structures generated by GeneMark-EX

has a huge weight on single-exon and

few-exon genes. Specifying the lambda

parameter of a poisson distribution will

make braker call a script for downsampling

of training gene structures according to

their number of introns distribution, i.e.

genes with none or few exons will be

downsampled, genes with many exons will be

kept. Default value is 2.

If you want to avoid downsampling, you have

to specify 0.

--checkSoftware Only check whether all required software

is installed, no execution of BRAKER

--nocleanup Skip deletion of all files that are typically not

used in an annotation project after

running braker.pl. (For tracking any

problems with a braker.pl run, you

might want to keep these files, therefore

nocleanup can be activated.)

DEVELOPMENT OPTIONS (PROBABLY STILL DYSFUNCTIONAL)

--splice_sites=patterns list of splice site patterns for UTR

prediction; default: GTAG, extend like this:

--splice_sites=GTAG,ATAC,...

this option only affects UTR training

example generation, not gene prediction

by AUGUSTUS

--extrinsicCfgFiles=file1,file2,... Depending on the mode in which braker.pl

is executed, it may require one ore several

extrinsicCfgFiles. Don't use this option

unless you know what you are doing!

--stranded=+,-,+,-,... If UTRs are trained, i.e.~strand-specific

bam-files are supplied and coverage

information is extracted for gene prediction,

create stranded ep hints. The order of

strand specifications must correspond to the

order of bam files. Possible values are

+, -, .

If stranded data is provided, ONLY coverage

data from the stranded data is used to

generate UTR examples! Coverage data from

unstranded data is used in the prediction

step, only.

The stranded label is applied to coverage

data, only. Intron hints are generated

from all libraries treated as "unstranded"

(because splice site filtering eliminates

intron hints from the wrong strand, anyway).

--optCfgFile=ppx.cfg Optional custom config file for AUGUSTUS

for running PPX (currently not

implemented)

--grass Switch this flag on if you are using braker.pl

for predicting genes in grasses with

GeneMark-ES/ET. The flag will enable

GeneMark-ES/ET to handle GC-heterogenicity

within genes more properly.

NOTHING IMPLEMENTED FOR GRASS YET!

--transmasked_fasta=file.fa Transmasked genome FASTA file for GeneMark

(to be used instead of the regular genome

FASTA file).

--min_contig=INT Minimal contig length for GeneMark, could

for example be set to 10000 if transmasked_fasta

option is used because transmasking might

introduce many very short contigs.

--translation_table=INT Change translation table from non-standard

to something else.

DOES NOT WORK YET BECAUSE BRAKER DOESNT

SWITCH TRANSLATION TABLE FOR GENEMARK-EX, YET!

DESCRIPTION

Example:

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \

--bam=accepted_hits.bam

braker.pl [OPTIONS] --genome=genome.fa --species=speciesname \

--hints=rnaseq.gff

To run with protein data from remote species and GeneMark-EP:

braker.pl [OPTIONS] --genome=genome.fa --hints=proteinintrons.gff --epmode=1

To run with protein data from a very closely related species:

braker.pl [OPTIONS] --genome=genome.fa --prot_seq=proteins.fa --prg=gth \

--gth2traingenes --trainFromGth

テストラン

git clone https://github.com/Gaius-Augustus/BRAKER.git
cd BRAKER/example/
wget http://topaz.gatech.edu/GeneMark/Braker/RNAseq.bam

f:id:kazumaxneo:20200813165357p:plain

brakerを実行する。

#Testing BRAKER with proteins of close homoogy and RNA-Seq data
braker.pl --genome genome.fa --bam RNAseq.bam --softmasking --cores 8

#Testing BRAKER with pre-trained parameters
braker.pl --genome=genome.fa --bam RNAseq.bam --species=arabidopsis \
 --skipAllTraining --softmasking --cores 8

#Testing BRAKER with genome sequence
braker.pl --genome=genome.fa --esmode --softmasking --cores 8

#Testing BRAKER with proteins of close homology
braker.pl --genome genome.fa --prot_seq proteins.fa --prg gth \
 --trainFromGth --softmasking --cores 8

--genome fasta file with DNA sequences
--bam bam file with spliced alignments from RNA-Seq
--softmasking Softmasking option for soft masked genome files. (Disabled by default.)
--cores Specifies the maximum number of cores that can be used during computation. Be aware: optimize_augustus.pl will use max. 8 cores; augustus will use max. nContigs in --genome=file cores
--species Species name. Existing species will not be overwritten. Uses Sp_1 etc., if no species is assigned
--esmode Run GeneMark-ES (genome sequence only) and train AUGUSTUS on long genes predicted by GeneMark-ES. Final predictions are ab initio
--prot_seq A protein sequence file in multiple fasta format. This file will be used to generate protein hints for AUGUSTUS by running one of the three alignment tools Exonerate (--prg=exonerate), Spaln (--prg=spaln) or GenomeThreader (--prg=gth). Default is GenomeThreader if the tool is not specified. Currently, hints from protein sequences are only used in the prediction step with AUGUSTUS
--prg Alignment tool gth (GenomeThreader), exonerate (Exonerate) or Spaln2 (spaln) that will be used to generate protein alignments that will be the basis for hints generation for gene prediction with AUGUSTUS (if specified in combination with --prot_seq) or that was used to externally generate an alignment file with the commands listed in description of --prot_aln (if used in combination with --prot_aln)
--trainFromGth No GeneMark-Training, train AUGUSTUS fromGenomeThreader alignments

プリセット

ls /opt/conda/config//species/

f:id:kazumaxneo:20200813184848p:plain

ls -lh /opt/conda/config//species/arabidopsis/

f:id:kazumaxneo:20200813184954p:plain

出力（manual p20 outputより）

f:id:kazumaxneo:20201002224808p:plain

最終出力はマニュアルのFigure.1.1と1.2、3.1 ではAUGUSTUS.gtf（essmodeではAUGUSTUS_ab_initio.gtf）となっている。またbraker2.gtfも出力される（*1）。（出力はオプションの設定で少しだけ変わることに注意）。

実行方法

braker2のどのモードを使用するかは、利用できるデータの内容によって決まる。

その生物のRNA seq情報が利用できる場合はマニュアルの3.1.1参照
近縁のproteome情報が利用できる場合はマニュアルの3.1.3参照
近い種のproteome情報も利用できない場合はマニュアルの3.1.2参照
RNA seq情報が利用できるがデータサイズが小さい場合はマニュアルの3.1.4の1と２参照

１、その生物のRNA seq情報が利用できる

genomeとbamファイル（RNA seqデータをRNAのアライナーで（リピートマスク）ゲノムにマッピングして得る）を指定する。入力ファイルだが、GeneMaekがホワイトスペースを嫌う。入力ゲノムはあらかじめスペースがないシンプルな名前にしておく。--gff3を使うと出力がデフォルトのgtfからgff3に変更される。

braker.pl --species=yourSpecies --genome=genome.fasta \
 --bam=file1.bam,file2.bam --cores 20 --gff3

#リピートがソフトマスクされているゲノム（リピート配列が小文字に置き換えられている）
braker.pl --species=yourSpecies --genome=genome.fasta \
 --bam=file.bam --softmasking --UTR=on --cores 20 --gff3

#複数のbam
braker.pl --species=yourSpecies --genome=genome.fasta \
--bam=condition1.bam,condition2.bam,condition3.bam --softmasking --UTR=on --cores 20 --gff3

--softmasking Softmasking option for soft masked genome files. (Disabled by default.)
--UTR=on create UTR training examples from RNA-Seq coverage data; requires options --bam=rnaseq.bam and --softmasking. Alternatively, if UTR parameters already exist, training step will be skipped and those pre-existing parameters are used.

UTR parametersをつけるときはStranded RNA-Seqかどうかも重要になる。詳しくはGithub参照。

すでにbamや他のソースからgtfを得ている場合、hintフラグを立ててgtfファイルを指定する。

braker.pl --species=yourSpecies --genome=genome.fasta \
 --hints=hints1.gff,hints2.gff --cores 20 --prg=gth

２、近縁のproteome情報が利用できる

RNA seqなどの信頼性の高いエビデンスが利用できない場合、近縁なモデル生物のprotein情報を使用する。上で紹介したdockerイメージには含まれていない。使うならレポジトリからソースコードをpullしてビルドする（解説）。上で紹介したコマンドでcondaの環境を作れば、exonerateやSpaln2なども導入されてすぐ使える。

braker.pl --genome=genome.fa --prot_seq=proteins.fa --softmasking --cores 20 --gff3

# --prgフラグを立ててdiamond以外のアライナーを使う場合、あらかじめアラインメントツールをダウンロードしてパスを通しておく必要がある（*2）。ここではGenomeThreaderを指定（dockerイメージではエラーを起こした）。
braker.pl --genome=genome.fa --prot_seq=proteins.fa --softmasking --prg=gth --cores 20 --gff3

--prg=gth|exonerate|spaln Alignment tool gth (GenomeThreader), exonerate (Exonerate) or Spaln2 (spaln) that will be used to generate protein alignments that will be the basis for hints generation for gene prediction with AUGUSTUS (if specified in combination with --prot_seq) or that was used to externally generate an alignment file with the commands listed in description of --prot_aln (if used in combination with --prot_aln).
--gth2traingenes Generate training gene structures for AUGUSTUS from GenomeThreader alignments. (These genes can either be used for training AUGUSTUS alone with --trainFromGth; or in addition to GeneMark-ET training genes if also a bam-file is supplied.)
--trainFromGth No GeneMark-Training, train AUGUSTUS from GenomeThreader alignments

多くのゲノムは、BRAKER2内のGeneMark-ETとAUGUSTUSの標準パラメータで正確に予測できる遺伝子構造を持っている。しかし、いくつかのゲノムは、クレード固有の特徴、例えば真菌の特殊なbranch point modelや、標準的でないスプライスサイトパターンを持っている。カスタムオプションのいずれかがターゲット種のゲノムの遺伝子予測精度を向上させるかどうかを判断するために、オプションのセクション3.2を読んでください。

brakerのプロセスは複雑なので、ランが最後まで終わっていないのに終了していると勘違いしているという話を耳にしました。braker.logを開いて、最後の行にbraker successfully finishedのようなコメントがあるか確認するようにして下さい。また、テストランを行なって、上手く動作するか確認するようにしてください。
brakerの最終的なアノテーション結果は、どのようなエビデンスを提供したかによって変化する。braker.gff3に複数の遺伝子予測モデルが残っているのは、それらがマージされているため。たとえば、RNA seqと近縁種のproteomeを外部エビデンスとしてbrakerを実行した場合、ab initioの遺伝子予測にはRNAseqのbamで訓練されたGeneMark-ETが使用され、genemark_evidence.gtfができる。そのgtfと近縁種のproteomeをヒントに使ってAUGUSTUSが訓練され（デフォルトでは5回）、AUGUSTUSが実行される。得られたAUGUSTUS.hints.gff3とgenemarkの結果をマージしてbraker.gff3が得られる。braker.gff3が最終出力だが信頼性が高いAUGUSTUSのgff3を使うこともある。その遺伝子がどのエビデンスに由来しているか、またツールの出力は何かはGFF3に書いてある。
RNA-seqのbamとproteome両方指定し、--UTR=onをつけてETPモードでランするとエラーになった。いくつかバグがあり、パスが見えない不具合などは修復したが、braker.logにあるbraker/Ppri5_utr.job.lstで停止した（docker環境、v.21.6 ）。
file_1_file_1などの名前はAUGUSTUSとGeneMarkのコンビネーションでbrakerの予測が行なわれていることに由来している。同様にt2とかt3などは複数の転写産物があることを意味している。gは遺伝子。tは転写産物。
braker2の遺伝子モデルは品質が高いものの、タンパク質コード領域でないUTRのアノテーションは不完全なことが多い。 "--UTR=on"をつけてランする。RNA seqで発現している領域のエビデンスを使って5,3-UTRまで遺伝子モデルを伸ばしてくれる。こちらが使用されている。ちなみにconda配布最新のv2.16でも、GemoMaのランでエラーが起きることがある（--UTR=onを付けた時のみ）。原因はRNAseqデータが大きい時にjavaのメモリ割り当て量が少ないことが原因、この解答にあるようにgushr.pyの500行目でメモリ割り当て量を明示的に指定すれfixできる (感謝)。

画像下のトラックがアノテーション。その中でbraker,gff3が上、--addUTR=onをつけた時の最終結果braker.utr.gff3(gtf)が真ん中、--UTR=onをつけた時の最終結果braker.utr.gff3(gtf)が下。UTRのアノテーションが長いのは--UTR=onの方（要約統計ではprotein長には変化がないが、transcripts長やgene長は長くなる）。

brakerは他のアノテーションツールをラップして動かしているので、brakerのコードがおかしくなくても、動かしているツールの問題でランが失敗することがあります。brakerのログを読んでも解決しない場合、brakerの別のパイプライン、例えば--esmodeを動かしてジョブがうまくいくかも検討してみてください。Braker2のissueには、ランが失敗するが必ずしもbrakerの問題ではないケースが報告されています。その他、テストデータが小さすぎるとGeneMarkのトレーニングの精度が上がらずエラーを起こします。動作確認でも、数Mb以上ある配列を使うようにして下さい（推奨遺伝子600以上）。

いくつかテストしてみましたが、必ずしも最終出力のaugatusのgtfがベストとは言えない結果も出ました。パラメータ検討に加えてgenemarkとaugatusの出力の比較も行なってみて下さい。

引用

BRAKER2: Automatic Eukaryotic Genome Annotation with GeneMark-EP+ and AUGUSTUS Supported by a Protein Database

Tomas Bruna, Katharina Hoff, Mario Stanke, Alexandre Lomsadze, Mark Borodovsky

bioRxiv, Posted August 11, 2020

Whole-Genome Annotation with BRAKER
Katharina Hoff, Alexandre Lomsadze, Mark Borodovsky, Mario Stanke

Methods Mol Biol. 2019;1962:65-95

2021 2/9

BRAKER2: automatic eukaryotic genome annotation with GeneMark-EP+ and AUGUSTUS supported by a protein database
Tomáš Brůna, Katharina J Hoff, Alexandre Lomsadze, Mario Stanke, Mark Borodovsky
NAR Genomics and Bioinformatics, Volume 3, Issue 1, March 2021

brakerを引用する際は出力ディレクトリのwhat-to-cite.txt（下記）を参照する。

When publishing results of this BRAKER run, please cite the following sources:
------------------------------------------------------------------------------

Hoff, K. J., Lange, S., Lomsadze, A., Borodovsky, M., & Stanke, M. (2016). BRAKER1: unsupervised RNA-Seq-based genome annotation with GeneMark-ET and AUGUSTUS. Bioinformatics, 32(5), 767-769.

Hoff, K. J., Lomsadze, A., Borodovsky, M., & Stanke, M. (2019). Whole-genome annotation with BRAKER. In Gene Prediction (pp. 65-95). Humana, New York, NY.

Buchfink, B., Xie, C., & Huson, D. H. (2015). Fast and sensitive protein alignment using DIAMOND. Nature Methods, 12(1), 59.

Stanke, M., Diekhans, M., Baertsch, R., & Haussler, D. (2008). Using native and syntenically mapped cDNA alignments to improve de novo gene finding. Bioinformatics, 24(5), 637-644.

Stanke, M., Schöffmann, O., Morgenstern, B., & Waack, S. (2006). Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics, 7(1), 62.

補足

GeneMarkは最新版とキーを両方ダウンロードし、キーと合わせて使うこと。GeneMarkのランを検証するには、例えばダウンロードしたperl gmes_linux_64の外でテストランする。

perl gmes_linux_64/gmes_petap.pl --sequence BRAKER/example/genome.fa --ES --cores 8 --max_intron 100 --min_gene_in_predict 100

エラーが起きるならperlのライブラリが配列＠INCに含まれるパスに存在しない可能性がある。perlモジュールはcpanm（apt-getで導入できる）を使ってインストールすれば楽。

cpanmはーLでモジュールの導入パスを指定できる。このとき、ーLで指定したディレクトリのさらにサブディレクトリであるlib/perlに実際には入る。例えば@INCに/root/perl5/lib/perl5/が含まれ、ここにcpanmでperlモジュールをインストールしたいなら、ーLを”-L /root/perl5”と指定する。 Hash::Mergeの導入なら、

$ cpanm -L /root/perl5 Hash::Merge

と打つ。これで/root/perl5/lib/perl5/にHash/ができ、中にMerge.pmがビルドされモジュールが見えるようになる（ビルドにはmakeが必要）。

注：cpanminus実行前に環境変数$PERL5LIBを見てパスが指定されていないなら追加する。

$ echo $PERL5LIB

$PERL5LIBにパスがない。ここでは/root/perl5/lib/perl5を追加
export PERL5LIB=$HOME/perl5/lib/perl5/:$PERL5LIB;

＝＞cpanminusを実行

追記１

cpanminusなどを使ったperlモジュールのコンパイルに失敗するならgccを入れる（linux）。

$ conda install -c conda-forge/label/gcc7 gxx_linux-64

追記２

ここではv2.1.4のdockerイメージを使っているが、2020 8/15現在は2.1.5が最新なので以下のnanozooさんのdockerfileを書き換えて実行してもいい。太字が追加・修正した部分になる。

FROM continuumio/miniconda3
ENV VERSION 2.1.5