2023-01-08

アノテーションパイプライン MAKER

2008年の論文

　移植可能で容易に設定可能なゲノムアノテーションパイプラインであるMAKERを開発した。MAKERの目的は、研究者が独立して真核生物ゲノムのアノテーションを行い、ゲノムデータベースを作成することである。MAKERはリピートを識別し、ESTやタンパク質をゲノムにアラインさせ、ab initio遺伝子予測を行い、これらのデータを証拠に基づく品質指標を持つ遺伝子アノテーションに自動的に生成する。また、MAKERは簡単にトレーニングすることができる。予備実験の結果をもとに遺伝子予測アルゴリズムを自動的に再トレーニングし、次回以降、より質の高い遺伝子モデルを生成することができる。MAKERの入力は最小限であり、その出力はGMODデータベースの作成に使用することができる。MAKERの出力はApollo Genomeブラウザで見ることができる。この機能により、データベースのオーバーヘッドなしに、個々のコンティグやBACのアノテーション、表示、編集を簡単に行うことができるようになる。MAKERは、プラナリアのSchmidtea mediterraneaのゲノムをアノテーションし、新しいゲノムデータベースSmedGDを作成するために使用された。また、MAKERの性能を他のアノテーションパイプラインと比較した。その結果、MAKERはゲノム配列をコミュニティがアクセス可能なゲノムデータベースに変換するための簡単で効果的な手段を提供することが実証された。MAKERは、広範なバイオインフォマティクスリソースが容易に利用できない新興のモデル生物ゲノムプロジェクトに特に有用である。

チュートリアルより

MAKERは、バイオインフォマティクスの経験が少ない小規模な研究グループでも使用できるように設計された、使いやすいゲノムアノテーションパイプラインである。MAKERはスケーラブルに設計されており、大規模なシーケンシングセンターを含むプロジェクトにも適している。

MAKER はアノテーションパイプラインであり、遺伝子予測ツールではない。MAKERは遺伝子を予測するのではなく、既存のソフトウェアツール（一部は遺伝子予測ツール）を活用し、エビデンスアラインメントに基づいて最適な遺伝子モデルを生成する。（注；MAKER自身は遺伝子予測ツールではないが、内部でab initio遺伝子予測ツールを訓練し、遺伝子予測すること自体は可能。tutorialの後半参照）

What does MAKER do?

Identifies and masks out repeat elements
Aligns ESTs to the genome
Aligns proteins to the genome
Produces ab initio gene predictions
Synthesizes these data into final annotations
Produces evidence-based quality values for downstream annotation management

https://www.yandell-lab.org/software/maker.html

MAKER Tutorial for WGS Assembly and Annotation Winter School 2018

http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_WGS_Assembly_and_Annotation_Winter_School_2018

インストール

Github

#conda(link) 
mamba install -c "bioconda/label/cf201901" -c conda-forge -c bioconda maker
=>condaでインストールした場合もRepbaseからrepeatmakerのRepbaseのrepeatmakerライブラリをダウンロードして配置する必要がある*1

#Docker image
#ここでは公開されているdocker imageを利用する（link）1-5の手順を踏む
#1 base image(link)のpull。このイメージ自体は４つの親イメージから順に作成されている。ビルドしても良い。
docker pull chrishah/premaker-plus:18-0d9787e

#2 作業ディレクトリでmakerのレポジトリをpullする
cd <change>/<to>/<working_dir>
git clone https://github.com/Yandell-Lab/maker

#3 RepbaseからrepeatmakerのRepbaseのrepeatmakerライブラリをとってくる（link）。最新の全データダウンロードには登録だけではだめで、所属機関のライセンスが必要（有料。昔から？）。ないのでここでは古めだがこれを使用（とても助かります）
=> tar ballを解凍し、後でRepeatMasker/Libraries/に配置する

#4 Dockerfile作成とビルド（#1のbase imageから）
echo -e "FROM chrishah/premaker-plus:18-0d9787e" > Dockerfile-maker-plus
docker build --network=host -t maker-plus:2.31 --file Dockerfile-maker-plus .

#5 出来たimageをランしてMaker（#2でcloneしたレポジトリ）をビルドする
docker run -itv $PWD:/home -w /home maker-plus:2.31
$ cd /home/maker/src
$ perl Build.PL
$ ./Build install
$ cd ../bin
$ ./maker #必要ならmaker/bin/にパスを通す

> maker -h

MAKER version 3.01.03

Usage:

maker [options] <maker_opts> <maker_bopts> <maker_exe>

Description:

MAKER is a program that produces gene annotations in GFF3 format using

evidence such as EST alignments and protein homology. MAKER can be used to

produce gene annotations for new genomes as well as update annotations

from existing genome databases.

The three input arguments are control files that specify how MAKER should

behave. All options for MAKER should be set in the control files, but a

few can also be set on the command line. Command line options provide a

convenient machanism to override commonly altered control file values.

MAKER will automatically search for the control files in the current

working directory if they are not specified on the command line.

Input files listed in the control options files must be in fasta format

unless otherwise specified. Please see MAKER documentation to learn more

about control file configuration. MAKER will automatically try and

locate the user control files in the current working directory if these

arguments are not supplied when initializing MAKER.

It is important to note that MAKER does not try and recalculated data that

it has already calculated. For example, if you run an analysis twice on

the same dataset you will notice that MAKER does not rerun any of the

BLAST analyses, but instead uses the blast analyses stored from the

previous run. To force MAKER to rerun all analyses, use the -f flag.

MAKER also supports parallelization via MPI on computer clusters. Just

launch MAKER via mpiexec (i.e. mpiexec -n 40 maker). MPI support must be

configured during the MAKER installation process for this to work though

Options:

-genome|g <file> Overrides the genome file path in the control files

-RM_off|R Turns all repeat masking options off.

-datastore/ Forcably turn on/off MAKER's two deep directory

nodatastore structure for output. Always on by default.

-old_struct Use the old directory styles (MAKER 2.26 and lower)

-base <string> Set the base name MAKER uses to save output files.

MAKER uses the input genome file name by default.

-tries|t <integer> Run contigs up to the specified number of tries.

-cpus|c <integer> Tells how many cpus to use for BLAST analysis.

Note: this is for BLAST and not for MPI!

-force|f Forces MAKER to delete old files before running again.

This will require all blast analyses to be rerun.

-again|a recaculate all annotations and output files even if no

settings have changed. Does not delete old analyses.

-quiet|q Regular quiet. Only a handlful of status messages.

-qq Even more quiet. There are no status messages.

-dsindex Quickly generate datastore index file. Note that this

will not check if run settings have changed on contigs

-nolock Turn off file locks. May be usful on some file systems,

but can cause race conditions if running in parallel.

-TMP Specify temporary directory to use.

-CTL Generate empty control files in the current directory.

-OPTS Generates just the maker_opts.ctl file.

-BOPTS Generates just the maker_bopts.ctl file.

-EXE Generates just the maker_exe.ctl file.

-MWAS <option> Easy way to control mwas_server for web-based GUI

options: STOP

START

RESTART

-version Prints the MAKER version.

-help|? Prints this usage statement.

> fasta_merge

Synopsis:

fasta_merge -d maker_datastore_index.log

fasta_merge -o genome.all -i <fasta1> <fasta2> ...

Descriptions:

This script will take a MAKER datastore index log file, extract all

the relevant fasta files and create fasta files with relevant

categories of sequence (i.e. transcript, protein, GeneMark protien,

etc.). For this to work properly you need to be in the same directory

as the datastore index.

Options:

-d The location of the MAKER datastore index log.

-o Alternate base name for the output files.

-i A optional list of files to process along with or instead of the

datastore.

> gff3_merge

Synopsis:

gff3_merge -d maker_datastore_index.log

gff3_merge -o genome.all.gff <gff3_file1> <gff3_file2> ...

Descriptions:

This script will take a MAKER datastore index log file, extract all

the relevant GFF3 files and combined GFF3 file. The script can also

combine other correctly formated GFF3 files. For this to work

properly you need to be in the same directory as the datastore index.

Options:

-d The location of the MAKER datastore index log file.

-o Alternate base name for the output files.

-s Use STDOUT for output.

-g Only write MAKER gene models to the file, and ignore evidence.

-n Do not print fasta sequence in footer

-l Merge legacy annotation sets (ignores already having seen

features more than once for the same contig)

makerライセンスに関する注意事項

テストラン

レポジトリのdata/にテスト用のデータセットが用意されている。

cd maker/data/

maker/data/

１，まずconfigファイルのtempleteを作成する。MAKER は現在のパスにあるconfigファイルを（明示的に指示することなく）自動で探して実行するようになっている。そのため、各ゲノム固有のディレクトリで MAKER を実行することが推奨されている。

maker -CTL

4つのファイルが出来る。

マニュアルの説明

maker_exe.ctl - 基礎となる実行ファイルのパス情報が含まれています。
maker_bopt.ctl - BLASTとExonerateのフィルタリングの統計情報が含まれています。
maker_opt.ctl - 入力ゲノムファイルの場所を含む、MAKERに関するその他のすべての情報が含まれています。

このあと、マニュアルではnanoエディタ（apt install nano）を使ってconfigファイルを編集している。

maker_exe.ctl（実行ファイルのパス情報を含む）を開いてみる。

上のような手順で正しくインストール手順に従った場合、すべての実行ファイルのパスが表示されているはず。編集する場合は"="の両側にはスペースを入れてはいけない。

maker_opt.ctl（入力ファイルのパスを含むその他のすべての情報）を開く。

maker_opt.ctlの６つの行を以下のように修正する。

genome=dpp_contig.fasta
est=dpp_est.fasta
protein=dpp_protein.fasta
est2genome=1

protein2genome=1

これで準備ができた。

３，編集が終わったらMAKERをランする。（*1）

maker -q -base annotation

-q Regular quiet. Only a handlful of status messages.
-qq Even more quiet. There are no status messages.
-base Set the base name MAKER uses to save output files. MAKER uses the input genome file name by default.
-R Turns all repeat masking options off.
-f Forces MAKER to delete old files before running again. This will require all blast analyses to be rerun.
-a recaculate all annotations and output files even if no settings have changed. Does not delete old analyses.

...

出力

dpp_contig.maker.output/ #入力ゲノムファイル名に基づいている

maker_opts.log, maker_exe.log, maker_bopts.log ファイルは、今回のMAKERの実行に使用された制御ファイルのログ。mpi_blastdbディレクトリには、入力されたEST、タンパク質、リピートデータベースから作成されたFASTAインデックスとBLASTデータベースファイルが入っている。

dpp_contig_datastoreディレクトリには、ゲノムFastaファイルから個々のコンティグに対する最終的なMAKER出力を格納したサブフォルダ群があります。

dpp_contig_master_datastore_index.logがログファイルになっている。

> cat dpp_contig.maker.output/dpp_contig_master_datastore_index.log

サンプルファイルにはコンティグが1つしかなかったため、1つのコンティグを記述したエントリしかない。写真は、コンティグcontig-dpp-500-500がSTARTし、その後、何事もなくFINISHEDしたことを示している。さらにこのコンティグの結果がdpp_contig_datastore/05/1F/contig-dpp-500-500/に保存されたことも示している。（マニュアルより）。ほかにも以下の種類がある。

FAILED - このコンティグの実行に失敗
RETRY - MAKERが失敗したコンティグを再試行した。
SKIPPED_SMALL - コンティグが短すぎてアノテーションができない（最小長は maker_opt.ctl で指定できる）。

DIED_SKIPPED_PERMANENT - MAKERが再試行でも失敗したコンティグ（コンティグの再試行回数は maker_opt.ctl で指定できる）。

（数千から数十万のコンティグが含まれる場合、ネットワーク経由のアクセスでもパフォーマンスが低下する可能性がある。MAKERは「ベース」から始まるネストしたサブディレクトリの階層を作成し、与えられたコンティグの結果を、ネストした数千のディレクトリのデータストアに配置し、これを回避する。master_datastore_index.logはンティグの出力がどこに格納されているかを特定するためにも不可欠なものになる（マニュアルより））。

コンティグ１つの結果を見てみる。

dpp_contig.maker.output/dpp_contig_datastore/05/1F/contig-dpp-500-500/

contig-dpp-500-500.gffは、GFF3形式のアノテーションファイル。contig-dpp-500-500.maker.transcripts.fasta と contig-dpp-500-500.maker.proteins.fasta ファイルには、遺伝子アノテーションの転写産物とタンパク質の配列が保存されている。theVoid.contig-dpp-500-500ディレクトリには、MAKERが実行するすべてのプログラム（Blast、SNAP、RepeatMaskerなど）からの個別の出力ファイルが保存されている。

4、各コンティグの結果のマージ

cd dpp_contig.maker.output/
fasta_merge -d dpp_contig_master_datastore_index.log
=> dpp_contig.all.maker.transcripts.fastaができる

gff3_merge -d dpp_contig_master_datastore_index.log
=> dpp_contig.all.gffができる

（注；テストデータにはコンティグが１つしかない）

こちらの論文ではMAKERを使って2段階のアノテーションを行うための設定が書かれています。参考になります。

An improved genome assembly uncovers prolific tandem repeats in Atlantic cod | BMC Genomics | Full Text

注；MAKERはMPIに対応していて、並列計算することで速度を大幅にアップすることできます。あるサイズの領域に切り分けて、それぞれの領域ごとに独立して計算できるからだと思われます。言い換えれば、単一ノードで実行するとかなり膨大な時間がかかるということです。ある程度大きなゲノムを扱う場合（例えば100Mb以上）、注意してください。

引用

MAKER: an easy-to-use annotation pipeline designed for emerging model organism genomes
Brandi L Cantarel, Ian Korf, Sofia M C Robb, Genis Parra, Eric Ross, Barry Moore, Carson Holt, Alejandro Sánchez Alvarado, Mark Yandell

Genome Res. 2008 Jan;18(1):188-96

MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations
Michael S Campbell, MeiYee Law, Carson Holt, Joshua C Stein, Gaurav D Moghe, David E Hufnagel, Jikai Lei, Rujira Achawanantakun, Dian Jiao, Carolyn J Lawrence, Doreen Ware, Shin-Han Shiu, Kevin L Childs, Yanni Sun, Ning Jiang, Mark Yandell

Plant Physiol. 2014 Feb;164(2):513-24. doi: 10.1104/pp.113.230144. Epub 2013 Dec 4.

Genome Annotation and Curation Using MAKER and MAKER-P
Michael S. Campbell, Carson Holt, Barry Moore, Mark Yandell
Curr Protoc Bioinformatics. 2014 Dec 12;48:4.11.1-4.11.39

PMCアーカイブlink

参考

Repbaseのライブラリがないと怒られる。それでも強制的にランするにはリピートマスク処理をOFFにする（-R）か、maker_opts.ctlのmodel_orgを写真のように空白にする。

bioconda-recipes：issue

ERROR: Could not determine if RepBase is installed · Issue #16501 · bioconda/bioconda-recipes · GitHub

もしくはrmlib=にfasta形式のリピートライブラリファイルを指定する（現在のバージョンのRepeatMakerはfasta形式のリピートファイルを認識する）。例えばRepatModelerを使えば種固有の性質があるリピートをDe novo探索し、fasta形式で出力する、リピートライブラリがない新しいゲノムではこれを使用できる（紹介）。その場合、"model_org="を空白にし、"rmlib="にRepatModelerで予測したリピートのコンセンサス配列のfastaファイルを指定する。

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

アノテーションパイプライン MAKER