ProtHint - macでインフォマティクス

著者らは、真核生物ゲノムにおける遺伝子予測のための高速かつ高精度なアルゴリズムの作成に向けて、いくつかのステップを踏んできた。まず、ab initio gene findingを効率的に行うための自動化手法、GeneMark-ESを導入し、教師無しでパラメータを繰り返し学習させた。次に、GeneMark-ETでは、教師なし学習とShort RNA Readsのマッピングによって明らかになったイントロン位置の情報を統合する方法を提案した。GeneMark-EPは、タンパク質データベースという、シーケンシングプロジェクト開始前に入手可能な別の外部情報源を利用するツールである。GeneMark-EPは、ProtHintという新しいパイプラインにより、ゲノムに大量のタンパク質マッピングを行い、遺伝子候補のスプライスサイトや翻訳開始・停止サイトに関するヒントを抽出する。GeneMark-EPは、このヒントを用いて、モデルパラメータの推定を改善するとともに、予測される遺伝子の座標が最も信頼できるヒントと異なる場合（-EP+モード）、その座標を調整することができる。GeneMark-EP および -EP+ のテストでは、GeneMark-ES と比較して遺伝子予測精度の向上が見られ、GeneMark-EP+ は GeneMark-ET よりも高い精度を示した。遺伝子予測精度が最も顕著に向上したのは、大規模な真核生物ゲノムにおいてであることが確認された。

Githubより

ProtHintは、予測された遺伝子を参照タンパク質配列のデータベースにマッピングし、スプライシングアラインメントすることによって、目的のゲノム中のヒント（イントロン、開始コドン、停止コドンの形）を予測し、スコアリングするパイプラインである。

注；ProtHintはGeneMark-ESで使用されていて、GeneMark-ESの配布物自体にも含まれている。ProtHint pipelineの流れはGeneMark-EP+の論文のFigure.2で説明されている。

インストール

依存

Python 3.3 or higher
GeneMark-ES（リンク）
DiamondとSpaln（このディストリビューションに含まれる（動作しないなら導入が必要））
Perl 5.10 or higher is required.

The following non-Core Perl modules are required:

MCE::Mutex
threads
YAML
Math::Utils

Github

#perlライブラリの導入
cpanm MCE::Mutex threads YAML Thread::Queue Math::Utils

#GeneMark-ES
cd GeneMark_dir
#Install the key
zcat gm_key_64.gz > ~/.gm_key
#test run 
bash check_install.bash
#パスを通す
export PATH=$PATH:/your/path/to/GeneMark-ES/

#本体
git clone https://github.com/gatech-genemark/ProtHint.git
cd ProtHint/

> ./prothint.py -h

usage: prothint.py [-h] [--workdir WORKDIR] [--geneSeeds GENESEEDS] [--geneMarkGtf GENEMARKGTF] [--fungus] [--nonCanonicalSpliceSites] [--diamondPairs DIAMONDPAIRS] [--maxProteinsPerSeed MAXPROTEINSPERSEED] [--evalue EVALUE]

[--minExonScore MINEXONSCORE] [--longGene LONGGENE] [--longProtein LONGPROTEIN] [--cleanup] [--pbs] [--threads THREADS] [--version] [--prevGeneSeeds PREVGENESEEDS] [--prevSpalnGff PREVSPALNGFF]

genome.fasta proteins.fasta

ProtHint 2.6.0: Pipeline for generating genome wide footprints of homologous proteins. The set of high confidence hints is generated using default thresholds in print_high_confidence.py script. If you wish to use different filtering criteria, re-run

print_high_confidence.py script with custom thresholds.

positional arguments:

genome.fasta Input genomic sequence in FASTA format.

proteins.fasta Protein database in FASTA format.

optional arguments:

-h, --help show this help message and exit

--workdir WORKDIR Folder for results and temporary files. If not specified, current directory is used

--geneSeeds GENESEEDS

Gene prediction seeds in gtf format. If this file is provided, GeneMark-ES run is skipped.

--geneMarkGtf GENEMARKGTF

Same as --geneSeeds, for backwards compatibility

--fungus Run GeneMark-ES in fungus mode.

--nonCanonicalSpliceSites

Predict introns with non-canonical splices sites. By default, ProtHint only predicts introns with GT-AG, GC-AG, and AT-AC donor-acceptor sites.

--diamondPairs DIAMONDPAIRS

File with "seed gene-protein" hits generated by DIAMOND. If this file is provided, DIAMOND search for protein hits is skipped. The seed genes in this file must correspond to seed genes passed by "--geneSeeds" option. All pairs in

the file are used -- option "--maxProteinsPerSeed" is ignored.

--maxProteinsPerSeed MAXPROTEINSPERSEED

Maximum number of protein hits per seed gene. Increasing this number leads to increased runtime and may improve the sensitivity of hints. Decreasing has an opposite effect. Default is set to 25.

--evalue EVALUE Maximum e-value for DIAMOND alignments hits. Default = 0.001

--minExonScore MINEXONSCORE

Discard all hints inside/neighboring exons with score lower than minExonScore. Default = 25

--longGene LONGGENE Threshold for what is considered a long gene in Spaln alignment. Default = 30000

--longProtein LONGPROTEIN

Threshold for what is considered a long protein in Spaln alignment. Default = 15000

--cleanup Delete temporary files and intermediate results. Cleanup is turned off by default as it is useful to keep these files for troubleshooting and the intermediate results might be useful on their own.

--pbs Run GeneMark-ES and Spaln on pbs.

--threads THREADS Number of threads used by ES, DIAMOND, and Spaln. By default, all available threads are used.

--version show program's version number and exit

--prevGeneSeeds PREVGENESEEDS

File with gene seeds which were used in the previous iteration. Next iteration of ProtHint is only executed for --geneSeeds which differ from --prevGeneSeeds. --prevSpalnGff is required when this option is used since results from

the previous iteration are reused for seeds which do not differ (Gene ids of such hints are updated to match the new seed genes).

--prevSpalnGff PREVSPALNGFF

Scored hints from previous iteration.

> gmes_petap.pl

$ gmes_petap.pl

# -------------------

Usage: /home/kazu/Documents/ProtHint/gmes_linux_64/gmes_petap.pl [options] --sequence [filename]

GeneMark-ES Suite version 4.71_lic

Suite includes GeneMark.hmm, GeneMark-ES, GeneMark-ET and GeneMark-EP algorithms.

Input sequence/s should be in FASTA format.

Select one of the gene prediction algorithms:

To run GeneMark-ES self-training algorithm

--ES

To run GeneMark-ET with hints from transcriptome splice alignments

--ET [filename]; file with intron coordinates from RNA-Seq read splice alignment in GFF format

--et_score [number]; default 10; minimum score of intron in initiation of the ET algorithm

To run GeneMark-EP with hints from protein splice alignments

--EP

--dbep [filename]; file with protein database in FASTA format

--ep_score [number,number]; default 4,0.25; minimum score of intron in initiation of the EP algorithm

--EP [filename]; file with intron coordinates from protein splice alignment in GFF format

To run GeneMark.hmm predictions using previously derived model

--predict_with [filename]; file with species specific gene prediction parameters

To run ES, ET or EP with branch point model. This option is most useful for fungal genomes

--fungus

To run hmm, ES, ET or EP in PLUS mode (prediction with hints)

--evidence [filename]; file with hints in GFF format

To run algorithms with alternative genetic codes

--gcode [number]; default 1; supported 1 and 6/26

Output formatting options:

--format [label]; default GTF; output gene prediction in GTF of GFF3 format

--work_dir [folder name]; default current working directory .;

Masking option

--soft_mask [number] or [auto]; default auto; to indicate that lowercase letters stand for repeats;

algorithm hard masks only lowercase repeats longer than specified length

In 'auto' mode hard masking threshold is selected by algorithm based on the size of the input genome

--mask_penalty [number] or [auto]; default 0.03;

Run options

--cores [number]; default 1; to run program with multiple threads

--pbs to run on cluster with PBS support

--v verbose

Optional sequence pre-processing parameters

--max_contig [number]; default 5000000; will split input genomic sequence into contigs shorter then max_contig

--min_contig [number]; default 50000 (10000 fungi);

will ignore contigs shorter than min_contig in training

--max_gap [number]; default 5000; will split sequence at gaps longer than max_gap

Letters 'n' and 'N' are interpreted as standing within gaps

--max_mask [number]; default 5000; will split sequence at repeats longer then max_mask

Letters 'x' and 'X' are interpreted as results of hard masking of repeats

Optional parameters

--max_intron [number]; default 10000 (3000 fungi); maximum length of intron

--max_intergenic [number]; default 50000 (10000 fungi); maximum length of intergenic regions

--min_contig_in_predict [number]; default 500; minimum allowed length of contig in prediction step

--min_gene_in_predict [number]; default 300 (120 fungi); minimum allowed gene length in prediction step

--gc_donor [value]; default 0.001; transition probability to GC donor in the range 0..1;

'off' switches GC donor model OFF

--gc3 [number]; GC3 cutoff in training for grasses

Developer options

--training to run only training step of algorithm; applicable to ES, ET or EP

--prediction to run only prediction step of algorithms using species parameters from previously executed training; applicable to ES, ET or EP

--usr_cfg [filename]; use custom configuration from this file

--ini_mod [filename]; use this file with parameters for algorithm initiation

--key_bin

--debug

# -------------------

データベース

リファレンスタンパク質配列のソースとして、OrthoDBタンパク質データベースの中で対象のゲノムに関連する分類を使用することが推奨されている。

#節足動物門の場合（ほかのリンクはGithub参照）
wget https://v100.orthodb.org/download/odb10_arthropoda_fasta.tar.gz
tar xvf odb10_arthropoda_fasta.tar.gz
cat arthropoda/Rawdata/* > arthropoda_proteins.fasta

#緑色植物亜界
wget https://v100.orthodb.org/download/odb10_plants_fasta.tar.gz
tar xvf odb10_plants_fasta.tar.gz
cat plants/Rawdata/* > plants_proteins.fasta

テストラン

ゲノムのfastaファイル、proteomeファイル（上で作ったデータベースのfastaや信頼できる近縁種のproteome）を指定する。--geneSeedsをつけてGeneMark-ESからのgtfファイルも指定するとGeneMark-ESによるab initio遺伝子予測はスキップされる。このフラグを付けてないなら、初めにGeneMark-ESが実行される（パスが通っている必要がある）。結果はworkdirに保存する。

git clone https://github.com/gatech-genemark/ProtHint.git
cd ProtHint/example/
../bin/prothint.py input/genome.fasta input/proteins.fasta --geneSeeds input/genemark.gtf --workdir workdir


#plantのproteins.fastaをDBに使う。訓練も行う。
../bin/prothint.py input/genome.fasta plants_proteins.fasta --workdir workdir

--geneSeeds Gene prediction seeds in gtf format. If this file is provided, GeneMark-ES run is skipped.

ProtHintは主に2つの出力を生成する（マニュアルより）。

prothint.gff 報告されたすべてのヒント（イントロン、スタート、ストップ）を含む Gff ファイル。
evidence.gff Prothint.gff の高信頼度サブセットで、例えば GeneMark-EP Plus モードに適している。このセットは、ProtHint/bin/print_high_confidence.pyスクリプトのデフォルトの閾値を使用して生成されている。異なるフィルタリング基準を使用したい場合は、カスタムしきい値でprint_high_confidence.pyスクリプトを再実行する。
prothint_augustus.gff BRAKERとAUGUSTUSですぐに使える出力も生成する。

出力例

workdir/

続いてGeneMark-ESかEP+を実行する（Github）。

gmes_petap.pl --EP workdir/prothint.gff --evidence workdir/evidence.gff --seq genome.masked.fasta --soft_mask 1000 --verbose --cores 20 --work_dir workdir

--EP file with intron coordinates from protein splice alignment in GFF format
--ES To run GeneMark-ES self-training algorithm
--evidence file with hints in GFF format
--soft_mask [number] or [auto] default auto; to indicate that lowercase letters stand for repeats; algorithm hard masks only lowercase repeats longer than specified length In 'auto' mode hard masking threshold is selected by algorithm based on the size of the input genome
--max_contig default 5000000; will split input genomic sequence into contigs shorter then max_contig
--min_contig default 50000 (10000 fungi); will ignore contigs shorter than min_contig in training
--max_gap default 5000; will split sequence at gaps longer than max_gap Letters 'n' and 'N' are interpreted as standing within gaps
--max_mask default 5000; will split sequence at repeats longer then max_mask Letters 'x' and 'X' are interpreted as results of hard masking of repeats
--cores default 1; to run program with multiple threads
--format default GTF; output gene prediction in GTF of GFF3 format
--work_dir default current working directory .;