イントロン位置の保存性とRNA-seqを活用したホモロジーに基づく遺伝子予測を行う GeMoMa

明けましておめでとうございます。今年もよろしくお願いいたします。

今年も忙しくなりそうなので、更新できるタイミングがあれば積極的に更新していきます。

　GeMoMaは、進化的に関連するリファレンス種の遺伝子モデルを基に、対象種の遺伝子モデルを予測する、相同性に基づく遺伝子予測プログラムである。GeMoMaは、アミノ酸配列保存、イントロン位置保存、およびRNA-seqデータを利用して、タンパク質をコードする転写産物を正確に予測することができる。さらに、GeMoMaは複数のリファレンス種に基づく予測の組み合わせをサポートし、異なるリファレンス種の高品質アノテーションを対象種に移植することができる。ここでは、GeMoMa モジュールと GeMoMa パイプラインの詳細な説明と、特定の生物学的問題に対処するためのコマンドラインでの使用方法について紹介する。

Documentation

http://www.jstacs.de/index.php/GeMoMa-Docs

インストール

condaを使って導入した（ubuntu18）。

依存

For running the GeMoMa, you need the following software on your computer

Java v1.8 or later
blast or mmseqs

GeMoMa

http://www.jstacs.de/index.php/GeMoMa

#conda(link)
mamba create -n gemoma -y
conda activate gemoma
mamba install -c conda-forge -c bioconda gemoma=1.9 -y

> GeMoMa -h

Searching for the new GeMoMa updates ...

You are using the latest GeMoMa version.

This jar allows to run all parts of GeneModelMapper (GeMoMa) except the external search algorithm (e.g. tblastn).

For more information please visit http://www.jstacs.de/index.php/GeMoMa

If you have any questions, comments or bugs, please check FAQs on our homepage, our github page https://github.com/Jstacs/Jstacs/labels/GeMoMa or contact jens.keilwagen@julius-kuehn.de

If you use this tool, please cite

@article{Keilwagen:2016:GeMoMa,

author = {Keilwagen, Jens and Wenk, Michael and Erickson, Jessica L. and Schattat, Martin H. and Grau, Jan and Hartung, Frank},

title = {{Using intron position conservation for homology-based gene prediction}},

journal = {Nucleic Acids Research},

volume = {44},

number = {9},

pages = {e89-e89},

year = {2016},

month = {02},

issn = {0305-1048},

doi = {10.1093/nar/gkw092}

}

@article{Keilwagen:2018:GeMoMa_RNAseq,

author = {Keilwagen, Jens and Hartung, Frank and Paulini, Michael and Twardziok, Sven O. and Grau, Jan},

title = {Combining RNA-seq data and homology-based gene prediction for plants, animals and fungi},

journal = {BMC Bioinformatics},

year = {2018},

month = {May},

day = {30},

volume = {19},

number = {1},

pages = {189},

issn = {1471-2105},

doi = {10.1186/s12859-018-2203-5}

}

Available tools:

GeMoMaPipeline - GeMoMa pipeline

ERE - Extract RNA-seq Evidence

CheckIntrons - CheckIntrons

DenoiseIntrons - DenoiseIntrons

NRR - NCBI Reference Retriever

Extractor - Extractor

GeMoMa - GeneModelMapper

GAF - GeMoMa Annotation Filter

AnnotationFinalizer - AnnotationFinalizer

AnnotationEvidence - Annotation evidence

Attribute2Table - Attribute2Table

SyntenyChecker - Synteny checker

AddAttribute - AddAttribute

GAFComparison - GAFComparison

Analyzer - Analyzer

BUSCORecomputer - BUSCORecomputer

GFFAttributes - GFFAttributes

TranscribedCluster - Transcribed Cluster

Syntax: GeMoMa <toolname> [<parameter=value> ...]

Further info about the tools is given with

GeMoMa <toolname> info

For tests of individual tools:

GeMoMa <toolname> test [<verbose>]

Tool parameters are listed with

GeMoMa <toolname>

> GeMoMa ERE

$ GeMoMa ERE

Searching for the new GeMoMa updates ...

You are using the latest GeMoMa version.

Parameters of tool "Extract RNA-seq Evidence" (ERE, version: 1.9):

s - Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED) = FR_UNSTRANDED

The following parameter(s) can be used multiple times:

m - mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam) = null

v - ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT) = LENIENT

u - use secondary alignments (allows to filter flags in the SAM or BAM, default = true) = true

c - coverage (allows to output the coverage, default = true) = true

mmq - minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40) = 40

mc - minimum context (only introns that have evidence of at least one split read with a minimal M (=(mis)match) stretch in the cigar string larger than or equal to this value will be used, valid range = [1, 1000000], default = 1) = 1

maximumcoverage - maximum coverage (optional parameter to reduce the size of coverage output files, coverage higher than this value will be reported as this value, valid range = [1, 10000], OPTIONAL) = null

f - filter by intron mismatches (filter reads by the number of mismatches around splits, range={NO, YES}, default = NO) = NO

No parameters for selection "NO"

Parameters for selection "YES":

r - region around introns (test region of this size around introns/splits for mismatches to the genome, valid range = [0, 100], default = 10) = 10

n - number of mismatches (number of mismatches allowed in regions around introns/splits, valid range = [0, 100], default = 3) = 3

t - target genome (The target genome file (FASTA). Should be in IUPAC code, type = fasta,fas,fa,fna,fasta.gz,fas.gz,fa.gz,fna.gz) = null

e - evidence long splits (require introns to have at least this number of times the supporting reads as their length deviates from the mean split length, valid range = [0.0, 100.0], default = 0.0) = 0.0

mil - minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0) = 0

repositioning - repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL) = null

outdir - The output directory, defaults to the current working directory (.) = .

実行方法

v1.9では以下のコマンドを利用できる。

GeMoMaPipeline - GeMoMa pipeline
ERE - Extract RNA-seq Evidence
CheckIntrons - CheckIntrons
DenoiseIntrons - DenoiseIntrons
NRR - NCBI Reference Retriever
Extractor - Extractor
GeMoMa - GeneModelMapper
GAF - GeMoMa Annotation Filter
AnnotationFinalizer - AnnotationFinalizer
AnnotationEvidence - Annotation evidence
Attribute2Table - Attribute2Table
SyntenyChecker - Synteny checker
AddAttribute - AddAttribute
GAFComparison - GAFComparison
Analyzer - Analyzer
BUSCORecomputer - BUSCORecomputer
GFFAttributes - GFFAttributes
TranscribedCluster - Transcribed Cluster

GeMoMaPipelineコマンド

このコマンドはGeMoMaパイプライン全体を実行する。基本的には以下のように実行される；Extract RNA-seq evidence (ERE), DenoiseIntrons, Extractor, external search (tblastn or mmseqs), Gene Model Mapper (GeMoMa), GeMoMa Annotation Filter (GAF), AnnnotationFinalizerが基本的に動作する。マルチスレッドで、1台のマシンのすべての計算コアを利用することができるが、計算クラスタのように分散させることはできない。

アノテーションをつける自身のゲノム(t)、リファレンス種1の名前（i）とアノテーション（a）と配列（g）、外部エビデンスであるRNAseqのbamファイル（ERE.m）（r=MAPPEDの時のみ）、スレッド数(threads)、出力ディレクトリ名（outdir）を指定する。

GeMoMa GeMoMaPipeline t=genome.fa r=NO o=true \
i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome> \
GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO \
threads=20 outdir=outdir

#RNA seqのbamも指定（HISAT2などでmappingして得たbam）
GeMoMa GeMoMaPipeline t=genome.fa r=MAPPED o=true \
i=<reference_1_id> a=<reference_1_annotation> g=<reference_1_genome> \
GeMoMa.Score=ReAlign AnnotationFinalizer.r=NO \
ERE.m=map.bam \
threads=20 outdir=outdir

t target genome (Target genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)
r RNA-seq evidence (data for RNA-seq evidence, range={NO, MAPPED, EXTRACTED}, default = NO)
o output individual predictions (If *true*, returns the predictions for each reference species, default = false)
s species (data for reference species, range={own, pre-extracted}, default = own)
i ID (ID to distinguish the different reference species, OPTIONAL)
a annotation (Reference annotation file (GFF or GTF), which contains gene models annotated in the reference genome, type = gff,gff3,gtf,gff.gz,gff3.gz,gtf.gz) FILE
g genome (Reference genome file (FASTA), type = fasta,fa,fas,fna,fasta.gz,fa.gz,fas.gz,fna.gz)
outdir The output directory, defaults to the current working directory (.) STRING
ERE.m Parameters for selection "MAPPED". mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)
threads The number of threads used for the tool, defaults to 1
GeMoMa.Score Score (A flag which allows to do nothing, re-score or re-align the search results, range={Trust, ReScore, ReAlign}, default = ReAlign)
AnnotationFinalizer.r rename (allows to generate generic gene and transcripts names (cf. parameter "name attribute"), range={COMPOSED, SIMPLE, NO}, default = COMPOSED)
GeMoMa.m maximum intron length (The maximum length of an intron, default = 15000)
GeMoMa.sil static intron length (A flag which allows to switch between static intron length, which can be specified by the user and is identical for all genes, and dynamic intron length, which is based on the gene-specific maximum intron length in the reference organism plus the user given maximum intron length, default = true)

Parameters for selection "DENOISE":

DenoiseIntrons.m maximum intron length (The maximum length of an intron, default = 15000) INT
DenoiseIntrons.me minimum expression (The threshold for removing introns, valid range = [0.0, 1.0], default = 0.01) DOUBLE
DenoiseIntrons.c context (The context upstream a donor and donwstream an acceptor site that is used to determine the expression of the region, valid range = [0, 100], default = 10)

複数のリファレンスがある場合は、sとパラメータタグ i, a, g を対応する値で繰り返す。o=trueにすると個々の予測が別ファイルとして出力され、個々のステップ簡単かつ迅速に再実行できる。最大イントロン長を指定したい場合は、GeMoMa.m と GeMoMa.sil パラメータを検討する。RNA-seqデータがある場合、DenoiseIntrons のパラメータも確認する。

出力例

特定のコマンドのみ実行することもできます。EREコマンドの例を示す。

EREコマンド

マップされたRNA-seqリードからイントロンとカバレッジを抽出する。イントロンはDenoiseIntronsコマンドでノイズ除去できる可能性がある。EREの結果は予測を改善するために使用することができ、GAFコマンドでより良い遺伝子モデルを選択するのに役立つ可能性がある。また、EREの結果は、AnnotationFinalizerコマンドによるUTRの予測に使用することができる(2018年の論文のFig.1参照)。

（STARやHISAT2などで得た）sam/bamファイルを指定する。s以外のパラメータは複数回指定できる。

GeMoMa ERE s=FR_UNSTRANDED m=input.bam outdir=OUTPUT

m mapped reads file (BAM/SAM files containing the mapped reads, type = bam,sam)
s Stranded (Defines whether the reads are stranded. In case of FR_FIRST_STRAND, the first read of a read pair or the only read in case of single-end data is assumed to be located on forward strand of the cDNA, i.e., reverse to the mRNA orientation. If you are using Illumina TruSeq you should use FR_FIRST_STRAND., range={FR_UNSTRANDED, FR_FIRST_STRAND, FR_SECOND_STRAND}, default = FR_UNSTRANDED)
mil minimum intron length (introns shorter than the minimum length are discarded and considered as contiguous, valid range = [0, 1000], default = 0)
mmq minimum mapping quality (reads with a mapping quality that is lower than this value will be ignored, valid range = [0, 255], default = 40)
v ValidationStringency (Defines how strict to be when reading a SAM or BAM, beyond bare minimum validation., range={STRICT, LENIENT, SILENT}, default = LENIENT) STRING
u use secondary alignments (allows to filter flags in the SAM or BAM, default = true)
repositioning repositioning (due to limitations in BAM/SAM format huge chromosomes need to be split before mapping. This parameter allows to undo the split mapping to real chromosomes and coordinates. The repositioning file has 3 columns: split_chr_name, original_chr_name, offset_in_original_chr, type = tabular, OPTIONAL)