トランスクリプトームから主要なtrasncriptsを選抜する EvidentialGene

2019 6/19 関連論文追記

2020 2/17 依存のインストール手順更新

2025/02/14 追記

EvidentialGeneのtr2aacds.plは、de novo アセンブルツールの結果から生物学的に有用な最良のmRNAセットにクラスタリングするパイプライン。論文は準備中で不明な点もあるが、ポスターによると以下の流れで冗長なtranscirptsを減らすらしい。fastanrdbとcd-hitを使ったあと、blastを使いprimaryなtranscirptsを選抜している。

Algorithm of tr2aacds:

　0. collect input transcripts.tr, produce CDS and AA sequences, work mostly on CDS.

　1. perfect redundant removal with fastanrdb

　2. perfect fragment removal with cd-hit-est

　3. blastn, basic local align high-identity subsequences for alternate tr.

　4. classify main/alternate cds, okay & drop subsets by CDS-align, protein metrics.

　5. output sequence sets from classifier: okay-main, okay-alts, drops. See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

すでにいくつかのde novo transcriptome解析の論文で、複数のde novo アセンブルツールの結果をマージして冗長性を減らすために使われている（ref1）。

公式サイト

http://arthropods.eugenes.org/genes2/about/EvidentialGene_trassembly_pipe.html

wiki

https://sourceforge.net/p/evidentialgene/wiki/Home/

インストール

依存

fastanrdb of exonerate package, quickly reduces perfect duplicate sequences
cd-hit, cd-hit-est clusters protein or nucleotide sequences.
blastn and makeblastdb of NCBI BLAST, Basic Local Alignment Search Tool, finds regions of local similarity between sequences.

#bioconda
mamba install -c bioconda -y cd-hit
mamba install -c bioconda -y exonerate
mamba install -c bioconda -y blast

本体

The best way to get〜のftpリンクからダウンロードする。

解凍してpub/evigene/scripts/prot/tr2aacds2.plを使う。

$ perl evigene/scripts/prot/tr2aacds2.pl

EvidentialGene tr2aacds.pl VERSION 2017.12.21

convert large, redundant mRNA assembly set to best protein coding sequences,

filtering by quality of duplicates, fragments, and alternate transcripts.

See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

Usage: tr2aacds.pl -mrnaseq transcripts.fasta[.gz]

opts: -MINCDS=90 -NCPU=1 -MAXMEM=1000.Mb -[no]smallclass -logfile -tidyup -dryrun -debug

help

> perl scripts/prot/tr2aacds.pl

$ perl scripts/prot/tr2aacds.pl

EvidentialGene tr2aacds.pl VERSION 2022.04.05

convert large, redundant mRNA assembly set to best protein coding sequences,

filtering by quality of duplicates, fragments, and alternate transcripts.

See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

Usage: tr2aacds.pl -cdnaseq transcripts.fasta[.gz]

options:

-NCPU=1 -MAXMEM=2000.Mb : number of CPU, max memory in megabytes (recommended)

-MINAA=30 : minimum valid protein size (or -MINCDS=90),

set -MINAA as high in 20..120 as reasonable for species, to avoid excess tiny coding-genes

-pHeterozygosity=[0..9] : reduce percent identities for heterozygous organism sample (default 0),

this lowers alternate/paralog identity cutoffs and classes, and drops more high-identity transcripts

-logfile : write log of operations (recommended);

-species=Homo_sapiens : define species, now used for public IDs only

-ablastab=blastp_table : use blast scores to keep some transcripts (blastp traa x refdb -outfmt 7; see docs)

-aconsensus : use replication over assemblers as quality (default) or -aconsensus=no (dont use),

-runsteps=list : skip or do certain pipeline steps (nofrag|yesaadup|nostage2, see docs/code)

-strandedrna : auto|yes|no; default auto == maybe|dunno, tests cds-orient, reliable but must recalc cds if stranded

-reorient : resolve mixed strand alternates from evidence of map strand and code potentials

-noutrorf : dont look for ORF in long UTRs, from fused genes (ie 2nd longest ORF)

-debug : more output to logfile

ラン

de novo アセンブルツールの結果をマージしたfastaを入力とする。複数あるなら"cat *fa > merged.fa"などでコンカテネートしておく。

１、解析前にFASTAを修正する。protocol.ioにJared Mamrotさんが投稿されたDe novo transcriptome assembly workflowのワークフローを真似て、FASTAのヘッダーをシンプルな名前に修正する。

perl -ane 'if(/\>/){$a++;print ">Locus_$a\n"}else{print;}' input.fasta > renamed.fasta

#さらにformatを修正
pat.pl -output output.fa -input renamed.fasta

(trformat.pl :regularize IDs in fasta of Velvet,Soap,Trinity, ensure unique IDs, add prefixes for parameter sets.)

2、evigene/scripts/prot/tr2aacds2.plを使う。

perl evigene/scripts/prot/tr2aacds2.pl -mrnaseq output.fa -MINCDS=90 -NCPU=12 -logfile

okayset/のoutput.okalt.faとoutput.okay.faは同じものではなく、主要な転写配列とalternativeの転写配列になる。こちらも参考にしてください。

https://translate.google.co.jp/translate?hl=ja&sl=en&u=https://www.biostars.org/p/273551/&prev=search

引用

poster

Gene-omes built from mRNA seq not genome DNA

ref1: cd-hitより効率的。

https://sourceforge.net/p/evidentialgene/discussion/general/thread/a4f0e29f/

この論文でEvidentialGeneと比較されています。

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6105091/

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

トランスクリプトームから主要なtrasncriptsを選抜する EvidentialGene