トランスクリプトームから主要なtrasncriptsを選抜する EvidentialGene

2019 6/19 関連論文追記

2020 2/17 依存のインストール手順更新

EvidentialGeneのtr2aacds.plは、de novo アセンブルツールの結果から生物学的に有用な最良のmRNAセットにクラスタリングするパイプライン。論文は準備中で不明な点もあるが、ポスターによると以下の流れで冗長なtranscirptsを減らすらしい。fastanrdbとcd-hitを使ったあと、blastを使いprimaryなtranscirptsを選抜している。

Algorithm of tr2aacds:

　0. collect input transcripts.tr, produce CDS and AA sequences, work mostly on CDS.

　1. perfect redundant removal with fastanrdb

　2. perfect fragment removal with cd-hit-est

　3. blastn, basic local align high-identity subsequences for alternate tr.

　4. classify main/alternate cds, okay & drop subsets by CDS-align, protein metrics.

　5. output sequence sets from classifier: okay-main, okay-alts, drops. See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

すでにいくつかのde novo transcriptome解析の論文で、複数のde novo アセンブルツールの結果をマージして冗長性を減らすために使われている（ref1）。

公式サイト

http://arthropods.eugenes.org/genes2/about/EvidentialGene_trassembly_pipe.html

wiki

https://sourceforge.net/p/evidentialgene/wiki/Home/

インストール

依存

fastanrdb of exonerate package, quickly reduces perfect duplicate sequences
cd-hit, cd-hit-est clusters protein or nucleotide sequences.
blastn and makeblastdb of NCBI BLAST, Basic Local Alignment Search Tool, finds regions of local similarity between sequences.

#bioconda
conda install -c bioconda -y exonerate blast

本体

http://arthropods.eugenes.org/EvidentialGene/evigene/

The best way to get〜のftpリンクからダウンロードする。

解凍してpub/evigene/scripts/prot/tr2aacds2.plを使う。

$ perl evigene/scripts/prot/tr2aacds2.pl

EvidentialGene tr2aacds.pl VERSION 2017.12.21

convert large, redundant mRNA assembly set to best protein coding sequences,

filtering by quality of duplicates, fragments, and alternate transcripts.

See http://eugenes.org/EvidentialGene/about/EvidentialGene_trassembly_pipe.html

Usage: tr2aacds.pl -mrnaseq transcripts.fasta[.gz]

opts: -MINCDS=90 -NCPU=1 -MAXMEM=1000.Mb -[no]smallclass -logfile -tidyup -dryrun -debug

ラン

de novo アセンブルツールの結果をマージしたfastaを入力とする。複数あるなら"cat *fa > merged.fa"などでコンカテネートしておく。

１、解析前にFASTAを修正する。protocol.ioにJared Mamrotさんが投稿されたDe novo transcriptome assembly workflowのワークフローを真似て、FASTAのヘッダーをシンプルな名前に修正する。

perl -ane 'if(/\>/){$a++;print ">Locus_$a\n"}else{print;}' input.fasta > renamed.fasta

#さらにformatを修正
pat.pl -output output.fa -input renamed.fasta

(trformat.pl :regularize IDs in fasta of Velvet,Soap,Trinity, ensure unique IDs, add prefixes for parameter sets.)

evigene/scripts/prot/にtr2aacds2.plはある。ラン。

perl evigene/scripts/prot/tr2aacds2.pl -mrnaseq output.fa -MINCDS=90 -NCPU=12 -logfile

okayset/のoutput.okalt.faとoutput.okay.faは同じものではなく、主要な転写配列とalternativeの転写配列になる。こちらも参考にしてください。

https://translate.google.co.jp/translate?hl=ja&sl=en&u=https://www.biostars.org/p/273551/&prev=search

引用

poster

Gene-omes built from mRNA seq not genome DNA

ref1: cd-hitより効率的。

https://sourceforge.net/p/evidentialgene/discussion/general/thread/a4f0e29f/

この論文でEvidentialGeneと比較されています。

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6105091/

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

トランスクリプトームから主要なtrasncriptsを選抜する EvidentialGene