de novoでTEを見つけてアノテーションをつけるパイプライン EDTA

2021/11/26 追記

　シーケンス技術とアセンブリアルゴリズムは成熟し、大規模で反復性のあるゲノムでも高品質なde novoアセンブリが可能になってきた。現在のアセンブリは、トランスポーザブルエレメント（TE）をトラバースし、TEのアノテーションを可能にしている。各エレメントのクラスごとに多数の方法があり、相対的な性能評価基準は不明である。イネのTEのライブラリに基づいて、既存のプログラムのベンチマークを行った。最も堅牢なプログラムを用いて、構造的に無傷で断片化されたエレメントのアノテーションのための凝縮されたTEライブラリを生成するExtensive de-novo TE Annotator (EDTA)と呼ばれる包括的なパイプラインを作成した。EDTAはオープンソースでhttps://github.com/oushujun/EDTAから自由に利用可能である。

The Extensive de-novo TE Annotator (EDTA) has matured to v2.0.0!!! It now formally supports pan-genome TE annotation. After two years of release, it has been downloaded over 5,000 times and annotated thousands of genomes. Great thanks to the TE and genomics community and fans!!! pic.twitter.com/HkHLN5dsBA
— Shujun Ou (@SigmaFacto) November 26, 2021

GIthubより

EDTAパッケージは、生のTE候補から誤った発見を除外し、全ゲノムTEアノテーション用の高品質な非冗長TEライブラリを生成するように設計されています。初期検索プログラムの選択は、イネゲノムで手動キュレーションされたTEライブラリを用いたアノテーション性能のベンチマークに基づいて行われた。

インストール

ubuntu18.04LTSでcondaを使って導入した（他にdocker、Singularityのサポートあり）。

Github

conda create -n edta python=3.8 -y
conda activate edta
conda install -c bioconda -c conda-forge edta -y

#docker(dockerhub)
docker pull docker://oushujun/edta:<tag>

> EDTA.pl

$ EDTA.pl

########################################################

##### Extensive de-novo TE Annotator (EDTA) v1.9.4 ####

##### Shujun Ou (shujun.ou.1@gmail.com) ####

########################################################

At least 1 parameter is required:

1) Input fasta file: --genome

This is the Extensive de-novo TE Annotator that generates a high-quality

structure-based TE library. Usage:

perl EDTA.pl [options]

--genome [File] The genome FASTA

--species [Rice|Maize|others] Specify the species for identification of TIR

candidates. Default: others

--step [all|filter|final|anno] Specify which steps you want to run EDTA.

all: run the entire pipeline (default)

filter: start from raw TEs to the end.

final: start from filtered TEs to finalizing the run.

anno: perform whole-genome annotation/analysis after

TE library construction.

--overwrite [0|1] If previous raw TE results are found, decide to overwrite

(1, rerun) or not (0, default).

--cds [File] Provide a FASTA file containing the coding sequence (no introns,

UTRs, nor TEs) of this genome or its close relative.

--curatedlib [File] Provided a curated library to keep consistant naming and

classification for known TEs. TEs in this file will be

trusted 100%, so please ONLY provide MANUALLY CURATED ones.

This option is not mandatory. It's totally OK if no file is

provided (default).

--sensitive [0|1] Use RepeatModeler to identify remaining TEs (1) or not (0,

default). This step is slow but MAY help to recover some TEs.

--anno [0|1] Perform (1) or not perform (0, default) whole-genome TE annotation

after TE library construction.

--rmout [File] Provide your own homology-based TE annotation instead of using the

EDTA library for masking. File is in RepeatMasker .out format. This

file will be merged with the structural-based TE annotation. (--anno 1

required). Default: use the EDTA library for annotation.

--evaluate [0|1] Evaluate (1) classification consistency of the TE annotation.

(--anno 1 required). Default: 0. This step is slow and does

not change the annotation result.

--exclude [File] Exclude bed format regions from TE annotation. Default: undef.

(--anno 1 required).

--force [0|1] When no confident TE candidates are found: 0, interrupt and exit

(default); 1, use rice TEs to continue.

--repeatmodeler [path] The directory containing RepeatModeler (default: read from ENV)

--repeatmasker [path] The directory containing RepeatMasker (default: read from ENV)

--check_dependencies Check if dependencies are fullfiled and quit

--threads|-t [int] Number of theads to run this script (default: 4)

--debug [0|1] Retain intermediate files (default: 0)

--help|-h Display this help info

テストラン

ゲノム配列と同一の種、または近縁種のcDNA配列を指定する。また、ゲノム配列の既知遺伝子アノテーションのBEDファイルを指定する（遺伝子のマスキングを防ぐため）。さらに、信頼できるFASTA形式のTEライブラリ（少なくても信頼性があるTE）も指定する。ここではイネのTEデータベースを指定している。

git clone https://github.com/oushujun/EDTA.git
cd EDTA/test/
EDTA.pl --genome genome.fa --cds genome.cds.fa --curatedlib ../database/rice6.9.5.liban --exclude genome.exclude.bed --overwrite 1 --sensitive 1 --anno 1 --evaluate 1 --threads 10

エラーが起こる。

引用

Benchmarking Transposable Element Annotation Methods for Creation of a Streamlined, Comprehensive Pipeline

Shujun Ou, Weija Su, Yi Liao, Kapeel Chougule, Doreen Ware, Thomas Peterson, Ning Jiang, Candice N. Hirsch, Matthew B. Hufford

bioRxiv, Posted June 03, 2019

追記

Benchmarking transposable element annotation methods for creation of a streamlined, comprehensive pipeline
Shujun Ou, Weija Su, Yi Liao, Kapeel Chougule, Jireh R. A. Agda, Adam J. Hellinga, Carlos Santiago Blanco Lugo, Tyler A. Elliott, Doreen Ware, Thomas Peterson, Ning Jiang, Candice N. Hirsch & Matthew B. Hufford
Genome Biology volume 20, Article number: 275 (2019)