2020-07-03

De novoでTEを探索する RepeatModeler2

2020 7/5 ProcessRepeatsのhelp追加

2020 7/6 step3修正

2020 7/7 ProcessRepeatsのコマンドの間違いを修正

2022/04/18 追記

2023/07/24 追記

　Tree of life全体のゲノム配列決定のペースが加速しているため、 transposable elements（TE）のようなゲノム構成要素の教師なしアノテーションを改善する必要性が高まっている。TEの種類や配列は種によって大きく異なるため、自動化されたTEの発見とアノテーションは困難で時間のかかる作業となっている。重要な最初のステップは、ゲノム上に散在しているすべてのユニークなTEファミリーを表現する配列モデルを新規に同定し、正確にコンパイルすることである。ここでは、このプロセスを大幅に促進するパイプラインであるRepeatModeler2を紹介する。このプログラムは、TE発見のために最も広く使われているツールの一つであるRepeatModelerのオリジナルバージョンよりも大幅に改良されている。特に、このバージョンには、真核生物のゲノムに広く存在するが、そのサイズと配列の複雑さから自動同定が困難な完全長 long terminal repeat（LTR）レトロエレメントの構造発見のためのモジュールが組み込まれている。著者らは、多様なTEランドスケープと高品質の手動でキュレーションされたTEライブラリを持つ3つのモデル種でRepeatModeler2のベンチマークを行った。Drosophila melanogaster（ショウジョウバエ）、Danio rerio（ゼブラフィッシュ）、Oryza sativa（イネ）である。これら3つの種において、RepeatModeler2は、元のRepeatModelerと比較して、手動でキュレーションした配列と95%以上の配列同一性と配列カバレッジを持つコンセンサス配列を約3倍以上同定した。予想通り、最も改善されたのはLTRレトロエレメントだった。このように、RepeatModeler2は、真核生物のゲノム配列におけるTEの同定と研究を強化するゲノムアノテーションツールキットの貴重な追加機能となる。RepeatModeler2は、オープンライセンス（https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/）のもと、ソースコードまたはコンテナ化されたパッケージとして提供される。

RepeatModelerは、HubleyとSmitによって2008年にリリースされ、最も広く使用されているTE発見ツールの1つである（2019年11月21日現在1,462回引用）。RepeatModelerは、ゲノム全体のリピートファミリーのシードアラインメントとコンセンサス配列をde novoで構築する。しかし、RepeatModelerのオリジナルバージョンは、他の既存のTEディスカバリーソフトウェアと同様に、完全な長さのコンセンサス配列の非冗長ライブラリを生成するには不十分である。最も問題となるのは、出力ライブラリに含まれる多くの断片化された部分的に冗長な配列の中から、特定のTEファミリーに対して一意の連続したコンセンサス配列であるべきものを解決することである。この問題は、逆に、TEファミリーの分類を妨げ、ゲノム中の実際のTEファミリーの数を増加させ、ゲノムアノテーションや下流の解析を混乱させる可能性がある。LTRレトロエレメントは、その大きさ（最大20キロ塩基対[kbp]）と配列や組織の複雑さのため、自動化されたTEファミリーの同定には特に抵抗があります。しかし、これらの要素は真核生物のゲノムに広く、しばしば非常に豊富で多様である。例えば、トウモロコシのリファレンスゲノムには、ゲノムDNAの約半分を占める約20,000の異なるファミリーに分類される10万以上のLTRエレメントが存在している(ref.29)。

これらの問題に対処するために、著者らはRepeatModelerの改良版を開発した。特に、構造的特徴からゲノム中のLTRエレメントを同定するためのオプションモジュールを統合した(ref.30, 31)。3つの多様なモデル種を用いてベンチマークから、RepeatModeler2は検出感度とコンセンサス配列の品質の両面で前バージョンよりも大幅に改善されていることを示す。このオープンソースパッケージは、シングル、マルチプロセッサコンピュータ上で動作するように設計されており、インストールを容易にするために、ソースディストリビューションまたはDocker/Singularityコンテナとして提供される。（以下略）

http://www.repeatmasker.org/RepeatModeler/

インストール

condaを使って導入した。 bioconda-recipesを見るとABBlastとNINJAはcondaでは導入されないのが分かる。LTRの探索も行うにはNINJAも必要 (cluster only)。

依存
Prerequisites

Perl
RepeatMasker & Libraries
RECON - De Novo Repeat Finder
RepeatScout - De Novo Repeat Finder,
TRF - Tandem Repeat Finder
RMBlast - A modified version of NCBI Blast for use with RepeatMasker and RepeatModeler.

Optional. Additional search engine:

ABBlast

Optional. Required for running LTR structural search pipeline:

LtrHarvest - The LtrHarvest program is part of the GenomeTools suite.
Ltr_retriever - A LTR discovery post-processing and filtering tool.
MAFFT
CD-HIT
Ninja - A tool for large-scale neighbor-joining phylogeny inference and clustering. We developed and tested RepeatModeler using Ninja version "0.95-cluster_only".

本体　Github

#or Bioconda(link) ここでは高速なmambaを使う
mamba create -n repeatmodeler python=3.11 -y
conda activate repeatmodeler
mamba install -c bioconda repeatmodeler -y

> RepeatModeler -h

$ RepeatModeler -h

Unknown option: h

/Users/kazu/anaconda3/envs/repeatmodeler/share/RepeatModeler/RepeatModeler - 2.0.1

NAME

RepeatModeler - Model repetitive DNA

SYNOPSIS

RepeatModeler [-options] -database <XDF Database>

DESCRIPTION

The options are:

-h(elp)

Detailed help

-database

The name of the sequence database to run an analysis on. This is the

name that was provided to the BuildDatabase script using the "-name"

option.

-pa #

Specify the number of parallel search jobs to run. RMBlast jobs will

use 4 cores each and ABBlast jobs will use a single core each. i.e.

on a machine with 12 cores and running with RMBlast you would use

-pa 3 to fully utilize the machine.

-recoverDir <Previous Output Directory>

If a run fails in the middle of processing, it may be possible

recover some results and continue where the previous run left off.

Simply supply the output directory where the results of the failed

run were saved and the program will attempt to recover and continue

the run.

-srand #

Optionally set the seed of the random number generator to a known

value before the batches are randomly selected ( using Fisher Yates

Shuffling ). This is only useful if you need to reproduce the sample

choice between runs. This should be an integer number.

-LTRStruct

Run the LTR structural discovery pipeline ( LTR_Harvest and

LTR_retreiver ) and combine results with the RepeatScout/RECON

pipeline. [optional]

-genomeSampleSizeMax #

Optionally change the maximum bp of the genome to sample in all

rounds of RECON (default=243000000).

CONFIGURATION OVERRIDES

-mafft_dir <string>

The path to the installation of the MAFFT multiple alignment

program.

-repeatmasker_dir <string>

The path to the installation of RepeatMasker.

-trf_prgm <string>

The full path including the name for the TRF program ( 4.0.9 or

higher )

-rscout_dir <string>

The path to the installation of the RepeatScout ( 1.0.6 or higher )

de-novo repeatfinding program.

-ninja_dir <string>

The path to the installation of the Ninja phylogenetic analysis

package.

-rmblast_dir <string>

The path to the installation of the RMBLAST sequence alignment

program.

-abblast_dir <string>

The path to the installation of the ABBLAST sequence alignment

program.

-recon_dir <string>

The path to the installation of the RECON de-novo repeatfinding

program.

-genometools_dir <string>

The path to the installation of the GenomeTools package.

-cdhit_dir <string>

The path to the installation of the CD-Hit sequence clustering

package.

-ltr_retriever_dir <string>

The path to the installation of the LTR_Retriever structural LTR

analysis package.

SEE ALSO

RepeatMasker, RMBlast

AUTHOR

RepeatModeler:

Robert Hubley <rhubley@systemsbiology.org>

Arian Smit <asmit@systemsbiology.org>

LTR Pipeline Extensions:

Jullien Michelle Flynn <jmf422@cornell.edu>

> BuildDatabase -h

$ BuildDatabase -h

No query sequence file indicated

/Users/kazu/anaconda3/envs/repeatmodeler/share/RepeatModeler/BuildDatabase - 2.0.1

NAME

BuildDatabase - Format FASTA files for use with RepeatModeler

SYNOPSIS

BuildDatabase [-options] -name "mydb" <seqfile(s) in fasta format>

BuildDatabase [-options] -name "mydb"

-dir <dir containing fasta files *.fa, *.fasta,

*.fast, *.FA, *.FASTA, *.FAST, *.dna,

and *.DNA >

BuildDatabase [-options] -name "mydb"

-batch <file containing a list of fasta files>

DESCRIPTION

This is basically a wrapper around AB-Blast's and NCBI Blast's

DB formating programs. It assists in aggregating files for processing

into a single database. Source files can be specified by:

- Placing the names of the FASTA files on the command

line.

- Providing the name of a directory containing FASTA files

with the file suffixes *.fa or *.fasta.

- Providing the name of a manifest file which contains the

names of FASTA files ( fully qualified ) one per line.

NOTE: Sequence identifiers are not preserved in this database. Each

sequence is assigned a new GI ( starting from 1 ). The

translation back to the original sequence is preserved in the

*.translation file.

The options are:

-h(elp)

Detailed help

-name <database name>

The name of the database to create.

-engine <engine name>

The name of the search engine we are using. I.e abblast/wublast or

rmblast.

-dir <directory>

The name of a directory containing fasta files to be processed. The

files are recognized by their suffix. Only *.fa and *.fasta files

are processed.

-batch <file>

The name of a file which contains the names of fasta files to

process. The files names are listed one per line and should be fully

qualified.

SEE ALSO

RepeatModeler, RMBlast

AUTHOR

Robert Hubley <rhubley@systemsbiology.org>

> ProcessRepeats -h

$ ProcessRepeats -h

No cat file indicated

NAME

ProcessRepeats - Post process results from RepeatMasker and produce an

annotation file.

SYNOPSIS

ProcessRepeats [-options] <RepeatMasker *.cat file>

DESCRIPTION

The options are:

-h(elp)

Detailed help

-species <query species>

Post process RepeatMasker results run on sequence from this species.

Default is human.

-lib <libfile>

Skips most processing, does not produce a .tbl file unless the

custome library is in the ">name#class" format.

-nolow

Does not display simple repeats or low_complexity DNA in the

annotation.

-noint

Skips steps specific to interspersed repeats, saving lots of time.

-lcambig

Outputs ambiguous DNA transposon fragments using a lower case name.

All other repeats are listed in upper case. Ambiguous fragments

match multiple repeat elements and can only be called based on

flanking repeat information.

-u Creates an untouched annotation file besides the manipulated file.

-xm Creates an additional output file in cross_match format (for

parsing).

-ace

Creates an additional output file in ACeDB format.

-gff

Creates an additional Gene Feature Finding format.

-poly

Creates an output file listing only potentially polymorphic simple

repeats.

-no_id

Leaves out final column with unique number for each element (was

default).

-excln

Calculates repeat densities excluding long stretches of Ns in the

query.

-orf2

Results in sometimes negative coordinates for L1 elements; all L1

subfamilies are aligned over the ORF2 region, sometimes improving

interpretation of data.

-a Shows the alignments in a .align output file.

-maskSource <originalSeqenceFile>

Instructs ProcessRepeats to mask the sequence file using the

annotation.

-x Mask repeats with a lower case 'x'.

-xsmall

Mask repeats by making the sequence lowercase.

SEE ALSO

RepeatMasker, Crossmatch, Blast

Biology

AUTHORS

Arian Smit <asmit@systemsbiology.org>

Robert Hubley <rhubley@systemsbiology.org>

RepeatMaskerのhelpも載せておきます。

> RepeatMasker -h

$ RepeatMasker -h

RepeatMasker version open-4.0.9

Option h is ambiguous (help, html)

NAME

RepeatMasker - Mask repetitive DNA

SYNOPSIS

RepeatMasker [-options] <seqfiles(s) in fasta format>

DESCRIPTION

The options are:

-h(elp)

Detailed help

Default settings are for masking all type of repeats in a primate

sequence.

Use an alternate search engine to the default. Note: 'ncbi' and

'rmblast' are both aliases for the rmblastn search engine engine.

The generic NCBI blastn program is not sensitive enough for use with

RepeatMasker at this time.

-pa(rallel) [number]

The number of processors to use in parallel (only works for batch

files or sequences over 50 kb)

-s Slow search; 0-5% more sensitive, 2-3 times slower than default

-q Quick search; 5-10% less sensitive, 2-5 times faster than default

-qq Rush job; about 10% less sensitive, 4->10 times faster than default

(quick searches are fine under most circumstances) repeat options

-nolow

Does not mask low_complexity DNA or simple repeats

-noint

Only masks low complex/simple repeats (no interspersed repeats)

-norna

Does not mask small RNA (pseudo) genes

-alu

Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)

-div [number]

Masks only those repeats < x percent diverged from consensus seq

-lib [filename]

Allows use of a custom library (e.g. from another species)

-cutoff [number]

Sets cutoff score for masking repeats when using -lib (default 225)

-species <query species>

Specify the species or clade of the input sequence. The species name

must be a valid NCBI Taxonomy Database species name and be contained

in the RepeatMasker repeat database. Some examples are:

-species human

-species mouse

-species rattus

-species "ciona savignyi"

-species arabidopsis

Other commonly used species:

mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,

danio, "ciona intestinalis" drosophila, anopheles, worm, diatoaea,

artiodactyl, arabidopsis, rice, wheat, and maize

Contamination options

-is_only

Only clips E coli insertion elements out of fasta and .qual files

-is_clip

Clips IS elements before analysis (default: IS only reported)

-no_is

Skips bacterial insertion element check

Running options

-gc [number]

Use matrices calculated for 'number' percentage background GC level

-gccalc

RepeatMasker calculates the GC content even for batch files/small

seqs

-frag [number]

Maximum sequence length masked without fragmenting (default 60000)

-nocut

Skips the steps in which repeats are excised

-noisy

Prints search engine progress report to screen (defaults to .stderr

file)

-nopost

Do not postprocess the results of the run ( i.e. call ProcessRepeats

). NOTE: This options should only be used when ProcessRepeats will

be run manually on the results.

output options

-dir [directory name]

Writes output to this directory (default is query file directory,

"-dir ." will write to current directory).

-a(lignments)

Writes alignments in .align output file

-inv

Alignments are presented in the orientation of the repeat (with

option -a)

-lcambig

Outputs ambiguous DNA transposon fragments using a lower case name.

All other repeats are listed in upper case. Ambiguous fragments

match multiple repeat elements and can only be called based on

flanking repeat information.

-small

Returns complete .masked sequence in lower case

-xsmall

Returns repetitive regions in lowercase (rest capitals) rather than

masked

-x Returns repetitive regions masked with Xs rather than Ns

-poly

Reports simple repeats that may be polymorphic (in file.poly)

-source

Includes for each annotation the HSP "evidence". Currently this

option is only available with the "-html" output format listed

below.

-html

Creates an additional output file in xhtml format.

-ace

Creates an additional output file in ACeDB format

-gff

Creates an additional Gene Feature Finding format output

-u Creates an additional annotation file not processed by

ProcessRepeats

-xm Creates an additional output file in cross_match format (for

parsing)

-no_id

Leaves out final column with unique ID for each element (was

default)

-e(xcln)

Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in

the query

SEE ALSO

Crossmatch, ProcessRepeats

AUTHORS

Arian Smit <asmit@systemsbiology.org>

Robert Hubley <rhubley@systemsbiology.org>

以下のコンテナを使えばRepeatModeler、RepeatMasker、そしてcosegの3つを利用可能。

Dockerhub

実行方法

１、データベースを作成する。

BuildDatabase -name prefix input_genome.fa

-name The name of the database to create

出力

f:id:kazumaxneo:20200702171557p:plain

２、RepeatModeler を実行する。スレッド数は以前は-paで指定したが、現在は-threads で指定する。

RepeatModeler -database prefix -threads 20

#ランタイムが最低数時間以上かかるため、Gihtubではnohup実行が強く推奨されている。nohupのため進捗logをファイルに保存する。
nohup RepeatModeler -database elephant -pa 20 >& run.out &

-database The name of the sequence database to run an analysis on. This is the name that was provided to the BuildDatabase script using the "-name" option.
-threads Specify the maximum number of threads which can be used by the program at any one time. Note that the '-pa' parameter in previous releases controlled the number of sequence batches compared in parallel using rmblastn (each running 4 threads). Therefore, if '-pa 4' was used previously the new thread parameter should be set to '-threads 16'.

様々なファイルが出力される。ディレクトリ名は実行dateが含まれるので少し長くなる（RM_88440.TueJul71311202020とか）。*2

f:id:kazumaxneo:20200703002148p:plain

tmpBlastXResults.out.bxsummary

f:id:kazumaxneo:20200703112744p:plain

シードアライメントファイル（.stk）は、Dfam互換のStockholmフォーマットで、help@dfam.org、Dfamデータベースにアップロードすることができる（マニュアルより）。

３、得られたリピートのコンセンサス配列(冗長な配列をコンセンサスにしてまとめたもの)のFASTA形式ファイルをライブラリに指定してRepeatMaskerを実行する。input_genome.faのリピートをソフトマスクする Nで置換する（*１）。

RepeatMasker -pa 20 -html -gff -small -lib outdir/tmpConsensus.fa.masked input_genome.fa

-html Creates an additional output file in xhtml format.
-gff Creates an additional Gene Feature Finding format output
-small Returns complete .masked sequence in lower case
-lib Allows use of a custom library (e.g. from another species)

出力

f:id:kazumaxneo:20200703120400p:plain

.tblがサマリーファイル

f:id:kazumaxneo:20200703120621p:plain

out.html

f:id:kazumaxneo:20200703120753p:plain

４、ProcessRepeatesを実行してリピートをソフトマスクする。ステップ3のRepeatMaskerの出力；input_genome.fa.catファイルを使うので、ステップ3のコマンドを先に実行する必要がある（YNSさんのコメントもご参照下さい）。

ProcessRepeats -maskSource input_genome.fa -xsmall -gff input_genome.fa.cat

-maskSource Instructs ProcessRepeats to mask the sequence file using the annotation.
-xsmall Mask repeats by making the sequence lowercase.
-gff Creates an additional Gene Feature Finding format output

input_genome.fa.maskedやGFFファイルなどが出力される。

追記

RepeatModelerのランでLTR探索も実行する。

RepeatModeler -database elephant -pa 20 -LTRStruct

-LTRStruct Run the LTR structural discovery pipeline ( LTR_Harvest and
LTR_retreiver ) and combine results with the RepeatScout/RECON
pipeline. [optional]

NINJAがないとランできない。

https://github.com/TravisWheelerLab/NINJA/releases/tag/0.95-cluster_only

からダウンロードしてmakeする。NINJAができるので、$NINJA_DIRを設定する。

export NINJA_DIR=<path>/<to>/NINJA-0.95-cluster_only/NINJA/

ある植物ゲノムアセンブリに適用したところ、condaで配布されているRepeatMaskerのデフォルトのリピートライブラリではゲノムの２％の領域しかマスクされなかったが、 RepeatModeler2でリピートを予測後、それをライブラリにしてRepeatMaskerをランすると34％の領域がマスクされた。

引用

RepeatModeler2 for automated genomic discovery of transposable element families
Jullien M. Flynn, Robert Hubley, Clément Goubert, Jeb Rosen, Andrew G. Clark, Cédric Feschotte, and Arian F. Smit

PNAS April 28, 2020 117 (17) 9451-9457; first published April 16, 2020