リピートをマスクする RepeatMasker - macでインフォマティクス

2021 3/26 コマンド追加

2022/12更新

2023/01/08, 01/9 追記

　RepeatMasker は、DNA 配列をスクリーニングして、散在したリピート配列や、複雑性の低い DNA 配列を検出するプログラムである。プログラムの出力は、クエリ配列に存在するリピートの詳細なアノテーションと、アノテーションされたリピートがすべてマスクされたクエリ配列の修正版である（デフォルトは Ns で置換）。RepeatMaskerでのシーケンス比較は、Phil Greenによって開発されたSmith-Waterman-Gotohアルゴリズムの効率的な実装であるcross_matchプログラム、またはWarren Gishによって開発されたWU-Blastによって実行される。RepeatMaskerのインストール方法については、Gihtubレポジトリの"INSTALL "を参照する。またプログラムの詳細なマニュアルは "repeatmaker.help" を参照する。

The Dfam database

Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, Pages D81–D89

　反復的なDNA、特にtransposable elements (TE)によるものは、多くのゲノムの大部分を占めている。Dfamは、繰り返しDNAのファミリーのオープンアクセスデータベースであり、各ファミリーは複数の配列のアラインメントとプロファイル隠れマルコフモデル（HMM）で表現されている。2013年のNAR Database Issueで紹介されたDfamの初期リリースには、ヒトで発見された反復エレメントの1143ファミリーが含まれており、ヒトゲノムのTE由来領域の100 Mb以上の追加アノテーションを高速化して作成するために使用されていた。その中でも特に、4つの新しい生物（マウス、ゼブラフィッシュ、ハエ、線虫）からの既知の繰り返しファミリーの包括的なセットを含む、総ファミリー数4150に拡張したことについて述べる。カバレッジの改善、および偽アノテーションを特定して減らすための方法について記述している。また、ウェブサイトのインターフェースの更新についても記述している。Dfamのウェブサイトは、http://dfam.org に移動した。シードアラインメント、プロファイルHMM、ヒットリスト、その他の基礎データがダウンロードできるようになった。

RepeatMasker website

Download Page

Dfam（The Dfam database is a open collection of Transposable Element DNA sequence alignments, hidden Markov Models (HMMs), consensus sequences, and genome annotations.）

https://www.dfam.org/home

インストール

condaを使ってpython3.7の仮想環境に導入した。

依存

Github

#bioconda (link)

mamba create -n repeatmasker python=3.7 -y
conda activate repeatmasker
mamba install -c bioconda -y repeatmasker

> RepeatMasker

$ RepeatMasker

RepeatMasker version 4.1.0

No query sequence file indicated

/Users/kazu/miniconda3/envs/repeatmasker/bin/RepeatMasker - 4.1.0

NAME

RepeatMasker - Mask repetitive DNA

SYNOPSIS

RepeatMasker [-options] <seqfiles(s) in fasta format>

DESCRIPTION

The options are:

-h(elp)

Detailed help

Default settings are for masking all type of repeats in a primate

sequence.

Use an alternate search engine to the default. Note: 'ncbi' and

'rmblast' are both aliases for the rmblastn search engine engine.

The generic NCBI blastn program is not sensitive enough for use with

RepeatMasker at this time.

-pa(rallel) [number]

The number of sequence batch jobs [50kb minimum] to run in parallel.

RepeatMasker will fork off this number of parallel jobs, each

running the search engine specified. For each search engine

invocation ( where applicable ) a fixed the number of cores/threads

is used:

RMBlast 4 cores

nhmmer 2 cores

crossmatch 1 core

To estimate the number of cores a RepeatMasker run will use simply

multiply the -pa value by the number of cores the particular search

engine will use.

-s Slow search; 0-5% more sensitive, 2-3 times slower than default

-q Quick search; 5-10% less sensitive, 2-5 times faster than default

-qq Rush job; about 10% less sensitive, 4->10 times faster than default

(quick searches are fine under most circumstances) repeat options

-nolow

Does not mask low_complexity DNA or simple repeats

-noint

Only masks low complex/simple repeats (no interspersed repeats)

-norna

Does not mask small RNA (pseudo) genes

-alu

Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)

-div [number]

Masks only those repeats < x percent diverged from consensus seq

-lib [filename]

Allows use of a custom library (e.g. from another species)

-cutoff [number]

Sets cutoff score for masking repeats when using -lib (default 225)

-species <query species>

Specify the species or clade of the input sequence. The species name

must be a valid NCBI Taxonomy Database species name and be contained

in the RepeatMasker repeat database. Some examples are:

-species human

-species mouse

-species rattus

-species "ciona savignyi"

-species arabidopsis

Other commonly used species:

mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,

danio, "ciona intestinalis" drosophila, anopheles, worm, diatoaea,

artiodactyl, arabidopsis, rice, wheat, and maize

Contamination options

-is_only

Only clips E coli insertion elements out of fasta and .qual files

-is_clip

Clips IS elements before analysis (default: IS only reported)

-no_is

Skips bacterial insertion element check

Running options

-gc [number]

Use matrices calculated for 'number' percentage background GC level

-gccalc

RepeatMasker calculates the GC content even for batch files/small

seqs

-frag [number]

Maximum sequence length masked without fragmenting (default 60000)

-nocut

Skips the steps in which repeats are excised

-noisy

Prints search engine progress report to screen (defaults to .stderr

file)

-nopost

Do not postprocess the results of the run ( i.e. call ProcessRepeats

). NOTE: This options should only be used when ProcessRepeats will

be run manually on the results.

output options

-dir [directory name]

Writes output to this directory (default is query file directory,

"-dir ." will write to current directory).

-a(lignments)

Writes alignments in .align output file

-inv

Alignments are presented in the orientation of the repeat (with

option -a)

-lcambig

Outputs ambiguous DNA transposon fragments using a lower case name.

All other repeats are listed in upper case. Ambiguous fragments

match multiple repeat elements and can only be called based on

flanking repeat information.

-small

Returns complete .masked sequence in lower case

-xsmall

Returns repetitive regions in lowercase (rest capitals) rather than

masked

-x Returns repetitive regions masked with Xs rather than Ns

-poly

Reports simple repeats that may be polymorphic (in file.poly)

-source

Includes for each annotation the HSP "evidence". Currently this

option is only available with the "-html" output format listed

below.

-html

Creates an additional output file in xhtml format.

-ace

Creates an additional output file in ACeDB format

-gff

Creates an additional Gene Feature Finding format output

-u Creates an additional annotation file not processed by

ProcessRepeats

-xm Creates an additional output file in cross_match format (for

parsing)

-no_id

Leaves out final column with unique ID for each element (was

default)

-e(xcln)

Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in

the query

CONFIGURATION OVERRIDES

-crossmatch_dir <string>

The path Phil Green's cross_match program ( phrap program suite ).

-rmblast_dir <string>

The path to the installation of the RMBLAST sequence alignment

program.

-libdir <string>

Path to the RepeatMasker libraries directory.

-trf_prgm <string>

The full path including the name for the TRF program.

-hmmer_dir <string>

The path to the HMMER profile HMM search software.

-abblast_dir <string>

The path to the installation of the ABBLAST sequence alignment

program.

-default_search_engine <string>

The default search engine to use

SEE ALSO

Crossmatch, ProcessRepeats

2002-2019 Copyright (C) Institute for Systems Biology 2002-2019

Developed by Arian Smit and Robert Hubley.

2000-2001 Copyright (C) Arian Smit 2000-2001.

1996-1999 Copyright (C) University of Washington, Developed by Arian

Smit, Philip Green and Colin Wilson of the University of Washington

Department of Genomics.

AUTHORS

Arian Smit <asmit@systemsbiology.org>

Robert Hubley <rhubley@systemsbiology.org>

conda（miniconda3）で導入した場合、Dfamデータベースは

~/miniconda3/envs/repeatmasker/share/RepeatMasker/Libraries/Dfam.hmm

実行方法

ゲノム配列、検索プログラムを指定する。

RepeatMasker -e hmmer -pa 20 input_genome.fasta

-e(ngine) [crossmatch|wublast|abblast|ncbi|rmblast|hmmer]
Use an alternate search engine to the default. Note: 'ncbi' and
'rmblast' are both aliases for the rmblastn search engine engine.
The generic NCBI blastn program is not sensitive enough for use with
RepeatMasker at this time.
-pa(rallel) [number]
The number of sequence batch jobs [50kb minimum] to run in parallel.
RepeatMasker will fork off this number of parallel jobs, each
running the search engine specified. For each search engine
invocation ( where applicable ) a fixed the number of cores/threads
is used:

RMBlast 4 cores
nhmmer 2 cores
crossmatch 1 core

To estimate the number of cores a RepeatMasker run will use simply
multiply the -pa value by the number of cores the particular search
engine will use.

まとめのファイルはfa.tblになる。LINEs、SINEs、LTR elements、small RNAなどの数がまとめられる。NNNでハードマスクされたFASTAファイルは.fa.maskedになる。"-small" -xsmallをつけるとNNNのハードマスクでなくリピートが小文字出力（ソフトマスク）になる。どう使い分けるか学ぶなら、makerのチュートリアルのこちらのセクションの説明が役に立つ。

オプションもつける。

#quick search（-q）, HTML report作成, GFF形式のリピートアノテーションファイル生成
RepeatMasker -e hmmer -q -gff -html -ace -source -a -u -dir outdir -pa 20 input_genome.fna


#リピートライブラリを明示的に指定(ここではmambaで仮想環境repeatmaskerに入れているので以下の通り)
RepeatMasker -e hmmer -gff -html -ace -source -a -u -dir outdir
-libdir $HOME/mambaforge/envs/repeatmasker/share/RepeatMasker/Libraries/
-pa 20 input_genome.fna


#外部エビデンスのリピートfastaファイルを指定する（-e hmmerは外す）
RepeatMasker -gff -html -ace -source -a -u -dir outdir
-lib repeatmodeler.conseusus.fa.classified
-pa 20 input_genome.fna
#補足；-libと-libdirを両方指定すると-libのみがライブラリとして使用される。


#-nointをつけるとlow complexilityリピートのみマスクされる。ソフトマスクとは別にlow complexility repeatはNでマスクすることもある。low complexilityリピートのみ探索するので早く終わる。
RepeatMasker -noint input_genome_softmasked.fasta
#注；小文字が大文字に戻るので必要なら最初にこちらを走らせる。

-poly Reports simple repeats that may be polymorphic (in file.poly)
-source Includes for each annotation the HSP "evidence". Currently this option is only available with the "-html" output format listed below.
-html Creates an additional output file in xhtml format.
-ace Creates an additional output file in ACeDB format
-gff Creates an additional Gene Feature Finding format output
-u Creates an additional annotation file not processed by ProcessRepeats
-dir [directory_name] Writes output to this directory (default is query file directory, "-dir ." will write to current directory).
-a Writes alignments in .align output file
-s Slow search; 0-5% more sensitive, 2-3 times slower than default
-q Quick search; 5-10% less sensitive, 2-5 times faster than default
-small　 Returns complete .masked sequence in lower case
-xsmall Returns repetitive regions in lowercase (rest capitals) rather than masked.
-noint Only masks low complex/simple repeats (no interspersed repeats)
-norna Does not mask small RNA (pseudo) genes
-libdir Path to the RepeatMasker libraries directory.
-noint Only masks low complex/simple repeats (no interspersed repeats)
-norna Does not mask small RNA (pseudo) genes

引用

RepeatMasker
Developed by Arian Smit and Robert Hubley

http://www.repeatmasker.org

Dfam database

The Dfam database of repetitive DNA families

Robert Hubley; Robert D. Finn; Jody Clements; Sean R. Eddy; Thomas A. Jones; Weidong Bao; Arian F.A. Smit; Travis J. Wheeler
Nucleic Acids Research, Volume 44, Issue D1, 4 January 2016, Pages D81–D89

参考