RepeatMaskerのヘルパースクリプトを使う - macでインフォマティクス

Philipp BayerさんがRepeatMaskerのヘルパースクリプトを使ってゲノムに散在する反復配列（wiki）がインタラクティブなプロットを作成する例を紹介されていた。試してみる。

TIL: The RepeatMasker helper-scripts https://t.co/JDWpkYAzzo and https://t.co/GgRFOb2bu3 make pretty interactive plots!

(I added CACTA) pic.twitter.com/Sr6zoAsbm5
— Philipp Bayer (@PhilippBayer) 2021年3月24日

インストール

RepeatMasker/utility

mamba create -n repeatmasker -y
conda activate repeatmasker
mamba install -c bioconda -y repeatmasker

git clone https://github.com/rmhubley/RepeatMasker.git

> perl RepeatMasker/util/calcDivergenceFromAling.pl

Error: One or more of the options '-a' or '-s' must be supplied!

RepeatMasker/util/calcDivergenceFromAlign.pl - $Name: $

NAME

calcDivergenceFromAlign.pl - Recalculate/Summarize the divergences in an

align file.

SYNOPSIS

calcDivergenceFromAlign.pl [-version] [-s <summary_file>] [-noCpGMod]

[-a <new_align_file>] *.align[.gz]

DESCRIPTION

A utility script to calculate a new divergence measure on the

RM alignment files. Currently we only calculate the Kimura 2-Parameter

divergence metric.

Treat "CG" dinucleotide sites in the consensus sequence as follows:

Two transition mutations are counted as a single transition, one

transition is counted as 1/10 of a standard transition, and

transversions are counted normally (as the would outside of a CpG

site). This modification to the Kimura 2 parameter model accounts

for the extremely high rate of mutations in at a CpG locus.

The options are:

-version

Displays the version of the program

-noCpGMod

Do not modify the transition counts at CpG sites.

SEE ALSO

AUTHORS

Robert Hubley <rhubley@systemsbiology.org>

Juan Caballero <jcaballero@systemsbiology.org>

実行方法

１、RepeatMaskerの実行。hmmerを指定した。パラメータはデフォルト設定にした。

RepeatMasker -e hmmer -gff -html -a -dir outdir -pa 8 input_genome.fasta

-e(ngine) [crossmatch|wublast|abblast|ncbi|rmblast|hmmer]
Use an alternate search engine to the default. Note: 'ncbi' and
'rmblast' are both aliases for the rmblastn search engine engine.
The generic NCBI blastn program is not sensitive enough for use with
RepeatMasker at this time.
-pa(rallel) [number]
The number of sequence batch jobs [50kb minimum] to run in parallel.
RepeatMasker will fork off this number of parallel jobs, each
running the search engine specified. For each search engine
invocation ( where applicable ) a fixed the number of cores/threads
is used:

RMBlast 4 cores
nhmmer 2 cores
crossmatch 1 core

To estimate the number of cores a RepeatMasker run will use simply
multiply the -pa value by the number of cores the particular search
engine will use.
-html Creates an additional output file in xhtml format.
-gff Creates an additional Gene Feature Finding format output
-dir Writes output to this directory (default is query file directory, "-dir ." will write to current directory).
-a Writes alignments in .align output file

２、得られたalnファイルを指定する。

アラインメントファイルのダイバージェンス尺度を計算するユーティリティースクリプトcalcDivergenceFromAling.plをランして、リピートのダイバージェンスを計算する。Kimura 2 パラメータの距離(link)に対応している。

perl RepeatMasker/util/calcDivergenceFromAling.pl -s out.divsum -a out.align 
input_genome.fasta.align

３、calcDivergenceで生成されたダイバージェンスサマリーデータを用いて、Repeat Landscapeグラフを作成する。-divで
calcDivergenceFromAlign.plスクリプトで作成されたダイバージェンス要約ファイルを指定する。"-g"でパーセンテージ計算で使用するゲノムサイズを設定する。

perl RepeatMasker/util/ccreateRepeatLandscape.pl -div out.divsum -g 120Mb

-g Set the genome size used in percentage calculations
-twoBit <filename> Get the genome size directly from the sequence file ( excluding Ns
). This option requires that the UCSC utility "twoBitInfo" is in
your path.
-j Output javascript only and not a fully constructed HTML page.