Philipp BayerさんがRepeatMaskerのヘルパースクリプトを使ってゲノムに散在する反復配列(wiki)がインタラクティブなプロットを作成する例を紹介されていた。試してみる。
TIL: The RepeatMasker helper-scripts https://t.co/JDWpkYAzzo and https://t.co/GgRFOb2bu3 make pretty interactive plots!
— Philipp Bayer (@PhilippBayer) 2021年3月24日
(I added CACTA) pic.twitter.com/Sr6zoAsbm5
インストール
RepeatMasker/utility
mamba create -n repeatmasker -y
conda activate repeatmasker
mamba install -c bioconda -y repeatmasker
git clone https://github.com/rmhubley/RepeatMasker.git
> perl RepeatMasker/util/calcDivergenceFromAling.pl
Error: One or more of the options '-a' or '-s' must be supplied!
RepeatMasker/util/calcDivergenceFromAlign.pl - $Name: $
NAME
calcDivergenceFromAlign.pl - Recalculate/Summarize the divergences in an
align file.
SYNOPSIS
calcDivergenceFromAlign.pl [-version] [-s <summary_file>] [-noCpGMod]
[-a <new_align_file>] *.align[.gz]
DESCRIPTION
A utility script to calculate a new divergence measure on the
RM alignment files. Currently we only calculate the Kimura 2-Parameter
divergence metric.
Treat "CG" dinucleotide sites in the consensus sequence as follows:
Two transition mutations are counted as a single transition, one
transition is counted as 1/10 of a standard transition, and
transversions are counted normally (as the would outside of a CpG
site). This modification to the Kimura 2 parameter model accounts
for the extremely high rate of mutations in at a CpG locus.
The options are:
-version
Displays the version of the program
-noCpGMod
Do not modify the transition counts at CpG sites.
SEE ALSO
COPYRIGHT
Copyright 2013 Robert Hubley, Institute for Systems Biology
AUTHOR
Robert Hubley <rhubley@systemsbiology.org>
> perl RepeatMasker/util/createRepeatLandscape.pl
RepeatMasker/util/createRepeatLandscape.pl - $Name: $
NAME
createRepeatLandscape.pl - Create a Repeat Landscape graph
SYNOPSIS
createRepeatLandscape.pl [-version] -div *.divsum [-t "graph title"]
[-j] -g # | -twoBit <twoBitFile>
DESCRIPTION
Create a Repeat Landscape graph using the divergence summary data
generated with the calcDivergenceFromAlign.pl script.
EXAMPLES
Older ( pre 4.0.4 ) RepeatMasker dataset:
./calcDivergenceFromAlign.pl -s example.divsum -a example_with_div.align
example.align.gz
This creates an additional file "example_with_div.align" which contains
the added Kimura divergence field after each alignment.
./createRepeatLandscape.pl -div example.divsum >
/home/user/public_html/example.html
On newer RepeatMasker dataset that already contains the Kimura divergence
line following each alignment:
./calcDivergenceFromAlign.pl -s example.divsum example.align.gz
./createRepeatLandscape.pl -div example.divsum >
/home/user/public_html/example.html
The options are:
-version
Displays the version of the program
-div <file>
The divergence summary file created with the
calcDivergenceFromAlign.pl script.
-g #
Set the genome size used in percentage calculations.
-twoBit <filename>
Get the genome size directly from the sequence file ( excluding Ns
). This option requires that the UCSC utility "twoBitInfo" is in
your path.
-j Output javascript only and not a fully constructed HTML page.
SEE ALSO
COPYRIGHT
Copyright 2013-2014 Robert Hubley, Institute for Systems Biology
AUTHORS
Robert Hubley <rhubley@systemsbiology.org>
Juan Caballero <jcaballero@systemsbiology.org>
実行方法
1、RepeatMaskerの実行。hmmerを指定した。パラメータはデフォルト設定にした。
RepeatMasker -e hmmer -gff -html -a -dir outdir -pa 8 input_genome.fasta
- -e(ngine) [crossmatch|wublast|abblast|ncbi|rmblast|hmmer]
Use an alternate search engine to the default. Note: 'ncbi' and
'rmblast' are both aliases for the rmblastn search engine engine.
The generic NCBI blastn program is not sensitive enough for use with
RepeatMasker at this time. -
-pa(rallel) [number]
The number of sequence batch jobs [50kb minimum] to run in parallel.
RepeatMasker will fork off this number of parallel jobs, each
running the search engine specified. For each search engine
invocation ( where applicable ) a fixed the number of cores/threads
is used:RMBlast 4 cores
nhmmer 2 cores
crossmatch 1 coreTo estimate the number of cores a RepeatMasker run will use simply
multiply the -pa value by the number of cores the particular search
engine will use. -
-html Creates an additional output file in xhtml format.
-
-gff Creates an additional Gene Feature Finding format output
-
-dir Writes output to this directory (default is query file directory, "-dir ." will write to current directory).
-
-a Writes alignments in .align output file
2、得られたalnファイルを指定する。
アラインメントファイルのダイバージェンス尺度を計算するユーティリティースクリプトcalcDivergenceFromAling.plをランして、リピートのダイバージェンスを計算する。Kimura 2 パラメータの距離(link)に対応している。
perl RepeatMasker/util/calcDivergenceFromAling.pl -s out.divsum -a out.align
input_genome.fasta.align
3、calcDivergenceで生成されたダイバージェンスサマリーデータを用いて、Repeat Landscapeグラフを作成する。-divで
calcDivergenceFromAlign.plスクリプトで作成されたダイバージェンス要約ファイルを指定する。"-g"でパーセンテージ計算で使用するゲノムサイズを設定する。
perl RepeatMasker/util/ccreateRepeatLandscape.pl -div out.divsum -g 120Mb
-
-g Set the genome size used in percentage calculations
-
-twoBit <filename> Get the genome size directly from the sequence file ( excluding Ns
). This option requires that the UCSC utility "twoBitInfo" is in
your path. -
-j Output javascript only and not a fully constructed HTML page.
"-twoBit twoBitfile"でゲノムサイズを配列ファイルから取得する。
出力例(シロイヌナズナゲノムを使用)
引用
RepeatMasker
Developed by Arian Smit and Robert Hubley
関連