macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

RepeatMaskerのヘルパースクリプトを使う

 

 

Philipp BayerさんがRepeatMaskerのヘルパースクリプトを使ってゲノムに散在する反復配列(wiki)がインタラクティブなプロットを作成する例を紹介されていた。試してみる。

 

 

 

インストール

 RepeatMasker/utility

mamba create -n repeatmasker -y
conda activate repeatmasker
mamba install -c bioconda -y repeatmasker

git clone https://github.com/rmhubley/RepeatMasker.git

> perl RepeatMasker/util/calcDivergenceFromAling.pl

Error: One or more of the options '-a' or '-s' must be supplied!

 

RepeatMasker/util/calcDivergenceFromAlign.pl - $Name: $

 

NAME

calcDivergenceFromAlign.pl - Recalculate/Summarize the divergences in an

align file.

 

SYNOPSIS

calcDivergenceFromAlign.pl [-version] [-s <summary_file>] [-noCpGMod]

[-a <new_align_file>] *.align[.gz]

 

DESCRIPTION

A utility script to calculate a new divergence measure on the

RM alignment files. Currently we only calculate the Kimura 2-Parameter

divergence metric.

 

Treat "CG" dinucleotide sites in the consensus sequence as follows:

Two transition mutations are counted as a single transition, one

transition is counted as 1/10 of a standard transition, and 

transversions are counted normally (as the would outside of a CpG

site). This modification to the Kimura 2 parameter model accounts

for the extremely high rate of mutations in at a CpG locus.

 

The options are:

 

-version

Displays the version of the program

 

-noCpGMod

Do not modify the transition counts at CpG sites.

 

SEE ALSO

COPYRIGHT

Copyright 2013 Robert Hubley, Institute for Systems Biology

 

AUTHOR

Robert Hubley <rhubley@systemsbiology.org>

 

perl RepeatMasker/util/createRepeatLandscape.pl

RepeatMasker/util/createRepeatLandscape.pl - $Name: $

 

NAME

createRepeatLandscape.pl - Create a Repeat Landscape graph

 

SYNOPSIS

createRepeatLandscape.pl [-version] -div *.divsum [-t "graph title"]

[-j] -g # | -twoBit <twoBitFile>

 

DESCRIPTION

Create a Repeat Landscape graph using the divergence summary data

generated with the calcDivergenceFromAlign.pl script.

 

EXAMPLES

Older ( pre 4.0.4 ) RepeatMasker dataset:

 

./calcDivergenceFromAlign.pl -s example.divsum -a example_with_div.align 

example.align.gz

 

This creates an additional file "example_with_div.align" which contains

the added Kimura divergence field after each alignment.

 

./createRepeatLandscape.pl -div example.divsum > 

/home/user/public_html/example.html

 

 

On newer RepeatMasker dataset that already contains the Kimura divergence

line following each alignment:

 

./calcDivergenceFromAlign.pl -s example.divsum example.align.gz

 

./createRepeatLandscape.pl -div example.divsum > 

/home/user/public_html/example.html

 

The options are:

 

-version

Displays the version of the program

 

-div <file>

The divergence summary file created with the

calcDivergenceFromAlign.pl script.

 

-g #

Set the genome size used in percentage calculations.

 

-twoBit <filename>

Get the genome size directly from the sequence file ( excluding Ns

). This option requires that the UCSC utility "twoBitInfo" is in

your path.

 

-j Output javascript only and not a fully constructed HTML page.

 

SEE ALSO

COPYRIGHT

Copyright 2013-2014 Robert Hubley, Institute for Systems Biology

 

AUTHORS

Robert Hubley <rhubley@systemsbiology.org>

 

Juan Caballero <jcaballero@systemsbiology.org>

 

 

実行方法

1、RepeatMaskerの実行。hmmerを指定した。パラメータはデフォルト設定にした。

RepeatMasker -e hmmer -gff -html -a -dir outdir -pa 8 input_genome.fasta
  • -e(ngine) [crossmatch|wublast|abblast|ncbi|rmblast|hmmer]
            Use an alternate search engine to the default. Note: 'ncbi' and
            'rmblast' are both aliases for the rmblastn search engine engine.
            The generic NCBI blastn program is not sensitive enough for use with
            RepeatMasker at this time.
  • -pa(rallel) [number]
            The number of sequence batch jobs [50kb minimum] to run in parallel.
            RepeatMasker will fork off this number of parallel jobs, each
            running the search engine specified. For each search engine
            invocation ( where applicable ) a fixed the number of cores/threads
            is used:

              RMBlast     4 cores
              nhmmer      2 cores
              crossmatch  1 core

            To estimate the number of cores a RepeatMasker run will use simply
            multiply the -pa value by the number of cores the particular search
            engine will use.

  • -html   Creates an additional output file in xhtml format.

  • -gff   Creates an additional Gene Feature Finding format output

  • -dir   Writes output to this directory (default is query file directory, "-dir ." will write to current directory).

  • -a   Writes alignments in .align output file

 

 

2、得られたalnファイルを指定する。

アラインメントファイルのダイバージェンス尺度を計算するユーティリティースクリプトcalcDivergenceFromAling.plをランして、リピートのダイバージェンスを計算する。Kimura 2 パラメータの距離(link)に対応している。

perl RepeatMasker/util/calcDivergenceFromAling.pl -s out.divsum -a out.align 
input_genome.fasta.align

 

3、calcDivergenceで生成されたダイバージェンスサマリーデータを用いて、Repeat Landscapeグラフを作成する。-divで
calcDivergenceFromAlign.plスクリプトで作成されたダイバージェンス要約ファイルを指定する。"-g"でパーセンテージ計算で使用するゲノムサイズを設定する。

perl RepeatMasker/util/ccreateRepeatLandscape.pl -div out.divsum -g 120Mb
  • -g   Set the genome size used in percentage calculations

  • -twoBit <filename>   Get the genome size directly from the sequence file ( excluding Ns
    ). This option requires that the UCSC utility "twoBitInfo" is in
    your path.

  • -j   Output javascript only and not a fully constructed HTML page.

       

"-twoBit twoBitfile"でゲノムサイズを配列ファイルから取得する。

 

出力例(シロイヌナズナゲノムを使用)

f:id:kazumaxneo:20210326182612p:plain

 

引用

RepeatMasker
Developed by Arian Smit and Robert Hubley

http://www.repeatmasker.org

 

関連