トランスポゾンなどのリピートをde novoで探す RepeatScout

　RepeatScoutはゲノム中のトランスポゾンなどのリピートを探すツール。リピートを見つけると、そのシードを保存性がなくなるまで伸長する戦略をとることで、見つかりにくい長くてやや配列に違いがあるリピートまで探索することが可能とされる（タンデムリピートやlow-complexityリピートは本手法のターゲットではない）。オーサーらの用意したデータでは、競合のRECONと比較して10倍以上短い時間で、２倍以上のリピートが検出されている。 RepeatMaskerの出力に対応している。

インストール

依存

Tandem Repeats Finder

https://tandem.bu.edu/trf/trf409.macosx.download.html

RepeatMasker

http://www.repeatmasker.org/RMDownload.html

Tandem Repeat Finderはバイナリをダウンロードして、trfとリネームする。RepeatMaskerはここからダウンロードして解凍し、”perl ./configure”して指示に従っていくだけでインストールできる。途中で聞かれるperlのパスはosxなら "/usr/bin/perl"、他のツールのパスはwhichで確認する。最後のサーチエンジンは用途に応じて１つ以上選ぶ。

Github

https://github.com/mmcco/RepeatScout

brewで導入できるが、サブコマンドが入らないので自分でビルドする。

git clone https://github.com/mmcco/RepeatScout.git
cd RepeatScout
make

フォルダ全体にパスを通しておく。trfにリネームしたTandem Repeats Finderもここにコピーしておく。

> ./RepeatScout

$ ./RepeatScout

RepeatScout Version 1.0.5

Usage:

RepeatScout -sequence <seq> -output <out> -freq <freq> -l <l> [opts]

-L # size of region to extend left or right (10000)

-match # reward for a match (+1)

-mismatch # penalty for a mismatch (-1)

-gap # penalty for a gap (-5)

-maxgap # maximum number of gaps allowed (5)

-maxoccurrences # cap on the number of sequences to align (10,000)

-maxrepeats # stop work after reporting this number of repeats (10000)

-cappenalty # cap on penalty for exiting alignment of a sequence (-20)

-tandemdist # of bases that must intervene between two l-mers for both to be counted (500)

-minthresh # stop if fewer than this number of l-mers are found in the seeding phase (3)

-minimprovement # amount that a the alignment needs to improve each step to be considered progress (3)

-stopafter # stop the alignment after this number of no-progress columns (100)

-goodlength # minimum required length for a sequence to be reported (50)

-maxentropy # entropy (complexity) threshold for an l-mer to be considered (-.7)

-v[v[v[v]]] How verbose do you want it to be? -vvvv is super-verbose.

——

> ./build_lmer_table

$ ./build_lmer_table

build_lmer_table Version 1.0.5

Usage:

build_lmer_table -l <l> -sequence <seq> -freq <output> [opts]

-tandem <d> --- tandem distance window (def: 500)

-min <#> --- smallest number of required lmers (def: 3)

-v --- output a small amount of debugging information.

> ./filter-stage-1.prl -h

FILTER-STAGE-1.PRL(1) User Contributed Perl DocumentationFILTER-STAGE-1.PRL(1)

NAME

filter-stage-1.prl -- a first stage post-processing tool for

RepeatScout output.

SYNOPSIS

cat repeats.fa | filter-stage-1.prl > repeats-filtered.prl

OPTIONS

none other than "-h" (the output of which you're reading), but you will

either want trf and nseg in your PATH, or you will want to set the

environment variables TRF_COMMAND and NSEG_COMMAND to provide the

executable.

DESCRIPTION

This tool takes a repeat library, which is a Fasta-formatted sequence

file, and filters out any sequence that is deemed to be more than 50%

low-complexity by either TRF or NSEG or both. Note that one algorithm

needs to make the determination; we don't check the total number of

unique bases masked out by TRF and NSEG individually.

ENVIRONMENT VARIABLES

In order for this program to find TRF and NSEG, you need to either

place said programs in your PATH, or you need to add the environment

variables TRF_COMMAND and NSEG_COMMAND. The value of those variables

should be the path at which the respective program can be found.

perl v5.18.2 2018-04-19 FILTER-STAGE-1.PRL(1)

リピートライブラリ

http://bix.ucsd.edu/repeatscout/

実行方法

ランは複数段階で行う。

１、データベースのビルド。全ての1-merの配列をpick upしてテーブルにする。

build_lmer_table -l 14 -sequence input.fasta -freq output.freq

２、そのテーブルファイルからFASTAを作る。

RepeatScout -sequence input.fasta -output output_repeats.fasta -freq output.freq -l 14

３、単純リピートなどを除外する。またデフォルトでは繰り返し数が10以下のリピートも排除する。

cat output_repeats.fasta | filter-stage-1.prl > repeats_filtered_stg1.fasta

４、RepeatMaskerでフィルタリングされた領域を分析する。

RepeatMasker -pa 20 -s -lib repeats_filtered_stg1.fasta input.fasta &

５、step4と並行して、規定回数登場しなかったリピートを排除する作業を行う。

cat repeats_filtered_stg1.fasta | filter-stage-2.prl --cat=Final_assembly.fasta.out --thresh=3 > repeats_filtered_stg2.fasta

６、RepeatMaskerで検出された部位を元に、step5の結果から最終的なリピート情報を出力する。

RepeatMasker -pa 20 -s -lib repeats_filtered_stg2.fasta input.fasta

引用

De novo identification of repeat families in large genomes.

Price AL1, Jones NC, Pevzner PA.

Bioinformatics. 2005 Jun;21 Suppl 1:i351-8.

SEQanswers

http://seqanswers.com/forums/showthread.php?t=5448

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

トランスポゾンなどのリピートをde novoで探す RepeatScout