間違って2回Noise Cancelling Repeat Finderのインストールについて投稿してしまいました。申し訳ありません。
タンデムDNAリピートはロングリード技術でシーケンスできるが、これらの技術の高いエラー率を考慮した計算ツールがないため、正確に解読できない。 ここでは、Pacific BiosciencesおよびOxford Nanoporeシーケンサーによって生成されたノイズの多いロングリードで、指定されたモチーフの推定タンデムリピートを明らかにするNoise-Cancelling Repeat Finder(NCRF)を紹介する。 シミュレーションデータでNCRFを検証して、さまざまな長さのモチーフを持つタンデムリピートを特定し、2つの代替ツールと比較して優れたパフォーマンスを示した。 NCRFは、実際のヒト全ゲノムシーケンスデータを使用して、熱ショックストレス応答に関与する(AATGG)nの繰り返しの長い配列を特定した。NCRFはCで実装され、いくつかのpythonスクリプトでサポートされ、biocondaおよびhttps://github.com/makovalab-psu/NoiseCancellingRepeatFinderで入手できる。
インストール
macos10.14のanaconda3.7環境にて、python2.7の仮想環境を作ってテストした。
ビルド依存
- gcc or similar C compiler and linker
- python (tested with version 2.7, not likely to work with python 3)
https://github.com/makovalab-psu/NoiseCancellingRepeatFinder
#bioconda (link)
conda create -n ncrf -y python=2.7
conda activate ncrf
conda install -c bioconda -y ncrf
#from source
git clone --branch v1.01.00 https://github.com/makovalab-psu/NoiseCancellingRepeatFinder.git
cd NoiseCancellingRepeatFinder/
make
./make_symbolic_links.sh
> NCRF -h
$ NCRF -h
NCRF-- Noise Cancelling Repeat Finder, to find tandem repeats in noisy reads
(version 1.01.02 20190429)
usage: cat <fasta> | NCRF [options]
<fasta> fasta file containing sequences; read from stdin
[<name>:]<motif> dna repeat motif to search for
(there can be more than one motif)
--minmratio=<ratio> discard alignments with a low frequency of matches;
ratio can be between 0 and 1 (e.g. "0.85"), or can be
expressed as a percentage (e.g. "85%")
--maxnoise=<ratio> (same as --minmratio but with 1-ratio)
--minlength=<bp> discard alignments that don't have long enough repeat
(default is 500)
--minscore=<score> discard alignments that don't score high enough
(default is zero)
--stats=events show match/mismatch/insert/delete counts
--positionalevents show match/mismatch/insert/delete counts by motif
position (independent of --stats=events); this may be
useful for detecting error non-uniformity, to separate
perfect repeats from imperfect
--help=scoring show options relating to alignment scoring
--help=allocation show options relating to memory allocation
--help=other show other, less frequently used options
The output is usually passed through a series of the ncrf_* post-processing
scripts.
> ncrf_cat.py
$ ncrf_cat.py
you have to give me at least one file
usage: ncrf_cat <file1> [<file2> ...] [--markend]
<file1> an output file from Noise Cancelling Repeat Finder
<file2> another output file from Noise Cancelling Repeat Finder
--markend assume end-of-file markers are absent in the input, and add an
end-of-file marker to the output
(by default we require inputs to have proper end-of-file markers)
Concatenate several output files from Noise Cancelling Repeat Finder. This
is little more than copying the files and adding a blank line between the
files.
It can also be used to verify that the input files contain end-of-file markers
i.e. that they were not truncated when created.
> ncrf_summary.py -h
$ ncrf_summary.py -h
unrecognized option: -h
usage: ncrf_cat <output_from_NCRF> | ncrf_summary [options]
--minmratio=<ratio> discard alignments with a low frequency of matches;
ratio can be between 0 and 1 (e.g. "0.85"), or can be
expressed as a percentage (e.g. "85%")
--maxnoise=<ratio> (same as --minmratio but with 1-ratio)
Typical output:
#line motif seq start end strand seqLen querybp mRatio m mm i d
1 GGAAT FAB41174_6 1568 3021 - 3352 1461 82.6% 1242 169 42 50
11 GGAAT FAB41174_2 3908 5077 - 7347 1189 82.4% 1009 125 35 55
21 GGAAT FAB41174_0 2312 3334 - 4223 1060 81.1% 881 115 26 64
...
> ncrf_to_bed.py -h
$ ncrf_to_bed.py -h
unrecognized option: -h
usage: ncrf_cat <output_from_NCRF> | ncrf_to_bed [options]
--minmratio=<ratio> discard alignments with a low frequency of matches;
ratio can be between 0 and 1 (e.g. "0.85"), or can be
expressed as a percentage (e.g. "85%")
--maxnoise=<ratio> (same as --minmratio but with 1-ratio)
Typical output is shown below. The 6th column ("score" in the bed spec) is
the match ratio times 1000 (e.g. 826 is 82.6%).
FAB41174_065680 1568 3021 . - 826
FAB41174_029197 3908 5077 . - 824
FAB41174_005950 2312 3334 . - 811
テストラン
1、リピートを探索する。
cd /NoiseCancellingRepeatFinder/
cat example.fa | NCRF GGCAT > example.ncrf
> cat example.ncrf
2、サマリーレポート
ncrf_cat.py example.ncrf | ncrf_summary.py
$ ncrf_cat.py example.ncrf | ncrf_summary.py
#line motif seq start end strand seqLen querybp mRatio m mm i d
WARNING: input alignments did not contain an event stats line
(NCRF --stats=events would create that line)
1 GGCAT NCRF_EXAMPLE 3495 4737 + 50000 1242 NA NA NA NA NA
引用
Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data
Harris RS, Cechova M, Makova KD
Bioinformatics. 2019 Nov 1;35(22):4809-4811