ノイズの多いロングリードからリピートを探す Noise Cancelling Repeat Finder

間違って２回Noise Cancelling Repeat Finderのインストールについて投稿してしまいました。申し訳ありません。

　タンデムDNAリピートはロングリード技術でシーケンスできるが、これらの技術の高いエラー率を考慮した計算ツールがないため、正確に解読できない。ここでは、Pacific BiosciencesおよびOxford Nanoporeシーケンサーによって生成されたノイズの多いロングリードで、指定されたモチーフの推定タンデムリピートを明らかにするNoise-Cancelling Repeat Finder（NCRF）を紹介する。シミュレーションデータでNCRFを検証して、さまざまな長さのモチーフを持つタンデムリピートを特定し、2つの代替ツールと比較して優れたパフォーマンスを示した。 NCRFは、実際のヒト全ゲノムシーケンスデータを使用して、熱ショックストレス応答に関与する（AATGG）nの繰り返しの長い配列を特定した。NCRFはCで実装され、いくつかのpython スクリプトでサポートされ、biocondaおよびhttps://github.com/makovalab-psu/NoiseCancellingRepeatFinderで入手できる。

インストール

macos10.14のanaconda3.7環境にて、python2.7の仮想環境を作ってテストした。

ビルド依存

gcc or similar C compiler and linker
python (tested with version 2.7, not likely to work with python 3)

Github

https://github.com/makovalab-psu/NoiseCancellingRepeatFinder

#bioconda (link)
conda create -n ncrf -y python=2.7
conda activate ncrf
conda install -c bioconda -y ncrf

#from source
git clone --branch v1.01.00 https://github.com/makovalab-psu/NoiseCancellingRepeatFinder.git
cd NoiseCancellingRepeatFinder/
make
./make_symbolic_links.sh

> NCRF -h

$ NCRF -h

NCRF-- Noise Cancelling Repeat Finder, to find tandem repeats in noisy reads

(version 1.01.02 20190429)

usage: cat <fasta> | NCRF [options]

<fasta> fasta file containing sequences; read from stdin

[<name>:]<motif> dna repeat motif to search for

(there can be more than one motif)

--minmratio=<ratio> discard alignments with a low frequency of matches;

ratio can be between 0 and 1 (e.g. "0.85"), or can be

expressed as a percentage (e.g. "85%")

--maxnoise=<ratio> (same as --minmratio but with 1-ratio)

--minlength=<bp> discard alignments that don't have long enough repeat

(default is 500)

--minscore=<score> discard alignments that don't score high enough

(default is zero)

--stats=events show match/mismatch/insert/delete counts

--positionalevents show match/mismatch/insert/delete counts by motif

position (independent of --stats=events); this may be

useful for detecting error non-uniformity, to separate

perfect repeats from imperfect

--help=scoring show options relating to alignment scoring

--help=allocation show options relating to memory allocation

--help=other show other, less frequently used options

The output is usually passed through a series of the ncrf_* post-processing

scripts.

> ncrf_cat.py

$ ncrf_cat.py

you have to give me at least one file

usage: ncrf_cat <file1> [<file2> ...] [--markend]

<file1> an output file from Noise Cancelling Repeat Finder

<file2> another output file from Noise Cancelling Repeat Finder

--markend assume end-of-file markers are absent in the input, and add an

end-of-file marker to the output

(by default we require inputs to have proper end-of-file markers)

Concatenate several output files from Noise Cancelling Repeat Finder. This

is little more than copying the files and adding a blank line between the

files.

It can also be used to verify that the input files contain end-of-file markers

i.e. that they were not truncated when created.

> ncrf_summary.py -h

$ ncrf_summary.py -h

unrecognized option: -h

usage: ncrf_cat <output_from_NCRF> | ncrf_summary [options]

--minmratio=<ratio> discard alignments with a low frequency of matches;

ratio can be between 0 and 1 (e.g. "0.85"), or can be

expressed as a percentage (e.g. "85%")

--maxnoise=<ratio> (same as --minmratio but with 1-ratio)

Typical output:

#line motif seq start end strand seqLen querybp mRatio m mm i d

1 GGAAT FAB41174_6 1568 3021 - 3352 1461 82.6% 1242 169 42 50

11 GGAAT FAB41174_2 3908 5077 - 7347 1189 82.4% 1009 125 35 55

21 GGAAT FAB41174_0 2312 3334 - 4223 1060 81.1% 881 115 26 64

...

> ncrf_to_bed.py -h

$ ncrf_to_bed.py -h

unrecognized option: -h

usage: ncrf_cat <output_from_NCRF> | ncrf_to_bed [options]

--minmratio=<ratio> discard alignments with a low frequency of matches;

ratio can be between 0 and 1 (e.g. "0.85"), or can be

expressed as a percentage (e.g. "85%")

--maxnoise=<ratio> (same as --minmratio but with 1-ratio)

Typical output is shown below. The 6th column ("score" in the bed spec) is

the match ratio times 1000 (e.g. 826 is 82.6%).

FAB41174_065680 1568 3021 . - 826

FAB41174_029197 3908 5077 . - 824

FAB41174_005950 2312 3334 . - 811

テストラン

１、リピートを探索する。

cd /NoiseCancellingRepeatFinder/
cat example.fa | NCRF GGCAT > example.ncrf

> cat example.ncrf

f:id:kazumaxneo:20200311091151p:plain

２、サマリーレポート

ncrf_cat.py example.ncrf | ncrf_summary.py

$ ncrf_cat.py example.ncrf | ncrf_summary.py

#line motif seq start end strand seqLen querybp mRatio m mm i d

WARNING: input alignments did not contain an event stats line

(NCRF --stats=events would create that line)

1 GGCAT NCRF_EXAMPLE 3495 4737 + 50000 1242 NA NA NA NA NA

引用
Noise-cancelling repeat finder: uncovering tandem repeats in error-prone long-read sequencing data

Harris RS, Cechova M, Makova KD

Bioinformatics. 2019 Nov 1;35(22):4809-4811

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ノイズの多いロングリードからリピートを探す Noise Cancelling Repeat Finder