k-merを使いSimple sequence repeats (SSRs) を検索する Kmer-SSR

　Simple sequence repeats (SSRs) は、DNA複製、修復、または組換えに起こるミスペアリングやミスのために、少なくとも1つの塩基が何回もタンデムに繰り返されるDNAの短いリピート領域である（Levinson and Gutman、1987）。数十年間、SSRは、短いリピート配列のコピー数の増加によって引き起こされる表現型の相違を決定するために研究されている（Kashi and King、2006）。さらに、SSRは、種の適応度を低下させることなく、定量的遺伝的変異および表現型の差異を説明する（Kashi et al、1997）。 SSR濃度は、異なる種間だけでなく、同じ種内の異なる染色体間でも変化し、配列のヌクレオチド組成を評価することによって説明することができない（Katti et al、2001）。 SSRはDNA複製、組換えおよび修復の特徴的な機能を明らかにするため、生物系の相互作用の研究、次世代配列決定データを用いた再増殖ベースの疾患の研究（Kashi and King、2006）において重要である。

　SSRを識別するために、多くの異なる手法が使用されてきた。この論文では、k-merの使用を提案する。 k-merという用語は、与えられた配列から得られる長さ 'k'の部分配列を指し、一方、k-mer分解は、配列から作ることができる長さ 'k'のすべての可能な部分文字列を指す。 k-mer分解のための使用は、以前にゲノムアセンブリや機械学習（Chikhi and Medvedev、2014; Ghandi et al。、2014）などの例で概説されてきた。（Hanら、2007）と同様の部分配列を同定するためにk-mersが使用されているが、我々の知る限り、k-mer分解によってSSR同定を試みたことはない。Kmer-SSRはほとんどのSSR識別アルゴリズムよりも速く全てのSSRを見つけることができる。Kmer-SSRのオプションは、100％ recallと100％ precisionを持ち、指定された長さのSSRをすべて正確に識別する。さらに生物学的に関連するSSRを特定するために、ユーザの入力に基づいてSSRのサブセットを容易に見ることを可能にするいくつかのフィルタも開発した。 Kmer-SSRは、フィルターオプションと組み合わせることで、SSRを他のSSR識別アルゴリズムよりも素早く、ユーザーフレンドリーな方法で正確かつ直感的に識別する。

Kmer-SSRの特徴（公式より）

Fast run time (linear, O(n), time complexity)
Memory efficient (linear, O(n), space complexity)
Finds all perfect repeats
Simple command-line interface, convenient for scripting and when running on High-Performance Computing (HPC) systems (note: no GUI provided)
Easily parsed, tab-delimited output
Runs on Linux (not Windows or Mac OS X)

インストール

Github

git clone https://github.com/ridgelab/Kmer-SSR.git
cd Kmer-SSR/
make
cd bin/

> ./kmer-ssr -h

$ ./kmer-ssr -h

./kmer-ssr: illegal option -- h

USAGE: kmer-ssr [-a a1,..,aN] [-A] [-d] [-e] [-h] [-i file] [-l int] [-L int]

[-n int] [-N int] [-o file] [-p p1,..,pN] [-r int] [-R int]

[-s s1,..,sN] [-t int] [-v]

Find SSRs in FASTA sequence data

Input:

-i in.fasta

The input file in fasta format. All sequence characters

will be converted to uppercase. [default: stdin]

If your fasta file is compressed, do not use -i. Simply

use zcat, bzcat, or a similar tool and pipe it into this

program.

Output:

-o out.tsv

The output file in tab-separated value (tsv) format.

Please see `README' column details. [default: stdout]

Algorithmic:

-a a1,..,aN

A comma-separated list of valid, uppercase characters

(nucleotides). Characters not in this list will be

ignored. [default=A,C,G,T]

-A

Report non-atomic SSRs (e.g., AT repeated 6 times may

report an ATAT repeated 3 times or an ATATAT repeated

2 times instead).

-e

Disable all filters and SSR validation to report every

SSR. Similar to: -A -r 2 -R <big_number> -n 2 -N

<big_number>. This will override any options set for

-n, -N, -r, -R, and -s.

-p p1,..,pN

A comma-separated list of period sizes (i.e., kmer

lengths). Inclusive ranges are also supported using a

hyphen. [default=4-8]

-l int

Only search for SSRs in sequences with total length

>= l [default: 100]

-L int

Only search for SSRs in sequences with total length

<= L [default: 500,000,000]

-n int

Keep only SSRs with total length (number of

nucleotides) >= n [default: 16]

-N int

Keep only SSRs with total length (number of

nucleotides) <= N [default: 10,000]

-r int

Keep only SSRs that repeat >= r times [default: 2]

-R int

Keep only SSRs that repeat <= R times [default:

10,000]

-s s1,..,sN

A comma-separated list of SSRs to search for; e.g.

"AC,GTTA,TTCTG,CCG" or "TGA". Please note that other

options may prevent SSRs specified with this option

from appearing in the output. For example, if -p is

"4-6", then an SSR with a repeating "AC" will never be

displayed because "AC" has a period size of 2 (and, as

it turns out, 2 is not in the range 4-6).

Misc:

-d

Disable the progress bar that normally prints to

stderr. Will automatically be disabled if (a) reading

from stdin or (b) writing to stdout without

redirecting it to a file.

-h

Show this help message and exit

-Q int

Max size of the tasks queue [default: 1,000]

-t int

Number of threads [default: 1]

-v

Show version number and exit

パスの通ったディレクトリに移動しておく。

ラン

デフォルト条件での検索。

kmer-ssr -i input.fasta -o output.tsv

kmer-ssr -t 4 -p 2-9 -r 3 -R 20 -i input.fasta -o output.tsv

-p Search for SSRs with a period size of 2, 3, 4, 5, 6, 7, 8, or 9. As examples: `AAAAAA...' and `ACGCAGTTGCACGCAGTTGC...' would not make the cutoff because `A' is shorter than 2 nucleotides and `ACGCAGTTGC' is longer than 9 nucleotides. However, `ACACAC...' and `AATCCTGGTAATCCTGGT...' would be included.
-r, -R The min (-r) and max (-R) times the repeating unit repeats. As examples: `ACAC' and `ACACACACACACACACACACAC ACACACACACACACACACAC' (21 `AC' units) would not make the cutoff. However, `ACACAC' and `ACACACACACACACACACACACACAC ACACACACACACAC' (20 `AC' units) would be included.

引用

Kmer-SSR: a fast and exhaustive SSR search algorithm.

Pickett BD, Miller JB, Ridge PG

Bioinformatics. 2017 Dec 15;33(24):3922-3928.