ゲノムから周期的なリピート配列を検出する SPADE

　周期的に繰り返されるDNAやタンパク質要素は、ゲノムの進化、遺伝子制御、タンパク質複合体の形成、免疫を含む様々な重要な生物学的事象に関与している。特筆すべきは、現在使用されているZFNs、TALENs、CRISPRsなどのゲノム編集ツールも、すべて天然の生物の周期的に繰り返される生体分子と関連していることである。周期的に繰り返される配列の生物学的重要性と、そのような周期的な繰り返しから新たなゲノム編集モジュールの発見が期待されているにもかかわらず、大規模なゲノム資源中のこのような構造要素をハイスループットかつ教師なしでグローバルに検出するソフトウェアはこれまで開発されていなかった。我々（著者ら）は、k-mer周期性評価に基づいて大規模ゲノムデータから周期的なDNAやタンパク質の繰り返しを網羅的に探索する新しいソフトウェアSPADE(Search for Patterned DNA Elements)を開発した。SPADEは、ゲノム編集に関連する配列やテトラトリコペプチド、アンキリン、WD40リピートなどの繰り返しドメインを含むタンパク質ファミリーを、配列の周期性というシンプルな制約のもとに、他の限られた繰り返し生体分子配列を対象とした他のソフトウェアに比べて優れた性能で捕捉することができ、新しい生物学的事象や新しいゲノム編集モジュールの発見に貢献できる可能性が高いことを示唆している。

SPADE is out in NAR! Congratulations Hideto! Great work and I really like the simple software concept :) We think lots of biologically important things are still hidden in periodically repeating genomic sequences. Lots to come soon. #CRISPR #genomeediting https://t.co/6fVLkjVBnm
— Nozomu Yachie (@nzmyachie) 2018年10月11日

テストラン

ubuntu18.04LTSのpython2.7環境でテストした（docker使用、ホストmacos10.1.4）。

依存

Blast+
mafft

pip install matplotlib==2.2.3 #(*1)
pip install seaborn==0.8.1 
pip install weblogo==3.6.0 
pip install biopython

本体　Github

https://github.com/yachielab/SPADE

#python2.7の仮想環境で使う
conda create -n python27 python=2.7
conda activate python27

git clone https://github.com/yachielab/SPADE
cd SPADE/
chmod u+x *.py

> python2 SPADE.py -h

# python2 SPADE.py -h

SYNOPSIS

SPADE [-h] [--help] [-in input_file] [-f input_file_format] [-t sequence_type]

[-Nk kmer_size] [-Nw window_size] [-Ns kmer_score_threshold] [-Ng gap_size]

[-Nm region_margin] [-Np period_threshold] [-Nq gap_frequency_threshold]

[-Nu motif_letter_consistency] [-Nr non_consensus_length_threshold]

[-Pk kmer_size] [-Pw window_size] [-Ps kmer_score_threshold] [-Pg gap_size]

[-Pm region_margin] [-Pp period_threshold] [-Pq gap_frequency_threshold]

[-Pu motif_letter_consistency] [-Pr non_consensus_length_threshold]

[--mafft string] [--blast string] [-n num_threads] [-v string] [-d] [--delete]

[-V] [--version]

DESCRIPTION

SPADE 1.0.0

OPTIONAL ARGUMENTS

-h, --help

Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters

-V, --version

Print software version; ignore all other parameters

*** Input query options

-in <File_In>

Input file name

-f <String, Permissible values: ‘genbank’ ‘fasta’ ‘auto’>

Input file type

Default = ‘auto’

-t <String, Permissible values: ‘nucl’ ‘prot’ ‘auto’>

Sequence type, nucleotide (nucl) or protein (prot)

Default = ‘auto’

*** General screening options for nucleotide periodic repeats

-Nk <Integer>

k-mer size

Default = 10

-Nw <Integer>

Size of sliding window to calculate cumulative k-mer distribution

Default = 1000

-Ns <Integer>

Threshold for peak height of each cumulative k-mer count area

Default = 20

-Ng <Integer>

Threshold for gap size between significant k-mer count areas

Default = 200

-Nm <Integer>

Size of margin to be evaluated with each detected highly repetitive region

Default = 1000

-Np <Real>

Periodicity score threshold for each detected highly repetitive region

Default = 0.5

-Nq <Real>

Gap frequency threshold for each position of a repeat motif to be removed

Default = 0.5

-Nu <Real>

Threshold for letter consistency score at each position of a repeat motif

Default = 0.8

-Nr <Integer>

Threshold for length of non-consensus region to be removed from a repeat motif

Default = 5

*** General screening options for protein periodic repeats

-Pk <Integer>

k-mer size

Default = 3

-Pw <Integer>

Size of sliding window to calculate cumulative k-mer distribution

Default = 300

-Ps <Integer>

Threshold for peak height of each cumulative k-mer count area

Default = 6

-Pg <Integer>

Threshold for gap size between significant k-mer count areas

Default = 50

-Pm <Integer>

Size of margin to be evaluated with each detected highly repetitive region

Default = 300

-Pp <Real>

Periodicity score threshold for each detected highly repetitive region

Default = 0.3

-Pq <Real>

Gap frequency threshold for each position of a repeat motif to be removed

Default = 0.5

-Pu <Real>

Threshold for letter consistency score at each position of a repeat motif

Default = 0.8

-Pr <Integer>

Threshold for length of non-consensus region to be removed from a repeat motif

Default = 5

*** MAFFT and BLAST+ options

--mafft <'String'>

Optional arguments for MAFFT can be defined with single quotations

Default = '--auto'

For MAFFT optional arguments, see

https://mafft.cbrc.jp/alignment/software/manual/manual.html

--blastn <'String'>

Optional arguments for BLAST+ can be defined with single quotations

Default = '-strand plus -task blastn-short -penalty -2 –outfmt "6 qseqid qseq sseqid sseq pident qlen length mismatch gapopen qstart qend sstart send gaps evalue bitscore"'

--blastp <'String'>

Optional arguments for BLAST+ can be defined with single quotations

Default = '-task blastp-short –outfmt "6 qseqid qseq sseqid sseq pident qlen length mismatch gapopen qstart qend sstart send gaps evalue bitscore"'

For BLAST+ optional arguments, see

https://www.ncbi.nlm.nih.gov/books/NBK279684/

*** Other options

-n <Integer>

Number of CPU threads. If this is set to more than 1, SPADE runs multiple

processes for multiple sequence entries in parallel.

Default = 1

-v <String, Permissible values: 'Y' 'N'>

Generate pdf files to visualize results for each detected repeat region

Default = Y

-d, --delete

This option deletes descendant output folders of highly repetitive regions

that are detected not to contain periodic repeats

テストラン

genbankファイルを指定する。

SPADE.py -in GCF_000014485.1_ASM1448v1_genomic.gbff

出力

テストファイルには3つのシーケンス（chr + 2plasmids）が含まれている。シーケンスごとにディレクトリが出力されるため、３つのディレクトリが生成される。

それぞれのディレクトリでは、リピートごとにサブフォルダができる。

f:id:kazumaxneo:20200326070435p:plain

サブフォルダ

f:id:kazumaxneo:20200326070441p:plain

periodic_repeat.pdf

f:id:kazumaxneo:20200326070459p:plain

出力についてはGithubで説明されています。確認して下さい。

引用

Fast and global detection of periodic sequence repeats in large genomic resources
Hideto Mori, Daniel Evans-Yamamoto, Soh Ishiguro, Masaru Tomita, Nozomu Yachie
Nucleic Acids Research, Volume 47, Issue 2, 25 January 2019, Page e8

Ghostscriptがないとのエラーが出たので、https://www.ghostscript.com/download/gsdnld.htmlからバイナリをダウンロードしてgsにリネームして/usr/local/binに置いた。

https://stackoverflow.com/questions/53091128/ghostscript-ps2pdf-not-working-correctly-from-matlab