StoatyDive - macでインフォマティクス

タンパク質の生物学的機能は、その相互作用パートナーと相互作用のモードによって決まる。これらの相互作用を研究することで、オルタナティブスプライシングや転写後調節などの細胞メカニズムに関する視野が広がる。クロスリンク、またはクロマチン免疫沈降とハイスループットシーケンス（CLIP-Seq、ChIP-Seq）の組み合わせは、これらの相互作用を推測する方法である。 CLIP- Seqは、RNA結合タンパク質（RBP）とそのターゲットRNA間のすべての相互作用を調査する（ref.1）。したがって、CLIP-Seqは、RBPによる転写後の調節を精査する。結合領域の予測（ピークコール）は、CLIP-SeqやChIP-Seqなどのメソッドのデータ分析における重要なステップである。通常、ピーク分析の前に、ピーク特性の評価と分類は行われない。それでも、peakcallerから取得したピークセットには、ダウンストリーム分析を改良するためにフィルタリングする価値のある異なるピークプロファイルがある場合がある。異なるピーク形状は、いくつかの生物学的および技術的な問題の結果である。
　JankowskyとHarris（ref.2）は、RNA-タンパク質相互作用の特性と潜在的な問題について議論している。RBPは異なる結合ドメインを持っているか、タンパク質複合体の一部である可能性がある。したがって、このタンパク質には、特異的から非特異的までのさまざまな親和性（メカニズム）を持つ異種の結合部位が存在する可能性がある。 RNA部位に対するタンパク質の親和性、タンパク質とRNAの濃度などの要因は、異なるタンパク質結合ドメインの結合特異性に影響する。例えば、これらはいくつかのRNAに結合する能力を持っているmRNA輸送因子について述べている。言及されていないのは、タンパク質タイプが異なるピークプロファイルにも現れる可能性があることである。ヘリカーゼは、転写因子と比較して異なるピークプロファイルを持つ場合がある。
さらに、技術的なバイアスがピークプロファイルの状況を変える可能性がある。ライブラリの準備中にエラーが発生すると、不特定のバインドが発生する場合がある。プロトコルバイアス、たとえば、エンドヌクレアーゼおよび光活性化可能なヌクレオシドによって導入されるPAR-CLIPバイアス（ref.3）も、結合部位の予測に影響を与える可能性がある。さらに、peakcaller自体が特定のピークプロファイルと偽陽性を生成する場合があるが、ユーザーはそれらをデータに含めたくない場合がある。
　そのため、結合部位のデータ分析には多くの疑問が生じる。対象のタンパク質は一般に、より特異的（論文図1a）または非特異的（図1b）に結合するのだろうか？目的のRBPには複数の結合部位があり得るか？私の実験にはいくつかの品質の問題があるのか、つまり、ライブラリ調整のエラーのためリードが非特異的な結合から来ているのか？私のプロトコルはバイアスを生成するか？選択したpeakcallerからの予測ピークのセットに偽陽性があるか？（一部略）
　ここでは、ピークプロファイルを評価および分類して、前述の質問に答えるのに役立つツールであるStoatyDiveを紹介する。 StoatyDiveは、ピークプロファイル全体と定義済みの機能を使用して、シーケンスデータのピーク形状クラスタリングを実行する。この論文では、ヒストンステムループ結合タンパク質（SLBP）のeCLIPプロトコルのCLIPデータでStoatyDiveをテストする（ref.5）。StoatyDiveは、タンパク質のさまざまな結合プロファイルを評価するいくつかのプロットと表を提供する。このツールは、特異的および非特異的結合部位を選択し、同様の形状のピークプロファイルを見つけるのに役立つ。したがって、SLBPデータの得られたピークを改良して、SLBPのより具体的なサイトを見つけようとする。（以下略）

インストール

macos10.14のpython3.7.4環境で仮想環境を作成してテストした。

本体　Github

#Bioconda(link)condaで仮想環境を作って導入 
conda create -n stoatydive -c bioconda -y StoatyDive
conda activate stoatydive

> StoatyDive.py -h

$ StoatyDive.py

[START]

usage: StoatyDive.py [-h] [options] -a *.bed -b *.bam/*bed -c *.txt

StoatyDive.py: error: the following arguments are required: -a/--input_bed, -b/--input_bam, -c/--chr_file

(StoatyDive) kamisakakazumanoMac-mini:HINGE kazu$ StoatyDive.py -h

[START]

usage: StoatyDive.py [-h] [options] -a *.bed -b *.bam/*bed -c *.txt

The tool can evalute the profile of peaks. Provide the peaks you want to evalutate in bed6 format and the reads

you used for the peak detection in bed or bam format. The user obtains a distributions of the coefficient of variation (CV)

which can be used to evaluate the profile landscape. In addition, the tool generates ranked list for the peaks based

on the CV. The table hast the following columns: Chr Start End ID VC Strand bp r p Max_Norm_VC

Left_Border_Center_Difference Right_Border_Center_Difference. See StoatyDive's development page for a detailed description.

optional arguments:

-h, --help show this help message and exit

-v, --version show program's version number and exit

-a *.bed, --input_bed *.bed

Path to the peak file in bed6 format.

-b *.bam/*.bed, --input_bam *.bam/*.bed

Path to the read file used for the peak calling in bed

or bam format.

-c *.txt, --chr_file *.txt

Path to the chromosome length file.

-o path/, --output_folder path/

Write results to this path. [Default: Operating Path]

-t float, --thresh float

Set a normalized CV threshold to divide the peak

profiles into more specific (0) and more unspecific

(1). [Default: 1.0]

--peak_correction Activate peak correction. The peaks are recentered

(shifted) for the correct sumit.

--max_translocate Set this flag if you want to shift the peak profiles

based on the maximum value inside the profile instead

of a Gaussian blur translocation.

--peak_length int Set maximum peak length for the constant peak length.

--max_norm_value float

Provide a maximum value for CV to make the normalized

CV plot more comparable.

--border_penalty Adds a penalty for non-centered peaks.

--scale_max float Provide a maximum value for the CV plot.

--maxcl int Maximal number of clusters of the kmeans clustering of

the peak profiles. The algorithm will be optimized,

i.e., the parameter is just a constraint and not

absolute. [Default: 15]

-k int, --numcl int You can forcefully set the number of cluster of peak

profiles.

--sm Turn on the peak profile smoothing for the peak

profile classification. It is recommended to turn it

on.

--lam float Parameter for the peak profile classification. Set

lambda for the smoothing of the peak profiles. A

higher value (> default) will underfit. A lower value

(< default) will overfit. [Default: 0.3]

--turn_off_classification

Turn off the peak profile classification.

テストラン

git clone https://github.com/BackofenLab/StoatyDive.git
cd StoatyDive/

#exaample1
StoatyDive.py -a test/broad_peaks/peaks.bed -b test/broad_peaks/reads.bed -c test/chrom_sizes.txt --peak_correction --border_penalty --turn_off_classification -o test/broad_peaks/

-a Path to the peak file in bed6 format.
-b Path to the read file used for the peak calling in bed or bam format.
-c Path to the chromosome length file.
-o Write results to this path. [Default:

CV distribution plot

f:id:kazumaxneo:20200714001025p:plain

この図は、興味のあるタンパク質の結合特異性についての第一印象を与えてくれる。図はまた、あなたの実験のパフォーマンス/品質についても教えてくれる。非特異的な結合部位を多く含む実験では、ゼロに近いCV分布を持つことになる。（Githubより）

Normalized CV distribution plot

f:id:kazumaxneo:20200714001155p:plain

正規化されたCV分布は、実験内の特異的な部位と非特異的な部位を識別するのに役立つ。正規化されたCVは，範囲[0,1]にある。特異的なサイトは 1 の値を持ち、非特異的なサイトは 0 の値を持つ。

final_tab.bed file

最終的な表形式ファイルは、予測された結合部位のランク付けされた、タブで区切られたリストになる。

実行方法

アラインメントのbamに加え、ピークファイルをBED6フォーマットで指定する。

StoatyDive.py -a peak.bed -b input.bam -c chromosome_length.txt --border_penalty --peak_correction

--border_penalty Adds a penalty for non-centered peaks.
--peak_correction Activate peak correction. The peaks are recentered (shifted) for the correct sumit.

--border_penalty と --peak_correction を指定して使用することが推奨されている。 --border_penalty を追加すると、正しくセンタリングされていないピークを処理する。

引用

StoatyDive: Evaluation and Classification of Peak Profiles for Sequencing Data

Florian Heyl, Rolf Backofen
BioRxiv preprint first posted online Oct. 9, 2019