bamファイルを使ってDNA/RNAシーケンスのピーク定量やTPM計算、bigwigのcoverage trackを作成する BAMscale

2020 1/14 conda追記

2020 9/23 タイトル修正

BAMscaleは、chromatin binding（ChIP-seq）およびクロマチン状態変化（ATAC-seq、END-seq）やchromatin state changes（ATAC-seq, END-seq）、RNA seqのシーケンシングデータセットを処理するワンステップツールである。出力には、テキスト形式の正規化されたピークスコアと、データ視覚化プログラムから直接アクセス可能なスケーリングされたカバレッジトラック（BigWig）が含まれる。大きなシーケンスデータセット（〜100Gbサイズ）を数分で効果的に処理し、現在利用可能なツールより優れている。

wiki

https://github.com/ncbi/BAMscale/wiki

NCBIのオフィシャルツールの１つになっています。

https://github.com/ncbi

インストール

ubuntu16.04のminiconda2-4.0.5環境でテストした（docker使用、ホストOS ubuntu18.04）。

インストール詳細はオーサーらの準備したマニュアルを確認して下さい。

https://github.com/ncbi/BAMscale/wiki/Installation#detailed-installation-for-linux-based-os

依存

samtools (htslib)
libBigWig

本体　Github

#build
git clone https://github.com/ncbi/BAMscale.git
cd BAMscale/
make

#またはコンパイル済みバイナリをリリースからダウンロードする(linux)
https://github.com/ncbi/BAMscale/releases

#追記　bioconda(link)
conda install -c bioconda -y bamscale

> BAMscale

BAMcompare: at tool to quantify peaks, and scale sequencing data

Version: v1.0

Usage: BAMscale <command>

Commands Description

======== ===========

cov Calculate coverage of BED coordinates in BAM file(s). Outputs are raw read counts, FPKM and TPM normalized values.

scale Convert BAM files to BigWigs; scale one or multiple files to genome size or to each other.

>BAMscale cov

Calculate coverage of BED coordinates in BAM file(s)

Version: v1.0

Usage: BAMscale cov [OPTIONS] --bed <BEDFILE> --bam <BAM_1> (--bam <BAM_2> ... --bam <BAM_N>)

Output: Coverage tables (un-normalized, library-size normalized, FPKM and TPM)

Required options:

--bed|-b <file> Input BED file

--bam|-i <file> Input BAM file. This can be specified multiple times in case of multiple BAM files

Library options:

--libtype|-l <str> Sequencing type to be used. Can be: single, paired, and auto (default: autodetect)

--frag|-f <flag> Compute coverage using fragments instead of reads (default: no)

--strand|-s <flag> Reads need to have same orientation of peaks (default: unstranded)

--rstrand|-r <flag> Reads need to have reverse orientation of peaks (default: unstranded)

Sequencing coverage computation options:

--seqcov|-e <int> Compute sequencing coverage from BAM file quickly using the index (option '0'),

or count number of reads by parsing entire BAM file (slower, but more accurate; set to '1' [default])

--blacklist|-c <file> Input file with list of chromosomes to blacklist when computing coverage for normalization

--bedsubtract|-u <int> BED file with regions to subtract when computing coverage for normalization

These coordinates should not overlap so reads are not counted multiple times

Mapping options:

--mapq|-q <int> Minimum (at least) mapping quality (default: 0)

--keepdup|-d <flag> Keep duplicated reads (default: no)

--noproper|-p <flag> Do not filter un-proper alignments (default: filter)

--unmappair|-m <flag> Do not remove reads with unmapped pairs

--minfrag|-g <int> Minimum fragment size for read pairs (default: 0)

--maxfrag|-x <int> Maximum fragment size for read pairs (default: 2000)

--fragfilt|-w <int> Filter reads based on fragment size (default: no)

Output options:

--outdir|-o <str> Output directory name (default: '.')

--prefix|-n <str> Output prefix for file names (default: none)

Performance options:

--threads|-t <int> No. of threads to use (default: 1)

> BAMscale scale

Scale one or multiple BAM files

Version: v1.0

Usage: BAMscale scale [OPTIONS] --bam <BAM_1> (--bam <BAM_2> ... --bam <BAM_N>)

Output: Coverage tracks in BigWig format (un-scaled, scaled, genome scaled)

Required options:

--bam|-i <file> Input BAM file. This has to be specified at least two times.

Library options:

--libtype|-l <str> Sequencing type to be used. Can be: single, paired, and auto (default: autodetect)

--frag|-f <flag> Compute coverage using fragments instead of reads (default: no)

--fragsize|-a <int> Fragment size to be used to extend single-end library reads

Normalization, scaling and operation type:

--normtype|-y <str> Type of normalization. (default: base)

If no normalization is needed, set '--scale no' argument, the program will disregard this option.

Options:

1) reads: No. of mapped reads/fragments

2) base: Sum of per-base coverage of reads/fragments

--scale|-k <str> Method to scale samples together. (default: genome)

Options are:

1) no: no scaling, just calculate coverage

2) smallest: scale reads to smallest library (multiple-samples only)

3) genome: scale samples to 1x genome coverage (only possible with 'base' normalization type)

--operation|-r <str> Operation to perform when scaling samples. Default: scaled

Options are:

1) scaled: output scaled tracks.

2) unscaled: do not scale files in any way.

2) log2: log2 transform against first BAM file.

3) ratio: coverage ratio against first BAM file.

4) subtract: subtract coverage against first BAM file.

5) rfd: OK-seq RFD calculation

--binsize|-z <int> Size of bins for output bigWig/bedgraph generation (default: 5)

Sequencing coverage computation options:

--seqcov|-e <int> Compute sequencing coverage from BAM file. (default: '1', count reads while parsing BAM)

Options are:

1) 0: use reads in index (only if normalization is set to 'reads')

2) 1: count reads while parsing BAM(s)

WARNING: this option is only useful when 'reads' are used for normalization

--blacklist|-c <file> Input file with list of chromosomes to blacklist during scaling analysis

--bedsubtract|-u <int> BED file with regions to subtract when computing coverage for normalization

These coordinates should not overlap so reads are not counted multiple times

--smoothen|-j <int> Smoothen signal by calculating mean of N bins flanking both sides of each bin (default: 0)

If set to '0', the signal is not smoothened. To turn on specify a value greater than '0'.

For replication timing, a good value is to smoothen to 100k bases. If binSize is 100bp, this would be '1000'

--tracksmooth|-b <int> Which tracks should be smoothened when performing smoothening (default: '1' meaning only binned track).

Options are:

1) 0: Smoothen scaled and transformed tracks (log2, ratio or subtracted)

2) 1: Smoothen only the scaled sequencing track

3) 2: Smoothen only the transformed (log2, ratio or subtract) track

Mapping options:

--mapq|-q <int> Minimum (at least) mapping quality (default: 0)

--keepdup|-d <flag> Keep duplicated reads (default: no)

--noproper|-p <flag> Do not filter un-proper alignments (default: filter)

--unmappair|-m <flag> Do not remove reads with unmapped pairs

--minfrag|-g <int> Minimum fragment size for read pairs (default: 0)

--maxfrag|-x <int> Maximum fragment size for read pairs (default: 2000)

--fragfilt|-w <int> Filter reads based on fragment size (default: no)

Output options:

--outdir|-o <str> Output directory name (default: '.')

Performance options:

--threads|-t <int> No. of threads to use (default: 1)

Dockerfileが用意されている。dockerイメージをビルドして使うこともできる。

docker build -t bamscale https://raw.githubusercontent.com/pongorlorinc/BAMscale/master/Dockerfile

実行方法

ピーク定量（genomeを均一サイズのbinに分けて定量）

BAMscale cov --bed region.bed --bam input1.bam --bam input2.bam --bam input3.bam ...

出力されるもの（Githubより）

1) FPKM Normalization
2) Library Size Normalized
3) TPM Normalized
4) Raw Read Counts

が出力される。

ゲノムサイズでスケーリングした定量

BAMscale scale --bam input1.bam --bam input2.bam --bam input3.bam ...

BAMscale scaleのRNA seq向けパラメータ

BAMscale scaleを実行する際、RNA seqのbamファイルでは"--operation "フラグを設定することでより正確な定量を行える。rna、strandrna、strandrnaRの３モードが用意されている（詳細はGIhub参照）。

１、unstranded RNA seq

BAMscale scale --bam <RNAseq.bam> --operation rna

total number of aligned basesがゲノムサイズで正規化された上で定量結果の bigWigが出力される。出力はpositiveとnegativeに分けられない。

２、stranded RNA seq

BAMscale scale --bam <RNAseq.bam> --operation strandrna

positiveとnegativeに分けて bigWigが２つ出力される。

３、stranded RNA seq（negativeをminus valueとして出力）。

BAMscale scale --bam <RNAseq.bam> --operation strandrnaR

２と同じだが、negativeはminus付きで出力される。

正規化のスケールファクターをゲノムサイズから変更するには "--scale custom" と"--factor"をつける。例えば0.76倍にスケールするには"--scale custom --factor 0.76"をつけてランする。

ゲノムサイズではなく、DESeq2による正規化を行うこともできる。"scaling factors"は個別のサンプルでは計算できないため、少し手順が複雑になる。Githubのチュートは現在作成中になっている。

調べた時からコマンドが変化していることに気づきました。現在修正中です。

パラメータと定量方法については特に以下を確認してください。 END-seq、ATAC-seq、OKseq向けのチュートリアルも用意されています。マッピング手順から丁寧に説明されています。

https://github.com/ncbi/BAMscale/wiki/Peak-Quantifying

引用

BAMscale: quantification of DNA sequencing peaks and generation of scaled coverage tracks
Lorinc S. Pongor, Jacob M. Gross, Roberto Vera Alvarez, Junko Murai, Sang-Min Jang, Hongliang Zhang, Christophe Redon, Haiqing Fu, Shar-Yin Huang, Bhushan Thakur, Adrian Baris, Leonardo Marino-Ramirez, David Landsman, Mirit I. Aladjem, Yves Pommier

bioRxiv preprint first posted online Jun. 13, 2019