macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ChIP-SeqやATAC-seqのピークコーラー MACS3

 

Githubより

 シーケンサー技術の向上に伴い、クロマチン免疫沈降法とハイスループットシーケンス(ChIP-Seq)によるゲノム規模のタンパク質-DNA相互作用の研究が盛んになってきている。このようなChIP-Seqの強力な解析手法の欠如に対処するため、本著者らは転写因子結合部位を特定するためのモデルベース解析ChIP-Seq (MACS) を発表した。MACSはゲノムの複雑さの影響を捉え、濃縮されたChIP領域の重要性を評価する。MACSはシークエンスタグの位置と方向の両方の情報を組み合わせることにより、結合部位の空間分解能を向上させる。MACSは、ChIP-Seqデータ単独、あるいは特異性を高めたコントロールサンプルを用いて容易に使用することができる。さらに、一般的なピークコーラーとして、MACSは「ランダムバックグラウンドよりも有意なリードを検出できる場所」というシンプルな質問であれば、あらゆる「DNAエンリッチメントアッセイ」にも適用することができる。なお、現在のMACS3はまだアルファ版である。しかし、Github Actionを利用してCI(Continous Integration)を実施し、メインブランチが特定の関数やサブコマンドのユニットテストに合格し、正しい出力を再現していることを確認している。今後、さらに新機能を追加していく予定です。

 

Call peaks

MACS/callpeak.md at master · macs3-project/MACS · GitHub

 

インストール

condaで環境を作り、pipを使ってmacs3を導入した(ubuntu18)。

本体 Github

mamba create -n macs python=3.10
conda activate macs
#v3
#macs3(link)
mamba install -c maximinio macs3 -y
#pip(link)
pip install MACS3

#v2
#macs2(link)
mamba install -c maximinio macs2 -y
#pip(link)
pip install MACS2

#v1
#macs(link)
mamba install -c maximinio macs -y
#pip(link)
pip install MACS

> macs3 -h

usage: macs3 [-h] [--version] {callpeak,bdgpeakcall,bdgbroadcall,bdgcmp,bdgopt,cmbreps,bdgdiff,filterdup,predictd,pileup,randsample,refinepeak,callvar} ...

 

macs3 -- Model-based Analysis for ChIP-Sequencing

 

positional arguments:

  {callpeak,bdgpeakcall,bdgbroadcall,bdgcmp,bdgopt,cmbreps,bdgdiff,filterdup,predictd,pileup,randsample,refinepeak,callvar}

    callpeak            Main MACS3 Function: Call peaks from alignment results.

    bdgpeakcall         Call peaks from bedGraph output. Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are accpetable.

    bdgbroadcall        Call broad peaks from bedGraph output. Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are accpetable.

    bdgcmp              Deduct noise by comparing two signal tracks in bedGraph. Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are accpetable.

    bdgopt              Operations on score column of bedGraph file. Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are accpetable.

    cmbreps             Combine BEDGraphs of scores from replicates. Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are accpetable.

    bdgdiff             Differential peak detection based on paired four bedgraph files. Note: All regions on the same chromosome in the bedGraph file should be continuous so only bedGraph files from MACS3 are accpetable.

    filterdup           Remove duplicate reads at the same position, then save the rest alignments to BED or BEDPE file. If you use '--keep-dup all option', this script can be utilized to convert any acceptable format into BED or BEDPE format.

    predictd            Predict d or fragment size from alignment results. In case of PE data, report the average insertion/fragment size from all pairs. *Will NOT filter duplicates*

    pileup              Pileup aligned reads with a given extension size (fragment size or d in MACS language). Note there will be no step for duplicate reads filtering or sequencing depth scaling, so you may need to do certain pre/post-processing.

    randsample          Randomly sample number/percentage of total reads.

    refinepeak          (Experimental) Take raw reads alignment, refine peak summits and give scores measuring balance of waston/crick tags. Inspired by SPP.

    callvar             Call variants in given peak regions from the alignment BAM files.

 

optional arguments:

  -h, --help            show this help message and exit

  --version             show program's version number and exit

 

For command line options of each command, type: macs3 COMMAND -h

 

> macs3 callpeak -h

usage: macs3 callpeak [-h] -t TFILE [TFILE ...] [-c [CFILE [CFILE ...]]] [-f {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}] [-g GSIZE] [-s TSIZE] [--keep-dup KEEPDUPLICATES] [--outdir OUTDIR] [-n NAME] [-B] [--verbose VERBOSE] [--trackline]

                      [--SPMR] [--nomodel] [--shift SHIFT] [--extsize EXTSIZE] [--bw BW] [--d-min D_MIN] [-m MFOLD MFOLD] [--fix-bimodal] [-q QVALUE | -p PVALUE] [--scale-to {large,small}] [--down-sample] [--seed SEED] [--tempdir TEMPDIR] [--nolambda]

                      [--slocal SMALLLOCAL] [--llocal LARGELOCAL] [--max-gap MAXGAP] [--min-length MINLEN] [--broad] [--broad-cutoff BROADCUTOFF] [--cutoff-analysis] [--call-summits] [--fe-cutoff FECUTOFF] [--buffer-size BUFFER_SIZE] [--to-large] [--ratio RATIO]

 

optional arguments:

  -h, --help            show this help message and exit

 

Input files arguments:

  -t TFILE [TFILE ...], --treatment TFILE [TFILE ...]

                        ChIP-seq treatment file. If multiple files are given as '-t A B C', then they will all be read and pooled together. REQUIRED.

  -c [CFILE [CFILE ...]], --control [CFILE [CFILE ...]]

                        Control file. If multiple files are given as '-c A B C', they will be pooled to estimate ChIP-seq background noise.

  -f {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}, --format {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}

                        Format of tag file, "AUTO", "BED" or "ELAND" or "ELANDMULTI" or "ELANDEXPORT" or "SAM" or "BAM" or "BOWTIE" or "BAMPE" or "BEDPE". The default AUTO option will let MACS decide which format (except for BAMPE and BEDPE which should be

                        implicitly set) the file is. Please check the definition in README. Please note that if the format is set as BAMPE or BEDPE, MACS3 will call its special Paired-end mode to call peaks by piling up the actual ChIPed fragments defined by both

                        aligned ends, instead of predicting the fragment size first and extending reads. Also please note that the BEDPE only contains three columns, and is NOT the same BEDPE format used by BEDTOOLS. DEFAULT: "AUTO"

  -g GSIZE, --gsize GSIZE

                        Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8), Default:hs

  -s TSIZE, --tsize TSIZE

                        Tag size/read length. This will override the auto detected tag size. DEFAULT: Not set

  --keep-dup KEEPDUPLICATES

                        It controls the behavior towards duplicate tags at the exact same location -- the same coordination and the same strand. The 'auto' option makes MACS calculate the maximum tags at the exact same location based on binomal distribution using

                        1e-5 as pvalue cutoff; and the 'all' option keeps every tags. If an integer is given, at most this number of tags will be kept at the same location. Note, if you've used samtools or picard to flag reads as 'PCR/Optical duplicate' in bit

                        1024, MACS3 will still read them although the reads may be decided by MACS3 as duplicate later. If you plan to rely on samtools/picard/any other tool to filter duplicates, please remove those duplicate reads and save a new alignment file

                        then ask MACS3 to keep all by '--keep-dup all'. The default is to keep one tag at the same location. Default: 1

 

Output arguments:

  --outdir OUTDIR       If specified all output files will be written to that directory. Default: the current working directory

  -n NAME, --name NAME  Experiment name, which will be used to generate output file names. DEFAULT: "NA"

  -B, --bdg             Whether or not to save extended fragment pileup, and local lambda tracks (two files) at every bp into a bedGraph file. DEFAULT: False

  --verbose VERBOSE     Set verbose level of runtime message. 0: only show critical message, 1: show additional warning message, 2: show process information, 3: show debug messages. DEFAULT:2

  --trackline           Tells MACS to include trackline with bedGraph files. To include this trackline while displaying bedGraph at UCSC genome browser, can show name and description of the file as well. However my suggestion is to convert bedGraph to bigWig, then

                        show the smaller and faster binary bigWig file at UCSC genome browser, as well as downstream analysis. Require -B to be set. Default: Not include trackline.

  --SPMR                If True, MACS will SAVE signal per million reads for fragment pileup profiles. It won't interfere with computing pvalue/qvalue during peak calling, since internally MACS3 keeps using the raw pileup and scaling factors between larger and

                        smaller dataset to calculate statistics measurements. If you plan to use the signal output in bedGraph to call peaks using bdgcmp and bdgpeakcall, you shouldn't use this option because you will end up with different results. However, this

                        option is recommended for displaying normalized pileup tracks across many datasets. Require -B to be set. Default: False

 

Shifting model arguments:

  --nomodel             Whether or not to build the shifting model. If True, MACS will not build model. by default it means shifting size = 100, try to set extsize to change it. It's highly recommended that while you have many datasets to process and you plan to

                        compare different conditions, aka differential calling, use both 'nomodel' and 'extsize' to make signal files from different datasets comparable. DEFAULT: False

  --shift SHIFT         (NOT the legacy --shiftsize option!) The arbitrary shift in bp. Use discretion while setting it other than default value. When NOMODEL is set, MACS will use this value to move cutting ends (5') towards 5'->3' direction then apply EXTSIZE to

                        extend them to fragments. When this value is negative, ends will be moved toward 3'->5' direction. Recommended to keep it as default 0 for ChIP-Seq datasets, or -1 * half of EXTSIZE together with EXTSIZE option for detecting enriched cutting

                        loci such as certain DNAseI-Seq datasets. Note, you can't set values other than 0 if format is BAMPE or BEDPE for paired-end data. DEFAULT: 0.

  --extsize EXTSIZE     The arbitrary extension size in bp. When nomodel is true, MACS will use this value as fragment size to extend each read towards 3' end, then pile them up. It's exactly twice the number of obsolete SHIFTSIZE. In previous language, each read

                        is moved 5'->3' direction to middle of fragment by 1/2 d, then extended to both direction with 1/2 d. This is equivalent to say each read is extended towards 5'->3' into a d size fragment. DEFAULT: 200. EXTSIZE and SHIFT can be combined when

                        necessary. Check SHIFT option.

  --bw BW               Band width for picking regions to compute fragment size. This value is only used while building the shifting model. Tweaking this is not recommended. DEFAULT: 300

  --d-min D_MIN         Minimum fragment size in basepair. Any predicted fragment size less than this will be excluded. DEFAULT: 20

  -m MFOLD MFOLD, --mfold MFOLD MFOLD

                        Select the regions within MFOLD range of high-confidence enrichment ratio against background to build model. Fold-enrichment in regions must be lower than upper limit, and higher than the lower limit. Use as "-m 10 30". This setting is only

                        used while building the shifting model. Tweaking it is not recommended. DEFAULT:5 50

  --fix-bimodal         Whether turn on the auto pair model process. If set, when MACS failed to build paired model, it will use the nomodel settings, the --exsize parameter to extend each tags towards 3' direction. Not to use this automate fixation is a default

                        behavior now. DEFAULT: False

 

Peak calling arguments:

  -q QVALUE, --qvalue QVALUE

                        Minimum FDR (q-value) cutoff for peak detection. DEFAULT: 0.05. -q, and -p are mutually exclusive.

  -p PVALUE, --pvalue PVALUE

                        Pvalue cutoff for peak detection. DEFAULT: not set. -q, and -p are mutually exclusive. If pvalue cutoff is set, qvalue will not be calculated and reported as -1 in the final .xls file.

  --scale-to {large,small}

                        When set to 'small', scale the larger sample up to the smaller sample. When set to 'larger', scale the smaller sample up to the bigger sample. By default, scale to 'small'. This option replaces the obsolete '--to-large' option. The default

                        behavior is recommended since it will lead to less significant p/q-values in general but more specific results. Keep in mind that scaling down will influence control/input sample more. DEFAULT: 'small', the choice is either 'small' or

                        'large'.

  --down-sample         When set, random sampling method will scale down the bigger sample. By default, MACS uses linear scaling. Warning: This option will make your result unstable and irreproducible since each time, random reads would be selected. Consider to use

                        'randsample' script instead. <not implmented>If used together with --SPMR, 1 million unique reads will be randomly picked.</not implemented> Caution: due to the implementation, the final number of selected reads may not be as you expected!

                        DEFAULT: False

  --seed SEED           Set the random seed while down sampling data. Must be a non-negative integer in order to be effective. DEFAULT: not set

  --tempdir TEMPDIR     Optional directory to store temp files. DEFAULT: /tmp

  --nolambda            If True, MACS will use fixed background lambda as local lambda for every peak region. Normally, MACS calculates a dynamic local lambda to reflect the local bias due to the potential chromatin accessibility.

  --slocal SMALLLOCAL   The small nearby region in basepairs to calculate dynamic lambda. This is used to capture the bias near the peak summit region. Invalid if there is no control data. If you set this to 0, MACS will skip slocal lambda calculation. *Note* that

                        MACS will always perform a d-size local lambda calculation while the control data is available. The final local bias would be the maximum of the lambda value from d, slocal, and llocal size windows. While control is not available, d and

                        slocal lambda won't be considered. DEFAULT: 1000

  --llocal LARGELOCAL   The large nearby region in basepairs to calculate dynamic lambda. This is used to capture the surround bias. If you set this to 0, MACS will skip llocal lambda calculation. *Note* that MACS will always perform a d-size local lambda

                        calculation while the control data is available. The final local bias would be the maximum of the lambda value from d, slocal, and llocal size windows. While control is not available, d and slocal lambda won't be considered. DEFAULT: 10000.

  --max-gap MAXGAP      Maximum gap between significant sites to cluster them together. The DEFAULT value is the detected read length/tag size.

  --min-length MINLEN   Minimum length of a peak. The DEFAULT value is the predicted fragment size d. Note, if you set a value smaller than the fragment size, it may have NO effect on the result. For BROAD peak calling, try to set a large value such as 500bps. You

                        can also use '--cutoff-analysis' option with default setting, and check the column 'avelpeak' under different cutoff values to decide a reasonable minlen value.

  --broad               If set, MACS will try to call broad peaks using the --broad-cutoff setting. Please tweak '--broad-cutoff' setting to control the peak calling behavior. At the meantime, either -q or -p cutoff will be used to define regions with 'stronger

                        enrichment' inside of broad peaks. The maximum gap is expanded to 4 * MAXGAP (--max-gap parameter). As a result, MACS will output a 'gappedPeak' and a 'broadPeak' file instead of 'narrowPeak' file. Note, a broad peak will be reported even if

                        there is no 'stronger enrichment' inside. DEFAULT: False

  --broad-cutoff BROADCUTOFF

                        Cutoff for broad region. This option is not available unless --broad is set. If -p is set, this is a pvalue cutoff, otherwise, it's a qvalue cutoff. Please note that in broad peakcalling mode, MACS3 uses this setting to control the overall

                        peak calling behavior, then uses -q or -p setting to define regions inside broad region as 'stronger' enrichment. DEFAULT: 0.1

  --cutoff-analysis     While set, MACS3 will analyze number or total length of peaks that can be called by different p-value cutoff then output a summary table to help user decide a better cutoff. The table will be saved in NAME_cutoff_analysis.txt file. Note,

                        minlen and maxgap may affect the results. WARNING: May take ~30 folds longer time to finish. The result can be useful for users to decide a reasonable cutoff value. DEFAULT: False

 

Post-processing options:

  --call-summits        If set, MACS will use a more sophisticated signal processing approach to find subpeak summits in each enriched peak region. DEFAULT: False

  --fe-cutoff FECUTOFF  When set, the value will be used to filter out peaks with low fold-enrichment. Note, MACS3 adds one as pseudocount while calculating fold-enrichment. By default, it is set as 1 so there is no filtering. DEFAULT: 1.0

 

Other options:

  --buffer-size BUFFER_SIZE

                        Buffer size for incrementally increasing internal array size to store reads alignment information. In most cases, you don't have to change this parameter. However, if there are large number of chromosomes/contigs/scaffolds in your alignment,

                        it's recommended to specify a smaller buffer size in order to decrease memory usage (but it will take longer time to read alignment files). Minimum memory requested for reading an alignment file is about # of CHROMOSOME * BUFFER_SIZE * 8

                        Bytes. DEFAULT: 100000

 

Obsolete options:

  --to-large            Obsolete option. Please use '--scale-to large' instead.

  --ratio RATIO         Obsolete option. Originally designed to normalize treatment and control with customized ratio, now it won't have any effect.

 

Examples:

1. Peak calling for regular TF ChIP-seq:

    $ macs3 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01

2. Broad peak calling on Histone Mark ChIP-seq:

    $ macs3 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1

3. Peak calling on ATAC-seq (paired-end mode):

    $ macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01

4. Peak calling on ATAC-seq ( focusing on insertion sites, and using single-end mode):

    $ macs3 callpeak -f BAM -t ATAC.bam -g hs -n test -B -q 0.01 --shift -50 --extension 100

 

 

実行方法

 treatmentとcontrolのbamファイルを指定する。

#通常のピークコール例
macs3 callpeak -t ChIP.bam -c Control.bam -f BAM -g hs -n test -B -q 0.01

#ヒストンマーキングChIP-seqのブロードピークコール例
macs3 callpeak -t ChIP.bam -c Control.bam --broad -g hs --broad-cutoff 0.1

#ATAC-seq (paired-end mode)でのピークコール例
macs3 callpeak -f BAMPE -t ATAC.bam -g hs -n test -B -q 0.01
  • -t    ChIP-seq treatment file. If multiple files are given as '-t A B C', then they will all be read and pooled together. REQUIRED.
  • -c    Control file. If multiple files are given as '-c A B C', they will be pooled to estimate ChIP-seq background noise.
      -f  {AUTO,BAM,SAM,BED,ELAND,ELANDMULTI,ELANDEXPORT,BOWTIE,BAMPE,BEDPE}   Format of tag file, "AUTO", "BED" or "ELAND" or "ELANDMULTI" or "ELANDEXPORT" or "SAM" or "BAM" or "BOWTIE" or "BAMPE" or "BEDPE". The default AUTO option will let MACS decide which format (except for BAMPE and BEDPE which should be implicitly set) the file is. Please check the definition in README. Please note that if the format is set as BAMPE or BEDPE, MACS3 will call its special Paired-end mode to call peaks by piling up the actual ChIPed fragments defined by bot aligned ends, instead of predicting the fragment size first and extending reads. Also please note that the BEDPE only contains three columns, and is NOT the same BEDPE format used by BEDTOOLS. DEFAULT: "AUTO"
  • -g      Effective genome size. It can be 1.0e+9 or 1000000000, or shortcuts:'hs' for human (2.7e9), 'mm' for mouse (1.87e9), 'ce' for C. elegans (9e7) and 'dm' for fruitfly (1.2e8), Default:hs
  • -n     Experiment name, which will be used to generate output file names. DEFAULT: "NA"
  •  -B     Whether or not to save extended fragment pileup, and local lambda tracks (two files) at every bp into a bedGraph file. DEFAULT: False
  • --verbose   Set verbose level of runtime message. 0: only show critical message, 1: show additional warning message, 2: show process information, 3: show debug messages. DEFAULT:2
  • -q    Minimum FDR (q-value) cutoff for peak detection. DEFAULT: 0.05. -q, and -p are mutually exclusive.
  • -p     Pvalue cutoff for peak detection. DEFAULT: not set. -q, and -p are mutually exclusive. If pvalue cutoff is set, qvalue will not be calculated and reported as -1 in the final .xls file.

出力例

 

_peaks.xlsはピーク情報を含む表形式のファイル。

ピークの開始位置、ピークの終了位置、ピーク領域の長さ、ピーク頂上の絶対位置、ピーク頂上部分でのパイルアップの高さ、ピーク頂上の -log10(p-value)、局所ラムダを持つランダムなポアソン分布に対する、このピーク山頂のfold enrichment、ピーク頂上の -log10(q-value)、name情報となる。座標はBED形式と異なり、”1”ベース。-broadを指定すると、broad peak callingモードではpeak summitが呼ばれないため、XLSのpileup, p-value, q-value, fold changeはピーク領域全体の平均値となる(マニュアルより)。

 

_peaks.narrowPeak

BED6+4形式のファイルで、ピーク位置とpeak summit、p-value、q-valueが書かれている。UCSCのゲノムブラウザに直接読み込むことができる。

 

_summits.bed

BEDフォーマットで、各ピークのsummit位置が書かれている。

 

_peaks.broadPeak

narrowPeakと同様のBED6+3フォーマットだが、10列目のピーク頂上のアノテーションは無い。このファイルおよびgappedPeakファイルは、-broadが有効な場合のみ利用可能(今回は出力されていない)。

_peaks.gappedPeak

BED12+3形式でbroad regionとnarrow peakの両方が含まれる。UCSCのゲノムブラウザに直接読み込むことができる。

 

MAC3では現在12種類のコマンドが用意されています。各サブコマンドのhelpページも用意されているので確認して下さい。

引用

Use model-based Analysis of ChIP-Seq (MACS) to analyze short reads generated by sequencing protein-DNA interactions in embryonic stem cells
Tao Liu

Methods Mol Biol. 2014;1150:81-95

 

Identifying ChIP-seq enrichment using MACS
Jianxing Feng, Tao Liu, Bo Qin, Yong Zhang, Xiaole Shirley Liu

Nat Protoc. 2012 Sep;7(9):1728-40

 

関連