macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

BBtoolsを使い各クロモソームのカバレッジを計算する

 

様々な方法があるが、正確に出すのは意外に難しい(例えばsamtools mpileupは0カバレッジをカウントしない)。ここではBBtoolsのpileup.shを使い各クロモソームのカバレッジを個別に計算するコマンドを紹介する。

 

追記

2/26 コマンド修正 

 

インストール

 BBtoolsはbrewで導入できます。

brew install BBtools

>  pileup.sh -h

$ pileup.sh -h

 

Written by Brian Bushnell

Last modified December 5, 2016

 

Description:  Calculates per-scaffold coverage information from an unsorted sam or bam file.

 

Usage:        pileup.sh in=<input> out=<output>

 

 

Input Parameters:

in=<file>           The input sam file; this is the only required parameter.

ref=<file>          Scans a reference fasta for per-scaffold GC counts, not otherwise needed.

fastaorf=<file>     An optional fasta file with ORF header information in PRODIGAL's output format.  Must also specify 'outorf'.

unpigz=t            Decompress with pigz for faster decompression.

 

Output Parameters:

out=<file>          (covstats) Per-scaffold coverage info.

rpkm=<file>         Per-scaffold RPKM/FPKM counts.

twocolumn=f         Change to true to print only ID and Avg_fold instead of all 6 columns.

countgc=t           Enable/disable counting of read GC content.

outorf=<file>       Per-orf coverage info to this file (only if 'fastaorf' is specified).

outsam=<file>       Print the input sam stream to this file (or stdout).  Useful for piping data.

hist=<file>         Histogram of # occurrences of each depth level.

basecov=<file>      Coverage per base location.

bincov=<file>       Binned coverage per location (one line per X bases).

binsize=1000        Binsize for binned coverage output.

keepshortbins=t     (ksb) Keep residual bins shorter than binsize.

normcov=<file>      Normalized coverage by normalized location (X lines per scaffold).

normcovo=<file>     Overall normalized coverage by normalized location (X lines for the entire assembly).

normb=-1            If positive, use a fixed number of bins per scaffold; affects 'normcov' and 'normcovo'.

normc=f             Normalize coverage to fraction of max per scaffold; affects 'normcov' and 'normcovo'.

delta=f             Only print base coverage lines when the coverage differs from the previous base.

nzo=f               Only print scaffolds with nonzero coverage.

concise=f           Write 'basecov' in a more concise format.

header=t            (hdr) Include headers in output files.

headerpound=t       (#) Prepend header lines with '#' symbol.

stdev=t             Calculate coverage standard deviation.

covminscaf=0        (minscaf) Don't print coverage for scaffolds shorter than this.

covwindow=0         Calculate how many bases are in windows of this size with

                    low average coverage.  Produces an extra stats column.

covwindowavg=5      Average coverage below this will be classified as low.

 

Processing Parameters:

strandedcov=f       Track coverage for plus and minus strand independently.

startcov=f          Only track start positions of reads.

secondary=t         Use secondary alignments, if present.

softclip=f          Include soft-clipped bases in coverage.

minmapq=0           (minq) Ignore alignments with mapq below this.

physical=f          (physcov) Calculate physical coverage for paired reads.  This includes the unsequenced bases.

tlen=t              Track physical coverage from the tlen field rather than recalculating it.

arrays=auto         Set to t/f to manually force the use of coverage arrays.  Arrays and bitsets are mutually exclusive.

bitsets=auto        Set to t/f to manually force the use of coverage bitsets.

32bit=f             Set to true if you need per-base coverage over 64k; does not affect per-scaffold coverage precision.

                    This option will double RAM usage (when calculating per-base coverage).

delcoverage=t       (delcov) Count bases covered by deletions as covered.

                    True is faster than false.

samstreamer=t       (ss) Load reads multithreaded to increase speed.

 

Java Parameters:

-Xmx                This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

                    -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

 

Output format (tab-delimited):

ID, Avg_fold, Length, Ref_GC, Covered_percent, Covered_bases, Plus_reads, Minus_reads, Read_GC, Median_fold, Std_Dev

 

ID:                Scaffold ID

Length:            Scaffold length

Ref_GC:            GC ratio of reference

Avg_fold:          Average fold coverage of this scaffold

Covered_percent:   Percent of scaffold with any coverage (only if arrays or bitsets are used)

Covered_bases:     Number of bases with any coverage (only if arrays or bitsets are used)

Plus_reads:        Number of reads mapped to plus strand

Minus_reads:       Number of reads mapped to minus strand

Read_GC:           Average GC ratio of reads mapped to this scaffold

Median_fold:       Median fold coverage of this scaffold (only if arrays are used)

Std_Dev:           Standard deviation of coverage (only if arrays are used)

 

Notes:

 

Only supports SAM format for reads and FASTA for reference (though either may be gzipped).

Sorting is not needed, so output may be streamed directly from a mapping program.

Requires approximately 1 bit per reference base plus 100 bytes per scaffold (even if no reference is specified).

This script will attempt to autodetect and correctly specify the -Xmx parameter to use all memory on the target node.

If this fails with a message including 'Error: Could not create the Java Virtual Machine.', then...

Please decrease the -Xmx parameter.  It should be set to around 85% of the available memory.

For example, -Xmx20g needs around 23 GB of virtual (and physical) memory when qsubbed.

If the program fails with a message including 'java.lang.OutOfMemoryError:', then...

-Xmx needs to be increased, which probably also means it needs to be qsubbed with a higher memory allocation.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

ラン

各クロモソームのカバレッジを計算する。.samだけでなく.bamも使用できる。

pileup.sh in=input.sam ref=ref.fa out=coverage.txt

出力を確認する。

f:id:kazumaxneo:20180224200815j:plain

2カラム出力

pileup.sh in=input.sam ref=ref.fa out=coverage.txt twocolumn=t

出力を整形表示。

 > column -t coverage.txt

#ID  Avg_fold

1    20.8505

2    21.0910

3    34.3087

4    22.9013

5    35.0498

6    143.9705

7    129.3246

8    101.3382

他にもたくさんオプションがあり、rpkmなども計算できるようです。

 

BBtoolsは以前紹介しています。

 

引用

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

 

Biostars

Question: Tools To Calculate Average Coverage For A Bam File?

https://www.biostars.org/p/5165/