様々な方法があるが、正確に出すのは意外に難しい(例えばsamtools mpileupは0カバレッジをカウントしない)。ここではBBtoolsのpileup.shを使い各クロモソームのカバレッジを個別に計算するコマンドを紹介する。
追記
2/26 コマンド修正
インストール
BBtoolsはbrewで導入できます。
brew install BBtools
> pileup.sh -h
$ pileup.sh -h
Written by Brian Bushnell
Last modified December 5, 2016
Description: Calculates per-scaffold coverage information from an unsorted sam or bam file.
Usage: pileup.sh in=<input> out=<output>
Input Parameters:
in=<file> The input sam file; this is the only required parameter.
ref=<file> Scans a reference fasta for per-scaffold GC counts, not otherwise needed.
fastaorf=<file> An optional fasta file with ORF header information in PRODIGAL's output format. Must also specify 'outorf'.
unpigz=t Decompress with pigz for faster decompression.
Output Parameters:
out=<file> (covstats) Per-scaffold coverage info.
rpkm=<file> Per-scaffold RPKM/FPKM counts.
twocolumn=f Change to true to print only ID and Avg_fold instead of all 6 columns.
countgc=t Enable/disable counting of read GC content.
outorf=<file> Per-orf coverage info to this file (only if 'fastaorf' is specified).
outsam=<file> Print the input sam stream to this file (or stdout). Useful for piping data.
hist=<file> Histogram of # occurrences of each depth level.
basecov=<file> Coverage per base location.
bincov=<file> Binned coverage per location (one line per X bases).
binsize=1000 Binsize for binned coverage output.
keepshortbins=t (ksb) Keep residual bins shorter than binsize.
normcov=<file> Normalized coverage by normalized location (X lines per scaffold).
normcovo=<file> Overall normalized coverage by normalized location (X lines for the entire assembly).
normb=-1 If positive, use a fixed number of bins per scaffold; affects 'normcov' and 'normcovo'.
normc=f Normalize coverage to fraction of max per scaffold; affects 'normcov' and 'normcovo'.
delta=f Only print base coverage lines when the coverage differs from the previous base.
nzo=f Only print scaffolds with nonzero coverage.
concise=f Write 'basecov' in a more concise format.
header=t (hdr) Include headers in output files.
headerpound=t (#) Prepend header lines with '#' symbol.
stdev=t Calculate coverage standard deviation.
covminscaf=0 (minscaf) Don't print coverage for scaffolds shorter than this.
covwindow=0 Calculate how many bases are in windows of this size with
low average coverage. Produces an extra stats column.
covwindowavg=5 Average coverage below this will be classified as low.
Processing Parameters:
strandedcov=f Track coverage for plus and minus strand independently.
startcov=f Only track start positions of reads.
secondary=t Use secondary alignments, if present.
softclip=f Include soft-clipped bases in coverage.
minmapq=0 (minq) Ignore alignments with mapq below this.
physical=f (physcov) Calculate physical coverage for paired reads. This includes the unsequenced bases.
tlen=t Track physical coverage from the tlen field rather than recalculating it.
arrays=auto Set to t/f to manually force the use of coverage arrays. Arrays and bitsets are mutually exclusive.
bitsets=auto Set to t/f to manually force the use of coverage bitsets.
32bit=f Set to true if you need per-base coverage over 64k; does not affect per-scaffold coverage precision.
This option will double RAM usage (when calculating per-base coverage).
delcoverage=t (delcov) Count bases covered by deletions as covered.
True is faster than false.
samstreamer=t (ss) Load reads multithreaded to increase speed.
Java Parameters:
-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
Output format (tab-delimited):
ID, Avg_fold, Length, Ref_GC, Covered_percent, Covered_bases, Plus_reads, Minus_reads, Read_GC, Median_fold, Std_Dev
ID: Scaffold ID
Length: Scaffold length
Avg_fold: Average fold coverage of this scaffold
Covered_percent: Percent of scaffold with any coverage (only if arrays or bitsets are used)
Covered_bases: Number of bases with any coverage (only if arrays or bitsets are used)
Plus_reads: Number of reads mapped to plus strand
Minus_reads: Number of reads mapped to minus strand
Read_GC: Average GC ratio of reads mapped to this scaffold
Median_fold: Median fold coverage of this scaffold (only if arrays are used)
Std_Dev: Standard deviation of coverage (only if arrays are used)
Notes:
Only supports SAM format for reads and FASTA for reference (though either may be gzipped).
Sorting is not needed, so output may be streamed directly from a mapping program.
Requires approximately 1 bit per reference base plus 100 bytes per scaffold (even if no reference is specified).
This script will attempt to autodetect and correctly specify the -Xmx parameter to use all memory on the target node.
If this fails with a message including 'Error: Could not create the Java Virtual Machine.', then...
Please decrease the -Xmx parameter. It should be set to around 85% of the available memory.
For example, -Xmx20g needs around 23 GB of virtual (and physical) memory when qsubbed.
If the program fails with a message including 'java.lang.OutOfMemoryError:', then...
-Xmx needs to be increased, which probably also means it needs to be qsubbed with a higher memory allocation.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
ラン
各クロモソームのカバレッジを計算する。.samだけでなく.bamも使用できる。
pileup.sh in=input.sam ref=ref.fa out=coverage.txt
出力を確認する。
2カラム出力
pileup.sh in=input.sam ref=ref.fa out=coverage.txt twocolumn=t
出力を整形表示。
> column -t coverage.txt
#ID Avg_fold
1 20.8505
2 21.0910
3 34.3087
4 22.9013
5 35.0498
6 143.9705
7 129.3246
8 101.3382
他にもたくさんオプションがあり、rpkmなども計算できるようです。
BBtoolsは以前紹介しています。
引用
BBMap (aligner for DNA/RNAseq) is now open-source and available for download.
Biostars
Question: Tools To Calculate Average Coverage For A Bam File?
https://www.biostars.org/p/5165/