bam/samのカバレッジなどを計算する pysamstats

2020 3/1 インストール手順修正

pysamstatsはsamのstatisticsを出力できるツール。pileup出力のほか、一定のbinサイズでの出力もできる。ライブラリとしての活用も視野に設計されている。

インストール

anaconda3.7環境でテストした（macos10.14）。

本体　Github

#ここでは仮想環境に導入する
mamba create -n pysamstats -y
conda activate pysamstats
mamba install -c bioconda pysamstats -y

それ以外の環境ならpipでバージョン指定インストールする。

pip install pysam==0.11.2.2
pip install pysamstats

> pysamstats -h

$ pysamstats -h

Usage: pysamstats [options] FILE

Calculate statistics against genome positions based on sequence alignments

from a SAM or BAM file and print them to stdout.

Options:

-h, --help show this help message and exit

-t TYPE, --type=TYPE Type of statistics to print, one of: alignment_binned,

baseq, baseq_ext, baseq_ext_strand, baseq_strand,

coverage, coverage_binned, coverage_ext,

coverage_ext_binned, coverage_ext_strand, coverage_gc,

coverage_strand, mapq, mapq_binned, mapq_strand, tlen,

tlen_binned, tlen_strand, variation, variation_strand.

-c CHROMOSOME, --chromosome=CHROMOSOME

Chromosome name.

-s START, --start=START

Start position (1-based).

-e END, --end=END End position (1-based).

-z, --zero-based Use zero-based coordinates (default is false, i.e.,

use one-based coords).

-u, --truncate Truncate pileup-based stats so no records are emitted

outside the specified range.

-S STEPPER, --stepper=STEPPER

Stepper to provide to underlying pysam call. Options

are:"all" (default): all reads are returned, except

where flags BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL,

BAM_FDUP set; "nofilter" applies no filter to returned

reads; "samtools": filter & read processing as in

_csamtools_ pileup. This requires a fasta file. For

complete details see the pysam documentation.

-d, --pad Pad pileup-based stats so a record is emitted for

every position (default is only covered positions).

-D MAX_DEPTH, --max-depth=MAX_DEPTH

Maximum read depth permitted in pileup-based

statistics. The default limit is 8000.

-f FASTA, --fasta=FASTA

Reference sequence file, only required for some

statistics.

-o, --omit-header Omit header row from output.

-p N, --progress=N Report progress every N rows.

--window-size=N Size of window for binned statistics (default is 300).

--window-offset=N Window offset to use for deciding which genome

position to report binned statistics against. The

default is 150, i.e., the middle of 300bp window.

--format=FORMAT Output format, one of {tsv, csv, hdf5} (defaults to

tsv). N.B., hdf5 requires PyTables to be installed.

--output=OUTPUT Path to output file. If not provided, write to stdout.

--fields=FIELDS Comma-separated list of fields to output (defaults to

all fields).

--hdf5-group=HDF5_GROUP

Name of HDF5 group to write to (defaults to the root

group).

--hdf5-dataset=HDF5_DATASET

Name of HDF5 dataset to create (defaults to "data").

--hdf5-complib=HDF5_COMPLIB

HDF5 compression library (defaults to zlib).

--hdf5-complevel=HDF5_COMPLEVEL

HDF5 compression level (defaults to 5).

--hdf5-chunksize=HDF5_CHUNKSIZE

Size of chunks in number of bytes (defaults to 2**20).

--min-mapq=MIN_MAPQ Only reads with mapping quality equal to or greater

than this value will be counted (0 by default).

--min-baseq=MIN_BASEQ

Only reads with base quality equal to or greater than

this value will be counted (0 by default). Only

applies to pileup-based statistics.

--no-dup Don't count reads flagged as duplicate.

--no-del Don't count reads aligned with a deletion at the given

position. Only applies to pileup-based statistics.

Pileup-based statistics types (each row has statistics over reads in a pileup column):

* coverage - Number of reads aligned to each genome position

(total and properly paired).

* coverage_strand - As coverage but with forward/reverse strand counts.

* coverage_ext - Various additional coverage metrics, including

coverage for reads not properly paired (mate

unmapped, mate on other chromosome, ...).

* coverage_ext_strand - As coverage_ext but with forward/reverse strand counts.

* coverage_gc - As coverage but also includes a column for %GC.

* variation - Numbers of matches, mismatches, deletions,

insertions, etc.

* variation_strand - As variation but with forward/reverse strand counts.

* tlen - Insert size statistics.

* tlen_strand - As tlen but with statistics by forward/reverse strand.

* mapq - Mapping quality statistics.

* mapq_strand - As mapq but with statistics by forward/reverse strand.

* baseq - Base quality statistics.

* baseq_strand - As baseq but with statistics by forward/reverse strand.

* baseq_ext - Extended base quality statistics, including qualities

of bases matching and mismatching reference.

* baseq_ext_strand - As baseq_ext but with statistics by forward/reverse strand.

Binned statistics types (each row has statistics over reads aligned starting within a genome window):

* coverage_binned - As coverage but binned.

* coverage_ext_binned - As coverage_ext but binned.

* mapq_binned - Similar to mapq but binned.

* alignment_binned - Aggregated counts from cigar strings.

* tlen_binned - As tlen but binned.

Examples:

pysamstats --type coverage example.bam > example.coverage.txt

pysamstats --type coverage --chromosome Pf3D7_v3_01 --start 100000 --end 200000 example.bam > example.coverage.txt

Version: 1.1.2 (pysam 0.15.3)

ラン

1、pileup - １塩基解像度で出力する。

coverage Number of reads aligned to each genome position (total and properly paired).

カバレッジ出力。

pysamstats --type coverage input.bam > output

coverage_ext Various additional coverage metrics, including coverage for reads not properly paired (mate unmapped, mate on other chromosome, ...).

他のカバレッジ評価基準を出力。

pysamstats --type coverage_ext input.bam > output

出力（２行だけ表示）。

chrom pos reads_all reads_pp reads_mate_unmapped reads_mate_other_chr reads_mate_same_strand reads_faceaway reads_softclipped reads_duplicate

chr1 1 8 8 0 0 0 6 2 0

chr1 2 8 8 0 0 0 6 2 0

coverage_gc - As coverage but also includes a column for %GC.

coverageコマンド+GC。

pysamstats --type coverage_gc input.bam --fasta ref.fa > output

chrom pos gc reads_all reads_pp

chr 1 58 8 8

chr 2 58 8 8

chr 3 57 8 8

variation - Numbers of matches, mismatches, deletions, insertions, etc.

ミスマッチやindelなどを含むリードのカバレッジも出力される。

pysamstats --type variation input.bam --fasta ref.fa > output

f:id:kazumaxneo:20180627222814p:plain

tlen - Insert size statistics.

インサートサイズ統計データ。

pysamstats --type tlen input.bam > output

mapq - Mapping quality statistics.

マッピングクオリティ統計データ。

pysamstats --type mapq input.bam > output

baseq - Base quality statistics.

ベースクオリティ統計データ。

pysamstats --type baseq input.bam > output

baseq_ext - Extended base quality statistics, including qualities of bases matching and mismatching reference.

拡張ベースクオリティ統計データ。

pysamstats --type baseq_ext input.bam > output

baseq_ext - Extended base quality statistics, including qualities of bases matching and mismatching reference.

拡張ベースクオリティ統計データ。

pysamstats --type baseq_ext input.bam > output

各コマンドには、フォワードマッピングとリバースマッピングを別々に出力する_strandコマンドもある。例えばmapqコマンドをフォワードとリバースそれぞれ出力するmapq_strandがある。

２、一定のwindowサイズ(bin)出力

coverage_binned - As coverage but binned.

binサイズで出力する以外はcoverageコマンドと同じ。ただしリファレンスは必要。

pysamstats --type coverage_binned input.bam --fasta ref.fa > output

他に以下のコマンドがある。binnサイズで出力する以外、上で紹介してきたコマンド群と同じ。

coverage_ext_binned - As coverage_ext but binned.
mapq_binned - Similar to mapq but binned.
alignment_binned - Aggregated counts from cigar strings.
tlen_binned - As tlen but binned.

例

オプションを付けることで、計算するポジション等を限定できる。

chr1の100000-200000のカバレッジを出力する。

pysamstats --type coverage --chromosome chr1 --start 100000 --end 200000 example.bam > example.coverage.txt

--chromosome Chromosome name.
--start Start position (1-based)
--end End position (1-based).

chr1のカバレッジをbinサイズ100で出力する。MAPQが1以上を対象とする。duplicationはカウントしない。進捗を100行ごとにSTDOUTに出力する。

pysamstats --type coverage_binned --chromosome chr1 --window-size 100 --window-offset 50 --min-mapq 1 --progress 100 --no-dup example.bam > example.coverage.txt

--window-size Size of window for binned statistics (default is 300).
--window-offset Window offset to use for deciding which genome position to report binned statistics against. The default is 150, i.e., the middle of 300bp window.
--min-mapq Only reads with mapping quality equal to or greater than this value will be counted (0 by default).
--progress Report progress every N rows.
--no-dup Don't count reads flagged as duplicate.

importして使うこともできます。下記はGithubページに載っているコードで、pysamでbamを開き、pysamstatsでカバレッジ分析を行い、matplotlibでplotするものです。chr1の1-10000を対象にします。

coverage.py

import pysam 
import pysamstats 
import matplotlib.pyplot as plt

mybam = pysam.AlignmentFile('path/to/input.bam') 
a = pysamstats.load_coverage(mybam, chrom='chr1', start=1, end=10000) 
plt.plot(a.pos, a.reads_all) 
plt.show()

#実行

python coverage.py

matplotlibのpyplot出力。

f:id:kazumaxneo:20180628011431p:plain