2020 3/1 インストール手順修正
pysamstatsはsamのstatisticsを出力できるツール。pileup出力のほか、一定のbinサイズでの出力もできる。ライブラリとしての活用も視野に設計されている。
インストール
anaconda3.7環境でテストした(macos10.14)。
本体 Github
#ここでは仮想環境に導入する
mamba create -n pysamstats -y
conda activate pysamstats
mamba install -c bioconda pysamstats -y
それ以外の環境ならpipでバージョン指定インストールする。
pip install pysam==0.11.2.2
pip install pysamstats
> pysamstats -h
$ pysamstats -h
Usage: pysamstats [options] FILE
Calculate statistics against genome positions based on sequence alignments
from a SAM or BAM file and print them to stdout.
Options:
-h, --help show this help message and exit
-t TYPE, --type=TYPE Type of statistics to print, one of: alignment_binned,
baseq, baseq_ext, baseq_ext_strand, baseq_strand,
coverage, coverage_binned, coverage_ext,
coverage_ext_binned, coverage_ext_strand, coverage_gc,
coverage_strand, mapq, mapq_binned, mapq_strand, tlen,
tlen_binned, tlen_strand, variation, variation_strand.
-c CHROMOSOME, --chromosome=CHROMOSOME
Chromosome name.
-s START, --start=START
Start position (1-based).
-e END, --end=END End position (1-based).
-z, --zero-based Use zero-based coordinates (default is false, i.e.,
use one-based coords).
-u, --truncate Truncate pileup-based stats so no records are emitted
outside the specified range.
-S STEPPER, --stepper=STEPPER
Stepper to provide to underlying pysam call. Options
are:"all" (default): all reads are returned, except
where flags BAM_FUNMAP, BAM_FSECONDARY, BAM_FQCFAIL,
BAM_FDUP set; "nofilter" applies no filter to returned
reads; "samtools": filter & read processing as in
_csamtools_ pileup. This requires a fasta file. For
complete details see the pysam documentation.
-d, --pad Pad pileup-based stats so a record is emitted for
every position (default is only covered positions).
-D MAX_DEPTH, --max-depth=MAX_DEPTH
Maximum read depth permitted in pileup-based
statistics. The default limit is 8000.
Reference sequence file, only required for some
statistics.
-o, --omit-header Omit header row from output.
-p N, --progress=N Report progress every N rows.
--window-size=N Size of window for binned statistics (default is 300).
--window-offset=N Window offset to use for deciding which genome
position to report binned statistics against. The
default is 150, i.e., the middle of 300bp window.
--format=FORMAT Output format, one of {tsv, csv, hdf5} (defaults to
tsv). N.B., hdf5 requires PyTables to be installed.
--output=OUTPUT Path to output file. If not provided, write to stdout.
--fields=FIELDS Comma-separated list of fields to output (defaults to
all fields).
--hdf5-group=HDF5_GROUP
Name of HDF5 group to write to (defaults to the root
group).
--hdf5-dataset=HDF5_DATASET
Name of HDF5 dataset to create (defaults to "data").
--hdf5-complib=HDF5_COMPLIB
HDF5 compression library (defaults to zlib).
--hdf5-complevel=HDF5_COMPLEVEL
HDF5 compression level (defaults to 5).
--hdf5-chunksize=HDF5_CHUNKSIZE
Size of chunks in number of bytes (defaults to 2**20).
--min-mapq=MIN_MAPQ Only reads with mapping quality equal to or greater
than this value will be counted (0 by default).
--min-baseq=MIN_BASEQ
Only reads with base quality equal to or greater than
this value will be counted (0 by default). Only
applies to pileup-based statistics.
--no-dup Don't count reads flagged as duplicate.
--no-del Don't count reads aligned with a deletion at the given
position. Only applies to pileup-based statistics.
Pileup-based statistics types (each row has statistics over reads in a pileup column):
* coverage - Number of reads aligned to each genome position
(total and properly paired).
* coverage_strand - As coverage but with forward/reverse strand counts.
* coverage_ext - Various additional coverage metrics, including
coverage for reads not properly paired (mate
unmapped, mate on other chromosome, ...).
* coverage_ext_strand - As coverage_ext but with forward/reverse strand counts.
* coverage_gc - As coverage but also includes a column for %GC.
* variation - Numbers of matches, mismatches, deletions,
insertions, etc.
* variation_strand - As variation but with forward/reverse strand counts.
* tlen - Insert size statistics.
* tlen_strand - As tlen but with statistics by forward/reverse strand.
* mapq - Mapping quality statistics.
* mapq_strand - As mapq but with statistics by forward/reverse strand.
* baseq - Base quality statistics.
* baseq_strand - As baseq but with statistics by forward/reverse strand.
* baseq_ext - Extended base quality statistics, including qualities
of bases matching and mismatching reference.
* baseq_ext_strand - As baseq_ext but with statistics by forward/reverse strand.
Binned statistics types (each row has statistics over reads aligned starting within a genome window):
* coverage_binned - As coverage but binned.
* coverage_ext_binned - As coverage_ext but binned.
* mapq_binned - Similar to mapq but binned.
* alignment_binned - Aggregated counts from cigar strings.
* tlen_binned - As tlen but binned.
Examples:
pysamstats --type coverage example.bam > example.coverage.txt
pysamstats --type coverage --chromosome Pf3D7_v3_01 --start 100000 --end 200000 example.bam > example.coverage.txt
Version: 1.1.2 (pysam 0.15.3)
ラン
1、pileup - 1塩基解像度で出力する。
coverage Number of reads aligned to each genome position (total and properly paired).
カバレッジ出力。
pysamstats --type coverage input.bam > output
coverage_ext Various additional coverage metrics, including coverage for reads not properly paired (mate unmapped, mate on other chromosome, ...).
他のカバレッジ評価基準を出力。
pysamstats --type coverage_ext input.bam > output
出力(2行だけ表示)。
chrom pos reads_all reads_pp reads_mate_unmapped reads_mate_other_chr reads_mate_same_strand reads_faceaway reads_softclipped reads_duplicate
chr1 1 8 8 0 0 0 6 2 0
chr1 2 8 8 0 0 0 6 2 0
coverage_gc - As coverage but also includes a column for %GC.
coverageコマンド+GC。
pysamstats --type coverage_gc input.bam --fasta ref.fa > output
chrom pos gc reads_all reads_pp
chr 1 58 8 8
chr 2 58 8 8
chr 3 57 8 8
variation - Numbers of matches, mismatches, deletions, insertions, etc.
ミスマッチやindelなどを含むリードのカバレッジも出力される。
pysamstats --type variation input.bam --fasta ref.fa > output
tlen - Insert size statistics.
インサートサイズ統計データ。
pysamstats --type tlen input.bam > output
mapq - Mapping quality statistics.
マッピングクオリティ統計データ。
pysamstats --type mapq input.bam > output
baseq - Base quality statistics.
ベースクオリティ統計データ。
pysamstats --type baseq input.bam > output
baseq_ext - Extended base quality statistics, including qualities of bases matching and mismatching reference.
拡張ベースクオリティ統計データ。
pysamstats --type baseq_ext input.bam > output
baseq_ext - Extended base quality statistics, including qualities of bases matching and mismatching reference.
拡張ベースクオリティ統計データ。
pysamstats --type baseq_ext input.bam > output
各コマンドには、フォワードマッピングとリバースマッピングを別々に出力する_strandコマンドもある。例えばmapqコマンドをフォワードとリバースそれぞれ出力するmapq_strandがある。
2、一定のwindowサイズ(bin)出力
coverage_binned - As coverage but binned.
binサイズで出力する以外はcoverageコマンドと同じ。ただしリファレンスは必要。
pysamstats --type coverage_binned input.bam --fasta ref.fa > output
他に以下のコマンドがある。binnサイズで出力する以外、上で紹介してきたコマンド群と同じ。
- coverage_ext_binned - As coverage_ext but binned.
- mapq_binned - Similar to mapq but binned.
- alignment_binned - Aggregated counts from cigar strings.
- tlen_binned - As tlen but binned.
例
オプションを付けることで、計算するポジション等を限定できる。
chr1の100000-200000のカバレッジを出力する。
pysamstats --type coverage --chromosome chr1 --start 100000 --end 200000 example.bam > example.coverage.txt
- --chromosome Chromosome name.
- --start Start position (1-based)
- --end End position (1-based).
chr1のカバレッジをbinサイズ100で出力する。MAPQが1以上を対象とする。duplicationはカウントしない。進捗を100行ごとにSTDOUTに出力する。
pysamstats --type coverage_binned --chromosome chr1 --window-size 100 --window-offset 50 --min-mapq 1 --progress 100 --no-dup example.bam > example.coverage.txt
- --window-size Size of window for binned statistics (default is 300).
- --window-offset Window offset to use for deciding which genome position to report binned statistics against. The default is 150, i.e., the middle of 300bp window.
- --min-mapq Only reads with mapping quality equal to or greater than this value will be counted (0 by default).
- --progress Report progress every N rows.
- --no-dup Don't count reads flagged as duplicate.
importして使うこともできます。下記はGithubページに載っているコードで、pysamでbamを開き、pysamstatsでカバレッジ分析を行い、matplotlibでplotするものです。chr1の1-10000を対象にします。
coverage.py
import pysam
import pysamstats
import matplotlib.pyplot as plt
mybam = pysam.AlignmentFile('path/to/input.bam')
a = pysamstats.load_coverage(mybam, chrom='chr1', start=1, end=10000)
plt.plot(a.pos, a.reads_all)
plt.show()
#実行
python coverage.py
matplotlibのpyplot出力。
グラフ描画は、好みに応じてseabornなども使えばいいんじゃないでしょうか?
引用