macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

bam/samのカバレッジなどを計算する pysamstats

 

pysamstatsはsamのstatisticsを出力できるツール。pileup出力のほか、一定のbinサイズでの出力もできる。ライブラリとしての活用も視野に設計されている。

 

インストール

本体 Github

Anacondaなどの環境下なら、condaで導入する。

conda install -c bioconda pysamstats

それ以外の環境ならpipでバージョン指定インストールする。

pip install pysam==0.11.2.2
pip install pysamstats

pysamstats -h

$ pysamstats -h

Usage: pysamstats [options] FILE

 

Calculate statistics against genome positions based on sequence alignments

from a SAM or BAM file and print them to stdout.

 

Options:

  -h, --help            show this help message and exit

  -t TYPE, --type=TYPE  Type of statistics to print, one of: alignment_binned,

                        baseq, baseq_ext, baseq_ext_strand, baseq_strand,

                        coverage, coverage_binned, coverage_ext,

                        coverage_ext_binned, coverage_ext_strand, coverage_gc,

                        coverage_strand, mapq, mapq_binned, mapq_strand, tlen,

                        tlen_binned, tlen_strand, variation, variation_strand.

  -c CHROMOSOME, --chromosome=CHROMOSOME

                        Chromosome name.

  -s START, --start=START

                        Start position (1-based).

  -e END, --end=END     End position (1-based).

  -z, --zero-based      Use zero-based coordinates (default is false, i.e.,

                        use one-based coords).

  -u, --truncate        Truncate pileup-based stats so no records are emitted

                        outside the specified range.

  -d, --pad             Pad pileup-based stats so a record is emitted for

                        every position (default is only covered positions).

  -D MAX_DEPTH, --max-depth=MAX_DEPTH

                        Maximum read depth permitted in pileup-based

                        statistics. The default limit is 8000.

  -f FASTA, --fasta=FASTA

                        Reference sequence file, only required for some

                        statistics.

  -o, --omit-header     Omit header row from output.

  -p N, --progress=N    Report progress every N rows.

  --window-size=N       Size of window for binned statistics (default is 300).

  --window-offset=N     Window offset to use for deciding which genome

                        position to report binned statistics against. The

                        default is 150, i.e., the middle of 300bp window.

  --format=FORMAT       Output format, one of {tsv, csv, hdf5} (defaults to

                        tsv). N.B., hdf5 requires PyTables to be installed.

  --output=OUTPUT       Path to output file. If not provided, write to stdout.

  --fields=FIELDS       Comma-separated list of fields to output (defaults to

                        all fields).

  --hdf5-group=HDF5_GROUP

                        Name of HDF5 group to write to (defaults to the root

                        group).

  --hdf5-dataset=HDF5_DATASET

                        Name of HDF5 dataset to create (defaults to "data").

  --hdf5-complib=HDF5_COMPLIB

                        HDF5 compression library (defaults to zlib).

  --hdf5-complevel=HDF5_COMPLEVEL

                        HDF5 compression level (defaults to 5).

  --hdf5-chunksize=HDF5_CHUNKSIZE

                        Size of chunks in number of bytes (defaults to 2**20).

  --min-mapq=MIN_MAPQ   Only reads with mapping quality equal to or greater

                        than this value will be counted (0 by default).

  --min-baseq=MIN_BASEQ

                        Only reads with base quality equal to or greater than

                        this value will be counted (0 by default). Only

                        applies to pileup-based statistics.

  --no-dup              Don't count reads flagged as duplicate.

  --no-del              Don't count reads aligned with a deletion at the given

                        position. Only applies to pileup-based statistics.

 

Pileup-based statistics types (each row has statistics over reads in a pileup column):

 

    * coverage            - Number of reads aligned to each genome position

                            (total and properly paired).

    * coverage_strand     - As coverage but with forward/reverse strand counts.

    * coverage_ext        - Various additional coverage metrics, including

                            coverage for reads not properly paired (mate

                            unmapped, mate on other chromosome, ...).

    * coverage_ext_strand - As coverage_ext but with forward/reverse strand counts.

    * coverage_gc         - As coverage but also includes a column for %GC.

    * variation           - Numbers of matches, mismatches, deletions,

                            insertions, etc.

    * variation_strand    - As variation but with forward/reverse strand counts.

    * tlen                - Insert size statistics.

    * tlen_strand         - As tlen but with statistics by forward/reverse strand.

    * mapq                - Mapping quality statistics.

    * mapq_strand         - As mapq but with statistics by forward/reverse strand.

    * baseq               - Base quality statistics.

    * baseq_strand        - As baseq but with statistics by forward/reverse strand.

    * baseq_ext           - Extended base quality statistics, including qualities

                            of bases matching and mismatching reference.

    * baseq_ext_strand    - As baseq_ext but with statistics by forward/reverse strand.

 

Binned statistics types (each row has statistics over reads aligned starting within a genome window):

 

    * coverage_binned     - As coverage but binned.

    * coverage_ext_binned - As coverage_ext but binned.

    * mapq_binned         - Similar to mapq but binned.

    * alignment_binned    - Aggregated counts from cigar strings.

    * tlen_binned         - As tlen but binned.

 

Examples:

 

    pysamstats --type coverage example.bam > example.coverage.txt

    pysamstats --type coverage --chromosome Pf3D7_v3_01 --start 100000 --end 200000 example.bam > example.coverage.txt

 

Version: 1.0.1 (pysam 0.14.1)

 

 

 

ラン

1、pileupベースのツール。1塩基解像度で出力する。

coverage   Number of reads aligned to each genome position (total and properly paired).

カバレッジ出力。

pysamstats --type coverage input.bam > output

coverage_ext   Various additional coverage metrics, including coverage for reads not properly paired (mate unmapped, mate on other chromosome, ...).

他のカバレッジ評価基準を出力。

pysamstats --type coverage_ext input.bam > output

 出力(2行だけ表示)。

chrom        pos      reads_all  reads_pp  reads_mate_unmapped  reads_mate_other_chr  reads_mate_same_strand  reads_faceaway  reads_softclipped  reads_duplicate

chr1  1        8          8         0                    0                     0                       6               2                  0

chr1  2        8          8         0                    0                     0                       6               2                  0

coverage_gc - As coverage but also includes a column for %GC.

coverageコマンド+GC

pysamstats --type coverage_gc input.bam --fasta ref.fa > output

chrom        pos  gc  reads_all  reads_pp

chr          1    58  8          8

chr          2    58  8          8

chr          3    57  8          8

variation - Numbers of matches, mismatches, deletions, insertions, etc.

ミスマッチやindelなどを含むリードのカバレッジも出力される。 

pysamstats --type variation input.bam --fasta ref.fa > output

f:id:kazumaxneo:20180627222814p:plain

tlen - Insert size statistics.

インサートサイズ統計データ。

pysamstats --type tlen input.bam > output

mapq - Mapping quality statistics.

マッピングクオリティ統計データ。

pysamstats --type mapq input.bam > output

baseq - Base quality statistics.

ベースクオリティ統計データ。

pysamstats --type baseq input.bam > output

baseq_ext - Extended base quality statistics, including qualities of bases matching and mismatching reference.

拡張ベースクオリティ統計データ。

pysamstats --type baseq_ext input.bam > output

baseq_ext - Extended base quality statistics, including qualities of bases matching and mismatching reference.

拡張ベースクオリティ統計データ。

pysamstats --type baseq_ext input.bam > output

各コマンドには、フォワードマッピングとリバースマッピングを別々に出力する_strandコマンドもある。例えばmapqコマンドをフォワードとリバースそれぞれ出力するmapq_strandがある。 

 

 2、一定のwindowサイズで出力する。計算時間も早くなる。

coverage_binned - As coverage but binned.

binサイズで出力する以外はcoverageコマンドと同じ。ただしリファレンスは必要。

pysamstats --type coverage_binned input.bam --fasta ref.fa > output

他に以下のコマンドがある。binnサイズで出力する以外、上で紹介してきたコマンド群と同じ。

  • coverage_ext_binned - As coverage_ext but binned.
  • mapq_binned - Similar to mapq but binned.
  • alignment_binned - Aggregated counts from cigar strings.
  • tlen_binned - As tlen but binned.

 

 

オプションを付けることで、計算するポジション等を限定できる。

chr1の100000-200000のカバレッジを出力する。

pysamstats --type coverage --chromosome chr1 --start 100000 --end 200000 example.bam > example.coverage.txt
  • --chromosome   Chromosome name.
  • --start    Start position (1-based)
  • --end      End position (1-based).

 chr1のカバレッジをbinサイズ100で出力する。MAPQが1以上を対象とする。duplicationはカウントしない。進捗を100行ごとにSTDOUTに出力する。

pysamstats --type coverage_binned --chromosome chr1 --window-size 100 --window-offset 50 --min-mapq 1 --progress 100 --no-dup example.bam > example.coverage.txt
  • --window-size   Size of window for binned statistics (default is 300). 
  •  --window-offset    Window offset to use for deciding which genome position to report binned statistics against. The default is 150, i.e., the middle of 300bp window.
  • --min-mapq    Only reads with mapping quality equal to or greater than this value will be counted (0 by default).
  • --progress   Report progress every N rows.
  • --no-dup    Don't count reads flagged as duplicate.

 

importして使うこともできます。下記はGithubページに載っているコードで、pysamでbamを開き、pysamstatsでカバレッジ分析を行い、matplotlibでplotするものです。chr1の1-10000を対象にします。

coverage.py

import pysam 
import pysamstats
import matplotlib.pyplot as plt

mybam = pysam.AlignmentFile('path/to/input.bam')
a = pysamstats.load_coverage(mybam, chrom='chr1', start=1, end=10000)
plt.plot(a.pos, a.reads_all)
plt.show()

#実行

python coverage.py

 

matplotlibのpyplot出力。

f:id:kazumaxneo:20180628011431p:plain

グラフ描画は、好みに応じてseabornなども使えばいいんじゃないでしょうか?

 

引用

https://github.com/alimanfoo/pysamstats