高速なbam/samの解析ツール Sambamba - macでインフォマティクス

2018 8/20 mpileupコマンドの謝り修正

2019 2/26 condaインストール追記

2021 6/2 help更新

Sambambaはsam、bam、cramの処理ツール。D言語で構築されている。フォーマットを変えたり、フィルタリングすることができる。SAMToolsやPicard-toolsの一部機能と重複するが、Sambambaは並列化に対応しており、SAMToolsより高速に動作するとされる。特にmpileupなどは並列化の恩恵が大きいと思われる。分割して並列処理していることで、今まで数日かかっていた処理が数時間で終わる。本ツールはヒトゲノムの高速なバリアント解析ツールspeedseqにも使われている（紹介）。

公式サイト

http://lomereiter.github.io/sambamba/

SAMtoolsとの比較

https://github.com/biod/sambamba/wiki/Comparison-with-samtools

Github

https://github.com/biod/sambamba

コマンド

https://github.com/biod/sambamba/wiki/Command-line-tools

フィルタリングの仕方

https://github.com/biod/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax

インストール

依存

samtools
bcftools

本体

Githubでバイナリが配布されている。brewやcondaでインストールできる。

#anacondaを使っているなら
conda install -c bioconda -y sambamba

brew install sambamba
sambanba view  #動作確認。例えばviewのヘルプを見る。

> sambamba view -h

# sambamba view -h

Usage: sambamba-view [options] <input.bam | input.sam> [region1 [...]]

Options: -F, --filter=FILTER

set custom filter for alignments

--num-filter=NUMFILTER

filter flag bits; 'i1/i2' corresponds to -f i1 -F i2 samtools arguments;

either of the numbers can be omitted

-f, --format=sam|bam|cram|json

specify which format to use for output (default is SAM)

-h, --with-header

print header before reads (always done for BAM output)

-H, --header

output only header to stdout (if format=bam, the header is printed as SAM)

-I, --reference-info

output to stdout only reference names and lengths in JSON

-L, --regions=FILENAME

output only reads overlapping one of regions from the BED file

-c, --count

output to stdout only count of matching records, hHI are ignored

-v, --valid

output only valid alignments

-S, --sam-input

specify that input is in SAM format

-C, --cram-input

specify that input is in CRAM format

-T, --ref-filename=FASTA

specify reference for writing CRAM

-p, --show-progress

show progressbar in STDERR (works only for BAM files with no regions specified)

-l, --compression-level

specify compression level (from 0 to 9, works only for BAM output)

-o, --output-filename

specify output filename

-t, --nthreads=NTHREADS

maximum number of threads to use

-s, --subsample=FRACTION

subsample reads (read pairs)

--subsampling-seed=SEED

set seed for subsampling

> sambamba sort

# sambamba sort

Usage: sambamba-sort [options] <input.bam>

Options: -m, --memory-limit=LIMIT

approximate total memory limit for all threads (by default 2GB)

--tmpdir=TMPDIR

directory for storing intermediate files; default is system directory for temporary files

-o, --out=OUTPUTFILE

output file name; if not provided, the result is written to a file with .sorted.bam extension

-n, --sort-by-name

sort by read name instead of coordinate (lexicographical order)

-N, --natural-sort

sort by read name instead of coordinate (so-called 'natural' sort as in samtools)

-l, --compression-level=COMPRESSION_LEVEL

level of compression for sorted BAM, from 0 to 9

-u, --uncompressed-chunks

write sorted chunks as uncompressed BAM (default is writing with compression level 1), that might be faster in some cases but uses more disk space

-p, --show-progress

show progressbar in STDERR

-t, --nthreads=NTHREADS

use specified number of threads

-F, --filter=FILTER

keep only reads that satisfy FILTER

> sambamba mpileup

# sambamba mpileup

usage: sambamba-pileup [options] input.bam [input2.bam [...]]

[--samtools <samtools mpileup args>]

[--bcftools <bcftools call args>]

This subcommand relies on external tools and acts as a multi-core

implementation of samtools and bcftools.

Therefore, the following tools should be present in $PATH:

* samtools

* bcftools (when used)

If --samtools is skipped, samtools mpileup is called with default arguments

If --bcftools is used without parameters, samtools is called with

switch '-gu' and bcftools is called as 'bcftools view -'

If --bcftools is skipped, bcftools is not called

Sambamba splits input BAM files into chunks and feeds them

to samtools mpileup and, optionally, bcftools in parallel.

The chunks are slightly overlapping so that variant calling

should not be impacted by these manipulations. The obtained results

from the multiple processes are combined as ordered output.

Sambamba options:

-L, --regions=FILENAME

provide BED file with regions

(no need to duplicate it in samtools args);

all input files must be indexed

-o, --output-filename=<STDOUT>

specify output filename

--tmpdir=TMPDIR

directory for temporary files

-t, --nthreads=NTHREADS

maximum number of threads to use

-b, --buffer-size=64_000_000

chunk size (in bytes)

-B, --output-buffer-size=512_000_000

output buffer size (in bytes)

Sambamba paths:

sambamba-pileup: failed to locate samtools executable in PATH

> sambamba merge

# sambamba merge

Usage: sambamba-merge [options] <output.bam> <input1.bam> <input2.bam> [...]

Options: -t, --nthreads=NTHREADS

number of threads to use for compression/decompression

-l, --compression-level=COMPRESSION_LEVEL

level of compression for merged BAM file, number from 0 to 9

-H, --header

output merged header to stdout in SAM format, other options are ignored; mainly for debug purposes

-p, --show-progress

show progress bar in STDERR

-F, --filter=FILTER

keep only reads that satisfy FILTER

> sambamba index

# sambamba index

Usage: sambamba-index [OPTIONS] <input.bam|input.cram> [output_file]

Creates index for a BAM or CRAM file

Options: -t, --nthreads=NTHREADS

number of threads to use for decompression

-p, --show-progress

show progress bar in STDERR

-c, --check-bins

check that bins are set correctly

-C, --cram-input

specify that input is in CRAM format

> sambamba markup

# sambamba markup

sambamba 0.6.6

Usage: sambamba [command] [args...]

Available commands: 'view', 'index', 'merge', 'sort',

'flagstat', 'slice', 'markdup', 'depth', 'mpileup'

To get help on a particular command, just call it without args.

Leave bug reports and feature requests at

https://github.com/lomereiter/sambamba/issues

> sambamba depth -h

# sambamba depth -h

Usage: sambamba-depth region|window|base [options] input.bam [input2.bam [...]]

All BAM files must be coordinate-sorted and indexed.

The tool has three modes: base, region, and window,

each name means per which unit to print the statistics.

Common options:

-F, --filter=FILTER

set custom filter for alignments; the default value is

'mapping_quality > 0 and not duplicate and not failed_quality_control'

-o, --output-file=FILENAME

output filename (by default /dev/stdout)

-t, --nthreads=NTHREADS

maximum number of threads to use

-c, --min-coverage=MINCOVERAGE

minimum mean coverage for output (default: 0 for region/window, 1 for base)

-C, --max-coverage=MAXCOVERAGE

maximum mean coverage for output

-q, --min-base-quality=QUAL

don't count bases with lower base quality

--combined

output combined statistics for all samples

-a, --annotate

add additional column of y/n instead of

skipping records not satisfying the criteria

-m, --fix-mate-overlaps

detect overlaps of mate reads and handle them on per-base basis

base subcommand options:

-L, --regions=FILENAME|REGION

list or regions of interest or a single region in form chr:beg-end (optional)

-z, --report-zero-coverage (DEPRECATED, use --min-coverage=0 instead)

don't skip zero coverage bases

region subcommand options:

-L, --regions=FILENAME|REGION

list or regions of interest or a single region in form chr:beg-end (required)

-T, --cov-threshold=COVTHRESHOLD

multiple thresholds can be provided,

for each one an extra column will be added,

the percentage of bases in the region

where coverage is more than this value

window subcommand options:

-w, --window-size=WINDOWSIZE

breadth of the window, in bp (required)

--overlap=OVERLAP

overlap of successive windows, in bp (default is 0)

-T, --cov-threshold=COVTHRESHOLD

same meaning as in 'region' subcommand

ラン

view（リンク）　特定の染色体や特定の領域にアライメントされたリードだけ抽出したり、数を数える。

bamの基本情報

sambamba view --reference-info input.bam

-I, --reference-info　 output to stdout only reference names and lengths in JSON

samの基本情報

sambamba view -S --reference-info input.bam

-S　specify that input is in SAM format

bamのマッピングクオリティ≥50のリード数

sambamba view -c -F "mapping_quality >= 50" input.bam

-c　output to stdout only count of matching records, hHI are ignored
-F　set custom filter for alignments

bamのリファレンス"chr19"に適切にアライメントされるリードだけsam出力。（カウントなら-cをつけ、-o以下を消す）

sambamba view -F "proper_pair" input.bam chr19 -t 8 -f sam -o output.sam

-o　specify output filename
-t　maximum number of threads to use
-f　specify which format to use for output (default is SAM)

samのリファレンス"chr1"に適切にアライメントされるリードだけbam出力。

sambamba view -S -F "proper_pair" input.sam chr1 -t 8 -f bam -h -p -o output.bam

-h　print header before reads (always done for BAM output)
-p　show progressbar in STDERR

"proper_pair"の割合を知りたければ、下のflagstatが便利です。特定の領域にオーバーラップしたリードを数えるなら、sliceの方が高速と書かれています（sliceリンク）。

sort（リンク）　ソート。

指定がなければindexが自動でつくcoordinate-sorted。

sambamba sort input.bam -o sorted.bam -t 20 -p -l 6

-o　output file name; if not provided, the result is written to a file with .sorted.bam extension
-t　use specified number of threads
-p　show progressbar in STDERR
-l　level of compression for sorted BAM, from 0 to 9（指定がなければ"6"くらいのサイズになる）

名前でソート。(indexは付けられない)

sambamba sort input.bam -o sorted.bam -t 20 -p -l 5

-o　output file name; if not provided, the result is written to

メモリが潤沢にあるなら、-mでメモリの最大使用量をあげると、I/Oが改善され高速化されるようです。でHDD使ってるサーバー等なら効果が大きそうです（未検討）。

flagstat（リンク）　bamの情報表示。

bamの情報表示。

sambamba flagstat input.bam -t 8 -p

-t　use NTHREADS for decompression
-p　show progressbar in STDERR

sambamba flagstat input.bam

509280 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

538 + 0 supplementary

0 + 0 duplicates

456058 + 0 mapped (89.55%:N/A)

508742 + 0 paired in sequencing

254371 + 0 read1

254371 + 0 read2

451458 + 0 properly paired (88.74%:N/A)

454670 + 0 with itself and mate mapped

850 + 0 singletons (0.17%:N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

表示された情報については上のflagstatのリンク先で確認してください。bamのBGZFフォーマットしか対応していないのでsamはviewで変換してから使ってください。

index（リンク）　bamのindexファイル作成

sambamba index input.bam -t 8 -p

-t　use NTHREADS for decompression
-p　show progressbar in STDERR

input.bam.baiができる。

merge（リンク）　bamのマージ。

圧縮率最大の設定で、12スレッド使い3つのbamをマージ。

sambamba merge merge.bam input1.bam input2.bam nput3.bam -t 12 -p -l 9

-t　Specify number of threads to use for compression/decompression tasks.
-l　Specify compression level of output file as a number from 0 to 9
-p　 Show progressbar in STDERR.

(If --bcftools is used without parameters, bcftools is called as 'bcftools view -Ov' )

mpileup（リンク）　samtoolsのmpileupのparallelバージョン。

12スレッド使い、pileupを実行。samtoolsの条件は--samtoolsをつけて書く。bcftoolsの条件は--bcftoolsをつけて書く。

sambamba mpileup input.bam -t 12 --samtools -f ref.fa > output

-t　Specify number of threads to use.
-p　 Show progressbar in STDERR.
-o　specify output filename

If --bcftools is used without parameters, bcftools is called as 'bcftools view -Ov'

bcftools

-O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]

引用

Sambamba: fast processing of NGS alignment formats.

Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P.

Bioinformatics. 2015 Jun 15;31(12):2032-4. doi: 10.1093/bioinformatics/btv098. Epub 2015 Feb 19.