2018 8/20 mpileupコマンドの謝り修正
2019 2/26 condaインストール追記
2021 6/2 help更新
Sambambaはsam、bam、cramの処理ツール。D言語で構築されている。フォーマットを変えたり、フィルタリングすることができる。SAMToolsやPicard-toolsの一部機能と重複するが、Sambambaは並列化に対応しており、SAMToolsより高速に動作するとされる。特にmpileupなどは並列化の恩恵が大きいと思われる。分割して並列処理していることで、今まで数日かかっていた処理が数時間で終わる。本ツールはヒトゲノムの高速なバリアント解析ツールspeedseqにも使われている(紹介)。
公式サイト
http://lomereiter.github.io/sambamba/
SAMtoolsとの比較
https://github.com/biod/sambamba/wiki/Comparison-with-samtools
https://github.com/biod/sambamba
コマンド
https://github.com/biod/sambamba/wiki/Command-line-tools
フィルタリングの仕方
https://github.com/biod/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax
インストール
依存
- samtools
- bcftools
本体
Githubでバイナリが配布されている。brewやcondaでインストールできる。
#anacondaを使っているなら
conda install -c bioconda -y sambamba
brew install sambamba
sambanba view #動作確認。例えばviewのヘルプを見る。
> sambamba view -h
# sambamba view -h
Usage: sambamba-view [options] <input.bam | input.sam> [region1 [...]]
Options: -F, --filter=FILTER
set custom filter for alignments
--num-filter=NUMFILTER
filter flag bits; 'i1/i2' corresponds to -f i1 -F i2 samtools arguments;
either of the numbers can be omitted
-f, --format=sam|bam|cram|json
specify which format to use for output (default is SAM)
-h, --with-header
print header before reads (always done for BAM output)
-H, --header
output only header to stdout (if format=bam, the header is printed as SAM)
-I, --reference-info
output to stdout only reference names and lengths in JSON
-L, --regions=FILENAME
output only reads overlapping one of regions from the BED file
-c, --count
output to stdout only count of matching records, hHI are ignored
-v, --valid
output only valid alignments
-S, --sam-input
specify that input is in SAM format
-C, --cram-input
specify that input is in CRAM format
-T, --ref-filename=FASTA
specify reference for writing CRAM
-p, --show-progress
show progressbar in STDERR (works only for BAM files with no regions specified)
-l, --compression-level
specify compression level (from 0 to 9, works only for BAM output)
-o, --output-filename
specify output filename
-t, --nthreads=NTHREADS
maximum number of threads to use
-s, --subsample=FRACTION
subsample reads (read pairs)
--subsampling-seed=SEED
set seed for subsampling
> sambamba sort
# sambamba sort
Usage: sambamba-sort [options] <input.bam>
Options: -m, --memory-limit=LIMIT
approximate total memory limit for all threads (by default 2GB)
--tmpdir=TMPDIR
directory for storing intermediate files; default is system directory for temporary files
-o, --out=OUTPUTFILE
output file name; if not provided, the result is written to a file with .sorted.bam extension
-n, --sort-by-name
sort by read name instead of coordinate (lexicographical order)
-N, --natural-sort
sort by read name instead of coordinate (so-called 'natural' sort as in samtools)
-l, --compression-level=COMPRESSION_LEVEL
level of compression for sorted BAM, from 0 to 9
-u, --uncompressed-chunks
write sorted chunks as uncompressed BAM (default is writing with compression level 1), that might be faster in some cases but uses more disk space
-p, --show-progress
show progressbar in STDERR
-t, --nthreads=NTHREADS
use specified number of threads
-F, --filter=FILTER
keep only reads that satisfy FILTER
> sambamba mpileup
# sambamba mpileup
usage: sambamba-pileup [options] input.bam [input2.bam [...]]
[--samtools <samtools mpileup args>]
[--bcftools <bcftools call args>]
This subcommand relies on external tools and acts as a multi-core
implementation of samtools and bcftools.
Therefore, the following tools should be present in $PATH:
* samtools
* bcftools (when used)
If --samtools is skipped, samtools mpileup is called with default arguments
If --bcftools is used without parameters, samtools is called with
switch '-gu' and bcftools is called as 'bcftools view -'
If --bcftools is skipped, bcftools is not called
Sambamba splits input BAM files into chunks and feeds them
to samtools mpileup and, optionally, bcftools in parallel.
The chunks are slightly overlapping so that variant calling
should not be impacted by these manipulations. The obtained results
from the multiple processes are combined as ordered output.
Sambamba options:
-L, --regions=FILENAME
provide BED file with regions
(no need to duplicate it in samtools args);
all input files must be indexed
-o, --output-filename=<STDOUT>
specify output filename
--tmpdir=TMPDIR
directory for temporary files
-t, --nthreads=NTHREADS
maximum number of threads to use
-b, --buffer-size=64_000_000
chunk size (in bytes)
-B, --output-buffer-size=512_000_000
output buffer size (in bytes)
Sambamba paths:
sambamba-pileup: failed to locate samtools executable in PATH
> sambamba merge
# sambamba merge
Usage: sambamba-merge [options] <output.bam> <input1.bam> <input2.bam> [...]
Options: -t, --nthreads=NTHREADS
number of threads to use for compression/decompression
-l, --compression-level=COMPRESSION_LEVEL
level of compression for merged BAM file, number from 0 to 9
-H, --header
output merged header to stdout in SAM format, other options are ignored; mainly for debug purposes
-p, --show-progress
show progress bar in STDERR
-F, --filter=FILTER
keep only reads that satisfy FILTER
> sambamba index
# sambamba index
Usage: sambamba-index [OPTIONS] <input.bam|input.cram> [output_file]
Creates index for a BAM or CRAM file
Options: -t, --nthreads=NTHREADS
number of threads to use for decompression
-p, --show-progress
show progress bar in STDERR
-c, --check-bins
check that bins are set correctly
-C, --cram-input
specify that input is in CRAM format
> sambamba markup
# sambamba markup
sambamba 0.6.6
Usage: sambamba [command] [args...]
Available commands: 'view', 'index', 'merge', 'sort',
'flagstat', 'slice', 'markdup', 'depth', 'mpileup'
To get help on a particular command, just call it without args.
Leave bug reports and feature requests at
https://github.com/lomereiter/sambamba/issues
> sambamba depth -h
# sambamba depth -h
Usage: sambamba-depth region|window|base [options] input.bam [input2.bam [...]]
All BAM files must be coordinate-sorted and indexed.
The tool has three modes: base, region, and window,
each name means per which unit to print the statistics.
Common options:
-F, --filter=FILTER
set custom filter for alignments; the default value is
'mapping_quality > 0 and not duplicate and not failed_quality_control'
-o, --output-file=FILENAME
output filename (by default /dev/stdout)
-t, --nthreads=NTHREADS
maximum number of threads to use
-c, --min-coverage=MINCOVERAGE
minimum mean coverage for output (default: 0 for region/window, 1 for base)
-C, --max-coverage=MAXCOVERAGE
maximum mean coverage for output
-q, --min-base-quality=QUAL
don't count bases with lower base quality
--combined
output combined statistics for all samples
-a, --annotate
add additional column of y/n instead of
skipping records not satisfying the criteria
-m, --fix-mate-overlaps
detect overlaps of mate reads and handle them on per-base basis
base subcommand options:
-L, --regions=FILENAME|REGION
list or regions of interest or a single region in form chr:beg-end (optional)
-z, --report-zero-coverage (DEPRECATED, use --min-coverage=0 instead)
don't skip zero coverage bases
region subcommand options:
-L, --regions=FILENAME|REGION
list or regions of interest or a single region in form chr:beg-end (required)
-T, --cov-threshold=COVTHRESHOLD
multiple thresholds can be provided,
for each one an extra column will be added,
the percentage of bases in the region
where coverage is more than this value
window subcommand options:
-w, --window-size=WINDOWSIZE
breadth of the window, in bp (required)
--overlap=OVERLAP
overlap of successive windows, in bp (default is 0)
-T, --cov-threshold=COVTHRESHOLD
same meaning as in 'region' subcommand
ラン
view(リンク) 特定の染色体や特定の領域にアライメントされたリードだけ抽出したり、数を数える。
bamの基本情報
sambamba view --reference-info input.bam
- -I, --reference-info output to stdout only reference names and lengths in JSON
samの基本情報
sambamba view -S --reference-info input.bam
- -S specify that input is in SAM format
bamのマッピングクオリティ≥50のリード数
sambamba view -c -F "mapping_quality >= 50" input.bam
- -c output to stdout only count of matching records, hHI are ignored
- -F set custom filter for alignments
bamのリファレンス"chr19"に適切にアライメントされるリードだけsam出力。(カウントなら-cをつけ、-o以下を消す)
sambamba view -F "proper_pair" input.bam chr19 -t 8 -f sam -o output.sam
- -o specify output filename
- -t maximum number of threads to use
- -f specify which format to use for output (default is SAM)
samのリファレンス"chr1"に適切にアライメントされるリードだけbam出力。
sambamba view -S -F "proper_pair" input.sam chr1 -t 8 -f bam -h -p -o output.bam
- -h print header before reads (always done for BAM output)
- -p show progressbar in STDERR
"proper_pair"の割合を知りたければ、下のflagstatが便利です。 特定の領域にオーバーラップしたリードを数えるなら、sliceの方が高速と書かれています(sliceリンク)。
sort(リンク) ソート。
指定がなければindexが自動でつくcoordinate-sorted。
sambamba sort input.bam -o sorted.bam -t 20 -p -l 6
- -o output file name; if not provided, the result is written to a file with .sorted.bam extension
- -t use specified number of threads
- -p show progressbar in STDERR
- -l level of compression for sorted BAM, from 0 to 9(指定がなければ"6"くらいのサイズになる)
名前でソート。(indexは付けられない)
sambamba sort input.bam -o sorted.bam -t 20 -p -l 5
- -o output file name; if not provided, the result is written to
メモリが潤沢にあるなら、-mでメモリの最大使用量をあげると、I/Oが改善され高速化されるようです。でHDD使ってるサーバー等なら効果が大きそうです(未検討)。
flagstat(リンク) bamの情報表示。
bamの情報表示。
sambamba flagstat input.bam -t 8 -p
- -t use NTHREADS for decompression
- -p show progressbar in STDERR
sambamba flagstat input.bam
509280 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
538 + 0 supplementary
0 + 0 duplicates
456058 + 0 mapped (89.55%:N/A)
508742 + 0 paired in sequencing
254371 + 0 read1
254371 + 0 read2
451458 + 0 properly paired (88.74%:N/A)
454670 + 0 with itself and mate mapped
850 + 0 singletons (0.17%:N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
表示された情報については上のflagstatのリンク先で確認してください。bamのBGZFフォーマットしか対応していないのでsamはviewで変換してから使ってください。
index(リンク) bamのindexファイル作成
sambamba index input.bam -t 8 -p
- -t use NTHREADS for decompression
- -p show progressbar in STDERR
input.bam.baiができる。
merge(リンク) bamのマージ。
圧縮率最大の設定で、12スレッド使い3つのbamをマージ。
sambamba merge merge.bam input1.bam input2.bam nput3.bam -t 12 -p -l 9
- -t Specify number of threads to use for compression/decompression tasks.
- -l Specify compression level of output file as a number from 0 to 9
- -p Show progressbar in STDERR.
(If --bcftools is used without parameters, bcftools is called as 'bcftools view -Ov' )
mpileup(リンク) samtoolsのmpileupのparallelバージョン。
12スレッド使い、pileupを実行。samtoolsの条件は--samtoolsをつけて書く。bcftoolsの条件は--bcftoolsをつけて書く。
sambamba mpileup input.bam -t 12 --samtools -f ref.fa > output
- -t Specify number of threads to use.
- -p Show progressbar in STDERR.
- -o specify output filename
If --bcftools is used without parameters, bcftools is called as 'bcftools view -Ov'
bcftools
- -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
引用
Sambamba: fast processing of NGS alignment formats.
Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P.
Bioinformatics. 2015 Jun 15;31(12):2032-4. doi: 10.1093/bioinformatics/btv098. Epub 2015 Feb 19.