macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

高速なbam/samの解析ツール Sambamba

2018 8/20 mpileupコマンドの謝り修正

2019 2/26 condaインストール追記

2021 6/2 help更新

 

 

Sambambaはsam、bam、cramの処理ツール。D言語で構築されている。フォーマットを変えたり、フィルタリングすることができる。SAMToolsやPicard-toolsの一部機能と重複するが、Sambambaは並列化に対応しており、SAMToolsより高速に動作するとされる。特にmpileupなどは並列化の恩恵が大きいと思われる。分割して並列処理していることで、今まで数日かかっていた処理が数時間で終わる。本ツールはヒトゲノムの高速なバリアント解析ツールspeedseqにも使われている(紹介)。  

 

公式サイト

http://lomereiter.github.io/sambamba/

f:id:kazumaxneo:20180731173113j:plain

SAMtoolsとの比較

https://github.com/biod/sambamba/wiki/Comparison-with-samtools

 Github

https://github.com/biod/sambamba 

コマンド

https://github.com/biod/sambamba/wiki/Command-line-tools

フィルタリングの仕方

https://github.com/biod/sambamba/wiki/%5Bsambamba-view%5D-Filter-expression-syntax

 

インストール

依存

  • samtools
  • bcftools

本体

Githubでバイナリが配布されている。brewやcondaでインストールできる。

#anacondaを使っているなら
conda install -c bioconda -y sambamba

brew install sambamba
sambanba view #動作確認。例えばviewのヘルプを見る。

> sambamba view -h

# sambamba view -h

Usage: sambamba-view [options] <input.bam | input.sam> [region1 [...]]

 

Options: -F, --filter=FILTER

                    set custom filter for alignments

         --num-filter=NUMFILTER

                    filter flag bits; 'i1/i2' corresponds to -f i1 -F i2 samtools arguments;

                    either of the numbers can be omitted

         -f, --format=sam|bam|cram|json

                    specify which format to use for output (default is SAM)

         -h, --with-header

                    print header before reads (always done for BAM output)

         -H, --header

                    output only header to stdout (if format=bam, the header is printed as SAM)

         -I, --reference-info

                    output to stdout only reference names and lengths in JSON

         -L, --regions=FILENAME

                    output only reads overlapping one of regions from the BED file

         -c, --count

                    output to stdout only count of matching records, hHI are ignored

         -v, --valid

                    output only valid alignments

         -S, --sam-input

                    specify that input is in SAM format

         -C, --cram-input

                    specify that input is in CRAM format

         -T, --ref-filename=FASTA

                    specify reference for writing CRAM

         -p, --show-progress

                    show progressbar in STDERR (works only for BAM files with no regions specified)

         -l, --compression-level

                    specify compression level (from 0 to 9, works only for BAM output)

         -o, --output-filename

                    specify output filename

         -t, --nthreads=NTHREADS

                    maximum number of threads to use

         -s, --subsample=FRACTION

                    subsample reads (read pairs)

         --subsampling-seed=SEED

                    set seed for subsampling

> sambamba sort 

# sambamba sort 

Usage: sambamba-sort [options] <input.bam>

 

Options: -m, --memory-limit=LIMIT

               approximate total memory limit for all threads (by default 2GB)

         --tmpdir=TMPDIR

               directory for storing intermediate files; default is system directory for temporary files

         -o, --out=OUTPUTFILE

               output file name; if not provided, the result is written to a file with .sorted.bam extension

         -n, --sort-by-name

               sort by read name instead of coordinate (lexicographical order)

         -N, --natural-sort

               sort by read name instead of coordinate (so-called 'natural' sort as in samtools)

         -l, --compression-level=COMPRESSION_LEVEL

               level of compression for sorted BAM, from 0 to 9

         -u, --uncompressed-chunks

               write sorted chunks as uncompressed BAM (default is writing with compression level 1), that might be faster in some cases but uses more disk space

         -p, --show-progress

               show progressbar in STDERR

         -t, --nthreads=NTHREADS

               use specified number of threads

         -F, --filter=FILTER

               keep only reads that satisfy FILTER

sambamba mpileup

# sambamba mpileup

usage: sambamba-pileup [options] input.bam [input2.bam [...]]

                       [--samtools <samtools mpileup args>]

                       [--bcftools <bcftools call args>]

 

This subcommand relies on external tools and acts as a multi-core

implementation of samtools and bcftools.

 

Therefore, the following tools should be present in $PATH:

    * samtools

    * bcftools (when used)

 

If --samtools is skipped, samtools mpileup is called with default arguments

If --bcftools is used without parameters, samtools is called with

     switch '-gu' and bcftools is called as 'bcftools view -'

If --bcftools is skipped, bcftools is not called

 

Sambamba splits input BAM files into chunks and feeds them

to samtools mpileup and, optionally, bcftools in parallel.

The chunks are slightly overlapping so that variant calling

should not be impacted by these manipulations. The obtained results

from the multiple processes are combined as ordered output.

 

Sambamba options:

         -L, --regions=FILENAME

                    provide BED file with regions

                    (no need to duplicate it in samtools args);

                    all input files must be indexed

         -o, --output-filename=<STDOUT>

                    specify output filename

         --tmpdir=TMPDIR

                    directory for temporary files

         -t, --nthreads=NTHREADS

                    maximum number of threads to use

         -b, --buffer-size=64_000_000

                    chunk size (in bytes)

         -B, --output-buffer-size=512_000_000

                    output buffer size (in bytes)

 

Sambamba paths:

 

sambamba-pileup: failed to locate samtools executable in PATH

sambamba merge

# sambamba merge

Usage: sambamba-merge [options] <output.bam> <input1.bam> <input2.bam> [...]

 

Options: -t, --nthreads=NTHREADS

               number of threads to use for compression/decompression

         -l, --compression-level=COMPRESSION_LEVEL

               level of compression for merged BAM file, number from 0 to 9

         -H, --header

               output merged header to stdout in SAM format, other options are ignored; mainly for debug purposes

         -p, --show-progress

               show progress bar in STDERR

         -F, --filter=FILTER

               keep only reads that satisfy FILTER

sambamba index

# sambamba index

Usage: sambamba-index [OPTIONS] <input.bam|input.cram> [output_file]

 

Creates index for a BAM or CRAM file

 

Options: -t, --nthreads=NTHREADS

               number of threads to use for decompression

         -p, --show-progress

               show progress bar in STDERR

         -c, --check-bins

               check that bins are set correctly

         -C, --cram-input

               specify that input is in CRAM format

sambamba markup

# sambamba markup

sambamba 0.6.6

 

Usage: sambamba [command] [args...]

 

    Available commands: 'view', 'index', 'merge', 'sort',

                        'flagstat', 'slice', 'markdup', 'depth', 'mpileup'

    To get help on a particular command, just call it without args.

 

Leave bug reports and feature requests at

https://github.com/lomereiter/sambamba/issues

sambamba depth -h

# sambamba depth -h

Usage: sambamba-depth region|window|base [options] input.bam  [input2.bam [...]]

 

          All BAM files must be coordinate-sorted and indexed.

 

          The tool has three modes: base, region, and window,

          each name means per which unit to print the statistics.

 

Common options:

         -F, --filter=FILTER

                    set custom filter for alignments; the default value is

                    'mapping_quality > 0 and not duplicate and not failed_quality_control'

         -o, --output-file=FILENAME

                    output filename (by default /dev/stdout)

         -t, --nthreads=NTHREADS

                    maximum number of threads to use

         -c, --min-coverage=MINCOVERAGE

                    minimum mean coverage for output (default: 0 for region/window, 1 for base)

         -C, --max-coverage=MAXCOVERAGE

                    maximum mean coverage for output

         -q, --min-base-quality=QUAL

                    don't count bases with lower base quality

         --combined

                    output combined statistics for all samples

         -a, --annotate

                    add additional column of y/n instead of

                    skipping records not satisfying the criteria

         -m, --fix-mate-overlaps

                    detect overlaps of mate reads and handle them on per-base basis

base subcommand options:

         -L, --regions=FILENAME|REGION

                    list or regions of interest or a single region in form chr:beg-end (optional)

         -z, --report-zero-coverage (DEPRECATED, use --min-coverage=0 instead)

                    don't skip zero coverage bases

region subcommand options:

         -L, --regions=FILENAME|REGION

                    list or regions of interest or a single region in form chr:beg-end (required)

         -T, --cov-threshold=COVTHRESHOLD

                    multiple thresholds can be provided,

                    for each one an extra column will be added,

                    the percentage of bases in the region

                    where coverage is more than this value

window subcommand options:

         -w, --window-size=WINDOWSIZE

                    breadth of the window, in bp (required)

         --overlap=OVERLAP

                    overlap of successive windows, in bp (default is 0)

         -T, --cov-threshold=COVTHRESHOLD

                    same meaning as in 'region' subcommand

 

 

 

ラン

viewリンク 特定の染色体や特定の領域にアライメントされたリードだけ抽出したり、数を数える。

bamの基本情報

sambamba view --reference-info input.bam
  •  -I, --reference-info  output to stdout only reference names and lengths in JSON 

samの基本情報

sambamba view -S --reference-info input.bam
  • -S specify that input is in SAM format

bamのマッピングクオリティ≥50のリード数

sambamba view -c -F "mapping_quality >= 50" input.bam
  • -c output to stdout only count of matching records, hHI are ignored  
  • -F set custom filter for alignments 

bamのリファレンス"chr19"に適切にアライメントされるリードだけsam出力。(カウントなら-cをつけ、-o以下を消す)

sambamba view -F "proper_pair" input.bam chr19 -t 8 -f sam -o output.sam
  • -o specify output filename
  • -t maximum number of threads to use
  • -f specify which format to use for output (default is SAM)

samのリファレンス"chr1"に適切にアライメントされるリードだけbam出力。

sambamba view -S -F "proper_pair" input.sam chr1 -t 8 -f bam -h -p -o output.bam
  • -h print header before reads (always done for BAM output)
  • -p show progressbar in STDERR

"proper_pair"の割合を知りたければ、下のflagstatが便利です。 特定の領域にオーバーラップしたリードを数えるなら、sliceの方が高速と書かれています(sliceリンク)。

 

 

sortリンク ソート。

指定がなければindexが自動でつくcoordinate-sorted。

sambamba sort input.bam -o sorted.bam -t 20 -p -l 6
  • -o output file name; if not provided, the result is written to a file with .sorted.bam extension
  • -t use specified number of threads
  • -p show progressbar in STDERR
  • -l level of compression for sorted BAM, from 0 to 9(指定がなければ"6"くらいのサイズになる)

名前でソート。(indexは付けられない)

sambamba sort input.bam -o sorted.bam -t 20 -p -l 5 
  • -o output file name; if not provided, the result is written to

メモリが潤沢にあるなら、-mでメモリの最大使用量をあげると、I/Oが改善され高速化されるようです。でHDD使ってるサーバー等なら効果が大きそうです(未検討)。

 

 

flagstatリンク bamの情報表示。

bamの情報表示。

sambamba flagstat input.bam -t 8 -p
  • -t use NTHREADS for decompression
  • -p show progressbar in STDERR

sambamba flagstat input.bam

509280 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

538 + 0 supplementary

0 + 0 duplicates

456058 + 0 mapped (89.55%:N/A)

508742 + 0 paired in sequencing

254371 + 0 read1

254371 + 0 read2

451458 + 0 properly paired (88.74%:N/A)

454670 + 0 with itself and mate mapped

850 + 0 singletons (0.17%:N/A)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

表示された情報については上のflagstatのリンク先で確認してください。bamのBGZFフォーマットしか対応していないのでsamはviewで変換してから使ってください。 

 

 

indexリンク bamのindexファイル作成

sambamba index input.bam -t 8 -p
  • -t use NTHREADS for decompression
  • -p show progressbar in STDERR

 input.bam.baiができる。

 

 

mergeリンク bamのマージ。

圧縮率最大の設定で、12スレッド使い3つのbamをマージ。

sambamba merge merge.bam input1.bam input2.bam nput3.bam -t 12 -p -l 9
  • -t Specify number of threads to use for compression/decompression tasks.
  • -l Specify compression level of output file as a number from 0 to 9
  • -p  Show progressbar in STDERR.

(If --bcftools is used without parameters, bcftools is called as 'bcftools view -Ov' )

 

 

mpileupリンク samtoolsのmpileupのparallelバージョン。

12スレッド使い、pileupを実行。samtoolsの条件は--samtoolsをつけて書く。bcftoolsの条件は--bcftoolsをつけて書く。

sambamba mpileup input.bam -t 12 --samtools -f ref.fa > output
  • -t Specify number of threads to use.
  • -p  Show progressbar in STDERR.
  • -o specify output filename

If --bcftools is used without parameters, bcftools is called as 'bcftools view -Ov'

bcftools

  • -O, --output-type <b|u|z|v> b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]

 

 

引用

Sambamba: fast processing of NGS alignment formats.

Tarasov A, Vilella AJ, Cuppen E, Nijman IJ, Prins P.

Bioinformatics. 2015 Jun 15;31(12):2032-4. doi: 10.1093/bioinformatics/btv098. Epub 2015 Feb 19.