2019 1/14 コマンド修正
2020 4/17 help追加
samblasterは、samファイルのduplicationのリードにタグをつけたり、構造変化の指標となるsplit-alingment readやdiscordant read pairを別ファイルに出力できるツール。samの時点でデータをより分けることで、discordant read pairやsplit-alingment readを使ったlarge indel検出などを劇的に軽量化することが可能になる。本ツールはヒトゲノムの高速なバリアント解析ツールspeedseqにも使われている(紹介)。2014年に論文発表された。
インストール
git clone git://github.com/GregoryFaust/samblaster.git
cd samblaster
make
cp samblaster /usr/local/bin/.
#Anaconda環境なら (linuxのみ)
conda install -c bioconda -y samblaster
> samblaster -h
$ samblaster -h
samblaster: Version 0.1.25
Author: Greg Faust (gf4ea@virginia.edu)
Tool to mark duplicates and optionally output split reads and/or discordant pairs.
Input sam file must contain sequence header and be grouped by read ids (QNAME).
Input typicallly contains paired-end data, although singleton data is allowed with --ignoreUnmated option.
Output will be all alignments in the same order as input, with duplicates marked with FLAG 0x400.
Usage:
For use as a post process on an aligner (eg. bwa mem):
bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster [-e] [-d samp.disc.sam] [-s samp.split.sam] | samtools view -Sb - > samp.out.bam
bwa mem -M <idxbase> samp.r1.fq samp.r2.fq | samblaster -M [-e] [-d samp.disc.sam] [-s samp.split.sam] | samtools view -Sb - > samp.out.bam
For use with a pre-existing bam file to pull split, discordant and/or unmapped reads without marking duplicates:
samtools view -h samp.bam | samblaster -a [-e] [-d samp.disc.sam] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null
For use with a bam file of singleton long reads to pull split and/or unmapped reads with/without marking duplicates:
samtools view -h samp.bam | samblaster --ignoreUnmated [-e] --maxReadLength 100000 [-s samp.split.sam] [-u samp.umc.fasta] | samtools view -Sb - > samp.out.bam
samtools view -h samp.bam | samblaster --ignoreUnmated -a [-e] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null
Input/Output Options:
-i --input FILE Input sam file [stdin].
-o --output FILE Output sam file for all input alignments [stdout].
-d --discordantFile FILE Output discordant read pairs to this file. [no discordant file output]
-s --splitterFile FILE Output split reads to this file abiding by paramaters below. [no splitter file output]
-u --unmappedFile FILE Output unmapped/clipped reads as FASTQ to this file abiding by parameters below. [no unmapped file output].
Requires soft clipping in input file. Will output FASTQ if QUAL information available, otherwise FASTA.
Other Options:
-a --acceptDupMarks Accept duplicate marks already in input file instead of looking for duplicates in the input.
-e --excludeDups Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.
-r --removeDups Remove duplicates reads from all output files. (Implies --excludeDups).
--addMateTags Add MC and MQ tags to all output paired-end SAM lines.
--ignoreUnmated Suppress abort on unmated alignments. Use only when sure input is read-id grouped,
and either paired-end alignments have been filtered or the input file contains singleton reads.
-M Run in compatibility mode; both 0x100 and 0x800 are considered chimeric. Similar to BWA MEM -M option.
--maxReadLength INT Maximum allowed length of the SEQ/QUAL string in the input file. [500]
Primarily useful for marking duplicates in files containing singleton long reads.
--maxSplitCount INT Maximum number of split alignments for a read to be included in splitter file. [2]
--maxUnmappedBases INT Maximum number of un-aligned bases between two alignments to be included in splitter file. [50]
--minIndelSize INT Minimum structural variant feature size for split alignments to be included in splitter file. [50]
--minNonOverlap INT Minimum non-overlaping base pairs between two alignments for a read to be included in splitter file. [20]
--minClipSize INT Minumum number of bases a mapped read must be clipped to be included in unmapped file. [20]
-q --quiet Output fewer statistics.
duplication、discordant-read、split-readの判定基準(下の方にあります)。
実行方法
bwa memのアライメントの過程でduplicationにタグをつける。
bwa index -a is input.fa
bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster |samtools sort -@ 12 -O BAM - > samp.sorted.bam
上のコマンドはfastqからbwa mem => samblaster => samtools view => samtools sortの流れでbamを作っている。samblasterはbwa memからsamファイルを受け取り、 duplication readsにタグをつけてsamtools viewに渡していることになる。
今回は以下のようなメッセージがプリントされた。
samblaster: Marked 1874 of 339039 (0.55%) read ids as duplicates using 13344768k memory in 1.056S CPU seconds and 45S wall time.
1874リードがduplicationと判定されている。
------------------------------------------------------------------------------------------------
以降はsamファイル出力として記載。
discordant-readとsplit-readは別ファイルに出力する。
bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster -e -d samp.disc.sam -s samp.split.sam > output.sam
- -e Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.
- -d FILE Output discordant read pairs to this file. [no discordant file output]
- -s FILE Output split reads to this file abiding by paramaters below. [no splitter file output]
duplication-readは全出力から除く。
bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster -r -e -d samp.disc.sam -s samp.split.sam > output.sam
- -r Remove duplicates reads from all output files.
注意;bwa memで-M(mark shorter split hits as secondary)をつけている時は、samblasterにも-Mをつけてランを行う。
- -M Compatibility mode (details below); both FLAG 0x100 and 0x800 denote supplemental (chimeric). Similar to bwa mem -M option.
引用
SAMBLASTER: fast duplicate marking and structural variant read extraction
Gregory G. Faust, Ira M. Hall
Bioinformatics. 2014 Sep 1; 30(17): 2503–2505.