duplicationリードにタグをつける samblaster - macでインフォマティクス

2019 1/14 コマンド修正

2020 4/17 help追加

samblasterは、samファイルのduplicationのリードにタグをつけたり、構造変化の指標となるsplit-alingment readやdiscordant read pairを別ファイルに出力できるツール。samの時点でデータをより分けることで、discordant read pairやsplit-alingment readを使ったlarge indel検出などを劇的に軽量化することが可能になる。本ツールはヒトゲノムの高速なバリアント解析ツールspeedseqにも使われている（紹介）。2014年に論文発表された。

インストール

Github

git clone git://github.com/GregoryFaust/samblaster.git 
cd samblaster 
make 
cp samblaster /usr/local/bin/.

#Anaconda環境なら (linuxのみ)
conda install -c bioconda -y samblaster

> samblaster -h

$ samblaster -h

samblaster: Version 0.1.25

Author: Greg Faust (gf4ea@virginia.edu)

Tool to mark duplicates and optionally output split reads and/or discordant pairs.

Input sam file must contain sequence header and be grouped by read ids (QNAME).

Input typicallly contains paired-end data, although singleton data is allowed with --ignoreUnmated option.

Output will be all alignments in the same order as input, with duplicates marked with FLAG 0x400.

Usage:

For use as a post process on an aligner (eg. bwa mem):

bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster [-e] [-d samp.disc.sam] [-s samp.split.sam] | samtools view -Sb - > samp.out.bam

bwa mem -M <idxbase> samp.r1.fq samp.r2.fq | samblaster -M [-e] [-d samp.disc.sam] [-s samp.split.sam] | samtools view -Sb - > samp.out.bam

For use with a pre-existing bam file to pull split, discordant and/or unmapped reads without marking duplicates:

samtools view -h samp.bam | samblaster -a [-e] [-d samp.disc.sam] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null

For use with a bam file of singleton long reads to pull split and/or unmapped reads with/without marking duplicates:

samtools view -h samp.bam | samblaster --ignoreUnmated [-e] --maxReadLength 100000 [-s samp.split.sam] [-u samp.umc.fasta] | samtools view -Sb - > samp.out.bam

samtools view -h samp.bam | samblaster --ignoreUnmated -a [-e] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null

Input/Output Options:

-i --input FILE Input sam file [stdin].

-o --output FILE Output sam file for all input alignments [stdout].

-d --discordantFile FILE Output discordant read pairs to this file. [no discordant file output]

-s --splitterFile FILE Output split reads to this file abiding by paramaters below. [no splitter file output]

-u --unmappedFile FILE Output unmapped/clipped reads as FASTQ to this file abiding by parameters below. [no unmapped file output].

Requires soft clipping in input file. Will output FASTQ if QUAL information available, otherwise FASTA.

Other Options:

-a --acceptDupMarks Accept duplicate marks already in input file instead of looking for duplicates in the input.

-e --excludeDups Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.

-r --removeDups Remove duplicates reads from all output files. (Implies --excludeDups).

--addMateTags Add MC and MQ tags to all output paired-end SAM lines.

--ignoreUnmated Suppress abort on unmated alignments. Use only when sure input is read-id grouped,

and either paired-end alignments have been filtered or the input file contains singleton reads.

-M Run in compatibility mode; both 0x100 and 0x800 are considered chimeric. Similar to BWA MEM -M option.

--maxReadLength INT Maximum allowed length of the SEQ/QUAL string in the input file. [500]

Primarily useful for marking duplicates in files containing singleton long reads.

--maxSplitCount INT Maximum number of split alignments for a read to be included in splitter file. [2]

--maxUnmappedBases INT Maximum number of un-aligned bases between two alignments to be included in splitter file. [50]

--minIndelSize INT Minimum structural variant feature size for split alignments to be included in splitter file. [50]

--minNonOverlap INT Minimum non-overlaping base pairs between two alignments for a read to be included in splitter file. [20]

--minClipSize INT Minumum number of bases a mapped read must be clipped to be included in unmapped file. [20]

-q --quiet Output fewer statistics.

duplication、discordant-read、split-readの判定基準（下の方にあります）。

GitHub - GregoryFaust/samblaster: samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.

実行方法

bwa memのアライメントの過程でduplicationにタグをつける。

bwa index -a is input.fa
bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster |samtools sort -@ 12 -O BAM - > samp.sorted.bam

上のコマンドはfastqからbwa mem => samblaster => samtools view => samtools sortの流れでbamを作っている。samblasterはbwa memからsamファイルを受け取り、 duplication readsにタグをつけてsamtools viewに渡していることになる。

今回は以下のようなメッセージがプリントされた。

samblaster: Marked 1874 of 339039 (0.55%) read ids as duplicates using 13344768k memory in 1.056S CPU seconds and 45S wall time.

1874リードがduplicationと判定されている。

------------------------------------------------------------------------------------------------

以降はsamファイル出力として記載。

discordant-readとsplit-readは別ファイルに出力する。

bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster -e -d samp.disc.sam -s samp.split.sam > output.sam

-e　Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.
-d　FILE Output discordant read pairs to this file. [no discordant file output]
-s　FILE Output split reads to this file abiding by paramaters below. [no splitter file output]

duplication-readは全出力から除く。

bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster -r -e -d samp.disc.sam -s samp.split.sam > output.sam

-r　Remove duplicates reads from all output files.

注意；bwa memで-M（mark shorter split hits as secondary）をつけている時は、samblasterにも-Mをつけてランを行う。

-M 　Compatibility mode (details below); both FLAG 0x100 and 0x800 denote supplemental (chimeric). Similar to bwa mem -M option.

引用

SAMBLASTER: fast duplicate marking and structural variant read extraction

Gregory G. Faust, Ira M. Hall

Bioinformatics. 2014 Sep 1; 30(17): 2503–2505.