macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

duplicationリードにタグをつける samblaster

2019 1/14 コマンド修正

2020 4/17 help追加

 

samblasterは、samファイルのduplicationのリードにタグをつけたり、構造変化の指標となるsplit-alingment readやdiscordant read pairを別ファイルに出力できるツール。samの時点でデータをより分けることで、discordant read pairやsplit-alingment readを使ったlarge indel検出などを劇的に軽量化することが可能になる。本ツールはヒトゲノムの高速なバリアント解析ツールspeedseqにも使われている紹介。2014年に論文発表された。

 

 インストール

Github

git clone git://github.com/GregoryFaust/samblaster.git 
cd samblaster
make
cp samblaster /usr/local/bin/.

#Anaconda環境なら (linuxのみ)
conda install -c bioconda -y samblaster

samblaster -h

$ samblaster -h

samblaster: Version 0.1.25

Author: Greg Faust (gf4ea@virginia.edu)

Tool to mark duplicates and optionally output split reads and/or discordant pairs.

Input sam file must contain sequence header and be grouped by read ids (QNAME).

Input typicallly contains paired-end data, although singleton data is allowed with --ignoreUnmated option.

Output will be all alignments in the same order as input, with duplicates marked with FLAG 0x400.

 

Usage:

For use as a post process on an aligner (eg. bwa mem):

     bwa mem <idxbase> samp.r1.fq samp.r2.fq | samblaster [-e] [-d samp.disc.sam] [-s samp.split.sam] | samtools view -Sb - > samp.out.bam

     bwa mem -M <idxbase> samp.r1.fq samp.r2.fq | samblaster -M [-e] [-d samp.disc.sam] [-s samp.split.sam] | samtools view -Sb - > samp.out.bam

For use with a pre-existing bam file to pull split, discordant and/or unmapped reads without marking duplicates:

     samtools view -h samp.bam | samblaster -a [-e] [-d samp.disc.sam] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null

For use with a bam file of singleton long reads to pull split and/or unmapped reads with/without marking duplicates:

     samtools view -h samp.bam | samblaster --ignoreUnmated [-e] --maxReadLength 100000 [-s samp.split.sam] [-u samp.umc.fasta] | samtools view -Sb - > samp.out.bam

     samtools view -h samp.bam | samblaster --ignoreUnmated -a [-e] [-s samp.split.sam] [-u samp.umc.fasta] -o /dev/null

Input/Output Options:

-i --input           FILE Input sam file [stdin].

-o --output          FILE Output sam file for all input alignments [stdout].

-d --discordantFile  FILE Output discordant read pairs to this file. [no discordant file output]

-s --splitterFile    FILE Output split reads to this file abiding by paramaters below. [no splitter file output]

-u --unmappedFile    FILE Output unmapped/clipped reads as FASTQ to this file abiding by parameters below. [no unmapped file output].

                          Requires soft clipping in input file.  Will output FASTQ if QUAL information available, otherwise FASTA.

 

Other Options:

-a --acceptDupMarks       Accept duplicate marks already in input file instead of looking for duplicates in the input.

-e --excludeDups          Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.

-r --removeDups           Remove duplicates reads from all output files. (Implies --excludeDups).

   --addMateTags          Add MC and MQ tags to all output paired-end SAM lines.

   --ignoreUnmated        Suppress abort on unmated alignments. Use only when sure input is read-id grouped,

                          and either paired-end alignments have been filtered or the input file contains singleton reads.

-M                        Run in compatibility mode; both 0x100 and 0x800 are considered chimeric. Similar to BWA MEM -M option.

   --maxReadLength    INT Maximum allowed length of the SEQ/QUAL string in the input file. [500]

                          Primarily useful for marking duplicates in files containing singleton long reads.

   --maxSplitCount    INT Maximum number of split alignments for a read to be included in splitter file. [2]

   --maxUnmappedBases INT Maximum number of un-aligned bases between two alignments to be included in splitter file. [50]

   --minIndelSize     INT Minimum structural variant feature size for split alignments to be included in splitter file. [50]

   --minNonOverlap    INT Minimum non-overlaping base pairs between two alignments for a read to be included in splitter file. [20]

   --minClipSize      INT Minumum number of bases a mapped read must be clipped to be included in unmapped file. [20]

-q --quiet                Output fewer statistics.

 

 duplication、discordant-read、split-readの判定基準(下の方にあります)。

GitHub - GregoryFaust/samblaster: samblaster: a tool to mark duplicates and extract discordant and split reads from sam files.

 

実行方法

 bwa memのアライメントの過程でduplicationにタグをつける。

bwa index -a is input.fa
bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster |samtools sort -@ 12 -O BAM - > samp.sorted.bam

 上のコマンドはfastqからbwa mem => samblaster => samtools view => samtools sortの流れでbamを作っている。samblasterはbwa memからsamファイルを受け取り、 duplication readsにタグをつけてsamtools viewに渡していることになる。

 今回は以下のようなメッセージがプリントされた。

samblaster: Marked 1874 of 339039 (0.55%) read ids as duplicates using 13344768k memory in 1.056S CPU seconds and 45S wall time.

1874リードがduplicationと判定されている。

 

------------------------------------------------------------------------------------------------

以降はsamファイル出力として記載。

discordant-readとsplit-readは別ファイルに出力する。

bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster -e -d samp.disc.sam -s samp.split.sam > output.sam
  • -e Exclude reads marked as duplicates from discordant, splitter, and/or unmapped file.
  • -d FILE Output discordant read pairs to this file. [no discordant file output]
  • -s FILE Output split reads to this file abiding by paramaters below. [no splitter file output]

  duplication-readは全出力から除く。

bwa mem -t 12 -R "@RG\tID:X\tLB:Y\tSM:Z\tPL:ILLUMINA" input.fa *.fastq | samblaster -r -e -d samp.disc.sam -s samp.split.sam > output.sam
  •  -r Remove duplicates reads from all output files.  

  

 

注意;bwa memで-M(mark shorter split hits as secondary)をつけている時は、samblasterにも-Mをつけてランを行う

  • -M  Compatibility mode (details below); both FLAG 0x100 and 0x800 denote supplemental (chimeric). Similar to bwa mem -M option. 

引用

SAMBLASTER: fast duplicate marking and structural variant read extraction

Gregory G. Faust, Ira M. Hall

Bioinformatics. 2014 Sep 1; 30(17): 2503–2505.