macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

多機能なNGS分析ツール BBtools 其の2

20196/13 タイトル修正

2020 7/24  bbdukコマンド追記

 

の続き。BBtoolsの残りのコマンドを紹介する。紹介するのは以下のコマンド。

  • Reformat フォーマット変換やクオリティトリミング。
  • Repair - ペアリードの順番が壊れたファイルを修復する
  • Stats アセンブリの基本情報をレポートする
  • BBDuk - クオリティやアダプターでトリミングしたり、フィルタリングする。
  • BBMask - 単純リピートや保存性が高い領域をマスクする 
  • CalcUniqueness - ユニークなリードの数を調べる。

     

 

 

Reformatリンク)-フォーマット変換やクオリティトリミング。

似たコマンドはいくつかあるが、Reformatはメモリー使用量が少ないのが特徴。

 下では全てインターレースのfastqを想定しているが、ペアリードなら

reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> と書く。

 

・fastq => fastaへの変換。

reformat.sh in=reads1.fq out=reads1.fasta

 

・phread quality scoreを一般的なASCII+33からASCII +64に変換する。

reformat.sh in=reads.fq out=reads.fq qin=33 qout=64

 

・fastq => fasta+qualへの変換。

reformat.sh in=reads.fq out=reads.fa qfout=reads.qual

 

インターレースのペアリードデータを2つのファイルに分ける。

reformat.sh in=interlace.fq out1=read1.fq out2=read2.fq

 

・ ペアリードデータをインターレースファイルにする。

reformat.sh in1=read1.fq in2=read2.fq out=reads.fq

 

・ 70文字ごとに改行する。(foldコマンドと同じ)

reformat.sh in=reads.fa out=wrapped.fa fastawrap=70

 

 

・ペアリードファイルがずれていないか検証。

reformat.sh in=reads.fq vint
  • verifyinterleaved=f (vint) sets 'vpair' to true and 'interleaved' to true.

 

・1や2がついていないペアリードファイルにペアを識別する1 or 2を入れる。

reformat.sh i1n=pair1.fq in2=pair2.fq out1=renamed1.fq out1=renamed2.fq addslash=t int=f
  • addslash=f Append ' /1' and ' /2' to read names, if not already present. Please include the flag 'int=t' if the reads are interleaved. 
  • int=f (interleaved) Determines whether INPUT file is considered interleaved.

 

・reverse complement

reformat.sh in=reads.fq out=out.fq rcomp
  • rcomp=f (rc) Reverse-compliment reads. 

 

・小文字の塩基を大文字にする。

reformat.sh in=reads.fq out=out.fq tuc
  • touppercase=t (tuc) Convert input bases to upper-case; otherwise, lower-case will not match.

 

・名前が全てユニークか確認する。

reformat.sh in=reads.fq out=out.fq uniquenames
  • uniquenames=f Make duplicate names unique by appending _<number>. 

duplicationは"_number"がついてユニークな名前になる。このコマンドだけ大量のメモリを使う。

 

・クオリテイスコアを特定の範囲に限定する。

reformat.sh in=reads.fq out=out.fq mincalledquality=2 maxcalledquality=41
  • maxcalledquality=41 Quality scores capped at this upper bound.
  • mincalledquality=2 Quality scores of ACGT bases will be capped at lower bound.

特定のツールの処理でクオリティがおかしくなってしまった時などに使う。

 

 

 

Repairリンク)-ペアリードの順番が壊れたファイルを修復する

ペアリードの順番がforwardとreverseでずれているfastqを修復する。例えばfastx toolkitでペアリードを個別にトリミングすると、forwardとreverseでリード数が揃わないfastqができてしまう。この場合、理想はraw readsからペアリードを扱えるツールでクオリティトリミングをやり直すことであるが、場合によってはオリジナルのfastqがないこともある。そのようなfastqの修復にこのコマンドを使う。

 

入力fasta、fastq、sam。

・ペアリードファイルの修復。

repair.sh -Xmx20g in1=broken1.fq in2=broken2 out1=fixed1.fq out2=fixed2.fq outs=singletons.fq repair
  • in=<file> The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in.
  • in2=<file> Use this if 2nd read of pairs are in a different file.
  • out=<file> The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out.
  • out2=<file> Use this to write 2nd read of pairs to a different file.
  • outs=<file> (outsingle) Write singleton reads here.
  • -Xmx20g This will be passed to Java to set memory usage, overriding the program's automatic memory detection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

 

 

 

Statsリンク)-アセンブリの基本情報のレポート

・contig.fasstaを分析する。

stats.sh in=contigs.fa gc=gc.txt gcformat=4
  • in=file Specify the input fasta file, or stdin.
  • gc=file Writes ACGTN content per scaffold to a file.
  • gcformat=<0-4>  Select GC output format; default 1.

  -  gcformat=0: (no base content info printed)

  -  gcformat=1: name length A C G T N GC

  -  gcformat=2: name GC

  -  gcformat=4: name length GC

  Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.

N50などの基礎的な情報が出力される。

user$ stats.sh in=contigs.fa gc=gc.txt gcformat=4

A C G T N IUPAC Other GC GC_stdev

0.2841 0.2276 0.2531 0.2352 0.0000 0.0000 0.0000 0.4807 0.0583

 

Main genome scaffold total:         42

Main genome contig total:           42

Main genome scaffold sequence total: 0.004 MB

Main genome contig sequence total:  0.004 MB  0.000% gap

Main genome scaffold N/L50:         9/131

Main genome contig N/L50:           9/131

Main genome scaffold N/L90:         42/41

Main genome contig N/L90:           42/41

Max scaffold length:                352

Max contig length:                  352

Number of scaffolds > 50 KB:        0

% main genome in scaffolds > 50 KB: 0.00%

 

 

Minimum Number        Number        Total         Total         Scaffold

Scaffold of            of            Scaffold      Contig        Contig  

Length  Scaffolds     Contigs       Length        Length        Coverage

-------- -------------- -------------- -------------- -------------- --------

    All             42             42         3,520         3,520 100.00%

     50             19             19         2,547         2,547 100.00%

    100             11             11         2,009         2,009 100.00%

    250              1             1           352           352 100.00%

gc含量は別ファイルに出力される。

user$ head -5 gc.txt 

#Name                Length GC

NODE_1_length_332_cov_90.186745 352 0.5114

NODE_4_length_46_cov_61.913044   66 0.4242

NODE_6_length_127_cov_8.811024  147 0.3469

NODE_7_length_127_cov_14.677165 147 0.2993

 

 

 

 

 

BBDukリンク)-クオリティトリミング、アダプタートリミング、フィルタリング、マスキングなどを行う。

BBDukはクオリティトリミングを行ったり、k-merベースでコンタミリードと推測されるリードのフィルタリングを行うことができるツール。GC、サイズ、エントロピーなど多様な方法でフィルタリングすることもできる。

ペアリードの場合、

bbduk.sh in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq

というような書き方をする。BBDukは常にペアリードの順番が壊れないように動作する。正常動作のためにペアリードは必ず同時に処理する。

 

・3'側のアダプタートリミング。

bbduk.sh in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo
  • in=<file> Main input. in=stdin.fq will pipe from stdin.
  • out=<file> (outnonmatch) Write reads here that do not contain kmers matching the database. 'out=stdout.fq' will pipe to standard out.
  • ref=<file,file> Comma-delimited list of reference files.
  • tpe=f (trimpairsevenly) When kmer right-trimming, trim both reads to the minimum length of either.
  • hammingdistance=0 (hdist) Maximum Hamming distance for ref kmers (subs only). Memory use is proportional to (3*K)^hdist.
  • tbo=f (trimbyoverlap) Trim adapters based on where paired reads overlap.
  • mink=0 Look for shorter kmers at read tips down to this length, when k-trimming or masking. 0 means disabled. Enabling this will disable maskmiddle.
  • ktrim=f Trim reads to remove bases matching reference kmers. Values:

   f (don't trim), f (don't trim),

   r (trim to the right),

   l (trim to the left)

ktrim=rでright trimming(3'側アダプター)、ktrim=lでleft trimming(5'側アダプター)を行う。kでアダプターのを探すサイズ(アダプターサイズにする)、minkで末端でのサイズを規定する。hdistで

 

 

・3'側のクオリティが10以下の領域をトリミング

bbduk.sh in=reads.fq out=clean.fq qtrim=r trimq=10
  • trimq=6 Regions with average quality BELOW this will be trimmed.
  • qtrim=f Trim read ends to remove bases with quality below trimq. Performed AFTER looking for kmers. Values:

    rl (trim both ends), 

    f (neither end), 

    r (right end only), 

    l (left end only),

    w (sliding window).

qtrim=rで3'側のクオリティトリミングが実行される。qtrim=rlで両側トリミング。qtrim=f以外のflagが指定されるとトリミングが実行されるが、トリミングはk-merベースの作業が全て完了してから実行される。 

 

・5'側10塩基と3'側139塩基以降を強制トリミング。

bbduk.sh in=reads.fq out=clean.fq ftl=10 ftr=139
  •  forcetrimleft=0 (ftl) If positive, trim bases to the left of this position (exclusive, 0-based).
  • forcetrimright=0 (ftr) If positive, trim bases to the right of this position (exclusive, 0-based).

 

・3'末端を強制トリミング。

bbduk.sh in=reads.fq out=clean.fq ftm=5
  •  forcetrimmod=0 (ftm) If positive, right-trim length to be equal to zero, modulo this number.

 5で剰余演算(モジュロ)を行い、余りの1がトリムされる。イルミナのシーエンスは基本5単位で増減し、50、75、150、300などがあるが、全て1-bp余分に読まれる。このラストの1-bpのクオリティが非常に低い傾向にあり、そのため5で割って余りの値1でトリムしている。

 

・平均クオリティが10以下のリードをフィルタリング(除去)。

bbduk.sh in=reads.fq out=clean.fq maq=10
  • minavgquality=0 (maq) Reads with average quality (after trimming) below this will be discarded.

 

・平均クオリティ10以下の領域を トリムし、100bp以下になったリードは除去する。

bbduk.sh in=reads.fq out=clean.fq qtrim=r trimq=10 ml=100
  • trimq=6 Regions with average quality BELOW this will be trimmed. 
  • minlength=10 (ml) Reads shorter than this after trimming will be discarded. Pairs will be discarded if both are shorter.

 

・k-merフィルタリング。

bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt

#paied-end ( マップもアンマップもペアは同期される)
bbduk.sh in1=pair_1.fq in2=pair_2.fq \
out1=unmatched_1.fq out2=unmatched_2.fq \
outm1=matched_1.fq outm2=matched_2.fq \
ref=phix.fa k=31 hdist=1 stats=stats.txt
  • outm=<file> (outmatch) Write reads here that contain kmers matching the database.
  • ref=<file,file> Comma-delimited list of reference files.
  • hammingdistance=0 (hdist) Maximum Hamming distance for ref kmers (subs only). Memory use is proportional to (3*K)^hdist.

指定ファイルに対して31-merで探索を行い、1塩基のミスマッチを許容して(hdist=1)マッチするリードがあれば、除去される。上ではイルミナのスパイク(バクテリオファージ、5.5-kb)に対して検索をかけている。ペアエンドはマッチしたものもマッチしなかったものも同期される。

動植物ゲノムアセンブリ前に微生物由来リードを除く。


 

・k-merマスキング1。

bbduk.sh in=ecoli.fa out=ecoli_masked.fa ref=salmonella.fa k=25 ktrim=N

25-merでサルモネラゲノムとマッチするリードを探し、マッチする部位がNNNでマスクされる。Xでマスクするならktirm=X。

 

・k-merマスキング2。エントロピーを指定。

bbduk.sh in=reads.fq out=filtered.fq entropy=0.5 entropywindow=50 entropyk=5
  • entropy=-1 Set between 0 and 1 to filter reads with entropy below that value. Higher is more stringent.
  • entropywindow=50 Calculate entropy using a sliding window of this length.
  • entropyk=5 Calculate entropy using kmers of this length.

平均エントロピーが0.5以下のリードを除去する。エントロピーは塩基の複雑さから判定している。AAAAAAAAAAAAAAのような配列のエントロピーは0で、ランダムな配列ほど1に近くなる。よって単純リピートなども排除されると考えられる。

 

・Quality recalibration

bbduk.sh in=reads.fq out=recalibrated.fq recalibrate sam=mapped.sam
  • recalibrate=f (recal) Recalibrate quality scores. Requires calibration matrices generated by CalcTrueQuality.
  • sam=<file,file> If recalibration is desired, and matrices have not already been generated, BBDuk will create them from the sam file.

samファイルを元にベースクオリティスコアを再校正したfastqを出力する。samのcigar stringは=(マッチ)とX(ミスマッチ)でないといけない。Mとかだと機能しないので、BBMapなどでアライメントをやり直す。リファレンスがない場合、アセンブルしたcontigにアライメントする方法が推奨されている。

 

・Histogram generation

bbduk.sh in=reads.fq bhist=bhist.txt qhist=qhist.txt gchist=gchist.txt aqhist=aqhist.txt lhist=lhist.txt gcbins=auto 
  • gcbins=100 Number gchist bins. Set to 'auto' to use read length. 

base frequency、クオリティスコア、GC、 平均クオリティスコア、長さ、などの統計情報のヒストグラムデータを出力する。BBMapにも同様のコマンドが存在する。

 

・Barcode and chastity filtering

bbduk.sh in=reads.fq out=filtered.fq barcodes=ACGTT,GGCCG barcodefilter chastityfilter 
  • barcodes= Comma-delimited list of barcodes or files of barcodes.

Header-parsing parameters - these require Illumina headers:

  • chastityfilter=f (cf) Discard reads with id containing ' 1:Y:' or ' 2:Y:'.
  • barcodefilter=f Remove reads with unexpected barcodes if barcodes is set, or barcodes containing 'N' otherwise. A barcode must be the last part of the read header. 

 

・Cardinality estimation

bbduk.sh in=reads.fq loglog loglogk=31 
  • cardinality=f (loglog) Count unique kmers using the LogLog algorithm.
  • loglogk=31 Use this kmer length for counting.

ユニークな31-merの配列の数をカウント。 

 

・Matching degenerate sequences such as primers

bbduk.sh in=reads.fq out=matching.fq literal=ACGTTNNNNNGTC copyundefined k=13 mm=f 
  • literal=<seq,seq> Comma-delimited list of literal reference sequences.
  • copyundefined=f (cu) Process non-AGCT IUPAC reference bases by making all possible unambiguous copies. Intended for short motifs or adapter barcodes, as time/memory use is exponential.

degenerateな配列を探索する。例えばメタ16Sの可変性領域を網羅的に探すことなどに使うことができる。

 

 

 

BBMaskリンク)-単純リピートや保存性が高い領域をマスクする

false positiveのマッピングを防止するため、単純リピートなどをあらかじめマスクしておくなどの目的に使われる。リードが単純な配列なのか判別するためにシャノン情報量の概念が利用されているらしい。マスクでなくフィルタリングで除くにはBBDukを使用する。

エントロピー(情報量)が0.7以下の部位をマスクする。

bbmask.sh in=ref.fa out=masked.fa entropy=0.7
  • entropy=0.70 (e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything. 

閾値をどのような値にするか正確に求めることは難しいが、entropy=0.7ではE.coliゲノムではわずか70bpしかマスクされない。またコピーや繰り返しが多いヒューマンでも0.7%しかマスクされないとされる。 entropy=0.9まであげるとE.coliで8-kb、ヒューマンで7%の領域がマスクされる。

 

 

 

CalcUniquenessリンク)-ユニークなリードの数を調べる。

リードの先頭からk-merの配列を取り出しユニークなのかどうか判別する。k-merの配列が合致するとユニークではないと判断される。データ中のユニークなリード数の割合を求めることで、シーケンスのカバレッジが十分かどうか推定することができる。

 

 

・ユニークなリード数を示したヒストグラムをかく。

bbcountunique.sh in=reads out=histogram.txt

 25000ごとに区切ってユニークなリードの割合が出力される。

k-merの配列は完全に一致していないと重複とみなされない。そのため低クオリティのデータに使うと、見かけ上ユニークなリードの割合が増える点に注意が必要。

 

 他にもいくつかコマンドがあります。公式サイトで確認してみてください。

BBTools User Guide - DOE Joint Genome Institute

 

 

 

 引用

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

BBMap (aligner for DNA/RNAseq) is now open-source and available for download. - SEQanswers

 

多機能なNGS分析ツール BBtools 其の1

2018 9/5  bbmerge-auto.sh修正

2019 5/14 BBnormコメント修正、パラメータ修正、ヘルプ追加

bbmap.sh2019 6/13 タイトル修正、6/19 其の3追記

2020 1/29 condaインストール追記、メモリ使用量指定、11/6 誤字修正

2023/02/02 追記

 

 

BBtoolsはアメリカのJGIが提供している多機能なNGS向けの解析ツール。2014年にオープンソース化されたらしい。論文は現在準備中とある。アライメントのBBmapや、オーバーラップがないペアリードをマージするBBMerge、エラーコレクションしたfastqを出力するBBnormなど特徴的なコマンドを備える。ここでは以下のコマンドを紹介する。

  • BBMap - リファンレンスへのアライメント。
  • BBMerge ペアリードのマージ。k-mer配列を付け足し延長することも可能。
  • BBnorm カバレッジノーマライズ
  • Dedupe duplicateしたcontigの除去。

其の2。 

其の3。 

 

公式ページ

http://jgi.doe.gov/data-and-tools/bbtools/

ユーザーガイド

http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/

ダウンロード

https://sourceforge.net/projects/bbmap/

#homebrew
brew
install BBtools
#bioconda (link)
conda install -c bioconda -y bbmap

 BBtoolsのコマンドは指定がない限りコア数をあるだけ使う。

> bbmap.sh

$ bbmap.sh

 

BBMap

Written by Brian Bushnell, from Dec. 2010 - present

Last modified December 20, 2017

 

Description:  Fast and accurate splice-aware read aligner.

Please read bbmap/docs/guides/BBMapGuide.txt for more information.

 

To index:     bbmap.sh ref=<reference fasta>

To map:       bbmap.sh in=<reads> out=<output sam>

To map without writing an index:

    bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk

 

in=stdin will accept reads from standard in, and out=stdout will write to 

standard out, but file extensions are still needed to specify the format of the 

input and output files e.g. in=stdin.fa.gz will read gzipped fasta from 

standard in; out=stdout.sam.gz will write gzipped sam.

 

Indexing Parameters (required when building the index):

nodisk=f                Set to true to build index in memory and write nothing 

                        to disk except output.

ref=<file>              Specify the reference sequence.  Only do this ONCE, 

                        when building the index (unless using 'nodisk').

build=1                 If multiple references are indexed in the same directory,

                        each needs a unique numeric ID (unless using 'nodisk').

k=13                    Kmer length, range 8-15.  Longer is faster but uses 

                        more memory.  Shorter is more sensitive.

                        If indexing and mapping are done in two steps, K should

                        be specified each time.

path=<.>                Specify the location to write the index, if you don't 

                        want it in the current working directory.

usemodulo=f             Throw away ~80% of kmers based on remainder modulo a 

                        number (reduces RAM by 50% and sensitivity slightly).

                        Should be enabled both when building the index AND 

                        when mapping.

rebuild=f               Force a rebuild of the index (ref= should be set).

 

Input Parameters:

build=1                 Designate index to use.  Corresponds to the number 

                        specified when building the index.

in=<file>               Primary reads input; required parameter.

in2=<file>              For paired reads in two files.

interleaved=auto        True forces paired/interleaved input; false forces 

                        single-ended mapping. If not specified, interleaved 

                        status will be autodetected from read names.

fastareadlen=500        Break up FASTA reads longer than this.  Max is 500 for

                        BBMap and 6000 for BBMapPacBio.  Only works for FASTA

                        input (use 'maxlen' for FASTQ input).  The default for

                        bbmap.sh is 500, and for mapPacBio.sh is 6000.

unpigz=f                Spawn a pigz (parallel gzip) process for faster 

                        decompression than using Java.  

                        Requires pigz to be installed.

touppercase=t           (tuc) Convert lowercase letters in reads to upper case 

                        (otherwise they will not match the reference).

 

Sampling Parameters:

 

reads=-1                Set to a positive number N to only process the first N

                        reads (or pairs), then quit.  -1 means use all reads.

samplerate=1            Set to a number from 0 to 1 to randomly select that

                        fraction of reads for mapping. 1 uses all reads.

skipreads=0             Set to a number N to skip the first N reads (or pairs), 

                        then map the rest.

 

Mapping Parameters:

fast=f                  This flag is a macro which sets other paramters to run 

                        faster, at reduced sensitivity.  Bad for RNA-seq.

slow=f                  This flag is a macro which sets other paramters to run 

                        slower, at greater sensitivity.  'vslow' is even slower.

maxindel=16000          Don't look for indels longer than this. Lower is faster.

                        Set to >=100k for RNAseq with long introns like mammals.

strictmaxindel=f        When enabled, do not allow indels longer than 'maxindel'.

                        By default these are not sought, but may be found anyway.

tipsearch=100           Look this far for read-end deletions with anchors

                        shorter than K, using brute force.

minid=0.76              Approximate minimum alignment identity to look for. 

                        Higher is faster and less sensitive.

minhits=1               Minimum number of seed hits required for candidate sites.

                        Higher is faster.

local=f                 Set to true to use local, rather than global, alignments.

                        This will soft-clip ugly ends of poor alignments.

perfectmode=f           Allow only perfect mappings when set to true (very fast).

semiperfectmode=f       Allow only perfect and semiperfect (perfect except for 

                        N's in the reference) mappings.

threads=auto            (t) Set to number of threads desired.  By default, uses 

                        all cores available.

ambiguous=best          (ambig) Set behavior on ambiguously-mapped reads (with 

                        multiple top-scoring mapping locations).

                            best    (use the first best site)

                            toss    (consider unmapped)

                            random  (select one top-scoring site randomly)

                            all     (retain all top-scoring sites)

samestrandpairs=f       (ssp) Specify whether paired reads should map to the

                        same strand or opposite strands.

requirecorrectstrand=t  (rcs) Forbid pairing of reads without correct strand 

                        orientation.  Set to false for long-mate-pair libraries.

killbadpairs=f          (kbp) If a read pair is mapped with an inappropriate

                        insert size or orientation, the read with the lower  

                        mapping quality is marked unmapped.

pairedonly=f            (po) Treat unpaired reads as unmapped.  Thus they will 

                        be sent to 'outu' but not 'outm'.

rcomp=f                 Reverse complement both reads prior to mapping (for LMP

                        outward-facing libraries).

rcompmate=f             Reverse complement read2 prior to mapping.

pairlen=32000           Set max allowed distance between paired reads.  

                        (insert size)=(pairlen)+(read1 length)+(read2 length)

rescuedist=1200         Don't try to rescue paired reads if avg. insert size

                        greater than this.  Lower is faster.

rescuemismatches=32     Maximum mismatches allowed in a rescued read.  Lower

                        is faster.

averagepairdist=100     (apd) Initial average distance between paired reads.

                        Varies dynamically; does not need to be specified.

deterministic=f         Run in deterministic mode.  In this case it is good

                        to set averagepairdist.  BBMap is deterministic

                        without this flag if using single-ended reads,

                        or run singlethreaded.

bandwidthratio=0        (bwr) If above zero, restrict alignment band to this 

                        fraction of read length.  Faster but less accurate.

bandwidth=0             (bw) Set the bandwidth directly.

                        fraction of read length.  Faster but less accurate.

usejni=f                (jni) Do alignments faster, in C code.  Requires 

                        compiling the C code; details are in /jni/README.txt.

maxsites2=800           Don't analyze (or print) more than this many alignments 

                        per read.

ignorefrequentkmers=t   (ifk) Discard low-information kmers that occur often.

excludefraction=0.03    (ef) Fraction of kmers to ignore.  For example, 0.03

                        will ignore the most common 3% of kmers.

greedy=t                Use a greedy algorithm to discard the least-useful

                        kmers on a per-read basis.

kfilter=0               If positive, potential mapping sites must have at

                        least this many consecutive exact matches.

 

 

Quality and Trimming Parameters:

qin=auto                Set to 33 or 64 to specify input quality value ASCII

                        offset. 33 is Sanger, 64 is old Solexa.

qout=auto               Set to 33 or 64 to specify output quality value ASCII 

                        offset (only if output format is fastq).

qtrim=f                 Quality-trim ends before mapping.  Options are: 

                        'f' (false), 'l' (left), 'r' (right), and 'lr' (both).

untrim=f                Undo trimming after mapping.  Untrimmed bases will be 

                        soft-clipped in cigar strings.

trimq=6                 Trim regions with average quality below this 

                        (phred algorithm).

mintrimlength=60        (mintl) Don't trim reads to be shorter than this.

fakefastaquality=-1     (ffq) Set to a positive number 1-50 to generate fake

                        quality strings for fasta input reads.

ignorebadquality=f      (ibq) Keep going, rather than crashing, if a read has 

                        out-of-range quality values.

usequality=t            Use quality scores when determining which read kmers

                        to use as seeds.

minaveragequality=0     (maq) Do not map reads with average quality below this.

maqb=0                  If positive, calculate maq from this many initial bases.

 

Output Parameters:

 

out=<file>              Write all reads to this file.

outu=<file>             Write only unmapped reads to this file.  Does not 

                        include unmapped paired reads with a mapped mate.

outm=<file>             Write only mapped reads to this file.  Includes 

                        unmapped paired reads with a mapped mate.

mappedonly=f            If true, treats 'out' like 'outm'.

bamscript=<file>        (bs) Write a shell script to <file> that will turn 

                        the sam output into a sorted, indexed bam file.

ordered=f               Set to true to output reads in same order as input.  

                        Slower and uses more memory.

overwrite=f             (ow) Allow process to overwrite existing files.

secondary=f             Print secondary alignments.

sssr=0.95               (secondarysitescoreratio) Print only secondary alignments

                        with score of at least this fraction of primary.

ssao=f                  (secondarysiteasambiguousonly) Only print secondary 

                        alignments for ambiguously-mapped reads.

maxsites=5              Maximum number of total alignments to print per read.

                        Only relevant when secondary=t.

quickmatch=f            Generate cigar strings more quickly.

trimreaddescriptions=f  (trd) Truncate read and ref names at the first whitespace,

                        assuming that the remainder is a comment or description.

ziplevel=2              (zl) Compression level for zip or gzip output.

pigz=f                  Spawn a pigz (parallel gzip) process for faster 

                        compression than Java.  Requires pigz to be installed.

machineout=f            Set to true to output statistics in machine-friendly 

                        'key=value' format.

printunmappedcount=f    Print the total number of unmapped reads and bases.

                        If input is paired, the number will be of pairs

                        for which both reads are unmapped.

showprogress=0          If positive, print a '.' every X reads.

showprogress2=0         If positive, print the number of seconds since the

                        last progress update (instead of a '.').

renamebyinsert=f        Renames reads based on their mapped insert size.

 

Post-Filtering Parameters:

idfilter=0              Independant of minid; sets exact minimum identity 

                        allowed for alignments to be printed.  Range 0 to 1.

subfilter=-1            Ban alignments with more than this many substitutions.

insfilter=-1            Ban alignments with more than this many insertions.

delfilter=-1            Ban alignments with more than this many deletions.

indelfilter=-1          Ban alignments with more than this many indels.

editfilter=-1           Ban alignments with more than this many edits.

inslenfilter=-1         Ban alignments with an insertion longer than this.

dellenfilter=-1         Ban alignments with a deletion longer than this.

nfilter=-1              Ban alignments with more than this many ns.  This 

                        includes nocall, noref, and off scaffold ends.

 

Sam flags and settings:

noheader=f              Disable generation of header lines.

sam=1.4                 Set to 1.4 to write Sam version 1.4 cigar strings, 

                        with = and X, or 1.3 to use M.

saa=t                   (secondaryalignmentasterisks) Use asterisks instead of

                        bases for sam secondary alignments.

cigar=t                 Set to 'f' to skip generation of cigar strings (faster).

keepnames=f             Keep original names of paired reads, rather than 

                        ensuring both reads have the same name.

intronlen=999999999     Set to a lower number like 10 to change 'D' to 'N' in 

                        cigar strings for deletions of at least that length.

rgid=                   Set readgroup ID.  All other readgroup fields 

                        can be set similarly, with the flag rgXX=

mdtag=f                 Write MD tags.

nhtag=f                 Write NH tags.

xmtag=f                 Write XM tags (may only work correctly with ambig=all).

amtag=f                 Write AM tags.

nmtag=f                 Write NM tags.

xstag=f                 Set to 'xs=fs', 'xs=ss', or 'xs=us' to write XS tags 

                        for RNAseq using firststrand, secondstrand, or 

                        unstranded libraries.  Needed by Cufflinks.  

                        JGI mainly uses 'firststrand'.

stoptag=f               Write a tag indicating read stop location, prefixed by YS:i:

lengthtag=f             Write a tag indicating (query,ref) alignment lengths, 

                        prefixed by YL:Z:

idtag=f                 Write a tag indicating percent identity, prefixed by YI:f:

inserttag=f             Write a tag indicating insert size, prefixed by X8:Z:

scoretag=f              Write a tag indicating BBMap's raw score, prefixed by YR:i:

timetag=f               Write a tag indicating this read's mapping time, prefixed by X0:i:

boundstag=f             Write a tag indicating whether either read in the pair

                        goes off the end of the reference, prefixed by XB:Z:

notags=f                Turn off all optional tags.

 

Histogram and statistics output parameters:

scafstats=<file>        Statistics on how many reads mapped to which scaffold.

refstats=<file>         Statistics on how many reads mapped to which reference

                        file; only for BBSplit.

sortscafs=t             Sort scaffolds or references by read count.

bhist=<file>            Base composition histogram by position.

qhist=<file>            Quality histogram by position.

aqhist=<file>           Histogram of average read quality.

bqhist=<file>           Quality histogram designed for box plots.

lhist=<file>            Read length histogram.

ihist=<file>            Write histogram of insert sizes (for paired reads).

ehist=<file>            Errors-per-read histogram.

qahist=<file>           Quality accuracy histogram of error rates versus 

                        quality score.

indelhist=<file>        Indel length histogram.

mhist=<file>            Histogram of match, sub, del, and ins rates by 

                        read location.

gchist=<file>           Read GC content histogram.

gcbins=100              Number gchist bins.  Set to 'auto' to use read length.

gcpairs=t               Use average GC of paired reads.

idhist=<file>           Histogram of read count versus percent identity.

idbins=100              Number idhist bins.  Set to 'auto' to use read length.

statsfile=stderr        Mapping statistics are printed here.

 

Coverage output parameters (these may reduce speed and use more RAM):

covstats=<file>         Per-scaffold coverage info.

rpkm=<file>             Per-scaffold RPKM/FPKM counts.

covhist=<file>          Histogram of # occurrences of each depth level.

basecov=<file>          Coverage per base location.

bincov=<file>           Print binned coverage per location (one line per X bases).

covbinsize=1000         Set the binsize for binned coverage output.

nzo=t                   Only print scaffolds with nonzero coverage.

twocolumn=f             Change to true to print only ID and Avg_fold instead of 

                        all 6 columns to the 'out=' file.

32bit=f                 Set to true if you need per-base coverage over 64k.

strandedcov=f           Track coverage for plus and minus strand independently.

startcov=f              Only track start positions of reads.

secondarycov=t          Include coverage of secondary alignments.

physcov=f               Calculate physical coverage for paired reads.

                        This includes the unsequenced bases.

delcoverage=t           (delcov) Count bases covered by deletions as covered.

                        True is faster than false.

covk=0                  If positive, calculate kmer coverage statistics.

 

Java Parameters:

-Xmx                    This will be passed to Java to set memory usage, 

                        overriding the program's automatic memory detection.

                        -Xmx20g will specify 20 gigs of RAM, and -Xmx800m 

                        will specify 800 megs.  The max is typically 85% of 

                        physical memory.  The human genome requires around 24g,

                        or 12g with the 'usemodulo' flag.  The index uses 

                        roughly 6 bytes per reference base.

-eoom                   This flag will cause the process to exit if an 

                        out-of-memory exception occurs.  Requires Java 8u92+.

-da                     Disable assertions.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter 

any problems, or post at: http://seqanswers.com/forums/showthread.php?t=41057

 

> bbnorm.sh 

$ bbnorm.sh 

 

Written by Brian Bushnell

Last modified October 19, 2017

 

Description:  Normalizes read depth based on kmer counts.

Can also error-correct, bin reads by kmer depth, and generate a kmer depth histogram.

However, Tadpole has superior error-correction to BBNorm.

Please read bbmap/docs/guides/BBNormGuide.txt for more information.

 

Usage:     bbnorm.sh in=<input> out=<reads to keep> outt=<reads to toss> hist=<histogram output>

 

Input parameters:

in=null             Primary input.  Use in2 for paired reads in a second file

in2=null            Second input file for paired reads in two files

extra=null          Additional files to use for input (generating hash table) but not for output

fastareadlen=2^31   Break up FASTA reads longer than this.  Can be useful when processing scaffolded genomes

tablereads=-1       Use at most this many reads when building the hashtable (-1 means all)

kmersample=1        Process every nth kmer, and skip the rest

readsample=1        Process every nth read, and skip the rest

interleaved=auto    May be set to true or false to force the input read file to ovverride autodetection of the input file as paired interleaved.

qin=auto            ASCII offset for input quality.  May be 33 (Sanger), 64 (Illumina), or auto.

 

Output parameters:

out=<file>          File for normalized or corrected reads.  Use out2 for paired reads in a second file

outt=<file>         (outtoss) File for reads that were excluded from primary output

reads=-1            Only process this number of reads, then quit (-1 means all)

sampleoutput=t      Use sampling on output as well as input (not used if sample rates are 1)

keepall=f           Set to true to keep all reads (e.g. if you just want error correction).

zerobin=f           Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms.

                    Default is false, to prevent confusion about how there can be 0-count kmers.

                    The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter.

tmpdir=/var/folders/f2/r9ydwd5d5vxgfzz5h48grzgc0000gn/T/      This will specify a directory for temp files (only needed for multipass runs).  If null, they will be written to the output directory.

usetempdir=t        Allows enabling/disabling of temporary directory; if disabled, temp files will be written to the output directory.

qout=auto           ASCII offset for output quality.  May be 33 (Sanger), 64 (Illumina), or auto (same as input).

rename=f            Rename reads based on their kmer depth.

 

Hashing parameters:

k=31                Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)

bits=32             Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32.  Maximum kmer depth recorded is 2^cbits.  Automatically reduced to 16 in 2-pass.

                    Large values decrease accuracy for a fixed amount of memory, so use the lowest number you can that will still capture highest-depth kmers.

hashes=3            Number of times each kmer is hashed and stored.  Higher is slower.

                    Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory.

prefilter=f         True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable.  The prefilter is more memory-efficient because it uses 2-bit cells.

prehashes=2         Number of hashes for prefilter.

prefilterbits=2     (pbits) Bits per cell in prefilter.

prefiltersize=0.35  Fraction of memory to allocate to prefilter.

buildpasses=1       More passes can sometimes increase accuracy by iteratively removing low-depth kmers

minq=6              Ignore kmers containing bases with quality below this

minprob=0.5         Ignore kmers with overall probability of correctness below this

threads=auto        (t) Spawn exactly X hashing threads (default is number of logical processors).  Total active threads may exceed X due to I/O threads.

rdk=t               (removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair, even if that kmer occurs more than once.

 

Normalization parameters:

fixspikes=f         (fs) Do a slower, high-precision bloom filter lookup of kmers that appear to have an abnormally high depth due to collisions.

target=100          (tgt) Target normalization depth.  NOTE:  All depth parameters control kmer depth, not read depth.

                    For kmer depth Dk, read depth Dr, read length R, and kmer size K:  Dr=Dk*(R/(R-K+1))

maxdepth=-1         (max) Reads will not be downsampled when below this depth, even if they are above the target depth.            

mindepth=5          (min) Kmers with depth below this number will not be included when calculating the depth of a read.

minkmers=15         (mgkpr) Reads must have at least this many kmers over min depth to be retained.  Aka 'mingoodkmersperread'.

percentile=54.0     (dp) Read depth is by default inferred from the 54th percentile of kmer depth, but this may be changed to any number 1-100.

uselowerdepth=t     (uld) For pairs, use the depth of the lower read as the depth proxy.

deterministic=t     (dr) Generate random numbers deterministically to ensure identical output between multiple runs.  May decrease speed with a huge number of threads.

passes=2            (p) 1 pass is the basic mode.  2 passes (default) allows greater accuracy, error detection, better contol of output depth.

 

Error detection parameters:

hdp=90.0            (highdepthpercentile) Position in sorted kmer depth array used as proxy of a read's high kmer depth.

ldp=25.0            (lowdepthpercentile) Position in sorted kmer depth array used as proxy of a read's low kmer depth.

tossbadreads=f      (tbr) Throw away reads detected as containing errors.

requirebothbad=f    (rbb) Only toss bad pairs if both reads are bad.

errordetectratio=125   (edr) Reads with a ratio of at least this much between their high and low depth kmers will be classified as error reads.

highthresh=12       (ht) Threshold for high kmer.  A high kmer at this or above are considered non-error.

lowthresh=3         (lt) Threshold for low kmer.  Kmers at this and below are always considered errors.

 

Error correction parameters:

ecc=f               Set to true to correct errors.  NOTE: Tadpole is now preferred for ecc as it does a better job.

ecclimit=3          Correct up to this many errors per read.  If more are detected, the read will remain unchanged.

errorcorrectratio=140  (ecr) Adjacent kmers with a depth ratio of at least this much between will be classified as an error.

echighthresh=22     (echt) Threshold for high kmer.  A kmer at this or above may be considered non-error.

eclowthresh=2       (eclt) Threshold for low kmer.  Kmers at this and below are considered errors.

eccmaxqual=127      Do not correct bases with quality above this value.

aec=f               (aggressiveErrorCorrection) Sets more aggressive values of ecr=100, ecclimit=7, echt=16, eclt=3.

cec=f               (conservativeErrorCorrection) Sets more conservative values of ecr=180, ecclimit=2, echt=30, eclt=1, sl=4, pl=4.

meo=f               (markErrorsOnly) Marks errors by reducing quality value of suspected errors; does not correct anything.

mue=t               (markUncorrectableErrors) Marks errors only on uncorrectable reads; requires 'ecc=t'.

overlap=f           (ecco) Error correct by read overlap.

 

Depth binning parameters:

lowbindepth=10      (lbd) Cutoff for low depth bin.

highbindepth=80     (hbd) Cutoff for high depth bin.

outlow=<file>       Pairs in which both reads have a median below lbd go into this file.

outhigh=<file>      Pairs in which both reads have a median above hbd go into this file.

outmid=<file>       All other pairs go into this file.

 

Histogram parameters:

hist=<file>         Specify a file to write the input kmer depth histogram.

histout=<file>      Specify a file to write the output kmer depth histogram.

histcol=3           (histogramcolumns) Number of histogram columns, 2 or 3.

pzc=f               (printzerocoverage) Print lines in the histogram with zero coverage.

histlen=1048576     Max kmer depth displayed in histogram.  Also affects statistics displayed, but does not affect normalization.

 

Peak calling parameters:

peaks=<file>        Write the peaks to this file.  Default is stdout.

minHeight=2         (h) Ignore peaks shorter than this.

minVolume=2         (v) Ignore peaks with less area than this.

minWidth=2          (w) Ignore peaks narrower than this.

minPeak=2           (minp) Ignore peaks with an X-value below this.

maxPeak=BIG         (maxp) Ignore peaks with an X-value above this.

maxPeakCount=8      (maxpc) Print up to this many peaks (prioritizing height).

 

Java Parameters:

-Xmx                This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

                    -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

-eoom               This flag will cause the process to exit if an out-of-memory exception occurs.  Requires Java 8u92+.

-da                 Disable assertions.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

> mapPacBio.sh 

$ mapPacBio.sh

 

BBMap

Written by Brian Bushnell, from Dec. 2010 - present

Last modified December 20, 2017

 

Description:  Fast and accurate splice-aware read aligner.

Please read bbmap/docs/guides/BBMapGuide.txt for more information.

 

To index:     bbmap.sh ref=<reference fasta>

To map:       bbmap.sh in=<reads> out=<output sam>

To map without writing an index:

    bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk

 

in=stdin will accept reads from standard in, and out=stdout will write to 

standard out, but file extensions are still needed to specify the format of the 

input and output files e.g. in=stdin.fa.gz will read gzipped fasta from 

standard in; out=stdout.sam.gz will write gzipped sam.

 

Indexing Parameters (required when building the index):

nodisk=f                Set to true to build index in memory and write nothing 

                        to disk except output.

ref=<file>              Specify the reference sequence.  Only do this ONCE, 

                        when building the index (unless using 'nodisk').

build=1                 If multiple references are indexed in the same directory,

                        each needs a unique numeric ID (unless using 'nodisk').

k=13                    Kmer length, range 8-15.  Longer is faster but uses 

                        more memory.  Shorter is more sensitive.

                        If indexing and mapping are done in two steps, K should

                        be specified each time.

path=<.>                Specify the location to write the index, if you don't 

                        want it in the current working directory.

usemodulo=f             Throw away ~80% of kmers based on remainder modulo a 

                        number (reduces RAM by 50% and sensitivity slightly).

                        Should be enabled both when building the index AND 

                        when mapping.

rebuild=f               Force a rebuild of the index (ref= should be set).

 

Input Parameters:

build=1                 Designate index to use.  Corresponds to the number 

                        specified when building the index.

in=<file>               Primary reads input; required parameter.

in2=<file>              For paired reads in two files.

interleaved=auto        True forces paired/interleaved input; false forces 

                        single-ended mapping. If not specified, interleaved 

                        status will be autodetected from read names.

fastareadlen=500        Break up FASTA reads longer than this.  Max is 500 for

                        BBMap and 6000 for BBMapPacBio.  Only works for FASTA

                        input (use 'maxlen' for FASTQ input).  The default for

                        bbmap.sh is 500, and for mapPacBio.sh is 6000.

unpigz=f                Spawn a pigz (parallel gzip) process for faster 

                        decompression than using Java.  

                        Requires pigz to be installed.

touppercase=t           (tuc) Convert lowercase letters in reads to upper case 

                        (otherwise they will not match the reference).

 

Sampling Parameters:

 

reads=-1                Set to a positive number N to only process the first N

                        reads (or pairs), then quit.  -1 means use all reads.

samplerate=1            Set to a number from 0 to 1 to randomly select that

                        fraction of reads for mapping. 1 uses all reads.

skipreads=0             Set to a number N to skip the first N reads (or pairs), 

                        then map the rest.

 

Mapping Parameters:

fast=f                  This flag is a macro which sets other paramters to run 

                        faster, at reduced sensitivity.  Bad for RNA-seq.

slow=f                  This flag is a macro which sets other paramters to run 

                        slower, at greater sensitivity.  'vslow' is even slower.

maxindel=16000          Don't look for indels longer than this. Lower is faster.

                        Set to >=100k for RNAseq with long introns like mammals.

strictmaxindel=f        When enabled, do not allow indels longer than 'maxindel'.

                        By default these are not sought, but may be found anyway.

tipsearch=100           Look this far for read-end deletions with anchors

                        shorter than K, using brute force.

minid=0.76              Approximate minimum alignment identity to look for. 

                        Higher is faster and less sensitive.

minhits=1               Minimum number of seed hits required for candidate sites.

                        Higher is faster.

local=f                 Set to true to use local, rather than global, alignments.

                        This will soft-clip ugly ends of poor alignments.

perfectmode=f           Allow only perfect mappings when set to true (very fast).

semiperfectmode=f       Allow only perfect and semiperfect (perfect except for 

                        N's in the reference) mappings.

threads=auto            (t) Set to number of threads desired.  By default, uses 

                        all cores available.

ambiguous=best          (ambig) Set behavior on ambiguously-mapped reads (with 

                        multiple top-scoring mapping locations).

                            best    (use the first best site)

                            toss    (consider unmapped)

                            random  (select one top-scoring site randomly)

                            all     (retain all top-scoring sites)

samestrandpairs=f       (ssp) Specify whether paired reads should map to the

                        same strand or opposite strands.

requirecorrectstrand=t  (rcs) Forbid pairing of reads without correct strand 

                        orientation.  Set to false for long-mate-pair libraries.

killbadpairs=f          (kbp) If a read pair is mapped with an inappropriate

                        insert size or orientation, the read with the lower  

                        mapping quality is marked unmapped.

pairedonly=f            (po) Treat unpaired reads as unmapped.  Thus they will 

                        be sent to 'outu' but not 'outm'.

rcomp=f                 Reverse complement both reads prior to mapping (for LMP

                        outward-facing libraries).

rcompmate=f             Reverse complement read2 prior to mapping.

pairlen=32000           Set max allowed distance between paired reads.  

                        (insert size)=(pairlen)+(read1 length)+(read2 length)

rescuedist=1200         Don't try to rescue paired reads if avg. insert size

                        greater than this.  Lower is faster.

rescuemismatches=32     Maximum mismatches allowed in a rescued read.  Lower

                        is faster.

averagepairdist=100     (apd) Initial average distance between paired reads.

                        Varies dynamically; does not need to be specified.

deterministic=f         Run in deterministic mode.  In this case it is good

                        to set averagepairdist.  BBMap is deterministic

                        without this flag if using single-ended reads,

                        or run singlethreaded.

bandwidthratio=0        (bwr) If above zero, restrict alignment band to this 

                        fraction of read length.  Faster but less accurate.

bandwidth=0             (bw) Set the bandwidth directly.

                        fraction of read length.  Faster but less accurate.

usejni=f                (jni) Do alignments faster, in C code.  Requires 

                        compiling the C code; details are in /jni/README.txt.

maxsites2=800           Don't analyze (or print) more than this many alignments 

                        per read.

ignorefrequentkmers=t   (ifk) Discard low-information kmers that occur often.

excludefraction=0.03    (ef) Fraction of kmers to ignore.  For example, 0.03

                        will ignore the most common 3% of kmers.

greedy=t                Use a greedy algorithm to discard the least-useful

                        kmers on a per-read basis.

kfilter=0               If positive, potential mapping sites must have at

                        least this many consecutive exact matches.

 

 

Quality and Trimming Parameters:

qin=auto                Set to 33 or 64 to specify input quality value ASCII

                        offset. 33 is Sanger, 64 is old Solexa.

qout=auto               Set to 33 or 64 to specify output quality value ASCII 

                        offset (only if output format is fastq).

qtrim=f                 Quality-trim ends before mapping.  Options are: 

                        'f' (false), 'l' (left), 'r' (right), and 'lr' (both).

untrim=f                Undo trimming after mapping.  Untrimmed bases will be 

                        soft-clipped in cigar strings.

trimq=6                 Trim regions with average quality below this 

                        (phred algorithm).

mintrimlength=60        (mintl) Don't trim reads to be shorter than this.

fakefastaquality=-1     (ffq) Set to a positive number 1-50 to generate fake

                        quality strings for fasta input reads.

ignorebadquality=f      (ibq) Keep going, rather than crashing, if a read has 

                        out-of-range quality values.

usequality=t            Use quality scores when determining which read kmers

                        to use as seeds.

minaveragequality=0     (maq) Do not map reads with average quality below this.

maqb=0                  If positive, calculate maq from this many initial bases.

 

Output Parameters:

 

out=<file>              Write all reads to this file.

outu=<file>             Write only unmapped reads to this file.  Does not 

                        include unmapped paired reads with a mapped mate.

outm=<file>             Write only mapped reads to this file.  Includes 

                        unmapped paired reads with a mapped mate.

mappedonly=f            If true, treats 'out' like 'outm'.

bamscript=<file>        (bs) Write a shell script to <file> that will turn 

                        the sam output into a sorted, indexed bam file.

ordered=f               Set to true to output reads in same order as input.  

                        Slower and uses more memory.

overwrite=f             (ow) Allow process to overwrite existing files.

secondary=f             Print secondary alignments.

sssr=0.95               (secondarysitescoreratio) Print only secondary alignments

                        with score of at least this fraction of primary.

ssao=f                  (secondarysiteasambiguousonly) Only print secondary 

                        alignments for ambiguously-mapped reads.

maxsites=5              Maximum number of total alignments to print per read.

                        Only relevant when secondary=t.

quickmatch=f            Generate cigar strings more quickly.

trimreaddescriptions=f  (trd) Truncate read and ref names at the first whitespace,

                        assuming that the remainder is a comment or description.

ziplevel=2              (zl) Compression level for zip or gzip output.

pigz=f                  Spawn a pigz (parallel gzip) process for faster 

                        compression than Java.  Requires pigz to be installed.

machineout=f            Set to true to output statistics in machine-friendly 

                        'key=value' format.

printunmappedcount=f    Print the total number of unmapped reads and bases.

                        If input is paired, the number will be of pairs

                        for which both reads are unmapped.

showprogress=0          If positive, print a '.' every X reads.

showprogress2=0         If positive, print the number of seconds since the

                        last progress update (instead of a '.').

renamebyinsert=f        Renames reads based on their mapped insert size.

 

Post-Filtering Parameters:

idfilter=0              Independant of minid; sets exact minimum identity 

                        allowed for alignments to be printed.  Range 0 to 1.

subfilter=-1            Ban alignments with more than this many substitutions.

insfilter=-1            Ban alignments with more than this many insertions.

delfilter=-1            Ban alignments with more than this many deletions.

indelfilter=-1          Ban alignments with more than this many indels.

editfilter=-1           Ban alignments with more than this many edits.

inslenfilter=-1         Ban alignments with an insertion longer than this.

dellenfilter=-1         Ban alignments with a deletion longer than this.

nfilter=-1              Ban alignments with more than this many ns.  This 

                        includes nocall, noref, and off scaffold ends.

 

Sam flags and settings:

noheader=f              Disable generation of header lines.

sam=1.4                 Set to 1.4 to write Sam version 1.4 cigar strings, 

                        with = and X, or 1.3 to use M.

saa=t                   (secondaryalignmentasterisks) Use asterisks instead of

                        bases for sam secondary alignments.

cigar=t                 Set to 'f' to skip generation of cigar strings (faster).

keepnames=f             Keep original names of paired reads, rather than 

                        ensuring both reads have the same name.

intronlen=999999999     Set to a lower number like 10 to change 'D' to 'N' in 

                        cigar strings for deletions of at least that length.

rgid=                   Set readgroup ID.  All other readgroup fields 

                        can be set similarly, with the flag rgXX=

mdtag=f                 Write MD tags.

nhtag=f                 Write NH tags.

xmtag=f                 Write XM tags (may only work correctly with ambig=all).

amtag=f                 Write AM tags.

nmtag=f                 Write NM tags.

xstag=f                 Set to 'xs=fs', 'xs=ss', or 'xs=us' to write XS tags 

                        for RNAseq using firststrand, secondstrand, or 

                        unstranded libraries.  Needed by Cufflinks.  

                        JGI mainly uses 'firststrand'.

stoptag=f               Write a tag indicating read stop location, prefixed by YS:i:

lengthtag=f             Write a tag indicating (query,ref) alignment lengths, 

                        prefixed by YL:Z:

idtag=f                 Write a tag indicating percent identity, prefixed by YI:f:

inserttag=f             Write a tag indicating insert size, prefixed by X8:Z:

scoretag=f              Write a tag indicating BBMap's raw score, prefixed by YR:i:

timetag=f               Write a tag indicating this read's mapping time, prefixed by X0:i:

boundstag=f             Write a tag indicating whether either read in the pair

                        goes off the end of the reference, prefixed by XB:Z:

notags=f                Turn off all optional tags.

 

Histogram and statistics output parameters:

scafstats=<file>        Statistics on how many reads mapped to which scaffold.

refstats=<file>         Statistics on how many reads mapped to which reference

                        file; only for BBSplit.

sortscafs=t             Sort scaffolds or references by read count.

bhist=<file>            Base composition histogram by position.

qhist=<file>            Quality histogram by position.

aqhist=<file>           Histogram of average read quality.

bqhist=<file>           Quality histogram designed for box plots.

lhist=<file>            Read length histogram.

ihist=<file>            Write histogram of insert sizes (for paired reads).

ehist=<file>            Errors-per-read histogram.

qahist=<file>           Quality accuracy histogram of error rates versus 

                        quality score.

indelhist=<file>        Indel length histogram.

mhist=<file>            Histogram of match, sub, del, and ins rates by 

                        read location.

gchist=<file>           Read GC content histogram.

gcbins=100              Number gchist bins.  Set to 'auto' to use read length.

gcpairs=t               Use average GC of paired reads.

idhist=<file>           Histogram of read count versus percent identity.

idbins=100              Number idhist bins.  Set to 'auto' to use read length.

statsfile=stderr        Mapping statistics are printed here.

 

Coverage output parameters (these may reduce speed and use more RAM):

covstats=<file>         Per-scaffold coverage info.

rpkm=<file>             Per-scaffold RPKM/FPKM counts.

covhist=<file>          Histogram of # occurrences of each depth level.

basecov=<file>          Coverage per base location.

bincov=<file>           Print binned coverage per location (one line per X bases).

covbinsize=1000         Set the binsize for binned coverage output.

nzo=t                   Only print scaffolds with nonzero coverage.

twocolumn=f             Change to true to print only ID and Avg_fold instead of 

                        all 6 columns to the 'out=' file.

32bit=f                 Set to true if you need per-base coverage over 64k.

strandedcov=f           Track coverage for plus and minus strand independently.

startcov=f              Only track start positions of reads.

secondarycov=t          Include coverage of secondary alignments.

physcov=f               Calculate physical coverage for paired reads.

                        This includes the unsequenced bases.

delcoverage=t           (delcov) Count bases covered by deletions as covered.

                        True is faster than false.

covk=0                  If positive, calculate kmer coverage statistics.

 

Java Parameters:

-Xmx                    This will be passed to Java to set memory usage, 

                        overriding the program's automatic memory detection.

                        -Xmx20g will specify 20 gigs of RAM, and -Xmx800m 

                        will specify 800 megs.  The max is typically 85% of 

                        physical memory.  The human genome requires around 24g,

                        or 12g with the 'usemodulo' flag.  The index uses 

                        roughly 6 bytes per reference base.

-eoom                   This flag will cause the process to exit if an 

                        out-of-memory exception occurs.  Requires Java 8u92+.

-da                     Disable assertions.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter 

any problems, or post at: http://seqanswers.com/forums/showthread.php?t=41057

 

> stats.sh

$ stats.sh

 

Written by Brian Bushnell

Last modified December 7, 2017

 

Description:  Generates basic assembly statistics such as scaffold count, 

N50, L50, GC content, gap percent, etc.  For multiple files, please use

statswrapper.sh.  Works with fasta and fastq only (gzipped is fine).

Please read bbmap/docs/guides/StatsGuide.txt for more information.

 

Usage:        stats.sh in=<file>

 

Parameters:

in=file         Specify the input fasta file, or stdin.

gc=file         Writes ACGTN content per scaffold to a file.

gchist=file     Filename to output scaffold gc content histogram.

shist=file      Filename to output cumulative scaffold length histogram.

gcbins=200      Number of bins for gc histogram.

n=10            Number of contiguous Ns to signify a break between contigs.

k=13            Estimate memory usage of BBMap with this kmer length.

minscaf=0       Ignore scaffolds shorter than this.

phs=f           (printheaderstats) Set to true to print total size of headers.

n90=t           (printn90) Print the N/L90 metrics.

extended=f      Print additional metrics such as N/L90 and log sum.

logoffset=1000  Minimum length for calculating log sum.

logbase=2       Log base for calculating log sum.

pdl=f           (printduplicatelines) Set to true to print lines in the 

                scaffold size table where the counts did not change.

n_=t            This flag will prefix the terms 'contigs' and 'scaffolds'

                with 'n_' in formats 3-6.

addname=f       Adds a column for input file name, for formats 3-6.

 

format=<0-7>    Format of the stats information; default 1.

format=0 prints no assembly stats.

format=1 uses variable units like MB and KB, and is designed for compatibility with existing tools.

format=2 uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing.

format=3 outputs stats in 2 rows of tab-delimited columns: a header row and a data row.

format=4 is like 3 but with scaffold data only.

format=5 is like 3 but with contig data only.

format=6 is like 3 but the header starts with a #.

format=7 is like 1 but only prints contig info.

 

gcformat=<0-4>  Select GC output format; default 1.

gcformat=0: (no base content info printed)

gcformat=1: name length A C G T N GC

gcformat=2: name GC

gcformat=4: name length GC

Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

> bbmerge.sh 

$ bbmerge.sh

 

Written by Brian Bushnell and Jonathan Rood

Last modified March 13, 2017

 

Description:  Merges paired reads into single reads by overlap detection.

With sufficient coverage, can also merge nonoverlapping reads by kmer extension.

Kmer modes requires much more memory, and should be used with the bbmerge-auto.sh script.

Please read bbmap/docs/guides/BBMergeGuide.txt for more information.

 

Usage for interleaved files: bbmerge.sh in=<reads> out=<merged reads> outu=<unmerged reads>

Usage for paired files:     bbmerge.sh in1=<read1> in2=<read2> out=<merged reads> outu1=<unmerged1> outu2=<unmerged2>

 

Input may be stdin or a file, fasta or fastq, raw or gzipped.

 

Input parameters:

in=null              Primary input. 'in2' will specify a second file.

interleaved=auto     May be set to true or false to override autodetection of

                     whether the input file as interleaved.

reads=-1             Quit after this many read pairs (-1 means all).

 

Output parameters:

out=<file>           File for merged reads. 'out2' will specify a second file.

outu=<file>          File for unmerged reads. 'outu2' will specify a second file.

outinsert=<file>     (outi) File to write read names and insert sizes.

outadapter=<file>    (outa) File to write consensus adapter sequences.

outc=<file>          File to write input read kmer cardinality estimate.

ihist=<file>         (hist) Insert length histogram output file.

nzo=t                Only print histogram bins with nonzero values.

showhiststats=t      Print extra header lines with statistical information.

ziplevel=2           Set to 1 (lowest) through 9 (max) to change compression

                     level; lower compression is faster.

ordered=f            Output reads in same order as input.

mix=f                Output both the merged (or mergable) and unmerged reads

                     in the same file (out=).  Useful for ecco mode.

 

Trimming/Filtering parameters:

qtrim=f              Trim read ends to remove bases with quality below minq.

                     Trims BEFORE merging.

                     Values: t (trim both ends), 

                             f (neither end), 

                             r (right end only), 

                             l (left end only).

qtrim2=f             May be specified instead of qtrim to perform trimming 

                     only if merging is unsuccessful, then retry merging.

trimq=10             Trim quality threshold.  This may be a comma-delimited

                     list (ascending) to try multiple values.

minlength=1          (ml) Reads shorter than this after trimming, but before

                     merging, will be discarded. Pairs will be discarded only

                     if both are shorter.

maxlength=-1         Reads with longer insert sizes will be discarded.

tbo=f                (trimbyoverlap) Trim overlapping reads to remove 

                     rightmost (3') non-overlapping portion, instead of joining.

minavgquality=0      (maq) Reads with average quality below this, after 

                     trimming, will not be attempted to be merged.

maxexpectederrors=0  (mee) If positive, reads with more combined expected 

                     errors than this will not be attempted to be merged.

forcetrimleft=0      (ftl) If nonzero, trim left bases of the read to 

                     this position (exclusive, 0-based).

forcetrimright=0     (ftr) If nonzero, trim right bases of the read 

                     after this position (exclusive, 0-based).

forcetrimright2=0    (ftr2) If positive, trim this many bases on the right end.

forcetrimmod=5       (ftm) If positive, trim length to be equal to 

                     zero modulo this number.

ooi=f                Output only incorrectly merged reads, for testing.

trimpolya=t          Trim trailing poly-A tail from adapter output.  Only 

                     affects outadapter.  This also trims poly-A followed

                     by poly-G, which occurs on NextSeq.

 

Processing Parameters:

usejni=f             (jni) Do overlapping in C code, which is faster.  Requires

                     compiling the C code; details are in /jni/README.txt.

merge=t              Create merged reads.  If set to false, you can still 

                     generate an insert histogram.

ecco=f               Error-correct the overlapping part, but don't merge.

trimnonoverlapping=f (tno) Trim all non-overlapping portions, leaving only

                     consensus sequence.  By default, only sequence to the 

                     right of the overlap (adapter sequence) is trimmed.

useoverlap=t         Attempt find the insert size using read overlap.

mininsert=35         Minimum insert size to merge reads.

mininsert0=35        Insert sizes less than this will not be considered.

                     Must be less than or equal to mininsert.

minoverlap=12        Minimum number of overlapping bases to allow merging.

minoverlap0=8        Overlaps shorter than this will not be considered.

                     Must be less than or equal to minoverlap.

minq=9               Ignore bases with quality below this.

maxq=41              Cap output quality scores at this.

entropy=t            Increase the minimum overlap requirement for low-

                     complexity reads.

efilter=6            Ban overlaps with over this many times the expected 

                     number of errors.  Lower is more strict. -1 disables.

pfilter=0.00004      Ban improbable overlaps.  Higher is more strict. 0 will

                     disable the filter; 1 will allow only perfect overlaps.

kfilter=0            Ban overlaps that create kmers with count below

                     this value (0 disables).  If this is used minprob should

                     probably be set to 0.  Requires good coverage.

ouq=f                Calculate best overlap using quality values.

owq=t                Calculate best overlap without using quality values.

usequality=t         If disabled, quality values are completely ignored,

                     both for overlap detection and filtering.  May be useful

                     for data with inaccurate quality values.

iupacton=f           (itn) Change ambiguous IUPAC symbols to N.

adapter=             Specify the adapter sequences used for these reads, if

                     known; this can be a fasta file or a literal sequence.

                     Read 1 and 2 can have adapters specified independently

                     with the adapter1 and adapter2 flags.  adapter=default

                     will use a list of common adapter sequences.

 

Ratio Mode: 

ratiomode=t          Score overlaps based on the ratio of matching to 

                     mismatching bases.

maxratio=0.09        Max error rate; higher increases merge rate.

ratiomargin=5.5      Lower increases merge rate; min is 1.

ratiooffset=0.55     Lower increases merge rate; min is 0.

ratiominoverlapreduction=3  This is the difference between minoverlap in 

                     flat mode and minoverlap in ratio mode; generally, 

                     minoverlap should be lower in ratio mode.

minsecondratio=0.1   Cutoff for second-best overlap ratio.

 

Flat Mode: 

flatmode=f           Score overlaps based on the total number of mismatching

                     bases only.

margin=2             The best overlap must have at least 'margin' fewer 

                     mismatches than the second best.

mismatches=3         Do not allow more than this many mismatches.

requireratiomatch=f  (rrm) Require the answer from flat mode and ratio mode

                     to agree, reducing false positives if both are enabled.

trimonfailure=t      (tof) If detecting insert size by overlap fails,

                     the reads will be trimmed and this will be re-attempted.

 

 

*** Ratio Mode and Flat Mode may be used alone or simultaneously. ***

*** Ratio Mode is usually more accurate and is the default mode. ***

 

 

Strictness (these are mutually exclusive macros that set other parameters):

strict=f             Decrease false positive rate and merging rate.

verystrict=f         (vstrict) Greatly decrease FP and merging rate.

ultrastrict=f        (ustrict) Decrease FP and merging rate even more.

maxstrict=f          (xstrict) Maximally decrease FP and merging rate.

loose=f              Increase false positive rate and merging rate.

veryloose=f          (vloose) Greatly increase FP and merging rate.

ultraloose=f         (uloose) Increase FP and merging rate even more.

maxloose=f           (xloose) Maximally decrease FP and merging rate.

fast=f               Fastest possible mode; less accurate.

 

Tadpole Parameters (for read extension and error-correction):

k=31                 Kmer length.  31 (or less) is fastest and uses the least

                     memory, but higher values may be more accurate.  

                     60 tends to work well for 150bp reads.

extend=0             Extend reads to the right this much before merging.

                     Requires sufficient (>5x) kmer coverage.

extend2=0            Extend reads this much only after a failed merge attempt,

                     or in rem/rsem mode.

iterations=1         (ei) Iteratively attempt to extend by extend2 distance

                     and merge up to this many times.

rem=f                (requireextensionmatch) Do not merge if the predicted

                     insert size differs before and after extension.

                     However, if only the extended reads overlap, then that

                     insert will be used.  Requires setting extend2.

rsem=f               (requirestrictextensionmatch) Similar to rem but stricter.

                     Reads will only merge if the predicted insert size before

                     and after extension match.  Requires setting extend2.

ecctadpole=f         (ecct) If reads fail to merge, error-correct with Tadpole

                     and try again.  This happens prior to extend2.

reassemble=t         If ecct is enabled, use Tadpole's reassemble mode for 

                     error correction.  Alternatives are pincer and tail.

removedeadends       (shave) Remove kmers leading to dead ends.

removebubbles        (rinse) Remove kmers in error bubbles.

mindepthseed=3       (mds) Minimum kmer depth to begin extension.

mindepthextend=2     (mde) Minimum kmer depth continue extension.

branchmult1=20       Min ratio of 1st to 2nd-greatest path depth at high depth.

branchmult2=3        Min ratio of 1st to 2nd-greatest path depth at low depth.

branchlower=3        Max value of 2nd-greatest path depth to be considered low.

ibb=t                Ignore backward branches when extending.

extra=<file>         A file or comma-delimited list of files of reads to use

                     for kmer counting, but not for merging or output.

prealloc=f           Pre-allocate memory rather than dynamically growing; 

                     faster and more memory-efficient for large datasets.  

                     A float fraction (0-1) may be specified, default 1.

prefilter=0          If set to a positive integer, use a countmin sketch to 

                     ignore kmers with depth of that value or lower, to

                     reduce memory usage.

minprob=0.5          Ignore kmers with overall probability of correctness 

                     below this, to reduce memory usage.

minapproxoverlap=26  For rem mode, do not merge reads if the extended reads

                     indicate that the raw reads should have overlapped by

                     at least this much, but no overlap was found.

 

Java Parameters:

-Xmx                 This will be passed to Java to set memory usage, 

                     overriding the program's automatic memory detection.

                     For example, -Xmx400m will specify 400 MB RAM.

-eoom                This flag will cause the process to exit if an 

                     out-of-memory exception occurs.  Requires Java 8u92+.

-da                  Disable assertions.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

dedupe.sh

$ dedupe.sh

 

Written by Brian Bushnell and Jonathan Rood

Last modified November 20, 2017

 

Description:  Accepts one or more files containing sets of sequences (reads or scaffolds).

Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.

Can also find overlapping sequences and group them into clusters.

Please read bbmap/docs/guides/DedupeGuide.txt for more information.

 

Usage:     dedupe.sh in=<file or stdin> out=<file or stdout>

 

An example of running Dedupe for clustering short reads:

dedupe.sh in=x.fq am=f ac=f fo c pc rnc=f mcs=4 mo=100 s=1 pto cc qin=33 csf=stats.txt pattern=cluster_%.fq dot=graph.dot

 

Input may be fasta or fastq, compressed or uncompressed.

Output may be stdout or a file.  With no output parameter, data will be written to stdout.

If 'out=null', there will be no output, but statistics will still be printed.

You can also use 'dedupe <infile> <outfile>' without the 'in=' and 'out='.

 

I/O parameters:

in=<file,file>        A single file or a comma-delimited list of files.

out=<file>            Destination for all output contigs.

pattern=<file>        Clusters will be written to individual files, where the '%' symbol in the pattern is replaced by cluster number.

outd=<file>           Optional; removed duplicates will go here.

csf=<file>            (clusterstatsfile) Write a list of cluster names and sizes.

dot=<file>            (graph) Write a graph in dot format.  Requires 'fo' and 'pc' flags.

threads=auto          (t) Set number of threads to use; default is number of logical processors.

overwrite=t           (ow) Set to false to force the program to abort rather than overwrite an existing file.

showspeed=t           (ss) Set to 'f' to suppress display of processing speed.

minscaf=0             (ms) Ignore contigs/scaffolds shorter than this.

interleaved=auto      If true, forces fastq input to be paired and interleaved.

ziplevel=2            Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

 

Output format parameters:

storename=t           (sn) Store scaffold names (set false to save memory).

#addpairnum=f         Add .1 and .2 to numeric id of read1 and read2.

storequality=t        (sq) Store quality values for fastq assemblies (set false to save memory).

uniquenames=t         (un) Ensure all output scaffolds have unique names.  Uses more memory.

numbergraphnodes=t    (ngn) Label dot graph nodes with read numbers rather than read names.

sort=f                Sort output (otherwise it will be random).  Options:

                         length:  Sort by length

                         quality: Sort by quality

                         name:    Sort by name

                         id:      Sort by input order

ascending=f           Sort in ascending order.

ordered=f             Output sequences in input order.  Equivalent to sort=id ascending.

renameclusters=f      (rnc) Rename contigs to indicate which cluster they are in.

printlengthinedges=f  (ple) Print the length of contigs in edges.

 

Processing parameters:

absorbrc=t            (arc) Absorb reverse-complements as well as normal orientation.

absorbmatch=t         (am) Absorb exact matches of contigs.

absorbcontainment=t   (ac) Absorb full containments of contigs.

#absorboverlap=f      (ao) Absorb (merge) non-contained overlaps of contigs (TODO).

findoverlap=f         (fo) Find overlaps between contigs (containments and non-containments).  Necessary for clustering.

uniqueonly=f          (uo) If true, all copies of duplicate reads will be discarded, rather than keeping 1.

rmn=f                 (requirematchingnames) If true, both names and sequence must match.

usejni=f              (jni) Do alignments in C code, which is faster, if an edit distance is allowed.

                      This will require compiling the C code; details are in /jni/README.txt.

 

Subset parameters:

subsetcount=1         (sstc) Number of subsets used to process the data; higher uses less memory.

subset=0              (sst) Only process reads whose ((ID%subsetcount)==subset).

 

Clustering parameters:

cluster=f             (c) Group overlapping contigs into clusters.

pto=f                 (preventtransitiveoverlaps) Do not look for new edges between nodes in the same cluster.

minclustersize=1      (mcs) Do not output clusters smaller than this.

pbr=f                 (pickbestrepresentative) Only output the single highest-quality read per cluster.

 

Cluster postprocessing parameters:

processclusters=f     (pc) Run the cluster processing phase, which performs the selected operations in this category.

                      For example, pc AND cc must be enabled to perform cc.

fixmultijoins=t       (fmj) Remove redundant overlaps between the same two contigs.

removecycles=t        (rc) Remove all cycles so clusters form trees.

cc=t                  (canonicizeclusters) Flip contigs so clusters have a single orientation.

fcc=f                 (fixcanoncontradictions) Truncate graph at nodes with canonization disputes.

foc=f                 (fixoffsetcontradictions) Truncate graph at nodes with offset disputes.

mst=f                 (maxspanningtree) Remove cyclic edges, leaving only the longest edges that form a tree.

 

Overlap Detection Parameters

exact=t               (ex) Only allow exact symbol matches.  When false, an 'N' will match any symbol.

touppercase=t         (tuc) Convert input bases to upper-case; otherwise, lower-case will not match.

maxsubs=0             (s) Allow up to this many mismatches (substitutions only, no indels).  May be set higher than maxedits.

maxedits=0            (e) Allow up to this many edits (subs or indels).  Higher is slower.

minidentity=100       (mid) Absorb contained sequences with percent identity of at least this (includes indels).

minlengthpercent=0    (mlp) Smaller contig must be at least this percent of larger contig's length to be absorbed.

minoverlappercent=0   (mop) Overlap must be at least this percent of smaller contig's length to cluster and merge.

minoverlap=200        (mo) Overlap must be at least this long to cluster and merge.

depthratio=0          (dr) When non-zero, overlaps will only be formed between reads with a depth ratio of at most this.

                      Should be above 1.  Depth is determined by parsing the read names; this information can be added

                      by running KmerNormalize (khist.sh, bbnorm.sh, or ecc.sh) with the flag 'rename'

k=31                  Seed length used for finding containments and overlaps.  Anything shorter than k will not be found.

numaffixmaps=1        (nam) Number of prefixes/suffixes to index per contig. Higher is more sensitive, if edits are allowed.

hashns=f              Set to true to search for matches using kmers containing Ns.  Can lead to extreme slowdown in some cases.

#ignoreaffix1=f       (ia1) Ignore first affix (for testing).

#storesuffix=f        (ss) Store suffix as well as prefix.  Automatically set to true when doing inexact matches.

 

Other Parameters

qtrim=f               Set to qtrim=rl to trim leading and trailing Ns.

trimq=6               Quality trim level.

forcetrimleft=-1      (ftl) If positive, trim bases to the left of this position (exclusive, 0-based).

forcetrimright=-1     (ftr) If positive, trim bases to the right of this position (exclusive, 0-based).

 

Note on Proteins / Amino Acids

Dedupe supports amino acid space via the 'amino' flag.  This also changes the default kmer length to 10.

In amino acid mode, all flags related to canonicity and reverse-complementation are disabled,

and nam (numaffixmaps) is currently limited to 2 per tip.

 

Java Parameters:

-Xmx                  This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

                      -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

-eoom                 This flag will cause the process to exit if an out-of-memory exception occurs.  Requires Java 8u92+.

-da                   Disable assertions.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

 

 

ラン

BBMap リンク)-リードのリファンレンスへのアライメント

powerpointスライド

BBMapは、k-merに分解してk-merサイズのseedでアライメントと伸長を行うアライナー(デファルトではk=13)。 ショートリードのアライメントだけでなく、ロングリードのアライメントも可能とされ、Illumina,、454、Sanger、Ion Torrent、Pac Bio、Nanoporeのリードのアライメント対応がマニュアルに記載されている。

・index作成とペアリードのアライメント(index=>マッピング=>アライメント)。メモリは最大20GB指定。

bbmap.sh -Xmx20g in1=reads1.fq in2=reads2.fq out=mapped.sam ref=reference.fa
  • -Xmx   This will be passed to Java to set memory usage,  overriding the program's automatic memory detection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m  will specify 800 megs.  The max is typically 85% of physical memory.  The human genome requires around 24g,  or 12g with the 'usemodulo' flag.  The index uses roughly 6 bytes per reference base.

indexディレクトリが作られ、リードがリファレンスにアライメントされる。defaultだとマシンの全コアが計算に使用される(明示するなら"t=")。

 

bbmap.shは700bp以下のリードしか扱えない。pacbioのアライメントは6-kbまでのリードに対応した専用コマンドを使う。

・ロングリードのアライメント。

mapPacBio.sh in=reads.fq out=mapped.sam ref=reference.fa

#リファレンスへのマッピング率(coverage), GCなどの統計データを出力。メモリ100GB指定
mapPacBio.sh in=reads.fq out=mapped.sam ref=reference.fa covstats=mapping_stats.txt nodisk=true -Xmx100g
  • nodisk=f  Set to true to build index in memory and write nothing to disk except output

6-kb以上のリードは自動で断片化されリネームされてアライメントされる。ナノポアのデータもこのコマンドを使うことが推奨されている。

 

様々な統計データを出力する。

bbmap.sh in1=reads1.fq in2=reads2.fq bhist=bhist.txt qhist=qhist.txt aqhist=aqhist.txt lhist=lhist.txt ihist=ihist.txt ehist=ehist.txt qahist=qahist.txt indelhist=indelhist.txt mhist=mhist.txt gchist=gchist.txt idhist=idhist.txt scafstats=scafstats.txt
  • covstats=<file>      Per-scaffold coverage info.
  • rpkm=<file>            Per-scaffold RPKM/FPKM counts.
  • covhist=<file>         Histogram of # occurrences of each depth level.
  • basecov=<file>       Coverage per base location.
  • bincov=<file>          Print binned coverage per location (one line per X bases).
  • covbinsize=1000    Set the binsize for binned co

標準出力には以下のようなデータがプリントされる。

  ------------------   Results   ------------------   

 

Genome:                1

Key Length:            13

Max Indel:             16000

Minimum Score Ratio:  0.56

Mapping Mode:         normal

Reads Used:           2647748 (397162200 bases)

 

Mapping:          70.725 seconds.

Reads/sec:       37437.34

kBases/sec:      5615.60

 

 

Pairing data:   pct reads num reads pct bases   num bases

 

mated pairs:      99.9427%   1323116  99.9427%    396934800

bad pairs:         0.0178%       236   0.0178%        70800

insert size avg:   601.02

insert 25th %:     566.00

insert median:     600.00

insert 75th %:     634.00

insert std dev:     52.83

insert mode:       602

 

 

Read 1 data:      pct reads num reads pct bases   num bases

 

mapped:           99.9677%   1323447  99.9677%    198517050

unambiguous:      98.0062%   1297479  98.0062%    194621850

ambiguous:         1.9615%     25968   1.9615%      3895200

low-Q discards:    0.0000%         0   0.0000%            0

 

perfect best site: 38.6619%    511835  38.6619%     76775250

semiperfect site: 38.6630%    511849  38.6630%     76777350

rescued:           1.2452%     16485

 

Match Rate:            NA        NA  99.3118%    197215279

Error Rate:       61.3256%    811612   0.6882%      1366602

Sub Rate:         60.9806%    807046   0.6527%      1296069

Del Rate:          0.5644%      7469   0.0326%        64831

Ins Rate:          0.3730%      4937   0.0029%         5702

N Rate:            0.0000%         0   0.0000%            0

 

 

Read 2 data:      pct reads num reads pct bases   num bases

 

mapped:           99.9702%   1323479  99.9702%    198521850

unambiguous:      97.9981%   1297371  97.9981%    194605650

ambiguous:         1.9721%     26108   1.9721%      3916200

low-Q discards:    0.0000%         0   0.0000%            0

 

perfect best site: 38.6839%    512126  38.6839%     76818900

semiperfect site: 38.6856%    512148  38.6856%     76822200

rescued:           1.1880%     15728

 

Match Rate:            NA        NA  99.3145%    197221255

Error Rate:       61.3046%    811353   0.6855%      1361264

Sub Rate:         60.9506%    806669   0.6523%      1295263

Del Rate:          0.5679%      7516   0.0306%        60669

Ins Rate:          0.3843%      5086   0.0027%         5332

N Rate:            0.0000%         0   0.0000%            0

 

Total time:     74.777 seconds.

さらに指定した分析データ(ヒストグラム)が出力される。

f:id:kazumaxneo:20170829140136j:plain

 

ゲノムアセンブリの各contigについて、GC含量、平均リードデプス、マッピングカバー率などの基礎的な統計を出力。

bbmap.sh ref=scaffolds.ref.fa nodisk in1=R1.fq in2=R2.fq covstats=mapping.stats threads=20

mapping.stats 

 

 

参考1 

index作成時のメモリ使用量はリファンレス1塩基につき6 byteとされている。ヒトゲノムでは24GB必要(usemodulo flagをつけて12GB)。

・defaultのk=13の時のメモリ使用量は以下のコマンドで確認できる。

stats.sh in=reference.fa k=13

user$ stats.sh in=../testgenome.fasta k=13

A C G T N IUPAC Other GC GC_stdev

0.2629 0.2364 0.2372 0.2635 0.0000 0.0000 0.0000 0.4736 0.0241

 

Main genome scaffold total:         8

Main genome contig total:           8

Main genome scaffold sequence total: 3.957 MB

Main genome contig sequence total:  3.957 MB  0.000% gap

Main genome scaffold N/L50:         1/3.573 MB

Main genome contig N/L50:           1/3.573 MB

Main genome scaffold N/L90:         1/3.573 MB

Main genome contig N/L90:           1/3.573 MB

Max scaffold length:                3.573 MB

Max contig length:                  3.573 MB

Number of scaffolds > 50 KB:        4

% main genome in scaffolds > 50 KB: 98.63%

 

 

Minimum Number        Number        Total         Total         Scaffold

Scaffold of            of            Scaffold      Contig        Contig  

Length  Scaffolds     Contigs       Length        Length        Coverage

-------- -------------- -------------- -------------- -------------- --------

    All              8             8     3,956,956     3,956,956 100.00%

   1 KB              8             8     3,956,956     3,956,956 100.00%

 2.5 KB              6             6     3,952,233     3,952,233 100.00%

   5 KB              6             6     3,952,233     3,952,233 100.00%

  10 KB              5             5     3,947,019     3,947,019 100.00%

  25 KB              5             5     3,947,019     3,947,019 100.00%

  50 KB              4             4     3,902,676     3,902,676 100.00%

 100 KB              4             4     3,902,676     3,902,676 100.00%

 250 KB              1             1     3,573,470     3,573,470 100.00%

 500 KB              1             1     3,573,470     3,573,470 100.00%

   1 MB              1             1     3,573,470     3,573,470 100.00%

 2.5 MB              1             1     3,573,470     3,573,470 100.00%

 

BBMap minimum memory estimate at k=13:     -Xmx890m (at least 1000 MB physical RAM)

使用したバクテリアゲノムだと1GBの物理メモリが必要と算出された。

 

参考2

indel部位のスプリットアライメントに対応している。条件はmaxindelフラッグで規定されており、デフォルトではmaxindel=16000となっている。RNA seqでlarge intronが想定される場合などはmaxindel=100000などの大きな数値に指定することが推奨されている。

 

 

 

 

 BBMerge リンク)-ペアリードのマージ。

BBMegeはオーバーラップするペアエンドリードをマージして1つの長いリードを作るツール。

bbmerge.sh in1=reads1.fq in2=reads2.fq out=merged.fq outu1=unmerged1.fq outu2=unmerged2.fq ihist=ihist.txt
  • in=null Primary input. 'in2' will specify a second file.
  • out=<file>  File for merged reads. 'out2' will specify a second file.
  • outu=<file> File for unmerged reads. 'outu2' will specify a second file.
  • ihist=<file> (hist) Insert length histogram output file.

インサートサイズのヒストグラムがihist.txtに出力される。マージを厳密に行うならpfilter=1をつける(“pfilter=1” is the strictest possible setting, which bans merges in which there are any mismatches in the overlap region. )。または、verystrict(Greatly decrease FP and merging rate.)をつける。オーサーはアセンブルに使うデータならverystrictをつけ、とにかくマージするならloose(Increase false positive rate and merging rate)をつけることを提案している。

 

 

・オーバーラップするペアエンドリードのエラー修復

bbmerge.sh in=interlace.fq out=corrected.fq ecco mix
  • mix=f Output both the merged (or mergable) and unmerged reads in the same file (out=). Useful for ecco mode.
  • ecco=f Error-correct the overlapping part, but don't merge.

 ここでは入力するペアリードはインターレースとなっている。このコマンドを実行することでオーバーラップする領域で塩基が合致する部位はクオリティスコアが上昇し、オーバーラップする領域で塩基が合致しない領域はクオリティスコアが高い方の塩基にbase callが変化したfastqが出力される。マージしたfastq自体は出力されない。クオリティスコアが同じで塩基が合致しない部位はNになる。

 

 

・k-merのgraphを利用し、マージされなかったfastqを延長しマージを試みる。

bbmerge-auto.sh in1=reads1.fq in2=reads2.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt ecct extend2=20 iterations=5
  • extend2=0 Extend reads this much only after a failed merge attempt, or in rem/rsem mode.
  • iterations=1 (ei) Iteratively attempt to extend by extend2 distance and merge up to this many times.

上記のコマンドだと最初にマージが試みられ、ダメならエラー修復して再チャレンジし、それでもダメならk-merのグラフから20-bpくっつけて再マージを試みる。そうして20-bpのextensionを最大5回繰り返す。途中でk-merのgraphがdead endになったり、分岐があったりする試行は中止される。300-bpx2のペアリードでテストすると、最大794-bpのシングルリードが出力された。

 

 

・リード長よりインサートサイズが短いペアエンドリードからアダプター配列を見つける。

bbmerge.sh in1=reads1.fq in2=reads2.fq outa=adapters.fa

例えば見つかったアダプターがGATCGGAAGAGCACACGTCTGAACTCCAGTC とする。このアダプター配列を次のコマンドで除く。

bbmerge.sh in1=reads1.fq in2=read2.fq out=merged.fq adapter1=GATCGGAAGAGCACACGTCTGAACTCCAGTC 

 

 リードに十分なデプスがあり、アダプター配列も除かれていれば、mergeコマンドは正しく機能する。より長いリードができればアセンブルでより長いk-merを選択することができることになる。一方で、誤ったgraphをつないでしまう可能性もあり、そうなるとアーティファクトの配列ができてしまうことになる(特にレアなindel変異の部位)。

 

 

BBnorm リンク)-カバレッジノーマライズ

カバレッジがディープな領域をノーマライズして軽量化し、データ解析を行いやすくするツール。例えば、カバレッジが非常に多く変動するシングルセルやウィルスのシーケンスデータ、またメタゲノムデータなどに利用できる。また、シーケンスデータ軽量化のためのdigital normalizationに使える(リンク 真ん中付近)。リードが多すぎて計算リソースを超えている時などに利用する。

・k-merカバレッジ100でノーマライズ

bbnorm.sh target=100 min=5 in1=reads1.fq in2=reads2.fq \
out=normalized_1.fq out2=normalized_2.fq

#シングルエンド, k31,mindepth2
bbnorm.sh target=100 min=2 k=31 in1=reads.fq.gz out=normalized.fq.gz
  • in=<file> (in1) Input file, or comma-delimited list of files.
  • in2=<file> Optional second file for paired reads.
  • k=31 Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)
  • target=100 (tgt) Target normalization depth. NOTE: All depth parameters control kmer depth, not read depth.
  • mindepth=5 (min) Kmers with depth below this number will not be included when calculating the depth of a read.
  • maxdepth=-1 (max) Reads will not be downsampled when below this depth, even if they are above the target depth.
  • minkmers=15 (mgkpr) Reads must have at least this many kmers over min depth to be retained. Aka 'mingoodkmersperread'.
  • out=<file> File for normalized or corrected reads. Use out2 for paired reads in a second file

上記のコマンドではk-merデプスが5以下のリード(シーケンスエラー、またはレアな変異)は排除される。またk-merカバレッジが100になるようにリード数が調整される。

追記;シークエンシングクオリティが高ければロングリードでも使用できる。エラー率が高いリードに適用すると、k-mer一致がないリードが多数出てしまい、多くのリードが失われてしまう可能性が高い。それでも使いたいなら、先にエラーコレクションするか(微細な配列の違いも失われる可能性がある)、k値を少しだけ小さくするかmin=0などにして低い側のカットオフは無くす。

 

 

・エラーコレクション

ecc.sh in1=reads1.fq in2=reads2.fq out=corrected.fq ecc=t keepall passes=1 bits=16 prefilter
  • ecc=f Set to true to correct errors.
  • keepall=f Set to true to keep all reads (e.g. if you just want error correction).
  • passes=2 (p) 1 pass is the basic mode. 2 passes (default) allows greater accuracy, error detection, better contol of output depth.
  • bits=32 Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Automatically reduced to 16 in 2-pass.
  • prefilter=f True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable. The prefilter is more memory-efficient because it uses 2-bit cells.

エラーコレクション後のfastqがcorrected.fq として出力される。レアk-merのリードのトリミングや除去は一切行われない。

 

 

・k-merヒストグラム作成

khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt

 ヒストグラムデータの他に、k-mer頻度分布のピークなどをプリントしたサマリーファイルが出力される。

user$ cat peaks.txt 

#k 31

#unique_kmers 32597468

#main_peak 14

#genome_size 6316924

#haploid_genome_size 6316924

#fold_coverage 14

#haploid_fold_coverage 14

#ploidy 1

#percent_repeat 36.469

#start center stop max volume

6 14 35 338679 4013227

41 51 121 3023 69382

823 1556 1569 10 1546

1569 1614 1624 21 605

1624 1637 1667 12 455

1667 1683 1705 9 340

1705 1730 1748 13 388

1748 1768 2257 3 1902

3198 4299 4410 4 3095

4410 4501 4523 2 288

4523 4542 5077 2 1112

 

注意点として、BBnormは発現解析(RNA seqやchip-seq)やレアバリアントの探索などに使ってはならない。 使うとひどいbiasが生じる可能性がある。そのようなデータセットでデータを削減する必要があるなら、ノーマライズでなくランダムサンプリングするツールを使う。

fastqのランダムサンプリングツールとしてseqtkやseqkitを紹介しています。

 

 

 

Dedupeリンク)-duplicateしたcontigの除去。

k-merベースのde brujin graphでは重複したcontigはほぼ作られないが、overlap layput方式のアセンブルだとduplication contigができることがある。そのようなcontigから重複したcontigを除けるように本コマンドは設計された。また、NCBIのrefseqなどは重複した配列が大量に存在し、blast解析すると同じような配列が大量にhitすることがあるが、そのようなデータベースのキュレーションにも使えるとされる。

 

また、NCBI

・duplication sequencesの除去。

dedupe.sh in=X.fa out=Z.fa outd=duplicates.fa
  • outd
  •  

outdの指定でduplication readsも別出力可能。

 

・5ミスマッチと2editsまで許容してduplication sequencesを除去。

dedupe.sh in=X.fa out=Y.fa s=5 e=2

s=5だけ指定すると塩基のミスマッチだけ5つまで許容してduplication sequencesが探索される。e=2だけ指定すると、どのような変異でも2つまで許容してduplication sequencesが探索される(例えば2塩基ミスマッチ or 1挿入と1塩基ミスマッチ)。

 

 

 

・99%以上同一の配列を除去。

dedupe.sh in=X.fa out=Y.fa minidentity=99

indelは許容されない。indelも考慮するならe flagを使用する。

 

 

・200bp以上同一の配列を1塩基のミスマッチまで許容して除去。

dedupe.sh in=X.fq pattern=cluster%.fq ac=f am=f s=1 mo=200 c pc csf=stats.txt outbest=best.fq fo c mcs=3 cc dot=graph.dot

 duplication sequenceというよりアンプリコンシーケンスデータ向けのコマンド。似た配列はクラスタリングして出力され、同時に長さとクオリティがベストの配列は別ファイルに出力される。

 

・PacBioの16S全長シーケンスデータのクラスタリング

1、全長読めてない配列を除去。

reformat.sh in=reads_of_insert.fastq out=filtered.fq minlen=1420 maxlen=1640 maq=20 qin=33 
  • minlength=0 (ml) Reads shorter than this after trimming will be discarded. Pairs will be discarded only if both are shorter.
  • maxlength=0 If nonzero, reads longer than this after trimming will be discarded.
  • minavgquality=0 (maq) Reads with average quality (after trimming) below this will be discarded.
  • qin=auto ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

1420-bp以下のリードと1640-bp以上のリード(アーティファクト)を除去する。また平均クオリティが20以下のリードも除去する。 

 

2、99%以上同一の配列をクラスタリング

dedupe.sh in=filtered.fq csf=stats_e26.txt outbest=best_e26.fq qin=33 usejni=t am=f ac=f fo c rnc=f mcs=3 k=27 mo=1420 ow cc pto nam=4 e=26 pattern=cluster_%.fq dot=graph.dot
  • usejni=f (jni) Do alignments in C code, which is faster, if an edit distance is allowed. This will require compiling the C code; details are in /jni/README.txt.
  • absorbmatch=t (am) Absorb exact matches of contigs.
  • absorbcontainment=t (ac) Absorb full containments of contigs.
  • findoverlap=f (fo) Find overlaps between contigs (containments and non-containments). Necessary for clustering.
  • renameclusters=f (rnc) Rename contigs to indicate which cluster they are in.
  • ow=f (overwrite) Overwrites files that already exist.
  • cc=t (canonicizeclusters) Flip contigs so clusters have a single orientation.
  • pto=f (preventtransitiveoverlaps) Do not look for new edges between nodes in the same cluster.
  • minclustersize=1 (mcs) Do not output clusters smaller than this.
  • cluster=f (c) Group overlapping contigs into clusters.
  • numaffixmaps=1 (nam) Number of prefixes/suffixes to index per contig. Higher is more sensitive, if edits are allowed.
  • maxedits=0 (e) Allow up to this many edits (subs or indels). Higher is slower.

4塩基オーバーラップの27merのseedを使い、99%以上同一で(e=26)1420塩基以上オーバーラップする配列をクラスタリングする。

 

 

コマンドが多いので、2回に分けます。

 

 

 引用

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

BBMap (aligner for DNA/RNAseq) is now open-source and available for download. - SEQanswers

 

BBMerge - Accurate paired shotgun read merging via overlap.

Bushnell B, Rood J, Singer E

PLoS One. 2017 Oct 26;12(10):e0185056. doi: 10.1371/journal.pone.0185056. eCollection 2017. 

 

BIostars ダウンサンプリング

https://www.biostars.org/p/168178/

 

 

メタゲノムデータの高速なtaxonomy assignmentを行う kraken

2018 10/6 タイトル修正 

2018 11/17 簡単なテスト追加

2019 4/12 dockerリンク追加

2019 10/15リンク追加

 

krakenは2014年に発表されたメタゲノムデータの分類手法。fastqまたはfastaの入力からk-merの配列に分解し、構築したデータベースにアライメントを行う。BLASTと同等の精度を保ちながら、megablastより最大909倍高速と主張されている。似たツールにメタゲノムデータからtaxonomy情報を得るMetaPhlAnがあるが、MetaPhlAnと比較してもkrakenは最大11倍高速とされる。

 

公式ページ

https://ccb.jhu.edu/software/kraken/

 マニュアル

http://ccb.jhu.edu/software/kraken/MANUAL.html

 

インストール

cent OSに導入した。

ランにはJellyfishのver.1が必要。公式サイトからダウンロードしてビルドする。

wget http://www.cbcb.umd.edu/software/jellyfish/jellyfish-1.1.11.tar.gz
tar zxvf jellyfish-1.1.11.tar.gz
cd jellyfish-1.1.11/
./configure
make
sudo make install

bin/jellyfishにパスを通しておく。

 

kraken本体はbrewで導入できる。

brew install kraken

または

git clone  https://github.com/DerrickWood/kraken.git
cd kraken/
./install_kraken.sh $KRAKEN_DIR #インストールディレクトリを指定

Anacondaを使っているならcondaで導入する。

#v1.1
conda install -c bioconda -y kraken==1.1

#v2 (bioconda)
conda install -c bioconda -y kraken2

 

NCBI refseqのURLが変更されているので、wgetで配列をダウンロードできなくなっている (2017年夏テスト時)。スクリプトを別途ダウンロードして修正する。

wget https://github.com/DerrickWood/kraken/blob/master/scripts/download_genomic_library.sh

download_genomic_library.shを開き、 43行目を以下のように変更。

wget $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz

さらに59行目を以下のように変更。

wget $FTP_SERVER/genomes/archive/old_refseq/Plasmids/plasmids.all.fna.tar.gz

 

brewでインストールしたkaken-buildを開き、

vi /usr/local/bin/kraken-build #viで開く。

262行目を修正したdownload_genomic_library.shのパスに変更する。ここでは以下のように修正した。

exec "/Volumes/3TB3/kraken_database/download_genomic_library.sh", $type;

保存して閉じる。

 

追記

dockerイメージ

https://hub.docker.com/r/staphb/kraken/

 

ラン

 データベースの準備(初回だけ)

 ↓

ラン

 ↓

集計

という三段階の流れで進める。

 

--データベースの準備--

データベースのダウンロード。ここではdatabaseというディレクトリ名にした。

kraken-build --standard --db database

 database/にRefseqの全データとtaxonomy情報がダウンロードされる。

 

データベースのビルド。時間がかかるので、コア数があるだけ指定する。

kraken-build --standard --threads 22 --db database

環境によってはかなりの時間がかかる。

容量を確認。

user]$ du -sh database/

161G database/

2017年8月実行時のdatabase容量は161GBだった。

準備ができたらランする。

 

--taxonomyラベリングを行う--

入力のデフォルトはfastaだが、--fastq-inputのフラグをつけることでfastqにも対応している。また--gzip-compressed--bzip2-compressedをつけることで、gzipやbzip2の入力も可能である。

 fastaを指定してラン。

kraken --preload --threads 20 --fastq-input --db database input.fa > output
  • --db Name for Kraken DB (default: none)
  • --threads Number of threads (default: 1)
  • --fasta-input Input is FASTA format
  • --preload Loads DB into memory before classification

 --preloadをつけることで、データベースがRAMディスクに読み込まれ高速化する。databaseは自分がつけたデータベースディレクトリ名。

  

ペアードエンドfastqを指定してラン。

kraken --preload --threads 20 --fastq-input --paired --check-names --db database R1.fq R2.fq > output
  • --fastq-input Input is FASTQ format 
  • --paired The two filenames provided are paired-end reads
  • --check-names Ensure each pair of reads have names that agree with each other; ignored if --paired is not specified
  • --gzip-compressed Input is gzip compressed
  • --bzip2-compressed Input is bzip2 compressed
  • --quick Quick operation (use first hit or hits)

krakenはペアリードをマージして1つにしてからk-merベースの探索を行う(NNNでつなぐ)。探索時、ATGC以外があるk-mer配列は除外される。そのためペアリードを入力に使うことで、シングルリードよりおよそ3%精度が上がるとされる。

  

出力は以下のようになる。

bin]$ head output

C M02077:58:000000000-APPA6:1:1101:15651:1463 1148 603 0:1 1148:16 0:31 1148:83 0:140 A:31 0:16 A:31 1148:4 0:220

C M02077:58:000000000-APPA6:1:1101:17046:1522 1148 601 0:1 1148:11 0:31 1148:61 0:31 1148:82 0:54 A:31 1148:11 0:5 A:31 1148:5 0:217

C M02077:58:000000000-APPA6:1:1101:15487:1536 1148 603 1148:14 0:72 1148:22 0:163 A:31 0:16 A:31 0:224

C M02077:58:000000000-APPA6:1:1101:17442:1545 1148 603 0:1 1148:113 0:31 1148:44 0:31 1148:16 0:35 A:31 1148:16 A:31 1148:36 0:188

C M02077:58:000000000-APPA6:1:1101:22458:1603 1148 601 0:16 1148:16 0:44 1148:14 0:181 A:31 0:1 1148:7 0:8 A:31 0:222

C M02077:58:000000000-APPA6:1:1101:19281:1604 1148 602 0:1 1148:64 0:31 1148:24 0:31 1148:28 0:31 1148:10 0:50 A:31 1148:16 A:31 1148:31 0:193

C M02077:58:000000000-APPA6:1:1101:23186:1616 1148 601 0:1 1148:168 0:31 1148:5 0:66 A:31 1148:16 A:31 0:5 1148:21 0:196

C M02077:58:000000000-APPA6:1:1101:18130:1621 1148 603 1148:213 0:58 A:31 1148:16 A:31 1148:81 0:31 1148:31 0:81

C M02077:58:000000000-APPA6:1:1101:9815:1622 1148 599 1148:208 0:62 A:31 1148:16 A:31 1148:46 0:175

C M02077:58:000000000-APPA6:1:1101:21784:1636 1148 602 0:1 1148:130 0:31 1148:83 0:26 A:31 1148:16 A:31 1148:33 0:190

1列目 C or Uは、分類されたか(C)、されなかった(U)かを表す。

2列目 シーケンスID(インプットのヘッダー名)

3列目 taxonomy ID(wiki

4列目 リードサイズ

5列目 ↓のマニュアルの説明参照。

 

  1. A space-delimited list indicating the LCA mapping of each k-mer in the sequence. For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
    • the first 13 k-mers mapped to taxonomy ID #562
    • the next 4 k-mers mapped to taxonomy ID #561
    • the next 31 k-mers contained an ambiguous nucleotide
    • the next k-mer was not in the database
    • the last 3 k-mers mapped to taxonomy ID #562

 

 --taxonomy nameに変換--

taxonomy IDからtaxonomy nameに変えるコマンドが用意されている。以下のように打つ。

kraken-translate --db database output > sequences.labels

出力

f:id:kazumaxneo:20170830114628j:plain

入力が純粋培養のバクテリアのデータなので、同じ生き物ばかり表示されている。

 

--mpa-formatをつけると、superkingdom, kingdom, phylum, class, order, family, genus, speciesだけの表記になる。

kraken-translate --mpa-format --db database output > sequences.labels

f:id:kazumaxneo:20170830115002j:plain

 

  --集計--

集計レポートを作る。

kraken-report --db database output > summary

 出力を確認。

head -20 summary 

  0.71 2036 2036 U 0 unclassified

 99.29 285920 3 - 1 root

 99.29 285916 1 - 131567   cellular organisms

 99.29 285915 65 D 2     Bacteria

 98.87 284692 36 - 1783272       Terrabacteria group

 98.55 283771 0 - 1798711         Cyanobacteria/Melainabacteria group

 98.55 283771 3 P 1117           Cyanobacteria

 98.54 283759 2 O 1890424             Synechococcales

 98.54 283753 0 F 1890428               Merismopediaceae

 98.54 283753 0 G 1142                 Synechocystis

 98.54 283753 283695 S 1148                   Synechocystis sp. PCC 6803

  0.02 57 57 - 1080228                     Synechocystis sp. PCC 6803 substr. GT-I

  0.00 1 1 - 1080229                     Synechocystis sp. PCC 6803 substr. PCC-N

  0.00 3 0 F 1890436               Pseudanabaenaceae

  0.00 3 0 G 1152                 Pseudanabaena

  0.00 3 3 S 82654                   Pseudanabaena sp. PCC 7367

  0.00 1 0 F 1890426               Synechococcaceae

  0.00 1 0 G 1129                 Synechococcus

  0.00 1 1 S 110662                   Synechococcus sp. CC9605

  0.00 7 1 - 1301283             Oscillatoriophycideae

1列目 そのcladeがカバーされたリードの割合(全体に対するパーセンテージ)

2列目 そのcladeがカバーされたリード数

3列目 そのtaxonに直接アサインされたリード数

4列目 rank code(↓の説明参照)

5列目 NCBI taxonomy ID

6列目 scientific name

  1. A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.

 

 

MetaPhlAnのようなタブ仕分けのファイルを出力させるコマンドも用意されている。

kraken-mpa-report --db database output > summary_tab 

bin]$ head -10 summary_tab 

d__Viruses 1

d__Bacteria 285915

d__Bacteria|p__Proteobacteria 419

d__Bacteria|p__Fusobacteria 5

d__Bacteria|p__Deferribacteres 2

d__Bacteria|p__Spirochaetes 7

d__Bacteria|p__Synergistetes 1

d__Bacteria|p__Bacteroidetes 724

d__Bacteria|p__Firmicutes 625

d__Bacteria|p__Deinococcus-Thermus 1

 

 

手持ちのfastaを追加して、カスタムデータベースを作ることもできる。作り方は公式マニュアルを参照。カスタムデータベース作成の流れが丁寧に説明されてます。

 

 

 

テスト

SRAの湖水からのメタゲノムデータでテストしてみた。15 GB x 2のデータをthread40でテストしたところ30分ほどかかった。kraken-reportの出力の先頭を貼っておく。

bin]$ head -30 summary2

 74.03 29617719 29617719 U 0 unclassified

 25.97 10391739 6719 - 1 root

 25.91 10364486 1126 - 131567   cellular organisms

 25.90 10361130 51629 D 2     Bacteria

 25.37 10150686 3934 - 1783272       Terrabacteria group

 25.22 10090850 0 - 1798711         Cyanobacteria/Melainabacteria group

 25.22 10090850 129569 P 1117           Cyanobacteria

 22.78 9114469 8754 - 1301283             Oscillatoriophycideae

 22.59 9036657 21480 O 1150               Oscillatoriales

 21.70 8681472 0 F 1892254                 Oscillatoriaceae

 21.70 8681472 934 G 1158                   Oscillatoria

 21.59 8639339 0 S 482564                     Oscillatoria nigro-viridis

 21.59 8639339 8639339 - 179408                       Oscillatoria nigro-viridis PCC 7112

  0.10 41199 0 S 118323                     Oscillatoria acuminata

  0.10 41199 41199 - 56110                       Oscillatoria acuminata PCC 6304

  0.39 155174 0 F 1892249                 Cyanothecaceae

  0.39 155174 23122 G 43988                   Cyanothece

  0.15 59279 59279 S 497965                     Cyanothece sp. PCC 7822

  0.12 46609 46609 S 65393                     Cyanothece sp. PCC 7424

  0.03 12335 12335 S 43989                     Cyanothece sp. ATCC 51142

  0.01 5589 5589 S 395962                     Cyanothece sp. PCC 8802

  0.01 4347 4347 S 395961                     Cyanothece sp. PCC 7425

  0.01 3893 3893 S 41431                     Cyanothece sp. PCC 8801

  0.29 114192 578 F 1892252                 Microcoleaceae

  0.13 53893 0 G 35823                   Arthrospira

  0.13 53893 0 S 118562                     Arthrospira platensis

  0.13 53893 53893 - 696747                       Arthrospira platensis NIES-39

  0.09 36770 0 G 44471                   Microcoleus

  0.09 36770 36770 S 1173027                     Microcoleus sp. PCC 7113

  0.06 22951 0 G 1205                   Trichodesmium

unclassifiedが74%あったが、淡水性のシアノバクテリアなどが様々hitしている。

 

 追記

こちらも参考にしてください。

追記

Visualization tool

 

追記

簡単なテスト

in silico mock metagenome cummunityのデータを作成し、テストしてみた。以下の比率で10のゲノムを混ぜ込む。

f:id:kazumaxneo:20181120110800j:plain

Kronaで可視化

f:id:kazumaxneo:20181120110927j:plain

2%ほどしか混ぜ込んでいないRickettsia canadensisも検出できている(*1)。どの種の定量結果も、妥当な数値が出ている。

 

こちらも参考になります。

追記

Building a Kraken database with new FTP structure and no GI numbers

http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/

 

引用

Kraken: ultrafast metagenomic sequence classification using exact alignments

Derrick E WoodEmail author and Steven L Salzberg

Genome Biology 2014 15:R46

 

ダウンロード時のエラーについて。

https://github.com/DerrickWood/kraken/issues/40

 

*1

リード数は50万x2。ファイルサイズは313MBx2と、メタゲノムにしては非常に少ない。

 

関連

krakenの出力を分析

krakenを利用

マルチサンプル


2023/03/04追記

 "Kraken2とMetaPhlAn 3を使用して、ヒト由来または環境由来のメタゲノム内のリードを分類したところ、分類されたリードの割合と同定された種の数の両方に大きな相違があることがわかりました。次に、様々な模擬サンプルや模擬サンプルを用いて、どのツールがメタゲノムサンプルの実際の組成に最も近い分類を与えるかを調べ、ツール、パラメータ、データベースの選択の組み合わせが、与えられた分類に与える影響を検討しました。"

https://pubmed.ncbi.nlm.nih.gov/36867161/

 

 

fastqのクオリティレポートを出力する qrqc

qrqcは 

qrqcはリードのクオリティや配列の分布をレポートできるRのパッケージ。1000-bp以下のfastqに対応している。同様の目的で使用されるツールとしてfastqcがある。

 

 

インストール

Rにて

## try http:// if https:// URLs are not supported 
source("https://bioconductor.org/biocLite.R")
biocLite("qrqc")

http://bioconductor.org/packages/devel/bioc/html/qrqc.html

 

マニュアル

browseVignettes("qrqc")

http://127.0.0.1:20080/library/qrqc/doc/qrqc.pdf

 

 

ラン

Rにて

library(qrqc)
fq <- readSeqFile("input.fastq")#10%ランダムサンプリングしてfastqを読み込み

 

読み込まれたfastqの情報を確認する。

fq

> fq

Quality Information for: input.fastq

 261774 sequences, 233067 unique with 0.10 being sampled

 mean quality: 36.101300

 min sequence length: 35

 max sequence length: 301

 

レポートを出力。

makeReport(fq)

フォルダにhtmlのレポートファイルが出力される。 

 

f:id:kazumaxneo:20170828213815j:plain

f:id:kazumaxneo:20170828213823j:plain

f:id:kazumaxneo:20170828213820j:plain

f:id:kazumaxneo:20170828213833j:plain

f:id:kazumaxneo:20170828213840j:plain

f:id:kazumaxneo:20170828213846j:plain

TruseqのNNN..配列が残存している。

 

 

 

k-mer分布。

kmerKLPlot(fq)

f:id:kazumaxneo:20170828214631j:plain

 

GC

gcPlot(fq) + geom_hline(yintercept=0.5, color="purple") 

f:id:kazumaxneo:20170828215149j:plain 

 

 

引用

 bioinformatics

qrqc | FASTQ のクオリティチェック

 

ベイズ的アプローチによるアダプタートリミングツール Scythe

 

 Scytheはfastqのアダプター配列トリミングツール。ライブラリ調整過程でリード長より短い回断片が精製されてくると、3'末端側にアダプター配列のついた配列がシーケンスされる。これは例えばsmall RNAのシーケンスを想定するとわかりやすい。small-RNAのライブラリを作成し、100bpシーケンスすれば、3'側には様々な位置にアダプターが来るのが想像できる。全ゲノムシーケンスでも、こういった短いリードのシーケンスは低確率で起こっている。このようなリードがアダプタートリミング後も残ると、アセンブリやアライメントなどに悪影響を及ぼす可能性がある。よって、3'側のトリミングも実行しておくことが望ましい。

 

 Scytheはアダンプターが残存している/していないの2つの尤度モデルを比較し、入力されたアダプター配列との相同性からアダプターがコンタミしているリードを推定し、3’末端側のアダプター配列を除く。Gitubのページには現在論文執筆中と記載されている。

  

 Scytheより先にクオリティトリミングを行うと、3'側のアダプターが部分的に削除されたりしてScytheの解析がうまく行えなくなる。そのため、初めにScytheで3'側のアダプター配列を除去することが推奨されている。Scytheで除けなかった5'側のアプアプター(アダプターだけの単純配列のアーティファクトなど)はTagdustなどで完全に除去し(アダプターが見つかればリード全長を除く)、それからでクオリティトリミングを行う。最終的には、fastqcやqrqcなどでクオリティスコアの分布を確認し、レポート結果からクオリティトリミングが適切か判断すべきとScytheの開発者は推奨している(最後の分析でトリミングが甘ければ、長さを指定するなどして、さらなるトリミングを実行する)。

 

Scythe(アダプタートリミング1)

Tagdust(アダプタートリミング2)

quality trimming

fastqc(判定)

 

 

Github

https://github.com/vsbuffalo/scythe

 

インストール

本体を解凍してmakeする。

make all

scytheをパスが通ったディレクトリにコピーするかリンクを張るなどしてパスを通す。

 

ラン

scythe-master/に同封されているilluminaのアダプター配列ファイルを指定してラン。最後尾に入力のfastqファイル(input.fastq)を指定している。

scythe -a illumina_adapters.fa -o trimmed_sequences.fastq input.fastq 
  • -o output trimmed sequences file (default: stdout)
  • -m   matches file (default: no output)
  • -q quality type, either illumina, solexa, or sanger (default: sanger)
  • -M filter sequnces less than or equal to this length (default: 35)
  • -t add a tag to the header indicating Scythe cut a sequence (default: off)-t, --tag add a tag to the header indicating Scythe cut a sequence (default: off) 
  • -n smallest contaminant to consider (default: 5)

Information on quality schemes:

・phred PHRED quality scores (e.g. from Roche 454). ASCII with no offset, range: [4, 60].

・sanger Sanger are PHRED ASCII qualities with an offset of 33, range: [0, 93]. From sanger Sanger are PHRED ASCII qualities with an offset of 33, range: [0, 93]. From  NCBI SRA, or Illumina pipeline 1.8+.

・solexa Solexa (also very early Illumina - pipeline < 1.3). ASCII offset ofsolexa Solexa (also very early Illumina - pipeline < 1.3). ASCII offset of 64, range: [-5, 62]. Uses a different quality-to-probabilities conversion than other schemes.

・illumina Illumina output from pipeline versions between 1.3 and 1.7. ASCII offset of 64, illumina Illumina output from pipeline versions between 1.3 and 1.7. ASCII offset of 64, range: [0, 62]

 

trimmed_sequences.fastqが出力される。Miseqのシーケンスデータを1,000,000サンプリングして解析したところ、17789リードにアダプターが見つかり、トリミングの結果トータルサイズ290,466,519-bpが290,124,196-bpになった。

 

 

引用

What is the strategy to deal with adapter contamination inside the reads but not at either of the ends (3' or 5')?

https://www.researchgate.net/post/What_is_the_strategy_to_deal_with_adapter_contamination_inside_the_reads_but_not_at_either_of_the_ends_3_or_5

アセンブルのgraphからプラスミドデータを検出するツール Recycler

2018 1/9 condaインストール追記

 

Recyclerはアセンブルのgraph pathからプラスミドの配列を検出する方法論。プラスミドのグラフは他のゲノムのグラフと独立しており、カバレッジが均一な少数のノードで構成される環状のグラフと推測される。これらの手がかりからプラスミドのグラフを探索する。2016年に論文が発表された。

 

インストール

依存 

Python 2.7+

いずれもpipでインストールできる。

 

Recommended for generating inputs

推奨ツールはbrewで導入できる。

 

Github

https://github.com/Shamir-Lab/Recycler

本体のインンストール

git clone https://github.com/rozovr/Recycler.git 
cd Recycler
python setup.py install --user

#またはcondaを使う
conda install -c bioconda -y recycler

 

ラン 

ランにはspadesでアセンブルして得たfastgファイルと、ペアードエンドfastqから作成したbamファイルが必要である(アライメントにはbwa memを使う)。以下のような流れでbamを作成できる。

make_fasta_from_fastg.py -g assembly_graph.fastg #graph.fastaができる。
bwa index assembly_graph.nodes.fasta
bwa mem -t 8 assembly_graph.nodes.fasta R1.fastq.gz R2.fastq.gz | samtools view -buS -@ 8 - > reads_pe.bam
samtools view -@ 8 -bF 0x0800 reads_pe.bam > reads_pe_primary.bam
samtools sort -@ 8 reads_pe_primary.bam > reads_pe_primary.sort.bam
samtools index reads_pe_primary.sort.bam

reads_pe_primary.sort.bamとそのbaiファイルを使用する。

  

recycle.py -g assembly_graph.fastg -k 55 -b reads_pe_primary.sort.bam -i True
  •  -g assembly graph FASTG file to process: for spades 3.5, before_rr.fastg; for spades 3.6+, assembly_graph.fastg
  • -k integer reflecting maximum k value used by the assembler
  • -b BAM file resulting from aligning reads to contigs file, filtering for best matches

 解析が終わると~.cycs.fastaというfastaファイルが出力される。

 

それ以外のオプション。

  • -l minimum length required for reporting [default: 1000]
  • -m coefficient of variation used for pre-selection [default: 0.5, higher--> less restrictive]
  • -i True or False value reflecting whether data sequenced was an isolated strain
  • -o provide a specific output directory by default results will be written to the directory the FASTG file is currently in.

 

5kb~100kbの7つのプラスミドを持つバクテリアのシーケンスデータでテストしたところ、5kb程度の3つのプラスミドのみfastaが出力された。レスキューされたプラスミドは、コピー数がクロモソームの数倍あり、構造もシンプルである。一方、レスキューされなかったプラスミドはゲノムと部分的に相同な配列を含んだ複雑なグラフであり、コピー数もクロモソームと同じくらいであると考えられている。そのためこれらを構成するノードはゲノムのノードとは独立して存在していない可能性があり、それが判定に影響したと推測される。

 

  

引用

Recycler: an algorithm for detecting plasmids from de novo assembly graphs

Roye Rozov,corresponding author1 Aya Brown Kav,2 David Bogumil,2 Naama Shterzer,2 Eran Halperin,1,3,4 Itzhak Mizrahi,2 and Ron Shamir1

Bioinformatics. 2017 Feb 15; 33(4): 475–482. 

 

Nextera Mate Pair protocolのジャンクションプライマー除去ツール NxTrim

 

イルミナはmate pairシーケンスのキットも販売している。このプロトコルではNextraのトランスポゾンでタギングしたゲノムをセルフライゲーションさせて離れた配列を近づける。そのため中央にジャンクション配列が残る(図1 赤の配列)。NxTrimはそのジャンクション配列を除く専用のツールである。トリム領域が最小限のため、アセンブル精度が上がると主張されている。2015年に論文が発表された。

 

f:id:kazumaxneo:20170825105709j:plain

図1 ライブラリ作成手順 公式PDFマニュアルより(リンク

 

 

インストール

brewで導入できる。

brew install NxTrim

 

 

ラン

Gitからサンプルfastqをダウンロードしてランする。

wget https://github.com/sequencing/NxTrim/tree/master/example/sample_R1.fastq.gz
wget https://github.com/sequencing/NxTrim/tree/master/example/sample_R2.fastq.gz
nxtrim -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -O sample

 4つのファイルが出力される。説明を貼っておく。

  • mp: read pairs that are large insert-size mate-pairs, both mates will be reverse-complemented by nxtrim (from RF to FR) unless --rf commandline option is used
  • pe: read pairs that are short insert-size paired-end reads due to the junction adapter occurring early in a read
  • se: single reads (reads having no R1 or R2 counterpart)
  • unknown: a library of read-pairs that are mostly large-insert mate-pair, but possibly contain a small proportion of paired end contaminants

  

 

引用

NxTrim: optimized trimming of Illumina mate pair reads

Jared O’Connell, Ole Schulz-Trieglaff, Emma Carlson, Matthew M. Hims, Niall A. Gormley, Anthony J. Cox

Bioinformatics. 2015 Jun 15;31(12):2035-7