2017-08-29

多機能なNGS分析ツール BBtools　其の２

20196/13 タイトル修正

2020 7/24 bbdukコマンド追記

の続き。BBtoolsの残りのコマンドを紹介する。紹介するのは以下のコマンド。

Reformat - フォーマット変換やクオリティトリミング。
Repair - ペアリードの順番が壊れたファイルを修復する。
Stats - アセンブリの基本情報をレポートする。
BBDuk - クオリティやアダプターでトリミングしたり、フィルタリングする。
BBMask - 単純リピートや保存性が高い領域をマスクする。
CalcUniqueness - ユニークなリードの数を調べる。

Reformat（リンク）-フォーマット変換やクオリティトリミング。

似たコマンドはいくつかあるが、Reformatはメモリー使用量が少ないのが特徴。

下では全てインターレースのfastqを想定しているが、ペアリードなら

reformat.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2>　と書く。

・fastq => fastaへの変換。

reformat.sh in=reads1.fq out=reads1.fasta

・phread quality scoreを一般的なASCII+33からASCII +64に変換する。

reformat.sh in=reads.fq out=reads.fq qin=33 qout=64

・fastq => fasta+qualへの変換。

reformat.sh in=reads.fq out=reads.fa qfout=reads.qual

・インターレースのペアリードデータを２つのファイルに分ける。

reformat.sh in=interlace.fq out1=read1.fq out2=read2.fq

・ペアリードデータをインターレースファイルにする。

reformat.sh in1=read1.fq in2=read2.fq out=reads.fq

・ 70文字ごとに改行する。(foldコマンドと同じ）

reformat.sh in=reads.fa out=wrapped.fa fastawrap=70

・ペアリードファイルがずれていないか検証。

reformat.sh in=reads.fq vint

verifyinterleaved=f (vint) sets 'vpair' to true and 'interleaved' to true.

・1や２がついていないペアリードファイルにペアを識別する1 or 2を入れる。

reformat.sh i1n=pair1.fq in2=pair2.fq out1=renamed1.fq out1=renamed2.fq addslash=t int=f

addslash=f Append ' /1' and ' /2' to read names, if not already present. Please include the flag 'int=t' if the reads are interleaved.
int=f (interleaved) Determines whether INPUT file is considered interleaved.

・reverse complement

reformat.sh in=reads.fq out=out.fq rcomp

rcomp=f (rc) Reverse-compliment reads.

・小文字の塩基を大文字にする。

reformat.sh in=reads.fq out=out.fq tuc

touppercase=t (tuc) Convert input bases to upper-case; otherwise, lower-case will not match.

・名前が全てユニークか確認する。

reformat.sh in=reads.fq out=out.fq uniquenames

uniquenames=f Make duplicate names unique by appending _<number>.

duplicationは"_number"がついてユニークな名前になる。このコマンドだけ大量のメモリを使う。

・クオリテイスコアを特定の範囲に限定する。

reformat.sh in=reads.fq out=out.fq mincalledquality=2 maxcalledquality=41

maxcalledquality=41 Quality scores capped at this upper bound.
mincalledquality=2 Quality scores of ACGT bases will be capped at lower bound.

特定のツールの処理でクオリティがおかしくなってしまった時などに使う。

Repair（リンク）-ペアリードの順番が壊れたファイルを修復する。

ペアリードの順番がforwardとreverseでずれているfastqを修復する。例えばfastx toolkitでペアリードを個別にトリミングすると、forwardとreverseでリード数が揃わないfastqができてしまう。この場合、理想はraw readsからペアリードを扱えるツールでクオリティトリミングをやり直すことであるが、場合によってはオリジナルのfastqがないこともある。そのようなfastqの修復にこのコマンドを使う。

入力 - fasta、fastq、sam。

・ペアリードファイルの修復。

repair.sh -Xmx20g in1=broken1.fq in2=broken2 out1=fixed1.fq out2=fixed2.fq outs=singletons.fq repair

in=<file> The 'in=' flag is needed if the input file is not the first parameter. 'in=stdin' will pipe from standard in.
in2=<file> Use this if 2nd read of pairs are in a different file.
out=<file> The 'out=' flag is needed if the output file is not the second parameter. 'out=stdout' will pipe to standard out.
out2=<file> Use this to write 2nd read of pairs to a different file.
outs=<file> (outsingle) Write singleton reads here.
-Xmx20g　This will be passed to Java to set memory usage, overriding the program's automatic memory detection. -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

Stats（リンク）-アセンブリの基本情報のレポート。

・contig.fasstaを分析する。

stats.sh in=contigs.fa gc=gc.txt gcformat=4

in=file Specify the input fasta file, or stdin.
gc=file Writes ACGTN content per scaffold to a file.
gcformat=<0-4> Select GC output format; default 1.

　　- gcformat=0: (no base content info printed)

　　- gcformat=1: name length A C G T N GC

　　- gcformat=2: name GC

　　- gcformat=4: name length GC

　　Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.

N50などの基礎的な情報が出力される。

user$ stats.sh in=contigs.fa gc=gc.txt gcformat=4

A C G T N IUPAC Other GC GC_stdev

0.2841 0.2276 0.2531 0.2352 0.0000 0.0000 0.0000 0.4807 0.0583

Main genome scaffold total: 42

Main genome contig total: 42

Main genome scaffold sequence total: 0.004 MB

Main genome contig sequence total: 0.004 MB 0.000% gap

Main genome scaffold N/L50: 9/131

Main genome contig N/L50: 9/131

Main genome scaffold N/L90: 42/41

Main genome contig N/L90: 42/41

Max scaffold length: 352

Max contig length: 352

Number of scaffolds > 50 KB: 0

% main genome in scaffolds > 50 KB: 0.00%

Minimum Number Number Total Total Scaffold

Scaffold of of Scaffold Contig Contig

Length Scaffolds Contigs Length Length Coverage

-------- -------------- -------------- -------------- -------------- --------

All 42 42 3,520 3,520 100.00%

50 19 19 2,547 2,547 100.00%

100 11 11 2,009 2,009 100.00%

250 1 1 352 352 100.00%

gc含量は別ファイルに出力される。

user$ head -5 gc.txt

#Name　　　　　　　　　　　　　　　 Length GC

NODE_1_length_332_cov_90.186745 352 0.5114

NODE_4_length_46_cov_61.913044 66 0.4242

NODE_6_length_127_cov_8.811024 147 0.3469

NODE_7_length_127_cov_14.677165 147 0.2993

BBDuk（リンク）-クオリティトリミング、アダプタートリミング、フィルタリング、マスキングなどを行う。

BBDukはクオリティトリミングを行ったり、k-merベースでコンタミリードと推測されるリードのフィルタリングを行うことができるツール。GC、サイズ、エントロピーなど多様な方法でフィルタリングすることもできる。

ペアリードの場合、

bbduk.sh in1=read1.fq in2=read2.fq out1=clean1.fq out2=clean2.fq

というような書き方をする。BBDukは常にペアリードの順番が壊れないように動作する。正常動作のためにペアリードは必ず同時に処理する。

・3'側のアダプタートリミング。

bbduk.sh in=reads.fq out=clean.fq ref=adapters.fa ktrim=r k=23 mink=11 hdist=1 tpe tbo

in=<file> Main input. in=stdin.fq will pipe from stdin.
out=<file> (outnonmatch) Write reads here that do not contain kmers matching the database. 'out=stdout.fq' will pipe to standard out.
ref=<file,file> Comma-delimited list of reference files.
tpe=f (trimpairsevenly) When kmer right-trimming, trim both reads to the minimum length of either.
hammingdistance=0 (hdist) Maximum Hamming distance for ref kmers (subs only). Memory use is proportional to (3*K)^hdist.
tbo=f (trimbyoverlap) Trim adapters based on where paired reads overlap.
mink=0 Look for shorter kmers at read tips down to this length, when k-trimming or masking. 0 means disabled. Enabling this will disable maskmiddle.
ktrim=f Trim reads to remove bases matching reference kmers. Values:

　　　f (don't trim), f (don't trim),

　　　r (trim to the right),

　　　l (trim to the left)

ktrim=rでright trimming（3'側アダプター）、ktrim=lでleft trimming（5'側アダプター）を行う。kでアダプターのを探すサイズ（アダプターサイズにする）、minkで末端でのサイズを規定する。hdistで

・3'側のクオリティが10以下の領域をトリミング

bbduk.sh in=reads.fq out=clean.fq qtrim=r trimq=10

trimq=6　Regions with average quality BELOW this will be trimmed.
qtrim=f Trim read ends to remove bases with quality below trimq. Performed AFTER looking for kmers. Values:

　　　　rl (trim both ends),

　　　　f (neither end),

　　　　r (right end only),

　　　　l (left end only),

　　　　w (sliding window).

qtrim=rで3'側のクオリティトリミングが実行される。qtrim=rlで両側トリミング。qtrim=f以外のflagが指定されるとトリミングが実行されるが、トリミングはk-merベースの作業が全て完了してから実行される。

・5'側10塩基と3'側139塩基以降を強制トリミング。

bbduk.sh in=reads.fq out=clean.fq ftl=10 ftr=139

forcetrimleft=0 (ftl) If positive, trim bases to the left of this position (exclusive, 0-based).
forcetrimright=0 (ftr) If positive, trim bases to the right of this position (exclusive, 0-based).

・3'末端を強制トリミング。

bbduk.sh in=reads.fq out=clean.fq ftm=5

forcetrimmod=0 (ftm) If positive, right-trim length to be equal to zero, modulo this number.

５で剰余演算（モジュロ）を行い、余りの1がトリムされる。イルミナのシーエンスは基本5単位で増減し、50、75、150、300などがあるが、全て1-bp余分に読まれる。このラストの1-bpのクオリティが非常に低い傾向にあり、そのため5で割って余りの値1でトリムしている。

・平均クオリティが10以下のリードをフィルタリング（除去）。

bbduk.sh in=reads.fq out=clean.fq maq=10

minavgquality=0 (maq) Reads with average quality (after trimming) below this will be discarded.

・平均クオリティ10以下の領域をトリムし、100bp以下になったリードは除去する。

bbduk.sh in=reads.fq out=clean.fq qtrim=r trimq=10 ml=100

trimq=6 Regions with average quality BELOW this will be trimmed.
minlength=10 (ml) Reads shorter than this after trimming will be discarded. Pairs will be discarded if both are shorter.

・k-merフィルタリング。

bbduk.sh in=reads.fq out=unmatched.fq outm=matched.fq ref=phix.fa k=31 hdist=1 stats=stats.txt

#paied-end ( マップもアンマップもペアは同期される)
bbduk.sh in1=pair_1.fq in2=pair_2.fq \
 out1=unmatched_1.fq out2=unmatched_2.fq \
 outm1=matched_1.fq  outm2=matched_2.fq \
 ref=phix.fa k=31 hdist=1 stats=stats.txt

outm=<file> (outmatch) Write reads here that contain kmers matching the database.
ref=<file,file> Comma-delimited list of reference files.
hammingdistance=0 (hdist) Maximum Hamming distance for ref kmers (subs only). Memory use is proportional to (3*K)^hdist.

指定ファイルに対して31-merで探索を行い、1塩基のミスマッチを許容して（hdist=1）マッチするリードがあれば、除去される。上ではイルミナのスパイク（バクテリオファージ、5.5-kb）に対して検索をかけている。ペアエンドはマッチしたものもマッチしなかったものも同期される。

動植物ゲノムアセンブリ前に微生物由来リードを除く。

・k-merマスキング１。

bbduk.sh in=ecoli.fa out=ecoli_masked.fa ref=salmonella.fa k=25 ktrim=N

25-merでサルモネラゲノムとマッチするリードを探し、マッチする部位がNNNでマスクされる。Xでマスクするならktirm=X。

・k-merマスキング２。エントロピーを指定。

bbduk.sh in=reads.fq out=filtered.fq entropy=0.5 entropywindow=50 entropyk=5

entropy=-1 Set between 0 and 1 to filter reads with entropy below that value. Higher is more stringent.
entropywindow=50 Calculate entropy using a sliding window of this length.
entropyk=5 Calculate entropy using kmers of this length.

平均エントロピーが0.5以下のリードを除去する。エントロピーは塩基の複雑さから判定している。AAAAAAAAAAAAAAのような配列のエントロピーは０で、ランダムな配列ほど1に近くなる。よって単純リピートなども排除されると考えられる。

・Quality recalibration

bbduk.sh in=reads.fq out=recalibrated.fq recalibrate sam=mapped.sam

recalibrate=f (recal) Recalibrate quality scores. Requires calibration matrices generated by CalcTrueQuality.
sam=<file,file> If recalibration is desired, and matrices have not already been generated, BBDuk will create them from the sam file.

samファイルを元にベースクオリティスコアを再校正したfastqを出力する。samのcigar stringは=（マッチ）とX（ミスマッチ）でないといけない。Mとかだと機能しないので、BBMapなどでアライメントをやり直す。リファレンスがない場合、アセンブルしたcontigにアライメントする方法が推奨されている。

・Histogram generation

bbduk.sh in=reads.fq bhist=bhist.txt qhist=qhist.txt gchist=gchist.txt aqhist=aqhist.txt lhist=lhist.txt gcbins=auto

gcbins=100 Number gchist bins. Set to 'auto' to use read length.

base frequency、クオリティスコア、GC、平均クオリティスコア、長さ、などの統計情報のヒストグラムデータを出力する。BBMapにも同様のコマンドが存在する。

・Barcode and chastity filtering

bbduk.sh in=reads.fq out=filtered.fq barcodes=ACGTT,GGCCG barcodefilter chastityfilter

barcodes= Comma-delimited list of barcodes or files of barcodes.

Header-parsing parameters - these require Illumina headers:

chastityfilter=f (cf) Discard reads with id containing ' 1:Y:' or ' 2:Y:'.
barcodefilter=f Remove reads with unexpected barcodes if barcodes is set, or barcodes containing 'N' otherwise. A barcode must be the last part of the read header.

・Cardinality estimation

bbduk.sh in=reads.fq loglog loglogk=31

cardinality=f (loglog) Count unique kmers using the LogLog algorithm.
loglogk=31 Use this kmer length for counting.

ユニークな31-merの配列の数をカウント。

・Matching degenerate sequences such as primers

bbduk.sh in=reads.fq out=matching.fq literal=ACGTTNNNNNGTC copyundefined k=13 mm=f

literal=<seq,seq> Comma-delimited list of literal reference sequences.
copyundefined=f (cu) Process non-AGCT IUPAC reference bases by making all possible unambiguous copies. Intended for short motifs or adapter barcodes, as time/memory use is exponential.

degenerateな配列を探索する。例えばメタ16Sの可変性領域を網羅的に探すことなどに使うことができる。

BBMask（リンク）-単純リピートや保存性が高い領域をマスクする。

false positiveのマッピングを防止するため、単純リピートなどをあらかじめマスクしておくなどの目的に使われる。リードが単純な配列なのか判別するためにシャノン情報量の概念が利用されているらしい。マスクでなくフィルタリングで除くにはBBDukを使用する。

・エントロピー（情報量）が0.7以下の部位をマスクする。

bbmask.sh in=ref.fa out=masked.fa entropy=0.7

entropy=0.70 (e) Mask windows with entropy under this value (0-1). 0.0001 will mask only homopolymers and 1 will mask everything.

閾値をどのような値にするか正確に求めることは難しいが、entropy=0.7ではE.coliゲノムではわずか70bpしかマスクされない。またコピーや繰り返しが多いヒューマンでも0.7%しかマスクされないとされる。 entropy=0.9まであげるとE.coliで8-kb、ヒューマンで7%の領域がマスクされる。

CalcUniqueness（リンク）-ユニークなリードの数を調べる。

リードの先頭からk-merの配列を取り出しユニークなのかどうか判別する。k-merの配列が合致するとユニークではないと判断される。データ中のユニークなリード数の割合を求めることで、シーケンスのカバレッジが十分かどうか推定することができる。

・ユニークなリード数を示したヒストグラムをかく。

bbcountunique.sh in=reads out=histogram.txt

25000ごとに区切ってユニークなリードの割合が出力される。

k-merの配列は完全に一致していないと重複とみなされない。そのため低クオリティのデータに使うと、見かけ上ユニークなリードの割合が増える点に注意が必要。

他にもいくつかコマンドがあります。公式サイトで確認してみてください。

BBTools User Guide - DOE Joint Genome Institute

引用

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

BBMap (aligner for DNA/RNAseq) is now open-source and available for download. - SEQanswers

2017-08-29

多機能なNGS分析ツール BBtools　其の１

fasta/fastqの操作 assembly Nanopore long read mapping k-mer error correction normalize 2017 PLoS ONE

2018 9/5 bbmerge-auto.sh修正

2019 5/14 BBnormコメント修正、パラメータ修正、ヘルプ追加

bbmap.sh2019 6/13 タイトル修正、6/19 其の３追記

2020 1/29 condaインストール追記、メモリ使用量指定、11/6 誤字修正

2023/02/02 追記

BBtoolsはアメリカのJGIが提供している多機能なNGS向けの解析ツール。2014年にオープンソース化されたらしい。論文は現在準備中とある。アライメントのBBmapや、オーバーラップがないペアリードをマージするBBMerge、エラーコレクションしたfastqを出力するBBnormなど特徴的なコマンドを備える。ここでは以下のコマンドを紹介する。

BBMap - リファンレンスへのアライメント。
BBMerge - ペアリードのマージ。k-mer配列を付け足し延長することも可能。
BBnorm - カバレッジのノーマライズ。
Dedupe - duplicateしたcontigの除去。

其の２。

其の３。

公式ページ

http://jgi.doe.gov/data-and-tools/bbtools/

ユーザーガイド

http://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/

ダウンロード

https://sourceforge.net/projects/bbmap/

#homebrew
brew install BBtools

#bioconda (link)
conda install -c bioconda -y bbmap

BBtoolsのコマンドは指定がない限りコア数をあるだけ使う。

> bbmap.sh

$ bbmap.sh

BBMap

Written by Brian Bushnell, from Dec. 2010 - present

Last modified December 20, 2017

Description: Fast and accurate splice-aware read aligner.

Please read bbmap/docs/guides/BBMapGuide.txt for more information.

To index: bbmap.sh ref=<reference fasta>

To map: bbmap.sh in=<reads> out=<output sam>

To map without writing an index:

bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk

in=stdin will accept reads from standard in, and out=stdout will write to

standard out, but file extensions are still needed to specify the format of the

input and output files e.g. in=stdin.fa.gz will read gzipped fasta from

standard in; out=stdout.sam.gz will write gzipped sam.

Indexing Parameters (required when building the index):

nodisk=f Set to true to build index in memory and write nothing

to disk except output.

ref=<file> Specify the reference sequence. Only do this ONCE,

when building the index (unless using 'nodisk').

build=1 If multiple references are indexed in the same directory,

each needs a unique numeric ID (unless using 'nodisk').

k=13 Kmer length, range 8-15. Longer is faster but uses

more memory. Shorter is more sensitive.

If indexing and mapping are done in two steps, K should

be specified each time.

path=<.> Specify the location to write the index, if you don't

want it in the current working directory.

usemodulo=f Throw away ~80% of kmers based on remainder modulo a

number (reduces RAM by 50% and sensitivity slightly).

Should be enabled both when building the index AND

when mapping.

rebuild=f Force a rebuild of the index (ref= should be set).

Input Parameters:

build=1 Designate index to use. Corresponds to the number

specified when building the index.

in=<file> Primary reads input; required parameter.

in2=<file> For paired reads in two files.

interleaved=auto True forces paired/interleaved input; false forces

single-ended mapping. If not specified, interleaved

status will be autodetected from read names.

fastareadlen=500 Break up FASTA reads longer than this. Max is 500 for

BBMap and 6000 for BBMapPacBio. Only works for FASTA

input (use 'maxlen' for FASTQ input). The default for

bbmap.sh is 500, and for mapPacBio.sh is 6000.

unpigz=f Spawn a pigz (parallel gzip) process for faster

decompression than using Java.

Requires pigz to be installed.

touppercase=t (tuc) Convert lowercase letters in reads to upper case

(otherwise they will not match the reference).

Sampling Parameters:

reads=-1 Set to a positive number N to only process the first N

reads (or pairs), then quit. -1 means use all reads.

samplerate=1 Set to a number from 0 to 1 to randomly select that

fraction of reads for mapping. 1 uses all reads.

skipreads=0 Set to a number N to skip the first N reads (or pairs),

then map the rest.

Mapping Parameters:

fast=f This flag is a macro which sets other paramters to run

faster, at reduced sensitivity. Bad for RNA-seq.

slow=f This flag is a macro which sets other paramters to run

slower, at greater sensitivity. 'vslow' is even slower.

maxindel=16000 Don't look for indels longer than this. Lower is faster.

Set to >=100k for RNAseq with long introns like mammals.

strictmaxindel=f When enabled, do not allow indels longer than 'maxindel'.

By default these are not sought, but may be found anyway.

tipsearch=100 Look this far for read-end deletions with anchors

shorter than K, using brute force.

minid=0.76 Approximate minimum alignment identity to look for.

Higher is faster and less sensitive.

minhits=1 Minimum number of seed hits required for candidate sites.

Higher is faster.

local=f Set to true to use local, rather than global, alignments.

This will soft-clip ugly ends of poor alignments.

perfectmode=f Allow only perfect mappings when set to true (very fast).

semiperfectmode=f Allow only perfect and semiperfect (perfect except for

N's in the reference) mappings.

threads=auto (t) Set to number of threads desired. By default, uses

all cores available.

ambiguous=best (ambig) Set behavior on ambiguously-mapped reads (with

multiple top-scoring mapping locations).

best (use the first best site)

toss (consider unmapped)

random (select one top-scoring site randomly)

all (retain all top-scoring sites)

samestrandpairs=f (ssp) Specify whether paired reads should map to the

same strand or opposite strands.

requirecorrectstrand=t (rcs) Forbid pairing of reads without correct strand

orientation. Set to false for long-mate-pair libraries.

killbadpairs=f (kbp) If a read pair is mapped with an inappropriate

insert size or orientation, the read with the lower

mapping quality is marked unmapped.

pairedonly=f (po) Treat unpaired reads as unmapped. Thus they will

be sent to 'outu' but not 'outm'.

rcomp=f Reverse complement both reads prior to mapping (for LMP

outward-facing libraries).

rcompmate=f Reverse complement read2 prior to mapping.

pairlen=32000 Set max allowed distance between paired reads.

(insert size)=(pairlen)+(read1 length)+(read2 length)

rescuedist=1200 Don't try to rescue paired reads if avg. insert size

greater than this. Lower is faster.

rescuemismatches=32 Maximum mismatches allowed in a rescued read. Lower

is faster.

averagepairdist=100 (apd) Initial average distance between paired reads.

Varies dynamically; does not need to be specified.

deterministic=f Run in deterministic mode. In this case it is good

to set averagepairdist. BBMap is deterministic

without this flag if using single-ended reads,

or run singlethreaded.

bandwidthratio=0 (bwr) If above zero, restrict alignment band to this

fraction of read length. Faster but less accurate.

bandwidth=0 (bw) Set the bandwidth directly.

fraction of read length. Faster but less accurate.

usejni=f (jni) Do alignments faster, in C code. Requires

compiling the C code; details are in /jni/README.txt.

maxsites2=800 Don't analyze (or print) more than this many alignments

per read.

ignorefrequentkmers=t (ifk) Discard low-information kmers that occur often.

excludefraction=0.03 (ef) Fraction of kmers to ignore. For example, 0.03

will ignore the most common 3% of kmers.

greedy=t Use a greedy algorithm to discard the least-useful

kmers on a per-read basis.

kfilter=0 If positive, potential mapping sites must have at

least this many consecutive exact matches.

Quality and Trimming Parameters:

qin=auto Set to 33 or 64 to specify input quality value ASCII

offset. 33 is Sanger, 64 is old Solexa.

qout=auto Set to 33 or 64 to specify output quality value ASCII

offset (only if output format is fastq).

qtrim=f Quality-trim ends before mapping. Options are:

'f' (false), 'l' (left), 'r' (right), and 'lr' (both).

untrim=f Undo trimming after mapping. Untrimmed bases will be

soft-clipped in cigar strings.

trimq=6 Trim regions with average quality below this

(phred algorithm).

mintrimlength=60 (mintl) Don't trim reads to be shorter than this.

fakefastaquality=-1 (ffq) Set to a positive number 1-50 to generate fake

quality strings for fasta input reads.

ignorebadquality=f (ibq) Keep going, rather than crashing, if a read has

out-of-range quality values.

usequality=t Use quality scores when determining which read kmers

to use as seeds.

minaveragequality=0 (maq) Do not map reads with average quality below this.

maqb=0 If positive, calculate maq from this many initial bases.

Output Parameters:

out=<file> Write all reads to this file.

outu=<file> Write only unmapped reads to this file. Does not

include unmapped paired reads with a mapped mate.

outm=<file> Write only mapped reads to this file. Includes

unmapped paired reads with a mapped mate.

mappedonly=f If true, treats 'out' like 'outm'.

bamscript=<file> (bs) Write a shell script to <file> that will turn

the sam output into a sorted, indexed bam file.

ordered=f Set to true to output reads in same order as input.

Slower and uses more memory.

overwrite=f (ow) Allow process to overwrite existing files.

secondary=f Print secondary alignments.

sssr=0.95 (secondarysitescoreratio) Print only secondary alignments

with score of at least this fraction of primary.

ssao=f (secondarysiteasambiguousonly) Only print secondary

alignments for ambiguously-mapped reads.

maxsites=5 Maximum number of total alignments to print per read.

Only relevant when secondary=t.

quickmatch=f Generate cigar strings more quickly.

trimreaddescriptions=f (trd) Truncate read and ref names at the first whitespace,

assuming that the remainder is a comment or description.

ziplevel=2 (zl) Compression level for zip or gzip output.

pigz=f Spawn a pigz (parallel gzip) process for faster

compression than Java. Requires pigz to be installed.

machineout=f Set to true to output statistics in machine-friendly

'key=value' format.

printunmappedcount=f Print the total number of unmapped reads and bases.

If input is paired, the number will be of pairs

for which both reads are unmapped.

showprogress=0 If positive, print a '.' every X reads.

showprogress2=0 If positive, print the number of seconds since the

last progress update (instead of a '.').

renamebyinsert=f Renames reads based on their mapped insert size.

Post-Filtering Parameters:

idfilter=0 Independant of minid; sets exact minimum identity

allowed for alignments to be printed. Range 0 to 1.

subfilter=-1 Ban alignments with more than this many substitutions.

insfilter=-1 Ban alignments with more than this many insertions.

delfilter=-1 Ban alignments with more than this many deletions.

indelfilter=-1 Ban alignments with more than this many indels.

editfilter=-1 Ban alignments with more than this many edits.

inslenfilter=-1 Ban alignments with an insertion longer than this.

dellenfilter=-1 Ban alignments with a deletion longer than this.

nfilter=-1 Ban alignments with more than this many ns. This

includes nocall, noref, and off scaffold ends.

Sam flags and settings:

noheader=f Disable generation of header lines.

sam=1.4 Set to 1.4 to write Sam version 1.4 cigar strings,

with = and X, or 1.3 to use M.

saa=t (secondaryalignmentasterisks) Use asterisks instead of

bases for sam secondary alignments.

cigar=t Set to 'f' to skip generation of cigar strings (faster).

keepnames=f Keep original names of paired reads, rather than

ensuring both reads have the same name.

intronlen=999999999 Set to a lower number like 10 to change 'D' to 'N' in

cigar strings for deletions of at least that length.

rgid= Set readgroup ID. All other readgroup fields

can be set similarly, with the flag rgXX=

mdtag=f Write MD tags.

nhtag=f Write NH tags.

xmtag=f Write XM tags (may only work correctly with ambig=all).

amtag=f Write AM tags.

nmtag=f Write NM tags.

xstag=f Set to 'xs=fs', 'xs=ss', or 'xs=us' to write XS tags

for RNAseq using firststrand, secondstrand, or

unstranded libraries. Needed by Cufflinks.

JGI mainly uses 'firststrand'.

stoptag=f Write a tag indicating read stop location, prefixed by YS:i:

lengthtag=f Write a tag indicating (query,ref) alignment lengths,

prefixed by YL:Z:

idtag=f Write a tag indicating percent identity, prefixed by YI:f:

inserttag=f Write a tag indicating insert size, prefixed by X8:Z:

scoretag=f Write a tag indicating BBMap's raw score, prefixed by YR:i:

timetag=f Write a tag indicating this read's mapping time, prefixed by X0:i:

boundstag=f Write a tag indicating whether either read in the pair

goes off the end of the reference, prefixed by XB:Z:

notags=f Turn off all optional tags.

Histogram and statistics output parameters:

scafstats=<file> Statistics on how many reads mapped to which scaffold.

refstats=<file> Statistics on how many reads mapped to which reference

file; only for BBSplit.

sortscafs=t Sort scaffolds or references by read count.

bhist=<file> Base composition histogram by position.

qhist=<file> Quality histogram by position.

aqhist=<file> Histogram of average read quality.

bqhist=<file> Quality histogram designed for box plots.

lhist=<file> Read length histogram.

ihist=<file> Write histogram of insert sizes (for paired reads).

ehist=<file> Errors-per-read histogram.

qahist=<file> Quality accuracy histogram of error rates versus

quality score.

indelhist=<file> Indel length histogram.

mhist=<file> Histogram of match, sub, del, and ins rates by

read location.

gchist=<file> Read GC content histogram.

gcbins=100 Number gchist bins. Set to 'auto' to use read length.

gcpairs=t Use average GC of paired reads.

idhist=<file> Histogram of read count versus percent identity.

idbins=100 Number idhist bins. Set to 'auto' to use read length.

statsfile=stderr Mapping statistics are printed here.

Coverage output parameters (these may reduce speed and use more RAM):

covstats=<file> Per-scaffold coverage info.

rpkm=<file> Per-scaffold RPKM/FPKM counts.

covhist=<file> Histogram of # occurrences of each depth level.

basecov=<file> Coverage per base location.

bincov=<file> Print binned coverage per location (one line per X bases).

covbinsize=1000 Set the binsize for binned coverage output.

nzo=t Only print scaffolds with nonzero coverage.

twocolumn=f Change to true to print only ID and Avg_fold instead of

all 6 columns to the 'out=' file.

32bit=f Set to true if you need per-base coverage over 64k.

strandedcov=f Track coverage for plus and minus strand independently.

startcov=f Only track start positions of reads.

secondarycov=t Include coverage of secondary alignments.

physcov=f Calculate physical coverage for paired reads.

This includes the unsequenced bases.

delcoverage=t (delcov) Count bases covered by deletions as covered.

True is faster than false.

covk=0 If positive, calculate kmer coverage statistics.

Java Parameters:

-Xmx This will be passed to Java to set memory usage,

overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx800m

will specify 800 megs. The max is typically 85% of

physical memory. The human genome requires around 24g,

or 12g with the 'usemodulo' flag. The index uses

roughly 6 bytes per reference base.

-eoom This flag will cause the process to exit if an

out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter

any problems, or post at: http://seqanswers.com/forums/showthread.php?t=41057

> bbnorm.sh

$ bbnorm.sh

Written by Brian Bushnell

Last modified October 19, 2017

Description: Normalizes read depth based on kmer counts.

Can also error-correct, bin reads by kmer depth, and generate a kmer depth histogram.

However, Tadpole has superior error-correction to BBNorm.

Please read bbmap/docs/guides/BBNormGuide.txt for more information.

Usage: bbnorm.sh in=<input> out=<reads to keep> outt=<reads to toss> hist=<histogram output>

Input parameters:

in=null Primary input. Use in2 for paired reads in a second file

in2=null Second input file for paired reads in two files

extra=null Additional files to use for input (generating hash table) but not for output

fastareadlen=2^31 Break up FASTA reads longer than this. Can be useful when processing scaffolded genomes

tablereads=-1 Use at most this many reads when building the hashtable (-1 means all)

kmersample=1 Process every nth kmer, and skip the rest

readsample=1 Process every nth read, and skip the rest

interleaved=auto May be set to true or false to force the input read file to ovverride autodetection of the input file as paired interleaved.

qin=auto ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

Output parameters:

out=<file> File for normalized or corrected reads. Use out2 for paired reads in a second file

outt=<file> (outtoss) File for reads that were excluded from primary output

reads=-1 Only process this number of reads, then quit (-1 means all)

sampleoutput=t Use sampling on output as well as input (not used if sample rates are 1)

keepall=f Set to true to keep all reads (e.g. if you just want error correction).

zerobin=f Set to true if you want kmers with a count of 0 to go in the 0 bin instead of the 1 bin in histograms.

Default is false, to prevent confusion about how there can be 0-count kmers.

The reason is that based on the 'minq' and 'minprob' settings, some kmers may be excluded from the bloom filter.

tmpdir=/var/folders/f2/r9ydwd5d5vxgfzz5h48grzgc0000gn/T/ This will specify a directory for temp files (only needed for multipass runs). If null, they will be written to the output directory.

usetempdir=t Allows enabling/disabling of temporary directory; if disabled, temp files will be written to the output directory.

qout=auto ASCII offset for output quality. May be 33 (Sanger), 64 (Illumina), or auto (same as input).

rename=f Rename reads based on their kmer depth.

Hashing parameters:

k=31 Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)

bits=32 Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Automatically reduced to 16 in 2-pass.

Large values decrease accuracy for a fixed amount of memory, so use the lowest number you can that will still capture highest-depth kmers.

hashes=3 Number of times each kmer is hashed and stored. Higher is slower.

Higher is MORE accurate if there is enough memory, and LESS accurate if there is not enough memory.

prefilter=f True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable. The prefilter is more memory-efficient because it uses 2-bit cells.

prehashes=2 Number of hashes for prefilter.

prefilterbits=2 (pbits) Bits per cell in prefilter.

prefiltersize=0.35 Fraction of memory to allocate to prefilter.

buildpasses=1 More passes can sometimes increase accuracy by iteratively removing low-depth kmers

minq=6 Ignore kmers containing bases with quality below this

minprob=0.5 Ignore kmers with overall probability of correctness below this

threads=auto (t) Spawn exactly X hashing threads (default is number of logical processors). Total active threads may exceed X due to I/O threads.

rdk=t (removeduplicatekmers) When true, a kmer's count will only be incremented once per read pair, even if that kmer occurs more than once.

Normalization parameters:

fixspikes=f (fs) Do a slower, high-precision bloom filter lookup of kmers that appear to have an abnormally high depth due to collisions.

target=100 (tgt) Target normalization depth. NOTE: All depth parameters control kmer depth, not read depth.

For kmer depth Dk, read depth Dr, read length R, and kmer size K: Dr=Dk*(R/(R-K+1))

maxdepth=-1 (max) Reads will not be downsampled when below this depth, even if they are above the target depth.

mindepth=5 (min) Kmers with depth below this number will not be included when calculating the depth of a read.

minkmers=15 (mgkpr) Reads must have at least this many kmers over min depth to be retained. Aka 'mingoodkmersperread'.

percentile=54.0 (dp) Read depth is by default inferred from the 54th percentile of kmer depth, but this may be changed to any number 1-100.

uselowerdepth=t (uld) For pairs, use the depth of the lower read as the depth proxy.

deterministic=t (dr) Generate random numbers deterministically to ensure identical output between multiple runs. May decrease speed with a huge number of threads.

passes=2 (p) 1 pass is the basic mode. 2 passes (default) allows greater accuracy, error detection, better contol of output depth.

Error detection parameters:

hdp=90.0 (highdepthpercentile) Position in sorted kmer depth array used as proxy of a read's high kmer depth.

ldp=25.0 (lowdepthpercentile) Position in sorted kmer depth array used as proxy of a read's low kmer depth.

tossbadreads=f (tbr) Throw away reads detected as containing errors.

requirebothbad=f (rbb) Only toss bad pairs if both reads are bad.

errordetectratio=125 (edr) Reads with a ratio of at least this much between their high and low depth kmers will be classified as error reads.

highthresh=12 (ht) Threshold for high kmer. A high kmer at this or above are considered non-error.

lowthresh=3 (lt) Threshold for low kmer. Kmers at this and below are always considered errors.

Error correction parameters:

ecc=f Set to true to correct errors. NOTE: Tadpole is now preferred for ecc as it does a better job.

ecclimit=3 Correct up to this many errors per read. If more are detected, the read will remain unchanged.

errorcorrectratio=140 (ecr) Adjacent kmers with a depth ratio of at least this much between will be classified as an error.

echighthresh=22 (echt) Threshold for high kmer. A kmer at this or above may be considered non-error.

eclowthresh=2 (eclt) Threshold for low kmer. Kmers at this and below are considered errors.

eccmaxqual=127 Do not correct bases with quality above this value.

aec=f (aggressiveErrorCorrection) Sets more aggressive values of ecr=100, ecclimit=7, echt=16, eclt=3.

cec=f (conservativeErrorCorrection) Sets more conservative values of ecr=180, ecclimit=2, echt=30, eclt=1, sl=4, pl=4.

meo=f (markErrorsOnly) Marks errors by reducing quality value of suspected errors; does not correct anything.

mue=t (markUncorrectableErrors) Marks errors only on uncorrectable reads; requires 'ecc=t'.

overlap=f (ecco) Error correct by read overlap.

Depth binning parameters:

lowbindepth=10 (lbd) Cutoff for low depth bin.

highbindepth=80 (hbd) Cutoff for high depth bin.

outlow=<file> Pairs in which both reads have a median below lbd go into this file.

outhigh=<file> Pairs in which both reads have a median above hbd go into this file.

outmid=<file> All other pairs go into this file.

Histogram parameters:

hist=<file> Specify a file to write the input kmer depth histogram.

histout=<file> Specify a file to write the output kmer depth histogram.

histcol=3 (histogramcolumns) Number of histogram columns, 2 or 3.

pzc=f (printzerocoverage) Print lines in the histogram with zero coverage.

histlen=1048576 Max kmer depth displayed in histogram. Also affects statistics displayed, but does not affect normalization.

Peak calling parameters:

peaks=<file> Write the peaks to this file. Default is stdout.

minHeight=2 (h) Ignore peaks shorter than this.

minVolume=2 (v) Ignore peaks with less area than this.

minWidth=2 (w) Ignore peaks narrower than this.

minPeak=2 (minp) Ignore peaks with an X-value below this.

maxPeak=BIG (maxp) Ignore peaks with an X-value above this.

maxPeakCount=8 (maxpc) Print up to this many peaks (prioritizing height).

Java Parameters:

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

> mapPacBio.sh

$ mapPacBio.sh

BBMap

Written by Brian Bushnell, from Dec. 2010 - present

Last modified December 20, 2017

Description: Fast and accurate splice-aware read aligner.

Please read bbmap/docs/guides/BBMapGuide.txt for more information.

To index: bbmap.sh ref=<reference fasta>

To map: bbmap.sh in=<reads> out=<output sam>

To map without writing an index:

bbmap.sh ref=<reference fasta> in=<reads> out=<output sam> nodisk

in=stdin will accept reads from standard in, and out=stdout will write to

standard out, but file extensions are still needed to specify the format of the

input and output files e.g. in=stdin.fa.gz will read gzipped fasta from

standard in; out=stdout.sam.gz will write gzipped sam.

Indexing Parameters (required when building the index):

nodisk=f Set to true to build index in memory and write nothing

to disk except output.

ref=<file> Specify the reference sequence. Only do this ONCE,

when building the index (unless using 'nodisk').

build=1 If multiple references are indexed in the same directory,

each needs a unique numeric ID (unless using 'nodisk').

k=13 Kmer length, range 8-15. Longer is faster but uses

more memory. Shorter is more sensitive.

If indexing and mapping are done in two steps, K should

be specified each time.

path=<.> Specify the location to write the index, if you don't

want it in the current working directory.

usemodulo=f Throw away ~80% of kmers based on remainder modulo a

number (reduces RAM by 50% and sensitivity slightly).

Should be enabled both when building the index AND

when mapping.

rebuild=f Force a rebuild of the index (ref= should be set).

Input Parameters:

build=1 Designate index to use. Corresponds to the number

specified when building the index.

in=<file> Primary reads input; required parameter.

in2=<file> For paired reads in two files.

interleaved=auto True forces paired/interleaved input; false forces

single-ended mapping. If not specified, interleaved

status will be autodetected from read names.

fastareadlen=500 Break up FASTA reads longer than this. Max is 500 for

BBMap and 6000 for BBMapPacBio. Only works for FASTA

input (use 'maxlen' for FASTQ input). The default for

bbmap.sh is 500, and for mapPacBio.sh is 6000.

unpigz=f Spawn a pigz (parallel gzip) process for faster

decompression than using Java.

Requires pigz to be installed.

touppercase=t (tuc) Convert lowercase letters in reads to upper case

(otherwise they will not match the reference).

Sampling Parameters:

reads=-1 Set to a positive number N to only process the first N

reads (or pairs), then quit. -1 means use all reads.

samplerate=1 Set to a number from 0 to 1 to randomly select that

fraction of reads for mapping. 1 uses all reads.

skipreads=0 Set to a number N to skip the first N reads (or pairs),

then map the rest.

Mapping Parameters:

fast=f This flag is a macro which sets other paramters to run

faster, at reduced sensitivity. Bad for RNA-seq.

slow=f This flag is a macro which sets other paramters to run

slower, at greater sensitivity. 'vslow' is even slower.

maxindel=16000 Don't look for indels longer than this. Lower is faster.

Set to >=100k for RNAseq with long introns like mammals.

strictmaxindel=f When enabled, do not allow indels longer than 'maxindel'.

By default these are not sought, but may be found anyway.

tipsearch=100 Look this far for read-end deletions with anchors

shorter than K, using brute force.

minid=0.76 Approximate minimum alignment identity to look for.

Higher is faster and less sensitive.

minhits=1 Minimum number of seed hits required for candidate sites.

Higher is faster.

local=f Set to true to use local, rather than global, alignments.

This will soft-clip ugly ends of poor alignments.

perfectmode=f Allow only perfect mappings when set to true (very fast).

semiperfectmode=f Allow only perfect and semiperfect (perfect except for

N's in the reference) mappings.

threads=auto (t) Set to number of threads desired. By default, uses

all cores available.

ambiguous=best (ambig) Set behavior on ambiguously-mapped reads (with

multiple top-scoring mapping locations).

best (use the first best site)

toss (consider unmapped)

random (select one top-scoring site randomly)

all (retain all top-scoring sites)

samestrandpairs=f (ssp) Specify whether paired reads should map to the

same strand or opposite strands.

requirecorrectstrand=t (rcs) Forbid pairing of reads without correct strand

orientation. Set to false for long-mate-pair libraries.

killbadpairs=f (kbp) If a read pair is mapped with an inappropriate

insert size or orientation, the read with the lower

mapping quality is marked unmapped.

pairedonly=f (po) Treat unpaired reads as unmapped. Thus they will

be sent to 'outu' but not 'outm'.

rcomp=f Reverse complement both reads prior to mapping (for LMP

outward-facing libraries).

rcompmate=f Reverse complement read2 prior to mapping.

pairlen=32000 Set max allowed distance between paired reads.

(insert size)=(pairlen)+(read1 length)+(read2 length)

rescuedist=1200 Don't try to rescue paired reads if avg. insert size

greater than this. Lower is faster.

rescuemismatches=32 Maximum mismatches allowed in a rescued read. Lower

is faster.

averagepairdist=100 (apd) Initial average distance between paired reads.

Varies dynamically; does not need to be specified.

deterministic=f Run in deterministic mode. In this case it is good

to set averagepairdist. BBMap is deterministic

without this flag if using single-ended reads,

or run singlethreaded.

bandwidthratio=0 (bwr) If above zero, restrict alignment band to this

fraction of read length. Faster but less accurate.

bandwidth=0 (bw) Set the bandwidth directly.

fraction of read length. Faster but less accurate.

usejni=f (jni) Do alignments faster, in C code. Requires

compiling the C code; details are in /jni/README.txt.

maxsites2=800 Don't analyze (or print) more than this many alignments

per read.

ignorefrequentkmers=t (ifk) Discard low-information kmers that occur often.

excludefraction=0.03 (ef) Fraction of kmers to ignore. For example, 0.03

will ignore the most common 3% of kmers.

greedy=t Use a greedy algorithm to discard the least-useful

kmers on a per-read basis.

kfilter=0 If positive, potential mapping sites must have at

least this many consecutive exact matches.

Quality and Trimming Parameters:

qin=auto Set to 33 or 64 to specify input quality value ASCII

offset. 33 is Sanger, 64 is old Solexa.

qout=auto Set to 33 or 64 to specify output quality value ASCII

offset (only if output format is fastq).

qtrim=f Quality-trim ends before mapping. Options are:

'f' (false), 'l' (left), 'r' (right), and 'lr' (both).

untrim=f Undo trimming after mapping. Untrimmed bases will be

soft-clipped in cigar strings.

trimq=6 Trim regions with average quality below this

(phred algorithm).

mintrimlength=60 (mintl) Don't trim reads to be shorter than this.

fakefastaquality=-1 (ffq) Set to a positive number 1-50 to generate fake

quality strings for fasta input reads.

ignorebadquality=f (ibq) Keep going, rather than crashing, if a read has

out-of-range quality values.

usequality=t Use quality scores when determining which read kmers

to use as seeds.

minaveragequality=0 (maq) Do not map reads with average quality below this.

maqb=0 If positive, calculate maq from this many initial bases.

Output Parameters:

out=<file> Write all reads to this file.

outu=<file> Write only unmapped reads to this file. Does not

include unmapped paired reads with a mapped mate.

outm=<file> Write only mapped reads to this file. Includes

unmapped paired reads with a mapped mate.

mappedonly=f If true, treats 'out' like 'outm'.

bamscript=<file> (bs) Write a shell script to <file> that will turn

the sam output into a sorted, indexed bam file.

ordered=f Set to true to output reads in same order as input.

Slower and uses more memory.

overwrite=f (ow) Allow process to overwrite existing files.

secondary=f Print secondary alignments.

sssr=0.95 (secondarysitescoreratio) Print only secondary alignments

with score of at least this fraction of primary.

ssao=f (secondarysiteasambiguousonly) Only print secondary

alignments for ambiguously-mapped reads.

maxsites=5 Maximum number of total alignments to print per read.

Only relevant when secondary=t.

quickmatch=f Generate cigar strings more quickly.

trimreaddescriptions=f (trd) Truncate read and ref names at the first whitespace,

assuming that the remainder is a comment or description.

ziplevel=2 (zl) Compression level for zip or gzip output.

pigz=f Spawn a pigz (parallel gzip) process for faster

compression than Java. Requires pigz to be installed.

machineout=f Set to true to output statistics in machine-friendly

'key=value' format.

printunmappedcount=f Print the total number of unmapped reads and bases.

If input is paired, the number will be of pairs

for which both reads are unmapped.

showprogress=0 If positive, print a '.' every X reads.

showprogress2=0 If positive, print the number of seconds since the

last progress update (instead of a '.').

renamebyinsert=f Renames reads based on their mapped insert size.

Post-Filtering Parameters:

idfilter=0 Independant of minid; sets exact minimum identity

allowed for alignments to be printed. Range 0 to 1.

subfilter=-1 Ban alignments with more than this many substitutions.

insfilter=-1 Ban alignments with more than this many insertions.

delfilter=-1 Ban alignments with more than this many deletions.

indelfilter=-1 Ban alignments with more than this many indels.

editfilter=-1 Ban alignments with more than this many edits.

inslenfilter=-1 Ban alignments with an insertion longer than this.

dellenfilter=-1 Ban alignments with a deletion longer than this.

nfilter=-1 Ban alignments with more than this many ns. This

includes nocall, noref, and off scaffold ends.

Sam flags and settings:

noheader=f Disable generation of header lines.

sam=1.4 Set to 1.4 to write Sam version 1.4 cigar strings,

with = and X, or 1.3 to use M.

saa=t (secondaryalignmentasterisks) Use asterisks instead of

bases for sam secondary alignments.

cigar=t Set to 'f' to skip generation of cigar strings (faster).

keepnames=f Keep original names of paired reads, rather than

ensuring both reads have the same name.

intronlen=999999999 Set to a lower number like 10 to change 'D' to 'N' in

cigar strings for deletions of at least that length.

rgid= Set readgroup ID. All other readgroup fields

can be set similarly, with the flag rgXX=

mdtag=f Write MD tags.

nhtag=f Write NH tags.

xmtag=f Write XM tags (may only work correctly with ambig=all).

amtag=f Write AM tags.

nmtag=f Write NM tags.

xstag=f Set to 'xs=fs', 'xs=ss', or 'xs=us' to write XS tags

for RNAseq using firststrand, secondstrand, or

unstranded libraries. Needed by Cufflinks.

JGI mainly uses 'firststrand'.

stoptag=f Write a tag indicating read stop location, prefixed by YS:i:

lengthtag=f Write a tag indicating (query,ref) alignment lengths,

prefixed by YL:Z:

idtag=f Write a tag indicating percent identity, prefixed by YI:f:

inserttag=f Write a tag indicating insert size, prefixed by X8:Z:

scoretag=f Write a tag indicating BBMap's raw score, prefixed by YR:i:

timetag=f Write a tag indicating this read's mapping time, prefixed by X0:i:

boundstag=f Write a tag indicating whether either read in the pair

goes off the end of the reference, prefixed by XB:Z:

notags=f Turn off all optional tags.

Histogram and statistics output parameters:

scafstats=<file> Statistics on how many reads mapped to which scaffold.

refstats=<file> Statistics on how many reads mapped to which reference

file; only for BBSplit.

sortscafs=t Sort scaffolds or references by read count.

bhist=<file> Base composition histogram by position.

qhist=<file> Quality histogram by position.

aqhist=<file> Histogram of average read quality.

bqhist=<file> Quality histogram designed for box plots.

lhist=<file> Read length histogram.

ihist=<file> Write histogram of insert sizes (for paired reads).

ehist=<file> Errors-per-read histogram.

qahist=<file> Quality accuracy histogram of error rates versus

quality score.

indelhist=<file> Indel length histogram.

mhist=<file> Histogram of match, sub, del, and ins rates by

read location.

gchist=<file> Read GC content histogram.

gcbins=100 Number gchist bins. Set to 'auto' to use read length.

gcpairs=t Use average GC of paired reads.

idhist=<file> Histogram of read count versus percent identity.

idbins=100 Number idhist bins. Set to 'auto' to use read length.

statsfile=stderr Mapping statistics are printed here.

Coverage output parameters (these may reduce speed and use more RAM):

covstats=<file> Per-scaffold coverage info.

rpkm=<file> Per-scaffold RPKM/FPKM counts.

covhist=<file> Histogram of # occurrences of each depth level.

basecov=<file> Coverage per base location.

bincov=<file> Print binned coverage per location (one line per X bases).

covbinsize=1000 Set the binsize for binned coverage output.

nzo=t Only print scaffolds with nonzero coverage.

twocolumn=f Change to true to print only ID and Avg_fold instead of

all 6 columns to the 'out=' file.

32bit=f Set to true if you need per-base coverage over 64k.

strandedcov=f Track coverage for plus and minus strand independently.

startcov=f Only track start positions of reads.

secondarycov=t Include coverage of secondary alignments.

physcov=f Calculate physical coverage for paired reads.

This includes the unsequenced bases.

delcoverage=t (delcov) Count bases covered by deletions as covered.

True is faster than false.

covk=0 If positive, calculate kmer coverage statistics.

Java Parameters:

-Xmx This will be passed to Java to set memory usage,

overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx800m

will specify 800 megs. The max is typically 85% of

physical memory. The human genome requires around 24g,

or 12g with the 'usemodulo' flag. The index uses

roughly 6 bytes per reference base.

-eoom This flag will cause the process to exit if an

out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter

any problems, or post at: http://seqanswers.com/forums/showthread.php?t=41057

> stats.sh

$ stats.sh

Written by Brian Bushnell

Last modified December 7, 2017

Description: Generates basic assembly statistics such as scaffold count,

N50, L50, GC content, gap percent, etc. For multiple files, please use

statswrapper.sh. Works with fasta and fastq only (gzipped is fine).

Please read bbmap/docs/guides/StatsGuide.txt for more information.

Usage: stats.sh in=<file>

Parameters:

in=file Specify the input fasta file, or stdin.

gc=file Writes ACGTN content per scaffold to a file.

gchist=file Filename to output scaffold gc content histogram.

shist=file Filename to output cumulative scaffold length histogram.

gcbins=200 Number of bins for gc histogram.

n=10 Number of contiguous Ns to signify a break between contigs.

k=13 Estimate memory usage of BBMap with this kmer length.

minscaf=0 Ignore scaffolds shorter than this.

phs=f (printheaderstats) Set to true to print total size of headers.

n90=t (printn90) Print the N/L90 metrics.

extended=f Print additional metrics such as N/L90 and log sum.

logoffset=1000 Minimum length for calculating log sum.

logbase=2 Log base for calculating log sum.

pdl=f (printduplicatelines) Set to true to print lines in the

scaffold size table where the counts did not change.

n_=t This flag will prefix the terms 'contigs' and 'scaffolds'

with 'n_' in formats 3-6.

addname=f Adds a column for input file name, for formats 3-6.

format=<0-7> Format of the stats information; default 1.

format=0 prints no assembly stats.

format=1 uses variable units like MB and KB, and is designed for compatibility with existing tools.

format=2 uses only whole numbers of bases, with no commas in numbers, and is designed for machine parsing.

format=3 outputs stats in 2 rows of tab-delimited columns: a header row and a data row.

format=4 is like 3 but with scaffold data only.

format=5 is like 3 but with contig data only.

format=6 is like 3 but the header starts with a #.

format=7 is like 1 but only prints contig info.

gcformat=<0-4> Select GC output format; default 1.

gcformat=0: (no base content info printed)

gcformat=1: name length A C G T N GC

gcformat=2: name GC

gcformat=4: name length GC

Note that in gcformat 1, A+C+G+T=1 even when N is nonzero.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

> bbmerge.sh

$ bbmerge.sh

Written by Brian Bushnell and Jonathan Rood

Last modified March 13, 2017

Description: Merges paired reads into single reads by overlap detection.

With sufficient coverage, can also merge nonoverlapping reads by kmer extension.

Kmer modes requires much more memory, and should be used with the bbmerge-auto.sh script.

Please read bbmap/docs/guides/BBMergeGuide.txt for more information.

Usage for interleaved files: bbmerge.sh in=<reads> out=<merged reads> outu=<unmerged reads>

Usage for paired files: bbmerge.sh in1=<read1> in2=<read2> out=<merged reads> outu1=<unmerged1> outu2=<unmerged2>

Input may be stdin or a file, fasta or fastq, raw or gzipped.

Input parameters:

in=null Primary input. 'in2' will specify a second file.

interleaved=auto May be set to true or false to override autodetection of

whether the input file as interleaved.

reads=-1 Quit after this many read pairs (-1 means all).

Output parameters:

out=<file> File for merged reads. 'out2' will specify a second file.

outu=<file> File for unmerged reads. 'outu2' will specify a second file.

outinsert=<file> (outi) File to write read names and insert sizes.

outadapter=<file> (outa) File to write consensus adapter sequences.

outc=<file> File to write input read kmer cardinality estimate.

ihist=<file> (hist) Insert length histogram output file.

nzo=t Only print histogram bins with nonzero values.

showhiststats=t Print extra header lines with statistical information.

ziplevel=2 Set to 1 (lowest) through 9 (max) to change compression

level; lower compression is faster.

ordered=f Output reads in same order as input.

mix=f Output both the merged (or mergable) and unmerged reads

in the same file (out=). Useful for ecco mode.

Trimming/Filtering parameters:

qtrim=f Trim read ends to remove bases with quality below minq.

Trims BEFORE merging.

Values: t (trim both ends),

f (neither end),

r (right end only),

l (left end only).

qtrim2=f May be specified instead of qtrim to perform trimming

only if merging is unsuccessful, then retry merging.

trimq=10 Trim quality threshold. This may be a comma-delimited

list (ascending) to try multiple values.

minlength=1 (ml) Reads shorter than this after trimming, but before

merging, will be discarded. Pairs will be discarded only

if both are shorter.

maxlength=-1 Reads with longer insert sizes will be discarded.

tbo=f (trimbyoverlap) Trim overlapping reads to remove

rightmost (3') non-overlapping portion, instead of joining.

minavgquality=0 (maq) Reads with average quality below this, after

trimming, will not be attempted to be merged.

maxexpectederrors=0 (mee) If positive, reads with more combined expected

errors than this will not be attempted to be merged.

forcetrimleft=0 (ftl) If nonzero, trim left bases of the read to

this position (exclusive, 0-based).

forcetrimright=0 (ftr) If nonzero, trim right bases of the read

after this position (exclusive, 0-based).

forcetrimright2=0 (ftr2) If positive, trim this many bases on the right end.

forcetrimmod=5 (ftm) If positive, trim length to be equal to

zero modulo this number.

ooi=f Output only incorrectly merged reads, for testing.

trimpolya=t Trim trailing poly-A tail from adapter output. Only

affects outadapter. This also trims poly-A followed

by poly-G, which occurs on NextSeq.

Processing Parameters:

usejni=f (jni) Do overlapping in C code, which is faster. Requires

compiling the C code; details are in /jni/README.txt.

merge=t Create merged reads. If set to false, you can still

generate an insert histogram.

ecco=f Error-correct the overlapping part, but don't merge.

trimnonoverlapping=f (tno) Trim all non-overlapping portions, leaving only

consensus sequence. By default, only sequence to the

right of the overlap (adapter sequence) is trimmed.

useoverlap=t Attempt find the insert size using read overlap.

mininsert=35 Minimum insert size to merge reads.

mininsert0=35 Insert sizes less than this will not be considered.

Must be less than or equal to mininsert.

minoverlap=12 Minimum number of overlapping bases to allow merging.

minoverlap0=8 Overlaps shorter than this will not be considered.

Must be less than or equal to minoverlap.

minq=9 Ignore bases with quality below this.

maxq=41 Cap output quality scores at this.

entropy=t Increase the minimum overlap requirement for low-

complexity reads.

efilter=6 Ban overlaps with over this many times the expected

number of errors. Lower is more strict. -1 disables.

pfilter=0.00004 Ban improbable overlaps. Higher is more strict. 0 will

disable the filter; 1 will allow only perfect overlaps.

kfilter=0 Ban overlaps that create kmers with count below

this value (0 disables). If this is used minprob should

probably be set to 0. Requires good coverage.

ouq=f Calculate best overlap using quality values.

owq=t Calculate best overlap without using quality values.

usequality=t If disabled, quality values are completely ignored,

both for overlap detection and filtering. May be useful

for data with inaccurate quality values.

iupacton=f (itn) Change ambiguous IUPAC symbols to N.

adapter= Specify the adapter sequences used for these reads, if

known; this can be a fasta file or a literal sequence.

Read 1 and 2 can have adapters specified independently

with the adapter1 and adapter2 flags. adapter=default

will use a list of common adapter sequences.

Ratio Mode:

ratiomode=t Score overlaps based on the ratio of matching to

mismatching bases.

maxratio=0.09 Max error rate; higher increases merge rate.

ratiomargin=5.5 Lower increases merge rate; min is 1.

ratiooffset=0.55 Lower increases merge rate; min is 0.

ratiominoverlapreduction=3 This is the difference between minoverlap in

flat mode and minoverlap in ratio mode; generally,

minoverlap should be lower in ratio mode.

minsecondratio=0.1 Cutoff for second-best overlap ratio.

Flat Mode:

flatmode=f Score overlaps based on the total number of mismatching

bases only.

margin=2 The best overlap must have at least 'margin' fewer

mismatches than the second best.

mismatches=3 Do not allow more than this many mismatches.

requireratiomatch=f (rrm) Require the answer from flat mode and ratio mode

to agree, reducing false positives if both are enabled.

trimonfailure=t (tof) If detecting insert size by overlap fails,

the reads will be trimmed and this will be re-attempted.

*** Ratio Mode and Flat Mode may be used alone or simultaneously. ***

*** Ratio Mode is usually more accurate and is the default mode. ***

Strictness (these are mutually exclusive macros that set other parameters):

strict=f Decrease false positive rate and merging rate.

verystrict=f (vstrict) Greatly decrease FP and merging rate.

ultrastrict=f (ustrict) Decrease FP and merging rate even more.

maxstrict=f (xstrict) Maximally decrease FP and merging rate.

loose=f Increase false positive rate and merging rate.

veryloose=f (vloose) Greatly increase FP and merging rate.

ultraloose=f (uloose) Increase FP and merging rate even more.

maxloose=f (xloose) Maximally decrease FP and merging rate.

fast=f Fastest possible mode; less accurate.

Tadpole Parameters (for read extension and error-correction):

k=31 Kmer length. 31 (or less) is fastest and uses the least

memory, but higher values may be more accurate.

60 tends to work well for 150bp reads.

extend=0 Extend reads to the right this much before merging.

Requires sufficient (>5x) kmer coverage.

extend2=0 Extend reads this much only after a failed merge attempt,

or in rem/rsem mode.

iterations=1 (ei) Iteratively attempt to extend by extend2 distance

and merge up to this many times.

rem=f (requireextensionmatch) Do not merge if the predicted

insert size differs before and after extension.

However, if only the extended reads overlap, then that

insert will be used. Requires setting extend2.

rsem=f (requirestrictextensionmatch) Similar to rem but stricter.

Reads will only merge if the predicted insert size before

and after extension match. Requires setting extend2.

ecctadpole=f (ecct) If reads fail to merge, error-correct with Tadpole

and try again. This happens prior to extend2.

reassemble=t If ecct is enabled, use Tadpole's reassemble mode for

error correction. Alternatives are pincer and tail.

removedeadends (shave) Remove kmers leading to dead ends.

removebubbles (rinse) Remove kmers in error bubbles.

mindepthseed=3 (mds) Minimum kmer depth to begin extension.

mindepthextend=2 (mde) Minimum kmer depth continue extension.

branchmult1=20 Min ratio of 1st to 2nd-greatest path depth at high depth.

branchmult2=3 Min ratio of 1st to 2nd-greatest path depth at low depth.

branchlower=3 Max value of 2nd-greatest path depth to be considered low.

ibb=t Ignore backward branches when extending.

extra=<file> A file or comma-delimited list of files of reads to use

for kmer counting, but not for merging or output.

prealloc=f Pre-allocate memory rather than dynamically growing;

faster and more memory-efficient for large datasets.

A float fraction (0-1) may be specified, default 1.

prefilter=0 If set to a positive integer, use a countmin sketch to

ignore kmers with depth of that value or lower, to

reduce memory usage.

minprob=0.5 Ignore kmers with overall probability of correctness

below this, to reduce memory usage.

minapproxoverlap=26 For rem mode, do not merge reads if the extended reads

indicate that the raw reads should have overlapped by

at least this much, but no overlap was found.

Java Parameters:

-Xmx This will be passed to Java to set memory usage,

overriding the program's automatic memory detection.

For example, -Xmx400m will specify 400 MB RAM.

-eoom This flag will cause the process to exit if an

out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

> dedupe.sh

$ dedupe.sh

Written by Brian Bushnell and Jonathan Rood

Last modified November 20, 2017

Description: Accepts one or more files containing sets of sequences (reads or scaffolds).

Removes duplicate sequences, which may be specified to be exact matches, subsequences, or sequences within some percent identity.

Can also find overlapping sequences and group them into clusters.

Please read bbmap/docs/guides/DedupeGuide.txt for more information.

Usage: dedupe.sh in=<file or stdin> out=<file or stdout>

An example of running Dedupe for clustering short reads:

dedupe.sh in=x.fq am=f ac=f fo c pc rnc=f mcs=4 mo=100 s=1 pto cc qin=33 csf=stats.txt pattern=cluster_%.fq dot=graph.dot

Input may be fasta or fastq, compressed or uncompressed.

Output may be stdout or a file. With no output parameter, data will be written to stdout.

If 'out=null', there will be no output, but statistics will still be printed.

You can also use 'dedupe <infile> <outfile>' without the 'in=' and 'out='.

I/O parameters:

in=<file,file> A single file or a comma-delimited list of files.

out=<file> Destination for all output contigs.

pattern=<file> Clusters will be written to individual files, where the '%' symbol in the pattern is replaced by cluster number.

outd=<file> Optional; removed duplicates will go here.

csf=<file> (clusterstatsfile) Write a list of cluster names and sizes.

dot=<file> (graph) Write a graph in dot format. Requires 'fo' and 'pc' flags.

threads=auto (t) Set number of threads to use; default is number of logical processors.

overwrite=t (ow) Set to false to force the program to abort rather than overwrite an existing file.

showspeed=t (ss) Set to 'f' to suppress display of processing speed.

minscaf=0 (ms) Ignore contigs/scaffolds shorter than this.

interleaved=auto If true, forces fastq input to be paired and interleaved.

ziplevel=2 Set to 1 (lowest) through 9 (max) to change compression level; lower compression is faster.

Output format parameters:

storename=t (sn) Store scaffold names (set false to save memory).

#addpairnum=f Add .1 and .2 to numeric id of read1 and read2.

storequality=t (sq) Store quality values for fastq assemblies (set false to save memory).

uniquenames=t (un) Ensure all output scaffolds have unique names. Uses more memory.

numbergraphnodes=t (ngn) Label dot graph nodes with read numbers rather than read names.

sort=f Sort output (otherwise it will be random). Options:

length: Sort by length

quality: Sort by quality

name: Sort by name

id: Sort by input order

ascending=f Sort in ascending order.

ordered=f Output sequences in input order. Equivalent to sort=id ascending.

renameclusters=f (rnc) Rename contigs to indicate which cluster they are in.

printlengthinedges=f (ple) Print the length of contigs in edges.

Processing parameters:

absorbrc=t (arc) Absorb reverse-complements as well as normal orientation.

absorbmatch=t (am) Absorb exact matches of contigs.

absorbcontainment=t (ac) Absorb full containments of contigs.

#absorboverlap=f (ao) Absorb (merge) non-contained overlaps of contigs (TODO).

findoverlap=f (fo) Find overlaps between contigs (containments and non-containments). Necessary for clustering.

uniqueonly=f (uo) If true, all copies of duplicate reads will be discarded, rather than keeping 1.

rmn=f (requirematchingnames) If true, both names and sequence must match.

usejni=f (jni) Do alignments in C code, which is faster, if an edit distance is allowed.

This will require compiling the C code; details are in /jni/README.txt.

Subset parameters:

subsetcount=1 (sstc) Number of subsets used to process the data; higher uses less memory.

subset=0 (sst) Only process reads whose ((ID%subsetcount)==subset).

Clustering parameters:

cluster=f (c) Group overlapping contigs into clusters.

pto=f (preventtransitiveoverlaps) Do not look for new edges between nodes in the same cluster.

minclustersize=1 (mcs) Do not output clusters smaller than this.

pbr=f (pickbestrepresentative) Only output the single highest-quality read per cluster.

Cluster postprocessing parameters:

processclusters=f (pc) Run the cluster processing phase, which performs the selected operations in this category.

For example, pc AND cc must be enabled to perform cc.

fixmultijoins=t (fmj) Remove redundant overlaps between the same two contigs.

removecycles=t (rc) Remove all cycles so clusters form trees.

cc=t (canonicizeclusters) Flip contigs so clusters have a single orientation.

fcc=f (fixcanoncontradictions) Truncate graph at nodes with canonization disputes.

foc=f (fixoffsetcontradictions) Truncate graph at nodes with offset disputes.

mst=f (maxspanningtree) Remove cyclic edges, leaving only the longest edges that form a tree.

Overlap Detection Parameters

exact=t (ex) Only allow exact symbol matches. When false, an 'N' will match any symbol.

touppercase=t (tuc) Convert input bases to upper-case; otherwise, lower-case will not match.

maxsubs=0 (s) Allow up to this many mismatches (substitutions only, no indels). May be set higher than maxedits.

maxedits=0 (e) Allow up to this many edits (subs or indels). Higher is slower.

minidentity=100 (mid) Absorb contained sequences with percent identity of at least this (includes indels).

minlengthpercent=0 (mlp) Smaller contig must be at least this percent of larger contig's length to be absorbed.

minoverlappercent=0 (mop) Overlap must be at least this percent of smaller contig's length to cluster and merge.

minoverlap=200 (mo) Overlap must be at least this long to cluster and merge.

depthratio=0 (dr) When non-zero, overlaps will only be formed between reads with a depth ratio of at most this.

Should be above 1. Depth is determined by parsing the read names; this information can be added

by running KmerNormalize (khist.sh, bbnorm.sh, or ecc.sh) with the flag 'rename'

k=31 Seed length used for finding containments and overlaps. Anything shorter than k will not be found.

numaffixmaps=1 (nam) Number of prefixes/suffixes to index per contig. Higher is more sensitive, if edits are allowed.

hashns=f Set to true to search for matches using kmers containing Ns. Can lead to extreme slowdown in some cases.

#ignoreaffix1=f (ia1) Ignore first affix (for testing).

#storesuffix=f (ss) Store suffix as well as prefix. Automatically set to true when doing inexact matches.

Other Parameters

qtrim=f Set to qtrim=rl to trim leading and trailing Ns.

trimq=6 Quality trim level.

forcetrimleft=-1 (ftl) If positive, trim bases to the left of this position (exclusive, 0-based).

forcetrimright=-1 (ftr) If positive, trim bases to the right of this position (exclusive, 0-based).

Note on Proteins / Amino Acids

Dedupe supports amino acid space via the 'amino' flag. This also changes the default kmer length to 10.

In amino acid mode, all flags related to canonicity and reverse-complementation are disabled,

and nam (numaffixmaps) is currently limited to 2 per tip.

Java Parameters:

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

ラン

BBMap （リンク）-リードのリファンレンスへのアライメント

powerpointスライド

BBMapは、k-merに分解してk-merサイズのseedでアライメントと伸長を行うアライナー（デファルトではk=13）。ショートリードのアライメントだけでなく、ロングリードのアライメントも可能とされ、Illumina,、454、Sanger、Ion Torrent、Pac Bio、Nanoporeのリードのアライメント対応がマニュアルに記載されている。

・index作成とペアリードのアライメント（index=>マッピング=>アライメント）。メモリは最大20GB指定。

bbmap.sh -Xmx20g in1=reads1.fq in2=reads2.fq out=mapped.sam ref=reference.fa

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection. -Xmx20g will specify 20 gigs of RAM, and -Xmx800m will specify 800 megs. The max is typically 85% of physical memory. The human genome requires around 24g, or 12g with the 'usemodulo' flag. The index uses roughly 6 bytes per reference base.

indexディレクトリが作られ、リードがリファレンスにアライメントされる。defaultだとマシンの全コアが計算に使用される（明示するなら"t="）。

bbmap.shは700bp以下のリードしか扱えない。pacbioのアライメントは6-kbまでのリードに対応した専用コマンドを使う。

・ロングリードのアライメント。

mapPacBio.sh in=reads.fq out=mapped.sam ref=reference.fa

#リファレンスへのマッピング率（coverage）, GCなどの統計データを出力。メモリ100GB指定
mapPacBio.sh in=reads.fq out=mapped.sam ref=reference.fa covstats=mapping_stats.txt nodisk=true -Xmx100g

nodisk=f Set to true to build index in memory and write nothing to disk except output

6-kb以上のリードは自動で断片化されリネームされてアライメントされる。ナノポアのデータもこのコマンドを使うことが推奨されている。

様々な統計データを出力する。

bbmap.sh in1=reads1.fq in2=reads2.fq bhist=bhist.txt qhist=qhist.txt aqhist=aqhist.txt lhist=lhist.txt ihist=ihist.txt ehist=ehist.txt qahist=qahist.txt indelhist=indelhist.txt mhist=mhist.txt gchist=gchist.txt idhist=idhist.txt scafstats=scafstats.txt

covstats=<file> Per-scaffold coverage info.
rpkm=<file> Per-scaffold RPKM/FPKM counts.
covhist=<file> Histogram of # occurrences of each depth level.
basecov=<file> Coverage per base location.
bincov=<file> Print binned coverage per location (one line per X bases).
covbinsize=1000 Set the binsize for binned co

標準出力には以下のようなデータがプリントされる。

------------------ Results ------------------

Genome: 1

Key Length: 13

Max Indel: 16000

Minimum Score Ratio: 0.56

Mapping Mode: normal

Reads Used: 2647748 (397162200 bases)

Mapping: 70.725 seconds.

Reads/sec: 37437.34

kBases/sec: 5615.60

Pairing data: pct reads num reads pct bases num bases

mated pairs: 99.9427% 1323116 99.9427% 396934800

bad pairs: 0.0178% 236 0.0178% 70800

insert size avg: 601.02

insert 25th %: 566.00

insert median: 600.00

insert 75th %: 634.00

insert std dev: 52.83

insert mode: 602

Read 1 data: pct reads num reads pct bases num bases

mapped: 99.9677% 1323447 99.9677% 198517050

unambiguous: 98.0062% 1297479 98.0062% 194621850

ambiguous: 1.9615% 25968 1.9615% 3895200

low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 38.6619% 511835 38.6619% 76775250

semiperfect site: 38.6630% 511849 38.6630% 76777350

rescued: 1.2452% 16485

Match Rate: NA NA 99.3118% 197215279

Error Rate: 61.3256% 811612 0.6882% 1366602

Sub Rate: 60.9806% 807046 0.6527% 1296069

Del Rate: 0.5644% 7469 0.0326% 64831

Ins Rate: 0.3730% 4937 0.0029% 5702

N Rate: 0.0000% 0 0.0000% 0

Read 2 data: pct reads num reads pct bases num bases

mapped: 99.9702% 1323479 99.9702% 198521850

unambiguous: 97.9981% 1297371 97.9981% 194605650

ambiguous: 1.9721% 26108 1.9721% 3916200

low-Q discards: 0.0000% 0 0.0000% 0

perfect best site: 38.6839% 512126 38.6839% 76818900

semiperfect site: 38.6856% 512148 38.6856% 76822200

rescued: 1.1880% 15728

Match Rate: NA NA 99.3145% 197221255

Error Rate: 61.3046% 811353 0.6855% 1361264

Sub Rate: 60.9506% 806669 0.6523% 1295263

Del Rate: 0.5679% 7516 0.0306% 60669

Ins Rate: 0.3843% 5086 0.0027% 5332

N Rate: 0.0000% 0 0.0000% 0

Total time: 74.777 seconds.

さらに指定した分析データ（ヒストグラム）が出力される。

f:id:kazumaxneo:20170829140136j:plain

ゲノムアセンブリの各contigについて、GC含量、平均リードデプス、マッピングカバー率などの基礎的な統計を出力。

bbmap.sh ref=scaffolds.ref.fa nodisk in1=R1.fq in2=R2.fq covstats=mapping.stats threads=20

mapping.stats

参考１

index作成時のメモリ使用量はリファンレス1塩基につき6 byteとされている。ヒトゲノムでは24GB必要（usemodulo flagをつけて12GB）。

・defaultのk=13の時のメモリ使用量は以下のコマンドで確認できる。

stats.sh in=reference.fa k=13

user$ stats.sh in=../testgenome.fasta k=13

A C G T N IUPAC Other GC GC_stdev

0.2629 0.2364 0.2372 0.2635 0.0000 0.0000 0.0000 0.4736 0.0241

Main genome scaffold total: 8

Main genome contig total: 8

Main genome scaffold sequence total: 3.957 MB

Main genome contig sequence total: 3.957 MB 0.000% gap

Main genome scaffold N/L50: 1/3.573 MB

Main genome contig N/L50: 1/3.573 MB

Main genome scaffold N/L90: 1/3.573 MB

Main genome contig N/L90: 1/3.573 MB

Max scaffold length: 3.573 MB

Max contig length: 3.573 MB

Number of scaffolds > 50 KB: 4

% main genome in scaffolds > 50 KB: 98.63%

Minimum Number Number Total Total Scaffold

Scaffold of of Scaffold Contig Contig

Length Scaffolds Contigs Length Length Coverage

-------- -------------- -------------- -------------- -------------- --------

All 8 8 3,956,956 3,956,956 100.00%

1 KB 8 8 3,956,956 3,956,956 100.00%

2.5 KB 6 6 3,952,233 3,952,233 100.00%

5 KB 6 6 3,952,233 3,952,233 100.00%

10 KB 5 5 3,947,019 3,947,019 100.00%

25 KB 5 5 3,947,019 3,947,019 100.00%

50 KB 4 4 3,902,676 3,902,676 100.00%

100 KB 4 4 3,902,676 3,902,676 100.00%

250 KB 1 1 3,573,470 3,573,470 100.00%

500 KB 1 1 3,573,470 3,573,470 100.00%

1 MB 1 1 3,573,470 3,573,470 100.00%

2.5 MB 1 1 3,573,470 3,573,470 100.00%

BBMap minimum memory estimate at k=13: -Xmx890m (at least 1000 MB physical RAM)

使用したバクテリアゲノムだと1GBの物理メモリが必要と算出された。

参考２

indel部位のスプリットアライメントに対応している。条件はmaxindelフラッグで規定されており、デフォルトではmaxindel=16000となっている。RNA seqでlarge intronが想定される場合などはmaxindel=100000などの大きな数値に指定することが推奨されている。

BBMerge （リンク）-ペアリードのマージ。

BBMegeはオーバーラップするペアエンドリードをマージして１つの長いリードを作るツール。

bbmerge.sh in1=reads1.fq in2=reads2.fq out=merged.fq outu1=unmerged1.fq outu2=unmerged2.fq ihist=ihist.txt

in=null　Primary input. 'in2' will specify a second file.
out=<file> 　File for merged reads. 'out2' will specify a second file.
outu=<file>　File for unmerged reads. 'outu2' will specify a second file.
ihist=<file>　(hist) Insert length histogram output file.

インサートサイズのヒストグラムがihist.txtに出力される。マージを厳密に行うならpfilter=1をつける（“pfilter=1” is the strictest possible setting, which bans merges in which there are any mismatches in the overlap region. ）。または、verystrict（Greatly decrease FP and merging rate.）をつける。オーサーはアセンブルに使うデータならverystrictをつけ、とにかくマージするならloose（Increase false positive rate and merging rate）をつけることを提案している。

・オーバーラップするペアエンドリードのエラー修復

bbmerge.sh in=interlace.fq out=corrected.fq ecco mix

mix=f　Output both the merged (or mergable) and unmerged reads in the same file (out=). Useful for ecco mode.
ecco=f　Error-correct the overlapping part, but don't merge.

ここでは入力するペアリードはインターレースとなっている。このコマンドを実行することでオーバーラップする領域で塩基が合致する部位はクオリティスコアが上昇し、オーバーラップする領域で塩基が合致しない領域はクオリティスコアが高い方の塩基にbase callが変化したfastqが出力される。マージしたfastq自体は出力されない。クオリティスコアが同じで塩基が合致しない部位はNになる。

・k-merのgraphを利用し、マージされなかったfastqを延長しマージを試みる。

bbmerge-auto.sh in1=reads1.fq in2=reads2.fq out=merged.fq outu=unmerged.fq ihist=ihist.txt ecct extend2=20 iterations=5

extend2=0　Extend reads this much only after a failed merge attempt, or in rem/rsem mode.
iterations=1　(ei) Iteratively attempt to extend by extend2 distance and merge up to this many times.

上記のコマンドだと最初にマージが試みられ、ダメならエラー修復して再チャレンジし、それでもダメならk-merのグラフから20-bpくっつけて再マージを試みる。そうして20-bpのextensionを最大５回繰り返す。途中でk-merのgraphがdead endになったり、分岐があったりする試行は中止される。300-bpx2のペアリードでテストすると、最大794-bpのシングルリードが出力された。

・リード長よりインサートサイズが短いペアエンドリードからアダプター配列を見つける。

bbmerge.sh in1=reads1.fq in2=reads2.fq outa=adapters.fa

例えば見つかったアダプターがGATCGGAAGAGCACACGTCTGAACTCCAGTC とする。このアダプター配列を次のコマンドで除く。

bbmerge.sh in1=reads1.fq in2=read2.fq out=merged.fq adapter1=GATCGGAAGAGCACACGTCTGAACTCCAGTC

リードに十分なデプスがあり、アダプター配列も除かれていれば、mergeコマンドは正しく機能する。より長いリードができればアセンブルでより長いk-merを選択することができることになる。一方で、誤ったgraphをつないでしまう可能性もあり、そうなるとアーティファクトの配列ができてしまうことになる（特にレアなindel変異の部位）。

BBnorm （リンク）-カバレッジのノーマライズ。

カバレッジがディープな領域をノーマライズして軽量化し、データ解析を行いやすくするツール。例えば、カバレッジが非常に多く変動するシングルセルやウィルスのシーケンスデータ、またメタゲノムデータなどに利用できる。また、シーケンスデータ軽量化のためのdigital normalizationに使える（リンク　真ん中付近）。リードが多すぎて計算リソースを超えている時などに利用する。

・k-merカバレッジ100でノーマライズ

bbnorm.sh target=100 min=5 in1=reads1.fq in2=reads2.fq \
out=normalized_1.fq out2=normalized_2.fq

#シングルエンド, k31,mindepth2
bbnorm.sh target=100 min=2 k=31 in1=reads.fq.gz out=normalized.fq.gz

in=<file>　(in1) Input file, or comma-delimited list of files.
in2=<file>　Optional second file for paired reads.
k=31 Kmer length (values under 32 are most efficient, but arbitrarily high values are supported)
target=100 (tgt) Target normalization depth. NOTE: All depth parameters control kmer depth, not read depth.
mindepth=5 (min) Kmers with depth below this number will not be included when calculating the depth of a read.
maxdepth=-1 (max) Reads will not be downsampled when below this depth, even if they are above the target depth.
minkmers=15 (mgkpr) Reads must have at least this many kmers over min depth to be retained. Aka 'mingoodkmersperread'.
out=<file> File for normalized or corrected reads. Use out2 for paired reads in a second file

上記のコマンドではk-merデプスが5以下のリード（シーケンスエラー、またはレアな変異）は排除される。またk-merカバレッジが100になるようにリード数が調整される。

追記；シークエンシングクオリティが高ければロングリードでも使用できる。エラー率が高いリードに適用すると、k-mer一致がないリードが多数出てしまい、多くのリードが失われてしまう可能性が高い。それでも使いたいなら、先にエラーコレクションするか（微細な配列の違いも失われる可能性がある）、k値を少しだけ小さくするかmin=0などにして低い側のカットオフは無くす。

・エラーコレクション

ecc.sh in1=reads1.fq in2=reads2.fq out=corrected.fq ecc=t keepall passes=1 bits=16 prefilter

ecc=f Set to true to correct errors.
keepall=f Set to true to keep all reads (e.g. if you just want error correction).
passes=2 (p) 1 pass is the basic mode. 2 passes (default) allows greater accuracy, error detection, better contol of output depth.
bits=32 Bits per cell in bloom filter; must be 2, 4, 8, 16, or 32. Maximum kmer depth recorded is 2^cbits. Automatically reduced to 16 in 2-pass.
prefilter=f True is slower, but generally more accurate; filters out low-depth kmers from the main hashtable. The prefilter is more memory-efficient because it uses 2-bit cells.

エラーコレクション後のfastqがcorrected.fq として出力される。レアk-merのリードのトリミングや除去は一切行われない。

・k-merヒストグラム作成

khist.sh in=reads.fq khist=khist.txt peaks=peaks.txt

ヒストグラムデータの他に、k-mer頻度分布のピークなどをプリントしたサマリーファイルが出力される。

user$ cat peaks.txt

#k 31

#unique_kmers 32597468

#main_peak 14

#genome_size 6316924

#haploid_genome_size 6316924

#fold_coverage 14

#haploid_fold_coverage 14

#ploidy 1

#percent_repeat 36.469

#start center stop max volume

6 14 35 338679 4013227

41 51 121 3023 69382

823 1556 1569 10 1546

1569 1614 1624 21 605

1624 1637 1667 12 455

1667 1683 1705 9 340

1705 1730 1748 13 388

1748 1768 2257 3 1902

3198 4299 4410 4 3095

4410 4501 4523 2 288

4523 4542 5077 2 1112

注意点として、BBnormは発現解析（RNA seqやchip-seq）やレアバリアントの探索などに使ってはならない。使うとひどいbiasが生じる可能性がある。そのようなデータセットでデータを削減する必要があるなら、ノーマライズでなくランダムサンプリングするツールを使う。

fastqのランダムサンプリングツールとしてseqtkやseqkitを紹介しています。

Dedupe（リンク）-duplicateしたcontigの除去。

k-merベースのde brujin graphでは重複したcontigはほぼ作られないが、overlap layput方式のアセンブルだとduplication contigができることがある。そのようなcontigから重複したcontigを除けるように本コマンドは設計された。また、NCBIのrefseqなどは重複した配列が大量に存在し、blast解析すると同じような配列が大量にhitすることがあるが、そのようなデータベースのキュレーションにも使えるとされる。

また、NCBIの

・duplication sequencesの除去。

dedupe.sh in=X.fa out=Z.fa outd=duplicates.fa

outd

outdの指定でduplication readsも別出力可能。

・5ミスマッチと2editsまで許容してduplication sequencesを除去。

dedupe.sh in=X.fa out=Y.fa s=5 e=2

s=5だけ指定すると塩基のミスマッチだけ５つまで許容してduplication sequencesが探索される。e=2だけ指定すると、どのような変異でも２つまで許容してduplication sequencesが探索される（例えば2塩基ミスマッチ or 1挿入と1塩基ミスマッチ）。

・99%以上同一の配列を除去。

dedupe.sh in=X.fa out=Y.fa minidentity=99

indelは許容されない。indelも考慮するならe flagを使用する。

・200bp以上同一の配列を１塩基のミスマッチまで許容して除去。

dedupe.sh in=X.fq pattern=cluster%.fq ac=f am=f s=1 mo=200 c pc csf=stats.txt outbest=best.fq fo c mcs=3 cc dot=graph.dot

duplication sequenceというよりアンプリコンシーケンスデータ向けのコマンド。似た配列はクラスタリングして出力され、同時に長さとクオリティがベストの配列は別ファイルに出力される。

・PacBioの16S全長シーケンスデータのクラスタリング。

1、全長読めてない配列を除去。

reformat.sh in=reads_of_insert.fastq out=filtered.fq minlen=1420 maxlen=1640 maq=20 qin=33

minlength=0 (ml) Reads shorter than this after trimming will be discarded. Pairs will be discarded only if both are shorter.
maxlength=0 If nonzero, reads longer than this after trimming will be discarded.
minavgquality=0 (maq) Reads with average quality (after trimming) below this will be discarded.
qin=auto ASCII offset for input quality. May be 33 (Sanger), 64 (Illumina), or auto.

1420-bp以下のリードと1640-bp以上のリード（アーティファクト）を除去する。また平均クオリティが20以下のリードも除去する。

2、99%以上同一の配列をクラスタリング。

dedupe.sh in=filtered.fq csf=stats_e26.txt outbest=best_e26.fq qin=33 usejni=t am=f ac=f fo c rnc=f mcs=3 k=27 mo=1420 ow cc pto nam=4 e=26 pattern=cluster_%.fq dot=graph.dot

usejni=f (jni) Do alignments in C code, which is faster, if an edit distance is allowed. This will require compiling the C code; details are in /jni/README.txt.
absorbmatch=t (am) Absorb exact matches of contigs.
absorbcontainment=t (ac) Absorb full containments of contigs.
findoverlap=f (fo) Find overlaps between contigs (containments and non-containments). Necessary for clustering.
renameclusters=f (rnc) Rename contigs to indicate which cluster they are in.
ow=f (overwrite) Overwrites files that already exist.
cc=t (canonicizeclusters) Flip contigs so clusters have a single orientation.
pto=f (preventtransitiveoverlaps) Do not look for new edges between nodes in the same cluster.
minclustersize=1 (mcs) Do not output clusters smaller than this.
cluster=f (c) Group overlapping contigs into clusters.
numaffixmaps=1 (nam) Number of prefixes/suffixes to index per contig. Higher is more sensitive, if edits are allowed.
maxedits=0 (e) Allow up to this many edits (subs or indels). Higher is slower.

4塩基オーバーラップの27merのseedを使い、99%以上同一で(e=26)1420塩基以上オーバーラップする配列をクラスタリングする。

コマンドが多いので、２回に分けます。

引用

BBMap (aligner for DNA/RNAseq) is now open-source and available for download.

BBMap (aligner for DNA/RNAseq) is now open-source and available for download. - SEQanswers

BBMerge - Accurate paired shotgun read merging via overlap.

Bushnell B, Rood J, Singer E

PLoS One. 2017 Oct 26;12(10):e0185056. doi: 10.1371/journal.pone.0185056. eCollection 2017.

BIostars ダウンサンプリング

https://www.biostars.org/p/168178/

2017-08-28

メタゲノムデータの高速なtaxonomy assignmentを行う kraken

metagenome 高速なツール生物種の推定 (taxonomic profiling) Genome Biology 2014 taxonomic assignment abundance estimation in metagenomics data docker

2018 10/6 タイトル修正

2018 11/17 簡単なテスト追加

2019 4/12 dockerリンク追加

2019 10/15リンク追加

krakenは2014年に発表されたメタゲノムデータの分類手法。fastqまたはfastaの入力からk-merの配列に分解し、構築したデータベースにアライメントを行う。BLASTと同等の精度を保ちながら、megablastより最大909倍高速と主張されている。似たツールにメタゲノムデータからtaxonomy情報を得るMetaPhlAnがあるが、MetaPhlAnと比較してもkrakenは最大11倍高速とされる。

公式ページ

https://ccb.jhu.edu/software/kraken/

マニュアル

http://ccb.jhu.edu/software/kraken/MANUAL.html

インストール

cent OSに導入した。

ランにはJellyfishのver.1が必要。公式サイトからダウンロードしてビルドする。

wget http://www.cbcb.umd.edu/software/jellyfish/jellyfish-1.1.11.tar.gz
tar zxvf jellyfish-1.1.11.tar.gz
cd jellyfish-1.1.11/
./configure
make 
sudo make install

bin/jellyfishにパスを通しておく。

kraken本体はbrewで導入できる。

brew install kraken

または

git clone  https://github.com/DerrickWood/kraken.git
cd kraken/
./install_kraken.sh $KRAKEN_DIR #インストールディレクトリを指定

Anacondaを使っているならcondaで導入する。

#v1.1
conda install -c bioconda -y kraken==1.1

#v2 (bioconda)
conda install -c bioconda -y kraken2

NCBI refseqのURLが変更されているので、wgetで配列をダウンロードできなくなっている (2017年夏テスト時)。スクリプトを別途ダウンロードして修正する。

wget https://github.com/DerrickWood/kraken/blob/master/scripts/download_genomic_library.sh

download_genomic_library.shを開き、 43行目を以下のように変更。

wget $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz

さらに59行目を以下のように変更。

wget $FTP_SERVER/genomes/archive/old_refseq/Plasmids/plasmids.all.fna.tar.gz

brewでインストールしたkaken-buildを開き、

vi /usr/local/bin/kraken-build #viで開く。

262行目を修正したdownload_genomic_library.shのパスに変更する。ここでは以下のように修正した。

exec "/Volumes/3TB3/kraken_database/download_genomic_library.sh", $type;

保存して閉じる。

追記

dockerイメージ

https://hub.docker.com/r/staphb/kraken/

ラン

データベースの準備（初回だけ）

　↓

ラン

　↓

集計

という三段階の流れで進める。

--データベースの準備--

データベースのダウンロード。ここではdatabaseというディレクトリ名にした。

kraken-build --standard --db database

database/にRefseqの全データとtaxonomy情報がダウンロードされる。

データベースのビルド。時間がかかるので、コア数があるだけ指定する。

kraken-build --standard --threads 22 --db database

環境によってはかなりの時間がかかる。

容量を確認。

user]$ du -sh database/

161G database/

2017年８月実行時のdatabase容量は161GBだった。

準備ができたらランする。

--taxonomyラベリングを行う--

入力のデフォルトはfastaだが、--fastq-inputのフラグをつけることでfastqにも対応している。また--gzip-compressedか--bzip2-compressedをつけることで、gzipやbzip2の入力も可能である。

fastaを指定してラン。

kraken --preload --threads 20 --fastq-input --db database input.fa > output

--db　Name for Kraken DB (default: none)
--threads　Number of threads (default: 1)
--fasta-input　Input is FASTA format
--preload　Loads DB into memory before classification

--preloadをつけることで、データベースがRAMディスクに読み込まれ高速化する。databaseは自分がつけたデータベースディレクトリ名。

ペアードエンドfastqを指定してラン。

kraken --preload --threads 20 --fastq-input --paired --check-names --db database R1.fq R2.fq > output

--fastq-input　Input is FASTQ format
--paired　The two filenames provided are paired-end reads
--check-names　Ensure each pair of reads have names that agree with each other; ignored if --paired is not specified
--gzip-compressed　Input is gzip compressed
--bzip2-compressed　Input is bzip2 compressed
--quick　Quick operation (use first hit or hits)

krakenはペアリードをマージして１つにしてからk-merベースの探索を行う（NNNでつなぐ）。探索時、ATGC以外があるk-mer配列は除外される。そのためペアリードを入力に使うことで、シングルリードよりおよそ3%精度が上がるとされる。

出力は以下のようになる。

bin]$ head output

C M02077:58:000000000-APPA6:1:1101:15651:1463 1148 603 0:1 1148:16 0:31 1148:83 0:140 A:31 0:16 A:31 1148:4 0:220

C M02077:58:000000000-APPA6:1:1101:17046:1522 1148 601 0:1 1148:11 0:31 1148:61 0:31 1148:82 0:54 A:31 1148:11 0:5 A:31 1148:5 0:217

C M02077:58:000000000-APPA6:1:1101:15487:1536 1148 603 1148:14 0:72 1148:22 0:163 A:31 0:16 A:31 0:224

C M02077:58:000000000-APPA6:1:1101:17442:1545 1148 603 0:1 1148:113 0:31 1148:44 0:31 1148:16 0:35 A:31 1148:16 A:31 1148:36 0:188

C M02077:58:000000000-APPA6:1:1101:22458:1603 1148 601 0:16 1148:16 0:44 1148:14 0:181 A:31 0:1 1148:7 0:8 A:31 0:222

C M02077:58:000000000-APPA6:1:1101:19281:1604 1148 602 0:1 1148:64 0:31 1148:24 0:31 1148:28 0:31 1148:10 0:50 A:31 1148:16 A:31 1148:31 0:193

C M02077:58:000000000-APPA6:1:1101:23186:1616 1148 601 0:1 1148:168 0:31 1148:5 0:66 A:31 1148:16 A:31 0:5 1148:21 0:196

C M02077:58:000000000-APPA6:1:1101:18130:1621 1148 603 1148:213 0:58 A:31 1148:16 A:31 1148:81 0:31 1148:31 0:81

C M02077:58:000000000-APPA6:1:1101:9815:1622 1148 599 1148:208 0:62 A:31 1148:16 A:31 1148:46 0:175

C M02077:58:000000000-APPA6:1:1101:21784:1636 1148 602 0:1 1148:130 0:31 1148:83 0:26 A:31 1148:16 A:31 1148:33 0:190

１列目　C or Uは、分類されたか（C）、されなかった（U）かを表す。

２列目　シーケンスID（インプットのヘッダー名）

３列目　taxonomy ID（wiki）

４列目　リードサイズ

５列目　↓のマニュアルの説明参照。

A space-delimited list indicating the LCA mapping of each k-mer in the sequence. For example, "562:13 561:4 A:31 0:1 562:3" would indicate that:
- the first 13 k-mers mapped to taxonomy ID #562
- the next 4 k-mers mapped to taxonomy ID #561
- the next 31 k-mers contained an ambiguous nucleotide
- the next k-mer was not in the database
- the last 3 k-mers mapped to taxonomy ID #562

--taxonomy nameに変換--

taxonomy IDからtaxonomy nameに変えるコマンドが用意されている。以下のように打つ。

kraken-translate --db database output > sequences.labels

出力

f:id:kazumaxneo:20170830114628j:plain

入力が純粋培養のバクテリアのデータなので、同じ生き物ばかり表示されている。

--mpa-formatをつけると、superkingdom, kingdom, phylum, class, order, family, genus, speciesだけの表記になる。

kraken-translate --mpa-format --db database output > sequences.labels

f:id:kazumaxneo:20170830115002j:plain

--集計--

集計レポートを作る。

kraken-report --db database output > summary

出力を確認。

head -20 summary

0.71 2036 2036 U 0 unclassified

99.29 285920 3 - 1 root

99.29 285916 1 - 131567 cellular organisms

99.29 285915 65 D 2 Bacteria

98.87 284692 36 - 1783272 Terrabacteria group

98.55 283771 0 - 1798711 Cyanobacteria/Melainabacteria group

98.55 283771 3 P 1117 Cyanobacteria

98.54 283759 2 O 1890424 Synechococcales

98.54 283753 0 F 1890428 Merismopediaceae

98.54 283753 0 G 1142 Synechocystis

98.54 283753 283695 S 1148 Synechocystis sp. PCC 6803

0.02 57 57 - 1080228 Synechocystis sp. PCC 6803 substr. GT-I

0.00 1 1 - 1080229 Synechocystis sp. PCC 6803 substr. PCC-N

0.00 3 0 F 1890436 Pseudanabaenaceae

0.00 3 0 G 1152 Pseudanabaena

0.00 3 3 S 82654 Pseudanabaena sp. PCC 7367

0.00 1 0 F 1890426 Synechococcaceae

0.00 1 0 G 1129 Synechococcus

0.00 1 1 S 110662 Synechococcus sp. CC9605

0.00 7 1 - 1301283 Oscillatoriophycideae

１列目　そのcladeがカバーされたリードの割合（全体に対するパーセンテージ）

２列目　そのcladeがカバーされたリード数

３列目　そのtaxonに直接アサインされたリード数

４列目　rank code（↓の説明参照）

５列目　NCBI taxonomy ID

６列目　scientific name

A rank code, indicating (U)nclassified, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. All other ranks are simply '-'.

MetaPhlAnのようなタブ仕分けのファイルを出力させるコマンドも用意されている。

kraken-mpa-report --db database output > summary_tab

bin]$ head -10 summary_tab

d__Viruses 1

d__Bacteria 285915

d__Bacteria|p__Proteobacteria 419

d__Bacteria|p__Fusobacteria 5

d__Bacteria|p__Deferribacteres 2

d__Bacteria|p__Spirochaetes 7

d__Bacteria|p__Synergistetes 1

d__Bacteria|p__Bacteroidetes 724

d__Bacteria|p__Firmicutes 625

d__Bacteria|p__Deinococcus-Thermus 1

手持ちのfastaを追加して、カスタムデータベースを作ることもできる。作り方は公式マニュアルを参照。カスタムデータベース作成の流れが丁寧に説明されてます。

テスト

SRAの湖水からのメタゲノムデータでテストしてみた。15 GB x 2のデータをthread40でテストしたところ３０分ほどかかった。kraken-reportの出力の先頭を貼っておく。

bin]$ head -30 summary2

74.03 29617719 29617719 U 0 unclassified

25.97 10391739 6719 - 1 root

25.91 10364486 1126 - 131567 cellular organisms

25.90 10361130 51629 D 2 Bacteria

25.37 10150686 3934 - 1783272 Terrabacteria group

25.22 10090850 0 - 1798711 Cyanobacteria/Melainabacteria group

25.22 10090850 129569 P 1117 Cyanobacteria

22.78 9114469 8754 - 1301283 Oscillatoriophycideae

22.59 9036657 21480 O 1150 Oscillatoriales

21.70 8681472 0 F 1892254 Oscillatoriaceae

21.70 8681472 934 G 1158 Oscillatoria

21.59 8639339 0 S 482564 Oscillatoria nigro-viridis

21.59 8639339 8639339 - 179408 Oscillatoria nigro-viridis PCC 7112

0.10 41199 0 S 118323 Oscillatoria acuminata

0.10 41199 41199 - 56110 Oscillatoria acuminata PCC 6304

0.39 155174 0 F 1892249 Cyanothecaceae

0.39 155174 23122 G 43988 Cyanothece

0.15 59279 59279 S 497965 Cyanothece sp. PCC 7822

0.12 46609 46609 S 65393 Cyanothece sp. PCC 7424

0.03 12335 12335 S 43989 Cyanothece sp. ATCC 51142

0.01 5589 5589 S 395962 Cyanothece sp. PCC 8802

0.01 4347 4347 S 395961 Cyanothece sp. PCC 7425

0.01 3893 3893 S 41431 Cyanothece sp. PCC 8801

0.29 114192 578 F 1892252 Microcoleaceae

0.13 53893 0 G 35823 Arthrospira

0.13 53893 0 S 118562 Arthrospira platensis

0.13 53893 53893 - 696747 Arthrospira platensis NIES-39

0.09 36770 0 G 44471 Microcoleus

0.09 36770 36770 S 1173027 Microcoleus sp. PCC 7113

0.06 22951 0 G 1205 Trichodesmium

unclassifiedが74%あったが、淡水性のシアノバクテリアなどが様々hitしている。

追記

こちらも参考にしてください。

追記

Visualization tool

追記

簡単なテスト

in silico mock metagenome cummunityのデータを作成し、テストしてみた。以下の比率で10のゲノムを混ぜ込む。

f:id:kazumaxneo:20181120110800j:plain

Kronaで可視化

f:id:kazumaxneo:20181120110927j:plain

2%ほどしか混ぜ込んでいないRickettsia canadensisも検出できている（*1）。どの種の定量結果も、妥当な数値が出ている。

こちらも参考になります。

追記

Building a Kraken database with new FTP structure and no GI numbers

http://www.opiniomics.org/building-a-kraken-database-with-new-ftp-structure-and-no-gi-numbers/

引用

Kraken: ultrafast metagenomic sequence classification using exact alignments

Derrick E WoodEmail author and Steven L Salzberg

Genome Biology 2014 15:R46

ダウンロード時のエラーについて。

https://github.com/DerrickWood/kraken/issues/40

リード数は50万x2。ファイルサイズは313MBx2と、メタゲノムにしては非常に少ない。

fastqのクオリティレポートを出力する qrqc

trimming / preprocessing fasta/fastqの操作

qrqcは

qrqcはリードのクオリティや配列の分布をレポートできるRのパッケージ。1000-bp以下のfastqに対応している。同様の目的で使用されるツールとしてfastqcがある。

インストール

Rにて

## try http:// if https:// URLs are not supported 
source("https://bioconductor.org/biocLite.R") 
biocLite("qrqc")

http://bioconductor.org/packages/devel/bioc/html/qrqc.html

マニュアル

browseVignettes("qrqc")

http://127.0.0.1:20080/library/qrqc/doc/qrqc.pdf

ラン

Rにて

library(qrqc)
fq <- readSeqFile("input.fastq")#10%ランダムサンプリングしてfastqを読み込み

読み込まれたfastqの情報を確認する。

fq

> fq

Quality Information for: input.fastq

261774 sequences, 233067 unique with 0.10 being sampled

mean quality: 36.101300

min sequence length: 35

max sequence length: 301

レポートを出力。

makeReport(fq)

フォルダにhtmlのレポートファイルが出力される。

f:id:kazumaxneo:20170828213815j:plain

f:id:kazumaxneo:20170828213823j:plain

f:id:kazumaxneo:20170828213820j:plain

f:id:kazumaxneo:20170828213833j:plain

f:id:kazumaxneo:20170828213840j:plain

f:id:kazumaxneo:20170828213846j:plain

TruseqのNNN..配列が残存している。

k-mer分布。

kmerKLPlot(fq)

f:id:kazumaxneo:20170828214631j:plain

gcPlot(fq) + geom_hline(yintercept=0.5, color="purple")

f:id:kazumaxneo:20170828215149j:plain

引用

bioinformatics

qrqc | FASTQ のクオリティチェック

2017-08-28

ベイズ的アプローチによるアダプタートリミングツール Scythe

trimming / preprocessing fasta/fastqの操作

　Scytheはfastqのアダプター配列トリミングツール。ライブラリ調整過程でリード長より短い回断片が精製されてくると、3'末端側にアダプター配列のついた配列がシーケンスされる。これは例えばsmall RNAのシーケンスを想定するとわかりやすい。small-RNAのライブラリを作成し、100bpシーケンスすれば、3'側には様々な位置にアダプターが来るのが想像できる。全ゲノムシーケンスでも、こういった短いリードのシーケンスは低確率で起こっている。このようなリードがアダプタートリミング後も残ると、アセンブリやアライメントなどに悪影響を及ぼす可能性がある。よって、3'側のトリミングも実行しておくことが望ましい。

　Scytheはアダンプターが残存している/していないの２つの尤度モデルを比較し、入力されたアダプター配列との相同性からアダプターがコンタミしているリードを推定し、３’末端側のアダプター配列を除く。Gitubのページには現在論文執筆中と記載されている。

　Scytheより先にクオリティトリミングを行うと、3'側のアダプターが部分的に削除されたりしてScytheの解析がうまく行えなくなる。そのため、初めにScytheで3'側のアダプター配列を除去することが推奨されている。Scytheで除けなかった5'側のアプアプター（アダプターだけの単純配列のアーティファクトなど）はTagdustなどで完全に除去し（アダプターが見つかればリード全長を除く）、それからでクオリティトリミングを行う。最終的には、fastqcやqrqcなどでクオリティスコアの分布を確認し、レポート結果からクオリティトリミングが適切か判断すべきとScytheの開発者は推奨している（最後の分析でトリミングが甘ければ、長さを指定するなどして、さらなるトリミングを実行する）。

Scythe（アダプタートリミング１）

↓

Tagdust（アダプタートリミング２）

↓

quality trimming

↓

fastqc（判定）

Github

https://github.com/vsbuffalo/scythe

インストール

本体を解凍してmakeする。

make all

scytheをパスが通ったディレクトリにコピーするかリンクを張るなどしてパスを通す。

ラン

scythe-master/に同封されているilluminaのアダプター配列ファイルを指定してラン。最後尾に入力のfastqファイル（input.fastq）を指定している。

scythe -a illumina_adapters.fa -o trimmed_sequences.fastq input.fastq

-o　output trimmed sequences file (default: stdout)
-m 　 matches file (default: no output)
-q　quality type, either illumina, solexa, or sanger (default: sanger)
-M　filter sequnces less than or equal to this length (default: 35)
-t　add a tag to the header indicating Scythe cut a sequence (default: off)-t, --tag add a tag to the header indicating Scythe cut a sequence (default: off)
-n　smallest contaminant to consider (default: 5)

Information on quality schemes:

・phred　PHRED quality scores (e.g. from Roche 454). ASCII with no offset, range: [4, 60].

・sanger　Sanger are PHRED ASCII qualities with an offset of 33, range: [0, 93]. From sanger Sanger are PHRED ASCII qualities with an offset of 33, range: [0, 93]. From NCBI SRA, or Illumina pipeline 1.8+.

・solexa　Solexa (also very early Illumina - pipeline < 1.3). ASCII offset ofsolexa Solexa (also very early Illumina - pipeline < 1.3). ASCII offset of 64, range: [-5, 62]. Uses a different quality-to-probabilities conversion than other schemes.

・illumina　Illumina output from pipeline versions between 1.3 and 1.7. ASCII offset of 64, illumina Illumina output from pipeline versions between 1.3 and 1.7. ASCII offset of 64, range: [0, 62]

trimmed_sequences.fastqが出力される。Miseqのシーケンスデータを1,000,000サンプリングして解析したところ、17789リードにアダプターが見つかり、トリミングの結果トータルサイズ290,466,519-bpが290,124,196-bpになった。

引用

What is the strategy to deal with adapter contamination inside the reads but not at either of the ends (3' or 5')?

https://www.researchgate.net/post/What_is_the_strategy_to_deal_with_adapter_contamination_inside_the_reads_but_not_at_either_of_the_ends_3_or_5

2017-08-27

アセンブルのgraphからプラスミドデータを検出するツール Recycler

assembly plasmid 2017 Bioinformatics

2018 1/9 condaインストール追記

Recyclerはアセンブルのgraph pathからプラスミドの配列を検出する方法論。プラスミドのグラフは他のゲノムのグラフと独立しており、カバレッジが均一な少数のノードで構成される環状のグラフと推測される。これらの手がかりからプラスミドのグラフを探索する。2016年に論文が発表された。

インストール

依存

Python 2.7+

いずれもpipでインストールできる。

Recommended for generating inputs

推奨ツールはbrewで導入できる。

Github

https://github.com/Shamir-Lab/Recycler

本体のインンストール

git clone https://github.com/rozovr/Recycler.git 
cd Recycler 
python setup.py install --user

#またはcondaを使う
conda install -c bioconda -y recycler

ラン

ランにはspadesでアセンブルして得たfastgファイルと、ペアードエンドfastqから作成したbamファイルが必要である（アライメントにはbwa memを使う）。以下のような流れでbamを作成できる。

make_fasta_from_fastg.py -g assembly_graph.fastg #graph.fastaができる。
bwa index assembly_graph.nodes.fasta
bwa mem -t 8 assembly_graph.nodes.fasta R1.fastq.gz R2.fastq.gz | samtools view -buS -@ 8 - > reads_pe.bam
samtools view -@ 8 -bF 0x0800 reads_pe.bam > reads_pe_primary.bam 
samtools sort -@ 8 reads_pe_primary.bam > reads_pe_primary.sort.bam 
samtools index reads_pe_primary.sort.bam

reads_pe_primary.sort.bamとそのbaiファイルを使用する。

recycle.py -g assembly_graph.fastg -k 55 -b reads_pe_primary.sort.bam -i True

-g　assembly graph FASTG file to process: for spades 3.5, before_rr.fastg; for spades 3.6+, assembly_graph.fastg
-k　integer reflecting maximum k value used by the assembler
-b　BAM file resulting from aligning reads to contigs file, filtering for best matches

解析が終わると~.cycs.fastaというfastaファイルが出力される。

それ以外のオプション。

-l　minimum length required for reporting [default: 1000]
-m　coefficient of variation used for pre-selection [default: 0.5, higher--> less restrictive]
-i　True or False value reflecting whether data sequenced was an isolated strain
-o　provide a specific output directory by default results will be written to the directory the FASTG file is currently in.

5kb~100kbの７つのプラスミドを持つバクテリアのシーケンスデータでテストしたところ、5kb程度の３つのプラスミドのみfastaが出力された。レスキューされたプラスミドは、コピー数がクロモソームの数倍あり、構造もシンプルである。一方、レスキューされなかったプラスミドはゲノムと部分的に相同な配列を含んだ複雑なグラフであり、コピー数もクロモソームと同じくらいであると考えられている。そのためこれらを構成するノードはゲノムのノードとは独立して存在していない可能性があり、それが判定に影響したと推測される。

引用

Recycler: an algorithm for detecting plasmids from de novo assembly graphs

Roye Rozov,corresponding author1 Aya Brown Kav,2 David Bogumil,2 Naama Shterzer,2 Eran Halperin,1,3,4 Itzhak Mizrahi,2 and Ron Shamir1

Bioinformatics. 2017 Feb 15; 33(4): 475–482.

2017-08-25

Nextera Mate Pair protocolのジャンクションプライマー除去ツール NxTrim

trimming / preprocessing 2015 Bioinformatics Mate Pair

イルミナはmate pairシーケンスのキットも販売している。このプロトコルではNextraのトランスポゾンでタギングしたゲノムをセルフライゲーションさせて離れた配列を近づける。そのため中央にジャンクション配列が残る（図１　赤の配列）。NxTrimはそのジャンクション配列を除く専用のツールである。トリム領域が最小限のため、アセンブル精度が上がると主張されている。2015年に論文が発表された。

f:id:kazumaxneo:20170825105709j:plain

図１　ライブラリ作成手順　公式PDFマニュアルより（リンク）

インストール

brewで導入できる。

brew install NxTrim

ラン

Gitからサンプルfastqをダウンロードしてランする。

wget https://github.com/sequencing/NxTrim/tree/master/example/sample_R1.fastq.gz
wget https://github.com/sequencing/NxTrim/tree/master/example/sample_R2.fastq.gz
nxtrim -1 sample_R1.fastq.gz -2 sample_R2.fastq.gz -O sample

４つのファイルが出力される。説明を貼っておく。

mp: read pairs that are large insert-size mate-pairs, both mates will be reverse-complemented by nxtrim (from RF to FR) unless --rf commandline option is used
pe: read pairs that are short insert-size paired-end reads due to the junction adapter occurring early in a read
se: single reads (reads having no R1 or R2 counterpart)
unknown: a library of read-pairs that are mostly large-insert mate-pair, but possibly contain a small proportion of paired end contaminants

引用

NxTrim: optimized trimming of Illumina mate pair reads

Jared O’Connell, Ole Schulz-Trieglaff, Emma Carlson, Matthew M. Hims, Niall A. Gormley, Anthony J. Cox

Bioinformatics. 2015 Jun 15;31(12):2035-7

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

多機能なNGS分析ツール BBtools　其の２

多機能なNGS分析ツール BBtools　其の１

メタゲノムデータの高速なtaxonomy assignmentを行う kraken

fastqのクオリティレポートを出力する qrqc

ベイズ的アプローチによるアダプタートリミングツール Scythe

アセンブルのgraphからプラスミドデータを検出するツール Recycler

Nextera Mate Pair protocolのジャンクションプライマー除去ツール NxTrim