2018-05-09

MinHashを使いfasta / fastqから生物種を高速推定する BBSketch

2019 6/13 追記

2019 7/18 インストール追記

2020 7/7 コマンド追記、help 更新

2020 7/9 文章追記

以前このブログで紹介したBBtoolsに、Minhashアルゴリズム（リンク）を使ってわずか数秒でゲノムなどの大きな配列を比較し、トップヒットを返してくれる機能が実装されている。Biostarsに使い方が載せてあったので、紹介しておきます。

MinHash Sketchは、大きな文字列や集合を素早く比較する方法である。ゲノミクスでは、次のような使い方ができる。

1) ゲノム内のすべてのkmerを集める。

2) それらにハッシュ関数を適用する。

3) 10000個の最小のハッシュコードを保持し、このセットを「スケッチ」と呼ぶ。

これを複数のゲノムに対して行うと、2つのゲノムがどれだけ似ているかをアラインメントを介して計算するよりもはるかに速く計算できる。例えば、何かをアセンブルした場合、それをスケッチして、数秒でntやRefSeqにあるすべてのもののスケッチと比較することができる。トップヒットしたスケッチが、そのハッシュコードの90%を大腸菌と共有していることを示している場合、それは大腸菌とほぼ90%のkmer同一性を持っていることを意味し、したがって大腸菌であることを意味する。

sendsketch.sh in=contigs.fa
これはアセンブリのスケッチを作成し、JGIのtaxonomyサーバへのHTTP接続を開き、スケッチを送信する。サーバはそれをすべてのntのスケッチと比較し、トップヒットを返す。

BBtools

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/

NGSでのMinHashの活用例（MASH）

http://mash.readthedocs.io/en/latest/distances.html

インストール

BBtoolsはbrewで導入できる。

#bioconda
mamba install -c bioconda bbmap -y

#homebrew
brew install BBtools

バージョンが古いと機能自体がないかもしれない。また、サーバーのアップデートに伴い、バージョンアップを要求される場合がある。

upgradeする。

#bioconda
mamba update bbmap

#homebrew
brew upgrade BBtools

> sendsketch.sh

Written by Brian Bushnell

Last modified December 19, 2019

Description: Compares query sketches to reference sketches hosted on a

remote server via the Internet. The input can be sketches made by sketch.sh,

or fasta/fastq files from which SendSketch will generate sketches.

Only sketches will sent, not sequences.

Please read bbmap/docs/guides/BBSketchGuide.txt for more information.

Usage:

sendsketch.sh in=file

To change nucleotide servers, add the server name, e.g.:

sendsketch.sh in=file nt

For the protein server with nucleotide input:

sendsketch.sh in=file protein

for the protein server with amino input:

sendsketch.sh in=file amino protein

Standard parameters:

in=<file> Sketch or fasta file to compare.

out=stdout Comparison output. Can be set to a file instead.

outsketch= Optional, to write the sketch to a file.

local=f For local files, have the server load the sketches.

Allows use of whitelists; recommended for Silva.

Local can only be used when the client and server access

the same filesystem - e.g., Genepool and Cori.

address= Address of remote server. Default address:

https://refseq-sketch.jgi-psf.org/sketch

You can also specify these abbreviations:

nt: nt server

refseq: Refseq server

silva: Silva server

protein: RefSeq prokaryotic amino acid sketches

img: IMG server (Not Yet Available)

mito: RefSeq mitochondrial server (NYA)

fungi: RefSeq fungi sketches (NYA)

Using an abbreviation automatically sets the address,

the blacklist, and k.

aws=f Set aws=t to use the aws servers instead of NERSC.

When, for example, NERSC (or the whole SF Bay area) is down.

Sketch-making parameters:

mode=single Possible modes, for fasta input:

single: Generate one sketch per file.

sequence: Generate one sketch per sequence.

k=31 Kmer length, 1-32. This is automatic and does not need to

be set for JGI servers, only for locally-hosted servers.

samplerate=1 Set to a lower value to sample a fraction of input reads.

For raw reads (rather than an assembly), 1-3x coverage

gives best results, by reducing error kmers. Somewhat

higher is better for high-error-rate data like PacBio.

minkeycount=1 Ignore kmers that occur fewer times than this. Values

over 1 can be used with raw reads to avoid error kmers.

minprob=0.0001 Ignore kmers below this probability of correctness.

minqual=0 Ignore kmers spanning bases below this quality.

entropy=0.66 Ignore sequence with entropy below this value.

merge=f Merge paired reads prior to sketching.

amino=f Use amino acid mode. Input should be amino acids.

translate=f Call genes and translate to proteins. Input should be

nucleotides. Designed for prokaryotes.

sixframes=f Translate all 6 frames instead of predicting genes.

ssu=t Scan for and retain full-length SSU sequence.

printssusequence=f Print the query SSU sequence (JSON mode only).

refid= Instead of a query file, specify a reference sketch by name

or taxid; e.g. refid=h.sapiens or refid=9606.

Size parameters:

size=10000 Desired size of sketches (if not using autosize).

mgf=0.01 (maxfraction) Max fraction of genomic kmers to use.

minsize=100 Do not generate sketches for genomes smaller than this.

autosize=t Use flexible sizing instead of fixed-length. This is

nonlinear; a human sketch is only ~6x a bacterial sketch.

sizemult=1 Multiply the autosized size of sketches by this factor.

Normally a bacterial-size genome will get a sketch size

of around 10000; if autosizefactor=2, it would be ~20000.

density= If this flag is set (to a number between 0 and 1),

autosize and sizemult are ignored, and this fraction of

genomic kmers are used. For example, at density=0.001,

a 4.5Mbp bacteria will get a 4500-kmer sketch.

sketchheapfactor=4 If minkeycount>1, temporarily track this many kmers until

counts are known and low-count kmers are discarded.

Taxonomy and filtering parameters:

level=2 Only report the best record per taxa at this level.

Either level names or numbers may be used.

0: disabled

1: subspecies

2: species

3: genus

...etc

include= Restrict output to organisms in these clades.

May be a comma-delimited list of names or NCBI TaxIDs.

includelevel=0 Promote the include list to this taxonomic level.

For example, include=h.sapiens includelevel=phylum

would only include organisms in the same phylum as human.

includestring= Only report records whose name contains this string.

exclude= Ignore organisms in these clades.

May be a comma-delimited list of names or NCBI TaxIDs.

excludelevel=0 Promote the exclude list to this taxonomic level.

For example, exclude=h.sapiens excludelevel=phylum

would exclude all organisms in the same phylum as human.

excludestring= Do not records whose name contains this string.

banunclassified=f Ignore organisms descending from nodes like

'unclassified Bacteria'

banvirus=f Ignore viruses.

requiressu=f Ignore records without SSUs.

minrefsize=0 Ignore ref sketches smaller than this (unique kmers).

minrefsizebases=0 Ignore ref sketches smaller than this (total base pairs).

Output format:

format=2 2: Default format with, per query, one query header line;

one column header line; and one reference line per hit.

3: One line per hit, with columns query, reference, ANI,

and sizeRatio.

4: JSON (format=json also works).

5: Constellation (format=constellation also works).

usetaxidname=f For format 3, print the taxID in the name column.

usetaxname for format 3, print the taxonomic name in the name column.

useimgname For format 3, print the img ID in the name column.

d3=f Output in JSON format, with a tree for visualization.

Output columns (for format=2):

printall=f Enable all output columns.

printani=t (ani) Print average nucleotide identity estimate.

completeness=t Genome completeness estimate.

score=f Score (used for sorting the output).

printmatches=t Number of kmer matches to reference.

printlength=f Number of kmers compared.

printtaxid=t NCBI taxID.

printimg=f IMG identifier (only for IMG data).

printgbases=f Number of genomic bases.

printgkmers=f Number of genomic kmers.

printgsize=t Estimated number of unique genomic kmers.

printgseqs=t Number of sequences (scaffolds/reads).

printtaxname=t Name associated with this taxID.

printname0=f (pn0) Original seqeuence name.

printqfname=t Query filename.

printrfname=f Reference filename.

printtaxa=f Full taxonomy of each record.

printcontam=t Print contamination estimate, and factor contaminant kmers

into calculations. Kmers are considered contaminant if

present in some ref sketch but not the current one.

printunique=t Number of matches unique to this reference.

printunique2=f Number of matches unique to this reference's taxa.

printunique3=f Number of query kmers unique to this reference's taxa,

regardless of whether they are in this reference sketch.

printnohit=f Number of kmers that don't hit anything.

printrefhits=f Average number of ref sketches hit by shared kmers.

printgc=f GC content.

printucontam=f Contam hits that hit exactly one reference sketch.

printcontam2=f Print contamination estimate using only kmer hits

to unrelated taxa.

contamlevel=species Taxonomic level to use for contam2/unique2/unique3.

NOTE: unique2/unique3/contam2/refhits require an index.

printdepth=f (depth) Print average depth of sketch kmers; intended

for shotgun read input.

printdepth2=f (depth2) Print depth compensating for genomic repeats.

Requires reference sketches to be generated with depth.

actualdepth=t If this is false, the raw average count is printed.

If true, the raw average (observed depth) is converted

to estimated actual depth (including uncovered areas).

printvolume=f (volume) Product of average depth and matches.

printca=f Print common ancestor, if query taxID is known.

printcal=f Print common ancestor tax level, if query taxID is known.

recordsperlevel=0 If query TaxID is known, and this is positive, print at

most this many records per common ancestor level.

Sorting:

sortbyscore=t Default sort order is by score.

sortbydepth=f Include depth as a factor in sort order.

sortbydepth2=f Include depth2 as a factor in sort order.

sortbyvolume=f Include volume as a factor in sort order.

sortbykid=f Sort strictly by KID.

sortbyani=f Sort strictly by ANI/AAI/WKID.

sortbyhits=f Sort strictly by the number of kmer hits.

Other output parameters:

minhits=3 (hits) Only report records with at least this many hits.

minani=0 (ani) Only report records with at least this ANI (0-1).

minwkid=0.0001 (wkid) Only report records with at least this WKID (0-1).

anifromwkid=t Calculate ani from wkid. If false, use kid.

minbases=0 Ignore ref sketches of sequences shortert than this.

minsizeratio=0 Don't compare sketches if the smaller genome is less than

this fraction of the size of the larger.

records=20 Report at most this many best-matching records.

color=family Color records at the family level. color=f will disable.

Colors work in most terminals but may cause odd characters

to appear in text editors. So, color defaults to f if

writing to a file.

intersect=f Print sketch intersections. delta=f is suggested.

Metadata parameters (optional, for the query sketch header):

taxid=-1 Set the NCBI taxid.

imgid=-1 Set the IMG id.

spid=-1 Set the sequencing project id (JGI-specific).

name= Set the name (taxname).

name0= Set name0 (normally the first sequence header).

fname= Set fname (normally the file name).

meta_= Set an arbitrary metadata field.

For example, meta_Month=March.

Other parameters:

requiredmeta= (rmeta) Required optional metadata values. For example:

rmeta=subunit:ssu,source:silva

bannedmeta= (bmeta) Forbidden optional metadata values.

Java Parameters:

-Xmx This will set Java's memory usage, overriding autodetection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.

The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory

exception occurs. Requires Java 8u92+.

-da Disable assertions.

For more detailed information, please read /bbmap/docs/guides/BBSketchGuide.txt.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

ラン

データベースに問い合わせるので、コマンド実行時はネットに繋がっている必要がある。

de novoアセンブリして得たFASTAをJGIのtaxonomy serverに問い合わせ、top hitを返す。

sendsketch.sh in=contigs.fa

Schizosaccharomyces_pombeのゲノムを指定してみる。

sendsketch.sh in=contigs.fa

sketchファイルで結果は問い合わされ、数秒で結果が返ってくる。

$ sendsketch.sh in=Schizosaccharomyces_pombe.ASM294v2_chr1.fa

Max memory cannot be determined. Attempting to use 3200 MB.

If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command,

or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.

Adding /usr/local/Cellar/bbtools/37.77/resources/blacklist_refseq_species_300.sketch to blacklist.

Loaded 1 sketch in 0.897 seconds.

Query: Schizosaccharomyces_pombe.ASM294v2_chr1.fa DB: RefSeq SketchLen: 11783 Seqs: 1 Bases: 5579033 gSize: 5438526

WKID KID ANI Complt Contam Matches Unique noHit TaxID gSize gSeqs taxName

100.00% 44.56% 100.00% 44.56% 0.00% 6890 6876 0 4896 12075791 5 Schizosaccharomyces pombe

KID (or kmer identity) is simply the percent of the kmers that matched
WKID is the weighted kmer identity, which has been adjusted to compensate for genome size.
matches is the number of matching kmers between the query and reference sketch

マッチしたk-merの数やゲノムサイズで正規化した割合のほか、ANIも出力される。RefseqのS.pombeのゲノムと100%合致した。複数見つかると、この例（リンク）のように出力される。プロテインの検索ならamino=tをつける。

NGSデータも高速に分析できる。例えばilluminaシーケンスデータ（fastq）の先頭100,000リードを分析する。

sendsketch.sh in=reads.fq reads=100k

10万リードなら実行後5秒ほどで結果が返ってくる。

生のPacBioはエラー率が高いので、感度を上げるためにデフォルトのスケッチサイズを大きくする。

sendsketch.sh in=pacbio_reads.fq reads=400k

genusレベルで結果を出力

sendsketch.sh level=3 in=HTS_reads.fq

level=2 Only report the best record per taxa at this level. Either level names or numbers may be used.
0: disabled
1: subspecies
2: species
3: genus
...etc

データベースをSILVAに変える。種レベルで問い合わせる。

sendsketch.sh in=16Samplicon.fastq level=3 address=silva

address= Address of remote server. Default address is sketch. You can also specify these abbreviations:
   nt: nt server
   refseq: Refseq server
   silva: Silva server
   protein: RefSeq prokaryotic amino acid sketches
   img: IMG server (Not Yet Available)
   mito: RefSeq mitochondrial server (NYA)
   fungi: RefSeq fungi sketches (NYA)

データベースをproteinに変える。属レベルで問い合わせる。

sendsketch.sh in=proteins.faa level=2 address=protein

address= Address of remote server. Default address is sketch. You can also

ロングリードのfastqを指定してエラーが出る場合

=> fastaに変換してから読み込む。

seqkit fq2fa ONT.fq > ONT.fa
sendsketch.sh in=HONT.fa reads=100k

目的によってはblastの置き換えになるツールです。コンタミネーションの簡易チェックにも使えると思います。ぜひ導入して使ってみてください。

追記

bamやsamでも動作します。

引用

Tool: BBSketch - A Tool for Rapid Sequence Comparison

BBSketch - A Tool for Rapid Sequence Comparison

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

MinHashを使いfasta / fastqから生物種を高速推定する BBSketch