macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

MinHashを使いfasta / fastqから生物種を高速推定する BBSketch

 

以前このブログで紹介したBBtoolsに、いつのまにか、Minhashアルゴリズムリンク)を使ってわずか数秒でゲノムなどの大きな配列を比較し、トップヒットを返してくれる機能が実装されている。Biostarsに使い方が載せてあったので、紹介しておきます。

 

BBtools

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/

NGSでのMinHashの活用例(MASH)

http://mash.readthedocs.io/en/latest/distances.html

 

インストール

BBtoolsはbrewで導入できる。

brew install BBtools

バージョンが古いと機能自体がないかもしれない。その場合はBBtoolsをアップデートする。

brew upgrade BBtools

 > sendsketch.sh

$ sendsketch.sh

 

Written by Brian Bushnell

Last modified December 7, 2017

 

Description:  Compares query sketches to reference sketches hosted on a 

remote server via the Internet.  The input can be sketches made by sketch.sh,

or fasta/fastq files from which SendSketch will generate sketches.  

Only sketches will sent, not sequences.

 

Please read bbmap/docs/guides/BBSketchGuide.txt for more information.

 

Usage:  sendsketch.sh in=file

 

Standard parameters:

in=<file>           Sketch or fasta file to compare.

out=stdout          Comparison output.  Can be set to a file instead.

outsketch=          Optional, to write the sketch to a file.

local=f             For local files, have the server load the sketches.

                    Allows use of whitelists; recommended for Silva.

                    Local can only be used when the client and server access 

                    the same filesystem - e.g., Genepool and Cori.

address=            Address of remote server.  Default address:

                    https://refseq-sketch.jgi-psf.org/sketch

                    You can also specify these abbreviations:

                       nt:      nt server

                       refseq:  Refseq server

                       silva:   Silva server

                       img:     IMG server (not yet available)

                    Using an abbreviation automatically sets the address, 

                    the blacklist, and k.

blacklist=<file>    Ignore keys in this sketch file.  Additionaly, there are

                    built-in blacklists that can be specified:

                       nt:      Blacklist for nt

                       refseq:  Blacklist for Refseq

                       silva:   Blacklist for Silva

                       img:     Blacklist for IMG

amino=f             Use amino acid mode.

 

Sketch-making parameters:

mode=single         Possible modes, for fasta input:

                       single: Generate one sketch per file.

                       sequence: Generate one sketch per sequence.

size=10000          Size of sketches to generate, if autosize=f.

                    For raw PacBio data, 100k is suggested.

maxfraction=0.01    (mgf) Max fraction of genomic kmers to use.

autosize=t          Use flexible sizing instead of fixed-length.

sizemult=1          Multiply the default size of sketches by this factor.

k=31                Kmer length, 1-32.  To maximize sensitivity and 

                    specificity, dual kmer lengths may be used:  k=31,24

                    Dual kmers are fastest if the shorter is a multiple 

                    of 4.  Query and reference k must match.

                    Currently, JGI-hosted nt and RefSeq use k=31,24

                    while JGI-hosted Silva uses k=31, but this is set for you.

keyfraction=0.2     Only consider this upper fraction of keyspace.

samplerate=1        Set to a lower value to sample a fraction of input reads.

                    For raw reads (rather than an assembly), 1-3x coverage

                    gives best results, by reducing error kmers.  Somewhat

                    higher is better for high-error-rate data like PacBio.

minkeycount=1       Ignore kmers that occur fewer times than this.  Values

                    over 1 can be used with raw reads to avoid error kmers.

sketchheapfactor=4  If minkeycount>1, temporarily track this many kmers until

                    counts are known and low-count kmers are discarded.

minprob=0.0001      Ignore kmers below this probability of correctness.

minqual=0           Ignore kmers spanning bases below this quality.

entropy=0.66        Ignore sequence with entropy below this value.

merge=f             Merge paired reads prior to sketching.

 

Taxonomy-related parameters:

tree=<file>         Specify a TaxTree file.  On Genepool, use tree=auto.

                    Only necessary for use with printtaxa and level.

                    Assumes comparisons are done against reference sketches

                    with known taxonomy information.

level=2             Only report the best record per taxa at this level.

                    Either level names or numbers may be used.

                       -1: disabled

                        1: subspecies

                        2: species

                        3: genus

                       ...etc

taxfilter=          List of NCBI taxIDs to filter from output.

taxfilterlevel=0    Filter everything sharing an ancestor at this level.

taxfilterinclude=t  Include only filtered records. 'f' will discard them.

 

Output format:

format=2            2: Default format with, per query, one query header line;

                       one column header line; and one reference line per hit.

                    3: One line per hit, with columns query, reference, ANI,

                       and sizeRatio.

usetaxidname=f      For format 3, print the taxID in the name column.

usetaxname          for format 3, print the taxonomic name in the name column.

useimgname          For format 3, print the img ID in the name column.

 

Output columns (for format=2):

printall=f          Enable all output columns.

printani=t          (ani) Print average nucleotide identity estimate.

completeness=t      Genome completeness estimate.

score=f             Score (used for sorting the output).

printmatches=t      Number of kmer matches to reference.

printlength=f       Number of kmers compared.

printtaxid=t        NCBI taxID.

printimg=f          IMG identifier (only for IMG data).

printgbases=f       Number of genomic bases.

printgkmers=f       Number of genomic kmers.

printgsize=t        Estimated number of unique genomic kmers.

printgseqs=t        Number of sequences (scaffolds/reads).

printtaxname=t      Name associated with this taxID.

printname0=t        (pn0) Original seqeuence name.

printfname=t        Query filename.

printtaxa=f         Full taxonomy of each record.

printcontam=t       Print contamination estimate, and factor contaminant kmers

                    into calculations.  Kmers are considered contaminant if

                    present in some ref sketch but not the current one.

printunique=t       Number of matches unique to this reference.

printunique2=f      Number of matches unique to this reference's taxa.

printunique3=f      Number of query kmers unique to this reference's taxa,

                    regardless of whether they are in this reference sketch.

printnohit=t        Number of kmers that don't hit anything.

printucontam=f      Contam hits that hit exactly one reference sketch.

printcontam2=f      Print contamination estimate using only kmer hits

                    to unrelated taxa.

contamlevel=species Taxonomic level to use for contam2/unique2/unique3.

NOTE: unique2/unique3/contam2 require an index.

 

printdepth=f        (depth) Print average depth of sketch kmers; intended

                    for shotgun read input.

printdepth2=f       (depth2) Print depth compensating for genomic repeats.

                    Requires reference sketches to be generated with depth.

actualdepth=t       If this is false, the raw average count is printed.

                    If true, the raw average (observed depth) is converted 

                    to estimated actual depth (including uncovered areas).

printvolume=f       (volume) Product of average depth and matches.

 

Sorting:

sortbyscore=t       Default sort order is by score.

sortbydepth=f       Include depth as a factor in sort order.

sortbydepth2=f      Include depth2 as a factor in sort order.

sortbyvolume=f      Include volume as a factor in sort order.

 

Other output parameters:

minhits=3           (hits) Only report records with at least this many hits.

minani=0            (ani) Only report records with at least this ANI (0-1).

minwkid=0.0001      (wkid) Only report records with at least this WKID (0-1).

anifromwkid=t       Calculate ani from wkid.  If false, use kid.

minbases=0          Ignore ref sketches of sequences shortert than this.

minsizeratio=0      Don't compare sketches if the smaller genome is less than

                    this fraction of the size of the larger.

records=20          Report at most this many best-matching records.

color=family        Color records at the family level.  color=f will disable.

                    Colors work in most terminals but may cause odd characters

                    to appear in text editors.  So, color defaults to f if 

                    writing to a file.

format=2            Available formats:

                      1: Original format; deprecated.

                      2: Customizable with different columns.

                      3: 3-column format for alltoall: query, ref, ANI.

intersect=f         Print sketch intersections.  delta=f is suggested.

 

Metadata parameters (optional, for the query sketch header):

taxid=-1            Set the NCBI taxid.

imgid=-1            Set the IMG id.

spid=-1             Set the sequencing project id (JGI-specific).

name=               Set the name (taxname).

name0=              Set name0 (normally the first sequence header).

fname=              Set fname (normally the file name).

meta_=              Set an arbitrary metadata field.

                    For example, meta_Month=March.

 

Other parameters:

requiredmeta=       (rmeta) Required optional metadata values.  For example:

                    rmeta=subunit:ssu,source:silva

bannedmeta=         (bmeta) Forbidden optional metadata values.

 

Java Parameters:

-Xmx                This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

                    -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

-eoom               This flag will cause the process to exit if an out-of-memory exception occurs.  Requires Java 8u92+.

-da                 Disable assertions.

 

For more detailed information, please read /bbmap/docs/guides/BBSketchGuide.txt.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

 

ラン

データベースに問い合わせるので、コマンド実行時はネットに繋がっている必要がある。

 

de novoアセンブリして得たFASTAをJGIのtaxonomy serverに問い合わせ、top hitを返す。

sendsketch.sh in=contigs.fa

 

Schizosaccharomyces_pombeのゲノムを指定してみる。

sendsketch.sh in=contigs.fa

sketchファイルで結果は問い合わされ、 数秒で結果が返ってくる。

$ sendsketch.sh in=Schizosaccharomyces_pombe.ASM294v2_chr1.fa 

Max memory cannot be determined.  Attempting to use 3200 MB.

If this fails, please add the -Xmx flag (e.g. -Xmx24g) to your command, 

or run this program qsubbed or from a qlogin session on Genepool, or set ulimit to an appropriate value.

Adding /usr/local/Cellar/bbtools/37.77/resources/blacklist_refseq_species_300.sketch to blacklist.

Loaded 1 sketch in 0.897 seconds.

 

Query: Schizosaccharomyces_pombe.ASM294v2_chr1.fa DB: RefSeq SketchLen: 11783 Seqs: 1 Bases: 5579033 gSize: 5438526

WKID KID ANI Complt Contam Matches Unique noHit TaxID gSize gSeqs taxName

100.00% 44.56% 100.00% 44.56% 0.00% 6890 6876 0 4896 12075791 5 Schizosaccharomyces pombe

  • KID (or kmer identity) is simply the percent of the kmers that matched
  • WKID is the weighted kmer identity, which has been adjusted to compensate for genome size.
  • matches is the number of matching kmers between the query and reference sketch

マッチしたk-merの数やゲノムサイズで正規化した割合のほか、ANIも出力される。RefseqのS.pombeのゲノムと100%合致した。複数見つかると、この例(リンク)のように出力される。プロテインの検索ならamino=tをつける。

 

NGSデータも高速に分析できる。例えばilluminaシーケンスデータ(fastq)の先頭100,000リードを分析する。

sendsketch.sh in=reads.fq reads=100k

10万リードなら実行後5秒ほどで結果が返ってくる。

 

SRAのERR024951分析結果。

WKID KID ANI Complt Contam Matches Unique noHit TaxID gSize gSeqs taxName

96.98% 10.51% 99.89% 100.00% 11.81% 2693 0 19893 1328378 5525669 10 Klebsiella pneumoniae MGH 46

89.50% 9.99% 99.60% 100.00% 12.34% 2558 0 19893 1714265 5622728 77 Klebsiella sp. KGM-IMP216

76.76% 8.37% 99.04% 100.00% 13.96% 2143 0 19893 749535 5551073 181 Klebsiella sp. MS 92-3

75.45% 7.94% 98.98% 100.00% 14.39% 2034 0 19893 1581108 5357324 136 Klebsiella sp. HMSC16A12

74.33% 7.96% 98.93% 100.00% 14.37% 2038 0 19893 1734033 5446858 96 Klebsiella sp. AA405

76.50% 7.82% 99.03% 100.00% 14.51% 2002 0 19893 672443 5178890 50 Klebsiella sp. S1

73.74% 7.80% 98.90% 100.00% 14.53% 1997 0 19893 665944 5392639 79 Klebsiella sp. 4_1_44FAA

10.75% 7.88% 92.21% 64.81% 11.60% 2252 1 16262 571 69447295 6331 Klebsiella oxytoca

12.90% 6.53% 92.82% 45.70% 9.71% 2134 78 12803 573 100548885 235290 Klebsiella pneumoniae

11.23% 6.48% 92.36% 100.00% 15.85% 1660 10 19893 244366 29454296 9171 Klebsiella variicola

30.23% 3.39% 95.74% 100.00% 18.94% 867 1 19893 1581110 5708662 58 Klebsiella sp. HMSC16C06

9.65% 6.77% 91.85% 61.22% 12.70% 1967 4 15827 28901 72649107 129637 Salmonella enterica

30.26% 3.16% 95.75% 100.00% 19.17% 810 0 19893 1182695 5400686 4 Klebsiella sp. KTE92

27.92% 3.28% 95.47% 100.00% 19.05% 841 0 19893 1581064 5909378 158 Klebsiella sp. HMSC25G12

28.93% 3.17% 95.59% 100.00% 19.16% 813 0 19893 2057803 5497888 187 Klebsiella sp. T11

29.37% 3.14% 95.64% 100.00% 19.19% 803 0 19893 469608 5414594 19 Klebsiella sp. 1_1_55

8.60% 6.59% 91.47% 66.12% 13.74% 1855 1 16749 615 66301158 13410 Serratia marcescens

28.03% 3.13% 95.48% 100.00% 19.20% 802 0 19893 2054603 5722692 365 Klebsiella sp. E-Nf3

28.29% 3.12% 95.51% 100.00% 19.21% 798 0 19893 1140114 5534298 57 Klebsiella sp. D5A

27.26% 3.14% 95.38% 100.00% 19.19% 804 0 19893 2054602 5816040 103 Klebsiella sp. F-Nf9

 

 

目的によってはblastの置き換えになるツールです。コンタミネーションの簡易チェックにも使えると思います。しかし、本当にBBtoolsはなんでもできますね。汎用性が高い上に、高速かつ堅牢というすごいツールです。ぜひ導入してみてください。

 

追記

bamやsamでも動作します。

 

引用

Tool: BBSketch - A Tool for Rapid Sequence Comparison

BBSketch - A Tool for Rapid Sequence Comparison