2019 7/5 インストールエラー修正 、twitter追記
2020 1/5 twitter追記、2/4 twitter追記、2/20 コマンド修正、2/27 help更新, コマンド修正、5/5 twitter追記
2022/04/15 コマンド例を追加
sourmashは、ゲノムデータのMinHash sketchesを作成、比較、操作するためのツールボックスである。MinHash sketchは、大規模なDNAまたはRNAシーケンスコレクションの"signatures"を保存し、Jaccard indexを使用してそれらを比較または検索するための軽量な方法を提供する。 MinHash sketchは、サンプルを同定し、類似のサンプルを見出し、共有配列を有するデータセットを同定し、系統樹を構築するために使用することができる(Ondov et al、2015)。sourmashはコマンドラインスクリプト、Pythonライブラリ、MinHashスケッチ用のCPythonモジュールを提供する。
sourmash紹介
https://sourmash.readthedocs.io/en/latest/
document
https://sourmash.readthedocs.io/en/latest/tutorials.html
Using sourmash: a practical guide
https://github.com/dib-lab/sourmash/blob/master/doc/using-sourmash-a-guide.md
A sourmash tutorial(一番説明が丁寧)
2017-dibsi-metagenomics/sourmash.md at master · dib-lab/2017-dibsi-metagenomics · GitHub
New sourmash, 2.2.0!
— Luiz Irber (@luizirber) October 1, 2019
New features: Parallelized compare and optimized compute for 10x data by @pranuvemuri!
PyPI:
$ pip install sourmash
Bioconda:
$ conda install -c conda-forge -c bioconda sourmash
More info: https://t.co/aFpFsJh6ic
sourmash 3.0.0 released!
— Luiz Irber (@luizirber) 2020年1月4日
Officially migrated from C++ to #RustLang 🎉
Release notes: https://t.co/DjJQG9LI9w
PyPI: https://t.co/tO5xLeCeGg
$ pip install sourmash==3.0.0
Bioconda:
$ conda install -c conda-forge -c bioconda sourmash=3.0.0
My favorite? `sourmash compare` is even faster now, we went from 95s in 2.3.1 to 5s in the 3.x series.
— Luiz Irber (@luizirber) 2020年1月16日
Next release: working on improving `sourmash compute`
(This graph is from https://t.co/gFRTXXCU7H, where we track sourmash performance) pic.twitter.com/NeZFyVpvBR
sourmash 3.2.0 released!
— Luiz Irber (@luizirber) 2020年1月27日
First #rustlang code @ctb ever wrote + awesome perf optimizations by @kloetzl!
Release notes: https://t.co/C8HEGhfgGs
PyPI: https://t.co/KYKBGqzJt8
$ pip install sourmash==3.2.0
Bioconda:
$ conda install -c conda-forge -c bioconda sourmash=3.2.0
2020 5/5 v3.3
sourmash 3.3.0 released!
— Titus Brown (@ctitusbrown) 2020年5月4日
search and gather on databases are much faster!
Release notes: https://t.co/fEUmjlgnTj
PyPI: https://t.co/8f6cCVcluB
$ pip install sourmash==3.3.0
Bioconda:
$ conda install -c conda-forge -c bioconda sourmash=3.3.0
7/25
sourmash 3.4.1 released!
— Titus Brown (@ctitusbrown) 2020年7月24日
UX improvements for clustering output and taxonomic classification; minor bug fixes.
More info: https://t.co/0LZFcWQ1Pn
PyPI: https://t.co/AQ4NGmwx49
% pip install sourmash==3.4.1
Bioconda:
% conda install -c conda-forge -c bioconda sourmash=3.4.1
インストール
mac os10.14のanaconda3-4.3.30環境でテストした。
本体 Github
#Anaconda環境で導入、バージョン指定しないとversion2が入る。(link)
mamba install -c bioconda sourmash -y
> sourmash -h
$ sourmash -h
== This is sourmash version 3.2.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
Compute, compare, manipulate, and analyze MinHash sketches of DNA sequences.
Usage instructions:
Basic operations
sourmash compare --help compare genomes
sourmash compute --help compute genome signatures
sourmash gather --help search a metagenome signature for multiple non-
overlapping matches
sourmash index --help index signatures for rapid search
sourmash info --help display sourmash version and other information
sourmash plot --help plot distance matrix made by 'compare'
sourmash search --help search a signature against a list of signatures
Taxonomic operations
sourmash lca --help
Manipulate signature files
sourmash sig --help
sourmash signature --help
Operations on storage
sourmash storage --help
Options:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-q, --quiet don't print citation information
> sourmash search -h
$ sourmash search -h
== This is sourmash version 3.2.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
usage: search [-h] [--traverse-directory] [-q] [--threshold T]
[--save-matches FILE] [--best-only] [-n N] [--containment]
[--ignore-abundance] [--scaled FLOAT] [-o FILE] [-k K]
[--protein] [--no-protein] [--dayhoff] [--no-dayhoff] [--hp]
[--no-hp] [--dna] [--no-dna]
query databases [databases ...]
positional arguments:
query query signature
databases signatures/SBTs to search
optional arguments:
-h, --help show this help message and exit
--traverse-directory search all signatures underneath directories
-q, --quiet suppress non-error output
--threshold T minimum threshold for reporting matches; default=0.08
--save-matches FILE output matching signatures to the specified file
--best-only report only the best match (with greater speed)
-n N, --num-results N
number of results to report
--containment evaluate containment rather than similarity
--ignore-abundance do NOT use k-mer abundances if present; note: has no
effect if --containment is specified
--scaled FLOAT downsample query to this scaled factor (yields greater
speed)
-o FILE, --output FILE
output CSV containing matches to this file
-k K, --ksize K k-mer size; default=31
--protein choose a protein signature; by default, a nucleotide
signature is used
--no-protein do not choose a protein signature
--dayhoff build Dayhoff-encoded amino acid signatures
--no-dayhoff do not build Dayhoff-encoded amino acid signatures
--hp, --hydrophobic-polar
build hydrophobic-polar-encoded amino acid signatures
--no-hp, --no-hydrophobic-polar
do not build hydrophobic-polar-encoded amino acid
signatures
--dna, --rna choose a nucleotide signature (default: True)
--no-dna, --no-rna do not choose a nucleotide signature
> sourmash compute -h
$ sourmash compute -h
== This is sourmash version 3.2.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
usage: compute [-h] [-k KSIZES] [-n NUM_HASHES] [--track-abundance]
[--scaled SCALED] [--protein] [--no-protein] [--dayhoff]
[--no-dayhoff] [--hp] [--no-hp] [--dna] [--no-dna]
[--input-is-protein] [--seed SEED] [--input-is-10x]
[--count-valid-reads COUNT_VALID_READS]
[--write-barcode-meta-csv WRITE_BARCODE_META_CSV]
[-p PROCESSES] [--save-fastas SAVE_FASTAS]
[--line-count LINE_COUNT] [--rename-10x-barcodes FILE]
[--barcodes-file FILE] [-f] [-o OUTPUT] [--singleton]
[--merge FILE] [--name-from-first] [--randomize] [-q]
[--check-sequence] [--license LICENSE]
filenames [filenames ...]
Required arguments:
filenames file(s) of sequences
Miscellaneous options:
-h, --help show this help message and exit
-q, --quiet suppress non-error output
--check-sequence complain if input sequence is invalid
--license LICENSE signature license. Currently only CC0 is supported.
Sketching options:
-k KSIZES, --ksizes KSIZES
comma-separated list of k-mer sizes; default=21,31,51
-n NUM_HASHES, --num-hashes NUM_HASHES
number of hashes to use in each sketch; default=500
--track-abundance track k-mer abundances in the generated signature
--scaled SCALED choose number of hashes as 1 in FRACTION of input
k-mers
--protein choose a protein signature; by default, a nucleotide
signature is used
--no-protein do not choose a protein signature
--dayhoff build Dayhoff-encoded amino acid signatures
--no-dayhoff do not build Dayhoff-encoded amino acid signatures
--hp, --hydrophobic-polar
build hydrophobic-polar-encoded amino acid signatures
--no-hp, --no-hydrophobic-polar
do not build hydrophobic-polar-encoded amino acid
signatures
--dna, --rna choose a nucleotide signature (default: True)
--no-dna, --no-rna do not choose a nucleotide signature
--input-is-protein Consume protein sequences - no translation needed.
--seed SEED seed used by MurmurHash; default=42
10x options:
--input-is-10x input is 10x single cell output folder
--count-valid-reads COUNT_VALID_READS
a barcode is only considered a valid barcode read and
its signature is written if number of umis are greater
than count-valid-reads. It is used to weed out cell
barcodes with few umis that might have been due to
false rna enzyme reactions
--write-barcode-meta-csv WRITE_BARCODE_META_CSV
for each of the unique barcodes, Write to a given
path, number of reads and number of umis per barcode.
-p PROCESSES, --processes PROCESSES
number of processes to use for reading 10x bam file
--save-fastas SAVE_FASTAS
save merged fastas for all the unique barcodes to
{CELL_BARCODE}.fasta in the absolute path given by
this flag; by default, fastas are not saved
--line-count LINE_COUNT
line count for each bam shard
--rename-10x-barcodes FILE
Tab-separated file mapping 10x barcode name to new
name, e.g. with channel or cell annotation label
--barcodes-file FILE Barcodes file if the input is unfiltered 10x bam file
File handling options:
-f, --force recompute signatures even if the file exists
-o OUTPUT, --output OUTPUT
output computed signatures to this file
--singleton compute a signature for each sequence record
individually
--merge FILE, --name FILE
merge all input files into one signature file with the
specified name
--name-from-first name the signature generated from each file after the
first record in the file
--randomize shuffle the list of input filenames randomly
> sourmash compare -h
$ sourmash compare -h
== This is sourmash version 3.2.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
usage: compare [-h] [-q] [-k K] [--protein] [--no-protein] [--dayhoff]
[--no-dayhoff] [--hp] [--no-hp] [--dna] [--no-dna] [-o F]
[--ignore-abundance] [--traverse-directory] [-f] [--csv F]
[-p N]
signatures [signatures ...]
positional arguments:
signatures list of signatures to compare
optional arguments:
-h, --help show this help message and exit
-q, --quiet suppress non-error output
-k K, --ksize K k-mer size; default=31
--protein choose a protein signature; by default, a nucleotide
signature is used
--no-protein do not choose a protein signature
--dayhoff build Dayhoff-encoded amino acid signatures
--no-dayhoff do not build Dayhoff-encoded amino acid signatures
--hp, --hydrophobic-polar
build hydrophobic-polar-encoded amino acid signatures
--no-hp, --no-hydrophobic-polar
do not build hydrophobic-polar-encoded amino acid
signatures
--dna, --rna choose a nucleotide signature (default: True)
--no-dna, --no-rna do not choose a nucleotide signature
-o F, --output F file to which output will be written; default is
terminal (standard output)
--ignore-abundance do NOT use k-mer abundances even if present
--traverse-directory compare all signatures underneath directories
-f, --force continue past errors in file loading
--csv F write matrix to specified file in CSV format (with
column headers)
-p N, --processes N Number of processes to use to calculate similarity
> sourmash plot -h
$ sourmash plot -h
== This is sourmash version 3.2.2. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==
usage: plot [-h] [--pdf] [--labels] [--labeltext LABELTEXT] [--indices]
[--vmin VMIN] [--vmax VMAX] [--subsample N] [--subsample-seed S]
[-f] [--output-dir DIR]
distances
positional arguments:
distances output from "sourmash compare"
optional arguments:
-h, --help show this help message and exit
--pdf output PDF; default is PNG
--labels show sample labels on dendrogram/matrix
--labeltext LABELTEXT
filename containing list of labels (overrides
signature names)
--indices show sample indices but not labels
--vmin VMIN lower limit of heatmap scale; default=0.000000
--vmax VMAX upper limit of heatmap scale; default=1.000000
--subsample N randomly downsample to this many samples, max
--subsample-seed S random seed for --subsample; default=1
-f, --force forcibly plot non-distance matrices
--output-dir DIR directory for output plots
実行方法
ここではsourmash computeのコマンドのみ紹介する。fast-GePのテストゲノムを使う(Github)。
cd fast-GeP-master/Examples/E.faecalis/input_files/
sourmashを実行
#fastq
sourmash compute *.fq.gz
sourmash compare *.sig -o output -k 31 --csv matrix.csv
sourmash plot --pdf --labels output
#fasta
sourmash sketch dna *fasta
sourmash compare *.sig -o output -k 31 --csv matrix.csv
sourmash plot --pdf --labels output
出力
引用
sourmash: a library for MinHash sketching of DNA
C. Titus Brown, Luiz Irber
Journal of Open Source Software, 1(5), 27
Large-scale sequence comparisons with sourmash.
Pierce NT, Irber L, Reiter T, Brooks P, Brown CT
F1000Res. 2019 Jul 4;8:1006
関連ツール