macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Minhashを使い、genomic DNA / proteinを高速比較する sourmash

2019 7/5 インストールエラー修正 、twitter追記

2020 1/5 twitter追記、2/4 twitter追記、2/20 コマンド修正、2/27 help更新, コマンド修正、5/5  twitter追記

2022/04/15 コマンド例を追加

 

 sourmashは、ゲノムデータのMinHash sketchesを作成、比較、操作するためのツールボックスである。MinHash sketchは、大規模なDNAまたはRNAシーケンスコレクションの"signatures"を保存し、Jaccard indexを使用してそれらを比較または検索するための軽量な方法を提供する。 MinHash sketchは、サンプルを同定し、類似のサンプルを見出し、共有配列を有するデータセットを同定し、系統樹を構築するために使用することができる(Ondov et al、2015)。sourmashはコマンドラインスクリプトPythonライブラリ、MinHashスケッチ用のCPythonモジュールを提供する。 

 

sourmash紹介

https://sourmash.readthedocs.io/en/latest/

document

https://sourmash.readthedocs.io/en/latest/tutorials.html

Using sourmash: a practical guide

https://github.com/dib-lab/sourmash/blob/master/doc/using-sourmash-a-guide.md

A sourmash tutorial(一番説明が丁寧)

2017-dibsi-metagenomics/sourmash.md at master · dib-lab/2017-dibsi-metagenomics · GitHub

 

 

2020 5/5 v3.3

 7/25

 

 

インストール

mac os10.14のanaconda3-4.3.30環境でテストした。

本体 Github

#Anaconda環境で導入、バージョン指定しないとversion2が入る。(link)
mamba install -c bioconda sourmash -y

> sourmash -h

$ sourmash -h

 

== This is sourmash version 3.2.2. ==

== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

 

Compute, compare, manipulate, and analyze MinHash sketches of DNA sequences.

 

Usage instructions:

    Basic operations

        sourmash compare --help   compare genomes

        sourmash compute --help   compute genome signatures

        sourmash gather --help    search a metagenome signature for multiple non-

                                  overlapping matches

        sourmash index --help     index signatures for rapid search

        sourmash info --help      display sourmash version and other information

        sourmash plot --help      plot distance matrix made by 'compare'

        sourmash search --help    search a signature against a list of signatures

 

    Taxonomic operations

        sourmash lca --help

 

    Manipulate signature files

        sourmash sig --help

        sourmash signature --help

 

    Operations on storage

        sourmash storage --help

 

Options:

  -h, --help     show this help message and exit

  -v, --version  show program's version number and exit

  -q, --quiet    don't print citation information

> sourmash search -h

$ sourmash search -h

 

== This is sourmash version 3.2.2. ==

== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

 

usage:  search [-h] [--traverse-directory] [-q] [--threshold T]

               [--save-matches FILE] [--best-only] [-n N] [--containment]

               [--ignore-abundance] [--scaled FLOAT] [-o FILE] [-k K]

               [--protein] [--no-protein] [--dayhoff] [--no-dayhoff] [--hp]

               [--no-hp] [--dna] [--no-dna]

               query databases [databases ...]

 

positional arguments:

  query                 query signature

  databases             signatures/SBTs to search

 

optional arguments:

  -h, --help            show this help message and exit

  --traverse-directory  search all signatures underneath directories

  -q, --quiet           suppress non-error output

  --threshold T         minimum threshold for reporting matches; default=0.08

  --save-matches FILE   output matching signatures to the specified file

  --best-only           report only the best match (with greater speed)

  -n N, --num-results N

                        number of results to report

  --containment         evaluate containment rather than similarity

  --ignore-abundance    do NOT use k-mer abundances if present; note: has no

                        effect if --containment is specified

  --scaled FLOAT        downsample query to this scaled factor (yields greater

                        speed)

  -o FILE, --output FILE

                        output CSV containing matches to this file

  -k K, --ksize K       k-mer size; default=31

  --protein             choose a protein signature; by default, a nucleotide

                        signature is used

  --no-protein          do not choose a protein signature

  --dayhoff             build Dayhoff-encoded amino acid signatures

  --no-dayhoff          do not build Dayhoff-encoded amino acid signatures

  --hp, --hydrophobic-polar

                        build hydrophobic-polar-encoded amino acid signatures

  --no-hp, --no-hydrophobic-polar

                        do not build hydrophobic-polar-encoded amino acid

                        signatures

  --dna, --rna          choose a nucleotide signature (default: True)

  --no-dna, --no-rna    do not choose a nucleotide signature

> sourmash compute -h

$ sourmash compute -h

 

== This is sourmash version 3.2.2. ==

== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

 

usage:  compute [-h] [-k KSIZES] [-n NUM_HASHES] [--track-abundance]

                [--scaled SCALED] [--protein] [--no-protein] [--dayhoff]

                [--no-dayhoff] [--hp] [--no-hp] [--dna] [--no-dna]

                [--input-is-protein] [--seed SEED] [--input-is-10x]

                [--count-valid-reads COUNT_VALID_READS]

                [--write-barcode-meta-csv WRITE_BARCODE_META_CSV]

                [-p PROCESSES] [--save-fastas SAVE_FASTAS]

                [--line-count LINE_COUNT] [--rename-10x-barcodes FILE]

                [--barcodes-file FILE] [-f] [-o OUTPUT] [--singleton]

                [--merge FILE] [--name-from-first] [--randomize] [-q]

                [--check-sequence] [--license LICENSE]

                filenames [filenames ...]

 

Required arguments:

  filenames             file(s) of sequences

 

Miscellaneous options:

  -h, --help            show this help message and exit

  -q, --quiet           suppress non-error output

  --check-sequence      complain if input sequence is invalid

  --license LICENSE     signature license. Currently only CC0 is supported.

 

Sketching options:

  -k KSIZES, --ksizes KSIZES

                        comma-separated list of k-mer sizes; default=21,31,51

  -n NUM_HASHES, --num-hashes NUM_HASHES

                        number of hashes to use in each sketch; default=500

  --track-abundance     track k-mer abundances in the generated signature

  --scaled SCALED       choose number of hashes as 1 in FRACTION of input

                        k-mers

  --protein             choose a protein signature; by default, a nucleotide

                        signature is used

  --no-protein          do not choose a protein signature

  --dayhoff             build Dayhoff-encoded amino acid signatures

  --no-dayhoff          do not build Dayhoff-encoded amino acid signatures

  --hp, --hydrophobic-polar

                        build hydrophobic-polar-encoded amino acid signatures

  --no-hp, --no-hydrophobic-polar

                        do not build hydrophobic-polar-encoded amino acid

                        signatures

  --dna, --rna          choose a nucleotide signature (default: True)

  --no-dna, --no-rna    do not choose a nucleotide signature

  --input-is-protein    Consume protein sequences - no translation needed.

  --seed SEED           seed used by MurmurHash; default=42

 

10x options:

  --input-is-10x        input is 10x single cell output folder

  --count-valid-reads COUNT_VALID_READS

                        a barcode is only considered a valid barcode read and

                        its signature is written if number of umis are greater

                        than count-valid-reads. It is used to weed out cell

                        barcodes with few umis that might have been due to

                        false rna enzyme reactions

  --write-barcode-meta-csv WRITE_BARCODE_META_CSV

                        for each of the unique barcodes, Write to a given

                        path, number of reads and number of umis per barcode.

  -p PROCESSES, --processes PROCESSES

                        number of processes to use for reading 10x bam file

  --save-fastas SAVE_FASTAS

                        save merged fastas for all the unique barcodes to

                        {CELL_BARCODE}.fasta in the absolute path given by

                        this flag; by default, fastas are not saved

  --line-count LINE_COUNT

                        line count for each bam shard

  --rename-10x-barcodes FILE

                        Tab-separated file mapping 10x barcode name to new

                        name, e.g. with channel or cell annotation label

  --barcodes-file FILE  Barcodes file if the input is unfiltered 10x bam file

 

File handling options:

  -f, --force           recompute signatures even if the file exists

  -o OUTPUT, --output OUTPUT

                        output computed signatures to this file

  --singleton           compute a signature for each sequence record

                        individually

  --merge FILE, --name FILE

                        merge all input files into one signature file with the

                        specified name

  --name-from-first     name the signature generated from each file after the

                        first record in the file

  --randomize           shuffle the list of input filenames randomly

> sourmash compare -h

$ sourmash compare -h

 

== This is sourmash version 3.2.2. ==

== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

 

usage:  compare [-h] [-q] [-k K] [--protein] [--no-protein] [--dayhoff]

                [--no-dayhoff] [--hp] [--no-hp] [--dna] [--no-dna] [-o F]

                [--ignore-abundance] [--traverse-directory] [-f] [--csv F]

                [-p N]

                signatures [signatures ...]

 

positional arguments:

  signatures            list of signatures to compare

 

optional arguments:

  -h, --help            show this help message and exit

  -q, --quiet           suppress non-error output

  -k K, --ksize K       k-mer size; default=31

  --protein             choose a protein signature; by default, a nucleotide

                        signature is used

  --no-protein          do not choose a protein signature

  --dayhoff             build Dayhoff-encoded amino acid signatures

  --no-dayhoff          do not build Dayhoff-encoded amino acid signatures

  --hp, --hydrophobic-polar

                        build hydrophobic-polar-encoded amino acid signatures

  --no-hp, --no-hydrophobic-polar

                        do not build hydrophobic-polar-encoded amino acid

                        signatures

  --dna, --rna          choose a nucleotide signature (default: True)

  --no-dna, --no-rna    do not choose a nucleotide signature

  -o F, --output F      file to which output will be written; default is

                        terminal (standard output)

  --ignore-abundance    do NOT use k-mer abundances even if present

  --traverse-directory  compare all signatures underneath directories

  -f, --force           continue past errors in file loading

  --csv F               write matrix to specified file in CSV format (with

                        column headers)

  -p N, --processes N   Number of processes to use to calculate similarity

> sourmash plot -h

$ sourmash plot -h

 

== This is sourmash version 3.2.2. ==

== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

 

usage:  plot [-h] [--pdf] [--labels] [--labeltext LABELTEXT] [--indices]

             [--vmin VMIN] [--vmax VMAX] [--subsample N] [--subsample-seed S]

             [-f] [--output-dir DIR]

             distances

 

positional arguments:

  distances             output from "sourmash compare"

 

optional arguments:

  -h, --help            show this help message and exit

  --pdf                 output PDF; default is PNG

  --labels              show sample labels on dendrogram/matrix

  --labeltext LABELTEXT

                        filename containing list of labels (overrides

                        signature names)

  --indices             show sample indices but not labels

  --vmin VMIN           lower limit of heatmap scale; default=0.000000

  --vmax VMAX           upper limit of heatmap scale; default=1.000000

  --subsample N         randomly downsample to this many samples, max

  --subsample-seed S    random seed for --subsample; default=1

  -f, --force           forcibly plot non-distance matrices

  --output-dir DIR      directory for output plots

 

 

実行方法

ここではsourmash computeのコマンドのみ紹介する。fast-GePのテストゲノムを使う(Github)。

cd fast-GeP-master/Examples/E.faecalis/input_files/

sourmashを実行

#fastq
sourmash compute *.fq.gz
sourmash compare *.sig -o output -k 31 --csv matrix.csv
sourmash plot --pdf --labels output

#fasta
sourmash sketch dna *fasta
sourmash compare *.sig -o output -k 31 --csv matrix.csv
sourmash plot --pdf --labels output

 

出力 

f:id:kazumaxneo:20181117215516p:plain

f:id:kazumaxneo:20181117215514p:plain

f:id:kazumaxneo:20220415194419p:plain

 

引用

sourmash: a library for MinHash sketching of DNA

C. Titus Brown, Luiz Irber

Journal of Open Source Software, 1(5), 27

 


Large-scale sequence comparisons with sourmash.

Pierce NT, Irber L, Reiter T, Brooks P, Brown CT

F1000Res. 2019 Jul 4;8:1006

 

関連ツール