2023-08-01

複数のゲノム間で保存された遺伝子クラスターを同定する Spacedust

gene cluster

レポジトリより

Spacedustは、相同性と遺伝子近傍の保存性に基づいて、複数のゲノム間で保存された遺伝子クラスターを同定するためのモジュール型ツールキットである。Foldseekの高速かつ高感度な構造比較とMMseqs2の相同性検索機能を利用している。ゲノム間の相同ヒットの集合を集約し、agglomerativeな階層クラスタリングアルゴリズムを用いて、各ゲノム間で保存された遺伝子近傍を持つヒットのクラスタを同定するという新しいアプローチを導入している。SpacedustはC++で実装されたGPLv3ライセンスのオープンソースソフトウェアで、LinuxとmacOSで利用できる。マルチコアで効率的に動作するように設計されている。

インストール

mac mini2018でテストした。

依存

構造比較を行うためにはFoldseekが必要。

Github

# static Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/spacedust/spacedust-linux-avx2.tar.gz; tar xvzf spacedust-linux-avx2.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

# static Linux SSE4.1 build (check using: cat /proc/cpuinfo | grep sse4_1)
wget https://mmseqs.com/spacedust/spacedust-linux-sse41.tar.gz; tar xvzf spacedust-linux-sse41.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

# static macOS build (universal binary with SSE4.1/AVX2/M1 NEON)
wget https://mmseqs.com/spacedust/spacedust-osx-universal.tar.gz; tar xvzf spacedust-osx-universal.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

#Other precompiled binaries

https://mmseqs.com/spacedust/

> spacedust

Spacedust is a tool to discover conserved gene clusters between any pairs of contig/genomes

spacedust Version: c2d6f0de4efbe33d65825957cd59e5507091c619

usage: spacedust <command> [<args>]

Main workflows for database input/output

createsetdb Create sequence set database from FASTA (and GFF3) input of contigs/genomes

aa2foldseek Map a sequence DB to reference foldseek DB

clusterdb Build a searchable cluster database from sequence DB or foldseek structure DB

clustersearch Find clusters of colocalized hits between any query-target sequence/profile set database

An extended list of all modules can be obtained by calling 'spacedust -h'.

> spacedust createsetdb -h

$ spacedust createsetdb -h

usage: spacedust createsetdb <i:fastaFile1[.gz|bz2]> ... <i:fastaFileN[.gz|bz2]> <o:setDB> <tmpDir> [options]

By Ruoshi Zhang <ruoshi.zhang@mpinat.mpg.de> & Milot Mirdita <milot@mirdita.de>

options: misc:

--dbtype INT Database type 0: auto, 1: amino acid 2: nucleotides [0]

--shuffle BOOL Shuffle input database [0]

--createdb-mode INT Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

--id-offset INT Numeric ids in index file are offset by this value [0]

--min-length INT Minimum codon number in open reading frames [30]

--max-length INT Maximum codon number in open reading frames [32734]

--max-gaps INT Maximum number of codons with gaps or unknown residues before an open reading frame is rejected [2147483647]

--contig-start-mode INT Contig start can be 0: incomplete, 1: complete, 2: both [2]

--contig-end-mode INT Contig end can be 0: incomplete, 1: complete, 2: both [2]

--orf-start-mode INT Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle) [1]

--forward-frames STR Comma-separated list of frames on the forward strand to be extracted [1,2,3]

--reverse-frames STR Comma-separated list of frames on the reverse strand to be extracted [1,2,3]

--translation-table INT 1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE

9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL

15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL

23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA

29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA [1]

--translate INT Translate ORF to amino acid [0]

--use-all-table-starts BOOL Use all alternatives for a start codon in the genetic table, if false - only ATG (AUG) [0]

--add-orf-stop BOOL Add stop codon '*' at complete start and end [0]

--gff-type STR Comma separated list of feature types in the GFF file to select

--stat STR One of: linecount, mean, min, max, doolittle, charges, seqlen, firstline

--tsv BOOL Return output in TSV format [0]

--gff-dir STR Path to gff dir file

common:

--compressed INT Write compressed output [0]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--threads INT Number of CPU-cores used (all by default) [12]

expert:

--write-lookup INT write .lookup file containing mapping from internal id, fasta id and file number [1]

--create-lookup INT Create database lookup file (can be very large) [0]

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

> spacedust aa2foldseek -h

$ spacedust aa2foldseek -h

usage: spacedust aa2foldseek <i:inputDB> <i:targetDB> <tmpDir> [options]

By Ruoshi Zhang <ruoshi.zhang@mpinat.mpg.de> & Milot Mirdita <milot@mirdita.de>

options: prefilter:

--seed-sub-mat TWIN Substitution matrix file for k-mer generation [aa:VTML80.out,nucl:nucleotide.out]

-s FLOAT Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]

-k INT k-mer length (0: automatically set to optimum) [0]

--k-score TWIN k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]

--alph-size TWIN Alphabet size (range 2-21) [aa:21,nucl:5]

--max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [10]

--split INT Split input into N equally distributed chunks. 0: set the best split automatically [0]

--split-mode INT 0: split target db; 1: split query db; 2: auto, depending on main memory [2]

--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--comp-bias-corr INT Correct for locally biased amino acid composition (range 0-1) [1]

--comp-bias-corr-scale FLOAT Correct for locally biased amino acid composition (range 0-1) [1.000]

--diag-score BOOL Use ungapped diagonal scoring during prefilter [1]

--exact-kmer-matching INT Extract only exact k-mers for matching (range 0-1) [1]

--mask INT Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [1]

--mask-prob FLOAT Mask sequences is probablity is above threshold [0.900]

--mask-lower-case INT Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

--min-ungapped-score INT Accept only matches with ungapped alignment score above threshold [15]

--add-self-matches BOOL Artificially add entries of queries with themselves (for clustering) [0]

--spaced-kmer-mode INT 0: use consecutive positions in k-mers; 1: use spaced k-mers [1]

--spaced-kmer-pattern STR User-specified spaced k-mer pattern

--local-tmp STR Path where some of the temporary files will be created

align:

-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.900]

--cov-mode INT 0: coverage of query and target

1: coverage of target

2: coverage of query

3: target seq. length has to be at least x% of query length

4: query seq. length has to be at least x% of target length

5: short seq. needs to be at least x% of the other seq. length [0]

-a BOOL Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

--alignment-mode INT How to compute the alignment:

0: automatic

1: only score and end_pos

2: also start_pos and cov

3: also seq.id

4: only ungapped alignment [0]

--alignment-output-mode INT How to compute the alignment:

0: automatic

1: only score and end_pos

2: also start_pos and cov

3: also seq.id

4: only ungapped alignment

5: score only (output) cluster format [0]

--wrapped-scoring BOOL Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]

-e DOUBLE List matches below this E-value (range 0.0-inf) [1.000E-03]

--min-seq-id FLOAT List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.900]

--min-aln-len INT Minimum alignment length (range 0-INT_MAX) [0]

--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]

--alt-ali INT Show up to this many alternative alignments [0]

--max-rejected INT Maximum rejected alignments before alignment calculation for a query is stopped [2147483647]

--max-accept INT Maximum accepted alignments before alignment calculation for a query is stopped [2147483647]

--score-bias FLOAT Score bias when computing SW alignment (in bits) [0.000]

--realign BOOL Compute more conservative, shorter alignments (scores and E-values not changed) [0]

--realign-score-bias FLOAT Additional bias when computing realignment [-0.200]

--realign-max-seqs INT Maximum number of results to return in realignment [2147483647]

--corr-score-weight FLOAT Weight of backtrace correlation score that is added to the alignment score [0.000]

--gap-open TWIN Gap open cost [aa:11,nucl:5]

--gap-extend TWIN Gap extension cost [aa:1,nucl:2]

--zdrop INT Maximal allowed difference between score values before alignment is truncated (nucleotide alignment only) [40]

profile:

--pca Pseudo count admixture strength

--pcb Pseudo counts: Neff at half of maximum admixture (range 0.0-inf)

misc:

--taxon-list STR Taxonomy ID, possibly multiple values separated by ','

--stat STR One of: linecount, mean, min, max, doolittle, charges, seqlen, firstline

--tsv BOOL Return output in TSV format [0]

common:

--sub-mat TWIN Substitution matrix file [aa:blosum62.out,nucl:nucleotide.out]

--max-seq-len INT Maximum sequence length [65535]

--db-load-mode INT Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

--threads INT Number of CPU-cores used (all by default) [12]

--compressed INT Write compressed output [0]

-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--remove-tmp-files BOOL Delete temporary files [0]

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

> spacedust clusterdb -h

$ spacedust clusterdb -h

usage: spacedust clusterdb <i:inputDB> <tmpDir> [options]

By Ruoshi Zhang <ruoshi.zhang@mpinat.mpg.de> & Milot Mirdita <milot@mirdita.de>

options: prefilter:

--seed-sub-mat TWIN Substitution matrix file for k-mer generation [aa:VTML80.out,nucl:nucleotide.out]

-s FLOAT Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]

-k INT k-mer length (0: automatically set to optimum) [0]

--k-score TWIN k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]

--alph-size TWIN Alphabet size (range 2-21) [aa:21,nucl:5]

--max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [300]

--split INT Split input into N equally distributed chunks. 0: set the best split automatically [0]

--split-mode INT 0: split target db; 1: split query db; 2: auto, depending on main memory [2]

--split-memory-limit BYTE Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--comp-bias-corr INT Correct for locally biased amino acid composition (range 0-1) [1]

--comp-bias-corr-scale FLOAT Correct for locally biased amino acid composition (range 0-1) [1.000]

--diag-score BOOL Use ungapped diagonal scoring during prefilter [1]

--exact-kmer-matching INT Extract only exact k-mers for matching (range 0-1) [0]