macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

複数のゲノム間で保存された遺伝子クラスターを同定する Spacedust

2024/10/10 追記

 

 メタゲノミクスは、環境やヒトに関連するマイクロバイオーム研究に革命をもたらした。しかし、生物学的プロセスや分子機能が知られているタンパク質の数は限られており、これが大きなボトルネックとなっている。原核生物やウイルスでは、同じ生物学的プロセスに関与する遺伝子を、保存された遺伝子クラスターとして共局在させておくことが進化上好ましい。逆に、遺伝子近傍の保存は機能的関連を示す。Spacedustは、保存された遺伝子クラスターを系統的にde novoで発見するためのツールである。相同タンパク質のマッチを見つけるために、Foldseekとの高速で高感度な構造比較を使用する。部分的に保存されたクラスターは、新しいクラスタリングと次数保存のP値を用いて検出される。1,308の細菌ゲノムの全対全分析でSpacedustの感度を実証し、420万遺伝子の58%を含む72,843の保存遺伝子クラスターを同定した。また、特殊なツールによってアノテーションされた抗ウイルス防御系クラスターの95%を回復した。Spacedustの高い感度とスピードは、配列決定された膨大な数の細菌、古細菌、ウイルスゲノムの大規模アノテーションを促進するだろう。

 

レポジトリより

Spacedustは、相同性と遺伝子近傍の保存性に基づいて、複数のゲノム間で保存された遺伝子クラスターを同定するためのモジュール型ツールキットである。Foldseekの高速かつ高感度な構造比較とMMseqs2の相同性検索機能を利用している。ゲノム間の相同ヒットの集合を集約し、agglomerativeな階層クラスタリングアルゴリズムを用いて、各ゲノム間で保存された遺伝子近傍を持つヒットのクラスタを同定するという新しいアプローチを導入している。SpacedustはC++で実装されたGPLv3ライセンスのオープンソースソフトウェアで、LinuxmacOSで利用できる。マルチコアで効率的に動作するように設計されている。

 

インストール

mac mini2018でテストした。

依存

構造比較を行うためにはFoldseekが必要。

Github

# static Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/spacedust/spacedust-linux-avx2.tar.gz; tar xvzf spacedust-linux-avx2.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

# static Linux SSE4.1 build (check using: cat /proc/cpuinfo | grep sse4_1)
wget https://mmseqs.com/spacedust/spacedust-linux-sse41.tar.gz; tar xvzf spacedust-linux-sse41.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

# static macOS build (universal binary with SSE4.1/AVX2/M1 NEON)
wget https://mmseqs.com/spacedust/spacedust-osx-universal.tar.gz; tar xvzf spacedust-osx-universal.tar.gz; export PATH=$(pwd)/spacedust/bin/:$PATH

#Other precompiled binaries

https://mmseqs.com/spacedust/ 

> spacedust -h

$ spacedust -h

Spacedust is a tool to discover conserved gene clusters between any pairs of contig/genomes

 

spacedust Version: 16b020301be952232d6eb2eaa2cd2ad0933d68b0

© Ruoshi Zhang <ruoshi.zhang@mpinat.mpg.de>

 

usage: spacedust <command> [<args>]

 

Main workflows for database input/output

  createsetdb       Create sequence set database from FASTA (and GFF3) input of contigs/genomes

  aa2foldseek       Map a sequence DB to reference foldseek DB

  clusterdb         Build a searchable cluster database from sequence DB or foldseek structure DB

  clustersearch     Find clusters of colocalized hits between any query-target sequence/profile set database

 

Special-purpose utilities

  besthitbyset      For each set of sequences compute the best element and update p-value

  combinehits       Group hits and compute a combined E-value for each query-target set pair

  summarizeresults  Summarize results on clustered hits

  clusterhits       Find clusters of hits by agglomerative hierarchical clustering and compute their clustering and ordering P-values

>spacedust createsetdb -h

$ spacedust createsetdb -h

usage: spacedust createsetdb <i:fastaFile1[.gz|bz2]> ... <i:fastaFileN[.gz|bz2]> <o:setDB> <tmpDir> [options]

By Ruoshi Zhang <ruoshi.zhang@mpinat.mpg.de> & Milot Mirdita <milot@mirdita.de>

options: misc:                       

--dbtype INT                 Database type 0: auto, 1: amino acid 2: nucleotides [0]

--shuffle BOOL               Shuffle input database [0]

--createdb-mode INT          Createdb mode 0: copy data, 1: soft link data and write new index (works only with single line fasta/q) [0]

--id-offset INT              Numeric ids in index file are offset by this value [0]

--min-length INT             Minimum codon number in open reading frames [30]

--max-length INT             Maximum codon number in open reading frames [32734]

--max-gaps INT               Maximum number of codons with gaps or unknown residues before an open reading frame is rejected [2147483647]

--contig-start-mode INT      Contig start can be 0: incomplete, 1: complete, 2: both [2]

--contig-end-mode INT        Contig end can be 0: incomplete, 1: complete, 2: both [2]

--orf-start-mode INT         Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle) [1]

--forward-frames STR         Comma-separated list of frames on the forward strand to be extracted [1,2,3]

--reverse-frames STR         Comma-separated list of frames on the reverse strand to be extracted [1,2,3]

--translation-table INT      1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE

                              9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL

                              15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL

                              23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA

                               29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA [1]

--translate INT              Translate ORF to amino acid [0]

--use-all-table-starts BOOL  Use all alternatives for a start codon in the genetic table, if false - only ATG (AUG) [0]

--add-orf-stop BOOL          Add stop codon '*' at complete start and end [0]

--gff-type STR               Comma separated list of feature types in the GFF file to select

--stat STR                   One of: linecount, mean, min, max, doolittle, charges, seqlen, firstline

--tsv BOOL                   Return output in TSV format [0]

--gff-dir STR                Path to gff dir file

common:                     

--compressed INT             Write compressed output [0]

-v INT                       Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--threads INT                Number of CPU-cores used (all by default) [128]

expert:                     

--write-lookup INT           write .lookup file containing mapping from internal id, fasta id and file number [1]

--create-lookup INT          Create database lookup file (can be very large) [0]

 

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

> spacedust aa2foldseek -h

$ spacedust aa2foldseek -h

usage: spacedust aa2foldseek <i:inputDB> <i:targetDB> <tmpDir> [options]

By Ruoshi Zhang <ruoshi.zhang@mpinat.mpg.de> & Milot Mirdita <milot@mirdita.de>

options: prefilter:                   

--seed-sub-mat TWIN           Substitution matrix file for k-mer generation [aa:VTML80.out,nucl:nucleotide.out]

-s FLOAT                      Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]

-k INT                        k-mer length (0: automatically set to optimum) [0]

--k-score TWIN                k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]

--alph-size TWIN              Alphabet size (range 2-21) [aa:21,nucl:5]

--max-seqs INT                Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [10]

--split INT                   Split input into N equally distributed chunks. 0: set the best split automatically [0]

--split-mode INT              0: split target db; 1: split query db; 2: auto, depending on main memory [2]

--split-memory-limit BYTE     Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--comp-bias-corr INT          Correct for locally biased amino acid composition (range 0-1) [1]

--comp-bias-corr-scale FLOAT  Correct for locally biased amino acid composition (range 0-1) [1.000]

--diag-score BOOL             Use ungapped diagonal scoring during prefilter [1]

--exact-kmer-matching INT     Extract only exact k-mers for matching (range 0-1) [1]

--mask INT                    Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [1]

--mask-prob FLOAT             Mask sequences is probablity is above threshold [0.900]

--mask-lower-case INT         Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

--min-ungapped-score INT      Accept only matches with ungapped alignment score above threshold [15]

--add-self-matches BOOL       Artificially add entries of queries with themselves (for clustering) [0]

--spaced-kmer-mode INT        0: use consecutive positions in k-mers; 1: use spaced k-mers [1]

--spaced-kmer-pattern STR     User-specified spaced k-mer pattern

--local-tmp STR               Path where some of the temporary files will be created

align:                       

-c FLOAT                      List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.900]

--cov-mode INT                0: coverage of query and target

                               1: coverage of target

                               2: coverage of query

                               3: target seq. length has to be at least x% of query length

                               4: query seq. length has to be at least x% of target length

                               5: short seq. needs to be at least x% of the other seq. length [0]

-a BOOL                       Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

--alignment-mode INT          How to compute the alignment:

                               0: automatic

                               1: only score and end_pos

                               2: also start_pos and cov

                               3: also seq.id

                               4: only ungapped alignment [0]

--alignment-output-mode INT   How to compute the alignment:

                               0: automatic

                               1: only score and end_pos

                               2: also start_pos and cov

                               3: also seq.id

                               4: only ungapped alignment

                               5: score only (output) cluster format [0]

--wrapped-scoring BOOL        Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]

-e DOUBLE                     List matches below this E-value (range 0.0-inf) [1.000E-03]

--min-seq-id FLOAT            List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.900]

--min-aln-len INT             Minimum alignment length (range 0-INT_MAX) [0]

--seq-id-mode INT             0: alignment length 1: shorter, 2: longer sequence [0]

--alt-ali INT                 Show up to this many alternative alignments [0]

--max-rejected INT            Maximum rejected alignments before alignment calculation for a query is stopped [2147483647]

--max-accept INT              Maximum accepted alignments before alignment calculation for a query is stopped [2147483647]

--score-bias FLOAT            Score bias when computing SW alignment (in bits) [0.000]

--realign BOOL                Compute more conservative, shorter alignments (scores and E-values not changed) [0]

--realign-score-bias FLOAT    Additional bias when computing realignment [-0.200]

--realign-max-seqs INT        Maximum number of results to return in realignment [2147483647]

--corr-score-weight FLOAT     Weight of backtrace correlation score that is added to the alignment score [0.000]

--gap-open TWIN               Gap open cost [aa:11,nucl:5]

--gap-extend TWIN             Gap extension cost [aa:1,nucl:2]

--zdrop INT                   Maximal allowed difference between score values before alignment is truncated  (nucleotide alignment only) [40]

profile:                     

--pca                         Pseudo count admixture strength

--pcb                         Pseudo counts: Neff at half of maximum admixture (range 0.0-inf)

misc:                        

--taxon-list STR              Taxonomy ID, possibly multiple values separated by ','

--stat STR                    One of: linecount, mean, min, max, doolittle, charges, seqlen, firstline

--tsv BOOL                    Return output in TSV format [0]

common:                      

--sub-mat TWIN                Substitution matrix file [aa:blosum62.out,nucl:nucleotide.out]

--max-seq-len INT             Maximum sequence length [65535]

--db-load-mode INT            Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

--threads INT                 Number of CPU-cores used (all by default) [128]

--compressed INT              Write compressed output [0]

-v INT                        Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--remove-tmp-files BOOL       Delete temporary files [0]

 

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

> spacedust clusterdb -h

$ spacedust clusterdb -h

usage: spacedust clusterdb <i:inputDB> <tmpDir> [options]

By Ruoshi Zhang <ruoshi.zhang@mpinat.mpg.de> & Milot Mirdita <milot@mirdita.de>

options: prefilter:                      

--seed-sub-mat TWIN              Substitution matrix file for k-mer generation [aa:VTML80.out,nucl:nucleotide.out]

-s FLOAT                         Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]

-k INT                           k-mer length (0: automatically set to optimum) [0]

--k-score TWIN                   k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]

--alph-size TWIN                 Alphabet size (range 2-21) [aa:21,nucl:5]

--max-seqs INT                   Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [300]

--split INT                      Split input into N equally distributed chunks. 0: set the best split automatically [0]

--split-mode INT                 0: split target db; 1: split query db; 2: auto, depending on main memory [2]

--split-memory-limit BYTE        Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--comp-bias-corr INT             Correct for locally biased amino acid composition (range 0-1) [1]

--comp-bias-corr-scale FLOAT     Correct for locally biased amino acid composition (range 0-1) [1.000]

--diag-score BOOL                Use ungapped diagonal scoring during prefilter [1]

--exact-kmer-matching INT        Extract only exact k-mers for matching (range 0-1) [0]

--mask INT                       Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [1]

--mask-prob FLOAT                Mask sequences is probablity is above threshold [0.900]

--mask-lower-case INT            Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

--min-ungapped-score INT         Accept only matches with ungapped alignment score above threshold [15]

--add-self-matches BOOL          Artificially add entries of queries with themselves (for clustering) [0]

--spaced-kmer-mode INT           0: use consecutive positions in k-mers; 1: use spaced k-mers [1]

--spaced-kmer-pattern STR        User-specified spaced k-mer pattern

--local-tmp STR                  Path where some of the temporary files will be created

align:                          

-c FLOAT                         List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.800]

--cov-mode INT                   0: coverage of query and target

                                  1: coverage of target

                                  2: coverage of query

                                  3: target seq. length has to be at least x% of query length

                                  4: query seq. length has to be at least x% of target length

                                  5: short seq. needs to be at least x% of the other seq. length [0]

-a BOOL                          Add backtrace string (convert to alignments with mmseqs convertalis module) [0]

--alignment-mode INT             How to compute the alignment:

                                  0: automatic

                                  1: only score and end_pos

                                  2: also start_pos and cov

                                  3: also seq.id

                                  4: only ungapped alignment [0]

--alignment-output-mode INT      How to compute the alignment:

                                  0: automatic

                                  1: only score and end_pos

                                  2: also start_pos and cov

                                  3: also seq.id

                                  4: only ungapped alignment

                                  5: score only (output) cluster format [0]

--wrapped-scoring BOOL           Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]

-e DOUBLE                        List matches below this E-value (range 0.0-inf) [1.000E-03]

--min-seq-id FLOAT               List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.700]

--min-aln-len INT                Minimum alignment length (range 0-INT_MAX) [0]

--seq-id-mode INT                0: alignment length 1: shorter, 2: longer sequence [0]

--alt-ali INT                    Show up to this many alternative alignments [0]

--max-rejected INT               Maximum rejected alignments before alignment calculation for a query is stopped [2147483647]

--max-accept INT                 Maximum accepted alignments before alignment calculation for a query is stopped [2147483647]

--score-bias FLOAT               Score bias when computing SW alignment (in bits) [0.000]

--realign BOOL                   Compute more conservative, shorter alignments (scores and E-values not changed) [0]

--realign-score-bias FLOAT       Additional bias when computing realignment [-0.200]

--realign-max-seqs INT           Maximum number of results to return in realignment [2147483647]

--corr-score-weight FLOAT        Weight of backtrace correlation score that is added to the alignment score [0.000]

--gap-open TWIN                  Gap open cost [aa:11,nucl:5]

--gap-extend TWIN                Gap extension cost [aa:1,nucl:2]

--zdrop INT                      Maximal allowed difference between score values before alignment is truncated  (nucleotide alignment only) [40]

clust:                          

--cluster-mode INT               0: Set-Cover (greedy)

                                  1: Connected component (BLASTclust)

                                  2,3: Greedy clustering by sequence length (CDHIT) [0]

--max-iterations INT             Maximum depth of breadth first search in connected component clustering [1000]

--similarity-type INT            Type of score used for clustering. 1: alignment score 2: sequence identity [2]

--single-step-clustering BOOL    Switch from cascaded to simple clustering workflow [0]

--cluster-steps INT              Cascaded clustering steps from 1 to -s [3]

--cluster-reassign BOOL          Cascaded clustering can cluster sequence that do not fulfill the clustering criteria.

                                  Cluster reassignment corrects these errors [0]

kmermatcher:                    

--weights STR                    Weights used for cluster priorization

--cluster-weight-threshold FLOAT Weight threshold used for cluster priorization [0.900]

--kmer-per-seq INT               k-mers per sequence [21]

--kmer-per-seq-scale TWIN        Scale k-mer per sequence based on sequence length as kmer-per-seq val + scale x seqlen [aa:0.000,nucl:0.200]

--adjust-kmer-len BOOL           Adjust k-mer length based on specificity (only for nucleotides) [0]

--hash-shift INT                 Shift k-mer hash initialization [67]

--include-only-extendable BOOL   Include only extendable [0]

--ignore-multi-kmer BOOL         Skip k-mers occurring multiple times (>=2) [0]

profile:                        

--pca                            Pseudo count admixture strength

--pcb                            Pseudo counts: Neff at half of maximum admixture (range 0.0-inf)

misc:                           

--taxon-list STR                 Taxonomy ID, possibly multiple values separated by ','

--rescore-mode INT               Rescore diagonals with:

                                  0: Hamming distance

                                  1: local alignment (score only)

                                  2: local alignment

                                  3: global alignment

                                  4: longest alignment fulfilling window quality criterion [0]

--search-mode INT                0: sequence search with MMseqs2, 1: structure comparison with Foldseek [0]

common:                         

--sub-mat TWIN                   Substitution matrix file [aa:blosum62.out,nucl:nucleotide.out]

--max-seq-len INT                Maximum sequence length [65535]

--db-load-mode INT               Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

--threads INT                    Number of CPU-cores used (all by default) [128]

--compressed INT                 Write compressed output [0]

-v INT                           Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--remove-tmp-files BOOL          Delete temporary files [0]

--force-reuse BOOL               Reuse tmp filse in tmp/latest folder ignoring parameters and version changes [0]

--mpi-runner STR                 Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")

expert:                         

--filter-hits BOOL               Filter hits by seq.id. and coverage [0]

--sort-results INT               Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]

 

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

> spacedust clustersearch -h

$ spacedust clustersearch -h

usage: spacedust clustersearch <i:querySetDB> <i:targetSetDB> <o:output[.tsv]> <tmpDir> [options]

By Ruoshi Zhang <ruoshi.zhang@mpinat.mpg.de> & Milot Mirdita <milot@mirdita.de>

options: prefilter:                      

--comp-bias-corr INT             Correct for locally biased amino acid composition (range 0-1) [1]

--comp-bias-corr-scale FLOAT     Correct for locally biased amino acid composition (range 0-1) [1.000]

--add-self-matches BOOL          Artificially add entries of queries with themselves (for clustering) [0]

--seed-sub-mat TWIN              Substitution matrix file for k-mer generation [aa:VTML80.out,nucl:nucleotide.out]

-s FLOAT                         Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [5.700]

-k INT                           k-mer length (0: automatically set to optimum) [0]

--k-score TWIN                   k-mer threshold for generating similar k-mer lists [seq:2147483647,prof:2147483647]

--alph-size TWIN                 Alphabet size (range 2-21) [aa:21,nucl:5]

--max-seqs INT                   Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [300]

--split INT                      Split input into N equally distributed chunks. 0: set the best split automatically [0]

--split-mode INT                 0: split target db; 1: split query db; 2: auto, depending on main memory [2]

--split-memory-limit BYTE        Set max memory per split. E.g. 800B, 5K, 10M, 1G. Default (0) to all available system memory [0]

--diag-score BOOL                Use ungapped diagonal scoring during prefilter [1]

--exact-kmer-matching INT        Extract only exact k-mers for matching (range 0-1) [0]

--mask INT                       Mask sequences in k-mer stage: 0: w/o low complexity masking, 1: with low complexity masking [1]

--mask-prob FLOAT                Mask sequences is probablity is above threshold [0.900]

--mask-lower-case INT            Lowercase letters will be excluded from k-mer search 0: include region, 1: exclude region [0]

--min-ungapped-score INT         Accept only matches with ungapped alignment score above threshold [15]

--spaced-kmer-mode INT           0: use consecutive positions in k-mers; 1: use spaced k-mers [1]

--spaced-kmer-pattern STR        User-specified spaced k-mer pattern

--local-tmp STR                  Path where some of the temporary files will be created

--disk-space-limit BYTE          Set max disk space to use for reverse profile searches. E.g. 800B, 5K, 10M, 1G. Default (0) to all available disk space in the temp folder [0]

align:                          

-a BOOL                          Add backtrace string (convert to alignments with mmseqs convertalis module) [1]

--alignment-mode INT             How to compute the alignment:

                                  0: automatic

                                  1: only score and end_pos

                                  2: also start_pos and cov

                                  3: also seq.id

                                  4: only ungapped alignment [2]

--alignment-output-mode INT      How to compute the alignment:

                                  0: automatic

                                  1: only score and end_pos

                                  2: also start_pos and cov

                                  3: also seq.id

                                  4: only ungapped alignment

                                  5: score only (output) cluster format [0]

--wrapped-scoring BOOL           Double the (nucleotide) query sequence during the scoring process to allow wrapped diagonal scoring around end and start [0]

-e DOUBLE                        List matches below this E-value (range 0.0-inf) [1.000E+01]

--min-seq-id FLOAT               List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]

--min-aln-len INT                Minimum alignment length (range 0-INT_MAX) [30]

--seq-id-mode INT                0: alignment length 1: shorter, 2: longer sequence [0]

--alt-ali INT                    Show up to this many alternative alignments [0]

-c FLOAT                         List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.800]

--cov-mode INT                   0: coverage of query and target

                                  1: coverage of target

                                  2: coverage of query

                                  3: target seq. length has to be at least x% of query length

                                  4: query seq. length has to be at least x% of target length

                                  5: short seq. needs to be at least x% of the other seq. length [2]

--max-rejected INT               Maximum rejected alignments before alignment calculation for a query is stopped [2147483647]

--max-accept INT                 Maximum accepted alignments before alignment calculation for a query is stopped [2147483647]

--score-bias FLOAT               Score bias when computing SW alignment (in bits) [0.000]

--realign BOOL                   Compute more conservative, shorter alignments (scores and E-values not changed) [0]

--realign-score-bias FLOAT       Additional bias when computing realignment [-0.200]

--realign-max-seqs INT           Maximum number of results to return in realignment [2147483647]

--corr-score-weight FLOAT        Weight of backtrace correlation score that is added to the alignment score [0.000]

--gap-open TWIN                  Gap open cost [aa:11,nucl:5]

--gap-extend TWIN                Gap extension cost [aa:1,nucl:2]

--zdrop INT                      Maximal allowed difference between score values before alignment is truncated  (nucleotide alignment only) [40]

--gap-pc INT                     Pseudo count for calculating position-specific gap opening penalties [10]

--exhaustive-search-filter INT   Filter result during search: 0: do not filter, 1: filter [0]

profile:                        

--pca                            Pseudo count admixture strength

--pcb                            Pseudo counts: Neff at half of maximum admixture (range 0.0-inf)

--mask-profile INT               Mask query sequence of profile using tantan [0,1] [1]

--e-profile DOUBLE               Include sequences matches with < E-value thr. into the profile (>=0.0) [1.000E-03]

--wg BOOL                        Use global sequence weighting for profile calculation [0]

--filter-msa INT                 Filter msa: 0: do not filter, 1: filter [1]

--filter-min-enable INT          Only filter MSAs with more than N sequences, 0 always filters [0]

--max-seq-id FLOAT               Reduce redundancy of output MSA using max. pairwise sequence identity [0.0,1.0] [0.900]

--qid STR                        Reduce diversity of output MSAs using min.seq. identity with query sequences [0.0,1.0]

                                  Alternatively, can be a list of multiple thresholds:

                                  E.g.: 0.15,0.30,0.50 to defines filter buckets of ]0.15-0.30] and ]0.30-0.50] [0.0]

--qsc FLOAT                      Reduce diversity of output MSAs using min. score per aligned residue with query sequences [-50.0,100.0] [-20.000]

--cov FLOAT                      Filter output MSAs using min. fraction of query residues covered by matched sequences [0.0,1.0] [0.000]

--diff INT                       Filter MSAs by selecting most diverse set of sequences, keeping at least this many seqs in each MSA block of length 50 [1000]

--pseudo-cnt-mode INT            use 0: substitution-matrix or 1: context-specific pseudocounts [0]

--num-iterations INT             Number of iterative profile search iterations [1]

--exhaustive-search BOOL         For bigger profile DB, run iteratively the search by greedily swapping the search results [0]

--lca-search BOOL                Efficient search for LCA candidates [0]

misc:                           

--taxon-list STR                 Taxonomy ID, possibly multiple values separated by ','

--rescore-mode INT               Rescore diagonals with:

                                  0: Hamming distance

                                  1: local alignment (score only)

                                  2: local alignment

                                  3: global alignment

                                  4: longest alignment fulfilling window quality criterion [0]

--allow-deletion BOOL            Allow deletions in a MSA [0]

--min-length INT                 Minimum codon number in open reading frames [30]

--max-length INT                 Maximum codon number in open reading frames [32734]

--max-gaps INT                   Maximum number of codons with gaps or unknown residues before an open reading frame is rejected [2147483647]

--contig-start-mode INT          Contig start can be 0: incomplete, 1: complete, 2: both [2]

--contig-end-mode INT            Contig end can be 0: incomplete, 1: complete, 2: both [2]

--orf-start-mode INT             Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle) [1]

--forward-frames STR             Comma-separated list of frames on the forward strand to be extracted [1,2,3]

--reverse-frames STR             Comma-separated list of frames on the reverse strand to be extracted [1,2,3]

--translation-table INT          1) CANONICAL, 2) VERT_MITOCHONDRIAL, 3) YEAST_MITOCHONDRIAL, 4) MOLD_MITOCHONDRIAL, 5) INVERT_MITOCHONDRIAL, 6) CILIATE

                                  9) FLATWORM_MITOCHONDRIAL, 10) EUPLOTID, 11) PROKARYOTE, 12) ALT_YEAST, 13) ASCIDIAN_MITOCHONDRIAL, 14) ALT_FLATWORM_MITOCHONDRIAL

                                  15) BLEPHARISMA, 16) CHLOROPHYCEAN_MITOCHONDRIAL, 21) TREMATODE_MITOCHONDRIAL, 22) SCENEDESMUS_MITOCHONDRIAL

                                  23) THRAUSTOCHYTRIUM_MITOCHONDRIAL, 24) PTEROBRANCHIA_MITOCHONDRIAL, 25) GRACILIBACTERIA, 26) PACHYSOLEN, 27) KARYORELICT, 28) CONDYLOSTOMA

                                   29) MESODINIUM, 30) PERTRICH, 31) BLASTOCRITHIDIA [1]

--translate INT                  Translate ORF to amino acid [0]

--use-all-table-starts BOOL      Use all alternatives for a start codon in the genetic table, if false - only ATG (AUG) [0]

--id-offset INT                  Numeric ids in index file are offset by this value [0]

--add-orf-stop BOOL              Add stop codon '*' at complete start and end [0]

--sequence-overlap INT           Overlap between sequences [0]

--sequence-split-mode INT        Sequence split mode 0: copy data, 1: soft link data and write new index, [1]

--headers-split-mode INT         Header split mode: 0: split position, 1: original header [0]

--search-type INT                Search type 0: auto 1: amino acid, 2: translated, 3: nucleotide, 4: translated nucleotide alignment [0]

--start-sens FLOAT               Start sensitivity [4.000]

--sens-steps INT                 Number of search steps performed from --start-sens to -s [1]

--simple-best-hit BOOL           Update the p-value by a single best hit, or by best and second best hits [1]

--suboptimal-hits INT            Include sub-optimal hits of query sequence up to a factor of its E-value. 0: only include one best hit [0]

--alpha FLOAT                    Set alpha for combining p-values during aggregation [1.000]

--aggregation-mode INT           Combined P-values computed from 0: multi-hit, 1: minimum of all P-values, 2: product-of-P-values, 3: truncated product [0]

--filter-self-match BOOL         Remove hits between the same set. 0: do not filter, 1: filter [0]

--multihit-pval FLOAT            Multihit P-value threshold for cluster match output [0.010]

--cluster-pval FLOAT             Clustering and Ordering P-value threshold for cluster match output [0.010]

--max-gene-gap INT               Maximum number of genes allowed between two clusters to merge [3]

--cluster-size INT               Minimum number of genes to define cluster [2]

--cluster-use-weight BOOL        Use weighting factor to calculate cluster match score [0]

--profile-cluster-search BOOL    Perform profile(target)-sequence searches in clustersearch [0]

--search-mode INT                0: sequence search with MMseqs2, 1: structure comparison with Foldseek [0]

common:                         

--sub-mat TWIN                   Substitution matrix file [aa:blosum62.out,nucl:nucleotide.out]

--max-seq-len INT                Maximum sequence length [65535]

--db-load-mode INT               Database preload mode 0: auto, 1: fread, 2: mmap, 3: mmap+touch [0]

--threads INT                    Number of CPU-cores used (all by default) [128]

--compressed INT                 Write compressed output [0]

-v INT                           Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]

--mpi-runner STR                 Use MPI on compute cluster with this MPI command (e.g. "mpirun -np 42")

--force-reuse BOOL               Reuse tmp filse in tmp/latest folder ignoring parameters and version changes [0]

--remove-tmp-files BOOL          Delete temporary files [0]

expert:                         

--filter-hits BOOL               Filter hits by seq.id. and coverage [0]

--sort-results INT               Sort results: 0: no sorting, 1: sort by E-value (Alignment) or seq.id. (Hamming) [0]

--create-lookup INT              Create database lookup file (can be very large) [0]

--chain-alignments INT           Chain overlapping alignments [0]

--merge-query INT                Combine ORFs/split sequences to a single entry [1]

--strand INT                     Strand selection only works for DNA/DNA search 0: reverse, 1: forward, 2: both [1]

--db-output BOOL                 Return a result DB instead of a text file [1]

 

references:

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11), 1026-1028 (2017)

 

 

実行方法

ランするには、以下のいずれかが必要。(1)ゲノムまたはコンティグ配列と、それとペアのGFF3(.gff3)フォーマットファイル(タンパク質コード配列(CDS)を結合したもの(Prodigalを実行する)(例;uvig_120081.fnaとuvig_120081.gff)。(2)Prodigalのヘッダーフォーマットを持つタンパク質配列のFASTAファイル(.faaまたは.faa.gz)。

 

 

1、入力ゲノムのデータベース setDBの作成

検索の前に、FASTAファイルに含まれるクエリ配列やターゲット配列をcreatesetdbを呼び出してデータベース形式に変換する必要がある。このコマンドは、まずGFF3またはヘッダから抽出したマッピングされたストランドとゲノム座標を含む配列DBを作成し、最後に関連するメタデータを生成する。塩基配列の入力には、引数 --gff-dir に GFF3 ファイルへのパスを指定する。"tmpFolder"を作業ディレクトリとして(ディレクトリが存在しない時は自動で作成)、データベースsetDBの作成。

git clone https://github.com/soedinglab/spacedust.git
cd spacedust/examples/
chmod a-x *gff
#1データベースとする配列のDB作成
ls *gff > GFFlist #GFFリストの作成
spacedust createsetdb genome1.fna [...genomeN.fna] targetDB tmpFolder --gff-dir GFFlist --gff-type CDS

#2 クエリ配列のDB作成。ここではexample/の.faaファイルをクエリとする。2つ以上のfaaファイルも可。
spacedust createsetdb NC_000915.faa queryDB tmpFolder
  • --gff-type    Comma separated list of feature types in the GFF file to select
  • --gff-dir     Path to gff dir file

出力例

 

Foldseekデータベース

Foldseekを使ってタンパク質の構造検索を行うためには、リファレンスFoldseekデータベースをあらかじめダウンロードする。それからspacedust aa2foldseekコマンドを使ってデータベースに変換する。

# Download reference FoldseekDB
foldseek databases Alphafold/UniProt refFoldseekDB tmpFolder

# Convert to structure sequence DB
spacedust aa2foldseek setDB refFoldseekDB tmpFolder

 

2、Spacedustの実行

Spacedustは、まず複数のゲノムに由来するタンパク質コード遺伝子の2つのセット間の全相同性検索/構造比較を行い、次に遺伝子近傍の保存に基づいて相同ヒットのクラスターを見つける。Foldseekを使った構造比較を行うには"--search-mode 1"を指定する。aa2foldseekで構造マップできた配列はFoldseekで検索し、残りはMMseqs2で検索する。MMseqs2とFoldseekの反復検索は、"--num-iterations <INT>"を設定することで、より繊細な検索を行うことができる。

# Search querySetDB against targetSetDB (using MMseqs)
spacedust clustersearch queryDB targetDB result.tsv tmpFolder

# Search querySetDB against targetSetDB turned into profile
spacedust clustersearch queryDB targetDB result.tsv tmpFolder --profile-cluster-search

# Iterative cluster search (like PSI-BLAST) with 2 iterations
spacedust clustersearch queryDB targetDB result.tsv tmpFolder --num-iterations 2

# Search querySetDB against targetSetDB (using Foldseek and MMseqs)
spacedust clustersearch queryDB targetDB result.tsv tmpFolder --search-mode 1
  • -threads     Number of CPU-cores used (all by default) [12]
  • --num-iterations      Number of iterative profile search iterations [1]
  • --rescore-mode   Rescore diagonals with:
                                    0: Hamming distance
                                    1: local alignment (score only)
                                    2: local alignment
                                    3: global alignment
                                    4: longest alignment fulfilling window quality criterion [0]

 

完了するとタブ区切りのテキストファイル(.tsv)が出力される。報告されるクラスターは、1行のサマリー行に続き、クエリーゲノムとターゲットゲノムのペアワイズヒットごとに1行ずつ、複数行で構成される。

excelで開いた。

#で始まるのがサマリー行。

#: a unique cluster ID, query genome accession, target genome accession, cluster match P-value (joint P-value of clustering and ordering), multihit P-value and number of hits in the cluster.

>で始まるのは、MMseqs2のアライメント結果のような形式で、クラスタの個々のメンバーヒット(すなわち、クエリにアライメントされた1つの標的配列)を記述する。

>: query protein ID, target protein ID, besthit P-value, sequence identity, pairwise E-value, query protein start, end and length, target protein start, end and length, alnCigar

 

コメント

間違って二回紹介してしまいましたので片方の記事は削除しました。ご迷惑をお掛けしました。

引用

De novo discovery of conserved gene clusters in microbial genomes with Spacedust

Ruoshi Zhang,  Milot Mirdita,  Johannes Söding

bioRxiv, Posted October 03, 2024.

 

GitHub - soedinglab/spacedust: Discovery of conserved gene clusters in multiple genomes

 

関連