タンパク質配列データベースをクラスタリングするDiamondのclusterコマンド (DIAMOND DeepClust)

2023/03/02 プレプリント追記、タイトル修正

wikiより

　Diamondは、CD-HITやUClustと同様に、ユーザーが定義したクラスタリング基準に基づいて配列をクラスタリングし、セントロイドまたは代表配列のセットを見つけ、クラスタリング基準対セントロイドが満たされるように各入力配列をセントロイドのクラスタに割り当てる。クラスタリング基準は、ローカルアライメントの配列カバレッジと配列の同一性によって定義される。

wiki

https://github.com/bbuchfink/diamond/wiki/Clustering

インストール

Github

#新しい機能のため、2023年２月現在はまだ、パッケージマネージャで配布されているようなディストリビューションでは利用できない。リリースから最新のバージョンをダウンロードする。

https://github.com/bbuchfink/diamond/releases/tag/v2.1.1

>diamond help #diamond v2.1.1.155

diamond v2.1.1.155 (C) Max Planck Society for the Advancement of Science

Documentation, support and updates available at http://www.diamondsearch.org

Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

Syntax: diamond COMMAND [OPTIONS]

Commands:

makedb Build DIAMOND database from a FASTA file

prepdb Prepare BLAST or FASTA database for use with Diamond

blastp Align amino acid query sequences against a protein reference database

blastx Align DNA query sequences against a protein reference database

cluster Cluster protein sequences

realign Realign clustered sequences against their centroids

recluster Recompute clustering to fix errors

reassign Reassign clustered sequences to the closest centroid

view View DIAMOND alignment archive (DAA) formatted file

help Produce help message

version Display version information

getseq Retrieve sequences from a DIAMOND database file

dbinfo Print information about a DIAMOND database file

test Run regression tests

makeidx Make database index

greedy-vertex-cover Compute greedy vertex cover

General options:

--threads (-p) number of CPU threads

--db (-d) database file

--out (-o) output file

--verbose (-v) verbose console output

--log enable debug log

--quiet disable console output

--header Use header lines in tabular output format

(0/simple/verbose).

Makedb options:

--in input reference file in FASTA format

--taxonmap protein accession to taxid mapping file

--taxonnodes taxonomy nodes.dmp from NCBI

--taxonnames taxonomy names.dmp from NCBI

Aligner/Clustering options:

--evalue (-e) maximum e-value to report alignments (default=0.001)

--tmpdir (-t) directory for temporary files

--comp-based-stats composition based statistics mode (0-4)

--masking masking algorithm (none, seg, tantan=default)

--soft-masking soft masking

--motif-masking softmask abundant motifs (0/1)

--approx-id minimum approx. identity% to report an

alignment/to cluster sequences

--ext Extension mode (banded-fast/banded-slow/full)

--memory-limit (-M) Memory limit in GB (default = 16G)

Aligner options:

--query (-q) input query file

--strand query strands to search (both/minus/plus)

--un file for unaligned queries

--al file or aligned queries

--unfmt format of unaligned query file (fasta/fastq)

--alfmt format of aligned query file (fasta/fastq)

--unal report unaligned queries (0=no, 1=yes)

--max-target-seqs (-k) maximum number of target sequences to report

alignments for (default=25)

--top report alignments within this percentage

range of top alignment score (overrides --max-target-seqs)

--max-hsps maximum number of HSPs per target sequence to

report for each query (default=1)

--range-culling restrict hit culling to overlapping query ranges

--compress compression for output files (0=none, 1=gzip, zstd)

--min-score minimum bit score to report alignments

(overrides e-value setting)

--id minimum identity% to report an alignment

--query-cover minimum query cover% to report an alignment

--subject-cover minimum subject cover% to report an alignment

--faster enable faster mode

--fast enable fast mode

--mid-sensitive enable mid-sensitive mode

--sensitive enable sensitive mode)

--more-sensitive enable more sensitive mode

--very-sensitive enable very sensitive mode

--ultra-sensitive enable ultra sensitive mode

--swipe exhaustive alignment against all database sequences

--iterate iterated search with increasing sensitivity

--global-ranking (-g) number of targets for global ranking

--block-size (-b) sequence block size in billions of letters

(default=2.0)

--index-chunks (-c) number of chunks for index processing (default=4)

--parallel-tmpdir directory for temporary files used by multiprocessing

--gapopen gap open penalty

--gapextend gap extension penalty

--matrix score matrix for protein alignment (default=BLOSUM62)

--custom-matrix file containing custom scoring matrix

--frameshift (-F) frame shift penalty (default=disabled)

--long-reads short for --range-culling --top 10 -F 15

--query-gencode genetic code to use to translate query (see

user manual)

--salltitles include full subject titles in DAA file

--sallseqid include all subject ids in DAA file

--no-self-hits suppress reporting of identical self hits

--taxonlist restrict search to list of taxon ids (comma-separated)

--taxon-exclude exclude list of taxon ids (comma-separated)

--seqidlist filter the database by list of accessions

--skip-missing-seqids ignore accessions missing in the database

Output format options:

--outfmt (-f) output format

0 = BLAST pairwise

5 = BLAST XML

6 = BLAST tabular

100 = DIAMOND alignment archive (DAA)

101 = SAM

Value 6 may be followed by a space-separated list of these keywords:

qseqid means Query Seq - id

qlen means Query sequence length

sseqid means Subject Seq - id

sallseqid means All subject Seq - id(s), separated by a ';'

slen means Subject sequence length

qstart means Start of alignment in query

qend means End of alignment in query

sstart means Start of alignment in subject

send means End of alignment in subject

qseq means Aligned part of query sequence

qseq_translated means Aligned part of query sequence (translated)

full_qseq means Query sequence

full_qseq_mate means Query sequence of the mate

sseq means Aligned part of subject sequence

full_sseq means Subject sequence

evalue means Expect value

bitscore means Bit score

corrected_bitscore means Bit score corrected for edge effects

score means Raw score

length means Alignment length

pident means Percentage of identical matches

approx_pident means Approximate percentage of identical matches

nident means Number of identical matches

mismatch means Number of mismatches

positive means Number of positive - scoring matches

gapopen means Number of gap openings

gaps means Total number of gaps

ppos means Percentage of positive - scoring matches

qframe means Query frame

btop means Blast traceback operations(BTOP)

cigar means CIGAR string

staxids means unique Subject Taxonomy ID(s), separated by a ';'

(in numerical order)

sscinames means unique Subject Scientific Name(s), separated by a ';'

sskingdoms means unique Subject Super Kingdom(s), separated by a ';'

skingdoms means unique Subject Kingdom(s), separated by a ';'

sphylums means unique Subject Phylum(s), separated by a ';'

stitle means Subject Title

salltitles means All Subject Title(s), separated by a '<>'

qcovhsp means Query Coverage Per HSP

scovhsp means Subject Coverage Per HSP

qtitle means Query title

qqual means Query quality values for the aligned part of the query

full_qqual means Query quality values

qstrand means Query strand

Default: qseqid sseqid pident length mismatch gapopen qstart qend

sstart send evalue bitscore

Clustering options:

--member-cover Minimum coverage% of the cluster member

sequence (default=80.0)

--cluster-steps Clustering steps

--centroid-out Output file for centroids (greedy vertex

cover workflow)

Cluster input options:

--clusters Clustering input file mapping sequences to centroids

--edges Input file for greedy vertex cover

--edge-format Edge format for greedy vertex cover (default/triplet)

Advanced options:

--algo Seed search algorithm

(0=double-indexed/1=query-indexed/ctg=contiguous-seed)

--bin number of query bins for seed search

--min-orf (-l) ignore translated sequences without an open

reading frame of at least this length

--seed-cut cutoff for seed complexity

--freq-masking mask seeds based on frequency

--freq-sd number of standard deviations for ignoring

frequent seeds

--id2 minimum number of identities for stage 1 hit

--xdrop (-x) xdrop for ungapped alignment

--gapped-filter-evalue E-value threshold for gapped filter (auto)

--band band for dynamic programming computation

--shapes (-s) number of seed shapes (default=all available)

--shape-mask seed shapes

--multiprocessing enable distributed-memory parallel processing

--mp-init initialize multiprocessing run

--mp-recover enable continuation of interrupted multiprocessing run

--mp-query-chunk process only a single query chunk as specified

--ext-chunk-size chunk size for adaptive ranking (default=auto)

--no-ranking disable ranking heuristic

--culling-overlap minimum range overlap with higher scoring hit

to delete a hit (default=50%)

--taxon-k maximum number of targets to report per species

--range-cover percentage of query range to be covered for

range culling (default=50%)

--dbsize effective database size (in letters)

--no-auto-append disable auto appending of DAA and DMND file extensions

--xml-blord-format Use gnl|BL_ORD_ID| style format in XML output

--stop-match-score Set the match score of stop codons against each other.

--tantan-minMaskProb minimum repeat probability for masking (default=0.9)

--file-buffer-size file buffer size in bytes (default=67108864)

--no-unlink Do not unlink temporary files.

--target-indexed Enable target-indexed mode

--ignore-warnings Ignore warnings

View options:

--daa (-a) DIAMOND alignment archive (DAA) file

--forwardonly only show alignments of forward strand

Getseq options:

--seq Space-separated list of sequence numbers to display.

Online documentation at http://www.diamondsearch.org

実行方法

タンパク質配列の入力データベース（fastaファイルかdiamond D.B）をクラスタリングする。

diamond cluster -d INPUT_FILE -o OUTPUT_FILE --approx-id 30 -M 64G

--database | -d The input sequence database. Supported formats are FASTA and DIAMOND (.dmnd) format
--out | -o This is a 2-column tabular file with the centroid accession as the first column and the member sequence accession as the second column. More elaborate output can be retrieved using the realign workflow.
--header Enable a header line in the output file.
--approx-id The identity cutoff for the clustering (in %)
--memory-limit | -M Set a memory limit for the diamond process
--member-cover The minimum coverage of the cluster member sequence by the centroid (in %). This is a unidirectional coverage i.e. a minimum coverage of the centroid is not required. The default is 80%.
--threads | -p
--verbose | -v
--log
--quiet
--tmpdir | -t

出力は２カラム形式で、1列目がセントロイドのアクセッション名、2列目がメンバー配列のアクセッション名となります。他にもrealign, recluster, reassign, greedy vertex coverコマンドがあります。wikiでも少しだけ説明されています。確認してください。

wikiより

--approx-id # クラスタリングのためのIDカットオフ(単位:%)。性能上の理由から、この設定は、アラインメント中の実際の同一性の数ではなく、ビットスコアから線形回帰して得られる近似的な配列同一性を参照していることに注意。diamond clusterの場合は50%、diamond deepclustの場合は0%がデフォルト値。

2023/033/02

更新されてDIAMOND DeepClustという名前になったようです。プレプリントも出ています。

https://www.biorxiv.org/content/10.1101/2023.01.24.525373v1

引用

Sensitive protein alignments at tree-of-life scale using DIAMOND

Benjamin Buchfink, Klaus Reuter & Hajk-Georg Drost
Nature Methods volume 18, pages 366–368 (2021)

2023/03/02

Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust
Benjamin Buchfink, Haim Ashkenazy, Klaus Reuter, John A. Kennedy, ProfileHajk-Georg Drost

bioRxiv, Posted January 25, 2023.