macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

タンパク質配列データベースをクラスタリングするDiamondのclusterコマンド (DIAMOND DeepClust)

2023/03/02 プレプリント追記、タイトル修正

 

wikiより

 Diamondは、CD-HITやUClustと同様に、ユーザーが定義したクラスタリング基準に基づいて配列をクラスタリングし、セントロイドまたは代表配列のセットを見つけ、クラスタリング基準対セントロイドが満たされるように各入力配列をセントロイドのクラスタに割り当てる。クラスタリング基準は、ローカルアライメントの配列カバレッジと配列の同一性によって定義される。

 

wiki

https://github.com/bbuchfink/diamond/wiki/Clustering

 

インストール

Github

#新しい機能のため、2023年2月現在はまだ、パッケージマネージャで配布されているようなディストリビューションでは利用できない。リリースから最新のバージョンをダウンロードする。

https://github.com/bbuchfink/diamond/releases/tag/v2.1.1

>diamond help  #diamond v2.1.1.155

diamond v2.1.1.155 (C) Max Planck Society for the Advancement of Science

Documentation, support and updates available at http://www.diamondsearch.org

Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

 

Syntax: diamond COMMAND [OPTIONS]

 

Commands:

makedb    Build DIAMOND database from a FASTA file

prepdb    Prepare BLAST or FASTA database for use with Diamond

blastp    Align amino acid query sequences against a protein reference database

blastx    Align DNA query sequences against a protein reference database

cluster    Cluster protein sequences

realign    Realign clustered sequences against their centroids

recluster    Recompute clustering to fix errors

reassign    Reassign clustered sequences to the closest centroid

view    View DIAMOND alignment archive (DAA) formatted file

help    Produce help message

version    Display version information

getseq    Retrieve sequences from a DIAMOND database file

dbinfo    Print information about a DIAMOND database file

test    Run regression tests

makeidx    Make database index

greedy-vertex-cover    Compute greedy vertex cover

 

General options:

--threads (-p)           number of CPU threads

--db (-d)                database file

--out (-o)               output file

--verbose (-v)           verbose console output

--log                    enable debug log

--quiet                  disable console output

--header                 Use header lines in tabular output format

(0/simple/verbose).

 

Makedb options:

--in                     input reference file in FASTA format

--taxonmap               protein accession to taxid mapping file

--taxonnodes             taxonomy nodes.dmp from NCBI

--taxonnames             taxonomy names.dmp from NCBI

 

Aligner/Clustering options:

--evalue (-e)            maximum e-value to report alignments (default=0.001)

--tmpdir (-t)            directory for temporary files

--comp-based-stats       composition based statistics mode (0-4)

--masking                masking algorithm (none, seg, tantan=default)

--soft-masking           soft masking

--motif-masking          softmask abundant motifs (0/1)

--approx-id              minimum approx. identity% to report an

alignment/to cluster sequences

--ext                    Extension mode (banded-fast/banded-slow/full)

--memory-limit (-M)      Memory limit in GB (default = 16G)

 

Aligner options:

--query (-q)             input query file

--strand                 query strands to search (both/minus/plus)

--un                     file for unaligned queries

--al                     file or aligned queries

--unfmt                  format of unaligned query file (fasta/fastq)

--alfmt                  format of aligned query file (fasta/fastq)

--unal                   report unaligned queries (0=no, 1=yes)

--max-target-seqs (-k)   maximum number of target sequences to report

alignments for (default=25)

--top                    report alignments within this percentage

range of top alignment score (overrides --max-target-seqs)

--max-hsps               maximum number of HSPs per target sequence to

report for each query (default=1)

--range-culling          restrict hit culling to overlapping query ranges

--compress               compression for output files (0=none, 1=gzip, zstd)

--min-score              minimum bit score to report alignments

(overrides e-value setting)

--id                     minimum identity% to report an alignment

--query-cover            minimum query cover% to report an alignment

--subject-cover          minimum subject cover% to report an alignment

--faster                 enable faster mode

--fast                   enable fast mode

--mid-sensitive          enable mid-sensitive mode

--sensitive              enable sensitive mode)

--more-sensitive         enable more sensitive mode

--very-sensitive         enable very sensitive mode

--ultra-sensitive        enable ultra sensitive mode

--swipe                  exhaustive alignment against all database sequences

--iterate                iterated search with increasing sensitivity

--global-ranking (-g)    number of targets for global ranking

--block-size (-b)        sequence block size in billions of letters

(default=2.0)

--index-chunks (-c)      number of chunks for index processing (default=4)

--parallel-tmpdir        directory for temporary files used by multiprocessing

--gapopen                gap open penalty

--gapextend              gap extension penalty

--matrix                 score matrix for protein alignment (default=BLOSUM62)

--custom-matrix          file containing custom scoring matrix

--frameshift (-F)        frame shift penalty (default=disabled)

--long-reads             short for --range-culling --top 10 -F 15

--query-gencode          genetic code to use to translate query (see

user manual)

--salltitles             include full subject titles in DAA file

--sallseqid              include all subject ids in DAA file

--no-self-hits           suppress reporting of identical self hits

--taxonlist              restrict search to list of taxon ids (comma-separated)

--taxon-exclude          exclude list of taxon ids (comma-separated)

--seqidlist              filter the database by list of accessions

--skip-missing-seqids    ignore accessions missing in the database

 

Output format options:

--outfmt (-f)            output format

    0   = BLAST pairwise

    5   = BLAST XML

    6   = BLAST tabular

    100 = DIAMOND alignment archive (DAA)

    101 = SAM

 

    Value 6 may be followed by a space-separated list of these keywords:

 

    qseqid means Query Seq - id

    qlen means Query sequence length

    sseqid means Subject Seq - id

    sallseqid means All subject Seq - id(s), separated by a ';'

    slen means Subject sequence length

    qstart means Start of alignment in query

    qend means End of alignment in query

    sstart means Start of alignment in subject

    send means End of alignment in subject

    qseq means Aligned part of query sequence

    qseq_translated means Aligned part of query sequence (translated)

    full_qseq means Query sequence

    full_qseq_mate means Query sequence of the mate

    sseq means Aligned part of subject sequence

    full_sseq means Subject sequence

    evalue means Expect value

    bitscore means Bit score

    corrected_bitscore means Bit score corrected for edge effects

    score means Raw score

    length means Alignment length

    pident means Percentage of identical matches

    approx_pident means Approximate percentage of identical matches

    nident means Number of identical matches

    mismatch means Number of mismatches

    positive means Number of positive - scoring matches

    gapopen means Number of gap openings

    gaps means Total number of gaps

    ppos means Percentage of positive - scoring matches

    qframe means Query frame

    btop means Blast traceback operations(BTOP)

    cigar means CIGAR string

    staxids means unique Subject Taxonomy ID(s), separated by a ';'

(in numerical order)

    sscinames means unique Subject Scientific Name(s), separated by a ';'

    sskingdoms means unique Subject Super Kingdom(s), separated by a ';'

    skingdoms means unique Subject Kingdom(s), separated by a ';'

    sphylums means unique Subject Phylum(s), separated by a ';'

    stitle means Subject Title

    salltitles means All Subject Title(s), separated by a '<>'

    qcovhsp means Query Coverage Per HSP

    scovhsp means Subject Coverage Per HSP

    qtitle means Query title

    qqual means Query quality values for the aligned part of the query

    full_qqual means Query quality values

    qstrand means Query strand

 

    Default: qseqid sseqid pident length mismatch gapopen qstart qend

sstart send evalue bitscore

 

Clustering options:

--member-cover           Minimum coverage% of the cluster member

sequence (default=80.0)

--cluster-steps          Clustering steps

--centroid-out           Output file for centroids (greedy vertex

cover workflow)

 

Cluster input options:

--clusters               Clustering input file mapping sequences to centroids

--edges                  Input file for greedy vertex cover

--edge-format            Edge format for greedy vertex cover (default/triplet)

 

Advanced options:

--algo                   Seed search algorithm

(0=double-indexed/1=query-indexed/ctg=contiguous-seed)

--bin                    number of query bins for seed search

--min-orf (-l)           ignore translated sequences without an open

reading frame of at least this length

--seed-cut               cutoff for seed complexity

--freq-masking           mask seeds based on frequency

--freq-sd                number of standard deviations for ignoring

frequent seeds

--id2                    minimum number of identities for stage 1 hit

--xdrop (-x)             xdrop for ungapped alignment

--gapped-filter-evalue   E-value threshold for gapped filter (auto)

--band                   band for dynamic programming computation

--shapes (-s)            number of seed shapes (default=all available)

--shape-mask             seed shapes

--multiprocessing        enable distributed-memory parallel processing

--mp-init                initialize multiprocessing run

--mp-recover             enable continuation of interrupted multiprocessing run

--mp-query-chunk         process only a single query chunk as specified

--ext-chunk-size         chunk size for adaptive ranking (default=auto)

--no-ranking             disable ranking heuristic

--culling-overlap        minimum range overlap with higher scoring hit

to delete a hit (default=50%)

--taxon-k                maximum number of targets to report per species

--range-cover            percentage of query range to be covered for

range culling (default=50%)

--dbsize                 effective database size (in letters)

--no-auto-append         disable auto appending of DAA and DMND file extensions

--xml-blord-format       Use gnl|BL_ORD_ID| style format in XML output

--stop-match-score       Set the match score of stop codons against each other.

--tantan-minMaskProb     minimum repeat probability for masking (default=0.9)

--file-buffer-size       file buffer size in bytes (default=67108864)

--no-unlink              Do not unlink temporary files.

--target-indexed         Enable target-indexed mode

--ignore-warnings        Ignore warnings

 

View options:

--daa (-a)               DIAMOND alignment archive (DAA) file

--forwardonly            only show alignments of forward strand

 

Getseq options:

--seq                    Space-separated list of sequence numbers to display.

 

Online documentation at http://www.diamondsearch.org

 

 

実行方法

タンパク質配列の入力データベース(fastaファイルかdiamond D.B)をクラスタリングする。

diamond cluster -d INPUT_FILE -o OUTPUT_FILE --approx-id 30 -M 64G
  • --database | -d    The input sequence database. Supported formats are FASTA and DIAMOND (.dmnd) format

  • --out | -o  This is a 2-column tabular file with the centroid accession as the first column and the member sequence accession as the second column. More elaborate output can be retrieved using the realign workflow.

  • --header   Enable a header line in the output file.

  • --approx-id   The identity cutoff for the clustering (in %)

  • --memory-limit | -M   Set a memory limit for the diamond process 
  • --member-cover   The minimum coverage of the cluster member sequence by the centroid (in %). This is a unidirectional coverage i.e. a minimum coverage of the centroid is not required. The default is 80%.
  • --threads | -p
  • --verbose | -v
  • --log
  • --quiet
  • --tmpdir | -t

出力は2カラム形式で、1列目がセントロイドのアクセッション名、2列目がメンバー配列のアクセッション名となります。他にもrealign, recluster, reassign, greedy vertex coverコマンドがあります。wikiでも少しだけ説明されています。確認してください。

 

wikiより

  • --approx-id # クラスタリングのためのIDカットオフ(単位:%)。性能上の理由から、この設定は、アラインメント中の実際の同一性の数ではなく、ビットスコアから線形回帰して得られる近似的な配列同一性を参照していることに注意。diamond clusterの場合は50%、diamond deepclustの場合は0%がデフォルト値。

 

2023/033/02

更新されてDIAMOND DeepClustという名前になったようです。プレプリントも出ています。

https://www.biorxiv.org/content/10.1101/2023.01.24.525373v1

 

引用

Sensitive protein alignments at tree-of-life scale using DIAMOND

Benjamin Buchfink, Klaus Reuter & Hajk-Georg Drost 
Nature Methods volume 18, pages 366–368 (2021)

 

2023/03/02

Sensitive clustering of protein sequences at tree-of-life scale using DIAMOND DeepClust
Benjamin Buchfink,  Haim Ashkenazy,  Klaus Reuter, John A. Kennedy,   ProfileHajk-Georg Drost

bioRxiv, Posted January 25, 2023.

 

関連