BLASTとコンパチブルで高速なホモロジー検索ツール Diamond

2019 1/20 help追加、コマンド追記, 6/9 -コマンド例から-max-target-seqs削除, 7/19 追記

2021 2/13 ツイート追記

2022/04/07 インストール追記、07/22 例追記、help更新

2025/02/26 インストール追記

Diamondはindexのつけ方を工夫することでBLASTXの解析速度を加速できるツール。blastと同等の機能を持つが、論文ではblastより最大20000倍高速化できると主張されている。特にクエリー配列が非常に多い場合に高速とされる。2015年にnature methodsに論文が発表された。

2021 2/13

DIAMOND v2.0.7 now supports full-matrix Smith Waterman extensions (vectorized using the SWIPE algorithm) and the new extended taxonomy mapping file from NCBI. https://t.co/YtVTQlDicf
— Benjamin Buchfink (@bbuchfink) 2021年2月13日

3/11

DIAMOND v2.0.8 now supports directly using BLAST databases, database reformatting no longer needed. https://t.co/YtVTQlDicf
— Benjamin Buchfink (@bbuchfink) 2021年3月10日

マニュアル

manual

ppt

https://www.donarmstrong.com/ld/dmnd2015/diamond_presentation_2015.pdf

インストール

Github

condaやbrewを使って導入できる。

#bioconda(link)
mamba install -c bioconda diamond

#brewでも導入可能(試していません)
brew install diamond

#binary 
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.11/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz

> diamond help #v2.0.13

$ diamond --help

diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science

Documentation, support and updates available at http://www.diamondsearch.org

Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

Syntax: diamond COMMAND [OPTIONS]

Commands:

makedb Build DIAMOND database from a FASTA file

blastp Align amino acid query sequences against a protein reference database

blastx Align DNA query sequences against a protein reference database

view View DIAMOND alignment archive (DAA) formatted file

help Produce help message

version Display version information

getseq Retrieve sequences from a DIAMOND database file

dbinfo Print information about a DIAMOND database file

test Run regression tests

makeidx Make database index

General options:

--threads (-p) number of CPU threads

--db (-d) database file

--out (-o) output file

--outfmt (-f) output format

0 = BLAST pairwise

5 = BLAST XML

6 = BLAST tabular

100 = DIAMOND alignment archive (DAA)

101 = SAM

Value 6 may be followed by a space-separated list of these keywords:

qseqid means Query Seq - id

qlen means Query sequence length

sseqid means Subject Seq - id

sallseqid means All subject Seq - id(s), separated by a ';'

slen means Subject sequence length

qstart means Start of alignment in query

qend means End of alignment in query

sstart means Start of alignment in subject

send means End of alignment in subject

qseq means Aligned part of query sequence

qseq_translated means Aligned part of query sequence (translated)

full_qseq means Query sequence

full_qseq_mate means Query sequence of the mate

sseq means Aligned part of subject sequence

full_sseq means Subject sequence

evalue means Expect value

bitscore means Bit score

score means Raw score

length means Alignment length

pident means Percentage of identical matches

nident means Number of identical matches

mismatch means Number of mismatches

positive means Number of positive - scoring matches

gapopen means Number of gap openings

gaps means Total number of gaps

ppos means Percentage of positive - scoring matches

qframe means Query frame

btop means Blast traceback operations(BTOP)

cigar means CIGAR string

staxids means unique Subject Taxonomy ID(s), separated by a ';'

(in numerical order)

sscinames means unique Subject Scientific Name(s), separated by a ';'

sskingdoms means unique Subject Super Kingdom(s), separated by a ';'

skingdoms means unique Subject Kingdom(s), separated by a ';'

sphylums means unique Subject Phylum(s), separated by a ';'

stitle means Subject Title

salltitles means All Subject Title(s), separated by a '<>'

qcovhsp means Query Coverage Per HSP

scovhsp means Subject Coverage Per HSP

qtitle means Query title

qqual means Query quality values for the aligned part of the query

full_qqual means Query quality values

qstrand means Query strand

Default: qseqid sseqid pident length mismatch gapopen qstart qend

sstart send evalue bitscore

--verbose (-v) verbose console output

--log enable debug log

--quiet disable console output

--header Write header lines to blast tabular format.

Makedb options:

--in input reference file in FASTA format

--taxonmap protein accession to taxid mapping file

--taxonnodes taxonomy nodes.dmp from NCBI

--taxonnames taxonomy names.dmp from NCBI

Aligner options:

--query (-q) input query file

--strand query strands to search (both/minus/plus)

--un file for unaligned queries

--al file or aligned queries

--unfmt format of unaligned query file (fasta/fastq)

--alfmt format of aligned query file (fasta/fastq)

--unal report unaligned queries (0=no, 1=yes)

--max-target-seqs (-k) maximum number of target sequences to report

alignments for (default=25)

--top report alignments within this percentage

range of top alignment score (overrides --max-target-seqs)

--max-hsps maximum number of HSPs per target sequence to

report for each query (default=1)

--range-culling restrict hit culling to overlapping query ranges

--compress compression for output files (0=none, 1=gzip, zstd)

--evalue (-e) maximum e-value to report alignments (default=0.001)

--min-score minimum bit score to report alignments

(overrides e-value setting)

--id minimum identity% to report an alignment

--query-cover minimum query cover% to report an alignment

--subject-cover minimum subject cover% to report an alignment

--fast enable fast mode

--mid-sensitive enable mid-sensitive mode

--sensitive enable sensitive mode)

--more-sensitive enable more sensitive mode

--very-sensitive enable very sensitive mode

--ultra-sensitive enable ultra sensitive mode

--iterate iterated search with increasing sensitivity

--global-ranking (-g) number of targets for global ranking

--block-size (-b) sequence block size in billions of letters

(default=2.0)

--index-chunks (-c) number of chunks for index processing (default=4)

--tmpdir (-t) directory for temporary files

--parallel-tmpdir directory for temporary files used by multiprocessing

--gapopen gap open penalty

--gapextend gap extension penalty

--frameshift (-F) frame shift penalty (default=disabled)

--long-reads short for --range-culling --top 10 -F 15

--matrix score matrix for protein alignment (default=BLOSUM62)

--custom-matrix file containing custom scoring matrix

--comp-based-stats composition based statistics mode (0-4)

--masking masking algorithm (none, seg, tantan=default)

--query-gencode genetic code to use to translate query (see

user manual)

--salltitles include full subject titles in DAA file

--sallseqid include all subject ids in DAA file

--no-self-hits suppress reporting of identical self hits

--taxonlist restrict search to list of taxon ids (comma-separated)

--taxon-exclude exclude list of taxon ids (comma-separated)

--seqidlist filter the database by list of accessions

--skip-missing-seqids ignore accessions missing in the database

Advanced options:

--algo Seed search algorithm

(0=double-indexed/1=query-indexed/ctg=contiguous-seed)

--bin number of query bins for seed search

--min-orf (-l) ignore translated sequences without an open

reading frame of at least this length

--seed-cut cutoff for seed complexity

--freq-masking mask seeds based on frequency

--freq-sd number of standard deviations for ignoring

frequent seeds

--motif-masking softmask abundant motifs (0/1)

--id2 minimum number of identities for stage 1 hit

--xdrop (-x) xdrop for ungapped alignment

--gapped-filter-evalue E-value threshold for gapped filter (auto)

--band band for dynamic programming computation

--shapes (-s) number of seed shapes (default=all available)

--shape-mask seed shapes

--multiprocessing enable distributed-memory parallel processing

--mp-init initialize multiprocessing run

--mp-recover enable continuation of interrupted multiprocessing run

--mp-query-chunk process only a single query chunk as specified

--ext-chunk-size chunk size for adaptive ranking (default=auto)

--no-ranking disable ranking heuristic

--ext Extension mode (banded-fast/banded-slow/full)

--culling-overlap minimum range overlap with higher scoring hit

to delete a hit (default=50%)

--taxon-k maximum number of targets to report per species

--range-cover percentage of query range to be covered for

range culling (default=50%)

--dbsize effective database size (in letters)

--no-auto-append disable auto appending of DAA and DMND file extensions

--xml-blord-format Use gnl|BL_ORD_ID| style format in XML output

--stop-match-score Set the match score of stop codons against each other.

--tantan-minMaskProb minimum repeat probability for masking (default=0.9)

--file-buffer-size file buffer size in bytes (default=67108864)

--memory-limit (-M) Memory limit for extension stage in GB

--no-unlink Do not unlink temporary files.

--target-indexed Enable target-indexed mode

--ignore-warnings Ignore warnings

View options:

--daa (-a) DIAMOND alignment archive (DAA) file

--forwardonly only show alignments of forward strand

Getseq options:

--seq Space-separated list of sequence numbers to display.

Online documentation at http://www.diamondsearch.org

ラン

はじめにデータベースとなるアミノ酸配列のindexファイルを作成する。

diamond makedb --in input.faa -d nr

--in Path to the input protein reference database file in FASTA format (may be gzip compressed). If this parameter is omitted, the input will be read from stdin
--taxonmap Path to mapping file that maps NCBI protein accession numbers to taxon ids (gzip com- pressed). This parameter is optional and needs to be supplied in order to provide taxon- omy features. The file can be downloaded from NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/ taxonomy/accession2taxid/prot.accession2taxid.gz.
--taxonnodes Path to the nodes.dmp file from the NCBI taxonomy. This parameter is optional and needs to be supplied in order to provide taxonomy features. The file is contained within this archive downloadable at NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip.

blastxでホモロジー検索を行う。inputは塩基配列。

diamond blastx -d nr -q query.fna -o matches.m8

--threads Number of CPU threads. By default, the program will auto-detect and use all available virtual cores on the machine.

出力はタブ区切り形式である。

user$ head matches.m8

gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813330|ref|YP_007449782.1| 100.0 323 0 0 1 969 1 323 1.8e-179 622.1

gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813441|ref|YP_007449893.1| 33.0 233 142 4 82 747 34 263 8.8e-25 108.2

Diamondの検出閾値はblastのdefaultの検出閾値よりずっと低いため、stringencyはblastより高くなっている。また、defaultのパラメータはショートリード向けの設定のため、クエリ配列が長い場合、--sensitiveや--more-sensitiveをつけることが推奨されている。

マニュアルに書いてあるが、タブ出力するには--outfmt 6をつける。さらに以下のような指定を行うことで、出力項目を好きに設定できる。

qseqid Query Seq - id
qlen Query sequence length
sseqid Subject Seq - id
sallseqid All subject Seq - id(s), separated by a ’;’ slen Subject sequence length
qstart Start of alignment in query
qend End of alignment in query
sstart Start of alignment in subject
send End of alignment in subject
qseq Aligned part of query sequence
sseq Aligned part of subject sequence
full sseq Full subject sequence
evalue Expect value
bitscore Bit score
score Raw score
length Alignment length
pident Percentage of identical matches
nident Number of identical matches
mismatch Number of mismatches
positive Number of positive - scoring matches
gapopen Number of gap openings
gaps Total number of gaps
ppos Percentage of positive - scoring matches
qframe Query frame
btop Blast traceback operations(BTOP)
staxids Unique Subject Taxonomy ID(s), separated by a ’;’ (in numerical order). This field requires setting the --taxonmap parameter for makedb.
salltitles All Subject Title(s), separated by a ’<>’
qcovhsp Query Coverage Per HSP
qtitle Query title

デフォルトでは

"qseqid sseqid pident length mismatch

gapopen qstart qend sstart send evalue bitscore"が出力されるようになっている。

例えばdiamondでblastxサーチを行う。tabularでqseqid、sseqid、evalueのみ出力する。max target seqはデフォルト25なので増やす。

diamond blastx --query input.fa \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 qseqid sseqid evalue \
 --sensitive \
 --max-target-seqs 1000 \
 > blast.out

追記

E-value ≤1E-100、amino acid identity ≥ 80%、minimum length (amino acid) ≥ 300 a.a。

diamond blastx --query input.fa \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 \
 --min-orf 300 --id 80 --evalue 1e-100 \
 > blast.out

感度を上げるには新しく導入された--very-sensitiveか--ultra-sensitiveを使う（引用２参照）。max target seqはデフォルト25しかないので増やす。

diamond blastx --query input.fa \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --evalue 1e-1 --very-sensitive \
 --max-target-seqs 100000 \
 --outfmt 6 \
 > blast.out

all versus all

proteins.faaの総当たり比較。部分ヒットを避けるパラメータ設定を使用する（引用）。

diamond makedb --in proteins.faa --db protein_db
diamond blastp --query proteins.faa --db protein_db --out blastp.tsv \
--outfmt 6 --evalue 1e-5 --max-target-seqs 10000 --query-cover 50 \
--subject-cover 50

出力例

f:id:kazumaxneo:20220407231300p:plain

引用

１

Fast and sensitive protein alignment using DIAMONDFast and sensitive protein alignment using DIAMOND
Benjamin Buchfink, Chao Xie & Daniel H

Nature Methods 12, 59–60 (2015) doi:10.1038/nmeth.3176

PDF

https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf

２

Sensitive protein alignments at tree-of-life scale using DIAMOND

Benjamin Buchfink, Klaus Reuter & Hajk-Georg Drost
Nature Methods volume 18, pages 366–368 (2021)

追記

Question about max-target-seqs option #29

Question about max-target-seqs option · Issue #29 · bbuchfink/diamond · GitHub

KMCのホームページで、diamondとKMCの連携について提案があります。詳細はKMCのHPからスクリプトをダウンロードして確認してください。

http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download

追記

論文図１で色々なシーケンサー由来のリードを使ってタンパク質と相同性検索した時の処理時間と感度が比較されています。データによっては2万倍以上高速化しています。

https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf

追記

ローカルデータベースのダウンロードについては、こちらが参考になります。

GitHub - josuebarrera/GenEra: genEra is an easy-to-use, low-dependency command-line tool that estimates the age of the last common ancestor of protein-coding gene families.

追記

AC-DIAMOND

ID convert

大雑把に調べる。

NCBIのblast DBを使う。

2021 4/8

DIAMOND v2 is here! Check out this paper from @bbuchfink & @HajkDrost in our Department. Instead of "we blasted gazillions of genomes, which took several days" it will now be "we diamonded gazillions of genomes during the coffee break." https://t.co/THmCxR2aeo
— Weigel Lab 🌱 aka WeigelWorld 🌱 (@PlantEvolution) 2021年4月7日

We introduce two new sensitivity modes: -very-sensitive and -ultra-sensitive allowing users to match the alignment sensitivity levels of BLAST while maintaining superior computational speed up to 360x. #Bioinformatics #Genomics #Phylogenomics pic.twitter.com/4IvXfiDT7y
— Hajk-Georg Drost (@HajkDrost) 2021年4月7日

Together with an optimized HPC and cloud-computing infrastructure, DIAMOND can now scale with the demands of ongoing bulk-sequencing efforts and exponentially growing genome assembly databases to facilitate massive comparative genomics efforts. #ERGA pic.twitter.com/zPkuHooama
— Hajk-Georg Drost (@HajkDrost) 2021年4月7日

2024/10/24

For evolutionary transcriptomics analyses, gene age inference computations would require several months of BLASTP runs. Since this is not practical, we had to invest into replacing BLASTP and develop DIAMOND2 to achieve up to 10,000x speedup.

👉https://t.co/9vPAZzg6F2 pic.twitter.com/5sCXgLtVxt
— Hajk-Georg Drost (@HajkDrost) October 23, 2024

We then developed a shell-script around DIAMOND2 (named GenEra) which pipelines and refines the gene age inference step to be able to generate gene age maps for any species of interest (for us brown algae).

👉https://t.co/AKUspVyAtQ pic.twitter.com/HWO7F986p4
— Hajk-Georg Drost (@HajkDrost) October 23, 2024

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

BLASTとコンパチブルで高速なホモロジー検索ツール Diamond