2019 1/20 help追加 、コマンド追記, 6/9 -コマンド例から-max-target-seqs削除, 7/19 追記
2021 2/13 ツイート追記
2022/04/07 インストール追記、07/22 例追記、help更新
Diamondはindexのつけ方を工夫することでBLASTXの解析速度を加速できるツール。blastと同等の機能を持つが、論文ではblastより最大20000倍高速化できると主張されている。特にクエリー配列が非常に多い場合に高速とされる。2015年にnature methodsに論文が発表された。
2021 2/13
DIAMOND v2.0.7 now supports full-matrix Smith Waterman extensions (vectorized using the SWIPE algorithm) and the new extended taxonomy mapping file from NCBI. https://t.co/YtVTQlDicf
— Benjamin Buchfink (@bbuchfink) 2021年2月13日
3/11
DIAMOND v2.0.8 now supports directly using BLAST databases, database reformatting no longer needed. https://t.co/YtVTQlDicf
— Benjamin Buchfink (@bbuchfink) 2021年3月10日
マニュアル
ppt
https://www.donarmstrong.com/ld/dmnd2015/diamond_presentation_2015.pdf
インストール
condaやbrewを使って導入できる。
#bioconda(link)
mamba install -c bioconda diamond
#brewでも導入可能(試していません)
brew install diamond
> diamond help #v2.0.13
$ diamond --help
diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
Syntax: diamond COMMAND [OPTIONS]
Commands:
makedb Build DIAMOND database from a FASTA file
blastp Align amino acid query sequences against a protein reference database
blastx Align DNA query sequences against a protein reference database
view View DIAMOND alignment archive (DAA) formatted file
help Produce help message
version Display version information
getseq Retrieve sequences from a DIAMOND database file
dbinfo Print information about a DIAMOND database file
test Run regression tests
makeidx Make database index
General options:
--threads (-p) number of CPU threads
--db (-d) database file
--out (-o) output file
--outfmt (-f) output format
0 = BLAST pairwise
5 = BLAST XML
6 = BLAST tabular
100 = DIAMOND alignment archive (DAA)
101 = SAM
Value 6 may be followed by a space-separated list of these keywords:
qseqid means Query Seq - id
qlen means Query sequence length
sseqid means Subject Seq - id
sallseqid means All subject Seq - id(s), separated by a ';'
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
qseq_translated means Aligned part of query sequence (translated)
full_qseq means Query sequence
full_qseq_mate means Query sequence of the mate
sseq means Aligned part of subject sequence
full_sseq means Subject sequence
evalue means Expect value
bitscore means Bit score
score means Raw score
length means Alignment length
pident means Percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive - scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive - scoring matches
qframe means Query frame
btop means Blast traceback operations(BTOP)
cigar means CIGAR string
staxids means unique Subject Taxonomy ID(s), separated by a ';'
(in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
skingdoms means unique Subject Kingdom(s), separated by a ';'
sphylums means unique Subject Phylum(s), separated by a ';'
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
qcovhsp means Query Coverage Per HSP
scovhsp means Subject Coverage Per HSP
qtitle means Query title
qqual means Query quality values for the aligned part of the query
full_qqual means Query quality values
qstrand means Query strand
Default: qseqid sseqid pident length mismatch gapopen qstart qend
sstart send evalue bitscore
--verbose (-v) verbose console output
--log enable debug log
--quiet disable console output
--header Write header lines to blast tabular format.
Makedb options:
--in input reference file in FASTA format
--taxonmap protein accession to taxid mapping file
--taxonnodes taxonomy nodes.dmp from NCBI
--taxonnames taxonomy names.dmp from NCBI
Aligner options:
--query (-q) input query file
--strand query strands to search (both/minus/plus)
--un file for unaligned queries
--al file or aligned queries
--unfmt format of unaligned query file (fasta/fastq)
--alfmt format of aligned query file (fasta/fastq)
--unal report unaligned queries (0=no, 1=yes)
--max-target-seqs (-k) maximum number of target sequences to report
alignments for (default=25)
--top report alignments within this percentage
range of top alignment score (overrides --max-target-seqs)
--max-hsps maximum number of HSPs per target sequence to
report for each query (default=1)
--range-culling restrict hit culling to overlapping query ranges
--compress compression for output files (0=none, 1=gzip, zstd)
--evalue (-e) maximum e-value to report alignments (default=0.001)
--min-score minimum bit score to report alignments
(overrides e-value setting)
--id minimum identity% to report an alignment
--query-cover minimum query cover% to report an alignment
--subject-cover minimum subject cover% to report an alignment
--fast enable fast mode
--mid-sensitive enable mid-sensitive mode
--sensitive enable sensitive mode)
--more-sensitive enable more sensitive mode
--very-sensitive enable very sensitive mode
--ultra-sensitive enable ultra sensitive mode
--iterate iterated search with increasing sensitivity
--global-ranking (-g) number of targets for global ranking
--block-size (-b) sequence block size in billions of letters
(default=2.0)
--index-chunks (-c) number of chunks for index processing (default=4)
--tmpdir (-t) directory for temporary files
--parallel-tmpdir directory for temporary files used by multiprocessing
--gapopen gap open penalty
--gapextend gap extension penalty
--frameshift (-F) frame shift penalty (default=disabled)
--long-reads short for --range-culling --top 10 -F 15
--matrix score matrix for protein alignment (default=BLOSUM62)
--custom-matrix file containing custom scoring matrix
--comp-based-stats composition based statistics mode (0-4)
--masking masking algorithm (none, seg, tantan=default)
--query-gencode genetic code to use to translate query (see
user manual)
--salltitles include full subject titles in DAA file
--sallseqid include all subject ids in DAA file
--no-self-hits suppress reporting of identical self hits
--taxonlist restrict search to list of taxon ids (comma-separated)
--taxon-exclude exclude list of taxon ids (comma-separated)
--seqidlist filter the database by list of accessions
--skip-missing-seqids ignore accessions missing in the database
Advanced options:
--algo Seed search algorithm
(0=double-indexed/1=query-indexed/ctg=contiguous-seed)
--bin number of query bins for seed search
--min-orf (-l) ignore translated sequences without an open
reading frame of at least this length
--seed-cut cutoff for seed complexity
--freq-masking mask seeds based on frequency
--freq-sd number of standard deviations for ignoring
frequent seeds
--motif-masking softmask abundant motifs (0/1)
--id2 minimum number of identities for stage 1 hit
--xdrop (-x) xdrop for ungapped alignment
--gapped-filter-evalue E-value threshold for gapped filter (auto)
--band band for dynamic programming computation
--shapes (-s) number of seed shapes (default=all available)
--shape-mask seed shapes
--multiprocessing enable distributed-memory parallel processing
--mp-init initialize multiprocessing run
--mp-recover enable continuation of interrupted multiprocessing run
--mp-query-chunk process only a single query chunk as specified
--ext-chunk-size chunk size for adaptive ranking (default=auto)
--no-ranking disable ranking heuristic
--ext Extension mode (banded-fast/banded-slow/full)
--culling-overlap minimum range overlap with higher scoring hit
to delete a hit (default=50%)
--taxon-k maximum number of targets to report per species
--range-cover percentage of query range to be covered for
range culling (default=50%)
--dbsize effective database size (in letters)
--no-auto-append disable auto appending of DAA and DMND file extensions
--xml-blord-format Use gnl|BL_ORD_ID| style format in XML output
--stop-match-score Set the match score of stop codons against each other.
--tantan-minMaskProb minimum repeat probability for masking (default=0.9)
--file-buffer-size file buffer size in bytes (default=67108864)
--memory-limit (-M) Memory limit for extension stage in GB
--no-unlink Do not unlink temporary files.
--target-indexed Enable target-indexed mode
--ignore-warnings Ignore warnings
View options:
--daa (-a) DIAMOND alignment archive (DAA) file
--forwardonly only show alignments of forward strand
Getseq options:
--seq Space-separated list of sequence numbers to display.
Online documentation at http://www.diamondsearch.org
ラン
はじめにデータベースとなるアミノ酸配列のindexファイルを作成する。
diamond makedb --in input.faa -d nr
- --in Path to the input protein reference database file in FASTA format (may be gzip compressed). If this parameter is omitted, the input will be read from stdin
-
--taxonmap Path to mapping file that maps NCBI protein accession numbers to taxon ids (gzip com- pressed). This parameter is optional and needs to be supplied in order to provide taxon- omy features. The file can be downloaded from NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/ taxonomy/accession2taxid/prot.accession2taxid.gz.
- --taxonnodes Path to the nodes.dmp file from the NCBI taxonomy. This parameter is optional and needs to be supplied in order to provide taxonomy features. The file is contained within this archive downloadable at NCBI: ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdmp.zip.
diamond blastx -d nr -q query.fna -o matches.m8
-
--threads Number of CPU threads. By default, the program will auto-detect and use all available virtual cores on the machine.
出力はタブ区切り形式である。
user$ head matches.m8
gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813330|ref|YP_007449782.1| 100.0 323 0 0 1 969 1 323 1.8e-179 622.1
gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813441|ref|YP_007449893.1| 33.0 233 142 4 82 747 34 263 8.8e-25 108.2
Diamondの検出閾値はblastのdefaultの検出閾値よりずっと低いため、stringencyはblastより高くなっている。また、defaultのパラメータはショートリード向けの設定のため、クエリ配列が長い場合、--sensitiveや--more-sensitiveをつけることが推奨されている。
マニュアルに書いてあるが、タブ出力するには--outfmt 6をつける。さらに以下のような指定を行うことで、出力項目を好きに設定できる。
- qseqid Query Seq - id
- qlen Query sequence length
- sseqid Subject Seq - id
- sallseqid All subject Seq - id(s), separated by a ’;’ slen Subject sequence length
- qstart Start of alignment in query
- qend End of alignment in query
- sstart Start of alignment in subject
- send End of alignment in subject
- qseq Aligned part of query sequence
- sseq Aligned part of subject sequence
- full sseq Full subject sequence
- evalue Expect value
- bitscore Bit score
- score Raw score
- length Alignment length
- pident Percentage of identical matches
- nident Number of identical matches
- mismatch Number of mismatches
- positive Number of positive - scoring matches
- gapopen Number of gap openings
- gaps Total number of gaps
- ppos Percentage of positive - scoring matches
- qframe Query frame
- btop Blast traceback operations(BTOP)
- staxids Unique Subject Taxonomy ID(s), separated by a ’;’ (in numerical order). This field requires setting the --taxonmap parameter for makedb.
- salltitles All Subject Title(s), separated by a ’<>’
- qcovhsp Query Coverage Per HSP
- qtitle Query title
デフォルトでは
"qseqid sseqid pident length mismatch
gapopen qstart qend sstart send evalue bitscore"が出力されるようになっている。
例えばdiamondでblastxサーチを行う。tabularでqseqid、sseqid、evalueのみ出力する。max target seqはデフォルト25なので増やす。
diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 qseqid sseqid evalue \
--sensitive \
--max-target-seqs 1000 \
> blast.out
追記
E-value ≤1E-100、amino acid identity ≥ 80%、minimum length (amino acid) ≥ 300 a.a。
diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 \
--min-orf 300 --id 80 --evalue 1e-100 \
> blast.out
感度を上げるには新しく導入された--very-sensitiveか--ultra-sensitiveを使う(引用2参照)。max target seqはデフォルト25しかないので増やす。
diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--evalue 1e-1 --very-sensitive \
--max-target-seqs 100000 \
--outfmt 6 \
> blast.out
all versus all
proteins.faaの総当たり比較。部分ヒットを避けるパラメータ設定を使用する(引用)。
diamond makedb --in proteins.faa --db protein_db
diamond blastp --query proteins.faa --db protein_db --out blastp.tsv \
--outfmt 6 --evalue 1e-5 --max-target-seqs 10000 --query-cover 50 \
--subject-cover 50
出力例
引用
1
Fast and sensitive protein alignment using DIAMONDFast and sensitive protein alignment using DIAMOND
Benjamin Buchfink, Chao Xie & Daniel H
Nature Methods 12, 59–60 (2015) doi:10.1038/nmeth.3176
https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf
2
Sensitive protein alignments at tree-of-life scale using DIAMOND
Benjamin Buchfink, Klaus Reuter & Hajk-Georg Drost
Nature Methods volume 18, pages 366–368 (2021)
追記
Question about max-target-seqs option #29
Question about max-target-seqs option · Issue #29 · bbuchfink/diamond · GitHub
KMCのホームページで、diamondとKMCの連携について提案があります。詳細はKMCのHPからスクリプトをダウンロードして確認してください。
http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download
追記
論文図1で色々なシーケンサー由来のリードを使ってタンパク質と相同性検索した時の処理時間と感度が比較されています。データによっては2万倍以上高速化しています。
https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf
追記
ローカルデータベースのダウンロードについては、こちらが参考になります。
追記
AC-DIAMOND
追記
ID convert
大雑把に調べる。
NCBIのblast DBを使う。
2021 4/8
DIAMOND v2 is here! Check out this paper from @bbuchfink & @HajkDrost in our Department. Instead of "we blasted gazillions of genomes, which took several days" it will now be "we diamonded gazillions of genomes during the coffee break." https://t.co/THmCxR2aeo
— Weigel Lab 🌱 aka WeigelWorld 🌱 (@PlantEvolution) 2021年4月7日
We introduce two new sensitivity modes: -very-sensitive and -ultra-sensitive allowing users to match the alignment sensitivity levels of BLAST while maintaining superior computational speed up to 360x. #Bioinformatics #Genomics #Phylogenomics pic.twitter.com/4IvXfiDT7y
— Hajk-Georg Drost (@HajkDrost) 2021年4月7日
Together with an optimized HPC and cloud-computing infrastructure, DIAMOND can now scale with the demands of ongoing bulk-sequencing efforts and exponentially growing genome assembly databases to facilitate massive comparative genomics efforts. #ERGA pic.twitter.com/zPkuHooama
— Hajk-Georg Drost (@HajkDrost) 2021年4月7日