DIAMOND v2.0.10
https://github.com/bbuchfink/diamond/discussions/478
-
DIAMONDは一般的なC++コードとしてコンパイルされ、ハードウェアアーキテクチャに対する特別な要件はないが、Intel/AMD x86-64プラットフォームのSSEおよびAVX命令セットが利用可能であれば、そのプラットフォーム上でかなり高速に実行される。Microsoft Windowsだけでなく、POSIX互換のOS(Linux、FreeBSD、macOS)上でも動作する。
-
より良いパフォーマンスのために高メモリのサーバを推奨するが、標準的なデスクトップコンピューターやラップトップでも実行できる。
-
コンパイル済みのバイナリは、Linux、macOS(Bioconda経由)、FreeBSD(pkg経由)、Windows用にダウンロードできる。最高のパフォーマンスを得るためには、ターゲットシステム上でソフトウェアをソースからネイティブにコンパイルすることが推奨される。
https://github.com/bbuchfink/diamond/wiki
インストール
#linux
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.6/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
./dimoand --help
> diamond --help #v2.1.6.160
$ diamond --help
diamond v2.1.6.160 (C) Max Planck Society for the Advancement of Science
Documentation, support and updates available at http://www.diamondsearch.org
Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)
Syntax: diamond COMMAND [OPTIONS]
Commands:
makedb Build DIAMOND database from a FASTA file
prepdb Prepare BLAST or FASTA database for use with Diamond
blastp Align amino acid query sequences against a protein reference database
blastx Align DNA query sequences against a protein reference database
cluster Cluster protein sequences
linclust Cluster protein sequences in linear time
realign Realign clustered sequences against their centroids
recluster Recompute clustering to fix errors
reassign Reassign clustered sequences to the closest centroid
view View DIAMOND alignment archive (DAA) formatted file
merge-daa Merge DAA files
help Produce help message
version Display version information
getseq Retrieve sequences from a DIAMOND database file
dbinfo Print information about a DIAMOND database file
test Run regression tests
makeidx Make database index
greedy-vertex-cover Compute greedy vertex cover
General options:
--threads (-p) number of CPU threads
--db (-d) database file
--out (-o) output file
--verbose (-v) verbose console output
--log enable debug log
--quiet disable console output
--header Use header lines in tabular output format (0/simple/verbose).
Makedb options:
--in input reference file in FASTA format
--taxonmap protein accession to taxid mapping file
--taxonnodes taxonomy nodes.dmp from NCBI
--taxonnames taxonomy names.dmp from NCBI
Aligner/Clustering options:
--evalue (-e) maximum e-value to report alignments (default=0.001)
--tmpdir (-t) directory for temporary files
--comp-based-stats composition based statistics mode (0-4)
--masking masking algorithm (none, seg, tantan=default)
--soft-masking soft masking
--motif-masking softmask abundant motifs (0/1)
--approx-id minimum approx. identity% to report an alignment/to cluster sequences
--ext Extension mode (banded-fast/banded-slow/full)
--memory-limit (-M) Memory limit in GB (default = 16G)
Aligner options:
--query (-q) input query file
--strand query strands to search (both/minus/plus)
--un file for unaligned queries
--al file or aligned queries
--unfmt format of unaligned query file (fasta/fastq)
--alfmt format of aligned query file (fasta/fastq)
--unal report unaligned queries (0=no, 1=yes)
--max-target-seqs (-k) maximum number of target sequences to report alignments for (default=25)
--top report alignments within this percentage range of top alignment score (overrides --max-target-seqs)
--max-hsps maximum number of HSPs per target sequence to report for each query (default=1)
--range-culling restrict hit culling to overlapping query ranges
--compress compression for output files (0=none, 1=gzip, zstd)
--min-score minimum bit score to report alignments (overrides e-value setting)
--id minimum identity% to report an alignment
--query-cover minimum query cover% to report an alignment
--subject-cover minimum subject cover% to report an alignment
--faster enable faster mode
--fast enable fast mode
--mid-sensitive enable mid-sensitive mode
--sensitive enable sensitive mode)
--more-sensitive enable more sensitive mode
--very-sensitive enable very sensitive mode
--ultra-sensitive enable ultra sensitive mode
--swipe exhaustive alignment against all database sequences
--iterate iterated search with increasing sensitivity
--global-ranking (-g) number of targets for global ranking
--block-size (-b) sequence block size in billions of letters (default=2.0)
--index-chunks (-c) number of chunks for index processing (default=4)
--parallel-tmpdir directory for temporary files used by multiprocessing
--gapopen gap open penalty
--gapextend gap extension penalty
--matrix score matrix for protein alignment (default=BLOSUM62)
--custom-matrix file containing custom scoring matrix
--frameshift (-F) frame shift penalty (default=disabled)
--long-reads short for --range-culling --top 10 -F 15
--query-gencode genetic code to use to translate query (see user manual)
--salltitles include full subject titles in DAA file
--sallseqid include all subject ids in DAA file
--no-self-hits suppress reporting of identical self hits
--taxonlist restrict search to list of taxon ids (comma-separated)
--taxon-exclude exclude list of taxon ids (comma-separated)
--seqidlist filter the database by list of accessions
--skip-missing-seqids ignore accessions missing in the database
Output format options:
--outfmt (-f) output format
0 = BLAST pairwise
5 = BLAST XML
6 = BLAST tabular
100 = DIAMOND alignment archive (DAA)
101 = SAM
102 = Taxonomic classification
103 = PAF
Value 6 may be followed by a space-separated list of these keywords:
qseqid means Query Seq - id
qlen means Query sequence length
sseqid means Subject Seq - id
sallseqid means All subject Seq - id(s), separated by a ';'
slen means Subject sequence length
qstart means Start of alignment in query
qend means End of alignment in query
sstart means Start of alignment in subject
send means End of alignment in subject
qseq means Aligned part of query sequence
qseq_translated means Aligned part of query sequence (translated)
full_qseq means Query sequence
full_qseq_mate means Query sequence of the mate
sseq means Aligned part of subject sequence
full_sseq means Subject sequence
evalue means Expect value
bitscore means Bit score
corrected_bitscore means Bit score corrected for edge effects
score means Raw score
length means Alignment length
pident means Percentage of identical matches
approx_pident means Approximate percentage of identical matches
nident means Number of identical matches
mismatch means Number of mismatches
positive means Number of positive - scoring matches
gapopen means Number of gap openings
gaps means Total number of gaps
ppos means Percentage of positive - scoring matches
qframe means Query frame
btop means Blast traceback operations(BTOP)
cigar means CIGAR string
staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)
sscinames means unique Subject Scientific Name(s), separated by a ';'
sskingdoms means unique Subject Super Kingdom(s), separated by a ';'
skingdoms means unique Subject Kingdom(s), separated by a ';'
sphylums means unique Subject Phylum(s), separated by a ';'
stitle means Subject Title
salltitles means All Subject Title(s), separated by a '<>'
qcovhsp means Query Coverage Per HSP
scovhsp means Subject Coverage Per HSP
qtitle means Query title
qqual means Query quality values for the aligned part of the query
full_qqual means Query quality values
qstrand means Query strand
Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
Clustering options:
--member-cover Minimum coverage% of the cluster member sequence (default=80.0)
--cluster-steps Clustering steps
--centroid-out Output file for centroids (greedy vertex cover workflow)
Cluster input options:
--clusters Clustering input file mapping sequences to centroids
--edges Input file for greedy vertex cover
--edge-format Edge format for greedy vertex cover (default/triplet)
Advanced options:
--algo Seed search algorithm (0=double-indexed/1=query-indexed/ctg=contiguous-seed)
--bin number of query bins for seed search
--min-orf (-l) ignore translated sequences without an open reading frame of at least this length
--seed-cut cutoff for seed complexity
--freq-masking mask seeds based on frequency
--freq-sd number of standard deviations for ignoring frequent seeds
--id2 minimum number of identities for stage 1 hit
--linsearch only consider seed hits against longest target for identical seeds
--lin-stage1 only consider seed hits against longest query for identical seeds
--xdrop (-x) xdrop for ungapped alignment
--gapped-filter-evalue E-value threshold for gapped filter (auto)
--band band for dynamic programming computation
--shapes (-s) number of seed shapes (default=all available)
--shape-mask seed shapes
--multiprocessing enable distributed-memory parallel processing
--mp-init initialize multiprocessing run
--mp-recover enable continuation of interrupted multiprocessing run
--mp-query-chunk process only a single query chunk as specified
--ext-chunk-size chunk size for adaptive ranking (default=auto)
--no-ranking disable ranking heuristic
--culling-overlap minimum range overlap with higher scoring hit to delete a hit (default=50%)
--taxon-k maximum number of targets to report per species
--range-cover percentage of query range to be covered for range culling (default=50%)
--dbsize effective database size (in letters)
--no-auto-append disable auto appending of DAA and DMND file extensions
--xml-blord-format Use gnl|BL_ORD_ID| style format in XML output
--stop-match-score Set the match score of stop codons against each other.
--tantan-minMaskProb minimum repeat probability for masking (default=0.9)
--file-buffer-size file buffer size in bytes (default=67108864)
--no-unlink Do not unlink temporary files.
--target-indexed Enable target-indexed mode
--ignore-warnings Ignore warnings
View options:
--daa (-a) DIAMOND alignment archive (DAA) file
--forwardonly only show alignments of forward strand
Getseq options:
--seq Space-separated list of sequence numbers to display.
Online documentation at http://www.diamondsearch.org
実行方法
ここでは以前ダウンロードしたnrデータベース(手順)を指定する。
diamond prepdb -d nr-database/nr
blastのnrデータベースに以下のようなファイルができる。
2022年秋にダウンロードしたnrデータベースを指定した時は1時間ほどかかった。作成したデータベースを指定し、あとは通常通りDiamondをランする。
さらにBLASTデータベースにも直接対応している。 (manual)
https://github.com/bbuchfink/diamond/issues/439
”BLASTデータベースの使用はデフォルトでは含まれてはいないので、NCBIツールキットのライブラリに対してリンクすることで有効にする必要がある。オペレーティングシステムによって提供される共有ライブラリを使用するか、セルフコンパイルされたライブラリを使用することによって行う。コンパイルにはzstdライブラリも必要(上記参照)”。
マニュアルの通りに進める。
sudo apt install ncbi-blast+
# change the BLAST version if needed
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.9.0/ncbi-blast-2.9.0+-src.tar.gz
tar xzf ncbi-blast-2.9.0+-src.tar.gz
cd ncbi-blast-2.9.0+-src/c++
./configure --prefix=$HOME/BLAST2.9 --without-debug --without-exe --without-boost --without-gui
make -j20 #20スレッドで10分くらいかかる
make install #(apt-get install cpio)
cd ../../
#Compiling from source (GCC4.8.1以降、CMake2.6以降、libpthread とzlibも必要)
wget https://github.com/bbuchfink/diamond/archive/v2.1.6.tar.gz
tar xzf v2.1.6.tar.gz
mkdir diamond-2.1.6/bin
cd diamond-2.1.6/bin
cmake -DBLAST_INCLUDE_DIR=$HOME/BLAST2.9/include/ncbi-tools++ ..
make -j8
sudo make install
引用
https://github.com/bbuchfink/diamond
参考
nr database Diamond
https://www.biostars.org/p/366342/