macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

diamondでBLASTのデータベースを使えるようにするdiamond prepdbコマンド

 

DIAMOND v2.0.10 

https://github.com/bbuchfink/diamond/discussions/478

 

  • DIAMONDは一般的なC++コードとしてコンパイルされ、ハードウェアアーキテクチャに対する特別な要件はないが、Intel/AMD x86-64プラットフォームのSSEおよびAVX命令セットが利用可能であれば、そのプラットフォーム上でかなり高速に実行される。Microsoft Windowsだけでなく、POSIX互換のOS(LinuxFreeBSDmacOS)上でも動作する。

  • より良いパフォーマンスのために高メモリのサーバを推奨するが、標準的なデスクトップコンピューターやラップトップでも実行できる。

  • コンパイル済みのバイナリは、LinuxmacOS(Bioconda経由)、FreeBSD(pkg経由)、Windows用にダウンロードできる。最高のパフォーマンスを得るためには、ターゲットシステム上でソフトウェアをソースからネイティブにコンパイルすることが推奨される。

 

wiki

https://github.com/bbuchfink/diamond/wiki

 

インストール

Github

#linux
wget http://github.com/bbuchfink/diamond/releases/download/v2.1.6/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
./dimoand --help

> diamond --help  #v2.1.6.160

$ diamond --help

diamond v2.1.6.160 (C) Max Planck Society for the Advancement of Science

Documentation, support and updates available at http://www.diamondsearch.org

Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

 

Syntax: diamond COMMAND [OPTIONS]

 

Commands:

makedb                   Build DIAMOND database from a FASTA file

prepdb                   Prepare BLAST or FASTA database for use with Diamond

blastp                   Align amino acid query sequences against a protein reference database

blastx                   Align DNA query sequences against a protein reference database

cluster                  Cluster protein sequences

linclust                 Cluster protein sequences in linear time

realign                  Realign clustered sequences against their centroids

recluster                Recompute clustering to fix errors

reassign                 Reassign clustered sequences to the closest centroid

view                     View DIAMOND alignment archive (DAA) formatted file

merge-daa                Merge DAA files

help                     Produce help message

version                  Display version information

getseq                   Retrieve sequences from a DIAMOND database file

dbinfo                   Print information about a DIAMOND database file

test                     Run regression tests

makeidx                  Make database index

greedy-vertex-cover      Compute greedy vertex cover

 

General options:

--threads (-p)           number of CPU threads

--db (-d)                database file

--out (-o)               output file

--verbose (-v)           verbose console output

--log                    enable debug log

--quiet                  disable console output

--header                 Use header lines in tabular output format (0/simple/verbose).

 

Makedb options:

--in                     input reference file in FASTA format

--taxonmap               protein accession to taxid mapping file

--taxonnodes             taxonomy nodes.dmp from NCBI

--taxonnames             taxonomy names.dmp from NCBI

 

Aligner/Clustering options:

--evalue (-e)            maximum e-value to report alignments (default=0.001)

--tmpdir (-t)            directory for temporary files

--comp-based-stats       composition based statistics mode (0-4)

--masking                masking algorithm (none, seg, tantan=default)

--soft-masking           soft masking

--motif-masking          softmask abundant motifs (0/1)

--approx-id              minimum approx. identity% to report an alignment/to cluster sequences

--ext                    Extension mode (banded-fast/banded-slow/full)

--memory-limit (-M)      Memory limit in GB (default = 16G)

 

Aligner options:

--query (-q)             input query file

--strand                 query strands to search (both/minus/plus)

--un                     file for unaligned queries

--al                     file or aligned queries

--unfmt                  format of unaligned query file (fasta/fastq)

--alfmt                  format of aligned query file (fasta/fastq)

--unal                   report unaligned queries (0=no, 1=yes)

--max-target-seqs (-k)   maximum number of target sequences to report alignments for (default=25)

--top                    report alignments within this percentage range of top alignment score (overrides --max-target-seqs)

--max-hsps               maximum number of HSPs per target sequence to report for each query (default=1)

--range-culling          restrict hit culling to overlapping query ranges

--compress               compression for output files (0=none, 1=gzip, zstd)

--min-score              minimum bit score to report alignments (overrides e-value setting)

--id                     minimum identity% to report an alignment

--query-cover            minimum query cover% to report an alignment

--subject-cover          minimum subject cover% to report an alignment

--faster                 enable faster mode

--fast                   enable fast mode

--mid-sensitive          enable mid-sensitive mode

--sensitive              enable sensitive mode)

--more-sensitive         enable more sensitive mode

--very-sensitive         enable very sensitive mode

--ultra-sensitive        enable ultra sensitive mode

--swipe                  exhaustive alignment against all database sequences

--iterate                iterated search with increasing sensitivity

--global-ranking (-g)    number of targets for global ranking

--block-size (-b)        sequence block size in billions of letters (default=2.0)

--index-chunks (-c)      number of chunks for index processing (default=4)

--parallel-tmpdir        directory for temporary files used by multiprocessing

--gapopen                gap open penalty

--gapextend              gap extension penalty

--matrix                 score matrix for protein alignment (default=BLOSUM62)

--custom-matrix          file containing custom scoring matrix

--frameshift (-F)        frame shift penalty (default=disabled)

--long-reads             short for --range-culling --top 10 -F 15

--query-gencode          genetic code to use to translate query (see user manual)

--salltitles             include full subject titles in DAA file

--sallseqid              include all subject ids in DAA file

--no-self-hits           suppress reporting of identical self hits

--taxonlist              restrict search to list of taxon ids (comma-separated)

--taxon-exclude          exclude list of taxon ids (comma-separated)

--seqidlist              filter the database by list of accessions

--skip-missing-seqids    ignore accessions missing in the database

 

Output format options:

--outfmt (-f)            output format

0   = BLAST pairwise

5   = BLAST XML

6   = BLAST tabular

100 = DIAMOND alignment archive (DAA)

101 = SAM

102 = Taxonomic classification

103 = PAF

 

Value 6 may be followed by a space-separated list of these keywords:

 

qseqid means Query Seq - id

qlen means Query sequence length

sseqid means Subject Seq - id

sallseqid means All subject Seq - id(s), separated by a ';'

slen means Subject sequence length

qstart means Start of alignment in query

qend means End of alignment in query

sstart means Start of alignment in subject

send means End of alignment in subject

qseq means Aligned part of query sequence

qseq_translated means Aligned part of query sequence (translated)

full_qseq means Query sequence

full_qseq_mate means Query sequence of the mate

sseq means Aligned part of subject sequence

full_sseq means Subject sequence

evalue means Expect value

bitscore means Bit score

corrected_bitscore means Bit score corrected for edge effects

score means Raw score

length means Alignment length

pident means Percentage of identical matches

approx_pident means Approximate percentage of identical matches

nident means Number of identical matches

mismatch means Number of mismatches

positive means Number of positive - scoring matches

gapopen means Number of gap openings

gaps means Total number of gaps

ppos means Percentage of positive - scoring matches

qframe means Query frame

btop means Blast traceback operations(BTOP)

cigar means CIGAR string

staxids means unique Subject Taxonomy ID(s), separated by a ';' (in numerical order)

sscinames means unique Subject Scientific Name(s), separated by a ';'

sskingdoms means unique Subject Super Kingdom(s), separated by a ';'

skingdoms means unique Subject Kingdom(s), separated by a ';'

sphylums means unique Subject Phylum(s), separated by a ';'

stitle means Subject Title

salltitles means All Subject Title(s), separated by a '<>'

qcovhsp means Query Coverage Per HSP

scovhsp means Subject Coverage Per HSP

qtitle means Query title

qqual means Query quality values for the aligned part of the query

full_qqual means Query quality values

qstrand means Query strand

 

Default: qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore

 

Clustering options:

--member-cover           Minimum coverage% of the cluster member sequence (default=80.0)

--cluster-steps          Clustering steps

--centroid-out           Output file for centroids (greedy vertex cover workflow)

 

Cluster input options:

--clusters               Clustering input file mapping sequences to centroids

--edges                  Input file for greedy vertex cover

--edge-format            Edge format for greedy vertex cover (default/triplet)

 

Advanced options:

--algo                   Seed search algorithm (0=double-indexed/1=query-indexed/ctg=contiguous-seed)

--bin                    number of query bins for seed search

--min-orf (-l)           ignore translated sequences without an open reading frame of at least this length

--seed-cut               cutoff for seed complexity

--freq-masking           mask seeds based on frequency

--freq-sd                number of standard deviations for ignoring frequent seeds

--id2                    minimum number of identities for stage 1 hit

--linsearch              only consider seed hits against longest target for identical seeds

--lin-stage1             only consider seed hits against longest query for identical seeds

--xdrop (-x)             xdrop for ungapped alignment

--gapped-filter-evalue   E-value threshold for gapped filter (auto)

--band                   band for dynamic programming computation

--shapes (-s)            number of seed shapes (default=all available)

--shape-mask             seed shapes

--multiprocessing        enable distributed-memory parallel processing

--mp-init                initialize multiprocessing run

--mp-recover             enable continuation of interrupted multiprocessing run

--mp-query-chunk         process only a single query chunk as specified

--ext-chunk-size         chunk size for adaptive ranking (default=auto)

--no-ranking             disable ranking heuristic

--culling-overlap        minimum range overlap with higher scoring hit to delete a hit (default=50%)

--taxon-k                maximum number of targets to report per species

--range-cover            percentage of query range to be covered for range culling (default=50%)

--dbsize                 effective database size (in letters)

--no-auto-append         disable auto appending of DAA and DMND file extensions

--xml-blord-format       Use gnl|BL_ORD_ID| style format in XML output

--stop-match-score       Set the match score of stop codons against each other.

--tantan-minMaskProb     minimum repeat probability for masking (default=0.9)

--file-buffer-size       file buffer size in bytes (default=67108864)

--no-unlink              Do not unlink temporary files.

--target-indexed         Enable target-indexed mode

--ignore-warnings        Ignore warnings

 

View options:

--daa (-a)               DIAMOND alignment archive (DAA) file

--forwardonly            only show alignments of forward strand

 

Getseq options:

--seq                    Space-separated list of sequence numbers to display.

 

Online documentation at http://www.diamondsearch.org

 

 

実行方法

ここでは以前ダウンロードしたnrデータベース(手順)を指定する。

diamond prepdb -d nr-database/nr

blastのnrデータベースに以下のようなファイルができる。

2022年秋にダウンロードしたnrデータベースを指定した時は1時間ほどかかった。作成したデータベースを指定し、あとは通常通りDiamondをランする。

 

 

さらにBLASTデータベースにも直接対応している。 (manual)

https://github.com/bbuchfink/diamond/issues/439

”BLASTデータベースの使用はデフォルトでは含まれてはいないので、NCBIツールキットのライブラリに対してリンクすることで有効にする必要がある。オペレーティングシステムによって提供される共有ライブラリを使用するか、セルフコンパイルされたライブラリを使用することによって行う。コンパイルにはzstdライブラリも必要(上記参照)”。

マニュアルの通りに進める。

sudo apt install ncbi-blast+
# change the BLAST version if needed
wget https://ftp.ncbi.nlm.nih.gov/blast/executables/blast+/2.9.0/ncbi-blast-2.9.0+-src.tar.gz
tar xzf ncbi-blast-2.9.0+-src.tar.gz
cd ncbi-blast-2.9.0+-src/c++
./configure --prefix=$HOME/BLAST2.9 --without-debug --without-exe --without-boost --without-gui
make -j20 #20スレッドで10分くらいかかる
make install #(apt-get install cpio)
cd ../../

#Compiling from source (GCC4.8.1以降、CMake2.6以降、libpthread とzlibも必要)
wget https://github.com/bbuchfink/diamond/archive/v2.1.6.tar.gz
tar xzf v2.1.6.tar.gz
mkdir diamond-2.1.6/bin
cd diamond-2.1.6/bin
cmake -DBLAST_INCLUDE_DIR=$HOME/BLAST2.9/include/ncbi-tools++ ..
make -j8
sudo make install

 

引用

https://github.com/bbuchfink/diamond

 

参考

nr database Diamond

https://www.biostars.org/p/366342/