macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

BLASTとコンパチブルで高速なホモロジー検索ツール Diamond

2019 1/20 help追加 、コマンド追記,  6/9 -コマンド例から-max-target-seqs削除, 7/19 追記

2021 2/13  ツイート追記

2022/04/07 インストール追記、07/22 例追記、help更新

  

Diamondはindexのつけ方を工夫することでBLASTXの解析速度を加速できるツール。blastと同等の機能を持つが、論文ではblastより最大20000倍高速化できると主張されている。特にクエリー配列が非常に多い場合に高速とされる。2015年にnature methodsに論文が発表された。

 

2021 2/13

3/11

 

 

  マニュアル

manual

ppt

https://www.donarmstrong.com/ld/dmnd2015/diamond_presentation_2015.pdf

インストール

Github

condaやbrewを使って導入できる。

#bioconda(link)
mamba install -c bioconda diamond=2.0.14

#brewでも導入可能(試していません)
brew install diamond

 > diamond help #v2.0.13

$ diamond --help

diamond v2.0.13.151 (C) Max Planck Society for the Advancement of Science

Documentation, support and updates available at http://www.diamondsearch.org

Please cite: http://dx.doi.org/10.1038/s41592-021-01101-x Nature Methods (2021)

 

Syntax: diamond COMMAND [OPTIONS]

 

Commands:

makedb    Build DIAMOND database from a FASTA file

blastp    Align amino acid query sequences against a protein reference database

blastx    Align DNA query sequences against a protein reference database

view    View DIAMOND alignment archive (DAA) formatted file

help    Produce help message

version    Display version information

getseq    Retrieve sequences from a DIAMOND database file

dbinfo    Print information about a DIAMOND database file

test    Run regression tests

makeidx    Make database index

 

General options:

--threads (-p)           number of CPU threads

--db (-d)                database file

--out (-o)               output file

--outfmt (-f)            output format

    0   = BLAST pairwise

    5   = BLAST XML

    6   = BLAST tabular

    100 = DIAMOND alignment archive (DAA)

    101 = SAM

 

    Value 6 may be followed by a space-separated list of these keywords:

 

    qseqid means Query Seq - id

    qlen means Query sequence length

    sseqid means Subject Seq - id

    sallseqid means All subject Seq - id(s), separated by a ';'

    slen means Subject sequence length

    qstart means Start of alignment in query

    qend means End of alignment in query

    sstart means Start of alignment in subject

    send means End of alignment in subject

    qseq means Aligned part of query sequence

    qseq_translated means Aligned part of query sequence (translated)

    full_qseq means Query sequence

    full_qseq_mate means Query sequence of the mate

    sseq means Aligned part of subject sequence

    full_sseq means Subject sequence

    evalue means Expect value

    bitscore means Bit score

    score means Raw score

    length means Alignment length

    pident means Percentage of identical matches

    nident means Number of identical matches

    mismatch means Number of mismatches

    positive means Number of positive - scoring matches

    gapopen means Number of gap openings

    gaps means Total number of gaps

    ppos means Percentage of positive - scoring matches

    qframe means Query frame

    btop means Blast traceback operations(BTOP)

    cigar means CIGAR string

    staxids means unique Subject Taxonomy ID(s), separated by a ';'

(in numerical order)

    sscinames means unique Subject Scientific Name(s), separated by a ';'

    sskingdoms means unique Subject Super Kingdom(s), separated by a ';'

    skingdoms means unique Subject Kingdom(s), separated by a ';'

    sphylums means unique Subject Phylum(s), separated by a ';'

    stitle means Subject Title

    salltitles means All Subject Title(s), separated by a '<>'

    qcovhsp means Query Coverage Per HSP

    scovhsp means Subject Coverage Per HSP

    qtitle means Query title

    qqual means Query quality values for the aligned part of the query

    full_qqual means Query quality values

    qstrand means Query strand

 

    Default: qseqid sseqid pident length mismatch gapopen qstart qend

sstart send evalue bitscore

--verbose (-v)           verbose console output

--log                    enable debug log

--quiet                  disable console output

--header                 Write header lines to blast tabular format.

 

Makedb options:

--in                     input reference file in FASTA format

--taxonmap               protein accession to taxid mapping file

--taxonnodes             taxonomy nodes.dmp from NCBI

--taxonnames             taxonomy names.dmp from NCBI

 

Aligner options:

--query (-q)             input query file

--strand                 query strands to search (both/minus/plus)

--un                     file for unaligned queries

--al                     file or aligned queries

--unfmt                  format of unaligned query file (fasta/fastq)

--alfmt                  format of aligned query file (fasta/fastq)

--unal                   report unaligned queries (0=no, 1=yes)

--max-target-seqs (-k)   maximum number of target sequences to report

alignments for (default=25)

--top                    report alignments within this percentage

range of top alignment score (overrides --max-target-seqs)

--max-hsps               maximum number of HSPs per target sequence to

report for each query (default=1)

--range-culling          restrict hit culling to overlapping query ranges

--compress               compression for output files (0=none, 1=gzip, zstd)

--evalue (-e)            maximum e-value to report alignments (default=0.001)

--min-score              minimum bit score to report alignments

(overrides e-value setting)

--id                     minimum identity% to report an alignment

--query-cover            minimum query cover% to report an alignment

--subject-cover          minimum subject cover% to report an alignment

--fast                   enable fast mode

--mid-sensitive          enable mid-sensitive mode

--sensitive              enable sensitive mode)

--more-sensitive         enable more sensitive mode

--very-sensitive         enable very sensitive mode

--ultra-sensitive        enable ultra sensitive mode

--iterate                iterated search with increasing sensitivity

--global-ranking (-g)    number of targets for global ranking

--block-size (-b)        sequence block size in billions of letters

(default=2.0)

--index-chunks (-c)      number of chunks for index processing (default=4)

--tmpdir (-t)            directory for temporary files

--parallel-tmpdir        directory for temporary files used by multiprocessing

--gapopen                gap open penalty

--gapextend              gap extension penalty

--frameshift (-F)        frame shift penalty (default=disabled)

--long-reads             short for --range-culling --top 10 -F 15

--matrix                 score matrix for protein alignment (default=BLOSUM62)

--custom-matrix          file containing custom scoring matrix

--comp-based-stats       composition based statistics mode (0-4)

--masking                masking algorithm (none, seg, tantan=default)

--query-gencode          genetic code to use to translate query (see

user manual)

--salltitles             include full subject titles in DAA file

--sallseqid              include all subject ids in DAA file

--no-self-hits           suppress reporting of identical self hits

--taxonlist              restrict search to list of taxon ids (comma-separated)

--taxon-exclude          exclude list of taxon ids (comma-separated)

--seqidlist              filter the database by list of accessions

--skip-missing-seqids    ignore accessions missing in the database

 

Advanced options:

--algo                   Seed search algorithm

(0=double-indexed/1=query-indexed/ctg=contiguous-seed)

--bin                    number of query bins for seed search

--min-orf (-l)           ignore translated sequences without an open

reading frame of at least this length

--seed-cut               cutoff for seed complexity

--freq-masking           mask seeds based on frequency

--freq-sd                number of standard deviations for ignoring

frequent seeds

--motif-masking          softmask abundant motifs (0/1)

--id2                    minimum number of identities for stage 1 hit

--xdrop (-x)             xdrop for ungapped alignment

--gapped-filter-evalue   E-value threshold for gapped filter (auto)

--band                   band for dynamic programming computation

--shapes (-s)            number of seed shapes (default=all available)

--shape-mask             seed shapes

--multiprocessing        enable distributed-memory parallel processing

--mp-init                initialize multiprocessing run

--mp-recover             enable continuation of interrupted multiprocessing run

--mp-query-chunk         process only a single query chunk as specified

--ext-chunk-size         chunk size for adaptive ranking (default=auto)

--no-ranking             disable ranking heuristic

--ext                    Extension mode (banded-fast/banded-slow/full)

--culling-overlap        minimum range overlap with higher scoring hit

to delete a hit (default=50%)

--taxon-k                maximum number of targets to report per species

--range-cover            percentage of query range to be covered for

range culling (default=50%)

--dbsize                 effective database size (in letters)

--no-auto-append         disable auto appending of DAA and DMND file extensions

--xml-blord-format       Use gnl|BL_ORD_ID| style format in XML output

--stop-match-score       Set the match score of stop codons against each other.

--tantan-minMaskProb     minimum repeat probability for masking (default=0.9)

--file-buffer-size       file buffer size in bytes (default=67108864)

--memory-limit (-M)      Memory limit for extension stage in GB

--no-unlink              Do not unlink temporary files.

--target-indexed         Enable target-indexed mode

--ignore-warnings        Ignore warnings

 

View options:

--daa (-a)               DIAMOND alignment archive (DAA) file

--forwardonly            only show alignments of forward strand

 

Getseq options:

--seq                    Space-separated list of sequence numbers to display.

 

Online documentation at http://www.diamondsearch.org

 

 

ラン

はじめにデータベースとなるアミノ酸配列のindexファイルを作成する。

diamond makedb --in input.faa -d nr

 

 

blastxでホモロジー検索を行う。inputは塩基配列

diamond blastx -d nr -q query.fna -o matches.m8 
  • --threads   Number of CPU threads. By default, the program will auto-detect and use all available virtual cores on the machine.

 出力はタブ区切り形式である。

 user$ head matches.m8 

gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813330|ref|YP_007449782.1| 100.0 323 0 0 1 969 1 323 1.8e-179 622.1

gi|451813329|ref|NC_020286.1|:3569362-3569561,1-772 gi|451813441|ref|YP_007449893.1| 33.0 233 142 4 82 747 34 263 8.8e-25 108.2

 Diamondの検出閾値はblastのdefaultの検出閾値よりずっと低いため、stringencyはblastより高くなっている。また、defaultのパラメータはショートリード向けの設定のため、クエリ配列が長い場合、--sensitiveや--more-sensitiveをつけることが推奨されている。

 

マニュアルに書いてあるが、タブ出力するには--outfmt 6をつける。さらに以下のような指定を行うことで、出力項目を好きに設定できる。

  • qseqid Query Seq - id
  • qlen Query sequence length
  • sseqid Subject Seq - id
  • sallseqid All subject Seq - id(s), separated by a ’;’ slen Subject sequence length
  • qstart Start of alignment in query
  • qend End of alignment in query
  • sstart Start of alignment in subject
  • send End of alignment in subject
  • qseq Aligned part of query sequence
  • sseq Aligned part of subject sequence
  • full sseq Full subject sequence
  • evalue Expect value
  • bitscore Bit score
  • score Raw score
  • length Alignment length
  • pident Percentage of identical matches
  • nident Number of identical matches
  • mismatch Number of mismatches
  • positive Number of positive - scoring matches
  • gapopen Number of gap openings
  • gaps Total number of gaps
  • ppos Percentage of positive - scoring matches
  • qframe Query frame
  • btop Blast traceback operations(BTOP)
  • staxids Unique Subject Taxonomy ID(s), separated by a ’;’ (in numerical order). This field requires setting the --taxonmap parameter for makedb.
  • salltitles All Subject Title(s), separated by a ’<>’
  • qcovhsp Query Coverage Per HSP
  • qtitle Query title

デフォルトでは

"qseqid sseqid pident length mismatch

gapopen qstart qend sstart send evalue bitscore"が出力されるようになっている。

例えばdiamondでblastxサーチを行う。tabularでqseqid、sseqid、evalueのみ出力する。max target seqはデフォルト25なので増やす。

diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 qseqid sseqid evalue \
--sensitive \
--max-target-seqs 1000 \
> blast.out

 

追記

E-value ≤1E-100、amino acid identity ≥ 80%、minimum length (amino acid) ≥ 300 a.a。

diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--outfmt 6 \
--min-orf 300 --id 80 --evalue 1e-100 \
> blast.out

感度を上げるには新しく導入された--very-sensitive--ultra-sensitiveを使う(引用2参照)。max target seqはデフォルト25しかないので増やす。

diamond blastx --query input.fa \
--db uniprot_ref_proteomes.diamond.dmnd \
--evalue 1e-1 --very-sensitive \
--max-target-seqs 100000 \
--outfmt 6 \
> blast.out

all versus all

proteins.faaの総当たり比較。部分ヒットを避けるパラメータ設定を使用する(引用)。

diamond makedb --in proteins.faa --db protein_db
diamond blastp --query proteins.faa --db protein_db --out blastp.tsv \
--outfmt 6 --evalue 1e-5 --max-target-seqs 10000 --query-cover 50 \
--subject-cover 50

出力例

f:id:kazumaxneo:20220407231300p:plain

 

引用

Fast and sensitive protein alignment using DIAMONDFast and sensitive protein alignment using DIAMOND
Benjamin Buchfink, Chao Xie & Daniel H

Nature Methods 12, 59–60 (2015) doi:10.1038/nmeth.3176

PDF

https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf

 

Sensitive protein alignments at tree-of-life scale using DIAMOND

Benjamin Buchfink, Klaus Reuter & Hajk-Georg Drost 
Nature Methods volume 18, pages 366–368 (2021)

 

 

 

追記

Question about max-target-seqs option #29

Question about max-target-seqs option · Issue #29 · bbuchfink/diamond · GitHub

 

KMCのホームページで、diamondとKMCの連携について提案があります。詳細はKMCのHPからスクリプトをダウンロードして確認してください。

http://sun.aei.polsl.pl/REFRESH/index.php?page=projects&project=kmc&subpage=download

 

追記

 論文図1で色々なシーケンサー由来のリードを使ってタンパク質と相同性検索した時の処理時間と感度が比較されています。データによっては2万倍以上高速化しています。

https://lemosbioinfo.files.wordpress.com/2016/11/nmeth-3176.pdf

 

追記

ローカルデータベースのダウンロードについては、こちらが参考になります。

GitHub - josuebarrera/GenEra: genEra is an easy-to-use, low-dependency command-line tool that estimates the age of the last common ancestor of protein-coding gene families.

 

追記

AC-DIAMOND

 

追記

ID convert


大雑把に調べる。



NCBIのblast DBを使う。


 

2021 4/8