超高速でスケーラブルなゲノム検索を行う gsearch

　ゲノム検索や分類は、データベース（参照ゲノム）に最もマッチするゲノムを見つけることが一般的であるが、利用可能なデータベースゲノムの数が増加していることや、従来の手法が大規模なデータベースに対してうまくスケールしないという事実により、ますます困難になってきている。ゲノム距離を推定するためのk-merハッシュベースの確率的データ構造（ProbMinHash、SuperMinHash、Densified MinHash、SetSketch）と、グラフベースの最近傍探索アルゴリズム（Hierarchical Navigable Small World Graphs、HNSW）を組み合わせることにより、本著者らは新しいデータ構造を作成し、関連するコンピュータプログラムGSearchを開発した。例えば、GSearchは、8,000のクエリゲノムを、利用可能なすべての微生物ゲノムまたはウイルスゲノム（それぞれn = 約318 000または約3 000 000）と照合し、そのベストマッチを数分以内に検索することができる。注目すべきは、GSearchはO(log(N))の時間複雑度を持ち、データベース分割戦略に基づいて数十億のゲノムでも十分にスケールすることである。さらに、GSearchはクエリーゲノムの新規性の程度に応じて3段階の検索戦略を実装し、特異性と感度を最大化する。したがって、GSearchはゲノム検索や分類を必要とするマイクロバイオーム研究の主要なボトルネックを解決する。

インストール

ubuntu22.04LTSで環境を作って導入した。レポジトリではmacos向けにhomebrewによるインストール方法も説明されている。

Github

#conda(link)
mamba create -n gsearch -y
conda  activate gsearch
mamba install -c bioconda gsearch

> gsearch -h

$ gsearch -h

************** initializing logger *****************

Approximate nearest neighbour search for genomes based on MinHash-like metrics

Usage: gsearch [OPTIONS] [COMMAND]

Commands:

tohnsw Build HNSW graph database from a collection of database genomes based on MinHash-like metric

add Add new genome files to a pre-built HNSW graph database

request Request nearest neighbors of query genomes against a pre-built HNSW graph database/index

ann Approximate Nearest Neighbor Embedding using UMAP-like algorithm

help Print this message or the help of the given subcommand(s)

Options:

--pio <pio> Parallel IO processing

--nbthreads <nbthreads> nb thread for sketching

-h, --help Print help

-V, --version Print version

> gsearch tohnsw -h

$ gsearch tohnsw -h

************** initializing logger *****************

Build HNSW graph database from a collection of database genomes based on MinHash-like metric

Usage: gsearch tohnsw [OPTIONS] --dir <hnsw_dir> --kmer <kmer_size> --sketch <sketch_size> --nbng <neighbours> --algo <sketch_algo>

Options:

-d, --dir <hnsw_dir> directory for storing database genomes

-k, --kmer <kmer_size> k-mer size to use

-s, --sketch <sketch_size> sketch size of minhash to use

-n, --nbng <neighbours> Maximum allowed number of neighbors (M) in HNSW

--ef <ef> ef_construct in HNSW

--algo <sketch_algo> specifiy the algorithm to use for sketching: prob, super/super2, hll or optdens/revoptdens

--aa --aa Specificy amino acid processing, require no value

--block --block : sketching is done concatenating sequences

-h, --help Print help

> gsearch add -h

$ gsearch add -h

************** initializing logger *****************

Add new genome files to a pre-built HNSW graph database

Usage: gsearch add --hnsw <hnsw_dir> --new <newdata_dir>

Options:

-b, --hnsw <hnsw_dir> set the name of directory containing already constructed hnsw data

-n, --new <newdata_dir> set directory containing new data

-h, --help Print help

> gsearch request -h

$ gsearch request -h

************** initializing logger *****************

Request nearest neighbors of query genomes against a pre-built HNSW graph database/index

Usage: gsearch request --hnsw <DATADIR> --nbanswers <nb_answers> --query <request_dir>

Options:

-b, --hnsw <DATADIR> directory contains pre-built database files

-n, --nbanswers <nb_answers> Sets the number of neighbors for the query

-r, --query <request_dir> Sets the directory of request genomes

-h, --help Print help

> gsearch ann -h

$ gsearch ann -h

************** initializing logger *****************

Approximate Nearest Neighbor Embedding using UMAP-like algorithm

Usage: gsearch ann [OPTIONS] --hnsw <hnsw_dir>

Options:

-b, --hnsw <hnsw_dir> directory containing hnsw

-s, --stats to get stats on nb neighbours

-e, --embed --embed to do an embedding

-h, --help Print help

> superani -h

$ superani -h

************** initializing logger *****************

Computing average nucleotide identity between reference and query genomes via sparse kmer chaining or Open Syncmer with Densified MinHash

Usage: superani --ql <FILE> --rl <FILE> --output <FILE>

Options:

-q, --ql <FILE> A file containing a list of query genome paths (.gz supported)

-r, --rl <FILE> A file containing a list of reference genome paths (.gz supported)

-o, --output <FILE> Output file to write results

-h, --help Print help

-V, --version Print version

> superaai -h

$ superaai -h

************** initializing logger *****************

Compute Average Amino Acid Identity (AAI) via FracMinHash/Sourmash for genomes

Usage: superaai [OPTIONS] --ql <FILE> --rl <FILE> --output <FILE>

Options:

-q, --ql <FILE> File containing list of query protein paths (.faa format, .gz supported)

-r, --rl <FILE> File containing list of reference protein paths (.faa format, .gz supported)

-o, --output <FILE> Output file to write results

-k, --kmer <INT> K-mer size for MinHash calculation [default: 7]

-l, --scaled <INT> Scaled factor for MinHash calculation [default: 100]

-s, --sketch <INT> Sketch size for MinHash (number of hashes) [default: 5120]

-h, --help Print help

-V, --version Print version

実行方法

１、新規indexの作成

gsearch tohnsw -d fasta_dir/ -k 21 -n 1 -s 10 --algo prob

-d directory for storing database genomes
-k k-mer size to use
-s sketch size of minhash to use
-n Maximum allowed number of neighbors (M) in HNSW
--algo specifiy the algorithm to use for sketching: prob, super/super2, hll or optdens/revoptdens

２、既知indexに追加

gsearch add --hnsw <hnsw_dir> --new <newdata_dir>

３

gsearch request --hnsw <DATADIR> --nbanswers <nb_answers> --query <request_dir>

”クエリーゲノムパス、データベースゲノムパス（距離によるランク付け）、距離の3つのカラムを持つ表形式のファイル（gsearch.neighbers.txt）がカレントディレクトリのディスクに保存される”。

距離はANIまたはAAIに変換できる。

4、superani - ANIの計算（MinHash 推定）

入力はクエリゲノムのパスとリフレッシュゲノムのパス。

superani -q name.txt -r name.txt -o out

-q A file containing a list of query genome paths (.gz supported)
-r A file containing a list of reference genome paths (.gz supported)
-o Output file to write results

出力は各クエリと各参照ゲノム間の ANI となる。

出力例

５，superaai - AAIの計算

superani -q name.txt -r name.txt -o out

-q File containing list of query protein paths (.faa format, .gz supported)
-r File containing list of reference protein paths (.faa format, .gz supported)
-o Output file to write results

引用

GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs

Jianshu Zhao, Jean Pierre Both, Luis M Rodriguez-R, Konstantinos T Konstantinidis

Nucleic Acids Research, Volume 52, Issue 16, 9 September 2024, Page e74,

GSearch: Ultra-Fast and Scalable Microbial Genome Search by Combining K-mer Hashing with Hierarchical Navigable Small World Graphs

Jianshu Zhao, Jean Pierre Both, Luis M. Rodriguez-R, Konstantinos T. Konstantinidis

bioRxiv, Posted November 13, 2023.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

超高速でスケーラブルなゲノム検索を行う gsearch