COBS index - macでインフォマティクス

Githubより

　COBS（COmpact Bit-sliced Signature index）は、invertedインデックスとブルームフィルタを掛け合わせたものである。DNAサンプルのk-merやテキスト文書のq-gramsをインデックス化し、ユーザが選択したカバレッジ閾値を持つコーパスに対して近似的なパターンマッチングのクエリを処理することが目標である。クエリの結果には多数の偽陽性が含まれる可能性があるが、これはクエリの長さとインデックスの構築時に決定される偽陽性率に伴って指数関数的に減少する。COBSのコンパクトかつシンプルなデータ構造は、構築時間とクエリパフォーマンスにおいて他のインデックスを凌駕し、PandeyらによるMantisは2位につけている。しかし、Mantisや他の先行研究とは異なり、COBSはRAMに完全なインデックスを必要としないため、より大きな文書集合に拡張できるように設計されている。

Documentation

https://cobs.readthedocs.io/en/latest/#

COBSは、FASTAファイル（*.fa, *.fasta, *.fa.gz, *.fasta.gz）、FASTQファイル（*.fq, *.fastq, *.fq.gz., *.fastq.gz）、「マルチFASTA」および「マルチFASTQ」ファイル（*.mfasta、 *.mfastq）, McCortex ファイル (*.ctx) またはテキストファイル (*.txt) を読み込むことができる。各ファイルタイプは、q-gramまたはk-mersに若干異なる方法で解析される。

インストール

Github

git clone --recursive https://github.com/bingmann/cobs.git
mkdir cobs/build
cd cobs/build
cmake ..
make -j4

> src/cobs
(Co)mpact (B)it-Sliced (S)ignature Index for Genome Search

Usage: src/cobs <subtool> ...

Available subtools:
doc-list read a list of documents and print the list
doc-dump read a list of documents and dump their contents
classic-construct constructs a classic index from the documents in <in_dir>
classic-construct-random constructs a classic index with random content
compact-construct creates the folders used for further construction
compact-construct-combine combines the classic indices in <in_dir> to form a compact index
query query an index
print-parameters calculates index parameters
print-kmers print all canonical kmers from <query>
benchmark-fpr run benchmark and false positive measurement
generate-queries select queries randomly from documents

See https://panthema.net/cobs for more information on COBS.

> cobs compact-construct -h
Usage: cobs compact-construct [options] <input> <out_file>
Parameters:
input path to the input directory or file
out_file path to the output .cobs_compact index file
Options:
-C, --clobber erase output directory if it exists
--continue continue in existing output directory
-f, --false-positive-rate false positive rate, default: 0.300000
--file-type "list" to read a file list, or filter documents by
file type (any, text, cortex, fasta, fastq, etc)
--keep-temporary keep temporary files during construction
-m, --memory memory in bytes to use, default: 201.307 Gi
--no-canonicalize don't canonicalize DNA k-mers, default: false
-h, --num-hashes number of hash functions, default: 1
-p, --page-size the page size of the compact the index, default:
sqrt(#documents)
-k, --term-size term size (k-mer size), default: 31
-T, --threads number of threads to use, default: max cores
--tmp-path directory for intermediate index files, default:
out_file + ".tmp")

> cobs query -h
Usage: cobs query [options] [query]
Parameters:
query the text sequence to search for
Options:
-f, --file query (fasta) file to process
-i, --index path to index file(s)
-l, --limit number of results to return, default: all
--load-complete load complete index into RAM for batch queries
-T, --threads number of threads to use, default: max cores
-t, --threshold threshold in percentage of terms in query matching,
default: 0.8

テストラン

１、Indexing

COBS indexを作成（fasta/に置かれている７つのfastaファイルに対して）

src/cobs compact-construct tests/data/fasta/ example.cobs_compact

example.cobs_compactが出力される。

２、Query an index

問い合わせる。

src/cobs query -i example.cobs_compact AGTCAACGCTAAGGCATTTCCCCCCTGCCTCCTGCCTGCTGCCAAGCCCT

#fasta
src/cobs query -i example.cobs_compact -f query.fa

-f query (fasta) file to process
-i path to index file(s)
-t threshold in percentage of terms in query matching, default: 0.8

ヒットした配列の情報が返される。

Multi-FASTA または Multi-FASTQ ファイル内の各配列は、多数のドキュメントとして解析される。COBSインデックスにおいても、各配列は個別のドキュメントとみなされる。

ENAにサブミットされた細菌ゲノムのペアエンドシークエンシングデータ全てを使って一貫した品質のゲノムアセンブリ（高品質アセンブリ639,981個）を行ったという論文が最近出ましたが（リンク）、その中でCOBS indexが配列サーチに利用されていて、この実装に興味を持ちました。その論文で公開されているCOBS index（リンク）のサイズは900GB近くあったのでダウンロードはしませんでしたが。

引用

COBS: a Compact Bit-Sliced Signature Index
Timo Bingmann, Phelim Bradley, Florian Gauger, Zamin Iqbal

aRxiv, [Submitted on 23 May 2019 (v1), last revised 26 Jul 2019 (this version, v2)]