Minhashをメタゲノム解析へ応用する CMash - macでインフォマティクス

　Minhashは、2つの集合の類似性をJaccard指数（集合の和に対する交点の大きさの比として定義される）の観点から推定する確率的な手法である。この手法は、対象となる集合の大きさが似ている場合に最も優れた性能を発揮し、集合の大きさが大きく異なる場合には性能が大幅に低下することを実証した。本論文では、大きさの異なる集合のJaccard indexを推定するのに適した、「containment minhash approach」と呼ばれる新しい効率的な手法を紹介する。これは、高速なメンバーシップクエリのための別の確率的手法（特にBloomフィルタ）を利用することで実現している。containment Minhash法の推定誤差の確率を導出し、古典的なMinhash法を大幅に改善することを示す。また、時間的・空間的な複雑さの点でも大きな改善が見られる。応用例として、この手法を用いてメタゲノムデータセット中の生物の有無を検出し、非常に小さく、存在量の少ない微生物の存在を検出できることを示す。

インストール

Github

mamba install -c bioconda cmash -y

> MakeDNADatabase.py -h

usage: MakeDNADatabase.py [-h] [-p PRIME] [-t THREADS] [-n NUM_HASHES]

[-k K_SIZE] [-i]

in_file out_file

This script creates training/reference sketches for each FASTA/Q file listed

in the input file.

positional arguments:

in_file Input file: file containing (absolute) file names of

training genomes.

out_file Output training database/reference file (in HDF5

format)

optional arguments:

-h, --help show this help message and exit

-p PRIME, --prime PRIME

Prime (for modding hashes) (default: 9999999999971)

-t THREADS, --threads THREADS

Number of threads to use (default: 128)

-n NUM_HASHES, --num_hashes NUM_HASHES

Number of hashes to use. (default: 500)

-k K_SIZE, --k_size K_SIZE

K-mer size (default: 21)

-i, --intersect_nodegraph

Optional flag to export Nodegraph file (bloom filter)

containing all k-mers in the training database. Saved

in same location as out_file. This is to be used with

QueryDNADatabase.py (default: False)

> MakeNodeGraph.py -h

usage: MakeNodeGraph.py [-h] [-fp FP_RATE] [-i INTERSECT_NODEGRAPH]

[-k K_SIZE] [-t THREADS]

in_file out_dir

This script will create node graph for a given k-mer size and query file (can

be used as input to QueryDNADatabase.py)

positional arguments:

in_file Input file: FASTQ/A file (can be gzipped).

out_dir Output directory

optional arguments:

-h, --help show this help message and exit

-fp FP_RATE, --fp_rate FP_RATE

False positive rate. (default: 0.0001)

-i INTERSECT_NODEGRAPH, --intersect_nodegraph INTERSECT_NODEGRAPH

Location of Node Graph. Will only insert query k-mers

in bloom filter if they appear anywhere in the

training database. Note that the Jaccard estimates

will now be J(query intersect union_i training_i,

training_i) instead of J(query, training_i), but will

use significantly less space (unfortunately will also

disable threading). (default: None)

-k K_SIZE, --k_size K_SIZE

K-mer size (default: 21)

-t THREADS, --threads THREADS

Number of threads to use (default: 128)

> QueryDNADatabase.py -h

usage: QueryDNADatabase.py [-h] [-t THREADS] [-f] [-fp FP_RATE]

[-ct CONTAINMENT_THRESHOLD] [-c CONFIDENCE]

[-ng NODE_GRAPH] [-b] [-i]

in_file training_data out_csv

This script creates a CSV file of similarity indicies between the input file

and each of the sketches in the training/reference file.

positional arguments:

in_file Input file: FASTQ/A file (can be gzipped).

training_data Training/reference data (HDF5 file created by

MakeTrainingDatabase.py)

out_csv Output CSV file

optional arguments:

-h, --help show this help message and exit

-t THREADS, --threads THREADS

Number of threads to use (default: 128)

-f, --force Force creation of new NodeGraph. (default: False)

-fp FP_RATE, --fp_rate FP_RATE

False positive rate. (default: 0.0001)

-ct CONTAINMENT_THRESHOLD, --containment_threshold CONTAINMENT_THRESHOLD

Only return results with containment index above this

value (default: 0.02)

-c CONFIDENCE, --confidence CONFIDENCE

Desired probability that all results were returned

with containment index above threshold [-ct] (default:

0.95)

-ng NODE_GRAPH, --node_graph NODE_GRAPH

NodeGraph/bloom filter location. Used if it exists; if

not, one will be created and put in the same directory

as the specified output CSV file. (default: None)

-b, --base_name Flag to indicate that only the base names (not the

full path) should be saved in the output CSV file

(default: False)

-i, --intersect_nodegraph

Option to only insert query k-mers in bloom filter if

they appear anywhere in the training database. Note

that the Jaccard estimates will now be J(query

intersect union_i training_i, training_i) instead of

J(query, training_i), but will use significantly less

space. (default: False)

実行方法

リファレンス／トレーニングデータベースを作成し、ブルームフィルターを形成し、データベースに問い合わせるという流れで使う。

１、fasta/fastqのフルパスを指定した1行ずつ指定したファイルを指定してデータベースを作成。

MakeDNADatabase.py list.txt TrainingDatabase.h5

２、（任意）ブルームフィルタを作成

MakeNodeGraph.py TrainingDatabase.h5 outdir

３、クエリファイルMetagenome.faとデータベースとのJaccard indexの推定値を得る

QueryDNADatabase.py Metagenome.fa TrainingDatabase.h5 output.csv

引用

Improving MinHash via the containment index with applications to metagenomic analysis
David Koslickia, Hooman Zabetib

https://doi.org/10.1016/j.amc.2019.02.018