MinHashスケッチで数百万個のバクテリアゲノムの高速クラスタリング解析を可能にする RabbitTClust

　スケッチベースの距離推定に基づく、高速でメモリ効率の良いゲノムクラスターツールRabbitTClustを紹介する。本手法は、次元削減技術とストリーミング、最新のマルチコアプラットフォーム上での並列化を組み合わせることで、大規模データセットの効率的な処理を可能にする。113,674の完全長細菌ゲノム配列（RefSeq: 455 GB in FASTA format）を6分以内に、1,009,738のGenBank アセンブル細菌ゲノム（FASTA format 4.0 TB）を128コアのワークステーションでわずか34分以内にクラスタリングすることができる。さらに、RefSeq細菌ゲノムに含まれる1,269個の冗長なゲノム（ヌクレオチド内容が同一）を同定することに成功した。

インストール

ビルド依存

cmake v.3.0 or later
c++14
zlib

GIthub

git clone --recursive https://github.com/RabbitBio/RabbitTClust.git
cd RabbitTClust
./install.sh

> ./clust-mst

usage: clust-mst [-h] [-l] [-t] <int> [-d] <double> -F <string> [-i] <string> [-o] <string>

usage: clust-mst [-h] [-f] [-E] [-d] <double> [-i] <string> <string> [-o] <string>

usage: clust-greedy [-h] [-l] [-t] <int> [-d] <double> [-F] <string> [-i] <string> [-o] <string>

usage: clust-greedy [-h] [-f] [-d] <double> [-i] <string> <string> [-o] <string>

-h : this help message

-m <int> : set the filter minimum genome length (minLen), genome with total length less the minLen will be ignore, for both clust-mst and clust-greedy

-k <int> : set kmer size, automatically calculate the kmer size without -k option, for both clust-mst and clust-greedy

-s <int> : set sketch size, default 1000, for both clust-mst and clust-greedy

-c <int> : set sampling ratio to compute viriable sketchSize, sketchSize = genomeSize/samplingRatio, only support with MinHash sketch function of clust-greedy

-d <double> : set the distance threshold, default 0.05 for both clust-mst and clust-greedy

-t <int> : set the thread number, default take full usage of platform cores number, for both clust-mst and clust-greedy

-l : input is a file list, not a single gneome file. Lines in the input file list specify paths to genome files, one per line, for both clust-mst and clust greedy

-i <string> : path of original input genome file or file list, input as the intermediate files should be used with option -f or -E

-f : two input files, genomeInfo and MSTInfo files for clust-mst; genomeInfo and sketchInfo files for clust-greedy

-E : two input files, genomeInfo and sketchInfo for clust-mst

-o <string> : path of output file, for both clust-mst and clust-greedy

-F <string> : set the sketch function, including MinHash and KSSD, default MinHash, for both clust-mst and clust-greedy

-e : not save the intermediate files generated from the origin genome file, such as the GenomeInfo, MSTInfo, and SketchInfo files, for both clust-mst and clust-greedy

> ./clust-greedy

usage: clust-mst [-h] [-l] [-t] <int> [-d] <double> -F <string> [-i]

usage: clust-mst [-h] [-f] [-E] [-d] <double> [-i] <string> <string>

[-o] <string>

usage: clust-greedy [-h] [-l] [-t] <int> [-d] <double> [-F] <string>

[-i] <string> [-o] <string>

usage: clust-greedy [-h] [-f] [-d] <double> [-i] <string> <string> [-o] <string>

-h : this help message

-m <int> : set the filter minimum genome length (minLen), genome with

total length less the minLen will be ignore, for both clust-mst and

clust-greedy

-k <int> : set kmer size, automatically calculate the kmer size

without -k option, for both clust-mst and clust-greedy

-s <int> : set sketch size, default 1000, for both clust-mst and clust-greedy

-c <int> : set sampling ratio to compute viriable sketchSize,

sketchSize = genomeSize/samplingRatio, only support with MinHash

sketch function of clust-greedy

-d <double> : set the distance threshold, default 0.05 for both

clust-mst and clust-greedy

-t <int> : set the thread number, default take full usage of platform

cores number, for both clust-mst and clust-greedy

-l : input is a file list, not a single gneome file. Lines in the

input file list specify paths to genome files, one per line, for both

clust-mst and clust greedy

-i <string> : path of original input genome file or file list, input

as the intermediate files should be used with option -f or -E

-f : two input files, genomeInfo and MSTInfo files for clust-mst;

genomeInfo and sketchInfo files for clust-greedy

-E : two input files, genomeInfo and sketchInfo for clust-mst

-o <string> : path of output file, for both clust-mst and clust-greedy

-F <string> : set the sketch function, including MinHash and KSSD,

default MinHash, for both clust-mst and clust-greedy

-e : not save the intermediate files generated from the origin genome

file, such as the GenomeInfo, MSTInfo, and SketchInfo files, for both

clust-mst and clust-greedy

実行方法

RabbitTClustは、古典的なシングルリンク階層型（clust-mst）と貪欲なインクリメンタルクラスタリング（clust-greedy）のアルゴリズムをサポートし、様々なシナリオに対応する。

単一のゲノム配列を指定(１つのファイル内に含まれる複数の配列間の比較)。

clust-mst -i bacteria.fna -o bacteria.mst.clust

clust-greedy -l -i bact_genbank.list -o bact_genbank.greedy.clust

bacteria.mst.clustが出力される。

複数のゲノム配列を指定する場合はゲノムファイルのパスを記載したリストを提供する。"-l"を指定する。

ls <path>/<to>/genome*.fna
clust-mst -l -i bact_refseq.list -o bact_refseq.mst.clust

clust-greedy -l -i bact_genbank.list -o bact_genbank.greedy.clust

-l nput is a file list, not a single gneome file. Lines in the input file list specify paths to genome files, one per line, for both clust-mst and clust greedy

複数ゲノムを指定した場合、出力はCD-HIT ライクなタブ区切り形式となっている。

各クラスタに含まれるゲノムが報告される。cluster0には６つのゲノムが含まれる。左端の列から
1、クラスタ内のローカルインデックス
２，ゲノムのグローバルインデックス
３，ゲノムサイズ
４，ゲノムファイル名 (ゲノムアセンブリのアクセッション番号を含む)
５，配列名 (ゲノムファイル中の最初の配列)
６，配列コメント（行の残り部分）

およそ4000個の細菌ゲノムをclust-mstで分析したところ、ランタイムは数秒だった。

研究とは関係ありませんが、レポジトリのウサギのイラスト可愛いですね。

引用

RabbitTClust: enabling fast clustering analysis of millions bacteria genomes with MinHash sketches

Xiaoming Xu, Zekun Yin, Lifeng Yan, Hao Zhang, Borui Xu, Yanjie Wei, Beifang Niu, Bertil Schmidt, Weiguo Liu

bioRxiv, Posted November 10, 2022.

引用

RabbitTClust: enabling fast clustering analysis of millions of bacteria genomes with MinHash sketches
Xiaoming Xu, Zekun Yin, Lifeng Yan, Hao Zhang, Borui Xu, Yanjie Wei, Beifang Niu, Bertil Schmidt & Weiguo Liu
Genome Biology volume 24, Article number: 121 (2023)