マルチスケール適応型クラスタリングと評価によるメタゲノムロングリードの効率的なビニングを行う LorBin

　ロングリードシーケンスはメタゲノミクスを変革し、メタゲノムアセンブルゲノム（MAG）の品質を向上させた。しかし、現在のビニング手法では、未知の種の同定や不均衡な種分布の管理が課題となっている。本稿では、天然マイクロバイオーム中のMAGを再構築するために特別に設計された教師なしbinnerであるLorBinを紹介する。LorBinは、シングルコピー遺伝子を用いた評価決定モデルを用いた2段階マルチスケール適応型DBSCANおよびBIRCHクラスタリングを採用し、MAG回収率を最大化する。LorBinは、口腔、腸管、海洋サンプルを含むシミュレートされたマイクロバイオームと実際のマイクロバイオームの両方において、競合する6つのビナーよりも優れた性能を発揮する。LorBinは、最先端のビニング手法と比較して、高いセレンディピティで15～189%多く高品質なMAGを生成し、2.4～17倍多くの新規分類群を同定した。これらを組み合わせることで、LorBin は未知の分類群を含む種が豊富なサンプルにアクセスするための有望なロングリードメタゲノムビナーとなる。LorBinは不均衡な天然の微生物叢からより完全なゲノムを取得するのに効率的である。

インストール

ubuntu21で著者らが提供のYAMLファイルを使ってインストールした。

依存

biopython=1.78
pytorch=1.11.0
setuptools=65.5.0
torchvision=0.12.0
torchaudio=0.11.0
numpy=1.23.3
pip=22.2.2
pandas=2.2.2
scikit-learn=1.1.2
scipy=1.13.1
joblib=1.4.2
Sequence processing tool:
minimap2=2.24-r1122
samtools=1.15.1
hmmer=3.1b2
prodigal=2.6.3
bedtools=2.26.0

Github

git clone https://github.com/LorMeBioAI/LorBin.git
cd LorBin/
mamba env create -f lorbin_env.yaml
conda activate lorbin_env
pip install dist/lorbin-0.1.0.tar.gz

#あるいは個別に導入(CPU only)
mamba create -n lorbin_env python=3.10
conda activate lorbin_env
mamba install -c conda-forge \
 python=3.10 numpy=1.23.5 scipy=1.9.3 scikit-learn=1.1.2 pandas=1.5.3 \
 joblib pytorch biopython
#bedtoolsも必要
mamba install -c bioconda bedtools -y

#本体
git clone https://github.com/LorMeBioAI/LorBin.git
cd LorBin/
pip install dist/lorbin-0.1.0.tar.gz

> LorBin -h

usage: LorBin [-h] ...

options:

-h, --help show this help message and exit

LorBin subcommands:

bin Bin contigs using one command.

generate_data

Generate sequence features(kmer and abundance) as trainning data.

concat Create the input FASTA file for LorBin, input should be at least two FASTA files, each from a sample-specific assembly, resulting FASTA can be binsplit with separator '-'

train Train model and get embedding.

cluster Bin contigs using two-stage clustering algorithm based on embedded features file.

> LorBin bin -h

usage: LorBin bin [-h] -o OUTPUT -fa FASTA [--bin_length BIN_LENGTH] -b BAM [BAM ...] [--num_process NUM_PROCESS] [--evaluation EVALUATION] [-a AKEEP] [--multi] [--epoch EPOCH]

[--batch_size BATCH_SIZE] [--batchsteps BATCHSTEPS [BATCHSTEPS ...]] [--lrate LRATE] [--cuda]

options:

-h, --help show this help message and exit

-o OUTPUT, --output OUTPUT

Output directory (will be created if non-existent)

-fa FASTA, --fasta FASTA

Path to the input fasta file.

--bin_length BIN_LENGTH

Minimum bin size in bps (Default: 80000)

-b BAM [BAM ...], --bam BAM [BAM ...]

Path to the input BAM(.bam) file.

--num_process NUM_PROCESS

Number of threads used (default: 10)

--evaluation EVALUATION

Evaluation model used(no_markers, markers110, markers35, default: nomarkers

-a AKEEP, --akeep AKEEP

The cut-off parameters of re-clustering decision model(0~1, default:0.6)

--multi Cluster uses more samples

--epoch EPOCH, -n EPOCH

training epoch (default: 300)

--batch_size BATCH_SIZE

batch size (default: 128)

--batchsteps BATCHSTEPS [BATCHSTEPS ...]

batchsteps (default: 30 100)

--lrate LRATE, -l LRATE

learning rate (default: 0.001)

--cuda whether use cuda

> LorBin generate_data -h

usage: LorBin generate_data [-h] -o OUTPUT -fa FASTA [--bin_length BIN_LENGTH] -b BAM [BAM ...] [--num_process NUM_PROCESS]

options:

-h, --help show this help message and exit

-o OUTPUT, --output OUTPUT

Output directory (will be created if non-existent)

-fa FASTA, --fasta FASTA

Path to the input fasta file.

--bin_length BIN_LENGTH

Minimum bin size in bps (Default: 80000)

-b BAM [BAM ...], --bam BAM [BAM ...]

Path to the input BAM(.bam) file.

--num_process NUM_PROCESS

Number of threads used (default: 10)

> LorBin concat -h

usage: LorBin concat [-h] -fa FASTA [FASTA ...] -o OUTPUT

options:

-h, --help show this help message and exit

-fa FASTA [FASTA ...], --fasta FASTA [FASTA ...]

The path to input FASTA files

-o OUTPUT, --output OUTPUT

The path to output FASTA file

> LorBin train -h

usage: LorBin train [-h] --data DATA -o OUTPUT [--epoch EPOCH] [--lrate LRATE] [--batch_size BATCH_SIZE] [--batchsteps BATCHSTEPS [BATCHSTEPS ...]] [--cuda]

options:

-h, --help show this help message and exit

--data DATA The path of training data

-o OUTPUT, --output OUTPUT

Output directory (will be created if non-existent)

--epoch EPOCH, -n EPOCH

training epoch (default: 300)

--lrate LRATE, -l LRATE

learning rate (default: 0.001)

--batch_size BATCH_SIZE

batch size (default: 64)

--batchsteps BATCHSTEPS [BATCHSTEPS ...]

batchseteps (default: 30, 60, 120)

--cuda whether use cuda

> LorBin cluster -h

usage: LorBin cluster [-h] -o OUTPUT -fa FASTA [--bin_length BIN_LENGTH] [--evaluation EVALUATION] [-a AKEEP] [--multi] [--embeddingdir EMBEDDINGDIR] [--data DATA]

[--num_process NUM_PROCESS] [--cuda] [--batch_size BATCH_SIZE] [--epoch EPOCH] [--lrate LRATE] [--batchsteps BATCHSTEPS [BATCHSTEPS ...]]

options:

-h, --help show this help message and exit

-o OUTPUT, --output OUTPUT

Output directory (will be created if non-existent)

-fa FASTA, --fasta FASTA

Path to the input fasta file.

--bin_length BIN_LENGTH

Minimum bin size in bps (Default: 80000)

--evaluation EVALUATION

Evaluation model used(no_markers, markers110, markers35, default: nomarkers

-a AKEEP, --akeep AKEEP

The cut-off parameters of re-clustering decision model(0~1, default:0.6)

--multi Cluster uses more samples

--embeddingdir EMBEDDINGDIR, -e EMBEDDINGDIR

The path of embedding csv file used in clustering

--data DATA The path of training data

--num_process NUM_PROCESS

Number of threads used (default: 10)

--cuda whether use cuda

--batch_size BATCH_SIZE

batch size (default: 64)

--epoch EPOCH, -n EPOCH

training epoch (default: 300)

--lrate LRATE, -l LRATE

learning rate (default: 0.001)

--batchsteps BATCHSTEPS [BATCHSTEPS ...]

batchseteps (default: 30, 60, 120)

テストラン

zenodeに公開されているテストデータを使う。hifiasmでアセンブルされたcontigとそのbamファイルとなる。ダウンロードして解凍する。

https://zenodo.org/records/13883404

167Mbの小さめのMAGとなる｡

fastaとそのbamを指定する。

#解凍
tar -zxvf testdata.tar.gz
#LorBin bin実行
LorBin bin --fa CRR451057.hifiasm.fna -b CRR451057.sorted.bam -o test_o

-o Output directory (will be created if non-existent)
-fa Path to the input fasta file.
--bin_length Minimum bin size in bps (Default: 80000)
-b Path to the input BAM(.bam) file.
--num_process Number of threads used (default: 10)

CPUモードでは10分ほどかかった(9900X､64GBメモリ)｡

出力

各コンティグのクラスタ､長さ､特徴量ベクトルなどが保存されている｡

output_bins/

実行方法

1､複数MAGを同時にbinningしたい場合､最初にLorBin concatコマンドで結合する｡

LorBin concat -fa test1.fna test2.fna -o test.fna

2､bamの作成､レポジトリではminimap2を使用し､結合したfastaをリファレンスに指定している｡

#test1サンプル
minimap2 -a test/test.fna test/test1_raw.fq | samtools view -h -b -S | samtools view -b -F 4 | samtools sort -@ 20 > test1.mapped.sorted.bam

#test2サンプル
minimap2 -a test/test.fna test/test2_raw.fq | samtools view -h -b -S | samtools view -b -F 4 | samtools sort -@ 20 > test2.mapped.sorted.bam

3､LorBin binの実行｡複数bamの場合は--multiをつける｡

LorBin bin -o outputdir -fa test.fna -b test1.mapped.sorted.bam test2.mapped.sorted.bam --multi

--multi Cluster uses more samples

引用

LorBin: efficient binning of long-read metagenomes by multiscale adaptive clustering and evaluation

Wei Xue, Zuo Liu, Yaozhong Zhang, Waseem Raza, Yarong Li, Li Jiang, Ye Tao, Jun Qian, Jousset Alexandre, Fang-Jie Zhao, Yangchun Xu, Fritz Sedlazeck, Qirong Shen, Gaofei Jiang & Zhong Wei

Nature Communications volume 16, Article number: 9353 (2025)

論文中にserendipityというビニング評価では一般的でない単語が登場していますが、これは本来の思いがけない幸運な発見という意味から、評価対象のbinnerが比較対象のbinnerでは再構築されなかった binをどれだけ独自に回収したかを示す指標として定義されています（論文のMethods参照､2014年の論文が引用されている）。