リファレンスゲノム情報を半教師あり深層学習で取り込むことで、より優れたビニングを実現する SemiBin

2022/02/09 ツイート追記

2022/02/27 追記

2022/03/01 help更新

　メタゲノムビニングとは、メタゲノムアセンブルゲノム（MAG）を構築する際に、同一ゲノムに由来すると予測される配列を自動的にグループ化するステップである。最も広く利用されているビニングの手法は、リファレンスに依存せず、de novoで動作し、これまでサンプルが得られなかったクレードからゲノムを復元することができる。しかし、これらの手法は、既存のデータベースの知識を活用していない。SemiBinは、ニューラルネットワークを用いて半教師付きのアプローチを実現するオープンソースのツールである。すなわち、SemiBinはリファレンスゲノムの情報を利用する一方で、リファレンスデータセットの外にあるゲノムをビニングする能力を保持している。SemiBinは、3つの異なる環境（ヒトの腸、犬の腸、海洋マイクロバイオーム）のシミュレーションおよび実際のマイクロバイオームのデータセットにおいて、既存の最先端のビニング手法よりも優れた結果を示している。SemiBinは、より多くの異なる属や種を含む、より大きな分類学的多様性を持つ高品質なビンを返す。SemiBinは、オープンソースソフトウェアとして、https://github.com/BigDataBiology/SemiBin/から利用できる。

https://semibin.readthedocs.io/en/latest/

Usage

https://semibin.readthedocs.io/en/latest/usage/

2022/02/09

New release of SemiBin: v0.6

Big changes:

- Prebuilt models for 10 different habitats (based on GMGCv1 data), up from 3 in last release

- A 'global' model which is a decent solution if no habitat-specific model is available
— Big Data Biology Lab - hiring postdocs! (@BigDataBiology) February 8, 2022

インストール

mambaでPython 3.7の環境を作ってテストした。

依存

SemiBin runs on Python 3.7-3.9.

Github

#conda (link)
mamba create -n SemiBin python==3.7 -y
conda activate SemiBin
mamba install -c bioconda semibin -y
mamba install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts -y

> SemiBin -h

Semi-supervised siamese neural network for metagenomic binning

optional arguments:

-h, --help show this help message and exit

-v, --version Print the version number

SemiBin subcommands:

single_easy_bin Bin contigs (single or co-assembly) using one command.

multi_easy_bin Bin contigs (multi-sample mode) using one command.

generate_cannot_links (predict_taxonomy)

Run the contig annotation using mmseqs with GTDB

reference genome and generate cannot-link file used in

the semi-supervised deep learning model training. This

will download the GTDB database if not downloaded

before.

generate_sequence_features_single (generate_sequence_features_single)

Generate sequence features (kmer and abundance) as

training data for semi-supervised deep learning model

training (single or co-assembly mode). This will

produce the data.csv and data_split.csv files.

generate_sequence_features_multi (generate_sequence_features_multi)

Generate sequence features (kmer and abundance) as

training data for semi-supervised deep learning model

training (multi-sample mode). This will produce the

data.csv and data_split.csv files.

download_GTDB Download GTDB reference genomes.

check_install Check required dependencies.

train Train the model.

bin Group the contigs into bins.

> SemiBin single_easy_bin -h

usage: SemiBin single_easy_bin [-h] [-p] [--environment ENVIRONMENT] -i

CONTIG_FASTA -o OUTPUT [-m MIN_LEN]

[--ratio RATIO] -b [BAMS [BAMS ...]] [-r]

[--cannot-name]

[--taxonomy-annotation-table TAXONOMY_TSV]

[--minfasta-kbs] [--no-recluster] [--recluster]

[--epoches EPOCHES] [--batch-size BATCHSIZE]

[--max-edges MAX_EDGES] [--max-node MAX_NODE]

[--random-seed RANDOM_SEED]

[--ml-threshold ML_THRESHOLD]

optional arguments:

-h, --help show this help message and exit

-p , --processes , -t , --threads

Number of CPUs used (pass the value 0 to use all CPUs,

default: 0)

--environment ENVIRONMENT

Environment for the built-in model (available choices:

human_gut/dog_gut/ocean/soil/cat_gut/human_oral/mouse_

gut/pig_gut/built_environment/wastewater/global).

-i CONTIG_FASTA, --input-fasta CONTIG_FASTA

Path to the input fasta file.

-o OUTPUT, --output OUTPUT

Output directory (will be created if non-existent)

-m MIN_LEN, --min-len MIN_LEN

Minimal length for contigs in binning. If you use

SemiBin with multi steps and you use this parameter,

please use this parameter consistently with all

subcommands. (Default: SemiBin chooses 1000bp or

2500bp according the ratio of the number of base pairs

of contigs between 1000-2500bp).

--ratio RATIO If the ratio of the number of base pairs of contigs

between 1000-2500 bp smaller than this value, the

minimal length will be set as 1000bp, otherwise

2500bp. Note that setting `--min-length/-m` overrules

this parameter. If you use SemiBin with multi steps

and you use this parameter, please use this parameter

consistently with all subcommands. (Default: 0.05)

-b [BAMS [BAMS ...]], --input-bam [BAMS [BAMS ...]]

Path to the input BAM file. If using multiple samples,

you can input multiple files.

-r , --reference-db-data-dir , --reference-db

GTDB reference storage path. (Default:

$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB).If not set

--reference-db and SemiBin cannot find GTDB in

$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB, SemiBin will

download GTDB (Note that >100GB of disk space are

required).

--cannot-name Name for the cannot-link file(default: cannot).

--taxonomy-annotation-table TAXONOMY_TSV

Pre-computed mmseqs2 format taxonomy TSV file to

bypass mmseqs2 GTDB annotation [advanced]

--minfasta-kbs minimum bin size in Kbps (Default: 200).

--no-recluster Do not recluster bins.

--recluster [Deprecated] Does nothing (current default is to

perform clustering)

--epoches EPOCHES Number of epoches used in the training process

(Default: 20).

--batch-size BATCHSIZE

Batch size used in the training process (Default:

2048).

--max-edges MAX_EDGES

The maximum number of edges that can be connected to

one contig (Default: 200).

--max-node MAX_NODE Fraction of contigs that considered to be binned

(should be between 0 and 1; default: 1).

--random-seed RANDOM_SEED

Random seed. Set it to a fixed value to reproduce

results across runs. The default is that the seed is

set by the system and .

--ml-threshold ML_THRESHOLD

Length threshold for generating must-link constraints.

(By default, the threshold is calculated from the

contig, and the default minimum value is 4,000 bp)

> SemiBin multi_easy_bin -h

usage: SemiBin multi_easy_bin [-h] [-p] -i CONTIG_FASTA -o OUTPUT [-m MIN_LEN]

[--ratio RATIO] -b [BAMS [BAMS ...]] [-r]

[--minfasta-kbs] [--no-recluster] [--recluster]

[--epoches EPOCHES] [--batch-size BATCHSIZE]

[--max-edges MAX_EDGES] [--max-node MAX_NODE]

[-s] [--random-seed RANDOM_SEED]

[--ml-threshold ML_THRESHOLD]

optional arguments:

-h, --help show this help message and exit

-p , --processes , -t , --threads

Number of CPUs used (pass the value 0 to use all CPUs,

default: 0)

-i CONTIG_FASTA, --input-fasta CONTIG_FASTA

Path to the input fasta file.

-o OUTPUT, --output OUTPUT

Output directory (will be created if non-existent)

-m MIN_LEN, --min-len MIN_LEN

Minimal length for contigs in binning. If you use

SemiBin with multi steps and you use this parameter,

please use this parameter consistently with all

subcommands. (Default: SemiBin chooses 1000bp or

2500bp according the ratio of the number of base pairs

of contigs between 1000-2500bp).

--ratio RATIO If the ratio of the number of base pairs of contigs

between 1000-2500 bp smaller than this value, the

minimal length will be set as 1000bp, otherwise

2500bp. Note that setting `--min-length/-m` overrules

this parameter. If you use SemiBin with multi steps

and you use this parameter, please use this parameter

consistently with all subcommands. (Default: 0.05)

-b [BAMS [BAMS ...]], --input-bam [BAMS [BAMS ...]]

Path to the input BAM file. If using multiple samples,

you can input multiple files.

-r , --reference-db-data-dir , --reference-db

GTDB reference storage path. (Default:

$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB).If not set

--reference-db and SemiBin cannot find GTDB in

$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB, SemiBin will

download GTDB (Note that >100GB of disk space are

required).

--minfasta-kbs minimum bin size in Kbps (Default: 200).

--no-recluster Do not recluster bins.

--recluster [Deprecated] Does nothing (current default is to

perform clustering)

--epoches EPOCHES Number of epoches used in the training process

(Default: 20).

--batch-size BATCHSIZE

Batch size used in the training process (Default:

2048).

--max-edges MAX_EDGES

The maximum number of edges that can be connected to

one contig (Default: 200).

--max-node MAX_NODE Fraction of contigs that considered to be binned

(should be between 0 and 1; default: 1).

-s , --separator Used when multiple samples binning to separate sample

name and contig name (Default is :).

--random-seed RANDOM_SEED

Random seed. Set it to a fixed value to reproduce

results across runs. The default is that the seed is

set by the system and .

--ml-threshold ML_THRESHOLD

Length threshold for generating must-link constraints.

(By default, the threshold is calculated from the

contig, and the default minimum value is 4,000 bp)

実行方法

SemiBinは、シングルサンプルアセンブリ、コ・アセンブリ、マルチサンプルアセンブリのビニングで動作する。シングルサンプルアセンブリとコ・アセンブリで使う場合、１）すべてのサンプルについてdata.csvとdata_split.csv（トレーニングで使用）を生成し、２）各サンプルごとにモデルを学習、３）同一サンプルから学習したモデルでコンティグごとにビニングを行う、という流れで実行する。各処理を1度に実行できるコマンドが用意されているので、ここではそれを使う。multi-sample binningを使う場合、複数サンプルを結合したコンティグと、複数のサンプルからのbamファイルが入力となる。

Easy single/co-assembly binningモード

簡単なsingle/co-assemblyビニングモードでランする。アセンブルしたFASTAファイルとそれにマッピングして得たbamファイルを指定する。ここではテストデータをランする。--environmentをつけないと、トレーニングに非常に長い時間がかかる（このテストデータでは6時間くらいかかった）。--environmentについては下を参照。

git clone https://github.com/BigDataBiology/SemiBin.git
cd SemiBin/test/bin_data/
SemiBin single_easy_bin -i input.fasta -b input.sorted.bam -o output -r <PATH>/<to>/GTDB --environment global

初回ランではGTDBのデータベースが$HOME/.cache/SemiBin/mmseqs2-GTDB/にダウンロードされる。ダウンロードされるパスを変更するには-rオプションを使う。GTDBのデータベースのダウンロードは手動でおそらくここからダウンロードされているので、この.tar.gzファイルを手動でダウンロードして解凍し、-rで指定した。

出力

output/

f:id:kazumaxneo:20211004033810p:plain

output/output_bin/

f:id:kazumaxneo:20211004033914p:plain

--environmentについて

--environment に (human_gut, dog_gut, or ocean) を指定すると、組み込みモデルの1つを使用できる。レポジトリには、コンティグのアノテーションとモデルのトレーニングにかかる時間を大幅に短縮し、かつ非常に良い結果を得ることができ、使用が推奨されている。

以下の環境がサポートされている（マニュアルより）

human_gut
dog_gut
ocean
soil
cat_gut
human_oral
mouse_gut
pig_gut
built_environment
wastewater
global（上記以外）

Easy multi sample binningモード

マルチサンプルビニングモード。複数のbamファイルを指定する。環境のオプションはない。

SemiBin multi_easy_bin -i contig_whole.fna -b *.bam -o output

引用

SemiBin: Incorporating information from reference genomes with semi-supervised deep learning leads to better metagenomic assembled genomes (MAGs)
Shaojun Pan, Chengkai Zhu, Xing-Ming Zhao, Luis Pedro Coelho

bioRxiv, Posted August 16, 2021