2022/02/09 ツイート追記
2022/02/27 追記
2022/03/01 help更新
メタゲノムビニングとは、メタゲノムアセンブルゲノム(MAG)を構築する際に、同一ゲノムに由来すると予測される配列を自動的にグループ化するステップである。最も広く利用されているビニングの手法は、リファレンスに依存せず、de novoで動作し、これまでサンプルが得られなかったクレードからゲノムを復元することができる。しかし、これらの手法は、既存のデータベースの知識を活用していない。SemiBinは、ニューラルネットワークを用いて半教師付きのアプローチを実現するオープンソースのツールである。すなわち、SemiBinはリファレンスゲノムの情報を利用する一方で、リファレンスデータセットの外にあるゲノムをビニングする能力を保持している。SemiBinは、3つの異なる環境(ヒトの腸、犬の腸、海洋マイクロバイオーム)のシミュレーションおよび実際のマイクロバイオームのデータセットにおいて、既存の最先端のビニング手法よりも優れた結果を示している。SemiBinは、より多くの異なる属や種を含む、より大きな分類学的多様性を持つ高品質なビンを返す。SemiBinは、オープンソースソフトウェアとして、https://github.com/BigDataBiology/SemiBin/から利用できる。
New release of SemiBin: v0.6
Big changes:
- Prebuilt models for 10 different habitats (based on GMGCv1 data), up from 3 in last release
- A 'global' model which is a decent solution if no habitat-specific model is available
mambaでPython 3.7の環境を作ってテストした。
- SemiBin runs on Python 3.7-3.9.
#conda (link)
mamba create -n SemiBin python==3.10 -y
conda activate SemiBin
mamba install -c bioconda semibin -y
mamba install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts -y
> SemiBin -h
Semi-supervised siamese neural network for metagenomic binning
optional arguments:
-h, --help show this help message and exit
-v, --version Print the version number
SemiBin subcommands:
single_easy_bin Bin contigs (single or co-assembly) using one command.
multi_easy_bin Bin contigs (multi-sample mode) using one command.
generate_cannot_links (predict_taxonomy)
Run the contig annotation using mmseqs with GTDB
reference genome and generate cannot-link file used in
the semi-supervised deep learning model training. This
will download the GTDB database if not downloaded
generate_sequence_features_single (generate_sequence_features_single)
Generate sequence features (kmer and abundance) as
training data for semi-supervised deep learning model
training (single or co-assembly mode). This will
produce the data.csv and data_split.csv files.
generate_sequence_features_multi (generate_sequence_features_multi)
Generate sequence features (kmer and abundance) as
training data for semi-supervised deep learning model
training (multi-sample mode). This will produce the
data.csv and data_split.csv files.
download_GTDB Download GTDB reference genomes.
check_install Check required dependencies.
train Train the model.
bin Group the contigs into bins.
> SemiBin single_easy_bin -h
usage: SemiBin single_easy_bin [-h] [-p] [--environment ENVIRONMENT] -i
[--ratio RATIO] -b [BAMS [BAMS ...]] [-r]
[--taxonomy-annotation-table TAXONOMY_TSV]
[--minfasta-kbs] [--no-recluster] [--recluster]
[--epoches EPOCHES] [--batch-size BATCHSIZE]
[--max-edges MAX_EDGES] [--max-node MAX_NODE]
[--random-seed RANDOM_SEED]
[--ml-threshold ML_THRESHOLD]
optional arguments:
-h, --help show this help message and exit
-p , --processes , -t , --threads
Number of CPUs used (pass the value 0 to use all CPUs,
default: 0)
--environment ENVIRONMENT
Environment for the built-in model (available choices:
Path to the input fasta file.
-o OUTPUT, --output OUTPUT
Output directory (will be created if non-existent)
-m MIN_LEN, --min-len MIN_LEN
Minimal length for contigs in binning. If you use
SemiBin with multi steps and you use this parameter,
please use this parameter consistently with all
subcommands. (Default: SemiBin chooses 1000bp or
2500bp according the ratio of the number of base pairs
of contigs between 1000-2500bp).
--ratio RATIO If the ratio of the number of base pairs of contigs
between 1000-2500 bp smaller than this value, the
minimal length will be set as 1000bp, otherwise
2500bp. Note that setting `--min-length/-m` overrules
this parameter. If you use SemiBin with multi steps
and you use this parameter, please use this parameter
consistently with all subcommands. (Default: 0.05)
-b [BAMS [BAMS ...]], --input-bam [BAMS [BAMS ...]]
Path to the input BAM file. If using multiple samples,
you can input multiple files.
-r , --reference-db-data-dir , --reference-db
GTDB reference storage path. (Default:
$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB).If not set
--reference-db and SemiBin cannot find GTDB in
$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB, SemiBin will
download GTDB (Note that >100GB of disk space are
--cannot-name Name for the cannot-link file(default: cannot).
--taxonomy-annotation-table TAXONOMY_TSV
Pre-computed mmseqs2 format taxonomy TSV file to
bypass mmseqs2 GTDB annotation [advanced]
--minfasta-kbs minimum bin size in Kbps (Default: 200).
--no-recluster Do not recluster bins.
--recluster [Deprecated] Does nothing (current default is to
perform clustering)
--epoches EPOCHES Number of epoches used in the training process
(Default: 20).
--batch-size BATCHSIZE
Batch size used in the training process (Default:
--max-edges MAX_EDGES
The maximum number of edges that can be connected to
one contig (Default: 200).
--max-node MAX_NODE Fraction of contigs that considered to be binned
(should be between 0 and 1; default: 1).
--random-seed RANDOM_SEED
Random seed. Set it to a fixed value to reproduce
results across runs. The default is that the seed is
set by the system and .
--ml-threshold ML_THRESHOLD
Length threshold for generating must-link constraints.
(By default, the threshold is calculated from the
contig, and the default minimum value is 4,000 bp)
> SemiBin multi_easy_bin -h
usage: SemiBin multi_easy_bin [-h] [-p] -i CONTIG_FASTA -o OUTPUT [-m MIN_LEN]
[--ratio RATIO] -b [BAMS [BAMS ...]] [-r]
[--minfasta-kbs] [--no-recluster] [--recluster]
[--epoches EPOCHES] [--batch-size BATCHSIZE]
[--max-edges MAX_EDGES] [--max-node MAX_NODE]
[-s] [--random-seed RANDOM_SEED]
[--ml-threshold ML_THRESHOLD]
optional arguments:
-h, --help show this help message and exit
-p , --processes , -t , --threads
Number of CPUs used (pass the value 0 to use all CPUs,
default: 0)
Path to the input fasta file.
-o OUTPUT, --output OUTPUT
Output directory (will be created if non-existent)
-m MIN_LEN, --min-len MIN_LEN
Minimal length for contigs in binning. If you use
SemiBin with multi steps and you use this parameter,
please use this parameter consistently with all
subcommands. (Default: SemiBin chooses 1000bp or
2500bp according the ratio of the number of base pairs
of contigs between 1000-2500bp).
--ratio RATIO If the ratio of the number of base pairs of contigs
between 1000-2500 bp smaller than this value, the
minimal length will be set as 1000bp, otherwise
2500bp. Note that setting `--min-length/-m` overrules
this parameter. If you use SemiBin with multi steps
and you use this parameter, please use this parameter
consistently with all subcommands. (Default: 0.05)
-b [BAMS [BAMS ...]], --input-bam [BAMS [BAMS ...]]
Path to the input BAM file. If using multiple samples,
you can input multiple files.
-r , --reference-db-data-dir , --reference-db
GTDB reference storage path. (Default:
$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB).If not set
--reference-db and SemiBin cannot find GTDB in
$HOME/.cache/SemiBin/mmseqs2-GTDB/GTDB, SemiBin will
download GTDB (Note that >100GB of disk space are
--minfasta-kbs minimum bin size in Kbps (Default: 200).
--no-recluster Do not recluster bins.
--recluster [Deprecated] Does nothing (current default is to
perform clustering)
--epoches EPOCHES Number of epoches used in the training process
(Default: 20).
--batch-size BATCHSIZE
Batch size used in the training process (Default:
--max-edges MAX_EDGES
The maximum number of edges that can be connected to
one contig (Default: 200).
--max-node MAX_NODE Fraction of contigs that considered to be binned
(should be between 0 and 1; default: 1).
-s , --separator Used when multiple samples binning to separate sample
name and contig name (Default is :).
--random-seed RANDOM_SEED
Random seed. Set it to a fixed value to reproduce
results across runs. The default is that the seed is
set by the system and .
--ml-threshold ML_THRESHOLD
Length threshold for generating must-link constraints.
(By default, the threshold is calculated from the
contig, and the default minimum value is 4,000 bp)
SemiBinは、シングルサンプルアセンブリ、コ・アセンブリ、マルチサンプルアセンブリのビニングで動作する。シングルサンプルアセンブリとコ・アセンブリで使う場合、1)すべてのサンプルについてdata.csvとdata_split.csv(トレーニングで使用)を生成し、2)各サンプルごとにモデルを学習、3)同一サンプルから学習したモデルでコンティグごとにビニングを行う、という流れで実行する。各処理を1度に実行できるコマンドが用意されているので、ここではそれを使う。multi-sample binningを使う場合、複数サンプルを結合したコンティグと、複数のサンプルからのbamファイルが入力となる。
Easy single/co-assembly binningモード
git clone https://github.com/BigDataBiology/SemiBin.git
cd SemiBin/test/bin_data/
SemiBin single_easy_bin -i input.fasta -b input.sorted.bam -o output -r <PATH>/<to>/GTDB --environment global
--environment に (human_gut, dog_gut, or ocean) を指定すると、組み込みモデルの1つを使用できる。レポジトリには、コンティグのアノテーションとモデルのトレーニングにかかる時間を大幅に短縮し、かつ非常に良い結果を得ることができ、使用が推奨されている。
- human_gut
- dog_gut
- ocean
- soil
- cat_gut
- human_oral
- mouse_gut
- pig_gut
- built_environment
- wastewater
- global(上記以外)
Easy multi sample binningモード
SemiBin multi_easy_bin -i contig_whole.fna -b *.bam -o output
