SemiBin2 - macでインフォマティクス

2023/07/10 誤字修正

2024/04/19 チュートリアルリンク追記

　環境試料からメタゲノムアセンブリゲノム（MAG）を再構成するメタゲノムビニング法は、大規模なメタゲノム研究において広く用いられている。最近提案された半教師ビニング法SemiBinは、いくつかの環境で最先端のビニング結果を達成した。しかし、コンティグのアノテーションが必要であり、計算コストがかかり、偏りが生じる可能性がある。
ここではSemiBin2を提案する。SemiBin2は自己教師付き学習を用いてコンティグから特徴埋め込みを学習する。模擬データセットと実データセットにおいて、自己教師付き学習はSemiBin1で用いられた半教師付き学習よりも良い結果を達成し、SemiBin2は他の最先端バイナーを凌駕することを示す。SemiBin1と比較して、SemiBin2は8.3-21.5%より多くの高品質ビンを再構成することができ、実際のショートリードシーケンシングサンプルにおいて、実行時間は25%、ピークメモリ使用量は11%しか必要としない。SemiBin2をロングリードデータにも拡張するために、アンサンブルベースのDBSCANクラスタリングアルゴリズムも提案し、ロングリードデータで2番目に優れたBinnerよりも13.1-26.3%多くの高品質ゲノムを得ることができた。SemiBin2はオープンソースソフトウェアとしてhttps://github.com/BigDataBiology/SemiBin/、本研究で使用した解析スクリプトはhttps://github.com/BigDataBiology/SemiBin2_benchmarkで利用できる。

Documentation

https://semibin.readthedocs.io/en/latest/usage/

Tutorial

インストール

SemiBinパッケージをインストールすると、バージョン1.5（正式にはSemiBin2 beta）から正式に2つのスクリプトがインストールされる： SemiBinとSemiBin2となる。（SemiBin2の機能はバージョン1.4から利用可能だった）。

依存

MMseqs2
Bedtools, Hmmer
Prodigal
(optionally) FragGeneScan
Samtools

Github

mamba create -n SemiBin -y
conda activate SemiBin
mamba install -c conda-forge -c bioconda semibin -y

#GPU版を使うなら以下も導入
mamba install -c pytorch-lts pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch-lts

> SemiBin2

usage: SemiBin2 [-h] [-v] [--verbose | --quiet] ...

Neural network-based binning of metagenomic contigs

options:

-h, --help show this help message and exit

-v, -V, --version Print the version number

--verbose Verbose output (default: False)

--quiet, -q Quiet output (default: False)

SemiBin subcommands:

single_easy_bin Bin contigs (single or co-assembly) using one command.

multi_easy_bin Bin contigs (multi-sample mode) using one command.

generate_cannot_links (predict_taxonomy)

Run the contig annotation using mmseqs with GTDB reference genome and generate cannot-link file used in the semi-supervised deep learning model training. This will download the GTDB database if not downloaded

before.

generate_sequence_features_single

Generate sequence features (kmer and abundance) as training data for (semi/self)-supervised deep learning model training (single or co-assembly mode). This will produce the data.csv and data_split.csv files.

generate_sequence_features_multi (generate_sequence_features_multi)

Generate sequence features (kmer and abundance) as training data for (semi/self)-supervised deep learning model training (multi-sample mode). This will produce the data.csv and data_split.csv files.

download_GTDB Download GTDB reference genomes.

check_install Check whether required dependencies are present.

concatenate_fasta concatenate fasta files for multi-sample binning

train_semi Train the model.

train_self Train the model with self-supervised learning

bin Group the contigs into bins.

bin_long Group the contigs from long reads into bins.

For more information, see https://semibin.readthedocs.io/en/latest/subcommands/

> SemiBin

usage: SemiBin [-h] [-v] [--verbose | --quiet] ...

Neural network-based binning of metagenomic contigs

options:

-h, --help show this help message and exit

-v, -V, --version Print the version number

--verbose Verbose output (default: False)

--quiet, -q Quiet output (default: False)

SemiBin subcommands:

single_easy_bin Bin contigs (single or co-assembly) using one command.

multi_easy_bin Bin contigs (multi-sample mode) using one command.

generate_cannot_links (predict_taxonomy)

before.

generate_sequence_features_single

generate_sequence_features_multi (generate_sequence_features_multi)

Generate sequence features (kmer and abundance) as training data for (semi/self)-supervised deep learning model training (multi-sample mode). This will produce the data.csv and data_split.csv files.

download_GTDB Download GTDB reference genomes.

check_install Check whether required dependencies are present.

concatenate_fasta concatenate fasta files for multi-sample binning

train Train the model.

train_self Train the model with self-supervised learning

bin Group the contigs into bins.

bin_long Group the contigs from long reads into bins.

For more information, see https://semibin.readthedocs.io/en/latest/subcommands/

実行方法

事前に訓練されたモデルを使用するか、トレーニングして使用する。事前に訓練されたモデルを使用するには、"--environment"で組み込みモデルを指定する（紹介）。この方法はメモリ使用量が少なく高速に動作する。一方、データに対して新しいモデルをトレーニングさせると精度が上がるとされるが、多くの時間とメモリを必要とした（最大10倍）。バージョン1.3以降サポートされた新しい自己教師付き学習モードでは、MMSeqs2でコンティグにアノテーションを付ける必要がなくなった。ベンチマークで、従来のトレーニング手順より高速でピークメモリ使用量が大幅に減り、より多くの高品質ビンを生成することが示されている。自己教師付き学習モードを使うには、"--self-supervised"オプションを使う（デフォルトの学習モード）。

１、single_easy_bin

メタゲノムアセンブリのfastaファイルと１つのbamファイルを指定する。自己教師付き学習モードを使うには、"--environment"を消して--self-supervisedオプションを使う。

#pre-trained human gut
SemiBin single_easy_bin -i contig.fa -b S1.sorted.bam -o output --environment human_gut

#self-supervised（CPUモードではデフォルトで利用可能な全CPUを訓練に使用）
SemiBin single_easy_bin -i contig.fa -b S1.sorted.bam -o output --self-supervised

-i Path to the input fasta file.
-o Output directory (will be created if non-existent)
--environment Environment for the built-in model (available choices: human_gut/dog_gut/ocean/soil/cat_gut/human_oral/mouse_gut/pig_gut/built_environment/wastewater/chicken_caecum/global).
--self-supervised Train the model with self-supervised learning.
--sequencing-type sequencing type in [short_read/long_read], Default: short_read.

出力例 - ”--environment human_gut”

出力例 - "self-supervised"

テストに使ったデータでは、組み込みモデルと比べてself-supervisedの方がbin配列を８個多く得られた。

２、multi_easy_bin

メタゲノムアセンブリのfastaファイルと複数のbamファイルを指定する。環境のオプションはない。

SemiBin multi_easy_bin -i contig_whole.fa -b *.sorted.bam -o output --self-supervised

マニュアルより（Semibin2の新機能について）

SemiBinとSemiBin2は同じ機能を持つが、インターフェースは若干異なる。SemiBin2の正確なインターフェースはまだ不安定（と考えられ）で、バージョン2.0がリリースされた時点でコードフリーズする予定である。
デフォルトの学習モードが"--self-supervised"モードになった。
半教師付きモードで学習するには、train_semiサブコマンドを使う必要がある（trainサブコマンドはないので、train_semiかtrain_selfを指定する必要がある）。半教師付き学習を使うには"--semi-supervised”を使う（deprecated）（紹介）。
以前は非推奨であったいくつかの引数が完全に削除された： - --recluster: リクラスター化がデフォルトであるため、すでに何もしていない - --mode： -mode：-train-from-manyを使う：半教師付き学習を使うには--semi-supervisedを使う（これも非推奨）。
SemiBin2のeasyワークフローを使用している場合、おそらくこれまでとまったく同じように動作する（ただし、より良い結果がより早く得られる）。
出力は常に-oで指定した出力ディレクトリ中のサブディレクトリoutput_bin/に置かれるようになった（--write-pre-reclustering-binオプションで事前にクラスタリングされたビンを書き出すように明示的に指定しない限り）。
--write-pre-reclustering-binはデフォルトではFalse。
デフォルトでは、ビンはSemiBin_{label}.fa.gzというファイル名で格納される（gzip圧縮されている）
出力ファイル名がanvi'o互換になった(-tag-outputのデフォルト値はSemiBin)。
ORF Finderのデフォルトはfast-naive internal ORF finderである。
ロングリードを使う場合は--sequencing-type=long_readオプションを使う。

引用

SemiBin2: self-supervised contrastive learning leads to better MAGs for short- and long-read sequencing
Shaojun Pan, Xing-Ming Zhao, Luis Pedro Coelho
Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i21–i2