ウィルスメタゲノムのビニングのための効率的な深層学習ツール CoCoNet

　メタゲノム解析は、微生物群集の特徴を明らかにし、マイクロバイオームと生物学的プロセスの複雑な関連性を解明する可能性を秘めている。アセンブリは、メタゲノミクス実験において最も重要なステップの1つである。アセンブリとは、重複するDNAシーケンスリードを、微生物群集のゲノムを十分に正確に表現できるように変換することである。このプロセスは計算上困難であり、多くのコンティグに渡ってゲノムが断片化されてしまうのが一般的になっている。計算機上のビニング法は、コンティグをその配列組成、存在量、染色体の構成に基づいて、コミュニティのゲノムを表すビンに分割することで、断片化を緩和するために使用される。既存のビン化法は、主に細菌ゲノム用に調整されており、ウイルスメタゲノムでは好ましい結果が得られない。
　CoCoNetは、深層学習の柔軟性と有効性を活用して、同じウイルスゲノムに属するコンティグのco-occurrenceをモデル化し、ウイルスコンティグをビン化するための厳密なフレームワークを提供する、ウイルスメタゲノムのための新しいビン化手法である。その結果、CoCoNetは、ウイルスのデータセットにおいて、既存のビニング手法を大幅に上回ることが分かった。
　CoCoNetはPythonで実装されており、PyPi (https://pypi.org/)からダウンロードできる。ソースコードはGitHub (https://github.com/Puumanamana/CoCoNet)、ドキュメントは https://coconet.readthedocs.io/en/latest/index.html にある。CoCoNetの実行には膨大なリソースを必要としない。例えば、100kのコンティグをビニングするのに、10個のインテルCPUコア（2.4GHz）で約4時間かかり、メモリのピークは27GBだった（論文のSupplementary Fig.S9参照）。大規模なデータセットを処理するためには、CoCoNetを大容量のRAMを搭載したサーバーで動作させる必要がある。

Document

https://coconet.readthedocs.io/#

ハイパーパラメータ (Documentより)
CoCoNetのパラメータは、ビンの同質性を重視するように設定されている。しかし、研究の目的によっては、どちらか一方を重視したい場合もあるだろう。主に3つのパラメータを調整することで、同質性の低下を犠牲にして、ビンの完全性を向上させることができる。

フラグメントの長さ --fragment-length
コンティグの最小prevalence --min-prevalence（コンティグが出現するサンプルの数
コンティグ-コンティググラフにおいて、エッジで結ばれた2つのコンティグ間の最小マッチ数θ、 --theta
クラスタがコヒーレントビンとみなされるために必要なエッジ密度の最小値、γ、 --gamma2

θやγの値を小さくすると（デフォルトではそれぞれ80％、75％）、ビン化のストリンジェンシーが下がる。これにより、k-merパターンやカバレッジパターンのばらつきが大きいウイルスの網羅性が向上するが、その代償として同質性が低下する可能性がある。同様に、フラグメントの長さを長くすることで、同一種のコンティグ間のk-merやカバレッジの分布のばらつきを最小限に抑えることができ、結果的に完全性が向上する。しかし、フラグメント長が長い（またはprevalenceが高い）閾値を設定すると、単に処理するのに十分な長さではなかったという理由で、より多くのコンティグがシングルトンビンに割り当てられることになる。また、prevalence率の最小値を上げると、サンプル間でより広範囲に存在するコンティグが選択される。当然のことながら、θ、γ、フラグメント長、最小prevalenceの値を下げると、より均質なビンが得られるが、完全ではない。他にもいくつかのパラメータを調整する価値がある。しかし、それらの効果はまだ十分に評価されておらず、経験的に選ばれたものである。

インストール

condaでpython3.7の仮想環境を作ってpipで導入した。CUDAも必要。

依存

CoCoNet was tested on both MacOS and Ubuntu 18.04. To install and run CoCoNet, you will need:

python (>=3.5, recommended: 3.7)
pip3, the python package manager or the conda installer.

Github

mamba create -n coconet -y python=3.7
conda activate coconet
pip3 install --user numpy
pip3 install --user coconet-binning

#docker (hub)
docker pull nakor/coconet:1.1.0

> coconet -h

$ coconet -h

/home/kazu/.local/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.)

return torch._C._cuda_getDeviceCount() > 0

usage: coconet [-h] [--version] {preprocess,learn,cluster,run} ...

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

action:

{preprocess,learn,cluster,run}

preprocess Preprocess data

learn Train neural network on input data

cluster Bin contigs using neural network

run Run complete workflow (recommended)

> coconet run -h

$ coconet run -h

return torch._C._cuda_getDeviceCount() > 0

usage: coconet run [-h] [--fasta FASTA] [--h5 H5] [--output OUTPUT]

[-t THREADS] [--debug] [--quiet] [--silent] [--continue]

[--bam BAM [BAM ...]] [--min-ctg-len MIN_CTG_LEN]

[--min-prevalence MIN_PREVALENCE]

[--min-mapping-quality MIN_MAPPING_QUALITY]

[--min-aln-coverage MIN_ALN_COVERAGE] [--flag FLAG]

[--tlen-range TLEN_RANGE TLEN_RANGE]

[--min-dtr-size MIN_DTR_SIZE]

[--fragment-step FRAGMENT_STEP] [--test-ratio TEST_RATIO]

[--n-train N_TRAIN] [--n-test N_TEST]

[--learning-rate LEARNING_RATE] [--batch-size BATCH_SIZE]

[--test-batch TEST_BATCH] [--patience PATIENCE]

[--load-batch LOAD_BATCH]

[--compo-neurons COMPO_NEURONS COMPO_NEURONS]

[--cover-neurons COVER_NEURONS COVER_NEURONS]

[--cover-filters COVER_FILTERS]

[--cover-kernel COVER_KERNEL] [--cover-stride COVER_STRIDE]

[--merge-neurons MERGE_NEURONS] [-k KMER] [--no-rc]

[--wsize WSIZE] [--wstep WSTEP] [--n-frags N_FRAGS]

[--max-neighbors MAX_NEIGHBORS]

[--vote-threshold VOTE_THRESHOLD]

[--algorithm {leiden,spectral}] [--theta THETA]

[--gamma1 GAMMA1] [--gamma2 GAMMA2]

[--n-clusters N_CLUSTERS] [--recruit-small-contigs]

[--fragment-length FRAGMENT_LENGTH]

[--features {coverage,composition} [{coverage,composition} ...]]

optional arguments:

-h, --help show this help message and exit

--fasta FASTA Path to your assembly file (fasta formatted) (default:

None)

--h5 H5 Experimental: coverage in hdf5 format (keys are

contigs, values are (sample, contig_len) ndarrays

(default: None)

--output OUTPUT Path to output directory (default: output)

-t THREADS, --threads THREADS

Number of threads (default: 5)

--debug Print debugging statements (default: 20)

--quiet Less verbose (default: None)

--silent Only error messages (default: None)

--continue Start from last checkpoint. The output directory needs

to be the same. (default: False)

--bam BAM [BAM ...] List of paths to your coverage files (bam formatted)

(default: None)

--min-ctg-len MIN_CTG_LEN

Minimum contig length (default: 2048)

--min-prevalence MIN_PREVALENCE

Minimum contig prevalence for binning. Contig with

less that value are filtered out. (default: 2)

--min-mapping-quality MIN_MAPPING_QUALITY

Minimum alignment quality (default: 30)

--min-aln-coverage MIN_ALN_COVERAGE

Discard alignments with less than 50% aligned

nucleotides

--flag FLAG SAM flag for filtering (same as samtools "-F" option)

(default: 3596)

--tlen-range TLEN_RANGE TLEN_RANGE

Only allow for paired alignments with spacing within

this range (default: None)

--min-dtr-size MIN_DTR_SIZE

Minimum size of DTR to flag complete contigs (default:

10)

--fragment-step FRAGMENT_STEP

Fragments spacing (default: 128)

--test-ratio TEST_RATIO

Ratio for train / test split (default: 0.1)

--n-train N_TRAIN Maximum number of training examples (default: 4000000)

--n-test N_TEST Number of test examples (default: 10000)

--learning-rate LEARNING_RATE

Learning rate for gradient descent (default: 0.001)

--batch-size BATCH_SIZE

Batch size for training (default: 256)

--test-batch TEST_BATCH

Run test every 400 batches

--patience PATIENCE Early stopping if test accuracy does not improve for 5

consecutive tests

--load-batch LOAD_BATCH

Number of coverage batch to load in memory. Consider

lowering this value if your RAM is limited. (default:

100)

--compo-neurons COMPO_NEURONS COMPO_NEURONS

Number of neurons for the composition dense layers

(x2) (default: [64, 32])

--cover-neurons COVER_NEURONS COVER_NEURONS

Number of neurons for the coverage dense layers (x2)

(default: [64, 32])

--cover-filters COVER_FILTERS

Number of filters for convolution layer of coverage

network. (default: 16)

--cover-kernel COVER_KERNEL

Kernel size for convolution layer of coverage network.

(default: 4)

--cover-stride COVER_STRIDE

Convolution stride for convolution layer of coverage

network. (default: 2)

--merge-neurons MERGE_NEURONS

Number of neurons for the merging layer (x1) (default:

32)

-k KMER, --kmer KMER k-mer size for composition vector (default: 4)

--no-rc Do not add the reverse complement k-mer occurrences to

the composition vector. (default: False)

--wsize WSIZE Smoothing window size for coverage vector (default:

64)

--wstep WSTEP Subsampling step for coverage vector (default: 32)

--n-frags N_FRAGS Number of fragments to split the contigs for the

clustering phase (default: 30)

--max-neighbors MAX_NEIGHBORS

Maximum number of neighbors to consider to compute the

adjacency matrix. (default: 250)

--vote-threshold VOTE_THRESHOLD

When this parameter is not set, contig-contig edges

are computed by summing the probability between all

pairwise fragments between them.Otherwise, adopt a

voting strategy and sets a hard-threshold on the

probabilityfrom each pairwise comparison. (default:

None)

--algorithm {leiden,spectral}

Algorithm for clustering the contig-contig graph.

Note: the number of cluster is required if "spectral"

is chosen. (default: leiden)

--theta THETA (leiden) Minimum percent of edges between two contigs

to form an edge between them (default: 0.8)

--gamma1 GAMMA1 (leiden) CPM optimization value for the first run of

the Leiden clustering (default: 0.3)

--gamma2 GAMMA2 (leiden) CPM optimization value for the second run of

the Leiden clustering (default: 0.4)

--n-clusters N_CLUSTERS

(spectral clustering) Maximum number of clusters

(default: None)

--recruit-small-contigs

Salvage short contigs (<2048) (default: False)

--fragment-length FRAGMENT_LENGTH

Length of contig fragments in bp. Default is half the

minimum contig length. (default: -1)

--features {coverage,composition} [{coverage,composition} ...]

Features for binning (composition, coverage, or both)

(default: ['coverage', 'composition'])

動作チェック

git clone https://github.com/Puumanamana/CoCoNet
cd CoCoNet
make test

実行方法

fastaとbamを指定する。

coconet --fasta scaffolds.fasta --bam cov/*.bam --output binning_results

前処理、アセンブリ、bam作成例についても書かれています。確認して下さい。

https://coconet.readthedocs.io/example-workflow.html

引用

CoCoNet: an efficient deep learning tool for viral metagenome binning
Cédric G Arisdakessian, Olivia D Nigro, Grieg F Steward, Guylaine Poisson, Mahdi Belcaid
Bioinformatics, Published: 05 April 2021