CoCoNetはPythonで実装されており、PyPi (https://pypi.org/)からダウンロードできる。ソースコードはGitHub (https://github.com/Puumanamana/CoCoNet)、ドキュメントは https://coconet.readthedocs.io/en/latest/index.html にある。CoCoNetの実行には膨大なリソースを必要としない。例えば、100kのコンティグをビニングするのに、10個のインテルCPUコア(2.4GHz)で約4時間かかり、メモリのピークは27GBだった(論文のSupplementary Fig.S9参照)。大規模なデータセットを処理するためには、CoCoNetを大容量のRAMを搭載したサーバーで動作させる必要がある。
ハイパーパラメータ (Documentより)
- フラグメントの長さ --fragment-length
- コンティグの最小prevalence --min-prevalence(コンティグが出現するサンプルの数
- コンティグ-コンティググラフにおいて、エッジで結ばれた2つのコンティグ間の最小マッチ数θ、 --theta
- クラスタがコヒーレントビンとみなされるために必要なエッジ密度の最小値、γ、 --gamma2
CoCoNet was tested on both MacOS and Ubuntu 18.04. To install and run CoCoNet, you will need:
mamba create -n coconet -y python=3.7
conda activate coconet
pip3 install --user numpy
pip3 install --user coconet-binning
#docker (hub)
docker pull nakor/coconet:1.1.0
> coconet -h
$ coconet -h
/home/kazu/.local/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
usage: coconet [-h] [--version] {preprocess,learn,cluster,run} ...
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
preprocess Preprocess data
learn Train neural network on input data
cluster Bin contigs using neural network
run Run complete workflow (recommended)
> coconet run -h
$ coconet run -h
/home/kazu/.local/lib/python3.7/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: The NVIDIA driver on your system is too old (found version 9010). Please update your GPU driver by downloading and installing a new version from the URL: http://www.nvidia.com/Download/index.aspx Alternatively, go to: https://pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0
usage: coconet run [-h] [--fasta FASTA] [--h5 H5] [--output OUTPUT]
[-t THREADS] [--debug] [--quiet] [--silent] [--continue]
[--bam BAM [BAM ...]] [--min-ctg-len MIN_CTG_LEN]
[--min-prevalence MIN_PREVALENCE]
[--min-mapping-quality MIN_MAPPING_QUALITY]
[--min-aln-coverage MIN_ALN_COVERAGE] [--flag FLAG]
[--tlen-range TLEN_RANGE TLEN_RANGE]
[--min-dtr-size MIN_DTR_SIZE]
[--fragment-step FRAGMENT_STEP] [--test-ratio TEST_RATIO]
[--n-train N_TRAIN] [--n-test N_TEST]
[--learning-rate LEARNING_RATE] [--batch-size BATCH_SIZE]
[--test-batch TEST_BATCH] [--patience PATIENCE]
[--load-batch LOAD_BATCH]
[--cover-filters COVER_FILTERS]
[--cover-kernel COVER_KERNEL] [--cover-stride COVER_STRIDE]
[--merge-neurons MERGE_NEURONS] [-k KMER] [--no-rc]
[--wsize WSIZE] [--wstep WSTEP] [--n-frags N_FRAGS]
[--max-neighbors MAX_NEIGHBORS]
[--vote-threshold VOTE_THRESHOLD]
[--algorithm {leiden,spectral}] [--theta THETA]
[--gamma1 GAMMA1] [--gamma2 GAMMA2]
[--n-clusters N_CLUSTERS] [--recruit-small-contigs]
[--fragment-length FRAGMENT_LENGTH]
[--features {coverage,composition} [{coverage,composition} ...]]
optional arguments:
-h, --help show this help message and exit
--fasta FASTA Path to your assembly file (fasta formatted) (default:
--h5 H5 Experimental: coverage in hdf5 format (keys are
contigs, values are (sample, contig_len) ndarrays
(default: None)
--output OUTPUT Path to output directory (default: output)
-t THREADS, --threads THREADS
Number of threads (default: 5)
--debug Print debugging statements (default: 20)
--quiet Less verbose (default: None)
--silent Only error messages (default: None)
--continue Start from last checkpoint. The output directory needs
to be the same. (default: False)
--bam BAM [BAM ...] List of paths to your coverage files (bam formatted)
(default: None)
--min-ctg-len MIN_CTG_LEN
Minimum contig length (default: 2048)
--min-prevalence MIN_PREVALENCE
Minimum contig prevalence for binning. Contig with
less that value are filtered out. (default: 2)
--min-mapping-quality MIN_MAPPING_QUALITY
Minimum alignment quality (default: 30)
--min-aln-coverage MIN_ALN_COVERAGE
Discard alignments with less than 50% aligned
--flag FLAG SAM flag for filtering (same as samtools "-F" option)
(default: 3596)
Only allow for paired alignments with spacing within
this range (default: None)
--min-dtr-size MIN_DTR_SIZE
Minimum size of DTR to flag complete contigs (default:
--fragment-step FRAGMENT_STEP
Fragments spacing (default: 128)
--test-ratio TEST_RATIO
Ratio for train / test split (default: 0.1)
--n-train N_TRAIN Maximum number of training examples (default: 4000000)
--n-test N_TEST Number of test examples (default: 10000)
--learning-rate LEARNING_RATE
Learning rate for gradient descent (default: 0.001)
--batch-size BATCH_SIZE
Batch size for training (default: 256)
--test-batch TEST_BATCH
Run test every 400 batches
--patience PATIENCE Early stopping if test accuracy does not improve for 5
consecutive tests
--load-batch LOAD_BATCH
Number of coverage batch to load in memory. Consider
lowering this value if your RAM is limited. (default:
Number of neurons for the composition dense layers
(x2) (default: [64, 32])
Number of neurons for the coverage dense layers (x2)
(default: [64, 32])
--cover-filters COVER_FILTERS
Number of filters for convolution layer of coverage
network. (default: 16)
--cover-kernel COVER_KERNEL
Kernel size for convolution layer of coverage network.
(default: 4)
--cover-stride COVER_STRIDE
Convolution stride for convolution layer of coverage
network. (default: 2)
--merge-neurons MERGE_NEURONS
Number of neurons for the merging layer (x1) (default:
-k KMER, --kmer KMER k-mer size for composition vector (default: 4)
--no-rc Do not add the reverse complement k-mer occurrences to
the composition vector. (default: False)
--wsize WSIZE Smoothing window size for coverage vector (default:
--wstep WSTEP Subsampling step for coverage vector (default: 32)
--n-frags N_FRAGS Number of fragments to split the contigs for the
clustering phase (default: 30)
--max-neighbors MAX_NEIGHBORS
Maximum number of neighbors to consider to compute the
adjacency matrix. (default: 250)
--vote-threshold VOTE_THRESHOLD
When this parameter is not set, contig-contig edges
are computed by summing the probability between all
pairwise fragments between them.Otherwise, adopt a
voting strategy and sets a hard-threshold on the
probabilityfrom each pairwise comparison. (default:
--algorithm {leiden,spectral}
Algorithm for clustering the contig-contig graph.
Note: the number of cluster is required if "spectral"
is chosen. (default: leiden)
--theta THETA (leiden) Minimum percent of edges between two contigs
to form an edge between them (default: 0.8)
--gamma1 GAMMA1 (leiden) CPM optimization value for the first run of
the Leiden clustering (default: 0.3)
--gamma2 GAMMA2 (leiden) CPM optimization value for the second run of
the Leiden clustering (default: 0.4)
--n-clusters N_CLUSTERS
(spectral clustering) Maximum number of clusters
(default: None)
Salvage short contigs (<2048) (default: False)
--fragment-length FRAGMENT_LENGTH
Length of contig fragments in bp. Default is half the
minimum contig length. (default: -1)
--features {coverage,composition} [{coverage,composition} ...]
Features for binning (composition, coverage, or both)
(default: ['coverage', 'composition'])
git clone https://github.com/Puumanamana/CoCoNet
cd CoCoNet
make test
coconet --fasta scaffolds.fasta --bam cov/*.bam --output binning_results
CoCoNet: an efficient deep learning tool for viral metagenome binning
Cédric G Arisdakessian, Olivia D Nigro, Grieg F Steward, Guylaine Poisson, Mahdi Belcaid
Bioinformatics, Published: 05 April 2021