ウイルスは地球上のあらゆる環境に豊富に存在し、あらゆる細胞生物を感染させる。にもかかわらず、ウイルスはゲノム科学者にとって一種のブラックボックスである。その遺伝的多様性は他の全ての生命体を合わせたものよりも大きく、そのゲノムはシーケンシングデータセットで見落とされることが多く、ほとんどの遺伝子について機能は推測できない。こうした理由から、科学者たちは、既知の参照ゲノムから大きく逸脱したウイルスゲノムでさえも発見し、その遺伝子に注釈を付けるために、シーケンスデータを高感度かつ特異的に解析できる、堅牢で高性能、かつ文書化され拡張性のあるツールを必要としている。本稿ではCenote-Taker 3を紹介する。このコマンドラインインターフェースツールは、ウイルス発見、プロファージ抽出、遺伝子およびその他の遺伝的特徴のアノテーションを行うモジュールを備え、ゲノムアセンブリおよび/またはメタゲノムアセンブリを処理する。ベンチマークでは、Cenote-Taker 3がウイルス遺伝子アノテーションにおいて、速度(ウォールタイム)と精度の両面でほとんどのツールを上回ることが示されている。ウイルス発見ベンチマークでは、Cenote-Taker 3はgeNomadと比較して良好な性能を示し、両ツールは補完的な結果を生成する。Cenote-Taker 3はBiocondaで自由に利用可能であり、そのオープンソースコードはGitHub(https://github.com/mtisza1/Cenote-Taker3)で管理されている。
Githubより
Cenote-Taker 3は、個々のゲノム配列から大規模なメタゲノムアセンブリまで対応可能なウイルスバイオインフォマティクスツールであり、以下の機能を提供する:
1、ウイルス固有の遺伝子(ウイルスの特徴的遺伝子)を含む配列の同定
2、ウイルス配列のアノテーション(以下を含む):
---a) 適応型ORFコール
---b) 機能注釈のためウイルス遺伝子ファミリーのHMM大規模カタログの使用
---c) 特徴遺伝子に基づく階層的分類アサインメント
---d) mmseqs2ベースのCDDデータベース検索
---e) 表形式(.tsv)およびインタラクティブゲノムマップ(.gbf)出力
また、Cenote-Taker 3は非常に高速であり、大規模データセットではCenote-Taker 2よりもはるかに高速である。さらに、ウイルス遺伝子に対してより多くの機能注釈を行うpharokkaを用いた同等のアノテーションよりも高速である。
インストール
macbook air(M1)でテストした。
Versions used in test installations
- mamba 1.5.8
- conda 24.7.1
# macos
mamba create --platform osx-64 -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.4.1 mmseqs2=15.6f452
conda activate ct3_env
#linux
mamba create -n ct3_env -c conda-forge -c bioconda cenote-taker3=3.4.1
conda activate ct3_env
#or
git clone https://github.com/mtisza1/Cenote-Taker3.git
cd Cenote-Taker3
mamba env create -f environment/ct3_env.yaml
pip install .
> cenotetaker3 -h
this script dir: /Users/kazu_1/mambaforge/envs/ct3_env/lib/python3.12/site-packages/cenote
usage: cenotetaker3 [-h] -c ORIGINAL_CONTIGS -r RUN_TITLE -p PROPHAGE [-t CPU] [--version] [-am ANNOTATION_MODE] [-wd C_WORKDIR]
[--template_file TEMPLATE_FILE] [--reads READS [READS ...]] [--minimum_length_circular CIRC_LENGTH_CUTOFF]
[--minimum_length_linear LINEAR_LENGTH_CUTOFF] [-db {virion,rdrp,dnarep} [{virion,rdrp,dnarep} ...]]
[--lin_minimum_hallmark_genes LIN_MINIMUM_DOMAINS] [--circ_minimum_hallmark_genes CIRC_MINIMUM_DOMAINS] [-hh {none,hhblits,hhsearch}]
[--caller {prodigal-gv,prodigal,phanotate,adaptive}] [--isolation_source ISOLATION_SOURCE] [--collection_date COLLECTION_DATE]
[--metagenome_type METAGENOME_TYPE] [--srr_number SRR_NUMBER] [--srx_number SRX_NUMBER] [--biosample BIOSAMPLE] [--bioproject BIOPROJECT]
[--assembler ASSEMBLER] [--molecule_type MOLECULE_TYPE] [--data_source DATA_SOURCE] [--cenote-dbs C_DBS] [--hmmscan_dbs HMM_DBS]
[--wrap WRAP] [--genbank GENBANK] [--taxdb {refseq,hallmark}] [--seqtech SEQTECH] [--max_dtr_assess MAXDTR] [--circ-file CIRCF]
Cenote-Taker 3 is a pipeline for virus discovery and thorough annotation of viral contigs and genomes. Visit https://github.com/mtisza1/Cenote-Taker3 for
help. Version 3.4.1
options:
-h, --help show this help message and exit
REQUIRED ARGUMENTS for Cenote-Taker 3 :
-c ORIGINAL_CONTIGS, --contigs ORIGINAL_CONTIGS
Contig file with .fasta extension in fasta format. Each header must be unique before the first space character
-r RUN_TITLE, --run_title RUN_TITLE
Name of this run. A directory of this name will be created. Must be unique from older runs or older run will be renamed. Must be less
than 18 characters, using ONLY letters, numbers and underscores (_)
-p PROPHAGE, --prune_prophage PROPHAGE
True or False. Attempt to identify and remove flanking chromosomal regions from non-circular contigs with viral hallmarks (True is
highly recommended for sequenced material not enriched for viruses. Virus-enriched samples probably should be False (you might check
enrichment with ViromeQC). Also, please use False if --lin_minimum_hallmark_genes is set to 0)
OPTIONAL ARGUMENTS for Cenote-Taker 3. See https://www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html#ModifiersPage for more information on GenBank metadata fields:
-t CPU, --cpu CPU Default: 8 -- Example: 32 -- Number of CPUs available for Cenote-Taker 3.
--version show program's version number and exit
-am ANNOTATION_MODE, --annotation_mode ANNOTATION_MODE
Default: False -- Annotate sequences only (skip discovery). Only use if you believe each provided sequence is viral
-wd C_WORKDIR, --working_directory C_WORKDIR
Default: /Users/kazu_1/Downloads -- Set working directory with absolute or relative path. run directory will be created within.
--template_file TEMPLATE_FILE
Template file with some metadata. Real one required for GenBank submission. Takes a couple minutes to generate:
https://submit.ncbi.nlm.nih.gov/genbank/template/submission/
--reads READS [READS ...]
read file(s) in .fastq format. You can specify more than one separated by a space
--minimum_length_circular CIRC_LENGTH_CUTOFF
Default: 1000 -- Minimum length of contigs to be checked for circularity. Bare minimun is 1000 nts
--minimum_length_linear LINEAR_LENGTH_CUTOFF
Default: 1000 -- Minimum length of non-circualr contigs to be checked for viral hallmark genes.
-db {virion,rdrp,dnarep} [{virion,rdrp,dnarep} ...], --virus_domain_db {virion,rdrp,dnarep} [{virion,rdrp,dnarep} ...]
default: virion rdrp -- Hits to which domain types should count as hallmark genes? 'virion' database: genes encoding virion
structural proteins, packaging proteins, or capsid maturation proteins (DNA and RNA genomes) with LOWEST false discovery rate. 'rdrp'
database: For RNA virus-derived RNA-dependent RNA polymerase. 'dnarep' database: replication genes of DNA viruses. mostly useful for
small DNA viruses, e.g. CRESS viruses
--lin_minimum_hallmark_genes LIN_MINIMUM_DOMAINS
Default: 1 -- Number of detected viral hallmark genes on a non-circular contig to be considered viral and recieve full annotation.
'2' might be more suitable, yielding a false positive rate near 0.
--circ_minimum_hallmark_genes CIRC_MINIMUM_DOMAINS
Default:1 -- Number of detected viral hallmark genes on a circular contig to be considered viral and recieve full annotation. For
samples physically enriched for virus particles, '0' can be used, but please treat circular contigs without known viral domains
cautiously. For unenriched samples, '1' might be more suitable.
-hh {none,hhblits,hhsearch}, --hhsuite_tool {none,hhblits,hhsearch}
default: none -- hhblits: query any of PDB, pfam, and CDD (depending on what is installed) to annotate ORFs escaping identification
via upstream methods. hhsearch: a more sensitive tool, will query PDB, pfam, and CDD (depending on what is installed) to annotate
ORFs. (WARNING: hhsearch takes much, much longer than hhblits and can extend the duration of the run many times over. Do not use on
large input contig files). 'none': forgoes annotation of ORFs with hhsuite. Fastest way to complete a run.
--caller {prodigal-gv,prodigal,phanotate,adaptive}
ORF caller for viruses. default: prodigal-gv prodigal-gv: prodigal-gv only (prodigal with extra models for unusual viruses) (meta
mode) prodigal: prodigal classic only (meta mode). phanotate: phanotate only Note: phanotate takes longer than prodigal,
exponentially so for LONG input contigs. adaptive: will choose based on preliminary taxonomy call (phages = phanotate, others =
prodigal-gv)
--isolation_source ISOLATION_SOURCE
Default: unknown -- Describes the local geographical source of the organism from which the sequence was derived
--collection_date COLLECTION_DATE
Default: unknown -- Date of collection. this format: 01-Jan-2019, i.e. DD-Mmm-YYYY
--metagenome_type METAGENOME_TYPE
Default: unknown -- a.k.a. metagenome_source
--srr_number SRR_NUMBER
Default: unknown -- For read data on SRA, run number, usually beginning with 'SRR' or 'ERR'
Default: unknown -- For read data on SRA, experiment number, usually beginning with 'SRX' or 'ERX'
--biosample BIOSAMPLE
Default: unknown -- For read data on SRA, sample number, usually beginning with 'SAMN' or 'SAMEA' or 'SRS'
--bioproject BIOPROJECT
Default: unknown -- For read data on SRA, project number, usually beginning with 'PRJNA' or 'PRJEB'
--assembler ASSEMBLER
Default: unknown_assembler -- Assembler used to generate contigs, if applicable. Specify version of assembler software, if possible.
--molecule_type MOLECULE_TYPE
Default: DNA -- viable options are DNA - OR - RNA
--data_source DATA_SOURCE
default: original -- original data is not taken from other researchers' public or private database. 'tpa_assembly': data is taken
from other researchers' public or private database. Please be sure to specify SRA metadata.
--cenote-dbs C_DBS DB path. If not set here, Cenote-Taker looks for environmental variable CENOTE_DBS. Then, if this variable is unset, DB path is
assumed to be /Users/kazu_1/mambaforge/envs/ct3_env/lib/python3.12
--hmmscan_dbs HMM_DBS
HMMscan DB version. looks in cenote_db_path/hmmscan_DBs/
--wrap WRAP Default: True -- Wrap/rotate DTR/circular contigs so the start codon of an ORF is the first nucleotide in the contig/genome
--genbank GENBANK Default: True -- Make GenBank files (.gbf, .sqn, .fsa, .tbl, .cmt, etc)?
--taxdb {refseq,hallmark}
Default: hallmark -- Which taxonomy database to use, just refseq virus OR virus hallmark genes from nr virus containing genus,
family, and class taxonomy labels and clustered at 90 percent AAI plus all hallmark genes from refseq virus
--seqtech SEQTECH Default: Illumina -- Which sequencing technology produced the reads? Common options: Illumina, Nanopore, PacBio, Onso, Aviti
--max_dtr_assess MAXDTR
Default: 1000000 -- maximum sequence length to assess DTRs. Extra long contigs with DTRs are likely to be bacterial chromosomes, not
virus genomes.
--circ-file CIRCF Provide a file with the names of contigs (header line sans ">") you believe are circular, one per line. If using this option, CT3
will treat listed contigs as circular but will not search for DTRs in sequences. Useful for long read assembly outputs (flye,
myloasm) that report circular contigs. --max_dtr_assess will still be considered.
> get_ct3_dbs -h
usage: get_ct3_dbs [-h] -o C_DBS [--hmm HMM_DB] [--refseq_tax REFSEQ_TAX] [--hallmark_tax HALLMARK_TAX] [--mmseqs_cdd MMSEQS_CDD] [--domain_list DOM_LIST]
[--hhCDD HHCDD] [--hhPFAM HHPFAM] [--hhPDB HHPDB]
Update and/or download databases associated with Cenote-Taker 3. HMM (hmmer) databases: updated January 10th, 2024. RefSeq Virus taxonomy DB compiled July
31, 2023. hallmark taxonomy database added March 19th, 2024
options:
-h, --help show this help message and exit
REQUIRED ARGUMENTS:
-o C_DBS output directory when database will be downloaded
Use options to pick databases to update.:
--hmm HMM_DB Default: False -- choose: True -or- False
--refseq_tax REFSEQ_TAX
Default: False -- choose: True -or- False
--hallmark_tax HALLMARK_TAX
Default: False -- choose: True -or- False
--mmseqs_cdd MMSEQS_CDD
Default: False -- choose: True -or- False
--domain_list DOM_LIST
Default: False -- choose: True -or- False
--hhCDD HHCDD Default: False -- choose: True -or- False
--hhPFAM HHPFAM Default: False -- choose: True -or- False
--hhPDB HHPDB Default: False -- choose: True -or- False
データベース
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T --mmseqs_cdd T --domain_list T
=> "-o"で指定したct3_DBs/ができる
#CENOTE_DBS環境変数にパスを設定
conda env config vars set CENOTE_DBS=/path/to/ct3_DBs
conda deactivate
conda activate ct3_env

あるいはhhsuite DBも含める(任意だが推奨)。
get_ct3_dbs -o ct3_DBs --hmm T --hallmark_tax T --refseq_tax T \
--mmseqs_cdd T --domain_list T --hhCDD T --hhPFAM T --hhPDB T
#CENOTE_DBS環境変数にパスを設定
conda env config vars set CENOTE_DBS=/path/to/ct3_DBs
conda deactivate
conda activate ct3_env

DBフットプリント。PDB70はかなり大きいので注意する。レポジトリより。
テストラン
git clone https://github.com/mtisza1/Cenote-Taker3.git
cd Cenote-Taker3/test_data/
cenotetaker3 -c testcontigs_DNA_ct2.fasta -r test_ct3 -p T
this script dir: /Users/kazu_1/mambaforge/envs/ct3_env/lib/python3.12/site-packages/cenote
000000000000000000000000000000
000000000000000000000000000000
0000000000 ^^^^^^^^ 0000000000
0000000 ^^^^^^^^^^^^^^ 0000000
00000 ^^^^^ CENOTE ^^^^^ 00000
00000 ^^^^^ TAKER! ^^^^^ 00000
00000 ^^^^^^^^^^^^^^^^^^ 00000
0000000 ^^^^^^^^^^^^^^ 0000000
0000000000 ^^^^^^^^ 0000000000
000000000000000000000000000000
000000000000000000000000000000
FASTA checked.
test_ct3
time update: configuring run directory 09-10-25---05:46:11
@@@@@@@@@@@@@@@@@@@@@@@@@
Your specified arguments:
Cenote-Taker version: 3.4.1
original contigs: testcontigs_DNA_ct2.fasta
title of this run: test_ct3
output directory: /Users/kazu_1/Downloads/Cenote-Taker3-main/test_data/test_ct3
Prune prophages? True
CPUs used for run: 8
Annotation only? False
minimum circular contig length: 1000
minimum linear contig length: 1000
virus hallmark type(s) to count: virion rdrp
min. viral hallmarks for linear: 1
min. viral hallmarks for circular: 1
Wrap contigs? True
HMM db version v3.1.1
ORF Caller: prodigal-gv
Cenote DBs directory: /Users/kazu_1/ct3_DBs
Cenote scripts directory: /Users/kazu_1/mambaforge/envs/ct3_env/lib/python3.12/site-packages/cenote
Template file: /Users/kazu_1/mambaforge/envs/ct3_env/lib/python3.12/site-packages/cenote/dummy_template.sbt
read file(s): none
HHsuite tool: none
Taxonomy DB: ct3_hallmark.taxDB
Sequencing Technology: Illumina
Max seq length to assess DTRs: 1000000
File with circular contig IDs: not given
メモリ8Gのmacbook airでテストすると100分ほどかかった。
出力例

(出力ファイルについてはレポジトリのOutput Filesの説明を参照)
実行方法
メタゲノムアセンブリからviromeを同定してアノテーションをつける。
cenotetaker3 -c metagenome.fna -r my_meta_ct3 -p T
- -c Contig file with .fasta extension in fasta format. Each header must be unique before the first space character
- -r Name of this run. A directory of this name will be created. Must be unique from older runs or older run will be renamed. Must be less than 18 characters, using ONLY letters, numbers and underscores (_)
- -t Number of CPUs available for Cenote-Taker 3. Default: 8.
微生物ゲノムでの推奨設定
cenotetaker3 -c my_metagenome.fna -r my_meta_ct3 -p T --lin_minimum_hallmark_genes 2
- --lin_minimum_hallmark_genes Default: 1 -- Number of detected viral hallmark genes on a non-circular contig to be considered viral and recieve full annotation.'2' might be more suitable, yielding a false positive rate near 0.
- -p True or False. Attempt to identify and remove flanking chromosomal regions from non-circular contigs with viral hallmarks (True is highly recommended for sequenced material not enriched for viruses. Virus-enriched samples probably should be False (you might check enrichment with ViromeQC). Also, please use False if --lin_minimum_hallmark_genes is set to 0)
Prodigalを強制的に使用(デフォルトはprodigal-gv)
cenotetaker3 -c my_metagenome.fna -r my_meta_ct3pr -p T --caller prodigal
- --caller {prodigal-gv,prodigal,phanotate,adaptive} ORF caller for viruses. default: prodigal-gv prodigal-gv: prodigal-gv only (prodigal with extra models for unusual viruses) (meta mode) prodigal: prodigal classic only (meta mode). phanotate: phanotate only Note: phanotate takes longer than prodigal, exponentially so for LONG input contigs. adaptive: will choose based on preliminary taxonomy call (phages = phanotate, others = prodigal-gv)
リードカバレッジを計算
cenotetaker3 -c my_metagenome.fna -r my_meta_ct3 -p T --reads my_reads/*fastq
- --reads read file(s) in .fastq format. You can specify more than one separated by a space
その他(Githubより)
-
下流解析のアイデアとして、ウイルスゲノム完全性推定にCheckVを使用。また
ファージの生活環予測にはBACPHLIPを使用(完全/ほぼ完全なファージゲノムのみ使用)、ゲノムクラスタリングと分類にはVContact3を使用、原核生物ウイルス宿主予測にはiPHoPを使用など挙げることができる。
Cenote-Taker 3の利用例
- メタゲノムデータにおけるウイルスコンティグの発見
- 高度に類似した注釈付き参照配列なしでのウイルス配列の注釈付け
- 微生物ゲノムにおけるプロファージ(またはプロウイルス)の発見
引用
Cenote-Taker 3 for Fast and Accurate Virus Discovery and Annotation of the Virome
Michael J. Tisza, Joseph F. Petrosino, Sara J. Javornik Cregeen
bioRxiv, Posted August 24, 2025.
関連
