磁性細菌(MTB)におけるマグネトソームの生合成と組織化を担うマグネトソーム遺伝子クラスター(MGC)は、細菌の磁気受容、オルガネラ生物形成、細胞内バイオミネラリゼーションの機構と進化的起源を解読する鍵となるものである。ここでは、大規模な(メタ)ゲノムデータからMGCを効率的に探索するためのPython製スタンドアローンツール、MagClusterの開発について報告する。
Githubより
MagClusterは、磁性細菌(MTB)のゲノムからmagnetosome gene clusters(MGC)を同定、アノテーション、可視化するツールである。MagClusterは、染色体上のマグネトゾーム遺伝子の物理的なクラスタ特性を利用して、配列の同一性のみに基づいて正確に同定することが困難なMGCを同定する。
インストール
condaで環境を作ってテストした(ubuntu18使用)。
#conda(link)
mamba create -n magcluster -y
conda activate magcluster
mamba install -c bioconda magcluster -y
#pip (link)
pip install magcluster
> magcluster -h
usage: magcluster [options]
Magnetosome Gene Clusters Analyzer
optional arguments:
-h, --help show this help message and exit
-v, --version show magcluster version number and exit
Options:
{prokka,mgc_screen,clinker}
prokka Genome annotation with Prokka
mgc_screen Magnetosome gene cluster screening
clinker Magnetosome gene cluster mapping with Clinker
General usage
-------------
Genome annotation:
$ magcluster prokka genome.fasta
MGCs screening:
$ magcluster mgc_screen genbank.gbk
MGCs alignment and mapping:
$ magcluster clinker genbank_mgc.gbk
Runjia, 2021
> magcluster prokka -h
usage: magcluster [options] prokka [-h] [--quiet] [--debug] [--outdir OUTDIR]
[--prefix PREFIX] [--force] [--addgenes]
[--addmrna] [--locustag LOCUSTAG]
[--increment INCREMENT] [--gffver GFFVER]
[--compliant] [--centre CENTRE]
[--accver ACCVER] [--genus GENUS]
[--species SPECIES] [--strain STRAIN]
[--plasmid PLASMID] [--kingdom KINGDOM]
[--gcode GCODE] [--gram GRAM] [--usegenus]
[--proteins PROTEINS] [--hmms HMMS]
[--metagenome] [--rawproduct]
[--cdsrnaolap] [--cpus CPUS] [--fast]
[--noanno] [--mincontiglen MINCONTIGLEN]
[--evalue EVALUE] [--rfam] [--norrna]
[--notrna] [--rnammer]
fafile [fafile ...]
positional arguments:
fafile Genome files need to be annotated
optional arguments:
-h, --help show this help message and exit
General:
--quiet No screen output (default OFF)
--debug Debug mode: keep all temporary files (default OFF)
Outputs:
--outdir OUTDIR Output folder [auto] (default 'XXX_annotation')
--prefix PREFIX Filename output prefix [auto] (default "XXX_")
--force Force overwriting existing output folder (default OFF)
--addgenes Add 'gene' features for each 'CDS' feature (default
OFF)
--addmrna Add 'mRNA' features for each 'CDS' feature (default
OFF)
--locustag LOCUSTAG Locus tag prefix (default 'PROKKA')
--increment INCREMENT
Locus tag counter increment (default '1')
--gffver GFFVER GFF version (default '3')
--compliant Force Genbank/ENA/DDJB compliance: --genes
--mincontiglen 200 --centre XXX (default OFF)
XXX (default OFF):
--centre CENTRE Sequencing centre ID. (default '')
--accver ACCVER Version to put in Genbank file (default '1')
Organism details:
--genus GENUS Genus name (default 'Genus')
--species SPECIES Species name (default 'species')
--strain STRAIN Strain name (default 'strain')
--plasmid PLASMID Plasmid name or identifier (default '')
Annotations:
--kingdom KINGDOM Annotation mode: Archaea|Bacteria|Viruses (default
'Bacteria')
--gcode GCODE Genetic code / Translation table (set if --kingdom is
set) (default '0')
--gram GRAM Gram: -/neg +/pos (default '')
--usegenus Use genus-specific BLAST databases (needs --genus)
(default OFF)
--proteins PROTEINS Fasta file of trusted proteins to first annotate from
(default "Magnetosome_protein_data.fasta")
--hmms HMMS Trusted HMM to first annotate from (default '')
--metagenome Improve gene predictions for highly fragmented genomes
(default OFF)
--rawproduct Do not clean up /product annotation (default OFF)
--cdsrnaolap Allow [tr]RNA to overlap CDS (default OFF)
Computation:
--cpus CPUS Number of CPUs to use [0=all] (default '8')
--fast Fast mode - skip CDS /product searching (default OFF)
--noanno For CDS just set /product="unannotated protein"
(default OFF)
--mincontiglen MINCONTIGLEN
Minimum contig size [NCBI needs 200] (default '1')
--evalue EVALUE Similarity e-value cut-off (default '1e-06')
--rfam Enable searching for ncRNAs with Infernal+Rfam (SLOW!)
(default '0')
--norrna Don't run rRNA search (default OFF)
--notrna Don't run tRNA search (default OFF)
--rnammer Prefer RNAmmer over Barrnap for rRNA prediction
(default OFF)
> magcluster mgc_screen -h
usage: magcluster [options] mgc_screen [-h] [-l CONTIGLENGTH] [-w WINDOWSIZE]
[-th THRESHOLD] [-o OUTDIR]
gbkfile [gbkfile ...]
positional arguments:
gbkfile .gbk/.gbf files to be analyzed. Multiple files or
files-containing folder is acceptable.
optional arguments:
-h, --help show this help message and exit
-l CONTIGLENGTH, --contiglength CONTIGLENGTH
The minimum length of contigs to be considered
(default '2,000 bp')
-w WINDOWSIZE, --windowsize WINDOWSIZE
The length of MGCs screening window (default '10,000
bp')
-th THRESHOLD, --threshold THRESHOLD
The minimum number of magnetosome genes in a given
contig and a given length of screening window (default
'3')
-o OUTDIR, --outdir OUTDIR
Output folder (default 'mgc_screen')
> magcluster clinker -h
usage: magcluster [options] clinker [-h] [-r RANGES [RANGES ...]] [-na]
[-i IDENTITY] [-j JOBS] [-s SESSION]
[-ji JSON_INDENT] [-f] [-o OUTPUT]
[-p [PLOT]] [-dl DELIMITER] [-dc DECIMALS]
[-hl] [-ha] [-ufo]
[gbkfiles [gbkfiles ...]]
optional arguments:
-h, --help show this help message and exit
Input options:
gbkfiles Gene cluster GenBank files
-r RANGES [RANGES ...], --ranges RANGES [RANGES ...]
Scaffold extraction ranges. If a range is specified,
only features within the range will be extracted from
the scaffold. Ranges should be formatted like:
scaffold:start-end (e.g. scaffold_1:15000-40000)
Alignment options:
-na, --no_align Do not align clusters
-i IDENTITY, --identity IDENTITY
Minimum alignment sequence identity [default: 0.3]
-j JOBS, --jobs JOBS Number of alignments to run in parallel (0 to use the
number of CPUs) [default: 0]
Output options:
-s SESSION, --session SESSION
Path to clinker session
-ji JSON_INDENT, --json_indent JSON_INDENT
Number of spaces to indent JSON [default: none]
-f, --force Overwrite previous output file
-o OUTPUT, --output OUTPUT
Save alignments to file
-p [PLOT], --plot [PLOT]
Plot cluster alignments using clustermap.js. If a path
is given, clinker will generate a portable HTML file
at that path. Otherwise, the plot will be served
dynamically using Python's HTTP server.
-dl DELIMITER, --delimiter DELIMITER
Character to delimit output by [default: human
readable]
-dc DECIMALS, --decimals DECIMALS
Number of decimal places in output [default: 2]
-hl, --hide_link_headers
Hide alignment column headers
-ha, --hide_aln_headers
Hide alignment cluster name headers
Visualisation options:
-ufo, --use_file_order
Display clusters in order of input files
実行方法
MagClusterは、MGCを一括処理するための3つのモジュールで構成されている。
1、Prokkaによるゲノムアノテーション
#2つのゲノムを指定
magcluster prokka --evalue 1e-05 genome1.fasta genome2.fasta
#もしくはゲノムのフォルダを指定
magcluster prokka --evalue 1e-05 ./MTB_genomes_folder
- --cpus Number of CPUs to use [0=all] (default '8')
ゲノムごとにprokkaの出力ディレクトリができる。
GenBank ファイルの MGC を含むコンティグ/スキャフォールドを検索してMGC候補を同定する。2000bpより短いコンティグと(-l 2000)、magnetosome遺伝子が3個以下のコンティグは廃棄される(-th 3)。
magcluster mgc_screen --threshold 3 --contiglength 2000 --windowsize 10000 genome1_annotation/genome1.gbk genome2_annotation/genome2.gbk
#もしくはゲノムのフォルダを指定
magcluster mgc_screen --threshold 3 --contiglength 2000 --windowsize 10000 ./gbkfiles_folder
- -th, --threshold The minimum number of magnetosome genes in a given contig and a given length of screening window (default '3')
- -l, --contiglength The minimum length of contigs to be considered (default '2,000 bp')
- -w, --windowsize The length of MGCs screening window (default '10,000 bp')
- -o, --outdir Output folder (default 'mgc_screen')
3、clinkerによるMGCの可視化
magcluster clinker -p MGC_align.html mgc_screen/*.gbk
- -p, --plot Plot cluster alignments using clustermap.js. If a path is given, clinker will generate a portable HTML file at that path. Otherwise, the plot will be served dynamically using Python's HTTP server.
引用
MagCluster: a Tool for Identification, Annotation, and Visualization of Magnetosome Gene Clusters
Runjia Ji, Wensi Zhang, Yongxin Pan, Wei Lin
Microbiol Resour Announc. 2022 Jan 20;11(1)
関連