最新のバクテリアコア遺伝子セットを使った系統解析パイプライン UBCG2

2021 6/3 誤字修正

　系統樹の再構築は、近年、細菌種間の進化関係を解明するための日常的かつ重要な作業となっている。最も広く用いられている方法は、細菌のドメイン全体に普遍的に存在するシングルコピーのコア遺伝子を連結して利用するものである。著者らは、28系統1,429種の細菌から抽出したコア遺伝子を用いて、Up-to-date Bacterial Core Genes (UBCG)と呼ばれるバイオインフォマティクスパイプラインを開発した。本研究では、43系統3,508種のより広範なゲノム配列から選ばれたUBCG2と呼ばれる新しい細菌コア遺伝子セットを改訂した。UBCG2は、9つのCOG（Clusters of Orthologous Groups of proteins）機能カテゴリーを持つ81の遺伝子から構成されている。新しい遺伝子セットと完全なパイプラインは　http://leb.snu.ac.kr/ubcg2で利用できる。

（中略）

43系統、3,508種の細菌のコア遺伝子を同定するために、hmmscanプログラムを用いて、各候補遺伝子の存在比(PR)とシングルコピー比(SR)を、信頼できるカットオフを用いて計算した(論文Table 1)。信頼できるカットオフは，固定されたカットオフではなく，遺伝子ごとに異なるカットオフを選択した。比較のために，HMMベースの検索において，全遺伝子の固定カットオフとして10e-5を採用したところ，ほとんどの遺伝子のPR値が増加し，SRは減少した。
コア遺伝子とは、信頼できるカットオフ値でPRとSRの両方が95%以上の遺伝子と定義した。この厳格な基準により細菌のコア遺伝子は81個となり、前回のUBCG（92個；Na et al.、2018年）よりも11個少なくなった。

HPより

BCG2パイプラインは、対応するHMMプロファイルを用いて、ドメイン固有のコア遺伝子を抽出する。各ゲノムから抽出されたUBCGプロファイルは、1つの.ucg形式のファイルに格納される。UBCGtree パイプラインでは、異なる種からの .ucg ファイルのセットを使用して系統解析を行う。UBCG2 パイプラインは、各遺伝子のアラインメント、連結、フィルタリング、GSI の計算を自動的に行い、種からなる系統樹を構築する。

http://leb.snu.ac.kr/ubcg2

UBCG v2 Gene list

http://leb.snu.ac.kr/ubcg2/genes/

インストール

HPから実行可能ファイル（.jar）を含む圧縮ファイルをダウンロードする。

依存

Java RE 8+ Link
Prodigal v2.6.xLink
HMMER 3.x Link
MAFFT v7.4x Link
RAxML v8.2.x Link
FastTree 2.1.x

mamba create -n UBCG2 python=3.8 -y
conda activate UBCG2
mamba install -c bioconda prodigal -y
mamba install -c bioconda hmmer -y
mamba install -c bioconda Mafft -y
mamba install -c bioconda RAxML -y
mamba install -c bioconda FastTree -y

#javaもないなら導入
#ODKなら
conda install -c conda-forge openjdk -y
#JDKなら
conda install -c bioconda java-jdk -y

> java -jar UBCG2.jar -h

-------------------------------

UBCG ver2.0 [Feb, 2021]

-------------------------------

This is a part of pipeline for prokaryotic phylogenomics using core gene sequences.

The core gene sets are pre-defined using complete genome sequences.

The UBCG finds the core genes from a contig file or a CDS (coding sequences) file.

If you want more information, please visit www.leb.snu.ac.kr/ubcg2

The external programs that are used in the UBCG2 should be installed.

Paths of the programs must be written in the 'programPath' file.

Basic options

-h : show usage and options (--help)

ex) java -jar UBCG2.jar -i fasta/contigs.fasta -ucg_dir path -label e.coli -hmm hmm/ubcg_v2.hmm

Extracting the gene sequences, which are included in ubcg_v2.hmm profile, from a contig file and save as an ucg file in the designated path.

ex) java -jar UBCG2.jar -p cds_protein.fasta -ucg_dir path -label e.coli -hmm hmm/ubcg_v2.hmm

Extracting gene sequences from a protein CDS file. Only protein sequences are extracted.

ex) java -jar UBCG2.jar -p cds_protein.fasta -n cds_dna.fasta -ucg_dir path -label e.coli -hmm hmm/uacg.hmm

Extracting the gene sequences (gene set in uacg.hmm) from nucleotide/protein CDS files. In this case, the target genome is an archaea.

Mandatory

-i <String> : contig file (fasta format)

-p <String> : CDS protein file (fasta format; -i or this option must be entered)

-n <String> : CDS nucleotide file (this is optional)

-ucg_dir <String> : directory to save a ucg file

-label <String> : label of the genome sequence

-hmm <String> : profile hmm for the core gene set (ubcg.hmm for bacteria, uacg.hmm for archaea)

Optional

-g <Integer>: translation table for translation (-g parameter in Prodigal)

use this option when you use other genetic code

most bacterial species use the 11 table

(Default : 11, the bacterial and archaeal code)

-t <Integer>: use multi-threads

(Default : 1)

Metadata of genome (optional)

-taxon_name <String> : name of species

-strain_name<String> : name of strain

-type : add this option if the strain is a type strain of species or subspecies

-acc <String> : accession of genome sequence. NCBI assembly accession is usually used for public data

-uid <Integer>: unique id. If it is not designated, automatically generated

-taxonomy <String> : taxonomy of the species

-targ_taxon <String> : target taxon

プログラムのルートディレクトリには外部ソフトウェアツールの位置情報を格納した "programPath"ファイルが格納されている。

> cat programPath

$ cat programPath

prodigal=prodigal

hmmsearch=hmmsearch

mafft=mafft

fasttree=FastTree

raxml=raxmlHPC-PTHREADS

programPathが見えないというエラーメッセージが出たら、そのパスにprogramPathファイルをコピーしておく。

実行方法

1、genomeから遺伝子配列を取り出しucg fileとして保存する。

#genome
java -jar UBCG2.jar -i contigs.fasta -ucg_dir outdir -label test -hmm hmm/ubcg_v2.hmm

#protein
java -jar UBCG2.jar -p cds_protein.fasta -ucg_dir outdir -label test -hmm hmm/ubcg_v2.hmm

-i contig file (fasta format)
-p CDS protein file (fasta format; -i or this option

ゲノム配列（プロテオーム配列）分、上記のコマンドを実行する。

出力（ここでは６ゲノム分）

f:id:kazumaxneo:20210602130840p:plain

２、 ucgファイルからUBCGツリーを出力

java -jar UBCGtree.jar align -ucg_dir outdir -run_id test1 -leaf uid,label

出力

f:id:kazumaxneo:20210602131040p:plain

output/test1/

f:id:kazumaxneo:20210602131120p:plain

各コア遺伝子のMSAファイルと連結MSAファイル、各コア遺伝子のツリーファイル、連結MSAファイルからのツリーファイルが出力される。

各プロセスの詳細については論文とHPを確認して下さい。

引用

UBCG2: Up-to-date bacterial core genes and pipeline for phylogenomic analysis

Jihyeon Kim, Seong-In Na, Dongwook Kim & Jongsik Chun
Journal of Microbiology volume 59, pages 609–615 (2021)

UBCG: Up-to-date bacterial core gene set and pipeline for phylogenomic tree reconstruction

Seong-In Na, Yeong Ouk Kim, Seok-Hwan Yoon, Sung-Min Ha, Inwoo Baek , Jongsik Chun

J Microbiol. 2018 Apr;56(4):280-285

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

最新のバクテリアコア遺伝子セットを使った系統解析パイプライン UBCG2