真菌のITSやコアタンパク質コード遺伝子を使った系統解析を自動で実行する UFCG pipeline

UFCG pipelineを使うと、真菌のITSやコアタンパク質を使った系統解析を自動で実行できます。簡単にですが、使い方を確認しておきます。

UFCG is a database&pipeline for fungi phylogenomics. Our db contains 61 marker genes, 20 widely used & 41 novel core-genes derived from 1.5k genomes. The pipeline automatically build a trees from DNA, RNA or AA sequence inputs.
📄https://t.co/zrZy0DiAM3
🌐https://t.co/IQGeSGzkQj pic.twitter.com/avCv4TzaoH
— Martin Steinegger 🇺🇦 (@thesteinegger) August 17, 2022

Manual

https://ufcg.steineggerlab.com/ufcg/manual

インストール

依存

Java Runtime Environment with a version higher than 8

Github

ubuntu18に導入した。

mamba create -n ufcg -c bioconda -c conda-forge openjdk=8 augustus mmseqs2 mafft iqtree
conda activate ufcg
wget -O UFCG.zip https://github.com/endixk/ufcg/releases/latest/download/UFCG.zip
unzip UFCG.zip && cd UFCG
java -jar UFCG.jar

#docker(link)
docker pull endix1029/ufcg:latest
docker run -it endix1029/ufcg:latest
cd UFCG
java -jar UFCG.jar

> java -jar UFCG.jar

$ java -jar UFCG.jar

__ __ _____ _____ _____

/ / / // ___// ___// ___/

/ / / // /_ / / / / __

/ /_/ // __/ / /___/ /_/ /

\____//_/ \____/\____/ v1.0

USAGE : java -jar UFCG.jar <module> [...]

Available Modules

Module Description

profile Extract UFCG profile from genome

profile-rna Extract UFCG profile from RNA-seq transcriptome

profile-pro Extract UFCG profile from proteome

train Train and generate sequence model

align Produce sequence alignments from UFCG profiles

tree Build maximum likelihood tree with UFCG profiles

prune Rebuild UFCG tree or single gene trees

Miscellaneous

Argument Description

--info Print program information

--core Print core gene list

General options

Argument Description

-h, --help Print this manual

-v, --verbose Make program verbose

--nocolor Remove ANSI escapes from standard output

--notime Remove timestamp in front of the prompt string

--developer Activate developer mode (For testing or debugging)

実行方法

UFCG pipelineを使うには、系統推定に使いたいゲノムのfastaファイルと、任意でメタデータファイルが必要。メタデータファイルは、ゲノムの分類記号を表す 7 つのエントリを受け取ることができる。

簡易バージョン；meta_simple.tsv

簡易版ではfilename、label、accessionを記載する。

フルバージョン；meta_full.tsv

フルバージョンではさらにtaxon_nameなどを記載する(link)。

ゲノムのfasta形式ファイルは、ゲノムアセンブリファイルのみ含むディレクトリに配置する。ファイルの拡張子は統一する。(.fa, .fna, .fasta, ...)。ファイルのリストは、指定されたメタデータ１列目のfilenameに対応していなければならない。メタデータのみあってゲノムファイルがなかったり、ゲノムファイルがあってメタデータのリストには含まれていないとエラーになるので注意。

１、対話式で実行。profileコマンドを使う。

 java -jar UFCG.jar profile -u

seqに含まれるGCA_000697725.1.fnaが含まれているとエラーになった。除くとランできた。最終的に、指定した出力ディレクトリにゲノムの数だけ.ucgファイルが書き出される。

２、系統推定にはtreeコマンドを使う。1の出力ディレクトリと、出力ツリーファイルで系統樹の葉の名前とするメタデータ列を指定する。

java -jar UFCG.jar tree -i profile_results/ -l label

-i Locate the path of the input .ucg profiles to align and infer tree
-l Name the leaves of the phylogenetic tree from the metadata

以下のメタデータ列に対応している（manualより）。上ではlabelを指定した。

uid : Include unique integer ID
acc : Include accession number
label : Include full label
taxon : Include taxon name
strain : Include strain name
taxonomy : Include taxonomic relationship

出力例

系統マーカータンパク質配列として、このレポジトリのseq/に含まれている配列を使用できます。これは、UFCGの定義するコアタンパク質になります。

また、UFCGのHPからは、分類がvalidな1000以上の真菌からのコア遺伝子のタンパク質のMSAとHMMプロファイルをダウンロードできます（手動で集める必要がないというのが重要）。ダウンロードしたMSAに自分のゲノムのタンパク質配列を加えることで、1000以上の真菌の系統解析を素早く実行することができます（ただし計算時間は長くなります）。ゲノムを決めた真菌の系統がわからない時は、このような方法が使えます。一方で、ここで紹介したUFCG pipelineは、手元にゲノムのリストがある場合に活用できます。

引用

UFCG: database of universal fungal core genes and pipeline for genome-wide phylogenetic analysis of fungi
Dongwook Kim, Cameron L.M. Gilchrist, Jongsik Chun, Martin Steinegger

bioRxiv, Posted August 17, 2022.