微生物ゲノムの包括的なアノテーションを行う MicrobeAnnotator

2020 9/5 修正

2020 9/7 誤字修正、出力追記、

2023/07/04 論文引用

　ハイスループットシーケンシングにより、利用可能な単離株、シングルセル、メタゲノムからの微生物ゲノムの数が増加している。これらのゲノムを解析・比較するためには、高速で包括的なアノテーションパイプラインが必要である。ゲノムアノテーションのためのアプローチはいくつか存在するが、これらのアプローチは通常、解析パイプラインへの組み込みを容易にするように設計されておらず、複数のアノテーションデータベースからの結果を組み合わせることができず、また、ハイスループットモードで代謝再構成の使いやすいサマリーを提供していないのが現状である。ここでは、微生物ゲノムの包括的なアノテーションを行うための完全自動化パイプラインであるMicrobeAnnotatorを紹介する。MicrobeAnnotatorはPythonで実装されており、オープンソースのArtistic Licence 2.0のもと、https://github.com/cruizperez/MicrobeAnnotator から自由に利用できる。

インストール

オーサーの指示通り、condaの仮想環境を作ってテストした（OSはubuntu18.04 LTS）。

依存

Programs:

Aspera Connect
KofamScan
HMMER >= 3.1
Ruby >= 2.5
GNU Parallel
One of:
Blast >= 2.2
Diamond >= 0.9
Sword >= 1.0.4

Python Modules:

matplotlib
seaborn >= 0.10.1
pandas
argparse
pathlib
shutil
subprocess
gzip
biopython
sqlite3
urllib
pywget

Github

#依存の導入
mamba create -n microbeannotator -c conda-forge -c bioconda python=3.7 blast hmmer ruby=2.5.1 parallel diamond sword seaborn biopython pywget -y
mamba activate microbeannotator

#kofamscanも導入するなら
mamba install -c bioconda -y kofamscan

Aspera ConnectとKofamScanも必要。

Aspera Connect

#/home/<user>/.aspera/connect/binに導入される
wget https://download.asperasoft.com/download/sw/connect/3.9.8/ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.tar.gz
tar xvfz ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.tar.gz
bash ibm-aspera-connect-3.9.8.176272-linux-g2.12-64.sh
#さらに必要ならパスも通す。例えば/home/kazu/.aspera/connect/binを.bashrcに書き込むなら
#echo 'export PATH=$PATH:/home/kazu/.aspera/connect/bin' >> ~/.bashrc && source ~/.bashrc

KofamScan（6GB）

mkdir kofamscan 
cd kofamscan 
wget ftp://ftp.genome.jp/pub/db/kofam/ko_list.gz 
wget ftp://ftp.genome.jp/pub/db/kofam/profiles.tar.gz 
wget ftp://ftp.genome.jp/pub/tools/kofam_scan/kofam_scan-1.3.0.tar.gz 

#Decompress and untar: 
gunzip ko_list.gz
tar xvfz profiles.tar.gz
tar xvfz kofamscan-1.3.0.tar.gz
cd kofamscan-1.3.0
cp config-template.yml config.yml
export PATH=$PWD:$PATH

configファイルconfig.ymlの# profile: /path/to/your/profile/db のパス部分を自分のデータベースパスに修正する。 => /home/<user>/kofamscan/profiles/prokaryote

さらにko_list: /path/to/your/kolist/file のパス部分を自分のデータベースパスに修正する。 =>ko_list: /home/<user>/kofamscan/ko_list

自分は/home/kazu/Document/に入れたので以下のように修正した。初期はコメントアウトされているので外す。

f:id:kazumaxneo:20200905162529p:plain

上の画像ではhmmsearchとparallelのパスも指定しているが、パスが通っているなら設定不要。

kofamscanをテストする。コマンドはexec_annotationなので

./exec_annotation -h

Usage: exec_annotation [options] <query>

<query> FASTA formatted query sequence file

-o <file> File to output the result [stdout]

-p, --profile <path> Profile HMM database

-k, --ko-list <file> KO information file

--cpu <num> Number of CPU to use [1]

-c, --config <file> Config file

--tmp-dir <dir> Temporary directory [./tmp]

-E, --e-value <e_value> Largest E-value required of the hits

-T, --threshold-scale <scale>

The score thresholds will be multiplied by this value

-f, --format <format> Format of the output [detail]

detail: Detail for each hits (including hits below threshold)

detail-tsv: Tab separeted values for detail format

mapper: KEGG Mapper compatible format

mapper-one-line: Similar to mapper, but all hit KOs are listed in one line

--[no-]report-unannotated Sequence name will be shown even if no KOs are assigned

Default is true when format=mapper or mapper-all,

false when format=detail

--create-alignment Create domain annotation files for each sequence

They will be located in the tmp directory

Incompatible with -r

-r, --reannotate Skip hmmsearch

Incompatible with --create-alignment

--keep-tabular Neither create tabular.txt nor delete K number files

By default, all K number files will be combined into

a tabular.txt and delete them

--keep-output Neither create output.txt nor delete K number files

By default, all K number files will be combined into

a output.txt and delete them

Must be with --create-alignment

-h, --help Show this message and exit

実際にランしてみる。kofamscan-1.3.0/ でkofanscanを実行。profileは/prokaryote.halを指定する。

./exec_annotation -o output --cpu 20 -p ../profiles/prokaryote.hal -k ../ko_list your_genome.fa

config.ymlが設定されているならプロファイル指定は不要。全CPU使用してラン。
exec_annotation -o output your_genome.fa

準備ができた。最後に本体を取ってくる。

git clone https://github.com/cruizperez/MicrobeAnnotator.git
cd MicrobeAnnotator/

> ./MicrobeAnnotator_DB_Builder -h

$ ./MicrobeAnnotator_DB_Builder -h

usage: MicrobeAnnotator_DB_Builder [-h] -d DIRECTORY -m METHOD [-t THREADS]

[--bin_path BIN_PATH] [--step STEP]

[--light]

This script build the search databases required by MicrobeAnnotator

Usage: ./MicrobeAnnotator_DB_Builder -f [output_file folder]

Global mandatory parameters: -f [output_file folder]

Optional Database Parameters: See ./MicrobeAnnotator_DB_Builder -h

optional arguments:

-h, --help show this help message and exit

-d DIRECTORY, --dir DIRECTORY

Directory where database will be created.

-m METHOD, --method METHOD

Search (and db creation) method, one of blast, diamond or sword

-t THREADS, --threads THREADS

Threads to use (when possible). By default 1.

--bin_path BIN_PATH Path to binary folder for selected method. By defaul assumes the program is in path.

--step STEP Step to start with. 1.Download data, 2.Parse annotation data, 3.Building SQLite DB, 4.Build protein DBs. Default 1.

--light Use only KOfamscan and swissprot databases. By default also builds refseq and trembl.

> ./MicrobeAnnotator -h

$ ./MicrobeAnnotator -h

usage: MicrobeAnnotator [-h] [-i INPUT_LIST [INPUT_LIST ...]] [-l FILE_LIST]

-o OUTPUT_DIR -d DATABASE_FOLDER -m METHOD

[--kofam_bin KOFAM_BIN] [--method_bin METHOD_BIN]

[--id_perc ID_PERC] [--bitscore BITSCORE]

[--evalue EVALUE] [--aln_percent ALN_PERCENT]

[--cluster CLUSTER] [--filename PLOT_FILENAME]

[-t THREADS] [-p PROCESSES] [--light] [--full]

MicrobeAnnotator parses protein fasta files and annotates them

using several databases in an iterative fashion and summarizes the findings

using KEGG modules based on KO numbers associated with best database matches.

Usage: ./MicrobeAnnotator -i [protein file] -o [output folder] -d [MicrobeAnnotator db folder]

-m [search method]

Global mandatory parameters: -i [protein file] -o [output folder] -d [MicrobeAnnotator db folder]

-m [search method]

Optional Database Parameters: See ./MicrobeAnnotator -h

optional arguments:

-h, --help show this help message and exit

Mandatory i/o options.:

-i INPUT_LIST [INPUT_LIST ...], --input INPUT_LIST [INPUT_LIST ...]

Space-separated list of protein files to parse. Use -i OR -l.

-l FILE_LIST, --list FILE_LIST

File with list of inputs. Use -i OR -l.

-o OUTPUT_DIR, --outdir OUTPUT_DIR

Directory to store results.

-d DATABASE_FOLDER, --database DATABASE_FOLDER

Directory where MicrobeAnnotator databases are located.

Options for search process.:

-m METHOD, --method METHOD

Method used to create databases and to perform seaches. One of "blast", "diamond" or "sword".

--kofam_bin KOFAM_BIN

Directory where KOFamscan binaries are located. By default assumes it is in PATH.

--method_bin METHOD_BIN

Directory where KOFamscan binaries are located. By default assumes it is in PATH.

--id_perc ID_PERC Minimum identity percentage to retain a hit. By default 40.

--bitscore BITSCORE Minimum bitscore to retain a hit. By default 80.

--evalue EVALUE Maximum evalue to retain a hit. By default 0.01.

--aln_percent ALN_PERCENT

Minimum percentage of query covered by hit alignment. By default 70.

Summary abnd plotting options.:

--cluster CLUSTER

Cluster genomes and/or modules. Select "cols" for genomes, "rows" for modules, or "both".

By default, no clustering

--filename PLOT_FILENAME

Prefix for output summary tables and plots. By default "metabolic_summary"

Miscellaneous options.:

-t THREADS, --threads THREADS

Threads to use per processed file, i.e. (per protein file). By default 1.

-p PROCESSES, --processes PROCESSES

Number of processes to launch, i.e. number of protein files to process simultaneously.

Note this is different from threads. For more information see the README. By default 1.

--light

Use only KOfamscan and swissprot databases. By default also uses refseq and

trembl (only use if you built both using "MicrobeAnnotator_DB_Builder").

--full

Do not perform the iterative annotation but search all proteins against all databases

(Increases computation time).

データベースの準備

databaseをビルドする。フルバージョン（230GBほどスペースが必要）と軽量版がある。軽量版は依存するデータベースが少なくなっている（--lightを立てて実行する）。blast、diamond、swordに対応しているが、ここではdiamondデータベースを作る。ここでは40スレッド使用。

MicrobeAnnotator_DB_Builder -d MicrobeAnnotator_DB -m diamond -t 40 --light

f:id:kazumaxneo:20200905140948p:plain

ライトバージョンのデータベース準備完了。フルバージョンも何度か試したが、データベースダウンロード途中で接続が切れた。

実行方法

ゲノムとデータベースを指定する。複数ゲノムある場合はスペースで区切って指定するかワイルドカード指定する。

MicrobeAnnotator -i genome1.fa genome2.fa genome3.fa -d MicrobeAnnotator -o output_dir -m diamond -p 3 -t 10

-p refers to the number of protein files to be processed simultaneously, e.g -p 3 will process three protein files at the same time.
-t refers to the number of processors to use per protein file. For example -t 5 will use five processors per each protein file.
-m [blast, diamond, sword] the search method you intend to use.

hmmsearchのランでエラーになる。

=> 使用したゲノムがヘッダーとDNA配列の２行構成なのが原因だった。適当な文字数で改行したらランできるようになった。

出力　（--ligtht）

f:id:kazumaxneo:20200907120223p:plain output_dir/annotation_results/genome.fasta.annotations

f:id:kazumaxneo:20200907120318p:plain

計算に２日も要した。何か空回りするような計算ステップがあるかもしれない。

引用

MicrobeAnnotator: a user-friendly, comprehensive microbial genome annotation pipeline

Carlos A. Ruiz-Perez, Roth E. Conrad, Konstantinos T. Konstantinidis

bioRxiv, Posted July 21, 2020

追記

MicrobeAnnotator: a user-friendly, comprehensive functional annotation pipeline for microbial genomes
Carlos A. Ruiz-Perez, Roth E. Conrad & Konstantinos T. Konstantinidis
BMC Bioinformatics volume 22, Article number: 11 (2021)