Refseq accession IDからfull taxonomyに変換する PYlogeny

ETE3とBioPythonのEutilsを中心に構築されたアクセッション番号からtaxonomy IDとそれに関連する系統情報に変換することができるシンプルなツール。現在はRefseq accession IDに対応している。

インストール

conda create -n PYlogeny python=3.6 -y
conda activate PYlogeny
pip install biopython ete3 six

#PYlogeny
git clone https://github.com/jrjhealey/PYlogeny.git
cd PYlogeny/

> python PYlogeny.py -h

$ python PYlogeny.py -h

usage: PYlogeny.py [-h] [-i INFILE] [-o OUTFILE] [-d DATABASE] -e EMAIL

[--version] [-v] [-s SQL] [-u]

Create a taxonomic breakdown for a list of accession numbers.

optional arguments:

-h, --help show this help message and exit

-i INFILE, --infile INFILE

A one-per-line file of accession numbers.

-o OUTFILE, --outfile OUTFILE

Output tabular file (default STDOUT).

-d DATABASE, --database DATABASE

What database to search for the accessions in (if you know it),

the script will attempt to use their format to guess otherwise.

-e EMAIL, --email EMAIL

Email to use with Eutils/Entrez.

--version show program's version number and exit

-v, --verbose Increase verbosity/logging. (-v or -vv)

-s SQL, --sql SQL Location to store the ETE3 database. Default is in ~/.etetoolkit/ .

If you specify a different location to the last instance, a

new copy of the database will have to be downloaded regardless.

-u, --update Update the local copy of the TaxID database.

(False by default, but should be done on a frequent basis).

Given an input list of accession numbers, create a table describing the taxonomic

memberships of those accession numbers.

The first time you run this program, and any time -u|--update is used, the

taxon dump will be made and an SQL database created. This takes several minutes.

データベースの準備

メールアドレスを指定する。NCBI taxdumpをfetchしてパースする（*1）。sqliteのデータベースが構築される。

python PYlogeny.py -u -e <your_mail_address>

NCBI taxdumpは更新が早いので、頻繁に最新のデータベースをダウンロードする必要がある。

テストラン

入力ファイル

> cat tests/10accs.txt

$ cat tests/10accs.txt

WP_041379885.1

WP_058588699.1

WP_105398703.1

WP_112878141.1

WP_110086472.1

WP_036780220.1

WP_065389666.1

WP_036813449.1

WP_110091204.1

WP_113043119.1

実行。メールアドレスは毎回記載する必要がある。ダミーアカウントでもランは可能。

python PYlogeny.py -e <your_mail_address> -i tests/10accs.txt -o out.csv

出力をエクセルで開いた。

f:id:kazumaxneo:20200907131508p:plain

NCBIにはbatch entrezという機能もある。

https://www.ncbi.nlm.nih.gov/sites/batchentrez

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Refseq accession IDからfull taxonomyに変換する PYlogeny