de novo transcriptomeのアノテーションツール dammit

2020 1/31 誤字修正

2020 2/1 コマンド修正

dammitは、単純なde novo transcriptome annotatorである。アノテーションのプロセスの個々の部分は全てすでに存在しているが、既存の解決策は過度に複雑であるか、または無駄な非フリーソフトウェアに依存しているという観測から生まれた。

dammitは無料でオープンソースであり、フリーでオープンソースのエコシステムを中心に構築されている。したがって、著者が十分に考慮していないプログラムは、依存から避けている。これは、非フリーライセンスのプログラム、またはインストールと構成が非常に難しいプログラムのいずれかを意味する可能性がある。著者らは、アクセスはオープン性の一部であると考えている。

dammitはde novo transcriptomeのfunctional annotationを自動で行うことができるツール。pythonの仮想環境で実行するように設計されており、ユーザーは最小限の手間でジョブを実行できる。以前から公開されていたが、Biocondaのサポート前は、導入手順が煩雑で、正常に動かすまで手間だった。最近になってbiocondaのサポートが入り、condaで簡単に導入できるようになった。

http://www.camillescott.org/dammit/

Annotating de novo transcriptomes with dammit¶

https://angus.readthedocs.io/en/2017/dammit_annotation.html

dammit annotationに関するツイート

dammitは以下のデータベースを使う（HPより）。

Pfam-A

Pfam-A is a collection of protein domain profiles for use with profile hidden markov model programs like hmmer. These searches are moderately fast and very sensitive, and the Pfam database is very well curated. Pfam is used during TransDecoder’s ORF finding and for annotation assignment.

Rfam

Rfam is a collection of RNA covariance models for use with programs like Infernal. Covariance models describe RNA secondary structure, and Rfam is a curated database of non-coding RNAs.

OrthoDB

OrthoDB is a curated database of orthologous genes. It attempts to classify proteins from all major groups of eukaryotes and trace them back to their ancestral ortholog.

BUSCO

BUSCO databases are collections of “core” genes for major domains of life. They are used with an accompanying BUSCO program which assesses the completeness of a genome, transcriptome, or list of genes. There are multiple BUSCO databases, and which one you use depends on your particular organism (*1).

uniref90

uniref is a curated collection of most known proteins, clustered at a 90% similarity threshold. This database is comprehensive, and thus quite enormous. dammit does not include it by default due to its size, but it can be installed and used with the --full flag.

インストール

ubuntu16.04に導入した。

依存

dammit, for now, is officially supported on GNU/Linux systems via bioconda. macOS support will be available via bioconda soon.

本体　Github

#Anaconda環境で実行
conda create -y -n dammit python=3 #Python3本体を含めた環境を作成 
conda activate dammit #仮想環境をActiveにする
conda install -y -c bioconda -c conda-forge dammit #インストール

#dockerhub(link)(試してません)
docker pull pypi/dammit

> dammit -h

$ dammit -h

usage: dammit [-h] [--debug] [--version] {migrate,databases,annotate} ...

# dammit: a tool for easy de novo transcriptome annotation

optional arguments:

-h, --help show this help message and exit

--debug

--version show program's version number and exit

dammit subcommands:

{migrate,databases,annotate}

databases Check for databases and optionally download and

prepare them for use. By default, only check their

status.

annotate The main annotation pipeline. Calculates assembly

stats; runs BUSCO; runs LAST against OrthoDB (and

optionally uniref90), HMMER against Pfam, Inferal

against Rfam, and Conditional Reciprocal Best-hit

Blast against user databases; and aggregates all

results in a properly formatted GFF3 file.

——

データベースの準備

#フルダウンロード(2018年6月現在、9データベースがout of dataしていた)
dammit databases --install 

#軽量版（OrthoDB、uniref、Pfam、Rfamが入らない）
dammit databases --install --quick

#BUSCOデータベース（リンク）plantsを導入
dammit databases --install --busco-group plants

Macではhomeの.dammit/databasesにデータベースが保存された。

実行方法

dammit annotate transcriptome.fa --n_threads 12 --busco-group \
eukaryota -o output_dir

FASTAファイル名のディレクトリができ、そこに結果が出力される。

配列が多いと、"--full"フラグ付きのランにはかなりの時間がかかります。ご注意下さい。

引用

GitHub - camillescott/dammit: just annotate it, dammit!

参考

BUSCOは以下のデータベースが利用できる。

dammit databases: error: argument --busco-group: invalid choice: 'plant' (choose from 'tenericutes', 'dikarya', 'pezizomycotina', 'enterobacteriales', 'euarchontoglires', 'basidiomycota', 'firmicutes', 'sordariomyceta', 'bacteria', 'saccharomycetales', 'metazoa', 'proteobacteria', 'eurotiomycetes', 'ascomycota', 'cyanobacteria', 'betaproteobacteria', 'vertebrata', 'gammaproteobacteria', 'lactobacillales', 'aves', 'mammalia', 'actinopterygii', 'insecta', 'microsporidia', 'laurasiatheria', 'deltaepsilonsub', 'saccharomyceta', 'fungi', 'hymenoptera', 'bacteroidetes', 'alveolata_stramenophiles', 'protists', 'actinobacteria', 'spirochaetes', 'clostridia', 'bacillales', 'tetrapoda', 'rhizobiales', 'arthropoda', 'endopterygota', 'eukaryota', 'embryophyta', 'diptera', 'nematoda')

利用の際は

"dammit databases --install --busco-group xxx"

を実行しておく。