ゲノム中のISエレメントを探す ISEScan - macでインフォマティクス

2021 8/7 コマンド修正

　ISEScanは、ゲノム中のIS(Insertion Sequence)エレメントを同定するためのPythonパイプラインである。完全なISエレメントを報告するか、完全なISエレメントと部分的なISエレメントの両方を報告するかのオプションがある。メテゲノムアセンブリに含まれるISエレメントの同定に使用する場合は、完全なISエレメントと部分的なISエレメントの両方を報告してみるのも良いかもしれない。ISEScanはデフォルトで完全なISエレメントと部分的なISエレメントの両方をレポートする。

　ISEScanはPython3で開発されている。1) ゲノム（またはメタゲノム）をfasta形式でスキャンし、2) ゲノムをプロテオームに予測/翻訳（FragGeneScanを使用）し、3) トランスポザーゼのあらかじめ構築されたpHMM（プロファイル隠れマルコフモデル）をプロテオームと照らし合わせて検索する（ISEScanに同梱されている2つのファイル、cluster.faa.hmmとcluster.single.faa）。 4) 次に、文献やデータベースで報告されている既知のISエレメントに共通する特徴に基づいて、同定されたトランスポザーゼ遺伝子を完全なIS(Insertion Sequence)エレメントに拡張し、5) 最後に、同定されたISエレメントをいくつかの結果ファイル(例えば、ISエレメントのリストを含むファイル、fasta形式のISエレメント配列を含むファイル、GFF3形式のアノテーションファイル)で報告する。

インストール

ubuntu18.04LTS でテストした。

Github

#conda、ここでは高速なmambaを使う (anaconda)
mamba install -c bioconda isescan -y

> isescan.py -h

$ isescan.py -h

usage: isescan [-h] [--version] [--removeShortIS] [--no-FragGeneScan] --seqfile SEQFILE --output OUTPUT [--nthread NTHREAD]

ISEScan is a python pipeline to identify Insertion Sequence elements (both complete and incomplete IS elements) in genom. A typical invocation would be:

python3 isescan.py seqfile proteome hmm

- If you want isescan to report only complete IS elements, you need to set command line option --removeShortIS.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

--removeShortIS Remove incomplete (partial) IS elements which include IS element with length < 400 or single copy IS element without perfect TIR.

--no-FragGeneScan Use the annotated protein sequences in NCBI GenBank file (.gbk which must be in the same folder with genome sequence file), instead of the protein sequences predicted/translated by FragGeneScan. (Experimental feature!)

--seqfile SEQFILE Sequence file in fasta format, '' by default

--output OUTPUT Output directory, 'results' by default

--nthread NTHREAD Number of CPU cores used for FragGeneScan and hmmer, 1 by default.

実行方法

ゲノムのfastaファイルを指定する。

isescan.py --seqfile NC_012624.fna --output results --nthread 8

--nthread number of CPU cores used for FragGeneScan and hmmer. By default one will be used.
--seqfile 　Sequence file in fasta format, '' by default
--output Output directory, 'results' by default

出力

f:id:kazumaxneo:20210222163410p:plain

xxx.fna.sum: 各ISファミリーのISコピーの要約
xxx.fna.raw: ISコピーの詳細、1行に1つのコピー
xxx.fna.gff: 各ISコピーとそのTIRのリスト、gff3フォーマット
xxx.fna.is.fna：各ISコピーのDNA配列、fasta形式
xxx.fna.orf.fna: 各ISコピーに含まれるTpase遺伝子(転移酵素遺伝子)のDNA配列、fasta形式
xxx.fna.orf.faa: 各ISコピーに含まれるTpaseのアミノ酸配列、fasta形式

レポジトリにはxargsを使って複数のゲノムを順番に調べていく例が記載されています。確認してください。

引用

ISEScan: automated identification of insertion sequence elements in prokaryotic genomes
Zhiqun Xie, Haixu Tang
Bioinformatics, Volume 33, Issue 21, 01 November 2017, Pages 3340–3347