深層学習に基づく真核生物配列の分類ツール Tiara

　多くのメタゲノムデータが利用可能になるにつれ、真核生物のメタゲノム解析が新たな課題として浮上してきた。真核生物の核およびオルガネラのゲノムを適切に分類することは、真核生物の多様性をより深く理解するために不可欠なステップである。

　メタゲノムデータに含まれる真核生物の配列を同定するために、深層学習に基づくアプローチであるTiaraを開発した。Tiaraは、核内真核生物と有機体真核生物の2つのフラクションに分類し、有機体配列をプラスチドとミトコンドリアに分けるという2段階の分類プロセスを採用している。Tiaraは、テストデータを用いて、原核生物の分類ではEukRepと同等の性能を発揮し、真核生物の分類ではより少ない計算時間でEukRepを上回ることを示した。実データを用いたテストでは、Tiaraは、真核細胞のマイクロバイオームを表す小さなデータセットと、海洋の遠洋域からの大きなデータセットの解析において、EukRepよりも優れた結果を示した。また、Tiaraはオルガネラ配列を正しく分類する唯一のツールであり、テストデータや実在するメタゲノムデータからほぼ完全なプラスチドやミトコンドリアのゲノムを復元したことからも、そのことが確認された。　　Tiaraはpython 3.8で実装されており、https://github.com/ibe-uw/tiara、Unixベースのシステムでテストされている。オープンソースのMITライセンスで公開されており、ドキュメントは https://ibe-uw.github.io/tiara にある。Tiaraのバージョン1.0.1がすべてのベンチマークに使用されている。

インストール

依存

Python >= 3.7
numpy, biopython, torch, skorch, tqdm, joblib, numba

Github

#pip
pip install tiara

#github (Latest developer version)
git clone https://github.com/ibe-uw/tiara.git
cd tiara
python setup.py install

$ tiara

usage: tiara [-h] -i input [-o output] [-m MIN_LEN] [--first_stage_kmer FIRST_STAGE_KMER] [--second_stage_kmer SECOND_STAGE_KMER] [-p cutoff [cutoff ...]] [--to_fasta class [class ...]] [-t THREADS] [--probabilities] [-v] [--gzip]

tiara - a deep-learning-based approach for identification of eukaryotic sequences

in the metagenomic data powered by PyTorch.

The sequences are classified in two stages:

- In the first stage, the sequences are classified to classes:

archaea, bacteria, prokarya, eukarya, organelle and unknown.

- In the second stage, the sequences labeled as organelle in the first stage

are classified to either mitochondria, plastid or unknown.

optional arguments:

-h, --help show this help message and exit

-i input, --input input

A path to a fasta file.

-o output, --output output

A path to output file. If not provided, the result is printed to stdout.

-m MIN_LEN, --min_len MIN_LEN

Minimum length of a sequence. Sequences shorter than min_len are discarded.

Default: 3000.

--first_stage_kmer FIRST_STAGE_KMER, --k1 FIRST_STAGE_KMER

k-mer length used in the first stage of classification. Default: 6.

--second_stage_kmer SECOND_STAGE_KMER, --k2 SECOND_STAGE_KMER

k-mer length used in the second stage of classification. Default: 7.

-p cutoff [cutoff ...], --prob_cutoff cutoff [cutoff ...]

Probability threshold needed for classification to a class.

If two floats are provided, the first is used in a first stage, the second in the second stage

Default: [0.65, 0.65].

--to_fasta class [class ...], --tf class [class ...]

Write sequences to fasta files specified in the arguments to this option.

The arguments are: mit - mitochondria, pla - plastid, bac - bacteria,

arc - archaea, euk - eukarya, unk - unknown, pro - prokarya,

all - all classes present in input fasta (to separate fasta files).

-t THREADS, --threads THREADS

Number of threads used.

--probabilities, --pr

Whether to write probabilities of individual classes for each sequence to the output.

-v, --verbose Whether to display some additional messages and progress bar during classification.

--gzip, --gz Whether to gzip results or not.

実行方法

メタゲノムアセンブリ配列を指定する。

tiara -i sample_input.fasta -o out.txt

fastaファイルに含まれる配列は、3000塩基以上（デフォルト値）である必要がある。1000塩基対より短い配列を分類することは推奨されない（*１）。

out.txt（ヘッダー配列ID、第一段階の分類結果、第二段階の分類結果をタブで区切ったファイル）とlog_out.txt（モデルのパラメータと分類結果のサマリーが含まれる）が生成される

出力例

Classification done.
First iteration statistics:
archaea: 71
bacteria: 17161
eukarya: 42
organelle: 7
prokarya: 66
unknown: 11
Second iteration statistics:
mitochondrion: 5
plastid: 2

Output saved to tiara.txt.
Log file saved to log_tiara.txt.

plastid配列や真核生物ゲノムの配列などを出力するにはオプションを使用します。レポジトリの説明を読んで下さい。

引用

Tiara: deep learning-based classification system for eukaryotic sequences
Michał Karlicki, Stanisław Antonowicz, Anna Karnkowska
Bioinformatics, Published: 27 September 2021

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

深層学習に基づく真核生物配列の分類ツール Tiara