ONTのリードからのリファレンスフリーのトランスクリプトーム再構成を行う isONform

　ロングリードトランスクリプトームシーケンスの進歩により、転写産物の完全な配列決定が可能になり、転写プロセスを研究する能力が大幅に向上した。ロングリードのトランスクリプトームシーケンス技術としては、Oxford Nanopore Technologies (ONT)が有名であり、費用対効果の高いシーケンスと高いスループットにより、細胞内のトランスクリプトームの特徴を明らかにできる可能性がある。しかし、転写産物のばらつきや配列決定エラーのために、ロングcDNAリードからアイソフォーム予測を行うには、かなりのバイオインフォマティクス処理が必要である。転写産物の予測を行うために、ゲノムやアノテーションに基づいた手法がいくつか存在する。しかし、このような手法には高品質のゲノムとアノテーションが必要であり、ロングリードのスプライスアライナーの精度には限界がある。さらに、異質性の高い遺伝子ファミリーは、リファレンスゲノムではうまく表現できない可能性があり、リファレンスフリーの解析が有効である。RATTLEのような、ONTから転写産物を予測するリファレンスフリーの方法は存在するが、その感度はリファレンスベースのアプローチに匹敵するものではない。ONT cDNAシーケンスデータからアイソフォームを構築する高感度アルゴリズムisONformを紹介する。このアルゴリズムは、リードのファジーシードから構築された遺伝子グラフ上の反復バブルポッピングに基づいている。シミュレーション、合成、および生物学的なONT cDNAデータを用いて、isONformはRATTLEよりも感度が大幅に高いことを示す。生物学的データでは、isONformの予測値はRATTLEと比較して、アノテーションに基づく手法StringTie2との整合性が大幅に高いことを示す。isONformが、十分にアノテーションされたゲノムを持たない生物のアイソフォーム構築にも、リファレンスベースの手法の予測値を検証するための直交的な手法としても使えると考えている。

インストール

Github

https://github.com/aljpetri/isONform

mamba create -n isonform python=3.10 -y
conda activate isonform
pip install isONcorrect
mamba install -c bioconda spoa -y
mamba install networkx -y
pip install parasail

#本体
git clone https://github.com/aljpetri/isONform.git
cd isONform/

> python isONform_parallel.py

usage: isONform_parallel.py [-h] [--version] [--fastq_folder FASTQ_FOLDER] [--t NR_CORES] [--k K] [--w W] [--xmin XMIN] [--xmax XMAX] [--exact_instance_limit EXACT_INSTANCE_LIMIT] [--keep_old] [--set_w_dynamically] [--max_seqs MAX_SEQS]

[--split_wrt_batches] [--clustered] [--outfolder OUTFOLDER] [--delta_len DELTA_LEN] [--delta DELTA] [--max_seqs_to_spoa MAX_SEQS_TO_SPOA] [--verbose] [--iso_abundance ISO_ABUNDANCE]

[--delta_iso_len_3 DELTA_ISO_LEN_3] [--delta_iso_len_5 DELTA_ISO_LEN_5] [--tmpdir TMPDIR] [--write_fastq]

De novo reconstruction of long-read transcriptome reads

options:

-h, --help show this help message and exit

--version show program's version number and exit

--fastq_folder FASTQ_FOLDER

Path to input fastq folder with reads in clusters (default: False)

--t NR_CORES Number of cores allocated for clustering (default: 8)

--k K Kmer size (default: 20)

--w W Window size (default: 31)

--xmin XMIN Lower interval length (default: 18)

--xmax XMAX Upper interval length (default: 80)

--exact_instance_limit EXACT_INSTANCE_LIMIT

Do exact correction for clusters under this threshold (default: 50)

--keep_old Do not recompute previous results if corrected_reads.fq is found and has the smae number of reads as input file (i.e., is complete). (default: False)

--set_w_dynamically Set w = k + max(2*k, floor(cluster_size/1000)). (default: False)

--max_seqs MAX_SEQS Maximum number of seqs to correct at a time (in case of large clusters). (default: 1000)

--split_wrt_batches Process reads per batch (of max_seqs sequences) instead of per cluster. Significantly decrease runtime when few very large clusters are less than the number of cores used. (default: False)

--clustered Indicates whether we use the output of isONclust (i.e. we have uncorrected data) (default: False)

--outfolder OUTFOLDER

Outfolder with all corrected reads. (default: None)

--delta_len DELTA_LEN

Maximum length difference between two reads intervals for which they would still be merged (default: 5)

--delta DELTA diversity rate used to compare sequences (default: 0.1)

--max_seqs_to_spoa MAX_SEQS_TO_SPOA

Maximum number of seqs to spoa (default: 200)

--verbose Print various developer stats. (default: False)

--iso_abundance ISO_ABUNDANCE

Cutoff parameter: abundance of reads that have to support an isoform to show in results (default: 5)

--delta_iso_len_3 DELTA_ISO_LEN_3

Cutoff parameter: maximum length difference at 3prime end, for which subisoforms are still merged into longer isoforms (default: 30)

--delta_iso_len_5 DELTA_ISO_LEN_5

Cutoff parameter: maximum length difference at 5prime end, for which subisoforms are still merged into longer isoforms (default: 50)

--tmpdir TMPDIR OPTIONAL PARAMETER: Absolute path to custom folder in which to store temporary files. If tmpdir is not specified, isONform will attempt to write the temporary files into the tmp folder on your system. It is advised to

only use this parameter if the symlinking does not work on your system. (default: None)

--write_fastq Indicates that we want to ouptut the final output (transcriptome) as fastq file (New standard: fasta) (default: False)

(r4) kamisakakazumanoMac-Studio:promoter kamisakakazuma$

実行方法

本アルゴリズムの入力は、isONclustとisONcorrect (Sahlin and Medvedev 2021)によりクラスタ化され、エラー訂正されたリードである。IsONformは、クラスター化されエラー訂正されたロングリードからアイソフォームを生成する。このために、networkx apiを使ってグラフを構築し、バブルポッピングやノードマージなど、さまざまな簡略化ストラテジーを適用する。

python isONform_parallel.py --fastq_folder <path>/<to>/<input_reads_dir> --t 12 --outfolder outdir --split_wrt_batches

＊fastqは絶対パスで指定

--fastq_folder Path to input fastq folder with reads in clusters (default: False)
--t Number of cores allocated for clustering (default: 8)
--outfolder Outfolder with all corrected reads. (default: None)
--split_wrt_batches Process reads per batch (of max_seqs sequences) instead of per cluster. Significantly decrease runtime when few very large clusters are less than the number of cores used. (default: False)

isONformはtranscriptome.fasta、 mapping.txt、support.txtの3つのファイルを出力する。

isONclust、isONcorrect、isONformを実行できるスクリプトも準備されている（レポジトリ）

引用

isONform: reference-free transcriptome reconstruction from Oxford Nanopore data
Alexander J Petri, Kristoffer Sahlin
Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i222–i231