メタゲノムとRNA seqにも対応したONTのロングリードのシミュレータ Trans-NanoSim

　第3世代の1分子RNA シーケンサーは、第2世代のシーケンサーと比較して、ロングリードを生成することによりアイソフォームレベルの転写物の特性解析が容易になるという、これまでにない利点を備えている。特に、Oxford Nanopore Technologyのシーケンシングプラットフォームは、他の第3世代シーケンシングテクノロジーと比較して、比較的安価で移植性に優れているため、近年普及が進んでいる。この技術力を活用した分析ツールの開発を支援するために、シミュレーションデータは、グランドトゥルースを用いた費用対効果の高いソリューションを提供する。しかし、トランスクリプトームデータを対象としたナノポアシーケンスシミュレータは、まだ利用可能ではない。
ナノポアRNA-sequncingデータから技術的特徴やトランスクリプトーム特異的特徴を学習したリードをシミュレートするツール、Trans-NanoSimを紹介する。ヒトとマウスのトランスクリプトームを記述したダイレクトRNAと相補的DNAデータセットでTrans-NanoSimの包括的なベンチマークを実施した。他のナノポアリードシミュレーターとの比較を通じて、ナノポアの相補的DNAおよびダイレクトRNAリードの特徴を捉える上でTrans-NanoSimのユニークな優位性と堅牢性を示した。Trans-NanoSimとその事前学習済みモデルは、https://github.com/bcgsc/NanoSim から自由にアクセスできる。

Githubより

NanoSimのバージョン２（v2.0.0）では、参照ゲノムへのロングゲノムONTリードのアライメントにminimap2をデフォルトアライナーとして使用するようになった。これにより、アライメントステップが高速化され、NanoSimの実行時間が短縮された。また、PythonパッケージのHTSeqを利用して、SAMアライメントファイルを効率的に読み込んでいる。

NanoSim (v2.5) では、ゲノムリードだけでなく、ONTトランスクリプトームリード（cDNA / direct RNA）のシミュレーションも可能になった。また、cDNAやdirectRNAのリードにおけるイントロンリテンション（IR）イベントなど、使用するライブラリ調製プロトコルの特徴もモデル化する。さらに、転写産物の発現パターンをプロファイリングし、カスタムデータセットのIRイベントを検出するスタンドアロンモードも備えている。さらに、選択したベースキャラクタに対するホモポリマーの伸縮をシミュレートするホモポリマーシミュレーションオプションも改良された。マルチプロセッシング・オプションにより、大規模なライブラリ・シミュレーションの実行時間を短縮することができる。

NanoSim (v2.6) では、fastqフォーマットのONTリードをシミュレートできるようになった。塩基品質情報は、マッチ塩基、ミスマッチ塩基、挿入塩基、欠失塩基、アンアライン塩基を、それぞれ異なるベースコーラーやリードタイプから学習し、切断正規分布でシミュレートしている。

NanoSim (v3.0) では、ONTメタゲノム・リードをシミュレートすることができるようになった。キャラクタライゼーションの段階でメタゲノム量を定量し、キメラリードにも対応します。また、シミュレーションの段階でも、両方の特徴をシミュレートすることができる。また、ゲノムモードでは、キメラリードのシミュレーションも可能になっている。一部の事前学習済みモデルについては、互換性の問題から再学習が行われている。

インストール

依存

Python packages:

HTSeq (Tested with version 0.11.2)
joblib (Tested with version 0.14.1)
numpy (Tested with version 1.17.2)
pybedtools (Tested with version 0.8.2)
pysam (Tested with version 0.13 or above)
scikit-learn (Tested with version 0.21.3)
scipy (Tested with verson 1.4.1)
six (Tested with version 1.16.0)

External programs:

minimap2 (Tested with versions 2.10, 2.17, 2.18)
LAST (Tested with versions 581 and 916)
samtools (Tested with version 1.12)
GenomeTools (Tested with version 1.6.1)

本体　Github

mamba create -n nanosim -y
conda activate nanosim
mamba install -c bioconda nanosim -y

> read_analysis.py -h

usage: read_analysis.py [-h] [-v]

{genome,transcriptome,metagenome,quantify,detect_ir}

...

Read characterization step

-----------------------------------------------------------

Given raw ONT reads, reference genome, metagenome, and/or

transcriptome, learn read features and output error profiles

optional arguments:

-h, --help show this help message and exit

-v, --version show program's version number and exit

subcommands:

There are five modes in read_analysis.

For detailed usage of each mode:

read_analysis.py mode -h

-------------------------------------------------------

{genome,transcriptome,metagenome,quantify,detect_ir}

genome Run the simulator on genome mode

transcriptome Run the simulator on transcriptome mode

metagenome Run the simulator on metagenome mode

quantify Quantify transcriptome expression or metagenome

abundance

detect_ir Detect Intron Retention events using the alignment

file

> simulator.py -h

usage: simulator.py [-h] [-v] {genome,transcriptome,metagenome} ...

Simulation step

-----------------------------------------------------------

Given error profiles, reference genome, metagenome,

and/or transcriptome, simulate ONT DNA or RNA reads

optional arguments:

-h, --help show this help message and exit

-v, --version show program's version number and exit

subcommands:

There are two modes in read_analysis.

For detailed usage of each mode:

simulator.py mode -h

-------------------------------------------------------

{genome,transcriptome,metagenome}

You may run the simulator on genome, transcriptome, or

metagenome mode.

genome Run the simulator on genome mode

transcriptome Run the simulator on transcriptome mode

metagenome Run the simulator on metagenome mode

$ read_analysis.py -v

NanoSim 3.0.0

実行方法

１、Characterizationステージは、genome, transcriptome, metagenome, quantify, detect_irの5つのモードで実行する。いくつかの訓練済みモデルはレポジトリに含まれている（リンク）。これらを使う場合は、１の評価をスキップして２のシミュレーションを直接実行できる。

genome mode
ONTゲノムリードのシミュレーションに興味がある場合、"genome "モードで特性評価ステージを実行する。リファレンスゲノムと FASTA または FASTQ フォーマットのトレーニング用リードセットを入力とし、minimap2（デフォルト）または LAST aligner を使用してこれらのリードを参照ゲノムにアライメントする。

read_analysis.py genome -i ONT_reads.fq.gz -rg ref.fasta -t 20

-i Input read for training
-rg Reference genome, not required if genome alignment file is provided
-a {minimap2,LAST} The aligner to be used, minimap2 or LAST (Default = minimap2)
-t Number of threads for alignment and model fitting (Default = 1)
-o The location and prefix of outputting profiles (Default = training)

出力

prefixがtrainingのファイルが出力される。

f:id:kazumaxneo:20220213010912p:plain

transcriptome mode

ONTのトランスクリプトームリード（cDNA / directRNA）のシミュレーションに興味がある場合は、トランスクリプトームモードで特性評価ステージを実行する。リファレンストランスクリプトーム、リファレンスゲノム、および FASTA または FASTQ フォーマットのトレーニングリードセットを入力とし、minimap2（デフォルト）または LAST aligner を使用してこれらのリードをリファレンス配列にアライメントする。

read_analysis.py transcriptome -i ONT_reads.fq.gz -rg ref.fasta -t 20

-rg Reference genome
-rt Reference Transcriptome
-annot Annotation file in ensemble GTF/GFF formats, required for intron retention detection
-i Input read for training
-a {minimap2, LAST} The aligner to be used: minimap2 or LAST (Default = minimap2)

他に metagenome, quantify, detect_irがある。

２、シミュレーションステージは、ゲノムモード、トランスクリプトームモード、メタゲノムモードから選んで実行する。ランするにはリファレンスゲノムとリードプロファイルを必要とする。

genome mode

20000リードシミュレートする。

#linear genome (複数リファレンス配列可)
simulator.py genome -dna_type linear -rg ref.fasta -c training -t 20 -n 20000

#circular genome (複数リファレンス配列は不可)
simulator.py genome -dna_type circular -rg ref.fasta -c training -t 20 -n 20000

-rg nput reference genome
-c Location and prefix of error profiles generated from characterization step (Default = training)
-o Output location and prefix for simulated reads (Default = simulated)
-n Number of reads to be simulated (Default = 20000)
-max The maximum length for simulated reads (Default = Infinity)
-min The minimum length for simulated reads (Default = 50)
-dna_type {linear, circular} Specify the dna type: circular OR linear (Default = linear)
--fastq Output fastq files instead of fasta file
-t Number of threads for simulation (Default = 1)
-b {albacore, guppy, guppy-flipflop} Simulate homopolymers with respect to chosen basecaller: albacore, guppy, or guppy-flipflop
-s {0, 1} Proportion of sense sequences. Overrides the value profiled in characterization stage. Should be between 0 and 1

出力

f:id:kazumaxneo:20220213012200p:plain

transcriptome mode

20000リードシミュレートする。イントロン構造も含める場合、リファレンスゲノムとtranscriptome配列（cDNA）の両方を指定する必要がある。また、発現行列も必要。

#linear genome (複数リファレンス配列可)
simulator.py transcriptome -rt Mus_musculus.GRCm38.cdna.all.fa -rg Mus_musculus.GRCm38.dna.primary_assembly.fa -c mouse_cdna -e abundance.tsv -n 20000

-rt Input reference transcriptome
-rg Input reference genome, required if intron retention simulation is on
-e Expression profile in the specified format as described in README
-c Location and prefix of error profiles generated from characterization step (Default = training)
-o Output location and prefix for simulated reads (Default = simulated)
-n Number of reads to be simulated (Default = 20000)

他に metagenomeモードがある。

引用

Trans-NanoSim characterizes and simulates nanopore RNA-sequencing data
Saber Hafezqorani, Chen Yang, Theodora Lo, Ka Ming Nip, René L Warren, Inanc Birol

GigaScience, Volume 9, Issue 6, June 2020