シミュレーション精度と速度が改善された DeepSimulator1.5

2020 2/1 タイトル追加、文章追加、誤字修正

2020 2/2 誤字修正

2020 3/9 コマンド修正

　ナノポアシーケンスは、主要な第3世代シーケンステクノロジーの1つである。 Nanoporeデータの処理と分析を容易にするために、多くの計算ツールが開発された。以前、DeepSimulator1.0（DS1.0）を開発した。DeepSimulatorは、生の電気信号とリードの両方を生成するNanoporeシーケンスの最初のシミュレータである。ただし、DS1.0は高品質のリードを生成できるが、一部のシーケンスでは、シミュレートされた生の信号と実際の信号との相違が大きくなる場合がある。さらに、Nanoporeシーケンシング技術は、DS1.0がリリースされてから大きく進化した。したがって、これらの変更に対応するためにDS1.0を更新する必要がある。
ここではDeepSimulator1.5（DS1.5）を提案する。これらの3つのモジュールはすべて、DS1.0から大幅に更新されている。シーケンスジェネレーターについては、最新の実際のリードの機能を反映するようにサンプルリード長の分布を更新した。 DeepSimulatorのコアであるシグナルジェネレーターの観点から、もう1つのポアモデル、コンテキスト非依存ポアモデルを追加した。これは、以前のコンテキスト依存モデルよりもはるかに高速である。さらに、生成された信号を実際の信号により類似させるために、ポアモデル信号を後処理するローパスフィルターを追加した。 basecallerに関しては、GPUとCPUの両方をサポートできる最新の公式basecaller、Guppyのサポートを追加した。さらに、マルチプロセッシング制御、メモリ、およびストレージ管理に関連する複数の最適化が実装されており、DS1.5はDS1.0よりもはるかに使いやすく、軽量なシミュレータになっている。メインプログラムとデータはhttps://github.com/lykaust15/DeepSimulatorから入手できる。

wiki

https://github.com/lykaust15/DeepSimulator/wiki/Parameters-of-DS1.5

インストール

ubuntu16.04のMiniconda2.4.0.5環境でテストした（docker使用、ホストos macos10.14）。Anaconda2が使える古いubuntuマシン(12.04)でも試した。

依存

Anaconda2 (https://www.anaconda.com/distribution/) or Minoconda2 (https://conda.io/miniconda.html).

本体　Github

git clone https://github.com/lykaust15/DeepSimulator.git
cd ./DeepSimulator/
./install.sh

> ./deep_simulator.sh

$ ./deep_simulator.sh

DeepSimulator v1.5 [Sep-26-2019]

A Deep Learning based Nanopore simulator which can simulate the process of Nanopore sequencing.

USAGE: ./deep_simulator.sh <-i input_genome> [-o out_root] [-D multi_fasta] [-c CPU_num] [-S random_seed] [-B basecaller]

[-n read_num] [-K coverage] [-l read_len_mean] [-C cirular_genome] [-m sample_mode]

[-M simulator] [-e event_std] [-u tune_sampling] [-O out_align] [-G sig_out]

[-f filter_freq] [-s signal_std] [-P perfect] [-H home]

Options:

***** required arguments *****

-i input_genome : input genome in FASTA format.

***** optional arguments (main) *****

-o out_root : Default output would the current directory. [default = './${input_name}_DeepSimu']

-c CPU_num : Number of processors. [default = 8]

-S random_seed : Random seed for controling the simulation process. [default = 0]

0 for a random seed. Use other number for a fixed seed for reproducibility.

-B basecaller : Choose from the following basecaller for the basecalling process. [default = 1]

1: guppy_gpu, 2: guppy_cpu, 3: albacore.

***** optional arguments (read-level) *****

-n read_num : The number of reads need to be simulated. [default = 100]

Set -1 to simulate the whole input sequence without cut (not suitable for genome-level).

-D multi_fasta : Whether the input fasta contains multi discontinuous sequences. [default = 1, separate different sequences]

Set 0 to concatenate different sequences.

-K coverage : This parameter is converted to number of read in the program. [default = 0]

If both K and n are given, we use the larger one.

-l read_len_mean : This parameter is used to control the read length mean. [default=8000]

-C cirular_genome : 0 for linear genome and 1 for circular genome. [default = 0]

-m sample_mode : Choose from the following distribution for the read length. [default = 3]

1: beta_distribution, 2: alpha_distribution, 3: mixed_gamma_dis.

***** optional arguments (event-signal) *****

-M simulator : Choose context-dependent(0) or context-independent(1) simulator to generate event. [default = 1]

-e event_std : Set the standard deviation (std) of the random noise of the event. [default = 1.0]

-u tune_sampling : Tuning sampling rate to around eight for each event. [default = 1 to tune]

Here eight is determined by 4000/450, where 4KHz is the signal sampling frequency,

and 450 is the bases per second to pass the nanopore.

-O out_align : Output ground-truth warping path between simulated signal and event. [default = 0 NOT to output]

-G out_signal : Output simulated signal in txt format. [default = 0 NOT to output]

***** optional arguments (signal-signal) *****

-f filter_freq : Set the frequency for the low-pass filter. [default = 950]

[hint]: a higher frequency value would result in better base-calling accuracy.

-s signal_std : Set the standard deviation (std) of the random noise of the signal. [default = 1.0]

[hint]: tune event_std, filter_freq and signal_std to simulate different sequencing qualities.

-P perfect : 0 for normal mode (with length repeat and random noise). [default = 0]

1 for perfect pore model (without 'event length repeat' and 'signal random noise').

2 for generating almost perfect reads without any randomness in signals (equal to -e 0 -f 0 -s 0).

***** home directory *****

-H home : Home directory of DeepSimulator. [default = 'current directory']

Dockerhubにdcokerイメージも出ています。

#HP (link)
docker pull shkao/deepsimulator:1.5
docker run --rm -itv $PWD:/data shkao/deepsimulator:1.5
> /opt/DeepSimulator/deep_simulator.sh

実行方法

テストゲノムを鋳型に10000リード発生させる。CPU版のguppyでbasecallする。

deep_simulator.sh -i example/artificial_human_chr22.fasta -n 10000 -B 2

-B basecaller : Choose from the following basecaller for the basecalling process. [default = 1] 1: guppy_gpu, 2: guppy_cpu, 3: albacore.
-n read_num : The number of reads need to be simulated. [default = 100]
Set -1 to simulate the whole input sequence without cut (not suitable for genome-level).

テスト（presetモデル）

v1.0でシミュレート

f:id:kazumaxneo:20200201132307p:plain

v1.5

f:id:kazumaxneo:20200201132304p:plain

縦軸の単位が揃っていないので注意してください（*1）。

追記

実行する時はディスクの空き容量に注意してください。生のfast5 を出すため、ゲノム全体を一定数カバーするリードを発生させるとかなりの容量になります。10万リード発生させた時は、fast5ディレクトリだけで80GBほどのファイルサイズになりました。

引用

DeepSimulator1.5: a more powerful, quicker and lighter simulator for Nanopore sequencing
Yu Li, Sheng Wang, Chongwei Bi, Zhaowen Qiu, Mo Li, Xin Gao

Bioinformatics, Published: 08 January 2020

v1.0

v1.0と同じ条件で同じリード数だけシミューレートすると、（ラン後のfast5のディレクトリサイズは1.5倍以上になっていたのも関わらず）v1.5はv1.0の1/10くらいの時間で終わった。