ロングリードのシミュレータ PBSIM2 - macでインフォマティクス

　近年、PacBioやOxford Nanoporeなどのハイスループットのロングリードシーケンサーが登場し、ショートリードシーケンサーに比べてエラーの多いロングリードが生成されるようになった。リードのエラー率の高さに加えて、エラーの不均一性は、ロングリードを用いた様々なダウンストリーム解析の難しさにつながる。ロングリードのエラーパターンを特徴づけ、それをシミュレートする有用なシミュレーターが多く開発されている。しかし、エラーの非一様性のシミュレーションにはまだ改善の余地がある。
　ここでは、ロングリードシーケンサのリードにおけるエラーの特徴を捉えるために、品質スコアの生成モデルを紹介する。このモデルでは、因子化情報量規準と呼ばれる最新のモデル選択法を用いた隠れマルコフモデルが利用されている。開発したシミュレータを様々な角度から評価した結果、本シミュレータは実際のリードと一致するリードをうまくシミュレートしていることがわかった。PBSIM2のソースコードは、https://github.com/yukiteruono/pbsim2 から自由に入手できる。

PBSIMはPacBioのCLRのリードやNanoporeのリードをシミュレートする。

インストール

ubuntu18.04LTSのpython3.9でmambaを使って導入・テストした。

Github

#bioconda (link)
mamba install -c bioconda pbsim2 -y

#from source
git clone https://github.com/yukiteruono/pbsim2.git
cd pbsim2/
./configure
make
sudo make install

> pbsim2

USAGE: pbsim [options] <reference>

<reference> FASTA format file (text file only).

[general options]

--prefix prefix of output files (sd).

--id-prefix prefix of read ID (S).

--depth depth of coverage (20.0).

--length-min minimum length (100).

--length-max maximum length (1000000).

--difference-ratio ratio of differences (6:50:54).

(substitution:insertion:deletion)

Each value must be 0-1000, e.g. 1000:1:0 is OK.

Note that the above default value is for PacBio, and

for Nanopore we recommend increasing substitution rate

and decreasing insertion rate, such as 23:31:46.

--seed for a pseudorandom number generator (Unix time).

[options of sampling-based simulation]

--sample-fastq FASTQ format file to sample (text file only).

--sample-profile-id sample-fastq (filtered) profile ID.

when using --sample-fastq, profile is stored.

'sample_profile_<ID>.fastq', and

'sample_profile_<ID>.stats' are created.

when not using --sample-fastq, profile is re-used.

Note that when profile is used, --length-min,max,

--accuracy-min,max would be the same as the profile.

--accuracy-min minimum accuracy (0.75).

--accuracy-max maximum accuracy (1.00).

[options of model-based simulation].

--hmm_model HMM model of quality code.

--length-mean mean of length model (9000.0).

--length-sd standard deviation of length model (7000.0).

--accuracy-mean mean of accuracy model (0.85).

テストラン

モデルベースのシミュレーション。カバレッジデプスは20とする。

pbsim --depth 20 --hmm_model data/P6C4.model sample/sample.fasta

logや中間ファイルなどが出力される。"sd_0001.maf"は、MAFフォーマットの参照配列と模擬リードの間のアラインメントファイル。シミュレートされたfastqは"sd_0001.fastq"というファイル名になっている。出力だが、参照する配列がmulti-FASTAファイルである場合、シミュレーションリードの出力ファイルはFASTAごとに作成される。

サンプリングベースのシミュレーション

pbsim --depth 20 --sample-fastq sample/sample.fastq sample/sample.fasta

サンプリングベースのシミュレーションでは、リードの長さとクオリティスコアは、実際のリードデータセット"sample/sample.fastq"からランダムにサンプリングされたリードのものと同じになる。

同じリードデータセットを使って、カバレッジデプスの異なる複数のシミュレーションデータを作成することもできます。レポジトリを確認して下さい。

引用

PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores
Yukiteru Ono, Kiyoshi Asai, Michiaki Hamada
Bioinformatics, Volume 37, Issue 5, 1 March 2021, Pages 589–595