あらゆるタイプのPacBioおよびONTロングリードのシミュレータ PBSIM3

2024/02/12 誤字修正、11/03 コマンド修正、2024/12/29追記

2025/01/15 追記

　Pacific Biosciences (PacBio)やOxford Nanopore Technologies (ONT)などのロングリードシーケンサーは、そのリード長や精度を向上させ、これまでにない研究を開拓している。ロングリードを解析するためのツールやアルゴリズムも数多く開発されており、PacBioやONTの急速な進歩は、その開発をさらに加速させている。ハイスループットシーケンス技術とその解析ツールの開発とともに、多くのリードシミュレーターが開発され、有効に活用されている。PBSIMは人気のあるロングリードシミュレータの一つである。本研究では、ロングリードのエラーモデル、高忠実度リードシミュレーションのためのマルチパスシーケンス、トランスクリプトームシーケンスシミュレーションの3つの新機能を備えたPBSIM3を開発した。したがって、PBSIM3は幅広いロングリードのシミュレーションの要求を満たすことができる。

インストール

Github

#from source
git clone https://github.com/yukiteruono/pbsim3.git
cd pbsim3/
./configure
make
sudo make install

#conda (link)
mamba install bioconda::pbsim3 -y

> ./pbsim

USAGE: pbsim [options]

[general options]

--prefix prefix of output files (sd).

--id-prefix prefix of read ID (S).

--seed for a pseudorandom number generator (Unix time).

[options for whole genome sequencing]

--strategy wgs

--genome FASTA format file (text file only).

--depth depth of coverage (20.0).

--length-min minimum length (100).

--length-max maximum length (1000000).

[options for transcriptome sequencing]

--strategy trans

--transcript original format file.

--length-min minimum length (100).

--length-max maximum length (1000000).

[options for template sequencing]

--strategy templ

--template FASTA format file (text file only).

[options for quality score model]

--method qshmm

--qshmm quality score model.

--length-mean mean length (9000.0).

--length-sd standard deviation of length (7000.0).

--accuracy-mean mean accuracy (0.85).

--pass-num number of sequencing passes (1).

--difference-ratio difference (error) ratio (6:55:39).

(substitution:insertion:deletion)

Each value must be 0-1000, e.g. 1000:1:0 is OK.

Note that the above default value is for PacBio RS II;

22:45:33 for PacBio Sequel and 39:24:36 for ONT are

recommended.

--hp-del-bias bias intensity of deletion in homopolymer (1).

The option specifies the deletion rate at 10-mer, where

the deletion rate at 1-mer is 1. The bias intensity from

1-mer to 10-mer is proportional to the length of the

homopolymer.

[options for error model]

--method errhmm

--errhmm error model.

--length-mean mean length (9000.0).

--length-sd standard deviation of length (7000.0).

--accuracy-mean mean accuracy (0.85).

--pass-num number of sequencing passes (1).

[options for sample-based method]

Note that the method can only be used for wag strategy.

--sample FASTQ format file to sample (text file only).

--sample-profile-id sample (filtered) profile ID.

When using --sample, profile is stored;

'sample_profile_<ID>.fastq', and

'sample_profile_<ID>.stats' are created.

When not using --sample, profile is re-used.

Note that when profile is used, --length-min,max,

--accuracy-min,max would be the same as the profile.

--accuracy-min minimum accuracy (0.75).

--accuracy-max maximum accuracy (1.00).

--difference-ratio difference (error) ratio (6:55:39).

(substitution:insertion:deletion)

Each value must be 0-1000, e.g. 1000:1:0 is OK.

Note that the above default value is for PacBio RS II;

22:45:33 for PacBio Sequel and 39:24:36 for ONT are

recommended.

--hp-del-bias bias intensity of deletion in homopolymer (1).

The option specifies the deletion rate at 10-mer, where

the deletion rate at 1-mer is 1. The bias intensity from

1-mer to 10-mer is proportional to the length of the

homopolymer.

注；バージョンが出ないので古いpbsimと間違えないように注意

実行方法

PBSIM3は、PacBio RS II CLR、PacBio Sequel CLR、PacBio Sequel HiFiおよびONTリードのWGSおよびTS（transcriptome）をシミュレーションできる。

WGS

エラーはリアルリードのFIC-HMMによって生成される。指定するERRHMM-RSII.modelはPacBio RS IIリードから構築したエラーモデル。他にPacBio Sequelリードから構築したエラーモデルERRHMM-SEQUEL.modelと、ONTリードから構築したエラーモデルERRHMM-ONT.modelが用意されている。

cd pbsim3/
pbsim --strategy wgs --method errhmm --errhmm data/ERRHMM-RSII.model --depth 20 --genome sample/sample.fasta

pbsim2と同様、リファレンスの配列それぞれに分かれてfastqとmafファイルが出力される。コンティグ配列それぞれのリファレンス配列も出力される。

マルチパスシークエンシングのシミュレーション。--pass-num を２以上にする。

pbsim --strategy wgs --method qshmm --qshmm data/QSHMM-RSII.model --depth 20 --genome sample/sample.fasta --pass-num 10

--pass-num number of sequencing passes (1).

結果はmafとsamとして得られる。samをbam変換後にCCSツール（紹介）でfastqに変換してコンセンサスfastq（CCSリード）を得る（#10）。

samtools view -bS sd_0001.sam > sd_0001.bam
ccs sd_0001.bam sd_0001.fastq.gz

レポジトリにはいくつかの例があります。確認して下さい。

その他

エラーモデルによるシミュレーションリードの品質コードはすべて”!”

引用

PBSIM3: a simulator for all types of PacBio and ONT long reads
Yukiteru Ono, Michiaki Hamada, Kiyoshi Asai
NAR Genomics and Bioinformatics, Volume 4, Issue 4, December 2022