PacBioのロングリードのシミュレーター SimLoRD

2019 2/9 インストール手順修正

2019 7/23 コマンド修正

SMRT（single molecule real time）シーケンシングのような第3世代シークエンシング技術は、第2世代の方法よりもかなり長いリードを出力可能なため、ますます使用されててきている。 SMRTのエラー特性は従来の技術と根本的に異なる。基本エラー率はより高い（10-15％）が、偏りがなく均一に分布していると考えらる（論文より　Eid et al、2009）SMRTデータまたはハイブリッドデータからの配列分析のためにより多くのバイオインフォマティクスアプリケーションが開発されるにつれ、ゲノムアセンブリ、SNPコール、構造変異検出は、SMRT技術の詳細を考慮したシミュレーターの恩恵を受けるだろう。 454、IlluminaおよびSOLiDのリード（Huang et al、2012）用のARTなどの第2世代テクノロジの既存シミュレータは、これを実行できない。

SMRTリード用に設計されたシミュレータはほとんどない（論文執筆時点）。 PBSAS（Ono et al、2013）、FASTQSim（Shcherbina、2014）、BLASRパッケージ（Chaisson and Tesler、2012）を参照。（一部略）PBSIMは、リファレンスとシミュレートされたリードの間にSAMフォーマットのアラインメントを提供しない。 FASTQSimは、リードの分析とシミュレーションの両方を行うための汎用ツールである。特に、SMRTシミュレーションのために事前設定されたパラメータを使うが、既存のデータセットをそのプロパティに関して分析し、トレーニングデータに応じてシミュレートすることも可能になっている。しかし、シミュレーションのマッピング情報やアラインメント情報を提供することができず、速度がかなり遅くなる（N. crassaで8700リードをシミュレートするのに30コアでに90分要した）。シミュレートされた長さ/クオリティ分布はデータ（論文　図1および補足）とよくは一致せず、パラメータを直接変更することは困難である。

　既存のソリューションを改善するために、著者らは、SimLoRDと呼ばれる新しいロングリードシミュレータを開発した。これは、テクニカルスペックが変更されたときに使いやすく、簡単に再構成できる。デフォルト値は、SMRT技術の現在の状態（2016年3月）に従って現実的なシミュレーション結果を提供する（図1参照）。

マニュアル

genomeinformatics / simlord — Bitbucket

全オプション

https://bitbucket.org/genomeinformatics/simlord/wiki/Home

ダウンロード

https://bitbucket.org/genomeinformatics/simlord/

インストール

依存

python3
scipy
numpy
pysam
dinopy

#anaconda環境ならcondaで導入できる (mac, linux)。
conda install -c bioconda -y simlord

> simlord -h

$ simlord -h

usage: simlord [-h] [--version]

(--read-reference PATH | --generate-reference GC LENGTH)

[--save-reference PATH] [--num-reads INT | --coverage INT]

[--chi2-params-s PAR PAR PAR PAR PAR]

[--chi2-params-n PAR PAR PAR] [--max-passes INT]

[--sqrt-params PAR PAR] [--norm-params PAR PAR]

[--probability-threshold FLOAT] [--prob-ins FLOAT]

[--prob-del FLOAT] [--prob-sub FLOAT] [--min-readlength INT]

[--lognorm-readlength [PARAMETER [PARAMETER ...]] |

--fixed-readlength INT | --sample-readlength-from-fastq PATH

[PATH ...] | --sample-readlength-from-text PATH]

[--sam-output SAM_OUTPUT] [--no-sam] [--without-ns]

[--uniform-chromosome-probability]

OUTPUT_PREFIX

SimLoRD v1.0.2 -- SimLoRD is a read simulator for long reads from third

generation sequencing and is currently focused on the Pacific Biosciences SMRT

error model.

positional arguments:

OUTPUT_PREFIX Save the simulated reads as a fastq-file at

OUTPUT_PREFIX.fastq

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

--read-reference PATH, -rr PATH

Read a reference from PATH to sample reads from

--generate-reference GC LENGTH, -gr GC LENGTH

Generate a random reference with given GC-content and

given length

--save-reference PATH, -sr PATH

Save the random reference as fasta-file at given PATH.

By default, save at output path with

'_reference.fasta' appended.

--num-reads INT, -n INT

Number of reads to simulate (1000).

--coverage INT, -c INT

Desired read coverage.

--chi2-params-s PAR PAR PAR PAR PAR, -xs PAR PAR PAR PAR PAR

Parameters for the curve determining the parameter

scale for the chi^2 distribution: m,b, z, c, a for

'm*x + b' if x <= z and 'c * x^-a' if x > z (default=

(0.01214, -5.12, 675, 48303.0732881,

1.4691051212330266))

--chi2-params-n PAR PAR PAR, -xn PAR PAR PAR

Parameters for the function determining the parameter

n for the chi^2 distribution: m, b, z for 'm*x + b' if

x < z and 'm*z + b' for x >=z (default=

(0.00189237136, 2.5394497, 5500)).

--max-passes INT, -mp INT

Maximal number of passes for one molecule (default=

40).

--sqrt-params PAR PAR, -sq PAR PAR

Parameters for the sqare root function for the quality

increase: a, b for 'sqrt(x+a) - b' (default= (0.5,

0.2247))

--norm-params PAR PAR, -nd PAR PAR

Parameters for normal distributed noise added to

quality increase sqare root function (default= (0,

0.2))

--probability-threshold FLOAT, -t FLOAT

Upper bound for the modified total error probability

(default= 0.2)

--prob-ins FLOAT, -pi FLOAT

Probability for insertions for reads with one pass.

(default= 0.11)

--prob-del FLOAT, -pd FLOAT

Probability for deletions for reads with one pass.

(default= 0.04)

--prob-sub FLOAT, -ps FLOAT

Probability for substitutions for reads with one pass.

(default= 0.01)

--min-readlength INT, -mr INT

Minium read length (default= 50) for lognormal

distribution

--lognorm-readlength [PARAMETER [PARAMETER ...]], -ln [PARAMETER [PARAMETER ...]]

Parameters for lognormal read length distribution:

(sigma, loc, scale), empty for defaults

--fixed-readlength INT, -fl INT

Fixed read length for all reads.

--sample-readlength-from-fastq PATH [PATH ...], -sf PATH [PATH ...]

Sample read length from a fastq-file at PATH

containing reads.

--sample-readlength-from-text PATH, -st PATH

Sample read length from a text file (one length per

line).

--sam-output SAM_OUTPUT, -so SAM_OUTPUT

Save the alignments in a sam-file at SAM_OUTPUT. By

default, use OUTPUT_PREFIX.sam.

--no-sam Do not calculate the alignment and write a sam file.

--without-ns Skip regions containing Ns and sample reads only from

parts completly without Ns.

--uniform-chromosome-probability

Sample chromosomes for reads equally distributed

instead of weighted by their length. (Was default

behaviour up to version 1.0.1)

実行方法

デフォルト条件のラン。1000リード発生させる。

simlord --read-reference ref.fasta -n 1000 output

--read-reference　Read a reference from PATH to sample reads from
-n　 Number of reads to simulate (1000).

output.fastqとoutput.samが出力される。

5000-bp固定で10000リード発生させる。エラー率は12% insertion、12% deletion、 2% substitutionとする。samは作らないなら"--no-sam"をつける。

simlord --read-reference ref.fasta -n 10000 -fl 5000 -pi 0.12 -pd 0.12 -ps 0.02 output

-fl　Fixed read length for all reads.
--prob-ins　Probability for insertions for reads with one pass.
--prob-del　Probability for deletions for reads with one pass.
--prob-sub　Probability for substitutions for reads with one pass. (default= 0.01)

その他、リファレンスを使わずランダムな配列を発生させるオプションもあります。

引用

SimLoRD: Simulation of Long Read Data.

Stöcker BK, Köster J, Rahmann S.

Bioinformatics. 2016 Sep 1;32(17):2704-6.