リアルデータに忠実なショートリードをシミュレートする ReSeq

2021 2/20 論文引用、condaによるインストール追記

2021 3/1 追記

2021 5/25 エラーについて追記

　ハイスループットのシーケンシングデータでは、生データから科学的な結果に至るまでのデータ処理において、計算ツール間の性能比較は、情報に基づいた意思決定を行うために不可欠である。シミュレーションは手法比較の重要な要素だが、標準的なIlluminaのゲノムDNAシークエンシングの場合、シミュレーションは単純化されすぎていることが多く、ほとんどのツールでは楽観的な結果になってしまう。
　ReSeqは、実際のデータから重要な成分を抽出して再現することで、合成データの信頼性を向上させる。主な改良点は、システマティックなエラー、フラグメントベースのカバレッジモデル、2次元のマージンに基づくサンプリングマトリクスの推定値を含めることである。これらの改良により、元のk-merスペクトルをより良く表現し、より忠実な性能評価が可能になった。ReSeqとそのすべてのコードはhttps://github.com/schmeing/ReSeqから入手できる。

GC bias, flanking bias, strand-specific systematic errors, etc. .. big parameter space.

Here are k-mer profiles across various datasets for real data + synthetic data from various existing simulators and ReSeq (@StephanSchmeing's new tool). pic.twitter.com/JduuX3XNR8
— Mark Robinson (@markrobinsonca) 2020年7月18日

"ReSeq simulates realistic Illumina high-throughput sequencing data" is finally published: https://t.co/7JMqMKdiW5
It has 3 more datasets, some comparisons additionally on all datasets and some clarifications/restructuring compared to the preprint.
— Schmeing (@StephanSchmeing) 2021年2月22日

インストール

swigの導入でコンフリクトが起きたため、ubuntu18.04のdockerイメージを使ってテストした (*1)。

ビルド依存

Github参照

Github

git clone https://github.com/schmeing/ReSeq.git
cd ReSeq
mkdir build
cd build
cmake ..
make
sudo make install

#2021 22/20 conda (link) （ここでは高速なmamba使用）
mamba install -c bioconda reseq -y

> ./reseq

# ./reseq

Program: reseq (REal SEQuence replicator)

Version: 1.0

Contact: Stephan Schmeing <stephan.schmeing@uzh.ch>

Usage: reseq <command> [options]

Commands:

getRefSeqBias stores reference sequence bias from reseq statistic files in tsv format

illuminaPE simulates illumina paired-end data

replaceN replaces N's in reference

test tests the program

> ./reseq getRefSeqBias

# ./reseq getRefSeqBias

>>> 26-07-20 18:53:13: info: Running ReSeq version 1.0 in getRefSeqBias mode

!!! int main(int, char**) [/home/ReSeq/reseq/main.cpp:470]: error: ref option is mandatory.

Usage: reseq getRefSeqBias -r <ref.fa> -s <stats.reseq> -o <out.tsv>

General:

-h [ --help ] Prints help information and exits

-j [ --threads ] arg (=0) Number of threads used (0=auto)

--verbosity arg (=4) Sets the level of verbosity (4=everything,

0=nothing)

--version Prints version info and exits

getRefSeqBias:

-o [ --output ] arg Output file for the reference sequence bias (tsv

format)

-r [ --ref ] arg Reference sequences in fasta format (gz and bz2

supported)

-s [ --stats ] arg Reseq statistics file to extract reference sequence

bias

> ./reseq illuminaPE

# ./reseq illuminaPE

>>> 26-07-20 18:53:37: info: Running ReSeq version 1.0 in illuminaPE mode

>>> 26-07-20 18:53:37: info: Detected 56 cores to be used.

!!! int main(int, char**) [/home/ReSeq/reseq/main.cpp:690]: error: refIn or refSim option mandatory.

Usage: reseq illuminaPE -b <file.bam> -r <ref.fa> -1 <file1.fq> -2 <file2.fq> [options]

General:

-h [ --help ] Prints help information and exits

-j [ --threads ] arg (=0) Number of threads used (0=auto)

--verbosity arg (=4) Sets the level of verbosity

(4=everything, 0=nothing)

--version Prints version info and exits

Stats:

--adapterFile arg Fasta file with adapter sequences

[(AutoDetect)]

--adapterMatrix arg 0/1 matrix with valid adapter pairing

(first read in rows, second read in

columns) [(AutoDetect)]

-b [ --bamIn ] arg Position sorted bam/sam file with reads

mapped to refIn

--binSizeBiasFit arg (=100000000) Reference sequences large then this are

split for bias fitting to limit memory

consumption

--maxFragLen arg (=2000) Maximum fragment length to include

pairs into statistics

--minMapQ arg (=10) Minimum mapping quality to include

pairs into statistics

--noBias Do not perform bias fit. Results in

uniform coverage if simulated from

--noTiles Ignore tiles for the statistics

[default]

-r [ --refIn ] arg Reference sequences in fasta format (gz

and bz2 supported)

--statsOnly Only generate the statistics

-s [ --statsIn ] arg Skips statistics generation and reads

directly from stats file

-S [ --statsOut ] arg Stores the real data statistics for

reuse in given file [<bamIn>.reseq]

--tiles Use tiles for the statistics

-v [ --vcfIn ] arg Ignore all positions with a listed

variant for stats generation

Probabilities:

--ipfIterations arg (=200) Maximum number of iterations for

iterative proportional fitting

--ipfPrecision arg (=5) Iterative proportional fitting

procedure stops after reaching this

precision (%)

-p [ --probabilitiesIn ] arg Loads last estimated probabilities and

continues from there if precision is

not met [<statsIn>.ipf]

-P [ --probabilitiesOut ] arg Stores the probabilities estimated by

iterative proportional fitting

[<probabilitiesIn>]

--stopAfterEstimation Stop after estimating the probabilities

Simulation:

-1 [ --firstReadsOut ] arg Writes the simulated first reads into

this file [reseq-R1.fq]

-2 [ --secondReadsOut ] arg Writes the simulated second reads into

this file [reseq-R2.fq]

-c [ --coverage ] arg (=0) Approximate average read depth

simulated (0 = Corrected original

coverage)

--errorMutliplier arg (=1) Divides the original probability of

correct base calls(no substitution

error) by this value and renormalizes

--methylation arg Extended bed graph file specifying

methylation for regions. Multiple score

columns for individual alleles are

possible, but must match with vcfSim.

C->T conversions for 1-specified value

in region.

--numReads arg (=0) Approximate number of read pairs

simulated (0 = Use <coverage>)

--readSysError arg Read systematic errors from file in

fastq format (seq=dominant error,

qual=error percentage)

--recordBaseIdentifier arg (=ReseqRead)

Base Identifier for the simulated fastq

records, followed by a count and other

information about the read

--refBias arg Way to select the reference biases for

simulation (keep [from refIn]/no

[biases]/draw [with replacement from

original biases]/file) [keep/no]

--refBiasFile arg File to read reference biases from: One

sequence per file (identifier bias)

-R [ --refSim ] arg Reference sequences in fasta format to

simulate from [<refIn>]

--seed arg Seed used for simulation, if none is

given random seed will be used

-V [ --vcfSim ] arg Defines genotypes to simulate alleles

or populations

--writeSysError arg Write the randomly drawn systematic

errors to file in fastq format

(seq=dominant error, qual=error

percentage)

> ./reseq replaceN

# ./reseq replaceN

>>> 26-07-20 18:54:09: info: Running ReSeq version 1.0 in replaceN mode

>>> 26-07-20 18:54:09: info: Detected 56 cores to be used.

!!! int main(int, char**) [/home/ReSeq/reseq/main.cpp:555]: error: refIn option is mandatory.

Usage: reseq replaceN -r <refIn.fa> -R <refSim.fa> [options]

General:

-h [ --help ] Prints help information and exits

-j [ --threads ] arg (=0) Number of threads used (0=auto)

--verbosity arg (=4) Sets the level of verbosity (4=everything,

0=nothing)

--version Prints version info and exits

ReplaceN:

-r [ --refIn ] arg Reference sequences in fasta format (gz and bz2

supported)

-R [ --refSim ] arg File to where reference sequences in fasta format

with N's randomly replace should be written to

--seed arg Seed used for replacing N, if none is given random

seed will be used

テスト

> ./reseq test

f:id:kazumaxneo:20200726185715p:plain

実行方法

１、リアルデータをリファレンスゲノムにマッピングする。

bowtie2-build my_reference.fa my_reference
bowtie2 -p 32 -X 2000 -x my_reference -1 my_data_1.fq -2 my_data_2.fq | samtools sort -m 10G -@ 4 -T _tmp -o my_mappings.bam -

2、ReSeq illuminaPEを実行する。

reseq illuminaPE -j 32 -r my_reference.fa -b my_mappings.bam -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq

-1 Writes the simulated first reads into this file [reseq-R1.fq]
-2 Writes the simulated second reads into this file [reseq-R2.fq]
-j Number of threads used (0=auto)

エラーが起きる。ランできるようになったら追記します。

=>20210301に condaを使ってubuntu18.04の計算機に導入し直した。問題なくランできた。

引用

ReSeq simulates realistic Illumina high-throughput sequencing data

Stephan Schmeing, Mark D. Robinson

bioRxiv, Posted July 17, 2020

2021 2/20

ReSeq simulates realistic Illumina high-throughput sequencing data

Stephan Schmeing & Mark D. Robinson
Genome Biology volume 22, Article number: 67 (2021)

新規ubuntuイメージを立ち上げ、依存ライブラリを導入後ビルドした。

cd data_dir
docker run -itv $PWD:/data ubuntu

apt update && apt install swig build-essential zlib1g-dev python-dev git cmake libboost-all-dev

#clone from the ReSeq repository
git clone https://github.com/schmeing/ReSeq.git
cd ReSeq
mkdir build
cd build
cmake ..
make

dockerhubのイメージ ( ランできることは確認しています。bowtie2はホスト側でランして下さい)

https://hub.docker.com/r/cristaniguti/reseq/tags?page=1&ordering=last_updated

docker pull cristaniguti/reseq:latest

error

!!! T& reseq::utilities::at(seqan::String<TString>&, size_t) [with T = seqan::SimpleType<unsigned char, seqan::Dna_>; size_t = long unsigned int] [/ReSeq/reseq/utilities.hpp:132]: error: Called position 9. String size is only 9.

が起きたので、--adapterFile TruSeq_singleをつけてランした（参考）。

#step1
reseq illuminaPE -j 12 -r ref.fa -b my_mappings.bam --statsOnly --adapterFile TruSeq_singlemy_simulated_data_1.fq -2 my_simulated_data_2.fq

#step2
reseq illuminaPE -j 12 -s my_mappings.bam.reseq --stopAfterEstimation

#step3
reseq illuminaPE -j 12 -R ref.fa -s my_mappings.bam.reseq --ipfIterations 0 -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq