2021 2/20 論文引用、condaによるインストール追記
2021 3/1 追記
2021 5/25 エラーについて追記
ハイスループットのシーケンシングデータでは、生データから科学的な結果に至るまでのデータ処理において、計算ツール間の性能比較は、情報に基づいた意思決定を行うために不可欠である。シミュレーションは手法比較の重要な要素だが、標準的なIlluminaのゲノムDNAシークエンシングの場合、シミュレーションは単純化されすぎていることが多く、ほとんどのツールでは楽観的な結果になってしまう。
ReSeqは、実際のデータから重要な成分を抽出して再現することで、合成データの信頼性を向上させる。主な改良点は、システマティックなエラー、フラグメントベースのカバレッジモデル、2次元のマージンに基づくサンプリングマトリクスの推定値を含めることである。これらの改良により、元のk-merスペクトルをより良く表現し、より忠実な性能評価が可能になった。ReSeqとそのすべてのコードはhttps://github.com/schmeing/ReSeqから入手できる。
GC bias, flanking bias, strand-specific systematic errors, etc. .. big parameter space.
— Mark Robinson (@markrobinsonca) 2020年7月18日
Here are k-mer profiles across various datasets for real data + synthetic data from various existing simulators and ReSeq (@StephanSchmeing's new tool). pic.twitter.com/JduuX3XNR8
"ReSeq simulates realistic Illumina high-throughput sequencing data" is finally published: https://t.co/7JMqMKdiW5
— Schmeing (@StephanSchmeing) 2021年2月22日
It has 3 more datasets, some comparisons additionally on all datasets and some clarifications/restructuring compared to the preprint.
インストール
swigの導入でコンフリクトが起きたため、ubuntu18.04のdockerイメージを使ってテストした (*1)。
ビルド依存
- Github参照
git clone https://github.com/schmeing/ReSeq.git
cd ReSeq
mkdir build
cd build
cmake ..
make
sudo make install
#2021 22/20 conda (link) (ここでは高速なmamba使用)
mamba install -c bioconda reseq -y
> ./reseq
# ./reseq
Program: reseq (REal SEQuence replicator)
Version: 1.0
Contact: Stephan Schmeing <stephan.schmeing@uzh.ch>
Usage: reseq <command> [options]
Commands:
getRefSeqBias stores reference sequence bias from reseq statistic files in tsv format
illuminaPE simulates illumina paired-end data
replaceN replaces N's in reference
test tests the program
> ./reseq getRefSeqBias
# ./reseq getRefSeqBias
>>> 26-07-20 18:53:13: info: Running ReSeq version 1.0 in getRefSeqBias mode
!!! int main(int, char**) [/home/ReSeq/reseq/main.cpp:470]: error: ref option is mandatory.
Usage: reseq getRefSeqBias -r <ref.fa> -s <stats.reseq> -o <out.tsv>
General:
-h [ --help ] Prints help information and exits
-j [ --threads ] arg (=0) Number of threads used (0=auto)
--verbosity arg (=4) Sets the level of verbosity (4=everything,
0=nothing)
--version Prints version info and exits
getRefSeqBias:
-o [ --output ] arg Output file for the reference sequence bias (tsv
format)
-r [ --ref ] arg Reference sequences in fasta format (gz and bz2
supported)
-s [ --stats ] arg Reseq statistics file to extract reference sequence
bias
> ./reseq illuminaPE
# ./reseq illuminaPE
>>> 26-07-20 18:53:37: info: Running ReSeq version 1.0 in illuminaPE mode
>>> 26-07-20 18:53:37: info: Detected 56 cores to be used.
!!! int main(int, char**) [/home/ReSeq/reseq/main.cpp:690]: error: refIn or refSim option mandatory.
Usage: reseq illuminaPE -b <file.bam> -r <ref.fa> -1 <file1.fq> -2 <file2.fq> [options]
General:
-h [ --help ] Prints help information and exits
-j [ --threads ] arg (=0) Number of threads used (0=auto)
--verbosity arg (=4) Sets the level of verbosity
(4=everything, 0=nothing)
--version Prints version info and exits
Stats:
--adapterFile arg Fasta file with adapter sequences
[(AutoDetect)]
--adapterMatrix arg 0/1 matrix with valid adapter pairing
(first read in rows, second read in
columns) [(AutoDetect)]
-b [ --bamIn ] arg Position sorted bam/sam file with reads
mapped to refIn
--binSizeBiasFit arg (=100000000) Reference sequences large then this are
split for bias fitting to limit memory
consumption
--maxFragLen arg (=2000) Maximum fragment length to include
pairs into statistics
--minMapQ arg (=10) Minimum mapping quality to include
pairs into statistics
--noBias Do not perform bias fit. Results in
uniform coverage if simulated from
--noTiles Ignore tiles for the statistics
[default]
-r [ --refIn ] arg Reference sequences in fasta format (gz
and bz2 supported)
--statsOnly Only generate the statistics
-s [ --statsIn ] arg Skips statistics generation and reads
directly from stats file
-S [ --statsOut ] arg Stores the real data statistics for
reuse in given file [<bamIn>.reseq]
--tiles Use tiles for the statistics
-v [ --vcfIn ] arg Ignore all positions with a listed
variant for stats generation
Probabilities:
--ipfIterations arg (=200) Maximum number of iterations for
iterative proportional fitting
--ipfPrecision arg (=5) Iterative proportional fitting
procedure stops after reaching this
precision (%)
-p [ --probabilitiesIn ] arg Loads last estimated probabilities and
continues from there if precision is
not met [<statsIn>.ipf]
-P [ --probabilitiesOut ] arg Stores the probabilities estimated by
iterative proportional fitting
[<probabilitiesIn>]
--stopAfterEstimation Stop after estimating the probabilities
Simulation:
-1 [ --firstReadsOut ] arg Writes the simulated first reads into
this file [reseq-R1.fq]
-2 [ --secondReadsOut ] arg Writes the simulated second reads into
this file [reseq-R2.fq]
-c [ --coverage ] arg (=0) Approximate average read depth
simulated (0 = Corrected original
coverage)
--errorMutliplier arg (=1) Divides the original probability of
correct base calls(no substitution
error) by this value and renormalizes
--methylation arg Extended bed graph file specifying
methylation for regions. Multiple score
columns for individual alleles are
possible, but must match with vcfSim.
C->T conversions for 1-specified value
in region.
--numReads arg (=0) Approximate number of read pairs
simulated (0 = Use <coverage>)
--readSysError arg Read systematic errors from file in
fastq format (seq=dominant error,
qual=error percentage)
--recordBaseIdentifier arg (=ReseqRead)
Base Identifier for the simulated fastq
records, followed by a count and other
information about the read
--refBias arg Way to select the reference biases for
simulation (keep [from refIn]/no
[biases]/draw [with replacement from
original biases]/file) [keep/no]
--refBiasFile arg File to read reference biases from: One
sequence per file (identifier bias)
-R [ --refSim ] arg Reference sequences in fasta format to
simulate from [<refIn>]
--seed arg Seed used for simulation, if none is
given random seed will be used
-V [ --vcfSim ] arg Defines genotypes to simulate alleles
or populations
--writeSysError arg Write the randomly drawn systematic
errors to file in fastq format
(seq=dominant error, qual=error
percentage)
> ./reseq replaceN
# ./reseq replaceN
>>> 26-07-20 18:54:09: info: Running ReSeq version 1.0 in replaceN mode
>>> 26-07-20 18:54:09: info: Detected 56 cores to be used.
!!! int main(int, char**) [/home/ReSeq/reseq/main.cpp:555]: error: refIn option is mandatory.
Usage: reseq replaceN -r <refIn.fa> -R <refSim.fa> [options]
General:
-h [ --help ] Prints help information and exits
-j [ --threads ] arg (=0) Number of threads used (0=auto)
--verbosity arg (=4) Sets the level of verbosity (4=everything,
0=nothing)
--version Prints version info and exits
ReplaceN:
-r [ --refIn ] arg Reference sequences in fasta format (gz and bz2
supported)
-R [ --refSim ] arg File to where reference sequences in fasta format
with N's randomly replace should be written to
--seed arg Seed used for replacing N, if none is given random
seed will be used
テスト
> ./reseq test
実行方法
1、リアルデータをリファレンスゲノムにマッピングする。
bowtie2-build my_reference.fa my_reference
bowtie2 -p 32 -X 2000 -x my_reference -1 my_data_1.fq -2 my_data_2.fq | samtools sort -m 10G -@ 4 -T _tmp -o my_mappings.bam -
2、ReSeq illuminaPEを実行する。
reseq illuminaPE -j 32 -r my_reference.fa -b my_mappings.bam -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq
- -1 Writes the simulated first reads into this file [reseq-R1.fq]
- -2 Writes the simulated second reads into this file [reseq-R2.fq]
- -j Number of threads used (0=auto)
エラーが起きる。ランできるようになったら追記します。
=>20210301に condaを使ってubuntu18.04の計算機に導入し直した。問題なくランできた。
引用
ReSeq simulates realistic Illumina high-throughput sequencing data
Stephan Schmeing, Mark D. Robinson
bioRxiv, Posted July 17, 2020
2021 2/20
ReSeq simulates realistic Illumina high-throughput sequencing data
Stephan Schmeing & Mark D. Robinson
Genome Biology volume 22, Article number: 67 (2021)
*1
新規ubuntuイメージを立ち上げ、依存ライブラリを導入後ビルドした。
cd data_dir
docker run -itv $PWD:/data ubuntu
apt update && apt install swig build-essential zlib1g-dev python-dev git cmake libboost-all-dev
#clone from the ReSeq repository
git clone https://github.com/schmeing/ReSeq.git
cd ReSeq
mkdir build
cd build
cmake ..
make
dockerhubのイメージ ( ランできることは確認しています。bowtie2はホスト側でランして下さい)
https://hub.docker.com/r/cristaniguti/reseq/tags?page=1&ordering=last_updated
docker pull cristaniguti/reseq:latest
error
!!! T& reseq::utilities::at(seqan::String<TString>&, size_t) [with T = seqan::SimpleType<unsigned char, seqan::Dna_>; size_t = long unsigned int] [/ReSeq/reseq/utilities.hpp:132]: error: Called position 9. String size is only 9.
が起きたので、--adapterFile TruSeq_singleをつけてランした(参考)。
#step1
reseq illuminaPE -j 12 -r ref.fa -b my_mappings.bam --statsOnly --adapterFile TruSeq_singlemy_simulated_data_1.fq -2 my_simulated_data_2.fq
#step2
reseq illuminaPE -j 12 -s my_mappings.bam.reseq --stopAfterEstimation
#step3
reseq illuminaPE -j 12 -R ref.fa -s my_mappings.bam.reseq --ipfIterations 0 -1 my_simulated_data_1.fq -2 my_simulated_data_2.fq