ONTリードのシミュレーター NanoSim - macでインフォマティクス

NanoSImは2017年に発表されたOxford nanoporeのロングリードのシミュレーター。ユーザーが指定したONTリードからプロファイルを作成し、それに基づいてロングリードを発生させることができる。

インストール

依存

Python packages:

git clone https://github.com/bcgsc/NanoSim.git
cd NanoSim/

実行方法

ランは二段階で行う。第一ステップはONTのシーケンスデータを指定してのモデルの構築となる。

./read_analysis.py -i ONT.fasta -r reference.fa

LASTを使いリファレンスゲノムにONTリードをアライメントしている。エラーを評価するため、ONTリード自身からアセンブルしたcontigをリファレンスに使ったりしてはならない。

カレントディレクトリにref_genome~とtrainning~というファイルがいくつかできる（-o 指定がない時）。

オーサーらにより、yeastと、E.coliの1dと2dで読んだONTリードのプロファイルやシーケンスデータが用意されている（R7とR9両方あり）。指定のONTリードがないならそれを使う。FTPサーバーリンク

wget ftp://ftp.bcgsc.ca/supplementary/NanoSim/yeast* #例えばyeastのデータをダウンロード

yeast_2D.fasta

yeast_S288C_ref.fa

yeast_profile.zip

がダウンロードされる。yeast_2D.fastaがONTリード、yeast_S288C_ref.faがリファレンスゲノムになる。

第二段階- 配列のシミュレーション。先ほど作ったtraining~を指定して走らせる。

./simulator.py linear -r referenceg.fa -c training

-r　reference genome in fasta file, specify path and file name
--max_len　Maximum read length, default = Inf
--min_len　Minimum read length, default = 50
--perfect　Output perfect reads, no mutations, default = False
--KmerBias　prohibits homopolymers with length >= 6 bases in output reads, can be omitted
-o　The prefix of output file, default = 'simulated'
-n　Number of generated reads, default = 20,000 reads
-c　the prefix of training set profiles, same as the output prefix in read_analysis.py, default = training
circular | linear　Do not choose 'circular' when there is more than one sequence in the reference <options>: