簡易なメタゲノムもシミュレートできるfastqのシミュレータ GemSIM

GemSIMは汎用フォーマットのSAMおよびFASTQ（IlluminaおよびRoche454を含む）と互換性のあるシングルエンドまたはペアエンドのリードを生成できるNGSのシミュレータ。ユーザーが比率を指定することで、簡単なメタゲノムのシミュレートを行うこともできる。

f:id:kazumaxneo:20180222161211j:plain

PDFマニュアルより転載。

ダウンロード

依存

python2.7

SourceForge（PDFマニュアルも含まれる）

https://sourceforge.net/projects/gemsim/

解凍して中に入る。

python GemReads.py -h #動作確認

$ python GemReads.py -h

########################################################################

# GemSIM - Generic Error Model based SIMulator of N.G. sequencing data #

########################################################################

GemReads.py:

Takes a reference genome, an empirical error model, and a haplotype file

listing SNP locations and frequencies, and creates a simulated data set

of random reads, as would be produced by a next-gen sequencing run.

Output is in fastq format, suitable for input into popular alignment

software.

Options:

-h prints these instructions.

-r reference genome, in fasta format.

-R Only for metagenome projects. Directory containing references.

-a Only for metagenome projects. Species-abundance file.

-n number of reads to produce. For paired end reads, number of pairs.

-g haplotype file, specifying location and frequency of snps.

-G directory of haplotype files (metagenomics mode only).

-l length of reads. Integer value, or -l d for empirical distribution.

-m error model file *_single.gzip or *_paired.gzip.

-c use this flag if you wish to draw reads from a circular genome.

-q quality score offset. Usually 33 or 64 (see manual).

-o output file name prefix.

-u Mean fragment length for paired end reads. -u d for empirical.

-s standard deviation for fragment length. Use only with -u and -p.

-p use only to create paired end reads.

python GemErr.py -h #動作確認

$ python GemErr.py -h

########################################################################

# GemSIM - Generic Error Model based SIMulator of N.G. sequencing data #

########################################################################

GemErr.py:

Takes a sam file and catalogues all the mismatches, insertions, and deletions

to create an error model for a particular sequencing run. Known true SNP

positions may be excluded.

Options:

-h prints these instructions.

-r read length. Set to LONGEST read in dataset.

-f reference genome in fasta format

-s input file in sam format.

-n desired output filename prefix.

-c specifies reference genome is circular. Otherwise assumed linear.

-i use only every ith read for model (optional, must be odd).

-m maximum indel size (optional, default=4).

-p use only if your data contains paired end reads.

-k minimum k-mer frequency in reference. (Default=0)

-e comma separated list of reference positions to exclude e.g. 293, 342

ラン

以下のエラーモデルが用意されている。

ill100v4: Illumina GA IIx with Illumina Sequencing Kit v4 chemistry (single and pair)
ill100v5: Illumina GA IIx with TrueSeq SBS Kit v5‐GA (single and pair)
r454ti_s: Roche/454 Titanium

chemistry v4のエラーモデルを使い、シングルエンドの100-bpリードを10万発生させる（モデルは_sの方を使う）。

python GemReads.py -r reference.fasta -n 100000 -l 100 -m models/ill100v4_s.gzip -q 64 -o read

-r eference genome, in fasta format.
-n number of reads to produce. For paired end reads, number of pairs.
-l length of reads. Integer value, or -l d for empirical distribution.
-m error model file *_single.gzip or *_paired.gzip.
-q quality score offset. Usually 33 or 64 (see manual).
-o output file name prefix.

進捗のログが出ないので分かりにくいが、終わるとカレントにout_single.fastqができる。

chemistry v4のエラーモデルを使い、経験則的なリード長とインサートサイズで200万ペア（400万リード）発生させる。（モデルは_pの方を使う）

python GemReads.py -r ecoli.fasta -n 2000000 -l d -m models/ill100v4_p.gzip -q 64 -p -u d -o uesaka

-s standard deviation for fragment length. Use only with -u and -p.
-p use only to create paired end reads.

終わるとカレントにout_single.fastqができる。

-cをつけると環状ゲノムのシミュレートになります（末端がない状態でリードを作るので、末端で擬似的な欠失が検出されなくなる）

メタゲノムデータのシミュレートもできるが、確率分布の種類等を指定して比率を決める機能はないので、ユーザーが指定する必要がある。ref/にあるbac1とbac2のメタゲノムデータを発生させるとする。

f:id:kazumaxneo:20180222165624j:plain

タブで区切った以下のようなテキストファイルを準備する。

bac1.fna<tab>160

bac2.fna<tab>40

f:id:kazumaxneo:20180222165747j:plain

絵で見るなら上のような感じ。　比率は".8"と".2"と書いても良い。

abund.txtとして保存。

454のTitaniumのエラープロファイルでメタゲノムリードを100万発生させる。ref/にあるfnaを指定している。

python GemReads.py –R ref/ –a abund.txt –n 10000000 –l d ‐m models/r454ti_s.gzip –c –q 33 –o meta

自分で用意したsamからエラーモデルを作る（paired-end）。

python GemErr.py ‐r 301 –f ref.fasta –s aligned.sam -n Miseq -c -i 5 -p

-r read length. Set to LONGEST read in dataset.
-f reference genome in fasta format
-s input file in sam format.
-n desired output filename prefix.
-c specifies reference genome is circular. Otherwise assumed linear.
-i use only every ith read for model (optional, must be odd).
-p use only if your data contains paired end reads.

終わると、エラーモデルのファイルができ、GemStats.pyでレポートを出せます。これらの解析例はPDFマニュアルで確認してください。マニュアル後半に例があります。

このツールはエラー報告が出ないようなので、出力はコマンドで確認して、中身が正　しいかチェックして進めてください。

補足

GemSIMを使ったGUIのプログラムの１つにFunctionSIMがあります。GemSIMは特定の遺伝子群に限定して、メタゲノムの配列をシミュレートできるようです。

引用

GemSIM: general, error-model based simulator of next-generation sequencing data.

McElroy KE1, Luciani F, Thomas T.

BMC Genomics. 2012 Feb 15;13:74.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

簡易なメタゲノムもシミュレートできるfastqのシミュレータ GemSIM