エラーの多いロングリードのシミュレータ Badread

　DNA シーケンシングのプラットフォームは、DNA サンプル中のヌクレオチド（A、C、G、T）の配列を測定することを目的としている。Illumina社のシーケンサーは、過去10年間の大半を占めてきた技術だが、これらのプラットフォームでは、比較的小さい（～100～300ヌクレオチドの長さ）配列の断片「リード」を生成する。対照的に、オックスフォード・ナノポア・テクノロジーズ（ONT）およびパシフィック・バイオサイエンシズ（PacBio）は、数万ヌクレオチド以上の配列断片を生成できる「ロングリード」シーケンサーを生成する。これらのプラットフォームからのロングリードは、ゲノムアセンブリおよび他のバイオインフォマティクス分析に非常に有益であり得る。しかし，１分子スケールでの測定の確率的性質は、ONTとPacBioのリードが「ノイズが多い」ことを意味する。
　ここ数年、この分野では多くの研究が行われている。新しい手法を評価するための手法として、リードシミュレーション：フェイクシーケンシングの生成がある。このアプローチには、実際のシーケンシングデータを使用する場合に比べて、より速く、より多くの利点があり、リーズナブルな価格で、より多くのテストを行うことができる。

Badreadは、ロングリードをシミュレートするツールである。キメラ、低品質の領域、系統的なベースコールエラーなど、実際のロングリードシークエンシングで遭遇するであろう多くの種類の問題を模倣することができる。

Badreadは、実際の長文読解の真似をしようとしているわけではなく、むしろ、ユーザーがシミュレーションされたリードの品質をコントロールできるようにすることを目的としている。Badreadを作ったのは、著者自身がロングリードを入力とするツールをテストするためである。Badreadを使うと、さまざまな種類のリードを生成して、どのような効果があるかを確認することができる。

インストール

ubuntu18.04にて、mamba（condaの高速な実装）で環境を作ってテストした。

Github

#Install from source
git clone https://github.com/rrwick/Badread.git
pip3 install ./Badread
badread --help

#conda
mamba create -n badread -y
conda activate badread
mamba install -c bioconda -y badread
mamba install -c bioconda edlib #必要だがbadreadのcondaレシピ（.yaml）では導入されない

> badread -h

usage: badread [-h] [--version] {simulate,error_model,qscore_model,plot} ...

Badread: a long read simulator that can imitate manytypes of read problems

Commands:

{simulate,error_model,qscore_model,plot}

simulate: generate fake long reads

error_model: build a Badread error model

qscore_model: build a Badread qscore model

plot: view read identities over a sliding window

Help:

-h, --help Show this help message and exit

--version Show program's version number and exit

> badread simulate

usage: badread simulate --reference REFERENCE --quantity QUANTITY [--length LENGTH] [--identity IDENTITY] [--error_model ERROR_MODEL] [--qscore_model QSCORE_MODEL] [--seed SEED]

[--start_adapter START_ADAPTER] [--end_adapter END_ADAPTER] [--start_adapter_seq START_ADAPTER_SEQ] [--end_adapter_seq END_ADAPTER_SEQ] [--junk_reads JUNK_READS]

[--random_reads RANDOM_READS] [--chimeras CHIMERAS] [--glitches GLITCHES] [--small_plasmid_bias] [-h] [--version]

Generate fake long reads

Required arguments:

--reference REFERENCE Reference FASTA file (can be gzipped)

--quantity QUANTITY Either an absolute value (e.g. 250M) or a relative depth (e.g. 25x)

Simulation parameters:

Length and identity and error distributions

--length LENGTH Fragment length distribution (mean and stdev, default: 15000,13000)

--identity IDENTITY Sequencing identity distribution (mean, max and stdev, default: 87.5,97.5,5)

--error_model ERROR_MODEL Can be "nanopore2018", "nanopore2020", "pacbio2016", "random" or a model filename (default: nanopore2020)

--qscore_model QSCORE_MODEL Can be "nanopore2018", "nanopore2020", "pacbio2016", "random", "ideal" or a model filename (default: nanopore2020)

--seed SEED Random number generator seed for deterministic output (default: different output each time)

Adapters:

Controls adapter sequences on the start and end of reads

--start_adapter START_ADAPTER Adapter parameters for read starts (rate and amount, default: 90,60)

--end_adapter END_ADAPTER Adapter parameters for read ends (rate and amount, default: 50,20)

--start_adapter_seq START_ADAPTER_SEQ

Adapter sequence for read starts (default: AATGTACTTCGTTCAGTTACGTATTGCT)

--end_adapter_seq END_ADAPTER_SEQ Adapter sequence for read ends (default: GCAATACGTAACTGAACGAAGT)

Problems:

Ways reads can go wrong

--junk_reads JUNK_READS This percentage of reads will be low-complexity junk (default: 1)

--random_reads RANDOM_READS This percentage of reads will be random sequence (default: 1)

--chimeras CHIMERAS Percentage at which separate fragments join together (default: 1)

--glitches GLITCHES Read glitch parameters (rate, size and skip, default: 10000,25,25)

--small_plasmid_bias If set, then small circular plasmids are lost when the fragment length is too high (default: small plasmids are included regardless of fragment length)

Other:

-h, --help Show this help message and exit

--version Show program's version number and exit

実行方法

Built-inの設定でONTのロングリードをゲノムの50x分だけシミュレート

badread simulate --reference ref.fasta --quantity 50x \
 | gzip > reads.fastq.gz

--reference Reference FASTA file (can be gzipped)
--quantity Either an absolute value (e.g. 250M) or a relative depth (e.g. 25x)

他にもいくつか設定例があります。GithubのREADMEを確認して下さい。

引用

Badread: simulation of error-prone long reads
Ryan R Wick

Journal of Open Source Software. 2019;4(36):1316

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

エラーの多いロングリードのシミュレータ Badread