配列解析におけるリード塩基の偏りを解消するNGSリードシミュレータ Boquila

　ゲノム中の塩基配列は不均質である。そのため、特定の塩基配列に偏ったゲノムワイドNGSリードは、ゲノムワイドの不均一な塩基配列分布の影響を受けてしまう。Boquilaは、真のリードのヌクレオチドプロファイルを模倣した配列を生成し、ゲノム全体に分布するNGSリードのヌクレオチドに基づくバイアスを補正するために使用することが可能である。Boquilaは、リファレンスゲノムの特定領域のみからリードを生成するよう設定できる。また、インプットDNAシーケンスを使用して、ゲノムのコピー数のばらつきによるバイアスを補正することもできる。Boquilaは入出力データに標準的なファイル形式を使用しており、ハイスループットなシーケンスアプリケーションのワークフローに容易に組み込むことができる。

example

https://github.com/CompGenomeLab/boquila/tree/main/examples

インストール

GIthub

cargo install --branch main --git https://github.com/CompGenomeLab/boquila.git boquila

> boquila --help

$ boquila --help

boquila 0.6.0

Generate NGS reads with same nucleotide distribution as input file

Generated reads will be written to stdout

By default input and output format is FASTQ

USAGE:

boquila [OPTIONS] <src>

ARGS:

<src> Model file

OPTIONS:

--bed <FILE> File name in which the simulated reads will be saved in BED format

--fasta Change input and output format to FASTA

-h, --help Print help information

--inseq <FILE> Input sequencing reads to be used instead of reference genome

--inseqFasta Change the input sequencing format to FASTA

--kmer <INT> Kmer size to be used while calculating frequency [default: 1]

--ref <FILE> Reference FASTA

--regions <FILE> RON formatted file containing genomic regions that generated reads will

be selected from

--seed <INT> Random number seed. If not provided system's default source of entropy

will be used instead.

--sens <INT> Sensitivity of selected reads.

If some positions are predominated by specific nucleotides, increasing

this value can make simulated reads more similar to input reads.

However runtime will also increase linearly. [default: 2] [possible

values: 1, 2, 3, 4, 5]

-V, --version Print version information

実行方法

リアルリード（圧縮していないfastq）、リファレンスのfasta形式ファイル、リファレンスの領域ファイルを指定する。

boquila input_reads.fq --ref ref_genome.fa --regions GRCh38.ron > out.fq

出力（入力がSRR5125157.fastq、出力がSRR5125157_sim.fastq）

f:id:kazumaxneo:20220403103615p:plain

引用

Boquila: NGS read simulator to eliminate read nucleotide bias in sequence analysis

Umit Akkose, Ogun Adebali

bioRxiv, Posted March 30, 2022

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

配列解析におけるリード塩基の偏りを解消するNGSリードシミュレータ Boquila