機械学習によって適したパラメータ設定でNGSのデータの自動クオリティフィルタリングを行う seqQscorer

次世代シーケンシング（NGS）データファイルの品質管理は、必要ではあるが複雑な作業である。この問題を解決するために、一般的なNGSの品質特徴を統計的に特徴づけ、ツリーベースの分類アルゴリズムと深層学習分類アルゴリズムを用いた新しい品質管理手法を開発した。予測モデルは、内部および外部の機能的ゲノミクスデータセットで検証されており、ある程度は未見の種のデータにも一般化可能である。統計的ガイドラインと予測モデルは、NGSデータの利用者が品質問題をよりよく理解し、自動品質管理を行うための貴重なリソースとなる。ガイドラインとソフトウェアは https://github.com/salbrec/seqQscorer から入手できる。

Githubより

seqQscorerはpythonで実装されており、品質の統計やレポートの要約（品質の特徴）を入力として、入力されたNGSサンプルが低品質である確率を計算する。この確率は、教師付き機械学習の分類モデルを用いて計算される。品質特徴量はFastQおよびBAMファイルから得られ、Genome Biology誌に掲載された論文で詳細に説明されている。

RAWフィーチャーはFastQCで得られたもので、基本的にはそのレポートサマリーを含んでいます。MAPフィーチャーは、Bowtie2アライメントからのマッピング統計です。このアラインメントから、LOCおよびTSSというフィーチャーセットが導き出されます。このフィーチャーセットは、特定の機能を持つゲノム領域におけるリードの分布と、TSS位置に近いリードの分布を表しています。後者のフィーチャーセットには、Bioconductor社のパッケージであるChIPseekerとChIPpeakAnnoが使用されています。

インストール

macos10.14でdocker imageを作ってテストした。

Github

docker pull salbrec/seqqdocker

> python deriveFeatureSets.py --help

# python deriveFeatureSets.py --help

usage: deriveFeatureSets.py [-h] --fastq1 FASTQ1 [--fastq2 FASTQ2] --btidx BTIDX [--outdir OUTDIR] [--cores CORES] [--fastqc {1,2}] [--assembly {GRCh38,GRCm38}] [--gtf GTF] [--name NAME]

seqQscorer Preprocessing - derives feature sets needed leveraged by seqQscorer

optional arguments:

-h, --help show this help message and exit

--fastq1 FASTQ1, -f1 FASTQ1

Input fastq file. Either the fastq file for a single-end sample or the fastq file for read 1 of a paired-end sample. Using this default destination "./feature_sets/", all feature sets are computed and saved.

The file names define the sampleID while the file endings define the feature sets RAW, MAP, LOC and TSS.

--fastq2 FASTQ2, -f2 FASTQ2

In case of a paired-end sample, the fastq file for read 2. When the preprocessing is applied on paired-end samples, FastQC is applied to the read 1 by default. Also the sampleID for the output files are named

by the file name of read1. These two aspects can be cahnged by using --fastqc respectively --name.

--btidx BTIDX, -ix BTIDX

Filename prefix for Bowtie2 Genome Index (minus trailing .X.bt2).

--outdir OUTDIR, -o OUTDIR

Output directory. Default: "./feature_sets/"

--cores CORES, -c CORES

Defines the number of processors (CPUs) to be used by bowtie2 and samtools. (decreases runtime)

--fastqc {1,2}, -f {1,2}

The fastq on which FastQC is applied on. Can optionally be selected for paired-end samples.

--assembly {GRCh38,GRCm38}, -a {GRCh38,GRCm38}

Species assembly needed to define the gene structure / annotation used by the bioconductor functions. (has to be consistent with the species used in for Bowtie2)

--gtf GTF, -g GTF File path for a gtf file to be used to get the LOC and TSS features. (--assembly will be ignored then)

--name NAME, -n NAME By default the output files are named by the file name of --fastq1. In order to change this to a custom name, use this option.

> python seqQscorer.py --help

# python seqQscorer.py --help

usage: seqQscorer.py [-h] --indir INDIR [--species {generic,human,mouse}] [--assay {generic,ChIP-seq,DNase-seq,RNA-seq}] [--runtype {generic,single-end,paired-end}] [--model MODEL] [--noRAW] [--noMAP] [--noLOC] [--noTSS] [--noFS]

[--bestCalib] [--peaktype {narrow,broad}] [--probOut PROBOUT] [--compOut COMPOUT] [--inputOut INPUTOUT] [--noVerbose] [--seed SEED] [--sampleID SAMPLEID]

seqQscorer - A machine learning application for quality assessment of NGS data

optional arguments:

-h, --help show this help message and exit

--indir INDIR, -i INDIR

Input directory containing the feature set files. The feature set files are perfectly fomated by the script "deriveFeatures.py": the file names (until the ".") define the sample ID while the file endings

define the corresponding feature set RAW, MAP, LOC, and TSS. By default seqQscorer applies the machine learning model to all samples from the given directory within milliseconds. However, it can be restricted

to one sample using --sampleID.

--species {generic,human,mouse}, -s {generic,human,mouse}

Species specifying the model used.

--assay {generic,ChIP-seq,DNase-seq,RNA-seq}, -a {generic,ChIP-seq,DNase-seq,RNA-seq}

Assay specifying the model used.

--runtype {generic,single-end,paired-end}, -r {generic,single-end,paired-end}

Run-Type specifying the model used.

--model MODEL, -m MODEL

Path to a serialized model, trained on own data. If used, the parameters --species, --assay, and --runtype have no impact on the classification model.

--noRAW Ignore all RAW features.

--noMAP Ignore all MAP features.

--noLOC Ignore all LOC features.

--noTSS Ignore all TSS features.

--noFS Switch off feature selection. (has only an impact if the best performance was achieved with chi2 or RFE)

--bestCalib Classifier setting is used that achieved the lowest brier score, hence the best calibration of the probabilities.

--peaktype {narrow,broad}, -pt {narrow,broad}

Optionally specify the peak-type for ChIP-seq data.

--probOut PROBOUT, -po PROBOUT

To specify an output file for the probabilities. Output will be tab-separated.

--compOut COMPOUT, -co COMPOUT

To specify an out file for the comprehensive output. Output will be kind of tab-separated.

--inputOut INPUTOUT, -io INPUTOUT

To specify an out file that will contain the parsed input. Output will be tab-separated.

--noVerbose, -nv Turn off verboseness, without being quiet.

--seed SEED, -rs SEED

Some classifiers apply randomization. Use --seed to make results reproducible. By default the seed 1 is used, set it to -1 if using a seed is not desired. For K-nearest neighbor and Naive Bayes the seed has no

impact.

--sampleID SAMPLEID, -id SAMPLEID

Restrict application of seqQscorer to only one sample defined by the ID.

テストラン

ここではオーサーらが提供しているdocker imageを使ってランする。

cd <your>/<fastq_dir>
#カレントと/homeをシェアして立ち上げる
docker run -i -t -v ${PWD}:/home/ salbrec/seqqdocker /bin/bash

ランにはアセンブリに一致するBowtieインデックスが必要。Bowtie2の公式Webページからindexをダウンロードする。

cd ./utils/genome_index
wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip
unzip GRCh38_noalt_as.zip

seqQscorerを実行する。デフォルトでは、一般的な分類モデルが学習され、品質確率の計算に使用される（最大のデータセットで学習されている最も信頼性の高いモデル）。

#paired-end
python deriveFeatureSets.py --fastq1 /var/examples/paired/ENCFF310LVJ.fastq.gz --fastq2 /var/examples/paired/ENCFF410LTA_r2.fastq.gz --cores 4 --btidx ./utils/genome
_index/GRCh38_noalt_as/GRCh38_noalt_as --assembly GRCm38 --outdir ./mouse_pe/

#single-end
python deriveFeatureSets.py --fastq1 /var/examples/single/ENCFF165NJF.fastq.gz --btidx ./utils/genome_index/GRCh38_noalt_as/GRCh38_noalt_as --assembly GRCh38

mouse_pe

f:id:kazumaxneo:20210318093835p:plain