インタラクティブなレポートを出力するONTのクオリティコントロールツール pycoQC

2020 7/21 コマンドでダブルスペースになっていた部分を修正

　核酸のナノポアシーケンシングは、開発に30年近くを要し、現在では合成法によるシーケンシングの代替手段として確固たる地位を確立している(Deamer, Akeson, & Branton, 2016)。オックスフォード・ナノポア・テクノロジーズ（ONT）は、2014年にDNAシーケンシング用の最初の商業的なナノポアデバイスをリリースし、その後も継続的に技術を改良してきた（Jain, Olsen, Paten, & Akeson, 2016）。読み取り精度は90%程度に過ぎないが、ONT技術は非常に長い分子をシーケンスすることができ、リアルタイムでデータを生成することができる。また、RNAを直接シーケンスすることができ、修飾された塩基を検出することができる（Garalde et al）.
　ナノポアのアレイによって取得された電気信号は、HDF5形式で保存され、シーケンスされた分子ごとに1つのファイル（FAST5と呼ばれる）を持つ。次いで、信号は、ベースコールソフトウェアを使用して、核酸配列に変換される。いくつかの選択肢があるが、読み取り精度のための最良のパフォーマーは、ONTによって開発され、維持されているAlbacoreまたはGuppyである(Wick, Judd, & Holt, 2018)。どちらもFASTQファイル、ベースコール情報とテキストサマリーファイルを含むFAST5ファイルを生成することができる。ONTは最近、シーケンシングランの品質管理分析のためのベストプラクティスガイドラインを発表したが（Oxford Nanopore Technologies, 2019）、シーケンシングデータの品質を深く探求するためのターンキーソリューションを提供していなかった。
　ここでは、ベースコールされたナノポアリードまたはAlbacoreとGuppyによって生成されたサマリーファイルからインタラクティブな品質管理メトリクスとプロットを生成するための新しいツールであるpycoQCを紹介する。Nanoplot（De Coster, D'Hert, Schultz, Cruts, & Van Broeckhoven, 2018）、MinionQC（Lanfear, Schalamun, Kainer, Wang, & Schwessinger, 2018）、 toulligQC（Laffay, Ferrato-Berberian, Jour-dren, Lemoine, & Le Crom, 2018）など、他のオープンソースの代替ツールもあるが、pycoQCにはいくつかの斬新な機能がある。

Documentation

https://a-slide.github.io/pycoQC/

jupyter notebook上で使う

jupyter API usage - pycoQC

インストール

macOS10.14でpipを使って導入した。

依存

numpy>=1.13
scipy>=1.1
pandas>=0.23
plotly>=3.4
jinja2>=2.10
h5py>=2.8.0
tqdm>=4.23'

Github

#pipで導入可能。
pip install pycoQC

#condaの仮想環境に入れるなら
conda install pip
pip install pycoQC

#bioconda (link) 未テスト
conda create -n pycoQC python=3.6 -y
conda activate pycoQC
conda install -c bioconda -y pycoqc

#development version (unstable)
pip install --index-url https://test.pypi.org/simple/ pycoQC -U

> pycoQC

$ pycoQC

ERROR: `--summary_file` is a required argument

usage: pycoQC [-h] [--version]

[--summary_file [SUMMARY_FILE [SUMMARY_FILE ...]]]

[--barcode_file [BARCODE_FILE [BARCODE_FILE ...]]]

[--bam_file [BAM_FILE [BAM_FILE ...]]]

[--html_outfile HTML_OUTFILE] [--json_outfile JSON_OUTFILE]

[--min_pass_qual MIN_PASS_QUAL] [--min_pass_len MIN_PASS_LEN]

[--filter_calibration] [--filter_duplicated]

[--min_barcode_percent MIN_BARCODE_PERCENT]

[--report_title REPORT_TITLE] [--template_file TEMPLATE_FILE]

[--config_file CONFIG_FILE] [--sample SAMPLE] [--default_config]

[-v | -q]

pycoQC computes metrics and generates interactive QC plots from the sequencing summary

report generated by Oxford Nanopore technologies basecallers

* Minimal usage

pycoQC -f sequencing_summary.txt -o pycoQC_output.html

* Including Guppy barcoding file + html output + json output

pycoQC -f sequencing_summary.txt -b barcoding_sequencing.txt -o pycoQC_output.html -j pycoQC_output.json

* Including Bam file + html output

pycoQC -f sequencing_summary.txt -a alignment.bam -o pycoQC_output.html

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

-v, --verbose Increase verbosity

-q, --quiet Reduce verbosity

Input/output options:

--summary_file [SUMMARY_FILE [SUMMARY_FILE ...]], -f [SUMMARY_FILE [SUMMARY_FILE ...]]

Path to a sequencing_summary generated by Albacore

1.0.0 + (read_fast5_basecaller.py) / Guppy 2.1.3+

(guppy_basecaller). One can also pass multiple space

separated file paths or a UNIX style regex matching

multiple files (Required)

--barcode_file [BARCODE_FILE [BARCODE_FILE ...]], -b [BARCODE_FILE [BARCODE_FILE ...]]

Path to the barcode_file generated by Guppy 2.1.3+

(guppy_barcoder) or Deepbinner 0.2.0+. This is not a

required file. One can also pass multiple space

separated file paths or a UNIX style regex matching

multiple files (optional)

--bam_file [BAM_FILE [BAM_FILE ...]], -a [BAM_FILE [BAM_FILE ...]]

Path to a Bam file corresponding to reads in the

summary_file. Preferably aligned with Minimap2 One can

also pass multiple space separated file paths or a

UNIX style regex matching multiple files (optional)

--html_outfile HTML_OUTFILE, -o HTML_OUTFILE

Path to an output html file report (required if

json_outfile not given)

--json_outfile JSON_OUTFILE, -j JSON_OUTFILE

Path to an output json file report (required if

html_outfile not given)

Filtering options:

--min_pass_qual MIN_PASS_QUAL

Minimum quality to consider a read as 'pass' (default:

--min_pass_len MIN_PASS_LEN

Minimum read length to consider a read as 'pass'

(default: 0)

--filter_calibration If given, reads flagged as calibration strand by the

basecaller are removed (default: False)

--filter_duplicated If given, duplicated read_ids are removed but the

first occurence is kept (Guppy sometimes outputs the

same read multiple times) (default: False)

--min_barcode_percent MIN_BARCODE_PERCENT

Minimal percent of total reads to retain barcode

label. If below, the barcode value is set as

`unclassified` (default: 0.1)

HTML report options:

--report_title REPORT_TITLE

Title to use in the html report (default: PycoQC

report)

--template_file TEMPLATE_FILE

Jinja2 html template for the html report (default: )

--config_file CONFIG_FILE

Path to a JSON configuration file for the html report.

If not provided, looks for it in ~/.pycoQC and

~/.config/pycoQC/config. If it's still not found,

falls back to default parameters. The first level keys

are the names of the plots to be included. The second

level keys are the parameters to pass to each plotting

function (default: )")

Other options:

--sample SAMPLE If not None a n number of reads will be randomly

selected instead of the entire dataset for ploting

function (deterministic sampling) (default: 100000)

--default_config, -d Print default configuration file. Can be used to

generate a template JSON file (default: False)

実行方法

AlbacoreやGuppyが出力するsequencing_summary.txtを指定する。

pycoQC \
 -f Guppy-2.1.3_basecall-1D-RNA_sequencing_summary.txt.gz \
 -o Guppy-2.1.3_basecall-1D_RNA.html \
 -j Guppy-2.1.3_basecall-1D_RNA.json \
 --min_pass_qual 6 \
 --min_pass_len 100 \
 --filter_calibration \
 --min_barcode_percent 10 \
 --quiet

出力

f:id:kazumaxneo:20200719210345p:plain