ロングリードシーケンスデータを評価する NanoPack2

　ロングリードシーケンスプロジェクトにおけるコホートサイズの増加により、Oxford Nanopore TechnologiesとPacific Biosciencesのシーケンスデータの品質評価と処理のためのより効率的なソフトウェアが必要とされている。ここでは、実験の要約、データセットのフィルタリング、フェーズドアラインメントの結果の可視化、およびNanoPackソフトウェアスイートの更新を行うための新しいツールについて説明する。
cramino, chopper, kyber, phasiusの各ツールはRustで書かれており、インストールや依存関係の管理を必要とせず、実行可能なバイナリとして提供されている。muslでビルドされたバイナリは、幅広い互換性で利用できる。NanoPlot と NanoComp は Python3 で書かれている。各ツールとそのドキュメントへのリンクは https://github.com/wdecoster/nanopack にある。すべてのツールはLinux、Mac OS、MS Windows Subsystem for Linuxと互換性があり、MITライセンスでリリースされている。リポジトリにはテストデータが含まれており、ツールはGitHub Actionsを使用して継続的にテストされており、condaマネージャでインストールできる。

インストール

"The python scripts are written and tested for Python >= 3.6. With pip install nanopack all python tools can be installed simultaneously, but using a conda environment is encouraged. For the rust tools binaries can be downloaded from the releases on the respective GitHub repositories, as well as installation through conda."

インストールおよび実行

ここではcondaでpython3.10の環境を作って導入する。

mamba create -n NanoPack python=3.10 -y
conda activate NanoPack

#環境を抜ける場合、deactivateする。
conda deactivate

1、NanoPlot（Github）

リード（fastq）、アラインメント（bam）、albacore サマリーファイルからリードの分析に役立つ多くのプロットを出力する。

mamba create -n nanoplot
conda activate
mamba install -c bioconda nanoplot

> NanoPlot -h

usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw] [--huge] [-o OUTDIR] [--no_static] [-p PREFIX] [--tsv_stats] [--info_in_report] [--maxlength N] [--minlength N] [--drop_outliers] [--downsample N] [--loglength] [--percentqual] [--alength] [--minqual N]

[--runtime_until N] [--readtype {1D,2D,1D2}] [--barcoded] [--no_supplementary] [-c COLOR] [-cm COLORMAP] [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]] [--plots [{kde,hex,dot} ...]] [--legacy [{kde,dot,hex} ...]] [--listcolors] [--listcolormaps]

[--no-N50] [--N50] [--title TITLE] [--font_scale FONT_SCALE] [--dpi DPI] [--hide_stats]

CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.

General options:

-h, --help show the help and exit

-v, --version Print version and exit.

-t, --threads THREADS

Set the allowed number of threads to be used by the script

--verbose Write log messages also to terminal.

--store Store the extracted data in a pickle file for future plotting.

--raw Store the extracted data in tab separated file.

--huge Input data is one very large file.

-o, --outdir OUTDIR Specify directory in which output has to be created.

--no_static Do not make static (png) plots.

-p, --prefix PREFIX Specify an optional prefix to be used for the output files.

--tsv_stats Output the stats file as a properly formatted TSV.

--info_in_report Add NanoPlot run info in the report.

Options for filtering or transforming input prior to plotting:

--maxlength N Hide reads longer than length specified.

--minlength N Hide reads shorter than length specified.

--drop_outliers Drop outlier reads with extreme long length.

--downsample N Reduce dataset to N reads by random sampling.

--loglength Additionally show logarithmic scaling of lengths in plots.

--percentqual Use qualities as theoretical percent identities.

--alength Use aligned read lengths rather than sequenced length (bam mode)

--minqual N Drop reads with an average quality lower than specified.

--runtime_until N Only take the N first hours of a run

--readtype {1D,2D,1D2}

Which read type to extract information about from summary. Options are 1D, 2D,

1D2

--barcoded Use if you want to split the summary file by barcode

--no_supplementary Use if you want to remove supplementary alignments

Options for customizing the plots created:

-c, --color COLOR Specify a valid matplotlib color for the plots

-cm, --colormap COLORMAP

Specify a valid matplotlib colormap for the heatmap

-f, --format [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]

Specify the output format of the plots, which are in addition to the html files

--plots [{kde,hex,dot} ...]

Specify which bivariate plots have to be made.

--legacy [{kde,dot,hex} ...]

Specify which bivariate plots have to be made (legacy mode).

--listcolors List the colors which are available for plotting and exit.

--listcolormaps List the colors which are available for plotting and exit.

--no-N50 Hide the N50 mark in the read length histogram

--N50 Show the N50 mark in the read length histogram

--title TITLE Add a title to all plots, requires quoting if using spaces

--font_scale FONT_SCALE

Scale the font of the plots by a factor

--dpi DPI Set the dpi for saving images

--hide_stats Not adding Pearson R stats in some bivariate plots

Input data sources, one of these is required.:

--fastq file [file ...]

Data is in one or more default fastq file(s).

--fasta file [file ...]

Data is in one or more fasta file(s).

--fastq_rich file [file ...]

Data is in one or more fastq file(s) generated by albacore, MinKNOW or guppy

with additional information concerning channel and time.

--fastq_minimal file [file ...]

Data is in one or more fastq file(s) generated by albacore, MinKNOW or guppy

with additional information concerning channel and time. Is extracted swiftly

without elaborate checks.

--summary file [file ...]

Data is in one or more summary file(s) generated by albacore or guppy.

--bam file [file ...]

Data is in one or more sorted bam file(s).

--ubam file [file ...]

Data is in one or more unmapped bam file(s).

--cram file [file ...]

Data is in one or more sorted cram file(s).

--pickle pickle Data is a pickle file stored earlier.

--feather file [file ...]

Data is in one or more feather file(s).

EXAMPLES:

NanoPlot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed

NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots hex dot

NanoPlot --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000

Guppyやalbacore、MinKnow basecallingから得られたsummary.txt outputファイルを指定するか、fastq、もしくはbamを指定する。

#summary.txtを指定
NanoPlot --summary sequencing_summary.txt --loglength -o outdir

#fastqを指定。複数も可。
NanoPlot -t 12 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots dot -o outdir

#bamを指定。複数も可。１００００リードのみダウンサンプリングして分析。
NanoPlot -t 12 --color yellow --bam alignment1.bam alignment2.bam --downsample 10000 -o outdir

--downsample N Reduce dataset to N reads by random sampling.

htmlレポートが出力される。

2、NanoComp（Github）

リード長や品質について複数のランを比較する。

pip install NanoComp

> NanoComp -h

usage: NanoComp [-h] [-v] [-t THREADS] [-o OUTDIR] [-p PREFIX] [--verbose] [--raw] [--store] [--tsv_stats] [--make_no_static] [--readtype {1D,2D,1D2}] [--maxlength N] [--minlength N] [--barcoded] [--split_runs TSV_FILE] [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]]

[-n names [names ...]] [-c colors [colors ...]] [--plot {violin,box,ridge,false}] [--title TITLE] [--dpi DPI]

Compares long read sequencing datasets.

General options:

-h, --help show the help and exit

-v, --version Print version and exit.

-t, --threads THREADS

Set the allowed number of threads to be used by the script

-o, --outdir OUTDIR Specify directory in which output has to be created.

-p, --prefix PREFIX Specify an optional prefix to be used for the output files.

--verbose Write log messages also to terminal.

--raw Store the extracted data in tab separated file.

--store Store the extracted data in a pickle file for future plotting.

--tsv_stats Output the stats file as a properly formatted TSV.

--make_no_static Do not make static (png) plots.

Options for filtering or transforming input prior to plotting:

--readtype {1D,2D,1D2}

Which read type to extract information about from summary. Options are 1D, 2D,

1D2

--maxlength N Drop reads longer than length specified.

--minlength N Drop reads shorter than length specified.

--barcoded Barcoded experiment in summary format, splitting per barcode.

--split_runs TSV_FILE

File: Split the summary on run IDs and use names in tsv file. Mandatory header

fields are 'NAME' and 'RUN_ID'.

Options for customizing the plots created:

-f, --format [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]

Specify the output format of the plots, which are in addition to the html files

-n, --names names [names ...]

Specify the names to be used for the datasets

-c, --colors colors [colors ...]

Specify the colors to be used for the datasets

--plot {violin,box,ridge,false}

Which plot type to use: 'box', 'violin' (default), 'ridge' (joyplot) or 'false'

(no plots)

--title TITLE Add a title to all plots, requires quoting if using spaces

--dpi DPI Set the dpi for saving images (deprecated)

Input data sources, one of these is required.:

--fasta file [file ...]

Data is in (compressed) fasta format.

--fastq files [files ...]

Data is in (compressed) fastq format.

--fastq_rich file [file ...]

Data is in one or more fastq file(s) generated by MinKNOW or guppy with

additional information concerning channel and time.

--summary files [files ...]

Data is in (compressed) summary files generated by guppy.

--bam files [files ...]

Data is in sorted bam files.

--ubam file [file ...]

Data is in one or more unmapped bam file(s).

--cram file [file ...]

Data is in one or more sorted cram file(s).

--pickle file [file ...]

Data is in one or more pickle file(s) from using NanoComp/NanoPlot.

--feather file [file ...]

Data is in one or more feather file(s).

EXAMPLES:

NanoComp --bam alignment1.bam alignment2.bam --outdir compare-runs

NanoComp --fastq reads1.fastq.gz reads2.fastq.gz reads3.fastq.gz --names run1 run2 run3

比較したいfastqかbamを指定する。

#fastqを指定
NanoComp --fastq ONT1.fastq.gz ONT2.fastq.gz ONT3.fastq.gz  --names run1 run2 run3 --outdir outdir

#bamを指定
NanoComp --bam alignment1.bam alignment2.bam alignment3.bam --outdir outdir

--plot {violin,box,ridge,false} Which plot type to use: 'box', 'violin' (default), 'ridge' (joyplot) or 'false' (no plots)

出力例（レポートの一部）

3、NanoQC（Github）

リード末尾のヌクレオチド組成や品質分布を調査する。

mamba install -c bioconda nanoQC -y

> nanoQC -h

usage: nanoQC [-h] [-v] [-o OUTDIR] [--rna] [-l MINLEN] fastq

Investigate nucleotide composition and base quality.

positional arguments:

fastq Reads data in fastq.gz format.

options:

-h, --help show this help message and exit

-v, --version Print version and exit.

-o OUTDIR, --outdir OUTDIR

Specify directory in which output has to be created.

--rna Fastq is from direct RNA-seq and contains U nucleotides.

-l MINLEN, --minlen MINLEN

Filters the reads on a minimal length of the given range. Also plots the given length/2 of the begin and end of the reads.

fastqを指定する。

nanoQC ONT.fastq.gz -o OUTDIR

出力例

4、Cramino（Github）

NanoStatのRust代替。より迅速なBAMまたはCRAMファイルの要約。

mamba install -c bioconda cramino -y

> cramino -h

cramino 0.9.7

Wouter De Coster decosterwouter@gmail.com

Tool to extract QC metrics from cram or bam

USAGE:

cramino [OPTIONS] <INPUT>

ARGS:

<INPUT> cram or bam file to check

OPTIONS:

-t, --threads <THREADS>

Number of parallel decompression threads to use [default: 4]

-m, --min-read-len <MIN_READ_LEN>

Minimal length of read to be considered [default: 0]

--hist

If histograms have to be generated

--checksum

If a checksum has to be calculated

--arrow <ARROW>

Write data to a feather format

--karyotype

Provide normalized number of reads per chromosome

--phased

Calculate metrics for phased reads

-h, --help

Print help information

-V, --version

Print version information

CRAMかbamを指定する。

cramino -t 8 input.bam

出力例

非常に高速。HDDに置いてあるファイルサイズ1GBのロングリードbamの分析は数秒で終わった。

5、chopper（Github）

NanoLyseとNanoFiltを組み合わせたRust実装で、フィルタリング、トリミング、汚染物質除去をより高速に行う。

mamba install -c bioconda chopper -y

> chopper -h

chopper 0.5.0

wdecoster <decosterwouter@gmail.com>

Filtering and trimming of fastq files. Reads on stdin and writes to stdout.

USAGE:

chopper [OPTIONS]

OPTIONS:

-q, --quality <MINQUAL> Sets a minimum Phred average quality score [default: 0]

--maxqual <MAXQUAL> Sets a maximum Phred average quality score [default: 1000]

-l, --minlength <MINLENGTH> Sets a minimum read length [default: 1]

--maxlength <MAXLENGTH> Sets a maximum read length [default: 2147483647]

--headcrop <HEADCROP> Trim N nucleotides from the start of a read [default: 0]

--tailcrop <TAILCROP> Trim N nucleotides from the end of a read [default: 0]

-t, --threads <THREADS> Use N parallel threads [default: 4]

-c, --contam <CONTAM> Filter contaminants against a fasta

-h, --help Print help information

-V, --version Print version information

ロングリードのfastqを指定する。

#ONTの１例。クオリティ１０以上、1000bp以上でフィルタリング。
gunzip -c reads.fastq.gz | chopper -q 10 -l 1000 -t 12 | gzip > filtered_reads.fastq.gz

-t Use N parallel threads [default: 4]
-c Filter contaminants against a fasta
-l Sets a minimum read length [default: 1]
-q Sets a minimum Phred average quality score [default: 0]

6、phasius（Github）

リードのphasing性能を示すグラフを作成するRustツール。cargoで導入するかリリースからダウンロードする。

cargo install phasius

7、kyber（Github）

BAM/CRAMファイルの対数変換したリード長と精度の600x600ピクセルのヒートマップイメージを素早く作成するツール。リードを素早く初期評価するために使う。

その他

Deprecated（廃止(予定)、非推奨）のリストに、以前このブログでも紹介したNanoStat、NanoFilt、NanoLyseが入っています。注意してください。

引用

NanoPack2: Population scale evaluation of long-read sequencing data
De Coster Wouter, Rosa Rademakers
Bioinformatics, Published: 12 May 2023

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ロングリードシーケンスデータを評価する NanoPack2