ショートリードおよびロングリードシーケンスデータのアダプター残存など包括的な品質チェックを行う Sequali

2025/03/19 追記

　シーケンスデータの品質管理は、多くのシーケンスワークフローの最初のステップである。ショートリードおよびロングリードシーケンス技術には、品質管理に関して多くの共通点がある。品質管理プログラムはいくつか存在するが、両方のテクノロジーに適した機能セットを持つものはない。Oxford Nanopore Technologiesシーケンス向けの品質管理プログラムには、アダプター検索、過剰発現配列解析、重複解析などの重要な機能が欠けている。Sequaliは、ショートリードとロングリードの両方のシーケンス技術に対応したシーケンス品質管理を提供するために開発された。アダプター検索、過剰発現配列解析、重複解析などの機能を備え、FASTQおよびuBAM入力をサポートしている。ショートリードとロングリードの両方のシーケンステクノロジーにおいて、同等のシーケンス品質管理プログラムよりも大幅に高速である。

SequaliはC拡張を用いたオープンソースのPythonアプリケーションであり、AGPL-3.0ライセンスのもと、https://github.com/rhpvorderman/sequaliで利用できる。各リリースのソースコードはzenodo: https://zenodo.org/doi/10.5281/zenodo.10822485 にアーカイブされている。

特徴（レポジトリより）

MultiQC バージョン 1.22 以降の MultiQC サポート。
メモリフットプリントが小さく、インストールサイズが小さく、実行時間が速い。
2コア（デフォルト）で実行した場合、Sequaliに必要なメモリは通常2GB未満、実行時間は3～30分。
一目で配列の質を判断できる有益なグラフ。
21bpの配列断片を用いた過剰発現解析。過剰発現配列はNCBIのunivecデータベースと照合される。
ファイルシステムの重複推定にも使用されるフィンガープリントサブサンプリング技術を用いて重複率を推定。
シングルリードデータに対して、6つのilluminaアダプター配列と17のnanoporeアダプター配列をチェックする。
ペアエンドリードデータのオーバーラップ解析によるアダプターの決定
ペアリードデータのインサートサイズメトリックス
イルミナリードのタイルごとの品質プロット
ナノポアリードのチャンネルおよびその他のプロット
FASTQおよびunaligned BAM（リードペア情報は無視される）に対応。
guppyでコールされた配列については、ナノポア特有のデータのプロットが追加される。
doradoによって提供されるBAMデータについては、追加のnanoporeプロットが提供される。
アライメントされたBAMファイルでは、samtools fastqのデータ処理方法と同様に、secondaryリードおよびsupplementaryリードは無視される。

Documentation

https://sequali.readthedocs.io/en/latest/

インストール

Github

#pip(PyPI)
pip install sequali

#conda(bioconda)
mamba install -c conda-forge -c bioconda sequali -y

> sequali -h

$ sequali -h

usage: sequali [-h] [--json JSON] [--html HTML] [--outdir OUTDIR] [--adapter-file ADAPTER_FILE] [--overrepresentation-threshold-fraction FRACTION] [--overrepresentation-min-threshold THRESHOLD] [--overrepresentation-max-threshold THRESHOLD] [--overrepresentation-max-unique-fragments N]

[--overrepresentation-fragment-length LENGTH] [--overrepresentation-sample-every DIVISOR] [--duplication-max-stored-fingerprints N] [--fingerprint-front-length LENGTH] [--fingerprint-back-length LENGTH] [--fingerprint-front-offset LENGTH] [--fingerprint-back-offset LENGTH] [-t THREADS]

[--version]

INPUT [INPUT_REVERSE]

Create a quality metrics report for sequencing data.

positional arguments:

INPUT Input FASTQ or uBAM file. The format is autodetected and compressed formats are supported.

INPUT_REVERSE Second FASTQ file for Illumina paired-end reads.

options:

-h, --help show this help message and exit

--json JSON JSON output file. default: '<input>.json'.

--html HTML HTML output file. default: '<input>.html'.

--outdir OUTDIR, --dir OUTDIR

Output directory for the report files. default: current working directory.

--adapter-file ADAPTER_FILE

File with adapters to search for. See default file for formatting. Default: /home/kazu/mambaforge/lib/python3.10/site-packages/sequali/adapters/adapter_list.tsv.

--overrepresentation-threshold-fraction FRACTION

At what fraction a sequence is determined to be overrepresented. The threshold is calculated as fraction times the number of sampled sequences. Default: 0.001 (1 in 1,000).

--overrepresentation-min-threshold THRESHOLD

The minimum amount of occurrences for a sequence to be considered overrepresented, regardless of the bound set by the threshold fraction. Useful for smaller files. Default: 100.

--overrepresentation-max-threshold THRESHOLD

The amount of occurrences for a sequence to be considered overrepresented, regardless of the bound set by the threshold fraction. Useful for very large files. Default: unlimited.

--overrepresentation-max-unique-fragments N

The maximum amount of unique fragments to store. Larger amounts increase the sensitivity of finding overrepresented sequences at the cost of increasing memory usage. Default: 5,000,000.

--overrepresentation-fragment-length LENGTH

The length of the fragments to sample. The maximum is 31. Default: 21.

--overrepresentation-sample-every DIVISOR

How often a read should be sampled. More samples leads to better precision, lower speed, and also towards more bias towards the beginning of the file as the fragment store gets filled up with more sequences from the beginning. Default: 1 in 8.

--duplication-max-stored-fingerprints N

Determines how many fingerprints are maximally stored to estimate the duplication rate. More fingerprints leads to a more accurate estimate, but also more memory usage. Default: 1,000,000.

--fingerprint-front-length LENGTH

Set the number of bases to be taken for the deduplication fingerprint from the front of the sequence. Default: 8.

--fingerprint-back-length LENGTH

Set the number of bases to be taken for the deduplication fingerprint from the back of the sequence. Default: 8.

--fingerprint-front-offset LENGTH

Set the offset for the front part of the deduplication fingerprint. Useful for avoiding adapter sequences. Default: 64 for single end, 0 for paired sequences.

--fingerprint-back-offset LENGTH

Set the offset for the back part of the deduplication fingerprint. Useful for avoiding adapter sequences. Default: 64 for single end, 0 for paired sequences.

-t THREADS, --threads THREADS

Number of threads to use. If greater than one an additional thread for gzip decompression will be used. Default: 2.

--version show program's version number and exit

実行方法

入力リードの指定は必須。ほかはオプション。

#シングルエンド
sequali --outdir output input.fastq.gz

#ペアエンド
sequali --outdir output R1.fastq.gz R2.fastq.gz

--json JSON output file. default: '<input>.json'.
--html HTML output file. default: '<input>.html'.
--outdir Output directory for the report files. default: current working directory.

unaligned BAMも分析できる。ペアエンドには対応していないので、ナノポアロングリードのベースコールしたbamファイルの分析に適している。

sequali dorado_basecalled.bam

690MB のショートリードの分析は5秒程度で完了した。シングルエンドでもペアエンドでもランタイムは変わらなかった（ペアごとにスレッドが割り振られて並列に分析されるため）。

（ロングリードでも基本的な統計の表示形式は変わらないが、guppy.bamやdorado.bam向け項目もある）

その他（レポジトリより）

Sequaliはデフォルトで、圧縮された入力ファイルごとに1つのスレッドを使用し、読み取り処理に1つのスレッドを使用するため、通常は2つのコアを使用する。
スレッド数を 2 以上にしても効果はない。解凍がボトルネックになるため、Sequali にマルチスレッドのフルパワーを持ち込んでも、コードの複雑さを増やすという不釣り合いな高いコストがかかる一方で有用性は限られる。

Sequaliはfastqのクリーニングツールではなく、リード品質を分析するためのツールです。特に残存アダプターやそれ以外の過剰に出現する配列の分析に長けています。こちらのページで解説されています。

https://sequali.readthedocs.io/en/latest/#module-option-explanations

レポジトリでは、cutadaptでアダプターを除外する前後でSequaliを実行し、残存アダプターが除けたことを確認する使い方を提案しています。

引用

Sequali: efficient and comprehensive quality control of short- and long-read sequencing data
Ruben H P Vorderman
Bioinformatics Advances, Volume 5, Issue 1, 2025, vbaf010