アンプリコンシーケンシングのアセンブルとクオリティフィルタリングツール moira

　16Sリボソーム RNAなどのマーカー遺伝子のハイスループットシークエンシングは、複雑な微生物群集の分析においてこれまで到達できなかった詳細なレベルの分析を可能にするため、微生物生態の研究者にとって非常に貴重なツールとなっている。 Roche 454、Illumina、IonTorrentシーケンサーなどのプラットフォームを手頃なコスト（ref.1〜5）で徹底的に特性評価し比較するために多くの研究が行われてきたが、希少な生物圏の構造や組成を分析するために非常に高い収率を利用する研究もある。ただし、シーケンス成果を正しく評価することが代表的な結果を得るために重要になる。（一部略）

最も一般的なソフトウェアツールとパッケージには、推奨パイプラインでOTUへのシーケンスクラスタリングが含まれている（ref,8〜15、いくつかの分子生物学パイプラインの比較については（15）を参照）。分布ベースのクラスタリング（ref.16）やクラスタリングフリーのアプローチ（ref.17）など、従来のクラスタリングに代わるものが最近提案されている。これらの新しい方法は、subpopulationレベルの研究に特に適しているが、中レベルから高存在度のシーケンスにのみ有効で、populationレベルのアルファまたはベータ多様性研究には適していない（ref.17）。さらに、それらがありそうな誤ったシーケンスを除去し、動的情報に基づいてsubpopulationsを解決することができるとしても、それらはrawリードの前処理でのフィルタリングステップの品質頼っている（ref.17）。

　Amplicon denoising（18, 19）は、Roche 454パイロシーケンシングリードをフィルタリングするための広範な方法であり、Iontorrentデータにも適用できる。シーケンスではなくflowgramsに対して機能する。これにより、パイロシーケンシングおよびIontorrentシーケンスに特徴的なホモポリマーエラーのより自然なモデリングが可能になる。ただし、これはプラットフォーム固有であり、計算コストが高くなる。

　Illuminaシステムでは、mothur（ref.20）、QIIME（ref.21）、およびUPARSE（ref.15）の著者が異なる解決策を提案しているため、クオリティフィルタリングに対するコンセンサスアプローチはない。これらのヒューリスティックなアプローチはすべて、それぞれのパイプラインの一部として公開されているが、著者らの知る限りでは、それらは互いに完全には比較されていない。

　マーカー - 遺伝子配列の分析にクオリティスコアを組み込むための厳密な方法が欠如していることが、偽の多様性検索を減らす厳格なフィルタリングを提唱するいくつかの著者を導いた（ref.3）。しかしながら、過度に厳格なフィルタリングをは、感度の望ましくない喪失をもたらし、観察された分類学的分布に影響を与える（ef.21）。したがって、これらの問題を克服する正確なアルゴリズムが望ましい。

　ここでは、簡単な統計的アプローチを使用することによって、関連するクオリティスコアを有する任意のシーケンスのエラー率分布を決定することができるPoisson binomial filtering（PBF）法を提示して検証する。特定のベースコールが誤っている確率を表すフレッドクオリティスコアは、すべてのシーケンスプラットフォームの生の出力から導き出すことができる。単一の塩基を読むことはコインを投げることにたとえることができる：ベースは正しいか間違っている、そして両方の可能性はそのクオリティスコアから決定することができる。実際、ある塩基に存在するエラーの数はベルヌーイ分布、すなわち単一試行の二項分布に従う。潜在的に等しくないエラー確率を有するヌクレオチド配列について、その配列がｋ個より多くのエラーを累積した推定確率を得るためにそれらの関連ベルヌーイ確率変数を合計する。 mothur、QIIME、またはUPARSEなどの主流の分子生物学パイプラインに含まれるフィルタリング手法と比較すると、Poisson binomial filteringはマーカー遺伝子配列をフィルタリングするための最も正確なアルゴリズムであることが証明された。さらに、PBFは単純な統計的原則に基づいており、入力としてフレッドクオリティスコアのみを必要とするため、クオリティスコアが真のエラー確率の許容可能な予測値であることと、エラーは独立していると仮定すれば、シーケンスプラットフォームに関係なく堅牢に機能することが期待される。最後に、このアルゴリズムは計算効率が良く、シーケンス数に比例してスケーリングし、メモリーフットプリントが低いため、パフォーマンスの低いデスクトップ環境でも役に立つ。

moiraに関するツイート

インストール

mac os10.12のpython2.7.10環境でテストした。

依存

python2.7

本体 Github

#pipで導入できる
pip install moira

> moira.py -h

$ moira.py -h

usage: moira.py [-h] [-ff FORWARD_FASTA] [-fq FORWARD_QUAL]

[-rf REVERSE_FASTA] [-rq REVERSE_QUAL] [-ffq FORWARD_FASTQ]

[-rfq REVERSE_FASTQ] [-l RELABEL] [-o {fasta,fastq}]

[-pi {mothur,USEARCH}] [-op OUTPUT_PREFIX] [-oc {none,gz,bz2}]

[-p PROCESSORS] [--paired] [-fo FASTQ_OFFSET] [--only_contig]

[--silent] [--nowarnings] [--doc] [-m MATCH] [-x MISMATCH]

[-g GAP] [--trim_overlap] [-i INSERT] [-d DELTAQ]

[-q {best,sum,posterior}] [-z QSCORE_CAP] [-c COLLAPSE]

[-t TRUNCATE] [-mo MIN_OVERLAP]

[-e {poisson_binomial,poisson_binomial_py,poisson,bootstrap}]

[-n {disallow,ignore,treat_as_errors}] [-r]

[-u UNCERT | -me MAXERRORS] [-a ALPHA] [-b BOOTSTRAP]

Perform quality filtering on a set of sequences.

optional arguments:

-h, --help show this help message and exit

General options:

-ff FORWARD_FASTA, --forward_fasta FORWARD_FASTA

Forward fasta file (can be gzip or bzip2 compressed).

-fq FORWARD_QUAL, --forward_qual FORWARD_QUAL

Forward qual file (can be gzip or bzip2 compressed).

-rf REVERSE_FASTA, --reverse_fasta REVERSE_FASTA

Reverse fasta file (can be gzip or bzip2 compressed).

-rq REVERSE_QUAL, --reverse_qual REVERSE_QUAL

Reverse qual file (can be gzip or bzip2 compressed).

-ffq FORWARD_FASTQ, --forward_fastq FORWARD_FASTQ

Forward fastq file (can be gzip or bzip2 compressed).

-rfq REVERSE_FASTQ, --reverse_fastq REVERSE_FASTQ

Forward fastq file (can be gzip or bzip2 compressed).

-l RELABEL, --relabel RELABEL

Generate sequential labels for the ordered sequences,

with the specified string at the beginning.

-o {fasta,fastq}, --output_format {fasta,fastq}

Output format: fasta with qual (and mothur name file

if --collapse) or fastq.

-pi {mothur,USEARCH}, --pipeline {mothur,USEARCH}

Make the output format compatible with the indicated

analysis pipeline.

-op OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX

Prefix for the output files

-oc {none,gz,bz2}, --output_compression {none,gz,bz2}

Prefix for the output files

-p PROCESSORS, --processors PROCESSORS

Number of processors to be used.

--paired Assemble paired-end reads and perform quality control

on the resulting contig.

-fo FASTQ_OFFSET, --fastq_offset FASTQ_OFFSET

--only_contig Assemble contigs but don't perform quality control.

--silent Do not print welcome, progress and goodbye messages.

Warnings will still be printed.

--nowarnings Do not print warning messages.

--doc Print full documentation.

Contig construction options:

-m MATCH, --match MATCH

Needleman-Wunsch aligner match score.

-x MISMATCH, --mismatch MISMATCH

Needleman-Wunsch aligner mismatch penalty.

-g GAP, --gap GAP Needleman-Wunsch aligner gap penalty.

--trim_overlap Trim the contig to the overlapping region.

-i INSERT, --insert INSERT

Contig constructor insert threshold.

-d DELTAQ, --deltaq DELTAQ

Contig constructor mismatch correction deltaq

threshold.

-q {best,sum,posterior}, --consensus_qscore {best,sum,posterior}

Contig constructor consensus qscore: always report

best qscore, sum qscores for matching bases, report

posterior qscores.

-z QSCORE_CAP, --qscore_cap QSCORE_CAP

Maximum consensus quality score reported by the contig

constructor. Use 0 for removing the qscore cap.

Sequence filtering options:

-c COLLAPSE, --collapse COLLAPSE

Collapse identical sequences before quality control.

-t TRUNCATE, --truncate TRUNCATE

Truncate sequences to a fixed length before quality

control. Discard smaller sequences.

-mo MIN_OVERLAP, --min_overlap MIN_OVERLAP

Discard contigs with less than the specified overlap

length.

-e {poisson_binomial,poisson_binomial_py,poisson,bootstrap}, --error_calc {poisson_binomial,poisson_binomial_py,poisson,bootstrap}

Error calculation method.

-n {disallow,ignore,treat_as_errors}, --ambigs {disallow,ignore,treat_as_errors}

Treatment of ambiguities: remove sequences with

ambigs, ignore ambigs, treat ambigs as errors.

-r, --round Round down the predicted number of errors to their

nearest integer prior to filtering.

-u UNCERT, --uncert UNCERT

Maximum allowed uncertainty (errors / sequence

length).

-me MAXERRORS, --maxerrors MAXERRORS

Maximum allowed errors per sequence.

-a ALPHA, --alpha ALPHA

Alpha cutoff value for the error distributions.

-b BOOTSTRAP, --bootstrap BOOTSTRAP

Number of replicates to use with the bootstrap method

実行方法

ペアエンドfastq（can be compressed with gzip or bzip2.）をアセンブルしてクオリティフィルタリング。fastq出力する。

moira.py --forward_fastq=pair1.fq --reverse_fastq=pair2.fq --paired --consensus_qscore posterior --output_format fastq --processors 8

--forward_fastq Forward fastq file (can be gzip or bzip2 compressed)

--reverse_fastq Reverse fastq file (can be gzip or bzip2 compressed)
--paired Assemble paired-end reads and perform quality control on the resulting contig
--output_format {fasta, fastq}
--consensus_qscore {best, sum, posterior} ntig constructor consensus qscore: always report best qscore, sum qscores for matching bases, report posterior qscores
--pipeline {mothur, USEARCH} Make the output format compatible with the indicated analysis pipeline.
--processors (default 1) number of processes to use
--maxerrors Maximum allowed errors per sequence
--alpha Alpha cutoff value for the error distributions
--truncate Truncate sequences to a fixed length before quality control. Discard smaller sequences

アセンブルされクオリティフィルタリングされたfastqと、エラーやgap数をリードごとに記載したレポートファイルが出力される。

f:id:kazumaxneo:20190126213600j:plain

イルミナ以外のシーケンシングデータで、fastaとクオリティファイルが別々になっているデータセットでは、個別にフラグを立てて指定します。詳細はGithubで確認して下さい。

ペアエンドfastqをアセンブル。クオリティフィルタリングは実行せずcontig.fastaのみ出力する。

moira.py --forward_fastq=pair1.fq --reverse_fastq=pair2.fq --paired --only_contig

--only_contig Assemble contigs but don't perform quality control

出力ファイル

f:id:kazumaxneo:20190126214106j:plain

引用

A novel conceptual approach to read-filtering in high-throughput amplicon sequencing studies

Fernando Puente-Sánchez, Jacobo Aguirre, Víctor Parro

Nucleic Acids Res. 2016 Feb 29; 44(4): e40

参考

例えばこちらのpreprintの前処理に使われています。

https://www.biorxiv.org/content/biorxiv/early/2019/01/25/530022.full.pdf

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

アンプリコンシーケンシングのアセンブルとクオリティフィルタリングツール moira