サンプルのコンタミネーションを見積もる Mash Screen

2019 11/5 論文追加

　シーケンシング技術がスループットを高めそしてコストを下げ続けるにつれて、シーケンシングされたゲノムのデータベース（例えばNCBI RefSeq [ref.1]）は指数関数的成長を続け、それらに対する検索をさらに複雑にしている[ref.2、3]。さらに、rawシーケンスデータであるNCBIのSequence Read Archive（SRA）[ref.4]はさらに急速に成長しており、表示されているゲノムおよびメタゲノムをアセンブリしてキュレーションする能力を上回っている[ref.5、6]。これらの傾向は、典型的には短い（〜21-bp）、一定の長さkについてゲノムの部分配列をオーバーラップして操作する、アライメントフリー分析法を普及させた。この分野における著者らの以前の研究はMash [ref.7]を導入した。これはMinHash次元圧縮[ref.8]を用いて全ゲノムのk-merセットを数百または数千の値のスケッチに圧縮し、ゲノムまたはメタゲノムの距離の極めて迅速な推定を可能にする。単離したゲノムまたは高レベルのメタゲノム相関には便利だが、Mashはメタゲノムの構成要素の分析には適していない。これは、MinHashingの一般的なアプリケーションが包含よりも近似に近いためである。この研究では、封じ込めの問題に取り組み、ゲノムまたはプロテオームがメタゲノム内でどの程度うまく表されているかを測定するための新しい方法、Mash Screenを示す。これは、迅速なコンタミネーションスクリーニングから、利用可能なすべてのメタゲノム全体にわたる特定の細菌およびウイルス株の追跡までの範囲の用途を有する。

（複数段落省略）

我々は公共のSRAメタゲノムからの新規ポリオーマウイルス種の同定と集合について述べる。

解説

https://genomeinformatics.github.io/mash-screen/

"Mash Screen: High-throughput sequence containment estimation for genome discovery" is published! Two years after I blogged about it, and 8 months after the preprint. I thank @BrianOndov for his patience and the reviewers for some key corrections https://t.co/boAWvvpnn5
— Adam Phillippy (@aphillippy) November 5, 2019

MASHは以前v2.0を紹介しており、オーサーらの手順に従ってメタゲノムの分析の流れもまとめていますが、重要なツールなのでメタゲノムやコンタミチェックの手順に絞って改めて紹介します。

インストール

Github

リリースよりバイナリをダウンロードするか、パッケージマネージャで導入する。

conda install -c bioconda -y mash==2.1

> mash

$ mash

Mash version 2.1

Type 'mash --license' for license and copyright information.

Usage:

mash <command> [options] [arguments ...]

Commands:

bounds Print a table of Mash error bounds.

dist Estimate the distance of query sequences to references.

info Display information about sketch files.

paste Create a single sketch file from multiple sketch files.

screen Determine whether query sequences are within a larger pool of

sequences.

sketch Create sketches (reduced representations for fast operations).

triangle Estimate a lower-triangular distance matrix.

> mash screen

$ mash screen

Version: 2.1

Usage:

mash screen [options] <queries>.msh <pool> [<pool>] ...

Description:

Determine how well query sequences are contained within a pool of sequences.

The queries must be formatted as a single Mash sketch file (.msh), created

with the `mash sketch` command. The <pool> files can be contigs or reads, in

fasta or fastq, gzipped or not, and "-" can be given for <pool> to read from

standard input. The <pool> sequences are assumed to be nucleotides, and will

be 6-frame translated if the <queries> are amino acids. The output fields are

[identity, shared-hashes, median-multiplicity, p-value, query-ID,

query-comment], where median-multiplicity is computed for shared hashes, based

on the number of observations of those hashes within the pool.

Options:

Option Description (range) [default]

-h Help

-p <int> Parallelism. This many threads will be spawned for processing. [1]

-w Winner-takes-all strategy for identity estimates. After counting

hashes for each query, hashes that appear in multiple queries will

be removed from all except the one with the best identity (ties

broken by larger query), and other identities will be reduced. This

removes output redundancy, providing a rough compositional outline.

...Output...

-i <num> Minimum identity to report. Inclusive unless set to zero, in which

case only identities greater than zero (i.e. with at least one

shared hash) will be reported. Set to -1 to output everything.

(-1-1) [0]

-v <num> Maximum p-value to report. (0-1) [1.0]

mash triangleコマンドも新しく実装されていますね。

~ GIthubより ~

A new triangle command computes a lower triangular
distance matrix in relaxed Phylip format. This streamlines all-pairs distance
commands and avoids computational redundancy.）

----------------------------------------------------------------------------------------

データベースの準備

https://mash.readthedocs.io/en/latest/tutorials.html よりNCBI RefSeqゲノムのpresketchされたアーカイブ"refseq.genomes.k21.s1000.msh direct link"をダウンロードする。plasmidの"refseq.genomes+plasmids.k21.s1000.msh refseq.plasmids.k21.s1000.msh direct link "も必要ならダウンロードする（direct link）。

または、MASH Documentからもダウンロードできる（合体版）。こちらには、SRA metagenomesとRefSeqのsketchファイルも用意されている。

https://mash.readthedocs.io/en/latest/data.html

実行方法

通常sketchコマンドはlow k-merを捨てて実行するが、screenコマンドはコンタミも含めてfastqファイル内にある全てのゲノムを検出したい時に使う。fastqをそのまま引数に指定する。

mash screen -p 4 refseq.genomes.k21s1000.msh pair1.fq pair2.fq > screen.tab
sort -gr screen.tab | head

-p Parallelism. This many threads will be spawned for processing. [1]
-w Winner-takes-all strategy for identity estimates. After counting hashes for each query, hashes that appear in multiple queries will be removed from all except the one with the best identity (ties broken by larger query), and other identities will be reduced. This removes output redundancy, providing a rough compositional outline

トップヒットに特定の種の様々な株がヒットしてくる場合、winner take all戦略のフラグ（-w）を立てることで回避できる。

出力

f:id:kazumaxneo:20190305160613j:plain

引用

Mash Screen: High-throughput sequence containment estimation for genome discovery

Brian D Ondov, Gabriel J Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B Buck, Adam M Phillippy

bioRxiv preprint first posted online Mar. 1, 2019

追記

Mash Screen: high-throughput sequence containment estimation for genome discovery

Brian D. Ondov, Gabriel J. Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B. Buck & Adam M. Phillippy
Genome Biology volume 20, Article number: 232 (2019)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

サンプルのコンタミネーションを見積もる Mash Screen