2019-05-19

メタゲノムアセンブリを評価する MetaQUAST

2019 5/27 追記

2021 6/15, 6/22 コマンド修正

メタゲノミクスは、環境サンプルから直接採取した遺伝物質を研究する。 NGS技術は、クローニングなしに少量の生物からDNAを抽出しショートリードシーケンシングすることを可能にする。しかし、そのような実験で作成されたデータは膨大でノイズが多く、そして豊富さと相同性が大きく異なる何千もの種からのDNA断片を含む傾向がある。これらの課題は、メタゲノムアセンブリの新たな計算上の問題をもたらし、それに続いて標準的なベンチマーク手順を必要とする（Boisvert et al、2012； Peng et al、2012； Haider et al、2014）。

　既存のほとんどのアセンブリ評価手法は、メタゲノムと連携するようには設計されていない。しかしながら、アセンブリに関してリードの尤度をカウントする方法（Clark e t al、2013； Ghodsi et al、2013）、または保存された普遍的なシングルコピー遺伝子の含量を決定する方法（Parks et al、2015； Simao et al、2005）が存在する。残念ながら、どれもclosely relatedなリファレンスゲノムへのコンティグアラインメントを使用していない。この論文では、QUASTのメタゲノム改良版MetaQUASTを紹介する（Gurevich et al、2013）。 QUASTは、与えられたclosely relatedなリファレンスゲノムとのアラインメントに基づいてエラーを検出し、さらにN50のようなコンティグ統計およびユーザー提供のリファレンス配列がなくても構成種の概観を与える遺伝子量をプロットする。メタゲノムアセンブリに対処するために、MetaQUASTはいくつか新しい機能を追加している：（i）無制限数のリファレンスゲノムを使用する能力、（ii）自動化された種含有量検出、（iii）キメラコンティグの検出。

　既知の種含有量を持つよく研究されたメタゲノムデータセット（Qin et al、2010）またはmockリードセット（Boisvert et al、2012; Namiki et al、2012）がある。それらは、リファレンスゲノムへのアラインメントに基づいてアセンブリ方法を評価するためにMetaQUASTと共に使用することができる。複数リファレンスパイプラインは、4つの主要なステップで構成されている（論文補足図S1）。

１、すべてのリファレンスゲノムは1つのファイルに結合される（結合リファレンス）。 QUASTは、結合リファレンスversus 入力アセンブリで評価される。 1つだけではなくすべてのあいまいなアラインメントも報告するように強制する。closely relatedな種を含むメタゲノムデータセットのために、すべてのあいまいなアラインメントは不可欠である。
２、生成されたアラインメントに基づいて、全てのコンティグをグループに分割する。複数ゲノムにマッピングされたコンティグは、すべてのマッチンググループに入る。アラインメントされていないコンティグは1つのextraグループに入る。
３、次に、各入力リファレンスに対してQUASTが別々に起動する。未アラインメントグループは、入力リファレンスなしで処理される。
４、最後に、すべてのQUAST実行の結果が要約レポートと視覚化してまとめられる。ユーザーは、各分析の詳細なフルQUAST出力と、データセット全体の結果の概観の両方を見ることができる。
スタンダードな品質評価metrics（N50、genome fractionなど）に加え、2つの測定基準を追加した。

　I nterspecies translocationsの数：隣接配列が異なるリファレンスにアラインメントする。

　誤ってアセンブリされた可能性があるコンティグの数：アラインメントサイズによらず未知のゲノムとの種間転座を含むコンティグ数。
GeneMarkSを使用する通常のQUASTとは対照的に、MetaQUASTはメタゲノム用に開発されたMetaGeneMarkを遺伝子予測に使用する（Zhu et al、2010）。

デノボ評価

ほとんどの実験的メタゲノミクス研究は、リファレンス情報が利用できないde novoアセンブリで動作する。 MetaQUASTが入力リファレンス配列または種リストなしで実行されると、それは種の内容を識別しそして自動的にリファレンス配列を引き出すことを試みる。このアルゴリズムは、研究者が主に微生物群集に興味を持っているという仮定の下で機能するので、検索は細菌と古細菌に限定されることに注意する。

　ワークフロー（論文補足図Ｓ２参照）は、BLASTn（Camacho et al、2009）を適用して、SILVAデータベースから16S rRNA配列にコンティグをアラインメントさせることから開始する（Quast et al、2012）。ほとんど全ての微生物種に存在する16Sサブユニットは高度に保存された配列であるが、生物を分類群に分類するにも役立つ超可変領域も含む。検出された各種について、最高のスコアを持つ1つの株がアセンブリに残る。

　上位50の生物がNCBIに対して照会され、それぞれの種について最も断片化されていない配列がダウンロードされる。生物間のrRNAオペロンのコピー数の違い、および16S遺伝子のゲノム内の不均一性に関する既知の問題のために、ダウンロードされたゲノム配列のいくつかは評価中のアセンブリで表現できない可能性がある。 MetaQUASTは、10％未満のコンティグカバレッジフラクションですべてのアセンブリに対してゲノムを除去することによって、偽陽性をフィルタリングしようとする。全ての配列が非常に低いゲノムfractionを有する特別な場合には、リストはフィルタリングされない。

　結果として、本発明者らは、おそらくアセンブリ配列によって表される一組のゲノムを得る。これらのシーケンスを使って（セクション2.1のように）MetaQUASTを起動し、通常のリファレンスベースの解析の場合と同じ出力ファイルを生成する。

私たち（著者ら）のアプローチは正確さと時間/メモリ消費の間の妥協である。より正確な結果を得るためには、MGTAXA（Williamson et al、2012）、または正確なリードアライメントに基づく方法、例えばKraken（Wood and Salzberg、2014）またはCLARK（Ounit et al、2015）を使う。非常に正確な結果を得るには、全NCBI-nrデータベースに対するBLASTx（Altschul et al、1990）検索によって得ることができる。得られた種名リストは、プレーンテキスト形式でMetaQUASTに送ることができ、NCBIデータベースから指定された配列をダウンロードしてそれらをリファンレンスに基づく評価に使用することができる。（以下略）

manual

http://quast.bioinf.spbau.ru/manual.html

インストール

condaを使ってpython3.9環境に導入した。

依存

QUAST can be run on Linux (64-bit and 32-bit with slightly limited functionality) or macOS (OS X).

Python2 (2.5 or higher) or Python3 (3.3 or higher)
GCC 4.7 or higher
Perl 5.6.0 or higher
GNU make and ar
zlib development files

In addition, QUAST submodules require:

Time::HiRes perl module for GeneMark-ES (gene prediction in eukaryotes)
Java 1.8 or later for GRIDSS (SV detection)
R for GRIDSS (SV detection)

Github

#bioconda
#仮想環境を作って導入する
mamba create -n quast
conda activate quast
mamba install -c bioconda -y quast


#docker (非公式イメージだが、コンソーシアムで使用されており信頼性が高い、Link)
docker pull staphb/quast:latest

> metaquast.py -h

$ metaquast.py -h

MetaQUAST: Quality Assessment Tool for Metagenome Assemblies

Version: 5.0.0

Usage: python /Users/kazuma/.pyenv/versions/miniconda3-4.3.14/bin/metaquast.py [options] <files_with_contigs>

Options:

-o --output-dir <dirname> Directory to store all result files [default: quast_results/results_<datetime>]

-r <filename,filename,...> Comma-separated list of reference genomes or directory with reference genomes

--references-list <filename> Text file with list of reference genome names for downloading from NCBI

-g --features [type:]<filename> File with genomic feature coordinates in the references (GFF, BED, NCBI or TXT)

Optional 'type' can be specified for extracting only a specific feature type from GFF

-m --min-contig <int> Lower threshold for contig length [default: 500]

-t --threads <int> Maximum number of threads [default: 25% of CPUs]

Advanced options:

-s --split-scaffolds Split assemblies by continuous fragments of N's and add such "contigs" to the comparison

-l --labels "label, label, ..." Names of assemblies to use in reports, comma-separated. If contain spaces, use quotes

-L Take assembly names from their parent directory names

-e --eukaryote Genome is eukaryotic (primarily affects gene prediction)

--fungus Genome is fungal (primarily affects gene prediction)

--large Use optimal parameters for evaluation of large genomes

In particular, imposes '-e -m 3000 -i 500 -x 7000' (can be overridden manually)

-k --k-mer-stats Compute k-mer-based quality metrics (recommended for large genomes)

This may significantly increase memory and time consumption on large genomes

--k-mer-size Size of k used in --k-mer-stats [default: 101]

--circos Draw Circos plot

-f --gene-finding Predict genes using MetaGeneMark

--glimmer Use GlimmerHMM for gene prediction (instead of the default finder, see above)

--gene-thresholds <int,int,...> Comma-separated list of threshold lengths of genes to search with Gene Finding module

[default: 0,300,1500,3000]

--rna-finding Predict ribosomal RNA genes using Barrnap

-b --conserved-genes-finding Count conserved orthologs using BUSCO (only on Linux)

--operons <filename> File with operon coordinates in the reference (GFF, BED, NCBI or TXT)

--max-ref-number <int> Maximum number of references (per each assembly) to download after looking in SILVA database.

Set 0 for not looking in SILVA at all [default: 50]

--blast-db <filename> Custom BLAST database (.nsq file). By default, MetaQUAST searches references in SILVA database

--use-input-ref-order Use provided order of references in MetaQUAST summary plots (default order: by the best average value)

--contig-thresholds <int,int,...> Comma-separated list of contig length thresholds [default: 0,1000,5000,10000,25000,50000]

-u --use-all-alignments Compute genome fraction, # genes, # operons in QUAST v1.* style.

By default, QUAST filters Minimap's alignments to keep only best ones

-i --min-alignment <int> The minimum alignment length [default: 65]

--min-identity <float> The minimum alignment identity (80.0, 100.0) [default: 95.0]

-a --ambiguity-usage <none|one|all> Use none, one, or all alignments of a contig when all of them

are almost equally good (see --ambiguity-score) [default: one]

--ambiguity-score <float> Score S for defining equally good alignments of a single contig. All alignments are sorted

by decreasing LEN * IDY% value. All alignments with LEN * IDY% < S * best(LEN * IDY%) are

discarded. S should be between 0.8 and 1.0 [default: 0.99]

--unique-mapping Disable --ambiguity-usage=all for the combined reference run,

i.e. use user-specified or default ('one') value of --ambiguity-usage

--strict-NA Break contigs in any misassembly event when compute NAx and NGAx.

By default, QUAST breaks contigs only by extensive misassemblies (not local ones)

-x --extensive-mis-size <int> Lower threshold for extensive misassembly size. All relocations with inconsistency

less than extensive-mis-size are counted as local misassemblies [default: 1000]

--scaffold-gap-max-size <int> Max allowed scaffold gap length difference. All relocations with inconsistency

less than scaffold-gap-size are counted as scaffold gap misassemblies [default: 10000]

--unaligned-part-size <int> Lower threshold for detecting partially unaligned contigs. Such contig should have

at least one unaligned fragment >= the threshold [default: 500]

--skip-unaligned-mis-contigs Do not distinguish contigs with >= 50% unaligned bases as a separate group

By default, QUAST does not count misassemblies in them

--fragmented Reference genome may be fragmented into small pieces (e.g. scaffolded reference)

--fragmented-max-indent <int> Mark translocation as fake if both alignments are located no further than N bases

from the ends of the reference fragments [default: 85]

Requires --fragmented option

--upper-bound-assembly Simulate upper bound assembly based on the reference genome and reads

--upper-bound-min-con <int> Minimal number of 'connecting reads' needed for joining upper bound contigs into a scaffold

[default: 2 for mate-pairs and 1 for long reads]

--est-insert-size <int> Use provided insert size in upper bound assembly simulation [default: auto detect from reads or 255]

--plots-format <str> Save plots in specified format [default: pdf].

Supported formats: emf, eps, pdf, png, ps, raw, rgba, svg, svgz

--memory-efficient Run everything using one thread, separately per each assembly.

This may significantly reduce memory consumption on large genomes

--space-efficient Create only reports and plots files. Aux files including .stdout, .stderr, .coords will not be created.

This may significantly reduce space consumption on large genomes. Icarus viewers also will not be built

-1 --pe1 <filename> File with forward paired-end reads (in FASTQ format, may be gzipped)

-2 --pe2 <filename> File with reverse paired-end reads (in FASTQ format, may be gzipped)

--pe12 <filename> File with interlaced forward and reverse paired-end reads. (in FASTQ format, may be gzipped)

--mp1 <filename> File with forward mate-pair reads (in FASTQ format, may be gzipped)

--mp2 <filename> File with reverse mate-pair reads (in FASTQ format, may be gzipped)

--mp12 <filename> File with interlaced forward and reverse mate-pair reads (in FASTQ format, may be gzipped)

--single <filename> File with unpaired short reads (in FASTQ format, may be gzipped)

--pacbio <filename> File with PacBio reads (in FASTQ format, may be gzipped)

--nanopore <filename> File with Oxford Nanopore reads (in FASTQ format, may be gzipped)

--ref-sam <filename> SAM alignment file obtained by aligning reads to reference genome file

--ref-bam <filename> BAM alignment file obtained by aligning reads to reference genome file

--sam <filename,filename,...> Comma-separated list of SAM alignment files obtained by aligning reads to assemblies

(use the same order as for files with contigs)

--bam <filename,filename,...> Comma-separated list of BAM alignment files obtained by aligning reads to assemblies

(use the same order as for files with contigs)

Reads (or SAM/BAM file) are used for structural variation detection and

coverage histogram building in Icarus

--sv-bedpe <filename> File with structural variations (in BEDPE format)

Speedup options:

--no-check Do not check and correct input fasta files. Use at your own risk (see manual)

--no-plots Do not draw plots

--no-html Do not build html reports and Icarus viewers

--no-icarus Do not build Icarus viewers

--no-snps Do not report SNPs (may significantly reduce memory consumption on large genomes)

--no-gc Do not compute GC% and GC-distribution

--no-sv Do not run structural variation detection (make sense only if reads are specified)

--no-gzip Do not compress large output files

--no-read-stats Do not align reads to assemblies

Reads will be aligned to reference and used for coverage analysis,

upper bound assembly simulation, and structural variation detection.

Use this option if you do not need read statistics for assemblies.

--fast A combination of all speedup options except --no-check

Other:

--silent Do not print detailed information about each step to stdout (log file is not affected)

--test Run MetaQUAST on the data from the test_data folder, output to quast_test_output

--test-no-ref Run MetaQUAST without references on the data from the test_data folder, output to quast_test_output.

MetaQUAST will download SILVA 16S rRNA database (~170 Mb) for searching reference genomes

Internet connection is required

-h --help Print full usage message

-v --version Print version

2019年5月現在、最新版はv5.0.2。ロングリードを使うオプションも存在する。

データベースの準備

quast-download-gridss 
quast-download-silva 
quast-download-busco

実行方法

mock communityなどでリファレンスが存在する場合、アセンブリした配列とリファレンスを指定する。複数ある場合、リファレンスはカンマで繋ぐ。

metaquast.py contigs_1.fasta contigs_2.fasta contigs_3.fasta \
 -r reference_1,reference_2,reference_3 \
 -l asembly1,assembly2,assembly3 \
 -1 reads1.fq.gz -2 reads2.fq.gz \
 -o output -t 40 --gene-finding

-o Directory to store all result files [default: quast_results/results_<datetime>]
-r Comma-separated list of reference genomes or directory with reference genomes
-l Names of assemblies to use in reports, comma-separated. If contain spaces, use quotes
--gene-finding Predict genes using MetaGeneMark
--glimmer Use GlimmerHMM for gene prediction (instead of the default finder, see above)

*NOTICE: Genes are not predicted by default. Use --gene-finding or --glimmer option to enable it.

リファレンスを指定しない場合、16S rRNAの分析結果をもとにリファレンス群がダウンロードされ、比較が行われる。

ペアエンドfastqも指定（マッピング済みならbamも使える）。時間がかかるsv callはなし。

metaquast.py contigs_1 contigs_2 contigs_3 \
 -1 reads1.fq.gz -2 reads2.fq.gz \
 -o output -t 40 --gene-finding --no-sv

Heng Liさんも指摘されてますが、リファレンスが異なる場合、SVがある部位がミスアセンブリと判定されてしまい、評価が悪くなる可能性があります。

引用
MetaQUAST: evaluation of metagenome assemblies
Mikheenko A, Saveliev V, Gurevich A

Bioinformatics. 2016 Apr 1;32(7):1088-90

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メタゲノムアセンブリを評価する MetaQUAST