ロングリードのメタゲノムのアセンブリを行う metaFlye

2019 5/28 誤字修正、8/20 誤字修正

2020 9/23 help更新、10/6 論文引用追加

2024/04/06 コマンド微修正

（Pacific BiosciencesまたはOxford Nanoporeシーケンサーによって生成された）一分子ロングシーケンシングリードによる細菌ゲノムアセンブリは、ショートシーケンシングリードアセンブリと比較して、アセンブリされたゲノムの隣接性を実質的に改善した。対照的に、初期のロングリードメタゲノム研究は単離細菌アセンブリと比較してより低い収率およびより短いリード長であることが報告され、これは高品質のアセンブリを生成することを困難にし、メタゲノム研究においてロングリードを利用するためにサンプル調製プロトコルを最適化しなければならないことを示唆する (Tsai et al., 2016, Driscoll et al., 2017)。しかしながら、高分子量DNA抽出技術における最近の改良により、複雑なメタゲノムを広範囲にカバーしそしてリード長を増してシーケンシングすることが可能になる (Moss and Bhatt. 2018, Bertrand et al., 2018, Somerville et al., 2018, Nicholls et al., 2019) 。これらの改良されたプロトコルは、複雑な細菌群集をシーケンシングするためにすでに使用されている（Bickhart et al、2018、Stewart et al、2018）。
　メタゲノムデータセットにはすでにいくつかのロングリードアセンブラ (Chin et al, 2016, Li, 2016, Koren et al., 2017, Kamath et al., 2017, Kolmogorov et al., 2019, Ruan and Li, 2019)が適用されているが、メタゲノムアセンブリ用に特別に設計されたものはない。ロングリードメタゲノムアセンブリはショートリードアセンブリの連続性を大幅に増加させ、分解能（Goltsman et al、2018）、水平遺伝子導入の検出（Guo et al、2018）、ならびに新規プラスミドおよびウイルスシーケンシング (Arredondo-Alonso et al., 2017, Paez-Espino et al., 2016)、などの固有の限界に対処する可能性があるので、これは残念である。
　メタゲノムアセンブリは、サンプルを構成する種／株の非常に不均一なカバレッジ、長いゲノム内およびゲノム間リピートのために、単離された細菌のアセンブリと比較してさらなる計算上（Li et al、2015、Nurk et al、2017）、ならびにプラスミドおよびウイルスの再構築の困難性 (Antipov et al., 2019, Wick and Holt, 2019) の課題を提示している。著者らは最近、高速なロングリードのゲノムアセンブラFlyeを開発し、それが正確で連続的なアセンブリを生成することを示した（Kolmogorov et al、2019）。 2019年、Wick and Holtはさまざまな細菌データセットについてFlyeのベンチマークを行い、最先端のロングリードアセンブラを改良することを実証した。

　モックと本物の両方の細菌コミュニティでメタゲノムアセンブラのmetaFlyeのベンチマークを行い、高品質のアセンブリを生成することを実証する。

The metaFlye paper is finally out! Metagenome assembly is hard, but long reads make it a bit more feasible. We show that presence of closely-related strains and species is the main challenge. metaFlye has various graph simplification steps to address it. https://t.co/75yZsnZrS2
— Mikhail Kolmogorov (@fenderglass) 2020年10月5日

metaFlye preprint is out! We describe how to apply repeat graphs to assemble microbial communities with uneven species abundance and very deep coverage. We also evaluate different algorithms using mock metagenomes and real cow rumen data. https://t.co/8tBjtSnXnC
— Mikhail Kolmogorov (@fenderglass) May 15, 2019

manual

https://github.com/kazumaxneo/Flye/blob/flye/docs/USAGE.md

インストール

mac os10.13のminiconda2-4.0.5環境でテストした。

Flye is available for Linux and MacOS platforms.

依存

C++ compiler with C++11 support (GCC 4.8+ / Clang 3.3+ / Apple Clang 5.0+)
GNU make
Python 2.7
Git
Core OS development headers (zlib, etc)

Flye package includes some third-party software:

libcuckoo
intervaltree
lemon
minimap2
Graphviz (optional)

#graphvizでアセンブリグラフを可視化するなら入れておく 
sudo apt install graphviz 
#anaconda環境ならcondaで入る 
mamba install -y -c bioconda graphviz

本体　GIthub

git clone https://github.com/fenderglass/Flye 
cd Flye 
python setup.py build

#Bioconda（link）
mamba install -c bioconda -y flye==2.9

> flye -h

$ flye -h

usage: flye (--pacbio-raw | --pacbio-corr | --pacbio-hifi | --nano-raw |

--nano-corr | --subassemblies) file1 [file_2 ...]

--out-dir PATH

[--genome-size SIZE] [--threads int] [--iterations int]

[--meta] [--plasmids] [--trestle] [--polish-target]

[--keep-haplotypes] [--debug] [--version] [--help]

[--resume] [--resume-from] [--stop-after] [--min-overlap SIZE]

Assembly of long reads with repeat graphs

optional arguments:

-h, --help show this help message and exit

--pacbio-raw path [path ...]

PacBio raw reads

--pacbio-corr path [path ...]

PacBio corrected reads

--pacbio-hifi path [path ...]

PacBio HiFi reads

--nano-raw path [path ...]

ONT raw reads

--nano-corr path [path ...]

ONT corrected reads

--subassemblies path [path ...]

high-quality contigs input

-g size, --genome-size size

estimated genome size (for example, 5m or 2.6g)

-o path, --out-dir path

Output directory

-t int, --threads int

number of parallel threads [1]

-i int, --iterations int

number of polishing iterations [1]

-m int, --min-overlap int

minimum overlap between reads [auto]

--asm-coverage int reduced coverage for initial disjointig assembly [not

set]

--plasmids rescue short unassembled plasmids

--meta metagenome / uneven coverage mode

--keep-haplotypes do not collapse alternative haplotypes

--trestle enable Trestle [disabled]

--polish-target path run polisher on the target sequence

--resume resume from the last completed stage

--resume-from stage_name

resume from a custom stage

--stop-after stage_name

stop after the specified stage completed

--debug enable debug output

-v, --version show program's version number and exit

Input reads can be in FASTA or FASTQ format, uncompressed

or compressed with gz. Currently, PacBio (raw, corrected, HiFi)

and ONT reads (raw, corrected) are supported. Expected error rates are

<30% for raw, <3% for corrected, and <1% for HiFi. Note that Flye

was primarily developed to run on raw reads. Additionally, the

--subassemblies option performs a consensus assembly of multiple

sets of high-quality contigs. You may specify multiple

files with reads (separated by spaces). Mixing different read

types is not yet supported. The --meta option enables the mode

for metagenome/uneven coverage assembly.

Genome size estimate is no longer a required option. You

need to provide an estimate if using --asm-coverage option.

To reduce memory consumption for large genome assemblies,

you can use a subset of the longest reads for initial disjointig

assembly by specifying --asm-coverage and --genome-size options. Typically,

40x coverage is enough to produce good disjointigs.

You can run Flye polisher as a standalone tool using

--polish-target option.

実行方法

"--meta"をつけてflyeを実行する。ここではnanoporeのraw read "--nano-raw"を指定している。必要であれば"--plasmids"もつける。 ”--genome-size”はv2.8から不要になった。

#ONT
flye --nano-raw ONT.fq --out-dir outdir --threads 40 --meta --plasmids

#pacbio-raw
flye --pacbio-raw CLR.fq --out-dir outdir --threads 40 --meta --plasmids

--plasmids rescue short unassembled plasmids
--meta metagenome / uneven coverage mode
--pacbio-raw PacBio raw reads
--pacbio-corr PacBio corrected reads
--nano-raw ONT raw reads
--nano-corr ONT corrected reads
--subassemblies high-quality contigs input
-g estimated genome size (for example, 5m or 2.6g)
-o Output directory
-t number of parallel threads [1]
-i number of polishing iterations [1]

テストラン

GridIONのmockバクテリアコミュニティのデータをランしてみる。

https://github.com/LomanLab/mockcommunity

wget https://nanopore.s3.climb.ac.uk/Zymo-GridION-EVEN-BB-SN.fq.gz

time (/usr/bin/time -v flye --nano-raw Zymo-GridION-EVEN-BB-SN.fq.gz --out-dir out_nano --threads 40 --meta --plasmids -g 60m)

ランタイムは563min、ピークメモリは135GBだった（*1）。

Bandageで可視化する。

f:id:kazumaxneo:20190525222630p:plain

今後はショートリードも使ったハイブリッドアセンブリが利用できる予定とされている（spadesやunicyclerのショーリードベースのアセンブリはロングリード情報をフルに使えない。そこで、flyeによるロングリードのアセンブリとmetaspadesによるショートリードのアセンブリを行い、最後に統合する方式のアセンブリを行うプログラムを作成する予定と記載されている）

引用

metaFlye: scalable long-read metagenome assembly using repeat graphs Mikhail Kolmogorov, Mikhail Rayko, Jeffrey Yuan, Evgeny Polevikov, Pavel Pevzner

bioRxiv preprint first posted online May. 15, 2019

2020 10/6

metaFlye: scalable long-read metagenome assembly using repeat graphs

Mikhail Kolmogorov, Derek M. Bickhart, Bahar Behsaz, Alexey Gurevich, Mikhail Rayko, Sung Bong Shin, Kristen Kuhn, Jeffrey Yuan, Evgeny Polevikov, Timothy P. L. Smith, Pavel A. Pevzner
Nature Methods, Published: 05 October 2020

Xeon Scarable Platinum P8180x1 、512GBメモリ環境にて実行。