2019-02-12

メタゲノムデータからrRNAをターゲットアセンブリし、系統アサイン、定量、比較する phyloFlash

2019 5/9 インストール追記

2020 6/11 インストール方法修正

2020 6/16 trusted contigのコマンド追記

2020 10/28 論文追記

2023/05/10 ツイート追記

　ショットガンメタゲノミクスは、微生物群集の機能を調査し、それらの系統または分類学的な構成を決定するための強力なツールである（Preprintより　ref.1、2）。プライマーバイアス（ref.3）やキメラ配列（ref.4、5）など、PCRベースのアンプリコンメソッドに関連するバイアスやアーティファクトを回避できる。ショットガンシークエンシングは、単一のマーカーではなくゲノム全体からサンプリングするため、より多くの情報も得られる。ショットガンメタゲノムライブラリーの構成をプロファイルするために様々なアプローチが適用されてきた。一般的な選択はローカルアライメントであり、これには、単一遺伝子内の特定の領域（ref.6、7）、保存されたシングルコピー遺伝子の集合（ref.8、9）、または全ゲノムデータ（10、11）が含まれる。他の研究者は、リードを分類するためにk-merマッチングなどのアラインメントフリーの方法を使用し（ref.12–14）、分類を超えて特定の遺伝子のターゲットアセンブリを実行するツールもある（ref.15、16）。分類学的または機能的プロファイリングに使用される方法に関係なく、結果は利用可能なリファレンスデータによって制限される。例えば、ターゲットの生物またはそれらのclose relativesはまだデータベースに表示されていない可能性があり、遺伝子水平伝播は、特に原核生物において相反する系統発生シグナルをもたらす可能性がある（ref.17、18）。
　分子生態学において、rRNAスモールサブユニット遺伝子（SSU rRNA）は最も重要なマーカーである。なぜなら、それは配列を実際の細胞にリンクさせるために容易に使用することができるからである。SSU rRNAは、リボソームから高コピーで入手でき、そして既に確立された分子プロービング技術であるfluorescence in situ hybridization （FISH）を通してアクセスすることができる（ref.19、20）。イメージングベース分析における価値と共に、水平伝播の速度が遅いため、SSU rRNA遺伝子は系統発生的多様性の観点から最も優れたサンプリングマーカーとなっている（ref.21）。現在のメタゲノム解析の進歩（22）では、メタゲノムアセンブリから微生物ゲノムのドラフトを自動的に抽出 (“binned”) することができるが、SSU rRNA遺伝子は同じ本質的な役割を持ち、系統学、イメージング、および実験的検証に不可欠になっている。しかし、自動ビニングの進歩にもかかわらず（ref.23）、ほとんどのmetagenome assembled genomes (MAGs) は、SSU rRNA遺伝子の断片さえ含んでいない、また全遺伝子は言うまでもなく含んでいない（ref.1、24）。
　理想的には、既存の膨大なSSU rRNA遺伝子のナレッジベースを（メタ）ゲノミクスプロジェクトで活用して、いくつかの異なる結果を得たいと考えている：アセンブリなしの分類学的プロファイリング（ref.6、7）、系統学のための全長配列のターゲットアセンブリと、それよるプローブ設計（ref.15、16、25、26）、およびSSU rRNA配列を完全なゲノムへリンクさせる（ref.27）。これらの各目的のために、それぞれ独自の長所と短所を持つ別々のソフトウェアツールがすでに開発されている（ref.16）（MATAM紹介）。ただし、これらは同じ問題の最終的には異なる側面であるため、一緒に検討する必要がある。
　ここではphyloFlash、メタゲノムからSSU rRNAの迅速なプロファイリングとターゲットアセンブリのためのオープンソースパイプラインについて説明する。我々（著者ら）は分類学的プロファイリングとターゲットアセンブリの異なるステップにおけるトレードオフを評価するためにパイプラインを使用し、リファレンスデータベースの作成から始まり、ターゲットリードの抽出とアセンブリ、そして最後にアセンブリされた配列をゲノムビンにリンクする。シミュレートされたリードデータとリアル環境メタゲノムの両方を使用して、rRNA専用の方法と、汎用的なリードマッパーおよびアセンブラのパフォーマンスを比較する。また公開されたデータを再分析することで、メタゲノムの構成を評価、比較するためにパイプラインがどのように使用できるか示す。最後に、我々（著者ら）は、ゲノム中心のメタゲノムと実験的分子生態学との間の橋渡しをするために、メタゲノムからのゲノムビンがどのようにSSU rRNA配列にリンクされ得るかを示す。

What does phyloFlash do? (Githubより)

Summarize taxonomic diversity of a metagenome/transcriptome library from SSU rRNA read affiliations
Assemble/reconstruct full-length SSU rRNA sequences suitable for phylogenetic analysis
Quick comparison of multiple samples by their taxonomic composition using a heatmap

Figure 1. Flowchart of the phyloFlash pipeline. Preprintより転載

マニュアル（インストール方法、使い方、出力の説明）

2023/05/10

easier setup for phyloFlash: you can now download pre-formatted SILVA databases from Zenodo (https://t.co/kIewtdwLBm) and start screening your metagenome libraries right away!
Bioconda recipe also tweaked to fix some env solver issueshttps://t.co/ZOQ1Isqein @GruberVodicka
— Brandon Seah (@kbseah) May 8, 2023

インストール

mac os10.14の miniconda3-4.3.30とminiconda2-4.0.5環境で動作確認した。

依存

To use phyloFlash you will need a GNU/Linux system with Perl, R and Python installed. (OS X is for the brave, we have not tested this!)

Perl >= 5.13.2
EMIRGE and its dependencies
BBmap
Vsearch
SPAdes
Bedtools
Mafft
Barrnap (customized version is provided with phyloFlash)

Optional: SortMeRNA v2.1b, if you want to use it as alternative to BBmap
These tools need to be in your $PATH environment variable, so that phyloFlash can find them.

In addition, you will need R and the following R packages for plotting if you use the phyloFlash_compare.pl script for comparing multiple samples:

ggdendro
gtable
reshape2
ggplot2
optparse

本体　Github

2019年2月現在のバージョンは3.3 beta1

#Anacondaを使っているならcondaで導入できる

conda create --name PHYLOFLASHenv python=3.5
conda activate PHYLOFLASHenv
conda install -c bioconda -y phyloflash

#if you want to use SortMeRNA option
conda install -c bioconda -y sortmerna=2.1b

> phyloFlash.pl -help

$ phyloFlash.pl -help

Usage:

phyloFlash.pl [OPTIONS] -lib *name* -read1 <file> -read2 <file>

phyloFlash.pl -help

phyloFlash.pl -man

phyloFlash.pl -check_env

> phyloFlash.pl -man

NAME

phyloFlash - A script to rapidly estimate the phylogenetic composition of

an Illumina (meta)genomic dataset and reconstruct SSU rRNA genes

SYNOPSIS

phyloFlash.pl [OPTIONS] -lib name -read1 <file> -read2 <file>

phyloFlash.pl -help

phyloFlash.pl -man

phyloFlash.pl -check_env

DESCRIPTION

This tool rapidly approximates the phylogenetic composition of a

(meta)genomic library by mapping reads to a reference SSU rRNA database,

and reconstructing full-length SSU rRNA sequences. The pipeline is

intended for Illumina paired- or single-end HiSeq and MiSeq reads.

ARGUMENTS

INPUT

-lib NAME

Library NAME to use for output file. The name must be one word

comprising only letters, numbers and "_" or "-" (no whitespace or

other punctuation).

-read1 FILE

Forward reads in FASTA or FASTQ formats. May be compressed with

Gzip (.gz extension). If interleaved reads are provided, please

use --interleaved flag in addition for paired-end processing.

-read2 FILE

File containing reverse reads. If this option is omitted,

phyloFlash will run in experimental single-end mode.

-check_env

Invokes checking of working environment and dependencies without

data input. Use to test setup.

-help Print brief help.

-man Show manual.

-outfiles

Show detailed list of output and temporary files and exit.

-version

Report version

LOCAL SETTINGS

-CPUs N Set the number of threads to use. Defaults to all available CPU

cores.

-crlf Use CRLF as line terminator in CSV output (to become RFC4180

compliant).

-decimalcomma

Use decimal comma instead of decimal point to fix locale problems.

Default: Off

-dbhome DIR

Directory containing phyloFlash reference databases. Use

phyloFlash_makedb.pl to create an appropriate directory.

ANALYSIS TOOLS

-emirge Turn on EMIRGE reconstruction of SSU sequences

Default: Off ("-noemirge")

-skip_spades

Turn off SPAdes assembly of SSU sequences

Default: No (use SPAdes by default)

-sortmerna

Use Sortmerna instead of BBmap to extract SSU rRNA reads. Insert

size and %id to reference statistics will not be available.

Default: No

INPUT AND ANALYSIS OPTIONS

-interleaved

Interleaved readfile with R1 and R2 in a single file at read1

-readlength N

Sets the expected readlength. Always use this if your read length

differs from 100 (the default). Must be within 50..500.

-readlimit N

Limits processing to the first N reads in each input file. Use

this for transcriptomes with a lot of rRNA reads (use values

<1000000).

Default: unlimited

-amplimit N

Sets the limit of SSU read pairs to switch from emirge.py to

emirge_amplicon.py. This feature is not reliable as

emirge_amplicon.py has been problematic to run (use values

>100000).

Default: 500000

-id N Minimum allowed identity for BBmap read mapping process in %. Must

be within 50..98. Set this to a lower value for very divergent

taxa

Default: 70

-evalue_sortmerna N

E-value cutoff to use for SortMeRNA (only if -sortmerna option

chosen). In scientific exponent notation (see below)

Default: 1e-09

-clusterid N

Identity threshold for clustering with vsearch in %. Must be

within 50..100.

Default: 97

-taxlevel N

Level in the taxonomy string to summarize read counts per taxon.

Numeric and 1-based (i.e. "1" corresponds to "Domain").

Default: 4 ("Order")

-tophit Report taxonomic summary based on the best hit per read only.

Otherwise, the consensus of all ambiguous mappings of a given read

will be used to assign its taxonomy. (This was the default

behavior before v3.2)

Default: No (use consensus)

-maxinsert N

Maximum insert size allowed for paired end read mapping. Must be

within 0..1200.

Default: 1200

-sc Turn on single cell MDA data mode for SPAdes assembly of SSU

sequences

-trusted FILE

User-supplied Fasta file of trusted contigs containing SSU rRNA

sequences. The SSU sequences will be extracted with Barrnap, and

the input read files will be screened against these extracted

"trusted" SSU sequences

-poscov Use Nhmmer to find positional coverage of reads across Barrnap's

HMM model of the 16S and 18S rRNA genes from a subsample of reads,

as an estimate of coverage evenness.

Default: Off ("-noposcov")

-everything

Turn on all the optional analyses and output options. Options

without defaults and any local settings must still be specified.

Equivalent to "-emirge -poscov -treemap -zip -log"

-almosteverything

Like -everything except without running EMIRGE.

OUTPUT OPTIONS

-html Generate output in HTML format.

Default: On. (Turn off with "-nohtml")

-treemap

Draw interactive treemap of taxonomic classification in

html-formatted report. This uses Google Visualization API, which

requires an internet connection, requires that you agree to their

and is not open source, although it is free to use.

Default: Off ("-notreemap")

-zip Compress output into a tar.gz archive file

Default: Off ("-nozip")

-keeptmp

Keep temporary/intermediate files

Default: No ("-nokeeptmp")

-log Write status messages printed to STDERR also to a log file

Default: Off ("-nolog")

COPYRIGHT AND LICENSE

Elmar A. Pruesse <elmar.pruesse@ucdenver.edu> with help from Brandon Seah

<kbseah@mpi-bremen.de>

LICENSE

This program is free software: you can redistribute it and/or modify it

under the terms of the GNU General Public License as published by the Free

Software Foundation, either version 3 of the License, or (at your option)

any later version.

This program is distributed in the hope that it will be useful, but

WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY

or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License

for more details. You should have received a copy of the GNU General

Public License along with this program. If not, see

<http://www.gnu.org/licenses/>.

> phyloFlash_makedb.pl --help

$ phyloFlash_makedb.pl --help

Usage:

### download databases via FTP

phyloFlash_makedb.pl --remote

### use local copies

phyloFlash_makedb.pl --silva_file path/to/silva_db --univec_file

path/to/univec_db

## Get help

phyloFlash_makedb.pl --help

phyloFlash_makedb.pl --man

Arguments:

Input Files:

--remote

Download databases via FTP

--silva_file path/to/silva_db

Path to local copy of SILVA database file. Ignored if --remote

flag is used.

This should be the Fasta-formatted SILVA SSURef file, clustered

at 99% identity, with SILVA taxonomy strings in file header, and

sequences truncated to SSU gene boundaries. The file name should

be in the form

*SILVA_[Release]_SSURef_Nr99_tax_silva_trunc.fasta.gz*

--univec_file path/to/univec_db

Path to local copy of Univec database file. Ignored if --remote

flag is used.

Optional Tools:

--emirge

Index database with Bowtie v1 for Emirge. Requires

*bowtie-build* to be in path.

Default: Yes (turn off with --noemirge)

--sortmerna

Index database for Sortmerna, if you wish to use it as an

alternative to BBmap for extracting rRNA reads from the read

file. Requires *indexdb_rna* to be in path.

Default: No (--nosortmerna).

Configuration and Help:

--keep Do not delete intermediary files

--overwrite

Overwrite files if already present. Files are not overwritten by

default, allowing you to restart the DB build process if it was

interrupted (but you will have to do find and delete corrupted

files manually).

Default: No ("--nooverwrite")

--CPUs *N*

Number of processors to use

Default: All available

--mem *N*

Memory limit (in Gb) for indexing tools. At least 10 is

recommended.

Default: 10

--log *FILE*

Write phyloFlash_makedb.pl log to a file.

Default: None

--check_env

Check that required dependencies are available in your path. If

specifying optional tools *--sortmerna* and *--emirge*, put the

*--check_env* argument at the end of the command.

--help Help message

--man Full manual page

--version

Report version

> phyloFlash_heatmap.R

# phyloFlash_heatmap.R

Loading required package: optparse

Usage: ./phyloFlash_heatmap.R [options] [files]

Generates a heatmap plot from multiple phyloFlash result sets. For more control,

source this file from R.

Files:

A list of files and/or directories that will be searched

for phyloFlash results.

Options:

-v, --verbose

Be more talkative

-q, --quiet

Be less talkative

-d, --debug

Show debug messages

-n MIN-NTU-COUNT, --min-ntu-count=MIN-NTU-COUNT

Sum NTUs with less counts in pseudo NTU "Other". Default 50.

--no-split

Do not split heatmap

-t SPLIT-REGEX, --split-regex=SPLIT-REGEX

Split heatmap using this regex on taxa. Multiple regex can be

specified comma separated. Default 'Eukaryota'

-l, --long-taxnames

Do not shorten taxa names to last two groups

-a, --absolute

Do not scale columns to percentages

-m CLUSTER-SAMPLES, --cluster-samples=CLUSTER-SAMPLES

Use this method for clustering/sorting samples. Can be:

alpha, ward.D, single, complete, average, mcquitty, median, centroid, or custom.

Default is ward.D.

-M CLUSTER-TAXA, --cluster-taxa=CLUSTER-TAXA

Use this method for clustering/sorting taxa. Can be:

alpha, ward.D, single, complete, average, mcquitty, median or centroid.

Default is ward.D

-r ROWS, --rows=ROWS

Component rows, in order, to render (separated by commas).

Valid terms are: tree, map, chao and labels.

Default is tree,map,chao,labels.

-c COLS, --cols=COLS

Component columns, in order, to render (separated by commas).

Valid terms are: labels, map and tree.

Default is labels,map,tree.

--colors=COLORS

Colors for heatmaps. Default is steelblue,indianred,green,orange.

-o OUT, --out=OUT

Name of output file. Must end in .png or .pdf. Default is out.png.

--aa=AA

Type of anti-aliasing to use for PNG output. Can be one of default,

none, gray, or subpixel. Default is gray.

-s OUT-SIZE, --out-size=OUT-SIZE

Size of output graphic in pixels (e.g. 100x100). Assumes 72 DPI for

PDF. Using "auto" for a dimension will attempt to guess at suitable

size. Default autoXauto

--library-name-from-file

Use thee filename to derive library name instead of parsing ...report.csv

--custom-distance-matrix-sample=CUSTOM-DISTANCE-MATRIX-SAMPLE

Import custom distance matrix for samples instead of calculating

from abundance matrix

-h, --help

Show this help message and exit

データベース作成

初回のみ

# Install reference database (takes some time)
phyloFlash_makedb.pl --remote

--remote Download databases via FTP

ラップトップで実行すると数時間かかった。

依存とデータベースのチェック

phyloFlash.pl -check_env

This is phyloFlash v3.3b1

[15:31:06] Current operating system darwin

[15:31:06] Checking for required tools.

[15:31:06] Using grep found at "/usr/bin/grep".

[15:31:06] Using bbmap found at

...

[15:31:06] All required tools found.

[15:31:06] Using dbhome './132'

O.K

テストラン

ペアエンドfastqを指定する。gzip圧縮（.fastq.gz）にも対応している。

git clone https://github.com/HRGV/phyloFlash.git
cd phyloFlash/
phyloFlash.pl -lib TEST -read1 test_files/test_F.fq.gz -read2 test_files/test_R.fq.gz

--CPUs Number of processors to use Default: All available
-lib Library NAME to use for output file. The name must be one word
comprising only letters, numbers and "_" or "-" (no whitespace or
other punctuation)
-read1 Forward reads in FASTA or FASTQ formats. May be compressed with Gzip (.gz extension). If interleaved reads are provided, please use --interleaved flag in addition for paired-end processing
-read2 File containing reverse reads. If this option is omitted, phyloFlash will run in experimental single-end mode
-maxinsert Maximum insert size allowed for paired end read mapping. Must be between 0 and 1200. Default: 1200.

感度は以下のパラメータで設定する。

-id Minimum % identity of reads to map against reference database. Must be between 50 and 98. Set to a lower value for very divergent taxa. Default: 70.
-clusterid identity threshold for reference sequence clustering step. Must be between 50 and 100. Default: 97.

出力（html）

f:id:kazumaxneo:20190210072758p:plain

f:id:kazumaxneo:20190210072800p:plain

interleaveのfastq。リード長は150-bpとする。CPU使用数は8までに制限する。

phyloFlash.pl -lib TEST -read1 reads.fq.gz -interleaved -readlength 150 -CPUs 8

-readlength Sets the expected readlength. Always use this if your read length differs from 100 (the default). Must be within 50..500
-interleaved Interleaved readfile with R1 and R2 in a single file at read1
--CPUs Number of processors to use Default: All available

EMIRGEも走らせ、16S rRNAを再構成する。logも出力する。全出力はzip圧縮する。

phyloFlash.pl -lib TEST -read1 reads_F.fq.gz -read2 reads_R.fq.gz -emirge -zip -log

--emirge Index database with Bowtie v1 for Emirge. Requires *bowtie-build* to be in path Default: Yes (turn off with --noemirge)

SPAdesとEMIRGEも両方走らせ、optional outputも全て出力する。

phyloFlash.pl -lib TEST -read1 reads_F.fq.gz -read2 reads_R.fq.gz -zip -log -everything

-everything Turn on all the optional analyses and output options. Options without defaults and any local settings must still be specified. Equivalent to -emirge -poscov -treemap -zip -log
-almosteverything Like -everything except without running EMIRGE
-treemap Draw interactive treemap of taxonomic classification in html-formatted report. This uses Google Visualization API, which requires that you agree to their terms of service (see https://developers.google.com/chart/terms), and is not open source, although it is free to use.

信頼できるSSU RNA配列データセットをユーザーが提供する。

phyloFlash.pl -lib TEST -read1 pair_1.fq.gz -read2 pair_2.fq.gz 
 -trusted silva-euk-28s.fasta

-trusted User-supplied Fasta file of trusted contigs containing SSU rRNA sequences. The SSU sequences will be extracted with Barrnap, and the input read files will be screened against these extracted "trusted" SSU sequences

出力

f:id:kazumaxneo:20190210074326p:plain

f:id:kazumaxneo:20190210074329p:plain

f:id:kazumaxneo:20190210074335p:plain

メタゲノムシーケンシングへの適用

汚水メタゲノムデータの分析に使ってみた。

phyloFlash.pl -lib TEST -read1 ERR2607605_1.fq.gz -read2 ERR2607605_2.fq.gz -zip -log -almosteverything

8GBx2（27.4M）のfastqの分析にかかった時間は3分40秒くらいだった（*1）。通常のメタゲノム分析パイプラインの処理時間と比較すると圧倒的に短い時間で処理している。

出力

f:id:kazumaxneo:20190210212934j:plain

f:id:kazumaxneo:20190210212937j:plain

複数出力の比較。各ランの定量結果の.csvファイルを指定する。

phyloFlash.pl -lib saample1 -read1 sample1_F.fq.gz -read2 sample1_R.fq.gz -almosteverything -zip
phyloFlash.pl -lib saample2 -read1 sample2_F.fq.gz -read2 sample2_R.fq.gz -almosteverything -zip
phyloFlash.pl -lib saample3 -read1 sample3_F.fq.gz -read2 sample3_R.fq.gz -almosteverything -zip

#出力を解凍し、それぞれのディレクトリのphyloFlash.NTUabundance.csvとphyloFlash.report.csvをカレントにコピーする。
#サンプル間比較実行
phyloFlash_heatmap.R -o output.pdf -n 50 --library-name-from-file sample*csv

#ワイルドカードを使わず各々指定
phyloFlash_heatmap.R -o output.pdf -n 50 --library-name-from-file \
saample1.phyloFlash.NTUabundance.csv \
saample2.phyloFlash.NTUabundance.csv \
saample3.phyloFlash.NTUabundance.csv

output.pdf

f:id:kazumaxneo:20190210230847j:plain

他のスクリプトも便利そうです。

ENA_phyloFlash.pl - Automatically download read files from ENA given a read accession number, and run phyloFlash on them
phyloFlash_fastgFishing.pl - Given a metagenomic assembly graph in Fastg format, identify SSU rRNA sequences and extract contigs connected to them. Optionally compare to phyloFlash results from the same library.

引用

phyloFlash – Rapid SSU rRNA profiling and targeted assembly from metagenomes

Harald R. Gruber-Vodicka, Brandon K. B. Seah, Elmar Pruesse

bioRxiv preprint first posted online Jan. 17, 2019

2020 10/28

phyloFlash: Rapid Small-Subunit rRNA Profiling and Targeted Assembly from Metagenomes
Harald R. Gruber-Vodicka, Brandon K. B. Seah, Elmar Pruesse

DOI: 10.1128/mSystems.00920-20