rRNAを再構成するEMIRGE - macでインフォマティクス

2020 6/11 インストール手順追記

2020 6/16 emirge_makedb.py help追加

　微生物群集構成の特徴付けは、系統発生マーカー遺伝子、最も一般的にはリボソームスモールサブユニットRNA（SSU rRNA）遺伝子[ref.1]を用いて行われることが最も多い。伝統的に、rRNA配列は増幅、クローニング、およびサンガーシーケンシングによって決定されてきた。さらに最近では（一部略）

　各方法には限界があるが、これらのハイスループットアプローチは広く採用されており、幅広い環境からの微生物群集構成の新しい理解をもたらした[ref.8、11、12]。

　メタゲノムデータで直接SSU遺伝子を検索すると、PCRおよびプライマーのバイアスが回避される[ref.16、17]。例えば、異常な16S rRNA遺伝子配列を持つ新規の深く分岐した古細菌がメタゲノムシーケンシングによって最近検出された[ref.18]。これらの異なる配列は、標準的なSSUプライマーを用いた増幅に頼った方法では回収されなかった。

報告されているほとんどのメタゲノムシーケンシングは、サンガーまたは454シーケンシング技術を使用している。これらのデータセットにおけるclosely rellatedな生物のためのrRNA遺伝子配列は共集合する。その結果、コミュニティのメンバーを代表するものではなく、本当のレベルの多様性があいまいになるような複合シーケンスが発生する。これらの問題は、より短いシークエンスリードを使用すると悪化する。さらに、ｋ-merに基づく方法は、SSU遺伝子内に見いだされるような、種間配列同一性が高い領域の近くの新規アセンブリさらに混乱させる可能性がある。

de novoアセンブリの代替手段は、利用可能であればリードをリファレンスシーケンスにマッピングすることである。いくつかの高速でメモリ効率の良いマッピングプログラムが利用可能であり、それらはすべてアライメントを検索しながら様々なレベルのエラーを許容する[ref.19-21]。コミュニティ構成が前もって知られていて、すべてのリファレンスSSUシーケンスが利用可能であるならば、種の豊富さの定量は理論的にリードマッピング問題に帰着することができる。しかし、環境サンプルの組成が事前にわかっていることはめったにない。 SSU遺伝子における高い配列の保存性は、closely relatedな株間で多くのあいまいなリードアサインを生じ、マッピング戦略を混乱させる。

ここでは、微生物コミュニティに存在する完全長SSU配列を正確に再構成する、EMアルゴリズム[ref.22]に基づく、新規の反復マッピング法を報告する。expectation maximization iterative reconstruction of genes from the environment （EMIRGE）と呼ばれるこの方法は、入力として、rawショートリードおよびキュレートされたSSU配列データベースをとる。いくつかの繰り返しリードマッピングサイクルが完了し、その間にマッピングされたリードによって徐々に最も可能性の高いコンセンサス配列が発見され修正される。このアルゴリズムは、確率論的に記述された完全長のSSUシーケンス、およびそのコミュニティにおける相対的存在量の測定値を生成する。このバイオインフォマティックアプローチは、複雑さが大幅に異なる、浅い微生物集団と深くサンプリングされた微生物集団の両方に適用できる。

インストール

mac os10.14のminiconda2環境でテストした。

依存

python (tested with version 2.6), with the following packages installed:
BioPython
Cython
pysam
scipy / numpy
usearch (www.drive5.com/usearch/ -- tested with usearch version 6.0.203; versions earlier than this are incompatible).
samtools (http://samtools.sourceforge.net/ -- tested with verison 0.1.18)
bowtie (http://bowtie-bio.sourceforge.net/index.shtml -- tested with version 0.12.7 and 0.12.8)
vsearch (https://github.com/torognes/vsearch -- optional, but required if you use emirge_makedb.py)

本体 Github

#bioconda(link)
conda install -c bioconda -y emirge

> emirge.py --help

$ emirge.py --help

Usage: emirge.py DIR <required_parameters> [options]

This version of EMIRGE (emirge.py) attempts to reconstruct rRNA SSU genes from

Illumina metagenomic data.

DIR is the working directory to process data in.

Use --help to see a list of required and optional arguments

Additional information:

https://groups.google.com/group/emirge-users

https://github.com/csmiller/EMIRGE/wiki

If you use EMIRGE in your work, please cite these manuscripts, as appropriate.

Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF (2011)

EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data.

Genome biology 12: R44. doi:10.1186/gb-2011-12-5-r44.

Miller CS, Handley KM, Wrighton KC, Frischkorn KR, Thomas BC, Banfield JF (2013)

Short-Read Assembly of Full-Length 16S Amplicons Reveals Bacterial Diversity in Subsurface Sediments.

PloS one 8: e56018. doi:10.1371/journal.pone.0056018.

Options:

-h, --help show this help message and exit

Required flags:

These flags are all required to run EMIRGE, and may be supplied in any

order.

-1 reads_1.fastq[.gz]

path to fastq file with \1 (forward) reads from

paired-end sequencing run, or all reads from single-

end sequencing run. File may optionally be gzipped.

EMIRGE expects ASCII-offset of 64 for quality scores.

(Note that running EMIRGE with single-end reads is

largely untested. Please let me know how it works for

you.)

-f FASTA_DB, --fasta_db=FASTA_DB

path to fasta file of candidate SSU sequences

-b BOWTIE_DB, --bowtie_db=BOWTIE_DB

precomputed bowtie index of candidate SSU sequences

(path to appropriate prefix; see --fasta_db)

-l MAX_READ_LENGTH, --max_read_length=MAX_READ_LENGTH

length of longest read in input data.

Required flags for paired-end reads:

These flags are required to run EMIRGE when you have paired-end reads

(the standard way of running EMIRGE), and may be supplied in any

order.

-2 reads_2.fastq path to fastq file with \2 (reverse) reads from

paired-end run. File must be unzipped for mapper.

EMIRGE expects ASCII-offset of 64 for quality scores.

-i INSERT_MEAN, --insert_mean=INSERT_MEAN

insert size distribution mean.

-s INSERT_STDDEV, --insert_stddev=INSERT_STDDEV

insert size distribution standard deviation.

Optional parameters:

Defaults should normally be fine for these options in order to run

EMIRGE

-n ITERATIONS, --iterations=ITERATIONS

Number of iterations to perform. It may be necessary

to use more iterations for more complex samples

(default=40)

-a PROCESSORS, --processors=PROCESSORS

Number of processors to use in the mapping steps. You

probably want to raise this if you have the

processors. (default: 1)

-m MAPPING, --mapping=MAPPING

path to precomputed initial mapping (bam file). If

not provided, and initial mapping will be run for you.

-p SNP_FRACTION_THRESH, --snp_fraction_thresh=SNP_FRACTION_THRESH

If fraction of variants in a candidate sequence

exceeds this threhold, then split the candidate into

two sequences for next iteration. See also

--variant_fraction_thresh. (default: 0.04)

-v VARIANT_FRACTION_THRESH, --variant_fraction_thresh=VARIANT_FRACTION_THRESH

minimum probability of second most probable base at a

site required in order to call site a variant. See

also --snp_fraction_thresh. (default: 0.1)

-j JOIN_THRESHOLD, --join_threshold=JOIN_THRESHOLD

If two candidate sequences share >= this fractional

identity over their bases with mapped reads, then

merge the two sequences into one for the next

iteration. (default: 0.97; valid range: [0.95, 1.0] )

-c MIN_DEPTH, --min_depth=MIN_DEPTH

minimum average read depth below which a candidate

sequence is discarded for next iteration(default: 3)

--nice_mapping=NICE_MAPPING

If set, during mapping phase, the mapper will be

"niced" by the Linux kernel with this value (default:

no nice)

--phred33 Illumina quality values in fastq files are the (fastq

standard) ascii offset of Phred+33. This is the new

default for Illumina pipeline >= 1.8. DEFAULT is still

to assume that quality scores are Phred+64

-e SAVE_EVERY, --save_every=SAVE_EVERY

every SAVE_EVERY iterations, save some information

about the program's state. This is solely for

debugging information, and is NOT required to resume a

run (see --resume_from below). (default=none)

Resuming iterations:

These options allow you to resume iterations from a previously

completed EMIRGE iteration. This requires that directories for the

iteration to resume from and the previous iteration both be present.

It is STRONGLY recommended that other options set on the command line

be identical to the original run. Note that EMIRGE does not check

this for you!

-r RESUME_FROM, --resume_from=RESUME_FROM

Resume iterations from COMPLETED iteration specified.

Requires that the iteration and previous iteration

fully completed, i.e. a priors file, bam file, and

fasta file are all present in the iteration directory.

> emirge_rename_fasta.py -h

$ emirge_rename_fasta.py -h

Usage: emirge_rename_fasta.py [options] <iter.DIR> > renamed.fasta

emirge_rename_fasta.py rewrites an emirge fasta file to include proper

sequence names and prior probabilities (abundance estimates) in the

record headers, and sorts the sequences from most to least abundant

iter.DIR is one of the iteration directories created by emirge (for

example: emirge_working_dir/iter.40). If no iter.DIR is given,

emirge_rename_fasta.py assumes that iter.DIR is the current working

directory.

Note that, with default options, bases with no read support are

labeled 'N', and terminal N's are trimmed

Options:

-h, --help show this help message and exit

-p PROB_MIN, --prob_min=PROB_MIN

Only include sequences in output with prior

probability above PROB_MIN (Default: include all

sequences)

-r RECORD_PREFIX, --record_prefix=RECORD_PREFIX

Add the specified prefix to each fasta record title

-n, --no_N Don't change bases with no read support to N.

Caution: these bases are not supported by reads in the

input data, but will usually be from a closely related

sequence.

-t, --no_trim_N Don't trim off N bases with no read support from ends

of sequences. Ignored if --no_N is also passed

> emirge_makedb.py -h

$ emirge_makedb.py -h

Usage: emirge_makedb.py [OPTIONS]

emirge_makedb.py creates a reference database and the necessay indices for use by

EMIRGE from an rRNA reference database. Without extra parameters, emirge_makedb.py

will 1) download the most recent SILVA SSU database, 2) filter it by sequence

length, 3) cluster at 97% sequence identity, 4) replace ambiguous bases

with random characters and 5) create a bowtie index.

Requires vsearch executable can be found in path for clustering.

https://github.com/torognes/vsearch

Requires bowtie-build (from bowtie version 1) can be found in path

Options:

-h, --help show this help message and exit

-g [SSU|LSU], --gene=[SSU|LSU]

build database from this gene (SSU=Small Subunit rRNA;

LSU=Large Subunit rRNA) default = SSU

-p THREADS, --threads=THREADS

number of threads to use for vsearch clustering of

database (default = use all available)

-t DIR, --tmpdir=DIR working directory for temporary files (default = /tmp)

-r N, --release=N SILVA release number (default: current SILVA release)

-m LEN, --min-len=LEN

minimum reference sequence length (default = 1200)

-M LEN, --max-len=LEN

maximum reference sequence length (default = 2000)

-i FLOAT, --id=FLOAT Cluster at this fractional identity level (default =

0.97)

-k, --keep keep intermediary files (default: do not keep)

-V FILE, --vsearch=FILE

path to vsearch binary (default: look in $PATH)

-B FILE, --bowtie-build=FILE

path to bowtie-build binary (default: look in $PATH)

--silva-license-accepted

I have read and accepted the SILVA license.

(

vsearch、usearchにパスが通っていないとこのヘルプメッセージは表示されない。usearchはフリーソフトではないため、作者サイトからアカデミック23bit版（無料）をダウンロードする必要がある。usearchにリネームし、パスの通ったディレクトリに移動する。

データベースの準備

SSU

#SSU
emirge_makedb.py -g SSU

#LSU
emirge_makedb.py -g LSU

https://www.arb-silva.de/no_cache/download/archive/

最新の138にはFASTA形式が含まれるが、古いのはレガシーなARB形式。LSUはARB形式しかない？

実行方法

１、EMIRGE実行

emirge.py -1 pair1.fq -2 pair2.fq -l 301 -f FASTA_DB/ -b BOWTIE_DB/ -i 500 -s 50

-1 path to fastq file with \1 (forward) reads from paired-end sequencing run, or all reads from single- end sequencing run. File may optionally be gzipped. EMIRGE expects ASCII-offset of 64 for quality scores. (Note that running EMIRGE with single-end reads is largely untested. Please let me know how it works for you.)
-f path to fasta file of candidate SSU sequences
-b precomputed bowtie index of candidate SSU sequences (path to appropriate prefix; see --fasta_db)
-l MAX_READ_LENGTH length of longest read in input data.
-2 reads_2.fastq path to fastq file with \2 (reverse) reads from paired-end run. File must be unzipped for mapper. EMIRGE expects ASCII-offset of 64 for quality scores.
-i insert size distribution mean.
-s insert size distribution standard deviation.

２、リネーム

emirge_rename_fasta.py iter.40 > renamed.fasta

引用

EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data

Christopher S Miller, Brett J Baker, Brian C Thomas, Steven W Singer, Jillian F Banfield

Genome Biol. 2011; 12(5):