2020 6/11 インストール手順追記
2020 6/16 emirge_makedb.py help追加
微生物群集構成の特徴付けは、系統発生マーカー遺伝子、最も一般的にはリボソームスモールサブユニットRNA(SSU rRNA)遺伝子[ref.1]を用いて行われることが最も多い。伝統的に、rRNA配列は増幅、クローニング、およびサンガーシーケンシングによって決定されてきた。さらに最近では(一部略)
各方法には限界があるが、これらのハイスループットアプローチは広く採用されており、幅広い環境からの微生物群集構成の新しい理解をもたらした[ref.8、11、12]。
メタゲノムデータで直接SSU遺伝子を検索すると、PCRおよびプライマーのバイアスが回避される[ref.16、17]。例えば、異常な16S rRNA遺伝子配列を持つ新規の深く分岐した古細菌がメタゲノムシーケンシングによって最近検出された[ref.18]。これらの異なる配列は、標準的なSSUプライマーを用いた増幅に頼った方法では回収されなかった。
報告されているほとんどのメタゲノムシーケンシングは、サンガーまたは454シーケンシング技術を使用している。これらのデータセットにおけるclosely rellatedな生物のためのrRNA遺伝子配列は共集合する。その結果、コミュニティのメンバーを代表するものではなく、本当のレベルの多様性があいまいになるような複合シーケンスが発生する。これらの問題は、より短いシークエンスリードを使用すると悪化する。さらに、k-merに基づく方法は、SSU遺伝子内に見いだされるような、種間配列同一性が高い領域の近くの新規アセンブリさらに混乱させる可能性がある。
de novoアセンブリの代替手段は、利用可能であればリードをリファレンスシーケンスにマッピングすることである。いくつかの高速でメモリ効率の良いマッピングプログラムが利用可能であり、それらはすべてアライメントを検索しながら様々なレベルのエラーを許容する[ref.19-21]。コミュニティ構成が前もって知られていて、すべてのリファレンスSSUシーケンスが利用可能であるならば、種の豊富さの定量は理論的にリードマッピング問題に帰着することができる。しかし、環境サンプルの組成が事前にわかっていることはめったにない。 SSU遺伝子における高い配列の保存性は、closely relatedな株間で多くのあいまいなリードアサインを生じ、マッピング戦略を混乱させる。
ここでは、微生物コミュニティに存在する完全長SSU配列を正確に再構成する、EMアルゴリズム[ref.22]に基づく、新規の反復マッピング法を報告する。expectation maximization iterative reconstruction of genes from the environment (EMIRGE)と呼ばれるこの方法は、入力として、rawショートリードおよびキュレートされたSSU配列データベースをとる。いくつかの繰り返しリードマッピングサイクルが完了し、その間にマッピングされたリードによって徐々に最も可能性の高いコンセンサス配列が発見され修正される。このアルゴリズムは、確率論的に記述された完全長のSSUシーケンス、およびそのコミュニティにおける相対的存在量の測定値を生成する。このバイオインフォマティックアプローチは、複雑さが大幅に異なる、浅い微生物集団と深くサンプリングされた微生物集団の両方に適用できる。
インストール
mac os10.14のminiconda2環境でテストした。
依存
- python (tested with version 2.6), with the following packages installed:
- BioPython
- Cython
- pysam
- scipy / numpy
- usearch (www.drive5.com/usearch/ -- tested with usearch version 6.0.203; versions earlier than this are incompatible).
- samtools (http://samtools.sourceforge.net/ -- tested with verison 0.1.18)
- bowtie (http://bowtie-bio.sourceforge.net/index.shtml -- tested with version 0.12.7 and 0.12.8)
- vsearch (https://github.com/torognes/vsearch -- optional, but required if you use emirge_makedb.py)
本体 Github
#bioconda(link)
conda install -c bioconda -y emirge
> emirge.py --help
$ emirge.py --help
Usage: emirge.py DIR <required_parameters> [options]
This version of EMIRGE (emirge.py) attempts to reconstruct rRNA SSU genes from
Illumina metagenomic data.
DIR is the working directory to process data in.
Use --help to see a list of required and optional arguments
Additional information:
https://groups.google.com/group/emirge-users
https://github.com/csmiller/EMIRGE/wiki
If you use EMIRGE in your work, please cite these manuscripts, as appropriate.
Miller CS, Baker BJ, Thomas BC, Singer SW, Banfield JF (2011)
EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data.
Genome biology 12: R44. doi:10.1186/gb-2011-12-5-r44.
Miller CS, Handley KM, Wrighton KC, Frischkorn KR, Thomas BC, Banfield JF (2013)
Short-Read Assembly of Full-Length 16S Amplicons Reveals Bacterial Diversity in Subsurface Sediments.
PloS one 8: e56018. doi:10.1371/journal.pone.0056018.
Options:
-h, --help show this help message and exit
Required flags:
These flags are all required to run EMIRGE, and may be supplied in any
order.
-1 reads_1.fastq[.gz]
path to fastq file with \1 (forward) reads from
paired-end sequencing run, or all reads from single-
end sequencing run. File may optionally be gzipped.
EMIRGE expects ASCII-offset of 64 for quality scores.
(Note that running EMIRGE with single-end reads is
largely untested. Please let me know how it works for
you.)
-f FASTA_DB, --fasta_db=FASTA_DB
path to fasta file of candidate SSU sequences
-b BOWTIE_DB, --bowtie_db=BOWTIE_DB
precomputed bowtie index of candidate SSU sequences
(path to appropriate prefix; see --fasta_db)
-l MAX_READ_LENGTH, --max_read_length=MAX_READ_LENGTH
length of longest read in input data.
Required flags for paired-end reads:
These flags are required to run EMIRGE when you have paired-end reads
(the standard way of running EMIRGE), and may be supplied in any
order.
-2 reads_2.fastq path to fastq file with \2 (reverse) reads from
paired-end run. File must be unzipped for mapper.
EMIRGE expects ASCII-offset of 64 for quality scores.
-i INSERT_MEAN, --insert_mean=INSERT_MEAN
insert size distribution mean.
-s INSERT_STDDEV, --insert_stddev=INSERT_STDDEV
insert size distribution standard deviation.
Optional parameters:
Defaults should normally be fine for these options in order to run
EMIRGE
-n ITERATIONS, --iterations=ITERATIONS
Number of iterations to perform. It may be necessary
to use more iterations for more complex samples
(default=40)
-a PROCESSORS, --processors=PROCESSORS
Number of processors to use in the mapping steps. You
probably want to raise this if you have the
processors. (default: 1)
-m MAPPING, --mapping=MAPPING
path to precomputed initial mapping (bam file). If
not provided, and initial mapping will be run for you.
-p SNP_FRACTION_THRESH, --snp_fraction_thresh=SNP_FRACTION_THRESH
If fraction of variants in a candidate sequence
exceeds this threhold, then split the candidate into
two sequences for next iteration. See also
--variant_fraction_thresh. (default: 0.04)
-v VARIANT_FRACTION_THRESH, --variant_fraction_thresh=VARIANT_FRACTION_THRESH
minimum probability of second most probable base at a
site required in order to call site a variant. See
also --snp_fraction_thresh. (default: 0.1)
-j JOIN_THRESHOLD, --join_threshold=JOIN_THRESHOLD
If two candidate sequences share >= this fractional
identity over their bases with mapped reads, then
merge the two sequences into one for the next
iteration. (default: 0.97; valid range: [0.95, 1.0] )
-c MIN_DEPTH, --min_depth=MIN_DEPTH
minimum average read depth below which a candidate
sequence is discarded for next iteration(default: 3)
--nice_mapping=NICE_MAPPING
If set, during mapping phase, the mapper will be
"niced" by the Linux kernel with this value (default:
no nice)
--phred33 Illumina quality values in fastq files are the (fastq
standard) ascii offset of Phred+33. This is the new
default for Illumina pipeline >= 1.8. DEFAULT is still
to assume that quality scores are Phred+64
-e SAVE_EVERY, --save_every=SAVE_EVERY
every SAVE_EVERY iterations, save some information
about the program's state. This is solely for
debugging information, and is NOT required to resume a
run (see --resume_from below). (default=none)
Resuming iterations:
These options allow you to resume iterations from a previously
completed EMIRGE iteration. This requires that directories for the
iteration to resume from and the previous iteration both be present.
It is STRONGLY recommended that other options set on the command line
be identical to the original run. Note that EMIRGE does not check
this for you!
-r RESUME_FROM, --resume_from=RESUME_FROM
Resume iterations from COMPLETED iteration specified.
Requires that the iteration and previous iteration
fully completed, i.e. a priors file, bam file, and
fasta file are all present in the iteration directory.
> emirge_rename_fasta.py -h
$ emirge_rename_fasta.py -h
Usage: emirge_rename_fasta.py [options] <iter.DIR> > renamed.fasta
emirge_rename_fasta.py rewrites an emirge fasta file to include proper
sequence names and prior probabilities (abundance estimates) in the
record headers, and sorts the sequences from most to least abundant
iter.DIR is one of the iteration directories created by emirge (for
example: emirge_working_dir/iter.40). If no iter.DIR is given,
emirge_rename_fasta.py assumes that iter.DIR is the current working
directory.
Note that, with default options, bases with no read support are
labeled 'N', and terminal N's are trimmed
Options:
-h, --help show this help message and exit
-p PROB_MIN, --prob_min=PROB_MIN
Only include sequences in output with prior
probability above PROB_MIN (Default: include all
sequences)
-r RECORD_PREFIX, --record_prefix=RECORD_PREFIX
Add the specified prefix to each fasta record title
-n, --no_N Don't change bases with no read support to N.
Caution: these bases are not supported by reads in the
input data, but will usually be from a closely related
sequence.
-t, --no_trim_N Don't trim off N bases with no read support from ends
of sequences. Ignored if --no_N is also passed
> emirge_makedb.py -h
$ emirge_makedb.py -h
Usage: emirge_makedb.py [OPTIONS]
emirge_makedb.py creates a reference database and the necessay indices for use by
EMIRGE from an rRNA reference database. Without extra parameters, emirge_makedb.py
will 1) download the most recent SILVA SSU database, 2) filter it by sequence
length, 3) cluster at 97% sequence identity, 4) replace ambiguous bases
with random characters and 5) create a bowtie index.
Requires vsearch executable can be found in path for clustering.
https://github.com/torognes/vsearch
Requires bowtie-build (from bowtie version 1) can be found in path
Options:
-h, --help show this help message and exit
-g [SSU|LSU], --gene=[SSU|LSU]
build database from this gene (SSU=Small Subunit rRNA;
LSU=Large Subunit rRNA) default = SSU
-p THREADS, --threads=THREADS
number of threads to use for vsearch clustering of
database (default = use all available)
-t DIR, --tmpdir=DIR working directory for temporary files (default = /tmp)
-r N, --release=N SILVA release number (default: current SILVA release)
-m LEN, --min-len=LEN
minimum reference sequence length (default = 1200)
-M LEN, --max-len=LEN
maximum reference sequence length (default = 2000)
-i FLOAT, --id=FLOAT Cluster at this fractional identity level (default =
0.97)
-k, --keep keep intermediary files (default: do not keep)
-V FILE, --vsearch=FILE
path to vsearch binary (default: look in $PATH)
-B FILE, --bowtie-build=FILE
path to bowtie-build binary (default: look in $PATH)
--silva-license-accepted
I have read and accepted the SILVA license.
(
vsearch、usearchにパスが通っていないとこのヘルプメッセージは表示されない。usearchはフリーソフトではないため、作者サイトからアカデミック23bit版(無料)をダウンロードする必要がある。usearchにリネームし、パスの通ったディレクトリに移動する。
データベースの準備
SSU
#SSU
emirge_makedb.py -g SSU
#LSU
emirge_makedb.py -g LSU
https://www.arb-silva.de/no_cache/download/archive/
最新の138にはFASTA形式が含まれるが、古いのはレガシーなARB形式。LSUはARB形式しかない?
実行方法
1、EMIRGE実行
emirge.py -1 pair1.fq -2 pair2.fq -l 301 -f FASTA_DB/ -b BOWTIE_DB/ -i 500 -s 50
- -1 path to fastq file with \1 (forward) reads from paired-end sequencing run, or all reads from single- end sequencing run. File may optionally be gzipped. EMIRGE expects ASCII-offset of 64 for quality scores. (Note that running EMIRGE with single-end reads is largely untested. Please let me know how it works for you.)
- -f path to fasta file of candidate SSU sequences
- -b precomputed bowtie index of candidate SSU sequences (path to appropriate prefix; see --fasta_db)
- -l MAX_READ_LENGTH length of longest read in input data.
- -2 reads_2.fastq path to fastq file with \2 (reverse) reads from paired-end run. File must be unzipped for mapper. EMIRGE expects ASCII-offset of 64 for quality scores.
- -i insert size distribution mean.
- -s insert size distribution standard deviation.
2、リネーム
emirge_rename_fasta.py iter.40 > renamed.fasta
引用
EMIRGE: reconstruction of full-length ribosomal genes from microbial community short read sequencing data
Christopher S Miller, Brett J Baker, Brian C Thomas, Steven W Singer, Jillian F Banfield
Genome Biol. 2011; 12(5):