2023/04/21 追記
アンプリコンシークエンスは、マイクロバイオームのプロファイリングにおいて確立されたコスト効率の高い手法である。しかし、このデータを処理するための多くのツールは、大きなデータセットを処理するためにバイオインフォマティクスのスキルと高い計算能力の両方を必要とする。さらに、ロングリードのアンプリコンデータを解析できるツールは数少ない。このギャップを埋めるために、LotuS2 (Less OTU Scripts 2)パイプラインを開発し、ユーザーフレンドリーでリソースに優しい、生のアンプリコン配列の多用途な分析を可能にした。
LotuS2 では、6 種類の配列クラスタリングアルゴリズムと豊富な前処理・後処理オプションにより、パラメータを完全に調整できる専門家から、異なるシナリオのためのデフォルトが提供されている初心者まで、柔軟にデータ解析を行うことができる。3つの独立した腸と土壌のデータセットをベンチマークしたところ、LotuS2は他のパイプラインと比較して平均29倍速く、しかも技術的な複製サンプルのアルファおよびベータ多様性をより良く再現することができた。さらに、既知の分類群構成を持つ模擬コミュニティをベンチマークした結果、LotuS2は、他のパイプラインと比較して、より高い割合で正しく属と種を復元した(それぞれ、98%と57%)ことが示された。ASV/OTUレベルでは、LotuS2の精度およびFスコアが最も高く、16S配列が正しく再構成された割合も高かった。
LotuS2は、軽量でユーザーフレンドリーなパイプラインであり、高速、高精度、合理的である。高いデータ利用率と信頼性により、数分でハイスループットなマイクロバイオーム解析が可能になる。LotuS2は、GitHub、conda、またはGalaxyウェブインタフェースから入手可能で、http://lotus2.earlham.ac.uk/ にドキュメントが掲載されている。
Documentation
Downloads
http://lotus2.earlham.ac.uk/main.php?site=downloads
Galaxy版も用意されている。
I'm so happy to write my first tweet of the year about the pre-print of our pipeline LotuS2! @Falk_tw @NicolaSoranzo @duncan_ng @TheQuadram @EarlhamInst
— Ezgi Ozkurt (@ezgi_evol) January 4, 2022
LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis https://t.co/xjuBkt4iK1
インストール
依存
- USEARCH ver 7, 8, 9 or 10 (link)
#conda (bioconda)
mamba create -n lotus2 -y
conda activate lotus2
mamba install -c bioconda lotus2 -y
#ダウンロードしたUSEARCH実行形式イナリのパスを指定する
lotus2 -link_usearch <path>/<to>/usearch
#もしくは対話式で進めるインストールスクリプトを使う。どのデータベースを使うかも聞かれる。
git clone https://github.com/hildebra/lotus2.git
cd lotus2/
perl autoInstall.pl
> lotus2
Lotus2 usage:
lotus2 -i <input fasta|fastq|dir> -o <output_dir> -m/-map <mapping_file>
Minimal example (from lotus2 dir):
./lotus2 -i Example/ -m Example/miSeqMap.sm.txt -o myTestRun -CL vsearch
#### OPTIONS ####
Basic Options:
-i <file|dir> In case that fastqFile or fnaFile and qualFile were
specified in the mapping file, this has to be the
directory with input files
-m|-map <file> Mapping file
-o <dir> Warning: The output directory will be completely
removed at the beginning of the LotuS2 run. Please
ensure this is a new directory or contains only
disposable data!
Workflow Options:
-forwardPrimer <string>
give the forward primer used to amplify DNA region
(e.g. 16S primer fwd)
-keepOfftargets <0|1> (0)?!?: keep offtarget hits against offtargetDB in
output fasta and otu matrix. (Default 0)
-keepTmpFiles <0|1> (1) save extra tmp files like chimeric OTUs or the
raw blast output in extra dir. (0) do not save these.
(Default: 0)
-keepUnclassified <0|1>
(1) Includes unclassified OTUs (i.e. no match in
RDP/Blast database) in OTU and taxa abundance matrix
calculations. (0) does not take these OTUs into
account. (Default: 1)
-offtargetDB <file> Remove likely contaminant OTUs/ASVs based on
alignment to provided fasta. This option is useful for
low-bacterial biomass samples, to remove possible host
genome contaminations (e.g. human/mouse genome)
-redoTaxOnly <0|1> (1) Only redo the taxonomic assignments (useful for
replacing a DB used on a finished lotus run). (0)
Normal lotus run. (Default: 0)
-reversePrimer <string>
give the reverse primer used to amplify DNA region
(e.g. 16S primer rev)
-saveDemultiplex <0|1|2|3>
(1) Saves all demultiplexed reads (unfiltered) in the
[outputdir]/demultiplexed folder, for easier data
upload. (2) Only saves quality filtered demultiplexed
reads and continues LotuS2 run subsequently. (3) Saves
demultiplexed, filtered reads into a single fq, with
sample ID in fastq/a header. (0) No demultiplexed
reads are saved. (Default: 0)
-taxOnly <file> Skip most of the lotus pipeline and only run a
taxonomic classification on a fasta file. E.g. lotus2
-taxOnly <some16S.fna> -refDB SLV
-tolerateCorruptFq <0|1>
(1) Continue reading fastq files, even if single
entries are incomplete (e.g. half of qual values
missing). (0) Abort lotus run, if fastq file is
corrupt. (Default: 0)
-useVsearch <0|1> (0) Use usearch for internal tasks such as remapping
reads on OTUs, chimera checks. (1) will use vsearch
for these tasks. This option is independent of the -CL
UPARSE/UNOISE option, and -taxAligner assignment
usearch/vsearch options. (Default: 0)
-verbosity <0-3> Level of verbosity from printing all program calls
and program output (3) to not even printing errors
(0). (Default: 1)
Clustering Options:
-CL|-clustering <uparse|swarm|cdhit|unoise|dada2>
Sequence clustering algorithm: (1) UPARSE, (2) swarm,
(3) cd-hit, (6) unoise3, (7) dada2. Short keyword or
number can be used to indicate clustering (Default:
uparse)
-chim_skew <num> Skew in chimeric fragment abundance (uchime option).
(Default: 2)
-count_chimeras<T/F> Add chimeras to count up OTUs/ASVs. (Default: F)
-deactivateChimeraCheck <0|1|2|3>
(0) do OTU chimera checks. (1) no chimera check at
all. (2) Deactivate deNovo chimera check. (3)
Deactivate ref based chimera check. (Default: 0)
-derepMin<num> Minimum size of dereplicated clustered, one form of
noise removal. Can also have a more complex syntax,
see examples. (Default: 1)
-endRem <string> DNA sequence, usually reverse primer or reverse
adaptor; all sequence beyond this point will be
removed from OTUs. This is redundant with the
"ReversePrimer" option from the mapping file, but
gives more control (e.g. there is a problem with
adaptors in the OTU output). (Default: "")
-id <0-1> Clustering threshold for OTUs. (Default: 0.97)
-readOverlap <num> The maximum number of basepairs that two reads are
overlapping. (Default: 300)
-swarm_distance <1,2,3,..>
Clustering distance for OTUs when using swarm
clustering. (Default: 1)
-xtalk <0|1> (1) check for crosstalk. Note that this requires in
most cases 64bit usearch. (Default: 0)
Taxonomy Options:
-ITSx <0|1> (1) use ITSx to only retain OTUs fitting to ITS1/ITS2
hmm models; (0) deactivate. (Default: 1)
-LCA_cover <0-1> Min horizontal coverage of an OTU sequence against
ref DB. (Default: 0.5)
-LCA_frac <0-1> Min fraction of database hits at taxlevel, with
identical taxonomy. (Default: 0.9)
-amplicon_type <SSU|LSU|ITS|ITS1|ITS2>
(SSU) small subunit (16S/18S), (LSU) large subunit
(23S/28S) or internal transcribed spacer
(ITS|ITS1|ITS2). (Default: SSU)
-buildPhylo <0,1,2,> (0) do not build OTU phylogeny; (1) use fasttree2;
(2) use IQ-TREE 2. (Default: 1)
-greengenesSpecies <0|1>
(1) Create greengenes output labels instead of OTU
(to be used with greengenes specific programs such as
BugBase). (Default: 0)
-itsx_partial <0-100> Parameters for ITSx to extract partial (%) ITS
regions as well. (0) deactivate. (Default: 0)
-lulu <0|1> (1) use LULU (https://github.com/tobiasgf/lulu) to
merge OTUs based on their occurrence. (Default: 1)
-rdp_thr <0-1> Confidence thresshold for RDP.(Default: 0.8)
-refDB <SLV|GG|HITdb|PR2|UNITE|beetax>
(SLV) Silva LSU (23/28S) or SSU (16/18S), (GG
greengenes (only SSU available), (HITdb) (SSU, human
gut specific), (PR2) LSU spezialized on Ocean
environmentas, (UNITE) ITS fungi specific, (beetax)
bee gut specific database and tax names. \nDecide
which reference DB will be used for a similarity based
taxonomy annotation. Databases can be combined, with
the first having the highest prioirty. E.g. "PR2,SLV"
would first use PR2 to assign OTUs and all unaasigned
OTUs would be searched for with SILVA, given that
\"-amplicon_type LSU\" was set. Can also be a custom
fasta formatted database: in this case provide the
path to the fasta file as well as the path to the
taxonomy for the sequences using -tax4refDB. See also
online help on how to create a custom DB. (Default:
SLV)
-tax4refDB <file> In conjunction with a custom fasta file provided to
argument -refDB, this file contains for each fasta
entry in the reference DB a taxonomic annotation
string, with the same number of taxonomic levels for
each, tab separated.
-taxAligner <0|blast|lambda|utax|vsearch|usearch>
Previously doBlast. (0) deactivated (just use RDP);
(1) or (blast) use Blast; (2) or (lambda) use LAMBDA
to search against a 16S reference database for
taxonomic profiling of OTUs; (3) or (utax): use UTAX
with custom databases; (4) or (vsearch) use VSEARCH to
align OTUs to custom databases; (5) or (usearch) use
USEARCH to align OTUs to custom databases. (Default:
0)
-taxExcludeGrep <string>
Exclude taxonomic group, these OTUs will be assigned
as unknown instead. E.g. -taxExcludeGrep
Chloroplast|Mitochondria (Default: )
-tax_group <bacteria|fungi>
(bacteria) bacterial 16S rDNA annnotation, (fungi)
fungal 18S/23S/ITS annotation. (Default: bacteria)
-useBestBlastHitOnly <0|1>
(1) do not use LCA (lowest common ancestor) to
determine most likely taxonomic level (not
recommended), instead just use the best blast hit. (0)
LCA algorithm. (Default: 0)
-utax_thr <0-1> Confidence thresshold for UTAX. (Default: 0.8)
Further Options:
-barcode|-MID <file> Filepath to fastq formated file with barcodes (this
is a processed mi/hiSeq format). The complementary
option in a mapping file would be the column
"MIDfqFile". (Default: "")
-c <file> LotuS.cfg, config file with program paths. (Default:
<LotuS2_dir>/lOTUs.cfg)
-p <454/miSeq/hiSeq/PacBio>
sequencing platform: PacBio, 454, miSeq or hiSeq.
(Default: miSeq)
-q <file> .qual file associated to fasta file. This is an old
format that was replaced by fastq format and is rarely
used nowadays. (Default: "")
-s|sdmopt <file> SDM option file, defaults to "configs/sdm_miSeq.txt"
in current dir. (Default: miSeq)
-tmp|-tmpDir <dir> temporary directory used to save intermediate
results. (Default: <outputDir>/tmpDir)
-t|-threads <num> number of threads to be used. (Default: 1)
Other uses of pipeline (quits after execution):
-check_map <file> Mapping_file: only checks mapping file and exists.
-create_map <file> mapping_file: creates a new mapping file at location,
based on already demultiplexed input (-i) dir. E.g.
lotus2 -create_map mymap.txt -i
/home/dir_with_demultiplex_fastq
-link_usearch <file> Provide the absolute path to your local usearch
binary file, this will be installed to be useable with
LotuS2 in the future.
-v Print LotuS2 version
テストラン
最小設定のラン。
git clone https://github.com/hildebra/lotus2.git
cd lotus2/
lotus2 -i Example/ -m Example/miSeqMap.sm.txt -o myTestRun
biocondaで配布されているversion2.16では途中でSegmentation faultが起きる。ランできるようになったら追記します。
galaxyでは問題なくラン出来た。
出力例
2023/04/21追記
v2.25では問題なくランできた(上に書いた方法で導入)。
OUT.fna
OTU.txt
higherLvl
引用
LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis
Ezgi Özkurt, Joachim Fritscher, Nicola Soranzo, Duncan Y. K. Ng, Robert P. Davey, Mohammad Bahram, Falk Hildebrand
bioRxiv, Posted December 24, 2021
関連