超高速で高精度なアンプリコンシークエンス解析ツール LotuS2

2023/04/21 追記

　アンプリコンシークエンスは、マイクロバイオームのプロファイリングにおいて確立されたコスト効率の高い手法である。しかし、このデータを処理するための多くのツールは、大きなデータセットを処理するためにバイオインフォマティクスのスキルと高い計算能力の両方を必要とする。さらに、ロングリードのアンプリコンデータを解析できるツールは数少ない。このギャップを埋めるために、LotuS2 (Less OTU Scripts 2)パイプラインを開発し、ユーザーフレンドリーでリソースに優しい、生のアンプリコン配列の多用途な分析を可能にした。

　LotuS2 では、6 種類の配列クラスタリングアルゴリズムと豊富な前処理・後処理オプションにより、パラメータを完全に調整できる専門家から、異なるシナリオのためのデフォルトが提供されている初心者まで、柔軟にデータ解析を行うことができる。3つの独立した腸と土壌のデータセットをベンチマークしたところ、LotuS2は他のパイプラインと比較して平均29倍速く、しかも技術的な複製サンプルのアルファおよびベータ多様性をより良く再現することができた。さらに、既知の分類群構成を持つ模擬コミュニティをベンチマークした結果、LotuS2は、他のパイプラインと比較して、より高い割合で正しく属と種を復元した（それぞれ、98％と57％）ことが示された。ASV/OTUレベルでは、LotuS2の精度およびFスコアが最も高く、16S配列が正しく再構成された割合も高かった。

　LotuS2は、軽量でユーザーフレンドリーなパイプラインであり、高速、高精度、合理的である。高いデータ利用率と信頼性により、数分でハイスループットなマイクロバイオーム解析が可能になる。LotuS2は、GitHub、conda、またはGalaxyウェブインタフェースから入手可能で、http://lotus2.earlham.ac.uk/ にドキュメントが掲載されている。

Documentation

http://lotus2.earlham.ac.uk/

Downloads

http://lotus2.earlham.ac.uk/main.php?site=downloads

Galaxy版も用意されている。

https://usegalaxy.eu

f:id:kazumaxneo:20220110023827p:plain

I'm so happy to write my first tweet of the year about the pre-print of our pipeline LotuS2! @Falk_tw @NicolaSoranzo @duncan_ng @TheQuadram @EarlhamInst
LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis https://t.co/xjuBkt4iK1
— Ezgi Ozkurt (@ezgi_evol) January 4, 2022

インストール

依存

USEARCH ver 7, 8, 9 or 10 (link)

Github

#conda (bioconda)
mamba create -n lotus2 -y
conda activate lotus2
mamba install -c bioconda lotus2 -y
#ダウンロードしたUSEARCH実行形式イナリのパスを指定する
lotus2 -link_usearch <path>/<to>/usearch

#もしくは対話式で進めるインストールスクリプトを使う。どのデータベースを使うかも聞かれる。
git clone https://github.com/hildebra/lotus2.git
cd lotus2/
perl autoInstall.pl

> lotus2

Lotus2 usage:

lotus2 -i <input fasta|fastq|dir> -o <output_dir> -m/-map <mapping_file>

Minimal example (from lotus2 dir):

./lotus2 -i Example/ -m Example/miSeqMap.sm.txt -o myTestRun -CL vsearch

#### OPTIONS ####

Basic Options:

-i <file|dir> In case that fastqFile or fnaFile and qualFile were

specified in the mapping file, this has to be the

directory with input files

-m|-map <file> Mapping file

-o <dir> Warning: The output directory will be completely

removed at the beginning of the LotuS2 run. Please

ensure this is a new directory or contains only

disposable data!

Workflow Options:

-forwardPrimer <string>

give the forward primer used to amplify DNA region

(e.g. 16S primer fwd)

-keepOfftargets <0|1> (0)?!?: keep offtarget hits against offtargetDB in

output fasta and otu matrix. (Default 0)

-keepTmpFiles <0|1> (1) save extra tmp files like chimeric OTUs or the

raw blast output in extra dir. (0) do not save these.

(Default: 0)

-keepUnclassified <0|1>

(1) Includes unclassified OTUs (i.e. no match in

RDP/Blast database) in OTU and taxa abundance matrix

calculations. (0) does not take these OTUs into

account. (Default: 1)

-offtargetDB <file> Remove likely contaminant OTUs/ASVs based on

alignment to provided fasta. This option is useful for

low-bacterial biomass samples, to remove possible host

genome contaminations (e.g. human/mouse genome)

-redoTaxOnly <0|1> (1) Only redo the taxonomic assignments (useful for

replacing a DB used on a finished lotus run). (0)

Normal lotus run. (Default: 0)

-reversePrimer <string>

give the reverse primer used to amplify DNA region

(e.g. 16S primer rev)

-saveDemultiplex <0|1|2|3>

(1) Saves all demultiplexed reads (unfiltered) in the

[outputdir]/demultiplexed folder, for easier data

upload. (2) Only saves quality filtered demultiplexed

reads and continues LotuS2 run subsequently. (3) Saves

demultiplexed, filtered reads into a single fq, with

sample ID in fastq/a header. (0) No demultiplexed

reads are saved. (Default: 0)

-taxOnly <file> Skip most of the lotus pipeline and only run a

taxonomic classification on a fasta file. E.g. lotus2

-taxOnly <some16S.fna> -refDB SLV

-tolerateCorruptFq <0|1>

(1) Continue reading fastq files, even if single

entries are incomplete (e.g. half of qual values

missing). (0) Abort lotus run, if fastq file is

corrupt. (Default: 0)

-useVsearch <0|1> (0) Use usearch for internal tasks such as remapping

reads on OTUs, chimera checks. (1) will use vsearch

for these tasks. This option is independent of the -CL

UPARSE/UNOISE option, and -taxAligner assignment

usearch/vsearch options. (Default: 0)

-verbosity <0-3> Level of verbosity from printing all program calls

and program output (3) to not even printing errors

(0). (Default: 1)

Clustering Options:

Sequence clustering algorithm: (1) UPARSE, (2) swarm,

(3) cd-hit, (6) unoise3, (7) dada2. Short keyword or

number can be used to indicate clustering (Default:

uparse)

-chim_skew <num> Skew in chimeric fragment abundance (uchime option).

(Default: 2)

-count_chimeras<T/F> Add chimeras to count up OTUs/ASVs. (Default: F)

-deactivateChimeraCheck <0|1|2|3>

(0) do OTU chimera checks. (1) no chimera check at

all. (2) Deactivate deNovo chimera check. (3)

Deactivate ref based chimera check. (Default: 0)

-derepMin<num> Minimum size of dereplicated clustered, one form of

noise removal. Can also have a more complex syntax,

see examples. (Default: 1)

-endRem <string> DNA sequence, usually reverse primer or reverse

adaptor; all sequence beyond this point will be

removed from OTUs. This is redundant with the

"ReversePrimer" option from the mapping file, but

gives more control (e.g. there is a problem with

adaptors in the OTU output). (Default: "")

-id <0-1> Clustering threshold for OTUs. (Default: 0.97)

-readOverlap <num> The maximum number of basepairs that two reads are

overlapping. (Default: 300)

-swarm_distance <1,2,3,..>

Clustering distance for OTUs when using swarm

clustering. (Default: 1)

-xtalk <0|1> (1) check for crosstalk. Note that this requires in

most cases 64bit usearch. (Default: 0)

Taxonomy Options:

-ITSx <0|1> (1) use ITSx to only retain OTUs fitting to ITS1/ITS2

hmm models; (0) deactivate. (Default: 1)

-LCA_cover <0-1> Min horizontal coverage of an OTU sequence against

ref DB. (Default: 0.5)

-LCA_frac <0-1> Min fraction of database hits at taxlevel, with

identical taxonomy. (Default: 0.9)

-amplicon_type <SSU|LSU|ITS|ITS1|ITS2>

(SSU) small subunit (16S/18S), (LSU) large subunit

(23S/28S) or internal transcribed spacer

(ITS|ITS1|ITS2). (Default: SSU)

-buildPhylo <0,1,2,> (0) do not build OTU phylogeny; (1) use fasttree2;

(2) use IQ-TREE 2. (Default: 1)

-greengenesSpecies <0|1>

(1) Create greengenes output labels instead of OTU

(to be used with greengenes specific programs such as

BugBase). (Default: 0)

-itsx_partial <0-100> Parameters for ITSx to extract partial (%) ITS

regions as well. (0) deactivate. (Default: 0)

-lulu <0|1> (1) use LULU (https://github.com/tobiasgf/lulu) to

merge OTUs based on their occurrence. (Default: 1)

-rdp_thr <0-1> Confidence thresshold for RDP.(Default: 0.8)

(SLV) Silva LSU (23/28S) or SSU (16/18S), (GG

greengenes (only SSU available), (HITdb) (SSU, human

gut specific), (PR2) LSU spezialized on Ocean

environmentas, (UNITE) ITS fungi specific, (beetax)

bee gut specific database and tax names. \nDecide

which reference DB will be used for a similarity based

taxonomy annotation. Databases can be combined, with

the first having the highest prioirty. E.g. "PR2,SLV"

would first use PR2 to assign OTUs and all unaasigned

OTUs would be searched for with SILVA, given that

\"-amplicon_type LSU\" was set. Can also be a custom

fasta formatted database: in this case provide the

path to the fasta file as well as the path to the

taxonomy for the sequences using -tax4refDB. See also

online help on how to create a custom DB. (Default:

SLV)

-tax4refDB <file> In conjunction with a custom fasta file provided to

argument -refDB, this file contains for each fasta

entry in the reference DB a taxonomic annotation

string, with the same number of taxonomic levels for

each, tab separated.

Previously doBlast. (0) deactivated (just use RDP);

(1) or (blast) use Blast; (2) or (lambda) use LAMBDA

to search against a 16S reference database for

taxonomic profiling of OTUs; (3) or (utax): use UTAX

with custom databases; (4) or (vsearch) use VSEARCH to

align OTUs to custom databases; (5) or (usearch) use

USEARCH to align OTUs to custom databases. (Default:

-taxExcludeGrep <string>

Exclude taxonomic group, these OTUs will be assigned

as unknown instead. E.g. -taxExcludeGrep

Chloroplast|Mitochondria (Default: )

-tax_group <bacteria|fungi>

(bacteria) bacterial 16S rDNA annnotation, (fungi)

fungal 18S/23S/ITS annotation. (Default: bacteria)

-useBestBlastHitOnly <0|1>

(1) do not use LCA (lowest common ancestor) to

determine most likely taxonomic level (not

recommended), instead just use the best blast hit. (0)

LCA algorithm. (Default: 0)

-utax_thr <0-1> Confidence thresshold for UTAX. (Default: 0.8)

Further Options:

-barcode|-MID <file> Filepath to fastq formated file with barcodes (this

is a processed mi/hiSeq format). The complementary

option in a mapping file would be the column

"MIDfqFile". (Default: "")

-c <file> LotuS.cfg, config file with program paths. (Default:

<LotuS2_dir>/lOTUs.cfg)

-p <454/miSeq/hiSeq/PacBio>

sequencing platform: PacBio, 454, miSeq or hiSeq.

(Default: miSeq)

-q <file> .qual file associated to fasta file. This is an old

format that was replaced by fastq format and is rarely

used nowadays. (Default: "")

-s|sdmopt <file> SDM option file, defaults to "configs/sdm_miSeq.txt"

in current dir. (Default: miSeq)

-tmp|-tmpDir <dir> temporary directory used to save intermediate

results. (Default: <outputDir>/tmpDir)

-t|-threads <num> number of threads to be used. (Default: 1)

Other uses of pipeline (quits after execution):

-check_map <file> Mapping_file: only checks mapping file and exists.

-create_map <file> mapping_file: creates a new mapping file at location,

based on already demultiplexed input (-i) dir. E.g.

lotus2 -create_map mymap.txt -i

/home/dir_with_demultiplex_fastq

-link_usearch <file> Provide the absolute path to your local usearch

binary file, this will be installed to be useable with

LotuS2 in the future.

-v Print LotuS2 version

テストラン

最小設定のラン。

git clone https://github.com/hildebra/lotus2.git
cd lotus2/
lotus2 -i Example/ -m Example/miSeqMap.sm.txt -o myTestRun

biocondaで配布されているversion2.16では途中でSegmentation faultが起きる。ランできるようになったら追記します。

galaxyでは問題なくラン出来た。

出力例

f:id:kazumaxneo:20220113002409p:plain

2023/04/21追記

v2.25では問題なくランできた（上に書いた方法で導入）。

OUT.fna

OTU.txt

higherLvl

引用

LotuS2: An ultrafast and highly accurate tool for amplicon sequencing analysis
Ezgi Özkurt, Joachim Fritscher, Nicola Soranzo, Duncan Y. K. Ng, Robert P. Davey, Mohammad Bahram, Falk Hildebrand

bioRxiv, Posted December 24, 2021