2019-08-08

多機能な配列処理ツール VSEARCH（USEARCHの代替）

2019 8/9 説明の誤り修正

2019 9/15 両鎖クラスタリングのコメント追加

　Rockström et al. (2009) とSteffen et al. (2015) は、生物多様性の損失を人類の短期生存のための主要な脅威として提示した。シークエンシング技術における最近の進歩は、深海熱水孔から南極湖、そして熱帯林からシベリア草原までの大規模な環境遺伝的多様性の研究を可能にした（Gilbert、Jansson＆Knight、2014 pubmed）。最近の臨床研究は、私たちの体のミクロバイオームと人間の健康にとっての日常環境の重要性を示している（Human Microbiome Project Consortium、2012）。通常、普遍的なマーカー（例えば、16S rRNA、ITS、COI）に焦点を当てて、これらのターゲットメタゲノミクス研究は何百万もの配列を生み出し、そして、それらの生態学的解釈を容易にするためのオープンソース、高速かつ記憶効率の高いツールを必要とする。

　マイクロバイオーム分析のためにいくつかのパイプラインが開発されており、その中でもmothur（Schloss et al。、2009）、QIIME（Caporaso et al。、2010）、およびUPARSE（Edgar、2013）が最も人気がある。 QIIMEとUPARSEはどちらもRobert C. Edgarによって設計および実装されているUSEARCH（Edgar、2010）に基づいており、http://drive5.com/usearch/で入手できる。 USEARCHはFASTQとFASTAファイルを操作し分析するための多数のコマンドとオプションを提供する。ただし、USEARCHのソースコードは一般には公開されておらず、アルゴリズムの詳細は初歩的にしか説明されておらず、メモリ制限のある32ビットバージョンのみが学術用途に無料で入手できる。

　オープンソースソリューションの存在はエンドユーザーにとって有益であり、研究活動を活性化することができると私たちは信じている。このため、私たち（著者ら）はUSEARCHに代わる高品質のオープンソース代替ツールを提供することを約束した。これはメモリの制限なしにユーザーが自由に利用できる。 VSEARCHには、一般的に使用されているほとんどのUSEARCH機能が含まれており、さらに開発を進めると機能が追加される可能性がある。論文ではVSEARCH実装の詳細について説明する。結果のスピードと質に関してその性能を評価するために、著者らはいくつかの最も重要な機能（検索、クラスタリング、キメラ検出とサブサンプリング）を評価し、それらをUSEARCHと比較した。 VSEARCHはUSEARCHの結果と同等かそれ以上の結果をもたらすことが分かった。

環境的または臨床的分子多様性研究は、キメラについてチェックする必要がある、複製しない、マスキングする、分類する、検索する、クラスタ化する、またはリファレンス配列と比較する必要がある大量のアンプリコン（例えば、SSU −rRNA配列）を生成する。 vsearchの目的は、最適化されたアルゴリズム実装を使用し、現代のコンピュータの可能性を最大限に引き出すことで、これらのタスクを実行するためのオールインワンオープンソースツールを提供し、高速で正確なデータ処理を提供することである。
ヌクレオチド配列の比較はvsearchの中核である。比較をスピードアップするために、vsearchは、2003年以降のx86-64 CPUのストリーミングSIMD拡張命令（SSE2）を利用して、非常に高速なNeedleman-Wunschアルゴリズムを実装している。 SSE2命令が利用できない場合、vsearchはエラーメッセージを出して終了する。 Power 8 CPUでは、AltiVec / VSX / VMX命令を使用する。メモリ使用量はシーケンスの長さに応じて急速に増加する。たとえば、長さ1 kbの2つのシーケンスを比較すると、スレッドあたり8 MBのメモリが必要で、2つの10 kbシーケンスを比較すると、スレッドごとに800 MBのメモリが必要になる。

DocumentationはGithubにあるリンクからダウンロードできる（direct link）。

インストール

mac os10.14向け64bitバイナリをダウンロードし、テストした。

VSEARCH binaries are provided for x86-64 systems running GNU/Linux, macOS (version 10.7 or higher) and Windows (64-bit, version 7 or higher), as well as for 64-bit little-endian POWER8 (ppc64le) and 64-bit ARMv8 systems (aarch64) running GNU/Linux.

ビルド依存（if support for gzip and bzip2 compressed FASTA and FASTQ input files is needed）

libz (zlib library) (zlib.h header file) (optional)
libbz2 (bzip2lib library) (bzlib.h header file) (optional)

本体

本体　GIthub

#bioconda
conda install -c bioconda vsearch

> vsearch

$ vsearch

vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores

https://github.com/torognes/vsearch

For help, please enter: vsearch --help

For further details, please see the manual.

Example commands:

vsearch --allpairs_global FILENAME --id 0.5 --alnout FILENAME

vsearch --cluster_fast FILENAME --id 0.97 --centroids FILENAME

vsearch --cluster_size FILENAME --id 0.97 --centroids FILENAME

vsearch --cluster_smallmem FILENAME --usersort --id 0.97 --centroids FILENAME

vsearch --derep_fulllength FILENAME --output FILENAME

vsearch --derep_prefix FILENAME --output FILENAME

vsearch --fastq_chars FILENAME

vsearch --fastq_convert FILENAME --fastqout FILENAME --fastq_ascii 64

vsearch --fastq_eestats FILENAME --output FILENAME

vsearch --fastq_eestats2 FILENAME --output FILENAME

vsearch --fastq_mergepairs FILENAME --reverse FILENAME --fastqout FILENAME

vsearch --fastq_stats FILENAME --log FILENAME

vsearch --fastx_filter FILENAME --fastaout FILENAME --fastq_trunclen 100

vsearch --fastx_mask FILENAME --fastaout FILENAME

vsearch --fastx_revcomp FILENAME --fastqout FILENAME

vsearch --fastx_subsample FILENAME --fastaout FILENAME --sample_pct 1

vsearch --rereplicate FILENAME --output FILENAME

vsearch --search_exact FILENAME --db FILENAME --alnout FILENAME

vsearch --shuffle FILENAME --output FILENAME

vsearch --sortbylength FILENAME --output FILENAME

vsearch --sortbysize FILENAME --output FILENAME

vsearch --uchime_denovo FILENAME --nonchimeras FILENAME

vsearch --uchime_ref FILENAME --db FILENAME --nonchimeras FILENAME

vsearch --usearch_global FILENAME --db FILENAME --id 0.97 --alnout FILENAME

> vsearch -h

$ vsearch -h

vsearch v2.6.2_macos_x86_64, 96.0GB RAM, 24 cores

https://github.com/torognes/vsearch

Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)

VSEARCH: a versatile open source tool for metagenomics

PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584

Usage: vsearch [OPTIONS]

General options

--bzip2_decompress decompress input with bzip2 (required if pipe)

--fasta_width INT width of FASTA seq lines, 0 for no wrap (80)

--gzip_decompress decompress input with gzip (required if pipe)

--help | -h display help information

--log FILENAME write messages, timing and memory info to file

--maxseqlength INT maximum sequence length (50000)

--minseqlength INT min seq length (clust/derep/search: 32, other:1)

--no_progress do not show progress indicator

--notrunclabels do not truncate labels at first space

--quiet output just warnings and fatal errors to stderr

--threads INT number of threads to use, zero for all cores (0)

--version | -v display version information

Chimera detection

--uchime_denovo FILENAME detect chimeras de novo

--uchime_ref FILENAME detect chimeras using a reference database

Data

--db FILENAME reference database for --uchime_ref

Parameters

--abskew REAL min abundance ratio of parent vs chimera (2.0)

--dn REAL 'no' vote pseudo-count (1.4)

--mindiffs INT minimum number of differences in segment (3)

--mindiv REAL minimum divergence from closest parent (0.8)

--minh REAL minimum score (0.28)

--sizein propagate abundance annotation from input

--self exclude identical labels for --uchime_ref

--selfid exclude identical sequences for --uchime_ref

--xn REAL 'no' vote weight (8.0)

Output

--alignwidth INT width of alignment in uchimealn output (80)

--borderline FILENAME output borderline chimeric sequences to file

--chimeras FILENAME output chimeric sequences to file

--fasta_score include chimera score in fasta output

--nonchimeras FILENAME output non-chimeric sequences to file

--relabel STRING relabel nonchimeras with this prefix string

--relabel_keep keep the old label after the new when relabelling

--relabel_md5 relabel with md5 digest of normalized sequence

--relabel_sha1 relabel with sha1 digest of normalized sequence

--sizeout include abundance information when relabelling

--uchimealns FILENAME output chimera alignments to file

--uchimeout FILENAME output to chimera info to tab-separated file

--uchimeout5 make output compatible with uchime version 5

--xsize strip abundance information in output

Clustering

--cluster_fast FILENAME cluster sequences after sorting by length

--cluster_size FILENAME cluster sequences after sorting by abundance

--cluster_smallmem FILENAME cluster already sorted sequences (see -usersort)

Parameters (most searching options also apply)

--cons_truncate do not ignore terminal gaps in MSA for consensus

--id REAL reject if identity lower, accepted values: 0-1.0

--iddef INT id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)

--qmask none|dust|soft mask seqs with dust, soft or no method (dust)

--sizein propagate abundance annotation from input

--strand plus|both cluster using plus or both strands (plus)

--usersort indicate sequences not pre-sorted by length

Output

--biomout FILENAME filename for OTU table output in biom 1.0 format

--centroids FILENAME output centroid sequences to FASTA file

--clusterout_id add cluster id info to consout and profile files

--clusterout_sort order msaout, consout, profile by decr abundance

--clusters STRING output each cluster to a separate FASTA file

--consout FILENAME output cluster consensus sequences to FASTA file

--mothur_shared_out FN filename for OTU table output in mothur format

--msaout FILENAME output multiple seq. alignments to FASTA file

--otutabout FILENAME filename for OTU table output in classic format

--profile FILENAME output sequence profile of each cluster to file

--relabel STRING relabel centroids with this prefix string

--relabel_keep keep the old label after the new when relabelling

--relabel_md5 relabel with md5 digest of normalized sequence

--relabel_sha1 relabel with sha1 digest of normalized sequence

--sizeorder sort accepted centroids by abundance (AGC)

--sizeout write cluster abundances to centroid file

--uc FILENAME specify filename for UCLUST-like output

--xsize strip abundance information in output

Dereplication and rereplication

--derep_fulllength FILENAME dereplicate sequences in the given FASTA file

--derep_prefix FILENAME dereplicate sequences in file based on prefixes

--rereplicate FILENAME rereplicate sequences in the given FASTA file

Parameters

--maxuniquesize INT maximum abundance for output from dereplication

--minuniquesize INT minimum abundance for output from dereplication

--sizein propagate abundance annotation from input

--strand plus|both dereplicate plus or both strands (plus)

Output

--output FILENAME output FASTA file

--relabel STRING relabel with this prefix string

--relabel_keep keep the old label after the new when relabelling

--relabel_md5 relabel with md5 digest of normalized sequence

--relabel_sha1 relabel with sha1 digest of normalized sequence

--sizeout write abundance annotation to output

--topn INT output only n most abundant sequences after derep

--uc FILENAME filename for UCLUST-like dereplication output

--xsize strip abundance information in derep output

FASTQ format conversion

--fastq_convert FILENAME convert between FASTQ file formats

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)

--fastq_asciiout INT FASTQ output quality score ASCII base char (33)

--fastq_qmax INT maximum base quality value for FASTQ input (41)

--fastq_qmaxout INT maximum base quality value for FASTQ output (41)

--fastq_qmin INT minimum base quality value for FASTQ input (0)

--fastq_qminout INT minimum base quality value for FASTQ output (0)

Output

--fastqout FILENAME FASTQ output filename for converted sequences

FASTQ format detection and quality analysis

--fastq_chars FILENAME analyse FASTQ file for version and quality range

Parameters

--fastq_tail INT min length of tails to count for fastq_chars (4)

FASTQ quality statistics

--fastq_stats FILENAME report statistics on FASTQ file

--fastq_eestats FILENAME quality score and expected error statistics

--fastq_eestats2 FILENAME expected error and length cutoff statistics

Parameters

--ee_cutoffs REAL,... fastq_eestats2 expected error cutoffs (0.5,1,2)

--fastq_ascii INT FASTQ input quality score ASCII base char (33)

--fastq_qmax INT maximum base quality value for FASTQ input (41)

--fastq_qmin INT minimum base quality value for FASTQ input (0)

--length_cutoffs INT,INT,INT fastq_eestats2 length (min,max,incr) (50,*,50)

Output

--log FILENAME output file for fastq_stats statistics

--output FILENAME output file for fastq_eestats(2) statistics

Filtering

--fastx_filter FILENAME filter and truncate sequences in FASTA/FASTQ file

--fastq_filter FILENAME filter and truncate sequences in FASTQ file

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)

--fastq_maxee REAL maximum expected error value for filter

--fastq_maxee_rate REAL maximum expected error rate for filter

--fastq_maxlen INT maximum length of sequence for filter

--fastq_maxns INT maximum number of N's for filter

--fastq_minlen INT minimum length of sequence for filter

--fastq_qmax INT maximum base quality value for FASTQ input (41)

--fastq_qmin INT minimum base quality value for FASTQ input (0)

--fastq_stripleft INT bases on the left to delete

--fastq_stripright INT bases on the right to delete

--fastq_truncee REAL maximum total expected error for truncation

--fastq_trunclen INT truncate reads to length INT (discard if shorter)

--fastq_trunclen_keep INT truncate reads to length INT (keep if shorter)

--fastq_truncqual INT minimum base quality value for truncation

--maxsize INT maximum abundance

--minsize INT minimum abundance

Output

--eeout include expected errors in output

--fastaout FILENAME FASTA output filename for passed sequences

--fastaout_discarded FNAME FASTA filename for discarded sequences

--fastqout FILENAME FASTQ output filename for passed sequences

--fastqout_discarded FNAME FASTQ filename for discarded sequences

--relabel STRING relabel filtered sequences with given prefix

--relabel_keep keep the old label after the new when relabelling

--relabel_md5 relabel filtered sequences with md5 digest

--relabel_sha1 relabel filtered sequences with sha1 digest

--sizeout include abundance information when relabelling

--xsize strip abundance information in output

Masking (new)

--fastx_mask FILENAME mask sequences in the given FASTA or FASTQ file

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)

--fastq_qmax INT maximum base quality value for FASTQ input (41)

--fastq_qmin INT minimum base quality value for FASTQ input (0)

--hardmask mask by replacing with N instead of lower case

--max_unmasked_pct max unmasked % of sequences to keep (100.0)

--min_unmasked_pct min unmasked % of sequences to keep (0.0)

--qmask none|dust|soft mask seqs with dust, soft or no method (dust)

Output

--fastaout FILENAME output to specified FASTA file

--fastqout FILENAME output to specified FASTQ file

Masking (old)

--maskfasta FILENAME mask sequences in the given FASTA file

Parameters

--hardmask mask by replacing with N instead of lower case

--qmask none|dust|soft mask seqs with dust, soft or no method (dust)

Output

--output FILENAME output to specified FASTA file

Paired-end reads merging

--fastq_mergepairs FILENAME merge paired-end reads into one sequence

Data

--reverse FILENAME specify FASTQ file with reverse reads

Parameters

--fastq_allowmergestagger Allow merging of staggered reads

--fastq_ascii INT FASTQ input quality score ASCII base char (33)

--fastq_maxdiffs INT maximum number of different bases in overlap (10)

--fastq_maxee REAL maximum expected error value for merged sequence

--fastq_maxmergelen maximum length of entire merged sequence

--fastq_maxns INT maximum number of N's

--fastq_minlen INT minimum input read length after truncation (1)

--fastq_minmergelen minimum length of entire merged sequence

--fastq_minovlen minimum length of overlap between reads (10)

--fastq_nostagger disallow merging of staggered reads (default)

--fastq_qmax INT maximum base quality value for FASTQ input (41)

--fastq_qmaxout INT maximum base quality value for FASTQ output (41)

--fastq_qmin INT minimum base quality value for FASTQ input (0)

--fastq_qminout INT minimum base quality value for FASTQ output (0)

--fastq_truncqual INT base quality value for truncation

Output

--eetabbedout FILENAME output error statistics to specified file

--fastaout FILENAME FASTA output filename for merged sequences

--fastaout_notmerged_fwd FN FASTA filename for non-merged forward sequences

--fastaout_notmerged_rev FN FASTA filename for non-merged reverse sequences

--fastq_eeout include expected errors in FASTQ output

--fastqout FILENAME FASTQ output filename for merged sequences

--fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences

--fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences

--label_suffix suffix to append to label of merged sequences

Pairwise alignment

--allpairs_global FILENAME perform global alignment of all sequence pairs

Output (most searching options also apply)

--alnout FILENAME filename for human-readable alignment output

--acceptall output all pairwise alignments

Reverse complementation

--fastx_revcomp FILENAME Reverse-complement seqs in FASTA or FASTQ file

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)

--fastq_qmax INT maximum base quality value for FASTQ input (41)

--fastq_qmin INT minimum base quality value for FASTQ input (0)

Output

--fastaout FILENAME FASTA output filename

--fastqout FILENAME FASTQ output filename

--label_suffix STRING Label to append to identifier in the output

Searching

--search_exact FILENAME filename of queries for exact match search

--usearch_global FILENAME filename of queries for global alignment search

Data

--db FILENAME name of UDB or FASTA database for search

Parameters

--dbmask none|dust|soft mask db with dust, soft or no method (dust)

--fulldp full dynamic programming alignment (always on)

--gapext STRING penalties for gap extension (2I/1E)

--gapopen STRING penalties for gap opening (20I/2E)

--hardmask mask by replacing with N instead of lower case

--id REAL reject if identity lower

--iddef INT id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)

--idprefix INT reject if first n nucleotides do not match

--idsuffix INT reject if last n nucleotides do not match

--leftjust reject if terminal gaps at alignment left end

--match INT score for match (2)

--maxaccepts INT number of hits to accept and show per strand (1)

--maxdiffs INT reject if more substitutions or indels

--maxgaps INT reject if more indels

--maxhits INT maximum number of hits to show (unlimited)

--maxid REAL reject if identity higher

--maxqsize INT reject if query abundance larger

--maxqt REAL reject if query/target length ratio higher

--maxrejects INT number of non-matching hits to consider (32)

--maxsizeratio REAL reject if query/target abundance ratio higher

--maxsl REAL reject if shorter/longer length ratio higher

--maxsubs INT reject if more substitutions

--mid REAL reject if percent identity lower, ignoring gaps

--mincols INT reject if alignment length shorter

--minqt REAL reject if query/target length ratio lower

--minsizeratio REAL reject if query/target abundance ratio lower

--minsl REAL reject if shorter/longer length ratio lower

--mintsize INT reject if target abundance lower

--minwordmatches INT minimum number of word matches required (12)

--mismatch INT score for mismatch (-4)

--pattern STRING option is ignored

--qmask none|dust|soft mask query with dust, soft or no method (dust)

--query_cov REAL reject if fraction of query seq. aligned lower

--rightjust reject if terminal gaps at alignment right end

--sizein propagate abundance annotation from input

--self reject if labels identical

--selfid reject if sequences identical

--slots INT option is ignored

--strand plus|both search plus or both strands (plus)

--target_cov REAL reject if fraction of target seq. aligned lower

--weak_id REAL include aligned hits with >= id; continue search

--wordlength INT length of words for database index 3-15 (8)

Output

--alnout FILENAME filename for human-readable alignment output

--biomout FILENAME filename for OTU table output in biom 1.0 format

--blast6out FILENAME filename for blast-like tab-separated output

--dbmatched FILENAME FASTA file for matching database sequences

--dbnotmatched FILENAME FASTA file for non-matching database sequences

--fastapairs FILENAME FASTA file with pairs of query and target

--matched FILENAME FASTA file for matching query sequences

--mothur_shared_out FN filename for OTU table output in mothur format

--notmatched FILENAME FASTA file for non-matching query sequences

--otutabout FILENAME filename for OTU table output in classic format

--output_no_hits output non-matching queries to output files

--rowlen INT width of alignment lines in alnout output (64)

--samheader include a header in the SAM output file

--samout FILENAME filename for SAM format output

--sizeout write abundance annotation to dbmatched file

--top_hits_only output only hits with identity equal to the best

--uc FILENAME filename for UCLUST-like output

--uc_allhits show all, not just top hit with uc output

--userfields STRING fields to output in userout file

--userout FILENAME filename for user-defined tab-separated output

Shuffling and sorting

--shuffle FILENAME shuffle order of sequences in FASTA file randomly

--sortbylength FILENAME sort sequences by length in given FASTA file

--sortbysize FILENAME abundance sort sequences in given FASTA file

Parameters

--maxsize INT maximum abundance for sortbysize

--minsize INT minimum abundance for sortbysize

--randseed INT seed for PRNG, zero to use random data source (0)

--sizein propagate abundance annotation from input

Output

--output FILENAME output to specified FASTA file

--relabel STRING relabel sequences with this prefix string

--relabel_keep keep the old label after the new when relabelling

--relabel_md5 relabel with md5 digest of normalized sequence

--relabel_sha1 relabel with sha1 digest of normalized sequence

--sizeout include abundance information when relabelling

--topn INT output just first n sequences

--xsize strip abundance information in output

Subsampling

--fastx_subsample FILENAME subsample sequences from given FASTA/FASTQ file

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)

--fastq_qmax INT maximum base quality value for FASTQ input (41)

--fastq_qmin INT minimum base quality value for FASTQ input (0)

--randseed INT seed for PRNG, zero to use random data source (0)

--sample_pct REAL sampling percentage between 0.0 and 100.0

--sample_size INT sampling size

--sizein consider abundance info from input, do not ignore

Output

--fastaout FILENAME output subsampled sequences to FASTA file

--fastaout_discarded FILE output non-subsampled sequences to FASTA file

--fastqout FILENAME output subsampled sequences to FASTQ file

--fastqout_discarded output non-subsampled sequences to FASTQ file

--relabel STRING relabel sequences with this prefix string

--relabel_keep keep the old label after the new when relabelling

--relabel_md5 relabel with md5 digest of normalized sequence

--relabel_sha1 relabel with sha1 digest of normalized sequence

--sizeout update abundance information in output

--xsize strip abundance information in output

UDB files

--makeudb_usearch FILENAME make UDB file from given FASTA file

--udb2fasta FILENAME output FASTA file from given UDB file

--udbinfo FILENAME show information about UDB file

--udbstats FILENAME report statistics about indexed words in UDB file

Parameters

--dbmask none|dust|soft mask db with dust, soft or no method (dust)

--hardmask mask by replacing with N instead of lower case

--wordlength INT length of words for database index 3-15 (8)

Output

--output FILENAME UDB or FASTA output file

実行方法

１、Searching

--search_exact filename of queries for exact match search
--usearch_global filename of queries for global alignment search

クエリをデータベースに問い合わせ、合致する配列を探す。fastaを使う。

クエリと90%以上合致する配列を出力
vsearch --usearch_global queries.fa --db database.fa --id 0.9 --alnout alnout

Parameters

--dbmask none|dust|soft mask db with dust, soft or no method (dust)
--fulldp full dynamic programming alignment (always on)
--gapext STRING penalties for gap extension (2I/1E)
--gapopen STRING penalties for gap opening (20I/2E)
--hardmask mask by replacing with N instead of lower case
--id REAL reject if identity lower
--iddef INT id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)
--idprefix INT reject if first n nucleotides do not match
--idsuffix INT reject if last n nucleotides do not match
--leftjust reject if terminal gaps at alignment left end
--match INT score for match (2)
--maxaccepts INT number of hits to accept and show per strand (1)
--maxdiffs INT reject if more substitutions or indels
--maxgaps INT reject if more indels
--maxhits INT maximum number of hits to show (unlimited)
--maxid REAL reject if identity higher
--maxqsize INT reject if query abundance larger
--maxqt REAL reject if query/target length ratio higher
--maxrejects INT number of non-matching hits to consider (32)
--maxsizeratio REAL reject if query/target abundance ratio higher
--maxsl REAL reject if shorter/longer length ratio higher
--maxsubs INT reject if more substitutions
--mid REAL reject if percent identity lower, ignoring gaps
--mincols INT reject if alignment length shorter
--minqt REAL reject if query/target length ratio lower
--minsizeratio REAL reject if query/target abundance ratio lower
--minsl REAL reject if shorter/longer length ratio lower
--mintsize INT reject if target abundance lower
--minwordmatches INT minimum number of word matches required (12)
--mismatch INT score for mismatch (-4)
--pattern STRING option is ignored
--qmask none|dust|soft mask query with dust, soft or no method (dust)
--query_cov REAL reject if fraction of query seq. aligned lower
--rightjust reject if terminal gaps at alignment right end
--sizein propagate abundance annotation from input
--self reject if labels identical
--selfid reject if sequences identical
--slots INT option is ignored
--strand plus|both search plus or both strands (plus)
--target_cov REAL reject if fraction of target seq. aligned lower
--weak_id REAL include aligned hits with >= id; continue search
--wordlength INT length of words for database index 3-15 (8)

Output

--alnout FILENAME filename for human-readable alignment output
--biomout FILENAME filename for OTU table output in biom 1.0 format
--blast6out FILENAME filename for blast-like tab-separated output
--dbmatched FILENAME FASTA file for matching database sequences
--dbnotmatched FILENAME FASTA file for non-matching database sequences
--fastapairs FILENAME FASTA file with pairs of query and target
--matched FILENAME FASTA file for matching query sequences
--mothur_shared_out FN filename for OTU table output in mothur format
--notmatched FILENAME FASTA file for non-matching query sequences
--otutabout FILENAME filename for OTU table output in classic format
--output_no_hits output non-matching queries to output files
--rowlen INT width of alignment lines in alnout output (64)
--samheader include a header in the SAM output file
--samout FILENAME filename for SAM format output
--sizeout write abundance annotation to dbmatched file
--top_hits_only output only hits with identity equal to the best
--uc FILENAME filename for UCLUST-like output
--uc_allhits show all, not just top hit with uc output
--userfields STRING fields to output in userout file
--userout FILENAME filename for user-defined tab-separated output

２、Clustering

クラスタリング

--cluster_fast cluster sequences after sorting by length
--cluster_size cluster sequences after sorting by abundance
--cluster_smallmem cluster already sorted sequences (see -usersort)
Parameters (most searching options also apply)

vsearch --cluster_fast input.fa --id 0.97 --centroids output.fa

Parameters (most searching options also apply)

--cons_truncate do not ignore terminal gaps in MSA for consensus
--id REAL reject if identity lower, accepted values: 0-1.0
--iddef INT id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)
--qmask none|dust|soft mask seqs with dust, soft or no method (dust)
--sizein propagate abundance annotation from input
--strand plus|both cluster using plus or both strands (plus)
--usersort indicate sequences not pre-sorted by length

Output

--biomout FILENAME filename for OTU table output in biom 1.0 format
--centroids FILENAME output centroid sequences to FASTA file
--clusterout_id add cluster id info to consout and profile files
--clusterout_sort order msaout, consout, profile by decr abundance
--clusters STRING output each cluster to a separate FASTA file
--consout FILENAME output cluster consensus sequences to FASTA file
--mothur_shared_out FN filename for OTU table output in mothur format
--msaout FILENAME output multiple seq. alignments to FASTA file
--otutabout FILENAME filename for OTU table output in classic format
--profile FILENAME output sequence profile of each cluster to file
--relabel STRING relabel centroids with this prefix string
--relabel_keep keep the old label after the new when relabelling
--relabel_md5 relabel with md5 digest of normalized sequence
--relabel_sha1 relabel with sha1 digest of normalized sequence
--sizeorder sort accepted centroids by abundance (AGC)
--sizeout write cluster abundances to centroid file
--uc FILENAME specify filename for UCLUST-like output
--xsize strip abundance information in output

--usearch_globalと同様、多様な出力フォーマットに対応している。

defaultではplus strandのみクラスタリングするので、同じ配列でもreverse compなら別配列扱いになる（"--strand plus"）。両鎖クラスタリングするなら"--strand both"にする。（#379より）

３、Dereplication and rereplication

--derep_fulllength dereplicate sequences in the given FASTA file
--derep_prefix dereplicate sequences in file based on prefixes
--rereplicate rereplicate sequences in the given FASTA file

vsearch --derep_fulllength input.fa --output output.fa

Parameters

--maxuniquesize INT maximum abundance for output from dereplication
--minuniquesize INT minimum abundance for output from dereplication
--sizein propagate abundance annotation from input
--strand plus|both dereplicate plus or both strands (plus)

Output

--output FILENAME output FASTA file
--relabel STRING relabel with this prefix string
--relabel_keep keep the old label after the new when relabelling
--relabel_md5 relabel with md5 digest of normalized sequence
--relabel_sha1 relabel with sha1 digest of normalized sequence
--sizeout write abundance annotation to output
--topn INT output only n most abundant sequences after derep
--uc FILENAME filename for UCLUST-like dereplication output
--xsize strip abundance information in derep output

４、Chimera detection

--uchime_denovo detect chimeras de novo
--uchime_ref detect chimeras using a reference database

fasta/fastqを指定する。

#de novo
vsearch --uchime_denovo input.fq --nonchimeras output.fq

#database使用
vsearch --uchime_ref FILENAME --db FILENAME --nonchimeras FILENAME

Data

--db FILENAME reference database for --uchime_ref

Parameters

--abskew REAL min abundance ratio of parent vs chimera (2.0)
--dn REAL 'no' vote pseudo-count (1.4)
--mindiffs INT minimum number of differences in segment (3)
--mindiv REAL minimum divergence from closest parent (0.8)
--minh REAL minimum score (0.28)
--sizein propagate abundance annotation from input
--self exclude identical labels for --uchime_ref
--selfid exclude identical sequences for --uchime_ref
--xn REAL 'no' vote weight (8.0)

Output

--alignwidth INT width of alignment in uchimealn output (80)
--borderline FILENAME output borderline chimeric sequences to file
--chimeras FILENAME output chimeric sequences to file
--fasta_score include chimera score in fasta output
--nonchimeras FILENAME output non-chimeric sequences to file
--relabel STRING relabel nonchimeras with this prefix string
--relabel_keep keep the old label after the new when relabelling
--relabel_md5 relabel with md5 digest of normalized sequence
--relabel_sha1 relabel with sha1 digest of normalized sequence
--sizeout include abundance information when relabelling
--uchimealns FILENAME output chimera alignments to file
--uchimeout FILENAME output to chimera info to tab-separated file
--uchimeout5 make output compatible with uchime version 5
--xsize strip abundance information in output

５、merge

--fastq_mergepairs merge paired-end reads into one sequence

#merge (fq/fq.gz) 
vsearch --fastq_mergepairs pair_1.fq --reverse pair_2.fq --fastqout output.fq

Parameters

--fastq_allowmergestagger Allow merging of staggered reads
--fastq_ascii INT FASTQ input quality score ASCII base char (33)
--fastq_maxdiffs INT maximum number of different bases in overlap (10)
--fastq_maxee REAL maximum expected error value for merged sequence
--fastq_maxmergelen maximum length of entire merged sequence
--fastq_maxns INT maximum number of N's
--fastq_minlen INT minimum input read length after truncation (1)
--fastq_minmergelen minimum length of entire merged sequence
--fastq_minovlen minimum length of overlap between reads (10)
--fastq_nostagger disallow merging of staggered reads (default)
--fastq_qmax INT maximum base quality value for FASTQ input (41)
--fastq_qmaxout INT maximum base quality value for FASTQ output (41)
--fastq_qmin INT minimum base quality value for FASTQ input (0)
--fastq_qminout INT minimum base quality value for FASTQ output (0)
--fastq_truncqual INT base quality value for truncation

Output

--eetabbedout FILENAME output error statistics to specified file
--fastaout FILENAME FASTA output filename for merged sequences
--fastaout_notmerged_fwd FN FASTA filename for non-merged forward sequences
--fastaout_notmerged_rev FN FASTA filename for non-merged reverse sequences
--fastq_eeout include expected errors in FASTQ output
--fastqout FILENAME FASTQ output filename for merged sequences
--fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences
--fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences
--label_suffix suffix to append to label of merged sequences

vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores

https://github.com/torognes/vsearch

Merging reads 100%

28630 Pairs

27041 Merged (94.4%)

1589 Not merged (5.6%)

Pairs that failed merging due to various reasons:

41 too few kmers found on same diagonal

35 potential tandem repeat

841 too many differences

670 alignment score too low, or score drop to high

1 overlap too short

1 staggered read pairs

Statistics of all reads:

300.01 Mean read length

Statistics of merged reads:

447.48 Mean fragment length

12.87 Standard deviation of fragment length

0.34 Mean expected error in forward sequences

1.19 Mean expected error in reverse sequences

0.25 Mean expected error in merged sequences

0.24 Mean observed errors in merged region of forward sequences

0.73 Mean observed errors in merged region of reverse sequences

0.97 Mean observed errors in merged region

６、Masking

--fastx_mask mask sequences in the given FASTA or FASTQ file

vsearch --fastx_mask input.fa --fastaout output.fa

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)
--fastq_qmax INT maximum base quality value for FASTQ input (41)
--fastq_qmin INT minimum base quality value for FASTQ input (0)
--hardmask mask by replacing with N instead of lower case
--max_unmasked_pct max unmasked % of sequences to keep (100.0)
--min_unmasked_pct min unmasked % of sequences to keep (0.0)
--qmask none|dust|soft mask seqs with dust, soft or no method (dust)

Output

--fastaout FILENAME output to specified FASTA file
--fastqout FILENAME output to specified FASTQ file

６、Pairwise alignment

--allpairs_global perform global alignment of all sequence pairs

Perform optimal global pairwise alignments of all vs. all fasta sequences contained in filename. This command is multi-threaded.

処理後の配列のペアワイズアラインメント（1000配列以上でも動作する）

vsearch --allpairs_global input.fa --id 0.5 --alnout output.fa

--alnout FILENAME filename for human-readable alignment output
--acceptall output all pairwise alignments
--id REAL reject if identity lower

大きな配列セットを使うと出力ファイルが巨大になることに注意してください。

７、Shuffling and sorting

--shuffle shuffle order of sequences in FASTA file randomly
--sortbylength sort sequences by length in given FASTA file
--sortbysize abundance sort sequences in given FASTA file

fasta/fastqを指定する。

vsearch --sortbylength input.fa --output output.fa

Parameters

--maxsize INT maximum abundance for sortbysize
--minsize INT minimum abundance for sortbysize
--randseed INT seed for PRNG, zero to use random data source (0)
--sizein propagate abundance annotation from input

Output

--output FILENAME output to specified FASTA file
--relabel STRING relabel sequences with this prefix string
--relabel_keep keep the old label after the new when relabelling
--relabel_md5 relabel with md5 digest of normalized sequence
--relabel_sha1 relabel with sha1 digest of normalized sequence
--sizeout include abundance information when relabelling
--topn INT output just first n sequences
--xsize strip abundance information in output

９、Subsampling

--fastx_subsample subsample sequences from given FASTA/FASTQ file

fasta/fastqを指定する。

#10percentサブサンプリング
vsearch --fastx_subsample input.fa --fastaout output.fa --sample_pct 10

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)
--fastq_qmax INT maximum base quality value for FASTQ input (41)
--fastq_qmin INT minimum base quality value for FASTQ input (0)
--randseed INT seed for PRNG, zero to use random data source (0)
--sample_pct REAL sampling percentage between 0.0 and 100.0
--sample_size INT sampling size
--sizein consider abundance info from input, do not ignore

Output

--fastaout FILENAME output subsampled sequences to FASTA file
--fastaout_discarded FILE output non-subsampled sequences to FASTA file
--fastqout FILENAME output subsampled sequences to FASTQ file
--fastqout_discarded output non-subsampled sequences to FASTQ file
--relabel STRING relabel sequences with this prefix string
--relabel_keep keep the old label after the new when relabelling
--relabel_md5 relabel with md5 digest of normalized sequence
--relabel_sha1 relabel with sha1 digest of normalized sequence
--sizeout update abundance information in output
--xsize strip abundance information in output

１０、Reverse complementation

--fastx_revcomp Reverse-complement seqs in FASTA or FASTQ file

fasta/fastqを指定する。

vsearch --fastx_revcomp input.fq --fastqout output.fq

Parameters

--fastq_allowmergestagger Allow merging of staggered reads
--fastq_ascii INT FASTQ input quality score ASCII base char (33)
--fastq_maxdiffs INT maximum number of different bases in overlap (10)
--fastq_maxee REAL maximum expected error value for merged sequence
--fastq_maxmergelen maximum length of entire merged sequence
--fastq_maxns INT maximum number of N's
--fastq_minlen INT minimum input read length after truncation (1)
--fastq_minmergelen minimum length of entire merged sequence
--fastq_minovlen minimum length of overlap between reads (10)
--fastq_nostagger disallow merging of staggered reads (default)
--fastq_qmax INT maximum base quality value for FASTQ input (41)
--fastq_qmaxout INT maximum base quality value for FASTQ output (41)
--fastq_qmin INT minimum base quality value for FASTQ input (0)
--fastq_qminout INT minimum base quality value for FASTQ output (0)
--fastq_truncqual INT base quality value for truncation

Output

--eetabbedout FILENAME output error statistics to specified file
--fastaout FILENAME FASTA output filename for merged sequences
--fastaout_notmerged_fwd FN FASTA filename for non-merged forward sequences
--fastaout_notmerged_rev FN FASTA filename for non-merged reverse sequences
--fastq_eeout include expected errors in FASTQ output
--fastqout FILENAME FASTQ output filename for merged sequences
--fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences
--fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences
--label_suffix suffix to append to label of merged sequences

１１、Filtering

--fastx_filter filter and truncate sequences in FASTA/FASTQ file
--fastq_filter filter and truncate sequences in FASTQ file

fasta/fastqを指定する。

#100bp以下をdiscard
vsearch --fastx_filter input.fq --fastaout output.fq --fastq_trunclen 100

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)
--fastq_maxee REAL maximum expected error value for filter
--fastq_maxee_rate REAL maximum expected error rate for filter
--fastq_maxlen INT maximum length of sequence for filter
--fastq_maxns INT maximum number of N's for filter
--fastq_minlen INT minimum length of sequence for filter
--fastq_qmax INT maximum base quality value for FASTQ input (41)
--fastq_qmin INT minimum base quality value for FASTQ input (0)
--fastq_stripleft INT bases on the left to delete
--fastq_stripright INT bases on the right to delete
--fastq_truncee REAL maximum total expected error for truncation
--fastq_trunclen INT truncate reads to length INT (discard if shorter)
--fastq_trunclen_keep INT truncate reads to length INT (keep if shorter)
--fastq_truncqual INT minimum base quality value for truncation
--maxsize INT maximum abundance
--minsize INT minimum abundance

Output

--eeout include expected errors in output
--fastaout FILENAME FASTA output filename for passed sequences
--fastaout_discarded FNAME FASTA filename for discarded sequences
--fastqout FILENAME FASTQ output filename for passed sequences
--fastqout_discarded FNAME FASTQ filename for discarded sequences
--relabel STRING relabel filtered sequences with given prefix
--relabel_keep keep the old label after the new when relabelling
--relabel_md5 relabel filtered sequences with md5 digest
--relabel_sha1 relabel filtered sequences with sha1 digest
--sizeout include abundance information when relabelling
--xsize strip abundance information in output

１２、format conversion

--fastq_convert convert between FASTQ file formats

fastqを指定する。

vsearch --fastq_convert input.fq --fastqout output.fq  --fastq_ascii 64

Parameters

--fastq_ascii INT FASTQ input quality score ASCII base char (33)
--fastq_asciiout INT FASTQ output quality score ASCII base char (33)
--fastq_qmax INT maximum base quality value for FASTQ input (41)
--fastq_qmaxout INT maximum base quality value for FASTQ output (41)
--fastq_qmin INT minimum base quality value for FASTQ input (0)
--fastq_qminout INT minimum base quality value for FASTQ output (0)

Output

--fastqout FILENAME FASTQ output filename for converted sequences

１３、FASTA/FASTQ file statistics:

--fastq_chars analyse FASTQ file for version and quality range Parameters

--fastq_stats report statistics on FASTQ file
--fastq_eestats quality score and expected error statistics
--fastq_eestats2 expected error and length cutoff statistics

#fq/fq.gz
vsearch --fastq_chars input.fq
vsearch --fastq_eestats input.fq --output report
vsearch --fastq_eestats2 input.fq --output report

> vsearch --fastq_chars input.fq

vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores

https://github.com/torognes/vsearch

Reading FASTQ file 100%

Read 28630 sequences.

Qmin 35, QMax 71, Range 37

Guess: -fastq_qmin 2 -fastq_qmax 38 -fastq_ascii 33

Guess: Original Sanger format (phred+33)

Letter N Freq MaxRun

------ ---------- ------ ------

A 2128003 24.8% 19

C 1948219 22.7% 6

G 2881517 33.6% 86

N 491 0.0% 34 Q=#

T 1622142 18.9% 13

Char ASCII Freq Tails

---- ----- ------ ----------

'#' 35 0.0% 14

'(' 40 0.0% 0

')' 41 0.1% 1

以下略

> vsearch --fastq_eestats2 input.fq --output report

> cat report

$ cat report

28630 reads, max len 301, avg 299.7

Length MaxEE 0.50 MaxEE 1.00 MaxEE 2.00

------ ---------------- ---------------- ----------------

50 28434( 99.3%) 28535( 99.7%) 28605( 99.9%)

100 27882( 97.4%) 28303( 98.9%) 28451( 99.4%)

150 27174( 94.9%) 27820( 97.2%) 28252( 98.7%)

200 26408( 92.2%) 27276( 95.3%) 27946( 97.6%)

250 25238( 88.2%) 26425( 92.3%) 27376( 95.6%)

300 23470( 82.0%) 25209( 88.1%) 26571( 92.8%)

引用

VSEARCH: a versatile open source tool for metagenomics
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, Frédéric Mahé

PeerJ. 2016; 4: e2584

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

多機能な配列処理ツール VSEARCH（USEARCHの代替）