macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

多機能な配列処理ツール VSEARCH(USEARCHの代替)

2019 8/9 説明の誤り修正

2019 9/15 両鎖クラスタリングのコメント追加

 

 Rockström et al. (2009) とSteffen et al. (2015) は、生物多様性の損失を人類の短期生存のための主要な脅威として提示した。シークエンシング技術における最近の進歩は、深海熱水孔から南極湖、そして熱帯林からシベリア草原までの大規模な環境遺伝的多様性の研究を可能にした(Gilbert、Jansson&Knight、2014 pubmed)。最近の臨床研究は、私たちの体のミクロバイオームと人間の健康にとっての日常環境の重要性を示している(Human Microbiome Project Consortium、2012)。通常、普遍的なマーカー(例えば、16S rRNA、ITS、COI)に焦点を当てて、これらのターゲットメタゲノミクス研究は何百万もの配列を生み出し、そして、それらの生態学的解釈を容易にするためのオープンソース、高速かつ記憶効率の高いツールを必要とする。

 マイクロバイオーム分析のためにいくつかのパイプラインが開発されており、その中でもmothur(Schloss et al。、2009)、QIIME(Caporaso et al。、2010)、およびUPARSE(Edgar、2013)が最も人気がある。 QIIMEとUPARSEはどちらもRobert C. Edgarによって設計および実装されているUSEARCH(Edgar、2010)に基づいており、http://drive5.com/usearch/で入手できる。 USEARCHはFASTQとFASTAファイルを操作し分析するための多数のコマンドとオプションを提供する。ただし、USEARCHのソースコードは一般には公開されておらず、アルゴリズムの詳細は初歩的にしか説明されておらず、メモリ制限のある32ビットバージョンのみが学術用途に無料で入手できる。

 オープンソースソリューションの存在はエンドユーザーにとって有益であり、研究活動を活性化することができると私たちは信じている。このため、私たち(著者ら)はUSEARCHに代わる高品質のオープンソース代替ツールを提供することを約束した。これはメモリの制限なしにユーザーが自由に利用できる。 VSEARCHには、一般的に使用されているほとんどのUSEARCH機能が含まれており、さらに開発を進めると機能が追加される可能性がある。論文ではVSEARCH実装の詳細について説明する。結果のスピードと質に関してその性能を評価するために、著者らはいくつかの最も重要な機能(検索、クラスタリング、キメラ検出とサブサンプリング)を評価し、それらをUSEARCHと比較した。 VSEARCHはUSEARCHの結果と同等かそれ以上の結果をもたらすことが分かった。

 

環境的または臨床的分子多様性研究は、キメラについてチェックする必要がある、複製しない、マスキングする、分類する、検索する、クラスタ化する、またはリファレンス配列と比較する必要がある大量のアンプリコン(例えば、SSU −rRNA配列)を生成する。 vsearchの目的は、最適化されたアルゴリズム実装を使用し、現代のコンピュータの可能性を最大限に引き出すことで、これらのタスクを実行するためのオールインワンオープンソースツールを提供し、高速で正確なデータ処理を提供することである。
ヌクレオチド配列の比較はvsearchの中核である。比較をスピードアップするために、vsearchは、2003年以降のx86-64 CPUのストリーミングSIMD拡張命令(SSE2)を利用して、非常に高速なNeedleman-Wunschアルゴリズムを実装している。 SSE2命令が利用できない場合、vsearchはエラーメッセージを出して終了する。 Power 8 CPUでは、AltiVec / VSX / VMX命令を使用する。メモリ使用量はシーケンスの長さに応じて急速に増加する。たとえば、長さ1 kbの2つのシーケンスを比較すると、スレッドあたり8 MBのメモリが必要で、2つの10 kbシーケンスを比較すると、スレッドごとに800 MBのメモリが必要になる。

 

DocumentationはGithubにあるリンクからダウンロードできる(direct link)。

 

インストール

mac os10.14向け64bitバイナリをダウンロードし、テストした。

VSEARCH binaries are provided for x86-64 systems running GNU/Linux, macOS (version 10.7 or higher) and Windows (64-bit, version 7 or higher), as well as for 64-bit little-endian POWER8 (ppc64le) and 64-bit ARMv8 systems (aarch64) running GNU/Linux

ビルド依存(if support for gzip and bzip2 compressed FASTA and FASTQ input files is needed)

  • libz (zlib library) (zlib.h header file) (optional)
  • libbz2 (bzip2lib library) (bzlib.h header file) (optional)

本体

本体 GIthub

#bioconda
conda install -c bioconda vsearch

> vsearch

$ vsearch 

vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores

https://github.com/torognes/vsearch

 

For help, please enter: vsearch --help

 

For further details, please see the manual.

 

Example commands:

 

vsearch --allpairs_global FILENAME --id 0.5 --alnout FILENAME

vsearch --cluster_fast FILENAME --id 0.97 --centroids FILENAME

vsearch --cluster_size FILENAME --id 0.97 --centroids FILENAME

vsearch --cluster_smallmem FILENAME --usersort --id 0.97 --centroids FILENAME

vsearch --derep_fulllength FILENAME --output FILENAME

vsearch --derep_prefix FILENAME --output FILENAME

vsearch --fastq_chars FILENAME

vsearch --fastq_convert FILENAME --fastqout FILENAME --fastq_ascii 64

vsearch --fastq_eestats FILENAME --output FILENAME

vsearch --fastq_eestats2 FILENAME --output FILENAME

vsearch --fastq_mergepairs FILENAME --reverse FILENAME --fastqout FILENAME

vsearch --fastq_stats FILENAME --log FILENAME

vsearch --fastx_filter FILENAME --fastaout FILENAME --fastq_trunclen 100

vsearch --fastx_mask FILENAME --fastaout FILENAME

vsearch --fastx_revcomp FILENAME --fastqout FILENAME

vsearch --fastx_subsample FILENAME --fastaout FILENAME --sample_pct 1

vsearch --rereplicate FILENAME --output FILENAME

vsearch --search_exact FILENAME --db FILENAME --alnout FILENAME

vsearch --shuffle FILENAME --output FILENAME

vsearch --sortbylength FILENAME --output FILENAME

vsearch --sortbysize FILENAME --output FILENAME

vsearch --uchime_denovo FILENAME --nonchimeras FILENAME

vsearch --uchime_ref FILENAME --db FILENAME --nonchimeras FILENAME

vsearch --usearch_global FILENAME --db FILENAME --id 0.97 --alnout FILENAME

 

> vsearch -h

$ vsearch -h

vsearch v2.6.2_macos_x86_64, 96.0GB RAM, 24 cores

https://github.com/torognes/vsearch

 

Rognes T, Flouri T, Nichols B, Quince C, Mahe F (2016)

VSEARCH: a versatile open source tool for metagenomics

PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584

 

Usage: vsearch [OPTIONS]

 

General options

  --bzip2_decompress          decompress input with bzip2 (required if pipe)

  --fasta_width INT           width of FASTA seq lines, 0 for no wrap (80)

  --gzip_decompress           decompress input with gzip (required if pipe)

  --help | -h                 display help information

  --log FILENAME              write messages, timing and memory info to file

  --maxseqlength INT          maximum sequence length (50000)

  --minseqlength INT          min seq length (clust/derep/search: 32, other:1)

  --no_progress               do not show progress indicator

  --notrunclabels             do not truncate labels at first space

  --quiet                     output just warnings and fatal errors to stderr

  --threads INT               number of threads to use, zero for all cores (0)

  --version | -v              display version information

 

Chimera detection

  --uchime_denovo FILENAME    detect chimeras de novo

  --uchime_ref FILENAME       detect chimeras using a reference database

 Data

  --db FILENAME               reference database for --uchime_ref

 Parameters

  --abskew REAL               min abundance ratio of parent vs chimera (2.0)

  --dn REAL                   'no' vote pseudo-count (1.4)

  --mindiffs INT              minimum number of differences in segment (3)

  --mindiv REAL               minimum divergence from closest parent (0.8)

  --minh REAL                 minimum score (0.28)

  --sizein                    propagate abundance annotation from input

  --self                      exclude identical labels for --uchime_ref

  --selfid                    exclude identical sequences for --uchime_ref

  --xn REAL                   'no' vote weight (8.0)

 Output

  --alignwidth INT            width of alignment in uchimealn output (80)

  --borderline FILENAME       output borderline chimeric sequences to file

  --chimeras FILENAME         output chimeric sequences to file

  --fasta_score               include chimera score in fasta output

  --nonchimeras FILENAME      output non-chimeric sequences to file

  --relabel STRING            relabel nonchimeras with this prefix string

  --relabel_keep              keep the old label after the new when relabelling

  --relabel_md5               relabel with md5 digest of normalized sequence

  --relabel_sha1              relabel with sha1 digest of normalized sequence

  --sizeout                   include abundance information when relabelling

  --uchimealns FILENAME       output chimera alignments to file

  --uchimeout FILENAME        output to chimera info to tab-separated file

  --uchimeout5                make output compatible with uchime version 5

  --xsize                     strip abundance information in output

 

Clustering

  --cluster_fast FILENAME     cluster sequences after sorting by length

  --cluster_size FILENAME     cluster sequences after sorting by abundance

  --cluster_smallmem FILENAME cluster already sorted sequences (see -usersort)

 Parameters (most searching options also apply)

  --cons_truncate             do not ignore terminal gaps in MSA for consensus

  --id REAL                   reject if identity lower, accepted values: 0-1.0

  --iddef INT                 id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)

  --qmask none|dust|soft      mask seqs with dust, soft or no method (dust)

  --sizein                    propagate abundance annotation from input

  --strand plus|both          cluster using plus or both strands (plus)

  --usersort                  indicate sequences not pre-sorted by length

 Output

  --biomout FILENAME          filename for OTU table output in biom 1.0 format

  --centroids FILENAME        output centroid sequences to FASTA file

  --clusterout_id             add cluster id info to consout and profile files

  --clusterout_sort           order msaout, consout, profile by decr abundance

  --clusters STRING           output each cluster to a separate FASTA file

  --consout FILENAME          output cluster consensus sequences to FASTA file

  --mothur_shared_out FN      filename for OTU table output in mothur format

  --msaout FILENAME           output multiple seq. alignments to FASTA file

  --otutabout FILENAME        filename for OTU table output in classic format

  --profile FILENAME          output sequence profile of each cluster to file

  --relabel STRING            relabel centroids with this prefix string

  --relabel_keep              keep the old label after the new when relabelling

  --relabel_md5               relabel with md5 digest of normalized sequence

  --relabel_sha1              relabel with sha1 digest of normalized sequence

  --sizeorder                 sort accepted centroids by abundance (AGC)

  --sizeout                   write cluster abundances to centroid file

  --uc FILENAME               specify filename for UCLUST-like output

  --xsize                     strip abundance information in output

 

Dereplication and rereplication

  --derep_fulllength FILENAME dereplicate sequences in the given FASTA file

  --derep_prefix FILENAME     dereplicate sequences in file based on prefixes

  --rereplicate FILENAME      rereplicate sequences in the given FASTA file

 Parameters

  --maxuniquesize INT         maximum abundance for output from dereplication

  --minuniquesize INT         minimum abundance for output from dereplication

  --sizein                    propagate abundance annotation from input

  --strand plus|both          dereplicate plus or both strands (plus)

 Output

  --output FILENAME           output FASTA file

  --relabel STRING            relabel with this prefix string

  --relabel_keep              keep the old label after the new when relabelling

  --relabel_md5               relabel with md5 digest of normalized sequence

  --relabel_sha1              relabel with sha1 digest of normalized sequence

  --sizeout                   write abundance annotation to output

  --topn INT                  output only n most abundant sequences after derep

  --uc FILENAME               filename for UCLUST-like dereplication output

  --xsize                     strip abundance information in derep output

 

FASTQ format conversion

  --fastq_convert FILENAME    convert between FASTQ file formats

 Parameters

  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)

  --fastq_asciiout INT        FASTQ output quality score ASCII base char (33)

  --fastq_qmax INT            maximum base quality value for FASTQ input (41)

  --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)

  --fastq_qmin INT            minimum base quality value for FASTQ input (0)

  --fastq_qminout INT         minimum base quality value for FASTQ output (0)

 Output

  --fastqout FILENAME         FASTQ output filename for converted sequences

 

FASTQ format detection and quality analysis

  --fastq_chars FILENAME      analyse FASTQ file for version and quality range

 Parameters

  --fastq_tail INT            min length of tails to count for fastq_chars (4)

 

FASTQ quality statistics

  --fastq_stats FILENAME      report statistics on FASTQ file

  --fastq_eestats FILENAME    quality score and expected error statistics

  --fastq_eestats2 FILENAME   expected error and length cutoff statistics

 Parameters

  --ee_cutoffs REAL,...       fastq_eestats2 expected error cutoffs (0.5,1,2)

  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)

  --fastq_qmax INT            maximum base quality value for FASTQ input (41)

  --fastq_qmin INT            minimum base quality value for FASTQ input (0)

  --length_cutoffs INT,INT,INT fastq_eestats2 length (min,max,incr) (50,*,50)

 Output

  --log FILENAME              output file for fastq_stats statistics

  --output FILENAME           output file for fastq_eestats(2) statistics

 

Filtering

  --fastx_filter FILENAME     filter and truncate sequences in FASTA/FASTQ file

  --fastq_filter FILENAME     filter and truncate sequences in FASTQ file

 Parameters

  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)

  --fastq_maxee REAL          maximum expected error value for filter

  --fastq_maxee_rate REAL     maximum expected error rate for filter

  --fastq_maxlen INT          maximum length of sequence for filter

  --fastq_maxns INT           maximum number of N's for filter

  --fastq_minlen INT          minimum length of sequence for filter

  --fastq_qmax INT            maximum base quality value for FASTQ input (41)

  --fastq_qmin INT            minimum base quality value for FASTQ input (0)

  --fastq_stripleft INT       bases on the left to delete

  --fastq_stripright INT      bases on the right to delete

  --fastq_truncee REAL        maximum total expected error for truncation

  --fastq_trunclen INT        truncate reads to length INT (discard if shorter)

  --fastq_trunclen_keep INT   truncate reads to length INT (keep if shorter)

  --fastq_truncqual INT       minimum base quality value for truncation

  --maxsize INT               maximum abundance

  --minsize INT               minimum abundance

 Output

  --eeout                     include expected errors in output

  --fastaout FILENAME         FASTA output filename for passed sequences

  --fastaout_discarded FNAME  FASTA filename for discarded sequences

  --fastqout FILENAME         FASTQ output filename for passed sequences

  --fastqout_discarded FNAME  FASTQ filename for discarded sequences

  --relabel STRING            relabel filtered sequences with given prefix

  --relabel_keep              keep the old label after the new when relabelling

  --relabel_md5               relabel filtered sequences with md5 digest

  --relabel_sha1              relabel filtered sequences with sha1 digest

  --sizeout                   include abundance information when relabelling

  --xsize                     strip abundance information in output

 

Masking (new)

  --fastx_mask FILENAME       mask sequences in the given FASTA or FASTQ file

 Parameters

  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)

  --fastq_qmax INT            maximum base quality value for FASTQ input (41)

  --fastq_qmin INT            minimum base quality value for FASTQ input (0)

  --hardmask                  mask by replacing with N instead of lower case

  --max_unmasked_pct          max unmasked % of sequences to keep (100.0)

  --min_unmasked_pct          min unmasked % of sequences to keep (0.0)

  --qmask none|dust|soft      mask seqs with dust, soft or no method (dust)

 Output

  --fastaout FILENAME         output to specified FASTA file

  --fastqout FILENAME         output to specified FASTQ file

 

Masking (old)

  --maskfasta FILENAME        mask sequences in the given FASTA file

 Parameters

  --hardmask                  mask by replacing with N instead of lower case

  --qmask none|dust|soft      mask seqs with dust, soft or no method (dust)

 Output

  --output FILENAME           output to specified FASTA file

 

Paired-end reads merging

  --fastq_mergepairs FILENAME merge paired-end reads into one sequence

 Data

  --reverse FILENAME          specify FASTQ file with reverse reads

 Parameters

  --fastq_allowmergestagger   Allow merging of staggered reads

  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)

  --fastq_maxdiffs INT        maximum number of different bases in overlap (10)

  --fastq_maxee REAL          maximum expected error value for merged sequence

  --fastq_maxmergelen         maximum length of entire merged sequence

  --fastq_maxns INT           maximum number of N's

  --fastq_minlen INT          minimum input read length after truncation (1)

  --fastq_minmergelen         minimum length of entire merged sequence

  --fastq_minovlen            minimum length of overlap between reads (10)

  --fastq_nostagger           disallow merging of staggered reads (default)

  --fastq_qmax INT            maximum base quality value for FASTQ input (41)

  --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)

  --fastq_qmin INT            minimum base quality value for FASTQ input (0)

  --fastq_qminout INT         minimum base quality value for FASTQ output (0)

  --fastq_truncqual INT       base quality value for truncation

 Output

  --eetabbedout FILENAME      output error statistics to specified file

  --fastaout FILENAME         FASTA output filename for merged sequences

  --fastaout_notmerged_fwd FN FASTA filename for non-merged forward sequences

  --fastaout_notmerged_rev FN FASTA filename for non-merged reverse sequences

  --fastq_eeout               include expected errors in FASTQ output

  --fastqout FILENAME         FASTQ output filename for merged sequences

  --fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences

  --fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences

  --label_suffix              suffix to append to label of merged sequences

 

Pairwise alignment

  --allpairs_global FILENAME  perform global alignment of all sequence pairs

 Output (most searching options also apply)

  --alnout FILENAME           filename for human-readable alignment output

  --acceptall                 output all pairwise alignments

 

Reverse complementation

  --fastx_revcomp FILENAME    Reverse-complement seqs in FASTA or FASTQ file

 Parameters

  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)

  --fastq_qmax INT            maximum base quality value for FASTQ input (41)

  --fastq_qmin INT            minimum base quality value for FASTQ input (0)

 Output

  --fastaout FILENAME         FASTA output filename

  --fastqout FILENAME         FASTQ output filename

  --label_suffix STRING       Label to append to identifier in the output

 

Searching

  --search_exact FILENAME     filename of queries for exact match search

  --usearch_global FILENAME   filename of queries for global alignment search

 Data

  --db FILENAME               name of UDB or FASTA database for search

 Parameters

  --dbmask none|dust|soft     mask db with dust, soft or no method (dust)

  --fulldp                    full dynamic programming alignment (always on)

  --gapext STRING             penalties for gap extension (2I/1E)

  --gapopen STRING            penalties for gap opening (20I/2E)

  --hardmask                  mask by replacing with N instead of lower case

  --id REAL                   reject if identity lower

  --iddef INT                 id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)

  --idprefix INT              reject if first n nucleotides do not match

  --idsuffix INT              reject if last n nucleotides do not match

  --leftjust                  reject if terminal gaps at alignment left end

  --match INT                 score for match (2)

  --maxaccepts INT            number of hits to accept and show per strand (1)

  --maxdiffs INT              reject if more substitutions or indels

  --maxgaps INT               reject if more indels

  --maxhits INT               maximum number of hits to show (unlimited)

  --maxid REAL                reject if identity higher

  --maxqsize INT              reject if query abundance larger

  --maxqt REAL                reject if query/target length ratio higher

  --maxrejects INT            number of non-matching hits to consider (32)

  --maxsizeratio REAL         reject if query/target abundance ratio higher

  --maxsl REAL                reject if shorter/longer length ratio higher

  --maxsubs INT               reject if more substitutions

  --mid REAL                  reject if percent identity lower, ignoring gaps

  --mincols INT               reject if alignment length shorter

  --minqt REAL                reject if query/target length ratio lower

  --minsizeratio REAL         reject if query/target abundance ratio lower

  --minsl REAL                reject if shorter/longer length ratio lower

  --mintsize INT              reject if target abundance lower

  --minwordmatches INT        minimum number of word matches required (12)

  --mismatch INT              score for mismatch (-4)

  --pattern STRING            option is ignored

  --qmask none|dust|soft      mask query with dust, soft or no method (dust)

  --query_cov REAL            reject if fraction of query seq. aligned lower

  --rightjust                 reject if terminal gaps at alignment right end

  --sizein                    propagate abundance annotation from input

  --self                      reject if labels identical

  --selfid                    reject if sequences identical

  --slots INT                 option is ignored

  --strand plus|both          search plus or both strands (plus)

  --target_cov REAL           reject if fraction of target seq. aligned lower

  --weak_id REAL              include aligned hits with >= id; continue search

  --wordlength INT            length of words for database index 3-15 (8)

 Output

  --alnout FILENAME           filename for human-readable alignment output

  --biomout FILENAME          filename for OTU table output in biom 1.0 format

  --blast6out FILENAME        filename for blast-like tab-separated output

  --dbmatched FILENAME        FASTA file for matching database sequences

  --dbnotmatched FILENAME     FASTA file for non-matching database sequences

  --fastapairs FILENAME       FASTA file with pairs of query and target

  --matched FILENAME          FASTA file for matching query sequences

  --mothur_shared_out FN      filename for OTU table output in mothur format

  --notmatched FILENAME       FASTA file for non-matching query sequences

  --otutabout FILENAME        filename for OTU table output in classic format

  --output_no_hits            output non-matching queries to output files

  --rowlen INT                width of alignment lines in alnout output (64)

  --samheader                 include a header in the SAM output file

  --samout FILENAME           filename for SAM format output

  --sizeout                   write abundance annotation to dbmatched file

  --top_hits_only             output only hits with identity equal to the best

  --uc FILENAME               filename for UCLUST-like output

  --uc_allhits                show all, not just top hit with uc output

  --userfields STRING         fields to output in userout file

  --userout FILENAME          filename for user-defined tab-separated output

 

Shuffling and sorting

  --shuffle FILENAME          shuffle order of sequences in FASTA file randomly

  --sortbylength FILENAME     sort sequences by length in given FASTA file

  --sortbysize FILENAME       abundance sort sequences in given FASTA file

 Parameters

  --maxsize INT               maximum abundance for sortbysize

  --minsize INT               minimum abundance for sortbysize

  --randseed INT              seed for PRNG, zero to use random data source (0)

  --sizein                    propagate abundance annotation from input

 Output

  --output FILENAME           output to specified FASTA file

  --relabel STRING            relabel sequences with this prefix string

  --relabel_keep              keep the old label after the new when relabelling

  --relabel_md5               relabel with md5 digest of normalized sequence

  --relabel_sha1              relabel with sha1 digest of normalized sequence

  --sizeout                   include abundance information when relabelling

  --topn INT                  output just first n sequences

  --xsize                     strip abundance information in output

 

Subsampling

  --fastx_subsample FILENAME  subsample sequences from given FASTA/FASTQ file

 Parameters

  --fastq_ascii INT           FASTQ input quality score ASCII base char (33)

  --fastq_qmax INT            maximum base quality value for FASTQ input (41)

  --fastq_qmin INT            minimum base quality value for FASTQ input (0)

  --randseed INT              seed for PRNG, zero to use random data source (0)

  --sample_pct REAL           sampling percentage between 0.0 and 100.0

  --sample_size INT           sampling size

  --sizein                    consider abundance info from input, do not ignore

 Output

  --fastaout FILENAME         output subsampled sequences to FASTA file

  --fastaout_discarded FILE   output non-subsampled sequences to FASTA file

  --fastqout FILENAME         output subsampled sequences to FASTQ file

  --fastqout_discarded        output non-subsampled sequences to FASTQ file

  --relabel STRING            relabel sequences with this prefix string

  --relabel_keep              keep the old label after the new when relabelling

  --relabel_md5               relabel with md5 digest of normalized sequence

  --relabel_sha1              relabel with sha1 digest of normalized sequence

  --sizeout                   update abundance information in output

  --xsize                     strip abundance information in output

 

UDB files

  --makeudb_usearch FILENAME  make UDB file from given FASTA file

  --udb2fasta FILENAME        output FASTA file from given UDB file

  --udbinfo FILENAME          show information about UDB file

  --udbstats FILENAME         report statistics about indexed words in UDB file

 Parameters

  --dbmask none|dust|soft     mask db with dust, soft or no method (dust)

  --hardmask                  mask by replacing with N instead of lower case

  --wordlength INT            length of words for database index 3-15 (8)

 Output

  --output FILENAME           UDB or FASTA output file

 

 

実行方法

  1、Searching

--search_exact       filename of queries for exact match search
--usearch_global    filename of queries for global alignment search

クエリをデータベースに問い合わせ、合致する配列を探す。fastaを使う。

クエリと90%以上合致する配列を出力
vsearch --usearch_global queries.fa --db database.fa --id 0.9 --alnout alnout

Parameters

  •   --dbmask none|dust|soft     mask db with dust, soft or no method (dust)
  •   --fulldp                    full dynamic programming alignment (always on)
  •   --gapext STRING             penalties for gap extension (2I/1E)
  •   --gapopen STRING            penalties for gap opening (20I/2E)
  •   --hardmask                  mask by replacing with N instead of lower case
  •   --id REAL                   reject if identity lower
  •   --iddef INT                 id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)
  •   --idprefix INT              reject if first n nucleotides do not match
  •   --idsuffix INT              reject if last n nucleotides do not match
  •   --leftjust                  reject if terminal gaps at alignment left end
  •   --match INT                 score for match (2)
  •   --maxaccepts INT            number of hits to accept and show per strand (1)
  •   --maxdiffs INT              reject if more substitutions or indels
  •   --maxgaps INT               reject if more indels
  •   --maxhits INT               maximum number of hits to show (unlimited)
  •   --maxid REAL                reject if identity higher
  •   --maxqsize INT              reject if query abundance larger
  •   --maxqt REAL                reject if query/target length ratio higher
  •   --maxrejects INT            number of non-matching hits to consider (32)
  •   --maxsizeratio REAL         reject if query/target abundance ratio higher
  •   --maxsl REAL                reject if shorter/longer length ratio higher
  •   --maxsubs INT               reject if more substitutions
  •   --mid REAL                  reject if percent identity lower, ignoring gaps
  •   --mincols INT               reject if alignment length shorter
  •   --minqt REAL                reject if query/target length ratio lower
  •   --minsizeratio REAL         reject if query/target abundance ratio lower
  •   --minsl REAL                reject if shorter/longer length ratio lower
  •   --mintsize INT              reject if target abundance lower
  •   --minwordmatches INT        minimum number of word matches required (12)
  •   --mismatch INT              score for mismatch (-4)
  •   --pattern STRING            option is ignored
  •   --qmask none|dust|soft      mask query with dust, soft or no method (dust)
  •   --query_cov REAL            reject if fraction of query seq. aligned lower
  •   --rightjust                 reject if terminal gaps at alignment right end
  •   --sizein                    propagate abundance annotation from input
  •   --self                      reject if labels identical
  •   --selfid                    reject if sequences identical
  •   --slots INT                 option is ignored
  •   --strand plus|both          search plus or both strands (plus)
  •   --target_cov REAL           reject if fraction of target seq. aligned lower
  •   --weak_id REAL              include aligned hits with >= id; continue search
  •   --wordlength INT            length of words for database index 3-15 (8)

 Output

  •   --alnout FILENAME           filename for human-readable alignment output
  •   --biomout FILENAME          filename for OTU table output in biom 1.0 format
  •   --blast6out FILENAME        filename for blast-like tab-separated output
  •   --dbmatched FILENAME        FASTA file for matching database sequences
  •   --dbnotmatched FILENAME     FASTA file for non-matching database sequences
  •   --fastapairs FILENAME       FASTA file with pairs of query and target
  •   --matched FILENAME          FASTA file for matching query sequences
  •   --mothur_shared_out FN      filename for OTU table output in mothur format
  •   --notmatched FILENAME       FASTA file for non-matching query sequences
  •   --otutabout FILENAME        filename for OTU table output in classic format
  •   --output_no_hits            output non-matching queries to output files
  •   --rowlen INT                width of alignment lines in alnout output (64)
  •   --samheader                 include a header in the SAM output file
  •   --samout FILENAME           filename for SAM format output
  •   --sizeout                   write abundance annotation to dbmatched file
  •   --top_hits_only             output only hits with identity equal to the best
  •   --uc FILENAME               filename for UCLUST-like output
  •   --uc_allhits                show all, not just top hit with uc output
  •   --userfields STRING         fields to output in userout file
  •   --userout FILENAME          filename for user-defined tab-separated output

出力フォーマットは--alnout | --biomout | --blast6out | --mothur_shared_out | --otutabout | --samout | --uc | --useroutから選ぶ。上では人が認識可能なアラインメントファイルで出力している。

 

 

2、Clustering

クラスタリング

  --cluster_fast    cluster sequences after sorting by length
  --cluster_size    cluster sequences after sorting by abundance
  --cluster_smallmem   cluster already sorted sequences (see -usersort)
 Parameters (most searching options also apply)

vsearch --cluster_fast input.fa --id 0.97 --centroids output.fa

Parameters (most searching options also apply)

  •   --cons_truncate             do not ignore terminal gaps in MSA for consensus
  •   --id REAL                   reject if identity lower, accepted values: 0-1.0
  •   --iddef INT                 id definition, 0-4=CD-HIT,all,int,MBL,BLAST (2)
  •   --qmask none|dust|soft      mask seqs with dust, soft or no method (dust)
  •   --sizein                    propagate abundance annotation from input
  •   --strand plus|both          cluster using plus or both strands (plus)
  •   --usersort                  indicate sequences not pre-sorted by length

 Output

  •   --biomout FILENAME          filename for OTU table output in biom 1.0 format
  •   --centroids FILENAME        output centroid sequences to FASTA file
  •   --clusterout_id             add cluster id info to consout and profile files
  •   --clusterout_sort           order msaout, consout, profile by decr abundance
  •   --clusters STRING           output each cluster to a separate FASTA file
  •   --consout FILENAME          output cluster consensus sequences to FASTA file
  •   --mothur_shared_out FN      filename for OTU table output in mothur format
  •   --msaout FILENAME           output multiple seq. alignments to FASTA file
  •   --otutabout FILENAME        filename for OTU table output in classic format
  •   --profile FILENAME          output sequence profile of each cluster to file
  •   --relabel STRING            relabel centroids with this prefix string
  •   --relabel_keep              keep the old label after the new when relabelling
  •   --relabel_md5               relabel with md5 digest of normalized sequence
  •   --relabel_sha1              relabel with sha1 digest of normalized sequence
  •   --sizeorder                 sort accepted centroids by abundance (AGC)
  •   --sizeout                   write cluster abundances to centroid file
  •   --uc FILENAME               specify filename for UCLUST-like output
  •   --xsize                     strip abundance information in output

--usearch_globalと同様、多様な出力フォーマットに対応している。

 defaultではplus strandのみクラスタリングするので、同じ配列でもreverse compなら別配列扱いになる("--strand plus")。両鎖クラスタリングするなら"--strand both"にする。(#379より)

 

3、Dereplication and rereplication

--derep_fulllength   dereplicate sequences in the given FASTA file
--derep_prefix         dereplicate sequences in file based on prefixes
--rereplicate            rereplicate sequences in the given FASTA file

vsearch --derep_fulllength input.fa --output output.fa

Parameters

  •   --maxuniquesize INT         maximum abundance for output from dereplication
  •   --minuniquesize INT         minimum abundance for output from dereplication
  •   --sizein                    propagate abundance annotation from input
  •   --strand plus|both          dereplicate plus or both strands (plus)

 Output

  •   --output FILENAME           output FASTA file
  •   --relabel STRING            relabel with this prefix string
  •   --relabel_keep              keep the old label after the new when relabelling
  •   --relabel_md5               relabel with md5 digest of normalized sequence
  •   --relabel_sha1              relabel with sha1 digest of normalized sequence
  •   --sizeout                   write abundance annotation to output
  •   --topn INT                  output only n most abundant sequences after derep
  •   --uc FILENAME               filename for UCLUST-like dereplication output
  •   --xsize                     strip abundance information in derep output

 

4、Chimera detection

--uchime_denovo    detect chimeras de novo
--uchime_ref       detect chimeras using a reference database

fasta/fastqを指定する。

#de novo
vsearch --uchime_denovo input.fq --nonchimeras output.fq

#database使用
vsearch --uchime_ref FILENAME --db FILENAME --nonchimeras FILENAME

Data

  •   --db FILENAME               reference database for --uchime_ref

 Parameters

  •   --abskew REAL               min abundance ratio of parent vs chimera (2.0)
  •   --dn REAL                   'no' vote pseudo-count (1.4)
  •   --mindiffs INT              minimum number of differences in segment (3)
  •   --mindiv REAL               minimum divergence from closest parent (0.8)
  •   --minh REAL                 minimum score (0.28)
  •   --sizein                    propagate abundance annotation from input
  •   --self                      exclude identical labels for --uchime_ref
  •   --selfid                    exclude identical sequences for --uchime_ref
  •   --xn REAL                   'no' vote weight (8.0)

 Output

  •   --alignwidth INT            width of alignment in uchimealn output (80)
  •   --borderline FILENAME       output borderline chimeric sequences to file
  •   --chimeras FILENAME         output chimeric sequences to file
  •   --fasta_score               include chimera score in fasta output
  •   --nonchimeras FILENAME      output non-chimeric sequences to file
  •   --relabel STRING            relabel nonchimeras with this prefix string
  •   --relabel_keep              keep the old label after the new when relabelling
  •   --relabel_md5               relabel with md5 digest of normalized sequence
  •   --relabel_sha1              relabel with sha1 digest of normalized sequence
  •   --sizeout                   include abundance information when relabelling
  •   --uchimealns FILENAME       output chimera alignments to file
  •   --uchimeout FILENAME        output to chimera info to tab-separated file
  •   --uchimeout5                make output compatible with uchime version 5
  •   --xsize                     strip abundance information in output

 

 

 

5、merge

--fastq_mergepairs    merge paired-end reads into one sequence

#merge (fq/fq.gz) 
vsearch --fastq_mergepairs pair_1.fq --reverse pair_2.fq --fastqout output.fq

Parameters

  •   --fastq_allowmergestagger   Allow merging of staggered reads
  •   --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  •   --fastq_maxdiffs INT        maximum number of different bases in overlap (10)
  •   --fastq_maxee REAL          maximum expected error value for merged sequence
  •   --fastq_maxmergelen         maximum length of entire merged sequence
  •   --fastq_maxns INT           maximum number of N's
  •   --fastq_minlen INT          minimum input read length after truncation (1)
  •   --fastq_minmergelen         minimum length of entire merged sequence
  •   --fastq_minovlen            minimum length of overlap between reads (10)
  •   --fastq_nostagger           disallow merging of staggered reads (default)
  •   --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  •   --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)
  •   --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  •   --fastq_qminout INT         minimum base quality value for FASTQ output (0)
  •   --fastq_truncqual INT       base quality value for truncation

 Output

  •   --eetabbedout FILENAME      output error statistics to specified file
  •   --fastaout FILENAME         FASTA output filename for merged sequences
  •   --fastaout_notmerged_fwd FN FASTA filename for non-merged forward sequences
  •   --fastaout_notmerged_rev FN FASTA filename for non-merged reverse sequences
  •   --fastq_eeout               include expected errors in FASTQ output
  •   --fastqout FILENAME         FASTQ output filename for merged sequences
  •   --fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences
  •   --fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences
  •   --label_suffix              suffix to append to label of merged sequences

vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores

https://github.com/torognes/vsearch

 

Merging reads 100%  

     28630  Pairs

     27041  Merged (94.4%)

      1589  Not merged (5.6%)

 

Pairs that failed merging due to various reasons:

        41  too few kmers found on same diagonal

        35  potential tandem repeat

       841  too many differences

       670  alignment score too low, or score drop to high

         1  overlap too short

         1  staggered read pairs

 

Statistics of all reads:

    300.01  Mean read length

 

Statistics of merged reads:

    447.48  Mean fragment length

     12.87  Standard deviation of fragment length

      0.34  Mean expected error in forward sequences

      1.19  Mean expected error in reverse sequences

      0.25  Mean expected error in merged sequences

      0.24  Mean observed errors in merged region of forward sequences

      0.73  Mean observed errors in merged region of reverse sequences

      0.97  Mean observed errors in merged region

 

 

6、Masking

--fastx_mask     mask sequences in the given FASTA or FASTQ file

vsearch --fastx_mask input.fa --fastaout output.fa

 Parameters

  •   --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  •   --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  •   --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  •   --hardmask                  mask by replacing with N instead of lower case
  •   --max_unmasked_pct          max unmasked % of sequences to keep (100.0)
  •   --min_unmasked_pct          min unmasked % of sequences to keep (0.0)
  •   --qmask none|dust|soft      mask seqs with dust, soft or no method (dust)

 Output

  •   --fastaout FILENAME         output to specified FASTA file
  •   --fastqout FILENAME         output to specified FASTQ file

 

 

6、Pairwise alignment

--allpairs_global    perform global alignment of all sequence pairs

Perform optimal global pairwise alignments of all vs. all fasta sequences contained in filename. This command is multi-threaded.

処理後の配列のペアワイズアラインメント(1000配列以上でも動作する)

vsearch --allpairs_global input.fa --id 0.5 --alnout output.fa
  • --alnout FILENAME           filename for human-readable alignment output
  • --acceptall                 output all pairwise alignments
  • --id REAL                   reject if identity lower

 大きな配列セットを使うと出力ファイルが巨大になることに注意してください。

 

 

7、Shuffling and sorting

--shuffle              shuffle order of sequences in FASTA file randomly
--sortbylength    sort sequences by length in given FASTA file
--sortbysize        abundance sort sequences in given FASTA file

fasta/fastqを指定する。

vsearch --sortbylength input.fa --output output.fa

 Parameters

  •   --maxsize INT               maximum abundance for sortbysize
  •   --minsize INT               minimum abundance for sortbysize
  •   --randseed INT              seed for PRNG, zero to use random data source (0)
  •   --sizein                    propagate abundance annotation from input

 Output

  •   --output FILENAME           output to specified FASTA file
  •   --relabel STRING            relabel sequences with this prefix string
  •   --relabel_keep              keep the old label after the new when relabelling
  •   --relabel_md5               relabel with md5 digest of normalized sequence
  •   --relabel_sha1              relabel with sha1 digest of normalized sequence
  •   --sizeout                   include abundance information when relabelling
  •   --topn INT                  output just first n sequences
  •   --xsize                     strip abundance information in output

 

 

9、Subsampling

--fastx_subsample    subsample sequences from given FASTA/FASTQ file

fasta/fastqを指定する。

#10percentサブサンプリング
vsearch --fastx_subsample input.fa --fastaout output.fa --sample_pct 10

 Parameters

  •   --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  •   --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  •   --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  •   --randseed INT              seed for PRNG, zero to use random data source (0)
  •   --sample_pct REAL           sampling percentage between 0.0 and 100.0
  •   --sample_size INT           sampling size
  •   --sizein                    consider abundance info from input, do not ignore

 Output

  •   --fastaout FILENAME         output subsampled sequences to FASTA file
  •   --fastaout_discarded FILE   output non-subsampled sequences to FASTA file
  •   --fastqout FILENAME         output subsampled sequences to FASTQ file
  •   --fastqout_discarded        output non-subsampled sequences to FASTQ file
  •   --relabel STRING            relabel sequences with this prefix string
  •   --relabel_keep              keep the old label after the new when relabelling
  •   --relabel_md5               relabel with md5 digest of normalized sequence
  •   --relabel_sha1              relabel with sha1 digest of normalized sequence
  •   --sizeout                   update abundance information in output
  •   --xsize                     strip abundance information in output

 

 

 

10、Reverse complementation

--fastx_revcomp    Reverse-complement seqs in FASTA or FASTQ file

fasta/fastqを指定する。

vsearch --fastx_revcomp input.fq --fastqout output.fq

Parameters

  •   --fastq_allowmergestagger   Allow merging of staggered reads
  •   --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  •   --fastq_maxdiffs INT        maximum number of different bases in overlap (10)
  •   --fastq_maxee REAL          maximum expected error value for merged sequence
  •   --fastq_maxmergelen         maximum length of entire merged sequence
  •   --fastq_maxns INT           maximum number of N's
  •   --fastq_minlen INT          minimum input read length after truncation (1)
  •   --fastq_minmergelen         minimum length of entire merged sequence
  •   --fastq_minovlen            minimum length of overlap between reads (10)
  •   --fastq_nostagger           disallow merging of staggered reads (default)
  •   --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  •   --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)
  •   --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  •   --fastq_qminout INT         minimum base quality value for FASTQ output (0)
  •   --fastq_truncqual INT       base quality value for truncation

 Output

  •   --eetabbedout FILENAME      output error statistics to specified file
  •   --fastaout FILENAME         FASTA output filename for merged sequences
  •   --fastaout_notmerged_fwd FN FASTA filename for non-merged forward sequences
  •   --fastaout_notmerged_rev FN FASTA filename for non-merged reverse sequences
  •   --fastq_eeout               include expected errors in FASTQ output
  •   --fastqout FILENAME         FASTQ output filename for merged sequences
  •   --fastqout_notmerged_fwd FN FASTQ filename for non-merged forward sequences
  •   --fastqout_notmerged_rev FN FASTQ filename for non-merged reverse sequences
  •   --label_suffix              suffix to append to label of merged sequences

 

11、Filtering

--fastx_filter     filter and truncate sequences in FASTA/FASTQ file
--fastq_filter     filter and truncate sequences in FASTQ file

fasta/fastqを指定する。

#100bp以下をdiscard
vsearch --fastx_filter input.fq --fastaout output.fq --fastq_trunclen 100

Parameters

  •   --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  •   --fastq_maxee REAL          maximum expected error value for filter
  •   --fastq_maxee_rate REAL     maximum expected error rate for filter
  •   --fastq_maxlen INT          maximum length of sequence for filter
  •   --fastq_maxns INT           maximum number of N's for filter
  •   --fastq_minlen INT          minimum length of sequence for filter
  •   --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  •   --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  •   --fastq_stripleft INT       bases on the left to delete
  •   --fastq_stripright INT      bases on the right to delete
  •   --fastq_truncee REAL        maximum total expected error for truncation
  •   --fastq_trunclen INT        truncate reads to length INT (discard if shorter)
  •   --fastq_trunclen_keep INT   truncate reads to length INT (keep if shorter)
  •   --fastq_truncqual INT       minimum base quality value for truncation
  •   --maxsize INT               maximum abundance
  •   --minsize INT               minimum abundance

 Output

  •   --eeout                     include expected errors in output
  •   --fastaout FILENAME         FASTA output filename for passed sequences
  •   --fastaout_discarded FNAME  FASTA filename for discarded sequences
  •   --fastqout FILENAME         FASTQ output filename for passed sequences
  •   --fastqout_discarded FNAME  FASTQ filename for discarded sequences
  •   --relabel STRING            relabel filtered sequences with given prefix
  •   --relabel_keep              keep the old label after the new when relabelling
  •   --relabel_md5               relabel filtered sequences with md5 digest
  •   --relabel_sha1              relabel filtered sequences with sha1 digest
  •   --sizeout                   include abundance information when relabelling
  •   --xsize                     strip abundance information in output

 

 

12、format conversion 

--fastq_convert    convert between FASTQ file formats

fastqを指定する。

vsearch --fastq_convert input.fq --fastqout output.fq  --fastq_ascii 64

Parameters

  •   --fastq_ascii INT           FASTQ input quality score ASCII base char (33)
  •   --fastq_asciiout INT        FASTQ output quality score ASCII base char (33)
  •   --fastq_qmax INT            maximum base quality value for FASTQ input (41)
  •   --fastq_qmaxout INT         maximum base quality value for FASTQ output (41)
  •   --fastq_qmin INT            minimum base quality value for FASTQ input (0)
  •   --fastq_qminout INT         minimum base quality value for FASTQ output (0)

 Output

  •  --fastqout FILENAME         FASTQ output filename for converted sequences

 

 

13、FASTA/FASTQ file statistics:

--fastq_chars        analyse FASTQ file for version and quality range Parameters

--fastq_stats         report statistics on FASTQ file
--fastq_eestats     quality score and expected error statistics
--fastq_eestats2   expected error and length cutoff statistics

#fq/fq.gz
vsearch --fastq_chars input.fq
vsearch --fastq_eestats input.fq --output report
vsearch --fastq_eestats2 input.fq --output report

> vsearch --fastq_chars input.fq

vsearch v2.7.0_macos_x86_64, 16.0GB RAM, 8 cores

https://github.com/torognes/vsearch

 

Reading FASTQ file 100% 

Read 28630 sequences.

Qmin 35, QMax 71, Range 37

Guess: -fastq_qmin 2 -fastq_qmax 38 -fastq_ascii 33

Guess: Original Sanger format (phred+33)

 

Letter          N   Freq MaxRun

------ ---------- ------ ------

     A    2128003  24.8%     19

     C    1948219  22.7%      6

     G    2881517  33.6%     86

     N        491   0.0%     34  Q=#

     T    1622142  18.9%     13

 

Char  ASCII    Freq       Tails

----  -----  ------  ----------

'#'     35    0.0%          14

'('     40    0.0%           0

')'     41    0.1%           1

以下略

> vsearch --fastq_eestats2 input.fq --output report

cat report

$ cat report

28630 reads, max len 301, avg 299.7

 

Length         MaxEE 0.50         MaxEE 1.00         MaxEE 2.00

------   ----------------   ----------------   ----------------

    50      28434( 99.3%)      28535( 99.7%)      28605( 99.9%)

   100      27882( 97.4%)      28303( 98.9%)      28451( 99.4%)

   150      27174( 94.9%)      27820( 97.2%)      28252( 98.7%)

   200      26408( 92.2%)      27276( 95.3%)      27946( 97.6%)

   250      25238( 88.2%)      26425( 92.3%)      27376( 95.6%)

   300      23470( 82.0%)      25209( 88.1%)      26571( 92.8%)

 

引用

VSEARCH: a versatile open source tool for metagenomics
Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, Frédéric Mahé

PeerJ. 2016; 4: e2584