2019 9/8 インストール追記
2013年のペーパーより
ハイスループットシーケンシング(HTS)は、シーケンシングデータの急速な成長率をもたらした。 著者らのラボでは、毎日テラバイトのデータを生成している。 これは通常、バリアントコーラー、定量およびアセンブリなどの一般的なタスクに使用する前に、さまざまな方法でデータ出力を「クリーン」に処理する必要がある。
HTSに関連する2つの一般的なタスクは、アダプタートリミングとペアエンドのマージである。 私たち( 著者ら)はこれらの共通の課題に対処するためにExpression Analysis、Inc.で2つのツールを開発した。 これらのプログラムの名前はfastq-mcfとfastq-joinである。 これらのツールの性能を、リソース効率と有効性の両面から、同様のオープンソースユーティリティと比較した。
https://help.rc.ufl.edu/doc/EA-Utils
公式HP
ea-utils by ExpressionAnalysis
インストール
#bioconda (link)
conda install -c bioconda -y ea-utils
#homebrew
brew tap brewsci/science #してない人だけ
brew install ea-utils
> fastq-mcf
$ fastq-mcf
Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]
Version: 1.04.803
Detects levels of adapter presence, computes likelihoods and
locations (start, end) of the adapters. Removes the adapter
sequences from the fastq file(s).
Stats go to stderr, unless -o is specified.
Specify -0 to turn off all default settings
If you specify multiple 'paired-end' inputs, then a -o option is
required for each. IE: -o read1.clip.q -o read2.clip.fq
Options:
-h This help
-o FIL Output file (stats to stdout)
-O N Only output the first N records (all)
-s N.N Log scale for adapter minimum-length-match (2.2)
-t N % occurance threshold before adapter clipping (0.25)
-m N Minimum clip length, overrides scaled auto (1)
-p N Maximum adapter difference percentage (10)
-l N Minimum remaining sequence length (19)
-L N Maximum remaining sequence length (none)
-D N Remove duplicate reads : Read_1 has an identical N bases (0)
-k N sKew percentage-less-than causing cycle removal (2)
-x N 'N' (Bad read) percentage causing cycle removal (20)
-q N quality threshold causing base removal (10)
-w N window-size for quality trimming (1)
-H remove >95% homopolymer reads (no)
-X remove low complexity reads (no)
-0 Set all default parameters to zero/do nothing
-U|u Force disable/enable Illumina PF filtering (auto)
-P N Phred-scale (auto)
-R Don't remove N's from the fronts/ends of reads
-n Don't clip, just output what would be done
-K Only keep clipped reads
-S Save all discarded reads to '.skip' files
-C N Number of reads to use for subsampling (300k)
-d Output lots of random debugging stuff
Quality adjustment options:
--cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT
--phred-adjust SCORE,AMT Adjust score SCORE by amount AMT
--phred-adjust-max SCORE Adjust scores > SCORE to SCOTE
Filtering options*:
--[mate-]qual-mean NUM Minimum mean quality score
--[mate-]qual-gt NUM,THR At least NUM quals > THR
--[mate-]max-ns NUM Maxmium N-calls in a read (can be a %)
--[mate-]min-len NUM Minimum remaining length (same as -l)
--homopolymer-pct PCT Homopolymer filter percent (95)
--lowcomplex-pct PCT Complexity filter percent (95)
--keep-clipped Only keep clipped (same as -K)
--max-output-reads N Only output first N records (same as -O)
If mate- prefix is used, then applies to second non-barcode read only
Adapter files are 'fasta' formatted:
Specify n/a to turn off adapter clipping, and just use filters
Increasing the scale makes recognition-lengths longer, a scale
of 100 will force full-length recognition of adapters.
Adapter sequences with _5p in their label will match 'end's,
and sequences with _3p in their label will match 'start's,
otherwise the 'end' is auto-determined.
Skew is when one cycle is poor, 'skewed' toward a particular base.
If any nucleotide is less than the skew percentage, then the
whole cycle is removed. Disable for methyl-seq, etc.
Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done
for miRNA, amplicon and other low-complexity situations!)
Duplicate read filtering is appropriate for assembly tasks, and
never when read length < expected coverage. -D 50 will use
4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.
*Quality filters are evaluated after clipping/trimming
Homopolymer filtering is a subset of low-complexity, but will not
be separately tracked unless both are turned on.
> fastq-multx
$ fastq-multx
Usage: fastq-multx [-g|-l|-B] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...
Version: 1.02.772
Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file.
Output file can be n/a to discard the corresponding data (use this for the barcode read)
Barcodes file (-B) looks like this:
<id1> <sequence1>
<id2> <sequence2> ...
Default is to guess the -bol or -eol based on clear stats.
If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used.
If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen.
Grouped barcodes file (-l or -L) looks like this:
<id1> <sequence1> <group1>
<id1> <sequence1> <group1>
<id2> <sequence2> <group2>...
Mated reads, if supplied, are kept in-sync
Options:
-o FIL1 Output files (one per input, required)
-g SEQFIL Determine barcodes from the indexed read SEQFIL
-l BCFIL Determine barcodes from any read, using BCFIL as a master list
-L BCFIL Determine barcodes from <read1.fq>, using BCFIL as a master list
-B BCFIL Use barcodes from BCFIL, no determination step, codes in <read1.fq>
-H Use barcodes from illumina's header, instead of a read
-b Force beginning of line (5') for barcode matching
-e Force end of line (3') for batcode matching
-t NUM Divide threshold for auto-determine by factor NUM (1), > 1 = more sensitive
-G NAME Use group(s) matching NAME only
-x Don't trim barcodes off before writing out destination
-n Don't execute, just print likely barcode list
-v C Verify that mated id's match up to character C (Use ' ' for illumina)
-m N Allow up to N mismatches, as long as they are unique (1)
-d N Require a minimum distance of N between the best and next best (2)
-q N Require a minimum phred quality of N to accept a barcode base (0)
> fastq-join
$ fastq-join
Usage: fastq-join [options] <read1.fq> <read2.fq> [mate.fq] -o <read.%.fq>
Joins two paired-end reads on the overlapping ends.
Options:
-o FIL See 'Output' below
-v C Verifies that the 2 files probe id's match up to char C
use ' ' (space) for Illumina reads
-p N N-percent maximum difference (8)
-m N N-minimum overlap (6)
-r FIL Verbose stitch length report
-R No reverse complement
-x Allow insert < read length
Output:
You can supply 3 -o arguments, for un1, un2, join files, or one
argument as a file name template. The suffix 'un1, un2, or join' is
appended to the file, or they replace a %-character if present.
If a 'mate' input file is present (barcode read), then the files
'un3' and 'join2' are also created.
> varcall
$ varcall
Specify -s for stats only, or -v to do variant calling
Usage: varcall <-s|-v> <-f REF> [options] bam1 [bam2...]
Version: 0.95.794 (BETA)
Either outputs summry stats for the list of files, or performs variant calling
Options (later options override earlier):
-s Calculate statistics
-v|version Calculate variants bases on supplied parameters (see -S)
-f Reference fasta (required if using bams, ignored otherwise)
-m Min locii depth (1)
-a Min allele depth (2)
-p Min allele pct by quality (0)
-q Min qual (3)
-Q Min mapping quality (0)
-b Min pct balance (strand/total) (0)
-D FLOAT Max duplicate read fraction (depth/length per position) (1)
-d FLOAT Minimum diversity (CV from optimal depth) (0.25)
-G FLOAT Minimum agreement (Weighted CV of positional variation) (0.25)
-0 Zero out all filters, set e-value filter to 1, report everything
-B If running from a BAM, turn off BAQ correction (false)
-R Homopolymer repeat indel filtering (8)
-e FLOAT Alpha filter to use, requires -l or -S (.05)
-g FLOAT Global minimum error rate (default: assume phred is ok)
-l INT Number of locii in total pileup used for bonferroni (1 mil)
-x CHR:POS Output this pos only, then quit
-S FILE Read in statistics and params from a previous run with -s (do this!)
-A ANNOT Calculate in-target stats using the annotation file (requires -o)
-o PREFIX Output prefix (works with -s or -v)
-F files List of file types to output (var, varsum, eav, vcf)
Extended Options
--pcr-annot BED Only include reads adhering to the expected amplicons
--stranded TYPE Can be FR (the default), FF, FR. Used with pcr-annot
--diversity|d FLOAT Alias for -d
--agreement|G FLOAT Alias for -G
--no-indels Ignore all indels
Input files
Files must be sorted bam files with bai index files available. Alternatively,
a single pileup file can be supplied.
Output files
Varcalls go to stdout. Stats go to stdout, or stderr if varcalling too
If an output prefix is used, files are created as follows:
PREFIX.var Variant calls in tab delimited 'varcall' format
PREFIX.eav Variant calls in tab delimited 'ea-var' format
PREFIX.cse Variant calls in tab delimited 'varprowl' format
PREFIX.vcf Variant calls, in vcf format
PREFIX.varsum Summary of variant calls
PREFIX.tgt.var On-target version of .var
PREFIX.tgt.cse On-target version of .cse
PREFIX.tgt.varsum On-target version of .varsum
Stats Output:
Contains mean, median, quartile information for depth, base quality, read len,
mapping quality, indel levels. Also estimates parameters suitable for
variant calls, and can be passed directly to this program for variant calls
If an output prefix is used, files are created as follows:
PREFIX.stats Stats output
PREFIX.noise Non-reference, non-homozygous allele summary
PREFIX.xnoise Like noise, but with context-specific rates
Filtering Details:
>sam-stats -h
$ sam-stats -h
Usage: sam-stats [options] [file1] [file2...filen]
Version: 1.38.763
Produces lots of easily digested statistics for the files listed
Options (default in parens):
-D Keep track of multiple alignments
-O PREFIX Output prefix enabling extended output (see below)
-R FIL Coverage/RNA output (coverage, 3' bias, etc, implies -A)
-A Report all chr sigs, even if there are more than 1000
-b INT Number of reads to sample for per-base stats (1M)
-S INT Size of ascii-signature (30)
-x FIL File extension for handling multiple files (stats)
-M Only overwrite if newer (requires -x, or multiple files)
-B Input is bam, don't bother looking at magic
-z Don't fail when zero entries in sam
OUTPUT:
If one file is specified, then the output is to standard out. If
multiple files are specified, or if the -x option is supplied,
the output file is <filename>.<ext>. Default extension is 'stats'.
Complete Stats:
<STATS> : mean, max, stdev, median, Q1 (25 percentile), Q3
reads : # of entries in the sam file, might not be # reads
phred : phred scale used
bsize : # reads used for qual stats
mapped reads : number of aligned reads (unique probe id sequences)
mapped bases : total of the lengths of the aligned reads
forward : number of forward-aligned reads
reverse : number of reverse-aligned reads
snp rate : mismatched bases / total bases (snv rate)
ins rate : insert bases / total bases
del rate : deleted bases / total bases
pct mismatch : percent of reads that have mismatches
pct align : percent of reads that aligned
len <STATS> : read length stats, ignored if fixed-length
mapq <STATS> : stats for mapping qualities
insert <STATS> : stats for insert sizes
%<CHR> : percentage of mapped bases per chr, followed by a signature
Subsampled stats (1M reads max):
base qual <STATS> : stats for base qualities
%A,%T,%C,%G : base percentages
Meaning of the per-chromosome signature:
A ascii-histogram of mapped reads by chromosome position.
It is only output if the original SAM/BAM has a header. The values
are the log2 of the # of mapped reads at each position + ascii '0'.
Extended output mode produces a set of files:
.stats : primary output
.fastx : fastx-toolkit compatible output
.rcov : per-reference counts & coverage
.xdist : mismatch distribution
.ldist : length distribution (if applicable)
.mqdist : mapping quality distribution
> fastq-stats -h
$ fastq-stats -h
Usage: fastq-stats [options] <fastq-file>
Version: 1.01 $Id: fastq-stats.cpp 652 2013-09-17 17:40:32Z earonesty $
Produces lots of easily digested statistics for the files listed
Options
-c cyclemax: max cycles for which following quality stats are produced [35]
-w INT window: max window size for generating duplicate read statistics [2000000]
-d debug: prints out debug statements
-D don't do duplicate read statistics
-s INT number of top duplicate reads to display
-x FIL output fastx statistics (requires an output filename)
-b FIL output base breakdown by per phred quality at every cycle.
It sets cylemax to longest read length
-L FIL Output length counts
The following data are printed to stdout:
reads : #reads in the fastq file
len : read length. mean and stdev are provided for variable read lengths
phred : phred scale used
window-size : Number of reads used to generate duplicate read statistics
cycle-max : Number of bases to assess for duplicity
dups : Number of reads that are duplicates
%dup : Pct reads that are duplcate
unique-dup seq : Number sequences that are duplicated
min dup count : Smallest duplicate tally for any duplicate sequence
dup seq <rank> <count> <sequence>
: Lists top 10 most frequent duplicate reads along with count mean and stdev
qual : Base Quality min, max and mean
%A,%T,%C,%G : base percentages
total bases : total number of bases
ラン
fastq-mcf(リンク): アダプター除去を行うツール。低クオリティ末端の除去、Nの除去などの機能も持つ。
fastq-mcf <options> adapters.fa input.fq
fastq-multx(リンク): Demultiplexing。
1、シングル。例えば
mock_a ACCC
salt_a CGTA
mock_b GAGT
salty_b CGGT
というバーコード配列ファイルbarcodes.filを用意し、以下のように打つ。ここでは入力データはgz圧縮されている。末端のアダプター配列に従い分割される。
fastq-multx -B barcodes.fil seq.fastq.gz -o %.fq.gz
- -B Use barcodes from BCFIL, no determination step, codes in <read1.fq>
- -o Output files (one per input, required)
double indexingのdemultiplexingは公式の例(リンク)を参考にしてください。
fastq-join(リンク):オーバーラップするfastqのマージ。
fastq-join read1.fq read2.fq -o read.%.fq
マージに成功した read.join.fq、失敗したread.un1.fq、read.un2.fqが出力される。
varcall(リンク):pileupからのバリアントコール(他のツールよりシンプル)。
fastq-join read1.fq read2.fq -o read.%.fq
sam-stats(リンク):bam/samのstatistics。
sam-stats input.bam
出力 (ONT long read使用)
reads 1141
version 1.38.763
mapped reads 1141
pct align 100.000000
mapped bases 4954696
pct forward 0.000
phred 33
forward 0
reverse 1141
len max 13407
len mean 4342.4154
len stdev 3041.4011
mapq mean 59.5425
mapq stdev 4.2102
mapq Q1 60.00
mapq median 60.00
mapq Q3 60.00
snp rate 0.064468
ins rate 0.036831
del rate 0.043816
pct mismatch 100.0000
base qual mean 7.9896
base qual stdev 3.9029
%A 29.1159
%C 23.2673
%G 26.6539
%T 20.9629
%segment1 68.497341 D@A@@@AA@@@?@@????=>===<=:;;00
%segment2 31.502659 @@?@@@@@??@?>@@@???>?>>>>>==<8
num ref seqs 2
num ref aligned 2
fastq-stats:fastqのstatistics。
fastq-stats input.fq
$ fastq-stats single.fq
reads 657350
len 250
len mean 250.0000
len stdev 0.0000
len min 250
phred 33
window-size 657350
cycle-max 35
dups 6
%dup 0.0017
unique-dup seq 6
min dup count 2
dup seq 1 1 CATGGTTAGCCAAATCCACATTGGCTTCTGTTTCG
dup seq 2 1 TGGTCAAAAAAAGGAAAAAAGGTTTATGGAGATAG
dup seq 3 1 AATTCACGATACAAAATTTGCAGGCGGCCCGCCAA
dup seq 4 1 TGGTGAACTAGTTCAATCTTTACAAAGTGAGATTA
dup seq 5 1 GAGCATTGCGGGCCGCCTTGACGGCGCTGGGGGAT
dup seq 6 1 TCTATTTAACCATCTGTATATCGAGGCTTTTCCTA
dup mean 2.0000
dup stddev 0.0000
qual min 2
qual max 38
qual mean 29.1634
qual stdev 13.5108
%A 25.9917
%C 23.9989
%G 24.0037
%T 26.0057
%N 0.0000
total bases 169337500
引用
Command-line tools for processing biological sequencing data
Erik Aronesty (2011). ea-utils
http://code.google.com/p/ea-utils
Comparison of Sequencing Utility Programs
Erik Aronesty (2013). TOBioiJ
DOI:10.2174/1875036201307010001
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.844.5897&rep=rep1&type=pdf