2018-06-29

bam, fastqのユーティリティツール EA-Utils

2019 9/8　インストール追記

2013年のペーパーより

　ハイスループットシーケンシング（HTS）は、シーケンシングデータの急速な成長率をもたらした。著者らのラボでは、毎日テラバイトのデータを生成している。これは通常、バリアントコーラー、定量およびアセンブリなどの一般的なタスクに使用する前に、さまざまな方法でデータ出力を「クリーン」に処理する必要がある。
　HTSに関連する2つの一般的なタスクは、アダプタートリミングとペアエンドのマージである。私たち（著者ら）はこれらの共通の課題に対処するためにExpression Analysis、Inc.で2つのツールを開発した。これらのプログラムの名前はfastq-mcfとfastq-joinである。これらのツールの性能を、リソース効率と有効性の両面から、同様のオープンソースユーティリティと比較した。

wiki

https://help.rc.ufl.edu/doc/EA-Utils

公式HP

ea-utils by ExpressionAnalysis

インストール

Github

#bioconda (link)
conda install -c bioconda -y ea-utils

#homebrew
brew tap brewsci/science #してない人だけ
brew install ea-utils

> fastq-mcf

$ fastq-mcf

Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...]

Version: 1.04.803

Detects levels of adapter presence, computes likelihoods and

locations (start, end) of the adapters. Removes the adapter

sequences from the fastq file(s).

Stats go to stderr, unless -o is specified.

Specify -0 to turn off all default settings

If you specify multiple 'paired-end' inputs, then a -o option is

required for each. IE: -o read1.clip.q -o read2.clip.fq

Options:

-h This help

-o FIL Output file (stats to stdout)

-O N Only output the first N records (all)

-s N.N Log scale for adapter minimum-length-match (2.2)

-t N % occurance threshold before adapter clipping (0.25)

-m N Minimum clip length, overrides scaled auto (1)

-p N Maximum adapter difference percentage (10)

-l N Minimum remaining sequence length (19)

-L N Maximum remaining sequence length (none)

-D N Remove duplicate reads : Read_1 has an identical N bases (0)

-k N sKew percentage-less-than causing cycle removal (2)

-x N 'N' (Bad read) percentage causing cycle removal (20)

-q N quality threshold causing base removal (10)

-w N window-size for quality trimming (1)

-H remove >95% homopolymer reads (no)

-X remove low complexity reads (no)

-0 Set all default parameters to zero/do nothing

-U|u Force disable/enable Illumina PF filtering (auto)

-P N Phred-scale (auto)

-R Don't remove N's from the fronts/ends of reads

-n Don't clip, just output what would be done

-K Only keep clipped reads

-S Save all discarded reads to '.skip' files

-C N Number of reads to use for subsampling (300k)

-d Output lots of random debugging stuff

Quality adjustment options:

--cycle-adjust CYC,AMT Adjust cycle CYC (negative = offset from end) by amount AMT

--phred-adjust SCORE,AMT Adjust score SCORE by amount AMT

--phred-adjust-max SCORE Adjust scores > SCORE to SCOTE

Filtering options*:

--[mate-]qual-mean NUM Minimum mean quality score

--[mate-]qual-gt NUM,THR At least NUM quals > THR

--[mate-]max-ns NUM Maxmium N-calls in a read (can be a %)

--[mate-]min-len NUM Minimum remaining length (same as -l)

--homopolymer-pct PCT Homopolymer filter percent (95)

--lowcomplex-pct PCT Complexity filter percent (95)

--keep-clipped Only keep clipped (same as -K)

--max-output-reads N Only output first N records (same as -O)

If mate- prefix is used, then applies to second non-barcode read only

Adapter files are 'fasta' formatted:

Specify n/a to turn off adapter clipping, and just use filters

Increasing the scale makes recognition-lengths longer, a scale

of 100 will force full-length recognition of adapters.

Adapter sequences with _5p in their label will match 'end's,

and sequences with _3p in their label will match 'start's,

otherwise the 'end' is auto-determined.

Skew is when one cycle is poor, 'skewed' toward a particular base.

If any nucleotide is less than the skew percentage, then the

whole cycle is removed. Disable for methyl-seq, etc.

Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done

for miRNA, amplicon and other low-complexity situations!)

Duplicate read filtering is appropriate for assembly tasks, and

never when read length < expected coverage. -D 50 will use

4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.

*Quality filters are evaluated after clipping/trimming

Homopolymer filtering is a subset of low-complexity, but will not

be separately tracked unless both are turned on.

> fastq-multx

$ fastq-multx

Usage: fastq-multx [-g|-l|-B] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...

Version: 1.02.772

Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file.

Output file can be n/a to discard the corresponding data (use this for the barcode read)

Barcodes file (-B) looks like this:

<id2> <sequence2> ...

Default is to guess the -bol or -eol based on clear stats.

If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used.

If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen.

Grouped barcodes file (-l or -L) looks like this:

<id2> <sequence2> <group2>...

Mated reads, if supplied, are kept in-sync

Options:

-o FIL1 Output files (one per input, required)

-g SEQFIL Determine barcodes from the indexed read SEQFIL

-l BCFIL Determine barcodes from any read, using BCFIL as a master list

-L BCFIL Determine barcodes from <read1.fq>, using BCFIL as a master list

-B BCFIL Use barcodes from BCFIL, no determination step, codes in <read1.fq>

-H Use barcodes from illumina's header, instead of a read

-b Force beginning of line (5') for barcode matching

-e Force end of line (3') for batcode matching

-t NUM Divide threshold for auto-determine by factor NUM (1), > 1 = more sensitive

-G NAME Use group(s) matching NAME only

-x Don't trim barcodes off before writing out destination

-n Don't execute, just print likely barcode list

-v C Verify that mated id's match up to character C (Use ' ' for illumina)

-m N Allow up to N mismatches, as long as they are unique (1)

-d N Require a minimum distance of N between the best and next best (2)

-q N Require a minimum phred quality of N to accept a barcode base (0)

> fastq-join

$ fastq-join

Usage: fastq-join [options] <read1.fq> <read2.fq> [mate.fq] -o <read.%.fq>

Joins two paired-end reads on the overlapping ends.

Options:

-o FIL See 'Output' below

-v C Verifies that the 2 files probe id's match up to char C

use ' ' (space) for Illumina reads

-p N N-percent maximum difference (8)

-m N N-minimum overlap (6)

-r FIL Verbose stitch length report

-R No reverse complement

-x Allow insert < read length

Output:

You can supply 3 -o arguments, for un1, un2, join files, or one

argument as a file name template. The suffix 'un1, un2, or join' is

appended to the file, or they replace a %-character if present.

If a 'mate' input file is present (barcode read), then the files

'un3' and 'join2' are also created.

> varcall

$ varcall

Specify -s for stats only, or -v to do variant calling

Usage: varcall <-s|-v> <-f REF> [options] bam1 [bam2...]

Version: 0.95.794 (BETA)

Either outputs summry stats for the list of files, or performs variant calling

Options (later options override earlier):

-s Calculate statistics

-v|version Calculate variants bases on supplied parameters (see -S)

-f Reference fasta (required if using bams, ignored otherwise)

-m Min locii depth (1)

-a Min allele depth (2)

-p Min allele pct by quality (0)

-q Min qual (3)

-Q Min mapping quality (0)

-b Min pct balance (strand/total) (0)

-D FLOAT Max duplicate read fraction (depth/length per position) (1)

-d FLOAT Minimum diversity (CV from optimal depth) (0.25)

-G FLOAT Minimum agreement (Weighted CV of positional variation) (0.25)

-0 Zero out all filters, set e-value filter to 1, report everything

-B If running from a BAM, turn off BAQ correction (false)

-R Homopolymer repeat indel filtering (8)

-e FLOAT Alpha filter to use, requires -l or -S (.05)

-g FLOAT Global minimum error rate (default: assume phred is ok)

-l INT Number of locii in total pileup used for bonferroni (1 mil)

-x CHR:POS Output this pos only, then quit

-S FILE Read in statistics and params from a previous run with -s (do this!)

-A ANNOT Calculate in-target stats using the annotation file (requires -o)

-o PREFIX Output prefix (works with -s or -v)

-F files List of file types to output (var, varsum, eav, vcf)

Extended Options

--pcr-annot BED Only include reads adhering to the expected amplicons

--stranded TYPE Can be FR (the default), FF, FR. Used with pcr-annot

--diversity|d FLOAT Alias for -d

--agreement|G FLOAT Alias for -G

--no-indels Ignore all indels

Input files

Files must be sorted bam files with bai index files available. Alternatively,

a single pileup file can be supplied.

Output files

Varcalls go to stdout. Stats go to stdout, or stderr if varcalling too

If an output prefix is used, files are created as follows:

PREFIX.var Variant calls in tab delimited 'varcall' format

PREFIX.eav Variant calls in tab delimited 'ea-var' format

PREFIX.cse Variant calls in tab delimited 'varprowl' format

PREFIX.vcf Variant calls, in vcf format

PREFIX.varsum Summary of variant calls

PREFIX.tgt.var On-target version of .var

PREFIX.tgt.cse On-target version of .cse

PREFIX.tgt.varsum On-target version of .varsum

Stats Output:

Contains mean, median, quartile information for depth, base quality, read len,

mapping quality, indel levels. Also estimates parameters suitable for

variant calls, and can be passed directly to this program for variant calls

If an output prefix is used, files are created as follows:

PREFIX.stats Stats output

PREFIX.noise Non-reference, non-homozygous allele summary

PREFIX.xnoise Like noise, but with context-specific rates

Filtering Details:

>sam-stats -h

$ sam-stats -h

Usage: sam-stats [options] [file1] [file2...filen]

Version: 1.38.763

Produces lots of easily digested statistics for the files listed

Options (default in parens):

-D Keep track of multiple alignments

-O PREFIX Output prefix enabling extended output (see below)

-R FIL Coverage/RNA output (coverage, 3' bias, etc, implies -A)

-A Report all chr sigs, even if there are more than 1000

-b INT Number of reads to sample for per-base stats (1M)

-S INT Size of ascii-signature (30)

-x FIL File extension for handling multiple files (stats)

-M Only overwrite if newer (requires -x, or multiple files)

-B Input is bam, don't bother looking at magic

-z Don't fail when zero entries in sam

OUTPUT:

If one file is specified, then the output is to standard out. If

multiple files are specified, or if the -x option is supplied,

the output file is <filename>.<ext>. Default extension is 'stats'.

Complete Stats:

<STATS> : mean, max, stdev, median, Q1 (25 percentile), Q3

reads : # of entries in the sam file, might not be # reads

phred : phred scale used

bsize : # reads used for qual stats

mapped reads : number of aligned reads (unique probe id sequences)

mapped bases : total of the lengths of the aligned reads

forward : number of forward-aligned reads

reverse : number of reverse-aligned reads

snp rate : mismatched bases / total bases (snv rate)

ins rate : insert bases / total bases

del rate : deleted bases / total bases

pct mismatch : percent of reads that have mismatches

pct align : percent of reads that aligned

len <STATS> : read length stats, ignored if fixed-length

mapq <STATS> : stats for mapping qualities

insert <STATS> : stats for insert sizes

%<CHR> : percentage of mapped bases per chr, followed by a signature

Subsampled stats (1M reads max):

base qual <STATS> : stats for base qualities

%A,%T,%C,%G : base percentages

Meaning of the per-chromosome signature:

A ascii-histogram of mapped reads by chromosome position.

It is only output if the original SAM/BAM has a header. The values

are the log2 of the # of mapped reads at each position + ascii '0'.

Extended output mode produces a set of files:

.stats : primary output

.fastx : fastx-toolkit compatible output

.rcov : per-reference counts & coverage

.xdist : mismatch distribution

.ldist : length distribution (if applicable)

.mqdist : mapping quality distribution

> fastq-stats -h

$ fastq-stats -h

Usage: fastq-stats [options] <fastq-file>

Version: 1.01 $Id: fastq-stats.cpp 652 2013-09-17 17:40:32Z earonesty $

Produces lots of easily digested statistics for the files listed

Options

-c cyclemax: max cycles for which following quality stats are produced [35]

-w INT window: max window size for generating duplicate read statistics [2000000]

-d debug: prints out debug statements

-D don't do duplicate read statistics

-s INT number of top duplicate reads to display

-x FIL output fastx statistics (requires an output filename)

-b FIL output base breakdown by per phred quality at every cycle.

It sets cylemax to longest read length

-L FIL Output length counts

The following data are printed to stdout:

reads : #reads in the fastq file

len : read length. mean and stdev are provided for variable read lengths

phred : phred scale used

window-size : Number of reads used to generate duplicate read statistics

cycle-max : Number of bases to assess for duplicity

dups : Number of reads that are duplicates

%dup : Pct reads that are duplcate

unique-dup seq : Number sequences that are duplicated

min dup count : Smallest duplicate tally for any duplicate sequence

dup seq <rank> <count> <sequence>

: Lists top 10 most frequent duplicate reads along with count mean and stdev

qual : Base Quality min, max and mean

%A,%T,%C,%G : base percentages

total bases : total number of bases

ラン

fastq-mcf（リンク）：アダプター除去を行うツール。低クオリティ末端の除去、Nの除去などの機能も持つ。

fastq-mcf <options> adapters.fa input.fq

fastq-multx（リンク）： Demultiplexing。

１、シングル。例えば

mock_a ACCC

salt_a CGTA

mock_b GAGT

salty_b CGGT

というバーコード配列ファイルbarcodes.filを用意し、以下のように打つ。ここでは入力データはgz圧縮されている。末端のアダプター配列に従い分割される。

fastq-multx -B barcodes.fil seq.fastq.gz -o %.fq.gz

-B Use barcodes from BCFIL, no determination step, codes in <read1.fq>
-o Output files (one per input, required)

double indexingのdemultiplexingは公式の例（リンク）を参考にしてください。

fastq-join（リンク）：オーバーラップするfastqのマージ。

fastq-join read1.fq read2.fq -o read.%.fq

マージに成功した read.join.fq、失敗したread.un1.fq、read.un2.fqが出力される。

varcall（リンク）：pileupからのバリアントコール（他のツールよりシンプル）。

fastq-join read1.fq read2.fq -o read.%.fq

sam-stats（リンク）：bam/samのstatistics。

sam-stats input.bam

出力 (ONT long read使用)

version 1.38.763

mapped reads 1141

pct align 100.000000

mapped bases 4954696

pct forward 0.000

phred 33

forward 0

reverse 1141

len max 13407

len mean 4342.4154

len stdev 3041.4011

mapq mean 59.5425

mapq stdev 4.2102

mapq Q1 60.00

mapq median 60.00

mapq Q3 60.00

snp rate 0.064468

ins rate 0.036831

del rate 0.043816

pct mismatch 100.0000

base qual mean 7.9896

base qual stdev 3.9029

%A 29.1159

%C 23.2673

%G 26.6539

%T 20.9629

%segment1 68.497341 D@A@@@AA@@@?@@????=>===<=:;;00

%segment2 31.502659 @@?@@@@@??@?>@@@???>?>>>>>==<8

num ref seqs 2

num ref aligned 2

fastq-stats：fastqのstatistics。

fastq-stats input.fq

$ fastq-stats single.fq

len 250

len mean 250.0000

len stdev 0.0000

len min 250

phred 33

window-size 657350

cycle-max 35

dups 6

%dup 0.0017

unique-dup seq 6

min dup count 2

dup seq 1 1 CATGGTTAGCCAAATCCACATTGGCTTCTGTTTCG

dup seq 2 1 TGGTCAAAAAAAGGAAAAAAGGTTTATGGAGATAG

dup seq 3 1 AATTCACGATACAAAATTTGCAGGCGGCCCGCCAA

dup seq 4 1 TGGTGAACTAGTTCAATCTTTACAAAGTGAGATTA

dup seq 5 1 GAGCATTGCGGGCCGCCTTGACGGCGCTGGGGGAT

dup seq 6 1 TCTATTTAACCATCTGTATATCGAGGCTTTTCCTA

dup mean 2.0000

dup stddev 0.0000

qual min 2

qual max 38

qual mean 29.1634

qual stdev 13.5108

%A 25.9917

%C 23.9989

%G 24.0037

%T 26.0057

%N 0.0000

total bases 169337500

引用

Command-line tools for processing biological sequencing data

Erik Aronesty (2011). ea-utils

http://code.google.com/p/ea-utils

Comparison of Sequencing Utility Programs

Erik Aronesty (2013). TOBioiJ

DOI:10.2174/1875036201307010001

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.844.5897&rep=rep1&type=pdf

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

bam, fastqのユーティリティツール EA-Utils