macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

bam, fastqのユーティリティツール EA-Utils

2013年のペーパーより

 ハイスループットシーケンシング(HTS)は、シーケンシングデータの急速な成長率をもたらした。 著者らのラボでは、毎日テラバイトのデータを生成している。 これは通常、バリアントコーラー、定量およびアセンブリなどの一般的なタスクに使用する前に、さまざまな方法でデータ出力を「クリーン」に処理する必要がある。
 HTSに関連する2つの一般的なタスクは、アダプタートリミングとペアエンドのマージである。 私たち( 著者ら)はこれらの共通の課題に対処するためにExpression Analysis、Inc.で2つのツールを開発した。 これらのプログラムの名前はfastq-mcfとfastq-joinである。 これらのツールの性能を、リソース効率と有効性の両面から、同様のオープンソースユーティリティと比較した。

 

wiki

https://help.rc.ufl.edu/doc/EA-Utils

公式HP

ea-utils by ExpressionAnalysis

インストール

Github

brew tap brewsci/science #してない人だけ
brew install ea-utils

 fastq-mcf

$ fastq-mcf

Usage: fastq-mcf [options] <adapters.fa> <reads.fq> [mates1.fq ...] 

Version: 1.04.803

 

Detects levels of adapter presence, computes likelihoods and

locations (start, end) of the adapters.   Removes the adapter

sequences from the fastq file(s).

 

Stats go to stderr, unless -o is specified.

 

Specify -0 to turn off all default settings

 

If you specify multiple 'paired-end' inputs, then a -o option is

required for each.  IE: -o read1.clip.q -o read2.clip.fq

 

Options:

    -h       This help

    -o FIL   Output file (stats to stdout)

    -O N     Only output the first N records (all)

    -s N.N   Log scale for adapter minimum-length-match (2.2)

    -t N     % occurance threshold before adapter clipping (0.25)

    -m N     Minimum clip length, overrides scaled auto (1)

    -p N     Maximum adapter difference percentage (10)

    -l N     Minimum remaining sequence length (19)

    -L N     Maximum remaining sequence length (none)

    -D N     Remove duplicate reads : Read_1 has an identical N bases (0)

    -k N     sKew percentage-less-than causing cycle removal (2)

    -x N     'N' (Bad read) percentage causing cycle removal (20)

    -q N     quality threshold causing base removal (10)

    -w N     window-size for quality trimming (1)

    -H       remove >95% homopolymer reads (no)

    -X       remove low complexity reads (no)

    -0       Set all default parameters to zero/do nothing

    -U|u     Force disable/enable Illumina PF filtering (auto)

    -P N     Phred-scale (auto)

    -R       Don't remove N's from the fronts/ends of reads

    -n       Don't clip, just output what would be done

    -K       Only keep clipped reads

    -S       Save all discarded reads to '.skip' files

    -C N     Number of reads to use for subsampling (300k)

    -d       Output lots of random debugging stuff

 

Quality adjustment options:

    --cycle-adjust      CYC,AMT   Adjust cycle CYC (negative = offset from end) by amount AMT

    --phred-adjust      SCORE,AMT Adjust score SCORE by amount AMT

    --phred-adjust-max  SCORE     Adjust scores > SCORE to SCOTE

 

Filtering options*:

    --[mate-]qual-mean  NUM       Minimum mean quality score

    --[mate-]qual-gt    NUM,THR   At least NUM quals > THR

    --[mate-]max-ns     NUM       Maxmium N-calls in a read (can be a %)

    --[mate-]min-len    NUM       Minimum remaining length (same as -l)

    --homopolymer-pct   PCT       Homopolymer filter percent (95)

    --lowcomplex-pct    PCT       Complexity filter percent (95)

    --keep-clipped                Only keep clipped (same as -K)

    --max-output-reads   N        Only output first N records (same as -O)

 

If mate- prefix is used, then applies to second non-barcode read only

 

Adapter files are 'fasta' formatted:

 

Specify n/a to turn off adapter clipping, and just use filters

 

Increasing the scale makes recognition-lengths longer, a scale

of 100 will force full-length recognition of adapters.

 

Adapter sequences with _5p in their label will match 'end's,

and sequences with _3p in their label will match 'start's,

otherwise the 'end' is auto-determined.

 

Skew is when one cycle is poor, 'skewed' toward a particular base.

If any nucleotide is less than the skew percentage, then the

whole cycle is removed.  Disable for methyl-seq, etc.

 

Set the skew (-k) or N-pct (-x) to 0 to turn it off (should be done

for miRNA, amplicon and other low-complexity situations!)

 

Duplicate read filtering is appropriate for assembly tasks, and

never when read length < expected coverage.  -D 50 will use

4.5GB RAM on 100m DNA reads - be careful. Great for RNA assembly.

 

*Quality filters are evaluated after clipping/trimming

 

Homopolymer filtering is a subset of low-complexity, but will not

be separately tracked unless both are turned on.

fastq-multx

$ fastq-multx

Usage: fastq-multx [-g|-l|-B] <barcodes.fil> <read1.fq> -o r1.%.fq [mate.fq -o r2.%.fq] ...

Version: 1.02.772

 

Output files must contain a '%' sign which is replaced with the barcode id in the barcodes file.

Output file can be n/a to discard the corresponding data (use this for the barcode read)

 

Barcodes file (-B) looks like this:

 

<id1> <sequence1>

<id2> <sequence2> ...

 

Default is to guess the -bol or -eol based on clear stats.

 

If -g is used, then it's parameter is an index lane, and frequently occuring sequences are used.

 

If -l is used then all barcodes in the file are tried, and the *group* with the *most* matches is chosen.

 

Grouped barcodes file (-l or -L) looks like this:

 

<id1> <sequence1> <group1>

<id1> <sequence1> <group1>

<id2> <sequence2> <group2>...

 

Mated reads, if supplied, are kept in-sync

 

Options:

 

-o FIL1     Output files (one per input, required)

-g SEQFIL   Determine barcodes from the indexed read SEQFIL

-l BCFIL    Determine barcodes from any read, using BCFIL as a master list

-L BCFIL    Determine barcodes from <read1.fq>, using BCFIL as a master list

-B BCFIL    Use barcodes from BCFIL, no determination step, codes in <read1.fq>

-H          Use barcodes from illumina's header, instead of a read

-b          Force beginning of line (5') for barcode matching

-e          Force end of line (3') for batcode matching

-t NUM      Divide threshold for auto-determine by factor NUM (1), > 1 = more sensitive

-G NAME     Use group(s) matching NAME only

-x          Don't trim barcodes off before writing out destination

-n          Don't execute, just print likely barcode list

-v C        Verify that mated id's match up to character C (Use ' ' for illumina)

-m N        Allow up to N mismatches, as long as they are unique (1)

-d N        Require a minimum distance of N between the best and next best (2)

-q N        Require a minimum phred quality of N to accept a barcode base (0)

fastq-join

$ fastq-join

Usage: fastq-join [options] <read1.fq> <read2.fq> [mate.fq] -o <read.%.fq>

 

Joins two paired-end reads on the overlapping ends.

 

Options:

 

-o FIL     See 'Output' below

-v C       Verifies that the 2 files probe id's match up to char C

            use ' ' (space) for Illumina reads

-p N       N-percent maximum difference (8)

-m N       N-minimum overlap (6)

-r FIL     Verbose stitch length report

-R         No reverse complement

-x         Allow insert < read length

 

Output: 

 

  You can supply 3 -o arguments, for un1, un2, join files, or one 

argument as a file name template.  The suffix 'un1, un2, or join' is 

appended to the file, or they replace a %-character if present.

 

  If a 'mate' input file is present (barcode read), then the files

'un3' and 'join2' are also created.

 

varcall

$ varcall

Specify -s for stats only, or -v to do variant calling

 

Usage: varcall <-s|-v> <-f REF> [options] bam1 [bam2...]

Version: 0.95.794 (BETA)

 

Either outputs summry stats for the list of files, or performs variant calling

 

Options (later options override earlier):

 

-s          Calculate statistics

-v|version  Calculate variants bases on supplied parameters (see -S)

-f          Reference fasta (required if using bams, ignored otherwise)

-m          Min locii depth (1)

-a          Min allele depth (2)

-p          Min allele pct by quality (0)

-q          Min qual (3)

-Q          Min mapping quality (0)

-b          Min pct balance (strand/total) (0)

-D FLOAT    Max duplicate read fraction (depth/length per position) (1)

-d FLOAT    Minimum diversity (CV from optimal depth) (0.25)

-G FLOAT    Minimum agreement (Weighted CV of positional variation) (0.25)

-0          Zero out all filters, set e-value filter to 1, report everything

-B          If running from a BAM, turn off BAQ correction (false)

-R          Homopolymer repeat indel filtering (8)

-e FLOAT    Alpha filter to use, requires -l or -S (.05)

-g FLOAT    Global minimum error rate (default: assume phred is ok)

-l INT      Number of locii in total pileup used for bonferroni (1 mil)

-x CHR:POS  Output this pos only, then quit

-S FILE     Read in statistics and params from a previous run with -s (do this!)

-A ANNOT    Calculate in-target stats using the annotation file (requires -o)

-o PREFIX   Output prefix (works with -s or -v)

-F files    List of file types to output (var, varsum, eav, vcf)

 

Extended Options

 

--pcr-annot   BED      Only include reads adhering to the expected amplicons

--stranded    TYPE     Can be FR (the default), FF, FR.  Used with pcr-annot

--diversity|d FLOAT    Alias for -d

--agreement|G FLOAT    Alias for -G

--no-indels            Ignore all indels

 

Input files

 

Files must be sorted bam files with bai index files available.  Alternatively,

a single pileup file can be supplied.

 

Output files

 

Varcalls go to stdout.  Stats go to stdout, or stderr if varcalling too

 

If an output prefix is used, files are created as follows:

   PREFIX.var         Variant calls in tab delimited 'varcall' format

   PREFIX.eav         Variant calls in tab delimited 'ea-var' format

   PREFIX.cse         Variant calls in tab delimited 'varprowl' format

   PREFIX.vcf         Variant calls, in vcf format

   PREFIX.varsum      Summary of variant calls

   PREFIX.tgt.var     On-target version of .var

   PREFIX.tgt.cse     On-target version of .cse

   PREFIX.tgt.varsum  On-target version of .varsum

 

Stats Output:

 

Contains mean, median, quartile information for depth, base quality, read len,

mapping quality, indel levels. Also estimates parameters suitable for

variant calls, and can be passed directly to this program for variant calls

 

If an output prefix is used, files are created as follows:

 

   PREFIX.stats       Stats output

   PREFIX.noise       Non-reference, non-homozygous allele summary

   PREFIX.xnoise      Like noise, but with context-specific rates

 

Filtering Details:

>sam-stats -h

$ sam-stats -h

Usage: sam-stats [options] [file1] [file2...filen]

Version: 1.38.763

 

Produces lots of easily digested statistics for the files listed

 

Options (default in parens):

 

-D             Keep track of multiple alignments

-O PREFIX      Output prefix enabling extended output (see below)

-R FIL         Coverage/RNA output (coverage, 3' bias, etc, implies -A)

-A             Report all chr sigs, even if there are more than 1000

-b INT         Number of reads to sample for per-base stats (1M)

-S INT         Size of ascii-signature (30)

-x FIL         File extension for handling multiple files (stats)

-M             Only overwrite if newer (requires -x, or multiple files)

-B             Input is bam, don't bother looking at magic

-z             Don't fail when zero entries in sam

 

OUTPUT:

 

If one file is specified, then the output is to standard out.  If

multiple files are specified, or if the -x option is supplied,

the output file is <filename>.<ext>.  Default extension is 'stats'.

 

Complete Stats:

 

  <STATS>           : mean, max, stdev, median, Q1 (25 percentile), Q3

  reads             : # of entries in the sam file, might not be # reads

  phred             : phred scale used

  bsize             : # reads used for qual stats

  mapped reads      : number of aligned reads (unique probe id sequences)

  mapped bases      : total of the lengths of the aligned reads

  forward           : number of forward-aligned reads

  reverse           : number of reverse-aligned reads

  snp rate          : mismatched bases / total bases (snv rate)

  ins rate          : insert bases / total bases

  del rate          : deleted bases / total bases

  pct mismatch      : percent of reads that have mismatches

  pct align         : percent of reads that aligned

  len <STATS>       : read length stats, ignored if fixed-length

  mapq <STATS>      : stats for mapping qualities

  insert <STATS>    : stats for insert sizes

  %<CHR>           : percentage of mapped bases per chr, followed by a signature

 

Subsampled stats (1M reads max):

  base qual <STATS> : stats for base qualities

  %A,%T,%C,%G       : base percentages

 

Meaning of the per-chromosome signature:

  A ascii-histogram of mapped reads by chromosome position.

  It is only output if the original SAM/BAM has a header. The values

  are the log2 of the # of mapped reads at each position + ascii '0'.

 

Extended output mode produces a set of files:

  .stats           : primary output

  .fastx           : fastx-toolkit compatible output

  .rcov            : per-reference counts & coverage

  .xdist           : mismatch distribution

  .ldist           : length distribution (if applicable)

  .mqdist          : mapping quality distribution

 

fastq-stats -h

$ fastq-stats -h

 

Usage: fastq-stats [options] <fastq-file>

 

Version: 1.01 $Id: fastq-stats.cpp 652 2013-09-17 17:40:32Z earonesty $

 

Produces lots of easily digested statistics for the files listed

 

Options

 

-c     cyclemax: max cycles for which following quality stats are produced [35]

-w INT window: max window size for generating duplicate read statistics [2000000]

-d     debug: prints out debug statements

-D     don't do duplicate read statistics

-s INT number of top duplicate reads to display

-x FIL output fastx statistics (requires an output filename)

-b FIL output base breakdown by per phred quality at every cycle.

       It sets cylemax to longest read length

-L FIL Output length counts 

 

 

The following data are printed to stdout:

 

  reads : #reads in the fastq file

  len                 : read length. mean and stdev are provided for variable read lengths

  phred : phred scale used

  window-size : Number of reads used to generate duplicate read statistics

  cycle-max : Number of bases to assess for duplicity

  dups : Number of reads that are duplicates

  %dup : Pct reads that are duplcate

  unique-dup seq : Number sequences that are duplicated

  min dup count : Smallest duplicate tally for any duplicate sequence

  dup seq <rank> <count> <sequence> 

  : Lists top 10 most frequent duplicate reads along with count mean and stdev

  qual : Base Quality min, max and mean

  %A,%T,%C,%G : base percentages

  total bases : total number of bases

 

 

ラン 

fastq-mcfリンク): アダプター除去を行うツール。低クオリティ末端の除去、Nの除去などの機能も持つ。

fastq-mcf <options> adapters.fa input.fq

 

fastq-multxリンク): Demultiplexing。

1、シングル。例えば

mock_a ACCC

salt_a CGTA

mock_b GAGT

salty_b CGGT

というバーコード配列ファイルbarcodes.filを用意し、以下のように打つ。ここでは入力データはgz圧縮されている。末端のアダプター配列に従い分割される。

fastq-multx -B barcodes.fil seq.fastq.gz -o %.fq.gz
  • -B    Use barcodes from BCFIL, no determination step, codes in <read1.fq>
  • -o    Output files (one per input, required)

double indexingのdemultiplexingは公式の例(リンク)を参考にしてください。

 

fastq-joinリンク):オーバーラップするfastqのマージ。

fastq-join read1.fq read2.fq -o read.%.fq 

マージに成功した read.join.fq、失敗したread.un1.fq、read.un2.fqが出力される。

 

varcallリンク):pileupからのバリアントコール(他のツールよりシンプル)。

fastq-join read1.fq read2.fq -o read.%.fq 

 

sam-statsリンク):bam/samのstatistics。

sam-stats input.bam

 

fastq-stats:fastqのstatistics。

fastq-stats input.fq

$ fastq-stats single.fq 

reads 657350

len 250

len mean 250.0000

len stdev 0.0000

len min 250

phred 33

window-size 657350

cycle-max 35

dups 6

%dup 0.0017

unique-dup seq 6

min dup count 2

dup seq 1 1 CATGGTTAGCCAAATCCACATTGGCTTCTGTTTCG

dup seq 2 1 TGGTCAAAAAAAGGAAAAAAGGTTTATGGAGATAG

dup seq 3 1 AATTCACGATACAAAATTTGCAGGCGGCCCGCCAA

dup seq 4 1 TGGTGAACTAGTTCAATCTTTACAAAGTGAGATTA

dup seq 5 1 GAGCATTGCGGGCCGCCTTGACGGCGCTGGGGGAT

dup seq 6 1 TCTATTTAACCATCTGTATATCGAGGCTTTTCCTA

dup mean 2.0000

dup stddev 0.0000

qual min 2

qual max 38

qual mean 29.1634

qual stdev 13.5108

%A 25.9917

%C 23.9989

%G 24.0037

%T 26.0057

%N 0.0000

total bases 169337500

 

引用

Command-line tools for processing biological sequencing data

Erik Aronesty (2011). ea-utils

http://code.google.com/p/ea-utils

 

Comparison of Sequencing Utility Programs

Erik Aronesty (2013). TOBioiJ

DOI:10.2174/1875036201307010001

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.844.5897&rep=rep1&type=pdf