2019-06-18

多機能なNGS分析ツール BBtools　其の３BBMap追加コマンド

BBMapの追加コマンドについて紹介します。

BBMap Guide

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/

callvariants.sh

Introducing CallVariants, a new variant caller in #BBMap! CallVariants is 81x faster than mpilup+bcftools, and very, very accurate.
— Brian Bushnell (@BBToolsBio) December 15, 2016

Do you like #GATK?

BBTools' CallVariants is 2000 times faster. Also, it does not place adjacent deletions and insertions, like GATK.

I'm not saying GATK is garbage, but I'm also not saying that. You can't run GATK without incurring adjacent insertions and deletions.
— Brian Bushnell (@BBToolsBio) June 13, 2019

bbcms.sh

New in #BBTools:https://t.co/H04qsEfnrO - error correction utility for metagenomes. Tadpole is more accurate, but BBCMS uses the same algorithm and never runs out of memory (since it uses a lossy data structure). Use Tadpole when you can, and BBCMS if it runs out of memory.
— Brian Bushnell (@BBToolsBio) June 13, 2019

callgenes.sh

New in #BBTools- https://t.co/VyzMcc0Wda, a new prokaryotic gene-caller. Did the world need a new prok gene caller? No! But, I needed a super-fast caller for internal use by sendsketch.

"https://t.co/D9wMtd802X in=assembly.fa protein"

You can even use it with raw reads!
— Brian Bushnell (@BBToolsBio) June 13, 2019

It's actually really good, though:https://t.co/VyzMcc0Wda in=assembly.fa out=genes.gff outa=proteins.faa

By default it uses models for hundreds of bacteria and archaea. You can increase the accuracy (if you know the clade) by making your own model with https://t.co/gnqK0US6UD.
— Brian Bushnell (@BBToolsBio) June 13, 2019

インストール

bbmapを導入する。

#bioconda (link)
conda install -c bioconda -y bbmap

バリアントコール

> callvariants.sh

$ callvariants.sh

Written by Brian Bushnell

Last modified October 12, 2017

Description: Calls variants from sam or bam input.

In default mode, all input files are combined and treated as a single sample.

In multisample mode, each file is treated as an individual sample,

and gets its own column in the VCF file. Unless overridden, input file

names are used as sample names.

Please read bbmap/docs/guides/CallVariantsGuide.txt for more information,

or bbmap/pipelines/variantPipeline.sh for a usage example.

Usage: callvariants.sh in=<file,file,...> ref=<file> vcf=<file>

Input may be sorted or unsorted.

The reference should be fasta.

I/O parameters:

in=<file> Input; may be one file or multiple comma-delimited files.

list=<file> Optional text file containing one input file per line.

Use list or in, but not both.

out=<file> Output variant list in native format. If the name ends

with .vcf then it will be vcf format.

vcf=<file> Output variant list in vcf format.

ref=<file> Reference fasta. Required to display ref alleles.

Variant calling wil be more accurate with the reference.

shist=<file> (scorehist) Output for variant score histogram.

zhist=<file> (zygosityhist) Output for zygosity histogram.

overwrite=f (ow) Set to false to force the program to abort rather than

overwrite an existing file.

extended=t Print additional variant statistics columns.

sample= Optional comma-delimited list of sample names.

multisample=f (multi) Set to true if there are multiple sam/bam files,

and each should be tracked as an individual sample.

vcf0= Optional comma-delimited list of per-sample outputs.

Only used in multisample mode.

bgzip=f Use bgzip for gzip compression.

Processing Parameters:

prefilter=f Use a Bloom filter to exclude variants seen fewer than

minreads times. Doubles the runtime but greatly reduces

memory usage. The results are identical.

samstreamer=t (ss) Load reads multithreaded to increase speed.

coverage=t (cc) Calculate coverage, to better call variants.

ploidy=1 Set the organism's ploidy.

rarity=1.0 Penalize the quality of variants with allele fraction

lower than this. For example, if you are interested in

4% frequency variants, you could set both rarity and

minallelefraction to 0.04. This is affected by ploidy -

a variant with frequency indicating at least one copy

is never penalized.

covpenalty=0.8 (lowcoveragepenalty) A lower penalty will increase the

scores of low-coverage variants, and is useful for

low-coverage datasets.

useidentity=t Include average read identity in score calculation.

usepairing=t Include pairing rate in score calculation.

usebias=t Include strand bias in score calculation.

useedist=t Include read-end distance in score calculation.

homopolymer=t Penalize scores of substitutions matching adjacent bases.

nscan=t Consider the distance of a variant from contig ends when

calculating strand bias.

32bit=f Set to true to allow coverage tracking over depth 65535.

strandedcov=t Tracks per-strand ref coverage to print the DP4 field.

Requires more memory when enabled.

callsub=t Call substitutions.

calldel=t Call deletions.

callins=t Call insertions.

nopassdot=f Use . as genotype for variations failing the filter.

Trimming parameters:

border=5 Trim at least this many bases on both ends of reads.

qtrim=r Quality-trim reads on this end

r: right, l: left, rl: both, f: don't quality-trim.

trimq=10 Quality-trim bases below this score.

Realignment parameters:

realign=f Realign all reads with more than a couple mismatches.

Decreases speed. Recommended for aligners other than BBMap.

unclip=f Convert clip symbols from exceeding the ends of the

realignment zone into matches and substitutitions.

repadding=70 Pad alignment by this much on each end. Typically,

longer is more accurate for long indels, but greatly

reduces speed.

rerows=602 Use this many rows maximum for realignment. Reads longer

than this cannot be realigned.

recols=2000 Reads may not be aligned to reference seqments longer

than this. Needs to be at least read length plus

max deletion length plus twice padding.

msa= Select the aligner. Options:

MultiStateAligner11ts: Default.

MultiStateAligner9PacBio: Use for PacBio reads, or for

Illumina reads mapped to PacBio/Nanopore reads.

Sam-filtering parameters:

minpos= Ignore alignments not overlapping this range.

maxpos= Ignore alignments not overlapping this range.

minreadmapq=4 Ignore alignments with lower mapq.

contigs= Comma-delimited list of contig names to include. These

should have no spaces, or underscores instead of spaces.

secondary=f Include secondary alignments.

supplimentary=f Include supplimentary alignments.

invert=f Invert sam filters.

Variant-Calling Cutoffs:

minreads=2 Ignore variants seen in fewer reads.

minqualitymax=15 Ignore variants with lower max base quality.

minedistmax=20 Ignore variants with lower max distance from read ends.

minmapqmax=0 Ignore variants with lower max mapq.

minidmax=0 Ignore variants with lower max read identity.

minpairingrate=0.1 Ignore variants with lower pairing rate.

minstrandratio=0.1 Ignore variants with lower plus/minus strand ratio.

minquality=12.0 Ignore variants with lower average base quality.

minedist=10.0 Ignore variants with lower average distance from ends.

minavgmapq=0.0 Ignore variants with lower average mapq.

minallelefraction=0.1 Ignore variants with lower allele fraction. This

should be adjusted for high ploidies.

minid=0 Ignore variants with lower average read identity.

minscore=20.0 Ignore variants with lower Phred-scaled score.

clearfilters Reset all filters to zero. Filter flags placed after

the clearfilters flag will still be applied.

There are additionally max filters for score, quality, mapq, allelefraction,

and identity.

Java Parameters:

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

ORF予測

>callgenes.sh

$ callgenes.sh

Written by Brian Bushnell

Last modified December 19, 2018

Description: Finds orfs and calls genes in unspliced prokaryotes.

This includes bacteria, archaea, viruses, and mitochondria.

Can also predict 16S, 23S, 5S, and tRNAs.

Usage: callgenes.sh in=contigs.fa out=calls.gff outa=aminos.faa

File parameters:

in=<file> A fasta file.

out=<file> Output gff file.

outa=<file> Amino acid output.

model=<file> A pgm file or comma-delimited list.

If unspecified a default model will be used.

compareto= Optional reference gff file to compare with the gene calls.

Other parameters:

minlen=60 Don't call genes shorter than this.

trd=f (trimreaddescription) Set to true to trim read headers after

the first whitespace. Necessary for IGV.

merge=f For paired reads, merge before calling.

detranslate=f Output canonical nucleotide sequences instead of amino acids.

recode=f Re-encode nucleotide sequences over called genes, leaving

non-coding regions unchanged.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

> analyzegenes.sh

$ analyzegenes.sh

Written by Brian Bushnell

Last modified September 27, 2018

Description: Generates a prokaryotic gene model (.pkm) for gene calling.

Input is fasta and gff files.

The .pkm file may be used by CallGenes.

Usage: analyzegenes.sh in=x.fa gff=x.gff out=x.pgm

File parameters:

in=<file> A fasta file or comma-delimited list of fasta files.

gff=<file> A gff file or comma-delimited list. This is optional;

if present, it must match the number of fasta files.

If absent, a fasta file 'foo.fasta' will imply the

presence of 'foo.gff'.

out=<file> Output pgm file.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

メタゲノムエラーコレクション

> bbcms.sh

r$ bbcms.sh

Written by Brian Bushnell

Last modified July 24, 2018

Description: Error corrects reads and/or filters by depth, storing

kmer counts in a count-min sketch (a Bloom filter variant).

This uses a fixed amount of memory. The error-correction algorithm is taken

from Tadpole; with plenty of memory, the behavior is almost identical to

Tadpole. As the number of unique kmers in a dataset increases, the accuracy

decreases such that it will make fewer corrections. It is still capable

of making useful corrections far past the point where Tadpole would crash

by running out of memory, even with the prefilter flag. But if there is

sufficient memory to use Tadpole, then Tadpole is more desirable.

Because accuracy declines with an increasing number of unique kmers, it can

be useful with very large datasets to run this in 2 passes, with the first

pass for filtering only using a 2-bit filter with the flags tossjunk=t and

ecc=f (and possibly mincount=2 and hcf=0.4), and the second pass using a

4-bit filter for the actual error correction.

Usage: bbcms.sh in=<input file> out=<output> outb=<reads failing filters>

Example of use in error correction:

bbcms.sh in=reads.fq out=ecc.fq bits=4 hashes=3 k=31 merge

Example of use in depth filtering:

bbcms.sh in=reads.fq out=high.fq outb=low.fq k=31 mincount=2 ecc=f hcf=0.4

Error correction and depth filtering can be done simultaneously.

File parameters:

in=<file> Primary input, or read 1 input.

in2=<file> Read 2 input if reads are in two files.

out=<file> Primary read output.

out2=<file> Read 2 output if reads are in two files.

outb=<file> (outbad/outlow) Output for reads failing mincount.

outb2=<file> (outbad2/outlow2) Read 2 output if reads are in two files.

extra=<file> Additional comma-delimited files for generating kmer counts.

ref=<file> If ref is set, then only files in the ref list will be used

for kmer counts, and the input files will NOT be used for

counts; they will just be filtered or corrected.

overwrite=t (ow) Set to false to force the program to abort rather than

overwrite an existing file.

Hashing parameters:

k=31 Kmer length, currently 1-31.

hashes=3 Number of hashes per kmer. Higher generally reduces

false positives at the expense of speed; rapidly

diminishing returns above 4.

minprob=0.5 Ignore kmers with probability of being correct below this.

memmult=1.0 Fraction of free memory to use for Bloom filter. 1.0 should

generally work; if the program crashes with an out of memory

error, set this lower. You may be able to increase accuracy

by setting it slightly higher.

cells= Option to set the number of cells manually. By default this

will be autoset to use all available memory. The only reason

to set this is to ensure deterministic output.

seed=0 This will change the hash function used.

Depth filtering parameters:

mincount=0 If positive, reads with kmer counts below mincount will

be discarded (sent to outb).

hcf=1.0 (highcountfraction) Fraction of kmers that must be at least

mincount to pass.

requireboth=t Require both reads in a pair to pass in order to go to out.

When true, if either read has a count below mincount, both

reads in the pair will go to outb. When false, reads will

only go to outb if both fail.

tossjunk=f Remove reads or pairs with outermost kmer depth below 2.

(Suggested params for huge metagenomes: mincount=2 hcf=0.4 tossjunk=t)

Error correction parameters:

ecc=t Perform error correction.

bits= Bits used to store kmer counts; max count is 2^bits-1.

Supports 2, 4, 8, 16, or 32. 16 is best for high-depth data;

2 or 4 are for huge, low-depth metagenomes that saturate the

bloom filter otherwise. Generally 4 bits is recommended for

error-correction and 2 bits is recommended for filtering only.

ecco=f Error-correct paired reads by overlap prior to kmer-counting.

merge=t Merge paired reads by overlap prior to kmer-counting, and

again prior to correction. Output will still be unmerged.

smooth=3 Remove spikes from kmer counts due to hash collisions.

The number is the max width of peaks to be smoothed; range is

0-3 (3 is most aggressive; 0 disables smoothing).

This also affects tossjunk.

Java Parameters:

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

実行方法

1、callvariants.sh

bam/samからバリアントコールを行う。

1-1、リファレンスにペアエンドfastqをmappingしてbamかsamを得る。

bbmap.sh ref=ref.fa in1=pair_1.fq in2=pair_2.fq out=output.sam nodisk
samtools sort -@ 12 -O BAM output.sam > sort.bam && samtools index -@ 12 sort.bam

indexディレクトリが作られ、リードがリファレンスにアライメントされる。defaultだとマシンの全コアが計算に使用される（明示するなら"t="）。それからsortコマンドでcoordinate sortedされたbamを得る。

1-2、入力のbam/samとリファレンスを指定してcallvariantsを実行する。ここでは倍数性(ploidy)2で実行。

callvariants.sh ploidy=2 in=sort.bam ref=ref.fa vcf=output.vcf

callsub=t Call substitutions
calldel=t Call deletions
callins=t Call insertions
ploidy=1 Set the organism's ploidy
in=<file> Input; may be one file or multiple comma-delimited files
vcf=<file> Output variant list in vcf format

バクテリアゲノムだと数秒でランは終わる。 vcf=を指定時の出力はVCF v4.2に従う。

2、callgenes.sh

contigやゲノム配列からprokaryotesの遺伝子配列を予測する。ここでは予測した遺伝子のアミノ酸配列も出力している。

callgenes.sh in=ref.fa out=output outa=amnoacids.fa

NGSのリードから実行することもできる。

callgenes.sh in=pair_1.fq in2=pair_2.fq out=output outa=amnoacids.fa

リードそれぞれから予測するため、出力は相応に大きくなることに注意。

3、bbcms.sh

メタゲノムシーケンシングリードのエラーコレクション。

入力のペアエンドfastqと出力のペアエンドfastqをそれぞれ指定する。

bbcms.sh in=pair_1.fq in2=pair_2.fq \
 out=error_corrected_1.fq.gz out2=error_corrected_2.fq.gz \
 outb=failing_mincount_1.fq.gz outb2=failing_mincount_2.fq.gz