BBMapの追加コマンドについて紹介します。
BBMap Guide
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/
callvariants.sh
Introducing CallVariants, a new variant caller in #BBMap! CallVariants is 81x faster than mpilup+bcftools, and very, very accurate.
— Brian Bushnell (@BBToolsBio) December 15, 2016
Do you like #GATK?
— Brian Bushnell (@BBToolsBio) June 13, 2019
BBTools' CallVariants is 2000 times faster. Also, it does not place adjacent deletions and insertions, like GATK.
I'm not saying GATK is garbage, but I'm also not saying that. You can't run GATK without incurring adjacent insertions and deletions.
bbcms.sh
New in #BBTools:https://t.co/H04qsEfnrO - error correction utility for metagenomes. Tadpole is more accurate, but BBCMS uses the same algorithm and never runs out of memory (since it uses a lossy data structure). Use Tadpole when you can, and BBCMS if it runs out of memory.
— Brian Bushnell (@BBToolsBio) June 13, 2019
callgenes.sh
New in #BBTools- https://t.co/VyzMcc0Wda, a new prokaryotic gene-caller. Did the world need a new prok gene caller? No! But, I needed a super-fast caller for internal use by sendsketch.
— Brian Bushnell (@BBToolsBio) June 13, 2019
"https://t.co/D9wMtd802X in=assembly.fa protein"
You can even use it with raw reads!
It's actually really good, though:https://t.co/VyzMcc0Wda in=assembly.fa out=genes.gff outa=proteins.faa
— Brian Bushnell (@BBToolsBio) June 13, 2019
By default it uses models for hundreds of bacteria and archaea. You can increase the accuracy (if you know the clade) by making your own model with https://t.co/gnqK0US6UD.
インストール
bbmapを導入する。
#bioconda (link)
conda install -c bioconda -y bbmap
バリアントコール
> callvariants.sh
$ callvariants.sh
Written by Brian Bushnell
Last modified October 12, 2017
Description: Calls variants from sam or bam input.
In default mode, all input files are combined and treated as a single sample.
In multisample mode, each file is treated as an individual sample,
and gets its own column in the VCF file. Unless overridden, input file
names are used as sample names.
Please read bbmap/docs/guides/CallVariantsGuide.txt for more information,
or bbmap/pipelines/variantPipeline.sh for a usage example.
Usage: callvariants.sh in=<file,file,...> ref=<file> vcf=<file>
Input may be sorted or unsorted.
The reference should be fasta.
I/O parameters:
in=<file> Input; may be one file or multiple comma-delimited files.
list=<file> Optional text file containing one input file per line.
Use list or in, but not both.
out=<file> Output variant list in native format. If the name ends
with .vcf then it will be vcf format.
vcf=<file> Output variant list in vcf format.
ref=<file> Reference fasta. Required to display ref alleles.
Variant calling wil be more accurate with the reference.
shist=<file> (scorehist) Output for variant score histogram.
zhist=<file> (zygosityhist) Output for zygosity histogram.
overwrite=f (ow) Set to false to force the program to abort rather than
overwrite an existing file.
extended=t Print additional variant statistics columns.
sample= Optional comma-delimited list of sample names.
multisample=f (multi) Set to true if there are multiple sam/bam files,
and each should be tracked as an individual sample.
vcf0= Optional comma-delimited list of per-sample outputs.
Only used in multisample mode.
bgzip=f Use bgzip for gzip compression.
Processing Parameters:
prefilter=f Use a Bloom filter to exclude variants seen fewer than
minreads times. Doubles the runtime but greatly reduces
memory usage. The results are identical.
samstreamer=t (ss) Load reads multithreaded to increase speed.
coverage=t (cc) Calculate coverage, to better call variants.
ploidy=1 Set the organism's ploidy.
rarity=1.0 Penalize the quality of variants with allele fraction
lower than this. For example, if you are interested in
4% frequency variants, you could set both rarity and
minallelefraction to 0.04. This is affected by ploidy -
a variant with frequency indicating at least one copy
is never penalized.
covpenalty=0.8 (lowcoveragepenalty) A lower penalty will increase the
scores of low-coverage variants, and is useful for
low-coverage datasets.
useidentity=t Include average read identity in score calculation.
usepairing=t Include pairing rate in score calculation.
usebias=t Include strand bias in score calculation.
useedist=t Include read-end distance in score calculation.
homopolymer=t Penalize scores of substitutions matching adjacent bases.
nscan=t Consider the distance of a variant from contig ends when
calculating strand bias.
32bit=f Set to true to allow coverage tracking over depth 65535.
strandedcov=t Tracks per-strand ref coverage to print the DP4 field.
Requires more memory when enabled.
callsub=t Call substitutions.
calldel=t Call deletions.
callins=t Call insertions.
nopassdot=f Use . as genotype for variations failing the filter.
Trimming parameters:
border=5 Trim at least this many bases on both ends of reads.
qtrim=r Quality-trim reads on this end
r: right, l: left, rl: both, f: don't quality-trim.
trimq=10 Quality-trim bases below this score.
Realignment parameters:
realign=f Realign all reads with more than a couple mismatches.
Decreases speed. Recommended for aligners other than BBMap.
unclip=f Convert clip symbols from exceeding the ends of the
realignment zone into matches and substitutitions.
repadding=70 Pad alignment by this much on each end. Typically,
longer is more accurate for long indels, but greatly
reduces speed.
rerows=602 Use this many rows maximum for realignment. Reads longer
than this cannot be realigned.
recols=2000 Reads may not be aligned to reference seqments longer
than this. Needs to be at least read length plus
max deletion length plus twice padding.
msa= Select the aligner. Options:
MultiStateAligner11ts: Default.
MultiStateAligner9PacBio: Use for PacBio reads, or for
Illumina reads mapped to PacBio/Nanopore reads.
Sam-filtering parameters:
minpos= Ignore alignments not overlapping this range.
maxpos= Ignore alignments not overlapping this range.
minreadmapq=4 Ignore alignments with lower mapq.
contigs= Comma-delimited list of contig names to include. These
should have no spaces, or underscores instead of spaces.
secondary=f Include secondary alignments.
supplimentary=f Include supplimentary alignments.
invert=f Invert sam filters.
Variant-Calling Cutoffs:
minreads=2 Ignore variants seen in fewer reads.
minqualitymax=15 Ignore variants with lower max base quality.
minedistmax=20 Ignore variants with lower max distance from read ends.
minmapqmax=0 Ignore variants with lower max mapq.
minidmax=0 Ignore variants with lower max read identity.
minpairingrate=0.1 Ignore variants with lower pairing rate.
minstrandratio=0.1 Ignore variants with lower plus/minus strand ratio.
minquality=12.0 Ignore variants with lower average base quality.
minedist=10.0 Ignore variants with lower average distance from ends.
minavgmapq=0.0 Ignore variants with lower average mapq.
minallelefraction=0.1 Ignore variants with lower allele fraction. This
should be adjusted for high ploidies.
minid=0 Ignore variants with lower average read identity.
minscore=20.0 Ignore variants with lower Phred-scaled score.
clearfilters Reset all filters to zero. Filter flags placed after
the clearfilters flag will still be applied.
There are additionally max filters for score, quality, mapq, allelefraction,
and identity.
Java Parameters:
-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da Disable assertions.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
ORF予測
>callgenes.sh
$ callgenes.sh
Written by Brian Bushnell
Last modified December 19, 2018
Description: Finds orfs and calls genes in unspliced prokaryotes.
This includes bacteria, archaea, viruses, and mitochondria.
Can also predict 16S, 23S, 5S, and tRNAs.
Usage: callgenes.sh in=contigs.fa out=calls.gff outa=aminos.faa
File parameters:
in=<file> A fasta file.
out=<file> Output gff file.
outa=<file> Amino acid output.
model=<file> A pgm file or comma-delimited list.
If unspecified a default model will be used.
compareto= Optional reference gff file to compare with the gene calls.
Other parameters:
minlen=60 Don't call genes shorter than this.
trd=f (trimreaddescription) Set to true to trim read headers after
the first whitespace. Necessary for IGV.
merge=f For paired reads, merge before calling.
detranslate=f Output canonical nucleotide sequences instead of amino acids.
recode=f Re-encode nucleotide sequences over called genes, leaving
non-coding regions unchanged.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
> analyzegenes.sh
$ analyzegenes.sh
Written by Brian Bushnell
Last modified September 27, 2018
Description: Generates a prokaryotic gene model (.pkm) for gene calling.
Input is fasta and gff files.
The .pkm file may be used by CallGenes.
Usage: analyzegenes.sh in=x.fa gff=x.gff out=x.pgm
File parameters:
in=<file> A fasta file or comma-delimited list of fasta files.
gff=<file> A gff file or comma-delimited list. This is optional;
if present, it must match the number of fasta files.
If absent, a fasta file 'foo.fasta' will imply the
presence of 'foo.gff'.
out=<file> Output pgm file.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
メタゲノムエラーコレクション
> bbcms.sh
r$ bbcms.sh
Written by Brian Bushnell
Last modified July 24, 2018
Description: Error corrects reads and/or filters by depth, storing
kmer counts in a count-min sketch (a Bloom filter variant).
This uses a fixed amount of memory. The error-correction algorithm is taken
from Tadpole; with plenty of memory, the behavior is almost identical to
Tadpole. As the number of unique kmers in a dataset increases, the accuracy
decreases such that it will make fewer corrections. It is still capable
of making useful corrections far past the point where Tadpole would crash
by running out of memory, even with the prefilter flag. But if there is
sufficient memory to use Tadpole, then Tadpole is more desirable.
Because accuracy declines with an increasing number of unique kmers, it can
be useful with very large datasets to run this in 2 passes, with the first
pass for filtering only using a 2-bit filter with the flags tossjunk=t and
ecc=f (and possibly mincount=2 and hcf=0.4), and the second pass using a
4-bit filter for the actual error correction.
Usage: bbcms.sh in=<input file> out=<output> outb=<reads failing filters>
Example of use in error correction:
bbcms.sh in=reads.fq out=ecc.fq bits=4 hashes=3 k=31 merge
Example of use in depth filtering:
bbcms.sh in=reads.fq out=high.fq outb=low.fq k=31 mincount=2 ecc=f hcf=0.4
Error correction and depth filtering can be done simultaneously.
File parameters:
in=<file> Primary input, or read 1 input.
in2=<file> Read 2 input if reads are in two files.
out=<file> Primary read output.
out2=<file> Read 2 output if reads are in two files.
outb=<file> (outbad/outlow) Output for reads failing mincount.
outb2=<file> (outbad2/outlow2) Read 2 output if reads are in two files.
extra=<file> Additional comma-delimited files for generating kmer counts.
ref=<file> If ref is set, then only files in the ref list will be used
for kmer counts, and the input files will NOT be used for
counts; they will just be filtered or corrected.
overwrite=t (ow) Set to false to force the program to abort rather than
overwrite an existing file.
Hashing parameters:
k=31 Kmer length, currently 1-31.
hashes=3 Number of hashes per kmer. Higher generally reduces
false positives at the expense of speed; rapidly
diminishing returns above 4.
minprob=0.5 Ignore kmers with probability of being correct below this.
memmult=1.0 Fraction of free memory to use for Bloom filter. 1.0 should
generally work; if the program crashes with an out of memory
error, set this lower. You may be able to increase accuracy
by setting it slightly higher.
cells= Option to set the number of cells manually. By default this
will be autoset to use all available memory. The only reason
to set this is to ensure deterministic output.
seed=0 This will change the hash function used.
Depth filtering parameters:
mincount=0 If positive, reads with kmer counts below mincount will
be discarded (sent to outb).
hcf=1.0 (highcountfraction) Fraction of kmers that must be at least
mincount to pass.
requireboth=t Require both reads in a pair to pass in order to go to out.
When true, if either read has a count below mincount, both
reads in the pair will go to outb. When false, reads will
only go to outb if both fail.
tossjunk=f Remove reads or pairs with outermost kmer depth below 2.
(Suggested params for huge metagenomes: mincount=2 hcf=0.4 tossjunk=t)
Error correction parameters:
ecc=t Perform error correction.
bits= Bits used to store kmer counts; max count is 2^bits-1.
Supports 2, 4, 8, 16, or 32. 16 is best for high-depth data;
2 or 4 are for huge, low-depth metagenomes that saturate the
bloom filter otherwise. Generally 4 bits is recommended for
error-correction and 2 bits is recommended for filtering only.
ecco=f Error-correct paired reads by overlap prior to kmer-counting.
merge=t Merge paired reads by overlap prior to kmer-counting, and
again prior to correction. Output will still be unmerged.
smooth=3 Remove spikes from kmer counts due to hash collisions.
The number is the max width of peaks to be smoothed; range is
0-3 (3 is most aggressive; 0 disables smoothing).
This also affects tossjunk.
Java Parameters:
-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.
-da Disable assertions.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
実行方法
1、callvariants.sh
bam/samからバリアントコールを行う。
1-1、リファレンスにペアエンドfastqをmappingしてbamかsamを得る。
bbmap.sh ref=ref.fa in1=pair_1.fq in2=pair_2.fq out=output.sam nodisk
samtools sort -@ 12 -O BAM output.sam > sort.bam && samtools index -@ 12 sort.bam
indexディレクトリが作られ、リードがリファレンスにアライメントされる。defaultだとマシンの全コアが計算に使用される(明示するなら"t=")。それからsortコマンドでcoordinate sortedされたbamを得る。
1-2、入力のbam/samとリファレンスを指定してcallvariantsを実行する。ここでは倍数性(ploidy)2で実行。
callvariants.sh ploidy=2 in=sort.bam ref=ref.fa vcf=output.vcf
- callsub=t Call substitutions
- calldel=t Call deletions
- callins=t Call insertions
- ploidy=1 Set the organism's ploidy
- in=<file> Input; may be one file or multiple comma-delimited files
- vcf=<file> Output variant list in vcf format
バクテリアゲノムだと数秒でランは終わる。 vcf=を指定時の出力はVCF v4.2に従う。
2、callgenes.sh
contigやゲノム配列からprokaryotesの遺伝子配列を予測する。ここでは予測した遺伝子のアミノ酸配列も出力している。
callgenes.sh in=ref.fa out=output outa=amnoacids.fa
NGSのリードから実行することもできる。
callgenes.sh in=pair_1.fq in2=pair_2.fq out=output outa=amnoacids.fa
リードそれぞれから予測するため、出力は相応に大きくなることに注意。
3、bbcms.sh
メタゲノムシーケンシングリードのエラーコレクション。
入力のペアエンドfastqと出力のペアエンドfastqをそれぞれ指定する。
bbcms.sh in=pair_1.fq in2=pair_2.fq \
out=error_corrected_1.fq.gz out2=error_corrected_2.fq.gz \
outb=failing_mincount_1.fq.gz outb2=failing_mincount_2.fq.gz
- in=<file> Primary input, or read 1 input.
- in2=<file> Read 2 input if reads are in two files.
- out=<file> Primary read output.
- out2=<file> Read 2 output if reads are in two files.
- outb=<file> Output for reads failing mincount.
- outb2=<file> Read 2 output if reads are in two files.
参考
Question: bbmap callvariants - how to add sample names, how to get 0/0 alleles back?
関連