macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

多機能なNGS分析ツール BBtools 其の3BBMap追加コマンド

 

BBMapの追加コマンドについて紹介します。

 

BBMap Guide

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/

 

callvariants.sh

bbcms.sh

callgenes.sh

 

 

インストール

bbmapを導入する。

#bioconda (link)
conda install -c bioconda -y bbmap

バリアントコール

> callvariants.sh

$ callvariants.sh

 

Written by Brian Bushnell

Last modified October 12, 2017

 

Description:  Calls variants from sam or bam input.

In default mode, all input files are combined and treated as a single sample.

In multisample mode, each file is treated as an individual sample,

and gets its own column in the VCF file.  Unless overridden, input file

names are used as sample names.

Please read bbmap/docs/guides/CallVariantsGuide.txt for more information,

or bbmap/pipelines/variantPipeline.sh for a usage example.

 

Usage:  callvariants.sh in=<file,file,...> ref=<file> vcf=<file>

 

Input may be sorted or unsorted.

The reference should be fasta.

 

I/O parameters:

in=<file>       Input; may be one file or multiple comma-delimited files.

list=<file>     Optional text file containing one input file per line.

                Use list or in, but not both.

out=<file>      Output variant list in native format.  If the name ends

                with .vcf then it will be vcf format.

vcf=<file>      Output variant list in vcf format.

ref=<file>      Reference fasta.  Required to display ref alleles.

                Variant calling wil be more accurate with the reference.

shist=<file>    (scorehist) Output for variant score histogram.

zhist=<file>    (zygosityhist) Output for zygosity histogram.

overwrite=f     (ow) Set to false to force the program to abort rather than

                overwrite an existing file.

extended=t      Print additional variant statistics columns.

sample=         Optional comma-delimited list of sample names.

multisample=f   (multi) Set to true if there are multiple sam/bam files,

                and each should be tracked as an individual sample.

vcf0=           Optional comma-delimited list of per-sample outputs.

                Only used in multisample mode.

bgzip=f         Use bgzip for gzip compression.

 

Processing Parameters:

prefilter=f     Use a Bloom filter to exclude variants seen fewer than

                minreads times.  Doubles the runtime but greatly reduces

                memory usage.  The results are identical.

samstreamer=t   (ss) Load reads multithreaded to increase speed.

coverage=t      (cc) Calculate coverage, to better call variants.

ploidy=1        Set the organism's ploidy.

rarity=1.0      Penalize the quality of variants with allele fraction 

                lower than this.  For example, if you are interested in

                4% frequency variants, you could set both rarity and

                minallelefraction to 0.04.  This is affected by ploidy - 

                a variant with frequency indicating at least one copy

                is never penalized.

covpenalty=0.8  (lowcoveragepenalty) A lower penalty will increase the 

                scores of low-coverage variants, and is useful for 

                low-coverage datasets.

useidentity=t   Include average read identity in score calculation.

usepairing=t    Include pairing rate in score calculation.

usebias=t       Include strand bias in score calculation.

useedist=t      Include read-end distance in score calculation.

homopolymer=t   Penalize scores of substitutions matching adjacent bases.

nscan=t         Consider the distance of a variant from contig ends when 

                calculating strand bias.

32bit=f         Set to true to allow coverage tracking over depth 65535.

strandedcov=t   Tracks per-strand ref coverage to print the DP4 field.

                Requires more memory when enabled.

callsub=t       Call substitutions.

calldel=t       Call deletions.

callins=t       Call insertions.

nopassdot=f     Use . as genotype for variations failing the filter.

 

Trimming parameters:

border=5        Trim at least this many bases on both ends of reads.

qtrim=r         Quality-trim reads on this end

                   r: right, l: left, rl: both, f: don't quality-trim.

trimq=10        Quality-trim bases below this score.

 

Realignment parameters:

realign=f       Realign all reads with more than a couple mismatches.

                Decreases speed.  Recommended for aligners other than BBMap.

unclip=f        Convert clip symbols from exceeding the ends of the

                realignment zone into matches and substitutitions.

repadding=70    Pad alignment by this much on each end.  Typically,

                longer is more accurate for long indels, but greatly

                reduces speed.

rerows=602      Use this many rows maximum for realignment.  Reads longer

                than this cannot be realigned.

recols=2000     Reads may not be aligned to reference seqments longer 

                than this.  Needs to be at least read length plus

                max deletion length plus twice padding.

msa=            Select the aligner.  Options:

                   MultiStateAligner11ts:     Default.

                   MultiStateAligner9PacBio:  Use for PacBio reads, or for

                   Illumina reads mapped to PacBio/Nanopore reads.

 

Sam-filtering parameters:

minpos=         Ignore alignments not overlapping this range.

maxpos=         Ignore alignments not overlapping this range.

minreadmapq=4   Ignore alignments with lower mapq.

contigs=        Comma-delimited list of contig names to include. These 

                should have no spaces, or underscores instead of spaces.

secondary=f     Include secondary alignments.

supplimentary=f Include supplimentary alignments.

invert=f        Invert sam filters.

 

Variant-Calling Cutoffs:

minreads=2              Ignore variants seen in fewer reads.

minqualitymax=15        Ignore variants with lower max base quality.

minedistmax=20          Ignore variants with lower max distance from read ends.

minmapqmax=0            Ignore variants with lower max mapq.

minidmax=0              Ignore variants with lower max read identity.

minpairingrate=0.1      Ignore variants with lower pairing rate.

minstrandratio=0.1      Ignore variants with lower plus/minus strand ratio.

minquality=12.0         Ignore variants with lower average base quality.

minedist=10.0           Ignore variants with lower average distance from ends.

minavgmapq=0.0          Ignore variants with lower average mapq.

minallelefraction=0.1   Ignore variants with lower allele fraction.  This

                        should be adjusted for high ploidies.

minid=0                 Ignore variants with lower average read identity.

minscore=20.0           Ignore variants with lower Phred-scaled score.

clearfilters            Reset all filters to zero.  Filter flags placed after

                        the clearfilters flag will still be applied.

 

There are additionally max filters for score, quality, mapq, allelefraction,

and identity.

 

Java Parameters:

-Xmx            This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

-eoom               This flag will cause the process to exit if an out-of-memory exception occurs.  Requires Java 8u92+.

-da                 Disable assertions.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

ORF予測

>callgenes.sh

$ callgenes.sh 

 

Written by Brian Bushnell

Last modified December 19, 2018

 

Description:  Finds orfs and calls genes in unspliced prokaryotes.

This includes bacteria, archaea, viruses, and mitochondria.

Can also predict 16S, 23S, 5S, and tRNAs.

 

Usage:  callgenes.sh in=contigs.fa out=calls.gff outa=aminos.faa

 

File parameters:

in=<file>       A fasta file.

out=<file>      Output gff file.

outa=<file>     Amino acid output.

model=<file>    A pgm file or comma-delimited list.

                If unspecified a default model will be used.

compareto=      Optional reference gff file to compare with the gene calls.

 

Other parameters:

minlen=60       Don't call genes shorter than this.

trd=f           (trimreaddescription) Set to true to trim read headers after

                the first whitespace.  Necessary for IGV.

merge=f         For paired reads, merge before calling.

detranslate=f   Output canonical nucleotide sequences instead of amino acids.

recode=f        Re-encode nucleotide sequences over called genes, leaving

                non-coding regions unchanged.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

> analyzegenes.sh

$ analyzegenes.sh

 

Written by Brian Bushnell

Last modified September 27, 2018

 

Description:  Generates a prokaryotic gene model (.pkm) for gene calling.

Input is fasta and gff files.

The .pkm file may be used by CallGenes.

 

Usage:  analyzegenes.sh in=x.fa gff=x.gff out=x.pgm

 

File parameters:

in=<file>       A fasta file or comma-delimited list of fasta files.

gff=<file>      A gff file or comma-delimited list.  This is optional;

                if present, it must match the number of fasta files.

                If absent, a fasta file 'foo.fasta' will imply the

                presence of 'foo.gff'.

out=<file>      Output pgm file.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

メタゲノムエラーコレクション

> bbcms.sh

r$ bbcms.sh 

 

Written by Brian Bushnell

Last modified July 24, 2018

 

Description:  Error corrects reads and/or filters by depth, storing

kmer counts in a count-min sketch (a Bloom filter variant).

This uses a fixed amount of memory.  The error-correction algorithm is taken

from Tadpole; with plenty of memory, the behavior is almost identical to 

Tadpole.  As the number of unique kmers in a dataset increases, the accuracy 

decreases such that it will make fewer corrections.  It is still capable

of making useful corrections far past the point where Tadpole would crash

by running out of memory, even with the prefilter flag.  But if there is

sufficient memory to use Tadpole, then Tadpole is more desirable.

 

Because accuracy declines with an increasing number of unique kmers, it can

be useful with very large datasets to run this in 2 passes, with the first 

pass for filtering only using a 2-bit filter with the flags tossjunk=t and 

ecc=f (and possibly mincount=2 and hcf=0.4), and the second pass using a 

4-bit filter for the actual error correction.

 

Usage:  bbcms.sh in=<input file> out=<output> outb=<reads failing filters>

 

Example of use in error correction:

bbcms.sh in=reads.fq out=ecc.fq bits=4 hashes=3 k=31 merge

 

Example of use in depth filtering:

bbcms.sh in=reads.fq out=high.fq outb=low.fq k=31 mincount=2 ecc=f hcf=0.4

 

Error correction and depth filtering can be done simultaneously.

 

File parameters:

in=<file>       Primary input, or read 1 input.

in2=<file>      Read 2 input if reads are in two files.

out=<file>      Primary read output.

out2=<file>     Read 2 output if reads are in two files.

outb=<file>     (outbad/outlow) Output for reads failing mincount.

outb2=<file>    (outbad2/outlow2) Read 2 output if reads are in two files.

extra=<file>    Additional comma-delimited files for generating kmer counts.

ref=<file>      If ref is set, then only files in the ref list will be used

                for kmer counts, and the input files will NOT be used for

                counts; they will just be filtered or corrected.

overwrite=t     (ow) Set to false to force the program to abort rather than

                overwrite an existing file.

 

Hashing parameters:

k=31            Kmer length, currently 1-31.

hashes=3        Number of hashes per kmer.  Higher generally reduces 

                false positives at the expense of speed; rapidly

                diminishing returns above 4.

minprob=0.5     Ignore kmers with probability of being correct below this.

memmult=1.0     Fraction of free memory to use for Bloom filter.  1.0 should

                generally work; if the program crashes with an out of memory

                error, set this lower.  You may be able to increase accuracy

                by setting it slightly higher.

cells=          Option to set the number of cells manually.  By default this

                will be autoset to use all available memory.  The only reason

                to set this is to ensure deterministic output.

seed=0          This will change the hash function used.

 

Depth filtering parameters:

mincount=0      If positive, reads with kmer counts below mincount will

                be discarded (sent to outb).

hcf=1.0         (highcountfraction) Fraction of kmers that must be at least

                mincount to pass.

requireboth=t   Require both reads in a pair to pass in order to go to out.

                When true, if either read has a count below mincount, both

                reads in the pair will go to outb.  When false, reads will

                only go to outb if both fail.

tossjunk=f      Remove reads or pairs with outermost kmer depth below 2.

(Suggested params for huge metagenomes: mincount=2 hcf=0.4 tossjunk=t)

 

Error correction parameters:

ecc=t           Perform error correction.

bits=           Bits used to store kmer counts; max count is 2^bits-1.

                Supports 2, 4, 8, 16, or 32.  16 is best for high-depth data;

                2 or 4 are for huge, low-depth metagenomes that saturate the 

                bloom filter otherwise.  Generally 4 bits is recommended for 

                error-correction and 2 bits is recommended for filtering only.

ecco=f          Error-correct paired reads by overlap prior to kmer-counting.

merge=t         Merge paired reads by overlap prior to kmer-counting, and 

                again prior to correction.  Output will still be unmerged.

smooth=3        Remove spikes from kmer counts due to hash collisions.

                The number is the max width of peaks to be smoothed; range is

                0-3 (3 is most aggressive; 0 disables smoothing).

                This also affects tossjunk.

                

 

Java Parameters:

-Xmx            This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

                -Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.  The max is typically 85% of physical memory.

-eoom           This flag will cause the process to exit if an out-of-memory exception occurs.  Requires Java 8u92+.

-da             Disable assertions.

 

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

 

 

 

実行方法

1、callvariants.sh

bam/samからバリアントコールを行う。

1-1、リファレンスにペアエンドfastqをmappingしてbamかsamを得る。

bbmap.sh ref=ref.fa in1=pair_1.fq in2=pair_2.fq out=output.sam nodisk
samtools sort -@ 12 -O BAM output.sam > sort.bam && samtools index -@ 12 sort.bam

indexディレクトリが作られ、リードがリファレンスにアライメントされる。defaultだとマシンの全コアが計算に使用される(明示するなら"t=")。それからsortコマンドでcoordinate sortedされたbamを得る。 

 

1-2、入力のbam/samとリファレンスを指定してcallvariantsを実行する。ここでは倍数性(ploidy)2で実行。

callvariants.sh ploidy=2 in=sort.bam ref=ref.fa vcf=output.vcf
  • callsub=t        Call substitutions
  • calldel=t         Call deletions
  • callins=t         Call insertions
  • ploidy=1         Set the organism's ploidy
  • in=<file>         Input; may be one file or multiple comma-delimited files
  • vcf=<file>       Output variant list in vcf format

バクテリアゲノムだと数秒でランは終わる。 vcf=を指定時の出力はVCF v4.2に従う。

 

2、callgenes.sh

contigやゲノム配列からprokaryotesの遺伝子配列を予測する。ここでは予測した遺伝子のアミノ酸配列も出力している。

callgenes.sh in=ref.fa out=output outa=amnoacids.fa

NGSのリードから実行することもできる。

callgenes.sh in=pair_1.fq in2=pair_2.fq out=output outa=amnoacids.fa

リードそれぞれから予測するため、出力は相応に大きくなることに注意。

 

 

3、bbcms.sh 

メタゲノムシーケンシングリードのエラーコレクション。

入力のペアエンドfastqと出力のペアエンドfastqをそれぞれ指定する。

bbcms.sh in=pair_1.fq in2=pair_2.fq \
out=error_corrected_1.fq.gz out2=error_corrected_2.fq.gz \
outb=failing_mincount_1.fq.gz outb2=failing_mincount_2.fq.gz
  • in=<file>        Primary input, or read 1 input.
  • in2=<file>      Read 2 input if reads are in two files.
  • out=<file>      Primary read output.
  • out2=<file>    Read 2 output if reads are in two files.
  • outb=<file>    Output for reads failing mincount.
  • outb2=<file>  Read 2 output if reads are in two files.

 

参考

Question: bbmap callvariants - how to add sample names, how to get 0/0 alleles back?

http://callvariants.sh

 

関連