HPより
https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/clumpify-guide/
Clumpifyは、オーバーラップしているリードを迅速にグループ化して塊にするためのツールです。これにより、ファイルの圧縮率を高めたり、オーバーラップベースのアセンブリを高速化したり、マッピングなどのキャッシュに敏感なアプリケーションを高速化したりできます。Clumpifyは、これらのクラスターからコンセンサス配列を生成することもできますが、これは現在のところ初歩的なものです。クラスターはオーバーラップすることは保証されていませんが、むしろkmerを共有することが保証されており、オーバーラップする可能性が高いことを意味しています。正確なデータ、つまりIllumina、Ion Torrent、またはエラー補正されたPacBioを対象に設計されています。Illuminaのデータであっても、クオリティトリミングやエラー訂正が必要な場合があります。
Tool:Introducing Clumpify: Create 30% Smaller, Faster Gzipped Fastq Files. And remove duplicates.
https://www.biostars.org/p/225338/
BBMapパッケージのClumpifyは他のツールとは少し異なり、実際にはデータを全く変更せず、単にgzip圧縮を最大化するためにデータを並べ替える。したがって、出力ファイルは完全に互換性のあるgzip圧縮されたfastqファイルのままであり、Clumpifyは下流の分析には高速化以外の影響はない。Clumpifyは、ソートされたbamファイルがソートされていないbamファイルよりも小さくなるのと同様の原理で動作する。つまり、似たような配列のリードが近くにあるようにペアは一緒に保たれたままリードが並び替えられ、gzip圧縮がより効率的になる。また、bamのソートとは異なり、マッピングや参照を必要とせず、非常に珍しいケースを除いて、任意の少量のメモリで行うことができる。
I/Oに制限がある処理では高速化が期待でき、テストでは、SpadesとMegahitのアセンブリでは、Clumpifyの入力を使用することで、Clumpifyを実行するのに必要な時間を上回る時間の短縮が得られた。マッピングのように純粋にCPUに制限があるものは、通常、速度の面ではあまり恩恵を受けない。
インストール
mamba install -c bioconda bbmap
> clumpify.sh
$ clumpify.sh
Written by Brian Bushnell
Last modified October 30, 2019
Description: Sorts sequences to put similar reads near each other.
Can be used for increased compression or error correction.
Please read bbmap/docs/guides/ClumpifyGuide.txt for more information.
Usage: clumpify.sh in=<file> out=<file> reorder
Input may be fasta or fastq, compressed or uncompressed. Cannot accept sam.
Parameters and their defaults:
in=<file> Input file.
in2=<file> Optional input for read 2 of twin paired files.
out=<file> Output file. May not be standard out.
out2=<file> Optional output for read 2 of twin paired files.
groups=auto Use this many intermediate files (to save memory).
1 group is fastest. Auto will estimate the number
of groups needed based on the file size, so it should
not ever run out of memory.
lowcomplexity=f For compressed low-complexity libraries such as RNA-seq,
this will more conservatively estimate how much memory
is needed to automatically decide the number of groups.
rcomp=f Give read clumps the same orientation to increase
compression. Should be disabled for paired reads.
overwrite=f (ow) Set to false to force the program to abort rather
than overwrite an existing file.
qin=auto Auto-detect input quality encoding. May be set to:
33: ASCII-33 (Sanger) encoding.
64: ASCII-64 (old Illumina) encoding.
All modern sequence is encoded as ASCII-33.
qout=auto Use input quality encoding as output quality encoding.
changequality=f (cq) If true, fix broken quality scores such as Ns with
Q>0. Default is false to ensure lossless compression.
fastawrap=70 Set to a higher number like 4000 for longer lines in
fasta format, which increases compression.
Compression parameters:
ziplevel=6 (zl) Gzip compression level (1-11). Higher is slower.
Level 11 is only available if pigz is installed and is
extremely slow to compress, but faster to decompress.
Naming the output file to *.bz2 will use bzip2 instead of
gzip for ~9% additional compression, which requires
bzip2, pbzip2, or lbzip2 in the path.
blocksize=128 Size of blocks for pigz, in kb. Higher gives slightly
better compression.
shortname=f Make the names as short as possible. 'shortname=shrink'
will shorten the names where possible, but retain the
flowcell and barcode information.
reorder=f Reorder clumps for additional compression. Only valid
when groups=1, passes=1, and ecc=f. Possible modes:
f: Do not reorder clumps.
c: Reorder using consensus reads. Uses additional
time and memory.
p: Reorder using pair information. Requires paired
reads. Yields the highest compression.
a: Automatically choose between 'c' and 'p'. The
flag reorder with no argument will set 'reorder=a'.
quantize=f Bin the quality scores, like NextSeq. This greatly
increases compression, but information is lost.
Temp file parameters:
compresstemp=auto (ct) Gzip temporary files. By default temp files will be
compressed if the output file is compressed.
deletetemp=t Delete temporary files.
deleteinput=f Delete input upon successful completion.
usetmpdir=f Use tmpdir for temp files.
tmpdir= By default, this is the environment variable TMPDIR.
Hashing parameters:
k=31 Use kmers of this length (1-31). Shorter kmers may
increase compression, but 31 is recommended for error
correction.
mincount=0 Don't use pivot kmers with count less than this.
Setting mincount=2 can increase compression.
Increases time and memory usage.
seed=1 Random number generator seed for hashing.
Set to a negative number to use a random seed.
hashes=4 Use this many masks when hashing. 0 uses raw kmers.
Often hashes=0 increases compression, but it should
not be used with error-correction.
border=1 Do not use kmers within this many bases of read ends.
Deduplication parameters:
dedupe=f Remove duplicate reads. For pairs, both must match.
By default, deduplication does not occur.
If dedupe and markduplicates are both false, none of
the other duplicate-related flags will have any effect.
markduplicates=f Don't remove; just append ' duplicate' to the name.
allduplicates=f Mark or remove all copies of duplicates, instead of
keeping the highest-quality copy.
addcount=f Append the number of copies to the read name.
Mutually exclusive with markduplicates or allduplicates.
subs=2 (s) Maximum substitutions allowed between duplicates.
subrate=0.0 (dsr) If set, the number of substitutions allowed will be
max(subs, subrate*min(length1, length2)) for 2 sequences.
allowns=t No-called bases will not be considered substitutions.
scanlimit=5 (scan) Continue for this many reads after encountering a
nonduplicate. Improves detection of inexact duplicates.
containment=f Allow containments (where one sequence is shorter).
affix=f For containments, require one sequence to be an affix
(prefix or suffix) of the other.
optical=f If true, mark or remove optical duplicates only.
This means they are Illumina reads within a certain
distance on the flowcell. Normal Illumina names needed.
Also for tile-edge and well duplicates.
dupedist=40 (dist) Max distance to consider for optical duplicates.
Higher removes more duplicates but is more likely to
remove PCR rather than optical duplicates.
This is platform-specific; recommendations:
NextSeq 40 (and spany=t)
HiSeq 1T 40
HiSeq 2500 40
HiSeq 3k/4k 2500
Novaseq 12000
spany=f Allow reads to be considered optical duplicates if they
are on different tiles, but are within dupedist in the
y-axis. Should only be enabled when looking for
tile-edge duplicates (as in NextSeq).
spanx=f Like spany, but for the x-axis. Not necessary
for NextSeq.
spantiles=f Set both spanx and spany.
adjacent=f Limit tile-spanning to adjacent tiles (those with
consecutive numbers).
*** Thus, for NextSeq, the recommended deduplication flags are: ***
dedupe optical spany adjacent
Pairing/ordering parameters (for use with error-correction):
unpair=f For paired reads, clump all of them rather than just
read 1. Destroys pairing. Without this flag, for paired
reads, only read 1 will be error-corrected.
repair=f After clumping and error-correction, restore pairing.
If groups>1 this will sort by name which will destroy
clump ordering; with a single group, clumping will
be retained.
Error-correction parameters:
ecc=f Error-correct reads. Requires multiple passes for
complete correction.
ecco=f Error-correct paired reads via overlap before clumping.
passes=1 Use this many error-correction passes. 6 passes are
suggested.
consensus=f Output consensus sequence instead of clumps.
Advanced error-correction parameters:
mincc=4 (mincountcorrect) Do not correct to alleles occuring less
often than this.
minss=4 (minsizesplit) Do not split into new clumps smaller than
this.
minsfs=0.17 (minsizefractionsplit) Do not split on pivot alleles in
areas with local depth less than this fraction of clump size.
minsfc=0.20 (minsizefractioncorrect) Do not correct in areas with local
depth less than this.
minr=30.0 (minratio) Correct to the consensus if the ratio of the
consensus allele to second-most-common allele is >=minr,
for high depth. Actual ratio used is:
min(minr, minro+minorCount*minrm+quality*minrqm).
minro=1.9 (minratiooffset) Base ratio.
minrm=1.8 (minratiomult) Ratio multiplier for secondary allele count.
minrqm=0.08 (minratioqmult) Ratio multiplier for base quality.
minqr=2.8 (minqratio) Do not correct bases when cq*minqr>rqsum.
minaqr=0.70 (minaqratio) Do not correct bases when cq*minaqr>5+rqavg.
minid=0.97 (minidentity) Do not correct reads with identity to
consensus less than this.
maxqadjust=0 Adjust quality scores by at most maxqadjust per pass.
maxqi=-1 (maxqualityincorrect) Do not correct bases with quality
above this (if positive).
maxci=-1 (maxcountincorrect) Do not correct alleles with count
above this (if positive).
findcorrelations=t Look for correlated SNPs in clumps to split into alleles.
maxcorrelations=12 Maximum number of eligible SNPs per clump to consider for
correlations. Increasing this number can reduce false-
positive corrections at the possible expense of speed.
Java Parameters:
-Xmx This will set Java's memory usage, overriding autodetection.
-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.
The max is typically 85% of physical memory.
-eoom This flag will cause the process to exit if an
out-of-memory exception occurs. Requires Java 8u92+.
-da Disable assertions.
Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.
実行方法
ペアリード、インターリーブリード、シングルエンドリードを使用できる。入力のfastq、出力のfastqを指定する。
#single
clumpify.sh in=reads.fq.gz out=clumped.fq.gz
#pair
clumpify.sh in1=r1.fq.gz in2=r2.fq.gz out1=clumped1.fq.gz out2=clumped2.fq.gz reorder
- groups=auto Use this many intermediate files (to save memory).
1 group is fastest. Auto will estimate the number of groups needed based on the file size, so it should not ever run out of memory.
Clumpifyの際に16個のテンポラリーファイルを使用
clumpify.sh in=reads.fq.gz out=clumped.fq.gz groups=16
- groups=auto Use this many intermediate files (to save memory). 1 group is fastest. Auto will estimate the number of groups needed based on the file size, so it should not ever run out of memory.
配列データを最大限に圧縮
rename.sh in=reads.fq out=renamed.fq prefix=x
bbmerge.sh in=renamed.fq out=merged.fq mix
clumpify.sh in=reads.fq.gz out=clumped.fa.gz zl=9 pigz fastawrap=100000
名前を削除し、可能な限りリードをマージし、そしてすべてをclumpifyする。出力はfasta形式にする(リードが100kbp以上の長さでない限り)。品質値を保存する必要がある場合は、代わりに「clumped.fq.gz」として出力する。
Clumpifyは実際にいくつかの解析パイプラインで前処理ステップとして使用されています。調べてみて下さい。
引用
Clumpify Guide - DOE Joint Genome Institute
関連