2020 8/5 誤字修正
de Bruijnグラフデータ構造は、次世代シークエンシング(NGS)で広く利用されている。デノボアセンブラなどの多くのプログラムは、このグラフのインメモリ表現に依存している。しかし、現在のヒトゲノムのde Bruijnグラフを表現する技術では、30GB以上のメモリを必要とする。
本研究では、de Bruijnグラフの新しい符号化方式を提案する。このエンコーディングはBloomフィルタに基づいており、重要な偽陽性を除去するための構造を追加している。この構造を実装したアセンブリソフトウェアMiniaは、5.7GBのメモリを使用してヒトゲノムのショートリードの完全なde novoアセンブリを23時間で実行した。
インストール
ビルド依存
- CMake 2.6+; see http://www.cmake.org/cmake/resources/software.html
- C++11 compiler; (g++ version>=4.7 (Linux), clang version>=4.3 (Mac OSX))
#bioconda(link)
conda install -c bioconda minia -y
#building from source
git clone --recursive https://github.com/GATB/minia.git
cd minia/
sh INSTALL
> minia
$ minia
Minia 3, git commit b0715310a
Specifiy -in
[minia options]
[assembly options]
-in (1 arg) : input reads (fasta/fastq/compressed) or hdf5 file [default '']
-keep-isolated (0 arg) : keep short (<= max(2k, 150 bp)) isolated output sequences
-traversal (1 arg) : traversal type ('contig', 'unitig') [default 'contig']
-fasta-line (1 arg) : number of nucleotides per line in fasta output (0 means one line) [default '0']
[graph simplifications options]
-no-bulge-removal (0 arg) : ask to not perform bulge removal
-no-tip-removal (0 arg) : ask to not perform tip removal
-no-ec-removal (0 arg) : ask to not perform erroneous connection removal
-tip-len-topo-kmult (1 arg) : remove all tips of length <= k * X bp [default '2.500000']
-tip-len-rctc-kmult (1 arg) : remove tips that pass coverage criteria, of length <= k * X bp [default '10.000000']
-tip-rctc-cutoff (1 arg) : tip relative coverage coefficient: mean coverage of neighbors > X * tip coverage [default '2.000000']
-bulge-len-kmult (1 arg) : bulges shorter than k*X bp are candidate to be removed [default '3.000000']
-bulge-len-kadd (1 arg) : bulges shorter than k+X bp are candidate to be removed [default '100']
-bulge-altpath-kadd (1 arg) : explore up to k+X nodes to find alternative path [default '50']
-bulge-altpath-covmult (1 arg) : bulges of coverage <= X*cov_altpath will be removed [default '1.100000']
-ec-len-kmult (1 arg) : EC shorter than k*X bp are candidates to be removed [default '9.000000']
-ec-rctc-cutoff (1 arg) : EC relative coverage coefficient (similar in spirit as tip) [default '4.000000']
-no-mphf (0 arg) : don't construct the MPHF
[kmer count options]
-kmer-size (1 arg) : size of a kmer [default '31']
-abundance-min (1 arg) : min abundance threshold for solid kmers [default '2']
-abundance-max (1 arg) : max abundance threshold for solid kmers [default '2147483647']
-abundance-min-threshold (1 arg) : min abundance hard threshold (only used when min abundance is "auto") [default '2']
-histo-max (1 arg) : max number of values in kmers histogram [default '10000']
-solidity-kind (1 arg) : way to compute counts of several files (sum, min, max, one, all, custom) [default 'sum']
-solidity-custom (1 arg) : when solidity-kind is custom, specifies list of files where kmer must be present [default '']
-max-memory (1 arg) : max memory (in MBytes) [default '5000']
-max-disk (1 arg) : max disk (in MBytes) [default '0']
-solid-kmers-out (1 arg) : output file for solid kmers (only when constructing a graph) [default '']
-out (1 arg) : output file [default '']
-out-dir (1 arg) : output directory [default '.']
-out-tmp (1 arg) : output directory for temporary files [default '.']
-out-compress (1 arg) : h5 compression level (0:none, 9:best) [default '0']
-storage-type (1 arg) : storage type of kmer counts ('hdf5' or 'file') [default 'hdf5']
-histo2D (1 arg) : compute the 2D histogram (with first file = genome, remaining files = reads) [default '0']
-histo (1 arg) : output the kmer abundance histogram [default '0']
[kmer count, advanced performance tweaks options]
-minimizer-type (1 arg) : minimizer type (0=lexi, 1=freq) [default '0']
-minimizer-size (1 arg) : size of a minimizer [default '10']
-repartition-type (1 arg) : minimizer repartition (0=unordered, 1=ordered) [default '0']
[bloom options]
-bloom (1 arg) : bloom type ('basic', 'cache', 'neighbor') [default 'neighbor']
-debloom (1 arg) : debloom type ('none', 'original' or 'cascading') [default 'cascading']
-debloom-impl (1 arg) : debloom impl ('basic', 'minimizer') [default 'minimizer']
[branching options]
-branching-nodes (1 arg) : branching type ('none' or 'stored') [default 'stored']
-topology-stats (1 arg) : topological information level (0 for none) [default '0']
[general options]
-config-only (0 arg) : dump config only
-nb-cores (1 arg) : number of cores [default '0']
-all-abundance-counts (0 arg) : output all k-mer abundance counts instead of mean
-edge-km (1 arg) : edge km representation [default '0']
-verbose (1 arg) : verbosity level [default '1']
-integer-precision (1 arg) : integers precision (0 for optimized value) [default '0']
[debug options]
-redo-bcalm (0 arg) : debug function, redo the bcalm algo
-skip-bcalm (0 arg) : same, but skip bcalm
-redo-bglue (0 arg) : same, but redo bglue (needs debug_keep_glue_files=true in source code)
-skip-bglue (0 arg) : same, but skip bglue
-redo-links (0 arg) : same, but redo links
-skip-links (0 arg) : same, but skip links
-nb-glue-partitions (1 arg) : number of glue partitions (automatically calculated by default) [default '0']
実行方法
入力のfastqを指定する。
minia minia -in input.fq -kmer-size 31 -out minia_assembly
- -in input reads (fasta/fastq/compressed) or hdf5 file [default '']
- -kmer-size size of a kmer [default '31']
引用
Space-efficient and exact de Bruijn graph representation based on a Bloom filter
Rayan Chikhi, Guillaume Rizk
Algorithms Mol Biol. 2013 Sep 16;8(1):22
関連
参考