De Bruijnアセンブラ Minia - macでインフォマティクス

2020 8/5 誤字修正

de Bruijnグラフデータ構造は、次世代シークエンシング（NGS）で広く利用されている。デノボアセンブラなどの多くのプログラムは、このグラフのインメモリ表現に依存している。しかし、現在のヒトゲノムのde Bruijnグラフを表現する技術では、30GB以上のメモリを必要とする。

　本研究では、de Bruijnグラフの新しい符号化方式を提案する。このエンコーディングはBloomフィルタに基づいており、重要な偽陽性を除去するための構造を追加している。この構造を実装したアセンブリソフトウェアMiniaは、5.7GBのメモリを使用してヒトゲノムのショートリードの完全なde novoアセンブリを23時間で実行した。

インストール

ビルド依存

CMake 2.6+; see http://www.cmake.org/cmake/resources/software.html
C++11 compiler; (g++ version>=4.7 (Linux), clang version>=4.3 (Mac OSX))

Github

#bioconda(link)
conda install -c bioconda minia -y

#building from source
git clone --recursive https://github.com/GATB/minia.git
cd minia/
sh INSTALL

> minia

$ minia

Minia 3, git commit b0715310a

Specifiy -in

[minia options]

[assembly options]

-in (1 arg) : input reads (fasta/fastq/compressed) or hdf5 file [default '']

-keep-isolated (0 arg) : keep short (<= max(2k, 150 bp)) isolated output sequences

-traversal (1 arg) : traversal type ('contig', 'unitig') [default 'contig']

-fasta-line (1 arg) : number of nucleotides per line in fasta output (0 means one line) [default '0']

[graph simplifications options]

-no-bulge-removal (0 arg) : ask to not perform bulge removal

-no-tip-removal (0 arg) : ask to not perform tip removal

-no-ec-removal (0 arg) : ask to not perform erroneous connection removal

-tip-len-topo-kmult (1 arg) : remove all tips of length <= k * X bp [default '2.500000']

-tip-len-rctc-kmult (1 arg) : remove tips that pass coverage criteria, of length <= k * X bp [default '10.000000']

-tip-rctc-cutoff (1 arg) : tip relative coverage coefficient: mean coverage of neighbors > X * tip coverage [default '2.000000']

-bulge-len-kmult (1 arg) : bulges shorter than k*X bp are candidate to be removed [default '3.000000']

-bulge-len-kadd (1 arg) : bulges shorter than k+X bp are candidate to be removed [default '100']

-bulge-altpath-kadd (1 arg) : explore up to k+X nodes to find alternative path [default '50']

-bulge-altpath-covmult (1 arg) : bulges of coverage <= X*cov_altpath will be removed [default '1.100000']

-ec-len-kmult (1 arg) : EC shorter than k*X bp are candidates to be removed [default '9.000000']

-ec-rctc-cutoff (1 arg) : EC relative coverage coefficient (similar in spirit as tip) [default '4.000000']

-no-mphf (0 arg) : don't construct the MPHF

[kmer count options]

-kmer-size (1 arg) : size of a kmer [default '31']

-abundance-min (1 arg) : min abundance threshold for solid kmers [default '2']

-abundance-max (1 arg) : max abundance threshold for solid kmers [default '2147483647']

-abundance-min-threshold (1 arg) : min abundance hard threshold (only used when min abundance is "auto") [default '2']

-histo-max (1 arg) : max number of values in kmers histogram [default '10000']

-solidity-kind (1 arg) : way to compute counts of several files (sum, min, max, one, all, custom) [default 'sum']

-solidity-custom (1 arg) : when solidity-kind is custom, specifies list of files where kmer must be present [default '']

-max-memory (1 arg) : max memory (in MBytes) [default '5000']

-max-disk (1 arg) : max disk (in MBytes) [default '0']

-solid-kmers-out (1 arg) : output file for solid kmers (only when constructing a graph) [default '']

-out (1 arg) : output file [default '']

-out-dir (1 arg) : output directory [default '.']

-out-tmp (1 arg) : output directory for temporary files [default '.']

-out-compress (1 arg) : h5 compression level (0:none, 9:best) [default '0']

-storage-type (1 arg) : storage type of kmer counts ('hdf5' or 'file') [default 'hdf5']

-histo2D (1 arg) : compute the 2D histogram (with first file = genome, remaining files = reads) [default '0']

-histo (1 arg) : output the kmer abundance histogram [default '0']

[kmer count, advanced performance tweaks options]

-minimizer-type (1 arg) : minimizer type (0=lexi, 1=freq) [default '0']

-minimizer-size (1 arg) : size of a minimizer [default '10']

-repartition-type (1 arg) : minimizer repartition (0=unordered, 1=ordered) [default '0']

[bloom options]

-bloom (1 arg) : bloom type ('basic', 'cache', 'neighbor') [default 'neighbor']

-debloom (1 arg) : debloom type ('none', 'original' or 'cascading') [default 'cascading']

-debloom-impl (1 arg) : debloom impl ('basic', 'minimizer') [default 'minimizer']

[branching options]

-branching-nodes (1 arg) : branching type ('none' or 'stored') [default 'stored']

-topology-stats (1 arg) : topological information level (0 for none) [default '0']

[general options]

-config-only (0 arg) : dump config only

-nb-cores (1 arg) : number of cores [default '0']

-all-abundance-counts (0 arg) : output all k-mer abundance counts instead of mean

-edge-km (1 arg) : edge km representation [default '0']

-verbose (1 arg) : verbosity level [default '1']

-integer-precision (1 arg) : integers precision (0 for optimized value) [default '0']

[debug options]

-redo-bcalm (0 arg) : debug function, redo the bcalm algo

-skip-bcalm (0 arg) : same, but skip bcalm

-redo-bglue (0 arg) : same, but redo bglue (needs debug_keep_glue_files=true in source code)

-skip-bglue (0 arg) : same, but skip bglue

-redo-links (0 arg) : same, but redo links

-skip-links (0 arg) : same, but skip links

-nb-glue-partitions (1 arg) : number of glue partitions (automatically calculated by default) [default '0']

実行方法

入力のfastqを指定する。

minia minia -in input.fq -kmer-size 31 -out minia_assembly

-in input reads (fasta/fastq/compressed) or hdf5 file [default '']
-kmer-size size of a kmer [default '31']

引用
Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Rayan Chikhi, Guillaume Rizk

Algorithms Mol Biol. 2013 Sep 16;8(1):22