ゲノム(メタゲノムを含む)の数は加速的に増加している。 近い将来、数百万のゲノム間のペアワイズ距離を推定する必要があるかもしれない。 クラウドコンピューティングを使用しても、そのような推定を実行できるソフトウェアはほとんどない。マルチスレッドソフトウェアBinDashは、典型的な個人用ラップトップのみを使用してこのような推定を実行できる。 BinDashは、既存のデータマイニング手法である最適な高密度化を使用して、bビット1順列ローリングMinHashを実装した。 BinDashは、評価によると、精度、圧縮率、メモリ使用量、実行時間の点で、最先端のソフトウェアよりも経験的に優れている。 評価は、Dell Inspiron 157 559ノートブックを使用して、RefSeqのすべての細菌ゲノムに関する比較を実行した。BinDashは、https://github.com/zhaoxiaofei/BinDashのApache 2.0ライセンスに基づいてリリースされる。
インストール
ビルド依存
git clone https://github.com/zhaoxiaofei/bindash.git
cd bindash/
mkdir release && cd release
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j
> ./bindash
# ./bindash
Usage:
./bindash <commmand> [options] [arguments ...]
Commands:
sketch: reduce mutiple genomes into one sketch.
A genome corresponds to a input sequence file.
A sketch consists of a set of output files.
dist: estimate distance (and relevant statistics) between
genomes in query sketch and genomes in target-sketch.
Query and target sketches are generated by the sketch command.
exact: estimate distance (and relevant statistics) between
genomes corresponding to input files.
Notes:
To see command-specific usage, please enter
./bindash command --help
To see version information, please enter
./bindash --version
The format for options is --NAME=VALUE
> ./bindash sketch --help
# ./bindash sketch --help
Running ./bindash commit 78e0d46-clean
Usage: ./bindash sketch [options] [arguments ...]
Arguments:
Zero or more filenames. If zero filenames, then read from each line in listfname.
Each filename specifies a path to a sequence file.
Options with [default values]:
--help : Show this help message.
--listfname: Name of the file associating consecutive sequences to genomes (including metagenomes and pangenomes).
Each line of this file has the following format:
"Path-to-a-sequence-file(F) <TAB> [genome-name(G) <TAB> number-of-consecutive-sequences(N) ...]".
If only F is provided, then use F as G and let N be the number of sequences in N [-]
--nthreads : This many threads will be spawned for processing. [1]
--minhashtype : Type of minhash.
-1 means perfect hash function for nucleotides where 5^(kmerlen) < 2^63.
0 means one hash-function with multiple min-values.
1 means multiple hash-functions and one min-value per function.
2 means one hash-function with partitionned buckets. [2]
--bbits : Number of bits kept as in b-bits minhash. [14]
--kmerlen : K-mer length used to generate minhash values. [21]
--sketchsize64 : Sketch size divided by 64, or equivalently,
the number of sets (each consisting of 64 minhash values) per genome). [32]
--isstrandpreserved : Preserve strand, which means ignore reverse complement. [false]
--iscasepreserved : Preserve case, which means the lowercase and uppercase versions of the
same letter are treated as two different letters. [false]
--randseed : Seed to provide to the hash function. [41].
--outfname : Name of the file containing sketches as output [sketch-at-time-1584151157 (time-dependent)].
Notes:
"-" (without quotes) means stdin.
For general usage, please enter
./bindash --help
The following is an example of options: --nthreads=8
> ./bindash dist --help
# ./bindash dist --help
Running ./bindash commit 78e0d46-clean
Usage: ./bindash dist [options] query-sketch [target-sketch]
Query-sketch and target-sketch: sketches used as query and target.
Sketches are generated by "./bindash sketch" (without quotes).
If target-sketch is omitted, then query-sketch is used as both query and target.
Options:
--help : Show this help message.
--ithres : If intersection(A, B) has less than this number of elements, then set the intersection to empty set so that the resulting Jaccard-index is zero. [2]
--mthres : Only results with at most this mutation distance are reported [2.5]
--nneighbors : Only this number of best-hit results per query are reported.
If this value is zero then report all. [0].
--nthreads : This many threads will be spawned for processing. [1]
--outfname : The output file comparing the query and target sketches.
The ouput file contains the following tab-separated fields per result line:
query-sketch, target-sketch, mutation-distance, p-value, and jaccard-index. [-].
--pthres : only results with at most this p-value are reported. [1.0001]
Note:
If target-sketch is omitted and --nneighbors is zero,
then distance from genome A to genome B is the same as distance from B to A.
In this case, only one record is reported per set of two genomes due to reflectivity.
"-" (without quotes) means stdout.
For general usage, please enter
./bindash --help
The following is an example of options: --mthres=0.2
実行方法
1、sketchファイルの作成
bindash sketch --outfname=genomeA.sketch genomeA.fasta
bindash sketch --outfname=genomeB.sketch genomeB.fasta
bindash sketch --outfname=genomeC.sketch genomeC.fasta
2、sketchファイルの比較
bindash dist genomeA.sketch genomeB.sketch genomeC.sketch
出力についてはGihubで確認して下さい。
引用
BinDash, software for fast genome distance estimation on a typical personal laptop
XiaoFei Zhao
Bioinformatics, Volume 35, Issue 4, 15 February 2019, Pages 671–673