HTS (NGS) 関連のインフォマティクス情報についてまとめています。


2020 5/25 docker link追加、ヘルプ追加、分かりにくい文章の修正





murasaki genomemurasaki genome



2020 5/26 追記

docker pull biocontainers/murasaki:v1.68.6-8b1-deb_cv1

 > docker run --rm -it  biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki -h

$ docker run --rm -it biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki -h

Usage: murasaki -p<patternSpec> [options] seq1 [seq2 [seq3 ... ]]


  --pattern|-p   = seed pattern (eg. 11101001010011011).

                    using the format [<w>:<l>] automatically generates a

                    random pattern of weight <w> and length <l>

  --directory|-d = output directory (default: output)

  --name|-n      = alignment name (default: test)

  --quickhash|-q = specify a hashing function:

                    0 - adaptive with S-boxes

                    1 - don't pack bits to make hash (use first word only)

                    2 - naively use the first hashbits worth of pattern

                    3 - adaptivevely find a good hash (default)

                    **experimental CryptoPP hashes**

                    4 - MD5

                    5 - SHA1

                    6 - Whirlpool

                    7 - CRC-32

                    8 - Adler-32

  --hashbits|-b  = use n bit hashes (for n's of 1 to WORDSIZE. default 26)

  --hashtype|-t  = select hash table data structure to use:

                    OpenHash  - open sub-word packing of hashbits

                    EcoHash   - chained sub-word packing of hashbits (default)

                    ArrayHash - malloc/realloc (fast but fragmenty)

                    MSetHash  - memory exorbanant, almost pointless.

  --probing      = 0 - linear, 1 - quadratic (default)

  --hitfilter|-h = minimum number of hits to be outputted as an anchor

                   (default 1)

  --histogram|-H = histogram computation level: (-H alone implies -H1)

                    0 - no histogram (default)

                    1 - basic bucketsize/bucketcount histogram data

    2 - bucket-based scores to anchors.detils

                    3 - perbucket count data

    4 - perbucket + perpattern count data

  --repeatmask|-r= skip repeat masked data (ie: lowercase atgc)

  --seedfilter|-f= skip seeds that occur more than N times

  --hashfilter|-m= like --seedfilter but works on hash keys instead of

                   seeds. May cause some collateral damage to otherwise

                   unique seeds, but it's faster. Also non-sequence-specific

                   so more like at best 1/N the tolerance of seedfilter.

  --rseed|-s     = random number seed for non-deterministic algorithms

                   (ie: the adative hash-finding). If you're doing any

                   performance comparisons, it's probably imperative that you

                   use the same seed for each run of the same settings.

                   Default is obtained from time() (ie: seconds since 1970).

  --skipfwd|-F   = Skip forward facing matches

  --skiprev|-R   = Skip reverse facing matches

  --skip1to1|-1  = Skip matches along the 1:1 line (good for comparing to self)

  --hashonly|-Q  = Hash Only. no anchors. just statistics.

  --hashskip|-S  = Hashes every n bases. (Default is 1. ie all)

                   Not supplying any argument increments the skip amount by 1.

  --hashCache|-c = Caches hash tables in the directory. (default: cache/)

  --join|-j      = Join anchors within n bases of eachother (default: 0)

                   Specifying a negative n implies -n*patternLength

  --bitscore|-B  = toggles compututation of a bitscore for all anchors

                   (default is on)

  --memory|-M    = set the target amount of total memory

                    (either in gb or as % total memory)

  --seedterms|-T = toggles retention of seed terms (defaults to off)

                    (these are necessary for computing TF-IDF scores)

  --sectime|-e   = always display times in seconds

  --repeatmap|-i = toggles keeping of a repeat map when --mergefilter

                   is used (defaults to yes).

  --mergefilter|-Y = filter out matches which would would cause more than N

                     many anchors to be generated from 1 seed (default -Y100).

                     Use -Y0 to disable.

  --scorefilter    = set a minimum ungapped score for seeds

  --tfidf|-k       = perform accurate tfidf scoring from within murasaki

                     (requires extra memory at anchor generation time)

  --reverseotf|-o  = generate reverse complement on the fly (defaults to on)

  --rifts|-/       = allow anchors to skip N sequences (default 0)

  --islands|-%     = same as --rifts=S-N (where S is number of seqs)

  --fuzzyextend|-z = enable (default) or disable fuzzy extension of hits

  --fuzzyextendlosslimit|-Z = set the cutoff at which to stop extending

                      fuzzy hits (ie. the BLAST X parameter).

  --gappedanchors  = use gapped (yes) or ungapped (no (default)) anchors.

  --scorebyminimumpair = do anchor scoring by minimum pair when appropriate

                     (default). Alternative is mean (somewhat illogical, but

                     theoretically faster).

  --binaryseq      = enable (default) or disable binary sequence read/write


Adaptive has function related:

  --hasherFairEntropy = use more balanced entropy estimation (default: yes)

  --hasherCorrelationAdjust = adjust entropy estimates for nearby sources

        assuming some correlation (default: yes)

  --hasherTargetGACycles = GA cycle cutoff

  --hasherEntropyAgro = how aggressive to be about pursuing maximum

        entropy hash functions (takes a real. default is 1).

...and of course

  --verbose|-v   = increases verbosity

  --version|-V   = prints version information and quits

  --help|-?      = prints this help message and quits


Platform information:

Wordsize: 64 bits

sizeof(word): 8 bytes

Total Memory: 22.97 GB

Available Memory: 23.93 GB (104.20%)


Murasaki version 1.68.6 (LARGESEQ, CRYPTOPP)








./murasaki -p 19:26 -d output -n sample_name 7942.gbk GT-S.gbk 7002.gbk Leptolyngbya_dg5.gbk Nostoc_sp.PCC7120.gbk

#docker imageを使うなら
docker run --rm -itv $PWD:/data/ -w /data biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki A.fasta B.fasta C.fasta -p 19:26 -d output










Yasunori Osana @ University of the Ryukyus


Yasunori Osana @ University of the Ryukyus





Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes

Popendorf K, Tsuyoshi H, Osana Y, Sakakibara Y

PLoS One. 2010 Sep 24;5(9):e12651