macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ゲノム比較のmurasakiと結果を表示するGMV

2020 5/25 docker link追加、ヘルプ追加、分かりにくい文章の修正

 

murasakiは複数ゲノムの相同性ある領域の探索を高速に行うツールで、GMVはその比較結果を見るためのビューアソフトである。領域によってカラフルな色がつくので、ゲノムリアレンジメントなどの構造変化をわかりやすく示すことができる。

 

公式サイト

murasaki genomemurasaki genome

 

murasakiのインストールはこの方のブログを参考にしてください。


2020 5/26 追記

#dockerhub(link)
docker pull biocontainers/murasaki:v1.68.6-8b1-deb_cv1

 > docker run --rm -it  biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki -h

$ docker run --rm -it biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki -h

Usage: murasaki -p<patternSpec> [options] seq1 [seq2 [seq3 ... ]]

Options:

  --pattern|-p   = seed pattern (eg. 11101001010011011).

                    using the format [<w>:<l>] automatically generates a

                    random pattern of weight <w> and length <l>

  --directory|-d = output directory (default: output)

  --name|-n      = alignment name (default: test)

  --quickhash|-q = specify a hashing function:

                    0 - adaptive with S-boxes

                    1 - don't pack bits to make hash (use first word only)

                    2 - naively use the first hashbits worth of pattern

                    3 - adaptivevely find a good hash (default)

                    **experimental CryptoPP hashes**

                    4 - MD5

                    5 - SHA1

                    6 - Whirlpool

                    7 - CRC-32

                    8 - Adler-32

  --hashbits|-b  = use n bit hashes (for n's of 1 to WORDSIZE. default 26)

  --hashtype|-t  = select hash table data structure to use:

                    OpenHash  - open sub-word packing of hashbits

                    EcoHash   - chained sub-word packing of hashbits (default)

                    ArrayHash - malloc/realloc (fast but fragmenty)

                    MSetHash  - memory exorbanant, almost pointless.

  --probing      = 0 - linear, 1 - quadratic (default)

  --hitfilter|-h = minimum number of hits to be outputted as an anchor

                   (default 1)

  --histogram|-H = histogram computation level: (-H alone implies -H1)

                    0 - no histogram (default)

                    1 - basic bucketsize/bucketcount histogram data

    2 - bucket-based scores to anchors.detils

                    3 - perbucket count data

    4 - perbucket + perpattern count data

  --repeatmask|-r= skip repeat masked data (ie: lowercase atgc)

  --seedfilter|-f= skip seeds that occur more than N times

  --hashfilter|-m= like --seedfilter but works on hash keys instead of

                   seeds. May cause some collateral damage to otherwise

                   unique seeds, but it's faster. Also non-sequence-specific

                   so more like at best 1/N the tolerance of seedfilter.

  --rseed|-s     = random number seed for non-deterministic algorithms

                   (ie: the adative hash-finding). If you're doing any

                   performance comparisons, it's probably imperative that you

                   use the same seed for each run of the same settings.

                   Default is obtained from time() (ie: seconds since 1970).

  --skipfwd|-F   = Skip forward facing matches

  --skiprev|-R   = Skip reverse facing matches

  --skip1to1|-1  = Skip matches along the 1:1 line (good for comparing to self)

  --hashonly|-Q  = Hash Only. no anchors. just statistics.

  --hashskip|-S  = Hashes every n bases. (Default is 1. ie all)

                   Not supplying any argument increments the skip amount by 1.

  --hashCache|-c = Caches hash tables in the directory. (default: cache/)

  --join|-j      = Join anchors within n bases of eachother (default: 0)

                   Specifying a negative n implies -n*patternLength

  --bitscore|-B  = toggles compututation of a bitscore for all anchors

                   (default is on)

  --memory|-M    = set the target amount of total memory

                    (either in gb or as % total memory)

  --seedterms|-T = toggles retention of seed terms (defaults to off)

                    (these are necessary for computing TF-IDF scores)

  --sectime|-e   = always display times in seconds

  --repeatmap|-i = toggles keeping of a repeat map when --mergefilter

                   is used (defaults to yes).

  --mergefilter|-Y = filter out matches which would would cause more than N

                     many anchors to be generated from 1 seed (default -Y100).

                     Use -Y0 to disable.

  --scorefilter    = set a minimum ungapped score for seeds

  --tfidf|-k       = perform accurate tfidf scoring from within murasaki

                     (requires extra memory at anchor generation time)

  --reverseotf|-o  = generate reverse complement on the fly (defaults to on)

  --rifts|-/       = allow anchors to skip N sequences (default 0)

  --islands|-%     = same as --rifts=S-N (where S is number of seqs)

  --fuzzyextend|-z = enable (default) or disable fuzzy extension of hits

  --fuzzyextendlosslimit|-Z = set the cutoff at which to stop extending

                      fuzzy hits (ie. the BLAST X parameter).

  --gappedanchors  = use gapped (yes) or ungapped (no (default)) anchors.

  --scorebyminimumpair = do anchor scoring by minimum pair when appropriate

                     (default). Alternative is mean (somewhat illogical, but

                     theoretically faster).

  --binaryseq      = enable (default) or disable binary sequence read/write

 

Adaptive has function related:

  --hasherFairEntropy = use more balanced entropy estimation (default: yes)

  --hasherCorrelationAdjust = adjust entropy estimates for nearby sources

        assuming some correlation (default: yes)

  --hasherTargetGACycles = GA cycle cutoff

  --hasherEntropyAgro = how aggressive to be about pursuing maximum

        entropy hash functions (takes a real. default is 1).

...and of course

  --verbose|-v   = increases verbosity

  --version|-V   = prints version information and quits

  --help|-?      = prints this help message and quits

 

Platform information:

Wordsize: 64 bits

sizeof(word): 8 bytes

Total Memory: 22.97 GB

Available Memory: 23.93 GB (104.20%)

 

Murasaki version 1.68.6 (LARGESEQ, CRYPTOPP)

 

 

 

ラン

ゲノム・プラスミドのfastaかgenbakファイルを指定する。

 

3つのgbkファイルの比較。

./murasaki -p 19:26 -d output -n sample_name 7942.gbk GT-S.gbk 7002.gbk Leptolyngbya_dg5.gbk Nostoc_sp.PCC7120.gbk

#docker imageを使うなら
docker run --rm -itv $PWD:/data/ -w /data biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki A.fasta B.fasta C.fasta -p 19:26 -d output

出力ディレクトリoutputをGNVに読み込ませて描画すると、下のような図を出力できる。

f:id:kazumaxneo:20170623134158j:plain

4種の非常に近縁なシアノバクテリアを比較。右端にはANI値を載せた。

 

 

ランさせた時の様子はこのような感じになる。

動画では、genbankファイルを2つ選択してmurasakiで相同性を調べ、出力されたフォルダをGMVで表示している。

 

著者らによるアルゴリズムの説明とチュートリアル

Yasunori Osana @ University of the Ryukyus

こちらも貼っておきます。

Yasunori Osana @ University of the Ryukyus

 

 

 

引用

Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes

Popendorf K, Tsuyoshi H, Osana Y, Sakakibara Y

PLoS One. 2010 Sep 24;5(9):e12651