macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ゲノムのマッピング可能性を調べる GenMap

 

 ゲノムの各位置のk-merの一意性(uniqueness)を計算することは、最大e個のミスマッチを許容しながら計算することが困難である。しかし、CRISPR実験のためのガイドRNAの設計など、多くの生物学的応用には不可欠である。より正式には、一意性または(k, e)マッピング可能性は、各位置について、このk-merがゲノム内でどのくらいの頻度で発生するかの逆数の値として記述することができる、すなわち、最大e個のミスマッチを許容する。
 本研究では、(k, e)マッピング可能性を計算するための高速な手法GenMapを提案する。このマッピングアルゴリズムを拡張し、複数のゲノムにまたがって計算できるようにした。これにより、ゲノムに固有の、あるいは全ゲノムに存在する近似的なk-merを同定することで、マーカー配列の計算やプローブデザインの候補を見つけることができる。GenMapは、各ゲノム位置の近似k-merの位置をエクスポートするためのcsvファイルだけでなく、バイナリ出力、ウィッグファイル、ベッドファイルなどの様々なフォーマットをサポートしている。
 GenMapはbioconda経由でインストールできる。バイナリとC++ソースコードhttps://github.com/cpockrandt/genmap から入手可能である。

 

wiki

https://github.com/cpockrandt/genmap/wiki

 

インストール

バイナリの配布とソースからのビルドについてはGithub参照。

#bioconda(link)
conda install -c bioconda genmap -y

genmap index -h

$ genmap index -h

GenMap index

============

 

SYNOPSIS

 

DESCRIPTION

    GenMap is a tool for fast and exact computation of genome mappability and can also be used for multiple genomes,

    e.g., to search for marker sequences.

 

    Detailed information is available in the wiki: <https://github.com/cpockrandt/genmap/wiki>

 

    Index creation. Only supports DNA and RNA (A, C, G, T/U, N). Other characters will be converted to N.

 

OPTIONS

    -h, --help

          Display the help message.

    --version-check BOOL

          Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,

          0, OFF, FALSE, F, and NO. Default: 1.

    --version

          Display version information.

    --copyright

          Display long copyright information.

    -F, --fasta-file INPUT_FILE

          Path to the fasta file. Valid filetypes are: .fsa, .fna, .fastq, .fasta, .fas, and .fa.

    -FD, --fasta-directory INPUT_FILE

          Path to the directory of fasta files (indexes all .fsa .fna .fastq .fasta .fas and .fa files in there, not

          including subdirectories).

    -I, --index OUTPUT_FILE

          Path to the index.

    -A, --algorithm STRING

          Algorithm for suffix array construction (needed for the FM index). One of radix and skew. Default: skew.

    -S, --sampling INTEGER

          Sampling rate of suffix array In range [1..64]. Default: 10.

    -v, --verbose

          Outputs some additional information on the constructed index.

 

VERSION

    Last update: May  7 2020

    GenMap index version: 1.2.0

    SeqAn version: 2.4.1

 

LEGAL

    GenMap index Copyright: 2019 Christopher Pockrandt, released under the 3-clause-BSD; 2016-2019 Knut Reinert and Freie Universität Berlin, released under the 3-clause-BSD

    SeqAn Copyright: 2006-2015 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.

    In your academic works please cite: Pockrandt et al (2019). GenMap: Fast and Exact Computation of Genome Mappability.

doi: https://doi.org/10.1101/611160

    For full copyright and/or warranty information see --copyright.

>genmap map -h

$ genmap map -h

GenMap map

==========

 

SYNOPSIS

 

DESCRIPTION

    GenMap is a tool for fast and exact computation of genome mappability and can also be used for multiple genomes,

    e.g., to search for marker sequences.

 

    Detailed information is available in the wiki: <https://github.com/cpockrandt/genmap/wiki>

 

    Tool for computing the mappability/frequency on nucleotide sequences. It supports multi-fasta files with DNA or

    RNA alphabets (A, C, G, T/U, N). Frequency is the absolute number of occurrences, mappability is the inverse,

    i.e., 1 / frequency-value.

 

OPTIONS

    -h, --help

          Display the help message.

    --version-check BOOL

          Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,

          0, OFF, FALSE, F, and NO. Default: 1.

    --version

          Display version information.

    --copyright

          Display long copyright information.

    -I, --index INPUT_FILE

          Path to the index

    -O, --output OUTPUT_FILE

          Path to output directory (or path to filename if only a single fasta files has been indexed)

    -E, --errors INTEGER

          Number of errors

    -K, --length INTEGER

          Length of k-mers

    -S, --selection OUTPUT_FILE

          Path to a bed file (3 columns: chromosome, start, end) with selected coordinates to compute the mappability

          (e.g., exon coordinates)

    -nc, --no-reverse-complement

          Searches the k-mers *NOT* on the reverse strand.

    -ep, --exclude-pseudo

          Mappability only counts the number of fasta files that contain the k-mer, not the total number of

          occurrences (i.e., neglects so called- pseudo genes / sequences). This has no effect on the csv output.

    -fs, --frequency-small

          Stores frequencies using 8 bit per value (max. value 255) instead of the mappbility using a float per value

          (32 bit). Applies to all formats (raw, txt, wig, bedgraph).

    -fl, --frequency-large

          Stores frequencies using 16 bit per value (max. value 65535) instead of the mappbility using a float per

          value (32 bit). Applies to all formats (raw, txt, wig, bedgraph).

    -r, --raw

          Output raw files, i.e., the binary format of std::vector<T> with T = float, uint8_t or uint16_t (depending

          on whether -fs or -fl is set). For each fasta file that was indexed a separate file is created. File type is

          .map, .freq8 or .freq16.

    -t, --txt

          Output human readable text files, i.e., the mappability respectively frequency values separated by spaces

          (depending on whether -fs or -fl is set). For each fasta file that was indexed a separate txt file is

          created. WARNING: This output is significantly larger than raw files.

    -w, --wig

          Output wig files, e.g., for adding a custom feature track to genome browsers. For each fasta file that was

          indexed a separate wig file and chrom.size file is created.

    -bg, --bedgraph

          Output bedgraph files. For each fasta file that was indexed a separate bedgraph-file is created.

    -d, --csv

          Output a detailed csv file reporting the locations of each k-mer (WARNING: This will produce large files and

          makes computing the mappability significantly slower).

    -m, --memory-mapping

          Turns memory-mapping on, i.e. the index is not loaded into RAM but accessed directly from secondary-memory.

          This may increase the overall running time, but do NOT use it if the index lies on network storage.

    -T, --threads INTEGER

          Number of threads Default: 12.

    -v, --verbose

          Outputs some additional information.

 

VERSION

    Last update: May  7 2020

    GenMap map version: 1.2.0

    SeqAn version: 2.4.1

 

LEGAL

    GenMap map Copyright: 2019 Christopher Pockrandt, released under the 3-clause-BSD; 2016-2019 Knut Reinert and Freie Universität Berlin, released under the 3-clause-BSD

    SeqAn Copyright: 2006-2015 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.

    In your academic works please cite: Pockrandt et al (2019). GenMap: Fast and Exact Computation of Genome Mappability.

doi: https://doi.org/10.1101/611160

    For full copyright and/or warranty information see --copyright.

 

 

実行方法

1、indexing

インデックス構築には2つのアルゴリズムがある。1つはRAM(radix)を使用し、もう1つはセカンダリメモリ(skew)を使用する。radixは比較ベースで繰り返しデータではかなり遅いので、スキューを使用することが推奨されている。複数ゲノムを使用することもできる(wiki参照)。

genmap index -F genome.fasta -I index_folder
  • -F     Path to the fasta file. Valid filetypes are: .fsa, .fna, .fastq, .fasta, .fas, and .fa.
  • -A    Algorithm for suffix array construction (needed for the FM index). One of radix and skew. Default: skew.
  • -I      Path to the index

 出力

f:id:kazumaxneo:20200511142542p:plain

 

2、mappability

genmap map -K 30 -E 2 -I index_folder -O out -t -w -bg
  • -K    Length of k-mers
  • -E    Number of errors
  • -I     Path to the index
  • -O    Path to output directory (or path to filename if only a single fasta files has been indexed)
  • -t     Output human readable text files, i.e., the mappability respectively frequency values separated by spaces(depending on whether -fs or -fl is set). For each fasta file that was indexed a separate txt file is created. WARNING: This output is significantly larger than raw files.
  • -w   Output wig files, e.g., for adding a custom feature track to genome browsers. For each fasta file that was indexed a separate wig file and chrom.size file is created.
  • -bg    Output bedgraph files. For each fasta file that was indexed a separate bedgraph-file is created.

出力(シングルゲノム)

f:id:kazumaxneo:20200511143751p:plain

 

シロイヌナズナゲノムについて調べ、IGVにwigとbedgraphを読み込んだ。ここではchr1を見ている。セントロメア領域で明らかに落ち込んでいる。

f:id:kazumaxneo:20200511143926p:plain

引用

GenMap: Ultra-fast Computation of Genome Mappability
Christopher Pockrandt, Mai Alzamel, Costas S Iliopoulos, Knut Reinert
Bioinformatics, Published: 04 April 2020