ゲノムのマッピング可能性を調べる GenMap

　ゲノムの各位置のk-merの一意性（uniqueness）を計算することは、最大e個のミスマッチを許容しながら計算することが困難である。しかし、CRISPR実験のためのガイドRNAの設計など、多くの生物学的応用には不可欠である。より正式には、一意性または(k, e)マッピング可能性は、各位置について、このk-merがゲノム内でどのくらいの頻度で発生するかの逆数の値として記述することができる、すなわち、最大e個のミスマッチを許容する。
　本研究では、(k, e)マッピング可能性を計算するための高速な手法GenMapを提案する。このマッピング性アルゴリズムを拡張し、複数のゲノムにまたがって計算できるようにした。これにより、ゲノムに固有の、あるいは全ゲノムに存在する近似的なk-merを同定することで、マーカー配列の計算やプローブデザインの候補を見つけることができる。GenMapは、各ゲノム位置の近似k-merの位置をエクスポートするためのcsvファイルだけでなく、バイナリ出力、ウィッグファイル、ベッドファイルなどの様々なフォーマットをサポートしている。
　GenMapはbioconda経由でインストールできる。バイナリとC++のソースコードは https://github.com/cpockrandt/genmap から入手可能である。

wiki

https://github.com/cpockrandt/genmap/wiki

インストール

バイナリの配布とソースからのビルドについてはGithub参照。

#bioconda(link)
conda install -c bioconda genmap -y

> genmap index -h

$ genmap index -h

GenMap index

============

SYNOPSIS

DESCRIPTION

GenMap is a tool for fast and exact computation of genome mappability and can also be used for multiple genomes,

e.g., to search for marker sequences.

Detailed information is available in the wiki: <https://github.com/cpockrandt/genmap/wiki>

Index creation. Only supports DNA and RNA (A, C, G, T/U, N). Other characters will be converted to N.

OPTIONS

-h, --help

Display the help message.

--version-check BOOL

Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,

0, OFF, FALSE, F, and NO. Default: 1.

--version

Display version information.

--copyright

Display long copyright information.

-F, --fasta-file INPUT_FILE

Path to the fasta file. Valid filetypes are: .fsa, .fna, .fastq, .fasta, .fas, and .fa.

-FD, --fasta-directory INPUT_FILE

Path to the directory of fasta files (indexes all .fsa .fna .fastq .fasta .fas and .fa files in there, not

including subdirectories).

-I, --index OUTPUT_FILE

Path to the index.

-A, --algorithm STRING

Algorithm for suffix array construction (needed for the FM index). One of radix and skew. Default: skew.

-S, --sampling INTEGER

Sampling rate of suffix array In range [1..64]. Default: 10.

-v, --verbose

Outputs some additional information on the constructed index.

VERSION

Last update: May 7 2020

GenMap index version: 1.2.0

SeqAn version: 2.4.1

LEGAL

GenMap index Copyright: 2019 Christopher Pockrandt, released under the 3-clause-BSD; 2016-2019 Knut Reinert and Freie Universität Berlin, released under the 3-clause-BSD

SeqAn Copyright: 2006-2015 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.

In your academic works please cite: Pockrandt et al (2019). GenMap: Fast and Exact Computation of Genome Mappability.

doi: https://doi.org/10.1101/611160

For full copyright and/or warranty information see --copyright.

>genmap map -h

$ genmap map -h

GenMap map

==========

SYNOPSIS

DESCRIPTION

GenMap is a tool for fast and exact computation of genome mappability and can also be used for multiple genomes,

e.g., to search for marker sequences.

Detailed information is available in the wiki: <https://github.com/cpockrandt/genmap/wiki>

Tool for computing the mappability/frequency on nucleotide sequences. It supports multi-fasta files with DNA or

RNA alphabets (A, C, G, T/U, N). Frequency is the absolute number of occurrences, mappability is the inverse,

i.e., 1 / frequency-value.

OPTIONS

-h, --help

Display the help message.

--version-check BOOL

Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,

0, OFF, FALSE, F, and NO. Default: 1.

--version

Display version information.

--copyright

Display long copyright information.

-I, --index INPUT_FILE

Path to the index

-O, --output OUTPUT_FILE

Path to output directory (or path to filename if only a single fasta files has been indexed)

-E, --errors INTEGER

Number of errors

-K, --length INTEGER

Length of k-mers

-S, --selection OUTPUT_FILE

Path to a bed file (3 columns: chromosome, start, end) with selected coordinates to compute the mappability

(e.g., exon coordinates)

-nc, --no-reverse-complement

Searches the k-mers *NOT* on the reverse strand.

-ep, --exclude-pseudo

Mappability only counts the number of fasta files that contain the k-mer, not the total number of

occurrences (i.e., neglects so called- pseudo genes / sequences). This has no effect on the csv output.

-fs, --frequency-small

Stores frequencies using 8 bit per value (max. value 255) instead of the mappbility using a float per value

(32 bit). Applies to all formats (raw, txt, wig, bedgraph).

-fl, --frequency-large

Stores frequencies using 16 bit per value (max. value 65535) instead of the mappbility using a float per

value (32 bit). Applies to all formats (raw, txt, wig, bedgraph).

-r, --raw

Output raw files, i.e., the binary format of std::vector<T> with T = float, uint8_t or uint16_t (depending

on whether -fs or -fl is set). For each fasta file that was indexed a separate file is created. File type is

.map, .freq8 or .freq16.

-t, --txt

Output human readable text files, i.e., the mappability respectively frequency values separated by spaces

(depending on whether -fs or -fl is set). For each fasta file that was indexed a separate txt file is

created. WARNING: This output is significantly larger than raw files.

-w, --wig

Output wig files, e.g., for adding a custom feature track to genome browsers. For each fasta file that was

indexed a separate wig file and chrom.size file is created.

-bg, --bedgraph

Output bedgraph files. For each fasta file that was indexed a separate bedgraph-file is created.

-d, --csv

Output a detailed csv file reporting the locations of each k-mer (WARNING: This will produce large files and

makes computing the mappability significantly slower).

-m, --memory-mapping

Turns memory-mapping on, i.e. the index is not loaded into RAM but accessed directly from secondary-memory.

This may increase the overall running time, but do NOT use it if the index lies on network storage.

-T, --threads INTEGER

Number of threads Default: 12.

-v, --verbose

Outputs some additional information.

VERSION

Last update: May 7 2020

GenMap map version: 1.2.0

SeqAn version: 2.4.1

LEGAL

GenMap map Copyright: 2019 Christopher Pockrandt, released under the 3-clause-BSD; 2016-2019 Knut Reinert and Freie Universität Berlin, released under the 3-clause-BSD

SeqAn Copyright: 2006-2015 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.

In your academic works please cite: Pockrandt et al (2019). GenMap: Fast and Exact Computation of Genome Mappability.

doi: https://doi.org/10.1101/611160

For full copyright and/or warranty information see --copyright.

実行方法

１、indexing

インデックス構築には2つのアルゴリズムがある。1つはRAM(radix)を使用し、もう1つはセカンダリメモリ(skew)を使用する。radixは比較ベースで繰り返しデータではかなり遅いので、スキューを使用することが推奨されている。複数ゲノムを使用することもできる（wiki参照）。

genmap index -F genome.fasta -I index_folder

-F Path to the fasta file. Valid filetypes are: .fsa, .fna, .fastq, .fasta, .fas, and .fa.
-A Algorithm for suffix array construction (needed for the FM index). One of radix and skew. Default: skew.
-I Path to the index

出力

f:id:kazumaxneo:20200511142542p:plain

２、mappability

genmap map -K 30 -E 2 -I index_folder -O out -t -w -bg

-K Length of k-mers
-E Number of errors
-I Path to the index
-O Path to output directory (or path to filename if only a single fasta files has been indexed)
-t Output human readable text files, i.e., the mappability respectively frequency values separated by spaces(depending on whether -fs or -fl is set). For each fasta file that was indexed a separate txt file is created. WARNING: This output is significantly larger than raw files.
-w Output wig files, e.g., for adding a custom feature track to genome browsers. For each fasta file that was indexed a separate wig file and chrom.size file is created.
-bg Output bedgraph files. For each fasta file that was indexed a separate bedgraph-file is created.

出力(シングルゲノム)

f:id:kazumaxneo:20200511143751p:plain

シロイヌナズナゲノムについて調べ、IGVにwigとbedgraphを読み込んだ。ここではchr1を見ている。セントロメア領域で明らかに落ち込んでいる。

f:id:kazumaxneo:20200511143926p:plain

引用

GenMap: Ultra-fast Computation of Genome Mappability
Christopher Pockrandt, Mai Alzamel, Costas S Iliopoulos, Knut Reinert
Bioinformatics, Published: 04 April 2020

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ゲノムのマッピング可能性を調べる GenMap