ゲノムの各位置のk-merの一意性(uniqueness)を計算することは、最大e個のミスマッチを許容しながら計算することが困難である。しかし、CRISPR実験のためのガイドRNAの設計など、多くの生物学的応用には不可欠である。より正式には、一意性または(k, e)マッピング可能性は、各位置について、このk-merがゲノム内でどのくらいの頻度で発生するかの逆数の値として記述することができる、すなわち、最大e個のミスマッチを許容する。
本研究では、(k, e)マッピング可能性を計算するための高速な手法GenMapを提案する。このマッピング性アルゴリズムを拡張し、複数のゲノムにまたがって計算できるようにした。これにより、ゲノムに固有の、あるいは全ゲノムに存在する近似的なk-merを同定することで、マーカー配列の計算やプローブデザインの候補を見つけることができる。GenMapは、各ゲノム位置の近似k-merの位置をエクスポートするためのcsvファイルだけでなく、バイナリ出力、ウィッグファイル、ベッドファイルなどの様々なフォーマットをサポートしている。
GenMapはbioconda経由でインストールできる。バイナリとC++のソースコードは https://github.com/cpockrandt/genmap から入手可能である。
https://github.com/cpockrandt/genmap/wiki
インストール
バイナリの配布とソースからのビルドについてはGithub参照。
#bioconda(link)
conda install -c bioconda genmap -y
> genmap index -h
$ genmap index -h
GenMap index
============
SYNOPSIS
DESCRIPTION
GenMap is a tool for fast and exact computation of genome mappability and can also be used for multiple genomes,
e.g., to search for marker sequences.
Detailed information is available in the wiki: <https://github.com/cpockrandt/genmap/wiki>
Index creation. Only supports DNA and RNA (A, C, G, T/U, N). Other characters will be converted to N.
OPTIONS
-h, --help
Display the help message.
--version-check BOOL
Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,
0, OFF, FALSE, F, and NO. Default: 1.
--version
Display version information.
--copyright
Display long copyright information.
-F, --fasta-file INPUT_FILE
Path to the fasta file. Valid filetypes are: .fsa, .fna, .fastq, .fasta, .fas, and .fa.
-FD, --fasta-directory INPUT_FILE
Path to the directory of fasta files (indexes all .fsa .fna .fastq .fasta .fas and .fa files in there, not
including subdirectories).
-I, --index OUTPUT_FILE
Path to the index.
-A, --algorithm STRING
Algorithm for suffix array construction (needed for the FM index). One of radix and skew. Default: skew.
-S, --sampling INTEGER
Sampling rate of suffix array In range [1..64]. Default: 10.
-v, --verbose
Outputs some additional information on the constructed index.
VERSION
Last update: May 7 2020
GenMap index version: 1.2.0
SeqAn version: 2.4.1
LEGAL
GenMap index Copyright: 2019 Christopher Pockrandt, released under the 3-clause-BSD; 2016-2019 Knut Reinert and Freie Universität Berlin, released under the 3-clause-BSD
SeqAn Copyright: 2006-2015 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.
In your academic works please cite: Pockrandt et al (2019). GenMap: Fast and Exact Computation of Genome Mappability.
doi: https://doi.org/10.1101/611160
For full copyright and/or warranty information see --copyright.
>genmap map -h
$ genmap map -h
GenMap map
==========
SYNOPSIS
DESCRIPTION
GenMap is a tool for fast and exact computation of genome mappability and can also be used for multiple genomes,
e.g., to search for marker sequences.
Detailed information is available in the wiki: <https://github.com/cpockrandt/genmap/wiki>
Tool for computing the mappability/frequency on nucleotide sequences. It supports multi-fasta files with DNA or
RNA alphabets (A, C, G, T/U, N). Frequency is the absolute number of occurrences, mappability is the inverse,
i.e., 1 / frequency-value.
OPTIONS
-h, --help
Display the help message.
--version-check BOOL
Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,
0, OFF, FALSE, F, and NO. Default: 1.
--version
Display version information.
--copyright
Display long copyright information.
-I, --index INPUT_FILE
Path to the index
-O, --output OUTPUT_FILE
Path to output directory (or path to filename if only a single fasta files has been indexed)
-E, --errors INTEGER
Number of errors
-K, --length INTEGER
Length of k-mers
-S, --selection OUTPUT_FILE
Path to a bed file (3 columns: chromosome, start, end) with selected coordinates to compute the mappability
(e.g., exon coordinates)
-nc, --no-reverse-complement
Searches the k-mers *NOT* on the reverse strand.
-ep, --exclude-pseudo
Mappability only counts the number of fasta files that contain the k-mer, not the total number of
occurrences (i.e., neglects so called- pseudo genes / sequences). This has no effect on the csv output.
-fs, --frequency-small
Stores frequencies using 8 bit per value (max. value 255) instead of the mappbility using a float per value
(32 bit). Applies to all formats (raw, txt, wig, bedgraph).
-fl, --frequency-large
Stores frequencies using 16 bit per value (max. value 65535) instead of the mappbility using a float per
value (32 bit). Applies to all formats (raw, txt, wig, bedgraph).
-r, --raw
Output raw files, i.e., the binary format of std::vector<T> with T = float, uint8_t or uint16_t (depending
on whether -fs or -fl is set). For each fasta file that was indexed a separate file is created. File type is
.map, .freq8 or .freq16.
-t, --txt
Output human readable text files, i.e., the mappability respectively frequency values separated by spaces
(depending on whether -fs or -fl is set). For each fasta file that was indexed a separate txt file is
created. WARNING: This output is significantly larger than raw files.
-w, --wig
Output wig files, e.g., for adding a custom feature track to genome browsers. For each fasta file that was
indexed a separate wig file and chrom.size file is created.
-bg, --bedgraph
Output bedgraph files. For each fasta file that was indexed a separate bedgraph-file is created.
-d, --csv
Output a detailed csv file reporting the locations of each k-mer (WARNING: This will produce large files and
makes computing the mappability significantly slower).
-m, --memory-mapping
Turns memory-mapping on, i.e. the index is not loaded into RAM but accessed directly from secondary-memory.
This may increase the overall running time, but do NOT use it if the index lies on network storage.
-T, --threads INTEGER
Number of threads Default: 12.
-v, --verbose
Outputs some additional information.
VERSION
Last update: May 7 2020
GenMap map version: 1.2.0
SeqAn version: 2.4.1
LEGAL
GenMap map Copyright: 2019 Christopher Pockrandt, released under the 3-clause-BSD; 2016-2019 Knut Reinert and Freie Universität Berlin, released under the 3-clause-BSD
SeqAn Copyright: 2006-2015 Knut Reinert, FU-Berlin; released under the 3-clause BSDL.
In your academic works please cite: Pockrandt et al (2019). GenMap: Fast and Exact Computation of Genome Mappability.
doi: https://doi.org/10.1101/611160
For full copyright and/or warranty information see --copyright.
実行方法
1、indexing
インデックス構築には2つのアルゴリズムがある。1つはRAM(radix)を使用し、もう1つはセカンダリメモリ(skew)を使用する。radixは比較ベースで繰り返しデータではかなり遅いので、スキューを使用することが推奨されている。複数ゲノムを使用することもできる(wiki参照)。
genmap index -F genome.fasta -I index_folder
- -F Path to the fasta file. Valid filetypes are: .fsa, .fna, .fastq, .fasta, .fas, and .fa.
- -A Algorithm for suffix array construction (needed for the FM index). One of radix and skew. Default: skew.
- -I Path to the index
出力
2、mappability
genmap map -K 30 -E 2 -I index_folder -O out -t -w -bg
- -K Length of k-mers
- -E Number of errors
- -I Path to the index
- -O Path to output directory (or path to filename if only a single fasta files has been indexed)
- -t Output human readable text files, i.e., the mappability respectively frequency values separated by spaces(depending on whether -fs or -fl is set). For each fasta file that was indexed a separate txt file is created. WARNING: This output is significantly larger than raw files.
- -w Output wig files, e.g., for adding a custom feature track to genome browsers. For each fasta file that was indexed a separate wig file and chrom.size file is created.
- -bg Output bedgraph files. For each fasta file that was indexed a separate bedgraph-file is created.
出力(シングルゲノム)
シロイヌナズナゲノムについて調べ、IGVにwigとbedgraphを読み込んだ。ここではchr1を見ている。セントロメア領域で明らかに落ち込んでいる。
引用
GenMap: Ultra-fast Computation of Genome Mappability
Christopher Pockrandt, Mai Alzamel, Costas S Iliopoulos, Knut Reinert
Bioinformatics, Published: 04 April 2020