複数の似たリファレンスが利用できるデータのアライメント作業を高速化するCompMap

種によって利用できるリファンレスの数は大きく異なる。例えばアウトブレイクした菌種を同定するために、１つのfastqデータをたくさんのリファンレスにアライメントするような作業を行う場合、リファレンスが数百ー数万も利用できると、アライメント作業が計算の99%以上を占めてしまい、非常に効率が悪い。CompMapはリファレンスが複数利用できる場合に、リファレンスを圧縮してアライメントすることで、計算時間を短縮する方法論。複数のFASTAから代表するFASTAを作成し、それに対してアライメントすることで、リファレンスが非常に似ている時は劇的に計算時間とメモリ使用量を削減できる。もちろんリファンレスがある程度違っても活用できる（論文の 3 CASE STUDIES参照）。

インストール

依存

BWA or Bowtie

公式サイトからソースコードをダウンロードする。マニュアルもあり。

http://csse.szu.edu.cn/staff/zhuzx/CompMap/#Install

make
make clean

> ./compMAP comp -h

$ ./compMAP comp -h

comp Usage:

compMAP comp <genome_dir> [ option ] <output>

the genomes & the output in the FASTA format.

OPTIONS:

-L INT

Set the minimum length of valid repeat sequences to identify in database sequences. (Default: 1000)

-e FLO

Set the mismatch tolerance rate in local alignment. (Default: 0.05)

-N INT

Set the size of prospecting window in local alignment. When a mismatch is detected, the program looks beyond for N more bases. If there are more than N/2 mismatches within these bases, then the matching goes the other direction or terminates. (Default: 10)

-k INT

Set the length of a kmers used in locate local alignment. (Default: 8)

-P <str1, str2, ... str10>

Set the kmer prefixes. The max number of prefixes is 10.

For example, 'compMAP comp db_dir -P CG, AA, TA' assign three prefixes namely "CG","AC", and "TA". (Default: "CG").

-R INT <filename1, filename2, filename3...> [<@listfiles>]

Assign the reference sequences for database compression.

0: randomly select a reference;

1: select the longest sequence as reference;

2: provide the file names of reference sequences. (Default: 1)

For example,'compMAP comp db_dir -R 2 db1.fasta, db2.fasta ref.fa'

assigns "db1.fasta" and "db2.fasta" as the reference.

'compMAP comp db_dir -R 2 @listfiles.txt ref.fa'

provides the names of reference sequences in a text file "listfiles.txt" which could contain the following contents

db1.fasta

db2.fasta

db3.fasta

...

-h : print the help!

Eg:

compMAP comp db_dir ref.fa

compMAP comp db_dir -R 2 db1.fasta ref.fa

compMAP comp db_dir -L 200 -E 0.03 -N 16 -K 10 -P CG, AT -R 2 @listfiles.txt ref.fa

output file: ref.fa ref.fa.mat ref.fa.log

> ./compMAP map -h

$ ./compMAP map -h

map Usage:

compMAP map [ option ] <aln.sam> <ref.fa>

"aln.sam" is the output of the align tool such as BWA and bowtie; "ref.fa" is the output of the previous command.

Make sure "ref.fa.mat" and "ref.fa.log" are stored in the same directory with "ref.fa".

OPTIONS:

-F Used to speed up the running of the program it fixed-length short reads are input

-D

Display the information of the mapped positions including the number of successful, failed and junction alignments.

-h : print the help!

E.g:

compMAP map -D aln.sam ref.fa

compMAP map -F aln.sam ref.fa

パスを通しておく。

ラン

db_dir/にリファレンスのfasta（４ファイル）があるものとする。

compMAPのcompコマンドでリファレンスを代表するref.faを作成する。

compMAP comp db_dir ref.fa

ref.faが出力される。

このref.faを使ってbwaでfastqをアライメントする（bowtieでも検証されている）。

bwa index ref.fa 
bwa mem ref.fa reads.fq > aln.sam

aln.map.samができる。

元のアライメントを復旧する。

compMAP map aln.sam ref.fa

-F　Used to speed up the running of the program if fixed-length short reads are input.-F Used to speed up the running of the program if fixed-length short reads are input.
-D　Display the information of the mapped positions including the number of successful, failed and junction alignments.

samの行数を調べれば、圧縮したFASTAの数だけ、復旧後のsamはaln.samより倍増していることが確認できる。

引用

CompMap: a reference-based compression program to speed up read mapping to related reference sequences.

Zhu Z, Li L, Zhang Y, Yang Y, Yang X.

Bioinformatics. 2015 Feb 1;31(3):426-8.