メタゲノムのmappingを行う MetaMaps - macでインフォマティクス

メタゲノム配列の分類は、高速で正確かつ情報豊富でなければならない。新しいロングシーケンステクノロジーは、これらの要素間のバランスを改善することを約束するが、ほとんどの既存の方法はショートリード用に設計されている。 MetaMapsは、ロングリリード専用に開発された新しいメソッドであり、遅いアライメントベースのメソッドの精度と、速いkmerベースのメソッドのスケーラビリティを組み合わせている。近似マッピングアルゴリズムを使用して、ロングリードメタゲノムを、30 GB未満の12,000を超えるゲノムまたはラップトップコンピューターのRAMを持つ包括的なRefSeqデータベースにマッピングできる。これらのマッピングを確率的スコアリングスキームとEMベースのサンプル構成の推定と統合することにより、MetaMapsは種レベルのリードアサインで95％を超える精度を達成し、シミュレートされたデータと実際のデータの両方でサンプル構成を推定するr2> 0.98を達成する。ユニークな点として、MetaMapsはすべての分類されたリードのマッピング位置と品質を出力し、機能的研究（遺伝子の有無など）および現在のデータベースに存在しない新しい種の検出を可能にする。MetaMapsはC ++ / Perlで実装されており、https://github.com/DiltheyLab/MetaMaps（GPL v3）から無料で入手できる。

インストール

ubuntu18.04LTSでテストした。

依存

Boost

Boostライブラリが必要なので導入する。ここではBoostのwebサイトから1.6.8をダウンロードした（別のツールとの関係で選んだ）。

wget https://sourceforge.net/projects/boost/files/boost/1.68.0/boost_1_68_0.tar.gz
tar -zxvf boost_1_68_0.tar.gz
cd boost_1_68_0

#ここでは/usr/local/include/boostに入れる。
mkdir /usr/local/include/boost
./bootstrap.sh
./b2 install --prefix=/usr/local/include/boost/
#パスを通す
export LD_LIBRARY_PATH=/usr/local/include/boost/lib:$LD_LIBRARY_PATH

本体　Github

git clone https://github.com/DiltheyLab/MetaMaps.git
cd MetaMaps/
./bootstrap.sh
./configure --with-boost=/usr/local/include/boost
make metamaps

> /metamaps

# ./metamaps

MetaMaps v 0.1

Simultaneous metagenomic classification and mapping.

Usage:

./metamaps mapDirectly|classify|mapAgainstIndex|index

Parameters:

./metamaps COMMAND -h for help

> ./metamaps mapDirectly -h

# ./metamaps mapDirectly -h

Available options

-----------------

-h, --help

Print this help page

-r <value>, --reference <value>

an input reference file (fasta/fastq)[.gz]

-k <value>, --kmer <value>

kmer size <= 16 [default 16 (DNA)]

-p <value>, --pval <value>

p-value cutoff, used to determine window/sketch sizes [default e-03]

--maxmemory <value>, --mm <value>

maximum memory, in GB [default e-03]

-w <value>, --window <value>

window size [default : computed using pvalue cutoff]

P-value is not considered if a window value is provided. Lower window

dow size implies denser sketch

-m <value>, --minReadLen <value>

minimum read length to map [default : 1000]

--perc_identity <value>, --pi <value>

threshold for identity [default : 80]

-t <value>, --threads <value>

count of threads for parallel execution [default : 1]

-q <value>, --query <value>

an input query file (fasta/fastq)[.gz]

--all

report all the mapping locations for a read, default is to consider few

best ones

-o <value>, --output <value>

output file

> ./metamaps classify -h

# ./metamaps classify -h

Available options

-----------------

-h, --help

Print this help page

--DB <value> [required]

Path to DB

--mappings <value> [required]

Path to mappings file

--minreads <value>

Minimum number of reads per contig to be considered for fitting identity

and length for the 'Unknown' functionality

-t <value>, --threads <value>

count of threads for parallel execution [default : 1]

データベースの準備

オーサーらは準備したminiSeq+H (~8G compressed, microbial genomes and the human reference genome)を　データベースの例にしているので、これをダウンロードした（Githubにdropboxへのリンクあり）。

#decompress
tar zxvf miniSeq+H.tar.gz

databases/ができる。

実行方法

1、mapping

メタゲノムのfastqとマッピング対象のデータベースを指定する。

metamaps mapDirectly --all -r databases/miniSeq+H/DB.fa -q input.fastq -t 16 -o classification_results

--all report all the mapping locations for a read, default is to consider few best ones
-q an input query file (fasta/fastq)[.gz]
-o output file
-t count of threads for parallel execution [default : 1]

メモリをそれなりに使うので、必要に応じて--maxmemoryをつける（物理メモリの7割程度が推奨されている）。

2、classification

分類する。

metamaps classify --mappings classification_results --DB databases/miniSeq+H

引用

MetaMaps – Strain-level metagenomic assignment and compositional estimation for long reads

Alexander Dilthey, Chirag Jain, Sergey Koren, Adam M. Phillippy

bioRxiv preprint first posted online Jul. 20, 2018

追記

Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps
Alexander T. Dilthey, Chirag Jain, Sergey Koren & Adam M. Phillippy
Nature Communications volume 10, Article number: 3066 (2019)