タンパク質配列をゲノム配列に対してintron (gap) awareで高速にアラインメントする Miniprot

#2024/03/08 v0.13リリースについて追記（停止コドンの取り扱いのバグ修正）

Miniprotは、タンパク質配列をゲノムに対してアフィンギャップ・ペナルティ、スプライシング、フレームシフトでアライメントする。Miniprotは、他の既知の種の遺伝子を用いて、新しい種のタンパク質コード遺伝子をアノテーションすることを主な目的としている。MiniprotはGeneWise(EMBL－EBIサービス)やExonerateと機能的に似ているが、全ゲノムに対してタンパク質をマップすることができ、残基アライメントステップが非常に高速です。
Miniprotは距離の離れたホモログのマッピングには最適化されていない。なぜなら、遠いホモログは遺伝子アノテーションの情報量が少ないからです。それでも、パフォーマンスを犠牲にすることなく、より高い感度を得るためにseedingパラメータを調整することが可能です。

Manual

https://lh3.github.io/miniprot/miniprot.html

Miniprot v0.13 released with support of non-standard translation tables, minor improvement to junction accuracy and a couple of bug fixes. https://t.co/a5CcWkLA9J
— Heng Li (@lh3lh3) 2024年3月6日

Miniprot, a new mapper for aligning proteins to genomes with splicing and frameshift. Can be used for annotating new genomes. ~20k mouse proteins to the human genome in 5 mins over 16 threads. Still WIP. Feedback welcomed. https://t.co/ILAvmF3VBO
— Heng Li (@lh3lh3) September 10, 2022

インストール

condaを使ってubuntu18に導入した。

Github

#conda(link)
mamba install -c bioconda miniprot -y

#from source
git clone https://github.com/lh3/miniprot
cd miniprot && make

> miniprot

Usage: miniprot [options] <ref.fa> <query.faa> [...]

Options:

Indexing:

-k INT k-mer size [6]

-s INT submer size (density: 1/(2*(k-s)+1)) [4]

-b INT bits per block [8]

-d FILE save index to FILE []

Mapping:

-S no splicing (applying -G1k -J1k -e1k)

-c NUM max k-mer occurrence [50000]

-G NUM max intron size [200k]

-w FLOAT weight of log gap penalty [0.75]

-n NUM minimum number of syncmers in a chain [5]

-m NUM min chaining score [0]

-l INT k-mer size for the second round of chaining [5]

-e NUM max extension for 2nd round of chaining and alignment [10k]

-p FLOAT min secondary-to-primary score ratio [0.7]

-N NUM consider at most INT secondary alignments [50]

Alignment:

-O INT gap open penalty [11]

-E INT gap extension (a k-long gap costs O+k*E) [1]

-J INT intron open penalty [31]

-C INT penalty for non-canonical splicing [11]

-F INT penalty for frameshifts or in-frame stop codons [17]

-B INT end bonus [5]

Input/output:

-t INT number of threads [4]

--gff output in the GFF3 format

-P STR prefix for IDs in GFF3 [MP]

-u print unmapped query proteins

--outn=NUM output up to min{NUM,-N} alignments per query [100]

-K NUM query batch size [2M]x

実行方法

ゲノム配列（もしくはindexされたファイル）とタンパク質配列を指定する。８スレッド指定。

miniprot -t8 ref-file protein.faa > output.paf

-t number of threads [4]

miniprotのインデックス作成は低速でメモリを消費するため、事前にインデックスを作っておくことが推奨されている。出力はprotein PAF format。マニュアルで説明されている。

--gffオプションを使うことでGFF3形式で出力することもできる。

miniprot -t8 --gff -d ref.mpi ref.fna > out.gff

--gff output in the GFF3 format

シロイヌナズナの全タンパク質配列をシロイヌナズナのリファレンスゲノム配列にアラインしたところ、かかった時間は21秒だった（TR3990x, 20スレッド、pre-indexing無し）。

引用

https://github.com/lh3/miniprot

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

タンパク質配列をゲノム配列に対してintron (gap) awareで高速にアラインメントする Miniprot