Oxford Nanoporeリードのアセンブリ MiniasmとNanopolish

2019 4/4 ヘルプ追記

2019 6/21 文章修正

2019 7/17 コメント追加

2019 7/26　追記

2019 10/14追記

2019 11/5 コマンドに-t <NUM> 追加の修正

2020 3/30 関連ツール追記

MiniasmはPacbioのロングリードやナノポアのロングリードのアセンブルツールで2015年に論文が発表された (ref.1)。アルゴリズムはオーバーラップ法になる。アセンブル時間が非常に短いのが特徴で、ナノポアリードのアセンブルの比較ペーパーでは、競合アセンブラが数時間かけるデータを２分で終えると書かれている(ref.2)。

インストール

GitHub

GitHubのダウンロードリンクからソースをダウンロードすることもできるが、brewやcondaでワンライナーインストールもできる。

#bioconda
conda install -y -c bioconda minimap
conda install -y -c bioconda miniasm

#homebrew
brew install miniasm

>miniasm

$ miniasm

Usage: miniasm [options] <in.paf>

Options:

Pre-selection:

-m INT min match length [100]

-i FLOAT min identity [0.05]

-s INT min span [2000]

-c INT min coverage [3]

Overlap:

-o INT min overlap [same as -s]

-h INT max over hang length [1000]

-I FLOAT min end-to-end match ratio [0.8]

Layout:

-g INT max gap differences between reads for trans-reduction [1000]

-d INT max distance for bubble popping [50000]

-e INT small unitig threshold [4]

-f FILE read sequences

-n INT rounds of short overlap removal [3]

-r FLOAT[,FLOAT]

max and min overlap drop ratio [0.7,0.5]

-F FLOAT aggressive overlap drop ratio in the end [0.8]

Miscellaneous:

-p STR output information: bed, paf, sg or ug [ug]

-b both directions of an arc are present in input

-1 skip 1-pass read selection

-2 skip 2-pass read selection

-V print version number

See miniasm.1 for detailed description of the command-line options.

> minimap

$ minimap

Usage: minimap [options] <target.fa> [query.fa] [...]

Options:

Indexing:

-k INT k-mer size [15]

-w INT minizer window size [{-k}*2/3]

-I NUM split index for every ~NUM input bases [4G]

-d FILE dump index to FILE

-l the 1st argument is a index file (overriding -k, -w and -I)

Mapping:

-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.001]

-r INT bandwidth [500]

-m FLOAT merge two chains if FLOAT fraction of minimizers are shared [0.50]

-c INT retain a mapping if it consists of >=INT minimizers [4]

-L INT min matching length [40]

-g INT split a mapping if there is a gap longer than INT [10000]

-T INT SDUST threshold; 0 to disable SDUST [0]

-S skip self and dual mappings

-O drop isolated hits before chaining (EXPERIMENTAL)

-P filtering potential repeats after mapping (EXPERIMENTAL)

-x STR preset (recommended to be applied before other options)

ava10k: -Sw5 -L100 -m0 (PacBio/ONT all-vs-all read mapping)

Input/Output:

-t INT number of threads [3]

-V show version number

See minimap.1 for detailed description of the command-line options.

> minimap2

r$ minimap2

Usage: minimap2 [options] <target.fa>|<target.idx> [query.fa] [...]

Options:

Indexing:

-H use homopolymer-compressed k-mer (preferrable for PacBio)

-k INT k-mer size (no larger than 28) [15]

-w INT minizer window size [10]

-I NUM split index for every ~NUM input bases [4G]

-d FILE dump index to FILE

Mapping:

-f FLOAT filter out top FLOAT fraction of repetitive minimizers [0.0002]

-g NUM stop chain enlongation if there are no minimizers in INT-bp [5000]

-G NUM max intron length (effective with -xsplice; changing -r) [200k]

-F NUM max fragment length (effective with -xsr or in the fragment mode) [800]

-r NUM bandwidth used in chaining and DP-based alignment [500]

-n INT minimal number of minimizers on a chain [3]

-m INT minimal chaining score (matching bases minus log gap penalty) [40]

-X skip self and dual mappings (for the all-vs-all mode)

-p FLOAT min secondary-to-primary score ratio [0.8]

-N INT retain at most INT secondary alignments [5]

Alignment:

-A INT matching score [2]

-B INT mismatch penalty [4]

-O INT[,INT] gap open penalty [4,24]

-E INT[,INT] gap extension penalty; a k-long gap costs min{O1+k*E1,O2+k*E2} [2,1]

-z INT[,INT] Z-drop score and inversion Z-drop score [400,200]

-s INT minimal peak DP alignment score [80]

-u CHAR how to find GT-AG. f:transcript strand, b:both strands, n:don't match GT-AG [n]

Input/Output:

-a output in the SAM format (PAF by default)

-Q don't output base quality in SAM

-L write CIGAR with >65535 ops at the CG tag

-R STR SAM read group line in a format like '@RG\tID:foo\tSM:bar'

-c output CIGAR in PAF

--cs[=STR] output the cs tag; STR is 'short' (if absent) or 'long' [none]

--MD output the MD tag

--eqx write =/X CIGAR operators

-Y use soft clipping for supplementary alignments

-t INT number of threads [3]

-K NUM minibatch size for mapping [500M]

--version show version number

Preset:

-x STR preset (always applied before other options; see minimap2.1 for details)

- map-pb/map-ont: PacBio/Nanopore vs reference mapping

- ava-pb/ava-ont: PacBio/Nanopore read overlap

- asm5/asm10/asm20: asm-to-ref mapping, for ~0.1/1/5% sequence divergence

- splice: long-read spliced alignment

- sr: genomic short-read mapping

See `man ./minimap2.1' for detailed description of command-line options.

テストデータ

オーサーたちの準備したtestデータのダウンロード

wget -O- http://www.cbcb.umd.edu/software/PBcR/data/selfSampleData.tar.gz | tar zxf -

解凍したディレクトリは以下のようになっている。

user $ ls -lh selfSampleData

total 272M

-rw-r--r-- 1 uesaka user 267M Feb 21 2015 pacbio_filtered.fastq

-rwxr-xr-x 1 uesaka user 169 Feb 6 2015 pacbio.spec

-rw-r----- 1 uesaka user 3.1K Feb 28 2015 README

-rw-r--r-- 1 uesaka user 4.5M Mar 20 2013 reference.fasta

実行方法

１、overlap

minimap -Sw5 -L100 -m0 -t8 pacbio_filtered.fastq pacbio_filtered.fastq | gzip -1 > reads.paf.gz

-S　skip self and dual mappings

-L　INT min matching length [40]

-m　FLOAT merge two chains if FLOAT fraction of minimizers are shared [0.50]

-t　INT number of threads [3]

２、layout

miniasm -f pacbio_filtered.fastq reads.paf.gz > reads.gfa

-f　FILE read sequences

GFAファイルを開く。

>less -S reads.gfa

f:id:kazumaxneo:20170622221043j:plain

最後にawkとsedを使ってGFAからFASTAに変換する。

awk '/^S/{print ">"$2"\n"$3}' reads.gfa | fold > reads.fa

先頭がSの行がアセンブルされたunitigで、その２フィールド目と３フィールド目をfastaの>と改行をつけながらfoldに渡し、改行を加えて出力している。

GFA=> FASTAにはany2fastaのようなツールを使ってもよい。

追記

miinmap2を使う。ONTのリードの場合

minimap2 -x ava-ont -t 12 long-read.fq long-read.fq > ont.paf 
miniasm -f long-read.fq ont.paf > assembly.gfa 
awk '/^S/{print ">"$2"\n"$3}' assembly.gfa | fold > assembly.fasta

アセンブルのエラーをNanopolishで修復する。Nanopolishは使い方に癖があるツールなので、以下のエントリでーNanopolishを単独で説明している。

OLC（参考PDF）のtagをつけているが、miniasmにはconsensus callのステップがないことに注意する。エラー率はアセンブリに使ったリードと同じになる。

3020 03/30追記

all versus allリードマッピングにfpaのフィルタリングを取り込む。

引用

１、Minimap and miniasm: fast mapping and de novo assembly for noisy long sequences

Heng Li

Bioinformatics (2016) 32 (14): 2103-2110. DOI:

https://academic.oup.com/bioinformatics/article/32/14/2103/1742895/Minimap-and-miniasm-fast-mapping-and-de-novo

２、Comparison of bacterial genome assembly software for MinION data and their applicability to medical microbiology

Kim Judge, Martin Hunt, Sandra Reuter, Alan Tracey, Michael A. Quail, Julian Parkhill, Sharon J. Peacock

01 September 2016, Microbial Genomics , 2016 2, doi: 10.1099/mgen.0.000085