Nanoporeのロングリードのアセンブリを中心としたツールキット Pomoxis

　Pomoxisは、ナノポアシーケンスに合わせて調整された基本的なバイオインフォマティクスツールのセット。ドラフトアセンブリの生成分析に関するツールが含まれている。これらのツールの多くは、Oxford Nanopore Technologiesの研究データ分析グループによって使用されている。

Document

Pomoxis - bioinformatics tools for nanopore research — Pomoxis 0.2.2 documentation

特徴（documentより）

Wraps third party tools with known good default parameters and methods of use.
Creates an isolated environment with all third-party tools.
Can be installed with conda.
Streamlines common short analysis chains.
Includes a nanopore read simulator.
Server/client components for minimap2 and bwa.
Integrates into katuali for performing more complex analysis pipelines.
Open source (Mozilla Public License 2.0).

インストール

依存

Installation from source

gcc-4.9
g++-4.9
zlib1g-dev
libncurses5-dev
python3-all-dev
libhdf5-dev
libatlas-base-dev
libopenblas-base
libopenblas-dev
libbz2-dev
liblzma-dev
libffi-dev
make
python-virtualenv
cmake (for racon)
wget (for fetching modules from github)
bzip2 (for extracting those modules)

requires the user to provide several binaries:

minimap2
miniasm
racon
samtools
bcftools
seqkit

本体　Github

#virtualenvを使用
virtualenv pomoxis --python=python3 --prompt "(pomoxis) "
. pomoxis/bin/activate
pip install git+https://github.com/rrwick/Porechop
pip install pomoxis

#またはvirtualenvを使わずpipで導入
pip install git+https://github.com/rrwick/Porechop
pip install pomoxis

> mini_align

$ mini_align

mini_align [-h] -r <reference> -i <fastq>

Align fastq/a formatted reads to a genome using minimap2.

-h show this help text.

-r reference, should be a fasta file. If correspondng minimap indices

do not exist they will be created. (required).

-i fastq/a input reads (required).

-I split index every ~NUM input bases (default: 16G, this is larger

than the usual minimap2 default).

-a aggressively extend gaps (sets -A1 -B2 -O2 -E1 for minimap2).

-P filter to only primary alignments (i.e. run samtools view -F 2308.

Deprecated: this filter is now default and can be disabled with -A.)

-A do not filter alignments to primary alignments, output all.

-n sort bam by read name.

-c chunk size. Input reads/contigs will be broken into chunks

prior to alignment.

-t alignment threads (default: 1).

-p output file prefix (default: reads).

-m fill MD tag.

-i and -r must be specified.

> mini_assemble

$ mini_assemble

mini_assemble [-h] -i <fastq>

Assemble fastq/fasta formatted reads and perform POA consensus.

-h show this help text.

-i fastx input reads (required).

-r reference fasta for reference-guided consensus (instead of de novo assembly)

-o output folder (default: assm).

-p output file prefix (default: reads).

-t number of minimap and racon threads (default: 1).

-m number of racon rounds (default: 4).

-n number of racon shuffles (default: 1. If not 1, should be >10).

-w racon window length (default: 500).

-k keep intermediate files (default: delete).

-K minimap's -K option (default: 500M).

-c trim adapters from reads prior to everything else.

-e error correct longest e% of reads prior to assembly, or an estimated coverage (e.g. 2x).

For an estimated coverage, the -l option must be a fastx or a length (e.g. 4.8mb).

-l Reference length, either as a number (e.g. 4.8mb) or fastx from which length will be calculated.

-x log all commands before running.

-i must be specified.

> assess_assembly

$ assess_assembly

assess_assembly [-h] -r <reference> -i <fastq>

Calculate accuracy statistics for an assembly.

-h show this help text.

-r reference, should be a fasta file. If correspondng bwa indices

do not exist they will be created. (required).

-i fastq/a input assembly (required).

-c chunk size. Input reads/contigs will be broken into chunks

prior to alignment, 0 will not chunk (default 100000).

-C catalogue errors.

-t alignment threads (default: 1).

-p output file prefix (default: assm).

-T trim consensus to primary alignments of truth to assembly.

-b .bed file of reference regions to assess.

-i and -r must be specified.

> simulate_calls -h

$ simulate_calls -h

usage: simulate_calls [-h] [--mu MU] [--sigma SIGMA] [--noise NOISE]

[--threads THREADS]

fasta ncalls

Simulate basecalls with scrappy.

positional arguments:

fasta Source sequence file.

ncalls Number of basecalls to produce.

optional arguments:

-h, --help show this help message and exit

--mu MU mean fragment length.

--sigma SIGMA stdv fragment length.

--noise NOISE Additional Gaussian noise on signal.

--threads THREADS number of worker threads.

> subsample_bam

$ subsample_bam

usage: subsample bam to uniform or proportional depth [-h] [-o OUTPUT_PREFIX]

[-r REGIONS [REGIONS ...]]

[-p PROFILE]

[-O {fwd,rev}]

[-t THREADS]

[-q QUALITY]

[-a ACCURACY]

[-c COVERAGE]

[--any_fail | --all_fail]

[-x PATIENCE]

[-s STRIDE] [-P]

[-S SEED]

bam depth [depth ...]

subsample bam to uniform or proportional depth: error: the following arguments are required: bam, depth

実行方法

1、mini_align

マッピング（minimap2）。fastqはgzip圧縮（.gz）にも対応している。

mini_align -t 20 -r ref.fa -i input.fq

-r reference, should be a fasta file. If correspondng minimap indices do not exist they will be created. (required)
-t alignment threads (default: 1)
-i fastq/a input reads (required)

出力はcoordinate sortされたbamとbam.baiになる。

２、mini_assemble

de novo アセンブリ（miniasmとminimap）とコンセンサスコール（racon x 4 times）

mini_assemble -t 20 -e 10 -l 4.7m -m 4 -i input.fq

-i fastx input reads (required)
-t number of minimap and racon threads (default: 1)
-c trim adapters from reads prior to everything else.
-e error correct longest e% of reads prior to assembly, or an estimated coverage (e.g. 2x). For an estimated coverage, the -l option must be a fastx or a length (e.g. 4.8mb)
-l Reference length, either as a number (e.g. 4.8mb) or fastx from which length will be calculated

注: テスト時はpolish off（-m 0）にしないと出力がゼロになった。polishが失敗しているにもかかわらず作業ファイル（draft assembly）を消しているからかもしれない。

リファレンスガイドアセンブリとコンセンサスコール（racon）

mini_assemble -t 20 -e 10 -l 4.7m -m 4 -r ref.fa -i input.fq

-r reference fasta for reference-guided consensus (instead of de novo assembly)

未テスト

３、assess_assembly

アセンブリ評価

assess_assembly -t 20 -r assembly.fa -i input.fq

# Percentage Errors

name mean q10 q50 q90

err_ont 16.926% 13.544% 16.310% 20.197%

err_bal 17.925% 14.164% 17.221% 21.615%

iden 6.706% 5.190% 6.558% 8.406%

del 5.699% 4.222% 5.305% 7.134%

ins 5.903% 4.259% 5.470% 7.339%

# Q Scores

name mean q10 q50 q90

err_ont 7.71 8.68 7.88 6.95

err_bal 7.47 8.49 7.64 6.65

iden 11.74 12.85 11.83 10.75

del 12.44 13.74 12.75 11.47

ins 12.29 13.71 12.62 11.34

標準ではSTDOUT出力される。

４、simulate_calls

oxford nanoporeのロングリードのシミュレーション

refe.faを鋳型にして10000リード発生させる。平均リード長は10000とする。

assess_assembly --threads 20 --mu 10000 refe.fa 10000

--mu mean fragment length

重たいので使えるだけスレッド指定した方が良いです。

出力

f:id:kazumaxneo:20190305144844j:plain

内部で実際に動いているツールとパラメータについては、それぞれの実行用bash スクリプトを開いて確認して下さい。

引用

Pomoxis - bioinformatics tools for nanopore research

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Nanoporeのロングリードのアセンブリを中心としたツールキット Pomoxis