medakaを使ってコンセンサスコールを行う

2020 3/23 コマンドの間違いを修正

2020 3/24 説明追記

2020 10/10 ツイート追記

Documentation

We've release v1.1.2 of @nanopore's medaka software. Updates include: consensus model for Guppy 4.0.11, a true ploidy-1 variant caller, doesn't break contigs at unpolished regions, diploid calling option to produce candidate variants for DeepVariant, binaries for ARM.
— Chris Wright (@chrisnrg) 2020年10月7日

特徴

basecallされたデータのみ必要（.fastaまたは.fastq）
グラフベースのメソッド（Raconなど）よりも精度が向上
Nanopolishよりも50倍高速（GPU実行できるため）
オーダーメイドの補正ネットワーク実装とトレーニングのための追加機能
オープンソース（Mozilla Public License 2.0）

インストール

Linuxで動作する。ここではbiocondaを使ってubuntu18.04LTSに導入した（docker使用、GPU未使用）。

gcc
zlib1g-dev
libbz2-dev
liblzma-dev
libffi-dev
libncurses5-dev
make
wget
python3-all-dev
python-virtualenv

本体　Github

#bioconda (link)
#ここでは仮想環境medaka-envにmedakaを導入する。
conda create -n medaka-env -y
conda activate medaka-env
conda install -c conda-forge -c bioconda -y medaka

#pip
pip install medaka
#To enable the use of GPU resource it is necessary to install the
#tensorflow-gpu package. In outline this can be achieve with:
pip uninstall tensorflow
pip install tensorflow-gpu
#note that The tensorflow-gpu GPU package is compiled against a specific version of the NVIDIA CUDA library; users are directed to the tensorflow installation pages for further information.

> medaka -h

$ medaka -h

usage: medaka [-h] [--version]

{compress_bam,features,train,consensus,smolecule,consensus_from_features,fastrle,stitch,variant,snp,methylation,tools}

...

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

subcommands:

valid commands

{compress_bam,features,train,consensus,smolecule,consensus_from_features,fastrle,stitch,variant,snp,methylation,tools}

additional help

compress_bam Compress an alignment into RLE form.

features Create features for inference.

train Train a model from features.

consensus Run inference from a trained model and alignments.

smolecule Create consensus sequences from single-molecule reads.

consensus_from_features

Run inference from a trained model on existing

features.

fastrle Create run-length encoded fastq (lengths in quality

track).

stitch Stitch together output from medaka consensus into

final output.

variant Decode probabilities to VCF.

snp Decode probabilities to SNPs.

methylation methylation subcommand.

tools tools subcommand.

様々なサブコマンドが利用できるが、ここではconsensusのみ記載。

> medaka consensus -h

$ medaka consensus -h

usage: medaka consensus [-h] [--debug | --quiet] [--batch_size BATCH_SIZE]

[--regions REGIONS [REGIONS ...]]

[--chunk_len CHUNK_LEN] [--chunk_ovlp CHUNK_OVLP]

[--model MODEL] [--disable_cudnn] [--threads THREADS]

[--check_output] [--save_features]

[--tag_name TAG_NAME] [--tag_value TAG_VALUE]

[--tag_keep_missing]

bam output

positional arguments:

bam Input alignments.

output Output file.

optional arguments:

-h, --help show this help message and exit

--debug Verbose logging of debug information. (default: 20)

--quiet Minimal logging; warnings only). (default: 20)

--batch_size BATCH_SIZE

Inference batch size. (default: 100)

--regions REGIONS [REGIONS ...]

Genomic regions to analyse, or a bed file. (default:

None)

--chunk_len CHUNK_LEN

Chunk length of samples. (default: 10000)

--chunk_ovlp CHUNK_OVLP

Overlap of chunks. (default: 1000)

--model MODEL Model definition, default is equivalent to

r941_min_high_g344. {r941_min_fast_g303,

r941_min_high_g303, r941_min_high_g330,

r941_min_high_g344, r941_prom_fast_g303,

r941_prom_high_g303, r941_prom_high_g344,

r941_prom_high_g330, r10_min_high_g303,

r10_min_high_g340, r103_min_high_g345,

r941_prom_snp_g303, r941_prom_variant_g303,

r941_min_high_g340_rle} (default:

/Users/kazu/anaconda3/envs/medaka-

env/lib/python3.6/site-

packages/medaka/data/r941_min_high_g344_model.hdf5)

--disable_cudnn Disable use of cuDNN model layers. (default: False)

--threads THREADS Number of threads used by inference. (default: 1)

--check_output Verify integrity of output file after inference.

(default: False)

--save_features Save features with consensus probabilities. (default:

False)

filter tag:

Filtering alignments by an integer valued tag.

--tag_name TAG_NAME Two-letter tag name. (default: None)

--tag_value TAG_VALUE

Value of tag. (default: None)

--tag_keep_missing Keep alignments when tag is missing. (default: False)

> medaka_consensus -h

$ medaka_consensus -h

medaka 0.11.5

------------

Assembly polishing via neural networks. The input assembly should be

preprocessed with racon.

medaka_consensus [-h] -i <fastx>

-h show this help text.

-i fastx input basecalls (required).

-d fasta input assembly (required).

-o output folder (default: medaka).

-m medaka model, (default: r941_min_high_g344).

Available: r941_min_fast_g303, r941_min_high_g303, r941_min_high_g330, r941_min_high_g344, r941_prom_fast_g303, r941_prom_high_g303, r941_prom_high_g344, r941_prom_high_g330, r10_min_high_g303, r10_min_high_g340, r103_min_high_g345, r941_prom_snp_g303, r941_prom_variant_g303, r941_min_high_g340_rle.

Alternatively a .hdf file from 'medaka train'.

-f Force overwrite of outputs (default will reuse existing outputs).

-t number of threads with which to create features (default: 1).

-b batchsize, controls memory use (default: 100).

実行方法

canuやminiasm+raconで作成したraw de novo aasemblyを入力とする。 oxford nanopoporeが想定しているのはraconでポリッシュしたアセンブリ配列となる。

medaka_consensus -i basecalled.fa -d draft-assembly.fa -o output

-i fastx input basecalls (required).
-d fasta input assembly (required).
-o output folder (default: medaka).
-m Model definition (default: r941_min_high_g344_model.hdf5)

結果は指定したディレクトリに出力される。

引用

medaka/README.md at master · nanoporetech/medaka · GitHub

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

medakaを使ってコンセンサスコールを行う