ONTのロングリードアセンブリをポリッシュする PEPPER

2021 12/24 ツイート追記

P.E.P.P.P.E.R.は、オックスフォード・ナノポア・シークエンシング技術で動作するように設計されたディープ・ニューラル・ネットワーク・ベースのポリッシャーである。P.E.P.P.E.R.は、各ゲノム位置のサマリー統計からコンセンサス配列を呼び出すために、リカレントニューラルネットワーク（RNN）ベースのエンコーダ-デコーダモデルを使用している。SSWを用いた局所的な再アラインメント処理を用いており、他のツール（例えばracon）を用いた事前のポリッシュを必要としないモジュールとなっている。

Released PEPPER-Margin-DeepVariant r0.7

This release outperforms older versions and existing callers for @nanopore R9.4.1 Guppy 5 "Sup" and R10.4 Q20 data.

Pub: https://t.co/AS2buYWIMA
Free link: https://t.co/of6ZppO5C8 https://t.co/zFPdxtTquI

🧵 on methods + results
[1/10] pic.twitter.com/Q4Aln7H33V
— Kishwar (@kishwarshafin) December 22, 2021

PEPPER v0.1 is now available for polishing @nanopore assembly. Don't only look at the base-quality after polishing, look at frameshifts and transcriptome completeness too.

This framework is integral to PEPPER-DeepVariant work.https://t.co/oP1mKaG64O @BenedictPaten @mitenjain
— Kishwar (@kishwarshafin) 2020年10月9日

PEPPER_variant_calling

pepper/PEPPER_variant_calling.md at r0.1 · kishwarshafin/pepper · GitHub

インストール

dockerの仮想環境でpipを使ってインストールした（ubuntu18.04LTS base image）。

Github

#pip
python3 -m pip install pepper-polish 

#docker (CPU based)
docker run -it --ipc=host --user=`id -u`:`id -g` --cpus="16" \
-v </directory/with/inputs_outputs>:/data kishwars/pepper:latest \
pepper --help

#docker (CPU based)
docker run --rm -it --ipc=host kishwars/pepper:latest pepper torch_stat

#docker (GPU based)
# CHECK GPU STATE: 
nvidia-docker run -it --ipc=host kishwars/pepper:latest pepper torch_stat 
# RUN PEPPER 
nvidia-docker run -it --ipc=host --user=`id -u`:`id -g` --cpus="16" \ -v </directory/with/inputs_outputs>:/data kishwars/pepper:latest \ pepper --help

> pepper --version

# pepper -h

usage: pepper [-h] [--version]

{polish,make_images,call_consensus,stitch,download_models,torch_stat,version}

...

PEPPER is a RNN based polisher for polishing ONT-based assemblies. It works in three steps:

1) make_images: This module takes alignment file and coverts themto HDF5 files containing summary statistics.

2) call_consensus: This module takes the summary images and atrained neural network and generates predictions per base.

3) stitch: This module takes the inference files as input and stitches them to generate a polished assembly.

positional arguments:

{polish,make_images,call_consensus,stitch,download_models,torch_stat,version}

polish Run the polishing pipeline. This will run make images-> inference -> stitch one after another.

The outputs of each step can be run separately using

the appropriate sub-command.

make_images Generate images that encode summary statistics of reads aligned to an assembly.

call_consensus Perform inference on generated images using a trained model.

stitch Stitch the polished genome to generate a contiguous polishedassembly.

download_models Download available models.

torch_stat See PyTorch configuration.

version Show program version.

optional arguments:

-h, --help show this help message and exit

--version Show version.

> pepper polish -h

# pepper polish -h

usage: pepper polish [-h] -b BAM -f FASTA -m MODEL_PATH -o OUTPUT_FILE

[-t THREADS] [-r REGION] [-bs BATCH_SIZE] [-g]

[-d_ids DEVICE_IDS] [-w NUM_WORKERS]

optional arguments:

-h, --help show this help message and exit

-b BAM, --bam BAM BAM file containing mapping between reads and the

draft assembly.

-f FASTA, --fasta FASTA

FASTA file containing the draft assembly.

-m MODEL_PATH, --model_path MODEL_PATH

Path to a trained model.

-o OUTPUT_FILE, --output_file OUTPUT_FILE

Path to output file with an expected prefix (i.e. -o

./outputs/polished_genome)

-t THREADS, --threads THREADS

Number of threads to use. Default is 5.

-r REGION, --region REGION

Region in [contig_name:start-end] format

-bs BATCH_SIZE, --batch_size BATCH_SIZE

Batch size for testing, default is 100. Suggested

values: 256/512/1024.

-g, --gpu If set then PyTorch will use GPUs for inference. CUDA

required.

-d_ids DEVICE_IDS, --device_ids DEVICE_IDS

List of gpu device ids to use for inference. Only used

in distributed setting. Example usage: --device_ids

0,1,2 (this will create three callers in id 'cuda:0,

cuda:1 and cuda:2' If none then it will use all

available devices.

-w NUM_WORKERS, --num_workers NUM_WORKERS

Number of workers for loading images. Default is 4.

実行方法

CPUバージョン

pepper polish \
--bam draft_assembly.bam \
--fasta draft_assembly.fasta> \
--model_path <path/to/pepper/models/XXX.pkl> \
--output_file output_file_prefix \
--threads 20 \
--batch_size 128

GPUバージョン

pepper polish \
--bam draft_assembly.bam \
--fasta draft_assembly.fasta \
--model_path <path/to/pepper/models/XXX.pkl> \
--output_file output_file_prefix \
--threads 20 \
--batch_size 512 \
--gpu \
--num_workers <num_workers>

アセンブリからマッピング、ポリッシュの流れはPolishing Microbial genome assemblies with PEPPERを確認して下さい。

PEPPER-DeepVariant によるバリアントコールの流れ

（DeepVariantグループと共同で、ONT用のハプロタイプawareなバリアントコーリングパイプラインを開発している）

pepper/PEPPER_variant_calling.md at r0.1 · kishwarshafin/pepper · GitHub

引用

GitHub - kishwarshafin/pepper: P.E.P.P.E.R. : Program for Evaluating Patterns in Pileups of Erroneous Reads

参考

https://nanoporetech.com/sites/default/files/s3/literature/snv-calling-and-phasing-workflow.pdf