Nanoporeのシステマティックなエラーを修正する Homopolish

　ナノポアシーケンスは、微生物ゲノムの再構築に広く利用されている。ゲノム上のエラーは、エラー率が高いため、ナノポアリードで学習したニューラルネットワークによって修正される。しかし、システマティックなエラーは通常修正されない。本論文では、Nanoporeのシステマティックエラーを補正するために、相同配列によって学習されるモデルを設計する。開発したプログラム「Homopolish」は、バクテリア、ウイルス、菌類、メタゲノムデータセットにおいて、MedakaやHELENよりも優れた性能を発揮する。Medaka/HELENと組み合わせることで、R9.4フローセルでのゲノム品質はQ50を超えることが可能になる。Nanopore-only sequencingでは、下流の解析に十分な高品質の微生物ゲノムが得られることを示している。

インストール

macos10.14で、mambaを使ってオーサー提供のenv.ymlから環境を作成した。

Github

git clone https://github.com/ythuang0522/homopolish.git
cd homopolish
mamba env create -f environment.yml
conda activate homopolish

> python3 homopolish.py

$ python3 homopolish.py -h

usage: homopolish.py [-h] [-v] {polish,train,make_train_data} ...

Homopolish is a SVM based polisher for polishing ONT-based assemblies.

1) polish: Run the polishing pipeline.

2) train: Train your own SVM model.

3) make_train_data: Make training data with reference genome.

positional arguments:

{polish,train,make_train_data}

polish Run the polishing pipeline.

train Train your own SVM model.

make_train_data Make training data with reference genome.

optional arguments:

-h, --help show this help message and exit

-v, --version Show version.

> python3 homopolish.py polish -h

$ python3 homopolish.py polish -h

usage: homopolish.py polish [-h] -m MODEL_PATH -a ASSEMBLY

(-s SKETCH_PATH | -g GENUS | -l LOCAL_DB_PATH)

[-t THREADS] [-o OUTPUT_DIR]

[--minimap_args MINIMAP_ARGS]

[--mash_threshold MASH_THRESHOLD]

[--download_contig_nums DOWNLOAD_CONTIG_NUMS] [-d]

[--mash_screen] [--meta]

optional arguments:

-h, --help show this help message and exit

-m MODEL_PATH, --model_path MODEL_PATH

[REQUIRED] Path to a trained model (pkl file). Please

see our github page to see options.

-a ASSEMBLY, --assembly ASSEMBLY

[REQUIRED] Path to a assembly genome.

-s SKETCH_PATH, --sketch_path SKETCH_PATH

Path to a mash sketch file.

-g GENUS, --genus GENUS

Genus name

-l LOCAL_DB_PATH, --local_DB_path LOCAL_DB_PATH

Path to your local DB

-t THREADS, --threads THREADS

Number of threads to use. [1]

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Path to the output directory. [output]

--minimap_args MINIMAP_ARGS

Minimap2 -x argument. [asm5]

--mash_threshold MASH_THRESHOLD

Mash output threshold. [0.95]

--download_contig_nums DOWNLOAD_CONTIG_NUMS

How much contig to download from NCBI. [20]

-d, --debug Keep the information of every contig after mash, such

as homologous sequences and its identity infomation.

[no]

--mash_screen Use mash screen. [mash dist]

--meta Your assembly genome is metagenome. [no]

> python3 homopolish.py train -h

$ python3 homopolish.py train -h

usage: homopolish.py train [-h] -d DATAFRAME_DIR [-o OUTPUT_DIR]

[-p OUTPUT_PREFIX] [-t THREADS]

optional arguments:

-h, --help show this help message and exit

-d DATAFRAME_DIR, --dataframe_dir DATAFRAME_DIR

[REQUIRED] Path to a directory for alignment

dataframe.

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Path to the output directory. [output]

-p OUTPUT_PREFIX, --output_prefix OUTPUT_PREFIX

Prefix for the train model. [train]

-t THREADS, --threads THREADS

Number of threads to use. [1]

> python3 homopolish.py make_train_data -h

$ python3 homopolish.py make_train_data -h

usage: homopolish.py make_train_data [-h] -r REFERENCE -a ASSEMBLY

(-s SKETCH_PATH | -g GENUS | -l LOCAL_DB_PATH)

[-t THREADS] [-o OUTPUT_DIR]

[--minimap_args MINIMAP_ARGS]

[--mash_threshold MASH_THRESHOLD]

[--download_contig_nums DOWNLOAD_CONTIG_NUMS]

[-d] [--mash_screen] [--meta]

optional arguments:

-h, --help show this help message and exit

-r REFERENCE, --reference REFERENCE

[REQUIRED] True reference aligned to assembly genome.

Include labels in output.

-a ASSEMBLY, --assembly ASSEMBLY

[REQUIRED] Path to a assembly genome.

-s SKETCH_PATH, --sketch_path SKETCH_PATH

Path to a mash sketch file.

-g GENUS, --genus GENUS

Genus name

-l LOCAL_DB_PATH, --local_DB_path LOCAL_DB_PATH

Path to your local DB

-t THREADS, --threads THREADS

Number of threads to use. [1]

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Path to the output directory. [output]

--minimap_args MINIMAP_ARGS

Minimap2 -x argument. [asm5]

--mash_threshold MASH_THRESHOLD

Mash output threshold. [0.95]

--download_contig_nums DOWNLOAD_CONTIG_NUMS

How much contig to download from NCBI. [20]

-d, --debug Keep the information of every contig after mash, such

as homologous sequences and its identity infomation.

[no]

--mash_screen Use mash screen. [mash dist]

--meta Your assembly genome is metagenome. [no]

データベースの準備

ウィルス、バクテリア、真菌のスケッチが用意されている。ここではバクテリアのスケッチをダウンロードする。

wget http://bioinfo.cs.ccu.edu.tw/bioinfo/mash_sketches/bacteria.msh.gz
gunzip bacteria.msh.gz

実行方法

データベースとドラフトゲノムアセンブリのfastaを指定する。R9.4フローセルでシークエンシングしたなら、-m R9.4.pklを指定する。

python3 homopolish.py polish -a input_genome.fasta -s bacteria.msh -m R9.4.pkl -o outdir

近いゲノムがダウンロードされ、研磨が実行される。

$ python3 homopolish.py polish -a racon.fasta -s bacteria.msh -m R9.4.pkl -o outdir2

[2021/04/30 13:09] INFO: RUN-ID: contig_1

[2021/04/30 13:09] INFO: Stage: Select closely-related genomes

TIME Select closely-related genomes: 0 MINS 5 SECS.

[2021/04/30 13:09] INFO: Stage: Download closely-related genomes

INFO: 18 homologous sequence need to download:

Downloaded GCF_001578645.1_ASM157864v1_genomic.fna.gz

Downloaded GCF_001647615.1_ASM164761v1_genomic.fna.gz

Downloaded GCF_001580195.1_ASM158019v1_genomic.fna.gz

ランが終わると、outdir/に研磨された fastaファイルが出力される。

Homopolish は系統的なindelエラーの除去のみに焦点を当てている。Racon または Medaka の後に実行する必要がある。

引用

Homopolish: a method for the removal of systematic errors in nanopore sequencing by homologous polishing
Yao-Ting Huang, Po-Yu Liu & Pei-Wen Shih
Genome Biology volume 22, Article number: 95 (2021)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Nanoporeのシステマティックなエラーを修正する Homopolish