2023-12-21

完全かつ正確な細菌ゲノムアセンブリを行う Hybracter

2023/12/23 誤字修正

2024/02/11 help追記、05/09 論文追記

　ロングリードシークエンスの精度と利用可能性が向上したことで、現在ではハイブリッド（すなわちショートリードとロングリード）アセンブリアプローチを用いて完全な細菌ゲノムが日常的に再構築されている。完全長ゲノムは、細菌の進化や、スモールバリアント（SNV）以外のゲノム変異をより深く理解することを可能にする。また、医学的に重要な抗菌薬耐性（AMR）遺伝子を持つことが多いプラスミドを同定するためにも重要である。しかしながら、小さなプラスミドはロングリードのアセンブリアルゴリズムでは見逃されたり、間違ってアセンブリされたりすることが多い。Hybracterは、ロングリードを用いた最初のアセンブリアプローチにより、ほぼ完全な細菌ゲノムを迅速、自動的かつスケーラブルに回収する手法である。Hybracterと既存の自動ハイブリッドアセンブリツールを、手作業でキュレートしたグランドトゥルースリファレンスゲノムと多様なサンプルパネルを使って比較した。Hybracterが既存のゴールドスタンダード自動ハイブリッドアセンブラUnicyclerよりも正確で高速であることを実証した。また、ロングリードのみを用いたHybracterは、小さなプラスミドの回収においてハイブリッド法に匹敵することを示す。

Documentation

https://hybracter.readthedocs.io/en/latest/install/

Here is Hybracter, our tool for generating automated hybrid & long-read-only bacterial genome assemblies.@rrwick @viji112 @BhavyaPapudeshi @BeardyMcFace @JuddLmj @Gh72938

Github: https://t.co/Warxeuyojb
Docs: https://t.co/DmPI0NKoMu
Preprint: https://t.co/FjZA9Wrepy

🧵1/11
— George Bouras (@GB13Faithless) 2023年12月14日

なぜHybracterを使うのか（レポジトリより）

可能な限り自動化されたロングリードのみ、あるいはハイブリッド細菌単離ゲノムアセンブリが欲しい場合。
本著者のようにUnicyclerが大好きだが、より高速で正確なものが欲しい場合。
多くの（例えば10以上の）細菌分離株をできるだけ効率的にアセンブルする必要がある場合。
研磨によってゲノムが改善されたかどうか、アセンブルが完了したかどうか、アセンブルしたプラスミドの数など、アセンブル・パイプラインに関するすべての情報が欲しい場合。

インストール

Github

#conda
mamba create -n hybracter python=3
conda activate hybracter
mamba install -c bioconda hybracter -y

#pip
pip install hybracter

hybracter install

> hybracter --help

hybracter --help

hybracter version 0.4.1

_ _ _

| |__ _ _| |__ _ __ __ _ ___| |_ ___ _ __

| '_ \| | | | '_ \| '__/ _` |/ __| __/ _ \ '__|

| | | | |_| | |_) | | | (_| | (__| || __/ |

|_| |_|\__, |_.__/|_| \__,_|\___|\__\___|_|

|___/

Usage: hybracter [OPTIONS] COMMAND [ARGS]...

For more options, run: hybracter command --help

Options:

-h, --help Show this message and exit.

Commands:

install Downloads and installs the plassembler database

hybrid Run hybracter with hybrid long and paired end short reads

hybrid-single Run hybracter hybrid on 1 isolate

long Run hybracter with only long reads

long-single Run hybracter long on 1 isolate

test-hybrid Test hybracter hybrid

test-long Test hybracter long

config Copy the system default config file

citation Print the citation(s) for hybracter

version Print the version for hybracter

> hybracter hybrid -h

hybracter version 0.4.1

_ _ _

| |__ _ _| |__ _ __ __ _ ___| |_ ___ _ __

| '_ \| | | | '_ \| '__/ _` |/ __| __/ _ \ '__|

| | | | |_| | |_) | | | (_| | (__| || __/ |

|_| |_|\__, |_.__/|_| \__,_|\___|\__\___|_|

|___/

Usage: hybracter hybrid [OPTIONS] [SNAKE_ARGS]...

Run hybracter with hybrid long and paired end short reads

Options:

-i, --input TEXT Input csv [required]

--no_pypolca Do not use pypolca to polish assemblies with

short reads

-o, --output PATH Output directory [default: hybracter_out]

--configfile TEXT Custom config file [default:

(outputDir)/config.yaml]

-t, --threads INTEGER Number of threads to use [default: 1]

--min_length INTEGER min read length for long reads

--min_quality INTEGER min read quality for long reads

--skip_qc Do not run porechop, filtlong and fastp to

QC the reads

-d, --databases PATH Plassembler Databases directory.

--medakaModel

Medaka Model. [default:

r1041_e82_400bps_sup_v4.2.0]

Flye Assembly Parameter [default: --nano-

hq]

--contaminants PATH Contaminants FASTA file to map long

readsagainst to filter out. Choose

--contaminants lambda to filter out phage

lambda long reads.

--dnaapler_custom_db PATH Custom amino acid FASTA file of sequences to

be used as a database with dnaapler custom.

--no_medaka Do not polish the long read assembly with

Medaka.

--logic [best|last] Hybracter logic to select best assembly. Use

--best to pick best assembly based on ALE

(hybrid) or pyrodigal mean length (long).

Use --last to pick the last polishing round

regardless. [default: best]

--use-conda / --no-use-conda Use conda for Snakemake rules [default:

use-conda]

--conda-prefix PATH Custom conda env directory

--snake-default TEXT Customise Snakemake runtime args [default:

--rerun-incomplete, --printshellcmds,

--nolock, --show-failed-logs, --conda-

frontend mamba]

-h, --help Show this message and exit.

CLUSTER EXECUTION:

hybracter hybrid ... --profile [profile]

For information on Snakemake profiles see:

https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles

RUN EXAMPLES:

Required: hybracter hybrid --input [file]

Specify output directory: hybracter hybrid ... --output [directory]

Specify threads: hybracter hybrid ... --threads [threads]

Disable conda: hybracter hybrid ... --no-use-conda

Change defaults: hybracter hybrid ... --snake-default="-k --nolock"

Add Snakemake args: hybracter hybrid ... --dry-run --keep-going --touch

Specify targets: hybracter hybrid ... all print_targets

Available targets:

all Run everything (default)

print_targets List available targets

> hybracter hybrid-single -h

hybracter version 0.4.1

_ _ _

| |__ _ _| |__ _ __ __ _ ___| |_ ___ _ __

| '_ \| | | | '_ \| '__/ _` |/ __| __/ _ \ '__|

| | | | |_| | |_) | | | (_| | (__| || __/ |

|_| |_|\__, |_.__/|_| \__,_|\___|\__\___|_|

|___/

Usage: hybracter hybrid-single [OPTIONS] [SNAKE_ARGS]...

Run hybracter hybrid on 1 isolate

Options:

-l, --longreads TEXT FASTQ file of longreads [required]

-1, --short_one TEXT R1 FASTQ file of paired end short reads

[required]

-2, --short_two TEXT R2 FASTQ file of paired end short reads

[required]

-s, --sample TEXT Sample name. [default: sample]

-c, --chromosome INTEGER Approximate lower-bound chromosome length

(in base pairs). [default: 1000000]

--no_pypolca Do not use pypolca to polish assemblies with

short reads

-o, --output PATH Output directory [default: hybracter_out]

--configfile TEXT Custom config file [default:

(outputDir)/config.yaml]

-t, --threads INTEGER Number of threads to use [default: 1]

--min_length INTEGER min read length for long reads

--min_quality INTEGER min read quality for long reads

--skip_qc Do not run porechop, filtlong and fastp to

QC the reads

-d, --databases PATH Plassembler Databases directory.

--medakaModel

Medaka Model. [default:

r1041_e82_400bps_sup_v4.2.0]

Flye Assembly Parameter [default: --nano-

hq]

--contaminants PATH Contaminants FASTA file to map long

readsagainst to filter out. Choose

--contaminants lambda to filter out phage

lambda long reads.

--dnaapler_custom_db PATH Custom amino acid FASTA file of sequences to

be used as a database with dnaapler custom.

--no_medaka Do not polish the long read assembly with

Medaka.

--logic [best|last] Hybracter logic to select best assembly. Use

--best to pick best assembly based on ALE

(hybrid) or pyrodigal mean length (long).

Use --last to pick the last polishing round

regardless. [default: best]

--use-conda / --no-use-conda Use conda for Snakemake rules [default:

use-conda]

--conda-prefix PATH Custom conda env directory

--snake-default TEXT Customise Snakemake runtime args [default:

--rerun-incomplete, --printshellcmds,

--nolock, --show-failed-logs, --conda-

frontend mamba]

-h, --help Show this message and exit.

CLUSTER EXECUTION:

hybracter hybrid-single ... --profile [profile]

For information on Snakemake profiles see:

https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles

RUN EXAMPLES:

Required: hybracter hybrid-single -l [FASTQ file of longreads]

Required: hybracter hybrid-single -1 [R1 FASTQ file of

paired end short reads]

Required: hybracter hybrid-single -2 [R2 FASTQ file of

paired end short reads]

Specify output directory: hybracter hybrid-single ... --output [directory]

Specify threads: hybracter hybrid-single ... --threads [threads]

Disable conda: hybracter hybrid-single ... --no-use-conda

Change defaults: hybracter hybrid-single ... --snake-default="-k --nolock"

Add Snakemake args: hybracter hybrid-single ... --dry-run

--keep-going --touch

Specify targets: hybracter hybrid-single ... all print_targets

Available targets:

all Run everything (default)

print_targets List available targets

> hybracter long -h

hybracter version 0.4.1

_ _ _

| |__ _ _| |__ _ __ __ _ ___| |_ ___ _ __

| '_ \| | | | '_ \| '__/ _` |/ __| __/ _ \ '__|

| | | | |_| | |_) | | | (_| | (__| || __/ |

|_| |_|\__, |_.__/|_| \__,_|\___|\__\___|_|

|___/

Usage: hybracter long [OPTIONS] [SNAKE_ARGS]...

Run hybracter with only long reads

Options:

-i, --input TEXT Input csv [required]

-o, --output PATH Output directory [default: hybracter_out]

--configfile TEXT Custom config file [default:

(outputDir)/config.yaml]

-t, --threads INTEGER Number of threads to use [default: 1]

--min_length INTEGER min read length for long reads

--min_quality INTEGER min read quality for long reads

--skip_qc Do not run porechop, filtlong and fastp to

QC the reads

-d, --databases PATH Plassembler Databases directory.

--medakaModel

Medaka Model. [default:

r1041_e82_400bps_sup_v4.2.0]

Flye Assembly Parameter [default: --nano-

hq]

--contaminants PATH Contaminants FASTA file to map long

readsagainst to filter out. Choose

--contaminants lambda to filter out phage

lambda long reads.

--dnaapler_custom_db PATH Custom amino acid FASTA file of sequences to

be used as a database with dnaapler custom.

--no_medaka Do not polish the long read assembly with

Medaka.

--logic [best|last] Hybracter logic to select best assembly. Use

--best to pick best assembly based on ALE

(hybrid) or pyrodigal mean length (long).

Use --last to pick the last polishing round

regardless. [default: best]

--use-conda / --no-use-conda Use conda for Snakemake rules [default:

use-conda]

--conda-prefix PATH Custom conda env directory

--snake-default TEXT Customise Snakemake runtime args [default:

--rerun-incomplete, --printshellcmds,

--nolock, --show-failed-logs, --conda-

frontend mamba]

-h, --help Show this message and exit.

CLUSTER EXECUTION:

hybracter hybrid ... --profile [profile]

For information on Snakemake profiles see:

https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles

RUN EXAMPLES:

Required: hybracter long --input [file]

Specify output directory: hybracter long ... --output [directory]

Specify threads: hybracter long ... --threads [threads]

Disable conda: hybracter long ... --no-use-conda

Change defaults: hybracter long ... --snake-default="-k --nolock"

Add Snakemake args: hybracter long ... --dry-run --keep-going --touch

Specify targets: hybracter long ... all print_targets

Available targets:

all Run everything (default)

print_targets List available targets

> hybracter long-single -h

hybracter version 0.4.1

_ _ _

| |__ _ _| |__ _ __ __ _ ___| |_ ___ _ __

| '_ \| | | | '_ \| '__/ _` |/ __| __/ _ \ '__|

| | | | |_| | |_) | | | (_| | (__| || __/ |

|_| |_|\__, |_.__/|_| \__,_|\___|\__\___|_|

|___/

Usage: hybracter long-single [OPTIONS] [SNAKE_ARGS]...

Run hybracter long on 1 isolate

Options:

-l, --longreads TEXT FASTQ file of longreads [required]

-s, --sample TEXT Sample name. [default: sample]

-c, --chromosome INTEGER FApproximate lower-bound chromosome length

(in base pairs). [default: 1000000]

-o, --output PATH Output directory [default: hybracter_out]

--configfile TEXT Custom config file [default:

(outputDir)/config.yaml]

-t, --threads INTEGER Number of threads to use [default: 1]

--min_length INTEGER min read length for long reads

--min_quality INTEGER min read quality for long reads

--skip_qc Do not run porechop, filtlong and fastp to

QC the reads

-d, --databases PATH Plassembler Databases directory.

--medakaModel

Medaka Model. [default:

r1041_e82_400bps_sup_v4.2.0]

Flye Assembly Parameter [default: --nano-

hq]

--contaminants PATH Contaminants FASTA file to map long

readsagainst to filter out. Choose

--contaminants lambda to filter out phage

lambda long reads.

--dnaapler_custom_db PATH Custom amino acid FASTA file of sequences to

be used as a database with dnaapler custom.

--no_medaka Do not polish the long read assembly with

Medaka.

--logic [best|last] Hybracter logic to select best assembly. Use

--best to pick best assembly based on ALE

(hybrid) or pyrodigal mean length (long).

Use --last to pick the last polishing round

regardless. [default: best]

--use-conda / --no-use-conda Use conda for Snakemake rules [default:

use-conda]

--conda-prefix PATH Custom conda env directory

--snake-default TEXT Customise Snakemake runtime args [default:

--rerun-incomplete, --printshellcmds,

--nolock, --show-failed-logs, --conda-

frontend mamba]

-h, --help Show this message and exit.

CLUSTER EXECUTION:

hybracter long-single ... --profile [profile]

For information on Snakemake profiles see:

https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles

RUN EXAMPLES:

Required: hybracter long-single -l [FASTQ file of longreads]

Specify output directory: hybracter long-single ... --output [directory]

Specify threads: hybracter long-single ... --threads [threads]

Disable conda: hybracter long-single ... --no-use-conda

Change defaults: hybracter long-single ... --snake-default="-k --nolock"

Add Snakemake args: hybracter long-single ... --dry-run --keep-going --touch

Specify targets: hybracter long-single ... all print_targets

Available targets:

all Run everything (default)

print_targets List available targets

> hybracter citation

hybracter version 0.4.1

_ _ _

| |__ _ _| |__ _ __ __ _ ___| |_ ___ _ __

| '_ \| | | | '_ \| '__/ _` |/ __| __/ _ \ '__|

| | | | |_| | |_) | | | (_| | (__| || __/ |

|_| |_|\__, |_.__/|_| \__,_|\___|\__\___|_|

|___/

Please cite hybracter in your paper using:

Bouras, G. (2023). Hybracter: a modern hybrid and long-only bacterial

assembly pipeline for many isolates.

Please consider also citing these dependencies (especially my own

tools Plassembler and Dnaapler :) ):

Plassembler:

https://doi.org/10.1093/bioinformatics/btad409

Dnaapler:

https://github.com/gbouras13/dnaapler

Snaketool:

https://doi.org/10.31219/osf.io/8w5j3

Wick et al's Assembling the perfect bacterial genome paper (provided

the intellectual framework for hybracter):

https://doi.org/10.1371/journal.pcbi.1010905

Trimnami:

https://github.com/beardymcjohnface/Trimnami

Filtlong:

https://github.com/rrwick/Filtlong

Porechop and Porechop_abi:

https://doi.org/10.1093/bioadv/vbac085

https://github.com/rrwick/Porechop

fastp:

https://doi.org/10.1093/bioinformatics/bty560

Flye:

https://doi.org/10.1038/s41587-019-0072-8

ALE:

https://doi.org/10.1093/bioinformatics/bts723

Medaka:

https://github.com/nanoporetech/medaka

Pyrodigal:

https://doi.org/10.21105/joss.04296

Polypolish:

https://doi.org/10.1371/journal.pcbi.1009802

POLCA:

https://doi.org/10.1093/bioinformatics/btt476

Snakemake:

https://doi.org/10.12688/f1000research.29032.1

データベース

以下のコマンドで導入できる。必須となっている。

hybracter install

テストラン

テストデータがダウンロードされ、自動で実行される。終わるまで数十分以上かかる。

hybracter test-hybrid --threads 8
hybracter test-long --threads 8

実行方法

１，hybracter hybrid-single - ロングリードとペアエンドショートリードで単一のゲノムをアセンブルする。Unicyclerと同様のパラメータを取る。ゲノムサイズは5-Mb、20スレッド使用とする。

hybracter hybrid-single -l long.fq.gz -1 R1.fq.gz -2 R2.fq.gz -s sample_name -c 5000000 -o outdir -t 20

-c Approximate lower-bound chromosome length (in base pairs). [default: 1000000]
-t Number of threads to use [default: 1]

２，hybracter hybrid - ロングリードとペアエンドショートリードで複数のゲノムをアセンブルする。

５列の入力csvファイルを--inputで指定する必要がある。1列目はこの分離に必要なサンプル名、2列目はロングリードのfastqファイル、3列目はそのサンプルの最小染色体長、4列目はR1 short read fastq、5列目はR2 short read fastq。

レポジトリの例

s_aureus_sample1,sample1_long_read.fastq.gz,2500000,sample1_SR_R1.fastq.gz,sample1_SR_R2.fastq.gz
p_aeruginosa_sample2,sample2_long_read.fastq.gz,5500000,sample2_SR_R1.fastq.gz,sample2_SR_R2.fastq.gz

hybracter hybrid -i input.csv -o outdir -t 12

３，hybracter long-single - ロングリードのみから単一のゲノムをアセンブルする。

3列の入力csvファイルを--inputで指定する必要がある。

hybracter long-single -l long.fq.gz -s sample_name -c <chromosome size>  -o outdir -t 12

４，hybracter long - ロングリードのみから複数のゲノムをアセンブルする。

入力csvファイルを--inputで指定する必要がある。他の入力は必要ない。

hybracter long -i input.csv -o outdir -t 12

５，hybracter install - 必要なplassemblerデータベースをダウンロードしてインストールする。

hybracter install -d databases_dir

出力例（3のラン）

FINAL_OUTPUT/

sample_name_final.fasta - そのサンプルの最終アセンブリ
sample_name_chromosome.fasta - そのサンプルの最終chromosomeアセンブリのみ
sample_name_plasmid.fasta - そのサンプルの最終的なプラスミドアセンブリのみ。空の場合そのサンプルにはプラスミドがなかったことを意味する。

hybracter_summary.tsv - アセンブリの要約統計

sample_name_per_contig_stats.tsv

コンティグ名、長さ、GC%およびコンティグが環状か線状かどうか。

その他（レポジトリより）

hybracterは、Oxford Nanopore Technologies (ONT)のロングリードファーストなアセンブリアプローチで細菌分離ゲノムをアセンブリするために設計されている。HPCとSnakemakeプロファイルの並列なパワーを使って大規模にスケールする。研磨のためにオプションでマッチしたペアエンドのショートリードを使う。
hybracterは自動化され、スケーラブルで、高速で、バイオインフォマティクスや微生物ゲノム学の専門知識を必要としない点を特徴とする。
hybracterは、SnakemakeとSnaketoolの実装により、複数のサンプルに対してスケーラビリティがあるという利点がある。そのため、もしクラスタにアクセスできるのであれば、hybracterはdragonflyeなどより高速である可能性が高い。
hybracterは、Ryan Wickの壮大なチュートリアルと関連論文にほぼ基づいている。
hybracterは、plassemblerによる標的プラスミドアセンブリ、dnaaplerによるコンティグの再配向、さらに研磨と統計的サマリーに関する追加のステップを追加している点がunicyclerと異なる。
hybracterはplassemblerを使うので、より正確なプラスミドアセンブリができる。
hybracterはアセンブリが'完全'か'不完全'かを自動的に提案する。
hybracterは各研磨ステップを評価し、最も品質が良いと思われるゲノムを選択する。
単一分離株に対する最良の（手動による）細菌アセンブリを求めるならTrycyclerを推奨する。

感想

さまざまな使用例に柔軟に対応できる自動でとても使いやすいアセンブラという印象です。精度は論文で議論されています。たくさんサンプルがあってもスケールする設計になっているので、今後たくさんの細菌の完全長アセンブリを狙う時は是非使ってみたいと思います。数個アセンブルしただけですが、ランタイムはUnicyclerより遥かに短い印象です。

引用

Hybracter: Enabling Scalable, Automated, Complete and Accurate Bacterial Genome Assemblies
George Bouras, Ghais Houtak, Ryan R. Wick, Vijini Mallawaarachchi, Michael J. Roach, Bhavya Papudeshi, Lousie M. Judd, Anna E. Sheppard, Robert A. Edwards, Sarah Vreugde

bioRxiv, Posted December 13, 2023

Hybracter: enabling scalable, automated, complete and accurate bacterial genome assemblies Open Access

George Bouras, Ghais Houtak, Ryan R. Wick, Vijini Mallawaarachchi, Michael J. Roach, Bhavya Papudeshi, Lousie M. Judd, Anna E. Sheppard, Robert A. Edwards and Sarah Vreugde

Published: 08 May 2024 https://doi.org/10.1099/mgen.0.001244