2019-07-14

アセンブリグラフからプラスミドを検出する HyAsP

　プラスミドはバクテリアで一般的なextra-chromosomalのDNA分子である。プラスミドは、それらの長さ（それらはchromosomeよりはるかに短い傾向がある）、コピー数（プラスミドは細胞内に複数のコピーで存在する場合がある）およびGC含有量などの様々な特徴においてchromosomeとは異なる。それらは水平遺伝子導入において、そしてそれ故、ビルレンス因子および抗生物質耐性の伝達において重要な役割を果たす（Carattoli、2013; Dolejska and Papagiannitsis、2018）。それ故、細菌サンプル中のプラスミドの同定は、薬剤耐性細菌の増殖に対する緩和戦略にとって重要である。

　全ゲノムシーケンシング（ WGS）データを使用する方法に焦点を合わせて、最近プラスミド検出を行う種々のアプローチが探求されている。最近のプラスミド検出とアセンブリのレビューは (Arredondo-Alonso et al., 2017; Laczny et al., 2017; Orlek et al., 2017)を参照する。既存のツールは、de-novoメソッドとリファレンスベースのメソッドという2種類のアプローチに区別できる。 Recycler（Rozov et al、2017）およびPlasmidSPAdes（Antipov et al、2016）は前者の方法群に属する。Recyclerは、リードデプスおよび長さの特徴に基づいてアセンブリグラフのサイクルを繰り返し剥がすことによってプラスミドを予測する。PlasmidSPAdesは、プラスミドのリードデプスがchromosomeと異なると仮定する。それはchromosomeのリードデプスを推定し、可能性のあるchromosomeのコンティグを取り除き、そして結果として得られる縮小アセンブリグラフの連結成分からプラスミドを予測する。 MOB-recon（Robertson and Nash、2018）は最近のリファレンスベースの方法である。それはリファレンスプラスミドのデータベースならびに既知のレプリコンおよびrelaxasesのコレクションを使用する。コンティグは、リファレンスデータベースに対してマッピングされ、そして推定のプラスミド単位にグループ分けされ、それらはレプリコンまたはrelaxasesなしでそれらの単位を捨てることによってさらに精製される。

　このAPPLICATIONS NOTEでは、WGSアセンブリ内のプラスミドコンティグを識別し、それらを個々のプラスミドにまとめるためのハイブリッドツールで、リファレンスベースの方法とde-novoの方法を組み合わせた新しいツール、HyAsP（プラスミド用ハイブリッドアセンブラ）を紹介する。既知のプラスミド遺伝子とリードデプスやGC含有量などの特性を考慮して、 66のゲノムを含むデータセットについて、HyAsPの予測品質をPlasmidSPAdesおよびMOB-reconと比較し、HyAsPが一般に両方のツールよりも優れていることを示す。
　HyAsPは、既知のプラスミド遺伝子からの情報、リードデプス、およびGC含有量を使用して、アセンブリグラフからプラスミドコンティグを検出、ビニング、およびアセンブリするためのgreedy algorithmである。アルゴリズム（アルゴリズム1の擬似コードのワークフローを含む）および実験についての概要を論文の以下に示す。完全な技術詳細は補足資料に記載されている。（以下略）

インストール

ubuntu16.04にて、condaの仮想環境を作ってテストした（conda create -n hyasp python=3.5.4）。

依存

HyAsP was developed and tested with the following software dependencies:

Python (python, version 3.5.4; packages: Bio, math, numpy, os, pandas, random, subprocess, sys)
BLAST+ (makeblastdb tblastn and blastn; version 2.6.0)

FastQC (fastqc, version 0.11.5)

sickle (sickle, version 1.33)
cutadapt (cutadapt, version 1.16)
Trim Galore (trim_galore, version 0.4.5_dev)
Unicycler (unicycler, version 0.4.5)

　　　SPAdes (spades.py, version 3.12.0)
　　　Racon (racon, version 1.3.0)
　　　Pilon (pilon-1.22.jar, version 1.22)
　　　SAMtools (samtools, version 1.5)
　　　Bowtie 2 (bowtie2 and bowtie2-build; version 2.3.3.1)
　　　Java (java, version 1.8.0_121)

#blast bioconda (link)
conda install -c bioconda -y blast==2.6.0

#fastqc bioconda (link)
conda install -c bioconda -y fastqc==0.11.5

#sickle bioconada (link)
conda install -c bioconda -y sickle

#cutadapt bioconada (link)
conda install -c bioconda -y cutadapt==1.16

#trim galore bioconada (link)
conda install -c bioconda -y trim-galore==0.4.5

#unicycler bioconada (link)
conda install -c bioconda -y unicycler

本体　GIthub

git clone https://github.com/cchauve/hyasp.git
cd hyasp
python setup.py sdist
pip install dist/HyAsP-1.0.0.tar.gz

> hyasp -h

$ hyasp -h

usage: hyasp [-h] {find,create,map,filter} ...

positional arguments:

{find,create,map,filter}

task to be performed

find find plasmids in the assembly graph

create create gene database from collection of plasmids

map map genes to contigs

filter filter gene-contig mapping

optional arguments:

-h, --help show this help message and exit

> hyasp map -h

$ hyasp map -h

usage: hyasp map [-h] (-f FROM_FASTA | -g FROM_GFA) [-c] [-v]

[--makeblastdb MAKEBLASTDB] [--blastn BLASTN]

genes_file mapping_file

positional arguments:

genes_file FASTA file of plasmid genes to be mapped

mapping_file output gene-contig mapping

optional arguments:

-h, --help show this help message and exit

-f FROM_FASTA, --from_fasta FROM_FASTA

file containing the contigs (in FASTA format) to which

the genes should be matched

-g FROM_GFA, --from_gfa FROM_GFA

file containing the contigs (as part of assembly graph

in GFA format) to which the genes should be matched

-c, --clean remove temporary files when mapping is done

-v, --verbose print more information

--makeblastdb MAKEBLASTDB

path to makeblastdb executable

--blastn BLASTN path to blastn executable

> hyasp filter -h

$ hyasp filter -h

usage: hyasp filter [-h] [-i IDENTITY_THRESHOLD] [-l LENGTH_THRESHOLD] [-f]

[-v]

genes_file mapping filtered_mapping

positional arguments:

genes_file FASTA file of plasmid genes used to create the mapping

mapping gene-contig mapping (BLAST output (default format 6)

to be filtered

filtered_mapping output file containing the filtered gene-contig

mapping

optional arguments:

-h, --help show this help message and exit

-i IDENTITY_THRESHOLD, --identity_threshold IDENTITY_THRESHOLD

minimum identity in a hit to keep it

-l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD

minimum fraction of query that has to be matched to

keep a hit

-f, --find_fragmented

if set, search for fragmented hits, i.e. several short

high-quality hits that together satisfy the length

threshold

-v, --verbose print more information

> hyasp find -h

$ hyasp find -h

usage: hyasp find [-h] [-g MIN_GENE_DENSITY] [-k MIN_SEED_GENE_DENSITY]

[-l MIN_LENGTH] [-L MAX_LENGTH] [-r MIN_READ_DEPTH]

[-d MIN_PLASMID_READ_DEPTH] [-G MAX_GC_DIFF]

[-c MAX_INTERMEDIATE_CONTIGS] [-n MAX_INTERMEDIATE_NT]

[-s MAX_SCORE] [-w SCORE_WEIGHTS] [-q] [-o OVERLAP_ENDS]

[-b BINNING] [-f FANOUT | -p] [-u] [-N] [-v]

assembly_graph genes_file gene_contig_mapping output_dir

positional arguments:

assembly_graph assembly graph (GFA format)

genes_file genes database (FASTA format)

gene_contig_mapping gene-contig mapping (BLAST output, default format 6)

output_dir output directory

optional arguments:

-h, --help show this help message and exit

-g MIN_GENE_DENSITY, --min_gene_density MIN_GENE_DENSITY

minimum gene density of putative plasmids (plasmids

with lower gene density are marked as questionable)

-k MIN_SEED_GENE_DENSITY, --min_seed_gene_density MIN_SEED_GENE_DENSITY

minimum gene density of a contig to be considered as a

seed

-l MIN_LENGTH, --min_length MIN_LENGTH

minimum length of putative plasmids (shorter plasmids

are marked as questionable)

-L MAX_LENGTH, --max_length MAX_LENGTH

maximum length of putative plasmids (longer plasmids

are marked as questionable)

-r MIN_READ_DEPTH, --min_read_depth MIN_READ_DEPTH

minimum read depth of contigs to be used in plasmids

-d MIN_PLASMID_READ_DEPTH, --min_plasmid_read_depth MIN_PLASMID_READ_DEPTH

minimum read depth of putative plasmids (plasmids with

lower read depth are marked as questionable)

-G MAX_GC_DIFF, --max_gc_diff MAX_GC_DIFF

maximum difference in GC content between contig to be

added and current plasmid

-c MAX_INTERMEDIATE_CONTIGS, --max_intermediate_contigs MAX_INTERMEDIATE_CONTIGS

maximum number of contigs between gene-containing

contigs in a plasmid

-n MAX_INTERMEDIATE_NT, --max_intermediate_nt MAX_INTERMEDIATE_NT

maximum sum of lengths of contigs between gene-

containing contigs in a plasmid

-s MAX_SCORE, --max_score MAX_SCORE

maximum score for eligible extensions

-w SCORE_WEIGHTS, --score_weights SCORE_WEIGHTS

weights of the score components (comma-separated list

of elements <component>=<weight>)

-q, --keep_subplasmids

do not mark plasmids contained by others as

questionable

-o OVERLAP_ENDS, --overlap_ends OVERLAP_ENDS

minimum overlap between ends of plasmid sequence for

circularisation

-b BINNING, --binning BINNING

number standard deviations the read depth and GC

content of plasmids are allowed to differ from the

centre of their bin (binning is activated by setting

the parameter to any value unequal NaN)

-f FANOUT, --fanout FANOUT

maximum number of predecessors / successors of any

contig in a plasmid (resp. contig collection)

-p, --probabilistic activates probabilistic extensions

-u, --use_median determine average read depth of a plasmid using the

median instead of the mean

-N, --use_node_based activates node-based extensions

-v, --verbose print more information

(hyasp) kazu@kazu:~/Document/hyasp/example/example_files$

>hyasp_pipeline -h

$ hyasp_pipeline -h

usage: hyasp_pipeline [-h] [-1 FIRST_SHORT_READS] [-2 SECOND_SHORT_READS]

[-s SINGLE_SHORT_READS] [-l LONG_READS]

[-u UNICYCLER_MODE] [-f] [-n]

[--read_qual_threshold READ_QUAL_THRESHOLD]

[--min_read_length MIN_READ_LENGTH]

[--unicycler_verbosity UNICYCLER_VERBOSITY]

[--unicycler_keep UNICYCLER_KEEP]

[--identity_threshold IDENTITY_THRESHOLD]

[--length_threshold LENGTH_THRESHOLD]

[--min_gene_density MIN_GENE_DENSITY]

[--min_seed_gene_density MIN_SEED_GENE_DENSITY]

[--min_length MIN_LENGTH] [--max_length MAX_LENGTH]

[--min_read_depth MIN_READ_DEPTH]

[--min_plasmid_read_depth MIN_PLASMID_READ_DEPTH]

[--max_gc_diff MAX_GC_DIFF]

[--max_intermediate_contigs MAX_INTERMEDIATE_CONTIGS]

[--max_intermediate_nt MAX_INTERMEDIATE_NT]

[--max_score MAX_SCORE] [--score_weights SCORE_WEIGHTS]

[--keep_subplasmids] [--overlap_ends OVERLAP_ENDS]

[--binning BINNING] [--fanout FANOUT] [--probabilistic]

[--use_median] [--use_node_based] [--verbose]

[--fastqc FASTQC] [--sickle SICKLE]

[--cutadapt CUTADAPT] [--trim_galore TRIM_GALORE]

[--unicycler UNICYCLER] [--spades SPADES]

[--racon RACON] [--pilon PILON] [--samtools SAMTOOLS]

[--bowtie2 BOWTIE2] [--bowtie2_build BOWTIE2_BUILD]

[--java JAVA] [--makeblastdb MAKEBLASTDB]

[--tblastn TBLASTN] [--blastn BLASTN]

analysis_dir genes_file

positional arguments:

analysis_dir directory into which all data and outputs are stored

genes_file (plasmid) genes to map to contigs

optional arguments:

-h, --help show this help message and exit

-1 FIRST_SHORT_READS, --first_short_reads FIRST_SHORT_READS

FASTQ file of first paired reads (if any)

-2 SECOND_SHORT_READS, --second_short_reads SECOND_SHORT_READS

FASTQ file of second paired reads (if any)

-s SINGLE_SHORT_READS, --single_short_reads SINGLE_SHORT_READS

FASTQ file of single / unpaired reads (if any)

-l LONG_READS, --long_reads LONG_READS

FASTQ file of long reads (if any)

-u UNICYCLER_MODE, --unicycler_mode UNICYCLER_MODE

Unicycler assembly mode (conservative, normal or bold)

-f, --use_fastqc if set, FASTQ reads are analysed with FastQC before

and after preprocessing

-n, --no_preprocessing

if set, the FASTQ reads are not preprocessed before

the assembly

--read_qual_threshold READ_QUAL_THRESHOLD

threshold for trimming low-quality ends

--min_read_length MIN_READ_LENGTH

minimum length of reads after quality / adapter

trimming

--unicycler_verbosity UNICYCLER_VERBOSITY

Unicycler level of stdout and log file information (0

- 3)

--unicycler_keep UNICYCLER_KEEP

Unicycler level of file retention (0 - 3)

--identity_threshold IDENTITY_THRESHOLD

filtering the gene-contig mapping: minimum identity in

a hit to keep it

--length_threshold LENGTH_THRESHOLD

filtering the gene-contig mapping: minimum fraction of

query that has to be matched to keep a hit

--min_gene_density MIN_GENE_DENSITY

greedy algorithm: minimum gene density of predicted

plasmids (are marked as questionable otherwise)

--min_seed_gene_density MIN_SEED_GENE_DENSITY

greedy algorithm: minimum gene density of a contig to

be considered as a seed

--min_length MIN_LENGTH

greedy algorithm: minimum length of predicted plasmid

(shorter ones are marked as questionable)

--max_length MAX_LENGTH

greedy algorithm: maximum length of predicted plasmid

(longer ones are marked as questionable)

--min_read_depth MIN_READ_DEPTH

greedy algorithm: minimum read depth of contig to be

used in plasmids

--min_plasmid_read_depth MIN_PLASMID_READ_DEPTH

greedy algorithm: minimum read depth of predicted

plasmids (are marked as questionable otherwise)

--max_gc_diff MAX_GC_DIFF

greedy algorithm: maximum difference in GC content

between contig to be added and current plasmid

--max_intermediate_contigs MAX_INTERMEDIATE_CONTIGS

greedy algorithm: maximum number of contigs between

gene-containing contigs in a plasmid

--max_intermediate_nt MAX_INTERMEDIATE_NT

greedy algorithm: maximum sum of lengths of contigs

between gene-containing contigs in a plasmid

--max_score MAX_SCORE

greedy algorithm: maximum score for potential

extensions

--score_weights SCORE_WEIGHTS

greedy algorithm: weights of the score components

(comma-separated list of elements

<component>=<weight>)

--keep_subplasmids greedy algorithm: mark plasmids contained by others as

questionable

--overlap_ends OVERLAP_ENDS

greedy algorithm: minimum overlap between ends of

plasmid sequence for circularisation

--binning BINNING greedy algorithm: setting a threshold value activates

plasmid binning

--fanout FANOUT greedy algorithm: maximum degree of branching of

contigs (per end)

--probabilistic greedy algorithm: choose extension probabilistically

--use_median greedy algorithm: determine average read depth of a

plasmid using the median instead of the mean

--use_node_based greedy algorithm: activates node-based extensions

--verbose greedy algorithm: prints detailed information during

the extension process

--fastqc FASTQC path to FastQC executable

--sickle SICKLE path to sickle executable

--cutadapt CUTADAPT path to cutadapt executable

--trim_galore TRIM_GALORE

path to Trim Galore executable

--unicycler UNICYCLER

path to Unicycler executable

--spades SPADES path to SPAdes executable

--racon RACON path to Racon executable

--pilon PILON path to Pilon executable

--samtools SAMTOOLS path to SAMtools executable

--bowtie2 BOWTIE2 path to Bowtie 2 executable

--bowtie2_build BOWTIE2_BUILD

path to bowtie2_build executable

--java JAVA path to Java executable

--makeblastdb MAKEBLASTDB

path to makeblastdb executable

--tblastn TBLASTN path to tblastn executable

--blastn BLASTN path to blastn executable

データベースの準備

cd databases 
hyasp create ncbi_database_genes.fasta -p plasmids.csv \
 -b ncbi_blacklist.txt -d -l 500 -m 100 -t GenBank \
 -r 2015-12-19T00:00:00Z

ncbi_database_genes.fastaができる。これをラン時に指定する（以下はgenes.fastaと記載）。

実行方法

１、アセンブリグラフ（GFA）からplasmidを探索。データベースncbi_database_genes.fastaとアセンブリグラフのGFAを指定する。

hyasp map genes.fasta gcm.csv -g assembly.gfa

結果はblastテキスト出力ライクな形式になる。

２、必要に応じて出力をフィルタリングする。

hyasp filter genes.fasta gcm.csv filtered_gcm.csv \
 --identity_threshold 0.95 \
 --length_threshold 0.95

--identity_threshold identity of hits retained in the mapping (default: 0.95).
--length_threshold Minimum fraction of query (gene) that has be matched to keep a hit (default: 0.95).

３、プラスミド配列検出。１の出力か２のフィルタリングした出力を指定する。

hyasp find assembly.gfa genes.fasta gcm.csv output_dir

上に載せたヘルプにあるように、感度調整のパラメータが多いので注意する。

出力

> ls -alth

-rw-rw-r-- 1 kazu kazu 4.2M 7月 13 22:44 tagged_assembly.gfa

drwxrwxr-x 2 kazu kazu 4.0K 7月 13 22:44 .

-rw-rw-r-- 1 kazu kazu 50K 7月 13 22:44 questionable_plasmid_contigs.fasta

-rw-rw-r-- 1 kazu kazu 89K 7月 13 22:44 putative_plasmid_contigs.fasta

-rw-rw-r-- 1 kazu kazu 90K 7月 13 22:44 putative_plasmids.fasta

-rw-rw-r-- 1 kazu kazu 50K 7月 13 22:44 questionable_plasmids.fasta

-rw-rw-r-- 1 kazu kazu 270 7月 13 22:44 contig_chains.csv

推定piasmid配列FASTAの他に、元のGFAファイルに対して推定プラスミドコンティグの色およびラベル情報が追加されたGFAファイルが出力される。

f:id:kazumaxneo:20190713225617p:plain

少なくとも１つの推定プラスミド中に存在するコンティグは青色がアサインされ、１つまたは複数の疑わしいプラスミド中に存在すると水色がアサインされる。

NGSのリードを指定して、rawリードから直接実行することもできる。

hyasp_pipeline output_dir ncbi_database_genes.fasta \
 -1 pair_1.fq -2 pair_2.fq

-1 FASTQ file of first paired reads (if any)
-2 FASTQ file of second paired reads (if any)
-s FASTQ file of single / unpaired reads (if any)
-l FASTQ file of long reads (if any)

シングルエンドショートリードとロングリードを指定。

hyasp_pipeline output_dir ncbi_database_genes.fasta \
 -s short_reads.fastq -l long_reads.fastq

テスト時はunicyclerを使ったアセンブリのステップでエラーになって結果は確認できなかった。

引用

HyAsP, a greedy tool for plasmids identification
Robert Müller, Cedric Chauve
Bioinformatics, Published: 22 May 2019

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

アセンブリグラフからプラスミドを検出する HyAsP