プラスミドはバクテリアで一般的なextra-chromosomalのDNA分子である。プラスミドは、それらの長さ(それらはchromosomeよりはるかに短い傾向がある)、コピー数(プラスミドは細胞内に複数のコピーで存在する場合がある)およびGC含有量などの様々な特徴においてchromosomeとは異なる。それらは水平遺伝子導入において、そしてそれ故、ビルレンス因子および抗生物質耐性の伝達において重要な役割を果たす(Carattoli、2013; Dolejska and Papagiannitsis、2018)。それ故、細菌サンプル中のプラスミドの同定は、薬剤耐性細菌の増殖に対する緩和戦略にとって重要である。
全ゲノムシーケンシング( WGS)データを使用する方法に焦点を合わせて、最近プラスミド検出を行う種々のアプローチが探求されている。最近のプラスミド検出とアセンブリのレビューは (Arredondo-Alonso et al., 2017; Laczny et al., 2017; Orlek et al., 2017)を参照する。既存のツールは、de-novoメソッドとリファレンスベースのメソッドという2種類のアプローチに区別できる。 Recycler(Rozov et al、2017)およびPlasmidSPAdes(Antipov et al、2016)は前者の方法群に属する。Recyclerは、リードデプスおよび長さの特徴に基づいてアセンブリグラフのサイクルを繰り返し剥がすことによってプラスミドを予測する。PlasmidSPAdesは、プラスミドのリードデプスがchromosomeと異なると仮定する。それはchromosomeのリードデプスを推定し、可能性のあるchromosomeのコンティグを取り除き、そして結果として得られる縮小アセンブリグラフの連結成分からプラスミドを予測する。 MOB-recon(Robertson and Nash、2018)は最近のリファレンスベースの方法である。それはリファレンスプラスミドのデータベースならびに既知のレプリコンおよびrelaxasesのコレクションを使用する。コンティグは、リファレンスデータベースに対してマッピングされ、そして推定のプラスミド単位にグループ分けされ、それらはレプリコンまたはrelaxasesなしでそれらの単位を捨てることによってさらに精製される。
このAPPLICATIONS NOTEでは、WGSアセンブリ内のプラスミドコンティグを識別し、それらを個々のプラスミドにまとめるためのハイブリッドツールで、リファレンスベースの方法とde-novoの方法を組み合わせた新しいツール、HyAsP(プラスミド用ハイブリッドアセンブラ)を紹介する。既知のプラスミド遺伝子とリードデプスやGC含有量などの特性を考慮して、 66のゲノムを含むデータセットについて、HyAsPの予測品質をPlasmidSPAdesおよびMOB-reconと比較し、HyAsPが一般に両方のツールよりも優れていることを示す。
HyAsPは、既知のプラスミド遺伝子からの情報、リードデプス、およびGC含有量を使用して、アセンブリグラフからプラスミドコンティグを検出、ビニング、およびアセンブリするためのgreedy algorithmである。アルゴリズム(アルゴリズム1の擬似コードのワークフローを含む)および実験についての概要を論文の以下に示す。完全な技術詳細は補足資料に記載されている。(以下略)
インストール
ubuntu16.04にて、condaの仮想環境を作ってテストした(conda create -n hyasp python=3.5.4)。
依存
HyAsP was developed and tested with the following software dependencies:
- Python (python, version 3.5.4; packages: Bio, math, numpy, os, pandas, random, subprocess, sys)
- BLAST+ (makeblastdb tblastn and blastn; version 2.6.0)
FastQC (fastqc, version 0.11.5)
- sickle (sickle, version 1.33)
- cutadapt (cutadapt, version 1.16)
- Trim Galore (trim_galore, version 0.4.5_dev)
- Unicycler (unicycler, version 0.4.5)
SPAdes (spades.py, version 3.12.0)
Racon (racon, version 1.3.0)
Pilon (pilon-1.22.jar, version 1.22)
SAMtools (samtools, version 1.5)
Bowtie 2 (bowtie2 and bowtie2-build; version 2.3.3.1)
Java (java, version 1.8.0_121)
#blast bioconda (link)
conda install -c bioconda -y blast==2.6.0
#fastqc bioconda (link)
conda install -c bioconda -y fastqc==0.11.5
#sickle bioconada (link)
conda install -c bioconda -y sickle
#cutadapt bioconada (link)
conda install -c bioconda -y cutadapt==1.16
#trim galore bioconada (link)
conda install -c bioconda -y trim-galore==0.4.5
#unicycler bioconada (link)
conda install -c bioconda -y unicycler
本体 GIthub
git clone https://github.com/cchauve/hyasp.git
cd hyasp
python setup.py sdist
pip install dist/HyAsP-1.0.0.tar.gz
> hyasp -h
$ hyasp -h
usage: hyasp [-h] {find,create,map,filter} ...
positional arguments:
{find,create,map,filter}
task to be performed
find find plasmids in the assembly graph
create create gene database from collection of plasmids
map map genes to contigs
filter filter gene-contig mapping
optional arguments:
-h, --help show this help message and exit
> hyasp map -h
$ hyasp map -h
usage: hyasp map [-h] (-f FROM_FASTA | -g FROM_GFA) [-c] [-v]
[--makeblastdb MAKEBLASTDB] [--blastn BLASTN]
genes_file mapping_file
positional arguments:
genes_file FASTA file of plasmid genes to be mapped
mapping_file output gene-contig mapping
optional arguments:
-h, --help show this help message and exit
-f FROM_FASTA, --from_fasta FROM_FASTA
file containing the contigs (in FASTA format) to which
the genes should be matched
-g FROM_GFA, --from_gfa FROM_GFA
file containing the contigs (as part of assembly graph
in GFA format) to which the genes should be matched
-c, --clean remove temporary files when mapping is done
-v, --verbose print more information
--makeblastdb MAKEBLASTDB
path to makeblastdb executable
--blastn BLASTN path to blastn executable
> hyasp filter -h
$ hyasp filter -h
usage: hyasp filter [-h] [-i IDENTITY_THRESHOLD] [-l LENGTH_THRESHOLD] [-f]
[-v]
genes_file mapping filtered_mapping
positional arguments:
genes_file FASTA file of plasmid genes used to create the mapping
mapping gene-contig mapping (BLAST output (default format 6)
to be filtered
filtered_mapping output file containing the filtered gene-contig
mapping
optional arguments:
-h, --help show this help message and exit
-i IDENTITY_THRESHOLD, --identity_threshold IDENTITY_THRESHOLD
minimum identity in a hit to keep it
-l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD
minimum fraction of query that has to be matched to
keep a hit
-f, --find_fragmented
if set, search for fragmented hits, i.e. several short
high-quality hits that together satisfy the length
threshold
-v, --verbose print more information
> hyasp find -h
$ hyasp find -h
usage: hyasp find [-h] [-g MIN_GENE_DENSITY] [-k MIN_SEED_GENE_DENSITY]
[-l MIN_LENGTH] [-L MAX_LENGTH] [-r MIN_READ_DEPTH]
[-d MIN_PLASMID_READ_DEPTH] [-G MAX_GC_DIFF]
[-c MAX_INTERMEDIATE_CONTIGS] [-n MAX_INTERMEDIATE_NT]
[-s MAX_SCORE] [-w SCORE_WEIGHTS] [-q] [-o OVERLAP_ENDS]
[-b BINNING] [-f FANOUT | -p] [-u] [-N] [-v]
assembly_graph genes_file gene_contig_mapping output_dir
positional arguments:
assembly_graph assembly graph (GFA format)
genes_file genes database (FASTA format)
gene_contig_mapping gene-contig mapping (BLAST output, default format 6)
output_dir output directory
optional arguments:
-h, --help show this help message and exit
-g MIN_GENE_DENSITY, --min_gene_density MIN_GENE_DENSITY
minimum gene density of putative plasmids (plasmids
with lower gene density are marked as questionable)
-k MIN_SEED_GENE_DENSITY, --min_seed_gene_density MIN_SEED_GENE_DENSITY
minimum gene density of a contig to be considered as a
seed
-l MIN_LENGTH, --min_length MIN_LENGTH
minimum length of putative plasmids (shorter plasmids
are marked as questionable)
-L MAX_LENGTH, --max_length MAX_LENGTH
maximum length of putative plasmids (longer plasmids
are marked as questionable)
-r MIN_READ_DEPTH, --min_read_depth MIN_READ_DEPTH
minimum read depth of contigs to be used in plasmids
-d MIN_PLASMID_READ_DEPTH, --min_plasmid_read_depth MIN_PLASMID_READ_DEPTH
minimum read depth of putative plasmids (plasmids with
lower read depth are marked as questionable)
-G MAX_GC_DIFF, --max_gc_diff MAX_GC_DIFF
maximum difference in GC content between contig to be
added and current plasmid
-c MAX_INTERMEDIATE_CONTIGS, --max_intermediate_contigs MAX_INTERMEDIATE_CONTIGS
maximum number of contigs between gene-containing
contigs in a plasmid
-n MAX_INTERMEDIATE_NT, --max_intermediate_nt MAX_INTERMEDIATE_NT
maximum sum of lengths of contigs between gene-
containing contigs in a plasmid
-s MAX_SCORE, --max_score MAX_SCORE
maximum score for eligible extensions
-w SCORE_WEIGHTS, --score_weights SCORE_WEIGHTS
weights of the score components (comma-separated list
of elements <component>=<weight>)
-q, --keep_subplasmids
do not mark plasmids contained by others as
questionable
-o OVERLAP_ENDS, --overlap_ends OVERLAP_ENDS
minimum overlap between ends of plasmid sequence for
circularisation
-b BINNING, --binning BINNING
number standard deviations the read depth and GC
content of plasmids are allowed to differ from the
centre of their bin (binning is activated by setting
the parameter to any value unequal NaN)
-f FANOUT, --fanout FANOUT
maximum number of predecessors / successors of any
contig in a plasmid (resp. contig collection)
-p, --probabilistic activates probabilistic extensions
-u, --use_median determine average read depth of a plasmid using the
median instead of the mean
-N, --use_node_based activates node-based extensions
-v, --verbose print more information
(hyasp) kazu@kazu:~/Document/hyasp/example/example_files$
>hyasp_pipeline -h
$ hyasp_pipeline -h
usage: hyasp_pipeline [-h] [-1 FIRST_SHORT_READS] [-2 SECOND_SHORT_READS]
[-s SINGLE_SHORT_READS] [-l LONG_READS]
[-u UNICYCLER_MODE] [-f] [-n]
[--read_qual_threshold READ_QUAL_THRESHOLD]
[--min_read_length MIN_READ_LENGTH]
[--unicycler_verbosity UNICYCLER_VERBOSITY]
[--unicycler_keep UNICYCLER_KEEP]
[--identity_threshold IDENTITY_THRESHOLD]
[--length_threshold LENGTH_THRESHOLD]
[--min_gene_density MIN_GENE_DENSITY]
[--min_seed_gene_density MIN_SEED_GENE_DENSITY]
[--min_length MIN_LENGTH] [--max_length MAX_LENGTH]
[--min_read_depth MIN_READ_DEPTH]
[--min_plasmid_read_depth MIN_PLASMID_READ_DEPTH]
[--max_gc_diff MAX_GC_DIFF]
[--max_intermediate_contigs MAX_INTERMEDIATE_CONTIGS]
[--max_intermediate_nt MAX_INTERMEDIATE_NT]
[--max_score MAX_SCORE] [--score_weights SCORE_WEIGHTS]
[--keep_subplasmids] [--overlap_ends OVERLAP_ENDS]
[--binning BINNING] [--fanout FANOUT] [--probabilistic]
[--use_median] [--use_node_based] [--verbose]
[--fastqc FASTQC] [--sickle SICKLE]
[--cutadapt CUTADAPT] [--trim_galore TRIM_GALORE]
[--unicycler UNICYCLER] [--spades SPADES]
[--racon RACON] [--pilon PILON] [--samtools SAMTOOLS]
[--bowtie2 BOWTIE2] [--bowtie2_build BOWTIE2_BUILD]
[--java JAVA] [--makeblastdb MAKEBLASTDB]
[--tblastn TBLASTN] [--blastn BLASTN]
analysis_dir genes_file
positional arguments:
analysis_dir directory into which all data and outputs are stored
genes_file (plasmid) genes to map to contigs
optional arguments:
-h, --help show this help message and exit
-1 FIRST_SHORT_READS, --first_short_reads FIRST_SHORT_READS
FASTQ file of first paired reads (if any)
-2 SECOND_SHORT_READS, --second_short_reads SECOND_SHORT_READS
FASTQ file of second paired reads (if any)
-s SINGLE_SHORT_READS, --single_short_reads SINGLE_SHORT_READS
FASTQ file of single / unpaired reads (if any)
-l LONG_READS, --long_reads LONG_READS
FASTQ file of long reads (if any)
-u UNICYCLER_MODE, --unicycler_mode UNICYCLER_MODE
Unicycler assembly mode (conservative, normal or bold)
-f, --use_fastqc if set, FASTQ reads are analysed with FastQC before
and after preprocessing
-n, --no_preprocessing
if set, the FASTQ reads are not preprocessed before
the assembly
--read_qual_threshold READ_QUAL_THRESHOLD
threshold for trimming low-quality ends
--min_read_length MIN_READ_LENGTH
minimum length of reads after quality / adapter
trimming
--unicycler_verbosity UNICYCLER_VERBOSITY
Unicycler level of stdout and log file information (0
- 3)
--unicycler_keep UNICYCLER_KEEP
Unicycler level of file retention (0 - 3)
--identity_threshold IDENTITY_THRESHOLD
filtering the gene-contig mapping: minimum identity in
a hit to keep it
--length_threshold LENGTH_THRESHOLD
filtering the gene-contig mapping: minimum fraction of
query that has to be matched to keep a hit
--min_gene_density MIN_GENE_DENSITY
greedy algorithm: minimum gene density of predicted
plasmids (are marked as questionable otherwise)
--min_seed_gene_density MIN_SEED_GENE_DENSITY
greedy algorithm: minimum gene density of a contig to
be considered as a seed
--min_length MIN_LENGTH
greedy algorithm: minimum length of predicted plasmid
(shorter ones are marked as questionable)
--max_length MAX_LENGTH
greedy algorithm: maximum length of predicted plasmid
(longer ones are marked as questionable)
--min_read_depth MIN_READ_DEPTH
greedy algorithm: minimum read depth of contig to be
used in plasmids
--min_plasmid_read_depth MIN_PLASMID_READ_DEPTH
greedy algorithm: minimum read depth of predicted
plasmids (are marked as questionable otherwise)
--max_gc_diff MAX_GC_DIFF
greedy algorithm: maximum difference in GC content
between contig to be added and current plasmid
--max_intermediate_contigs MAX_INTERMEDIATE_CONTIGS
greedy algorithm: maximum number of contigs between
gene-containing contigs in a plasmid
--max_intermediate_nt MAX_INTERMEDIATE_NT
greedy algorithm: maximum sum of lengths of contigs
between gene-containing contigs in a plasmid
--max_score MAX_SCORE
greedy algorithm: maximum score for potential
extensions
--score_weights SCORE_WEIGHTS
greedy algorithm: weights of the score components
(comma-separated list of elements
<component>=<weight>)
--keep_subplasmids greedy algorithm: mark plasmids contained by others as
questionable
--overlap_ends OVERLAP_ENDS
greedy algorithm: minimum overlap between ends of
plasmid sequence for circularisation
--binning BINNING greedy algorithm: setting a threshold value activates
plasmid binning
--fanout FANOUT greedy algorithm: maximum degree of branching of
contigs (per end)
--probabilistic greedy algorithm: choose extension probabilistically
--use_median greedy algorithm: determine average read depth of a
plasmid using the median instead of the mean
--use_node_based greedy algorithm: activates node-based extensions
--verbose greedy algorithm: prints detailed information during
the extension process
--fastqc FASTQC path to FastQC executable
--sickle SICKLE path to sickle executable
--cutadapt CUTADAPT path to cutadapt executable
--trim_galore TRIM_GALORE
path to Trim Galore executable
--unicycler UNICYCLER
path to Unicycler executable
--spades SPADES path to SPAdes executable
--racon RACON path to Racon executable
--pilon PILON path to Pilon executable
--samtools SAMTOOLS path to SAMtools executable
--bowtie2 BOWTIE2 path to Bowtie 2 executable
--bowtie2_build BOWTIE2_BUILD
path to bowtie2_build executable
--java JAVA path to Java executable
--makeblastdb MAKEBLASTDB
path to makeblastdb executable
--tblastn TBLASTN path to tblastn executable
--blastn BLASTN path to blastn executable
データベースの準備
cd databases
hyasp create ncbi_database_genes.fasta -p plasmids.csv \
-b ncbi_blacklist.txt -d -l 500 -m 100 -t GenBank \
-r 2015-12-19T00:00:00Z
ncbi_database_genes.fastaができる。これをラン時に指定する(以下はgenes.fastaと記載)。
実行方法
1、アセンブリグラフ(GFA)からplasmidを探索。データベースncbi_database_genes.fastaとアセンブリグラフのGFAを指定する。
hyasp map genes.fasta gcm.csv -g assembly.gfa
結果はblastテキスト出力ライクな形式になる。
2、必要に応じて出力をフィルタリングする。
hyasp filter genes.fasta gcm.csv filtered_gcm.csv \
--identity_threshold 0.95 \
--length_threshold 0.95
- --identity_threshold identity of hits retained in the mapping (default: 0.95).
- --length_threshold Minimum fraction of query (gene) that has be matched to keep a hit (default: 0.95).
3、プラスミド配列検出。1の出力か2のフィルタリングした出力を指定する。
hyasp find assembly.gfa genes.fasta gcm.csv output_dir
上に載せたヘルプにあるように、感度調整のパラメータが多いので注意する。
出力
> ls -alth
-rw-rw-r-- 1 kazu kazu 4.2M 7月 13 22:44 tagged_assembly.gfa
drwxrwxr-x 2 kazu kazu 4.0K 7月 13 22:44 .
-rw-rw-r-- 1 kazu kazu 50K 7月 13 22:44 questionable_plasmid_contigs.fasta
-rw-rw-r-- 1 kazu kazu 89K 7月 13 22:44 putative_plasmid_contigs.fasta
-rw-rw-r-- 1 kazu kazu 90K 7月 13 22:44 putative_plasmids.fasta
-rw-rw-r-- 1 kazu kazu 50K 7月 13 22:44 questionable_plasmids.fasta
-rw-rw-r-- 1 kazu kazu 270 7月 13 22:44 contig_chains.csv
推定piasmid配列FASTAの他に、元のGFAファイルに対して推定プラスミドコンティグの色およびラベル情報が追加されたGFAファイルが出力される。
少なくとも1つの推定プラスミド中に存在するコンティグは青色がアサインされ、1つまたは複数の疑わしいプラスミド中に存在すると水色がアサインされる。
NGSのリードを指定して、rawリードから直接実行することもできる。
hyasp_pipeline output_dir ncbi_database_genes.fasta \
-1 pair_1.fq -2 pair_2.fq
- -1 FASTQ file of first paired reads (if any)
- -2 FASTQ file of second paired reads (if any)
- -s FASTQ file of single / unpaired reads (if any)
- -l FASTQ file of long reads (if any)
シングルエンドショートリードとロングリードを指定。
hyasp_pipeline output_dir ncbi_database_genes.fasta \
-s short_reads.fastq -l long_reads.fastq
テスト時はunicyclerを使ったアセンブリのステップでエラーになって結果は確認できなかった。
引用
HyAsP, a greedy tool for plasmids identification
Robert Müller, Cedric Chauve
Bioinformatics, Published: 22 May 2019