2020-03-17

計算リソースを効率的に使って多数のよく似たバクテリアゲノムを素早く分析する自動化されたパイプライン Bactopia

2020 3/17 パラメータ追記、コマンド修正、タイトル修正

2020 3/18 追記

2020 5/11 説明追加

2020 8/13 論文追記

2020 12/9 ツイート追加

2021 2/24アップデートされたコマンドに修正

2021 10/7 ツイート追加

　イルミナのテクノロジーを使用した細菌ゲノムのシーケンシングは、多くの場合、扱いやすい分析手法よりも速くデータが生成される手順になっている。 Nextflowワークフローソフトウェアを使用して構築されたBactopiaと呼ばれる新しいシリーズのパイプラインを作成し、細菌種または属の効率的な比較ゲノム解析手法を提供する。 Bactopiaは、対象の種に対して一連のカスタマイズ可能なデータセットが作成されるデータセットセットアップステップ（Bactopia Datasets; BaDs）で構成されている; the Bactopia Analysis Pipeline (BaAP)、これは品質管理、ゲノムアセンブリ、および利用可能なデータセットに基づいていくつかの他の機能を実行し、処理されたデータを構造化されたディレクトリ形式で出力する。また、処理されたデータの一部またはすべてに対して特定の後処理を実行する一連のBactopiaツール（BaT）を出力する。 BaTには、パンゲノム解析、サンプル間の平均ヌクレオチド同一性の計算、16S遺伝子の抽出とプロファイリング、高度に保存された遺伝子を使用した分類学的分類が含まれる。 BaTの数は、将来、特定のアプリケーションを満たすために増加することが予想される。デモンストレーションとして、ヒトの膣マイクロバイオームの共通種であるL. crispatusに焦点を合わせ、1,664の公開されたラクトバチルスゲノムの分析を行った。 Bactopiaは、1つのバクテリアゲノムの小さなプロジェクトから数千ものプロジェクトまで拡張できるオープンソースシステムであり、比較データセットとダウンストリーム解析のオプションを選択する際の柔軟性を高めている。 Bactopiaコードは、https://www.github.com/bactopia/bactopiaからアクセスできる。

　Bactopia Datasetsは、多くの既存のパブリックデータセットとプライベートデータセットを分析に含めるためのフレームワークを提供する。これらのデータセットをBactopia用にダウンロード、構築、および（または）構成するプロセスは自動化されている。

　Bactopia Analysis PipelineはNextflowで構築され、入力FASTQ（ローカルまたはSRA / ENAから入手可能）は、品質管理、アセンブリ、アノテーション、リファレンスへのマッピング、バリアントコール、マイナースケッチクエリ、BLASTアラインメント、挿入部位予測、タイピング、などを行うことができる。 Bactopia Analysis Pipelineは、利用可能なBactopiaデータセットに基づいて、含める分析を自動的に選択する。

Bactopia Toolsは、比較分析のための独立したワークフローセットである。比較分析には、要約レポート、パンゲノム、または系統樹構築が含まれる。 Bactopiaの予測可能な出力構造を使用すると、Bactopia Toolで処理するために含めるサンプルを選択できる。

Bactopiaは、黄色ブドウ球菌ゲノムをターゲットとする我々（著者ら）がリリースしたワークフローであるStaphopiaに触発された。 Staphopiaとユーザーフィードバックから学んだことを使用して、Bactopiaは最初から使いやすさ、移植性、速度を念頭に置いてゼロから開発された。

V2リリース

Bactopia v2, now includes more than 100 tests, that are testing more than 7000 outputs.

Pytest (modeled after nf-core/modules) is running on @github Actions, using real data from https://t.co/EmWtMhVXt6

Thank you @tdread_emory, for providing the resources to run all these tests
— Robert A. Petit III, PhD (@rpetit3) 2021年12月6日

Bactopia users: if you ever find yourself feeling this way with Bactopia please don't hesitate to reach out!

Sometimes things break and it can be frustrating! I've also tried my best to build thorough docs, but there's always room for improvement.https://t.co/TSJ8GMHCAL https://t.co/gCr3PZywV3
— Robert A. Petit III, PhD (@rpetit3) 2021年10月2日

Check out https://t.co/JAZtbHuLG6 to get started using Bactopia.

Please feel free to reach out if you have any questions! https://t.co/WG3HqzokiP
— Robert A. Petit III (@rpetit3) 2020年8月12日

Latest preprint with @rpetit3 describes our Bactopia pipeline - a software for flexible analysis of 1 - 1000s of bacterial genomes starting from FASTQ. Example analysis of 1600+ Lactobacillus genomes, focusing on L. crispatus, with only a few commands https://t.co/SlkWbClxhZ pic.twitter.com/U6ie8XTChR
— Timothy Read PhD (@tdread_emory) 2020年3月11日

New blog post: Using AWS Batch to process 67,000 genomes with Bactopia.

In 5 days(!), I processed all publicly available Staphylococcus aureus sequencing projects. This post outlines how it was done and the costs to do so.https://t.co/09R7Ac0bl1 #asmngs #bioinformatics #aws
— Robert A. Petit III (@rpetit3) 2020年12月8日

Bactopia v1.5.4 is now available!

Release Details: https://t.co/3RI7Dp6rgT

Learn more about Bactopia at:
- https://t.co/lADbrUD6MN (docs)
- https://t.co/0aLt5LIE8W (ePoster)
— Robert A. Petit III (@rpetit3) 2020年12月18日

Document

Installation - Bactopia

インストール

condaの仮想環境を作ってテストしたがエラーが起きた（macosとubuntu18.04でテスト）。オーサーらが提供しているdockerイメージを使ってテストした。

本体　Github

#bioconda (link)
conda create -y -n bactopia -c conda-forge -c bioconda bactopia
conda activate bactopia

#dockerイメージ (link)
docker pull bactopia/bactopia
#ホストのカレントと仮想環境の/dataをシェアしてrun
docker run --rm -itv $PWD:/data bactopia/bactopia

> bactopia --help

$ bactopia --help

N E X T F L O W ~ version 20.01.0

Launching `/Users/kazu/anaconda3/envs/bactopia/share/bactopia-1.3.0/main.nf` [curious_brown] - revision: a8ccad600f

bactopia v1.3.0

Required Parameters:

### For Procesessing Multiple Samples

--fastqs STR An input file containing the sample name and

absolute paths to FASTQs to process

### For Processing A Single Sample

--R1 STR First set of reads for paired end in compressed (gzip)

FASTQ format

--R2 STR Second set of reads for paired end in compressed (gzip)

FASTQ format

--SE STR Single end set of reads in compressed (gzip) FASTQ format

--sample STR The name of the input sequences

### For Downloading from ENA

--accessions An input file containing ENA/SRA experiement accessions to

be processed

--accession A single ENA/SRA Experiment accession to be processed

Dataset Parameters:

--datasets DIR The path to available datasets that have

already been set up

--species STR Determines which species-specific dataset to

use for the input sequencing

Optional Parameters:

--coverage INT Reduce samples to a given coverage

Default: 100x

--genome_size INT Expected genome size (bp) for all samples, a value of '0'

will disable read error correction and read subsampling.

Special values (requires --species):

'min': uses minimum completed genome size of species

'median': uses median completed genome size of species

'mean': uses mean completed genome size of species

'max': uses max completed genome size of species

Default: Mash estimate

--outdir DIR Directory to write results to

Default: .

--max_time INT The maximum number of minutes a job should run before being halted.

Default: 120 minutes

--max_memory INT The maximum amount of memory (Gb) allowed to a single process.

Default: 32 Gb

--cpus INT Number of processors made available to a single

process.

Default: 4

Nextflow Related Parameters:

--infodir DIR Directory to write Nextflow summary files to

Default: ./bactopia-info

--condadir DIR Directory to Nextflow should use for Conda environments

Default: Bactopia's Nextflow directory

--nfdir Print directory Nextflow has pulled Bactopia to

--overwrite Nextflow will overwrite existing output files.

Default: false

--conatainerPath Path to Singularity containers to be used by the 'slurm'

profile.

Default: /opt/bactopia/singularity

--sleep_time After reading datases, the amount of time (seconds) Nextflow

will wait before execution.

Default: 5 seconds

--publish_mode Set Nextflow's method for publishing output files. Allowed methods are:

'copy' (default) Copies the output files into the published directory.

'copyNoFollow' Copies the output files into the published directory

without following symlinks ie. copies the links themselves.

'link' Creates a hard link in the published directory for each

process output file.

'rellink' Creates a relative symbolic link in the published directory

for each process output file.

'symlink' Creates an absolute symbolic link in the published directory

for each process output file.

Default: copy

--force Nextflow will overwrite existing output files.

Default: false

Useful Parameters:

--available_datasets Print a list of available datasets found based

on location given by "--datasets"

--example_fastqs Print example of expected input for FASTQs file

--check_fastqs Verify "--fastqs" produces the expected inputs

--compress Compress (gzip) select outputs (e.g. annotation, variant calls)

to reduce overall storage footprint.

--keep_all_files Keeps all analysis files created. By default, intermediate

files are removed. This will not affect the ability

to resume Nextflow runs, and only occurs at the end

of the process.

--dry_run Mimics workflow execution, to help prevent errors realated to

conda envs being built in parallel. Only useful on new

installs of Bactopia.

--version Print workflow version information

--help Show this message and exit

--help_all Show a complete list of adjustable parameters

> bactopia datasets --help

$ bactopia datasets --help

usage: bactopia datasets [-h] [--ariba STR] [--species STR] [--skip_prokka]

[--include_genus] [--identity FLOAT]

[--overlap FLOAT] [--max_memory INT] [--fast_cluster]

[--skip_minmer] [--skip_plsdb] [--cpus INT]

[--clear_cache] [--force] [--force_ariba]

[--force_mlst] [--force_prokka] [--force_minmer]

[--force_plsdb] [--keep_files] [--list_datasets]

[--depends] [--version] [--verbose] [--silent]

OUTPUT_DIRECTORY

bactopia datasets (v1.3.0) - Setup public datasets for Bactopia

positional arguments:

OUTPUT_DIRECTORY Directory to write output.

optional arguments:

-h, --help show this help message and exit

Ariba Reference Datasets:

--ariba STR Setup Ariba datasets for a given reference or a list of

references in a text file. (Default: card,vfdb_core)

Bacterial Species:

--species STR Download available MLST schemas and completed genomes for

a given species or a list of species in a text file.

Custom Prokka Protein FASTA:

--skip_prokka Skip creation of a Prokka formatted fasta for each species

--include_genus Include all genus members in the Prokka proteins FASTA

--identity FLOAT CD-HIT (-c) sequence identity threshold. (Default: 0.9)

--overlap FLOAT CD-HIT (-s) length difference cutoff. (Default: 0.8)

--max_memory INT CD-HIT (-M) memory limit (in MB). (Default: unlimited

--fast_cluster Use CD-HIT's (-g 0) fast clustering algorithm, instead of

the accurate but slow algorithm.

Minmer Datasets:

--skip_minmer Skip download of pre-computed minmer datasets (mash,

sourmash)

PLSDB (Plasmid) BLAST/Sketch:

--skip_plsdb Skip download of pre-computed PLSDB datbases (blast, mash)

Helpful Options:

--cpus INT Number of cpus to use. (Default: 1)

--clear_cache Remove any existing cache.

--force Forcibly overwrite existing datasets.

--force_ariba Forcibly overwrite existing Ariba datasets.

--force_mlst Forcibly overwrite existing MLST datasets.

--force_prokka Forcibly overwrite existing Prokka datasets.

--force_minmer Forcibly overwrite existing minmer datasets.

--force_plsdb Forcibly overwrite existing PLSDB datasets.

--keep_files Keep all downloaded and intermediate files.

--list_datasets List Ariba reference datasets and MLST schemas available

for setup.

--depends Verify dependencies are installed.

Adjust Verbosity:

--version show program's version number and exit

--verbose Print debug related text.

--silent Only critical errors will be printed.

example usage:

bactopia datasets outdir

bactopia datasets outdir --ariba 'card'

bactopia datasets outdir --species 'Staphylococcus aureus' --include_genus

> bactopia prepare --help

$ bactopia prepare --help

usage: bactopia prepare [-h] [-e STR] [-s STR] [--pattern STR] [--version] STR

bactopia prepare (v1.3.0) - Read a directory and prepare a FOFN of FASTQs

positional arguments:

STR Directory where FASTQ files are stored

optional arguments:

-h, --help show this help message and exit

-e STR, --ext STR Extension of the FASTQs. Default: .fastq.gz

-s STR, --sep STR Split FASTQ name on the last occurrence of the separator.

Default: _

--pattern STR Glob pattern to match FASTQs. Default: *.fastq.gz

--version show program's version number and exit

> bactopia search --help

$ bactopia search --help

usage: bactopia search [-h] [--exact_taxon] [--outdir OUTPUT_DIRECTORY]

[--prefix PREFIX] [--limit INT] [--version]

STR

bactopia search (v1.3.0) - Search ENA for associated WGS samples

positional arguments:

STR Taxon ID or Study accession

optional arguments:

-h, --help show this help message and exit

--exact_taxon Exclude Taxon ID descendents.

--outdir OUTPUT_DIRECTORY

Directory to write output. (Default: .)

--prefix PREFIX Prefix to use for output file names. (Default: ena)

--limit INT Maximum number of results to return. (Default:

1000000)

--version show program's version number and exit

example usage:

bactopia search PRJNA480016 --limit 20

bactopia search 1280 --exact_taxon --limit 20'

bactopia search "staphylococcus aureus" --limit 20

> bactopia tools

$ bactopia tools

bactopia tools (v1.3.0) - A suite of comparative analyses for Bactopia outputs

Available Tools:

fastani Pairwise average nucleotide identity

gtdb Identify marker genes and assign taxonomic classifications

phyloflash 16s assembly, alignment and tree

roary Pan-genome with optional core-genome tree.

summary A report summarizing Bactopia project

データベース作成

様々なデータベースが利用できる（説明）。ここでは基本のデータセットを使う。

bactopia datasets datasets/
#新しいバージョン
bactopia datasets

これにより、Aribaデータセット（CARDおよびvfdb_core）、RefSeq Mash sketch、GenBank Sourmash Signatures、およびPLSDBが作成されたdatasetsディレクトリにセットアップされる。

出力

f:id:kazumaxneo:20200317100838p:plain

実行方法

シークエンシングリードとデータベースを指定する。

#paired-end
bactopia --R1 R1.fastq.gz --R2 R2.fastq.gz --sample SAMPLE_NAME \
 --datasets datasets/ --outdir OUTDIR

#single-end
bactopia --SE SAMPLE.fastq.gz --sample SAMPLE --datasets datasets/ --outdir OUTDIR

#複数サンプル（ペアエンドなら自動で1行にタブ区切り表示される。出力を目視で一度確認すること、拡張子はfastq.gzが認識される）
bactopia prepare MY-FASTQS/ > fastqs.txt
bactopia --fastqs fastqs.txt --datasets datasets --outdir OUTDIR

--datasets The path to available datasets that have already been set up
--sample The name of the input sequences

ランが始まると端末に’進捗が表示される。

f:id:kazumaxneo:20200317114604p:plain

そのプロセスが終わるとチェック✔がつく。

executor > local (21)

[95/b57cd1] process > gather_fastqs [100%] 1 of 1 ✔

[73/d15465] process > fastq_status [100%] 1 of 1 ✔

[c5/f5cdc8] process > estimate_genome_size [100%] 1 of 1 ✔

[2e/d1894d] process > qc_reads [100%] 1 of 1 ✔

[91/8525ff] process > qc_original_summary [100%] 1 of 1 ✔

[76/1bffa8] process > qc_final_summary [100%] 1 of 1 ✔

[46/fb4103] process > assemble_genome [100%] 1 of 1 ✔

[53/d3e2ed] process > make_blastdb [100%] 1 of 1 ✔

[46/7c5e66] process > annotate_genome [100%] 1 of 1 ✔

[76/e06817] process > count_31mers [100%] 1 of 1 ✔

[- ] process > sequence_type -

[bd/aa733b] process > ariba_analysis [100%] 2 of 2 ✔

[da/e89cf4] process > minmer_sketch [100%] 1 of 1 ✔

[80/5d8161] process > minmer_query [100%] 5 of 5 ✔

[- ] process > call_variants -

[- ] process > download_references -

[- ] process > call_variants_auto -

[39/cf5629] process > update_antimicrobial_resistance [100%] 1 of 1 ✔

[2e/28348a] process > antimicrobial_resistance [100%] 1 of 1 ✔

[- ] process > insertion_sequences -

[72/41fc25] process > plasmid_blast [100%] 1 of 1 ✔

[- ] process > blast_query -

[- ] process > mapping_query -

Completed at: 17-Mar-2020 03:06:49

Duration : 22m 24s

CPU hours : 2.5

Succeeded : 21

終了した。

出力

bactopia-info/

Nextflow workflow reportも確認できる。

f:id:kazumaxneo:20200317125519p:plain

サンプルの出力は、ユーザーが指定したサブディレクトリに保存される。

SAMPLE_NAME/

f:id:kazumaxneo:20200317125603p:plain

アセンブリ、アノテーション、k-merを使った種の予測、AMR予測、など。詳細はDocumentを確認して下さい。

複数サンプルある場合、リストファイルを指定する。xxx_R1.fastq.gzとxxx_R2.fastq.gzなら

#list作成
bactopia prepare -e .fastq.gz FASTQ-DIR/ > fastq-list.txt

#実行
bactopia --fastqs fastq-list.txt --datasets datasets --outdir OUTDIR  --cpus 20

-e Extension of the FASTQs. Default: .fastq.gz
-s Split FASTQ name on the last occurrence of the separator. Default: _
--pattern Glob pattern to match FASTQs. Default: *.fastq.gz
--fastqs An input file containing the sample name and absolute paths to FASTQs to process
--max_memory The maximum amount of memory (Gb) allowed to a single process. Default: 32 Gb
--cpus Number of processors made available to a single

計算リソースはかなり効率的に使われる。

f:id:kazumaxneo:20200318091613p:plain

複数サンプルある場合、サンプルごとにサブフォルダに保存されていく。

f:id:kazumaxneo:20200318091412p:plain

5Mbバクテリアx５0のデータ60サンプルの解析時間はわずか２時間40分程度だった（*1）。

ENAやSRAのシーケンシングリードを分析する。

# Single ENA/SRA Experiment 
bactopia --accession SRX000000 --dataset datasets --outdir OUTDIR 

# Multiple ENA/SRA Experiments 
bactopia search "staphylococcus aureus" > accessions.txt bactopia --accessions accessions.txt --dataset datasets --outdir ${OUTDIR}