2020-03-19

臨床環境の病原性バクテリアを素早くジェノタイピングする biohansel

　BioHanselは、全ゲノムシーケンス（WGS）データで系統学的に有益な１塩基多型（SNP）（canonical SNPsとも呼ばれる）を識別することにより、細菌分離株の高解像度のジェノタイピングを実行する。このアプリケーションは、高速k-merマッチングアルゴリズムを使用して、病原体WGSデータを階層構造のスキーマに含まれるcanonical SNPsにマッピングし、検出されたSNPプロファイルに基づいて遺伝子型を割り当てる。適度なコンピューティングリソースを使用して、BioHanselは生のシーケンスリードまたはアセンブルされたコンティグから分離株を数秒で効率的にタイプし、監視、診断にWGS手法を適用したい公衆衛生、食品安全、環境、農業当局、および研究プログラムによる使用を魅力的にする。 BioHanselは現在、4つの一般的なサルモネラ血清型（Typhi、Typhimurium、Enteritidis、およびHeidelberg）の標準SNPジェノタイピングスキーマと、Mycobacterium tuberculosisのスキーマを提供している。ユーザーは、他の生物のジェノタイピング用に独自のスキーマを提供することもできる。 BioHanselの品質保証システムは、ジェノタイピング結果の妥当性を評価し、低品質のデータ、汚染されたデータセット、および誤認された生物を特定できる。 BioHanselは、製品のリコールなどの公衆衛生を目的としたサーベイランス、ソース属性、リスク評価、診断、および迅速なスクリーニングをサポートすることを目的としている。 BioHanselは、PyPI、Conda、およびGalaxyワークフローマネージャーで利用可能なパッケージを備えたオープンソースアプリケーションである。要約すると、BioHanselは、標準的なコンピューティングハードウェアでシーケンシングリードまたはアセンブルされたコンティグから、細菌ゲノムの効率的、迅速、正確、高解像度の分類を実行する。 BioHanselは、感染症の監視、診断、アウトブレイクの調査と対応の最前線で、一般的な研究ツールとしてだけでなく、完全に運用可能なWGSワークフローでの使用にも適している。　BioHanselユーザーガイドは、https://bio-hansel.readthedocs.io/en/readthedocs/から入手できる。補足資料はhttps://github.com/phac-nml/biohansel-manuscript-supplementary-dataから入手できる。

BioHansel is available on GitHub and can be installed via Conda, PyPI or GitHub.

BioHansel is also available on Galaxy. https://t.co/tLXUT7XQ0K
— Daniel Kein 🇨🇦 (@danielkein) 2020年1月12日

Documentation

https://bio-hansel.readthedocs.io/en/readthedocs/user-docs/usage.html

インストール

mac os10.14のanaconda環境でテストした。

biohansel has been confirmed to work on Mac OSX (versions 10.13.5 Beta and 10.12.6) when installed with Conda.

依存

Python (>=v3.6)

numpy >=1.12.1
pandas >=0.20.1
pyahocorasick >=1.1.6
attrs

本体　Github

#bioconda (link)
conda create -n biohansel -y
conda activate biohansel
conda install -c bioconda -y bio_hansel

#pip
pip install bio_hansel
#latest master branch version directly from Github
pip install git+https://github.com/phac-nml/biohansel.git@master

> hansel -h

$ hansel -h

usage: hansel [-h] [-s SCHEME] [--scheme-name SCHEME_NAME] [-M SCHEME_METADATA] [-p forward_reads reverse_reads] [-i fasta_path genome_name] [-D INPUT_DIRECTORY] [-o OUTPUT_SUMMARY] [-O OUTPUT_KMER_RESULTS]

[-S OUTPUT_SIMPLE_SUMMARY] [--force] [--json] [--min-kmer-freq MIN_KMER_FREQ] [--max-kmer-freq MAX_KMER_FREQ] [--low-cov-depth-freq LOW_COV_DEPTH_FREQ] [--max-missing-kmers MAX_MISSING_KMERS]

[--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS] [--low-cov-warning LOW_COV_WARNING] [--max-intermediate-kmers MAX_INTERMEDIATE_KMERS] [--max-degenerate-kmers MAX_DEGENERATE_KMERS] [-t THREADS] [-v] [-V]

[F [F ...]]

Subtype microbial genomes using SNV targeting k-mer subtyping schemes.

Includes schemes for Salmonella enterica spp. enterica serovar Heidelberg, Enteritidis, Typhi, and Typhimurium subtyping. Also includes a Mycobacterium tuberculosis scheme called 'tb_lineage'.

Developed by Geneviève Labbé, Peter Kruczkiewicz, Philip Mabon, James Robertson, Justin Schonfeld, Daniel Kein, Marisa A. Rankin, Matthew Gopez, Darian Hole, David Son, Natalie Knox, Chad R. Laing, Kyrylo Bessonov, Eduardo Taboada, Catherine Yoshida, Roger P. Johnson, Gary Van Domselaar and John H.E. Nash.

positional arguments:

F Input genome FASTA/FASTQ files (can be Gzipped)

optional arguments:

-h, --help show this help message and exit

-s SCHEME, --scheme SCHEME

Scheme to use for subtyping (built-in: "heidelberg", "enteritidis", "typhi", "typhimurium", "tb_lineage"; OR user-specified: /path/to/user/scheme)

--scheme-name SCHEME_NAME

Custom user-specified SNP substyping scheme name

-M SCHEME_METADATA, --scheme-metadata SCHEME_METADATA

Scheme subtype metadata table (tab-delimited file with ".tsv" or ".tab" extension or CSV with ".csv" extension format accepted; MUST contain column called "subtype")

-p forward_reads reverse_reads, --paired-reads forward_reads reverse_reads

FASTQ paired-end reads

-i fasta_path genome_name, --input-fasta-genome-name fasta_path genome_name

input fasta file path AND genome name

-D INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY

directory of input fasta files (.fasta|.fa|.fna) or FASTQ files (paired FASTQ should have same basename with "_\d\.(fastq|fq)" postfix to be automatically paired) (files can be Gzipped)

-o OUTPUT_SUMMARY, --output-summary OUTPUT_SUMMARY

Subtyping summary output path (tab-delimited)

-O OUTPUT_KMER_RESULTS, --output-kmer-results OUTPUT_KMER_RESULTS

Subtyping kmer matching output path (tab-delimited)

-S OUTPUT_SIMPLE_SUMMARY, --output-simple-summary OUTPUT_SIMPLE_SUMMARY

Subtyping simple summary output path

--force Force existing output files to be overwritten

--json Output JSON representation of output files

--min-kmer-freq MIN_KMER_FREQ

Min k-mer freq/coverage

--max-kmer-freq MAX_KMER_FREQ

Max k-mer freq/coverage

--low-cov-depth-freq LOW_COV_DEPTH_FREQ

Frequencies below this coverage are considered low coverage

--max-missing-kmers MAX_MISSING_KMERS

Decimal proportion of maximum allowable missing kmers before being considered an error. (0.0 - 1.0)

--min-ambiguous-kmers MIN_AMBIGUOUS_KMERS

Minimum number of missing kmers to be considered an ambiguous result

--low-cov-warning LOW_COV_WARNING

Overall kmer coverage below this value will trigger a low coverage warning

--max-intermediate-kmers MAX_INTERMEDIATE_KMERS

Decimal proportion of maximum allowable missing kmers to be considered an intermediate subtype. (0.0 - 1.0)

--max-degenerate-kmers MAX_DEGENERATE_KMERS

Maximum number of scheme k-mers allowed before quitting with a usage warning. Default is 100000

-t THREADS, --threads THREADS

Number of parallel threads to run analysis (default=1)

-v, --verbose Logging verbosity level (-v == show warnings; -vvv == show debug info)

-V, --version show program's version number and exit

実行方法

fastqを指定する。

hansel -s heidelberg -t 8 -o results.tab -O match_results.tab -p SRR5646583_forward.fastqsanger SRR5646583_reverse.fastqsanger

-s Scheme to use for subtyping (built-in: "heidelberg", "enteritidis", "typhi", "typhimurium", "tb_lineage"; OR user-specified: /path/to/user/scheme)
-v, --verbose Logging verbosity level (-v == show warnings; -vvv == show debug info)
-t Number of parallel threads to run analysis (default=1)
-p forward_reads reverse_reads

match_results.tab

f:id:kazumaxneo:20200318223919p:plain

match_results.tab

f:id:kazumaxneo:20200318224006p:plain

ディレクトリの全fastqを分析する。

hansel -s heidelberg -vv --threads <n_cpu> -o results.tab -O match_results.tab -D /path/to/fastqs/

引用

Rapid and accurate SNP genotyping of clonal bacterial pathogens with BioHansel

Geneviève Labbé, James Robertson, Peter Kruczkiewicz, Marisa Rankin, Matthew Gopez, Chad R. Laing, Philip Mabon, Kim Ziebell, Aleisha R. Reimer, Lorelee Tschetter, Gary Van Domselaar, Sadjia Bekal, Kimberley A. MacDonald, Linda Hoang, Linda Chui, Danielle Daignault, Durda Slavic, Frank Pollari, E. Jane Parmley, Elissa Giang, Lok Kan Lee, Jonathan Moffat, Joanne MacKinnon, Roger Johnson, John H.E. Nash. [Manuscript in preparation]

bioRxiv, Posted January 11, 2020

2020-03-18

タンデムリピートなどのミスアセンブリを分析する TandemQUAST

centromere 2020 human genome Preprint misassembly Long Terminal Repeat retrotransposons (LTR-RTs) repetitive sequences

　タンデムリピートは、不均等なクロスオーバーによってしばしば生成される複数の連続するほぼ同一のシーケンスによって形成される（Smith、1976）。初期のDNAシーケンスプロジェクトで、タンデムリピートが真核生物のゲノムに豊富にあることが明らかになった（Yunis and Yasmineh、1971; Bacolla et al、2008）。タンデムリピートの最近の研究は、さまざまな細胞プロセスにおける役割を明らかにし、タンデムリピートの変異が遺伝的障害につながる可能性があることを示した（McFarland et al、2015; Giunta and Funabiki、2017; Song et al、2018; Black et al、 2018）。
　広く研究されたショートタンデムリピート（Willems et al、2014; Gymrek et al、2016; Saini et al、2018）と、長さが数万から数百万のウルトラロングタンデムリピート（ETR）を区別する。 ETRはアセンブリが難しいため、それらの大部分は他の種はもちろんのこと、ヒトゲノムにおいてさえもアセンブリされないままである。セントロメアとペリセントロメアには、いくつかの最長のETRが含まれている。これは、ヒトゲノムの約3％を占め、メガベース長の領域に及ぶ（Miga、2019）。それらは、これまでに配列を決定するすべての試みを回避したヒトゲノムの「ダークマター」であり、リファレンスヒトゲノムの最大のギャップである（Hayden et al、2013; Miga et al、2019）。
　パシフィックバイオサイエンス（PacBio）やオックスフォードナノポアテクノロジー（ONT）などのロングリード技術の出現により、全ゲノムシーケンスの状況が大きく変わった。ロングリードアセンブラ（Chin et al、2016; Koren et al、2017; Kolmogorov et al、2019; Ruan and Li、2019）およびロングリードとショートリードを組み合わせたハイブリッドアセンブラ（Antipov et al、 2016; Zimin et al、2017）は、ショートリードアセンブリと比較して、アセンブリされたゲノムの連続性を大幅に向上した。さらに、ロングリードは、ヒトの動原体を再構築するための半手動アプローチの成功に貢献した（Jain et al、2018a; Miga et al、2019）。 Flyeアセンブラは、ロングリードにまたがるブリッジされたタンデムリピート、およびロングリードにまたがらないいくつかの非ブリッジタンデムリピートを正常に解決する（Kolmogorov et al、2019）。 centroFlyeアセンブラ（BzikadzeおよびPevzner、2019）は、セントロメアなどの非ブリッジETRを自動的にアセンブルするように設計された。
　ETRアセンブリのさまざまな代替戦略と、これらのアセンブリのベンチマークのグラウンドトゥルースがないため、品質評価の問題が生じる。同様の問題は、GAGE（Salzberg et al、2011）やQUAST（Gurevich et al、2013; Mikheenko et al、2018）、metaQUAST（Mikheenko et al、2016）およびrnaQUAST（Bushmanova et al、2016）などのショートリードのゲノムアセンブリ評価ツールと特化した品質評価によって対処されている。ただし、これらのツールは既知のリファレンスに基づいている。ETRの分析にはアセンブリ品質を評価するリファレンスフリーのアプローチが必要なため、ETRの分析には適用できない。同時に、既存のリファレンスフリーツールは、ペアエンドリードアラインメントまたは遺伝子コンテンツの分析に基づいている（Hunt et al、2013; Clark et al、2013; Ghodsi et al、2013;Simãoet al、2015）。

　既存のアセンブリ品質評価ツールは、リードをアセンブリに正確にマッピングするアライナーに依存している（Li and Durbin、2009; Langmead et al、2009; Li、2016; Li、2018）。ただし、ベンチマークでは、これらのツールがETRで失敗することが多いことが明らかになっている。たとえば、minimap2（Li、2018）は、特にアセンブリエラーのある地域で、ETRへの一部のリードのミスアラインメントをもたらす。そのため、エラーが発生しやすいロングリードをETRに効率的にマップするtandemMapperツールを開発した。 TandemMapperはtandemQUASTの開発を可能にするだけでなく、より正確なリードマッピングとその後のポリッシングによるETRアセンブリの改善にもつながった。
　ETRアセンブリの品質を評価する最初の試みはセントロメア特異的であり（Bzikadze and Pevzner、2019）、ETRアセンブリの一般的な品質評価ツールには至っていない。セントロメアの種および染色体固有の性質により、他の種のETRへの同じアプローチの適用が妨げられる。ただし、セントロメア組織の共通の原則を利用して、ETR用の汎用アセンブリ評価ツールを開発できる。
　霊長類のセントロメアは、レトロトランスポゾンリピートとATリッチαサテライト、171 bpモノマーに基づくDNAリピートで構成されている（Manuelidis and Wu、1978）。ヒトと多くの霊長類では、連続したモノマーがタンデムに並んで高次リピート（HOR）ユニットに配置されている（Willard and Waye、1987a）。 HOR内のモノマーの数とその順序は、染色体に固有である。たとえば、染色体X（DXZ1と呼ばれる）の HORは、12のモノマーで構成されている（Willard and Waye、1987b）。これらの12のモノマーは、先祖代々の五量体サテライトリピートABCDEから進化したもので、C1D 1E 1A1B 1C 2D 2E 2A 2B 2C 3D 3.として表すことができる。（一部略）

　ここでは、リードをETRにマッピングするた目のツールであるtandemMapperと、ETRアセンブリを評価および改善するためのツールであるtandemQUASTを紹介する。タンデムマッパーとその後のポリッシングを使用して、centroFlye（Bzikadze and Pevzner、2019）とキュレートされたセミマニュアルアプローチ（Miga et al、2019）の両方によって生成されたヒトセントロメアXのアセンブリを変更した。（一部略）
TandemMapperとtandemQUASTは、https://github.com/ablab/tandemQUASTのGitHubでコマンドラインユーティリティとして自由に利用できるオープンソースソフトウェアである。

ここでは TandemQUASTのみ紹介します。

インストール

macos10.14にてcondaの仮想環境を作成してテストした。

本体　Github

https://github.com/ablab/TandemTools

#依存が多いので、ここではcondaで作成した仮想環境に導入する。
conda create -n tandemtools -y
conda activate tandemtools
git clone https://github.com/ablab/TandemTools.git
cd TandemTools/
conda install --file requirements.txt -y

#jellyfishも必要
conda install -c bioconda -y jellyfish

> ./tandemquast.py --help

$ ./tandemquast.py --help

Usage: tandemquast.py [OPTIONS] [ASSEMBLY_FNAMES]...

Options:

--nano PATH File with ONT reads

--pacbio PATH File with PacBio CLR reads

-o PATH Output folder [required]

-t INTEGER Threads

-m PATH Monomer sequence

-l TEXT Comma separated list of assembly labels

--hifi PATH File with PacBio HiFi reads

--only-polish Run polishing only

-f, --no-reuse Do not reuse old files

--help Show this message and exit.

> ./tandemmapper.py

$ ./tandemmapper.py

Usage: tandemmapper.py [OPTIONS] [ASSEMBLY_FNAMES]...

Try 'tandemmapper.py --help' for help.

Error: Missing option '-o'.

テストラン

ロングリードのfastqとアセンブリのFASTAファイルを指定する。

 ./tandemquast.py --nano test_data/simulated_reads.fasta -o test test_data/simulated_polished.fa test_data/simulated_del.fasta

--nano File with Oxford Nanopore reads used for ETR assembly
--pacbio File with PacBio CLR reads used for ETR assembly
-o Folder to store all result files

report/が最終出力。simulated_delとsimulated_polishedの比較。

f:id:kazumaxneo:20200317220604p:plain

discordance_simulated-polished_vs_simulated-del.png

f:id:kazumaxneo:20200318080308p:plain

simulated-polished_vs_simulated-del.png

f:id:kazumaxneo:20200318080320p:plain

simulated-del_kmer_analysis.png

f:id:kazumaxneo:20200318080332p:plain

simulated-polished_kmer_analysis.png

f:id:kazumaxneo:20200318080557p:plain

simulated-del_bp_analysis.png

f:id:kazumaxneo:20200318080402p:plain

simulated-del_bp_analysis.png

f:id:kazumaxneo:20200318080443p:plain

simulated-del_coverage.png

f:id:kazumaxneo:20200318080512p:plain

simulated-polished_coverage.png

f:id:kazumaxneo:20200318080525p:plain

レポートの見方は論文を読んで確認して下さい。

tandemquastは数十万から数百万塩基のウルトラロングタンデムリピート向けに開発されており、より短いタンデムリピートには使わないことが強く推奨されています。注意してください。

引用

TandemMapper and TandemQUAST: mapping long reads and assessing/improving assembly quality in extra-long tandem repeats

Alla Mikheenko, Andrey V. Bzikadze, Alexey Gurevich, Karen H. Miga, Pavel A. Pevzner

2020-03-17

計算リソースを効率的に使って多数のよく似たバクテリアゲノムを素早く分析する自動化されたパイプライン Bactopia

2020 Preprint automated pipeline automatic pipeline bacteria docker AMR 生物種の推定 (taxonomic profiling) taxonomic assignment annotation benchmark bacterial annotation SNV small indel MInHash 高速なツール pan-genome ANI nextflow mSystems

2020 3/17 パラメータ追記、コマンド修正、タイトル修正

2020 3/18 追記

2020 5/11 説明追加

2020 8/13 論文追記

2020 12/9 ツイート追加

2021 2/24アップデートされたコマンドに修正

2021 10/7 ツイート追加

　イルミナのテクノロジーを使用した細菌ゲノムのシーケンシングは、多くの場合、扱いやすい分析手法よりも速くデータが生成される手順になっている。 Nextflowワークフローソフトウェアを使用して構築されたBactopiaと呼ばれる新しいシリーズのパイプラインを作成し、細菌種または属の効率的な比較ゲノム解析手法を提供する。 Bactopiaは、対象の種に対して一連のカスタマイズ可能なデータセットが作成されるデータセットセットアップステップ（Bactopia Datasets; BaDs）で構成されている; the Bactopia Analysis Pipeline (BaAP)、これは品質管理、ゲノムアセンブリ、および利用可能なデータセットに基づいていくつかの他の機能を実行し、処理されたデータを構造化されたディレクトリ形式で出力する。また、処理されたデータの一部またはすべてに対して特定の後処理を実行する一連のBactopiaツール（BaT）を出力する。 BaTには、パンゲノム解析、サンプル間の平均ヌクレオチド同一性の計算、16S遺伝子の抽出とプロファイリング、高度に保存された遺伝子を使用した分類学的分類が含まれる。 BaTの数は、将来、特定のアプリケーションを満たすために増加することが予想される。デモンストレーションとして、ヒトの膣マイクロバイオームの共通種であるL. crispatusに焦点を合わせ、1,664の公開されたラクトバチルスゲノムの分析を行った。 Bactopiaは、1つのバクテリアゲノムの小さなプロジェクトから数千ものプロジェクトまで拡張できるオープンソースシステムであり、比較データセットとダウンストリーム解析のオプションを選択する際の柔軟性を高めている。 Bactopiaコードは、https://www.github.com/bactopia/bactopiaからアクセスできる。

　Bactopia Datasetsは、多くの既存のパブリックデータセットとプライベートデータセットを分析に含めるためのフレームワークを提供する。これらのデータセットをBactopia用にダウンロード、構築、および（または）構成するプロセスは自動化されている。

　Bactopia Analysis PipelineはNextflowで構築され、入力FASTQ（ローカルまたはSRA / ENAから入手可能）は、品質管理、アセンブリ、アノテーション、リファレンスへのマッピング、バリアントコール、マイナースケッチクエリ、BLASTアラインメント、挿入部位予測、タイピング、などを行うことができる。 Bactopia Analysis Pipelineは、利用可能なBactopiaデータセットに基づいて、含める分析を自動的に選択する。

Bactopia Toolsは、比較分析のための独立したワークフローセットである。比較分析には、要約レポート、パンゲノム、または系統樹構築が含まれる。 Bactopiaの予測可能な出力構造を使用すると、Bactopia Toolで処理するために含めるサンプルを選択できる。

Bactopiaは、黄色ブドウ球菌ゲノムをターゲットとする我々（著者ら）がリリースしたワークフローであるStaphopiaに触発された。 Staphopiaとユーザーフィードバックから学んだことを使用して、Bactopiaは最初から使いやすさ、移植性、速度を念頭に置いてゼロから開発された。

V2リリース

Bactopia v2, now includes more than 100 tests, that are testing more than 7000 outputs.

Pytest (modeled after nf-core/modules) is running on @github Actions, using real data from https://t.co/EmWtMhVXt6

Thank you @tdread_emory, for providing the resources to run all these tests
— Robert A. Petit III, PhD (@rpetit3) 2021年12月6日

Bactopia users: if you ever find yourself feeling this way with Bactopia please don't hesitate to reach out!

Sometimes things break and it can be frustrating! I've also tried my best to build thorough docs, but there's always room for improvement.https://t.co/TSJ8GMHCAL https://t.co/gCr3PZywV3
— Robert A. Petit III, PhD (@rpetit3) 2021年10月2日

Check out https://t.co/JAZtbHuLG6 to get started using Bactopia.

Please feel free to reach out if you have any questions! https://t.co/WG3HqzokiP
— Robert A. Petit III (@rpetit3) 2020年8月12日

Latest preprint with @rpetit3 describes our Bactopia pipeline - a software for flexible analysis of 1 - 1000s of bacterial genomes starting from FASTQ. Example analysis of 1600+ Lactobacillus genomes, focusing on L. crispatus, with only a few commands https://t.co/SlkWbClxhZ pic.twitter.com/U6ie8XTChR
— Timothy Read PhD (@tdread_emory) 2020年3月11日

New blog post: Using AWS Batch to process 67,000 genomes with Bactopia.

In 5 days(!), I processed all publicly available Staphylococcus aureus sequencing projects. This post outlines how it was done and the costs to do so.https://t.co/09R7Ac0bl1 #asmngs #bioinformatics #aws
— Robert A. Petit III (@rpetit3) 2020年12月8日

Bactopia v1.5.4 is now available!

Release Details: https://t.co/3RI7Dp6rgT

Learn more about Bactopia at:
- https://t.co/lADbrUD6MN (docs)
- https://t.co/0aLt5LIE8W (ePoster)
— Robert A. Petit III (@rpetit3) 2020年12月18日

Document

Installation - Bactopia

インストール

condaの仮想環境を作ってテストしたがエラーが起きた（macosとubuntu18.04でテスト）。オーサーらが提供しているdockerイメージを使ってテストした。

本体　Github

#bioconda (link)
conda create -y -n bactopia -c conda-forge -c bioconda bactopia
conda activate bactopia

#dockerイメージ (link)
docker pull bactopia/bactopia
#ホストのカレントと仮想環境の/dataをシェアしてrun
docker run --rm -itv $PWD:/data bactopia/bactopia

> bactopia --help

$ bactopia --help

N E X T F L O W ~ version 20.01.0

Launching `/Users/kazu/anaconda3/envs/bactopia/share/bactopia-1.3.0/main.nf` [curious_brown] - revision: a8ccad600f

bactopia v1.3.0

Required Parameters:

### For Procesessing Multiple Samples

--fastqs STR An input file containing the sample name and

absolute paths to FASTQs to process

### For Processing A Single Sample

--R1 STR First set of reads for paired end in compressed (gzip)

FASTQ format

--R2 STR Second set of reads for paired end in compressed (gzip)

FASTQ format

--SE STR Single end set of reads in compressed (gzip) FASTQ format

--sample STR The name of the input sequences

### For Downloading from ENA

--accessions An input file containing ENA/SRA experiement accessions to

be processed

--accession A single ENA/SRA Experiment accession to be processed

Dataset Parameters:

--datasets DIR The path to available datasets that have

already been set up

--species STR Determines which species-specific dataset to

use for the input sequencing

Optional Parameters:

--coverage INT Reduce samples to a given coverage

Default: 100x

--genome_size INT Expected genome size (bp) for all samples, a value of '0'

will disable read error correction and read subsampling.

Special values (requires --species):

'min': uses minimum completed genome size of species

'median': uses median completed genome size of species

'mean': uses mean completed genome size of species

'max': uses max completed genome size of species

Default: Mash estimate

--outdir DIR Directory to write results to

Default: .

--max_time INT The maximum number of minutes a job should run before being halted.

Default: 120 minutes

--max_memory INT The maximum amount of memory (Gb) allowed to a single process.

Default: 32 Gb

--cpus INT Number of processors made available to a single

process.

Default: 4

Nextflow Related Parameters:

--infodir DIR Directory to write Nextflow summary files to

Default: ./bactopia-info

--condadir DIR Directory to Nextflow should use for Conda environments

Default: Bactopia's Nextflow directory

--nfdir Print directory Nextflow has pulled Bactopia to

--overwrite Nextflow will overwrite existing output files.

Default: false

--conatainerPath Path to Singularity containers to be used by the 'slurm'

profile.

Default: /opt/bactopia/singularity

--sleep_time After reading datases, the amount of time (seconds) Nextflow

will wait before execution.

Default: 5 seconds

--publish_mode Set Nextflow's method for publishing output files. Allowed methods are:

'copy' (default) Copies the output files into the published directory.

'copyNoFollow' Copies the output files into the published directory

without following symlinks ie. copies the links themselves.

'link' Creates a hard link in the published directory for each

process output file.

'rellink' Creates a relative symbolic link in the published directory

for each process output file.

'symlink' Creates an absolute symbolic link in the published directory

for each process output file.

Default: copy

--force Nextflow will overwrite existing output files.

Default: false

Useful Parameters:

--available_datasets Print a list of available datasets found based

on location given by "--datasets"

--example_fastqs Print example of expected input for FASTQs file

--check_fastqs Verify "--fastqs" produces the expected inputs

--compress Compress (gzip) select outputs (e.g. annotation, variant calls)

to reduce overall storage footprint.

--keep_all_files Keeps all analysis files created. By default, intermediate

files are removed. This will not affect the ability

to resume Nextflow runs, and only occurs at the end

of the process.

--dry_run Mimics workflow execution, to help prevent errors realated to

conda envs being built in parallel. Only useful on new

installs of Bactopia.

--version Print workflow version information

--help Show this message and exit

--help_all Show a complete list of adjustable parameters

> bactopia datasets --help

$ bactopia datasets --help

usage: bactopia datasets [-h] [--ariba STR] [--species STR] [--skip_prokka]

[--include_genus] [--identity FLOAT]

[--overlap FLOAT] [--max_memory INT] [--fast_cluster]

[--skip_minmer] [--skip_plsdb] [--cpus INT]

[--clear_cache] [--force] [--force_ariba]

[--force_mlst] [--force_prokka] [--force_minmer]

[--force_plsdb] [--keep_files] [--list_datasets]

[--depends] [--version] [--verbose] [--silent]

OUTPUT_DIRECTORY

bactopia datasets (v1.3.0) - Setup public datasets for Bactopia

positional arguments:

OUTPUT_DIRECTORY Directory to write output.

optional arguments:

-h, --help show this help message and exit

Ariba Reference Datasets:

--ariba STR Setup Ariba datasets for a given reference or a list of

references in a text file. (Default: card,vfdb_core)

Bacterial Species:

--species STR Download available MLST schemas and completed genomes for

a given species or a list of species in a text file.

Custom Prokka Protein FASTA:

--skip_prokka Skip creation of a Prokka formatted fasta for each species

--include_genus Include all genus members in the Prokka proteins FASTA

--identity FLOAT CD-HIT (-c) sequence identity threshold. (Default: 0.9)

--overlap FLOAT CD-HIT (-s) length difference cutoff. (Default: 0.8)

--max_memory INT CD-HIT (-M) memory limit (in MB). (Default: unlimited

--fast_cluster Use CD-HIT's (-g 0) fast clustering algorithm, instead of

the accurate but slow algorithm.

Minmer Datasets:

--skip_minmer Skip download of pre-computed minmer datasets (mash,

sourmash)

PLSDB (Plasmid) BLAST/Sketch:

--skip_plsdb Skip download of pre-computed PLSDB datbases (blast, mash)

Helpful Options:

--cpus INT Number of cpus to use. (Default: 1)

--clear_cache Remove any existing cache.

--force Forcibly overwrite existing datasets.

--force_ariba Forcibly overwrite existing Ariba datasets.

--force_mlst Forcibly overwrite existing MLST datasets.

--force_prokka Forcibly overwrite existing Prokka datasets.

--force_minmer Forcibly overwrite existing minmer datasets.

--force_plsdb Forcibly overwrite existing PLSDB datasets.

--keep_files Keep all downloaded and intermediate files.

--list_datasets List Ariba reference datasets and MLST schemas available

for setup.

--depends Verify dependencies are installed.

Adjust Verbosity:

--version show program's version number and exit

--verbose Print debug related text.

--silent Only critical errors will be printed.

example usage:

bactopia datasets outdir

bactopia datasets outdir --ariba 'card'

bactopia datasets outdir --species 'Staphylococcus aureus' --include_genus

> bactopia prepare --help

$ bactopia prepare --help

usage: bactopia prepare [-h] [-e STR] [-s STR] [--pattern STR] [--version] STR

bactopia prepare (v1.3.0) - Read a directory and prepare a FOFN of FASTQs

positional arguments:

STR Directory where FASTQ files are stored

optional arguments:

-h, --help show this help message and exit

-e STR, --ext STR Extension of the FASTQs. Default: .fastq.gz

-s STR, --sep STR Split FASTQ name on the last occurrence of the separator.

Default: _

--pattern STR Glob pattern to match FASTQs. Default: *.fastq.gz

--version show program's version number and exit

> bactopia search --help

$ bactopia search --help

usage: bactopia search [-h] [--exact_taxon] [--outdir OUTPUT_DIRECTORY]

[--prefix PREFIX] [--limit INT] [--version]

STR

bactopia search (v1.3.0) - Search ENA for associated WGS samples

positional arguments:

STR Taxon ID or Study accession

optional arguments:

-h, --help show this help message and exit

--exact_taxon Exclude Taxon ID descendents.

--outdir OUTPUT_DIRECTORY

Directory to write output. (Default: .)

--prefix PREFIX Prefix to use for output file names. (Default: ena)

--limit INT Maximum number of results to return. (Default:

1000000)

--version show program's version number and exit

example usage:

bactopia search PRJNA480016 --limit 20

bactopia search 1280 --exact_taxon --limit 20'

bactopia search "staphylococcus aureus" --limit 20

> bactopia tools

$ bactopia tools

bactopia tools (v1.3.0) - A suite of comparative analyses for Bactopia outputs

Available Tools:

fastani Pairwise average nucleotide identity

gtdb Identify marker genes and assign taxonomic classifications

phyloflash 16s assembly, alignment and tree

roary Pan-genome with optional core-genome tree.

summary A report summarizing Bactopia project

データベース作成

様々なデータベースが利用できる（説明）。ここでは基本のデータセットを使う。

bactopia datasets datasets/
#新しいバージョン
bactopia datasets

これにより、Aribaデータセット（CARDおよびvfdb_core）、RefSeq Mash sketch、GenBank Sourmash Signatures、およびPLSDBが作成されたdatasetsディレクトリにセットアップされる。

出力

f:id:kazumaxneo:20200317100838p:plain

実行方法

シークエンシングリードとデータベースを指定する。

#paired-end
bactopia --R1 R1.fastq.gz --R2 R2.fastq.gz --sample SAMPLE_NAME \
 --datasets datasets/ --outdir OUTDIR

#single-end
bactopia --SE SAMPLE.fastq.gz --sample SAMPLE --datasets datasets/ --outdir OUTDIR

#複数サンプル（ペアエンドなら自動で1行にタブ区切り表示される。出力を目視で一度確認すること、拡張子はfastq.gzが認識される）
bactopia prepare MY-FASTQS/ > fastqs.txt
bactopia --fastqs fastqs.txt --datasets datasets --outdir OUTDIR

--datasets The path to available datasets that have already been set up
--sample The name of the input sequences

ランが始まると端末に’進捗が表示される。

f:id:kazumaxneo:20200317114604p:plain

そのプロセスが終わるとチェック✔がつく。

executor > local (21)

[95/b57cd1] process > gather_fastqs [100%] 1 of 1 ✔

[73/d15465] process > fastq_status [100%] 1 of 1 ✔

[c5/f5cdc8] process > estimate_genome_size [100%] 1 of 1 ✔

[2e/d1894d] process > qc_reads [100%] 1 of 1 ✔

[91/8525ff] process > qc_original_summary [100%] 1 of 1 ✔

[76/1bffa8] process > qc_final_summary [100%] 1 of 1 ✔

[46/fb4103] process > assemble_genome [100%] 1 of 1 ✔

[53/d3e2ed] process > make_blastdb [100%] 1 of 1 ✔

[46/7c5e66] process > annotate_genome [100%] 1 of 1 ✔

[76/e06817] process > count_31mers [100%] 1 of 1 ✔

[- ] process > sequence_type -

[bd/aa733b] process > ariba_analysis [100%] 2 of 2 ✔

[da/e89cf4] process > minmer_sketch [100%] 1 of 1 ✔

[80/5d8161] process > minmer_query [100%] 5 of 5 ✔

[- ] process > call_variants -

[- ] process > download_references -

[- ] process > call_variants_auto -

[39/cf5629] process > update_antimicrobial_resistance [100%] 1 of 1 ✔

[2e/28348a] process > antimicrobial_resistance [100%] 1 of 1 ✔

[- ] process > insertion_sequences -

[72/41fc25] process > plasmid_blast [100%] 1 of 1 ✔

[- ] process > blast_query -

[- ] process > mapping_query -

Completed at: 17-Mar-2020 03:06:49

Duration : 22m 24s

CPU hours : 2.5

Succeeded : 21

終了した。

出力

bactopia-info/

Nextflow workflow reportも確認できる。

f:id:kazumaxneo:20200317125519p:plain

サンプルの出力は、ユーザーが指定したサブディレクトリに保存される。

SAMPLE_NAME/

f:id:kazumaxneo:20200317125603p:plain

アセンブリ、アノテーション、k-merを使った種の予測、AMR予測、など。詳細はDocumentを確認して下さい。

複数サンプルある場合、リストファイルを指定する。xxx_R1.fastq.gzとxxx_R2.fastq.gzなら

#list作成
bactopia prepare -e .fastq.gz FASTQ-DIR/ > fastq-list.txt

#実行
bactopia --fastqs fastq-list.txt --datasets datasets --outdir OUTDIR  --cpus 20

-e Extension of the FASTQs. Default: .fastq.gz
-s Split FASTQ name on the last occurrence of the separator. Default: _
--pattern Glob pattern to match FASTQs. Default: *.fastq.gz
--fastqs An input file containing the sample name and absolute paths to FASTQs to process
--max_memory The maximum amount of memory (Gb) allowed to a single process. Default: 32 Gb
--cpus Number of processors made available to a single

計算リソースはかなり効率的に使われる。

f:id:kazumaxneo:20200318091613p:plain

複数サンプルある場合、サンプルごとにサブフォルダに保存されていく。

f:id:kazumaxneo:20200318091412p:plain

5Mbバクテリアx５0のデータ60サンプルの解析時間はわずか２時間40分程度だった（*1）。

ENAやSRAのシーケンシングリードを分析する。

# Single ENA/SRA Experiment 
bactopia --accession SRX000000 --dataset datasets --outdir OUTDIR 

# Multiple ENA/SRA Experiments 
bactopia search "staphylococcus aureus" > accessions.txt bactopia --accessions accessions.txt --dataset datasets --outdir ${OUTDIR}

--accession A single ENA/SRA Experiment accession to be processed

bactopia-report.html

f:id:kazumaxneo:20200317125323p:plain

f:id:kazumaxneo:20200317125353p:plain
f:id:kazumaxneo:20200317125243p:plain

bactopia-timeline.html

f:id:kazumaxneo:20200317125247p:plain

引用

Bactopia: a flexible pipeline for complete analysis of bacterial genomes

Robert A. Petit III, Timothy D. Read

Posted March 02, 2020

2020 8/13

Bactopia: a Flexible Pipeline for Complete Analysis of Bacterial Genomes
Robert A. Petit III, Timothy D. Read

mSystems. 5 (2020)

TR3990x, 128GBメモリ環境。

2020-03-16

スモールゲノムからラージゲノムまで対応した高速かつ精度の高いハイブリッドアセンブラ HASLR

2020 Preprint 高速なツール hybrid assembly human genome

　オックスフォード・ナノポア・テクノロジーズやパシフィック・バイオサイエンスなどのプラットフォームからの第三世代シーケンシング技術は、より連続したアセンブリを構築し、ゲノムを完全に再構築する道を開いた。これらのテクノロジーで生成されより長いたリード長は、短距離から中距離のリピートの課題を克服する手段を提供した。現在、正確なロングリードアセンブラは計算コストが高く、一方で高速なメソッドはそれほど正確ではない。したがって、大小のゲノムを再構築するための高速かつ正確なツールに対するニーズは未だに満たされていない。最近の第3世代シーケンスの進歩にもかかわらず、研究者は多くの分析タスクに対して第2世代のリードを生成する傾向がある。ここでは、第2世代と第3世代の両方のシーケンシングリードを使用して正確なゲノムアセンブリを効率的に生成するハイブリッドアセンブラであるHASLRを紹介する。我々（本著者ら）の実験は、HASLRが最速のアセンブラであるだけでなく、他のテスト済みのアセンブラと比較して、すべてのサンプルでミスアセンブリの数が最も少ないことも示している。さらに、連続性と精度の観点から、生成されたアセンブリは、ほとんどのサンプルで他のツールと同等である。HASLRは、https://github.com/vpc-ccg/haslrからオープンソースのツールとして利用できる。

HASLRへの入力は、ゲノムサイズの推定と、同じサンプル由来のロングリード（LR）のセットとショートリード（SR）のセットである。 HASLRは、All versus allのLRアラインメントを実行することなく、ゲノムを迅速にアセンブルする新しいアプローチを使用してアセンブリを実行する。 HASLRのコアは、最初に、効率的なSRアセンブラを使用したSRからのコンティグのアセンブルと、次に、LRを使用したゲノムのバックボーンを表すコンティグの配列を見つけることである。

ラップトップでも軽快に動作するゲノム比較ツール bindash

2019 Bioinformatics MInHash genetic distance 高速なツール

　ゲノム（メタゲノムを含む）の数は加速的に増加している。近い将来、数百万のゲノム間のペアワイズ距離を推定する必要があるかもしれない。クラウドコンピューティングを使用しても、そのような推定を実行できるソフトウェアはほとんどない。マルチスレッドソフトウェアBinDashは、典型的な個人用ラップトップのみを使用してこのような推定を実行できる。 BinDashは、既存のデータマイニング手法である最適な高密度化を使用して、bビット1順列ローリングMinHashを実装した。 BinDashは、評価によると、精度、圧縮率、メモリ使用量、実行時間の点で、最先端のソフトウェアよりも経験的に優れている。評価は、Dell Inspiron 157 559ノートブックを使用して、RefSeqのすべての細菌ゲノムに関する比較を実行した。BinDashは、https://github.com/zhaoxiaofei/BinDashのApache 2.0ライセンスに基づいてリリースされる。

インストール

ビルド依存

any C++ compiler supporting the C++11 standard
CMake version 2.6 or plus
zlib

本体　Github

git clone https://github.com/zhaoxiaofei/bindash.git
cd bindash/
mkdir release && cd release
cmake -DCMAKE_BUILD_TYPE=Release .. 
make -j

> ./bindash

# ./bindash

Usage:

./bindash <commmand> [options] [arguments ...]

Commands:

sketch: reduce mutiple genomes into one sketch.

A genome corresponds to a input sequence file.

A sketch consists of a set of output files.

dist: estimate distance (and relevant statistics) between

genomes in query sketch and genomes in target-sketch.

Query and target sketches are generated by the sketch command.

exact: estimate distance (and relevant statistics) between

genomes corresponding to input files.

Notes:

To see command-specific usage, please enter

./bindash command --help

To see version information, please enter

./bindash --version

The format for options is --NAME=VALUE

> ./bindash sketch --help

# ./bindash sketch --help

Running ./bindash commit 78e0d46-clean

Usage: ./bindash sketch [options] [arguments ...]

Arguments:

Zero or more filenames. If zero filenames, then read from each line in listfname.

Each filename specifies a path to a sequence file.

Options with [default values]:

--help : Show this help message.

--listfname: Name of the file associating consecutive sequences to genomes (including metagenomes and pangenomes).

Each line of this file has the following format:

"Path-to-a-sequence-file(F) <TAB> [genome-name(G) <TAB> number-of-consecutive-sequences(N) ...]".

If only F is provided, then use F as G and let N be the number of sequences in N [-]

--nthreads : This many threads will be spawned for processing. [1]

--minhashtype : Type of minhash.

-1 means perfect hash function for nucleotides where 5^(kmerlen) < 2^63.

0 means one hash-function with multiple min-values.

1 means multiple hash-functions and one min-value per function.

2 means one hash-function with partitionned buckets. [2]

--bbits : Number of bits kept as in b-bits minhash. [14]

--kmerlen : K-mer length used to generate minhash values. [21]

--sketchsize64 : Sketch size divided by 64, or equivalently,

the number of sets (each consisting of 64 minhash values) per genome). [32]

--isstrandpreserved : Preserve strand, which means ignore reverse complement. [false]

--iscasepreserved : Preserve case, which means the lowercase and uppercase versions of the

same letter are treated as two different letters. [false]

--randseed : Seed to provide to the hash function. [41].

--outfname : Name of the file containing sketches as output [sketch-at-time-1584151157 (time-dependent)].

Notes:

"-" (without quotes) means stdin.

For general usage, please enter

./bindash --help

The following is an example of options: --nthreads=8

> ./bindash dist --help

# ./bindash dist --help

Running ./bindash commit 78e0d46-clean

Usage: ./bindash dist [options] query-sketch [target-sketch]

Query-sketch and target-sketch: sketches used as query and target.

Sketches are generated by "./bindash sketch" (without quotes).

If target-sketch is omitted, then query-sketch is used as both query and target.

Options:

--help : Show this help message.

--ithres : If intersection(A, B) has less than this number of elements, then set the intersection to empty set so that the resulting Jaccard-index is zero. [2]

--mthres : Only results with at most this mutation distance are reported [2.5]

--nneighbors : Only this number of best-hit results per query are reported.

If this value is zero then report all. [0].

--nthreads : This many threads will be spawned for processing. [1]

--outfname : The output file comparing the query and target sketches.

The ouput file contains the following tab-separated fields per result line:

query-sketch, target-sketch, mutation-distance, p-value, and jaccard-index. [-].

--pthres : only results with at most this p-value are reported. [1.0001]

Note:

If target-sketch is omitted and --nneighbors is zero,

then distance from genome A to genome B is the same as distance from B to A.

In this case, only one record is reported per set of two genomes due to reflectivity.

"-" (without quotes) means stdout.

For general usage, please enter

./bindash --help

The following is an example of options: --mthres=0.2

実行方法

１、sketchファイルの作成

bindash sketch --outfname=genomeA.sketch genomeA.fasta
bindash sketch --outfname=genomeB.sketch genomeB.fasta
bindash sketch --outfname=genomeC.sketch genomeC.fasta

２、sketchファイルの比較

bindash dist genomeA.sketch genomeB.sketch genomeC.sketch

出力についてはGihubで確認して下さい。

引用

BinDash, software for fast genome distance estimation on a typical personal laptop
XiaoFei Zhao
Bioinformatics, Volume 35, Issue 4, 15 February 2019, Pages 671–673

2020-03-14

（small eukaryotes）ゲノムアセンブリがchromosome levelに達しているかどうかを評価する Tapestry

evaluation tool assembly 結果の視覚化 (visualization) telomere small eukaryotes G3: Genes, Genomes, Genetics 2018

2022/12/30 追記

ゲノムには、複製、転座、大きな逆位、倍数性変異などの複雑な機能が含まれている可能性があり、アセンブリやアセンブリの検証が困難になる場合がある。John Daveyが開発したTapestryと呼ばれるツールを使用すると、小さく、ほぼ完全な真核生物ゲノム（50 Mb未満、100コンティグ未満）を視覚的に検証できる。タペストリーの入力は、ゲノムアセンブリ、fastqリード、およびテロメア配列である。これらの情報からアセンブリの要約統計とHTMLレポートを生成する。

London Calling 2019

John Davey - Tapestry: assessing small eukaryotic genome assemblies with long-reads

John DaveyはTapestryを使用するデモを行い、ジャンクコンティグを削除し、ハプロタイプコンティグを特定し、最終アセンブリをfasta形式でエクスポートすることを示した。講演の締めくくりに、ナノポアシーケンスにより以前は見ることができなかったゲノムの新機能を発見できるようになったと述べた（London Calling 2019の資料より）。

インストール

依存

Linux or macOS

Python 3.6 or later
minimap2
samtools
Python packages:
biopython
intervaltree
jinja2
networkx
numpy
pandas
plumbum
pysam
scikit-learn >= 0.20
sqlalchemy
tqdm

本体　Github

#bioconda (link)
mamba create -n tapestry -y 
conda activate tapestry 
mamba install -c conda-forge -c bioconda -y tapestry

> weave

$ weave

usage: weave [-h] -a ASSEMBLY -r READS [-d DEPTH] [-l LENGTH]

[-t TELOMERE [TELOMERE ...]] [-w WINDOWSIZE] [-n] [-o OUTPUT]

[-c CORES] [-v]

weave: assess quality of one genome assembly

optional arguments:

-h, --help show this help message and exit

-a ASSEMBLY, --assembly ASSEMBLY

filename of assembly in FASTA format (required)

-r READS, --reads READS

filename of long reads in FASTQ format (required; must

be gzipped)

-d DEPTH, --depth DEPTH

genome coverage to subsample from FASTQ file (default

50)

-l LENGTH, --length LENGTH

minimum read length to retain when subsampling

(default 10000)

-t TELOMERE [TELOMERE ...], --telomere TELOMERE [TELOMERE ...]

telomere sequence to search for

-w WINDOWSIZE, --windowsize WINDOWSIZE

window size for ploidy calculations (default 10000)

-n, --noreadoutput do not output read alignments in report (default

False)

-o OUTPUT, --output OUTPUT

directory to write output, default weave_output

-c CORES, --cores CORES

number of parallel cores to use (default 1)

-v, --version report version number and exit

> clean -h

$ clean -h

usage: clean [-h] -a ASSEMBLY -c CSV [-o OUTPUT]

clean: filter and order assembly from list of contigs

optional arguments:

-h, --help show this help message and exit

-a ASSEMBLY, --assembly ASSEMBLY

filename of assembly in FASTA format

-c CSV, --csv CSV Tapestry CSV output

-o OUTPUT, --output OUTPUT

filename of output contigs, default

filtered_assembly.fasta

実行方法

手順や結果の解釈については上の18分程度のプレゼンテーション動画で全て説明されています。わかりやすく説明しているので是非ご覧になって下さい。ここではコマンドの手順だけ記載します。

１、コンティグ、シーケンスリード、そしてテロメア配列を指定してアセンブリレポートを出力する。

weave -a assembly.fasta -r reads.fastq.gz -t TGATGA -o assembly -c 8

ランが終わるとhtmlのレポートができる。動画の手順でフィルタリングし、右上のボタンからCSVファイルをダウンロードする。

f:id:kazumaxneo:20200311103009p:plain

contigの緑色の濃さは、リードカバレッジに基づいた各領域の倍数性の推定値を反映している。テロメアが見つかった場合、各コンティグ末端に赤い丸として表示される。赤色の円の不濃さは見つかったテロメアの数を反映している。

２、ダウンロードしたCSVとwaeveで使ったアセンブリのFASTAファイルを指定、以下のコマンドを実行する。

clean -a assembly.fasta -c assembly_filtered.csv

選択した配列のみ含むアセンブリ配列のmulti-fastaが出力される。

出力の詳細はGithubで確認して下さい。詳しく説明されています。

引用

GitHub - johnomics/tapestry: Validate and edit small eukaryotic genome assemblies

Chromosomal assembly of the nuclear genome of the endosymbiont-bearing trypanosomatid Angomonas deanei
John W Davey, Carolina M C Catta-Preta, Sally James, Sarah Forrester, Maria Cristina M Motta, Peter D Ashton, Jeremy C Mottram
G3 Genes|Genomes|Genetics, Volume 11, Issue 1, January 2021, jkaa018

2020-03-13

時系列データなどの連続データを視覚化するwebサービス PlotTwist

time series dataset 結果の視覚化 (visualization) web tool 2020 PLOS Biology heatmap shiny

2020 3/13 誤字修正、補足説明追加

　要約だけでなく、個々の測定からの実際のデータを表示することで、実験結果の理解が深まり、より完全な全体像が得られることは長年にわたって強調されてきた[ref.1-3]。これは、ダイナミクスを研究するための時間依存データと同様に、静的条件下で取得したデータにも当てはまる[ref.4]。静的な離散条件（関連wiki）の場合、ドットプロットはデータをグラフィカルに描写するのに適した方法である。実際のデータを表示するオプションを含むいくつかのオンラインデータ視覚化ツールが報告されている[ref.4–8]。

　連続データを視覚化するツールが報告されている[ref.9,10]が、これらには一般的な目的があり、実験データを表示するためによく使用されるいくつかの機能がない。科学実験からの連続データを視覚化するための専用のオンラインツールは（著者らの知る限り）不足している。これは驚くべきことである。特に生物学に関連する動的システムを理解するには、時間をかけて実験する必要があるためである。さらに、時系列実験はhigh contentになる可能性があり[ref.11]、複数の異なる摂動（wiki）の適用が含まれる場合がある。時系列データの複雑さは、データを迅速に検査し、さまざまな視覚化にアクセスするためのツールを必要とする。その次に、実験中に適用された治療法/摂動を明確に伝えることが重要である。タイムラプス実験に加えて、他の多くの実験戦略が連続データを提供する。例には、長さ、濃度、または波長の測定が含まれる。これらのタイプのデータは、測定された変数を定量的で連続的なスケールで視覚化するプロットで描かれることがよくある。

　理想的には、連続データ用のデータ視覚化ツールが無料で利用可能であり、オープンソースであり、出版準備の整った高品質のグラフを生成する。個別データ用のこのようなWebツールを以前に報告し[ref.6]、コミュニティからの熱烈な反応を与えられて、連続データ用のWebツールを生成するように動機付けられた。 Webツールの焦点は時系列データにあるが、通常は連続的なデータに使用できる。

　生成した新しいWebツールはPlotTwistと呼ばれ、設定された時間の条件のインジケーターを使用したタイムラプス実験からのデータをプロットする。この無料のオープンソースアプリは、連続データの迅速な視覚化と注釈付けを容易にするために開発された。 PlotTwistでグラフを作成するにはコーディングスキルは必要ない。ggplot2パッケージにより最先端のデータ視覚化を生成する。したがって、PlotTwistはデータの検査と出版品質の視覚化の生成に適していると考えられる。

Do you have some continuous data you want to plot?
Give this a try:https://t.co/4I34lr2nKG https://t.co/54lbPgRC2T
— Joαchim Goedhαrt (@joachimgoedhart) 2020年3月6日

I want to compare #PlotTwist with other software out there. What's your fav. software for plotting data from time-series?
(I used KaleidaGraph in the past ⬇️; https://t.co/3VD9sH3DKz) pic.twitter.com/3SU1BifnSi
— Joαchim Goedhαrt (@joachimgoedhart) 2019年9月2日

Converting excellent spreadsheets to tidy data

https://thenode.biologists.com/converting-excellent-spreadsheets-tidy-data/education/

Github

データの準備

PlotTwistで使用可能なデータは、カンマ区切りのCSVかはexcelのXLS形式のファイルである。いずれの場合も、フォーマットはwide data formatかtidy data formatで記述されていなければならない。この2つのフォーマットについて簡単に解説する。

１、wide data format

表１はwide data formatの例である。時系列データをexcelなどの表計算ソフトでまとめる場合、作図のしやすさもあってこのwide data format形式（スプレッドシート形式）でまとめることが多い。いわゆるヒトにとって見やすいデータ構造である。このwide data format形式のファイルをPlotTwistのサイトにアップロードした場合、1行目がヘッダーとして機能し、列の名前に使用される（表１ではcell_01からcell_03）。１列目の2行目以降は時間として働く。1行目のサンプル/条件名にスペースが含まれている場合、アンダースコアに変換されて読み込まれる。

表１、wide data format

f:id:kazumaxneo:20200312203448p:plain

（表１はPlotTwistのwide data formatのデモデータより転載）

データがwide data formatでPlotTwistにアップロードされた場合、データはtidy data formatに変換され、条件「id」を識別する列が追加される。この変換はユーザーには表示されないが、ggplot2でのプロットには必要である。

２、tidy data format

表2はtidy data format（整然データフォーマット）の例である。tidy data formatでは、時間（Time）、各サンプル/条件の識別子（sample）、測定値（Value）、および条件（id）をそれぞれの列に記載する。このファイルをPlotTwistにアップロード後、どの列がどの要素なのかを指定していく。このtidy data formatはggplot2で作画する際にお馴染みのデータフォーマットである。

表２、tidy data format

f:id:kazumaxneo:20200312213038p:plain

　tidy data formatではサンプルの時間ごとに繰り返し値をプロットしているため、縦に冗長で人の目にとっては見にくいデータ構造にも見えますが、観測値と変数が行列に1つずつ記載されている"整然な"データ構造です。欠損値があった場合にもエラーが起きなくなり、データ分析において堅牢なデータ構造と言えます（わかりやすい解説）（もう１つ；Tidy data）。

　論文図1にwide formatとtidy data formatの説明があります。確認して下さい ( link) 。また、複数のワイドデータセットが同時に（CSVまたはXLS形式で）アップロードされる場合、これらはそれぞれ異なる条件として扱われます。論文には補足ムービー（S1 movie）があり、こちらには、複数ファイルのアップロードがどのように行われるかを示されています。合わせて確認して下さい（movie S1 ダイレクトリンク）。

webサービス

https://huygens.science.uva.nl/PlotTwist/ にアクセスする。

Type of fileを選んでからCSV、またはexcelのXLS形式ファイルをアップロードする。またはウィンドウ内にペーストする。

f:id:kazumaxneo:20200313000411p:plain

表が読み込まれる。

f:id:kazumaxneo:20200313001146p:plain

読み込まれるデータがtydy data formatの場合は、These data are Tydyにチェックを付ける。

f:id:kazumaxneo:20200313001248p:plain

上でも説明したように、時間の列、観測値の列などを指定していく。

f:id:kazumaxneo:20200313002436p:plain

Data normalizationボタンにチェックを付けると、比較を容易にするためのデータ正規化オプションが利用できる。

f:id:kazumaxneo:20200313000243p:plain

値の最大/最小値での除算、ユーザーが指定したベースライン値での除算、Z-score、差をベースラインで除算し、最大値で除算して0〜1のスケールなどできる。

正規化後のデータは左下のボタンからダウンロードできる。

f:id:kazumaxneo:20200313110415p:plain

プロット
Plotのタブに移るとデータが可視化される。デフォルトのグラフでは、x軸に沿って時間が表示され、y軸に測定値が表示される。

f:id:kazumaxneo:20200313110547p:plain

こちらはwide data formatの例になる。

Data as でフォーマットの切り替えができる。　初期はLines（折れ線グラフ）になっており、各サンプルのすべての座標が線で結んで表示される。

f:id:kazumaxneo:20200313110721p:plain

Dotsに変更

f:id:kazumaxneo:20200313110723p:plain

Heatmapに変更

f:id:kazumaxneo:20200313110728p:plain

Small multiplesに変更すると系列ごとに異なるグラフで表示される。

f:id:kazumaxneo:20200313111257p:plain

LInesに戻してThe plot thickensにチェックをつけた。

f:id:kazumaxneo:20200313111357p:plain

show the meanにチェックをつけた。

f:id:kazumaxneo:20200313111114p:plain

show the meanを外してShow the 95% CIにチェックをつけた。

f:id:kazumaxneo:20200313111155p:plain

平均値および/または平均の95％信頼区間(Confidence interval, CI)は、個々のサンプルの線の上に透明なレイヤーとして描画される。95％の信頼区間により、「視覚的推論」による差異条件の比較が可能になる（論文より）。

x軸を0-300にした。

f:id:kazumaxneo:20200313112251p:plain

y軸を0-2にした。

f:id:kazumaxneo:20200313112531p:plain

log scale

f:id:kazumaxneo:20200313112639p:plain

Remove gridlines

f:id:kazumaxneo:20200313112731p:plain

Use color for the data

f:id:kazumaxneo:20200313112804p:plain

Use color for the stats

f:id:kazumaxneo:20200313112827p:plain

LInesに切り替え、Use color for the dataにチェックを付け、色はTol; mutedにした。さらにVisibility of the dataのゲージを動かして線を0.8まで太くした。Visibility of the statisticsは１にした。

f:id:kazumaxneo:20200313113102p:plain

さらにUse color for the statsにチェックを付け、色はTol; brightにした。

f:id:kazumaxneo:20200313113242p:plain

色覚異常に優しいカラーパレットも利用できる。パレットは7〜10色で構成され、その正確な構成は論文図3Aに示されている。複数の条件のデータが提供されている場合、各条件の統計は異なる色で表示できるようになっている（論文より）。

HeightとWidthをそれぞれ720と900にした。

f:id:kazumaxneo:20200313113410p:plain

タイトルをつけ、x軸とy軸の名前を変更した。さらに軸のフォントサイズを大きくした。

f:id:kazumaxneo:20200313114034p:plain

Add legendにチェックを付け、さらにLegend titleのウィンドウに書き込んでレジェンド名を付けた。

f:id:kazumaxneo:20200313114239p:plain

Add legendのチェックを外し、Add labels to objectsにチェックを付けた。

f:id:kazumaxneo:20200313114413p:plain

クラスタリング

プロットタブの右にあるクラスタリングタブでは、類似性に基づいてデータがグループ化されて視覚化される。時系列データでのクラスタリングの適用は比較的新しいため、どのクラスタリング手法が最良の生物学的洞察を提供するかは不明だが、クラスター分析を容易にするため、PlotTwistにはいくつかの教師なしの方法が実装されている（論文より）。

初期条件ではユークリッド距離でのウォード法になっている。

f:id:kazumaxneo:20200313115320p:plain

非階層的なk-meansなども利用できる。

クラスター分析の重要な１つの側面は、適切な数のクラスターを見つけることである。クラスタリング方法とクラスター番号に関するガイダンスを提供するために、PlotTwist ではクラスター検証インデックス（CVI）を計算できるようになっている[ref.13]。特定の数のクラスターの数を返す3つの異なるCVIが実装されている（論文より）。

数値が大きいほど、クラスタリングが向上する。 CVIはデータを要約する統計であり、それに応じて処理する必要があることに注意する。

ユークリッド距離でのウォード法にするクラスタリングでは、Calinski Harabasz indexは9.93になった。

f:id:kazumaxneo:20200313121853p:plain

CVIに応じて最適なクラスターの数は、基礎となる生物学を必ずしも反映していない（詳細は論文を読んで下さい）。

クラスター数を2にから６にして、widesを1200にした。

f:id:kazumaxneo:20200313115943p:plain

Show contributions per conditionにチェックを付けてそれぞれのクラスターへの分配比率を表示した。

f:id:kazumaxneo:20200313120118p:plain

作成した図は左上のボタンからPDF、またはPNGとしてダウンロードできる。

f:id:kazumaxneo:20200313122501p:plain

引用

PlotTwist: A web app for plotting and annotating continuous data
Joachim Goedhart

PLoS Biol. 2020 Jan 13;18(1):e3000581

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

臨床環境の病原性バクテリアを素早くジェノタイピングする biohansel

タンデムリピートなどのミスアセンブリを分析する TandemQUAST

計算リソースを効率的に使って多数のよく似たバクテリアゲノムを素早く分析する自動化されたパイプライン Bactopia

スモールゲノムからラージゲノムまで対応した高速かつ精度の高いハイブリッドアセンブラ HASLR

ラップトップでも軽快に動作するゲノム比較ツール bindash

（small eukaryotes）ゲノムアセンブリがchromosome levelに達しているかどうかを評価する Tapestry

時系列データなどの連続データを視覚化するwebサービス PlotTwist