シングルコマンドで複数ラージゲノムのシンテニーパターンを可視化する ntSynt-viz

　近年、染色体スケールのゲノムアセンブリが爆発的に増加しており、複数ゲノムのシンテニーを検出することによる比較ゲノム解析の可能性が大きく広がっている。既存のツールは複数のゲノム間のシンテニーブロックを検出することができるが、テキストベースの出力であるため、大規模なシンテニーパターンを直感的に探索することは困難である。前述のユーティリティが出力するシンテニーブロックデータから重要な生物学的洞察を得るためには、解釈可能で情報量が多く、使いやすいシンテニー可視化ツールが不可欠である。ここでは、ntSynt-vizを紹介する。ntSynt-vizは、複数ゲノムのシンテニーブロックを自動的にソート、正規化、プロットするためのコマンドラインツールである。ntSynt-vizは、14のヒトゲノムと9のホバーフライゲノム間のシンテニーを評価する際に、最先端のツールNGenomeSynと比較して、より明確で解釈しやすい染色体ペインティングリボンプロットを提供することを示す。著者らは、ntSynt-vizが分岐ゲノム間の大規模なシンテニーパターンに関する重要な洞察を提供し、それによって重要な進化的疑問に関する研究を前進させることを期待している。ntSynt-vizはGitHub (https://github.com/bcgsc/ntsynt-viz)で自由に利用できる。

ntSynt-vizは1つのコマンドで、シンテニーブロックマッピングを活用して、1)構造的類似性に基づく入力染色体の順序付け、2)ターゲットゲノムと比較した入力染色体のストランド正規化、3)シンテニーに基づく距離推定を利用したゲノムの上から下への順序付けなど、複数の重要な新機能を搭載している。さらに、gggenomesを用いて、染色体ペインティングに着想を得たカラーリングとリボンプロットを統合し、出力画像の解釈可能性をさらに高めるようにした。ntSynt-vizはntSyntによって計算されたシンテニーブロックを直接扱うが、他のツールのシンテニーブロックも扱うことができる。我々（本著者ら）は、 ntSynt-vizが提供する美学が、多様な種や集団の進化史に関する豊かな生物学的洞察を可能にすることを期待している。

wiki

https://github.com/bcgsc/ntSynt-viz/wiki

インストール

一部はＲパッケージ、ほかpythonパッケージ、snakemakeなどという構成なので、システムのRのパスと混線しないようcondaでpythonごと仮想環境を設定後、condaでシステムとは別にＲを導入し、依存関係のコンフリクトを防いだ（conda deactivate baseすればシステムへの影響は無くなる。レポジトリでは仮想環境はセットしていない）。

依存

python 3.8+
intervaltree
snakemake
quicktree
R packages:
gggenomes
treeio
ggpubr
ggtree
tidytree
phytools
dplyr
tidyr
argparse
scales
stringr

Github

mamba create -n ntSynt-viz python=3.11 -y
conda activate ntSynt-viz
#R(バージョン指定なしだと4.4.2が導入された)
mamba install -c conda-forge r-base -y
which R

#
mamba install --yes -c conda-forge -c bioconda quicktree snakemake intervaltree  bioconductor-treeio r-ggpubr bioconductor-ggtree r-phytools r-dplyr r-argparse r-scales r-stringr

#condaでインストールできないRパッケージ（Github）
R -e 'install.packages(c("gggenomes"), repos = "https://cran.r-project.org")'

#本体
wget https://github.com/bcgsc/ntSynt-viz/releases/download/v1.0.0/ntSynt-viz-1.0.0.tar.gz
tar xvzf ntSynt-viz-1.0.0.tar.gz
rm ntSynt-viz-1.0.0.tar.gz
export PATH=${PWD}/ntSynt-viz-1.0.0/bin:$PATH

> ntsynt_viz.py -h

usage: ntsynt_viz.py [-h] --blocks BLOCKS --fais FAIS [FAIS ...] [--name_conversion NAME_CONVERSION] [--tree TREE] [--target-genome TARGET_GENOME] [--normalize] [--indel INDEL] [--length LENGTH] [--seq_length SEQ_LENGTH] [--keep KEEP [KEEP ...]] [--centromeres CENTROMERES]

[--haplotypes HAPLOTYPES] [--prefix PREFIX] [--format {png,pdf}] [--scale SCALE] [--height HEIGHT] [--width WIDTH] [--no-arrow] [--ribbon_adjust RIBBON_ADJUST] [-f] [-n] [-v]

Visualizing multi-genome synteny

options:

-h, --help show this help message and exit

required arguments:

--blocks BLOCKS ntSynt-formatted synteny blocks TSV

--fais FAIS [FAIS ...]

FAI files for all input genomes. Can be a list or a file with one FAI path per line.

main plot formatting arguments:

--name_conversion NAME_CONVERSION

TSV for converting names in the blocks TSV (old -> new). IMPORTANT: new names cannot have spaces. If you want to have spaces in the final ribbon plot, use the special character '_'. All underscores in the new name will be converted to spaces.

--tree TREE User-input tree file in newick format. If specified, this tree will be plotted next to the output ribbon plot, and used for ordering the assemblies. The names in the newick file must match the new names if --name_conversion is specified, or the genome file names in the

synteny blocks input file otherwise. If not specified, the synteny blocks will be used to estimate pairwise distances for the genome ordering and associated tree.

--target-genome TARGET_GENOME

Target genome. If specified, this genome will be at the top of the ribbon plot, with ribbons coloured based on its chromosomes and (if applicable) other chromosomes normalized to it. If not specified, the top genome will be arbitrary.

--normalize Normalize strand of chromosomes relative to the target (top) genome in the ribbon plots

--centromeres CENTROMERES

TSV file with centromere positions. Must have the headers: bin_id,seq_id,start,end. bin_id must match the new names from --name_conversion or the genome names if --name_conversion is not specified. seq_id is the chromosome name.

--haplotypes HAPLOTYPES

File listing haplotype assembly names: TSV, maternal/paternal assembly file names separated by tabs.

--no-arrow Only used with --normalize; do not draw arrows indicating reverse-complementation

block filtering arguments:

--indel INDEL Indel size threshold [50000]

--length LENGTH Minimum synteny block length [100000]

--seq_length SEQ_LENGTH

Minimum sequence length [500000]

--keep KEEP [KEEP ...]

List of genome_name:chromosome to show in visualization. All chromosomes with links to the specified chromosomes will also be shown.

output arguments:

--prefix PREFIX Prefix for output files [ntSynt-viz_ribbon-plot]

--format {png,pdf} Output format of plot [png]

--scale SCALE Length of scale bar in bases [100e6]

--height HEIGHT Height of plot in cm [20]

--width WIDTH Width of plot in cm [50]

--ribbon_adjust RIBBON_ADJUST

Ratio for adjusting spacing beside ribbon plot. Increase if ribbon plot labels are cut off, and decrease to reduce the white space to the left of the ribbon plot [0.1]

execution arguments:

-f, --force Force a re-run of the entire pipeline

-n Dry-run for snakemake pipeline

-v, --version show program's version number and exit

実行方法

ランするにはntSynt形式のシンテニーブロックファイルが必要。

ntSynt synteny_blocks.tsv

https://github.com/bcgsc/ntsynt?tab=readme-ov-file#output-files

１，ntSyntのラン（紹介）

ntSynt -d 5 assembly1.fa assembly2.fa assembly3.fa

<prefix>.synteny_blocks.tsvが出力される。

.synteny_blocks.tsvは、デモ用のファイルがレポジトリに準備されている。

８列からなる。ntSynt以外で作ったTSVも、フォーマットが準拠していれば使用できる。

２，ntsynt_vizのラン

block.tsvとfaiファイルのパスを記載したTSVファイルが必要（それぞれのfaiファイルはsamtools faidx genome.fastaで作成できる）。

fai.tsv

準備ができたら実行する。

ntsynt_viz.py --blocks ntSynt.synteny_blocks.tsv --fais fais.tsv --prefix prefix

テストラン

newick形式のツリーファイルを-treeオプションで指定すると、ggtree を使用してツリーがリボンプロット横にプロットされる。

cd /ntSynt-viz-1.0.0/tests/
ntsynt_viz.py --blocks great-apes.ntSynt.synteny_blocks.tsv --fais fais.tsv --tree great-apes.mt-tree.nwk --name_conversion great-apes.name-conversions.tsv --normalize --prefix great-apes_ribbon-plots --ribbon_adjust 0.14 --scale 1e9

--blocks ntSynt-formatted synteny blocks TSV
--fais FAI files for all input genomes. Can be a list or a file with one FAI path per line.
--name_conversion TSV for converting names in the blocks TSV (old -> new). IMPORTANT: new names cannot have spaces. If you want to have spaces in the final ribbon plot, use the special character '_'. All underscores in the new name will be converted to spaces.
--tree User-input tree file in newick format. If specified, this tree will be plotted next to the output ribbon plot, and used for ordering the assemblies. The names in the newick file must match the new names if --name_conversion is specified, or the genome file names in the synteny blocks input file otherwise. If not specified, the synteny blocks will be used to estimate pairwise distances for the genome ordering and associated tree.
--normalize Normalize strand of chromosomes relative to the target (top) genome in the ribbon plots
--prefix Prefix for output files [ntSynt-viz_ribbon-plot]
--ribbon_adjust Ratio for adjusting spacing beside ribbon plot. Increase if ribbon plot labels are cut off, and decrease to reduce the white space to the left of the ribbon plot [0.1]
--scale Length of scale bar in bases [100e6]

snakemakeのパイプラインは１分以内で終了した。

出力ファイル

great-apes_ribbon-plots_ribbon-plot_tree.png

リボンは、プロット上の一番上（ターゲット）のゲノムの染色体に基づいて色が付けられている。染色体の色は、シンテニーブロックで覆われた各染色体セグメントをリボンと同じ配色で着色することで実現している。染色体のシンテニーブロックでカバーされていない領域は灰色で表示されている（論文より）。

入力cladogramなしでリボンプロットを作成、染色体ストランドの正規化プロセスはスキップ、ターゲットゲノム（トップ位置）を指定

ntsynt_viz.py --blocks great-apes.ntSynt.synteny_blocks.tsv --fais fais.tsv  --name_conversion great-apes.name-conversions.tsv  --prefix great-apes_ribbon-plots_no-tree --ribbon_adjust 0.15 --scale 1e9 --target-genome Homo_sapiens

出力ファイル

great-apes_ribbon-plots_no-tree_ribbon-plot.png

論文より

ギャップやセントロメアの座標を持つBEDライクファイルを提供できる。それらは染色体上の黒いセグメントとしてプロットされる（-centromeres）。
プロットサイズ(--width, --height)、スケールバー長(--scale)、出力フォーマット(--format png/pdf)、表示する染色体の最小長(--seq_length)、プロット上のファイル名から名前への変換(--name_conversion)、プロットする染色体のサブセット(--keep)、シンテニーブロックの最小長(--length)をカスタマイズできる。
あるゲノムの異なるハプロタイプをリボンプロットで比較する場合、--haplotypes を使ってこれらの関係を指定する。
ntSynt-vizの全パラメーターと入力ファイルのフォーマットについてはSupplementary Table S1に記載されている。

ntSyntはシンテニー可視化に外部ツールgggenomesなどを使い、可視化の工程が煩雑でしたが、ntSynt-vizはプロセスをsnakemakeで管理させて工程を自動化させています。（ntSyntのラン、ntSynt-vizの２コマンドだけで視覚化が完了する）。論文の導入でも強調して書かれていますが、とてもユーザーフレンドリーなパイプラインになったと感じます。

引用

ntSynt-viz: Visualizing synteny patterns across multiple genomes

Lauren Coombe, Rene L Warren, Inanc Birol

bioRxiv, Posted January 16, 2025.