Pangenomeグラフは、ゲノムコレクションの相互アラインメントを完全に表現するものである。このモデルは、構造的に複雑な領域を含む集団の全ゲノム多様性を研究する機会を提供する。しかしながら、パンゲノムグラフを用いた数百ギガスケールのゲノムの解析は、既存のツールでは十分にサポートされていないため、困難である。そのため、このようなデータに対して高度な質問を効率的に行うための高速で汎用性の高いソフトウェアが求められている。
ODGIは、スケーラブルなアルゴリズムを実装し、DNAパンゲノムグラフをバリエーショングラフの形で効率的にメモリ内に表現する、新しいツール群である。ODGIは、Graphical Fragment Assembly形式であらかじめ構築されたグラフをサポートしている。ODGIには、複雑な領域の検出、パンゲノム遺伝子座の抽出、アーティファクトの除去、探索的解析、操作、検証、視覚化のためのツールが含まれている。その高速並列実行により、日常的なパンゲノムタスクだけでなく、ギガベーススケールのパンゲノムグラフの複雑な生物学的疑問に素早く答えることができるパイプラインも促進される。
ODGIは、MITオープンソースライセンスの下、フリーソフトウェアとして公開されていいる。ソースコードは https://github.com/pangenome/odgi からダウンロード可能で、ドキュメントは https://odgi.readthedocs.io から入手できる。ODGI は Bioconda https://bioconda.github.io/recipes/odgi/README.html または GNU Guix https://github.com/pangenome/odgi/blob/master/guix.scm を介してインストールすることができる。
Documentation
Welcome to the odgi documentation! — odgi c522690 documentation
Quick Start
https://odgi.readthedocs.io/en/latest/rst/quick_start.html#quick-start
Practical Graphical Pangenomics
インストール
mamba install -c bioconda odgi -y
#docker
docker pull pangenome/odgi
> odgi
odgi: optimized dynamic genome/graph implementation, version v0.8.0
usage: odgi <command> [options]
Overview of available commands:
-- bin Binning of pangenome sequence and path information in the graph.
-- break Break cycles in the graph and drop its paths.
-- build Construct a dynamic succinct variation graph in ODGI format from a GFAv1.
-- chop Divide nodes into smaller pieces preserving node topology and order.
-- cover Cover the graph with paths.
-- crush Crush runs of N.
-- degree Describe the graph in terms of node degree.
-- depth Find the depth of a graph as defined by query criteria.
-- draw Draw previously-determined 2D layouts of the graph with diverse annotations.
-- explode Breaks a graph into connected components storing each component in its own file.
-- extract Extract subgraphs or parts of a graph defined by query criteria.
-- flatten Generate linearizations of a graph.
-- flip Flip path orientations to match the graph.
-- groom Harmonize node orientations.
-- heaps Path pangenome coverage permutations.
-- inject Inject BED annotations as paths.
-- kmers Display and characterize the kmer space of a graph.
-- layout Establish 2D layouts of the graph using path-guided stochastic gradient descent.
-- matrix Write the graph topology in sparse matrix format.
-- normalize Compact unitigs and simplify redundant furcations.
-- overlap Find the paths touched by given input paths.
-- panpos Get the pangenome position of a given path and nucleotide position (1-based).
-- pathindex Create a path index for a given graph.
-- paths Interrogate the embedded paths of a graph.
-- pav Presence/absence variants (PAVs).
-- position Find, translate, and liftover graph and path positions between graphs.
-- priv Differentially private sampling of graph subpaths.
-- procbed Procrustes-BED: adjust BED to match subpaths in graph.
-- prune Remove parts of the graph.
-- server Start a basic HTTP server to lift coordinates between path and pangenomic positions.
-- sort Apply different kind of sorting algorithms to a graph.
-- squeeze Squeezes multiple graphs in ODGI format into the same file in ODGI format.
-- stats Metrics describing a variation graph and its path relationship.
-- stepindex Generate a step index and access the position of each step of each path once.
-- tips Identifying break point positions relative to given references.
-- unchop Merge unitigs into a single node preserving the node order.
-- unitig Output unitigs of the graph.
-- untangle Project paths into reference-relative, to decompose paralogy relationships.
-- validate Validate a graph checking if the paths are consistent with the graph topology.
-- version Print the version of ODGI to stdout.
-- view Project a graph into other formats.
-- viz Visualize a variation graph in 1D.
> odgi build
odgi build {OPTIONS}
Construct a dynamic succinct variation graph in ODGI format from a GFAv1.
OPTIONS:
[ MANDATORY OPTIONS ]
-g[FILE], --gfa=[FILE] GFAv1 FILE containing the nodes, edges
and paths to build a dynamic succinct
variation graph from.
-o[FILE], --out=[FILE] Write the dynamic succinct variation
graph to this *FILE*. A file ending
with *.og* is recommended.
[ Graph Sorting ]
-O, --optimize Compact the graph id space into a
dense integer range.
-s, --sort Apply a general topological sort to
the graph and order the node ids
accordingly. A bidirected adaptation
of Kahn’s topological sort (1962) is
used, which can handle components with
no heads or tails. Here, both heads
and tails are taken into account.
[ Threading ]
-t[N], --threads=[N] Number of threads to use for parallel
operations.
[ Processing Information ]
-P, --progress Write the current progress to stderr.
-d, --debug Verbosely print graph information to
stderr. This includes the maximum
node_id, the minimum node_id, the
handle to node_id mapping, the deleted
nodes and the path metadata.
[ Program Information ]
-h, --help Print a help message for odgi build.
> odgi stats -i
Flag 'i' requires an argument but received none
odgi stats {OPTIONS}
Metrics describing a variation graph and its path relationship.
OPTIONS:
[ MANDATORY OPTIONS ]
-i[FILE], --idx=[FILE] Load the succinct variation graph in
ODGI format from this *FILE*. The file
name usually ends with *.og*. It also
accepts GFAv1, but the on-the-fly
conversion to the ODGI format requires
additional time!
[ Summary Options ]
-S, --summarize Summarize the graph properties and
dimensions. Print to stdout the
#nucleotides, #nodes, #edges, #paths,
#steps in a tab-delimited format.
-W, --weak-connected-components Shows the properties of the weakly
connected components.
-L, --self-loops Number of nodes with a self-loop.
-N, --nondeterministic-edges Show nondeterministic edges (those
that extend to the same next base).
-b, --base-content Describe the base content of the
graph. Print to stdout the #A, #C, #G
and #T in a tab-delimited format.
-D[STRING], --delim=[STRING] The part of each path name before this
delimiter is a group identifier, which
when specified will ensure that odgi
stats collects the summary information
per group and not per path.
-f, --file-size Show the file size in bytes.
-a[DELIM,POS],
--pangenome-sequence-class-counts=[DELIM,POS]
Show counted pangenome sequence class
counts of all samples. Classes are
Private (only one sample visiting the
node), Core (all samples visiting the
node), and Shell (not Core or
Private). The given String determines
how to find the sample name in the
path names: DELIM,POS. Split the whole
path name by DELIM and access the
actual sample name at POS of the split
result. If the full path name is the
sample name, select a DELIM that is
not in the path names and set POS to
0. If -m,--multiqc was set, this
OPTION has to be set implicitly.
[ Sorting Goodness Eval Options ]
-c[FILE], --coords-in=[FILE] Load the 2D layout coordinates in
binary layout format from this *FILE*.
The file name usually ends with
*.lay*. The sorting goodness
evaluation will then be performed for
this *FILE*. When the layout
coordinates are provided, the mean
links length and the sum path nodes
distances statistics are evaluated in
2D, else in 1D. Such a file can be
generated with *odgi layout*.
-l, --mean-links-length Calculate the mean links length. This
metric is path-guided and computable
in 1D and 2D.
-g, --no-gap-links Don’t penalize gap links in the mean
links length. A gap link is a link
which connects two nodes that are
consecutive in the linear pangenomic
order. This option is specifiable only
to compute the mean links length in
1D.
-s, --sum-path-nodes-distances Calculate the sum of path nodes
distances. This metric is path-guided
and computable in 1D and 2D. For each
path, it iterates from node to node,
summing their distances, and
normalizing by the path length. In 1D,
if a link goes back in the linearized
viewpoint of the graph, this is
penalized (adding 3 times its length
in the sum).
-d,
--penalize-different-orientation If a link connects two nodes which
have different orientations, this is
penalized (adding 2 times its length
in the sum).
-p, --path-statistics Display the statistics (mean links
length or sum path nodes distances)
for each path.
-w, --weighted-feedback-arc Compute the sum of weigths of all
feedback arcs, i.e. backward pointing
edges the statistics (the weight is
the number of times the edge is
traversed by paths).
-j, --weighted-reversing-join Compute the sum of weigths of all
reversing joins, i.e. edges joining
two in- or two out-sides (the weight
is the number of times the edge is
traversed by paths).
[ IO Format Options ]
-m, --multiqc Setting this option prints all!
statistics in YAML format instead of
pseudo TSV to stdout. This includes
*-S,--summarize*,
*-W,--weak-connected-components*,
*-L,--self-loops*,
*-b,--base-content*,
*-l,--mean-links-length*,
*-g,--no-gap-links*,
*-s,--sum-path-nodes-distances*,
*-f,--file-size*, and
*-d,--penalize-different-orientation*.
*-p,path-statistics* is still
optional. Not applicable to
*-N,--nondeterministic-edges*.
Overwrites all other given OPTIONs!
The output is perfectly curated for
the ODGI MultiQC module.
-y, --yaml Setting this option prints all
selected statistics in YAML format
instead of pseudo TSV to stdout.
[ Processing Information ]
-t[N], --threads=[N] Number of threads to use for parallel
operations.
[ Processing Information ]
-P, --progress Write the current progress to stderr.
[ Program Information ]
-h, --help Print a help message for odgi stats.
> odgi paths -h
odgi paths {OPTIONS}
Interrogate the embedded paths of a graph. Does not print anything to stdout
by default!
OPTIONS:
[ MANDATORY ARGUMENTS ]
-i[FILE], --idx=[FILE] Load the succinct variation graph in
ODGI format from this *FILE*. The file
name usually ends with *.og*. It also
accepts GFAv1, but the on-the-fly
conversion to the ODGI format requires
additional time!
[ Path Investigation Options ]
-O[FILE], --overlaps=[FILE] Read in the path grouping *FILE* to
generate the overlap statistics from.
The file must be tab-delimited. The
first column lists a grouping and the
second the path itself. Each line has
one path entry. For each group the
pairwise overlap statistics for each
pairing will be calculated and printed
to stdout.
-L, --list-paths Print the paths in the graph to
stdout. Each path is printed in its
own line.
-l, --list-path-start-end If -L,--list-paths was specified, this
additionally prints the start and end
positions of each path in additional,
tab-delimited coloumns.
-D[CHAR], --delim=[CHAR] The part of each path name before this
delimiter CHAR is a group identifier.
This parameter should only be set in
combination with **-H, --haplotypes**.
Prints an additional, first column
**group.name** to stdout.
-H, --haplotypes Print to stdout the paths in an
approximate binary haplotype matrix
based on the graph’s sort order. The
output is tab-delimited: *path.name*,
*path.length*, *path.step.count*,
*node.1*, *node.2*, *node.n*. Each
path entry is printed in its own line.
-d, --distance Provides a sparse distance matrix for
paths. If **-D, --delim** is set, it
will be path groups distances. Each
line prints in a tab-delimited format
to stdout: *path.a*, *path.b*,
*path.a.length*, *path.b.length*,
*intersection*, *jaccard*,
*euclidean*.
-f, --fasta Print paths in FASTA format to stdout.
One line for the FASTA header, another
line for the whole sequence.
[ Path Modification Options ]
-K[FILE], --keep-paths=[FILE] Keep paths listed (by line) in *FILE*.
-X[FILE], --drop-paths=[FILE] Drop paths listed (by line) in *FILE*.
-o[FILE], --out=[FILE] Write the dynamic succinct variation
graph to this file (e.g. *.og*).
[ Threading ]
-t[N], --threads=[N] Number of threads to use for parallel
operations.
[ Processing Information ]
-P, --progress Write the current progress to stderr.
[ Program Information ]
-h, --help Print a help message for odgi paths.
> odgi viz
odgi viz {OPTIONS}
Visualize a variation graph in 1D.
OPTIONS:
[ MANDATORY OPTIONS ]
-i[FILE], --idx=[FILE] Load the succinct variation graph in
ODGI format from this *FILE*. The file
name usually ends with *.og*. It also
accepts GFAv1, but the on-the-fly
conversion to the ODGI format requires
additional time!
-o[FILE], --out=[FILE] Write the visualization in PNG format
to this *FILE*.
[ Visualization Options ]
-x[N], --width=[N] Set the width in pixels of the output
image (default: 1500).
-y[N], --height=[N] Set the height in pixels of the output
image (default: 500).
-a[N], --path-height=[N] The height in pixels for a path.
-X[N], --path-x-padding=[N] The padding in pixels on the x-axis
for a path.
-n, --no-path-borders Don't show path borders.
-b, --black-path-borders Draw path borders in black (default is
white).
-R, --pack-paths Pack all paths rather than displaying
a single path per row.
-L[FLOAT],
--link-path-pieces=[FLOAT] Show thin links of this relative width
to connect path pieces.
-A[STRING],
--alignment-prefix=[STRING] Apply alignment related visual motifs
to paths which have this name prefix.
It affects the [**-S, --show-strand**]
and [**-d, –change-darkness**]
options.
-S, --show-strand Use red and blue coloring to display
forward and reverse alignments. This
parameter can be set in combination
with [**-A,
–alignment-prefix**=*STRING*].
-z,
--color-by-mean-inversion-rate Change the color respect to the node
strandness (black for forward, red for
reverse); in binned mode (**-b,
--binned-mode**), change the color
respect to the mean inversion rate of
the path for each bin, from black (no
inversions) to red (bin mean inversion
rate equals to 1).
-N, --color-by-uncalled-bases Change the color with respect to the
uncalled bases of the path for each
bin, from black (no uncalled bases) to
green (all uncalled bases).
-s[CHAR],
--color-by-prefix=[CHAR] Color paths by their names looking at
the prefix before the given character
CHAR.
-M[FILE], --prefix-merges=[FILE] Merge paths beginning with prefixes
listed (one per line) in *FILE*.
-I[PREFIX],
--ignore-prefix=[PREFIX] Ignore paths starting with the given
*PREFIX*.
[ Intervals Selection Options ]
-r[STRING], --path-range=[STRING] Nucleotide range to visualize:
``STRING=[PATH:]start-end``.
``\*-end`` for ``[0,end]``;
``start-*`` for
``[start,pangenome_length]``. If no
PATH is specified, the nucleotide
positions refer to the pangenome’s
sequence (i.e., the sequence obtained
arranging all the graph’s node from
left to right).
[ Path Selection Options ]
-p[FILE],
--paths-to-display=[FILE] List of paths to display in the
specified order; the file must contain
one path name per line and a subset of
all paths can be specified.
[ Path Names Viz Options ]
-H, --hide-path-names Hide the path names on the left of the
generated image.
-C, --color-path-names-background Color path names background with the
same color as paths.
-c[N],
--max-num-of-characters=[N] Maximum number of characters to
display for each path name (max 128
characters). The default value is *the
length of the longest path name* (up
to 32 characters).
[ Binned Mode Options ]
-w[bp], --bin-width=[bp] The bin width specifies the size of
each bin in the binned mode. If it is
not specified, the bin width is
calculated from the width in pixels of
the output image.r
-m, --color-by-mean-depth Change the color with respect to the
mean coverage of the path for each
bin, using the colorbrewer palette
specified in -B --colorbrewer-palette
-B[SCHEME:N],
--colorbrewer-palette=[SCHEME:N] Use the colorbrewer palette specified
by the given SCHEME, with the number
of levels N. Specifiy 'show' to see
available palettes.
-G, --no-grey-depth Use the colorbrewer palette for <0.5x
and ~1x coverage bins. By default,
these bins are light and neutral grey.
[ Gradient Mode Options ]
-d, --change-darkness Change the color darkness based on
nucleotide position in the path. When
it is used in binned mode, the mean
inversion rate of the bin node is
considered to set the color gradient
starting position: when this rate is
greater than 0.5, the bin is
considered inverted, and the color
gradient starts from the right-end of
the bin. This parameter can be set in
combination with [**-A,
–alignment-prefix**=*STRING*].
-l, --longest-path Use the longest path length to change
the color darkness.
-u, --white-to-black Change the color darkness from white
(for the first nucleotide position) to
black (for the last nucleotide
position).
[ Compressed Mode Options ]
-O, --compressed-mode Compress the view vertically,
summarizing the path coverage across
all paths displaying the information
using only one path 'COMPRESSED_MODE'.
A heatmap color-coding from
https://colorbrewer2.org/#type=diverging&scheme=RdBu&n=11
is used. Alternatively, one can enter
a colorbrewer palette via -B,
--colorbrewer-palette.
[ Threading ]
-t[N], --threads=[N] Number of threads to use for parallel
operations.
[ Processing Information ]
-P, --progress Write the current progress to stderr.
[ Program Information ]
-h, --help Print a help message for odgi viz.
実行方法
odgi は、グラフ操作、レイアウト、遺伝子座抽出、グラフ統計からグラフの可視化、検証、遺伝子アノテーションのリフ トオーバーまで、様々なツールを提供している。
odgi build は、GFA (v1) グラフを odgi バイナリ、ノード中心のエンコードフォーマットに変換する。
git clone https://github.com/pangenome/odgi.git
cd odgi/test/
odgi build -g DRB1-3123.gfa -P -o DRB1-3123.og
#グラフが最適化されていないことが事前に分かっていて、最適化したい場合
odgi build -g in.gfa -P --optimize -o out.og
- -g GFAv1 FILE containing the nodes, edges and paths to build a dynamic succinct variation graph from.
- -o Write the dynamic succinct variation graph to this *FILE*. A file ending with *.og* is recommended.
- -O, --optimize Compact the graph id space into a dense integer range.
- -P Write the current progress to stderr.
構築されたグラフは、入力のGFAグラフにあるすべての情報を、損失することなく表現している(マニュアルより)。odgiの多くのコマンドはodgi バイナリのグラフを要求するので、odgi buildを使って変換する。
基本的な統計
odgi statsを使う。
odgi stats -i DRB1-3123.og -S | column -t
すべてのノードのヌクレオチドの総数は21997、ノード数4955、エッジ数6777、パス数12と分かった。odgi statsはmultiqcにも対応している。
パンゲノムのパス名を表示
odgi pathsを使う。
odgi paths -i DRB1-3123.og -L
- -i Load the succinct variation graph in ODGI format from this *FILE*. The file
name usually ends with *.og*. It also accepts GFAv1, but the on-the-fly conversion to the ODGI format requires additional time!
特定のパスの配列をFASTA形式で取り出す
odgi pathsを-f付きで実行する。
odgi paths -i DRB1-3123.og -f > paths.fasta
- -f Print paths in FASTA format to stdout. One line for the FASTA header, another
line for the whole sequence.
パンゲノムグラフを可視化
odgi viz を使う。
odgi viz -i DRB1-3123.og -o DRB1-3123.png -x 500
- -i Load the succinct variation graph in ODGI format from this *FILE*. The file name usually ends with *.og*. It also accepts GFAv1, but the on-the-fly conversion to the ODGI format requires additional time!
- -o Write the visualization in PNG format to this *FILE*
- -x Set the width in pixels of the output image (default: 1500).
- -w The bin width specifies the size of each bin in the binned mode. If it is not specified, the bin width is calculated from the width in pixels of the output image.r
DRB1-3123.png
パンゲノムグラフの特徴を1次元で線形表示する。グラフのノードは左から右に配置されている。パスの名称は左側に、パスの下の黒い線はリンクで、グラフのトポロジーを表現している。色のついた棒グラフは、パスとパンゲノム配列の関係をバイナリ行列で表している。各パスの明るさを開始位置によってグラデーション表示したり、ノードの向き(strandedness)を色で表現することもできる。-wではインターバルウィンドウのサイズを指定できる。
真核生物のゲノムは、繰り返し配列によって特徴づけられる。これらの配列はパンゲノムグラフに複雑な領域をもたらすことがある。それらを特定するために、グラフ内の深さを解析することができる(マニュアルより)。
odgi viz -i LPA.og -o LPA.bm.png -x 500 -bm
- -b Draw path borders in black (default is white).
- -m Change the color with respect to the mean coverage of the path for each bin, using the colorbrewer palette specified in -B --colorbrewer-palette
他にも様々なコマンドがあります。次回に続きます。
引用
ODGI: understanding pangenome graphs
Andrea Guarracino, Simon Heumos, Sven Nahnsen, Pjotr Prins, Erik Garrison Author Notes
Bioinformatics, Volume 38, Issue 13, 1 July 2022, Pages 3319–3326