ODGI のextractコマンド - macでインフォマティクス

ODGI のextract コマンドは、パンゲノムグラフから目的の遺伝子座を抽出し、パンゲノムグラフのサブグラフを作成する。チュートリアルでは、このextract コマンドを使い、6人＋chm13細胞株の7つのハプロイドヒトゲノムアセンブリから得られた13のコンティグの特定の遺伝子座のパンゲノムバリエーショングラフを対象に、個体間で差異があるバリアントを調べる手順を説明している。

Extract selected loci

https://odgi.readthedocs.io/en/latest/rst/tutorials/extract_selected_loci.html

help

> odgi extract -h

odgi extract {OPTIONS}

Extract subgraphs or parts of a graph defined by query criteria.

OPTIONS:

[ MANDATORY OPTIONS ]

-i[FILE], --idx=[FILE] Load the succinct variation graph in

ODGI format from this *FILE*. The file

name usually ends with *.og*. It also

accepts GFAv1, but the on-the-fly

conversion to the ODGI format requires

additional time!

-d[N],

--max-distance-subpaths=[N] Maximum distance between subpaths

allowed for merging them. It reduces

the fragmentation of unspecified paths

in the input path ranges. Set 0 to

disable it.

-e[N],

--max-merging-iterations=[N] Maximum number of iterations in

attempting to merge close subpaths. It

stops early if during an iteration no

subpaths were merged [default: 3].

[ Graph Files IO ]

-o[FILE], --out=[FILE] Store all subgraphs in this FILE. The

file name usually ends with *.og*.

[ Extract Options ]

-s, --split-subgraphs Instead of writing the target

subgraphs into a single graph, write

one subgraph per given target to a

separate file named path:start-end.og

(0-based coordinates).

-I, --inverse Extract the parts of the graph that do

not meet the query criteria.

-n[ID], --node=[ID] A single node ID from which to begin

our traversal.

-l[FILE], --node-list=[FILE] A file with one node id per line. The

node specified will be extracted from

the input graph.

-c[N], --context-steps=[N] The number of steps (nodes) away from

our initial subgraph that we should

collect [default: 0 (disabled)]

-L[N], --context-bases=[N] The number of bases away from our

initial subgraph that we should

collect [default: 0 (disabled)]

-r[STRING], --path-range=[STRING] Find the node(s) in the specified path

range TARGET=path[:pos1[-pos2]]

(0-based coordinates).

-b[FILE], --bed-file=[FILE] Find the node(s) in the path range(s)

specified in the given BED FILE.

-q[STRING],

--pangenomic-range=[STRING] Find the node(s) in the specified

pangenomic range pos1-pos2 (0-based

coordinates). The nucleotide positions

refer to the pangenome’s sequence

(i.e., the sequence obtained arranging

all the graph’s node from left to

right).

-E, --full-range Collects all nodes in the sorted order

of the graph in the min and max

positions touched by the given path

ranges. This ensures that all the

paths of the subgraph are not split by

node, but that the nodes are laced

together again. Comparable to **-R,

--lace-paths=FILE**, but specifically

for all paths in the resulting

subgraph. Be careful to use it with

very complex graphs.

-p[FILE],

--paths-to-extract=[FILE] List of paths to keep in the extracted

graph. The FILE must contain one path

name per line and a subset of all

paths can be specified. Paths

specified in the input path ranges

(with -r/--path-range and/or

-b/--bed-file) will be kept in any

case.

-R[FILE], --lace-paths=[FILE] List of paths to fully retain in the

extracted graph. Must contain one path

name per line and a subset of all

paths can be specified.

[ Threading ]

-t[N], --threads=[N] Number of threads to use for parallel

operations.

[ Processing Information ]

-P, --progress Print information about the operations

and the progress to stderr.

[ Program Information ]

-h, --help Print a help message for odgi extract.

チュートリアルの通り進める。

１、GFAの取得と変換

git clone https://github.com/pangenome/odgi.git
cd odgi/
#GFA1グラフを odgiバイナリに変換
odgi build -g test/LPA.gfa -o LPA.og

２、パス名の確認

odgi paths -i LPA.og -L

３，チュートリアルではHG02572のこの遺伝子座 (LPA) のバリアントに注目している。まずVCFを確認する (chm13__LPA__tig00000001に対してコールされたバリアント)。

gzip -dc test/LPA.chm13__LPA__tig00000001.vcf.gz | grep -v '^##' - | head -n 9 | cut -f 1-9,16,17 | column -t

VCFのIDフィールドにそのバリアントに関与するノードが列挙されている。>はそのバリアントが隣接ノードに順方向（>）か逆方向（<）で接続されることを表す。

1050のバリアント（Tの挿入）は、VCFからHG02572__LPA__tig00000001にしか存在しない。この挿入が該当するサブグラフを抽出する。VCFから、IDが23のノード(-n 23)を指定する。

odgi extract -i LPA.og -n 23 -c 1 -o LPA.21_23_G_GT.og -d 0

#stats
odgi stats -i LPA.21_23_G_GT.og -S

４、パス名の確認

odgi paths -i LPA.21_23_G_GT.og -L

サブグラフには、リファレンスとして使用されたコンティグと、2つのHG02572のコンティグが含まれている。

５、

サブグラフを可視化するために、GFAとして書き出す。

odgi view -i LPA.21_23_G_GT.og -g > LPA.21_23_G_GT.gfa

Bandageで可視化

（マニュアルより）写真はグラフのトポロジーを示しており、各色の長方形がノードを表している。特に、3つのパスがID21と23のノードをサポートし、1つのパスのみがID22のノードをサポートしている。ID22のノードは、HG02572__LPA__tig00000001コンティグに挿入されている追加ヌクレオチドTをグラフ上で表している。

引用

Extract selected loci — odgi c522690 documentation