PanGenome Graphを構築する PGGB - macでインフォマティクス

Githubより

pggb は入力配列から pangenome variation graph を作成する。pangenome variation graph は一般的な多重配列アライメントの一種で、ある生物種または生物群におけるゲノム要素の完全な集合をモデル化したものです。パンゲノムは、配列グラフの一種である変異グラフの形で効率的にコード化され、グラフ自体にパスとして線形の配列が埋め込まれます。pangenome variation graphによりゲノム間の配列バリエーションを理解することができます。pggb はこのようなグラフを生成します。pggbは入力配列のall-to-allアライメント（wfmash）、 graph induction（seqwish）、段階的正規化（smoothxg、gfaffix）を用いてこのようなグラフを生成します。グラフ構築後、pggbはグラフの診断目的の視覚化(odgi)を行う。pggb の出力は GFAv1 形式で、 vg や odgi ツールキットなどの多くのゲノムグラフやパンゲノムツールの入力として使用することができます。
pggbは、ヒトパンゲノムリファレンスコンソーシアム（HPRC）において、ドラフトヒトパンゲノムからグラフを構築する方法として、大規模なテストが行われています。

Quick Start

https://pggb.readthedocs.io/en/latest/rst/quick_start.html

インストール

Github

#conda
mamba create -n pggb python=3 -y
conda activate pggb
mamba install -c conda-forge -c bioconda pggb -y

#docker
docker pull ghcr.io/pangenome/pggb:latest

> pggb

usage: pggb -i <input-fasta> -n <n-mappings> [options]

options:

[wfmash]

-i, --input-fasta FILE input FASTA/FASTQ file

-s, --segment-length N segment length for mapping [default: 5k]

-l, --block-length N minimum block length filter for mapping [default: 5*segment-length]

-N, --no-split disable splitting of input sequences during mapping [enabled by default]

-p, --map-pct-id PCT percent identity for mapping/alignment [default: 90]

-n, --n-mappings N number of mappings to retain for each segment

-x, --sparse-map N keep this fraction of mappings ('auto' for giant component heuristic) [default: 1.0]

-K, --mash-kmer N kmer size for mapping [default: 19]

-F, --mash-kmer-thres N ignore the top % most-frequent kmers [default: 0.001]

-Y, --exclude-delim C skip mappings between sequences with the same name prefix before

the given delimiter character [default: all-vs-all and !self]

[seqwish]

-k, --min-match-len N filter exact matches below this length [default: 19]

-f, --sparse-factor N keep this randomly selected fraction of input matches [default: no sparsification]

-B, --transclose-batch number of bp to use for transitive closure batch [default: 10000000]

[smoothxg]

-H, --n-haps N number of haplotypes, if different than that set with -n [default: n-mappings]

-j, --path-jump-max maximum path jump to include in block [default: 0]

-e, --edge-jump-max N maximum edge jump before breaking [default: 0 / off]

-G, --poa-length-target N,M target sequence length for POA, one per pass [default: 700,900,1100]

-P, --poa-params PARAMS score parameters for POA in the form of match,mismatch,gap1,ext1,gap2,ext2

may also be given as presets: asm5, asm10, asm15, asm20

[default: 1,19,39,3,81,1 = asm5]

-O, --poa-padding N pad each end of each sequence in POA with N*(mean_seq_len) bp [default: 0.001]

-b, --run-abpoa run abPOA [default: SPOA]

-z, --global-poa run the POA in global mode [default: local mode]

-d, --pad-max-depth N depth/haplotype at which we don't pad the POA problem [default: 100]

-M, --write-maf write MAF output representing merged POA blocks [default: off]

-Q, --consensus-prefix P use this prefix for consensus path names [default: Consensus_]

[odgi]

-v, --skip-viz don't render visualizations of the graph in 1D and 2D [default: make them]

-S, --stats generate statistics of the seqwish and smoothxg graph [default: off]

[vg]

-V, --vcf-spec SPEC specify a set of VCFs to produce with SPEC = REF:DELIM[:LEN][,REF:DELIM:[LEN]]*

the paths matching ^REF are used as a reference, while the sample haplotypes

are derived from path names, e.g. when DELIM=# and with '-V chm13:#',

a path named HG002#1#ctg would be assigned to sample HG002 phase 1.

If LEN is specified and greater than 0, the VCFs are decomposed, filtering

sites whose max allele length is greater than LEN. [default: off]

[multiqc]

-m, --multiqc generate MultiQC report of graphs' statistics and visualizations,

automatically runs odgi stats [default: off]

[general]

-o, --output-dir PATH output directory

-D, --temp-dir PATH directory for temporary files

-a, --input-paf FILE input PAF file; the wfmash alignment step is skipped

-r, --resume do not overwrite existing outputs in the given directory

[default: start pipeline from scratch]

-t, --threads N number of compute threads to use in parallel steps

-T, --poa-threads N number of compute threads to use during POA (set lower if you OOM during smoothing)

-A, --keep-temp-files keep intermediate graphs

-Z, --compress compress alignment (.paf), graph (.gfa, .og), and MSA (.maf) outputs with pigz,

and variant (.vcf) outputs with bgzip

--version display the version of pggb

-h, --help this text

Use wfmash, seqwish, smoothxg, odgi, gfaffix, and vg to build, project and display a pangenome graph.

実行方法

１，ゲノムを１つのFASTAファイルにまとめて圧縮し、indexをつける。

cat genome.fasta > in.fa
bgzip -@ 3 in.fa
samtools faidx in.fa.gz

全ゲノムアセンブリの場合、異なる染色体に分割して、それぞれの染色体ごとに実行することも推奨されている。

２、例えば9つのハプロタイプを含むin.faから、5kbマッチで最小で90%以上の同一性でグラフを構築するには以下のように実行する。16スレッド指定する。また、-V でリファレンス gi|568815561:# を基準としてvariant をコールする（VCFが出力される）。-MでMAF形式の多重配列アライメントも保存される。

pggb -i in.fa -o output -n 9 -t 16 -p 90 -s 5k -V 'gi|568815561:#'

-i input FASTA/FASTQ file
-s segment length for mapping [default: 5k]
-p percent identity for mapping/alignment [default: 90]
-n number of mappings to retain for each segment
-t number of compute threads to use in parallel steps
-V specify a set of VCFs to produce with SPEC = REF:DELIM[:LEN][,REF:DELIM:[LEN]]*