macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

バクテリアのヌクレオチド分解能パンゲノムツール Pandora

 

 新しいパンゲノムグラフ構造であるpandoraと、バクテリアのパンゲノム全体のバリアントを同定するアルゴリズムを紹介する。バクテリアの適応性の多くは付属ゲノムに依存しているため、コアゲノムだけのSNPを解析する方法では満足のいく結果が得られない。Pandoraは、シーケンスされたゲノムをリファレンスのリコンビナントとして近似し、新しいバリエーションを検出し、複数のサンプルをパンジェノタイプする。578の大腸菌ゲノムからなるリファレンスグラフを用いて、20の多様な分離株を比較した。Pandoraは、単一のリファレンスベースのツールよりも多くの希少なSNPを回復し、最も近いRefSeqリファレンスを選ぶよりも有意に優れており、リファレンスバイアスなしに多様なサンプルを分析するための安定したフレームワークを提供する。

 

usage

https://github.com/rmcolq/pandora/wiki/Usage

Toy example

https://github.com/rmcolq/pandora/tree/master/example

 

レポジトリより

Pandoraは、パンゲノムリファレンスグラフ(PanRG)を用いたバクテリアゲノム解析のためのツールです。1つまたは複数のサンプルについて、遺伝子の有無の検出やSNP、indel、より長いバリアントのジェノタイピングが可能です。PandoraはIlluminaまたはNanoporeのデータで動作します。

 

インストール

Github

#conda (link)
mamba create -n pandra -y
conda activate pandra
mamba install -c bioconda pandora -y

> pandora --help

# pandora --help

Pandora: Pan-genome inference and genotyping with long noisy or short accurate reads.

Usage: pandora [OPTIONS] SUBCOMMAND

 

Options:

  -h,--help                   Print this help message and exit

  -V,--version                

 

Subcommands:

  index                       Index population reference graph (PRG) sequences.

  map                         Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample, and optionally genotype variants.

  compare                     Quasi-map reads from multiple samples to an indexed PRG, infer the sequence of present loci in each sample, and call variants between the samples.

  discover                    Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.

  walk                        Outputs a path through the nodes in a PRG corresponding to the either an input sequence (if it exists) or the top/bottom path

  seq2path                    For each sequence, return the path through the PRG

  get_vcf_ref                 Outputs a fasta suitable for use as the VCF reference using input sequences

  random                      Outputs a fasta of random paths through the PRGs

  merge_index                 Allows multiple indices to be merged (no compatibility check)

> pandora index --help

# pandora index --help

Index population reference graph (PRG) sequences.

Usage: pandora index [OPTIONS] <PRG>

 

Positionals:

  <PRG> FILE [required]       PRG to index (in fasta format)

 

Options:

  -h,--help                   Print this help message and exit

  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]

  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

  -t,--threads INT            Maximum number of threads to use [default: 1]

  -o,--outfile FILE           Filename for the index [default: <PRG>.kXX.wXX.idx]

  -v                          Verbosity of logging. Repeat for increased verbosity

 

> pandora map --help

# pandora map --help

Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample, and optionally genotype variants.

Usage: pandora map [OPTIONS] <TARGET> <QUERY>

 

Positionals:

  <TARGET> FILE [required]    An indexed PRG file (in fasta format)

  <QUERY> FILE [required]     Fast{a,q} file containing reads to quasi-map

 

Options:

  -h,--help                   Print this help message and exit

  -v                          Verbosity of logging. Repeat for increased verbosity

 

Indexing:

  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]

  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

 

Input/Output:

  -o,--outdir DIR             Directory to write output files to [default: "pandora"]

  -t,--threads INT            Maximum number of threads to use [default: 1]

  --vcf-refs FILE             Fasta file with a reference sequence to use for each loci. The sequence MUST have a perfect match in <TARGET> and the same name

  --kg                        Save kmer graphs with forward and reverse coverage annotations for found loci

  --loci-vcf                  Save a VCF file for each found loci

  -C,--comparison-paths       Save a fasta file for a random selection of paths through loci

  -M,--mapped-reads           Save a fasta file for each loci containing read parts which overlapped it

 

Parameter Estimation:

  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]

  -g,--genome-size STR/INT    Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]

  --bin                       Use binomial model for kmer coverages [default: negative binomial]

 

Mapping:

  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]

  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

 

Preset:

  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

 

Filtering:

  --clean                     Add a step to clean and detangle the pangraph

  --max-covg INT              Maximum coverage of reads to accept [default: 300]

 

Consensus/Variant Calling:

  --genotype                  Add extra step to carefully genotype sites.

  --snps                      When genotyping, only include SNP sites

  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]

 

Genotyping:

  --local Needs: --genotype   (Intended for developers) Use coverage-oriented (local) genotyping instead of the default ML path-oriented (global) approach.

  -a INT                      Hard threshold for the minimum allele coverage allowed when genotyping [default: 0]

  -s INT                      The minimum required total coverage for a site when genotyping [default: 0]

  -D INT                      Minimum difference in coverage on a site required between the first and second maximum likelihood path [default: 0]

  -F INT                      Minimum allele coverage, as a fraction of the expected coverage, allowed when genotyping [default: 0]

  -E,--gt-error-rate FLOAT    When genotyping, assume that coverage on alternative alleles arises as a result of an error process with rate -E. [default: 0.01]

  -G,--gt-conf INT            Minimum genotype confidence (GT_CONF) required to make a call [default: 1]

> pandora compare --help

# pandora compare --help

Quasi-map reads from multiple samples to an indexed PRG, infer the sequence of present loci in each sample, and call variants between the samples.

Usage: pandora compare [OPTIONS] <TARGET> <QUERY_IDX>

 

Positionals:

  <TARGET> FILE [required]    An indexed PRG file (in fasta format)

  <QUERY_IDX> FILE [required] A tab-delimited file where each line is a sample identifier followed by the path to the fast{a,q} of reads for that sample

 

Options:

  -h,--help                   Print this help message and exit

  -v                          Verbosity of logging. Repeat for increased verbosity

 

Indexing:

  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]

  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

 

Input/Output:

  -o,--outdir DIR             Directory to write output files to [default: "pandora"]

  -t,--threads INT            Maximum number of threads to use [default: 1]

  --vcf-refs FILE             Fasta file with a reference sequence to use for each loci. The sequence MUST have a perfect match in <TARGET> and the same name

  --loci-vcf                  Save a VCF file for each found loci

 

Parameter Estimation:

  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]

  -g,--genome-size STR/INT    Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]

  --bin                       Use binomial model for kmer coverages [default: negative binomial]

 

Mapping:

  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]

  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

 

Preset:

  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

 

Filtering:

  --clean                     Add a step to clean and detangle the pangraph

  --max-covg INT              Maximum coverage of reads to accept [default: 300]

 

Consensus/Variant Calling:

  --genotype                  Add extra step to carefully genotype sites.

  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]

 

Genotyping:

  --local Needs: --genotype   (Intended for developers) Use coverage-oriented (local) genotyping instead of the default ML path-oriented (global) approach.

  -a INT                      Hard threshold for the minimum allele coverage allowed when genotyping [default: 0]

  -s INT                      The minimum required total coverage for a site when genotyping [default: 0]

  -D INT                      Minimum difference in coverage on a site required between the first and second maximum likelihood path [default: 0]

  -F INT                      Minimum allele coverage, as a fraction of the expected coverage, allowed when genotyping [default: 0]

  -E,--gt-error-rate FLOAT    When genotyping, assume that coverage on alternative alleles arises as a result of an error process with rate -E. [default: 0.01]

  -G,--gt-conf INT            Minimum genotype confidence (GT_CONF) required to make a call [default: 1]

 

> pandora discover --help

# pandora discover --help

Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.

Usage: pandora discover [OPTIONS] <TARGET> <QUERY_IDX>

 

Positionals:

  <TARGET> FILE [required]    An indexed PRG file (in fasta format)

  <QUERY_IDX> FILE [required] A tab-delimited file where each line is a sample identifier followed by the path to the fast{a,q} of reads for that sample

 

Options:

  -h,--help                   Print this help message and exit

  --discover-k INT:[0-32)     K-mer size to use when discovering novel variants [default: 15]

  --max-ins INT               Max. insertion size for novel variants. Warning: setting too long may impair performance [default: 15]

  --covg-threshold INT        Positions with coverage less than this will be tagged for variant discovery [default: 3]

  -l INT                      Min. length of consecutive positions below coverage threshold to trigger variant discovery [default: 1]

  -L INT                      Max. length of consecutive positions below coverage threshold to trigger variant discovery [default: 30]

  -d,--merge INT              Merge candidate variant intervals within distance [default: 15]

  -N INT                      Maximum number of candidate variants allowed for a candidate region [default: 25]

  --min-dbg-dp INT            Minimum node/kmer depth in the de Bruijn graph used for discovering variants [default: 2]

  -v                          Verbosity of logging. Repeat for increased verbosity

 

Indexing:

  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]

  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

 

Input/Output:

  -o,--outdir DIR             Directory to write output files to [default: "pandora_discover"]

  -t,--threads INT            Maximum number of threads to use [default: 1]

  --kg                        Save kmer graphs with forward and reverse coverage annotations for found loci

  -M,--mapped-reads           Save a fasta file for each loci containing read parts which overlapped it

 

Parameter Estimation:

  -e,--error-rate FLOAT       Estimated error rate for reads [default: 0.11]

  -g,--genome-size STR/INT    Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]

  --bin                       Use binomial model for kmer coverages [default: negative binomial]

 

Mapping:

  -m,--max-diff INT           Maximum distance (bp) between consecutive hits within a cluster [default: 250]

  -c,--min-cluster-size INT   Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]

 

Preset:

  -I,--illumina               Reads are from Illumina. Alters error rate used and adjusts for shorter reads

 

Filtering:

  --clean                     Add a step to clean and detangle the pangraph

  --clean-dbg                 Clean the local assembly de Bruijn graph

  --max-covg INT              Maximum coverage of reads to accept [default: 600]

 

Consensus/Variant Calling:

  --kmer-avg INT              Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]

> pandora walk --help

# pandora walk --help

Outputs a path through the nodes in a PRG corresponding to the either an input sequence (if it exists) or the top/bottom path

Usage: pandora walk [OPTIONS] <PRG>

 

Positionals:

  <PRG> FILE [required]       A PRG file (in fasta format)

 

Options:

  -h,--help                   Print this help message and exit

  -i,--input FILE Excludes: --top --bottom

                              Fast{a,q} of sequences to output paths through the PRG for

  -T,--top Excludes: --input --bottom

                              Output the top path through each local PRG

  -B,--bottom Excludes: --input --top

                              Output the bottom path through each local PRG

 

> pandora seq2path --help

# pandora seq2path --help

For each sequence, return the path through the PRG

Usage: pandora seq2path [OPTIONS] <PRG>

 

Positionals:

  <PRG> FILE [required]       PRG to index (in fasta format)

 

Options:

  -h,--help                   Print this help message and exit

  -i,--input FILE Excludes: --top --bottom

                              Fast{a,q} of sequences to output paths through the PRG for

  -w INT                      Window size for (w,k)-minimizers (must be <=k) [default: 14]

  -k INT                      K-mer size for (w,k)-minimizers [default: 15]

  -T,--top Excludes: --input --bottom

                              Output the top path through each local PRG

  -B,--bottom Excludes: --input --top

                              Output the bottom path through each local PRG

  --flag Needs: --input       output success/fail rather than the node path

  -v                          Verbosity of logging. Repeat for increased verbosity

> pandora get_vcf_ref --help

# pandora get_vcf_ref --help

Outputs a fasta suitable for use as the VCF reference using input sequences

Usage: pandora get_vcf_ref [OPTIONS] <PRG> [<QUERY>]

 

Positionals:

  <PRG> FILE [required]       PRG to index (in fasta format)

  <QUERY> FILE                Fast{a,q} file of sequences to retrive the PRG reference for

 

Options:

  -h,--help                   Print this help message and exit

  -z,--compress               Compress the output with gzip

  -v                          Verbosity of logging. Repeat for increased verbosity

> pandora random --help

# pandora random --help

Outputs a fasta of random paths through the PRGs

Usage: pandora random [OPTIONS] <PRG>

 

Positionals:

  <PRG> FILE [required]       PRG to generate random paths from

 

Options:

  -h,--help                   Print this help message and exit

  -n INT                      Number of paths to output [default: 1]

  -z,--compress               Compress the output with gzip

  -v                          Verbosity of logging. Repeat for increased verbosity

 

> pandora merge_index --help

# pandora merge_index --help

Allows multiple indices to be merged (no compatibility check)

Usage: pandora merge_index [OPTIONS] <IDX>...

 

Positionals:

  <IDX> FILES [required]      Indices to merge

 

Options:

  -h,--help                   Print this help message and exit

  -o,--outfile FILE           Filename for merged index [default: "merged_index.idx"]

  -v                          Verbosity of logging. Repeat for increased verbosity

必要であればMultiple Sequence AlignmentからPandoraやGramtoolsに入力するPRGを作成するmake_prgもインストールする。

Github

git clone https://github.com/rmcolq/make_prg.git
cd make_prg
python -m pip install .
#MAFFT
mamba install -c bioconda -y mafft

> make_prg --help

# make_prg --help

usage: make_prg <subcommand> <options>

 

Subcommand entrypoint

 

optional arguments:

  -h, --help     show this help message and exit

  -V, --version  show program's version number and exit

  -v, --verbose  Run with high verbosity (debug level logging)

 

Available subcommands:

  

    from_msa     Make PRG from multiple sequence alignment

Nextflowを使って多数のデータを効率的に扱えるバージョンも用意されている(レポジトリの説明参照)。

 

 

実行方法

対象となる遺伝子やゲノム領域ごとに 1つのエントリを持つ、グラフの fastaファイルが必要。持っていないなら、各グラフのための多重配列アラインメントが必要になる。多くの種の異種遺伝子クラスターを表すMSAのコンパイル済みコレクションは、ここからダウンロードできる。

https://pangenome.org

(panXサイトの下のcore|all gene alignments)

make_prgコマンドを使ってグラフを作成することもできる。

 

テストラン

git clone https://github.com/rmcolq/pandora.git
cd pandora/example/
./run_pandora.sh

出力

out/

f:id:kazumaxneo:20220310005956p:plain

上のシェルスクリプトではまずグラフファイルがダウンロードされ、 pandora indexコマンドでindexingされる。それからpandora compareコマンドを使ってグラフにilluminaのシークエンシングリードを Quasi-mappingして、pangenome matrix とVCFファイルのコールを行う。さらにpangenome discoverコマンドでもDe novoバリアントのコールを行う。 make_prg updateコマンドで最初のグラフをアップデートしてVCF で使用するリファレンスを(全ての)サンプルにできるだけ近いものを推論する。それから再びpandora compareコマンドを使ってバリアントコールを行っている。

 

感想

1つの種の多数の株を使ったバリアント解析では、選んだ1つのリファレンス(代表)にマッピングして差異を検出します。この時、選んだリファレンスが近いか遠いかによってバリアントコールにバイアスが発生します。これは、一般に1つの種の細菌ゲノムでも、全ての種に共通しているコア遺伝子(ハウスキーピング遺伝子が含まれる)と稀にしか見つからないアクセサリ遺伝子に分かれ、選んだリファレンスによって、稀にしか見つからないアクセサリ遺伝子のマッピングができるかどうか変わってしまうためです(コア遺伝子の存在率をグラフにプロットすると、典型的には論文の図1のようにU字型のプロットが観察されます。詳細は論文のintroを読んで下さい)。アクセサリ遺伝子は多くの細菌ゲノムに見つかるため、どのリファレンスを選んでも、単独であるかぎりこの問題は解決しないと考えられます。この問題を、著者らは論文中でハード・リファレンス・バイアスと呼んでいます。Pandoraのアプローチは有用だと思います。

引用

Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs
Rachel M. Colquhoun, Michael B. Hall, Leandro Lima, Leah W. Roberts, Kerri M. Malone, Martin Hunt, Brice Letcher, Jane Hawkey, Sophie George, Louise Pankhurst & Zamin Iqbal 
Genome Biology volume 22, Article number: 267 (2021)

 

https://github.com/iqbal-lab-org/make_prg

 

関連