新しいパンゲノムグラフ構造であるpandoraと、バクテリアのパンゲノム全体のバリアントを同定するアルゴリズムを紹介する。バクテリアの適応性の多くは付属ゲノムに依存しているため、コアゲノムだけのSNPを解析する方法では満足のいく結果が得られない。Pandoraは、シーケンスされたゲノムをリファレンスのリコンビナントとして近似し、新しいバリエーションを検出し、複数のサンプルをパンジェノタイプする。578の大腸菌ゲノムからなるリファレンスグラフを用いて、20の多様な分離株を比較した。Pandoraは、単一のリファレンスベースのツールよりも多くの希少なSNPを回復し、最も近いRefSeqリファレンスを選ぶよりも有意に優れており、リファレンスバイアスなしに多様なサンプルを分析するための安定したフレームワークを提供する。
usage
https://github.com/rmcolq/pandora/wiki/Usage
Toy example
https://github.com/rmcolq/pandora/tree/master/example
レポジトリより
Pandoraは、パンゲノムリファレンスグラフ(PanRG)を用いたバクテリアゲノム解析のためのツールです。1つまたは複数のサンプルについて、遺伝子の有無の検出やSNP、indel、より長いバリアントのジェノタイピングが可能です。PandoraはIlluminaまたはNanoporeのデータで動作します。
インストール
#conda (link)
mamba create -n pandra -y
conda activate pandra
mamba install -c bioconda pandora -y
> pandora --help
# pandora --help
Pandora: Pan-genome inference and genotyping with long noisy or short accurate reads.
Usage: pandora [OPTIONS] SUBCOMMAND
Options:
-h,--help Print this help message and exit
-V,--version
Subcommands:
index Index population reference graph (PRG) sequences.
map Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample, and optionally genotype variants.
compare Quasi-map reads from multiple samples to an indexed PRG, infer the sequence of present loci in each sample, and call variants between the samples.
discover Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.
walk Outputs a path through the nodes in a PRG corresponding to the either an input sequence (if it exists) or the top/bottom path
seq2path For each sequence, return the path through the PRG
get_vcf_ref Outputs a fasta suitable for use as the VCF reference using input sequences
random Outputs a fasta of random paths through the PRGs
merge_index Allows multiple indices to be merged (no compatibility check)
> pandora index --help
# pandora index --help
Index population reference graph (PRG) sequences.
Usage: pandora index [OPTIONS] <PRG>
Positionals:
<PRG> FILE [required] PRG to index (in fasta format)
Options:
-h,--help Print this help message and exit
-w INT Window size for (w,k)-minimizers (must be <=k) [default: 14]
-k INT K-mer size for (w,k)-minimizers [default: 15]
-t,--threads INT Maximum number of threads to use [default: 1]
-o,--outfile FILE Filename for the index [default: <PRG>.kXX.wXX.idx]
-v Verbosity of logging. Repeat for increased verbosity
> pandora map --help
# pandora map --help
Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample, and optionally genotype variants.
Usage: pandora map [OPTIONS] <TARGET> <QUERY>
Positionals:
<TARGET> FILE [required] An indexed PRG file (in fasta format)
<QUERY> FILE [required] Fast{a,q} file containing reads to quasi-map
Options:
-h,--help Print this help message and exit
-v Verbosity of logging. Repeat for increased verbosity
Indexing:
-w INT Window size for (w,k)-minimizers (must be <=k) [default: 14]
-k INT K-mer size for (w,k)-minimizers [default: 15]
Input/Output:
-o,--outdir DIR Directory to write output files to [default: "pandora"]
-t,--threads INT Maximum number of threads to use [default: 1]
--vcf-refs FILE Fasta file with a reference sequence to use for each loci. The sequence MUST have a perfect match in <TARGET> and the same name
--kg Save kmer graphs with forward and reverse coverage annotations for found loci
--loci-vcf Save a VCF file for each found loci
-C,--comparison-paths Save a fasta file for a random selection of paths through loci
-M,--mapped-reads Save a fasta file for each loci containing read parts which overlapped it
Parameter Estimation:
-e,--error-rate FLOAT Estimated error rate for reads [default: 0.11]
-g,--genome-size STR/INT Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]
--bin Use binomial model for kmer coverages [default: negative binomial]
Mapping:
-m,--max-diff INT Maximum distance (bp) between consecutive hits within a cluster [default: 250]
-c,--min-cluster-size INT Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]
Preset:
-I,--illumina Reads are from Illumina. Alters error rate used and adjusts for shorter reads
Filtering:
--clean Add a step to clean and detangle the pangraph
--max-covg INT Maximum coverage of reads to accept [default: 300]
Consensus/Variant Calling:
--genotype Add extra step to carefully genotype sites.
--snps When genotyping, only include SNP sites
--kmer-avg INT Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]
Genotyping:
--local Needs: --genotype (Intended for developers) Use coverage-oriented (local) genotyping instead of the default ML path-oriented (global) approach.
-a INT Hard threshold for the minimum allele coverage allowed when genotyping [default: 0]
-s INT The minimum required total coverage for a site when genotyping [default: 0]
-D INT Minimum difference in coverage on a site required between the first and second maximum likelihood path [default: 0]
-F INT Minimum allele coverage, as a fraction of the expected coverage, allowed when genotyping [default: 0]
-E,--gt-error-rate FLOAT When genotyping, assume that coverage on alternative alleles arises as a result of an error process with rate -E. [default: 0.01]
-G,--gt-conf INT Minimum genotype confidence (GT_CONF) required to make a call [default: 1]
> pandora compare --help
# pandora compare --help
Quasi-map reads from multiple samples to an indexed PRG, infer the sequence of present loci in each sample, and call variants between the samples.
Usage: pandora compare [OPTIONS] <TARGET> <QUERY_IDX>
Positionals:
<TARGET> FILE [required] An indexed PRG file (in fasta format)
<QUERY_IDX> FILE [required] A tab-delimited file where each line is a sample identifier followed by the path to the fast{a,q} of reads for that sample
Options:
-h,--help Print this help message and exit
-v Verbosity of logging. Repeat for increased verbosity
Indexing:
-w INT Window size for (w,k)-minimizers (must be <=k) [default: 14]
-k INT K-mer size for (w,k)-minimizers [default: 15]
Input/Output:
-o,--outdir DIR Directory to write output files to [default: "pandora"]
-t,--threads INT Maximum number of threads to use [default: 1]
--vcf-refs FILE Fasta file with a reference sequence to use for each loci. The sequence MUST have a perfect match in <TARGET> and the same name
--loci-vcf Save a VCF file for each found loci
Parameter Estimation:
-e,--error-rate FLOAT Estimated error rate for reads [default: 0.11]
-g,--genome-size STR/INT Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]
--bin Use binomial model for kmer coverages [default: negative binomial]
Mapping:
-m,--max-diff INT Maximum distance (bp) between consecutive hits within a cluster [default: 250]
-c,--min-cluster-size INT Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]
Preset:
-I,--illumina Reads are from Illumina. Alters error rate used and adjusts for shorter reads
Filtering:
--clean Add a step to clean and detangle the pangraph
--max-covg INT Maximum coverage of reads to accept [default: 300]
Consensus/Variant Calling:
--genotype Add extra step to carefully genotype sites.
--kmer-avg INT Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]
Genotyping:
--local Needs: --genotype (Intended for developers) Use coverage-oriented (local) genotyping instead of the default ML path-oriented (global) approach.
-a INT Hard threshold for the minimum allele coverage allowed when genotyping [default: 0]
-s INT The minimum required total coverage for a site when genotyping [default: 0]
-D INT Minimum difference in coverage on a site required between the first and second maximum likelihood path [default: 0]
-F INT Minimum allele coverage, as a fraction of the expected coverage, allowed when genotyping [default: 0]
-E,--gt-error-rate FLOAT When genotyping, assume that coverage on alternative alleles arises as a result of an error process with rate -E. [default: 0.01]
-G,--gt-conf INT Minimum genotype confidence (GT_CONF) required to make a call [default: 1]
> pandora discover --help
# pandora discover --help
Quasi-map reads to an indexed PRG, infer the sequence of present loci in the sample and discover novel variants.
Usage: pandora discover [OPTIONS] <TARGET> <QUERY_IDX>
Positionals:
<TARGET> FILE [required] An indexed PRG file (in fasta format)
<QUERY_IDX> FILE [required] A tab-delimited file where each line is a sample identifier followed by the path to the fast{a,q} of reads for that sample
Options:
-h,--help Print this help message and exit
--discover-k INT:[0-32) K-mer size to use when discovering novel variants [default: 15]
--max-ins INT Max. insertion size for novel variants. Warning: setting too long may impair performance [default: 15]
--covg-threshold INT Positions with coverage less than this will be tagged for variant discovery [default: 3]
-l INT Min. length of consecutive positions below coverage threshold to trigger variant discovery [default: 1]
-L INT Max. length of consecutive positions below coverage threshold to trigger variant discovery [default: 30]
-d,--merge INT Merge candidate variant intervals within distance [default: 15]
-N INT Maximum number of candidate variants allowed for a candidate region [default: 25]
--min-dbg-dp INT Minimum node/kmer depth in the de Bruijn graph used for discovering variants [default: 2]
-v Verbosity of logging. Repeat for increased verbosity
Indexing:
-w INT Window size for (w,k)-minimizers (must be <=k) [default: 14]
-k INT K-mer size for (w,k)-minimizers [default: 15]
Input/Output:
-o,--outdir DIR Directory to write output files to [default: "pandora_discover"]
-t,--threads INT Maximum number of threads to use [default: 1]
--kg Save kmer graphs with forward and reverse coverage annotations for found loci
-M,--mapped-reads Save a fasta file for each loci containing read parts which overlapped it
Parameter Estimation:
-e,--error-rate FLOAT Estimated error rate for reads [default: 0.11]
-g,--genome-size STR/INT Estimated length of the genome - used for coverage estimation. Can pass string such as 4.4m, 100k etc. [default: 5000000]
--bin Use binomial model for kmer coverages [default: negative binomial]
Mapping:
-m,--max-diff INT Maximum distance (bp) between consecutive hits within a cluster [default: 250]
-c,--min-cluster-size INT Minimum size of a cluster of hits between a read and a loci to consider the loci present [default: 10]
Preset:
-I,--illumina Reads are from Illumina. Alters error rate used and adjusts for shorter reads
Filtering:
--clean Add a step to clean and detangle the pangraph
--clean-dbg Clean the local assembly de Bruijn graph
--max-covg INT Maximum coverage of reads to accept [default: 600]
Consensus/Variant Calling:
--kmer-avg INT Maximum number of kmers to average over when selecting the maximum likelihood path [default: 100]
> pandora walk --help
# pandora walk --help
Outputs a path through the nodes in a PRG corresponding to the either an input sequence (if it exists) or the top/bottom path
Usage: pandora walk [OPTIONS] <PRG>
Positionals:
<PRG> FILE [required] A PRG file (in fasta format)
Options:
-h,--help Print this help message and exit
-i,--input FILE Excludes: --top --bottom
Fast{a,q} of sequences to output paths through the PRG for
-T,--top Excludes: --input --bottom
Output the top path through each local PRG
-B,--bottom Excludes: --input --top
Output the bottom path through each local PRG
> pandora seq2path --help
# pandora seq2path --help
For each sequence, return the path through the PRG
Usage: pandora seq2path [OPTIONS] <PRG>
Positionals:
<PRG> FILE [required] PRG to index (in fasta format)
Options:
-h,--help Print this help message and exit
-i,--input FILE Excludes: --top --bottom
Fast{a,q} of sequences to output paths through the PRG for
-w INT Window size for (w,k)-minimizers (must be <=k) [default: 14]
-k INT K-mer size for (w,k)-minimizers [default: 15]
-T,--top Excludes: --input --bottom
Output the top path through each local PRG
-B,--bottom Excludes: --input --top
Output the bottom path through each local PRG
--flag Needs: --input output success/fail rather than the node path
-v Verbosity of logging. Repeat for increased verbosity
> pandora get_vcf_ref --help
# pandora get_vcf_ref --help
Outputs a fasta suitable for use as the VCF reference using input sequences
Usage: pandora get_vcf_ref [OPTIONS] <PRG> [<QUERY>]
Positionals:
<PRG> FILE [required] PRG to index (in fasta format)
<QUERY> FILE Fast{a,q} file of sequences to retrive the PRG reference for
Options:
-h,--help Print this help message and exit
-z,--compress Compress the output with gzip
-v Verbosity of logging. Repeat for increased verbosity
> pandora random --help
# pandora random --help
Outputs a fasta of random paths through the PRGs
Usage: pandora random [OPTIONS] <PRG>
Positionals:
<PRG> FILE [required] PRG to generate random paths from
Options:
-h,--help Print this help message and exit
-n INT Number of paths to output [default: 1]
-z,--compress Compress the output with gzip
-v Verbosity of logging. Repeat for increased verbosity
> pandora merge_index --help
# pandora merge_index --help
Allows multiple indices to be merged (no compatibility check)
Usage: pandora merge_index [OPTIONS] <IDX>...
Positionals:
<IDX> FILES [required] Indices to merge
Options:
-h,--help Print this help message and exit
-o,--outfile FILE Filename for merged index [default: "merged_index.idx"]
-v Verbosity of logging. Repeat for increased verbosity
必要であればMultiple Sequence AlignmentからPandoraやGramtoolsに入力するPRGを作成するmake_prgもインストールする。
git clone https://github.com/rmcolq/make_prg.git
cd make_prg
python -m pip install .
#MAFFT
mamba install -c bioconda -y mafft
> make_prg --help
# make_prg --help
usage: make_prg <subcommand> <options>
Subcommand entrypoint
optional arguments:
-h, --help show this help message and exit
-V, --version show program's version number and exit
-v, --verbose Run with high verbosity (debug level logging)
Available subcommands:
from_msa Make PRG from multiple sequence alignment
Nextflowを使って多数のデータを効率的に扱えるバージョンも用意されている(レポジトリの説明参照)。
実行方法
対象となる遺伝子やゲノム領域ごとに 1つのエントリを持つ、グラフの fastaファイルが必要。持っていないなら、各グラフのための多重配列アラインメントが必要になる。多くの種の異種遺伝子クラスターを表すMSAのコンパイル済みコレクションは、ここからダウンロードできる。
(panXサイトの下のcore|all gene alignments)
make_prgコマンドを使ってグラフを作成することもできる。
テストラン
git clone https://github.com/rmcolq/pandora.git
cd pandora/example/
./run_pandora.sh
出力
out/
上のシェルスクリプトではまずグラフファイルがダウンロードされ、 pandora indexコマンドでindexingされる。それからpandora compareコマンドを使ってグラフにilluminaのシークエンシングリードを Quasi-mappingして、pangenome matrix とVCFファイルのコールを行う。さらにpangenome discoverコマンドでもDe novoバリアントのコールを行う。 make_prg updateコマンドで最初のグラフをアップデートしてVCF で使用するリファレンスを(全ての)サンプルにできるだけ近いものを推論する。それから再びpandora compareコマンドを使ってバリアントコールを行っている。
感想
1つの種の多数の株を使ったバリアント解析では、選んだ1つのリファレンス(代表)にマッピングして差異を検出します。この時、選んだリファレンスが近いか遠いかによってバリアントコールにバイアスが発生します。これは、一般に1つの種の細菌ゲノムでも、全ての種に共通しているコア遺伝子(ハウスキーピング遺伝子が含まれる)と稀にしか見つからないアクセサリ遺伝子に分かれ、選んだリファレンスによって、稀にしか見つからないアクセサリ遺伝子のマッピングができるかどうか変わってしまうためです(コア遺伝子の存在率をグラフにプロットすると、典型的には論文の図1のようにU字型のプロットが観察されます。詳細は論文のintroを読んで下さい)。アクセサリ遺伝子は多くの細菌ゲノムに見つかるため、どのリファレンスを選んでも、単独であるかぎりこの問題は解決しないと考えられます。この問題を、著者らは論文中でハード・リファレンス・バイアスと呼んでいます。Pandoraのアプローチは有用だと思います。
引用
Pandora: nucleotide-resolution bacterial pan-genomics with reference graphs
Rachel M. Colquhoun, Michael B. Hall, Leandro Lima, Leah W. Roberts, Kerri M. Malone, Martin Hunt, Brice Letcher, Jane Hawkey, Sophie George, Louise Pankhurst & Zamin Iqbal
Genome Biology volume 22, Article number: 267 (2021)
https://github.com/iqbal-lab-org/make_prg
関連