(レポジトリより)
Galahは、よりスケーラブルなメタゲノムアセンブリゲノム(MAG)デレプリケーション法を目指している。すなわち、微生物ゲノムをANIに基づいてクラスタリングし、各クラスタの中から1つのメンバーを代表として選択するものである。
Galahは、特に近縁種が多い(ANIが95%以上)場合、greedyクラスタリングにより、dRepなどに比べてゲノムデプリケーションを高速化することができる。生成されたクラスタ代表には2つの性質がある。ANI の閾値を 99% に設定した場合;
- 各代表は、他の代表に対して<99% ANIである。
- すべてのメンバーが、代表に対して >=99% ANI である。
CheckM genome qualities が指定された場合、クラスタはさらに1つのプロパティを持つ。
各代表ゲノムは、クラスタの他のメンバーよりも品質スコアが優れている。各ゲノムは、Parks et.al. 2020 https://doi.org/10.1038/s41587-020-0501-8 に記載されている品質計算式を縮小した、complete-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000 の式に基づいて品質スコアが割り当てられている。
manual
https://wwood.github.io/galah/galah-cluster.html
インストール
依存
- Dashing v0.4.0 https://github.com/dnbaker/dashing
- FastANI v1.31 https://github.com/ParBLiSS/FastANI
#conda (link)
mamba create -n galah -y
conda activate galah
mamba install -c bioconda galah -y
#cargo
cargo install galah
> galah -h
galah 0.3.1
Ben J. Woodcroft <benjwoodcroft near gmail.com>
Metagenome assembled genome (MAG) dereplicator / clusterer
USAGE:
galah [FLAGS] [SUBCOMMAND]
FLAGS:
-h, --help Prints help information
-q, --quiet Unless there is an error, do not print logging information
-V, --version Prints version information
-v, --verbose Print extra debug logging information
SUBCOMMANDS:
cluster Cluster FASTA files by average nucleotide identity
cluster-validate Verify clustering results
help Prints this message or the help of the given subcommand(s)
> galah cluster --full-help
GALAH(CLUSTER) GALAH(CLUSTER)
NAME
galah cluster - Cluster genome FASTA files by average nucleotide identity (version 0.3.1)
SYNOPSIS
galah cluster <GENOME_INPUTS> <OUTPUT_ARGUMENTS>
DESCRIPTION
This cluster mode dereplicates genomes, choosing a subset of the input genomes as representatives. Required inputs are (1) a genome definition, and (2) an output format definition.
The source code for this program can be found at https://github.com/wwood/galah or https://github.com/wwood/coverm
GENOME INPUT
-f, --genome-fasta-files PATH ..
Path(s) to FASTA files of each genome e.g. pathA/genome1.fna pathB/genome2.fa.
-d, --genome-fasta-directory PATH
Directory containing FASTA files of each genome.
-x, --genome-fasta-extension EXT
File extension of genomes in the directory specified with -d/--genome-fasta-directory. [default: fna]
--genome-fasta-list PATH
File containing FASTA file paths, one per line.
FILTERING PARAMETERS
--checkm-tab-table PATH
CheckM tab table (i.e. the output of checkm .. --tab_table -f PATH ..) for defining genome quality, which is used both for filtering and to rank genomes during clustering.
--genome-info PATH
dRep style genome info table for defining quality. Used like --checkm-tab-table.
--min-completeness FLOAT
Ignore genomes with less completeness than this percentage. [default: not set]
--max-contamination FLOAT
Ignore genomes with more contamination than this percentage. [default: not set]
CLUSTERING PARAMETERS
--ani FLOAT
Overall ANI level to dereplicate at with FastANI. [default: 99]
--min-aligned-fraction FLOAT
Min aligned fraction of two genomes for clustering. [default: 50]
--fragment-length FLOAT
Length of fragment used in FastANI calculation (i.e. --fragLen). [default: 3000]
--quality-formula FORMULA
Scoring function for genome quality [default: Parks2020_reduced]. One of:
formula description
──────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Parks2020_reduced (default) A quality formula described in Parks et. al. 2020
https://doi.org/10.1038/s41587-020-0501-8 (Supplementary Table 19) but only includ‐
ing those scoring criteria that can be calculated from the sequence without homol‐
ogy searching: completeness-5*contamination-5*num_contigs/100-5*num_ambigu‐
ous_bases/100000
completeness-4contamination completeness-4*contamination
completeness-5contamination completeness-5*contamination
dRep completeness-5*contamination+contamination*(strain_heterogene‐
ity/100)+0.5*log10(N50)
--precluster-ani FLOAT
Require at least this dashing-derived ANI for preclustering and to avoid FastANI on distant lineages within preclusters. [default: 95]
--precluster-method NAME
method of calculating rough ANI for dereplication. 'dashing' for HyperLogLog, 'finch' for finch MinHash. [default: dashing]
OUTPUT
--output-cluster-definition PATH
Output a file of representative<TAB>member lines.
--output-representative-fasta-directory PATH
Symlink representative genomes into this directory.
--output-representative-fasta-directory-copy PATH
Copy representative genomes into this directory.
--output-representative-list PATH
Print newline separated list of paths to representatives into this file.
GENERAL PARAMETERS
-t, --threads INT
Number of threads. [default: 1]
-v, --verbose
Print extra debugging information
-q, --quiet
Unless there is an error, do not print log messages
-h, --help
Output a short usage message.
--full-help
Output a full help message and display in 'man'.
--full-help-roff
Output a full help message in raw ROFF format for conversion to other formats.
EXIT STATUS
0 Successful program execution.
1 Unsuccessful program execution.
101 The program panicked.
AUTHOR
Ben J. Woodcroft, Centre for Microbiome Research, Queensland University of Technology <benjwoodcroft near gmail.com>
実行方法
ゲノムアセンブリを99%カットオフでクラスタリングする。生のFASTAファイルに対応している。
galah cluster --genome-fasta-directory genome_dir --genome-fasta-extension fa --output-cluster-definition clusters.tsv --output-representative-fasta-directory-copy outdir --threads 20 --ani 99
- --output-cluster-definition Output a file of representative<TAB>member lines
-
--genome-fasta-directory Directory containing FASTA files of each genome
-
--genome-fasta-extension File extension of genomes in the directory specified with -d/--genome-fasta-directory. [default: fna]
- --output-representative-fasta-directory-copy Copy representative genomes into this directory
- --threads Number of threads. [default: 1]
-
--ani Overall ANI level to dereplicate at with FastANI. [default: 99]
-
--min-aligned-fraction Min aligned fraction of two genomes for clustering. [default: 50]
-
--fragment-length Length of fragment used in FastANI calculation (i.e. --fragLen). [default: 3000]
quality formula(レポジトリより)
checkMの結果も指定すると、各クラスタの代表がcheckMの品質スコアの点で他のメンバーより優れていることが保証される。
galah cluster --genome-fasta-directory genome_dir --genome-fasta-extension fa --output-cluster-definition clusters.tsv --output-representative-fasta-directory-copy outdir --threads 20 --ani 99 --checkm-tab-table checkM.tsv
- --checkm-tab-table CheckM tab table (i.e. the output of checkm .. --tab_table -f PATH ..) for defining genome quality, which is used both for filtering and to rank genomes during clustering
引用
GitHub - wwood/galah: More scalable dereplication for metagenome assembled genomes
関連