（メタ）ゲノムアセンブリをANIでクラスタリングする galah

2024/03/21 v0.4.0インストールとchekm2について追記

（レポジトリより）

　Galahは、よりスケーラブルなメタゲノムアセンブリゲノム（MAG）デレプリケーション法を目指している。すなわち、微生物ゲノムをANIに基づいてクラスタリングし、各クラスタの中から1つのメンバーを代表として選択するものである。

Galahは、特に近縁種が多い（ANIが95%以上）場合、greedyクラスタリングにより、dRepなどに比べてゲノムデプリケーションを高速化することができる。生成されたクラスタ代表には2つの性質がある。ANI の閾値を 99% に設定した場合；

各代表は、他の代表に対して<99% ANIである。
すべてのメンバーが、代表に対して >=99% ANI である。

CheckM genome qualities が指定された場合、クラスタはさらに1つのプロパティを持つ。

各代表ゲノムは、クラスタの他のメンバーよりも品質スコアが優れている。各ゲノムは、Parks et.al. 2020 https://doi.org/10.1038/s41587-020-0501-8 に記載されている品質計算式を縮小した、complete-5*contamination-5*num_contigs/100-5*num_ambiguous_bases/100000 の式に基づいて品質スコアが割り当てられている。

manual

https://wwood.github.io/galah/galah-cluster.html

インストール

依存

Dashing v0.4.0 https://github.com/dnbaker/dashing
FastANI v1.31 https://github.com/ParBLiSS/FastANI

Github

#conda (link)
mamba create -n galah -y
conda activate galah
#v0.4.0
mamba install -c bioconda galah=0.4.0 -y


#cargo
cargo install galah

> galah -h

galah -h

Metagenome assembled genome (MAG) dereplicator / clusterer

Usage: galah [OPTIONS] [COMMAND]

Commands:

cluster-validate Verify clustering results

cluster Cluster FASTA files by average nucleotide identity

help Print this message or the help of the given subcommand(s)

Options:

-v, --verbose Print extra debug logging information

--quiet Unless there is an error, do not print logging information

-h, --help Print help

-V, --version Print version

(galah)

> galah cluster -h

galah cluster

Cluster (dereplicate) genomes

Example: Dereplicate at 95% (after pre-clustering at 90%) a directory of .fna

FASTA files and create a new directory of symlinked FASTA files of

epresentatives:

galah cluster --genome-fasta-directory input_genomes/

--output-representative-fasta-directory output_directory/

Example: Dereplicate a set of genomes with paths specified in genomes.txt at

95% ANI, after a preclustering at 90% using the MinHash finch method, and

output the cluster definition to clusters.tsv:

galah cluster --ani 95 --precluster-ani 90 --precluster-method finch

--genome-fasta-list genomes.txt

--output-cluster-definition clusters.tsv

See galah cluster --full-help for further options and further detail.

(galah)

> galah cluster-validate -h

Verify clustering results

Usage: galah cluster-validate [OPTIONS] --cluster-file <cluster-file>

Options:

--cluster-file <cluster-file>

Output of 'cluster' subcommand

--ani <ani>

ANI to validate against [default: 99]

--min-aligned-fraction <min-aligned-fraction>

Min aligned fraction of two genomes for clustering [default: 50]

-t, --threads <threads>

[default: 1]

-v, --verbose

Print extra debug logging information

--quiet

Unless there is an error, do not print logging information

-h, --help

Print help

実行方法

ゲノムアセンブリを99％カットオフでクラスタリングする。生のFASTAファイルに対応している。

galah cluster --genome-fasta-directory genome_dir --genome-fasta-extension fa --output-cluster-definition clusters.tsv --output-representative-fasta-directory-copy outdir --threads 20 --ani 99

--output-cluster-definition Output a file of representative<TAB>member lines
--genome-fasta-directory Directory containing FASTA files of each genome
--genome-fasta-extension File extension of genomes in the directory specified with -d/--genome-fasta-directory. [default: fna]
--output-representative-fasta-directory-copy Copy representative genomes into this directory
--threads Number of threads. [default: 1]
--ani Overall ANI level to dereplicate at with FastANI. [default: 99]
--min-aligned-fraction Min aligned fraction of two genomes for clustering. [default: 50]
--fragment-length Length of fragment used in FastANI calculation (i.e. --fragLen). [default: 3000]

quality formula（レポジトリより）

checkMの結果も指定すると、各クラスタの代表がcheckMの品質スコアの点で他のメンバーより優れていることが保証される。

galah cluster --genome-fasta-directory genome_dir --genome-fasta-extension fa --output-cluster-definition clusters.tsv --output-representative-fasta-directory-copy outdir --threads 20 --ani 99 --checkm-tab-table checkM.tsv

#追記　checkM2を使う（v0.4.0~）
galah cluster --genome-fasta-directory genome_dir --genome-fasta-extension fa --output-cluster-definition clusters.tsv --output-representative-fasta-directory-copy outdir --threads 20 --ani 99 --checkm2-quality-report quality_report.tsv

--checkm-tab-table CheckM tab table (i.e. the output of checkm .. --tab_table -f PATH ..) for defining genome quality, which is used both for filtering and to rank genomes during clustering

引用

GitHub - wwood/galah: More scalable dereplication for metagenome assembled genomes