ウイルスゲノムの超高速・高精度配列アライメント、ANI計算とクラスタリングを行う vclust

　Viromicsは毎年数百万個のウイルスゲノムと断片を産生し、従来の配列比較法を圧倒している。Vclustは、Lempel-Ziv構文解析によって平均塩基同一性を決定し、権威ある viral genomics and taxonomy consortiaによって承認された閾値でウイルスゲノムをクラスタリングする新しいアプローチである。Vclustは既存のツールと比較して優れた精度と効率を示し、ミッドレンジのワークステーションで数百万個のウイルスゲノムを数時間でクラスタリングした。

webサーバー

https://afproject.org/vclust/

使うにはローカルのfastaファイルを指定する。

example output

インストール

Github

git clone --recurse-submodules https://github.com/refresh-bio/vclust
cd vclust && make -j

> ./vclust.py -h

usage: vclust.py [-v] [-h] {prefilter,align,cluster,info} ...

vclust.py v.1.2.7: calculate ANI and cluster virus (meta)genome sequences

positional arguments:

{prefilter,align,cluster,info}

prefilter Prefilter genome pairs for alignment

align Align genome sequences and calculate ANI metrics

cluster Cluster genomes based on ANI thresholds

info Show information about the tool and its dependencies

options:

-v, --version Display the tool's version and exit

-h, --help Show this help message and exit

> ./vclust.py prefilter -h

usage: vclust.py prefilter -i <file> -o <file> [-k <int>] [--min-kmers <int>] [--min-ident <float>] [--batch-size <int>] [--kmers-fraction <float>] [--max-seqs <int>] [--keep_temp] [--bin <file>] [--bin-fasta <file>] [-t <int>] [-v] [-h]

required arguments:

-i, --in <file> Input FASTA file or directory with FASTA files

-o, --out <file> Output filename

options:

-k, --k <int> Size of k-mer for Kmer-db [25]

--min-kmers <int> Filter genome pairs based on minimum number of shared k-mers [10]

--min-ident <float> Filter genome pairs based on minimum sequence identity of the shorter sequence (0-1) [0.7]

--batch-size <int> Process a multifasta file in smaller batches of n FASTA sequences. This option reduces memory at the expense of speed. By default, no batch [0]

--kmers-fraction <float> Fraction of k-mers to analyze for each genome (0-1). A lower value reduces RAM usage and speeds up processing (affects sensitivity) [1.0]

--max-seqs <int> Maximum number of sequences allowed to pass the prefilter per query. Only the sequences with the highest identity to the query are reported. This option reduces RAM usage and speeds up processing (affects sensitivity). By default, all sequences that pass the prefilter are

reported [0]

--keep_temp Keep temporary Kmer-db files [False]

--bin <file> Path to the Kmer-db binary [/home/kazu/Documents/vclust/bin/kmer-db]

--bin-fasta <file> Path to the multi-fasta-split binary [/home/kazu/Documents/vclust/bin/multi-fasta-split]

-t, --threads <int> Number of threads (all by default) [64]

-v, --verbose Show Kmer-db progress

-h, --help Show this help message and exit

> ./vclust.py align -h

usage: vclust.py align -i <file> -o <file> [--filter <file>] [--filter-threshold <float>] [--outfmt <str>] [--out-aln <file>] [--out-ani <float>] [--out-tani <float>] [--out-gani <float>] [--out-qcov <float>] [--out-rcov <float>] [--bin <file>] [--mal <int>] [--msl <int>] [--mrd <int>] [--mqd <int>]

[--reg <int>] [--aw <int>] [--am <int>] [--ar <int>] [-t <int>] [-v] [-h]

required arguments:

-i, --in <file> Input FASTA file or directory with FASTA files

-o, --out <file> Output filename

options:

--filter <file> Path to filter file (output of prefilter)

--filter-threshold <float> Align genome pairs above the threshold (0-1) [0]

--outfmt <str> Output format [standard]

choices: lite,standard,complete

--out-aln <file> Write alignments to the specified tsv file (optional).

--out-ani <float> Min. ANI to output (0-1) [0]

--out-tani <float> Min. tANI to output (0-1) [0]

--out-gani <float> Min. gANI to output (0-1) [0]

--out-qcov <float> Min. query coverage (aligned fraction) to output (0-1) [0]

--out-rcov <float> Min. reference coverage (aligned fraction) to output (0-1) [0]

--bin <file> Path to the LZ-ANI binary [/home/kazu/Documents/vclust/bin/lz-ani]

--mal <int> Min. anchor length [11]

--msl <int> Min. seed length [7]

--mrd <int> Max. dist. between approx. matches in reference [40]

--mqd <int> Max. dist. between approx. matches in query [40]

--reg <int> Min. considered region length [35]

--aw <int> Approx. window length [15]

--am <int> Max. no. of mismatches in approx. window [7]

--ar <int> Min. length of run ending approx. extension [3]

-t, --threads <int> Number of threads (all by default) [64]

-v, --verbose Show LZ-ANI progress

-h, --help Show this help message and exit

> ./vclust.py cluster -h

usage: vclust.py cluster -i <file> -o <file> --ids <file> [-r] [--algorithm <str>] [--metric <str>] [--tani <float>] [--gani <float>] [--ani <float>] [--qcov <float>] [--rcov <float>] [--len_ratio <float>] [--num_alns <int>] [--leiden-resolution <float>] [--leiden-beta <float>] [--leiden-iterations <int>]

[--bin <file>] [-v] [-h]

required arguments:

-i, --in <file> Input file with ANI metrics (tsv)

-o, --out <file> Output filename

--ids <file> Input file with sequence identifiers (tsv)

options:

-r, --out-repr Output a representative genome for each cluster instead of numerical cluster identifiers. The representative genome is selected as the one with the longest sequence. [False]

--algorithm <str> Clustering algorithm [single]

* single: Single-linkage (connected component)

* complete: Complete-linkage

* uclust: UCLUST

* cd-hit: Greedy incremental

* set-cover: Greedy set-cover (MMseqs2)

* leiden: Leiden algorithm

--metric <str> Similarity metric for clustering [tani]

choices: tani,gani,ani

--tani <float> Min. total ANI (0-1) [0]

--gani <float> Min. global ANI (0-1) [0]

--ani <float> Min. ANI (0-1) [0]

--qcov <float> Min. query coverage/aligned fraction (0-1) [0]

--rcov <float> Min. reference coverage/aligned fraction (0-1) [0]

--len_ratio <float> Min. length ratio between shorter and longer sequence (0-1) [0]

--num_alns <int> Max. number of local alignments between two genomes; 0 means all genome pairs are allowed. [0]

--leiden-resolution <float> Resolution parameter for the Leiden algorithm [0.7]

--leiden-beta <float> Beta parameter for the Leiden algorithm [0.01]

--leiden-iterations <int> Number of iterations for the Leiden algorithm [2]

--bin <file> Path to the Clusty binary [/home/kazu/Documents/vclust/bin/clusty]

-v, --verbose Show Clusty progress

-h, --help Show this help message and exit

> ./vclust.py info

Vclust 1.2.7

kmer-db ok

multi-fasta-split ok

lz-ani ok

clusty ok

テストラン

３つのステップ；プレフィルター、整列とANI計算、クラスタリング、のコマンドが用意されている。

1、vclust.py prefilter

ウィルスゲノムのmulti-fastaを指定する。prefilterサブコマンドでは類似ゲノム配列ペアはプレフィルタリングされる。

cd vclust
./vclust.py prefilter -i example/multifasta.fna -o fltr.txt

2、vclust.py align

類似ゲノム配列ペアを整列し、ペアワイズ ANIを計算する。1の出力ファイル”fltr.txt”を指定する。

./vclust.py align -i example/multifasta.fna -o ani.tsv --filter fltr.txt

--filter Path to filter file (output of prefilter)

ani.tsvが出力される。

> column -t ani.tsv

３，vclust.py cluster

与えられたANI指標と最小閾値に基づいてゲノム配列をクラスタリングする。

./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95

--metric Similarity metric for clustering [tani]
choices: tani, gani, ani

> column -t clusters.tsv

引用

Ultrafast and accurate sequence alignment and clustering of viral genomes

Andrzej Zielezinski, Adam Gudyś, Jakub Barylski, Krzysztof Siminski, Piotr Rozwalak, Bas E. Dutilh, Sebastian Deorowicz

bioRxiv, Posted July 02, 2024.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ウイルスゲノムの超高速・高精度配列アライメント、ANI計算とクラスタリングを行う vclust