macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ウイルスゲノムの超高速・高精度配列アライメント、ANI計算とクラスタリングを行う vclust

 

 Viromicsは毎年数百万個のウイルスゲノムと断片を産生し、従来の配列比較法を圧倒している。Vclustは、Lempel-Ziv構文解析によって平均塩基同一性を決定し、権威ある viral genomics and taxonomy consortiaによって承認された閾値でウイルスゲノムをクラスタリングする新しいアプローチである。Vclustは既存のツールと比較して優れた精度と効率を示し、ミッドレンジのワークステーションで数百万個のウイルスゲノムを数時間でクラスタリングした。

 

webサーバー

https://afproject.org/vclust/

使うにはローカルのfastaファイルを指定する。

example output

 

インストール

Github

git clone --recurse-submodules https://github.com/refresh-bio/vclust
cd vclust && make -j

> ./vclust.py -h

 

usage: vclust.py [-v] [-h] {prefilter,align,cluster,info} ...

 

vclust.py v.1.2.7: calculate ANI and cluster virus (meta)genome sequences

 

positional arguments:

  {prefilter,align,cluster,info}

    prefilter           Prefilter genome pairs for alignment

    align               Align genome sequences and calculate ANI metrics

    cluster             Cluster genomes based on ANI thresholds

    info                Show information about the tool and its dependencies

 

options:

  -v, --version         Display the tool's version and exit

  -h, --help            Show this help message and exit

 

> ./vclust.py prefilter -h

usage: vclust.py prefilter -i <file> -o <file> [-k <int>] [--min-kmers <int>] [--min-ident <float>] [--batch-size <int>] [--kmers-fraction <float>] [--max-seqs <int>] [--keep_temp] [--bin <file>] [--bin-fasta <file>] [-t <int>] [-v] [-h]

 

required arguments:

  -i, --in <file>           Input FASTA file or directory with FASTA files

  -o, --out <file>          Output filename

 

options:

  -k, --k <int>             Size of k-mer for Kmer-db [25]

  --min-kmers <int>         Filter genome pairs based on minimum number of shared k-mers [10]

  --min-ident <float>       Filter genome pairs based on minimum sequence identity of the shorter sequence (0-1) [0.7]

  --batch-size <int>        Process a multifasta file in smaller batches of n FASTA sequences. This option reduces memory at the expense of speed. By default, no batch [0]

  --kmers-fraction <float>  Fraction of k-mers to analyze for each genome (0-1). A lower value reduces RAM usage and speeds up processing (affects sensitivity) [1.0]

  --max-seqs <int>          Maximum number of sequences allowed to pass the prefilter per query. Only the sequences with the highest identity to the query are reported. This option reduces RAM usage and speeds up processing (affects sensitivity). By default, all sequences that pass the prefilter are

                            reported [0]

  --keep_temp               Keep temporary Kmer-db files [False]

  --bin <file>              Path to the Kmer-db binary [/home/kazu/Documents/vclust/bin/kmer-db]

  --bin-fasta <file>        Path to the multi-fasta-split binary [/home/kazu/Documents/vclust/bin/multi-fasta-split]

  -t, --threads <int>       Number of threads (all by default) [64]

  -v, --verbose             Show Kmer-db progress

  -h, --help                Show this help message and exit

 

> ./vclust.py align -h

usage: vclust.py align -i <file> -o <file> [--filter <file>] [--filter-threshold <float>] [--outfmt <str>] [--out-aln <file>] [--out-ani <float>] [--out-tani <float>] [--out-gani <float>] [--out-qcov <float>] [--out-rcov <float>] [--bin <file>] [--mal <int>] [--msl <int>] [--mrd <int>] [--mqd <int>]

                       [--reg <int>] [--aw <int>] [--am <int>] [--ar <int>] [-t <int>] [-v] [-h]

 

required arguments:

  -i, --in <file>             Input FASTA file or directory with FASTA files

  -o, --out <file>            Output filename

 

options:

  --filter <file>             Path to filter file (output of prefilter)

  --filter-threshold <float>  Align genome pairs above the threshold (0-1) [0]

  --outfmt <str>              Output format [standard]

                              choices: lite,standard,complete

  --out-aln <file>            Write alignments to the specified tsv file (optional).

  --out-ani <float>           Min. ANI to output (0-1) [0]

  --out-tani <float>          Min. tANI to output (0-1) [0]

  --out-gani <float>          Min. gANI to output (0-1) [0]

  --out-qcov <float>          Min. query coverage (aligned fraction) to output (0-1) [0]

  --out-rcov <float>          Min. reference coverage (aligned fraction) to output (0-1) [0]

  --bin <file>                Path to the LZ-ANI binary [/home/kazu/Documents/vclust/bin/lz-ani]

  --mal <int>                 Min. anchor length [11]

  --msl <int>                 Min. seed length [7]

  --mrd <int>                 Max. dist. between approx. matches in reference [40]

  --mqd <int>                 Max. dist. between approx. matches in query [40]

  --reg <int>                 Min. considered region length [35]

  --aw <int>                  Approx. window length [15]

  --am <int>                  Max. no. of mismatches in approx. window [7]

  --ar <int>                  Min. length of run ending approx. extension [3]

  -t, --threads <int>         Number of threads (all by default) [64]

  -v, --verbose               Show LZ-ANI progress

  -h, --help                  Show this help message and exit

 

> ./vclust.py cluster -h

usage: vclust.py cluster -i <file> -o <file> --ids <file> [-r] [--algorithm <str>] [--metric <str>] [--tani <float>] [--gani <float>] [--ani <float>] [--qcov <float>] [--rcov <float>] [--len_ratio <float>] [--num_alns <int>] [--leiden-resolution <float>] [--leiden-beta <float>] [--leiden-iterations <int>]

                         [--bin <file>] [-v] [-h]

 

required arguments:

  -i, --in <file>              Input file with ANI metrics (tsv)

  -o, --out <file>             Output filename

  --ids <file>                 Input file with sequence identifiers (tsv)

 

options:

  -r, --out-repr               Output a representative genome for each cluster instead of numerical cluster identifiers. The representative genome is selected as the one with the longest sequence. [False]

  --algorithm <str>            Clustering algorithm [single]

                               * single: Single-linkage (connected component)

                               * complete: Complete-linkage

                               * uclust: UCLUST

                               * cd-hit: Greedy incremental

                               * set-cover: Greedy set-cover (MMseqs2)

                               * leiden: Leiden algorithm

  --metric <str>               Similarity metric for clustering [tani]

                               choices: tani,gani,ani

  --tani <float>               Min. total ANI (0-1) [0]

  --gani <float>               Min. global ANI (0-1) [0]

  --ani <float>                Min. ANI (0-1) [0]

  --qcov <float>               Min. query coverage/aligned fraction (0-1) [0]

  --rcov <float>               Min. reference coverage/aligned fraction (0-1) [0]

  --len_ratio <float>          Min. length ratio between shorter and longer sequence (0-1) [0]

  --num_alns <int>             Max. number of local alignments between two genomes; 0 means all genome pairs are allowed. [0]

  --leiden-resolution <float>  Resolution parameter for the Leiden algorithm [0.7]

  --leiden-beta <float>        Beta parameter for the Leiden algorithm [0.01]

  --leiden-iterations <int>    Number of iterations for the Leiden algorithm [2]

  --bin <file>                 Path to the Clusty binary [/home/kazu/Documents/vclust/bin/clusty]

  -v, --verbose                Show Clusty progress

  -h, --help                   Show this help message and exit

 

> ./vclust.py info 

Vclust               1.2.7

kmer-db              ok

multi-fasta-split    ok

lz-ani               ok

clusty               ok

 

 

 

テストラン

3つのステップ;プレフィルター、整列とANI計算、クラスタリング、のコマンドが用意されている。

 

1、vclust.py prefilter

ウィルスゲノムのmulti-fastaを指定する。prefilterサブコマンドでは類似ゲノム配列ペアはプレフィルタリングされる。

cd vclust
./vclust.py prefilter -i example/multifasta.fna -o fltr.txt

 

2、vclust.py align

類似ゲノム配列ペアを整列し、ペアワイズ ANIを計算する。1の出力ファイル”fltr.txt”を指定する。

./vclust.py align -i example/multifasta.fna -o ani.tsv --filter fltr.txt
  • --filter    Path to filter file (output of prefilter)

ani.tsvが出力される。

> column -t ani.tsv

 

3,vclust.py cluster

与えられたANI指標と最小閾値に基づいてゲノム配列をクラスタリングする。

./vclust.py cluster -i ani.tsv -o clusters.tsv --ids ani.ids.tsv --metric ani --ani 0.95
  • --metric    Similarity metric for clustering [tani]
                      choices: tani, gani, ani

> column -t clusters.tsv

 

 

 

 

 

 

 

引用

Ultrafast and accurate sequence alignment and clustering of viral genomes

Andrzej Zielezinski, Adam Gudyś, Jakub Barylski, Krzysztof Siminski, Piotr Rozwalak, Bas E. Dutilh, Sebastian Deorowicz

bioRxiv, Posted July 02, 2024.