macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

超高速かつ堅牢なMAGのANI比較を行う skani

 

メタゲノムアセンブリゲノム(MAG)用のシーケンス比較ツールは、大量のデータや低品質のデータに対処するのが困難である。本著者らは、疎な近似アラインメントを用いて平均ヌクレオチド同一性(ANI)を決定する手法であるskani(https://github.com/bluenote-1577/skani)を提案する。skaniは、断片化され不完全なMAGsにおいて、FastANIよりも精度と速度(20倍以上高速)で優れている。skaniは、65,000を超える原核生物ゲノムに対して数秒で照合を実行し、6GBのメモリで動作する。skaniは、大規模でノイズの多いメタゲノムデータセットに対して、より高解像度の洞察を可能にする。

 

 

basic usage guide

skani basic usage guide · bluenote-1577/skani Wiki · GitHub

 

インストール

ビルド依存

  • rust programming language and associated tools such as cargo are required and assumed to be in PATH.
  • A c compiler (e.g. GCC)
  • make

Github

git clone https://github.com/bluenote-1577/skani
cd skani
# If default rust install directory is ~/.cargo
cargo install --path . --root ~/.cargo

#conda
mamba install -c bioconda skani

#binary
wget https://github.com/bluenote-1577/skani/releases/download/latest/skani
chmod +x skani
./skani -h

> skani 

skani 0.3.0

fast, robust ANI calculation and database searching for metagenomic contigs and assemblies.

 

Quick ANI calculation:

skani dist genome1.fa genome2.fa

 

Memory-efficient database search:

skani sketch genomes/* -o database; skani search -d database query1.fa query2.fa ...

 

All-to-all comparison:

skani triangle genomes/*

 

USAGE:

    skani <SUBCOMMAND>

 

OPTIONS:

    -h, --help       Print help information

    -V, --version    Print version information

 

SUBCOMMANDS:

    dist        Compute ANI for queries against references fasta files or pre-computed sketch

                    files. Usage: skani dist query.fa ref1.fa ref2.fa ... or use -q/--ql and -r/--rl

                    options

    search      Search queries against a large pre-sketched database of reference genomes in a

                    memory efficient manner. Usage: skani search -d sketch_folder query1.fa

                    query2.fa ...

    sketch      Sketch (index) genomes. Usage: skani sketch genome1.fa genome2.fa ... -o

                    new_sketch_folder

    triangle    Compute a lower triangular ANI/AF matrix. Usage: skani triangle genome1.fa

                    genome2.fa genome3.fa ...

> skani dist -h

skani-dist 

Compute ANI for queries against references fasta files or pre-computed sketch files. Usage: skani

dist query.fa ref1.fa ref2.fa ... or use -q/--ql and -r/--rl options

 

USAGE:

    skani dist [OPTIONS] <QUERY|-q <QUERIES>...|--ql <QUERY_LIST>> [--] [REFERENCE]...

 

OPTIONS:

    -h, --help          Print help information

    -t <THREADS>        Number of threads [default: 3]

 

INPUTS:

    -q <QUERIES>...              Query fasta(s) or sketch(es)

        --qi                     Use individual sequences for the QUERY in a multi-line fasta

        --ql <QUERY_LIST>        File with each line containing one fasta/sketch file

    -r <REFERENCES>...           Reference fasta(s) or sketch(es)

        --ri                     Use individual sequences for the REFERENCE in a multi-line fasta

        --rl <REFERENCE_LIST>    File with each line containing one fasta/sketch file

    <QUERY>                  Query fasta or sketch

    <REFERENCE>...           Reference fasta(s) or sketch(es)

 

OUTPUT:

    -o <OUTPUT>                        Output file name; rewrites file by default [default: output

                                       to stdout]

        --min-af <MIN_AF>              Only output ANI values where one genome has aligned fraction

                                       > than this value. [default: 15]

        --both-min-af <BOTH_MIN_AF>    Only output ANI values where both genomes have aligned

                                       fraction > than this value. [default: disabled]

        --ci                           Output [5%,95%] ANI confidence intervals using percentile

                                       bootstrap on the putative ANI distribution

        --detailed                     Print additional info including contig N50s and more

    -n <N>                             Max number of results to show for each query. [default:

                                       unlimited]

        --short-header                 Only display the first part of contig names (before first

                                       whitespace)

 

PRESETS:

        --fast             Faster skani mode; 2x faster and less memory. Less accurate AF and less

                           accurate ANI for distant genomes, but works ok for high N50 and > 95%

                           ANI. Alias for -c 200

        --medium           Medium skani mode; 2x slower and more memory. More accurate AF and more

                           accurate ANI for moderately fragmented assemblies (< 10kb N50). Alias for

                           -c 70

        --slow             Slower skani mode; 4x slower and more memory. Gives much more accurate AF

                           for distant genomes. More accurate ANI for VERY fragmented assemblies (<

                           3kb N50), but less accurate ANI otherwise. Alias for -c 30

        --small-genomes    Mode for small genomes such as viruses or plasmids (< 20 kb). Can be much

                           faster for large data, but is slower/less accurate on bacterial-sized

                           genomes. Alias for: -c 30 -m 200 --faster-small

 

ALGORITHM PARAMETERS:

    -c <C>                   Compression factor (k-mer subsampling rate). [default: 125]

        --faster-small       Filter genomes with < 20 marker k-mers more aggressively. Much faster

                             for many small genomes but may miss some comparisons

    -m <MARKER_C>            Marker k-mer compression factor. Markers are used for filtering.

                             Consider decreasing to ~200-300 if working with small genomes (e.g.

                             plasmids or viruses). [default: 1000]

        --median             Estimate median identity instead of average (mean) identity

        --no-learned-ani     Disable regression model for ANI prediction. [default: learned ANI used

                             for c >= 70 and >= 150,000 bases aligned and not on individual contigs]

        --no-marker-index    Do not use hash-table inverted index for faster ANI filtering.

                             [default: load index if > 100 query files or using the --qi option]

        --robust             Estimate mean after trimming off 10%/90% quantiles

    -s <S>                   Screen out pairs with *approximately* < % identity using k-mer

                             sketching. [default: 80]

 

MISC:

        --trace    Trace level verbosity

    -v, --debug    Debug level verbosity

> skani search -h

skani-search 

Search queries against a large pre-sketched database of reference genomes in a memory efficient

manner. Usage: skani search -d sketch_folder query1.fa query2.fa ...

 

USAGE:

    skani search [OPTIONS] -d <DATABASE> <QUERY|-q <QUERIES>...|--ql <QUERY_LIST>> [--]

 

OPTIONS:

    -h, --help          Print help information

    -t <THREADS>        Number of threads [default: 3]

 

INPUTS:

    -d <DATABASE>            Output folder from `skani sketch`

    -q <QUERIES>...          Query fasta(s) or sketch(es)

        --qi                 Use individual sequences for the QUERY in a multi-line fasta

        --ql <QUERY_LIST>    File with each line containing one fasta/sketch file

    <QUERY>...           Query fasta(s) or sketch(es)

 

OUTPUT:

    -o <OUTPUT>                        Output file name; rewrites file by default [default: output

                                       to stdout]

        --both-min-af <BOTH_MIN_AF>    Only output ANI values where both genomes have aligned

                                       fraction > than this value. [default: disabled]

        --ci                           Output [5%,95%] ANI confidence intervals using percentile

                                       bootstrap on the putative ANI distribution

        --detailed                     Print additional info including contig N50s and more

        --min-af <MIN_AF>              Only output ANI values where one genome has aligned fraction

                                       > than this value. [default: 15]

    -n <N>                             Max number of results to show for each query. [default:

                                       unlimited]

        --short-header                 Only display the first part of contig names (before first

                                       whitespace)

 

ALGORITHM PARAMETERS:

        --keep-refs          Keep reference sketches in memory if the sketch passes the marker

                             filter. Takes more memory but is much faster when querying many similar

                             sequences

        --median             Estimate median identity instead of average (mean) identity

        --no-learned-ani     Disable regression model for ANI prediction. [default: learned ANI used

                             for c >= 70 and >= 150,000 bases aligned and not on individual contigs]

        --no-marker-index    Do not use hash-table inverted index for faster ANI filtering.

                             [default: load index if > 100 query files or using the --qi option]

        --robust             Estimate mean after trimming off 10%/90% quantiles

    -s <S>                   Screen out pairs with *approximately* < % identity using k-mer

                             sketching. [default: 80]

 

MISC:

        --trace    Trace level verbosity

    -v, --debug    Debug level verbosity

> skani triangle -h

skani-triangle 

Compute a lower triangular ANI/AF matrix. Usage: skani triangle genome1.fa genome2.fa genome3.fa ...

 

USAGE:

    skani triangle [OPTIONS] <-l <FASTA_LIST>|FASTA_FILES>

 

OPTIONS:

    -h, --help          Print help information

    -t <THREADS>        Number of threads [default: 3]

 

INPUTS:

    -i                      Use individual sequences instead the entire file for multi-fastas

    -l <FASTA_LIST>         File with each line containing one fasta/sketch file

    <FASTA_FILES>...    Fasta(s) or sketch(es)

 

OUTPUT:

    -o <OUTPUT>                        Output file name; rewrites file by default [default: output

                                       to stdout]

        --both-min-af <BOTH_MIN_AF>    Only output ANI values where both genomes have aligned

                                       fraction > than this value. [default: disabled]

        --ci                           Output [5%,95%] ANI confidence intervals using percentile

                                       bootstrap on the putative ANI distribution. Only works with

                                       --sparse or -E

        --detailed                     Print additional info including contig N50s and more

        --diagonal                     Output the diagonal of the ANI matrix (i.e. self-self

                                       comparisons) for both dense and sparse matrices

        --distance                     Output 100 - ANI instead of ANI, creating a distance instead

                                       of a similarity matrix. No effect if using --sparse or -E

    -E, --sparse                       Output comparisons in a row-by-row form (i.e. sparse matrix)

                                       in the same form as `skani dist`. Only pairs with aligned

                                       fraction > --min-af are output

        --full-matrix                  Output full matrix instead of lower-triangular matrix

        --min-af <MIN_AF>              Only output ANI values where one genome has aligned fraction

                                       > than this value. [default: 15]

        --short-header                 Only display the first part of contig names (before first

                                       whitespace)

 

PRESETS:

        --fast             Faster skani mode; 2x faster and less memory. Less accurate AF and less

                           accurate ANI for distant genomes, but works ok for high N50 and > 95%

                           ANI. Alias for -c 200

        --medium           Medium skani mode; 2x slower and more memory. More accurate AF and more

                           accurate ANI for moderately fragmented assemblies (< 10kb N50). Alias for

                           -c 70

        --slow             Slower skani mode; 4x slower and more memory. Gives much more accurate AF

                           for distant genomes. More accurate ANI for VERY fragmented assemblies (<

                           3kb N50), but less accurate ANI otherwise. Alias for -c 30

        --small-genomes    Mode for small genomes such as viruses or plasmids (< 20 kb). Can be much

                           faster for large data, but is slower/less accurate on bacterial-sized

                           genomes. Alias for: -c 30 -m 200 --faster-small

 

ALGORITHM PARAMETERS:

    -c <C>                  Compression factor (k-mer subsampling rate). [default: 125]

        --faster-small      Filter genomes with < 20 marker k-mers more aggressively. Much faster

                            for many small genomes but may miss some comparisons

    -m <MARKER_C>           Marker k-mer compression factor. Markers are used for filtering.

                            Consider decreasing to ~200-300 if working with small genomes (e.g.

                            plasmids or viruses). [default: 1000]

        --median            Estimate median identity instead of average (mean) identity

        --no-learned-ani    Disable regression model for ANI prediction. [default: learned ANI used

                            for c >= 70 and >= 150,000 bases aligned and not on individual contigs]

        --robust            Estimate mean after trimming off 10%/90% quantiles

    -s <S>                  Screen out pairs with *approximately* < % identity using k-mer

                            sketching. [default: 80]

 

MISC:

        --trace    Trace level verbosity

    -v, --debug    Debug level verbosity

> skani sketch -h

skani-sketch 

Sketch (index) genomes. Usage: skani sketch genome1.fa genome2.fa ... -o new_sketch_folder

 

USAGE:

    skani sketch [OPTIONS] -o <OUTPUT> <FASTA_FILES|-l <FASTA_LIST>>

 

OPTIONS:

    -h, --help          Print help information

    -t <THREADS>        Number of threads [default: 3]

 

INPUT/OUTPUT:

    -o <OUTPUT>                Output folder where sketch files are placed

    -i                         Use individual sequences instead the entire file for multi-fastas

    -l <FASTA_LIST>            File with each line containing one fasta/sketch file

        --separate-sketches    Create separate .sketch files instead of consolidated database

                               format. DOES NOT WORK WITH -i

    <FASTA_FILES>...       fastas to sketch

 

PRESETS:

        --fast      Faster skani mode; 2x faster and less memory. Less accurate AF and less accurate

                    ANI for distant genomes, but works ok for high N50 and > 95% ANI. Alias for -c

                    200

        --medium    Medium skani mode; 2x slower and more memory. More accurate AF and more accurate

                    ANI for moderately fragmented assemblies (< 10kb N50). Alias for -c 70

        --slow      Slower skani mode; 4x slower and more memory. Gives much more accurate AF for

                    distant genomes. More accurate ANI for VERY fragmented assemblies (< 3kb N50),

                    but less accurate ANI otherwise. Alias for -c 30

 

SKETCH PARAMETERS:

    -c <C>               Compression factor (k-mer subsampling rate). [default: 125]

    -m <MARKER_C>        Marker k-mer compression factor. Markers are used for filtering. Consider

                         decreasing to ~200-300 if working with small genomes (e.g. plasmids or

                         viruses). [default: 1000]

 

MISC:

        --trace    Trace level verbosity

    -v, --debug    Debug level verbosity

 (バージョンに注意する。古いバージョンで作成されたスケッチファイルは認識しない。)

 

実行方法

2つのゲノムを比較するには2つのfastaファイルを順番に指定する(順番は結果に影響しない)。

git clone https://github.com/bluenote-1577/skani.git
cd skani/test_files/
skani dist e.coli-EC590.fasta e.coli-K12.fasta 

出力例

ANI 99.39, Alignment fraction  91.89

 

複数ゲノム間の比較、結果はresults.txtに書き出し。

skani dist -t 3 -q e.coli-W.fasta e.coli-K12.fasta -r e.coli-EC590.fasta e.coli-W.fasta -o results.txt

#ワイルドカードにも対応
skani dist -t 3 -q genome*fa -r e.coli-EC590.fasta -o results.txt

出力例

 

データベースを構築して使用。-dでDBのディレクトリを指定する。

skani sketch fasta/* -o database
# => database/ができる

#データベースと比較
skani search query1.fa query2.fa -d database

出力例

 

 フォルダ内全ゲノムペア間のANI行列の作成。-Eをつけるとスパース形式(行ごとにペアを表示)となる。--min-afと組み合わせて意味のあるANI値だけを書き出す。

skani triangle fasta/* > skani_ani_matrix.txt
  • -E    Output comparisons in a row-by-row form (i.e. sparse matrix) in the same form as `skani dist`. Only pairs with aligned fraction > --min-af are output 
  • --full-matrix     Output full matrix instead of lower-triangular matrix
  • --min-af <MIN_AF>   Only output ANI values where one genome has aligned fraction

出力例

 

マトリクスファイルからクラスタリングされたヒートポンプに可視化するスクリプトが用意されている(python3, seaborn, scipy/numpy, and matplotlibが必要)。

python scripts/clustermap_triangle.py skani_ani_matrix.txt 

 

GTDB R226のPre-sketched databasesをダウンロードしてsketched databaseとして使用する事ができる。

wget http://faust.compbio.cs.cmu.edu/skani-files/skani_gtdb_r226-v0.3.tar.gz
tar -zxvf skani_gtdb_r226-v0.3.tar.gz
skani search my_genome.fa -d skani_gtdb_r226-v0.3 -o results.tsv

tar.gzファイルサイズは37GB。解凍後は57GBディスクスペースを占有する。ゲノム1個だけで問い合わせたところ、22秒かかった(E5 2680 v4)。

結果はペアワイズyANI値が高い順に並んでいる。GTDB rep. DBには各種の代表ゲノムは1個ずつしか含まれていないので、トップヒットのみ返すoptionがあってもいいかもしれない(クエリと同種を調べる目的ならこれで達成できる)。

 

その他

  • skaniは、塩基レベルのアライメントを行わない近似マッピング法を用いてDNA配列(コンティグ/MAG/ゲノム)およびANI > 約80%の平均ヌクレオチド同一性(ANI)と整列の割合(AF)を計算する。
  • 純粋なスケッチング手法(例:Mash)は不完全なMAGのANIを過小評価することがあるが、skaniは不完全かつ中程度の品質のメタゲノムアセンブリゲノム(MAG)に対しても正確。
  • インデックス作成/スケッチングはMashより約3倍高速で、クエリ処理はFastANIより約25倍高速(ただしMashよりは遅い)
  • 効率的なデータベース検索。事前処理された65,000以上の原核生物ゲノムのデータベースに対して、単一プロセッサと約6GBのRAMで数秒でクエリ検索が可能。ゲノム配列からデータベースを構築するのには数分から1時間程度かかる。
  • v0.3.0では古い.sketchファイルは使用できなくなっているので注意する。
  • v0.3.0ではデフォルトで個別の.sketchファイルではなく単一のデータベースが作成される。以前の動作は--separate-sketchesオプションで再現可能。
  • -cで精度と感度を調整できる。デフォルトは-c 125となっている。
  • --slow-c 30):小さく断片化した / ANIが低いゲノム向け。--medium-c 70):中程度に断片していたり距離のあるゲノム向け。高品質ゲノムではANIを少し過小評価する可能性がある。--fast-c 200):高速・小メモリ使用量。10kb以上のN50やANI > 95%の場合に適し、AF が不要であれば選択可能。
  • ウィルスと小さなプラスミド向けに--small-genomes オプションが用意されている。これは-c 30 -m 200 --faster-small を一括して設定する。3kb以上のウイルスを扱う場合は -m 150などを考える。
  • -s オプションでANI値の閾値をセットする。デフォルトは 80%で、より厳密に近縁ゲノム間のみ比較したい場合、-s を高く設定することができる。

  • デフォルト設定では ANI82%以上までが信頼性が高い結果が得られる(AF > 15% が条件)。

引用

Fast and robust metagenomic sequence comparison through sparse chaining with skani

Jim Shaw & Yun William Yu 

Nature Methods volume 20, pages1661–1665 (2023)

 

関連