アセンブルされたゲノムをコンパクトに表現する AGC

2025/3/26 SNS追記

　高品質の配列アセンブリは、個体の完全な遺伝情報を表現する究極の手段である。現在進行中のいくつかのパンゲノムプロジェクトでは、様々な種の高品質なアセンブリのコレクションが作成されている。ここでは、配列決定されたゲノムを2-3桁小さい空間で表現し、任意のコンティグやその部分を簡単かつ高速に抽出する方法を紹介する。

　Pacific Biosciences社やOxford Nanopore社など、急速に発展しているロングリード配列解析技術により、ハプロイドおよび2倍体ゲノムのハプロタイプresolvedアセンブリが日常的に可能となっている。また、著者らは同じ生物種から採取したサンプルのシークエンシングとde novoアセンブルを開始している[ref. 7, 11, 2, 10, 5]。例えば、Human Pangenome Reference Consortium (HPRC) は94のハプロタイプアセンブリを公開し、今後数年間でさらに600のアセンブリを作成する予定である[ref.11]。これらのハプロタイプアセンブリは、小さな変異をコードしているだけでなく、セグメント重複やセントロメアなどの複雑な構造変異を表しており、初めて本格的に遺伝子配列変異を調べることができるようになってきている。現在、我々はgzipなどの汎用圧縮ツールを用いて、類似ゲノムのコレクションを圧縮している。しかし、これらの圧縮ツールは、ゲノム間の高い類似性を考慮しても、4倍程度の圧縮率しか達成できない。NAF [ref.9], HRCM [ref.14], MBGC [ref.6] は，類似性を考慮し，de novoアセンブリを扱う数少ない圧縮ツールである．しかし、これらのツールは転送やアーカイブのコストを削減することだけを目的としており、アーカイブ全体を解凍せずに個々の配列を抽出することはできない。その結果、ユーザーは日常的な解析のために非圧縮のデータを保存しなければならない。このため、実用的な用途は非常に限られている。

本論文では、同一種のアセンブルゲノム配列を収集するための高効率圧縮手法であるAGC (Assembled Genomes Compressor)を紹介する。圧縮されたコレクションは、新しいサンプルによって容易に拡張することができる。AGCは、他の配列を解凍することなく、要求されたコンティグやサンプルに素早くアクセスすることができる。このツールは、コマンドラインアプリケーションとして実装されている。また、C、C++、Pythonのプログラミングライブラリを用いてデータにアクセスすることもできる。

Githubより

Assembled Genomes Compressor (AGC) は、de-novo assembled genomes のコレクションを圧縮するために設計されたツールです。短いゲノム（ウイルス）、長いゲノム（ヒト）など、様々な種類のデータセットに使用できます。

特に高品質なゲノムに対して高い圧縮率が得られます。例えば、約290Gbから成るHuman Pangenome Project (47 samples), GRCh 38 reference, CHM13 v.1.1 assemblyの96個のハプロタイプ配列は、1.5GB未満になりました。agcはシングルサンプルやコンティグを数秒で抽出できるため、圧縮されたサンプルに簡単にアクセスすることができます。また、圧縮も高速です。AMD TR 3990X ベースのマシン（32 スレッド使用）で、HPP コレクションを圧縮するのにかかった時間は約 12 分でした。

AGC 3.2 (assembled genome compressor) has been released. Better speed, better ratio (at least for bacteria genomes), optional low-memory decompression.https://t.co/vQVN3K5IP9
— Sebastian Deorowicz (@sdeorowicz) November 25, 2024

Working with collections of assembled genomes? If you need a compact storage (e.g., 2 orders of mag. smaller than FASTA) with fast access to any contig or its part take a look. Many thanks to @lh3lh3 for collaboration!https://t.co/RNCdWYxQJP
— Sebastian Deorowicz (@sdeorowicz) April 8, 2022

This is how similar sequences from a pangenome project should be stored and accessed in future. Try it if you have the type of data. @sdeorowicz's algorithm and implementation are so impressive. Another pleasant collaboration with him. https://t.co/5eZXAE6t1E
— Heng Li (@lh3lh3) April 8, 2022

インストール

ubuntu18でcondaの環境を作ってテストした。

The release contains a set of precompiled binaries for Windows, Linux, and OS X.

Github

#conda(link)
mamba create -n agc -y
conda activate agc
mamba install -c bioconda agc -y

> agc

AGC (Assembled Genomes Compressor) v. 2.0 [build 20220405.1]

Usage: agc <command> [options]

Command:

create - create archive from FASTA files

append - add FASTA files to existing archive

getcol - extract all samples from archive

getset - extract sample from archive

getctg - extract contig from archive

listset - list sample names in archive

listctg - list sample and contig names in archive

info - show some statistics of the compressed data

Note: run agc <command> to see command-specific options

> agc create

AGC (Assembled Genomes Compressor) v. 2.0 [build 20220405.1]

Usage: agc create [options] <ref.fa> [<in1.fa> ...] > <out.agc>

Options:

-a - adaptive mode (default: false)

-b <int> - batch size (default: 50; min: 1; max: 1000000000)

-c - concatenated genomes in a single file (default: false)

-d - do not store cmd-line (default: true)

-i <file_name> - file with FASTA file names (alterantive to listing file names explicitely in command line)

-k <int> - k-mer length(default: 31; min: 17; max: 32)

-l <int> - min. match length (default: 20; min: 15; max: 32)

-o <file_name> - output to file (default: output is sent to stdout)

-s <int> - expected segment size (default: 60000; min: 100; max: 1000000)

-t <int> - no of threads (default: 64; min: 1; max: 128)

-v <int> - verbosity level (default: 0; min: 0; max: 2)

> agc getcol

AGC (Assembled Genomes Compressor) v. 2.0 [build 20220405.1]

Usage: agc getcol [options] <in.agc> > <out.fa>

Options:

-l <int> - line length (default: 80; min: 40; max: 2000000000)

-o <output_path> - output to files at path (default: output is sent to stdout)

-t <int> - no of threads (default: 64; min: 1; max: 128)

-v <int> - verbosity level (default: 0; min: 0; max: 2)

実行方法

３つのゲノムコレクションを圧縮。gzip圧縮された配列にも対応している。

agc create -t 16 ref.fa in1.fa in2.fa > col.agc

-t no of threads (default: 64; min: 1; max: 128)

コレクションにゲノムを追加する。

agc append in.agc in3.fa in4.fa > out.agc

コレクションを全て展開する。

agc getset col.agc > ref.fa

コレクション中のゲノムリストを表示する。

#genome names
agc listset col.agc

#contig names
agc listset in.agc genome1 genome2

コレクションの情報を表示する。

agc info col.agc

No. samples : 1

k-mer length : 31

Min. match length: 20

Batch size : 50

Command lines:

: agc create -t 16

genome/Homo_sapiens.GRCh38.dna.primary_assembly.fa

Githubより

FASTAファイルはオプションでgzip圧縮できるが、パフォーマンス上の理由から非圧縮の参照用FASTAファイルを使用することが推奨される。
全サンプルが 1 ファイルで提供される場合（concatenated genomes モード）、リファレンスは別ファイルで提供される必要があります。

パラメータを設定することにより、圧縮サイズと解凍時間との間で妥協することができます。最も重要なオプションの影響については、Githubで説明されている。

Githubでは様々な使用例が提示されています。ラージゲノムのコレクション（ヒトなど）の効率的な圧縮表現方法を探している方は確認してみて下さい。

引用

AGC: Compact representation of assembled genomes
Sebastian Deorowicz, Agnieszka Danek, Heng Li

bioRxiv, Posted April 07, 2022