マルチサンプルビニングのための高速な近似カバレッジ計算法 fairy

　メタゲノムのビニングは、同じゲノムに属するコンティグをクラスタリングすることであり、メタゲノムアセンブリゲノム（MAG）を復元するための重要なステップである。コンティグは、ゲノム全体で一貫したリードカバレッジパターンを利用することで連結される。しかし、標準的なパイプラインでは、カバレッジを計算するために複数サンプルの全リードアラインメントが必要であり、計算上の重要なボトルネックとなっている。

本著者らは、メタゲノムビニングのための近似カバレッジ計算手法であるfairy (https://github.com/bluenote-1577/fairy)を発表する。fairyはk-merベースのアライメント不要の高速手法である。マルチサンプルビニングにおいて、fairyはリードアライメントよりも250倍以上速く、ビニングに十分な精度を持つ。Fairyは、宿主および非宿主関連データセット上のいくつかの既存のBinnerと互換性がある。MetaBAT2を用いると、FairyはBWAによるアライメントと比較して50%以上の完全性と5%未満の不完全性で、MAGの98.5%を回復する。特にfairyを用いたマルチサンプルビニングは、BWAを用いたシングルサンプルビニングよりも常に優れており（平均で1.5倍以上、50%以上の完全なMAG）、しかも高速である。公開された堆積物メタゲノムプロジェクトにおいて、マルチサンプルビニングがシングルサンプルビニングよりも高品質なアスガルド古細菌MAGを回収すること、およびfairyの結果がリードアライメントと区別できないことを実証した。Fairyは、ビニングのためのマルチサンプルカバレッジを近似的かつ迅速に計算するための新しいツールであり、長年の課題を解決する。

introduction

https://github.com/bluenote-1577/fairy/wiki/Introduction-to-fairy

レポジトリより

Fairyはメタゲノム解析のビニングの前に使用する。以下が可能。

カバレッジ計算において、リードアライメント（BWAなど）よりも100倍～1000倍高速にカバレッジを計算
ショートリード、ナノポアリードで同等のビンが得られる
MetaBAT2、MaxBin2などと互換性のある出力フォーマット

インストール

ubuntu22でビルドした。

ビルド依存

rust (version > 1.63) programming language and associated tools such as cargo are required and assumed to be in PATH.
A c compiler (e.g. GCC)
make
cmake

Github

https://github.com/LuoGroup2023/DeChat

mamba install -c bioconda fairy

#
git clone https://github.com/bluenote-1577/fairy
cd fairy
cargo install --path . 
fairy -h

> fairy -h

fairy 0.5.4

Approximate metagenomic coverage calculation for contigs.

## index paired-end reads

fairy sketch -1 a_1.fq b_1.fq -2 a_2.fq b_2.fq -d sketches

## coverage matrix output

fairy coverage -t 30 sketches/*.bcsp contigs1.fa -o coverage_matrix.tsv

USAGE:

fairy <SUBCOMMAND>

OPTIONS:

-h, --help Print help information

-V, --version Print version information

SUBCOMMANDS:

coverage Extremely fast species-level coverage calculation by k-mer sketching

sketch Sketch (index) reads. Each sample.fq -> sample.bcsp

> fairy coverage -h

fairy-coverage

Extremely fast species-level coverage calculation by k-mer sketching

USAGE:

fairy coverage [OPTIONS] [FILES]...

OPTIONS:

--debug

Debug output

-h, --help

Print help information

-s, --sample-threads <SAMPLE_THREADS>

Number of samples to be processed concurrently. Default: (# of total threads / 2) + 1

-t <THREADS>

Number of threads [default: 3]

--trace

Trace output (caution: very verbose)

INPUT:

-l, --list <FILE_LIST> Newline delimited file of file inputs

<FILES>... Pre-sketched *.bcsp files and raw fasta/gzip contig files

ALGORITHM:

-m, --minimum-ani <MINIMUM_ANI>

Minimum adjusted ANI to consider (0-100) for coverage calculation. Default is 95. Don't

lower this unless you know what you're doing

-M, --min-number-kmers <MIN_NUMBER_KMERS>

Exclude genomes with less than this number of sampled k-mers [default: 8]

SKETCHING:

-c <C>

Subsampling rate. Does nothing for pre-sketched files [default: 50]

-k <K>

Value of k. Only k = 21, 31 are currently supported. Does nothing for pre-sketched files

[default: 31]

--min-spacing <MIN_SPACING_KMER>

Minimum spacing between selected k-mers on the contigs. [default: 30]

OUTPUT:

--maxbin-format

Remove contig length, average depth, and variance columns. (default: MetaBAT2 format

with variances)

-o, --output-file <OUT_FILE_NAME>

Output to this file instead of stdout

> fairy sketch -h

fairy-sketch

Sketch (index) reads. Each sample.fq -> sample.bcsp

USAGE:

fairy sketch [OPTIONS]

OPTIONS:

--debug Debug output

-h, --help Print help information

-t <THREADS> Number of threads [default: 3]

--trace Trace output (caution: very verbose)

OUTPUT:

-d, --sample-output-directory <SAMPLE_OUTPUT_DIR>

Output directory for sample sketches [default: ./]

--lS <LIST_SAMPLE_NAMES>

Newline delimited file; read sketches are renamed to given sample names

-S, --sample-names <SAMPLE_NAMES>...

Read sketches are renamed to given sample names as opposed to using the read file name

SINGLE-END INPUT:

-r, --reads <READS>... Single-end fasta/fastq reads

PAIRED-END INPUT:

-1, --first-pairs <FIRST_PAIR>...

First pairs for paired end reads

-2, --second-pairs <SECOND_PAIR>...

Second pairs for paired end reads

--l1 <LIST_FIRST_PAIR>

Newline delimited file; inputs are first pair of PE reads

--l2 <LIST_SECOND_PAIR>

Newline delimited file; inputs are second pair of PE reads

ALGORITHM:

-c <C> Subsampling rate [default: 50]

-k <K> Value of k. Only k = 21, 31 are currently supported [default: 31]

実行方法

１，Index reads

#short reads
fairy sketch -1 *_1.fastq.gz -2 *_2.fastq.gz -d sketch_dir

#long reads
fairy sketch -r long_reads.fq -d sketch_dir

２，coverage計算

fairy coverage sketch_dir/*.bcsp contigs1.fa -t 10 -o coverage1.tsv
fairy coverage sketch_dir/*.bcsp contigs2.fa -t 10 -o coverage2.tsv

得られたカバレッジ情報”coverage.tsv”はそのままmetabatやmaxbin2に使用できる。

レポジトリと論文より

シングルサンプルのビニングにfairyを使用しない。また、fairyをPacBio HiFiとの組み合わせで使用しない。
マルチサンプルカバレッジは、ビニングにおいてシングルサンプルカバレッジよりもはるかに優れており、CheckMのような品質管理ソフトウェアでさえ検出できない可能性のある、より優れたMAGを生成することが示されている[10]。
カバレッジの計算は通常、リードをコンティグにアライメントすることで行われる。n個のサンプルとn個のアセンブリからなるプロジェクトの場合、マルチサンプルカバレッジを素直に計算するには、各サンプルを各アセンブリにアライメントする必要があり、n²回のリードアライメントを実行することになる。サンプル数が多くなると、この2次スケーリングは法外に長くなる。
すべてのリードをアセンブルして1セットのコンティグを得る共アセンブルは、潜在的な解決策であるが、共アセンブルはメモリを消費し、同様のひずみが発生する可能性がある[13]。別の方法として、split-binning [4]があり、これは一組のアセンブリからのすべてのコンティグを連結し、アライメントする方法である。これはより高速であるが、それでも比較的時間とメモリーを消費する。そのため、多くの大規模な研究では、いまだにシングルサンプルビニングが行われている。