高効率なカバレッジ計算ツール BamToCov - macでインフォマティクス

2022/02/25 論文引用

　多くのゲノミクスアプリケーションでは、リファレンスのヌクレオチドカバレッジを計算したり、リファレンス領域に何本のリードがマッピングされているかをカウントしたりする必要がある。本発表では、BamToCovを紹介する。このツールは、メモリ効率の良いアルゴリズムに依存し、カスタムパイプラインに柔軟に統合できるように設計された、迅速かつ柔軟なカバレッジ計算のためのツールスイートである。このツール群は、ソートされたBAMファイルやCRAMファイルを処理し、様々なフィルタリングアプローチを用いてカバレッジ情報を抽出することができる。

　BamToCovツールは、既存のツールとは異なり、最小限のメモリで、ワークフローに容易に統合でき、ストランドに特化したカバレッジ解析ができるように開発されている。独自のカバレッジ計算アルゴリズムにより、ロングリードのアラインメント解析に最適になっている。プログラムとそのドキュメントは、https://github.com/telatin/bamtocov で自由に利用することができる。

　アライメントファイル（BAM形式）からカバレッジ情報を抽出するツールは、すでにいくつか存在する。Samtools (Li et al., 2009), Bedtools (Quinlan, 2014), Sambamba (Tarasov et al., 2015) 。そして新しく、より機能豊富なMosdepth (Pedersen and Quinlan, 2018b) とMegaDepth (Wilks et al., 2021)がある。既存のツールの共通の限界は、mate-pairs ライブラリを使用してアセンブリの完全性を決定する際に重要な物理的カバレッジを計算できないことである。また、鎖ごとのカバレッジを分離することができない。ある位置がフォワードリードのみ、あるいはリバースリードのみによってカバーされている場合、それはおそらくミスアライメントに起因する。これらの制限を解決するために、Covtobed (Birolo and Telatin, 2020)を開発した。これは、コンピュータプログラミングのUNIX哲学に触発され、入出力ストリームをサポートする単一タスクに焦点を当てたシンプルかつ効率的なC++プログラムである。ここでは、Nim言語で記述されたBamToCovプログラムとその補助ユーティリティを紹介する。これは、入力ストリームを読み込む機能を維持しながら、インターバルターゲット、新しい出力フォーマット、カバレッジ統計、複数のBAMファイルをサポートする新しい機能を備えたCovtobedのコアアルゴリズムを用いてカバレッジ計算を行い、全体的にパフォーマンスの向上（すなわち、より小さなメモリフットプリントと最大3倍の速度向上）を達成している。

Documentation

https://telatin.github.io/bamtocov/

Wig format

https://telatin.github.io/bamtocov/notes/wig.html

BamToCov supports #targets (in BED, GTF and GFF3 formats, as microbiologists and their GFF3 files have been neglected for too long!) and output in #WIG format as well.

📦 https://t.co/ssK2A59kUu
— andrea telatin (@telatin) November 18, 2021

ペアエンドライブラリを使用する場合、物理カバレッジも計算することができる。物理カバレッジとは、ロングインサートのペアエンド（メイトペア）で-->... <--のようにリードはカバーしていない領域（...）が発生するが、この領域もカウント対象としたカバレッジの事（プレプリント図１））

特徴

UNIX哲学の入力ストリームをサポートに対応しており、bamのインデックスは必要ない
ストランドバイアスをチェックするために、ストランドごとのカバレッジを計算可能
少ないメモリ使用量
CRAMファイルをネイティブにサポート
ロングリードのアラインメントにも対応（Table.1）
高速（MegaDepthに次ぐ速度）

インストール

Github

mamba create -n bamtocov
conda activate bamtocov
mamba install -y -c bioconda bamtocov

> bamtocov -h

BamToCov 2.3.0

Usage: bamtocov [options] [<BAM>]...

Arguments:

<BAM> the alignment file for which to calculate depth (default: STDIN)

Core options:

-p, --physical Calculate physical coverage

-s, --stranded Report coverage separate by strand

-q, --quantize <breaks> Comma separated list of breaks for quantized output

-w, --wig <SPAN> Output in WIG format (using fixed <SPAN>), 0 will print in BED format [default: 0]

--op <func> How to summarize coverage for each WIG span (mean/min/max) [default: max]

-o, --report <TXT> Output coverage report

--skip-output Do not output per-base coverage

--report-low <min> Report coverage for bases with coverage < min [default: 0]

Target files:

-r, --regions <bed> Target file in BED or GFF3/GTF format (detected with the extension)

-t, --gff-type <feat> GFF feature type to parse [default: CDS]

-i, --gff-id <ID> GFF identifier [default: ID]

--gff-separator <sep> GFF attributes separator [default: ;]

--gff Force GFF input (otherwise assumed by extension .gff)

BAM reading options:

-T, --threads <threads> BAM decompression threads [default: 0]

-F, --flag <FLAG> Exclude reads with any of the bits in FLAG set [default: 1796]

-Q, --mapq <mapq> Mapping quality threshold [default: 0]

Other options:

--debug Enable diagnostics

-h, --help Show help

> bamtocounts -h

$ bamtocounts -h

BamToCounts 2.3.0

Usage: bamtocounts [options] <Target> <BAM-or-CRAM>...

Arguments:

<Target> the BED (or GFF) file containing regions in which to count reads

<BAM-or-CRAM> the alignment file for which to calculate depth

Options:

-T, --threads <threads> BAM decompression threads [default: 0]

-r, --fasta <fasta> FASTA file for use with CRAM files [default: ].

-F, --flag <FLAG> Exclude reads with any of the bits in FLAG set [default: 1796]

-Q, --mapq <mapq> Mapping quality threshold [default: 0]

-g, --gff Force GFF for input (otherwise autodetected by .gff extension)

-t, --type <feat> GFF feature type to parse [default: CDS]

-i, --id <ID> GFF identifier [default: ID]

-n, --rpkm Add a RPKM column

-l, --norm-len Add a counts/length column (after RPKM when both used)

--header Print header

--debug Enable diagnostics

-h, --help Show help

> covtotarget -h

covToTarget 2.3.0

Usage: covtotarget [options] <Target> [<covtobed-output>]

Arguments:

<Target> the BED (or GFF) file containing regions in which to count reads

<covtobed-output> covtobed output, or STDIN if not provided

Options:

-g, --gff Force GFF for input (otherwise autodetected by .gff extension)

-t, --type <feat> GFF feature type to parse [default: CDS]

-i, --id <ID> GFF identifier [default: ID]

-l, --norm-len Normalize by gene length

-b, --bed-output Output format is BED-like (default is feature_name [tab] counts)

-h, --help Show help

実行方法

bamtocov - BAMファイルを解析してBED形式のカバレッジファイルを出力

bamを指定する。

インデックスがなくてもソートされたBAMファイルを読み込むことができる。
bamtocov input.bam > coverage.bed

#物理カバレッジ
bamtocov -p input.bam > coverage.bed

-p Calculate physical coverage

wigファイルフォーマットで出力する。

bamtocov --wig 200 input.bam > coverage.wig

-w Output in WIG format (using fixed <SPAN>), 0 will print in BED format [default: 0]

strandedを指定すると、forwardとreverseそれぞれのカバレッジが計算される。

bamtocov --stranded input.bam > coverage.bed

-s Report coverage separate by strand

5列のBEDライクなファイルが出力される。4列目がforward strand coverage、5列目がreverse strand coverage。

BamToCounts - BAMファイル中のターゲットのリード数をカウント

ターゲット領域のbedを指定する。

BamToCounts target.bed input.bam  > coverage.txt

BamCountsRefs - 複数のBAMファイル（同じ参照配列を持つ）からカウントテーブルを出力

wigファイル出力

bamcountrefs --tag "Chr1" input1.bam input2.bam

covToTarget - ターゲット（BED または GFF3 フォーマットのアノテーションファイル）と covtobed 1.0 の出力を基に、フィーチャーごとのカバレッジレポートを作成

covtobed input/mini.bam | covtotarget input/mini.bed > output/counts.tsv

シミュレートされたbamを生成するコマンドも用意されています。ドキュメントを確認して下さい。

引用

BamToCov: an efficient toolkit for sequence coverage calculations
Giovanni Birolo, Andrea Telatin

bioRxiv, Posted November 17, 2021

2022/02/24

BamToCov, an efficient toolkit for sequence coverage calculations
Giovanni Birolo, Andrea Telatin Author Notes
Bioinformatics, Published: 23 February 2022