大規模RNA-seqデータセットのマッピングされたリードの集約と要約のための効率的な方法 TieBrush

　シーケンシングデータをプログラムで要約し、視覚的に検査する機能はゲノム解析に不可欠だが、現在利用可能な方法は、大量のサンプルに対応できるものではない。特に、数千のRNA-seqサンプル間の転写背景を視覚的に比較することは、利用可能な計算機資源によって制限されており、データサイズの大きさによって圧倒されてしまうことがある。本研究では、非常に大規模なシーケンスデータセット（RNA、全ゲノム、エクソームなど）を、視覚的かつ計算機的に素早く検査できる形に処理するために設計されたソフトウェアパッケージであるTieBrushを紹介する。TieBrushはまた、下流の計算機解析のためのデータ集約の手法としても利用でき、アラインメントされたリードを入力とするほとんどのソフトウェアツールに対応している。

TieBrushは、MITライセンスのもと、C++パッケージとして提供される。コンパイル済みのバイナリ、ソースコード、サンプルデータはGitHub (https://github.com/alevar/tiebrush)で公開されている

Githubより

TieBrushは、複数のシーケンスサンプル（ソートされたBAMファイルとして取得）のリードアラインメントを要約し、フィルタリングします。このユーティリティは、複数のシーケンスサンプル（複数の入力BAMファイル）間で、「重複する」リードアライメント（同じCIGAR文字列で同じ位置）をマージ/コラプスすることを目的とし、カスタムSAMタグを追加して、「アライメント多重度」カウント（すべての入力データで同じアライメントが何回見られるか）と「サンプル数」（何サンプルが同じアライメントを示しているか）を記録するために使用します。目標は、多くのシーケンスサンプルからリードアライメントを多重化したこの複合BAMファイルを生成し、多くのサンプルにわたるリードアライメントとそのカウントの包括的な「背景」画像を描くことです。

インストール

ubuntru18でテストした（docker使用、cmake v3.10.2, gcc v7.4.0）。

ビルド依存

GCC ≥ 4.8, LLVM/Clang ≥ 3.8
CMake ≥ 2.8

Github

git clone https://github.com/alevar/tiebrush.git --recursive
cd tiebrush/
cmake -DCMAKE_BUILD_TYPE=Release .
make -j8
make install

> ./tiebrush

TieBrush v0.0.6

==================

Summarize and filter read alignments from multiple sequencing samples (taken as sorted SAM/BAM/CRAM files). This utility aims to merge/collapse "duplicate" read alignments across multiple sequencing samples (inputs), adding custom SAM tags in order to keep track of the "alignment multiplicity" count (how many times the same alignment is seen across all input data) and "sample count" (how many samples show that same alignment).

==================

usage: tiebrush [-h] -o OUTPUT [-L|-P|-E] [-S] [-M] [-N max_NH_value] [-Q min_mapping_quality] [-F FLAGS] ...

Input arguments:

... input alignment files can be provided as a space-delimited

list of filenames or as a text file containing a list of

filenames, one per line

Required arguments:

-o File for BAM output

Optional arguments:

-h,--help Show this help message and exit

--version Show the program version and exit

-L,--full If enabled, only reads with the same CIGAR

and MD strings will be grouped and collapsed.

By default, TieBrush will consider the CIGAR

string only when grouping reads

Only one of -L, -P or -E options can be enabled

-P,--clip If enabled, reads will be grouped by clipped

CIGAR string. In this mode 5S10M5S and 3S10M3S

CIGAR strings will be grouped if the coordinates

of the matching substring (10M) are the same

between reads

-E,--exon If enabled, reads will be grouped if their exon

boundaries are the same. This option discards

any structural variants contained in mapped

substrings of the read and only considers start

and end coordinates of each non-splicing segment

of the CIGAR string

-S,--keep-supp If enabled, supplementary alignments will be

included in the collapsed groups of reads.

By default, TieBrush removes any mappings

not listed as primary (0x100). Note, that if enabled,

each supplementary mapping will count as a separate read

-M,--keep-unmap If enabled, unmapped reads will be retained (uncollapsed)

in the output. By default, TieBrush removes any

unmapped reads

-N Maximum NH score of the reads to retain

-Q Minimum mapping quality of the reads to retain

-F Bits in SAM flag to use in read comparison. Only reads that

have specified flags will be merged together (default: 0)

Error: no input provided!

> ./tiecov -h

TieCov v0.0.6

==================

The TieCov utility can take the output file produced by TieBrush and generate the following auxiliary files:

1. BedGraph file with the coverage data

2. Junction BED file

3. a heatmap BED that uses color intensity to represent the number of samples that contain each position

==================

usage: tiecov [-s out.sample] [-c out.coverage] [-j out.junctions] [-W] input

Input arguments (required):

input alignment file in SAM/BAM/CRAM format

Optional arguments (at least one of -s/-c/-j must be specified):

-h,--help Show this help message and exit

--version Show program version and exit

-s BedGraph file with an estimate of the number of samples

which contain alignments for each interval.

-c BedGraph (or BedWig with '-W') file with coverage

for all mapped bases.

-j BED file with coverage of all splice-junctions

in the input file.

-W save coverage in BigWig format. Default output

is in Bed format

Tiewrap is a utility script provided to make running TieBrush on large datasets a bit easier.

> ./tiewrap.py -h

usage: tiewrap.py [-h] -o OUTPUT [-L] [-P] [-E] [-S] [-M] [-N MAX_NH]

[-Q MIN_MAP_QUAL] [-F FLAGS] [-t THREADS] [-b BATCH_SIZE]

...

Help Page

positional arguments:

input Input can be provided as a space-delimited list of

filenames or as a textfile containing a list of

filenames one per each line.

optional arguments:

-h, --help show this help message and exit

-o OUTPUT, --output OUTPUT

File for BAM output.

-L, --full If enabled, only reads with the same CIGAR and MD

strings will be grouped and collapsed. By default,

TieBrush will consider the CIGAR string only when

grouping reads.

-P, --clip If enabled, reads will be grouped by clipped CIGAR

string. In this mode 5S10M5S and 3S10M3S cigar strings

will be grouped if the coordinates of the matching

substring (10M) are the same between reads.

-E, --exon If enabled, reads will be grouped if their exon

boundaries are the same. This option discards any

structural variants contained in mapped substrings of

the read and only considers start and end coordinates

of each non-splicing segment of the CIGAR string.

-S, --keep-supp If enabled, supplementary alignments will be included

in the collapsed groups of reads. By default, TieBrush

removes any mappings not listed as primary (0x100).

Note, that if enabled, each supplementary mapping will

count as a separate read.

-M, --keep-unmap If enabled, unmapped reads will be retained

(uncollapsed) in the output. By default, TieBrush

removes any unmapped reads.

-N MAX_NH, --max-nh MAX_NH

Maximum NH score of the reads to retain.

-Q MIN_MAP_QUAL, --min-map-qual MIN_MAP_QUAL

Minimum mapping quality of the reads to retain.

-F FLAGS, --flags FLAGS

Bits in SAM flag to use in read comparison. Only reads

that have specified flags will be merged together

(default: 0)

-t THREADS, --threads THREADS

Number of threads to use.

-b BATCH_SIZE, --batch-size BATCH_SIZE

Number of input files to process in a batch on each thread.

テストラン

データセットには、いくつかの組織で発現やスプライシングに差があることが知られている2つの遺伝子座（NEFLとSLC25A3）からのマッピングされたシミュレーションリード（BAM）が含まれている。データは2つの組織（t1、t2）に分けられ、それぞれ10サンプルで構成されている。

１、tiebrushのラン

出力bam名、入力bamの順に指定する。複数の入力bamファイルはスペースで区切る。

cd example/
../tiebrush -o t1/t1.bam t1/t1s0.bam t1/t1s1.bam t1/t1s2.bam t1/t1s3.bam t1/t1s4.bam t1/t1s5.bam t1/t1s6.bam t1/t1s7.bam t1/t1s8.bam t1/t1s9.bam

../tiebrush -o t2/t2.bam t2/t2s0.bam t2/t2s1.bam t2/t2s2.bam t2/t2s3.bam t2/t2s4.bam t2/t2s5.bam t2/t2s6.bam t2/t2s7.bam t2/t2s8.bam t2/t2s9.bam

２、tiecov のラン

tiecov -s t1/t1.sample -c t1/tb.coverage -j t1/tb.junctions t1/t1.bam
tiecov -s t2/t2.sample -c t2/tb.coverage -j t2/tb.junctions t2/t2.bam

出力（t1）

f:id:kazumaxneo:20220411022337p:plain

完了すると、tiecovは各組織についていくつかの要約ファイルを作成する。これらのファイルは、BAM（コラプストアライメント）、BED（ジャンクショントラック）、BEDgraph（サンプルとカバレッジトラック）フォーマットをサポートするIGVなどのゲノムブラウザで閲覧・精査できる（リンク先の図１と図２）。多くのデータの中でその領域のリードを含むサンプルの割合や、特定のexonだけ組織間で異なる発現を示す遺伝子を調べるために利用できます。

融合したbamのサイズはかなり小さくなります。およそ4GBのサイズのbam３つを融合したところ、800MBほどのサイズになりました。作成したbamやカバレッジトラックをIGVで表示するには、元のリファレンスと同じ配列を使う必要があります。

引用

TieBrush: an efficient method for aggregating and summarizing mapped reads across large datasets
Ales Varabyou, Geo Pertea, Christopher Pockrandt, Mihaela Pertea Author Notes
Bioinformatics, Volume 37, Issue 20, 15 October 2021, Pages 3650–3651