詳細なリードカウント情報を出力する bam-readcount

2021/111/8 インストール方法追記

シングルエンドのデータをターゲットとしている。ペアエンドは独立してカウントされる。

インストール

mac os 10.13でテストした。

依存

git
cmake 2.8.3+ (cmake.org)

mac osにはcmakeは入ってません。brew install cmakeで入れるのが手っ取り早いです。

git clone https://github.com/genome/bam-readcount.git
cd bam-readcount/
mkdir build 
cd build
cmake ..
make
cd bin/

#conda
conda create -n bamreadcount python=3.7 -y
conda activate bamreadcount
conda install -c bioconda bam-readcount

> bam-readcount

$ bam-readcount

Usage: bam-readcount [OPTIONS] <bam_file> [region]

Generate metrics for bam_file at single nucleotide positions.

Example: bam-readcount -f ref.fa some.bam

Available options:

-h [ --help ] produce this message

-v [ --version ] output the version number

-q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used

for counting.

-b [ --min-base-quality ] arg (=0) minimum base quality at a position to

use the read for counting.

-d [ --max-count ] arg (=10000000) max depth to avoid excessive memory

usage.

-l [ --site-list ] arg file containing a list of regions to

report readcounts within.

-f [ --reference-fasta ] arg reference sequence in the fasta format.

-D [ --print-individual-mapq ] arg report the mapping qualities as a comma

separated list.

-p [ --per-library ] report results by library.

-w [ --max-warnings ] arg maximum number of warnings of each type

to emit. -1 gives an unlimited number.

-i [ --insertion-centric ] generate indel centric readcounts.

Reads containing insertions will not be

included in per-base counts

ラン

MAPQ≥1以上のリードを対象にカバレッジをカウント。

bam-readcount -f ref.fa -b 1 input.bam  > output

bedで指定した領域を対象にカバレッジをカウント。エラーメッセージは表示しない。

bam-readcount -f ref.fa -w 0 -l inout.bed input.bam > output

出力。

head -n 1 output

chr1 100 A 10 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:10:60.00:37.10:60.00:6:4:0.60:0.00:0.00:6:0.41:287.90:0.49 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00

ch1のポジション100、リファレンスはA、デプスは10、残りの部分は

base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end

となっている。このカラムはGithubで説明されている通り

base → the base that all reads following in this field contain at the reported position i.e. C
count → the number of reads containing the base
avg_mapping_quality → the mean mapping quality of reads containing the base
avg_basequality → the mean base quality for these reads
avg_se_mapping_quality → mean single ended mapping quality
num_plus_strand → number of reads on the plus/forward strand
num_minus_strand → number of reads on the minus/reverse strand
avg_pos_as_fraction → average position on the read as a fraction (calculated with respect to the length after clipping). This value is normalized to the center of the read (bases occurring strictly at the center of the read have a value of 1, those occurring strictly at the ends should approach a value of 0)
avg_num_mismatches_as_fraction → average number of mismatches on these reads per base
avg_sum_mismatch_qualities → average sum of the base qualities of mismatches in the reads
num_q2_containing_reads → number of reads with q2 runs at the 3’ end
avg_distance_to_q2_start_in_q2_reads → average distance of position (as fraction of unclipped read length) to the start of the q2 run
avg_clipped_length → average clipped read length of reads
avg_distance_to_effective_3p_end → average distance to the 3’ prime end of the read (as fraction of unclipped read length)