2021/111/8 インストール方法追記
シングルエンドのデータをターゲットとしている。 ペアエンドは独立してカウントされる。
インストール
mac os 10.13でテストした。
依存
- git
- cmake 2.8.3+ (cmake.org)
mac osにはcmakeは入ってません。brew install cmakeで入れるのが手っ取り早いです。
git clone https://github.com/genome/bam-readcount.git
cd bam-readcount/
mkdir build
cd build
cmake ..
make
cd bin/
#conda
conda create -n bamreadcount python=3.7 -y
conda activate bamreadcount
conda install -c bioconda bam-readcount
> bam-readcount
$ bam-readcount
Usage: bam-readcount [OPTIONS] <bam_file> [region]
Generate metrics for bam_file at single nucleotide positions.
Example: bam-readcount -f ref.fa some.bam
Available options:
-h [ --help ] produce this message
-v [ --version ] output the version number
-q [ --min-mapping-quality ] arg (=0) minimum mapping quality of reads used
for counting.
-b [ --min-base-quality ] arg (=0) minimum base quality at a position to
use the read for counting.
-d [ --max-count ] arg (=10000000) max depth to avoid excessive memory
usage.
-l [ --site-list ] arg file containing a list of regions to
report readcounts within.
-f [ --reference-fasta ] arg reference sequence in the fasta format.
-D [ --print-individual-mapq ] arg report the mapping qualities as a comma
separated list.
-p [ --per-library ] report results by library.
-w [ --max-warnings ] arg maximum number of warnings of each type
to emit. -1 gives an unlimited number.
-i [ --insertion-centric ] generate indel centric readcounts.
Reads containing insertions will not be
included in per-base counts
ラン
MAPQ≥1以上のリードを対象にカバレッジをカウント。
bam-readcount -f ref.fa -b 1 input.bam > output
bedで指定した領域を対象にカバレッジをカウント。エラーメッセージは表示しない。
bam-readcount -f ref.fa -w 0 -l inout.bed input.bam > output
出力。
head -n 1 output
chr1 100 A 10 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:10:60.00:37.10:60.00:6:4:0.60:0.00:0.00:6:0.41:287.90:0.49 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00
ch1のポジション100、リファレンスはA、デプスは10、 残りの部分は
base:count:avg_mapping_quality:avg_basequality:avg_se_mapping_quality:num_plus_strand:num_minus_strand:avg_pos_as_fraction:avg_num_mismatches_as_fraction:avg_sum_mismatch_qualities:num_q2_containing_reads:avg_distance_to_q2_start_in_q2_reads:avg_clipped_length:avg_distance_to_effective_3p_end
となっている。このカラムはGithubで説明されている通り
- base → the base that all reads following in this field contain at the reported position i.e. C
- count → the number of reads containing the base
- avg_mapping_quality → the mean mapping quality of reads containing the base
- avg_basequality → the mean base quality for these reads
- avg_se_mapping_quality → mean single ended mapping quality
- num_plus_strand → number of reads on the plus/forward strand
- num_minus_strand → number of reads on the minus/reverse strand
- avg_pos_as_fraction → average position on the read as a fraction (calculated with respect to the length after clipping). This value is normalized to the center of the read (bases occurring strictly at the center of the read have a value of 1, those occurring strictly at the ends should approach a value of 0)
- avg_num_mismatches_as_fraction → average number of mismatches on these reads per base
- avg_sum_mismatch_qualities → average sum of the base qualities of mismatches in the reads
- num_q2_containing_reads → number of reads with q2 runs at the 3’ end
- avg_distance_to_q2_start_in_q2_reads → average distance of position (as fraction of unclipped read length) to the start of the q2 run
- avg_clipped_length → average clipped read length of reads
- avg_distance_to_effective_3p_end → average distance to the 3’ prime end of the read (as fraction of unclipped read length)
となっている。
bam中に複数ライブラリがあってライブラリごとに出力したい場合、"-p"をつけて実行する。
引用
GitHub - genome/bam-readcount: count DNA sequence reads in BAM files
https://genome.sph.umich.edu/wiki/Bam_read_count
Biostars