FASTQ、BED、BAMを操作するNGSUtilsその１ bamutils

2020 4/17 インストール追記

NGSUtilsは、FASTQ、BED、BAM形式のファイルなどを操作するためのツール。 Mac OS XおよびLinuxで動作する。コマンドが多いので３回に分けて紹介する。１回目はbamを操作するbamutils。

インストール

公式ページ

NGSUtils - Tools for next-generation sequencing analysis

webマニュアル

http://ngsutils.org/installation/

依存

Mac OSX or Linux operating system (tested on RHEL5 and RHEL6)
Python 2.6+ (including development packages)
Make

上記に含まれていないが、cent OSに導入した（python 2.7.2でビルドした）。

git clone git://github.com/ngsutils/ngsutils.git
cd ngsutils/
make  #依存がインストールされる（詳細はwebマニュアル参照）

#bioconda (link) (失敗した)
conda create -n ngsutils -y
conda activate ngsutils
conda install -c bioconda -y ngsutils

ラン

keepbest（リンク） best mappingを抽出。

bamutils keepbest input.bam output.bam

best（リンク） best mappingのみ出力

bamutils best input.bam output.bam

export（リンク） 指定した情報を全リードから出力

bamutils export -name input.bam > output #リード名を出力

bamutils export -ref input.bam > output #mappingされたクロモソーム名を出力

-name Read name
-ref Mapped reference (chrom)
-pos Mapped position (1-based)
-strand Mapped strand (+/-)
-cigar CIGAR alignment string
-flags Mapping flags (base 10 number)
-seq Sequence
-qual Quality sequence
-mapq MAPQ score
-nextref Next mapped reference (paired-end)
-nextpos Next mapped position (paired-end)
-tlen Template length (paired-end)
-tag:tag_name Any tag

　　　For example: -tag:AS -tag:NH Outputs the alignment score and the edit distance

-tag:* Outputs all tags in SAM format

expressed（リンク） マッピングされた領域をBED形式で出力。（samtools indexでbam.baiファイルを準備しておく必要がある）

bamutils expressed input.bam > output.bed

-ns Ignore strandedness when creating regions (default: false)
-uniq Only use unique starting positions when performing counts (default: false)
-dist <N> minimum distance required between regions - they will be merged if w/in this distance (default: 10)
-mincount <N> The minimum number of reads required in a region (default: 2)

filter（リンク） 指定した条件を満たすリードだけbamで出力。

#chr1にマッピングされており、リード長が100以上でペアエンドが-> <-の向きに両方マップングされているリードだけbamで出力。
bamutils filter input.bam output.bam -minlen 100 -properpair -includeref chr1

-minlen val Remove reads that are smaller than {val}
-maxlen val Remove reads that are larger than {val}
-mapped Keep only mapped reads
-unmapped Keep only unmapped reads
-properpair Keep only properly paired reads (both mapped, correct orientation, flag set in BAM)
-noproperpair Keep only not-properly paired reads

他にも様々な条件を指定できる。

junctioncount（リンク） 指定した領域をspanしてマッピングされているリードをカウント

bamutils junctioncount input.bam chr1:5000-5500

merge（リンク） bamをマージ（best matchを選択してくる）。

bamutils merge output.bam input1.bam input2.bam

同じリードが複数ヶ所にアライメントされている場合、ベストマッチを選択してくる。

pair（リンク）別々にマッピングされているペアエンドについて、指定条件内でペアを再度調べる。

bamutils pair output.bam input1.bam input2.bam

-size <low-high> The minimum/maximum insert size to accept. By default, this will attempt to minimize the distance between reads, upto the lower-bound. Any pair over the upper bound will be discarded. Note: for RNA, because it is impossible to detect junctions that are between the reads, this should be a very big range (ex: 50-1000000) Default: 50-10000
-tag <VAL> Tag to use to determine from which file reads will be taken. (must be type :i or :f) You may have more than one of these, in which case they will be sorted in order. You can add a +/- at the end of the name to signify sort order (asc/desc). Default: AS+, NM-

pcrdup（リンク）ペアエンドのPCR duplicationを調べる（たまたま同じ位置で始まっているペアもPCR duplicationとみなす）。

bamutils pcrdup input.bam -count

peakheight（リンク）指定したbedファイルの領域のpeak positionを調べ、bedに追記する。

bamutils peakheight input.bam input.bed

maximum peak size、total unique reads、 total coverage (sum of coverage over each base)、 coverage density (total coverage / length)が追記される。

removevlipping（リンク）hard-clipされた領域を捨てる。

bamutils removeclipping input.bam output.bam

renamepair（リンク）一部アライナーはペアエンドのリード名に/1や/2をつけてbsmを作り、これが下流の解析でエラーになる。これを消す（ペアエンドは同じリード名になる）。

bamutils renamepair input.bam output.bam

split（リンク）bamを分割する。

#refereneで分割
bamutils split -ref input.bam output

#リード数100000ずつ分割
bamutils split -n 100000 input.bam output

stats（リンク）bamを分析。

bamutils stats input.bam

$ bamutils stats input.bam

Calculating Read stats...

Done! (0:36)

out.bam

Mapped: 1083286

Unmapped: 0

Flag distribution

[0x001] Multiple fragments 1083286 100.00%

[0x002] All fragments aligned 888871 82.05%

[0x008] Next unmapped 58635 5.41%

[0x010] Reverse complimented 542299 50.06%

[0x020] Next reverse complimented 539537 49.81%

[0x040] First fragment 1083286 100.00%

[0x800] Supplementary 1 0.00%

Template length: 568.90 +/- 824.04

Reference counts count

1 1083286

tag（リンク）tagをbamに追加する。

bamutils tag -xs input.bam output.bam

-xs Add the XS:A tag for +/- strandedness (req'd by Cufflinks)
-suffix suff A suffix to add to each read name
-tag tag Add an arbitrary tag (ex: -tag XX:Z:test)
-orig-ref tag Add a new tag with the original reference name (For example, in a region-based BAM will be converted to standard coordinates)

check（リンク）破損のチェック。