シンプルなfastq、sam、bamの分析ツール fastqp

fastqpはシンプルなNGSのシーケンスデータ（fastq、sam、bam）評価ツール。

インストール

mac os 10.13 python2.7.14環境に導入した。

依存

Tested on Python 2.7, and 3.4
Tested on Mac OS 10.10 and Linux 2.6.18
Numpy, Scipy, and Matplotlib
samtools

本体 Github

https://github.com/mdshw5/fastqp

Matt ShirleyさんはBiostarsでも度々返信されていますね。

pipで導入できます。

pip install fastqp

> fastqp -h

$ fastqp -h

usage: fastqp [-h] [-q] [-s BINSIZE] [-a NAME] [-n NREADS] [-p BASE_PROBS]

[-k {2,3,4,5,6,7}] [-o OUTPUT] [-e TEXT] [-t {fastq,gz,sam,bam}]

[-ll LEFTLIMIT] [-rl RIGHTLIMIT] [-mq MEDIAN_QUAL]

[--aligned-only | --unaligned-only] [-d]

input

simple NGS read quality assessment using Python

positional arguments:

input input file (one of .sam, .bam, .fq, or .fastq(.gz) or

stdin (-))

optional arguments:

-h, --help show this help message and exit

-q, --quiet do not print any messages (default: False)

-s BINSIZE, --binsize BINSIZE

number of reads to bin for sampling (default: auto)

-a NAME, --name NAME sample name identifier for text and graphics output

(default: input file name)

-n NREADS, --nreads NREADS

number of reads sample from input (default: 2000000)

-p BASE_PROBS, --base-probs BASE_PROBS

probabilites for observing A,T,C,G,N in reads

(default: 0.25,0.25,0.25,0.25,0.1)

-k {2,3,4,5,6,7}, --kmer {2,3,4,5,6,7}

length of kmer for over-repesented kmer counts

(default: 5)

-o OUTPUT, --output OUTPUT

base name for output figures (default: fastqp_figures)

-e TEXT, --text TEXT file name for text output (default: -)

-t {fastq,gz,sam,bam}, --type {fastq,gz,sam,bam}

file type (default: auto)

-ll LEFTLIMIT, --leftlimit LEFTLIMIT

leftmost cycle limit (default: 1)

-rl RIGHTLIMIT, --rightlimit RIGHTLIMIT

rightmost cycle limit (-1 for none) (default: -1)

-mq MEDIAN_QUAL, --median-qual MEDIAN_QUAL

median quality threshold for failing QC (default: 30)

--aligned-only only aligned reads (default: False)

--unaligned-only only unaligned reads (default: False)

-d, --count-duplicates

calculate sequence duplication rate (default: False)

ラン

.sam, .bam, .fq, or .fastq(.gz)に対応している。

fastqの分析。inputファイルだけ指定すればランできる。

fastqp input.fastq.gz

k mer overrepresentation testで急激な変動があれば、標準出力される。

There were 11,745,147 reads in the file. Analysis finished in 00:01:57.

KmerWarning: kmer TCGGA has a non-uniform profile (slope = 2.125013564839935, p = 8.46578475199142e-35).

KmerWarning: kmer ATCGG has a non-uniform profile (slope = 2.1653621812262616, p = 1.8476822870673151e-34).

KmerWarning: kmer GGAAG has a non-uniform profile (slope = 2.095957677699403, p = 2.5659919959063345e-34).

KmerWarning: kmer GATCG has a non-uniform profile (slope = 2.140138361367336, p = 4.629573409568249e-34).

KmerWarning: kmer CGGAA has a non-uniform profile (slope = 2.1136394465545307, p = 1.4142951997190637e-33).

KmerWarning: kmer AAGAG has a non-uniform profile (slope = 2.0164744981009224, p = 1.0203562990197863e-32).

KmerWarning: kmer AGATC has a non-uniform profile (slope = 2.020788117200217, p = 2.2505527710710567e-30).

出力ディレクトリはgz圧縮される。解凍して中を開く。複数のmetricsで分析が行われている。

f:id:kazumaxneo:20180622090028p:plain

f:id:kazumaxneo:20180622090032p:plain

f:id:kazumaxneo:20180622090036p:plain

f:id:kazumaxneo:20180622090039p:plain

f:id:kazumaxneo:20180622090042p:plain

f:id:kazumaxneo:20180622090046p:plain

f:id:kazumaxneo:20180622090049p:plain

f:id:kazumaxneo:20180622090052p:plain

f:id:kazumaxneo:20180622090056p:plain

Miseqのシーケンスデータで、3'末端付近のクオリティ落ち込みが激しい。

bam、samも分析項目は同じですが、mappingされたリードだけ対象にするようなオプションがあります。解析に時間がかかるようなら、-nでサンプリングするリード数を減らしてみてください（デフォルトは200万リード）。いわゆる大数の法則で、数千~数万リードもサンプリングすれば、真の分布は見えてくるしフィッティングもまともになってくると思います（確かめてませんが）。

実装がシンプルなので、pythonを中心にNGSの分析ツールを作りたい人にもお手本にもなるんじゃないでしょうか？ライブラリNumPy（ナンパイ）、SciPy（サイパイ）で高速計算し（参考ページ後半）、matplotlibでグラフ出力する流れです。

引用

GitHub - mdshw5/fastqp: Simple FASTQ quality assessment using Python