2018-12-03

RNA seqシーケンシングデータの包括的な前処理ツール FastqPuri

2018 12/3 図差し替え

2019 6/18 condaインストール追記

2019 6/21 コマンド追記

2019 10/23引用追記

2020 1/7 インストール追記

2020 12/9 my docker imageのラン手順修正

　RNA-seq実験から正確な結果を得るには、前処理ステップでのクオリティチェック（QC）とシーケンスデータのフィルタリングが重要になる。ワークフローは通常、次のように進行する。最初にシーケンスクオリティチェックを行い、続いてアダプターおよび低クオリティの塩基を除去するシーケンスフィルター処理を行う。次に、他の生物からの汚染配列を除去し、最後に、シーケンスデータが現在受け入れ可能であることを確認するために別のクオリティ管理作業を実行する。
　シーケンスデータのクオリティチェック、フィルタリング、及びトリミングを行うツールは存在するが、RNA-seqデータで一般的に使用されるすべての前処理ステップを網羅する適切かつ包括的なツールはない。 RNA-seqデータにはFastQC [ref.1]が広く使用されているが、ゲノムデータ用に設計されているため、品質チェックモジュールのいくつかはRNA-seqデータに適していない（例：重複レベル、GC含量）。 RSeQC [ref.16]（紹介）とRNA-SeQC [ref.8]はRNA-seqデータ用に書かれていたが、入力としてアライメントファイル（BAM）のみ使用され、アライメントフリーのリードカウンタであるkallisto（紹介）、salmon（紹介）と組み合わせて使うには適していない[ref.14]。 AfterQC [ref.5]（紹介）はクオリティ管理とグローバルクオリティフィルタリングを実行するが、特にRNA-seqデータには対応していない。また、AfterQCのstrandバイアス検出およびオーバーラップペア解析はRNA-seqデータには有効ではなく、コンタミネーションフィルタリングは含まれていない。 AfterQCは、クオリティスコアに基づいた自動フィルタリング機能も制限されている。すなわち、グローバルトリムにより各リードから一定数の塩基を削除することのみできる。 RNA-QC-Chain [ref.18]は、RNA-seqデータに包括的なクオリティ管理を提供すると主張しているが、rawリード（fastq）データの有益なグラフィックス表現がなく、rRNA汚染のみをフィルタリングすることができる。
　さらに、リードのマッピングはRNA-seqデータ解析で最も時間のかかるステップだったが、アライメントフリーのトランスクリプタカウンタが導入されたため、現在、クオリティ管理とフィルタリングは時間のかかるボトルネックとなっている。 FastqPuriは、すべてのRNA-seqワークフローで必要とされるこれらの第1ステップのための自動化された最も効率的なインプリメンテーションを提供する。これには、一般的なクオリティ管理だけでなく、低クオリティ塩基のフィルタリング、Ns、残ったアダプター、および汚染生物由来のリードが含まれる。本ソフトウェアは、シングルまたはペアエンドシーケンシングの圧縮されていないfastqファイルと圧縮されたfastqファイルの両方を処理し、1サンプルあたりのクオリティレポートとデータセット内のすべてのサンプルに対する要約レポートを生成し、優れた診断プロットを提供する。

Figure 1: Workflow for preprocessing fastq files with FastqPuri. （Preprintより転載）

Table 1

f:id:kazumaxneo:20181201213740p:plain

Preprintより転載

インストール

mac os10.14のminiconda2-4.0.5環境でテストした。

依存

OS: Linux (clang, gcc), Mac OS (clang, gcc), OpenBSD (clang)
cmake (at least version 2.8),
a C compiler supporting the c11 standard (change the compiler flags otherwise),
pandoc (optional, see documentation in PANDOC.md),
Rscript (optional),
Following R packages installed (optional):pheatmap
knitr
rmarkdown

apt install pandoc

#Rライブラリ
install_github("raivokolde/pheatmap")
install.packages('knitr', dependencies = TRUE)
install.packages('rmarkdown', dependencies = TRUE)

本体 Github

git clone https://github.com/jengelmann/FastqPuri.git
cd FastqPuri/

#R/bin/にあるRscriptのパスを指定
cmake -H. -Bbuild/ -DRSCRIPT=/usr/lib/R/bin/Rscript
cd build/
make
cd ../bin/

追記
#bioconda (link)
mamba create -n fastqpuri -y python=3.8
conda activate fastqpuri
mamba install -c bioconda -y fastqpuri

> ls FastqPuri/bin/

CMakeFiles Qreport Sreport makeBloom makeTree trimFilter trimFilterPE

#ヘルプ

> Qreport

$ ./Qreport

Usage: ./Qreport -i <INPUT_FILE.fq> -l <READ_LENGTH>

-o <OUTPUT_FILE> [-t <NUMBER_OF_TILES>] [-q <MINQ>]

[-n <#_QUALITY_VALUES>] [-f <FILTER_STATUS>]

[-0 <ZEROQ>] [-Q <low-Qs>]

Reads in a fq file (gz, bz2, z formats also accepted) and creates a

quality report (html file) along with the necessary data to create it

stored in binary format.

Options:

-v Prints package version.

-h Prints help dialog.

-i Input file [*fq|*fq.gz|*fq.bz2]. Mandatory option.

-l Read length. Length of the reads. Mandatory option.

-o Output file prefix (with NO extension). Mandatory option.

-t Number of tiles. Optional (default 96).

-q Minimum quality allowed. Optional (default 27).

-n Number of different quality values allowed. Optional (default 46).

-f Filter status: 0 original file, 1 file filtered with trimFilter,

2 file filtered with another tool. Optional (default 0).

-0 ASCII value for quality score 0. Optional (default 33).

-Q quality values for low quality proportion plot. Optional (default 27,33,37),

Format is either <int>[,<int>]* or <min-int>:<max-int>.

File: /Users/user/local/FastqPuri/src/init_Qreport.c, line: 79

> Sreport

$ ./Sreport

Usage: ./Sreport -i <INPUT_FOLDER> -t <Q|F|P> -o <OUTPUT_FILE>

Uses all *bin files found in a folder (output of Qreport|trimFilter|trimFilterPE)

and generates a summary report in html format (of Qreport|trimFilter|trimFilterPE).

Options:

-v Prints package version.

-h Prints help dialog.

-i Input folder containing *bin data (output from Qreport). Mandatory option.

-t {Q,F,P} Type of report to be generated: 'Q' for quality summary

report, 'F' for filter summary report (single-end reads), and

'P' for filter summary report (paired-end reads)

data filter summary report. Mandatory option,

-o Output file (with NO extension). Mandatory option.

File: /Users/user/local/FastqPuri/src/init_Sreport.c, line: 69

——

> makeTree

$ ./makeTree

Usage: ./makeTree -f|--fasta <FASTA_INPUT> -l|--depth <DEPTH> -o, --output <OUTPUT_FILE>

Reads a *fa file, constructs a tree of depth DEPTH and saves it

compressed in OUTPUT_FILE.

Options:

-v, --version Prints package version.

-h, --help Prints help dialog.

-f, --fasta Fasta input file. Mandatory option.

-l, --depth depth of the tree structure. Mandatory option.

-o, --output Output file. If the extension is not *gz, it is added. Mandatory option.

File: /Users/user/local/FastqPuri/src/init_makeTree.c, line: 68

——

> makeBloom

$ ./makeBloom

Usage: ./makeBloom --fasta <FASTA_INPUT> --output <FILTERFILE> --kmersize [KMERSIZE]

(--fal_pos_rate [p] | --hashNum [HASHNUM] | --bfsizeBits [SIZEBITS])

Options:

-v, --version Prints package version.

-h, --help Prints help dialog.

-f, --fasta Fasta input file. Mandatory option.

-o, --output Output file, with NO extension. Mandatory option.

-k, --kmersize kmer size, number or elements. Optional(default = 25)

-p, --fal_pos_rate false positive rate. Optional (default = 0.05)

-g, --hashNum number of hash functions used. Optional (default

value computed from the false positive rate).

-m, --bfsizeBits size of the filter in bits. It will be forced to be

a multiple of 8. Optional (default value computed

from the false positive rate).

NOTE: the options -p, -g, -m are mutually exclusive. The program

will give an error if more than one of them are passed as input.

It is recommended to pass the false positive rate and let the

program compute the other variables (excepting singular situations)

If none of them are passed, the false positive rate will default

to 0.05 and the other variables will be computed respecting this

requirement. See documentation and supplementary material for

more details.

File: /Users/user/local/FastqPuri/src/init_makeBloom.c, line: 81

> ./trimFilter

$ trimFilter

Usage: trimFilter --ifq <INPUT_FILE.fq> --length <READ_LENGTH>

--output [O_PREFIX] --gzip [y|n]

--adapter [<ADAPTERS.fa>:<mismatches>:<score>]

--method [TREE|BLOOM]

(--idx [<INDEX_FILE>:<score>:<lmer_len>] |

--ifa [<INPUT.fa>:<score>:[lmer_len]])

--minL [MINL] --minQ [MINQ] --zeroQ [ZEROQ]

(--percent [percent] | --global [n1:n2])

--trimN [NO|ALL|ENDS|STRIP]

Reads in a fq file (gz, bz2, z formats also accepted) and removes:

* low quality reads,

* reads containing N base callings,

* reads representing contaminations, belonging to sequences in INPUT.fa

Outputs 4 [O_PREFIX]_fq.gz files containing: "good" reads, discarded

low Q reads, discarded reads containing N's, discarded contaminations.

Options:

-v, --version prints package version.

-h, --help prints help dialog.

-f, --ifq fastq input file [*fq|*fq.gz|*fq.bz2], mandatory option.

-l, --length read length: length of the reads, mandatory option.

-o, --output output prefix (with path), optional (default ./out).

-z, --gzip gzip output files: yes or no (default yes)

-A, --adapter adapter input. Three fields separated by colons:

<ADAPTERS.fa>: fasta file containing adapters,

<mismatches>: maximum mismatch count allowed,

<score>: score threshold for the aligner.

-x, --idx index input file. To be included with methods to remove.

contaminations (TREE, BLOOM). 3 fields separated by colons:

<INDEX_FILE>: output of makeTree, makeBloom,

<score>: score threshold to accept a match [0,1],

[lmer_len]: the length of the lmers to be

looked for in the reads [1,READ_LENGTH].

-a, --ifa fasta input file of potential contaminations.

To be included only with method TREE

(it excludes the option --idx). Otherwise, an

index file has to be precomputed and given as parameter

(see option --idx). 3 fields separated by colons:

<INPUT.fa>: fasta input file [*fa|*fa.gz|*fa.bz2],

<score>: score threshold to accept a match [0,1],

<lmer_len>: depth of the tree: [1,READ_LENGTH].

Corresponds to the length of the lmers to be

looked for in the reads.

-C, --method method used to look for contaminations:

TREE: uses a 4-ary tree. Index file optional,

BLOOM: uses a bloom filter. Index file mandatory.

-Q, --trimQ NO: does nothing to low quality reads (default),

ALL: removes all reads containing at least one low

quality nucleotide.

ENDS: trims the ends of the read if their quality is

below the threshold -q,

FRAC: discards a read if the fraction of bases with

low quality scores (below -q) is over 5 percent

or a user defined percentage (-p).

ENDSFRAC: trims the ends and then discards the read if

there are more low quality nucleotides than

allowed by the option -p.

GLOBAL: removes n1 bases on the left and n2 on the

right, specified in -g.

All reads are discarded if they are shorter than MINL

(specified with -m or --minL).

-m, --minL minimum length allowed for a read before it is discarded

(default 25).

-q, --minQ minimum quality allowed (int), optional (default 27).

-0, --zeroQ value of ASCII character representing zero quality (int), optional (default 33).

-p, --percent percentage of low quality bases tolerated before

discarding a read (default 5),

-g, --global required option if --trimQ GLOBAL is passed. Two int,

n1:n2, have to be passed specifying the number of bases

to be globally cut from the left and right, respectively.

-N, --trimN NO: does nothing to reads containing N's,

ALL: removes all reads containing N's,

ENDS: trims ends of reads with N's,

STRIPS: looks for the largest substring with no N's.

All reads are discarded if they are shorter than the

sequence length specified by -m/--minL.

File: /FastqPuri/src/init_trimFilter.c, line: 132

——

> ./trimFilterPE

$ ./trimFilterPE

Usage: trimFilterPE --ifq <INPUT1.fq>:<INPUT2.fq> --length <READ_LENGTH>

--output [O_PREFIX] --gzip [y|n]

--adapter [<AD1.fa>:<AD2.fa>:<mismatches>:<score>]

--method [TREE|BLOOM]

(--idx [<INDEX_FILE>:<score>:<lmer_len>] |

--ifa [<INPUT.fa>:<score>:[lmer_len]])

--minL [MINL] --minQ [MINQ] --zeroQ [ZEROQ]

(--percent [percent] | --global [n1:n2])

--trimN [NO|ALL|ENDS|STRIP]

Reads in paired end fq files (gz, bz2, z formats also accepted) and removes:

* low quality reads,

* reads containing N base callings,

* reads representing contaminations, belonging to sequences in INPUT.fa.

Outputs 8 [O_PREFIX][1|2]_fq.gz files containing: "good" reads, discarded

low Q reads, discarded reads containing N's, discarded contaminations.

Options:

-v, --version prints package version.

-h, --help prints help dialog.

-f, --ifq 2 fastq input files [*fq|*fq.gz|*fq.bz2] separated by

colons, mandatory option.

-l, --length read length: length of the reads, mandatory option.

-o, --output output prefix (with path), optional (default ./out).

-z, --gzip gzip output files: yes or no (default yes)

-A, --adapter adapter input. Four fields separated by colons:

<AD1.fa>: fasta file containing adapters,

<AD2.fa>: fasta file containing adapters,

<mismatches>: maximum mismatch count allowed,

<score>: score threshold for the aligner.

-x, --idx index input file. To be included with any methods to remove.

contaminations (TREE, BLOOM). 3 fields separated by colons:

<INDEX_FILE>: output of makeTree, makeBloom,

<score>: score threshold to accept a match [0,1],

[lmer_len]: the length of the lmers to be

looked for in the reads [1,READ_LENGTH].

-a, --ifa fasta input file of potential contaminations.

To be included only with method TREE

(it excludes the option --idx). Otherwise, an

index file has to be precomputed and given as parameter

(see option --idx). 3 fields separated by colons:

<INPUT.fa>: fasta input file [*fa|*fa.gz|*fa.bz2],

<score>: score threshold to accept a match [0,1],

<lmer_len>: depth of the tree: [1,READ_LENGTH].

Corresponds to the length of the lmers to be

looked for in the reads.

-C, --method method used to look for contaminations:

TREE: uses a 4-ary tree. Index file optional,

BLOOM: uses a bloom filter. Index file mandatory.

-Q, --trimQ NO: does nothing to low quality reads (default),

ALL: removes all reads containing at least one low

quality nucleotide.

ENDS: trims the ends of the read if their quality is

below the threshold -q,

FRAC: discards a read if the fraction of bases with

low quality scores (below -q) is over 5 percent

or a user defined percentage (-p).

ENDSFRAC: trims the ends and then discards the read if

there are more low quality nucleotides than

allowed by the option -p.

GLOBAL: removes n1 cycles on the left and n2 on the

right, specified in -g.

All reads are discarded if they are shorter than MINL

(specified with -m or --minL).

-m, --minL minimum length allowed for a read before it is discarded

(default 25).

-q, --minQ minimum quality allowed (int), optional (default 27).

-0, --zeroQ value of ASCII character representing zero quality (int), optional (default 33)

-p, --percent percentage of low quality bases tolerated before

discarding a read (default 5),

-g, --global required option if --trimQ GLOBAL is passed. Two int,

n1:n2, have to be passed specifying the number of bases

to be globally cut from the left and right, respectively.

-N, --trimN NO: does nothing to reads containing N's,

ALL: removes all reads containing N's,

ENDS: trims ends of reads with N's,

STRIPS: looks for the largest substring with no N's.

All reads are discarded if they are shorter than the

sequence length specified by -m/--minL.

File: /Users/user/local/FastqPuri/src/init_trimFilterDS.c, line: 136

ビルド手順が煩雑なので、すぐ使えるdockerイメージを上げておきます。(2018/12/01にdocker hubにpush)

#version1.0
docker pull kazumax/fastqpuri

例えばイメージの/dataとホストのカレントパスをシェアして実行。リード長maxは151とする
docker run --rm -itv $PWD:/data/ kazumax/fastqpuri 
source /etc/profile
Qreport -i input1.fq -l 151 -o sample1

追記

上に書いたようにcondaで導入できるようになっています。

実行方法

1、クオリティレポート出力。ここでは３サンプルを順番に分析している。

mkdir out_dir
#single
Qreport -i sample1.fq.gz -l 100 -f 0 -o out_dir/sample1
Qreport -i sample2.fq.gz -l 100 -f 0 -o out_dir/sample2
Qreport -i sample3.fq.gz -l 100 -f 0 -o out_dir/sample3


#paired-end
Qreport -i pair_1.fq.z -i pair_1.fq.gz -l 100 -f 0 -o out_dir/sample

-i Input file [*fq|*fq.gz|*fq.bz2]. Mandatory option
-o Output file prefix (with NO extension). Mandatory option.
-f Filter status: 0 original file, 1 file filtered with trimFilter, 2 file filtered with another tool. Optional (default 0)
-l Read length. Length of the reads. Mandatory option

ランが終わると、次のステップで使うバイナリファイル(.bin)と、それぞれのサンプルのレポートファイル（プレーンテキスト+ html）が出力される。

sample1.html

f:id:kazumaxneo:20181201135627j:plain

f:id:kazumaxneo:20181201135630j:plain

f:id:kazumaxneo:20181201135632j:plain

レーン内のタイルごとのlow qualityリードの割合を可視化した図も出力されている。（*1）

２、sapmle1-3のサマリーレポート出力。ディレクトリ内にある全ての.binファイルが対象となる。

Sreport -i out_dir/ -t Q -o raw_report

-t {Q,F,P} Type of report to be generated: 'Q' for quality summary report, 'F' for filter summary report (single-end reads), and 'P' for filter summary report (paired-end reads) data filter summary report. Mandatory option,

フィルタリング前は"-t Q"を選ぶ。

summmary.htmlが出力される。

f:id:kazumaxneo:20181201190600p:plain

フローセル１レーン内のタイルナンバー、low qualityのリードの割合、Nsを持つリード数の割合だけ要約した表として出力される。表の下にはクオリティ分布が視覚化される。この結果を元に、どのようなトリミングを行うかを考える。

Optional step

汚染配列がある場合 (*2)

makeBloom -o contamination -f contami_genome.fa -p 0.05 -k 25

-f Fasta input file. Mandatory option
-p false positive rate. Optional (default = 0.05)
-k number or elements. Optional(default = 25)

contamination.bfとcontamination.txtが出力される。次のステップでcontamination.bfを指定して汚染配列を別出力する（"--idx contamination.bf:0.05"）。

３、クオリティトリミング。ここからは１サンプルの例だけ示す。

#single-end
trimFilter --ifq single.fq --length 100 -q 27 -Q ENDS -z no --output trimmed_

#paired-end
trimFilterPE --ifq pair1.fq:pair2.fq --length 100 -q 27 -Q ENDS -z no --output trimmed_


#paired-end: low quality, Ns、contammination、read length（≤25）のフィルタリング実行
trimFilterPE --ifq pair1.fq:pair2.fq
--idx contamination.bf:0.05 
--length 100 
-N ENDS 
-Q ENDS 
-m 25 
-q 20 
-z no 
--output trimmed_

-a fasta input file of potential contaminations. To be included only with method TREE
-m minimum length allowed for a read before it is discarded (25).
-q minimum quality allowed (int), optional (default 27).
-N, --trimN

         NO: does nothing to reads containing N's,
        ALL: removes all reads containing N's,
         ENDS: trims ends of reads with N's,
       STRIPS: looks for the largest substring with no N's.
         All reads are discarded if they are shorter than MINL (specified with -m).

-Q, --trimQ

         NO: does nothing to low quality reads (default),
         ALL: removes all reads containing at least one low quality nucleotide.
         ENDS: trims the ends of the read if their quality is below the threshold -q,
         FRAC: discards a read if the fraction of bases with low quality scores (below -q) is over 5 percent or a user defined percentage (-p).
         ENDSFRAC: trims the ends and then discards the read if there are more low quality nucleotides than allowed by the option -p.
         GLOBAL: removes n1 bases on the left and n2 on the
right, specified in -g.
         All reads are discarded if they are shorter than MINL (specified with -m).

good qualityとbad qualityのfastqが別々に出力される。また、sumaryの.binファイルが出力される。Nsのフィルターやコンタミネーションのフィルターもつけていた場合、それらのファイルも別々に出力される。

quality trimmingのみ実行後 (q<27)

f:id:kazumaxneo:20181203114118j:plain

quality trimming実行前

f:id:kazumaxneo:20181203114215j:plain

flowcellの１レーン内のタイルごとのクオリティ

quality trimmingのみ実行後 (q<27)

f:id:kazumaxneo:20181203113945j:plain

quality trimming実行前

f:id:kazumaxneo:20181203114013j:plain

リード数

$ seqkist stats *fastq

f:id:kazumaxneo:20181201154134j:plain

使用したデータでは、トータルサイズは3%減少。

４、最後にフィルタリング結果のサマリーレポートを出力する。ディレクトリ内にあるsummaryの.binファイル（3のステップで出力される）が自動で認識される。

#paired-end
Sreport -i /data/ -t P -o filtering_report

#single-end
Sreport -i /data/ -t F -o filtering_report

-t {Q,F,P} Type of report to be generated: 'Q' for quality summary report, 'F' for filter summary report (single-end reads), and 'P' for filter summary report (paired-end reads) data filter summary report.