QC、エラー修復、トリミング、レポート作成を自動実行する AfterQC

AfterQCはfastqのフィルタリング、トリミング、エラー修復、およびクオリティチェックを全て自動で行なってくれるツールである。エラー修復はオーバーラップするペアードエンドリードのクオリティを比較して実行される。2017年に論文が発表された。

インストール

Github

git clone https://github.com/OpenGene/AfterQC.git

$ after.py -h

Usage: Automatic Filtering, Trimming, Error Removing and Quality Control for Illumina fastq data

Simplest usage:

cd to the folder containing your fastq data, run <python after.py>

Options:

--version show program's version number and exit

-h, --help show this help message and exit

-1 READ1_FILE, --read1_file=READ1_FILE

file name of read1, required. If input_dir is

specified, then this arg is ignored.

-2 READ2_FILE, --read2_file=READ2_FILE

file name of read2, if paired. If input_dir is

specified, then this arg is ignored.

-7 INDEX1_FILE, --index1_file=INDEX1_FILE

file name of 7' index. If input_dir is specified, then

this arg is ignored.

-5 INDEX2_FILE, --index2_file=INDEX2_FILE

file name of 5' index. If input_dir is specified, then

this arg is ignored.

-d INPUT_DIR, --input_dir=INPUT_DIR

the input dir to process automatically. If read1_file

are input_dir are not specified, then current dir (.)

is specified to input_dir

-g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER

the folder to store good reads, by default it is named

'good', in the current directory

-b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER

the folder to store bad reads, by default it is named

'bad', in the same folder as good_output_folder

-r REPORT_OUTPUT_FOLDER, --report_output_folder=REPORT_OUTPUT_FOLDER

the folder to store QC reports, by default it is named

'QC', in the same folder as good_output_folder

--read1_flag=READ1_FLAG

specify the name flag of read1, default is R1, which

means a file with name *R1* is read1 file

--read2_flag=READ2_FLAG

specify the name flag of read2, default is R2, which

means a file with name *R2* is read2 file

--index1_flag=INDEX1_FLAG

specify the name flag of index1, default is I1, which

means a file with name *I1* is index2 file

--index2_flag=INDEX2_FLAG

specify the name flag of index2, default is I2, which

means a file with name *I2* is index2 file

-f TRIM_FRONT, --trim_front=TRIM_FRONT

number of bases to be trimmed in the head of read. -1

means auto detect

-t TRIM_TAIL, --trim_tail=TRIM_TAIL

number of bases to be trimmed in the tail of read. -1

means auto detect

--trim_pair_same=TRIM_PAIR_SAME

use same trimming configuration for read1 and read2 to

keep their sequence length identical, default is true

-q QUALIFIED_QUALITY_PHRED, --qualified_quality_phred=QUALIFIED_QUALITY_PHRED

the quality value that a base is qualifyed. Default 15

means phred base quality >=Q15 is qualified.

-u UNQUALIFIED_BASE_LIMIT, --unqualified_base_limit=UNQUALIFIED_BASE_LIMIT

if exists more than unqualified_base_limit bases that

quality is lower than qualified quality, then this

read/pair is bad. Default is 60

-p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT

if exists one polyX(polyG means GGGGGGGGG...), and its

length is >= poly_size_limit, then this read/pair is

bad. Default is 35

-a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY

the count of allowed mismatches when detection polyX.

Default 2 means allow 2 mismatches for polyX detection

-n N_BASE_LIMIT, --n_base_limit=N_BASE_LIMIT

if exists more than maxn bases have N, then this

read/pair is bad. Default is 5

-s SEQ_LEN_REQ, --seq_len_req=SEQ_LEN_REQ

if the trimmed read is shorter than seq_len_req, then

this read/pair is bad. Default is 35

--debubble specify whether apply debubble algorithm to remove the

reads in the bubbles. Default is False

--debubble_dir=DEBUBBLE_DIR

specify the folder to store output of debubble

algorithm, default is debubble

--draw=DRAW specify whether draw the pictures or not, when use

debubble or QC. Default is on

--barcode=BARCODE specify whether deal with barcode sequencing files,

default is on, which means all files with barcode_flag

in filename will be treated as barcode sequencing

files

--barcode_length=BARCODE_LENGTH

specify the designed length of barcode

--barcode_flag=BARCODE_FLAG

specify the name flag of a barcoded file, default is

barcode, which means a file with name *barcode* is a

barcoded file

--barcode_verify=BARCODE_VERIFY

specify the verify sequence of a barcode which is

adjunct to the barcode

--store_overlap=STORE_OVERLAP

specify whether store only overlapped bases of the

good reads

--overlap_output_folder=OVERLAP_OUTPUT_FOLDER

the folder to store only overlapped bases of the good

reads

--qc_only if qconly is true, only QC result will be output, this

can be much fast

--qc_sample=QC_SAMPLE

sample up to qc_sample reads when do QC, 0 means

sample all reads. Default is 200,000

--qc_kmer=QC_KMER specify the kmer length for KMER statistics for QC,

default is 8

--no_correction disable base correction for mismatched base pairs in

overlapped areas

--mask_mismatch set the qual num to 0 for mismatched base pairs in

overlapped areas to mask them out

--no_overlap disable overlap analysis (usually much faster with

this option)

——

ラン

シングルエンド

python after.py -1 R1.fq

ペアードエンド

python after.py -1 R1.fq -2 R2.fq

fastqのディレクトリを指定してラン（fastqを全て解析）。

python after.py -d input/

ラン中にlogがほとんど出ないので不安になるが、200MBx2のfastqで10分くらいかかるのでいじらず待つ。解析が終わると、ランしたパスに３つのディレクトリができる。

f:id:kazumaxneo:20170917002342j:plain

上では２組のペアードエンドfastqフォルダを解析している。badにはトリミングをパスしなかったリードが収納されており、goodにトリミングされて最低限の長さを満たしたリードが収納されている。ペアリードの順番が崩れないように、goodにはペアリード両方についてPASSしたリードだけ出力されている。

Q>30でトリミングし、25-bp以下になったリードは破棄する。

python after.py -1 R1.fq -2 R2.fq -q 30 -s 25

フィルタリングオプション

-f <int>　number of bases to be trimmed in the head of read. -1 means auto detect.
-t <int>　number of bases to be trimmed in the tail of read. -1 means auto detect.
-q <int>　the quality value that a base is qualifyed. Default 20 means base quality >=Q20 is qualified.
-u <int>　if exists more than unqualified_base_limit bases that quality is lower than qualified quality, then this read/pair is bad. Default 0 means do not filter reads by low quality base count.
-p <int>　POLY_SIZE_LIMIT if exists one polyX(polyG means GGGGGGGGG...), and its length is >= poly_size_limit, then this read/pair is bad. Default is 35.
-a <int>　the count of allowed mismatches when evaluating poly_X. Default 5 means disallow any mismatches.
-n <int>　if exists more than maxn bases have N, then this read/pair is bad. Default is 5.
-s <int>　if the trimmed read is shorter than seq_len_req, then this read/pair is bad. Default is 35.

バブル除去オプション（一般のシーケンスには非推奨）

--debubble　enable debubble algorithm to remove the reads in the bubbles. Default is False

バーコードオプション

--barcode　specify whether deal with barcode sequencing files, default is on

クオリティチェックオプション

--qc_only　enable this option, only QC result will be output, this can be much faster
--qc_kmer <int>　specify the kmer length for KMER statistics for QC, default is 8

QC/には分析結果がhtmlで出力される。html5で描画されており、編集可能である。

f:id:kazumaxneo:20170917004054j:plain