AfterQCはfastqのフィルタリング、トリミング、エラー修復、およびクオリティチェックを全て自動で行なってくれるツールである。エラー修復はオーバーラップするペアードエンドリードのクオリティを比較して実行される。2017年に論文が発表された。
インストール
git clone https://github.com/OpenGene/AfterQC.git
$ after.py -h
Usage: Automatic Filtering, Trimming, Error Removing and Quality Control for Illumina fastq data
Simplest usage:
cd to the folder containing your fastq data, run <python after.py>
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-1 READ1_FILE, --read1_file=READ1_FILE
file name of read1, required. If input_dir is
specified, then this arg is ignored.
-2 READ2_FILE, --read2_file=READ2_FILE
file name of read2, if paired. If input_dir is
specified, then this arg is ignored.
-7 INDEX1_FILE, --index1_file=INDEX1_FILE
file name of 7' index. If input_dir is specified, then
this arg is ignored.
-5 INDEX2_FILE, --index2_file=INDEX2_FILE
file name of 5' index. If input_dir is specified, then
this arg is ignored.
-d INPUT_DIR, --input_dir=INPUT_DIR
the input dir to process automatically. If read1_file
are input_dir are not specified, then current dir (.)
is specified to input_dir
-g GOOD_OUTPUT_FOLDER, --good_output_folder=GOOD_OUTPUT_FOLDER
the folder to store good reads, by default it is named
'good', in the current directory
-b BAD_OUTPUT_FOLDER, --bad_output_folder=BAD_OUTPUT_FOLDER
the folder to store bad reads, by default it is named
'bad', in the same folder as good_output_folder
-r REPORT_OUTPUT_FOLDER, --report_output_folder=REPORT_OUTPUT_FOLDER
the folder to store QC reports, by default it is named
'QC', in the same folder as good_output_folder
--read1_flag=READ1_FLAG
specify the name flag of read1, default is R1, which
means a file with name *R1* is read1 file
--read2_flag=READ2_FLAG
specify the name flag of read2, default is R2, which
means a file with name *R2* is read2 file
--index1_flag=INDEX1_FLAG
specify the name flag of index1, default is I1, which
means a file with name *I1* is index2 file
--index2_flag=INDEX2_FLAG
specify the name flag of index2, default is I2, which
means a file with name *I2* is index2 file
-f TRIM_FRONT, --trim_front=TRIM_FRONT
number of bases to be trimmed in the head of read. -1
means auto detect
-t TRIM_TAIL, --trim_tail=TRIM_TAIL
number of bases to be trimmed in the tail of read. -1
means auto detect
--trim_pair_same=TRIM_PAIR_SAME
use same trimming configuration for read1 and read2 to
keep their sequence length identical, default is true
-q QUALIFIED_QUALITY_PHRED, --qualified_quality_phred=QUALIFIED_QUALITY_PHRED
the quality value that a base is qualifyed. Default 15
means phred base quality >=Q15 is qualified.
-u UNQUALIFIED_BASE_LIMIT, --unqualified_base_limit=UNQUALIFIED_BASE_LIMIT
if exists more than unqualified_base_limit bases that
quality is lower than qualified quality, then this
read/pair is bad. Default is 60
-p POLY_SIZE_LIMIT, --poly_size_limit=POLY_SIZE_LIMIT
if exists one polyX(polyG means GGGGGGGGG...), and its
length is >= poly_size_limit, then this read/pair is
bad. Default is 35
-a ALLOW_MISMATCH_IN_POLY, --allow_mismatch_in_poly=ALLOW_MISMATCH_IN_POLY
the count of allowed mismatches when detection polyX.
Default 2 means allow 2 mismatches for polyX detection
-n N_BASE_LIMIT, --n_base_limit=N_BASE_LIMIT
if exists more than maxn bases have N, then this
read/pair is bad. Default is 5
-s SEQ_LEN_REQ, --seq_len_req=SEQ_LEN_REQ
if the trimmed read is shorter than seq_len_req, then
this read/pair is bad. Default is 35
--debubble specify whether apply debubble algorithm to remove the
reads in the bubbles. Default is False
--debubble_dir=DEBUBBLE_DIR
specify the folder to store output of debubble
algorithm, default is debubble
--draw=DRAW specify whether draw the pictures or not, when use
debubble or QC. Default is on
--barcode=BARCODE specify whether deal with barcode sequencing files,
default is on, which means all files with barcode_flag
in filename will be treated as barcode sequencing
files
--barcode_length=BARCODE_LENGTH
specify the designed length of barcode
--barcode_flag=BARCODE_FLAG
specify the name flag of a barcoded file, default is
barcode, which means a file with name *barcode* is a
barcoded file
--barcode_verify=BARCODE_VERIFY
specify the verify sequence of a barcode which is
adjunct to the barcode
--store_overlap=STORE_OVERLAP
specify whether store only overlapped bases of the
good reads
--overlap_output_folder=OVERLAP_OUTPUT_FOLDER
the folder to store only overlapped bases of the good
reads
--qc_only if qconly is true, only QC result will be output, this
can be much fast
--qc_sample=QC_SAMPLE
sample up to qc_sample reads when do QC, 0 means
sample all reads. Default is 200,000
--qc_kmer=QC_KMER specify the kmer length for KMER statistics for QC,
default is 8
--no_correction disable base correction for mismatched base pairs in
overlapped areas
--mask_mismatch set the qual num to 0 for mismatched base pairs in
overlapped areas to mask them out
--no_overlap disable overlap analysis (usually much faster with
this option)
——
ラン
シングルエンド
python after.py -1 R1.fq
ペアードエンド
python after.py -1 R1.fq -2 R2.fq
fastqのディレクトリを指定してラン(fastqを全て解析)。
python after.py -d input/
ラン中にlogがほとんど出ないので不安になるが、200MBx2のfastqで10分くらいかかるのでいじらず待つ。解析が終わると、ランしたパスに3つのディレクトリができる。
上では2組のペアードエンドfastqフォルダを解析している。badにはトリミングをパスしなかったリードが収納されており、goodにトリミングされて最低限の長さを満たしたリードが収納されている。ペアリードの順番が崩れないように、goodにはペアリード両方についてPASSしたリードだけ出力されている。
Q>30でトリミングし、25-bp以下になったリードは破棄する。
python after.py -1 R1.fq -2 R2.fq -q 30 -s 25
フィルタリングオプション
- -f <int> number of bases to be trimmed in the head of read. -1 means auto detect.
- -t <int> number of bases to be trimmed in the tail of read. -1 means auto detect.
- -q <int> the quality value that a base is qualifyed. Default 20 means base quality >=Q20 is qualified.
- -u <int> if exists more than unqualified_base_limit bases that quality is lower than qualified quality, then this read/pair is bad. Default 0 means do not filter reads by low quality base count.
- -p <int> POLY_SIZE_LIMIT if exists one polyX(polyG means GGGGGGGGG...), and its length is >= poly_size_limit, then this read/pair is bad. Default is 35.
- -a <int> the count of allowed mismatches when evaluating poly_X. Default 5 means disallow any mismatches.
- -n <int> if exists more than maxn bases have N, then this read/pair is bad. Default is 5.
- -s <int> if the trimmed read is shorter than seq_len_req, then this read/pair is bad. Default is 35.
バブル除去オプション(一般のシーケンスには非推奨)
- --debubble enable debubble algorithm to remove the reads in the bubbles. Default is False
バーコードオプション
- --barcode specify whether deal with barcode sequencing files, default is on
クオリティチェックオプション
- --qc_only enable this option, only QC result will be output, this can be much faster
- --qc_kmer <int> specify the kmer length for KMER statistics for QC, default is 8
QC/には分析結果がhtmlで出力される。html5で描画されており、編集可能である。
AfterQCの 動作はfastqcとBBtools、scytheなどのツールを合わせたようなものになっている。動作はやや遅いため、後継のfastpの論文も発表されている。
引用
AfterQC: automatic filtering, trimming, error removing and quality control for fastq data.
Shifu Chen, Tanxiao Huang, Yanqing Zhou, Yue Han, Mingyan Xu and Jia Gu.
BMC Bioinformatics 2017 18(Suppl 3):80