macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ショートリードとロングリード両方に対応した高速なクオリティフィルタリングツール RabbitQC

2020 8/19 追記

 

 現代のシーケンシング技術は、生物学や医学の多くの分野で革命を起こし続けている。生成されたデータセットはエラーが発生しやすいため、下流のアプリケーションでは通常、FASTQファイルを前処理するための品質管理手法が必要となる。しかし、このタスクのための既存のツールは、現在のところ、計算プラットフォームの機能を十分に活用できず、実行時間が遅くなっている。

 ここでは、最新のハードウェアをフルに活用できるFASTQファイル用の非常に高速な統合品質管理ツールであるRabbitQCを紹介する。このツールには様々な操作が含まれており、様々なシーケンス技術(Illumina、Oxford Nanopore、PacBio)をサポートしている。RabbitQCは、他の最先端のツールと比較して、1桁から2桁のスピードアップを実現している。C +++ のソースとバイナリは https://github.com/ZekunYin/RabbitQC で入手可能である。

 

インストール

ubuntu18.04LTSでテストした(CPU:xeon E5 v4 2680 dual)。エラーになったので、ビルド済みのパッケージをwindows 10 proのpowershellでテストした。

Github

git clone https://github.com/ZekunYin/RabbitQC.git
cd RabbitQC/
make

> ./rabbit_qc

# ./rabbit_qc 

rabbit_qc: an ultra-fast all-in-one FASTQ preprocessor

version 0.0.1

4 CPUs detected

usage: ./rabbit_qc [options] ... 

options:

  -i, --in1                            read1 input file name (string [=])

  -o, --out1                           read1 output file name (string [=])

  -I, --in2                            read2 input file name (string [=])

  -O, --out2                           read2 output file name (string [=])

  -6, --phred64                        indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)

  -z, --compression                    compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])

      --stdin                          input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.

      --stdout                         stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end input. Disabled by default.

      --interleaved_in                 indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.

      --reads_to_process               specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])

      --dont_overwrite                 don't overwrite existing files. Overwritting is allowed by default.

  -V, --verbose                        output verbose log information (i.e. when every 1M reads are processed).

  -A, --disable_adapter_trimming       adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled

  -a, --adapter_sequence               the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])

      --adapter_sequence_r2            the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=auto])

      --detect_adapter_for_pe          by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.

  -f, --trim_front1                    trimming how many bases in front for read1, default is 0 (int [=0])

  -t, --trim_tail1                     trimming how many bases in tail for read1, default is 0 (int [=0])

  -b, --max_len1                       if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0])

  -F, --trim_front2                    trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])

  -T, --trim_tail2                     trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])

  -B, --max_len2                       if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])

  -g, --trim_poly_g                    force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

      --poly_g_min_len                 the minimum length to detect polyG in the read tail. 10 by default. (int [=10])

  -G, --disable_trim_poly_g            disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

  -x, --trim_poly_x                    enable polyX trimming in 3' ends.

      --poly_x_min_len                 the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

  -5, --cut_by_quality5                enable per read cutting by quality in front (5'), default is disabled (WARNING: this will interfere deduplication for both PE/SE data)

  -3, --cut_by_quality3                enable per read cutting by quality in tail (3'), default is disabled (WARNING: this will interfere deduplication for SE data)

  -W, --cut_window_size                the size of the sliding window for sliding window trimming, default is 4 (int [=4])

  -M, --cut_mean_quality               the bases in the sliding window with mean quality below cutting_quality will be cut, default is Q20 (int [=20])

  -Q, --disable_quality_filtering      quality filtering is enabled by default. If this option is specified, quality filtering is disabled

  -q, --qualified_quality_phred        the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])

  -u, --unqualified_percent_limit      how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])

  -n, --n_base_limit                   if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])

  -L, --disable_length_filtering       length filtering is enabled by default. If this option is specified, length filtering is disabled

  -l, --length_required                reads shorter than length_required will be discarded, default is 15. (int [=15])

      --length_limit                   reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])

  -y, --low_complexity_filter          enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).

  -Y, --complexity_threshold           the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])

      --filter_by_index1               specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])

      --filter_by_index2               specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])

      --filter_by_index_threshold      the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])

  -c, --correction                     enable base correction in overlapped regions (only for PE data), default is disabled

      --overlap_len_require            the minimum length of the overlapped region for overlap analysis based adapter trimming and correction. 30 by default. (int [=30])

      --overlap_diff_limit             the maximum difference of the overlapped region for overlap analysis based adapter trimming and correction. 5 by default. (int [=5])

  -U, --umi                            enable unique molecular identifier (UMI) preprocessing

      --umi_loc                        specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])

      --umi_len                        if the UMI is in read1/read2, its length should be provided (int [=0])

      --umi_prefix                     if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])

      --umi_skip                       if the UMI is in read1/read2, rabbit_qc can skip several bases following UMI, default is 0 (int [=0])

  -p, --overrepresentation_analysis    enable overrepresented sequence analysis.

  -P, --overrepresentation_sampling    one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])

  -j, --json                           the json format report file name (string [=RabbitQC.json])

  -h, --html                           the html format report file name (string [=RabbitQC.html])

  -R, --report_title                   should be quoted with ' or ", default is "rabbitQC report" (string [=RabbitQC report])

  -w, --thread                         worker thread number, default is [max CPU cores - 2] (int [=2])

  -s, --split                          split output by limiting total split file number with this option (2~999), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (int [=0])

  -S, --split_by_lines                 split output by limiting lines of each file with this option(>=1000), a sequential number prefix will be added to output name ( 0001.out.fq, 0002.out.fq...), disabled by default (long [=0])

  -d, --split_prefix_digits            the digits for the sequential number padding (1~10), default is 4, so the filename will be padded as 0001.xxx, 0 to disable padding (int [=4])

  -D, --TGS                            whether process third generation sequenceing data

  -m, --minlen                         Minimum length of reads to be included in the plots (int [=200])

  -?, --help                           print this message

 

 

 

テストラン

cd testdata/
../rabbit_qc -w 4 -i R1.fq -I R2.fq -o out.R1.fq.gz -O out.R2.fq.gz

#long read
../rabbit_qc -w 4 -D -i in.fq

エラーになる。=> windows用のリリースだと動作する。

出力

クオリティトリミングされたリードと、fastpと同様のレポート.htmlとjsonファイルが出力される。

f:id:kazumaxneo:20200819193150p:plain

 

 

動作は非常に高速。150Mbx2のfastq処理には3秒しかからなかった(*1)。

引用
RabbitQC: High-speed scalable quality control for sequencing data

Zekun Yin, Hao Zhang, Meiyang Liu, Wen Zhang, Honglei Song, Haidong Lan, Yanjie Wei, Beifang Niu, Bertil Schmidt, Weiguo Liu

Bioinformatics. 2020 Aug 13

 

 

 *1

xeon E5 v4 2680 dusl 56スレッド指定時

 

関連