illumina、BGIのシーケンシングリードの前処理を行う Ktrim

　次世代シーケンシング（NGS）データは、品質の悪いサイクルやアダプター汚染に悩まされることが多いため、下流での解析の前に前処理を行う必要がある。最新のシーケンサーのスループットとリードの長さはますます増大しており、前処理のステップは、現在のツールの性能が未充足であるため、データ解析のボトルネックになっている。そのため、シーケンシングデータの前処理のための超高速かつ高精度なアダプターやクオリティートリミングツールの開発が急務となっている。
　本研究では、Ktrimを開発した。Ktrimの主な特徴は、一般的なライブラリ調整キットのアダプターをビルトインでサポートしていること、ユーザーがカスタマイズしたアダプター配列をサポートしていること、ペアエンドとシングルエンドの両方のデータをサポートしていること、解析を高速化するための並列化をサポートしていることなどである。Ktrimは現在のツールと比較して約2〜18倍の高速化を実現し、テストデータセットに適用した場合にも高い精度を示した。このように、KtrimはショートNGSデータの前処理のための貴重で効率的なツールとして機能する可能性がある。
　この論文で記述された結果を再現するためのソースコードとスクリプトは、GPL v3 ライセンスの下でhttps://github.com/hellosunking/Ktrim/にて自由に利用可能である。

インストール

ubuntu18.04でテストした。

Github

git clone https://github.com/hellosunking/Ktrim.git
cd Ktrim/
make clean
make
make install #root権限が必要。またはbin/にパスを通す。

> ktrim -h

# ./ktrim -h

Usage: Ktrim [options] -1/-U Read1.fq [ -2 Read2.fq ] -o out.prefix

Author : Kun Sun (sunkun@szbl.ac.cn)

Version: 1.1.0 (Feb 2020)

Ktrim is designed to perform adapter- and quality-trimming of FASTQ files.

Compulsory parameters:

-1/-U Read1.fq Specify the path to the files containing read 1

If your data is Paired-end, use '-1' and specify read 2 files using '-2' option

Note that if '-U' is used, specification of '-2' is invalid

If you have multiple files for your sample, use ',' to separate them

-o out.prefix Specify the prefix of the output files

Note that output files include trimmed reads in FASTQ format and statistics

Optional parameters:

-2 Read2.fq Specify the path to the file containing read 2

Use this parameter if your data is generated in paired-end mode

If you have multiple files for your sample, use ',' to separate them

and make sure that all the files are well paired in '-1' and '-2' options

-t threads Specify how many threads should be used (default: 1, single-thread)

You can set '-t' to 0 to use all threads (automatically detected)

-p phred-base Specify the baseline of the phred score (default: 33)

-q score The minimum quality score to keep the cycle (default: 20)

Note that 20 means 1% error rate, 30 means 0.1% error rate in Phred

Phred 33 ('!') and Phred 64 ('@') are the most widely used scoring system

Quality scores start from 35 ('#') in the FASTQ files is also common

-s size Minimum read size to be kept after trimming (default: 36)

-k kit Specify the sequencing kit to use built-in adapters

Currently supports 'Illumina' (default), 'Nextera', 'Transposase' and 'BGI'

-a sequence Specify the adapter sequence in read 1

-b sequence Specify the adapter sequence in read 2

If '-a' is set while '-b' is not, I will assume that read 1 and 2 use same adapter

Note that '-k' option has a higher priority (when set, '-a'/'-b' will be ignored)

-m proportion Set the proportion of mismatches allowed during index and sequence comparison

Default: 0.125 (i.e., 1/8 of compared base pairs)

-h/--help Show this help information and quit

-v/--version Show the software version and quit

Please refer to README.md file for more information (e.g., setting adapters).

Ktrim: extra-fast and accurate adapter- and quality-trimmer.

bin/にパスを通しておく。

実行方法

Ktrimには、Illumina TruSeqキット、Nexteraキット、Nexteraトランスポザーゼアダプター、BGIシーケンシングキットで使用されているアダプター配列がパッケージ内に組み込まれている。ただし、'-a'（リード1）と'-b'（リード2、リード1と同じ場合は空欄にする）オプションを設定することで、アダプター配列のカスタマイズも可能。

fastqを指定する。全スレッド使用。

#paired-end
ktrim -1 pair_1.fq -2 pair_2.fq -o out.prefix -t 8

#single-end
ktrim -U single.fq -o out.prefix -t 8

-1 Specify the path to the files containing read 1
-2 Specify the path to the file containing read 2
-o out.prefix Specify the prefix of the output files
-t threads Specify how many threads should be used (default: 1, single-thread) You can set '-t' to 0 to use all threads (automatically detected)

ランが終わるとログと前処理されたfastqが出力される。

複数fastq。

mkdir outdir
ktrim -t 0 -p 35 -q 30 -s 36 -o outdir/out \
 -1 lane1_1.fq,lane2_1.fq,lane3_1.fq
 -2 lane1_2.fq,lane2_2.fq,lane3_2.fq \
 -a READ1_ADAPTER_SEQUENCE -b READ2_ADAPTER_SEQUENCE

-p Specify the baseline of the phred score (default: 33)
-q The minimum quality score to keep the cycle (default: 20) Note that 20 means 1% error rate, 30 means 0.1% error rate in Phred. Phred 33 ('!') and Phred 64 ('@') are the most widely used scoring system Quality scores start from 35 ('#') in the FASTQ files is also common
-s Minimum read size to be kept after trimming (default: 36)

引用

Ktrim: an extra-fast and accurate adapter- and quality-trimmer for sequencing data
Kun Sun
Bioinformatics, btaa171, Published: 11 March 2020