教師なしトリミングツール UrQt - macでインフォマティクス

信頼性の低いヌクレオチドがあると、後の分析において偽陰性および偽陽性の数を増加させるか、またはデノボアセンブリにおいて誤ったk-merを生成し、アセンブリを複雑にして誤ったアセンブルを引き起こす可能性がある[論文より ref.4]。信頼性の低いヌクレオチドを除去し、有益なヌクレオチドのみを使用するために、ほとんどのNGSデータ分析は、分析前にQCステップから開始される。

　低品質のヌクレオチドに対処する古典的なQC戦略は、FastQC [ref.6]などのツールを使用してヌクレオチドごとの品質を視覚化した後、FASTX-Toolkit [ref.5]などのツールを使用してリードの5'末端と３’末端の任意の数のヌクレオチドを除去し、次に、特定のphredスコア以下の長さの所定の割合を持つすべてのヌクレオチドを除外する。より最近のアプローチは、低頻度の多型を除去するようにヌクレオチドを改変する（エラー訂正）。この種のアプローチは、最も頻度の高いモチーフに基づいて低頻度のモチーフを修正するために、k-mer配列を使用することが多い。しかしながら、このタイプのアプローチは高いシークエンシングカバレッジ（Quakeの場合は15倍、ALLPATHS-LGの場合は100倍）を必要とし、RNA seqのような非一様なシークエンシング実験には適用できない）。

UrQtは教師なしセグメンテーションアルゴリズムを実装し、最尤法によって各リードで最適なトリミングカットポイントを見つける。UrQtはデータ依存のパラメータを必要とせず、最新のマルチコアアーキテクチャを利用しており、分析パイプラインへの組み込みに有益なツールとなっている。

公式サイト

https://lbbe.univ-lyon1.fr/-UrQt-.html?lang=fr

マニュアル

https://lbbe.univ-lyon1.fr/Documentation-5173.html?lang=fr

インストール

git clone https://github.com/l-modolo/UrQt
cd UrQt/
make
./UrQt

$ ./UrQt

UrQt.1.0.18

Argument must be defined.

Usage: ./UrQt--in <input.fastq> --out <output.fastq>

--in input fastq file

--out output fastq file

Optional:

--inpair input fastq file for paired end data

--outpair output fastq file for paired end data, empty read in one file will be removed in both

--phred <number> [33 = Sanger (ASCII 33 to 126), 64 = Illumina 1.3 (ASCII 64 to 126), 59 = Solexa/Illumina 1.0 (ASCII 59 to 126)] (default: 33)

Trimming option:

--t <number> minimum phred score for a ``good quality'' (default: 20)

--N <character> polyN to trim (default: QC trimming)

--max_head_trim <number> maximum number of nucleotide trimmed at the head of the reads (default: read length)

--max_tail_trim <number> maximum number of nucleotide trimmed at the tail of the reads (default: read length)

--min_read_size <number> remove all reads smaller than this size after the trimming step (default: 0)

--pos <head|tail|both> (expected position of trimmed sequence in the read) (default: both)

--r no removing of empty reads (100% of bases trimmed) (default: the empty reads are removed from the output)

--min_QC_length <double> if present with --min_QC_phred the minimum percentage of base with min_QC_phred necessary to keep a read (default: without QC percentage for a length)

--min_QC_phred <int> if present with --min_QC_length, the minimum phred score on min_QC_length percent of the base necessary to keep a read (default: without QC percentage for a length)

Estimation :

--s <number> number of reads to sample to compute the fixe proportion of the 4 different nucleotides (default: proportion computed in the partitioning of each reads)

--S if present the proportion of the 4 different nucleotides is set to 1/4 (default: proportion computed in the partitioning of each reads)

Other:

--v verbose

--gz gziped output

--m <number> number of thread to use

--buffer <buffer> max number of reads in memory

パスの通ったディレクトリにコピーしておく。

ラン

シングルエンド。gzも扱える。

UrQt --in single.fastq.gz --gz --m 8 --out trimmed.fastq.gz

--in input fastq file
--outpair output fastq file for paired end data, empty read in one file
--t <number> minimum phred score for a ``good quality'' (default: 20)
--gz gziped output
--m <number> number of thread to use

ペアエンド

UrQt --in pair1.fq --inpair pair2.fq --outpair1_trimmed.fastq --outpair pair2_trimmed.fastq

before（リード数641,426、平均294bp）

f:id:kazumaxneo:20180311201237j:plain

after（リード数641,413、平均216bp） (--t 20)

f:id:kazumaxneo:20180311201321j:plain

sickleの結果と比較してみる（sickle紹介）(qualtiy ≥ 20)

after（リード数599,641、平均187.8bp）

f:id:kazumaxneo:20180311202056j:plain

引用

UrQt: an efficient software for the Unsupervised Quality trimming of NGS data.

Modolo L, Lerat E

BMC Bioinformatics. 2015 Apr 29;16:137.