ターゲットアンプリコンシーケンシングのプライマーを除く cutPrimers

　リードからのプライマーの切断は、ターゲットアンプリコンのNGSデータを処理する上で重要なステップである。既存のツールは、リードから１つまたはいくつかのプライマー／アダプター配列を切断し、そして出現するそれらすべてを除去するように適合されている。また、既存のツールのほとんどは、kmerを使用しており、プライマーの一部または研究された遺伝子配列を有するプライマーのみを切断することができる。このため、そのようなプログラムを使用すると、誤ったトリミング、カバレッジの縮小、および誤検出増加につながる。本著者らはリードから任意の数のプライマーを正確に切断するcutPrimersという新しいツールを開発した。 BRCA1 / 2遺伝子の研究中に得られたシークエンスリードを用いて、cutPrimersをcutadapt、AlienTrimmer、およびBBDukと比較した。それらすべてが、cutPrimerでトリミングされたリードと比較して、少なくとも2つのアンプリコンのカバレッジが許容できないレベル（<30リード）に減少するようにリードをトリミングした。同時に、TrimmomaticとAlienTrimmerはすべてのプライマー配列をカットしたため、残りのリードの長さは予想以上に短くなった。

インストール

ubuntu16.04でテストした。

依存

Linux

cutPrimers works on the Python3+ and requires the following packages:

Biopython - you can install it with: sudo apt-get install python3-biopython or download it from http://biopython.org/wiki/Download and install it locally with python3 setup.py install --user
regex - you can install it with: sudo apt-get install python3-regex or download it from https://pypi.python.org/pypi/regex/ and install it locally with python3 setup.py install --user
argparse - you can install it with: sudo pip3 install argparse or download it from https://pypi.python.org/pypi/argparse and install it locally with python3 setup.py install --user
pysam - you can install it with: sudo pip3 install pysam or download it from https://pypi.org/project/pysam/. With pysam you need to install cython.

本体　Github

git clone https://github.com/aakechin/cutPrimers.git
cd cutPrimers/

> python3 cutPrimers.py -h

# python3 cutPrimers.py -h

usage: cutPrimers.py [-h] [--readsFile_r1 READSFILE1]

[--readsFile_r2 READSFILE2] [--bam-file BAMFILE]

[--coordinates-file COORDSFILE]

[--out-bam-file OUTBAMFILE]

[--out-untrimmed-bam-file OUTUNTRIMMEDBAMFILE]

[--minimal-read-length MINREADLEN] [--hard-clipping]

--primersFileR1_5 PRIMERSFILER1_5

[--primersFileR2_5 PRIMERSFILER2_5]

[--primersFileR1_3 PRIMERSFILER1_3]

[--primersFileR2_3 PRIMERSFILER2_3]

[--trimmedReadsR1 TRIMMEDREADSR1]

[--trimmedReadsR2 TRIMMEDREADSR2]

[--untrimmedReadsR1 UNTRIMMEDREADSR1]

[--untrimmedReadsR2 UNTRIMMEDREADSR2]

[--primersStatistics PRIMERSSTATISTICS]

[--error-number ERRNUMBER]

[--primer-location-buffer PRIMERLOCBUF]

[--min-primer3-length MINPRIMER3LEN] [--primer3-absent]

[--identify-dimers IDIMER] [--threads THREADS]

This script cuts primers from reads sequences

optional arguments:

-h, --help show this help message and exit

--readsFile_r1 READSFILE1, -r1 READSFILE1

file with R1 reads of one sample

--readsFile_r2 READSFILE2, -r2 READSFILE2

file with R2 reads of one sample

--bam-file BAMFILE, -bam BAMFILE

BAM-file in which you want to cut primers from reads

--coordinates-file COORDSFILE, -coord COORDSFILE

file with coordinates of amplicons in the BED-format

(without column names and locations of primers):

chromosome | start | end. It is necessary for cutting

primer sequences from BAM-file. Its order should be

the same as for files with primer sequences

--out-bam-file OUTBAMFILE, -outbam OUTBAMFILE

name of file for output BAM-file with reads

--out-untrimmed-bam-file OUTUNTRIMMEDBAMFILE, -outbam2 OUTUNTRIMMEDBAMFILE

name of file for output BAM-file with untrimmed reads.

It is not required. If you do not use this parameter,

all untrimmed reads will be lost

--minimal-read-length MINREADLEN, -minlen MINREADLEN

minimal length of read after trimming. Default: 10

--hard-clipping, -hard

use this parameter, if you want to trim reads wuith

hard clipping. By default, primer sequences are

trimmed with soft-clipping

--primersFileR1_5 PRIMERSFILER1_5, -pr15 PRIMERSFILER1_5

fasta-file with sequences of primers on the 5'-end of

R1 reads

--primersFileR2_5 PRIMERSFILER2_5, -pr25 PRIMERSFILER2_5

fasta-file with sequences of primers on the 5'-end of

R2 reads. Do not use this parameter if you have

single-end reads

--primersFileR1_3 PRIMERSFILER1_3, -pr13 PRIMERSFILER1_3

fasta-file with sequences of primers on the 3'-end of

R1 reads. It is not required. But if it is determined,

-pr23 is necessary

--primersFileR2_3 PRIMERSFILER2_3, -pr23 PRIMERSFILER2_3

fasta-file with sequences of primers on the 3'-end of

R2 reads

--trimmedReadsR1 TRIMMEDREADSR1, -tr1 TRIMMEDREADSR1

name of file for trimmed R1 reads

--trimmedReadsR2 TRIMMEDREADSR2, -tr2 TRIMMEDREADSR2

name of file for trimmed R2 reads

--untrimmedReadsR1 UNTRIMMEDREADSR1, -utr1 UNTRIMMEDREADSR1

name of file for untrimmed R1 reads. If you want to

write reads that has not been trimmed to the same file

as trimmed reads, type the same name

--untrimmedReadsR2 UNTRIMMEDREADSR2, -utr2 UNTRIMMEDREADSR2

name of file for untrimmed R2 reads. If you want to

write reads that has not been trimmed to the same file

as trimmed reads, type the same name

--primersStatistics PRIMERSSTATISTICS, -stat PRIMERSSTATISTICS

name of file for statistics of errors in primers. This

works only for paired-end reads with primers at 3'-

and 5'-ends

--error-number ERRNUMBER, -err ERRNUMBER

number of errors (substitutions, insertions,

deletions) that allowed during searching primer

sequence in a read sequence. Default: 5

--primer-location-buffer PRIMERLOCBUF, -plb PRIMERLOCBUF

Buffer of primer location in the read from the start

or end of read. If this value is zero, than cutPrimers

will search for primer sequence in the region of the

longest primer length. Default: 10

--min-primer3-length MINPRIMER3LEN, -primer3len MINPRIMER3LEN

Minimal length of primer on the 3'-end to trim. Use

this parameter, if you are ready to trim only part of

primer sequence of the 3'-end of read

--primer3-absent, -primer3

if primer at the 3'-end may be absent, use this

parameter

--identify-dimers IDIMER, -idimer IDIMER

use this parameter if you want to get statistics of

homo- and heterodimer formation. Choose file to which

statistics of primer-dimers will be written. This

parameter may slightly decrease the speed of analysis

--threads THREADS, -t THREADS

number of threads

テストラン

入力のfastqを指定する。

cd cutPrimers/
python3 cutPrimers.py \
 -r1 example/1_S1_L001_R1_001.fastq.gz \
 -r2 example/1_S1_L001_R2_001.fastq.gz \
 -pr15 example/primers_R1_5.fa \
 -pr25 example/primers_R2_5.fa \
 -pr13 example/primers_R1_3.fa \
 -pr23 example/primers_R2_3.fa \
 -tr1 example/1_r1_trimmed.fastq.gz \
 -tr2 example/1_r2_trimmed.fastq.gz \
 -utr1 example/1_r1_untrimmed.fastq.gz \
 -utr2 example/1_r2_untrimmed.fastq.gz \
-t 2

-r1 file with R1 reads of one sample
-r2 file with R2 reads of one sample
-pr15 fasta-file with sequences of primers on the 5'-end of R1 reads
-pr25 fasta-file with sequences of primers on the 5'-end of R2 reads. Do not use this parameter if you have single-end reads
-pr13 fasta-file with sequences of primers on the 3'-end of R1 reads. It is not required. But if it is determined, -pr23 is necessary
-pr23 fasta-file with sequences of primers on the 3'-end of R2 reads
-tr1 name of file for trimmed R1 reads
-tr2 name of file for trimmed R2 reads
-utr1 name of file for untrimmed R1 reads. If you want to write reads that has not been trimmed to the same file as trimmed reads, type the same name
-utr2 name of file for untrimmed R2 reads. If you want to write reads that has not been trimmed to the same file as trimmed reads, type the same name
-t number of threads

出力のtrimmed/untrimmed fastqはgzip圧縮される。

bamを指定してprimerを除くこともできます。helpを確認して下さい。

引用

cutPrimers: A New Tool for Accurate Cutting of Primers from Reads of Targeted Next Generation Sequencing
Kechin A, Boyarskikh U, Kel A, Filipenko M

J Comput Biol. 2017 Nov;24(11):1138-1143