リードからのプライマーの切断は、ターゲットアンプリコンのNGSデータを処理する上で重要なステップである。既存のツールは、リードから1つまたはいくつかのプライマー/アダプター配列を切断し、そして出現するそれらすべてを除去するように適合されている。また、既存のツールのほとんどは、kmerを使用しており、プライマーの一部または研究された遺伝子配列を有するプライマーのみを切断することができる。このため、そのようなプログラムを使用すると、誤ったトリミング、カバレッジの縮小、および誤検出増加につながる。本著者らはリードから任意の数のプライマーを正確に切断するcutPrimersという新しいツールを開発した。 BRCA1 / 2遺伝子の研究中に得られたシークエンスリードを用いて、cutPrimersをcutadapt、AlienTrimmer、およびBBDukと比較した。それらすべてが、cutPrimerでトリミングされたリードと比較して、少なくとも2つのアンプリコンのカバレッジが許容できないレベル(<30リード)に減少するようにリードをトリミングした。同時に、TrimmomaticとAlienTrimmerはすべてのプライマー配列をカットしたため、残りのリードの長さは予想以上に短くなった。
インストール
ubuntu16.04でテストした。
依存
cutPrimers works on the Python3+ and requires the following packages:
- Biopython - you can install it with:
sudo apt-get install python3-biopython
or download it from http://biopython.org/wiki/Download and install it locally withpython3 setup.py install --user
- regex - you can install it with:
sudo apt-get install python3-regex
or download it from https://pypi.python.org/pypi/regex/ and install it locally withpython3 setup.py install --user
- argparse - you can install it with:
sudo pip3 install argparse
or download it from https://pypi.python.org/pypi/argparse and install it locally withpython3 setup.py install --user
- pysam - you can install it with:
sudo pip3 install pysam
or download it from https://pypi.org/project/pysam/. With pysam you need to install cython.
本体 Github
git clone https://github.com/aakechin/cutPrimers.git
cd cutPrimers/
> python3 cutPrimers.py -h
# python3 cutPrimers.py -h
usage: cutPrimers.py [-h] [--readsFile_r1 READSFILE1]
[--readsFile_r2 READSFILE2] [--bam-file BAMFILE]
[--coordinates-file COORDSFILE]
[--out-bam-file OUTBAMFILE]
[--out-untrimmed-bam-file OUTUNTRIMMEDBAMFILE]
[--minimal-read-length MINREADLEN] [--hard-clipping]
--primersFileR1_5 PRIMERSFILER1_5
[--primersFileR2_5 PRIMERSFILER2_5]
[--primersFileR1_3 PRIMERSFILER1_3]
[--primersFileR2_3 PRIMERSFILER2_3]
[--trimmedReadsR1 TRIMMEDREADSR1]
[--trimmedReadsR2 TRIMMEDREADSR2]
[--untrimmedReadsR1 UNTRIMMEDREADSR1]
[--untrimmedReadsR2 UNTRIMMEDREADSR2]
[--primersStatistics PRIMERSSTATISTICS]
[--error-number ERRNUMBER]
[--primer-location-buffer PRIMERLOCBUF]
[--min-primer3-length MINPRIMER3LEN] [--primer3-absent]
[--identify-dimers IDIMER] [--threads THREADS]
This script cuts primers from reads sequences
optional arguments:
-h, --help show this help message and exit
--readsFile_r1 READSFILE1, -r1 READSFILE1
file with R1 reads of one sample
--readsFile_r2 READSFILE2, -r2 READSFILE2
file with R2 reads of one sample
--bam-file BAMFILE, -bam BAMFILE
BAM-file in which you want to cut primers from reads
--coordinates-file COORDSFILE, -coord COORDSFILE
file with coordinates of amplicons in the BED-format
(without column names and locations of primers):
chromosome | start | end. It is necessary for cutting
primer sequences from BAM-file. Its order should be
the same as for files with primer sequences
--out-bam-file OUTBAMFILE, -outbam OUTBAMFILE
name of file for output BAM-file with reads
--out-untrimmed-bam-file OUTUNTRIMMEDBAMFILE, -outbam2 OUTUNTRIMMEDBAMFILE
name of file for output BAM-file with untrimmed reads.
It is not required. If you do not use this parameter,
all untrimmed reads will be lost
--minimal-read-length MINREADLEN, -minlen MINREADLEN
minimal length of read after trimming. Default: 10
--hard-clipping, -hard
use this parameter, if you want to trim reads wuith
hard clipping. By default, primer sequences are
trimmed with soft-clipping
--primersFileR1_5 PRIMERSFILER1_5, -pr15 PRIMERSFILER1_5
fasta-file with sequences of primers on the 5'-end of
R1 reads
--primersFileR2_5 PRIMERSFILER2_5, -pr25 PRIMERSFILER2_5
fasta-file with sequences of primers on the 5'-end of
R2 reads. Do not use this parameter if you have
single-end reads
--primersFileR1_3 PRIMERSFILER1_3, -pr13 PRIMERSFILER1_3
fasta-file with sequences of primers on the 3'-end of
R1 reads. It is not required. But if it is determined,
-pr23 is necessary
--primersFileR2_3 PRIMERSFILER2_3, -pr23 PRIMERSFILER2_3
fasta-file with sequences of primers on the 3'-end of
R2 reads
--trimmedReadsR1 TRIMMEDREADSR1, -tr1 TRIMMEDREADSR1
name of file for trimmed R1 reads
--trimmedReadsR2 TRIMMEDREADSR2, -tr2 TRIMMEDREADSR2
name of file for trimmed R2 reads
--untrimmedReadsR1 UNTRIMMEDREADSR1, -utr1 UNTRIMMEDREADSR1
name of file for untrimmed R1 reads. If you want to
write reads that has not been trimmed to the same file
as trimmed reads, type the same name
--untrimmedReadsR2 UNTRIMMEDREADSR2, -utr2 UNTRIMMEDREADSR2
name of file for untrimmed R2 reads. If you want to
write reads that has not been trimmed to the same file
as trimmed reads, type the same name
--primersStatistics PRIMERSSTATISTICS, -stat PRIMERSSTATISTICS
name of file for statistics of errors in primers. This
works only for paired-end reads with primers at 3'-
and 5'-ends
--error-number ERRNUMBER, -err ERRNUMBER
number of errors (substitutions, insertions,
deletions) that allowed during searching primer
sequence in a read sequence. Default: 5
--primer-location-buffer PRIMERLOCBUF, -plb PRIMERLOCBUF
Buffer of primer location in the read from the start
or end of read. If this value is zero, than cutPrimers
will search for primer sequence in the region of the
longest primer length. Default: 10
--min-primer3-length MINPRIMER3LEN, -primer3len MINPRIMER3LEN
Minimal length of primer on the 3'-end to trim. Use
this parameter, if you are ready to trim only part of
primer sequence of the 3'-end of read
--primer3-absent, -primer3
if primer at the 3'-end may be absent, use this
parameter
--identify-dimers IDIMER, -idimer IDIMER
use this parameter if you want to get statistics of
homo- and heterodimer formation. Choose file to which
statistics of primer-dimers will be written. This
parameter may slightly decrease the speed of analysis
--threads THREADS, -t THREADS
number of threads
テストラン
入力のfastqを指定する。
cd cutPrimers/
python3 cutPrimers.py \
-r1 example/1_S1_L001_R1_001.fastq.gz \
-r2 example/1_S1_L001_R2_001.fastq.gz \
-pr15 example/primers_R1_5.fa \
-pr25 example/primers_R2_5.fa \
-pr13 example/primers_R1_3.fa \
-pr23 example/primers_R2_3.fa \
-tr1 example/1_r1_trimmed.fastq.gz \
-tr2 example/1_r2_trimmed.fastq.gz \
-utr1 example/1_r1_untrimmed.fastq.gz \
-utr2 example/1_r2_untrimmed.fastq.gz \
-t 2
- -r1 file with R1 reads of one sample
- -r2 file with R2 reads of one sample
- -pr15 fasta-file with sequences of primers on the 5'-end of R1 reads
- -pr25 fasta-file with sequences of primers on the 5'-end of R2 reads. Do not use this parameter if you have single-end reads
- -pr13 fasta-file with sequences of primers on the 3'-end of R1 reads. It is not required. But if it is determined, -pr23 is necessary
- -pr23 fasta-file with sequences of primers on the 3'-end of R2 reads
- -tr1 name of file for trimmed R1 reads
- -tr2 name of file for trimmed R2 reads
- -utr1 name of file for untrimmed R1 reads. If you want to write reads that has not been trimmed to the same file as trimmed reads, type the same name
- -utr2 name of file for untrimmed R2 reads. If you want to write reads that has not been trimmed to the same file as trimmed reads, type the same name
- -t number of threads
出力のtrimmed/untrimmed fastqはgzip圧縮される。
bamを指定してprimerを除くこともできます。helpを確認して下さい。
引用
cutPrimers: A New Tool for Accurate Cutting of Primers from Reads of Targeted Next Generation Sequencing
Kechin A, Boyarskikh U, Kel A, Filipenko M
J Comput Biol. 2017 Nov;24(11):1138-1143