macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

アダプタートリミング、クオリティトリミング、ペアエンドのマージを一括して行う ClipAndMerge

 

ClipAndMergeはAlexander PeltzerさんがGithubで公開されている、アダプタートリミング、クオリティトリミング、ペアエンドのマージを一括して行ってくれるツール。ワンライナーでマージしたfastq出力を得ることができる。

 

インストール

mac os10.14のminiconda3-4.2.12環境でテストした。

本体 Github

#Anaconda環境でcondaを使い導入
conda install -c bioconda -y ClipAndMerge

> ClipAndMerge -h

$ ClipAndMerge -h

ClipAndMerge (v. 1.7.8)

Integrative Transcriptomics

University of Tübingen

 

Author: Günter Jäger

 

This tool clips adapters from fastq sequences and merges overlapping regions from forward and reverse reads.

Input sequences are accepted in fastq, or in gzipped fastq format.

 

Option "-in1" is required

java -jar ClipAndMerge.jar [options...]

 

  Example: java -jar ClipAndMerge.jar -in1 STRING

 

 -discardBadReads                    : Discard reads after merging that do not fulfill the quality criteria. (default:

                                       false)

 -e DOUBLE                           : Error rate for merging forward and reverse reads. A value of 0.05 means that 5%

                                       mismatches are allowed in the overlap region. (default: 0.05)

 -f FORWARD_ADAPTER_STRING           : Forward reads adapter sequence. (default: AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC)

 -h                                  : Display this help page and exit. (default: true)

 -in1 STRING                       : Forward reads input file(s) in fastq(.gz) file format.

 -in2 STRING[]                       : Reverse reads input file(s) in fastq(.gz) file format.

 -l INTEGER                          : Discard sequences shorter than this number of nucleotides after adapter

                                       clipping. (default: 25)

 -lastBase INTEGER                   : Reads are trimmed from the 3' end until given value is reached. Trimming is not

                                       performed if read is already <= given value. If this option is given the

                                       '-trim3p' option is disregarded! Given value sould be 1-based! (default:

                                       2147483647)

 -log LOG_FILE_STRING                : Write log messages to a file instead of the standard error stream.

 -m INTEGER                          : Require a minimum adapter alignment length. If less nucleotides align with the

                                       adapter, the sequences are not clipped. (default: 8)

 -maxParallelReads NUM_READS_INTEGER : Maximal number of reads, that can be processed in parallel. This number largely

                                       depends on the processing system settings! Only change it if you know what you

                                       are doing! (default: 1000)

 -minQualBadReads INTEGER            : Minimal base quality for keeping bad reads. If 0 is specified, then all reads

                                       are kept. (default: 0)

 -n                                  : Discard sequences with unknown (N) nucleotides. Default is to keep such

                                       sequences. (default: false)

 -no_clip_stats                      : Disable the display of clipping statistics. (default: false)

 -no_clipping                        : Skip adapter clipping. Only read merging is performed! (This is only recommended

                                       if every forward and reverse read has a corresponding partner in the other

                                       respective fastq-file! Otherwise merging can not be performed correctly.

                                       (default: false)

 -no_merging                         : Skip read merging for paired-end sequencing data! Only adapter clipping is

                                       performed. This parameter is not needed for single-end data. (default: false)

 -no_qbMM                            : Do not perform quality based mismatch calculation for merging. Default is to

                                       take quality scores into account. (default: false)

 -o OUTPUT_FILE_STRING               : Output file. If no file is provided, output will be written to System.out. If

                                       file ends with '.gz', output will be gzipped.

 -p INTEGER                          : Minimal number of nucleotides that have to overlap in order to merge the forward

                                       and reverse read. (default: 10)

 -q INTEGER                          : Minimum base quality for quality trimming. (default: 20)

 -qo INTEGER                         : Phred Score offset. (default: 33)

 -qt                                 : Enable quality trimming for non-merged reads. (default: true)

 -qualFreqBadReads DOUBLE            : Percentage of reads that have to fulfill minimal base quality criterion.

                                       (default: 0.9)

 -r REVERSE_ADAPTER_STRING           : Reverse reads adapter sequence. (default: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA)

 -rm_no_partner                      : Remove reads with no pairing partner after adapter clipping. (default: false)

 -timeEstimation                     : Perform remaining time estimation. Note: this can take long for large gzipped

                                       input files. (default: false)

 -trim3p INTEGER                     : Trim N nucleotides from the 3' end of each read. This step is performed after

                                       adapter clipping. Reverse reads are not reverse trancriped before trimming.

                                       (default: 0)

 -trim5p INTEGER                     : Trim N nucleotides from the 5' end of each read. This step is performed after

                                       adapter clipping. Reverse reads are not reverse transcriped before trimming.

                                       (default: 0)

 -u FORWARD_FILE REVERSE_FILE        : Write unmerged forward and reverse reads to extra files. Unmerged forward reads

                                       are written to the file 'FORWARD_FILE'. Unmerged reverse reads are written to

                                       the file 'REVERSE_FILE', i.e. the regular output file then only contains merged

                                       reads!

                                       Attention: If the option '-rm_no_partner' is not selected the two given output

                                       files also contain forward/reverse reads with no pairing partner!

                                       If filenames end with '.gz' gzipped output is produced!

 -verbose                            : Print additional processing information (default: false)

——

 

 

実行方法

ペアエンドリードとアダプター配列を指定して実行する。

ClipAndMerge -verbose -l 25 -p 10 -q 20 -e 0.05\
-f AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC\
-r AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTA\
-in1 pair_1.fastq -in2 pair_2.fastq\
-o merged.fq.gz
  • -in1   Forward reads input file(s) in fastq(.gz) file format.
  • -in2   Reverse reads input file(s) in fastq(.gz) file format.
  • -o      Output file
  • -p      Minimal number of nucleotides that have to overlap in order to merge the forward and reverse read. (default: 10)
  • -q      Minimum base quality for quality trimming. (default: 20)
  • -e      Error rate for merging forward and reverse reads. A value of 0.05 means that 5% mismatches are allowed in the overlap region. (default: 0.05)
  • -l       Discard sequences shorter than this number of nucleotides after adapter clipping. (default: 25)
  • -verbose   Print additional processing information (default: false)

 

引用

https://github.com/apeltzer/ClipAndMerge