macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

バーコードやアダプターをトリミングする AdapterRemoval v2

化石のようなサンプル(リンク)や昔の人の骨、歯から断片化したDNAを抽出してシーケンスシーケンスすることが増えており、それに伴ってアダプターに5'と3'両側が汚染されたシーケンスデータが増えてきている。AdapterRemoval は柔軟なパラメータセットを持つ、はSSE1とSSE2命令に対応したやDemultiplexやアダプタートリミングのツール。柔軟な条件でトリミングが行える。初期バージョンからシングルエンドとペアードエンドのシーケンスデータに対応していたが、バージョン2では上記の拡張命令に加え、gzipとbzip2圧縮ファイルへの対応、処理の並列化、interleaveのfastqへの対応、オーバーラップするfastqのマージなどの機能が追加された。

 

 

 

インストール

github

 

https://github.com/MikkelSchubert/adapterremoval

Binaryをダウンロードする。

wget -O adapterremoval-2.1.7.tar.gz https://github.com/MikkelSchubert/adapterremoval/archive/v2.1.7.tar.gz 
tar xvzf adapterremoval-2.1.7.tar.gz
cd adapterremoval-2.1.7
make #ビルド
sudo make install #インストール

 

> adapterremoval

$ AdapterRemoval

AdapterRemoval ver. 2.1.7

 

This program searches for and removes remnant adapter sequences from

your read data.  The program can analyze both single end and paired end

data.  For detailed explanation of the parameters, please refer to the

man page.  For comments, suggestions  and feedback please contact Stinus

Lindgreen (stinus@binf.ku.dk) and Mikkel Schubert (MikkelSch@gmail.com).

 

If you use the program, please cite the paper:

    Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid

    adapter trimming, identification, and read merging.

    BMC Research Notes, 12;9(1):88.

 

    http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2

 

 

Arguments:                           Description:

  --help                             Display this message.

  --version                          Print the version string.

 

  --file1 FILE                       Input file containing mate 1 reads or single-ended reads [REQUIRED].

  --file2 FILE                       Input file containing mate 2 reads [OPTIONAL].

 

FASTQ OPTIONS:

  --qualitybase BASE                 Quality base used to encode Phred scores in input; either 33, 64, or solexa

                                       [current: 33].

  --qualitybase-output BASE          Quality base used to encode Phred scores in output; either 33, 64, or solexa. By

                                       default, reads will be written in the same format as the that specified using

                                       --qualitybase.

  --qualitymax BASE                  Specifies the maximum Phred score expected in input files, and used when writing

                                       output. ASCII encoded values are limited to the characters '!' (ASCII = 33) to

                                       '~' (ASCII = 126), meaning that possible scores are 0 - 93 with offset 33, and

                                       0 - 62 for offset 64 and Solexa scores [default: 41].

  --mate-separator CHAR              Character separating the mate number (1 or 2) from the read name in FASTQ

                                       records [default: '/'].

  --interleaved                      This option enables both the --interleaved-input option and the

                                       --interleaved-output option [current: off].

  --interleaved-input                The (single) input file provided contains both the mate 1 and mate 2 reads, one

                                       pair after the other, with one mate 1 reads followed by one mate 2 read. This

                                       option is implied by the --interleaved option [current: off].

  --interleaved-output               If set, trimmed paired-end reads are written to a single file containing mate 1

                                       and mate 2 reads, one pair after the other. This option is implied by the

                                       --interleaved option [current: off].

 

OUTPUT FILES:

  --basename BASENAME                Default prefix for all output files for which no filename was explicitly set

                                       [current: your_output].

  --settings FILE                    Output file containing information on the parameters used in the run as well as

                                       overall statistics on the reads after trimming [default: BASENAME.settings]

  --output1 FILE                     Output file containing trimmed mate1 reads [default: BASENAME.pair1.truncated

                                       (PE), BASENAME.truncated (SE), or BASENAME.paired.truncated (interleaved PE)]

  --output2 FILE                     Output file containing trimmed mate 2 reads [default: BASENAME.pair2.truncated

                                       (only used in PE mode, but not if --interleaved-output is enabled)]

  --singleton FILE                   Output file to which containing paired reads for which the mate has been

                                       discarded [default: BASENAME.singleton.truncated]

  --outputcollapsed FILE             If --collapsed is set, contains overlapping mate-pairs which have been merged

                                       into a single read (PE mode) or reads for which the adapter was identified by

                                       a minimum overlap, indicating that the entire template molecule is present.

                                       This does not include which have subsequently been trimmed due to low-quality

                                       or ambiguous nucleotides [default: BASENAME.collapsed]

  --outputcollapsedtruncated FILE    Collapsed reads (see --outputcollapsed) which were trimmed due the presence of

                                       low-quality or ambiguous nucleotides [default: BASENAME.collapsed.truncated]

  --discarded FILE                   Contains reads discarded due to the --minlength, --maxlength or --maxns options

                                       [default: BASENAME.discarded]

 

OUTPUT COMPRESSION:

  --gzip                             Enable gzip compression [current: off]

  --gzip-level LEVEL                 Compression level, 0 - 9 [current: 6]

  --bzip2                            Enable bzip2 compression [current: off]

  --bzip2-level LEVEL                Compression level, 0 - 9 [current: 9]

 

TRIMMING SETTINGS:

  --adapter1 SEQUENCE                Adapter sequence expected to be found in mate 1 reads [current:

                                       AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG].

  --adapter2 SEQUENCE                Adapter sequence expected to be found in mate 2 reads [current:

                                       AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT].

  --adapter-list FILENAME            Read table of white-space separated adapters pairs, used as if the first column

                                       was supplied to --adapter1, and the second column was supplied to --adapter2;

                                       only the first adapter in each pair is required SE trimming mode [current:

                                       <not set>].

 

  --mm MISMATCH_RATE                 Max error-rate when aligning reads and/or adapters. If > 1, the max error-rate

                                       is set to 1 / MISMATCH_RATE; if < 0, the defaults are used, otherwise the

                                       user-supplied value is used directly. [defaults: 1/3 for trimming; 1/10 when

                                       identifing adapters].

  --maxns MAX                        Reads containing more ambiguous bases (N) than this number after trimming are

                                       discarded [current: 1000].

  --shift N                          Consider alignments where up to N nucleotides are missing from the 5' termini

                                       [current: 2].

 

  --trimns                           If set, trim ambiguous bases (N) at 5'/3' termini [current: off]

  --trimqualities                    If set, trim bases at 5'/3' termini with quality scores <= to --minquality value

                                       [current: off]

  --minquality PHRED                 Inclusive minimum; see --trimqualities for details [current: 2]

  --minlength LENGTH                 Reads shorter than this length are discarded following trimming [current: 15].

  --maxlength LENGTH                 Reads longer than this length are discarded following trimming [current:

                                       4294967295].

  --collapse                         When set, paired ended read alignments of --minalignmentlength or more bases are

                                       combined into a single consensus sequence, representing the complete insert,

                                       and written to either basename.collapsed or basename.collapsed.truncated (if

                                       trimmed due to low-quality bases following collapse); for single-ended reads,

                                       putative complete inserts are identified as having at least

                                       --minalignmentlength bases overlap with the adapter sequence, and are written

                                       to the the same files [current: off].

  --minalignmentlength LENGTH        If --collapse is set, paired reads must overlap at least this number of bases to

                                       be collapsed, and single-ended reads must overlap at least this number of

                                       bases with the adapter to be considered complete template molecules [current:

                                       11].

  --minadapteroverlap LENGTH         In single-end mode, reads are only trimmed if the overlap between read and the

                                       adapter is at least X bases long, not counting ambiguous nucleotides (N); this

                                       is independant of the --minalignmentlength when using --collapse, allowing a

                                       conservative selection of putative complete inserts while ensuring that all

                                       possible adapter contamination is trimmed [current: 0].

 

DEMULTIPLEXING:

  --barcode-list FILENAME            List of barcodes or barcode pairs for single or double-indexed demultiplexing.

                                       Note that both indexes should be specified for both single-end and paired-end

                                       trimming, if double-indexed multiplexing was used, in order to ensure that the

                                       demultiplexed reads can be trimmed correctly [current: <not set>].

  --barcode-mm N                     Maximum number of mismatches allowed when counting mismatches in both the mate 1

                                       and the mate 2 barcode for paired reads.

  --barcode-mm-r1 N                  Maximum number of mismatches allowed for the mate 1 barcode; if not set, this

                                       value is equal to the '--barcode-mm' value; cannot be higher than the

                                       '--barcode-mm value'.

  --barcode-mm-r2 N                  Maximum number of mismatches allowed for the mate 2 barcode; if not set, this

                                       value is equal to the '--barcode-mm' value; cannot be higher than the

                                       '--barcode-mm value'.

 

MISC:

  --identify-adapters                Attempt to identify the adapter pair of PE reads, by searching for overlapping

                                       reads [current: off].

  --seed SEED                        Sets the RNG seed used when choosing between bases with equal Phred scores when

                                       collapsing. Note that runs are not deterministic if more than one thread is

                                       used. If not specified, a seed is generated using the current time.

  --threads THREADS                  Maximum number of threads [current: 1]

 

ラン

シングルエンドのfastqのNとクオリティの低い領域を除いてgzip出力する。

AdapterRemoval --file1 single.fq --basename output_single --trimns --trimqualities --gzip --threads 4
  • --file1 FILE   Input file containing mate 1 reads or single-ended reads [REQUIRED].
  • --basename BASENAME   Default prefix for all output files for which no filename was explicitly set [current: your_output].
  • --trimns   If set, trim ambiguous bases (N) at 5'/3' termini [current: off]
  • --trimqualities   If set, trim bases at 5'/3' termini with quality scores <= to --minquality value [current: off]
  • --minquality PHRED   Inclusive minimum; see --trimqualities for details [current: 2]
  • --gzip   Enable gzip compression [current: off]
  • --threads THREADS   Maximum number of threads [current: 1]

 

ペアードエンドのfastqのNとクオリティの低い領域を除いて出力する。11mer以上の重複があるペアはマージして出力される。

AdapterRemoval --file1 pair_1.fq --file2 pair_2.fq --basename output_paired --trimns --trimqualities --collapse --threads 4
  • --file1 FILE   Input file containing mate 1 reads or single-ended reads [REQUIRED].
  • --file2 FILE   Input file containing mate 2 reads [OPTIONAL].
  • --collapse   When set, paired ended read alignments of --minalignmentlength or more bases are combined into a single consensus sequence, representing the complete insert, and written to either basename.collapsed or basename.collapsed.truncated (if trimmed due to low-quality bases following collapse); for single-ended reads, putative complete inserts are identified as having at least --minalignmentlength bases overlap with the adapter sequence, and are written

 

 

引用

AdapterRemoval v2: rapid adapter trimming, identification, and read merging.

Schubert M, Lindgreen S, Orlando L.

BMC Res Notes. 2016 Feb 12;9:88.

 

AdapterRemoval: easy cleaning of next-generation sequencing reads.

Lindgreen S.

BMC Res Notes. 2012 Jul 2;5:337.