化石のようなサンプル(リンク)や昔の人の骨、歯から断片化したDNAを抽出してシーケンスシーケンスすることが増えており、それに伴ってアダプターに5'と3'両側が汚染されたシーケンスデータが増えてきている。AdapterRemoval は柔軟なパラメータセットを持つ、はSSE1とSSE2命令に対応したやDemultiplexやアダプタートリミングのツール。柔軟な条件でトリミングが行える。初期バージョンからシングルエンドとペアードエンドのシーケンスデータに対応していたが、バージョン2では上記の拡張命令に加え、gzipとbzip2圧縮ファイルへの対応、処理の並列化、interleaveのfastqへの対応、オーバーラップするfastqのマージなどの機能が追加された。
インストール
https://github.com/MikkelSchubert/adapterremoval
Binaryをダウンロードする。
wget -O adapterremoval-2.1.7.tar.gz https://github.com/MikkelSchubert/adapterremoval/archive/v2.1.7.tar.gz
tar xvzf adapterremoval-2.1.7.tar.gz
cd adapterremoval-2.1.7
make #ビルド
sudo make install #インストール
> adapterremoval
$ AdapterRemoval
AdapterRemoval ver. 2.1.7
This program searches for and removes remnant adapter sequences from
your read data. The program can analyze both single end and paired end
data. For detailed explanation of the parameters, please refer to the
man page. For comments, suggestions and feedback please contact Stinus
Lindgreen (stinus@binf.ku.dk) and Mikkel Schubert (MikkelSch@gmail.com).
If you use the program, please cite the paper:
Schubert, Lindgreen, and Orlando (2016). AdapterRemoval v2: rapid
adapter trimming, identification, and read merging.
BMC Research Notes, 12;9(1):88.
http://bmcresnotes.biomedcentral.com/articles/10.1186/s13104-016-1900-2
Arguments: Description:
--help Display this message.
--version Print the version string.
--file1 FILE Input file containing mate 1 reads or single-ended reads [REQUIRED].
--file2 FILE Input file containing mate 2 reads [OPTIONAL].
FASTQ OPTIONS:
--qualitybase BASE Quality base used to encode Phred scores in input; either 33, 64, or solexa
[current: 33].
--qualitybase-output BASE Quality base used to encode Phred scores in output; either 33, 64, or solexa. By
default, reads will be written in the same format as the that specified using
--qualitybase.
--qualitymax BASE Specifies the maximum Phred score expected in input files, and used when writing
output. ASCII encoded values are limited to the characters '!' (ASCII = 33) to
'~' (ASCII = 126), meaning that possible scores are 0 - 93 with offset 33, and
0 - 62 for offset 64 and Solexa scores [default: 41].
--mate-separator CHAR Character separating the mate number (1 or 2) from the read name in FASTQ
records [default: '/'].
--interleaved This option enables both the --interleaved-input option and the
--interleaved-output option [current: off].
--interleaved-input The (single) input file provided contains both the mate 1 and mate 2 reads, one
pair after the other, with one mate 1 reads followed by one mate 2 read. This
option is implied by the --interleaved option [current: off].
--interleaved-output If set, trimmed paired-end reads are written to a single file containing mate 1
and mate 2 reads, one pair after the other. This option is implied by the
--interleaved option [current: off].
OUTPUT FILES:
--basename BASENAME Default prefix for all output files for which no filename was explicitly set
[current: your_output].
--settings FILE Output file containing information on the parameters used in the run as well as
overall statistics on the reads after trimming [default: BASENAME.settings]
--output1 FILE Output file containing trimmed mate1 reads [default: BASENAME.pair1.truncated
(PE), BASENAME.truncated (SE), or BASENAME.paired.truncated (interleaved PE)]
--output2 FILE Output file containing trimmed mate 2 reads [default: BASENAME.pair2.truncated
(only used in PE mode, but not if --interleaved-output is enabled)]
--singleton FILE Output file to which containing paired reads for which the mate has been
discarded [default: BASENAME.singleton.truncated]
--outputcollapsed FILE If --collapsed is set, contains overlapping mate-pairs which have been merged
into a single read (PE mode) or reads for which the adapter was identified by
a minimum overlap, indicating that the entire template molecule is present.
This does not include which have subsequently been trimmed due to low-quality
or ambiguous nucleotides [default: BASENAME.collapsed]
--outputcollapsedtruncated FILE Collapsed reads (see --outputcollapsed) which were trimmed due the presence of
low-quality or ambiguous nucleotides [default: BASENAME.collapsed.truncated]
--discarded FILE Contains reads discarded due to the --minlength, --maxlength or --maxns options
[default: BASENAME.discarded]
OUTPUT COMPRESSION:
--gzip Enable gzip compression [current: off]
--gzip-level LEVEL Compression level, 0 - 9 [current: 6]
--bzip2 Enable bzip2 compression [current: off]
--bzip2-level LEVEL Compression level, 0 - 9 [current: 9]
TRIMMING SETTINGS:
--adapter1 SEQUENCE Adapter sequence expected to be found in mate 1 reads [current:
AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG].
--adapter2 SEQUENCE Adapter sequence expected to be found in mate 2 reads [current:
AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT].
--adapter-list FILENAME Read table of white-space separated adapters pairs, used as if the first column
was supplied to --adapter1, and the second column was supplied to --adapter2;
only the first adapter in each pair is required SE trimming mode [current:
<not set>].
--mm MISMATCH_RATE Max error-rate when aligning reads and/or adapters. If > 1, the max error-rate
is set to 1 / MISMATCH_RATE; if < 0, the defaults are used, otherwise the
user-supplied value is used directly. [defaults: 1/3 for trimming; 1/10 when
identifing adapters].
--maxns MAX Reads containing more ambiguous bases (N) than this number after trimming are
discarded [current: 1000].
--shift N Consider alignments where up to N nucleotides are missing from the 5' termini
[current: 2].
--trimns If set, trim ambiguous bases (N) at 5'/3' termini [current: off]
--trimqualities If set, trim bases at 5'/3' termini with quality scores <= to --minquality value
[current: off]
--minquality PHRED Inclusive minimum; see --trimqualities for details [current: 2]
--minlength LENGTH Reads shorter than this length are discarded following trimming [current: 15].
--maxlength LENGTH Reads longer than this length are discarded following trimming [current:
4294967295].
--collapse When set, paired ended read alignments of --minalignmentlength or more bases are
combined into a single consensus sequence, representing the complete insert,
and written to either basename.collapsed or basename.collapsed.truncated (if
trimmed due to low-quality bases following collapse); for single-ended reads,
putative complete inserts are identified as having at least
--minalignmentlength bases overlap with the adapter sequence, and are written
to the the same files [current: off].
--minalignmentlength LENGTH If --collapse is set, paired reads must overlap at least this number of bases to
be collapsed, and single-ended reads must overlap at least this number of
bases with the adapter to be considered complete template molecules [current:
11].
--minadapteroverlap LENGTH In single-end mode, reads are only trimmed if the overlap between read and the
adapter is at least X bases long, not counting ambiguous nucleotides (N); this
is independant of the --minalignmentlength when using --collapse, allowing a
conservative selection of putative complete inserts while ensuring that all
possible adapter contamination is trimmed [current: 0].
DEMULTIPLEXING:
--barcode-list FILENAME List of barcodes or barcode pairs for single or double-indexed demultiplexing.
Note that both indexes should be specified for both single-end and paired-end
trimming, if double-indexed multiplexing was used, in order to ensure that the
demultiplexed reads can be trimmed correctly [current: <not set>].
--barcode-mm N Maximum number of mismatches allowed when counting mismatches in both the mate 1
and the mate 2 barcode for paired reads.
--barcode-mm-r1 N Maximum number of mismatches allowed for the mate 1 barcode; if not set, this
value is equal to the '--barcode-mm' value; cannot be higher than the
'--barcode-mm value'.
--barcode-mm-r2 N Maximum number of mismatches allowed for the mate 2 barcode; if not set, this
value is equal to the '--barcode-mm' value; cannot be higher than the
'--barcode-mm value'.
MISC:
--identify-adapters Attempt to identify the adapter pair of PE reads, by searching for overlapping
reads [current: off].
--seed SEED Sets the RNG seed used when choosing between bases with equal Phred scores when
collapsing. Note that runs are not deterministic if more than one thread is
used. If not specified, a seed is generated using the current time.
--threads THREADS Maximum number of threads [current: 1]
ラン
シングルエンドのfastqのNとクオリティの低い領域を除いてgzip出力する。
AdapterRemoval --file1 single.fq --basename output_single --trimns --trimqualities --gzip --threads 4
- --file1 FILE Input file containing mate 1 reads or single-ended reads [REQUIRED].
- --basename BASENAME Default prefix for all output files for which no filename was explicitly set [current: your_output].
- --trimns If set, trim ambiguous bases (N) at 5'/3' termini [current: off]
- --trimqualities If set, trim bases at 5'/3' termini with quality scores <= to --minquality value [current: off]
- --minquality PHRED Inclusive minimum; see --trimqualities for details [current: 2]
- --gzip Enable gzip compression [current: off]
- --threads THREADS Maximum number of threads [current: 1]
ペアードエンドのfastqのNとクオリティの低い領域を除いて出力する。11mer以上の重複があるペアはマージして出力される。
AdapterRemoval --file1 pair_1.fq --file2 pair_2.fq --basename output_paired --trimns --trimqualities --collapse --threads 4
- --file1 FILE Input file containing mate 1 reads or single-ended reads [REQUIRED].
- --file2 FILE Input file containing mate 2 reads [OPTIONAL].
- --collapse When set, paired ended read alignments of --minalignmentlength or more bases are combined into a single consensus sequence, representing the complete insert, and written to either basename.collapsed or basename.collapsed.truncated (if trimmed due to low-quality bases following collapse); for single-ended reads, putative complete inserts are identified as having at least --minalignmentlength bases overlap with the adapter sequence, and are written
引用
AdapterRemoval v2: rapid adapter trimming, identification, and read merging.
Schubert M, Lindgreen S, Orlando L.
BMC Res Notes. 2016 Feb 12;9:88.
AdapterRemoval: easy cleaning of next-generation sequencing reads.
Lindgreen S.
BMC Res Notes. 2012 Jul 2;5:337.