AtroposはCutadaptのフォークとして開発されたNGSのアダプタートリミングツール。並列化に対応しており、高速に動作する。Cutadaptよりセンシティブで(ミスマッチを考慮する)、miRNAやbisulfite-seq用のトリミングモードも備える。エラー率やアダプター配列の検出モード、QCのレポート出力、SAM/BAMの読み込み対応、SRAのIDからの直接トリミング、など多彩な機能を備えている。color spaceにも対応している。オーバーラップリードのマージも今後サポートされるらしい。
マニュアル
https://atropos.readthedocs.io/en/latest/
インストール
依存
- Optional python libraries
- pytest (for running unit tests)
- progressbar2 or tqdm (progressbar support)
- pysam (SAM/BAM input)
- khmer 2.0+ (for detecting low-frequency adapter contamination)
- jinja2 (for user-defined report formats)
- ngstream (for SRA streaming), which requires ngs
pythonのライブラリばかりなので、pipでインストールできるが、オーサーはcondaで環境を整えることを推奨している。condaが通る環境下で
conda install -c bioconda atropos
atropos -help #動作確認
Dockerのイメージも提供されているのでGithubで確認してください。
https://github.com/jdidion/atropos
ヘルプ
> atropos -help
$ atropos -help
usage:
atropos trim -a ADAPTER [options] [-o output.fastq] -se input.fastq
atropos trim -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq -pe1 in1.fastq -pe2 in2.fastq
Atropos version 1.1.10
Trim adapters and low-quality bases, and perform other NGS preprocessing. This
command provides most of Atropos' functionality.
Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard
characters are supported. The reverse complement is *not* automatically
searched. All reads from input.fastq will be written to output.fastq with the
adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter
sequences can be given (use further -a options), but only the best-matching
adapter will be removed.
Input may also be in FASTA, SAM, or BAM format. Compressed input and output is
supported and auto-detected from the file name (.gz, .xz, .bz2). Use the file
name '-' for standard input/output. Without the -o option, output is sent to
standard output.
optional arguments:
-h, --help show this help message and exit
--debug Print debugging information. (no)
--progress {bar,msg} Show progress. bar = show progress bar; msg = show a
status message. (no)
--quiet Print only error messages. (no)
--log-level {DEBUG,INFO,WARN,ERROR}
Logging level. (ERROR when --quiet else INFO)
--log-file FILE File to write logging info. (stdout)
Input:
-pe1 FILE1, --input1 FILE1
The first (and possibly only) input file.
-pe2 FILE2, --input2 FILE2
The second input file.
-l FILE, --interleaved-input FILE
Interleaved input file.
-se FILE, --single-input FILE
A single-end read file.
--single-input-read {1,2}
When treating an interleaved FASTQ or paired-end
SAM/BAM file as single-end, this option specifies
which of the two reads to process. (both reads used)
-sq FILE, --single-quals FILE
A single-end qual file.
-sra ACCN, --sra-accession ACCN
Accesstion to stream from SRA (requires optional NGS
dependency to be installed).
-f {fasta,fastq,sra-fastq,sam,bam}, --format {fasta,fastq,sra-fastq,sam,bam}
Input file format. Ignored when reading csfasta/qual
files. (auto-detect from file name extension)
-Q QUALITY_BASE, --quality-base QUALITY_BASE
Assume that quality values in FASTQ are encoded as
ascii(quality + QUALITY_BASE). This needs to be set to
64 for some old Illumina FASTQ files. (33)
-c, --colorspace Enable colorspace mode: Also trim the color that is
adjacent to the found adapter. (no)
--max-reads N Maximum number of reads/pairs to process (no max)
--subsample PROB Subsample a fraction of reads. (no)
--subsample-seed SEED
The seed to use for the pseudorandom number generator.
Usingthe same seed will result in the same subsampling
of reads.
--batch-size SIZE Number of records to process in each batch. (1000)
-D ID, --sample-id ID
Optional sample ID. Added to the summary output.
Finding adapters:
Parameters -a, -g, -b specify adapters to be removed from each read (or
from the first read in a pair if data is paired). If specified multiple
times, only the best matching adapter is trimmed (but see the --times
option). When the special notation 'file:FILE' is used, adapter sequences
are read from the given FASTA file. When the --adapter-file option is
used, adapters can be specified by name rather than sequence.
-a ADAPTER, --adapter ADAPTER
Sequence of an adapter ligated to the 3' end (paired
data: of the first read). The adapter and subsequent
bases are trimmed. If a '$' character is appended
('anchoring'), the adapter is only found if it is a
suffix of the read. (none)
-g ADAPTER, --front ADAPTER
Sequence of an adapter ligated to the 5' end (paired
data: of the first read). The adapter and any
preceding bases are trimmed. Partial matches at the 5'
end are allowed. If a '^' character is prepended
('anchoring'), the adapter is only found if it is a
prefix of the read. (none)
-b ADAPTER, --anywhere ADAPTER
Sequence of an adapter that may be ligated to the 5'
or 3' end (paired data: of the first read). Both types
of matches as described under -a und -g are allowed.
If the first base of the read is part of the match,
the behavior is as with -g, otherwise as with -a. This
option is mostly for rescuing failed library
preparations - do not use if you know which end your
adapter was ligated to! (none)
-F KNOWN_ADAPTERS_FILE, --known-adapters-file KNOWN_ADAPTERS_FILE
Path or URL of a FASTA file containing adapter
sequences.
--no-default-adapters
Don't fetch the default adapter list (which is
currently stored in GitHub).
--adapter-cache-file ADAPTER_CACHE_FILE
File where adapter sequences will be cached, unless
--no-cache-adapters is set.
--no-cache-adapters Don't cache adapters list as '.adapters' in the
working directory.
--no-trim Match and redirect reads to output/untrimmed-output as
usual, but do not remove adapters. (no)
--mask-adapter Mask adapters with 'N' characters instead of trimming
them. (no)
Expected GC content of sequences.
--aligner {adapter,insert}
Which alignment algorithm to use for identifying
adapters. Currently, you can choose between the semi-
global alignment strategy used in Cutdapt ('adapter')
or the more accurate insert-based alignment algorithm
('insert'). Note that insert-based alignment can only
be used with paired-end reads containing 3' adapters.
New algorithms are being implemented and the default
is likely to change. (adapter)
-e ERROR_RATE, --error-rate ERROR_RATE
Maximum allowed error rate for adapter match (no. of
errors divided by the length of the matching region).
(0.1)
--indel-cost COST Integer cost of insertions and deletions during
adapter match. Substitutions always have a cost of 1.
(1)
--no-indels Allow only mismatches in alignments. (allow both
mismatches and indels)
-n COUNT, --times COUNT
Remove up to COUNT adapters from each read. (1)
--match-read-wildcards
Interpret IUPAC wildcards in reads. (no)
-N, --no-match-adapter-wildcards
Do not interpret IUPAC wildcards in adapters. (no)
-O MINLENGTH, --overlap MINLENGTH
If the overlap between the read and the adapter is
shorter than MINLENGTH, the read is not modified.
Reduces the no. of bases trimmed due to random adapter
matches. (3)
--adapter-max-rmp PROB
If no minimum overlap (-O) is specified, then adapters
are only matched when the probabilty of observing k
out of n matching bases is <= PROB. (1E-6)
--insert-max-rmp PROB
Overlapping inserts only match when the probablity of
observing k of n matching bases is <= PROB. (1E-6)
--insert-match-error-rate INSERT_MATCH_ERROR_RATE
Maximum allowed error rate for insert match (no. of
errors divided by the length of the matching region).
(0.2)
--insert-match-adapter-error-rate INSERT_MATCH_ADAPTER_ERROR_RATE
Maximum allowed error rate for matching adapters after
successful insert match (no. of errors divided by the
length of the matching region). (0.2)
-R, --merge-overlapping
Merge read pairs that overlap into a single sequence.
This is experimental. (no)
--merge-min-overlap MERGE_MIN_OVERLAP
The minimum overlap between reads required for
merging. If this number is (0,1.0], it specifies the
minimum length as the fraction of the length of the
*shorter* read in the pair; otherwise it specifies the
minimum number of overlapping base pairs (with an
absolute minimum of 2 bp). (0.9)
--merge-error-rate MERGE_ERROR_RATE
The maximum error rate for merging. (0.2)
--correct-mismatches {liberal,conservative,N}
How to handle mismatches while aligning/merging.
'Liberal' and 'conservative' error correction both
involve setting the base to the one with the best
quality. They differ only when the qualities are equal
-- liberal means set it to the base from the read with
the overall best median base quality, while
conservative means to leave it unchanged. 'N' means to
set the base to N. If exactly one base is ambiguous,
the non-ambiguous base is always used. (no error
correction)
Additional read modifications:
--op-order OP_ORDER The order in which trimming operations are be applied.
This is a string of 1-5 of the following characters: A
= adapter trimming; C = cutting (unconditional); G =
NextSeq trimming; Q = quality trimming; W = overwrite
poor quality reads. The default is 'WCGQA' to maintain
compatibility with Cutadapt; however, this is likely
to change to 'GAWCQ' in the near future.
-u LENGTH, --cut LENGTH
Remove bases from each read (first read only if
paired). If LENGTH is positive, remove bases from the
beginning. If LENGTH is negative, remove bases from
the end. Can be used twice if LENGTHs have different
signs. (no)
-q [5'CUTOFF,]3'CUTOFF, --quality-cutoff [5'CUTOFF,]3'CUTOFF
Trim low-quality bases from 5' and/or 3' ends of each
read before adapter removal. Applied to both reads if
data is paired. If one value is given, only the 3' end
is trimmed. If two comma-separated cutoffs are given,
the 5' end is trimmed with the first cutoff, the 3'
end with the second. (no)
-i LENGTH, --cut-min LENGTH
Similar to -u, except that cutting is done AFTER
adapter trimming, and only if a minimum of LENGTH
bases was not already removed. (no)
--nextseq-trim 3'CUTOFF
NextSeq-specific quality trimming (each read). Trims
also dark cycles appearing as high-quality G bases
(EXPERIMENTAL). (no)
--trim-n Trim N's on ends of reads. (no)
-x PREFIX, --prefix PREFIX
Add this prefix to read names. Use {name} to insert
the name of the matching adapter. (no)
-y SUFFIX, --suffix SUFFIX
Add this suffix to read names; can also include
{name}. (no)
--strip-suffix STRIP_SUFFIX
Remove this suffix from read names if present. Can be
given multiple times. (no)
--length-tag TAG Search for TAG followed by a decimal number in the
description field of the read. Replace the decimal
number with the correct length of the trimmed read.
For example, use --length-tag 'length=' to correct
fields like 'length=123'. (no)
Filtering of processed reads:
--discard-trimmed, --discard
Discard reads that contain an adapter. Also use -O to
avoid discarding too many randomly matching reads!
(no)
--discard-untrimmed, --trimmed-only
Discard reads that do not contain the adapter. (no)
-m LENGTH, --minimum-length LENGTH
Discard trimmed reads that are shorter than LENGTH.
Reads that are too short even before adapter removal
are also discarded. In colorspace, an initial primer
is not counted. (0)
-M LENGTH, --maximum-length LENGTH
Discard trimmed reads that are longer than LENGTH.
Reads that are too long even before adapter removal
are also discarded. In colorspace, an initial primer
is not counted. (no limit)
--max-n COUNT Discard reads with too many N bases. If COUNT is an
integer, it is treated as the absolute number of N
bases. If it is between 0 and 1, it is treated as the
proportion of N's allowed in a read. (no)
Output:
-o FILE, --output FILE
Write trimmed reads to FILE. FASTQ or FASTA format is
chosen depending on input. The summary report is sent
to standard output. Use '{name}' in FILE to
demultiplex reads into multiple files. (write to
standard output)
--info-file FILE Write information about each read and its adapter
matches into FILE. See the documentation for the file
format. (no)
-r FILE, --rest-file FILE
When the adapter matches in the middle of a read,
write the rest (after the adapter) into FILE. (no)
--wildcard-file FILE When the adapter has N bases (wildcards), write
adapter bases matching wildcard positions to FILE.
When there are indels in the alignment, this will
often not be accurate. (no)
--too-short-output FILE
Write reads that are too short (according to length
specified by -m) to FILE. (no - too short reads are
discarded)
--too-long-output FILE
Write reads that are too long (according to length
specified by -M) to FILE. (no - too long reads are
discarded)
--untrimmed-output FILE
Write reads that do not contain the adapter to FILE.
(no - untrimmed reads are written to default output)
--merged-output FILE Write reads that have been merged to this file.
(merged reads are discarded)
--report-file FILE Write report to file rather than stdout/stderr. (no)
--report-formats [FORMAT [FORMAT ...]]
Report type(s) to generate. If multiple, '--report-
file' is treated as a prefix and the appropriate
extensions are appended. If unspecified, the format is
guessed from the file extension. Supported formats
are: txt (legacy text format), json, yaml, pickle. See
the documentation for a full description of the
structured output (json/yaml/pickle formats).
--stats [STATS [STATS ...]]
Which read-level statistics to compute. Can be 'none'
(default), 'pre': only compute pre-trimming stats;
'post': only compute post-trimming stats; or 'both'.
The keyword can be followed by ':' and then additional
configuration parameters. E.g. 'pre:tiles' means to
also collect tile-level statistics (Illumina data
only), and 'pre:tiles=<regexp>' means to use the
specified regular expression to extract key portions
of read names to collect the tile statistics.
Colorspace options:
-d, --double-encode Double-encode colors (map 0,1,2,3,4 to A,C,G,T,N).
(no)
-t, --trim-primer Trim primer base and the first color (which is the
transition to the first nucleotide). (no)
--strip-f3 Strip the _F3 suffix of read names. (no)
--maq, --bwa MAQ- and BWA-compatible colorspace output. This
enables -c, -d, -t, --strip-f3 and -y '/1'. (no)
--no-zero-cap Do not change negative quality values to zero in
colorspace data. By default, they are since many tools
have problems with negative qualities. (no)
-z, --zero-cap Change negative quality values to zero. This is
enabled by default when -c/--colorspace is also
enabled. Use the above option to disable it. (no)
Paired-end options:
The -A/-G/-B/-U/-I options work like their -a/-b/-g/-u/-i counterparts,
but are applied to the second read in each pair.
-A ADAPTER 3' adapter to be removed from second read in a pair.
(no)
-G ADAPTER 5' adapter to be removed from second read in a pair.
(no)
-B ADAPTER 5'/3 adapter to be removed from second read in a pair.
(no)
-U LENGTH Remove LENGTH bases from second read in a pair (see
--cut). (no)
-I LENGTH, --cut-min2 LENGTH
Similar to -U, except that cutting is done AFTER
adapter trimming, and only if a minimum of LENGTH
bases was not already removed (see --cut-min). (no)
-w LOWQ,HIGHQ,WINDOW, --overwrite-low-quality LOWQ,HIGHQ,WINDOW
When one read has mean quality < LOWQ and the other
read has mean quality >= HIGHQ over the first WINDOW
bases, overwrite the worse read with the better read.
-p FILE, --paired-output FILE
Write second read in a pair to FILE. (no)
-L FILE, --interleaved-output FILE
Write output to interleaved file.
--pair-filter (any|both)
Which of the reads in a paired-end read have to match
the filtering criterion in order for it to be
filtered. (any)
--untrimmed-paired-output FILE
Write second read in a pair to this FILE when no
adapter was found in the first read. Use this option
together with --untrimmed-output when trimming paired-
end reads. (no - output to same file as trimmed reads)
--too-short-paired-output FILE
Write second read in a pair to this file if pair is
too short. Use together with --too-short-output. (no -
too short reads are discarded)
--too-long-paired-output FILE
Write second read in a pair to this file if pair is
too long. Use together with --too-long-output. (no -
too long reads are discarded)
Method-specific options:
--bisulfite METHOD Set default option values for bisulfite-treated data.
The argument specifies the type of bisulfite library
(rrbs, non-directional, non-directional-rrbs, truseq,
epignome, or swift) or custom parameters for trimming:
'<read1>[;<read2>]' where trimming parameters for each
read are: '<5' cut>,<3' cut>,<include trimmed>,<only
trimmed>' where 'include trimmed' is 1 or 0 for
whether or not the bases already trimmed during/prior
to adapter trimming should be counted towards the
total bases to be cut and 'only trimmed' is 1 or 0 for
whether or not only trimmed reads should be further
cut. (no)
--mirna Set default option values for miRNA data. (no)
Parallel (multi-core) options:
-T THREADS, --threads THREADS
Number of threads to use for read trimming. Set to 0
to use max available threads. (Do not use
multithreading)
--no-writer-process Do not use a writer process; instead, each worker
thread writes its own output to a file with a '.N'
suffix. (no)
--preserve-order Preserve order of reads in input files (ignored if
--no-writer-process is set). (no)
--process-timeout SECONDS
Number of seconds process should wait before
escalating messages to ERROR level. (60)
--read-queue-size SIZE
Size of queue for batches of reads to be processed.
(THREADS * 100)
--result-queue-size SIZE
Size of queue for batches of results to be written.
(THREADS * 100)
--compression {worker,writer}
Where data compression should be performed. Defaults
to 'writer' if system-level compression can be used
and (1 < threads < 8), otherwise defaults to 'worker'.
ラン
アダプター配列の検出。
atropos detect -se single.fq -o output
- -pe1 The first (and possibly only) input file.
- -pe2 The second input file.
- -l Interleaved input file.
- -se A single-end read file.
- -o File in which to write the summary of detected
- -sra <ACCN> Accesstion to stream from SRA (requires optional NGS dependency to be installed).
シングルエンドの3'側のアダプターを除く。トリミング後10塩基以下は捨てる。
atropos --threads 8 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTTA -se single.fastq --minimum-length 10 -o output.fq -e 0.1
- -a <ADAPTER> Sequence of an adapter ligated to the 3' end (paired data: of the first read). The adapter and subsequent bases are trimmed. If a '$' character is appended ('anchoring'), the adapter is only found if it is a suffix of the read. (none)
- --threads <INT> Number of threads to use for read trimming. Set to 0 to use max available threads. (Do not use multithreading)
- -e <INT> Maximum allowed error rate for adapter match (no. of errors divided by the length of the matching region). (0.1)
- --indel-cost <INT> COST Integer cost of insertions and deletions during adapter match. Substitutions always have a cost of 1. (1)
- --no-indels <INT> Allow only mismatches in alignments. (allow both mismatches and indels)
- --minimum-length Discard trimmed reads that are shorter than LENGTH. Reads that are too short even before adapter removal are also discarded. In colorspace, an initial primer is not counted. (0)
シングルエンドの5'側のアダプターを除く。
atropos --threads 8 -g AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTTA -se single.fastq -o output.fq
- -g <ADAPTER> Sequence of an adapter ligated to the 5' end (paired data: of the first read). The adapter and any preceding bases are trimmed. Partial matches at the 5' end are allowed. If a '^' character is prepended ('anchoring'), the adapter is only found if it is a prefix of the read. (none)
3'と5'の両側除くには-aや-gの代わりに-bをつける。
8スレッド使いペアリードそれぞれの3'側のアダプターを除く。
atropos --threads 8 --aligner insert \
-a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTTA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \
-pe1 input1.fastq -pe2 input2.fastq -o output1.fq -p output2.fq
- -A 3' adapter to be removed from second read in a pair. (no)
- --aligner <adapter, insert> Which alignment algorithm to use for identifying adapters. Currently, you can choose between the semi- global alignment strategy used in Cutdapt ('adapter') or the more accurate insert-based alignment algorithm ('insert'). Note that insert-based alignment can only be used with paired-end reads containing 3' adapters. New algorithms are being implemented and the default is likely to change. (adapter)
- -p <FILE> Write second read in a pair to FILE. (no)
- --no-trim Match and redirect reads to output/untrimmed-output as usual, but do not remove adapters. (no)
- --mask-adapter Mask adapters with 'N' characters instead of trimming them. (no)
- --times <COUNT> Remove up to COUNT adapters from each read. (1)
ペアリードの順番が崩れないように出力される。
シングルエンドの場合と同様に、5'側のアダプターを除くには-aと-Aの代わりに-gと-Gをつけアダプター配列を指定(またはファイル読み込み)する。
polyAを除く(Trimming poly-A tails)。
atropos -a "A{100}" -o output.fastq input.fastq
100以下のAAAも除くことができる。
そのほかにもアダプター前に低クオリティリードをトリミングしたり、ペアードエンドのリードをマージするなど様々なオプションが設定されています。bam/samやSOLiD、nextseq用のコマンドなどもあります。マニュアルを読んで確認してください。
https://atropos.readthedocs.io/en/latest/
引用
Atropos: specific, sensitive, and speedy trimming of sequencing reads.
Didion JP, Martin M, Collins FS.
PeerJ. 2017 Aug 30;5:e3720. doi: 10.7717/peerj.3720. eCollection 2017.