macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

並列化に対応したアダプタートリミングツール Atropos

 

AtroposはCutadaptのフォークとして開発されたNGSのアダプタートリミングツール。並列化に対応しており、高速に動作する。Cutadaptよりセンシティブで(ミスマッチを考慮する)、miRNAやbisulfite-seq用のトリミングモードも備える。エラー率やアダプター配列の検出モード、QCのレポート出力、SAM/BAMの読み込み対応、SRAのIDからの直接トリミング、など多彩な機能を備えている。color spaceにも対応している。オーバーラップリードのマージも今後サポートされるらしい。

  

マニュアル

https://atropos.readthedocs.io/en/latest/

 

インストール

依存

    • Python 3.3+ (python 2.x is NOT supported)
      • note: we have identified a possible bug in python 3.4.2 that causes random segmentation faults. We think this mainly affects unit testing (and thus specifically test on 3.4.3). If you encounter this bug, we recommend upgrading to a newer python version.
    • Cython 0.25.2+ (pip install Cython)
  • Optional python libraries
    • pytest (for running unit tests)
    • progressbar2 or tqdm (progressbar support)
    • pysam (SAM/BAM input)
    • khmer 2.0+ (for detecting low-frequency adapter contamination)
    • jinja2 (for user-defined report formats)
    • ngstream (for SRA streaming), which requires ngs

pythonのライブラリばかりなので、pipでインストールできるが、オーサーはcondaで環境を整えることを推奨している。condaが通る環境下で

conda install -c bioconda atropos

atropos -help #動作確認

Dockerのイメージも提供されているのでGithubで確認してください。

Github

https://github.com/jdidion/atropos

 

ヘルプ

> atropos -help

$ atropos -help

usage: 

atropos trim -a ADAPTER [options] [-o output.fastq] -se input.fastq

atropos trim -a ADAPT1 -A ADAPT2 [options] -o out1.fastq -p out2.fastq -pe1 in1.fastq -pe2 in2.fastq

 

Atropos version 1.1.10

 

Trim adapters and low-quality bases, and perform other NGS preprocessing. This

command provides most of Atropos' functionality.

 

Replace "ADAPTER" with the actual sequence of your 3' adapter. IUPAC wildcard

characters are supported. The reverse complement is *not* automatically

searched. All reads from input.fastq will be written to output.fastq with the

adapter sequence removed. Adapter matching is error-tolerant. Multiple adapter

sequences can be given (use further -a options), but only the best-matching

adapter will be removed.

 

Input may also be in FASTA, SAM, or BAM format. Compressed input and output is

supported and auto-detected from the file name (.gz, .xz, .bz2). Use the file

name '-' for standard input/output. Without the -o option, output is sent to

standard output.

 

optional arguments:

  -h, --help            show this help message and exit

  --debug               Print debugging information. (no)

  --progress {bar,msg}  Show progress. bar = show progress bar; msg = show a

                        status message. (no)

  --quiet               Print only error messages. (no)

  --log-level {DEBUG,INFO,WARN,ERROR}

                        Logging level. (ERROR when --quiet else INFO)

  --log-file FILE       File to write logging info. (stdout)

 

Input:

  -pe1 FILE1, --input1 FILE1

                        The first (and possibly only) input file.

  -pe2 FILE2, --input2 FILE2

                        The second input file.

  -l FILE, --interleaved-input FILE

                        Interleaved input file.

  -se FILE, --single-input FILE

                        A single-end read file.

  --single-input-read {1,2}

                        When treating an interleaved FASTQ or paired-end

                        SAM/BAM file as single-end, this option specifies

                        which of the two reads to process. (both reads used)

  -sq FILE, --single-quals FILE

                        A single-end qual file.

  -sra ACCN, --sra-accession ACCN

                        Accesstion to stream from SRA (requires optional NGS

                        dependency to be installed).

  -f {fasta,fastq,sra-fastq,sam,bam}, --format {fasta,fastq,sra-fastq,sam,bam}

                        Input file format. Ignored when reading csfasta/qual

                        files. (auto-detect from file name extension)

  -Q QUALITY_BASE, --quality-base QUALITY_BASE

                        Assume that quality values in FASTQ are encoded as

                        ascii(quality + QUALITY_BASE). This needs to be set to

                        64 for some old Illumina FASTQ files. (33)

  -c, --colorspace      Enable colorspace mode: Also trim the color that is

                        adjacent to the found adapter. (no)

  --max-reads N         Maximum number of reads/pairs to process (no max)

  --subsample PROB      Subsample a fraction of reads. (no)

  --subsample-seed SEED

                        The seed to use for the pseudorandom number generator.

                        Usingthe same seed will result in the same subsampling

                        of reads.

  --batch-size SIZE     Number of records to process in each batch. (1000)

  -D ID, --sample-id ID

                        Optional sample ID. Added to the summary output.

 

Finding adapters:

  Parameters -a, -g, -b specify adapters to be removed from each read (or

  from the first read in a pair if data is paired). If specified multiple

  times, only the best matching adapter is trimmed (but see the --times

  option). When the special notation 'file:FILE' is used, adapter sequences

  are read from the given FASTA file. When the --adapter-file option is

  used, adapters can be specified by name rather than sequence.

 

  -a ADAPTER, --adapter ADAPTER

                        Sequence of an adapter ligated to the 3' end (paired

                        data: of the first read). The adapter and subsequent

                        bases are trimmed. If a '$' character is appended

                        ('anchoring'), the adapter is only found if it is a

                        suffix of the read. (none)

  -g ADAPTER, --front ADAPTER

                        Sequence of an adapter ligated to the 5' end (paired

                        data: of the first read). The adapter and any

                        preceding bases are trimmed. Partial matches at the 5'

                        end are allowed. If a '^' character is prepended

                        ('anchoring'), the adapter is only found if it is a

                        prefix of the read. (none)

  -b ADAPTER, --anywhere ADAPTER

                        Sequence of an adapter that may be ligated to the 5'

                        or 3' end (paired data: of the first read). Both types

                        of matches as described under -a und -g are allowed.

                        If the first base of the read is part of the match,

                        the behavior is as with -g, otherwise as with -a. This

                        option is mostly for rescuing failed library

                        preparations - do not use if you know which end your

                        adapter was ligated to! (none)

  -F KNOWN_ADAPTERS_FILE, --known-adapters-file KNOWN_ADAPTERS_FILE

                        Path or URL of a FASTA file containing adapter

                        sequences.

  --no-default-adapters

                        Don't fetch the default adapter list (which is

                        currently stored in GitHub).

  --adapter-cache-file ADAPTER_CACHE_FILE

                        File where adapter sequences will be cached, unless

                        --no-cache-adapters is set.

  --no-cache-adapters   Don't cache adapters list as '.adapters' in the

                        working directory.

  --no-trim             Match and redirect reads to output/untrimmed-output as

                        usual, but do not remove adapters. (no)

  --mask-adapter        Mask adapters with 'N' characters instead of trimming

                        them. (no)

  --gc-content GC_CONTENT

                        Expected GC content of sequences.

  --aligner {adapter,insert}

                        Which alignment algorithm to use for identifying

                        adapters. Currently, you can choose between the semi-

                        global alignment strategy used in Cutdapt ('adapter')

                        or the more accurate insert-based alignment algorithm

                        ('insert'). Note that insert-based alignment can only

                        be used with paired-end reads containing 3' adapters.

                        New algorithms are being implemented and the default

                        is likely to change. (adapter)

  -e ERROR_RATE, --error-rate ERROR_RATE

                        Maximum allowed error rate for adapter match (no. of

                        errors divided by the length of the matching region).

                        (0.1)

  --indel-cost COST     Integer cost of insertions and deletions during

                        adapter match. Substitutions always have a cost of 1.

                        (1)

  --no-indels           Allow only mismatches in alignments. (allow both

                        mismatches and indels)

  -n COUNT, --times COUNT

                        Remove up to COUNT adapters from each read. (1)

  --match-read-wildcards

                        Interpret IUPAC wildcards in reads. (no)

  -N, --no-match-adapter-wildcards

                        Do not interpret IUPAC wildcards in adapters. (no)

  -O MINLENGTH, --overlap MINLENGTH

                        If the overlap between the read and the adapter is

                        shorter than MINLENGTH, the read is not modified.

                        Reduces the no. of bases trimmed due to random adapter

                        matches. (3)

  --adapter-max-rmp PROB

                        If no minimum overlap (-O) is specified, then adapters

                        are only matched when the probabilty of observing k

                        out of n matching bases is <= PROB. (1E-6)

  --insert-max-rmp PROB

                        Overlapping inserts only match when the probablity of

                        observing k of n matching bases is <= PROB. (1E-6)

  --insert-match-error-rate INSERT_MATCH_ERROR_RATE

                        Maximum allowed error rate for insert match (no. of

                        errors divided by the length of the matching region).

                        (0.2)

  --insert-match-adapter-error-rate INSERT_MATCH_ADAPTER_ERROR_RATE

                        Maximum allowed error rate for matching adapters after

                        successful insert match (no. of errors divided by the

                        length of the matching region). (0.2)

  -R, --merge-overlapping

                        Merge read pairs that overlap into a single sequence.

                        This is experimental. (no)

  --merge-min-overlap MERGE_MIN_OVERLAP

                        The minimum overlap between reads required for

                        merging. If this number is (0,1.0], it specifies the

                        minimum length as the fraction of the length of the

                        *shorter* read in the pair; otherwise it specifies the

                        minimum number of overlapping base pairs (with an

                        absolute minimum of 2 bp). (0.9)

  --merge-error-rate MERGE_ERROR_RATE

                        The maximum error rate for merging. (0.2)

  --correct-mismatches {liberal,conservative,N}

                        How to handle mismatches while aligning/merging.

                        'Liberal' and 'conservative' error correction both

                        involve setting the base to the one with the best

                        quality. They differ only when the qualities are equal

                        -- liberal means set it to the base from the read with

                        the overall best median base quality, while

                        conservative means to leave it unchanged. 'N' means to

                        set the base to N. If exactly one base is ambiguous,

                        the non-ambiguous base is always used. (no error

                        correction)

 

Additional read modifications:

  --op-order OP_ORDER   The order in which trimming operations are be applied.

                        This is a string of 1-5 of the following characters: A

                        = adapter trimming; C = cutting (unconditional); G =

                        NextSeq trimming; Q = quality trimming; W = overwrite

                        poor quality reads. The default is 'WCGQA' to maintain

                        compatibility with Cutadapt; however, this is likely

                        to change to 'GAWCQ' in the near future.

  -u LENGTH, --cut LENGTH

                        Remove bases from each read (first read only if

                        paired). If LENGTH is positive, remove bases from the

                        beginning. If LENGTH is negative, remove bases from

                        the end. Can be used twice if LENGTHs have different

                        signs. (no)

  -q [5'CUTOFF,]3'CUTOFF, --quality-cutoff [5'CUTOFF,]3'CUTOFF

                        Trim low-quality bases from 5' and/or 3' ends of each

                        read before adapter removal. Applied to both reads if

                        data is paired. If one value is given, only the 3' end

                        is trimmed. If two comma-separated cutoffs are given,

                        the 5' end is trimmed with the first cutoff, the 3'

                        end with the second. (no)

  -i LENGTH, --cut-min LENGTH

                        Similar to -u, except that cutting is done AFTER

                        adapter trimming, and only if a minimum of LENGTH

                        bases was not already removed. (no)

  --nextseq-trim 3'CUTOFF

                        NextSeq-specific quality trimming (each read). Trims

                        also dark cycles appearing as high-quality G bases

                        (EXPERIMENTAL). (no)

  --trim-n              Trim N's on ends of reads. (no)

  -x PREFIX, --prefix PREFIX

                        Add this prefix to read names. Use {name} to insert

                        the name of the matching adapter. (no)

  -y SUFFIX, --suffix SUFFIX

                        Add this suffix to read names; can also include

                        {name}. (no)

  --strip-suffix STRIP_SUFFIX

                        Remove this suffix from read names if present. Can be

                        given multiple times. (no)

  --length-tag TAG      Search for TAG followed by a decimal number in the

                        description field of the read. Replace the decimal

                        number with the correct length of the trimmed read.

                        For example, use --length-tag 'length=' to correct

                        fields like 'length=123'. (no)

 

Filtering of processed reads:

  --discard-trimmed, --discard

                        Discard reads that contain an adapter. Also use -O to

                        avoid discarding too many randomly matching reads!

                        (no)

  --discard-untrimmed, --trimmed-only

                        Discard reads that do not contain the adapter. (no)

  -m LENGTH, --minimum-length LENGTH

                        Discard trimmed reads that are shorter than LENGTH.

                        Reads that are too short even before adapter removal

                        are also discarded. In colorspace, an initial primer

                        is not counted. (0)

  -M LENGTH, --maximum-length LENGTH

                        Discard trimmed reads that are longer than LENGTH.

                        Reads that are too long even before adapter removal

                        are also discarded. In colorspace, an initial primer

                        is not counted. (no limit)

  --max-n COUNT         Discard reads with too many N bases. If COUNT is an

                        integer, it is treated as the absolute number of N

                        bases. If it is between 0 and 1, it is treated as the

                        proportion of N's allowed in a read. (no)

 

Output:

  -o FILE, --output FILE

                        Write trimmed reads to FILE. FASTQ or FASTA format is

                        chosen depending on input. The summary report is sent

                        to standard output. Use '{name}' in FILE to

                        demultiplex reads into multiple files. (write to

                        standard output)

  --info-file FILE      Write information about each read and its adapter

                        matches into FILE. See the documentation for the file

                        format. (no)

  -r FILE, --rest-file FILE

                        When the adapter matches in the middle of a read,

                        write the rest (after the adapter) into FILE. (no)

  --wildcard-file FILE  When the adapter has N bases (wildcards), write

                        adapter bases matching wildcard positions to FILE.

                        When there are indels in the alignment, this will

                        often not be accurate. (no)

  --too-short-output FILE

                        Write reads that are too short (according to length

                        specified by -m) to FILE. (no - too short reads are

                        discarded)

  --too-long-output FILE

                        Write reads that are too long (according to length

                        specified by -M) to FILE. (no - too long reads are

                        discarded)

  --untrimmed-output FILE

                        Write reads that do not contain the adapter to FILE.

                        (no - untrimmed reads are written to default output)

  --merged-output FILE  Write reads that have been merged to this file.

                        (merged reads are discarded)

  --report-file FILE    Write report to file rather than stdout/stderr. (no)

  --report-formats [FORMAT [FORMAT ...]]

                        Report type(s) to generate. If multiple, '--report-

                        file' is treated as a prefix and the appropriate

                        extensions are appended. If unspecified, the format is

                        guessed from the file extension. Supported formats

                        are: txt (legacy text format), json, yaml, pickle. See

                        the documentation for a full description of the

                        structured output (json/yaml/pickle formats).

  --stats [STATS [STATS ...]]

                        Which read-level statistics to compute. Can be 'none'

                        (default), 'pre': only compute pre-trimming stats;

                        'post': only compute post-trimming stats; or 'both'.

                        The keyword can be followed by ':' and then additional

                        configuration parameters. E.g. 'pre:tiles' means to

                        also collect tile-level statistics (Illumina data

                        only), and 'pre:tiles=<regexp>' means to use the

                        specified regular expression to extract key portions

                        of read names to collect the tile statistics.

 

Colorspace options:

  -d, --double-encode   Double-encode colors (map 0,1,2,3,4 to A,C,G,T,N).

                        (no)

  -t, --trim-primer     Trim primer base and the first color (which is the

                        transition to the first nucleotide). (no)

  --strip-f3            Strip the _F3 suffix of read names. (no)

  --maq, --bwa          MAQ- and BWA-compatible colorspace output. This

                        enables -c, -d, -t, --strip-f3 and -y '/1'. (no)

  --no-zero-cap         Do not change negative quality values to zero in

                        colorspace data. By default, they are since many tools

                        have problems with negative qualities. (no)

  -z, --zero-cap        Change negative quality values to zero. This is

                        enabled by default when -c/--colorspace is also

                        enabled. Use the above option to disable it. (no)

 

Paired-end options:

  The -A/-G/-B/-U/-I options work like their -a/-b/-g/-u/-i counterparts,

  but are applied to the second read in each pair.

 

  -A ADAPTER            3' adapter to be removed from second read in a pair.

                        (no)

  -G ADAPTER            5' adapter to be removed from second read in a pair.

                        (no)

  -B ADAPTER            5'/3 adapter to be removed from second read in a pair.

                        (no)

  -U LENGTH             Remove LENGTH bases from second read in a pair (see

                        --cut). (no)

  -I LENGTH, --cut-min2 LENGTH

                        Similar to -U, except that cutting is done AFTER

                        adapter trimming, and only if a minimum of LENGTH

                        bases was not already removed (see --cut-min). (no)

  -w LOWQ,HIGHQ,WINDOW, --overwrite-low-quality LOWQ,HIGHQ,WINDOW

                        When one read has mean quality < LOWQ and the other

                        read has mean quality >= HIGHQ over the first WINDOW

                        bases, overwrite the worse read with the better read.

  -p FILE, --paired-output FILE

                        Write second read in a pair to FILE. (no)

  -L FILE, --interleaved-output FILE

                        Write output to interleaved file.

  --pair-filter (any|both)

                        Which of the reads in a paired-end read have to match

                        the filtering criterion in order for it to be

                        filtered. (any)

  --untrimmed-paired-output FILE

                        Write second read in a pair to this FILE when no

                        adapter was found in the first read. Use this option

                        together with --untrimmed-output when trimming paired-

                        end reads. (no - output to same file as trimmed reads)

  --too-short-paired-output FILE

                        Write second read in a pair to this file if pair is

                        too short. Use together with --too-short-output. (no -

                        too short reads are discarded)

  --too-long-paired-output FILE

                        Write second read in a pair to this file if pair is

                        too long. Use together with --too-long-output. (no -

                        too long reads are discarded)

 

Method-specific options:

  --bisulfite METHOD    Set default option values for bisulfite-treated data.

                        The argument specifies the type of bisulfite library

                        (rrbs, non-directional, non-directional-rrbs, truseq,

                        epignome, or swift) or custom parameters for trimming:

                        '<read1>[;<read2>]' where trimming parameters for each

                        read are: '<5' cut>,<3' cut>,<include trimmed>,<only

                        trimmed>' where 'include trimmed' is 1 or 0 for

                        whether or not the bases already trimmed during/prior

                        to adapter trimming should be counted towards the

                        total bases to be cut and 'only trimmed' is 1 or 0 for

                        whether or not only trimmed reads should be further

                        cut. (no)

  --mirna               Set default option values for miRNA data. (no)

 

Parallel (multi-core) options:

  -T THREADS, --threads THREADS

                        Number of threads to use for read trimming. Set to 0

                        to use max available threads. (Do not use

                        multithreading)

  --no-writer-process   Do not use a writer process; instead, each worker

                        thread writes its own output to a file with a '.N'

                        suffix. (no)

  --preserve-order      Preserve order of reads in input files (ignored if

                        --no-writer-process is set). (no)

  --process-timeout SECONDS

                        Number of seconds process should wait before

                        escalating messages to ERROR level. (60)

  --read-queue-size SIZE

                        Size of queue for batches of reads to be processed.

                        (THREADS * 100)

  --result-queue-size SIZE

                        Size of queue for batches of results to be written.

                        (THREADS * 100)

  --compression {worker,writer}

                        Where data compression should be performed. Defaults

                        to 'writer' if system-level compression can be used

                        and (1 < threads < 8), otherwise defaults to 'worker'.

 

 

 

ラン

アダプター配列の検出。

atropos detect -se single.fq -o output
  • -pe1 The first (and possibly only) input file.
  • -pe2 The second input file.
  • -l Interleaved input file.
  • -se A single-end read file.
  • -o File in which to write the summary of detected
  • -sra <ACCN> Accesstion to stream from SRA (requires optional NGS dependency to be installed).

 

シングルエンドの3'側のアダプターを除く。トリミング後10塩基以下は捨てる。

atropos --threads 8 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTTA -se single.fastq --minimum-length 10 -o output.fq -e 0.1
  • -a <ADAPTER> Sequence of an adapter ligated to the 3' end (paired data: of the first read). The adapter and subsequent bases are trimmed. If a '$' character is appended ('anchoring'), the adapter is only found if it is a suffix of the read. (none)
  • --threads <INT> Number of threads to use for read trimming. Set to 0 to use max available threads. (Do not use multithreading)
  • -e <INT> Maximum allowed error rate for adapter match (no. of errors divided by the length of the matching region). (0.1)
  • --indel-cost <INT> COST Integer cost of insertions and deletions during adapter match. Substitutions always have a cost of 1. (1)
  • --no-indels <INT> Allow only mismatches in alignments. (allow both mismatches and indels)
  • --minimum-length Discard trimmed reads that are shorter than LENGTH.                         Reads that are too short even before adapter removal are also discarded. In colorspace, an initial primer is not counted. (0)

 

シングルエンドの5'側のアダプターを除く。

atropos --threads 8 -g AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTTA -se single.fastq -o output.fq
  •  -g <ADAPTER> Sequence of an adapter ligated to the 5' end (paired data: of the first read). The adapter and any preceding bases are trimmed. Partial matches at the 5' end are allowed. If a '^' character is prepended ('anchoring'), the adapter is only found if it is a prefix of the read. (none)

3'と5'の両側除くには-a-gの代わりに-bをつける。

 

 

8スレッド使いペアリードそれぞれの3'側のアダプターを除く。

atropos --threads 8 --aligner insert \
 -a AGATCGGAAGAGCACACGTCTGAACTCCAGTCACGAGTTA -A AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT \
 -pe1 input1.fastq -pe2 input2.fastq -o output1.fq -p output2.fq
  • -A 3' adapter to be removed from second read in a pair. (no)
  • --aligner <adapterinsert> Which alignment algorithm to use for identifying adapters. Currently, you can choose between the semi- global alignment strategy used in Cutdapt ('adapter') or the more accurate insert-based alignment algorithm ('insert'). Note that insert-based alignment can only be used with paired-end reads containing 3' adapters. New algorithms are being implemented and the default is likely to change. (adapter)
  • -p <FILE> Write second read in a pair to FILE. (no)
  • --no-trim Match and redirect reads to output/untrimmed-output as usual, but do not remove adapters. (no)
  • --mask-adapter Mask adapters with 'N' characters instead of trimming them. (no)
  • --times <COUNT> Remove up to COUNT adapters from each read. (1)

ペアリードの順番が崩れないように出力される。

シングルエンドの場合と同様に、5'側のアダプターを除くには-a-Aの代わりに-g-Gをつけアダプター配列を指定(またはファイル読み込み)する。

 

polyAを除く(Trimming poly-A tails)。

atropos -a "A{100}" -o output.fastq input.fastq

 100以下のAAAも除くことができる。

 

そのほかにもアダプター前に低クオリティリードをトリミングしたり、ペアードエンドのリードをマージするなど様々なオプションが設定されています。bam/samやSOLiD、nextseq用のコマンドなどもあります。マニュアルを読んで確認してください。

https://atropos.readthedocs.io/en/latest/

 

引用

Atropos: specific, sensitive, and speedy trimming of sequencing reads.

Didion JP, Martin M, Collins FS.

PeerJ. 2017 Aug 30;5:e3720. doi: 10.7717/peerj.3720. eCollection 2017.