macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ペアエンドfastqをマージする flash2

 

 DNAシーケンシング技術の急速な低下に伴い、デノボ全ゲノムシーケンシング(WGS)プロジェクトは新しいゲノムについて非常に深いカバレッジを生み出している。しかし、これらの技術による高いカバレッジとゲノムアセンブリアルゴリズム(Gnerre et al、2011; Li et al、2010)の劇的な改善を持ってしても、次世代技術によって生み出されるリード長の短さは、 WGSデータからゲノムを再構築する上で重要な障壁となっている。リード長が増加すると、ゲノムアセンブリの品質に重大かつ積極的な影響がある。

 現在のいくつかのプロトコルは、DNA断片ライブラリーの両端から配列を生成する。フラグメントがシーケンスできる長さの2倍よりも短い場合、結果として得られるペアエンドリードにはオーバーラップが存在することになる。たとえば、プロジェクトでは100 bpのリード、175 bpのライブラリを使用する場合がある。このタイプのライブラリは、アセンブリでこれらのリードを使用する前に、ペアエンドのリードの末端同士をオーバーラップおよびマージすることによってリード長を延長する機会が提供される。

 この論文では、FLASH(Fast Length Adjustment of SHort reads)を紹介する。これは、ペアエンドリード間の正確なオーバーラップを検出し、それらをつなぎ合わせてリードを拡張するソフトウェアツールである。著者らはシミュレーションで100万の100 bpペアエンドリードを発生させ、ツールの正確性をテストし、信頼性が高いと判断した。次に、実際のデータを使用して、このマージ手順がアセンブリの品質に及ぼす影響を測定した。また、FLASHの速度と精度をSHERA(Rodrigue et al、2010)と比較した。

 リードのエラー率が1%以下の場合、FLASHはペアリードの99%を正しく処理できる。エラー率が2%の場合、FLASHはペアリードの98%以上を処理する(デフォルトパラメータ)。より積極的なパラメータ(すなわち、-x 0.35)では、エラー率は5%でもFLASHはペアリードの90%を正しく処理する。

 

Limitations of FLASH include:
- FLASH cannot merge paired-end reads that do not overlap.
- FLASH is not designed for data that has a significant amount of indel
errors (such as Sanger sequencing data). It is best suited for Illumina
data.

flash2 Githubより

 

インストール

mac os10.14の miniconda3-4.0.5環境でテストした。

本体 Github

#Anaconda環境でcondaを使い導入
conda install -c bioconda -y flash2

> flash2 -h

$ flash2 -h

Usage: flash [OPTIONS] MATES_1.FASTQ MATES_2.FASTQ

       flash [OPTIONS] --interleaved-input (MATES.FASTQ | -)

       flash [OPTIONS] --tab-delimited-input (MATES.TAB | -)

 

----------------------------------------------------------------------------

                                 DESCRIPTION                                

----------------------------------------------------------------------------

 

FLASH (Fast Length Adjustment of SHort reads) is an accurate and fast tool

to merge paired-end reads that were generated from DNA fragments whose

lengths are shorter than twice the length of reads.  Merged read pairs result

in unpaired longer reads, which are generally more desired in genome

assembly and genome analysis processes.

 

Briefly, the FLASH algorithm considers all possible overlaps at or above a

minimum length between the reads in a pair and chooses the overlap that

results in the lowest mismatch density (proportion of mismatched bases in

the overlapped region).  Ties between multiple overlaps are broken by

considering quality scores at mismatch sites.  When building the merged

sequence, FLASH computes a consensus sequence in the overlapped region.

More details can be found in the original publication

(http://bioinformatics.oxfordjournals.org/content/27/21/2957.full).

 

Limitations of FLASH include:

   - FLASH cannot merge paired-end reads that do not overlap.

   - FLASH is not designed for data that has a significant amount of indel

     errors (such as Sanger sequencing data).  It is best suited for Illumina

     data.

 

----------------------------------------------------------------------------

                               MANDATORY INPUT

----------------------------------------------------------------------------

 

The most common input to FLASH is two FASTQ files containing read 1 and read 2

of each mate pair, respectively, in the same order.

 

Alternatively, you may provide one FASTQ file, which may be standard input,

containing paired-end reads in either interleaved FASTQ (see the

--interleaved-input option) or tab-delimited (see the --tab-delimited-input

option) format.  In all cases, gzip compressed input is autodetected.  Also,

in all cases, the PHRED offset is, by default, assumed to be 33; use the

--phred-offset option to change it.

 

----------------------------------------------------------------------------

                                   OUTPUT

----------------------------------------------------------------------------

 

The default output of FLASH consists of the following files:

 

   - out.extendedFrags.fastq      The merged reads.

   - out.notCombined_1.fastq      Read 1 of mate pairs that were not merged.

   - out.notCombined_2.fastq      Read 2 of mate pairs that were not merged.

   - out.hist                     Numeric histogram of merged read lengths.

   - out.histogram                Visual histogram of merged read lengths.

 

FLASH also logs informational messages to standard output.  These can also be

redirected to a file, as in the following example:

 

  $ flash reads_1.fq reads_2.fq 2>&1 | tee flash.log

 

In addition, FLASH supports several features affecting the output:

 

   - Writing the merged reads directly to standard output (--to-stdout)

   - Writing gzip compressed output files (-z) or using an external

     compression program (--compress-prog)

   - Writing the uncombined read pairs in interleaved FASTQ format

     (--interleaved-output)

   - Writing all output reads to a single file in tab-delimited format

     (--tab-delimited-output)

 

----------------------------------------------------------------------------

                                   OPTIONS

----------------------------------------------------------------------------

 

  -Q, --quality-cutoff=NUM The cut off number for the quality score

                          corresponding wtih the percent cutoff.  Default:

                          2.

  -C, --percent-cutoff=NUM   The cutoff percentage for each read that will

                          be discarded if it falls below -Q option. (0-100)  Default:

                          50.

  -D, --no-discard      This turns off the discard logic Default: false

 

  -m, --min-overlap=NUM   The minimum required overlap length between two

                          reads to provide a confident overlap. Default 10bp.

 

  -M, --max-overlap=NUM   Maximum overlap length expected in approximately

                          90% of read pairs.  It is by default set to 65bp,

                          which works well for 100bp reads generated from a

                          180bp library, assuming a normal distribution of

                          fragment lengths.  Overlaps longer than the maximum

                          overlap parameter are still considered as good

                          overlaps, but the mismatch density (explained below)

                          is calculated over the first max_overlap bases in

                          the overlapped region rather than the entire

                          overlap.  Default: 65bp, or calculated from the

                          specified read length, fragment length, and fragment

                          length standard deviation.

 

  -e, --min-overlap-outie=NUM   The minimum required overlap length between two

                          reads to provide a confident overlap in an outie scenario.

                          Default: 35bp.

 

  -x, --max-mismatch-density=NUM

                          Maximum allowed ratio between the number of

                          mismatched base pairs and the overlap length.

                          Two reads will not be combined with a given overlap

                          if that overlap results in a mismatched base density

                          higher than this value.  Note: Any occurence of an

                          'N' in either read is ignored and not counted

                          towards the mismatches or overlap length.  Our

                          experimental results suggest that higher values of

                          the maximum mismatch density yield larger

                          numbers of correctly merged read pairs but at

                          the expense of higher numbers of incorrectly

                          merged read pairs.  Default: 0.25.

 

  -O, --allow-outies      Also try combining read pairs in the "outie"

                          orientation, e.g.

 

                               Read 1: <-----------

                               Read 2:       ------------>

 

                          as opposed to only the "innie" orientation, e.g.

 

                               Read 1:       <------------

                               Read 2: ----------->

 

                          FLASH uses the same parameters when trying each

                          orientation.  If a read pair can be combined in

                          both "innie" and "outie" orientations, the

                          better-fitting one will be chosen using the same

                          scoring algorithm that FLASH normally uses.

 

                          This option also causes extra .innie and .outie

                          histogram files to be produced.

 

  -p, --phred-offset=OFFSET

                          The smallest ASCII value of the characters used to

                          represent quality values of bases in FASTQ files.

                          It should be set to either 33, which corresponds

                          to the later Illumina platforms and Sanger

                          platforms, or 64, which corresponds to the

                          earlier Illumina platforms.  Default: 33.

 

  -r, --read-len=LEN

  -f, --fragment-len=LEN

  -s, --fragment-len-stddev=LEN

                          Average read length, fragment length, and fragment

                          standard deviation.  These are convenience parameters

                          only, as they are only used for calculating the

                          maximum overlap (--max-overlap) parameter.

                          The maximum overlap is calculated as the overlap of

                          average-length reads from an average-size fragment

                          plus 2.5 times the fragment length standard

                          deviation.  The default values are -r 100, -f 180,

                          and -s 18, so this works out to a maximum overlap of

                          65 bp.  If --max-overlap is specified, then the

                          specified value overrides the calculated value.

 

                          If you do not know the standard deviation of the

                          fragment library, you can probably assume that the

                          standard deviation is 10% of the average fragment

                          length.

 

  --cap-mismatch-quals    Cap quality scores assigned at mismatch locations

                          to 2.  This was the default behavior in FLASH v1.2.7

                          and earlier.  Later versions will instead calculate

                          such scores as max(|q1 - q2|, 2); that is, the

                          absolute value of the difference in quality scores,

                          but at least 2.  Essentially, the new behavior

                          prevents a low quality base call that is likely a

                          sequencing error from significantly bringing down

                          the quality of a high quality, likely correct base

                          call.

 

  --interleaved-input     Instead of requiring files MATES_1.FASTQ and

                          MATES_2.FASTQ, allow a single file MATES.FASTQ that

                          has the paired-end reads interleaved.  Specify "-"

                          to read from standard input.

 

  --interleaved-output    Write the uncombined pairs in interleaved FASTQ

                          format.

 

  -I, --interleaved       Equivalent to specifying both --interleaved-input

                          and --interleaved-output.

 

  -Ti, --tab-delimited-input

                          Assume the input is in tab-delimited format

                          rather than FASTQ, in the format described below in

                          '--tab-delimited-output'.  In this mode you should

                          provide a single input file, each line of which must

                          contain either a read pair (5 fields) or a single

                          read (3 fields).  FLASH will try to combine the read

                          pairs.  Single reads will be written to the output

                          file as-is if also using --tab-delimited-output;

                          otherwise they will be ignored.  Note that you may

                          specify "-" as the input file to read the

                          tab-delimited data from standard input.

 

  -To, --tab-delimited-output

                          Write output in tab-delimited format (not FASTQ).

                          Each line will contain either a combined pair in the

                          format 'tag <tab> seq <tab> qual' or an uncombined

                          pair in the format 'tag <tab> seq_1 <tab> qual_1

                          <tab> seq_2 <tab> qual_2'.

 

  -o, --output-prefix=PREFIX

                          Prefix of output files.  Default: "out".

 

  -d, --output-directory=DIR

                          Path to directory for output files.  Default:

                          current working directory.

 

  -c, --to-stdout         Write the combined reads to standard output.  In

                          this mode, with FASTQ output (the default) the

                          uncombined reads are discarded.  With tab-delimited

                          output, uncombined reads are included in the

                          tab-delimited data written to standard output.

                          In both cases, histogram files are not written,

                          and informational messages are sent to standard

                          error rather than to standard output.

 

  -z, --compress          Compress the output files directly with zlib,

                          using the gzip container format.  Similar to

                          specifying --compress-prog=gzip and --suffix=gz,

                          but may be slightly faster.

 

  --compress-prog=PROG    Pipe the output through the compression program

                          PROG, which will be called as `PROG -c -',

                          plus any arguments specified by --compress-prog-args.

                          PROG must read uncompressed data from standard input

                          and write compressed data to standard output when

                          invoked as noted above.

                          Examples: gzip, bzip2, xz, pigz.

 

  --compress-prog-args=ARGS

                          A string of additional arguments that will be passed

                          to the compression program if one is specified with

                          --compress-prog=PROG.  (The arguments '-c -' are

                          still passed in addition to explicitly specified

                          arguments.)

 

  --suffix=SUFFIX, --output-suffix=SUFFIX

                          Use SUFFIX as the suffix of the output files

                          after ".fastq".  A dot before the suffix is assumed,

                          unless an empty suffix is provided.  Default:

                          nothing; or 'gz' if -z is specified; or PROG if

                          --compress-prog=PROG is specified.

 

  -t, --threads=NTHREADS  Set the number of worker threads.  This is in

                          addition to the I/O threads.  Default: number of

                          processors.  Note: if you need FLASH's output to

                          appear deterministically or in the same order as

                          the original reads, you must specify -t 1

                          (--threads=1).

 

  -q, --quiet             Do not print informational messages.

 

  -h, --help              Display this help and exit.

 

  -v, --version           Display version.

 

Run `flash2 --help | less' to prevent this text from scrolling by.

 

実行方法 

ペアエンドのfastq

flash2 pair1.fq pair2.fq -t 8

 

interleaveのfastq

flash2 --interleaved-input interleave.fq -t 8

 

リード長のヒストグラムファイル、マージされたリードのfastq、マージされなかったリードのfastqがそれぞれ出力される。

f:id:kazumaxneo:20181201102535j:plain

 

-O (--allow-outies)をつけることで、外向きのペアエンドfastqのマージも可能

flash2 --allow-outies pair1.fq pair2.fq -t 8

  -O, --allow-outies      Also try combining read pairs in the "outie"

                          orientation, e.g.

 

                               Read 1: <-----------

                               Read 2:       ------------>

 

                          as opposed to only the "innie" orientation, e.g.

 

                               Read 1:       <------------

                               Read 2: ----------->

 

                          FLASH uses the same parameters when trying each

                          orientation.  If a read pair can be combined in

                          both "innie" and "outie" orientations, the

                          better-fitting one will be chosen using the same

                          scoring algorithm that FLASH normally uses.

 

                          This option also causes extra .innie and .outie

                          histogram files to be produced.

 

引用

FLASH: fast length adjustment of short reads to improve genome assemblies

Tanja Magoč, Steven L. Salzberg
Bioinformatics, Volume 27, Issue 21, 1 November 2011, Pages 2957–2963,