macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

diginormによるシーケンスデータの軽量化

 2019 5/14 helpとパラメータ追記

 

 "digital normalization"という名で発表されたこの手法は、k-merを指標にリードを間引いて、データサイズを軽量化する方法論。データサイズが大きすぎてアセンブルできないサンプルの軽量化に使えるとされる。トリミングターゲットは、低/高のk-merカバレッジのリードになる。k-merのカバレッジが中央値から極端に少ないのはシーケンスエラーと考えられ、k-merのカバレッジが極端に多いにはリピートと考えられる。それらを削って軽量化する。

 ただし低k-merカバレッジ配列はエラーでなくレアなバリアントの可能性もあるし、高カバレッジの配列も単純リピートでなく倍数体ゲノムの極端に似た配列の可能性がある。それらをエラーのk-merと一緒に消してしまう恐れがある。それについて論文の著者がHPで考えを述べている(Why you shouldn't use digital normalization)。 興味がある人はその辺りを理解した上で使ってください。

  本手法はde novo transcrptomeのデータにも使うことが可能とされ、de novo transcriptomeの論文にも使われた実績を持つ。

 

diginorm公式サイト

http://ged.msu.edu/papers/2012-diginorm/

解析にはkhmerを使う。

http://khmer.readthedocs.io/en/v2.0/user/scripts.html#scripts-diginorm 

チュートリアル

A tutorial in basic digital normalization — ANGUS 2.0 documentation

ランに必要なツールのインストール

khmerとscreedが必要。

git clone git://github.com/ged-lab/screed.git 
git clone git://github.com/ged-lab/khmer.git

cd screed
python setup.py install
cd ../khmer
make test
export PYTHONPATH=/root/khmer/python

 khmerとscreedはpipでも導入できます。

ベストなk-merについて

> normalize-by-median.py -h

# normalize-by-median.py -h

 

|| This is the script normalize-by-median.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

||

|| If you use this script in a publication, please cite EACH of the following:

||

||   * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

||   * CT Brown et al., arXiv:1203.4802 [q-bio.GN]

||

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

 

usage: normalize-by-median.py [--version] [--info] [-h] [-k KSIZE]

                              [-U UNIQUE_KMERS] [--fp-rate FP_RATE]

                              [-M MAX_MEMORY_USAGE] [--small-count] [-q]

                              [-C CUTOFF] [-p] [--force_single]

                              [-u unpaired_reads_filename] [-s filename]

                              [-R report_filename]

                              [--report-frequency report_frequency] [-f]

                              [-o filename] [-l filename] [--gzip | --bzip]

                              input_sequence_filename

                              [input_sequence_filename ...]

 

Do digital normalization (remove mostly redundant sequences)

 

positional arguments:

  input_sequence_filename

                        Input FAST[AQ] sequence filename.

 

optional arguments:

  --version             show program's version number and exit

  --info                print citation information

  -h, --help            show this help message and exit

  -k KSIZE, --ksize KSIZE

                        k-mer size to use (default: 32)

  -U UNIQUE_KMERS, --unique-kmers UNIQUE_KMERS

                        approximate number of unique kmers in the input set

                        (default: 0)

  --fp-rate FP_RATE     Override the automatic FP rate setting for the current

                        script (default: None)

  -M MAX_MEMORY_USAGE, --max-memory-usage MAX_MEMORY_USAGE

                        maximum amount of memory to use for data structure

                        (default: None)

  --small-count         Reduce memory usage by using a smaller counter for

                        individual kmers. (default: False)

  -q, --quiet

  -C CUTOFF, --cutoff CUTOFF

                        when the median k-mer coverage level is above this

                        number the read is not kept. (default: 20)

  -p, --paired          require that all sequences be properly paired

                        (default: False)

  --force_single        treat all sequences as single-ended/unpaired (default:

                        False)

  -u unpaired_reads_filename, --unpaired-reads unpaired_reads_filename

                        include a file of unpaired reads to which -p/--paired

                        does not apply. (default: None)

  -s filename, --savegraph filename

                        save the k-mer countgraph to disk after all reads are

                        loaded. (default: None)

  -R report_filename, --report report_filename

                        write progress report to report_filename (default:

                        None)

  --report-frequency report_frequency

                        report progress every report_frequency reads (default:

                        100000)

  -f, --force           continue past file reading errors (default: False)

  -o filename, --output filename

                        only output a single file with the specified filename;

                        use a single dash "-" to specify that output should go

                        to STDOUT (the terminal) (default: None)

  -l filename, --loadgraph filename

                        load a precomputed k-mer graph from disk (default:

                        None)

  --gzip                Compress output using gzip (default: False)

  --bzip                Compress output using bzip2 (default: False)

 

Discard sequences based on whether or not their median k-mer abundance lies

above a specified cutoff. Kept sequences will be placed in <fileN>.keep.

 

By default, paired end reads will be considered together; if either read should

be kept, both will be kept. (This keeps both reads from a fragment, and helps

with retention of repeats.) Unpaired reads are treated individually.

 

If `-p`/`--paired` is set, then proper pairing is required and the script will

exit on unpaired reads, although `--unpaired-reads` can be used to supply a

file of orphan reads to be read after the paired reads.

 

`--force_single` will ignore all pairing information and treat reads

individually.

 

With `-s`/`--savegraph`, the k-mer countgraph will be saved to the specified

file after all sequences have been processed. `-l`/`--loadgraph` will load the

specified k-mer countgraph before processing the specified files.  Note that

these graphs are are in the same format as those produced by `load-into-

counting.py` and consumed by `abundance-dist.py`.

 

To append reads to an output file (rather than overwriting it), send output to

STDOUT with `--output -` and use UNIX file redirection syntax (`>>`) to append

to the file.

 

Example:

 

    normalize-by-median.py -k 17 tests/test-data/test-abund-read-2.fa

 

Example:

 

    normalize-by-median.py -p -k 17 \

    tests/test-data/test-abund-read-paired.fa

 

Example:

 

    normalize-by-median.py -p -k 17 -o - tests/test-data/paired.fq \

    >> appended-output.fq

 

Example:

 

    normalize-by-median.py -k 17 -f tests/test-data/test-error-reads.fq \

    tests/test-data/test-fastq-reads.fq

 

Example:

 

    normalize-by-median.py -k 17 -s test.ct \

    tests/test-data/test-abund-read-2.fa \

    tests/test-data/test-fastq-reads.fq

> filter-abund.py -h

# filter-abund.py -h

 

|| This is the script filter-abund.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

||

|| If you use this script in a publication, please cite EACH of the following:

||

||   * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

||   * Q Zhang et al., https://doi.org/10.1371/journal.pone.0101271

||

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

 

usage: filter-abund.py [--version] [--info] [-h] [-T THREADS] [-C CUTOFF] [-V]

                       [-Z NORMALIZE_TO] [-o optional_output_filename] [-f]

                       [-q] [--gzip | --bzip]

                       input_count_graph_filename input_sequence_filename

                       [input_sequence_filename ...]

 

Trim sequences at a minimum k-mer abundance.

 

positional arguments:

  input_count_graph_filename

                        The input k-mer countgraph filename

  input_sequence_filename

                        Input FAST[AQ] sequence filename

 

optional arguments:

  --version             show program's version number and exit

  --info                print citation information

  -h, --help            show this help message and exit

  -T THREADS, --threads THREADS

                        Number of simultaneous threads to execute (default: 1)

  -C CUTOFF, --cutoff CUTOFF

                        Trim at k-mers below this abundance. (default: 2)

  -V, --variable-coverage

                        Only trim low-abundance k-mers from sequences that

                        have high coverage. (default: False)

  -Z NORMALIZE_TO, --normalize-to NORMALIZE_TO

                        Base the variable-coverage cutoff on this median k-mer

                        abundance. (default: 20)

  -o optional_output_filename, --output optional_output_filename

                        Output the trimmed sequences into a single file with

                        the given filename instead of creating a new file for

                        each input file. (default: None)

  -f, --force           Overwrite output file if it exists (default: False)

  -q, --quiet

  --gzip                Compress output using gzip (default: False)

  --bzip                Compress output using bzip2 (default: False)

 

Trimmed sequences will be placed in "${input_sequence_filename}.abundfilt" for

each input sequence file. If the input sequences are from RNAseq or metagenome

sequencing then `--variable-coverage` should be used.

 

Example:

 

    load-into-counting.py -k 20 -x 5e7 countgraph data/100k-filtered.fa

    filter-abund.py -C 2 countgraph data/100k-filtered.fa

> load-into-counting.py -h

# load-into-counting.py -h

 

|| This is the script load-into-counting.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

||

|| If you use this script in a publication, please cite EACH of the following:

||

||   * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

||   * Q Zhang et al., https://doi.org/10.1371/journal.pone.0101271

||   * A. D\xf6ring et al. https://doi.org:80/10.1186/1471-2105-9-11

||

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

 

usage: load-into-counting.py [--version] [--info] [-h] [-k KSIZE]

                             [-U UNIQUE_KMERS] [--fp-rate FP_RATE]

                             [-M MAX_MEMORY_USAGE] [--small-count]

                             [-T THREADS] [-b] [-s FORMAT] [-f] [-q]

                             output_countgraph_filename

                             input_sequence_filename

                             [input_sequence_filename ...]

 

Build a k-mer countgraph from the given sequences.

 

positional arguments:

  output_countgraph_filename

                        The name of the file to write the k-mer countgraph to.

  input_sequence_filename

                        The names of one or more FAST[AQ] input sequence

                        files.

 

optional arguments:

  --version             show program's version number and exit

  --info                print citation information

  -h, --help            show this help message and exit

  -k KSIZE, --ksize KSIZE

                        k-mer size to use (default: 32)

  -U UNIQUE_KMERS, --unique-kmers UNIQUE_KMERS

                        approximate number of unique kmers in the input set

                        (default: 0)

  --fp-rate FP_RATE     Override the automatic FP rate setting for the current

                        script (default: None)

  -M MAX_MEMORY_USAGE, --max-memory-usage MAX_MEMORY_USAGE

                        maximum amount of memory to use for data structure

                        (default: None)

  --small-count         Reduce memory usage by using a smaller counter for

                        individual kmers. (default: False)

  -T THREADS, --threads THREADS

                        Number of simultaneous threads to execute (default: 1)

  -b, --no-bigcount     The default behaviour is to count past 255 using

                        bigcount. This flag turns bigcount off, limiting

                        counts to 255. (default: True)

  -s FORMAT, --summary-info FORMAT

                        What format should the machine readable run summary be

                        in? (`json` or `tsv`, disabled by default) (default:

                        None)

  -f, --force           Overwrite output file if it exists (default: False)

  -q, --quiet

 

Note: with `-b`/`--no-bigcount` the output will be the exact size of the k-mer

countgraph and this script will use a constant amount of memory. In exchange

k-mer counts will stop at 255. The memory usage of this script with `-b` will

be about 1.15x the product of the `-x` and `-N` numbers.

 

Example:

 

    load-into-counting.py -k 20 -x 5e7 out data/100k-filtered.fa

 

Multiple threads can be used to accelerate the process, if you have extra cores

to spare.

 

Example:

 

    load-into-counting.py -k 20 -x 5e7 -T 4 out data/100k-filtered.fa

> split-paired-reads.py -h

# split-paired-reads.py -h

 

|| This is the script split-paired-reads.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

||

|| If you use this script in a publication, please cite EACH of the following:

||

||   * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

||

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

 

usage: split-paired-reads.py [--version] [--info] [-h] [-d output_directory]

                             [-0 output_orphaned] [-1 output_first]

                             [-2 output_second] [-f] [--gzip | --bzip]

                             [infile]

 

Split interleaved reads into two files, left and right.

 

positional arguments:

  infile

 

optional arguments:

  --version             show program's version number and exit

  --info                print citation information

  -h, --help            show this help message and exit

  -d output_directory, --output-dir output_directory

                        Output split reads to specified directory. Creates

                        directory if necessary (default: )

  -0 output_orphaned, --output-orphaned output_orphaned

                        Allow "orphaned" reads and extract them to this file

                        (default: None)

  -1 output_first, --output-first output_first

                        Output "left" reads to this file (default: None)

  -2 output_second, --output-second output_second

                        Output "right" reads to this file (default: None)

  -f, --force           Overwrite output file if it exists (default: False)

  --gzip                Compress output using gzip (default: False)

  --bzip                Compress output using bzip2 (default: False)

 

Some programs want paired-end read input in the One True Format, which is

interleaved; other programs want input in the Insanely Bad Format, with left-

and right- reads separated. This reformats the former to the latter.

 

The directory into which the left- and right- reads are output may be specified

using `-d`/`--output-dir`. This directory will be created if it does not

already exist.

 

Alternatively, you can specify the filenames directly with `-1`/`--output-

first` and `-2`/`--output-second`, which will override the `-d`/`--output-dir`

setting on a file-specific basis.

 

`-0`/'--output-orphans` will allow broken-paired format, and orphaned reads

will be saved separately, to the specified file.

 

Example:

 

    split-paired-reads.py tests/test-data/paired.fq

 

Example:

 

    split-paired-reads.py -0 reads-output-file tests/test-data/paired.fq

 

Example:

 

    split-paired-reads.py -1 reads.1 -2 reads.2 tests/test-data/paired.fq

 

ラン  

トランスクリプトームデータ

 1-passノーマライズを行う。

はじめにインターレースのpaired.fqにmergeしておく。

seqtk mergepe R1.fq R2.fq > paired.fq

 

 分布でトリミング。

normalize-by-median.py -C 20 -k 20 -N 4 -x 5e8 paired.fq
  • -C <CUTOFF>    when the median k-mer coverage level is above this number the read is not kept. (default: 20)
  • -k <KSIZE >    k-mer size to use (default: 32)

CとKの 値について(上と同じリンク)。

 

 

ゲノムデータ

スリーパスノーマライズ (3-pass)を行う。

はじめにインターレースのpaired.fqにmergeしておく(seqtk(リンク))。

seqtk mergepe R1.fq R2.fq > paired.fq

k-merカウントグラフを書く。(countgraph.py)

load-into-counting.py -k 20 -x 5e8 -T 24 graph paired.fq 
  • -k k-mer size to use (default: 32)
  • -T Number of simultaneous threads to execute
  • -x upper bound on tablesize to use; overrides --max-memory-usage/-M.

 

1-pass 分布でリードをトリミング。

normalize-by-median.py ; Discard sequences based on whether or not their median k-mer abundance lies above a specified cutoff.

normalize-by-median.py -C 20 -k 20 -N 4 -x 5e8 paired.fq
  • -k k-mer size to use (default: 32)
  • -C when the median k-mer coverage level is above this number the read is not kept. (default: 20)
  • -N number of tables to use in k-mer countgraph
  • -x upper bound on tablesize to use; overrides --max-memory-usage/-M.

paired.fq.keepが出力される。試したデータでは、リードは287,956から273,788に減少 (5%カット)。 出力されるリード数は揃っていない。あとでペアに修正する。

 2-pass 今度はlow coverageのリードをトリミングする。

filter-abund.py ; Trim sequences at a minimum k-mer abundance.

filter-abund.py -T 24 -C 2 graph paired.fq.keep
  • -C CUTOFF, --cutoff CUTOFF Trim at k-mers below this abundance. (default: 2)
  • -T Number of simultaneous threads to execute (default: 1)

R1.fastq.keep.abundfiltとR2.fastq.keep.abundfiltが出力される。

paired.fq.keep.abundfiltが出力される。リードは60%ほどトリムされた。 

 3-pass 最後にもう一度-C 5でトリミング。

normalize-by-median.py -C 5 -k 20 -N 4 -x 1e8 paired.fq.keep.abundfilt

paired.fq.keep.abundfilt.keepが出力される。

最後に ペアのリードとシングルトンのリードに分ける。

split-paired-reads.py -0 orphan -1 trimmmed_R1.fq -2 trimmmed_R2.fq paired.fq.keep.abundfilt.keep
  • -0 output_orphaned Allow "orphaned" reads and extract them to this file (default: None)
  • -1 Output "left" reads to this file (default: None)
  • -2 Output "right" reads to this file (default: None)

 R1とR2の行数は同じになる。ペアじゃないオーファンなリードが20%近くを占めた。

 

 

KMCを使ってk-mer頻度を計算(高速なk-merカウントツール KMC)。

 ↓

20merでヒストグラムを書く。ややエラーの多いfastqを使っている。

diginorm処理前

f:id:kazumaxneo:20180221164944j:plain

頻度が少ない(エラー、low coverage、レアバリアント、コンタミ)20merが非常に多い。

 ↓

diginorm処理後

f:id:kazumaxneo:20180221163327j:plain

頻度が少ない20-mer配列が消えた。

 

 

diginormの前に、エラー訂正もやっておいた方がいいかもしれません。karatならrecall、precision評価ツールもついています。エラーコレクションツール karect

下は"before"をkaratでエラー訂正したfastqのヒストグラム

f:id:kazumaxneo:20180221163254j:plain 

まだ多いが、beforeと比べて頻度が少ないk-merの数が大きく減っている。捨てる前に、修復できるなら修復して、捨てられないようにしようという発想です。

 

 

引用

A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data

C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom

arXiv:1203.4802  Publication date: 2012

 

追記

BBnormがオススメです。