2017-08-04

diginormによるシーケンスデータの軽量化

2019 5/14 helpとパラメータ追記

"digital normalization"という名で発表されたこの手法は、k-merを指標にリードを間引いて、データサイズを軽量化する方法論。データサイズが大きすぎてアセンブルできないサンプルの軽量化に使えるとされる。トリミングターゲットは、低/高のk-merカバレッジのリードになる。k-merのカバレッジが中央値から極端に少ないのはシーケンスエラーと考えられ、k-merのカバレッジが極端に多いにはリピートと考えられる。それらを削って軽量化する。

ただし低k-merカバレッジ配列はエラーでなくレアなバリアントの可能性もあるし、高カバレッジの配列も単純リピートでなく倍数体ゲノムの極端に似た配列の可能性がある。それらをエラーのk-merと一緒に消してしまう恐れがある。それについて論文の著者がHPで考えを述べている（Why you shouldn't use digital normalization）。興味がある人はその辺りを理解した上で使ってください。

本手法はde novo transcrptomeのデータにも使うことが可能とされ、de novo transcriptomeの論文にも使われた実績を持つ。

diginorm公式サイト

http://ged.msu.edu/papers/2012-diginorm/

解析にはkhmerを使う。

http://khmer.readthedocs.io/en/v2.0/user/scripts.html#scripts-diginorm

チュートリアル

A tutorial in basic digital normalization — ANGUS 2.0 documentation

ランに必要なツールのインストール

khmerとscreedが必要。

git clone git://github.com/ged-lab/screed.git 
git clone git://github.com/ged-lab/khmer.git

cd screed 
python setup.py install 
cd ../khmer 
make test 
export PYTHONPATH=/root/khmer/python

khmerとscreedはpipでも導入できます。

ベストなk-merについて

> normalize-by-median.py -h

# normalize-by-median.py -h

|| This is the script normalize-by-median.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

|| If you use this script in a publication, please cite EACH of the following:

|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

|| * CT Brown et al., arXiv:1203.4802 [q-bio.GN]

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

usage: normalize-by-median.py [--version] [--info] [-h] [-k KSIZE]

[-U UNIQUE_KMERS] [--fp-rate FP_RATE]

[-M MAX_MEMORY_USAGE] [--small-count] [-q]

[-C CUTOFF] [-p] [--force_single]

[-u unpaired_reads_filename] [-s filename]

[-R report_filename]

[--report-frequency report_frequency] [-f]

[-o filename] [-l filename] [--gzip | --bzip]

input_sequence_filename

[input_sequence_filename ...]

Do digital normalization (remove mostly redundant sequences)

positional arguments:

input_sequence_filename

Input FAST[AQ] sequence filename.

optional arguments:

--version show program's version number and exit

--info print citation information

-h, --help show this help message and exit

-k KSIZE, --ksize KSIZE

k-mer size to use (default: 32)

-U UNIQUE_KMERS, --unique-kmers UNIQUE_KMERS

approximate number of unique kmers in the input set

(default: 0)

--fp-rate FP_RATE Override the automatic FP rate setting for the current

script (default: None)

-M MAX_MEMORY_USAGE, --max-memory-usage MAX_MEMORY_USAGE

maximum amount of memory to use for data structure

(default: None)

--small-count Reduce memory usage by using a smaller counter for

individual kmers. (default: False)

-q, --quiet

-C CUTOFF, --cutoff CUTOFF

when the median k-mer coverage level is above this

number the read is not kept. (default: 20)

-p, --paired require that all sequences be properly paired

(default: False)

--force_single treat all sequences as single-ended/unpaired (default:

False)

-u unpaired_reads_filename, --unpaired-reads unpaired_reads_filename

include a file of unpaired reads to which -p/--paired

does not apply. (default: None)

-s filename, --savegraph filename

save the k-mer countgraph to disk after all reads are

loaded. (default: None)

-R report_filename, --report report_filename

write progress report to report_filename (default:

None)

--report-frequency report_frequency

report progress every report_frequency reads (default:

100000)

-f, --force continue past file reading errors (default: False)

-o filename, --output filename

only output a single file with the specified filename;

use a single dash "-" to specify that output should go

to STDOUT (the terminal) (default: None)

-l filename, --loadgraph filename

load a precomputed k-mer graph from disk (default:

None)

--gzip Compress output using gzip (default: False)

--bzip Compress output using bzip2 (default: False)

Discard sequences based on whether or not their median k-mer abundance lies

above a specified cutoff. Kept sequences will be placed in <fileN>.keep.

By default, paired end reads will be considered together; if either read should

be kept, both will be kept. (This keeps both reads from a fragment, and helps

with retention of repeats.) Unpaired reads are treated individually.

If `-p`/`--paired` is set, then proper pairing is required and the script will

exit on unpaired reads, although `--unpaired-reads` can be used to supply a

file of orphan reads to be read after the paired reads.

`--force_single` will ignore all pairing information and treat reads

individually.

With `-s`/`--savegraph`, the k-mer countgraph will be saved to the specified

file after all sequences have been processed. `-l`/`--loadgraph` will load the

specified k-mer countgraph before processing the specified files. Note that

these graphs are are in the same format as those produced by `load-into-

counting.py` and consumed by `abundance-dist.py`.

To append reads to an output file (rather than overwriting it), send output to

STDOUT with `--output -` and use UNIX file redirection syntax (`>>`) to append

to the file.

Example:

normalize-by-median.py -k 17 tests/test-data/test-abund-read-2.fa

Example:

normalize-by-median.py -p -k 17 \

tests/test-data/test-abund-read-paired.fa

Example:

normalize-by-median.py -p -k 17 -o - tests/test-data/paired.fq \

>> appended-output.fq

Example:

normalize-by-median.py -k 17 -f tests/test-data/test-error-reads.fq \

tests/test-data/test-fastq-reads.fq

Example:

normalize-by-median.py -k 17 -s test.ct \

tests/test-data/test-abund-read-2.fa \

tests/test-data/test-fastq-reads.fq

> filter-abund.py -h

# filter-abund.py -h

|| This is the script filter-abund.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

|| If you use this script in a publication, please cite EACH of the following:

|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

|| * Q Zhang et al., https://doi.org/10.1371/journal.pone.0101271

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

usage: filter-abund.py [--version] [--info] [-h] [-T THREADS] [-C CUTOFF] [-V]

[-Z NORMALIZE_TO] [-o optional_output_filename] [-f]

[-q] [--gzip | --bzip]

input_count_graph_filename input_sequence_filename

[input_sequence_filename ...]

Trim sequences at a minimum k-mer abundance.

positional arguments:

input_count_graph_filename

The input k-mer countgraph filename

input_sequence_filename

Input FAST[AQ] sequence filename

optional arguments:

--version show program's version number and exit

--info print citation information

-h, --help show this help message and exit

-T THREADS, --threads THREADS

Number of simultaneous threads to execute (default: 1)

-C CUTOFF, --cutoff CUTOFF

Trim at k-mers below this abundance. (default: 2)

-V, --variable-coverage

Only trim low-abundance k-mers from sequences that

have high coverage. (default: False)

-Z NORMALIZE_TO, --normalize-to NORMALIZE_TO

Base the variable-coverage cutoff on this median k-mer

abundance. (default: 20)

-o optional_output_filename, --output optional_output_filename

Output the trimmed sequences into a single file with

the given filename instead of creating a new file for

each input file. (default: None)

-f, --force Overwrite output file if it exists (default: False)

-q, --quiet

--gzip Compress output using gzip (default: False)

--bzip Compress output using bzip2 (default: False)

Trimmed sequences will be placed in "${input_sequence_filename}.abundfilt" for

each input sequence file. If the input sequences are from RNAseq or metagenome

sequencing then `--variable-coverage` should be used.

Example:

load-into-counting.py -k 20 -x 5e7 countgraph data/100k-filtered.fa

filter-abund.py -C 2 countgraph data/100k-filtered.fa

> load-into-counting.py -h

# load-into-counting.py -h

|| This is the script load-into-counting.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

|| If you use this script in a publication, please cite EACH of the following:

|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

|| * Q Zhang et al., https://doi.org/10.1371/journal.pone.0101271

|| * A. D\xf6ring et al. https://doi.org:80/10.1186/1471-2105-9-11

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

usage: load-into-counting.py [--version] [--info] [-h] [-k KSIZE]

[-U UNIQUE_KMERS] [--fp-rate FP_RATE]

[-M MAX_MEMORY_USAGE] [--small-count]

[-T THREADS] [-b] [-s FORMAT] [-f] [-q]

output_countgraph_filename

input_sequence_filename

[input_sequence_filename ...]

Build a k-mer countgraph from the given sequences.

positional arguments:

output_countgraph_filename

The name of the file to write the k-mer countgraph to.

input_sequence_filename

The names of one or more FAST[AQ] input sequence

files.

optional arguments:

--version show program's version number and exit

--info print citation information

-h, --help show this help message and exit

-k KSIZE, --ksize KSIZE

k-mer size to use (default: 32)

-U UNIQUE_KMERS, --unique-kmers UNIQUE_KMERS

approximate number of unique kmers in the input set

(default: 0)

--fp-rate FP_RATE Override the automatic FP rate setting for the current

script (default: None)

-M MAX_MEMORY_USAGE, --max-memory-usage MAX_MEMORY_USAGE

maximum amount of memory to use for data structure

(default: None)

--small-count Reduce memory usage by using a smaller counter for

individual kmers. (default: False)

-T THREADS, --threads THREADS

Number of simultaneous threads to execute (default: 1)

-b, --no-bigcount The default behaviour is to count past 255 using

bigcount. This flag turns bigcount off, limiting

counts to 255. (default: True)

-s FORMAT, --summary-info FORMAT

What format should the machine readable run summary be

in? (`json` or `tsv`, disabled by default) (default:

None)

-f, --force Overwrite output file if it exists (default: False)

-q, --quiet

Note: with `-b`/`--no-bigcount` the output will be the exact size of the k-mer

countgraph and this script will use a constant amount of memory. In exchange

k-mer counts will stop at 255. The memory usage of this script with `-b` will

be about 1.15x the product of the `-x` and `-N` numbers.

Example:

load-into-counting.py -k 20 -x 5e7 out data/100k-filtered.fa

Multiple threads can be used to accelerate the process, if you have extra cores

to spare.

Example:

load-into-counting.py -k 20 -x 5e7 -T 4 out data/100k-filtered.fa

> split-paired-reads.py -h

# split-paired-reads.py -h

|| This is the script split-paired-reads.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

|| If you use this script in a publication, please cite EACH of the following:

|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

usage: split-paired-reads.py [--version] [--info] [-h] [-d output_directory]

[-0 output_orphaned] [-1 output_first]

[-2 output_second] [-f] [--gzip | --bzip]

[infile]

Split interleaved reads into two files, left and right.

positional arguments:

infile

optional arguments:

--version show program's version number and exit

--info print citation information

-h, --help show this help message and exit

-d output_directory, --output-dir output_directory

Output split reads to specified directory. Creates

directory if necessary (default: )

-0 output_orphaned, --output-orphaned output_orphaned

Allow "orphaned" reads and extract them to this file

(default: None)

-1 output_first, --output-first output_first

Output "left" reads to this file (default: None)

-2 output_second, --output-second output_second

Output "right" reads to this file (default: None)

-f, --force Overwrite output file if it exists (default: False)

--gzip Compress output using gzip (default: False)

--bzip Compress output using bzip2 (default: False)

Some programs want paired-end read input in the One True Format, which is

interleaved; other programs want input in the Insanely Bad Format, with left-

and right- reads separated. This reformats the former to the latter.

The directory into which the left- and right- reads are output may be specified

using `-d`/`--output-dir`. This directory will be created if it does not

already exist.

Alternatively, you can specify the filenames directly with `-1`/`--output-

first` and `-2`/`--output-second`, which will override the `-d`/`--output-dir`

setting on a file-specific basis.

`-0`/'--output-orphans` will allow broken-paired format, and orphaned reads

will be saved separately, to the specified file.

Example:

split-paired-reads.py tests/test-data/paired.fq

Example:

split-paired-reads.py -0 reads-output-file tests/test-data/paired.fq

Example:

split-paired-reads.py -1 reads.1 -2 reads.2 tests/test-data/paired.fq

ラン

トランスクリプトームデータ

1-passノーマライズを行う。

はじめにインターレースのpaired.fqにmergeしておく。

seqtk mergepe R1.fq R2.fq > paired.fq

分布でトリミング。

normalize-by-median.py -C 20 -k 20 -N 4 -x 5e8 paired.fq

-C <CUTOFF> when the median k-mer coverage level is above this number the read is not kept. (default: 20)
-k <KSIZE > k-mer size to use (default: 32)

CとKの値について（上と同じリンク）。

ゲノムデータ

スリーパスノーマライズ (3-pass)を行う。

はじめにインターレースのpaired.fqにmergeしておく（seqtk（リンク））。

seqtk mergepe R1.fq R2.fq > paired.fq

k-merカウントグラフを書く。(countgraph.py)

load-into-counting.py -k 20 -x 5e8 -T 24 graph paired.fq

-k　k-mer size to use (default: 32)
-T　Number of simultaneous threads to execute
-x　upper bound on tablesize to use; overrides --max-memory-usage/-M.

1-pass　分布でリードをトリミング。

normalize-by-median.py ; Discard sequences based on whether or not their median k-mer abundance lies above a specified cutoff.

normalize-by-median.py -C 20 -k 20 -N 4 -x 5e8 paired.fq

-k　k-mer size to use (default: 32)
-C　when the median k-mer coverage level is above this number the read is not kept. (default: 20)
-N　number of tables to use in k-mer countgraph
-x　upper bound on tablesize to use; overrides --max-memory-usage/-M.

paired.fq.keepが出力される。試したデータでは、リードは287,956から273,788に減少（5%カット）。出力されるリード数は揃っていない。あとでペアに修正する。

2-pass　今度はlow coverageのリードをトリミングする。

filter-abund.py ; Trim sequences at a minimum k-mer abundance.

filter-abund.py -T 24 -C 2 graph paired.fq.keep

-C　CUTOFF, --cutoff CUTOFF Trim at k-mers below this abundance. (default: 2)
-T　Number of simultaneous threads to execute (default: 1)

R1.fastq.keep.abundfiltとR2.fastq.keep.abundfiltが出力される。

paired.fq.keep.abundfiltが出力される。リードは60%ほどトリムされた。

3-pass　最後にもう一度-C 5でトリミング。

normalize-by-median.py -C 5 -k 20 -N 4 -x 1e8 paired.fq.keep.abundfilt

paired.fq.keep.abundfilt.keepが出力される。

最後にペアのリードとシングルトンのリードに分ける。

split-paired-reads.py -0 orphan -1 trimmmed_R1.fq -2 trimmmed_R2.fq paired.fq.keep.abundfilt.keep

-0　output_orphaned Allow "orphaned" reads and extract them to this file (default: None)
-1　Output "left" reads to this file (default: None)
-2　Output "right" reads to this file (default: None)

R1とR2の行数は同じになる。ペアじゃないオーファンなリードが20%近くを占めた。

KMCを使ってk-mer頻度を計算(高速なk-merカウントツール KMC)。

　↓

20merでヒストグラムを書く。ややエラーの多いfastqを使っている。

diginorm処理前

f:id:kazumaxneo:20180221164944j:plain