2019 5/14 helpとパラメータ追記
"digital normalization"という名で発表されたこの手法は、k-merを指標にリードを間引いて、データサイズを軽量化する方法論。データサイズが大きすぎてアセンブルできないサンプルの軽量化に使えるとされる。トリミングターゲットは、低/高のk-merカバレッジのリードになる。k-merのカバレッジが中央値から極端に少ないのはシーケンスエラーと考えられ、k-merのカバレッジが極端に多いにはリピートと考えられる。それらを削って軽量化する。
ただし低k-merカバレッジ配列はエラーでなくレアなバリアントの可能性もあるし、高カバレッジの配列も単純リピートでなく倍数体ゲノムの極端に似た配列の可能性がある。それらをエラーのk-merと一緒に消してしまう恐れがある。それについて論文の著者がHPで考えを述べている(Why you shouldn't use digital normalization)。 興味がある人はその辺りを理解した上で使ってください。
本手法はde novo transcrptomeのデータにも使うことが可能とされ、de novo transcriptomeの論文にも使われた実績を持つ。
diginorm公式サイト
http://ged.msu.edu/papers/2012-diginorm/
解析にはkhmerを使う。
http://khmer.readthedocs.io/en/v2.0/user/scripts.html#scripts-diginorm
A tutorial in basic digital normalization — ANGUS 2.0 documentation
ランに必要なツールのインストール
khmerとscreedが必要。
git clone git://github.com/ged-lab/screed.git
git clone git://github.com/ged-lab/khmer.git
cd screed
python setup.py install
cd ../khmer
make test
export PYTHONPATH=/root/khmer/python
khmerとscreedはpipでも導入できます。
ベストなk-merについて
> normalize-by-median.py -h
# normalize-by-median.py -h
|| This is the script normalize-by-median.py in khmer.
|| You are running khmer version 3.0.0a3
|| You are also using screed version 1.0
||
|| If you use this script in a publication, please cite EACH of the following:
||
|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1
|| * CT Brown et al., arXiv:1203.4802 [q-bio.GN]
||
|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.
usage: normalize-by-median.py [--version] [--info] [-h] [-k KSIZE]
[-U UNIQUE_KMERS] [--fp-rate FP_RATE]
[-M MAX_MEMORY_USAGE] [--small-count] [-q]
[-C CUTOFF] [-p] [--force_single]
[-u unpaired_reads_filename] [-s filename]
[-R report_filename]
[--report-frequency report_frequency] [-f]
[-o filename] [-l filename] [--gzip | --bzip]
input_sequence_filename
[input_sequence_filename ...]
Do digital normalization (remove mostly redundant sequences)
positional arguments:
input_sequence_filename
Input FAST[AQ] sequence filename.
optional arguments:
--version show program's version number and exit
--info print citation information
-h, --help show this help message and exit
-k KSIZE, --ksize KSIZE
k-mer size to use (default: 32)
-U UNIQUE_KMERS, --unique-kmers UNIQUE_KMERS
approximate number of unique kmers in the input set
(default: 0)
--fp-rate FP_RATE Override the automatic FP rate setting for the current
script (default: None)
-M MAX_MEMORY_USAGE, --max-memory-usage MAX_MEMORY_USAGE
maximum amount of memory to use for data structure
(default: None)
--small-count Reduce memory usage by using a smaller counter for
individual kmers. (default: False)
-q, --quiet
-C CUTOFF, --cutoff CUTOFF
when the median k-mer coverage level is above this
number the read is not kept. (default: 20)
-p, --paired require that all sequences be properly paired
(default: False)
--force_single treat all sequences as single-ended/unpaired (default:
False)
-u unpaired_reads_filename, --unpaired-reads unpaired_reads_filename
include a file of unpaired reads to which -p/--paired
does not apply. (default: None)
-s filename, --savegraph filename
save the k-mer countgraph to disk after all reads are
loaded. (default: None)
-R report_filename, --report report_filename
write progress report to report_filename (default:
None)
--report-frequency report_frequency
report progress every report_frequency reads (default:
100000)
-f, --force continue past file reading errors (default: False)
-o filename, --output filename
only output a single file with the specified filename;
use a single dash "-" to specify that output should go
to STDOUT (the terminal) (default: None)
-l filename, --loadgraph filename
load a precomputed k-mer graph from disk (default:
None)
--gzip Compress output using gzip (default: False)
--bzip Compress output using bzip2 (default: False)
Discard sequences based on whether or not their median k-mer abundance lies
above a specified cutoff. Kept sequences will be placed in <fileN>.keep.
By default, paired end reads will be considered together; if either read should
be kept, both will be kept. (This keeps both reads from a fragment, and helps
with retention of repeats.) Unpaired reads are treated individually.
If `-p`/`--paired` is set, then proper pairing is required and the script will
exit on unpaired reads, although `--unpaired-reads` can be used to supply a
file of orphan reads to be read after the paired reads.
`--force_single` will ignore all pairing information and treat reads
individually.
With `-s`/`--savegraph`, the k-mer countgraph will be saved to the specified
file after all sequences have been processed. `-l`/`--loadgraph` will load the
specified k-mer countgraph before processing the specified files. Note that
these graphs are are in the same format as those produced by `load-into-
counting.py` and consumed by `abundance-dist.py`.
To append reads to an output file (rather than overwriting it), send output to
STDOUT with `--output -` and use UNIX file redirection syntax (`>>`) to append
to the file.
Example:
normalize-by-median.py -k 17 tests/test-data/test-abund-read-2.fa
Example:
normalize-by-median.py -p -k 17 \
tests/test-data/test-abund-read-paired.fa
Example:
normalize-by-median.py -p -k 17 -o - tests/test-data/paired.fq \
>> appended-output.fq
Example:
normalize-by-median.py -k 17 -f tests/test-data/test-error-reads.fq \
tests/test-data/test-fastq-reads.fq
Example:
normalize-by-median.py -k 17 -s test.ct \
tests/test-data/test-abund-read-2.fa \
tests/test-data/test-fastq-reads.fq
> filter-abund.py -h
# filter-abund.py -h
|| This is the script filter-abund.py in khmer.
|| You are running khmer version 3.0.0a3
|| You are also using screed version 1.0
||
|| If you use this script in a publication, please cite EACH of the following:
||
|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1
|| * Q Zhang et al., https://doi.org/10.1371/journal.pone.0101271
||
|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.
usage: filter-abund.py [--version] [--info] [-h] [-T THREADS] [-C CUTOFF] [-V]
[-Z NORMALIZE_TO] [-o optional_output_filename] [-f]
[-q] [--gzip | --bzip]
input_count_graph_filename input_sequence_filename
[input_sequence_filename ...]
Trim sequences at a minimum k-mer abundance.
positional arguments:
input_count_graph_filename
The input k-mer countgraph filename
input_sequence_filename
Input FAST[AQ] sequence filename
optional arguments:
--version show program's version number and exit
--info print citation information
-h, --help show this help message and exit
-T THREADS, --threads THREADS
Number of simultaneous threads to execute (default: 1)
-C CUTOFF, --cutoff CUTOFF
Trim at k-mers below this abundance. (default: 2)
-V, --variable-coverage
Only trim low-abundance k-mers from sequences that
have high coverage. (default: False)
-Z NORMALIZE_TO, --normalize-to NORMALIZE_TO
Base the variable-coverage cutoff on this median k-mer
abundance. (default: 20)
-o optional_output_filename, --output optional_output_filename
Output the trimmed sequences into a single file with
the given filename instead of creating a new file for
each input file. (default: None)
-f, --force Overwrite output file if it exists (default: False)
-q, --quiet
--gzip Compress output using gzip (default: False)
--bzip Compress output using bzip2 (default: False)
Trimmed sequences will be placed in "${input_sequence_filename}.abundfilt" for
each input sequence file. If the input sequences are from RNAseq or metagenome
sequencing then `--variable-coverage` should be used.
Example:
load-into-counting.py -k 20 -x 5e7 countgraph data/100k-filtered.fa
filter-abund.py -C 2 countgraph data/100k-filtered.fa
> load-into-counting.py -h
# load-into-counting.py -h
|| This is the script load-into-counting.py in khmer.
|| You are running khmer version 3.0.0a3
|| You are also using screed version 1.0
||
|| If you use this script in a publication, please cite EACH of the following:
||
|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1
|| * Q Zhang et al., https://doi.org/10.1371/journal.pone.0101271
|| * A. D\xf6ring et al. https://doi.org:80/10.1186/1471-2105-9-11
||
|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.
usage: load-into-counting.py [--version] [--info] [-h] [-k KSIZE]
[-U UNIQUE_KMERS] [--fp-rate FP_RATE]
[-M MAX_MEMORY_USAGE] [--small-count]
[-T THREADS] [-b] [-s FORMAT] [-f] [-q]
output_countgraph_filename
input_sequence_filename
[input_sequence_filename ...]
Build a k-mer countgraph from the given sequences.
positional arguments:
output_countgraph_filename
The name of the file to write the k-mer countgraph to.
input_sequence_filename
The names of one or more FAST[AQ] input sequence
files.
optional arguments:
--version show program's version number and exit
--info print citation information
-h, --help show this help message and exit
-k KSIZE, --ksize KSIZE
k-mer size to use (default: 32)
-U UNIQUE_KMERS, --unique-kmers UNIQUE_KMERS
approximate number of unique kmers in the input set
(default: 0)
--fp-rate FP_RATE Override the automatic FP rate setting for the current
script (default: None)
-M MAX_MEMORY_USAGE, --max-memory-usage MAX_MEMORY_USAGE
maximum amount of memory to use for data structure
(default: None)
--small-count Reduce memory usage by using a smaller counter for
individual kmers. (default: False)
-T THREADS, --threads THREADS
Number of simultaneous threads to execute (default: 1)
-b, --no-bigcount The default behaviour is to count past 255 using
bigcount. This flag turns bigcount off, limiting
counts to 255. (default: True)
-s FORMAT, --summary-info FORMAT
What format should the machine readable run summary be
in? (`json` or `tsv`, disabled by default) (default:
None)
-f, --force Overwrite output file if it exists (default: False)
-q, --quiet
Note: with `-b`/`--no-bigcount` the output will be the exact size of the k-mer
countgraph and this script will use a constant amount of memory. In exchange
k-mer counts will stop at 255. The memory usage of this script with `-b` will
be about 1.15x the product of the `-x` and `-N` numbers.
Example:
load-into-counting.py -k 20 -x 5e7 out data/100k-filtered.fa
Multiple threads can be used to accelerate the process, if you have extra cores
to spare.
Example:
load-into-counting.py -k 20 -x 5e7 -T 4 out data/100k-filtered.fa
> split-paired-reads.py -h
# split-paired-reads.py -h
|| This is the script split-paired-reads.py in khmer.
|| You are running khmer version 3.0.0a3
|| You are also using screed version 1.0
||
|| If you use this script in a publication, please cite EACH of the following:
||
|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1
||
|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.
usage: split-paired-reads.py [--version] [--info] [-h] [-d output_directory]
[-0 output_orphaned] [-1 output_first]
[-2 output_second] [-f] [--gzip | --bzip]
[infile]
Split interleaved reads into two files, left and right.
positional arguments:
infile
optional arguments:
--version show program's version number and exit
--info print citation information
-h, --help show this help message and exit
-d output_directory, --output-dir output_directory
Output split reads to specified directory. Creates
directory if necessary (default: )
-0 output_orphaned, --output-orphaned output_orphaned
Allow "orphaned" reads and extract them to this file
(default: None)
-1 output_first, --output-first output_first
Output "left" reads to this file (default: None)
-2 output_second, --output-second output_second
Output "right" reads to this file (default: None)
-f, --force Overwrite output file if it exists (default: False)
--gzip Compress output using gzip (default: False)
--bzip Compress output using bzip2 (default: False)
Some programs want paired-end read input in the One True Format, which is
interleaved; other programs want input in the Insanely Bad Format, with left-
and right- reads separated. This reformats the former to the latter.
The directory into which the left- and right- reads are output may be specified
using `-d`/`--output-dir`. This directory will be created if it does not
already exist.
Alternatively, you can specify the filenames directly with `-1`/`--output-
first` and `-2`/`--output-second`, which will override the `-d`/`--output-dir`
setting on a file-specific basis.
`-0`/'--output-orphans` will allow broken-paired format, and orphaned reads
will be saved separately, to the specified file.
Example:
split-paired-reads.py tests/test-data/paired.fq
Example:
split-paired-reads.py -0 reads-output-file tests/test-data/paired.fq
Example:
split-paired-reads.py -1 reads.1 -2 reads.2 tests/test-data/paired.fq
ラン
トランスクリプトームデータ
1-passノーマライズを行う。
はじめにインターレースのpaired.fqにmergeしておく。
seqtk mergepe R1.fq R2.fq > paired.fq
分布でトリミング。
normalize-by-median.py -C 20 -k 20 -N 4 -x 5e8 paired.fq
- -C <CUTOFF> when the median k-mer coverage level is above this number the read is not kept. (default: 20)
- -k <KSIZE > k-mer size to use (default: 32)
CとKの 値について(上と同じリンク)。
ゲノムデータ
スリーパスノーマライズ (3-pass)を行う。
はじめにインターレースのpaired.fqにmergeしておく(seqtk(リンク))。
seqtk mergepe R1.fq R2.fq > paired.fq
k-merカウントグラフを書く。(countgraph.py)
load-into-counting.py -k 20 -x 5e8 -T 24 graph paired.fq
- -k k-mer size to use (default: 32)
- -T Number of simultaneous threads to execute
- -x upper bound on tablesize to use; overrides --max-memory-usage/-M.
1-pass 分布でリードをトリミング。
normalize-by-median.py ; Discard sequences based on whether or not their median k-mer abundance lies above a specified cutoff.
normalize-by-median.py -C 20 -k 20 -N 4 -x 5e8 paired.fq
- -k k-mer size to use (default: 32)
- -C when the median k-mer coverage level is above this number the read is not kept. (default: 20)
- -N number of tables to use in k-mer countgraph
- -x upper bound on tablesize to use; overrides --max-memory-usage/-M.
paired.fq.keepが出力される。試したデータでは、リードは287,956から273,788に減少 (5%カット)。 出力されるリード数は揃っていない。あとでペアに修正する。
2-pass 今度はlow coverageのリードをトリミングする。
filter-abund.py ; Trim sequences at a minimum k-mer abundance.
filter-abund.py -T 24 -C 2 graph paired.fq.keep
- -C CUTOFF, --cutoff CUTOFF Trim at k-mers below this abundance. (default: 2)
- -T Number of simultaneous threads to execute (default: 1)
R1.fastq.keep.abundfiltとR2.fastq.keep.abundfiltが出力される。
paired.fq.keep.abundfiltが出力される。リードは60%ほどトリムされた。
3-pass 最後にもう一度-C 5でトリミング。
normalize-by-median.py -C 5 -k 20 -N 4 -x 1e8 paired.fq.keep.abundfilt
paired.fq.keep.abundfilt.keepが出力される。
最後に ペアのリードとシングルトンのリードに分ける。
split-paired-reads.py -0 orphan -1 trimmmed_R1.fq -2 trimmmed_R2.fq paired.fq.keep.abundfilt.keep
- -0 output_orphaned Allow "orphaned" reads and extract them to this file (default: None)
- -1 Output "left" reads to this file (default: None)
- -2 Output "right" reads to this file (default: None)
R1とR2の行数は同じになる。ペアじゃないオーファンなリードが20%近くを占めた。
KMCを使ってk-mer頻度を計算(高速なk-merカウントツール KMC)。
↓
20merでヒストグラムを書く。ややエラーの多いfastqを使っている。
diginorm処理前
頻度が少ない(エラー、low coverage、レアバリアント、コンタミ)20merが非常に多い。
↓
diginorm処理後
頻度が少ない20-mer配列が消えた。
diginormの前に、エラー訂正もやっておいた方がいいかもしれません。karatならrecall、precision評価ツールもついています。エラーコレクションツール karect
下は"before"をkaratでエラー訂正したfastqのヒストグラム。
まだ多いが、beforeと比べて頻度が少ないk-merの数が大きく減っている。捨てる前に、修復できるなら修復して、捨てられないようにしようという発想です。
引用
A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data
C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom
arXiv:1203.4802 Publication date: 2012
追記
BBnormがオススメです。