アライメントツールはエラーやバラツキを処理するように設計されているが、リファレンスとは大幅に異なるシーケンスリードを確実に正しい場所に割り当てることはできない。アラインメントの信頼性が低いと、バリアントコールの信頼性が低くなり、真のバリアントと偽のコールを区別することが困難になる。一部の領域ではリードがクリップされたり、まったくマッピングされないことすらある。その結果、アライメントのみに依存するバリアントコールアルゴリズムは、これらのイベントを特徴付けることができない。
バクテリアゲノムは小さく、比較的単純だが、そのバラツキのために最も難しいターゲットの1つである。水平遺伝子伝達(HGT)は、リファレンスと比較してサンプルを有意に変化させる1つのメカニズムであり、しばしば薬剤耐性または病原性の増加をもたらす(Ochman et al、2000)。バクテリアタイピングおよびサーベイランス法は、既知の対立遺伝子(Li et al、2016)または複数のリファレンス配列(Li et al、2012)のデータベース、全ゲノムアセンブリ(Thomsen et al、2016)を必要とすることが多い。これらのアプローチは、分析を実行したり、対立遺伝子データベースの維持に必要な時間を増加させる。理想的な方法は、単一のリファレンスで細菌の多様性を扱うことで、リファレンスを変更することなく統一的な文脈でゲノムの変更を行うことができるだろう。
一般的に使用されるいくつかの代替方法があるが、アライメントおよびバリアントコールプロトコルよりも能力が制限されているか、または計算リソースを消費している。Monoploidの生物ではde novoアセンブリの手法でバリアントをコールできるが、コンティグの状態ではカバレッジが"1"に減少するため、ミスコールを簡単に修正することはできなくなる(Olson et al。、2015)。典型的にアセンブリで使用されるde Bruijnグラフを構築するには、数ギガバイトのメモリまたはコンピューティングクラスタが必要になる可能性がある(Li et al、2010)。 Cortex(Iqbal et al、2012)は、de novoアセンブリアプローチの限界を克服するために設計されたが、依然として全ゲノムにわたって組み立てられたde Brujin graphに依存しています。 Platypus(Rimmer et al、2014)はローカルアセンブリを実行するが、バリアントの軌跡が先験的に分からない場合は、リードのマッピングとローカルアセンブリのオーバーヘッドが発生する。
1つの可能なアプローチは、k-merスペクトルのセット内に含まれる情報をシーケンスデータにわたって使用することである。 K-mersは数値で表すことができ、シーケンスアライメントに依存せず、k-merの計数方法は高速である。今日まで、このようなアプローチ(Gardner and Slezak、2010; Gardner et al。、2013)では、まばらな間隔のSNPが修正されているが、高密度SNPおよびlarge insertionは、リファレンス配列とは非常に異なる多くのk-merを作製するので、そのようなアプローチがどのように効率的に機能するかはすぐには分からない。
著者らは、サンプルシーケンスデータに対してk-mer頻度に厳密に依存するアルゴリズムを構築し、リファレンス配列中のk-merを順序付けしてバリアントをコールする。この方法は、まず、期待されるk-mersの頻度分布の中断を使用して基準を超えるバリエーションの領域を見つけ、その領域に隣接する変更されていないk-mersから開始することによって、k-mer空間による検索を大幅に簡素化する。このアプローチを使用して、Kestrelがどのようにしてゲノムの改変領域を再構築し、高密度変異の領域を解決できるかを示す。
インストール
mac os10.12でテストした。
依存
- Java 1.7 or later
https://github.com/paudano/kestrel
Github リリースにlinux向けバイナリとPDFマニュアルがある。バイナリはosxでも動作する。
https://github.com/paudano/kestrel/releases/tag/1.0.1
> ./kestrel -h
$ ./kestrel -h
kestrel -f <INPUT_FORMAT> -i <INTERVAL_FILE> -k <KSIZE> -m <OUT_FORMAT>
-o <OUT_FILE> -p <OUT_FILE> -r <REF_SEQUENCE> -s[<SAMPLE_NAME>]
-w <WEIGHT_VEC> --alpha=<ALPHA> --ambiregions --ambivar --anchorboth
--autoflank --byreference --byregion --charset=<CHARSET> --countrev
--decaymin=<MIN_PROPORTION> --diffq=<QUANTILE> --filespersample=<N_FILES>
--flank=<LENGTH> --free --hapfmt=<OUT_FORMAT> --lib=<LIB_FILE>
--liburl=<LIB_URL> --logfile=<LOG_FILE> --loglevel=<LOG_LEVEL> --logstderr
--logstdout --maxalignstates=<STATES> --maxhapstates=<STATES> --memcount
--mincount=<MIN_COUNT> --mindiff=<COUNT_DIFF> --minmask=<MIN_MASK>
--minsize=<MIN_SIZE> --noambigregions --noambivar --noanchorboth --nocountrev
--nofree --nomemcount --norevregion --normikc --normrefdesc --noseqfilter
--peakscan=<LENGTH> --quality=<SEQ_FILTER> --revregion --rmikc --rmrefdesc
--scanlimitfactor=<FACTOR> --seqfilter=<SEQ_FILTER> --stdout --temploc=<TEMP>
--varfilter=<FILTER_SPEC> SEQ_FILE [SEQ_FILE...]
kestrel -h[<TOPIC>]
-f --format <INPUT_FORMAT> (Default = auto)
Set the input sequence format type. This option determines how the format
files are read. This option may be set multiple times when reading files
with different formats. See "count -hreader" for a full list of readers.
-h --help [<TOPIC>]
Get command-line help. If "TOPIC" is specified, then help on a specific
topic is shown. Try "--help=topics" (or "-htopics") for a list of topics
-i --interval <INTERVAL_FILE>
Reads a file of intervals defining the regions over the reference sequences
where variants should be detected. If no intervals are specified, variants
are detected over the full length of each reference sequence. The file type
is determined by the file name, such as "intervals.bed". BED files are
supported by Kestrel, and others may be added.
-k --ksize <KSIZE> (Default = 31)
Size of k-mers sequence data is translated to during analysis. If unsure,
use the default value. If the sequencing error rate is very high, or if the
reference is very short, a small (e.g. a single short gene), then a smaller
k-mer size, such as 21, may be useful if the defalt value does not produce
meaningful results.
-m --outfmt <OUT_FORMAT> (Default = vcf)
Set output format.
-o --out <OUT_FILE>
Set output file name.
-p --hapout <OUT_FILE>
Set haplotype output file name.
-r --ref <REF_SEQUENCE>
Add reference sequences variants will be called against. This can be any
file that Kestrel can read. The format and character-set options apply to
reference sequences, but not filters.
-s --sample [<SAMPLE_NAME>]
Set the name of the sample that the next sample files are assigned to. If
the argument (SAMPLE_NAME) is given, the name of the sample is set to this
name. If the argument is not given, then the sample name is assigned from
the name of the first file after this option. Any files on the command-line
appearing before this option are assigned to a sample and will not be part
of this sample. If --filespersample was used on the command-line before this
option, it is reset and files are no longer automatically grouped.
-w --weight <WEIGHT_VEC> (Default = (10.0, -10.0, -40.0, -4.0, 0.0))
Set the alignment weights as a comma-separated list of values. The order of
weights is match, mismatch, gap-open, gap-extend, and initial score. If
values are blank or the list has fewer than 5 elements, the missing values
are assigned their default weight. Each value is a floating-point number,
and it may be represented in exponential form (e.g. 1.0e2) or as an integer
in hexadecimal or octal format. Optionally, the list may be surrounded by
parenthesis or braces (angle, square, or curly).
--alpha <ALPHA> (Default = 0.80)
Set the exponential decay alpha, which controls how quickly the recovery
threshold declines to its minimum value (see --decaymin) in an active region
search. Alpha is defined as the rate of decay for every k bases. At k bases
from the left anchor, the threshold will have declined to alpha * range. At
every k bases, the threshold will continue to decline at this rate.
--ambiregions (Default)
Allow active regions to cover ambiguous bases, such as N.
--ambivar (Default)
Allow variants over ambiguous bases, such as N.
--anchorboth (Default)
Both ends of an active region (region with variants) must be bordered by
unaltered k-mers or variants will not be called in it. This option may miss
variants near the ends of a reference sequence, but it forces stronger
evidence for the variants that are called.
--autoflank (Default)
When extracting intervals from reference sequences, some bases are extracted
on both sides of the interval whenever possible. This gives Kestrel more
bases for active region detection, but it does not otherwise affect variant
calls. This option tells Kestrel to pick the flank by multiplying 3.50 with
the k-mer size.
--byreference (Default)
If variant call regions were defined, variant call locations are still
relative to the reference sequence and not the region.
--byregion
If variant call regions were defined, variant call locations are relative to
the region and not the reference sequence.
--charset <CHARSET> (Default = UTF-8)
Character set encoding of input files. This option specifies the character
set of all files following it. The default, "UTF-8", properly handles ASCII
files, which is a safe assumption for most files. Latin-1 files with values
greater than 127 will not be properly parsed.
--countrev (Default)
Count reverse complement k-mers in region statistics. This should be set for
most sequencing protocols.
--decaymin <MIN_PROPORTION> (Default = 0.55)
Set the minimum value (asymptotic lower bound) of the exponential decay
function used in active region detection as a proportion of the anchor k-mer
count. If this value is 0.0, k-mer count recovery threshold may decline to
1. If this value is 1.0, the decay function is not used and the detector
falls back to finding a k-mer with a count within the difference threshold
of the anchor k-mer count.
--diffq <QUANTILE> (Default = 0.9000)
If set to a value greater than 0.0, then the k-mer count difference between
two k-mers that triggers a correction attempt is found dynamically. The
difference in k-mer counts between each pair of neighboring k-mers over an
uncorrected reference region is found, and this quantile of is computed over
those differences. For example, a value of 0.85 means that at most 15% (100%
- 85%) of the k-mer count differences will be high enough. If this computed
value is less than the minimum k-mer count difference (--mindiff), then that
minimum is the difference threshold. This value may not be 1.0 or greater,
and it may not be negative. If 0.0, the minimum count difference is always
the minimum threshold (--mindiff).
--filespersample <N_FILES> (Default = 0)
Set the number of input files per sample. For example, reading paired-end
FASTQ files (2 files per sample) can be simplified by setting this value to
2. Alternatively, samples can be separated by multiple -s (--sample)
arguments. The default value, 0, will not automatically group input files.
If -s (--sample) is read on the command-line, this value is set back to 0.
Any sequence files found on the command-line before this option are assigned
to a group.
--flank <LENGTH> (Default = k * 3.50)
When extracting intervals from reference sequences, this many bases are
extracted on both sides of the interval whenever possible. This gives
Kestrel more bases for active region detection, but it does not otherwise
affect variant calls. Set to 0 to disable flanks. If this option is not set,
Kestrel will determine the appropriate length of flank by multiplying 3.50
with the k-mer size.
--free
Free resources between processing samples. This may reduce the memory
footprint of Kestrel, but it may force expensive resources to be recreated
and impact performance.
--hapfmt <OUT_FORMAT> (Default = sam)
Set haplotype output format. Ignored if a haplotype output file is not set.
--lib <LIB_FILE>
Load a library file. Kestrel can accept external components, and they must
be packaged on a JAR file.
--liburl <LIB_URL>
Load a library by its URL. Kestrel can accept external components, and they
must be packaged on a JAR file. This option can access JAR files on the
local system or stored anywhere the program can access and that can be
represented as a URL.
--logfile <LOG_FILE> (Default = <STDERR>)
Set log file name.
--loglevel <LOG_LEVEL> (Default = WARN)
Set the log level. Valid levels are ALL, TRACE, DEBUG, INFO, WARN, ERROR,
and OFF.
--logstderr
Write log messages to standard error instead of a file. Unless redirected,
this output is written to the the screen.
--logstdout
Write log messages to standard output instead of a file. Unless redirected,
this output is written to the the screen.
--maxalignstates <STATES> (Default = 15)
Set the maximum number of alignment states. When haplotype assembly branches
into more than one possible sequence, the state of one is saved while
another is built. When the maximum number of saved states reaches this
value, the least likely one is discarded.
--maxhapstates <STATES> (Default = 15)
Set the maximum number of haplotypes for an active region. Alignments can
generate more than one haplotype, and with noisy sequence data or
paralogues, many haplotypes may be found. This options limits the amount of
memory that can be consumed in these cases.
--memcount
K-mer counts from each sample will be stored in memory. This option assumes
that samples are relatively small or the machine has enough memory to handle
the counts. Note that the JVM might need to be run with additional memory
(-Xmx option) to support this option.
--mincount <MIN_COUNT> (Default = 5)
Set the minimum k-mer count for processing samples. K-mers with a count less
than this value will be discarded. Sequence read errors produce many
erroneous k-mers, and this slows the process of variant calling
significantly.
--mindiff <COUNT_DIFF> (Default = 5)
Set the minimum k-mer count difference for identifying active regions. When
the count between neighboring k-mer counts is this or greater, Kestrel will
treat it as a region where a variant may occur.
--minmask <MIN_MASK> (Default = 0)
Size of k-mer minimizers or 0 to disable processing by minimizers. The
minimizer of a k-mer is determined by taking all sub-k-mers of a given size
(set by this option) from a k-mer and its reverse complement and choosing
the lesser of the sub-k-mers. Sub-k-mers are XORed with this mask while
comparing them, but the minimizer is not XORed (it is still a sub-k-mer of
the original k-mer). This option can be used to break up large minimizer
groups due to low-complexity k-mers when minimizers are used.
--minsize <MIN_SIZE> (Default = 15)
Minimizers group k-mers in the indexed k-mer count (IKC) file generated by
Kestrel when reading sequences, and this parameter controls the size of the
minimizer.
--noambigregions
An active region may not span any base that is not A, C, G, T, or U.
--noambivar
A variant may not span any base that is not A, C, G, T, or U.
--noanchorboth
An active region (region with variants) must be bordered on at least one
side by an unaltered k-mers, but it may extend to the end of the sequence.
This will allow Kestrel to find variants less than a k-mer from the ends,
but the evidence supporting these variants is weaker.
--nocountrev
Do not include the reverse complement of k-mers in read depth estimates. If
all sequence reads are in the same orientation as the reference, then this
option should be used.
--nofree (Default)
Retain resources between samples. This may use more memory, but it will
avoid re-creating expensive resources between samples.
--nomemcount (Default)
K-mer counts for each sample are offloaded to an indexed k-mer count file.
This option reduces the memory demand of Kestrel.
--norevregion
When set, regions variants are called on are always in the same orientation
as the reference sequence. The stranded-ness of defined intervals is
ignored.
--normikc
Do not remove the indexed k-mer count (IKC) file for each sample.
--normrefdesc
Use the full sequence name as it appears in the reference sequence file.
FASTA files often include a description after the sequence name, and with
this option, it becomes part of the full sequence name. If using an interval
file, the full sequence name and description must match the sequence file.
--noseqfilter (Default)
Turn off sequence filtering for all files following this option. If
--seqfilter or --quality was specified, this option disables sequence
filtering. These options together make it possible to specify filtering for
some files and disable filtering for others.
--peakscan <LENGTH> (Default = 7)
Reference regions with sequence homology in other regions of the genome may
contain k-mers with artificially high frequencies from adding counts for
k-mers that appear in both regions. This causes a peak in the k-mer
frequencies over the reference, and it can trigger an erroneous
active-region scan for variants. When encountering a difference, Kestrel
will scan forward this number of k-mers looking for a peak in the k-mer
frequencies. If the frequencies drop back down to the original range, the
active-region scan is not performed. This keeps Kestrel from erroneously
searching large regions of the reference. Setting this value to 0 disables
peak detection.
--quality <SEQ_FILTER>
This option is an alias for "seqfilter"
--revregion
When set, reverse complement reference regions that occur on the negative
strand. Only itervals defined with on the negative strand are altered.
--rmikc (Default)
Remove the indexed k-mer count (IKC) for each sample after kestrel runs.
--rmrefdesc (Default)
When set, remove the description from reference sequence names. The
descirption is everything that occurs after the first whitespace character.
FASTA files often have a sequence name and a long description separated by
whitespace. This option ensures that the sequence name matches in the FASTA
and an interval file, if used.
--scanlimitfactor <FACTOR> (Default = 5.0)
Set a limit on how long an active region may be. This is computed by
multiplying the k-mer size by this factor and adding the maximum length of a
gap. The computed limit will be adjusted so that active regions are at least
large enough to capture a SNP in cases where the maximum gap length is 0.
Setting this to a low value or "min" will set the limit so that it is just
large enough to catch SNPs and deletions, but it will miss large deletions
if another variant is within the k-mer size window. Setting this to a high
value or "max" lifts the restrictions on active region lengths, and this may
cause the program to take an excessive amount of time and memory trying to
solve arbitrarily long active regions.
--seqfilter <SEQ_FILTER>
Filter sequences as they are read and before k-mers are extracted from them.
Some sequence readers can filter or alter reads at runtime. The most common
filter is a quality filter where low-quality bases are removed. The filter
specification is a filter name followed by a colon (:) and arguments to the
filter. If a filter name is not specified, then the "sanger" quality filter
is assumed. For example, "sanger:10" and "10" will filter k-mers with any
base quality score less than 10. The sequence filter specification is set
for all files appearing on the command-line after this option. To turn off
filtering once it has been set, files following --noseqfilter will have no
filter specification.
--stdout (Default)
Write output to standard output instead of a file. Unless redirected, this
output is written to the the screen.
--temploc <TEMP> (Default = Output location)
The location where segments are offloaded. This argument must be a directory
or the location for a new directory. Parent directories will be created as
needed.
--varfilter <FILTER_SPEC>
Add a variant filter specification. The argument should be the name of the
filter, a colon, and the filter arguments. The correct filter is loaded by
name and the filter arguments are passed to it.
SEQ_FILE
Input sequence file.
kestrel version: 1.0.1
ラン
リファレンスとfastqを指定して実行する。
kestrel -r ref.fa -k 31 -o variant.vcf input.fq
- -k Size of k-mers sequence data is translated to during analysis (Default = 31)
- -r Reference sequences variants will be called against.
ペアエンド情報は使わないので、ペアエンドシーケンスデータを使うならマージしてから使う。
引用
Mapping-free variant calling using haplotype reconstruction from k-mer frequencies
Audano PA, Ravishankar S, Vannberg FO
Bioinformatics. 2018 May 15;34(10):1659-1665.