2019 2/15 動画リンク追加
2020 4/19 インストール追記, help更新
2020 4/29 論文アブスト追記
RNA-seqリードからトランスクリプト の豊富さを定量化する軽量な手法であるSalmonを紹介する。Salmonは、新しい二相並列推論アルゴリズムと特徴量の多いバイアスモデルを組み合わせ、超高速なリードマッピングを実現している。この手法は、フラグメントGC含有量のバイアスを補正する初めてのトランスクリプトームワイドな定量手法であり、ここで実証したように、アバンダンス推定の精度とその後の差分発現解析の感度が大幅に向上している。
salmonは豊富なbiasモデルを取り込み、高速、高精度、堅牢なRNAseqの発現定量を行う方法論。 kallistoやeXpressと比べて、同じFDRで2倍以上精度が高い(DEG判定された遺伝子が2倍以上少ない=false positiveが少ない)というデータを出している。
Supplementary Figure 1より転載。
salomonは単独で「アラインメント」と「定量」の両方を行う、すなわち、indexがついたtranscriptsのリファレンスとFASTQを入力として受け取り、中間のアラインメントファイルを生成せずに直接定量を実行する。結果かなりの時間とスペースを節約することができる。1例をあげると、 6億のリード (75bp, paired-end)をわずか23分で定量できるとされる( 30 スレッド使用時)。
公式サイト
Overview – Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data
quick start guide
Getting Started – Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data
Document
http://salmon.readthedocs.io/en/latest/
Accurate, Fast, and Model-Aware Transcript Expression Quantification with Salmon
インストール
conda, brewで導入できる。
#bioconda (link)
mamba install -c bioconda -y salmon
#homebrew
brew install salmon
Githubのリリースにはビルド済みのバイナリもあります(リンク)。さらにdockerイメージも用意されています。
> salmon
$ salmon
salmon v1.2.0
Usage: salmon -h|--help or
salmon -v|--version or
salmon -c|--cite or
salmon [--no-version-check] <COMMAND> [-h | options]
Commands:
index Create a salmon index
quant Quantify a sample
alevin single cell analysis
swim Perform super-secret operation
quantmerge Merge multiple quantifications into a single file
> salmon quant --help-alignment
$ salmon quant --help-alignment
Version Info: This is the most recent version of salmon.
Quant
==========
Perform dual-phase, alignment-based estimation of
transcript abundance from RNA-seq reads
salmon quant options:
alignment input options:
--discardOrphans [alignment-based mode only] : Discard
orphan alignments in the input . If
this flag is passed, then only paired
alignments will be considered toward
quantification estimates. The default
behavior is to consider orphan
alignments if no valid paired mappings
exist.
-l [ --libType ] arg Format string describing the library
type
-a [ --alignments ] arg input alignment (BAM) file(s).
-e [ --eqclasses ] arg input salmon weighted equivalence class
file.
-t [ --targets ] arg FASTA format file containing target
transcripts.
basic options:
-v [ --version ] print version string
-h [ --help ] produce help message
-o [ --output ] arg Output quantification directory.
--seqBias Perform sequence-specific bias
correction.
--gcBias [beta for single-end reads] Perform
fragment GC bias correction.
--posBias Perform positional bias correction.
-p [ --threads ] arg (=8) The number of threads to use
concurrently.
--incompatPrior arg (=0) This option sets the prior probability
that an alignment that disagrees with
the specified library type (--libType)
results from the true fragment origin.
Setting this to 0 specifies that
alignments that disagree with the
library type should be "impossible",
while setting it to 1 says that
alignments that disagree with the
library type are no less likely than
those that do
-g [ --geneMap ] arg File containing a mapping of
transcripts to genes. If this file is
provided salmon will output both
quant.sf and quant.genes.sf files,
where the latter contains aggregated
gene-level abundance estimates. The
transcript to gene mapping should be
provided as either a GTF file, or a in
a simple tab-delimited format where
each line contains the name of a
transcript and the gene to which it
belongs separated by a tab. The
extension of the file is used to
determine how the file should be
parsed. Files ending in '.gtf', '.gff'
or '.gff3' are assumed to be in GTF
format; files with any other extension
are assumed to be in the simple format.
In GTF / GFF format, the
"transcript_id" is assumed to contain
the transcript identifier and the
"gene_id" is assumed to contain the
corresponding gene identifier.
--auxTargetFile arg A file containing a list of "auxiliary"
targets. These are valid targets
(i.e., not decoys) to which fragments
are allowed to map and be assigned, and
which will be quantified, but for which
auxiliary models like sequence-specific
and fragment-GC bias correction should
not be applied.
--meta If you're using Salmon on a metagenomic
dataset, consider setting this flag to
disable parts of the abundance
estimation model that make less sense
for metagenomic data.
alignment-specific options:
--noErrorModel Turn off the alignment error model,
which takes into account the the
observed frequency of different types
of mismatches / indels when computing
the likelihood of a given alignment.
Turning this off can speed up
alignment-based salmon, but can harm
quantification accuracy.
--numErrorBins arg (=6) The number of bins into which to divide
each read when learning and applying
the error model. For example, a value
of 10 would mean that effectively, a
separate error model is leared and
applied to each 10th of the read, while
a value of 3 would mean that a separate
error model is applied to the read
beginning (first third), middle (second
third) and end (final third).
-s [ --sampleOut ] Write a "postSample.bam" file in the
output directory that will sample the
input alignments according to the
estimated transcript abundances. If
you're going to perform downstream
analysis of the alignments with tools
which don't, themselves, take fragment
assignment ambiguity into account, you
should use this output.
-u [ --sampleUnaligned ] In addition to sampling the aligned
reads, also write the un-aligned reads
to "postSample.bam".
--gencode This flag will expect the input
transcript fasta to be in GENCODE
format, and will split the transcript
name at the first '|' character. These
reduced names will be used in the
output and when looking for these
transcripts in a gene to transcript
GTF.
--scoreExp arg (=1) The factor by which sub-optimal
alignment scores are downweighted to
produce a probability. If the best
alignment score for the current read is
S, and the score for a particular
alignment is w, then the probability
will be computed porportional to exp( -
scoreExp * (S-w) ). NOTE: This flag
only has an effect if you are parsing
alignments produced by salmon itself
(i.e. pufferfish or RapMap in
selective-alignment mode).
--mappingCacheMemoryLimit arg (=2000000)
If the file contained fewer than this
many mapped reads, then just keep the
data in memory for subsequent rounds of
inference. Obviously, this value should
not be too large if you wish to keep a
low memory usage, but setting it large
enough to accommodate all of the mapped
read can substantially speed up
inference on "small" files that contain
only a few million reads.
advanced options:
--alternativeInitMode [Experimental]: Use an alternative
strategy (rather than simple
interpolation between) the online and
uniform abundance estimates to
initalize the EM / VBEM algorithm.
--auxDir arg (=aux_info) The sub-directory of the quantification
directory where auxiliary information
e.g. bootstraps, bias parameters, etc.
will be written.
--skipQuant Skip performing the actual transcript
quantification (including any Gibbs
sampling or bootstrapping).
--dumpEq Dump the simple equivalence class
counts that were computed during
mapping or alignment.
-d [ --dumpEqWeights ] Dump conditional probabilities
associated with transcripts when
equivalence class information is being
dumped to file. Note, this will dump
the factorization that is actually used
by salmon's offline phase for
inference. If you are using
range-factorized equivalence classes
(the default) then the same transcript
set may appear multiple times with
different associated conditional
probabilities.
--minAssignedFrags arg (=10) The minimum number of fragments that
must be assigned to the transcriptome
for quantification to proceed.
--reduceGCMemory If this option is selected, a more
memory efficient (but slightly slower)
representation is used to compute
fragment GC content. Enabling this will
reduce memory usage, but can also
reduce speed. However, the results
themselves will remain the same.
--biasSpeedSamp arg (=5) The value at which the fragment length
PMF is down-sampled when evaluating
sequence-specific & GC fragment bias.
Larger values speed up effective length
correction, but may decrease the
fidelity of bias modeling results.
--fldMax arg (=1000) The maximum fragment length to consider
when building the empirical
distribution
--fldMean arg (=250) The mean used in the fragment length
distribution prior
--fldSD arg (=25) The standard deviation used in the
fragment length distribution prior
-f [ --forgettingFactor ] arg (=0.65000000000000002)
The forgetting factor used in the
online learning schedule. A smaller
value results in quicker learning, but
higher variance and may be unstable. A
larger value results in slower learning
but may be more stable. Value should
be in the interval (0.5, 1.0].
--initUniform initialize the offline inference with
uniform parameters, rather than seeding
with online parameters.
--maxOccsPerHit arg (=1000) When collecting "hits" (MEMs), hits
having more than maxOccsPerHit
occurrences won't be considered.
-w [ --maxReadOcc ] arg (=200) Reads "mapping" to more than this many
places won't be considered.
--noLengthCorrection [experimental] : Entirely disables
length correction when estimating the
abundance of transcripts. This option
can be used with protocols where one
expects that fragments derive from
their underlying targets without regard
to that target's length (e.g. QuantSeq)
--noEffectiveLengthCorrection Disables effective length correction
when computing the probability that a
fragment was generated from a
transcript. If this flag is passed in,
the fragment length distribution is not
taken into account when computing this
probability.
--noSingleFragProb Disables the estimation of an
associated fragment length probability
for single-end reads or for orphaned
mappings in paired-end libraries. The
default behavior is to consider the
probability of all possible fragment
lengths associated with the retained
mapping. Enabling this flag (i.e.
turning this default behavior off) will
simply not attempt to estimate a
fragment length probability in such
cases.
--noFragLengthDist [experimental] : Don't consider
concordance with the learned fragment
length distribution when trying to
determine the probability that a
fragment has originated from a
specified location. Normally,
Fragments with unlikely lengths will be
assigned a smaller relative probability
than those with more likely lengths.
When this flag is passed in, the
observed fragment length has no effect
on that fragment's a priori
probability.
--noBiasLengthThreshold [experimental] : If this option is
enabled, then no (lower) threshold will
be set on how short bias correction can
make effective lengths. This can
increase the precision of bias
correction, but harm robustness. The
default correction applies a threshold.
--numBiasSamples arg (=2000000) Number of fragment mappings to use when
learning the sequence-specific bias
model.
--numAuxModelSamples arg (=5000000) The first <numAuxModelSamples> are used
to train the auxiliary model parameters
(e.g. fragment length distribution,
bias, etc.). After ther first
<numAuxModelSamples> observations the
auxiliary model parameters will be
assumed to have converged and will be
fixed.
--numPreAuxModelSamples arg (=5000) The first <numPreAuxModelSamples> will
have their assignment likelihoods and
contributions to the transcript
abundances computed without applying
any auxiliary models. The purpose of
ignoring the auxiliary models for the
first <numPreAuxModelSamples>
observations is to avoid applying these
models before thier parameters have
been learned sufficiently well.
--useEM Use the traditional EM algorithm for
optimization in the batch passes.
--useVBOpt Use the Variational Bayesian EM
[default]
--rangeFactorizationBins arg (=4) Factorizes the likelihood used in
quantification by adopting a new notion
of equivalence classes based on the
conditional probabilities with which
fragments are generated from different
transcripts. This is a more
fine-grained factorization than the
normal rich equivalence classes. The
default value (4) corresponds to the
default used in Zakeri et al. 2017
(doi: 10.1093/bioinformatics/btx262),
and larger values imply a more
fine-grained factorization. If range
factorization is enabled, a common
value to select for this parameter is
4. A value of 0 signifies the use of
basic rich equivalence classes.
--numGibbsSamples arg (=0) Number of Gibbs sampling rounds to
perform.
--noGammaDraw This switch will disable drawing
transcript fractions from a Gamma
distribution during Gibbs sampling. In
this case the sampler does not account
for shot-noise, but only assignment
ambiguity
--numBootstraps arg (=0) Number of bootstrap samples to
generate. Note: This is mutually
exclusive with Gibbs sampling.
--bootstrapReproject This switch will learn the parameter
distribution from the bootstrapped
counts for each sample, but will
reproject those parameters onto the
original equivalence class counts.
--thinningFactor arg (=16) Number of steps to discard for every
sample kept from the Gibbs chain. The
larger this number, the less chance
that subsequent samples are
auto-correlated, but the slower
sampling becomes.
-q [ --quiet ] Be quiet while doing quantification
(don't write informative output to the
console unless something goes wrong).
--perTranscriptPrior The prior (either the default or the
argument provided via --vbPrior) will
be interpreted as a transcript-level
prior (i.e. each transcript will be
given a prior read count of this value)
--perNucleotidePrior The prior (either the default or the
argument provided via --vbPrior) will
be interpreted as a nucleotide-level
prior (i.e. each nucleotide will be
given a prior read count of this value)
--sigDigits arg (=3) The number of significant digits to
write when outputting the
EffectiveLength and NumReads columns
--vbPrior arg (=0.01) The prior that will be used in the VBEM
algorithm. This is interpreted as a
per-transcript prior, unless the
--perNucleotidePrior flag is also
given. If the --perNucleotidePrior
flag is given, this is used as a
nucleotide-level prior. If the default
is used, it will be divided by 1000
before being used as a nucleotide-level
prior, i.e. the default per-nucleotide
prior will be 1e-5.
--writeOrphanLinks Write the transcripts that are linked
by orphaned reads.
--writeUnmappedNames Write the names of un-mapped reads to
the file unmapped_names.txt in the
auxiliary directory.
ラン
transcriptsのfastaにindexをつける。
salmon index -p 2 -t transcripts.fa.gz -i ref_index
- -p [ --threads ] arg (=2) Number of threads to use (only used for computing bias features)
- -t [ --transcripts ] arg Transcript fasta file.
- -k [ --kmerLen ] arg (=31) The size of k-mers that should be used for the quasi index.
- -i [ --index ] arg Salmon index.
定量。
salmon quant -i ref_index -l A -1 pair1.fastq.gz -2 pair2.fastq.gz -p 8 -o output/sample1
output/sample1/の中に複数のファイルができる。quant.sfが定量結果のファイルとなる。
> head quant.sf |column -t
$ head quant.sf |column -t
Name Length EffectiveLength TPM NumReads
NODE_1_length_23927_cov_361.684814_g0_i0 23927 23716.8 258.816 19183
NODE_2_length_23311_cov_54.219328_g1_i0 23311 23100.8 32.8009 2368
NODE_3_length_21707_cov_72.753024_g2_i0 21707 21496.8 52.4408 3523
NODE_4_length_21289_cov_42.848023_g3_i0 21289 21078.8 32.9264 2169
NODE_5_length_21214_cov_88.072195_g4_i0 21214 21003.8 69.6986 4575
NODE_6_length_21116_cov_110.811411_g5_i0 21116 20905.8 70.9743 4637
NODE_7_length_20592_cov_75.743027_g6_i0 20592 20381.8 20.7355 1320.77
NODE_8_length_20497_cov_76.082942_g6_i1 20497 20286.8 42.181 2674.23
NODE_9_length_20233_cov_124.431480_g7_i0 20233 20022.8 41.9024 2622
引用
Salmon provides fast and bias-aware quantification of transcript expression
Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry & Carl Kingsford
Nature Methods 14, 417–419 (2017)
RNA seq Blog
http://www.rna-seqblog.com/salmon-fast-and-bias-aware-quantification-of-transcript-expression/
Scallopとsalmonの組み合わせ
ゲノムからの定量(2022年3月現在は初期バージョン)