2018-01-21

超高速にRNA seqのリードカウント（定量）を行う salmon

2019 2/15 動画リンク追加

2020 4/19 インストール追記, help更新

2020 4/29 論文アブスト追記

RNA-seqリードからトランスクリプトの豊富さを定量化する軽量な手法であるSalmonを紹介する。Salmonは、新しい二相並列推論アルゴリズムと特徴量の多いバイアスモデルを組み合わせ、超高速なリードマッピングを実現している。この手法は、フラグメントGC含有量のバイアスを補正する初めてのトランスクリプトームワイドな定量手法であり、ここで実証したように、アバンダンス推定の精度とその後の差分発現解析の感度が大幅に向上している。

salmonは豊富なbiasモデルを取り込み、高速、高精度、堅牢なRNAseqの発現定量を行う方法論。 kallistoやeXpressと比べて、同じFDRで2倍以上精度が高い（DEG判定された遺伝子が２倍以上少ない=false positiveが少ない）というデータを出している。

f:id:kazumaxneo:20180121202042j:plain

Supplementary Figure 1より転載。

salomonは単独で「アラインメント」と「定量」の両方を行う、すなわち、indexがついたtranscriptsのリファレンスとFASTQを入力として受け取り、中間のアラインメントファイルを生成せずに直接定量を実行する。結果かなりの時間とスペースを節約することができる。１例をあげると、 6億のリード (75bp, paired-end)をわずか23分で定量できるとされる（ 30 スレッド使用時）。

公式サイト

Overview – Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data

quick start guide

Getting Started – Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data

Document

http://salmon.readthedocs.io/en/latest/

Accurate, Fast, and Model-Aware Transcript Expression Quantification with Salmon

インストール

Github

conda, brewで導入できる。

#bioconda (link)
mamba install -c bioconda -y salmon

#homebrew
brew install salmon

Githubのリリースにはビルド済みのバイナリもあります（リンク）。さらにdockerイメージも用意されています。

> salmon

$ salmon

salmon v1.2.0

Usage: salmon -h|--help or

salmon -v|--version or

salmon -c|--cite or

salmon [--no-version-check] <COMMAND> [-h | options]

Commands:

index Create a salmon index

quant Quantify a sample

alevin single cell analysis

swim Perform super-secret operation

quantmerge Merge multiple quantifications into a single file

> salmon quant --help-alignment

$ salmon quant --help-alignment

Version Info: This is the most recent version of salmon.

Quant

==========

Perform dual-phase, alignment-based estimation of

transcript abundance from RNA-seq reads

salmon quant options:

alignment input options:

--discardOrphans [alignment-based mode only] : Discard

orphan alignments in the input . If

this flag is passed, then only paired

alignments will be considered toward

quantification estimates. The default

behavior is to consider orphan

alignments if no valid paired mappings

exist.

-l [ --libType ] arg Format string describing the library

type

-a [ --alignments ] arg input alignment (BAM) file(s).

-e [ --eqclasses ] arg input salmon weighted equivalence class

file.

-t [ --targets ] arg FASTA format file containing target

transcripts.

basic options:

-v [ --version ] print version string

-h [ --help ] produce help message

-o [ --output ] arg Output quantification directory.

--seqBias Perform sequence-specific bias

correction.

--gcBias [beta for single-end reads] Perform

fragment GC bias correction.

--posBias Perform positional bias correction.

-p [ --threads ] arg (=8) The number of threads to use

concurrently.

--incompatPrior arg (=0) This option sets the prior probability

that an alignment that disagrees with

the specified library type (--libType)

results from the true fragment origin.

Setting this to 0 specifies that

alignments that disagree with the

library type should be "impossible",

while setting it to 1 says that

alignments that disagree with the

library type are no less likely than

those that do

-g [ --geneMap ] arg File containing a mapping of

transcripts to genes. If this file is

provided salmon will output both

quant.sf and quant.genes.sf files,

where the latter contains aggregated

gene-level abundance estimates. The

transcript to gene mapping should be

provided as either a GTF file, or a in

a simple tab-delimited format where

each line contains the name of a

transcript and the gene to which it

belongs separated by a tab. The

extension of the file is used to

determine how the file should be

parsed. Files ending in '.gtf', '.gff'

or '.gff3' are assumed to be in GTF

format; files with any other extension

are assumed to be in the simple format.

In GTF / GFF format, the

"transcript_id" is assumed to contain

the transcript identifier and the

"gene_id" is assumed to contain the

corresponding gene identifier.

--auxTargetFile arg A file containing a list of "auxiliary"

targets. These are valid targets

(i.e., not decoys) to which fragments

are allowed to map and be assigned, and

which will be quantified, but for which

auxiliary models like sequence-specific

and fragment-GC bias correction should

not be applied.

--meta If you're using Salmon on a metagenomic

dataset, consider setting this flag to

disable parts of the abundance

estimation model that make less sense

for metagenomic data.

alignment-specific options:

--noErrorModel Turn off the alignment error model,

which takes into account the the

observed frequency of different types

of mismatches / indels when computing

the likelihood of a given alignment.

Turning this off can speed up

alignment-based salmon, but can harm

quantification accuracy.

--numErrorBins arg (=6) The number of bins into which to divide

each read when learning and applying

the error model. For example, a value

of 10 would mean that effectively, a

separate error model is leared and

applied to each 10th of the read, while

a value of 3 would mean that a separate

error model is applied to the read

beginning (first third), middle (second

third) and end (final third).

-s [ --sampleOut ] Write a "postSample.bam" file in the

output directory that will sample the

input alignments according to the

estimated transcript abundances. If

you're going to perform downstream

analysis of the alignments with tools

which don't, themselves, take fragment

assignment ambiguity into account, you

should use this output.

-u [ --sampleUnaligned ] In addition to sampling the aligned

reads, also write the un-aligned reads

to "postSample.bam".

--gencode This flag will expect the input

transcript fasta to be in GENCODE

format, and will split the transcript

name at the first '|' character. These

reduced names will be used in the

output and when looking for these

transcripts in a gene to transcript

GTF.

--scoreExp arg (=1) The factor by which sub-optimal

alignment scores are downweighted to

produce a probability. If the best

alignment score for the current read is

S, and the score for a particular

alignment is w, then the probability

will be computed porportional to exp( -

scoreExp * (S-w) ). NOTE: This flag

only has an effect if you are parsing

alignments produced by salmon itself

(i.e. pufferfish or RapMap in

selective-alignment mode).

--mappingCacheMemoryLimit arg (=2000000)

If the file contained fewer than this

many mapped reads, then just keep the

data in memory for subsequent rounds of

inference. Obviously, this value should

not be too large if you wish to keep a

low memory usage, but setting it large

enough to accommodate all of the mapped

read can substantially speed up

inference on "small" files that contain

only a few million reads.

advanced options:

--alternativeInitMode [Experimental]: Use an alternative

strategy (rather than simple

interpolation between) the online and

uniform abundance estimates to

initalize the EM / VBEM algorithm.

--auxDir arg (=aux_info) The sub-directory of the quantification

directory where auxiliary information

e.g. bootstraps, bias parameters, etc.

will be written.

--skipQuant Skip performing the actual transcript

quantification (including any Gibbs

sampling or bootstrapping).

--dumpEq Dump the simple equivalence class

counts that were computed during

mapping or alignment.

-d [ --dumpEqWeights ] Dump conditional probabilities

associated with transcripts when

equivalence class information is being

dumped to file. Note, this will dump

the factorization that is actually used

by salmon's offline phase for

inference. If you are using

range-factorized equivalence classes

(the default) then the same transcript

set may appear multiple times with

different associated conditional

probabilities.

--minAssignedFrags arg (=10) The minimum number of fragments that

must be assigned to the transcriptome

for quantification to proceed.

--reduceGCMemory If this option is selected, a more

memory efficient (but slightly slower)

representation is used to compute

fragment GC content. Enabling this will

reduce memory usage, but can also

reduce speed. However, the results

themselves will remain the same.

--biasSpeedSamp arg (=5) The value at which the fragment length

PMF is down-sampled when evaluating

sequence-specific & GC fragment bias.

Larger values speed up effective length

correction, but may decrease the

fidelity of bias modeling results.

--fldMax arg (=1000) The maximum fragment length to consider

when building the empirical

distribution

--fldMean arg (=250) The mean used in the fragment length

distribution prior

--fldSD arg (=25) The standard deviation used in the

fragment length distribution prior

-f [ --forgettingFactor ] arg (=0.65000000000000002)

The forgetting factor used in the

online learning schedule. A smaller

value results in quicker learning, but

higher variance and may be unstable. A

larger value results in slower learning

but may be more stable. Value should

be in the interval (0.5, 1.0].

--initUniform initialize the offline inference with

uniform parameters, rather than seeding

with online parameters.

--maxOccsPerHit arg (=1000) When collecting "hits" (MEMs), hits

having more than maxOccsPerHit

occurrences won't be considered.

-w [ --maxReadOcc ] arg (=200) Reads "mapping" to more than this many

places won't be considered.

--noLengthCorrection [experimental] : Entirely disables

length correction when estimating the

abundance of transcripts. This option

can be used with protocols where one

expects that fragments derive from

their underlying targets without regard

to that target's length (e.g. QuantSeq)

--noEffectiveLengthCorrection Disables effective length correction

when computing the probability that a

fragment was generated from a

transcript. If this flag is passed in,

the fragment length distribution is not

taken into account when computing this

probability.

--noSingleFragProb Disables the estimation of an

associated fragment length probability

for single-end reads or for orphaned

mappings in paired-end libraries. The

default behavior is to consider the

probability of all possible fragment

lengths associated with the retained

mapping. Enabling this flag (i.e.

turning this default behavior off) will

simply not attempt to estimate a

fragment length probability in such

cases.

--noFragLengthDist [experimental] : Don't consider

concordance with the learned fragment

length distribution when trying to

determine the probability that a

fragment has originated from a

specified location. Normally,

Fragments with unlikely lengths will be

assigned a smaller relative probability

than those with more likely lengths.

When this flag is passed in, the

observed fragment length has no effect

on that fragment's a priori

probability.

--noBiasLengthThreshold [experimental] : If this option is

enabled, then no (lower) threshold will

be set on how short bias correction can

make effective lengths. This can

increase the precision of bias

correction, but harm robustness. The

default correction applies a threshold.

--numBiasSamples arg (=2000000) Number of fragment mappings to use when

learning the sequence-specific bias

model.

--numAuxModelSamples arg (=5000000) The first <numAuxModelSamples> are used

to train the auxiliary model parameters

(e.g. fragment length distribution,

bias, etc.). After ther first

<numAuxModelSamples> observations the

auxiliary model parameters will be

assumed to have converged and will be

fixed.

--numPreAuxModelSamples arg (=5000) The first <numPreAuxModelSamples> will

have their assignment likelihoods and

contributions to the transcript

abundances computed without applying

any auxiliary models. The purpose of

ignoring the auxiliary models for the

first <numPreAuxModelSamples>

observations is to avoid applying these

models before thier parameters have

been learned sufficiently well.

--useEM Use the traditional EM algorithm for

optimization in the batch passes.

--useVBOpt Use the Variational Bayesian EM

[default]

--rangeFactorizationBins arg (=4) Factorizes the likelihood used in

quantification by adopting a new notion

of equivalence classes based on the

conditional probabilities with which

fragments are generated from different

transcripts. This is a more

fine-grained factorization than the

normal rich equivalence classes. The

default value (4) corresponds to the

default used in Zakeri et al. 2017

(doi: 10.1093/bioinformatics/btx262),

and larger values imply a more

fine-grained factorization. If range

factorization is enabled, a common

value to select for this parameter is

4. A value of 0 signifies the use of

basic rich equivalence classes.

--numGibbsSamples arg (=0) Number of Gibbs sampling rounds to

perform.

--noGammaDraw This switch will disable drawing

transcript fractions from a Gamma

distribution during Gibbs sampling. In

this case the sampler does not account

for shot-noise, but only assignment

ambiguity

--numBootstraps arg (=0) Number of bootstrap samples to

generate. Note: This is mutually

exclusive with Gibbs sampling.

--bootstrapReproject This switch will learn the parameter

distribution from the bootstrapped

counts for each sample, but will

reproject those parameters onto the

original equivalence class counts.

--thinningFactor arg (=16) Number of steps to discard for every

sample kept from the Gibbs chain. The

larger this number, the less chance

that subsequent samples are

auto-correlated, but the slower

sampling becomes.

-q [ --quiet ] Be quiet while doing quantification

(don't write informative output to the

console unless something goes wrong).

--perTranscriptPrior The prior (either the default or the

argument provided via --vbPrior) will

be interpreted as a transcript-level

prior (i.e. each transcript will be

given a prior read count of this value)

--perNucleotidePrior The prior (either the default or the

argument provided via --vbPrior) will

be interpreted as a nucleotide-level

prior (i.e. each nucleotide will be

given a prior read count of this value)

--sigDigits arg (=3) The number of significant digits to

write when outputting the

EffectiveLength and NumReads columns

--vbPrior arg (=0.01) The prior that will be used in the VBEM

algorithm. This is interpreted as a

per-transcript prior, unless the

--perNucleotidePrior flag is also

given. If the --perNucleotidePrior

flag is given, this is used as a

nucleotide-level prior. If the default

is used, it will be divided by 1000

before being used as a nucleotide-level

prior, i.e. the default per-nucleotide

prior will be 1e-5.

--writeOrphanLinks Write the transcripts that are linked

by orphaned reads.

--writeUnmappedNames Write the names of un-mapped reads to

the file unmapped_names.txt in the

auxiliary directory.

ラン

transcriptsのfastaにindexをつける。

salmon index -p 2 -t transcripts.fa.gz -i ref_index

-p [ --threads ] arg (=2) Number of threads to use (only used for computing bias features)
-t [ --transcripts ] arg Transcript fasta file.
-k [ --kmerLen ] arg (=31) The size of k-mers that should be used for the quasi index.
-i [ --index ] arg Salmon index.

定量。

salmon quant -i ref_index -l A -1 pair1.fastq.gz -2 pair2.fastq.gz -p 8 -o output/sample1

output/sample1/の中に複数のファイルができる。quant.sfが定量結果のファイルとなる。

> head quant.sf |column -t

$ head quant.sf |column -t

Name Length EffectiveLength TPM NumReads

NODE_1_length_23927_cov_361.684814_g0_i0 23927 23716.8 258.816 19183

NODE_2_length_23311_cov_54.219328_g1_i0 23311 23100.8 32.8009 2368

NODE_3_length_21707_cov_72.753024_g2_i0 21707 21496.8 52.4408 3523

NODE_4_length_21289_cov_42.848023_g3_i0 21289 21078.8 32.9264 2169

NODE_5_length_21214_cov_88.072195_g4_i0 21214 21003.8 69.6986 4575

NODE_6_length_21116_cov_110.811411_g5_i0 21116 20905.8 70.9743 4637

NODE_7_length_20592_cov_75.743027_g6_i0 20592 20381.8 20.7355 1320.77

NODE_8_length_20497_cov_76.082942_g6_i1 20497 20286.8 42.181 2674.23

NODE_9_length_20233_cov_124.431480_g7_i0 20233 20022.8 41.9024 2622

引用

Salmon provides fast and bias-aware quantification of transcript expression

Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry & Carl Kingsford

Nature Methods 14, 417–419 (2017)

RNA seq Blog

http://www.rna-seqblog.com/salmon-fast-and-bias-aware-quantification-of-transcript-expression/

Scallopとsalmonの組み合わせ

ゲノムからの定量（2022年3月現在は初期バージョン）

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

超高速にRNA seqのリードカウント（定量）を行う salmon