2018-01-21

複数のトランスクリプトームをマージするtransfuse

transfuseは、フィルターを満たすtranscriptsをクラスタリングし、融合したtranscriptsを作るツール。複数のRNA のアセンブルツールで合成されたtranscriptsをマージし、よりハイグレードなtranscriptsを作るために用いられる。現在、論文準備中とされる。

インストール

Github

https://github.com/cboursnell/transfuse

transfuseはRubyで構築されており、rubyのパッケージ管理コマンドgemでインストールできる。

sudo gem install transfuse

> transfuse

$ transfuse

/System/Library/Frameworks/Ruby.framework/Versions/2.3/usr/lib/ruby/2.3.0/universal-darwin17/rbconfig.rb:214: warning: Insecure world writable dir /Users/kazumaxneo/Documents/art_bin_MountRainier/ in PATH, mode 040777

Transfuse v0.5.0

by Chris Boursnell <cmb211@cam.ac.uk> and

Richard Smith-Unna <rds45@cam.ac.uk>

DESCRIPTION:

Merge multiple assemblies.

USAGE:

transfuse <options>

OPTIONS:

-a, --assemblies=<s> assembly files in FASTA format, comma-separated

-l, --left=<s> left reads file in FASTQ format

-r, --right=<s> right reads file in FASTQ format

-o, --output=<s> write merged assembly to file

-t, --threads=<i> number of threads (default: 1)

-i, --id=<f> sequence identity to cluster at (default: 1.0)

-n, --install install dependencies

-v, --verbose be verbose

-e, --version Print version and exit

-h, --help Show this message

ラン

transfuse --assemblies soap-k31.fa,soap-k41.fa,soap-k51.fa --left reads_1.fq --right reads_2.fq --output soap-merged.fa --threads 12 -v

-vをつけてランしても一向にlogが出てこないなら、エラーの可能性が高いです。その場合、fastaのヘッダーが悪さをしている可能性があります。ヘッダーをシンプルな名前に修正してやり直してみてください。

perl -ane 'if(/\>/){$a++;print ">name_$a\n"}else{print;}' input.fa > rename.fa

引用

https://github.com/cboursnell/transfuse

https://groups.google.com/forum/#!topic/trinityrnaseq-users/Rt9Wnrs3k0A

2018-01-21

超高速にRNA seqのリードカウント（定量）を行う salmon

RNA seq 高速なツール RNAseqの定量 2017 Nature Methods

2019 2/15 動画リンク追加

2020 4/19 インストール追記, help更新

2020 4/29 論文アブスト追記

RNA-seqリードからトランスクリプトの豊富さを定量化する軽量な手法であるSalmonを紹介する。Salmonは、新しい二相並列推論アルゴリズムと特徴量の多いバイアスモデルを組み合わせ、超高速なリードマッピングを実現している。この手法は、フラグメントGC含有量のバイアスを補正する初めてのトランスクリプトームワイドな定量手法であり、ここで実証したように、アバンダンス推定の精度とその後の差分発現解析の感度が大幅に向上している。

salmonは豊富なbiasモデルを取り込み、高速、高精度、堅牢なRNAseqの発現定量を行う方法論。 kallistoやeXpressと比べて、同じFDRで2倍以上精度が高い（DEG判定された遺伝子が２倍以上少ない=false positiveが少ない）というデータを出している。

f:id:kazumaxneo:20180121202042j:plain

Supplementary Figure 1より転載。

salomonは単独で「アラインメント」と「定量」の両方を行う、すなわち、indexがついたtranscriptsのリファレンスとFASTQを入力として受け取り、中間のアラインメントファイルを生成せずに直接定量を実行する。結果かなりの時間とスペースを節約することができる。１例をあげると、 6億のリード (75bp, paired-end)をわずか23分で定量できるとされる（ 30 スレッド使用時）。

公式サイト

Overview – Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data

quick start guide

Getting Started – Salmon: Fast, accurate and bias-aware transcript quantification from RNA-seq data

Document

http://salmon.readthedocs.io/en/latest/

Accurate, Fast, and Model-Aware Transcript Expression Quantification with Salmon

インストール

Github

conda, brewで導入できる。

#bioconda (link)
mamba install -c bioconda -y salmon

#homebrew
brew install salmon

Githubのリリースにはビルド済みのバイナリもあります（リンク）。さらにdockerイメージも用意されています。

> salmon

$ salmon

salmon v1.2.0

Usage: salmon -h|--help or

salmon -v|--version or

salmon -c|--cite or

salmon [--no-version-check] <COMMAND> [-h | options]

Commands:

index Create a salmon index

quant Quantify a sample

alevin single cell analysis

swim Perform super-secret operation

quantmerge Merge multiple quantifications into a single file

> salmon quant --help-alignment

$ salmon quant --help-alignment

Version Info: This is the most recent version of salmon.

Quant

==========

Perform dual-phase, alignment-based estimation of

transcript abundance from RNA-seq reads

salmon quant options:

alignment input options:

--discardOrphans [alignment-based mode only] : Discard

orphan alignments in the input . If

this flag is passed, then only paired

alignments will be considered toward

quantification estimates. The default

behavior is to consider orphan

alignments if no valid paired mappings

exist.

-l [ --libType ] arg Format string describing the library

type

-a [ --alignments ] arg input alignment (BAM) file(s).

-e [ --eqclasses ] arg input salmon weighted equivalence class

file.

-t [ --targets ] arg FASTA format file containing target

transcripts.

basic options:

-v [ --version ] print version string

-h [ --help ] produce help message

-o [ --output ] arg Output quantification directory.

--seqBias Perform sequence-specific bias

correction.

--gcBias [beta for single-end reads] Perform

fragment GC bias correction.

--posBias Perform positional bias correction.

-p [ --threads ] arg (=8) The number of threads to use

concurrently.

--incompatPrior arg (=0) This option sets the prior probability

that an alignment that disagrees with

the specified library type (--libType)

results from the true fragment origin.

Setting this to 0 specifies that

alignments that disagree with the

library type should be "impossible",

while setting it to 1 says that

alignments that disagree with the

library type are no less likely than

those that do

-g [ --geneMap ] arg File containing a mapping of

transcripts to genes. If this file is

provided salmon will output both

quant.sf and quant.genes.sf files,

where the latter contains aggregated

gene-level abundance estimates. The

transcript to gene mapping should be

provided as either a GTF file, or a in

a simple tab-delimited format where

each line contains the name of a

transcript and the gene to which it

belongs separated by a tab. The

extension of the file is used to

determine how the file should be

parsed. Files ending in '.gtf', '.gff'

or '.gff3' are assumed to be in GTF

format; files with any other extension

are assumed to be in the simple format.

In GTF / GFF format, the

"transcript_id" is assumed to contain

the transcript identifier and the

"gene_id" is assumed to contain the

corresponding gene identifier.

--auxTargetFile arg A file containing a list of "auxiliary"

targets. These are valid targets

(i.e., not decoys) to which fragments

are allowed to map and be assigned, and

which will be quantified, but for which

auxiliary models like sequence-specific

and fragment-GC bias correction should

not be applied.

--meta If you're using Salmon on a metagenomic

dataset, consider setting this flag to

disable parts of the abundance

estimation model that make less sense

for metagenomic data.

alignment-specific options:

--noErrorModel Turn off the alignment error model,

which takes into account the the

observed frequency of different types

of mismatches / indels when computing

the likelihood of a given alignment.

Turning this off can speed up

alignment-based salmon, but can harm

quantification accuracy.

--numErrorBins arg (=6) The number of bins into which to divide

each read when learning and applying

the error model. For example, a value

of 10 would mean that effectively, a

separate error model is leared and

applied to each 10th of the read, while

a value of 3 would mean that a separate

error model is applied to the read

beginning (first third), middle (second

third) and end (final third).

-s [ --sampleOut ] Write a "postSample.bam" file in the

output directory that will sample the

input alignments according to the

estimated transcript abundances. If

you're going to perform downstream

analysis of the alignments with tools

which don't, themselves, take fragment

assignment ambiguity into account, you

should use this output.

-u [ --sampleUnaligned ] In addition to sampling the aligned

reads, also write the un-aligned reads

to "postSample.bam".

--gencode This flag will expect the input

transcript fasta to be in GENCODE

format, and will split the transcript

name at the first '|' character. These

reduced names will be used in the

output and when looking for these

transcripts in a gene to transcript

GTF.

--scoreExp arg (=1) The factor by which sub-optimal

alignment scores are downweighted to

produce a probability. If the best

alignment score for the current read is

S, and the score for a particular

alignment is w, then the probability

will be computed porportional to exp( -

scoreExp * (S-w) ). NOTE: This flag

only has an effect if you are parsing

alignments produced by salmon itself

(i.e. pufferfish or RapMap in

selective-alignment mode).

--mappingCacheMemoryLimit arg (=2000000)

If the file contained fewer than this

many mapped reads, then just keep the

data in memory for subsequent rounds of

inference. Obviously, this value should

not be too large if you wish to keep a

low memory usage, but setting it large

enough to accommodate all of the mapped

read can substantially speed up

inference on "small" files that contain

only a few million reads.

advanced options:

--alternativeInitMode [Experimental]: Use an alternative

strategy (rather than simple

interpolation between) the online and

uniform abundance estimates to

initalize the EM / VBEM algorithm.

--auxDir arg (=aux_info) The sub-directory of the quantification

directory where auxiliary information

e.g. bootstraps, bias parameters, etc.

will be written.

--skipQuant Skip performing the actual transcript

quantification (including any Gibbs

sampling or bootstrapping).

--dumpEq Dump the simple equivalence class

counts that were computed during

mapping or alignment.

-d [ --dumpEqWeights ] Dump conditional probabilities

associated with transcripts when

equivalence class information is being

dumped to file. Note, this will dump

the factorization that is actually used

by salmon's offline phase for

inference. If you are using

range-factorized equivalence classes

(the default) then the same transcript

set may appear multiple times with

different associated conditional

probabilities.

--minAssignedFrags arg (=10) The minimum number of fragments that

must be assigned to the transcriptome

for quantification to proceed.

--reduceGCMemory If this option is selected, a more

memory efficient (but slightly slower)

representation is used to compute

fragment GC content. Enabling this will

reduce memory usage, but can also

reduce speed. However, the results

themselves will remain the same.

--biasSpeedSamp arg (=5) The value at which the fragment length

PMF is down-sampled when evaluating

sequence-specific & GC fragment bias.

Larger values speed up effective length

correction, but may decrease the

fidelity of bias modeling results.

--fldMax arg (=1000) The maximum fragment length to consider

when building the empirical

distribution

--fldMean arg (=250) The mean used in the fragment length

distribution prior

--fldSD arg (=25) The standard deviation used in the

fragment length distribution prior

-f [ --forgettingFactor ] arg (=0.65000000000000002)

The forgetting factor used in the

online learning schedule. A smaller

value results in quicker learning, but

higher variance and may be unstable. A

larger value results in slower learning

but may be more stable. Value should

be in the interval (0.5, 1.0].

--initUniform initialize the offline inference with

uniform parameters, rather than seeding

with online parameters.

--maxOccsPerHit arg (=1000) When collecting "hits" (MEMs), hits

having more than maxOccsPerHit

occurrences won't be considered.

-w [ --maxReadOcc ] arg (=200) Reads "mapping" to more than this many

places won't be considered.

--noLengthCorrection [experimental] : Entirely disables

length correction when estimating the

abundance of transcripts. This option

can be used with protocols where one

expects that fragments derive from

their underlying targets without regard

to that target's length (e.g. QuantSeq)

--noEffectiveLengthCorrection Disables effective length correction

when computing the probability that a

fragment was generated from a

transcript. If this flag is passed in,

the fragment length distribution is not

taken into account when computing this

probability.

--noSingleFragProb Disables the estimation of an

associated fragment length probability

for single-end reads or for orphaned

mappings in paired-end libraries. The

default behavior is to consider the

probability of all possible fragment

lengths associated with the retained

mapping. Enabling this flag (i.e.

turning this default behavior off) will

simply not attempt to estimate a

fragment length probability in such

cases.

--noFragLengthDist [experimental] : Don't consider

concordance with the learned fragment

length distribution when trying to

determine the probability that a

fragment has originated from a

specified location. Normally,

Fragments with unlikely lengths will be

assigned a smaller relative probability

than those with more likely lengths.

When this flag is passed in, the

observed fragment length has no effect

on that fragment's a priori

probability.

--noBiasLengthThreshold [experimental] : If this option is

enabled, then no (lower) threshold will

be set on how short bias correction can

make effective lengths. This can

increase the precision of bias

correction, but harm robustness. The

default correction applies a threshold.

--numBiasSamples arg (=2000000) Number of fragment mappings to use when

learning the sequence-specific bias

model.

--numAuxModelSamples arg (=5000000) The first <numAuxModelSamples> are used

to train the auxiliary model parameters

(e.g. fragment length distribution,

bias, etc.). After ther first

<numAuxModelSamples> observations the

auxiliary model parameters will be

assumed to have converged and will be

fixed.

--numPreAuxModelSamples arg (=5000) The first <numPreAuxModelSamples> will

have their assignment likelihoods and

contributions to the transcript

abundances computed without applying

any auxiliary models. The purpose of

ignoring the auxiliary models for the

first <numPreAuxModelSamples>

observations is to avoid applying these

models before thier parameters have

been learned sufficiently well.

--useEM Use the traditional EM algorithm for

optimization in the batch passes.

--useVBOpt Use the Variational Bayesian EM

[default]

--rangeFactorizationBins arg (=4) Factorizes the likelihood used in

quantification by adopting a new notion

of equivalence classes based on the

conditional probabilities with which

fragments are generated from different

transcripts. This is a more

fine-grained factorization than the

normal rich equivalence classes. The

default value (4) corresponds to the

default used in Zakeri et al. 2017

(doi: 10.1093/bioinformatics/btx262),

and larger values imply a more

fine-grained factorization. If range

factorization is enabled, a common

value to select for this parameter is

4. A value of 0 signifies the use of

basic rich equivalence classes.

--numGibbsSamples arg (=0) Number of Gibbs sampling rounds to

perform.

--noGammaDraw This switch will disable drawing

transcript fractions from a Gamma

distribution during Gibbs sampling. In

this case the sampler does not account

for shot-noise, but only assignment

ambiguity

--numBootstraps arg (=0) Number of bootstrap samples to

generate. Note: This is mutually

exclusive with Gibbs sampling.

--bootstrapReproject This switch will learn the parameter

distribution from the bootstrapped

counts for each sample, but will

reproject those parameters onto the

original equivalence class counts.

--thinningFactor arg (=16) Number of steps to discard for every

sample kept from the Gibbs chain. The

larger this number, the less chance

that subsequent samples are

auto-correlated, but the slower

sampling becomes.

-q [ --quiet ] Be quiet while doing quantification

(don't write informative output to the

console unless something goes wrong).

--perTranscriptPrior The prior (either the default or the

argument provided via --vbPrior) will

be interpreted as a transcript-level

prior (i.e. each transcript will be

given a prior read count of this value)

--perNucleotidePrior The prior (either the default or the

argument provided via --vbPrior) will

be interpreted as a nucleotide-level

prior (i.e. each nucleotide will be

given a prior read count of this value)

--sigDigits arg (=3) The number of significant digits to

write when outputting the

EffectiveLength and NumReads columns

--vbPrior arg (=0.01) The prior that will be used in the VBEM

algorithm. This is interpreted as a

per-transcript prior, unless the

--perNucleotidePrior flag is also

given. If the --perNucleotidePrior

flag is given, this is used as a

nucleotide-level prior. If the default

is used, it will be divided by 1000

before being used as a nucleotide-level

prior, i.e. the default per-nucleotide

prior will be 1e-5.

--writeOrphanLinks Write the transcripts that are linked

by orphaned reads.

--writeUnmappedNames Write the names of un-mapped reads to

the file unmapped_names.txt in the

auxiliary directory.

ラン

transcriptsのfastaにindexをつける。

salmon index -p 2 -t transcripts.fa.gz -i ref_index

-p [ --threads ] arg (=2) Number of threads to use (only used for computing bias features)
-t [ --transcripts ] arg Transcript fasta file.
-k [ --kmerLen ] arg (=31) The size of k-mers that should be used for the quasi index.
-i [ --index ] arg Salmon index.

定量。

salmon quant -i ref_index -l A -1 pair1.fastq.gz -2 pair2.fastq.gz -p 8 -o output/sample1

output/sample1/の中に複数のファイルができる。quant.sfが定量結果のファイルとなる。

> head quant.sf |column -t

$ head quant.sf |column -t

Name Length EffectiveLength TPM NumReads

NODE_1_length_23927_cov_361.684814_g0_i0 23927 23716.8 258.816 19183

NODE_2_length_23311_cov_54.219328_g1_i0 23311 23100.8 32.8009 2368

NODE_3_length_21707_cov_72.753024_g2_i0 21707 21496.8 52.4408 3523

NODE_4_length_21289_cov_42.848023_g3_i0 21289 21078.8 32.9264 2169

NODE_5_length_21214_cov_88.072195_g4_i0 21214 21003.8 69.6986 4575

NODE_6_length_21116_cov_110.811411_g5_i0 21116 20905.8 70.9743 4637

NODE_7_length_20592_cov_75.743027_g6_i0 20592 20381.8 20.7355 1320.77

NODE_8_length_20497_cov_76.082942_g6_i1 20497 20286.8 42.181 2674.23

NODE_9_length_20233_cov_124.431480_g7_i0 20233 20022.8 41.9024 2622

引用

Salmon provides fast and bias-aware quantification of transcript expression

Rob Patro, Geet Duggal, Michael I Love, Rafael A Irizarry & Carl Kingsford

Nature Methods 14, 417–419 (2017)

RNA seq Blog

http://www.rna-seqblog.com/salmon-fast-and-bias-aware-quantification-of-transcript-expression/

Scallopとsalmonの組み合わせ

ゲノムからの定量（2022年3月現在は初期バージョン）

2018-01-21

リピート領域をマスクしてプライマー設計を支援するPrimer3_maskerと、プライマーを作成するprimer3

primer mask 2018 Bioinformatics repetitive sequences

#2018 9/20 brew によるインストールとprimer3のコマンド追加

2018 10/22 わかりにくい部分を修正

2020 3/9 インストール手順修正

2020 3/18 誤字修正

2021 4/18 インストール手順修正

Primer3_maskerは、ゲノムに対してk-mer頻度のデータベースを構築し、プライマーが高頻度に結合する配列をマスクすることで、特異的なプライマー設計が行えるよう支援するツール。

インストール

Github

#biocondaチャネルでprimer3を導入するとprimer3_maskerも一緒に入る。
conda install -c bioconda primer3 

#homebrew
brew install primer3

#conda
mamba install -c bioconda primer3 -y

#from source
git clone https://github.com/bioinfo-ut/primer3_masker 
cd primer3_masker/src/ 
make primer3_masker

#k-mer_listのダウンロード ヒトゲノム向け
cd ../kmer_lists/
wget http://primer3.ut.ee/lists/homo_sapiens_11.list #11-merのリスト
wget http://primer3.ut.ee/lists/homo_sapiens_16.list #16-merのリスト

> ./primer3_masker

$ primer3_masker

Usage: ./primer3_masker [OPTIONS] <INPUTFILE>

Options:

-h, --help - print this usage screen and exit

-p, --probability_cutoff - masking cutoff [0, 1] (default: >=0.1)

-lh, --kmer_lists_path - path to the kmer list files (default: ../kmer_lists/)

-lp, --list_prefix - prefix of the k-mer lists to use with default model (default: homo_sapiens)

-a, --absolute_value_cutoff - masking cutoff based on k-mer count; requires a single list name, defined with -l

-l, --list - define a single k-mer list for masking with absolute cutoff option -a

-m5, --mask_5p - nucleotides to mask in 5' direction (default: 1)

-m3, --mask_3p - nucleotides to mask in 3' direction (default: 0)

-c, --masking_char - character used for masking (default: N)

-s, --soft_mask - use soft masking (default: false)

-d, --masking_direction - a strand to mask (fwd, rev, both) (default: both)

パスの通ったディレクトリに移動しておく。

テストラン

ここではtemplate.fasta（ヒトゲノムの配列）のリピートをtest_data/の定義に従いマスクする。中身を確認。

$ cat ../test_data/template.fasta |fold -w 80

>template

TTGTCAAGGTTAGATGCTGTTTCTACAGGTCACCAACTGCGGAAACAATGACATGGTCTGAAAATATGGACACGCTTTTA

GCCAACCAAGGTAAGATTTAACTAATAATAGGCTTAAAATACAATAATTAAATATAAATTATTAAATTCTGAAAGTTGGT

AACATATCATAAAGTATGAGTTTAATCAATGAAGTATAAAATTATTAATAATCATAAATTCATAAAAATCCAAAATCTAA

ATAGAATCAGGTTGGGGCTAAAATAAGTTTATAGGTTAACTCTGTACATTAAAACAAAAGGGAAATTCAATCTAGCAAGT

GAAATTTTCCATTGCCTTAGACTCACTTTAACATTTTTTATTATTTTTTATTTTAATACAGAGTCTCACTCTCTCTCTCT

ATCAGGCTGGAGTGCAGTGGCATGATCTCAGCTCACTGCAAACTCCACCTTCTGGGTTCAAGCAATTTTCCTGACTCAGC

CTCCTGAGTAGCTGAGATTACAGACATGCACCACCATACCCGGCTAATTTTTGTATTTTTAGTAGAGACAGGGTTTCACC

ATGTTGGCCAGGCTGGTATCAAACTCCTGACCTCAGGTGATCCACCCACCTCAGCATCCCAAAGTGCTGGGATTCAATTC

AGGTGTGAGCCACTGTGCCAGCCCTAGGCTCGCTGTGTGTGTGTGTGTGTGTATACACACATACACATACATATATATAT

GTATTTTTTTTTTTTTTGAGACGGAGTCTTGCTTTACCACCCAGACTGGAGTGTAGAGTGTAGTGGTGTGATCTCTGCTC

ACTGCAACCTCTGCCTCCCGGGTTCAAGGGATTCTCCTGCCTCAGCCTCCCGAGGAGCTGGGACTACGGGAGCATGCCAC

GACACCAAGCTAATATGTGTATTTTTAGTAGAGACAGGTGTTCGCCACATTAGCCAGGCTGGTCTCGAACTTCTGACCCC

AGATGATCTGCCTGCCTTGACCTCCCAAAGTGCTAGGAT

１、k-mer listをダウンロードする。

http://primer3.ut.ee/lists.htm

リストにない生き物のゲノムのマスキングを行いたい場合は、GenomeTester4を用いて自分で作成する。

例えば-c 2をつけてglistmaker（紹介）を実行。

glistmaker input.fasta -w 16 -c 2 -o output

オーサーに連絡してもリストファイルは作ってもらえるようです（リンク上）。

２、FASTA配列のマスクを実行。

primer3_masker -lh test_data/test_16.list \
-lp test test_data/template.fasta

-lh　path to the kmer list files (default: ../kmer_lists/)
-lp　define prefix of the k-mer lists to use (default: homo_sapiens)

結果 (foldに渡し、80文字折り返し出力）

>template

TTGTCAAGGTTAGATGCTGTTTCTACAGGTCACCAACTGCGGAAACAATGACATGGTCTGAAAATATGGACACGCTTTTA

GCCAACCAAGGTAAGATTTAACTAATAATAGGCTTAAAATACAATAATTAAATATAAATTATTAAATTCTGAAAGTTGGT

AACATATCATAAAGTATGAGTTTAATCAATGAAGTATAAAATTATTAATAATCATAAATTCATAAAAATCCAAAATCTAA

ATAGAATCAGGTTGGGGCTAAAATAAGTTTATAGGTTAACTCTGTACATTAAAACAAAAGGGAAATTCAATCTAGCAAGT

GAAATTTTCCATTGCCTTAGACTCACTTTAACATTTTTTATTATTTTTTATTTTAATACAGAGTCTCACTCTCTCTCTCT

ATNNNNNNNNAGTGCAGNNNNNTGNNCTCAGCTCACTGCNNACTCCACCTTCTGGGTTCAAGCAATTTTCCTGANNNNNC

CTCCTGAGTNNNNNAGATTACAGACATGCACCACCATACCCNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNNNNCANN

NNGTTNGCNNNNCNNGNATCANNNNCCTGACCTCAGGTGATCCACCCACCTCAGCANNNCAAAGTGCTGGGNNNCAATTC

AGGTGTGAGCCACTGTGCCAGCCCTAGGCTCGCNNNNNNNGTGTGTGNNNNNNTACACACATACACATACATATATATAT

GTANNNNNNNNNTTNNNGNNNNGGAGTCTTGCTTTACCACCCAGACTGGAGTGTAGAGTGTAGTGGTGTGATCTCTNNNN

NCTGCAACCTCTGCCTCCCGGGTTCAAGGNNNNNNNNNNCCTCANNNNNNNNAGGAGCTGGGACTACGGGAGCATGCCAC

GACACCAAGCTAATATGNNNNNNTTTAGTAGANNNAGGTGTTCGCCACATTAGCCAGGCTGGTCTCGAACTTCTGACCCC

AGATGATCTGCCTGCCTTGACCTCCCAAAGTGCTAGGAT

k-mer頻度を指定する。例えば頻度10以上の配列はマスクする。

primer3_masker -a 10 -l test_data/test_16.list \
test_data/template.fasta |fold -w 80

-a　masking cutoff based on k-mer count; requires a single list name, defined with -l
-l　define a single k-mer list; for using with absolute cutoff option -a

>template

TTGTCAAGGTTAGATGCTGTTTCTACAGGTCACCAACTGCGGAAACAATGACATGGTCTGAAAATATGGACACGCTTTTA

GCCAACCAAGGTAAGATTTAACTAATAATAGGCTTAAAATACAATAATTAAATATAAATTATTAAATTCTGAAAGTTGGT

AACATATCATAAAGTATGAGTTTAATCAATGAAGTATAAAATTATTAATAATCATAAATTCATAAAAATCCAAAATCTAA

ATAGAATCAGGTTGGGGCTAAAATAAGTTTATAGGTTAACTCTGTACATTAAAACAAAAGGGAAATTCAATCTAGCAAGT

GAAATTTTCCATTGCCTTAGACTCACTTTAACATTTTTTATTATTTTTTATTTTAATACAGAGTCTCACTCTCTCTCTCT

ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACTCCACCTTNTGGGTNCAAGCAATNTNCCTNANNNNNN

NNCNTGNNTNNNNNNNNTTACNNACATGCACCACCATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNANNNCAAAGTGCTGGGNNNCAATTN

NNNNGTGAGCCACTNNNNNAGCCCTAGGCTCGNNNNNNNNGTGTGTGNNNNNNNNCACACATACACATACATATATATAN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTACNACCCAGACTGGAGTNTAGAGTGTAGTGGTGTGNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTNNNNNNNNNNNNGAGCTGGGACTACGGGAGCATGCCAC

GACACCAAGCTAATANNNNNNNNTTTAGTANNNNNNNNTGTTCGCCACANTANNNNGGCTGGTCNCGNNNNTCTGACCCC

AGANGATCTGCCTGNNNNNANNNNCCAAANNNNNANNNN

-aが２だと、当然より多くの領域がマスクされる。

primer3_masker -a 2 -l test_data/test_16.list \
test_data/template.fasta |fold -w 80

>template

TTGTCAAGGTTAGATGCTGTTTCTACAGGTCACCAACTGCGGAAACAATGACATGGTCTGAAAATATGGACACGCTTTTA

GCCAACCAAGGTAAGATTTAACTAATAATAGGCTTAAAATACAATAATTAAATATAAATTATTAAATTCTGAAAGTTGGT

AACATATCATAAAGTATGAGTTTAATCAATGAAGTATAAAATTATTAATAATCATAAATTCATAAAAATCCAAAATCTAA

ATAGAATCAGGTTGGGGCTAAAATAAGTTTATAGGTTAACTCTGTACATTAAAACAAAAGGGAAATTCAATCTAGCAAGT

GAAATTTTCCATTGCCTTAGACTCACTTTANCANNTNNNATTATTNTTNNTNNNAATACAGAGNNTCACTCTCTCTCTNN

ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTGNNNNNNNNNNNNNTNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNANNNNCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNAANNGCTNNNNNNNNATTN

NNNNNNGAGCCACTNNNNNNNCCCTAGGCTCNNNNNNNNNNNNNNTNNNNNNNNNNNNNNNNNNNNNNNNNNNTANNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTTACNNNNCAGACTGGAGTNNNNNNTGTNNNNNNNNNNNNNNNNNNN

NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNCTANGGGNGCATGCCAC

GACACCAAGCTAATANNNNNNNNNTTAGTANNNNNNNNNGTTCGCCACNNNANNNNNNNNNNNNNNGNNNNNNNNNNNCN

NNNNGATCTGNNNNNNNNNNNNNNCNNNNNNNNNNNNNN

追記

primer3でゲノムのfinishingプライマーを設計する。ユニークなプライマーを作成したいので、リピートマスクしてからprimer3でプライマーを設計する。contigの末端を1000bpほど抽出してくる。試してないが、BED形式で領域を書き出して、seqkitやseqtkに任せれば簡単に抽出できるはず。

まずprimer3を入れる。brewでprimer3を導入した場合、primer3_configファイルが入ってなくて”but path to thermodynamic parameters not specified”とのエラーが出て怒られてしまうので、手動で、/optにconfigファイルを入れる。

git clone https://github.com/primer3-org/primer3.git
sudo mv primer3/src/primer3_config /opt

primer3はconfigファイルからprimerを設計するので、まずconfigファイルを作成する。

ヒアドキュメントでconfig作成。例えばアセンブリした配列の下流末端だけ抽出し、forwardのプライマーのみ作成する。"SEQUENCE_TEMPLATE="の行にはprimerを作成したい配列を記載する。あらかじめPrimer3_maskerでリピートマスクしておけば、特異的なプライマーを設計できる。

cat > test_config <<EOF 
SEQUENCE_ID=example 
SEQUENCE_TEMPLATE=NNNNNNNNNNNNNNNNNNNNNCATGCATGCATGCATGCATCATGCATGCATGCATGCATGCATGCATGCATNNMMMMMMATGC
PRIMER_TASK=generic
PRIMER_PICK_LEFT_PRIMER=1
PRIMER_PICK_INTERNAL_OLIGO=0
PRIMER_PICK_RIGHT_PRIMER=0
PRIMER_OPT_SIZE=22
PRIMER_MIN_SIZE=20
PRIMER_MAX_SIZE=26
PRIMER_PRODUCT_SIZE_RANGE=75-150
PRIMER_EXPLAIN_FLAG=1
PRIMER_NUM_RETURN=1
PRIMER_MAX_GC=60
PRIMER_MAX_END_GC=3
PRIMER_MIN_TM=55
PRIMER_OPT_TM=60
PRIMER_MAX_TM=63
PRIMER_MUST_MATCH_THREE_PRIME=nnnns
PRIMER_MUST_MATCH_FIVE_PRIME=wnnnn
EOF

PRIMER_MUST_MATCH_THREE_PRIME=nnnnsで5'endはTかA、PRIMER_MUST_MATCH_FIVE_PRIME=wnnnnで3'endはGかCにしています。PRIMER_MAX_GC=60によりGC maxは60%に、PRIMER_OPT_TM=60により至適Tm値は60℃にしています。SEQUENCE_TEMPLATEには、プライマーをデザインしたい配列を登録します。スクリプトを書いて放り込む作業を自動化してしまえば楽になります。

その他にも多数のパラメータがあります。primer3 documentの"9. "GLOBAL" INPUT TAGS"を読んで下さい。

実際にプライマーを作成するには、作成したconfigファイルを指定して以下のように打つ。

primer3_core test_config

引用

Primer3_masker: integrating masking of template sequence with primer design software

Triinu Kõressaar, Maarja Lepamets, Lauris Kaplinski, Kairi Raime, Reidar Andreson, Maido Remm

Bioinformatics, bty036 Published: 19 January 2018

2018-01-21

rRNAを除く SortMeRNA

RNA seq metagenome 系統解析 GO enrichment analysis contamination 2012 Bioinformatics OTU rRNA

2020 2/5 condaインストール追記

2020 6/16 コマンドが大きく変更したため更新（v２.１）

2020 12/9 unmapを出力するようにコマンドを修正, 再びhelp更新（v４.２）

次世代シーケンシング（NGS）技術を生物群集から直接抽出したRNAに適用すると、コーディング型とノンコーディング型の両方のRNAを特徴づける断片が混在している。これらを区別し、メッセンジャー RNAとリボソーム RNA（rRNA）のファミリーをさらに分類することは、相互作用環境の遺伝子発現パターンや構成種の系統分類を調べる上で重要なステップである。

　本研究では、メタトランスクリプトームデータからrRNA断片を迅速にフィルタリングするために設計された新しいソフトウェアSortMeRNAを紹介する。このソフトウェアは、大規模なリードセットを扱うことができ、rRNAデータベースに一致する全ての断片を高感度かつ低ランニングタイムでソートすることが可能である。

出力はfasta、fastq、アライメントのsam、またblastライクな出力も可能である。Illumina, 454, Ion Torrent and PacBioのシーケンスデータに対応している。QIIMEと一緒に使用することで、OTUを検出し系統解析にも利用することができる。

マニュアル

http://bioinfo.lifl.fr/RNA/sortmerna/code/SortMeRNA-user-manual-v2.1.pdf

FAQ

http://bioinfo.lifl.fr/sortmerna/faqs.php

インストール

Github

conda install -c bioconda -y sortmerna

> sortmerna -h

# sortmerna -h

[process:1369] === Options processing starts ... ===

Found value: sortmerna

Found flag: -h

[process:1453] Processing option: h with value:

Program: SortMeRNA version 4.2.0

Turnhoutseweg 30, 2340 Beerse, Belgium

2014-2016 Knight Lab:

Department of Pediatrics, UCSD, La Jolla

2012-2014 Bonsai Bioinformatics Research Group:

LIFL, University Lille 1, CNRS UMR 8022, INRIA Nord-Europe

Disclaimer: SortMeRNA comes with ABSOLUTELY NO WARRANTY; without even the

implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

See the GNU Lesser General Public License for more details.

Contributors: Jenya Kopylova jenya.kopylov@gmail.com

Laurent Noﾃｩ laurent.noe@lifl.fr

Pierre Pericard pierre.pericard@lifl.fr

Daniel McDonald wasade@gmail.com

Mikaﾃｫl Salson mikael.salson@lifl.fr

Hﾃｩlﾃｨne Touzet helene.touzet@lifl.fr

Rob Knight robknight@ucsd.edu

Usage: sortmerna --ref FILE [--ref FILE] --reads FWD_READS [--reads REV_READS] [OPTIONS]:

-------------------------------------------------------------------------------------------------------------

| option type-format description default |

-------------------------------------------------------------------------------------------------------------

[REQUIRED]

--ref PATH Required Reference file (FASTA) absolute or relative path.

Use mutliple times, once per a reference file

--reads PATH Required Raw reads file (FASTA/FASTQ/FASTA.GZ/FASTQ.GZ).

Use twice for files with paired reads

[COMMON]

--workdir PATH Optional Directory for storing Reference index, USRDIR/sortmerna/run/

Key-value database, and the output.

Default structure:

WORKDIR/

idx/

kvdb/

out/

--kvdb PATH Optional Directory for storing Key-value database WORKDIR/kvdb

KVDB is used for storing alignement results.

--idx PATH Optional Directory for storing Reference index. WORKDIR/idx

--fastx BOOL Optional Output aligned reads into FASTA/FASTQ file

--sam BOOL Optional Output SAM alignment for aligned reads.

--SQ BOOL Optional Add SQ tags to the SAM file

--blast STRING Optional output alignments in various Blast-like formats

'0' - pairwise

'1' - tabular(Blast - m 8 format)

'1 cigar' - tabular + column for CIGAR

'1 cigar qcov' - tabular + columns for CIGAR

and query coverage

'1 cigar qcov qstrand' - tabular + columns for CIGAR,

query coverage and strand

--aligned STRING/BOOL Optional Aligned reads file prefix [dir/][pfx] WORKDIR/out/aligned

Directory and file prefix for aligned output i.e.

each output file will go into the specified directory with the given prefix.

The appropriate extension (fasta|fastq|blast|sam|etc) will be automatically added.

Both 'dir' and 'pfx' are optional.

The 'dir' can be a relative or an absolute path.

If 'dir' is not specified, the output will be created in the WORKDIR/out/

If 'pfx' is not specified, the prefix 'aligned' will be used

Examples:

-aligned $MYDIR/dir_1/dir_2/1 -> $MYDIR/dir_1/dir_2/1.fasta

-aligned dir_1/apfx -> $PWD/dir_1/apfx.fasta

-aligned dir_1/ -> $PWD/aligned.fasta

-aligned apfx -> $PWD/apfx.fasta

-aligned (no argument) -> WORKDIR/out/aligned.fasta

--other STRING/BOOL Optional Non-aligned reads file prefix [dir/][pfx] WORKDIR/out/other

Must be used with 'fastx'.

Directory and file prefix for non-aligned output i.e.

each output file will go into the specified directory with the given prefix.

The appropriate extension (fasta|fastq|blast|sam|etc) will be automatically added.

Both 'dir' and 'pfx' are optional.

The 'dir' can be a relative or an absolute path.

If 'dir' is not specified, the output will be created in the WORKDIR/out/

If 'pfx' is not specified, the prefix 'other' will be used

Examples:

-other $MYDIR/dir_1/dir_2/1 -> $MYDIR/dir_1/dir_2/1.fasta

-other dir_1/apfx -> $PWD/dir_1/apfx.fasta

-other dir_1/ -> $PWD/dir_1/other.fasta

-other apfx -> $PWD/apfx.fasta

-other (no argument) -> aligned_out/other.fasta

i.e. the same output directory as used

for aligned output

--num_alignments INT Optional Positive integer (INT >=0).

Report first INT alignments per read reaching E-value

If INT = 0, all alignments will be output

--best INT Optional Report INT best alignments per read reaching E-value 1

by searching --min_lis INT candidate alignments

If INT == 0: search All candidate alignments

If INT > 0: search INT best alignments.

The larger is the INT, the longer is the search time.

Explanation:

A read can potentially be aligned (reaching E-value threshold)

to multiple reference sequences.

The 'best' alignment is an alignment that is better

than the previously found alignments.

The very first found alignment is automatically the best alignment

until a better one is found.

--min_lis INT Optional Search all alignments having the first INT longest LIS

LIS stands for Longest Increasing Subsequence,

it is computed using seeds' positions to expand hits into

longer matches prior to Smith - Waterman alignment.

Requires option 'best'.

Mutually exclusive with option 'num_alignments'

--print_all_reads BOOL Optional Output null alignment strings for non-aligned reads False

to SAM and/or BLAST tabular files

--paired BOOL Optional Indicates a Single reads file with paired reads False

If a single reads file with paired reads is used,

and neither 'paired_in' nor 'paired_out' are specified,

use this option together with 'out2' to output

FWD and REV reads into separate files

--paired_in BOOL Optional If one of the paired-end reads is Aligned, False

put both reads into Aligned FASTA/Q file

Must be used with 'fastx'.

Mutually exclusive with 'paired_out'.

--paired_out BOOL Optional If one of the paired-end reads is Non-aligned, False

put both reads into Non-Aligned FASTA/Q file

Must be used with 'fastx'.

Mutually exclusive with 'paired_in'.

--out2 BOOL Optional Output paired reads into separate files. False

Must be used with 'fastx'.

Ignored without either of 'paired_in' |

'paired_out' | 'paired' | two 'reads'

--match INT Optional SW score (positive integer) for a match. 2

--mismatch INT Optional SW penalty (negative integer) for a mismatch. -3

--gap_open INT Optional SW penalty (positive integer) for introducing a gap. 5

--gap_ext INT Optional SW penalty (positive integer) for extending a gap. 2

-e DOUBLE Optional E-value threshold. 1

Defines the 'statistical significance' of a local alignment.

Exponentially correllates with the Minimal Alignment Score.

Higher E-values (100, 1000, ...) cause More reads

to Pass the alignment threshold

-F BOOL Optional Search only the forward strand. False

-N BOOL Optional SW penalty for ambiguous letters (N's) scored

as --mismatch

-R BOOL Optional Search only the reverse-complementary strand. False

[OTU_PICKING]

--id INT Optional %%id similarity threshold (the alignment 0.97

must still pass the E-value threshold).

--coverage INT Optional %%query coverage threshold (the alignment must 0.97

still pass the E-value threshold)

--de_novo_otu BOOL Optional FASTA/FASTQ file for reads matching database < %%id False

(set using --id) and < %%cov (set using --coverage)

(alignment must still pass the E-value threshold).

--otu_map BOOL Optional Output OTU map (input to QIIME's make_otu_table.py). False

[ADVANCED]

--passes INT,INT,INT Optional Three intervals at which to place the seed on the read L,L/2,3

(L is the seed length)

--edges INT Optional Number (or percent if INT followed by %% sign) of 4

nucleotides to add to each edge of the read

prior to SW local alignment

--num_seeds BOOL Optional Number of seeds matched before searching 2

for candidate LIS

--full_search INT Optional Search for all 0-error and 1-error seed False

matches in the index rather than stopping

after finding a 0-error match (<1%% gain in

sensitivity with up four-fold decrease in speed)

--pid BOOL Optional Add pid to output file names. False

-a INT Optional DEPRECATED in favour of '-threads'. Number of numCores

processing threads to use.

Automatically redirects to '-threads'

--threads INT Optional Number of Processing threads to use numCores

[INDEXING]

-L DOUBLE Optional Indexing: seed length. 18

-m DOUBLE Optional Indexing: the amount of memory (in Mbytes) for building 3072

the index.

-v BOOL Optional Produce verbose output when building the index True

--interval INT Optional Indexing: Positive integer: index every Nth L-mer in 1

the reference database e.g. '-interval 2'.

--max_pos INT Optional Indexing: maximum (integer) number of positions to store 1000

for each unique L-mer. If 0 all positions are stored.

[HELP]

-h BOOL Optional Print help information

--version BOOL Optional Print SortMeRNA version number

[DEVELOPER]

--dbg_put_db BOOL Optional

--cmd BOOL Optional Launch an interactive session (command prompt) False

--task INT Optional Processing Task: 4

0 - align. Only perform alignment

1 - post-processing (log writing)

2 - generate reports

3 - align and post-process

4 - all

データベース

git clone https://github.com/biocore/sortmerna.git

> ls -lh

$ ls -lh sortmerna/data/rRNA_databases/

total 149672

-rw-r--r-- 1 kazu staff 2.0K 6 16 17:25 README.txt

-rw-r--r-- 1 kazu staff 2.2M 6 16 17:25 rfam-5.8s-database-id98.fasta

-rw-r--r-- 1 kazu staff 8.1M 6 16 17:25 rfam-5s-database-id98.fasta

drwxr-xr-x 5 kazu staff 160B 6 16 17:25 scripts/

-rw-r--r-- 1 kazu staff 3.7M 6 16 17:25 silva-arc-16s-id95.fasta

-rw-r--r-- 1 kazu staff 734K 6 16 17:25 silva-arc-23s-id98.fasta

-rw-r--r-- 1 kazu staff 19M 6 16 17:25 silva-bac-16s-id90.fasta

-rw-r--r-- 1 kazu staff 12M 6 16 17:25 silva-bac-23s-id98.fasta

-rw-r--r-- 1 kazu staff 13M 6 16 17:25 silva-euk-18s-id95.fasta

-rw-r--r-- 1 kazu staff 14M 6 16 17:25 silva-euk-28s-id98.fasta

-rw-r--r-- 1 kazu staff 591K 6 16 17:25 silva_ids_acc_tax.tar.gz

（データベースにはアーキアと真核生物のrRNA配列もある。SILVAからダウンロードするとarb ファイルで扱いにくいので、最新バージョンではないがこちらの配列セットを様々な処理に使うのもあり）

ラン

１、indexファイルの準備

解析にはデータベースのrRNA (FASTA) にindexをつける必要がある。ここではSILVAの16S rRNAのindexを作成している。１はスキップ可能。

indexdb_rna --ref silva-arc-16s-id95.fasta,silva-arc-16s-id95.fasta-db

-v verbose
--ref STRING,STRING FASTA reference file, index file

fastaとinexの間は","で区切る。複数のファイルを指定することもできる（マニュアル参照）。

２、SortMeRNAの実行

indexがない場合、作成されてからランされる。

mkdir temp
sortmerna --ref silva-arc-16s-id95.fasta,silva-arc-16s-id95.fasta-db \
 --reads pair_1.fq --reads pair_2.fq \
 --aligned mapped --fastx --workdir temp --sam mapped --other unmap --out2

--fastx output FASTA/FASTQ file
--aligned STRING aligned reads filepath + base file name

mapped.fq（インターレース）とunmap_fwd.fq, unmap_rev.fqが出力される。ペアの順番は同期されていないので注意すること。

--fastx、--otherのほかに、--samや--blastがある。詳細はhelpから確認してください。

引用

SortMeRNA: fast and accurate filtering of ribosomal RNAs in metatranscriptomic data.

Kopylova E, Noé L, Touzet H.

Bioinformatics. 2012 Dec 15;28(24):3211-7.

2018-01-21

古いサンプルのデータ (fastqやbam) から効率的にアダプターを除く leeHom

trimming / preprocessing bam/sam Ancient DNA

　古いDNAが断片化したサンプルからのシーケンスが増えている。しばしば数万年前のサンプルからも抽出される古代のサンプルのDNAは断片化が起きており、うまくDNAを抽出してもサイズが100-bpを超えることは滅多にない。短いDNAをペアードエンドでシーケンスすると、インサート全体がシーケンスされ、さらにアダプターまでシーケンスされることになる（図参照）。

f:id:kazumaxneo:20180121020540j:plain

ResearchGateより転載。

　シーケンス全体がオーバーラップしているので、正しくマージできれば、単純に高いクオリテイのペア側の配列に更新するだけでシーケンス精度を高めることも可能になるが、元々のDNAの質が悪いので（ミスマッチとギャップが発生しうる）、マージは難しいと推測される。

　leeHomは5'側と3'側のアダプターを除去し、ベイジアン最大事後確率アプローチを用いて元のDNA配列を再構成する方法論となる。シミュレーションと、古代のサンプルとして有名なネアンデルタール人のシーケンスデータを使いテストされており、他の方法論より精度が高いという結果が出ている。シングルエンド、ペアードエンドのシーケンスデータに対応している。

公式サイト

https://bioinf.eva.mpg.de/leehom/

インストール

Github

https://github.com/grenaud/leeHom

git clone --recursive https://github.com/grenaud/leeHom.git 
cd leeHom/ 
make 
src/leeHom #動作確認

$ src/leeHom

Usage:

src/leeHom [options] BAMfile

This program takes an unaligned BAM where mates are consecutive

or fastq files and trims and merges reads

You can specify a unaligned bam file or one or two fastq :

-fq1 First fastq

-fq2 Second fastq file (for paired-end)

-fqo Output fastq prefix

-o , --outfile Output (BAM format).

-u Produce uncompressed bam (good for pipe)

--aligned Allow reads to be aligned (default false)

-v , --verbose Turn all messages on (default false)

--log [log file] Print a tally of merged reads to this log file (default only to stderr)

--phred64 Use PHRED 64 as the offset for QC scores (default : PHRED33)

Paired End merging/Single Read trimming options

You can specify either:

--ancientdna ancient DNA (default false)

this allows for partial overlap

or if you know your size length distribution:

--loc Location for lognormal dist. (default none)

--scale Scale for lognormal dist. (default none)

--keepOrig Write original reads if they are trimmed or merged (default false)

Such reads will be marked as PCR duplicates

-f , --adapterFirstRead Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)

-s , --adapterSecondRead Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)

-c , --FirstReadChimeraFilter If the forward read looks like this sequence, the cluster is filtered out.

Provide several sequences separated by comma (def. Multiplex: ACACTCTTTCCCTACACGTCTGAACTCCAG)

-k , --key Key sequence with which each sequence starts. Comma separate for forward and reverse reads. (default '')

-i , --allowMissing Allow one base in one key to be missing or wrong. (default false)

--trimCutoff Lowest number of adapter bases to be observed for single Read trimming (default 1)

パスを通しておく。

マルチコア版

> src/leeHomMulti #-tでthread数を決める

$ src/leeHomMulti

Usage:

leeHomMulti [options] BAMfile

This program takes an unaligned BAM where mates are consecutive

or fastq files and trims and merges reads

You can specify a unaligned bam file or one or two fastq :

-fq1 First fastq

-fq2 Second fastq file (for paired-end)

-fqo Output fastq prefix

-o , --outfile Output (BAM format).

-u Produce uncompressed bam (good for pipe)

--aligned Allow reads to be aligned (default false)

-v , --verbose Turn all messages on (default false)

--log [log file] Print a tally of merged reads to this log file (default only to stderr)

--phred64 Use PHRED 64 as the offset for QC scores (default : PHRED33)

-t [threads] Use multiple cores (default : 1)

Paired End merging/Single Read trimming options

You can specify either:

--ancientdna ancient DNA (default false)

this allows for partial overlap

or if you know your size length distribution:

--loc Location for lognormal dist. (default none)

--scale Scale for lognormal dist. (default none)

--keepOrig Write original reads if they are trimmed or merged (default false)

Such reads will be marked as PCR duplicates

-f , --adapterFirstRead Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)

-s , --adapterSecondRead Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)

-c , --FirstReadChimeraFilter If the forward read looks like this sequence, the cluster is filtered out.

Provide several sequences separated by comma (def. Multiplex: ACACTCTTTCCCTACACGTCTGAACTCCAG)

-k , --key Key sequence with which each sequence starts. Comma separate for forward and reverse reads. (default '')

-i , --allowMissing Allow one base in one key to be missing or wrong. (default false)

--trimCutoff Lowest number of adapter bases to be observed for single Read trimming (default 1)

ラン

テストラン。

fasrqファイルからアダプターを除く。

leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -fq1 testData/rawAncientDNA.f1.gz -fq2 testData/rawAncientDNA.f2.gz -fqo testData/outfq

-f Adapter that is observed after the forward read (def. Multiplex: AGATCGGAAGAGCACACGTCTGAACTCCAG)
-s Adapter that is observed after the reverse read (def. Multiplex: AGATCGGAAGAGCGTCGTGTAGGGAAAGAG)
--ancientdna ancient DNA (default false) this allows for partial overlap
-fq1 First fastq-fq1 First fastq
-fq2 Second fastq file (for paired-end)
-fqo Output fastq prefix

ランが終わると下の解析logが表示される。

Total 50000; Merged (trimming) 40540; Merged (overlap) 7376; Kept PE/SR 1955; Trimmed SR 0; Adapter dimers/chimeras 129; Failed Key 0

47,916のペアードエンドがマージされ（=40540+7376)、1955は重複が見つからずマージされなかったことを意味している。またインサートなしにアダプターが２つタンデムに結合したキメラは129見つかった。

出力はマージされたfastq.gzと、マージされなかったr1.fastq.gz、r2.fastq.gz、そして条件が満たされなかったfailのfastq.gzが出力される。

bamファイルからアダプターを除く。

leeHom -f AGATCGGAAGAGCACACGTCTGAACTCCAG -s GGAAGAGCGTCGTGTAGGGAAAGAGTGTAG --ancientdna -o testData/reconsAncientDNA.bam testData/rawAncientDNA.bam

-o , --outfile Output (BAM format).

引用

leeHom: adaptor trimming and merging for Illumina sequencing reads

Gabriel Renaud, Udo Stenzel, and Janet Kelso.

Nucleic Acids Res. 2014 Oct 13; 42(18): e141.

2018-01-20

RNA seqのクオリティチェックツール QoRTs

RNA seq evaluation tool 結果の視覚化 (visualization) 2015 BMC Bioinformatics

RNA-Seqは特定のバイアス、アーティファクトを受けやすく、堅牢で包括的なクオリティチェックが重要になる。とくにサンプル調製、ライブラリー作成、またはシークエンシングのエラーは、予期せぬアーティファクト、バイアスを引き起こす。適切に処理できるように、そのような問題を検出することが重要になるが、品質を自動的にテストする包括的な方法が存在しない。

QoRTsは幅広いクオリティマトリクスを生成し、インフォマティシャンが適切にクオリティチェックを行えるようサポートする。

公式サイト

QoRTs: Quality of RNA-Seq Toolset

Example Walkthrough

QoRTs/example-walkthrough.pdf at master · hartleys/QoRTs · GitHub

インストール

Githubのリリースからコンパイル済みのjarファイルをダウンロードできる。マニュアルもある。

https://github.com/hartleys/QoRTs/releases

java -jar QoRTs.jar QC -man #マニュアル

#描画するためのRのパッケージも入れておく。Rに入って
install.packages("http://hartleys.github.io/QoRTs/QoRTs_LATEST.tar.gz", repos=NULL, type="source");

r$ java -jar QoRTs.jar QC --man

Starting QoRTs v1.3.0 (Compiled Fri Oct 20 11:56:37 EDT 2017)

Starting time: (Mon Jan 22 17:49:51 JST 2018)

NAME

Version: 1.3.0 (Updated Fri Oct 20 11:56:37 EDT 2017)

USAGE

java [Java Options] -jar QoRTs.jar QC [options] infile

gtffile.gtf qcDataDir

DESCRIPTION:

This utility runs a large battery of QC / data processing tools

on a single given sam or bam file. This is the primary function

of the QoRTs utility. All analyses are run via a single pass

through the sam/bam file.

REQUIRED ARGUMENTS:

infile

The input .bam or .sam file of aligned sequencing reads. Or

'-' to read from stdin.

(String)

gtffile.gtf

The gtf annotation file. This tool was designed to use the

standard gtf annotations provided by Ensembl, but other

annotations can be used as well.

If the filename ends with ".gz" or ".zip", the file will be

parsed using the appropriate decompression method.

(String)

qcDataDir

The output directory.

(String)

OPTIONS:

--singleEnded

Flag to indicate that reads are single end.

(flag)

--stranded

Flag to indicate that data is stranded.

(flag)

--stranded_fr_secondstrand

Flag to indicate that reads are from a fr_secondstrand type

of stranded library (equivalent to the "stranded = yes"

option in HTSeq or the "fr_secondStrand" library-type option

in TopHat/CuffLinks). If your data is stranded, you must

know the library type in order to analyze it properly. This

utility uses the same definitions as cufflinks to define

strandedness type. By default, the fr_firststrand library

type is assumed for all stranded data (equivalent to the

"stranded = reverse" option in HTSeq).

(flag)

--maxReadLength len

Sets the maximum read length. For unclipped datasets this

option is not necessary since the read length can be

determined from the data. By default, QoRTs will attempt to

determine the max read length by examining the first 1000

reads. If your data is hard-clipped prior to alignment, then

it is strongly recommended that this option be included, or

else an error may occur. Note that hard-clipping data prior

to alignment is generally not recommended, because this

makes it difficult (or impossible) to determine the

sequencer read-cycle of each nucleotide base. This may

obfuscate cycle-specific artifacts, trends, or errors, the

detection of which is one of the primary purposes of QoRTs!

In addition, hard clipping (whether before or after

alignment) removes quality score data, and thus quality

score metrics may be misleadingly optimistic. A MUCH

preferable method of removing undesired sequence is to

replace such sequence with N's, which preserves the quality

score and the sequencer cycle information while still

removing undesired sequence.

(Int)

--minMAPQ num

Filter out reads with less than the given MAPQ. Most RNA-Seq

aligners use the MAPQ field to differentiate uniquely-mapped

and multi-mapped reads. However, different aligners use a

different MAPQ conventions. By default, all reads with a

MAPQ of less than 255 will be excluded, as this is the MAPQ

associated with uniquely-aligned reads generated by the

RNA-STAR aligner. For use with TopHat2 you should set this

to 50. The MAPQ behavior for GSNAP is not well documented,

but it appears that a filtering threshold of 30 should be

adequate. Set this to 0 to turn off mapq filtering.

(Int)

--generatePlots

Generate all single-replicate QC plots. Equivalent to the

combination of: --generateMultiPlot --generateSeparatePlots

and --generatePdfReport. This option will cause QoRTs to

make an Rscript system call, loading the R package QoRTs.

(Note: this requires that R be installed and in the PATH,

and that QoRTs be installed on that R installation)

(flag)

--testRun

Flag to indicate that only the first 100k reads should be

read in. Used for testing.

(flag)

--keepMultiMapped

Flag to indicate that the tool should NOT filter out

multi-mapped reads. Note that even with this flag raised

this utility will still only use the 'primary' alignment

location for each read. By default any reads that are marked

as multi-mapped will be ignored entirely. Most aligners use

the MAPQ value to mark multi-mapped reads. Any read with

MAPQ < 255 is assumed to be non-uniquely mapped (this is the

standard used by RNA-STAR and TopHat/TopHat2). This option

is equivalent to "--minMAPQ 0".

(flag)

--noGzipOutput

Flag to indicate that output files should NOT be compressed

into the gzip format. By default almost all output files are

compressed to save space.

(flag)

--readGroup readGroupName

If this option is set, all analyses will be restricted to

ONLY reads that are tagged with the given readGroupName

(using an RG tag). This can be used if multiple read-groups

have already been combined into a single bam file, but you

want to summarize each read-group separately.

(String)

--dropChrom dropChromosomes

A comma-delimited list of chromosomes to ignore and exclude

from all analyses. Important: no whitespace!

(CommaDelimitedListOfStrings)

--skipFunctions func1,func2,...

A list of functions to skip (comma-delimited, no

whitespace). See the sub-functions list, below. The

default-on functions are: NVC, GCDistribution, GeneCalcs,

readLengthDistro, QualityScoreDistribution,

writeJunctionSeqCounts, writeKnownSplices,

writeNovelSplices, writeClippedNVC, CigarOpDistribution,

overlapMatch, cigarLocusCounts, InsertSize, chromCounts,

writeSpliceExon, writeGenewiseGeneBody, JunctionCalcs,

writeGeneCounts, writeBiotypeCounts, writeDESeq,

writeDEXSeq, writeGeneBody, StrandCheck

(CommaDelimitedListOfStrings)

--addFunctions func1,func2,...

A list of functions to add (comma-delimited, no whitespace).

This can be used to add functions that are off by default.

Followed by a comma delimited list, with no internal

whitespace. See the sub-functions list, below. The

default-off functions are: mismatchEngine,

annotatedSpliceExonCounts, calcOnTarget, FPKM, cigarMatch,

testDataDump, writeGeneBodyIv, fastqUtils, referenceMatch,

writeDocs, makeJunctionBed, makeWiggles,

makeAllBrowserTracks, calcDetailedGeneCounts

(CommaDelimitedListOfStrings)

--runFunctions func1,func2,...

The complete list of functions to run (comma-delimited, no

whitespace). Setting this option runs ONLY for the functions

explicitly requested here (along with any functions upon

which the assigned functions are dependent). See the

sub-functions list, below. Allowed options are: NVC,

mismatchEngine, annotatedSpliceExonCounts, GCDistribution,

calcOnTarget, GeneCalcs, FPKM, readLengthDistro, cigarMatch,

QualityScoreDistribution, testDataDump,

writeJunctionSeqCounts, writeKnownSplices,

writeNovelSplices, writeClippedNVC, CigarOpDistribution,

overlapMatch, cigarLocusCounts, InsertSize, chromCounts,

writeGeneBodyIv, fastqUtils, writeSpliceExon,

referenceMatch, writeGenewiseGeneBody, JunctionCalcs,

writeGeneCounts, writeDocs, makeJunctionBed,

writeBiotypeCounts, writeDESeq, writeDEXSeq, makeWiggles,

writeGeneBody, StrandCheck, makeAllBrowserTracks,

calcDetailedGeneCounts

(CommaDelimitedListOfStrings)

--seqReadCt val

(Optional) The total number of reads (or read-pairs, for

paired-end data) generated by the sequencer for this sample,

prior to alignment. This will be passed on into the

QC.summary.txt file and used to calculate mapping rate.

(Int)

--rawfastq myfastq.1.fq.gz,myfastq.2.fq.gz

(Optional) The raw fastq, prior to alignment. In normal

operation, this is used ONLY to calculate the number of

pre-alignment reads (or read-pairs) simply by counting the

number of lines and dividing by 4. Alternatively, the number

of pre-alignment read-pairs can be included explicitly via

the --seqReadCt option, or added in the plotting /

cross-comparison step by including the input.read.pair.count

column in the replicate decoder.In general, the --seqReadCt

option is recommended when available.

Certain optional QC functions are also available that

utilize the raw Fastq file in other ways. If the filename

ends with ".gz" or ".zip", the file will be parsed using the

appropriate decompression method.

(CommaDelimitedListOfStrings)

--chromSizes chrom.sizes.txt

A chrom.sizes file. The first (tab-delimited) column must

contain all chromosomes found in the dataset. The second

column must contain chromosome sizes (in base-pairs). If a

standard genome is being used, it is strongly recommended

that this be generated by the UCSC utility

'fetchChromSizes'.

This file is ONLY needed to produce wiggle files. If this is

provided, then by default QoRTs will produce 100-bp-window

wiggle files (and junction '.bed' files) for the supplied

data.In order to produce wiggle files, this parameter is

REQUIRED.

(String)

--title myTitle

The title of the replicate. Used for the track name in the

track definition line of any browser tracks ('.wig' or

'.bed' files) generated by this utility. Also may be used in

the figure text, if figures are being generated.Note that no

browser tracks will be created by default, unless the

'--chromSizes' option is set. Bed files can also be

generated using the option '--addFunction makeJunctionBed'

(String)

--flatgff flattenedGffFile.gff.gz

A "flattened" gff file that matches the standard gtf file.

Optional. The "flattened" gff file assigns unique

identifiers for all exons, splice junctions, and

aggregate-genes. This is used for the junction counts and

exon counts (for DEXSeq). The flattened gtf file can be

generated using the "makeFlatGff" command. Flattened GFF

files containing novel splice junctions can be generated

using the "mergeNovelSplices" function. Note that (for most

purposes) the command should be run with the same

strandedness code as found in the dataset. Running a

flattened gff that was generated using a different

strandedness mode may be useful for certain purposes, but is

generally not supported and is for advanced users only.See

the documentation for makeFlatGff for more information.

If the filename ends with ".gz" or ".zip", the file will be

parsed using the appropriate decompression method.

(String)

--generateMultiPlot

Generate a multi-frame figure, containing a visual summary

of all QC stats. (Note: this requires that R be installed

and in the PATH, and that QoRTs be installed on that R

installation)

(flag)

--generateSeparatePlots

Generate seperate plots for each QC stat, rather than only

one big multiplot. (Note: this requires that R be installed

and in the PATH, and that QoRTs be installed on that R

installation)

(flag)

--generatePdfReport

Generate a pdf report. (Note: this requires that R be

installed and in the PATH, and that QoRTs be installed on

that R installation)

(flag)

--adjustPhredScore val

QoRTs expects input files to conform to the SAM format

specification, which requires all Phred scores to be in

Phred+33 encoding. However some older tools produce SAM

files with nonstandard encodings. To read such data, you can

set this parameter to subtract from the apparent (phred+33)

phred score. Thus, to read Phred+64 data (produced by

Illumina v1.3-1.7), set this parameter to 31. Note: QoRTs

does not support negative Phred scores. NOTE: THIS OPTION IS

EXPERIMENTAL!

(Int)

--maxPhredScore val

According to the standard FASTQ and SAM format

specification, Phred quality scores are supposed to range

from 0 to 41. However, certain sequencing machines such as

the HiSeq4000 supposedly produce occasional quality scores

as high as 45. If your dataset contains quality scores in

excess of 41, then you must use this option to set the

maximum legal quality score. Otherwise, QoRTs will throw an

error.

(Int)

--summaryFileSuffix .summary.txt

The suffix of the 'summary' file. This is useful to set if

you want to run multiple QC runs in parallel to reduce

runtime, without overwriting one another's summary files.In

particular, the NVC metrics often take a long time to run,

so splitting those off using the --runFunctions parameter

might speed things up considerably. Note that 'QC' will be

appended in the actual filename. THIS OPTION IS BETA!

(String)

--extractReadsByMetric metric=value

THIS OPTIONAL PARAMETER IS STILL UNDER BETA TESTING. This

parameter allows the user to extract anomalous reads that

showed up in previous QoRTs runs. Currently reads can be

extracted based on the following metrics: StrandTestStatus,

InsertSize and GCcount.

(String)

--keepOnlyOnTarget

Experimental flag. Ignores reads that DO NOT fall within the

target region (specified by the required bedfile using the

--targetRegionBed parameter).

(flag)

--dropOnTarget

Experimental flag. Ignores reads that DO fall within the

target region (specified by the required bedfile using the

--targetRegionBed parameter).

(flag)

--randomSubsample 1.00

If this option is set, QoRTs will ignore a random fraction

of the input read pairs. This can drastically reduce

runtime, though it may reduce the accuracy of the output QC

metrics.

(Double)

--restrictToGeneList geneList.txt

If this option is set, almost all analyses will be

restricted to reads that are found on genes named in the

supplied gene list file. The file should contain a gene ID

on each line and nothing else. The only functions that will

be run on the full set of all reads will be the functions

that calculate the gene mapping itself. NOTE: if you want to

include ambiguous reads, include a line with the text:

'_ambiguous'. If you want to include reads that do not map

to any known feature, include a line with the text:

'_no_feature'. WARNING: this is not intended for default

use. It is intended to be used when re-running QoRTs, with

the intention of examining artifacts that can be caused in

various plots by a small number of genes with extremely high

coverage. For example, GC content plots sometimes contain

visible spikes caused by small mitochondrial genes with

extremely high expression.ADDITIONAL WARNING: This feature

is in BETA, and is not yet fully tested.

(String)

--dropGeneList geneList.txt

If this option is set, almost all analyses will be

restricted to reads that are NOT found on genes named in the

supplied gene list file. The file should contain a gene ID

on each line and nothing else. The only functions that will

be run on the full set of all reads will be the functions

that calculate the gene mapping itself. NOTE: if you want to

EXCLUDE ambiguous reads, include a line with the text:

'_ambiguous'. If you want to EXCLUDE reads that do not map

to any known feature, include a line with the text:

'_no_feature'. WARNING: this is not intended for default

use. It is intended to be used when re-running QoRTs, with

the intention of examining artifacts that can be caused by

certain individual 'problem genes'. For example, GC content

plots sometimes contain visible spikes caused by small

transcripts / RNA's with extremely high expression

levels.ADDITIONAL WARNING: This feature is in BETA, and is

not yet fully tested.

(String)

--DNA

BETA: This flag makes various changes to allow QoRTs to run

on whole-exome or whole-genome DNA-Seq data.

(flag)

--RNA

Indicates that the data is RNA-Seq (this is the default:

flag does nothing).

(flag)

--genomeFA chr.fa.gz[,chr2.fa,...]

Reference genome sequence. This can either be a single FASTA

file with all the chromosomes included, or a comma-delimited

list of fasta files with 1 chromosome each. Note: IF

multiple fasta files are specificed, each must contain ONLY

ONE chromosome. If a single multi-chromosome fasta file is

specificed, performance will be improved if the chromosomes

are in the same order as they are found in the BAM file,

however, this is not required. The genomic sequence is used

by certain experimental sub-utilities (currently only the

referenceMatch utility). Comma delimited, no spaces. Fasta

files can be in plaintext, gzipped or zipped.

(CommaDelimitedListOfStrings)

--genomeBufferSize val

The size of the genome fasta buffer. Tuning this parameter

may improve performance.

(Int)

--outfilePrefix sampID

Prefix to be prepended to all output files. If this is set,

all output files will use the format:

"outfiledir/<prefix>QC.qcfilename.txt.gz"

(String)

--nameSorted

DEPRECATED: Relevant for paired-end reads only.

This flag is used to run QoRTs in "name-sorted" mode. This

flag is optional, as under the default mode QoRTs will

accept BAM files sorted by either name OR position. However,

The only actual requirement in this mode is that read pairs

be adjacent.

Errors may occur if the SAM flags are inconsistent: for

example, if orphaned reads appear with the "mate mapped" SAM

flag set.

(flag)

--coordSorted

DEPRECATED: this mode is now subsumed by the default mode

and as such this parameter is now nonfunctional.

Note that, in the default mode, for paired-end data QoRTs

will accept EITHER coordinate-sorted OR name-sorted bam

files. In "--nameSorted" mode, QoRTs ONLY accepts

name-sorted bam files.

If a large fraction of the read-pairs are mapped to

extremely distant loci (or to different chromosomes), then

memory issues may arise. However, this should not be a

problem with most datasets. Technically by default QoRTs can

run on arbitrarily-ordered bam files, but this is STRONGLY

not recommended, as memory usage will by greatly increased.

(flag)

--fileContainsNoMultiMappedReads

DEPRECATED. Flag to indicate that the input sam/bam file

contains only primary alignments (ie, no multi-mapped

reads). This flag is ALWAYS OPTIONAL, but when applicable

this utility will run (slightly) faster when using this

argument. (DEPRECIATED! The performance improvement was

marginal)

(flag)

--parallelFileRead

DEPRECATED: DO NOT USE. Flag to indicate that bam file

reading should be run in paralell for increased speed. Note

that in this mode you CANNOT read from stdin. Also note that

for this to do anything useful, the numThreads option must

be set to some number greater than 1. Also note that

additional threads above 9 will have no appreciable affect

on speed.

(flag)

--numThreads num

DEPRECIATED, nonfunctional.

(Int)

--checkForAlignmentBlocks

Certain aligners will mark reads 'aligned' even though they

have no aligned bases. This option will automatically check

for some reads and ignore them, rather than throwing an

error.

(flag)

--targetRegionBed targetRegion.bed

For whole exome sequencing, this specifies the exome target

regions.

(String)

--stopAfterNReads n

Stop after reading in n reads or read-pairs.

(Int)

--randomSeed n

Use specified random seed.

(Long)

--parseIlluminaStyleReadIDs

Specifies that the read-names are in the illumina style.

CURRENTLY NONFUNCTIONAL!

(flag)

--verbose

Flag to indicate that debugging information and extra

progress information should be sent to stderr.

(flag)

--quiet

Flag to indicate that only errors and warnings should be

sent to stderr.

(flag)

DEFAULT SUB-FUNCTIONS

NVC

Nucleotide-vs-Cycle counts.

GCDistribution

Calculate GC content distribution.

GeneCalcs

Find gene assignment and gene body calculations.

readLengthDistro

Tabulates the distribution of read lengths.

QualityScoreDistribution

Calculate quality scores by cycle.

writeJunctionSeqCounts

Write counts file designed for use with JunctionSeq

(contains known splice junctions, gene counts, and exon

counts). [Depends: writeSpliceExon]

writeKnownSplices

Write known splice junction counts. [Depends: JunctionCalcs]

writeNovelSplices

Write novel splice junction counts. [Depends: JunctionCalcs]

writeClippedNVC

Write NVC file containing clipped sequences. [Depends: NVC]

CigarOpDistribution

Cigar operation rates by cycle and cigar operation length

rates (deletions, insertions, splicing, clipping, etc).

overlapMatch

BETA: This function calculates the matching of overlapping

sections of paired reads. [Depends: mismatchEngine]

cigarLocusCounts

BETA: This function is still undergoing basic testing. It is

not intended for production use at this time.

InsertSize

Insert size distribution (paired-end data only).

chromCounts

Calculate chromosome counts

writeSpliceExon

Synonym for function "writeJunctionSeqCounts" (for

backwards-compatibility) [Depends: JunctionCalcs]

writeGenewiseGeneBody

Write file containing gene-body distributions for each

(non-overlapping) gene. [Depends: writeGeneBody]

JunctionCalcs

Find splice junctions (both novel and annotated).

writeGeneCounts

Write extended gene-level read/read-pair counts file

(includes counts for CDS/UTR, ambiguous regions, etc).

[Depends: GeneCalcs]

writeBiotypeCounts

Write a table listing read counts for each gene BioType

(uses the optional "gene_biotype" GTF attribute). [Depends:

GeneCalcs]

writeDESeq

Write gene-level read/read-pair counts file, suitable for

use with DESeq, EdgeR, etc. [Depends: GeneCalcs]

writeDEXSeq

Write exon-level read/read-pair counts file, designed for

use with DEXSeq. [Depends: JunctionCalcs]

writeGeneBody

Write gene-body distribution file. [Depends: GeneCalcs]

StrandCheck

Check the strandedness of the data. Note that if the

stranded option is set incorrectly, this tool will

automatically print a warning to that effect.

NON-DEFAULT SUB-FUNCTIONS

mismatchEngine

Internal module that runs overlap/reference mismatch

calculations. Automatically included on any runs that

include these functions.

annotatedSpliceExonCounts

Write counts for exons, known-splice-junctions, and genes,

with annotation columns indicating chromosome, etc (default:

OFF) [Depends: JunctionCalcs]

calcOnTarget

BETA: requires --targetRegionBed parameter. This function

calculates the rates at which reads intersect with the

On-Target area. Intended for whole exome sequencing data.

Make sure to use the --targetRegionBed parameter or else

this function will deactivate! (Default: ON iff

targetRegionBed param is found)

FPKM

Write FPKM values. Note: FPKMs are generally NOT the

recommended normalization method. We recommend using a more

advanced normalization as provided by DESeq, edgeR,

CuffLinks, or similar (default: OFF)

cigarMatch

Work-In-Progress: this function is a placeholder for future

functionality, and is not intended for use at this time.

(default: OFF)

testDataDump

EXPERIMENTAL: This function dumps a bunch of information for

internal testing purposes. NOT FOR GENERAL USE! (default:

OFF)

writeGeneBodyIv

Writes an optional additional file detailing the intervals

used in the gene-body coverage calculations

("QC.geneBodyCoverage.DEBUG.intervals.txt.gz") (default:

OFF) [Depends: writeGeneBody]

fastqUtils

BETA: requires --rawfastq parameter. Adds additional tests

that use the supplied raw fastq file. Requires that one (or

two) fastq files be supplied. (Default: ON iff rawfastq

param is found)

referenceMatch

BETA: requires --genomeFA parameter. This function

calculates the matching against the reference. Requires the

specification of genome fasta file(s). REQUIRES

COORDINATE-SORTED BAM FILES! REQUIRES THAT FA AND BAM HAVE

THE SAME CHROMOSOME ORDERING! (Default: ON iff genomeFA

param is found) [Depends: mismatchEngine]

writeDocs

Writes a QC.documentation.txt file that documents all output

files.

makeJunctionBed

Write splice-junction count "bed" files. (default: OFF)

makeWiggles

Write "wiggle" coverage files with 100-bp window size. Note:

this REQUIRES that the --chromSizes parameter be included!

(default: OFF)

makeAllBrowserTracks

Write both the "wiggle" and the splice-junction bed files

(default: OFF) [Depends: makeJunctionBed, makeWiggles]

calcDetailedGeneCounts

Calculate more detailed read counts for each gene, counting

the number of reads that cover introns, cross-strand, etc

(default: OFF)

AUTHORS:

Stephen W. Hartley, Ph.D. stephen.hartley (at nih dot gov)

LEGAL:

This software is "United States Government Work" under the terms

of the United States Copyright Act. It was written as part of

the authors' official duties for the United States Government

and thus cannot be copyrighted. This software is freely

available to the public for use without a copyright notice.

Restrictions cannot be placed on its present or future use.

Although all reasonable efforts have been taken to ensure the

accuracy and reliability of the software and data, the National

Human Genome Research Institute (NHGRI) and the U.S. Government

does not and cannot warrant the performance or results that may

be obtained by using this software or data. NHGRI and the U.S.

Government disclaims all warranties as to performance,

merchantability or fitness for any particular purpose.

In any work or product derived from this material, proper

attribution of the authors as the source of the software or data

should be made, using "NHGRI Genome Technology Branch" as the

citation.

NOTE: This package includes (internally) the sam-1.113.jar

library from picard tools. That package uses the MIT license,

which can be accessed using the command:

java -jar thisjarfile.jar help samjdkinfo

Done. (Mon Jan 22 17:49:51 JST 2018)

例えばbin/に入れて省略名"QoRTs"でランできるようにしておく。

mv QoRTs.jar /usr/local/bin/
echo alias QoRTs=\'java -jar /usr/local/bin/QoRTs.jar\' >> ~/.bash_profile && source ~/.bash_profile

ラン

アライメント結果のbamとgtfを指定してランする。結果は指定したディレクトリに出力される。

QoRTs --generatePdfReport input.bam input.gtf qcDataDir

--singleEnded　Flag to indicate that reads are single end. (flag)
--stranded　Flag to indicate that data is stranded. (flag)
--stranded_fr_secondstrand　Flag to indicate that reads are from a fr_secondstrand type of stranded library (equivalent to the "stranded = yes" option in HTSeq or the "fr_secondStrand" library-type option in TopHat/CuffLinks). If your data is stranded, you must know the library type in order to analyze it properly. This utility uses the same definitions as cufflinks to define strandedness type. By default, the fr_firststrand library type is assumed for all stranded data (equivalent to the "stranded = reverse" option in HTSeq). (flag)
--generateMultiPlot　Generate a multi-frame figure, containing a visual summary of all QC stats. (Note: this requires that R be installed and in the PATH, and that QoRTs be installed on that R installation) (flag)
--generatePdfReport　Generate a pdf report. (Note: this requires that R be installed and in the PATH, and that QoRTs be installed on that R installation)

シングルエンドのbamなら"--singleEnded"のフラグをつけてランする。結果は指定したディレクトリqcDataDir/に出力される。

--generatePdfReportをつけると、次のようなPDFレポートが出力される。

f:id:kazumaxneo:20180122180004j:plain

f:id:kazumaxneo:20180122180008j:plain

f:id:kazumaxneo:20180122180010j:plain

f:id:kazumaxneo:20180122180013j:plain

f:id:kazumaxneo:20180122180016j:plain

f:id:kazumaxneo:20180122180018j:plain

QCコマンド以外に、QC出力のreplicates内のマージ、UCSCで使えるカバレッジのwigファイルへの変換、レポートを出力するユーティリティコマンドなどいくつかあります。詳細は次のコマンドで確認してください。

QoRTs utilname --man

こちらにも分かりやすく説明されています。

QoRTs: Quality of RNA-Seq Toolset

引用

QoRTs: a comprehensive toolset for quality control and data processing of RNA-Seq experiments

Stephen W. HartleyEmail author and James C. Mullikin

BMC Bioinformatics 201516:224

2018-01-20

高頻度なk-merを効率的にカウントする Turtle

k-mer 高速なツール 2014 Bioinformatics

　k-merを用いたde Bruijnグラフ構造は今日普及しているゲノムアセンブルの中核であり、多くの方法論で使われている。k-merはCeleraのようなOLCのアセンブルツールでも重複のシードを用いるのに使われている。また、いくつかのエラー訂正ツールは、k-merの頻度を分析してエラー訂正を行う。すなわち複数回出現するk-merをゲノムの真の配列に由来するものと考え、対照的に１回しか出現しないk-merをシーケンスエラーとみなす。

　サイズがgのゲノムでは、最大g回までのユニークなk-merが期待される。ただしこの数は重複により減少する。k-merが小さくなっても、偶然ユニークでない可能性が高まるため、減ることになる（5merならATGCの全組み合わせは4^5しかない）。しかしながら、シーケンスデータのユニークなk-merを全て数え上げると、ゲノムサイズから期待されるユニークなk-mer数よりずっと多くなる。この理由は、シーケンスエラーが直感的に理解するより頻繁に起きているからである。例えばphread quality scoreが平均30のデータからk=31でk-merを数え上げることを考える。シーケンスエラーが偶然起きる確率は１つの部位で0.1%なので、正解率は99.9%、すなわち31mer連続してエラーが起きない可能性は96.6%になる。裏を返せば3.4%はシーケンスエラーをどこかに含んでいる。クオリティが20なら、シーケンスエラーがゼロの31merは全体の73%まで落ちる。ここに、ゲノムのレアなバリアント、ハプロタイプ、コンタミなども乗ってくるため、ゲノムサイズから予想される値よりずっと大きくなる。よって、ゲノムサイズを推測するには、低頻度なk-merを除きカウントする必要がある。

　Turtleは高頻度なk-merの数をBloom filterを使って省メモリで計算する方法論。並列化にも対応しており、巨大なデータのk-merを少ないリソースで計算するすることができる。具体的には、135.3Gbのヒトゲノムデータの31-merのカウントを、20スレッド使用時に2時間未満で終えるとされる。

インストール

cent OSに導入した。

公式サイトからダウンロードする。

http://bioinformatics.rutgers.edu/Software/Turtle/

> scTurtle64

$ scTurtle64

This program comes with ABSOLUTELY NO WARRANTY.

This is free software, and you are welcome to redistribute it under certain conditions. For details see the document COPYING.

Parameters received:

Please specify an input file.

scTurtle64 Usage:

scTurtle64 [arguments]

example: ./scTurtle64 -f 1Mreads.fq -o kmer_counts -k 31 -n 6000000 -t 9

-i input reads file in fasta format.

-f input reads file in fastq format. This is mutually exclusive with -i.

-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.

-q ouput files prefix. k-mers and their counts are stored in tab delimited fromat (quake compatible) in multiple files named prefix0, prefix1... which the user can concatenate if desired.

-k k-mer length.

-t Number of threads.

-n Expected number of frequent k-mers. For uniform coverage libraries this is usually close to genome length. For single-cell libraries, 2-3 times the gemome length is recommended.

-s The approximate amount of space (in GB) to be used. It is used to indirectly compute -n and is mutually exclusive with -n. When both -n and -s are specified, the one that appears last is used.

-h Print this help menu.

-v Print software version.

パスを通しておく。

ラン

scTurtle32 -f reads.fq -o kmer_counts -k 31 -n 3900000 -t 8

-i input reads file in fasta format.-i input reads file in fasta format.
-f input reads file in fastq format. This is mutually exclusive with -i.
-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.-o ouput files prefix. k-mers and their counts are stored in fasta format (headers indicating frequency) in multiple files named prefix0, prefix1... which the user can concatenate if desired.
-q ouput files prefix. k-mers and their counts are stored in tab delimited fromat (quake compatible) in multiple files named prefix0, prefix1... which the user can concatenate if desired.
-k k-mer length.
-t Number of threads.
-n Expected number of frequent k-mers. For uniform coverage libraries this is usually close to genome length. For single-cell libraries, 2-3 times the gemome length is recommended.

出力

STATs:

No of reads: 254371

No of k-mers: 62516374

no of frequent k-mers found :3854800

no of frequent k-mersはゲノムサイズに近くなる。

作成中

引用

Turtle: identifying frequent k-mers with cache-efficient algorithms

Rajat Shuvro Roy Debashish Bhattacharya Alexander Schliep.

Bioinformatics, Volume 30, Issue 14, 15 July 2014, Pages 1950–1957.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

複数のトランスクリプトームをマージするtransfuse

超高速にRNA seqのリードカウント（定量）を行う salmon

リピート領域をマスクしてプライマー設計を支援するPrimer3_maskerと、プライマーを作成するprimer3

rRNAを除く SortMeRNA

古いサンプルのデータ (fastqやbam) から効率的にアダプターを除く leeHom

RNA seqのクオリティチェックツール QoRTs

高頻度なk-merを効率的にカウントする Turtle