2019-06-03

転写領域アノテーションのためSRAのデータをサンプリングしてマッピング率等を評価する VARUS

2019 Preprint SRA download RNA seq

2019 6/3 何も表示されないバグを修正

　非常に大量の次世代シークエンシング（NGS）データがNCBIのシークエンスリードアーカイブ（SRA）[ref.1]やENA[ref.2]などの公共のデータベースに保管されている。これを書いている時点で、2019年3月に、SRAは約2.7 * 10^16 bpのrawシーケンシングデータを保存している[ref.3]。この保存されたデータは、多様な種の特定の目的を持つ、非常に多様な個々への実験的支援を提供する。
　このNGSデータの大部分はRNA-Seqである。多くの研究において、RNAは転写産物の発現についての統計的結論を引き出すために多くのreplicatesからシーケンシングされてきた[ref.4、5]。オリジナルの研究の再現性に対する要件に加えて、リポジトリの大きな価値は、オリジナルの研究とは非常に異なる目的でさらなる研究のためにデータを使用できる可能性にある。 RNA-Seqを再利用または使用する重要な理由の1つは、ゲノム（再）アノテーションである。多くの場合、ゲノムには同じ種のトランスクリプトーム研究がいくつか行われた後にアノテーションが付けられる。構造的なゲノム再アノテーションにも多くの理由がある。アセンブリが改良され、さらなる系統のゲノムがシーケンシングされ、より正確なアノテーション方法が追加されたり、多数の種のゲノムに一貫してアノテーションが付けられたりする。
　ある種について入手可能なすべてのRNA-Seqデータを取得することは、データの質やアラインメントの問題のために実行可能ではなく、また逆効果でもない。利用可能なデータは非常に豊富で冗長な場合がある。例えば、キイロショウジョウバエ（表1）のための30,000以上のRNA-Seqシークエンシングランがあり、合計で41 Tbp以上になる。さらに、いくつかのゲノムライブラリーは、トランスクリプトームとして誤ってラベル付けされているか、アダプター配列を含んでいるか、または単に低い品質もしくはアラインメント性を有するため、シーケンシングのクオリティ管理が必要である（表１参照）。したがって、ゲノムアノテーションのために利用可能なすべてのリードのサブセットを選択するのが賢明である。これを行う際に、選択されたリードの相補性は、計算およびデータスループットに対する負担を管理可能なままに保ちながら、生物の全転写産物の大部分をカバーできるべきである。組織特異的または状態特異的な発現のために、多様な組織または状態からの多くの異なるシーケンシングランのリードを使用することが必要であり得る。アップロード者が入力したメタデータには、研究の要約が含まれていることがよくある。これらの手動スキャンは面倒であり、組織や状態について常に決定的なものではなく、たとえそうであったとしても、他のシーケンスランの選択によってまだカバーされていない転写物のどの部分を表現できるかを結論づけることができない。さらに、メタデータが正しくない可能性がある。
　Ohta、Nakazato、Bonoは、SRAで報告されたリードの質について大規模な評価を行ったが、リードは全くアラインされなかった[ref.6]。本著者らの知る限りでは、最大の転写物セットSRAからのリードを自動的にサンプリングするためのソフトウェアは存在しなかった。アーカイブに表示されている組織や条件に大きな偏りがある可能性があるため、すべてのライブラリからの簡単なランダムサンプリングは最適ではない。
　特定の種についてSRAで利用可能なすべてのランからのリードを自動的にサンプリング、ダウンロード、およびアラインメントする新しいツールVARUS [ref.7]を紹介する。シーケンシングランから少量のランダムサンプルを繰り返しダウンロードするオンラインアルゴリズムを実装して、以前にダウンロードされたリードを最もよく補完すると推定される、独自にアラインメントするリードまたはリードペアの割合が比較的高いサンプルをダウンロードする。本著者らはVARUSを12のゲノムについて試験し、そしてスプライスアラインメントによって示唆されたイントロンをリファレンスアノテーションのものと比較した。、VARUSはほとんどの種に対して、シーケンシャルランの手動選択と比較して、より少ない数のダウンロードリード数でより高い精度を達成する。

インストール

ubuntu14.04でテストした。

依存

samtools
fastq-dump and
STAR or HISAT2 (tested with HISAT 2, version 2.0.0-beta) and
requires the script filterIntronsFindStrand.pl to be in the executable path.

#bioconda
conda install -c bioconda -y samtools
conda install -c bioconda -y sra-tools
conda install -c bioconda -y hisat2

filterIntronsFindStrand.pl

apt-get update && apt-get install bamtools libbamtools-dev
git clone https://github.com/nextgenusfs/augustus.git
cd cd augustus/
make -j 16
cd auxprogs/
make clean

本体　Github

git clone https://github.com/MarioStanke/VARUS.git
cd VARUS/Implementation/
make -j 12

> ./VARUS -h

# ./VARUS -h

Usage:

Online-algorithm to automatically draw optimal samples from libraries of

RNA-seqdata from the NCBI with the aim to achieve a coverage as high as

possible, with as little data being downloaded as possible.

In a stepwise procedure parts of these libraries are downloaded and aligned

to the Genome of the species with STAR or HISAT. At each step the run that is expected

to yield the most increase in coverage, is chosen.

Input:

- Genome of the target species specified with --genomeDir

The Genome files have to be prepared for the aligner.

After indexing with STAR the folder should

contain the following files:

chrLenghth.txt, chrName.txt, chrNameLength.txt, chrStart.txt, Genome,

genomeParameters.txt, SA, SAindex

https://github.com/alexdobin/STAR/blob/master/doc/STARmanual.pdf

- Runlist named "Runlist.txt".

Runlist-Format:

@Run_acc total_spots total_bases bool:paired #tabulator separated

ERR1328564 75912627 1 5041721263 0

ERR1328563 88959042 1 7616693616 0

DRR030358 20853417 1 750723012 0

...

An exemplary call of VARUS might look like this:

./VARUS --blockSize 5000 --batchSize 10000 --genomeDir <...>

--pathToRuns <...> --outFileNamePrefix <...>

We divide the genome in many smaller blocks.This is done because on some larger

transcripts the coverage is also desired to be evenly distributed.

At each step a fixed number of reads is downloaded. We call a set of reads

downloaded at once a batch. The batchSize specifies how many reads are

downloaded at once.

----------------------------------------------------------------------------------------------------

Mandatory Inputs:

----------------------------------------------------------------------------------------------------

--aligner: STAR

specifies the aligner, STAR or HISAT

--fastqDumpCall: fastq-dump

if fastq-dump is not in your $PATH, you can specify the path as well

like /path/to/fastqdump/./fastq-dump.Note that this string has to

contain './fastq-dump' as well.

--genomeDir:

specifies the path to the genome/transcriptome that you are

investigating.

--outFileNamePrefix: ../Tutorial/Drosophila/

specifies the path in which all output of VARUS should be stored.

--pathToRuns: ../Tutorial/Drosophila/

specifies the path to Runlist.txt.You can retrieve a Runlist with the

perl-script 'RunListRetriever.pl'.

--pathToSTAR:

specifies the path to the executable of STAR, that aligns the reads to

the transcriptome/genome.

--pathToVARUS:

specifies the path to the executable of VARUS

----------------------------------------------------------------------------------------------------

Basic Settings:

If you choose to not use an estimator, runs are

downloaded randomlyuntil you have reached

maxBatches.

----------------------------------------------------------------------------------------------------

--batchSize: 100000

the number of reads to be downloaded in each step. Typically a library

contains about 10e+6 to 10e+7 reads.

--blockSize: 5000

the number of bases one block will have. This is done in order to be

able to compare the coverage of larger chromosomes with smaller ones,

since larger chromosomes will naturally have more reads mapped to them

than smaller ones.

--deleteLater: 0

if set to 1, the fasta-files and alignment-files will be deleted after

they are used to identify the next run to be downloaded from. If you

want to use the reads for your genome-annotation you should not use this

option.

--estimator: 2

1 == simple, 2 == advanced, 3 == Dirichlet mixture, 4 == cluster

estimator, otherwise downloads will be done choosing the runs randomly.

Note that if you choose to download randomly, you should specify

maxBatches in order to let the program end at some point.

--exportObservationsToFile:0

if set to 1 the program will output the observations in all runs in all

steps into CSV-files. This does not refer to the fasta.files but to the

quantitative distributions of the reads among the genome. WARNING: This

will overwrite older files! NOTE: Using this option can lead to

performance-issues.

--maxBatches: 100

if maxBatches > 0, the program will exit as soon as it loaded maxBatches

batches.

--mergeThreshold: 2

Specifies after how many downloads a merge of all the existing

alignments should be done. Note that after the alignments are merged all

subfolders with smaller alignments are deleted. At the end one final

merge is done. 2 is the minimal value. Note: If you set the parameter

deleteLater = 0, then only one merge at the end of execution is done.

--qualityThreshold: 5

Specifies how much of uniquely mapping reads you are allowing. STAR

produces a file 'Log.final.out' with information of how much percent of

the reads did map. Really small values like 0.1 can be an indicator of

intrinsic dna.

--randomSeed: -1

if randomSeed > 0, the program will have deterministic results. Else the

seed will be set according to the current time

--runThreadN: 4

Number of threads to run STAR parallel with. Read STAR-manual for more

information.

--verbosityDebug: 0

sets the debug-verbosity-level. There are 4 verbosity-levels. Higher

values mean more output.

----------------------------------------------------------------------------------------------------

Estimators:

In each step of the algorithm the distributions of

the reads in each run are estimated based on the

reads that are downloaded already of this run. We

introduced a cost-function that assigns higher

scores to runs that are expected to contain reads

mapping to regions in the genome with only little

observations at the given step. The download stops

if the expected profit is lower than the cost for

downloading another batch. The default of cost = 0

lets VARUS download maxbatches reads. With cost > 0

the download can stop prematurely. A sensible choice

could be cost = 1.

----------------------------------------------------------------------------------------------------

--cost: 0

sets the cost for downloading one read. The cost-function is a

monotonically increasing function with monotonically decreasing

derivative.

--genomeFaFile:

specifies the path to the genome/transcriptome fasta file

--loadAllOnce: 0

if set to 1, a single batch from each run will be downloaded once,

before the estimation process starts.

--profitCondition: 0

if profitConditon == 1, the program will exit if the expected profit

falls below 0. Note that the expected profit can lead to the program

downloading for a very long time, since some of the estimators tend to

be very optimistic if the parameters are not set adequately.

----------------------------------------------------------------------------------------------------

Simple and advanced Estimator:

The simple estimator adds pseudocounts to the actual

observations. The distribution of the reads is then

estimated with the maximum-likelihood-method. The

advanced estimator adds the frequencies of the sum

of all observations in all runs. This sum is

multiplied with a weight lambda to control how

similar the runs are to be assumed.

----------------------------------------------------------------------------------------------------

--lambda: 2

parameter for estimator 2. The weight with which the sum of all

observations in all runs is added as additional pseudocounts for the

estimation.

--pseudoCount: 1

adds a pseudocount to all possible observations. Only relevant for

estimators 1, 2.

----------------------------------------------------------------------------------------------------

Commandline handling:

Instead of typing all commands in the console you

can read in and write to a file that stores all your

settings. This file can also be read in for future

program-runs to ensure to use the same settings.

----------------------------------------------------------------------------------------------------

--exportParametersToFile: 0

if set to 1 the parameters used for this execution of the program will

be exported into a parametersfile

--pathToParameters: /VARUS/Implementation/VARUSparameters.txt

specifies the path and name to the parameters-file that should be read

in and written to. Default is the current working directory and a file

called 'VARUSparameters.txt'.

--readParametersFromFile: 1

if set to 1, the program will look for a file specified with

pathToParameters and interpret its content as command-line arguments.

These parameters will then be used to run the program. Note: additional

parameters passed with the command line will overwrite the parameters

read from the parameters-file.

--------------------------------------------------------------------------------

> ./runVARUS.pl -h

S# ./runVARUS.pl -h

Usage:

Parameter default Explanation

--outFileDir /cwd/ Folder in which all ouput should be stored

--varusParameters path to a parameters file, defaults to /current/working/directory/VARUSparameters.txt

--pathToSTAR specifies the path to the STAR executable, only required it

STAR is not in the PATH

--pathToHISAT specifies the path to the HISAT executables, only required it

hisat-build is not in the PATH

--createindex 1 creates the genome index, 0 if you don't want to create the index

You need an index in order to run STAR

--runThreadN 4 Number of threads used for STAR index creation

--createRunList 1 creates the RunList, 0 if you don't want to create the RunList

You need a RunList in order to run VARUS

--allRuns 1 put all available accession-ids in the Runlist.txt, if false only the first 100 are used

--runVARUS 1 runs VARUS

--createStatistics 0 creates a plot of the coverage achieved with all downloads

--readFromTable 1 searches for a file 'species.txt' with two columns ;-separated

first column=species name in latin

second column=path to the corresponding genome in fasta-format

--pathToSpecies /cwd/ path to the file 'species.txt'

--latinGenus latin name of the genus e.g Drosophila

--latinSpecies latin name of the species e.g melanogaster

--speciesGenome path to the corresponding genome in fasta-format

--VARUScall default ./VARUS

--logfile default runVarus.log

--aligner STAR alignment program: STAR or HISAT

--verbosity 4 between 0 and 5 for less and more logging output

テストラン

ランにはVARUS/runVARUS.plを使う。テストデータをランする。

cd /VARUS/example/
../runVARUS.pl --aligner=HISAT --readFromTable=0 \
 --createindex=1 --latinGenus=Schizosaccharomyces \
 --latinSpecies=pombe --speciesGenome=S.pombe.genome.fa \
 --logfile=S.pombe.log 2> S.pombe.err

--readFromTable searches for a file 'species.txt' with two columns ;-separated first column=species name in latin second column=path to the corresponding
--createindex creates the genome index, 0 if you don't want to create the index. You need an index in order to run STAR.
--latinGenus latin name of the genus e.g Drosophila
--latinSpecies latin name of the species e.g melanogaster
--speciesGenome path to the corresponding genome in fasta-format
--batchSize specifies how many reads should be downloaded in each iteration (e.g. 50000 or 200000)
--maxBatches specifies how many batches should be downloaded at most