包括的なde novoトランスクリプトームアセンブリのパイプライン TransPi

2021 6/4 更新

　RNA-Seqデータの利用とde novoトランスクリプトームアセンブリの生成は、生態学と進化学の研究において重要な役割を果たしてきた。これは、ゲノム情報が利用できない非モデル生物に顕著に当てはまる。しかしながら、遺伝子発現の差異の研究、DNAエンリッチベイトの設計、系統分類学はすべて、トランスクリプトームレベルで収集されたデータを用いて達成することができる。トランスクリプトームのアセンブルには複数のツールが利用可能であるが、単一のツールではすべてのデータセットに最適なアセンブルはできない。そのため、マルチアセンブラアプローチに続いて削減ステップを経て、アセンブリの改良された表現を生成することがしばしば求められる。このような複雑な解析におけるエラーを減らしつつ、同時に再現性とスケーラビリティを実現するために、RNA-Seqデータの解析には自動化されたワークフローが不可欠となっている。しかし、これらのツールのほとんどは、ゲノムデータがアセンブリプロセスのリファレンスとして使用可能な種のために設計されており、非モデル生物での使用が制限されている。著者らは、ユーザーの入力を最小限に抑えながらも、徹底的な解析能力を失うことなく、de novoトランスクリプトームアセンブリを行う包括的なパイプラインTransPiを提示する。異なるモデル生物、kmerセット、リード長、リード量の組み合わせをツールの評価に使用した。さらに、異なる系統にまたがる49種の非モデル生物も解析した。シングルアセンブラのみを使用したアプローチと比較して、TransPiはより高いBUSCO完全率と重複率の大幅な減少を同時に生成する。TransPiは設定が簡単で、Conda、Docker、Singularityを使用してシームレスにデプロイすることができる。

manual

TransPi Manual

Programs used

Do you work on de novo transcriptome assembly? One sample? 20 samples? More? Try our new pipeline, TransPi. https://t.co/fvbsuuMswa
— Ramón Rivera-Vicéns (@RERV787) 2021年2月18日

インストール

Github

git clone https://github.com/palmuc/TransPi.git
cd TransPi/

> nextflow run TransPi.nf --help

$ nextflow run TransPi.nf --help

N E X T F L O W ~ version 20.07.1

Launching `TransPi.nf` [condescending_stonebraker] - revision: 9be7a05d7b

==================================================

TransPi - Transcriptome Analysis Pipeline v1.0.0

==================================================

Steps:

1- Run the `precheck_TransPi.sh` to set up the databases and tools

(if neccesary) used by TransPi

2- Run TransPi

Usage:

nextflow run TransPi.nf TransPi_analysis_option other_options

Example usage:

nextflow run TransPi.nf --all --reads

"/PATH/TO/READS/*_R[1,2].fastq.gz" --k 25,41,53 --maxReadLen 75

Manadatory arguments:

--all Run the entire pipeline (Assemblies,

EvidentialGene, Annotation, etc.)

This option also requires arguments

--reads, --k, and --maxReadLen

Example:

--reads

"/PATH/TO/READS/*_R[1,2].fastq.gz" --k 25,35,55,75,85 --maxReadLen 150

NOTE: Use of quotes is needed for

the reads PATH. Kmer list depends on read length.

--onlyAsm Run only the Assemblies and

EvidentialGene analysis

This option also requires arguments

--reads, --k, --maxReadLen

--onlyEvi Run only the Evidential Gene analysis

Transcriptome expected to be in a

directory called "onlyEvi"

--onlyAnn Run only the Annotation analysis

(starting from a final assembly)

Transcriptome expected to be in a

directory called "onlyAnn"

Other options:

-profile Configuration profile to use. Can use

multiple (comma separated)

test Run TransPi with a test dataset

conda Run TransPi with conda.

docker Run TransPi with docker container

singularity Run TransPi with singularity

container with all the neccesary tools

TransPiContainer Run TransPi with a single

container with all tools

--help Display this message

--fullHelp Display this message and examples for

running TransPi

Output options:

--outdir Name of output directory. Default "results"

-w, -work Name of working directory. Default

"work". Only one dash is needed for -work since it is a nextflow

function.

--tracedir Name for directory to save pipeline

trace files. Default "pipeline_info"

Additional analyses:

--rRNAfilter Remove rRNA from sequences. Requires

option --rRNAdb

--rRNAdb PATH to database of rRNA sequences

to use for filtering of rRNA. Default ""

--filterSpecies Perform psytrans filtering of

transcriptome. Default "false" Requires options --host and --symbiont

--host PATH to host (or similar)

protein file. Default ""

--symbiont PATH to symbionts (or similar)

protein files. Default ""

--psyval Psytrans value to train model. Default "160"

--allBuscos Run BUSCO analysis in all

assemblies. Default "false"

--rescueBusco Generate BUSCO distribution

analysis. Default "false"

--minPerc Mininmum percentage of assemblers

require for the BUSCO distribution. Default ".70"

--shortTransdecoder Run Transdecoder without the homology

searches. Default "false"

--withSignalP Include SignalP for the annotation.

Needs manual installation of CBS-DTU tools. Default "false". Requires

--signalp

--signalp PATH to SignalP software. Default ""

--withTMHMM Include TMHMM for the annotation.

Needs manual installation of CBS-DTU tools. Default "false". Requires

--tmhmm

--tmhmm PATH to TMHMM software. Default ""

--withRnammer Include Rnammer for the annotation.

Needs manual installation of CBS-DTU tools. Default "false". Requires

--rnam

--rnam PATH to Rnammer software. Default ""

Skip options:

--skipEvi Skip EvidentialGene run in --onlyAsm

option. Default "false"

--skipQC Skip FastQC step. Default "false"

--skipFilter Skip fastp filtering step. Default "false"

--skipKegg Skip kegg analysis. Default "false"

--skipReport Skip generation of final TransPi

report. Default "false"

Others:

--minQual Minimum quality score for fastp

filtering. Default "25"

--pipeInstall PATH to TransPi directory. Default "".

If precheck is used this will be added to the nextflow.config

automatically.

--myCondaInstall PATH to local conda environment of

TransPi. Default "". Requires use of --myConda.

--myConda Make TransPi use a local conda

environemt with all the tools (generated with precheck)

--envCacheDir PATH for environment cache directory

(either conda or containers). Default "Launch directory of pipeline"

--getVersions Get software versions. Default "false"

データベースの準備

git clone https://github.com/palmuc/TransPi.git
cd TransPi/

TransPiの実行には、さまざまなデータベースが必要になる。precheckスクリプトは、ツールを実行するために、必要に応じてデータベースとソフトウェアをインストールする。

#ここでは/home/kazu/TransPi/にデータベースを保存
bash precheck_TransPi.sh /home/kazu/TransPi/

対話式でどのデータベースをダウンロードするか確認しながら進める。

f:id:kazumaxneo:20210603084949p:plain

２を選択した。

ｙを選択して進める。

f:id:kazumaxneo:20210603085046p:plain

ここでは２を選択。

f:id:kazumaxneo:20210603085202p:plain

ここでは４を選択。

f:id:kazumaxneo:20210603085251p:plain

ここでは２を選択。

f:id:kazumaxneo:20210603085351p:plain

BUSCO の該当データベースがダウンロードされ、解凍され、上で指定したパス（ここでは/home/kazu/TransPi/）の中のDBs/busco_db/の中に保存される。

次はuniprotのデータベースを準備する。一度間違ってスクリプトを終了したので空のDBがあると警告表示されている。改めてダウンロードする。１を選択。

f:id:kazumaxneo:20210603085516p:plain

指示に従って進めていく。

f:id:kazumaxneo:20210220220446p:plain

完了

f:id:kazumaxneo:20210604144110p:plain

DBs/

f:id:kazumaxneo:20210603161636p:plain

実行方法

profileはdockerを使う。-allでTransPiがサポートする全てのプロセスの全てのランする。使うリードは_R1.fastq.gz と_R2.fastq.gzの形式になっている必要がある。

nextflow run TransPi.nf --all --reads '/YOUR/READS/*_R[1,2].fastq.gz' \
 --k 25,41,53,75,99 --maxReadLen 150 -profile docker

--all Run the entire pipeline (Assemblies, EvidentialGene, Annotation, etc.) This option also requires arguments --reads, --k, and --maxReadLen
Example: --reads "/PATH/TO/READS/*_R[1,2].fastq.gz" --k 25,35,55,75,85 --maxReadLen 150 NOTE: Use of quotes is needed for the reads PATH. Kmer list depends on read length.
--onlyAsm Run only the Assemblies and EvidentialGene analysis This option also requires arguments --reads, --k, --maxReadLen
--onlyEvi Run only the Evidential Gene analysis Transcriptome expected to be in a directory called "onlyEvi"
--onlyAnn Run only the Annotation analysis (starting from a final assembly) Transcriptome expected to be in a directory called "onlyAnn"

allはデータベースのビルドでエラーが出た。

nextflowのバージョンを最新版に更新するとランできるようになった。

アセンブルとEvidentialGeneのみ実行。

nextflow run TransPi.nf --onlyAsm --reads '/YOUR/READS/*_R[1,2].fq.gz' \
 --k 25,41,53 --maxReadLen 75 -profile docker

--onlyAsm Run only the Assemblies and EvidentialGene analysis This option also requires arguments --reads, --k, --maxReadLen
--onlyEvi Run only the Evidential Gene analysis Transcriptome expected to be in a directory called "onlyEvi"
--onlyAnn Run only the Annotation analysis (starting from a final assembly) Transcriptome expected to be in a directory called "onlyAnn"

Evidential Geneのランのみ実行。./onlyEvi/にtranscriptsのfastaファイルが含まれている必要がある。

nextflow run TransPi.nf --onlyEvi -profile docker

nextflow.configのパラメータは必要ならば修正する。

f:id:kazumaxneo:20210605114132p:plain

ラン前にリソースが潤沢ならCPU利用量とメモリ上限を増やしておく。アセンブリならmed_cpus。

たくさんのプロセスがあるため、ランには長い時間がかかる。

f:id:kazumaxneo:20210605114405p:plain

引用

TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly

Rivera-Vic ́ens, R.E, Garcia Escudero, C, Conci, N, Eitel, M, Wo ̈rheide, G

bioRxiv, Posted February 18, 2021.

2022/02/14

TransPi – a comprehensive TRanscriptome ANalysiS PIpeline for de novo transcriptome assembly
R.E. Rivera-Vicéns, C.A. Garcia-Escudero, N. Conci, M. Eitel, G. Wörheide
First published: 04 February 2022

https://onlinelibrary.wiley.com/doi/10.1111/1755-0998.13593