de novo transcriptome assembliesを評価する rnaQUAST

2020 2/3 インストール追記、実行例追記

2020 8/13 インストール追記

　rnaQUASTはde novo transcriptomeのアセンブルパフォーマンスを比較するツール。リファレンスゲノムやtranscriptsのカタログにアセンブルした配列をアライメントし、様々な統計データをPDFで出力する。リファンレンスの遺伝子情報（gtf）がない時でも、ラン中にGeneMarkS-Tを動かし遺伝子を予測してランすることもできる。内部でRNAのアライナーを動かし（STARやTophatが必要）、カバレッジを調べることもできる。

公式サイト

http://cab.spbu.ru/software/rnaquast/

マニュアル

http://cab.spbu.ru/files/rnaquast/release1.5.0/manual.html

インストール

ubuntu18.04LTS のマシンでテストした（macosには対応していない）。

依存

Python 2 (2.5 or higher)
matplotlib python package
joblib python package
gffutils python package (needs biopython)
NCBI BLAST+ (blastn)
GMAP (or BLAT) aligner

optional

STAR (or alternatively TopHat aligner + SAM tools)
GeneMarkS-T

gffutils、matplotlib、joblibはpipでインストールする。BLATやBLASTはbrewで導入できる。GMAPを使うなら、github（リンク）からダウンロードしてビルドする。

Github

#bioconda (link)
conda create -n rnaquast -y
conda activate rnaquast
conda install -c bioconda -y rnaquast

>rnaQUAST.py -h

$ rnaQUAST.py -h

usage: /home/kazu/anaconda3/envs/rnaquast/bin/rnaQUAST.py [-h]

[-r REFERENCE [REFERENCE ...]]

[--gtf GTF [GTF ...]]

[--gene_db GENE_DB]

[-c TRANSCRIPTS [TRANSCRIPTS ...]]

[-psl ALIGNMENT [ALIGNMENT ...]]

[-sam READS_ALIGNMENT]

[-1 LEFT_READS]

[-2 RIGHT_READS]

[-s SINGLE_READS]

[--gmap_index GMAP_INDEX]

[-o OUTPUT_DIR]

[--test] [-d]

[-t THREADS]

[-l LABELS [LABELS ...]]

[-ss]

[--min_alignment MIN_ALIGNMENT]

[--no_plots]

[--blat] [--tophat]

[--gene_mark]

[--meta]

[--lower_threshold LOWER_THRESHOLD]

[--upper_threshold UPPER_THRESHOLD]

[--disable_infer_genes]

[--disable_infer_transcripts]

[--busco_lineage BUSCO_LINEAGE]

[--prokaryote]

QUALITY ASSESSMENT FOR TRANSCRIPTOME ASSEMBLIES /home/kazu/anaconda3/envs/rnaquast/bin/rnaQUAST.py v.1.5.1

Usage:

python /home/kazu/anaconda3/envs/rnaquast/bin/rnaQUAST.py --transcripts TRANSCRIPTS --reference REFERENCE --gtf GENE_COORDINATES

optional arguments:

-h, --help show this help message and exit

Input data:

-r REFERENCE [REFERENCE ...], --reference REFERENCE [REFERENCE ...]

Single file (or several files for meta RNA) with

reference genome in FASTA format or *.txt file with

one-per-line list of FASTA files with reference

sequences

--gtf GTF [GTF ...] File with gene coordinates (or several files or *.txt

file with one-per-line list of GTF / GFF files for

meta RNA). We recommend to use files downloaded from

GENCODE or Ensembl [GTF/GFF]

--gene_db GENE_DB Path to the gene database generated by gffutils to be

used

-c TRANSCRIPTS [TRANSCRIPTS ...], --transcripts TRANSCRIPTS [TRANSCRIPTS ...]

File(s) with transcripts [FASTA]

-psl ALIGNMENT [ALIGNMENT ...], --alignment ALIGNMENT [ALIGNMENT ...]

File(s) with transcript alignments to the reference

genome [PSL]

-sam READS_ALIGNMENT, --reads_alignment READS_ALIGNMENT

File with read alignments to the reference genome

[SAM]

-1 LEFT_READS, --left_reads LEFT_READS

File with forward paired-end reads [FASTQ or gzip-

compressed]

-2 RIGHT_READS, --right_reads RIGHT_READS

File with reverse paired-end reads [FASTQ or gzip-

compressed]

-s SINGLE_READS, --single_reads SINGLE_READS

File with unpaired reads [FASTQ or gzip-compressed]

--gmap_index GMAP_INDEX

Folder containing GMAP index for the reference genome

Basic options:

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Directory to store all results [default:

rnaQUAST_results/results_<datetime>]

--test Run rnaQUAST on the test data from the test_data

folder, output directory is rnaOUAST_test_output

-d, --debug Report detailed information, typically used only for

detecting problems.

Advanced options:

-t THREADS, --threads THREADS

Maximum number of threads, default: min(number of CPUs

/ 2, 16)

-l LABELS [LABELS ...], --labels LABELS [LABELS ...]

Name(s) of assemblies that will be used in the reports

-ss, --strand_specific

Set if transcripts were assembled using strand-

specific RNA-Seq data

--min_alignment MIN_ALIGNMENT

Minimal alignment length, default: 50

--no_plots Do not draw plots (to speed up computation)

--blat Run with BLAT alignment tool

(http://hgwdev.cse.ucsc.edu/~kent/exe/) instead of

GMAP

--tophat Run with TopHat tool

(https://ccb.jhu.edu/software/tophat/index.shtml)

instead of STAR

--gene_mark Run with GeneMarkS-T tool

(http://topaz.gatech.edu/GeneMark/)

--meta Run QUALITY ASSESSMENT FOR METATRANSCRIPTOME

ASSEMBLIES

--lower_threshold LOWER_THRESHOLD

Lower threshold for x-assembled/covered/matched

metrics, default: 0.5

--upper_threshold UPPER_THRESHOLD

Upper threshold for x-assembled/covered/matched

metrics, default: 0.95

--prokaryote Use this option if the genome is prokaryotic

Gffutils related options:

--disable_infer_genes

Use this option if your GTF file already contains

genes records

--disable_infer_transcripts

Use this option if your GTF already contains

transcripts records

BUSCO related options:

--busco_lineage BUSCO_LINEAGE

Run with BUSCO tool (http://busco.ezlab.org/). Path to

the BUSCO lineage data to be used (Eukaryota, Metazoa,

Arthropoda, Vertebrata or Fungi)

Don't forget to add GMAP (or BLAT) to PATH.

(rnaquast) kazu@kazu:~/Downloads$

ラン

アセンブルしたfastaとリファレンスのゲノム、リファレンスのgtfファイルを指定してランする。複数のアセンブリを評価する時はスペースで区切る

（-c transcripts1.fasta<space>transcripts2.fasta）。

rnaQUAST.py -c transcripts1.fasta transcripts2.fasta -r reference.fasta --gtf gene_coordinates.gtf -t 2 -o output

-c File(s) with transcripts in FASTA format separated by space.
-o Directory to store all results. Default is rnaQUAST_results/results_<datetime>.
-r Single file with reference genome containing all chromosomes/scaffolds in FASTA format (preferably with *.fasta, *.fa, *.fna, *.ffn or *.frn extension) OR *.txt file containing the one-per-line list of FASTA files with reference sequences.
--gtf File with gene coordinates in GTF/GFF format (needs information about parent relations). We recommend to use files downloaded from GENCODE or Ensembl.
-t Maximum number of threads. Default is min(number of CPUs / 2, 16).
--blat Run with BLAT alignment tool instead of GMAP.
--gene_mark Run with GeneMarkS-T gene prediction tool. Use --prokaryote option if the genome is prokaryotic.

レポートがPDF形式で出力される。

f:id:kazumaxneo:20180125100017j:plain

f:id:kazumaxneo:20180125100012j:plain

各評価項目の詳細はrnaQUAST 1.5 Manualから確認してください。オプション解説の下のほうに載ってます。

追記

buscoも走らせる場合、busco v3のHP（link）からダウンロードして解答したlineage ディレクトリを指定する。

 rnaQUAST.py -1 R1.fastq -2 R2.fastq -t 12 -o quast \
 -c Trinity.fasta transcripts.fasta \
 --busco_lineage protists_ensembl/


#reference genomeとアノテーション情報も利用できるなら使う
rnaQUAST.py -1 R1.fastq -2 R2.fastq -t 12 -o quast \
 -c Trinity.fasta transcripts.fasta \
 --busco_lineage protists_ensembl/
 -r ref_genome.fa --gtf ref_annotation.gtf