RNAのロングリードを分析する IsoQuant - macでインフォマティクス

　IsoQuantは、PacBioやOxford Nanoporesのような長いRNAリードのリファレンスベース解析のためのツールである。IsoQuantは、リファレンスゲノムにリードをマッピングし、それらのイントロンとエクソンの構造に基づいて、アノテーションされたアイソフォームに割り当てる。IsoQuantはまた、イントロンの保持、代替スプライスサイト、スキップされたエクソンなどの様々な修飾を発見することができる。IsoQuant はさらに、遺伝子、アイソフォーム、エクソン、イントロンの定量を行う。リードがグループ化されている場合（例えば、細胞タイプに応じて）、カウントは提供されたグループ化に従って報告される。さらに、IsoQuant は新規のものも含めて、発見された転写モデルを生成する。IsoQuantバージョン1.1は、2020年12月11日にGPLv2の下でリリースされ、https://github.com/ablab/IsoQuant からダウンロードできる。

http://cab.spbu.ru/software/isoquant/

Meet IsoQuant - a new tool for long-read RNA analysis! It can detect structural alternations in individual reads, quantify isoforms and construct novel transcript models. Works well on PacBio and ONT reads, and has features for single-cell data analysis. https://t.co/KzYpajdq74
— Center for Algorithmic Biotechnology (@bioinf_spbu) 2020年7月14日

インストール

condaを使ってpython3.7の仮想環境を作ってテストした（OSはubuntu18.04LTS）。

依存
IsoQuant requires a 64-bit Linux system or Mac OS and Python (3.7 and higher) to be pre-installed on it. You will also need

gffutils
pysam
biopython
pyfaidx
pandas
numpy
minimap2
samtools)
STAR (optional)

Github

#bioconda (link)
conda create -n isoquant -y python=3.7
conda activate isoquant
conda install -c bioconda isoquant -y
#or 
conda install -c bioconda -c conda-forge -c isoquant isoquant -y

#gffutilsｙとpybedtoolsが導入されなかったので、追加で入れた
pip install gffutils pybedtools

#from source
git clone https://github.com/ablab/IsoQuant.git
cd IsoQuant
git checkout latest
pip install -r requirements.txt
#=> You also need minimap2 to be in the $PATH variable.

> isoquant.py -h

$ isoquant.py -h

usage: isoquant.py [-h] [--output OUTPUT] --genedb GENEDB [--complete_genedb]

[--reference REFERENCE] [--index INDEX]

(--bam BAM [BAM ...] | --fastq FASTQ [FASTQ ...] | --bam_list BAM_LIST | --fastq_list FASTQ_LIST)

--data_type {assembly,pacbio_raw,pacbio_ccs,nanopore}

[--stranded STRANDED] [--has_polya] [--fl_data]

[--full_help] [--test] [--threads THREADS]

[--labels LABELS [LABELS ...]] [--read_group READ_GROUP]

[--sqanti_output] [--count_exons]

[--matching_strategy {exact,precise,default,loose}]

[--model_construction_strategy {reliable,default,fl,all,assembly}]

optional arguments:

-h, --help show this help message and exit

--output OUTPUT, -o OUTPUT

output folder, will be created automatically

[default=isoquant_output]

--genedb GENEDB, -g GENEDB

gene database in gffutils DB format or GTF/GFF format

--complete_genedb use this flag if gene annotation contains transcript

and gene metafeatures, e.g. with official annotations,

such as GENCODE; speeds up gene database conversion

--reference REFERENCE, -r REFERENCE

reference genome in FASTA format, should be provided

to compute some additional stats and when raw reads

are used as an input

--index INDEX genome index for specified aligner, should be provided

only when raw reads are used as an input

--bam BAM [BAM ...] sorted and indexed BAM file(s), each file will be

treated as a separate sample

--fastq FASTQ [FASTQ ...]

input FASTQ file(s), each file will be treated as a

separate sample; reference genome should be provided

when using raw reads

--bam_list BAM_LIST text file with list of BAM files, one file per line,

leave empty line between samples

--fastq_list FASTQ_LIST

text file with list of FASTQ files, one file per line,

leave empty line between samples

--data_type {assembly,pacbio_raw,pacbio_ccs,nanopore}, -d {assembly,pacbio_raw,pacbio_ccs,nanopore}

type of data to process, supported types are:

assembly, pacbio_raw, pacbio_ccs, nanopore

--stranded STRANDED reads strandness type, supported values are: forward,

reverse, none

--has_polya set if reads were not polyA trimmed; polyA tails will

be detected and further required for transcript model

construction

--fl_data reads represent FL transcripts; both ends of the read

are considered to be reliable

--full_help show full list of options

--test run IsoQuant on toy dataset

--threads THREADS, -t THREADS

number of threads to use

--labels LABELS [LABELS ...], -l LABELS [LABELS ...]

sample names to be used; input file names are used if

not set

--read_group READ_GROUP

a way to group feature counts (no grouping by

default): by BAM file tag (tag:TAG), using additional

file (file:FILE:READ_COL:GROUP_COL:DELIM), using read

id (read_id:DELIM)

--sqanti_output produce SQANTI-like TSV output (requires more time)

--count_exons perform exon and intron counting

--matching_strategy {exact,precise,default,loose}

matching strategy to use from most strict to least

--model_construction_strategy {reliable,default,fl,all,assembly}

transcritp model construnction strategy to use

インストールチェック

> isoquant.py --test

$ isoquant.py --test

=== Running in test mode ===

Any other option is ignored

2020-08-18 00:37:15,601 - INFO - === IsoQuant pipeline started ===

2020-08-18 00:37:15,601 - INFO - Indexing reference

2020-08-18 00:37:15,633 - INFO - Aligning /home/kazu/anaconda3/share/isoquant-1.0.0-0/tests/toy_data/MAPT.Mouse.ONT.simulated.fastq to the reference

2020-08-18 00:37:17,727 - INFO - Converting gene annotation file to .db format (takes a while)...

2020-08-18 00:37:18,263 - INFO - Gene database written to /home/kazu/Documents/isoquant_test/MAPT.Mouse.genedb.db

2020-08-18 00:37:18,264 - INFO - Provide this database next time to avoid excessive conversion

2020-08-18 00:37:18,264 - INFO - Loading gene database from /home/kazu/Documents/isoquant_test/MAPT.Mouse.genedb.db

2020-08-18 00:37:18,266 - INFO - Loading reference genome from /home/kazu/anaconda3/share/isoquant-1.0.0-0/tests/toy_data/MAPT.Mouse.reference.fasta

2020-08-18 00:37:18,283 - INFO - Processing 1 sample

2020-08-18 00:37:18,283 - INFO - Processing sample 00_MAPT.Mouse.ONT.simulated

2020-08-18 00:37:18,283 - INFO - Sample has 1 BAM file: isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.bam

2020-08-18 00:37:18,283 - INFO - Processing chromosome chr11

2020-08-18 00:37:18,367 - INFO - Combining output

2020-08-18 00:37:18,373 - INFO - Finished processing sample 00_MAPT.Mouse.ONT.simulated

2020-08-18 00:37:18,373 - INFO - Gene counts are stored in isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.gene_counts.tsv

2020-08-18 00:37:18,373 - INFO - Transcript counts are stored in isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.transcript_counts.tsv

2020-08-18 00:37:18,373 - INFO - Transcript model file isoquant_test/00_MAPT.Mouse.ONT.simulated/transcript_models.gff

2020-08-18 00:37:18,373 - INFO - Processed sample 00_MAPT.Mouse.ONT.simulated

2020-08-18 00:37:18,373 - INFO - Read assignment statistics

2020-08-18 00:37:18,373 - INFO - empty: 15

2020-08-18 00:37:18,373 - INFO - unique: 117

2020-08-18 00:37:18,374 - INFO - Transcript model statistics

2020-08-18 00:37:18,374 - INFO - known: 10

2020-08-18 00:37:18,374 - INFO - Processed 1 sample

2020-08-18 00:37:18,374 - INFO - === IsoQuant pipeline finished ===

2020-08-18 00:37:18,375 - INFO - === TEST PASSED CORRECTLY ===

O.K

実行方法

ナノポアdRNAリードを指定。ポリAはトリミングされていない(--has_polya)。ゲノムとGTF(--genedb)のオフィシャル(--complete_genedb)アノテーションを指定、ファイル名の代わりにサンプルラベルを使用。サンプル名 My_ONT。

isoquant.py --data_typ nanopore --has_polya --stranded forward --fastq ONT.raw.fastq.gz --reference reference.fasta --genedb annotation.gtf --complete_genedb --output output_dir --threads 12 --labels My_ONT

--data_type {assembly,pacbio_raw,pacbio_ccs,nanopore} 　 type of data to process, supported types are: assembly, pacbio_raw, pacbio_ccs, nanopore
--has_polya　 set if reads were not polyA trimmed; polyA tails will be detected and further required for transcript model construction
--stranded 　reads strandness type, supported values are: forward, reverse, none
--fastq input FASTQ file(s), each file will be treated as a separate sample; reference genome should be provided when using raw reads
--reference reference genome in FASTA format, should be provided to compute some additional stats and when raw reads are used as an input
--genedb 　 gene database in gffutils DB format or GTF/GFF format
--complete_genedb use this flag if gene annotation contains transcript and gene metafeatures, e.g. with official annotations, such as GENCODE; speeds up gene database conversion
--output output folder, will be created automatically [default=isoquant_output]
--threads number of threads to use
--labels sample names to be used; input file names are used if not set

出力ディレクトリにサンプルごとのサブフォルダができる。

引用

https://github.com/ablab/IsoQuant