IsoQuantは、PacBioやOxford Nanoporesのような長いRNAリードのリファレンスベース解析のためのツールである。IsoQuantは、リファレンスゲノムにリードをマッピングし、それらのイントロンとエクソンの構造に基づいて、アノテーションされたアイソフォームに割り当てる。IsoQuantはまた、イントロンの保持、代替スプライスサイト、スキップされたエクソンなどの様々な修飾を発見することができる。IsoQuant はさらに、遺伝子、アイソフォーム、エクソン、イントロンの定量を行う。リードがグループ化されている場合(例えば、細胞タイプに応じて)、カウントは提供されたグループ化に従って報告される。さらに、IsoQuant は新規のものも含めて、発見された転写モデルを生成する。IsoQuantバージョン1.1は、2020年12月11日にGPLv2の下でリリースされ、https://github.com/ablab/IsoQuant からダウンロードできる。
HP
http://cab.spbu.ru/software/isoquant/
Meet IsoQuant - a new tool for long-read RNA analysis! It can detect structural alternations in individual reads, quantify isoforms and construct novel transcript models. Works well on PacBio and ONT reads, and has features for single-cell data analysis. https://t.co/KzYpajdq74
— Center for Algorithmic Biotechnology (@bioinf_spbu) 2020年7月14日
インストール
condaを使ってpython3.7の仮想環境を作ってテストした(OSはubuntu18.04LTS)。
依存
IsoQuant requires a 64-bit Linux system or Mac OS and Python (3.7 and higher) to be pre-installed on it. You will also need
- gffutils
- pysam
- biopython
- pyfaidx
- pandas
- numpy
- minimap2
- samtools)
- STAR (optional)
#bioconda (link)
conda create -n isoquant -y python=3.7
conda activate isoquant
conda install -c bioconda isoquant -y
#or
conda install -c bioconda -c conda-forge -c isoquant isoquant -y
#gffutilsyとpybedtoolsが導入されなかったので、追加で入れた
pip install gffutils pybedtools
#from source
git clone https://github.com/ablab/IsoQuant.git
cd IsoQuant
git checkout latest
pip install -r requirements.txt
#=> You also need minimap2 to be in the $PATH variable.
> isoquant.py -h
$ isoquant.py -h
usage: isoquant.py [-h] [--output OUTPUT] --genedb GENEDB [--complete_genedb]
[--reference REFERENCE] [--index INDEX]
(--bam BAM [BAM ...] | --fastq FASTQ [FASTQ ...] | --bam_list BAM_LIST | --fastq_list FASTQ_LIST)
--data_type {assembly,pacbio_raw,pacbio_ccs,nanopore}
[--stranded STRANDED] [--has_polya] [--fl_data]
[--full_help] [--test] [--threads THREADS]
[--labels LABELS [LABELS ...]] [--read_group READ_GROUP]
[--sqanti_output] [--count_exons]
[--matching_strategy {exact,precise,default,loose}]
[--model_construction_strategy {reliable,default,fl,all,assembly}]
optional arguments:
-h, --help show this help message and exit
--output OUTPUT, -o OUTPUT
output folder, will be created automatically
[default=isoquant_output]
--genedb GENEDB, -g GENEDB
gene database in gffutils DB format or GTF/GFF format
--complete_genedb use this flag if gene annotation contains transcript
and gene metafeatures, e.g. with official annotations,
such as GENCODE; speeds up gene database conversion
--reference REFERENCE, -r REFERENCE
reference genome in FASTA format, should be provided
to compute some additional stats and when raw reads
are used as an input
--index INDEX genome index for specified aligner, should be provided
only when raw reads are used as an input
--bam BAM [BAM ...] sorted and indexed BAM file(s), each file will be
treated as a separate sample
--fastq FASTQ [FASTQ ...]
input FASTQ file(s), each file will be treated as a
separate sample; reference genome should be provided
when using raw reads
--bam_list BAM_LIST text file with list of BAM files, one file per line,
leave empty line between samples
--fastq_list FASTQ_LIST
text file with list of FASTQ files, one file per line,
leave empty line between samples
--data_type {assembly,pacbio_raw,pacbio_ccs,nanopore}, -d {assembly,pacbio_raw,pacbio_ccs,nanopore}
type of data to process, supported types are:
assembly, pacbio_raw, pacbio_ccs, nanopore
--stranded STRANDED reads strandness type, supported values are: forward,
reverse, none
--has_polya set if reads were not polyA trimmed; polyA tails will
be detected and further required for transcript model
construction
--fl_data reads represent FL transcripts; both ends of the read
are considered to be reliable
--full_help show full list of options
--test run IsoQuant on toy dataset
--threads THREADS, -t THREADS
number of threads to use
--labels LABELS [LABELS ...], -l LABELS [LABELS ...]
sample names to be used; input file names are used if
not set
--read_group READ_GROUP
a way to group feature counts (no grouping by
default): by BAM file tag (tag:TAG), using additional
file (file:FILE:READ_COL:GROUP_COL:DELIM), using read
id (read_id:DELIM)
--sqanti_output produce SQANTI-like TSV output (requires more time)
--count_exons perform exon and intron counting
--matching_strategy {exact,precise,default,loose}
matching strategy to use from most strict to least
--model_construction_strategy {reliable,default,fl,all,assembly}
transcritp model construnction strategy to use
インストールチェック
> isoquant.py --test
$ isoquant.py --test
=== Running in test mode ===
Any other option is ignored
2020-08-18 00:37:15,601 - INFO - === IsoQuant pipeline started ===
2020-08-18 00:37:15,601 - INFO - Indexing reference
2020-08-18 00:37:15,633 - INFO - Aligning /home/kazu/anaconda3/share/isoquant-1.0.0-0/tests/toy_data/MAPT.Mouse.ONT.simulated.fastq to the reference
2020-08-18 00:37:17,727 - INFO - Converting gene annotation file to .db format (takes a while)...
2020-08-18 00:37:18,263 - INFO - Gene database written to /home/kazu/Documents/isoquant_test/MAPT.Mouse.genedb.db
2020-08-18 00:37:18,264 - INFO - Provide this database next time to avoid excessive conversion
2020-08-18 00:37:18,264 - INFO - Loading gene database from /home/kazu/Documents/isoquant_test/MAPT.Mouse.genedb.db
2020-08-18 00:37:18,266 - INFO - Loading reference genome from /home/kazu/anaconda3/share/isoquant-1.0.0-0/tests/toy_data/MAPT.Mouse.reference.fasta
2020-08-18 00:37:18,283 - INFO - Processing 1 sample
2020-08-18 00:37:18,283 - INFO - Processing sample 00_MAPT.Mouse.ONT.simulated
2020-08-18 00:37:18,283 - INFO - Sample has 1 BAM file: isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.bam
2020-08-18 00:37:18,283 - INFO - Processing chromosome chr11
2020-08-18 00:37:18,367 - INFO - Combining output
2020-08-18 00:37:18,373 - INFO - Finished processing sample 00_MAPT.Mouse.ONT.simulated
2020-08-18 00:37:18,373 - INFO - Gene counts are stored in isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.gene_counts.tsv
2020-08-18 00:37:18,373 - INFO - Transcript counts are stored in isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.transcript_counts.tsv
2020-08-18 00:37:18,373 - INFO - Transcript model file isoquant_test/00_MAPT.Mouse.ONT.simulated/transcript_models.gff
2020-08-18 00:37:18,373 - INFO - Processed sample 00_MAPT.Mouse.ONT.simulated
2020-08-18 00:37:18,373 - INFO - Read assignment statistics
2020-08-18 00:37:18,373 - INFO - empty: 15
2020-08-18 00:37:18,373 - INFO - unique: 117
2020-08-18 00:37:18,374 - INFO - Transcript model statistics
2020-08-18 00:37:18,374 - INFO - known: 10
2020-08-18 00:37:18,374 - INFO - Processed 1 sample
2020-08-18 00:37:18,374 - INFO - === IsoQuant pipeline finished ===
2020-08-18 00:37:18,375 - INFO - === TEST PASSED CORRECTLY ===
O.K
実行方法
ナノポアdRNAリードを指定。ポリAはトリミングされていない(--has_polya)。ゲノムとGTF(--genedb)のオフィシャル(--complete_genedb)アノテーションを指定、ファイル名の代わりにサンプルラベルを使用。サンプル名 My_ONT。
isoquant.py --data_typ nanopore --has_polya --stranded forward --fastq ONT.raw.fastq.gz --reference reference.fasta --genedb annotation.gtf --complete_genedb --output output_dir --threads 12 --labels My_ONT
-
--data_type {assembly,pacbio_raw,pacbio_ccs,nanopore} type of data to process, supported types are: assembly, pacbio_raw, pacbio_ccs, nanopore
- --has_polya set if reads were not polyA trimmed; polyA tails will be detected and further required for transcript model construction
-
--stranded reads strandness type, supported values are: forward, reverse, none
- --fastq input FASTQ file(s), each file will be treated as a separate sample; reference genome should be provided when using raw reads
- --reference reference genome in FASTA format, should be provided to compute some additional stats and when raw reads are used as an input
- --genedb gene database in gffutils DB format or GTF/GFF format
- --complete_genedb use this flag if gene annotation contains transcript and gene metafeatures, e.g. with official annotations, such as GENCODE; speeds up gene database conversion
-
--output output folder, will be created automatically [default=isoquant_output]
-
--threads number of threads to use
-
--labels sample names to be used; input file names are used if not set
出力ディレクトリにサンプルごとのサブフォルダができる。
引用
https://github.com/ablab/IsoQuant
関連