macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

RNAのロングリードを分析する IsoQuant

 

 IsoQuantは、PacBioやOxford Nanoporesのような長いRNAリードのリファレンスベース解析のためのツールである。IsoQuantは、リファレンスゲノムにリードをマッピングし、それらのイントロンエクソンの構造に基づいて、アノテーションされたアイソフォームに割り当てる。IsoQuantはまた、イントロンの保持、代替スプライスサイト、スキップされたエクソンなどの様々な修飾を発見することができる。IsoQuant はさらに、遺伝子、アイソフォーム、エクソンイントロン定量を行う。リードがグループ化されている場合(例えば、細胞タイプに応じて)、カウントは提供されたグループ化に従って報告される。さらに、IsoQuant は新規のものも含めて、発見された転写モデルを生成する。IsoQuantバージョン1.1は、2020年12月11日にGPLv2の下でリリースされ、https://github.com/ablab/IsoQuant からダウンロードできる。

 

HP

http://cab.spbu.ru/software/isoquant/

 

 

インストール

condaを使ってpython3.7の仮想環境を作ってテストした(OSはubuntu18.04LTS)。

依存
IsoQuant requires a 64-bit Linux system or Mac OS and Python (3.7 and higher) to be pre-installed on it. You will also need

  • gffutils
  • pysam
  • biopython
  • pyfaidx
  • pandas
  • numpy
  • minimap2
  • samtools)
  • STAR (optional)

Github

#bioconda (link)
conda create -n isoquant -y python=3.7
conda activate isoquant
conda install -c bioconda isoquant -y
#or
conda install -c bioconda -c conda-forge -c isoquant isoquant -y

#gffutilsyとpybedtoolsが導入されなかったので、追加で入れた
pip install gffutils pybedtools

#from source
git clone https://github.com/ablab/IsoQuant.git
cd IsoQuant
git checkout latest
pip install -r requirements.txt
#=> You also need minimap2 to be in the $PATH variable.

isoquant.py -h

$ isoquant.py -h

usage: isoquant.py [-h] [--output OUTPUT] --genedb GENEDB [--complete_genedb]

                   [--reference REFERENCE] [--index INDEX]

                   (--bam BAM [BAM ...] | --fastq FASTQ [FASTQ ...] | --bam_list BAM_LIST | --fastq_list FASTQ_LIST)

                   --data_type {assembly,pacbio_raw,pacbio_ccs,nanopore}

                   [--stranded STRANDED] [--has_polya] [--fl_data]

                   [--full_help] [--test] [--threads THREADS]

                   [--labels LABELS [LABELS ...]] [--read_group READ_GROUP]

                   [--sqanti_output] [--count_exons]

                   [--matching_strategy {exact,precise,default,loose}]

                   [--model_construction_strategy {reliable,default,fl,all,assembly}]

 

optional arguments:

  -h, --help            show this help message and exit

  --output OUTPUT, -o OUTPUT

                        output folder, will be created automatically

                        [default=isoquant_output]

  --genedb GENEDB, -g GENEDB

                        gene database in gffutils DB format or GTF/GFF format

  --complete_genedb     use this flag if gene annotation contains transcript

                        and gene metafeatures, e.g. with official annotations,

                        such as GENCODE; speeds up gene database conversion

  --reference REFERENCE, -r REFERENCE

                        reference genome in FASTA format, should be provided

                        to compute some additional stats and when raw reads

                        are used as an input

  --index INDEX         genome index for specified aligner, should be provided

                        only when raw reads are used as an input

  --bam BAM [BAM ...]   sorted and indexed BAM file(s), each file will be

                        treated as a separate sample

  --fastq FASTQ [FASTQ ...]

                        input FASTQ file(s), each file will be treated as a

                        separate sample; reference genome should be provided

                        when using raw reads

  --bam_list BAM_LIST   text file with list of BAM files, one file per line,

                        leave empty line between samples

  --fastq_list FASTQ_LIST

                        text file with list of FASTQ files, one file per line,

                        leave empty line between samples

  --data_type {assembly,pacbio_raw,pacbio_ccs,nanopore}, -d {assembly,pacbio_raw,pacbio_ccs,nanopore}

                        type of data to process, supported types are:

                        assembly, pacbio_raw, pacbio_ccs, nanopore

  --stranded STRANDED   reads strandness type, supported values are: forward,

                        reverse, none

  --has_polya           set if reads were not polyA trimmed; polyA tails will

                        be detected and further required for transcript model

                        construction

  --fl_data             reads represent FL transcripts; both ends of the read

                        are considered to be reliable

  --full_help           show full list of options

  --test                run IsoQuant on toy dataset

  --threads THREADS, -t THREADS

                        number of threads to use

  --labels LABELS [LABELS ...], -l LABELS [LABELS ...]

                        sample names to be used; input file names are used if

                        not set

  --read_group READ_GROUP

                        a way to group feature counts (no grouping by

                        default): by BAM file tag (tag:TAG), using additional

                        file (file:FILE:READ_COL:GROUP_COL:DELIM), using read

                        id (read_id:DELIM)

  --sqanti_output       produce SQANTI-like TSV output (requires more time)

  --count_exons         perform exon and intron counting

  --matching_strategy {exact,precise,default,loose}

                        matching strategy to use from most strict to least

  --model_construction_strategy {reliable,default,fl,all,assembly}

                        transcritp model construnction strategy to use

 

インストールチェック

isoquant.py --test

$ isoquant.py --test

=== Running in test mode === 

Any other option is ignored 

2020-08-18 00:37:15,601 - INFO -  === IsoQuant pipeline started === 

2020-08-18 00:37:15,601 - INFO - Indexing reference

2020-08-18 00:37:15,633 - INFO - Aligning /home/kazu/anaconda3/share/isoquant-1.0.0-0/tests/toy_data/MAPT.Mouse.ONT.simulated.fastq to the reference

2020-08-18 00:37:17,727 - INFO - Converting gene annotation file to .db format (takes a while)...

2020-08-18 00:37:18,263 - INFO - Gene database written to /home/kazu/Documents/isoquant_test/MAPT.Mouse.genedb.db

2020-08-18 00:37:18,264 - INFO - Provide this database next time to avoid excessive conversion

2020-08-18 00:37:18,264 - INFO - Loading gene database from /home/kazu/Documents/isoquant_test/MAPT.Mouse.genedb.db

2020-08-18 00:37:18,266 - INFO - Loading reference genome from /home/kazu/anaconda3/share/isoquant-1.0.0-0/tests/toy_data/MAPT.Mouse.reference.fasta

2020-08-18 00:37:18,283 - INFO - Processing 1 sample

2020-08-18 00:37:18,283 - INFO - Processing sample 00_MAPT.Mouse.ONT.simulated

2020-08-18 00:37:18,283 - INFO - Sample has 1 BAM file: isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.bam

2020-08-18 00:37:18,283 - INFO - Processing chromosome chr11

2020-08-18 00:37:18,367 - INFO - Combining output

2020-08-18 00:37:18,373 - INFO - Finished processing sample 00_MAPT.Mouse.ONT.simulated

2020-08-18 00:37:18,373 - INFO - Gene counts are stored in isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.gene_counts.tsv

2020-08-18 00:37:18,373 - INFO - Transcript counts are stored in isoquant_test/00_MAPT.Mouse.ONT.simulated/00_MAPT.Mouse.ONT.simulated.transcript_counts.tsv

2020-08-18 00:37:18,373 - INFO - Transcript model file isoquant_test/00_MAPT.Mouse.ONT.simulated/transcript_models.gff

2020-08-18 00:37:18,373 - INFO - Processed sample 00_MAPT.Mouse.ONT.simulated

2020-08-18 00:37:18,373 - INFO - Read assignment statistics

2020-08-18 00:37:18,373 - INFO - empty: 15

2020-08-18 00:37:18,373 - INFO - unique: 117

2020-08-18 00:37:18,374 - INFO - Transcript model statistics

2020-08-18 00:37:18,374 - INFO - known: 10

2020-08-18 00:37:18,374 - INFO - Processed 1 sample

2020-08-18 00:37:18,374 - INFO -  === IsoQuant pipeline finished === 

2020-08-18 00:37:18,375 - INFO -  === TEST PASSED CORRECTLY === 

O.K 

 

 

実行方法

ナノポアdRNAリードを指定。ポリAはトリミングされていない(--has_polya)。ゲノムとGTF(--genedb)のオフィシャル(--complete_genedb)アノテーションを指定、ファイル名の代わりにサンプルラベルを使用。サンプル名 My_ONT。

isoquant.py --data_typ nanopore --has_polya --stranded forward --fastq ONT.raw.fastq.gz --reference reference.fasta --genedb annotation.gtf --complete_genedb --output output_dir --threads 12 --labels My_ONT
  • --data_type {assembly,pacbio_raw,pacbio_ccs,nanopore}   type of data to process, supported types are:  assembly, pacbio_raw, pacbio_ccs, nanopore

  • --has_polya  set if reads were not polyA trimmed; polyA tails will be detected and further required for transcript model construction
  • --stranded  reads strandness type, supported values are: forward, reverse, none

  • --fastq   input FASTQ file(s), each file will be treated as a separate sample; reference genome should be provided when using raw reads
  •  --reference    reference genome in FASTA format, should be provided to compute some additional stats and when raw reads are used as an input
  • --genedb   gene database in gffutils DB format or GTF/GFF format
  • --complete_genedb     use this flag if gene annotation contains transcript and gene metafeatures, e.g. with official annotations, such as GENCODE; speeds up gene database conversion
  • --output    output folder, will be created automatically [default=isoquant_output]

  • --threads    number of threads to use

  • --labels    sample names to be used; input file names are used if not set

 

出力ディレクトリにサンプルごとのサブフォルダができる。

 

引用

https://github.com/ablab/IsoQuant

 

関連