macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

RNA seqのリードカウント HTSeq-count

2020 8/15 condaによるインストールとhelp追記

2021 8/6 リンク消去

 

HTSeqはNGSデータの各種ハンドリングができるツール。ここではその1つhtseq-countコマンドを紹介する。htseq-countはリードのアライメントデータからカウントデータを出力するために使う。htseq-countを使うと、bamから数分でカウントデータを得ることができる。samtoolsのAPIを使い、bamの読み書きを行なっている(リンク)。2015年にBioinformaticsに論文が発表された。 

 

公式サイト

http://htseq.readthedocs.io/en/release_0.9.1/

 

インストール

依存

piplかcondaで導入できる。

pip install htseq

#bioconda(依存も含めて導入される)
conda install -c bioconda -y htseq

> htseq-count -h

$ htseq-count -h

usage: htseq-count [options] alignment_file gff_file

 

This script takes one or more alignment files in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it. See http://htseq.readthedocs.io/en/master/count.html for details.

 

positional arguments:

  samfilenames          Path to the SAM/BAM files containing the mapped reads. If '-' is selected, read from standard input

  featuresfilename      Path to the GTF file containing the features

 

optional arguments:

  -h, --help            show this help message and exit

  -f {sam,bam,auto}, --format {sam,bam,auto}

                        Type of <alignment_file> data. DEPRECATED: file format is detected automatically. This option is ignored.

  -r {pos,name}, --order {pos,name}

                        'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single-end data.

  --max-reads-in-buffer MAX_BUFFER_SIZE

                        When <alignment_file> is paired end sorted by position, allow only so many reads to stay in memory until the mates are found (raising this number will use more memory). Has no effect for single end or paired end sorted by name

  -s {yes,no,reverse}, --stranded {yes,no,reverse}

                        Whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand interpretation

  -a MINAQUAL, --minaqual MINAQUAL

                        Skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10). MAPQ is the 5th column of a SAM/BAM file and its usage depends on the software used to map the reads.

  -t FEATURETYPE, --type FEATURETYPE

                        Feature type (3rd column in GTF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon)

  -i IDATTR, --idattr IDATTR

                        GTF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id). All feature of the right type (see -t option) within the same GTF attribute will be added together. The typical way of using this option is to count all exonic reads from each gene and

                        add the exons but other uses are possible as well.

  --additional-attr ADDITIONAL_ATTR

                        Additional feature attributes (default: none, suitable for Ensembl GTF files: gene_name). Use multiple times for more than one additional attribute. These attributes are only used as annotations in the output, while the determination of how the counts are added together is

                        done based on option -i.

  -m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}

                        Mode to handle reads overlapping more than one feature (choices: union, intersection-strict, intersection-nonempty; default: union)

  --nonunique {none,all,fraction,random}

                        Whether and how to score reads that are not uniquely aligned or ambiguously assigned to features (choices: none, all, fraction, random; default: none)

  --secondary-alignments {score,ignore}

                        Whether to score secondary alignments (0x100 flag)

  --supplementary-alignments {score,ignore}

                        Whether to score supplementary alignments (0x800 flag)

  -o SAMOUTS, --samout SAMOUTS

                        Write out all SAM alignment records into SAM/BAM files (one per input file needed), annotating each line with its feature assignment (as an optional field with tag 'XF'). See the -p option to use BAM instead of SAM.

  -p {SAM,BAM,sam,bam}, --samout-format {SAM,BAM,sam,bam}

                        Format to use with the --samout option.

  -d OUTPUT_DELIMITER, --delimiter OUTPUT_DELIMITER

                        Column delimiter in output (default: TAB).

  -c OUTPUT_FILENAME, --counts_output OUTPUT_FILENAME

                        Filename to output the counts to instead of stdout.

  --append-output       Append counts output. This option is useful if you have already creates a TSV/CSV/similar file with a header for your samples (with additional columns for the feature name and any additionl attributes) and want to fill in the rest of the file.

  -n NPROCESSES, --nprocesses NPROCESSES

                        Number of parallel CPU processes to use (default: 1).

  --feature-query FEATURE_QUERY

                        Restrict to features descibed in this expression. Currently supports a single kind of expression: attribute == "one attr" to restrict the GFF to a single gene or transcript, e.g. --feature-query 'gene_name == "ACTB"' - notice the single quotes around the argument of this

                        option and the double quotes around the gene name. Broader queries might become available in the future.

  -q, --quiet           Suppress progress report

  --version             Show software version and exit

 

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL) and Fabio Zanini (fabio.zanini@unsw.edu.au), UNSW Sydney. (c) 2010-2020. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.12.4.

htseq-count-barcodes -h

$ htseq-count-barcodes -h

usage: htseq-count-barcodes [options] alignment_file gff_file

 

This script takes one alignment file in SAM/BAM format and a feature file in GFF format and calculates for each feature the number of reads mapping to it, accounting for barcodes. See http://htseq.readthedocs.io/en/master/count.html for details.

 

positional arguments:

  samfilename           Path to the SAM/BAM file containing the barcoded, mapped reads. If '-' is selected, read from standard input

  featuresfilename      Path to the GTF file containing the features

 

optional arguments:

  -h, --help            show this help message and exit

  -f {sam,bam,auto}, --format {sam,bam,auto}

                        Type of <alignment_file> data. DEPRECATED: file format is detected automatically. This option is ignored.

  -r {pos,name}, --order {pos,name}

                        'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single-end data.

  --max-reads-in-buffer MAX_BUFFER_SIZE

                        When <alignment_file> is paired end sorted by position, allow only so many reads to stay in memory until the mates are found (raising this number will use more memory). Has no effect for single end or paired end sorted by name

  -s {yes,no,reverse}, --stranded {yes,no,reverse}

                        Whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand interpretation

  -a MINAQUAL, --minaqual MINAQUAL

                        Skip all reads with MAPQ alignment quality lower than the given minimum value (default: 10). MAPQ is the 5th column of a SAM/BAM file and its usage depends on the software used to map the reads.

  -t FEATURETYPE, --type FEATURETYPE

                        Feature type (3rd column in GTF file) to be used, all features of other type are ignored (default, suitable for Ensembl GTF files: exon)

  -i IDATTR, --idattr IDATTR

                        GTF attribute to be used as feature ID (default, suitable for Ensembl GTF files: gene_id)

  --additional-attr ADDITIONAL_ATTR

                        Additional feature attributes (default: none, suitable for Ensembl GTF files: gene_name). Use multiple times for each different attribute

  -m {union,intersection-strict,intersection-nonempty}, --mode {union,intersection-strict,intersection-nonempty}

                        Mode to handle reads overlapping more than one feature (choices: union, intersection-strict, intersection-nonempty; default: union)

  --nonunique {none,all}

                        Whether to score reads that are not uniquely aligned or ambiguously assigned to features

  --secondary-alignments {score,ignore}

                        Whether to score secondary alignments (0x100 flag)

  --supplementary-alignments {score,ignore}

                        Whether to score supplementary alignments (0x800 flag)

  -o SAMOUT, --samout SAMOUT

                        Write out all SAM alignment records into aSAM/BAM file, annotating each line with its feature assignment (as an optional field with tag 'XF'). See the -p option to use BAM instead of SAM.

  -p {SAM,BAM,sam,bam}, --samout-format {SAM,BAM,sam,bam}

                        Format to use with the --samout option.

  -d OUTPUT_DELIMITER, --delimiter OUTPUT_DELIMITER

                        Column delimiter in output (default: TAB).

  -c OUTPUT_FILENAME, --counts_output OUTPUT_FILENAME

                        TSV/CSV filename to output the counts to instead of stdout.

  --cell-barcode CB_TAG

                        BAM tag used for the cell barcode (default compatible with 10X Genomics Chromium is CB).

  --UMI UB_TAG          BAM tag used for the unique molecular identifier, also known as molecular barcode (default compatible with 10X Genomics Chromium is UB).

  -q, --quiet           Suppress progress report

  --version             Show software version and exit

 

Written by Simon Anders (sanders@fs.tum.de), European Molecular Biology Laboratory (EMBL) and Fabio Zanini (fabio.zanini@unsw.edu.au), UNSW Sydney. (c) 2010-2020. Released under the terms of the GNU General Public License v3. Part of the 'HTSeq' framework, version 0.12.4.

htseq-qa -h

$ htseq-qa -h

usage: htseq-qa [-h] [-t {sam,bam,solexa-export,fastq,solexa-fastq}] [-o OUTFILE] [-r READLEN] [-g GAMMA] [-n] [-m MAXQUAL] [--primary-only] [--max-records MAX_RECORDS] readfilename

 

This script take a file with high-throughput sequencing reads (supported formats: SAM, Solexa _export.txt, FASTQ, Solexa _sequence.txt) and performs a simply quality assessment by producing plots showing the distribution of called bases and base-call quality scores by position within the reads. The

plots are output as a PDF file.

 

positional arguments:

  readfilename          The file to count reads in (SAM/BAM or Fastq)

 

optional arguments:

  -h, --help            show this help message and exit

  -t {sam,bam,solexa-export,fastq,solexa-fastq}, --type {sam,bam,solexa-export,fastq,solexa-fastq}

                        type of read_file (one of: sam [default], bam, solexa-export, fastq, solexa-fastq)

  -o OUTFILE, --outfile OUTFILE

                        output filename (default is <read_file>.pdf)

  -r READLEN, --readlength READLEN

                        the maximum read length (when not specified, the script guesses from the file

  -g GAMMA, --gamma GAMMA

                        the gamma factor for the contrast adjustment of the quality score plot

  -n, --nosplit         do not split reads in unaligned and aligned ones

  -m MAXQUAL, --maxqual MAXQUAL

                        the maximum quality score that appears in the data (default: 41)

  --primary-only        For SAM/BAM input files, ignore alignments that are not primary. This only affects 'multimapper' reads that align to several regions in the genome. By choosing this option, each read will only count as one; without this option, each of its alignments counts as one.

  --max-records MAX_RECORDS

                        Limit the analysis to the first N reads/alignments.

 

 

ラン

シングルエンドのカウント。

htseq-count -f bam align.bam reference.gtf > count.txt
  •  -f type of <alignment_file> data, either 'sam' or 'bam' (default: sam)

 

ペアードエンドのカウント。入力するbamはあらかじめsamtoolsでソートされている必要がある。

htseq-count -f bam -r pos align.bam reference.gtf > count.txt
  • -r  'pos' or 'name'. Sorting order of <alignment_file> (default: name). Paired-end sequencing data must be sorted either by position or by read name, and the sorting order must be specified. Ignored for single- end data.
  • -f type of <alignment_file> data, either 'sam' or 'bam' (default: sam)
  • -s whether the data is from a strand-specific assay. Specify 'yes', 'no', or 'reverse' (default: yes). 'reverse' means 'yes' with reversed strand
  • -a skip all reads with alignment quality lower than the given minimum value (default: 10)

 

 

引用

HTSeq—a Python framework to work with high-throughput sequencing data

Simon Anders,* Paul Theodor Pyl, and Wolfgang Huber

Bioinformatics. 2015 Jan 15; 31(2): 166–169.