macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

RNA seqのリードカウント featureCounts

2019 6/19 インストール追記

2019 6/19 追記

2019 8/14 help追加

2019 8/15 run log追加

2020 11/1 コマンド追加

 

RNA reqのリードカウントツールを紹介する。

featureCounts

 

ダウンロード

sourceforgeリンク

https://sourceforge.net/projects/subread/files/subread-1.5.2/

 

インストール

ソースコードをダウンロードして解凍し、/srcに移動。macでは以下のようにしてビルドする。

cd subread-1.4.6-source/src
make -f Makefile.MacOS

#bioconda (link)
conda install -c bioconda subread

linux公式マニュアル参照。 

featureCounts

$ featureCounts

 

Version 1.6.4

 

Usage: featureCounts [options] -a <annotation_file> -o <output_file> input_file1 [input_file2] ... 

 

## Mandatory arguments:

 

  -a <string>         Name of an annotation file. GTF/GFF format by default. See

                      -F option for more format information. Inbuilt annotations

                      (SAF format) is available in 'annotation' directory of the

                      package. Gzipped file is also accepted.

 

  -o <string>         Name of output file including read counts. A separate file

                      including summary statistics of counting results is also

                      included in the output ('<string>.summary'). Both files

                      are in tab delimited format.

 

  input_file1 [input_file2] ...   A list of SAM or BAM format files. They can be

                      either name or location sorted. If no files provided,

                      <stdin> input is expected. Location-sorted paired-end reads

                      are automatically sorted by read names.

 

## Optional arguments:

# Annotation

 

  -F <string>         Specify format of the provided annotation file. Acceptable

                      formats include 'GTF' (or compatible GFF format) and

                      'SAF'. 'GTF' by default.  For SAF format, please refer to

                      Users Guide.

 

  -t <string>         Specify feature type in GTF annotation. 'exon' by 

                      default. Features used for read counting will be 

                      extracted from annotation using the provided value.

 

  -g <string>         Specify attribute type in GTF annotation. 'gene_id' by 

                      default. Meta-features used for read counting will be 

                      extracted from annotation using the provided value.

 

  --extraAttributes   Extract extra attribute types from the provided GTF

                      annotation and include them in the counting output. These

                      attribute types will not be used to group features. If

                      more than one attribute type is provided they should be

                      separated by comma.

 

  -A <string>         Provide a chromosome name alias file to match chr names in

                      annotation with those in the reads. This should be a two-

                      column comma-delimited text file. Its first column should

                      include chr names in the annotation and its second column

                      should include chr names in the reads. Chr names are case

                      sensitive. No column header should be included in the

                      file.

 

# Level of summarization

 

  -f                  Perform read counting at feature level (eg. counting 

                      reads for exons rather than genes).

 

# Overlap between reads and features

 

  -O                  Assign reads to all their overlapping meta-features (or 

                      features if -f is specified).

 

  --minOverlap <int>  Minimum number of overlapping bases in a read that is

                      required for read assignment. 1 by default. Number of

                      overlapping bases is counted from both reads if paired

                      end. If a negative value is provided, then a gap of up

                      to specified size will be allowed between read and the

                      feature that the read is assigned to.

 

  --fracOverlap <float> Minimum fraction of overlapping bases in a read that is

                      required for read assignment. Value should be within range

                      [0,1]. 0 by default. Number of overlapping bases is

                      counted from both reads if paired end. Both this option

                      and '--minOverlap' option need to be satisfied for read

                      assignment.

 

  --fracOverlapFeature <float> Minimum fraction of overlapping bases in a

                      feature that is required for read assignment. Value

                      should be within range [0,1]. 0 by default.

 

  --largestOverlap    Assign reads to a meta-feature/feature that has the 

                      largest number of overlapping bases.

 

  --nonOverlap <int>  Maximum number of non-overlapping bases in a read (or a

                      read pair) that is allowed when being assigned to a

                      feature. No limit is set by default.

 

  --nonOverlapFeature <int> Maximum number of non-overlapping bases in a feature

                      that is allowed in read assignment. No limit is set by

                      default.

 

  --readExtension5 <int> Reads are extended upstream by <int> bases from their

                      5' end.

 

  --readExtension3 <int> Reads are extended upstream by <int> bases from their

                      3' end.

 

  --read2pos <5:3>    Reduce reads to their 5' most base or 3' most base. Read

                      counting is then performed based on the single base the 

                      read is reduced to.

 

# Multi-mapping reads

 

  -M                  Multi-mapping reads will also be counted. For a multi-

                      mapping read, all its reported alignments will be 

                      counted. The 'NH' tag in BAM/SAM input is used to detect 

                      multi-mapping reads.

 

# Fractional counting

 

  --fraction          Assign fractional counts to features. This option must

                      be used together with '-M' or '-O' or both. When '-M' is

                      specified, each reported alignment from a multi-mapping

                      read (identified via 'NH' tag) will carry a fractional

                      count of 1/x, instead of 1 (one), where x is the total

                      number of alignments reported for the same read. When '-O'

                      is specified, each overlapping feature will receive a

                      fractional count of 1/y, where y is the total number of

                      features overlapping with the read. When both '-M' and

                      '-O' are specified, each alignment will carry a fractional

                      count of 1/(x*y).

 

# Read filtering

 

  -Q <int>            The minimum mapping quality score a read must satisfy in

                      order to be counted. For paired-end reads, at least one

                      end should satisfy this criteria. 0 by default.

 

  --splitOnly         Count split alignments only (ie. alignments with CIGAR

                      string containing 'N'). An example of split alignments is

                      exon-spanning reads in RNA-seq data.

 

  --nonSplitOnly      If specified, only non-split alignments (CIGAR strings do

                      not contain letter 'N') will be counted. All the other

                      alignments will be ignored.

 

  --primary           Count primary alignments only. Primary alignments are 

                      identified using bit 0x100 in SAM/BAM FLAG field.

 

  --ignoreDup         Ignore duplicate reads in read counting. Duplicate reads 

                      are identified using bit Ox400 in BAM/SAM FLAG field. The 

                      whole read pair is ignored if one of the reads is a 

                      duplicate read for paired end data.

 

# Strandness

 

  -s <int or string>  Perform strand-specific read counting. A single integer

                      value (applied to all input files) or a string of comma-

                      separated values (applied to each corresponding input

                      file) should be provided. Possible values include:

                      0 (unstranded), 1 (stranded) and 2 (reversely stranded).

                      Default value is 0 (ie. unstranded read counting carried

                      out for all input files).

 

# Exon-exon junctions

 

  -J                  Count number of reads supporting each exon-exon junction.

                      Junctions were identified from those exon-spanning reads

                      in the input (containing 'N' in CIGAR string). Counting

                      results are saved to a file named '<output_file>.jcounts'

 

  -G <string>         Provide the name of a FASTA-format file that contains the

                      reference sequences used in read mapping that produced the

                      provided SAM/BAM files. This optional argument can be used

                      with '-J' option to improve read counting for junctions.

 

# Parameters specific to paired end reads

 

  -p                  If specified, fragments (or templates) will be counted

                      instead of reads. This option is only applicable for

                      paired-end reads.

 

  -B                  Only count read pairs that have both ends aligned.

 

  -P                  Check validity of paired-end distance when counting read 

                      pairs. Use -d and -D to set thresholds.

 

  -d <int>            Minimum fragment/template length, 50 by default.

 

  -D <int>            Maximum fragment/template length, 600 by default.

 

  -C                  Do not count read pairs that have their two ends mapping 

                      to different chromosomes or mapping to same chromosome 

                      but on different strands.

 

  --donotsort         Do not sort reads in BAM/SAM input. Note that reads from 

                      the same pair are required to be located next to each 

                      other in the input.

 

# Number of CPU threads

 

  -T <int>            Number of the threads. 1 by default.

 

# Read groups

 

  --byReadGroup       Assign reads by read group. "RG" tag is required to be

                      present in the input BAM/SAM files.

                      

 

# Long reads

 

  -L                  Count long reads such as Nanopore and PacBio reads. Long

                      read counting can only run in one thread and only reads

                      (not read-pairs) can be counted. There is no limitation on

                      the number of 'M' operations allowed in a CIGAR string in

                      long read counting.

 

# Assignment results for each read

 

  -R <format>         Output detailed assignment results for each read or read-

                      pair. Results are saved to a file that is in one of the

                      following formats: CORE, SAM and BAM. See Users Guide for

                      more info about these formats.

 

  --Rpath <string>    Specify a directory to save the detailed assignment

                      results. If unspecified, the directory where counting

                      results are saved is used.

 

# Miscellaneous

 

  --tmpDir <string>   Directory under which intermediate files are saved (later

                      removed). By default, intermediate files will be saved to

                      the directory specified in '-o' argument.

 

  --maxMOp <int>      Maximum number of 'M' operations allowed in a CIGAR

                      string. 10 by default. Both 'X' and '=' are treated as 'M'

                      and adjacent 'M' operations are merged in the CIGAR

                      string.

 

  --verbose           Output verbose information for debugging, such as un-

                      matched chromosome/contig names.

 

  -v                  Output version of the program.

 

 

 

マニュアル

featureCounts - quick guide

WEHI Bioinformatics - featureCounts

  

実行方法

inputはsam/bamファイルと、リファレンスのカウントするfeatureの場所を記したgtfファイルか、よりシンプルなSAFというフォーマットである。タブ区切りの5フィールドの形式で、公式サイトに例がある。

GeneID	Chr	Start	End	Strand
497097	chr1	3204563	3207049	-
497097	chr1	3411783	3411982	-
497097	chr1	3660633	3661579	-

 

シングルエンドのデータをカウント。例えば3つのbamからカウントデータを取得する。

featureCounts -T 8 -t exon -g gene_id -a annotation.gtf -o counts.txt input1.bam input2.bam input3.bam
  • -T Number of the threads. 1 by default.
  • -t Specify the feature type. Only rows which have the matched matched feature type in the provided GTF annotation file will be included for read counting. `exon' by default.
  • -g  Specify the attribute type used to group features (eg. exons) into meta-features (eg. genes), when GTF annotation is provided. `gene_id' by default.
  • -a Give the name of the annotation file. The program assumes that the provided annotation file is in GTF format. Use -F option to specify other annotation formats.
  • -o Give the name of the output file.

6サンプルカウントした結果のキャプチャが以下となる。

f:id:kazumaxneo:20170710110348j:plain

先頭部分数十行を載せている。出力はタブ区切りのテキストファイルで、1行目はコメント行(#)である。2行目以下は以下のような並びになっている。

‘Geneid’, ‘Chr’, ‘Start’, ‘End’, ‘Strand’,‘Length’,‘sample1_count’,‘sample2_count’ ...

2列目に1;1;1;1;1;1とあるのは、exsonが6あることを意味している。lengthは1つの遺伝子のオーバーラップをのぞいたexonの合計サイズ (bp) である。右端にあるのがカウントの列である。サンプル数分だけできる。

 

 

strand specificにカウント。

featureCounts -s 1 -T 8 -t exon -g gene_id -a annotation.gtf -o counts.txt <input.bam>
  • -s  Indicate if strand-specific read counting should be performed. It has three possible values: 0 (unstranded), 1 (stranded) and 2 (reversely stranded). 0 by default.

"-s 1"はfeatrureと同じ向きにアライメントされたリードだけカウント。"-s 2"はfeatureの反対向きにアライメントされたリードだけカウント。

 

ペアリードのフラグメントをカウント。

featureCounts -p -T 8 -t exon -g gene_id -a annotation.gtf -o counts.txt <input_PE.bam>
  • -p If specified, fragments (or templates) will be counted instead of reads. This option is only applicable for paired-end reads. The two reads from the same fragment must be adjacent to each other in the provided SAM/BAM file.

 

複数ファイルを同時にカウントして出力する。

featureCounts -T 8 -t exon -g gene_id -a annotation.gtf -o counts.txt library1.bam library2.bam library3.bam

bam/samをスペース区切りで記載する。

 

ペアエンドの両方がマッピングされたものだけカウント。

featureCounts -p -B -t exon -g gene_id -a annotation.gtf -o counts.txt <input_PE.bam>
  • -B If specified, only fragments that have both ends successfully aligned will be considered for summarization. This option is only applicable for paired-end reads.

 

ペアエンドの両方がマッピングされたものだけカウント。unstranded。featureは CDS、rRNA、tmRNA、tRNAをカウント(フィーチャーは実際にannotation.gtfを開いて確認すること*1)。

featureCounts -p -B -s 0 -T 8 -t CDS,rRNA,tmRNA,tRNA -g transcript_id -a annotation.gtf -o counts.txt <input_PE.bam>

紹介したオプション以外に、インサートサイズに下限、上限を設けるオプション(-P-d, -Dや、異なるクロモソームにアライメントされたpaired-endを排除するオプション(-C)、マッピングクオリティの閾値-Q)、複数箇所にマッピングされたリードのカウント(-M)、など多様なオプションがあります。詳細は公式quickマニュアルを確認してください。

 

実行時は標準出力にlogも表示される。

 

        ==========     _____ _    _ ____  _____  ______          _____  

        =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 

          =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |

            ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |

              ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |

        ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/

  v1.6.4

 

//========================== featureCounts setting ===========================\\

||                                                                            ||

||             Input files : 6 BAM files                                      ||

||                           S SRX2391505.bam                                 ||

||                           S SRX2391506.bam                                 ||

||                           S SRX2391507.bam                                 ||

||                           S SRX2391512.bam                                 ||

||                           S SRX2391513.bam                                 ||

||                           S SRX2391514.bam                                 ||

||                                                                            ||

||             Output file : counts.txt                                       ||

||                 Summary : counts.txt.summary                               ||

||              Annotation : Escherichia_coli_k_12.ASM80076v1.44.gtf (GTF)    ||

||      Dir for temp files : ./                                               ||

||                                                                            ||

||                 Threads : 8                                                ||

||                   Level : meta-feature level                               ||

||              Paired-end : no                                               ||

||      Multimapping reads : not counted                                      ||

|| Multi-overlapping reads : not counted                                      ||

||   Min overlapping bases : 1                                                ||

||                                                                            ||

\\============================================================================//

 

//================================= Running ==================================\\

||                                                                            ||

|| Load annotation file Escherichia_coli_k_12.ASM80076v1.44.gtf ...           ||

||    Features : 4258                                                         ||

||    Meta-features : 4257                                                    ||

||    Chromosomes/contigs : 1                                                 ||

||                                                                            ||

|| Process BAM file SRX2391505.bam...                                         ||

||    Single-end reads are included.                                          ||

||    Assign alignments to features...                                        ||

||    Total alignments : 11635683                                             ||

||    Successfully assigned alignments : 8362707 (71.9%)                      ||

||    Running time : 0.03 minutes                                             ||

||                                                                            ||

|| Process BAM file SRX2391506.bam...                                         ||

||    Single-end reads are included.                                          ||

||    Assign alignments to features...                                        ||

||    Total alignments : 9923480                                              ||

||    Successfully assigned alignments : 7888505 (79.5%)                      ||

||    Running time : 0.02 minutes                                             ||

||                                                                            ||

|| Process BAM file SRX2391507.bam...                                         ||

||    Single-end reads are included.                                          ||

||    Assign alignments to features...                                        ||

||    Total alignments : 15194419                                             ||

||    Successfully assigned alignments : 12625859 (83.1%)                     ||

||    Running time : 0.04 minutes                                             ||

||                                                                            ||

|| Process BAM file SRX2391512.bam...                                         ||

||    Single-end reads are included.                                          ||

||    Assign alignments to features...                                        ||

||    Total alignments : 14376088                                             ||

||    Successfully assigned alignments : 11823380 (82.2%)                     ||

||    Running time : 0.03 minutes                                             ||

||                                                                            ||

|| Process BAM file SRX2391513.bam...                                         ||

||    Single-end reads are included.                                          ||

||    Assign alignments to features...                                        ||

||    Total alignments : 13734901                                             ||

||    Successfully assigned alignments : 11896845 (86.6%)                     ||

||    Running time : 0.03 minutes                                             ||

||                                                                            ||

|| Process BAM file SRX2391514.bam...                                         ||

||    Single-end reads are included.                                          ||

||    Assign alignments to features...                                        ||

||    Total alignments : 11465553                                             ||

||    Successfully assigned alignments : 9222513 (80.4%)                      ||

||    Running time : 0.03 minutes                                             ||

||                                                                            ||

||                                                                            ||

|| Summary of counting results can be found in file "counts.txt.summary"      ||

||                                                                            ||

\\============================================================================//

 

引用

featureCounts: an efficient general purpose program for assigning sequence reads to genomic features.

Liao Y, Smyth GK and Shi W (2014).  

Bioinformatics. 2014 Apr 1;30(7):923-30.

 

関連

 

オプションについてはbioinformaticsさんのページが詳しいです。


*1

featureCountsラン開始時にfeature数が画面に表示されるので、合っているか確認すること。