腫瘍全ゲノムの体細胞変異エンリッチメント解析のための柔軟なツールセット MutEnricher

　腫瘍の全エキソームから体細胞変異を解析することで、新規のガンドライバー遺伝子の発見が加速されている。しかし、ゲノムの約98%はノンコーディングであり、その中には変異によって正常な細胞機能が損なわれる可能性のある調節エレメントも含まれている。一方、全ゲノムシーケンス（WGS）は、非コード体細胞変異の同定とバックグラウンド変異率の推定を可能にするが、この領域を特異的に調査するための計算ツールはほとんど存在しない。

　本著者らは、WGSデータからコーディングおよびノンコーディングゲノム領域における体細胞突然変異のエンリッチメントを調査するための柔軟なツールセットであるMutEnricherを紹介する。MutEnricher には、これらの目的のために 2 つの異なるモジュールが含まれており、サンプルおよびフィーチャー固有のバックグラウンド変異率を計算するためのカスタマイズ可能なオプションを提供する。さらに、両方の MutEnricher モジュールは、フィーチャーレベルおよびローカル (「ホットスポット」) の体細胞変異エンリッチメント統計値を計算する。

　MutEnricherは、体細胞突然変異のエンリッチメントを調査するための柔軟なソフトウェアパッケージで、Pythonで実装され、自由に利用でき、効率的に並列化でき、研究者の特定のニーズに合わせて高度に設定可能である。MutEnricherは、https://github.com/asoltis/MutEnricher からオンラインで入手できる。

My software paper describing MutEnricher, a tool for investigating somatic mutation recurrence in both protein coding and non-coding genomic regions, was published by BMC Bioinformatics! https://t.co/I6xEL3RnhC
— Anthony Soltis (@ToroSoltis) August 1, 2020

Wiki

https://github.com/asoltis/MutEnricher/wiki

Tutorial

https://github.com/asoltis/MutEnricher/wiki/Tutorial

マニュアルより

MutEnricherは、全ゲノムシーケンス（WGS）データからタンパク質コードおよび非コードゲノム座の体細胞変異エンリッチメント解析を行う柔軟なツールセットで、Pythonで実装されており、Python 2および3で使用可能です。MutEnricherは、Dockerイメージとしても提供されています。MutEnricherは、2つの異なるモジュールを含んでいます。

coding - タンパク質コーディング遺伝子における非サイレント変異の体細胞エンリッチメント解析を実行する。
noncoding - 非コード領域のエンリッチメント解析を行う。

サブディレクトリには、MutEnricherの共変量クラスタリング関数で使用する共変量ファイルを生成するためのヘルパー関数が2つ含まれています。

インストール

依存

This software has been explicitly tested with Python 2.7 (versions 2.7.12 and greater) and Python 3.7 (versions 3.7.3) on Red Hat >=6, Ubuntu 16 LTS, and macOS Sierra. Compatibility with Python versions < 2.7 is likely possible, though untested

Github

#docker (link)
docker pull asoltis/mutenricher:latest

> mutEnricher.py coding -h

usage: python mutEnricher.py coding [-h] [-o OUTDIR] [--prefix PREFIX]

[--gene-field GENEFIELD] [-g GENE_LIST]

[--stat-type STAT_TYPE]

[--bg-vars-type BG_VARS_TYPE] [--maf MAF]

[--exome-only] [--anno-type TTYPE]

[-m MAP_REGIONS] [-p NPROCESSORS]

[--snps-only] [-c COV_FN] [-w WEIGHTS_FN]

[--by-contig] [--use-local]

[--min-clust-size MIN_CLUST_SIZE]

[--precomputed-covars COV_PRECOMP_DIR]

[-d MAX_HS_DIST]

[--min-hs-vars MIN_HS_VARS]

[--min-hs-samps MIN_HS_SAMPS]

[--blacklist BLACKLIST_FN]

[--ap-iters AP_ITERS]

[--ap-convits AP_CONVITS]

[--ap-algorithm AP_ALG]

genes.gtf vcfs_list.txt

positional arguments:

genes.gtf Input GTF file (Required). Can be provided as plain

text or gzip-compressed file.

vcfs_list.txt Input VCFs list file (Required). Required columns:

file path, sample name. NOTE: sample names must be

unique for each sample!

optional arguments:

-h, --help show this help message and exit

-o OUTDIR, --outdir OUTDIR

Provide output directory for analysis. (default: ./)

--prefix PREFIX Provide prefix for analysis. (default:

mutation_enrichment)

--gene-field GENEFIELD

Provide field name from input GTF containing gene

name/id information. (default: gene_id)

-g GENE_LIST, --gene-list GENE_LIST

Provide list of genes to which analysis should be

restricted (one gene per-line in text file). Analysis

will only considers genes from GTF file that are

present in this list. Default behavior is to query all

coding genes present in input GTF. (default: None)

--stat-type STAT_TYPE

Select the stype of statistical testing to perform.

Options are: 1) 'nsamples' (default), which uses the

binomial distribution to compute the significance of

the number of samples containing a non-silent somatic

mutation ('n') among 'N' total samples against

background mutation rate 'p', or 2) 'nmutations',

which uses the negative binomial distribution to

compute the significance of the number of non-silent

mutations 'k' in a gene of coding length 'x' against

background mutation rate 'p' (default: nsamples)

--bg-vars-type BG_VARS_TYPE

Select which variants should be counted in background

rate calculations. Choices are: 'all' and 'silent'. If

'all' is selected, all variants (silent + non-silent)

are counted in background calculations. If 'silent' is

selected, only silent mutations count towards

background. (default: all)

--maf MAF Instead of VCF list file, provide MAF (mutation

annotation format) file with mutation information. To

use, provide a dummy character (e.g. "-") for the VCFs

argument and provide a MAF file with this option. Gene

information (e.g. lengths) are computed from input

GTF. Genes not present by genefield in GTF (read from

first column of MAF) are skipped. Input MAF can be

provided as plain text of gzip-compressed file.

(default: None)

--exome-only If using exome-based data, choose this flag to only

consider exonic coordinates of genes for background

estimates. Default behavior is to consider full gene

length (exons + introns) in calculations. (default:

False)

--anno-type TTYPE Select annotation type for determining non-silent

somatic variants. Valid pre-sets are: 'annovar-

refGene', 'annovar-knownGene', 'annovar-ensGene',

'SnpEff', 'VEP', or 'illumina'. For 'illumina', 'CSQT'

INFO field is parsed; for 'SnpEff', 'ANN' INFO field

is parsed. For 'VEP', the CSQ INFO field is parsed.

Alternatively, provide tab-delimited input text file

describing terms for use. If providing text file, must

include one term per row with 3 columns: 1) String

that is either 'Gene' or 'Effect' to denote field with

gene name or gene effect, respectively; 2) value from

VCF INFO field for code to search for matching gene

name or non-silent effect; 3) valid terms (can be left

blank for 'Gene' row). If MAF input is used, this

option is ignored and default MAF terms are used.

(default: annovar-refGene)

-m MAP_REGIONS, --mappable-regions MAP_REGIONS

Provide BED file of mappable genomic regions (sorted

and tabix-indexed). If provided, only portions of

regions from input file overlapping these mappable

regions will be used in analsyis. Region lengths are

also adjusted for enrichment calculations. (default:

None)

-p NPROCESSORS, --processors NPROCESSORS

Set number of processors for parallel runs. (default:

--snps-only Set this flag to tell program to only consider SNPs in

analysis. Default is to consider all variant types.

(default: False)

-c COV_FN, --covariates-file COV_FN

Provide covariates file. Format is tab-delimited text

file, with first column listing gene name according to

gene_id field in input GTF. Header should contain

covariate names in columns 2 to end. (default: None)

-w WEIGHTS_FN, --covariate-weights WEIGHTS_FN

Provide covariates weight file. Format is tab-

delimited file (no header) with: covariate name,

weight. Weights are normalized to sum=1. If not

provided, uniform weighting of covariates is assumed.

(default: None)

--by-contig Use this flag to perform clustering on genes by contig

(i.e. by chromosome). This speeds computation of gene

clusters. If not set, clusters are computed using all

genes in same run. (default: False)

--use-local Use this flag to tell the program to use the local

gene background rate instead of global background

rate. If covariate files or pre-computed covariates

are supplied along with this flag being set, a

combined covariate plus local background scheme is

used whereby local backgrounds from cluster members

are considered. (default: False)

--min-clust-size MIN_CLUST_SIZE

Set minimum number of covariate cluster members.

Regions belonging to a cluster with only itself or

less than this value are flagged and a local

background around the region is calculated and used

instead. (default: 3)

--precomputed-covars COV_PRECOMP_DIR

Provide path to pre-computed covariate clusters for

genes in input GTF file. (default: None)

-d MAX_HS_DIST, --hotspot-distance MAX_HS_DIST

Set maximum distance between mutations for candidate

hotspot discovery. (default: 50)

--min-hs-vars MIN_HS_VARS

Set minimum number of mutations that must be present

for a valid candidate hotspot. (default: 3)

--min-hs-samps MIN_HS_SAMPS

Set minimum number of samples that must contain

mutations to inform a valid candidate hotspot.

(default: 2)

--blacklist BLACKLIST_FN

Provide a blacklist of specific variants to exclude

from analysis. Blacklist file format is tab-delimited

text file with four required columns: contig

(chromosome), position (1-indexed), reference base,

alternate base. (default: None)

--ap-iters AP_ITERS Set maximum number of AP iterations before re-

computing with alternate self-similarity. (default:

1000)

--ap-convits AP_CONVITS

Set number of convergence iterations for AP runs (i.e.

if exemplars remain constant for this many iterations,

terminate early). This value MUST be smaller than the

total number of iterations. (default: 50)

--ap-algorithm AP_ALG

Select between one of two versions of AP clustering

algorithm: 'slow' or 'fast'. The 'fast' version is

faster in terms of runtime but consumes more memory

than 'slow'. (default: fast)

テストラン

１、レポジトリをcloneする。

git clone https://github.com/asoltis/MutEnricher.git
cd MutEnricher/example_data/

MutEnricher/example_data/

２、MutEnricherのランには、vcfのパスと名前が書かれたリストファイルを作り、これを指定する必要がある。example_data/vcf/には100個のVCFファイルが配置されているので、dockerイメージを使う場合、/data/vcfsに配置されているとして、以下のコマンドを打つ。

ls vcfs/*.vcf.gz | while read VCF; do
name=$(basename $VCF .vcf.gz)
echo -e "/data/$VCF\t$name" >> test_vcf_paths.txt
done

> head test_vcf_paths.txt

3、MutEnricher codingをランする。VCFのリストとBEDファイル、もしくはgtf.ファイルを指定する。

mkdir out
sudo docker run -itv $PWD:/data -v $PWD/out:/tmp --rm asoltis/mutenricher python mutEnricher.py noncoding /data/annotation_files/ucsc.refFlat.20170829.promoters_up2kb_downUTR.no_chrMY.bed /data/test_vcf_paths.txt -o /tmp --prefix noncoding_example_global_bg

出力

gene_enrichments.txt；MutEnricherによって決定された全体の遺伝子エンリッチメント結果

hotspot.txt；ホットスポットのエンリッチメント手順の結果

gene_hotspot_Fisher_enrichments.txt；遺伝子領域全体（上記1）とホットスポット候補（見つかった場合）について、Fisher検定により有意だった結果を含んでいる。

gene_data.pkl；エンリッチメント解析で使用した変異データおよび計算結果を含むpythonのpickleオブジェクト。pythonで読み込むことができる。

引用

MutEnricher: a flexible toolset for somatic mutation enrichment analysis of tumor whole genomes
Anthony R Soltis, Clifton L Dalgard, Harvey B Pollard, Matthew D Wilkerson

BMC Bioinformatics. 2020 Jul 31;21(1):338