ゲノムスケッチを用いて迅速にコホートサンプルの関連性を推定する somalier

　複数の空間的または縦断的生検から得られたシーケンシングデータを解釈する際には、サンプルのmix upを検出することが不可欠であるが、生殖細胞変異の研究よりも困難である。腫瘍のほとんどのゲノム研究では、遺伝的変異は腫瘍とサンプル提供者の正常組織のペアワイズ比較によって検出されることが多く、多くの場合、体細胞変異のみが報告されている。その結果、遺伝子型情報がバラバラになるため、生殖細胞変異の遺伝子型のみに基づいてサンプル交換を検出する既存のツールの使用を妨げている。この問題を解決するために、著者らはアラインメント上で直接操作できるsomalierを開発した。Somalierは各サンプルについて情報量の多い遺伝的変異の小さなスケッチを抽出する。その後、何百もの生検や正常組織からのスケッチを1秒以内に比較することができる。この速度は、生殖細胞サンプルの大規模なコホートにおける関連性のチェックにも役立つ。Somalierは、テキスト出力とインタラクティブなビジュアルレポートの両方を生成し、複数の関連性メトリクスを使用してサンプルのスワップの検出と修正を容易にする。このツールを紹介し、正常、腫瘍、無細胞のDNAサンプルを含む5つの神経膠腫サンプルのコホートでその有用性を実証した。また、1000 Genomes Projectの高カバー率シーケンシングデータにSomalierを適用することで、いくつかの関連サンプルを同定することができた。また、そのデータは、ゲノムの構築や多様なシーケンシングデータに適用することができ、学術利用のためにgithub.com/brentp/somalierで自由に利用可能である。

somalierは既知の多型サイトのリストを取る。数百（または数十）のサイトでさえ、関連性の非常に良い指標になる。最良のサイトは、2つのサンプルが異なる確率を最大にするため、集団の対立遺伝子頻度が0.5に近いサイトである。そのようなサイトのリストは、GRCh37とhg38のリリースにある。これらの部位での遺伝子型を素早く計算するために、Somalierは正確な塩基をアッセイする。抽出ステップは、bam/cramファイルから1サンプルずつ直接行われる。

　relateステップは、extractコマンドの出力に基づいて実行される。新しいサンプルを追加して比較できるように、非常に高速に実行される。hom-ref, het, hom-altにはサンプルごとに3つのビットベクトルを使用する。各ビットベクタは64ビットの整数の配列で、サンプル中のインデックスのバリアントが例えばヘテロ接合体である場合、各ビットが設定される。この設定では、高速なビット演算とpopcountハードウェア命令を使用して、関連性を非常に迅速に計算することができる。

somalier (kinship and QC checks from BAM/CRAM/VCF) output can now be shown in multiQC !! https://t.co/cLe7OZ0XTY
— brent pedersen (@brent_p) 2020年5月30日

somalier evaluates kinship across samples, genome-builds and file types using "sketches". new release also supports ancestry prediction using a set of training samples (default is thousand-G). https://t.co/JS4sf1j3Cp

and example ancestry output: https://t.co/HQMN9Ep367
— brent pedersen (@brent_p) 2019年11月27日

new release of somalier supports GVCF (and CRAM,BAM,multi-sample VCF) so you can find relatedness across formats, across genome builds and across cohorts:https://t.co/fcqDze80zc

it's also quite fast.
— brent pedersen (@brent_p) 2019年10月21日

インストール

Github

リリースからstatic bainaryをダウンロードする。

> ./somalier -h

# ./somalier -h

somalier version: 0.2.10

Commands:

extract : extract genotype-like information for a single sample from VCF/BAM/CRAM.

relate : aggregate `extract`ed information and calculate relatedness among samples.

ancestry : perform ancestry prediction on a set of samples, given a set of labeled samples

find-sites : create a new sites.vcf.gz file from a population VCF (this is rarely needed).

> ./somalier extract -h

# ./somalier extract -h

somalier version: 0.2.10

somalier extract

extract genotype-like information for a single-sample at selected sites

Usage:

somalier extract [options] sample_file

Arguments:

sample_file single-sample CRAM/BAM/GVCF file or multi/single-sample VCF from which to extract

Options:

-s, --sites=SITES sites vcf file of variants to extract

-f, --fasta=FASTA path to reference fasta file

-d, --out-dir=OUT_DIR path to output directory (default: .)

--sample-prefix=SAMPLE_PREFIX

prefix for the sample name stored inside the digest

-h, --help Show this help

> ./somalier relate -h

# ./somalier relate -h

somalier version: 0.2.10

somalier relate

calculate relatedness among samples from extracted, genotype-like information

Usage:

somalier relate [options] [extracted ...]

Arguments:

[extracted ...] $sample.somalier files for each sample. the first 10 are tested as a glob patterns

Options:

-g, --groups=GROUPS optional path to expected groups of samples (e.g. tumor normal pairs).

specified as comma-separated groups per line e.g.:

normal1,tumor1a,tumor1b

normal2,tumor2a

--sample-prefix=SAMPLE_PREFIX

optional sample prefixes that can be removed to find identical samples. e.g. batch1-sampleA batch2-sampleA

-p, --ped=PED optional path to a ped/fam file indicating the expected relationships among samples.

-d, --min-depth=MIN_DEPTH only genotype sites with at least this depth. (default: 7)

--min-ab=MIN_AB hets sites must be between min-ab and 1 - min_ab. set this to 0.2 for RNA-Seq data (default: 0.3)

-u, --unknown set unknown genotypes to hom-ref. it is often preferable to use this with VCF samples that were not jointly called

-i, --infer infer relationships (https://github.com/brentp/somalier/wiki/pedigree-inference)

-o, --output-prefix=OUTPUT_PREFIX

output prefix for results. (default: somalier)

-h, --help Show this help

> ./somalier ancestry -h

# ./somalier ancestry -h

somalier version: 0.2.10

somalier pca

dimensionality reduction

Usage:

somalier pca [options] [extracted ...]

Arguments:

[extracted ...] $sample.somalier files for each sample. place labelled samples first followed by '++' then *.somalier for query samples

Options:

--labels=LABELS file with ancestry labels

-o, --output-prefix=OUTPUT_PREFIX

prefix for output files (default: somalier-ancestry)

--n-pcs=N_PCS number of principal components to use in the reduced dataset (default: 5)

--nn-hidden-size=NN_HIDDEN_SIZE

shape of hidden layer in neural network (default: 16)

--nn-batch-size=NN_BATCH_SIZE

batch size fo training neural network (default: 32)

--nn-test-samples=NN_TEST_SAMPLES

number of labeled samples to test for NN convergence (default: 101)

-h, --help Show this help

> ./somalier find-sites -h

# ./somalier find-sites -h

somalier version: 0.2.10

somalier find-sites

Usage:

somalier find-sites [options] vcf

Arguments:

vcf population VCF to use to find sites

Options:

-x, --exclude=EXCLUDE optional exclude files

-i, --include=INCLUDE optional include file. only consider variants that fall in ranges within this file

--gnotate-exclude=GNOTATE_EXCLUDE

sites in slivar gnotation (zip) format to exclude

--snp-dist=SNP_DIST minimum distance between autosomal SNPs to avoid linkage (default: 10000)

--min-AN=MIN_AN minimum number of alleles (AN) at the site. (must be less than twice number of samples in the cohort) (default: 115_000)

-h, --help Show this help

this will write output sites to: ./sites.vcf.gz

実行方法

１、extract - extract genotype-like information for a single sample from VCF/BAM/CRAM.

cohort vcf or single samppleのvcfから抽出する。vcfはgzip圧縮してindexファイルが存在していること。

somalier extract -d extracted/ -s sites.vcf.gz -f g1k_v37_decoy.fa $cohort.vcf.gz

-s sites vcf file of variants to extract
-f path to reference fasta file
-d path to output directory (default: .)

またはbam/cramファイル群から抽出する。forループで反復処理する。

for f in *.cram; do
 somalier extract -d extracted/ -s sites.vcf.gz -f g1k_v37_decoy.fa $f
done

２、relate - aggregate `extract`ed information and calculate relatedness among samples.

1の出力を指定する。任意でpedigreeファイルを指定する。

somalier relate -p $pedigree extracted/*.somalier

-p optional path to a ped/fam file indicating the expected relationships among samples.

インタラクティブなHTML output（example）が出力される。

引用

Somalier: rapid relatedness estimation for cancer and germline studies using efficient genome sketches
Pedersen BS, Bhetariya PJ, Brown J, Marth G, Jensen RL, Bronner MP, Underhill HR, Quinlan AR
BioRxiv, 12 Nov 2019