ノイズを除去しながらsmall /large cohortsのgenotypingを行う smoove

　smooveは既存のソフトウェアをラップし、構造変異のコールやジェノタイピングを簡単にするため、いくつかのフィルタリングを追加している。フィルタリングにより、smooveは低レベルのノイズを示すスプリアスなアライメントシグナルを削除し、recallを向上させる。どちらも、1つのコマンドで小さなコホートをサポートし、合計4つのステップで母集団レベルのコールを行う。たとえば、大量の出力を直接ジェノタイピングのために複数のsvtyperプロセスにストリームできるため、各ステップを並列化できる。約40サンプル未満のコホートでは、単一のコマンドで、ジョイントコールされ、ジェノタイプされたVCFを取得できる。

ジョブの流れ

lumpy_filterを並列処理して、lumpyに必要なsplit readsやdiscordant readsを抽出する
lumpy_filterのコールをさらにフィルタリングし、カバレッジの高い偽の領域とユーザー指定領域を削除する。また、疑わしいシグナルであると思われるリードも削除する。その後、discordantのbamsからシングルトンのリードを取り除く。
lumpyに必要な平均値、標準偏差、および一括処理で要求されるインサートサイズの分布をサンプルごとに調べる。
Lumpyの出力を複数のsvtyperプロセスに直接ストリーミングして、ジェノタイピングを実行する。
最終VCFをソート、圧縮、およびindexingする。

ブログ

Smoove - genomics dev blog

twitter

Just made a new release of smoove (easy SV calling) that includes a quality score for HET calls. If you give the score, please let me know how it's working for you.https://t.co/knTxqBtYjR

this is in the `annotate` sub-command. upstream parts of smoove are unchanged.
— brent pedersen (@brent_p) April 23, 2018

インストール

オーサーが準備したdockerイメージを使いテストした。

依存

lumpy and lumpy_filter
samtools: for CRAM support
gsort: to sort final VCF
bgzip+tabix: to compress and index final VCF

And optionally (but all highly recommended):

svtyper: to genotypes SVs
svtools: required for large cohorts
mosdepth: remove high coverage regions.
bcftools: version 1.5 or higher for VCF indexing and filtering.
duphold: to annotate depth changes within events and at the break-points.

本体　Github

docker pull brentp/smoove

> docker run brentp/smoove smoove -h

$ docker run brentp/smoove smoove -h

smoove version: 0.2.3

smoove calls several programs. Those with 'Y' are found on your $PATH. Only those with '*' are required.

*[Y] bgzip [ sort -> (compress) -> index ]

*[Y] gsort [(sort) -> compress -> index ]

*[Y] tabix [ sort -> compress -> (index)]

*[Y] lumpy

*[Y] lumpy_filter

*[Y] samtools

*[Y] svtyper

*[Y] mosdepth [extra filtering of split and discordant files for better scaling]

[Y] duphold [(optional) annotate calls with depth changes]

[Y] svtools [only needed for large cohorts].

Available sub-commands are below. Each can be run with -h for additional help.

call : call lumpy (and optionally svtyper)

merge : merge and sort (using svtools) calls from multiple samples

genotype : parallelize svtyper on an input VCF

paste : square final calls from multiple samples (each with same number of variants)

plot-counts : plot counts of split, discordant reads before, after smoove filtering

annotate : annotate a VCF with gene and quality of SV call

hipstr : run hipSTR in parallel

cnvnator : run cnvnator in parallel

duphold : run duphold in parallel (this can be done by adding a flag to call or genotype)

> docker run brentp/smoove smoove call -h

$ docker run brentp/smoove smoove call -h

[smoove] 2019/04/05 15:30:33 starting with version 0.2.3

this runs lumpy ands sends output to {outdir}/{name}-smoove.vcf.gz if --genotype is requested, the output goes to {outdir}/{name}-smoove.genotyped.vcf.gz

Usage: smoove --name NAME --fasta FASTA [--exclude EXCLUDE] [--excludechroms EXCLUDECHROMS] [--processes PROCESSES] [--outdir OUTDIR] [--noextrafilters] [--support SUPPORT] [--genotype] [--duphold] [--removepr] BAMS [BAMS ...]

Positional arguments:

BAMS path to bam(s) to call.

Options:

--name NAME, -n NAME project name used in output files.

--fasta FASTA, -f FASTA

fasta file.

--exclude EXCLUDE, -e EXCLUDE

BED of exclude regions.

--excludechroms EXCLUDECHROMS, -C EXCLUDECHROMS

ignore SVs with either end in this comma-delimited list of chroms. If this starts with ~ it is treated as a regular expression to exclude. [default: hs37d5,~:,~^GL,~decoy]

--processes PROCESSES, -p PROCESSES

number of processors to parallelize. [default: 3]

--outdir OUTDIR, -o OUTDIR

output directory.

--noextrafilters, -F use lumpy_filter only without extra smoove filters.

--support SUPPORT, -S SUPPORT

mininum support required to report a variant. [default: 4]

--genotype stream output to svtyper for genotyping

--duphold, -d run duphold on output. only works with --genotype

--removepr, -x remove PRPOS and PREND tags from INFO (only used with --gentoype).

--help, -h display this help and exit

> docker run brentp/smoove smoove merge -h

$ docker run brentp/smoove smoove merge -h

Usage: smoove[smoove] 2019/04/05 15:31:00 starting with version 0.2.3

--name NAME [--outdir OUTDIR] --fasta FASTA VCFS [VCFS ...]

Positional arguments:

VCFS path to vcfs.

Options:

--name NAME, -n NAME project name used in output files.

--outdir OUTDIR, -o OUTDIR

output directory. [default: ./]

--fasta FASTA, -f FASTA

fasta file.

--help, -h display this help and exit

> docker run brentp/smoove smoove genotype -h

s$ docker run brentp/smoove smoove genotype -h

[smoove] 2019/04/05 15:31:34 starting with version 0.2.3

Usage: smoove --name NAME [--outdir OUTDIR] --fasta FASTA [--removepr] [--duphold] [--processes PROCESSES] --vcf VCF BAMS [BAMS ...]

Positional arguments:

BAMS path to bam to call.

Options:

--name NAME, -n NAME project name used in output files.

--outdir OUTDIR, -o OUTDIR

output directory.

--fasta FASTA, -f FASTA

fasta file.

--removepr, -x remove PRPOS and PREND tags from INFO.

--duphold, -d run duphold on output.

--processes PROCESSES, -p PROCESSES

number of processors to use. [default: 3]

--vcf VCF, -v VCF vcf to genotype (use - for stdin). [default: -]

--help, -h display this help and exit

> docker run brentp/smoove smoove paste -h

$ docker run brentp/smoove smoove paste -h

square VCF files from different samples with the same number of records

Usage: smoove --name NAME [--outdir OUTDIR] VCFS [VCFS ...]

Positional arguments:

VCFS path to vcfs.

Options:

--name NAME, -n NAME project name used in output files.

--outdir OUTDIR, -o OUTDIR

output directory. [default: ./]

--help, -h display this help and exit

[smoove] 2019/04/05 15:31:56 starting with version 0.2.3

> docker run brentp/smoove smoove plot-counts -h

$ docker run brentp/smoove smoove plot-counts -h

[smoove] 2019/04/05 15:32:22 starting with version 0.2.3

Usage: smoove --vcf VCF --html HTML

Options:

--vcf VCF, -v VCF path to input VCF from smoove 0.2.3 or greater.

--html HTML, -h HTML path to output html file to be written.

--help, -h display this help and exit

> docker run brentp/smoove smoove annotate -h

$ docker run brentp/smoove smoove annotate -h

[smoove] 2019/04/05 15:32:39 starting with version 0.2.3

GFF3 annotation files can be downloaded from Ensembl:

ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/

ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/

Usage: smoove [--gff GFF] VCF

Positional arguments:

VCF path to VCF(s) to annotate.

Options:

--gff GFF, -g GFF path to GFF for gene annotation

--help, -h display this help and exit

> docker run brentp/smoove smoove hipstr -h

$ docker run brentp/smoove smoove hipstr -h

Usage: smoove --regions REGIONS --fasta FASTA BAMS [BAMS ...]

Positional arguments:

BAMS bams in which to call STRs

Options:

--regions REGIONS, -r REGIONS

BED file of regions containing STRs

--fasta FASTA, -f FASTA

path to reference fasta file

--help, -h display this help and exit

[smoove] 2019/04/05 15:33:03 starting with version 0.2.3

> docker run brentp/smoove smoove cnvnator -h

$ docker run brentp/smoove smoove cnvnator -h

Usage: smoove --name NAME --fasta FASTA [--outdir OUTDIR] [--excludechroms EXCLUDECHROMS] BAM

Positional arguments:

BAM path to bam to call.

Options:

--name NAME, -n NAME project name used in output files.

--fasta FASTA, -f FASTA

fasta file.

--outdir OUTDIR, -o OUTDIR

output directory.

--excludechroms EXCLUDECHROMS, -C EXCLUDECHROMS

ignore SVs with either end in this comma-delimited list of chroms. If this starts with ~ it is treated as a regular expression to exclude. [default: hs37d5,~:,~^GL,~decoy,~random,~chrUn,~alt$]

--help, -h display this help and exit

[smoove] 2019/04/05 15:33:20 starting with version 0.2.3

> docker run brentp/smoove smoove duphold -h

$ docker run brentp/smoove smoove duphold -h

[smoove] 2019/04/05 15:33:45 starting with version 0.2.3

Usage: smoove --fasta FASTA --vcf VCF [--processes PROCESSES] [--snps SNPS] --outvcf OUTVCF BAMS [BAMS ...]

Positional arguments:

BAMS paths to sample BAM/CRAMs

Options:

--fasta FASTA, -f FASTA

fasta file.

--vcf VCF, -v VCF path to input SV VCF

--processes PROCESSES, -p PROCESSES

number of threads ot use. [default: 4]

--snps SNPS, -s SNPS optional path to SNP/Indel VCF containing these samples for annotation with allele balance.

--outvcf OUTVCF, -o OUTVCF

path to output SV VCF

--help, -h display this help and exit

実行方法

１、small cohortsのデータセット (n < ~ 40)

サンプル数が少ない場合、ジョイントコールしてgVCFを出力できる。

smoove call -x --name my-cohort --exclude input.bed --fasta ref.fasta -p 40 --genotype input_dir/*.bam

false callを減らすため、"--exclude"により疑わしい領域を指定したbedファイルを追加することが推奨されている。オーサーはspeedseqのgithubで公開されているbedファイルの使用を勧めている（GRCh37とhg38についてそれぞれ用意されている）。

２、population calling (large cohortsのデータセット)

並列処理の効率が低いため、大規模なデータセットでは、1サンプルに1スレッド指定し、それを並列ランしていくことが推奨されている。全結果をマージし、アノテーションしてジョイントコールするために以下の流れで進める。

#1サンプルに1スレッド指定。並列ランしていく。
smoove call --outdir results-smoove/ --exclude input.bed --name sample_name --fasta ref.fasta -p 1 --genotype sample_dir/sample13時.bam

#全サンプルのランが終わったら結果を結合する。スレッドは使えるだけ指定。
smoove merge --name merged -f ref.fasta --outdir ./ results-smoove/*.genotyped.vcf.gz

#全サンプルのgenotypingを実行。dupholdでコール部位のデプス情報も追加する。スレッドは使えるだけ指定。
smoove genotype -d -x -p 1 --name sample-joint --outdir results-genotped/ --fasta ref.fasta --vcf merged.sites.vcf.gz sample_dir/sample.bam

#結果を結合、joint callする。
smoove paste --name cohort results-genotyped/*.vcf.gz

#optional 既知gffファイルを元にexon情報なども追加する。
smoove annotate --gff Homo_sapiens.GRCh37.82.gff3.gz $cohort.smoove.square.vcf.gz | bgzip -c > $cohort.smoove.square.anno.vcf.gz