2022-02-06

ショートリードやロングリードを使った効率的な構造バリアントコールを行う Dysgu

　構造変異（SV）はゲノム進化において基本的な役割を果たし、癌などの遺伝性または後天性疾患の根底にある可能性がある。ロングリードシーケンス技術により、構造変異（SV）の特徴付けが向上したが、ペアエンドシーケンスの方が拡張性に優れている。本発表では、ペアエンドまたはロングリードを用いて、SVやインデルを検出するdysguを紹介する。dysguはアライメントギャップ、不一致、補足マッピングからのシグナルを検出し、コンセンサスコンティグを生成した後、機械学習を用いてイベントを分類する。また、異常配列の再マッピングにより、追加のSVを同定する。Dysguは、ペアエンドやロングリードを用いた既存の最先端ツールを凌駕し、高い感度と精度を提供すると同時に、最も高速に実行できるツールの1つである。また、低カバレッジのペアエンドとロングリードを組み合わせることで、高カバレッジのロングリードと競合する性能を発揮することがわかった。

インストール

pipで導入した。

依存

Dysgu requires Python>=3.7 and has been tested on linux and MacOS.

Github

pip install dysgu

#build from source
git clone --recursive https://github.com/kcleal/dysgu.git
cd dysgu
bash INSTALL.sh

#test

dysgu test

f:id:kazumaxneo:20220205233220p:plain

O.K

> dysgu

$ dysgu

Usage: dysgu [OPTIONS] COMMAND [ARGS]...

Dysgu-SV is a set of tools calling structural variants from bam/cram files

Options:

--version Show the version and exit.

--help Show this message and exit.

Commands:

call Call structural variants from alignment file/stdin

fetch Filters input .bam/.cram for read-pairs that are discordant or...

merge Merge .vcf/csv variant files

run Run the dysgu pipeline.

test Run dysgu tests

> dysgu call --help

Usage: dysgu call [OPTIONS] REFERENCE WORKING_DIRECTORY [SV_ALIGNS]

Call structural variants from alignment file/stdin

Options:

-b, --ibam PATH Original input file usef with 'fetch'

command, used for calculating insert size

parameters

-o, --svs-out PATH Output file [default: stdout]

-f, --out-format [csv|vcf] Output format [default: vcf]

--sites PATH A vcf file of known variant sites. Matching

output variants are labelled with 'PASS'

plus the ID from --sites

--sites-prob FLOAT RANGE Prior probability that a matching variant in

--sites is true [default: 0.6; 0<=x<=1]

--sites-pass-only [True|False] Only add variants from sites that have PASS

[default: True]

--parse-probs [True|False] Parse INFO:MeanPROB or FORMAT:PROB instead

of using --sites-p [default: False]

--all-sites [True|False] Output a genotype for all variants in

--sites (including homozygous reference 0/0)

--pfix TEXT Post-fix of temp alignment file (used when a

working-directory is provided instead of sv-

aligns)

--mode [pe|pacbio|nanopore] Type of input reads. Multiple options are

set, overrides other optionspacbio/nanopore:

--mq 20 --paired False --min-support 2

--max-cov 150 [default: pe]

--pl [pe|pacbio|nanopore] Type of input reads [default: pe]

--clip-length INTEGER Minimum soft-clip length, >= threshold are

kept. Set to -1 to ignore [default: 15]

--max-cov TEXT Regions with > max-cov that do no overlap

'include' are discarded. Use 'auto' to

estimate a value from the alignment index

file [default: 200]. Regions with > max-cov

that do no overlap 'include' are discarded.

Set to -1 to ignore.

--max-tlen INTEGER Maximum template length to consider when

calculating paired-end template size

[default: 1000]

--min-support INTEGER Minimum number of reads per SV [default: 3]

--min-size INTEGER Minimum size of SV to report [default: 30]

--mq INTEGER Minimum map quality < threshold are

discarded [default: 1]

--dist-norm FLOAT Distance normalizer [default: 100]

--spd FLOAT Span position distance [default: 0.3]

--trust-ins-len TEXT Trust insertion length from cigar, for high

error rate reads use False [default: True]

-I, --template-size TEXT Manually set insert size, insert stdev,

read_length as 'INT,INT,INT'

--regions PATH bed file of target regions, used for

labelling events

--regions-only [True|False] If --regions is provided, call only events

within target regions [default: False]

--regions-mm-only [True|False] If --regions is provided, only use minimizer

clustering within --regions. Useful for high

coverage targeted sequencing [default:

False]

-p, --procs INTEGER RANGE Processors to use [default: 1; 1<=x<=128]

--buffer-size INTEGER Number of alignments to buffer [default: 0]

--merge-within [True|False] Try and merge similar events, recommended

for most situations [default: True]

--drop-gaps [True|False] Drop SVs near gaps +/- 250 bp of Ns in

reference [default: True]

--merge-dist INTEGER Attempt merging of SVs below this distance

threshold, default is (insert-median +

5*insert_std) for pairedreads, or 500 bp for

single-end reads

--paired [True|False] Paired-end reads or single [default: True]

--contigs [True|False] Generate consensus contigs for each side of

break and use sequence-based metrics in

model scoring [default: True]

-v, --verbosity [0|1|2] 0 = no contigs in output, 1 = output contigs

for variants without ALT sequence called, 2

= output all contigs [default: 1]

--diploid [True|False] Use diploid model for scoring variants. Use

'False' for non-diploid or poly clonal

samples [default: True]

--remap TEXT Try and remap anomalous contigs to find

additional small SVs [default: True]

--metrics Output additional metrics for each SV

[default: False]

--no-gt Skip adding genotype to SVs

--keep-small Keep SVs < min-size found during re-mapping

-x, --overwrite Overwrite temp files

-c, --clean Remove temp files when finished

--thresholds TEXT Probability threshold to label as PASS for

'DEL,INS,INV,DUP,TRA' [default:

0.45,0.45,0.45,0.45,0.45]

--help Show this message and exit.

> dysgu fetch --help

Usage: dysgu fetch [OPTIONS] WORKING_DIRECTORY BAM

Filters input .bam/.cram for read-pairs that are discordant or have a soft-

clip of length > '--clip-length', saves bam file in WORKING_DIRECTORY

Options:

-f, --out-format [bam|fq|fasta]

Output format. 'bam' output maintains sort

order, 'fq' output is collated by name

[default: bam]

--reference PATH Reference file for opening cram files

--pfix TEXT Post-fix to add to temp alignment files

-o, --output TEXT Output reads, discordant, supplementary and

soft-clipped reads to file.

--compression TEXT Set output bam compression level. Default is

uncompressed [default: wb0]

-a, --write_all Write all alignments from SV-read template

to temp file [default: False]

--clip-length INTEGER Minimum soft-clip length, >= threshold are

kept. Set to -1 to ignore [default: 15]

--mq INTEGER Minimum map quality < threshold are

discarded [default: 1]

--min-size INTEGER Minimum size of SV to report [default: 30]

--max-cov FLOAT Genomic regions with coverage > max-cov are

discarded. Set to -1 to ignore. [default:

200]

-p, --procs INTEGER RANGE Compression threads to use for writing bam

[default: 1; 1<=x<=128]

--search PATH .bed file, limit search to regions

--exclude PATH .bed file, do not search/call SVs within

regions. Takes precedence over --search

-x, --overwrite Overwrite temp files [default: False]

--pl [pe|pacbio|nanopore] Type of input reads [default: pe]

--help Show this message and exit.

> dysgu merge --help

Usage: dysgu merge [OPTIONS] INPUT_FILES...

Merge .vcf/csv variant files

Options:

-o PATH Output file, [default: stdout]

-f, --out-format [csv|vcf] Output format [default: vcf]

--merge-across [True|False] Merge records across input samples [default:

True]

--merge-within [True|False] Perform additional merge within input samples,

prior to --merge-across [default: False]

--merge-dist INTEGER Distance threshold for merging [default: 500]

--separate [True|False] Keep merged tables separate, adds --post-fix to

file names, csv format only [default: False]

--post-fix TEXT Adds --post-fix to file names, only if

--separate is True [default: dysgu]

--no-chr [True|False] Remove 'chr' from chromosome names in vcf

output [default: False]

--no-contigs [True|False] Remove contig sequences from vcf output

[default: False]

--add-kind [True|False] Add region-overlap 'kind' to vcf output

[default: False]

-v, --verbosity [0|1|2] 0 = no contigs in output, 1 = output contigs

for variants without ALT sequence called, 2 =

output all contigs [default: 1]

--help Show this message and exit.

> dysgu run --help

Usage: dysgu run [OPTIONS] REFERENCE WORKING_DIRECTORY BAM

Run the dysgu pipeline. Important parameters are --mode, --diploid, --min-

support, --min-size, --max-cov

Options:

--sites PATH A vcf file of known variant sites. All sites

will be genotyped in the output vcf

--sites-prob FLOAT RANGE Prior probability that a matching variant in

--sites is true [default: 0.6; 0<=x<=1]

--sites-pass-only [True|False] Only add variants from sites that have PASS

[default: True]

--parse-probs [True|False] Parse INFO:MeanPROB or FORMAT:PROB instead

of using --sites-p [default: False]

--all-sites [True|False] Output a genotype for all variants in

--sites (including homozygous reference 0/0)

--pfix TEXT Post-fix to add to temp alignment files

-o, --svs-out PATH Output file, [default: stdout]

-f, --out-format [csv|vcf] Output format [default: vcf]

-a, --write_all Write all alignments from SV-read template

to temp file [default: False]

--compression TEXT Set temp file bam compression level. Default

is uncompressed [default: wb0]

-p, --procs INTEGER RANGE Number of cpu cores to use [default: 1;

1<=x<=128]

--mode [pe|pacbio|nanopore] Type of input reads. Multiple options are

set, overrides other options. pacbio: --mq

20 --paired False --min-support 2 --max-cov

150 --dist-norm 200 --trust-ins-len True.

nanopore: --mq 20 --paired False --min-

support 2 --max-cov 150 --dist-norm 900

--trust-ins-len False [default: pe]

--pl [pe|pacbio|nanopore] Type of input reads [default: pe]

--clip-length INTEGER Minimum soft-clip length, >= threshold are

kept. Set to -1 to ignore [default: 15]

--max-cov TEXT Genomic regions with coverage > max-cov

discarded. Use 'auto' to estimate a value

from the alignment index file [default:

200]. Set to -1 to ignore

--max-tlen INTEGER Maximum template length to consider when

calculating paired-end template size

[default: 1000]

--min-support INTEGER Minimum number of reads per SV [default: 3]

--min-size INTEGER Minimum size of SV to report [default: 30]

--mq INTEGER Minimum map quality < threshold are

discarded [default: 1]

--dist-norm FLOAT Distance normalizer [default: 100]

--spd FLOAT Span position distance [default: 0.3]

--trust-ins-len TEXT Trust insertion length from cigar, for high

error rate reads use False [default: True]

-I, --template-size TEXT Manually set insert size, insert stdev,

read_length as 'INT,INT,INT'

--search PATH .bed file, limit search to regions

--exclude PATH .bed file, do not search/call SVs within

regions. Takes precedence over --search

--regions PATH bed file of target regions, used for

labelling events

--regions-only [True|False] If --regions is provided, call only events

within target regions [default: False]

--regions-mm-only [True|False] If --regions is provided, only use minimizer

clustering within --regions. Useful for high

coverage targeted sequencing [default:

False]

--buffer-size INTEGER Number of alignments to buffer [default: 0]

--merge-within [True|False] Try and merge similar events, recommended

for most situations [default: True]

--drop-gaps [True|False] Drop SVs near gaps +/- 250 bp of Ns in

reference [default: True]

--merge-dist INTEGER Attempt merging of SVs below this distance

threshold. Default for paired-end data is

(insert-median + 5*insert_std) for

pairedreads, or 500 bp for single-end reads

--paired [True|False] Paired-end reads or single [default: True]

--contigs [True|False] Generate consensus contigs for each side of

break and use sequence-based metrics in

model scoring [default: True]

-v, --verbosity [0|1|2] 0 = no contigs in output, 1 = output contigs

for variants without ALT sequence called, 2

= output all contigs [default: 1]

--diploid [True|False] Use diploid model for scoring variants. Use

'False' for non-diploid or poly clonal

samples [default: True]

--remap TEXT Try and remap anomalous contigs to find

additional small SVs [default: True]

--metrics Output additional metrics for each SV

[default: False]

--no-gt Skip adding genotype to SVs

--keep-small Keep SVs < min-size found during re-mapping

-x, --overwrite Overwrite temp files

-c, --clean Remove temp files when finished

--thresholds TEXT Probability threshold to label as PASS for

'DEL,INS,INV,DUP,TRA' [default:

0.45,0.45,0.45,0.45,0.45]

--help Show this message and exit.

実行方法

ショートリード (100-250 bpの範囲の長さのPEリード)

dysgu の実行は、データの種類によっていくつかの方法がある。ペアエンドデータでは、fetch と call をラップした run コマンドが推奨されている。

bwa memでアライメントされたペアエンドリードで動作するように開発されている（他のアライナーも可）。bam (またはCRAM形式) とゲノム (fasta 形式) を指定する。また、一時ファイルを保存するための作業ディレクトリが必要。

dysgu run -p 12 --mode pe reference.fa temp_dir input.bam > svs.vcf

-p Number of cpu cores to use [default: 1; 1<=x<=128]
-v [0|1|2] 0 = no contigs in output, 1 = output contigs for variants without ALT sequence called, 2 = output all contigs [default: 1]
--mode [pe|pacbio|nanopore] Type of input reads.

シーケンスプラットフォームに応じて、dysguは推奨設定と特定の機械学習モデルを適用する。modeはこの特定のモデルを適用するための複数パラメータの一括設定になる。

ロングリード

非常に長いリード（Oxford nanopore）の場合、パイプラインのfetchステージは必要ないので、callコマンドを直接使用する。minimap2やngmlrでアライメントされたロングリードで動作するように開発されている。

#Pacbio
dysgu call -p 12 --mode pacbio reference.fa temp_dir input.bam > svs.vcf

#ONT
dysgu call -p 12 --mode nanopore reference.fa temp_dir input.bam > svs.vcf

（注；PacBio Sequel II HiFiリードでは一般にrunコマンドが推奨されている）

ONTのような非常に長いリードの場合、リードの大部分が複数のSV候補を持っており、事実上入力ファイルが冗長になる。このようなデータセットによっては 'fetch' コマンドは不要で、代わりに 'call' コマンドを推奨する（論文より）。

複数サンプルのマージ

サンプルのマージを計画している場合、「run/call」モジュールを実行する際に-v2オプションを使用する。

dysgu merge sample1.vcf sample2.vcf > combined.vcf

引用

Dysgu: efficient structural variant calling using short or long reads
Kez Cleal, Duncan M Baird
Nucleic Acids Research, Published: 31 January 2022