構造変異(SV)はゲノム進化において基本的な役割を果たし、癌などの遺伝性または後天性疾患の根底にある可能性がある。ロングリードシーケンス技術により、構造変異(SV)の特徴付けが向上したが、ペアエンドシーケンスの方が拡張性に優れている。本発表では、ペアエンドまたはロングリードを用いて、SVやインデルを検出するdysguを紹介する。dysguはアライメントギャップ、不一致、補足マッピングからのシグナルを検出し、コンセンサスコンティグを生成した後、機械学習を用いてイベントを分類する。また、異常配列の再マッピングにより、追加のSVを同定する。Dysguは、ペアエンドやロングリードを用いた既存の最先端ツールを凌駕し、高い感度と精度を提供すると同時に、最も高速に実行できるツールの1つである。また、低カバレッジのペアエンドとロングリードを組み合わせることで、高カバレッジのロングリードと競合する性能を発揮することがわかった。
インストール
pipで導入した。
依存
pip install dysgu
#build from source
git clone --recursive https://github.com/kcleal/dysgu.git
cd dysgu
bash INSTALL.sh
#test
dysgu test
O.K
> dysgu
$ dysgu
Usage: dysgu [OPTIONS] COMMAND [ARGS]...
Dysgu-SV is a set of tools calling structural variants from bam/cram files
Options:
--version Show the version and exit.
--help Show this message and exit.
Commands:
call Call structural variants from alignment file/stdin
fetch Filters input .bam/.cram for read-pairs that are discordant or...
merge Merge .vcf/csv variant files
run Run the dysgu pipeline.
test Run dysgu tests
> dysgu call --help
Usage: dysgu call [OPTIONS] REFERENCE WORKING_DIRECTORY [SV_ALIGNS]
Call structural variants from alignment file/stdin
Options:
-b, --ibam PATH Original input file usef with 'fetch'
command, used for calculating insert size
parameters
-o, --svs-out PATH Output file [default: stdout]
-f, --out-format [csv|vcf] Output format [default: vcf]
--sites PATH A vcf file of known variant sites. Matching
output variants are labelled with 'PASS'
plus the ID from --sites
--sites-prob FLOAT RANGE Prior probability that a matching variant in
--sites is true [default: 0.6; 0<=x<=1]
--sites-pass-only [True|False] Only add variants from sites that have PASS
[default: True]
--parse-probs [True|False] Parse INFO:MeanPROB or FORMAT:PROB instead
of using --sites-p [default: False]
--all-sites [True|False] Output a genotype for all variants in
--sites (including homozygous reference 0/0)
--pfix TEXT Post-fix of temp alignment file (used when a
working-directory is provided instead of sv-
aligns)
--mode [pe|pacbio|nanopore] Type of input reads. Multiple options are
set, overrides other optionspacbio/nanopore:
--mq 20 --paired False --min-support 2
--max-cov 150 [default: pe]
--pl [pe|pacbio|nanopore] Type of input reads [default: pe]
--clip-length INTEGER Minimum soft-clip length, >= threshold are
kept. Set to -1 to ignore [default: 15]
--max-cov TEXT Regions with > max-cov that do no overlap
'include' are discarded. Use 'auto' to
estimate a value from the alignment index
file [default: 200]. Regions with > max-cov
that do no overlap 'include' are discarded.
Set to -1 to ignore.
--max-tlen INTEGER Maximum template length to consider when
calculating paired-end template size
[default: 1000]
--min-support INTEGER Minimum number of reads per SV [default: 3]
--min-size INTEGER Minimum size of SV to report [default: 30]
--mq INTEGER Minimum map quality < threshold are
discarded [default: 1]
--dist-norm FLOAT Distance normalizer [default: 100]
--spd FLOAT Span position distance [default: 0.3]
--trust-ins-len TEXT Trust insertion length from cigar, for high
error rate reads use False [default: True]
-I, --template-size TEXT Manually set insert size, insert stdev,
read_length as 'INT,INT,INT'
--regions PATH bed file of target regions, used for
labelling events
--regions-only [True|False] If --regions is provided, call only events
within target regions [default: False]
--regions-mm-only [True|False] If --regions is provided, only use minimizer
clustering within --regions. Useful for high
coverage targeted sequencing [default:
False]
-p, --procs INTEGER RANGE Processors to use [default: 1; 1<=x<=128]
--buffer-size INTEGER Number of alignments to buffer [default: 0]
--merge-within [True|False] Try and merge similar events, recommended
for most situations [default: True]
--drop-gaps [True|False] Drop SVs near gaps +/- 250 bp of Ns in
reference [default: True]
--merge-dist INTEGER Attempt merging of SVs below this distance
threshold, default is (insert-median +
5*insert_std) for pairedreads, or 500 bp for
single-end reads
--paired [True|False] Paired-end reads or single [default: True]
--contigs [True|False] Generate consensus contigs for each side of
break and use sequence-based metrics in
model scoring [default: True]
-v, --verbosity [0|1|2] 0 = no contigs in output, 1 = output contigs
for variants without ALT sequence called, 2
= output all contigs [default: 1]
--diploid [True|False] Use diploid model for scoring variants. Use
'False' for non-diploid or poly clonal
samples [default: True]
--remap TEXT Try and remap anomalous contigs to find
additional small SVs [default: True]
--metrics Output additional metrics for each SV
[default: False]
--no-gt Skip adding genotype to SVs
--keep-small Keep SVs < min-size found during re-mapping
-x, --overwrite Overwrite temp files
-c, --clean Remove temp files when finished
--thresholds TEXT Probability threshold to label as PASS for
'DEL,INS,INV,DUP,TRA' [default:
0.45,0.45,0.45,0.45,0.45]
--help Show this message and exit.
> dysgu fetch --help
Usage: dysgu fetch [OPTIONS] WORKING_DIRECTORY BAM
Filters input .bam/.cram for read-pairs that are discordant or have a soft-
clip of length > '--clip-length', saves bam file in WORKING_DIRECTORY
Options:
-f, --out-format [bam|fq|fasta]
Output format. 'bam' output maintains sort
order, 'fq' output is collated by name
[default: bam]
--reference PATH Reference file for opening cram files
--pfix TEXT Post-fix to add to temp alignment files
-o, --output TEXT Output reads, discordant, supplementary and
soft-clipped reads to file.
--compression TEXT Set output bam compression level. Default is
uncompressed [default: wb0]
-a, --write_all Write all alignments from SV-read template
to temp file [default: False]
--clip-length INTEGER Minimum soft-clip length, >= threshold are
kept. Set to -1 to ignore [default: 15]
--mq INTEGER Minimum map quality < threshold are
discarded [default: 1]
--min-size INTEGER Minimum size of SV to report [default: 30]
--max-cov FLOAT Genomic regions with coverage > max-cov are
discarded. Set to -1 to ignore. [default:
200]
-p, --procs INTEGER RANGE Compression threads to use for writing bam
[default: 1; 1<=x<=128]
--search PATH .bed file, limit search to regions
--exclude PATH .bed file, do not search/call SVs within
regions. Takes precedence over --search
-x, --overwrite Overwrite temp files [default: False]
--pl [pe|pacbio|nanopore] Type of input reads [default: pe]
--help Show this message and exit.
> dysgu merge --help
Usage: dysgu merge [OPTIONS] INPUT_FILES...
Merge .vcf/csv variant files
Options:
-o PATH Output file, [default: stdout]
-f, --out-format [csv|vcf] Output format [default: vcf]
--merge-across [True|False] Merge records across input samples [default:
True]
--merge-within [True|False] Perform additional merge within input samples,
prior to --merge-across [default: False]
--merge-dist INTEGER Distance threshold for merging [default: 500]
--separate [True|False] Keep merged tables separate, adds --post-fix to
file names, csv format only [default: False]
--post-fix TEXT Adds --post-fix to file names, only if
--separate is True [default: dysgu]
--no-chr [True|False] Remove 'chr' from chromosome names in vcf
output [default: False]
--no-contigs [True|False] Remove contig sequences from vcf output
[default: False]
--add-kind [True|False] Add region-overlap 'kind' to vcf output
[default: False]
-v, --verbosity [0|1|2] 0 = no contigs in output, 1 = output contigs
for variants without ALT sequence called, 2 =
output all contigs [default: 1]
--help Show this message and exit.
> dysgu run --help
Usage: dysgu run [OPTIONS] REFERENCE WORKING_DIRECTORY BAM
Run the dysgu pipeline. Important parameters are --mode, --diploid, --min-
support, --min-size, --max-cov
Options:
--sites PATH A vcf file of known variant sites. All sites
will be genotyped in the output vcf
--sites-prob FLOAT RANGE Prior probability that a matching variant in
--sites is true [default: 0.6; 0<=x<=1]
--sites-pass-only [True|False] Only add variants from sites that have PASS
[default: True]
--parse-probs [True|False] Parse INFO:MeanPROB or FORMAT:PROB instead
of using --sites-p [default: False]
--all-sites [True|False] Output a genotype for all variants in
--sites (including homozygous reference 0/0)
--pfix TEXT Post-fix to add to temp alignment files
-o, --svs-out PATH Output file, [default: stdout]
-f, --out-format [csv|vcf] Output format [default: vcf]
-a, --write_all Write all alignments from SV-read template
to temp file [default: False]
--compression TEXT Set temp file bam compression level. Default
is uncompressed [default: wb0]
-p, --procs INTEGER RANGE Number of cpu cores to use [default: 1;
1<=x<=128]
--mode [pe|pacbio|nanopore] Type of input reads. Multiple options are
set, overrides other options. pacbio: --mq
20 --paired False --min-support 2 --max-cov
150 --dist-norm 200 --trust-ins-len True.
nanopore: --mq 20 --paired False --min-
support 2 --max-cov 150 --dist-norm 900
--trust-ins-len False [default: pe]
--pl [pe|pacbio|nanopore] Type of input reads [default: pe]
--clip-length INTEGER Minimum soft-clip length, >= threshold are
kept. Set to -1 to ignore [default: 15]
--max-cov TEXT Genomic regions with coverage > max-cov
discarded. Use 'auto' to estimate a value
from the alignment index file [default:
200]. Set to -1 to ignore
--max-tlen INTEGER Maximum template length to consider when
calculating paired-end template size
[default: 1000]
--min-support INTEGER Minimum number of reads per SV [default: 3]
--min-size INTEGER Minimum size of SV to report [default: 30]
--mq INTEGER Minimum map quality < threshold are
discarded [default: 1]
--dist-norm FLOAT Distance normalizer [default: 100]
--spd FLOAT Span position distance [default: 0.3]
--trust-ins-len TEXT Trust insertion length from cigar, for high
error rate reads use False [default: True]
-I, --template-size TEXT Manually set insert size, insert stdev,
read_length as 'INT,INT,INT'
--search PATH .bed file, limit search to regions
--exclude PATH .bed file, do not search/call SVs within
regions. Takes precedence over --search
--regions PATH bed file of target regions, used for
labelling events
--regions-only [True|False] If --regions is provided, call only events
within target regions [default: False]
--regions-mm-only [True|False] If --regions is provided, only use minimizer
clustering within --regions. Useful for high
coverage targeted sequencing [default:
False]
--buffer-size INTEGER Number of alignments to buffer [default: 0]
--merge-within [True|False] Try and merge similar events, recommended
for most situations [default: True]
--drop-gaps [True|False] Drop SVs near gaps +/- 250 bp of Ns in
reference [default: True]
--merge-dist INTEGER Attempt merging of SVs below this distance
threshold. Default for paired-end data is
(insert-median + 5*insert_std) for
pairedreads, or 500 bp for single-end reads
--paired [True|False] Paired-end reads or single [default: True]
--contigs [True|False] Generate consensus contigs for each side of
break and use sequence-based metrics in
model scoring [default: True]
-v, --verbosity [0|1|2] 0 = no contigs in output, 1 = output contigs
for variants without ALT sequence called, 2
= output all contigs [default: 1]
--diploid [True|False] Use diploid model for scoring variants. Use
'False' for non-diploid or poly clonal
samples [default: True]
--remap TEXT Try and remap anomalous contigs to find
additional small SVs [default: True]
--metrics Output additional metrics for each SV
[default: False]
--no-gt Skip adding genotype to SVs
--keep-small Keep SVs < min-size found during re-mapping
-x, --overwrite Overwrite temp files
-c, --clean Remove temp files when finished
--thresholds TEXT Probability threshold to label as PASS for
'DEL,INS,INV,DUP,TRA' [default:
0.45,0.45,0.45,0.45,0.45]
--help Show this message and exit.
実行方法
ショートリード (100-250 bpの範囲の長さのPEリード)
dysgu の実行は、データの種類によっていくつかの方法がある。ペアエンドデータでは、fetch と call をラップした run コマンドが推奨されている。
bwa memでアライメントされたペアエンドリードで動作するように開発されている(他のアライナーも可)。bam (またはCRAM形式) とゲノム (fasta 形式) を指定する。また、一時ファイルを保存するための作業ディレクトリが必要。
dysgu run -p 12 --mode pe reference.fa temp_dir input.bam > svs.vcf
- -p Number of cpu cores to use [default: 1; 1<=x<=128]
- -v [0|1|2] 0 = no contigs in output, 1 = output contigs for variants without ALT sequence called, 2 = output all contigs [default: 1]
- --mode [pe|pacbio|nanopore] Type of input reads.
シーケンスプラットフォームに応じて、dysguは推奨設定と特定の機械学習モデルを適用する。modeはこの特定のモデルを適用するための複数パラメータの一括設定になる。
ロングリード
非常に長いリード(Oxford nanopore)の場合、パイプラインのfetchステージは必要ないので、callコマンドを直接使用する。minimap2やngmlrでアライメントされたロングリードで動作するように開発されている。
#Pacbio
dysgu call -p 12 --mode pacbio reference.fa temp_dir input.bam > svs.vcf
#ONT
dysgu call -p 12 --mode nanopore reference.fa temp_dir input.bam > svs.vcf
(注;PacBio Sequel II HiFiリードでは一般にrunコマンドが推奨されている)
ONTのような非常に長いリードの場合、リードの大部分が複数のSV候補を持っており、事実上入力ファイルが冗長になる。このようなデータセットによっては 'fetch' コマンドは不要で、代わりに 'call' コマンドを推奨する(論文より)。
複数サンプルのマージ
サンプルのマージを計画している場合、「run/call」モジュールを実行する際に-v2オプションを使用する。
dysgu merge sample1.vcf sample2.vcf > combined.vcf
引用
Dysgu: efficient structural variant calling using short or long reads
Kez Cleal, Duncan M Baird
Nucleic Acids Research, Published: 31 January 2022