macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ノイズの多いロングリードを使ってSVをコールする SVIM

2020 7/21 出力画像追加

 

 構造変異とは、50 bpよりも大きいゲノム変異と定義されている。構造変異は、一塩基多型や小さな挿入・欠失よりも、任意のゲノムのより多くの塩基に影響を与えることが示されている。さらに、これらの変異はヒトの表現型や多様性に大きな影響を与え、多くの疾患と関連している。それらのサイズとリピートとの関連性のため、ショットガンシーケンシングでは、特にショートリードに基づいている場合には検出が困難である。Pacific BiosciencesやOxford Nanopore Technologiesが提供するようなロングリード、1分子シーケンシング技術は、数千塩基対の長さのリードを生成する。より高いエラー率とシークエンシングコストにもかかわらず、ロングリードシークエンシングは構造変異の検出に多くの利点をもたらす。しかし、利用可能なソフトウェアツールでは、その可能性を十分に活用できていないのが現状である。
 本研究では、ロングリードデータから構造バリアントを高感度に検出し、正確に特徴付けするためのツールであるSVIMを紹介する。SVIMは、リードアラインメントからの構造バリアントシグネチャの収集、クラスタリング、および組み合わせのための3つのコンポーネントから構成されている。SVIMは、タンデム重複やインタースパン重複、新規エレメント挿入などの類似したタイプを含む5つの異なるバリアントクラスを識別する。SVIMは、ゲノム上の重複の発生源と発生先の両方を抽出できる点でユニークである。Pacific Biosciences社とNanopore社のシーケンシングマシンによるシミュレーションデータと実データでの評価では、既存のツールと比較して良好な結果が得られている。
 SVIM のソースコードと実行ファイルは Github: github.com/eldariont/svim で公開されている。SVIMはPython 3で実装されており、biocondaとPythonパッケージインデックスで利用できる。

 

SVIMは4つの主要なステップで構成される。

COLLECTは、ロングリードアラインメントのSVのシグネチャを検出する。
CLUSTER は同じ SV からのシグネチャをマージする。
COMBINEは、異なるゲノム領域からのクラスタを結合し、異なるSVタイプに分類する。
GENOTYPEはSVにまたがるアラインメントを用いて遺伝子型を決定する。

 

wiki

https://github.com/eldariont/svim/wiki

 

raw long readsではなくハプロイドや二倍体ゲノムアセンブリ、またはコンティグを解析する場合は、別の方法であるSVIM-asmを使う。

インストール

ubuntu18.04LTSでテストした。

Github

#bioconda (link)
conda create -n svim_env --channel bioconda svim
conda activate svim_env

#pip (Python 3.6 or newer)
pip install svim

#Install from github 
git clone https://github.com/eldariont/svim.git
cd svim
pip install .

svim -h

$ svim -h

usage: svim [-h] [--version] {reads,alignment} ...

 

SVIM (pronounced SWIM) is a structural variant caller for long reads. 

It discriminates five different variant classes: deletions, tandem and interspersed duplications, 

inversions and insertions. SVIM is unique in its capability of extracting both the genomic origin and 

destination of duplications.

 

SVIM consists of four major steps:

- COLLECT detects signatures for SVs in long read alignments

- CLUSTER merges signatures that come from the same SV

- COMBINE combines clusters from different genomic regions and classifies them into distinct SV types

- GENOTYPE uses alignments spanning SVs to determine their genotype

 

SVIM can process two types of input. Firstly, it can detect SVs from raw reads by aligning them to a given reference genome first ("SVIM.py reads [options] working_dir reads genome").

Alternatively, it can detect SVs from existing reads alignments in SAM/BAM format ("SVIM.py alignment [options] working_dir bam_file").

 

positional arguments:

  {reads,alignment}  modes

    reads            Detect SVs from raw reads. Align reads to given reference

                     genome first.

    alignment        Detect SVs from an existing alignment

 

optional arguments:

  -h, --help         show this help message and exit

  --version, -v      show program's version number and exit

> svim reads -h

$ svim reads -h

usage: svim reads [-h] [--verbose] [--cores CORES]

                  [--aligner {ngmlr,minimap2}] [--nanopore]

                  [--min_mapq MIN_MAPQ] [--min_sv_size MIN_SV_SIZE]

                  [--max_sv_size MAX_SV_SIZE]

                  [--segment_gap_tolerance SEGMENT_GAP_TOLERANCE]

                  [--segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE]

                  [--all_bnds]

                  [--partition_max_distance PARTITION_MAX_DISTANCE]

                  [--distance_normalizer DISTANCE_NORMALIZER]

                  [--cluster_max_distance CLUSTER_MAX_DISTANCE]

                  [--del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE]

                  [--trans_destination_partition_max_distance TRANS_DESTINATION_PARTITION_MAX_DISTANCE]

                  [--trans_partition_max_distance TRANS_PARTITION_MAX_DISTANCE]

                  [--trans_sv_max_distance TRANS_SV_MAX_DISTANCE]

                  [--skip_genotyping] [--minimum_score MINIMUM_SCORE]

                  [--homozygous_threshold HOMOZYGOUS_THRESHOLD]

                  [--heterozygous_threshold HETEROZYGOUS_THRESHOLD]

                  [--minimum_depth MINIMUM_DEPTH] [--sample SAMPLE]

                  [--types TYPES] [--sequence_alleles] [--insertion_sequences]

                  [--tandem_duplications_as_insertions]

                  [--interspersed_duplications_as_insertions] [--read_names]

                  [--zmws]

                  working_dir reads genome

 

positional arguments:

  working_dir           Working and output directory. Existing files in the

                        directory are overwritten. If the directory does not

                        exist, it is created.

  reads                 Read file (FASTA, FASTQ, gzipped FASTA, gzipped FASTQ

                        or file list). The read file has to have one of the

                        following supported file endings: FASTA: .fa, .fasta,

                        .FA, .fa.gz, .fa.gzip, .fasta.gz, .fasta.gzip FASTQ:

                        .fq, .fastq, .FQ, .fq.gz, .fq.gzip, .fastq.gz,

                        .fastq.gzip FILE LIST: .fa.fn, fq.fn

  genome                Reference genome file (FASTA)

 

optional arguments:

  -h, --help            show this help message and exit

  --verbose             Enable more verbose logging (default: False)

 

ALIGN:

  --cores CORES         CPU cores to use for the alignment (default: 1)

  --aligner {ngmlr,minimap2}

                        Tool for read alignment: ngmlr or minimap2 (default:

                        ngmlr)

  --nanopore            Use Nanopore settings for read alignment (default:

                        False)

 

COLLECT:

  --min_mapq MIN_MAPQ   Minimum mapping quality of reads to consider (default:

                        20). Reads with a lower mapping quality are ignored.

  --min_sv_size MIN_SV_SIZE

                        Minimum SV size to detect (default: 40). SVIM can

                        potentially detect events of any size but is limited

                        by the signal-to-noise ratio in the input alignments.

                        That means that more accurate reads and alignments

                        enable the detection of smaller events. For current

                        PacBio or Nanopore data, we would recommend a minimum

                        size of 40bp or larger.

  --max_sv_size MAX_SV_SIZE

                        Maximum SV size to detect (default: 100000). This

                        parameter is used to distinguish long deletions (and

                        inversions) from translocations which cannot be

                        distinguished from the alignment alone. Split read

                        segments mapping far apart on the reference could

                        either indicate a very long deletion (inversion) or a

                        translocation breakpoint. SVIM calls a translocation

                        breakpoint if the mapping distance is larger than this

                        parameter and a deletion (or inversion) if it is

                        smaller or equal.

  --segment_gap_tolerance SEGMENT_GAP_TOLERANCE

                        Maximum tolerated gap between adjacent alignment

                        segments (default: 10). This parameter applies to gaps

                        on the reference and the read. Example: Deletions are

                        detected from two subsequent segments of a split read

                        that are mapped far apart from each other on the

                        reference. The segment gap tolerance determines the

                        maximum tolerated length of the read gap between both

                        segments. If there is an unaligned read segment larger

                        than this value between the two segments, no deletion

                        is called.

  --segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE

                        Maximum tolerated overlap between adjacent alignment

                        segments (default: 5). This parameter applies to

                        overlaps on the reference and the read. Example:

                        Deletions are detected from two subsequent segments of

                        a split read that are mapped far apart from each other

                        on the reference. The segment overlap tolerance

                        determines the maximum tolerated length of an overlap

                        between both segments on the read. If the overlap

                        between the two segments on the read is larger than

                        this value, no deletion is called.

  --all_bnds            Output all rearrangements additionally in BND notation

                        (default: False). By default, SV signatures from the

                        read alignments are used to detect complete SVs, such

                        as deletions, insertions and inversions. When this

                        option is enabled, all SVs are also output in breakend

                        (BND) notation as defined in the VCF specs. For

                        instance, a deletion gets two records in the VCF

                        output: 1. the normal <DEL> record and 2. a <BND>

                        record representing the novel adjacency between the

                        deletion's start and end coordinate in the sample

                        genome.

 

CLUSTER:

  --partition_max_distance PARTITION_MAX_DISTANCE

                        Maximum distance in bp between SVs in a partition

                        (default: 1000). Before clustering, the SV signatures

                        are divided into coarse partitions. This parameter

                        determines the maximum distance between two subsequent

                        signatures in the same partition. If the distance

                        between two subsequent signatures is larger than this

                        parameter, they are distributed into separate

                        partitions.

  --distance_normalizer DISTANCE_NORMALIZER

                        Distance normalizer used for span-position distance

                        (default: 900). SVIM clusters the SV signatures using

                        an hierarchical clustering approach and a novel

                        distance metric called "span-position distance". Span-

                        position distance is the sum of two components, span

                        distance and position distance. The span distance is

                        the difference in lengths between signatures

                        normalized by the greater length and always lies in

                        the interval [0,1]. The position distance is the

                        difference in position between signatures normalized

                        by the distance normalizer (this parameter). For a

                        position difference of 1.8kb and a distance normalizer

                        of 900, the position distance will be 2. A smaller

                        distance normalizer leads to a higher position

                        distance and as a consequence increases the importance

                        of the position distance in the span-position distance

                        relative to the span distance.

  --cluster_max_distance CLUSTER_MAX_DISTANCE

                        Maximum span-position distance between SVs in a

                        cluster (default: 0.3). This is the most important

                        parameter because it determines the strictness of

                        clustering. Choosing a large value leads to fewer but

                        larger clusters with larger distances between its

                        members. Choosing a small value leads to more but

                        smaller clusters with smaller distances between its

                        members. This parameter determines the height of the

                        cut-off in the hierarchical clustering dendrogram.

 

COMBINE:

  --del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE

                        Maximum span-position distance between the origin of

                        an insertion and a deletion to be flagged as a

                        potential cut&paste insertion (default: 1.0)

  --trans_destination_partition_max_distance TRANS_DESTINATION_PARTITION_MAX_DISTANCE

                        Maximum distance in bp between translocation

                        breakpoint destinations in a partition (default: 1000)

  --trans_partition_max_distance TRANS_PARTITION_MAX_DISTANCE

                        Maximum distance in bp between translocation

                        breakpoints in a partition (default: 200)

  --trans_sv_max_distance TRANS_SV_MAX_DISTANCE

                        Maximum distance in bp between a translocation

                        breakpoint and an SV signature to be combined

                        (default: 500)

 

GENOTYPE:

  --skip_genotyping     Disable genotyping (default: False)

  --minimum_score MINIMUM_SCORE

                        Minimum score for genotyping (default: 3). Only SV

                        candidates with a higher or equal score are genotyped.

                        Depending on the score distribution among the SV

                        candidates, decreasing this value increases the

                        runtime. We recommend to choose a value close to the

                        score threshold used for filtering the SV candidates.

  --homozygous_threshold HOMOZYGOUS_THRESHOLD

                        Minimum variant allele frequency to be called as

                        homozygous (default: 0.8). Allele frequency is

                        computed as the fraction of reads supporting the

                        variant over the total number of reads covering the

                        variant. Variants with an allele frequence greater

                        than or equal to this threshold are called as

                        homozygous alternative.

  --heterozygous_threshold HETEROZYGOUS_THRESHOLD

                        Minimum variant allele frequency to be called as

                        heterozygous (default: 0.2). Allele frequency is

                        computed as the fraction of reads supporting the

                        variant over the total number of reads covering the

                        variant. Variants with an allele frequence greater

                        than or equal to this threshold but lower than the

                        homozygous threshold are called as heterozygous

                        alternative. Variants with an allele frequence lower

                        than this threshold are called as homozygous

                        reference.

  --minimum_depth MINIMUM_DEPTH

                        Minimum total read depth for genotyping (default: 4).

                        Variants covered by a total number of reads lower than

                        this value are not assigned a genotype (./. in the

                        output VCF file).

 

OUTPUT:

  --sample SAMPLE       Sample ID to include in output vcf file (default:

                        Sample)

  --types TYPES         SV types to include in output VCF (default:

                        DEL,INS,INV,DUP:TANDEM,DUP:INT,BND). Give a comma-

                        separated list of SV types. The possible SV types are:

                        DEL (deletions), INS (novel insertions), INV

                        (inversions), DUP:TANDEM (tandem duplications),

                        DUP:INT (interspersed duplications), BND (breakends).

  --sequence_alleles    Use nucleotide sequences for alleles of deletions,

                        inversions and insertions in output VCF (default:

                        False). By default, all SVs are represented by

                        symbolic alleles, such as <DEL>, <INV> or <INS>. If

                        enabled, ALT alleles of insertions are obtained from

                        the sequence of a random read that supports the

                        variant.

  --insertion_sequences

                        Output insertion sequences in INFO tag of VCF

                        (default: False). If enabled, the INFO/SEQS tag

                        contains a list of insertion sequences from the

                        supporting reads. However, the insertion sequences are

                        not combined into a consensus sequence.

  --tandem_duplications_as_insertions

                        Represent tandem duplications as insertions in output

                        VCF (default: False). By default, tandem duplications

                        are represented by the SVTYPE=DUP:TANDEM and the

                        genomic source is given by the POS and END tags. When

                        enabling this option, duplications are instead

                        represented by the SVTYPE=INS and POS and END both

                        give the insertion point of the duplication.

  --interspersed_duplications_as_insertions

                        Represent interspersed duplications as insertions in

                        output VCF (default: False). By default, interspersed

                        duplications are represented by the SVTYPE=DUP:INT and

                        the genomic source is given by the POS and END tags.

                        When enabling this option, duplications are instead

                        represented by the SVTYPE=INS and POS and END both

                        give the insertion point of the duplication.

  --read_names          Output names of supporting reads in INFO tag of VCF

                        (default: False). If enabled, the INFO/READS tag

                        contains the list of names of the supporting reads.

  --zmws                look for information on ZMWs in PacBio read names

                        (default: False). If enabled, the INFO/ZMWS tag

                        contains the number of ZMWs that produced supporting

                        reads.

svim alignment -h

$ svim alignment -h

usage: svim alignment [-h] [--verbose] [--min_mapq MIN_MAPQ]

                      [--min_sv_size MIN_SV_SIZE] [--max_sv_size MAX_SV_SIZE]

                      [--segment_gap_tolerance SEGMENT_GAP_TOLERANCE]

                      [--segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE]

                      [--partition_max_distance PARTITION_MAX_DISTANCE]

                      [--distance_normalizer DISTANCE_NORMALIZER]

                      [--cluster_max_distance CLUSTER_MAX_DISTANCE]

                      [--all_bnds]

                      [--del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE]

                      [--trans_destination_partition_max_distance TRANS_DESTINATION_PARTITION_MAX_DISTANCE]

                      [--trans_partition_max_distance TRANS_PARTITION_MAX_DISTANCE]

                      [--trans_sv_max_distance TRANS_SV_MAX_DISTANCE]

                      [--skip_genotyping] [--minimum_score MINIMUM_SCORE]

                      [--homozygous_threshold HOMOZYGOUS_THRESHOLD]

                      [--heterozygous_threshold HETEROZYGOUS_THRESHOLD]

                      [--minimum_depth MINIMUM_DEPTH] [--sample SAMPLE]

                      [--types TYPES] [--sequence_alleles]

                      [--insertion_sequences]

                      [--tandem_duplications_as_insertions]

                      [--interspersed_duplications_as_insertions]

                      [--read_names] [--zmws]

                      working_dir bam_file genome

 

positional arguments:

  working_dir           Working and output directory. Existing files in the

                        directory are overwritten. If the directory does not

                        exist, it is created.

  bam_file              Coordinate-sorted and indexed BAM file with aligned

                        long reads

  genome                Reference genome file that the long reads were aligned

                        to (FASTA)

 

optional arguments:

  -h, --help            show this help message and exit

  --verbose             Enable more verbose logging (default: False)

 

COLLECT:

  --min_mapq MIN_MAPQ   Minimum mapping quality of reads to consider (default:

                        20). Reads with a lower mapping quality are ignored.

  --min_sv_size MIN_SV_SIZE

                        Minimum SV size to detect (default: 40). SVIM can

                        potentially detect events of any size but is limited

                        by the signal-to-noise ratio in the input alignments.

                        That means that more accurate reads and alignments

                        enable the detection of smaller events. For current

                        PacBio or Nanopore data, we would recommend a minimum

                        size of 40bp or larger.

  --max_sv_size MAX_SV_SIZE

                        Maximum SV size to detect (default: 100000). This

                        parameter is used to distinguish long deletions (and

                        inversions) from translocations which cannot be

                        distinguished from the alignment alone. Split read

                        segments mapping far apart on the reference could

                        either indicate a very long deletion (inversion) or a

                        translocation breakpoint. SVIM calls a translocation

                        breakpoint if the mapping distance is larger than this

                        parameter and a deletion (or inversion) if it is

                        smaller or equal.

  --segment_gap_tolerance SEGMENT_GAP_TOLERANCE

                        Maximum tolerated gap between adjacent alignment

                        segments (default: 10). This parameter applies to gaps

                        on the reference and the read. Example: Deletions are

                        detected from two subsequent segments of a split read

                        that are mapped far apart from each other on the

                        reference. The segment gap tolerance determines the

                        maximum tolerated length of the read gap between both

                        segments. If there is an unaligned read segment larger

                        than this value between the two segments, no deletion

                        is called.

  --segment_overlap_tolerance SEGMENT_OVERLAP_TOLERANCE

                        Maximum tolerated overlap between adjacent alignment

                        segments (default: 5). This parameter applies to

                        overlaps on the reference and the read. Example:

                        Deletions are detected from two subsequent segments of

                        a split read that are mapped far apart from each other

                        on the reference. The segment overlap tolerance

                        determines the maximum tolerated length of an overlap

                        between both segments on the read. If the overlap

                        between the two segments on the read is larger than

                        this value, no deletion is called.

 

CLUSTER:

  --partition_max_distance PARTITION_MAX_DISTANCE

                        Maximum distance in bp between SVs in a partition

                        (default: 1000). Before clustering, the SV signatures

                        are divided into coarse partitions. This parameter

                        determines the maximum distance between two subsequent

                        signatures in the same partition. If the distance

                        between two subsequent signatures is larger than this

                        parameter, they are distributed into separate

                        partitions.

  --distance_normalizer DISTANCE_NORMALIZER

                        Distance normalizer used for span-position distance

                        (default: 900). SVIM clusters the SV signatures using

                        an hierarchical clustering approach and a novel

                        distance metric called "span-position distance". Span-

                        position distance is the sum of two components, span

                        distance and position distance. The span distance is

                        the difference in lengths between signatures

                        normalized by the greater length and always lies in

                        the interval [0,1]. The position distance is the

                        difference in position between signatures normalized

                        by the distance normalizer (this parameter). For a

                        position difference of 1.8kb and a distance normalizer

                        of 900, the position distance will be 2. A smaller

                        distance normalizer leads to a higher position

                        distance and as a consequence increases the importance

                        of the position distance in the span-position distance

                        relative to the span distance.

  --cluster_max_distance CLUSTER_MAX_DISTANCE

                        Maximum span-position distance between SVs in a

                        cluster (default: 0.3). This is the most important

                        parameter because it determines the strictness of

                        clustering. Choosing a large value leads to fewer but

                        larger clusters with larger distances between its

                        members. Choosing a small value leads to more but

                        smaller clusters with smaller distances between its

                        members. This parameter determines the height of the

                        cut-off in the hierarchical clustering dendrogram.

  --all_bnds            Output all rearrangements additionally in BND notation

                        (default: False). By default, SV signatures from the

                        read alignments are used to detect complete SVs, such

                        as deletions, insertions and inversions. When this

                        option is enabled, all SVs are also output in breakend

                        (BND) notation as defined in the VCF specs. For

                        instance, a deletion gets two records in the VCF

                        output: 1. the normal <DEL> record and 2. a <BND>

                        record representing the novel adjacency between the

                        deletion's start and end coordinate in the sample

                        genome.

 

COMBINE:

  --del_ins_dup_max_distance DEL_INS_DUP_MAX_DISTANCE

                        Maximum span-position distance between the origin of

                        an insertion and a deletion to be flagged as a

                        potential cut&paste insertion (default: 1.0)

  --trans_destination_partition_max_distance TRANS_DESTINATION_PARTITION_MAX_DISTANCE

                        Maximum distance in bp between translocation

                        breakpoint destinations in a partition (default: 1000)

  --trans_partition_max_distance TRANS_PARTITION_MAX_DISTANCE

                        Maximum distance in bp between translocation

                        breakpoints in a partition (default: 200)

  --trans_sv_max_distance TRANS_SV_MAX_DISTANCE

                        Maximum distance in bp between a translocation

                        breakpoint and an SV signature to be combined

                        (default: 500)

 

GENOTYPE:

  --skip_genotyping     Disable genotyping (default: False)

  --minimum_score MINIMUM_SCORE

                        Minimum score for genotyping (default: 3). Only SV

                        candidates with a higher or equal score are genotyped.

                        Depending on the score distribution among the SV

                        candidates, decreasing this value increases the

                        runtime. We recommend to choose a value close to the

                        score threshold used for filtering the SV candidates.

  --homozygous_threshold HOMOZYGOUS_THRESHOLD

                        Minimum variant allele frequency to be called as

                        homozygous (default: 0.8). Allele frequency is

                        computed as the fraction of reads supporting the

                        variant over the total number of reads covering the

                        variant. Variants with an allele frequence greater

                        than or equal to this threshold are called as

                        homozygous alternative.

  --heterozygous_threshold HETEROZYGOUS_THRESHOLD

                        Minimum variant allele frequency to be called as

                        heterozygous (default: 0.2). Allele frequency is

                        computed as the fraction of reads supporting the

                        variant over the total number of reads covering the

                        variant. Variants with an allele frequence greater

                        than or equal to this threshold but lower than the

                        homozygous threshold are called as heterozygous

                        alternative. Variants with an allele frequence lower

                        than this threshold are called as homozygous

                        reference.

  --minimum_depth MINIMUM_DEPTH

                        Minimum total read depth for genotyping (default: 4).

                        Variants covered by a total number of reads lower than

                        this value are not assigned a genotype (./. in the

                        output VCF file).

 

OUTPUT:

  --sample SAMPLE       Sample ID to include in output vcf file (default:

                        Sample)

  --types TYPES         SV types to include in output VCF (default:

                        DEL,INS,INV,DUP:TANDEM,DUP:INT,BND). Give a comma-

                        separated list of SV types. The possible SV types are:

                        DEL (deletions), INS (novel insertions), INV

                        (inversions), DUP:TANDEM (tandem duplications),

                        DUP:INT (interspersed duplications), BND (breakends).

  --sequence_alleles    Use nucleotide sequences for alleles of deletions,

                        inversions and insertions in output VCF (default:

                        False). By default, all SVs are represented by

                        symbolic alleles, such as <DEL>, <INV> or <INS>. If

                        enabled, ALT alleles of insertions are obtained from

                        the sequence of a random read that supports the

                        variant.

  --insertion_sequences

                        Output insertion sequences in INFO tag of VCF

                        (default: False). If enabled, the INFO/SEQS tag

                        contains a list of insertion sequences from the

                        supporting reads. However, the insertion sequences are

                        not combined into a consensus sequence.

  --tandem_duplications_as_insertions

                        Represent tandem duplications as insertions in output

                        VCF (default: False). By default, tandem duplications

                        are represented by the SVTYPE=DUP:TANDEM and the

                        genomic source is given by the POS and END tags. When

                        enabling this option, duplications are instead

                        represented by the SVTYPE=INS and POS and END both

                        give the insertion point of the duplication.

  --interspersed_duplications_as_insertions

                        Represent interspersed duplications as insertions in

                        output VCF (default: False). By default, interspersed

                        duplications are represented by the SVTYPE=DUP:INT and

                        the genomic source is given by the POS and END tags.

                        When enabling this option, duplications are instead

                        represented by the SVTYPE=INS and POS and END both

                        give the insertion point of the duplication.

  --read_names          Output names of supporting reads in INFO tag of VCF

                        (default: False). If enabled, the INFO/READS tag

                        contains the list of names of the supporting reads.

  --zmws                look for information on ZMWs in PacBio read names

                        (default: False). If enabled, the INFO/ZMWS tag

                        contains the number of ZMWs that produced supporting

                        reads.

 

 

実行方法

 生のロングリードを与えられたリファレンスゲノムにアラインメントすることで、生のロングリードからSVを検出することができる。また、SAM/BAMフォーマットの既存のリードアライメントからSVを検出することもできる。

 

デフォルトのngmlrを使ってリードをアラインし、SVをコール。

#PacBio reads
svim reads out_dir input_reads.fa ref.fa

#ONT reads
svim reads --nanopore out_dir input_reads.fa ref.fa

生のfasta、fastq、gzip圧縮を扱える。

指定したout_dirディレクトリにログファイル、VCF 形式の SVコール結果、BED 形式の SVコール結果、BED形式の中間シグネチャクラスタが出力される。

 

minimap2を使ってリードをアラインし、SVをコール。

#PacBio reads
svim reads --aligner minimap2 out_dir input_reads.fa ref.fa

#Use only high-quality alignments
svim reads --min_mapq 30 out_dir input_reads.fa ref.fa

 

bamを指定する。

svim alignment my_sample input.bam

出力

f:id:kazumaxneo:20200721141259p:plain

variants.vcf

f:id:kazumaxneo:20200721141302p:plain

  • SVIM は全ての SVコールとスコアを出力する。そのため、結果をこれらのスコアに基づいてフィルタリングすることが強く推奨されている。これはすべてのバリアントコールを出力して、フィルタリングを後処理のステップとして残した方が、よりユーザーフレンドリーであるというオーサーらの考えによる。
  • SVIM は各 variant コールに対して 0 から 100 の間のスコアを出力する。スコアが高いほど、そのコールがより信頼できることを意味する。スコアはVCF出力のQUAL列やBEDファイルの5番目の列にある。スコアは主にサポートするリードの数に基づいており。さらにSVのスパンと位置に関するサポートリード間の一致も考慮されている。
  • 結果のフィルタリングに関してだが、スコアがサポートリード数に依存するため、そのスコア分布は入力のシーケンシングカバレッジに応じて変化する。したがって、適切なスコアカットオフについて一概には言えない。カバレッジの高いデータセット(>40倍)では、10-15のしきい値が推奨される。カバー率の低いデータセットでは,閾値を低くすべきである。

引用

SVIM: structural variant identification using mapped long reads
David Heller, Martin Vingron
Bioinformatics, Volume 35, Issue 17, 1 September 2019