macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ハプロイドまたは二倍体ゲノムのためのSVコーラー SVIM-asm

2021 8/14 タイトル修正

 

 ゲノム変異の主要なクラスの一つとして、構造バリアント(SV)は50 bps以上の大きさのゲノムリアレンジメントの多様な範囲から構成されている。ヒトの平均的なゲノムには、シングルヌクレオチドバリアント(Single Nucleotide Variants: SNV)やsmall indelに比べてSVの数はかなり少ないが、SVはより多くの塩基対に影響を与える(1000 Genomes Project Consortium, 2015)。その結果、SVは健康なヒトの表現型とヒトの疾患の両方に強い影響を与える。
 手頃な価格で正確な次世代シークエンシング(NGS)技術が利用できるようになったため、SVは現在、一般的にシークエンシングのリードの分析によって検出されるようになった。通常、調査対象のゲノムからのリードを既存のリファレンスゲノムにアラインメントして、両ゲノム間の違いを明らかにする(リードベースのSVコール)。また、de novo アセンブリでは、リード間の配列のオーバーラップを利用してコンティグと呼ばれるより長いゲノム断片を計算で再構築する。生のシーケンシングリードと同様に、これらのアセンブリーコンティグをリファレンスゲノムまたは比較対象のゲノムにアラインメントして、SVの検出を容易にすることができる(アセンブリベースのSVコール)(Sedlazeckら、2018)。
 生のリードの代わりにコンティグからのSVの検出は、高品質のリファレンスゲノムが利用できない種の解析に特に有用である。アセンブリベースのSVコールの他のアプリケーションには、リファレンスと比較して大規模なリアレンジメントを有するゲノムの解析、複数の関連するゲノムアセンブリのペアワイズ比較、およびサンプル固有の配列または大きな挿入の解析が含まれる。ショートリードとロングリードのアライメントからSVを検出するソフトウェアツールは増えているが(Kosugiら、2019)、ゲノム-ゲノムアライメントからSVを検出するツールは少ない。そのような3つのツール、AsmVar、AssemblyticsおよびSyRyは最近開発されたが、ハプロイドゲノムアセンブリの解析のためだけに開発されている(Goelら、2019)。
最近まで、ゲノムアセンブリ法は、通常、二倍体ゲノムの2つの親ハプロタイプをハプロイドゲノム表現に畳み込んでいた。しかし、より長いシークエンシングリードの利用可能性およびHi-CおよびStrand-Seqのような相補的なシークエンシング技術の利用可能性により、ハプロタイプresolvedゲノムアセンブリのルーチンの生産が可能になった(Gargら、2019; Nurkら、2020)。これらのハプロタイプresolveされたアセンブリは、両方の親ハプロタイプの完全な遺伝情報を捕捉し、バリアントコール感度を向上させることができる(Chaisson et al)。 しかし、これまでのところ、ハプロタイプresolveされたゲノムアセンブリからの大きな挿入や欠失を検出するための手法としては、DipCallという1つの手法しか発表されていない(Li et al., 2018)。本研究では、ハプロタイプおよび二倍体ゲノムアセンブリから、挿入や欠失を含む6つの一般的なクラスのSVを検出し、ジェノタイピングするための手法SVIM-asmを紹介する。

 

Githubより

SVIM-asm (SWIM-assemと発音) は、ハプロイドまたは二倍体ゲノムアライメントのための構造的バリアント コーラーである。与えられたソート済みBAMファイル(できればminimap2から)を解析し、クエリアセンブリとリファレンス間の5つの異なるバリアントクラス(欠失、挿入、タンデムおよびインタースパン重複、逆位)を検出する。生のロングシーケンシングリードからSVをコールするにはSVIMを使用する。


インストール

インストールスクリプトを使って導入した。

依存

本体 Github

#bioconda(link)
conda create -n svimasm_env python=3.8 -y
conda activate svimasm_env
conda install --channel bioconda svim-asm -y

#Install from github (requires Python 3)
git clone https://github.com/eldariont/svim-asm.git
cd svim-asm
pip install .

svim-asm -h

$ svim-asm -h

usage: svim-asm [-h] [--version] {haploid,diploid} ...

 

SVIM-asm (pronounced SWIM-assem) is a structural variant caller for genome-genome alignments. 

It discriminates five different variant classes: deletions, insertions, tandem and interspersed duplications and inversions.

SVIM-asm analyzes alignments between a haploid or diploid query assembly and a reference assembly in SAM/BAM format. 

We recommend to produce the alignments using minimap2.

 

SVIM-asm has an haploid and a diploid mode depending on the input assembly and performs the following steps:

- COLLECT detects SVs from genome-genome alignments in BAM format

- PAIR merges the SV calls from the two haplotypes of a diploid assembly (diploid mode only)

- OUTPUT prints the found SVs in VCF format

 

positional arguments:

  {haploid,diploid}  modes

    haploid          Detect SVs from the alignment of an haploid query assembly to a reference assembly

    diploid          Detect SVs from the alignment of a diploid query assembly to a reference assembly

 

optional arguments:

  -h, --help         show this help message and exit

  --version, -v      show program's version number and exit

svim-asm haploid -h

$ svim-asm haploid -h

usage: svim-asm haploid [-h] [--min_mapq MIN_MAPQ] [--min_sv_size MIN_SV_SIZE] [--max_sv_size MAX_SV_SIZE] [--query_gap_tolerance QUERY_GAP_TOLERANCE] [--query_overlap_tolerance QUERY_OVERLAP_TOLERANCE]

                        [--reference_gap_tolerance REFERENCE_GAP_TOLERANCE] [--reference_overlap_tolerance REFERENCE_OVERLAP_TOLERANCE] [--sample SAMPLE] [--types TYPES] [--symbolic_alleles] [--tandem_duplications_as_insertions]

                        [--interspersed_duplications_as_insertions] [--query_names]

                        working_dir bam_file genome

 

positional arguments:

  working_dir           Working and output directory. Existing files in the directory are overwritten. If the directory does not exist, it is created.

  bam_file              SAM/BAM file with alignment of query assembly to reference assembly (needs to be coordinate-sorted and indexed)

  genome                Reference genome file that the assembly was aligned to (FASTA)

 

optional arguments:

  -h, --help            show this help message and exit

 

COLLECT:

  --min_mapq MIN_MAPQ   Minimum mapping quality of alignments to consider (default: 20). Alignments with a lower mapping quality are ignored.

  --min_sv_size MIN_SV_SIZE

                        Minimum SV size to detect (default: 40). SVIM can potentially detect events of any size but is limited by the signal-to-noise ratio in the input alignments. That means that more accurate assemblies and alignments

                        enable the detection of smaller events.

  --max_sv_size MAX_SV_SIZE

                        Maximum SV size to detect (default: 100000). This parameter is used to distinguish long deletions (and inversions) from translocations which cannot be distinguished from the alignment alone. Split read segments

                        mapping far apart on the reference could either indicate a very long deletion (inversion) or a translocation breakpoint. SVIM calls a translocation breakpoint if the mapping distance is larger than this parameter and

                        a deletion (or inversion) if it is smaller or equal.

  --query_gap_tolerance QUERY_GAP_TOLERANCE

                        Maximum tolerated gap between adjacent alignment segments on the query (default: 50). Example: Deletions are detected from two subsequent segments of a split query sequence that are mapped far apart from each other on

                        the reference. The query gap tolerance determines the maximum tolerated length of the query gap between both segments. If there is an unaligned query segment larger than this value between the two segments, no

                        deletion is called.

  --query_overlap_tolerance QUERY_OVERLAP_TOLERANCE

                        Maximum tolerated overlap between adjacent alignment segments on the query (default: 50). Example: Deletions are detected from two subsequent segments of a split query sequence that are mapped far apart from each

                        other on the reference. The query overlap tolerance determines the maximum tolerated length of an overlap between both segments in the query. If the overlap between the two segments in the query is larger than this

                        value, no deletion is called.

  --reference_gap_tolerance REFERENCE_GAP_TOLERANCE

                        Maximum tolerated gap between adjacent alignment segments on the reference (default: 50). Example: Insertions are detected from two segments of a split query sequence that are mapped right next to each other on the

                        reference but with unaligned sequence between them on the query. The reference gap tolerance determines the maximum tolerated length of the reference gap between both segments. If there is a reference gap larger than

                        this value between the two segments, no insertion is called.

  --reference_overlap_tolerance REFERENCE_OVERLAP_TOLERANCE

                        Maximum tolerated overlap between adjacent alignment segments on the reference (default: 50). Example: Insertions are detected from two segments of a split query sequence that are mapped right next to each other on

                        the reference but with unaligned sequence between them on the query. The reference overlap tolerance determines the maximum tolerated length of an overlap between both segments on the reference. If there is a

                        reference gap larger than this value between the two segments, no insertion is called.

 

OUTPUT:

  --sample SAMPLE       Sample ID to include in output vcf file (default: Sample)

  --types TYPES         SV types to include in output VCF (default: DEL,INS,INV,DUP:TANDEM,DUP:INT,BND). Give a comma-separated list of SV types. The possible SV types are: DEL (deletions), INS (novel insertions), INV (inversions),

                        DUP:TANDEM (tandem duplications), DUP:INT (interspersed duplications), BND (breakends).

  --symbolic_alleles    Use symbolic alleles, such as <DEL> or <INV> in the VCF output (default: False). By default, deletions, insertions, and inversions are represented by their nucleotide sequence in the output VCF.

  --tandem_duplications_as_insertions

                        Represent tandem duplications as insertions in output VCF (default: False). By default, tandem duplications are represented by the SVTYPE=DUP:TANDEM and the genomic source is given by the POS and END tags. When

                        enabling this option, duplications are instead represented by the SVTYPE=INS and POS and END both give the insertion point of the duplication.

  --interspersed_duplications_as_insertions

                        Represent interspersed duplications as insertions in output VCF (default: False). By default, interspersed duplications are represented by the SVTYPE=DUP:INT and the genomic source is given by the POS and END tags.

                        When enabling this option, duplications are instead represented by the SVTYPE=INS and POS and END both give the insertion point of the duplication.

  --query_names         Output names of supporting query sequences in INFO tag of VCF (default: False). If enabled, the INFO/READS tag contains the list of names of the supporting query sequences.

svim-asm diploid -h

$ svim-asm diploid -h

usage: svim-asm diploid [-h] [--min_mapq MIN_MAPQ] [--min_sv_size MIN_SV_SIZE] [--max_sv_size MAX_SV_SIZE] [--query_gap_tolerance QUERY_GAP_TOLERANCE] [--query_overlap_tolerance QUERY_OVERLAP_TOLERANCE]

                        [--reference_gap_tolerance REFERENCE_GAP_TOLERANCE] [--reference_overlap_tolerance REFERENCE_OVERLAP_TOLERANCE] [--max_edit_distance MAX_EDIT_DISTANCE] [--sample SAMPLE] [--types TYPES] [--symbolic_alleles]

                        [--tandem_duplications_as_insertions] [--interspersed_duplications_as_insertions] [--query_names]

                        working_dir bam_file1 bam_file2 genome

 

positional arguments:

  working_dir           Working and output directory. Existing files in the directory are overwritten. If the directory does not exist, it is created.

  bam_file1             SAM/BAM file with alignment of query assembly's first haplotype to reference assembly (needs to be coordinate-sorted and indexed)

  bam_file2             SAM/BAM file with alignment of query assembly's second haplotype to reference assembly (needs to be coordinate-sorted and indexed)

  genome                Reference genome file that the assembly was aligned to (FASTA)

 

optional arguments:

  -h, --help            show this help message and exit

 

COLLECT:

  --min_mapq MIN_MAPQ   Minimum mapping quality of alignments to consider (default: 20). Alignments with a lower mapping quality are ignored.

  --min_sv_size MIN_SV_SIZE

                        Minimum SV size to detect (default: 40). SVIM can potentially detect events of any size but is limited by the signal-to-noise ratio in the input alignments. That means that more accurate assemblies and alignments

                        enable the detection of smaller events.

  --max_sv_size MAX_SV_SIZE

                        Maximum SV size to detect (default: 100000). This parameter is used to distinguish long deletions (and inversions) from translocations which cannot be distinguished from the alignment alone. Split read segments

                        mapping far apart on the reference could either indicate a very long deletion (inversion) or a translocation breakpoint. SVIM calls a translocation breakpoint if the mapping distance is larger than this parameter and

                        a deletion (or inversion) if it is smaller or equal.

  --query_gap_tolerance QUERY_GAP_TOLERANCE

                        Maximum tolerated gap between adjacent alignment segments on the query (default: 50). Example: Deletions are detected from two subsequent segments of a split query sequence that are mapped far apart from each other on

                        the reference. The query gap tolerance determines the maximum tolerated length of the query gap between both segments. If there is an unaligned query segment larger than this value between the two segments, no

                        deletion is called.

  --query_overlap_tolerance QUERY_OVERLAP_TOLERANCE

                        Maximum tolerated overlap between adjacent alignment segments on the query (default: 50). Example: Deletions are detected from two subsequent segments of a split query sequence that are mapped far apart from each

                        other on the reference. The query overlap tolerance determines the maximum tolerated length of an overlap between both segments in the query. If the overlap between the two segments in the query is larger than this

                        value, no deletion is called.

  --reference_gap_tolerance REFERENCE_GAP_TOLERANCE

                        Maximum tolerated gap between adjacent alignment segments on the reference (default: 50). Example: Insertions are detected from two segments of a split query sequence that are mapped right next to each other on the

                        reference but with unaligned sequence between them on the query. The reference gap tolerance determines the maximum tolerated length of the reference gap between both segments. If there is a reference gap larger than

                        this value between the two segments, no insertion is called.

  --reference_overlap_tolerance REFERENCE_OVERLAP_TOLERANCE

                        Maximum tolerated overlap between adjacent alignment segments on the reference (default: 50). Example: Insertions are detected from two segments of a split query sequence that are mapped right next to each other on

                        the reference but with unaligned sequence between them on the query. The reference overlap tolerance determines the maximum tolerated length of an overlap between both segments on the reference. If there is a

                        reference gap larger than this value between the two segments, no insertion is called.

 

PAIR:

  --max_edit_distance MAX_EDIT_DISTANCE

                        Maximum edit distance between both alleles to be paired up into a homozygous call (default: 200).

 

OUTPUT:

  --sample SAMPLE       Sample ID to include in output vcf file (default: Sample)

  --types TYPES         SV types to include in output VCF (default: DEL,INS,INV,DUP:TANDEM,DUP:INT,BND). Give a comma-separated list of SV types. The possible SV types are: DEL (deletions), INS (novel insertions), INV (inversions),

                        DUP:TANDEM (tandem duplications), DUP:INT (interspersed duplications), BND (breakends).

  --symbolic_alleles    Use symbolic alleles, such as <DEL> or <INV> in the VCF output (default: False). By default, deletions, insertions, and inversions are represented by their nucleotide sequence in the output VCF.

  --tandem_duplications_as_insertions

                        Represent tandem duplications as insertions in output VCF (default: False). By default, tandem duplications are represented by the SVTYPE=DUP:TANDEM and the genomic source is given by the POS and END tags. When

                        enabling this option, duplications are instead represented by the SVTYPE=INS and POS and END both give the insertion point of the duplication.

  --interspersed_duplications_as_insertions

                        Represent interspersed duplications as insertions in output VCF (default: False). By default, interspersed duplications are represented by the SVTYPE=DUP:INT and the genomic source is given by the POS and END tags.

                        When enabling this option, duplications are instead represented by the SVTYPE=INS and POS and END both give the insertion point of the duplication.

  --query_names         Output names of supporting query sequences in INFO tag of VCF (default: False). If enabled, the INFO/READS tag contains the list of names of the supporting query sequences.

 

 

実行方法

svimはクエリアセンブリとリファレンスアセンブリを比較してSVをコールする。アライメントはminimap2で作成することが推奨されている。

ハプロイドアセンブリでの実行例。svim-asm haploidを使う。

minimap2 --paf-no-hit -a -x asm5 --cs -r2k -t 20 reference.fa assembly.fasta > align.sam
samtools sort -m4G -@4 align.sam > align.bam
samtools index align.sort.bam
svim-asm haploid working_dir align.sort.bam ref.fa

出力

f:id:kazumaxneo:20201224150927p:plain

sv-lengths-q5.png

f:id:kazumaxneo:20201224150715p:plain

variants.vcf

f:id:kazumaxneo:20201224150759p:plain

 

2つのハプロタイプからなる2倍体アセンブリのクエリを解析するには、両方のクエリアセンブリをリファレンスアセンブリにアラインさせる必要がある。svim-asm diploidを使う。

minimap2 --paf-no-hit -a -x asm5 --cs -r2k -t 20 ref.fa haplotype1.fasta > align_hap1.sam
minimap2 --paf-no-hit -a -x asm5 --cs -r2k -t 20 ref.fa haplotype2.fasta > align_hap2.sam

samtools sort -m4G -@4 align_hap1.sam > align_hap1.bam
samtools sort -m4G -@4 align_hap2.sam > align_hap2.bam

samtools index align_hap1.bam
samtools index align_hap2.bam

svim-asm diploid working_dir align_hap1.bam align_hap2.bam ref.fa

 

variants.vcf には、検出された SV が VCF 形式で格納されています (http://samtools.github.io/hts-specs/VCFv4.2.pdf を参照)。
sv-lengths.pngには、SVサイズのヒストグラムが含まれています。
SVIM_<day>_<time>.logには、コマンドラインと同じロギング出力が含まれています。

引用

SVIM-asm: Structural variant detection from haploid and diploid genome assemblies
David Heller, Martin Vingron

Bioinformatics. 2020 Dec 21