ゲノムを比較してstructural rearrangementsを検出する SyRI

2019 12/17 論文引用追加

　同じ種の半数体ゲノムは、典型的にはそれらのゲノム構造において高い類似性を示す広範囲のco-linear（シンテニー）領域を含む。しかし、これらのシンテニー領域は
異なるハプロタイプにおける異なる方向および／または位置によって特徴付けられるstructural rearrangements（SR）によって中断される。 SRは、逆位、転座、重複に分類することができる。（一部略）

　既存のＳＲ予測方法の多くは、リファレンス配列へのショートリードまたはロングリードのアライメントを利用する。SNPやsmall indelなどの局所的な違いは高精度で検出できるが、リードアラインメントだけでは複雑なSRを正確に予測することは困難である。対照的に、高品質のゲノムアセンブリによる比較は、通常、raw シーケンシングリードと比較してはるかに長く高品質であるため、正確なSR同定にとってより強力である［ref.6］。しかしながら、近年のde-novo全ゲノムアセンブリ生成を支える重要な技術的改良にもかかわらず[ref.7]、全ゲノムアラインメント（WGA）をゲノム差異の同定の基礎として使用するツールはほんの少ししかない [ref.8, pubmed]。たとえば、利用可能なツールには、de novoアセンブリの個々のscaffoldsをリファレンス配列と比較してアライメントのブレイクポイントを解析して逆位や転座を特定するAsmVar [ref.9]や、リファレンス配列にユニークにアライメントされたコンティグを利用してLarge indelや局所的なリピートの違いなど、さまざまなゲノムの違いを識別するAssemblyticsがある[ref.10]（紹介）。

　ここでは、ペアワイズWGAから生成されたgenome graphsを使用して、2つの関連ゲノム（通常は同一種由来）間のゲノム構造を同定するSyRI（Synteny and Rearrangement Identifier）を紹介する。 SyRIは、２つのゲノムの相同染色体間の全てのシンテニー領域を同定することから始まる。他のすべての領域は定義によりSRである（あるいはそうでなければそれらはシンテニック領域の一部である）ので、これはSR識別の問題をSR分類に変換する。 SyRIは、非シンテニー領域を逆位、転座および重複に分類する。 SyRIは、ゲノム全体にわたってリアレンジメントされた領域の分析を行い、ゲノム差異を最適化するためにSRにグローバルにアノテーションを付ける。転座と転移を区別するのが一般的だが、ここでは両方のタイプを転座と呼ぶ。さらに、転座と重複はまとめてTDと呼ぶ。最後に、SyRIは、リアレンジメント領域および非リアレンジメント領域を含むゲノム全体にわたる局所的変異を同定する。ローカルな変動および構造的なリアレンジメントは、それらのサイズまたは複雑さによって区別されないことに留意することが重要である。ローカルな変動には、large deletionまたはlarge insertionなどの大きな構造的変動も含まれ得るからである。その代わりにローカルな変動は、シンテニック内および構造的にリアレンジメントされた領域内に見出すことができる。これにより、変動が導入され、例えば、リアレンジメント領域のSNPと比較して、シンテニー領域のSNPを区別することが可能になる。リアレンジメントされた領域（およびその中のローカルな変動）は、それぞれの生物の子孫におけるメンデルの分離パターンに従わないであろうが、それらのコピー数の変化にさえつながる可能性があるので、この区別は重要である。著者らはSyRIを用いて、5つのモデル種の多様なゲノムを分析し、2つのA. thaliana株に見られる転座を遺伝的に検証し、50のF2組換えゲノムのIllumina全ゲノムシーケンシングデータを分析した。

Documentation

Synteny and Rearrangement Identifier (SyRI) | syri

インストール

ubuntu16.04のminiconda3.4.0.5環境でテストした。

依存

Python3

Cython
numpy
scipy
pandas
python-igraph
biopython
psutil
MUMmer3

conda install -y cython numpy scipy pandas biopython psutil
conda install -y -c conda-forge python-igraph
conda install -c bioconda mummer

本体　Github

git clone https://github.com/schneebergerlab/syri.git
cd syri/
python3 setup.py install
cd syri/bin/

> python syri

$ python syri

usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-o FOUT] [-k]

[--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR]

[--prefix PREFIX] [--seed SEED] [--nc NCORES] [--novcf] [--nosr]

[-b BRUTERUNTIME] [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT]

[--inc INCREASEBY] [--no-chrmatch] [--nosv] [--nosnp] [--all]

[--allow-offset OFFSET] [--cigar] [-s SSPATH]

syri: error: the following arguments are required: -c

kazu@edb2e2639563:~/syri/syri/bin$ ./syri -h

usage: syri [-h] -c INFILE [-r REF] [-q QRY] [-d DELTA] [-o FOUT] [-k]

[--log {DEBUG,INFO,WARN}] [--lf LOG_FIN] [--dir DIR]

[--prefix PREFIX] [--seed SEED] [--nc NCORES] [--novcf] [--nosr]

[-b BRUTERUNTIME] [--unic TRANSUNICOUNT] [--unip TRANSUNIPERCENT]

[--inc INCREASEBY] [--no-chrmatch] [--nosv] [--nosnp] [--all]

[--allow-offset OFFSET] [--cigar] [-s SSPATH]

Input Files:

-c INFILE File containing alignment coordinates in a tsv format

(default: None)

-r REF Genome A (which is considered as reference for the

alignments). Required for local variation (large

indels, CNVs) identification. (default: None)

-q QRY Genome B (which is considered as query for the

alignments). Required for local variation (large

indels, CNVs) identification. (default: None)

-d DELTA .delta file from mummer. Required for short variation

(SNPs/indels) identification when CIGAR string is not

available (default: None)

optional arguments:

-h, --help show this help message and exit

-o FOUT Output file name (default: syri)

-k Keep internediate output files (default: False)

--log {DEBUG,INFO,WARN}

log level (default: INFO)

--lf LOG_FIN Name of log file (default: syri.log)

--dir DIR path to working directory (if not current directory)

(default: None)

--prefix PREFIX Prefix to add before the output file Names (default: )

--seed SEED seed for generating random numbers (default: 1)

--nc NCORES number of cores to use in parallel (max is number of

chromosomes) (default: 1)

--novcf Do not combine all files into one output file

(default: False)

SR identification:

--nosr Set to skip structural rearrangement identification

(default: False)

-b BRUTERUNTIME Cutoff to restrict brute force methods to take too

much time (in seconds). Smaller values would make

algorithm faster, but could have marginal effects on

accuracy. In general case, would not be required.

(default: 60)

--unic TRANSUNICOUNT Number of uniques bps for selecting translocation.

Smaller values would select smaller TLs better, but

may increase time and decrease accuracy. (default:

1000)

--unip TRANSUNIPERCENT

Percent of unique region requried to select

translocation. Value should be in range (0,1]. Smaller

values would selection of translocation which are more

overlapped with other regions. (default: 0.5)

--inc INCREASEBY Minimum score increase required to add another

alignment to translocation cluster solution (default:

1000)

--no-chrmatch Do not allow SyRI to automatically match chromosome

ids between the two genomes if they are not equal

(default: False)

ShV identification:

--nosv Set to skip structural variation identification

(default: False)

--nosnp Set to skip SNP/Indel (within alignment)

identification (default: False)

--all Use duplications too for variant identification

(default: False)

--allow-offset OFFSET

BPs allowed to overlap (default: 0)

--cigar Find SNPs/indels using CIGAR string. Necessary for

alignment generated using aligners other than nucmers

(default: False)

-s SSPATH path to show-snps from mummer (default: show-snps)

> ./chroder

$ ./chroder

usage: chroder [-h] [-n NCOUNT] [-o OUT] [-noref] coords ref qry

chroder: error: the following arguments are required: coords, ref, qry

kazu@edb2e2639563:~/syri/syri/bin$ ./chroder -h

usage: chroder [-h] [-n NCOUNT] [-o OUT] [-noref] coords ref qry

positional arguments:

coords Alignment coordinates in a tsv format

ref Assembly of genome A in multi-fasta format

qry Assembly of genome B in multi-fasta format

optional arguments:

-h, --help show this help message and exit

-n NCOUNT number of N's to be inserted

-o OUT output file prefix

-noref Use this parameter when no assembly is at chromosome level

実行方法

SyRIはSRを同定するためにクロモソームレベルのアセンブリ配列を必要とする。ない場合、chroderを使いpseudo-chromosome配列を作成してから実行する（説明HP）。

#chromosomeレベルアセンブリでないならまずchroderを走らせる

#step1 nucmerのラン (HP)
nucmer --maxmatch -c 500 -b 500 -l 100 refgenome query_genome
delta-filter -m -i 90 -l 100 out.delta > out_m_i90_l100.delta 
show-coords -THrd out_m_i90_l100.delta > out_m_i90_l100.coords

#step2 chroder
chroder -o output out_m_i90_l100.coords ref.fa scaffolds.fa

#step3 syri
syri -c out_m_i90_l100.coords -d out_m_i90_l100.delta -r ref.fa -q output.fa

exampleランの説明がわかりやすいです。

https://schneebergerlab.github.io/syri/pipeline.html

plot例

Plotting genomic structure using plotsr | syri

引用

SyRI: identification of syntenic and rearranged regions from whole genome assemblies

Manish Goel, Hequan Sun, Wen-Biao Jiao, Korbinian Schneeberger

bioRxiv preprint first posted online Feb. 11, 2019

追記

SyRI: finding genomic rearrangements and local sequence differences from whole-genome assemblies

Manish Goel, Hequan Sun, Wen-Biao Jiao & Korbinian Schneeberger
Genome Biology volume 20, Article number: 277 (2019)