2019-05-20

viral quasispeciesのアセンブリを行う SAVAGE

2019 5/20 誤字修正、コメント、ヘルプ追加

　HIV、Zika、Ebolaなどのウイルスは、一般的にウイルス準種（viral quasispecies, wiki）と呼ばれる、遺伝的に関連しているが異なる変異株の集団として宿主に存在する。それぞれ独自のハプロタイプ配列によって特徴付けられているこれらの株は、高い突然変異率および組換え率の傾向がある（Duffy et al、2008; Domingo et al、2012）。感染サンプルに存在するウイルス準種の遺伝的多様性を捉えることを目的とした次世代シーケンシング（NGS）に基づく方法が、治療選択肢や他の臨床的に関連性のある決定を選択する際に臨床医を支援する。
　理想的には、ウイルス準種のアセンブリは、それらの存在率と共にすべてのウイルスハプロタイプを提示することによって感染の遺伝的多様性を特徴付ける。これには2つの大きな課題がある。
1、異なる系統の数は通常不明である。さらに、２つの異なる株はごくわずかな量の区別できる突然変異によって異なり得る。最後に述べるが決して軽んずべきでないが、豊富さの割合はシークエンシングエラー率と同じくらい低くなる可能性があり、これは低頻度で存在する真の突然変異の検出を妨げる。
2、遺伝的多様性と高突然変異率の両方、高品質のコンセンサスゲノム配列を表すリファレンスゲノムは、疾患発生時には時代遅れになる可能性がある。適切なゲノムの欠如は、多くのウイルス準種アセンブリにとって大きな障害となる。
　既存のすべてのアセンブリ方法では、最初のポイントまたは2番目のポイントのどちらにも対処できないことを理解することが重要である。ウイルス準種の構築に特化した最近のリファレンスガイドアプローチは、ウイルス準種の進化の根底にある原動力をモデル化する統計的枠組みを示唆した。以前のアプローチは主にハプロタイプの局所的再構築に焦点を当てていたが（Zagordi et al、2010、2011; Quince et al、2011; Huang et al、2012）、例えばハプロタイプのグローバル再構築を目的としたより高度なアプローチとして、Dirichletプロセス混合モデル（Prabhakaran et al、2014）、隠れマルコフモデル（Töpferet al、2013）、またはサンプリングスキーム（Prosperi and Salemi 2012）がある。オーバーラップグラフでパスを計算する（Astrovskaya et al、2011）、オーバーラップグラフで最大クリークを列挙する、またはコンフリクトグラフで最大独立集合を計算する（Mangul et al、2014）最近の組み合わせアプローチもある。これらのアプローチはしっかりとポイント1を扱っているが、それらの方法のバックボーンとしてそれらの大多数は高品質のリファレンスシーケンスに依存している。これが今度はポイント2を扱えない理由である。変異パターンを逸脱すると、これらのアプローチは十分にうまく機能しない。

　一方、de novoアセンブリアプローチはリファレンスゲノムに依存しない。哺乳動物ゲノム構築のための多数のデノボアプローチが存在するが（例えば、比較評価についてはSalzberg et al、2011; Bradnam et al、2013; Gurevich et al、2013を参照）、これらの一般的方法はウイルス準種構築にはあまり適していない。主な違いは、ウイルスの突然変異率が真核生物よりも数桁高いため、1つのリードに複数のpolymorphicな部位が生じることである（Duffy et al、2008; Domingo et al、2012）。これにより、突然変異を別々のハプロタイプにphasingすることが可能になる。しかし、一般的なアセンブリ手法はこの性質を利用しない。一般的なアセンブラは、単一のコンセンサス配列を再構築することを目的としており、ひどい倍数体のゲノムを処理するようには設計されていない。これに関して、ウイルスゲノム構築を専門とする新規のアセンブラが既にあることに注意されたい（Yang et al、2012; Hunt et al、2015）。しかしながら、これらの特定のアプローチはまた、株特異的な配列よりもむしろコンセンサスゲノムをアセンブリすることを目的としており、その目的は個々の配列よりもむしろ新しいリファレンスを構築することである。著者らの知る限りでは、haplotype-resolvedのウイルス準種アセンブリの既存の唯一のアプローチはMLEHaploである（Malhotra et al、2016b）。結果として、ポイント２に対処していると、大部分の既存のデノボアセンブリ方法は満足できる程度までポイント１に対処することができない。
　考えられる主な問題は、上記の特殊なde novoウイルス準種アプローチを含むほぼすべてのNGSベースのゲノムアセンブラが、アセンブリパラダイムとしてde Bruijnグラフに依存していることである。それによって、リードはk-merに分解される。ここで、kは通常リード長よりかなり小さい。この概念を一般化したものとして、paired de Bruijn graphが導入された（Medvedev et al、2011）、これは後処理段階でメイトペアを分析する代わりにメイトペア情報をグラフ構造自体に組み入れ、結果としてより大きなcontigsを作り出す。上記のように、ウイルス種間のアセンブリでは、低頻度の変異とシーケンシングエラーを区別することが不可欠である。低頻度の変異は遺伝的に関連しており、したがって異なるリード内でco-occurするが、シーケンシングエラーは co-occurrenceのパターンを示さない。 Co-occurrenceパターンの検出は、リードを全長で調べることによって決定的にサポートされるが、この情報はde Bruijnグラフでは利用できない。一方、オーバーラップグラフは、完全長のリードを利用し、それらを小さい部分に分解しない。したがって、オーバーラップグラフのパラダイムはウイルス準種アセンブリ問題により適していると考えられる。
　オーバーラップグラフに基づくウイルス準種アセンブリのための唯一の既存の方法はHaploCliqueである（Töpferet al、2014）。このメソッドはリファレンスガイドだが、オーバーラップグラフを作成するためのアンカーポイントを提供するためだけにリファレンスを使用する。他の多くのアプローチとは異なり（Zagordi et al、2010、2011;Töpferet al、2013; Di Giallonardo et al、2014）、ハプロタイプ配列はリファレンスからではなくリードからアセンブリされる。インスピレーションを提供する一方で、HaploCliqueアルゴリズムは、比較的低いカバレッジ（1000倍以上）のデータセットにも既に過剰な計算リソースを必要とすることが証明されている。その理由は、最大クリークの列挙に基づいているためである。これは、ランタイムとメモリ空間の両方ともシーケンシングカバレッジの対数になる。したがって、アセンブリアルゴリズムのクリーク列挙のための新規でより効率的なアルゴリズムを提示する。

　リファレンスゲノム欠如問題を解決する2つの出口戦略がある。最初の戦略は、利用可能なde novoコンセンサスゲノムアセンブラのうちの1つ（その中で最も人気のあるツールはVICUNA [Yang et al、2012]）を使用して、患者サンプル自体からコンセンサスゲノム配列を構築することである。このアドホックコンセンサスをリファレンスとして使用し、リファレンスガイドアプローチの1つに適用する。この戦略はMangulら（2014）によっても示唆されていて、本著者らはそれをここでさらに探求する。 2番目の戦略は、患者サンプルリードから直接オーバーラップグラフを作成することである。続いて、倍数性を意識したアセンブリアルゴリズムを使用してオーバーラップグラフから株特異的配列を抽出する。問題は、オーバーラップグラフを作成するにはすべてのリードのペアワイズ比較を行う必要があることである。このため、カバレッジのディープなデータセットで実行可能にするためには、洗練されたインデックス作成手法が必要になる。ここでは、リファレンスゲノム無しでオーバーラップグラフを作成するため、FMインデックスベースの手法を効率的に利用する方法を示す（Välimäkiet al、2012）。そのようにして、オーバーラップグラフに基づいたウイルス準種のde novoアセンブリの最初のアプローチを提供する。
要約すると、以下に関連する貢献をする。
１、ディープカバレッジシーケンシングデータセットからのオーバーラップグラフ作成
２、オーバーラップグラフアセンブリパラダイムを用いたウイルス準種アセンブリ。
これらを組み合わせて、本著者らはSAVAGE (strain aware viral genome assembly)を提示する。これは、ディープカバレッジ（20,000倍以上）のシーケンスデータセットからウイルス準種のリファレンスなしのアセンブリを可能にする方法である。ここでは、オーバーラップグラフに基づく最初の本物のde novoウイルス準種アセンブリアプローチを提供するだけでなく、例えばVICUNA（Yang et al、2012）によって計算されるように、患者サンプルから生成されたアドホックコンセンサス配列を利用した最初の方法も提供する。

最後に、ジカウイルスとC型肝炎ウイルスに感染した患者からの2つのウイルスサンプルについてSAVAGEを使ったアセンブリ結果を示す。（以下略）

Figure 1. An overview of the workflow and algorithms of SAVAGE. 論文より転載。

インストール

依存

Python2

rust-overlaps
ncbi-blast
bwa mem
Kallisto

本体　Bitbucket

#bioconda
conda install -c bioconda -y savage

#またはconda仮想環境を作って導入する
conda create -n savage -c bioconda -y savage python=2.7.15
conda activate savage

> savage -h

$ savage -h

usage: savage.py [-h] [-s INPUT_S] [-p1 INPUT_P1] [-p2 INPUT_P2]

[-m MIN_OVERLAP_LEN] [-t THREADS] --split SPLIT_NUM

[--revcomp] [-o OUTDIR] [--ref REFERENCE] [--no_stage_a]

[--no_stage_b] [--no_stage_c] [--no_overlaps]

[--no_preprocessing] [--no_assembly] [--count_strains]

[--ignore_subreads] [--merge_contigs MERGE_CONTIGS]

[--min_clique_size MIN_CLIQUE_SIZE]

[--overlap_len_stage_c OVERLAP_STAGE_C]

[--contig_len_stage_c CONTIG_LEN_STAGE_C] [--keep_branches]

[--sfo_mm SFO_MM] [--diploid]

[--diploid_contig_len DIPLOID_CONTIG_LEN]

[--diploid_overlap_len DIPLOID_OVERLAP_LEN]

[--average_read_len AVERAGE_READ_LEN] [--no_filtering]

[--max_tip_len MAX_TIP_LEN]

Program: SAVAGE - Strain Aware VirAl GEnome assembly

Version: 0.4.1

Release date: July 17, 2017

Contact: Jasmijn Baaijens - j.a.baaijens@cwi.nl

SAVAGE assembles individual (viral) haplotypes from NGS data. It expects as

input single- and/or paired-end Illumina sequencing reads. Please note that the

paired-end reads are expected to be in forward-forward format, as output by

PEAR.

Run savage -h for a complete description of required and optional arguments.

For the complete manual, please visit https://bitbucket.org/jbaaijens/savage

optional arguments:

-h, --help show this help message and exit

basic arguments:

-s INPUT_S path to input fastq containing single-end reads

-p1 INPUT_P1 path to input fastq containing paired-end reads (/1)

-p2 INPUT_P2 path to input fastq containing paired-end reads (/2)

-m MIN_OVERLAP_LEN, --min_overlap_len MIN_OVERLAP_LEN

minimum overlap length required between reads

-t THREADS, --num_threads THREADS

allowed number of cores

--split SPLIT_NUM split the data set into patches s.t. 500 < coverage/split_num < 1000

--revcomp use this option when paired-end input reads are in forward-reverse orientation;

this option will take reverse complements of /2 reads (specified with -p2)

please see the SAVAGE manual for more information about input read orientations

-o OUTDIR, --outdir OUTDIR

specify output directory

reference-guided mode:

--ref REFERENCE reference genome in fasta format

advanced arguments:

--no_stage_a skip Stage a (initial contig formation)

--no_stage_b skip Stage b (extending initial contigs)

--no_stage_c skip Stage c (merging maximized contigs into master strains)

--no_overlaps skip overlap computations (use existing overlaps file instead)

--no_preprocessing skip preprocessing procedure (i.e. creating data patches)

--no_assembly skip all assembly steps; only use this option when using --count_strains separate from assembly (e.g. on a denovo assembly)

--count_strains compute a lower bound on the number of strains in this sample; note: this requires a reference genome.

--ignore_subreads ignore subread info from previous stage

--merge_contigs MERGE_CONTIGS

specify maximal distance between contigs for merging into master strains (stage c)

--min_clique_size MIN_CLIQUE_SIZE

minimum clique size used during error correction

--overlap_len_stage_c OVERLAP_STAGE_C

min_overlap_len used in stage c

--contig_len_stage_c CONTIG_LEN_STAGE_C

minimum contig length required for stage c input contigs

--keep_branches disable merging along branches by removing them from the graph (stage b & c)

--sfo_mm SFO_MM input parameter -e=SFO_MM for sfo: maximal mismatch rate 1/SFO_MM

--diploid use this option for diploid genome assembly

--diploid_contig_len DIPLOID_CONTIG_LEN

minimum contig length required for diploid step contigs

--diploid_overlap_len DIPLOID_OVERLAP_LEN

min_overlap_len used in diploid assembly step

--average_read_len AVERAGE_READ_LEN

average length of the input reads; will be computed from the input if not specified

--no_filtering disable kallisto-based filtering of contigs

--max_tip_len MAX_TIP_LEN

maximum extension length for a sequence to be called a tip

(savage) kazu@kazu:~/taniguchi_datadir/20190518$

> python freq_est.py -h

$ python freq_est.py -h

usage: freq_est.py [-h] -c CONTIGS [-o OUT_FILE] [-m MIN_LEN]

[--select_ids SELECT_IDS] [--kallisto]

[--kallisto_path KALLISTO_PATH] -l FRAGMENTSIZE -d STDDEV

[-f FORWARD] [-r REVERSE] [-s SUBREADS_FILE]

[-k LEN_CORRECTION]

Estimate relative frequencies for contigs produced by SAVAGE. The frequency

estimation procedure has two modes: 1. Compute frequency estimates for given

contigs by running Kallisto, then parse Kallisto output (abundance.tsv) and

compute frequency estimates from TPM counts. 2. Quick estimates based on the

relative number of reads contributing to a contig during SAVAGE assembly.

Option 1 obtains most accurate results, but requires Kallisto to be installed.

Note: For option 1, please make sure that the input fasta contains all

assembled sequences; contigs of sufficient length will be selected after

estimating TPM counts (Kallisto). For option 2, please make sure to run SAVAGE

with the option --use_subreads for better frequency estimates of stage b/c

contigs.

optional arguments:

-h, --help show this help message and exit

general arguments:

-c CONTIGS, --contigs CONTIGS

fastq or fasta file containing contigs

-o OUT_FILE, --out OUT_FILE

write output to file; if not specified, output is

written to stdout.

-m MIN_LEN, --min_len MIN_LEN

only consider contigs at least this length

--select_ids SELECT_IDS

select target contigs for frequency estimation by

providing a comma-separated list of contig identifiers

kallisto-mode arguments:

--kallisto use Kallisto for improved frequency estimation

--kallisto_path KALLISTO_PATH

path to kallisto executable; needed only if Kallisto

path is not included in PATH

-l FRAGMENTSIZE, --fragmentsize FRAGMENTSIZE

Estimated average fragment size

-d STDDEV, --stddev STDDEV

Estimated standard deviation of fragment size

-f FORWARD, --forward FORWARD

original forward input reads (before assembly) in

fastq format

-r REVERSE, --reverse REVERSE

original reverse input reads (before assembly) in

fastq format

quick-mode arguments:

-s SUBREADS_FILE, --subreads SUBREADS_FILE

subreads file produced by savage, corresponding to the

fastq file provided

-k LEN_CORRECTION, --correction LEN_CORRECTION

(optional) correction term for computing effective

contig length: use c = fragment_size - 2*(read_len-

min_overlap_len)

(savage) kazu@kazu:/media/kazu/1/virus/savage$

freq_est.pyはcondaによるインストールでパスは通っていない。

テストラン

git clone https://bitbucket.org/jbaaijens/savage.git
cd savage/example/

リードのstatistics

> seqkit stats input_fas/*

file format type num_seqs sum_len min_len avg_len max_len

input_fas/paired1.fastq FASTQ DNA 200 49,970 245 249.9 250

input_fas/paired2.fastq FASTQ DNA 200 49,962 246 249.8 250

input_fas/singles.fastq FASTQ DNA 2,000 896,174 380 448.1 490

１、merge

PEARでペアエンドをマージする。テストデータではすでに用意されているため実行する必要はない。

pear -f pair1.fq -r pair2.fq -o merged

merged.fastqとunmpergedのfastqが出力される。

2-1、De novo

ペアエンドfastqとマージしたfastqを指定してDe novoアセンブリを実行。ペアエンドfastはPEARでmegeし、マージされなかったペアエンドfastq（forward- forwad）を指定する。error correctionが500-1000xのカバレッジに最適化されているため、それ以上ディープなデータの場合、データを 500-1000xカバレッジになるよう分割する"--split"フラグを立てる。

savage -s input_fas/singles.fastq -p1 input_fas/paired1.fastq -p2 input_fas/paired2.fastq -m 200 --split 1 -o output -t 12

-s path to input fastq containing single-end reads
-p1 path to input fastq containing paired-end reads (/1)
-p2 path to input fastq containing paired-end reads (/2)
-o specify output directory
-m minimum overlap length required between reads
-t allowed number of cores
--split split the data set into patches s.t. 500 < coverage/split_num < 1000
--merge_contigs specify maximal distance between contigs for merging into master strains (stage c). By default this is set to 0.

出力

ls -lth out/

$ ls -lth out/

合計 112K

-rw-rw-r-- 1 kazu 19K 5月 19 21:57 contigs_stage_c.fasta

drwxrwxr-x 2 kazu 4.0K 5月 19 21:57 stage_c/

-rw-rw-r-- 1 kazu 19K 5月 19 21:57 contigs_stage_b.fasta

drwxrwxr-x 2 kazu 4.0K 5月 19 21:57 stage_b/

-rw-rw-r-- 1 kazu 60K 5月 19 21:57 contigs_stage_a.fasta

drwxrwxr-x 4 kazu 4.0K 5月 19 21:57 stage_a/

stage aがerror correction、stage bがオーバーラップグラフに基づくアセンブリ、stage cがマージしてhaplotypeを再構成するステージになる。stage cのマージは、"--merge_contigs"のパラメータでマージされるかどうか決定される。例えばdefaultの0から0.01（=1%）まで増やせば、1%以下の違いしかない配列は１つのhaplotypeにマージされる。

2-2、Reference guide

またはリファレンスを指定してReference guidedアセンブリ実行。

savage --ref /path/to/refeence.fasta --split patch_num --min_overlap_len M --s singles.fastq --p1 paired1.fastq --p2 paired2.fastq -t 12 -o output

リファレンスはfull pathで指定しないとエラーになる。

3、Counting virus strains

python freq_est.py --contigs output/contigs_stage_c.fasta --subreads output/stage_c/subreads.txt --min_len 1000 \
-l <Estimated average fragment size> \
-d <Estimated standard deviation of fragment size>

引用
De novo assembly of viral quasispecies using overlap graphs.

Baaijens JA, Aabidine AZE, Rivals E, Schönhuth A

Genome Res. 2017 May;27(5):835-848