mixed sampleの多様性を見積もる ShoRAH - macでインフォマティクス

　ディープシークエンシングや次世代シークエンシング（NGS）と呼ばれる新しい世代のハイスループットDNAシークエンシング技術の出現により、基礎的、応用的、臨床的研究における新しい実験的アプローチの扉が開かれた。 NGSによって生成される膨大な量のデータの恩恵を受けるアプリケーションの中には、heterogeneousなサンプルにおける遺伝的多様性の研究がある。遺伝的多様性は、例えばHIV感染に重大な影響を及ぼす。多様なウイルスのバリアントセットは、疾患の進行に関与し、効果的な治療法を開発する努力を妨げる[論文より　ref.1]。遺伝的多様性が重要な生物学的システムの他の例には、バクテリアコミュニティ[ref.2]および癌細胞[ref.3]が含まれる。

　Traditionalなサンガーシーケンシングでは集団のコンセンサス配列が生じるため、遺伝的に多様なサンプルの低頻度バリアントは検出されない（閾値は約20％）。この制限は、シークエンシングエラーが適切に処理されていれば、混合サンプルの遺伝的多様性を正確に定量できるディープシークエンシングでは克服されている[re4]。多くの遺伝子分析では、ゲノムの一部、例えば単一の遺伝子のみが単離され、数十万のクローンシークエンシング反応を行い、統計的にサンプル中の多様性を代表する一連のリードを生成する。リードは、増幅された領域のリファレンス配列とアライメントさせることができる。遺伝的変異の信頼できる推定値を得るためには、シークエンシングエラーについて交絡効果（confounding effects、交絡）を考慮する必要がある[ref.5]。

　シークエンシングエラーを訂正し、heterogeneousなサンプルの構造を推定するために、ShoRAH（HaplotypesへのShort Read Assembly）を開発した。このソフトウェアは、ゲノムの同じ領域とオーバーラップするすべてのリードをクラスタリングすることで、エラーを訂正する（論文図1参照）。各クラスターのコンセンサス配列は、誤ったリードが得られたオリジナルのハプロタイプを表す。クラスターに関連するリードの数は、集団におけるハプロタイプの有病率を推定する。多くのアプリケーションでは、このローカルダイバーシティ推定は重要な結論を導き出すのに十分である。例えば、約500bpの長さの現在のパイロシーケンシングリードでは、抗レトロウイルス治療の重要な標的であるHIVのプロテアーゼ遺伝子を完全にカバーすることができる。より長いハプロタイプ配列は、隣接する重複するリードから再構成することができ、これらの全体的なハプロタイプの頻度を次に推定することができる。

　ShoRAHは、単一のディープシーケンシング実行で混合サンプルから得られた一連の読み取りからローカルおよびグローバル母集団構造を推論するための計算ツールのセットである。この論文では、その実装の概要、使用例、および得られた結果の簡単な概要を示すことで、ソフトウェアの主な機能を報告する。

Schematic view of the local haplotype reconstruction. 論文より転載

http://cbg-ethz.github.io/shorah/global.html

wiki

https://wiki-bsse.ethz.ch/display/ShoRAH/Documentation

インストール

ubuntu16.04を使ってテストした（docker使用。ホストOS: mac os10.12）。

依存

Python 2 or Python 3, backward compatibility is provided as some current Linux distributions and OS X systems are still using 2.x as default. The required dependencies are:
a) Biopython, and b) NumPy. These packages can be downloaded using pip or anaconda
Perl, for some scripts
zlib, which is used by the bundled samtools for compressing bam files
pkg-config, for discovering dependencies, which most Unix-like systems include
GNU scientific library, for random number generation

In addition, if you want to bootstrap the git version of shorah instead of using the provided tarballs, you will need the GNU Autotools:

Autoconf 2.69
Automake 1.15
m4, which most Unix-like system include

本体　Github

#Anacondaを使っているなら、condaで導入できる。
conda install -c bioconda shorah

以下のコマンドがある。（Githubより）

> shorah.py -h

$ shorah.py -h

usage: shorah.py [-h] -b BAM -f REF [-a FLOAT] [-w INT] [-r chrm:start-stop]

[-S INT] [-s INT] [-i FLOAT] [-x INT] [-k]

ShoRAH: Short Read Assembly into Haplotypes

optional arguments:

-h, --help show this help message and exit

Input files:

Required input

-b BAM, --bam BAM sorted bam format alignment file (default: None)

-f REF, --fasta REF reference genome in fasta format (default: None)

Run options:

Parameters that can (maybe should) be changed according to the needs

-a FLOAT, --alpha FLOAT

alpha in dpm sampling (default: 0.1)

-w INT, --windowsize INT

window size (default: 201)

-r chrm:start-stop, --region chrm:start-stop

region in format 'chr:start-stop', e.g.

'chrm:1000-3000' (default: )

-S INT, --seed INT set seed for reproducible results (default: None)

More options:

Do you really want to change this?

-s INT, --winshifts INT

number of window shifts (default: 3)

-i FLOAT, --sigma FLOAT

value of sigma to use when calling SNVs (default:

0.01)

-x INT, --maxcov INT approximate max coverage allowed (default: 10000)

-k, --keep_files keep intermediate files (default: True)

> dec.py -h

$ dec.py -h

usage: dec.py [-h] -b BAM -f REF [-w INT] [-r chrm:start-stop] [-a FLOAT]

[-S INT] [-s INT] [-x INT] [-k]

Local haplotype recontruction: Dirichlet process mixture

optional arguments:

-h, --help show this help message and exit

Input files:

Required input

-b BAM, --bam BAM file with aligned reads in .bam format (default: None)

-f REF, --fasta REF reference genome in fasta format (default: None)

Run options:

Parameters that can (maybe should) be changed according to the needs

-w INT, --windowsize INT

window size (default: 201)

-r chrm:start-stop, --region chrm:start-stop

region in format 'chrm:start-stop', e.g.

'ch3:1000-3000' (default: )

-a FLOAT, --alpha FLOAT

alpha in dpm sampling (default: 0.1)

-S INT, --seed INT set seed for reproducible results (default: None)

More options:

Do you really want to change this?

-s INT, --winshifts INT

number of window shifts (default: 3)

-x INT, --maxcov INT approximate max coverage allowed (default: 10000)

-k, --keep_files keep all intermediate files (default: True)

> amplian.py -h

$ amplian.py -h

usage: amplian.py [-h] -b BAM -f REF [-r chrm:start-stop] [-d] [-m FLOAT]

[-a FLOAT] [-x INT] [-s FLOAT]

Local haplotype reconstruction - amplicon mode

optional arguments:

-h, --help show this help message and exit

Input files:

Required input

-b BAM, --bam BAM file with aligned reads in .bam format (default: None)

-f REF, --fasta REF reference genome in fasta format (default: None)

Type of run:

You can specify a region, or look for the highest diversity region

-r chrm:start-stop, --region chrm:start-stop

region in format 'chrm:start-stop' e.g.

'ch3:1000-1300' (default: )

-d, --diversity if set, automatically detects the highest entropy

region and runs there (default: False)

Run options:

Fine tuning

-m FLOAT, --min_overlap FLOAT

fraction of read overlap to be included (default:

0.95)

-a FLOAT, --alpha FLOAT

alpha in dpm sampling (default: 0.5)

More options:

Do you really want to change this?

-x INT, --maxcov INT approximate max coverage allowed (default: 50000)

-s FLOAT, --sigma FLOAT

sigma value to use when calling SNVs (default: 0.01)

実行方法

globalモード

shorah.py -b input.bam -f ref.fa

ampliconモード

amplian.py -b input.bam -f target.fa

テストではエラーを起こした。issuesの#5と同じ問題かと思われるが、未解決

https://github.com/cbg-ethz/shorah/issues/5

引用

ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data

Osvaldo Zagordi, Arnab Bhattacharya, Nicholas Eriksson, Niko Beerenwinkel

BMC Bioinformatics 2011 12:119