2018-11-07

QuasiRecomb

　次世代シークエンシング（NGS）技術は、以前はあまりにも労働集約的であると考えられていた実験を日常的な作業に変えた（Metzker、2010）。 NGSの1つの用途は、genetic diversityを定量化するために遺伝的にheterogousなpopulationsのシーケンシングである。Genetic diversityは、例えば、HIVおよびHCVなどのRNAウイルスによる感染において主な関心事である。これらのシステムでは、病原体の高い突然変異率と病原体間の組換えとの組み合わせにより、ウイルス性疑似種（quasispecies）と呼ばれる異なるが関連する個体集団が生じる。異なるウイルス粒子が単一の細胞に感染すると、後の事象として組換えが起こり得る。我々（著者ら）は、この集団における異なるウイルス株をハプロタイプとして示す。ウイルス準種の特徴を研究することで、宿主における病原体の進化のメカニズムを明らかにすることができる。例えば、準種の多様性は病原性（Vignuzzi et al、2006）、immune escape （Nowak et al、1991）および薬剤耐性（Johnson et al、2008）に影響を及ぼすことが示されている。

　The quasispecies equationは、突然変異選択プロセスに従って進化するRNAウイルス集団の数学モデルである（Eigen、1971）。モデルのダイナミクスは、複製時にあるウイルスのハプロタイプ（または株）を別のものへと変換する突然変異と、異なる株の複製速度の変動を説明する選択用語とによって記述される。 HIVおよびHCVを含む多くの臨床的に関連性の高いウイルスでは組換えが頻繁であることが知られているが、突然変異プロセスは一般に点突然変異の結果としてのみ考慮される。例えば、HIVの組換え率は、その点突然変異率より約10倍高いと推定される。したがって、quasispecies modelは、突然変異および組換えの両方を説明するために拡張されている（Boerlijst et al、1996）。平衡状態では、モデルは、ウイルス集団が1つまたは少数のハプロタイプによってdominatであり、、常に生成される低頻度の突然変異株の雲に囲まれていると予測する。

　最近のNGS技術は、1回の実験で数百万のDNAシーケンシングリードを生成することによって、前例のないレベルでウイルスquasispeciesを観察することを可能にする。しかしながら、この高い収率はコストがかかる。リードは、通常、最新の技術では700bpまでの短差で、最小のウイルスゲノムよりずっと小さく、サンプル調製やシーケンシングエラーが原因でエラーが発生しやすい（Gilles et al、2011）。結果として、得られたデータは不完全でノイズが多いので、NGSを用いてウイルス集団を有意に特徴づけるには、シーケンシングデータを注意深く解析する必要がある（Beerenwinkel and Zagordi、2011）。

　この論文では、突然変異と組換えを明示的にモデル化することにより、NGSデータに基づいてウイルスquasispeciesを推定することを目指している。我々（著者ら）は、ウィルス集団、すなわち、ハプロタイプ分布およびNGSによるプロービングを生成するために、隠れマルコフモデル（HMM）を使用する。本発明者らのモデルでは、ハプロタイプは、ハプロタイプがどの配列から派生するかを選択する、HMMの状態の切り替えとして記述される少数の生成配列、および生成配列の位置特異的確率表によって記載される突然変異から生じる。シークエンシングリードは、観察エラーの対象となるハプロタイプから得られる。

　本論文では、ウイルスquasispeciesの推定を行うための、すなわち患者内のウイルスのハプロタイプの分布を推定するための新規の確率的モデルを提示する。具体的には、実際の遺伝的多様性は、突然変異および組換えにより、generatorsと呼ばれる、いくつかの配列によって生成され、観察された多様性は追加のシーケンシングエラーに起因すると仮定する。我々（著者ら）は局所ハプロタイプ推論のモデルを提示する。すなわち、個々のリードによってカバーすることができる大きさのゲノム領域の集団構造を推定することを目的とするが、このモデルをグローバルなハプロタイプ推論、すなわちより長いゲノム領域に拡張することを目的とする。一般に、ローカル推論は、多くのアプリケーションにとって、より信頼性が高く、十分である。例えば、抗レトロウイルス治療の重要な標的であるHIVプロテアーゼ遺伝子は297bpであり、臨床診断のための共通のパイロシーケンシングプラットフォームであるRoche / 454 GS Juniorシーケンサーで400bp以上のリード値を得ることが現在の標準である（論文執筆時点）。局所的なハプロタイプの再構成はまた、全体的な再構築の出発点として使用することもできる（Astrovskaya、2011; Eriksson et al、2008; Prabhakaran et al、2010; Prosperi et al、2011; Zagordi et al、2011）。本モデルは、ウイルス性ハプロタイプの分布を高い信頼性で推定することができることを、ground truthにアクセスできるシミュレートされたデータに適用することで示し、そしてHIV感染患者から得たリードへの適用を示す。

QuasiRecombに関するツイート

特徴（Githubより）

First algorithm that models the recombination process
Allows position-wise mutation events
Infers a parametric probability distribution from the underlying viral population
Error correction by estimating position-wise sequencing error-rates
Local, gene- and genome-wide reconstruction
Reports SNV (single nucleotide variant) posteriors
Incorporates paired-end information
Uses PHRED scores to weight each base of each read
Input may contain paired-end and single reads
Supports reads of all current sequencing technologies (454/Roche, Illumina and PacBio)
Suitable for amplicon and shotgun sequencing projects
Reports reconstructed haplotypes and their relative frequencies
Reports translated proteins in all three reading frames with their relative frequencies
Input data can be in BAM or SAM format

インストール

本体　Github

リリースよりビルドされたjavaプログラムをダウンロードする。

> java -jar /Users/user/Downloads/QuasiRecomb.jar

$ java -jar /Users/user/Downloads/QuasiRecomb.jar

No input given

QuasiRecomb version: 1.2

Get latest version from http://bit.ly/QuasiRecomb

USAGE: java -jar QuasiRecomb.jar options...

-------------------------

=== GENERAL options ===

-i INPUT : Alignment file in BAM or SAM format.

-o PATH : Path to the output directory (default: current directory).

-K INT or INT-INT : The interval or fixed number of sequence generators, i.e. 1-4 or 2

In a grid enviroment the $SGE_TASK_ID.

In case of no input, K will be incremented as long as max BIC has not been reached, but will stop at K=5.

-m INT : The number of EM restarts during model selection (default: 5).

-t INT : The number of EM restarts for best K to find optimum (default: 50).

-r INT-INT : Only reconstruct a specific region.

-noRecomb : Do not allow recombination.

-quality : Account phred quality scores (slower runtime).

-printAlignment : Save alignment.txt in a human readable format.

-sampleProteins : Sample full-length protein sequences in three reading frames.

-coverage : If your dataset only contains a single region of interest,

regions with a minimum coverage of 100x, 500x, 1,000x and 10,000x are reported.

-refine : Can only be used after QuasiRecomb has been executed once before on the same dataset in the same directory.

Thins the number of haplotypes.

-noGaps : Ignore gaps; useful if data is 454 and gaps are only technical errors.

-conservative : Use this if the major haplotypes are only of interest.

-maxDel INT : Remove reads with more than INT consecutive deletions.

-maxPercDel DOUBLE : Remove reads with more than DOUBLE ratio of deletions, between 0.0 - 1.0

-unpaired : If read names are not unique and reads are single-end, prevent pairing and merging.

-------------------------

=== Technical options ===

-XX:NewRatio=9 : Reduces the memory consumption (RECOMMENDED to use).

-Xms2G -Xmx10G : Increase heap space.

-XX:+UseParallelGC : Enhances performance on multicore systems.

-XX:+UseNUMA : Enhances performance on multi-CPU systems.

-------------------------

=== EXAMPLES ===

java -XX:NewRatio=9 -jar QuasiRecomb.jar -i alignment.bam

java -XX:NewRatio=9 -jar QuasiRecomb.jar -i alignment.bam -conservative

java -XX:NewRatio=9 -jar QuasiRecomb.jar -i alignment.bam -K 1:10

java -XX:NewRatio=9 -jar QuasiRecomb.jar -i alignment.bam -noRecomb -r 790-2292

java -XX:+UseParallelGC -Xms2g -Xmx10g -XX:+UseNUMA -XX:NewRatio=9 -jar QuasiRecomb.jar -i alignment.bam

-------------------------

For further information, see http://bit.ly/QuasiRecomb-howto

-------------------------

実行方法

java -jar QuasiRecomb.jar -i alignment.bam

ハプロタイプのFASTAが出力される。

テスト

論文と同じSRR069887のデータを使いテストした。リファレンスはHXB2（リンク）を使用。

マッピング結果

f:id:kazumaxneo:20181107103828j:plain

ターゲット領域のリードのみ取り出す。

samtools view -@ 10 -b sort.bam chr:6321-6499 > target.bam

QuasiRecombのラン

java -jar QuasiRecomb.jar -i target.bam

> head -n 50 support/K1-snvs.txt

head -n 50 support/K1-snvs.txt

A C G T -

0 0 0 0 1 0

1 1 0 0 0 0

2 0 0 1 0 0

3 0 0 0 1 0

4 1 0 0 0 0

5 1 0 0 0 0

6 1 0 0 0 0

7 1 0 0 0 0

8 1 0 0 0 0

9 0 0 0 1 0

10 0 0 0 1 0

11 0 0 1 0 0

12 1 0 0 0 0

13 1 0 0 0 0

14 0 1 0 0 0

15 1 0 0 0 0

16 0 0 0 1 0

17 1 0 0 0 0

18 0 0 1 0 0

19 0 0 1 0 0

20 0 0 0 0 1

21 0 0 0 0 1

22 1 0 0 0 0

23 1 0 0 0 0

24 0 0 0 1 0

25 1 0 0 0 0

26 0 0 1 0 0

27 0 1 0 0 0

28 1 0 0 0 0

29 0 1 0 0 0

30 0 1 0 0 0

31 1 0 0 0 0

32 0 1 0 0 0

33 0 1 0 0 0

34 1 0 0 0 0

35 1 0 0 0 0

36 0 0 1 0 0

37 0 0 1 0 0

38 0 1 0 0 0

39 1 0 0 0 0

40 1 0 0 0 0

41 1 0 0 0 0

42 0 0 1 0 0

43 1 0 0 0 0

44 0 0 1 0 0

45 1 0 0 0 0

46 0 1 0 0 0

47 1 0 0 0 0

48 0.23398 0 0.76601 0 0

K1-summary.html

f:id:kazumaxneo:20181107115532j:plain

quasispecies.fasta

f:id:kazumaxneo:20181107153054j:plain

サマリープロット

R CMD BATCH support/coverage.R
R CMD BATCH support/modelselection.R

f:id:kazumaxneo:20181107115327j:plain

f:id:kazumaxneo:20181107115331j:plain

引用

Probabilistic Inference of Viral Quasispecies Subject to Recombination
Armin Töpfer, Osvaldo Zagordi, Sandhya Prabhakaran, Volker Roth, Eran Halperin, Niko Beerenwinkel

J Comput Biol. 2013 Feb; 20(2): 113–123

2018-11-06

mixed sampleの多様性を見積もる ShoRAH

2011 BMC Bioinformatics haplotype reconstruction bacteria amplicon sequence テスト失敗

　ディープシークエンシングや次世代シークエンシング（NGS）と呼ばれる新しい世代のハイスループットDNAシークエンシング技術の出現により、基礎的、応用的、臨床的研究における新しい実験的アプローチの扉が開かれた。 NGSによって生成される膨大な量のデータの恩恵を受けるアプリケーションの中には、heterogeneousなサンプルにおける遺伝的多様性の研究がある。遺伝的多様性は、例えばHIV感染に重大な影響を及ぼす。多様なウイルスのバリアントセットは、疾患の進行に関与し、効果的な治療法を開発する努力を妨げる[論文より　ref.1]。遺伝的多様性が重要な生物学的システムの他の例には、バクテリアコミュニティ[ref.2]および癌細胞[ref.3]が含まれる。

　Traditionalなサンガーシーケンシングでは集団のコンセンサス配列が生じるため、遺伝的に多様なサンプルの低頻度バリアントは検出されない（閾値は約20％）。この制限は、シークエンシングエラーが適切に処理されていれば、混合サンプルの遺伝的多様性を正確に定量できるディープシークエンシングでは克服されている[re4]。多くの遺伝子分析では、ゲノムの一部、例えば単一の遺伝子のみが単離され、数十万のクローンシークエンシング反応を行い、統計的にサンプル中の多様性を代表する一連のリードを生成する。リードは、増幅された領域のリファレンス配列とアライメントさせることができる。遺伝的変異の信頼できる推定値を得るためには、シークエンシングエラーについて交絡効果（confounding effects、交絡）を考慮する必要がある[ref.5]。

　シークエンシングエラーを訂正し、heterogeneousなサンプルの構造を推定するために、ShoRAH（HaplotypesへのShort Read Assembly）を開発した。このソフトウェアは、ゲノムの同じ領域とオーバーラップするすべてのリードをクラスタリングすることで、エラーを訂正する（論文図1参照）。各クラスターのコンセンサス配列は、誤ったリードが得られたオリジナルのハプロタイプを表す。クラスターに関連するリードの数は、集団におけるハプロタイプの有病率を推定する。多くのアプリケーションでは、このローカルダイバーシティ推定は重要な結論を導き出すのに十分である。例えば、約500bpの長さの現在のパイロシーケンシングリードでは、抗レトロウイルス治療の重要な標的であるHIVのプロテアーゼ遺伝子を完全にカバーすることができる。より長いハプロタイプ配列は、隣接する重複するリードから再構成することができ、これらの全体的なハプロタイプの頻度を次に推定することができる。

　ShoRAHは、単一のディープシーケンシング実行で混合サンプルから得られた一連の読み取りからローカルおよびグローバル母集団構造を推論するための計算ツールのセットである。この論文では、その実装の概要、使用例、および得られた結果の簡単な概要を示すことで、ソフトウェアの主な機能を報告する。

Schematic view of the local haplotype reconstruction. 論文より転載

http://cbg-ethz.github.io/shorah/global.html

wiki

https://wiki-bsse.ethz.ch/display/ShoRAH/Documentation

インストール

ubuntu16.04を使ってテストした（docker使用。ホストOS: mac os10.12）。

依存

Python 2 or Python 3, backward compatibility is provided as some current Linux distributions and OS X systems are still using 2.x as default. The required dependencies are:
a) Biopython, and b) NumPy. These packages can be downloaded using pip or anaconda
Perl, for some scripts
zlib, which is used by the bundled samtools for compressing bam files
pkg-config, for discovering dependencies, which most Unix-like systems include
GNU scientific library, for random number generation

In addition, if you want to bootstrap the git version of shorah instead of using the provided tarballs, you will need the GNU Autotools:

Autoconf 2.69
Automake 1.15
m4, which most Unix-like system include

本体　Github

#Anacondaを使っているなら、condaで導入できる。
conda install -c bioconda shorah

以下のコマンドがある。（Githubより）

> shorah.py -h

$ shorah.py -h

usage: shorah.py [-h] -b BAM -f REF [-a FLOAT] [-w INT] [-r chrm:start-stop]

[-S INT] [-s INT] [-i FLOAT] [-x INT] [-k]

ShoRAH: Short Read Assembly into Haplotypes

optional arguments:

-h, --help show this help message and exit

Input files:

Required input

-b BAM, --bam BAM sorted bam format alignment file (default: None)

-f REF, --fasta REF reference genome in fasta format (default: None)

Run options:

Parameters that can (maybe should) be changed according to the needs

-a FLOAT, --alpha FLOAT

alpha in dpm sampling (default: 0.1)

-w INT, --windowsize INT

window size (default: 201)

-r chrm:start-stop, --region chrm:start-stop

region in format 'chr:start-stop', e.g.

'chrm:1000-3000' (default: )

-S INT, --seed INT set seed for reproducible results (default: None)

More options:

Do you really want to change this?

-s INT, --winshifts INT

number of window shifts (default: 3)

-i FLOAT, --sigma FLOAT

value of sigma to use when calling SNVs (default:

0.01)

-x INT, --maxcov INT approximate max coverage allowed (default: 10000)

-k, --keep_files keep intermediate files (default: True)

> dec.py -h

$ dec.py -h

usage: dec.py [-h] -b BAM -f REF [-w INT] [-r chrm:start-stop] [-a FLOAT]

[-S INT] [-s INT] [-x INT] [-k]

Local haplotype recontruction: Dirichlet process mixture

optional arguments:

-h, --help show this help message and exit

Input files:

Required input

-b BAM, --bam BAM file with aligned reads in .bam format (default: None)

-f REF, --fasta REF reference genome in fasta format (default: None)

Run options:

Parameters that can (maybe should) be changed according to the needs

-w INT, --windowsize INT

window size (default: 201)

-r chrm:start-stop, --region chrm:start-stop

region in format 'chrm:start-stop', e.g.

'ch3:1000-3000' (default: )

-a FLOAT, --alpha FLOAT

alpha in dpm sampling (default: 0.1)

-S INT, --seed INT set seed for reproducible results (default: None)

More options:

Do you really want to change this?

-s INT, --winshifts INT

number of window shifts (default: 3)

-x INT, --maxcov INT approximate max coverage allowed (default: 10000)

-k, --keep_files keep all intermediate files (default: True)

> amplian.py -h

$ amplian.py -h

usage: amplian.py [-h] -b BAM -f REF [-r chrm:start-stop] [-d] [-m FLOAT]

[-a FLOAT] [-x INT] [-s FLOAT]

Local haplotype reconstruction - amplicon mode

optional arguments:

-h, --help show this help message and exit

Input files:

Required input

-b BAM, --bam BAM file with aligned reads in .bam format (default: None)

-f REF, --fasta REF reference genome in fasta format (default: None)

Type of run:

You can specify a region, or look for the highest diversity region

-r chrm:start-stop, --region chrm:start-stop

region in format 'chrm:start-stop' e.g.

'ch3:1000-1300' (default: )

-d, --diversity if set, automatically detects the highest entropy

region and runs there (default: False)

Run options:

Fine tuning

-m FLOAT, --min_overlap FLOAT

fraction of read overlap to be included (default:

0.95)

-a FLOAT, --alpha FLOAT

alpha in dpm sampling (default: 0.5)

More options:

Do you really want to change this?

-x INT, --maxcov INT approximate max coverage allowed (default: 50000)

-s FLOAT, --sigma FLOAT

sigma value to use when calling SNVs (default: 0.01)

実行方法

globalモード

shorah.py -b input.bam -f ref.fa

ampliconモード

amplian.py -b input.bam -f target.fa

テストではエラーを起こした。issuesの#5と同じ問題かと思われるが、未解決

https://github.com/cbg-ethz/shorah/issues/5

引用

ShoRAH: estimating the genetic diversity of a mixed sample from next-generation sequencing data

Osvaldo Zagordi, Arnab Bhattacharya, Nicholas Eriksson, Niko Beerenwinkel

BMC Bioinformatics 2011 12:119

2018-11-05

マッピングツール segemehl

Bisulfite sequencing 2014 Bioinformatics mapping RNA seq

2018 11/5 タイトル修正

　近年、短いシーケンシングリードを大きなリファレンスゲノムにアライメントさせる問題はかなりの注目を集めており、これまで様々な異なるアルゴリズムアプローチに基づく、異なる多くのアラインメントツールが発表されている。 EBIの調査では、現在、80以上の異なるマッパーが数えられている（Fonseca et al、2012）（論文執筆時点）。この競争の激しい分野ではかなり進化と進歩が見られた。初期のマッパーは、ミスマッチがないか少ないミスマッチでショートリードをアライメントさせることに制限され、挿入および欠失を伴うリードはしばしば除外された。しかし RNA-seqプロトコルの登場により、問題の複雑さがさらに増した。スプライシングである。 2つ以上のエクソンを連結するcDNAにリードをマッピングする場合、アライナーは、リードを「splitして」、その部分をリファレンスゲノムの適切なエキソンにアライメントさせる必要がある。あるいは、マッパーにジャンクション情報を提供する必要がある。今日では、ほとんどのツールは1つのスプリットしか想定しないため（論文執筆時点）、複数のエキソン - エキソンジャンクションにまたがるリードは適切にsplitアライメントされないことがある。リード長は絶えず増加しており、複数のsplitを可能にするアルゴリズムは明らかに有利である。 segemehlは、シミュレートされたデータベンチマークで非常によく機能することが示されたローカルトランジションアルゴリズムを使用してマルチスプリットアライメントを容易にする（Hoffmann et al、2014）。bisulfite処理したDNAリードのアライメント、すなわちmethyl-cytosineシーケンシングはまた、特殊なアルゴリズムおよび方法を必要とする（Chen et al、2010; Krueger and Andrews、2011; Smith et al、2009）。 segemehlの特徴であるsplit-readおよびbisulfiteは、広範なベンチマーク（Hoffmann et al、2014; Otto et al、2012）とともに最近publishされた。ツールの多様性とアルゴリズムとソフトウェアの迅速な開発には、頻繁で透明で再現性のあるベンチマークが必要である。ここで著者らは、DNA-seqおよびRNA-seqリードアライメントの性能テストの結果を示し、詳細情報を提供する。これらすべてのデータ、カスタムスクリプト、および分析を再実行する方法の詳細な説明を含む広範なelectronic supplement（http://www.bioinf.uni-leipzig.de/publications/supplements/13-008）を用意した。ベンチマークが特定の側面のみを測定したり、ツールの普遍的な優越性または劣等性を主張するために使用されないことを強調したい。すべてのreadersにこのデータを再現することを奨励し、代替ベンチマークを提示することを奨励する。

ベンチマーク

http://www.ecseq.com/support/benchmark.html

インストール

mac os10.13のanaconda2.4.3環境でテストした。

#Anacondaを使っているなら、condaで導入できる。

conda install -c bioconda segemehl

> segemehl.x

$ segemehl.x

segemehl.x: required option 'database' (d) missing

usage: segemehl.x [-sbcKVTYCO] -d <file> [<file>] [-q <file>] [-p <file>] [-i <file>] [-j <file>] [-x <file>] [-y <file>] [-B <string>] [-F <n>] [-m <n>] [-t <n>] [-o <string>] [-u <file>] [-D <n>] [-J <n>] [-E <double>] [-w <double>]

[-M <n>] [-r <n>] [-S] [--nohead] [-e <n>] [-n <n>] [-X <n>] [-A <n>] [-W <n>] [-U <n>] [-Z <n>] [-l <f>] [-H] [--showalign] [-P <string>] [-Q <string>] [-R <n>] [-I <n>]

Heuristic mapping of short sequences

[INPUT]

-d, --database <file> [<file>] list of path/filename(s) of database sequence(s)

-q, --query <file> path/filename of query sequences (default:none)

-p, --mate <file> path/filename of mate pair sequences (default:none)

-i, --index <file> path/filename of db index (default:none)

-j, --index2 <file> path/filename of second db index (default:none)

-x, --generate <file> generate db index and store to disk (default:none)

-y, --generate2 <file> generate second db index and store to disk (default:none)

-B, --filebins <string> file bins with basename <string> for easier data handling (default:none)

-F, --bisulfite <n> bisulfite mapping with methylC-seq/Lister et al. (=1) or bs-seq/Cokus et al. protocol (=2) (default:0)

[GENERAL]

-m, --minsize <n> minimum size of queries (default:12)

-s, --silent shut up!

-b, --brief brief output

-c, --checkidx check index

-t, --threads <n> start <n> threads (default:1)

-o, --outfile <string> outputfile (default:none)

-u, --nomatchfilename <file> filename for unmatched reads (default:none)

[SEEDPARAMS]

-D, --differences <n> search seeds initially with <n> differences (default:1)

-J, --jump <n> search seeds with jump size <n> (0=automatic) (default:0)

-E, --evalue <double> max evalue (default:5.000000)

-w, --maxsplitevalue <double> max evalue for splits (default:50.000000)

-M, --maxinterval <n> maximum width of a suffix array interval, i.e. a query seed will be omitted if it matches more than <n> times (default:100)

-r, --maxout <n> maximum number of alignments that will be reported. If set to zero, all alignments will be reported (default:0)

-S, --splits detect split/spliced reads (default:none)

-K, --SEGEMEHL output SEGEMEHL format (needs to be selected for brief)

-V, --MEOP output MEOP field for easier variance calling in SAM (XE:Z:)

--nohead do not output header

[SEEDEXTENSIONPARAMS]

-e, --extensionscore <n> score of a match during extension (default:2)

-n, --extensionpenalty <n> penalty for a mismatch during extension (default:4)

-X, --dropoff <n> dropoff parameter for extension (default:8)

[ALIGNPARAMS]

-A, --accuracy <n> min percentage of matches per read in semi-global alignment (default:90)

-W, --minsplicecover <n> min coverage for spliced transcripts (default:80)

-U, --minfragscore <n> min score of a spliced fragment (default:18)

-Z, --minfraglen <n> min length of a spliced fragment (default:20)

-l, --splicescorescale <f> report spliced alignment with score s only if <f>*s is larger than next best spliced alignment (default:1.000000)

-H, --hitstrategy report only best scoring hits (=1) or all (=0) (default:1)

--showalign show alignments

-P, --prime5 <string> add 5' adapter (default:none)

-Q, --prime3 <string> add 3' adapter (default:none)

-R, --clipacc <n> clipping accuracy (default:70)

-T, --polyA clip polyA tail

-Y, --autoclip autoclip unknown 3prime adapter

-C, --hardclip enable hard clipping

-O, --order sorts the output by chromsome and position (might take a while!)

-I, --maxinsertsize <n> maximum size of the inserts (paired end) (default:5000)

[VERSION]

0.2.0-418 (2015-01-05 05:17:35 -0500 (Mon, 05 Jan 2015))

[BUGS]

Please report bugs to steve@bioinf.uni-leipzig.de

[REFERENCES]

SEGEMEHL is free software for non-commercial use

実行方法

１、index（*1）

segemehl.x -x index -d ref.fasta

-x generate db index and store to disk (default:none)
-d list of path/filename(s) of database sequence(s)

２、マッピング

segemehl.x -i index -d ref.fasta  -q input.fq > mapped.sam

-q path/filename of query sequences (default:none)
-p path/filename of mate pair sequences (default:none)

Split read alignmentをONにしたマッピング

segemehl.x -S -i index -d ref.fasta  -q input.fq > mapped.sam

-S detect split/spliced reads (default:none)

詳細はマニュアルを読んでください。丁寧に説明されています。

引用
Lacking alignments? The next-generation sequencing mapper segemehl revisited
Otto C, Stadler PF, Hoffmann S

Bioinformatics. 2014 Jul 1;30(13):1837-43

A multi-split mapping algorithm for circular RNA, splicing, trans-splicing and fusion detection
Hoffmann S, Otto C, Doose G, Tanzer A, Langenberger D, Christ S, Kunz M, Holdt LM, Teupser D, Hackermüller J, Stadler PF

Genome Biol. 2014 Feb 10;15(2):R34

特定のクロモソームだけindexingすることもできるが、ここでは記載しない。WGSデータを特定のクロモソームだけにマッピングさせると偽陽性が生じやすい（例えばヒトゲノムのリファレンスではデコイの配列も用意されている）。

https://software.broadinstitute.org/gatk/documentation/article?id=11010

Heng Liさんのブログ

Which human reference genome to use?

http://lh3.github.io/2017/11/13/which-human-reference-genome-to-use

2018-11-04

ハプロタイプベースのバリアントコーラー octopus

2018 family trios somatic mutation germline SNV tumor human exome joint calling 2021 Nature Biotechnology

2019 4/16 誤字修正

2020 4/15 インストール追記

2021 4/8 論文引用

　ハプロタイプベースのアプローチは、生殖系列のバリアントをコールするための選択方法として浮かび上がってきた。なぜなら、これらの方法は、リードマッパーからのアライメントエラーに対して強く、ポジショナルアプローチ1-7（ref.1 Platypus paper）よりも優れたシグナル - ノイズ特性（S/N）を有するからである。しかしながら、既存のハプロタイプベースのバリアントコーラーにはいくつかの限界がある。第1に、既存のツールは、二倍体（ref.1-3）または一定のコピーナンバー（ref.4-6）のいずれかを仮定するモデルを実装する、ほとんどの場合は多くの問題に対して準最適であり、理想的な無関係な個体集団から標本が選択されると仮定する。このようなモデルは、スモールコホートの生殖系列バリアントをコールするのに適しているが、他の実験デザインで生成されたデータにはpoorにしかフィッティングしない。これには、paired tumours（tumours とnormalのペア）、シングルセル、parent-offspring trios（家族トリオ）などの既知の関連性を有する試料を含む研究、pooled tumour および試料がしばしばheterogeneous であるバクテリアのシーケンシングが含まれる。これらの制限により、研究者はさまざまなコーラーを使い、ポストhoc フィルタリングと統合を伴うカスタムパイプラインを実装して対応している（ref.8-16）。

　第2に、既存のハプロタイプベースの方法は、バリアントが非重複領域内で評価される際に、ウィンドウアーチファクトを被る。これは、複雑な領域で、評価対象の領域外にあるバリアントでfalseコールを引き起こす可能性がある。第3に、既存の方法は、リードデータによって支持されたハプロタイプ配列とそれを生じさせた突然変異事象とを明確に区別しない。これは、これらのハプロタイプ配列に適切な事前確率を割り当てることを困難にする。なぜなら、同じハプロタイプ配列を生じさせていても、異なるセットの変異は非常に異なる生物学的尤度を有し得るためである、第4に、ハプロタイプベースの方法は本質的にはバリアントを物理的にphasingすることができるが、既存のツールは２倍体の遺伝子型のphasingに限定され、生殖系列バリアントまたは他の体細胞de novoバリアントに関してpoorにしかフィットしない。

　増大するさまざまな実験デザインでのバリアントコールの要求を満たすために、著者らは、統一されたハプロタイプawareフレームワークで異なる遺伝子型モデルを適応させるアルゴリズムを設計した。particle filteringからインスピレーションを得て、通常、他の方法よりも長いハプロタイプを生成する新規なハプロタイプ推論手続きを開発し、アーチファクトをwindowingしてS/N比を改善し、より正確なバリアントコールを実現した。さらに、本発明者らの方法は、同一の配列になるにもかかわらず別個の突然変異イベントからなる突然変異の生物学的妥当性を考慮し、比較することができる。著者らは、事前情報とリード情報の両方を活用し、体細胞突然変異を含む任意のploidyの遺伝型phaseを可能にする確率的なphasingアルゴリズムを提案する。

　我々（著者ら）はC ++で記述されたアルゴリズムOctopusを実装する。著者らは、Octopusが、いくつかの一般的な実験デザイン、すなわち個人の生殖細胞バリアントコール、親子トリオのde novoバリアントコール、tumoursの体細胞バリアントコール（paired tumoursの有り無しに関わらず）を専門とする最先端のツールよりも正確であることを示す。 Opsusはhttps://github.com/luntergroup/octopusのMITライセンス下で自由に利用できる。

Overview of the unified haplotype-based algorithm. Preprintより転載。

2018年11月現在、preprintです。

User Document

インストール

依存

A C++14 compiler with SSE2 support
A C++14 standard library implementation
Git 2.5 or greater
Boost 1.65 or greater
htslib 1.4 or greater
CMake 3.9 or greater
Optional:
Python3 or greater

本体　Github

#linuxのanaconda環境ならcondaで導入できる。(link)
conda install -y -c conda-forge -c bioconda octopus

#macの場合、Githubを参照。

#またはdocker イメージをpullするかbuildする。buildするなら
git clone https://github.com/luntergroup/octopus.git
cd octopus/
#ビルド(*1)
docker build -t octopus .
#run
docker run --rm -it octopus /tmp/octopus/bin/octopus -h

> octopus -h

$ octopus -h

octopus population calling options:

General:

-h [ --help ] Produce help message

--version Output the version number

--config arg A config file, used to populate command

line options

--debug [=arg(="octopus_debug.log")] Writes verbose debug information to

debug.log in the working directory

--trace [=arg(="octopus_trace.log")] Writes very verbose debug information

to trace.log in the working directory

--fast Turns off some features to improve

runtime, at the cost of decreased

calling accuracy. Equivalent to '-a off

-l minimal -x 50`

--very-fast The same as fast but also disables

inactive flank scoring

Backend:

-w [ --working-directory ] arg Sets the working directory

--threads [=arg(=0)] Maximum number of threads to be used,

enabling this option with no argument

lets the application decide the number

of threads ands enables specific

algorithm parallelisation

-X [ --max-reference-cache-footprint ] arg (=500MB)

Maximum memory footprint for cached

reference sequence

-B [ --target-read-buffer-footprint ] arg (=6GB)

None binding request to limit the

memory footprint of buffered read data

--max-open-read-files arg (=250) Limits the number of read files that

can be open simultaneously

--target-working-memory arg Target working memory footprint for

analysis not including read or

reference footprint

I/O:

-R [ --reference ] arg FASTA format reference genome file to

be analysed. Target regions will be

extracted from the reference index if

not provded explicitly

-I [ --reads ] arg Space-separated list of BAM/CRAM files

to be analysed. May be specified

multiple times

-i [ --reads-file ] arg Files containing lists of BAM/CRAM

files, one per line, to be analysed

--one-based-indexing Notifies that input regions are given

using one based indexing rather than

zero based

-T [ --regions ] arg Space-separated list of regions

(chrom:begin-end) to be analysed. May

be specified multiple times

-t [ --regions-file ] arg File containing a list of regions

(chrom:begin-end), one per line, to be

analysed

-K [ --skip-regions ] arg Space-separated list of regions

(chrom:begin-end) to skip May be

specified multiple times

-k [ --skip-regions-file ] arg File of regions (chrom:begin-end), one

per line, to skip

-S [ --samples ] arg Space-separated list of sample names to

analyse

-s [ --samples-file ] arg File of sample names to analyse, one

per line, which must be a subset of the

samples that appear in the read files

--ignore-unmapped-contigs Ignore any contigs that are not present

in the read files

--pedigree arg PED file containing sample pedigree

-o [ --output ] arg File to where output is written. If

unspecified, calls are written to

stdout

--contig-output-order arg (=asInReferenceIndex)

The order contigs should be written to

the output

--sites-only Only reports call sites (i.e. without

sample genotype information)

--legacy Outputs a legacy version of the final

callset in addition to the native

version

--regenotype arg VCF file specifying calls to

regenotype, only sites in this files

will appear in the final output

--bamout arg Output realigned BAM files

--split-bamout arg Output split realigned BAM files

--data-profile arg Output a profile of polymorphisms and

errors found in the data

Read transformations:

--read-transforms arg (=1) Enable all read transformations

--mask-low-quality-tails [=arg(=3)] Masks read tail bases with base quality

less than this

--mask-tails [=arg(=1)] Unconditionally mask this many read

tail sbases

--soft-clip-masking arg (=1) Turn on or off soft clip base

recalibration

--soft-clip-mask-threshold [=arg(=3)] Only soft clipped bases with quality

less than this will be recalibrated,

rather than all bases

--mask-soft-clipped-boundary-bases arg (=2)

Masks this number of adjacent non soft

clipped bases when soft clipped bases

are present

--adapter-masking arg (=1) Enable adapter detection and masking

--overlap-masking arg (=1) Enable read segment overlap masking

Read filtering:

--read-filtering arg (=1) Enable all read filters

--consider-unmapped-reads Allows reads marked as unmapped to be

used for calling

--min-mapping-quality arg (=20) Minimum read mapping quality required

to consider a read for calling

--good-base-quality arg (=20) Base quality threshold used by

min-good-bases and min-good-base-fracti

on filters

--min-good-base-fraction [=arg(=0.5)] Base quality threshold used by

min-good-bases filter

--min-good-bases arg (=20) Minimum number of bases with quality

min-base-quality before read is

considered

--allow-qc-fails Filters reads marked as QC failed

--min-read-length arg Filters reads shorter than this

--max-read-length arg Filter reads longer than this

--allow-marked-duplicates Allows reads marked as duplicate in

alignment record

--allow-octopus-duplicates Allows reads considered duplicates by

octopus

--allow-secondary-alignments Allows reads marked as secondary

alignments

--allow-supplementary-alignments Allows reads marked as supplementary

alignments

--no-reads-with-unmapped-segments Filter reads with unmapped template

segments to be used for calling

--no-reads-with-distant-segments Filter reads with template segments

that are on different contigs

--no-adapter-contaminated-reads Filter reads with possible adapter

contamination

--disable-downsampling Disables downsampling

--downsample-above arg (=1000) Downsample reads in regions where

coverage is over this

--downsample-target arg (=500) The target coverage for the downsampler

Candidate variant generation:

-g [ --raw-cigar-candidate-generator ] arg (=1)

Enable candidate generation from raw

read alignments (CIGAR strings)

--repeat-candidate-generator arg (=1) Enable candidate generation from

adjusted read alignments (CIGAR

strings) around tandem repeats

-a [ --assembly-candidate-generator ] arg (=1)

Enable candidate generation using local

re-assembly

-c [ --source-candidates ] arg Variant file paths containing known

variants. These variants will

automatically become candidates

--source-candidates-file arg Files containing lists of source

candidate variant files

--min-source-quality [=arg(=2)] Only variants with quality above this

value are considered for candidate

generation

--extract-filtered-source-candidates arg (=0)

Extract variants from source VCF

records that have been filtered

--min-base-quality arg (=20) Only bases with quality above this

value are considered for candidate

generation

--min-supporting-reads [=arg(=2)] Minimum number of reads that must

support a variant if it is to be

considered a candidate. By default

octopus will automatically determine

this value

--max-variant-size arg (=2000) Maximum candidate variant size to

consider (in region space)

--kmer-sizes arg (=10 15 20) Kmer sizes to use for local assembly

--num-fallback-kmers arg (=10) How many local assembly fallback kmer

sizes to use if the default sizes fail

--fallback-kmer-gap arg (=10) The gap size used to generate local

assembly fallback kmers

--max-region-to-assemble arg (=400) The maximum region size that can be

used for local assembly

--max-assemble-region-overlap arg (=200)

The maximum number of bases allowed to

overlap assembly regions

--assemble-all Forces all regions to be assembled

--assembler-mask-base-quality arg (=10)

Aligned bases with quality less than

this will be converted to reference

before being inserted into the De

Bruijn graph

--min-kmer-prune arg (=2) Minimum number of read observations to

keep a kmer in the assembly graph

before bubble extraction

--max-bubbles arg (=30) Maximum number of bubbles to extract

from the assembly graph

--min-bubble-score arg (=2) Minimum bubble score that will be

extracted from the assembly graph

Haplotype generation:

-x [ --max-haplotypes ] arg (=200) Maximum number of candidate haplotypes

the caller may consider. If a region

contains more candidate haplotypes than

this then filtering is applied

--haplotype-holdout-threshold arg (=2500)

Forces the haplotype generator to

temporarily hold out some alleles if

the number of haplotypes in a region

exceeds this threshold

--haplotype-overflow arg (=200000) Regions with more haplotypes than this

will be skipped

--max-holdout-depth arg (=20) Maximum number of holdout attempts the

haplotype generator can make before the

region is skipped

--extension-level arg (=normal) Level of haplotype extension. Possible

values are: conservative, normal,

optimistic, aggressive

-e [ --haplotype-extension-threshold ] arg (=100)

Haplotypes with posterior probability

less than this can be filtered before

extension

--dedup-haplotypes-with-prior-model arg (=1)

Remove duplicate haplotypes using

mutation prior model

--protect-reference-haplotype arg (=1)

Protect the reference haplotype from

filtering

Calling (general):

-C [ --caller ] arg (=population) Which of the octopus callers to use

-P [ --organism-ploidy ] arg (=2) All contigs with unspecified ploidies

are assumed the organism ploidy

-p [ --contig-ploidies ] arg (=Y=1 chrY=1 MT=1 chrM=1)

Space-separated list of contig

(contig=ploidy) or sample contig

(sample:contig=ploidy) ploidies

--contig-ploidies-file arg File containing a list of contig

(contig=ploidy) or sample contig

(sample:contig=ploidy) ploidies, one

per line

--min-variant-posterior arg (=1) Report variant alleles with posterior

probability (phred scale) greater than

this

--refcall [=arg(=blocked)] Caller will report reference confidence

calls for each position (positional),

or in automatically sized blocks

(blocked)

--min-refcall-posterior arg (=2) Report reference alleles with posterior

probability (phred scale) greater than

this

-z [ --snp-heterozygosity ] arg (=0.001)

Germline SNP heterozygosity for the

given samples

--snp-heterozygosity-stdev arg (=0.01)

Standard deviation of the germline SNP

heterozygosity used for the given

samples

-y [ --indel-heterozygosity ] arg (=0.0001)

Germline indel heterozygosity for the

given samples

--use-uniform-genotype-priors Use a uniform prior model when

calculating genotype posteriors

--max-genotypes arg (=5000) The maximum number of genotypes to

evaluate

--max-joint-genotypes arg (=1000000) The maximum number of joint genotype

vectors to consider when computing

joint genotype posterior probabilities

--use-independent-genotype-priors Use independent genotype priors for

joint calling

--model-posterior arg Calculate model posteriors for every

call

--inactive-flank-scoring arg (=1) Disables additional calculation to

adjust alignment score when there are

inactive candidates in haplotype

flanking regions

--model-mapping-quality arg (=1) Include the read mapping quality in the

haplotype likelihood calculation

--sequence-error-model arg (=HiSeq) The sequencer error model to use (HiSeq

or xTen)

--max-vb-seeds arg (=12) Maximum number of seeds to use for

Variational Bayes algorithms

Phasing:

-l [ --phasing-level ] arg (=normal) Level of phasing - longer range phasing

can improve calling accuracy at the

cost of runtime speed. Possible values

are: minimal, conservative, moderate,

normal, aggressive

--min-phase-score arg (=10) Minimum phase score (phred scale)

required to report sites as phased

Variant filtering:

-f [ --call-filtering ] arg (=1) Enable all variant call filtering

--filter-expression arg (=QUAL < 10 | MQ < 10 | MP < 10 | AF < 0.05 | SB > 0.98 | BQ < 15 | DP < 1)

Boolean expression to use to filter

variant calls

--somatic-filter-expression arg (=QUAL < 2 | GQ < 20 | MQ < 30 | SMQ < 40 | SB > 0.9 | SD > 0.9 | BQ < 20 | DP < 3 | MF > 0.2 | NC > 1 | FRF > 0.5)

Boolean expression to use to filter

somatic variant calls

--denovo-filter-expression arg (=QUAL < 50 | PP < 40 | GQ < 20 | MQ < 30 | AF < 0.1 | SB > 0.95 | BQ < 20 | DP < 10 | DC > 1 | MF > 0.2 | FRF > 0.5 | MP < 30 | MQ0 > 2)

Boolean expression to use to filter

somatic variant calls

--refcall-filter-expression arg (=QUAL < 2 | GQ < 20 | MQ < 10 | DP < 10 | MF > 0.2)

Boolean expression to use to filter

homozygous reference calls

--use-calling-reads-for-filtering arg (=0)

Use the original reads used for variant

calling for filtering

--keep-unfiltered-calls Keep a copy of unfiltered calls

--training-annotations arg Outputs all calls as PASS and annotates

output VCF with the given measures or

measure set

--filter-vcf arg Filter the given Octopus VCF without

calling

--forest-file arg Trained Ranger random forest file

--somatic-forest-file arg Trained Ranger random forest file for

somatic variants

実行方法

リファレンスとbamを指定して実行する。

octopus --reference ref.fasta --reads input.bam --threads 4 > octopus.vcf


#dockerなら
docker run --rm -itv $PWD:/data octopus /tmp/octopus/bin/octopus \
--reference /data/ref.fasta --reads /data/input.bam --threads 4\
> /data/output.vcf

領域指定。"-T chr:start-end"で指定するか、bedファイルを読み込む。

octopus --referencehs37d5.fa --reads NA12878.bam -t regions.bed > output.vcf

スキップする領域を指定するには"-K chr:start-end"、または"-t regions.bed"。

Family trios

octopus --reference hs37d5.fa --reads NA12878.bam NA12891.bam NA12892.bam -M NA12892 -F NA12891 -o output.vcf

-M maternal sample
-F paternal sample

Normal tumour pair

octopus --reference hs37d5.fa --reads normal.bam tumour.bam --normal-sample NORMAL > output.vcf

複数tumourサンプルがある場合

octopus --reference hs37d5.fa --reads normal.bam tumourA.bam tumourB.bam --normal-sample NORMAL

家系情報がない同一集団のJoint variant calling (現バージョンではexperimental)（*2）

octopus --reference hs37d5.fa --reads NA12878.bam NA12891.bam NA12892.bam

他にも様々な解析例がまとめられています。Githubで確認して下さい。

https://github.com/luntergroup/octopus

引用

A unified haplotype-based method for accurate and comprehensive variant calling

Daniel P Cooke, David C Wedge, Gerton Lunter

bioRxiv preprint first posted online Oct. 29, 2018

doi: https://doi.org/10.1101/456103

2021/04/05
A unified haplotype-based method for accurate and comprehensive variant calling

Daniel P. Cooke, David C. Wedge & Gerton Lunter
Nature Biotechnology (2021), Published: 29 March 2021

Dockerfile61行目の

RUN ./scripts/install.py --root --threads=2

　↓

RUN ./scripts/install.py --threads=2

に変えてビルドした。

Joint calling

Genotype Calling From Ngs Data - Per Sample Or Multi Sample?

Variation & Genotype Calling From Ngs Data - Per Sample Or Multi Sample?

2018-11-03

シーケンシングデータのハプロタイプを可視化し、リードを分類する HapFlow

2016 Bioinformatics 結果の視覚化 (visualization) haplotype bacteria GUIツール bam/sam OTU

2018 11/3 誤字修正

2019 3/18 freebayes追記

　ハイスループットシーケンシング技術の出現により、バクテリア集団のシーケンシングのような新しい実験的アプローチが可能になった。感染は、しばしば同じ種の複数の株を含んでおり（Darch et al、2015; Taylor et al、1995）、これは治療方法決定（Cohen et al、2012）に重要な意味を持つ。mixed-strain の集団を分析するためのいくつかの方法が開発されている。ShoRAH（Zagordi et al、2011）は、全体的なハプロタイプの最小セットを再構成し、推測されたハプロタイプの頻度を推定する。これは、オーバーラップしたリードによってリンクするため十分な密度のバリアントが必要になる。ドミナントの株およびマイナーな株からの感染を同定するために 2ステップの最尤アプローチも報告されている（Eyre et al、2013）。このアプローチは、バリアント密度に依存しないが、ローカルまたはグローバルのハプロタイプを推論することはできない。１サンプル内で同じバクテリア種に複数の株がある場合、ゲノム分析を行うベストな戦略を決定するためには、シーケンシングデータのハプロタイプを視覚化するツールが必要である。

　Savant（Fiume et al、2010 pubmed）、Tablet（Milne et al、2010）、Consed（Gordon and Green、2013 pubmed）など、多くの優れたリードアライメント視覚化ツールが存在する。これらのツールは、各リードが線、またはベースの行として表される方法でリードを配列する。このレイアウトは、バリアントまたはミスアライメントされたリードを識別するのには満足できるものだが、リードに存在するハプロタイプを識別するのには理想的ではない。これは、リードは密接にまとめられているため、離れた位置のバリアントが同じリードペアにあるかどうかを判断することは困難であるためである。さらに、ハプロタイプによってグループ分けされていないので、シーケンシングデータでハプロタイプがどの程度頻繁に表されるかを同定することが困難である。

　HapFlowは、存在するハプロタイプをより容易に同定するため、リードアラインメントデータを抽出することによってこれらの問題を解決する。 HapFlowは、シークエンシングアーチファクトの同定、サンプル中に存在する最小の株数の同定、およびシーケンシングデータ単独でローカルハプロタイプまたはグローバルハプロタイプの決定が可能かどうかを判断するのに役立つ。

HP（プログラムの他、テストデータ、マニュアルもダウンロードできる）

http://mjsull.github.io/HapFlow/files.html

wiki

https://github.com/mjsull/HapFlow/wiki

インストール

HPからダウンロードする。CUI環境で動くpythonのパッケージと、GUIバージョンがある（windows版、mac版、linux版あり）。

HapFlow - downloads

ここではmac osxで動作確認した。

解析の流れ

１、リードをアライメントしてbamを出力する（公式例）。

２、Freebayesなどのploidyを指定してバリアントコールできるツールを使い、バリアントコールを行う（freebayesはcondaで導入できる）。

例えばploidyの値を3に増やしてバリアントコールする。

freebayes -f ref.fa -p 3 -F 0.2 -C 5 input.bam >aln.vcf

#もっと厳しくするなら
freebayes -f assembly.fasta -p 3 -F 0.2 -C 50 -q 30 -m 30 --min-coverage 100 -i -u input.bam >aln.vcf

-F Require at least this fraction of observations supporting an alternate allele within a single individual in the in order to evaluate the position. default: 0.2
-p Sets the default ploidy for the analysis to N. default: 2
-C Require at least this count of observations supporting an alternate allele within a single individual in order to evaluate the position. default: 2

３、 Flow Fileを作成するため、HapFlowを実行する。

./HapFlow

コマンドを叩くとHapFlowのウィンドウが出現する。

f:id:kazumaxneo:20181102004906p:plain

メニューバーのCreate flow fileをクリック

f:id:kazumaxneo:20181102212815p:plain

ウィンドウが表示されるので、bamとvcfを指定する。

f:id:kazumaxneo:20181102212451p:plain

ジョブ開始はOKをクリック。ランが終わると.flwファイルが出力される（上の画像ではoutoutをflowとしているが、拡張子なしでも認識する）。

４、File => Load flow file から.flwファイルを読み込み可視化する。

f:id:kazumaxneo:20181102213129p:plain

ここで読み込んでいるのは、HPからダウンロードしたテストデータ1のtutorial_1.flw（real data from mixed-strain infection of Koala）

図の表現方法は、公式wikiで丁寧に説明されている。

https://github.com/mjsull/HapFlow/wiki/Interpreting-data-using-HapFlow

公式より転載

　左の図は、リードがリファレンスゲノムにアライメントされていることを表している。縦の線は、そこにバリエーションがあることを意味する。同じポジションに赤と青のバリアント（アレル）があり、それが３ポジション存在している。リードによってバリアントのパターンは異なるが、左から順に青ー赤ー赤のバリアントプロファイルを持つリードが一番多い。左図右端の縦線は、バリアントのプロファイルによるグループ分け結果を表していて、この青ー赤ー赤のバリアントプロファイルのリードはオレンジの縦線が割り当てられている。

　右の図は、バリアント組み合わせに基づいてユニークなバリアントプロファイルを"flow"で表している。先ほどの４リードサポートがあるオレンジのflowが一番太く、ー青ー青の組み合わせの紫のflowが一番薄い。このように、flowの厚みはそのユニークなバリアントプロファイルの比率を表している。flowの矢印の向きはリードの向きと対応している。

次の図

f:id:kazumaxneo:20181103070400p:plain

公式より転載

　ペアエンドシーケンシングの場合、ペアのリードは同じテンプレートに由来しているので、より長い範囲でバリアントプロファイルをカバーできる。flow図では、ペアエンドリードは上図のように破線で接続される。そのほか、ゲノムは一番上に青のboxで表わされ（右の図）、バリアントのリファレンス上での位置は、青boxの中のオレンジboxで表わされる。オレンジ領域を拡大したのが一段下のオレンジboxで、バリアント位置はオレンジbox中の縦線で表わされる。

この図のflowをよく見ると、dark green flowの左端矢印は長さが半分になっているが、これはバリアント位置がリードの上流末端に相当することを意味している。同じく、リード下流末端のバリアントも矢印の長さが半分になる（Light greenのflowの右側矢印）。末端バリアントのみ区別した表記があるのは、リード末端はミスアライメントに由来することが多いためと述べられている。flowはバリアントを共有している度合いによってグルーピングされ、にたバリアントプロファイルを持つflowは同系色の色で表現され、近傍にまとめられる（上で言えば、赤とオレンジのflowグループ、濃い緑と薄緑のflowグループ）。

f:id:kazumaxneo:20181103112200j:plain

一番下には、リファレンスのポジションとリファレンスのアレル（*付きの方）が表示される。

f:id:kazumaxneo:20181103112817j:plain

この領域では、左端２番目の363598のアレルでメジャーなもの（青色のflows）は、次（左端３番目）の363732ではマイナーなアレルに変化している。

view > hide gappedで、ギャップのラインを非表示。

f:id:kazumaxneo:20181103113649j:plain

また、A、D、W、Sそれぞれのキーを押すと、flowの幅と高さを１段階ずつ変更できる。

Dを数回押して横長にした。

f:id:kazumaxneo:20181103114310j:plain

tools > goto baseでポジションにジャンプできる。

f:id:kazumaxneo:20181103114502j:plain

異なるtaxonomicグループが混ざっていることがわかれば、taxonomicグループごとにリードを分割することもできる。そのためには、自分でリードをアサインしていく必要がある。

左端から順にダブルクリックして、OTUの番号を割りふる。まずはOTU1をアサインする。左端から順に同じ系統のカラーをダブルクリックしていく（１つでも飛ばすと番号がつかない）。

f:id:kazumaxneo:20181103120041j:plain

全て終わったら、 OTU2に切り替え、同様に番号をアサインしていく。

f:id:kazumaxneo:20181103120246j:plain

全アサインが終わった状態

f:id:kazumaxneo:20181103120333j:plain

公式より

tools > Write OTUsで、アサインしたグループごとにbamを分割出力する。

f:id:kazumaxneo:20181103120422j:plain

出力された。

f:id:kazumaxneo:20181103120501j:plain

少し大きなゲノムでも表示できるようです。詳細はwikiのperfrmanceを読んで下さい。質問も受け付けているようです。

https://github.com/mjsull/HapFlow/wiki/Performance

引用
HapFlow: visualizing haplotypes in sequencing data
Sullivan MJ, Bachmann NL, Timms P, Polkinghorne A

Bioinformatics. 2016 Feb 1;32(3):441-3

2018-11-02

複数のアセンブラとk-merを使ったTranscriptome 自動アセンブリワークフロー Oyster River Protocol

2018 PeerJ RNA seq assembly error correction evaluation tool automatic pipeline fish plant マウス worm de novo transcriptome

2018 11/2 コマンド追記 & 誤字修正

2018 11/7 誤字修正

2019 4/6 docker追記

2019 6/17 追記、誤字修正

2019 6/21追記

2019 7/5 Step by step instructions link追記

　現代のシーケンシング技術は細胞内の代謝過程から人口変動パターンまで、非常に幅広い自然現象の基礎となるゲノムレベルのプロセスを深く理解する機会を提供してきた。トランスクリプトームシーケンシングは、特に functional genomics（論文より　Lappalainen et al、2013; Cahoy et al、2008）に影響を与えており、何年か前には不可能だった発見も可能になってきている（Mortazavi et al、2008; Wang、Gerstein＆Snyder、2009）。この理由の大部分は、これらの研究が実施されるスケールに起因している（Li et al、2017; Tan et al、2017）。 1つまたは少数の候補遺伝子に基づくadaptationの研究とは異なり（Fitzpatrick et al、2005; Panhuis、2006）、現代の研究は発現された転写産物を同時に全評価（トランスクリプトーム）することができる。ダイナミックレンジの高まりの直接的な結果として、スケール問題に加えて、低および高発現転写産物を同時に再構成および定量する能力が高まった（Wolf、2013; Vijay et al、2013）。最後に、遺伝子発現の変化を検出するためのより改善された方法が登場し（Robinson、McCarthy＆Smyth、2010 (link); Love、Huber＆Anders、2014 (link)）、遺伝子発現変動の理解が目的の研究の解像度が向上した。

　広範囲にわたり人気になった直接的な結果として、トランスクリプトームのアセンブリのための多様なツールセットが存在しているが、潜在的には、再構成された転写産物のそれぞれが再構築に失敗する。最も初期の特別なトランスクリプトームアセンブラには、Trans-ABySS（Robertson et al、2010）、Oases（Schulz et al、2012）、SOAPdenovoTrans（Xie et al、2014）のパッケージがある。これらは、 de BruijnのgraphベースのゲノムアセンブラのABySS（Simpson et al、2009）、Velvet（Zerbino＆Birney、2008）、SOAP（Li et al、2008）に機能が基づいている。これらの初期の努力から、トリニティ（Haas et al、2013）およびIDBA-Tran（Pengら、2013）のような一連のより専門化されたde novoトランスクリプトームアセンブラが生み出された。 De Bruijnのグラフアプローチは強力だが、より新しく開発されたソフトウェアはアルゴリズム的な視点で斬新な部分を探究しており、新しいメソッドがトランスクリプトームの再構成できなかった配列をアセンブリできると仮定すれば、大きな利点を提供できている。例えば、BinPacker（Liu et al、2016）（紹介）は、古典的なビン充填問題の後にアセンブリ問題をモデル化するde Bruijn graph手法を放棄し、一方、Shannon（Kannan et al、2016 preprint）は、ソフトウェアエンジニアが決定したヒューリスティックではなくinformation theoryを使う。これらの新しいアセンブラは、基本的に異なるアセンブリアルゴリズムを実装することにより、他のアセンブラが正確にアセンブルできないトランスクリプトームの部分を再構築する可能性がある。

　デノボアセンブリに利用可能な様々なツールに加えて、リードトリミングによる前処理（例えば、Skewer; Jiang et al、2014、Trimmomatic、Bolger、Lohse＆Usadel、2014 、2011）、リードの正規化（khmer; Pell et al、2012）、リードのエラー訂正（SEECER; Le et al、2013、RCorrector; Song＆Florea、2015、Reptile; Yang、Dorman＆Aluru 、2010）がある。同様に、TransRate（Smith-Unna et al、2016）、BUSCO（Benchmarking Universal Single-Copy Orthologs、Simãoet al、2015）、Detonate（Li et al、2014）などのアセンブリされたトランスクリプトのクオリティを評価するベンチマークツールも同様に開発されている。

(一部略)

　トランスクリプトーム再構成に関連する微妙な（そしてそれほど微妙ではない）方法論的課題は、非常に可変なアセンブリ品質をもたらす可能性がある。特に、ほとんどのツールはデフォルト設定を使用して実行されるが、これらのデフォルトは、特定の（多くの場合、不特定の）ユースケースまたはデータ型に対してのみ有効である。パラメータの最適化は本質的にデータセット依存型と階乗型であるため、特にパイプライン全体の網羅的な最適化は不可能である。このことを考慮すると、デノボトランスクリプトームアセンブリを実行するには、時間とリソース面で大きな投資が必要であり、各ステップを慎重に検討する必要がある。ここでは、様々な一般的な実験条件または分類群にわたって、高品質のトランスクリプトームアセンブリ作製をもたらす、エビデンスに基づくアセンブリプロトコルを提案する。

この論文では、トランスクリプトームアセンブリ用のオイスターリバープロトコル（ORP）1の開発について説明する。これは、Vijay et al (2013) に記載されている多くの欠点を明示的に考慮し、その改善を試みる。本方法は複数のkmerとマルチアセンブラ戦略を活用している。この革新は非常に重要である。なぜなら、すべてのアセンブリソリューションがバイアスがある方法でシーケンシングリードデータを処理し転写産物をレカバリするからである。具体的には、アセンブリソフトウェアの開発では、アセンブリ問題自体の範囲を考慮して必要なヒューリスティックセットを使用する。各ソフトウェア開発チームが様々なアセンブリアルゴリズムを実装しながらこれらのヒューリスティックに関連するユニークなアイデアを持ち、個々のアセンブラはユニークなアセンブリ動作を示す。マルチアセンブラのアプローチを活用することで、あるアセンブラの強みが別のアセンブラの弱みを補うことができる。アセンブリーヒューリスティックに関連するバイアスに加えて、アセンブリのkmer-lengthは転写再構成に重要な影響を及ぼす。より短いkmersは、abundantな転写産物より少ない転写産物を効率的に再構築することがよく知られている。この場合、複数の異なるkmer長でアセンブルし、結果として得られるアセンブリをマージすると、このタイプのバイアスを効果的に減らすことができる。これらの問題を認識して、私 (著者のこと) は、複数の異なるアセンブラの組み合わせと複数のアセンブリ - kmers長によって生じるアセンブリが、さまざまなメトリックにわたって、個々のアセンブリより優れていると仮定している。

強化されたパイプラインを開発することに加えて、この作業では、新しいアセンブリアルゴリズムやパイプラインを開発する際に他の研究者が使用できる、完全にベンチマークされたリファレンスアセンブリを利用できるようにしながら、他の多くの研究者がアセンブリ方法の比較を公表してきたが、これまでは単一のデータセットがいくつかの異なる方法でアセンブリされていた（Marchant et al、2016; Finseth＆Harrison、2014）。

ソフトウエア

ORPはLinuxプラットフォームにインストールできる。Linuxbrew（Jackman＆Birol、2016）のみ必要で、インストールにスーパーユーザー権限は必要ない。このソフトウェアは、以下に説明するすべてのステップを調整するスタンドアローンのmakefileとして実装されている。ソフトウェアはバージョン管理されており、共有と再利用を促進するためにオープンライセンスとなっている。ユーザーガイドはhttp://oyster-river-protocol.rtfd.ioから入手できる。

1、プリアセンブリ

Illuminaシーケンシングアダプターを、MacManes（2014 link）の推奨に従い、Trimmomaticバージョン0.36（Bolger、Lohse＆Usadel、2014）プログラムを用いて除去した（リードの両端からPhred≦2を除去 *1）。トリミング後、MacManes＆Eisen（2013）の推奨に従ってソフトウェアRCorrectorバージョン1.0.2（Song＆Florea、2015）（紹介）を使用してエラー訂正した。

２、アセンブリ

トリムしてエラーコレクションされたデータセットを使い、3つの異なるトランスクリプトームアセンブラと3つの異なるkmer長で4つのユニークなアセンブリを作成した。まず、リードの正規化を行わずに、Trinityリリース2.4.0（Haas et al、2013）とデフォルト設定（k = 25）を使用してアセンブリを行なった。正規化しない決定は、正規化されたデータセットのパフォーマンスがわずかに悪化していることを示す以前の研究（MacManes、2015 preprint）に基づいている。次に、SPAdes RNAseqアセンブラ（バージョン3.10）（Chikhi＆Medvedev、2014）（SPAdes紹介）によるアセンブリを、k-merサイズ55および75の別個の２つのランで行った。最後に、Shannonアセンブラのバージョン0.0.2（Kannan et al）によるアセンブリをk-merサイズ75で行った。これらのアセンブラは以下の事実に基づいて選択された:（1）オープンサイエンス開発モデルを使っており、エンドユーザがコードに貢献する可能性がある。（2）維持されており、積極的な開発が続いている。（3）異なるアルゴリズムを使っている。

３、OrthoFuseを使ったアセンブリのマージ

4つのアセンブリを結合するため、私（著者）はトランスクリプトームアセンブリを効果的にマージする新しいソフトウェアを開発した。簡単に説明すると、OrthoFuseはすべてのアセンブリをコンカテネートすることから始まり、ORPにパッケージ化されている、ヌクレオチド配列を受け入れるようにmodifyされてORPに組み込まれたOrthoFinder（Emms＆Kelly、2015）を実行して転写産物グループを形成する。

一部略

Orthofinderの実行後、modifyされてORPに組み込まれたTransRateバージョン1.0.3（Smith-Unna et al、2016）（紹介）をマージされたアセンブリに使い、その後最高のコンティグスコアのcontigを各グループから選択し、新しいアセンブリファイルに配置する。得られたファイルは、すべての下流分析に使用できる。

４、アセンブリの評価

すべてのアセンブリは、ORP-TransRate、Detonateバージョン1.11（Li et al、2014）（紹介）、shmlastバージョン1.2（Scott、2017, link）、およびBUSCOバージョン3.0.2（Simãoet al、2015）（紹介）を使用して評価された。 TransRateは、長さとマッピングベースのmetrcisによりスコアを生成し、トランスクリプトームアセンブリの連続性を評価する。Detonateはorthogonal analysisを実行し、アセンブリのスコアを生成する。 BUSCOは、すべての真核生物に見られる保存されたシングルコピーオーソログのアセンブリを検索することにより、アセンブリを評価する。具体的には、「完全なオーソログ」は、BUSCOグループ平均の長さの2標準偏差以内の転写産物として定義され、このメトリックを満たすと"complete orthologs"、満たさないと"fragmented"と判定される。

Oyster River Protocolに関するツイート

Finally Published! The Oyster River Protocol: a multi-assembler and kmer approach for de novo transcriptome assembly. https://t.co/6Ss7TULCKQ
— Matt MacManes (@macmanes) August 3, 2018

インストール

mac os10.12でテストした。

依存

Rcorrector
HMMER
Trimmomatic
Trinity
SPAdes
TransABySS
MCL
Metis
OrthoFuser
BLAST
seqtk
BUSCO (make sure to install databases)
TransRate (the ORP version packaged here)
Python modules numpy, scipy, biopython, cvxopt

本体 Github

ツールの導入過程でAnacondaや様々な環境が導入される。ここでは仮想環境で実行した。

cd ~
git clone https://github.com/macmanes-lab/Oyster_River_Protocol.git
cd Oyster_River_Protocol
make
source ~/.profile

#configファイルを修正する
vi software/config.ini

> cat software/config.ini

$ cat software/config.ini

[busco]

lineage_path = /home/kazu/Oyster_River_Protocol/busco_dbs/eukaryota_odb9

force = True

quiet = False

[tblastn]

path = /home/kazu/Oyster_River_Protocol/software/anaconda/install/envs/orp_v2/bin/

[makeblastdb]

path = /home/kazu/Oyster_River_Protocol/software/anaconda/install/envs/orp_v2/bin/

[hmmsearch]

path = /home/kazu/Oyster_River_Protocol/software/anaconda/install/envs/orp_v2/bin/

パスを調べ、ここでは↑ のように修正した。

仮想環境"orp_v2"を立ち上げる。

source $HOME/Oyster_River_Protocol/software/anaconda/install/bin/activate orp_v2

ヘルプが表示されるか確認。フルパスで指定する。

> ~/Oyster_River_Protocol/oyster.mk -h

$ ~/Oyster_River_Protocol/oyster.mk -h

***** Welcome to the Oyster River Prptocol *****

***** This is version 2.1.0 *****

Usage:

/path/to/Oyster_River/Protocol/oyster.mk main CPU=24

MEM=128

STRAND=RF

READ1=1.subsamp_1.cor.fq

READ2=1.subsamp_2.cor.fq

RUNOUT=test

(orp_v2) kazu@aa796a2b1908:~/Oyster_River_Protocol$ /home/kazu/Oyster_River_Protocol/oyster.mk -h

Usage: make [options] [target] ...

Options:

-b, -m Ignored for compatibility.

-B, --always-make Unconditionally make all targets.

-C DIRECTORY, --directory=DIRECTORY

Change to DIRECTORY before doing anything.

-d Print lots of debugging information.

--debug[=FLAGS] Print various types of debugging information.

-e, --environment-overrides

Environment variables override makefiles.

-f FILE, --file=FILE, --makefile=FILE

Read FILE as a makefile.

-h, --help Print this message and exit.

-i, --ignore-errors Ignore errors from commands.

-I DIRECTORY, --include-dir=DIRECTORY

Search DIRECTORY for included makefiles.

-j [N], --jobs[=N] Allow N jobs at once; infinite jobs with no arg.

-k, --keep-going Keep going when some targets can't be made.

-l [N], --load-average[=N], --max-load[=N]

Don't start multiple jobs unless load is below N.

-L, --check-symlink-times Use the latest mtime between symlinks and target.

-n, --just-print, --dry-run, --recon

Don't actually run any commands; just print them.

-o FILE, --old-file=FILE, --assume-old=FILE

Consider FILE to be very old and don't remake it.

-p, --print-data-base Print make's internal database.

-q, --question Run no commands; exit status says if up to date.

-r, --no-builtin-rules Disable the built-in implicit rules.

-R, --no-builtin-variables Disable the built-in variable settings.

-s, --silent, --quiet Don't echo commands.

-S, --no-keep-going, --stop

Turns off -k.

-t, --touch Touch targets instead of remaking them.

-v, --version Print the version number of make and exit.

-w, --print-directory Print the current directory.

--no-print-directory Turn off -w, even if it was turned on implicitly.

-W FILE, --what-if=FILE, --new-file=FILE, --assume-new=FILE

Consider FILE to be infinitely new.

--warn-undefined-variables Warn when an undefined variable is referenced.

This program built for x86_64-pc-linux-gnu

Report bugs to <bug-make@gnu.org>

インストールはMakefileに従い自動に行われますが、ビルドが面倒で数回失敗したので、dockerイメージも上げておきます。testデータは捨ててあるので、このイメージで下記のテストランを行う場合はgithubから再取得して下さい。このイメージは16GBほどあります。

docker pull kazumax/oyster_river_protocol

#run
docker run -itv $PWD:/data kazumax/oyster_river_protocol
#user: kazu
#pass: 1898 
su kazu
cd ~
source ~/.profiles
$HOME/Oyster_River_Protocol/software/anaconda/install/bin/activate orp_v2

#help
/home/kazu/Oyster_River_Protocol/oyster.mk -h

↑　非推奨

2019 4/6 追記

オーサーがdockerイメージを準備してくれています。こちらを使って下さい。

The Oyster River transcriptome assembler can now be installed using @Docker `macmaneslab/orp:2.2.4`. Also, wow, Docker is amazing! See instructions https://t.co/QQwPeAyCSP
— Matt MacManes (@macmanes) April 5, 2019

How to install the ORP using Docker — Oyster River Protocol latest documentation

#docker hub version

#latest tagが付いてないのでversionチェックして入れてください
docker pull macmaneslab/orp:2.2.7

cd RNA_seq_raw_fastq
docker run -itv $PWD:/data -w /data macmaneslab/orp:2.2.7
> conda activate orp
> $HOME/Oyster_River_Protocol/oyster.mk \
TPM_FILT=1 \
MEM=50 \
CPU=40 \
READ1=pair_1.fq.gz \
READ2=pair_1.fq.gz \
RUNOUT=output

テストラン

Run test dataset

cd sampledata
#実行(*2)
$HOME/Oyster_River_Protocol/oyster.mk main STRAND=RF \
MEM=15 \
CPU=8 \
READ1=test.1.fq.gz \
READ2=test.2.fq.gz \
RUNOUT=test

seqs] Processed 53898 reads in 1.990 CPU sec, 0.424 real sec

[main] Version: 0.7.17-r1188

[main] CMD: bwa mem -t 8 test /dev/fd/63 /dev/fd/62

[main] Real time: 0.821 sec; CPU: 2.110 sec

[bam_sort_core] merging from 0 files and 10 in-memory blocks...

-parsing file: test.sorted.bam

-done parsing file, examining orientations of reads.

7| ##

6| ##

5| ##

4| ##

3| ##

2| ### #

1| ### ## #

-----------

------------------------------------

| Summary |

------------------------------------

| observations: 20 |

| min value: -1.000000 |

| mean : -0.988000 |

| max value: -0.935000 |

------------------------------------

***** See the following link for interpretation *****

***** https://oyster-river-protocol.readthedocs.io/en/latest/strandexamine.html *****

***** QUALITY REPORT FOR: test using the ORP version 2.1.0 ****

***** THE ASSEMBLY CAN BE FOUND HERE: /home/kazu/Oyster_River_Protocol/sampledata/assemblies/test.ORP.fasta ****

***** BUSCO SCORE ~~~~~> C:0.0%[S:0.0%,D:0.0%],F:0.3%,M:99.7%,n:303

***** TRANSRATE SCORE ~~~~~> 0.37907

***** TRANSRATE OPTIMAL SCORE ~~~~~> 0.53726

***** UNIQUE GENES ORP ~~~~~> 37

***** UNIQUE GENES TRINITY ~~~~~> 31

***** UNIQUE GENES SPADES55 ~~~~~> 22

***** UNIQUE GENES SPADES75 ~~~~~> 23

***** UNIQUE GENES TRANSABYSS ~~~~~> 35

ジョブ終了

QCリード、アセンブリ、レポート等が出力される。
出力

f:id:kazumaxneo:20181102102919j:plain

レポート解析だけ実行する。

report.mk

$HOME/Oyster_River_Protocol/report.mk main \
ASSEMBLY=assembly.fasta \
CPU=24 \
LINEAGE=eukaryota_odb9 \
READ1=SRR2016923_1.fastq \
READ2=SRR2016923_2.fastq \
RUNOUT=SRR2016923

他にもquant.mk、strandeval.mkなどのコマンドがあります（version2.1）。helpで使い方は確認してください。また、HPのversion logも定期的に確認してください。一部アセンブラやツールが、論文と変更されたりしています。今後もパフォーマンス改善のチューニングがされるかもしれません。

追記

Step by step instructions

orp_website/full_instructions.md at master · macmanes-lab/orp_website · GitHub

引用
The Oyster River Protocol: a multi-assembler and kmer approach for de novo transcriptome assembly
Matthew D. MacManes

PeerJ. 2018; 6: e5428

参考

Frontiers | On the optimal trimming of high-throughput mRNA sequence data | Genetics

では、phread quality trimingなし、≤2、≤5、≤20のトリミング、がその後の処理に与える影響を調べている。

複数アセンブラを使ったde novo assemblyのワークフローの実際の流れに興味があれば、以下のprotocol.ioページも確認してみて下さい。丁寧に説明されています。

FR、RF等はBiostarsの質問を参照

https://www.biostars.org/p/169942/

Paired reads:

RF: first read (/1) of fragment pair is sequenced as anti-sense (reverse(R)), and second read (/2) is in the sense strand (forward(F)); typical of the dUTP/UDG sequencing method.
FR: first read (/1) of fragment pair is sequenced as sense (forward), and second read (/2) is in the antisense strand (reverse)

Unpaired (single) reads:

F: the single read is in the sense (forward) orientation
R: the single read is in the antisense (reverse) orientation

追記

良い論文が出ています。

De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers | GigaScience | Oxford Academic

2018-11-01

メタゲノムシーケンシングデータからMLST タイピングを行う MetaMLST

2017 Nucleic Acids Research clinical and diagnostic bacteria MLST metagenome

2020 5/24 レポジトリが引っ越ししているので、リンク更新

2022/09/06 インストール手順更新

　高分解能の微生物菌株同定およびトレースは、臨床および研究環境の両方において重要な課題である。微生物の菌株レベルのタイピングのための最も一般的な方法の1つは、すべての株に存在することが知られている少数の種特異的ゲノム遺伝子座（通常は7つ）をシーケンシングすることに基づくmultilocus sequence typing（MLST）を与えられた標的種全てで行うことである。そのシンプルさと解像度のおかげで、多くの原核生物（ref.2,3）および真核微生物（ref.4）にMLSTアプローチが定義され採用されてきた。様々な病原菌種に対応した何千ものMLSTのプロファイルとシーケンスのデータベースが利用可能になっている（ref.5,6）。

　臨床でのMLSTの日常的な使用を妨げる制限要因は、関心のある各細菌種を単離して培養する必要性である。 MLSTプロトコル（標的遺伝子座のDNA抽出、PCR増幅、精製およびシーケンシング）も高価で面倒である。次世代シークエンシング技術のスループット向上によるコスト削減により、サンプル（metagenomics（ref.7））の全DNA含量を直接シーケンシングして、時間のかかる単離および培養をスキップして複雑な微生物群集の特性評価を行う効果的なアプローチが可能になっている。メタゲノムデータセットは、所与の微生物群集に存在するすべての菌株のシーケンス情報を含み、理論的には、サンプル中のすべての種のデータをタイピングすることができる。

　メタゲノムMLSTタイピングを達成するための可能な戦略は、メタゲノミクスアセンブリを行い、そのメタゲノミクスコンティグをMLSTデータベースに対してマッピングすることである。しかしながら、このアプローチは、豊富な株（すなわちアセンブリのための十分なデプスを有する株）しか明らかにならない。さらに、メタゲノミクスアセンブリは計算上要求が厳しい。したがって、現在、メタゲノムからMLST遺伝子座を容易かつ効率的に抽出する方法はない。

　MLST手法の有効性と、培養フリーかつ高処理能力のメタゲノミクスを組み合わせるために、著者らはMetaMLSTと呼ばれる微生物タイピングのための新しい計算パイプラインを開発した。利用可能なMLST lociを持つ種について、MLST lociのテンプレートデータベースが与えられると、MetaMLSTは、メタゲノミクス試料中の微生物株のlociプロファイルのin silicoコンセンサス配列再構築を行う。再構成されたloci（プロファイル）は、PubMLST（ref.5）で維持されている公に利用可能なプロファイルのデータベースと比較することによって同定される。新しいalleles、すなわちデータベースに存在しない新しいallelesは、配列内部データベースをテンプレートとして使用し、与えられた信頼スコアに基づき、comparative sequence reconstruction によって決定される。このマッピングベースの手法は、計算上の制限を克服し、メタゲノムアセンブリと比較して検出限界を減らすことできる。下流の標準MLSTパイプラインに続いて、サンプルを個別に処理してsequence types（ST）プロファイルを決定し、疫学分析のために組み合わせ、可能であれば公的データベースで利用可能な多数のプロファイリングされた株と統合する。このソフトウェアは、http://segatalab.cibio.unitn.it/tools/metamlst/のサポート資料で自由に入手できる。

Segata Lab - Computational Metagenomics

wiki

https://bitbucket.org/CibioCM/metamlst/wiki/Home

Segata Lab のツールは以前も何度か紹介させてもらっています。

PhyloPhlAn（紹介）
PanPhlAn（紹介）
MetaPhlAn2（紹介）

MetaMLST works by reconstructing the MLST loci-sequences using the closest reference from the publicly available datsets (PubMLST) and traces the most abundant strain of each species. Bitbucketより。

インストール

mac os10.13のanaconda3-5.0.0環境でテストした。

依存

Bowtie2 >= v. 2.2.6
Samtools >= 1.3.1
Biopython >= 1.63
Pysam >= 0.11.1
Python3

Bitbucket Github

git clone https://github.com/SegataLab/metamlst.git
cd metamlst/

> python3 metamlst.py

$ python3 metamlst.py

usage: metamlst.py [-h] [-o OUTPUT FOLDER] [-d DB PATH]

[--filter species1,species2...] [--penalty PENALTY]

[--minscore MINSCORE] [--max_xM XM] [--min_read_len LENGTH]

[--min_accuracy CONFIDENCE] [--debug] [--presorted]

[--quiet] [--version] [--nloci NLOCI] [--log] [-a]

[BAMFILE]

Reconstruct the MLST loci from a BAMFILE aligned to the reference MLST loci

positional arguments:

BAMFILE BowTie2 BAM file containing the alignments (default:

None)

optional arguments:

-h, --help show this help message and exit

-o OUTPUT FOLDER Output Folder (default: ./out) (default: ./out)

-d DB PATH, --database DB PATH

Specify a different MetaMLST-Database. If unset, use

the default Database. You can create a custom DB with

metaMLST-index.py) (default: /Users/kamisakakazuma/loc

al/metamlst/metamlst_databases/metamlstDB_2018.db)

--filter species1,species2...

Filter for specific set of organisms only (METAMLST-

KEYs, comma separated. Use metaMLST-index.py

--listspecies to get MLST keys) (default: None)

--penalty PENALTY MetaMLST penaty for under-represented alleles

(default: 100)

--minscore MINSCORE Minimum alignment score for each alignment to be

considered valid (default: 80)

--max_xM XM Maximum SNPs rate for each alignment to be considered

valid (BowTie2s XM value) (default: 5)

--min_read_len LENGTH

Minimum BowTie2 alignment length (default: 50)

--min_accuracy CONFIDENCE

Minimum threshold on Confidence score (percentage) to

pass the reconstruction step (default: 0.9)

--debug Debug Mode (default: False)

--presorted The input BAM file is sorted and indexed with

samtools. If set, MetaMLST skips this step (default:

False)

--quiet Suppress text output (default: False)

--version Prints version informations (default: False)

--nloci NLOCI Do not discard samples where at least NLOCI (percent)

are detected. This can lead to imperfect MLST typing

(default: 100)

--log generate logfiles (default: False)

-a Write known sequences (default: False)

>python3 metamlst-merge.py -h

$ python3 metamlst-merge.py -h

usage: metamlst-merge.py [-h] [-d DB PATH] [--filter species1,species2...]

[-z ED] [--meta METADATA_PATH] [--idField IDFIELD]

[--outseqformat {A,A+,B,B+,C,C+}]

[-j subjectID,diet,age...] [--jgroup] [--version]

[folder]

Detects the MLST profiles from a collection of intermediate files from MetaMLST.py

positional arguments:

folder Path to the folder containing .nfo MetaMLST.py files

optional arguments:

-h, --help show this help message and exit

-d DB PATH, --database DB PATH

Specify a different MetaMLST-Database. If unset, use the default Database. You can create a custom DB with metaMLST-index.py)

--filter species1,species2...

Filter for specific set of organisms only (METAMLST-KEYs, comma separated. Use metaMLST-index.py --listspecies to get MLST keys)

-z ED Maximum Edit Distance from the closest reference to call a new MLST allele. Default: 5

--meta METADATA_PATH Metadata file (CSV)

--idField IDFIELD Field number pointing to the 'sampleID' value in the metadata file

--outseqformat {A,A+,B,B+,C,C+}

A : Concatenated Fasta (Only Detected STs)

A+ : Concatenated Fasta (All STs)

B : Single loci (Only New Loci)

B+ : Single loci (All loci)

C : CSV STs Table [default]

-j subjectID,diet,age...

Embed a LIST of metadata in the the output sequences (A or A+ outseqformat modes). Requires a comma separated list of field names from the metadata file specified with --meta

--jgroup Group the output sequences (A or A+ outseqformat modes) by ST, rather than by sample. Requires -j

--version Prints version informations

実行方法

１、データベースの準備

初回のみデータベースのダウンロードとビルドを行う必要がある。データベースが見つからなければラン時に再構築される。

#オーサーが準備した2018更新データがダウンロードされる。
python3 metamlst-index.py

#indexing
python3 metamlst-index.py -i bowtie_index

または下記からダウンロードする。

https://bitbucket.org/CibioCM/metamlst/downloads/

f:id:kazumaxneo:20181027103345p:plain

a$ metamlst-index.py -i bowtie_index

-bash: metamlst-index.py: command not found

kamisakBookpuro:metamlst kamisakakazuma$ python3 metamlst-index.py -i bowtie_index

Database /Users/kamisakakazuma/local/metamlst/metamlst_databases/metamlstDB_2018.db contains:

133 organisms

923 total loci

115566 total alleles (~54.71 Megabases)

13928 total profiles

COLLECTING DATA... [ - ... - ]

BUILDING INDEX [ - ... - ]

BUILDING INDEX [ - DONE - ]

完了。以後は作成した"bowtie_index"を使う。

２、データベースへのマッピング

example/4_single_sample_multiple_species/の２サンプルを使う。

#サンプル1
bowtie2 --threads 4 --very-sensitive-local -a --no-unal -x bowtie_index \
metamlst_examples/4_two_samples_with_metadata/SRS015937_epidermidis.fastq \
| samtools view -bS - > SRS015937_epidermidis.bam

#サンプル2
bowtie2 --threads 4 --very-sensitive-local -a --no-unal -x bowtie_index \
metamlst_examples/4_two_samples_with_metadata/SRS013261_epidermidis.fastq \
| samtools view -bS - > SRS013261_epidermidis.bam

ペアエンドは-1と-2フラグを立てて指定する。並列処理数は--threadsで指定する。bamが出力される。

３、データベースへのアライメント結果からベストマッチの配列を決め、MLST alleleを再構成する

#1
metamlst.py SRS015937_epidermidis.bam -o ./out/

#2
metamlst.py SRS013261_epidermidis.bam -o ./out/

$ python3 metamlst.py SRS013261_epidermidis.bam -o ./out/

--------------------------- SRS013261_epidermidis ----------------------------

sepidermidis Detected Loci: arcC, aroE, gtr, mutS, pyrR, tpiA, yqiL

Closest allele identification [ - ... - ]

Locus Avg. Coverage Score Hits Reference Allele(s)

arcC 33.51 183.3 2990 8

aroE 24.02 179.0 1656 7

gtr 26.63 177.9 1863 12

mutS 31.48 178.2 1613 4

pyrR 25.38 183.8 1607 12

tpiA 28.89 180.1 1637 2

yqiL 34.91 172.2 2531 2

Building Consensus Sequences [ - ... - ]

Locus Ref. Length Ns SNPs Confidence Notes

arcC 8 465 0 0 100.0 % --

aroE 7 420 0 0 100.0 % --

gtr 12 438 0 0 100.0 % --

mutS 4 412 0 0 100.0 % --

pyrR 12 428 0 0 100.0 % --

tpiA 2 424 0 0 100.0 % --

yqiL 2 416 0 0 100.0 % --

Reconstruction Successful [ - WRITE - ]

pacnes Detected Loci: CAMP2, aroE, atpD, gmk, guaA, lepA, sodA, tly

Closest allele identification [ - ... - ]

Locus Avg. Coverage Score Hits Reference Allele(s)

CAMP2 101.19 184.6 864 6

aroE 25.29 176.8 106 1

atpD 28.22 174.0 129 1

gmk 17.56 175.1 71 1

guaA 26.13 179.4 130 3

lepA 46.12 187.6 226 1

sodA 30.4 177.7 138 1

tly 36.64 184.3 297 8

Building Consensus Sequences [ - ... - ]

Locus Ref. Length Ns SNPs Confidence Notes

CAMP2 6 804 0 0 100.0 % --

aroE 1 424 0 0 100.0 % --

atpD 1 453 0 0 100.0 % --

gmk 1 400 0 0 100.0 % --

guaA 3 493 0 0 100.0 % --

lepA 1 452 0 0 100.0 % --

sodA 1 450 0 0 100.0 % --

tly 8 777 0 0 100.0 % --

Reconstruction Successful [ - WRITE - ]

saureus Detected Loci: tpi, yqiL

Missing Loci : arcC, aroE, glpF, gmk, pta

szooepidemicus Detected Loci: tpi

Missing Loci : arcC, nrdE, proS, spi, tdk, yqiL

sbsec Detected Loci: tpi

Missing Loci : ddl, gki, glnA, mutS, mutS2, pheS, proS, pyrE, thrS

kpneumoniae Detected Loci: pgi

Missing Loci : gapA, infB, mdh, phoE, rpoB, tonB

koxytoca Detected Loci: pgi

Missing Loci : gapA, infB, mdh, phoE, rpoB, tonB

mabscessus Detected Loci: Mab-rpoB

Missing Loci : Mab-argH, Mab-cya, Mab-gnd, Mab-murC, Mab-pta, Mab-purH

cdiphtheriae Detected Loci: atpA, rpoB

Missing Loci : dnaE, dnaK, fusA, leuA, odhA

pfluorescens Detected Loci: rpoB

Missing Loci : glnS, gyrB, ileS, nuoD, recA, rpoD

streptomyces Detected Loci: atpD

Missing Loci : 16S, gyrB, recA, rpoB, trpB

soralis Detected Loci: ddl

Missing Loci : aroE, gdh, gki, hexB, recP, xpt

spneumoniae Detected Loci: ddl, gki

Missing Loci : aroE, gdh, recP, spi, xpt

bcc Detected Loci: lepA

Missing Loci : atpD, gltB, gyrB, phaC, recA, trpB

mcaseolyticus Detected Loci: tuf

Missing Loci : ack, cpn60, fdh, pta, purA, sar

arcobacter Detected Loci: aspA

Missing Loci : atpA, glnA, gltA, glyA, pgm, tkt

ecoli Detected Loci: adk, fumC

Missing Loci : gyrB, icd, mdh, purA, recA

mcanis Detected Loci: tuf

Missing Loci : ack, cpn60, fdh, pta, purA, sar

aeromonas Detected Loci: groL

Missing Loci : gltA, gyrB, metG, ppsA, recA

neisseria Detected Loci: fumC

Missing Loci : abcZ, adk, aroE, gdh, pdhC, pgm

smaltophilia Detected Loci: recA

Missing Loci : atpD, gapA, guaA, mutM, nuoD, ppsA

tenacibaculum Detected Loci: atpA

Missing Loci : dnaK, glyA, gyrB, infB, rlmN, tgt

clari Detected Loci: Cla-atpA

Missing Loci : Cla-adk, Cla-glnA, Cla-glyA, Cla-pgi, Cla-pgm, Cla-tkt

bpseudomallei Detected Loci: gltB

Missing Loci : ace, gmhD, lepA, lipA, narK, ndh

efaecalis Detected Loci: gyd

Missing Loci : aroE, gdh, gki, pstS, xpt, yqiL

efaecium Detected Loci: gyd

Missing Loci : adk, atpA, ddl, gdh, pstS, purK

senterica Detected Loci: purE

Missing Loci : aroC, dnaN, hemD, hisD, sucA, thrA

spseudintermedius Detected Loci: tuf

Missing Loci : ack, cpn60, fdh, pta, purA, sar

[ - Completed - ]

検出されたlociとmissingのlociがそれぞれ表示される。

４、複数サンプルの解析では、次のステップにメタデータファイルを読み込む必要があるので、ここで準備する。ここでは公式に習いsample_metadata.txtといファイル名にした。書式はexample5のメタデータファイルを真似した。

sampleID metadata_field_1 dataset_name source

SRS013261 some_data1 hmp source1

SRS015937 some_data1 hmp source1

５、strain typing

metamlst-merge.py --meta sample_metadata.txt ./out/

$ metamlst-merge.py ./out/

+------------------------------------------------------------------------------+

| Staphylococcus epidermidis |

+------------------------------------------------------------------------------+

KNOWN MLST profiles found:

ST arcC aroE gtr mutS pyrR tpiA yqiL Hits

19 8 7 12 4 12 2 2 1

NEW MLST profiles found:

ST arcC aroE gtr mutS pyrR tpiA yqiL Hits

100001 1 1 1 2 2 1 13 1

REJECTED NEW MLST profiles, as they have > SNPs than max-threshold (-z 5)

ST arcC aroE gtr mutS pyrR tpiA yqiL Hits

Outputing results [ - ... - ]

+------------------------------------------------------------------------------+

| Propionibacterium acnes |

+------------------------------------------------------------------------------+

KNOWN MLST profiles found:

ST CAMP2 aroE atpD gmk guaA lepA sodA tly Hits

4 6 1 1 1 3 1 1 8 1

NEW MLST profiles found:

ST CAMP2 aroE atpD gmk guaA lepA sodA tly Hits

REJECTED NEW MLST profiles, as they have > SNPs than max-threshold (-z 5)

ST CAMP2 aroE atpD gmk guaA lepA sodA tly Hits

Outputing results [ - ... - ]

Colour Legend:

--------------------------------------------------------------------------------

Alleles: [Known] [NEW] [NEW-RECURRING]

Profiles: [Known] [NEW] [NEW*]

New* profiles are composed by Known and Recurring alleles only

--------------------------------------------------------------------------------

Completed! Have a nice day.

出力ディレクトリにmergedディレクトリができる。

f:id:kazumaxneo:20181027134700p:plain

解説: Githubより

MetaMLST Report File: Contains the aggregate analysis for all the samples.

MetaMLST ST File : Contains the new S. epidermidis ST table after the analys (all the known profiles plus the new profiles detected in the samples).

ツールにPATHが通っていれば、他のexampleもそれぞれのサブディレクトリのtest.shを叩くことで実行できます。wikiの他のexampleには、MetaMLSTで解析後、タイピング結果を元に系統樹やMinimum Spanning Trees （wiki）を描く流れにも簡単に触れられています。

感想

メタゲノムアセンブリのbinningツールは、データによってはspeciesレベルのbinningを可能にしますが、ultra deep シーケンシングしてもstrainレベルの分類はまだ困難なままです（論文 Fig.2d Critical Assessment of Metagenome Interpretation—a benchmark of metagenomics software）。これは、メタゲノムのショートリードシーケンシングデータからde novoアセンブリを行っても、よく似たゲノムはまとめられてしまうためです。研究の目的によっては、MLSTのようなstrainレベルの分解能を持った手法も引き続き重要になってきます。

引用
MetaMLST: multi-locus strain-level bacterial typing from metagenomic samples
Zolfo M, Tett A, Jousson O, Donati C, Segata N

Nucleic Acids Res. 2017 Jan 25;45(2):e7

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

QuasiRecomb

mixed sampleの多様性を見積もる ShoRAH

マッピングツール segemehl

ハプロタイプベースのバリアントコーラー octopus

シーケンシングデータのハプロタイプを可視化し、リードを分類する HapFlow

複数のアセンブラとk-merを使ったTranscriptome 自動アセンブリワークフロー Oyster River Protocol

メタゲノムシーケンシングデータからMLST タイピングを行う MetaMLST