2020-04-29

（メタゲノム）ONTのロングリードをアセンブリしてコンプリートMAGsを生成する lathe

2020 4/29 実行手順修正

　細菌および古細菌での完全なメタゲノムアセンブリゲノム（MAG）の新規生成は、マイクロバイオーム研究における長年の目標である。既存のメタゲノムシーケンスおよびアセンブリ法では通常、完成した細菌ゲノムシーケンスが得られないため、類似のコンティグをグループ化または「ビニング」することでゲノムドラフトが形成される。このアプローチは、細菌ゲノムの膨大なコレクションを生み出し、微生物の世界に対する認識を大幅に拡大した。

　ビニングの品質は、主に基礎となるアセンブリのサイズと連続性に依存する。アセンブリの連続性が高まると、各ゲノムを形成するためにグループ化する必要があるコンティグが少なくなり、より大きくなるため、ゲノムビニングの感度と特異性が向上する。リードクラウドシーケンスを含むシーケンスおよびアセンブリテクノロジーの進歩により、MAGの品質は向上している。しかし、繰り返しシーケンスを正しく配置する能力には限界がある。

　repetititve elementsのサイズは、数十塩基対から数百キロベースまでの範囲に及ぶ。ロングリードは、ミニチュア逆方向反復配列転移因子、トランスポゾン、遺伝子複製、プロファージ配列などの一般的な反復因子全体に及ぶことがある。最近、ナノポアおよびPacBioロングリードアセンブリが腸内細菌叢およびその他の微生物叢に適用された。しかし、腸内細菌叢を分析するためのロングリードの適用は、便から高分子量（HMW）DNAを抽出する効率的な方法の欠如により妨げられてきた。標準的なビードビーティングは広範囲のせん断をもたらす可能性があり、固相可逆固定化（SPRI）ビードの「クリーンアップ」ステップは数百塩基対の低いDNAフラグメントを除去するが、多くの場合、バクテリア全体をスキャホールドするのに十分な大きさのDNAフラグメントを濃縮できない。穏やかなビードビーティングはせん断を減らすことができるが、溶解が困難な生物からDNAを抽出できない場合がある。したがって、ゲノムアセンブリの限界を克服するために、グラム陽性菌とグラム陰性菌の両方から反復エレメントにまたがるDNAの長い断片を抽出する方法が必要である。

　DNA抽出およびゲノムアセンブリのプロトコルを含む、糞便サンプルのナノポアシーケンスのワークフローを示す（論文補足図1）。本DNA抽出プロトコルは、培養バクテリアの抽出方法から適応されており、溶解酵素のカクテルによる細胞壁の酵素分解、フェノール-クロロホルム抽出、その後のRNAse AおよびプロテイナーゼK消化、重力カラム精製、およびSPRIサイズ選択を含む。このアプローチにより、わずか300mgの糞便からロングリードシーケンシングに適したマイクログラム量の純粋なHMW DNAが生成される。このバイオインフォマティクスワークフローであるLatheは、最近報告されたOPERA-MSなどのハイブリッドアセンブリメソッドではなく、ロングリードアセンブリベースのアプローチを使用している。ナノポアまたはPacBioテクノロジーのいずれかによって、入力されたロングリードデータを生成できる。 Latheは、ベースコーリング、ロングリードアセンブリ、およびポリッシングの既存の手順と、ミスアセンブリ検出およびゲノム環状化（メソッド）の洗練されたアプローチを組み合わせている。

インストール

Github

#condaでsnakemakeの仮想環境を作り、そこにlatheを導入
conda create -n snakemake -y
conda activate snakemake
#snakemake (bioconda)(>=5.4.3*1）
conda install -c bioconda -y snakemake
#flyeを使うならflyeを挿入
conda install -c bioconda -y flye
#またはcanuアセンブラ
conda install -c conda-forge -c bioconda Canu=1.8

git clone https://github.com/elimoss/lathe.git
cd lathe/
git pull

ONTのHPから guppyをダウンロード、パスを通す。

実行方法

configdファイルを準備する。

> cat lathe/config.yaml

$ cat config.yaml

sample_name: 'sample'

#data input

fast5_directory: '/absolute/path/to/fast5/data/' #no relative paths!

flowcell: 'FLO-MIN106'

kit: 'SQK-LSK109'

#workflow steps to perform

assembler: 'flye' #or canu

min_contig_size: 0 #remove contigs smaller than this from the assembly (can speed up polishing but potentially hurt genome completeness)

skip_circularization: False

skip_polishing: False

polish_both: False #should the input to short read polishing be the output of long read polishing?

short_reads: '' #Comma-separated forward and reverse optionally gzipped fastq files. Optional. If no reads are given, the pilon step is skipped.

#the below options are all related to Canu. genome_size is used by Flye as well.

canu_args: 'cnsThreads=2 cnsMemory=32'

usegrid: True #should Canu use the grid?

grid_options: '--time=80:00:00 --account asbhatt'

genome_size: '100m,250m' #Estimated genome size. The default values work well for typical healthy human gut samples.

#A single value can be specified instead, which will perform only one assembly and bypass

#merging. This would be suitable for bacterial isolate data, small datasets or very simple

#metagenomes.

最低限、fast5のパス、推定ゲノムサイズは修正する。任意でアセンブラの種類（defaultはflye）、使用スレッドやメモリ使用量上限なども修正する。

実行方法

configファイルを指定する。/dataにconfigファイルがあり、fast5は/data/fast5にあるなら

cd /data
snakemake --use-singularity --singularity-args '--bind /data/,/scg/,/home/ ' -s Snakefile \
--configfile config.yaml --restart-times 0 --keep-going --latency-wait 30

ubuntu18.04にて、１から環境を構築する。goとsingularityの導入(1-2)、snkaemakeの導入(3)、lathe のレポジトリのclone(4)の順に進める。

#ubuntu18.04で実行。既にcondaコマンドが使えるものとする
#1 依存の導入
apt-get update && \
 sudo apt-get install -y build-essential \
 sudo libssl-dev uuid-dev libseccomp-dev \
 sudo pkg-config squashfs-tools cryptsetup

#2 GOの導入
export VERSION=1.13.7 OS=linux ARCH=amd64 
wget -O /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz https://dl.google.com/go/go${VERSION}.${OS}-${ARCH}.tar.gz && \
 tar -C /usr/local -xzf /tmp/go${VERSION}.${OS}-${ARCH}.tar.gz
##pathを通す 
echo 'export GOPATH=${HOME}/go' >> ~/.bashrc && \
 echo 'export PATH=/usr/local/go/bin:${PATH}:${GOPATH}/bin' >> ~/.bashrc && \
 source ~/.bashrc

#3 singularityのビルド
mkdir -p ${GOPATH}/src/github.com/sylabs && \
 cd ${GOPATH}/src/github.com/sylabs && \
 git clone https://github.com/sylabs/singularity.git && \
 cd singularity git checkout v3.5.3 ./mconfig && \
 cd ./builddir && \
 make -j && \
 sudo make install

#4 snakemakeの導入
conda create -n snakemake -y
conda activate snakemake
conda install -c bioconda -c conda-forge -y snakemake==3.5.4

手順はsingularityの導入マニュアルに準拠している。1-4に加えて、flye/canu、guppyを使えるようにしておく。

https://hub.docker.com/r/snakemake/snakemake/tags

まだベータ版だが、macでもSINGULARITY DESKTOP MACOS（https://sylabs.io/singularity-desktop-macos/）をインストールすればsingularityを利用できる。自己解凍の.dmgに従うだけで２−３分でインストール可能。インストール後、４を実行する。

引用

Complete, closed bacterial genomes from microbiomes using nanopore sequencing

Eli L. Moss, Dylan G. Maghini & Ami S. Bhatt
Nature Biotechnology (2020)

参考

DEICODE

2019 mSystems microbial diversity metagenome

　β多様性とは、生物群集間の分類学的または系統的構成の違いを表す生態学的な概念である。β多様性法は、多くのマイクロバイオーム統計解析パイプラインの主要なコンポーネントとなっている。これらの分析により、複雑な微生物群集の概要を把握し、微生物群集を差別化する環境要因を特定することができる。しかし、微生物生態学者がデータを分析するために利用できる距離メトリックは数十種類あり、それぞれの距離メトリックは特定のデータ特性を捉えるように調整されている。そのため、β多様性プロットは、選択した距離測定基準によって劇的に異なって見えることがあり、生データの解釈の違いが発生している(ref.1)。

　β多様性解析における主な交絡因子の1つは、マイクロバイオームデータセットがまばらであること（すなわち、ほとんどの微生物がほとんどのデータセットには存在しないこと）であり、これが序列プロットにおいてスパイクパターンやhorseshoeパターンを生じさせ、解析を複雑にすることが示されている（ref.2、3）。さらに、主成分分析（PCA）は、正規分布と線形関連変数の共通の仮定を持っており、生物学的データではしばしば違反している(ref.4,-7)。その結果、Jaccard指数のような分類群の有無のみを考慮する古典的な距離測定法や、Bray-Curtis symmetrized distanceのような相対的な豊富さの度合いを明示的に考慮する測定法が一般的に使用されている。UniFrac (ref.8)で示されたように、系統情報を取り入れることで、微生物のβ多様性の推定は大幅に改善された。しかし、 presence/absence methodでは、しばしばコミュニティ間の差が顕著に現れ、それは豊富さに基づいた方法では不明瞭である。これは逆説的に見えるかもしれないが、豊富さに基づいた手法はコミュニティに関するより多くの情報を統合しているからである。しかし、重要な担い手が豊富な種ではなく希少な種であったり、豊富な種が大きなランダム変動を示す場合には、豊富さの情報は系統的な指標を用いても結果を明確にするどころか不明瞭にしてしまう可能性がある(ref.9)。

　表現型とマイクロバイオーム全体の関連性を明らかにできないのは、微生物分類群の豊富さの相対的な変化を適切に考慮していない方法の症状かもしれない。この原理を実証するために、論文図1Aのシナリオを考えてみる。このシナリオでは、taxon1は他の2つのtaxonよりもはるかに少ないが、時間の経過とともに指数関数的に増加している。taxon2は個体数が多く、時間の経過とともに安定している。 taxon3もまた、豊富さは高いが、ランダムに変動している。第1群集と他の2つの時間点との間のユークリッド距離は非常に変動しており、taxon1の指数関数的な成長によって誘発される変化を捉えていない。ユークリッド距離のこの変動性は、主に豊富なtaxonのランダムな変動によって駆動される。

　ユークリッド距離とは対照的に、Aitchison距離（論文式2）のような組成的距離測定法は、そのような相対的な変化を適切に説明することができる(ref.10)。ここでは、Aitchison距離は対数倍の変化のみを考慮しており、高個体数のtaxonでの偏差が絶対的なスケールでは大きいが、相対的なスケールでは小さいという事実を反映している。100カウントと120カウントの差は20カウントであり、これは最初のtaxonの豊富さに比べれば大きいが、20％の増加に過ぎない。これに対して、最初の分類群では約2,000%の増加が見られ、結果として、Aitchison距離は低個体数種の変化も含めた大きな相対的な変化で駆動されている。

　サンプル間で大きな倍数変化を示す微生物は、Aitchison距離の計算においてより重くウエイトされることになる。しかし、この距離メトリックはゼロを扱うことができず、したがって、マイクロバイオーム研究を特徴づける疎なデータセットに適用することは困難である。マイクロバイオームデータにゼロを生じさせる可能性のあるプロセスは多くある。それは、アンダースサンプリングがあった可能性があり、シーケンシングデータ中に低頻度の微生物が検出されなかったことが考えられる。また、サンプルの不均一性により、環境中に存在しているにもかかわらず検出されなかった可能性もある。さらに、環境中に微生物が全く存在していない可能性もある。これらの可能性のあるすべてのプロセスを考慮すると、シーケンシングデータからこれらの異なるプロセスを区別することは実行可能ではない(ref.11, 12)。この問題を回避するために、本著者らは、行列補完を使用して疎なデータも明示的に扱うことができる新しい構成的距離メトリックを提案する。これは、すべてのゼロを欠損値として扱い、行列補完を用いてこの欠損データを扱うモデルを構築することによって行われる。

　行列補完は元々、疎なデータを扱うための自然な解決策として、ユーザー項目の評価を予測するためのレコメンドシステムのコンテキストで開発された(ref.13)。例えば、Netflixデータベースには、すべての映画ごとにすべての顧客を詳細に記述した行列があり、そのエントリは映画の評価を表している。しかし、各ユーザはNetflixで利用可能な映画のごく一部しか評価していないため、データベースの約1%しかゼロではない値を含んでいない。その結果、特定の顧客に特定の映画を推薦しようとするとき、モデルは顧客が提供した利用可能なレーティングに基づいて訓練される必要がある。マトリックス補完タスクは、この種のタスクを実行するための最先端の手法の一つとなっている。

　この論文では、シミュレーションベンチマークと2つのケーススタディを用いて、マトリクス補完を用いて疎なマイクロバイオームデータセットを前処理することの有用性を実証し、構成の順序付けを可能にし、サンプル間の違いの原因となる特徴に関する情報を保持することを示す。

Robust Aitchison PCA Beta Diversity with DEICODE

インストール

本体　Github

#conda
conda create -n deicode -y
conda activate deicode
conda install -c conda-forge deicode -y

#biomが入らなかったので追加導入した
conda install -c bioconda biom-format

> deicode --help

$ deicode --help

Usage: deicode [OPTIONS] COMMAND [ARGS]...

Options:

--help Show this message and exit.

Commands:

auto-rpca Runs RPCA with an rclr preprocessing step and auto-estimates...

rpca Runs RPCA with an rclr preprocessing step.

> deicode rpca --help

$ deicode rpca --help

Usage: deicode rpca [OPTIONS]

Runs RPCA with an rclr preprocessing step.

Options:

--in-biom TEXT Input table in biom format. [required]

--output-dir TEXT Location of output files. [required]

--n_components INTEGER The underlying low-rank structure. The input

can be an integer (suggested: 1 < rank < 10)

[minimum 2]. Note: as the rank increases the

runtime will increase dramatically.

[default: 3]

--min-sample-count INTEGER Minimum sum cutoff of sample across all

features. The value can be at minimum zero

and must be an whole integer. It is

suggested to be greater than or equal to

500. [default: 500]

--min-feature-count INTEGER Minimum sum cutoff of features across all

samples. The value can be at minimum zero

and must be an whole integer [default: 10]

--min-feature-frequency INTEGER

Minimum percentage of samples a feature must

appear with a value greater than zero. This

value can range from 0 to 100 with decimal

values allowed. [default: 0]

--max_iterations INTEGER The number of iterations to optimize the

solution (suggested to be below 100; beware

of overfitting) [minimum 1] [default: 5]

--help Show this message and exit.

> deicode auto-rpca

$ deicode auto-rpca -h

Usage: deicode auto-rpca [OPTIONS]

Try 'deicode auto-rpca --help' for help.

Error: no such option: -h

(deicode) kamisakakazumanoMac-mini:human-unmap kazu$ deicode auto-rpca --help

Usage: deicode auto-rpca [OPTIONS]

Runs RPCA with an rclr preprocessing step and auto-estimates the rank

(i.e. n-components parameter).

Options:

--in-biom TEXT Input table in biom format. [required]

--output-dir TEXT Location of output files. [required]

--min-sample-count INTEGER Minimum sum cutoff of sample across all

features. The value can be at minimum zero

and must be an whole integer. It is

suggested to be greater than or equal to

500. [default: 500]

--min-feature-count INTEGER Minimum sum cutoff of features across all

samples. The value can be at minimum zero

and must be an whole integer [default: 10]

--min-feature-frequency INTEGER

Minimum percentage of samples a feature must

appear with a value greater than zero. This

value can range from 0 to 100 with decimal

values allowed. [default: 0]

--max_iterations INTEGER The number of iterations to optimize the

solution (suggested to be below 100; beware

of overfitting) [minimum 1] [default: 5]

--help Show this message and exit.

実行方法

deicodeには2つのコマンド、 rpca と auto-rpca がある。auto-rpcaは行列の下位ランクを自動的に推定し、n_componentsパラメータの入力を必要としない。rpcaでは，n_componentsを明示的に設定する必要がある。コマンドの構造はQIIME2のコマンドに従っている。QIIME 2でのDEICODEの使用チュートリアル（リンク）を参照する。

QIIME 2上ではなく、単独で使用する。HUMAnN2などで得たbiom形式のファイル（生のカウントテーブル）を指定する。

deicode auto-rpca --in-biom input.biom --output-dir outdir

引用
A Novel Sparse Compositional Technique Reveals Microbial Perturbations

Martino C, Morton JT, Marotz CA, Thompson LR, Tripathi A, Knight R, Zengler K

mSystems. 2019 Feb 12;4(1). pii: e00016-19

シーケンシングリードの前処理を行う AUSPP

2019 Journal of Bioinformatics and Computational Biology trimming / preprocessing RNA seq small RNA ChIP-Seq

　ショートリードをリファレンスゲノム/配列にマッピングできるショートリードアライナーは多数あり、それらのほとんどはFASTQファイルを入力クエリファイルとして直接受け入れることができる。ただし、通常、生データは前処理する必要がある。さまざまな次世代シーケンス（NGS）テクノロジーによって生成された生データの前処理に特化したソフトウェアプログラムはほとんど存在しない。ここでは、NGSショートリードの前処理と自動マッピングのためのPerl スクリプトベースのパイプラインであるAUSPPを紹介する。このパイプラインには、品質管理、アダプターのトリミング、リードのcollapsing、structural RNAの除去、長さの選択、リードマッピング、および正規化されたwiggleファイルの作成が含まれる。生データからゲノムマッピングまでの処理を容易にするため、メタ分析前のステップの強力なツールとなる。最も重要なことは、AUSPPには多くのタイプのNGSデータ用のデフォルトの処理パイプライン設定があるため、ほとんどの場合、ユーザーは生データとゲノムを提供するだけで利用できることである。 AUSPPは移植可能で簡単にインストールでき、ソースコードはhttps://github.com/highlei/AUSPPから無料で入手できる。

インストール

ubuntu18.04LTS環境でテストした。

依存

Auspp was developed on linux (Ubuntu 14.04.5 LTS), and hasn't been tested on other OS platform.
Auspp was developed on perl version 5.14.
PATH executables one of the following short read aligners: bowtie and bowtie-build or bowtie2 and bowtie2-build or soap and 2bwt-builder or hisat2 and hisat2-build or bwa

本体　Github

git clone https://github.com/highlei/auspp.git
cd auspp/

#管理者権限で
perl MAKEFILE.pl -i /usr/local/bin/

> auspp

# auspp

==========================| /usr/local/bin/auspp start |=================================

Now = 2020-02-11 19:38:36

Version : 1.0

Author : Lei Gao <highlei@hotmail.com> or <leigao@szu.edu.cn>

Usage: /usr/local/bin/auspp -i fastq_file -x sampleID -M Modes {-D index | -G genome} [options]

Usage: /usr/local/bin/auspp -i fastq_file -x sampleID -M degradome [options]

-i <str> input the fastq file (Could be gzip'ed (extension: .gz)). eg: Col.fastq or Col_r1.fastq

-I <str> input the other mate if paired-end sequencing. eg: Col_r2.fastq

-x <str> input the sampleID for -i library. eg: Col

-D <str> reference sequence index: soap index or bwa or bowtie(2) or hisat2 index.

-G <str> reference sequence/genome in fasta format. (Required when -P soap and step 7.)

-M <str> Modes: presets for supported SEQ:

smallRNA same as -P soap -s 124567 -L "20-25;21;22;23;24;All";

mRNA same as -P hisat2 -s 1367 -L All;

ribo same as -P soap -s 12467 -L All;

chip same as -P bowtie -s 167 -L All;

snp same as -P baw mem -s 167 -L All;

pseudo same as -P soap -s 1267 -L All;

nucleosome same as -P bowtie2 -s 167 -L All;

degradome same as -s 12;

sRNAexample will run example

Customized settings by user:

-s <str> running step (eg: 1-7 or 1367):

1 quality control,2 trim,3 collapse,4 filter,5 length,6 mapping,7 GenomeBrowser

-P <str> align program: soap or "bwa aln" or "bwa mem" or bowtie(2) or hasat2 or tophat2. [soap]

-L <str> the read lenth range. eg: "20-25;21;22;23;24;All" [All]

Required by special step:

-R <str> r/t/sn/snoRNA or repeats or other database in fasta format for filter;

must be makeblastdb by blast+. Required when step 4 activated

-a <str> adaptor sequence. Required when step 2 activated. e.g.TGGAATTCTCGGG. [AGATCGGAAGA]

-A <str> adaptor sequence for mate if paired-end. e.g. AAAAAAAAAAA. [AGATCGGAAGA]

-f <str> gtf file for hisat2 or gff File for tophat2 if have. eg: TAIR10_GFF3_genes.gff.gtf

-e <str> where is the example/?(It locates in the directory of source code). Required when mode=sRNAexample. eg: yoursoft/auspp/example/

Defult settings are recommended:

-d <str> name of directory for store intermediate/output files. [fasta,trim,filter,map2gnm]

Normally, fasta/: output of 1st step; trim/: output of 2nd and 3rd step; filter/: output of 4th and 5th; map2gnm/: output of 6th step; map2gnm/bam2wigM/: output of 7th step

-S <str> the parameter set for soap or bwa samse or bowtie(2) or hasat2 or tophat2.

-T <str> parameter settings for trim_adaptor ["-l 9 -m 18"]

-Q <str> quality control. ["-q 20 -c 5"]

-C <str> copy number filter. e.g. "-c 5,10" to discard reads with copy>10 or copy<5. ["-c 1,"]

-h display this help

Example:

/usr/local/bin/auspp -i Col.fastq -x Col -M smallRNA -D tair10.Chr.fa.index

/usr/local/bin/auspp -i Col.fastq -x Col -M RNA -G tair10.Chr.fa

/usr/local/bin/auspp -i Col.fastq -x Col -M degradome

Documentation:

perldoc auspp

makeblastのバイナリがエラーを吐いたので修正した（*1）。

テストラン

cd example/
auspp -M sRNAexample

出力

f:id:kazumaxneo:20200211203511p:plain

test.sRNA.auspp.info

f:id:kazumaxneo:20200211203228p:plain

test.sRNA.auspp.info.pdf

f:id:kazumaxneo:20200211203027p:plain

test.sRNA.auspp.lenDist.pdf

f:id:kazumaxneo:20200211203058p:plain

実行方法

fastqとプリセット、リファレンスゲノムを指定する。

auspp -i pair_r1.fastq.gz -I pair_r2.fastq.gz -x Col -M mRNA -G tair10.Chr.fa

-i input the fastq file (Could be gzip'ed (extension: .gz)). eg: Col.fastq or Col_r1.fastq
-I input the other mate if paired-end sequencing. eg: Col_r2.fastq
-x input the sampleID for -i library. eg: Col
-G reference sequence/genome in fasta format. (Required when -P soap and step 7.)
-M Modes: presets for supported SEQ:
smallRNA same as -P soap -s 124567 -L "20-25;21;22;23;24;All";
mRNA same as -P hisat2 -s 1367 -L All;
ribo same as -P soap -s 12467 -L All;
chip same as -P bowtie -s 167 -L All;
snp same as -P baw mem -s 167 -L All;
pseudo same as -P soap -s 1267 -L All;
nucleosome same as -P bowtie2 -s 167 -L All;
degradome same as -s 12;
sRNAexample will run example

引用

AUSPP: A universal short-read pre-processing package

Gao L, Wu C, Liu L

J Bioinform Comput Biol. 2019 Dec;17(6):1950037

condaを使ってblastパッケージを導入した。

2020-04-26

GO annotation分析ウェブサービス WEGO2

2018 Nucleic Acids Research web tool GO enrichment analysis de novo transcriptome

2020 4/26 タイトル修正

　WEGO (Web Gene Ontology Annotation Plot) は、2006年に発表された、GO (Gene Ontology) のアノテーション結果を可視化、比較、プロットするためのシンプルで便利なツールである。ハイスループットシーケンシングの急速な発展とGOの普及により、WEGOは近年の利用者数や引用数の増加に伴い、バージョン2.0へのアップデートを行った。WEGOはGOのアノテーション結果を入力として使用する。GOの標準化されたDAG（Directed Acyclic Graph）構造化語彙システムに基づいて、各GO IDに対応する遺伝子の数を計算し、グラフィカルな形式で表示する。WEGO 2.0のアップデートでは、比較ゲノム解析のためのより効率的で最新のアプローチを提供することを目的として、以下の4つの点に重点を置いている。第一に、これまで3つに制限されていた入力ファイルの数が無制限になり、WEGOは複数のデータセットを解析できるようになった。また、このバージョンでは、ゲノム比較解析のベースラインとして採用できる9つのモデル種のリファレンスデータセットが追加された。さらに、解析プロセスでは、2つのサンプルごとではなく、複数のデータセットに対してカイ二乗検定が行われる。最後に、WEGO 2.0は、従来のWEGO ヒストグラムに加えて、GO termのソートされたP値を表示し、それらの有意差を示す追加の出力グラフを提供する。同時に、WEGO 2.0は全く新しいユーザーインターフェースを特徴としている。WEGOは、http://wego.genomics.org.cn で無料で利用できる。

Documentation

http://wego.genomics.org.cn/document

webサービス

http://wego.genomics.org.cn にアクセスする。

入力フォーマットはWEGOネイティブフォーマット、InterProScanテキストフォーマット、Raw/XML出力フォーマット、Gene Ontology Annotationフォーマット（GAF）の5種類対応している。シンプルなWEGOネイティブフォーマットは、1行に1つのアノテーションレコードを持つタブ区切りのテキストファイルである。最初のカラムは遺伝子名、その他のカラムはGOを記載する。遺伝子名は、標準的な遺伝子識別子、アクセッション番号、ユーザー定義の文字列のいずれでも構わない。

f:id:kazumaxneo:20200425215954p:plain

WEGO Native file。web docunemtより。

ここではDocunemtで提供されているdemoファイルを指定した。アノテーション情報ソースとしてリファレンスを指定する。ここにない場合、ユーザーがアノテーション情報を提供する事もできる。その場合はGAFフォーマットにまとめた、GO termリストとアノテーションの単一ファイルをアップロードする（詳細はDocunemt参照）。

f:id:kazumaxneo:20200426012240p:plain
demoファイルの場合はフォーマット"WEGO Native file"を選択。submitをクリック。

出力

f:id:kazumaxneo:20200426013212p:plain

GO tree - アップロードされたファイルに含まれるすべてのGOを階層化表示

f:id:kazumaxneo:20200426013331p:plain

それぞれの行のboxは階層表示されたGO termを表し、左から順に遺伝子番号、アップロードされたデータセットに対するGO termの遺伝子割合、各入力データのピアソンカイ二乗検定のp値（期待される遺伝子数が 5 未満だとMl表示）、GO ID、GO termの説明、そしてこのGOtermへの遺伝子リストのリンクとなる。遺伝子パーセンテージは、アップロードされたデータセットの全遺伝子数に対する、この用語のアノテーションを持つ遺伝子またはその用語の子ノードの数のパーセンテージ。スター記号はp値<0.05の有意な関係を持つGO termに付く。

graph表示。

f:id:kazumaxneo:20200426015105p:plain

引用

WEGO 2.0: a web tool for analyzing and plotting GO annotations, 2018 update
Jia Ye, Yong Zhang, Huihai Cui, Jiawei Liu, Yuqing Wu, Yun Cheng, Huixing Xu, Xingxin Huang, Shengting Li, An Zhou, Xiuqing Zhang, Lars Bolund, Qiang Chen, Jian Wang, Huanming Yang, Lin Fang, Chunmei Shi
Nucleic Acids Research, Volume 46, Issue W1, 2 July 2018, Pages W71–W75

2020-04-25

超高速な配列検索エンジン BIGSI

2019 Nature Biotechnology

2020 4/30 追記

　グローバルアーカイブには、未処理の細菌およびウイルスのゲノム配列データが指数関数的に増加している。これらのデータに配列検索用語を検索する機能があれば、基礎研究はもちろん、リアルタイムのゲノム疫学やサーベイランスなどのアプリケーションも容易になるだろう。しかし、これは現在の方法では不可能である。この問題を解決するために、微生物集団ゲノムに関する知識とウェブ検索のために考案された計算手法を組み合わせて、BItsliced Genomic Signature Index (BIGSI)という検索可能なデータ構造を作成した。その結果、従来の方法に比べて4桁も少ない保存量で、447,833の細菌およびウイルスの全ゲノム配列データセットからなる全世界コーパスのインデックスを作成した。BIGSIの検索機能を利用して、抗生物質耐性遺伝子MCR-1、MCR-2、MCR-3を迅速に検索し、2,827のプラスミドの宿主範囲を決定し、アーカイブされたデータセットの抗生物質耐性を定量化した。また、新規（未処理またはアセンブル）の配列データセットが堆積してくると、インデックスは増加していき、数百万件のデータセットにスケールアップすることができる。

Hey, @alexmsalmeida and @robdfinn have made a BIGSI search index of a representative of genomes from their 280k genomes from the human gut. Paper https://t.co/DaZNGwG1dW and search here :https://t.co/MQYAVOemDI
— Zamin Iqbal (@ZaminIqbal) September 24, 2019

Mash better than bigsi for finding similar genomes. I think it would be hard to use it to find a specific sequence/gene/plasmid . And both not good for protein-space search. Depends on use case
— Zamin Iqbal (@ZaminIqbal) May 27, 2018

We're building a BIGSI ( https://t.co/WbHf8gtjf1) search index of that (all bacterial, viral, meta genomic sequence) set up to update as data is submitted to the ncbi /ENA. But it's not ready yet
— Zamin Iqbal (@ZaminIqbal) May 27, 2018

このツイートも載せておきます。

BigSi was incredible already. For those who don't know it yet: imagine running a sequence similarity search in a database 10,000 times bigger than NR. https://t.co/ml9uJyt5G9
— Daniel Gautheret (@DGautheret) May 28, 2019

インストール

Github

導入手順は公式の手順を参照

https://bigsi.readme.io/docs/installation-on-mac

ここでは用意されているdockerイメージを使用する。

docker pull phelimb/bigsi

> docker run phelimb/bigsi bigsi --help

$ sudo docker run phelimb/bigsi bigsi --help

[sudo] password for kazu:

bigsi-v0.3.5

Available Commands:

- bloom

- build

- bulk_search

- delete

- insert

- merge

- search

- variant_search

> docker run phelimb/bigsi bigsi bloom --help

$ docker run phelimb/bigsi bigsi bloom --help

usage: bigsi-v0.3.5 bloom [-h] [-c CONFIG] ctx outfile

Creates a bloom filter from a sequence file or cortex graph.

(fastq,fasta,bam,ctx) e.g. index insert ERR1010211.ctx

positional arguments:

ctx

outfile

optional arguments:

-h, --help show this help message and exit

-c CONFIG, --config CONFIG

> docker run phelimb/bigsi bigsi build --help

$ docker run phelimb/bigsi bigsi build --help

usage: bigsi-v0.3.5 build [-h] [-b BLOOMFILTERS] [-s SAMPLES] [-f FROM_FILE]

[-c CONFIG]

optional arguments:

-h, --help show this help message and exit

-b BLOOMFILTERS, --bloomfilters BLOOMFILTERS

Multiple Values

-s SAMPLES, --samples SAMPLES

Multiple Values

-f FROM_FILE, --from_file FROM_FILE

Basic text / string value

-c CONFIG, --config CONFIG

Basic text / string value

> docker run phelimb/bigsi bigsi bulk_search --help

$ docker run phelimb/bigsi bigsi bulk_search --help

usage: bigsi-v0.3.5 bulk_search [-h] [-t THRESHOLD] [-c CONFIG] [-s]

[-f {json,csv}] [-st]

fasta

positional arguments:

fasta Basic text / string value

optional arguments:

-h, --help show this help message and exit

-t THRESHOLD, --threshold THRESHOLD

A float number

-c CONFIG, --config CONFIG

Basic text / string value

-s, --score Accepts a true or false value

-f {json,csv}, --format {json,csv}

Accepts one of the following values: (json|csv)

-st, --stream Accepts a true or false value

> docker run phelimb/bigsi bigsi variant_search --help

$ docker run phelimb/bigsi bigsi variant_search --help

usage: bigsi-v0.3.5 variant_search [-h] [-g GENE] [-ge GENBANK] [-c CONFIG]

[-f {json,csv}]

reference ref pos alt

positional arguments:

reference Basic text / string value

ref Basic text / string value

pos A whole number

alt Basic text / string value

optional arguments:

-h, --help show this help message and exit

-g GENE, --gene GENE Basic text / string value

-ge GENBANK, --genbank GENBANK

Basic text / string value

-c CONFIG, --config CONFIG

Basic text / string value

-f {json,csv}, --format {json,csv}

Accepts one of the following values: (json|csv)

> docker run phelimb/bigsi bigsi insert --help

$ docker run phelimb/bigsi bigsi insert --help

usage: bigsi-v0.3.5 insert [-h] config bloomfilter sample

Inserts a bloom filter into the graph e.g. bigsi insert ERR1010211.bloom

ERR1010211

positional arguments:

config Basic text / string value

bloomfilter

sample

optional arguments:

-h, --help show this help message and exit

> docker run phelimb/bigsi bigsi delete --help

$ docker run phelimb/bigsi bigsi delete --help

usage: bigsi-v0.3.5 delete [-h] [-c CONFIG]

optional arguments:

-h, --help show this help message and exit

-c CONFIG, --config CONFIG

Basic text / string value

> docker run phelimb/bigsi bigsi merge --help

$ docker run phelimb/bigsi bigsi merge --help

usage: bigsi-v0.3.5 merge [-h] config merge_config

positional arguments:

config Basic text / string value

merge_config Basic text / string value

optional arguments:

-h, --help show this help message and exit

demo

（実際に配列を検索可能）

BIGSI Demo

f:id:kazumaxneo:20191104034812p:plain

テスト配列検索結果

f:id:kazumaxneo:20191104035036p:plain

作成中。

引用

Ultrafast search of all deposited bacterial and viral genomic data

Phelim Bradley, Henk C. den Bakker, Eduardo P. C. Rocha, Gil McVean & Zamin Iqbal
Nature Biotechnology volume 37, pages152–159 (2019)

パスウェイデータベース間の共通性を探索、分析、キュレーションする ComPath

npj Systems Biology and Applications 2018 Pathway database KEGG pathway web tool

　パスウェイは生物系の解析や表現に広く利用されているが、明確な境界線がなく、多数のデータベースに分散しており、相互運用性がないため、それらの間の網羅性、一致性、不一致を評価することができない。本研究では、データベース間のパスウェイマッピングのキュレーションをサポートし、いくつかの新しい可視化を通じてパスウェイ知識の探索を促進するエコシステムであるComPathを紹介する。3つの主要なパスウェイデータベース間のマッピングをキュレーションし、パーキンソン病に焦点を当てたケーススタディを紹介する。ComPath のソースコードとリソースは https://github.com/ComPath から、ウェブアプリケーションは https://compath.scai.fraunhofer.de/ からアクセスできる。

webサービス

https://compath.scai.fraunhofer.de にアクセスする。

ComPath Overview - ComPathに搭載されているさまざまなパスウェイデータベースを検索

f:id:kazumaxneo:20200423233410p:plain

特定の遺伝子から、その遺伝子が関わるパスウェイを検索。

f:id:kazumaxneo:20200423233611p:plain

各データベースのパスウェイや遺伝子の分布を調べる。

f:id:kazumaxneo:20200423233730p:plain

5つのボタンのいずれかをクリックしてそのパスウェイデータベースのページに入ると、分布の詳細を調べることができる（下図）。

f:id:kazumaxneo:20200423233946p:plain

Pathway Similarity - 関連するパスウェイをリンクさせ、その相互作用や整合性を調べる。

f:id:kazumaxneo:20200423232320p:plain

Overlap Heatmap

KEGG

f:id:kazumaxneo:20200331192736p:plain

マウスホイールで拡大できる。

例えばRNA polymeraseはRNA polymerase自身以外にPyrimidine metabolism、Purine metabolism、Metabolic pathwayともオーバーラップしている。

f:id:kazumaxneo:20200423232508p:plain

Reactome

f:id:kazumaxneo:20200331192804p:plain

WikiPathways

f:id:kazumaxneo:20200331192815p:plain

Pathway Similarity Network

ネットワークビューでは、パスウェイのネットワークとして視覚化できる。比較するパスウェイデータベースのペアと類似度の最小閾値を選択してネットワークを描く。

f:id:kazumaxneo:20200423232811p:plain

wikipathwayとKEGG

f:id:kazumaxneo:20200423233126p:plain

Pathway Overlap - 異なるパスウェイ間の重なりを可視化

ユーザーが特定のパスウェイを指定して、異なるパスウェイデータベースでどの程度重複しているか調べることができる。

f:id:kazumaxneo:20200423234514p:plain

ベン図の文字をクリックすると構成遺伝子を確認できる。

f:id:kazumaxneo:20200423234942p:plain

Pathway Enrichment - 遺伝子セットを提供し、エンリッチされたパスウェイを分析

f:id:kazumaxneo:20200331200423p:plain

ここではexampleデータを指定した。

結果

。5つのパスウェイデータベースが順番にまとめられ、それぞれP-valueの低い順にパスウェイリストが並ぶ。

f:id:kazumaxneo:20200331200430p:plain

リスト興味のあるパスウェイを選択して、実行する分析の種類を選択する。

オーバーラップビューでは、選択したパスウェイ間の境界をベン図またはオイラー図で表示する。クラスタービュー」では、距離に基づいてクラスタ化されたパスウェイのインタラクティブなデンドログラムが表示される。"ネットワークビュー "では選択されたパスウェイの周辺のを表示し、パスウェイ間の相互作用を調べることができる。

f:id:kazumaxneo:20200331200707p:plain

静的なブログでは説明に限界があります。他の機能は実際にアクセスして確かめて下さい。

引用
ComPath: an ecosystem for exploring, analyzing, and curating mappings across pathway databases

Domingo-Fernández D, Hoyt CT, Bobis-Álvarez C, Marín-Llaó J, Hofmann-Apitius M

NPJ Syst Biol Appl. 2018 Dec 13;5:3

2020-04-23

（ヒトゲノム）インタラクティブな遺伝子の変異プロットを出力する G3viz

CRAN 2020 Bioinformatics tumor 結果の視覚化 (visualization) cohort population genomics human genome

　ロリポップダイアグラムは、ガンゲノミクスにおける遺伝子変異のトランスレーショナル効果を可視化し、探索するために広く用いられているグラフィカルな表現の一つである。しかし、使いやすい機能を備えたロリポップダイアグラムツールはまだ不足している。ここでは、研究者がウェブブラウザ上でロリポップダイアグラムを用いて遺伝子変異データを探索できるようにするRパッケージ、g3vizを紹介する。わずか数行のRコードで、ユーザーはデータの詳細をインタラクティブに可視化し、所見に注釈を付け、結果として得られた図を高品質の図にエクスポートすることができる。その有用性と使い勝手の良さから、g3vizは、バイオインフォマティクスのスキルやプログラミングの経験が豊富な研究者に広く利用されている。R パッケージは CRAN (http://cran.r-project.org/web/packages/g3viz) から MIT ライセンスで自由に入手できる。g3lollipop JavaScriptパッケージはGitHub (https://github.com/g3viz/g3lollipop.js)からMITライセンスで自由に利用できる。

Package ‘g3viz’

https://cran.r-project.org/web/packages/g3viz/g3viz.pdf

Tutorial

http://127.0.0.1:23058/library/g3viz/doc/chart_themes.html

インストール

macos10.14にてRstudioを使ってテストした。

依存

Depends R (>= 3.0.0)

本体　Github

#CRAN(link)
install.packages('g3viz', dependencies = TRUE)
library(g3viz)

help

> browseVignettes("g3viz")

実行方法

ここではGithubに用意されているテストデータ（g3viz/visual_test/tables/tp53-msk_impact_2017.tsv）を使う。

１、readMAF関数を使ってmutation.datデータフレームにバリアント情報を読み込む。

mutation.dat <- readMAF("/Users/kazuma/Downloads/g3viz-master/visual_test/tables/tp53-msk_impact_2017.tsv", sep="\t")

自分のデータを使うなら,setwd("<PATH>") で作業ディレクトリに移動して実行するか、ファイルをフルパスで記載する。

読み込んでいるデータは次のような形式になる。HUGOのgene symbol、アミノ酸変異とその検体数、変異の種類になる。

f:id:kazumaxneo:20200422233047p:plain

２、g3Lollipopを使ってインタラクティブな図を出力する。まずはテーマ"simple"を使ってみる。

g3Lollipop(mutation.dat,
 plot.options =
  g3Lollipop.theme(theme.name = "simple",
      title.text = "test",
      y.axis.label = "y-label",
      legend.title = "legend-title"),
 btn.style = "blue",
 gene.symbol = "TP53")

mutation.dat Input genomic mutation data frame
plot.options g3lollipop diagram options in list format. Check g3Lollipop.options
gene.symbol HGNC primary gene symbol
btn.style button style, including browser default button style, and two built-in styles, blue or gray. Default NA, indicating browser default.
theme.name me name, including default, cbioportal, nature, nature2, dark, blue, ggplot2, and simple. Default default.

遺伝子名はHGNCのgene symbolに対応している。

出力。 gene symbolが記載されている領域が対象になる。

f:id:kazumaxneo:20200422230611p:plain

カーソルを合わせると、多型/変異のタイプや内訳を確認できる。

f:id:kazumaxneo:20200422231002p:plain

マウスホイールで拡大してアミノ酸変化の注釈をつけた。

f:id:kazumaxneo:20200422231232p:plain

結果はwebページとして出力できる。いつでもブラウザに読み込んでインタラクティブに操作できるようになっている。

f:id:kazumaxneo:20200422224358p:plain

Rstudioの場合。

テーマcBioPortal

g3Lollipop(mutation.dat,
 plot.options =
  g3Lollipop.theme(theme.name = "cbioportal",
      title.text = "test2",
      y.axis.label = "y-label",
      legend.title = "legend-title"),
 btn.style = "blue",
 gene.symbol = "TP53")

f:id:kazumaxneo:20200422231717p:plain

テーマggplot2

g3Lollipop(mutation.dat,
 plot.options =
  g3Lollipop.theme(theme.name = "ggplot2",
      title.text = "test3",
      y.axis.label = "y-label",
      legend.title = "legend-title"),
 btn.style = "blue",
 gene.symbol = "TP53")

f:id:kazumaxneo:20200422231944p:plain

少し拡大してキャプチャした。

他の例はチュートリアルを参照して下さい。

引用

G3viz: an R package to interactively visualize genetic mutation data using a lollipop-diagram
Xin Guo, Bo Zhang, Wenqi Zeng, Shuting Zhao, Dongliang Ge
Bioinformatics, Volume 36, Issue 3, 1 February 2020, Pages 928–929

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

（メタゲノム）ONTのロングリードをアセンブリしてコンプリートMAGsを生成する lathe

DEICODE

シーケンシングリードの前処理を行う AUSPP

GO annotation分析ウェブサービス WEGO2

超高速な配列検索エンジン BIGSI

パスウェイデータベース間の共通性を探索、分析、キュレーションする ComPath

（ヒトゲノム）インタラクティブな遺伝子の変異プロットを出力する G3viz