2019-06-19

(Omics向け) 従来のベン図表現を拡張する DiVenn

2019 Frontiers in Genetics web tool GO term venn diagram KEGG pathway RNA seq proteome

　ハイスループットデータ技術の進歩により、詳細な分析なしに膨大な量の遺伝子発現データが生成されてきた。例えば、INVEX (Xia et al., 2013)、ExAtlas (Sharov et al., 2015)、そしてWebGIVI (Sun et al., 2017)などのいくつかのウェブベースの視覚化ツールは、発現データ分析において首尾よく使用された。ただし、3つ以上の実験を体系的に比較することは依然として困難である。統合されたバイオインフォマティクスデータベースとともに複数の実験のデータを視覚化することは特に困難になる。ベン図は、複数の実験間で遺伝子リストを比較するために広く使用されている。 GeneVenn（Pirooznia et al、2007）、Venny（Oliveros、2015）、およびInteractiVenn（Heberle et al、2015）は、現在使用されているWebベースツールの例である。しかしながら、それらは重大な制限を有する：（１）gene IDを遺伝子機能に結び付けることはできない。Biological pathwayおよび遺伝子オントロジー（GO）などのバイオインフォマティクスデータベースを統合することはできない。（2）グラフに遺伝子発現量を表示することはできない。（3）ベン図で興味を引く可能性がある共通または固有の遺伝子は、遺伝子発現値および遺伝子機能では抽出できない。

　本著者らは、上記の制限を克服し、生物学者が彼らの遺伝子リストを視覚化し、biological pathwayおよびGOデータベースからの統合された知識に基づいて生物学的仮説を生成するのを助けるインタラクティブなwebベースのツールを提供する。このツールを使用すると、研究者は遺伝子リストを比較および視覚化するだけでなく、関心のある遺伝子機能に基づいてグラフ内の遺伝子ノードをサブセット化または強調表示することもできる。このツールはユーザーフレンドリーで、force-directed focus packageを使用して大量の入力データを処理できる（ref.1）。ユーザーは結果/情報テーブルから重要な遺伝子情報を抽出およびダウンロードし、publication品質の高解像度画像をダウンロードできる。
DiVennは、PHP、JavaScript、R、D3.js（Bostock et al、2011）、およびMySQLデータベースを使用して開発された。データの視覚化のフローチャートを論文図1に示す。DiVennは現在、2種類の入力データを受け入れる。（1）2列のタブ区切りのカスタムデータ。例えば、 gene IDsおよび対応するpathwayデータ、転写因子およびそれらによって調節される下流の遺伝子、ならびにマイクロRNAおよび対応する標的遺伝子などである。 2列目は「1」または「2」でなければならない。（２）遺伝子発現データ。 1列目は gene IDs、2列目は遺伝子制御値である。遺伝子調節値は、differentially expressed (DE) 遺伝子から得られるべきである。使用者は、倍数変化のカットオフ値（例えば、２倍変化）を選択してそれらのDE遺伝子を定義できる。この遺伝子調節値を単純化するために、本著者らは、使用者が自身の倍数変化のカットオフ値に基づいて、上方調節遺伝子を表すために「１」を、下方調節遺伝子を表すために「２」を用いることを要求する。ユーザーが自分の遺伝子をKEGG pathway（Kanehisa et al、2019）またはGOデータベースにリンクする必要がある場合、DiVennではKEGG pathwayおよびGOデータベースが利用可能な14のモデル生物種がサポートされている。現在、3種類のgene IDs、すなわち KEGG gene IDs、Uniprotgene ID（UniProt、2008）、およびNCBI gene ID（Benson et al、2018）がpathway解析に受け入れられている。DiVennによるd分析には、すべてのagriGO（Du et al、2010; Tian et al、2017）がサポートするIDが受け入れられる。 DiVennでは、ユーザーはネットワークグラフ内の最大8つの遺伝子リストを比較して視覚化することができる。（以下略）

チュートリアル

https://github.com/noble-research-institute/DiVenn

DiVenn Tutorial

ブラウザ

All modern browsers, such as Safari, Google Chrome, and IE are supported. The recommended web browser is Chrome.

使い方

DiVennの解析前に、（ユーザー指定のcut-off条件で）それぞれのトランスクリプトーム解析データから発現変動遺伝子セットが抽出されていないといけない。

http://divenn.noble.org にアクセスする。

オーサーが公開しているYou tube動画と同じ流れで説明する。

生物種を選択する。現在１４のモデル生物種に対応している。

f:id:kazumaxneo:20190619025013p:plain

ここではArabidospsisを選択。

実験データ数を選択する。ここでは実験データ３つのDEGセットを比較するとして３を選択。

f:id:kazumaxneo:20190616204821p:plain

"Load Sample Data"ボタンをクリック。

f:id:kazumaxneo:20190616204855p:plain

それぞれのウィンドウ内にsampleデータが読み込まれた。データの１列目がgene IDs、２列目が発現パターンになる。発現パターンは、ユーザーが指定の条件でフィルタリングして得たDEGのセットの発現パターンで（ユーザー指定のcut-off条件でDEGは抽出済みのはず）、数値の1か2のみ受け付ける。1は up-regulated、2は down-regulatedを表す。

視覚化後、オプションのpathway解析まで行うためには、KEGG、Uniprot (UniProt, 2008)、 NCBI (Benson, et al., 2018)のgene IDを使う必要がある。

表示される実験名を変えたいならウィンドウに手打ちする。Exp1からexperiment1に変更した。

f:id:kazumaxneo:20190619015042p:plain

--補足--

実際の解析時にはデータを含むテキストファイルをuploadする。

f:id:kazumaxneo:20190616205116p:plain

-------------

一番下のsubmitボタンを押し、視覚化を実行する。

f:id:kazumaxneo:20190616210529p:plain

視覚化結果

f:id:kazumaxneo:20190616210559p:plain

色は以下のような情報を表す。

f:id:kazumaxneo:20190616210738p:plain

従来のベン図では実験間で共通/固有のDEG数しか表現されない。一方、見てわかるようにこのインタラクティブなグラフは、実験間で共通/固有のDEGに加え、各々のDEGが誘導か抑制なのか、クリックすることで後述する遺伝子名、gene IDs、GO term、KEGG pathway情報まで表現できる。さらに、GO enrichment 解析やpathway enrichment解析を視覚的に確認しながら進めることが可能（後半で説明しているが、リストのGO termやpahway情報とフィッシャー検定結果がその場で表にまとめられるので、興味あるリストのみ選択して再描画できる）。

個人的な意見として、共通/固有のDEG数は一般的なベン図の方が判断しやすいと思う。

図はマウスのホイール上下やTrackpadの上下で拡大縮小したり、nodeをドラッグして操作できる。もとに戻したい時は十字アイコンの真ん中のRESETをクリックする。

f:id:kazumaxneo:20190616212158p:plain

nodeの上で右クリックすることで、gene IDsの表示/非表示を選択可能。

f:id:kazumaxneo:20190616210702p:plain

一括表示。

f:id:kazumaxneo:20190619022517p:plain

ノードの上で右クリックしてGene detailをクリックすると、GO termやGO IDなどの情報が表示される。

f:id:kazumaxneo:20190619022610p:plain

KEGG、Uniprot、 NCBI のgene IDsも表示されている。

一括表示/非表示は右上のメニューからも実行できる。

f:id:kazumaxneo:20190616211714p:plain

メニューではnodeの色も変更可能。

f:id:kazumaxneo:20190616211652p:plain

表示されている遺伝子群について、エンリッチされたKEGG pathwayとGO termを調べることができる。メニューのShow Gene DetailsからPathwayかGOを選択。

f:id:kazumaxneo:20190616211844p:plain

pahwayを選択した。しばらく待ってから下の方にスクロールすると表が出現している。

f:id:kazumaxneo:20190616212529p:plain

表のgene IDをクリックすると、ノードを右クリックしたときと同様詳細が表示できる。

f:id:kazumaxneo:20190616214156p:plain

pathway列にpathwayの情報がある場合、文字をクリックするとそのKEGG pathwayに飛ぶ。

f:id:kazumaxneo:20190616213806p:plain

KEGG pathwayではその生物種でアサインされているものが緑色で表示される。基礎的な話になるが、上の図では、この生物種（ここではシロイヌナズナ）でKEGG pathwayにアサインされているαリノレン酸代謝系酵素遺伝子が緑色になっている（ユーザーがこのDiVennで使用したDEGリストにあった遺伝子では無い）。

ここでは特にエンリッチされていそうなpathwayを調べたい。表の一番上のpathwayボタンを押し、pathwayでソートし直す。

f:id:kazumaxneo:20190616212808p:plain

検索効率を上げるため、左上から表示件数を変更。50 => 200にした。

f:id:kazumaxneo:20190616212708p:plain

多いpathwayを探す。p valueをみて増えてそうなpathwayが見つかったら、各行の右端にあるチェックboxに✔︎をつけて選択していく。

f:id:kazumaxneo:20190616212731p:plain

shiftキーを押しながら一番上と一番下をサンドすることで該当pathwayを一気に選択可能。

一番下までスクロールし、Only Redraw Selectedにチェックをつけ、Redrawボタンをクリック。

f:id:kazumaxneo:20190616212959p:plain

一番上に戻ると、選択したgene IDsのみが描画されている。

f:id:kazumaxneo:20190616213139p:plain

DiVennによって生成されたグラフは、右上のメニューからPNGかSVGとして出力、ダウンロードできます。

ノードはクリックしたりドラッグして移動できます。重要な遺伝子を真ん中に配置するなどして整えてから出力するといいと思います。

引用
DiVenn: An Interactive and Integrated Web-Based Visualization Tool for Comparing Gene Lists
Liang Sun, Sufen Dong, Yinbing Ge, Jose Pedro Fonseca, Zachary T. Robinson, Kirankumar S. Mysore, Perdeep Mehta

Front Genet. 2019; 10: 421

関連

agriGO

2019-06-18

多機能なNGS分析ツール BBtools　其の３BBMap追加コマンド

高速なツール SNV small indel error correction metagenome annotation

BBMapの追加コマンドについて紹介します。

BBMap Guide

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/

callvariants.sh

Introducing CallVariants, a new variant caller in #BBMap! CallVariants is 81x faster than mpilup+bcftools, and very, very accurate.
— Brian Bushnell (@BBToolsBio) December 15, 2016

Do you like #GATK?

BBTools' CallVariants is 2000 times faster. Also, it does not place adjacent deletions and insertions, like GATK.

I'm not saying GATK is garbage, but I'm also not saying that. You can't run GATK without incurring adjacent insertions and deletions.
— Brian Bushnell (@BBToolsBio) June 13, 2019

bbcms.sh

New in #BBTools:https://t.co/H04qsEfnrO - error correction utility for metagenomes. Tadpole is more accurate, but BBCMS uses the same algorithm and never runs out of memory (since it uses a lossy data structure). Use Tadpole when you can, and BBCMS if it runs out of memory.
— Brian Bushnell (@BBToolsBio) June 13, 2019

callgenes.sh

New in #BBTools- https://t.co/VyzMcc0Wda, a new prokaryotic gene-caller. Did the world need a new prok gene caller? No! But, I needed a super-fast caller for internal use by sendsketch.

"https://t.co/D9wMtd802X in=assembly.fa protein"

You can even use it with raw reads!
— Brian Bushnell (@BBToolsBio) June 13, 2019

It's actually really good, though:https://t.co/VyzMcc0Wda in=assembly.fa out=genes.gff outa=proteins.faa

By default it uses models for hundreds of bacteria and archaea. You can increase the accuracy (if you know the clade) by making your own model with https://t.co/gnqK0US6UD.
— Brian Bushnell (@BBToolsBio) June 13, 2019

インストール

bbmapを導入する。

#bioconda (link)
conda install -c bioconda -y bbmap

バリアントコール

> callvariants.sh

$ callvariants.sh

Written by Brian Bushnell

Last modified October 12, 2017

Description: Calls variants from sam or bam input.

In default mode, all input files are combined and treated as a single sample.

In multisample mode, each file is treated as an individual sample,

and gets its own column in the VCF file. Unless overridden, input file

names are used as sample names.

Please read bbmap/docs/guides/CallVariantsGuide.txt for more information,

or bbmap/pipelines/variantPipeline.sh for a usage example.

Usage: callvariants.sh in=<file,file,...> ref=<file> vcf=<file>

Input may be sorted or unsorted.

The reference should be fasta.

I/O parameters:

in=<file> Input; may be one file or multiple comma-delimited files.

list=<file> Optional text file containing one input file per line.

Use list or in, but not both.

out=<file> Output variant list in native format. If the name ends

with .vcf then it will be vcf format.

vcf=<file> Output variant list in vcf format.

ref=<file> Reference fasta. Required to display ref alleles.

Variant calling wil be more accurate with the reference.

shist=<file> (scorehist) Output for variant score histogram.

zhist=<file> (zygosityhist) Output for zygosity histogram.

overwrite=f (ow) Set to false to force the program to abort rather than

overwrite an existing file.

extended=t Print additional variant statistics columns.

sample= Optional comma-delimited list of sample names.

multisample=f (multi) Set to true if there are multiple sam/bam files,

and each should be tracked as an individual sample.

vcf0= Optional comma-delimited list of per-sample outputs.

Only used in multisample mode.

bgzip=f Use bgzip for gzip compression.

Processing Parameters:

prefilter=f Use a Bloom filter to exclude variants seen fewer than

minreads times. Doubles the runtime but greatly reduces

memory usage. The results are identical.

samstreamer=t (ss) Load reads multithreaded to increase speed.

coverage=t (cc) Calculate coverage, to better call variants.

ploidy=1 Set the organism's ploidy.

rarity=1.0 Penalize the quality of variants with allele fraction

lower than this. For example, if you are interested in

4% frequency variants, you could set both rarity and

minallelefraction to 0.04. This is affected by ploidy -

a variant with frequency indicating at least one copy

is never penalized.

covpenalty=0.8 (lowcoveragepenalty) A lower penalty will increase the

scores of low-coverage variants, and is useful for

low-coverage datasets.

useidentity=t Include average read identity in score calculation.

usepairing=t Include pairing rate in score calculation.

usebias=t Include strand bias in score calculation.

useedist=t Include read-end distance in score calculation.

homopolymer=t Penalize scores of substitutions matching adjacent bases.

nscan=t Consider the distance of a variant from contig ends when

calculating strand bias.

32bit=f Set to true to allow coverage tracking over depth 65535.

strandedcov=t Tracks per-strand ref coverage to print the DP4 field.

Requires more memory when enabled.

callsub=t Call substitutions.

calldel=t Call deletions.

callins=t Call insertions.

nopassdot=f Use . as genotype for variations failing the filter.

Trimming parameters:

border=5 Trim at least this many bases on both ends of reads.

qtrim=r Quality-trim reads on this end

r: right, l: left, rl: both, f: don't quality-trim.

trimq=10 Quality-trim bases below this score.

Realignment parameters:

realign=f Realign all reads with more than a couple mismatches.

Decreases speed. Recommended for aligners other than BBMap.

unclip=f Convert clip symbols from exceeding the ends of the

realignment zone into matches and substitutitions.

repadding=70 Pad alignment by this much on each end. Typically,

longer is more accurate for long indels, but greatly

reduces speed.

rerows=602 Use this many rows maximum for realignment. Reads longer

than this cannot be realigned.

recols=2000 Reads may not be aligned to reference seqments longer

than this. Needs to be at least read length plus

max deletion length plus twice padding.

msa= Select the aligner. Options:

MultiStateAligner11ts: Default.

MultiStateAligner9PacBio: Use for PacBio reads, or for

Illumina reads mapped to PacBio/Nanopore reads.

Sam-filtering parameters:

minpos= Ignore alignments not overlapping this range.

maxpos= Ignore alignments not overlapping this range.

minreadmapq=4 Ignore alignments with lower mapq.

contigs= Comma-delimited list of contig names to include. These

should have no spaces, or underscores instead of spaces.

secondary=f Include secondary alignments.

supplimentary=f Include supplimentary alignments.

invert=f Invert sam filters.

Variant-Calling Cutoffs:

minreads=2 Ignore variants seen in fewer reads.

minqualitymax=15 Ignore variants with lower max base quality.

minedistmax=20 Ignore variants with lower max distance from read ends.

minmapqmax=0 Ignore variants with lower max mapq.

minidmax=0 Ignore variants with lower max read identity.

minpairingrate=0.1 Ignore variants with lower pairing rate.

minstrandratio=0.1 Ignore variants with lower plus/minus strand ratio.

minquality=12.0 Ignore variants with lower average base quality.

minedist=10.0 Ignore variants with lower average distance from ends.

minavgmapq=0.0 Ignore variants with lower average mapq.

minallelefraction=0.1 Ignore variants with lower allele fraction. This

should be adjusted for high ploidies.

minid=0 Ignore variants with lower average read identity.

minscore=20.0 Ignore variants with lower Phred-scaled score.

clearfilters Reset all filters to zero. Filter flags placed after

the clearfilters flag will still be applied.

There are additionally max filters for score, quality, mapq, allelefraction,

and identity.

Java Parameters:

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

ORF予測

>callgenes.sh

$ callgenes.sh

Written by Brian Bushnell

Last modified December 19, 2018

Description: Finds orfs and calls genes in unspliced prokaryotes.

This includes bacteria, archaea, viruses, and mitochondria.

Can also predict 16S, 23S, 5S, and tRNAs.

Usage: callgenes.sh in=contigs.fa out=calls.gff outa=aminos.faa

File parameters:

in=<file> A fasta file.

out=<file> Output gff file.

outa=<file> Amino acid output.

model=<file> A pgm file or comma-delimited list.

If unspecified a default model will be used.

compareto= Optional reference gff file to compare with the gene calls.

Other parameters:

minlen=60 Don't call genes shorter than this.

trd=f (trimreaddescription) Set to true to trim read headers after

the first whitespace. Necessary for IGV.

merge=f For paired reads, merge before calling.

detranslate=f Output canonical nucleotide sequences instead of amino acids.

recode=f Re-encode nucleotide sequences over called genes, leaving

non-coding regions unchanged.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

> analyzegenes.sh

$ analyzegenes.sh

Written by Brian Bushnell

Last modified September 27, 2018

Description: Generates a prokaryotic gene model (.pkm) for gene calling.

Input is fasta and gff files.

The .pkm file may be used by CallGenes.

Usage: analyzegenes.sh in=x.fa gff=x.gff out=x.pgm

File parameters:

in=<file> A fasta file or comma-delimited list of fasta files.

gff=<file> A gff file or comma-delimited list. This is optional;

if present, it must match the number of fasta files.

If absent, a fasta file 'foo.fasta' will imply the

presence of 'foo.gff'.

out=<file> Output pgm file.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

メタゲノムエラーコレクション

> bbcms.sh

r$ bbcms.sh

Written by Brian Bushnell

Last modified July 24, 2018

Description: Error corrects reads and/or filters by depth, storing

kmer counts in a count-min sketch (a Bloom filter variant).

This uses a fixed amount of memory. The error-correction algorithm is taken

from Tadpole; with plenty of memory, the behavior is almost identical to

Tadpole. As the number of unique kmers in a dataset increases, the accuracy

decreases such that it will make fewer corrections. It is still capable

of making useful corrections far past the point where Tadpole would crash

by running out of memory, even with the prefilter flag. But if there is

sufficient memory to use Tadpole, then Tadpole is more desirable.

Because accuracy declines with an increasing number of unique kmers, it can

be useful with very large datasets to run this in 2 passes, with the first

pass for filtering only using a 2-bit filter with the flags tossjunk=t and

ecc=f (and possibly mincount=2 and hcf=0.4), and the second pass using a

4-bit filter for the actual error correction.

Usage: bbcms.sh in=<input file> out=<output> outb=<reads failing filters>

Example of use in error correction:

bbcms.sh in=reads.fq out=ecc.fq bits=4 hashes=3 k=31 merge

Example of use in depth filtering:

bbcms.sh in=reads.fq out=high.fq outb=low.fq k=31 mincount=2 ecc=f hcf=0.4

Error correction and depth filtering can be done simultaneously.

File parameters:

in=<file> Primary input, or read 1 input.

in2=<file> Read 2 input if reads are in two files.

out=<file> Primary read output.

out2=<file> Read 2 output if reads are in two files.

outb=<file> (outbad/outlow) Output for reads failing mincount.

outb2=<file> (outbad2/outlow2) Read 2 output if reads are in two files.

extra=<file> Additional comma-delimited files for generating kmer counts.

ref=<file> If ref is set, then only files in the ref list will be used

for kmer counts, and the input files will NOT be used for

counts; they will just be filtered or corrected.

overwrite=t (ow) Set to false to force the program to abort rather than

overwrite an existing file.

Hashing parameters:

k=31 Kmer length, currently 1-31.

hashes=3 Number of hashes per kmer. Higher generally reduces

false positives at the expense of speed; rapidly

diminishing returns above 4.

minprob=0.5 Ignore kmers with probability of being correct below this.

memmult=1.0 Fraction of free memory to use for Bloom filter. 1.0 should

generally work; if the program crashes with an out of memory

error, set this lower. You may be able to increase accuracy

by setting it slightly higher.

cells= Option to set the number of cells manually. By default this

will be autoset to use all available memory. The only reason

to set this is to ensure deterministic output.

seed=0 This will change the hash function used.

Depth filtering parameters:

mincount=0 If positive, reads with kmer counts below mincount will

be discarded (sent to outb).

hcf=1.0 (highcountfraction) Fraction of kmers that must be at least

mincount to pass.

requireboth=t Require both reads in a pair to pass in order to go to out.

When true, if either read has a count below mincount, both

reads in the pair will go to outb. When false, reads will

only go to outb if both fail.

tossjunk=f Remove reads or pairs with outermost kmer depth below 2.

(Suggested params for huge metagenomes: mincount=2 hcf=0.4 tossjunk=t)

Error correction parameters:

ecc=t Perform error correction.

bits= Bits used to store kmer counts; max count is 2^bits-1.

Supports 2, 4, 8, 16, or 32. 16 is best for high-depth data;

2 or 4 are for huge, low-depth metagenomes that saturate the

bloom filter otherwise. Generally 4 bits is recommended for

error-correction and 2 bits is recommended for filtering only.

ecco=f Error-correct paired reads by overlap prior to kmer-counting.

merge=t Merge paired reads by overlap prior to kmer-counting, and

again prior to correction. Output will still be unmerged.

smooth=3 Remove spikes from kmer counts due to hash collisions.

The number is the max width of peaks to be smoothed; range is

0-3 (3 is most aggressive; 0 disables smoothing).

This also affects tossjunk.

Java Parameters:

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

実行方法

1、callvariants.sh

bam/samからバリアントコールを行う。

1-1、リファレンスにペアエンドfastqをmappingしてbamかsamを得る。

bbmap.sh ref=ref.fa in1=pair_1.fq in2=pair_2.fq out=output.sam nodisk
samtools sort -@ 12 -O BAM output.sam > sort.bam && samtools index -@ 12 sort.bam

indexディレクトリが作られ、リードがリファレンスにアライメントされる。defaultだとマシンの全コアが計算に使用される（明示するなら"t="）。それからsortコマンドでcoordinate sortedされたbamを得る。

1-2、入力のbam/samとリファレンスを指定してcallvariantsを実行する。ここでは倍数性(ploidy)2で実行。

callvariants.sh ploidy=2 in=sort.bam ref=ref.fa vcf=output.vcf

callsub=t Call substitutions
calldel=t Call deletions
callins=t Call insertions
ploidy=1 Set the organism's ploidy
in=<file> Input; may be one file or multiple comma-delimited files
vcf=<file> Output variant list in vcf format

バクテリアゲノムだと数秒でランは終わる。 vcf=を指定時の出力はVCF v4.2に従う。

2、callgenes.sh

contigやゲノム配列からprokaryotesの遺伝子配列を予測する。ここでは予測した遺伝子のアミノ酸配列も出力している。

callgenes.sh in=ref.fa out=output outa=amnoacids.fa

NGSのリードから実行することもできる。

callgenes.sh in=pair_1.fq in2=pair_2.fq out=output outa=amnoacids.fa

リードそれぞれから予測するため、出力は相応に大きくなることに注意。

3、bbcms.sh

メタゲノムシーケンシングリードのエラーコレクション。

入力のペアエンドfastqと出力のペアエンドfastqをそれぞれ指定する。

bbcms.sh in=pair_1.fq in2=pair_2.fq \
 out=error_corrected_1.fq.gz out2=error_corrected_2.fq.gz \
 outb=failing_mincount_1.fq.gz outb2=failing_mincount_2.fq.gz

in=<file> Primary input, or read 1 input.
in2=<file> Read 2 input if reads are in two files.
out=<file> Primary read output.
out2=<file> Read 2 output if reads are in two files.
outb=<file> Output for reads failing mincount.
outb2=<file> Read 2 output if reads are in two files.

参考

Question: bbmap callvariants - how to add sample names, how to get 0/0 alleles back?

http://callvariants.sh

関連

2019-06-17

（ヒト、マウス向け）GO term enrichment解析を行う GOnet

2018 BMC Bioinformatics web tool GO term human genome mouse

　ゲノムワイド研究のアウトプットは、通常、共有の発現パターンを示す遺伝子（またはそれらのタンパク質産物）のリストである。例えば、これらは、疾患の有無にかかわらずドナー群において差次的に発現される遺伝子、または生物学的サンプルの特定の画分において質量分析法によって同定されたタンパク質のリストであり得る。そのようなデータから科学的な意味を理解することは、関与する遺伝子／タンパク質およびそれらの機能に関する生物学的知識を必要とする複雑な作業である。公開されたデータが拡大するにつれて、絶えず拡大する知識と計算方法により最新情報を入手することがますます困難になっている。データベースリソースは、この知識を利用しやすくするための重要な機能である。 Gene Ontology（GO、http://geneontology.org/、[ref.1]）は、molecular functions、biological processes、そしてcellular componentsを記述するための論理的定義とともに、コントロールされた階層vocabularyを維持する先駆的プロジェクトである。このコントロールされたvocabularyは、特定の遺伝子が果たす役割に関する実験的（および計算上の）知見を捉えるために、いくつかのモデル生物データベースによって利用されている。この知識は、遺伝子にアノテーションを付けるGO termを調べ、それらを機能グループに分割するために、与えられた遺伝子のリスト（遺伝子セットとも呼ばれる）に適用することができる（‘annotation’ analysis）。このアプローチは、例えばDAVIDツール[ref.2]で実装されている。もう1つの一般的な手順は、ユーザーによって送信されたエントリの一覧で大幅にover-represented しているtermのみに焦点を当てることである（‘enrichment’ analysis）。このアプローチは、Gene Ontology アノテーションに適用されたGSEA（gene set enrichment analysis）の特定のケースである。そのような分析は、他のウェブアプリケーション（例えば、GOrilla [ref.4]、NaviGO [ref.5]、DAVID [ref.2]、AmiGO [ref.6]）を使用してGOプロジェクトのウェブサイト[ref.3]から実行できる。プログラミング言語のPythonで利用可能なモジュール（例えばGOATools [ref.7]、goenrich [ref.8]）およびR（例えばGOstats [ref.9]、topGO [ref.10]）でも実行できる。そのようなアプローチの人気は、最初のGOC publication [ref.11]が22000以上の論文によって引用されているという事実によって強調されている（2018年10月現在のGoogle Scholar検索による）。

　ただし、現在のGO分析Webアプリケーション（AmiGOやDAVIDなど）の出力は、termの階層構造を完全には伝えていない。 GOrillaやNaviGOなどのツールを使用すると、GO termの階層を視覚化できるが、分析中の遺伝子やタンパク質に対するGO termの関係は失われる。termのhierarchical structureの視覚化と遺伝子と termの関係の両方に取り組むことが、オープンソースのWebアプリケーションGOnet（https://github.com/mikpom/gonet）を作成する主な動機であった。それは遺伝子とtermのノードを持つ完全にインタラクティブなグラフを生成することによって達成される。グラフはさまざまなレイアウトをサポートしているため、グラフトポロジに基づいて分析を拡張することができる。

　研究者は、時折、より詳細な情報を得るために、調査した各遺伝子産物の機能を調べる必要がある。そのようなエントリーごとの分析のために、研究者は様々な公的リソースから情報を検索する必要があり得る。 GOnetはこのアプローチに準拠し、結果として得られるビューに外部データベース（UniProt [ref.12]、Ensembl [ref.13]、DICE-DB [ref.14]、Genecards [ref.15]）への便利なリンクを提供する。さらに、外部情報源からの発現データを使用して、遺伝子を色付けし、調査されたシグネチャへのさらなる洞察を提供することができる。全体的にこれらの特徴はGOnetを実験生物学者および計算生物学者のためのオミックスデータの生物学的解釈を容易にするための重要な道具にしている。

　GOnetアプリケーションは入力として遺伝子シンボル、タンパク質シンボル、またはタンパク質ID（UniProt ID）のリストを受け取り、グラフを出力する（論文図1の例）。視覚化されたグラフの実際の構造とその外観に影響を与えるさまざまな入力パラメータがある。最初の主なユーザーの選択は、遺伝子に注釈が付けられているGO termである。

１、submitされた遺伝子リストにおいて統計的に有意に過剰に示されたGO term。

２、

事前定義されたサブセット（「GO slim」とも呼ばれる）、またはユーザー指定のtermリスト。

（以下略）

使い方

https://tools.dice-database.org/GOnet/ にアクセスする。

入力遺伝子名をプレーンテキスト形式で貼り付ける。形式は一般的な遺伝子名(ABCB1, CCR7, Klf4, etc), Uniprot IDs (e.g. P32248) 、MGI IDs (e.g. MGI:1342287, mouse only)になる。Tab区切りにして2カラム目にlog2FCの値をつけても動作する。

f:id:kazumaxneo:20190614192858j:plain

ヒューマンデータにはUniprot IDし、マウスデータにはMGI ID使用が推奨されてる。

２種類の解析、GO term enrichment解析とGO term annotation解析がサポートされている。

f:id:kazumaxneo:20190614193015j:plain

defaultのGO term enrichment解析では標準的なGOenrichment解析を実行し、submitされた遺伝子リストと過剰に表されたGO termとの関係のみが示される。GO term annotation解析では、GOnetは入力遺伝子リストとGOサブセットパラメータで指定されたGO termとの関係を再構築する。

GO term annotation解析を行うにはGOサブセットを指定する。defaultはGO slim genericになっている。これはGOCによって維持されている遺伝子オントロジーのサブセットで、大部分の研究に適したカテゴリーを含んでいる。

f:id:kazumaxneo:20190614193435j:plain

カスタムオプションを選択し、ユーザーがGO termを指定することも可能。

GO term enrichment解析ではバッググラウンドを指定できる。defaulではAll GO annotated genesになっている。

f:id:kazumaxneo:20190614193827j:plain

結果は十分に具体的ではない場合、定義済みのバッググラウンドを指定することで改善される場合がある。GOnetでは、いくつかのcell type/tissueがバッググラウンドとして指定できる。

f:id:kazumaxneo:20190614200238j:plain

ユーザのデータセットに由来する細胞型で有意に発現される遺伝子をバッググラウンドとして使う。例えば、オプション（DICE-DB）TH1が選択された場合、Th1細胞中で1 TPMより高い発現を有する全ての遺伝子がバックグラウンドとして使用される。

エンリッチ解析のp値の計算は、Python goenrichパッケージ(link)の手順に従う。すべてのGOに対してフィッシャーの直接確率検定 (wiki)のp値が計算され、すべての遺伝子リストのGO termについて、過剰表現されたGO termはないという帰無仮説を立てて検定が実施される。

条件を決定したらsubmitする。

f:id:kazumaxneo:20190614202033j:plain

結果

エッジとノードからなる一種の有向グラフとして表現される。図はインタラクティブに編集可能になっている。ノードをクリックして移動したり、右のメニューから条件を変更できる。

f:id:kazumaxneo:20190614202038j:plain

四角形のノードはGO term、楕円は遺伝子を表す。p値が小さいほどノードの色が濃くなる（GO term enrichment解析の場合）。

f:id:kazumaxneo:20190614202320j:plain

ノードをクリックすると詳細が表示される。

f:id:kazumaxneo:20190614204333j:plain

リンクをクリックするとAmiGO2にジャンプする。

f:id:kazumaxneo:20190614204447j:plain

レイアウトは３種類から指定できる。上のレイアウトはdefaultの"Cose"で、サンプル数が少なめの時に推奨される。サンプル数が増えると、階層構造の”Hierarchical”が推奨される。

Hierarchical

f:id:kazumaxneo:20190614202902j:plain

このレイアウトでは、ノードが階層内に表示される。特異性の低いGO termはネットワークの上部に配置され、特異性の高いGO termは下部に配置される。遺伝子はグラフの一番下に配置される。

もう1つ、Eulerというレイアウントも指定できる。

f:id:kazumaxneo:20190614203457j:plain

これも大規模ネットワークに適しているとされる。

メニューからはレイアウントの他にもp valueを変更できる。

threshould p-valueを減らして厳しくした。

f:id:kazumaxneo:20190614203738j:plain

Threshould p-valueをより厳しくすると、GO termのノードの一部が見えなくなった。

引用
GOnet: a tool for interactive Gene Ontology analysis

Pomaznoy M, Ha B, Peters B

BMC Bioinformatics. 2018 Dec 7;19(1):470

2019-06-16

bamファイルを扱う bamM

library bam/sam

　BamMはBAMファイルを解析するpythonにラップされたcライブラリである。このコードはPySam (link) のすべての機能を実装するものではないが、PySamよりも高速で安定したBAMファイルのインターフェースを提供することを目的としている。

http://ecogenomics.github.io/BamM/

マニュアル

http://ecogenomics.github.io/BamM/manual/BamM_manual.pdf

python library

API Documentation

インストール

ubuntu16.04のminiconda2-4.0.5環境でテストした（docker使用、ホストOS ubuntu18.04）。

依存

python2.7
samtools
bwa

本体　Github

#bioconda（link）
conda install -c bioconda bamm

#GNU Guix
guix package --install bamm

> bamm

...::: BamM :::...

Working with the BAM, not against it...

-------------------------------------------------------------------------

version: 1.7.3

-------------------------------------------------------------------------

bamm make -> Make BAM/TAM files (sorted + indexed)

bamm parse -> Get coverage profiles / linking reads / insert types

bamm extract -> Extract reads / headers from BAM files

bamm filter -> Filter BAM file reads

USE: bamm OPTION -h to see detailed options

> bamm make -h

$ bamm make -h

usage: bamm make -d DATABASE [-i INTERLEAVED [INTERLEAVED ...]]

[-c COUPLED [COUPLED ...]] [-s SINGLE [SINGLE ...]]

[-p PREFIX] [-o OUT_FOLDER]

[--index_algorithm INDEX_ALGORITHM]

[--alignment_algorithm ALIGNMENT_ALGORITHM] [--extras EXTRAS]

[-k] [-K] [-f] [--output_tam] [-u] [-t THREADS] [-m MEMORY]

[--temporary_directory TEMPORARY_DIRECTORY] [-h]

[--show_commands] [--quiet] [--silent]

make a BAM/TAM file (sorted + indexed)

required argument:

-d DATABASE, --database DATABASE

contigs to map onto (in fasta format)

reads to map (specify one or more arguments):

-i INTERLEAVED [INTERLEAVED ...], --interleaved INTERLEAVED [INTERLEAVED ...]

map interleaved sequence files (as many as you like) EX: -i reads1_interleaved.fq.gz reads2_interleaved.fna

-c COUPLED [COUPLED ...], --coupled COUPLED [COUPLED ...]

map paired sequence files (as many sets as you like) EX: -c reads1_1.fq.gz reads1_2.fq.gz reads2_1.fna reads2_2.fna

-s SINGLE [SINGLE ...], --single SINGLE [SINGLE ...]

map Single ended sequence files (as many as you like) EX: -s reads1_singles.fq.gz reads2_singles.fna

optional arguments:

-p PREFIX, --prefix PREFIX

prefix to apply to BAM/TAM files (None for reference name)

-o OUT_FOLDER, --out_folder OUT_FOLDER

write to this folder (default: .)

--index_algorithm INDEX_ALGORITHM

algorithm bwa uses for indexing 'bwtsw' or 'is' [None for auto]

--alignment_algorithm ALIGNMENT_ALGORITHM

algorithm bwa uses for alignment 'mem', 'bwasw' or 'aln' (default: mem)

--extras EXTRAS extra arguments to use during mapping. Format is "<BWA_MODE1>:'<ARGS>',<BWA_MODE2>:'<ARGS>'"

-k, --keep keep all generated database index files (see also --kept)

-K, --kept assume database indices already exist (e.g. previously this script was run with -k/--keep)

-f, --force force overwriting of index files if they are present

--output_tam output TAM file instead of BAM file

-u, --keep_unmapped Keep unmapped reads in BAM output

-t THREADS, --threads THREADS

maximum number of threads to use (default: 1)

-m MEMORY, --memory MEMORY

maximum amount of memory to use per samtools sort thread (default 2GB). Suffix K/M/G recognized

--temporary_directory TEMPORARY_DIRECTORY

temporary directory for working with BAM files (default do not use)

-h, --help show this help message and exit

--show_commands show all commands being run

--quiet suppress output from the mapper

--silent suppress all output

Example usage:

Produce a sorted, indexed BAM file with reads mapped onto contigs.fa using 40 threads

bamm make -d contigs.fa -i reads1_interleaved.fq.gz reads2_interleaved.fq.gz -c reads3_1.fq.gz reads3_2.fq.gz -t 40

Produce a 3 sorted, indexed BAM files with reads mapped onto contigs.fa.gz

bamm make -d contigs.fa.gz -i reads1_interleaved.fq.gz reads2_interleaved.fq.gz -s reads4_singles.fq.gz

Extra arguments are passed on a "per-mode" basis using the format:

<BWA_MODE>:'<ARGS>'

For example, the argument:

--extras "mem:-k 25"

tells bwa mem to use a minimum seed length of 25bp.

Multiple modes are separated by commas. For example:

--extras "aln:-O 12 -E 3,sampe:-n 15"

Passes the arguments "-O 12 -E 3" to bwa aln and the arguments "-n 15" to bwa sampe.

********************************************************************************

*** WARNING ***

********************************************************************************

Values passed using --extras are not checked by BamM. This represents a

significant security risk if BamM is being run with elevated privileges.

Thus you should NEVER run 'bamm make' as root or some other powerful user,

ESPECIALLY if you are providing access to multiple / unknown users.

********************************************************************************

> bamm parse -h

$ bamm parse -h

usage: bamm parse -b BAMFILES [BAMFILES ...] [-c COVERAGES] [-l LINKS]

[-i INSERTS] [-n NUM_TYPES [NUM_TYPES ...]]

[-m {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}]

[-r CUTOFF_RANGE [CUTOFF_RANGE ...]] [--length LENGTH]

[--base_quality BASE_QUALITY]

[--mapping_quality MAPPING_QUALITY]

[--max_distance MAX_DISTANCE] [--use_secondary]

[--use_supplementary] [-v] [-t THREADS] [-h]

get bamfile type and/or coverage profiles and/or linking reads

required argument:

-b BAMFILES [BAMFILES ...], --bamfiles BAMFILES [BAMFILES ...]

bam files to parse

optional arguments:

-c COVERAGES, --coverages COVERAGES

filename to write coverage profiles to [default: don't calculate coverage]

-l LINKS, --links LINKS

filename to write pairing links to [default: don't calculate links]

-i INSERTS, --inserts INSERTS

filename to write bamfile insert distributions to [default: print to STDOUT]

-n NUM_TYPES [NUM_TYPES ...], --num_types NUM_TYPES [NUM_TYPES ...]

number of insert/orientation types per BAM

-m {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}, --coverage_mode {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}

how to calculate coverage (requires --coverages) (default: pmean)

-r CUTOFF_RANGE [CUTOFF_RANGE ...], --cutoff_range CUTOFF_RANGE [CUTOFF_RANGE ...]

range used to calculate upper and lower bounds when calculating coverage

--length LENGTH minimum Q length (default: 50)

--base_quality BASE_QUALITY

base quality threshold (Qscore) (default: 20)

--mapping_quality MAPPING_QUALITY

mapping quality threshold

--max_distance MAX_DISTANCE

maximum allowable edit distance from query to reference (default: 1000)

--use_secondary use reads marked with the secondary flag

--use_supplementary use reads marked with the supplementary flag

-v, --verbose be verbose

-t THREADS, --threads THREADS

maximum number of threads to use (default: 1)

-h, --help show this help message and exit

Example usage:

Calculate insert distribution, print to STDOUT

bamm parse -b my.bam

Calculate contig-wise coverage

bamm parse -b my.bam -c output_coverage.tsv

Calculate linking information on 2 bam files

bamm parse -b first.bam second.bam -l output_links.tsv

Coverage calculation modes:

* opmean: outlier pileup coverage: average of reads overlapping each base,

after bases with coverage outside mean +/- 1 standard deviation

have been excluded. The number of standard deviation used for the

cutoff can be changed with --cutoff_range.

* pmean: pileup coverage: average of number of reads overlapping each base

* tpmean: trimmed pileup coverage: average of reads overlapping each base,

after bases with in the top and bottom 10% have been excluded. The

10% range can be changed using --cutoff_range and should be

specified as a percentage (e.g., 15 for 15%).

* counts: absolute number of reads mapping

* cmean: like 'counts' except divided by the length of the contig

* pmedian: median pileup coverage: median of number of reads overlapping each

base

* pvariance: variance of pileup coverage: variance of number of reads

overlapping each base

The 'cutoff_range' variable is used for trimmed mean and outlier mean. This

parameter takes at most two values. The first is the lower cutoff and the

second is the upper. If only one value is supplied then lower == upper.

> bamm extract -h

$ bamm extract -h

usage: bamm extract -b BAMFILES [BAMFILES ...] -g GROUPS [GROUPS ...]

[-p PREFIX] [-o OUT_FOLDER] [--mix_bams] [--mix_groups]

[--mix_reads] [--interleave]

[--mapping_quality MAPPING_QUALITY] [--use_secondary]

[--use_supplementary] [--max_distance MAX_DISTANCE]

[--no_gzip] [--headers_only] [-v] [-t THREADS] [-h]

Extract reads which hit the given references

required arguments:

-b BAMFILES [BAMFILES ...], --bamfiles BAMFILES [BAMFILES ...]

bam files to parse

-g GROUPS [GROUPS ...], --groups GROUPS [GROUPS ...]

files containing reference names (1 per line) or contigs file in fasta format

optional arguments:

-p PREFIX, --prefix PREFIX

prefix to apply to output files

-o OUT_FOLDER, --out_folder OUT_FOLDER

write to this folder (default: .)

--mix_bams use the same file for multiple bam files

--mix_groups use the same files for multiple groups

--mix_reads use the same files for paired/unpaired reads

--interleave interleave paired reads in ouput files

--mapping_quality MAPPING_QUALITY

mapping quality threshold

--use_secondary use reads marked with the secondary flag

--use_supplementary use reads marked with the supplementary flag

--max_distance MAX_DISTANCE

maximum allowable edit distance from query to reference (default: 1000)

--no_gzip do not gzip output files

--headers_only extract only (unique) headers

-v, --verbose be verbose

-t THREADS, --threads THREADS

maximum number of threads to use (default: 1)

-h, --help show this help message and exit

> bamm filter -h

$ bamm filter -h

usage: bamm filter -b BAMFILE [-o OUT_FOLDER]

[--mapping_quality MAPPING_QUALITY]

[--max_distance MAX_DISTANCE] [--length LENGTH]

[--percentage_id PERCENTAGE_ID]

[--percentage_aln PERCENTAGE_ALN] [--use_secondary]

[--use_supplementary] [-v] [-h]

Apply stringency filter to Bam file reads

required arguments:

-b BAMFILE, --bamfile BAMFILE

bam file to filter

optional arguments:

-o OUT_FOLDER, --out_folder OUT_FOLDER

write to this folder (output file has '_filtered.bam' suffix) (default: .)

--mapping_quality MAPPING_QUALITY

mapping quality threshold

--max_distance MAX_DISTANCE

maximum allowable edit distance from query to reference (default: 1000)

--length LENGTH minimum allowable query length

--percentage_id PERCENTAGE_ID

minimum base identity of mapped region (between 0 and 1)

--percentage_aln PERCENTAGE_ALN

minimum fraction of read mapped (between 0 and 1)

--use_secondary use reads marked with the secondary flag

--use_supplementary use reads marked with the supplementary flag

-v, --invert_match select unmapped reads

-h, --help show this help message and exit

コマンドとしての使用方法

1、bamm make mappingしてbamを作成

ペアエンドfastqを指定する。出力はmappedとし、8スレッド使う。bwaのアルゴリズムはmemを指定する。indexがすでにあっても上書きする。unmapのリードも出力する。

bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
 -p mapped -t 8 --alignment_algorithm mem -f -u

-i shuffled reads
-c paired files
-s singleton read files
-p prefix to apply to BAM/TAM files (None for reference name)
-o write to this folder (default: .)
--index_algorithm algorithm bwa uses for indexing 'bwtsw' or 'is' [None for auto]
--alignment_algorithm algorithm bwa uses for alignment 'mem', 'bwasw' or 'aln' (default: mem)
-f force overwriting of index files if they are present
-t maximum number of threads to use (default: 1)
-m maximum amount of memory to use per samtools sort thread
-u Keep unmapped reads in BAM output

ペアエンド、シングルエンド、インターリーブのfastqをそれぞれ指定する。20スレッド指定する。unmapのリードは出力しない（default）。

bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
 -s unpaired.fastq.gz -i interleave.fq.gz \
 -p mapped -t 20 --alignment_algorithm mem -f

2、bamm parse カバレッジやインサートサイズ計算

それぞれのchromosomeのリードのカバレッジ、相対的なリードの方向、リードのインサートサイズ、リンク情報を出力する。

bamm parse -c covs.tsv -l links.tsv -i inserts.tsv -t 8 -b mapped.bam

-c filename to write coverage profiles to [default: don't calculate coverage]
-l filename to write pairing links to [default: don't calculate links]
-i filename to write bamfile insert distributions to [default: print to STDOUT]
-t maximum number of threads to use (default: 1)
-b bam files to parse

リンク情報は他のcontigやchrにまたがってmapingされるリード情報をもとにリンクされている配列を出力する（下の図はmanualより）。

インサートサイズ、相対的なリードの方向 (IN/OUT)を標準出力に出力する。

bamm parse -b mapped.bam

複数のbamのカバレッジを出力する。

bamm parse -c coverage.tsv -m pmean -b f1.bam f2.bam f3.bam

-m <pmean,opmean,tpmean,counts,cmean,pmedian,pvariance> how to calculate coverage (requires --coverages) (default: pmean)

計算モード（manualより）

opmean: Outlier pileup coverage: average of reads overlapping each base, after bases with coverage outside mean +/- 1 standard deviation have been excluded. The number of standard deviation used for the cutoff can be changed with --coverage_range.
pmean: Pileupcoverage:averageofnumberofreadsoverlappingeachbase
tpmean: Trimmed pileup coverage: average of reads overlapping each base, after bases with in the top and bottom 10% have been excluded. The 10% range can be changed using --coverage_range.
counts: Absolute number of reads mapping
cmean: Like'counts'exceptdividedbythelengthofthecontig
pmedian: Median pileup coverage: median of number of reads overlapping each base

３、extract bamからリードを出力

bamを指定する。他のコマンドと同様、複数bamを指定できる。

bamm extract -g ref.fasta -b mapped.bam -t 8

ペアエンドのbamを指定した場合、ペアエンドfastq、ペアエンドfastqでsingletonになったfastqの３つのファイルが出力される。ペアエンドfastqの順番は同期されている。出力はgzip圧縮される。

出力されたリードのヘッダーにはどのchr/contigにマッピングされていたか、ペアとしてマッピングされていたかなどを示す情報がつく（manualより）。

引用

GitHub - Ecogenomics/BamM: Metagenomics-focused BAM file manipulation

2019-06-15

メタゲノムのファージ配列分析webサーバー VirMiner

2019 Microbiome (Journal) virus web tool 結果の視覚化 (visualization) ゲノム比較 (comparative genomics) drug resistance phage CRISPR

2019 6/15 誤字修正

ウイルスは、それらの恒常性および進化に寄与する微生物群集の必須の構成要素である。ヒトの腸内細菌叢のウイルス群集はバクテリオファージが支配的である[ref.1]。ファージは遺伝子水平伝播（HGT）[ref.2]によって細菌群集の構造と機能を調節することができ、それによって病原性、抗生物質耐性、およびバイオフィルム形成を含む細菌表現型を変化させる[ref.3、4、5]。このようなファージ誘発性の変化は、細菌の病原性および抗生物質耐性に影響を与えることによって潜在的な健康上のリスクを引き起こす可能性がある。例えば、ファージはコレラ菌のような通性病原体の病原性に影響を与える[ref.6]。志賀毒素のような、いくつかのファージコード化された毒性因子が発見されており[ref.7]、これは多くの細胞型においてアポトーシスを誘導することが示されている[ref.8、9]。一方、抗生物質耐性菌増加におけるファージの役割はまだ物議をかもしている。伝統的に、ファージは抗生物質治療のようなストレス下での細菌適応における遺伝的貯蔵庫と考えられていた。抗生物質耐性遺伝子が抗生物質治療マウス由来のファージに高度に濃縮されていることが実験的に証明されている［ref.10］。しかしながら、最近の研究[ref.11]において、研究者らは、抗生物質耐性遺伝子（ARG）の存在はファージゲノムにおいて非常に過大評価されていると結論付けた。探索的バイオインフォマティクス戦略を通して、彼らは1181の公に利用可能なファージゲノム中から2つの既知ARGおよび421の新たに予測されたARGを同定した。しかしながら、彼らの実施した大腸菌で４つのARGを発現す実験では、抗生物質耐性増加を引き起こさなかった。矛盾する知見は、抗生物質耐性の増殖およびヒトの健康におけるファージの役割が十分に理解されていないことを示している。

　バクテリオファージが健康関連のウイルス群集の重要な部分を占めているとしても、ファージのより深い理解を得ることは、ウイルスの単離および精製における困難性のために依然として挑戦的である[ref.12、13、14]。しかし、原核細胞とウイルスの両方から同時にゲノムリードを生成するメタゲノムシーケンシングの使用は、生態学的および臨床サンプルのメタゲノムからファージゲノムを回収することによってウイルス研究を劇的に促進した[ref.15、16、17]。このアプローチを用いて、抗生物質耐性遺伝子もコードする、一組の潜在的な腸特異的Bacteroidales-likeファージがヒトの腸内微生物ゲノム内で同定された［ref.17］。別の研究では、分類群特異的マーカー遺伝子を用いて世界中の207の個人からの便メタゲノムに含まれるファージ分類群を分類および定量化し[ref.18]、ヒト集団間での特定のファージ分類群の量の違いを見出した。したがって、phageomeの分類学的および機能的組成ならびにファージ - 宿主相互作用は、メタゲノムデータを直接使用して明らかにすることができた。さらに、マイクロバイオームについてのより良い生態学的理解、およびそれらがヒトの健康に与える影響についてのより深い洞察を得ることができる。

　バクテリアとファージの混合配列からファージコンティグを同定することは、メタゲノム研究におけるファージ分析にとって必要かつ重要なステップである。ファージ配列またはプロファージ領域を同定するための現在のツールのほとんど、すなわちPhage Finder [ref.19]、Prophage Finder [ref.20]、Prophinder [ref.21]、およびPHAST [ref.22]などは、viromeシーケンシングまたは原核生物ゲノムシーケンシングには適しているが、メタゲノムデータからファージ配列を同定するようには設計されておらず、そしてマイクロバイオームからファージおよびバクテリア配列を効率的に分離することはできない。

　Metaphinder [ref.23]は、メタゲノムシーケンスデータからファージコンティグを同定するように開発されたWebサーバーである。しかしながら、これら全ての前述のツールは、現在のデータベース中の既知のファージ配列に対する相同性検索を通してファージ配列を同定する。微生物群集に感染しているウイルス粒子は10³¹あると推定されているが、現在のデータベースにはわずか数千のウイルスゲノムしか登録されていない[ref.24、25]。したがって、現在のツールは、未知または未培養のファージを多数無視する可能性がある。

　メタゲノムデータから未知のファージを予測するために、最近VirSorter [ref.26]（紹介）とVirFinder [ref.12]が開発された。 VirSorterは、「major capsid protein」、「portal」、「terminase large subunit」、「spike」、「tail」、「virion formation」として定義される「特徴」遺伝子の存在を検出するためにウイルスタンパク質配列の2つのリファレンスデータベースを採用した。メタゲノム名の各コンティグには、「、」または「coat」の注釈を付ける。さらに、VirSorterは、ウイルス様遺伝子、Pfam関連遺伝子、短い遺伝子、およびstrand交換における枯渇を含む他の測定基準を使用して、予測されたウイルス領域の信頼レベルを測定するための確率モデルを構築する。これに基づいて、メタゲノムコンティグを３つのカテゴリーに分類することができる：検出されたウイルス様遺伝子およびウイルスホールマーク遺伝子の有意なエンリッチメントを有する配列（検出されたウイルス様遺伝子またはウイルスホールマーク遺伝子の有意なエンリッチメントを有する配列）。（「可能性が高い」）、および既知のウイルスリファレンスとは異なるが既知のウイルスゲノムとは構造的に類似している（「可能性がある」）配列。対照的に、VirFinderは経験的にウイルスとファージのk-mer頻度が異なると仮定しているため、メタゲノムサンプルのウイルスシグナルを決定するためにk-merベースの機械学習モデルを構築した。どちらのツールも優れた予測パフォーマンスを示している。しかし、性能評価は、ウイルスコンティグの割合を人為的に設定することによって生成されたmock メタゲノム[ref.12、26]に基づいていたため、実際のサンプルに対する予測能力を反映することはできない。本著者らの分析は、前述のツールを評価するために使用される構成が、ヒトの腸内の実際のミクロバイオーム構成とは大きく異なることを明らかにした。その上、VirSorterとVirFinderが微生物群内のファージ分析のために提供する機能は比較的限られている。ファージコンティグを同定した後、ファージ - 宿主相互作用などのさらなる分析は提供されず、それは特定のストレスに応答したファージの重要な役割を明らかにし得る。したがって、微生物群集内でのファージの可能性のある役割についてのより深い理解を提供するために、より強力なツールが必要である。

　ここでは、メタゲノムデータから、特に豊富なファージコンティグのために、ファージコンティグを識別するためにランダムフォレストモデルを使用する、ユーザーフレンドリーなWebツールVirMinerを開発した。実際のメタゲノムデータにおいてより高い予測を達成するために、VirMinerを抗生物質で処理された10人の個体の精製ファージライブラリーのpaired phageomesおよび長期サンプリングにより、ヒト腸内微生物メタゲノムの訓練および評価を行った。さらに、VirMinerはいくつかのハイライトを含む包括的な分析パイプライン、すなわち（1）rawリード処理、オンサイトメタゲノムアセンブリ、および遺伝子予測。（2）Pfam、KEGGオロソログ（KO）、ファージオルソロググループ（POG）、ウイルスタンパク質ファミリー、およびウイルス特徴を含む包括的なfunctional annotation。（３）ファージコンティグ識別のための高感度ランダムフォレスト（RF）予測モデル。これは、豊富なファージコンティグの識別において優れた性能を示す。（４）ファージ - 宿主関係予測およびCRISPR部位認識。（５）異なるサンプル群間の統計的比較、を提供する。

f:id:kazumaxneo:20190421210009p:plain

The workflow of VirMiner. 論文より転載。

help

http://147.8.185.62/VirMiner/help.php

使い方

HPにアクセスする。下の方にあるstart analysisボタンをクリック。

fastqファイルを指定する (Please make sure the names of unzipped pair-end FASTQ files are ended with "_1.fastq" or "_2.fastq"). 複数同時にuploadしても問題ない。gzip圧縮には対応していない。
f:id:kazumaxneo:20190421210803p:plain

２サンプル間比較を行いたい場合、メタデータファイルもアップする必要がある。1列目がfastq名、２列目が条件になる。ファイル名は"filenames_comparison.txt"にする。

f:id:kazumaxneo:20190606214839p:plain

アップロードが終わったら、Emailを記載して送信ボタンを押す。

送信ボタンを押すと画面が下のように切り替わり、さらにデータを追加することもできる。これで

f:id:kazumaxneo:20190421211433p:plain

OKならstart the analysisボタンを押すことでジョブが開始される。

f:id:kazumaxneo:20190421212131p:plain

アップロードされたfastqは、クオリティトリミング後、IDBA_UDでde novo assemlyされ、HMMを使うMetaGeneMarkによってcontigsからORFが予測される。それから論文（helpも参照）の条件で事前学習されたランダムフォレストモデル（link）に従ってcontigsはphageかそれ以外に分類され、NCBI GenBankの3319のphagefゲノム配列とその taxonomy情報をもとに学習された単純ベイズ分類器（もともとrRNAを分類するために使用されていたRDP Classifier ）によってtaxonomic rankがアサインされる。　Shannonインデックス、Simpsonインデックス、Pielou均等性インデックスを含む微生物多様性インデックスは、属および種レベルの両方でRパッケージを使用して計算される。グループ間の比較にはウィルコクソンの順位和検定が使われている。

viral contigs間の豊富さトップ50 Pfamによるヒートマップ

f:id:kazumaxneo:20190613013728p:plain

Pfamの豊富さに基づくMDRによるクラスタリング

f:id:kazumaxneo:20190613013546p:plain

viral contigs間の豊富さトップ50 KEGG orthologyによるヒートマップ

f:id:kazumaxneo:20190613013749p:plain

Viral cotingsのKEGG pathway相対量

f:id:kazumaxneo:20190613013755p:plain

KEGG pathway classの豊富さによるヒートマップ

f:id:kazumaxneo:20190613013801p:plain

KEGG orthologyの豊富さに基づくMDRによるクラスタリング

f:id:kazumaxneo:20190613013808p:plain

phageコミュニティのジーナスレベルのmicrobial diveristy

f:id:kazumaxneo:20190613013809p:plain

phageコミュニティのspeciesレベルのmicrobial diveristy

f:id:kazumaxneo:20190613013814p:plain

Phage-host interaction network

f:id:kazumaxneo:20190613013825p:plain

f:id:kazumaxneo:20190613013830p:plain

Phage-host network

f:id:kazumaxneo:20190613013840p:plain

f:id:kazumaxneo:20190613013848p:plain

ランが終わるとメールが届く。

f:id:kazumaxneo:20190613163222j:plain

curlやwgetでまとめてダウンロードする。

wget -r -np http://147.8.185.62/VirMiner/tasks/201961313249

f:id:kazumaxneo:20190613165716j:plain

感想

多機能かつユーザフレンドリな解析ツールです。２、３ヶ月に渡ってトライしましたが、しばしばfastqをuploadして新規データ解析することができなくなることがありました。メタゲノムのraw fastqをアップロードする仕様のため、重すぎてパンクしているのかもしれません。どうしてもすぐに使いたい場合は、混雑状況についてオーサーらに問い合わるか、データを削減してuploadすることも検討してください。

VirMiner - Contact

引用

Mining, analyzing, and integrating viral signals from metagenomic data
Tingting Zheng, Jun Li, Yueqiong Ni, Kang Kang, Maria-Anna Misiakou, Lejla Imamovic, Billy K. C. Chow, Anne A. Rode, Peter Bytzer, Morten Sommer, Gianni Panagiotou
Microbiome 2019 7:42

関連

参考

2019-06-14

メタゲノムアセンブリをbinningする CONCOCT

2014 Nature Methods ヒト腸内 (human gut) docker Binning (metagenomics)

2021 4/28 コマンド追記

　ショットガンシーケンシングは、複雑な微生物群集からのゲノムの再構築を可能にするが、全ゲノムを再構築することはできないので、ゲノムの断片をビンに入れることが必要である。この論文では、CONCOCTを提示する。これは、コンティグをゲノムに自動的にクラスタリングするために、複数のサンプルにわたるシーケンス構成とカバレッジを組み合わせたアルゴリズムである。シミュレーション、およびリアルのhuman gutメタゲノムデータセットで高い再現率と精度を示す。

documentation

https://concoct.readthedocs.io/en/latest/

インストール

ubuntu16.04でcondaの仮想環境を作ってテストした（docker使用、ホストOS ubuntu18.04）。

依存

python=2.7

#Bioconda link参照

本体　Github

#bioconda（link）
conda install -c bioconda -y concoct

#依存が多いので仮想環境を作った方が無難
conda create -n concoct_env -c bioconda python=2.7 concoct
source activate concoct_env

> concoct -h

# concoct -h

usage: concoct [-h] [--coverage_file COVERAGE_FILE]

[--composition_file COMPOSITION_FILE] [-c CLUSTERS]

[-k KMER_LENGTH] [-l LENGTH_THRESHOLD] [-r READ_LENGTH]

[--total_percentage_pca TOTAL_PERCENTAGE_PCA] [-b BASENAME]

[-s SEED] [-i ITERATIONS] [-e EPSILON] [--no_cov_normalization]

[--no_total_coverage] [--no_original_data] [-o] [-d] [-v]

optional arguments:

-h, --help show this help message and exit

--coverage_file COVERAGE_FILE

specify the coverage file, containing a table where

each row correspond to a contig, and each column

correspond to a sample. The values are the average

coverage for this contig in that sample. All values

are separated with tabs.

--composition_file COMPOSITION_FILE

specify the composition file, containing sequences in

fasta format. It is named the composition file since

it is used to calculate the kmer composition (the

genomic signature) of each contig.

-c CLUSTERS, --clusters CLUSTERS

specify maximal number of clusters for VGMM, default

400.

-k KMER_LENGTH, --kmer_length KMER_LENGTH

specify kmer length, default 4.

-l LENGTH_THRESHOLD, --length_threshold LENGTH_THRESHOLD

specify the sequence length threshold, contigs shorter

than this value will not be included. Defaults to

1000.

-r READ_LENGTH, --read_length READ_LENGTH

specify read length for coverage, default 100

--total_percentage_pca TOTAL_PERCENTAGE_PCA

The percentage of variance explained by the principal

components for the combined data.

-b BASENAME, --basename BASENAME

Specify the basename for files or directory where

outputwill be placed. Path to existing directory or

basenamewith a trailing '/' will be interpreted as a

directory.If not provided, current directory will be

used.

-s SEED, --seed SEED Specify an integer to use as seed for clustering. 0

gives a random seed, 1 is the default seed and any

other positive integer can be used. Other values give

ArgumentTypeError.

-i ITERATIONS, --iterations ITERATIONS

Specify maximum number of iterations for the VBGMM.

Default value is 500

-e EPSILON, --epsilon EPSILON

Specify the epsilon for VBGMM. Default value is 1.0e-6

--no_cov_normalization

By default the coverage is normalized with regards to

samples, then normalized with regards of contigs and

finally log transformed. By setting this flag you skip

the normalization and only do log transorm of the

coverage.

--no_total_coverage By default, the total coverage is added as a new

column in the coverage data matrix, independently of

coverage normalization but previous to log

transformation. Use this tag to escape this behaviour.

--no_original_data By default the original data is saved to disk. For big

datasets, especially when a large k is used for

compositional data, this file can become very large.

Use this tag if you don't want to save the original

data.

-o, --converge_out Write convergence info to files.

-d, --debug Debug parameters.

-v, --version show program's version number and exit

> python CONCOCT/scripts/cut_up_fasta.py -h

$ python CONCOCT/scripts/cut_up_fasta.py -h

usage: cut_up_fasta.py [-h] [-c CHUNK_SIZE] [-o OVERLAP_SIZE] [-m]

[-b BEDFILE]

contigs [contigs ...]

Cut up fasta file in non-overlapping or overlapping parts of equal length.

Optionally creates a BED-file where the cutup contigs are specified in terms

of the original contigs. This can be used as input to concoct_coverage_table.py.

positional arguments:

contigs Fasta files with contigs

optional arguments:

-h, --help show this help message and exit

-c CHUNK_SIZE, --chunk_size CHUNK_SIZE

Chunk size

-o OVERLAP_SIZE, --overlap_size OVERLAP_SIZE

Overlap size

-m, --merge_last Concatenate final part to last contig

-b BEDFILE, --bedfile BEDFILE

BEDfile to be created with exact regions of the

original contigs corresponding to the newly created

contigs

> python CONCOCT/scripts/concoct_coverage_table.py -h

$ python CONCOCT/scripts/concoct_coverage_table.py -h

usage: concoct_coverage_table.py [-h] [--samplenames SAMPLENAMES]

bedfile bamfiles [bamfiles ...]

A script to generate the input coverage table for CONCOCT using a BEDFile.

Output is written to stdout. The BEDFile defines the regions used as

subcontigs for concoct. This makes it possible to get the coverage for

subcontigs without specifically mapping reads against the subcontigs. @author:

inodb, alneberg

positional arguments:

bedfile Contigs BEDFile with four columns representing:

'Contig ID, Start Position, End Position and SubContig

ID' respectively. The Subcontig ID is usually the same

as the Contig ID for contigs which are not cutup. This

file can be generated by the cut_up_fasta.py script.

bamfiles BAM files with mappings to the original contigs.

optional arguments:

-h, --help show this help message and exit

--samplenames SAMPLENAMES

File with sample names, one line each. Should be same

nr of bamfiles. Default sample names used are the file

names of the bamfiles, excluding the file extension.

dockerイメージも用意されている。

docker pull binpro/concoct_latest

実行方法

0、megahitでアセンブリを実行する。

１、 contigを1000bp以下に小さく分割する。

git clone https://github.com/BinPro/CONCOCT.git

python CONCOCT/scripts/cut_up_fasta.py original_contigs.fa -c 10000 -o 0 --merge_last -b contigs_10K.bed > contigs_10K.fa

-c Chunk size
-m, --merge_last Concatenate final part to last contig
-b BEDfile to be created with exact regions of the original contigs corresponding to the newly created contigs
-o Overlap size

contigs_10K.bedとcontigs_10K.faが出力される。

２、coverage depth テーブルの出力。前もってbowtie2でオリジナルのfastaにリードをmappingしてbamを作成しておく必要がある。

python CONCOCT/scripts/concoct_coverage_table.py contigs_10K.bed mapping/Sample*.sorted.bam > coverage_table.tsv

３、concoctを実行する。

concoct --composition_file contigs_10K.fa --coverage_file coverage_table.tsv -b concoct_output/

このステップが一番時間がかかる。gzip圧縮の10GBのfastqからアセンブルしたcontigを使った時は10時間ほどかかった。

４、

merge_cutup_clustering.py concoct_output/clustering_gt1000.csv > concoct_output/clustering_merged.csv

５、

mkdir concoct_output/fasta_bins
extract_fasta_bins.py original_contigs.fa concoct_output/clustering_merged.csv --output_path concoct_output/fasta_bins

出力

f:id:kazumaxneo:20210428235707p:plain

引用

Binning metagenomic contigs by coverage and composition.

Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C

Nat Methods. 2014 Nov;11(11):1144-6

関連

2019-06-13

ロングリードのドラフトアセンブリをpolishする marginpolish

Nanopore long read polish 高速なツール

2019 6/13 tweetリンク追加、誤字修正

　MarginPolishはグラフベースのアセンブリのpolisher。入力としてFASTA アセンブリとインデックス付きBAM（ONTのアセンブリ配列へのアラインメント）を受け取り、polishingしたFASTA アセンブリを生成する。

　MarginPolishはスタンドアロンでも機能するが、超高速ナノポアアセンブラShastaとマルチタスクRNNポリッシャHELENを含むアセンブリパイプラインの一部にもなっている。 HELENはMarginPolishによって生成されたイメージを操作する。

概要

MarginPolishはBAMからリードとアライメントを取得し、アライメントで観察された一致、挿入、欠失を説明する初期グラフを生成、複数の可能性のあるアラインメントに対する確率を決定する。すべてのリードが調整されると、グラフ内の各ノードについて加重調整スコアが生成され、これらを使用してグラフ内の最も可能性の高いパスが決定される。このプロセスは繰り返され、繰り返しの合間にアライメントの総重み付き尤度が減少するまで、または設定された最大繰り返し数に達するまで続く。これにより、ナノポアリードの主なエラー原因となるホモポリマーが低減される。

アセンブリワークフロー詳細については、HELENレポジトリのドキュメントに記載されている。

helen/walkthrough.md at master · kishwarshafin/helen · GitHub

In the #nanoporeconf data theatre, @mitenjain et al have been busy... Announcing a new assembler (SHASTA) and not one but two new polishers (MarginPolish, HELEN) for making large scale reference-quality human genomes. pic.twitter.com/XdLFIsGu0e
— Sam Nicholls (@samstudio8) May 23, 2019

Shastaでドラフトアセンブリし、MarginpolishとHELENでエラーを減らすフローで開発されています。ShastaやHELENについては近いうちに紹介します。

インストール

centos6でテストした。

依存

If compiling on Ubuntu, this will install all required packages: (Githubより)

apt-get -y install git make gcc g++ autoconf zlib1g-dev libcurl4-openssl-dev libbz2-dev libhdf5-dev


#MarginPolish is compiled with cmake. We recommend using the latest cmake version, but 3.7 and higher are supported:
wget https://github.com/Kitware/CMake/releases/download/v3.14.4/cmake-3.14.4-Linux-x86_64.sh && sudo mkdir /opt/cmake && sudo sh cmake-3.14.4-Linux-x86_64.sh --prefix=/opt/cmake --skip-license && sudo ln -s /opt/cmake/bin/cmake /usr/local/bin/cmake
cmake --version

本体　Github

git clone https://github.com/UCSC-nanopore-cgl/marginPolish.git
cd marginPolish
git submodule update --init

mkdir build
cd build
cmake ..
make -j 12

> ./marginPolish

$ ./marginPolish

usage: marginPolish <BAM_FILE> <ASSEMBLY_FASTA> <PARAMS> [options]

Version: 1.0.0

Polishes the ASSEMBLY_FASTA using alignments in BAM_FILE.

Required arguments:

BAM_FILE is the alignment of reads to the assembly (or reference).

ASSEMBLY_FASTA is the reference sequence BAM file in fasta format.

PARAMS is the file with marginPolish parameters.

Default options:

-h --help : Print this help screen

-a --logLevel : Set the log level [default = info]

-t --threads : Set number of concurrent threads [default = 1]

-o --outputBase : Name to use for output files [default = 'output']

-r --region : If set, will only compute for given chromosomal region.

Format: chr:start_pos-end_pos (chr3:2000-3000).

HELEN feature generation options:

-f --produceFeatures : output features for HELEN.

-F --featureType : output features of chunks for HELEN. Valid types:

splitRleWeight: [default] run lengths split into chunks

nuclAndRlWeight: split into nucleotide and run length (RL across nucleotides)

rleWeight: weighted likelihood from POA nodes (RLE)

simpleWeight: weighted likelihood from POA nodes (non-RLE)

-L --splitRleWeightMaxRL : max run length (for 'splitRleWeight' type only) [default = 10]

-u --trueReferenceBam : true reference aligned to ASSEMBLY_FASTA, for HELEN

features. Setting this parameter will include labels

in output.

Miscellaneous supplementary output options:

-i --outputRepeatCounts : Output base to write out the repeat counts [default = NULL]

-j --outputPoaTsv : Output base to write out the poa as TSV file [default = NULL]

テストラン

１、ドラフトアセンブリのfastaに対してminimap2でロングリードをmappingし、bamを作成する。

２、bamとpolishing対象のfastaを指定する。

marginPolish long_reads.bam input.fasta \
/path/to/MarginPolish/params/allParams.np.json -o output -t 20 2>log

進捗logが大量に出力されたので2>logとした。ランが終わるとoutput.faが出力される。

まとめ

　現在もまだ開発中のようですが、マニュアルが整備されており、パフォーマンスも良かったので早めに紹介しました。テストした限り、Raconより高速に動作しつつ、polishing率もRacon/pilonよりMarginpolishの方が多くなっていました。

引用

GitHub - UCSC-nanopore-cgl/MarginPolish

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

(Omics向け) 従来のベン図表現を拡張する DiVenn

多機能なNGS分析ツール BBtools　其の３BBMap追加コマンド

（ヒト、マウス向け）GO term enrichment解析を行う GOnet

bamファイルを扱う bamM

メタゲノムのファージ配列分析webサーバー VirMiner

メタゲノムアセンブリをbinningする CONCOCT

ロングリードのドラフトアセンブリをpolishする marginpolish