2019-11-13

全ゲノムシーケンスしたバクテリア/アーキアのゲノム配列のblastサービス BLAST-XYPlot Viewer

　現在、数千の完全にシーケンシングされた細菌および古細菌のゲノムが公開リポジトリで利用可能であり、この数は急速に増加している。この情報は、徹底的な比較ゲノム研究の達成を可能にする。配列類似性を検索するために最も広く使用されているツールの1つは、BLAST（Altschul et al、1990）であり、複数のWebサーバーから、またはスタンドアロンバージョンでローカルに実行できる。 WebサーバーからのBLASTの実行は、少数のクエリシーケンスを同時に比較することに制限されている。 BLASTローカルバージョンを実行する場合、必要な数のクエリシーケンスを含めることができる。それでも、情報を抽出するには追加のプログラミングスキルが必要になる。

　ゲノム情報は、微生物の特性と機能に関する洞察を得るための強力なソースである。大量の固有データを考慮すると、その研究はバイオインフォマティクスツールに依存している。ゲノムマイニングは、シーケンスされたゲノムの特定の生物学的機能をコードする遺伝子または遺伝子のクラスターを見つけるために使用される。これらの遺伝子を検索する最も成功したアプローチの1つは、最も保存された遺伝子/タンパク質または既知の機能を持つ代表的なもののいずれかを、クエリシーケンスとして使用することである。その結果は、研究の特徴に関与する残りの遺伝子を見つけるために、そのゲノムコンテキストをスキャンするための出発点として使用される。そのゲノムコンテキストの特性評価には通常、BLASTまたはシーケンスアラインメントの比較が含まれ、それらの機能が予測される。それでも、この戦略には時間がかかる。あるいは、大規模なBLAST検索を実行すると、それらの遺伝子/タンパク質の複数のヌクレオチドまたはアミノ酸配列をクエリとして含めることができる。バイオインフォマティクスの特性評価にかかる時間を節約するが、分析する必要がある多くのBLAST結果も生成する。

　多くの生物学的プロセスは遺伝子クラスターにエンコードされているため、完全な生合成パスウェイまたは細菌オペロンの存在を判断するには複数の検索が必要になる。これらの検索を1回の実行で実行し、結果を簡単に分析して表示するツールも必要になる。現在、いくつかのツールが利用可能であり、配列類似性によって単一および複数の遺伝子/タンパク質を検索するのに非常に便利である（Fong et al、2008; Revanna et al、2009; Despalins et al、2011; Medema et al、2013; Neumann et al、2014 ）。これらのツールには柔軟な構成オプションが含まれているが、表示されるデータの数は限られている。したがって、結果を完全に閲覧するには追加の手順が必要である。したがって、同時に表示できる結果の数は限られている。

　ここでは、プラットフォームに依存しない、無料で自由に使用できるWebツール（http://www.blast-xyplot-viewer.icuap.buap.mxで入手可能）を導入する。このツールを使用して、特定の分類群または生物学的階層、またはすべてのシーケンスされた細菌/古細菌でさえ、単一の実行で単一の遺伝子/タンパク質、オペロンまたは完全な生合成パスウェイの存在、完全性、および分布を検索できる。

Tutorial

http://www.blast-xyplot-viewer.icuap.buap.mx/tutorial

使い方

http://www.blast-xyplot-viewer.icuap.buap.mxにアクセスする。

ジョブタイトルをメールアドレスする。また、BLASTのタイプを選ぶ。ここではゲノムを検索する。E.coliゲノムを使った。

f:id:kazumaxneo:20191110022429p:plain

データベースを選ぶ。ここでは全バクテリア・アーキアを選んだ。

f:id:kazumaxneo:20191110022432p:plain

結果

f:id:kazumaxneo:20191110004226p:plain

XYPlot

プロットでは、ズーム、ドラッグ、およびマウスオーバーを使用して、数千のBLAST結果の分布全体としてデータを分析したり、特定の結果にズームしたりできる。

f:id:kazumaxneo:20191110024723p:plain

デフォルトのプロットでは、0から360の範囲をカバーするx軸と、検索構成パラメーターで使用されるすべてのレプリコンを含むy軸で全体データが表示される。メインプロットに加えて、特定の範囲でデータをズームできる2つのプロットがある。

下と右のグラフは囲むことで指定の領域にジャンプできる。

f:id:kazumaxneo:20191110025053p:plain

f:id:kazumaxneo:20191113195914p:plain

プロットをスキャンするとき、個々のBLAST結果と遺伝子/タンパク質クラスターを視覚化するには、x軸の20度のウィンドウで十分と記載されている。

引用

BLAST-XYPlot Viewer: A Tool for Performing BLAST in Whole-Genome Sequenced Bacteria/Archaea and Visualize Whole Results Simultaneously

Yagul Pedraza-Pérez, Rodrigo Alberto Cuevas-Vede, Ángel Bernardo Canto-Gómez, Liliana López-Pliego, Rosa María Gutiérrez-Ríos, Ismael Hernández-Lucas, Gustavo Rubín-Linares, Ygnacio Martínez-Laguna, Jesús Francisco López-Olguín, Luis Ernesto Fuentes-Ramírez

G3. 2018 Jul; 8(7): 2167–2172

2019-11-12

haplotype-awareなVCFのアノテーションを行う BCFtools/csq

2017 Bioinformatics consensus caller haplotype

　シーケンシングされたエクソームおよび全ゲノムサンプルの数が急速に増加しており、最も関心のあるバリアントの膨大な量のデータを迅速に選別できることが重要になっている。このプロセスの重要なステップは、シーケンスバリアントを取得し、機能効果のアノテーションを提供することである。臨床、進化、および遺伝子型と表現型の研究では、機能的結果の正確な予測は、下流の解釈にとって重要である。 Ensembl Variant Effect Predictor（VEP）（McLaren et al、2016）、SnpEff（Cingolani et al、2012）またはANNOVAR（Wang et al。、2010）など、バリアントの効果を予測するためのいくつかの一般的な既存のプログラムがある。 1つの重要な制限は、単一レコードベースであり、論文図1に示すように、周囲の同じ位相（phase）のバリアントが考慮されると、これが誤ったアノテーションにつながる可能性がある。
　最近の実験的および計算上の進歩による長距離シーケンシングテクノロジーのコスト削減（Zheng et al、2016）および統計的フェージングアルゴリズムの精度の向上（Loh et al、2016）により、数十キロベースを超えるフェーズハプロタイプが日常的に利用可能になっている; Sharp et al、2016）サンプルコホートサイズの増加による（McCarthy et al、2016）。この情報を活用できるBCFtools / csqに実装された新しいvariant consequence predictorを紹介する。

Three types of compound variants that lead to incorrect consequence prediction when handled in a localized manner each separately rather than jointly.

bcftoolsのConsequence callingのHPより転載。

bcftools

http://samtools.github.io/bcftools/

インストール

本体　Github

download

http://www.htslib.org/download/

#v1.9
wget https://github.com/samtools/bcftools/releases/download/1.9/bcftools-1.9.tar.bz2
tar xf bcftools-1.9.tar.bz2
#ここでは/usr/local/binに入れる。
./configure --prefix=/usr/local/bin
make -j 8
make install

> ./bcftools

$ ./bcftools

Program: bcftools (Tools for variant calling and manipulating VCFs and BCFs)

Version: 1.9 (using htslib 1.9)

Usage: bcftools [--version|--version-only] [--help] <command> <argument>

Commands:

-- Indexing

index index VCF/BCF files

-- VCF/BCF manipulation

annotate annotate and edit VCF/BCF files

concat concatenate VCF/BCF files from the same set of samples

convert convert VCF/BCF files to different formats and back

isec intersections of VCF/BCF files

merge merge VCF/BCF files files from non-overlapping sample sets

norm left-align and normalize indels

plugin user-defined plugins

query transform VCF/BCF into user-defined formats

reheader modify VCF/BCF header, change sample names

sort sort VCF/BCF file

view VCF/BCF conversion, view, subset and filter VCF/BCF files

-- VCF/BCF analysis

call SNP/indel calling

consensus create consensus sequence by applying VCF variants

cnv HMM CNV calling

csq call variation consequences

filter filter VCF/BCF files using fixed thresholds

gtcheck check sample concordance, detect sample swaps and contamination

mpileup multi-way pileup producing genotype likelihoods

roh identify runs of autozygosity (HMM)

stats produce VCF/BCF stats

Most commands accept VCF, bgzipped VCF, and BCF with the file type detected

automatically even when streaming from a pipe. Indexed VCF and BCF will work

in all situations. Un-indexed VCF and BCF and streams will work in most but

not all situations.

> bcftools csq

# bcftools csq

About: Haplotype-aware consequence caller.

Usage: bcftools csq [options] in.vcf

Required options:

-f, --fasta-ref <file> reference file in fasta format

-g, --gff-annot <file> gff3 annotation file

CSQ options:

-c, --custom-tag <string> use this tag instead of the default BCSQ

-l, --local-csq localized predictions, consider only one VCF record at a time

-n, --ncsq <int> maximum number of consequences to consider per site [16]

-p, --phase <a|m|r|R|s> how to handle unphased heterozygous genotypes: [r]

a: take GTs as is, create haplotypes regardless of phase (0/1 -> 0|1)

m: merge *all* GTs into a single haplotype (0/1 -> 1, 1/2 -> 1)

r: require phased GTs, throw an error on unphased het GTs

R: create non-reference haplotypes if possible (0/1 -> 1|1, 1/2 -> 1|2)

s: skip unphased hets

Options:

-e, --exclude <expr> exclude sites for which the expression is true

--force run even if some sanity checks fail

-i, --include <expr> select sites for which the expression is true

-o, --output <file> write output to a file [standard output]

-O, --output-type <b|u|z|v|t> b: compressed BCF, u: uncompressed BCF, z: compressed VCF

v: uncompressed VCF, t: plain tab-delimited text output [v]

-q, --quiet suppress warning messages. Can be given two times for even less messages

-r, --regions <region> restrict to comma-separated list of regions

-R, --regions-file <file> restrict to regions listed in a file

-s, --samples <-|list> samples to include or "-" to apply all variants and ignore samples

-S, --samples-file <file> samples to include

-t, --targets <region> similar to -r but streams rather than index-jumps

-T, --targets-file <file> similar to -R but streams rather than index-jumps

Example:

bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf

# GFF3 annotation files can be downloaded from Ensembl. e.g. for human:

ftp://ftp.ensembl.org/pub/current_gff3/homo_sapiens/

ftp://ftp.ensembl.org/pub/grch37/release-84/gff3/homo_sapiens/

実行方法

ランにはリファレンスのFASTA、遺伝子のGFF3アノテーション、VCFファイルが必要。GFF3（GFFのversion3）はEnsembl formatのGFF3のみに対応している。

#compressed BCF出力
bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ob -o out.bcf 

#uncompressed VCF出力
bcftools csq -f hs37d5.fa -g Homo_sapiens.GRCh37.82.gff3.gz in.vcf -Ov -o out.vcf

-O, --output-type <b|u|z|v|t> b: compressed BCF, u: uncompressed BCF, z: compressed VCF
v: uncompressed VCF, t: plain tab-delimited text output [v]

引用
BCFtools/csq: haplotype-aware variant consequences

Danecek P, McCarthy SA

Bioinformatics. 2017 Jul 1;33(13):2037-2039

VCFのアノテーションを行う Snpdat（非モデル生物にも対応）

2013 VCF BMC Bioinformatics download Ensembl annotation dbSNP human genome SNP

　一塩基多型（SNP）は、脊椎動物と無脊椎動物で見られる最も一般的なgenetic variantである[ref.1]。 SNPは、関連研究[ref.2]、遺伝子マッピング[ref.3]、および集団遺伝学[4]で好まれている分子マーカーとして定期的に利用されている。技術の改善とコストの削減により、研究者は表現型の変動に潜在的な影響を与える、まれな変異を含む数千の変異を特定している[ref.5、6]。非バイオインフォマティクスの研究者は、ますます大規模なデータセットの分析を実行する必要が出てきている。疾患感受性、農業、および進化は、SNPが複雑な形質の生物学的機能および表現型の変動に与える影響を理解することに関わる分野の1つである[ref.7-9]。ただし、このタイプの情報で多数のSNPにアノテーションを付けると、手作業で実行するのが困難で非現実的であることが分かる。

　SNPアノテーション用の多くのバイオインフォマティクスツールが既に存在する（SNPit [ref.10]、SNPnexus [ref.11]、Snap [ref.12]、SNP Function Portal [ref.13]、SNPper [ref.14]、Fans [ref.15]、FunctSNP [ref.16]、Annovar [ref.17]）。 Ensemblから入手できる真核生物種のリファレンス配列は50以上あるが（リリース65）[ref.18]（論文執筆時点）、現在のところヒト以外のSNPデータの分析を可能にするツールはごくわずかである（Snat、Fans、FunctSNP、Annovarなど）。より一般的なツールの多くは、dbSNPのSNP情報を持つ種のみを分析でき、アノテーション付けされるSNPがdbSNPにすでに存在することを必要とするものもある。いくつかのツールは、周囲の既知のSNPの情報を返すことで未知変異の問題を回避しようとする。
　他のツールでサポートされておらず、アノテーション付きのSNPの数が少ない可能性のある生物で使用できる、使いやすいSNP data analysis tool（SNPdat）を開発した。SNPdatはSNPのサンプリングが深く行われている既知生物のデータセットの分析にも同様に使用できる。
　SNPdatは、Perlで記述されたクロスプラットフォームコマンドラインツールであり、既存のSNPディスカバリーまたはアノテーションパイプラインに簡単に組み込むことができ、さらには標準デスクトップマシンでユーザーが実行することもできる。（以下略）

インストール

本体　Github

git clone https://github.com/agdoran/snpdat.git 
cd snpdat/
perl SNPdat_v1.0.5.pl -h

> perl SNPdat_v1.0.5.pl -h

$ perl SNPdat_v1.0.5.pl -h

SNPdat v1.0.5

start time:

2019年 11月 7日木曜日 00時52分36秒 JST

SNPdat v1.0.5

SNPdat is a high throughput analysis tool that can provide a comprehensive annotation of both novel and known single nucleotide polymorphisms (SNPs).

SNPdat requires that each file is specified when running the program. There are 3 mandatory file definitions.

Usage:

perl SNPdat -i Input_file -f Fasta_file -g Gene_Transfer_File

Required:

-i Input file

-g Gene transfer file (GTF)

-f FASTA formated sequence file

Optional:

-d a dbSNP ASN_FLAT file processed using SNPdat_parse_dbsnp.pl (optional)

-s a file containing a summary of the queried SNPs (optional)

NOTE:If no output file is specified, results will be printed to 'Input_file.summary'

-o output_file specified by the user (optional)

NOTE:If no output file is specified, results will be printed to 'Input_file.output'

Advanced:

-x retrieve sequence information from the next/previous feature should a codon cross that boundary.

User can specify a comma separated list of features from the GTF. This is case-sensitive.

This is only recommended for advanced users who understand what it does.

By default this is not set. See website/manual for more information.

USAGE:

-x feature1,feature2

e.g.

-x exon

-x CDS

-x exon,CDS

Info:

-h This wonderful help page

-v This version of SNPdat

For more instuctions see the SNPdat webage:

http://code.google.com/p/snpdat/

実行方法

ランにはゲノムのFASTAファイル、バリアントコールのVCFファイル（またはタブ区切りテキストファイル）、遺伝子アノテーションのGTFファイルが必要。

perl SNPdat_v1.0.5.pl -i input.vcf -f reference.fasta \
 -g gene_annotation.gtf -o output -s vcf.summary

-i Input file (Mandatory)
-g Gene transfer file (GTF) (Mandatory)
-f FASTA formated sequence file (Mandatory)
-d a dbSNP ASN_FLAT file processed using SNPdat_parse_dbsnp.pl (optional)
-s a file containing a summary of the queried SNPs (optional)

vcf.summary

f:id:kazumaxneo:20191109150011p:plain

output

f:id:kazumaxneo:20191109150512p:plain

そのほかのスクリプト

Ensemblの各リリースからゲノムのFASTAファイルとアノテーションのGTFをダウンロードするスクリプトなどが付属している。対話形式で実行できるようになっている。

perl GTF_FASTA_finder_v1.0.4.pl

まずリリースバージョンを指定する。例えばrelease-35なら左端の番号15をタイプする。

f:id:kazumaxneo:20191109003837p:plain

続いて、ゲノム配列をダウンロードする生物を選ぶ。このリリースバージョンではヒトゲノムは23。

f:id:kazumaxneo:20191109003854p:plain

（昔のリリースと比べると、最新リリースは利用できるゲノムの数がかなり増えている）。

リファレンスのFASTAがダウンロードされる。続いてアノテーションのGTFをダウンロードする。再び生物を選ぶ。ここでは酵母を選んだ。

f:id:kazumaxneo:20191109004235p:plain

完了するとメッセージが出て対話モードは終了。

f:id:kazumaxneo:20191109004523p:plain

得られたFASTAとGTFを使う。

もう1つはdbSNPの各バージョンをダウンロードするスクリプトになる。

perl GTF_FASTA_finder_v1.0.4.pl

対話モードになるので、バージョンを選ぶ。

f:id:kazumaxneo:20191109005744p:plain

手順は同様なので省略。

引用
Snpdat: easy and rapid annotation of results from de novo snp discovery projects for model and non-model organisms

Doran AG1, Creevey CJ

BMC Bioinformatics. 2013 Feb 8;14:45

2019-11-10

（ヒトとマウス向け）VCFのアノテーションを行う Jannovar

2014 VCF Human Mutation (Journal) annotation animal human exome human genome mouse variant ranking SNV small indel small RNA Variant annotations in VCF format

　全ゲノムシーケンス（WES）は、ヒトゲノムのタンパク質コーディングエクソンのターゲットシーケンスであり、新しいメンデル遺伝病遺伝子を特定するための強力で費用対効果の高い方法であり、診断環境でもますます使用されている[Bamshad et al 、2011; Robinson et al、2011; Shendure、2011; Choi et al、2012]。 2005年に次世代シーケンス（NGS）が導入されて以来[Margulies et al、2005; Shendure et al、2005]および2010年のWESによるメンデル病遺伝子の最初の同定[Ng et al、2010]、WESにより同定された100以上の新規疾患遺伝子が発表された[Rabbani et al、2012]。現在、WESのコストは1,000米ドルを下回り、急速に低下しているため、ヒトの遺伝学およびその他の医学分野の研究および臨床診断にWESを使用する新しい時代に突入している。

　WESデータの生成は急速に容易かつ安価になっているが、これらのデータの分析と解釈は依然として課題である。使用するターゲット領域とキャプチャ技術の定義に応じて、典型的なWES実験で20,000を超えるバリアントが識別される[Ng et al、2009]。rawシーケンシングリードからのバリアントの同定には、リードのリファレンスゲノムへのマッピングや、1つ以上のアルゴリズムを使用したバリアントコールなど、多くの処理ステップが含まれる。この分析の結果はVariant Call Format（VCF）ファイルに保存される。このファイルには、特定された各バリアントの染色体位置、リードデプス、クオリティ、およびその他のメタデータに関する情報が含まれている[Danecek et al、2011]。このデータの解釈における重要なステップは、遺伝子および転写産物に対する潜在的な影響に関するこれらのバリアントのアノテーションである。つまり、染色体座標を反映するVCFファイルのバリアントを変換する（例、chr11:g.1857751C>G）、遺伝子ベースのバリアントアノテーションを行う（例：c.655C>G:p.P219A in the gene SYT8）。ほとんどの生物学的または医学的解釈は、遺伝子産物に対するバリアントの潜在的な影響を評価しようとする。

　ANCF [Wang et al、2010]、クラウドコンピューティングフレームワークVAT [Habegger et al、2012]、Variant Effect PredictorなどVCFファイルおよびその他のソースからのゲノムバリエーションにアノテーションを付けるための多くのツールが開発されている。ただし、これらのツールは血統分析を実行するように設計されておらず、多くは5 'または3'非翻訳領域（UTR）のバリアントなどの特定のクラスのバリアントに正確なアノテーションを提供せず、ソフトウェアライブラリとして使用できない。 Jannovarは、高度なエクソームシーケンスソフトウェアパイプラインのバリアントアノテーションおよび血統分析のための柔軟で十分にテストされたソフトウェアライブラリのニーズを満たすために開発され、さらにVCFのアノテーションのための迅速で使いやすいスタンドアロン Javaプログラムを提供する。 JannovarはJava プログラミング言語で作成されており、開発者はバリアントの解釈、視覚化、優先順位付け、および関連タスクのプログラムのコンポーネントとして使用できる。 Jannovarは、カリフォルニア大学サンタクルーズ（UCSC）ゲノムブラウザ[Meyer et al、2013]、NCBI RefSeq [Pruitt et al、2012]、またはEnsembl [Flicek et al、2013]データからトランスクリプト定義ファイルを作成する。迅速で信頼性の高いインターバルツリーベースのアルゴリズムを使用して、バリアントの影響を受ける転写産物を検索し、エキソン変異と5 'および3' UTRにある変異および非コードRNA変異のHuman Gene Variation Society（HGVS; Antonarakis、1998）compliant variant nomenclature を生成する。

Jannovarのソースコードは、GitHub リポジトリ https://github.com/charite/jannovarから入手できる。プリコンパイルされたバージョン（Jannovar.jar）は、詳細なチュートリアルと共に、http://compbio.charite.deのホームページから入手できる。 Jannovarは、スタンドアロンアプリケーションとして使用する場合、エクソームまたはゲノムシーケンスからのVCFファイルのHGVS準拠の高速アノテーションを提供し、継承モードに従ってバリアントをフィルタリングできる。さらに、Jannovarは、エクソームフィルタリングのプログラムまたはパイプライン内でJavaプログラミングライブラリとして使用できる。（以下略）

インストール

依存

Java >=8

本体　Github

#bioconda (link) 
conda install -c bioconda -y jannovar-cli

> jannovar

$ jannovar

usage: jannovar-cli [-h] [--version] {annotate-pos,annotate-csv,annotate-vcf,db-list,download,statistics,rest-server,hgvs-to-vcf} ...

jannovar-cli: error: too few arguments

(base) kamisakakazumanoMac-mini:deletion kazu$ jannovar -h

usage: jannovar-cli [-h] [--version] {annotate-pos,annotate-csv,annotate-vcf,db-list,download,statistics,rest-server,hgvs-to-vcf} ...

Jannovar CLI performs a series of VCF annotation tasks, including predicted molecular impact of variants and annotation of compatible Mendelian inheritance.

positional arguments:

{annotate-pos,annotate-csv,annotate-vcf,db-list,download,statistics,rest-server,hgvs-to-vcf}

annotate-pos annotate genomic changes given on the command line

annotate-csv Annotate a csv file

annotate-vcf annotate VCF files

db-list list databases available for download

download download transcript databases

statistics compute statistics about VCF file

rest-server start REST server

hgvs-to-vcf project transcript-level to chromosome-level changes

optional arguments:

-h, --help show this help message and exit

--version Show Jannovar version

You can find out more at http://jannovar.rtfd.org

> jannovar annotate-vcf -h

$ jannovar annotate-vcf -h

usage: jannovar-cli annotate-vcf [-h] -i INPUT_VCF -o OUTPUT_VCF -d DATABASE [--interval INTERVAL] [--pedigree-file PEDIGREE_FILE] [--annotate-as-singleton-pedigree] [--ref-fasta REF_FASTA] [--dbsnp-vcf DBSNP_VCF] [--dbsnp-prefix DBSNP_PREFIX]

[--exac-vcf EXAC_VCF] [--exac-prefix EXAC_PREFIX] [--gnomad-exomes-vcf GNOMAD_EXOMES_VCF] [--gnomad-exomes-prefix GNOMAD_EXOMES_PREFIX] [--gnomad-genomes-vcf GNOMAD_GENOMES_VCF]

[--gnomad-genomes-prefix GNOMAD_GENOMES_PREFIX] [--uk10k-vcf UK10K_VCF] [--uk10k-prefix UK10K_PREFIX] [--g1k-vcf G1K_VCF] [--g1k-prefix G1K_PREFIX] [--clinvar-vcf CLINVAR_VCF] [--clinvar-prefix CLINVAR_PREFIX]

[--cosmic-vcf COSMIC_VCF] [--cosmic-prefix COSMIC_PREFIX] [--one-parent-gt-filtered-filters-affected] [--inheritance-anno-use-filters] [--dbnsfp-tsv DBNSFP_TSV] [--dbnsfp-col-contig DBNSFP_COL_CONTIG]

[--dbnsfp-col-position DBNSFP_COL_POSITION] [--dbnsfp-prefix DBNSFP_PREFIX] [--dbnsfp-columns DBNSFP_COLUMNS] [--bed-annotation BED_ANNOTATION] [--vcf-annotation VCF_ANNOTATION] [--tsv-annotation TSV_ANNOTATION]

[--use-threshold-filters] [--gt-thresh-filt-min-cov-het GT_THRESH_FILT_MIN_COV_HET] [--gt-thresh-filt-min-cov-hom-alt GT_THRESH_FILT_MIN_COV_HOM_ALT] [--gt-thresh-filt-max-cov GT_THRESH_FILT_MAX_COV]

[--gt-thresh-filt-min-gq GT_THRESH_FILT_MIN_GQ] [--gt-thresh-filt-min-aaf-het GT_THRESH_FILT_MIN_AAF_HET] [--gt-thresh-filt-max-aaf-het GT_THRESH_FILT_MAX_AAF_HET]

[--gt-thresh-filt-min-aaf-hom-alt GT_THRESH_FILT_MIN_AAF_HOM_ALT] [--gt-thresh-filt-max-aaf-hom-ref GT_THRESH_FILT_MAX_AAF_HOM_REF] [--var-thresh-max-allele-freq-ad VAR_THRESH_MAX_ALLELE_FREQ_AD]

[--var-thresh-max-allele-freq-ar VAR_THRESH_MAX_ALLELE_FREQ_AR] [--var-thresh-max-hom-alt-exac VAR_THRESH_MAX_HOM_ALT_EXAC] [--var-thresh-max-hom-alt-g1k VAR_THRESH_MAX_HOM_ALT_G1K] [--use-advanced-pedigree-filters]

[--de-novo-max-parent-ad2 DE_NOVO_MAX_PARENT_AD2] [--enable-off-target-filter] [--utr-is-off-target] [--intronic-splice-is-off-target] [--no-escape-ann-field] [--show-all] [--no-3-prime-shifting] [--3-letter-amino-acids]

[--disable-parent-gt-is-filtered] [--version] [--report-no-progress] [-v] [-vv] [--http-proxy HTTP_PROXY] [--https-proxy HTTPS_PROXY] [--ftp-proxy FTP_PROXY]

Perform annotation of a single VCF file

optional arguments:

-h, --help show this help message and exit

--version Show Jannovar version

Required arguments:

-i INPUT_VCF, --input-vcf INPUT_VCF

Path to input VCF file

-o OUTPUT_VCF, --output-vcf OUTPUT_VCF

Path to output VCF file

-d DATABASE, --database DATABASE

Path to database .ser file

--interval INTERVAL Interval with regions to annotate (optional)

Annotation Arguments (optional):

--pedigree-file PEDIGREE_FILE

Pedigree file to use for Mendelian inheritance annotation

--annotate-as-singleton-pedigree

Annotate VCF file with single individual as singleton pedigree (singleton assumed to be affected)

--ref-fasta REF_FASTA Path to FAI-indexed reference FASTA file, required for dbSNP/ExAC/UK10K-based annotation

--dbsnp-vcf DBSNP_VCF Path to dbSNP VCF file, activates dbSNP annotation

--dbsnp-prefix DBSNP_PREFIX

Prefix for dbSNP annotations

--exac-vcf EXAC_VCF Path to ExAC VCF file, activates ExAC annotation

--exac-prefix EXAC_PREFIX

Prefix for ExAC annotations

--gnomad-exomes-vcf GNOMAD_EXOMES_VCF

Path to gnomAD exomes VCF file, activates gnomAD exomes annotation

--gnomad-exomes-prefix GNOMAD_EXOMES_PREFIX

Prefix for ExgnomAD exomes AC annotations

--gnomad-genomes-vcf GNOMAD_GENOMES_VCF

Path to gnomAD genomes VCF file, activates gnomAD genomes annotation

--gnomad-genomes-prefix GNOMAD_GENOMES_PREFIX

Prefix for ExgnomAD genomes AC annotations

--uk10k-vcf UK10K_VCF Path to UK10K VCF file, activates UK10K annotation

--uk10k-prefix UK10K_PREFIX

Prefix for UK10K annotations

--g1k-vcf G1K_VCF Path to thousand genomes VCF file, activates thousand genomes annotation

--g1k-prefix G1K_PREFIX

Prefix for thousand genomes annotations

--clinvar-vcf CLINVAR_VCF

Path to ClinVar file, activates ClinVar annotation

--clinvar-prefix CLINVAR_PREFIX

Prefix for ClinVar annotations

--cosmic-vcf COSMIC_VCF

Path to COSMIC file, activates COSMIC annotation

--cosmic-prefix COSMIC_PREFIX

Prefix for COSMIC annotations

--one-parent-gt-filtered-filters-affected

If one parent's genotype is affected, apply OneParentGtFiltered filter to child

--inheritance-anno-use-filters

Use filters in inheritance mode annotation

Annotation with dbNSFP (experimental; optional):

--dbnsfp-tsv DBNSFP_TSV

Patht to dbNSFP TSV file

--dbnsfp-col-contig DBNSFP_COL_CONTIG

Column index of contig in dbNSFP

--dbnsfp-col-position DBNSFP_COL_POSITION

Column index of position in dbNSFP

--dbnsfp-prefix DBNSFP_PREFIX

Prefix for dbNSFP annotations

--dbnsfp-columns DBNSFP_COLUMNS

Columns from dbDSFP file to use for annotation

BED-based Annotation (experimental; optional):

--bed-annotation BED_ANNOTATION

Add BED file to use for annotating. The value must be of the format "pathToBed:infoField:description[:colNo]".

Generic VCF-based Annotation (experimental; optional):

--vcf-annotation VCF_ANNOTATION

Add VCF file to use for annotating. The value must be of the format "pathToVfFile:prefix:field1,field2,field3".

TSV-based Annotation (experimental; optional):

--tsv-annotation TSV_ANNOTATION

Add TSV file to use for annotating. The value must be of the format "pathToTsvFile:oneBasedOffset:colContig:colStart:colEnd:colRef(or=0):colAlt(or=0):isRefAnnotated(R=yes,A=no):colValue:fieldType:fieldName:

fieldDescription:accumulationStrategy".

Threshold-filter related arguments:

--use-threshold-filters

Use threshold-based filters

--gt-thresh-filt-min-cov-het GT_THRESH_FILT_MIN_COV_HET

Minimal coverage for het. call

--gt-thresh-filt-min-cov-hom-alt GT_THRESH_FILT_MIN_COV_HOM_ALT

Minimal coverage for hom. alt calls

--gt-thresh-filt-max-cov GT_THRESH_FILT_MAX_COV

Maximal coverage for a sample

--gt-thresh-filt-min-gq GT_THRESH_FILT_MIN_GQ

Minimal genotype call quality

--gt-thresh-filt-min-aaf-het GT_THRESH_FILT_MIN_AAF_HET

Minimal het. call alternate allele fraction

--gt-thresh-filt-max-aaf-het GT_THRESH_FILT_MAX_AAF_HET

Maximal het. call alternate allele fraction

--gt-thresh-filt-min-aaf-hom-alt GT_THRESH_FILT_MIN_AAF_HOM_ALT

Minimal hom. alt call alternate allele fraction

--gt-thresh-filt-max-aaf-hom-ref GT_THRESH_FILT_MAX_AAF_HOM_REF

Maximal hom. ref call alternate allele fraction

--var-thresh-max-allele-freq-ad VAR_THRESH_MAX_ALLELE_FREQ_AD

Maximal allele fraction for autosomal dominant inheritance mode

--var-thresh-max-allele-freq-ar VAR_THRESH_MAX_ALLELE_FREQ_AR

Maximal allele fraction for autosomal recessive inheritance mode

--var-thresh-max-hom-alt-exac VAR_THRESH_MAX_HOM_ALT_EXAC

Maximal count in homozygous state in ExAC before ignoring

--var-thresh-max-hom-alt-g1k VAR_THRESH_MAX_HOM_ALT_G1K

Maximal count in homozygous state in ExAC before ignoring

--use-advanced-pedigree-filters

Use advanced pedigree-based filters (mainly useful for de novo variants)

--de-novo-max-parent-ad2 DE_NOVO_MAX_PARENT_AD2

Maximal support of alternative allele in parent for de novo variants.

Exome on/off target filters:

--enable-off-target-filter

Enable filter for on/off-target based on effect impact

--utr-is-off-target Make UTR count as off-target (default is to count UTR as on-target)

--intronic-splice-is-off-target

Make intronic (non-consensus site) splice region count as off-target (default is to count as on-target)

Other, optional Arguments:

--no-escape-ann-field Disable escaping of INFO/ANN field in VCF output

--show-all Show all effects

--no-3-prime-shifting Disable shifting towards 3' of transcript

--3-letter-amino-acids

Enable usage of 3 letter amino acid codes

--disable-parent-gt-is-filtered

Verbosity Options:

--report-no-progress Disable progress report, more quiet mode

-v, --verbose Enable verbose mode

-vv, --very-verbose Enable very verbose mode

Proxy Options:

Configuration related to Proxy, note that environment variables *_proxy and *_PROXY are also interpreted

--http-proxy HTTP_PROXY

Set HTTP proxy to use, if any

--https-proxy HTTPS_PROXY

Set HTTPS proxy to use, if any

--ftp-proxy FTP_PROXY Set FTP proxy to use, if any

> jannovar annotate-csv -h

$ jannovar annotate-csv -h

usage: jannovar-cli annotate-csv [-h] -d DATABASE -i INPUT -c CHR -p POS -r REF -a ALT [-t {Default,TDF,RFC4180,Excel,MySQL}] [--header] [--show-all] [--no-3-prime-shifting] [--3-letter-amino-acids] [--version] [--report-no-progress] [-v] [-vv]

[--http-proxy HTTP_PROXY] [--https-proxy HTTPS_PROXY] [--ftp-proxy FTP_PROXY]

Perform annotation of genomic changes given on the command line

optional arguments:

-h, --help show this help message and exit

--version Show Jannovar version

Required arguments:

-d DATABASE, --database DATABASE

Path to database .ser file

-i INPUT, --input INPUT

CSV file

-c CHR, --chr CHR Column of chr (1 based)

-p POS, --pos POS Column of pos (1 based)

-r REF, --ref REF Column of ref (1 based)

-a ALT, --alt ALT Column of alt (1 based)

Additional CSV arguments (optional):

-t {Default,TDF,RFC4180,Excel,MySQL}, --type {Default,TDF,RFC4180,Excel,MySQL}

Type of csv file.

--header Set if the file contains a header.

Optional Arguments:

--show-all Show all effects

--no-3-prime-shifting Disable shifting towards 3' of transcript

--3-letter-amino-acids

Enable usage of 3 letter amino acid codes

Verbosity Options:

--report-no-progress Disable progress report, more quiet mode

-v, --verbose Enable verbose mode

-vv, --very-verbose Enable very verbose mode

Proxy Options:

Configuration related to Proxy, note that environment variables *_proxy and *_PROXY are also interpreted

--http-proxy HTTP_PROXY

Set HTTP proxy to use, if any

--https-proxy HTTPS_PROXY

Set HTTPS proxy to use, if any

--ftp-proxy FTP_PROXY Set FTP proxy to use, if any

Example: java -jar Jannovar.jar annotate-csv -d hg19_refseq.ser -c 1 -p 2 -r 3 -r 4 -t TDF --header -i input.csv

> jannovar statistics -h

$ jannovar statistics -h

usage: jannovar-cli statistics [-h] -i INPUT_VCF -o OUTPUT_REPORT -d DATABASE [--version] [--report-no-progress] [-v] [-vv] [--http-proxy HTTP_PROXY] [--https-proxy HTTPS_PROXY] [--ftp-proxy FTP_PROXY]

Compute statistics about variants in VCF file

optional arguments:

-h, --help show this help message and exit

--version Show Jannovar version

Required arguments:

-i INPUT_VCF, --input-vcf INPUT_VCF

Path to input VCF file

-o OUTPUT_REPORT, --output-report OUTPUT_REPORT

Path to output report TXT file

-d DATABASE, --database DATABASE

Path to database .ser file

Verbosity Options:

--report-no-progress Disable progress report, more quiet mode

-v, --verbose Enable verbose mode

-vv, --very-verbose Enable very verbose mode

Proxy Options:

Configuration related to Proxy, note that environment variables *_proxy and *_PROXY are also interpreted

--http-proxy HTTP_PROXY

Set HTTP proxy to use, if any

--https-proxy HTTPS_PROXY

Set HTTPS proxy to use, if any

--ftp-proxy FTP_PROXY Set FTP proxy to use, if any

> jannovar annotate-pos -h

$ jannovar annotate-pos -h

usage: jannovar-cli annotate-pos [-h] -d DATABASE -c GENOMIC_CHANGE [--show-all] [--no-3-prime-shifting] [--3-letter-amino-acids] [--version] [--report-no-progress] [-v] [-vv] [--http-proxy HTTP_PROXY] [--https-proxy HTTPS_PROXY]

[--ftp-proxy FTP_PROXY]

Perform annotation of genomic changes given on the command line

optional arguments:

-h, --help show this help message and exit

--version Show Jannovar version

Required arguments:

-d DATABASE, --database DATABASE

Path to database .ser file

-c GENOMIC_CHANGE, --genomic-change GENOMIC_CHANGE

Genomic change to annotate, you can give multiple ones

Optional Arguments:

--show-all Show all effects

--no-3-prime-shifting Disable shifting towards 3' of transcript

--3-letter-amino-acids

Enable usage of 3 letter amino acid codes

Verbosity Options:

--report-no-progress Disable progress report, more quiet mode

-v, --verbose Enable verbose mode

-vv, --very-verbose Enable very verbose mode

Proxy Options:

Configuration related to Proxy, note that environment variables *_proxy and *_PROXY are also interpreted

--http-proxy HTTP_PROXY

Set HTTP proxy to use, if any

--https-proxy HTTPS_PROXY

Set HTTPS proxy to use, if any

--ftp-proxy FTP_PROXY Set FTP proxy to use, if any

Example: java -jar Jannovar.jar annotate-pos -d hg19_refseq.ser -c 'chr1:12345C>A'

> jannovar download -h

$ jannovar download -h

usage: jannovar-cli download [-h] -d DATABASE [-s DATA_SOURCE_LIST] [--download-dir DOWNLOAD_DIR] [--gene-ids GENE_IDS [GENE_IDS ...]] [-o OUTPUT_FILE] [--version] [--report-no-progress] [-v] [-vv] [--http-proxy HTTP_PROXY]

[--https-proxy HTTPS_PROXY] [--ftp-proxy FTP_PROXY]

Download transcript database

optional arguments:

-h, --help show this help message and exit

--version Show Jannovar version

Required arguments:

-d DATABASE, --database DATABASE

Name of database to download, can be given multiple times

Optional Arguments:

-s DATA_SOURCE_LIST, --data-source-list DATA_SOURCE_LIST

INI file with data source list

--download-dir DOWNLOAD_DIR

Path to download directory

--gene-ids GENE_IDS [GENE_IDS ...]

Optional list of genes to limit creation of database to

-o OUTPUT_FILE, --output-file OUTPUT_FILE

Optional path to output file

Verbosity Options:

--report-no-progress Disable progress report, more quiet mode

-v, --verbose Enable verbose mode

-vv, --very-verbose Enable very verbose mode

Proxy Options:

Configuration related to Proxy, note that environment variables *_proxy and *_PROXY are also interpreted

--http-proxy HTTP_PROXY

Set HTTP proxy to use, if any

--https-proxy HTTPS_PROXY

Set HTTPS proxy to use, if any

--ftp-proxy FTP_PROXY Set FTP proxy to use, if any

実行方法

１、データベースの準備（初回のみ）

hg19/GRCh37のRefSeq transcript databaseをダウンロードする。

jannovar download -d hg19/refseq

ダウンロード後、データベースファイルdata/hg19_refseq.serができる。

f:id:kazumaxneo:20191106150720p:plain

以下のデータベースが利用できる。

> jannovar db-list

$ jannovar db-list

Options

JannovarDBOptions [dataSourceFiles=[bundle:///default_sources.ini], isReportProgress()=true, getHttpProxy()=null, getHttpsProxy()=null, getFtpProxy()=null]

Available data sources:

hg18/ucsc

hg18/ensembl

hg18/refseq

hg18/refseq_curated

hg19/ucsc

hg19/ensembl

hg19/refseq

hg19/refseq_curated

hg19/refseq_interim

hg19/refseq_interim_curated

hg38/ucsc

hg38/ensembl

hg38/refseq

hg38/refseq_curated

mm9/ucsc

mm9/ensembl

mm9/refseq

mm9/refseq_curated

mm10/ucsc

mm10/ensembl

mm10/refseq

mm10/refseq_curated

rn6/refseq

rn6/refseq_curated

利用できるのはヒトとマウスのhg18, hg19, hg38, mm9, mm10である。

２、vcfを指定してバリアントのアノテーションを行う。

変異のVCFを指定する。ここではJannovarのgithubにあるsmall.vcfを使う。

jannovar annotate-vcf -d data/hg19_refseq.ser \
 -i jannovar/examples/small.vcf -o ouput.vcf

入力VCF

f:id:kazumaxneo:20191106152451p:plain

出力VCF。アノテーションがアサインされている。

f:id:kazumaxneo:20191106151803p:plain

"LOW"とか"MODERATE"などあるが、これはVariant annotations in VCF formatの定義に従って変異の影響度がつけられたものになる。

Variant Effects — Jannovar 0.11.0 documentation にも解説がある。

そのほかのコマンド

・jannovar annotate-pos - Perform annotation of genomic changes given on the command line

変異後の配列のアノテーションを素早く確認する。chr1の12345のC =>Aなら'chr1:12345C>A'と指定する（formatは{CHROMOSOME}:{POSITION}{REF}>{ALT} ）。下ではもう１箇所指定している。

jannovar annotate-vcf -d data/hg19_refseq.ser \
 -c 'chr1:12345C>A' -c 'chr1:12346C>A'

出力

#change effect hgvs_annotation messages

chr1:12345C>A NON_CODING_TRANSCRIPT_INTRON_VARIANT DDX11L1:NR_046018.2:n.354+118C>A: .

chr1:12346C>A NON_CODING_TRANSCRIPT_INTRON_VARIANT DDX11L1:NR_046018.2:n.354+119C>A: .

・jannovar statistics - Compute statistics about variants in VCF file

jannovar statistics -d data/hg19_refseq.ser \
 -i input.vcf -o stats

出力

f:id:kazumaxneo:20191106160055p:plain

引用
Jannovar: a java library for exome annotation
Jäger M1, Wang K, Bauer S, Smedley D, Krawitz P, Robinson PN

Hum Mutat. 2014 May;35(5):548-55

2019-11-09

ゲノム上でクラスターを形成する遺伝子群を探すwebサービス Cluster Locator

2018 Bioinformatics gene cluster 結果の視覚化 (visualization) mouse human genome yeast

　遺伝子は真核生物のゲノムに沿ってランダムには配置されていないことが十分に確立されている（Feuerborn and Cook、2015; Hurst et al、2004）。これまでに研究されたすべての真核生物で、遺伝子の位置と遺伝子発現、遺伝子機能または量的形質の間の多様な相互相関が発見されている（De and Babu、2010; Ghanbarian and Hurst、2015）。これらの相関は、イーストSaccharomycesでほぼ20年前に最初に観察され（Eisen et al、1998）、後に線虫、ハエ、マウス、ヒトおよびその他の生物で観察された（Michalak、2008）。「クラスター」という用語の多様な定義を使用して、いくつかの研究は、機能を共有する共発現遺伝子のクラスター、ゲノム内の近傍を共有する機能的に関連する遺伝子のクラスター、または類似の発現パターンまたは関連機能を持つ近隣の遺伝子のグループを発見した（Corrales et al、2017; Lee and Sonnhammer、2003;Reimegårdet al、2017;Théveninet al、2014; Tiirikka et al、2014; Yi et al、2007）。したがって、現在、ゲノム内の遺伝子の相対的な位置は、その生物学的機能またはその発現パターンとは無関係ではないことが受け入れられている。

　近年、ゲノムアノテーションの改良と遺伝子発現データの増加により、共機能または共発現遺伝子のリストの構築は比較的容易になった。それにもかかわらず、リスト上の遺伝子がゲノムに沿ってクラスター化される方法の簡単な統計分析を可能にするツールが不足している。開発されたいくつかのツールはこれに関する洞察を提供することができる（Aboukhalil et al、2013; Dottorini et al、2013; Yi et al、2007）が、特にそうするように設計されておらず、現在オンラインまたはリクエスト後に利用することはできない。

　ここでは、ユーザーが提供するタンパク質コード遺伝子のリストを指定し、許可された最大ギャップを選択した後（以下の定義を参照）、すべてのクラスターを検索、定量、表示する無料のオンラインで使いやすいツールであるCluster Locatorを紹介する。結果はブラウザに表示され、ダウンロード可能なCluster Locatorの出力には、識別されたクラスター数、サイズ、位置、各クラスター内の遺伝子のidentityと位置、および結果の統計分析が含まれる。

　Cluster Locatorは、バックエンドでPython 2.7に実装されたWebベースのアプリケーションであり、フロントエンドでReactJSおよびD3jsライブラリを使用する。バックエンドはAWS Lambdaにデプロイされ、フロントエンドの静的ファイルはAWS S3ストレージに保存される。（以下略）

User guide

https://s3.amazonaws.com/cluster-locator-statics/user_guide.pdf

使い方

http://clusterlocator.bnd.edu.uy にアクセスする。

リファレンスゲノムを指定する。

f:id:kazumaxneo:20191106002015p:plain

調べる遺伝子リストを1行１ID形式でアップロードする。対応するgene IDはDocument参照。ヒトであればEnsembl geneIDかHGNC official symbolsが対応している。

f:id:kazumaxneo:20191106021751p:plain

最大1,000遺伝子分析可能。

クラスターは、隣接する遺伝子間のギャップが指定した最大ギャップセットより大きくならない遺伝子セットとして定義される（DocumentのFig.1参照）。この最大ギャップサイズを指定する。

f:id:kazumaxneo:20191106021539p:plain

ここでは上の方にある"Cick here"からexampleデータをランする。

結果

検出されたクラスター数、クラスターに含まれる遺伝子数、

f:id:kazumaxneo:20191106002147p:plain

Uniformaly testは普通分布（正規分布）の判定などに使うコルモゴロフ–スミルノフ検定を行って、入力リストが染色体上に均一に配置されているのか、偏りがあるのか調べている。random samplingは、ゲノムからランダムに遺伝子を抽出してたまたま近傍に位置している可能性と比較した結果になる。Document参照。

染色体に沿って、分析対象の遺伝子と、特定されたクラスターが視覚化される。最大6つのセグメントまで同時表示される。

f:id:kazumaxneo:20191106002153p:plain

染色体は垂直の線で全て同じ長さで表される。遺伝子はドットで表示される。プリロードされたゲノムの場合、ラベルから対応するデータベース（Ensembl、FlyBase、WormBase、またはSGD）にリンクしている。

引用

Cluster Locator, online analysis and visualization of gene clustering
Flavio Pazos Obregón, Pablo Soto, José Luis Lavín, Ana Rosa Cortázar, Rosa Barrio, Ana María Aransay, Rafael Cantera
Bioinformatics, Volume 34, Issue 19, 01 October 2018, Pages 3377–3379

2019-11-08

pATLASflow

nextflow plasmid

pATLASflowはplasmid ATLASのマッピング、 mash screen、およびアセンブリメソッドを実行するパイプライン。

plasmid ATLAS

インストール

GIthub

#ここでは仮想環境に入れる。
conda create -n pATLASflow nextflow
conda activate pATLASflow
conda instlall -c bioconda -y mash

> nextflow run tiagofilipe12/pATLASflow --help

$ nextflow run tiagofilipe12/pATLASflow --help

N E X T F L O W ~ version 19.10.0

Launching `tiagofilipe12/pATLASflow` [zen_golick] - revision: 7cec7485d8 [master]

===========================================================

p A T L A S f l o w

===========================================================

Version: 1.1.0

Usage:

nextflow run tiagofilipe12/pATLAS_mash_screen.nf

Nextflow magic options:

-profile Forces nextflow to run with docker or singularity. Default: standard Choices: standard, singularity,slurm

Main options:

--help Opens this help. It will open only when --help is provided. So, yes, this line is pretty useless since you already know that if you reached here.

--version Prints the version of the pipeline script.

--mash_screen Enables mash screen run.

--assembly Enables mash dist run to use fasta file against plasmid db

--mapping Enables mapping pipeline.

Mash options:

--kMer the length of the kmer to be used by mash. Default: 21

--pValue The p-value cutoff. Default: 0.05

Mash screen exclusive options:

--identity The minimum identity value between two sequences. Default: 0.9

--noWinner This option allows to disable the -w option of mash screen Default: false

Mash dist exclusive options:

--mash_distance Provide the maximum distance between two plasmids to be reported. Default: 0.1

--shared_hashes Provide a percentage for the hashesshared between the reference and the query sequence(s). Default: 0.8

Reads options:

--reads The path to the read files. Here users may provide many samples in the same directory. However be assured that glob pattern is unique (e.g. 'path/to/*_{1,2}.fastq').

--singleEnd Provide this option if you have single-end reads. By default the pipeline will assume that you provide paired-end reads. Default: false

Fasta options:

--fasta Provide fasta file pattern to be searched by nextflow. Default: 'fasta/*.fas'

Bowtie2 options:

--trim5 Provide parameter -5 to bowtie2 allowing to trim 5' end. Default: 0

--cov_cutoff Provide a cutoff value to filter results for coverage results. Default: 0.60

実行方法

plasmid配列の分類。

nextflow run tiagofilipe12/pATLASflow --assembly --fasta input.fasta

/results/mashdist/にある出力のJSONファイルをpATLASに読み込ませる。

引用

GitHub - tiagofilipe12/pATLASflow: A pipeline to run mapping, mash screen and assembly methods for pATLAS.

Plasmid ATLAS: plasmid visual analytics and identification in high-throughput sequencing data
Tiago F Jesus, Bruno Ribeiro-Gonçalves, Diogo N Silva, Valeria Bortolaia, Mário Ramirez, João A Carriço
Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D188–D194

2019-11-07

Ensemblの Variant Effect Predictor (VEP)

2019 Genome Biology web tool VCF evaluation tool human exome human genome Ensembl cohort population genomics docker

2019 11/10 タイトル修正

2020 10/14 dockerリンク追加

　ゲノムまたはエクソームシーケンシングから生じるバリアントデータの分析は、クリニックでの基礎研究からトランスレーショナルゲノミクスまで、生物学の進歩の基本である。機能を調査し、標準化された治療に基づく医療システムから個々の患者をターゲットにした医療システムへと前進するための鍵となる。

　一般的な疾患またはまれな疾患の患者の場合、バリアント分析の潜在的な利点には、患者のケア、監視、および治療結果の改善が含まれる。ガンでは、遺伝子検査のデータを使用してすでに多くの成功がある。たとえば、BRCA突然変異の遺伝が陽性である患者には、選択的予防手術の選択肢がある。 EGFR遺伝子変異を示す肺ガン患者またはトリプルネガティブ乳ガン患者は、成功を改善するために薬物処方を調整することができる[ref.1、2]。

　まれな疾患は、発生率が低く、関連する対立遺伝子の浸透率が不完全であるため、個々に診断することが困難な場合がある。ただし、全ゲノムシーケンス（WGS）または全エクソームシーケンスデータのバリアント解析は、基礎となる遺伝子変異の発見につながる可能性がある[ref.3]。関連する突然変異を特定することは、治療オプションの研究および将来の創薬に有利となる。一方、診断の直接的な利益だけでなくより正確な予後をもたらし、追加の医学的調査の負担を取り除くかもしれない。

　世界中で最も一般的な非感染性疾患は、心血管疾患、ガン、糖尿病である[ref.4]。多くのアレイベースのgenome-wide association studies （GWAS）がリスク遺伝子座を検索しているにもかかわらず、これらの条件における比較的小さな遺伝性成分のみが解明されている[ref.5]。多数のサンプルのWGSが、潜在的な表現型または疾患の関連を持つまれなバリアントを検出するための十分な統計検出力を得るために必要である[ref.6、7]。 WGSの研究では、ゲノムの調節領域および非コード領域の変異も検出される。これらは、形質関連変異の大部分を構成すると考えられており[ref.8]、ガンにも役割を果たす[ref.9]。

　大規模なシーケンシングとバリアント分析の可能性は革新的である。この価値を認識して、アイスランド[ref.10]、英国[ref.11]、および米国[ref.12]で主要なpopulation sequencing initiatives が開始された。他の種では、Genome 10 K [ref.13]、1001 Arabidopsisゲノム[ef.14]、1000雄牛ゲノムプロジェクト[ef.15]などの取り組みが似たような目標を持って、異なる資金調達モデルで動作している。

　DNAシーケンス技術の継続的な改善と、ヒトゲノムあたり約1000ドルの現在のコストにより、大量のゲノム、エクソーム、および解釈が必要なその後のバリアントデータが生じている。一方、機能の結果を決定するための分析のコストは、バリアントの解釈が困難なため、かなり高いままである。例えば、典型的な二倍体ヒトゲノムには、リファレンスゲノム配列に関して約350万のSNVと1000のコピー数変異[ref.16]がある。これらの変異の約20,000〜25,000はタンパク質コーディングであり、そのうち10,000がアミノ酸を変更するが、タンパク質のtruncatingまたはloss of functionの変異は50〜100のみである[ref.16]。多数のバリアントの手動レビューは非実用的で費用がかかり、機能的なアノテーションの欠如やハプロタイプ内の複数のバリアントの解釈など、追加の困難がある。

　バリアントの解釈では、転写物またはタンパク質に対するバリアントの影響が考慮されることがよくある。したがって、転写産物のアノテーションと、タンパク質のコード領域または非コード領域へのバリアントのローカライズに依存している。 Homo sapiensアノテーションには2つの主要な情報源がある：GENCODE [17]およびNational Center for Biotechnology Information（NCBI）のReference Sequence（RefSeq）[ref.18]。両方のトランスクリプトアノテーションは、バリアントのレポートと解釈を変更できるバージョン変更と更新の対象となる。データの再現性のために、トランスクリプトアイソフォームとトランスクリプトバージョンを厳密に追跡する必要があるが、場合によっては、バージョンを含めてもすべての潜在的な誤解を回避するには不十分である[ref.19]。転写産物セットの作成方法には違いがある。GENCODEアノテーションはゲノムベースだが、RefSeq転写産物はリファレンスゲノムからは独立している。 RefSeq転写産物はリファレンスアセンブリのエラーを修正し、生物学的表現が改善された転写産物（GRCh37リファレンスの遺伝子ABO、ACTN3、ALMS1など）を提供する場合があるが、ゲノムと転写産物セットの違いは混乱とエラーを引き起こす可能性がある。 GENCODEの目的は、あらゆる組織および発達段階での各アイソフォームの発現を表す包括的な転写セットを作成することであり、その結果、タンパク質ごとに平均でほぼ4つのアイソフォームが存在している。

　VEPは、家畜の特性分析[ref.24、25]、診療所での患者診断、GWASの研究[ref.26–30]に使用されている。 1000 Genomes [ref.31]やExome Aggregation Consortium（ExAC）[ref.32]など、多数の大規模プロジェクトでの分析に使用されている。 VEPのアノテーションは、GEMINI [ref.33]などのバリアントアノテーションを詳細に調査するためのツールへの入力として使用される。これは、シーケンスバリアントの詳細なアノテーションを必要とするプロジェクトにとって価値のある柔軟なツールである。

　VEPは、2つの広範なカテゴリのゲノムバリアントにアノテーションを付ける。（1）特定の明確な変更（SNV、挿入、欠失、複数の塩基対置換、マイクロサテライト、タンデムリピートを含む）を持つシーケンスバリアント。（2）より大きな構造変異（長さが50ヌクレオチドを超える）、構造変異には、コピー数変化、挿入、欠失が含まれる。すべての入力バリアントについて、VEPは、転写産物、タンパク質、および調節領域への影響に関する詳細なアノテーションを返す。既知または重複するバリアントには、対立遺伝子の頻度と疾患または表現型の情報が含まれる。（以下略）

custom annotations

https://asia.ensembl.org/info/docs/tools/vep/script/vep_custom.html

ローカル版

ubuntu18.04 LTSでテストした（docker使用、ホストOS macos10.14）。

依存

gcc, g++ and make
Perl (>=5.10 recommended, tested on 5.10, 5.14, 5.18, 5.22, 5.26)
Perl libraries Archive::Zip and DBI

#cpanmで導入できる
cpanm Archive::Zip
cpanm DBI

本体　Github

git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep
perl INSTALL.pl

> ./vep

$ vep

#----------------------------------#

# ENSEMBL VARIANT EFFECT PREDICTOR #

#----------------------------------#

Versions:

ensembl : 101.856c8e8

ensembl-funcgen : 101.b918a49

ensembl-io : 101.943b6c2

ensembl-variation : 101.851c7e0

ensembl-vep : 101.0

Help: dev@ensembl.org , helpdesk@ensembl.org

Twitter: @ensembl

http://www.ensembl.org/info/docs/tools/vep/script/index.html

Usage:

./vep [--cache|--offline|--database] [arguments]

Basic options

=============

--help Display this message and quit

-i | --input_file Input file

-o | --output_file Output file

--force_overwrite Force overwriting of output file

--species [species] Species to use [default: "human"]

--everything Shortcut switch to turn on commonly used options. See web

documentation for details [default: off]

--fork [num_forks] Use forking to improve script runtime

For full option documentation see:

http://www.ensembl.org/info/docs/tools/vep/script/vep_options.html

dockerイメージ（dockerhub）

docker pull ensemblorg/ensembl-vep:latest

テストラン

git clone https://github.com/Ensembl/ensembl-vep.git
cd ensembl-vep/examples/
#arabidopsis thaliana
vep -i arabidopsis_thaliana.TAIR10.vcf -o out.txt --species arabidopsis_thaliana --database --genome

#Homo sapiens (runにはおよそ1hほど必要)
vep -i homo_sapiens_GRCh37.vcf -o output1.txt --database
vep -i homo_sapiens_GRCh38.vcf -o output2.txt --database

--cache Enables use of the cache. Add --refseq or --merged to use the refseq or merged cache, (if installed).
--database Enable VEP to use local or remote databases.
--genomes Override the default connection settings with those for the Ensembl Genomes public MySQL server. Required when using any of the Ensembl Genomes species. Not used by default
--offline Enable offline mode. No database connections will be made, and a cache file or GFF/GTF file is required for annotation. Add --refseq to use the refseq cache (if installed). Not used by default

出力(arabidopsis thaliana)

f:id:kazumaxneo:20201014105443p:plain

out.txt

f:id:kazumaxneo:20201014105522p:plain

out.txt_summary.html

f:id:kazumaxneo:20201014105552p:plain

オプションなどの詳細は下記 URLを確認して下さい。

https://asia.ensembl.org/info/docs/tools/vep/script/vep_tutorial.html

webサービス

http://asia.ensembl.org/info/genome/variation/tools/variant_tools.htmlのVariant Effect Predictor (VEP)を選択する。

Variant Effect Predictor (VEP)

http://asia.ensembl.org/Multi/Tools/VEP?db=core

リファレンスを選ぶ。

f:id:kazumaxneo:20191105181931p:plain

ゲノムはEnsemblの最新リリースに基づいている。ヒトゲノムであればGRCh38で、GRCh37は旧ページから使う。

VCFを指定する。ここではGRCh38アセンブリをリファレンスとし、freebayesを使ってNA12877のvariant callを行なったVCFを選んだ。

f:id:kazumaxneo:20191105182204p:plain

結果

レポートが表示される。

f:id:kazumaxneo:20191105180237p:plain

バリアントエフェクトのカラムは、リンク先から詳細を調べることができる。

f:id:kazumaxneo:20191106115031p:plain

例えばrs367896724（dbSNP）をクリック

f:id:kazumaxneo:20191106115341p:plain

Linkage disequilibrium (連鎖不平衡)

Population genetics

rs367896724 INDEL

Sample genotypes

Genes and regulation

Context

Flanking sequence

Variant Effect Predictorに戻る。テーブルはその場でソートしたりフィルタリングできる。

f:id:kazumaxneo:20191105180240p:plain

引用

The Ensembl Variant Effect Predictor

William McLaren, Laurent Gil, Sarah E. Hunt, Harpreet Singh Riat, Graham R. S. Ritchie, Anja Thormann, Paul Flicek, Fiona Cunningham
Genome Biology volume 17, Article number: 122 (2016)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

全ゲノムシーケンスしたバクテリア/アーキアのゲノム配列のblastサービス BLAST-XYPlot Viewer

haplotype-awareなVCFのアノテーションを行う BCFtools/csq

VCFのアノテーションを行う Snpdat（非モデル生物にも対応）

（ヒトとマウス向け）VCFのアノテーションを行う Jannovar

ゲノム上でクラスターを形成する遺伝子群を探すwebサービス Cluster Locator

pATLASflow

Ensemblの Variant Effect Predictor (VEP)