2020-06-29

高速かつメモリ使用量の少ないポリッシングツール POLCA

2020 6/29 インストール手順修正

　Pacific Biosciences（PacBio）によるSingle Molecule Real Time（SMRT）シーケンスや、Oxford Nanopore Technologies（ONT）によるnanoporeシーケンスなどの第3世代シーケンスプラットフォームは、数キロベースからメガベース以上の範囲のリードを生成する。ただし、両方のテクノロジーのエラー率は8〜15％と比較的高くなっている。エラーの種類はテクノロジーによって異なるが、十分に深くカバーしているため、ほとんどのエラーはリードを相互にチェックすることで修正できる。エラー修正のもう1つの戦略は、エラー率が0.5％未満の長い（100〜250bp）イルミナリードとロングリードをペアにすることである。ハイブリッド戦略では、より高価なロングリードのカバレッジを大幅に低くする必要があるが、これはより安価なイルミナリードに置き換えることができる。 2番目のテクノロジーを使用すると、ロングリードでのシステマティックなエラーがディープカバレッジでも修正されない可能性があり、Illuminaのリードを使用すると、これらのエラーを修正できる追加の利点がある。これにより、ハイブリッドシーケンス戦略を使用してアセンブリされた全ゲノムアセンブリは、10万塩基あたり1エラー未満のエラー率を得ることができる。
　ハイブリッドゲノムプロジェクトでイルミナのデータを使用する方法は2つある。 PBcR（Berlin et al、2015）およびMaSuRCA（Zimin et al、2017）アセンブラで行われているように、ロングリードをプロセスの早い段階で修正して使用するか、ロングリードアセンブリの後にルミナのリードをアセンブリにアラインしてコンセンサスの品質を向上させる。この後者のアプローチは、一般的にコンセンサスの「ポリッシング」と呼ばれる。イルミナのデータを使用してアセンブリをポリッシングするために利用できるソフトウェアツールがいくつかあるが、最も広く使用されているのはPilon（Walker et al、2014）およびRacon（Vaser et al、2017）である。このペーパーでは、バージョン3.3.5以降のMaSuRCAアセンブラパッケージで配布しているPOLCA（Calling AlternativesによるPOLishing）と呼ばれる新しいポリッシングツールを紹介する。 POLCAには、PilonおよびRaconに比べて3つの主な利点がある。（1）非常に高速、（2）使用メモリが非常に少ない、（3）誤訂正が非常に少ない。著者らの実験が示すように、洗練されたシーケンスの品質は、PilonまたはRaconによって達成された品質に匹敵する。その速度と導入されたエラー率の低さにより、POLCAはアセンブリポリッシング用の優れたツールとなる。また、示されているように、最高のコンセンサス精度はPOLCAとRaconの両方でポリッシュすることによって達成される。
　以下では、3つのデータセットでのPOLCAのパフォーマンスの分析を示す。まず、シミュレーションデータセットを使用して、既知のランダムエラーをゲノムに導入し、同じゲノムからシミュレートしたリードでそれを洗練する。これにより、洗練されたアセンブリを「真の」ゲノムシーケンスと比較する。次に、オックスフォードナノポアデータから生成された一連の細菌ゲノムアセンブリと、PacBioデータからアセンブリされたヒトゲノムに対して、ポリッシング方法をテストする。
　既存のアセンブリのコンセンサスシーケンスをポリッシュするには、少なくとも2つのアプローチがある。 1つは、ゲノムアセンブリにアラインメントしてリードのマルチアラインメントをリカバーし、元のシークエンシングデータまたは追加のシークエンシングデータを使用してコンセンサス計算を再実行することである。 2番目のアプローチは、リードをコンセンサスに合わせ、リードがエラーの可能性を示す場所を特定し、リード配列を使用してそれらのエラーを修正することである。 RaconとPilonが含まれる最初のアプローチは、計算コストが高くなるが、アセンブリに多数のエラーが含まれる場合はよりうまく機能する可能性がある。 POLCAは後者のアプローチを採用している。（以下略）

インストール

condaを使いversion 3.4.1をインストールした(ubuntu18.04LTS使用)。

依存

POLCA has one external dependency: bwa mem aligner

本体　Github

#bioconda(link) ここでは仮想環境に入れる
mamba create -n masurca -y
conda activate masurca
mamba install -c bioconda -y masurca

#from source (未テスト)
git clone https://github.com/alekseyzimin/masurca
cd masurca/
git submodule init
git submodule update
make

> polca.sh

$ polca.sh

Usage: polca.sh -a <assembly contigs or scaffolds> -r <'Illumina_reads_fastq1 Illumina_reads_fastq'> -t <number of threads> [-n] <optional:do not fix errors that are found> [-m] <optional: memory per thread to use in samtools sort>

> masurca -v

version 3.4.1

実行方法

ランにはドラフトゲノムアセンブリとilluminaのリードまたはPacbioのHIFIリードが必要。

polca.sh -a genome.fasta -r 'reads1.fastq reads2.fastq.gz' -t 16 -m 1G

-a assembly contigs or scaffolds
-r 'Illumina_reads_fastq1 Illumina_reads_fastq'
-m optional: memory per thread to use in samtools sort
-t number of threads

ランタイムは非常に短い。

出力

f:id:kazumaxneo:20200629120554p:plain

assembly.fasta.report

f:id:kazumaxneo:20200629120813p:plain

入力ファイルがassembly.fastaの場合、assembly.fasta.PolcaCorrected.faがポリッシュされたFASTAファイルになる。中間ファイルがたくさん残るのでラン後に整頓する必要がある。

引用

The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies
Aleksey V. Zimin , Steven L. Salzberg
PLOS Computational Biology, Published: June 26, 2020

2020-06-28

自動化されたProkaryotesのRNA seq解析パイプライン ProkSeq

2020 Preprint RNA seq RNAseqの定量 bacteria archaea automated pipeline automatic pipeline docker

2020 6/29 補足説明追加。

RNA-seq技術は、導入以来、病原性細菌の研究において異なる条件にさらされた細菌からの複数のサンプルにわたる遺伝子発現の違いを同定し、定量化するために広く利用されてきた。一部の例外を除いて、遺伝子発現を評価するための現在のツールは、真核生物の遺伝子の構造を中心に設計されている。原核生物用に設計されたスタンドアローンのツールは少なく、改善が必要である。原核生物の品質管理、差分遺伝子発現の決定、ダウンストリームパスウェイ解析、極限の生物学的条件で収集されたデータの正規化のために必要なツールをすべて含む原核生物のための明確なパイプラインはまだ不足している。ここでは、原核生物用に設計されたユーザーフレンドリーで完全自動化されたRNA-seqデータ解析パイプラインであるProkSeqについて説明する。ProkSeqは、 differential gene expressionの解析、発現データの正規化、データと結果の可視化のための様々なオプションを提供し、publication-qualityの数値を生成する。
　ProkSeqはPythonで実装されており、ISCオープンソースライセンスで公開されている。ツールと詳細なユーザーマニュアルは、Docker: https://hub.docker.com/repository/docker/snandids/prokseq-v2.1、Anaconda: https://anaconda.org/snandiDS/prokseq、Github: https://github.com/snandiDS/prokseq でホストされている。

　原核生物のRNA-seq解析は、利用可能なほとんどのRNA-seqパッケージは、入力データが原核生物のものとは多くの面で異なる真核生物の遺伝子構造を反映していることを前提としているため、困難である(Johnson, et al., 2016)。細菌の転写物はイントロンを持たず、交互にスプライスされていないため、スプライスジャンクションを考慮して開発されたアライナーを使用するとしばしばゲノム中の誤って割り当てられたリードが増加することがある（Magoc, et al.）。さらに、真核生物とは異なり、特定のストレス下では、ほぼすべての原核生物遺伝子の発現が変化する可能性がある(Creecy and Conway, 2015)。さらに、実験セットアップのばらつき、プラスミド遺伝子の存在および過剰発現、ならびにRNA-seqプロトコルの違いのために、原核生物のデータには、品質トリミング、アダプター除去、および歪んだデータの正規化がしばしば必要とされる（Magoc, et al., 2013; McClure, et al）。

　原核生物には、SPARTA(Johnson, et al., 2016)、EdgePRO(Magoc, et al., 2013)、およびRockHopper(McClure, et al., 2013)のような、RNA-seqデータの解析を容易にすることができるいくつかのソフトウェアパッケージがあるが、いずれも、データ操作に関する実質的な知識を必要とする。そこで、原核生物のRNA-seqデータ解析を行う際の人間の介入を減らすために、完全自動化されたコマンドラインベースのワークフローであるProkSeqを開発した。ProkSeqは、利用可能な様々なツールとPythonで書かれた組み込み関数を統合している。ProkSeqは、RNA-seqデータの品質管理ステップから、異なる発現を持つ遺伝子のパスウェイエンリッチメント解析までを処理する。差分発現、正規化発現、可視化のための多様なオプションを提供し、出版品質の数値を生成する。人間の介入が少ないため、ProkSeqの使用は、データの再フォーマットが必要となることが多い個別のツールを逐次適用するよりも時間がかからない。利便性に加えて、ProkSeqのマルチスレッド機能により、パイプラインにかかる時間が短縮される。

　ユーザーは任意の分析を行うためにパラメータファイルの設定を柔軟に変更することができる。サンプルファイルには、解析に含まれる fastq ファイルの名前が記載されており、処理サンプルやコントロールサンプルなどの実験クラスも定義されている。ProkSeqは、まずリードの品質をFastQC（http://www.bioinformatics.babraham.ac.uk/projects/fastqc）とafterQC（Chen, et al）でチェックし、bowtie2（Langmead and Salzberg, 2012）を使用してリードをリファレンスゲノムにマッピングする。ProkSeqはその後、各ライブラリのアライメント品質に関するレポートを図とテキストの両方で生成し、カバレッジの均一性、タンパク質コーディング配列に沿った分布、5'および3'UTR領域、RseQC(Wang, et al., 2012)およびその他の組み込み機能によって生成されたリード重複率とストランド特異性に関する情報を提供する。遺伝子あたりの総リード数はfeatureCounts(Liao, et al., 2014)を用いて計算される。またProkSeqは、各遺伝子の正規化された遺伝子発現値を、100万あたりの転写物（TPM）（Wagner, et al. 2012）や100万あたりのカウント（CPM）（Wagner, et al. 2012）の形で計算する。これらを算出した計算式については、補足（S1）で説明する。さらに、バイアスを考慮したアルゴリズムを使用して転写物の発現を定量化するために、salmon（Patro et al、2017）を統合しており、これはその後のdifferential expressionの解析の精度および信頼性を実質的に向上させる。
　ProkSeqは、DEseq2(Love, et al., 2014)、edgeR(Robinson, et al., 2010)、およびNOISeq(Tarazona, et al., 2015)などのdifferential expression analysisのためのいくつかのツールを統合している。differential expression genesの下流解析には、ProkSeqではGOエンリッチメント、clusterProfilerを統合したパスウェイエンリッチメントを用いている（Yu, et al. アライメント前後の品質統計やグラフィカルな可視化に関するレポートはpdfとHTML形式で作成される。ProkSeqの重要なユニークな機能の1つは、歪んだデータに対する確立された正規化手法の統合である（Creecy and Conway, 2015; Zhu, et al）。さらに、このパッケージは、任意のゲノムブラウザで可視化するための1ヌクレオチド分解能のwiggleファイルを生成する。ProkSeqは、データ解析の各ステップで鮮やかなグラフィックと出版可能な数値を生成し、ユーザーのデータへの信頼性と理解度を高める。その方法については、補足方法(S1)で詳しく説明している。

インストール

推奨されているdockerイメージを使ってテストした（ホストはmacos）。

本体　Github

#docker hub(link)
docker pull snandids/prokseq-v2.1:v1

#Bioconda(link)
conda install -c bioconda prokseq -y

> python scripts/pipeline-v2.8.py -h

# python scripts/pipeline-v2.8.py -h

Usage: pipeline-v2.8.py [options] arg

Options:

-h, --help show this help message and exit

-s SAMPLE_FILE_NAME, --sample=SAMPLE_FILE_NAME

provide the sample file

-p PARAMETER_FILE_NAME, --param=PARAMETER_FILE_NAME

provide the parameter file

-n NUMBER OF PROCESSORS, --numproc=NUMBER OF PROCESSORS

provide the number of processors

テストラン

ここではdockerイメージを使う。

docker run -it snandids/prokseq-v2.1:v1
source /etc/profile.d/conda.sh
conda activate py36
cd prokseq

カレントのファイル構成

ls -lh /root/prokseq

# ls -lh

total 111M

drwxr-xr-x 10 root root 4.0K May 29 06:14 Output

-rw-rw-r-- 1 1000 1000 21K Jun 7 07:08 README.txt

-rw-r--r-- 1 1000 1000 4.6M May 1 06:29 SequenceChromosome.fasta

drwxr-xr-x 2 root root 4.0K May 29 05:33 bowtie2_genome

-rw-r--r-- 1 root root 18 May 29 06:14 chr.sizes

-rw-r--r-- 1 root root 5.0K May 29 06:16 class.log

drwxr-xr-x 2 root root 4.0K May 29 06:09 data

drwxr-xr-x 1 root root 4.0K May 29 06:16 depend

-rw-rw-r-- 1 1000 1000 273K May 1 06:28 oldAnnotationGFF.bed

-rw-rw-r-- 1 1000 1000 367K May 1 06:28 oldAnnotationGFF.gtf

-rw-rw-r-- 1 1000 1000 2.9K Jun 8 07:00 param.input.bowtie

-rw-rw-r-- 1 1000 1000 3.1K Jun 8 07:00 param.input.salmon

-rw-r--r-- 1 1000 1000 1.3M Mar 30 07:12 reads_1.fastq

-rw-r--r-- 1 1000 1000 1.3M Mar 30 07:13 reads_2.fastq

-rw-r--r-- 1 1000 1000 1.3M Mar 30 08:28 reads_3.fastq

-rw-r--r-- 1 1000 1000 1.3M Mar 30 08:28 reads_4.fastq

-rw-rw-r-- 1 1000 1000 8.2M May 27 13:41 sampleCtrl_1.R1.fq

-rw-rw-r-- 1 1000 1000 8.6M Jun 2 05:39 sampleCtrl_1.R2.fq

-rw-rw-r-- 1 1000 1000 8.2M May 27 13:41 sampleCtrl_2.R1.fq

-rw-rw-r-- 1 1000 1000 8.6M May 27 13:42 sampleCtrl_2.R2.fq

-rw-rw-r-- 1 1000 1000 8.2M May 27 13:41 sampleCtrl_3.R1.fq

-rw-rw-r-- 1 1000 1000 8.6M May 27 13:42 sampleCtrl_3.R2.fq

-rw-rw-r-- 1 1000 1000 8.2M May 27 13:40 sampleTreat_1.R1.fq

-rw-rw-r-- 1 1000 1000 8.6M May 27 13:41 sampleTreat_1.R2.fq

-rw-rw-r-- 1 1000 1000 8.2M May 27 13:40 sampleTreat_2.R1.fq

-rw-rw-r-- 1 1000 1000 8.6M May 27 13:40 sampleTreat_2.R2.fq

-rw-rw-r-- 1 1000 1000 8.2M May 27 13:40 sampleTreat_3.R1.fq

-rw-rw-r-- 1 1000 1000 8.6M May 27 13:40 sampleTreat_3.R2.fq

-rw-rw-r-- 1 1000 1000 1.2K May 28 12:28 samples.bowtie.PEsample

-rw-rw-r-- 1 1000 1000 1.1K May 27 12:35 samples.bowtie.SEsample

-rw-rw-r-- 1 1000 1000 868 May 28 12:49 samples.salmon.PEsample

-rw-rw-r-- 1 1000 1000 840 May 28 14:13 samples.salmon.SEsample

drwxrwxr-x 1 1000 1000 4.0K Jun 8 06:59 scripts

-rw-r--r-- 1 root root 18 May 29 06:14 test.fasta

ランにはサンプル情報を記載したファイルを使用する。

cat samples.bowtie.PEsample

# cat samples.bowtie.PEsample

###################################################################################

# File "sample" - sample description file

# Specify the names of the sample files and tag them as "treat" and "control".

###################################################################################

# Specify the genome file, and specify the path where the

# indexed file will be stored, and the prifex of the indexed genome.

# Default is 'bowtie2_genome'.

GENOME SequenceChromosome.fasta bowtie2_genome/sequenceChr

# Specify the fastq files

# Specify the output name of the sam files.

# Followed by the tag/class/condition of the sample (treated or control)

# List all the fastq files as bellow.

FASTQ sampleTreat_1.R1.fq sampleTreat_1.R2.fq sampleTreat_1.sam treat

FASTQ sampleTreat_2.R1.fq sampleTreat_2.R2.fq sampleTreat_2.sam treat

FASTQ sampleTreat_3.R1.fq sampleTreat_3.R2.fq sampleTreat_3.sam treat

FASTQ sampleCtrl_1.R1.fq sampleCtrl_1.R2.fq sampleCtrl_1.sam control

FASTQ sampleCtrl_2.R1.fq sampleCtrl_2.R2.fq sampleCtrl_2.sam control

FASTQ sampleCtrl_3.R1.fq sampleCtrl_3.R2.fq sampleCtrl_3.sam control

# End of file "sample"

下の方のコメントアウトされていない６行（白）がfastq情報になる。ペアエンドのR1<space>R2<space>output.sam<space>case/control を記載する。(３群以上の多変量解析は行えない)。

実際のランでは、赤字の行のGENOME SequenceChromosome.fasta bowtie2_genome/sequenceChrを自分の使用しているgenomeのFASTAファイルとbowtie indexのディレクトリに修正する。indexはラン中に作成されるので、名前だけ指示しておけば良い（上ではbowtie2_genome/にsequenceChr~ができる）。

もう１つ必要なのはパラメータを記載したファイル。 exampleではparam.input.bowtie。テストはデフォルト設定でランできるが、実際のランでは白字のgtfとbedファイル名を修正する必要がある。また、GO enrichment解析はTERM2GENE.csvをもとに実行される（GO databaseはdata/TERM2NAME.csv）。GO enrichment解析を実行したいなら、事前に対応表を作って置いておく必要がある（*１）。

> cat param.input.bowtie

cat param.input.bowtie

####################################################################

# File "param.input" - defination file

# Define the paths and parameters to run the external packages.

####################################################################

# Describe the Bowtie options below as:

BOWTIE -I 0

BOWTIE -X 500

BOWTIE -k 1

BOWTIE -p 1

# In case the package salmon is to be run, uncomment the options.

# These are the default values.

#SALMONINDEX -k 29

#SALMONQUANT -l A

#SALMONQUANT -p 2

#SALMONQUANT --validateMappings TRUE

# Define the Featurecounts options as bellow:

FEATURECOUNTS a oldAnnotationGFF.gtf

# Define the Featurecounts output file name.

FEATURECOUNTS o FeatCount

# Specify a count file name

COUNTFILE countFile.csv

# geneBody coverage.r require a bed file. Specify the name of bed file as bellow:

geneBody_coverage r oldAnnotationGFF.bed

# Specify if batch effect removal is required. FALSE if not required, else TRUE.

BATCH_EFFECT_REMOVE FALSE

# For pathway analysis, define the log fold change cutoff as bellow:

PATHWAY cutoffPositive 2.0

PATHWAY cutoffNegative -2.0

# For pathway analysis, define the organism in three alphabets as bellow.

# ypy = Yersinia pseudotuberculosis

# User need to change the keg abbreviation of their genome which can be

# found in https://www.genome.jp/kegg/catalog/org_list.html. Here ypy is

# the Yersinia pseudotuberculosis YPIII

PATHWAY Organism ypy

# For Gene Ontology of the pathway analysis, define GO term and gene name file.

PATHWAY TERM2GENE data/TERM2GENE.csv

PATHWAY TERM2NAME data/TERM2NAME.csv

# Specify the path to samtools

PATH SAMTOOLS /root/prokseq/depend/samtools/bin

# Specify the root path. That means where the ProkSeq bundle is unpacked.

# The location should have the following folders

# 1. depend - contains all the binaries of the external packages

# 2. scripts - contains all the modules required for ProkSeq, and pipeline-vx.x.sh

# 3. data - Contains Gene ontology files for pathway analysis.

PATH ROOT /root/prokseq

# If the above environment (depend, scripts, data) is true, the following

# line maye uncommented.

#PATH DEFAULT

# Specify the path to geneBody_coverage

PATH geneBody_coverage /root/prokseq/depend/RSeQC-2.6.2/scripts/

# Specify the path to FEATURECOUNTS

PATH FEATURECOUNTS /root/prokseq/depend/subread-1.4.6-p5-Linux-i386/bin/

# Specify the path to fastqc

PATH FASTQC /root/prokseq/depend/FastQC

# Specify the path to bowtie

PATH BOWTIE /root/prokseq/depend/bowtie2/bowtie2-2.3.5.1-linux-x86_64

# Specify the path to pypy required for running afterqc

PATH PYPY /root/prokseq/depend/pypy2.7-v7.2.0-linux64/bin

# Specify the path to readfasta

PATH READFASTA /root/prokseq/depend

# End of file "param.input"

(py36) sh-5.0#

注意；エラーが起こるので実際のランでも/root/proseq/を使うこと。

実行

python scripts/pipeline-v2.8.py -s samples.bowtie.PEsample -p param.input.bowtie -n 8

-s provide the sample file
-p provide the parameter file
-n provide the number of processors

Checking required tools or packages:

samtools : Found

salmon : Not found

fastqc : Found

bowtie : Found

afterqc : Found

readFasta : Found

R : Found

GnBdCov : Found

FeatCnt : Found

Checking R packages:

DESeq2 : TRUE

ggplot2 : TRUE

edgeR : TRUE

NOISeq : TRUE

limma : TRUE

clusterProfiler : TRUE

apeglm : TRUE

SALMON:

salmon IS NOT FOUND!

Done with package checks. Seems all the required packages are available.

Do you want to continue? (Y/N) : Y

Yで実行。

PATHWAY ANALYSIS

Rscript --vanilla /root/prokseq/depend/clusterProf.r 2.0 -2.0 ypy data/TERM2GENE.csv data/TERM2NAME.csv

Output:

Registered S3 method overwritten by 'enrichplot':

method from

fortify.enrichResult DOSE

clusterProfiler v3.14.3 For help: https://guangchuangyu.github.io/software/clusterProfiler

If you use clusterProfiler in published research, please cite:

Guangchuang Yu, Li-Gen Wang, Yanyan Han, Qing-Yu He. clusterProfiler: an R package for comparing biological themes among gene clusters. OMICS: A Journal of Integrative Biology. 2012, 16(5):284-287.

[1] "ypy"

[1] "Found organism"

`universe` is not in character and will be ignored...

Success.

終了

出力

QC_preFilter/ #fastqc出力

f:id:kazumaxneo:20200628193238p:plain

QC_afterFilter/ #afterQC出力

f:id:kazumaxneo:20200628193630p:plain

sam/

f:id:kazumaxneo:20200628193436p:plain

bam/

f:id:kazumaxneo:20200628193352p:plain
wig/

f:id:kazumaxneo:20200628193452p:plain

AlignmentStat/

f:id:kazumaxneo:20200628193536p:plain

Results/

f:id:kazumaxneo:20200628193511p:plain

edgeR_results.txt

f:id:kazumaxneo:20200628214447p:plain

>DESeq2lfcShrink_results.txt

f:id:kazumaxneo:20200628214508p:plain

> DESeq2_results.txt

f:id:kazumaxneo:20200628214532p:plain

> afterNoiseq.txt

f:id:kazumaxneo:20200628214559p:plain

＞KEGGpathway.txt

f:id:kazumaxneo:20200628214213p:plain

> KEGGenricher.txt

f:id:kazumaxneo:20200628214232p:plain

> GOpathways.txt

f:id:kazumaxneo:20200628214301p:plain

> GOenricher.txt

f:id:kazumaxneo:20200628214401p:plain

plots/

f:id:kazumaxneo:20200628193326p:plain

correlation_violin.pdf

f:id:kazumaxneo:20200628192851p:plain

correlation.pdf

f:id:kazumaxneo:20200628192911p:plain

DESeq-Volcano.pdf

f:id:kazumaxneo:20200628192926p:plain

edgeRvolcano.pdf

f:id:kazumaxneo:20200628192950p:plain

MA-plot.pdf

f:id:kazumaxneo:20200628193009p:plain

pca.pdf

f:id:kazumaxneo:20200628193024p:plain

pValue-histogram.pdf

f:id:kazumaxneo:20200628193040p:plain

RSeQC.geneBodyCoverage.curves.pdf

f:id:kazumaxneo:20200628193102p:plain

sampleCtrl_1.sortedBAM.bam_RSeQC.geneBodyCoverage.curves.pdf(サンプルごとに個別出力されるので合計6つ)

f:id:kazumaxneo:20200628193201p:plain

注意；ノートパソコなどで実行する場合は空き容量に注意して下さい。データによりますが、SAM出力するので、dockerの利用できる容量と空き容量を増やしておいて下さい。目安は100GB以上です。

引用

ProkSeq for complete analysis of RNA-seq data from prokaryotes
Authors: A K M Firoj Mahmud1, Soumyadeep Nandi#2, Maria Fällman

bioRxiv, Posted June 09, 2020.

*１

TERM2GENE.csvの中身

f:id:kazumaxneo:20200629210453p:plain

遺伝子名は他のアノテーションファイルと合致していること。

TERM2NAME.csvはGO termを定義するファイル。

f:id:kazumaxneo:20200629210709p:plain

2020-06-27

リファレンスゲノムのアノテーション情報をターゲットゲノムに移す Liftoff

2020 Preprint liftOver Bioinformatics human genome

　DNA シーケンシング技術と計算手法の向上により、多くの種の高品質なゲノムアセンブリが大幅に増加している。これらのゲノムの生物学を理解するためには、遺伝子の特徴やその他の機能的エレメントのアノテーションが不可欠であるが、ほとんどの種ではリファレンスゲノムのみが十分にアノテーションされている。ここでは、同種または近縁種の2つのアセンブリ間で遺伝子をマッピングすることができる新しいゲノムアノテーションリフトオーバーツールであるLiftoffについて説明する。Liftoffは、リファレンスゲノムからターゲットゲノムに遺伝子をアラインメントし、各エクソン、トランスクリプト、遺伝子の構造を保持しながら、配列の同一性を最大化するマッピングを見つける。著者らは、Liftoffがヒトリファレンスゲノムの2つのバージョン間（GRCh37とGRCh38）で99.9%の遺伝子を平均99.9%以上の配列同一性で正確にマッピングできることを示す。また、Liftoffは、ヒトのタンパク質をコードする遺伝子の98.4%以上を、配列同一性98.7%のチンパンジーゲノムアセンブリにリフティングすることに成功し、種を超えて遺伝子をマッピングできることを示す。

My new paper with @StevenSalzberg1 is out in Bioinformatics (online as of Dec. 15th)! Liftoff is a tool we created to lift gene annotations over from one genome assembly to another https://t.co/EQ1JWT2MPn
— Alaina Shumate (@alaina_shumate) January 12, 2021

The 20th birthday of the Human Genome Project seems like a great day for my latest pre-print with @StevenSalzberg1 to come out! Liftoff is an open-source tool to map genome annotation from one genome to another. https://t.co/xCgwviG8nt
— Alaina Shumate (@alaina_shumate) 2020年6月26日

Do you need a robust system for mapping genome annotation from one genome to another? One that doesn't require you to somehow first align the genomes base-by-base? Check out Liftoff, by @alaina_shumate, and our new preprint https://t.co/PeuEyC3NXC
— Steven Salzberg (@StevenSalzberg1) 2020年6月26日

ゲノム全体をアラインメントするのではなく、遺伝子配列のみをアラインメントすることで、2つのゲノム間に構造的な違いが多くあっても、遺伝子をリフトオーバーすることができる。Liftoffは、各遺伝子について、転写物と遺伝子の構造を保持しながら、配列の同一性を最大化するエクソンのアラインメントを見つける。2つの遺伝子が重複する遺伝子座に誤ってマップされている場合、Liftoffはどちらの遺伝子が誤ってマップされている可能性が高いかを判断し、それを再マップしようとする。Liftoffは、リファレンスにアノテーションされていない、ターゲットアセンブリに存在する追加の遺伝子コピーを見つけることもできる。

インストール

依存

Liftoff requires Python3 and also depends on Minimap2.
python3>

conda install -c bioconda minimap2

本体　Github

git clone https://github.com/agshumate/Liftoff
cd liftoff/
python setup.py install #install_requires=['numpy', 'biopython','gffutils', 'networkx','pysam']

> python liftoff.py

$ python liftoff.py

usage: liftoff.py [-h] -t <target.fasta> -r <reference.fasta>

[-g <ref_annotation.gff>] [-chroms <chroms.txt>] [-p 1]

[-o <output.gff>] [-db DB] [-infer_transcripts]

[-u <unmapped_features.txt>] [-infer_genes] [-a 0.5]

[-s 0.5] [-unplaced <unplaced_seq_names.txt>] [-copies]

[-sc 1.0] [-m PATH] [-dir <intermediate_files_dir>]

liftoff.py: error: the following arguments are required: -t, -r

(base) kamisakakazumanoMac-mini:liftoff kazu$ python liftoff.py -h

usage: liftoff.py [-h] -t <target.fasta> -r <reference.fasta>

[-g <ref_annotation.gff>] [-chroms <chroms.txt>] [-p 1]

[-o <output.gff>] [-db DB] [-infer_transcripts]

[-u <unmapped_features.txt>] [-infer_genes] [-a 0.5]

[-s 0.5] [-unplaced <unplaced_seq_names.txt>] [-copies]

[-sc 1.0] [-m PATH] [-dir <intermediate_files_dir>]

Lift features from one genome assembly to another

optional arguments:

-h, --help show this help message and exit

-t <target.fasta> target fasta genome to lift genes to

-r <reference.fasta> reference fasta genome to lift genes from

-g <ref_annotation.gff>

annotation file to lift over in gff or gtf format

-chroms <chroms.txt> comma seperated file with corresponding chromosomes in

the reference,target sequences

-p 1 processes

-o <output.gff> output file

-db DB name of feature database. If none, -g argument must be

provided and a database will be built automatically

-infer_transcripts use if GTF file only includes exon/CDS features and

does not include transcripts/mRNA

-u <unmapped_features.txt>

name of file to write unmapped features to

-infer_genes use if GTF file only includes transcripts, exon/CDS

features

-a 0.5 minimum alignment coverage to consider a feature

mapped [0-1]

-s 0.5 minimum sequence identity in child features (usually

exons/CDS) to consider a feature mapped [0-1]

-unplaced <unplaced_seq_names.txt>

text file with name(s) of unplaced sequences to map

genes from after genes from chromosomes in chroms.txt

are mapped

-copies look for extra gene copies in the target genome

-sc 1.0 with -copies, minimum sequence identity in exons/CDS

for which a gene is considered a copy. Must be greater

than -s

-m PATH Minimap2 path

-dir <intermediate_files_dir>

name of directory to save intermediate fasta and SAM

files

実行方法

ターゲットゲノムのFASTAファイル、リファレンスのFASTAファイルとアノテーションのGFFファイルを指定する。

python liftoff.py -t target.fa -r ref.fa -g ref.gff

引用

Liftoff: an accurate gene annotation mapping tool

Alaina Shumate, Steven Salzberg

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.24.169680

2020 12/16

Liftoff: accurate mapping of gene annotations
Alaina Shumate, Steven L Salzberg
Bioinformatics, Published: 15 December 2020

2020-06-26

バクテリアとアーキアのアミノ酸生合成パスウェイを調べる GapMind

2020 mSystems bacteria web tool Pathway amino acids archaea

2021 1/15 ファイルサイズが大きいと受け付けないエラーが修正されたのを確認

　ゲノム配列は何万もの微生物について利用可能である。これらの微生物のほとんどについては、分離された条件以外にその生理学についてはほとんど知られていない。また、酵母エキスのような複雑な基質を用いて単離された微生物の場合、その栄養所要量については何も知られていない。これらの微生物の生態学的役割や潜在的な利用法を理解するためには、原則としてゲノム配列から予測可能な生育要件を理解することが重要である。具体的には、微生物が20種類の標準アミノ酸を合成できるかどうかに焦点を当てる。

　比較ゲノミクスツールの中には、微生物がどのアミノ酸を合成できるかを予測しようとするものもあるが（ref.1,2）、その予測は全く信頼できるものではない（ref.3）。例えば、最小培地で生育できる細菌でIntegrated Microbial Genomes tool(ref.1)をテストしたところ、これらの細菌はすべてのアミノ酸を合成できるにもかかわらず、平均して6つのアミノ酸に対してauxotrophic （wiki）になると予測されることが分かった(ref.3)。あるいは、ゲノムスケールの代謝モデルを用いてauxotrophiesを同定することもできるが、正確なモデルはほとんどの分類群では利用できない。著者らが知る限り、正確で自動化された補助病の予測が成功しているのは、よく研究されている分類群、例えば腸内細菌やシュードモナスのようなものだけである。定義された培地で生育する40種類の多様な腸内細菌の研究では、自動生成された代謝モデルは、そのうちの30種類の生育を予測することができなかった(ref.5)。

　成長要件を自動的に予測することはいくつかの理由から困難である。第一に、多くの細菌は教科書に記載されている大腸菌や枯草菌の標準的な生合成パスウェイを使用していない。これらのバリアントパスウェイは、自動化ツールが頼りにしているデータベースからはしばしば欠落している(ref.3, 6)。変異したパスウェイや変異した酵素は発見され続けているため、ゲノム配列だけから微生物の増殖能力を正確に予測することはまだできないかもしれない(ref.3)。

　第二に、タンパク質の配列から酵素活性を予測することは、その配列が実験的に研究されたどのタンパク質とも大きく異なる場合には困難である。カバレッジを高めるために、比較ツールは、実験的に研究されていないタンパク質のアノテーションを含む、アノテーションされたタンパク質のデータベースに依存していることが多い。残念ながら、GenBank、KEGG、SEEDなどのデータベースにある酵素アノテーションの多くは正しくない(ref.7, 8)。もう一つの問題は、比較ツールがベストヒット同定だけに依存していることが多く、これは融合タンパク質や分裂タンパク質にはうまく機能しない。例えば、あるタンパク質がXとYの融合であり、そのベストヒットがXである場合、そのタンパク質はXとアノテーションされ、Yは存在しないように見えるかもしれない。

　著者らは、原核生物ゲノムのアミノ酸生合成パスウェイを再構築してアノテーションするツールGapMindを構築した。生合成パスウェイの理解が限られていることと、自動アノテーションの難しさを考えると、GapMindは生合成能力の有無を予測するものではない。その代わりに、現在の知識に基づいて各アミノ酸を作るための最も妥当なパスウェイを特定し、潜在的なギャップを強調する。例えば、あるステップの分岐した候補が特定された場合、そのステップは中程度の信頼度と表示され、そのステップが最も可能性の高いパスウェイの一部である場合はハイライトされる。ユーザーは結果を調べて、そのパスウェイが存在する可能性が高いかどうかを判断することができる。　

　GapMindの結果が類似タンパク質の機能に関する実験データに確実にトレースできるようにするために、GapMindは主に実験的に特徴づけられたタンパク質との類似性に依存している。GapMindはベストヒットを使用せず、融合タンパク質や分割タンパク質を正しく処理している。GapMindはウェブベースのインターフェースを持ち(http://papers.genomics.lbl.gov/gaps)、実行にはゲノムあたり約15秒かかる。

　GapMindのデータベースには、数十種類のバリアント生合成パスウェイや酵素が含まれている。追加のバリアントを特定するために、定義された培地で生育し、大規模な遺伝子データが利用可能な35の細菌を対象にGapMindをテストした。その遺伝子データに基づいて、2つのバリアントパスウェイと数十種類の発散した酵素をGapMindのデータベースに組み込んだ。

　とはいえ、まだまだ多くの変異パスウェイや酵素が発見されていない。このように、GapMindには”known gaps”のデータベースも含まれている。興味のあるゲノムに、類似の生物（最小限の培地で生育可能な生物）の”known gaps”なステップが欠落しているように見える場合、GapMindはそのステップを”known gaps”としてマークする。このようにして、ユーザーはそのギャップがまだ知られていない酵素やパスウェイによるものである可能性があることを知ることができる。

https://twitter.com/search?q=GapMind&src=typed_query

Our colleague Morgan Price just released GapMind for automated annotation of amino acid biosynthesis in #microbes https://t.co/SAJtMjqp8y pic.twitter.com/SfFY7ZN9n1
— @Deutschbauer_Lab (@DeutschbauerLab) 2019年6月5日

GapMindでは、17種類のアミノ酸の生合成と、芳香族アミノ酸の前駆体であるコリスマテートの生合成を記述している。GapMindには、他の3つのアミノ酸（アラニン、アスパラギン酸、またはグルタミン酸）の生合成は含まれていないが、これらはそれぞれ、中央代謝からの中間体（ピルビン酸、オキサロ酢酸、またはα-ケトグルタル酸）のトランスアミノ化によって形成されるためである。アミノ酸トランスアミナーゼはしばしば非特異的であり、それらの正確な基質をアノテーションすることは困難であるので、これらの3つのトランスアミノ化反応を触媒する酵素が存在し、アミノ酸を生成することができると仮定する。GapMindの主な目的は、微生物がどのようにして最小限の栄養素で成長できるかを理解することなので、異常な栄養要求に対応するパスウェイは含めていない（論文参照）。

webサービス

http://papers.genomics.lbl.gov/cgi-bin/gapView.cgiにアクセスする。

f:id:kazumaxneo:20200625012958p:plain

35の既知バクテリアのアミノ酸合成系遺伝子の潜在的ギャップ

Potential Gaps in Amino acid biosynthesis

データベースに登録されているバクテリアやアーキアのアミノ酸合成系パスウエイ（とそのギャップ）を探索したり、ユーザーがプロテオーム配列をアップロードしてアミノ酸合成系遺伝子が見つかるのか調べることができる。

f:id:kazumaxneo:20200626130841p:plain

ユーザーがプロテオーム配列をアップロードした場合も数十秒待てば結果は出力される。

探索結果。赤字はlow confidence、？はknown gaps（写真下の注釈）。

f:id:kazumaxneo:20200626131655p:plain

アミノ酸をクリックすると詳細が表示される。このデータではL-histidine生合成系のhistidinol-phosphate phosphataseという酵素のFidelityが低い。histidinol-phosphate phosphataseのBest candidateの文字をクリックすると、

f:id:kazumaxneo:20200626131915p:plain

amino acid identityとカバレッジが低いことが分かる。

f:id:kazumaxneo:20200626132037p:plain

（高信頼性の候補については、80％のカバレッジを持つ特徴的なタンパク質との40％のアミノ酸同一性、または80％のカバレッジを持つキュレーションされたファミリーとの一致を必要とする）

引用

GapMind: Automated Annotation of Amino Acid Biosynthesis
Morgan N. Price, Adam M. Deutschbauer, Adam P. Arkin

mSystems, 2020

anvi'oのパンゲノム解析でヒートマップを追加する

ANI heatmap pan-genome metagenome 系統解析

Prochlorococcus Metapangenome - Anvi'o Server

　anvi'oは様々な解析方法や表現方法をサポートするマルチオミクス解析パッケージである。その機能の1つに、パンゲノムやメタゲノム（binned.fasta）のgenomic ANIを総当たりで計算し、 anvi'oマップにヒートマップのレイヤーとして表示する機能がある。ここではANI計算を行なってANIヒートマップレイヤー付きanvi'oマップを描く手順をまとめておきます。

マニュアル

インストール

公式dockerイメージを使ってubuntu18.04LTS上でテストした。

本体　Github

依存が多いので、condaだと依存チェックに異常な時間がかかる。dockerを使うと簡単。

#docker (dockerhub) (link)

#latest (v6)
docker pull meren/anvio:latest

インストールチェック

> anvi-self-test --suite pangenomics

help

> anvi-gen-contigs-database -h

usage: anvi-gen-contigs-database [-h] -f FASTA [-n PROJECT_NAME]

[-o DB_FILE_PATH] [--description TEXT_FILE]

[-L INT] [-K INT] [--skip-gene-calling]

[--prodigal-translation-table INT]

[--external-gene-calls GENE-CALLS]

[--ignore-internal-stop-codons]

[--skip-mindful-splitting]

Generate a new anvi'o contigs database.

optional arguments:

-h, --help show this help message and exit

MANDATORY INPUTS:

Things you really need to provide to be in business.

-f FASTA, --contigs-fasta FASTA

The FASTA file that contains reference sequences you

mapped your samples against. This could be a reference

genome, or contigs from your assembler. Contig names

in this file must match to those in other input files.

If there is a problem anvi'o will gracefully complain

about it.

-n PROJECT_NAME, --project-name PROJECT_NAME

Name of the project. Please choose a short but

descriptive name (so anvi'o can use it whenever she

needs to name an output file, or add a new table in a

database, or name her first born).

OPTIONAL INPUTS:

Things you may want to tweak.

-o DB_FILE_PATH, --output-db-path DB_FILE_PATH

Output file path for the new database.

--description TEXT_FILE

A plain text file that contains some description about

the project. You can use Markdwon syntax. The

description text will be rendered and shown in all

relevant interfaces, including the anvi'o interactive

interface, or anvi'o summary outputs.

-L INT, --split-length INT

Anvi'o splits very long contigs into smaller pieces,

without actually splitting them for real. These

'virtual' splits improve the efficacy of the

visualization step, and changing the split size gives

freedom to the user to adjust the resolution of their

display when necessary. The default value is (20000).

If you are planning to use your contigs database for

metagenomic binning, we advise you to not go below

10,000 (since the lower the split size is, the more

items to show in the display, and decreasing the split

size does not really help much to binning). But if you

are thinking about using this parameter for ad hoc

investigations other than binning, you should ignore

our advice, and set the split size as low as you want.

If you do not want your contigs to be split, you can

set the split size to '0' or any other negative

integer (lots of unnecessary freedom here, enjoy!).

-K INT, --kmer-size INT

K-mer size for k-mer frequency calculations. The

default k-mer size for composition-based analyses is

4, historically. Although tetra-nucleotide frequencies

seem to offer the the sweet spot of sensitivity,

information density, and manageable number of

dimensions for clustering approaches, you are welcome

to experiment (but maybe you should leave it as is for

your first set of analyses).

--skip-mindful-splitting

By default, anvi'o attempts to prevent soft-splitting

large contigs by cutting proper gene calls to make

sure a single gene is not broken into multiple splits.

This requires a careful examination of where genes

start and end, and to find best locations to split

contigs with respect to this information. So, when the

user asks for a split size of, say, 1,000, it serves

as a mere suggestion. When this flag is used, anvi'o

does what the user wants and creates splits at desired

lengths (although some functionality may become

unavailable for the projects that rely on a contigs

database that is initiated this way).

GENES IN CONTIGS:

Expert thingies.

--skip-gene-calling By default, generating an anvi'o contigs database

includes the identification of open reading frames in

contigs by running a bacterial gene caller. Declaring

this flag will by-pass that process. If you prefer,

you can later import your own gene calling results

into the database.

--prodigal-translation-table INT

This is a parameter to pass to the Prodigal for a

specific translation table. This parameter corresponds

to the parameter `-g` in Prodigal, the default value

of which is 11 (so if you do not set anything, it will

be set to 11 in Prodigal runtime. Please refer to the

Prodigal documentation to determine what is the right

translation table for you if you think you need it.)

--external-gene-calls GENE-CALLS

A TAB-delimited file to utilize external gene calls.

The file must have these columns: 'gene_callers_id' (a

unique integer number for each gene call, start from

1), 'contig' (the contig name the gene call is found),

'start' (start position, integer), 'stop' (stop

position, integer), 'direction' (the direction of the

gene open reading frame; can be 'f' or 'r'), 'partial'

(whether it is a complete gene call, or a partial one;

must be 1 for partial calls, and 0 for complete

calls), 'source' (the gene caller), and 'version' (the

version of the gene caller, i.e., v2.6.7 or v1.0). An

example file can be found via the URL

https://bit.ly/2qEEHuQ

--ignore-internal-stop-codons

This is only relevant when you have an external gene

calls file. If anvi'o figures out that your custom

gene calls result in amino acid sequences with stop

codons in the middle, it will complain about it. You

can use this flag to tell anvi'o to don't check for

internal stop codons, EVEN THOUGH IT MEANS THERE IS

MOST LIKELY SOMETHING WRONG WITH YOUR EXTERNAL GENE

CALLS FILE. Anvi'o will understand that sometimes we

don't want to care, and will not judge you. Instead,

it will replace every stop codon residue in the amino

acid sequence with an 'X' character. Please let us

know if you used this and things failed, so we can

tell you that you shouldn't have really used it if you

didn't like failures at the first place (smiley).

> anvi-gen-genomes-storage -h

usage: anvi-gen-genomes-storage [-h] [-e FILE_PATH] [-i FILE_PATH]

[--gene-caller GENE-CALLER] -o GENOMES_STORAGE

Create a genome storage from internal or external genomes for a pan genome

analysis.

optional arguments:

-h, --help show this help message and exit

EXTERNAL GENOMES:

External genomes listed as anvi'o contigs databases. As in, you have one

or more genomes say from NCBI you want to work with, and you created an

anvi'o contigs database for each one of them.

-e FILE_PATH, --external-genomes FILE_PATH

A two-column TAB-delimited flat text file that lists

anvi'o contigs databases. The first item in the header

line should read 'name', and the second should read

'contigs_db_path'. Each line in the file should

describe a single entry, where the first column is the

name of the genome (or MAG), and the second column is

the anvi'o contigs database generated for this genome.

INTERNAL GENOMES:

Genome bins stored in an anvi'o profile databases as collections.

-i FILE_PATH, --internal-genomes FILE_PATH

A five-column TAB-delimited flat text file. The header

line must contain these columns: 'name', 'bin_id',

'collection_id', 'profile_db_path', 'contigs_db_path'.

Each line should list a single entry, where 'name' can

be any name to describe the anvi'o bin identified as

'bin_id' that is stored in a collection.

PRO STUFF:

Things you may not have to change. But you never know (unless you read the

help).

--gene-caller GENE-CALLER

The gene caller to utilize. Anvi'o supports multiple

gene callers, and some operations (including this one)

requires an explicit mentioning of which one to use.

The default is 'prodigal', but it will not be enough

if you if you were a rebel and have used `--external-

gene-callers` or something.

OUTPUT:

Give it a nice name. Must end with '-GENOMES.db'. This is primarily due to

the fact that there are other .db files used throughout anvi'o and it

would be better to distinguish this very special file from them.

-o GENOMES_STORAGE, --output-file GENOMES_STORAGE

File path to store results.

> anvi-pan-genome -h

WARNING

===============================================

If you publish results from this workflow, please do not forget to cite DIAMOND

(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL

(http://micans.org/mcl/ and doi:10.1007/978-1-61779-361-5_15)

usage: anvi-pan-genome [-h] -g GENOMES_STORAGE [-G GENOME_NAMES]

[--skip-alignments] [--skip-homogeneity]

[--quick-homogeneity] [--align-with ALIGNER]

[--exclude-partial-gene-calls] [--use-ncbi-blast]

[--minbit MINBIT] [--mcl-inflation INFLATION]

[--min-occurrence NUM_OCCURRENCE]

[--min-percent-identity PERCENT] [--sensitive]

[-n PROJECT_NAME] [--description TEXT_FILE]

[-o PAN_DB_DIR] [-W] [-T NUM_THREADS]

[--skip-hierarchical-clustering]

[--enforce-hierarchical-clustering]

[--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]

A DIAMOND and MCL-based anvi'o workflow for pangenomics. You provide genomes

from anywhere (whether they are external genomes, or anvi'o genome bins in

collections), and it gives you back a pangenome analysis.

optional arguments:

-h, --help show this help message and exit

GENOMES:

The very fancy genomes storage file. This file is generated by the program

`anvi-genomes-storage`. Please see the online tutorial on pangenomic

workflow if you don't know how to generate one.

-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE

Anvi'o genomes storage file

-G GENOME_NAMES, --genome-names GENOME_NAMES

Genome names to 'focus'. You can use this parameter to

limit the genomes included in your analysis. You can

provide these names as a comma-separated list of

names, or you can put them in a file, where you have a

single genome name in each line, and provide the file

path.

PARAMETERS:

Important stuff Tom never pays attention (but you should).

--skip-alignments By default, anvi'o attempts to align amino acid

sequences in each gene cluster using multiple sequnce

alignment via muscle. You can use this flag to skip

that step and be upset later.

--skip-homogeneity By default, anvi'o attempts to calculate homogeneity

values for every gene cluster, given that they are

aligned. You can use this flag to have anvi'o skip

homogeneity calculations. Anvi'o will ignore this flag

if you decide to skip alignments

--quick-homogeneity By default, anvi'o will use a homogeneity algorithm

that checks for horizontal and vertical geometric

homogeneity (along with functional). With this flag,

you can tell anvi'o to skip horizontal geometric

homogeneity calculations. It will be less accurate but

quicker. Anvi'o will ignore this flag if you skip

homogeneity calculations or alignments all together.

--align-with ALIGNER The multiple sequence alignment program to use when

multiple sequence alignment is necessary. To see all

available options, use the flag `--list-aligners`.

--exclude-partial-gene-calls

By default, anvi'o includes all partial gene calls

from the analysis, which, in some cases, may inflate

the number of gene clusters identified and introduce

extra heterogeneity within those gene clusters. Using

this flag, you can request anvi'o to exclude partial

gene calls from the analysis (whether a gene call is

partial or not is an information that comes directly

from the gene caller used to identify genes during the

generation of the contigs database).

--use-ncbi-blast This program uses DIAMOND by default, however, if you

like, you can use good ol' blastp from NCBI instead.

--minbit MINBIT The minimum minbit value. The minbit heuristic

provides a mean to set a to eliminate weak matches

between two amino acid sequences. We learned it from

ITEP (Benedict MN et al, doi:10.1186/1471-2164-15-8),

which is a comprehensive analysis workflow for

pangenomes, and decided to use it in the anvi'o

pangenomic workflow, as well. Briefly, If you have two

amino acid sequences, 'A' and 'B', the minbit is

defined as 'BITSCORE(A, B) / MIN(BITSCORE(A, A),

BITSCORE(B, B))'. So the minbit score between two

sequences goes to 1 if they are very similar over the

entire length of the 'shorter' amino acid sequence,

and goes to 0 if (1) they match over a very short

stretch compared even to the length of the shorter

amino acid sequence or (2) the match betwen sequence

identity is low. The default is 0.5.

--mcl-inflation INFLATION

MCL inflation parameter, that defines the sensitivity

of the algorithm during the identification of the gene

clusters. More information on this parameter and it's

effect on cluster granularity is here:

(http://micans.org/mcl/man/mclfaq.html#faq7.2). The

default is 2.

--min-occurrence NUM_OCCURRENCE

Do you not want singletons?\ You don't? Well, this

parameter will help you get rid of them (along with

doubletons, if you want). Anvi'o will remove gene

clusters that occur less than the number you set using

this parameter from the analysis. The default is 1,

which means everything will be kept. If you want to

remove singletons, set it to 2, if you want to remove

doubletons as well, set it to 3, and so on.

--min-percent-identity PERCENT

Minimum percent identity between the two amino acid

sequences for them to have an edge for MCL analysis.

This value will be used to filter hits from Diamond

search results. Because percent identity is not a

predictor of a good match (since it does not

communicate many other important factors such as the

alignment length between the two sequences and its

proportion to the entire length of those involved), we

suggest you rely on 'minbit' parameter. But you know

what? Maybe you shouldn't listen to anyone, and

experiment on your own! The default is 0 percent.

--sensitive DIAMOND sensitivity. With this flag you can instruct

DIAMOND to be 'sensitive', rather than 'fast' during

the search. It is likely the search will take

remarkably longer. But, hey, if you are doing it for

your final analysis, maybe it should take longer and

be more accurate. This flag is only relevant if you

are running DIAMOND.

OTHERS:

Sweet parameters of convenience.

-n PROJECT_NAME, --project-name PROJECT_NAME

Name of the project. Please choose a short but

descriptive name (so anvi'o can use it whenever she

needs to name an output file, or add a new table in a

database, or name her first born).

--description TEXT_FILE

A plain text file that contains some description about

the project. You can use Markdwon syntax. The

description text will be rendered and shown in all

relevant interfaces, including the anvi'o interactive

interface, or anvi'o summary outputs.

-o PAN_DB_DIR, --output-dir PAN_DB_DIR

Directory path for output files

-W, --overwrite-output-destinations

Overwrite if the output files and/or directories

exist.

-T NUM_THREADS, --num-threads NUM_THREADS

Maximum number of threads to use for multithreading

whenever possible. Very conservatively, the default is

1. It is a good idea to not exceed the number of CPUs

/ cores on your system. Plus, please be careful with

this option if you are running your commands on a SGE

--if you are clusterizing your runs, and asking for

multiple threads to use, you may deplete your

resources very fast.

ORGANIZING GENE CLUSTERs:

These are stuff that will change the clustering dendrogram of your gene

clusters.

--skip-hierarchical-clustering

Anvi'o attempts to generate a hierarchical clustering

of your gene clusters once it identifies them so you

can use `anvi-display-pan` to play with it. But if you

want to skip this step, this is your flag.

--enforce-hierarchical-clustering

If you want anvi'o to try to generate a hierarchical

clustering of your gene clusters even if the number of

gene clusters exceeds its suggested limit for

hierarchical clustering, you can use this flag to

enforce it. Are you are a rebel of some sorts? Or did

computers made you upset? Express your anger towards

machine using this flag.

--distance DISTANCE_METRIC

The distance metric for the clustering of gene

clusters. If you do not use this flag, the default

distance metric will be used for each clustering

configuration which is "euclidean".

--linkage LINKAGE_METHOD

The same story with the `--distance`, except, the

system default for this one is ward.

> anvi-compute-genome-similarity -h

usage: anvi-compute-genome-similarity [-h] [-i FILE_PATH] [-e FILE_PATH]

[-f FASTA_TEXT_FILE] -o DIR_PATH

[-p PAN_DB]

[--program {pyANI,fastANI,sourmash}]

[--fastani-kmer-size FASTANI_KMER_SIZE]

[--fragment-length FRAGMENT_LENGTH]

[--min-num-fragments MIN_NUM_FRAGMENTS]

[--method {ANIm,ANIb,ANIblastall,TETRA}]

[--min-alignment-fraction NUM]

[--significant-alignment-length INT]

[--min-full-percent-identity FULL_PERCENT_IDENTITY]

[--kmer-size INT] [--scale INT]

[--distance DISTANCE_METRIC]

[--linkage LINKAGE_METHOD]

[-T NUM_THREADS] [--just-do-it]

[--log-file FILE_PATH]

Export sequences from sequence sources and compute a similarity metric (e.g.

ANI). If a Pan Database is given anvi'o will write computed output to misc

data tables of Pan Database.

optional arguments:

-h, --help show this help message and exit

INPUT OPTIONS:

Tell anvi'o what you want.

-i FILE_PATH, --internal-genomes FILE_PATH

A five-column TAB-delimited flat text file. The header

line must contain these columns: 'name', 'bin_id',

'collection_id', 'profile_db_path', 'contigs_db_path'.

Each line should list a single entry, where 'name' can

be any name to describe the anvi'o bin identified as

'bin_id' that is stored in a collection.

-e FILE_PATH, --external-genomes FILE_PATH

A two-column TAB-delimited flat text file that lists

anvi'o contigs databases. The first item in the header

line should read 'name', and the second should read

'contigs_db_path'. Each line in the file should

describe a single entry, where the first column is the

name of the genome (or MAG), and the second column is

the anvi'o contigs database generated for this genome.

-f FASTA_TEXT_FILE, --fasta-text-file FASTA_TEXT_FILE

A two-column TAB-delimited file that lists multiple

FASTA files to import for analysis. If using for

`anvi-dereplicate-genomes` or `anvi-compute-distance`,

each FASTA is assumed to be a genome. The first item

in the header line should read 'name', and the second

item should read 'path'. Each line in the field should

describe a single entry, where the first column is the

name of the FASTA file or corresponding sequence, and

the second column is the path to the FASTA file

itself.

OUTPUT OPTIONS:

Tell anvi'o where to store your results.

-o DIR_PATH, --output-dir DIR_PATH

Directory path for output files

-p PAN_DB, --pan-db PAN_DB

This is totally optional, but very useful when

applicable. If you are running this for genomes for

which you already have an anvi'o pangeome, then you

can show where the pan database is and anvi'o would

automatically add the results into the misc data

tables of your pangenome. Those data can then be shown

as heatmaps on the pan interactive interface through

the 'layers' tab.

Program:

Tell anvi'o which similarity program to run.

--program {pyANI,fastANI,sourmash}

Tell anvi'o which program to run to process genome

similarity. For ANI, you should either use pyANI or

fastANI. If accuracy is paramount (for example,

distinguishing things less than 1 percent different),

or for dealing with genomes < 80 percent similar,

pyANI is what we recommend. However, fastANI is much

faster. If you for some reason want to use mash

similarity, you can use sourmash, but its really not

intended for genome comparisons. If you don't choose

anything here, anvi'o will reluctantly set the program

to pyANI, but you really should be the one who is on

top of these things.

fastANI Settings:

Tell anvi'o to tell fastANI what settings to set. Only if `--program` is

set to `pyANI`

--fastani-kmer-size FASTANI_KMER_SIZE

Choose a kmer. The default is 16.

--fragment-length FRAGMENT_LENGTH

Choose a fragment length. The default is 3000.

--min-num-fragments MIN_NUM_FRAGMENTS

Choose the minimum number of fragment lengths to that

can can be trusted. The default is 50.

pyANI Settings:

Tell anvi'o to tell pyANI what method you wish to use and what settings to

set. Only if `--program` is set to `pyANI`

--method {ANIm,ANIb,ANIblastall,TETRA}

Method for pyANI. The default is ANIb. You must have

the necessary binary in path for whichever method you

choose. According to the pyANI help for v0.2.7 at

https://github.com/widdowquinn/pyani, the method

'ANIm' uses MUMmer (NUCmer) to align the input

sequences. 'ANIb' uses BLASTN+ to align 1020nt

fragments of the input sequences. 'ANIblastall': uses

the legacy BLASTN to align 1020nt fragments Finally,

'TETRA': calculates tetranucleotide frequencies of

each input sequence

--min-alignment-fraction NUM

In some cases you may get high raw ANI estimates

(percent identity scores) between two genomes that

have little to do with each other simply because only

a small fraction of their content may be aligned. This

filter will set all ANI scores between two genomes to

0 if the alignment fraction is less than you deem

trustable. When you set a value, anvi'o will go

through the ANI results, and set percent identity

scores between two genomes to 0 if the alignment

fraction *between either of them* is less than the

parameter described here. The default is 0.

--significant-alignment-length INT

So --min-alignment-fraction discards any hit that is

coming from alignments that represent shorter

fractions of genomes, but what if you still don't want

to miss an alignment that is longer than an X number

of nucleotides regardless of what fraction of the

genome it represents? Well, this parameter is to

recover things that may be lost due to --min-

alignment-fraction parameter. Let's say, if you set

--min-alignment-fraction to '0.05', and this parameter

to '5000', anvi'o will keep hits from alignments that

are longer than 5000 nts, EVEN IF THEY REPRESENT less

than 5 percent of a given genome pair. Basically if

--min-alignment-fraction is your shield to protect

yourself from incoming garbage, --significant-

alignment-length is your chopstick to pick out those

that may be interesting, and you are a true warrior

here.

--min-full-percent-identity FULL_PERCENT_IDENTITY

In some cases you may get high raw ANI estimates

(percent identity scores) between two genomes that

have little to do with each other simply because only

a small fraction of their content may be aligned. This

can be partly alleviated by considering the *full*

percent identity, which includes in its calculation

regions that did not align. For example, if the

alignment is a whopping 97 percent identity but only 8

percent of the genome aligned, the *full* percent

identity is 0.970 * 0.080 = 0.078 OR 7.8 percent.

*full* percent identity is always included in the

report, but you can also use it as a filter for other

metrics, such as percent identity. This filter will

set all ANI measures between two genomes to 0 if the

*full* percent identity is less than you deem

trustable. When you set a value, anvi'o will go

through the ANI results, and set all ANI measures

between two genomes to 0 if the *full* percent

identity *between either of them* is less than the

parameter described here. The default is 0.

Sourmash Settings:

Tell anvi'o to tell sourmash what settings to set. Only if `--program` is

set to `sourmash`

--kmer-size INT Set the k-mer size for mash similarity checks. We

found 13 in almost all cases correlates best with

alignment-based ANI.

--scale INT Set the compression ratio for fasta signature file

computations. The default is 1000. Smaller ratios

decrease sensitivity, while larger ratios will lead to

large fasta signatures.

HIERARCHICAL CLUSTERING:

anvi-compute-genome-similarity outputs similarity matrix files, which can

be clustered into nice looking dendrograms to display the relationships

between genomes nicely (in the anvi'o interface and elsewhere). Here you

can set the distance metric and the linkage algorithm for that.

--distance DISTANCE_METRIC

The distance metric for the hierarchical clustering.

The default is "euclidean".

--linkage LINKAGE_METHOD

The linkage method for the hierarchical clustering.

The default is "ward".

OTHER IMPORTANT STUFF:

Yes. You're almost done.

-T NUM_THREADS, --num-threads NUM_THREADS

Maximum number of threads to use for multithreading

whenever possible. Very conservatively, the default is

1. It is a good idea to not exceed the number of CPUs

/ cores on your system. Plus, please be careful with

this option if you are running your commands on a SGE

--if you are clusterizing your runs, and asking for

multiple threads to use, you may deplete your

resources very fast.

--just-do-it Don't bother me with questions or warnings, just do

it.

--log-file FILE_PATH File path to store debug/output messages.

実行方法

ここではパンゲノム解析を想定して進める。

1、FASTAファイルが’置いてある作業ディレクトリにてanvi'oのdockerイメージを立ち上げる。

docker run --rm -it -v `pwd`:`pwd` -w `pwd` -p 8080:8080 meren/anvio:latest

２、microbial genomeのFASTAファイルを対象にデータベースを作成する。この作業はゲノムごとに順番に行う必要があり時間がかかる。計算リソースが潤沢なら、バックグラウンドに回して並行処理することでスピードアップできる。

anvi-gen-contigs-database -f ghenome1.fna -o genome1.db -n 'genome1' &
anvi-gen-contigs-database -f ghenome2.fna -o genome2.db -n 'genome2' &
anvi-gen-contigs-database -f ghenome3.fna -o genome3.db -n 'genome3' &
anvi-gen-contigs-database -f ghenome4.fna -o genome4.db -n 'genome4' &
anvi-gen-contigs-database -f ghenome5.fna -o genome5.db -n 'genome5' &
anvi-gen-contigs-database -f ghenome6.fna -o genome6.db -n 'genome6' &
anvi-gen-contigs-database -f ghenome7.fna -o genome7.db -n 'genome7' &
anvi-gen-contigs-database -f ghenome8.fna -o genome8.db -n 'genome8' &

注意；FASTAファイルのヘッダやファイル名で割と一般的に使われるのが"-", "<space>", "-"などですが、これらはファイル内に存在してもファイル名に存在してもエラーを起こします。必ず置換しておいてください。アンダーバー”_”に置換しておけばエラーは起きません。また、"-n"で指定する名前は視覚化されるときに使われます。禁則文字に注意しつつ適切な名前をつけてください。

３、データベースを統合する。タブ区切りのリストファイルを与える必要がある。-nで指定した名前とdbファイル名が記載されたファイルになる。

list.txt

f:id:kazumaxneo:20200624123643p:plain

リストファイルとデータベース名を指定して実行する。

anvi-gen-genomes-storage -e list.txt -o PROCHLORO-GENOMES.db

統合されたデータベースPROCHLORO-GENOMES.dbが出力される。

４、anvi-pan-genomeプログラムを使ってパンゲノム解析を実行する。3の出力であるPROCHLORO-GENOMES.dbを指定する。

anvi-pan-genome -g PROCHLORO-GENOMES.db -n PROJECT -T 40

ディレクトリ PROJECT/ができ、ディレクトリ内にパンゲノムデータベースPROJECT-PAN.dbと関連ファイルが出力される。以降はPROJECT-PAN.dbを使う。

５、既にデータベースは作成されておりいつでも視覚化できるが、その前にANI計算をして視覚化時にヒートマップレイヤーを選択できるようにする。anvi-compute-genome-similarity コマンドを使う。このコマンドにはPyani（紹介）などの代表的な総当たりANI計算プログラムが組み込まれている。ANIの計算方法はPyaniのGIthub参照。

３で使ったリストファイルとANI計算方法を指定して実行する。

anvi-compute-genome-similarity -p PROCHLORO-GENOMES/PROJECT-PAN.db --program pyANI --method ANIm -T 40 --log-file log -e
list.txt -o ANI

--program {pyANI, fastANI, sourmash} Tell anvi'o which program to run to process genome similarity. For ANI, you should either use pyANI or fastANI. If accuracy is paramount (for example, distinguishing things less than 1 percent different), or for dealing with genomes < 80 percent similar, pyANI is what we recommend. However, fastANI is much faster. If you for some reason want to use mash similarity, you can use sourmash, but its really not intended for genome comparisons. If you don't choose anything here, anvi'o will reluctantly set the program to pyANI, but you really should be the one who is on top of these things.
--method {ANIm, ANIb, ANIblastall, TETRA} Method for pyANI. The default is ANIb. You must have the necessary binary in path for whichever method you choose. According to the pyANI help for v0.2.7 at https://github.com/widdowquinn/pyani, the method 'ANIm' uses MUMmer (NUCmer) to align the input sequences. 'ANIb' uses BLASTN+ to align 1020nt fragments of the input sequences. 'ANIblastall': uses the legacy BLASTN to align 1020nt fragments Finally, 'TETRA': calculates tetranucleotide frequencies of each input sequence
-o Directory path for output files
-p <PAN_DB> This is totally optional, but very useful when applicable. If you are running this for genomes for which you already have an anvi'o pangeome, then you can show where the pan database is and anvi'o would automatically add the results into the misc data tables of your pangenome. Those data can then be shown as heatmaps on the pan interactive interface through the 'layers' tab.
-T Maximum number of threads to use for multithreading whenever possible. Very conservatively, the default is 1. It is a good idea to not exceed the number of CPUs / cores on your system. Plus, please be careful with this option if you are running your commands on a SGE --if you are clusterizing your runs, and asking for multiple threads to use, you may deplete your resources very fast.
--log-file File path to store debug/output messages.

指定したディレクトリに総当たりANI計算結果やnewickファイルが出力される。データベースを指定してランしていれば、ANI計算結果は既にデータベースに組み込まれている。

６、視覚化する。

anvi-display-pan -p PROJECT1/PROJECT-PAN.db -g PROCHLORO-GENOMES.db

http://localhost:8080 にアクセスする。

レイヤータブでANIにチェックをつける。

f:id:kazumaxneo:20200624130903p:plain

完成。右上にANIのヒートマップが追加された。

f:id:kazumaxneo:20200624131128p:plain

ヒートマップはリングの位置関係と揃っているように見えますが、完全には同期していないので注意してください。

マニュアルで完成例を見ることができます。

web server (pangenome)

anvi'o server

引用

Anvi'o: an advanced analysis and visualization platform for 'omics data

Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO

PeerJ. 2015 Oct 8;3:e1319

GO enrichmet解析結果を視覚化する GOplot

CRAN 2015 Bioinformatics 結果の視覚化 (visualization) RNA seq GO enrichment analysis GO term

　オミクスデータの機能分析に利用できる方法が多すぎるにもかかわらず、結果の包括的なまだ詳細な理解を得ることは困難なままになっている。これは主に、この種の情報を視覚化するための公開ツールが不足しているためである。ここでは、グラフィック表示を強化するために、ggplot2に基づいたGOplotと呼ばれるRパッケージを紹介する。このパッケージは、一般的なエンリッチメント分析の出力を取得し、さまざまな詳細レベルでプロットを生成する。このパッケージは、オミクスデータに対するより深い洞察を提供し、研究者がわずか数行のコードで洞察に富んだプロットを生成し、調査結果を簡単に伝達できるようにする。
　RパッケージGOplotは、CRAN-The Comprehensive R Archive Network：http://cran.r-project.org/web/packages/GOplotから入手できる。ベン図のShiny Webアプリケーションは、https://wwalter.shinyapps.io/Venn/に、サンプル図を含むパッケージの詳細なマニュアルは、https://wencke.github.io/にある。

wiki

https://wencke.github.io

manual

https://cran.r-project.org/web/packages/GOplot/GOplot.pdf

ベン図のwebサーバー

https://wwalter.shinyapps.io/Venn/

インストール

本体　Github

install.packages('GOplot')

#latest development version
install_github('wencke/wencke.github.io')

Rのバージョンによりツールが導入できないなら、 Rocker Communityのdockerイメージを使うのが手っ取り早い（紹介）。ローカルのR環境を汚さず、Rstudioをブラウザ上で使用できる。今回はこの方法で使用する。

#tag一覧
#ここでは3.5.1を使う
docker pull r-base:3.5.1

パスワードを指定して立ち上げる。ここではポート番号8787番を指定。

docker run -e PASSWORD=yourpassword --rm -p 8787:8787 rocker/rstudio

localhost:8787にアクセスする。ユーザー名はrstudio、パスワードはここでは"yourpassword"になる。

f:id:kazumaxneo:20200223161439p:plain

テストラン

データセットECが用意されている。

library(GOplot)
head(EC$david)

f:id:kazumaxneo:20200224010848p:plain

head(EC$genelist)

f:id:kazumaxneo:20200623231248p:plain

２、circオブジェクトを作成

circ <- circle_dat(EC$david, EC$genelist)

３、GOBarでバープロット出力

GOBar(subset(circ, category == 'BP'))

f:id:kazumaxneo:20200623230514p:plain

４、GOBubbleでbubble plotを出力

GOBubble(circ, labels = 3)

f:id:kazumaxneo:20200623230653p:plain

またはBP、CC、MFに分けて出力

GOBubble(circ, title = 'Bubble plot', colour = c('orange', 'darkred', 'gold'), display = 'multiple', labels = 3)

f:id:kazumaxneo:20200623230744p:plain

GOBubble(circ, title = 'Bubble plot', colour = c('orange', 'darkred', 'gold'), display = 'multiple', labels = 3)

f:id:kazumaxneo:20191106190105p:plain

詳細はマニュアルを確認して下さい。

引用

GOplot: an R package for visually combining expression data with functional analysis.

Walter W, Sánchez-Cabo F, Ricote M

Bioinformatics. 2015 Sep 1;31(17):2912-4

https://www.rocker-project.org

Github

動物（Metazoa）ミトコンドリアゲノムのアノテーションを行うウェブサーバー MITOS

2013 Molecular Phylogenetics and Evolution web tool annotation mitochondria organelle genome

　信頼性の高い標準化されたゲノムアノテーションは、ゲノム配列データの系統的な比較解析に不可欠な前提条件である。これは、特に系統の再構成、ゲノムリアレンジメントのメカニズムの研究、配列変化の影響の調査に当てはまる。正確で偏りのないアノテーションの必要性は、新しいシーケンス技術によって利用可能になった大量のデータを処理するために自動化されたパイプラインが採用された場合には、さらに緊急性を増している。　

　現在、ミトコンドリアゲノムの完全な配列は、多様な分類群からなる2000種以上のMetazoanで利用可能である。Metazoanのミトコンドリアゲノムは（ごく一部の例外を除いて）平均長さ約16,500ntの環状分子で、極端な長さのものでは11,423nt（Paraspadella gotoi NC_006083）や43,079nt（Trichoplax adhaerens NC_008151）などがある。ミトコンドリアゲノムは、通常、13のタンパク質コード遺伝子、22のtRNA、2つのrRNA、そしてほとんどの調節エレメントを含む1つのノンコーディング領域からなる、よく保存された遺伝子内容を持っている（Wolstenholme, 1992）。この単純な構造は、動物のミトコンドリアゲノムを大規模な比較研究のための魅力的なターゲットにしている。

　ミトコンドリア遺伝子は通常、単一の連続したエクソンで構成されているが、いくつかのクレードではタンパク質をコードする遺伝子やrRNAで例外が報告されており(Beagley et al., 1996, Dellaporta et al., 2006, Wang and Lavrov, 2008)、いくつかのサウロプシドグループでは保存されたフレームシフトが存在する(Mindell et al., 1998)。いくつかのケースでは、いくつかの重複および欠失の証拠もある（例えば、SanMauro et al、2006、Fujita et al、2007）。ミトコンドリアゲノムの特殊性は、逸脱した遺伝子コードの使用、遺伝子の重複、不完全な停止コドンの存在である(Wolstenholme, 1992, Jühling et al., 2012)。これらすべての問題を合わせると、ゲノムアノテーションの作業は複雑になり、大規模な手動による「エキスパートキュレーション」が不可欠になった。このプロセスでは、さまざまなキュレーション担当者によって多くの異なるツールが使用されてきた。Boore (2006)などで議論されているように、これにはいくつかの問題がある。(a) 古いアノテーションで使用されていたツールが時代遅れになっている可能性がある、すなわち、改良された方法がすでに利用可能になっている可能性がある、(b) 相同性アノテーションの基礎として使用されている配列が間違っていたり、不完全であったりする可能性がある、(c) アノテーションのための一般的に認められたガイドラインが存在しない、などである。

　マイトゲノムとそのアノテーションに関する最も包括的で最新のリソースは、NCBI RefSeq (Pruitt et al., 2007)である。RefSeqのキュレーターがデータの質を向上させるために多大な努力をしているにもかかわらず、アノテーションにはいくつかの不整合やエラーが残っており、自動化された解析パイプラインの問題となっている。これには、読み取り方向（ストランド）の情報の欠落や誤り、遺伝子の指定の誤り、遺伝子アノテーションの欠落、trnL1/trnL2とtrnS1/trnS2のtRNAの同一性の間違い、遺伝子名の不一致などが含まれる。

　Boore (2006)は、これらの問題を克服するためのいくつかの可能な解決策を提案している。系統的なエラースクリーニング、遺伝子名の標準化、tRNAのアンチコドンラベリング、遺伝子と遺伝子の境界指定の基準、遺伝子の割り当ての現実を受け入れるための基準である。Bernt et al. (2013a)で詳細にレビューされているいくつかのデータベースは、これらのラインに沿ってRefSeq mitogenomesの改良されたアノテーションを提供することを目的としている。METAMiGA (Feijao et al., 2006)とOGRe (Jameson et al., 2003)は、専門家の知識に基づいたデータの手動による改良を組み込んでいる。tRNAscan-SE (Lowe and Eddy, 1997), ARWEN (Laslett and Canback, 2008), BLAST (Altschul et al., 1990)の検索に基づくルールのリストと専門家の知識を用いた体系的な半自動エラースクリーニングが、最近リリースされた新しいデータベースであるMitoZoa (Lupi et al., 2010)に使用されている。

　遺伝情報は、その情報をもとにして、その情報がどのように変化しているのかを調べたものである。DOGMA (Wyman et al., 2004)は、ミトコンドリアゲノムと葉緑体ゲノムの両方を扱う半自動化された手法のパイプラインである。コーディング遺伝子とノンコーディング遺伝子の同定にBLASTを使用している。COVE (Eddy and Durbin, 1994)は、二次構造に基づいてtRNA候補を同定するためにDOGMAで採用されている。MOSAS (Sheffield et al., 2010)は、配列データの整理とアノテーションに焦点を当てた手法で、昆虫のマイトゲノムを対象としている。tRNAの予測にはARWENとtRNAscan-SEを採用している。BLASTは、MOSASがクエリ配列のローカルデータベース（現在は昆虫のみ）に基づいてオープンリーディングフレームとrRNAを検索するために使用されている。ユーザー定義のカットオフ値が必要であり、予測値を手動で改良する必要があるため、このアプローチは大規模なデータセットに適用することが難しく、予測値の比較可能性が制限されている。

　MITOchondrial genome annotation Server (MITOS)は、Metazoanのミトコンドリアゲノムのde novoアノテーションのための完全自動化されたパイプラインへのアクセスを提供している。また、BLAST検索とアノテーション済みのタンパク質配列とのアグリゲーションに基づく新しい戦略を用いて、タンパク質コード遺伝子を同定する（セクション2.1）。また、tRNAとrRNAは、構造化されたRNAのそれぞれについて、特定の共分散モデルを用いてアノテーションされる（セクション2.2）。本論文では、RefSeq 39に含まれるすべての動物のマイトゲノムのde novoアノテーションにMITOSを適用し、結果の品質を慎重に評価することに焦点を当てた（セクション3）。
　MITOSは、入力としてFASTA形式の配列ファイルと対応する遺伝子コードのみを必要とする。パイプラインは2段階で進行し、最初に各遺伝子の候補配列を特定し、次にこれらを照合して最終的なアノテーションを行う。以下では、MITOSの各コンポーネントの詳細な説明を行う。（以下省略）。

Gitlab

#bioconda
conda create -n mitos -c bioconda -y mitos
conda activate mitos

#pip
pip install mitos

webサービス

MITOS Web Server にアクセスする。

ジョブ名やメールアドレスを記載し、それからミトコンドリアゲノム配列を指定する。

f:id:kazumaxneo:20200622234454p:plain

Advanced

f:id:kazumaxneo:20200622234458p:plain

短い配列でもジョブが終わるまで数時間以上かかる。

出力

アノテーション結果はBEDやGFFなどでダウンロードできる。tRNAの場合、二次構造予測結果へのリンクも付いている。

f:id:kazumaxneo:20200623090939p:plain

引用

MITOS: Improved De Novo Metazoan Mitochondrial Genome Annotation

Matthias Bernt , Alexander Donath, Frank Jühling, Fabian Externbrink, Catherine Florentz, Guido Fritzsch, Joern Pütz, Martin Middendorf, Peter F Stadler

Mol Phylogenet Evol. 2013 Nov;69(2):313-9

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

高速かつメモリ使用量の少ないポリッシングツール POLCA

自動化されたProkaryotesのRNA seq解析パイプライン ProkSeq

リファレンスゲノムのアノテーション情報をターゲットゲノムに移す Liftoff

バクテリアとアーキアのアミノ酸生合成パスウェイを調べる GapMind

anvi'oのパンゲノム解析でヒートマップを追加する

GO enrichmet解析結果を視覚化する GOplot

動物（Metazoa）ミトコンドリアゲノムのアノテーションを行うウェブサーバー MITOS