2020-04-03

anvi'oを使ってメタゲノム解析を行う

2020 4/22 追記

2020 5/20 コード修正

　ハイスループットシーケンシングとオミックス技術の進歩は、自然界に存在する微生物群集の研究に革命をもたらしている。微生物のライフスタイルを包括的に調査するためには、遺伝情報を対話的に整理して可視化し、複雑なデータの分解能を高めるための微妙な違いを取り入れる能力が必要となる。ここでは、複数のソースからのオミックスデータを単一の直感的な表現にリンクすることができるインタラクティブなインターフェースを備えた、メタゲノムアセンブリ内の微生物ゲノムの自動化とヒト主導の特性評価を提供する先進的な解析および可視化プラットフォームであるanvi'oを紹介する。その拡張可能な可視化アプローチは、各コンティグに関する多次元の情報を抽出し、データの探索、操作、報告のためのダイナミックで統一された作業環境を提供する。Anvi'oを使用して、公開されているデータセットを再解析し、１塩基変異のデノボの特徴付けを通じて、微生物集団内のゲノムの時間的変化を探り、培養種やシングルセルゲノムをメタゲノムやメタトランスクリプトームデータとリンクさせた。Anvi'oは、広範なバイオインフォマティクスのスキルを持たない研究者でも、大規模な「オミックスデータセット」の詳細な分析を実行し、伝えることができるようにするオープンソースのプラットフォームである。

http://merenlab.org/software/anvio/

Anvi'oを使用すると、メタゲノムのビニング、一塩基変異の分析、バクテリアパンゲノムの研究、メタゲノムアセンブリ内のバクテリアゲノム数予測、また真核生物アセンブリプロジェクトからの汚染の除去までを行うことができる。

積極的なバージョンアップによって機能も徐々に変わってきています。注意して使って下さい。

インストール

公式dockerイメージを使って複数のubuntu18.04LTSマシンでテストした。

本体　Github

#bioconda (link)注意；依存が多いため時間がかかる
conda create -n anvio -y
conda activate anvio
conda install -c bioconda anvio -y

#homebrew （not tested）
brew tap merenlab/anvio
brew install merenlab/anvio/anvio

テスト

> anvi-self-test --suite mini

#依存が多いので、condaだと依存チェックに異常な時間がかかる。動かすだけならdockerが楽。

#docker (dockerhub) (link)

#latest
docker pull meren/anvio:latest

#lauch
docker run --rm -it -v `pwd`:`pwd` -w `pwd` -p 8080:8080 meren/anvio:latest

> anvi-setup-ncbi-cogs -h

usage: anvi-setup-ncbi-cogs [-h] [--cog-data-dir COG_DATA_DIR] [--reset]

[--just-do-it] [-T NUM_THREADS]

Download and setup NCBI's Clusters of Orthologous Groups database.

optional arguments:

-h, --help show this help message and exit

--cog-data-dir COG_DATA_DIR

The directory for COG data to be stored. If you leave

it as is without specifying anything, the default

destination for the data directory will be used to set

things up. The advantage of it is that everyone will

be using a single data directory, but then you may

need superuser privileges to do it. Using this

parameter you can choose the location of the data

directory somewhere you like. However, when it is time

to run COGs, you will need to remember that path and

provide it to the program.

--reset Remove all the previously stored files and start over.

If something is feels wrong for some reason and if you

believe re-downloading files and setting them up could

address the issue, this is the flag that will tell

anvi'o to act like a real computer scientist

challenged with a computational problem.

--just-do-it Don't bother me with questions or warnings, just do

it.

-T NUM_THREADS, --num-threads NUM_THREADS

Maximum number of threads to use for multithreading

whenever possible. Very conservatively, the default is

1. It is a good idea to not exceed the number of CPUs

/ cores on your system. Plus, please be careful with

this option if you are running your commands on a SGE

--if you are clusterizing your runs, and asking for

multiple threads to use, you may deplete your

resources very fast.

> anvi-gen-contigs-database -h

usage: anvi-gen-contigs-database [-h] -f FASTA [-n PROJECT_NAME]

[-o DB_FILE_PATH] [--description TEXT_FILE]

[-L INT] [-K INT] [--skip-gene-calling]

[--prodigal-translation-table INT]

[--external-gene-calls GENE-CALLS]

[--ignore-internal-stop-codons]

[--skip-mindful-splitting]

Generate a new anvi'o contigs database.

optional arguments:

-h, --help show this help message and exit

MANDATORY INPUTS:

Things you really need to provide to be in business.

-f FASTA, --contigs-fasta FASTA

The FASTA file that contains reference sequences you

mapped your samples against. This could be a reference

genome, or contigs from your assembler. Contig names

in this file must match to those in other input files.

If there is a problem anvi'o will gracefully complain

about it.

-n PROJECT_NAME, --project-name PROJECT_NAME

Name of the project. Please choose a short but

descriptive name (so anvi'o can use it whenever she

needs to name an output file, or add a new table in a

database, or name her first born).

OPTIONAL INPUTS:

Things you may want to tweak.

-o DB_FILE_PATH, --output-db-path DB_FILE_PATH

Output file path for the new database.

--description TEXT_FILE

A plain text file that contains some description about

the project. You can use Markdwon syntax. The

description text will be rendered and shown in all

relevant interfaces, including the anvi'o interactive

interface, or anvi'o summary outputs.

-L INT, --split-length INT

Anvi'o splits very long contigs into smaller pieces,

without actually splitting them for real. These

'virtual' splits improve the efficacy of the

visualization step, and changing the split size gives

freedom to the user to adjust the resolution of their

display when necessary. The default value is (20000).

If you are planning to use your contigs database for

metagenomic binning, we advise you to not go below

10,000 (since the lower the split size is, the more

items to show in the display, and decreasing the split

size does not really help much to binning). But if you

are thinking about using this parameter for ad hoc

investigations other than binning, you should ignore

our advice, and set the split size as low as you want.

If you do not want your contigs to be split, you can

set the split size to '0' or any other negative

integer (lots of unnecessary freedom here, enjoy!).

-K INT, --kmer-size INT

K-mer size for k-mer frequency calculations. The

default k-mer size for composition-based analyses is

4, historically. Although tetra-nucleotide frequencies

seem to offer the the sweet spot of sensitivity,

information density, and manageable number of

dimensions for clustering approaches, you are welcome

to experiment (but maybe you should leave it as is for

your first set of analyses).

--skip-mindful-splitting

By default, anvi'o attempts to prevent soft-splitting

large contigs by cutting proper gene calls to make

sure a single gene is not broken into multiple splits.

This requires a careful examination of where genes

start and end, and to find best locations to split

contigs with respect to this information. So, when the

user asks for a split size of, say, 1,000, it serves

as a mere suggestion. When this flag is used, anvi'o

does what the user wants and creates splits at desired

lengths (although some functionality may become

unavailable for the projects that rely on a contigs

database that is initiated this way).

GENES IN CONTIGS:

Expert thingies.

--skip-gene-calling By default, generating an anvi'o contigs database

includes the identification of open reading frames in

contigs by running a bacterial gene caller. Declaring

this flag will by-pass that process. If you prefer,

you can later import your own gene calling results

into the database.

--prodigal-translation-table INT

This is a parameter to pass to the Prodigal for a

specific translation table. This parameter corresponds

to the parameter `-g` in Prodigal, the default value

of which is 11 (so if you do not set anything, it will

be set to 11 in Prodigal runtime. Please refer to the

Prodigal documentation to determine what is the right

translation table for you if you think you need it.)

--external-gene-calls GENE-CALLS

A TAB-delimited file to utilize external gene calls.

The file must have these columns: 'gene_callers_id' (a

unique integer number for each gene call, start from

1), 'contig' (the contig name the gene call is found),

'start' (start position, integer), 'stop' (stop

position, integer), 'direction' (the direction of the

gene open reading frame; can be 'f' or 'r'), 'partial'

(whether it is a complete gene call, or a partial one;

must be 1 for partial calls, and 0 for complete

calls), 'source' (the gene caller), and 'version' (the

version of the gene caller, i.e., v2.6.7 or v1.0). An

example file can be found via the URL

https://bit.ly/2qEEHuQ

--ignore-internal-stop-codons

This is only relevant when you have an external gene

calls file. If anvi'o figures out that your custom

gene calls result in amino acid sequences with stop

codons in the middle, it will complain about it. You

can use this flag to tell anvi'o to don't check for

internal stop codons, EVEN THOUGH IT MEANS THERE IS

MOST LIKELY SOMETHING WRONG WITH YOUR EXTERNAL GENE

CALLS FILE. Anvi'o will understand that sometimes we

don't want to care, and will not judge you. Instead,

it will replace every stop codon residue in the amino

acid sequence with an 'X' character. Please let us

know if you used this and things failed, so we can

tell you that you shouldn't have really used it if you

didn't like failures at the first place (smiley).

> anvi-run-hmms -h

usage: anvi-run-hmms [-h] -c CONTIGS_DB [-H HMM PROFILE PATH]

[-I HMM PROFILE NAME] [--also-scan-trnas]

[-T NUM_THREADS] [--just-do-it]

This program deals with populating tables that store HMM hits in an anvi'o

contigs database.

optional arguments:

-h, --help show this help message and exit

DB:

An anvi'o contigs adtabase to populate with HMM hits

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

HMM OPTIONS:

If you have your own HMMs, or if you would like to run only a set of

default anvi'o HMM profiles rather than running them all, this is your

stop.

-H HMM PROFILE PATH, --hmm-profile-dir HMM PROFILE PATH

You can use this parameter you can specify a directory

path that contain an HMM profile. This way you can run

HMM profiles that are not included in anvi'o. See the

online to find out about the specifics of this

directory structure .

-I HMM PROFILE NAME, --installed-hmm-profile HMM PROFILE NAME

When you run this program without any parameter, it

runs all 4 HMM profiles installed on your system. If

you want only a specific one to run, you can select it

by using this parameter. These are the currently

available ones: "Protista_83" (type: singlecopy),

"Bacteria_71" (type: singlecopy), "Archaea_76" (type:

singlecopy), "Ribosomal_RNAs" (type: Ribosomal_RNAs).

tRNAs:

Through this program you can also scan Transfer RNA sequences in your

contigs database for free (instead of running `anvi-scan-trnas` later).

--also-scan-trnas Also scan tRNAs while you're at it.

PERFORMANCE:

Stuff everyone forgets to set and then get upset with how slow science

goes.

-T NUM_THREADS, --num-threads NUM_THREADS

Maximum number of threads to use for multithreading

whenever possible. Very conservatively, the default is

1. It is a good idea to not exceed the number of CPUs

/ cores on your system. Plus, please be careful with

this option if you are running your commands on a SGE

--if you are clusterizing your runs, and asking for

multiple threads to use, you may deplete your

resources very fast.

AUTHORITY:

Because you are the boss.

--just-do-it Don't bother me with questions or warnings, just do

it.

> anvi-display-contigs-stats -h

usage: anvi-display-contigs-stats [-h] [--report-as-text] [-o FILE_PATH]

[-I IP_ADDR] [-P INT] [--browser-path PATH]

[--server-only] [--password-protected]

CONTIG DATABASES) [CONTIG DATABASE(S ...]

Start the anvi'o interactive interactive for viewing or comparing contigs

statistics

positional arguments:

CONTIG DATABASE(S) Anvio'o Contig databases to display statistics, you

can give multiple databases by seperating them with

space.

optional arguments:

-h, --help show this help message and exit

REPORT CONFIGURATION:

Specify what kind of output you want.

--report-as-text If you give this flag, Anvi'o will not open new

browser to show Contigs database statistics and write

all stats to TAB separated file and you should also

give --output-file with this flag otherwise Anvi'o

will complain.

-o FILE_PATH, --output-file FILE_PATH

File path to store results.

SERVER CONFIGURATION:

For power users.

-I IP_ADDR, --ip-address IP_ADDR

IP address for the HTTP server. The default ip address

(0.0.0.0) should work just fine for most.

-P INT, --port-number INT

Port number to use for anvi'o services. If nothing is

declared, anvi'o will try to find a suitable port

number, starting from the default port number, 8080.

--browser-path PATH By default, anvi'o will use your default browser to

launch the interactive interface. If you would like to

use something else than your system default, you can

provide a full path for an alternative browser using

this parameter, and hope for the best. For instance we

are using this parameter to call Google's experimental

browser, Canary, which performs better with demanding

visualizations.

--server-only The default behavior is to start the local server, and

fire up a browser that connects to the server. If you

have other plans, and want to start the server without

calling the browser, this is the flag you need.

--password-protected If this flag is set, command line tool will ask you to

enter a password and interactive interface will be

only accessible after entering same password. This

option is recommended for shared machines like

clusters or shared networks where computers are not

isolated.

> anvi-run-ncbi-cogs -h

usage: anvi-run-ncbi-cogs [-h] -c CONTIGS_DB [--cog-data-dir COG_DATA_DIR]

[-T NUM_THREADS] [--sensitive]

[--temporary-dir-path PATH]

[--search-with SEARCH_METHOD]

Run NCBI's COGs to associate genes in an anvi'o contigs database with

functions. COGs database was been designed as an attempt to classify proteins

from completely sequenced genomes on the basis of the orthology concept. It is

no longer actively developed, however, it is still very effective for daily

needs. You may want to consider Pfams or the eggNOG database for more

comprehensive functional insights.

optional arguments:

-h, --help show this help message and exit

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

--cog-data-dir COG_DATA_DIR

The directory path for your COG setup. Anvi'o will try

to use the default path if you do not specify

anything.

-T NUM_THREADS, --num-threads NUM_THREADS

Maximum number of threads to use for multithreading

whenever possible. Very conservatively, the default is

1. It is a good idea to not exceed the number of CPUs

/ cores on your system. Plus, please be careful with

this option if you are running your commands on a SGE

--if you are clusterizing your runs, and asking for

multiple threads to use, you may deplete your

resources very fast.

--sensitive DIAMOND sensitivity. With this flag you can instruct

DIAMOND to be 'sensitive', rather than 'fast' during

the search. It is likely the search will take

remarkably longer. But, hey, if you are doing it for

your final analysis, maybe it should take longer and

be more accurate. This flag is only relevant if you

are running DIAMOND.

--temporary-dir-path PATH

If you don't provide anything here, this program will

come up with a temporary directory path by itself to

store intermediate files, and clean it later. If you

want to have full control over this, you can use this

flag to define one..

--search-with SEARCH_METHOD

What program to use for database searching. The

default search uses diamond. All available options

include: diamond, blastp.

> anvi-get-sequences-for-gene-calls -h

usage: anvi-get-sequences-for-gene-calls [-h] [-c CONTIGS_DB]

[--gene-caller-ids GENE_CALLER_IDS]

[--delimiter CHAR]

[--report-extended-deflines]

[--wrap WRAP] [--export-gff3]

[--get-aa-sequences]

[-g GENOMES_STORAGE]

[-G GENOME_NAMES] -o FILE_PATH

A script to get back sequences for gene calls

optional arguments:

-h, --help show this help message and exit

OPTION #1: EXPORT FROM CONTIGS DB:

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

--gene-caller-ids GENE_CALLER_IDS

Gene caller ids. Multiple of them can be declared

separated by a delimiter (the default is a comma). In

anvi-gen-variability-profile, if you declare nothing

you will get all genes matching your other filtering

criteria. In other programs, you may get everything,

nothing, or an error. It really depends on the

situation. Fortunately, mistakes are cheap, so it's

worth a try.

--delimiter CHAR The delimiter to parse multiple input terms. The

default is ','.

--report-extended-deflines

When declared, the deflines in the resulting FASTA

file will contain more information.

--wrap WRAP When to wrap sequences when storing them in a FASTA

file. The default is '120'. A value of '0' would be

equivalent to 'do not wrap'.

--export-gff3 If this is true, the output file will be in GFF3

format.

--get-aa-sequences Store amino acid sequences instead.

OPTION #2: EXPORT FROM A GENOMES STORAGE:

-g GENOMES_STORAGE, --genomes-storage GENOMES_STORAGE

Anvi'o genomes storage file

-G GENOME_NAMES, --genome-names GENOME_NAMES

Genome names to 'focus'. You can use this parameter to

limit the genomes included in your analysis. You can

provide these names as a comma-separated list of

names, or you can put them in a file, where you have a

single genome name in each line, and provide the file

path.

OPTIONS COMMON TO ALL INPUTS:

-o FILE_PATH, --output-file FILE_PATH

File path to store results.

> anvi-import-taxonomy-for-genes -h

usage: anvi-import-taxonomy-for-genes [-h] -c CONTIGS_DB [-p PARSER] -i FILES)

[FILE(S ...] [--just-do-it]

Import gene-level taxonomy into an anvi'o contigs database.

optional arguments:

-h, --help show this help message and exit

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

-p PARSER, --parser PARSER

Parser to make sense of the input files. There are 3

parsers readily available: ['default_matrix',

'centrifuge', 'kaiju']. It is OK if you do not select

a parser, but in that case there will be no additional

contigs available except the identification of single-

copy genes in your contigs for later use. Using a

parser will not prevent the analysis of single-copy

genes, but make anvio more powerful to help you make

sense of your results. Please see the documentation,

or get in touch with the developers if you have any

questions regarding parsers.

-i FILE(S) [FILE(S) ...], --input-files FILE(S) [FILE(S) ...]

Input file(s) for selected parser. Each parser (except

"blank") requires input files to process that you

generate before running anvio. Please see the

documentation for details.

--just-do-it Don't bother me with questions or warnings, just do

it.

> anvi-merge -h

usage: anvi-merge [-h] -c CONTIGS_DB [-o DIR_PATH] [-S NAME]

[--description TEXT_FILE] [--skip-hierarchical-clustering]

[--enforce-hierarchical-clustering]

[--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD] [-W]

SINGLE_PROFILES) [SINGLE_PROFILE(S ...]

Merge multiple anvio profiles

positional arguments:

SINGLE_PROFILE(S) Anvo'o single profiles to merge

optional arguments:

-h, --help show this help message and exit

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

-o DIR_PATH, --output-dir DIR_PATH

Directory path for output files

-S NAME, --sample-name NAME

It is important to set a sample name (using only ASCII

letters and digits and without spaces) that is unique

(considering all others). If you do not provide one,

anvi'o will try to make up one for you based on other

information, although, you should never let the

software to decide these things).

--description TEXT_FILE

A plain text file that contains some description about

the project. You can use Markdwon syntax. The

description text will be rendered and shown in all

relevant interfaces, including the anvi'o interactive

interface, or anvi'o summary outputs.

--skip-hierarchical-clustering

If you are not planning to use the interactive

interface (or if you have other means to add a tree of

contigs in the database) you may skip the step where

hierarchical clustering of your items are preformed

based on default clustering recipes matching to your

database type.

--enforce-hierarchical-clustering

If you have more than 25,000 splits in your merged

profile, anvi-merge will automatically skip the

hierarchical clustering of splits (by setting --skip-

hierarchical-clustering flag on). This is due to the

fact that computational time required for hierarchical

clustering increases exponentially with the number of

items being clustered. Based on our experience we

decided that 25,000 splits is about the maximum we

should try. However, this is not a theoretical limit,

and you can overwrite this heuristic by using this

flag, which would tell anvi'o to attempt to cluster

splits regardless.

--distance DISTANCE_METRIC

The distance metric for the hierarchical clustering.

If you do not use this flag, the default distance

metric will be used for each clustering configuration

which is "euclidean".

--linkage LINKAGE_METHOD

The same story with the `--distance`, except, the

system default for this one is ward.

-W, --overwrite-output-destinations

Overwrite if the output files and/or directories

exist.

> anvi-profile -h

usage: anvi-profile [-h] [-i INPUT_BAM] [-c CONTIGS_DB] [--blank-profile]

[-o DIR_PATH] [-W] [-S NAME] [--report-variability-full]

[--skip-SNV-profiling] [--profile-SCVs]

[--description TEXT_FILE] [--cluster-contigs]

[--skip-hierarchical-clustering]

[--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]

[-M INT] [--max-contig-length INT] [-X INT] [-V INT]

[--list-contigs] [--contigs-of-interest FILE]

[-T NUM_THREADS] [--queue-size INT]

[--write-buffer-size-per-thread INT] [--force-multi]

Creates a single anvi'o profile database. The default input to this program is

a BAM file. When it is run on a BAM file, depending on the user parameters,

the program quantifies coverage per nucleotide position (and averages them out

per contig), calculates single-nucleotide, single-codon, and single-amino acid

variants, and stores these data into appropriate tables. Anvi'o single

profiles can be merged by the program `anvi-merge`.

optional arguments:

-h, --help show this help message and exit

INPUTS:

There are two possible inputs for anvio profiler. You must to declare

either of these two.

-i INPUT_BAM, --input-file INPUT_BAM

Sorted and indexed BAM file to analyze. Takes a long

time depending on the length of the file and

parameters used for profiling.

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

--blank-profile If you only have contig sequences, but no mapping data

(i.e., you found a genome and would like to take a

look from it), this flag will become very hand. After

creating a contigs database for your contigs, you can

create a blank anvi'o profile database to use anvi'o

interactive interface with that contigs database

without any mapping data.

EXTRAS:

Things that are not mandatory, but can be useful if/when declared.

-o DIR_PATH, --output-dir DIR_PATH

Directory path for output files

-W, --overwrite-output-destinations

Overwrite if the output files and/or directories

exist.

-S NAME, --sample-name NAME

It is important to set a sample name (using only ASCII

letters and digits and without spaces) that is unique

(considering all others). If you do not provide one,

anvi'o will try to make up one for you based on other

information, although, you should never let the

software to decide these things).

--report-variability-full

One of the things anvi-profile does is to store

information about variable nucleotide positions.

Usually it does not report every variable position,

since not every variable position is genuine

variation. Say, if you have 1,000 coverage, and all

nucleotides at that position are Ts and only one of

them is a C, the confidence of that C being a real

variation is quite low. anvi'o has a simple algorithm

in place to reduce the impact of noise. However, using

this flag you can disable it and ask profiler to

report every single variation (which may result in

very large output files and millions of reports, but

you are the boss). Do not forget to take a look at '--

min-coverage-for-variability' parameter

--skip-SNV-profiling By default, anvi'o characterizes single-nucleotide

variation in each sample. The use of this flag will

instruct profiler to skip that step. Please remember

that parameters and flags must be identical between

different profiles using the same contigs database for

them to merge properly.

--profile-SCVs Anvi'o can perform accurate characterization of codon

frequencies in genes during profiling. While having

codon frequencies opens doors to powerful evolutionary

insights in downstream analyses, due to its

computational complexity, this feature comes 'off' by

default. Using this flag you can rise against the

authority, as you always should, and make anvi'o

profile codons.

--description TEXT_FILE

A plain text file that contains some description about

the project. You can use Markdwon syntax. The

description text will be rendered and shown in all

relevant interfaces, including the anvi'o interactive

interface, or anvi'o summary outputs.

HIERARCHICAL CLUSTERING:

Do you want your splits to be clustered? Yes? No? Maybe? Remember: By

default, anvi-profile will not perform hierarchical clustering on your

splits; but if you use `--blank` flag, it will try. You can skip that by

using the `--skip-hierarchical-clustering` flag.

--cluster-contigs Single profiles are rarely used for genome binning or

visualization, and since clustering step increases the

profiling runtime for no good reason, the default

behavior is to not cluster contigs for individual

runs. However, if you are planning to do binning on

one sample, you must use this flag to tell anvi'o to

run cluster configurations for single runs on your

sample.

--skip-hierarchical-clustering

If you are not planning to use the interactive

interface (or if you have other means to add a tree of

contigs in the database) you may skip the step where

hierarchical clustering of your items are preformed

based on default clustering recipes matching to your

database type.

--distance DISTANCE_METRIC

The distance metric for the hierarchical clustering.

Only relevant if you are using `--cluster-contigs`

flag. The default is "euclidean".

--linkage LINKAGE_METHOD

The linkage method for the hierarchical clustering.

Just like the distance metric this is only relevant if

you are using it with `--cluster-contigs` flag. The

default is "ward".

NUMBERS:

Defaults of these parameters will impact your analysis. You can always

come back to them and update your profiles, but it is important to make

sure defaults are reasonable for your sample.

-M INT, --min-contig-length INT

Minimum length of contigs in a BAM file to analyze.

The minimum length should be long enough for tetra-

nucleotide frequency analysis to be meaningful. There

is no way to define a golden number of minimum length

that would be applicable to genomes found in all

environments, but we chose the default to be 1000, and

have been happy with it. You are welcome to

experiment, but we advise to never go below 1,000. You

also should remember that the lower you go, the more

time it will take to analyze all contigs. You can use

--list-contigs parameter to have an idea how many

contigs would be discarded for a given M.

--max-contig-length INT

Just like the minimum contig length parameter, but to

set a maximum. Basically this will remove any contig

longer than a certain value. Why would anyone need

this? Who knows. But if you ever do, it is here.

-X INT, --min-mean-coverage INT

Minimum mean coverage for contigs to be kept in the

analysis. The default value is 0, which is for your

best interest if you are going to profile multiple BAM

files which are then going to be merged for a cross-

sectional or time series analysis. Do not change it if

you are not sure this is what you want to do.

-V INT, --min-coverage-for-variability INT

Minimum coverage of a nucleotide position to be

subjected to SNV profiling. By default, anvi'o will

not attempt to make sense of variation in a given

nucleotide position if it is covered less than 10X.

You can change that minimum using this parameter.

CONTIGS:

Sweet parameters of convenience

--list-contigs When declared, the program will list contigs in the

BAM file and exit gracefully without any further

analysis.

--contigs-of-interest FILE

It is possible to analyze only a group of contigs from

a given BAM file. If you provide a text file, in which

every contig of interest is listed line by line, the

profiler would engine only on those contigs in the BAM

file and ignore the rest. This can be used for

debugging purposes, or to engine on a particular group

of contigs that were identified as relevant during the

interactive analysis.

PERFORMANCE:

Performance settings for profiler

-T NUM_THREADS, --num-threads NUM_THREADS

Maximum number of threads to use for multithreading

whenever possible. Very conservatively, the default is

1. It is a good idea to not exceed the number of CPUs

/ cores on your system. Plus, please be careful with

this option if you are running your commands on a SGE

--if you are clusterizing your runs, and asking for

multiple threads to use, you may deplete your

resources very fast.

--queue-size INT The queue size for worker threads to store data to

communicate to the main thread. The default is set by

the class based on the number of threads. If you have

*any* hesitation about whether you know what you are

doing, you should not change this value.

--write-buffer-size-per-thread INT

How many items should be kept in memory before they

are written do the disk. The default is 500 per

thread. So a single-threaded job would have a write

buffer size of 500, whereas a job with 4 threads would

have a write buffer size of 4*500. The larger the

buffer size, the less frequent the program will access

to the disk, yet the more memory will be consumed

since the processed items will be cleared off the

memory only after they are written to the disk. The

default buffer size will likely work for most cases.

Please keep an eye on the memory usage output to make

sure the memory use never exceeds the size of the

physical memory.

--force-multi This is not useful to non-developers. It forces the

multi-process routine even when 1 thread is chosen.

> anvi-interactive -h

usage: anvi-interactive [-h] [-p PROFILE_DB] [-c CONTIGS_DB]

[-C COLLECTION_NAME] [--manual-mode] [-f FASTA]

[-d VIEW_DATA] [-t NEWICK] [--items-order FLAT_FILE]

[-V ADDITIONAL_VIEW] [-A ADDITIONAL_LAYERS]

[--gene-mode] [--inseq-stats] [-b BIN_NAME]

[--view NAME] [--title NAME]

[--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}]

[--split-hmm-layers] [--hide-outlier-SNVs]

[--state-autoload NAME] [--collection-autoload NAME]

[--export-svg FILE_PATH] [--show-views]

[--skip-check-names] [-o DIR_PATH] [--dry-run]

[--show-states] [--list-collections]

[--skip-init-functions] [--skip-auto-ordering]

[--distance DISTANCE_METRIC]

[--linkage LINKAGE_METHOD] [-I IP_ADDR] [-P INT]

[--browser-path PATH] [--read-only] [--server-only]

[--password-protected] [--user-server-shutdown]

Start an anvi'o server for the interactive interface

optional arguments:

-h, --help show this help message and exit

DEFAULT INPUTS:

The interavtive interface can be started with and without anvi'o

databases. The default use assumes you have your profile and contigs

database, however, it is also possible to start the interface using ad hoc

input files. See 'MANUAL INPUT' section for required parameters.

-p PROFILE_DB, --profile-db PROFILE_DB

Anvi'o profile database

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

-C COLLECTION_NAME, --collection-name COLLECTION_NAME

If you have a collection in your profile database, you

can use this flag to start the interactive interface

with a tree showing your bins in your collection,

instead of each split. This is very useful when you

have imported your external binning results into

anvi'o, and want to see the distribution of your bins

across samples. In these cases anvi'o will cluster

your bins and based on multiple metrics. Because this

particular clustering will be done on the fly within

anvi'o interactive class, you get to define a

disntance metric and a linkage method using --linkage

and --distance parameters if you want!

MANUAL INPUTS:

Mandatory input parameters to start the interactive interface without

anvi'o databases.

--manual-mode Using this flag, you can run the interactive interface

in an ad hoc manner using input files you curated

instead of standard output files generated by an

anvi'o run. In the manual mode you will be asked to

provide a profile database. In this mode a profile

database is only used to store 'state' of the

interactive interface so you can reload your visual

settings when you re-analyze the same files again. If

the profile database you provide does not exist,

anvi'o will create an empty one for you.

-f FASTA, --fasta-file FASTA

A FASTA-formatted input file

-d VIEW_DATA, --view-data VIEW_DATA

A TAB-delimited file for view data

-t NEWICK, --tree NEWICK

NEWICK formatted tree structure

--items-order FLAT_FILE

A flat file that contains the order of items you wish

the display using the interactive interface. You may

want to use this if you have a specific order of items

in your mind, and do not want to display a tree in the

middle (or simply you don't have one). The file format

is simple: each line should have an item name, and

there should be no header.

ADDITIONAL STUFF:

Parameters to provide additional layers, views, or layer data.

-V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW

A TAB-delimited file for an additional view to be used

in the interface. This file should contain all split

names, and values for each of them in all samples.

Each column in this file must correspond to a sample

name. Content of this file will be called 'user_view',

which will be available as a new item in the 'views'

combo box in the interface

-A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS

A TAB-delimited file for additional layers for splits.

The first column of this file must be split names, and

the remaining columns should be unique attributes. The

file does not need to contain all split names, or

values for each split in every column. Anvi'o will try

to deal with missing data nicely. Each column in this

file will be visualized as a new layer in the tree.

GENE MODE:

Gene mode related parameters.

--gene-mode Initiate the interactive interface in 'gene mode'. In

this mode, the items are genes (instead of splits of

contigs). The following views are available: detection

(the detection value of each gene in each sample). The

mean_coverage (the mean coverage of genes). The

non_outlier_mean_coverage (the mean coverage of the

non-outlier nucleotide positions of each gene in each

sample (median absolute deviation is used to remove

outliers per gene per sample)). The

non_outlier_coverage_std view (standard deviation of

the coverage of non-outlier positions of genes in

samples). You can also choose to order items and

layers according to each one of the aforementioned

views. In addition, all layer ordering that are

available in the regular mode (i.e. the full mode

where you have contigs/splits) are also available in

'gene mode', so that, for example, you can choose to

order the layers according to 'detection', and that

would be the order according to the detection values

of splits, whereas if you choose 'genes_detections'

then the order of layers would be according to the

detection values of genes. Inspection and sequence

functionality are available (through the right-click

menu), except now sequences are of the specific gene.

Inspection has now two options available: 'Inspect

Context', which brings you to the inspection page of

the split to which the gene belongs where the

inspected gene will be highlighted in yellow in the

bottom, and 'Inspect Gene', which opens the inspection

page only for the gene and 100 nts around each side of

it (the purpose of this option is to make the

inspection page load faster if you only want to look

at the nucleotide coverage of a specific gene).

NOTICE: You can't store states or collections in 'gene

mode'. However, you still can make fake selections,

and create fake bins for your viewing convenience only

(smiley). Search options are available, and you can

even search for functions if you have them in your

contigs database. ANOTHER NOTICE: loading this mode

might take a while if your bin has many genes, and

your profile database has many samples, this is

because the gene coverages stats are computed in an

ad-hoc manner when you load this mode, we know this is

not ideal and we plan to improve that (along with

other things). If you have suggestions/complaints

regarding this mode please comment on this github

issue: https://goo.gl/yHhRei. Please refer to the

online tutorial for more information.

--inseq-stats Provide if working with INSeq/Tn-Seq genomic data.

With this, all gene level coverage stats will be

calculated using INSeq/Tn-Seq statistical methods.

-b BIN_NAME, --bin-id BIN_NAME

Bin name you are interested in.

VISUALS RELATED:

Parameters that give access to various adjustements regarding the

interface.

--view NAME Start the interface with a pre-selected view. To see a

list of available views, use --show-views flag.

--title NAME Title for the interface. If you are working with a

RUNINFO dict, the title will be determined based on

information stored in that file. Regardless, you can

override that value using this parameter.

--taxonomic-level {t_domain,t_phylum,t_class,t_order,t_family,t_genus,t_species}

The taxonomic level to use whenever relevant and/or

available. The default taxonomic level is t_genus, but

if you choose something specific, anvi'o will focus on

that whenever possible.

--split-hmm-layers When declared, this flag tells the interface to split

every gene found in HMM searches that were performed

against non-singlecopy gene HMM profiles into their

own layer. Please see the documentation for details.

--hide-outlier-SNVs During profiling, anvi'o marks positions of single-

nucleotide variations (SNVs) that originate from

places in contigs where coverage values are a bit

'sketchy'. If you would like to avoid SNVs in those

positions of splits in applicable projects you can use

this flag, and the interface would hide SNVs that are

marked as 'outlier' (although it is clearly the best

to see everything, no one will judge you if you end up

using this flag) (plus, there may or may not be some

historical data on this here:

https://github.com/meren/anvio/issues/309).

--state-autoload NAME

Automatically load previous saved state and draw tree.

To see a list of available states, use --show-states

flag.

--collection-autoload NAME

Automatically load a collection and draw tree. To see

a list of available collections, use --list-

collections flag.

--export-svg FILE_PATH

The SVG output file path.

SWEET PARAMS OF CONVENIENCE:

Parameters and flags that are not quite essential (but nice to have).

--show-views When declared, the program will show a list of

available views, and exit.

--skip-check-names For debugging purposes. You should never really need

it.

-o DIR_PATH, --output-dir DIR_PATH

Directory path for output files

--dry-run Don't do anything real. Test everything, and stop

right before wherever the developer said 'well, this

is enough testing', and decided to print out results.

--show-states When declared the program will print all available

states and exit.

--list-collections Show available collections and exit.

--skip-init-functions

When declared, function calls for genes will not be

initialized (therefore will be missing from all

relevant interfaces or output files). The use of this

flag may reduce the memory fingerprint and processing

time for large datasets.

--skip-auto-ordering When declared, the attempt to include automatically

generated orders of items based on additional data is

skipped. In case those buggers cause issues with your

data, and you still want to see your stuff and deal

with the other issue maybe later.

--distance DISTANCE_METRIC

The distance metric for the hierarchical clustering.

Only relevant if you are running the interactive

interface in "collection" mode. The default is

"euclidean".

--linkage LINKAGE_METHOD

The linkage method for the hierarchical clustering.

Only relevant if you are running the interactive

interface in "collection" mode. The default is "ward".

SERVER CONFIGURATION:

For power users.

-I IP_ADDR, --ip-address IP_ADDR

IP address for the HTTP server. The default ip address

(0.0.0.0) should work just fine for most.

-P INT, --port-number INT

Port number to use for anvi'o services. If nothing is

declared, anvi'o will try to find a suitable port

number, starting from the default port number, 8080.

--browser-path PATH By default, anvi'o will use your default browser to

launch the interactive interface. If you would like to

use something else than your system default, you can

provide a full path for an alternative browser using

this parameter, and hope for the best. For instance we

are using this parameter to call Google's experimental

browser, Canary, which performs better with demanding

visualizations.

--read-only When the interactive interface is started with this

flag, all 'database write' operations will be

disabled.

--server-only The default behavior is to start the local server, and

fire up a browser that connects to the server. If you

have other plans, and want to start the server without

calling the browser, this is the flag you need.

--password-protected If this flag is set, command line tool will ask you to

enter a password and interactive interface will be

only accessible after entering same password. This

option is recommended for shared machines like

clusters or shared networks where computers are not

isolated.

--user-server-shutdown

Allow users to shutdown an anvi'server via web

interface.

> anvi-script-reformat-fasta -h

usage: anvi-script-reformat-fasta [-h] [-l MIN_LENGTH]

[--max-percentage-gaps PERCENTAGE]

[-i TXT FILE] [-I TXT FILE] -o FASTA FILE

[--simplify-names] [--prefix PREFIX]

[-r REPORT FILE]

FASTA FILE

Reformat FASTA file (remove contigs based on length, or based on a given list

of deflines, and/or generate an output with simpler names)

positional arguments:

FASTA FILE

optional arguments:

-h, --help show this help message and exit

-l MIN_LENGTH, --min-len MIN_LENGTH

Minimum length of contigs to keep (contigs shorter

than this value will not be included in the output

file). The default is 0, so nothing will be removed if

you do not declare a minimum size.

--max-percentage-gaps PERCENTAGE

Maximum fraction of gaps in a sequence (any sequence

with more gaps will be removed from the output FASTA

file). The default is 100.000000.

-i TXT FILE, --exclude-ids TXT FILE

IDs to remove from the FASTA file. You cannot provide

both --keep-ids and --exclude-ids.

-I TXT FILE, --keep-ids TXT FILE

If provided, all IDs not in this file will be excluded

from the reformatted FASTA file. Any additional

filters (such as --min-len) will still be applied to

the IDs in this file. You cannot provide both

--exclude-ids and --keep-ids.

-o FASTA FILE, --output-file FASTA FILE

Output file path.

--simplify-names Edit deflines to make sure they contigs have simple

names.

--prefix PREFIX Use this parameter if you would like to add a prefix

to your contig names while simplifying them. The

prefix must be a single word (you can use underscor

character, but nothing more!).

-r REPORT FILE, --report-file REPORT FILE

Report file path. When you run this program with

`--simplify-names` flag, all changes to deflines will

be reported in this file in case you need to go back

to this information later. It is not mandatory to

declare one, but it is a very good idea to have it.

> anvi-export-splits-and-coverages -h

usage: anvi-export-splits-and-coverages [-h] -p PROFILE_DB -c CONTIGS_DB

[-o DIR_PATH] [-O FILENAME_PREFIX]

[--splits-mode] [--report-contigs]

[--use-Q2Q3-coverages]

Export split or contig sequences and coverages across samples stored in an

anvi'o profile database. This program is especially useful if you would like

to 'bin' your splits or contigs outside of anvi'o and import the binning

results into anvi'o using `anvi-import-collection` program.

optional arguments:

-h, --help show this help message and exit

-p PROFILE_DB, --profile-db PROFILE_DB

Anvi'o profile database

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

-o DIR_PATH, --output-dir DIR_PATH

Directory path for output files

-O FILENAME_PREFIX, --output-file-prefix FILENAME_PREFIX

A prefix to be used while naming the output files (no

file type extensions please; just a prefix).

--splits-mode Specify this flag if you would like to output

coverages of individual 'splits', rather than their

'parent' contig coverages.

--report-contigs By default this program reports sequences and their

coverages for 'splits'. By using this flag, you can

report contig sequences and coverages instead. For

obvious reasons, you can't use this flag with

`--splits-mode` flag.

--use-Q2Q3-coverages By default this program reports the mean coverage of a

split (or contig, see --report-contigs) for each

sample. By using this flag, you can report the mean

Q2Q3 coverage by excluding 25 percent of the

nucleotide positions with the smallest coverage

values, and 25 percent of the nucleotide positions

with the largest coverage values. The hope is that

this removes 'outlier' positions resulting from non-

specific mapping, etc. that skew the mean coverage

estimate.

> anvi-import-collection -h

usage: anvi-import-collection [-h] [-c CONTIGS_DB] [-p PAN_OR_PROFILE_DB] -C

COLLECTION_NAME [--bins-info BINS_INFO]

[--contigs-mode]

TAB DELIMITED FILE

Import an external binning result into anvi'o

positional arguments:

TAB DELIMITED FILE The input file that describes bin IDs for each split

or contig.

optional arguments:

-h, --help show this help message and exit

-c CONTIGS_DB, --contigs-db CONTIGS_DB

Anvi'o contigs database generated by 'anvi-gen-

contigs'

-p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB

Anvi'o pan or profile database (and even genes

database in appropriate contexts).

-C COLLECTION_NAME, --collection-name COLLECTION_NAME

Collection name.

--bins-info BINS_INFO

Additional information for bins. The file must contain

three TAB-delimited columns, where the first one must

be a unique bin name, the second should be a 'source',

and the last one should be a 7 character HTML color

code (i.e., '#424242'). Source column must contain

information about the origin of the bin. If these bins

are automatically identified by a program like

CONCOCT, this column could contain the program name

and version. The source information will be associated

with the bin in various interfaces so in a sense it is

not *that* critical what it says there, but on the

other hand it is, becuse we should also think about

people who may end up having to work with what we put

together later.

--contigs-mode Use this flag if your binning was done on contigs

instead of splits. Please refer to the documentation

for help.

実行方法

の手順に則り進める。

1、NCBI COG（Clusters of Orthologus Groups）データベースの準備。こちらは初回のみ実行する。

anvi-setup-ncbi-cogs -T 40

dockerイメージを使っている場合、一度実行してcommitする。次回以降はそれを使えば手間が減る。

２、contig.fastaの準備とデータベース作成

２、　分析対象のメタゲノムのcontigg.fastaやbinned.fastaを準備する。配列が多すぎると階層的クラスタリングの時にエラーになるので注意する。回避するには短い配列を捨てて配列数を減らす。

補足ステップ。======================================

FASTAファイルのヘッダーのdeflinesを修正（option）する。また、サイズ選択も可能。。スペースなどあるとステップ２でエラーを起こす。

anvi-script-reformat-fasta -l 5000 -o contigs.fa input_contigs.fa

contigs.faが出力される。修正されなかった場合、ヘッダーをシンプルな名前に置換する。あとで使うアラインメントのbamファイルは、修正後のfastaを使って作っていないとエラーになる。

ヘッダやファイル名で割と一般的に使われるのが"-", "<space>", "-"などだが、これらはファイル内にあってもファイル名にあってもエラーを起こす。必ず置換しておく。アンダーバー”_”にしておけばエラーは起きない。

================================================

binned.fastaからデータベースを作成する。コンティグに関連する情報（ORFの位置、各コンティグのk-mer頻度、スプリットの開始位置と終了位置、Prodigalを使った遺伝子の機能的および分類学的アノテーションなど）のデータベースとなる。

anvi-gen-contigs-database -f binned.fasta -o contigs.db -n 'An example contigs1 datbase'

複数あるなら順番に作成
anvi-gen-contigs-database -f binned2.fasta -o contigs2.db -n 'An example contigs2 datbase'
anvi-gen-contigs-database -f binned3.fasta -o contigs3.db -n 'An example contigs3 datbase'

3、コンティグデータベースを、プラットフォームに同梱されている HMM モデル（現時点では、複数のバクテリアのシングルコピー遺伝子コレクションが公開されている）からのヒットでデコレートする。できるだけ多くスレッドを当てる。dbが複数あるなら全て行う。dbを統合するまで以後も同じ。

anvi-run-hmms -c contigs.db -T 40

-T Maximum number of threads to use for multithreading whenever possible. Very conservatively, the default is 1. It is a good idea to not exceed the number of CPUs/ cores on your system. Plus, please be careful with this option if you are running your commands on a SGE --if you are clusterizing your runs, and asking for multiple threads to use, you may deplete your resources very fast.

4、コンティグの統計を表示（コンティグスデータベースとHMMモデルが既に作成されていること）

anvi-display-contigs-stats contigs.db

http://127.0.0.1:8080にアクセスしてstatisticsを確認する。

f:id:kazumaxneo:20200402094703p:plain

f:id:kazumaxneo:20200402094706p:plain

確認し終わったら"Ctrl + C"で停止。

5、NCBI COGを使ってコンティグスデータベースの遺伝子をアノテーションするためのプログラムanvi-run-ncbi-cogsを実行する。DIAMONDが動くので、できるだけ多くスレッドを当てる。

anvi-run-ncbi-cogs -c contigs.db --num-threads 40

6、NCBI COGを使ってコンティグスデータベースの遺伝子をアノテーションするためのプログラムanvi-run-ncbi-cogsを実行する。

anvi-get-sequences-for-gene-calls -c contigs.db -o gene-calls.fa

================================================

追加ステップ - centrifugiのランとtaxonomyのインポート（参考）

各遺伝子のtaxonomyアノテーションを持っていて、それをデータベースに入れてキュレーションしたい時に実行する。kaijuやcentrifugeが使えるが、centrifugeだと以下のようにする。ステップ6のgene-calls.faを使い、各遺伝子へのtaxonomyアノテーションを行う。centrifugeはdockerイメージにも最初からインストールされている。データベースだけ用意すればよい。

# centrifugeプリビルドデータベース（初回のみ）
wget ftp://ftp.ccb.jhu.edu/pub/infphilo/centrifuge/data/p_compressed+h+v.tar.gz
tar -zxvf p_compressed+h+v.tar.gz
#p_compressed+h+v1.cf、p_compressed+h+v2.cf、p_compressed+h+v3.cfができる。ラン時は"-x p_compressed+h+v"と指定する。

#centrifugeのrun
centrifuge -f -x p_compressed+h+v gene-calls.fa -S centrifuge_hits.tsv -p 40
#=> centrifuge_report.tsvとcentrifuge_hits.tsvができる。

#anvi'oにcentrifugeの結果を取り込む。
anvi-import-taxonomy-for-genes -c contigs.db -i centrifuge_report.tsv centrifuge_hits.tsv -p centrifuge

エラーなくランできていれば、視覚化の際にtaxonomyのオプションが利用できるようになる。

================================================

7、sortしたbamファイルとbam.baiの準備。minimap2を使うなら、

#sample1
minimap2 -R "@RG\tID:X\tLB:Y\tSM:sample1\tPL:ILLUMINA" -t 40 -ax sr \
contigs.fa sample1_R1.fq.gz sample1_R2.fq.gz \
|samtools sort -@ 40 -m 2G -O BAM - > sample1.bam \
&& samtools index -@ 8 sample1.bam

#sample2
minimap2 -R "@RG\tID:X\tLB:Y\tSM:sample2\tPL:ILLUMINA" -t 40 -ax sr \
contigs.fa sample2_R1.fq.gz sample2_R2.fq.gz \
|samtools sort -@ 40 -m 2G -O BAM - > sample2.bam \
&& samtools index -@ 8 sample2.bam

#sample3
minimap2 -R "@RG\tID:X\tLB:Y\tSM:sample3\tPL:ILLUMINA" -t 40 -ax sr \
contigs.fa sample3_R1.fq.gz sample3_R2.fq.gz \
|samtools sort -@ 40 -m 2G -O BAM - > sample3.bam \
&& samtools index -@ 8 sample3.bam

8、サンプルに関する固有の情報のデータベースの作成。サンプルのbamごとに実施する。デフォルトでは 2,500-bp より長いコンティグを処理する( --min-contig-length)。

#sample1
anvi-profile -i sample1.bam -c contigs.db --output-dir PROFILES/SAMPLE1_Profile --sample-name sample1 -T 50

#sample2
anvi-profile -i sample2.bam -c contigs.db --output-dir PROFILES/SAMPLE2_Profile --sample-name sample2 -T 50

#sample3
anvi-profile -i sample3.bam -c contigs.db --output-dir PROFILES/SAMPLE3_Profile --sample-name sample3 -T 50

もし1サンプル（one bam file）しかないなら"--cluster-contigs"をつけてランする。

ステップ2ではコンティグのデータベースcontig.dbができたが、このステップではサンプルごとのデータベースであるPROFILE.dbができる。

f:id:kazumaxneo:20200403152637p:plain

うまくいけばHappyが出ます。

9、8の結果をマージし、クラスタリングを実行する。１サンプルのみの場合は不要。

anvi-merge \
 PROFILES/SAMPLE1_Profile/PROFILE.db \
 PROFILES/SAMPLE2_Profile/PROFILE.db \
 PROFILES/SAMPLE3_Profile/PROFILE.db \
 -o SAMPLES-MERGED -c contigs.db  --enforce-hierarchical-clustering

--skip-hierarchical-clustering If you are not planning to use the interactive interface (or if you have other means to add a tree of contigs in the database) you may skip the step where hierarchical clustering of your items are preformed based on default clustering recipes matching to your database type.
--enforce-hierarchical-clustering If you have more than 25,000 splits in your merged profile, anvi-merge will automatically skip the hierarchical clustering of splits (by setting --skip-hierarchical-clustering flag on). This is due to the fact that computational time required for hierarchical clustering increases exponentially with the number of items being clustered. Based on our experience we decided that 25,000 splits is about the maximum we should try. However, this is not a theoretical limit, and you can overwrite this heuristic by using this flag, which would tell anvi'o to attempt to cluster splits regardless.

versioon6+以降、ビニングは別コマンド anvi-cluster-contigsになり、このマージでは実行されなくなっている。この処理で階層的クラスタリングが実行されるが、--enforce-hierarchical-clustering をつけていてもクラスタリングに失敗することがある。失敗すると視覚化できないので、短い配列を減らすなどしてやり直す。

========================================================

追加ステップ１；コンティグを独自にビニングしている場合は、その結果をコレクションとしてマージされたプロファイルデータベースにインポートできる。

anvi-import-collection binning_results.txt -p SAMPLES-MERGED/PROFILE.db -c contigs.db --source "SOURCE_NAME"

binning_results.txtはTAB区切りのテキストファイルで、どのコンティグがどのビンに属しているかという情報を含んでいる必要がある。具体的には、コンティグ名<TAB>それが属するビン名、というタブ区切りファイルを用意する（例）。

追加ステップ2；カバレッジ情報と配列構成情報のエクスポート

anvi-export-splits-and-coverages -p SAMPLES-MERGED/PROFILE.db -c contigs.db -o output

========================================================

10、視覚化と分析。

anvi-interactive -p SAMPLES-MERGED/PROFILE.db -c contigs.db --server-only -P 8080

anvi-interactive を実行すると、各ビンの様々な特性、すなわち平均カバレッジ、分散、補完性、冗長性の推定値をその場で計算し、インタラクティブなインターフェイスでサンプル全体の分布を表示する。

http://localhost:8080にアクセスして結果を確認する。

左下のDrawボタンを押すと描画される。

f:id:kazumaxneo:20200403083928p:plain

単離サンプルのゲノム１つだけを分析した。複数のコンティグをクラスタリングし、環状のレイヤーにカバレッジ、GCなどを表現している。中央のデンドログラムがコンティグのクラスタリング結果を表現している。1つの環状ゲノムを表現している訳ではなく、全コンティグをカバレッジでクラスタリングして環状に並べているということ (右上1/4が注釈になっていても何も問題はない)。

環状表現からlinear phylogramに変更した。

f:id:kazumaxneo:20200403183014p:plain