HTS (NGS) 関連のインフォマティクス情報についてまとめています。


2021 2/15 追記


anvi'o パンゲノミックワークフローは、3 つの主要なステップで構成されている。

1、anvi-gen-genomes-storage による anvi'o ゲノムストレージ生成


3、anvi-display-pan(ゲノムストレージと パンデータベースが必要)を使用して結果を表示





本体 Github


#docker (dockerhub) (link)

#latest (v6)
docker pull meren/anvio:latest


> anvi-self-test --suite pangenomics


docker run --rm -it -v `pwd`:`pwd` -w `pwd` -p 8080:8080 meren/anvio:latest

anvi-migrate -h

> anvi-migrate -h

usage: anvi-migrate [-h] [--just-do-it] [-t VERSION] DATABASE [DATABASE ...]


Migrate an anvi'o database or config file to a newer version.


positional arguments:

  DATABASE              Anvi'o database or config file for migration


optional arguments:

  -h, --help            show this help message and exit

  --just-do-it          Do not bother me with warnings

  -t VERSION, --target-version VERSION

                        Anvi'o will stop upgrading your database when it

                        reaches to this version.

 :: anvi'o v6.2 :: 

anvi-gen-genomes-storage -h

> anvi-gen-genomes-storage -h

usage: anvi-gen-genomes-storage [-h] [-e FILE_PATH] [-i FILE_PATH]

                                [--gene-caller GENE-CALLER] -o GENOMES_STORAGE


Create a genome storage from internal or external genomes for a pan genome



optional arguments:

  -h, --help            show this help message and exit



  External genomes listed as anvi'o contigs databases. As in, you have one

  or more genomes say from NCBI you want to work with, and you created an

  anvi'o contigs database for each one of them.


  -e FILE_PATH, --external-genomes FILE_PATH

                        A two-column TAB-delimited flat text file that lists

                        anvi'o contigs databases. The first item in the header

                        line should read 'name', and the second should read

                        'contigs_db_path'. Each line in the file should

                        describe a single entry, where the first column is the

                        name of the genome (or MAG), and the second column is

                        the anvi'o contigs database generated for this genome.



  Genome bins stored in an anvi'o profile databases as collections.


  -i FILE_PATH, --internal-genomes FILE_PATH

                        A five-column TAB-delimited flat text file. The header

                        line must contain these columns: 'name', 'bin_id',

                        'collection_id', 'profile_db_path', 'contigs_db_path'.

                        Each line should list a single entry, where 'name' can

                        be any name to describe the anvi'o bin identified as

                        'bin_id' that is stored in a collection.



  Things you may not have to change. But you never know (unless you read the



  --gene-caller GENE-CALLER

                        The gene caller to utilize. Anvi'o supports multiple

                        gene callers, and some operations (including this one)

                        requires an explicit mentioning of which one to use.

                        The default is 'prodigal', but it will not be enough

                        if you if you were a rebel and have used `--external-

                        gene-callers` or something.



  Give it a nice name. Must end with '-GENOMES.db'. This is primarily due to

  the fact that there are other .db files used throughout anvi'o and it

  would be better to distinguish this very special file from them.



                        File path to store results.

 :: anvi'o v6.2 :: 

anvi-pan-genome -h

> anvi-pan-genome -h




If you publish results from this workflow, please do not forget to cite DIAMOND

(doi:10.1038/nmeth.3176), unless you use it with --use-ncbi-blast flag, and MCL

( and doi:10.1007/978-1-61779-361-5_15)


usage: anvi-pan-genome [-h] -g GENOMES_STORAGE [-G GENOME_NAMES]

                       [--skip-alignments] [--skip-homogeneity]

                       [--quick-homogeneity] [--align-with ALIGNER]

                       [--exclude-partial-gene-calls] [--use-ncbi-blast]

                       [--minbit MINBIT] [--mcl-inflation INFLATION]

                       [--min-occurrence NUM_OCCURRENCE]

                       [--min-percent-identity PERCENT] [--sensitive]

                       [-n PROJECT_NAME] [--description TEXT_FILE]

                       [-o PAN_DB_DIR] [-W] [-T NUM_THREADS]



                       [--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]


A DIAMOND and MCL-based anvi'o workflow for pangenomics. You provide genomes

from anywhere (whether they are external genomes, or anvi'o genome bins in

collections), and it gives you back a pangenome analysis.


optional arguments:

  -h, --help            show this help message and exit



  The very fancy genomes storage file. This file is generated by the program

  `anvi-genomes-storage`. Please see the online tutorial on pangenomic

  workflow if you don't know how to generate one.



                        Anvi'o genomes storage file

  -G GENOME_NAMES, --genome-names GENOME_NAMES

                        Genome names to 'focus'. You can use this parameter to

                        limit the genomes included in your analysis. You can

                        provide these names as a comma-separated list of

                        names, or you can put them in a file, where you have a

                        single genome name in each line, and provide the file




  Important stuff Tom never pays attention (but you should).


  --skip-alignments     By default, anvi'o attempts to align amino acid

                        sequences in each gene cluster using multiple sequnce

                        alignment via muscle. You can use this flag to skip

                        that step and be upset later.

  --skip-homogeneity    By default, anvi'o attempts to calculate homogeneity

                        values for every gene cluster, given that they are

                        aligned. You can use this flag to have anvi'o skip

                        homogeneity calculations. Anvi'o will ignore this flag

                        if you decide to skip alignments

  --quick-homogeneity   By default, anvi'o will use a homogeneity algorithm

                        that checks for horizontal and vertical geometric

                        homogeneity (along with functional). With this flag,

                        you can tell anvi'o to skip horizontal geometric

                        homogeneity calculations. It will be less accurate but

                        quicker. Anvi'o will ignore this flag if you skip

                        homogeneity calculations or alignments all together.

  --align-with ALIGNER  The multiple sequence alignment program to use when

                        multiple sequence alignment is necessary. To see all

                        available options, use the flag `--list-aligners`.


                        By default, anvi'o includes all partial gene calls

                        from the analysis, which, in some cases, may inflate

                        the number of gene clusters identified and introduce

                        extra heterogeneity within those gene clusters. Using

                        this flag, you can request anvi'o to exclude partial

                        gene calls from the analysis (whether a gene call is

                        partial or not is an information that comes directly

                        from the gene caller used to identify genes during the

                        generation of the contigs database).

  --use-ncbi-blast      This program uses DIAMOND by default, however, if you

                        like, you can use good ol' blastp from NCBI instead.

  --minbit MINBIT       The minimum minbit value. The minbit heuristic

                        provides a mean to set a to eliminate weak matches

                        between two amino acid sequences. We learned it from

                        ITEP (Benedict MN et al, doi:10.1186/1471-2164-15-8),

                        which is a comprehensive analysis workflow for

                        pangenomes, and decided to use it in the anvi'o

                        pangenomic workflow, as well. Briefly, If you have two

                        amino acid sequences, 'A' and 'B', the minbit is

                        defined as 'BITSCORE(A, B) / MIN(BITSCORE(A, A),

                        BITSCORE(B, B))'. So the minbit score between two

                        sequences goes to 1 if they are very similar over the

                        entire length of the 'shorter' amino acid sequence,

                        and goes to 0 if (1) they match over a very short

                        stretch compared even to the length of the shorter

                        amino acid sequence or (2) the match betwen sequence

                        identity is low. The default is 0.5.

  --mcl-inflation INFLATION

                        MCL inflation parameter, that defines the sensitivity

                        of the algorithm during the identification of the gene

                        clusters. More information on this parameter and it's

                        effect on cluster granularity is here:

                        ( The

                        default is 2.

  --min-occurrence NUM_OCCURRENCE

                        Do you not want singletons?\ You don't? Well, this

                        parameter will help you get rid of them (along with

                        doubletons, if you want). Anvi'o will remove gene

                        clusters that occur less than the number you set using

                        this parameter from the analysis. The default is 1,

                        which means everything will be kept. If you want to

                        remove singletons, set it to 2, if you want to remove

                        doubletons as well, set it to 3, and so on.

  --min-percent-identity PERCENT

                        Minimum percent identity between the two amino acid

                        sequences for them to have an edge for MCL analysis.

                        This value will be used to filter hits from Diamond

                        search results. Because percent identity is not a

                        predictor of a good match (since it does not

                        communicate many other important factors such as the

                        alignment length between the two sequences and its

                        proportion to the entire length of those involved), we

                        suggest you rely on 'minbit' parameter. But you know

                        what? Maybe you shouldn't listen to anyone, and

                        experiment on your own! The default is 0 percent.

  --sensitive           DIAMOND sensitivity. With this flag you can instruct

                        DIAMOND to be 'sensitive', rather than 'fast' during

                        the search. It is likely the search will take

                        remarkably longer. But, hey, if you are doing it for

                        your final analysis, maybe it should take longer and

                        be more accurate. This flag is only relevant if you

                        are running DIAMOND.



  Sweet parameters of convenience.


  -n PROJECT_NAME, --project-name PROJECT_NAME

                        Name of the project. Please choose a short but

                        descriptive name (so anvi'o can use it whenever she

                        needs to name an output file, or add a new table in a

                        database, or name her first born).

  --description TEXT_FILE

                        A plain text file that contains some description about

                        the project. You can use Markdwon syntax. The

                        description text will be rendered and shown in all

                        relevant interfaces, including the anvi'o interactive

                        interface, or anvi'o summary outputs.

  -o PAN_DB_DIR, --output-dir PAN_DB_DIR

                        Directory path for output files

  -W, --overwrite-output-destinations

                        Overwrite if the output files and/or directories


  -T NUM_THREADS, --num-threads NUM_THREADS

                        Maximum number of threads to use for multithreading

                        whenever possible. Very conservatively, the default is

                        1. It is a good idea to not exceed the number of CPUs

                        / cores on your system. Plus, please be careful with

                        this option if you are running your commands on a SGE

                        --if you are clusterizing your runs, and asking for

                        multiple threads to use, you may deplete your

                        resources very fast.



  These are stuff that will change the clustering dendrogram of your gene




                        Anvi'o attempts to generate a hierarchical clustering

                        of your gene clusters once it identifies them so you

                        can use `anvi-display-pan` to play with it. But if you

                        want to skip this step, this is your flag.


                        If you want anvi'o to try to generate a hierarchical

                        clustering of your gene clusters even if the number of

                        gene clusters exceeds its suggested limit for

                        hierarchical clustering, you can use this flag to

                        enforce it. Are you are a rebel of some sorts? Or did

                        computers made you upset? Express your anger towards

                        machine using this flag.

  --distance DISTANCE_METRIC

                        The distance metric for the clustering of gene

                        clusters. If you do not use this flag, the default

                        distance metric will be used for each clustering

                        configuration which is "euclidean".

  --linkage LINKAGE_METHOD

                        The same story with the `--distance`, except, the

                        system default for this one is ward.

 :: anvi'o v6.2 :: 

>anvi-display-pan -h

> anvi-display-pan -h 

usage: anvi-display-pan [-h] -p PAN_DB [-g GENOMES_STORAGE] [-d VIEW_DATA]                                                                                                                                     

                        [-t NEWICK] [-V ADDITIONAL_VIEW]

                        [-A ADDITIONAL_LAYERS] [--view NAME] [--title NAME]

                        [--state-autoload NAME] [--collection-autoload NAME]

                        [--export-svg FILE_PATH] [--skip-init-functions]

                        [--dry-run] [--skip-auto-ordering] [-I IP_ADDR]

                        [-P INT] [--browser-path PATH] [--read-only]

                        [--server-only] [--password-protected]



Start an anvi'o server to display a pan-genome


optional arguments:

  -h, --help            show this help message and exit



  Input files from the pangenome analysis.


  -p PAN_DB, --pan-db PAN_DB

                        Anvi'o pan database


                        Anvi'o genomes storage file



  Where the yay factor becomes a reality.


  -d VIEW_DATA, --view-data VIEW_DATA

                        A TAB-delimited file for view data

  -t NEWICK, --tree NEWICK

                        NEWICK formatted tree structure



  Parameters to provide additional layers, views, or layer data.



                        A TAB-delimited file for an additional view to be used

                        in the interface. This file should contain all split

                        names, and values for each of them in all samples.

                        Each column in this file must correspond to a sample

                        name. Content of this file will be called 'user_view',

                        which will be available as a new item in the 'views'

                        combo box in the interface


                        A TAB-delimited file for additional layers for splits.

                        The first column of this file must be split names, and

                        the remaining columns should be unique attributes. The

                        file does not need to contain all split names, or

                        values for each split in every column. Anvi'o will try

                        to deal with missing data nicely. Each column in this

                        file will be visualized as a new layer in the tree.



  Parameters that give access to various adjustements regarding the



  --view NAME           Start the interface with a pre-selected view. To see a

                        list of available views, use --show-views flag.

  --title NAME          Title for the interface. If you are working with a

                        RUNINFO dict, the title will be determined based on

                        information stored in that file. Regardless, you can

                        override that value using this parameter.

  --state-autoload NAME

                        Automatically load previous saved state and draw tree.

                        To see a list of available states, use --show-states


  --collection-autoload NAME

                        Automatically load a collection and draw tree. To see

                        a list of available collections, use --list-

                        collections flag.

  --export-svg FILE_PATH

                        The SVG output file path.



  Parameters and flags that are not quite essential (but nice to have).



                        When declared, function calls for genes will not be

                        initialized (therefore will be missing from all

                        relevant interfaces or output files). The use of this

                        flag may reduce the memory fingerprint and processing

                        time for large datasets.

  --dry-run             Don't do anything real. Test everything, and stop

                        right before wherever the developer said 'well, this

                        is enough testing', and decided to print out results.

  --skip-auto-ordering  When declared, the attempt to include automatically

                        generated orders of items based on additional data is

                        skipped. In case those buggers cause issues with your

                        data, and you still want to see your stuff and deal

                        with the other issue maybe later.



  For power users.


  -I IP_ADDR, --ip-address IP_ADDR

                        IP address for the HTTP server. The default ip address

                        ( should work just fine for most.

  -P INT, --port-number INT

                        Port number to use for anvi'o services. If nothing is

                        declared, anvi'o will try to find a suitable port

                        number, starting from the default port number, 8080.

  --browser-path PATH   By default, anvi'o will use your default browser to

                        launch the interactive interface. If you would like to

                        use something else than your system default, you can

                        provide a full path for an alternative browser using

                        this parameter, and hope for the best. For instance we

                        are using this parameter to call Google's experimental

                        browser, Canary, which performs better with demanding


  --read-only           When the interactive interface is started with this

                        flag, all 'database write' operations will be


  --server-only         The default behavior is to start the local server, and

                        fire up a browser that connects to the server. If you

                        have other plans, and want to start the server without

                        calling the browser, this is the flag you need.

  --password-protected  If this flag is set, command line tool will ask you to

                        enter a password and interactive interface will be

                        only accessible after entering same password. This

                        option is recommended for shared machines like

                        clusters or shared networks where computers are not



                        Allow users to shutdown an anvi'server via web


 :: anvi'o v6.2 :: 

anvi-split -h

> anvi-split -h

usage: anvi-split [-h] -p PAN_OR_PROFILE_DB [-c CONTIGS_DB]

                  [-g GENOMES_STORAGE] [--skip-variability-tables]

                  [--compress-auxiliary-data] [-C COLLECTION_NAME]

                  [-b BIN_NAME] [-o DIR_PATH] [--list-collections]



                  [--distance DISTANCE_METRIC] [--linkage LINKAGE_METHOD]


Split an anvi'o pan or profile database into smaller, self-contained pieces.

This is usually great when you want to share a subset of an anvi'o project.

You give this guy your databases, and a collection id, and it gives you back

directories of individual projects for each bin that can be treated as self-

contained smaller anvi'o projects. We know you don't read this far into these

help menus, but please remember: you will either need to provide a profile &

contigs database pair, or a pan & genomes storage pair. The rest will be taken

care of. Magic.


optional arguments:

  -h, --help            show this help message and exit



  You will either provide a PROFILE/CONTIGS or a PAN/GENOMES STORAGE pair



  -p PAN_OR_PROFILE_DB, --pan-or-profile-db PAN_OR_PROFILE_DB

                        Anvi'o pan or profile database (and even genes

                        database in appropriate contexts).

  -c CONTIGS_DB, --contigs-db CONTIGS_DB

                        Anvi'o contigs database generated by 'anvi-gen-



                        Anvi'o genomes storage file



  Some options that are specific to this only.



                        Processing variability tables in profile database

                        might take a very long time. With this flag you will

                        be asking anvi'o to skip them.


                        When declared, the auxiliary data file in the

                        resulting output will be compressed. This saves space,

                        but it takes long. Also, if you are planning to

                        compress the entire later using GZIP, it is even

                        useless to do. But you are the boss!



  You should provide a valid collection name. If you do not provide bin

  names, the program will generate an output for each bin in your collection




                        Collection name.

  -b BIN_NAME, --bin-id BIN_NAME

                        Bin name you are interested in.



  Where do we want the resulting split profiles to be stored.


  -o DIR_PATH, --output-dir DIR_PATH

                        Directory path for output files



  Stuff that you rarely need, but you really really need when the time

  comes. Following parameters will aply to each of the resulting anvi'o

  profile that will be split from the mother anvi'o profile.


  --list-collections    Show available collections and exit.


                        If you are not planning to use the interactive

                        interface (or if you have other means to add a tree of

                        contigs in the database) you may skip the step where

                        hierarchical clustering of your items are preformed

                        based on default clustering recipes matching to your

                        database type.


                        If you have more than 25,000 splits in your merged

                        profile, anvi-merge will automatically skip the

                        hierarchical clustering of splits (by setting --skip-

                        hierarchical-clustering flag on). This is due to the

                        fact that computational time required for hierarchical

                        clustering increases exponentially with the number of

                        items being clustered. Based on our experience we

                        decided that 25,000 splits is about the maximum we

                        should try. However, this is not a theoretical limit,

                        and you can overwrite this heuristic by using this

                        flag, which would tell anvi'o to attempt to cluster

                        splits regardless.

  --distance DISTANCE_METRIC

                        The distance metric for the hierarchical clustering.

                        If you do not use this flag, the default distance

                        metric will be used for each clustering configuration

                        which is "euclidean".

  --linkage LINKAGE_METHOD

                        The same story with the `--distance`, except, the

                        system default for this one is ward.

 :: anvi'o v6.2 :: 





anvi'oのメタゲノムビニングFASTA(internal genome)、ユーザーが用意したFASTA(external genome)を利用できる。



wget -O Prochlorococcus_31_genomes.tar.gz
tar -zxvf Prochlorococcus_31_genomes.tar.gz
cd Prochlorococcus_31_genomes





docker run --rm -it -v `pwd`:`pwd` -w `pwd` -p 8080:8080 meren/anvio:latest

pip install h5py

anvi-migrate *.db



3、 全ゲノムの.dbを出力。任意で各ゲノムの追加情報を含むTAB区切りファイル(external-genomes.txt)を指定する。




name contigs_db_path

genome1 genome1.db

genome2 genome2.db

genome3 genome3.db






anvi-gen-genomes-storage -e external-genomes.txt -o PROCHLORO-GENOMES.db



4、 ゲノムストレージの準備ができたら、anvi-pan-genomeプログラムを使ってパンゲノム解析を実行する。

anvi-pan-genome -g PROCHLORO-GENOMES.db -n PROJECT1 -T 40

ディレクトリ PROJECT1/ができ、中にパンゲノムデータベースPROJECT1-PAN.dbなどが出力される。



anvi-display-pan -p PROJECT1/PROJECT1-PAN.db -g PROCHLORO-GENOMES.db 

http://localhost:8080 にアクセスする。








遺伝子クラスタリングの結果に基づいてレイヤーの順番(つまり中心から外周までのリングの順番)を並べ換える。Layerタブ=> Order by => gene_cluster frequenciesを選択。






MainタブのItem orderはリング内でのオーダーの指定になる。変更すると、レイヤーの順番は変わらず、1レイヤー内の遺伝子の順番がクラスタリングされる。例えば左斜め上のコア遺伝子クラスタ(真っ黒の部分)が右下に移動したりする。





anvi-import-state -p PROJECT1/PROJECT1-PAN.db \
--state pan-state.json \
--name default

anvi-display-pan -p PROJECT1/PROJECT1-PAN.db -g PROCHLORO-GENOMES.db











アサインしたい名前をbin_1からcore geneという名前を変更した。色は赤にした。





core geneの表記をつけたい枝部分にマウスを合わせる。


ホバーされて色が変わった状態で1回左クリックする。その枝の最外周にCore geneという表記がついた。




左上のメニューもプラスをクリック、名前がaccessory gene(色は青)というタグを作成、残りの枝の最外周にaccessory geneという表記をつけた。


下にaccessory geneの表記をつけた。





右クリック => ウィンドウのinspect gene clusterを選択すると、新しいウィンドウが生成され。そこに下のように配列が表示される。









anvi-display-pan -p PROJECT1/PROJECT1-PAN.db -g PROCHLORO-GENOMES.db

ランが終わったら、 出力ディレクトリのコア遺伝子やアクセサリ遺伝子の.dbを指定して描画する。


2020 6/23

52 genome



Anvi'o: an advanced analysis and visualization platform for 'omics data

Eren AM, Esen ÖC, Quince C, Vineis JH, Morrison HG, Sogin ML, Delmont TO

PeerJ. 2015 Oct 8;3:e1319