HTS (NGS) 関連のインフォマティクス情報についてまとめています。

バクテリオファージ・サテライトを同定する SatelliteFinder


 バクテリオファージとバクテリアの相互作用は、ファージサテライト(バクテリア間の移動にファージを利用する要素)によって影響を受ける。サテライトは、防御システム、抗生物質耐性遺伝子、病原性因子をコードすることができるが、その数や多様性は不明である。著者らは、細菌ゲノム中のサテライトを同定するためにSatelliteFinderを開発し、最もよく説明されている4つのファミリーを検出した。P4-like、phage inducible chromosomal islands (PICI)、capsid-forming PICI、PICI-like elements(PLE)。記述したエレメントの数を約5000に拡大し、最大で3つの異なるファミリーのサテライトを持つ細菌ゲノムを発見した。ほとんどのサテライトはプロテオバクテリアとFirmicutesで見つかったが、一部はActinobacteriaのような新しい分類群にもある。サテライトの遺伝子レパートリーは、サイズや構成が多様であり、ゲノム構成は非常に保存されていることを明らかにした。PICIとcfPICIのコア遺伝子の系統は、それらのハイジャックモジュールの独立した進化を示す。他のサテライトのファミリー間で相同なコア遺伝子はほとんどなく、ファージと相同なものはさらに少ない。したがって、ファージサテライトは古く、多様であり、おそらく独立して何度も進化してきたと考えられる。ファージに感染している細菌の中には、まだ既知のサテライトを持たないものが多く、また、最近、新しいファミリーが提案されていることから、膨大な数と種類のサテライトの発見の始まりにいるのではないかと推測している。






>docker run --rm gempasteur/satellite_finder:0.9.1

macsypy.error.OptionError: ERROR: option --models is mandatory except if you use --previous-run.

usage: [-h] [-m {cfPICI,PICI,P4,PLE}]

                           [--sequence-db SEQUENCE_DB]

                           [--db-type {ordered_replicon,gembase}]

                           [--replicon-topology {linear,circular}]

                           [--topology-file TOPOLOGY_FILE] [--idx]

                           [--inter-gene-max-space INTER_GENE_MAX_SPACE INTER_GENE_MAX_SPACE]

                           [--min-mandatory-genes-required MIN_MANDATORY_GENES_REQUIRED MIN_MANDATORY_GENES_REQUIRED]

                           [--min-genes-required MIN_GENES_REQUIRED MIN_GENES_REQUIRED]

                           [--max-nb-genes MAX_NB_GENES MAX_NB_GENES]

                           [--multi-loci MULTI_LOCI] [--hmmer HMMER]

                           [--e-value-search E_VALUE_SEARCH] [--cut-ga]

                           [--i-evalue-sel I_EVALUE_SEL]

                           [--coverage-profile COVERAGE_PROFILE]

                           [--mandatory-weight MANDATORY_WEIGHT]

                           [--accessory-weight ACCESSORY_WEIGHT]

                           [--exchangeable-weight EXCHANGEABLE_WEIGHT]

                           [--redundancy-penalty REDUNDANCY_PENALTY]

                           [--out-of-cluster OUT_OF_CLUSTER] [-o OUT_DIR]

                           [--index-dir INDEX_DIR]

                           [--res-search-suffix RES_SEARCH_SUFFIX]

                           [--res-extract-suffix RES_EXTRACT_SUFFIX]

                           [--profile-suffix PROFILE_SUFFIX] [-w WORKER] [-v]

                           [--mute] [--version] [-l] [--cfg-file CFG_FILE]

                           [--previous-run PREVIOUS_RUN]


     *            *               *                   *

*           *               *   *   *  *    **                *   *

  **     *    *   *  *     *                    *               *

 *      *   * *     *   **         *   *  *           *


 ____        _       _ _ _ _         _____ _           _           

/ ___|  __ _| |_ ___| | (_) |_ ___  |  ___(_)_ __   __| | ___ _ __ 

\___ \ / _` | __/ _ \ | | | __/ _ \ | |_  | | '_ \ / _` |/ _ \ '__|

 ___) | (_| | ||  __/ | | | ||  __/ |  _| | | | | | (_| |  __/ |   

|____/ \__,_|\__\___|_|_|_|\__\___| |_|   |_|_| |_|\__,_|\___|_|  


  *      *         *        *    *              *

             *                           *  *           *     *


SatelliteFinder - Detection of four families of phage satellites: P4-like, PICI, cfPICI and PLESatellite systems 

in protein datasets using systems modelling and similarity search.  



  -h, --help            show this help message and exit

  -m {cfPICI,PICI,P4,PLE}, --models {cfPICI,PICI,P4,PLE}

                        The models to search.


                        (required unless --previous-run is set)


Input dataset options:

  --sequence-db SEQUENCE_DB

                        Path to the sequence dataset in fasta format.

                        (required unless --previous-run is set)

  --db-type {ordered_replicon,gembase}

                        The type of dataset to deal with.

                        "ordered_replicon" to an assembled genome,

                        "gembase" to a set of replicons where sequence identifiers

                        follow this convention: ">RepliconName_SequenceID".

                        (required unless --previous-run is set)

  --replicon-topology {linear,circular}

                        The topology of the replicons

                        (this option is meaningful only if the db_type is

                        'ordered_replicon' or 'gembase'.)

                        (default: circular)

  --topology-file TOPOLOGY_FILE

                        Topology file path. The topology file allows to specify a topology

                        (linear or circular) for each replicon (this option is meaningful only if the db_type is 

                        'ordered_replicon' or 'gembase'.

                        A topology file is a tabular file with two columns:

                            the 1st is the replicon name, and the 2nd the corresponding topology:

                            "RepliconA    linear"

  --idx                 Forces to build the indexes for the sequence dataset even

                        if they were previously computed and present at the dataset location.

                        (default: False)


Systems detection options:


                        Co-localization criterion: maximum number of components non-matched by a

                            profile allowed between two matched components for them to be considered contiguous.

                        Option only meaningful for 'ordered' datasets.

                        The first value must name a model, the second a number of components.

                        This option can be repeated several times:

                            "--inter-gene-max-space TXSS/T2SS 12 --inter-gene-max-space TXSS/Flagellum 20


                        The minimal number of mandatory genes required for model assessment.

                        The first value must correspond to a model fully qualified name, the second value to an integer.

                        This option can be repeated several times:

                            "--min-mandatory-genes-required TXSS/T2SS 15 --min-mandatory-genes-required TXSS/Flagellum 10"


                        The minimal number of genes required for model assessment

                        (includes both 'mandatory' and 'accessory' components).

                        The first value must correspond to a model fully qualified name, the second value to an integer.

                        This option can be repeated several times:

                            "--min-genes-required TXSS/T2SS 15 --min-genes-required TXSS/Flagellum 10

  --max-nb-genes MAX_NB_GENES MAX_NB_GENES

                        The maximal number of genes to consider a system as full.

                        The first value must correspond to a model name, the second value to an integer.

                        This option can be repeated several times:

                            "--max-nb-genes TXSS/T2SS 5 --max-nb-genes TXSS/Flagellum 10"

  --multi-loci MULTI_LOCI

                        Specifies if the system can be detected as a 'scattered' (or multiple-loci-encoded) system.

                        The models are specified as a comma separated list of fully qualified name(s)

                            "--multi-loci model_familyA/model_1,model_familyB/model_2"


Options for Hmmer execution and hits filtering:

  --hmmer HMMER         Path to the hmmsearch program.

                        If not specified, rely on the environment variable PATH

                        (default: hmmsearch)

  --e-value-search E_VALUE_SEARCH

                        Maximal e-value for hits to be reported during hmmsearch search.

                        By default MSF set per profile threshold for hmmsearch run (hmmsearch --cut_ga option) 

                        for profiles containing the GA bit score threshold.

                        If a profile does not contains the GA bit score the --e-value-search (-E in hmmsearch) is applied to this profile.

                        To applied the --e-value-search to all profiles use the --no-cut-ga option. 

                        (default: 0.1)

  --cut-ga              By default the satellite_finder try to applied a threshold by default for all profiles

                        But it is possible to activate a threshold per profile by using the hmmer -cut-ga option.

                        This is possible only if the GA bit score is present in the profile otherwise 

                        MF switch to use the --e-value-search (-E in hmmsearch). 

                        (default: False)

  --i-evalue-sel I_EVALUE_SEL

                        Maximal independent e-value for Hmmer hits to be selected for systems detection.

                        (default:0.01 )

  --coverage-profile COVERAGE_PROFILE

                        Minimal profile coverage required for the hit alignment  with the profile to allow

                        the hit selection for systems detection. 

                        (default: 0.4)


Score options:

  Options for cluster and systems scoring


  --mandatory-weight MANDATORY_WEIGHT

                        the weight of a mandatory component in cluster scoring


  --accessory-weight ACCESSORY_WEIGHT

                        the weight of a accessory component in cluster scoring


  --exchangeable-weight EXCHANGEABLE_WEIGHT

                        the weight modifier for a component which code for exchangeable cluster scoring


  --redundancy-penalty REDUNDANCY_PENALTY

                        the weight modifier for cluster which bring a component already presents in other

                        clusters (default:1.5)

  --out-of-cluster OUT_OF_CLUSTER

                        the weight modifier for a hit which is a

                         - true loner (not in cluster)

                         - or multi-system (from an other system) 



Path options:

  -o OUT_DIR, --out-dir OUT_DIR

                        Path to the directory where to store output results.

                        if out-dir is specified, res-search-dir will be ignored.

  --index-dir INDEX_DIR

                        Specifies the path to a directory to store/read the sequence index when the sequence-db dir is not writable.

  --res-search-suffix RES_SEARCH_SUFFIX

                        The suffix to give to Hmmer raw output files. (default: .search_hmm.out)

  --res-extract-suffix RES_EXTRACT_SUFFIX

                        The suffix to give to filtered hits output files. (default: .res_hmm_extract)

  --profile-suffix PROFILE_SUFFIX

                        The suffix of profile files. For each 'Gene' element, the corresponding profile is

                        searched in the 'profile_dir', in a file which name is based on the

                        Gene name + the profile suffix.

                        For instance, if the Gene is named 'gspG' and the suffix is '.hmm3',

                        then the profile should be placed at the specified location

                        under the name 'gspG.hmm3'

                        (default: .hmm)


General options:

  -w WORKER, --worker WORKER

                        Number of workers to be used by MacSyFinder.

                        In the case the user wants to run MacSyFinder in a multi-thread mode.

                        0 mean that all threads available will be used.

                        (default: 1)

  -v, --verbosity       Increases the verbosity level. There are 4 levels:

                        Error messages (default), Warning (-v), Info (-vv) and Debug.(-vvv)

  --mute                Mute the log on stdout.

                        (continue to log on macsyfinder.log)

                        (default: False)

  --version             show program's version number and exit

  -l, --list-models     Displays all models installed at generic location and quit.

  --cfg-file CFG_FILE   Path to a MacSyFinder configuration file to be used. (conflict with --previous-run)

  --previous-run PREVIOUS_RUN

                        Path to a previous MacSyFinder run directory.

                        It allows to skip the Hmmer search step on a same dataset,

                        as it uses previous run results and thus parameters regarding Hmmer detection.

                        The configuration file from this previous run will be used.

                        Conflicts with options:  

                            --cfg-file, --sequence-db, --profile-suffix, --res-extract-suffix, --e-value-res, --db-type, --hmmer






docker run -v ${PWD}/:/home/msf -u $(id -u ${USER}):$(id -g ${USER})  gempasteur/satellite_finder:0.9.1  --db-type ordered_replicon --models  <cfPICI | PICI | P4 | PLE> --sequence-db my_protein_dataset.fasta -w 12 --out-dir <result_dir>



  • dockerイメージはdocker hubに登録されているので、Apptainerで直接利用することも可能。Dockerと違って共有ディレクトリを気にする必要はなく、HOMEと/tmpは自動的に共有される。


Identification and characterization of thousands of bacteriophage satellites across bacteria 
Jorge A Moura de Sousa, Alfred Fillol-Salom, José R Penadés, Eduardo P C Rocha
Nucleic Acids Research, 03 March 2023