現在、環境サンプルをcharacterizeすることを目指して、ますます多くのメタゲノム分析ツールが利用可能になっている[論文より ref.1,2,3,4]。Whole metagenome shotgun (WMS)シーケンシングテクニックから生成される大量のデータにより動機づけられたメタゲノムのプロファイリングは、実際のシナリオでよりアクセスしやすく、より速く適用可能となり、メタゲノミクス分析の標準的な方法となってきている[ref.5,6,7]。 WMSシーケンシングデータに基づいて配列の分類を実行するツールにさまざまな種類がある。 1つの基本的なアプローチは、リファレンスや先行知識なしにショートリードから完全またはほぼ完全なゲノムを再構築することを目的とするde novo sequence assembly [ref.8,9,10]である。これはコミュニティ構成を評価するための最高の解決策を提供する。しかし、リード長の短さ、不十分なカバレッジ、類似のDNA配列、およびlow abundant な株のために、メタゲノミクスデータから有意義なアセンブリを生成することは非常に困難である[ref.11 link]。
より一般的には、アセンブリなしで直接WMSリードを使う、リファレンスベースの手法がある。つまり、以前に取得されたゲノム配列に依存して解析が行われる。このカテゴリのアプリケーションでは、2つのスタンダード:taxonomic profilingツールとビニングツールが採用されている。Profilersは、WMSシーケンス全体を分析し、所定のリファレンス配列に基づいて生物およびそれらの相対存在量を予測することを目指している。ビニングツールは、特定のサンプル中の各シーケンスを個別に分類し、それらのそれぞれをリファレンスセットの最も有望な生物に連結することを目的とする。これら概念の違いにかかわらず、両方のツール群を微生物群集の特徴づけに使用することが可能だった。しかし、ビニングツールは各配列の個々の分類を生成し、分類学的プロファイラとして使用するために変換し、正規化する必要がある。
これらの2つのカテゴリの中で利用可能な方法は、いくつかの技術、例えば、リードマッピング、k-merアライメント、および組成分析が含まれる。データベース、例えばcomplete ゲノム配列、マーカー遺伝子、タンパク質配列の構築に関する変形もまた一般的である。これらの技術の多くは、最新のシークエンシング技術の高いスループットと、利用可能な多数のリファレンスゲノム配列を処理するための計算コストを克服するために開発された。
ツール、パラメータ、データベース、およびテクニックのためのいくつかのオプションの利用可能性は、研究者にとってどのメソッドを使用するか決めるシナリオを複雑にする。さまざまなツールは、さまざまなシナリオで良好な結果を提供し、複数の構成で多かれ少なかれ正確であるか、感度が高い。研究やサンプルの変化ごとにそれらの出力に依存することは難しい。さらに、複数の方法が使用される場合、異なる基準セットを使用するツール間の矛盾した結果を統合することは困難である。さらに、インストール、パラメータ、データベース作成、標準出力の欠如は容易に解決できない課題である。
メタゲノムシーケンス分類ツールの共同実行と統合のための新しいパイプラインであるMetaMetaを提案する。 MetaMetaにはいくつかの利点がある:簡単なインストールとセットアップ、複数のツールとサンプルとデータベースのサポート、複数の結果を組み合わせた最終プロファイルの改善、すぐに使用できる並列化と高性能コンピューティング(HPC)の統合、データベースの自動ダウンロードと設定、カスタムデータベースの作成、統合された前処理ステップ(リードのトリミング、エラー訂正、およびサブサンプリング) 。MetaMetaは、正しい識別情報をマージし、誤った識別情報を適切にフィルタリングすることにより、より慎重なプロファイリング結果を実現する。 MetaMetaはSnakeMake [ref.12 link]で構築され、オープンソースである。(以下略)
Tool selectionより
MetaMetaは、CLARK [ref.19]、DUDes [ref.20]、GOTTCHA [ref.21]、Kraken [ref.22]、Kaiju [ref.23]、およびmOTUs [ref.24]の6つのツールセットで評価された。 選んだ動機は、部分的には、そのようなツールの性能を比較している最近の論文による[ref.3、4、25]。 CLARK、GOTTCHA、Kraken、およびmOTUは[ref.4]に従い非常に低い偽陽性数を達成した。 DUDesは、[ref.25]に従って精度と感度との間の良好なトレードオフを達成するin houseツールであった。 Kaijuはtranslatedデータベースを使用し、全ゲノムベースの方法に多様性をもたらす。
MetaMeta Pipeline. 論文より転載(pubmed)。
MetaMetaに関するツイート
インストール
cent os6のanaconda3-4.0.0環境でテストした。
本体 Github
#Anaconda環境ならcondaで導入可能だが、linuxのみ。
conda install -c bioconda metameta
#他とバッティングしていたようなので、仮想環境にインストールし直した
#仮想環境でテストする。pyenvでpythonのバージョン管理しているものとする
conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda create -n metametaenv metameta=1.2.0
source activate metametaenv
>metameta --help
$ metameta --help
usage: snakemake [-h] [--profile PROFILE] [--snakefile FILE] [--gui [PORT]]
[--cores [N]] [--local-cores N]
[--resources [NAME=INT [NAME=INT ...]]]
[--config [KEY=VALUE [KEY=VALUE ...]]] [--configfile FILE]
[--list] [--list-target-rules] [--directory DIR] [--dryrun]
[--printshellcmds] [--debug-dag] [--dag]
[--force-use-threads] [--rulegraph] [--d3dag] [--summary]
[--detailed-summary] [--archive FILE] [--touch]
[--keep-going] [--force] [--forceall]
[--forcerun [TARGET [TARGET ...]]]
[--prioritize TARGET [TARGET ...]]
[--until TARGET [TARGET ...]]
[--omit-from TARGET [TARGET ...]] [--allow-ambiguity]
[--cluster CMD | --cluster-sync CMD | --drmaa [ARGS]]
[--drmaa-log-dir DIR] [--cluster-config FILE]
[--immediate-submit] [--jobscript SCRIPT] [--jobname NAME]
[--cluster-status CLUSTER_STATUS] [--kubernetes [NAMESPACE]]
[--kubernetes-env ENVVAR [ENVVAR ...]]
[--container-image IMAGE] [--reason] [--stats FILE]
[--nocolor] [--quiet] [--nolock] [--unlock]
[--cleanup-metadata FILE [FILE ...]] [--rerun-incomplete]
[--ignore-incomplete] [--list-version-changes]
[--list-code-changes] [--list-input-changes]
[--list-params-changes] [--latency-wait SECONDS]
[--wait-for-files [FILE [FILE ...]]] [--benchmark-repeats N]
[--notemp] [--keep-remote] [--keep-target-files]
[--keep-shadow]
[--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]]
[--max-jobs-per-second MAX_JOBS_PER_SECOND]
[--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND]
[--restart-times RESTART_TIMES] [--attempt ATTEMPT]
[--timestamp] [--greediness GREEDINESS] [--no-hooks]
[--print-compilation]
[--overwrite-shellcmd OVERWRITE_SHELLCMD] [--verbose]
[--debug] [--runtime-profile FILE] [--mode {0,1,2}]
[--bash-completion] [--use-conda] [--conda-prefix DIR]
[--create-envs-only] [--use-singularity]
[--singularity-prefix DIR] [--singularity-args ARGS]
[--wrapper-prefix WRAPPER_PREFIX]
[--default-remote-provider {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp}]
[--default-remote-prefix DEFAULT_REMOTE_PREFIX]
[--no-shared-fs] [--version]
[target [target ...]]
Snakemake is a Python based language and execution environment for GNU Make-
like workflows.
positional arguments:
target Targets to build. May be rules or files.
optional arguments:
-h, --help show this help message and exit
--profile PROFILE Name of profile to use for configuring Snakemake.
Snakemake will search for a corresponding folder in
/etc/xdg/xdg-ubuntu/snakemake and
/home/kazu/.config/snakemake. Alternatively, this can
be an absolute or relative path. The profile folder
has to contain a file 'config.yaml'. This file can be
used to set default values for command line options in
YAML format. For example, '--cluster qsub' becomes
'cluster: qsub' in the YAML file. Profiles can be
obtained from https://github.com/snakemake-profiles.
--snakefile FILE, -s FILE
The workflow definition in a snakefile.
--gui [PORT] Serve an HTML based user interface to the given
network and port e.g. 168.129.10.15:8000. By default
Snakemake is only available in the local network
(default port: 8000). To make Snakemake listen to all
ip addresses add the special host address 0.0.0.0 to
the url (0.0.0.0:8000). This is important if Snakemake
is used in a virtualised environment like Docker. If
possible, a browser window is opened.
--cores [N], --jobs [N], -j [N]
Use at most N cores in parallel (default: 1). If N is
omitted, the limit is set to the number of available
cores.
--local-cores N In cluster mode, use at most N cores of the host
machine in parallel (default: number of CPU cores of
the host). The cores are used to execute local rules.
This option is ignored when not in cluster mode.
--resources [NAME=INT [NAME=INT ...]], --res [NAME=INT [NAME=INT ...]]
Define additional resources that shall constrain the
scheduling analogously to threads (see above). A
resource is defined as a name and an integer value.
E.g. --resources gpu=1. Rules can use resources by
defining the resource keyword, e.g. resources: gpu=1.
If now two rules require 1 of the resource 'gpu' they
won't be run in parallel by the scheduler.
--config [KEY=VALUE [KEY=VALUE ...]], -C [KEY=VALUE [KEY=VALUE ...]]
Set or overwrite values in the workflow config object.
The workflow config object is accessible as variable
config inside the workflow. Default values can be set
by providing a JSON file (see Documentation).
--configfile FILE Specify or overwrite the config file of the workflow
(see the docs). Values specified in JSON or YAML
format are available in the global config dictionary
inside the workflow.
--list, -l Show availiable rules in given Snakefile.
--list-target-rules, --lt
Show available target rules in given Snakefile.
--directory DIR, -d DIR
Specify working directory (relative paths in the
snakefile will use this as their origin).
--dryrun, -n Do not execute anything.
--printshellcmds, -p Print out the shell commands that will be executed.
--debug-dag Print candidate and selected jobs (including their
wildcards) while inferring DAG. This can help to debug
unexpected DAG topology or errors.
--dag Do not execute anything and print the directed acyclic
graph of jobs in the dot language. Recommended use on
Unix systems: snakemake --dag | dot | display
--force-use-threads Force threads rather than processes. Helpful if shared
memory (/dev/shm) is full or unavailable.
--rulegraph Do not execute anything and print the dependency graph
of rules in the dot language. This will be less
crowded than above DAG of jobs, but also show less
information. Note that each rule is displayed once,
hence the displayed graph will be cyclic if a rule
appears in several steps of the workflow. Use this if
above option leads to a DAG that is too large.
Recommended use on Unix systems: snakemake --rulegraph
| dot | display
--d3dag Print the DAG in D3.js compatible JSON format.
--summary, -S Print a summary of all files created by the workflow.
The has the following columns: filename, modification
time, rule version, status, plan. Thereby rule version
contains the versionthe file was created with (see the
version keyword of rules), and status denotes whether
the file is missing, its input files are newer or if
version or implementation of the rule changed since
file creation. Finally the last column denotes whether
the file will be updated or created during the next
workflow execution.
--detailed-summary, -D
Print a summary of all files created by the workflow.
The has the following columns: filename, modification
time, rule version, input file(s), shell command,
status, plan. Thereby rule version contains the
versionthe file was created with (see the version
keyword of rules), and status denotes whether the file
is missing, its input files are newer or if version or
implementation of the rule changed since file
creation. The input file and shell command columns are
selfexplanatory. Finally the last column denotes
whether the file will be updated or created during the
next workflow execution.
--archive FILE Archive the workflow into the given tar archive FILE.
The archive will be created such that the workflow can
be re-executed on a vanilla system. The function needs
conda and git to be installed. It will archive every
file that is under git version control. Note that it
is best practice to have the Snakefile, config files,
and scripts under version control. Hence, they will be
included in the archive. Further, it will add input
files that are not generated by by the workflow itself
and conda environments. Note that symlinks are
dereferenced. Supported formats are .tar, .tar.gz,
.tar.bz2 and .tar.xz.
--touch, -t Touch output files (mark them up to date without
really changing them) instead of running their
commands. This is used to pretend that the rules were
executed, in order to fool future invocations of
snakemake. Fails if a file does not yet exist.
--keep-going, -k Go on with independent jobs if a job fails.
--force, -f Force the execution of the selected target or the
first rule regardless of already created output.
--forceall, -F Force the execution of the selected (or the first)
rule and all rules it is dependent on regardless of
already created output.
--forcerun [TARGET [TARGET ...]], -R [TARGET [TARGET ...]]
Force the re-execution or creation of the given rules
or files. Use this option if you changed a rule and
want to have all its output in your workflow updated.
--prioritize TARGET [TARGET ...], -P TARGET [TARGET ...]
Tell the scheduler to assign creation of given targets
(and all their dependencies) highest priority.
(EXPERIMENTAL)
--until TARGET [TARGET ...], -U TARGET [TARGET ...]
Runs the pipeline until it reaches the specified rules
or files. Only runs jobs that are dependencies of the
specified rule or files, does not run sibling DAGs.
--omit-from TARGET [TARGET ...], -O TARGET [TARGET ...]
Prevent the execution or creation of the given rules
or files as well as any rules or files that are
downstream of these targets in the DAG. Also runs jobs
in sibling DAGs that are independent of the rules or
files specified here.
--allow-ambiguity, -a
Don't check for ambiguous rules and simply use the
first if several can produce the same file. This
allows the user to prioritize rules by their order in
the snakefile.
--cluster CMD, -c CMD
Execute snakemake rules with the given submit command,
e.g. qsub. Snakemake compiles jobs into scripts that
are submitted to the cluster with the given command,
once all input files for a particular job are present.
The submit command can be decorated to make it aware
of certain job properties (input, output, params,
wildcards, log, threads and dependencies (see the
argument below)), e.g.: $ snakemake --cluster 'qsub
-pe threaded {threads}'.
--cluster-sync CMD cluster submission command will block, returning the
remote exitstatus upon remote termination (for
example, this should be usedif the cluster command is
'qsub -sync y' (SGE)
--drmaa [ARGS] Execute snakemake on a cluster accessed via DRMAA,
Snakemake compiles jobs into scripts that are
submitted to the cluster with the given command, once
all input files for a particular job are present. ARGS
can be used to specify options of the underlying
cluster system, thereby using the job properties
input, output, params, wildcards, log, threads and
dependencies, e.g.: --drmaa ' -pe threaded {threads}'.
Note that ARGS must be given in quotes and with a
leading whitespace.
--drmaa-log-dir DIR Specify a directory in which stdout and stderr files
of DRMAA jobs will be written. The value may be given
as a relative path, in which case Snakemake will use
the current invocation directory as the origin. If
given, this will override any given '-o' and/or '-e'
native specification. If not given, all DRMAA stdout
and stderr files are written to the current working
directory.
--cluster-config FILE, -u FILE
A JSON or YAML file that defines the wildcards used in
'cluster'for specific rules, instead of having them
specified in the Snakefile. For example, for rule
'job' you may define: { 'job' : { 'time' : '24:00:00'
} } to specify the time for rule 'job'. You can
specify more than one file. The configuration files
are merged with later values overriding earlier ones.
--immediate-submit, --is
Immediately submit all jobs to the cluster instead of
waiting for present input files. This will fail,
unless you make the cluster aware of job dependencies,
e.g. via: $ snakemake --cluster 'sbatch --dependency
{dependencies}. Assuming that your submit script (here
sbatch) outputs the generated job id to the first
stdout line, {dependencies} will be filled with space
separated job ids this job depends on.
--jobscript SCRIPT, --js SCRIPT
Provide a custom job script for submission to the
cluster. The default script resides as 'jobscript.sh'
in the installation directory.
--jobname NAME, --jn NAME
Provide a custom name for the jobscript that is
submitted to the cluster (see --cluster). NAME is
"snakejob.{rulename}.{jobid}.sh" per default. The
wildcard {jobid} has to be present in the name.
--cluster-status CLUSTER_STATUS
Status command for cluster execution. This is only
considered in combination with the --cluster flag. If
provided, Snakemake will use the status command to
determine if a job has finished successfully or
failed. For this it is necessary that the submit
command provided to --cluster returns the cluster job
id. Then, the status command will be invoked with the
job id. Snakemake expects it to return 'success' if
the job was successfull, 'failed' if the job failed
and 'running' if the job still runs.
--kubernetes [NAMESPACE]
Execute workflow in a kubernetes cluster (in the
cloud). NAMESPACE is the namespace you want to use for
your job (if nothing specified: 'default'). Usually,
this requires --default-remote-provider and --default-
remote-prefix to be set to a S3 or GS bucket where
your . data shall be stored. It is further advisable
to activate conda integration via --use-conda.
--kubernetes-env ENVVAR [ENVVAR ...]
Specify environment variables to pass to the
kubernetes job.
--container-image IMAGE
Docker image to use, e.g., when submitting jobs to
kubernetes. By default, this is
'quay.io/snakemake/snakemake', tagged with the same
version as the currently running Snakemake instance.
Note that overwriting this value is up to your
responsibility. Any used image has to contain a
working snakemake installation that is compatible with
(or ideally the same as) the currently running
version.
--reason, -r Print the reason for each executed rule.
--stats FILE Write stats about Snakefile execution in JSON format
to the given file.
--nocolor Do not use a colored output.
--quiet, -q Do not output any progress or rule information.
--nolock Do not lock the working directory
--unlock Remove a lock on the working directory.
--cleanup-metadata FILE [FILE ...], --cm FILE [FILE ...]
Cleanup the metadata of given files. That means that
snakemake removes any tracked version info, and any
marks that files are incomplete.
--rerun-incomplete, --ri
Re-run all jobs the output of which is recognized as
incomplete.
--ignore-incomplete, --ii
Do not check for incomplete output files.
--list-version-changes, --lv
List all output files that have been created with a
different version (as determined by the version
keyword).
--list-code-changes, --lc
List all output files for which the rule body (run or
shell) have changed in the Snakefile.
--list-input-changes, --li
List all output files for which the defined input
files have changed in the Snakefile (e.g. new input
files were added in the rule definition or files were
renamed). For listing input file modification in the
filesystem, use --summary.
--list-params-changes, --lp
List all output files for which the defined params
have changed in the Snakefile.
--latency-wait SECONDS, --output-wait SECONDS, -w SECONDS
Wait given seconds if an output file of a job is not
present after the job finished. This helps if your
filesystem suffers from latency (default 5).
--wait-for-files [FILE [FILE ...]]
Wait --latency-wait seconds for these files to be
present before executing the workflow. This option is
used internally to handle filesystem latency in
cluster environments.
--benchmark-repeats N
Repeat a job N times if marked for benchmarking
(default 1).
--notemp, --nt Ignore temp() declarations. This is useful when
running only a part of the workflow, since temp()
would lead to deletion of probably needed files by
other parts of the workflow.
--keep-remote Keep local copies of remote input files.
--keep-target-files Do not adjust the paths of given target files relative
to the working directory.
--keep-shadow Do not delete the shadow directory on snakemake
startup.
--allowed-rules ALLOWED_RULES [ALLOWED_RULES ...]
Only consider given rules. If omitted, all rules in
Snakefile are used. Note that this is intended
primarily for internal use and may lead to unexpected
results otherwise.
--max-jobs-per-second MAX_JOBS_PER_SECOND
Maximal number of cluster/drmaa jobs per second,
default is 10, fractions allowed.
--max-status-checks-per-second MAX_STATUS_CHECKS_PER_SECOND
Maximal number of job status checks per second,
default is 10, fractions allowed.
--restart-times RESTART_TIMES
Number of times to restart failing jobs (defaults to
0).
--attempt ATTEMPT Internal use only: define the initial value of the
attempt parameter (default: 1).
--timestamp, -T Add a timestamp to all logging output
--greediness GREEDINESS
Set the greediness of scheduling. This value between 0
and 1 determines how careful jobs are selected for
execution. The default value (1.0) provides the best
speed and still acceptable scheduling quality.
--no-hooks Do not invoke onstart, onsuccess or onerror hooks
after execution.
--print-compilation Print the python representation of the workflow.
--overwrite-shellcmd OVERWRITE_SHELLCMD
Provide a shell command that shall be executed instead
of those given in the workflow. This is for debugging
purposes only.
--verbose Print debugging output.
--debug Allow to debug rules with e.g. PDB. This flag allows
to set breakpoints in run blocks.
--runtime-profile FILE
Profile Snakemake and write the output to FILE. This
requires yappi to be installed.
--mode {0,1,2} Set execution mode of Snakemake (internal use only).
--bash-completion Output code to register bash completion for snakemake.
Put the following in your .bashrc (including the
accents): `snakemake --bash-completion` or issue it in
an open terminal session.
--use-conda If defined in the rule, run job in a conda
environment. If this flag is not set, the conda
directive is ignored.
--conda-prefix DIR Specify a directory in which the 'conda' and 'conda-
archive' directories are created. These are used to
store conda environments and their archives,
respectively. If not supplied, the value is set to the
'.snakemake' directory relative to the invocation
directory. If supplied, the `--use-conda` flag must
also be set. The value may be given as a relative
path, which will be extrapolated to the invocation
directory, or as an absolute path.
--create-envs-only If specified, only creates the job-specific conda
environments then exits. The `--use-conda` flag must
also be set.
--use-singularity If defined in the rule, run job within a singularity
container. If this flag is not set, the singularity
directive is ignored.
--singularity-prefix DIR
Specify a directory in which singularity images will
be stored.If not supplied, the value is set to the
'.snakemake' directory relative to the invocation
directory. If supplied, the `--use-singularity` flag
must also be set. The value may be given as a relative
path, which will be extrapolated to the invocation
directory, or as an absolute path.
--singularity-args ARGS
Pass additional args to singularity.
--wrapper-prefix WRAPPER_PREFIX
Prefix for URL created from wrapper directive
(default: https://bitbucket.org/snakemake/snakemake-
wrappers/raw/). Set this to a different URL to use
your fork or a local clone of the repository.
--default-remote-provider {S3,GS,FTP,SFTP,S3Mocked,gfal,gridftp}
Specify default remote provider to be used for all
input and output files that don't yet specify one.
--default-remote-prefix DEFAULT_REMOTE_PREFIX
Specify prefix for default remote provider. E.g. a
bucket name.
--no-shared-fs Do not assume that jobs share a common file system.
When this flag is activated, Snakemake will assume
that the filesystem on a cluster node is not shared
with other nodes. For example, this will lead to
downloading remote files on each cluster node
separately. Further, it won't take special measures to
deal with filesystem latency issues. This option will
in most cases only make sense in combination with
--default-remote-provider. Further, when using
--cluster you will have to also provide --cluster-
status. Only activate this if you know what you are
doing.
--version, -v show program's version number and exit
実行方法
ランにはデータベースディレクトリやシーケンシングデータのパスを記載した、以下のようなconfigファイルが必要になる。
workdir: "/home/user/folder/results/"
dbdir: "/home/user/folder/databases/"
samples:
sample_name_1:
fq1: "/home/user/folder/reads/file.1.fq"
fq2: "/home/user/folder/reads/file.2.fq"
データベースはmetametaのGithubレポジトリのlinkからダウンロードできる。
configファイルを全て埋めたら実行する。
metameta --configfile yourconfig.yaml --use-conda --keep-going --cores 24
テストラン
#metametaをインストールしたディレクトリに移動する。pyenv下でanacondaを管理していたので以下の場所に移動した。
cd /home/kazu/.pyenv/versions/anaconda3-4.0.0/opt/metameta/
metameta --configfile sampledata/sample_data_custom_viral.yaml --use-conda --keep-going --cores 6
大量のエラーが出る。ツール間のバッティングがありそうなので、condaで仮想環境を作りやり直してみる。
複数結果の統合、さらなるツールの追加などもサポートしています。
引用
MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling
Piro VC, Matschkowski M, Renard BY
Microbiome. 2017 Aug 14;5(1):101