2023/03/02 論文引用
ショットガンメタゲノムデータを解析することで、微生物群集に関する貴重な知見が得られると同時に、個々のゲノムレベルでの解決が可能となる。しかし、完全なリファレンスゲノムが存在しない場合、シークエンスリードからメタゲノムアセンブルゲノム(MAG)を再構築する必要がある。本研究では、メタゲノムアセンブリ、ビニング、分類学的分類を行うnf-core/magパイプラインを紹介する。nf-core/magは、ショートリードとロングリードを組み合わせることでアセンブリの連続性を高め、サンプルごとのグループ情報を共アセンブリやゲノムビニングに利用することができる。パイプラインは、インストールが容易で、すべての依存関係がコンテナ内に用意されており、移植性と再現性に優れている。Nextflowで書かれており、パイプライン開発のベストプラクティスであるnf-coreイニシアチブの一環として開発されている。すべてのコードは、GitHubのnf-core organization(https://github.com/nf-core/mag)でホストされており、MITライセンスで公開されている。
usage
Githubより
デフォルトでは、パイプラインは次の解析を実行する。ショートリードとロングリードの両方をサポートしている。
1、fastpとPorechopでリードとアダプターをクオリティートリムし、FastQCで基本的なQCを実行する。
2、Centrifugeおよび/またはKraken2を用いてリードにtaxonomyを割り当てる。
3、MEGAHITとSPAdesを用いてアセンブリを行い、Quastを用いて品質をチェックする。
4、MetaBAT2を用いてビニングを行い、Buscoを用いてゲノムビンの品質を確認する。
5、GTDB-TkやCATを用いてビンに分類を付与する。
6、指定されたresultsディレクトリに、結果の一部やソフトウェアのバージョンをまとめたMultiQCのレポートなどを作成する。
2023/03/02
Pipeline release! nf-core/mag v2.3.0 (Assembly and binning of metagenomes)
— nf-core (@nf_core) March 2, 2023
See the changelog: https://t.co/RaSuR1r8G0
インストール
依存
- Nextflow (>=21.04.0)
help
> nextflow run nf-core/mag --help --show_hidden
N E X T F L O W ~ version 25.04.7
Launching `https://github.com/nf-core/mag` [elegant_tuckerman] DSL2 - revision: 7ffd8b8c65 [main]
------------------------------------------------------
,--./,-.
___ __ __ __ ___ /,-._.--~'
|\ | |__ __ / ` / \ |__) |__ } {
| \| | \__, \__/ | \ |___ \`-._,-`-,
`._,._,'
nf-core/mag 4.0.0
------------------------------------------------------
Typical pipeline command:
nextflow run nf-core/mag -profile <docker/singularity/.../institute> --input samplesheet.csv --outdir <OUTDIR>
--help [boolean, string] Show the help message for all top level parameters. When a parameter is given to `--help`, the full help message of that parameter will be printed.
--help_full [boolean] Show the help message for all non-hidden parameters.
--show_hidden [boolean] Show all hidden parameters in the help message. This needs to be used in combination with `--help` or `--help_full`.
Input/output options
--input [string] CSV samplesheet file containing information about the samples in the experiment.
--single_end [boolean] Specifies that the input is single-end reads.
--assembly_input [string] Additional input CSV samplesheet containing information about pre-computed assemblies. When set, both read pre-processing and assembly are skipped and the pipeline begins at the binning stage.
--outdir [string] The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.
--email [string] Email address for completion summary.
--multiqc_title [string] MultiQC report title. Printed as page header, used for filename if not otherwise specified.
Reference genome options
--igenomes_ignore [boolean] Do not load the iGenomes reference config.
--igenomes_base [string] The base path to the igenomes reference files [default: s3://ngi-igenomes/igenomes/]
Institutional config options
--custom_config_version [string] Git commit id for Institutional configs. [default: master]
--custom_config_base [string] Base directory for Institutional configs. [default: https://raw.githubusercontent.com/nf-core/configs/master]
--config_profile_name [string] Institutional config name.
--config_profile_description [string] Institutional config description.
--config_profile_contact [string] Institutional config contact information.
--config_profile_url [string] Institutional config URL link.
Generic options
--version [boolean] Display version and exit.
--publish_dir_mode [string] Method used to save pipeline results to output directory. (accepted: symlink, rellink, link, copy, copyNoFollow, move) [default: copy]
--monochrome_logs [boolean] Use monochrome_logs
--email_on_fail [string] Email address for completion summary, only when pipeline fails.
--plaintext_email [boolean] Send plain-text email instead of HTML.
--max_multiqc_email_size [string] File size limit when attaching MultiQC reports to summary emails. [default: 25.MB]
--hook_url [string] Incoming hook URL for messaging service
--multiqc_config [string] Custom config file to supply to MultiQC.
--multiqc_logo [string] Custom logo file to supply to MultiQC. File name must also be set in the MultiQC config file
--multiqc_methods_description [string] Custom MultiQC yaml file containing HTML including a methods description.
--validate_params [boolean] Boolean whether to validate parameters against the schema at runtime [default: true]
--pipelines_testdata_base_path [string] Base URL or local path to location of pipeline test dataset files [default: https://raw.githubusercontent.com/nf-core/test-datasets/]
--trace_report_suffix [string] Suffix to add to the trace report filename. Default is the date and time in the format yyyy-MM-dd_HH-mm-ss.
Reproducibility options
--megahit_fix_cpu_1 [boolean] Fix number of CPUs for MEGAHIT to 1. Not increased with retries.
--spades_fix_cpus [integer] Fix number of CPUs used by SPAdes. Not increased with retries. [default: -1]
--spadeshybrid_fix_cpus [integer] Fix number of CPUs used by SPAdes hybrid. Not increased with retries. [default: -1]
--metabat_rng_seed [integer] RNG seed for MetaBAT2. [default: 1]
Quality control for short reads options
--clip_tool [string] Specify which adapter clipping tool to use. (accepted: fastp, adapterremoval, trimmomatic) [default: fastp]
--save_clipped_reads [boolean] Specify to save the resulting clipped FASTQ files to --outdir.
--reads_minlength [integer] The minimum length of reads must have to be retained for downstream analysis. [default: 15]
--fastp_qualified_quality [integer] Minimum phred quality value of a base to be qualified in fastp. [default: 15]
--fastp_cut_mean_quality [integer] The mean quality requirement used for per read sliding window cutting by fastp. [default: 15]
--fastp_save_trimmed_fail [boolean] Save reads that fail fastp filtering in a separate file. Not used downstream.
--fastp_trim_polyg [boolean] Turn on detecting and trimming of poly-G tails
--adapterremoval_minquality [integer] The minimum base quality for low-quality base trimming by AdapterRemoval. [default: 2]
--adapterremoval_trim_quality_stretch [boolean] Turn on quality trimming by consecutive stretch of low quality bases, rather than by window.
--adapterremoval_adapter1 [string] Forward read adapter to be trimmed by AdapterRemoval. [default: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG]
--adapterremoval_adapter2 [string] Reverse read adapter to be trimmed by AdapterRemoval for paired end data. [default: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT]
--host_genome [string] Name of iGenomes reference for host contamination removal.
--host_fasta [string] Fasta reference file for host contamination removal.
--host_fasta_bowtie2index [string] Bowtie2 index directory corresponding to `--host_fasta` reference file for host contamination removal.
--host_removal_verysensitive [boolean] Use the `--very-sensitive` instead of the`--sensitive`setting for Bowtie 2 to map reads against the host genome.
--host_removal_save_ids [boolean] Save the read IDs of removed host reads.
--save_hostremoved_reads [boolean] Specify to save input FASTQ files with host reads removed to --outdir.
--keep_phix [boolean] Keep reads similar to the Illumina internal standard PhiX genome.
--phix_reference [string] Genome reference used to remove Illumina PhiX contaminant reads. [default: ${baseDir}/assets/data/GCA_002596845.1_ASM259684v1_genomic.fna.gz]
--skip_clipping [boolean] Skip read preprocessing using fastp or adapterremoval.
--save_phixremoved_reads [boolean] Specify to save input FASTQ files with phiX reads removed to --outdir.
--bbnorm [boolean] Run BBnorm to normalize sequence depth.
--bbnorm_target [integer] Set BBnorm target maximum depth to this number. [default: 100]
--bbnorm_min [integer] Set BBnorm minimum depth to this number. [default: 5]
--save_bbnorm_reads [boolean] Save normalized read files to output directory.
Quality control for long reads options
--skip_adapter_trimming [boolean] Skip removing adapter sequences from long reads.
--longreads_min_length [integer] Discard any read which is shorter than this value. [default: 1000]
--longreads_min_quality [integer] Discard any read which has a mean quality score lower than this value.
--longreads_keep_percent [integer] Keep this percent of bases. [default: 90]
--longreads_length_weight [integer] The higher the more important is read length when choosing the best reads. [default: 10]
--keep_lambda [boolean] Keep reads similar to the ONT internal standard Escherichia virus Lambda genome.
--lambda_reference [string] Genome reference used to remove ONT Lambda contaminant reads. [default: ${baseDir}/assets/data/GCA_000840245.1_ViralProj14204_genomic.fna.gz]
--save_lambdaremoved_reads [boolean] Specify to save input FASTQ files with lamba reads removed to --outdir.
--save_porechop_reads [boolean] Specify to save the resulting clipped FASTQ files to --outdir.
--save_filtered_longreads [boolean] Specify to save the resulting length filtered long read FASTQ files to --outdir.
--longread_adaptertrimming_tool [string] Specify which long read adapter trimming tool to use. (accepted: porechop, porechop_abi) [default: porechop_abi]
--longread_filtering_tool [string] Specify which long read filtering tool to use. (accepted: filtlong, nanoq, chopper) [default: filtlong]
Taxonomic profiling options
--centrifuge_db [string] Database for taxonomic binning with centrifuge.
--kraken2_db [string] Database for taxonomic binning with kraken2.
--krona_db [string] Database for taxonomic binning with krona
--skip_krona [boolean] Skip creating a krona plot for taxonomic binning.
--cat_db [string] Database for taxonomic classification of metagenome assembled genomes. Can be either a zipped file or a directory containing the extracted output of such.
--cat_db_generate [boolean] Generate CAT database.
--save_cat_db [boolean] Save the CAT database generated when specified by `--cat_db_generate`.
--cat_official_taxonomy [boolean] Only return official taxonomic ranks (Kingdom, Phylum, etc.) when running CAT.
--skip_gtdbtk [boolean] Skip the running of GTDB, as well as the automatic download of the database
--gtdb_db [string] Specify the location of a GTDBTK database. Can be either an uncompressed directory or a `.tar.gz` archive. If not specified will be downloaded for you when GTDBTK or binning QC is not skipped. [default:
--gtdb_mash [string] Specify the location of a GTDBTK mash database. If missing, GTDB-Tk will skip the ani_screening step
--gtdbtk_min_completeness [number] Min. bin completeness (in %) required to apply GTDB-tk classification. [default: 50]
--gtdbtk_max_contamination [number] Max. bin contamination (in %) allowed to apply GTDB-tk classification. [default: 10]
--gtdbtk_min_perc_aa [number] Min. fraction of AA (in %) in the MSA for bins to be kept. [default: 10]
--gtdbtk_min_af [number] Min. alignment fraction to consider closest genome. [default: 0.65]
--gtdbtk_pplacer_cpus [integer] Number of CPUs used for the by GTDB-Tk run tool pplacer. [default: 1]
--gtdbtk_pplacer_useram [boolean] Speed up pplacer step of GTDB-Tk by loading to memory.
Assembly options
--coassemble_group [boolean] Co-assemble samples within one group, instead of assembling each sample separately.
--spades_options [string] Additional custom options for SPAdes and SPAdesHybrid. Do not specify `--meta` as this will be added for you!
--spades_downstreaminput [string] Specify whether to use contigs or scaffolds assembled by SPAdes (accepted: scaffolds, contigs) [default: scaffolds]
--megahit_options [string] Additional custom options for MEGAHIT.
--skip_spades [boolean] Skip Illumina-only SPAdes assembly.
--skip_spadeshybrid [boolean] Skip SPAdes hybrid assembly.
--skip_megahit [boolean] Skip MEGAHIT assembly.
--skip_quast [boolean] Skip metaQUAST.
Gene prediction and annotation options
--skip_prodigal [boolean] Skip Prodigal gene prediction
--prokka_with_compliance [boolean] Turn on Prokka complicance mode for truncating contig names for NCBI/ENA compatibility.
--prokka_compliance_centre [string] Specify sequencing centre name required for Prokka's compliance mode.
--skip_prokka [boolean] Skip Prokka genome annotation.
--skip_metaeuk [boolean] Skip MetaEuk gene prediction and annotation
--metaeuk_mmseqs_db [string] A string containing the name of one of the databases listed in the [mmseqs2 documentation](https://github.com/soedinglab/MMseqs2/wiki#downloading-databases). This database will be downloaded and formatted for eukaryotic genome annotation. Incompatible with --metaeuk_db. (accepted: UniRef100,
UniRef90, UniRef50, UniProtKB, UniProtKB/TrEMBL, UniProtKB/Swiss-Prot, NR, NT, GTDB, PDB, PDB70, Pfam-A.full, Pfam-A.seed, Pfam-B, CDD, eggNOG, VOGDB, dbCAN2, SILVA, Resfinder, Kalamari)
--metaeuk_db [string] Path to either a local fasta file of protein sequences, or to a directory containing an MMseqs2-formatted database, for annotation of eukaryotic genomes.
--save_mmseqs_db [boolean] Save the downloaded mmseqs2 database specified in `--metaeuk_mmseqs_db`.
Virus identification options
--run_virus_identification [boolean] Run virus identification.
--genomad_db [string] Database for virus classification with geNomad
--genomad_min_score [number] Minimum geNomad score for a sequence to be considered viral [default: 0.7]
--genomad_splits [integer] Number of groups that geNomad's MMSeqs2 databse should be split into (reduced memory requirements) [default: 1]
Binning options
--binning_map_mode [string] Defines mapping strategy to compute co-abundances for binning, i.e. which samples will be mapped against the assembly. (accepted: all, group, own) [default: group]
--skip_binning [boolean] Skip metagenome binning entirely
--skip_metabat2 [boolean] Skip MetaBAT2 Binning
--skip_maxbin2 [boolean] Skip MaxBin2 Binning
--skip_concoct [boolean] Skip CONCOCT Binning
--min_contig_size [integer] Minimum contig size to be considered for binning and for bin quality check. [default: 1500]
--min_length_unbinned_contigs [integer] Minimal length of contigs that are not part of any bin but treated as individual genome. [default: 1000000]
--max_unbinned_contigs [integer] Maximal number of contigs that are not part of any bin but treated as individual genome. [default: 100]
--bin_min_size [integer] Specify the shortest length a bin should be to retain for downstream processing (in base pairs) [default: 0]
--bin_max_size [integer] Specify the longest length a bin should be to retain for downstream processing (in base pairs). By default no limit.
--bin_concoct_chunksize [integer] Specify length of sub-contigs cut up prior CONCOCT binning [default: 10000]
--bin_concoct_overlap [integer] Specify the overlap between each sub-contig prior CONCOCT binning [default: 0]
--bin_concoct_donotconcatlast [boolean] Specify to not append the last contig less than sub-contig length to the last correct length contig
--bowtie2_mode [string] Specify alternative Bowtie2 settings for aligning reads back against the assembly.
--save_assembly_mapped_reads [boolean] Save the output of mapping raw reads back to assembled contigs
--bin_domain_classification [boolean] Enable domain-level (prokaryote or eukaryote) classification of bins using Tiara. Processes which are domain-specific will then only receive bins matching the domain requirement.
--bin_domain_classification_tool [string] Specify which tool to use for domain classification of bins. Currently only 'tiara' is implemented. [default: tiara]
--tiara_min_length [integer] Minimum contig length for Tiara to use for domain classification. For accurate classification, should be longer than 3000 bp. [default: 3000]
--exclude_unbins_from_postbinning [boolean] Exclude unbinned contigs in the post-binning steps (bin QC, taxonomic classification, and annotation steps).
Bin quality check options
--skip_binqc [boolean] Disable bin QC with BUSCO, CheckM or CheckM2.
--binqc_tool [string] Specify which tool for bin quality-control validation to use. (accepted: busco, checkm, checkm2) [default: busco]
--busco_db [string] Download URL, local tar.gz archive, or local uncompressed directory for an *_odb10 or *_odb12 BUSCO lineage dataset.
--busco_db_lineage [string] Name of the BUSCO *_odb10 or *_odb12 lineage to check against. Additionally supports 'auto', 'auto_prok' and 'auto_euk' for automatic lineage selection mode. [default: auto]
--save_busco_db [boolean] Save the used BUSCO lineage datasets provided via `--busco_db`.
--busco_clean [boolean] Enable clean-up of temporary files created during BUSCO runs.
--checkm_download_url [string] URL pointing to checkM database for auto download, if local path not supplied. [default: https://zenodo.org/records/7401545/files/checkm_data_2015_01_16.tar.gz]
--checkm_db [string] Path to local folder containing already downloaded and uncompressed CheckM database.
--save_checkm_data [boolean] Save the used CheckM reference files downloaded when not using --checkm_db parameter.
--checkm2_db [string] Path to local file of an already downloaded and uncompressed CheckM2 database (.dmnd file).
--checkm2_db_version [integer] CheckM2 database version number to download (Zenodo record ID, for reference check the canonical reference https://zenodo.org/records/5571251, and pick the Zenodo ID of the database version of your choice). [default: 14897628]
--save_checkm2_data [boolean] Save the used CheckM2 reference files downloaded when not using --checkm2_db parameter.
--refine_bins_dastool [boolean] Turn on bin refinement using DAS Tool.
--refine_bins_dastool_threshold [number] Specify single-copy gene score threshold for bin refinement. [default: 0.5]
--postbinning_input [string] Specify which binning output is sent for downstream annotation, taxonomic classification, bin quality control etc. (accepted: raw_bins_only, refined_bins_only, both) [default: raw_bins_only]
--run_gunc [boolean] Turn on GUNC genome chimerism checks
--gunc_db [string] Specify a path to a pre-downloaded GUNC dmnd database file
--gunc_database_type [string] Specify which database to auto-download if not supplying own (accepted: progenomes, gtdb) [default: progenomes]
--gunc_save_db [boolean] Save the used GUNC reference files downloaded when not using --gunc_db parameter.
Ancient DNA assembly
--ancient_dna [boolean] Turn on/off the ancient DNA subworfklow
--pydamage_accuracy [number] PyDamage accuracy threshold [default: 0.5]
--skip_ancient_damagecorrection [boolean] deactivate damage correction of ancient contigs using variant and consensus calling
--freebayes_ploidy [integer] Ploidy for variant calling [default: 1]
--freebayes_min_basequality [integer] minimum base quality required for variant calling [default: 20]
--freebayes_minallelefreq [number] minimum minor allele frequency for considering variants [default: 0.33]
--bcftools_view_high_variant_quality [integer] minimum genotype quality for considering a variant high quality [default: 30]
--bcftools_view_medium_variant_quality [integer] minimum genotype quality for considering a variant medium quality [default: 20]
--bcftools_view_minimal_allelesupport [integer] minimum number of bases supporting the alternative allele [default: 3]
------------------------------------------------------
* The pipeline
https://doi.org/10.1093/nargab/lqac007
* The nf-core framework
https://doi.org/10.1038/s41587-020-0439-x
* Software dependencies
https://github.com/nf-core/mag/blob/main/CITATIONS.md
テストラン
conda、docker、Singularity、Shifter、Podman(Docker互換のコンテナエンジン)、Charliecloudなどに対応している。
#docker
nextflow run nf-core/mag -profile test,docker
#conda
nextflow run nf-core/mag -profile test,conda
順番に実行されていく。テストランもある程度時間がかかる。

出力

Taxonomy

Assembly

Genome Binning

MEGAHIT-test_minigut-binDepths.heatmap.png

SPAdes-test_minigut-binDepths.heatmap.png

Genome Binning/QC

multiqc

実際のランではprofileとfastqのパス、もしくはfastqのパスとサンプル名を記載したCSVファイルを指定する。
#docker
nextflow run nf-core/mag -profile docker --input '*_R{1,2}.fastq.gz'
#samplesheet.csv
nextflow run nf-core/mag -profile docker --input samplesheet.csv
カンマ区切りで最大5列の情報を記載する。ヘッダーはsample,group,short_reads_1,short_reads_2,long_readsとする。
sample,group,short_reads_1,short_reads_2,long_reads
sample1,0,data/sample1_R1.fastq.gz,data/sample1_R2.fastq.gz,data/sample1.fastq.gz
sample2,0,data/sample2_R1.fastq.gz,data/sample2_R2.fastq.gz,data/sample2.fastq.gz
sample3,1,data/sample3_R1.fastq.gz,data/sample3_R2.fastq.gz,
サンプルIDは一意でなければならない。2列目のグループ情報は、ビニングステップの共分散の計算にのみ使用され、共アセンブリには使用されない。共アセンブリには--coassemble_groupオプションを使う。3列目以降で指定するFastQファイルは圧縮されている必要がある(.fastq.gz, .fq.gz)。ロングリードもある場合、ペアエンドのshort readデータとの組み合わせでのみ提供可能。1つのサンプルシート内でシングルエンドとペアエンドの混在は不可。シングルエンドリードを指定する場合は、コマンドラインパラメータ -single_end も指定する。
シングルエンド,megahitのみでアセンブリ、不要なステップを除外。
nextflow run nf-core/mag \
-profile docker \
--input samplesheet.csv \
--outdir results_dir \
--single_end \
--skip_gtdbtk \
--skip_quast \
--skip_prodigal --skip_prokka --skip_metaeuk \
--skip_spades \
--skip_spadeshybrid \
--skip_krona
この例だと、以下の黄色のステップのみ実行される。

引用
nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning
Sabrina Krakau, Daniel Straub, Hadrien Gourlé, Gisela Gabernet, Sven Nahnsen
bioRxiv, Posted August 31, 2021
2023/01
nf-core/mag: a best-practice pipeline for metagenome hybrid assembly and binning
Sabrina Krakau, Daniel Straub, Hadrien Gourlé, Gisela Gabernet, and Sven Nahnsen
NAR Genom Bioinform. 2022 Mar; 4(1)
参考
file:///Users/kazu/Downloads/IPSJ-BIO18054047.pdf