シーケンシングデータの汚染を除去するnextflowパイプライン CLEAN

2025/02/14 追記

　多くの生物学的および医学的な疑問は、配列データの解析に基づいて解答されている。しかし、様々なリードコレクションやアセンブリにおいて、コンタミネーション、人工的なスパイクイン、過剰発現したrRNA配列が見つかることがある。特に、イルミナやナノポアのデータで知られているようなコントロールとして使用されるスパイクインは、しばしばコンタミネーションとして考慮されず、また解析中に適切に除去されない。さらに、ヒト宿主DNAの除去は、個人を特定できないようにするためのデータ保護や倫理的配慮のために必要な場合がある。

本著者らは、ロングリードとショートリードの両方のシーケンス技術から不要な配列を除去するパイプライン、CLEANを開発した。このパイプラインは、イルミナとナノポアのデータに特化し、その技術固有のコントロール配列を使用するが、メタゲノムリードとアセンブリの宿主の汚染除去や、RNA-SeqデータからのrRNAの除去にも使用できる。結果は、精製された配列と汚染されていると同定された配列で、統計情報がレポートにまとめられる。

この出力は、その後の解析で直接使用することができ、その結果、計算が高速化され、結果が改善される。汚染除去は平凡に見えるが、多くの汚染物質は日常的に見落とされており、完全には再現できないステップや追跡が困難なステップでクリーニングされている。CLEANは、ゲノミクスとトランスクリプトミクスにおける再現可能でプラットフォームに依存しないデータ解析を容易にし、BSD3ライセンスのもと、https://github.com/rki-mf1/cleanで利用できる。

インストール

依存

nextflowと下のいずれか（デフォルトはdocker）

- Conda
- Mamba
- Docker
- Singularity

Github

nextflow info rki-mf1/clean

> nextflow run rki-mf1/clean -r v1.1.0 --help

$ nextflow run rki-mf1/clean --help

Nextflow 24.10.4 is available - Please consider updating your version to it

N E X T F L O W ~ version 23.10.1

Launching `https://github.com/rki-mf1/clean` [small_ride] DSL2 - revision: 085a0a8e29 [main]

____________________________________________________________________________________________

Workflow: Decontamination

Clean your Illumina, Nanopore, PacBio or any FASTA-formated sequence date. The output are the clean

and as contaminated identified sequences. Per default minimap2 is used for aligning your sequences

to a host but we recommend using BWA for mapping short reads --bwa or the --bbduk flag

to switch to bbduk to clean short-read data.

Use the --host and --control flag to download a host database or specify your --own FASTA.

Usage example:

nextflow run rki-mf1/clean --input_type nano --input '*/*.fastq' --host eco --control dcs

nextflow run rki-mf1/clean --input_type illumina --input '*/*.R{1,2}.fastq' --own some_host.fasta --bbduk

nextflow run rki-mf1/clean --input_type illumina --input 'test/illumina*.R{1,2}.fastq.gz' --nano data/nanopore.fastq.gz --fasta data/assembly.fasta --host eco --control phix

Input:

--input_type nano --input '*.fasta' or '*.fastq.gz' -> one sample per file

--input_type pacbio --input '*.fasta' or '*.fastq.gz' -> one sample per file (for PacBio CLR reads)

--input_type illumina --input '*.R{1,2}.fastq.gz' -> file pairs

--input_type illumina_single_end --input '*.fastq.gz' -> one sample per file

--input_type fasta --input '*.fasta.gz' -> one sample per file

...read above input from csv files: --list

required format: name,path for --input_type nano, --input_type pacbio, and --input_type fasta; name,pathR1,pathR2 for --illumina input_type; name,path for --input_type illumina_single_end

Decontamination options:

--host Comma separated list of reference genomes for decontamination, downloaded based on this parameter [default: false]

Currently supported are:

- hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly, incl. mtDNA]

- t2t [T2T Consortium: human genome w/ additional 200 Mbp, closed gaps, and more complete Y (T2T-CHM13+Yv2.0), incl. mtDNA]

- mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly, incl. mtDNA]

- csa [NCBI: GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic, incl. mtDNA]

- gga [NCBI: Gallus_gallus.GRCg6a.dna.toplevel, incl. mtDNA]

- cli [NCBI: GCF_000337935.1_Cliv_1.0_genomic, incl. mtDNA]

- eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel]

- sc2 [ENA: MN908947.3 (Wuhan-Hu-1 complete genome)]

--control Comma separated list of common controls used in Illumina or Nanopore sequencing [default: false]

Currently supported are:

- phix [Illumina: enterobacteria_phage_phix174_sensu_lato_uid14015, NC_001422]

- dcs [ONT DNA-Seq: a positive control (3.6 kb standard amplicon mapping the 3' end of the Lambda genome)]

- eno [ONT RNA-Seq: a positive control (yeast ENO2 Enolase II of strain S288C, YHR174W)]

--own Use your own FASTA sequences (comma separated list of files) for decontamination, e.g. host.fasta.gz,spike.fasta [default: false]

--keep Use your own FASTA sequences (comma separated list of files) to explicitly keep mapped reads, e.g. target.fasta.gz,important.fasta [default: false]

Reads are assigned to a combined index for decontamination and keeping. The use of this parameter can prevent

false positive hits and the accidental removal of reads due to (poor quality) mappings.

--rm_rrna Clean your data from rRNA [default: false]

--bwa Add this flag to use BAW MEM instead of minimap2 for decontamination of short reads [default: false]

--bbduk Add this flag to use bbduk instead of minimap2 for decontamination of short reads [default: false]

--bbduk_kmer Set kmer for bbduk [default: 27]

--bbduk_qin Set quality ASCII encoding for bbduk [default: auto; options are: 64, 33, auto]

--reads_rna Add this flag for noisy direct RNA-Seq Nanopore data [default: false]

--min_clip Filter mapped reads by soft-clipped length (left + right). If >= 1 total number; if < 1 relative to read length

--dcs_strict Filter out alignments that cover artificial ends of the ONT DCS to discriminate between Lambda Phage and DCS

--skip_qc Skip quality control steps (fastqc, nanoplot, multiqc, etc.) [default: false]

Compute options:

--cores Max cores per process for local use [default 8]

--max_cores Max cores used on the machine for local use [default 24]

--memory Max memory for local use, enter in this format '8.GB' [default: 8 GB]

--output Name of the result folder [default: results]

Nextflow options:

-with-report rep.html CPU / RAM usage (may cause errors)

-with-dag chart.html Generates a flowchart for the process tree

-with-timeline time.html Timeline (may cause errors)

Computing:

In particular for execution of the workflow on a HPC (LSF, SLURM) adjust the following parameters:

--databases Defines the path where databases are stored [default: nextflow-clean-autodownload]

--condaCacheDir Defines the path where environments (conda) are cached [default: conda]

--singularityCacheDir Defines the path where images (singularity) are cached [default: singularity]

Miscellaneous:

--cleanup_work_dir Deletes all files in the work directory after a successful completion of a run [default: false]

warning: if true, the option will prevent the use of the resume feature!

--no_intermediate Do not save intermediate .bam/fastq/etc files into the `results/intermediate/` directory [default: false]

Saves a lot of disk space, especially if used with the `--cleanup_work_dir` argument.

Profile:

You can merge different profiles for different setups, e.g.

-profile local,docker

-profile lsf,singularity

-profile slurm,singularity

-profile standard (local,docker) [default]

local

lsf

slurm

docker

singularity

conda

mamba

gcloud (use this as template for your own GCP setup)

実行方法

リードの入力は、Nanoporeの場合は--input_type nano、PacBio CLRの場合は--input_type pacbio、Illuminaの場合は--input_type illuminaまたは--input_type illumina_single_endで指定する。除染のための追加コントロールは --control で定義できる。

テストラン

#docker
nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
--host eco --control dcs

#mamba
nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \
--host eco --control dcs -profile mamba

--host Comma separated list of reference genomes for decontamination, downloaded based on this parameter [default: false] Currently supported are:
- hsa [Ensembl: Homo_sapiens.GRCh38.dna.primary_assembly, incl. mtDNA]
- t2t [T2T Consortium: human genome w/ additional 200 Mbp, closed gaps, and more complete Y (T2T-CHM13+Yv2.0), incl. mtDNA]
- mmu [Ensembl: Mus_musculus.GRCm38.dna.primary_assembly, incl. mtDNA]
- csa [NCBI: GCF_000409795.2_Chlorocebus_sabeus_1.1_genomic, incl. mtDNA]
- gga [NCBI: Gallus_gallus.GRCg6a.dna.toplevel, incl. mtDNA]
- cli [NCBI: GCF_000337935.1_Cliv_1.0_genomic, incl. mtDNA]
- eco [Ensembl: Escherichia_coli_k_12.ASM80076v1.dna.toplevel]
- sc2 [ENA: MN908947.3 (Wuhan-Hu-1 complete genome)]
--control Comma separated list of common controls used in Illumina or Nanopore sequencing [default: false] Currently supported are:
- phix [Illumina: enterobacteria_phage_phix174_sensu_lato_uid14015, NC_001422]
- dcs [ONT DNA-Seq: a positive control (3.6 kb standard amplicon mapping the 3' end of the Lambda genome)]
- eno [ONT RNA-Seq: a positive control (yeast ENO2 Enolase II of strain S288C, YHR174W)]

$ nextflow run rki-mf1/clean -r v1.1.0 --input_type nano --input ~/.nextflow/assets/rki-mf1/clean/test/nanopore.fastq.gz \

--host eco --control dcs

Nextflow 24.10.4 is available - Please consider updating your version to it

N E X T F L O W ~ version 23.10.1

Launching `https://github.com/rki-mf1/clean` [awesome_lamarck] DSL2 - revision: d02998c570 [v1.1.0]

Profile: standard

executor > local (21)

[b7/5c593d] process > prepare_contamination:prepare_auto_host:download_host (1) [100%] 1 of 1 ✔

[15/e78643] process > prepare_contamination:concat_contamination [100%] 1 of 1 ✔

[ee/9fbb65] process > clean:minimap2 (1) [100%] 1 of 1 ✔

[d4/c445b8] process > clean:sort_bam (1) [100%] 1 of 1 ✔

[93/f389d1] process > clean:index_bam (1) [100%] 1 of 1 ✔

[2f/7f9928] process > clean:idxstats_from_bam (1) [100%] 1 of 1 ✔

[8d/6e3601] process > clean:flagstats_from_bam (1) [100%] 1 of 1 ✔

[1a/235c97] process > clean:split_bam (1) [100%] 1 of 1 ✔

[4b/3d6a01] process > clean:index_bam2 (2) [100%] 2 of 2 ✔

[af/8e4ac2] process > clean:fastq_from_bam (2) [100%] 2 of 2 ✔

[27/119473] process > summarize:BAM_STATISTICS (1) [100%] 1 of 1 ✔

[71/753ebb] process > summarize:COMBINE_BAM_STATISTICS (combine bam statistics files) [100%] 1 of 1 ✔

[32/42e5e8] process > qc:nanoplot (3) [100%] 3 of 3 ✔

[a0/6abf6b] process > qc:format_nanoplot_report (2) [100%] 3 of 3 ✔

[f5/7229df] process > qc:multiqc (1) [100%] 1 of 1 ✔

Completed at: 10-Feb-2025 06:47:36

Duration : 7m 12s

CPU hours : 0.1

Succeeded : 21

出力

qc/multiqc_report.html

入力以外に、汚染配列にマッピングされたリード、汚染配列にアンマップのリードについてもっ同様の統計と図が出力される。

イルミナリードを入力としてヒトのシークエンシングリードの汚染を除く。

 nextflow run rki-mf1/clean -r v1.1.0 --input_type illumina --input './*R{1,2}.fastg.gz' -profile docker --host hsa

レポジトリより

CLEANはIllumina、Nanopore、PacBio CLR、またはあらゆるFASTA形式の配列データをクリーニングすることができる。
デフォルトではminimap2が参照配列とのアライメントに使用される（Nanoporeデータではmap-ont設定、PacBio CLRデータではmap-bp設定、short-readデータではsr設定）。ショートリードデータの場合は、BWA（-bwa）に切り替えてもよい。

引用

Targeted decontamination of sequencing data with CLEAN

Marie Lataretu, Sebastian Krautwurst, Matthew R. Huska, Mike Marquet, Adrian Viehweger, Sascha D. Braun, Christian Brandt, Martin Hölzer

bopRxiv, Posted January 24, 2025.