PacBio CLR ゲノムアセンブリの研磨のためのNextflowワークフロー polishCLR

2023/08/23 論文引用

　ロングリードシーケンスにより、染色体レベルの高密度のコンティグが得られるようになり、ゲノムアセンブリは大きく変化した。しかし、Pacific Biosciences (PacBio) Continuous Long Reads (CLR) などの第3世代のロングリード技術によるアセンブリは、高いエラーレートを有している。このようなエラーは、ポリッシングと呼ばれるプロセスにより、ショートリードで修正することができる。最近、Vertebrate Genome Project (VGP) Assemblyコミュニティによって、非モデル生物のde novoゲノムアセンブリを研磨するためのベストプラクティスが説明されたが、従来の高性能計算機環境で容易に実装・実行できる、一般に入手可能で再現性のあるワークフローが必要とされている。ここでは、CLRデータから作成されたアセンブリを研磨するためのベストプラクティスを実装した再現可能なNextflowワークフロー、polishCLR (https://github.com/isugifNF/polishCLR)について説明する。PolishCLRは、ベストプラクティスを準最適なケースに拡張するいくつかの入力オプションから開始することができる。また、purge_dupsによる重複ハプロタイプの特定、データがある場合のscaffoldingのための切断（break）、ArrowおよびFreeBayesによる複数回の研磨と評価など、いくつかの重要なプロセスにおいて再入力ポイントが用意されている。PolishCLRは、既存のエラーを起こしやすいロングリードデータからアセンブリを完成させるツールとして、アセンブリコミュニティのためにコンテナ化され、一般に公開されている。

With co-lead @bcbjen, Siva Chudalayandi, Ben Rosen, Andrew Severin (@isugif) of the Ag100Pest assembly team - we're proud to share our @nextflow implementation of the @genomeark best practices for CLR assembly, ✨polishCLR✨ https://t.co/l91gwcetYz
— Amanda Stahlke, PhD (@Amanda_Stahlke) February 14, 2022

example data

https://data.nal.usda.gov/dataset/data-polishclr-example-input-genome-assemblies

SRA

ID 804956 - BioProject - NCBI

インストール

依存

Github

git clone https://github.com/isugifNF/polishCLR.git
cd polishCLR
mamba env create -f environment.yml -p ${PWD}/env/polishCLR_env

#activate
conda activate $PWD/env/polishCLR_env

> nextflow main.nf

N E X T F L O W ~ version 21.10.0

Launching `main.nf` [high_babbage] - revision: 6a81970115

----------------------------------------------------

\\---------//

___ ___ _ ___ ___ \\-----//

| (___ | | / _ | |_ \-//

_|_ ___) |__| \_/ _|_ | // \

//-----\\

//---------\\

isugifNF/polishCLR v1.0.0

----------------------------------------------------

Usage:

The typical command for running the pipeline are as follows:

nextflow run main.nf --primary_assembly "*fasta" --illumina_reads "*{1,2}.fastq.bz2" --pacbio_reads "*_subreads.bam" -resume

Mandatory arguments:

--illumina_reads paired end illumina reads, to be used for Merqury QV scores, and freebayes polish primary assembly

--pacbio_reads pacbio reads in bam format, to be used to arrow polish primary assembly

--mitochondrial_assembly mitocondrial assembly will be concatinated to the assemblies before polishing [default: false]

Either FALCON (or FALCON Unzip) assembly:

--primary_assembly genome assembly fasta file to polish

--alternate_assembly if alternate/haplotig assembly file is provided, will be concatinated to the primary assembly before polishing [default: false]

--falcon_unzip if primary assembly has already undergone falcon unzip [default: false]. If true, will Arrow polish once instead of twice.

Or TrioCanu assembly

--paternal_assembly paternal genome assembly fasta file to polish

--maternal_assembly maternal genome assembly fasta file to polish

Pick Step 1 (arrow, purgedups) or Step 2 (arrow, freebayes, freebayes)

--step Run step 1 or step 2 (default: 1)

Optional modifiers

--species if a string is given, rename the final assembly by species name [default:false]

--k kmer to use in MerquryQV scoring [default:21]

--same_specimen if illumina and pacbio reads are from the same specimin [default: true].

--meryldb path to a prebuilt meryl database, built from the illumina reads. If not provided, tehen build.

Optional configuration arguments

--parallel_app Link to parallel executable [default: 'parallel']

--bzcat_app Link to bzcat executable [default: 'bzcat']

--pigz_app Link to pigz executable [default: 'pigz']

--meryl_app Link to meryl executable [default: 'meryl']

--merqury_sh Link to merqury script [default: '$MERQURY/merqury.sh']

--pbmm2_app Link to pbmm2 executable [default: 'pbmm2']

--samtools_app Link to samtools executable [default: 'samtools']

--gcpp_app Link to gcpp executable [default: 'gcpp']

--bwamem2_app Link to bwamem2 executable [default: 'bwa-mem2']

--freebayes_app Link to freebayes executable [default: 'freebayes']

--bcftools_app Link to bcftools executable [default: 'bcftools']

--merfin_app Link to merfin executable [default: 'merfin']

Optional arguments:

--outdir Output directory to place final output [default: 'PolishCLR_Results']

--clusterOptions Cluster options for slurm or sge profiles [default slurm: '-N 1 -n 40 -t 04:00:00'; default sge: ' ']

--threads Number of CPUs to use during each job [default: 40]

--queueSize Maximum number of jobs to be queued [default: 50]

--account Some HPCs require you supply an account name for tracking usage. You can supply that here.

--help This usage statement.

実行方法

polishCLRは３つのゲノムアセンブリステータスでの研磨；１）倍数体ゲノムアセンブリのハプロタイプが解決されていないプライマリアセンブリ、２）ハプロタイプが解決されてプライマリアセンブリとオルタナティブアセンブリの両方があるが研磨されていない時、３）２のデータがpacbio CLRリードで研磨もされている時、を想定している。

primary_assembly、alternate_assembly、mitochondrial_assemblyのアセンブリ配列（３つとも必須）と、イルミナ（fastq）とpacbioのリード（bam）を指定する。ここでは上のExampleアセンブリとシークエンシングデータを使う。

#download project
grabseqs sra -t 8 -m metadata.csv -o outdir PRJNA804956

nextflow run main.nf \
  --primary_assembly cns_p_ctg.fasta \
  --alternate_assembly cns_h_ctg.fasta \
  --mitochondrial_assembly Hzea_mtDNA_contig.fasta \
  --illumina_reads JDRP*{R1,R2}.fastq.bz2　\
  --pacbio_reads m*.subreads.bam \
  --species "Hzea" \
  --k "21" \
  --falcon_unzip true \
  --step 1 \
  --busco_lineage "lepideoptera_odb10" \
  -resume \
  -profile ceres

理解できていないパラメータがあって上手くランできなかった。

引用

polishCLR: a Nextflow workflow for polishing PacBio CLR genome assemblies
Jennifer Chang, Amanda R. Stahlke, Sivanandan Chudalayandi, Benjamin D. Rosen, Anna K. Childers, Andrew Severin

bioRxiv, Posted February 11, 2022

polishCLR: A Nextflow Workflow for Polishing PacBio CLR Genome Assemblies
Jennifer Chang, Amanda R Stahlke, Sivanandan Chudalayandi, Benjamin D Rosen, Anna K Childers, Andrew J Severin
Genome Biology and Evolution, Volume 15, Issue 3, March 2023