自動化されたラージゲノムアセンブリと評価のためのツール Pipeasm

　高品質なリファレンスゲノムを用いた生物多様性研究の取り組みが活発化し、さまざまな生物の塩基配列決定が可能になっていることから、大規模ゲノムアセンブリのための最先端の方法論を取り入れた、アクセスしやすく、再現性が高く、使いやすいツールの開発が急務となっている。Pipeasmは、HiFi PacBio、ONT、HiCデータを用いて脊椎動物ゲノムをアセンブルするために設計されたSnakemakeパイプラインである。入力情報と推奨されるパラメータを設定ファイルに設定することで、Pipeasmはマニュアルキュレーションが可能な複数のサイズの二倍体ゲノムをアセンブルすることができた。PipeasmはSnakemakeとSingularityがある環境が必要で、https://github.com/itvgenomics/pipeasmから利用できる。

Pipeasmは、前処理ステップから包括的なレポートや統計に至るまで、高品質なデータを確保するためのツール群を統合している。Pipeasmは、トリミングとQC、k-merプロファイリング、アセンブリ、アセンブリ統計、除染、Hi-Cマッピングとスキャフォールディングの6つの主要ステップで構成されている（論文図1）。必要な入力データはPacBioのHiFi-CCSロングリードで、ソロアセンブル（HiFi-onlyリードを用いたアセンブル）が可能で、これを使って部分的にフェーズ化されたコンティグを含むプライマリーアセンブリとオルタネートなアセンブリが提供される。オプションとして、クロマチンコンタクト（Hi-C）ショートリードを提供すると、HiFiとHi-Cの両方を使用して、理想的な条件で2つのハプロタイプが解決されたアセンブリを作成するフェーズドアセンブリも実行できる。両方のデータが利用可能な場合、ユーザーはタイミング最適化のためにフェーズドアセンブリのみを実行するか、フェーズドアセンブリとソロアセンブリを実行するかを選択できる。また、Oxford Nanopore Long-Reads (ONT)を提供して、ソロとフェーズドアセンブリの両方を実行することもできる。Pipeasmは、すべてのステップのパラメータを含むコンフィギュレーションファイルを提供する。すでにデフォルト値で最適化されているが、簡単にカスタマイズすることができる。ユーザーはこのファイル内に、生物種名、サンプルID、リードのパス、遺伝子コード、分類学ID、使用するBUSCOデータベースなどの特定の情報を入力する。

インストール

依存

A Linux-based operating system (e.g., Ubuntu, CentOS, Fedora)
Python (version 3.5 or later) installed on your system
Snakemake/Singularity/Docker Instalation

Github

mamba create -n snakemake -c bioconda -c conda-forge singularity=3.8.5 snakemake=8.5.1
conda activate snakemake
git clone https://github.com/itvgenomics/pipeasm.git
cd pipeasm

> bash Pipeasm.sh

$ bash Pipeasm.sh

### Pipeasm - a tool for automated large genome assembly and analysis

#Authors: Trindade F., Silva B. M., Canesin L., Souza Junior R. O., Oliveira R. 2024

###

### This program is free software: you can redistribute it and/or modify

### it under the terms of the GNU General Public License as published by

### the Free Software Foundation, either version 3 of the License, or

### any later version.

###

### This program is distributed in the hope that it will be useful,

### but WITHOUT ANY WARRANTY; without even the implied warranty of

### MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the

### GNU General Public License for more details.

###

### You should have received a copy of the GNU General Public License

### along with this program. If not, see <http://www.gnu.org/licenses/>.

ERROR: Missing one of the required arguments: -d (Work Directory), -t (# Threads)

## Usage: ./Pipeasm.sh -c <config.yaml> -d </path/to/work/dir> -s </path/to/snakefile> -t <# threads>

#-d </path/to/work/dir> = Path to your working directory where all the workflow file are

#-c </path/to/config.yaml> = Overwrite the default configuration file with all nedded parameters (config/config.yaml).

#-s </path/to/snakefile> = Overwrite the default snakefile path (workflow/Snakefile)

#-t <int> = Number of threads to use

# You can choose a Pipeasm step with:

--trimming_qc (for Trimming and Quality Control);

--kmer_eval (for k-mer Evaluation stats/plots);

--assembly (for all Assembly and Decontamination steps);

--scaffolding (for all Scaffolding and Hi-C Map steps);

# Only the -d and -t flags are required if you want to use the Snakemake default parameters

実行方法

config.yamlにfastqファイルのパスを記載する。HiFi/Hi-C/ONTそれぞれファイルが結合されており（複数ファイルある時）、gzipで圧縮されている必要がある。

config.yaml

https://github.com/itvgenomics/pipeasm/blob/main/config/config.yaml

以下は多くが必須。

species：アセンブルする生物の種名。MitoHiFiでリファレンスとして使用する完全なミトコンドリアゲノムを検索するために使用される（必須）。
sample：各生物種のアセンブリを示すDToL_ID（取得先）（必須）exampleではbHypStr1.1となっている。
4-7行目：fastqはフルパスで指定する。それぞれ１つのデータに固めて指定する（必須）、ontは任意。
geneticcode：ミトコンドリアゲノムのアノテーションで使用する遺伝コード。これは２でOK。単純な真核生物や特殊なケースのみ変更する必要がある（必須）。
taxid：NCBI TaxonomyデータベースのID（必須）
buscodb：ユーザーの種に最も近いBUSCO DB分類群（リンク）（必須）
solo_asm：ソロとフェーズドアセンブリの両方を実行するかどうか（必須）
threads：32～64スレッド推奨（1Gbのゲノムアセンブリには～150GbのRAM、汚染除去とkmer解析には最大500GbのRAMが必要）。
gxdb：汚染チェックステップを実行するためのFCS-GXデータベース*1。ダウンロードガイド。このステップを実行しない場合は空白にする。

準備ができたら実行する。４つのステップを選べる。

--trimming_qc (for Trimming and Quality Control);
--kmer_eval (for k-mer Evaluation stats/plots);
--assembly (for all Assembly and Decontamination steps);
--scaffolding (to run YAHS auto-scaffolding and create the Hi-C Maps);

--trimming_qcを選択した。configファイルと workflow/Snakefile、現在のパス（レポジトリのroot、fastqが置いてある）を選択する。

cd pipeasm
bash Pipeasm.sh -d ./ -c config.yaml -s workflow/Snakefile -t 32 --trimming_qc

-d Path to your working directory where all the workflow file are
-c Overwrite the default configuration file with all nedded parameters (config/config.yaml).
-s Overwrite the default snakefile path (workflow/Snakefile)
-t Number of threads to use

初回はsingularity imageのpullなどに余分な時間がかかる。

合計80GBほどのgzip圧縮Hi-CとHiFiリードの前処理に1時間半ほどかかった（CPU: 5995WX）。

出力

cd results/

fastpで前処理されたHi-Cリード、アダプター除去されたHiFiリード、fastqcとnanoplotのレポートなどが確認できる。

bash Pipeasm.sh -d ./ -c config.yaml -s workflow/Snakefile -t 32 --assembly

bash Pipeasm.sh -d ./ -c config.yaml -s workflow/Snakefile -t 32 --scaffolding

Assembly/とScaffolding/

bHypStr1.1_linear_plot.png

KAT/bHypStr1.1.mx.png

Smudgeplot/bHypStr1.1_smudgeplot.pdf

bHypStr1.1_Scaffolding_Hap1.snail.png

bHypStr1.1.bHypStr1.1.yahs_hap1.spectra-cn.fl.png

合計80GBほどのgzip圧縮Hi-CとHiFiリード（ハプロイドゲノムサイズ600Mb）のアセンブリとscaffoldingに１日程度かかった（CPU: 5995WX）。

引用

Pipeasm: a tool for automated large chromosome-scale genome assembly and evaluation

Bruno Marques Silva, Fernanda de Jesus Trindade, Lucas Eduardo Costa Canesin, Giordano Bruno Soares Souza, Alexandre Aleixo, Gisele Lopes Nunes, Renato R. M. Oliveira

bioRxiv, Posted October 24, 2024.

FCS-GX - NCBI Foreign Contamination Screening (FCS)プログラムスイートの1モジュール。ゲノム配列中の外来生物による汚染を検出する。FCS-GXはGenBankに提出する前の最終アセンブリに対して実行することが推奨されている。最終アセンブリでさらに有効なコンタミネーションが同定された場合は、コンタミネーション除去後に再スクリーニングすることも推奨される。

NCBI fcs

FCS GX quickstart · ncbi/fcs Wiki · GitHub