深層学習を用いて細菌分離株の高精度なSNVコールを行う AccuSNV

　細菌種内の変異を正確に検出することは、微生物進化の基礎研究、伝播事象の再構築、抗菌薬耐性変異の同定において極めて重要である。全ゲノムシーケンスから一塩基多型（SNV）を特定する多くのツールが開発されているが、細菌ゲノムの複雑さや、サンプルタイプやシーケンスデプスごとに異なるフィルタリング閾値が必要となることから、高い偽陽性率に悩まされることが多い。データセットの規模拡大に伴い、高精度を実現するための手動フィルタリングは重大な障壁となっている。本稿では、高精度かつ自動化された細菌SNV検出を実現する新たな深層学習ベースツール「AccuSNV」を提案する。従来の1サンプル単位処理手法とは異なり、AccuSNVは畳み込みニューラルネットワーク（CNN）を活用し、複数サンプルのアラインメント情報を統合、学習したサンプル横断パターンにより精度を向上させる。6種の細菌種から得られたシミュレーションデータ（多様なシーケンス深度、分離株数、変異数、分岐レベルを含む）を用い、7つの主要SNV検出ツールとの比較評価を実施した。さらに実用性を検証するため、報告済みSNVを含む複数のキュレーション済み細菌データセットでAccuSNVをテストした。シミュレーション環境と実データ環境の両方で、AccuSNVは一貫して最高の性能を達成した。さらに、AccuSNVは包括的で使いやすい下流解析モジュールと出力を提供する。これには変異アノテーション情報、系統推定、dN/dS計算、およびオプションの手動フィルタリングが含まれている。自動化された深層学習ベースの変異検出と相まって、これらの機能により、AccuSNVは計算技術の習熟度が異なる様々なユーザーに広く利用可能となっている。

このパイプラインとツールキットは、WGSデータから細菌分離株間の単一ヌクレオチド差異を検出・分析するために使用できる。主な特徴は、

ディープラーニング手法により、カバレッジ不足による偽陰性や偽陽性を回避しつつ、生データの可視化を可能にする。
系統樹構築、非同義変異と同義変異のカウント、平行進化など、容易な進化解析を実現。

インストール

ubuntuで環境を作ってテストした。

Github

mamba create -n accusnv -c conda-forge -c bioconda accusnv -y
conda activate accusnv

> accusnv -h

$ accusnv -h

usage: Local analysis module of AccuSNV [-h] -i INPUT_MAT [-c INPUT_COV] [-s MIN_COV_SAMP] [-v MIN_COV] [-e EXCLUDE_SAMP] [-g GENERATE_REP]

[-r REF_GENOME] [-o OUTPUT_DIR]

Apply filters and CNN to call SNVs for closely related bacterial isolates.

optional arguments:

-h, --help show this help message and exit

-i INPUT_MAT, --input_mat INPUT_MAT

The input mutation table in npz file

-c INPUT_COV, --input_cov INPUT_COV

The input coverage table in npz file

-s MIN_COV_SAMP, --min_cov_for_filter_sample MIN_COV_SAMP

Before running the CNN model, low-quality samples with more than 45% of positions having zero aligned reads will be

filtered out. (default "-s 45") You can adjust this threshold with this parameter; to include all samples, set "-s 100".

-v MIN_COV, --min_cov_for_filter_pos MIN_COV

For the filter module: on individual samples, calls must have at least this many reads on the fwd/rev strands

individually. If many samples have low coverage (e.g. <5), then you can set this parameter to smaller value. (e.g. -v 2).

Default is 5.

-e EXCLUDE_SAMP, --excluse_samples EXCLUDE_SAMP

The names of the samples you want to exclude (e.g. -e S1,S2,S3). If you specify a number, such as "-e 1000", any sample

with more than 1,000 SNVs will be automatically excluded.

-g GENERATE_REP, --generate_report GENERATE_REP

If not generate html report and other related files, set to 0. (default: 1)

-r REF_GENOME, --rer REF_GENOME

The reference genome

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

The output dir

> accusnv_snakemake -h

usage: AccuSNV [-h] -i INPUT_SP [-t TF_SLURM] [-c CP_ENV] [-r REF_DIR] [-s MIN_COV_SAMP] [-v MIN_COV] [-e EXCLUDE_SAMP] [-g GENERATE_REP]

[-o OUT_DIR]

AccuSNV - SNV calling tool for bacterial isolates using deep learning.

optional arguments:

-h, --help show this help message and exit

-i INPUT_SP, --input_sample_info INPUT_SP

The dir of input sample info file --- Required

-t TF_SLURM, --turn_off_slurm TF_SLURM

If set to 1, the SLURM system will not be used for automatic job submission. Instead, all jobs will run locally or on a

single node. (Default: 0)

-c CP_ENV, --conda_prebuilt_env CP_ENV

The absolute dir of your pre-built conda env. e.g. /path/snake_pipeline/accusnv_sub

-r REF_DIR, --ref_dir REF_DIR

The dir of your reference genomes

-s MIN_COV_SAMP, --min_cov_for_filter_sample MIN_COV_SAMP

Before running the CNN model, low-quality samples with more than 45% of positions having zero aligned reads will be

filtered out. (default "-s 45") You can adjust this threshold with this parameter; to include all samples, set "-s 100".

-v MIN_COV, --min_cov_for_filter_pos MIN_COV

For the filter module: on individual samples, calls must have at least this many reads on the fwd/rev strands

individually. If many samples have low coverage (e.g. <5), then you can set this parameter to smaller value. (e.g. -v 2).

Default is 5.

-e EXCLUDE_SAMP, --excluse_samples EXCLUDE_SAMP

The names of the samples you want to exclude (e.g. -e S1,S2,S3). If you specify a number, such as "-e 1000", any sample

with more than 1,000 SNVs will be automatically excluded.

-g GENERATE_REP, --generate_report GENERATE_REP

If not generate html report and other related files, set to 0. (default: 1)

-o OUT_DIR, --output_dir OUT_DIR

Output dir (default: current dir/wd_out_(uid), uid is generated randomly)

テストラン

git clone https://github.com/liaoherui/AccuSNV.git
cd AccuSNV/conda_test_dir/
sh test_run.sh

（空のfastq.gzを使ったテスト。すぐに終わる）

実行方法

最初のリファレンスゲノムへの大量のfastqのマッピングは、slurmジョブとしてサブミットしてクラスタ環境で実行するように設計されているが、ローカルのワークステーションなどでもランできる。

実際に実行するには、細菌分離株のショートリードシーケンスデータのパスを示したCSVとアノテーション付き参照ゲノムファイルが必要。CSVは以下の構成で記載する。

samples_cae_test_pe.csv

Path,Sample,FileName,Reference,Group,Outgroup,Type
Test_data/pe_test/strain1/,strain1,strain1,Cae_ref,pe_test,0,PE
Test_data/pe_test/strain2/,strain2,strain2,Cae_ref,pe_test,0,PE
Test_data/pe_test/strain3/,strain3,strain3,Cae_ref,pe_test,0,PE
Test_data/pe_test/strain4/,strain4,strain4,Cae_ref,pe_test,0,PE

の形式で準備する。PathがFASTQ ファイルがあるディレクトリのパス、Sampleがサンプル名、FileNameが実際のファイル名、Referenceがリファレンスゲノムの識別子（ここでは Cae_ref）、Groupがサンプルグループ（ここでは pe_test）、Outgroupが外群フラグ（0 は in-group, 1 は outgroup）、TypeはPE（ペアエンド）/ SE（シングルエンド）のどちらかを指定する。SampleとFileNameが同じ名前であっても可。

リファレンスゲノムも必要。リファレンスはファイルをディレクトリに配置して、そのパスを指定するようになっている。テストデータは以下のファイル構成が用意されている。ラン後にはannotation_genes.pandas.py.pk1というpickleファイルもできる。

Cae_ref/

genome.fasta
genome.fasta.fai
annotations.gff

準備ができたら実行する。

accusnv_snakemake -i input_sample.csv -r ref_dir -o outdir

#テストラン
cd AccuSNV//snake_pipeline/
python accusnv_snakemake.py -i samples_cae_test_pe.csv -r reference_genomes -o cae_pe_test

出力例

snakemake実行のためのconfigファイルなどができる。config.yamlを一番下に記載。

準備ができたら実行する。

snakemake --cores 4 -s Snakefile --configfile config.yaml

自動作成されるconfigファイルでは指定されていないキーが複数あって、正常にランされず。実際は最終的にcandidate_mutation_table_final.npzというファイルができる。これを使って下流解析を行う。

下流解析

#テストラン
cd AccuSNV/
accusnv_downstream -i local_analysis/test_data/candidate_mutation_table_final.npz \
  -r snake_pipeline/reference_genomes/Cae_ref -o mac_downstream_test

出力例

下流解析向けに、分離株全ての高品質なSNVの表、分離株間の系統関係を示す最大節約法のツリーファイルなどが出力される。

論文とレポジトリより

AccuSNVは、各候補SNVにおけるサンプル間アラインメント情報を要約する独自のデータ構造と、このマルチサンプルリードアラインメントデータから有益なパターンを導出するCNNモデルを活用する。サンプル内シグナルとサンプル間パターンの両方を捕捉することで、低カバレッジ、実ゲノムの分岐、多様な実データセットに対する頑健性を向上させている。
ランする時に、カレントにカレントにconfig.yaml,experiment_info.yaml,Snakefileファイルと同名のファイル、あるいはscripts/と同名のディレクトリがないように注意する。
AccuSNVはSLURM ジョブスケジューラが使えるクラスタ環境でランするように設計されており、レポジトリではテストラン例としてslurmにjobをサブミットする流れが説明されている。ローカルPCでランする時には"--turn_off_slurm 1"を使用する（デフォルト -turn_off_slurm 0）。

引用

High-accuracy SNV calling for bacterial isolates using deep learning with AccuSNV
Herui Liao, Arolyn Conwill, Ian Light-Maka, Martin Fenk, Alyssa H. Mitchell, Evan B. Qu, Paul Torrillo, Jacob Baker, Felix M. Key, Tami Lieberman

bioRxiv, Posted September 29, 2025.

関連

> cat config.yaml
#change configfile to point to experiment_config.yaml for this experiment
#change mkdir, --output and --error to point to log folder for this experiment
#change mail-user, mail-type to point to your email
#can change default resources or set-resources as needed

snakefile: Snakefile
use-conda: True
conda-frontend: mamba
rerun-incomplete: True
restart-times: 1
jobs: 400
latency-wait: 30
keep-going: True
configfile: ./experiment_info.yaml
conda-prefix: ~/miniconda3/envs/accusnv
cluster-status: ./slurm_status_script.py
reason: True
# dryrun: True
# unlock: True

cluster:
mkdir -p ./logs/ &&
sbatch
--partition={resources.partition}
--ntasks={resources.ntasks}
--cpus-per-task={resources.cpus_per_task}
--mem={resources.mem}
--time={resources.time}
--job-name="SM.{rule}"
--output="./logs/%j_{rule}.err.txt"
--error="./logs/%j_{rule}.err.txt"
--mail-user="UPDATE_EMAIL_ADDRESS"
--mail-type="FAIL"

#note that resource names should use underscore _ instead of dash -

default-resources:
- partition="sched_mit_tami,mit_normal,newnodes,sched_mit_chisholm,sched_mit_hill"
- time="1:00:00"
- ntasks=1
- cpus_per_task=1
- mem=100000

set-resources:
# make_data_links
#- make_data_links:partition="sched_mit_hill"
# cutadapt
- cutadapt:ntasks=1
- cutadapt:cpus_per_task=1
- cutadapt:time="04:00:00"
# sickle
- sickle:ntasks=1
- sickle:cpus_per_task=1
- sickle:time="04:00:00"
# bowtie2
- bowtie2:mem=64000
- bowtie2:cpus_per_task=8
- bowtie2:time="02:00:00"
# sam2bam
- sam2bam:mem=64000
- sam2bam:cpus_per_task=1
- sam2bam:time="12:00:00"
# mpileup2vcf
- mpileup2vcf:mem=60000
- mpileup2vcf:cpus_per_task=1
- mpileup2vcf:time="12:00:00"
# pileup2diversity_matrix
- pileup2diversity_matrix:mem=60000
- pileup2diversity_matrix:cpus_per_task=1
- pileup2diversity_matrix:time="12:00:00"
# variants2positions
- variants2positions:mem=60000
- variants2positions:mem_per_cpu=60000
- variants2positions:cpus_per_task=1
- variants2positions:time="02:00:00"
# combine_positions
- combine_positions:mem=60000
- combine_positions:mem_per_cpu=60000
- combine_positions:cpus_per_task=1
- combine_positions:time="02:00:00"
# build_data_links
- build_data_links:mem=24000
- build_data_links:mem_per_cpu=24000
- build_data_links:cpus_per_task=10
- build_data_links:time="12:00:00"
# kraken2
- kraken2:mem_per_cpu=10000
- kraken2:cpus_per_task=20
- kraken2:time="05:00:00"
# spades
- spades:mem=500G
- spades:cpus_per_task=16
- spades:time="12:00:00"
# sumstats
- sumstats:mem=10000
# prokka
- prokka:mem=100G
- prokka:cpus_per_task=16
- prokka:time="04:00:00"
(accusnv) kazu@DESKTOP-R7O0V4R:~/AccuSNV/snake_pipeline$