ロングリードシーケンスでは、ショートリードシーケンスではマッピングが困難とされているゲノム領域でのバリアント検出ができる。この手法では、長距離ハプロタイプ情報を用いてSNPを検出し、検出されたSNPとロングリードを位相合わせし、ローカルリアライメントによりindelをコールする。8つのヒトゲノムを用いて評価した結果、NanoCallerは競合するアプローチよりも高いパフォーマンスを達成することが実証された。また、広く用いられているベンチマークゲノムにおいて、従来は検出が困難であった41の新規バリアントを実験的に検証した。NanoCallerは、ロングリードシーケンスから複雑なゲノム領域における新規バリアントの発見を容易にする。
NanoCaller, from Ahsan, Liu, Fang & Wang is a variant detection program for ONT and PacBio reads. It uses long-range haplotype information for SNP calling, and local alignment of phased reads for indel calling. Works well in difficult-to-map regions https://t.co/ErcyilX4W2 pic.twitter.com/2Tls3A68uO
— Genome Biology (@GenomeBiology) September 13, 2021
インストール
#conda (link)
mamba create -n nanocaller -y
conda activate nanocaller
mamba install -c bioconda nanocaller
#docker (hub)
docker pull genomicslab/nanocaller:latest
> NanoCaller -h
usage: NanoCaller [-h] [-mode MODE] [-seq SEQUENCING] [-cpu CPU] [-mincov MINCOV] [-maxcov MAXCOV] [-keep_bam] [-o OUTPUT] [-prefix PREFIX] [-sample SAMPLE] [-include_bed INCLUDE_BED] [-exclude_bed EXCLUDE_BED] [-start START]
[-end END] [-p PRESET] -bam BAM -ref REF -chrom CHROM [-snp_model SNP_MODEL] [-min_allele_freq MIN_ALLELE_FREQ] [-min_nbr_sites MIN_NBR_SITES] [-nbr_t NEIGHBOR_THRESHOLD] [-sup] [-indel_model INDEL_MODEL]
[-ins_t INS_THRESHOLD] [-del_t DEL_THRESHOLD] [-win_size WIN_SIZE] [-small_win_size SMALL_WIN_SIZE] [-impute_indel_phase] [-phase_bam] [-enable_whatshap]
optional arguments:
-h, --help show this help message and exit
Required Arguments:
-bam BAM, --bam BAM Bam file, should be phased if 'indel' mode is selected (default: None)
-ref REF, --ref REF Reference genome file with .fai index (default: None)
-chrom CHROM, --chrom CHROM
Chromosome (default: None)
Preset:
-p PRESET, --preset PRESET
Apply recommended preset values for SNP and Indel calling parameters, options are 'ont', 'ul_ont', 'ul_ont_extreme', 'ccs' and 'clr'. 'ont' works well for any type of ONT sequencing datasets. However, use
'ul_ont' if you have several ultra-long ONT reads up to 100kbp long, and 'ul_ont_extreme' if you have several ultra-long ONT reads up to 300kbp long. For PacBio CCS (HiFi) and CLR reads, use 'ccs'and 'clr'
respectively. Presets are described in detail here: github.com/WGLab/NanoCaller/blob/master/docs/Usage.md#preset-options. (default: None)
Configurations:
-mode MODE, --mode MODE
NanoCaller mode to run, options are 'snps', 'snps_unphased', 'indels' and 'both'. 'snps_unphased' mode quits NanoCaller without using WhatsHap for phasing. (default: both)
-seq SEQUENCING, --sequencing SEQUENCING
Sequencing type, options are 'ont', 'ul_ont', 'ul_ont_extreme', and 'pacbio'. 'ont' works well for any type of ONT sequencing datasets. However, use 'ul_ont' if you have several ultra-long ONT reads up to 100kbp
long, and 'ul_ont_extreme' if you have several ultra-long ONT reads up to 300kbp long. For PacBio CCS (HiFi) and CLR reads, use 'pacbio'. (default: ont)
-cpu CPU, --cpu CPU Number of CPUs to use (default: 1)
-mincov MINCOV, --mincov MINCOV
Minimum coverage to call a variant (default: 8)
-maxcov MAXCOV, --maxcov MAXCOV
Maximum coverage of reads to use. If sequencing depth at a candidate site exceeds maxcov then reads are downsampled. (default: 160)
Variant Calling Regions:
-include_bed INCLUDE_BED, --include_bed INCLUDE_BED
Only call variants inside the intervals specified in the bgzipped and tabix indexed BED file. If any other flags are used to specify a region, intersect the region with intervals in the BED file, e.g. if -chom
chr1 -start 10000000 -end 20000000 flags are set, call variants inside the intervals specified by the BED file that overlap with chr1:10000000-20000000. Same goes for the case when whole genome variant calling
flag is set. (default: None)
-exclude_bed EXCLUDE_BED, --exclude_bed EXCLUDE_BED
Path to bgzipped and tabix indexed BED file containing intervals to ignore for variant calling. BED files of centromere and telomere regions for the following genomes are included in NanoCaller: hg38, hg19, mm10
and mm39. To use these BED files use one of the following options: ['hg38', 'hg19', 'mm10', 'mm39']. (default: None)
-start START, --start START
start, default is 1 (default: None)
-end END, --end END end, default is the end of contig (default: None)
SNP Calling:
-snp_model SNP_MODEL, --snp_model SNP_MODEL
NanoCaller SNP model to be used (default: ONT-HG002)
-min_allele_freq MIN_ALLELE_FREQ, --min_allele_freq MIN_ALLELE_FREQ
minimum alternative allele frequency (default: 0.15)
-min_nbr_sites MIN_NBR_SITES, --min_nbr_sites MIN_NBR_SITES
minimum number of nbr sites (default: 1)
-nbr_t NEIGHBOR_THRESHOLD, --neighbor_threshold NEIGHBOR_THRESHOLD
SNP neighboring site thresholds with lower and upper bounds seperated by comma, for Nanopore reads '0.4,0.6' is recommended, for PacBio CCS anc CLR reads '0.3,0.7' and '0.3,0.6' are recommended respectively
(default: 0.4,0.6)
-sup, --supplementary
Use supplementary reads (default: False)
Indel Calling:
-indel_model INDEL_MODEL, --indel_model INDEL_MODEL
NanoCaller indel model to be used (default: ONT-HG002)
-ins_t INS_THRESHOLD, --ins_threshold INS_THRESHOLD
Insertion Threshold (default: 0.4)
-del_t DEL_THRESHOLD, --del_threshold DEL_THRESHOLD
Deletion Threshold (default: 0.6)
-win_size WIN_SIZE, --win_size WIN_SIZE
Size of the sliding window in which the number of indels is counted to determine indel candidate site. Only indels longer than 2bp are counted in this window. Larger window size can increase recall, but use a
maximum of 50 only (default: 40)
-small_win_size SMALL_WIN_SIZE, --small_win_size SMALL_WIN_SIZE
Size of the sliding window in which indel frequency is determined for small indels (default: 4)
-impute_indel_phase, --impute_indel_phase
Infer read phase by rudimentary allele clustering if the no or insufficient phasing information is available, can be useful for datasets without SNPs or regions with poor phasing quality (default: False)
Output Options:
-keep_bam, --keep_bam
Keep phased bam files. (default: False)
-o OUTPUT, --output OUTPUT
VCF output path, default is current working directory (default: None)
-prefix PREFIX, --prefix PREFIX
VCF file prefix (default: variant_calls)
-sample SAMPLE, --sample SAMPLE
VCF file sample name (default: SAMPLE)
Phasing:
-phase_bam, --phase_bam
Phase bam files if snps mode is selected. This will phase bam file without indel calling. (default: False)
-enable_whatshap, --enable_whatshap
Allow WhatsHap to change SNP genotypes when phasing using --distrust-genotypes and --include-homozygous flags (this is not the same as regenotyping), considerably increasing the time needed for phasing. It has a
negligible effect on SNP calling accuracy for Nanopore reads, but may make a small improvement for PacBio reads. By default WhatsHap will only phase SNP calls produced by NanoCaller, but not change their
genotypes. (default: False)
実行方法
ランするには、最小でもbamとref.fasta、chr名の指定が必要。bamファイルとfastaファイルはindexingされている必要がある。
NanoCaller -bam nanopore.bam -ref ref.fa -chrom chr22 -cpu 12 -p ont -o outdir -mode both
- -bam Bam file, should be phased if 'indel' mode is selected (default: None)
- -ref Reference genome file with .fai index (default: None)
- -chrom Chromosome (default: None)
- -p Apply recommended preset values for SNP and Indel calling parameters, options are 'ont', 'ul_ont', 'ul_ont_extreme', 'ccs' and 'clr'. 'ont' works well for any type of ONT sequencing datasets. However, use 'ul_ont' if you have several ultra-long ONT reads up to 100kbp long, and 'ul_ont_extreme' if you have several ultra-long ONT reads up to 300kbp long. For PacBio CCS (HiFi) and CLR reads, use 'ccs'and 'clr' respectively. Presets are described in detail here: github.com/WGLab/NanoCaller/blob/master/docs/Usage.md#preset-options. (default: None)
- -mode NanoCaller mode to run, options are 'snps', 'snps_unphased', 'indels' and 'both'. 'snps_unphased' mode quits NanoCaller without using WhatsHap for phasing. (default: both)
出力例
outdir/
テストラン
テストデータをランするコマンドのオプションは以下の通り。ただし、データへのリンクが切れているため現在は実行不可。
NanoCaller -bam HG002.nanopore.chr22.sample.bam -p ont -o test_run -chrom chr22 -start 20000000 -end 21000000 -ref chr22_ref.fa -cpu 4 > log
引用
NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks
Mian Umair Ahsan, Qian Liu, Li Fang & Kai Wang
Genome Biology volume 22, Article number: 261 (2021)
参考
Anyone aware of a tool that does INDEL calling on nanopore data that also does PHASING on the indels in the absence of SNPs? I tried clair3/longshot/gatk/nanocaller, none of them does seem to support what Im looking for. #Bioinformatics #nanopore #ngsanalysis #plants
— Henri van de Geest (@geesthc) December 22, 2021