macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ハプロタイプを考慮してロングリードシーケンスからマッピング困難な領域のSNPやインデルを正確に検出する NanoCaller

 

 ロングリードシーケンスでは、ショートリードシーケンスではマッピングが困難とされているゲノム領域でのバリアント検出ができる。この手法では、長距離ハプロタイプ情報を用いてSNPを検出し、検出されたSNPとロングリードを位相合わせし、ローカルリアライメントによりindelをコールする。8つのヒトゲノムを用いて評価した結果、NanoCallerは競合するアプローチよりも高いパフォーマンスを達成することが実証された。また、広く用いられているベンチマークゲノムにおいて、従来は検出が困難であった41の新規バリアントを実験的に検証した。NanoCallerは、ロングリードシーケンスから複雑なゲノム領域における新規バリアントの発見を容易にする。

 

 

インストール

Github

#conda (link)
mamba create -n nanocaller -y
conda activate nanocaller
mamba install -c bioconda nanocaller

#docker (hub)
docker pull genomicslab/nanocaller:latest

> NanoCaller -h

usage: NanoCaller [-h] [-mode MODE] [-seq SEQUENCING] [-cpu CPU] [-mincov MINCOV] [-maxcov MAXCOV] [-keep_bam] [-o OUTPUT] [-prefix PREFIX] [-sample SAMPLE] [-include_bed INCLUDE_BED] [-exclude_bed EXCLUDE_BED] [-start START]

                  [-end END] [-p PRESET] -bam BAM -ref REF -chrom CHROM [-snp_model SNP_MODEL] [-min_allele_freq MIN_ALLELE_FREQ] [-min_nbr_sites MIN_NBR_SITES] [-nbr_t NEIGHBOR_THRESHOLD] [-sup] [-indel_model INDEL_MODEL]

                  [-ins_t INS_THRESHOLD] [-del_t DEL_THRESHOLD] [-win_size WIN_SIZE] [-small_win_size SMALL_WIN_SIZE] [-impute_indel_phase] [-phase_bam] [-enable_whatshap]

 

optional arguments:

  -h, --help            show this help message and exit

 

Required Arguments:

  -bam BAM, --bam BAM   Bam file, should be phased if 'indel' mode is selected (default: None)

  -ref REF, --ref REF   Reference genome file with .fai index (default: None)

  -chrom CHROM, --chrom CHROM

                        Chromosome (default: None)

 

Preset:

  -p PRESET, --preset PRESET

                        Apply recommended preset values for SNP and Indel calling parameters, options are 'ont', 'ul_ont', 'ul_ont_extreme', 'ccs' and 'clr'. 'ont' works well for any type of ONT sequencing datasets. However, use

                        'ul_ont' if you have several ultra-long ONT reads up to 100kbp long, and 'ul_ont_extreme' if you have several ultra-long ONT reads up to 300kbp long. For PacBio CCS (HiFi) and CLR reads, use 'ccs'and 'clr'

                        respectively. Presets are described in detail here: github.com/WGLab/NanoCaller/blob/master/docs/Usage.md#preset-options. (default: None)

 

Configurations:

  -mode MODE, --mode MODE

                        NanoCaller mode to run, options are 'snps', 'snps_unphased', 'indels' and 'both'. 'snps_unphased' mode quits NanoCaller without using WhatsHap for phasing. (default: both)

  -seq SEQUENCING, --sequencing SEQUENCING

                        Sequencing type, options are 'ont', 'ul_ont', 'ul_ont_extreme', and 'pacbio'. 'ont' works well for any type of ONT sequencing datasets. However, use 'ul_ont' if you have several ultra-long ONT reads up to 100kbp

                        long, and 'ul_ont_extreme' if you have several ultra-long ONT reads up to 300kbp long. For PacBio CCS (HiFi) and CLR reads, use 'pacbio'. (default: ont)

  -cpu CPU, --cpu CPU   Number of CPUs to use (default: 1)

  -mincov MINCOV, --mincov MINCOV

                        Minimum coverage to call a variant (default: 8)

  -maxcov MAXCOV, --maxcov MAXCOV

                        Maximum coverage of reads to use. If sequencing depth at a candidate site exceeds maxcov then reads are downsampled. (default: 160)

 

Variant Calling Regions:

  -include_bed INCLUDE_BED, --include_bed INCLUDE_BED

                        Only call variants inside the intervals specified in the bgzipped and tabix indexed BED file. If any other flags are used to specify a region, intersect the region with intervals in the BED file, e.g. if -chom

                        chr1 -start 10000000 -end 20000000 flags are set, call variants inside the intervals specified by the BED file that overlap with chr1:10000000-20000000. Same goes for the case when whole genome variant calling

                        flag is set. (default: None)

  -exclude_bed EXCLUDE_BED, --exclude_bed EXCLUDE_BED

                        Path to bgzipped and tabix indexed BED file containing intervals to ignore for variant calling. BED files of centromere and telomere regions for the following genomes are included in NanoCaller: hg38, hg19, mm10

                        and mm39. To use these BED files use one of the following options: ['hg38', 'hg19', 'mm10', 'mm39']. (default: None)

  -start START, --start START

                        start, default is 1 (default: None)

  -end END, --end END   end, default is the end of contig (default: None)

 

SNP Calling:

  -snp_model SNP_MODEL, --snp_model SNP_MODEL

                        NanoCaller SNP model to be used (default: ONT-HG002)

  -min_allele_freq MIN_ALLELE_FREQ, --min_allele_freq MIN_ALLELE_FREQ

                        minimum alternative allele frequency (default: 0.15)

  -min_nbr_sites MIN_NBR_SITES, --min_nbr_sites MIN_NBR_SITES

                        minimum number of nbr sites (default: 1)

  -nbr_t NEIGHBOR_THRESHOLD, --neighbor_threshold NEIGHBOR_THRESHOLD

                        SNP neighboring site thresholds with lower and upper bounds seperated by comma, for Nanopore reads '0.4,0.6' is recommended, for PacBio CCS anc CLR reads '0.3,0.7' and '0.3,0.6' are recommended respectively

                        (default: 0.4,0.6)

  -sup, --supplementary

                        Use supplementary reads (default: False)

 

Indel Calling:

  -indel_model INDEL_MODEL, --indel_model INDEL_MODEL

                        NanoCaller indel model to be used (default: ONT-HG002)

  -ins_t INS_THRESHOLD, --ins_threshold INS_THRESHOLD

                        Insertion Threshold (default: 0.4)

  -del_t DEL_THRESHOLD, --del_threshold DEL_THRESHOLD

                        Deletion Threshold (default: 0.6)

  -win_size WIN_SIZE, --win_size WIN_SIZE

                        Size of the sliding window in which the number of indels is counted to determine indel candidate site. Only indels longer than 2bp are counted in this window. Larger window size can increase recall, but use a

                        maximum of 50 only (default: 40)

  -small_win_size SMALL_WIN_SIZE, --small_win_size SMALL_WIN_SIZE

                        Size of the sliding window in which indel frequency is determined for small indels (default: 4)

  -impute_indel_phase, --impute_indel_phase

                        Infer read phase by rudimentary allele clustering if the no or insufficient phasing information is available, can be useful for datasets without SNPs or regions with poor phasing quality (default: False)

 

Output Options:

  -keep_bam, --keep_bam

                        Keep phased bam files. (default: False)

  -o OUTPUT, --output OUTPUT

                        VCF output path, default is current working directory (default: None)

  -prefix PREFIX, --prefix PREFIX

                        VCF file prefix (default: variant_calls)

  -sample SAMPLE, --sample SAMPLE

                        VCF file sample name (default: SAMPLE)

 

Phasing:

  -phase_bam, --phase_bam

                        Phase bam files if snps mode is selected. This will phase bam file without indel calling. (default: False)

  -enable_whatshap, --enable_whatshap

                        Allow WhatsHap to change SNP genotypes when phasing using --distrust-genotypes and --include-homozygous flags (this is not the same as regenotyping), considerably increasing the time needed for phasing. It has a

                        negligible effect on SNP calling accuracy for Nanopore reads, but may make a small improvement for PacBio reads. By default WhatsHap will only phase SNP calls produced by NanoCaller, but not change their

                        genotypes. (default: False)

 

 

 

 

実行方法

ランするには、最小でもbamとref.fasta、chr名の指定が必要。bamファイルとfastaファイルはindexingされている必要がある。

NanoCaller -bam nanopore.bam -ref ref.fa -chrom chr22 -cpu 12 -p ont -o outdir -mode both
  • -bam   Bam file, should be phased if 'indel' mode is selected (default: None) 
  • -ref      Reference genome file with .fai index (default: None) 
  • -chrom    Chromosome (default: None)
  • -p    Apply recommended preset values for SNP and Indel calling parameters, options are 'ont', 'ul_ont', 'ul_ont_extreme', 'ccs' and 'clr'. 'ont' works well for any type of ONT sequencing datasets. However, use 'ul_ont' if you have several ultra-long ONT reads up to 100kbp long, and 'ul_ont_extreme' if you have several ultra-long ONT reads up to 300kbp long. For PacBio CCS (HiFi) and CLR reads, use 'ccs'and 'clr' respectively. Presets are described in detail here: github.com/WGLab/NanoCaller/blob/master/docs/Usage.md#preset-options. (default: None)
  • -mode    NanoCaller mode to run, options are 'snps', 'snps_unphased', 'indels' and 'both'. 'snps_unphased' mode quits NanoCaller without using WhatsHap for phasing. (default: both)

出力例

outdir/

 

 

テストラン

テストデータをランするコマンドのオプションは以下の通り。ただし、データへのリンクが切れているため現在は実行不可。

NanoCaller -bam HG002.nanopore.chr22.sample.bam -p ont -o test_run -chrom chr22 -start 20000000 -end 21000000 -ref chr22_ref.fa -cpu 4 > log

 

引用

NanoCaller for accurate detection of SNPs and indels in difficult-to-map regions from long-read sequencing by haplotype-aware deep neural networks
Mian Umair Ahsan, Qian Liu, Li Fang & Kai Wang 
Genome Biology volume 22, Article number: 261 (2021) 

 

参考