macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

超高速・高精度な体細胞スモールバリアントコーラー rabbitvar

 

 次世代シーケンサー(NGS)技術の継続的な発展により、がん研究においてゲノム解析が広範囲かつ頻繁に利用されるようになった。それに伴う大規模なNGSデータセットの作成は、一般的に使用されるハードウェアプラットフォーム上で高度に最適化された高精度の体細胞バリアントコール法の必要性を確立している。著者らは、最新のマルチコアCPU上で腫瘍と正常のペアNGSデータから小さな体細胞変異を検出できるスケーラブルなバリアントコーラーRabbitVar ( https://github.com/LeiHaoa/RabbitVar ) を発表する。本アプローチは、候補探しと機械学習ベースのフィルタリング戦略を、最適化されたデータ構造とマルチスレッドで組み合わせ、高い精度と効率の両方を達成する。異なるシーケンス条件と汚染率のもと、実世界のHCC1395乳がんデータセットにおいて、RabbitVarの性能を主要な最先端コーラー(Strelka2、Mutect2、NeuSomatic、VarDict、VarScan2)と比較した。評価結果は、RabbitVarがSNVを呼び出す際に高い競争力を持つF1スコアを達成したことを示している。さらに、より難易度の高いindel variantをコールした場合にも、常に最高のF1スコアを達成した。RabbitVarは、48コアのワークステーションで、80倍の深さの腫瘍と正常ヒト全ゲノムシーケンスデータセットを20分未満で処理することができ、効率という点で、テストした他のすべてのバリアントコーラーよりも優れている。

 

インストール

依存

  • htslib
  • zlib (included with most Linuxes)

Github

git clone https://github.com/LeiHaoa/RabbitVar.git
cd RabbitVar/
mamba create -n rabbitvar python=3.8
conda activate rabbitvar
./install.sh #途中でpipやcondaのinstallコマンドが実行される(link)。condaの速度が気になるならmambaに書き換える。*1
cd bin/
> ./RabbitVar 

usage: ./RabbitVar --Genome_fasta=string --in_bam=string [options] ... 

options:

  -H, --help                  Print this help page

  -p, --pileup                Do pileup regardless of the frequency

  -t, --dedup                 Indicate to remove duplicated reads.  Only one pair with same start positions will be kept

  -3, --3-prime               Indicate to move indels to 3-prime if alternative alignment can be achieved.

  -u, --uni                   Indicate unique mode, which when mate pairs overlap, the overlapping part will be counted only once

       using forward read only.

      --UN                    Indicate unique mode, which when mate pairs overlap, the overlapping part will be counted only once 

       using first read only.

      --chimeric              Indicate to turn off chimeric reads filtering.

      --deldupvar             Turn on deleting of duplicate variants. Variants in this mode are considered and outputted only if 

       start position of variant is inside the region interest.

  -y, --verbose               

  -F, --Filter                The hexical to filter reads using samtools. Default: 0x504 (filter 2nd alignments, unmapped reads 

       and duplicates).  Use -F 0 to turn it off. (string [=0x504])

  -z, --zero_based            Indicate whether coordinates are zero-based, as IGV uses.  Default: 1 for BED file or amplicon BED 

       file.Use 0 to turn it off. When using the -R option, it's set to 0

  -k, --local_realig          Indicate whether to perform local realignment.  Default: 1.  Set to 0 to disable it.  For Ion or 

       PacBio, 0 is recommended. (int [=1])

  -a, --amplicon              Indicate it's amplicon based calling. Reads that don't map to the amplicon will be skipped. A read

       pair is considered belonging to the amplicon if the edges are less than int bp to the amplicon, 

       and overlap fraction is at least float.  Default: 10:0.95 (string [=])

  -c, --column                The column for chromosome (int [=2])

  -G, --Genome_fasta          The reference fasta. Should be indexed (.fai). (string)

  -R, --Region                The region of interest. In the format of chr:start-end. If end is omitted, then a single position.  

       No BED is needed. (string [=])

  -d, --delemiter             The delimiter for split region_info, default to tab "\t" (string [= ])

  -n, --regular_expression    The regular expression to extract sample name from BAM filenames.  

       Default to: /([^\/\._]+?)_[^\/]*.bam/ (string [=/([^\/\._]+?)_[^\/]*.bam/])

  -N, --Name                  The sample name to be used directly.  Will overwrite -n option (string [=])

  -b, --in_bam                The indexed BAM file (string)

  -S, --region_start          The column for region start, e.g. gene start (int [=6])

  -E, --region_end            The column for region end, e.g. gene end (int [=7])

  -s, --seg_start             The column for segment starts in the region, e.g. exon starts (int [=9])

  -e, --seg_end               The column for segment ends in the region, e.g. exon ends (int [=10])

  -g, --gene_name             The column for gene name, or segment annotation (int [=12])

  -x, --numcl_extend          The number of nucleotide to extend for each segment, default: 0 (int [=0])

  -B, --min                   The minimum # of reads to determine strand bias, default 2 (int [=2])

  -Q, --Quality               If set, reads with mapping quality less than INT will be filtered and ignored (int [=0])

  -q, --phred_score           The phred score for a base to be considered a good call.  Default: 25 (for Illumina) For PGM, set 

       it to ~15, as PGM tends to under estimate base quality. (double [=22.5])

  -m, --mismatch              If set, reads with mismatches more than INT will be filtered and ignored.  Gaps are not counted as 

       mismatches. Valid only for bowtie2/TopHat or BWA aln followed by sampe.  BWA mem is calculated as 

       NM - Indels.  Default: 8, or reads with more than 8 mismatches will not be used. (int [=8])

  -T, --trim                  Trim bases after [INT] bases in the reads (int [=0])

  -X, --extension             Extension of bp to look for mismatches after insersion or deletion.  Default to 2 bp, or only calls 

       when they're within 2 bp. (int [=2])

  -P, --Position              The read position filter.  If the mean variants position is less that specified, it's considered 

       false positive.  Default: 5 (int [=5])

  -I, --Indel_size            The indel size.  Default: 50bp (int [=50])

      --th                    Threads count. (int [=0])

      --fisher                Experimental feature: fisher test

  -M, --Min_macth             The minimum matches for a read to be considered. If, after soft-clipping, the matched bp is less 

       than INT, then the read is discarded. It's meant for PCR based targeted sequencing where there's no 

       insert and the matching is only the primers. Default: 0, or no filtering (int [=0])

  -Y, --ref-extension         Extension of bp of reference to build lookup table. Default to 1200 bp. Increase the number will 

       slowdown the program. The main purpose is to call large indels with 1000 bit that can be missed by 

       discordant mate pairs. (int [=1200])

  -r, --minimum_reads         The minimum # of variant reads, default 2 (int [=2])

  -o, --Qratio                The Qratio of (good_quality_reads)/(bad_quality_reads+0.5).  The quality is defined by -q option.  

       Default: 1.5 (double [=1.5])

  -O, --MapQ                  The reads should have at least mean MapQ to be considered a valid variant.  

       Default: no filtering (double [=0])

  -V, --freq                  The lowest frequency in the normal sample allowed for a putative somatic mutation.  

       Defaults to 0.05 (double [=0.05])

  -f, --allele_fre            The threshold for allele frequency, default: 0.01 or 1% (double [=0.01])

  -Z, --downsample            For downsampling fraction.  e.g. 0.7 means roughly 70% downsampling.  Default: No downsampling. 

       Use with caution.  The downsampling will be random and non-reproducible. (double [=0])

      --adaptor               Filter adaptor sequences so that they aren't used in realignment. Multiple adaptors can be supplied 

       by setting them with comma, like: --adaptor ACGTTGCTC,ACGGGGTCTC,ACGCGGCTAG . (string [=])

  -J, --crispr                The genomic position that CRISPR/Cas9 suppose to cut, typically 3bp from the PAM NGG site and  

       within the guide.  For CRISPR mode only.  It will adjust the variants (mostly In-Del) start and end 

       sites to as close to this location as possible,if there are alternatives. The option should only be 

       used for CRISPR mode. (int [=0])

  -j, --CRISPR_fbp            In CRISPR mode, the minimum amount in bp that a read needs to overlap with cutting site.  If a read does not meet the criteria,

       it will not be used for variant calling, since it is likely just a partially amplified PCR.  Default: not set, or no filtering (int [=0])

      --mfreq                 The variant frequency threshold to determine variant as good in case of monomer MSI. 

       Default: 0.25 (double [=0.25])

      --nmfreq                The variant frequency threshold to determine variant as good in case of non-monomer MSI. 

       Default: 0.1 (double [=0.1])

      --out                   The out put file path. 

       Default: ./out.txt (string [=./out.txt])

  -i, --bed                   The region file to be processed (string [=])

      --version               Print FastVC version information

      --auto_resize           Auto resize the bed region size for better performance

 

 

実行方法

RabbitVarではアラインメント.bamファイル(ソートされ、インデックスが付けられているもの)、リージョンの.bedファイル(または-Rパラメータでリージョンを指定)、ゲノムファイル.faまたは.fasta(hg19.faやhg38.faなど。インデックスを付ける(.fai))が必要。

./RabbitVar \
  -R chr1:2829690-4918526  \
-G hg38.fa \
-f 0.01 -N 'TUMORSMAPLE|NORMALSAMPLE'  \
-b 'TUMOR.bam|NORMAL.bam' \
  -c 1 -S 2 -E 3 -g 4 \
  --fisher \
  --out ./out.txt
  • -G    The reference fasta. Should be indexed (.fai). (string)
  • -R     The region of interest. In the format of chr:start-end. If end is omitted, then a single position.  No BED is needed. (string [=]) 
  • -N    The sample name to be used directly.  Will overwrite -n option (string [=])
  • -b     The indexed BAM file (string)
  •  -f     The threshold for allele frequency, default: 0.01 or 1% (double [=0.01])
  • --fisher    Experimental feature: fisher test
  • --out    The out put file path
  • -c     The column for chromosome (int [=2])
  • -S     The column for region start, e.g. gene start (int [=6])
  • -E      The column for region end, e.g. gene end (int [=7])
  • -g      The column for gene name, or segment annotation (int [=12])

 

引用

RabbitVar: ultra-fast and accurate somatic small-variant calling on multi-core architectures
Zhang H, Song H, Yin Z, Chang Q, Wei Y, Niu B, Schmidt B, Liu W
Preprint from bioRxiv, 06 Jan 2023

 

インストールスクリプト69行目の

cmake -DHTS_PREFIX=$(pwd)/htslib/build -DCMAKE_INSTALL_PREFIX=$(pwd) ..

を以下に修正した。

cmake -DHTS_PREFIX=$(pwd)/htslib/build -DCMAKE_INSTALL_PREFIX=$(pwd) .