超高速・高精度な体細胞スモールバリアントコーラー rabbitvar

　次世代シーケンサー（NGS）技術の継続的な発展により、がん研究においてゲノム解析が広範囲かつ頻繁に利用されるようになった。それに伴う大規模なNGSデータセットの作成は、一般的に使用されるハードウェアプラットフォーム上で高度に最適化された高精度の体細胞バリアントコール法の必要性を確立している。著者らは、最新のマルチコアCPU上で腫瘍と正常のペアNGSデータから小さな体細胞変異を検出できるスケーラブルなバリアントコーラーRabbitVar ( https://github.com/LeiHaoa/RabbitVar ) を発表する。本アプローチは、候補探しと機械学習ベースのフィルタリング戦略を、最適化されたデータ構造とマルチスレッドで組み合わせ、高い精度と効率の両方を達成する。異なるシーケンス条件と汚染率のもと、実世界のHCC1395乳がんデータセットにおいて、RabbitVarの性能を主要な最先端コーラー（Strelka2、Mutect2、NeuSomatic、VarDict、VarScan2）と比較した。評価結果は、RabbitVarがSNVを呼び出す際に高い競争力を持つF1スコアを達成したことを示している。さらに、より難易度の高いindel variantをコールした場合にも、常に最高のF1スコアを達成した。RabbitVarは、48コアのワークステーションで、80倍の深さの腫瘍と正常ヒト全ゲノムシーケンスデータセットを20分未満で処理することができ、効率という点で、テストした他のすべてのバリアントコーラーよりも優れている。

インストール

依存

htslib
zlib (included with most Linuxes)

Github

git clone https://github.com/LeiHaoa/RabbitVar.git
cd RabbitVar/
mamba create -n rabbitvar python=3.8
conda activate rabbitvar
./install.sh #途中でpipやcondaのinstallコマンドが実行される（link）。condaの速度が気になるならmambaに書き換える。*1
cd bin/

> ./RabbitVar

usage: ./RabbitVar --Genome_fasta=string --in_bam=string [options] ...

options:

-H, --help Print this help page

-p, --pileup Do pileup regardless of the frequency

-t, --dedup Indicate to remove duplicated reads. Only one pair with same start positions will be kept

-3, --3-prime Indicate to move indels to 3-prime if alternative alignment can be achieved.

-u, --uni Indicate unique mode, which when mate pairs overlap, the overlapping part will be counted only once

using forward read only.

--UN Indicate unique mode, which when mate pairs overlap, the overlapping part will be counted only once

using first read only.

--chimeric Indicate to turn off chimeric reads filtering.

--deldupvar Turn on deleting of duplicate variants. Variants in this mode are considered and outputted only if

start position of variant is inside the region interest.

-y, --verbose

-F, --Filter The hexical to filter reads using samtools. Default: 0x504 (filter 2nd alignments, unmapped reads

and duplicates). Use -F 0 to turn it off. (string [=0x504])

-z, --zero_based Indicate whether coordinates are zero-based, as IGV uses. Default: 1 for BED file or amplicon BED

file.Use 0 to turn it off. When using the -R option, it's set to 0

-k, --local_realig Indicate whether to perform local realignment. Default: 1. Set to 0 to disable it. For Ion or

PacBio, 0 is recommended. (int [=1])

-a, --amplicon Indicate it's amplicon based calling. Reads that don't map to the amplicon will be skipped. A read

pair is considered belonging to the amplicon if the edges are less than int bp to the amplicon,

and overlap fraction is at least float. Default: 10:0.95 (string [=])

-c, --column The column for chromosome (int [=2])

-G, --Genome_fasta The reference fasta. Should be indexed (.fai). (string)

-R, --Region The region of interest. In the format of chr:start-end. If end is omitted, then a single position.

No BED is needed. (string [=])

-d, --delemiter The delimiter for split region_info, default to tab "\t" (string [= ])

-n, --regular_expression The regular expression to extract sample name from BAM filenames.

Default to: /([^\/\._]+?)_[^\/]*.bam/ (string [=/([^\/\._]+?)_[^\/]*.bam/])

-N, --Name The sample name to be used directly. Will overwrite -n option (string [=])

-b, --in_bam The indexed BAM file (string)

-S, --region_start The column for region start, e.g. gene start (int [=6])

-E, --region_end The column for region end, e.g. gene end (int [=7])

-s, --seg_start The column for segment starts in the region, e.g. exon starts (int [=9])

-e, --seg_end The column for segment ends in the region, e.g. exon ends (int [=10])

-g, --gene_name The column for gene name, or segment annotation (int [=12])

-x, --numcl_extend The number of nucleotide to extend for each segment, default: 0 (int [=0])

-B, --min The minimum # of reads to determine strand bias, default 2 (int [=2])

-Q, --Quality If set, reads with mapping quality less than INT will be filtered and ignored (int [=0])

-q, --phred_score The phred score for a base to be considered a good call. Default: 25 (for Illumina) For PGM, set

it to ~15, as PGM tends to under estimate base quality. (double [=22.5])

-m, --mismatch If set, reads with mismatches more than INT will be filtered and ignored. Gaps are not counted as

mismatches. Valid only for bowtie2/TopHat or BWA aln followed by sampe. BWA mem is calculated as

NM - Indels. Default: 8, or reads with more than 8 mismatches will not be used. (int [=8])

-T, --trim Trim bases after [INT] bases in the reads (int [=0])

-X, --extension Extension of bp to look for mismatches after insersion or deletion. Default to 2 bp, or only calls

when they're within 2 bp. (int [=2])

-P, --Position The read position filter. If the mean variants position is less that specified, it's considered

false positive. Default: 5 (int [=5])

-I, --Indel_size The indel size. Default: 50bp (int [=50])

--th Threads count. (int [=0])

--fisher Experimental feature: fisher test

-M, --Min_macth The minimum matches for a read to be considered. If, after soft-clipping, the matched bp is less

than INT, then the read is discarded. It's meant for PCR based targeted sequencing where there's no

insert and the matching is only the primers. Default: 0, or no filtering (int [=0])

-Y, --ref-extension Extension of bp of reference to build lookup table. Default to 1200 bp. Increase the number will

slowdown the program. The main purpose is to call large indels with 1000 bit that can be missed by

discordant mate pairs. (int [=1200])

-r, --minimum_reads The minimum # of variant reads, default 2 (int [=2])

-o, --Qratio The Qratio of (good_quality_reads)/(bad_quality_reads+0.5). The quality is defined by -q option.

Default: 1.5 (double [=1.5])

-O, --MapQ The reads should have at least mean MapQ to be considered a valid variant.

Default: no filtering (double [=0])

-V, --freq The lowest frequency in the normal sample allowed for a putative somatic mutation.

Defaults to 0.05 (double [=0.05])

-f, --allele_fre The threshold for allele frequency, default: 0.01 or 1% (double [=0.01])

-Z, --downsample For downsampling fraction. e.g. 0.7 means roughly 70% downsampling. Default: No downsampling.

Use with caution. The downsampling will be random and non-reproducible. (double [=0])

--adaptor Filter adaptor sequences so that they aren't used in realignment. Multiple adaptors can be supplied

by setting them with comma, like: --adaptor ACGTTGCTC,ACGGGGTCTC,ACGCGGCTAG . (string [=])

-J, --crispr The genomic position that CRISPR/Cas9 suppose to cut, typically 3bp from the PAM NGG site and

within the guide. For CRISPR mode only. It will adjust the variants (mostly In-Del) start and end

sites to as close to this location as possible,if there are alternatives. The option should only be

used for CRISPR mode. (int [=0])

-j, --CRISPR_fbp In CRISPR mode, the minimum amount in bp that a read needs to overlap with cutting site. If a read does not meet the criteria,

it will not be used for variant calling, since it is likely just a partially amplified PCR. Default: not set, or no filtering (int [=0])

--mfreq The variant frequency threshold to determine variant as good in case of monomer MSI.

Default: 0.25 (double [=0.25])

--nmfreq The variant frequency threshold to determine variant as good in case of non-monomer MSI.

Default: 0.1 (double [=0.1])

--out The out put file path.

Default: ./out.txt (string [=./out.txt])

-i, --bed The region file to be processed (string [=])

--version Print FastVC version information

--auto_resize Auto resize the bed region size for better performance

実行方法

RabbitVarではアラインメント.bamファイル（ソートされ、インデックスが付けられているもの）、リージョンの.bedファイル(または-Rパラメータでリージョンを指定）、ゲノムファイル.faまたは.fasta（hg19.faやhg38.faなど。インデックスを付ける（.fai））が必要。

./RabbitVar \
  -R chr1:2829690-4918526  \
  -G hg38.fa \
  -f 0.01 -N 'TUMORSMAPLE|NORMALSAMPLE'  \
  -b 'TUMOR.bam|NORMAL.bam' \
  -c 1 -S 2 -E 3 -g 4 \
  --fisher \
  --out ./out.txt

-G The reference fasta. Should be indexed (.fai). (string)
-R The region of interest. In the format of chr:start-end. If end is omitted, then a single position. No BED is needed. (string [=])
-N The sample name to be used directly. Will overwrite -n option (string [=])
-b The indexed BAM file (string)
-f The threshold for allele frequency, default: 0.01 or 1% (double [=0.01])
--fisher Experimental feature: fisher test
--out The out put file path
-c The column for chromosome (int [=2])
-S The column for region start, e.g. gene start (int [=6])
-E The column for region end, e.g. gene end (int [=7])
-g The column for gene name, or segment annotation (int [=12])

引用

RabbitVar: ultra-fast and accurate somatic small-variant calling on multi-core architectures
Zhang H, Song H, Yin Z, Chang Q, Wei Y, Niu B, Schmidt B, Liu W
Preprint from bioRxiv, 06 Jan 2023

インストールスクリプト69行目の

cmake -DHTS_PREFIX=$(pwd)/htslib/build -DCMAKE_INSTALL_PREFIX=$(pwd) ..

を以下に修正した。

cmake -DHTS_PREFIX=$(pwd)/htslib/build -DCMAKE_INSTALL_PREFIX=$(pwd) .