高感度なバリアントコーラー LoFreq - macでインフォマティクス

2020 1/5 ツイッターリンク追記、12/15 help更新

2023/07/24 追記

　シーケンシング技術の最近の進歩により、細胞集団におけるheterogeneityおよびsub-populationsのより広範な研究、およびそれらの進化による「コンセンサス配列」からの移行が可能になった。このような‘population perspective’ な視点は、ウイルス性疑似種（Viral quasispecies、wiki）およびウィルスの宿主内バリエーション（論文より ref.1,2）、バクテリアのsub-populations（ref.3-5）、がんのサブクローン進化（ref.6-8）など適用可能な様々なアプリケーションを持つ。これらの研究における集団構造（およびレアな亜集団）の正確な特徴付けは、宿主応答または抗生物質曝露のダイナミクスにとって基本的である。いくつかの最近のガンシーケンシング研究はレアなsub-populationsおよびバリアントの機能的役割を、腫瘍のgrowth、抗生物質耐性、転移（9,10）およびそれらを研究するための計算ツールの観点でさらに強調している。

　原理的には、大規模並列シーケンシングの高スループットにより、レアなsub-populationsのサンプリングが可能になる。しかし、シークエンシングエラーはpopulationsにおける真の変異の決定を複雑にする。シークエンシングエラー率は、テクノロジー、ラン、レーン、マルチプレックス、ゲノムロケーション、および置換タイプ（ref.11〜13）により著しく異なることが知られている。これらの問題を解決するためのアプローチが研究されてきたが、バリアントコール法の大部分は、関心のある個別の頻度（すなわち0,0.5および1; 低カバレッジのヒトリシーケンシングデータおよび二倍体コール（ref.14-16）関連の一連の方法は、プールされたシーケンシングデータ（17-20）において二倍体genotypesをコールするために調整されたものであり、一般には適用されない）。

　RNAウイルスの研究はこの規則（ref.21-24）の例外を提供している。 RNAウイルスは、RNA依存性DNAポリメラーゼのプルーフリーディング能力が不十分または欠損しているために高い突然変異率および高い複製速度を有し、結果として生じる宿主内変異は抗生物質治療戦略（ref.25）、生ワクチンの遺伝子モニタリングにも影響がある（ref.26）。これらの研究で使用された手法は ad hoc なトリミング、フィルタリング、閾値を使用してバリアントコールに頼っており、そのため感度や広範な適用性が制限されている（サンプルまたはシーケンスごとの手動調整が必要）。 Breseq（ref.27,28）やSNVer（ref.29）などのモデルベースの手法は、潜在的により敏感で一般的だが、単純な二項モデルに依存しており、エラー率の偏りに対応するように調整されていない。シーケンシングエラーの影響を減らすフェージングに依存し、454シークエンシングに合わせたより洗練されたアプローチが、ウイルスデータセットに適用されている（ref.30）。しかし、この方法は他の技術に直接適用することはできず、大きなゲノムやシーケンシングデータセットでは実行できない。

　シーケンシングを用いて細胞のゲノム状態をモニタリングする新しい臨床応用では、population内のレアなバリアントを検出し、検出限界の末端までそれを行う能力は、未達成の重要な能力である。一方で、バリアントコールの感受性の増加は、レアだが重要なsub-populations（e.g. cancer stem cell mutations）をモニターすることを可能にする。他方、薬剤耐性sub-populationsの早期検出には感度が不可欠である（e.g. with antiretroviral drugs for HIV）。そのような設定では、ad hocなアプローチは適合性およびロバストさを欠いており、バリアント検出感度において人工的なキャップを被る可能性がある。シミュレーションエラーの正確なモデリングは、感度限界を押し上げるために不可欠であり、取り組むべき必要性がある。

　この研究では、シークエンシングエラー率の偏りの公式モデルに基づいて、高カバレッジシークエンシングデータセットからシングルヌクレオチドバリアント（SNV）をコールための、高感度で堅牢なアプローチを提示する。近似に頼ることなく、このモデルでは厳密な統計的検定を効率的に行うことができ、大きなゲノムや高カバレッジのデータセットを正確に解析することができる。結果のメソッドLoFreqは、シーケンス実行およびポジション特異的なシーケンスバイアスに自動適応し、データセットの平均シーケンシングエラーレートより低い頻度のSNVを呼び出すことができる。 LoFreqの堅牢性、感度および特異性は、いくつかのシミュレートされたデータセットと実際のデータセット（ウイルス、バクテリアおよびヒト）および2つの実験プラットフォーム（FluidigmおよびSequenom）を用いて検証された。レアなsomatic SNV（胃がんのエキソームシーケンシングデータセット）コールにLoFreqを適用し、臨床研究（ヌクレオシドアナログ薬Balapiravir）で治療前後のデングウイルス quasi species を研究した結果は、さらに堅牢性と多様性を強調している。

　LoFreq（バージョン2）は、次世代シーケンシングデータからSNVとindelを推論するための、高速で機密性の高いバリアントコーラー。これは、通常他の方法によって無視される、ベースコールクオリティおよびシーケンスに固有の他のエラー源（例えばindelアラインメントの不確実さ）を十分に利用する。LoFreq *は、シーケンシングマシンまたはシーケンシング技術に依存しない閾値を使うため、ほぼすべてのタイプのアライメントしたデータ（例えばIllumina、IonTorrentまたはPacbio）で使用できる。またカバレッジおよびシーケンスクオリティ変化に自動的に適応するので、様々なデータセットにも適用できる（例えばウイルス、バクテリア、メタゲノムまたは体細胞シーケンシングデータなど）。LoFreqは非常に敏感で、平均ベースコールクオリティ（すなわち、シーケンシングエラー率）を下回るバリエーションを予測できる。各バリアントコールには、false postiveの確率をコントロールできるp値が割り当てられる。近似やヒューリスティックは使用しない。（擬似）並列実装も行われている。カバレッジに自動適応するので、カバレッジの高いデータや大きなゲノムに利用できる。単一プロセッサーでは、600XカバレッジのE.coliゲノムのSNVコールでおよそ1時間、100Xカバレッジのヒトexomeデータセットではおよそ1時間かかる。

LoFreq · Fast and sensitive variant calling from next-gen sequencing data

マニュアル

Commands · LoFreq

LoFreqに関するツイート。

There's life in the old dog yet: #lofreq 2.1.4 released! The one big change is the proper use of the @htslib API implemented by @jomarnz (huge thanks!). Please use `./configure --with-htslib`, which replaces the old SAMTOOLS and HTSLIB variables. Blog post to follow soon.
— Andreas Wilm (@me_myself_andY) 2020年1月4日

インストール

mac os10.13でテストした。

依存（ソースからコンパイルする場合に必要）

a C compiler (e.g. gcc or clang)
a Python 2.7 or Python 3 interpreter
zlib developer files
a compiled version of samtools 1.1
a compiled version of htslib 1.1; use the one that comes bundled with samtools!)

Github

SourceForgeにバイナリが用意されている。またcondaでもインストールできる。

https://sourceforge.net/projects/lofreq/files/

ダウンロードしてパスを通す。condaなら

mamba install -c bioconda lofreq

> lofreq

$ lofreq

| ____|

| _ \ | __| _ \ _` |

| ( | __| | __/ ( |

_____| \___/ _| _| \___| \__, |

Fast and sensitive inference of SNVs and indels

Usage: lofreq <command> [options]

Main Commands:

call : Call variants

call-parallel : Call variants in parallel

somatic : Call somatic variants

Preprocessing Commands

viterbi : Viterbi realignment

indelqual : Insert indel qualities

alnqual : Insert base and indel alignment qualities

Other Commands:

checkref : Check that reference fasta and BAM file match

filter : Filter variants in VCF file

uniq : Test whether variants predicted in only one sample really are unique

plpsummary : Print pileup summary per position

vcfset : VCF set operations

version : Print version info

Samtools Clones:

faidx : Create index for fasta file

index : Create index for BAM file

idxstats : Print stats for indexed BAM file

Extra Tools (if installed):

vcfplot : Plot VCF statistics

> lofreq call -h

$ lofreq call -h

lofreq call: call variants from BAM file

Usage: lofreq call [options] in.bam

Options:

- Reference:

-f | --ref FILE Indexed reference fasta file (gzip supported) [null]

- Output:

-o | --out FILE Vcf output file [- = stdout]

- Regions:

-r | --region STR Limit calls to this region (chrom:start-end) [null]

-l | --bed FILE List of positions (chr pos) or regions (BED) [null]

- Base-call quality:

-q | --min-bq INT Skip any base with baseQ smaller than INT [6]

-Q | --min-alt-bq INT Skip alternate bases with baseQ smaller than INT [6]

-R | --def-alt-bq INT Overwrite baseQs of alternate bases (that passed bq filter) with this value (-1: use median ref-bq; 0: keep) [0]

-j | --min-jq INT Skip any base with joinedQ smaller than INT [0]

-J | --min-alt-jq INT Skip alternate bases with joinedQ smaller than INT [0]

-K | --def-alt-jq INT Overwrite joinedQs of alternate bases (that passed jq filter) with this value (-1: use median ref-bq; 0: keep) [0]

- Base-alignment (BAQ) and indel-aligment (IDAQ) qualities:

-B | --no-baq Disable use of base-alignment quality (BAQ)

-A | --no-idaq Don't use IDAQ values (NOT recommended under ANY circumstances other than debugging)

-D | --del-baq Delete pre-existing BAQ values, i.e. compute even if already present in BAM

-e | --no-ext-baq Use 'normal' BAQ (samtools default) instead of extended BAQ (both computed on the fly if not already present in lb tag)

- Mapping quality:

-m | --min-mq INT Skip reads with mapping quality smaller than INT [0]

-M | --max-mq INT Cap mapping quality at INT [255]

-N | --no-mq Don't merge mapping quality in LoFreq's model

- Indels:

--call-indels Enable indel calls (note: preprocess your file to include indel alignment qualities!)

--only-indels Only call indels; no SNVs

- Source quality:

-s | --src-qual Enable computation of source quality

-S | --ign-vcf FILE Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas

-T | --def-nm-q INT If >= 0, then replace non-match base qualities with this default value [-1]

- P-values:

-a | --sig P-Value cutoff / significance level [0.010000]

-b | --bonf Bonferroni factor. 'dynamic' (increase per actually performed test) or INT ['dynamic']

- Misc.:

-C | --min-cov INT Test only positions having at least this coverage [1]

(note: without --no-default-filter default filters (incl. coverage) kick in after predictions are done)

-d | --max-depth INT Cap coverage at this depth [1000000]

--illumina-1.3 Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded

--use-orphan Count anomalous read pairs (i.e. where mate is not aligned properly)

--plp-summary-only No variant calling. Just output pileup summary per column

--no-default-filter Don't run default 'lofreq filter' automatically after calling variants

--force-overwrite Overwrite any existing output

--verbose Be verbose

--debug Enable debugging

ラン

利用するbamは、false positiveを防ぐため、予めPCR duplicationを除き、indel周辺のリアライメントを行って、GATKのBQSR（ヒトでは既知バリアントデータベースがあるので利用可能）（公式）を行うことが推奨されている。

バリアントコール。bamのidnexとfastaのindexは前もって作っておく必要がある。

samtools faidx ref.fa
samtools index short.bam
lofreq call -f ref.fa -o vars.vcf aln.bam

#20並列
lofreq call-parallel --pp-threads 20 -f ref.fa -o vars.vcf aln.bam

--call-indels Enable indel calls (note: preprocess your file to include indel alignment qualities!)
-a P-Value cutoff / significance level [0.010000]
-C Test only positions having at least this coverage [1]
-d Cap coverage at this depth [1000000]
--illumina-1.3 Assume the quality is Illumina-1.3-1.7/ASCII+64 encoded
--verbose Be verbose
-S Ignore variants in this vcf file for source quality computation. Multiple files can be given separated by commas

normalと比較してtumorのバリアントをコール。

lofreq somatic -n normal.bam -t tumor.bam -f hg19.fa \
 --threads 8 -o output -d dbsnp.vcf.gz

高感度で高カバレッジなデータセットにも自動適応できることを利用して、バクテリアの研究室進化実験（シミュレーション）でのheterogenicなバリアントコールにも使われています。

https://www.ncbi.nlm.nih.gov/pubmed/28224054