データが同じサンプルに由来するかどうかをvariant callingから判定する BAM-matcher

　特に、シーケンススループットの高いプロジェクトや施設（Koboldt et al。、2010）においては、ミスラベルやミックスアップはよくある問題である。次世代シーケンシング（NGS）データを扱う場合、誤ったラベルのサンプルは誤ったデータ処理と分析につながり、矛盾する結果や誤った結論につながる。たとえエラーが特定されても、貴重なデータリソースが無駄になり、分析や診断が遅れることがある。

　主要なゲノムセンターでの一般的なプラクティスは、カスタム高密度SNPパネルを使用して各サンプルのSNPプロファイルを作成する（論文より　Koboldt et al、2010）。これらのパネルの背後にある原則は、サンプルマッチングに使用できる各サンプルの独立して生成されたSNPプロファイルを提供することである。

　しかしNGS技術のコストが着実に減少しているため、１研究室で腫瘍と正常サンプルを用いた体細胞変異の解析、（家系の）縦方向のサンプルを用いた疾患進行のモニタリング、または同一家系の複数の個体由来の試料を用いて、希少疾患における病原性変異体を探すことが可能になっている。これらのNGSデータには既にかなりの量の遺伝子型情報が含まれているため、これらのデータをサンプルマッチングに直接使用して、追加のプロファイリングを実行するための追加リソースやサンプルを増やす必要はない。

　既存のNGSデータを使用したサンプルマッチングを容易にするために、BAMファイル（http://samtools.github.io/hts-specs）の迅速なペアワイズ比較を提供するツールであるBAM-matcherを開発した。所定のゲノム位置におけるサンプル遺伝子型を比較することによって決定される。 BAM-matcherには、次の機能がある。まず、使用が簡単で、処理パイプラインの初期段階で展開できる。第2に、BAM-matcherは非常に高速である。genotype-calling を所定のポジションに限定することによりIntel Xeon E5、3.6 GHzの環境で〜2分で2つのサンプルを比較することができる。 BAM-matcherがサンプルの遺伝子型データをキャッシュするので、以前に計算されたサンプルを含むその後の比較は、はるかに速く（約1秒）、大きなコホートの全体的な処理時間を大幅に短縮する可能性がある。第3に、BAM-matcherは柔軟性がある。全ゲノムシーケンス（WGS）、WESおよびRNA-seqデータを含む様々なタイプのNGSデータを比較することができる。適切なゲノムのリファレンスおよびSNPの適切なリストが提供される場合、BAM-マッチャーは非ヒトゲノムにも使用され得る。

wiki

https://bitbucket.org/sacgf/bam-matcher/wiki/Installation

インストール

mac os10.13のpython2.7.10でテストした。

依存

PyVCF
ConfigParser
Cheetah
pysam (requires python-dev and zlib1g-dev libraries)
fisher (requires numpy and python-dev libraries).

pythonライブラリはいずれもpipで導入できる。その他バリアントコーラーも必要になる。

Genome Analysis Toolkit (GATK) https://www.broadinstitute.org/gatk/
Freebayes https://github.com/ekg/freebayes
VarScan http://varscan.sourceforge.net/

本体　bitbucket

https://bitbucket.org/sacgf/bam-matcher

git clone https://bitbucket.org/sacgf/bam-matcher.git
cd bam-matcher/

> python bam-matcher.py -h

$ python bam-matcher.py -h

usage: bam-matcher.py [-h] --bam1 BAM1 --bam2 BAM2 [--config CONFIG]

[--generate-config GENERATE_CONFIG] [--output OUTPUT]

[--short-output] [--html] [--no-report]

[--scratch-dir SCRATCH_DIR] [--vcf VCF]

[--caller {gatk,freebayes,varscan}]

[--dp-threshold DP_THRESHOLD]

[--number_of_snps NUMBER_OF_SNPS] [--fastfreebayes]

[--gatk-mem-gb GATK_MEM_GB] [--gatk-nt GATK_NT]

[--varscan-mem-gb VARSCAN_MEM_GB]

[--reference REFERENCE] [--ref-alternate REF_ALTERNATE]

[--chromosome-map CHROMOSOME_MAP]

[--about-alternate-ref ABOUT_ALTERNATE_REF]

[--do-not-cache] [--recalculate] [--cache-dir CACHE_DIR]

[--experimental] [--allele-freq] [--debug] [--verbose]

Compare two BAM files to see if they are from the same samples, using

frequently occuring SNPs reported in the 1000Genome database

optional arguments:

-h, --help show this help message and exit

--debug, -d Debug mode. Temporary files are not removed

--verbose, -v Verbose reporting. Default = False

REQUIRED:

--bam1 BAM1, -B1 BAM1

First BAM file

--bam2 BAM2, -B2 BAM2

Second BAM file

CONFIGURATION:

--config CONFIG, -c CONFIG

Specify configuration file (default =

/dir/where/script/is/located/bam-matcher.conf)

--generate-config GENERATE_CONFIG, -G GENERATE_CONFIG

Specify where to generate configuration file template

OUTPUT REPORT:

--output OUTPUT, -o OUTPUT

Specify output report path (default =

/current/dir/bam_matcher.SUBFIX)

--short-output, -so Short output mode (tab-separated).

--html, -H Enable HTML output. HTML file name = report + '.html'

--no-report, -n Don't write output to file. Results output to command

line only.

--scratch-dir SCRATCH_DIR, -s SCRATCH_DIR

Scratch directory for temporary files. If not

specified, the report output directory will be used

(default = /tmp/[random_string])

VARIANTS:

--vcf VCF, -V VCF VCF file containing SNPs to check (default can be

specified in config file instead)

CALLERS AND SETTINGS (will override config values):

--caller {gatk,freebayes,varscan}, -CL {gatk,freebayes,varscan}

Specify which caller to use (default = 'gatk')

--dp-threshold DP_THRESHOLD, -DP DP_THRESHOLD

Minimum required depth for comparing variants

--number_of_snps NUMBER_OF_SNPS, -N NUMBER_OF_SNPS

Number of SNPs to compare.

--fastfreebayes, -FF Use --targets option for Freebayes.

--gatk-mem-gb GATK_MEM_GB, -GM GATK_MEM_GB

Specify Java heap size for GATK (GB, int)

--gatk-nt GATK_NT, -GT GATK_NT

Specify number of threads for GATK UnifiedGenotyper

(-nt option)

--varscan-mem-gb VARSCAN_MEM_GB, -VM VARSCAN_MEM_GB

Specify Java heap size for VarScan2 (GB, int)

REFERENCES:

--reference REFERENCE, -R REFERENCE

Default reference fasta file. Needs to be indexed with

samtools faidx. Overrides config settings.

--ref-alternate REF_ALTERNATE, -R2 REF_ALTERNATE

Alternate reference fasta file. Needs to be indexed

with samtools faidx. Overrides config settings.

--chromosome-map CHROMOSOME_MAP, -M CHROMOSOME_MAP

Required when using alternate reference. Run BAM-

matcher with --about-alternate-ref for more details.

--about-alternate-ref ABOUT_ALTERNATE_REF, -A ABOUT_ALTERNATE_REF

Print information about using --alternate-ref and

--chromosome-map

BATCH OPERATIONS:

--do-not-cache, -NC Do not keep variant-calling output for future

comparison. By default (False) data is written to

/bam/filepath/without/dotbam.GT_compare_data

--recalculate, -RC Don't use cached variant calling data, redo variant-

calling. Will overwrite cached data unless told not to

(-NC)

--cache-dir CACHE_DIR, -CD CACHE_DIR

Specify directory for cached data. Overrides

configuration

EXPERIMENTAL:

--experimental, -X Enable experimental features

--allele-freq, -AF Plot variant allele frequency graphs

ラン

テストラン。configファイルを準備。

#configのテンプレートをコピー。
cp bam-matcher.conf.template bam-matcher.conf

テストランを行うには、.confの以下の行を修正する必要がある。

38行目: VCF_file: ~/bam-matcher/test_data/1kg.exome.highAF.1511.vcf #他にもvcfは2つある。

53行目: REFERENCE: ~/reference/hg19.fa

61行目: CACHE_DIR: ~/temp #logやテンポラルなファイル置き場

ここで使っているhg19のヘッダーはchr1,chr2...でなく1,2...の番号。ヘッダーが違うならsedで置換する（sed -e "s/chr//g" hg19.fa > hg19_rename.fa）。

バリアントコーラーはデフォルトではfreebayesになっている。他のツールに変更したいなら修正する。合わせてバリアントコーラーのパラメータや使用スレッド数も環境に合わせ修正する。

16行目: caller: freebayes

準備できたら実行する。調べたいbamファイル2つを指定する。

python bam-matcher.py -B1 test_data/sample1.bam -B2 test_data/sample2.bam -o output_report.txt

出力

depth threshold: 15

________________________________________

Positions with same genotype: 163

breakdown: hom: 71

het: 92

________________________________________

Positions with diff genotype: 108

breakdown:

BAM 1

| het | hom | subset

-------+------+------+-------

het | 1 | 0 | 31 |

-------+------+------+-------

BAM 2 hom | 0 | 21 | - |

-------+------+------+-------

subset| 55 | - | - |

________________________________________

Total sites compared: 271

Fraction of common: 0.601476 (163/271)

________________________________________

CONCLUSION:

LIKELY FROM DIFFERENT SOURCES

——

“Subset” denotes the case where the site genotype of the sample is homozygous to one of the heterozygous alleles in the other sample" (supplenentaryより)

Fraction of common: 0.601476 (163/271)になり、違うサンプルと結論された。ちなみに同じサンプルだと判断されると"BAM FILES ARE FROM THE SAME SOURCE"が出る。

定義、理屈については、Supplementary Text S2を確認してください。

こちらもどうぞ。

引用

BAM-matcher: a tool for rapid NGS sample matching

Paul P.S. Wang Wendy T. Parker Susan Branford Andreas W. Schreiber

Bioinformatics, Volume 32, Issue 17, 1 September 2016, Pages 2699–2701