ショートリードとロングリードを用いたハプロタイプアセンブリツール HAT

　ハプロタイプとは、1つの染色体上に共存する対立遺伝子のセットで、次世代に共に遺伝する。一倍体のリファレンスゲノムでは、この共起情報が失われるため、表現型と遺伝子型の対立遺伝子の組み合わせとの関連付けにしか利用できない。そのため、DNA塩基配列データから完全なハプロタイプを復元する方法が重要である。近年、ハプロタイプの再構築がいくつか試みられているが、大きな限界が残っている。特に、相同染色体間の差異が少ない場合、高品質の連続したハプロタイプを確実に作成することはできない。
　本発表では、ショートリードおよびロングリードとレファレンスゲノムを用いてハプロタイプを再構成するハプロタイプアセンブリツールHATを紹介する。HATは、ショートリードの精度とロングリードの長さを利用して、ハプロタイプを再構成しようとするものである。HATを異数性の酵母Saccharomyces pastorianus CBS1483と同株の複数のシミュレーションポリプロイドデータセットでテストし、既存のツールより優れていることを示した。

インストール

condaで環境を作ってpipで導入した。

依存

Python
pysam
Biopython
numpy
matplotlib
seaborn

Github

mamba create -n hat python=3.9
conda activate hat

git clone https://github.com/AbeelLab/hat
cd hat
pip install .

#もしくはpipかcondaで導入
pip install HAT-phasing

mamba install -c bioconda hat-phasing -y

> HAT -h

usage: HAT [-h] [-rl READ_LENGTH] [-pl PHASING_LOCATION] [-r REFERENCE_FILE] [-lf LONGREADS_FASTA] [-sf1 SHORTREADS_1_FASTQ] [-sf2 SHORTREADS_2_FASTQ] [-th TRUE_HAPLOTYPES] [-ma MULTIPLE_GENOME_ALIGNMENT] [-ha HAPLOTYPE_ASSEMBLY]

chromosome_name vcf_file short_read_alignment long_read_alignment ploidy output output_dir

positional arguments:

chromosome_name The chromosome which is getting phased

vcf_file VCF file name

short_read_alignment short reads alignment file

long_read_alignment long reads alignment file

ploidy ploidy of the chromosome

output output prefix file name

output_dir output directory

optional arguments:

-h, --help show this help message and exit

-rl READ_LENGTH, --read_length READ_LENGTH

short reads length

-pl PHASING_LOCATION, --phasing_location PHASING_LOCATION

the location in the chromosome which is phased

-r REFERENCE_FILE, --reference_file REFERENCE_FILE

reference file

-lf LONGREADS_FASTA, --longreads_fasta LONGREADS_FASTA

long reads fasta file

-sf1 SHORTREADS_1_FASTQ, --shortreads_1_fastq SHORTREADS_1_FASTQ

first pair fastq file

-sf2 SHORTREADS_2_FASTQ, --shortreads_2_fastq SHORTREADS_2_FASTQ

second pair fastq file

-th TRUE_HAPLOTYPES, --true_haplotypes TRUE_HAPLOTYPES

the correct haplotypes file

-ma MULTIPLE_GENOME_ALIGNMENT, --multiple_genome_alignment MULTIPLE_GENOME_ALIGNMENT

Multiple genome alignment file of haplotypes to the reference

-ha HAPLOTYPE_ASSEMBLY, --haplotype_assembly HAPLOTYPE_ASSEMBLY

Assembly of the haplotype sequences

テストラン

ランするには、SNP を含むVCF ファイル、ショートリードとロングリードをリファレンスにアライメントして得たbam ファイル2つ（ソートされていること）が必要。テストデータはbwa memとminimap2でマッピングされ、FreeBayesでバリアントコールされ、vcffilterでSNPsが選抜されている。

実行する。

cd hat/Example/haplosim-triploid-CP048984.1-highhetero/
HAT -r CP048984.1.fna CP048984.1 snp-var.vcf.gz short_reads_alignment.sorted.bam \
long_reads_alignment.sorted.bam 3 hat_output results/

テストデータCP048984.1 の染色体について、results ディレクトリに 4 つのファイルが出力される。

hat_output_ploidy_blocks figure
hat_output_phase_matrix
hat_output_phased_blocks
ploidy block は、ハプロタイプ間に十分な差異がある領域。HAT はこれらの領域で動作し、ハプロタイプの対立遺伝子を見つける。

phased_matrix 出力には、SNP 遺伝子座の alleles haplotypes が出力される。

引用

HAT: Haplotype Assembly Tool using short and error-prone long reads
Ramin Shirali Hossein Zade, Aysun Urhan, Alvaro Assis de Souza, Akash Singh,
Thomas Abeel

bioRxiv, posted July 21, 2022