ハプロタイプを考慮したロングリードからの2倍体ゲノムアセンブリを行う phasebook

　ハプロタイプを考慮した2倍体ゲノムアセンブリは、ゲノミクス、精密医療、その他多くの分野で極めて重要である。ロングリードシーケンス技術により、ゲノムアセンブリは大幅に改善された。しかし、現在のロングリードアセンブラは、リファレンスベースのためバイアスがかかっていたり、2倍体ゲノムのハプロタイプの多様性を捉えることができなかったりする。本著者らは、ロングリードから二倍体ゲノムのハプロタイプを再構成するde novoアプローチであるphasebookを発表する。phasebookは、ハプロタイプカバレッジにおいて他のアプローチを大きく上回り、さらにアセンブリエラーとアセンブリの連続性の点でも競争力のある性能を達成する。

インストール

依存

whatshap
minimap2
longshot
samtools
bcftools
fpa
overlap graph construction module from HaploConduct
error correction modules from Racon, CONSENT, MECAT2 and NECAT
g++ >=5.5.0 and with boost libraries
python3

Github

#依存ツールの導入
mamba create -n phasebook python=3.7 -y
conda activate phasebook
mamba install -c bioconda whatshap=0.18 minimap2=2.18 longshot=0.4.1 samtools=1.12 bcftools=1.12 racon=1.4.20 fpa=0.5 -y

#本体のビルド
git clone https://github.com/phasebook/phasebook.git
cd phasebook
sh install.sh

> python scripts/phasebook.py -h

usage: python phasebook.py [-h] -i INFILE [-o OUTDIR] [-t THREADS]

[-p PLATFORM] [-x] [-g GENOMESIZE]

[--overlaps OVERLAPS] [--nsplit NSPLIT]

[--run_mode RUN_MODE] [--min_cov MIN_COV]

[--min_identity MIN_IDENTITY]

[--min_read_len MIN_READ_LEN]

[--min_sread_len MIN_SREAD_LEN]

[--min_ovlp_len MIN_OVLP_LEN]

[--n_correct N_CORRECT] [--n_polish N_POLISH]

[--polish_tool POLISH_TOOL] [--max_oh MAX_OH]

[--oh_ratio OH_RATIO] [--sp_oh SP_OH]

[--sp_ohratio SP_OHRATIO]

[--sp_min_identity SP_MIN_IDENTITY]

[--sp_min_ovlplen SP_MIN_OVLPLEN]

[--max_tip_len MAX_TIP_LEN]

[--min_cluster_size MIN_CLUSTER_SIZE]

[--level LEVEL] [--rm_trans RM_TRANS]

[--trim_ends TRIM_ENDS] [--rename RENAME] [--qc QC]

[--ctg_asm CTG_ASM]

[--super_ovlp_fast SUPER_OVLP_FAST]

[--correct_mode CORRECT_MODE]

[--max_het_snps MAX_HET_SNPS]

[--min_allele_cov MIN_ALLELE_COV]

[--n_final_polish N_FINAL_POLISH]

[--sort_by_len SORT_BY_LEN] [--rm_tmp RM_TMP]

[--max_memory MAX_MEMORY]

[--superead_id SUPEREAD_ID]

[--limited_times LIMITED_TIMES]

[--max_ovlps MAX_OVLPS]

[--max_cluster_size MAX_CLUSTER_SIZE] [--version]

Haplotype-aware de novo assembly of diploid genome from long reads

optional arguments:

-h, --help show this help message and exit

-i INFILE, --infile INFILE

input file in FASTA/FASTQ format (default: None)

-o OUTDIR, --outdir OUTDIR

output directory (default: .)

-t THREADS, --threads THREADS

number of threads (default: 1)

-p PLATFORM, --platform PLATFORM

sequencing platform(PacBio CLR/PacBio HiFi/Oxford

Nanopore): [pb/hifi/ont] (default: pb)

-x, --preset use preset parameters (default: False)

-g GENOMESIZE, --genomesize GENOMESIZE

genome size: small/large (default: small)

--overlaps OVERLAPS input file in PAF format (default: None)

--nsplit NSPLIT number of splitted input fasta/fastq files (default:

--run_mode RUN_MODE running mode: local/hpc, hpc is not available yet

(default: local)

--min_cov MIN_COV min coverage for trimming consensus (default: 4.0)

--min_identity MIN_IDENTITY

min identity for filtering overlaps (default: 0.75)

--min_read_len MIN_READ_LEN

min read length for processing (default: 1000)

--min_sread_len MIN_SREAD_LEN

min seed read length (default: 1000)

--min_ovlp_len MIN_OVLP_LEN

min overlap length for super reads construction

(default: 1000)

--n_correct N_CORRECT

times for self error correction of raw reads (default:

--n_polish N_POLISH times for super reads polishing (default: 2)

--polish_tool POLISH_TOOL

tool used for polishing contigs: [racon/hypo]. Note:

hypo is recommended if using HiFi reads (default:

racon)

--max_oh MAX_OH max overhang length (default: 1000)

--oh_ratio OH_RATIO max overhang to mapping length ratio (default: 0.8)

--sp_oh SP_OH max overhang length (default: 10)

--sp_ohratio SP_OHRATIO

max overhang to mapping length ratio (default: 0.8)

--sp_min_identity SP_MIN_IDENTITY

super reads min identity for filtering overlaps

(default: 0.98)

--sp_min_ovlplen SP_MIN_OVLPLEN

min overlap length for super reads construction

(default: 1000)

--max_tip_len MAX_TIP_LEN

max length to be removed as tips (default: 1000)

--min_cluster_size MIN_CLUSTER_SIZE

min size of read clusters (default: 4)

--level LEVEL iteration times of clustering read (default: 1)

--rm_trans RM_TRANS choose to (0) keep all edges, (1) remove transitive

edges, (2) to remove double transitive edges (default:

--trim_ends TRIM_ENDS

trim the erroneous bases in both ends, should be

either True or False (default: False)

--rename RENAME rename read name or not, should be either True or

False (default: True)

--qc QC quality control for input reads or not, should be

either True or False, TODO (default: False)

--ctg_asm CTG_ASM method to assemble super reads: [rb/naive/iter], rb is

recommended (default: rb)

--super_ovlp_fast SUPER_OVLP_FAST

compute super read overlaps using fast mode or not,

should be either True or False (default: False)

--correct_mode CORRECT_MODE

method to correct raw reads: [msa/hybrid], msa is much

faster than hybrid, which is recommended for large

genomes (default: hybrid)

--max_het_snps MAX_HET_SNPS

maximum number of heterozygous SNPs to determine the

contig overlap is from the identical haplotype or not

(default: 0)

--min_allele_cov MIN_ALLELE_COV

number of observations of each allele (default: 4)

--n_final_polish N_FINAL_POLISH

polish times for final contigs (default: 1)

--sort_by_len SORT_BY_LEN

sort super reads by read length or not(by number of

overlaps), should be either True or False (default:

True)

--rm_tmp RM_TMP remove temp files or not, should be either True or

False (default: True)

--max_memory MAX_MEMORY

max memory to use (default: 50G)

--superead_id SUPEREAD_ID

superead id required if running on HPC (default: None)

--limited_times LIMITED_TIMES

max times used for read (default: 5)

--max_ovlps MAX_OVLPS

max number of overlaps for read (default: 10000)

--max_cluster_size MAX_CLUSTER_SIZE

max cluster size (default: 100000)

--version, -v show the version

Built by: Xiao Luo

テストラン

入力リードファイルはFASTAまたはFASTQ形式。FASTAファイルでは、1行ごとのリードで、配列途中のラップは不可になっている。

PacBioのリード

cd phasebook/example/

#HiFi reads(small genome)
python ../scripts/phasebook.py -i reads.fa -t 8 -p hifi -g small -x 

#CLR reads(small genome)
python ../scripts/phasebook.py -i reads.fa -t 8 -p pb -g small -x

-i input file in FASTA/FASTQ format (default: None)
-o output directory (default: .)
-t number of threads (default: 1)
-p sequencing platform(PacBio CLR/PacBio HiFi/Oxford Nanopore): [pb/hifi/ont] (default: pb)
-x use preset parameters (default: False)
-g genome size: small/large (default: small)

出力

f:id:kazumaxneo:20220227235342p:plain

ONTのリード

#small genome
python phasebook.py -i reads.fa -t 8 -p ont -g small -x

#large genome
python phasebook.py -i reads.fa -t 8 -p ont -g large -x

レポジトリより

また、ヒトゲノムのような非常に大きなゲノムを扱う場合、HPC上でphasebookを実行することができる。
-g は、ゲノムの大きさが小さいか大きいかを設定する。gを大きくすると、リードのオーバーラップ計算やフィルタリング、シーケンスエラー補正がより効率的に行われれるが、アセンブルのパフォーマンスが低下する可能性がある。

引用

phasebook: haplotype-aware de novo assembly of diploid genomes from long reads
Xiao Luo, Xiongbin Kang & Alexander Schönhuth
Genome Biology volume 22, Article number: 299 (2021)