ハプロタイプを考慮した2倍体ゲノムアセンブリは、ゲノミクス、精密医療、その他多くの分野で極めて重要である。ロングリードシーケンス技術により、ゲノムアセンブリは大幅に改善された。しかし、現在のロングリードアセンブラは、リファレンスベースのためバイアスがかかっていたり、2倍体ゲノムのハプロタイプの多様性を捉えることができなかったりする。本著者らは、ロングリードから二倍体ゲノムのハプロタイプを再構成するde novoアプローチであるphasebookを発表する。phasebookは、ハプロタイプカバレッジにおいて他のアプローチを大きく上回り、さらにアセンブリエラーとアセンブリの連続性の点でも競争力のある性能を達成する。
インストール
依存
- whatshap
- minimap2
- longshot
- samtools
- bcftools
- fpa
- overlap graph construction module from HaploConduct
- error correction modules from Racon, CONSENT, MECAT2 and NECAT
- g++ >=5.5.0 and with boost libraries
- python3
#依存ツールの導入
mamba create -n phasebook python=3.7 -y
conda activate phasebook
mamba install -c bioconda whatshap=0.18 minimap2=2.18 longshot=0.4.1 samtools=1.12 bcftools=1.12 racon=1.4.20 fpa=0.5 -y
#本体のビルド
git clone https://github.com/phasebook/phasebook.git
cd phasebook
sh install.sh
> python scripts/phasebook.py -h
usage: python phasebook.py [-h] -i INFILE [-o OUTDIR] [-t THREADS]
[-p PLATFORM] [-x] [-g GENOMESIZE]
[--overlaps OVERLAPS] [--nsplit NSPLIT]
[--run_mode RUN_MODE] [--min_cov MIN_COV]
[--min_identity MIN_IDENTITY]
[--min_read_len MIN_READ_LEN]
[--min_sread_len MIN_SREAD_LEN]
[--min_ovlp_len MIN_OVLP_LEN]
[--n_correct N_CORRECT] [--n_polish N_POLISH]
[--polish_tool POLISH_TOOL] [--max_oh MAX_OH]
[--oh_ratio OH_RATIO] [--sp_oh SP_OH]
[--sp_ohratio SP_OHRATIO]
[--sp_min_identity SP_MIN_IDENTITY]
[--sp_min_ovlplen SP_MIN_OVLPLEN]
[--max_tip_len MAX_TIP_LEN]
[--min_cluster_size MIN_CLUSTER_SIZE]
[--level LEVEL] [--rm_trans RM_TRANS]
[--trim_ends TRIM_ENDS] [--rename RENAME] [--qc QC]
[--ctg_asm CTG_ASM]
[--super_ovlp_fast SUPER_OVLP_FAST]
[--correct_mode CORRECT_MODE]
[--min_allele_cov MIN_ALLELE_COV]
[--n_final_polish N_FINAL_POLISH]
[--sort_by_len SORT_BY_LEN] [--rm_tmp RM_TMP]
[--max_memory MAX_MEMORY]
[--superead_id SUPEREAD_ID]
[--limited_times LIMITED_TIMES]
[--max_ovlps MAX_OVLPS]
[--max_cluster_size MAX_CLUSTER_SIZE] [--version]
Haplotype-aware de novo assembly of diploid genome from long reads
optional arguments:
-h, --help show this help message and exit
-i INFILE, --infile INFILE
input file in FASTA/FASTQ format (default: None)
-o OUTDIR, --outdir OUTDIR
output directory (default: .)
-t THREADS, --threads THREADS
number of threads (default: 1)
-p PLATFORM, --platform PLATFORM
sequencing platform(PacBio CLR/PacBio HiFi/Oxford
Nanopore): [pb/hifi/ont] (default: pb)
-x, --preset use preset parameters (default: False)
-g GENOMESIZE, --genomesize GENOMESIZE
genome size: small/large (default: small)
--overlaps OVERLAPS input file in PAF format (default: None)
--nsplit NSPLIT number of splitted input fasta/fastq files (default:
1)
--run_mode RUN_MODE running mode: local/hpc, hpc is not available yet
(default: local)
--min_cov MIN_COV min coverage for trimming consensus (default: 4.0)
--min_identity MIN_IDENTITY
min identity for filtering overlaps (default: 0.75)
--min_read_len MIN_READ_LEN
min read length for processing (default: 1000)
--min_sread_len MIN_SREAD_LEN
min seed read length (default: 1000)
--min_ovlp_len MIN_OVLP_LEN
min overlap length for super reads construction
(default: 1000)
--n_correct N_CORRECT
times for self error correction of raw reads (default:
0)
--n_polish N_POLISH times for super reads polishing (default: 2)
--polish_tool POLISH_TOOL
tool used for polishing contigs: [racon/hypo]. Note:
hypo is recommended if using HiFi reads (default:
racon)
--max_oh MAX_OH max overhang length (default: 1000)
--oh_ratio OH_RATIO max overhang to mapping length ratio (default: 0.8)
--sp_oh SP_OH max overhang length (default: 10)
--sp_ohratio SP_OHRATIO
max overhang to mapping length ratio (default: 0.8)
--sp_min_identity SP_MIN_IDENTITY
super reads min identity for filtering overlaps
(default: 0.98)
--sp_min_ovlplen SP_MIN_OVLPLEN
min overlap length for super reads construction
(default: 1000)
--max_tip_len MAX_TIP_LEN
max length to be removed as tips (default: 1000)
--min_cluster_size MIN_CLUSTER_SIZE
min size of read clusters (default: 4)
--level LEVEL iteration times of clustering read (default: 1)
--rm_trans RM_TRANS choose to (0) keep all edges, (1) remove transitive
edges, (2) to remove double transitive edges (default:
1)
--trim_ends TRIM_ENDS
trim the erroneous bases in both ends, should be
either True or False (default: False)
--rename RENAME rename read name or not, should be either True or
False (default: True)
--qc QC quality control for input reads or not, should be
either True or False, TODO (default: False)
--ctg_asm CTG_ASM method to assemble super reads: [rb/naive/iter], rb is
recommended (default: rb)
--super_ovlp_fast SUPER_OVLP_FAST
compute super read overlaps using fast mode or not,
should be either True or False (default: False)
--correct_mode CORRECT_MODE
method to correct raw reads: [msa/hybrid], msa is much
faster than hybrid, which is recommended for large
genomes (default: hybrid)
maximum number of heterozygous SNPs to determine the
contig overlap is from the identical haplotype or not
(default: 0)
--min_allele_cov MIN_ALLELE_COV
number of observations of each allele (default: 4)
--n_final_polish N_FINAL_POLISH
polish times for final contigs (default: 1)
--sort_by_len SORT_BY_LEN
sort super reads by read length or not(by number of
overlaps), should be either True or False (default:
True)
--rm_tmp RM_TMP remove temp files or not, should be either True or
False (default: True)
--max_memory MAX_MEMORY
max memory to use (default: 50G)
--superead_id SUPEREAD_ID
superead id required if running on HPC (default: None)
--limited_times LIMITED_TIMES
max times used for read (default: 5)
--max_ovlps MAX_OVLPS
max number of overlaps for read (default: 10000)
--max_cluster_size MAX_CLUSTER_SIZE
max cluster size (default: 100000)
--version, -v show the version
Built by: Xiao Luo
テストラン
入力リードファイルはFASTAまたはFASTQ形式。FASTAファイルでは、1行ごとのリードで、配列途中のラップは不可になっている。
PacBioのリード
cd phasebook/example/
#HiFi reads(small genome)
python ../scripts/phasebook.py -i reads.fa -t 8 -p hifi -g small -x
#CLR reads(small genome)
python ../scripts/phasebook.py -i reads.fa -t 8 -p pb -g small -x
- -i input file in FASTA/FASTQ format (default: None)
- -o output directory (default: .)
- -t number of threads (default: 1)
- -p sequencing platform(PacBio CLR/PacBio HiFi/Oxford Nanopore): [pb/hifi/ont] (default: pb)
- -x use preset parameters (default: False)
- -g genome size: small/large (default: small)
出力
ONTのリード
#small genome
python phasebook.py -i reads.fa -t 8 -p ont -g small -x
#large genome
python phasebook.py -i reads.fa -t 8 -p ont -g large -x
レポジトリより
- また、ヒトゲノムのような非常に大きなゲノムを扱う場合、HPC上でphasebookを実行することができる。
- -g は、ゲノムの大きさが小さいか大きいかを設定する。gを大きくすると、リードのオーバーラップ計算やフィルタリング、シーケンスエラー補正がより効率的に行われれるが、アセンブルのパフォーマンスが低下する可能性がある。
引用
phasebook: haplotype-aware de novo assembly of diploid genomes from long reads
Xiao Luo, Xiongbin Kang & Alexander Schönhuth
Genome Biology volume 22, Article number: 299 (2021)