関心のあるあらゆる生物のWGSデータセットに対して、SV、SNP、IN/DEL、およびCNVのコールとアノテーションを実行する PerSVade

2022/08/22 オプション追記

　構造バリアント（SV）はゲノムの変異の根底にあるものだが、ショートリードからの検出が困難なため、見落とされることがよくある。ほとんどのアルゴリズムはヒトでテストされており、他の生物にどの程度適用できるかはまだ不明である。この問題を解決するために、本著者らはサンプルに合わせたパイプラインであるperSVade（personalized structural variation detection）を開発し、最適にコールされたSVとその推定精度、さらにスモール変異やコピー数変異を提供する。PerSVadeは、6つの真核生物のベンチマークにおいて、SVの呼び出し精度を向上させた。最適なパラメータの普遍的なセットを見つけることができず、サンプル固有のパラメータ最適化の必要性を強調する。PerSVadeは、多様な生物におけるSVの検出と研究を促進する。

Happy to present the publication of #PerSVade in @GenomeBiology
PerSVade is a tool to
personalize Structural Variant prediction for your genome of interest
Developed by @MikiSchikora at @gabaldonlab https://t.co/15AZYFrhQT pic.twitter.com/bFttF5p1EX
— Toni Gabaldón (@toni_gabaldon) 2022年8月16日

Most tools are optimized for human genomes, and we found out that optimal parameters vary widely across datasets. One size does not fit all species!

Beyond Structural Variants, #PerSVade will also call other type of variants such as SNPs, CNVs and INDELS pic.twitter.com/s94PuRROXx
— Toni Gabaldón (@toni_gabaldon) 2022年8月16日

wiki

https://github.com/Gabaldonlab/perSVade/wiki

FAQ

https://github.com/Gabaldonlab/perSVade/wiki/8.-FAQs

インストール

配布されているdocker imageを使ってテストした。

Github

#docker(link)
docker pull mikischikora/persvade:v1.02.6

> docker run -i mikischikora/persvade:v1.02.6 scripts/perSVade --help

--------------------------------------------------------------------------------

perSVade: personalized Structural Variation detection

This is a pipeline to call and annotate small variants, structural

variants (SVs) and/or coverage-derived copy number variants (CNVs)

Find more information in https://github.com/Gabaldonlab/perSVade

--------------------------------------------------------------------------------

These are the available modules (see

https://github.com/Gabaldonlab/perSVade/wiki/1.-Pipeline-overview to

combine them):

trim_reads_and_QC Trimming (with trimmomatic) and

quality control (with fastqc) of the reads

IMPORTANT: Check the output of

fastqc before using the trimmed reads for other analyses

align_reads Align reads, mark duplicates and

calculate coverage per windows

infer_repeats Find repeats in a genome, which is

necessary for some of the modules below

find_homologous_regions Find regions with pairwise

homology in a genome

find_knownSVs_regions Find regions with perSVade-inferred SVs

optimize_parameters Find optimal parameters for SV

calling through simulations

call_SVs Call structural variants (SVs)

with gridss and clove

call_CNVs Call copy-number variants (CNVs)

based on coverage

integrate_SV_CNV_calls Integrate the variant calls of

'call_SVs' and 'call_CNVs' into a single .vcf file

annotate_SVs Annotate the fuctional impact of

the variants from 'integrate_SV_CNV_calls'

call_small_variants Call SNPs and small IN/DELs

annotate_small_vars Annotate the fuctional impact of

the variants from 'call_small_variants'

get_cov_genes Calculate the coverage for each

gene of the genome

Usage:

perSVade <module> <args>. Type 'perSVade <module> -h' for more

information on each of them.

テストラン

#docker使用時
git clone https://github.com/Gabaldonlab/perSVade.git
cd perSVade/
docker run -v $PWD/perSVade_testing_outputs:/perSVade/installation/test_installation/testing_outputs mikischikora/persvade:v1.02.6 python -u ./installation/test_installation/test_installation_modules.py

test_installation_modules.py スクリプトは、コンテナ内の /perSVade/installation/test_installation/testing_outputs 、ホスト側の./perSVade_testing_outputs内にデータを書き込む。1時間ほどかかる。s

出力例

実行方法

多段階のステップを得て、ペアエンドfastq、リファレンスゲノム、 GTF形式のアノテーションファイルからperSVade のすべての可能な出力を取得する。fastqはgzip圧縮して提供することが推奨されている。以前は多段階のコマンドをまとめて実行することもできたが、エラーが発生しやすくて却ってフレンドリーでは無いため、現在は、ステップバイステップでコマンドを実行していく事が推奨されている。

1、リードのトリミングと品質管理

mkdir output
docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade trim_reads_and_QC -f1 /data/read_1.fastq.gz -f2 /data/read_2.fastq.gz -o /data/output/trimmed_reads

出力例

2、リードアライメント

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade align_reads -f1 /data/output/trimmed_reads/trimmed_reads1.fastq.gz -f2 /data/output/trimmed_reads/trimmed_reads2.fastq.gz --ref /data/genome.fasta -o /data/output/aligned_reads --fraction_available_mem 0.5 -thr 16

--fraction_available_mem This pipeline calculates the available RAM for several steps, and it may not work well in some systems (i.e. HPC clusters). This parameter allows you to correct possible errors. If --fraction_available_mem is not provided (default behavior), this pipeline will calculate the available RAM by filling the memory, which may give errors. If you want to use all the available memory you should specify --fraction_available_mem 1.0. See the FAQ 'How does the --fraction_available_mem work?' from https://github.com/Gabaldonlab/perSVade/wiki/8.-FAQs for more info.
-thr Number of threads, Default: 16

出力例

３、リピートの推論

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade infer_repeats --ref /data/genome.fasta -o /data/output/repeat_inference --fraction_available_mem 0.5 -thr 16

出力例

４、SVの領域を定義する。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade find_homologous_regions --ref /data/genome.fasta -o /data/output/find_hom_regions

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade find_knownSVs_regions -o /data/output/find_known_SVs --ref /data/genome.fasta --mitochondrial_chromosome chrM --SVcalling_parameters default --repeats_file skip --close_shortReads_table ./close_shortReads_table.tab

５、SVコールに最も適したフィルタリングパラメータを見つける。aligned_reads.bam.sortedがbamファイル。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade optimize_parameters --ref /data/genome.fasta -o /data/output/parameter_optimization -sbam /data/output/aligned_reads/aligned_reads.bam.sorted --mitochondrial_chromosome chrM --repeats_file /data/output/repeat_inference/combined_repeats.tab --regions_SVsimulations random --simulation_ploidies haploid --fraction_available_mem 0.5 -thr 16

６、最適化されたパラメータでSVコールを実行。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade call_SVs --ref /data/genome.fasta -o /data/output/call_SVs -sbam /data/output/aligned_reads/aligned_reads.sorted.bam --mitochondrial_chromosome chrM --SVcalling_parameters /data/output/parameter_optimization/optimized_parameters.json --repeats_file /data/output/repeat_inference/combined_repeats.tab --fraction_available_mem 0.5 -thr 16

７、カバレッジベースのCNVコールを実行（（ウィンドウサイズ500bp、ハプロイドを想定）。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade call_CNVs --ref /data/genome.fasta -o /data/output/call_CNVs -sbam /data/output/aligned_reads/aligned_reads.sorted.bam --mitochondrial_chromosome chrM -p 1 --cnv_calling_algs HMMcopy,AneuFinder --window_size_CNVcalling 500 --fraction_available_mem 0.5 -thr 16

８、SVとCNVのコールを1つの.vcfファイルに統合。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade integrate_SV_CNV_calls -o /data/output/integrated_SV_CNV_calls --ref /data/genome.fasta --mitochondrial_chromosome chrM -p 1 -sbam /data/output/aligned_reads/aligned_reads.sorted.bam --outdir_callSVs /data/output/call_SVs --outdir_callCNVs /data/output/call_CNVs --repeats_file skip --fraction_available_mem 0.5 -thr 16

９、SVとCNVのコールを1つの.vcfファイルに統合。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade annotate_SVs -o /data/output/annotate_SVs --ref /data/genome.fasta --mitochondrial_chromosome chrM -gff /data/annotations.gff -mcode 3 -gcode 1 --SV_CNV_vcf /data/output/integrated_SV_CNV_calls/SV_and_CNV_variant_calling.vcf --fraction_available_mem 0.5 -thr 16

１０、3つのアルゴリズム（bcftools,freebayes,HaplotypeCaller）でSNPコールとindelコールを実行。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade call_small_variants -o /data/output/small_vars --ref /data/genome.fasta -p 1 -sbam /data/output/aligned_reads/aligned_reads.sorted.bam --repeats_file /data/output/repeat_inference/combined_repeats.tab --callers bcftools,freebayes,HaplotypeCaller --min_AF 0.9 --min_coverage 20 --fraction_available_mem 0.5 -thr 16

１２、11で生成されたスモールバリアントの機能的アノテーションを実行（標準gDNA遺伝暗号と酵母ミトコンドリア暗号を想定）。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade annotate_small_vars -o /data/output/annotate_small_vars --ref /data/genome.fasta --mitochondrial_chromosome chrM -gff /data/annotations.gff -mcode 3 -gcode 1 --merged_vcf /data/output/small_vars/merged_vcfs_allVars_ploidy1.vcf --fraction_available_mem 0.5 -thr 16

１３、遺伝子ごとのカバレッジを計算。

docker run -itv $PWD:/data --rm mikischikora/persvade:v1.02.6 python -u ./scripts/perSVade get_cov_genes -o output/get_cov_genes --ref /data/genome.fasta -gff /data/annotations.gff -sbam /data/output/aligned_reads/aligned_reads.sorted.bam --fraction_available_mem 0.5 -thr 16

すべての結果はoutput/に書き込まれる。

作成中

引用

PerSVade: personalized structural variant detection in any species of interest
Miquel Àngel Schikora-Tamarit & Toni Gabaldón
Genome Biology volume 23, Article number: 175 (2022)