macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

重複遺伝子のコピー数をマッピングデータから推定する parascopy

 

 ヒトゲノムには数百の低コピー反復配列(LCR)が存在するが、コピー数のばらつきが大きく、リードマッピングがあいまいなため、ショートリードシーケンス技術による解析は困難である。LCRに重複する150以上の重複遺伝子のコピー数および配列の変異は、単発性および複合性のヒト疾患に関与している。全ゲノムシーケンス(WGS)を用いて重複遺伝子の総コピー数とパラログ特異的コピー数を推定する計算ツールParascopyについて述べる。Parascopyは、グローバルリアライメントを行うことなく、異なる繰り返しコピーにマッピングされたリードを共同で解析する効率的な手法である。複数のサンプルを活用してシーケンスバイアスを軽減し、リピートコピーを区別する信頼性の高いパラロガス配列バリアント(PSV)を同定する。多様な集団から得られた2504人分のWGSデータを解析した結果、Parascopyはシーケンスバイアスに強く、既存の手法に比べて精度が高く、重複遺伝子における病原性コピー数変化の優先順位付けが可能であることが示された。

 

インストール

依存

  • samtools ≥ 1.11,
  • tabix ≥ 1.11,
  • bgzip ≥ 1.11,
  • bwa ≥ 0.7.

Github

mamba create --name paras_env parascopy -y
conda activate paras_env
mamba install -c bioconda parascopy -y
#bedtoolsも必要
mamba install -c bioconda bedtools-y

> parascopy -h

Parascopy v1.17.1

Created by Timofey Prodanov & Vikas Bansal

 

Usage:   parascopy <command> <arguments>

 

Commands:

 

[ Creating homology table ]

    pretable    Create homology pre-table.

                This command aligns genomic regions back to the genome

                to find homologous regions.

    table       Convert homology pre-table into homology table.

                This command combines overlapping homologous regions

                into longer duplications.

 

[ Analyzing BAM/CRAM files ]

    depth       Calculate read depth and variance in given genomic windows.

    cn          Find aggregate and paralog-specific copy number

                for given unique and duplicated regions.

    cn-using    Same as "cn", but use input model parameters.

    pool        Pool reads from various copies of a given homologous region.

    call        Call pooled and paralog-specific variants.

 

[ Querying homology table ]

    view        View and filter homology table.

    msa         Visualize multiple sequence alignment of

                homologous regions.

    psvs        Output PSVs (paralogous-sequence variants)

                between homologous regions.

    examine     Split input regions by reference copy number

                and write homologous regions.

 

[ General help ]

    help        Show this help message.

    version     Show version.

    cite        Show citation information.

 

> parascopy pretable -h

 

usage: parascopy pretable -f <fasta> [-r <regions> | -R <bed>] -o <bed> [arguments]

 

Create homology pre-table.

This command aligns genomic regions back to the genome to find homologous regions.

 

Input/output arguments:

  -f <file>, --fasta-ref <file>

                        Input reference fasta file. Should have BWA index.

  -o <file>, --output <file>

                        Output bed[.gz] file.

 

Region arguments (optional):

  -r <region> [<region> ...], --regions <region> [<region> ...]

                        Region(s) in format "chr" or "chr:start-end").

                        Start and end are 1-based inclusive. Commas are ignored.

  -R <file> [<file> ...], --regions-file <file> [<file> ...]

                        Input bed[.gz] file(s) containing regions (tab-separated, 0-based semi-exclusive).

 

Filtering arguments:

  --min-aln-len <int>   Minimal alignment length [default: 250].

  --min-seq-sim <float>

                        Minimal sequence similarity [default: 0.96].

  --max-homologies <int>

                        Skip regions with more than <int> homologies (copies) in the genome [default: 10].

  --keep-self-alns      If true, partial self-alignments will be kept

                        (for example keep an alignment with a 300bp shift).

                        Full self-alignments are always discarded.

 

Optional arguments:

  --read-len <int>      Artificial read length [default: 900].

  --step-size <int>     Artificial reads step size [default: 150].

  -s, --symmetric       Create a symmetric homology table (may be unstable).

 

Execution arguments:

  -F, --force           Force overwrite output file.

  -@ <int>, --threads <int>

                        Use <int> threads [default: 4].

  --tmp-dir <dir>       Puts temporary files in the following directory (does not remove temporary

                        files after finishing). Otherwise, creates a temporary directory.

  -b <path>, --bwa <path>

                        Path to BWA executable [default: bwa].

  --seed-len <int>      Minimum BWA seed length [default: 16].

  --tabix <path>        Path to "tabix" executable [default: tabix].

  --bedtools <path>     Path to "bedtools" executable [default: bedtools].

  --bgzip <path>        Path to "bgzip" executable [default: bgzip].

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

>  parascopy table -h

usage: parascopy table -i <pretable> -f <fasta> -o <table> [arguments]

 

Convert homology pre-table into homology table.

This command combines overlapping homologous regions into longer duplications.

 

Input/output arguments:

  -i <file>, --input <file>

                        Input indexed bed.gz homology pre-table.

  -f <file> [<file>], --fasta-ref <file> [<file>]

                        Input reference fasta file.

  -o <file>, --output <file>

                        Output table bed[.gz] file with homology table.

  -g <file>, --graph <file>

                        Optional: output directory with duplication graphs.

 

Region arguments (optional):

  -r <region>, --regions <region>

                        Region(s) in format "chr" or "chr:start-end").

                        Start and end are 1-based inclusive. Commas are ignored.

  -R <file>, --regions-file <file>

                        Input bed[.gz] file(s) containing regions (tab-separated, 0-based semi-exclusive).

 

Duplications filtering arguments:

  -e <expr>, --exclude <expr>

                        Exclude duplications for which the expression is true

                        [default: length < 500 && seq_sim < 0.97].

  -t <expr>, --tangled <expr>

                        Exclude duplications for which the expression is true and mark regions

                        as "tangled". These regions will be discarded from downstream analysis.

                        Regions in args.exclude will be discarded first.

                        By default, this discards tandem duplications with high multiplicity

                        (low complexity). [default: sep < 5000 && av_mult > 2].

 

Optional arguments:

  --tabix <path>        Path to "tabix" executable [default: tabix].

  -@ <int>, --threads <int>

                        Use <int> threads [default: 4].

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

> parascopy depth -h

usage: parascopy depth (-i <bam> [...] | -I <bam-list>) (-g hg19|hg38 | -b <bed>) -f <fasta> -o <dir> [arguments]

 

Calculate read depth and variance in given genomic windows.

 

Input/output arguments:

  -i <file> [<file> ...], --input <file> [<file> ...]

                        Input indexed BAM/CRAM files.

                        All entries should follow the format "filename[::sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input-list.

  -I <file>, --input-list <file>

                        A file containing a list of input BAM/CRAM files.

                        All lines should follow the format "filename[ sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input.

                        

  -g hg19|hg38, --genome hg19|hg38

                        Use predefined windows for the human genome. Mutually exclusive with --bed-regions.

  -b <file>, --bed-regions <file>

                        Input bed[.gz] file containing windows (tab-separated, 0-based semi-exclusive),

                        which will be used to calculate read depth. Windows must be non-overlapping

                        and have the same size. Mutually exclusive with --genome.

                        

  -f <file>, --fasta-ref <file>

                        Input reference fasta file.

  -o <dir>, --output <dir>

                        Output directory.

 

Filtering arguments:

  -m <float>, --low-mapq-perc <float>

                        Skip windows that have more than <float>% reads with MAPQ < 10 [default: 10].

  -c <float>, --clipped-perc <float>

                        Skip windows that have more than <float>% reads with soft/hard clipping [default: 10].

  -u <float>, --unpaired-perc <float>

                        Skip windows that have more than <float>% reads without proper pair [default: 10].

  -N <int>, --neighbours <int>

                        Discard <int> neighbouring windows to the left and to the right

                        of a skipped window [default: 1].

 

Depth calculation arguments:

  --loess-frac <float>  Loess parameter: use <float> closest windows to estimate average read depth

                        for each GC-percentage [default: 0.2].

  --tail-windows <int>  Do not use GC-content if it lies in the left or right tail

                        with less than <int> windows [default: 1000].

  --low-mapq <int>      Read mapping quality under <int> is considered as low [default: 10].

  --mate-dist <int>     Insert size (~ distance between read mates) is expected to be under <int> [default: 5000].

  --ploidy <int>        Genome ploidy. [default: 2].

                        If not 2, run "parascopy cn[-using]" with "--modify-ref" parameter.

 

Optional arguments:

  -@ <int>, --threads <int>

                        Number of threads [default: 4].

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

> parascopy cn -h

 

usage: parascopy cn (-i <bam> [...] | -I <bam-list>) -t <table> -f <fasta> (-r <region> [...] | -R <bed>) -d <bg-depth> [...] -o <dir> [arguments]

 

Find aggregate and paralog-specific copy number for given unique and duplicated regions.

 

Input/output arguments:

  -i <file>, --input <file>

                        Input indexed BAM/CRAM files.

                        All entries should follow the format "filename[::sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input-list.

  -I <file>, --input-list <file>

                        A file containing a list of input BAM/CRAM files.

                        All lines should follow the format "filename[ sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input.

                        

  -t <file>, --table <file>

                        Input indexed bed table with information about segmental duplications.

  -f <file>, --fasta-ref <file>

                        Input reference fasta file.

  -d <file> [<file> ...], --depth <file> [<file> ...]

                        Input files / directories with background read depth.

                        Should be created using "parascopy depth".

  -o <dir>, --output <dir>

                        Output directory.

 

Region arguments:

  -r <region> [<region> ...], --regions <region> [<region> ...]

                        Region(s) in format "chr" or "chr:start-end".

                        Start and end are 1-based inclusive. Commas are ignored.

                        Optionally, you can provide region name using the format "region::name"

                        For example "-r chr5:70,925,087-70,953,015::SMN1".

  -R <file> [<file> ...], --regions-file <file> [<file> ...]

                        Input bed[.gz] file(s) containing regions (tab-separated, 0-based semi-exclusive).

                        Optional fourth column will be used as a region name.

  --force-agcn <bed>    Instead of calculating aggregate copy numbers, use provided bed file.

                        Columns: chrom, start, end, samples, copy_num. Fourth column can be a single sample name;

                        list of sample names separated by ","; or "*" to indicate all samples.

  --modify-ref <bed>    Modify reference copy number using bed file with the same format as `--force-agcn`.

                        Provided values are used for paralog-specific copy number detection.

  --regions-subset <str> [<str> ...]

                        Additionally filter input regions: only use regions with names that are in this list.

                        If the first argument is "!", only use regions not in this list.

 

Duplications filtering arguments:

  -e <expr>, --exclude <expr>

                        Exclude duplications for which the expression is true

                        [default: length < 500 && seq_sim < 0.97].

  --short <int>         Skip regions with short duplications (shorter than <int> bp),

                        not excluded in the -e/--exclude argument [default: 500].

  --max-ref-cn <int>    Skip regions with reference copy number higher than <int> [default: 10].

  --unknown-seq <float>

                        At most this fraction of region sequence can be unknown (N) [default: 0.1].

  --skip-unique         Skip regions without any duplications in the reference genome.

 

Aggregate copy number detection arguments:

  --min-samples <int>   Use multi-sample information if there are at least <int> samples present

                        for the region/PSV [default: 50].

  --min-windows <int>   Predict aggregate and paralog copy number only in regions with at

                        least <int> windows [default: 5].

  --region-dist <int>   Jointly calculate copy number for nearby duplications with equal reference copy number,

                        if the distance between them does not exceed <int> [default: 1000].

  --window-filtering <float>

                        Modify window filtering: by default window filtering is the same as in the background

                        read depth calculation [default: 1].

                        Values < 1 - discard more windows, > 1 - keep more windows.

  --agcn-range <int> <int>

                        Detect aggregate copy number in a range around the reference copy number [default: 5 7].

                        For example, for a duplication with copy number 8 copy numbers 3-15 can be detected.

  --strict-agcn-range   Detect aggregate copy number strictly within the --agcn-range, even if there are

                        samples with bigger/smaller copy number values.

  --agcn-jump <int>     Maximal jump in the aggregate copy number between two consecutive windows [default: 6].

  --transition-prob <float>

                        Log10 transition probability for the aggregate copy number HMM [default: -5].

  --uniform-initial     Copy number HMM: use uniform initial distribution and do not update initial probabilities.

  --no-multipliers      Do not estimate or use read depth multipliers.

  --update-agcn <float>

                        Update agCN using psCN probabilities when agCN quality is less than <float> [default: 40].

  --vmr <str>           Sort samples by variance-mean ratio, and only use samples with smallest values.

                        Value should be either <float> (ratio threshold), or <float>% (use this percentile).

 

Paralog-specific copy number detection arguments:

  --reliable-threshold <float> <float>

                        PSV-reliability thresholds (reliable PSV has all f-values over the threshold).

                        First value is used for gene conversion detection,

                        second value is used to estimate paralog-specific CN [default: 0.80 0.95].

  --pscn-bound <int> [<int>]

                        Do not estimate paralog-specific copy number if any of the statements is true:

                        - aggregate copy number is higher than <int>[1]           [default: 8],

                        - number of possible psCN tuples is higher than <int>[2]  [default: 500].

 

Execution parameters:

  --rerun full|partial|none

                        Rerun CN analysis for all loci:

                            full:    complete rerun,

                            partial: use pooled reads from a previous run,

                            none:    skip successfully finished loci [default].

  --skip-cn             Do not calculate agCN and psCN profiles. If this option is set, Parascopy still

                        calculates read depth for duplicated windows and PSV-allelic read depth.

  --skip-pscn           Do not calculate psCN profiles.

  --pool-cram           Pool reads into CRAM files (currently, BAM by default).

  -@ <int>, --threads <int>

                        Number of available threads [default: 4].

  --samtools <path>|none

                        Path to samtools executable [default: samtools].

                        Use python wrapper if "none", can lead to errors.

  --tabix <path>        Path to "tabix" executable [default: tabix].

                        Use "none" to skip indexing output files.

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

> parascopy cn-using -h

 

usage: parascopy cn-using <model> (-i <bam> [...] | -I <bam-list>) -t <table> -f <fasta> -d <bg-depth> [...] -o <dir> [arguments]

 

Find aggregate and paralog-specific copy number for given unique and duplicated regions.

 

Input/output arguments:

  <model>               Use model parameters from an independent "parascopy cn" run.

                        Allows multiple arguments of the following types:

                        - model/<region>.gz,

                        - file with paths to *.gz files,

                        - directory: use all subfiles *.gz.

                        

  -i <file>, --input <file>

                        Input indexed BAM/CRAM files.

                        All entries should follow the format "filename[::sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input-list.

  -I <file>, --input-list <file>

                        A file containing a list of input BAM/CRAM files.

                        All lines should follow the format "filename[ sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input.

                        

  -t <file>, --table <file>

                        Input indexed bed table with information about segmental duplications.

  -f <file>, --fasta-ref <file>

                        Input reference fasta file.

  -d <file> [<file> ...], --depth <file> [<file> ...]

                        Input files / directories with background read depth.

                        Should be created using "parascopy depth".

  -o <dir>, --output <dir>

                        Output directory.

 

Region arguments:

  --force-agcn <bed>    Instead of calculating aggregate copy numbers, use provided bed file.

                        Columns: chrom, start, end, samples, copy_num. Fourth column can be a single sample name;

                        list of sample names separated by ","; or "*" to indicate all samples.

  --regions-subset <str> [<str> ...]

                        Additionally filter input regions: only use regions with names that are in this list.

                        If the first argument is "!", only use regions not in this list.

 

Duplications filtering arguments:

  --unknown-seq <float>

                        At most this fraction of region sequence can be unknown (N) [default: 0.1].

  --skip-unique         Skip regions without any duplications in the reference genome.

 

Aggregate copy number detection arguments:

  --uniform-initial     Copy number HMM: use uniform initial distribution and do not update initial probabilities.

  --no-multipliers      Do not estimate or use read depth multipliers.

  --update-agcn <float>

                        Update agCN using psCN probabilities when agCN quality is less than <float> [default: 40].

  --vmr <str>           Sort samples by variance-mean ratio, and only use samples with smallest values.

                        Value should be either <float> (ratio threshold), or <float>% (use this percentile).

 

Paralog-specific copy number detection arguments:

  --reliable-threshold <float> <float>

                        PSV-reliability thresholds (reliable PSV has all f-values over the threshold).

                        First value is used for gene conversion detection,

                        second value is used to estimate paralog-specific CN.

                        Default: use reliable thresholds from <model>.

 

Execution parameters:

  --rerun full|partial|none

                        Rerun CN analysis for all loci:

                            full:    complete rerun,

                            partial: use pooled reads from a previous run,

                            none:    skip successfully finished loci [default].

  --skip-cn             Do not calculate agCN and psCN profiles. If this option is set, Parascopy still

                        calculates read depth for duplicated windows and PSV-allelic read depth.

  --skip-pscn           Do not calculate psCN profiles.

  --pool-cram           Pool reads into CRAM files (currently, BAM by default).

  -@ <int>, --threads <int>

                        Number of available threads [default: 4].

  --samtools <path>|none

                        Path to samtools executable [default: samtools].

                        Use python wrapper if "none", can lead to errors.

  --tabix <path>        Path to "tabix" executable [default: tabix].

                        Use "none" to skip indexing output files.

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

 

> parascopy pool -h

usage: parascopy pool (-i <bam> [...] | -I <bam-list>) -t <table> -f <fasta> -r <region> -o <dir>

 

Pool reads from various copies of a duplication.

 

Input/output arguments:

  -i <file> [<file> ...], --input <file> [<file> ...]

                        Input indexed BAM/CRAM files.

                        All entries should follow the format "filename[::sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input-list.

  -I <file>, --input-list <file>

                        A file containing a list of input BAM/CRAM files.

                        All lines should follow the format "filename[ sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input.

                        

  -t <file>, --table <file>

                        Input indexed bed table with information about segmental duplications.

  -f <file>, --fasta-ref <file>

                        Input reference fasta file.

  -r <region>, --region <region>

                        Single region in format "chr:start-end". Start and end are 1-based inclusive.

                        Commas are ignored.

  -o <dir>|<file>, --output <dir>|<file>

                        Output BAM/CRAM file if corresponding extension is used.

                        Otherwise, write CRAM files in the output directory.

  -b, --bam             Write BAM files to the output directory instead of CRAM.

 

Duplications filtering arguments:

  -e <expr>, --exclude <expr>

                        Exclude duplications for which the expression is true

                        [default: length < 500 && seq_sim < 0.97].

 

Optional arguments:

  -m <int>|infinity, --mate-dist <int>|infinity

                        Output read mates even if they are outside of the duplication,

                        if the distance between mates is less than <int> [default: 5000].

                        Use 0 to skip all mates outside the duplicated regions.

                        Use inf|infinity to write all mapped read mates.

  -q, --quiet           Do not write information to the stderr.

  --samtools <path>|none

                        Path to samtools executable [default: samtools].

                        Use python wrapper if "none", can lead to errors.

  --only-regions <file>

                        Append regions, used for pooling and realigning, to this file, and stop.

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

> parascopy call -h

 

usage: parascopy call -p <dir> [-i <bam> ... | -I <bam-list>] -t <table> -f <fasta> [-o <dir>]

 

Call variants in duplicated regions.

 

Input/output arguments:

  -p <dir>, --parascopy <dir>

                        Input directory with Parascopy copy number estimates.

                        By default, pooled reads from the --parascopy directory are taken,

                        however, one may supply alignment files using -i and -I arguments.

  -i <file>, --input <file>

                        Optional: Input indexed BAM/CRAM files.

                        All entries should follow the format "filename[::sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input-list.

  -I <file>, --input-list <file>

                        Optional: A file containing a list of input BAM/CRAM files.

                        All lines should follow the format "filename[ sample]"

                        If sample name is not set, all reads in a corresponding file should have a read group (@RG).

                        Mutually exclusive with --input.

                        

  -t <file>, --table <file>

                        Input indexed bed table with information about segmental duplications.

  -f <file>, --fasta-ref <file>

                        Input reference fasta file.

  -o <dir>, --output <dir>

                        Output directory. Required if -i or -I arguments were used.

  -R <file>, --regions <file>

                        Limit analysis to regions in the provided BED file.

 

Freebayes parameters:

  --freebayes <path>    Optional: path to the modified Freebayes executable.

  --alternate-count <int>

                        Minimum alternate allele read count (in at least one sample),

                        corresponds to freebayes parameter -C <int> [default: 4].

  --alternate-fraction <float>

                        Minimum alternate allele read fraction (in at least one sample),

                        corresponds to freebayes parameter -F <float> [default: 0.05].

  --n-alleles <int>     Use at most <int> best alleles (set 0 to all),

                        corresponds to freebayes parameter -n <int> [default: 3].

 

Variant calling parameters:

  --skip-paralog        Do not calculate paralog-specific genotypes.

  --limit-qual <float>  Skip SNVs that do not overlap PSVs and have Freebayes quality

                        under <float> [default: 1].

  --assume-cn <bed>     Instead of using Parascopy paralog-specific copy number values, use copy number from

                        the input file with columns "chrom start end samples copy_num".

                        Fourth input column can be a single sample name; list of sample names separated by ",";

                        or "*" to indicate all samples.

  --limit-pooled <float>

                        Based solely on allelic read depth, ignore pooled genotypes with probabilities

                        under 10^<float> [default: -5].

  --mutation-rate <float>

                        Log10 mutation rate (used for calculating genotype priors) [default: -3].

  --error-rate <float> <float>

                        Two error rates: first for SNPs, second for indels [default: 0.01 0.01].

  --base-qual <int> <int>

                        Ignore observations with low base quality (first for SNPs, second for indels) [default: 10 10].

  --no-mate-penalty <float>

                        Penalize possible paired-read alignment positions in case they do not match

                        second read alignment position (log10 penalty) [default: -5].

  --psv-ref-gt <float>  Use all PSVs (even unreliable) if they have a reference paralog-specific

                        genotype (genotype quality >= <float>) [default: 20].

  --limit-depth <int> <int>

                        Min and max variant read depth [default: 3, 2000].

  --strand-bias <float>

                        Maximum strand bias (Phred score) [default: 30].

  --unpaired-bias <float>

                        Maximum unpaired bias (Phred score) [default: 30].

  --max-agcn <int>      Maximum aggregate copy number [default: 10].

 

Execution parameters:

  --rerun full|partial|none

                        Rerun analysis for all loci:

                            full:    complete rerun,

                            partial: use already calculated read-allele observations,

                            none:    skip successfully finished loci [default].

  -s (<file>|<name>) ..., --samples (<file>|<name>) ...

                        Limit the analysis to the provided sample names.

                        Input may consist of sample names and files with sample names (filename should contain "/").

  -@ <int>, --threads <int>

                        Number of available threads [default: 4].

  --regions-subset <str> [<str> ...]

                        Additionally filter input regions: only use regions with names that are in this list.

                        If the first argument is "!", only use regions not in this list.

  --samtools <path>     Path to "samtools" executable [default: samtools].

  --tabix <path>        Path to "tabix" executable [default: tabix].

                        Use "none" to skip indexing output files.

  --debug               Output additional debug information.

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

> parascopy view -h

usage: parascopy view <table> [-o <table>] [arguments]

 

View and filter homology table.

 

Input/output arguments:

  <file>                Input indexed bed.gz homology table.

  -o <file>, --output <file>

                        Optional: output path.

 

Region arguments (optional):

  -r <region> [<region> ...], --regions <region> [<region> ...]

                        Region(s) in format "chr" or "chr:start-end").

                        Start and end are 1-based inclusive. Commas are ignored.

  -R <file> [<file> ...], --regions-file <file> [<file> ...]

                        Input bed[.gz] file(s) containing regions (tab-separated, 0-based semi-exclusive).

  --r2 <region> [<region> ...], --regions2 <region> [<region> ...]

                        Second region must overlap region(s) in format "chr" or "chr:start-end").

                        Start and end are 1-based inclusive. Commas are ignored.

  --R2 <file> [<file> ...], --regions2-file <file> [<file> ...]

                        Second region must overlap regions from the input bed[.gz] file(s).

 

Duplications filtering arguments:

  -i <expr>, --include <expr>

                        Include duplications for which the expression is true.

  -e <expr>, --exclude <expr>

                        Exclude duplications for which the expression is true.

  -t, --skip-tangled    Do not show tangled regions.

 

Output format arguments:

  -p, --pretty          Print commas as thousand separator and split info field into tab entries.

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

> parascopy msa -h

usage: parascopy msa -t <table> -f <fasta> (-r <region> | -R <bed>) [-o <clustal>] [arguments]

 

Visualize multiple sequence alignment of homologous regions.

 

Input/output arguments:

  -t <file>, --table <file>

                        Input indexed bed.gz homology table.

  -f <file>, --fasta-ref <file>

                        Input reference fasta file.

  -o <file>, --output <file>

                        Optional: output in clustal format.

  -O <file>, --out-csv <file>

                        Optional: output csv file with aligned positions.

 

Region arguments (at least one is required):

  -r <region> [<region> ...], --regions <region> [<region> ...]

                        Region(s) in format "chr" or "chr:start-end").

                        Start and end are 1-based inclusive. Commas are ignored.

  -R <file> [<file> ...], --regions-file <file> [<file> ...]

                        Input bed[.gz] file(s) containing regions (tab-separated, 0-based semi-exclusive).

 

Duplications filtering arguments:

  -e <expr>, --exclude <expr>

                        Exclude duplications for which the expression is true [default: length < 500].

 

Optional arguments:

  -c, --true-clustal    Outputs true clustal format: writes gaps outside the boundary of duplication,

                        writes number of bases instead of genomic position.

  -w <int>, --width <int>

                        Number of basepairs per line [default: 60].

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

> parascopy psvs -h

usage: parascopy psvs -t <table> -f <fasta> (-r <region> | -R <bed>) -o <vcf> [arguments]

 

Output PSVs (paralogous-sequence variants) between homologous regions.

 

Input/output arguments:

  -t <file>, --table <file>

                        Input indexed bed.gz homology table.

  -f <file>, --fasta-ref <file>

                        Input reference fasta file.

  -o <file>, --output <file>

                        Output vcf[.gz] file.

 

Region arguments (required):

  -r <region> [<region> ...], --regions <region> [<region> ...]

                        Region(s) in format "chr" or "chr:start-end").

                        Start and end are 1-based inclusive. Commas are ignored.

  -R <file> [<file> ...], --regions-file <file> [<file> ...]

                        Input bed[.gz] file(s) containing regions (tab-separated, 0-based semi-exclusive).

 

Duplications filtering arguments:

  -e <expr>, --exclude <expr>

                        Exclude duplications for which the expression is true [default: length < 500].

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

> parascopy examine -h

usage: parascopy examine -t <table> -o <table> [arguments]

 

Split input regions by reference copy number.

 

Input/output arguments:

  -t <file>, --table <file>

                        Input indexed bed.gz homology table.

  -o <file>, --output <file>

                        Output bed[.gz] file.

 

Region arguments (optional):

  -r <region> [<region> ...], --regions <region> [<region> ...]

                        Region(s) in format "chr" or "chr:start-end").

                        Start and end are 1-based inclusive. Commas are ignored.

  -R <file> [<file> ...], --regions-file <file> [<file> ...]

                        Input bed[.gz] file(s) containing regions (tab-separated, 0-based semi-exclusive).

 

Duplications filtering arguments:

  -e <expr>, --exclude <expr>

                        Exclude duplications for which the expression is true [default: length < 500].

 

Optional arguments:

  -m <int>, --min-length <int>

                        Do not output entries shorter that the minimal length [default: 0].

 

Other arguments:

  -h, --help            Show this help message

  -V, --version         Show version.

 

 

実行方法

1、Parascopyのランには重複遺伝子のテーブルファイルを使う。ゲノムからビルドする。

bwa index genome.fa
parascopy pretable -f genome.fa -o pretable.bed.gz
parascopy table -i pretable.bed.gz -f genome.fa -o table.bed.gz

GRCh37, GRCh38 and CHM13の計算済みDBがダウンロードできる。

https://zenodo.org/records/15019940

 

2、バックグラウンドのリードデプスを計算

入力bam/cramはソートおよびインデックス付けされている必要がある。-i in1.bam in2.bam ... または -I input-list.txt として渡す(input-list.txt は、各行にファイル名が 1 つずつ記述されたテキスト)。

#-bで1のテーブルファイルを指定。bamのリスト-Iでを指定
parascopy depth -b input.bed.gz -I input.list -f genome.fa -o depth

#あるいはbamファイルを-iで指定
parascopy depth -b input.bed.gz -i input.bam -f genome.fa -o depth
  • -i     Input indexed BAM/CRAM files. All entries should follow the format "filename[::sample]".  If sample name is not set, all reads in a corresponding file should have a read group (@RG). Mutually exclusive with --input-list.
  • -I     A file containing a list of input BAM/CRAM files. All lines should follow the format "filename[ sample]" If sample name is not set, all reads in a corresponding file should  have a read group (@RG). Mutually exclusive with --input.
  • -g     Use predefined windows for the human genome, possible values are: GRCh37, GRCh38, CHM13; with an optional suffix `-fast`. Mutually exclusive with --bed-regions.
  • -b     Input bed[.gz] file containing windows (tab-separated, 0-based semi-exclusive), which will be used to calculate read depth. Windows must be non-overlapping and have the same size. Mutually exclusive with --genome.         

depthコマンドのあとはcnのステップを行う。

 

3、複数サンプルの agCN(アグリゲートコピー数)および psCN(パラログ特異的コピー数)を推定

parascopy cn -I input.list -t table.bed.gz -f genome.fa -R regions.bed -d depth -o analysis1

 

その他

  • ヒトゲノムには多様な構造多型が存在するため、特定の領域には別バージョンの配列(alternate locus / ALT contig) が追加で用意されていて、ALT contigと呼ばれる(ALT contigs: 父、母、由来の配列という意味ではなく、MHC 領域のような配列多様性が高い領域のprimary assemblyとは異なる代替の配列表現)。ALT contig を “別のコピーと解釈するとコピー数推定が狂う 可能性がある。"-modify-ref"フラグを立てる事でALT contigのコピー数を 0 に設定できる。
  • 配列ヘッダーにはChromosomesがついている必要がある。前提条件が厳しく、些細なことでエラーが起きやすいので、レポジトリのリンク先のゲノム配列とbedを使うのが無難。計算にかなり時間がかかるので、最初は小さな染色体、特に19番染色体以降の1個くらいを使うのが良いかもしれない。

引用

Robust and accurate estimation of paralog-specific copy number for duplicated genes using whole-genome sequencing

Timofey Prodanov & Vikas Bansal 
Nature Communications volume 13, Article number: 3221 (2022) 

 

A multilocus approach for accurate variant calling in low-copy repeats using whole-genome sequencing Open Access

Timofey Prodanov , Vikas Bansal

Bioinformatics, Volume 39, Issue Supplement_1, June 2023, Pages i279–i287