macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Structural Variation Engine (SVE)

 

先日紹介したFusoSVのSVコールパイプラインSVEを紹介する。

 

f:id:kazumaxneo:20181110182031j:plain

Core Frameworks and Extension. Githubより

 

インストール 

依存関係が多いためdockerコンテナを使ったランが推奨されている。

Github

docker pull timothyjamesbecker/sve

> docker run --rm timothyjamesbecker/sve /software/SVE/scripts/prepare_ref.py -h

$ docker run --rm -v $PWD/:/data timothyjamesbecker/sve /software/SVE/scripts/prepare_ref.py -h

usage: prepare_ref.py [-h] [-g | -r REF_PATH] [-t TARGET] [-o OUT_DIR]

                      [-l LEN] [-c CHR] [-P CPUS] [-d DATABASE] [-e]

 

Prepares a random fasta reference or existing fasta reference file for

downstream SV analysis. This includes chrom copying, indexing, masking and all

associated one-time processecies that need to be done before || per sample or

other || means can be started in indepencence.

 

optional arguments:

  -h, --help            show this help message and exit

  -g, --gen_rnd         generate a rgXXX.fa file using -L and -C [False]

  -r REF_PATH, --ref_path REF_PATH

                        fasta reference file [None]

  -t TARGET, --target TARGET

                        comma sperated list of stage names to filenames

                        (wildcards will work for multiple files) note: this is

                        a nessesary step for use with breakseq2 and some other

                        callers that rely on specific libraries/files. [EX] -t

                        breakseq:/data/breakseq2_bplib_20150129* [None]

  -o OUT_DIR, --out_dir OUT_DIR

                        output directory to store resulting files [~/refs]

  -l LEN, --len LEN     random reference total length [None]

  -c CHR, --chr CHR     random reference number of chroms [3]

  -P CPUS, --cpus CPUS  number of cpus/cores to use [1]

  -d DATABASE, --database DATABASE

                        database configuration file [SVE/data/svedb.config]

  -e, --erase_db        wipe the SVEDB [False]

> $ docker run --rm timothyjamesbecker/sve /software/SVE/scripts/prepare_bam.py -h

$ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/prepare_bam.py -h

usage: prepare_bam.py [-h] [-a ALGORITHM] [-g] [-m] [-s SAMPLE] [-o OUT_DIR]

                      [-r REF] [-d DATABASE] [-f FQS] [-b BAM] [-P CPUS]

                      [-T THREADS] [-M MEM]

 

Processes Illumina PE .fq reads into a .bam file given a reference .fa file as

input. [Note] mem, piped_mem, piped_split sub-pipelines assume reads > 75bp in

length

 

optional arguments:

  -h, --help            show this help message and exit

  -a ALGORITHM, --algorithm ALGORITHM

                        aln|mem|piped_mem|piped_split|speed_seq [piped_split]

  -g, --replace_rg      replace reads groups [False]

  -m, --mark_duplicates

                        mark duplicate reads [False]

  -s SAMPLE, --sample SAMPLE

                        sample name [input]

  -o OUT_DIR, --out_dir OUT_DIR

                        output directory to store resulting files [None]

  -r REF, --ref REF     fasta reference file path [None]

  -d DATABASE, --database DATABASE

                        database configuration file [SVE/data]

  -f FQS, --fqs FQS     fq comma-sep file path list [None] [EX PE] --fqs

                        ~/data/sample1_FWD.fq,~/data/sample1_REV.fq

  -b BAM, --bam BAM     bam file path [None]

  -P CPUS, --cpus CPUS  number of cpus for alignment and sorting, ect [1]

  -T THREADS, --threads THREADS

                        number of threads per CPU [4]

  -M MEM, --mem MEM     ram in GB units to use for processing per cpu/thread

                        unit [4]

——

> docker run --rm timothyjamesbecker/sve /software/SVE/scripts/auto.py -h

 $ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/auto.py -h

usage: auto.py [-h] [-o OUT_DIR] [-r REF] [-f FQS] [-b BAM] [-s SAMPLE]

               [-t TARGET] [-m MODEL] [-P CPUS] [-T THREADS] [-M MEM]

               [-a ALGORITHM] [-R REALIGN] [-v]

 

Autonomous Structural Variation Engine:

Given a .fa reference file and a pair: NA12878_1.fq.gz,NA12878_2.fq.gz, 

produce a FusorSV VCF file all_samples_merge.vcf with comprehensive merged SV calls using a fusion model.

#[USAGE] auto.py --ref --fqs|bam --out_dir

#----------------------------------------------------------------------------------------------------------------- 

 

optional arguments:

  -h, --help            show this help message and exit

  -o OUT_DIR, --out_dir OUT_DIR

                        output directory to store resulting files [None]

  -r REF, --ref REF     fasta reference path (if indexes are not found, run (A)) [None]

  -f FQS, --fqs FQS     fq comma-sep file path list [None]

                        [EX PE] --fqs ~/data/sample1_FWD.fq,~/data/sample1_REV.fq

  -b BAM, --bam BAM     BAM format file for use without alignment or for realignmnet [None]

  -s SAMPLE, --sample SAMPLE

                        Sample identifier [infered from BAM|fastq read 1 trimmed by . and _]

  -t TARGET, --target TARGET

                        maps a stage name to a comma seperated list of filenames with a colon : (wildcards will work for multiple files)

                        multiple mappings are sperated by a semi colon ; This is a nessesary step when using breakseq2 and some other callers 

                        that rely on specific libraries/files with a new reference. The SVE image has hg19 libraries for breakseq2 included.

                        [EX] -t delly:/data/exclude_list.bed;breakseq:/data/breakseq2_bplib_20150129*

                        [None]

  -m MODEL, --model MODEL

                        data fusion model [FusorSV/data/models/g1k_v37_decoy.P3.pickle]

  -P CPUS, --cpus CPUS  number of cpus for alignment and sorting, ect [1]

  -T THREADS, --threads THREADS

                        number of threads per CPU [4]

  -M MEM, --mem MEM     ram in GB units to use for processing per cpu/thread unit [4]

  -a ALGORITHM, --algorithm ALGORITHM

                        alignment/realignment algorithm pipeline [speed_seq]

  -R REALIGN, --realign REALIGN

                        does realignment of the BAM file to the reference [place holder]

  -v, --verbose         stderr and stdout from all stages [False]

——

> docker run --rm  timothyjamesbecker/sve /software/SVE/scripts/variant_processor.py -h

$ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/variant_processor.py -h

usage: variant_processor.py [-h] [-o OUT_DIR] [-r REF] [-b BAMS]

                            [-D READ_DEPTH] [-L READ_LENGTH] [-t TARGETS]

                            [-s STAGES] [-c CHROMS] [-p PARTITIONS] [--debug]

                            [-d DATABASE] [-e] [-v]

 

Shortened Version of the SVE that uses pre-made .bam files Allthough .bam

files are not compatible with all callers such as Variation Hunter

 

optional arguments:

  -h, --help            show this help message and exit

  -o OUT_DIR, --out_dir OUT_DIR

                        output directory to store resulting files

  -r REF, --ref REF     fasta reference file path

  -b BAMS, --bams BAMS  BAM comma-sep bam file path list, assuming matching

                        bam.bai is in place... [EX] --bam ~/data/sample1.bam,~

                        /data/sample2.bam,~/data/sample3.bam

  -D READ_DEPTH, --read_depth READ_DEPTH

                        Average Depth of Coverage in ROI

  -L READ_LENGTH, --read_length READ_LENGTH

                        Read Length

  -t TARGETS, --targets TARGETS

                        target region files for input: vcf,bd,1gk formats

  -s STAGES, --stages STAGES

                        stage name list

  -c CHROMS, --chroms CHROMS

                        chrom name list

  -p PARTITIONS, --partitions PARTITIONS

                        number of partitions or ||L

  --debug               save result/error data to db

  -d DATABASE, --database DATABASE

                        database configuration file

  -e, --erase_db        reset and clear the schema bound db

  -v, --verbose         be verbose with caller stdout/stderr

2018年11月現在、 10のSVコーラーがサポートされている。

  • 1. BreakDancer (with VCF converter)
  • 2. BreakSeq2
  • 3. cnMOPS (with VCF converter)
  • 4. CNVnator
  • 5. Delly2
  • 6. Hydra-Multi (with VCF converter)
  • 7. GATK Haplotype Caller 3.6 (with SV size VCF filter)
  • 8. GenomeSTRiP2.0 (both SVDiscovery and CNVdiscovery) (with DEL/DUP VCF converter)
  • 9. Lumpy-SV
  • 10. Tigra-SV (and EXT pipeline)

 

実行方法

Autonomous Mode: アライメント、SVコール、FusorSVによるマージのすべてのステップが実行される。fastaの拡張子は".fa"になっている必要がある。

docker run -v $PWD/:/data -it timothyjamesbecker/sve /software/SVE/scripts/auto.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-f /data/sample1/sample1_1.fq,/data/sample1/sample1_2.fq \
-o /data/run_sample1/

1000Genomesのphase3のリファレンス使用がdefaultになっている(リンク)。

 

個別実行

1、リファレンスindex等の準備(初回のみ必要)

docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/prepare_ref.py \
-r /data/human_g1k_v37_decoy.fa \
-t breakseq:/data/breakseq2*.fna \
-o /data/ref/ \
-P 12

FASTAファイルの拡張子は".fa"になっている必要がある。かなり時間がかかる。ジョブが終わると、ref/にFASTAファイルと関連indexが出力される。以降のランにはこのディレクトリのFASTAファイルを指定する。

 

2、speedseqをalignerに使いBAMを作成

docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/prepare_bam.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-f /data/sample1/sample1_1.fq,/data/sample1/sample1_2.fq \
-o /data/bams/ \
-P 4 -T 6 -M 4 -a speed_seq
  • -P   number of cpus for alignment and sorting, ect [1]
  • -T   number of threads per CPU [4]
  • -M   MEM ram in GB units to use for processing per cpu/thread
    unit [4]
  •  -a    aln|mem|piped_mem|piped_split|speed_seq [piped_split]

-P 4 -T 6 -M 4で24コアCPU、96GBメモリを使う(コアあたり4GB)。

 

3、bamを使いSVコール

docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/variant_processor.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-b /data/bams/sample1.bam \
-o /data/vcfs/ \
-s breakdancer,gatk_haplo,delly,lumpy,cnvnator

#more sv caller
docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/variant_processor.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-b /data/bams/sample1.bam \
-o /data/vcfs/ \
-s breakdancer,gatk_haplo,delly,lumpy,cnvnator,breakseq,cnmops,hydra,genome_strip,tigra

 

4、FusorSVによるSVのマージ

docker run -v $PWD/:/data timothyjamesbecker/sve \
software/FusorSV/FusorSV.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y \
-i /data/vcfs/ \
-o /data/fused/ \
-f /data/models/human_g1k_v37_decoy.P3.pickle.gz\
-p 2\
-M 0.5\
-L DEFAULT
  • -p    number of cores to use in || [1]
  • -M   reciprocal overlap needed for clustering [0.5]
  • -L    liftover chain file path or default [./data/hg19ToHg38.over.chain.gz]
  • -c or --chroms is an optional chrom list if you only want calls that were made on specific sequences.The default is 1-22,X,Y,MT with auto-sensing of (chr prefix)
  • -i or --in_dir takes the VCF directory that was produced from step

 

引用

FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods

Becker T, Lee WP, Leone J, Zhu Q1, Zhang C, Liu S, Sargent J, Shanker K, Mil-Homens A, Cerveira E, Ryan M, Cha J, Navarro FCP, Galeev T, Gerstein M, Mills RE, Shin DG, Lee C, Malhotra A

Genome Biol. 2018 Mar 20;19(1):38.

 

関連/類似ツール