Structural Variation Engine (SVE) - macでインフォマティクス

先日紹介したFusoSVのSVコールパイプラインSVEを紹介する。

f:id:kazumaxneo:20181110182031j:plain

Core Frameworks and Extension. Githubより

インストール

依存関係が多いためdockerコンテナを使ったランが推奨されている。

docker pull timothyjamesbecker/sve

> docker run --rm timothyjamesbecker/sve /software/SVE/scripts/prepare_ref.py -h

$ docker run --rm -v $PWD/:/data timothyjamesbecker/sve /software/SVE/scripts/prepare_ref.py -h

usage: prepare_ref.py [-h] [-g | -r REF_PATH] [-t TARGET] [-o OUT_DIR]

[-l LEN] [-c CHR] [-P CPUS] [-d DATABASE] [-e]

Prepares a random fasta reference or existing fasta reference file for

downstream SV analysis. This includes chrom copying, indexing, masking and all

associated one-time processecies that need to be done before || per sample or

other || means can be started in indepencence.

optional arguments:

-h, --help show this help message and exit

-g, --gen_rnd generate a rgXXX.fa file using -L and -C [False]

-r REF_PATH, --ref_path REF_PATH

fasta reference file [None]

-t TARGET, --target TARGET

comma sperated list of stage names to filenames

(wildcards will work for multiple files) note: this is

a nessesary step for use with breakseq2 and some other

callers that rely on specific libraries/files. [EX] -t

breakseq:/data/breakseq2_bplib_20150129* [None]

-o OUT_DIR, --out_dir OUT_DIR

output directory to store resulting files [~/refs]

-l LEN, --len LEN random reference total length [None]

-c CHR, --chr CHR random reference number of chroms [3]

-P CPUS, --cpus CPUS number of cpus/cores to use [1]

-d DATABASE, --database DATABASE

database configuration file [SVE/data/svedb.config]

-e, --erase_db wipe the SVEDB [False]

> $ docker run --rm timothyjamesbecker/sve /software/SVE/scripts/prepare_bam.py -h

$ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/prepare_bam.py -h

usage: prepare_bam.py [-h] [-a ALGORITHM] [-g] [-m] [-s SAMPLE] [-o OUT_DIR]

[-r REF] [-d DATABASE] [-f FQS] [-b BAM] [-P CPUS]

[-T THREADS] [-M MEM]

Processes Illumina PE .fq reads into a .bam file given a reference .fa file as

input. [Note] mem, piped_mem, piped_split sub-pipelines assume reads > 75bp in

length

optional arguments:

-h, --help show this help message and exit

-a ALGORITHM, --algorithm ALGORITHM

aln|mem|piped_mem|piped_split|speed_seq [piped_split]

-g, --replace_rg replace reads groups [False]

-m, --mark_duplicates

mark duplicate reads [False]

-s SAMPLE, --sample SAMPLE

sample name [input]

-o OUT_DIR, --out_dir OUT_DIR

output directory to store resulting files [None]

-r REF, --ref REF fasta reference file path [None]

-d DATABASE, --database DATABASE

database configuration file [SVE/data]

-f FQS, --fqs FQS fq comma-sep file path list [None] [EX PE] --fqs

~/data/sample1_FWD.fq,~/data/sample1_REV.fq

-b BAM, --bam BAM bam file path [None]

-P CPUS, --cpus CPUS number of cpus for alignment and sorting, ect [1]

-T THREADS, --threads THREADS

number of threads per CPU [4]

-M MEM, --mem MEM ram in GB units to use for processing per cpu/thread

unit [4]

——

> docker run --rm timothyjamesbecker/sve /software/SVE/scripts/auto.py -h

$ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/auto.py -h

usage: auto.py [-h] [-o OUT_DIR] [-r REF] [-f FQS] [-b BAM] [-s SAMPLE]

[-t TARGET] [-m MODEL] [-P CPUS] [-T THREADS] [-M MEM]

[-a ALGORITHM] [-R REALIGN] [-v]

Autonomous Structural Variation Engine:

Given a .fa reference file and a pair: NA12878_1.fq.gz,NA12878_2.fq.gz,

produce a FusorSV VCF file all_samples_merge.vcf with comprehensive merged SV calls using a fusion model.

#[USAGE] auto.py --ref --fqs|bam --out_dir

#-----------------------------------------------------------------------------------------------------------------

optional arguments:

-h, --help show this help message and exit

-o OUT_DIR, --out_dir OUT_DIR

output directory to store resulting files [None]

-r REF, --ref REF fasta reference path (if indexes are not found, run (A)) [None]

-f FQS, --fqs FQS fq comma-sep file path list [None]

[EX PE] --fqs ~/data/sample1_FWD.fq,~/data/sample1_REV.fq

-b BAM, --bam BAM BAM format file for use without alignment or for realignmnet [None]

-s SAMPLE, --sample SAMPLE

Sample identifier [infered from BAM|fastq read 1 trimmed by . and _]

-t TARGET, --target TARGET

maps a stage name to a comma seperated list of filenames with a colon : (wildcards will work for multiple files)

multiple mappings are sperated by a semi colon ; This is a nessesary step when using breakseq2 and some other callers

that rely on specific libraries/files with a new reference. The SVE image has hg19 libraries for breakseq2 included.

[EX] -t delly:/data/exclude_list.bed;breakseq:/data/breakseq2_bplib_20150129*

[None]

-m MODEL, --model MODEL

data fusion model [FusorSV/data/models/g1k_v37_decoy.P3.pickle]

-P CPUS, --cpus CPUS number of cpus for alignment and sorting, ect [1]

-T THREADS, --threads THREADS

number of threads per CPU [4]

-M MEM, --mem MEM ram in GB units to use for processing per cpu/thread unit [4]

-a ALGORITHM, --algorithm ALGORITHM

alignment/realignment algorithm pipeline [speed_seq]

-R REALIGN, --realign REALIGN

does realignment of the BAM file to the reference [place holder]

-v, --verbose stderr and stdout from all stages [False]

——

> docker run --rm timothyjamesbecker/sve /software/SVE/scripts/variant_processor.py -h

$ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/variant_processor.py -h

usage: variant_processor.py [-h] [-o OUT_DIR] [-r REF] [-b BAMS]

[-D READ_DEPTH] [-L READ_LENGTH] [-t TARGETS]

[-s STAGES] [-c CHROMS] [-p PARTITIONS] [--debug]

[-d DATABASE] [-e] [-v]

Shortened Version of the SVE that uses pre-made .bam files Allthough .bam

files are not compatible with all callers such as Variation Hunter

optional arguments:

-h, --help show this help message and exit

-o OUT_DIR, --out_dir OUT_DIR

output directory to store resulting files

-r REF, --ref REF fasta reference file path

-b BAMS, --bams BAMS BAM comma-sep bam file path list, assuming matching

bam.bai is in place... [EX] --bam ~/data/sample1.bam,~

/data/sample2.bam,~/data/sample3.bam

-D READ_DEPTH, --read_depth READ_DEPTH

Average Depth of Coverage in ROI

-L READ_LENGTH, --read_length READ_LENGTH

Read Length

-t TARGETS, --targets TARGETS

target region files for input: vcf,bd,1gk formats

-s STAGES, --stages STAGES

stage name list

-c CHROMS, --chroms CHROMS

chrom name list

-p PARTITIONS, --partitions PARTITIONS

number of partitions or ||L

--debug save result/error data to db

-d DATABASE, --database DATABASE

database configuration file

-e, --erase_db reset and clear the schema bound db

-v, --verbose be verbose with caller stdout/stderr

2018年11月現在、 10のSVコーラーがサポートされている。

1. BreakDancer (with VCF converter)
2. BreakSeq2
3. cnMOPS (with VCF converter)
4. CNVnator
5. Delly2
6. Hydra-Multi (with VCF converter)
7. GATK Haplotype Caller 3.6 (with SV size VCF filter)
8. GenomeSTRiP2.0 (both SVDiscovery and CNVdiscovery) (with DEL/DUP VCF converter)
9. Lumpy-SV
10. Tigra-SV (and EXT pipeline)

実行方法

Autonomous Mode: アライメント、SVコール、FusorSVによるマージのすべてのステップが実行される。fastaの拡張子は".fa"になっている必要がある。

docker run -v $PWD/:/data -it timothyjamesbecker/sve /software/SVE/scripts/auto.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-f /data/sample1/sample1_1.fq,/data/sample1/sample1_2.fq \
-o /data/run_sample1/

1000Genomesのphase3のリファレンス使用がdefaultになっている（リンク）。

個別実行

１、リファレンスindex等の準備（初回のみ必要）

docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/prepare_ref.py \
-r /data/human_g1k_v37_decoy.fa \
-t breakseq:/data/breakseq2*.fna \
-o /data/ref/ \
-P 12

FASTAファイルの拡張子は".fa"になっている必要がある。かなり時間がかかる。ジョブが終わると、ref/にFASTAファイルと関連indexが出力される。以降のランにはこのディレクトリのFASTAファイルを指定する。

2、speedseqをalignerに使いBAMを作成

docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/prepare_bam.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-f /data/sample1/sample1_1.fq,/data/sample1/sample1_2.fq \
-o /data/bams/ \
-P 4 -T 6 -M 4 -a speed_seq

-P number of cpus for alignment and sorting, ect [1]
-T number of threads per CPU [4]
-M MEM ram in GB units to use for processing per cpu/thread
unit [4]
-a aln|mem|piped_mem|piped_split|speed_seq [piped_split]

-P 4 -T 6 -M 4で24コアCPU、96GBメモリを使う（コアあたり４GB）。

３、bamを使いSVコール

docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/variant_processor.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-b /data/bams/sample1.bam \
-o /data/vcfs/ \
-s breakdancer,gatk_haplo,delly,lumpy,cnvnator

#more sv caller
docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/variant_processor.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-b /data/bams/sample1.bam \
-o /data/vcfs/ \
-s breakdancer,gatk_haplo,delly,lumpy,cnvnator,breakseq,cnmops,hydra,genome_strip,tigra

４、FusorSVによるSVのマージ

docker run -v $PWD/:/data timothyjamesbecker/sve \
software/FusorSV/FusorSV.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y \
-i /data/vcfs/ \
-o /data/fused/ \
-f /data/models/human_g1k_v37_decoy.P3.pickle.gz\
-p 2\
-M 0.5\
-L DEFAULT

-p number of cores to use in || [1]
-M reciprocal overlap needed for clustering [0.5]
-L liftover chain file path or default [./data/hg19ToHg38.over.chain.gz]
-c or --chroms is an optional chrom list if you only want calls that were made on specific sequences.The default is 1-22,X,Y,MT with auto-sensing of (chr prefix)
-i or --in_dir takes the VCF directory that was produced from step

引用

FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods

Becker T, Lee WP, Leone J, Zhu Q1, Zhang C, Liu S, Sargent J, Shanker K, Mil-Homens A, Cerveira E, Ryan M, Cha J, Navarro FCP, Galeev T, Gerstein M, Mills RE, Shin DG, Lee C, Malhotra A

Genome Biol. 2018 Mar 20;19(1):38.