先日紹介したFusoSVのSVコールパイプラインSVEを紹介する。
Core Frameworks and Extension. Githubより
インストール
依存関係が多いためdockerコンテナを使ったランが推奨されている。
docker pull timothyjamesbecker/sve
> docker run --rm timothyjamesbecker/sve /software/SVE/scripts/prepare_ref.py -h
$ docker run --rm -v $PWD/:/data timothyjamesbecker/sve /software/SVE/scripts/prepare_ref.py -h
usage: prepare_ref.py [-h] [-g | -r REF_PATH] [-t TARGET] [-o OUT_DIR]
[-l LEN] [-c CHR] [-P CPUS] [-d DATABASE] [-e]
Prepares a random fasta reference or existing fasta reference file for
downstream SV analysis. This includes chrom copying, indexing, masking and all
associated one-time processecies that need to be done before || per sample or
other || means can be started in indepencence.
optional arguments:
-h, --help show this help message and exit
-g, --gen_rnd generate a rgXXX.fa file using -L and -C [False]
-r REF_PATH, --ref_path REF_PATH
fasta reference file [None]
-t TARGET, --target TARGET
comma sperated list of stage names to filenames
(wildcards will work for multiple files) note: this is
a nessesary step for use with breakseq2 and some other
callers that rely on specific libraries/files. [EX] -t
breakseq:/data/breakseq2_bplib_20150129* [None]
-o OUT_DIR, --out_dir OUT_DIR
output directory to store resulting files [~/refs]
-l LEN, --len LEN random reference total length [None]
-c CHR, --chr CHR random reference number of chroms [3]
-P CPUS, --cpus CPUS number of cpus/cores to use [1]
-d DATABASE, --database DATABASE
database configuration file [SVE/data/svedb.config]
-e, --erase_db wipe the SVEDB [False]
> $ docker run --rm timothyjamesbecker/sve /software/SVE/scripts/prepare_bam.py -h
$ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/prepare_bam.py -h
usage: prepare_bam.py [-h] [-a ALGORITHM] [-g] [-m] [-s SAMPLE] [-o OUT_DIR]
[-r REF] [-d DATABASE] [-f FQS] [-b BAM] [-P CPUS]
[-T THREADS] [-M MEM]
Processes Illumina PE .fq reads into a .bam file given a reference .fa file as
input. [Note] mem, piped_mem, piped_split sub-pipelines assume reads > 75bp in
length
optional arguments:
-h, --help show this help message and exit
-a ALGORITHM, --algorithm ALGORITHM
aln|mem|piped_mem|piped_split|speed_seq [piped_split]
-g, --replace_rg replace reads groups [False]
-m, --mark_duplicates
mark duplicate reads [False]
-s SAMPLE, --sample SAMPLE
sample name [input]
-o OUT_DIR, --out_dir OUT_DIR
output directory to store resulting files [None]
-r REF, --ref REF fasta reference file path [None]
-d DATABASE, --database DATABASE
database configuration file [SVE/data]
-f FQS, --fqs FQS fq comma-sep file path list [None] [EX PE] --fqs
~/data/sample1_FWD.fq,~/data/sample1_REV.fq
-b BAM, --bam BAM bam file path [None]
-P CPUS, --cpus CPUS number of cpus for alignment and sorting, ect [1]
-T THREADS, --threads THREADS
number of threads per CPU [4]
-M MEM, --mem MEM ram in GB units to use for processing per cpu/thread
unit [4]
——
> docker run --rm timothyjamesbecker/sve /software/SVE/scripts/auto.py -h
$ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/auto.py -h
usage: auto.py [-h] [-o OUT_DIR] [-r REF] [-f FQS] [-b BAM] [-s SAMPLE]
[-t TARGET] [-m MODEL] [-P CPUS] [-T THREADS] [-M MEM]
[-a ALGORITHM] [-R REALIGN] [-v]
Autonomous Structural Variation Engine:
Given a .fa reference file and a pair: NA12878_1.fq.gz,NA12878_2.fq.gz,
produce a FusorSV VCF file all_samples_merge.vcf with comprehensive merged SV calls using a fusion model.
#[USAGE] auto.py --ref --fqs|bam --out_dir
#-----------------------------------------------------------------------------------------------------------------
optional arguments:
-h, --help show this help message and exit
-o OUT_DIR, --out_dir OUT_DIR
output directory to store resulting files [None]
-r REF, --ref REF fasta reference path (if indexes are not found, run (A)) [None]
-f FQS, --fqs FQS fq comma-sep file path list [None]
[EX PE] --fqs ~/data/sample1_FWD.fq,~/data/sample1_REV.fq
-b BAM, --bam BAM BAM format file for use without alignment or for realignmnet [None]
-s SAMPLE, --sample SAMPLE
Sample identifier [infered from BAM|fastq read 1 trimmed by . and _]
-t TARGET, --target TARGET
maps a stage name to a comma seperated list of filenames with a colon : (wildcards will work for multiple files)
multiple mappings are sperated by a semi colon ; This is a nessesary step when using breakseq2 and some other callers
that rely on specific libraries/files with a new reference. The SVE image has hg19 libraries for breakseq2 included.
[EX] -t delly:/data/exclude_list.bed;breakseq:/data/breakseq2_bplib_20150129*
[None]
-m MODEL, --model MODEL
data fusion model [FusorSV/data/models/g1k_v37_decoy.P3.pickle]
-P CPUS, --cpus CPUS number of cpus for alignment and sorting, ect [1]
-T THREADS, --threads THREADS
number of threads per CPU [4]
-M MEM, --mem MEM ram in GB units to use for processing per cpu/thread unit [4]
-a ALGORITHM, --algorithm ALGORITHM
alignment/realignment algorithm pipeline [speed_seq]
-R REALIGN, --realign REALIGN
does realignment of the BAM file to the reference [place holder]
-v, --verbose stderr and stdout from all stages [False]
——
> docker run --rm timothyjamesbecker/sve /software/SVE/scripts/variant_processor.py -h
$ docker run --rm -itv $PWD:/data/ timothyjamesbecker/sve /software/SVE/scripts/variant_processor.py -h
usage: variant_processor.py [-h] [-o OUT_DIR] [-r REF] [-b BAMS]
[-D READ_DEPTH] [-L READ_LENGTH] [-t TARGETS]
[-s STAGES] [-c CHROMS] [-p PARTITIONS] [--debug]
[-d DATABASE] [-e] [-v]
Shortened Version of the SVE that uses pre-made .bam files Allthough .bam
files are not compatible with all callers such as Variation Hunter
optional arguments:
-h, --help show this help message and exit
-o OUT_DIR, --out_dir OUT_DIR
output directory to store resulting files
-r REF, --ref REF fasta reference file path
-b BAMS, --bams BAMS BAM comma-sep bam file path list, assuming matching
bam.bai is in place... [EX] --bam ~/data/sample1.bam,~
/data/sample2.bam,~/data/sample3.bam
-D READ_DEPTH, --read_depth READ_DEPTH
Average Depth of Coverage in ROI
-L READ_LENGTH, --read_length READ_LENGTH
Read Length
-t TARGETS, --targets TARGETS
target region files for input: vcf,bd,1gk formats
-s STAGES, --stages STAGES
stage name list
-c CHROMS, --chroms CHROMS
chrom name list
-p PARTITIONS, --partitions PARTITIONS
number of partitions or ||L
--debug save result/error data to db
-d DATABASE, --database DATABASE
database configuration file
-e, --erase_db reset and clear the schema bound db
-v, --verbose be verbose with caller stdout/stderr
2018年11月現在、 10のSVコーラーがサポートされている。
- 1. BreakDancer (with VCF converter)
- 2. BreakSeq2
- 3. cnMOPS (with VCF converter)
- 4. CNVnator
- 5. Delly2
- 6. Hydra-Multi (with VCF converter)
- 7. GATK Haplotype Caller 3.6 (with SV size VCF filter)
- 8. GenomeSTRiP2.0 (both SVDiscovery and CNVdiscovery) (with DEL/DUP VCF converter)
- 9. Lumpy-SV
- 10. Tigra-SV (and EXT pipeline)
実行方法
Autonomous Mode: アライメント、SVコール、FusorSVによるマージのすべてのステップが実行される。fastaの拡張子は".fa"になっている必要がある。
docker run -v $PWD/:/data -it timothyjamesbecker/sve /software/SVE/scripts/auto.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-f /data/sample1/sample1_1.fq,/data/sample1/sample1_2.fq \
-o /data/run_sample1/
1000Genomesのphase3のリファレンス使用がdefaultになっている(リンク)。
個別実行
1、リファレンスindex等の準備(初回のみ必要)
docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/prepare_ref.py \
-r /data/human_g1k_v37_decoy.fa \
-t breakseq:/data/breakseq2*.fna \
-o /data/ref/ \
-P 12
FASTAファイルの拡張子は".fa"になっている必要がある。かなり時間がかかる。ジョブが終わると、ref/にFASTAファイルと関連indexが出力される。以降のランにはこのディレクトリのFASTAファイルを指定する。
2、speedseqをalignerに使いBAMを作成
docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/prepare_bam.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-f /data/sample1/sample1_1.fq,/data/sample1/sample1_2.fq \
-o /data/bams/ \
-P 4 -T 6 -M 4 -a speed_seq
- -P number of cpus for alignment and sorting, ect [1]
- -T number of threads per CPU [4]
- -M MEM ram in GB units to use for processing per cpu/thread
unit [4] - -a aln|mem|piped_mem|piped_split|speed_seq [piped_split]
-P 4 -T 6 -M 4で24コアCPU、96GBメモリを使う(コアあたり4GB)。
3、bamを使いSVコール
docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/variant_processor.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-b /data/bams/sample1.bam \
-o /data/vcfs/ \
-s breakdancer,gatk_haplo,delly,lumpy,cnvnator
#more sv caller
docker run -v $PWD/:/data timothyjamesbecker/sve \
/software/SVE/scripts/variant_processor.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-b /data/bams/sample1.bam \
-o /data/vcfs/ \
-s breakdancer,gatk_haplo,delly,lumpy,cnvnator,breakseq,cnmops,hydra,genome_strip,tigra
4、FusorSVによるSVのマージ
docker run -v $PWD/:/data timothyjamesbecker/sve \
software/FusorSV/FusorSV.py \
-r /data/ref/human_g1k_v37_decoy.fa \
-c 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,X,Y \
-i /data/vcfs/ \
-o /data/fused/ \
-f /data/models/human_g1k_v37_decoy.P3.pickle.gz\
-p 2\
-M 0.5\
-L DEFAULT
- -p number of cores to use in || [1]
- -M reciprocal overlap needed for clustering [0.5]
- -L liftover chain file path or default [./data/hg19ToHg38.over.chain.gz]
- -c or --chroms is an optional chrom list if you only want calls that were made on specific sequences.The default is 1-22,X,Y,MT with auto-sensing of (chr prefix)
- -i or --in_dir takes the VCF directory that was produced from step
引用
FusorSV: an algorithm for optimally combining data from multiple structural variation detection methods
Becker T, Lee WP, Leone J, Zhu Q1, Zhang C, Liu S, Sargent J, Shanker K, Mil-Homens A, Cerveira E, Ryan M, Cha J, Navarro FCP, Galeev T, Gerstein M, Mills RE, Shin DG, Lee C, Malhotra A
Genome Biol. 2018 Mar 20;19(1):38.
関連/類似ツール