pacbioのロングリードの構造変異検出ツール pbsv

pbsvは、PacBio一分子リアルタイムシークエンシング（SMRT）リードから二倍体ゲノムの構造変異をコールして分析するための一連のツールである。このツールは、PacBioのSMRT Link GUIのStructural Variant Calling分析ワークフローを強化する。

pbsvは挿入、欠失、逆位、duplications、および転座をコールする。シングルサンプルコールとジョイント（マルチサンプル）コールの両方がサポートされている。

pbsvの効果的な範囲

20 bpから10 kbまでの挿入
20塩基対から100塩基対までの欠失
200 bpから10 kbへの逆位
20 bpから10 kbまでのduplications
異なる染色体間の転座、あるいは単一の染色体上で100kb以上離れている

Structural Variant Detection in SMRT Link 5 with pbsv

インストール

ubuntu16.04でテストした（ホストOS macos10.12）。

PacBio Secondary Analysis Tools on Bioconda

本体 Github

#anacondaを使っているならcondaで導入可能（linux only)
conda install -c bioconda pbsv

#pbmm2がないならこちらも入れる
conda install -c bioconda pbmm2

> pbsv

# pbsv

pbsv - PacBio structural variant (SV) calling and analysis tools

Usage: pbsv <tool>

Basic Options:

-h, --help Output this help.

--version Output version information.

Tools:

discover Find structural variant signatures in read alignments (BAM to SVSIG).

call Call structural variants from SV signatures and assign genotypes (SVSIG to VCF).

Typical workflow:

1. Align PacBio reads to a reference genome, per movie.

Subreads BAM input:

$ pbmm2 align ref.fa movie1.subreads.bam ref.movie1.bam --sort --median-filter --sample sample1

CCS BAM input:

$ pbmm2 align ref.fa movie1.ccs.bam ref.movie1.bam --sort --preset CCS --sample sample1

CCS FASTQ input:

$ pbmm2 align ref.fa movie1.Q20.fastq ref.movie1.bam --sort --preset CCS --sample sample1 --rg '@RG\tID:movie1'

2. Discover signatures of structural variation (per movie or per sample):

$ pbsv discover ref.movie1.bam ref.sample1.svsig.gz

$ pbsv discover ref.movie2.bam ref.sample2.svsig.gz

3. Call structural variants and assign genotypes (all samples), for CCS input append "--ccs":

$ pbsv call ref.fa ref.sample1.svsig.gz ref.sample2.svsig.gz ref.var.vcf

root@908d5cac1b6e:/#

> pbsv discover

$ pbsv discover

Usage: pbsv discover [options] <ref.in.bam|xml> <ref.out.svsig.gz>

pbsv discover - Find structural variant (SV) signatures in read alignments (BAM to SVSIG).

Basic Options:

-h,--help Output this help.

--version Output version information.

--log-file Log to a file, instead of stderr.

--log-level Set log level: "TRACE", "DEBUG", "INFO", "WARN", "FATAL". ["WARN"]

Sample Options:

-s,--sample Override sample name tag from BAM read group.

Alignment Filter Options:

-q,--min-mapq Ignore alignments with mapping quality < N. [20]

-m,--min-ref-span Ignore alignments with reference length < N bp. ["100"]

Downsample Options:

-w,--downsample-window-length Window in which to limit coverage, in basepairs. ["10K"]

-a,--downsample-max-alignments Consider up to N alignments in a window; 0 means disabled. [20]

Discovery Options:

-r,--region Limit discovery to this reference region: CHR|CHR:START-END.

-l,--min-svsig-length Ignore SV signatures with length < N bp. ["7"]

-b,--tandem-repeats Tandem repeat intervals for indel clustering.

-k,--max-skip-split Ignore alignment pairs separated by > N bp of a read or reference. ["100"]

Options:

--emit-tool-contract Emit tool contract.

--resolved-tool-contract Use args from resolved tool contract.

Arguments:

ref.in.bam|xml Coordinate-sorted aligned reads in which to identify SV signatures.

ref.out.svsig.gz Structural variant signatures output.

> pbsv call

# pbsv call

Usage: pbsv call [options] <ref.fa|xml> <ref.in.svsig.gz|fofn> <ref.out.vcf>

pbsv call - Call structural variants from SV signatures and assign genotypes (SVSIG to VCF).

Basic Options:

-h,--help Output this help.

--version Output version information.

--log-file Log to a file, instead of stderr.

--log-level Set log level: "TRACE", "DEBUG", "INFO", "WARN", "FATAL". ["WARN"]

-j,--num-threads Number of threads to use, 0 means autodetection. [0]

-z,--chunk-length Process in chunks of N reference bp. ["1M"]

Variant Options:

-t,--types Call these SV types: "DEL", "INS", "INV", "DUP", "BND", "CNV". ["DEL,INS,INV,DUP,BND,CNV"]

-m,--min-sv-length Ignore variants with length < N bp. ["20"]

--min-cnv-length Ignore CNVs with length < N bp. ["1K"]

--max-ins-length Ignore insertions with length > N bp. ["10K"]

--max-dup-length Ignore duplications with length > N bp. ["100K"]

SV Signature Cluster Options:

--cluster-max-length-perc-diff Do not cluster signatures with difference in length > P%. [25]

--cluster-max-ref-pos-diff Do not cluster signatures > N bp apart in reference. ["200"]

--cluster-min-basepair-perc-id Do not cluster signatures with basepair identity < P%. [10]

Consensus Options:

-x,--max-consensus-coverage Limit to N reads for variant consensus. [20]

-s,--poa-scores Score POA alignment with triplet match,mismatch,gap. ["1,-2,-2"]

--min-realign-length Consider segments with > N length for re-alignment. ["100"]

Call Options:

-A,--call-min-reads-all-samples Ignore calls supported by < N reads total across samples. [2]

-O,--call-min-reads-one-sample Ignore calls supported by < N reads in every sample. [2]

-S,--call-min-reads-per-strand-all-samples Ignore calls supported by < N reads per strand total across samples [1]

-P,--call-min-read-perc-one-sample Ignore calls supported by < P% of reads in every sample. [20]

--ccs CCS optimized parameters: -A 1 -O 1 -S 0 -P 10.

Genotyping:

--gt-min-reads Minimum supporting reads to assign a sample a non-reference genotype. [1]

Annotations:

--annotations Annotate variants by comparing with sequences in fasta. Default annotations are ALU, L1, SVA.

--annotation-min-perc-sim Annotate variant if sequence similarity > P%. [60]

Variant Filtering Options:

--min-N-in-gap Consider >= N consecutive "N" bp as a reference gap. ["50"]

--filter-near-reference-gap Flag variants < N bp from a gap as "NearReferenceGap". ["1K"]

--filter-near-contig-end Flag variants < N bp from a contig end as "NearContigEnd". ["1K"]

Options:

--emit-tool-contract Emit tool contract.

--resolved-tool-contract Use args from resolved tool contract.

Arguments:

ref.fa|xml Reference genome assembly against which to call variants.

ref.in.svsig.gz|fofn SV signatures from one or more samples.

ref.out.vcf Variant call format (VCF) output.

実行方法

１、リファレンスにマッピングする。pbmm2（紹介）（インストール参照）の使用が推奨されている。

subreads.bam（説明）を入力とする。

pbmm2 align ref.fa movie1.subreads.bam ref.movie1.bam --sort --median-filter --sample sample1

またはccs.bamを入力とする。

pbmm2 align ref.fa movie1.ccs.bam ref.movie1.bam --sort --preset CCS --sample sample1

またはfastqを入力とする。

pbmm2 align ref.fa movie1.Q20.fastq ref.movie1.bam --sort --preset CCS --sample sample1 --rg '@RG\tID:movie1'

unaligned.bam/fastqからマッピング結果のbamが出力される。

２、pbsv discoverでSVのシグネチャを検出

sample1と2があるなら、それぞれに対して実行。タンデムリピートのbedファイルがあるなら、感度とrecall改善のためにフラグを立てることが推奨されている（"--tandem-repeats repeat.bed"）。

pbsv discover ref.movie1.bam ref.sample1.svsig.gz
pbsv discover ref.movie2.bam ref.sample2.svsig.gz

ref.sample1.svsig.gz、 ref.sample2.svsig.gzが出力される。

３、pbsv callでSV検出（joint callも可能）

sample1と2があるなら両方指定する。

pbsv call ref.fa ref.sample1.svsig.gz ref.sample2.svsig.gz ref.var.vcf

ラージゲノムの場合、２以降のプロセスは並行して実行することでかなり高速化できる（*1）。実行例はpbsvのGithubを参照してください。検出アルゴリズムについてもGithubで説明されています。

引用

GitHub - PacificBiosciences/pbsv: pbsv - PacBio structural variant (SV) calling and analysis tools

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

pacbioのロングリードの構造変異検出ツール pbsv