CRISPR/Cas9編集後のアンプリコンシークエンシングからindelのレポートを生成する CRISPR-DAV

　CRISPR/Cas9システムの簡便さと精度の高さは、遺伝子編集の新時代をもたらした。CRISPRを介在させたゲノム編集を用いた目的のクローンのスクリーニングは、その多重化により次世代シークエンシング（NGS）によって可能になった。ここでは、CRISPR NGSデータをハイスループットで解析するためのCRISPR-DAV（CRISPR Data Analysis and Visualization）パイプラインを紹介する。パイプラインでは、Burrows-Wheeler AlignerとAssembly Based ReAlignmentを用いて、small indelとlarge indelの検出を行い、結果は包括的なチャートとインタラクティブなアライメントビューのセットで表示される。CRISPR-DAVはGitHubとDocker Hubのリポジトリ：https://github.com/pinetree1/crispr-dav.git と https://hub.docker.com/r/pinetree1/crispr-dav/ で利用できる。

　CRISPR-DAVパイプラインは、アンプリコンベースのNGSから生成されたFASTQリードを処理する。低品質で非常に短いリードは、まずPRINSEQを用いて除去される（Schmieder et al、2011）。次に、BWAとABRAを使用して、高品質のリードをリファレンスゲノムまたはアンプリコン配列にアラインメントし、indelを検出する。著者らの経験では、ABRAはBWAによって検出されなかった大きなindelを検出した。アンプリコン内の全てのヌクレオチド位置でのindelの頻度を計算する。しかし、CRISPRからのindelは通常、すべてが同じヌクレオチド位置から始まるわけではない。そこで、CRISPRの効率を評価するために、リードレベルでの簡易的な測定値、％indelリードを定義する。まず、sgRNA（シングルガイドRNA）の標的配列にまたがるリードを総リードとしてカウントする。第二に、著者らの観察に基づいて、CRISPRによって誘発されたすべてのindelは、sgRNAの配列と重なっている。したがって、CRISPRによるindelリードとみなされるためには、このターゲット領域に少なくとも1つの塩基が挿入または欠失している必要がある。次に、リードの合計数に応じたindelリードの割合を計算する。HDRの効率を評価するために、リード中のHDRオリゴの所望の塩基変化を調べ、リードを4つのカテゴリに特徴付ける。(i)完全なHDR：すべての意図した塩基変化が起こり、それらの間にindelが存在しない、(ii)編集されたHDR：少なくとも1つの意図した塩基変化が起こるが、indelが存在する、指示された修復が起こった後にCRISPRによる再編集が原因である可能性が高い、(iii)部分的なHDR：一部の意図した塩基変化が起こるが、すべてではないが、indelが存在しない、(iv)非HDR：意図した塩基変化のいずれも示さないリード。データの理解を深めるためには、可視化が重要である。パイプラインは、以下を示すチャートを含むHTMLレポートを生成する：様々な段階でのリードカウント、アンプリコン内のリードの深さとindel頻度、indelリードのカウントとパーセンテージ、アレル、SNP、HDRの頻度、Canvas Xpress（http://canvasxpress.org）で有効化されたアライメントビュー、となる。

インストール

準備されているdockerイメージを使ってテストした。

Github

#dockerhub(link)
docker pull pinetree1/crispr-dav:latest

> /crispr-dav/crispr.pl

# /crispr-dav/crispr.pl

CRISPR data analysis and visualization pipeline.

Usage: ../../crispr.pl [options]

--conf <str> Configuration file. Required. See template /crispr-dav/conf.txt

It has information about genome locations, tools, and parameters.

Specify a reference using --genome or --amp_fasta, but not both.

Use --genome for standard genome, such as hg19. Need to have paths of fasta file,

bwa index, and refGene coordinate file in the configuration file. To download the

coordinate file, go to UCSC Genome Browser, in TableBrowser, select group:Genes

and Gene Predictions, track:RefSeq Genes, table:refGene, region:genome,

output format:all fields from selected table. The downloaded tab-delimited file

should have these columns:

bin,name,chrom,strand,txStart,txEnd,cdsStart,cdsEnd,exonStarts,exonEnds,...

Use --amp_fasta when using a custom amplicon sequenece as reference.

--genome <str> Genome version (e.g. hg19) as specified in configuration file.

--amp_fasta <str> Amplicon reference fasta file containing a single sequence.

--codon_start <int> Translation starting position in the amplicon reference sequence.

If the first codon starts at the first base, then the position is 1. No translation

will be performed if the option is omitted. No intron should be present in the

amplicon reference sequence if translation is needed.

--region <str> Required when --genome option is used. This is a bed file for amplicon region.

The tab-separated fields are chr, start, end, genesym, refseqid, strand(+/-).

No header. All fields are required.

The start and end are 0-based; start is inclusive and end is exclusive.

Genesym is gene symbol. Refseqid is used to identify transcript coordinates in

UCSC refGene coordinate file. If refseqid is '-', no alignment view will be created.

Only one row is allowed this file. If an experiment has two amplicons, run the

pipeline separately for each amplicon.

--crispr <str> Required. A bed file containing one or more CRISPR sgRNA sites.

Tab-delimited file. No header. Information for each site:

The fields are: chr, start, end, CRISPR_name, sgRNA_sequence, strand, and

HDR mutations. All fields except HDR mutations are required. The start and end

are 0-based; start is inclusive and end is exclusive. CRISPR names and sequences

must be unique.

HDR format: <Pos1><NewBase1>,<Pos2><NewBase2>,... The bases are desired new bases

on positive strand,e.g.101900208C,101900229G,101900232C,101900235A. No space. The

positions are 1-based and inclusive.

--fastqmap <str> Required. A tab-delimited file containing 2 or 3 columns. No header.

The fields are sample name, read1 fastq file, and optionally read2 fastq file.

Fastq files must be gizpped and and file names end with .gz.

--sitemap <str> Required. A tab-delimited file that associates sample name with CRISPR

sites. No header. Each line starts with sample name, followed by one or more sgRNA

guide sequences. This file controls what samples to be analyzed.

--merge Y or N. Default: Y. Merge paired-end reads before filtering and alignment.

--sge Submit jobs to SGE queue. The system must already have been configured for SGE.

--outdir <str> Output directory. Default: current directory.

--help Print this help message.

--verbose Print some commands and information for debugging.

テストラン１

実行

docker run --rm -it -v $PWD:/Users/xyz/temp pinetree1/crispr-dav 
cd /crispr-dav/Examples/example1/
sh run.sh

以下のコマンドを実行している。

../..//crispr.pl --conf conf.txt --region amplicon.bed --crispr site.bed \
--sitemap sample.site --fastqmap fastq.list --genome genomex

conf.txtはコンフィグファイル。先頭のgenomeのパスだけ書き換えれば他のデータにも使える。

> cat conf.txt

f:id:kazumaxneo:20200524222547p:plain

（以下省略）

amplicon.bedは増幅領域を指定したbedファイル。 site.bedはCRISPR sgRNA sitesを指定したbedファイル。sample.siteはサンプル名とCRISPR sitesを指定したタブ区切りファイル。fastq.listはfastqのパスを指定したタブ区切りファイル。

f:id:kazumaxneo:20200524222930p:plain

ラン後、結果をホストと共有しているディレクトリに移してから終了。

mv deliverables /Users/xyz/temp
exit

index.html

f:id:kazumaxneo:20200524205917p:plain

f:id:kazumaxneo:20200524205918p:plain

GENEX_CR1_cx0.html

f:id:kazumaxneo:20200525000429p:plain

f:id:kazumaxneo:20200525000542p:plain

図はインタラクティブに操作できる。

f:id:kazumaxneo:20200525000511p:plain

テストラン2

準備。テスト１と違ってサンプルシートを用意し、ヘルパースクリプトを使ってランに必要なファイルを調整する。

docker run --rm -it -v $PWD:/Users/xyz/temp pinetree1/crispr-dav 
cd /crispr-dav/Examples/example1_a/
../../prepare_run.pl samplesheet.txt

ラン

cd amp_GENEX_chr4_35769_36280/
sh run.sh

引用
CRISPR-DAV: CRISPR NGS Data Analysis and Visualization Pipeline

Xuning Wang, Charles Tilford, Isaac Neuhaus, Gabe Mintier, Qi Guo , John N Feder, Stefan Kirov

Bioinformatics. 2017 Dec 1;33(23):3811-3812

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

CRISPR/Cas9編集後のアンプリコンシークエンシングからindelのレポートを生成する CRISPR-DAV