10x genomicsのシングルセルRNA-seq解析パイプライン cellranger（version4について）

2020 10/31 説明を追加

2021 2/11 docker インストールにv５.01追加

　Cell Rangerは、ChromiumのシングルセルRNA-seq出力を処理して、リードのアラインメント、フィーチャ-バーコードマトリックスの生成、クラスタリングと遺伝子発現解析を行う解析パイプラインのセットである。Cell Rangerには、シングルセル遺伝子発現実験に関連する4つのパイプラインが含まれている。

　cellranger mkfastqは、Illuminaシーケンサーで生成された生のベースコール（BCL）ファイルをFASTQファイルにデマルチプレックスする。これはIlluminaのbcl2fastqのラッパーで、10xライブラリーに特化した便利な機能を追加し、サンプルシートのフォーマットを簡略化したものである。

　cellranger countは、cellranger mkfastqからFASTQファイルを取得し、アラインメント、フィルタリング、バーコードカウント、UMIカウントを実行する。それは、Chromium セルラーバーコードを使用して、フィーチャ-バーコードマトリックスを生成し、クラスタを決定し、遺伝子発現解析を実行するために使用される。カウントパイプラインは、同じGEMウェル上の複数のシークエンシング実行からの入力を取り込むことができる。cellranger countはまた、遺伝子発現リードと一緒にフィーチャーバーコードデータを処理する。

　cellranger aggr は、cellranger count の複数回の実行からの出力を集約し、それらの実行を同じシークエンシングデプスに正規化してから、特徴バーコードマトリクスを再計算し、結合されたデータを解析する。aggrパイプラインを使用して、複数のサンプルからのデータを実験全体のフィーチャ-バーコードマトリックスと解析に結合することができる。

　cellranger reanalyzeは、cellranger countまたはcellranger aggrによって生成された特徴-バーコードマトリックスを利用し、調整可能なパラメータ設定を用いて次元削減、クラスタリング、遺伝子発現アルゴリズムを再実行する。

　これらのパイプラインは、広く使用されているRNA-seqアライナーSTARとChromium固有のアルゴリズムを組み合わせたものである。出力は、標準的なBAM、MEX、CSV、HDF5、HTML形式で提供され、細胞情報が付加されている。

What is Cell Ranger?

https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/what-is-cell-ranger

テスト環境

intel xeon platinumのシングルノード環境でテスト（osはubuntu18.04LTS）。

インストール

10x GenomicsのHPからダウンロードする。ダウンロード前に所属機関やメールアドレス等の記入が必要（ --transcriptomeで指定するリファレンスもリンク先からダウンロードできる）。

Cell Ranger Installation -Software -Single Cell Gene Expression -Official 10x Genomics Support

#解凍
tar -xzvf cellranger-4.0.0.tar.gz
cd cellranger-4.0.0/bin/

> cellranger

$ cellranger

cellranger 4.0.0

Process 10x Genomics Gene Expression, Feature Barcode, and Immune Profiling data

USAGE:

cellranger <SUBCOMMAND>

FLAGS:

-h, --help Prints help information

-V, --version Prints version information

SUBCOMMANDS:

count Count gene expression and feature barcoding reads from a single sample and GEM well

vdj Assembles single-cell VDJ receptor sequences from 10x Immune Profiling libraries

aggr Aggregate data from multiple Cell Ranger runs

reanalyze Re-run secondary analysis (dimensionality reduction, clustering, etc)

targeted-compare Analyze targeted enrichment performance by comparing a targeted sample to its cognate parent WTA sample (used as input for targeted gene expression)

targeted-depth Estimate targeted read depth values (mean reads per cell) for a specified input parent WTA sample and a target panel CSV file

mkvdjref Prepare a reference for use with CellRanger VDJ

mkfastq Run Illumina demultiplexer on sample sheets that contain 10x-specific sample index sets

testrun Execute the 'count' pipeline on a small test dataset

mat2csv Convert a gene count matrix to CSV format

mkgtf Prepare a GTF file for use as a 10x transcriptome reference

mkref Prepare a reference for use with 10x analysis software. Requires a GTF and FASTA

upload Upload a summary of an analysis pipeline job to 10x Genomics support

sitecheck Collect Linux system configuration information

help Prints this message or the help of the given subcommand(s)

> cellranger count -h

$ cellranger count -h

cellranger-count

Count gene expression and feature barcoding reads from a single sample and GEM well

USAGE:

cellranger count [FLAGS] [OPTIONS] --id <ID> --transcriptome <PATH>

FLAGS:

--no-target-umi-filter Turn off the target UMI filtering subpipeline

--nosecondary Disable secondary analysis, e.g. clustering. Optional

--no-libraries Proceed with processing using a --feature-ref but no Feature Barcode libraries specified with the 'libraries' flag

--dry Do not execute the pipeline. Generate a pipeline invocation (.mro) file and stop

--disable-ui Do not serve the UI

--noexit Keep web UI running after pipestance completes or fails

--nopreflight Skip preflight checks

-h, --help Prints help information

OPTIONS:

--id <ID> A unique run id and output folder name [a-zA-Z0-9_-]+

--description <TEXT> Sample description to embed in output files

--transcriptome <PATH> Path of folder containing 10x-compatible transcriptome reference

-f, --fastqs <PATH>... Path to input FASTQ data

-p, --project <TEXT> Name of the project folder within a mkfastq or bcl2fastq-generated folder to pick FASTQs from

-s, --sample <PREFIX>... Prefix of the filenames of FASTQs to select

--lanes <NUMS>... Only use FASTQs from selected lanes

--libraries <CSV> CSV file declaring input library data sources

--feature-ref <CSV> Feature reference CSV file, declaring Feature Barcode constructs and associated barcodes

--target-panel <CSV> The target panel CSV file declaring the target panel used, if any

--expect-cells <NUM> Expected number of recovered cells

--force-cells <NUM> Force pipeline to use this number of cells, bypassing cell detection

--r1-length <NUM> Hard trim the input Read 1 to this length before analysis

--r2-length <NUM> Hard trim the input Read 2 to this length before analysis

--chemistry <CHEM> Assay configuration. NOTE: by default the assay configuration is detected automatically, which is the recommened mode. You usually will not need to specify a chemistry. Options are: 'auto' for

autodetection, 'threeprime' for Single Cell 3', 'fiveprime' for Single Cell 5', 'SC3Pv1' or 'SC3Pv2' or 'SC3Pv3' for Single Cell 3' v1/v2/v3, 'SC5P-PE' or 'SC5P-R2' for Single Cell 5', paired-

end/R2-only, 'SC-FB' for Single Cell Antibody-only 3' v2 or 5' [default: auto]

--jobmode <MODE> Job manager to use. Valid options: local (default), sge, lsf, slurm or a .template file. Search for help on "Cluster Mode" at support.10xgenomics.com for more details on configuring the pipeline to

use a compute cluster [default: local]

--localcores <NUM> Set max cores the pipeline may request at one time. Only applies to local jobs

--localmem <NUM> Set max GB the pipeline may request at one time. Only applies to local jobs

--localvmem <NUM> Set max virtual address space in GB for the pipeline. Only applies to local jobs

--mempercore <NUM> Reserve enough threads for each job to ensure enough memory will be available, assuming each core on your cluster has at least this much memory available. Only applies in cluster jobmodes

--maxjobs <NUM> Set max jobs submitted to cluster at one time. Only applies in cluster jobmodes

--jobinterval <NUM> Set delay between submitting jobs to cluster, in ms. Only applies in cluster jobmodes

--overrides <PATH> The path to a JSON file that specifies stage-level overrides for cores and memory. Finer-grained than --localcores, --mempercore and --localmem. Consult the 10x support website for an example

override file

--uiport <PORT> Serve web UI at http://localhost:PORT

> cellranger bamtofastq -h

$ cellranger bamtofastq -h

bamtofastq v1.2.1

10x Genomics BAM to FASTQ converter.

Tool for converting 10x BAMs produced by Cell Ranger or Long Ranger back to

FASTQ files that can be used as inputs to re-run analysis. The FASTQ files

emitted by the tool should contain the same set of sequences that were

input to the original pipeline run, although the order will not be

preserved. The FASTQs will be emitted into a directory structure that is

compatible with the directories created by the 'mkfastq' tool.

10x BAMs produced by Long Ranger v2.1+ and Cell Ranger v1.2+ contain header

fields that permit automatic conversion to the correct FASTQ sequences.

Older 10x pipelines require one of the arguments listed below to indicate

which pipeline created the BAM.

NOTE: BAMs created by non-10x pipelines are unlikely to work correctly,

unless all the relevant tags have been recreated.

NOTE: BAM produced by the BASIC and ALIGNER pipeline from Long Ranger 2.1.2 and earlier

are not compatible with bamtofastq

NOTE: BAM files created by CR < 1.3 do not have @RG headers, so bamtofastq will use the GEM well

annotations attached to the CB (cell barcode) tag to split data from multiple input libraries.

Reads without a valid barcode do not carry the CB tag and will be dropped. These reads would

not be included in any valid cell.

Usage:

bamtofastq [options] <bam> <output-path>

bamtofastq (-h | --help)

Options:

--nthreads=<n> Threads to use for reading BAM file [default: 4]

--locus=<locus> Optional. Only include read pairs mapping to locus. Use chrom:start-end format.

--reads-per-fastq=N Number of reads per FASTQ chunk [default: 50000000]

--gemcode Convert a BAM produced from GemCode data (Longranger 1.0 - 1.3)

--lr20 Convert a BAM produced by Longranger 2.0

--cr11 Convert a BAM produced by Cell Ranger 1.0-1.1

--bx-list=L Only include BX values listed in text file L. Requires BX-sorted and index BAM file (see Long Ranger support for details).

-h --help Show this screen.

(grabseqs) [

cellranger-4.0.0/bin/にパスを通しておく。

cell barcodeのファイルは

cellranger-4.0.0/lib/python/cellranger/barcodes/737K-august-2016.txt

がそうらしい。テスト時は737280行あった。

非公式だがdockerイメージも上がっている。windowsやmacでランする時に便利と思われる。

https://hub.docker.com/r/cumulusprod/cellranger/tags

#6.1.2
docker pull cumulusprod/cellranger:6.1.2

#v5.0.1
docker pull cumulusprod/cellranger:5.0.1

#v4.0.0
docker pull cumulusprod/cellranger:4.0.0

#v3.1.0
docker pull cumulusprod/cellranger:3.1.0

#v2.2.0
docker pull cumulusprod/cellranger:2.2.0

実行方法

10xのチュートリアルではSRAのデータを例に説明しているので、ここでもそれに則る。SRAのscRNAseqデータを使う時の注意点だが、（fastqがindexとペアに分かれてアップロードされているなら問題ないが、）ここで使うチュートリアルデータのように１つのfastqとして登録されている場合、このfastqをダウンロードしてもindexやペアエンドが分離されておらず使えない。10xのbarcoded bam形式bamをダウンロードする（このページの下のAWSに置いてあるbarcoded bamをダウンロード）。それからcellranger bamtofastqを使い、それぞれのライブラリに正しく分離されたfastqを作る。この作業で、複数poolしていればライブラリごとのfastqディレクトリと、その中にfastq（通常ペアエンドなのでR1とR2のfastq.gzと、indexのfastq.gz）ができる。fastqは５千万リード毎にchunk出力される。cell ranger countのランには、この作業で得られたディレクトリを指定する。

１、cellranger bamtofastq（link）によりbarcoded bamからfastqを出力する。ここでは公式でexampleデータとして使っているC05.bam.1と C07.bam.1のうちのC05.bam.1を使う。こちらからダウンロードできる。

cellranger bamtofastq --nthreads=4 --reads-per-fastq=50000000  C05.bam.1 normal

"--nthreads=4"と"--reads-per-fastq=50000000"はなくても動作する。

ランが終わると、normal/にindepth_C05_MissingLibrary_1_HL5G3BBXXとindepth_C05_MissingLibrary_1_HNNWNBBXXができる。

出力ディレクトリ

ls normal/indepth_C05_MissingLibrary_1_HL5G3BBXX/

-rw-r--r-- 1 kazumaxgene 317M 10月 24 16:10 bamtofastq_S1_L003_I1_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 563M 10月 24 16:10 bamtofastq_S1_L003_R1_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.5G 10月 24 16:10 bamtofastq_S1_L003_R2_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 475M 10月 24 16:10 bamtofastq_S1_L004_I1_006.fastq.gz

-rw-r--r-- 1 kazumaxgene 804M 10月 24 16:10 bamtofastq_S1_L004_R1_006.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.6G 10月 24 16:10 bamtofastq_S1_L004_R2_006.fastq.gz

drwxr-xr-x 2 kazumaxgene 4.0K 10月 24 16:04 ./

-rw-r--r-- 1 kazumaxgene 677M 10月 24 16:04 bamtofastq_S1_L004_I1_005.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.2G 10月 24 16:04 bamtofastq_S1_L004_R1_005.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.6G 10月 24 16:04 bamtofastq_S1_L004_R2_005.fastq.gz

-rw-r--r-- 1 kazumaxgene 672M 10月 24 15:56 bamtofastq_S1_L004_I1_004.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.2G 10月 24 15:56 bamtofastq_S1_L004_R1_004.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.7G 10月 24 15:56 bamtofastq_S1_L004_R2_004.fastq.gz

-rw-r--r-- 1 kazumaxgene 697M 10月 24 15:56 bamtofastq_S1_L003_I1_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.3G 10月 24 15:56 bamtofastq_S1_L003_R1_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.8G 10月 24 15:56 bamtofastq_S1_L003_R2_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 680M 10月 24 15:47 bamtofastq_S1_L004_I1_003.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.3G 10月 24 15:47 bamtofastq_S1_L004_R1_003.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.7G 10月 24 15:47 bamtofastq_S1_L004_R2_003.fastq.gz

-rw-r--r-- 1 kazumaxgene 680M 10月 24 15:38 bamtofastq_S1_L004_I1_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.3G 10月 24 15:38 bamtofastq_S1_L004_R1_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.7G 10月 24 15:38 bamtofastq_S1_L004_R2_002.fastq.gz

-rw-r--r-- 1 kazumaxgene 680M 10月 24 15:30 bamtofastq_S1_L004_I1_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 1.3G 10月 24 15:30 bamtofastq_S1_L004_R1_001.fastq.gz

-rw-r--r-- 1 kazumaxgene 2.7G 10月 24 15:30 bamtofastq_S1_L004_R2_001.fastq.gz

このディレクトリを以後の解析に使用する。

２、cellranger countを実行する。

cellranger count --id=normal \
 --transcriptome=refdata-gex-mm10-2020-A/ \
 --fastqs=indepth_C05_MissingLibrary_1_HL5G3BBXX/,indepth_C05_MissingLibrary_1_HNNWNBBXX/ 
 --chemistry auto --sample=bamtofastq --localcores 16

"--chemistry auto --sample=bamtofastq --localcores 16"はなくても動作する。

出力ディレクトリのouts/が出力

f:id:kazumaxneo:20201026095243p:plain

outs/cloupe.cloupeはloupe rangerで開くことができる。

summary.html f:id:kazumaxneo:20201101123433p:plain

f:id:kazumaxneo:20201101123435p:plain

Tips

SRAからダウンロードしたデータなどで、ディレクトリにfastq.gzを置いているにも関わらずcellrangercountがfastqのファイル名が認識できない場合はこちらを参照（SRAからダウンロードした少し前のデータなど）。

https://kb.10xgenomics.com/hc/en-us/articles/115003802691

他のコマンドは10xのマニュアルを参照して下さい。

引用

What is Cell Ranger? -Software -Single Cell Gene Expression -Official 10x Genomics Support

status 255

https://kb.10xgenomics.com/hc/en-us/articles/360042488811-Cellranger-job-failed-in-stage-code-exit-status-255

2020 12/02 追記

DBKEROのシングルセルチュートリアル。基本的にコピペで実行できるようになっている。

https://kero.hgc.jp/open_tutorials/learning/

2022/09/28

Useful tip & heads-up❗️for #scrnaseq cellranger @10xGenomics users:starting with v.7.0, cellranger count also considers reads aligned to introns (20-40% of all reads).

this is great for increased sensitivity!

however, what to do if previous data has been counted w/o introns? 🧵 pic.twitter.com/m0JZ8u4Cyj
— Simona Cristea (@simocristea) September 26, 2022