シングルセルとバルクのトランスクリプトームのリファレンスフリーおよびリファレンスガイドアセンブリを行うrnabloom

　シングルセルRNAシーケンス（scRNA-seq）技術はここ10年で急速に進歩したが、シングルセル・トランスクリプトーム解析のワークフローでは、主に遺伝子発現データが用いられており、シングルセルレベルでのアイソフォーム配列解析はまだかなり限定的である。シングルセルにおけるアイソフォームの検出・発見は、scRNA-seqデータに固有の技術的欠点があるため困難であり、既存のトランスクリプトームアセンブリー手法は主にバルクRNAサンプル用に設計されている。この課題を解決するために、著者らは、複数のシングルセルのトランスクリプトームから集約された豊富な情報量を活用して、細胞特異的なアイソフォームを再構築するアセンブリアルゴリズムであるRNA-Bloomを開発した。RNA-Bloomを用いたアセンブリは、リファレンスガイドまたはリファレンスフリーで行うことができるため、新規アイソフォームや外来転写産物を偏りなく発見することができる。RNA-Bloomの両方のアセンブリ戦略を、5つの最新のリファレンスフリーおよびリファレンスベースのトランスクリプトームアセンブリ手法と比較した。384セルのシミュレーションデータセットを用いたベンチマークでは、リファレンスフリーのRNA-Bloomは、最も優れたリファレンスフリーアセンブラよりも37.9%～38.3%多くのアイソフォームを再構成したのに対し、リファレンスガイドのRNA-Bloomは、リファレンスベースのアセンブラよりも4.1%～11.6%多くのアイソフォームを再構成した。また、40億リード以上からなる実際の3840細胞のデータセットに適用したところ、RNA-Bloomは、評価した最も優れたリファレンスベースおよびリファレンスフリーのアプローチと比較して、9.7%～25.0%多くのアイソフォームを再構成した。RNA-Bloomは、遺伝子発現解析以外のscRNA-seqデータの有用性を高め、現在情報的にアクセス可能なものを拡大すると期待している。

インストール

mambaで仮想環境を作ってテストした。

Github

#conda (link)
mamba create -n rnabloom -y
conda activate rnabloom
mamba install -c bioconda rnabloom

> rnabloom -h

$ rnabloom -h

RNA-Bloom v1.3.1

Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer

usage: java -jar RNA-Bloom.jar [-l <FILE>] [-r <FILE>] [-pool <FILE>]

[-long <FILE>] [-ref <FILE>] [-rcl] [-rcr] [-rc] [-ss] [-n <STR>]

[-prefix <STR>] [-u] [-t <INT>] [-o <PATH>] [-f] [-k <INT>] [-stage

<INT>] [-q <INT>] [-Q <INT>] [-c <INT>] [-hash <INT>] [-sh <INT>]

[-dh <INT>] [-ch <INT>] [-ph <INT>] [-nk <INT>] [-ntcard] [-mem

<DECIMAL>] [-sm <DECIMAL>] [-dm <DECIMAL>] [-cm <DECIMAL>] [-pm

<DECIMAL>] [-fpr <DECIMAL>] [-savebf] [-tiplength <INT>]

[-lookahead <INT>] [-sample <INT>] [-e <INT>] [-grad <DECIMAL>]

[-indel <INT>] [-p <DECIMAL>] [-length <INT>] [-norr] [-mergepool]

[-overlap <INT>] [-bound <INT>] [-extend] [-nofc] [-sensitive]

[-artifact] [-chimera] [-stratum <01|e0|e1|e2|e3|e4|e5>] [-pair

<INT>] [-a <INT>] [-mmopt <OPTIONS>] [-lrop <DECIMAL>] [-lrrd

<INT>] [-lrpb] [-debug] [-h] [-v]

-l,--left <FILE> left reads file(s)

-r,--right <FILE> right reads file(s)

-pool,--pool <FILE> list of read files for pooled assembly

-long <FILE> long reads file(s)

(Requires `minimap2` and `racon` in

PATH. Presets `-k 17 -c 3 -indel 10 -e

3 -p 0.8 -overlap 200 -tip 100` unless

each option is defined otherwise.)

-ref <FILE> reference transcripts file(s) for

guiding the assembly process

-rcl,--revcomp-left reverse-complement left reads [false]

-rcr,--revcomp-right reverse-complement right reads [false]

-rc,--revcomp-long reverse-complement long reads [false]

-ss,--stranded reads are strand specific [false]

-n,--name <STR> assembly name [rnabloom]

-prefix <STR> name prefix in FASTA header for

assembled transcripts

-u,--uracil output uracils (U) in place of thymines

(T) in assembled transcripts [false]

-t,--threads <INT> number of threads to run [2]

-o,--outdir <PATH> output directory

[/Users/kazu/Desktop/6803GT-S_pseudo_ss

embly_teat/rnabloom_assembly]

-f,--force force overwrite existing files [false]

-k,--kmer <INT> k-mer size [25]

-stage <INT> assembly termination stage

short reads: [3]

1. construct graph

2. assemble fragments

3. assemble transcripts

long reads: [3]

1. construct graph

2. correct reads

3. assemble transcripts

-q,--qual-dbg <INT> minimum base quality in reads for

constructing DBG [3]

-Q,--qual-frag <INT> minimum base quality in reads for

fragment reconstruction [3]

-c,--mincov <INT> minimum k-mer coverage [1]

-hash <INT> number of hash functions for all Bloom

filters [2]

-sh,--sbf-hash <INT> number of hash functions for screening

Bloom filter [2]

-dh,--dbgbf-hash <INT> number of hash functions for de Bruijn

graph Bloom filter [2]

-ch,--cbf-hash <INT> number of hash functions for k-mer

counting Bloom filter [2]

-ph,--pkbf-hash <INT> number of hash functions for paired

k-mers Bloom filter [2]

-nk,--num-kmers <INT> expected number of unique k-mers in

input reads

-ntcard count unique k-mers in input reads with

ntCard [false]

(Requires `ntcard` in PATH. If this

option is used along with `-long`, the

value for `-c` is set automatically

based on the ntCard histogram, unless

`-c` is defined otherwise)

-mem,--memory <DECIMAL> total amount of memory (GB) for all

Bloom filters [auto]

-sm,--sbf-mem <DECIMAL> amount of memory (GB) for screening

Bloom filter [auto]

-dm,--dbgbf-mem <DECIMAL> amount of memory (GB) for de Bruijn

graph Bloom filter [auto]

-cm,--cbf-mem <DECIMAL> amount of memory (GB) for k-mer

counting Bloom filter [auto]

-pm,--pkbf-mem <DECIMAL> amount of memory (GB) for paired kmers

Bloom filter [auto]

-fpr,--fpr <DECIMAL> maximum allowable false-positive rate

of Bloom filters [0.01]

-savebf save graph (Bloom filters) from stage 1

to disk [false]

-tiplength <INT> maximum number of bases in a tip [5]

-lookahead <INT> number of k-mers to look ahead during

graph traversal [3]

-sample <INT> sample size for estimating

read/fragment lengths [1000]

-e,--errcorritr <INT> number of iterations of

error-correction in reads [1]

-grad,--maxcovgrad <DECIMAL> maximum k-mer coverage gradient for

error correction [0.50]

-indel <INT> maximum size of indels to be collapsed

[1]

-p,--percent <DECIMAL> minimum percent identity of sequences

to be collapsed [0.90]

-length <INT> minimum transcript length in output

assembly [200]

-norr skip redundancy reduction for assembled

transcripts [false]

(will not create 'transcripts.nr.fa')

-mergepool merge pooled assemblies [false]

(Requires `-pool`; overrides `-norr`)

-overlap <INT> minimum number of overlapping bases

between reads [10]

-bound <INT> maximum distance between read mates

[500]

-extend extend fragments outward during

fragment reconstruction [false]

-nofc turn off assembly consistency with

fragment paired k-mers [false]

-sensitive assemble transcripts in sensitive mode

[false]

-artifact keep potential sequencing artifacts

[false]

-chimera keep potential chimeras [false]

-stratum <01|e0|e1|e2|e3|e4|e5> fragments lower than the specified

stratum are extended only if they are

branch-free in the graph [e0]

-pair <INT> minimum number of consecutive k-mer

pairs for assembling transcripts [10]

-a,--polya <INT> prioritize assembly of transcripts with

poly-A tails of the minimum length

specified [0]

-mmopt <OPTIONS> options for minimap2 [-r 150]

(`-x` and `-t` are already in use)

-lrop <DECIMAL> minimum proportion of matching bases

within long-read overlaps [0.4]

-lrrd <INT> min read depth required for long-read

assembly [2]

-lrpb use PacBio preset for minimap2 [false]

-debug print debugging information [false]

-h,--help print this message and exits

-v,--version print version information and exits

実行方法

paired-end reads

rnabloom -l LEFT.fastq -r RIGHT.fastq -revcomp-right -ntcard -t 20 -o  OUTDIR

-l left reads file(s)
-r right reads file(s)
-rcl,--revcomp-left reverse-complement left reads [false]
-rcr,--revcomp-right reverse-complement right reads [false]
-rc,--revcomp-long reverse-complement long reads [false]
-ntcard count unique k-mers in input reads with ntCard [false]
-t number of threads to run [2]
-o output directory
-ss,--stranded reads are strand specific [false]

strandedオプションは、入力されたリードがストランドごとに分かれていることを示す（Github参照）。

出力

f:id:kazumaxneo:20210320143003p:plain

single-end reads

rnabloom -l READS.fastq -ntcard -t 20 -o OUTDIR

single-cell RNA-seq

rnabloom -pool READSLIST.txt -revcomp-right -ntcard -t THREADS -outdir OUTDIR

-pool,--pool <FILE> list of read files for pooled assembly

Nanopore Readsde

# nanopore PCR cDNA sequencing data
rnabloom -long READS.fasta -ntcard -t THREADS -outdir OUTDIR

# nanopore direct cDNA sequencing data
rnabloom -long READS.fasta -stranded -revcomp-long -ntcard -t THREADS -outdir OUTDIR

引用

RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes
Ka Ming Nip, Readman Chiu, Chen Yang, Justin Chu, Hamid Mohamadi, René L. Warren, Inanc Birol

Genome Res. 2020 Aug; 30(8): 1191–1200