シングルセルRNAシーケンス(scRNA-seq)技術はここ10年で急速に進歩したが、シングルセル・トランスクリプトーム解析のワークフローでは、主に遺伝子発現データが用いられており、シングルセルレベルでのアイソフォーム配列解析はまだかなり限定的である。シングルセルにおけるアイソフォームの検出・発見は、scRNA-seqデータに固有の技術的欠点があるため困難であり、既存のトランスクリプトームアセンブリー手法は主にバルクRNAサンプル用に設計されている。この課題を解決するために、著者らは、複数のシングルセルのトランスクリプトームから集約された豊富な情報量を活用して、細胞特異的なアイソフォームを再構築するアセンブリアルゴリズムであるRNA-Bloomを開発した。RNA-Bloomを用いたアセンブリは、リファレンスガイドまたはリファレンスフリーで行うことができるため、新規アイソフォームや外来転写産物を偏りなく発見することができる。RNA-Bloomの両方のアセンブリ戦略を、5つの最新のリファレンスフリーおよびリファレンスベースのトランスクリプトームアセンブリ手法と比較した。384セルのシミュレーションデータセットを用いたベンチマークでは、リファレンスフリーのRNA-Bloomは、最も優れたリファレンスフリーアセンブラよりも37.9%~38.3%多くのアイソフォームを再構成したのに対し、リファレンスガイドのRNA-Bloomは、リファレンスベースのアセンブラよりも4.1%~11.6%多くのアイソフォームを再構成した。また、40億リード以上からなる実際の3840細胞のデータセットに適用したところ、RNA-Bloomは、評価した最も優れたリファレンスベースおよびリファレンスフリーのアプローチと比較して、9.7%~25.0%多くのアイソフォームを再構成した。RNA-Bloomは、遺伝子発現解析以外のscRNA-seqデータの有用性を高め、現在情報的にアクセス可能なものを拡大すると期待している。
インストール
mambaで仮想環境を作ってテストした。
#conda (link)
mamba create -n rnabloom -y
conda activate rnabloom
mamba install -c bioconda rnabloom
> rnabloom -h
$ rnabloom -h
RNA-Bloom v1.3.1
Ka Ming Nip, Canada's Michael Smith Genome Sciences Centre, BC Cancer
Copyright 2018
usage: java -jar RNA-Bloom.jar [-l <FILE>] [-r <FILE>] [-pool <FILE>]
[-long <FILE>] [-ref <FILE>] [-rcl] [-rcr] [-rc] [-ss] [-n <STR>]
[-prefix <STR>] [-u] [-t <INT>] [-o <PATH>] [-f] [-k <INT>] [-stage
<INT>] [-q <INT>] [-Q <INT>] [-c <INT>] [-hash <INT>] [-sh <INT>]
[-dh <INT>] [-ch <INT>] [-ph <INT>] [-nk <INT>] [-ntcard] [-mem
<DECIMAL>] [-sm <DECIMAL>] [-dm <DECIMAL>] [-cm <DECIMAL>] [-pm
<DECIMAL>] [-fpr <DECIMAL>] [-savebf] [-tiplength <INT>]
[-lookahead <INT>] [-sample <INT>] [-e <INT>] [-grad <DECIMAL>]
[-indel <INT>] [-p <DECIMAL>] [-length <INT>] [-norr] [-mergepool]
[-overlap <INT>] [-bound <INT>] [-extend] [-nofc] [-sensitive]
[-artifact] [-chimera] [-stratum <01|e0|e1|e2|e3|e4|e5>] [-pair
<INT>] [-a <INT>] [-mmopt <OPTIONS>] [-lrop <DECIMAL>] [-lrrd
<INT>] [-lrpb] [-debug] [-h] [-v]
-l,--left <FILE> left reads file(s)
-r,--right <FILE> right reads file(s)
-pool,--pool <FILE> list of read files for pooled assembly
-long <FILE> long reads file(s)
(Requires `minimap2` and `racon` in
PATH. Presets `-k 17 -c 3 -indel 10 -e
3 -p 0.8 -overlap 200 -tip 100` unless
each option is defined otherwise.)
-ref <FILE> reference transcripts file(s) for
guiding the assembly process
-rcl,--revcomp-left reverse-complement left reads [false]
-rcr,--revcomp-right reverse-complement right reads [false]
-rc,--revcomp-long reverse-complement long reads [false]
-ss,--stranded reads are strand specific [false]
-n,--name <STR> assembly name [rnabloom]
-prefix <STR> name prefix in FASTA header for
assembled transcripts
-u,--uracil output uracils (U) in place of thymines
(T) in assembled transcripts [false]
-t,--threads <INT> number of threads to run [2]
-o,--outdir <PATH> output directory
[/Users/kazu/Desktop/6803GT-S_pseudo_ss
embly_teat/rnabloom_assembly]
-f,--force force overwrite existing files [false]
-k,--kmer <INT> k-mer size [25]
-stage <INT> assembly termination stage
short reads: [3]
1. construct graph
2. assemble fragments
3. assemble transcripts
long reads: [3]
1. construct graph
2. correct reads
3. assemble transcripts
-q,--qual-dbg <INT> minimum base quality in reads for
constructing DBG [3]
-Q,--qual-frag <INT> minimum base quality in reads for
fragment reconstruction [3]
-c,--mincov <INT> minimum k-mer coverage [1]
-hash <INT> number of hash functions for all Bloom
filters [2]
-sh,--sbf-hash <INT> number of hash functions for screening
Bloom filter [2]
-dh,--dbgbf-hash <INT> number of hash functions for de Bruijn
graph Bloom filter [2]
-ch,--cbf-hash <INT> number of hash functions for k-mer
counting Bloom filter [2]
-ph,--pkbf-hash <INT> number of hash functions for paired
k-mers Bloom filter [2]
-nk,--num-kmers <INT> expected number of unique k-mers in
input reads
-ntcard count unique k-mers in input reads with
ntCard [false]
(Requires `ntcard` in PATH. If this
option is used along with `-long`, the
value for `-c` is set automatically
based on the ntCard histogram, unless
`-c` is defined otherwise)
-mem,--memory <DECIMAL> total amount of memory (GB) for all
Bloom filters [auto]
-sm,--sbf-mem <DECIMAL> amount of memory (GB) for screening
Bloom filter [auto]
-dm,--dbgbf-mem <DECIMAL> amount of memory (GB) for de Bruijn
graph Bloom filter [auto]
-cm,--cbf-mem <DECIMAL> amount of memory (GB) for k-mer
counting Bloom filter [auto]
-pm,--pkbf-mem <DECIMAL> amount of memory (GB) for paired kmers
Bloom filter [auto]
-fpr,--fpr <DECIMAL> maximum allowable false-positive rate
of Bloom filters [0.01]
-savebf save graph (Bloom filters) from stage 1
to disk [false]
-tiplength <INT> maximum number of bases in a tip [5]
-lookahead <INT> number of k-mers to look ahead during
graph traversal [3]
-sample <INT> sample size for estimating
read/fragment lengths [1000]
-e,--errcorritr <INT> number of iterations of
error-correction in reads [1]
-grad,--maxcovgrad <DECIMAL> maximum k-mer coverage gradient for
error correction [0.50]
-indel <INT> maximum size of indels to be collapsed
[1]
-p,--percent <DECIMAL> minimum percent identity of sequences
to be collapsed [0.90]
-length <INT> minimum transcript length in output
assembly [200]
-norr skip redundancy reduction for assembled
transcripts [false]
(will not create 'transcripts.nr.fa')
-mergepool merge pooled assemblies [false]
(Requires `-pool`; overrides `-norr`)
-overlap <INT> minimum number of overlapping bases
between reads [10]
-bound <INT> maximum distance between read mates
[500]
-extend extend fragments outward during
fragment reconstruction [false]
-nofc turn off assembly consistency with
fragment paired k-mers [false]
-sensitive assemble transcripts in sensitive mode
[false]
-artifact keep potential sequencing artifacts
[false]
-chimera keep potential chimeras [false]
-stratum <01|e0|e1|e2|e3|e4|e5> fragments lower than the specified
stratum are extended only if they are
branch-free in the graph [e0]
-pair <INT> minimum number of consecutive k-mer
pairs for assembling transcripts [10]
-a,--polya <INT> prioritize assembly of transcripts with
poly-A tails of the minimum length
specified [0]
-mmopt <OPTIONS> options for minimap2 [-r 150]
(`-x` and `-t` are already in use)
-lrop <DECIMAL> minimum proportion of matching bases
within long-read overlaps [0.4]
-lrrd <INT> min read depth required for long-read
assembly [2]
-lrpb use PacBio preset for minimap2 [false]
-debug print debugging information [false]
-h,--help print this message and exits
-v,--version print version information and exits
実行方法
paired-end reads
rnabloom -l LEFT.fastq -r RIGHT.fastq -revcomp-right -ntcard -t 20 -o OUTDIR
- -l left reads file(s)
- -r right reads file(s)
- -rcl,--revcomp-left reverse-complement left reads [false]
- -rcr,--revcomp-right reverse-complement right reads [false]
- -rc,--revcomp-long reverse-complement long reads [false]
- -ntcard count unique k-mers in input reads with ntCard [false]
- -t number of threads to run [2]
- -o output directory
- -ss,--stranded reads are strand specific [false]
strandedオプションは、入力されたリードがストランドごとに分かれていることを示す(Github参照)。
出力
single-end reads
rnabloom -l READS.fastq -ntcard -t 20 -o OUTDIR
single-cell RNA-seq
rnabloom -pool READSLIST.txt -revcomp-right -ntcard -t THREADS -outdir OUTDIR
- -pool,--pool <FILE> list of read files for pooled assembly
Nanopore Readsde
# nanopore PCR cDNA sequencing data
rnabloom -long READS.fasta -ntcard -t THREADS -outdir OUTDIR
# nanopore direct cDNA sequencing data
rnabloom -long READS.fasta -stranded -revcomp-long -ntcard -t THREADS -outdir OUTDIR
引用
RNA-Bloom enables reference-free and reference-guided sequence assembly for single-cell transcriptomes
Ka Ming Nip, Readman Chiu, Chen Yang, Justin Chu, Hamid Mohamadi, René L. Warren, Inanc Birol
Genome Res. 2020 Aug; 30(8): 1191–1200