RNA seqのロングリードをリファレンスフリーでクラスタリングする RATTLE

2021 1/12 helpと解析例追加

2022/04/19 ツイート追加

　ナノポアを用いた１分子ロングリードシークエンシングは、あらゆるサンプルからトランスクリプトームを測定する前例のない機会を提供する。しかし、現在の解析方法では、リファレンスゲノムやトランスクリプトームとの比較、あるいは複数のシークエンシング技術の使用に依存しているため、ゲノムアセンブリが利用できない種や、既存のリファレンスゲノムにあまり含まれていない個体、およびリファレンスゲノムから直接識別できない疾患特異的なトランスクリプトームの発見のための費用対効果の高い研究を行うことができない。DNAアセンブリのための方法は、コンセンサス配列が複数のトランスクリプトアイソフォームを持つ遺伝子に対して必要な解釈性を欠くため、トランスクリプトームに直接利用することができない。このような課題に対処するために、著者らは、Nanoporeロングリードから転写産物をリファレンスフリーで再構成し、定量化するための最初のツールであるRATTLEを開発した。シミュレーションデータ、アイソフォームスパイクイン、組織や細胞株からのシークエンスデータを用いて、RATTLEが転写産物の配列と豊富さを正確に決定し、リファレンスベースの手法と同等であることを実証し、入力リード数の増加に伴って予測される転写物の数が飽和することを示した。

We have improved RATTLE to be able to process >10M Nanopore reads https://t.co/xm90pLT8eB
— Eduardo Eyras (@EduEyras) April 19, 2022

インストール

Github

git clone --recurse-submodules https://github.com/comprna/RATTLE
cd RATTLE
./build.sh

> ./rattle

$ ./rattle

Run with mode: ./rattle <cluster|cluster_summary|extract_clusters|correct|polish>

> rattle cluster -h

# rattle cluster -h

-h, --help

shows this help message

-i, --input

input fasta/fastq file (required)

--fastq

whether input and output should be in fastq format (instead of fasta)

-o, --output

output folder (default: .)

-t, --threads

number of threads to use (default: 1)

-k, --kmer-size

k-mer size for gene clustering (default: 10)

-s, --score-threshold

minimum score for two reads to be in the same gene cluster (default: 0.2)

-v, --max-variance

max allowed variance for two reads to be in the same gene cluster (default: 1000000)

--iso

perform clustering at the isoform level

--iso-kmer-size

k-mer size for isoform clustering (default: 11)

--iso-score-threshold

minimum score for two reads to be in the same isoform cluster (default: 0.3)

--iso-max-variance

max allowed variance for two reads to be in the same isoform cluster (default: 25)

-B, --bv-start-threshold

starting threshold for the bitvector k-mer comparison (default: 0.4)

-b, --bv-end-threshold

ending threshold for the bitvector k-mer comparison (default: 0.2)

-f, --bv-falloff

falloff value for the bitvector threshold for each iteration (default: 0.05)

-r, --min-reads-cluster

minimum number of reads per cluster (default: 0)

-p, --repr-percentile

cluster representative percentile (default: 0.15)

--rna

use this mode if data is direct RNA (disables checking both strands)

> rattle cluster_summary -h

# rattle cluster_summary -h

-h, --help

shows this help message

-i, --input

input fasta/fastq file (required)

-c, --clusters

clusters file (required)

--fastq

whether input and output should be in fastq format (instead of fasta)

> rattle extract_clusters -h

# rattle extract_clusters -h

-h, --help

shows this help message

-i, --input

input fasta/fastq file (required)

-c, --clusters

clusters file (required)

-o, --output-folder

output folder for fastx files (default: .)

-m, --min-reads

min reads per cluster to save it into a file

--fastq

whether input and output should be in fastq format (instead of fasta)

> rattle correct -h

# rattle correct -h

-h, --help

shows this help message

-i, --input

input fasta/fastq file (required)

-c, --clusters

clusters file (required)

-o, --output

output folder (default: .)

-g, --gap-occ

gap-occ (default: 0.3)

-m, --min-occ

min-occ (default: 0.3)

-s, --split

split clusters into sub-clusters of size s for msa (default: 200)

-r, --min-reads

min reads to correct/output consensus for a cluster (default: 5)

-t, --threads

number of threads to use (default: 1)

> rattle polish -h

# rattle polish -h

-h, --help

shows this help message

-i, --input

input RATTLE consensi fasta/fastq file (required)

-o, --output-folder

output folder for fastx files (default: .)

-t, --threads

number of threads to use (default: 1)

--rna

use this mode if data is direct RNA (disables checking both strands)

実行方法

fastqを指定する。

rattle cluster -i reads.fq -t 24 --fastq -o clusters

エラーになる。

他のONTのRNA seq リードを使ったところランできた。

Reading fasta file... Done

[================================================================================] 98056/98056 (100%)9%))

[================================================================================] 26082/26082 (100%)62%)

Iteration 0.35 complete

[================================================================================] 14039/14039 (100%)29%)

Iteration 0.3 complete

[================================================================================] 7434/7434 (100%)65%)

Iteration 0.25 complete

[================================================================================] 4173/4173 (100%)6%))

Iteration 0.2 complete

[================================================================================] 2651/2651 (100%)23%)

Iteration 0 complete

Gene clustering done

1684 gene clusters found

[================================================================================] 1684/1684 (100%)06%)

summary (csv with read_id,cluster_id)

rattle cluster_summary -c clusters.out -i reads.fq --fastq > summary

Githubには他にも様々な例があります。クラスタリングして定量まで行うには、Githubの例にあるように数段階のステップを踏む必要があります。

引用

Reference-free reconstruction and quantification of transcriptomes from Nanopore long-read sequencing

Ivan de la Rubia, Joel A. Indi, Silvia Carbonell-Sala, Julien Lagarde, M Mar Albà, Eduardo Eyras
bioRxiv, Posted July 30, 2020

2022/07/12

RATTLE: reference-free reconstruction and quantification of transcriptomes from Nanopore sequencing

Ivan de la Rubia, Akanksha Srivastava, Wenjing Xue, Joel A. Indi, Silvia Carbonell-Sala, Julien Lagarde, M. Mar Albà & Eduardo Eyras
Genome Biology volume 23, Article number: 153 (2022)