メタゲノム、メタトランススクリプトーム、ncRNAのシークエンシングデータからrRNA配列を正確かつ高速に検出・除去する RiboDetector

2022/03/11追記

2023/03/05 追記

　トランスクリプトームやトランスラトーム技術の進歩により、RNAの活性プロファイルやRNAによる制御機構を深く研究することが可能になった。リボソーム RNA（rRNA）配列は細胞内RNAの中で非常に豊富に存在するが、ターゲット配列にポリアデニレーションが含まれていない場合、ライブラリ調製時にこれらを容易に除去することができず、下流の解析を加速・改善するためには計算技術によるポストホックの除去が必要となる。本発表では、BiLSTM（Bi-directional Long Short-Term Memory）ニューラルネットワークをベースに、トランスクリプトーム、メタゲノム、メタトランスリプトーム、ノンコーディング RNA、リボソームプロファイリングのシーケンスデータからrRNAリードを高速かつ正確に同定する新規ソフトウェアRiboDetectorについて説明する。RiboDetectorは、最新のアプローチと比較して、ベンチマークデータセットにおける誤判定を6倍以上減らすことができた。重要なことは、RiboDetectorのわずかな誤検出は、特定のGene Ontology (GO) タームに富んでおらず、下流の機能プロファイリングに偏りが少ないことが示唆された。また、RiboDetectorは、90%未満の配列同一性で学習データから乖離した新規rRNA配列を検出する優れた汎用性を実証した。パーソナルコンピュータ上で、RiboDetectorは40Mリードを6分以内に処理し、他の方法と比較してGPUモードで約50倍、CPUモードで約15倍高速化した。RiboDetectorはGPL v3.0ライセンスの下、https://github.com/hzi-bifo/RiboDetector で利用できる。

RiboDetector also demonstrated a remarkable generalizability for detecting novel rRNA sequences that are divergent from the training data with sequence identities of <90%. pic.twitter.com/3PE5V5650h
— Zhi-Luo Deng (@zhldeng) February 22, 2022

インストール

condaで環境を使って導入した。

依存

Github

mamba create -n ribodetector python=3.8 -y
conda activate ribodetector
#conda(link)
mamba install -c bioconda -y ribodetector
#pip
pip install ribodetector

#Install pytorch in the ribodetector env if GPU is available
mamba install -c pytorch pytorch

#GPU

> ribodetector -h

usage: ribodetector [-h] [-c CONFIG] [-d DEVICEID] -l LEN -i [INPUT [INPUT ...]] -o [OUTPUT [OUTPUT ...]] [-r [RRNA [RRNA ...]]] [-e {rrna,norrna,both,none}] [-t THREADS] [-m MEMORY] [--chunk_size CHUNK_SIZE] [-v]

rRNA sequence detector

optional arguments:

-h, --help show this help message and exit

-c CONFIG, --config CONFIG

Path of config file

-d DEVICEID, --deviceid DEVICEID

Indices of GPUs to enable. Quotated comma-separated device ID numbers. (default: all)

-l LEN, --len LEN Sequencing read length, should be not smaller than 50.

-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]

Path of input sequence files (fasta and fastq), the second file will be considered as second end if two files given.

-o [OUTPUT [OUTPUT ...]], --output [OUTPUT [OUTPUT ...]]

Path of the output sequence files after rRNAs removal (same number of files as input).

(Note: 2 times slower to write gz files)

-r [RRNA [RRNA ...]], --rrna [RRNA [RRNA ...]]

Path of the output sequence file of detected rRNAs (same number of files as input)

-e {rrna,norrna,both,none}, --ensure {rrna,norrna,both,none}

Only output certain sequences with high confidence

norrna: output non-rRNAs with high confidence, remove as many rRNAs as possible;

rrna: vice versa, output rRNAs with high confidence;

both: both non-rRNA and rRNA prediction with high confidence;

none: give label based on the mean probability of read pair.

(Only applicable for paired end reads, discard the read pair when their predicitons are discordant)

-t THREADS, --threads THREADS

number of threads to use. (default: 10)

-m MEMORY, --memory MEMORY

amount (GB) of GPU RAM. (default: 12)

--chunk_size CHUNK_SIZE

Use this parameter when having low memory. Parsing the file in chunks.

Not needed when free RAM >=5 * your_file_size (uncompressed, sum of paired ends).

When chunk_size=256, memory=16 it will load 256 * 16 * 1024 reads each chunk (use ~20 GB for 100bp paired end).

-v, --version show program's version number and exit

#CPU

> ribodetector_cpu -h

usage: ribodetector_cpu [-h] [-c CONFIG] -l LEN -i [INPUT [INPUT ...]] -o [OUTPUT [OUTPUT ...]] [-r [RRNA [RRNA ...]]] [-e {rrna,norrna,both,none}] [-t THREADS] [--chunk_size CHUNK_SIZE] [-v]

rRNA sequence detector

optional arguments:

-h, --help show this help message and exit

-c CONFIG, --config CONFIG

Path of config file

-l LEN, --len LEN Sequencing read length, should be not smaller than 50.

-i [INPUT [INPUT ...]], --input [INPUT [INPUT ...]]

Path of input sequence files (fasta and fastq), the second file will be considered as second end if two files given.

-o [OUTPUT [OUTPUT ...]], --output [OUTPUT [OUTPUT ...]]

Path of the output sequence files after rRNAs removal (same number of files as input).

(Note: 2 times slower to write gz files)

-r [RRNA [RRNA ...]], --rrna [RRNA [RRNA ...]]

Path of the output sequence file of detected rRNAs (same number of files as input)

-e {rrna,norrna,both,none}, --ensure {rrna,norrna,both,none}

Only output certain sequences with high confidence

norrna: output non-rRNAs with high confidence, remove as many rRNAs as possible;

rrna: vice versa, output rRNAs with high confidence;

both: both non-rRNA and rRNA prediction with high confidence;

none: give label based on the mean probability of read pair.

(Only applicable for paired end reads, discard the read pair when their predicitons are discordant)

-t THREADS, --threads THREADS

number of threads to use. (default: 10)

--chunk_size CHUNK_SIZE

chunk_size * threads reads to process per thread.(default: 1024)

When chunk_size=1024 and threads=20, each process will load 1024 reads, in total consumming ~20G memory.

-v, --version show program's version number and exit

実行方法

シークエンシングリード（fastq）を指定する。

#GPU
ribodetector -t 20 \
  -l 100 \
  -i inputs/reads.1.fq.gz inputs/reads.2.fq.gz \
  -m 10 \
  -e rrna \
  --chunk_size 256 \
  -o outputs/reads.rrna.{1,2}.fq.gz

#CPU
ribodetector_cpu -t 20 \
  -l 100 \
  -i inputs/reads.1.fq.gz inputs/reads.2.fq.gz \
  -e rrna \
  --chunk_size 256 \
-o outputs/reads.rrna.{1,2}.fq

-l Sequencing read length, should be not smaller than 50.
-i Path of input sequence files (fasta and fastq), the second file will be considered as second end if two files given.
-o Path of the output sequence files after rRNAs removal (same number of files as input). (Note: 2 times slower to write gz files)
-t number of threads to use. (default: 10)
-m amount (GB) of GPU RAM. (default: 12)
--chunk_size Use this parameter when having low memory. Parsing the file in chunks. Not needed when free RAM >=5 * your_file_size (uncompressed, sum of paired ends). When chunk_size=256, memory=16 it will load 256 * 16 * 1024 reads each chunk (use ~20 GB for 100bp paired end).
-e {rrna, norrna, both, none} Only output certain sequences with high confidence norrna: output non-rRNAs with high confidence, remove as many rRNAs as possible;
rrna: vice versa, output rRNAs with high confidence;
both: both non-rRNA and rRNA prediction with high confidence;
none: give label based on the mean probability of read pair.
(Only applicable for paired end reads, discard the read pair when their predicitons are discordant)

CPU版だとある程度時間がかかる。3GBx2のファイルサイズのfastq.gzを使ったところ、ファイル書き出しまでに40分ほどかかった（3990x、20スレッド使用）。

f:id:kazumaxneo:20220310001802p:plain

ribodetector前後のfastqの18S rRNAへのマッピング。上段が処理前、中央が"-e rrna"設定での実行後、下段が"-e nonrrna"設定での実行後。

28S rRNA。上の画像と同じく、上段が処理前、中央が"-e rrna"設定での実行後、下段が"-e nonrrna"設定での実行後。まれにアラインされるリードが見つかる。

感想

論文中ではRNA配列と配列類似性を有する短い領域をもつ配列にrRNA由来のリードがアラインされ得ることが書かれています。rRNA配列が分かっているなら問題にはならないと思いますが、環境サンプルのmetatranscriptomeなどではrRNA配列が既存の分類から遠くて影響を及ぼす可能性があります。RiboDetectorを自分も使っていこうと思います。一点注意点として、GPU版はそれなりにGPUのメモリを要求します。３GBメモリの1060GTXではメモリが足りませんでした（参考まで）。--chunk_sizeオプションでスレッド１つあたり一度に読み込むリード数を制御してできるので、足りなければデフォルトの1024から減らしてスレッド数も減らしてください。かなりメモリを使うので注意してください。

引用

Rapid and accurate identification of ribosomal RNA sequences via deep learning
Zhi-Luo Deng, Philipp C Münch, René Mreches, Alice C McHardy
Nucleic Acids Research, Published: 21 February 2022