2022-02-20

（非モデル生物）RNA-seqデータの超高速な機能的プロファイリングを行う seq2fun

2022/02/21 画像追記

　リファレンスゲノムを持たない非モデル生物のRNA-seqデータ解析では、計算時間とコストが依然として大きなボトルネックとなっている。この課題を解決するために、著者らは、トランスクリプトームde novoアセンブリを行わずにRNA-seqリードの機能定量を直接行う、新規のオールインワン、超高速ツール、Seq2Funを開発した。このパイプラインは、シーケンスエラー修正、ポリ（A）テールの除去、オーバーラップしたペアエンドリードの結合など、生のリードの品質管理から始まる。次に、各リードを可能な限りのアミノ酸断片に変換してDNA-タンパク質検索を行い、その後、十分にキュレーションされたタンパク質データベースから相同配列を同定する。最後に、このパイプラインは、遺伝子のアバンダンステーブル、パスウェイと生物種のヒットテーブル、結果を視覚化するHTMLレポート、下流解析に適したマッピングされた遺伝子がアノテーションされたクリーンリードの出力など、いくつかの有益な出力を生成する。Seq2Funは、ファイルの書き込みや読み込みといった中間ステップを持たないため、I/Oが非常に効率的になっている。Seq2FunはC++で記述されており、限られたCPU数、メモリ数のパソコンで動作させることができる。2,000,000リード/分以上の処理が可能で、de novoアセンブリに基づく従来のワークフローに比べ120倍以上高速でありながら、様々なテストデータセットにおいて高い精度を維持することができる。

参照ゲノムを持たない生物種の遺伝子レベルまたはパスウェイレベルの解析は、その転写産物の構築とアノテーションに大きく依存している（Martin and Wang 2011; Eldem et al. 2017; Voshall and Moriyama 2018）。従来のRNA-seqワークフローでは、複数のソフトウェアツールを使用して、生リードの品質チェック、リードエラー補正、トランスクリプトームde novoアセンブリ、トランスクリプトーム品質評価、トランスクリプトームアノテーション、DEGの同定やパスウェイエンリッチメント解析などのダウンストリーム解析を行う（Martin and Wang 2011; Eldem et al.2017; Voshall and Moriyama 2018）。下流の統計解析は比較的簡単だが、生データ処理が重要な障害として残っている。特に、トランスクリプトームde novoアセンブリは複雑で時間のかかる作業であり、大規模な計算資源を必要とする（Martin and Wang 2011; Eldem et al.）いくつかのトランスクリプトームde novoアセンブラが開発されており、例えば確立されたツールであるTrinity (Haas et al. 2013; https://github.com/trinityrnaseq/trinityrnaseq/wiki) やSOAPdenovo-Trans (Xie et al. 2014) 、さらに最近開発されたツールであるBridger (Chang et al. 2015), BinPacker (Liu et al. 2016), and TransLiG (Liu et al. 2019)などが挙げられる。しかし、これらのツールによる解析は、高性能なコンピュータで数日から数週間かかることもある。さらに、非モデル生物の組み立てられたトランスクリプトームは、しばしば多くの偽陽性および偽陰性に悩まされ、単一のアセンブラがすべてのシナリオで最良の結果を提供することはできない（Hölzer and Marz 2019; Liu et al.）従来のRNA-seqワークフローにおけるもう1つの重要なステップは、トランスクリプトームのアノテーションである。確立された手順は、特にDNA間BLASTNアプローチと比較して、相同配列間の大きな進化的乖離を克服できる translated searchを介してDNA間BLASTXを実行することである（Conesa et al.2016; Ye et al.2019) 。この方法は、Blast2GO (Conesa and Götz 2008) やTrinotate (https://github.com/Trinotate/Trinotate.github.io/wiki) などのいくつかのプログラムで実装されている。しかし、これらの実装も時間がかかり、計算量が多い。de novoアセンブリおよびアノテーションの実行に必要な計算スキルおよび計算資源は、非モデル生物の研究においてRNA-seqを使用する際の大きな障壁となっている。

　ここでは、上記の問題点を解決するために、非モデル生物のRNA-seqリードの機能定量を行う、超高速、アセンブリフリー、オールインワンツールであるSeq2Funを紹介します。基礎となるアルゴリズムの説明に加え、様々な生物種のシミュレーションデータと実データセットを用いて、Seq2Funがトランスクリプトームde novoアセンブリに基づく従来のRNA-seq解析ワークフローを精度および計算効率の両面で上回ることを示す。また、リファレンスゲノムがない生物のRNA-seqデータの解析にSeq2Funがどのように利用できるかをケーススタディとして示す。

Seq2Funバージョン2の主な特徴（レポジトリより・一部改変）

Seq2Funは、RNA-seqのリードを可能な限りのアミノ酸配列に直接変換し、十分にキュレーションされたデータベースから相同なタンパク質配列を検索するという新しい戦略を持っている。Seq2Funには3つの主要なステップが存在する。(1) 生リードの品質管理、(2) クリーンアップしたリードをマージして可能な限りのアミノ酸配列に変換し、多種類のタンパク質データベースとアライメント、(3) KEGGオーソログ（KO）付き発現マトリックスと下流解析用のサマリー図の作成。

超高速：従来のRNA-seqワークフローに比べ、120倍以上のスピード（～200万リード/分）。
超低メモリコスト：わずか2.27GBのメモリ消費、8スレッド、16GBメモリの標準的なPCで実行可能。
リファレンスフリー：生物のゲノムやトランスクリプトームのリファレンスが不要で、トランスクリプトームde novoアセンブリも不要。
オールインワン：Seq2Funは生のRNA-seqリードを直接入力とし、中間ファイルの書き込みや読み込みをせずに遺伝子存在量テーブルを出力するため、I/Oが非常に効率的。
多機能：Seq2Funは、オルソログ遺伝子バンダンステーブル、これらのテーブルとリードの品質チェックをまとめたhtmlレポート、さらに遺伝子アセンブリなどのさらなる解析のためにマッピングされたクリーンリードの出力など、複数の出力ファイルを生成する。
柔軟性：カスタマイズされたデータベースを使用して、特定の遺伝子や生物のグループに関するRNA-seq解析をサポートする。
使いやすさ：最小限のプログラミングスキルで使用できる。
ターゲット遺伝子アセンブルのサポート：マッピングされたリードを抽出し、ターゲット遺伝子アセンブルを行うための新機能がある。

インストール

Seq2Funは、Ubuntu (16.04 LTS以上)とmacOS Catalinaでテストされている。ここではubuntu18でビルドした。

依存

Seq2Fun (version 2.0.0) is written in C/C++11 and can be installed on Linux or Mac OS X (with Xcode and Xcode Command Line Tools installed).

Github

git clone https://github.com/xia-lab/Seq2Fun.git
cd Seq2Fun/src/
make clean
make

> ./seq2fun

Seq2Fun: high-throughput functional profiling of RNA-seq data for non-model organisms

version 2.0.2

usage: ./seq2fun [options] ...

options:

-s, --sampletable (recommended) sample table must consist of 3 columns (sample prefix name (sample01), forward reads name (sample01_R1.fq.gz), group info (control) for single-reads or 4 columns (sample prefix name (sample01), forward reads (sample01_R1.fq.gz), reverse reads (sample01_R2.fq.gz), group info (control) for paired-end reads. The columns must be separated by tab (string [=])

-i, --in1 read1 input file name (string [=])

-I, --in2 read2 input file name (string [=])

-X, --prefix prefix name for output files, eg: sample01 (string [=])

--outputMappedCleanReads enable output mapped clean reads into fastq.gz files, by default is false, using --outputMappedCleanReads to enable it

--outputReadsAnnoMap enable output mapped clean reads-annotation map into .gz files, by default is false, using --outputReadsAnnoMap to enable it

-D, --genemap gene/protein KO species map (string [=])

--profiling profiling mode, by default is false, using --profiling to enable it

-d, --tfmi fmi index of Protein database (string [=])

-K, --mode searching mode either tGREEDY or tMEM (maximum exactly match). By default greedy (string [=tGREEDY])

-E, --mismatch number of mismatched amino acid in sequence comparison with protein database with default value 2 (int [=2])

-j, --minscore minimum matching score of amino acid sequence in comparison with protein database with default value 80 (int [=80])

-J, --minlength minimum matching length of amino acid sequence in comparison with protein database with default value 19, for GREEDY and 13 for MEM model (int [=0])

-m, --maxtranslength maximum cutoff of translated peptides, it must be no less than minlength, with default 60 (int [=60])

--allFragments enable this function will force Seq2Fun to use all the translated AA fragments with length > minlength. This will slightly help to classify reads contain the true stop codon and start codon; This could have limited impact on the accuracy for comparative study and enable this function will slow down the Seq2Fun. by default is false, using --allFragments to enable it

--codontable select the codon table (same as blastx in NCBI), we provide 20 codon tables from 'https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi#SG31'. By default is the codontable1 (Standard Code) (string [=codontable1])

--dbDir dir for internal database such as ko_fullname.txt (string [=])

-Z, --pathway list of selected pathways for target pathways analysis (string [=])

-z, --genefa the gene/protein sequences fasta file for retrieving proteins in selected pathways to construct database (string [=])

-w, --thread worker thread number, default is 2 (int [=2])

-V, --verbose enable verbose

--debug enable debug

--longlog enable the long logout format

--reads_buffer specify reads buffer size (MB) for each file. (int [=1])

--fix_mgi_id the MGI FASTQ ID format is not compatible with many BAM operation tools, enable this option to fix it.

-6, --phred64 indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)

--reads_to_process specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])

-A, --disable_adapter_trimming adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled

-a, --adapter_sequence the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])

--adapter_sequence_r2 the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=auto])

--adapter_fasta specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file (string [=])

--detect_adapter_for_pe by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.

--no_trim_polyA by default, ployA tail will be trimmed. If this option is specified, polyA trimming is disabled

-f, --trim_front1 trimming how many bases in front for read1, default is 0 (int [=0])

-t, --trim_tail1 trimming how many bases in tail for read1, default is 0 (int [=0])

-b, --max_len1 if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0])

-F, --trim_front2 trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])

-T, --trim_tail2 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])

-B, --max_len2 if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])

-g, --trim_poly_g force polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

--poly_g_min_len the minimum length to detect polyG in the read tail. 10 by default. (int [=10])

-G, --disable_trim_poly_g disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

-x, --trim_poly_x enable polyX trimming in 3' ends.

--poly_x_min_len the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

-5, --cut_front move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.

-3, --cut_tail move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.

-r, --cut_right move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.

-W, --cut_window_size the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])

-M, --cut_mean_quality the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])

--cut_front_window_size the window size option of cut_front, default to cut_window_size if not specified (int [=4])

--cut_front_mean_quality the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])

--cut_tail_window_size the window size option of cut_tail, default to cut_window_size if not specified (int [=4])

--cut_tail_mean_quality the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])

--cut_right_window_size the window size option of cut_right, default to cut_window_size if not specified (int [=4])

--cut_right_mean_quality the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20])

-Q, --disable_quality_filtering quality filtering is enabled by default. If this option is specified, quality filtering is disabled

-q, --qualified_quality_phred the quality value that a base is qualified. Default 20 means phred quality >=Q15 is qualified. (int [=20])

-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])

-n, --n_base_limit if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])

-e, --average_qual if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement (int [=0])

-L, --disable_length_filtering length filtering is enabled by default. If this option is specified, length filtering is disabled

-l, --length_required reads shorter than length_required will be discarded, default is 60. (int [=60])

--length_limit reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])

--no_low_complexity_filter disable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).

-Y, --complexity_threshold the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])

--filter_by_index1 specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])

--filter_by_index2 specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])

--filter_by_index_threshold the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])

-c, --disable_correction disenable base correction in overlapped regions (only for PE data), default is enabled

-v, --overlap_len_require the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default. (int [=30])

--overlap_diff_limit the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default. (int [=5])

--overlap_diff_percent_limit the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%. (int [=20])

-U, --umi enable unique molecular identifier (UMI) preprocessing

--umi_loc specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])

--umi_len if the UMI is in read1/read2, its length should be provided (int [=0])

--umi_prefix if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])

--umi_skip if the UMI is in read1/read2, Seq2Fun can skip several bases following UMI, default is 0 (int [=0])

-p, --overrepresentation_analysis enable overrepresented sequence analysis.

-P, --overrepresentation_sampling one in (--overrepresentation_sampling) reads will be computed for overrepresentation analysis (1~10000), smaller is slower, default is 20. (int [=20])

--cut_by_quality5 DEPRECATED, use --cut_front instead.

--cut_by_quality3 DEPRECATED, use --cut_tail instead.

--cut_by_quality_aggressive DEPRECATED, use --cut_right instead.

--discard_unmerged DEPRECATED, no effect now, see the introduction for merging.

-?, --help print this message

データベースの準備

seq2funはリファレンスゲノムや転写産物配列は使わないが、キュレーションされたタンパク質データベースを使う。このため、事前に、真核生物、動物、植物、菌類、さらに哺乳類、鳥類、爬虫類、両生類、魚類、節足動物などのサブグループに対して、リードの定量速度と精度のバランスを取るために、様々なKEGGオルソログタンパク質データベースが構築されている。

https://github.com/xia-lab/Seq2Fun

多くの非モデル生物では、研究成果の生物学的理解は、KEGGパスウェイ、Gene Ontology、PANTHER分類システムなどの機能アノテーションを持つタンパク質コード遺伝子に限定されている。そのため、Seq2Fun version 2では、オルソログのような機能的アノテーションを持つ遺伝子にフォーカスしたデータベースを開発することで、非モデル生物を研究する多くの研究者のニーズに応えられるようになりました。Orthofinderをベースに構築された数十種類のデータベースをダウンロードできます（レポジトリより）。

v２に入る。

http://gofile.me/4esAc/bG1RlcQIS

f:id:kazumaxneo:20220221172112p:plain

version2 database

f:id:kazumaxneo:20220221172133p:plain

一部はまだアノテーションの情報が不完全なことに注意する。animalsとbirdsで動作確認した。

ここではPlantsとbirdsをダウンロードした。解凍する。

tar -xzvf plants.tar.gz
tar -xzvf birds.tar.gz

plants

f:id:kazumaxneo:20220220213003p:plain

テストラン

Seq2Funは、maximum exact match (MEM) モードとGreedyモードの2つのモードで実行できる。MEMモードはクエリーとリファレンス配列の完全一致のみを許可するため、データベース内に非常に近縁な種が存在する生物に適している。Greedyモードは、相同配列間の進化的距離を克服するために、クエリーと参照配列間のミスマッチを許容し、データベースに近縁の参照ゲノムを持たない生物に適している（最適なパラメータは論文の補足資料に記載されている）。

テストランではbirdsデータを使う。解凍したbirds/ディスプレイをcdSeq2Fun/testdata/に配置する。seq2funのランには、fastqのファイルprefixとファイルのパス、サンプルグループについて書かれたテキストを指定する（シングルエンドリードの場合は3番目のカラムを削除）。

testdata/sample.txt

f:id:kazumaxneo:20220220233202p:plain

cdSeq2Fun/testdata/
../bin/seq2fun --sampletable sample.txt --tfmi birds/birds.fmi --genemap birds/birds_annotation.txt -w 8 --profiling -V --outputMappedCleanReads --outputReadsAnnoMap

--sampletable (recommended) sample table must consist of 3 columns (sample prefix name (sample01), forward reads name (sample01_R1.fq.gz), group info (control) for single-reads or 4 columns (sample prefix name (sample01), forward reads (sample01_R1.fq.gz), reverse reads (sample01_R2.fq.gz), group info (control) for paired-end reads. The columns must be separated by tab (string [=])
--in1 read1 input file name (string [=])
--in2 read2 input file name (string [=])
--prefix prefix name for output files, eg: sample01 (string [=])
--tfmi fmi index of Protein database (string [=])
--genemap gene/protein KO species map (string [=])
-w worker thread number, default is 2 (int [=2])
-V enable verbose
--outputMappedCleanReads enable output mapped clean reads into fastq.gz files, by default is false, using --outputMappedCleanReads to enable it
--outputReadsAnnoMap enable output mapped clean reads-annotation map into .gz files, by default is false, using --outputReadsAnnoMap to enable it
--profiling profiling mode, by default is false, using --profiling to enable it

数十秒で終了した。

出力

（レポジトリより）Seq2Funには、比較モードとプロファイリングモード（デフォルト）の2つの出力モードがあります。比較モードではオルソログアバンダンス表のみ、プロファイリングモードでは4つの表 1).全サンプルのオルソログアバンダンス表 2).networkanalyst に提出する全サンプルのオルソログアバンダンス表 3).networkanalyst に提出して下流の解析に用いるアノテーションファイルサミット、各サンプルのオルソログアバンダンス表 4).mapped clean reads file 5).html reportが作成されます。

f:id:kazumaxneo:20220220234649p:plain

（出力の詳細はレポジトリで説明されています）

全サンプルの存在量テーブル

S2fid_abundance_table_all_samples.txt

f:id:kazumaxneo:20220220234749p:plain

Seq2Fun_summary_all_samples.html

f:id:kazumaxneo:20220220214803p:plain

f:id:kazumaxneo:20220220214817p:plain

f:id:kazumaxneo:20220220214906p:plain

f:id:kazumaxneo:20220220214935p:plain

他に、サンプル別のレポートも出力されます。

論文中では、通常50GBのRAMを必要としたマウスのRNA seq解析でseq2fun はわずか0.4GBしか消費しなかった事と、マウス、ニワトリ、ゼブラフィッシュ、回虫のデータセットで50－113倍高速だった事から、8スレッド、16GB RAMの性能の（ラップトップ）コンピュータで実行可能であると書かれています。

レポジトリでは、エチニルエストラジオール（EE2）で処理された実際の非モデル生物ダブルクレステッドウのRNA-seqデータセットを使ったチュートリアルがあります。興味がある方は確認して下さい。

引用

Ultrafast functional profiling of RNA-seq data for nonmodel organisms
Peng Liu, Jessica Ewald, Jose Hector Galvez, Jessica Head, Doug Crump, Guillaume Bourque, Niladri Basu, Jianguo Xia

Genome Res. 2021 Apr;31(4):713-720