ロングリードやショートリードのRNA seq情報をもとに転写領域をアセンブリして出力する StringTie2

2020 7/1 インストール方法追記, コマンド追記

2020 7/2 タイトル修正

2020 7/27 merge追記

2022/06/09 論文引用

2022/12/10, 12/28追記

2023/01/21 レポジトリURL修正

RNAシーケンス（RNAシーケンス）データセット内の転写産物の量を測定することは、細胞の働きを理解するための強力な方法である。リードをリファレンスゲノムに合わせるだけで、遺伝子の平均発現の大まかな推定値が得られ、スプライス部位の異なる使用のヒントが得られる[ref.1]。オルタナティブスプライシングは真核生物では非常に一般的であり、推定90％のヒトマルチエクソンタンパク質コーディング遺伝子と30％の非コーディングRNA（ncRNA）遺伝子が複数のアイソフォームを持っている[ref.2、3]。アノテーション付きのヒトタンパク質をコードする遺伝子の数は、過去10年間ほぼ一定のままだが、ncRNA遺伝子およびタンパク質をコードするアイソフォームの数は増加し続けている[ref.4]。

　イルミナなどの第2世代のシーケンサーは、数億のショート（〜100 bp）RNA-seqリードを生成できる。この長さのリードは、非常に小さなエクソンの場合を除いて、通常、2エクソンを超えない。短いリードをアセンブリすることにより、完全長の転写物を再構築し、新規遺伝子と遺伝子アイソフォームを特定できる。トランスクリプトームアセンブリには、de novoとreference-guidedの2つの主なアプローチがある。 Trinity [ref.5]やOases [ref.6]などのde novoトランスクリプトームアセンブラは、リードをゲノムにアラインせずに、リード間のオーバーラップを見つけて、それらを完全なトランスクリプトにチェーンしようとする。このタスクは、相互に大きく重複する多くのアイソフォームを持つパラロガス遺伝子および転写物の存在により複雑になり、その結果、このアプローチは非常に断片化され、エラーを起こしやすいトランスクリプトームを生成する。 Cufflinks [ref.7]、Bayesembler [ref.8]、StringTie [ref.9]、TransComb [ref.10]、Scallop [ref.11]などのリファレンスガイド付きアセンブラは、スプライシングを使用してRNA-seqリードがHISAT [12]やSTAR [13]などのアライナーでアラインされる既存のゲノムを利用する。これらのアセンブラは、アラインメントに基づいてスプライスグラフ（または他のデータ構造）を構築し、それらのグラフを使用して個々のトランスクリプトを構築する。一部のリファレンスガイド付きアセンブラは、オプションのガイドとして既知の転写産物のエクソン-イントロンアノテーションを使用することもでき、可能な場合は既知の遺伝子を優先させることができる。最近の研究[ref.14]では、より多くの転写産物を正確かつ高精度にアセンブルすることにより、StringTieがCufflinksとBayesemblerの両方よりも優れていることがわかった。

StringTieおよびその他のトランスクリプトームアセンブラーは、各トランスクリプトに割り当てられたアライメントされたリード数に基づいて、トランスクリプトの存在量を推定する。最近では、Sailfish [ref.15]、Salmon [ref.16]、Kallisto [ref.17]などの代替方法により、正確なk-merマッチングに基づいて既知の転写産物にリードをアサインすることで豊富さを推定できることが示された。しかし、これらのアラインメントフリーの方法は、新規遺伝子またはアイソフォームを検出することができず、アラインメントベースのパイプラインと比較して、低存在量および低分子量RNAの定量化のパフォーマンスが低下する[ref.18]。

　StringTieの最初のリリースでは、全ゲノムアセンブリ用に開発されたスーパーリードの構築を介して、de novoトランスクリプトームアセンブリの限定バージョンを使用する方法を提案した[ref.19]。概念的に、スーパーリードは、k-merルックアップテーブルに基づいた一意の拡張がある限り、ショートリードの各端を拡張することによって構築される。これにより、ショートリードのエラー率が低い合成のロングリードコレクションが作成される。それらはより長いため、ゲノムに一意にアラインする可能性が高く、遺伝子のスプライスグラフが単純化される可能性がある。スーパーリードは、StringTie 1.0（以降StringTie1）の限られた容量で使用され、ペアエンドリード間のギャップを埋めるだけであった。その限られた実装では、リードのペアを置き換えるためにスーパーリードが使用され、単一のペアでないリードのように処理できるようになった。スーパーリードを使用する際の難しさの1つは、ゲノムアセンブリ用にスーパーリードを作成するために使用されるアルゴリズムにエラー修正ステップが含まれていることである。もう1つの問題は、完全なスーパーリードに多数のショートリードが含まれている可能性があるため、定量化ステップ中に単一のリードとしてカウントできないことである。そのため、スーパーリード間でリード範囲を分散するために、期待値最大化（EM）アルゴリズムを開発した。　

　第二世代のシーケンサーは非常に多くのリードを生成するが、それらのリード長は通常非常に短く、ほとんどのRNA-seq実験では75〜125 bpの範囲である。これらの短いリードは、多くの場合、複数の場所にアラインし、このようなリードを「マルチマッピング」と指定する。ショートリードには、2つ以上のエクソンにまたがることがめったにないという制限もある。これらの問題は、Pacific Biosciences（PacBio）やOxford Nanopore Technologies（ONT）などの第3世代のシーケンシングテクノロジーによって軽減できる。 10,000µbpを超えるリード長を生成できるこれらのロングリードテクノロジーは、全ゲノムアセンブリを劇的に改善し[ref.20]、RNAシーケンス実験に使用すると、アイソフォーム同定、そして発見の精度が大幅に向上する可能性がある[ref.21,22,23]。第3世代のシーケンサーによって生成されるリードの一部はRNA転写産物の全長をカバーするが、多くは必然的に部分的な転写産物のみをキャプチャする。これはさまざまな理由で発生する。たとえば、（1）RNAは急速に分解され、シーケンシングのためにキャプチャされるまでに全長より短くなることがある。（2）長い分子は、ライブラリの準備中に壊れる可能性がある。または（3）cDNAシーケンスでは、逆転写ステップで完全なRNA分子を捕捉できない場合がある。したがって、トランスクリプトを完全にカバーするリードのみを考慮する計算ツールは、多くのリードを破棄せざるを得ず、感度が大幅に低下する可能性がある。しかし、これまでは、ロングリードはトランスクリプトームアセンブリに広く採用されていなかった。これは、部分的にエラー率がはるかに高く（通常8〜10％以上）、アライメントが困難であるためである[ref.24、25]。またシーケンサーのスループットがはるかに低いため、最高発現の遺伝子以外のすべての正確な定量は不可能である。

（一部略）

ここでは、StringTie2トランスクリプトアセンブラの主要な新しいリリースであるStringTie2を紹介する。これは、ショートリードとロングリードの両方、および完全な長さのスーパーリードをアセンブリできる。 33のIllumina RNA-seqデータセットに関する結果は、StringTie2がより正確であることを示している。

StringTie2 paper is out: more accurate than any other RNA-seq assembler, even faster than before, and now able to handle ONT and PacBio reads, where it's significantly more accurate and 80 times faster than the next-best method. s/o @elapertea @samkovaka https://t.co/NpUA1o8FKg
— Steven Salzberg (@StevenSalzberg1) 2019年12月19日

インストール

condaを使いubuntu18.04とmacos10.14でテストした(anaconda3.7環境)。

依存

StringTie is compatible with a wide range of Linux and Apple OS systems

本体　Github

#get latest release
git clone https://github.com/mpertea/stringtie2
cd stringtie2
make release

#bioconda (link)
mamba install -c bioconda -y stringtie

> stringtie

$ stringtie

StringTie v2.1.2 usage:

stringtie <input.bam ..> [-G <guide_gff>] [-l <label>] [-o <out_gtf>] [-p <cpus>]

[-v] [-a <min_anchor_len>] [-m <min_tlen>] [-j <min_anchor_cov>] [-f <min_iso>]

[-C <coverage_file_name>] [-c <min_bundle_cov>] [-g <bdist>] [-u] [-L] [-e]

[--ptf <f_tab>] [-x <seqid,..>] [-A <gene_abund.out>] [-h] {-B | -b <dir_path>}

Assemble RNA-Seq alignments into potential transcripts.

Options:

--version : print just the version at stdout and exit

--conservative : conservative transcriptome assembly, same as -t -c 1.5 -f 0.05

--rf assume stranded library fr-firststrand

--fr assume stranded library fr-secondstrand

-G reference annotation to use for guiding the assembly process (GTF/GFF3)

--ptf load point-features from a given 4 column feature file <f_tab>

-o output path/file name for the assembled transcripts GTF (default: stdout)

-l name prefix for output transcripts (default: STRG)

-f minimum isoform fraction (default: 0.01)

-L use long reads settings (default:false)

-R if long reads are provided, just clean and collapse the reads but do not assemble

-m minimum assembled transcript length (default: 200)

-a minimum anchor length for junctions (default: 10)

-j minimum junction coverage (default: 1)

-t disable trimming of predicted transcripts based on coverage

(default: coverage trimming is enabled)

-c minimum reads per bp coverage to consider for multi-exon transcript

(default: 1)

-s minimum reads per bp coverage to consider for single-exon transcript

(default: 4.75)

-v verbose (log bundle processing details)

-g maximum gap allowed between read mappings (default: 50)

-M fraction of bundle allowed to be covered by multi-hit reads (default:1)

-p number of threads (CPUs) to use (default: 1)

-A gene abundance estimation output file

-B enable output of Ballgown table files which will be created in the

same directory as the output GTF (requires -G, -o recommended)

-b enable output of Ballgown table files but these files will be

created under the directory path given as <dir_path>

-e only estimate the abundance of given reference transcripts (requires -G)

-x do not assemble any transcripts on the given reference sequence(s)

-u no multi-mapping correction (default: correction enabled)

-h print this usage message and exit

Transcript merge usage mode:

stringtie --merge [Options] { gtf_list | strg1.gtf ...}

With this option StringTie will assemble transcripts from multiple

input files generating a unified non-redundant set of isoforms. In this mode

the following options are available:

-G <guide_gff> reference annotation to include in the merging (GTF/GFF3)

-o <out_gtf> output file name for the merged transcripts GTF

(default: stdout)

-m <min_len> minimum input transcript length to include in the merge

(default: 50)

-c <min_cov> minimum input transcript coverage to include in the merge

(default: 0)

-F <min_fpkm> minimum input transcript FPKM to include in the merge

(default: 1.0)

-T <min_tpm> minimum input transcript TPM to include in the merge

(default: 1.0)

-f <min_iso> minimum isoform fraction (default: 0.01)

-g <gap_len> gap between transcripts to merge together (default: 250)

-i keep merged transcripts with retained introns; by default

these are not kept unless there is strong evidence for them

-l <label> name prefix for output transcripts (default: MSTRG)

Error: no input file provided!

実行方法

ショートリードかロングリードをリファレンスゲノムにマッピングして得たbamを指定する。複数指定も可能。

stringtie -p 12 -o out.gtf sample1.bam sample2.bam sample3.bam 2> log

-o output path/file name for the assembled transcripts GTF (default: stdout)
-p number of threads (CPUs) to use (default: 1)

bamとリファレンスのgtf（ガイド）を指定する。

stringtie -L -G human-chr19_P.gff -p 12 -o long_reads_guided.out.gtf long_reads.bam 2> log

-G reference annotation to use for guiding the assembly process (GTF/GFF3)
-L use long reads settings (default:false)

複数のgtfをマージするオプションもあります。

stringtie --merge in1.gtf in2.gtf in3.gtf ... -o output.gtf 2> log

#reference gtf/gff3も指定する(*1)
stringtie --merge in1.gtf in2.gtf in3.gtf ... -G ref.gff3 -o output.gtf 2> log

stringtie --merge [Options] { gtf_list | strg1.gtf ...}

追記

こちらの論文ではPacBioのLong RNA sequencinng単独をStringTie2で解析したり、illuminaのRNA seqのbamとPacBioのLong RNA sequencinngからのbamをマージしてStringTie2で解析するのワークフローよりも、StringTie2の新しいmixオプション（2022年の論文#2）のオプションを使う事（PacBioとilluminaそれぞれのbamを個別に指定する）を推奨しています。詳しくはこちら（Example 3: Using StringTie --mixの部分）。

引用

Transcriptome assembly from long-read RNA-seq alignments with StringTie2

Sam Kovaka, Aleksey V. Zimin, Geo M. Pertea, Roham Razaghi, Steven L. Salzberg, Mihaela Pertea
Genome Biology volume 20, Article number: 278 (2019)

2022/06/09

”ハイブリッドリードアセンブリを実行するStringTieの新リリースを紹介する。ロングリードとショートリードの両方の長所を利用することで、StringTieによるハイブリッドリードアセンブリは、ロングリードのみ、またはショートリードのみのアセンブリよりも精度が高く、データセットによっては、正しくアセンブリされた転写産物の数を2倍以上に増やすことができ、ロングリードデータのアセンブリのみよりも大幅に高い精度を得ることができる”

Improved transcriptome assembly using a hybrid of long and short reads with StringTie
Alaina Shumate,Brandon Wong,Geo Pertea,Mihaela Pertea

https://doi.org/10.1371/journal.pcbi.1009730

Published: June 1, 2022