2019-03-27

複数のtranscritome情報（gtf）をマージする TACO

2023/01/04 追記

　ハイスループット RNAシークエンシング（RNA-Seq）により、トランスクリプトームの詳細な理解が可能になった（ref.1–3）。手動および自動システムによる high fidelityな遺伝子モデルアノテーションの試みは、主にロースループットシークエンシング法（ref.4–6）に頼っていたが、RNA-Seqを用いた広大なトランスクリプトームについてのいくつかの研究（ref.3, 7, 8）は、リファレンス遺伝子カタログが完全とはほど遠いことを述べている。このアノテーションのギャップは、TCGA (ref.9)、ICGC (ref.10)、GTEX (ref.11) などのコンソーシアムによる数千もの高品質RNA-Seqデータセットの迅速なシーケンスによってさらに広がっている。現在のリファレンスアノテーションプロジェクトの限界は、トランスクリプトを発見するためにはこれらの大規模なRNA-Seqデータセットを利用からの計算トランスクリプトームアセンブリ法の開発を必要とすることである。

　単一サンプル実験のためのトランスクリプトーム再構築は依然として研究の活発な分野である。現在の再構成法はヌクレオチドまたはスプライスジャンクションレベルで高い精度を達成しているが、研究はそれらが完全長転写物のスプライシングパターンを確実に予測しないことを示した（ref.12）。さらに、単一サンプルのトランスクリプトーム再構成は、多くのサンプルにわたる転写動態の下流分析のための有用性が限られている。この問題に取り組むためには、複数の入力データセットからのコンセンサストランスクリプトームを構築しなければならず、それは個々のトランスクリプトーム再構成を階層的に併合することによって行うことができる。本著者らはこの手順を“meta-assembly”と名付ける。

　Meta-assembly用の現在のソフトウェアには、Cufflinksパッケージ内のCuffmergeユーティリティ（ref.13）と、StringTieソフトウェア内のmergeモード（以後「StringTie-merge」と表記）（ref.14）が含まれる。本著者らは、入力データセットの数が増えるとCuffmergeとStringTie-mergeの両方のパフォーマンスが低下し、イントロンを保持する誤って予測された転写物構造、異常に長い3 'および5'末端、および同じstrandの複数の隣接遺伝子を連結するリードスルー転写物、などが起きることを示す。これらのエラーは、不完全なプロセッシングのRNA、またはゲノムDNAでインプットライブラリーが汚染されたことに起因する可能性や、ゲノムguidedアセンブリから由来する可能性がある。スケーラブルを維持するために、メタアセンブリ方法論はこれらのエラーの原因を軽減するように設計されなければならない。

　この論文では、転写構造予測のため膨大なRNA-Seqデータランドスケープを活用する堅牢なソリューションとして、新しいメタアセンブリ法、TACO（Transcriptome Assemblies Combined In One）を紹介する。TACOのデータを調製するために、シーケンシングリードはスプライスアラインメントツールによってリファレンスゲノムにアラインメントされる。その後にgenome-guided transcript assemblyが実行され、これがTACOへの入力として利用される（論文図1）。純粋なメタアセンブラーであるため、TACOは使用されるアライメントおよびアセンブリ戦略にとらわれず、したがってTuxedo suite（ref.16）などの既存のRNA-Seq分析プロトコルに容易に組み込むことができる（論文のSupplementary Note, “Integration with other tools”参照）。このソフトウェアは、PythonおよびCで書かれており、その資料はhttp://tacorna.github.ioで入手できる。実行時間とメモリ使用量の統計はSupplementary Noteにある。

（複数段落省略）

リファレンスの選択は精度の測定基準に大きな影響を与え、選択されたリファレンスの性質に応じて潜在的な偏りが生じる可能性があることに注意することが重要である。

TACOは、最も豊富な上位5000転写産物について、76.8％のスプライシングパターン精度を達成し、それぞれStringTie-mergeおよびCuffmergeの30％および16.2％の精度よりも劇的に優れていた（論文図2c、補足図5、および補足表6）。

上記のパフォーマンス評価から明らかなように、TACOは、多数の入力データセットを処理する場合でも堅牢な正確性を維持し、CuffmergeおよびStringTie-mergeと比較してより忠実度の高い転写産物を生成した。一方CuffmergeとStringTie-mergeはたくさんの偽のアイソフォームを作り出した。この問題は、50個のサンプルをマージするときに発生し、マージされたサンプルの数が500個になると悪化した（論文図3）。

Fig.1 Schematic detailing the transcriptome meta-assembly workflow for TACO.

論文より転載

https://tacorna.github.io

manual

https://tacorna.github.io

インストール

downloadからmac、linux向けのバイナリをダウンロードできる。

https://tacorna.github.io

mamba install -c conda-forge -c bioconda -y taco

> ./taco_run -h

$ ./taco_run -h

usage: taco_run [-h] [-o DIR] [-p N] [-v] [--resume] [--assemble BED]

[--gtf-expr-attr ATTR] [--filter-min-length N]

[--filter-min-expr X] [--filter-splice-juncs]

[--add-splice-motif ADD_SPLICE_MOTIF]

[--ref-genome-fasta REF_GENOME_FASTA_FILE]

[--isoform-frac FRAC] [--max-isoforms N]

[--assemble-unstranded] [--no-assemble-unstranded]

[--change-point] [--no-change-point]

[--change-point-pvalue <float 0.0-1.0>]

[--change-point-fold-change <float 0.0-1.0>]

[--change-point-trim] [--no-change-point-trim]

[--path-kmax kmax] [--path-frac X] [--max-paths N]

[sample_file]

TACO: Multi-sample transcriptome assembly from RNA-Seq

positional arguments:

sample_file

optional arguments:

-h, --help show this help message and exit

-o DIR, --output-dir DIR

directory where output files will be stored (if

already exists then --resume must be specified)

[default=output]

-p N, --num-processes N

Run TACO in parallel with N processes [default=1]

-v, --verbose Enabled detailed logging (for debugging)

--resume Resumes an existing run that may have ended

prematurely. Specify the location of the run using the

-o/--output-dir option.

--assemble BED Assemble transfrags produced by a previous TACO run,

bypassing the GTF aggregation step. Accepts a taco-

formatted BED file.

--gtf-expr-attr ATTR GTF attribute field containing expression

[default=FPKM]

--filter-min-length N

Filter input transfrags with length < N prior to

assembly [default=200]

--filter-min-expr X Filter input transfrags with expression (FPKM/TPM) < X

prior to assembly [default=0.5]

--filter-splice-juncs

Filter input transfrags that possess non-canonical

splice motifs prior to assembly. Splice motifs are

GTAG, GCAG, and ATAC are allowed [default=False].

Requires genome sequence to be specified using --ref-

genome-fasta.

--add-splice-motif ADD_SPLICE_MOTIF

Add additional splice junction motifs to permit when

using the --filter-splice-juncs flag. Use this flag

multiple times for each additional junction motif.

Motif must be 4 bases.

--ref-genome-fasta REF_GENOME_FASTA_FILE

Reference genome sequence in FASTA format needed to

assess splice junction motif sequences. Use in

conjunction with --filter-splice-juncs.

--isoform-frac FRAC Report transcript isoforms with expression fraction >=

FRAC (0.0-1.0) relative to the major isoform within

each gene [default=0.05]

--max-isoforms N Maximum isoforms to report for each gene [default=0]

--assemble-unstranded

Enable assembly of unstranded transfrags

[default=False]

--no-assemble-unstranded

Disable assembly of unstranded transfrags

--change-point Enable change point detection [default=True]

--no-change-point Disable change point detection

Advanced Options:

(recommend leaving at their default settings for most purposes)

--change-point-pvalue <float 0.0-1.0>

Mann-Whitney-U p-value threshold for calling change

points [default=0.01]

--change-point-fold-change <float 0.0-1.0>

Fold change threshold between the means of two

putative segments on either side of a change point. A

value of 0.0 is the most strict setting, effectively

calling no change points. Conversely, setting the

value to 1.0 calls allchange points [default=0.85]

--change-point-trim Trim transfrags around change points [default=True]

--no-change-point-trim

Disable trimming around change points

--path-kmax kmax Limit optimization for choosing parameter k for path

graph (DeBruijn graph) to k <= kmax [default=0]

--path-frac X dynamic programming algorithm will stop finding

suboptimal paths when path expression drops below a

fraction X (0.0-1.0) of the total locus expression

[default=0.0]

--max-paths N dynamic programming algorithm will stop after finding

N paths [default=0]

> ./taco_refcomp -h #TACOのアセンブリ結果を他のリファレンスと比較する

$ ./taco_refcomp -h

usage: taco_refcomp [-h] [-v] [-o OUTPUT_DIR] [-p NUM_CORES] [--cpat]

[--cpat-species CPAT_SPEC] [--cpat-genome CPAT_GEN]

[-r REF_GTF_FILE] [-t TEST_GTF_FILE]

optional arguments:

-h, --help show this help message and exit

-v, --verbose

-o OUTPUT_DIR, --output-dir OUTPUT_DIR

Directory for reference comparison output (default:

taco_compare)

-p NUM_CORES, --num-processes NUM_CORES

Run tool in parallel with N processes. (note: each

core processes 1 chromosome) (default: 1)

--cpat Run CPAT tool to for coding potential scoring. (CPAT

function currently only supports Human, Mouse, and

Zebrafish) (WARNING: The CPAT tool can take over an

hour) (default: False)

--cpat-species CPAT_SPEC

Select either: human, mouse, zebrafish (default:

human)

--cpat-genome CPAT_GEN

Provide a genome fasta for the genome used to produce

assemblies being compared. Required if "--cpat" used.

CPAT uses this to obtain sequence for the provided

transcripts (default: None)

-r REF_GTF_FILE, --ref-gtf REF_GTF_FILE

Reference GTF file to compare against (default: None)

-t TEST_GTF_FILE, --test-gtf TEST_GTF_FILE

GTF file used for comparison (default: None)

実行方法

GTFファイルのパスを記載したテキスト指定する。

ls *gtf > gtf_files.txt
taco_run gtf_files.txt

他のgtfとの比較

taco_refcomp -o <output_directory> -r <reference_gtf> -t <test_gtf> --cpat (optional flag to run coding potential prediction)

テストラン

ランにはgtfファイルと、そのパスを記載したテキストファイルが必要。

１、tiny dataset

# オーサーのサイトからtxtとgtfをダウンロード
wget https://raw.githubusercontent.com/tacorna/taco/master/taco_test.gtf
wget https://raw.githubusercontent.com/tacorna/taco/master/taco_test_files.txt

中身をチェック

>cat taco_test_files.txt

taco_test.gtf

> head taco_test.gtf

chr1 Cufflinks transcript 894755 896612 1000 . . expr "0.0967430354"; transcript_id "49.T21"; ref "0"; sample_id "49";

chr1 Cufflinks transcript 894755 899245 1000 - . expr "0.0603817385"; transcript_id "49.T23"; ref "0"; sample_id "49";

chr1 Cufflinks exon 894755 896091 1000 - . transcript_id "49.T23";

chr1 Cufflinks exon 894755 896612 1000 . . transcript_id "49.T21";

chr1 Cufflinks transcript 894770 899167 1000 - . expr "0.2315295601"; transcript_id "6.T43"; ref "0"; sample_id "6";

chr1 Cufflinks transcript 894770 899167 1000 . . expr "0.1813954292"; transcript_id "6.T42"; ref "0"; sample_id "6";

chr1 Cufflinks exon 894770 896091 1000 - . transcript_id "6.T43";

chr1 Cufflinks exon 894770 899167 1000 . . transcript_id "6.T42";

chr1 Cufflinks transcript 894836 896157 1000 . . expr "0.1777070486"; transcript_id "46.T28"; ref "0"; sample_id "46";

chr1 Cufflinks transcript 894836 904238 1000 - . expr "0.0358336285"; transcript_id "46.T30"; ref "0"; sample_id "46";

Cufflinks/HAVANA/ENSEMBLのgtfを結合して１つのgtfにしたのが入力ファイル。

tacoを実行する。

taco_run --gtf-expr-attr expr taco_test_files.txt

進捗

019-03-24 16:54:35,322 pid=171 INFO - taco version 0.6.2

2019-03-24 16:54:35,322 pid=171 INFO - ------------------------------------------------------------------------------

2019-03-24 16:54:35,322 pid=171 INFO - verbose logging: False

2019-03-24 16:54:35,322 pid=171 INFO - num processes: 1

2019-03-24 16:54:35,322 pid=171 INFO - output directory: output

2019-03-24 16:54:35,323 pid=171 INFO - filter min length: 200

2019-03-24 16:54:35,323 pid=171 INFO - filter min expression: 0.5

2019-03-24 16:54:35,323 pid=171 INFO - filter splice juncs: 0

2019-03-24 16:54:35,323 pid=171 INFO - additional splice motifs:

2019-03-24 16:54:35,323 pid=171 INFO - reference genome FASTA file: None

2019-03-24 16:54:35,323 pid=171 INFO - reference GTF file: None

2019-03-24 16:54:35,323 pid=171 INFO - guided assembly mode: False

2019-03-24 16:54:35,324 pid=171 INFO - guided strand mode: False

2019-03-24 16:54:35,324 pid=171 INFO - guided ends mode: False

2019-03-24 16:54:35,324 pid=171 INFO - GTF expression attribute: expr

2019-03-24 16:54:35,324 pid=171 INFO - isoform fraction: 0.05

2019-03-24 16:54:35,324 pid=171 INFO - max_isoforms: 0

2019-03-24 16:54:35,324 pid=171 INFO - assemble_unstranded: 0

2019-03-24 16:54:35,324 pid=171 INFO - change point: True

2019-03-24 16:54:35,324 pid=171 INFO - change point pvalue: 0.01

2019-03-24 16:54:35,325 pid=171 INFO - change point fold change: 0.85

2019-03-24 16:54:35,325 pid=171 INFO - change point trim: True

2019-03-24 16:54:35,325 pid=171 INFO - path frac: 0.0

2019-03-24 16:54:35,325 pid=171 INFO - max paths: 0

2019-03-24 16:54:35,325 pid=171 INFO - Samples: 1

2019-03-24 16:54:35,325 pid=171 INFO - Aggregating GTF files

2019-03-24 16:54:35,325 pid=171 INFO - Aggregating in parallel using 1 processes

2019-03-24 16:54:35,592 pid=171 INFO - Merging aggregated files

2019-03-24 16:54:35,634 pid=171 INFO - Removing temporary files

2019-03-24 16:54:35,640 pid=171 INFO - Aggregate done

2019-03-24 16:54:35,645 pid=171 INFO - Indexing Loci

2019-03-24 16:54:35,654 pid=171 INFO - Assembling GTF files

2019-03-24 16:54:35,654 pid=171 INFO - Assembling in parallel using 1 processes

2019-03-24 16:54:36,481 pid=171 INFO - Merging output files

2019-03-24 16:54:36,547 pid=171 INFO - Removing temporary files

2019-03-24 16:54:36,565 pid=171 INFO - Done

出力

# ls -al output/

total 352

drwxr-xr-x 19 root root 646 Mar 24 16:54 .

drwxr-xr-x 8 root root 272 Mar 24 16:54 ..

-rw-r--r-- 1 root root 850 Mar 24 16:54 args.pickle

-rw-r--r-- 1 root root 10061 Mar 24 16:54 assembly.bed

-rw-r--r-- 1 root root 82339 Mar 24 16:54 assembly.gtf

-rw-r--r-- 1 root root 4758 Mar 24 16:54 change_points.gtf

-rw-r--r-- 1 root root 11727 Mar 24 16:54 expr.neg.bedgraph

-rw-r--r-- 1 root root 11956 Mar 24 16:54 expr.none.bedgraph

-rw-r--r-- 1 root root 14891 Mar 24 16:54 expr.pos.bedgraph

-rw-r--r-- 1 root root 59 Mar 24 16:54 loci.txt

-rw-r--r-- 1 root root 7136 Mar 24 16:54 path_graph_stats.txt

-rw-r--r-- 1 root root 84 Mar 24 16:54 sample_stats.txt

-rw-r--r-- 1 root root 30 Mar 24 16:54 samples.txt

-rw-r--r-- 1 root root 47412 Mar 24 16:54 splice_graph.gtf

-rw-r--r-- 1 root root 13956 Mar 24 16:54 splice_junctions.bed

-rw-r--r-- 1 root root 73 Mar 24 16:54 status.json

drwxr-xr-x 2 root root 68 Mar 24 16:54 tmp

-rw-r--r-- 1 root root 71538 Mar 24 16:54 transfrags.bed

-rw-r--r-- 1 root root 4743 Mar 24 16:54 transfrags.filtered.bed

> head output/assembly.bed

# head output/assembly.bed

chr1 902851 903946 G4|TU43(2623.2) 1000 - 902851 902851 0 1 1095, 0,

chr1 904464 948833 G1|TU2(5605.4) 527 + 904464 904464 0 16 1215,2411,4604,9199,92,182,51,125,90,138,163,116,79,500,125,5136, 0,1337,6403,11285,21457,25690,26574,31307,34575,34810,36679,37671,37945,38094,38788,39233,

chr1 904464 948833 G1|TU3(3993.9) 376 + 904464 904464 0 17 497,153,684,2200,4728,92,182,51,125,90,141,163,116,79,819,111,4926, 0,10854,11285,13417,15756,21457,25690,26574,31307,34575,34807,36679,37671,37945,38094,39233,39443,

chr1 904464 948833 G1|TU6(2321.8) 218 + 904464 904464 0 18 3748,3291,153,684,8420,92,182,51,125,90,186,163,116,79,500,125,111,4926, 0,7480,10854,11285,12064,21457,25690,26574,31307,34575,34810,36679,37671,37945,38094,38788,39233,39443,

chr1 904464 948833 G1|TU8(1366.5) 129 + 904464 904464 0 20 497,385,2411,3577,243,304,200,7067,92,182,51,125,90,186,163,116,79,500,125,5136, 0,830,1337,6403,10338,10703,11285,13417,21457,25690,26574,31307,34575,34810,36679,37671,37945,38094,38788,39233,

chr1 904464 908520 G1|TU9(913.0) 86 + 904464 904464 0 1 4056, 0,

chr1 904464 932076 G1|TU10(817.4) 77 + 904464 904464 0 8 497,2918,4604,200,7067,2830,182,1038, 0,830,6403,11285,13417,21457,25690,26574,

chr1 904753 908521 G5|TU44(1287.7) 1000 - 904753 904753 0 1 3768, 0,

chr1 908833 909977 G2|TU11(255.6) 1000 + 908833 908833 0 2 451,247, 0,897,

chr1 908833 909977 G2|TU12(197.3) 772 + 908833 908833 0 2 631,337, 0,807,

> head output/assembly.gtf

# head output/assembly.gtf

chr1 taco transcript 902852 903946 1000 - . tss_id "TSS8"; locus_id "L1"; abs_frac "1.00000"; rel_frac "1.00000"; expr "2623.181"; transcript_id "TU43"; gene_id "G4";

chr1 taco exon 902852 903946 1000 - . tss_id "TSS8"; locus_id "L1"; transcript_id "TU43"; gene_id "G4";

chr1 taco transcript 904465 948833 527 + . tss_id "TSS2"; locus_id "L1"; abs_frac "0.17052"; rel_frac "0.52742"; expr "5605.392"; transcript_id "TU2"; gene_id "G1";

chr1 taco transcript 904465 948833 376 + . tss_id "TSS2"; locus_id "L1"; abs_frac "0.12149"; rel_frac "0.37579"; expr "3993.875"; transcript_id "TU3"; gene_id "G1";

chr1 taco transcript 904465 948833 218 + . tss_id "TSS2"; locus_id "L1"; abs_frac "0.07063"; rel_frac "0.21846"; expr "2321.791"; transcript_id "TU6"; gene_id "G1";

chr1 taco transcript 904465 948833 129 + . tss_id "TSS2"; locus_id "L1"; abs_frac "0.04157"; rel_frac "0.12858"; expr "1366.496"; transcript_id "TU8"; gene_id "G1";

chr1 taco transcript 904465 908520 86 + . tss_id "TSS2"; locus_id "L1"; abs_frac "0.02777"; rel_frac "0.08591"; expr "913.030"; transcript_id "TU9"; gene_id "G1";

chr1 taco transcript 904465 932076 77 + . tss_id "TSS2"; locus_id "L1"; abs_frac "0.02486"; rel_frac "0.07691"; expr "817.373"; transcript_id "TU10"; gene_id "G1";

chr1 taco exon 904465 905679 527 + . tss_id "TSS2"; locus_id "L1"; transcript_id "TU2"; gene_id "G1";

chr1 taco exon 904465 904961 376 + . tss_id "TSS2"; locus_id "L1"; transcript_id "TU3"; gene_id "G1";

２、large dataset

データの準備（link）。

wget http://mctp-taco.path.med.umich.edu/ccle_cufflinks.tar.gz
tar -zxvf ccle_cufflinks.tar.gz
cd ccle_cufflinks
ls *gtf > list

#tacoを実行
taco_run --gtf-expr-attr FPKM -o output_dir list

引用
TACO produces robust multi-sample transcriptome assemblies from RNA-seq
Yashar S. Niknafs, Balaji Pandian, Hariharan K. Iyer, Arul M. Chinnaiyan, Matthew K. Iyer

Nat Methods. 2017 Jan; 14(1): 68–70. Published online 2016 Nov 21