ハイブリッドRNAシーケンスデータを使ってゲノムアノテーションを改善する annotate_my_genomes

2022/12/27,28 追記

　ハイブリッドシーケンステクノロジーの進歩により、ハイブリッドシーケンス・トランスクリプトミクスを用いてしばしばアノテーションされるゲノムアセンブリがますます拡大し、ゲノムの特性解析が向上し、さまざまな生物における新規遺伝子やアイソフォームの同定に繋がっている。
ハイブリッドシーケンスデータから収集した転写産物を入力とし、いくつかのバイオインフォマティクス的アプローチを統合することにより、GTFフォーマットの過去のアノテーションとの遺伝子照合を含め、コーディングRNAとロングノンコーディング RNAを区別する使いやすいゲノム誘導型トランスクリプトームアノテーションパイプラインを開発した。また、ニワトリのSCO-spondin遺伝子（105以上のエクソンを含む）の全エクソンを正しくアセンブルし、アノテーションすることにより、本手法の有効性を実証した（相同性割り当てによるニワトリのリファレンスアノテーションにおける欠損遺伝子の同定を含む）。
本手法は、ニワトリ脳のトランスクリプトームアノテーションを改善するのに役立つ。Anaconda/NextflowとDockerで実装された使いやすいパッケージで、幅広い種、組織、研究分野に適用でき、現在のアノテーションの改善と調和に役立つものである。コードとデータセットは、https://github.com/cfarkas/annotate_my_genomes で公開されている。

wiki

https://github.com/cfarkas/annotate_my_genomes/wiki

このパイプラインは新規アノテーションを１からつけるためよりも、既にある程度の品質のアノテーションが公開されており、Iso-seqなどのlong RNA seqのデータを使ってそれをさらに改善することを目的としています。

インストール

dockerイメージは文字化けエラーを起こしたので、condaで環境を作ってテストした（#2; install option2の手順）。

Github

#1 
git clone https://github.com/cfarkas/annotate_my_genomes.git
cd annotate_my_genomes
current_dir=$(pwd)
nextflow run makefile.nf --workdir $current_dir --conda ./22.04_environment.yml
sudo cp ./bin/* /usr/local/bin/

#2
git clone https://github.com/cfarkas/annotate_my_genomes.git
cd annotate_my_genomes
conda config --add channels bioconda
conda config --add channels conda-forge
mamba env create -f 22.04_environment.yml
conda activate annotate_my_genomes
bash makefile.sh
#パスの通ったディレクトリにコピー。ここでは仮想環境のbinに
sudo cp ./bin/* /home/kazu/mambaforge/envs/annotate_my_genomes/bin/     

#3 docker
docker pull carlosfarkas/annotate_my_genomes:latest

> annotate-my-genomes

arguments -a, -r, -g, -c, -t and -o must be provided

annotate-my-genomes [-h] [-a <stringtie.gtf>] [-r

<reference_genome.gtf>] [-g <reference_genome.fasta>] [-c

<gawn_config>] [-t <threads>]

This pipeline will Overlap StringTie transcripts (GTF format) with

current UCSC annotation and will annotate novel transcripts.

Arguments:

-h show this help text

-a StringTie GTF

-r UCSC gene annotation (in GTF format)

-g Reference genome (in fasta format)

-c GAWN config file (path to gawn_config.sh in annotate_my_genomes folder)

-t Number of threads for processing (integer)

-o output folder (must exist)

> add-ncbi-annotation

add-ncbi-annotation [-h] [-a <stringtie.gtf>] [-n

<NCBI_reference.gtf>] [-r <reference_genome.gtf>] [-g

<reference_genome.fasta>] [-c <gawn_config>] [-t <threads>] [-o

<output>]

This pipeline will Overlap StringTie transcripts (GTF format) with

current NCBI annotation and will annotate novel transcripts.

Arguments:

-h show this help text

-a StringTie GTF

-n NCBI gene annotation (in GTF format)

-r UCSC gene annotation (in GTF format)

-g Reference genome (in fasta format)

-c GAWN config file (path to gawn_config.sh in annotate_my_genomes folder)

-t Number of threads for processing (integer)

arguments -a, -n, -r, -g, -c, -t and -o must be provided

-o output folder (must exist)

(an

テストラン

１、ゲノムの準備

cd annotate_my_genomes/nextflow_scripts/
mkdir outdir
nextflow run genome-download.nf \
--genome galGal6 \
--conda ../22.04_environment.yml --outdir outdir

最近のバージョンのnextflowを使っているなら-dsl1をつけてランする。nextflowのバージョンが古いとエラーが起きるので注意。

outdir/

２、annotate-my-genomesを実行する。
ファイルはフルパスで指定する。nextflowを使うとエラーが出たのでここでは使わず実行する。stringtie.gtf の作り方のコマンドレシピはこちらで説明されている。ここでは

#エラーになる
cd annotate_my_genomes/nextflow_scripts/
mkdir output_folder
nextflow run annotate-my-genomes.nf \
--stringtie $PWD/outdir/stringtie.gtf \
--ref_annotation $PWD/outdir/galGal6.gtf \ 
--genome $PWD/outdir/galGal6.fa \
--config $PWD/gawn_config.sh \
--threads 20 \
--conda /path/to/22.04_environment.yml --outdir $PWD/output_folder/

#エラーが出たのでnextflowのスクリプトを使わずに実行
mkdir output2
annotate-my-genomes -a outdir/transcripts.gtf -g outdir/galGal6.fa -c gawn_config.sh -t 20 -o output2/ -r output/galGal6.gtf

outdir2

現在のアノテーションにlong RNA seq read 由来アノテーション（stringtie2.gtf）がマージされたファイルがfinal_annotated.gtf。

３、さらにNCBIのアノテーションを追加することもできる。その場合、２のコマンドの代わりにadd-ncbi-annotationを使用する。-n以外のパラメータは２と同じ。

mkdir output3
add-ncbi-annotation -a output/transcripts.gtf -n output/galGal6_ncbiRefSeq.gtf -r output/galGal6.gtf -g output/galGal6.fa -c gawn_config.sh -t 30 -o output3/

outdir3/

BRAKER2 パイプラインの出力 braker.gtf または TSEBRA パイプラインの出力 tsebra.gtfも使ってNCBI annotationとマージしたアノテーションファイルを作ることができる（braker2の出力はbraker.gff3とbraker.gtfだが、gtfのほうを使う）。それにはAGAT toolkitを使ってbraker.gtfのフォーマットを修正し、add-ncbi-annotationコマンドを走らせる。詳細はレポジトリの"VI Annotation of BRAKER2 / TSEBRA gtf output"で説明されている。
stringtie.gtfをadd-ncbi-annotationでアノテーションした場合、add-ncbi-annotationパイプラインの出力としてtranscripts annotation table（csv）を作成することができる。isoform-identificationコマンドを使う。
新規タンパク質のアノテーションと、パイプラインで同定された新規遺伝子内のパラログの同定を行うことができる（リンク）。

引用

annotate_my_genomes: an easy-to-use pipeline to improve genome annotation and uncover neglected genes by hybrid RNA sequencing
Carlos Farkas, Antonia Recabal, Andy Mella, Daniel Candia-Herrera, Maryori González Olivero, Jody Jonathan Haigh, Estefanía Tarifeño-Saldivia, Teresa Caprile
GigaScience, Volume 11, 2022, Published: 06 December 2022