自動化された真核生物の遺伝子アノテーションツール FINDER

2021 9/1 論文追記

2022/12/27 追記

　真核生物の遺伝子アノテーションは、蓄積された転写産物のデータを緻密に解析する必要があり、簡単な作業ではない。真核生物の遺伝子アノテーションには、重複する遺伝子を含むゲノムの転写活性領域、多数の転写産物を産生する遺伝子、トランスポサブルエレメント、多数の多様な配列反復などの課題がある。現在市販されている遺伝子アノテーションソフトウェアは、あらかじめ構築された完全長の遺伝子配列アセンブリに依存しており、エラーがないことは保証されていない。また、これらの配列の起源は不確かなことが多く、配列中のエラーを特定して修正することが困難である。そのため、複数の組織や実験条件におけるトランスクリプトームの状況を、正確かつ全体的に表現することができない。そのため、遺伝子構造の多様性を把握するためには、ゲノムワイドな発現データの包括的な解析が不可欠である。

　ここでは、遺伝子と転写産物の構造をアノテーションするプロセス全体を最適化する完全自動化の計算ツール、FINDERを発表する。現在の最新のパイプラインとは異なり、FINDERは生のシーケンスリードを直接扱うことでRNA-Seqの前処理ステップを自動化し、これらのリードに関連するタンパク質を補足することでBRAKER2からの遺伝子予測を最適化する。FINDERパイプラインは、（1）転写産物を報告し、特定の条件下で発現している遺伝子を認識する、（2）発現しているRNA-Seqデータから可能性のあるすべての代替スプライス転写産物を生成する、（3）リードカバレッジパターンを分析して既存の転写産物モデルを修正し、新しいモデルを作成する、（4）複数のデータセットにわたる利用可能な証拠に基づいて、遺伝子を高信頼度または低信頼度としてスコアリングする、というものである。FINDERは、8種の多様なゲノムを自動的にアノテーションできることを実証した。FINDERは、完全に自動化されたアプローチで、生の発現データから直接遺伝子をアノテーションすることができる。FINDERは、あらゆるサイズの真核生物のゲノムを処理することができ、手動による監督を必要としないため、計算機ツールの取り扱い経験が少ないベンチ研究者にとって理想的である。

インストール

Github

git clone https://github.com/sagnikbanerjee15/finder.git
cd Finder
mamba env create -f environment.yml
conda activate finder_conda_env
cd dep


#GeneMark-ESとGeneMarkS/Tが必要、ダウンロードしたgmes_linux_64.tar.gzとkeyのgmes_linux_64.gzをカレントに置いてインストールスクリプトを実行（もしくは手動でパスをを通す）
./install.sh

cd gmes_linux_64/
perl change_path_in_perl_scripts.pl /usr/bin/perl

#Docker(hub link)からiamgeをpullしておく。run_finder.shから呼び出されたdocker imageが実行されるので必ず必要。
docker pull sagnikbanerjee15/finder:1.1.0

> finder

Please use the --help option to get usage information

usage: finder [-h] --metadatafile METADATAFILE --output_directory

OUTPUT_DIRECTORY --genome GENOME [--cpu CPU]

[--genome_dir_star GENOME_DIR_STAR]

[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE]

[--protein PROTEIN] [--no_cleanup] [--preserve_raw_input_data]

[--checkpoint CHECKPOINT]

[--perform_post_completion_data_cleanup]

finder: error: the following arguments are required: --metadatafile/-mf, --output_directory/-out_dir, --genome/-g

> run_finder

$ run_finder

Please use the --help option to get usage information

usage: run_finder [-h] [--version] --metadatafile METADATAFILE

--output_directory OUTPUT_DIRECTORY --genome GENOME --organism_model

{VERT,INV,PLANTS,FUNGI} --genemark_path GENEMARK_PATH

--genemark_license GENEMARK_LICENSE [--cpu CPU]

[--genome_dir_star GENOME_DIR_STAR]

[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE] [--protein

PROTEIN] [--no_cleanup] [--preserve_raw_input_data] [--checkpoint

CHECKPOINT]

[--perform_post_completion_data_cleanup]

[--run_tests] [--addUTR] [--skip_cpd] [--exonerate_gff3

EXONERATE_GFF3] [--star_shared_mem] [--framework {docker,singularity}]

run_finder: error: the following arguments are required:

--metadatafile/-mf, --output_directory/-out_dir, --genome/-g,

--organism_model/-om, --genemark_path/-gm, --genemark_license/-gml

(base) kazu@kazu:/media/kazu/8TB5/torenia/RepeatModeler2-default/RM_42413.ThuDec221259252022/repeatmasker2/finder-arabi-test/Finder-master$

run_finder -h

usage: run_finder [-h] [--version] --metadatafile METADATAFILE

--output_directory OUTPUT_DIRECTORY --genome GENOME --organism_model

{VERT,INV,PLANTS,FUNGI} --genemark_path GENEMARK_PATH

--genemark_license GENEMARK_LICENSE [--cpu CPU]

[--genome_dir_star GENOME_DIR_STAR]

[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE] [--protein

PROTEIN] [--no_cleanup] [--preserve_raw_input_data] [--checkpoint

CHECKPOINT]

[--perform_post_completion_data_cleanup]

[--run_tests] [--addUTR] [--skip_cpd] [--exonerate_gff3

EXONERATE_GFF3] [--star_shared_mem] [--framework {docker,singularity}]

Generates gene annotation from RNA-Seq data

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

Required arguments:

--metadatafile METADATAFILE, -mf METADATAFILE

Please enter the name of the metadata file.

Enter 0 in the last column of those samples which you wish to skip

processing. The columns should represent the following in order -->

BioProject, SRA Accession, Tissues, Description, Date, Read Length,

Ended (PE or SE), RNA-Seq, process, Location. If the sample is skipped

it will not be downloaded. Leave the directory path blank if you are

downloading the samples. In the end of the run the program will output

a csv file with the directory path filled out. Please check the

provided csv file for more information on how to configure the

metadata file.

--output_directory OUTPUT_DIRECTORY, -out_dir OUTPUT_DIRECTORY

Enter the name of the directory where all

other operations will be performed

--genome GENOME, -g GENOME

Enter the SOFT-MASKED genome file of the organism

--organism_model {VERT,INV,PLANTS,FUNGI}, -om {VERT,INV,PLANTS,FUNGI}

Enter the type of organism

--genemark_path GENEMARK_PATH, -gm GENEMARK_PATH

Enter the path to genemark

--genemark_license GENEMARK_LICENSE, -gml GENEMARK_LICENSE

Enter the licence file. Please make sure your

license file is less than 365 days old

Optional arguments:

--cpu CPU, -n CPU Enter the number of CPUs to be used.

--genome_dir_star GENOME_DIR_STAR, -gdir_star GENOME_DIR_STAR

Please enter the location of the genome index

directory of STAR

--genome_dir_olego GENOME_DIR_OLEGO, -gdir_olego GENOME_DIR_OLEGO

Please enter the location of the genome index

directory of OLego

--verbose VERBOSE, -verb VERBOSE

Enter a verbosity level

--protein PROTEIN, -p PROTEIN

Enter the protein fasta

--no_cleanup, -no_cleanup

Provide this option if you do not wish to

remove any intermediate files. Please note that this will NOT remove

any files and might take up a large amount of space

--preserve_raw_input_data, -preserve

Set this argument if you want to preserve the

raw fastq files. All other temporary files will be removed. These

fastq files can be later used.

--checkpoint CHECKPOINT, -c CHECKPOINT

Enter a value if you wish to restart

operations from a certain check point. Please note if you have new

RNA-Seq samples, then FINDER will override this argument and

computation will take place from read alignment. If there are missing

data in any step then also FINDER will enforce restart of operations

from a previous

. For example, if you wish to run assembly on

samples for which alignments are not available then FINDER will

readjust this value and set it to 1.

1. Align reads to reference genome (Will

trigger removal of all alignments and start from beginning)

2. Assemble with PsiCLASS (Will remove all

assemblies)

3. Find genes with FINDER (entails

changepoint detection)

4. Predict genes using BRAKER2 (Will

remove previous results of gene predictions with BRAKER2)

5. Annotate coding regions

6. Merge FINDER annotations with BRAKER2

predictions and protein sequences

--perform_post_completion_data_cleanup, -pc_clean

Set this field if you wish to clean up all the

intermediate files after the completion of the execution. If this

operation is requested prior to generation of all the important files

then it will be ignored and finder will proceed to annotate the

genome.

--run_tests, -rt Modify behaviour of finder to accelerate

tests. This will reduce the downloaded fastq files to a bare minimum

and also check the other installations

--addUTR, --addUTR Turn on this option if you wish BRAKER to add

UTR sequences

--skip_cpd, --skip_cpd

Turn on this option to skip changepoint

detection. Could be effective for grasses

--exonerate_gff3 EXONERATE_GFF3, -egff3 EXONERATE_GFF3

Enter the exonerate output in gff3 format

--star_shared_mem, --star_shared_mem

Turn on this option if you want STAR to load

the genome index into shared memory. This saves memory if multiple

finder runs are executing on the same host, but might not work in your

cluster environment.

--framework {docker,singularity}, -fm {docker,singularity}

Enter your choice of framework

テストラン

FINDERは大量のRNA-Seqサンプルを扱うことを想定して設計されている。アノテーションファイルをダウンロードし、続いて raw fastqを準備する。

cd example
wget ftp://ftp.ensemblgenomes.org/pub/plants/release-49/fasta/arabidopsis_thaliana/dna/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz
gunzip Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz


mkdir star_index_without_transcriptome
STAR --runMode genomeGenerate --runThreadN 20 --genomeDir star_index_without_transcriptome --genomeSAindexNbases 12 --genomeFastaFiles Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa]

olegoindex -p olego_index Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa

raw fastq

f:id:kazumaxneo:20210406094429p:plain

docker imageもpullしておく。

FINDERを実行する。全てのファイルがdocker側から見えるようにフルパスで指定する。ここでは$PWD/fileとしている。

run_finder -no_cleanup -mf $PWD/Arabidopsis_thaliana_metadata.csv -n
20 -out_dir $PWD/FINDER_test_ARATH -g
$PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p
$PWD/uniprot_ARATH.fasta -preserve --genemark_path
$PWD/gmes_linux_64_4/ --genemark_license $PWD/gm_key_64
--organism_model PLANTS  1> log

メタデータ CSVの一部の列は任意と書かれているが実際は存在していないとエラーになる、

不明なエラーが起きる。

引用

FINDER: An automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences
Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z. Sen, Roger P. Wise, Carson M. Andorf

bioRxiv, Posted February 06, 2021

20210901

FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences
Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z. Sen, Roger P. Wise & Carson M. Andorf
BMC Bioinformatics volume 22, Article number: 205 (2021)