

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

自動化された真核生物の遺伝子アノテーションツール FINDER

2021 9/1 論文追記

2022/12/27 追記







git clone
cd Finder
mamba env create -f environment.yml
conda activate finder_conda_env
cd dep


cd gmes_linux_64/
perl /usr/bin/perl

#Docker(hub link)からiamgeをpullしておく。run_finder.shから呼び出されたdocker imageが実行されるので必ず必要。
docker pull sagnikbanerjee15/finder:1.1.0

> finder

Please use the --help option to get usage information

usage: finder [-h] --metadatafile METADATAFILE --output_directory


[--genome_dir_star GENOME_DIR_STAR]

[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE]

[--protein PROTEIN] [--no_cleanup] [--preserve_raw_input_data]

[--checkpoint CHECKPOINT]


finder: error: the following arguments are required: --metadatafile/-mf, --output_directory/-out_dir, --genome/-g


$ run_finder

Please use the --help option to get usage information

usage: run_finder [-h] [--version] --metadatafile METADATAFILE

--output_directory OUTPUT_DIRECTORY --genome GENOME --organism_model


--genemark_license GENEMARK_LICENSE [--cpu CPU]

                  [--genome_dir_star GENOME_DIR_STAR]

[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE] [--protein

PROTEIN] [--no_cleanup] [--preserve_raw_input_data] [--checkpoint



[--run_tests] [--addUTR] [--skip_cpd] [--exonerate_gff3

EXONERATE_GFF3] [--star_shared_mem] [--framework {docker,singularity}]

run_finder: error: the following arguments are required:

--metadatafile/-mf, --output_directory/-out_dir, --genome/-g,

--organism_model/-om, --genemark_path/-gm, --genemark_license/-gml

(base) kazu@kazu:/media/kazu/8TB5/torenia/RepeatModeler2-default/RM_42413.ThuDec221259252022/repeatmasker2/finder-arabi-test/Finder-master$

run_finder -h

usage: run_finder [-h] [--version] --metadatafile METADATAFILE

--output_directory OUTPUT_DIRECTORY --genome GENOME --organism_model


--genemark_license GENEMARK_LICENSE [--cpu CPU]

                  [--genome_dir_star GENOME_DIR_STAR]

[--genome_dir_olego GENOME_DIR_OLEGO] [--verbose VERBOSE] [--protein

PROTEIN] [--no_cleanup] [--preserve_raw_input_data] [--checkpoint



[--run_tests] [--addUTR] [--skip_cpd] [--exonerate_gff3

EXONERATE_GFF3] [--star_shared_mem] [--framework {docker,singularity}]


Generates gene annotation from RNA-Seq data


optional arguments:

  -h, --help            show this help message and exit

  --version             show program's version number and exit


Required arguments:


                        Please enter the name of the metadata file.

Enter 0 in the last column of those samples which you wish to skip

processing. The columns should represent the following in order -->

BioProject, SRA Accession, Tissues, Description, Date, Read Length,

Ended (PE or SE), RNA-Seq, process, Location. If the sample is skipped

it will not be downloaded. Leave the directory path blank if you are

downloading the samples. In the end of the run the program will output

a csv file with the directory path filled out. Please check the

provided csv file for more information on how to configure the

metadata file.

  --output_directory OUTPUT_DIRECTORY, -out_dir OUTPUT_DIRECTORY

                        Enter the name of the directory where all

other operations will be performed

  --genome GENOME, -g GENOME

                        Enter the SOFT-MASKED genome file of the organism


                        Enter the type of organism

  --genemark_path GENEMARK_PATH, -gm GENEMARK_PATH

                        Enter the path to genemark

  --genemark_license GENEMARK_LICENSE, -gml GENEMARK_LICENSE

                        Enter the licence file. Please make sure your

license file is less than 365 days old


Optional arguments:

  --cpu CPU, -n CPU     Enter the number of CPUs to be used.

  --genome_dir_star GENOME_DIR_STAR, -gdir_star GENOME_DIR_STAR

                        Please enter the location of the genome index

directory of STAR

  --genome_dir_olego GENOME_DIR_OLEGO, -gdir_olego GENOME_DIR_OLEGO

                        Please enter the location of the genome index

directory of OLego

  --verbose VERBOSE, -verb VERBOSE

                        Enter a verbosity level

  --protein PROTEIN, -p PROTEIN

                        Enter the protein fasta

  --no_cleanup, -no_cleanup

                        Provide this option if you do not wish to

remove any intermediate files. Please note that this will NOT remove

any files and might take up a large amount of space

  --preserve_raw_input_data, -preserve

                        Set this argument if you want to preserve the

raw fastq files. All other temporary files will be removed. These

fastq files can be later used.

  --checkpoint CHECKPOINT, -c CHECKPOINT

                        Enter a value if you wish to restart

operations from a certain check point. Please note if you have new

RNA-Seq samples, then FINDER will override this argument and

computation will take place from read alignment. If there are missing

data in any step then also FINDER will enforce restart of operations

from a previous

                        . For example, if you wish to run assembly on

samples for which alignments are not available then FINDER will

readjust this value and set it to 1.

                            1. Align reads to reference genome (Will

trigger removal of all alignments and start from beginning)

                            2. Assemble with PsiCLASS (Will remove all


                            3. Find genes with FINDER (entails

changepoint detection)

                            4. Predict genes using BRAKER2 (Will

remove previous results of gene predictions with BRAKER2)

                            5. Annotate coding regions

                            6. Merge FINDER annotations with BRAKER2

predictions and protein sequences


  --perform_post_completion_data_cleanup, -pc_clean

                        Set this field if you wish to clean up all the

intermediate files after the completion of the execution. If this

operation is requested prior to generation of all the important files

then it will be ignored and finder will proceed to annotate the


  --run_tests, -rt      Modify behaviour of finder to accelerate

tests. This will reduce the downloaded fastq files to a bare minimum

and also check the other installations

  --addUTR, --addUTR    Turn on this option if you wish BRAKER to add

UTR sequences

  --skip_cpd, --skip_cpd

                        Turn on this option to skip changepoint

detection. Could be effective for grasses

  --exonerate_gff3 EXONERATE_GFF3, -egff3 EXONERATE_GFF3

                        Enter the exonerate output in gff3 format

  --star_shared_mem, --star_shared_mem

                        Turn on this option if you want STAR to load

the genome index into shared memory. This saves memory if multiple

finder runs are executing on the same host, but might not work in your

cluster environment.

  --framework {docker,singularity}, -fm {docker,singularity}

                        Enter your choice of framework





FINDERは大量のRNA-Seqサンプルを扱うことを想定して設計されている。アノテーションファイルをダウンロードし、続いて raw fastqを準備する。

cd example
gunzip Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa.gz

mkdir star_index_without_transcriptome
STAR --runMode genomeGenerate --runThreadN 20 --genomeDir star_index_without_transcriptome --genomeSAindexNbases 12 --genomeFastaFiles Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa]

olegoindex -p olego_index Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa

raw fastq



docker imageもpullしておく。



run_finder -no_cleanup -mf $PWD/Arabidopsis_thaliana_metadata.csv -n
20 -out_dir $PWD/FINDER_test_ARATH -g
$PWD/Arabidopsis_thaliana.TAIR10.dna_sm.toplevel.fa -p
$PWD/uniprot_ARATH.fasta -preserve --genemark_path
$PWD/gmes_linux_64_4/ --genemark_license $PWD/gm_key_64
--organism_model PLANTS  1> log





FINDER: An automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences
Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z. Sen, Roger P. Wise, Carson M. Andorf

bioRxiv, Posted February 06, 2021



FINDER: an automated software package to annotate eukaryotic genes from RNA-Seq data and associated protein sequences
Sagnik Banerjee, Priyanka Bhandary, Margaret Woodhouse, Taner Z. Sen, Roger P. Wise & Carson M. Andorf 
BMC Bioinformatics volume 22, Article number: 205 (2021)