ロングリードcDNAシーケンスも利用できる真核生物ゲノムの自動化されたアノテーションツール LoReAn

　1分子完全長相補DNA（cDNA）配列決定は、転写産物の構造やスプライス形態を明らかにすることで、ゲノムアノテーションを支援することができるが、現在のアノテーションパイプラインにはそのような情報が組み込まれていない。本研究では、ロングリードアノテーション(LoReAn)ソフトウェアを紹介する。2つの真菌ゲノム（Verticillium dahliaeとPlicaturopsis crispa）と2つの植物ゲノム（Arabidopsis [Arabidopsis thaliana]とOryza sativa）のアノテーションに基づいて、LoReAnが一般的なアノテーションパイプラインよりも優れていることを示す。

　これまでのほとんどのゲノムアノテーションは、ショートリードマッピングデータとab initio遺伝子予測の組み合わせに依存している。しかし、ショートリードRNA-seqデータは必ずしも明確にマッピングできるわけではなく、1個のリードでは遺伝子の全長に及ぶわけではないため、また、遺伝子予測につながるエビデンスの重み付けに違いがあるため、このプロセスではエラーが発生する。LoReAnは、遺伝子予測の過程で経験的なマッピングデータに重点を置いて、マッピングと遺伝子構造の問題に対処するために、ロングリードシーケンシングデータからの情報も利用するために開発された。

INSTALLATION

https://github.com/lfaino/LoReAn/blob/master/INSTALL.md

インストール

ubuntu18でdockerを使ってテストした。

パイプラインで使用されるソフトウェア

TransDecoder-3.0.1
samtools v0.1.19-96b5f2294a
bedtools v2.25.0
bowtie v1.1.2
bamtools v2.4.1
AATpackage r03052011
iAssembler v1.3.2.x64
GeneMark-ES/ET v.4.33 64bit (THIS SOFTWARE IS NOT FREE FOR EVERYONE, check installation instruction)
PASApipeline v2.1.0
augustus v3.3
trinityrnaseq v2.5.1
STAR v2.5.3a
gmap-gsnap v2017-06-20
fasta v36.3.8e
BRAKER v2.0
EVidenceModeler v1.1.1
gffread v0.9.9
genometools v1.5.9

Github

オーサーが準備したdockerイメージの利用が推奨されている。

#dockerhub(link)
docker pull lfaino/lorean:latest

> docker run -it --rm lfaino/lorean lorean.py -h

usage: lorean [options] reference

LoReAn - Automated genome annotation pipeline that integrates long reads

positional arguments:

reference Path to reference file

optional arguments:

-h, --help show this help message and exit

-pr [PROTEINS], --proteins [PROTEINS]

Path to protein sequences FASTA file

-sp [SPECIES], --species [SPECIES]

Species name for AUGUSTUS training. No re-training if

species already present in AUGUSTUS config folder

-d, --stranded Run LoReAn on stranded mode [FALSE]

-mm, --minimap2 Use minimap2 to map long cDNA reads on the genome;

default setting is GMAP [FALSE]

-iprs, --interproscan

Run interproscan after gene annotation [FALSE]

-f, --fungus Use this option for fungal species (used in Gene Mark-

ES) [FALSE]

-kt, --keep_tmp Keep temporary files [FALSE]

-sr [FASTQ_file], --short_reads [FASTQ_file]

Path to short reads FASTQ. If paired end, comma-

separated (1-1.fq,1-2.fq). BAM sorted files are

allowed; the extension of the file should be

filename.sorted.bam

-lr [FASTQ_file], --long_reads [FASTQ_file]

Path to long reads FASTQ

-a [FASTA_file], --adapter [FASTA_file]

FASTA file containing the adapter sequences. Adapter

sequences in forward and reverse strain of the same

adapter need to be used in the file

-am [ADAPTER_MATCH_SCORE], --adapter_match_score [ADAPTER_MATCH_SCORE]

Score value for an adapter to match a read. Lower

values keep more reads but the orientation is less

reliable [0-100]. If left empty, the value is

automatically calculated)

-rp [GFF_file], --repeat_masked [GFF_file]

GFF or GFF3 or GTF or BED file containing repeats

coordinates

-mg, --mask_genome Run RepeatScout and RepeatMasker on the genome fasta

file [FALSE]

-rl [N], --repeat_lenght [N]

Minimum length of a repeat to be masked

-ex [GFF_file], --external [GFF_file]

GFF3 of FASTA file containing external annotation

information

-up [GFF_file], --upgrade [GFF_file]

GFF3 to upgrade using long reads [and short

read]information []

-m [MAX_LONG_READ], --max_long_read [MAX_LONG_READ]

Filter out long reads longer than this value (longer

reads may affect mapping and assembling) [20000]

-pasa [PASA_DB], --pasa_db [PASA_DB]

PASA database name [pipeline_run]

-n [PREFIX_GENE], --prefix_gene [PREFIX_GENE]

Prefix to add to the final Gff3 gene name [specie]

-o [OUT_DIR], --out_dir [OUT_DIR]

In this path all the files will be stored

-w [WORKING_DIR], --working_dir [WORKING_DIR]

Working directory (will create if not present)

-t [N], --threads [N]

Number of threads [3]

-cw N, --augustus_weigth N

Weight assigned to AUGUSTUS evidence for EVM [1]

-ew N, --external_weigth N

Weight assigned to external for EVM [1]

-gw [N], --genemark_weigth [N]

Weight assigned to GENEMARK evidence for EVM [1]

-tw [N], --trinity_weigth [N]

Weight assigned to Trinity mapped with GMAP evidence

for EVM [1]

-pw [N], --pasa_weigth [N]

Weight assigned to PASA evidence for EVM [5]

-aw [N], --exonerate_weigth [N]

Weight assigned to AAT protein evidence for EVM [1]

-c [N], --segmentSize [N]

Segment size for EVM partitions [100000]

-e [N], --overlap_size [N]

Overlap size for EVM partitions [10000]

-g [N], --min_intron_length [N]

Minimal intron length for GMAP [9]

-q [N], --max_intron_length [N]

Maximal intron length for GMAP, STAR and TRINITY

[1000]

-ee [N], --end_exon [N]

Minimal length for end exon with GMAP [20]

-cme [N], --cluster_min_evidence [N]

Minimal evidence needed to form a cluster [5]

-cMe [N], --cluster_max_evidence [N]

Maximal evidence to form a cluster.Prevents the

clustering or rRNA genes i.e. [5000]

-aol [N], --assembly_overlap_length [N]

Minimal length (in nt) of overlap for ASSEMBLY [200]

-api [N], --assembly_percent_identity [N]

Minimal identity for the ASSEMBLY (95-100) [97]

-art [F], --assembly_read_threshold [F]

Fraction of reads supporting an assembled UNITIG to

keep(0.1-1) [0.3]

-v, --verbose Prints out the commands used in LoReAn[FALSE]

Luigi Faino - luigi.faino@gmail.com; luigi.faino@uniroma1.it - October 2017

さらに GeneMark-ES/ET v.4.48_3.60が必要。ライセンスの関係でHPからダウンロードする必要がある。

http://exon.gatech.edu/GeneMark/license_download.cgi

GeneMark keyをユーザーのホームにコピーする。

zcat gm_key_64.gz > ~/.gm_key

dockerやSingularityを使うには、2つのファイルをダウンロードして解凍する。

#1 augustus configファイル
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1ZzbbHXYGLGtScrpC3SmRGT0w2DWBNRaP' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1ZzbbHXYGLGtScrpC3SmRGT0w2DWBNRaP" -O ./config.augustus.tar.gz && rm -rf /tmp/cookies.txt && tar -zxvf config.augustus.tar.gz && rm config.augustus.tar.gz

#2 ライブラリファイル
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt --keep-session-cookies --no-check-certificate 'https://docs.google.com/uc?export=download&id=1hEhzvyLDRTLPJM_f7pibq9E9X7ral5j0' -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1\n/p')&id=1hEhzvyLDRTLPJM_f7pibq9E9X7ral5j0" -O ./RepeatMasker.Libraries.tar.gz && rm -rf /tmp/cookies.txt && tar -zxvf RepeatMasker.Libraries.tar.gz && rm RepeatMasker.Libraries.tar.gz

config/とLibraries/ができる。LoReAnのランではこのディレクトリを指定する（dockerではconfig/のみ使う）。

実行方法

タンパク質配列（近縁種のもの）、訓練のための生物名、ゲノムファイル名は必須。

lorean -pr protein.fasta -sp <species name> genome.fasta

-pr Path to protein sequences FASTA file []
-sp Species name for AUGUSTUS training. No re-training if species already present in AUGUSTUS config folder

LoReAn_annotation/に結果は出力される。

dockerを使う。ホストのconfig/パスをイメージ側の指定ディレクトリと共有、GeneMark keyを置いたホスト側のホームディレクトリをイメージ側のホームディレクトリと共有する。

sudo docker run -it --rm \
-v $PWD:/data \
-v config:/opt/LoReAn/third_party/software/augustus/config/ \
-v $HOME:/home/lorean \
-u $(id -u ${USER}):$(id -g ${USER}) \
lfaino/lorean:latest lorean -pr protein.faa -sp <species name> genome.fna

Example

用意されているデータセットを使う。

使用するスレッド数を "-t"オプションで指定する。8CPU、24GB RAMのマシンでは、BRAKER1のトレーニングのため、数時間の実行が必要（Githubより）。RNAseqのペアエンドfastqは"-sr"オプションでカンマ区切りで指定する。ロングcDNAリードは"-lr"で指定する。

git clone https://github.com/lfaino/LoReAn_Example.git
cd LoReAn_Example/Crispa/
gunzip *gz
lorean.py -a -rp repeats.scaffold3.bed -sr scaffold3.short_1.fastq,scaffold3.short_2.fastq -lr scaffold3.long.fasta -pr scaffold3.prot.fasta -sp crispa scaffold3.fasta -d -f -mg -t 20 -kt

-sr Path to short reads FASTQ. If paired end, comma-separated (1-1.fq,1-2.fq). BAM sorted files are allowed; the extension of the file should be filename.sorted.bam
-lr Path to long reads FASTQ
-a FASTA file containing the adapter sequences. Adapter sequences in forward and reverse strain of the same adapter need to be used in the file
-rp GFF or GFF3 or GTF or BED file containing repeats coordinates
-d Run LoReAn on stranded mode [FALSE]
-mg Run RepeatScout and RepeatMasker on the genome fasta file [FALSE]
-kt Keep temporary files [FALSE]
-f Use this option for fungal species (used in Gene Mark-ES) [FALSE]
-t Number of threads [3]