リファレンスゲノムのアノテーション情報をターゲットゲノムに移す Liftoff

　DNA シーケンシング技術と計算手法の向上により、多くの種の高品質なゲノムアセンブリが大幅に増加している。これらのゲノムの生物学を理解するためには、遺伝子の特徴やその他の機能的エレメントのアノテーションが不可欠であるが、ほとんどの種ではリファレンスゲノムのみが十分にアノテーションされている。ここでは、同種または近縁種の2つのアセンブリ間で遺伝子をマッピングすることができる新しいゲノムアノテーションリフトオーバーツールであるLiftoffについて説明する。Liftoffは、リファレンスゲノムからターゲットゲノムに遺伝子をアラインメントし、各エクソン、トランスクリプト、遺伝子の構造を保持しながら、配列の同一性を最大化するマッピングを見つける。著者らは、Liftoffがヒトリファレンスゲノムの2つのバージョン間（GRCh37とGRCh38）で99.9%の遺伝子を平均99.9%以上の配列同一性で正確にマッピングできることを示す。また、Liftoffは、ヒトのタンパク質をコードする遺伝子の98.4%以上を、配列同一性98.7%のチンパンジーゲノムアセンブリにリフティングすることに成功し、種を超えて遺伝子をマッピングできることを示す。

My new paper with @StevenSalzberg1 is out in Bioinformatics (online as of Dec. 15th)! Liftoff is a tool we created to lift gene annotations over from one genome assembly to another https://t.co/EQ1JWT2MPn
— Alaina Shumate (@alaina_shumate) January 12, 2021

The 20th birthday of the Human Genome Project seems like a great day for my latest pre-print with @StevenSalzberg1 to come out! Liftoff is an open-source tool to map genome annotation from one genome to another. https://t.co/xCgwviG8nt
— Alaina Shumate (@alaina_shumate) 2020年6月26日

Do you need a robust system for mapping genome annotation from one genome to another? One that doesn't require you to somehow first align the genomes base-by-base? Check out Liftoff, by @alaina_shumate, and our new preprint https://t.co/PeuEyC3NXC
— Steven Salzberg (@StevenSalzberg1) 2020年6月26日

ゲノム全体をアラインメントするのではなく、遺伝子配列のみをアラインメントすることで、2つのゲノム間に構造的な違いが多くあっても、遺伝子をリフトオーバーすることができる。Liftoffは、各遺伝子について、転写物と遺伝子の構造を保持しながら、配列の同一性を最大化するエクソンのアラインメントを見つける。2つの遺伝子が重複する遺伝子座に誤ってマップされている場合、Liftoffはどちらの遺伝子が誤ってマップされている可能性が高いかを判断し、それを再マップしようとする。Liftoffは、リファレンスにアノテーションされていない、ターゲットアセンブリに存在する追加の遺伝子コピーを見つけることもできる。

インストール

依存

Liftoff requires Python3 and also depends on Minimap2.
python3>

conda install -c bioconda minimap2

本体　Github

git clone https://github.com/agshumate/Liftoff
cd liftoff/
python setup.py install #install_requires=['numpy', 'biopython','gffutils', 'networkx','pysam']

> python liftoff.py

$ python liftoff.py

usage: liftoff.py [-h] -t <target.fasta> -r <reference.fasta>

[-g <ref_annotation.gff>] [-chroms <chroms.txt>] [-p 1]

[-o <output.gff>] [-db DB] [-infer_transcripts]

[-u <unmapped_features.txt>] [-infer_genes] [-a 0.5]

[-s 0.5] [-unplaced <unplaced_seq_names.txt>] [-copies]

[-sc 1.0] [-m PATH] [-dir <intermediate_files_dir>]

liftoff.py: error: the following arguments are required: -t, -r

(base) kamisakakazumanoMac-mini:liftoff kazu$ python liftoff.py -h

usage: liftoff.py [-h] -t <target.fasta> -r <reference.fasta>

[-g <ref_annotation.gff>] [-chroms <chroms.txt>] [-p 1]

[-o <output.gff>] [-db DB] [-infer_transcripts]

[-u <unmapped_features.txt>] [-infer_genes] [-a 0.5]

[-s 0.5] [-unplaced <unplaced_seq_names.txt>] [-copies]

[-sc 1.0] [-m PATH] [-dir <intermediate_files_dir>]

Lift features from one genome assembly to another

optional arguments:

-h, --help show this help message and exit

-t <target.fasta> target fasta genome to lift genes to

-r <reference.fasta> reference fasta genome to lift genes from

-g <ref_annotation.gff>

annotation file to lift over in gff or gtf format

-chroms <chroms.txt> comma seperated file with corresponding chromosomes in

the reference,target sequences

-p 1 processes

-o <output.gff> output file

-db DB name of feature database. If none, -g argument must be

provided and a database will be built automatically

-infer_transcripts use if GTF file only includes exon/CDS features and

does not include transcripts/mRNA

-u <unmapped_features.txt>

name of file to write unmapped features to

-infer_genes use if GTF file only includes transcripts, exon/CDS

features

-a 0.5 minimum alignment coverage to consider a feature

mapped [0-1]

-s 0.5 minimum sequence identity in child features (usually

exons/CDS) to consider a feature mapped [0-1]

-unplaced <unplaced_seq_names.txt>

text file with name(s) of unplaced sequences to map

genes from after genes from chromosomes in chroms.txt

are mapped

-copies look for extra gene copies in the target genome

-sc 1.0 with -copies, minimum sequence identity in exons/CDS

for which a gene is considered a copy. Must be greater

than -s

-m PATH Minimap2 path

-dir <intermediate_files_dir>

name of directory to save intermediate fasta and SAM

files

実行方法

ターゲットゲノムのFASTAファイル、リファレンスのFASTAファイルとアノテーションのGFFファイルを指定する。

python liftoff.py -t target.fa -r ref.fa -g ref.gff

引用

Liftoff: an accurate gene annotation mapping tool

Alaina Shumate, Steven Salzberg

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.24.169680

2020 12/16

Liftoff: accurate mapping of gene annotations
Alaina Shumate, Steven L Salzberg
Bioinformatics, Published: 15 December 2020

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

リファレンスゲノムのアノテーション情報をターゲットゲノムに移す Liftoff