Vertebrate Genomes Projectで使用されているマイトゲノムアセンブリパイプライン mitoVGP

　最新のシーケンス技術により、比較的小さなミトコンドリアゲノムのアセンブリは容易に行えるようになるはずである。しかし、ミトコンドリアのアセンブリに直接対応したツールはほとんど存在しない。脊椎動物ゲノムプロジェクト（VGP）の一環として、著者らはミトコンドリアリードの類似性に基づいた同定とミトコンドリアゲノムのde novoアセンブリのための完全自動化パイプラインであるmitoVGPを開発した。本パイプラインは、100種の脊椎動物の完全なマイトゲノムアセンブリに成功した。組織タイプとライブラリサイズの選択がマイトゲノムの配列決定とアセンブリに大きな影響を与えることが分かった。mitoVGPのアセンブリをショートリードシークエンシングに基づいた完全と思われるリファレンスマイトゲノムと比較したところ、参照マイトゲノムのエラー、欠落した配列、不完全な遺伝子、特にリピート領域の遺伝子、が同定された。またmitoVGPのアセンブリは、新しい遺伝子領域の重複を同定し、ミトコンドリアゲノムの進化と組織についての新たな洞察を明らかにした。

VGP

インストール

Github

#pacbioのリード用のパイプラインとONTのリード用のパイプラインがある。
#2つの依存するツールのpythonコードが2と3に別れているため、使用時にそれぞれ個別の過疎環境をactivateして使用する様に設計されている。
git clone https://github.com/gf777/mitoVGP.git
cd mitoVGP/
tar -xvf canu-1.8.Linux-amd64.tar.xz
rm canu-1.8.Linux-amd64.tar.xz

#pacbioとONTそれぞれの仮想環境を作成
#pacbio
conda env create -f mitoVGP_conda_env_pacbio.yml
#ONT
conda env create -f mitoVGP_conda_env_ONT.yml

> ./mitoVGP -h

$ ./mitoVGP -h

Usage: './mitoVGP -s species -i species_ID -r reference -t threads'

mitoVGP is used for reference-guided de novo mitogenome assembly using a combination of long and short read data.

An existing reference from closely to distantly related species is used to identify mito-like reads in pacbio WGS data,

which are then employed in de novo genome assembly. The assembly is further polished using both long and short read data,

and linearized to start with the conventional Phenylalanine tRNA sequence.

Check the github page https://github.com/GiulioF1/mitoVGP for a description of the pipeline.

A complete Conda environment with all dependencies is available to run the pipeline in the same github page.

This script a simple wrapper of the scripts found in the scripts/ folder. You can find more information

on each step in the help (-h) of each script.

Required arguments are:

-a long read sequencing platform (Pacbio/ONT)

-s the species name (e.g. Calypte_anna)

-i the VGP species ID (e.g. bCalAnn1)

-r the reference sequence fasta file

-t the number of threads

Optional arguments are:

-g the putative mitogenome size (potentially, that of the reference genome). If not provided, length of reference is used.

It does not need to be precise. Accepts Canu formatting.

-d multithreaded download of files (true/false default: false) !! caution: true may require considerable amount of space.

-1 use pacbio/nanopore reads from list of files, requires absolute path (default looks into aws)

-2 use PE illumina reads from list of files (with fw and rv reads including R1/R2 in their names), requires absolute path (default looks into aws)

-m the aligner (blasr|minimap2|pbmm2). Default is pbmm2

-f filter reads by size prior to assembly (reduces the number of NUMT reads and helps the assembly)

-p filter reads by percent coverage of the reference over their length (avoid noise in the assembly when low coverage)

-o the options for Canu

-v picard validation stringency (STRICT/LENIENT default: STRICT)

-z increase sensitivity of mummer overlap detection

-b use gcpp or variantCaller during arrow polishing for 2.0 or earlier chemistry respectively (gcpp/variantCaller default: gcpp)

実行方法

pacbio。pacbio のchemistryバージョンによって"-b"で指定するバリアントコーラーは変更する。pacbio chemistry 2.0では"gcpp"、2.0以下では"variantCaller"、RSIIでは"blasr"を選ぶことも出来る。

conda activate mitoVGP_pacbio
./mitoVGP -a pacbio -s <species_name> -i <VGP_species_ID> -r ref_mtDNA.fasta -t 12 -b variantCaller

-a long read sequencing platform (Pacbio/ONT)
-s the species name (e.g. Calypte_anna)
-i the VGP species ID (e.g. bCalAnn1)
-r the reference sequence fasta file
-b use gcpp or variantCaller during arrow polishing for 2.0 or earlier chemistry respectively (gcpp/variantCaller default: gcpp)

ONT

conda activate mitoVGP_ONT
./mitoVGP -a ONT -s <species_name> -i fMasArm1 -r ref_mtDNA.fasta -t 12 -b variantCaller

mitoVGPによるマイトゲノムアセンブルはGenomeArkのいくつかの種でも利用されている（GenomeArkはVEPによって作成されたゲノムとアノテーション情報のデータベース）。

引用

Complete vertebrate mitogenomes reveal widespread gene duplications and repeats

Giulio Formenti, Arang Rhie, Jennifer Balacco, Bettina Haase, Jacquelyn Mountcastle, Olivier Fedrigo, Samara Brown, Marco Capodiferro, Farooq O. Al-Ajli, Roberto Ambrosini, Peter Houde, Sergey Koren, Karen Oliver, Michelle Smith, Jason Skelton, Emma Betteridge, Jale Dolucan, Craig Corton, Iliana Bista, James Torrance, Alan Tracey, Jonathan Wood, Marcela Uliano-Silva, Kerstin Howe, Shane McCarthy, Sylke Winkler, Woori Kwak, Jonas Korlach, Arkarachai Fungtammasan, Daniel Fordham, Vania Costa, Simon Mayes, Matteo Chiara, David S. Horner, Eugene Myers, Richard Durbin, Alessandro Achilli, Edward L. Braun, Adam M. Phillippy, Erich D. Jarvis, The Vertebrate Genomes Project Consortium

bioRxiv, Posted July 01, 2020

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Vertebrate Genomes Projectで使用されているマイトゲノムアセンブリパイプライン mitoVGP