ミトコンドリアゲノムをショートリードからアセンブルする MitoFinder（既知ミトコンドリアゲノム情報が必要）

2020 9/29 タイトル修正

　ハイスループットシーケンシング技術の発展により、核内ultraconserved DNA elements（UCE）（wiki）のターゲットエンリッチメントシーケンシングが可能となり、数千ものゲノムマーカーから系統関係を日常的に推論できるようになった。最近では、ミトコンドリアDNA(mtDNA)が、このようなキャプチャー実験において標的遺伝子座と並んで頻繁にシークエンシングされることが示されている。このように、ミトコンドリアDNA（mtDNA）は、その幅広い進化的関心にもかかわらず、核内マーカーと一緒に配列決定されていることはほとんどない。ここでは、何百ものUCEライブラリーから効率的にmitogenomic dataをアセンブルし、アノテーションを行うために、ユーザーフレンドリーなバイオインフォマティクスパイプラインであるMitoFinderを開発した。本研究では、ケーススタディとして、アリ(Formicidae)を用いて、501個のUCEライブラリがシークエンシングされているが、マイトゲノムは29個のみであることを確認した。UCEとmtDNAの両方の遺伝子座をアセンブルするために、4つの異なるアセンブラ（IDBA-UD、MEGAHIT、MetaSPAdes、Trinity）の効率を比較した。MitoFinderを用いて、メタゲノムアセンブラ、特にMetaSPAdesがUCEとmtDNAの両方をアセンブルするのに適していることを示した。501個のUCEライブラリーからmitogenomicシグナルが正常に抽出され、CO1バーコーディングを用いて種の同定を確認することができた。さらに、ミトコンドリアゲノムが単一のコンティグにアセンブルされた296例を自動化した結果、利用可能なアリのmitogenomesの数を桁違いに増やすことができた。メタゲノムアセンブラのパワーを利用することで、MitoFinderはUCEライブラリから相補的なマイトゲノムデータを抽出するための効率的なツールを提供し、潜在的なmitonuclear discordanceの検査を可能にする。このアプローチは、他のシーケンスキャプチャー法、トランスクリプトームデータ、多様な分類群の全ゲノムショットガンシーケンスにも適用可能である。MitoFinderソフトウェアはGitHub (https://github.com/RemiAllio/MitoFinder)から入手できる。

Conceptualization of the pipeline. 論文より転載

MitoFinderは、ミトコンドリアゲノムをアセンブルし、Illuminaショートリードからミトコンドリア遺伝子をアノテーションするためのパイプラインである。MitoFinderは、既存のゲノムアセンブリ中からミトコンドリア配列を見つけてアノテーションするように設計されている。

インストール

ubuntu18.04LTSでテストした。

Github

git clone https://github.com/RemiAllio/MitoFinder.git
cd MitoFinder 
./install.sh
#他の導入方法はGithubを参照

> ./mitofinder -help

$ ./mitofinder -help

usage: mitofinder [-h] [--megahit] [--idba] [--metaspades] [-j PROCESSNAME]

[-1 PE1] [-2 PE2] [-s SE] [-a ASSEMBLY] [-m MEM]

[-l SHORTESTCONTIG] [-p PROCESSORSTOUSE] [-r REFSEQFILE]

[-e BLASTEVAL] [-n NWALK] [--override] [--adjust-direction]

[--ignore] [--new-genes] [--allow-intron] [--numt]

[--intron-size INTRONSIZE] [--max-contig MAXCONTIG]

[--cds-merge] [--out-gb] [--contig-size CONTIGSIZE]

[--rename-contig RENAME]

[--blast-identity-nucl BLASTIDENTITYNUCL]

[--blast-identity-prot BLASTIDENTITYPROT]

[--blast-size ALIGNCUTOFF] [--circular-size CIRCULARSIZE]

[--circular-offset CIRCULAROFFSET] [-o ORGANISMTYPE] [-v]

[--example] [--citation]

Mitofinder is a pipeline to assemble and annotate mitochondrial DNA from

trimmed sequencing reads.

optional arguments:

-h, --help show this help message and exit

--megahit Use Megahit for assembly. (Default)

--idba Use IDBA-UD for assembly.

--metaspades Use MetaSPAdes for assembly.

-j PROCESSNAME, --seqid PROCESSNAME

Sequence ID to be used throughout the process

-1 PE1, --Paired-end1 PE1

File with forward paired-end reads

-2 PE2, --Paired-end2 PE2

File with reverse paired-end reads

-s SE, --Single-end SE

File with single-end reads

-a ASSEMBLY, --assembly ASSEMBLY

File with your own assembly

-m MEM, --max-memory MEM

max memory to use in Go (MEGAHIT or MetaSPAdes)

-l SHORTESTCONTIG, --length SHORTESTCONTIG

Shortest contig length to be used (MEGAHIT). Default =

100

-p PROCESSORSTOUSE, --processors PROCESSORSTOUSE

Number of threads Mitofinder will use at most.

-r REFSEQFILE, --refseq REFSEQFILE

Reference mitochondrial genome in GenBank format

(.gb).

-e BLASTEVAL, --blast-eval BLASTEVAL

e-value of blast program used for contig

identification and annotation. Default = 0.00001

-n NWALK, --nwalk NWALK

Maximum number of codon steps to be tested on each

size of the gene to find the start and stop codon

during the annotation step. Default = 5 (30 bases)

--override This option forces MitoFinder to override the previous

output directory for the selected assembler.

--adjust-direction This option tells MitoFinder to adjust the direction

of selected contig(s) (given the reference).

--ignore This option tells MitoFinder to ignore the non-

standart mitochondrial genes.

--new-genes This option tells MitoFinder to try to annotate the

non-standard animal mitochondrial genes (e.g. rps3 in

fungi). If several references are used, make sure the

non-standard genes have the same names in the several

references

--allow-intron This option tells MitoFinder to search for genes with

introns. Recommendation : Use it on mitochondrial

contigs previously found with MitoFinder without this

option.

--numt This option tells MitoFinder to search for both

mitochondrial genes and NUMTs. Recommendation : Use it

on nuclear contigs previously found with MitoFinder

without this option.

--intron-size INTRONSIZE

Size of intron allowed. Default = 5000 bp

--max-contig MAXCONTIG

Maximum number of contigs matching to the reference to

keep. Default = 0 (unlimited)

--cds-merge This option tells MitoFinder to not merge the exons in

the NT and AA fasta files.

--out-gb Do not create annotation output file in GenBank

format.

--contig-size CONTIGSIZE

Minimum size of a contig to be considered. Default =

1000

--rename-contig RENAME

"yes/no" If "yes", the contigs matching the

reference(s) are renamed. Default is "yes" for de novo

assembly and "no" for existing assembly (-a option)

--blast-identity-nucl BLASTIDENTITYNUCL

Nucleotide identity percentage for a hit to be

retained. Default = 50

--blast-identity-prot BLASTIDENTITYPROT

Amino acid identity percentage for a hit to be

retained. Default = 40

--blast-size ALIGNCUTOFF

Percentage of overlap in blast best hit to be

retained. Default = 30

--circular-size CIRCULARSIZE

Size to consider when checking for circularization.

Default = 45

--circular-offset CIRCULAROFFSET

Offset from start and finish to consider when looking

for circularization. Default = 200

-o ORGANISMTYPE, --organism ORGANISMTYPE

Organism genetic code following NCBI table (integer):

1. The Standard Code 2. The Vertebrate Mitochondrial

Code 3. The Yeast Mitochondrial Code 4. The Mold,

Protozoan, and Coelenterate Mitochondrial Code and the

Mycoplasma/Spiroplasma Code 5. The Invertebrate

Mitochondrial Code 6. The Ciliate, Dasycladacean and

Hexamita Nuclear Code 9. The Echinoderm and Flatworm

Mitochondrial Code 10. The Euplotid Nuclear Code 11.

The Bacterial, Archaeal and Plant Plastid Code 12. The

Alternative Yeast Nuclear Code 13. The Ascidian

Mitochondrial Code 14. The Alternative Flatworm

Mitochondrial Code 16. Chlorophycean Mitochondrial

Code 21. Trematode Mitochondrial Code 22. Scenedesmus

obliquus Mitochondrial Code 23. Thraustochytrium

Mitochondrial Code 24. Pterobranchia Mitochondrial

Code 25. Candidate Division SR1 and Gracilibacteria

Code

-v, --version Version 1.3

--example Print getting started examples

--citation How to cite MitoFinder

実行方法

fastq、リファレンスgenebankファイルを指定する。

mitofinder --megahit -j [seqid] -1 pair1.fastq.gz -2 pair2.fastq.gz -r genbank_reference.gb -o 1 -p 12 -m [memory]

you can choose the assembler using the following options:
--megahit (default: faster)
--metaspades (recommended: a bit slower but more efficient (see associated paper). WARNING: Not compatible with single-end reads)
--idba
-j Sequence ID to be used throughout the process
-r Reference mitochondrial genome in GenBank format (containing at least one mitochondrial genome of reference extracted from NCBI)
-p Number of threads Mitofinder will use at most
-m max memory to use in Go (MEGAHIT or MetaSPAdes)
-o Organism genetic code following NCBI table (integer):
1. The Standard Code 2. The Vertebrate Mitochondrial
Code 3. The Yeast Mitochondrial Code 4. The Mold,
Protozoan, and Coelenterate Mitochondrial Code and the
Mycoplasma/Spiroplasma Code 5. The Invertebrate
Mitochondrial Code 6. The Ciliate, Dasycladacean and
Hexamita Nuclear Code 9. The Echinoderm and Flatworm
Mitochondrial Code 10. The Euplotid Nuclear Code 11.
The Bacterial, Archaeal and Plant Plastid Code 12. The
Alternative Yeast Nuclear Code 13. The Ascidian
Mitochondrial Code 14. The Alternative Flatworm
Mitochondrial Code 16. Chlorophycean Mitochondrial
Code 21. Trematode Mitochondrial Code 22. Scenedesmus
obliquus Mitochondrial Code 23. Thraustochytrium
Mitochondrial Code 24. Pterobranchia Mitochondrial
Code 25. Candidate Division SR1 and Gracilibacteria
Code