HiFiリードを使う複雑な植物オルガネラゲノムのde novoアセンブリツール Oatk

　植物オルガネラゲノム、特に複雑な繰り返し構造を持つ巨大なミトコンドリアゲノムは、アセンブリにとって大きな課題である。ロングリードシーケンス技術の登場は、完全長のゲノムを構築する画期的な機会を提供するが、代替構造を解決する問題は依然として残っている。ここでは、高精度ロングリードからの植物オルガネラゲノムアセンブリのための新しいツールを紹介する。本手法は、k-merベースのアセンブラを用いて迅速にアセンブリグラフを構築し、プロファイルHMM遺伝子データベースを統合してオルガネラ配列の確実なアノテーションを行い、アセンブリグラフを通して最適なサポートパスを見つける新しい検索手法を活用する。195種の植物について高品質なオルガネラアセンブリを記述し、他の方法よりも優れていることを示す。アセンブルされたゲノムは、構造の複雑さ、ヘテロプラスミー、オルガネラ間のDNA交換について多くの知見を与えてくれる。

インストール

Github

mamba create -n oatk python=3.11 -y
conda activate oatk
mamba install -c bioconda oatk -y

> oatk -h

Usage: oatk [options] <target.fa[stq][.gz]> [...]

Options:

Input/Output:

-o FILE prefix of output files [./oatk.asm]

-t INT number of threads [1]

-G using input FILE as assembly graph file instead of raw reads for Syncasm

-M run minicircle mode for small animal mitochondria or plasmid

-v INT verbose level [0]

--version show version number

Syncasm:

-k INT kmer size [1001]

-s INT smer size (no larger than 31) [31]

-c INT minimum kmer coverage [30]

-a FLOAT minimum arc coverage [0.35]

-D INT maximum amount of data to use; suffix K/M/G recognized [0]

--max-bubble INT maximum bubble size for assembly graph clean [100000]

--max-tip INT maximum tip size for assembly graph clean [10000]

--weak-cross FLOAT maximum relative edge coverage for weak crosslink clean [0.30]

--unzip-round INT maximum round of assembly graph unzipping [3]

--no-read-ec do not do read error correction

Annotation:

-m FILE mitochondria gene annotation HMM profile database [NULL]

-p FILE plastid gene annotation HMM profile database [NULL]

-b INT batch size [100000]

-T DIR temporary directory [NULL]

--nhmmscan STR nhmmscan executable path [nhmmscan]

Pathfinder:

-f FLOAT prefer circular path to longest if >= FLOAT sequence covered [0.90]

-S FLOAT minimum annotation score of a core gene [300.0]

-e FLOAT maximum E-value of a core gene [1.000e-06]

-g INT[,INT] minimum number of core gene gain; the second INT for mitochondria [3,1]

-l INT minimum length of a singleton sequence to keep [10000]

-q FLOAT minimum coverage of a sequence compared to the subgraph average [0.20]

-C INT maximum copy number to consider [10]

--include-trn include tRNA genes for sequence classification

--include-rrn include rRNA genes for sequence classification

--no-graph-clean do not do assembly graph clean

--edge-c-tag STR edge coverage tag in the GFA file [EC:i]

--kmer-c-tag STR kmer coverage tag in the GFA file [KC:i]

--seq-c-tag STR sequence coverage tag in the GFA file [SC:f]

Example: ./oatk -o oatk.asm -t 16 -m angiosperm_mito.fam -p angiosperm_pltd.fam hifi.fa.gz

> ./oatkdb

ERROR: requires at least TWO positional parameters

Program: oatkdb (build gene HMM profile database for an NCBI taxonomy)

Version: 1.0

Usage: oatkdb [options] ${taxid} [mitochondrion|chloroplast]

Optional:

-j|--jobs INT Number of parallel jobs to run (default 4).

-t|--threads INT Number of threads to use for each parallel job (default 2).

-p|--protein Use protein instead of nucleotide sequence alignment for coding genes.

-c|--codon INT NCBI genetic code table for translation (default 1).

-T|--tmpdir STR Temp file directory (default auto).

-f|--force Overwrite existing files.

-a|--alias STR Gene name alias file.

-o|--output STR Output HMM profile database file name prefix (default auto).

--max-ref INT Maximum number of NCBI reference sequences to download (default 1000000).

--min-seq INT Minimum number of sequences to keep a gene (default 5).

--max-seq INT Maximum number of sequences to use for HMM (default 10000).

--no-trna Do not build HMM profiles for tRNA genes.

--no-rrna Do not build HMM profiles for rRNA genes.

--incl-orf Include open reading frames.

--resume Resume a previously unfinished run.

--clean Clean temporary files when finished.

--test Run program in test mode.

--log STR Log file (default stdout).

-h|--help Print this help message.

-v|--version Print version number.

1. Refer to NCBI Taxonomy https://www.ncbi.nlm.nih.gov/taxonomy for taxid

2. Refer to webpage https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for code table selection

Commonly used genetic code tables include:

[1] The Standard Code

[2] The Vertebrate Mitochondrial Code

[5] The Invertebrate Mitochondrial Code

[11] The Bacterial, Archaeal and Plant Plastid Code

Example: oatkdb -j 4 -t 8 -c 11 -o angiosperms_pltd_v20241101 3398 chloroplast

> ./seqdb

ERROR: requires at least ONE positional parameter

Program: seqdb (build profile HMM for a gene from sequence file)

Version: 1.0

Usage: seqdb [options] sequence

Optional:

-n|--name STR Gene name to write to the HMM file (default auto).

-t|--threads INT Number of threads to use for each parallel job (default 8).

-p|--protein The input is protein sequences.

-c|--codon INT NCBI genetic code table for translation used for '-p' (default 1).

-T|--tmpdir STR Temp file directory (default auto).

-f|--force Overwrite existing files.

-o|--output STR Output HMM profile database file (default auto).

--min-seq INT Minimum number of sequences to keep a gene (default 5).

--max-seq INT Maximum number of sequences to use for HMM (default 10000).

--resume Resume a previously unfinished run.

--clean Clean temporary files when finished.

--log STR Log file (default stdout).

-h|--help Print this help message.

-v|--version Print version number.

1. Refer to NCBI Taxonomy https://www.ncbi.nlm.nih.gov/taxonomy for taxid

2. Refer to webpage https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi for code table selection

Commonly used genetic code tables include:

[1] The Standard Code

[2] The Vertebrate Mitochondrial Code

[5] The Invertebrate Mitochondrial Code

[11] The Bacterial, Archaeal and Plant Plastid Code

Example: seqdb -t 12 -c 11 -n rpoc2 -o ./rpoc2.hmm rpo2.fna

データベース

ローカルデータベースを構築するためのツールoatkdbを使う。NCBI tax. IDとオルガネラタイプ（ミトコンドリアまたは葉緑体）の2つのパラメータを指定する。oatkdbはソースリファレンス配列のGenBankファイルを解析し、タンパク質コード、rRNA、tRNAを含むコア遺伝子を（主にソース配列に見られる頻度によって）見つけ、対応する配列を抽出する。それから、配列長、内容、分岐を考慮し、外れ値／エラーになりやすい配列を除去するシーケンスクリーン、配列のマルチプルアラインメントを経て、プロファイルHMMを構築する。

https://github.com/c-zhou/OatkDB

#python、seqtk、hmmer3、Entrez E-utilities、mafft、GNU parallelが必要
conda activate oatk
mamba install -c bioconda seqtk hmmer entrez-direct mafft -y

git clone https://github.com/c-zhou/OatkDB.git
cd OatkDB/
oatkdb -j 4 -t 8 -c 11 -o angiosperms_pltd_v20230911 3398 chloroplast

テストラン

実行するには、DNA配列をDNAプロファイル・データベースと照合するnhmmscanとオルガネラ遺伝子プロファイルデータベース（上のデータベース参照）が必要。

exampleデータを使う。Darwin Tree of Lifeからの Arabidopsis thalianaのHiFiリードとなっている。

https://zenodo.org/records/10367917

oatkラッパーを使うと、3つのモジュール： syncasmによるHiFiリードアセンブリ、hmmannotによるHMMアノテーション、pathfinderによるオルガネラゲノム抽出、が連続して実行される。3つのモジュールを別々に実行することもできる。

oatk -k 1001 -c 30 -t 8 -m embryophyta_mito.fam -p embryophyta_pltd.fam -o ddAraThal4 ddAraThal4_organelle.hifi.fa.gz

出力例

MTとCTアセンブリの他、遺伝子アノテーションやアセンブリグラフファイルも含まれる。詳細はレポジトリ参照。

論文とレポジトリより

oatkプログラムは、MBG、Spades、Hifiasmなどから構築した独自のゲノムアセンブリグラフからオルガネラゲノムを抽出することもできる。これには、"-G"オプションを指定し、入力配列ファイルをGFAファイルに置き換える。しかしoatkはゲノム構造を解くために配列およびアークカバレッジ（特に配列カバレッジ）に強く依存しているため、多くのロングリードアセンブラ（例えばHifiasm）から構築されるオーバーラップ／ストリンググラフは、完全なオルガネラゲノム構築には理想的ではないことに注意すべきである。また、ロングリードアセンブラによってGFAファイルとアークカバレッジのタグ名が異なる点にも注意するべきである（レポジトリ参照）
データベースの選択について。陸上植物オルガネラについては、いくつかの下位分類群のデータベースがあるが、まず"embryophyta"から始めるのが推奨される。またOatkDBサイトの指示に従ってカスタムデータベースを構築することもできる。
例えば本研究で用いたk = 1,001の場合、データは約500倍に圧縮され、効率的なアセンブリグラフの構築が容易になる。
HifiasmやHiCanuなど、HiFiデータ用によく設計されたゲノムアセンブリツールは数多くあるが、オルガネラゲノムアセンブリには効果的に使われていない。これらのアセンブラは、オルガネラゲノムのような、カバレッジが数千倍から数万倍に達するような極めてカバレッジの高いデータには最適化されておらず、配列エラー修正に問題がある。さらに、オルガネラゲノムと核ゲノム間のデータカバレッジが非常にアンバランスであることももう一つの課題である。
Oatkは、オルガネラ配列同定のために、シード配列に依存するのではなく、プロファイルHMM遺伝子データベースを使用している。このアプローチはより感度の高い方法を提供する。

引用

Oatk: a de novo assembly tool for complex plant organelle genomes
Chenxi Zhou, Max Brown, Mark Blaxter, The Darwin Tree of Life Project Consortium, Shane A. McCarthy, and Richard Durbin

bioRxiv, posted October 28, 2024