miniprotを使うことでゲノムからのBUSCO評価の精度と速度を改善したcompleasm

2023/07/01 名前をminiBUSCOからcompleasmに差し替え

2023/09/29　論文引用

　ゲノムアセンブリの完全性評価は、ゲノムデータの正確性と信頼性を評価する上で重要である。不完全なアセンブリは、遺伝子予測、アノテーション、その他のダウンストリーム解析におけるエラーにつながる可能性がある。BUSCOは、幅広い分類群に保存されているシングルコピーのオルソログの存在を比較することにより、ゲノムアセンブリの完全性を評価する最も広く使われているツールの1つである。しかし、BUSCOの実行時間は、特にいくつかの大規模なゲノムアセンブリでは長くなることがある。そのため、研究者はゲノムアセンブリを迅速に反復したり、大量のアセンブリを解析したりすることが困難である。ここでは、ゲノムアセンブリの完全性を評価するための効率的なツールであるcompleasmを紹介する。 compleasmは、タンパク質-ゲノムアライナーminiprotとBUSCOの保存オーソログ遺伝子のデータセットを利用する。実際のヒトのアセンブリを評価した結果、compleasmはBUSCOと比較して14倍のスピードアップを達成した。さらに、compleasmはBUSCOの完全性95.7%よりも正確な完全性99.6%を報告し、これはT2T-CHM13のアノテーション完全性99.5%とほぼ一致する。

インストール

Github

mamba create -n compleasm -c conda-forge -c bioconda compleasm -y
conda activate compleasm
compleasm -h

> compleasm -h

$ compleasm -h

usage: compleasm [-h] {download,list,miniprot,analyze,run} ...

compleasm

positional arguments:

{download,list,miniprot,analyze,run}

compleasm modules help

download Download specified BUSCO lineages

list List local or remote BUSCO lineages

miniprot Run miniprot alignment

analyze Evaluate genome completeness from provided miniprot

alignment

run Run compleasm including miniprot alignment and

completeness evaluation

optional arguments:

-h, --help show this help message and exit

> compleasm run -h

$ compleasm run -h

usage: compleasm run [-h] -a ASSEMBLY_PATH -o OUTPUT_DIR [-t THREADS]

[-l LINEAGE] [-L LIBRARY_PATH] [-m {lite,busco}]

[--specified_contigs SPECIFIED_CONTIGS [SPECIFIED_CONTIGS ...]]

[--outs OUTS]

[--miniprot_execute_path MINIPROT_EXECUTE_PATH]

[--hmmsearch_execute_path HMMSEARCH_EXECUTE_PATH]

[--autolineage] [--sepp_execute_path SEPP_EXECUTE_PATH]

[--min_diff MIN_DIFF] [--min_identity MIN_IDENTITY]

[--min_length_percent MIN_LENGTH_PERCENT]

[--min_complete MIN_COMPLETE] [--min_rise MIN_RISE]

optional arguments:

-h, --help show this help message and exit

-a ASSEMBLY_PATH, --assembly_path ASSEMBLY_PATH

Input genome file in FASTA format.

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

The output folder.

-t THREADS, --threads THREADS

Number of threads to use

-l LINEAGE, --lineage LINEAGE

Specify the name of the BUSCO lineage to be used.

(e.g. eukaryota, primates, saccharomycetes etc.)

-L LIBRARY_PATH, --library_path LIBRARY_PATH

Folder path to download lineages or already downloaded

lineages. If not specified, a folder named

"mb_downloads" will be created on the current running

path by default to store the downloaded lineage files.

-m {lite,busco}, --mode {lite,busco}

The mode of evaluation. dafault mode: busco. lite:

Without using hmmsearch to filtering protein

alignment. busco: Using hmmsearch on all candidate

protein alignment to purify the miniprot alignment to

imporve accuracy.

--specified_contigs SPECIFIED_CONTIGS [SPECIFIED_CONTIGS ...]

Specify the contigs to be evaluated, e.g. chr1 chr2

chr3. If not specified, all contigs will be evaluated.

--outs OUTS output if score at least FLOAT*bestScore [0.99]

--miniprot_execute_path MINIPROT_EXECUTE_PATH

Path to miniprot executable

--hmmsearch_execute_path HMMSEARCH_EXECUTE_PATH

Path to hmmsearch executable

--autolineage Automatically search for the best matching lineage

without specifying lineage file.

--sepp_execute_path SEPP_EXECUTE_PATH

Path to run_sepp.py executable

--min_diff MIN_DIFF The thresholds for the best matching and second best

matching.

--min_identity MIN_IDENTITY

The identity threshold for valid mapping results.

--min_length_percent MIN_LENGTH_PERCENT

The fraction of protein for valid mapping results.

--min_complete MIN_COMPLETE

The length threshold for complete gene.

--min_rise MIN_RISE Minimum length threshold to make dupicate take

precedence over single or fragmented over

single/duplicate.

> compleasm download -h

$ compleasm download -h

usage: compleasm download [-h] [-L LIBRARY_PATH] lineages [lineages ...]

positional arguments:

lineages Specify the names of the BUSCO lineages to be

downloaded. (e.g. eukaryota, primates, saccharomycetes

etc.)

optional arguments:

-h, --help show this help message and exit

-L LIBRARY_PATH, --library_path LIBRARY_PATH

The destination folder to store the downloaded lineage

files.If not specified, a folder named "mb_downloads"

will be created on the current running path.

> compleasm list -h

$ compleasm list -h

usage: compleasm list [-h] [--remote] [--local] [-L LIBRARY_PATH]

optional arguments:

-h, --help show this help message and exit

--remote List remote BUSCO lineages

--local List local BUSCO lineages

-L LIBRARY_PATH, --library_path LIBRARY_PATH

Folder path to stored lineages.

> compleasm analyze -h

$ compleasm analyze -h

usage: compleasm analyze [-h] -g GFF -l LINEAGE -o OUTPUT_DIR [-t THREADS]

[-L LIBRARY_PATH] [-m {lite,busco}]

[--hmmsearch_execute_path HMMSEARCH_EXECUTE_PATH]

[--specified_contigs SPECIFIED_CONTIGS [SPECIFIED_CONTIGS ...]]

[--min_diff MIN_DIFF] [--min_identity MIN_IDENTITY]

[--min_length_percent MIN_LENGTH_PERCENT]

[--min_complete MIN_COMPLETE] [--min_rise MIN_RISE]

optional arguments:

-h, --help show this help message and exit

-g GFF, --gff GFF Miniprot output gff file

-l LINEAGE, --lineage LINEAGE

BUSCO lineage name

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Output analysis folder

-t THREADS, --threads THREADS

Number of threads to use

-L LIBRARY_PATH, --library_path LIBRARY_PATH

Folder path to stored lineages.

-m {lite,busco}, --mode {lite,busco}

The mode of evaluation. dafault mode: busco. lite:

Without using hmmsearch to filtering protein

alignment. busco: Using hmmsearch on all candidate

protein alignment to purify the miniprot alignment to

imporve accuracy.

--hmmsearch_execute_path HMMSEARCH_EXECUTE_PATH

Path to hmmsearch executable

--specified_contigs SPECIFIED_CONTIGS [SPECIFIED_CONTIGS ...]

Specify the contigs to be evaluated, e.g. chr1 chr2

chr3. If not specified, all contigs will be evaluated.

--min_diff MIN_DIFF The thresholds for the best matching and second best

matching.

--min_identity MIN_IDENTITY

The identity threshold for valid mapping results. [0,

--min_length_percent MIN_LENGTH_PERCENT

The fraction of protein for valid mapping results.

--min_complete MIN_COMPLETE

The length threshold for complete gene.

--min_rise MIN_RISE Minimum length threshold to make dupicate take

precedence over single or fragmented over

single/duplicate.

> compleasm miniprot -h

$ compleasm miniprot -h

usage: compleasm miniprot [-h] -a ASSEMBLY -p PROTEIN -o OUTDIR [-t THREADS]

[--outs OUTS]

[--miniprot_execute_path MINIPROT_EXECUTE_PATH]

optional arguments:

-h, --help show this help message and exit

-a ASSEMBLY, --assembly ASSEMBLY

Input genome file in FASTA format

-p PROTEIN, --protein PROTEIN

Input protein file

-o OUTDIR, --outdir OUTDIR

Miniprot alignment output directory

-t THREADS, --threads THREADS

Number of threads to use

--outs OUTS output if score at least FLOAT*bestScore [0.95]

--miniprot_execute_path MINIPROT_EXECUTE_PATH

Path to miniprot executable

実行方法

ゲノムアセンブリのfastaファイルを指定する。”--autolineage”をつけると最適な系統が自動で選択される。

compleasm run --autolineage -a hg38.fa -o hs38-mb -t 12

-a Input genome file in FASTA format.
-o The output folder.
-t Number of threads to use
-l Specify the name of the BUSCO lineage to be used. (e.g. eukaryota, primates, saccharomycetes etc.)
--autolineage Automatically search for the best matching lineage without specifying lineage file.

選択された系統ファイル中のタンパク質配列がminiprotでゲノム配列にアライメントされ、miniprotアライメント結果を解析してゲノムの完全性を評価される。

出力

hs38-mb/

summary.txt

ダウンロード済みのディレクトリを指定するには-Lオプションを使う。

compleasm run -a genome.fasta -o output_dir -l eukaryota -t 12 -L /path/to/lineage

-L Folder path to download lineages or already downloaded lineages. If not specified, a folder named "mb_downloads" will be created on the current running path by default to store the downloaded lineage files.

"--specified_contigs"オプションを使うことで、特定のcontig配列だけを評価することもできる。

compleasm run -a genome.fasta -o output_dir -l eukaryota -t 12 --specified_contigs chr1 chr2 chr3 chr4 chr5 chr6 chr7 chr8 chr9 chr10 chr11 chr12 chr13 chr14 chr15 chr16 chr17 chr18 chr19 chr20 chr21 chr22

--specified_contigs Specify the contigs to be evaluated, e.g. chr1 chr2 chr3. If not specified, all contigs will be evaluated.

miniprotモジュールを使うと、miniprotアライメントを実行してgffファイルを出力する。

compleasm miniprot -a genome.fasta -p protein.faa -o output_dir -t 12

その他

analyzeモジュールを使うと、指定したminiprotのアライメント結果を直接解析してゲノムの完全性を評価できる。
downloadモジュールを使うと、データベースのダウンロードのみ実行できる。
compleasmはリモートホモログに対する感度が限定的であるため、BUSCOデータベースから遠いアセンブリでは、compleasmとBUSCOの結果を組み合わせることでより高い信頼性を確保することが推奨される（論文より）。
プレプリントの表１にBUSCOとcompleasmの比較結果がある。特にヒトゲノムで本家BUSCOの精度が低いことが示されている。

レポジトリではツール名がminiBUSCOからcompleasmに変更されています。このブログでもcompleasmに変更しました。

引用

miniBUSCO: a faster and more accurate reimplementation of BUSCO

Neng Huang, Heng Li

bioRxiv, Posted June 06, 2023

2023/09/29 論文引用

compleasm: a faster and more accurate reimplementation of BUSCO
Neng Huang, Heng Li
Bioinformatics, Published: 27 September 2023