MetaEukは、真核生物のメタゲノムコンティグにおけるタンパク質コード遺伝子のハイスループット、リファレンスベースの発見、アノテーションのためのツールキットである。可能なすべてのエクソンをカバーする6フレーム翻訳フラグメントの高速検索を行い、マッチしたものを複数のエクソンタンパク質に最適に結合する。7つの多様なアノテーションされたゲノムのベンチマークを使用して、リファレンスデータベースとの配列類似性が低い条件下でもMetaEukが高感度であることを示した。大規模なメタゲノムデータから新規の真核生物タンパク質を発見するMetaEukの能力を実証するために、Tara Oceansプロジェクトの912サンプルからコンティグを収集した。MetaEukは、10台の16コアサーバー上で8日間で1,200万以上のタンパク質コード遺伝子を予測した。発見されたタンパク質のほとんどは、既知のタンパク質と高度に乖離しており、サンプル数が非常に少ない真核生物のスーパーグループに由来するものである。
ubuntu18.04でテストした(CPUはxeon platinum)。
# static build sse4.1
wget https://mmseqs.com/metaeuk/metaeuk-linux-sse41.tar.gz
tar xvfz metaeuk-linux-sse41.tar.gz
export PATH=$(pwd)/metaeuk/bin/:$PATH
# static build AVX2
wget https://mmseqs.com/metaeuk/metaeuk-linux-avx2.tar.gz
tar xvfz metaeuk-linux-avx2.tar.gz
export PATH=$(pwd)/metaeuk/bin/:$PATH
> metaeuk
$ metaeuk
MetaEuk is homology-based strategy to efficiently query many contigs assembled from metagenomic samples against a comprehensive protein/profile target database to describe their protein repertoire. It does not require preliminary binning of the contigs and makes no assumption concerning the splicing signal when searching for multi-exon proteins.
Please cite:
Levy Karin E, Mirdita M, Soding J: MetaEuk — sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics. Microbiome (2020) 8:48.
metaeuk Version: e7e2d95f454105e5d4aa40bc221a8b18fdc1ce41
© Eli Levy Karin, eli.levy.karin@gmail.com
usage: metaeuk <command> [<args>]
Main workflows for database input/output
predictexons Call optimal exon sets based on protein similarity
easy-predict Predict protein-coding genes from contigs (fasta/database) based on similarities to targets (fasta/database) and return a fasta of the predictions in a single step
taxtocontig Assign taxonomic labels to predictions and aggregate them per contig
reduceredundancy Cluster metaeuk calls that share an exon and select representative prediction
unitesetstofasta Create a fasta output from optimal exon sets
groupstoacc Create a TSV output from representative prediction to member
> metaeuk easy-predict
$ metaeuk easy-predict
usage: metaeuk easy-predict <i:contigs> <i:targets> <o:predictionsFasta> <tmpDir> [options]
-s FLOAT Sensitivity: 1.0 faster; 4.0 fast; 7.5 sensitive [4.000]
--max-seqs INT Maximum results per query sequence allowed to pass the prefilter (affects sensitivity) [300]
-a BOOL Add backtrace string (convert to alignments with mmseqs convertalis module) [0]
--alignment-mode INT How to compute the alignment:
0: automatic
1: only score and end_pos
2: also start_pos and cov
3: also seq.id
4: only ungapped alignment [2]
-e FLOAT List matches below this E-value (range 0.0-inf) [100.000]
--min-seq-id FLOAT List matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]
--min-aln-len INT Minimum alignment length (range 0-INT_MAX) [0]
--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]
--alt-ali INT Show up to this many alternative alignments [0]
-c FLOAT List matches above this fraction of aligned (covered) residues (see --cov-mode) [0.000]
--cov-mode INT 0: coverage of query and target
1: coverage of target
2: coverage of query
3: target seq. length has to be at least x% of query length
4: query seq. length has to be at least x% of target length
5: short seq. needs to be at least x% of the other seq. length [0]
--max-rejected INT Maximum rejected alignments before alignment calculation for a query is stopped [2147483647]
--max-accept INT Maximum accepted alignments before alignment calculation for a query is stopped [2147483647]
--e-profile FLOAT Include sequences matches with < e-value thr. into the profile (>=0.0) [0.001]
--num-iterations INT Number of iterative profile search iterations [1]
--rescore-mode INT Rescore diagonals with:
0: Hamming distance
1: local alignment (score only)
2: local alignment
3: global alignment
4: longest alignment fullfilling window quality criterion [0]
--allow-deletion BOOL Allow deletions in a MSA [0]
--min-length INT Minimum codon number in open reading frames [15]
--max-length INT Maximum codon number in open reading frames [32734]
--max-gaps INT Maximum number of codons with gaps or unknown residues before an open reading frame is rejected [2147483647]
--contig-start-mode INT Contig start can be 0: incomplete, 1: complete, 2: both [2]
--contig-end-mode INT Contig end can be 0: incomplete, 1: complete, 2: both [2]
--orf-start-mode INT Orf fragment can be 0: from start to stop, 1: from any to stop, 2: from last encountered start to stop (no start in the middle) [1]
--forward-frames STR Comma-seperated list of frames on the forward strand to be extracted [1,2,3]
--reverse-frames STR Comma-seperated list of frames on the reverse strand to be extracted [1,2,3]
--translate INT Translate ORF to amino acid [0]
--use-all-table-starts BOOL Use all alteratives for a start codon in the genetic table, if false - only ATG (AUG) [0]
--id-offset INT Numeric ids in index file are offset by this value [0]
--add-orf-stop BOOL Add stop codon '*' at complete start and end [0]
--search-type INT Search type 0: auto 1: amino acid, 2: translated, 3: nucleotide, 4: translated nucleotide alignment [0]
--start-sens FLOAT Start sensitivity [4.000]
--sens-steps INT Number of search steps performed from --start-sens to -s [1]
--metaeuk-eval FLOAT maximal combined evalue of an optimal set [0.0, inf] [0.001]
--metaeuk-tcov FLOAT minimal length ratio of combined set to target [0.0, 1.0] [0.500]
--max-intron INT Maximal allowed intron length [10000]
--min-intron INT Minimal allowed intron length [15]
--min-exon-aa INT Minimal allowed exon length in amino acids [11]
--max-overlap INT Maximal allowed overlap of consecutive exons in amino acids [10]
--set-gap-open INT Gap open penalty (negative) for missed target amino acids between exons [-1]
--set-gap-extend INT Gap extend penalty (negative) for missed target amino acids between exons [-1]
--overlap INT allow predictions to overlap another on the same strand. when not allowed (default), only the prediction with better E-value will be retained [0,1] [0]
--protein INT translate the joint exons coding sequence to amino acids [0,1] [0]
--target-key INT write the target key (internal DB identifier) instead of its accession. By default (0) target accession will be written [0,1] [0]
--reverse-fragments INT reverse AA fragments to compute under null [0,1] [0]
--threads INT Number of CPU-cores used (all by default) [56]
--compressed INT Write compressed output [0]
-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]
Combines the following MetaEuk modules into a single step: predictexons, reduceredundancy and unitesetstofasta
- Levy Karin E, Mirdita M, Soeding J: MetaEuk – sensitive, high-throughput gene discovery and annotation for large-scale eukaryotic metagenomics. biorxiv, 851964 (2019).
metaeuk createdb database_proteins.faa database
metaeuk easy-predict query_nucleotides.fasta database predsResults tempFolder
他に、予測されたMetaEukタンパク質にtaxonomic labelを割り当て、その予測値をコンティグに付与する機能もある。手順はGithubで確認して下さい。
MetaEuk—sensitive, high-throughput gene discovery, and annotation for large-scale eukaryotic metagenomics
Eli Levy Karin, Milot Mirdita & Johannes Söding
Microbiome volume 8, Article number: 48 (2020)