（プロテイン）レシプロカルベストヒットを抽出する getRBH.pl

　タンパク質配列の迅速な比較のためのソフトウェアの著者は、そのソフトウェアの速度を評価し、その結果をそのタスクのための最も一般的なソフトウェアと比較しているが、より特殊な用途、例えば、Reciprocal Best Hit（RBH）としてのオルソログの発見のように、そのソフトウェアを評価することは一般的ではない。ここでは、blastpよりも高速に動作するソフトウェアを用いてRBHの結果を比較することに焦点を当てた。すなわち、lastal、diamond、MMseqs2である。

　テストしたプログラムの中で、lastalは結果を出すのに最も時間がかからなかった。しかし、進化的に離れたゲノムを比較した場合、他のどのプログラムよりも結果は少なかった。blastpと最も類似したRBHの数を生成したプログラムはMMseqs2だった。このプログラムは、使用されたプログラムの中で最も低い誤差推定値を示した。diamondを用いた結果は、MMseqs2で得られた結果に非常に近く、diamondの方がはるかに高速に実行されていた。本結果は、テストしたプログラムの中で最も優れていたのは、"sensitive "オプションで実行された diamond であったことを示唆しており、実行にかかる時間はblastpの7%で、blastpよりも低いエラー率で結果が得られた。各シーケンス比較プログラムを使用した相互のベストヒットを得るためのプログラムは、https://github.com/Computational-conSequences/SequenceToolsに置かれている。

インストール

依存

使用するタンパク質相同性検索プログラム

Github

git clone https://github.com/Computational-conSequences/SequenceTools.git
cd SequenceTools/

> perl getdNdS.pl

$ perl getdNdS.pl

about:

This program produces files with dN/dS results

usage:

getdNdS.pl -q <queryFile> -t <targetFile> [options]

options:

-q query genome ID [GCF_000005845], required

-t target genome ID [GCF_000287335], required

-g directory with genome data (faa, cds|ffn),

default GenomeData

-w dN/dS estimate method [bayes|codeml|yn00], default bayes

-c codon freq estimate, 2: Goldman Yang, 5: Muse Gaut,

default 2 (Goldman Yang)

-p program for pairwise comparisons [blastp|diamond|mmseqs],

default: diamond

(CAUTION: mmseqs is not working yet)

-m directory for blastp|diamond|mmseqs results, default compRuns

-o output directory, default [method]-dNdS

requirements:

This program requires the blastp program, or blast results

for the query + target genomes in the compRuns directory,

or in the directory indicated with the '-m' option

The program also requires at least two files per genome:

1. A fasta file with protein sequences ending in ".faa"

2. A fasta file with corresponding coding sequences

ending in ".cds" or ".ffn"

The files can be compressed with gzip or bzip2

> perl getRBH.pl

if not -d I need both query and target files

NAME

getRBH.pl - compare proteomes and extract reciprocal best hits

SYNOPSIS

getRBH.pl -q [filename1.faa] -t [filename2.faa] [options]

EXAMPLES

getRBH.pl -q Genomes/GCF_000005845.faa.gz -t Genomes/GCF_000009045.faa.gz

getRBH.pl -d Genomes -o AllvsAllRBH

OPTIONS

-q query fasta file(s) or directory with fasta files, required. files can

be compressed with gzip or bzip2.

-t target fasta file(s) or directory with fasta files, required. files can

be compressed with gzip or bzip2.

-d directory with fasta files for all-vs-all comparisons, no default. If

set -q and -t will be ignored.

-o directory for reciprocal best hits, default: RBH.

-p program for pairwise comparisons [blastp|diamond|lastal|mmseqs],

default: diamond

-s sensitivity (only works for diamond and mmseqs):

F: low sensitivity: diamond: fast, mmseqs: -s 1

S: diamond: sensitive, mmseqs: -s 2

M: diamond: more-sensitive, mmseqs: -s 4

V: diamond: very-sensitive, mmseqs: -s 5.7

U: highest sensitivity: diamond: ultra-sensitive, mmseqs: -s 7

for mmseqs the option will also accept numbers between 1 and 7,

defaults: diamond: V; mmseqs: 5.7

-m directory for blastp|diamond|lastal|mmseqs results, default compRuns.

-c minimum coverage of shortest sequence [30 - 99], default 60.

-f maximum overlap between fused sequences [0 - 0.5], default 0.1 (10%) of

shortest sequence.

-e maximum e-value, default 1e-06

-a include aligned sequences in comparison results [T/F], default F. (not

available in lastal)

-r find RBH [T|F], default 'T', if 'F' the program will only run the

blastp|diamond|lastal|mmseqs comparisons

-k keep prior results [T|R|F], default 'T':

F: repeat both alignments and gathering of RBH

R: only repeat gathering of RBHs

T: keep prior alignments and prior RBHs

-z maximum number of targets to find with blastp|diamond|lastal|mmseqs

minimum of 50. Defaults to 1/7th of the sequences in the target file

-x number of CPUs to use, default: 4 (max: 12)

REQUIREMENTS

This program requires either appropriately formatted comparison results for

the query/target genomes, or blastp|diamond|lastal|mmseqs to compare

protein sequences

WARNING

The following software is missing:

1 lastal available at: http://last.cbrc.jp/

2 mmseqs available at: https://github.com/soedinglab/MMseqs2/releases

> perl calcANI.pl

$ perl calcANI.pl

about:

This program calculates Average Nucleotide Identity

(ANI) by several methods

usage: calcANI.pl -q <fnafile> -t <fnafile> [options]

options:

-q query file in fasta format, or directory with query

files in fasta format, required

(files can be compressed with gzip or bzip2)

-t target file in fasta format, or directory with target

files in fasta format, required

(files can be compressed with gzip or bzip2)

-d directory with fna files for all-vs-all, no default,

if set, -i and -t will be ignored

ANIx: our blast+ reinterpretation (megablast)

ANIp: our blast+ implementation of original method (slow)

ANIb: original method using legacy blast (slowest)

https://www.doi.org/10.1099/ijs.0.64483-0

ANIm: mummer method (using nucmer)

https://www.doi.org/10.1073/pnas.0906412106

ANIf: fastANI method

https://www.doi.org/10.1038/s41467-018-07641-9

ANIl: our lastal implementation

ANIu: our ublast implementation

-c fragment length (not used in ANIx or ANIm), minimum 1000,

maximum 3000 default 1020

-o output folder, default ResultsANI

-k keep prior result [T|F], default 'T'. If 'T' prior results

will be taken from appropriate files in the output folder

-x number of cpus to use, maximum 12, default 2

Warning:

The software for the following methods is missing:

1. ANIf: fastANI available at:

https://github.com/ParBLiSS/FastANI/releases/

2. ANIl: lastal available at:

http://last.cbrc.jp/

3. ANIm: nucmer available at:

https://github.com/mummer4/mummer/releases/

4. ANIu: usearch available at:

https://www.drive5.com/usearch/

実行方法

クエリのタンパク質ファイルとターゲットのタンパク質ファイルを指定する。相同性検索ツールは-pで、最小カバレッジは-cで、最大e-valueは-eで指定する。

perl getRBH.pl -q query_protein.faa -t target_protein.faa -o outdir -p diamond -s -c 60 -e 1e-06

-q query fasta file(s) or directory with fasta files, required. files can
be compressed with gzip or bzip2.
-t target fasta file(s) or directory with fasta files, required. files can
be compressed with gzip or bzip2.
-d directory with fasta files for all-vs-all comparisons, no default. If
set -q and -t will be ignored.
-o directory for reciprocal best hits, default: RBH.
-p program for pairwise comparisons [blastp|diamond|lastal|mmseqs],
default: diamond
-e maximum e-value, default 1e-06
-c minimum coverage of shortest sequence [30 - 99], default 60.
-s sensitivity (only works for diamond and mmseqs):

F: low sensitivity: diamond: fast, mmseqs: -s 1

S: diamond: sensitive, mmseqs: -s 2

M: diamond: more-sensitive, mmseqs: -s 4

V: diamond: very-sensitive, mmseqs: -s 5.7

U: highest sensitivity: diamond: ultra-sensitive, mmseqs: -s 7

for mmseqs the option will also accept numbers between 1 and 7,
defaults: diamond: V; mmseqs: 5.7