2020-09-11

Average amino acid identity（AAI）を計算する CompareM

2020 10/5 追記

2020 10/9 修正

CompareMは大規模な比較ゲノム解析をサポートするソフトウェアツールキットである。ゲノムのセット（アミノ酸の同一性など）と個々のゲノム（コドン使用率など）の統計を提供する。数千ゲノムへのスケーラビリティを可能にするために、並列化された実装が提供されている。一般的なワークフローは、ユーザーが簡単に導入できるように単一のメソッドとして提供され、経験豊富なユーザーが特定の機能を利用できるようにより詳細なインターフェースが提供されている。CompareMはオープンソースであり、GNU General Public License (Version 3)の下でリリースされている。

　CompareMには２つの共通のワークフローがある。1つ目はゲノム間のペアワイズAAIを計算するためのもので(--aai_wf)、2つ目はリファレンスゲノムのセットに対してAAIの値を計算してクエリゲノムを分類するものである(--classify_wf)。これらのワークフローはCompareMが提供するより詳細なコマンドのいくつかを呼び出している。具体的には、aai_wf は call_genes, similarity, and aai コマンドを実行し、classify_wf は call_genes, similarity, and classify コマンドを実行する。CompareM は Prodigal を使用してゲノムのセットそれぞれの遺伝子をコールするが、ユーザーが提供したタンパク質配列セットを使用することもできる。

ユーザーガイド

https://github.com/dparks1134/CompareM/blob/master/users_guide.pdf

インストール

yamlファイルが3.6以上になっているので、conda mambaでpython3.6の仮想環境を作ってテストした（ macos10.14で試したが、エラーメッセージが出ないまま正しい数値が出力されない危険なエラーが起きた。ubuntu18.04で動作確認した）。

依存

CompareM makes use of the numpy, scipy, matplotlib, and biolib python packages
prodigal >= 2.6.2
diamond >= 0.9.0

Github

#bioconda (link)
mamba create -n Comparem -y python=3.6
conda activate Comparem
mamba install -c bioconda comparem -y

> comparem

$ comparem

...::: CompareM v0.1.1 :::...

Common workflows:

aai_wf -> Calculate AAI between all pairs of genomes

(runs call_genes => similarity => aai)

classify_wf -> Identify similar genomes based on AAI values

(runs call_genes => similarity => classify)

Gene prediction:

call_genes -> Identify genes within genomes

Gene homology and genome similarity:

similarity -> Perform reciprocal sequence similarity search between proteins

aai -> Calculate AAI between all pairs of genomes

classify -> Identify similar genomes based on AAI value

Usage profiles:

aa_usage -> Calculate amino acid usage within each genome

codon_usage -> Calculate codon usage within each genome

kmer_usage -> Calculate kmer usage within each genome

stop_usage -> Calculate stop codon usage within each genome

Lateral gene transfer:

lgt_di -> Calculate dinuceotide (3rd,1st) usage of genes to identify putative LGT events

lgt_codon -> Calculate codon usage of genes to identify putative LGT events

Visualization and exploration:

diss -> Calculate the dissimilarity between usage profiles

hclust -> Perform hierarchical clustering

Use: comparem <command> -h for command specific help.

Feature requests or bug reports can be sent to Donovan Parks (donovan.parks@gmail.com)

or posted on GitHub (https://github.com/dparks1134/comparem).

> comparem aai_wf -h

$ comparem aai_wf -h

usage: comparem aai_wf [-h] [-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]

[-x FILE_EXT] [--proteins] [--force_table FORCE_TABLE]

[--blastp] [--sensitive] [--keep_headers] [--keep_rbhs]

[--tmp_dir TMP_DIR] [-c CPUS] [--silent]

input_files output_dir

Calculate AAI between all pairs of genomes

positional arguments:

input_files genome files

output_dir output directory

optional arguments:

-h, --help show this help message and exit

-e, --evalue EVALUE e-value cutoff for identifying initial blast hits

(default: 0.001)

-p, --per_identity PER_IDENTITY

percent identity for defining homology (default: 30.0)

-a, --per_aln_len PER_ALN_LEN

percent alignment length of query sequence for

defining homology (default: 70.0)

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

--proteins indicates the input files contain protein sequences

--force_table FORCE_TABLE

force use of specific translation table

--blastp use blastp instead of diamond

--sensitive use sensitive mode of DIAMOND

--keep_headers indicates FASTA headers already have the format

<genome_id>~<gene_id>

--keep_rbhs create file with reciprocal best hits

--tmp_dir TMP_DIR specify alternative directory for temporary files

(default:

/var/folders/v0/rdv6fcx11fz1dy2jjvthmx1r0000gn/T)

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem classify_wf -h

$ comparem classify_wf -h

husage: comparem classify_wf [-h] [-k NUM_TOP_TARGETS] [-t TAXONOMY_FILE]

[-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]

[-x FILE_EXT] [--proteins]

[--force_table FORCE_TABLE] [--blastp]

[--sensitive] [--keep_headers] [--keep_rbhs]

[--tmp_dir TMP_DIR] [-c CPUS] [--silent]

query_files target_files output_dir

Identify similar genomes based on AAI value.

positional arguments:

query_files query genome files

target_files target genome files

output_dir output directory

optional arguments:

-h, --help show this help message and exit

-k, --num_top_targets NUM_TOP_TARGETS

number of top scoring target genomes to report per

query genome (default: 1)

-t, --taxonomy_file TAXONOMY_FILE

file indicating taxonomic identification of all target

genomes

-e, --evalue EVALUE e-value cutoff for identifying initial blast hits

(default: 0.001)

-p, --per_identity PER_IDENTITY

percent identity for defining homology (default: 30.0)

-a, --per_aln_len PER_ALN_LEN

percent alignment length of query sequence for

defining homology (default: 70.0)

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

--proteins indicates the input files contain protein sequences

--force_table FORCE_TABLE

force use of specific translation table

--blastp use blastp instead of diamond

--sensitive use sensitive mode of DIAMOND

--keep_headers indicates FASTA headers already have the format

<genome_id>~<gene_id>

--keep_rbhs create file with reciprocal best hits

--tmp_dir TMP_DIR specify alternative directory for temporary files

(default:

/var/folders/v0/rdv6fcx11fz1dy2jjvthmx1r0000gn/T)

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem call_genes -h

$ comparem call_genes -h

usage: comparem call_genes [-h] [-x FILE_EXT] [--force_table FORCE_TABLE]

[-c CPUS] [--silent]

input_genomes output_dir

Identify genes within genomes.

positional arguments:

input_genomes genome files to process

output_dir output directory

optional arguments:

-h, --help show this help message and exit

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

--force_table FORCE_TABLE

force use of specific translation table

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem similarity -h

$ comparem similarity -h

usage: comparem similarity [-h] [-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]

[-x FILE_EXT] [--blastp] [--sensitive]

[--keep_headers] [--tmp_dir TMP_DIR] [-c CPUS]

[--silent]

query_proteins target_proteins output_dir

Perform sequence similarity search between genes.

positional arguments:

query_proteins query protein files to process

target_proteins target protein files to process

output_dir output directory

optional arguments:

-h, --help show this help message and exit

-e, --evalue EVALUE maximum e-value for reporting an alignments (default:

0.001)

-p, --per_identity PER_IDENTITY

minimum percent identity for reporting an alignment

(default: 30.0)

-a, --per_aln_len PER_ALN_LEN

minimum percent coverage of query sequence for

reporting an alignment (default: 70.0)

-x, --file_ext FILE_EXT

extension of files to process (default: faa)

--blastp use Blastp-fast instead of DIAMOND

--sensitive use sensitive mode of DIAMOND

--keep_headers indicates FASTA headers already have the format

<genome_id>~<gene_id>

--tmp_dir TMP_DIR specify alternative directory for temporary files

(default:

/var/folders/v0/rdv6fcx11fz1dy2jjvthmx1r0000gn/T)

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem aai -h

$ comparem aai -h

usage: comparem aai [-h] [-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]

[--keep_rbhs] [-c CPUS] [--silent]

query_gene_file sorted_hit_table output_dir

Calculate the AAI between all genome pairs.

positional arguments:

query_gene_file file with all query genes

sorted_hit_table sorted file indicating genes passing sequence

similarity criteria

output_dir output directory

optional arguments:

-h, --help show this help message and exit

-e, --evalue EVALUE maximum e-value for reporting an alignments (default:

0.001)

-p, --per_identity PER_IDENTITY

minimum percent identity for reporting an alignment

(default: 30.0)

-a, --per_aln_len PER_ALN_LEN

minimum percent coverage of query sequence for

reporting an alignment (default: 70.0)

--keep_rbhs create file with reciprocal best hits

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem classify -h

$ comparem classify -h

usage: comparem classify [-h] [-k NUM_TOP_TARGETS] [-t TAXONOMY_FILE]

[-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]

[-x FILE_EXT] [--keep_rbhs] [-c CPUS] [--silent]

query_gene_file target_gene_file sorted_hit_table

output_dir

Identify similar genomes based on AAI value.

positional arguments:

query_gene_file file with all query genes

target_gene_file file with all target genes

sorted_hit_table sorted file indicating genes passing sequence

similarity criteria

output_dir output directory

optional arguments:

-h, --help show this help message and exit

-k, --num_top_targets NUM_TOP_TARGETS

number of top scoring target genomes to report per

query genome (default: 1)

-t, --taxonomy_file TAXONOMY_FILE

file indicating taxonomic identification of all target

genomes

-e, --evalue EVALUE e-value cutoff for identifying initial blast hits

(default: 0.001)

-p, --per_identity PER_IDENTITY

percent identity for defining homology (default: 30.0)

-a, --per_aln_len PER_ALN_LEN

percent alignment length of query sequence for

defining homology (default: 70.0)

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

--keep_rbhs create file with reciprocal best hits

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem aa_usage –h

$ comparem aa_usage –h

usage: comparem aa_usage [-h] [--counts] [-x FILE_EXT] [-c CPUS] [--silent]

protein_gene_files output_file

comparem aa_usage: error: the following arguments are required: output_file

> comparem codon_usage -h

$ comparem codon_usage -h

usage: comparem codon_usage [-h] [--counts] [-x FILE_EXT] [--keep_ambiguous]

[-c CPUS] [--silent]

nucleotide_gene_files output_file

Calculate codon usage within each genome.

positional arguments:

nucleotide_gene_files

input files with genes in nucleotide space

output_file output file indicating codon usage of each genome

optional arguments:

-h, --help show this help message and exit

--counts output raw counts instead of frequencies

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

--keep_ambiguous keep codons with ambiguous bases

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem kmer_usage -h

$ comparem kmer_usage -h

usage: comparem kmer_usage [-h] [--counts] [-k K] [-x FILE_EXT] [-c CPUS]

[--silent]

genome_files output_file

Calculate kmer usage within each genome.

positional arguments:

genome_files input files with genomes in nucleotide space

output_file output file indicating kmer usage of each genome

optional arguments:

-h, --help show this help message and exit

--counts output raw counts instead of frequencies

-k K length of kmers, e.g., 4 -> tetranucleotides (default:

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem stop_usage -h

$ comparem stop_usage -h

usage: comparem stop_usage [-h] [--counts] [--mean_gene_length] [-x FILE_EXT]

[-c CPUS] [--silent]

nucleotide_gene_files output_file

Calculate stop codon usage within each genome.

positional arguments:

nucleotide_gene_files

input files with genes in nucleotide space

output_file output file indicating stop codon usage of each genome

optional arguments:

-h, --help show this help message and exit

--counts output raw counts instead of frequencies

--mean_gene_length report mean gene length for genes with each stop codon

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem lgt_di -h

$ comparem lgt_di -h

usage: comparem lgt_di [-h] [-x FILE_EXT] [--crit_value CRIT_VALUE] [-c CPUS]

[--silent]

nucleotide_gene_files output_dir

Calculate dinuceotide (3rd,1st) usage of genes to identify putative LGT

events.

positional arguments:

nucleotide_gene_files

input files with genes in nucleotide space

output_dir output directory to write dinucleotide usage for each

gene in each genome

optional arguments:

-h, --help show this help message and exit

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

--crit_value CRIT_VALUE

critical value for defining deviant genes (default:

0.001)

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem lgt_codon -h

$ comparem lgt_codon -h

usage: comparem lgt_codon [-h] [-x FILE_EXT] [-c CPUS] [--silent]

nucleotide_gene_files output_dir

Calculate codon usage of genes to identify putative LGT events.

positional arguments:

nucleotide_gene_files

input files with genes in nucleotide space

output_dir output directory to write dinucleotide usage for each

gene in each genome

optional arguments:

-h, --help show this help message and exit

-x, --file_ext FILE_EXT

extension of files to process (default: fna)

-c, --cpus CPUS number of CPUs to use (default: 1)

--silent suppress output

> comparem diss -h

$ comparem diss -h

usage: comparem diss [-h]

[--metric {euclidean,minkowski,cityblock,seuclidean,sqeuclidean,cosine,correlation,hamming,jaccard,chebyshev,canberra,braycurtis,mahalanobis,yule,matching,dice,kulsinski,rogerstanimoto,russellrao,sokalmichener,sokalsneath,wminkowski}]

[--full_matrix] [--silent]

profile_file output_file

Calculate the dissimilarity between usage profiles.

positional arguments:

profile_file file with usage profile for each genome

output_file output file with pairwise dissimilarity between

genomes

optional arguments:

-h, --help show this help message and exit

--metric {euclidean,minkowski,cityblock,seuclidean,sqeuclidean,cosine,correlation,hamming,jaccard,chebyshev,canberra,braycurtis,mahalanobis,yule,matching,dice,kulsinski,rogerstanimoto,russellrao,sokalmichener,sokalsneath,wminkowski}

distance metric to use (default: euclidean)

--full_matrix output full dissimilarity matrix

--silent suppress output

> comparem hclust -h

$ comparem hclust -h

usage: comparem hclust [-h]

[--method {single,complete,average,weighted,centroid,median,ward}]

[--similarity] [--max_sim_value MAX_SIM_VALUE]

[--name_col1 NAME_COL1] [--name_col2 NAME_COL2]

[--value_col VALUE_COL] [--silent]

pairwise_value_file output_tree

Perform hierarchical clustering.

positional arguments:

pairwise_value_file file with pairwise similarity or dissimilarity values

between genomes

output_tree name for output hierarchical cluster tree

optional arguments:

-h, --help show this help message and exit

--method {single,complete,average,weighted,centroid,median,ward}

clustering method to use. (default: average)

--similarity indicates file contain similarity values

--max_sim_value MAX_SIM_VALUE

maximum similarity value (default: 100)

--name_col1 NAME_COL1

index of first column with genome names (default: 0)

--name_col2 NAME_COL2

index of second column with genome names (default: 1)

--value_col VALUE_COL

index of column with similarity or dissimilarity

values (default: 2)

--silent suppress output

実行方法

ランするにはfasta形式のゲノムファイルが必要。

f:id:kazumaxneo:20200910194851p:plain

1、comparem aai_wf - ペアワイズAAIの計算

比較するゲノムのディレクトリ、出力ディレクトリを指定する。

comparem aai_wf --cpus 12 -x fna input_dir/ outdir

#または翻訳済みのproteinディレクトリを指定
comparem aai_wf --cpus 12 --proteins -x faa input_dir/ outdir

-e e-value cutoff for identifying initial blast hits (default: 0.001)
-p percent identity for defining homology (default: 30.0)
-a percent alignment length of query sequence for defining homology (default: 70.0)
--sensitive use sensitive mode of DIAMOND
-x extension of files to process (default: fna)
-blastp use blastp instead of diamond
--proteins indicates the input files contain protein sequences

出力

f:id:kazumaxneo:20200910195147p:plain

aai/aai_summary.tsv

f:id:kazumaxneo:20200910195841p:plain

出力はall versus allのマトリックスファイルではなく、ペアワイズの比較結果のテーブルになる。使用したデータでは、上の表のようにgenome３と他のゲノムとの比較（２−5行目）、次にゲノム５と他のゲノムとの比較（６−8行目）、次にゲノム１（9行目）、最後にゲノム４（10−11行目）という順番でプリントされている。ゲノム5から見たゲノム3との比較結果は２行目にあるが、ゲノム３から見たゲノム５の比較結果はない。レシプロカルヒットの平均からAAIを出しているなら不要になる。

２、comparem classify_wf - AAIの結果から近いゲノムを推定。

クエリのゲノムと比較するゲノムのディレクトリ、出力ディレクトリを指定する。

comparem classify_wf -x fna --cpus 12 query_genome.fna input_dir outdir

テスト時はエラーを起こした。

補足

--keep_headers をつけるとAAIだけでなく全比較結果が残る。aai_summary.tsvは巨大なテキストファイルになるので、ラン後にユーザーがパースする必要がある。

現在のところ、CompareMの実装にはエラーが出る可能性のある部分がある。詳細はGithubで確認してください。

引用

GitHub - dparks1134/CompareM: A toolbox for comparative genomics.

参考

https://www.biostars.org/p/303333/

*1 エラーが起きるときはファイル名が問題の可能性がある。シンプルなファイル名にリネームする。

Teaching-Dundee-BS32010/06-RBBH.ipynb at master · widdowquinn/Teaching-Dundee-BS32010 · GitHub

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Average amino acid identity（AAI）を計算する CompareM