2020 10/5 追記
2020 10/9 修正
CompareMは大規模な比較ゲノム解析をサポートするソフトウェアツールキットである。ゲノムのセット(アミノ酸の同一性など)と個々のゲノム(コドン使用率など)の統計を提供する。数千ゲノムへのスケーラビリティを可能にするために、並列化された実装が提供されている。一般的なワークフローは、ユーザーが簡単に導入できるように単一のメソッドとして提供され、経験豊富なユーザーが特定の機能を利用できるようにより詳細なインターフェースが提供されている。CompareMはオープンソースであり、GNU General Public License (Version 3)の下でリリースされている。
CompareMには2つの共通のワークフローがある。1つ目はゲノム間のペアワイズAAIを計算するためのもので(--aai_wf)、2つ目はリファレンスゲノムのセットに対してAAIの値を計算してクエリゲノムを分類するものである(--classify_wf)。これらのワークフローはCompareMが提供するより詳細なコマンドのいくつかを呼び出している。具体的には、aai_wf は call_genes, similarity, and aai コマンドを実行し、classify_wf は call_genes, similarity, and classify コマンドを実行する。CompareM は Prodigal を使用してゲノムのセットそれぞれの遺伝子をコールするが、ユーザーが提供したタンパク質配列セットを使用することもできる。
ユーザーガイド
https://github.com/dparks1134/CompareM/blob/master/users_guide.pdf
インストール
yamlファイルが3.6以上になっているので、conda mambaでpython3.6の仮想環境を作ってテストした( macos10.14で試したが、エラーメッセージが出ないまま正しい数値が出力されない危険なエラーが起きた。ubuntu18.04で動作確認した)。
依存
- CompareM makes use of the numpy, scipy, matplotlib, and biolib python packages
- prodigal >= 2.6.2
- diamond >= 0.9.0
#bioconda (link)
mamba create -n Comparem -y python=3.6
conda activate Comparem
mamba install -c bioconda comparem -y
> comparem
$ comparem
...::: CompareM v0.1.1 :::...
Common workflows:
aai_wf -> Calculate AAI between all pairs of genomes
(runs call_genes => similarity => aai)
classify_wf -> Identify similar genomes based on AAI values
(runs call_genes => similarity => classify)
Gene prediction:
call_genes -> Identify genes within genomes
Gene homology and genome similarity:
similarity -> Perform reciprocal sequence similarity search between proteins
aai -> Calculate AAI between all pairs of genomes
classify -> Identify similar genomes based on AAI value
Usage profiles:
aa_usage -> Calculate amino acid usage within each genome
codon_usage -> Calculate codon usage within each genome
kmer_usage -> Calculate kmer usage within each genome
stop_usage -> Calculate stop codon usage within each genome
Lateral gene transfer:
lgt_di -> Calculate dinuceotide (3rd,1st) usage of genes to identify putative LGT events
lgt_codon -> Calculate codon usage of genes to identify putative LGT events
Visualization and exploration:
diss -> Calculate the dissimilarity between usage profiles
hclust -> Perform hierarchical clustering
Use: comparem <command> -h for command specific help.
Feature requests or bug reports can be sent to Donovan Parks (donovan.parks@gmail.com)
or posted on GitHub (https://github.com/dparks1134/comparem).
> comparem aai_wf -h
$ comparem aai_wf -h
usage: comparem aai_wf [-h] [-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]
[-x FILE_EXT] [--proteins] [--force_table FORCE_TABLE]
[--blastp] [--sensitive] [--keep_headers] [--keep_rbhs]
[--tmp_dir TMP_DIR] [-c CPUS] [--silent]
input_files output_dir
Calculate AAI between all pairs of genomes
positional arguments:
input_files genome files
output_dir output directory
optional arguments:
-h, --help show this help message and exit
-e, --evalue EVALUE e-value cutoff for identifying initial blast hits
(default: 0.001)
-p, --per_identity PER_IDENTITY
percent identity for defining homology (default: 30.0)
-a, --per_aln_len PER_ALN_LEN
percent alignment length of query sequence for
defining homology (default: 70.0)
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
--proteins indicates the input files contain protein sequences
--force_table FORCE_TABLE
force use of specific translation table
--blastp use blastp instead of diamond
--sensitive use sensitive mode of DIAMOND
--keep_headers indicates FASTA headers already have the format
<genome_id>~<gene_id>
--keep_rbhs create file with reciprocal best hits
--tmp_dir TMP_DIR specify alternative directory for temporary files
(default:
/var/folders/v0/rdv6fcx11fz1dy2jjvthmx1r0000gn/T)
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem classify_wf -h
$ comparem classify_wf -h
husage: comparem classify_wf [-h] [-k NUM_TOP_TARGETS] [-t TAXONOMY_FILE]
[-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]
[-x FILE_EXT] [--proteins]
[--force_table FORCE_TABLE] [--blastp]
[--sensitive] [--keep_headers] [--keep_rbhs]
[--tmp_dir TMP_DIR] [-c CPUS] [--silent]
query_files target_files output_dir
Identify similar genomes based on AAI value.
positional arguments:
query_files query genome files
target_files target genome files
output_dir output directory
optional arguments:
-h, --help show this help message and exit
-k, --num_top_targets NUM_TOP_TARGETS
number of top scoring target genomes to report per
query genome (default: 1)
-t, --taxonomy_file TAXONOMY_FILE
file indicating taxonomic identification of all target
genomes
-e, --evalue EVALUE e-value cutoff for identifying initial blast hits
(default: 0.001)
-p, --per_identity PER_IDENTITY
percent identity for defining homology (default: 30.0)
-a, --per_aln_len PER_ALN_LEN
percent alignment length of query sequence for
defining homology (default: 70.0)
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
--proteins indicates the input files contain protein sequences
--force_table FORCE_TABLE
force use of specific translation table
--blastp use blastp instead of diamond
--sensitive use sensitive mode of DIAMOND
--keep_headers indicates FASTA headers already have the format
<genome_id>~<gene_id>
--keep_rbhs create file with reciprocal best hits
--tmp_dir TMP_DIR specify alternative directory for temporary files
(default:
/var/folders/v0/rdv6fcx11fz1dy2jjvthmx1r0000gn/T)
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem call_genes -h
$ comparem call_genes -h
usage: comparem call_genes [-h] [-x FILE_EXT] [--force_table FORCE_TABLE]
[-c CPUS] [--silent]
input_genomes output_dir
Identify genes within genomes.
positional arguments:
input_genomes genome files to process
output_dir output directory
optional arguments:
-h, --help show this help message and exit
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
--force_table FORCE_TABLE
force use of specific translation table
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem similarity -h
$ comparem similarity -h
usage: comparem similarity [-h] [-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]
[-x FILE_EXT] [--blastp] [--sensitive]
[--keep_headers] [--tmp_dir TMP_DIR] [-c CPUS]
[--silent]
query_proteins target_proteins output_dir
Perform sequence similarity search between genes.
positional arguments:
query_proteins query protein files to process
target_proteins target protein files to process
output_dir output directory
optional arguments:
-h, --help show this help message and exit
-e, --evalue EVALUE maximum e-value for reporting an alignments (default:
0.001)
-p, --per_identity PER_IDENTITY
minimum percent identity for reporting an alignment
(default: 30.0)
-a, --per_aln_len PER_ALN_LEN
minimum percent coverage of query sequence for
reporting an alignment (default: 70.0)
-x, --file_ext FILE_EXT
extension of files to process (default: faa)
--blastp use Blastp-fast instead of DIAMOND
--sensitive use sensitive mode of DIAMOND
--keep_headers indicates FASTA headers already have the format
<genome_id>~<gene_id>
--tmp_dir TMP_DIR specify alternative directory for temporary files
(default:
/var/folders/v0/rdv6fcx11fz1dy2jjvthmx1r0000gn/T)
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem aai -h
$ comparem aai -h
usage: comparem aai [-h] [-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]
[--keep_rbhs] [-c CPUS] [--silent]
query_gene_file sorted_hit_table output_dir
Calculate the AAI between all genome pairs.
positional arguments:
query_gene_file file with all query genes
sorted_hit_table sorted file indicating genes passing sequence
similarity criteria
output_dir output directory
optional arguments:
-h, --help show this help message and exit
-e, --evalue EVALUE maximum e-value for reporting an alignments (default:
0.001)
-p, --per_identity PER_IDENTITY
minimum percent identity for reporting an alignment
(default: 30.0)
-a, --per_aln_len PER_ALN_LEN
minimum percent coverage of query sequence for
reporting an alignment (default: 70.0)
--keep_rbhs create file with reciprocal best hits
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem classify -h
$ comparem classify -h
usage: comparem classify [-h] [-k NUM_TOP_TARGETS] [-t TAXONOMY_FILE]
[-e EVALUE] [-p PER_IDENTITY] [-a PER_ALN_LEN]
[-x FILE_EXT] [--keep_rbhs] [-c CPUS] [--silent]
query_gene_file target_gene_file sorted_hit_table
output_dir
Identify similar genomes based on AAI value.
positional arguments:
query_gene_file file with all query genes
target_gene_file file with all target genes
sorted_hit_table sorted file indicating genes passing sequence
similarity criteria
output_dir output directory
optional arguments:
-h, --help show this help message and exit
-k, --num_top_targets NUM_TOP_TARGETS
number of top scoring target genomes to report per
query genome (default: 1)
-t, --taxonomy_file TAXONOMY_FILE
file indicating taxonomic identification of all target
genomes
-e, --evalue EVALUE e-value cutoff for identifying initial blast hits
(default: 0.001)
-p, --per_identity PER_IDENTITY
percent identity for defining homology (default: 30.0)
-a, --per_aln_len PER_ALN_LEN
percent alignment length of query sequence for
defining homology (default: 70.0)
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
--keep_rbhs create file with reciprocal best hits
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem aa_usage –h
$ comparem aa_usage –h
usage: comparem aa_usage [-h] [--counts] [-x FILE_EXT] [-c CPUS] [--silent]
protein_gene_files output_file
comparem aa_usage: error: the following arguments are required: output_file
> comparem codon_usage -h
$ comparem codon_usage -h
usage: comparem codon_usage [-h] [--counts] [-x FILE_EXT] [--keep_ambiguous]
[-c CPUS] [--silent]
nucleotide_gene_files output_file
Calculate codon usage within each genome.
positional arguments:
nucleotide_gene_files
input files with genes in nucleotide space
output_file output file indicating codon usage of each genome
optional arguments:
-h, --help show this help message and exit
--counts output raw counts instead of frequencies
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
--keep_ambiguous keep codons with ambiguous bases
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem kmer_usage -h
$ comparem kmer_usage -h
usage: comparem kmer_usage [-h] [--counts] [-k K] [-x FILE_EXT] [-c CPUS]
[--silent]
genome_files output_file
Calculate kmer usage within each genome.
positional arguments:
genome_files input files with genomes in nucleotide space
output_file output file indicating kmer usage of each genome
optional arguments:
-h, --help show this help message and exit
--counts output raw counts instead of frequencies
-k K length of kmers, e.g., 4 -> tetranucleotides (default:
4)
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem stop_usage -h
$ comparem stop_usage -h
usage: comparem stop_usage [-h] [--counts] [--mean_gene_length] [-x FILE_EXT]
[-c CPUS] [--silent]
nucleotide_gene_files output_file
Calculate stop codon usage within each genome.
positional arguments:
nucleotide_gene_files
input files with genes in nucleotide space
output_file output file indicating stop codon usage of each genome
optional arguments:
-h, --help show this help message and exit
--counts output raw counts instead of frequencies
--mean_gene_length report mean gene length for genes with each stop codon
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem lgt_di -h
$ comparem lgt_di -h
usage: comparem lgt_di [-h] [-x FILE_EXT] [--crit_value CRIT_VALUE] [-c CPUS]
[--silent]
nucleotide_gene_files output_dir
Calculate dinuceotide (3rd,1st) usage of genes to identify putative LGT
events.
positional arguments:
nucleotide_gene_files
input files with genes in nucleotide space
output_dir output directory to write dinucleotide usage for each
gene in each genome
optional arguments:
-h, --help show this help message and exit
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
critical value for defining deviant genes (default:
0.001)
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem lgt_codon -h
$ comparem lgt_codon -h
usage: comparem lgt_codon [-h] [-x FILE_EXT] [-c CPUS] [--silent]
nucleotide_gene_files output_dir
Calculate codon usage of genes to identify putative LGT events.
positional arguments:
nucleotide_gene_files
input files with genes in nucleotide space
output_dir output directory to write dinucleotide usage for each
gene in each genome
optional arguments:
-h, --help show this help message and exit
-x, --file_ext FILE_EXT
extension of files to process (default: fna)
-c, --cpus CPUS number of CPUs to use (default: 1)
--silent suppress output
> comparem diss -h
$ comparem diss -h
usage: comparem diss [-h]
[--metric {euclidean,minkowski,cityblock,seuclidean,sqeuclidean,cosine,correlation,hamming,jaccard,chebyshev,canberra,braycurtis,mahalanobis,yule,matching,dice,kulsinski,rogerstanimoto,russellrao,sokalmichener,sokalsneath,wminkowski}]
[--full_matrix] [--silent]
profile_file output_file
Calculate the dissimilarity between usage profiles.
positional arguments:
profile_file file with usage profile for each genome
output_file output file with pairwise dissimilarity between
genomes
optional arguments:
-h, --help show this help message and exit
--metric {euclidean,minkowski,cityblock,seuclidean,sqeuclidean,cosine,correlation,hamming,jaccard,chebyshev,canberra,braycurtis,mahalanobis,yule,matching,dice,kulsinski,rogerstanimoto,russellrao,sokalmichener,sokalsneath,wminkowski}
distance metric to use (default: euclidean)
--full_matrix output full dissimilarity matrix
--silent suppress output
> comparem hclust -h
$ comparem hclust -h
usage: comparem hclust [-h]
[--method {single,complete,average,weighted,centroid,median,ward}]
[--similarity] [--max_sim_value MAX_SIM_VALUE]
[--name_col1 NAME_COL1] [--name_col2 NAME_COL2]
[--value_col VALUE_COL] [--silent]
pairwise_value_file output_tree
Perform hierarchical clustering.
positional arguments:
pairwise_value_file file with pairwise similarity or dissimilarity values
between genomes
output_tree name for output hierarchical cluster tree
optional arguments:
-h, --help show this help message and exit
--method {single,complete,average,weighted,centroid,median,ward}
clustering method to use. (default: average)
--similarity indicates file contain similarity values
maximum similarity value (default: 100)
--name_col1 NAME_COL1
index of first column with genome names (default: 0)
--name_col2 NAME_COL2
index of second column with genome names (default: 1)
index of column with similarity or dissimilarity
values (default: 2)
--silent suppress output
実行方法
ランするにはfasta形式のゲノムファイルが必要。
1、comparem aai_wf - ペアワイズAAIの計算
comparem aai_wf --cpus 12 -x fna input_dir/ outdir
#または翻訳済みのproteinディレクトリを指定
comparem aai_wf --cpus 12 --proteins -x faa input_dir/ outdir
- -e e-value cutoff for identifying initial blast hits (default: 0.001)
- -p percent identity for defining homology (default: 30.0)
- -a percent alignment length of query sequence for defining homology (default: 70.0)
- --sensitive use sensitive mode of DIAMOND
- -x extension of files to process (default: fna)
- -blastp use blastp instead of diamond
- --proteins indicates the input files contain protein sequences
出力
aai/aai_summary.tsv
出力はall versus allのマトリックスファイルではなく、ペアワイズの比較結果のテーブルになる。使用したデータでは、上の表のようにgenome3と他のゲノムとの比較(2−5行目)、次にゲノム5と他のゲノムとの比較(6−8行目)、次にゲノム1(9行目)、最後にゲノム4(10−11行目)という順番でプリントされている。ゲノム5から見たゲノム3との比較結果は2行目にあるが、ゲノム3から見たゲノム5の比較結果はない。レシプロカルヒットの平均からAAIを出しているなら不要になる。
*2
2、comparem classify_wf - AAIの結果から近いゲノムを推定。
クエリのゲノムと比較するゲノムのディレクトリ、出力ディレクトリを指定する。
comparem classify_wf -x fna --cpus 12 query_genome.fna input_dir outdir
テスト時はエラーを起こした。
補足
--keep_headers をつけるとAAIだけでなく全比較結果が残る。aai_summary.tsvは巨大なテキストファイルになるので、ラン後にユーザーがパースする必要がある。
現在のところ、CompareMの実装にはエラーが出る可能性のある部分がある。詳細はGithubで確認してください。
引用
GitHub - dparks1134/CompareM: A toolbox for comparative genomics.
参考
https://www.biostars.org/p/303333/
*1 エラーが起きるときはファイル名が問題の可能性がある。シンプルなファイル名にリネームする。
*2
Teaching-Dundee-BS32010/06-RBBH.ipynb at master · widdowquinn/Teaching-Dundee-BS32010 · GitHub