eukaryotesの進化的に保存された遺伝子クラスターを検出する EvolClust

真核生物ゲノムの遺伝子のorderは、進化を通じてあまり保存されない傾向がある（DávilaLópezet al。、2010）。この傾向にもかかわらず、遺伝子の特定のグループは、長い進化距離にわたってゲノム内に近接したままであり、これは、選択がそれらのゲノムの共局在を維持するように作用することを示唆している。そのような保存されたクラスター内の遺伝子は、機能的なリンクを持っている可能性があり（Lee and Sonnhammer、2003; Wisecaver et al。、2014）、協調的に転写することがあり得る（Boutanaev et al、2002;Reimegårdet al、2017）。比較ゲノミクスを使用して、遺伝子の順序が大幅にシャッフルされているにもかかわらず、予想よりも大幅に近いままになっている遺伝子のグループを明らかにすることができる。

　Gecko3（Winter et al。2015）はそのような疑問に対処するように設計されているが、大量の種を比較すると飽和する。 i-ADHore（Proost et al、2012）やMCScanX（Wang et al、2012）などの他のプログラムは、ゲノム間の共線性を検索するために準備されているが、遺伝子クラスターの存在に焦点を合わせていない。　　

　本著者らはEvolClustを開発した。これは、比較対象のゲノムが共有する隣接遺伝子のグループを検出するアルゴリズムである。前述のプログラムとは異なり、保存された遺伝子クラスターを検出することができ、数百のゲノムを使用して大規模な比較を実行する準備ができている。さらに、二次代謝産物クラスターのプログラム検索など、特定のタイプの遺伝子クラスターの検索に限定されない（Khaldi et al、2010; Medema et al、2011）。 EvolClustはLinuxでテストされており、docker付属のdockerファイルを使用して他のシステムに移植できる。

　EvolClustは、クラスターの機能や遺伝子組成とは無関係に、ゲノム比較から進化的に保存された遺伝子クラスターを推測するPythonベースのツールである。 EvolClustは、ペアワイズゲノム比較から保存された遺伝子クラスターを予測し、all versus allのゲノム比較から関連クラスターのファミリーを推測する。

wiki

https://github.com/Gabaldonlab/EvolClust/wiki

f:id:kazumaxneo:20191003011103p:plain

Main algorithm. Githubより

インストール

python2.7.14環境でテストした（ubuntu 18.04のdockerイメージ使用）。

依存

python 2.7
numpy and mcl

conda install -c anaconda numpy
apt update && apt-get install -y mcl

本体　Github

git clone https://github.com/Gabaldonlab/EvolClust.git
cd EvolClust/


#dockerイメージビルド
docker build -t evolclust .

> python evolclust.py -h

python evolclust.py -h

usage: evolclust.py [-h] [-i INFILE] [-f FASTAFILE] [-l LISTFILE]

[-s1 SPECIES1] [-s2 SPECIES2] [-d OUTDIR]

[--minSize MINSIZE] [--maxSize MAXSIZE]

[--non_homologs NON_HOMOLOGS]

[--threshold {2stdv,3stdv,1stdv,90percent,75percent}]

[--initial_files] [--get_pairwise_clusters]

[--filter_clusters] [--cluster_comparison]

[--cluster_families] [--statistics]

[--path_evolclust PATHEVOLCLUST] [--path_mcl PATHMCL]

[--local]

Will perform the genome walking

optional arguments:

-h, --help show this help message and exit

-i INFILE, --infile INFILE

Input file. Can change depending on the analysis run

-f FASTAFILE, --fastafile FASTAFILE

Fasta file that contains the complete proteome

database.

-l LISTFILE, --listfile LISTFILE

Complete list of proteins.

-s1 SPECIES1, --species1 SPECIES1

Species tag needed to run the threshold calculation

-s2 SPECIES2, --species2 SPECIES2

Species tag needed to run the threshold calculation

-d OUTDIR, --outdir OUTDIR

basepath folder where the results will be stored

--minSize MINSIZE Minimum size a cluster needs to have to be considered.

Default is set to 5

--maxSize MAXSIZE Maximum size a cluster needs to have to be considered.

Default is set to 35

--non_homologs NON_HOMOLOGS

Number of non-homologous genes needed to split a

cluster

--threshold {2stdv,3stdv,1stdv,90percent,75percent}

Way to calculate the thresholds to accept a conserved

region as cluster

--initial_files Will prompt the program to create the initial files

--get_pairwise_clusters

Main program in the genome walking - starts from pairs

files

--filter_clusters Will filter out identical clusters and name them all;

will also create the files needed to run the final

cluster comparison

--cluster_comparison Will compare clusters

--cluster_families Builds the cluster families

--statistics Will calculate statistics for each cluster family.

--path_evolclust PATHEVOLCLUST

Path to the python program evolclust.py

--path_mcl PATHMCL Path to mcl

--local This will run the whole evolclust pipeline without

splitting it in steps. It is only recommended for

small datasets

#dockerイメージビルド
docker build -t evolclust .

docker run --rm -v outFolder_computer:outFolder_docker evolclust -i datasets/test_dataset.mcl -f datasets/test_dataset.fa -d outFolder_docker --local

テストラン

タンパク質のmulti fastaを用意する。また、ホモログタンパク質のリストを用意する。小さなデータセットであればstep1-5を--localをつけることで１度に実行できる（詳細はwiki参照）。

python evolclust.py -i datasets/test_dataset.mcl -f datasets/test_dataset.fa -d evol_test --local

-f Fasta file that contains the complete proteome database.
-i Input file. Can change depending on the analysis run
-d base path folder where the results will be stored
--local This will run the whole evolclust pipeline without splitting it in steps. It is only recommended for small datasets

出力

ls -l evol_test/

total 656

-rw-r--r-- 1 root 14651 Oct 2 18:57 all_cluster_comparisons.txt

drwxr-xr-x 5 root 4096 Oct 2 18:56 all_cluster_predictions/

drwxr-xr-x 2 root 4096 Oct 2 18:57 cluster_comparison/

-rw-r--r-- 1 root 51620 Oct 2 18:57 cluster_families.complemented.txt

-rw-r--r-- 1 root 51620 Oct 2 18:57 cluster_families.txt

drwxr-xr-x 2 root 4096 Oct 2 18:57 clusters_by_spe/

drwxr-xr-x 5 root 4096 Oct 2 18:56 clusters_from_pairs/

-rw-r--r-- 1 root 53508 Oct 2 18:57 complete_cluster_list.txt

-rw-r--r-- 1 root 448268 Oct 2 18:49 complete_protein_list.txt

drwxr-xr-x 2 root 4096 Oct 2 18:49 conversion_files/

drwxr-xr-x 2 root 4096 Oct 2 18:57 jobs/

-rw-r--r-- 1 root 2391 Oct 2 18:57 mcl_comparison_clusters.txt

drwxr-xr-x 5 root 4096 Oct 2 18:49 pairs_files/

drwxr-xr-x 5 root 4096 Oct 2 18:57 thresholds/

drwxr-xr-x 4 root 4096 Oct 2 18:57 timming/

出力についての解説

https://github.com/Gabaldonlab/EvolClust/wiki/6.--Output-files

引用

EvolClust: automated inference of evolutionary conserved gene clusters in eukaryotes
Marina Marcet-Houben, Toni Gabaldón Author Notes
Bioinformatics, Published: 27 September 2019