2020-05-20

ゲノムの中の関心がある遺伝子を視覚化する Gcluster

2020 5/27 コメント追加

2020 5/28 -mオプション追記

　遺伝子、遺伝子クラスター、およびその近傍のゲノムコンテクストを比較することは、遺伝子の機能や微生物の進化の基盤を決定する上で非常に重要である。現在のところ、多数のゲノムのゲノムコンテクストを可視化し、比較することができる使いやすく柔軟なツールはまだ存在していない。
　ここでは、研究者が大量の完成したゲノムやドラフトゲノムにわたって、興味のある遺伝子の周りのゲノム領域の高品質な線形マップをカスタマイズして作成することを可能にするスタンドアロンのPerlツールであるGclusterを紹介する。重要なことは、Gclusterは、組み込みのorthoMCLの形での相同遺伝子解析と、遺伝子のコンテキストの優れた比較を提供するために、与えられた系統樹上にゲノムをマッピングすることを統合している。
　GclusterはPerlで書かれており、GPLv3でリリースされている。ソースコードは https://github.com/Xiangyang1984/Gcluster とhttp://www.microbialgenomic.com/Gcluster_tool.html において自由に入手できる。

インストール

ubuntu18.04でテストした。

本体　Github

#bioconda (link)*1
conda install -c bioconda gcluster -y

> Gcluster.pl

$ Gcluster.pl

=NAME

Gcluster.pl

=DESCRIPTION

Gcluster is a simple tool for visualizing and comparing genome contexts for massive genomes. It is freely available at http://www.microbialgenomic.com/Gcluster_tool.html and https://github.com/Xiangyang1984/Gcluster_v1.01 under an open source GPLv3 license. It is a stand-alone Perl application, which requires MCL, NCBI BLAST+ and several Perl Modules (e.g. GD, GD::SVG) to be installed before use.

=USAGE

Gcluster.pl -dir genbank_file_directory -gene interested_gene_file [options]

FOR EXAMPLE:

perl /home/xiangyang/Gcluster_v1.01-master/Gcluster.pl -dir /home/xiangyang/Gcluster_v1.01-master/test_data/gbk -gene /home/xiangyang/Gcluster_v1.01-master/test_data/interested_gene_name.txt -tree /home/xiangyang/Gcluster_v1.01-master/test_data/full_NJ_rRNA_tree.nwk -m 3

Large test data is available at website (http://www.microbialgenomic.com/160_genomes_testdata.tar.gz). It contains 160 annotated genomes used as input files to create Fig. 1 in manuscript.

=ARGUMENTS

REQUIRED ARGUMENTS:

~~~~~~~~~~~~~~~~~~~

-dir, --genbank_file_directory

A Directory containing annotated genomes as Genbank format file. To avoid a mistake, genome names cannot use special character, such as space, equal. For large number of genomes, users are recommended to download using Aspera, a high-speed file transfer tool (https://downloads.asperasoft.com/). If enough rgb colors provided or not to color homologous genes, there are no limitation to the number of genomes and the number of genes flanking gene of interest for Gcluster.

-gene, --interested_gene_file

A list of the interested gene, in which each line contains a locus tag of the interested gene for individual genome. Users are recommended to use "interested_gene_generation.pl" in Gcluster package for generation this file. In this situation, user needs to provide a blast database file in FASTA format, which contains at least one protein sequence homologous to the gene of interest. To map genome contexts to a given phylogeny or to order the genome contexts using a "strain_reorder_file", only one gene of interest is allowed to provide for each genome.

For example:

AX2_RS10405 #arsenite_oxidase_large_subunit;Achromobacter_xylosoxidans_NBRC_15126_ATCC_27061

KUC_RS10495 #arsenite_oxidase_large_subunit;Halomonas_boliviensis_LC1

KYC_RS14580 #arsenite_oxidase_large_subunit;Achromobacter_arsenitoxydans_SY8

...

OPTIONAL ARGUMENTS:

~~~~~~~~~~~~~~~~~~~

-o, --Gcluster_output_directory

An output directory holding all the generated files by Gcluster.pl. if this option is not set, Gcluster will create a directory named "Gcluster_workplace" in the bin directory from where Gcluster.pl was invoked.

-tree, --phylogenetic_file

A Newick format tree file is used by Gcluster to automatically accociate the genomes with their phylogeny. Meanwhile, Gcluster will output a file named "temp_strain_reorder_file", which contains the order information of genomes in tree from up to down. It should be noted that all nodes name in provided tree must completely match with the genbank files name of all genomes.

Gcluster provides a perlscript in "Gcluster/script" directory for batch extraction of 16S rRNA gene sequences, which can be used to build a 16S rRNA tree using software like MEGA (https://www.megasoftware.net/).

-topology, --show_tree_topology

Display the tree topology, which is obtained from the tree file (Default: T).

-branch, --show_tree_branch

Draw tree using tree branch length, which is obtained from the tree file (Default: F).

--x_step

Draw tree using xstep instead of tree branch length (Default: 10).

-bootstrap, --show_tree_bootstrap

Display the tree bootstrap value, which is obtained from the tree file (Default: F).

-srf, --strain_reorder_file

A two-column tab-delimited text file is used to sort genomes from up to down accoding to users requirement. Each row must consist of a strain name followed by the numerical order that is used for sorting genomes. It should be noted that all strains name must completely match with the genbank files name of all genomes. Gcluster needs a "strain_reorder_file" or a "phylogenetic_file", but not both at the same time.

For example:

Achromobacter_xylosoxidans_NCTC10807 1

Achromobacter_xylosoxidans_NBRC_15126_ATCC_27061 2

Achromobacter_xylosoxidans_NCTC10808 9

Achromobacter_sp._2789STDY5608623 5

Achromobacter_sp._2789STDY5608621 4

Alcaligenes_faecalis_subsp._faecalis_NCIB_8687 10

Achromobacter_xylosoxidans_DPB_1 7

Achromobacter_xylosoxidans_B_1 8

Achromobacter_piechaudii_HLE 3

Achromobacter_marplatensis_B2 6

... ...

-n, --flanking_gene_number

Number of genes flanking gene of interest are set to show. If enough rgb colors provided or not to color homologous genes, there are no limitation to the number of genomes and the number of genes flanking gene of interest (Default: 10).

-color_f, --gene_color_filled

Color was used to fill homologous gene clusters or gene families of interest (Default: T), if choose F, all of the genes were filled with the color customized by "gene_no_color_filled" parameter).

-pso, --percent_strain_homologouscluster_color

Only color certain homologous gene clusters, in which the holding number of different genomes exceeds the threshold number (Default: 0). This is measured using (X/Y)*100. In this formula, X denotes the number of different genomes in a set of homologous gene cluster, and Y denotes the total number of genomes. This parameter is useful when no enough rgb colors are provided in colors_configure_file under color_configure direcoty. Users could to reduce the number of colors used by setting a high value.

-c_color_b, --cds_color_border

To color the border of the CDS genes (Default: black), users can choose from blue, black, red, white, gray, dgray.

-p_color_b, --pseudo_color_border

To color the border of the Pseudo genes (Default: dgray), users can choose from blue, black, red, white, gray, dgray.

-r_color_b, --RNA_color_border

To color the border of the RNA (tRNA, rRNA) genes (Default: red), users can choose from blue, black, red, white, gray, dgray.

-no_color_f, --gene_no_color_filled

To fill uniqe genes (including RNA genes), pseudo genes, and homologous gene clusters not meeting the criteria set by "percent_strain_homologouscluster_color" parameter with a single color (Default: white), users can choose from blue, black, red, white, gray, dgray.

-dw, --line_drawing_width

Set the line drawing width (Default: 1).

-l, --arrow_relative_Length

Set the relative length of the gene arrow (Default: 4).

-w, --arrow_relative_Height

Set the relative Height of the gene arrow (Default: 6).

-scale, --figure_Scale_up_multiple

Adjust gene length through zooming (Default: 0.5).

-s_Y, --strain_name_shift_Y

Set the offset along Y-axis for strain names (Default: 0).

-g_Y, --gene_label_shift_Y

Set the offset along Y-axis for gene labels (Default: 2).

-dis, --distance_between_two_genomes

Set the distance between two genome contexts in Y-axis (Default: 70).

-up, --up_shift

Set the top margin of image in pixels (Default: 10).

-down, --down_shift

Set the bottom margin of image in pixels (Default: 20).

-left, --left_shift

Set the left margin of image in pixels (Default: 10).

-right, --right_shift

Set the right margin of image in pixels (Default: 20).

-label, --show_label

Display the gene label (gene Locus Tag or genename) (Default: T).

-ul, --unification_label

Unify gene label for homologous gene cluster (Default: T). Among a set of homologous gene cluster, if a gene is annotated with a name X, all other genes will be labeled with X.

-family, --font_family

Set font family for the genome name and the gene label, e.g. Times New Roman, Arial, Verdana and so on (Default: Times New Roman). Users are suggested to choose font family listed in metrcis module, or causing a miscalculation of string width for genome name in SVG-format map.

-style, --font_style

Set font style for the genome name and the gene label, e.g. Normal, Bold, Italic (Default: Normal). It should be noted that the font style "Bold" does not work when using to cearte a PNG format figure in MacOS.

-size, --label_font_size

Set font size for gene label (Default: 6).

-color, --label_font_color

Set font color for gene label (Default: dgray).

-i_color, --interested_gene_label_font_color

Customize gene label color for gene of interest (Default: red), users can choose from blue, black, red, white, gray, dgray.

-r, --rotate_gene_label (Default: 30)

Rotate the angle of the gene label, e.g. 30, 45, 135 and so on.

--strain_name_font_size

Set font size for genome name (Default: 12).

--Strain_name_font_color

set font color for genome name (Default: black). Users can choose from blue, black, red, white, gray, dgray.

-Bst, --homologous_gene_cutoff

Array to set blast parse cutoff: E-value, Identify, Coverage, Match_length (Default: E-value=1-e5, Identify=0, Coverage=50%, Match_length=0).

-m, --multiple_threads

Numbers of thread to use (Default: 1).

-SVG, --SVG_image

Create SVG format figure (Default: T).

-PNG, --PNG_image

Create PNG format figure (Default: T).

-sub_TFT, --start_at_sub_TFT (Default: F)

Jump to generate a collection of sub-TFT tables and perform homologous gene analysis (Default: F). Skips sequences extraction and TFT file generation.

-map, --start_at_map

Jump to map generation (Default: F). Generation of a collection of sub-TFT tables and homologous gene clusters has already been done. This parameter is very useful to customize the map quickly. It should be noted that there's no sense to reset "flanking_gene_number" parameter if this parameter set to "T".

Importantly, at this step, users can revise the gene label by directly edition of the locus_tag in sub_TFT file or all_orthomcl.out. In sub_TFT files and all_orthomcl.out file, there are two forms of gene locus tag, (1) "Locus_Tag", in this case, no genename is defined for a gene; (2) "GeneName;Locus_Tag", in this case, genename is given for a gene. For the first form, user can revise gene label by addition of a genename followed by a semicolon in the front of the Locus_Tag. For the second form, user can revise gene label by modification of the genename.

-h, --help

Show this message.

=AUTHOR

Dr. Xiangyang Li (E-mail: lixiangyang@fudan.edu.cn, lixiangyang1984@gmail.com), Fudan university; Kaili University; Bacterial Genome Data mining & Bioinformatic Analysis (http://www.microbialgenomic.com/).

=COPYRIGHT

> interested_gene_generation.pl

# interested_gene_generation.pl

=NAME

interested_gene_generation.pl

=DESCRIPTION

Run this command to enble users to obtain a list of the interested gene (a two-column tab-delimited text file) by a local blastP analysis using multiple threads.

=USAGE

interested_gene_generation.pl -dir genbank_file_directory -db database [options]

FOR EXAMPLE:

# perl /home/xiangyang/Gcluster_v1.01-master/interested_gene_generation.pl -dir /home/xiangyang/Gcluster_v1.01-master/test_data/gbk -db /home/xiangyang/Gcluster_v1.01-master/test_data/aioB.fasta -m 3

=ARGUMENTS

=======================

REQUIRED ARGUMENTS:

~~~~~~~~~~~~~~~~~~~

-dir, --genbank_file_directory

A directory containing annotated genomes as Genbank format file. To avoid a mistake, genome names cannot use special character,

such as space, equal. For large number of genomes, users are recommended to download using Aspera, a high-speed file transfer

tool (https://downloads.asperasoft.com/).

-db, --database

A protein database in FASTA format, which contains at least one protein sequence homologous to the gene of interest.

OPTIONAL ARGUMENTS:

~~~~~~~~~~~~~~~~~~~

-o, --output_directory

An output directory holding all the generated files by interested_gene_generation.pl. if this option is not set, interested_gene_generation.pl will create a directory named "interested_gene_workplace" in the bin directory from where interested_gene_generation.pl was invoked.

-m, --multiple_threads

set thread number (Default: 1)

-b, --start_at_blast

Jump to a local blastp analysis, and Skips sequencing extraction (Default: T).

-e, --e_value

set E-value cutoff in Blast analysi (default: 1e-5)

-i, --identify

set percent identity cutoff in Blast analysis (default: 30)

-c, --coverage

set percent coverage (Query and Subject) cutoff in Blast analysis (default: 50)

-l, --match_length

set alignment length cutoff in Blast analysis (default: 30)

-h, --help

Show this message.

=AUTHOR

Dr. Xiangyang Li (E-mail: lixiangyang@fudan.edu.cn), Fudan university; Kaili University; Bacterial Genome Data mining & Bioinformatic Analysis (www.microbialgenomic.com/).

=COPYRIGHT

テストラン

git clone https://github.com/Xiangyang1984/Gcluster.git
cd Gcluster/
perl test.pl #*1

実行方法

準備

ランには比較するゲノムのGenBnakファイルとリストファイルが必要。

１、GenBnakファイル。

テストデータのGenBnak（ファイル名にスペースや特殊文字がない事）

f:id:kazumaxneo:20200520230937p:plain

２、関心のある遺伝子のリストファイル。

関心のある遺伝子の locus tagを記載したテキストファイルになる。9 genbankの比較なら、それぞれのゲノムから1つずつ9行のリストを作ることになる。

f:id:kazumaxneo:20200520231626p:plain

手動で作成してもよいが、数が多いと手間なので、blastpのヒットから自動作成するスクリプトが準備されている。ランするにはgenbankのディレクトリと興味ある遺伝子のタンパク質配列（クエリ）を指定する。

interested_gene_generation.pl -dir test_data/gbk_dir -db test_data/aioB.fasta -m 4 -o out_dir -r 1e-5 -i 30 -c 50 -I 30

-dir A directory containing annotated genomes as Genbank format file.
-db A protein database in FASTA format, which contains at least one protein sequence homolo
-o output_directory
-m set thread number (Default: 1)
-e set E-value cutoff in Blast analysi (default: 1e-5)
-i set percent identity cutoff in Blast analysis (default: 30)
-c set percent coverage (Query and Subject) cutoff in Blast analysis (default: 50)
-l set alignment length cutoff in Blast analysis (default: 30)

ランが終わると出力ディレクトリに上の画像のような内容のテキストファイルができる。これをGclusterのランに使用する。

f:id:kazumaxneo:20200520232903p:plain

interested_gene_name.txtにblastpのトップヒット遺伝子のlocus tagがリスト化される。blastpにセカンドヒットなども存在する場合、２列目のコメントアウトした部分以降にまとめられる。

ラン

genbankのディレクトリと興味がある遺伝子のリストファイルを指定する。

Gcluster.pl -dir ./test_data/gbk -gene interested_gene_name.txt -o out_dir -m 12

#遺伝子名を大きくする
Gcluster.pl -dir ./test_data/gbk -gene interested_gene_name.txt -o out_dir -m 12 -size 10

#よく似たゲノム同士なら条件を厳しくする
Gcluster.pl -dir ./test_data/gbk -gene interested_gene_name.txt -o out_dir -m 12 -size 10 -Bst 1e-100

-dir A Directory containing annotated genomes as Genbank format file.
-gene A list of the interested gene, in which each line contains a locus tag of the interested gene for individual genome.
-m Numbers of thread to use (Default: 1).
-size Set font size for gene label (Default: 6).
-Bst Array to set blast parse cutoff: E-value, Identify, Coverage, Match_length (Default: E-value=1-e5, Identify=0, Coverage=50%, Match_length=0).

出力

f:id:kazumaxneo:20200520234055p:plain

figure.png。指定した関心のある遺伝子は、図の真ん中の赤い注釈になっている遺伝子（*４）。

f:id:kazumaxneo:20200520234111p:plain

順番を変えるには、次の例のようにnewlickのファイルを与えるか、順番を支持したテキストファイルを指定する（Github参照）。

Newickフォーマットのツリーファイルも与える事で、phylogenyに従って順番をソートできる*2。

Gcluster.pl -dir ./test_data/gbk \
 -gene out_dir/interested_gene_name.txt \
 -tree ./test_data/16S_rRNA_tree.nwk \
 -o out_dir -m 12

figure.svg

f:id:kazumaxneo:20200520234944p:plain

出力図はGcluster.plのオプションでカスタマイズできます。例えば"-n <num>"で隣接する遺伝子の表示数という感じです。コマンドのhelpを見て下さい。*３

公式（大規模な例あり）

http://www.microbialgenomic.com/Gcluster_tool.html

引用

Gcluster: a simple-to-use tool for visualizing and comparing genome contexts for numerous genomes
Xiangyang Li, Fang Chen, Yunpeng Chen
Bioinformatics, Published: 28 March 2020

*1 Githubにも注意書きがあるが、/usr/bin/にコマンドがないとエラーが出たのでシンボリックリンクを張って対応した。

ln -s /root/anaconda3/bin/blastp /usr/bin/

ln -s /root/anaconda3/bin/makeblastdb /usr/bin/

ln -s /root/anaconda3/bin/mcl /usr/bin/

genbankから自動で16Sの配列を抽出するスクリプトが用意されている。抽出後、MEGAなどを使って16Sのツリーを構築し、それを指定する。ファイル名が同じでないと受け付けないので注意。

*３

たくさんの遺伝子クラスターが図示されてしまってわかりにくい場合は、interested_gene_generation.plの結果得られるinterested_gene_name.txtを開いて、適切なアノテーション行１つだけ残し、あとは消して下さい。それを使ってGcluster.plをランすれば、１ゲノムから１クラスターのみ視覚化されるようになります。

*４

図中の注釈遺伝子名はリストファイルのデータに則って同じ名前で統一される。注釈が同じだからとってそのサンプルに該当遺伝子があることを意味するわけではない。注意する。