macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

大規模な系統学的データセットを削減する Treemmer

 

 大規模な塩基配列データセットは、視覚化するのも扱うのも難しい。さらに、それらはしばしば自然の多様性のランダムなサブセットを表しているのではなく、調整されていない便宜的なサンプリングの結果である。その結果、冗長性やサンプリングバイアスに悩まされることになる。Treemmerは、系統樹の冗長性を評価し、系統樹の多様性に最も寄与しない葉を除去することで、系統樹の複雑さを軽減するシンプルなツールである。
Treemmerは、元の多様性を代表するサブサンプルを維持しながら、異なる系統構造と冗長性のレベルを持つデータセットのサイズを縮小することができる。さらに、あらゆる種類のメタ情報を含むTreemmerの動作を微調整することが可能であり、Treemmerは特に実証的研究に有用である。

 

インストール

公開されているsingularityイメージを使ってテストした。

依存

Treemmer is compatible with both python 2 and python 3.

  • ETE3
  • Joblib
  • Numpy
  • Matplotlib

Github

#pullしてビルド
cd data_dir/
singularity pull --arch amd64 library://fmenardo/treemmer/treemmer:0.3
singularity build --sandbox treemmer_sb/ treemmer_0.3.sif

> singularity exec treemmer_sb/ python3 Treemmer_v0.3.py -h

usage: Treemmer_v0.3.py INFILE [options (-h to see all options)]

 

positional arguments:

  INFILE                path to the newick tree

 

optional arguments:

  -h, --help            show this help message and exit

  -X [X [X ...]], --stop_at_X_leaves [X [X ...]]

                        Output reduced tree with X leaves. If multiple values are given Treemmer will produce multiple reduced datsets in the same run

  -RTL [0-1 [0-1 ...]], --stop_at_RTL [0-1 [0-1 ...]]

                        Output reduced tree with the specified RTL. If multiple values are given Treemmer will produce multiple reduced datsets in the same run

  -r [INT], --resolution [INT]

                        number of leaves to prune at each iteration (default: 1)

  -p, --solve_polytomies

                        resolve polytomies at random (default: FALSE)

  -pr, --prune_random   prune random leaves (default: FALSE)

  -lp [0,1,2], --leaves_pair [0,1,2]

                        After the pair of leaves with the smallest distance is dentified Treemmer prunes: 0: the longest leaf 1: the shortest leaf 2: random choice (default: 2)

  -np, --no_plot        do not load matplotlib and plot (default: FALSE)

  -fp, --fine_plot      when --resolution > 1, plot RTL vs n leaves every time a leaf is pruned (default: FALSE => plot every X leaves (X = -r))

  -c [INT], --cpu [INT]

                        number of cpu to use (default: 1)

  -lm [path/to/file], --list_meta [path/to/file]

                        path to file with metainformation. Format for each line: "leaf_name,tag". Leaves can appear mutiple times with different tags, or not appear at all

  -mc [INT], --meta_count [INT]

                        if the -lm option is active -mc defines the minimum number of leaves that will be kept for each category defined in the metainformation file (default = 0)

  -lmc [path/to/file], --list_meta_count [path/to/file]

                        path to file. Format for each line: "tag,number", this option is alternative to -mc and allows to specify the different minimum number of leaves that shuld be retained for different categories

  -v [0,1,2], --verbose [0,1,2]

                        0: silent (almost), 1: show progress, 2: print tree at each iteration, 3: only for testing (findN), 4: only for testing (prune_t) (default: 1)

  -sc1 [leaf_name], --select_clade_1 [leaf_name]

                        use together with -sc2. Treemmer will identify the smallest monophyletic clade including two specified leaves and output a list of leaves belonging to this clade. This can be usefull to prepare the --list_meta input file in case you

                        want to prune only leaves belonging (or not belonging) to a certain clade

  -sc2 [leaf_name], --select_clade_2 [leaf_name]

                        use together with -sc1. Treemmer will identify the smallest monophyletic clade including two specified leaves and output a list of leaves belonging to this clade. This can be useful to prepare the --list_meta input file in case you

                        want to prune only leaves belonging (or not belonging) to a certain clade

  -sa, --select_all     output the list of leaf names in the input tree and exit

  -pa, --plot_always    output the RTL plot with the smallest tree defined by the -X or -RTL option

  -pc, --plot_complete  plot the complete RTL plot and file when the -X or -RTL options are specified

  -sX [sX], --switch_at_X [sX]

                        Treemmer will start normally and switch to random subsampling when the tree has less than sX leaves. This option can be used with -sRTL, Treemmer will change behaviour as soon as one of the two criteria is met

  -sRTL [0-1], --switch_at_RTL [0-1]

                        Treemmer will start normally and switch to random subsampling when the tree is shorter than sRTL. This option can be used with -sX, Treemmer will change behaviour as soon as one of the two criteria is met

 

 

実行方法

系統樹全体の相対的な長さの減少をプロット。newick treeファイルを指定する。

singularity exec treemmer_sb/ python3 /Treemmer_v0.3.py tree_file.nwk

PDFが保存される。


実質的な系統樹の長さが開始時の90%になるまで系統樹を刈り込む。

singularity exec treemmer_sb/ python3 /Treemmer_v0.3.py tree_file.nwk -RTL 0.9

tree_file.nwkRTL_0.9などが保存される。

 

系統樹を100チップになるまで剪定。

singularity exec treemmer_sb/ python3 /Treemmer_v0.3.py tree_file.nwk -X 100

入力

出力

 

引用

Treemmer: a tool to reduce large phylogenetic datasets with minimal loss of diversity

Fabrizio Menardo, Chloé Loiseau, Daniela Brites, Mireia Coscolla, Sebastian M. Gygli, Liliana K. Rutaihwa, Andrej Trauner, Christian Beisel, Sonia Borrell & Sebastien Gagneux 
BMC Bioinformatics volume 19, Article number: 164 (2018)