2020-07-06

Kmasker

　多くの植物ゲノムは、高レベルのrepetitive sequencesを持っている。ハイスループットシーケンスリードを使用したこれらの複雑なゲノムのアセンブリは、依然として困難な作業である。これらのデータセットの repeat complexity を過小評価または無視すると、ダウンストリームの分析を容易に誤った方向に導く可能性がある。 k-merカウントの方法によるrepetitive regionsの検出は、信頼できることが証明された。 kmerカウントを利用する使いやすいアプリケーションは、特に植物の分野で高い需要がある。

　コマンドラインおよびWebベースのソリューションとして提供されるゲノムデータの分析ワークフロー全体で、k-merカウント情報をアシスタントとして使用するツールであるKmasker plantsを紹介する。repetitive sequencesをスクリーニングおよびマスクするコアコンピタンスに加えて、異なる品種またはclosely relatedな種間の比較研究を可能にする機能と、Cas9エンドヌクレアーゼを使用した部位特異的突然変異誘発の適用のためのガイドRNAの標的特異性を推定する方法を統合した。さらに、経済的に最も重要な10の栽培種の事前に計算されたインデックスを維持するKmasker plantsのWebサービスをセットアップする。

Kmasker plants ‐ a tool for assessing complex sequence space in plant species https://t.co/fcfWCDhQhk
— Uwe Scholz (@UweScholz271) 2019年12月11日

webサービスFAQ

https://kmasker.ipk-gatersleben.de//?id=faq

tutorial

https://doi.ipk-gatersleben.de/DOI/10fdd0bb-825f-459a-9d08-7c04066208f0/77a28b20-87c1-402b-ae36-47af029956e0/2

インストール

ubuntu18.04LTSでテストした。

Github

#bioconda (link)
conda create -n Kmasker Kmasker
conda activate Kmasker

Check Kmasker installation

> Kmasker --check_config --verbose

help

> Kmasker

$ Kmasker

Usage of program Kmasker:

(version: 1.1.1 rc231015) (session id: 3shXCd251r)

Description:

Kmasker is a tool for the automatic detection of repetitive sequence regions.

There are three modules and you should select one for your analysis.

Modules:

--build construction of new index (requires --seq)

--run perform analysis and masking (requires --fasta)

--explore perform downstream analysis with constructed index and detected repeats

General options:

--show_repository show complete list of private and external k-mer indices

--show_details show details for a requested kindex

--show_path show path Kmaskers looks for constructed kindex

--remove_kindex remove kindex from repository

--set_private_path change path to private repository

--set_external_path change path to external repository [readonly]

--expert_setting_kmasker submit individual parameter to Kmasker eg. pctgap,

minseed, mingff (see documentation!)

--expert_setting_jelly submit individual parameter to jellyfish (e.g. on memory usage

for index construction)

--expert_setting_blast submit individual parameter to blast (e.g. '-evalue')

--threads set number of threads [4]

--bed force additional BED output [off]

--user_conf set specific user configuration file [/Users/kazu/.kmasker_user.config]

--global_conf set specific global configuration file [/Users/kazu/anaconda3/envs/Kmasker/etc/kmasker.config]

--check_install shows the detected/configured path for all used applications

--setid set a user specified process id

--long_id create a process id that is unique for this host (e.g. for use in cluster environments)

--temp sets the location of temporary files [./temp/]

--verbose enables verbose output and keeps log files

--make_model For use with krispr: Build a new krispr model. You have to specifiy a .csv after this paramter. Details at https://git.io/JecYI. You can use -m to specify the coverage threshold.

> Kmasker --build

実行方法

１、build - パスの設定（初回のみ）

Kmasker --build --set_private_path path/to/directory

２、indexing - k-mer インデックス構造の構築

Kmasker --build --seq input.fq --gs 135 --in At1 --cn arabidopsis

３、run - Kmaskerのコアプロセス

4つの一般的なオプションがある。1) SINGLEまたはMULTIPLEインデックス構造を用いた基本的なk-mer解析、2)蛍光in situハイブリダイゼーション(FISH)に適用可能な候補配列のスクリーニング、3)適用されているk-merインデックス構造の違いを検索する比較解析、4)ゲノム全体の特異性を調べるための短い配列プローブの解析。

1) 基本的なk-mer解析 - 作成したindexとゲノムのFASTAファイルを指定する。

Kmasker --run --fasta query.fasta --kindex At1

f:id:kazumaxneo:20200707003353p:plain

KMASKER_masked_KDX_At1_1Cwrcpk6RY.fasta

f:id:kazumaxneo:20200707003348p:plain

Xでマスクされる。runコマンドの他の使い方についてはGithubと論文を読んで下さい。

webサービス

https://kmasker.ipk-gatersleben.deにアクセスする。

メールアドレス、植物種を指定する（指定k-mer長のindexが構築済みで管理されている）。

f:id:kazumaxneo:20200707003617p:plain

リピートマスクを行いたいゲノム配列をアップロードする。

f:id:kazumaxneo:20200707003807p:plain

パラメータについてはFAQを確認して下さい。

引用

Kmasker plants ‐ a tool for assessing complex sequence space in plant species
Sebastian Beier Chris Ulpinnis Markus Schwalbe Thomas Münch Robert Hoffie Iris Koeppel Christian Hertig Nagaveni Budhagatapalli Stefan Hiekel Krishna Mohan Pathi Goetz Hensel Martin Grosse Sindy Chamas Sophia Gerasimova Jochen Kumlehn Uwe Scholz Thomas Schmutzer

Plant J. 2020 May;102(3):631-642

2020-07-05

GFAのインタラクティブな可視化ツール GfaViz

2019 Bioinformatics assembly graph GFA GUIツール

The graphical fragment assembly (GFA) フォーマットは、シーケンスグラフを表現するための新しい標準フォーマットである。GFA 1は主にアセンブリグラフを対象としていたが、新しい GFA 2 フォーマットはいくつかの機能を導入しており、scaffoldingグラフ、バリエーショングラフ、アラインメントグラフ、カラーメタゲノムグラフなど、他の種類の情報を表現するのに適している。ここでは、GFAフォーマットの配列グラフを可視化するためのインタラクティブなグラフィカルツールであるGfaVizを紹介する。このソフトウェアは、GFA 2のすべての新機能をサポートし、その可視化のための規約を紹介している。ユーザーは、単一の要素やグループを表現するために、2つの異なるレイアウトと複数のスタイルから選択することができる。すべてのカスタマイズは、外部の設定ファイルを必要とせずに、GFAフォーマットのカスタムタグに保存することができる。スタイルシートは、ファイルのグループの標準設定オプションを保存するためにサポートされている。ビジュアライゼーションは、ラスターおよびベクターグラフィック形式にエクスポートすることができる。コマンドラインインターフェースにより、画像のバッチ生成が可能になっている。GfaVizは https://github.com/ggonnella/gfaviz から入手できる。

specification of the Graphical Fragment Assembly (GFA) format.

GitHub - GFA-spec/GFA-spec: Graphical Fragment Assembly (GFA) Format Specification

インストール

ubuntu18.04にでソースからビルドしてテストした。qmake-qtの代わりにqmakeを使用してmakefile を生成した。

ビルド依存

Qt framework. For this reason, Qt5 needs to be installed on your system （link）
GCC version 7.1.0 or newer and clang version 3.8.0 or newer

本体　Github

git clone https://github.com/ggonnella/gfaviz
cd gfaviz
qmake-qt5
#SVG support
qmake-qt5 NOSVG=true 
make

> ./gfaviz -h

$ ./gfaviz -h

Warning: QStandardPaths: XDG_RUNTIME_DIR not set, defaulting to '/tmp/runtime-kazu'

Usage: ./gfaviz [options] [filenames...]

Options:

-h, --help Displays this help.

-n, --no-gui Disable GUI

-r, --render Render graph(s) into file(s).

-o, --output <filename> Render graph(s) into <filename>

-f, --output-format <format> File format for the output. If no value

is specified, format will be inferred

from the file suffix specified in the

--output option. Possible values: BMP,

PNG, JPG, JPEG, PBM, XBM, XPM, SVG.

Default: PNG

-W, --width <width> Width of the output file in pixels.

-H, --height <height> Height of the output file in pixels.

-t, --transparency Transparent background in rendered

images (only png).

-s, --usestyle <filename> Use the style options represented by the

stylesheet <filename>.

--bg-color <value> Background color.

Default: #00ffffff

--labels Add all labels to the graph.

Default: false

--seg-labels Add segment labels to the graph.

Default: false

--edge-labels Add edge labels to the graph.

Default: false

--gap-labels Add gap labels to the graph.

Default: false

--group-labels Add group labels to the graph.

Default: false

--fragment-labels Add fragment labels to the graph.

Default: false

--no-gaps Do not show gaps in the graph.

Default: false

--no-fragments Do not show fragments in the graph.

Default: false

--no-groups Do not show groups in the graph.

Default: false

--seg-width <value> Width of the segments.

Default: 2.00

--seg-outline-width <value> Width of the segment outline.

Default: 0.50

--seg-color <value> Color of the segment.

Default: #e9ffdd

--seg-outline-color <value> Color of the segment outline.

Default: #000000

--seg-max-sub <value> Maximum number of subsegments in segment

representation.

Default: 10

--seg-as-arrow Indicate segment direction with arrow.

Default: false

--edge-width <value> Width of the links/edges.

Default: 1.00

--edge-color <value> Color of the links/edges.

Default: #000000

--edge-highlights-show Always show highlight of the overlapped

parts of edges on segments.

Default: false

--edge-highlights-color <value> Color of the edge-highlights on

segments. Tip: use low alpha value.

Default: #32ff0000

--edge-highlights-color-random Randomize color of the edge-highlights

on segments.

Default: false

--dovetail-width <value> Width of dovetail links.

Default: 1.00

--dovetail-length <value> Length of dovetail links.

Default: 10.00

--dovetail-color <value> Color of dovetail links.

Default: #000000

--internal-width <value> Width of non-dovetail links.

Default: 1.00

--internal-length <value> Length of internal links.

Default: 80.00

--internal-color <value> Color of non-dovetail links.

Default: #000000

--group-width <value> Width of the groups.

Default: 1.20

--group-colors <value> Colors of the groups, separated by

commas.

Default: red,orange,pink,green

--gap-width <value> Width of the gaps.

Default: 1.00

--gaps-as-edges Use gaps as edges in layout computation.

Default: false

--gap-length <value> Gap length factor if --gaps-as-edges is

set.

Default: 10.00

--gap-color <value> Color of positive-length gaps.

Default: #668074

--neg-gap-color <value> Color of negative-length gaps.

Default: #c09090

--fragment-width <value> Width of the fragments.

Default: 1.00

--fragment-color <value> Color of the fragments.

Default: #009000

--rev-fragment-color <value> Color of the fragments in rev

orientation.

Default: #cf0000

--fwd-fragment-color <value> Color of the fragments in fwd

orientation.

Default: #009000

--fragment-dist <value> Distance of fragments from segments.

Default: 1.00

--fragment-conn-color <value> Color of connection of fragments to

segments.

Default: #000000

--fragment-conn-width <value> Width of connection of fragments to

segments.

Default: 0.50

--fragment-minlength <value> Min length of fragment representation.

Default: 0.10

--fragment-multlength <value> Multiplier of fragment length.

Default: 1.00

--fragment-highlights-show Always show highlight of the overlapped

parts of fragments on segments.

Default: false

--fragment-highlights-color <value> Color of the fragment-highlights on

segments. Tip: use low alpha value.

Default: #32ff00ff

--label-font <value> Font family of all labels.

Default: Arial

--label-size <value> Font point size of all labels.

Default: 4.50

--label-color <value> Font color of all labels.

Default: #000000

--label-outline-width <value> Font outline width of all labels.

Default: 1.00

--label-outline-color <value> Font outline color of all labels.

Default: #ffffff

--seg-label-font <value> Font family of the segment labels.

Default: Arial

--seg-label-size <value> Font point size of the segment labels.

Default: 4.50

--seg-label-color <value> Font color of the segment labels.

Default: #000000

--seg-label-outline-width <value> Font outline width of the segment

labels.

Default: 1.00

--seg-label-outline-color <value> Font outline color of the segment

labels.

Default: #ffffff

--edge-label-font <value> Font family of the edge labels.

Default: Arial

--edge-label-size <value> Font point size of the edge labels.

Default: 4.50

--edge-label-color <value> Font color of the edge labels.

Default: #000000

--edge-label-outline-width <value> Font outline width of the edge labels.

Default: 1.00

--edge-label-outline-color <value> Font outline color of the edge labels.

Default: #ffffff

--group-label-font <value> Font family of the group labels.

Default: Arial

--group-label-size <value> Font point size of the group labels.

Default: 4.50

--group-label-color <value> Font color of the group labels.

Default: #000000

--group-label-outline-width <value> Font outline width of the group labels.

Default: 1.00

--group-label-outline-color <value> Font outline color of the group labels.

Default: #ffffff

--gap-label-font <value> Font family of the gap labels.

Default: Arial

--gap-label-size <value> Font point size of the gap labels.

Default: 4.50

--gap-label-color <value> Font color of the gap labels.

Default: #000000

--gap-label-outline-width <value> Font outline width of the gap labels.

Default: 1.00

--gap-label-outline-color <value> Font outline color of the gap labels.

Default: #ffffff

--frag-label-font <value> Font family of the fragment labels.

Default: Arial

--frag-label-size <value> Font point size of the fragment labels.

Default: 4.50

--frag-label-color <value> Font color of the fragment labels.

Default: #000000

--frag-label-outline-width <value> Font outline width of the fragment

labels.

Default: 1.00

--frag-label-outline-color <value> Font outline color of the fragment

labels.

Default: #ffffff

--seg-label-showlength Show segment length in label.

Default: false

--seg-label-seq Show segment sequence as label.

Default: false

--minweight <value> Minimum length of fragments and

segments, expressed in fraction of the

longest segment length divided by the max

number of subsegments.

Default: 0.20

--weight-factor <value> Weight factor for the computation of

segments and fragment lengths.

Default: 1.00

--fmmm Use the FMMM layouting algorithm, which

is faster than the default (SM).

Default: false

Arguments:

filenames Name of the file(s) to be opened.

実行方法

ここではGUI版を使う。CUI版を使うには"--no-gui"フラグを立てる。

./gfaviz

File => Open GFA fileからGFAファイルを読み込む。 GFA1とGFA2に対応している。

f:id:kazumaxneo:20200705203349p:plain

グラフを選択すると、そのグラフに関する情報が右のウィンドウに表示される。

f:id:kazumaxneo:20200705235736p:plain

labelにチェックを付けた。

f:id:kazumaxneo:20200706000022p:plain

GFAファイルの修正はGfaPyの使用が推奨されている。

引用

GfaViz: flexible and interactive visualization of GFA sequence graphs
Giorgio Gonnella, Niklas Niehus, Stefan Kurtz Author Notes
Bioinformatics, Volume 35, Issue 16, 15 August 2019, Pages 2853–2855

関連

2020-07-04

種名を指定するだけで自動で系統推定を実行する PhySpeTree

2019 BMC Evolutionary Biology rRNA 分子系統樹系統解析 automated pipeline

2020 7/6 追記

　系統樹は進化関係の推論に広く用いられている。既存のソフトウェアやアルゴリズムでは、主に系統樹の推論が中心となっている。しかし、非常に大規模な配列の処理や、複数のソフトウェアを接続するためのconfigureファイルの作成など、中間的なステップへの注目度は低い。種の数が多い場合には、この中間ステップがボトルネックとなり、樹形構築の効率に大きな影響を与える可能性がある。
　ここでは、細菌、古細菌、真核生物にまたがる系統樹の再構築を容易にするために、PhySpeTreeと名付けられた使いやすいパイプラインを紹介する。ユーザーは種名の略語を入力するだけで、PhySpeTreeは異なるソフトウェア用の複雑な設定ファイルを準備し、ゲノムデータを自動的にダウンロードし、配列をクリーンアップし、ツリーを構築する。PhySpeTreeでは、高度なオプションを調整することで、配列のアラインメントやツリー構築などの重要なステップを実行することができる。PhySpeTree は、高度に保存されたタンパク質とリボソームのスモールサブユニットのRNA 配列を連結したものをベースにした 2 つの並列パイプラインをそれぞれ提供する。新種の挿入、可視化構成の生成、ツリーの結合などのアクセサリモジュールは、PhySpeTreeと一緒に配布されている。
　PhySpeTreeは、アクセサリモジュールと組み合わせることで、系統樹の再構築を大幅に簡素化する。PhySpeTreeは、最新のオペレーティングシステム（Linux、macOS、Windows）上で動作するPythonで実装されている。ソースコードは、詳細なドキュメント(https://github.com/yangfangs/physpetools)とともに自由に入手できる。

PhySpeTree workflow Githubより転載

https://twitter.com/search?q=PhySpeTree&src=typed_query

インストール

GLIBC_2.29が要求されたのでubuntu19.10の仮想環境でテストした（docker使用）。

本体　Github

pip install PhySpeTree

> PhySpeTree -h

$ PhySpeTree -h

usage: PhySpeTree [-h] {autobuild,combine,iview,build,check} ...

-------------------------------------------------------------------------------------

PhySpeTree (0.3.9) - Reconstruct Phylogenetic species Tree

Citation:

PhySpeTree: automatically reconstructing phylogenetic species tree (submitted)

--------------------------------------------------------------------------------------

optional arguments:

-h, --help show this help message and exit

RCONSTRUCT PHYLOGENETIC TREE:

{autobuild,combine,iview,build,check}

autobuild Auto reconstruct phylogenetic tree

combine Combine phylogenetic tree

iview View tree by iTol

build Extend phylogenetic tree with new species

check Check organism database and prepare for extend tree

files

> PhySpeTree autobuild -h

# PhySpeTree autobuild -h

usage: PhySpeTree autobuild [-h] [-i [SPENAMES]] [-o OUTDATA] [-t THREAD]

[-e EXTENDDATA] [--hcp] [--ehcp] [--srna]

[--esrna] [-db DB] [--muscle]

[--muscle_p MUSCLE_PARAMETER] [--clustalw]

[--clustalw_p CLUSTALW_PARAMETER] [--mafft]

[--mafft_p MAFFT_PARAMETER] [--gblocks]

[--gblocks_p GBLOCKS_PARAMETER] [--trimal]

[--trimal_p TRIMAL_PARAMETER] [--raxml]

[--raxml_p RAXML_PARAMETER] [--fasttree]

[--fasttree_p FASTTREE_PARAMETER] [--iqtree]

[--iqtree_p IQTREE_PARAMETER]

optional arguments:

-h, --help show this help message and exit

AUTOBUILD OPTIONS:

-i [SPENAMES] Input a TXT file contain the species names

(abbreviated names) are same with KEGG species

abbreviation.

-o OUTDATA A directory include output data (tree files). The

default name is Outdata.

-t THREAD Specify the number of processing threads (CPUs) to

reconstruct phylogenetic tree. The default is 1.

-e EXTENDDATA The extended data should be FASTA format to extend

phylogenetic tree by --ehcp or --esrna option.

--hcp Specify the hcp (highly conserved protein) method to

reconstruct phylogenetic tree. The default method is

hcp.

--ehcp The ehcp mode is use highly conserved proteins with

extend highly conserved protein (users provide) to

reconstruct phylogenetic tree.

--srna The srna (SSU rRNA) method is use SSU rRNA data to

reconstruct phylogenetic tree.

--esrna The esrna mode is use SSU RNA sequence with extend SSU

RNA sequence (users provide) to reconstruct

phylogenetic tree.

-db DB The absolute path for local database.

ADVANCE OPTIONS:

--muscle Multiple sequence alignment by muscle. The default

multiple sequence alignment software is Muscle.

--muscle_p MUSCLE_PARAMETER

Set Muscle advance parameters. The default is -maxiter

100.

--clustalw Multiple sequence alignment by clustalw2.

--clustalw_p CLUSTALW_PARAMETER

Set clustalw2 advance parameters. Here use clustalw

default parameters.

--mafft Multiple sequence alignment by mafft.

--mafft_p MAFFT_PARAMETER

Set mafft advance parameters. Here use mafft default

parameters.

--gblocks Trim by Gblocks.

--gblocks_p GBLOCKS_PARAMETER

Set Gblocks advance parameters.

--trimal Trim by trimal.

--trimal_p TRIMAL_PARAMETER

Set trimal advance parameters.

--raxml Reconstruct phylogenetic tree by RAxML. The default

build tree software is RAxML.

--raxml_p RAXML_PARAMETER

Set RAxML advance parameters.

--fasttree Reconstruct phylogenetic tree by FastTree.

--fasttree_p FASTTREE_PARAMETER

Set FastTree advance parameters.

--iqtree Reconstruct phylogenetic tree by iqtree.

--iqtree_p IQTREE_PARAMETER

Set iqtree advance parameters.

実行方法

autobuild - 自動で配列をデータベースからダウンロードして系統推定

種名の略称を指定したテキストファイル（例）を指定する。自動で高度に保存されたタンパク質セットがダウンロード（*1）され、系統推定が実行される。SSU rRNAに切り替えるには”--srna”フラグを立てる。

PhySpeTree autobuild -i organism_example_list.txt -T 20 --hcp

--hcp Specify the hcp (highly conserved protein) method to reconstruct phylogenetic tree. The default method is hcp.
--srna The srna (SSU rRNA) method is use SSU rRNA data to reconstruct phylogenetic tree.

出力

f:id:kazumaxneo:20200704005823p:plain

iTOL（紹介）に読み込んだ。

f:id:kazumaxneo:20200705133824p:plain

または詳細な設定を指定する、、系統推定方法、置換モデルなど詳細にわたって指定可能。

SSU rRNAの系統推定。多重整列はmuscleを指定（mafft/clustalw）、アラインメントのトリミングはGblocks（trimAI）、系統推定はRAxML（fasttree/iqtree）、RAxMLの置換モデルなどの詳細パラメータも指定（RAxMLのmanual）。

PhySpeTree autobuild -i organism_example_list.txt -o test-t 12 --srna --raxml --raxml_p ' -f a -m GTRGAMMA -p 12345 -x 12345 -# 100 -n T1'

--srna The srna (SSU rRNA) method is use SSU rRNA data to reconstruct phylogenetic tree
-o A directory include output data (tree files). The default name is Outdata.
-t Specify the number of processing threads (CPUs) to reconstruct phylogenetic tree. The default is 1.
--muscle Multiple sequence alignment by muscle. The default multiple sequence alignment software is Muscle.
--raxml Reconstruct phylogenetic tree by RAxML. The default build tree software is RAxML.
--raxml_p Set RAxML advance parameters.

他に複数のツリーファイルをマージしたり、iTOLで閲覧する時のアノテーションファイルを先生するコマンドなどがある。詳細はGIthubで確認して下さい。

引用

PhySpeTree: an automated pipeline for reconstructing phylogenetic species trees

Yang Fang, Chengcheng Liu, Jiangyi Lin, Xufeng Li, Kambiz N. Alavian, Yi Yang, Yulong Niu
BMC Evolutionary Biology volume 19, Article number: 219 (2019)

関連

使用されるタンパク質配列については論文とGithub READMEを参照（一番下）

2020-07-03

De novoでTEを探索する RepeatModeler2

2020 PNAS Long Terminal Repeat retrotransposons (LTR-RTs) transposon repetitive sequences large genome

2020 7/5 ProcessRepeatsのhelp追加

2020 7/6 step3修正

2020 7/7 ProcessRepeatsのコマンドの間違いを修正

2022/04/18 追記

2023/07/24 追記

　Tree of life全体のゲノム配列決定のペースが加速しているため、 transposable elements（TE）のようなゲノム構成要素の教師なしアノテーションを改善する必要性が高まっている。TEの種類や配列は種によって大きく異なるため、自動化されたTEの発見とアノテーションは困難で時間のかかる作業となっている。重要な最初のステップは、ゲノム上に散在しているすべてのユニークなTEファミリーを表現する配列モデルを新規に同定し、正確にコンパイルすることである。ここでは、このプロセスを大幅に促進するパイプラインであるRepeatModeler2を紹介する。このプログラムは、TE発見のために最も広く使われているツールの一つであるRepeatModelerのオリジナルバージョンよりも大幅に改良されている。特に、このバージョンには、真核生物のゲノムに広く存在するが、そのサイズと配列の複雑さから自動同定が困難な完全長 long terminal repeat（LTR）レトロエレメントの構造発見のためのモジュールが組み込まれている。著者らは、多様なTEランドスケープと高品質の手動でキュレーションされたTEライブラリを持つ3つのモデル種でRepeatModeler2のベンチマークを行った。Drosophila melanogaster（ショウジョウバエ）、Danio rerio（ゼブラフィッシュ）、Oryza sativa（イネ）である。これら3つの種において、RepeatModeler2は、元のRepeatModelerと比較して、手動でキュレーションした配列と95%以上の配列同一性と配列カバレッジを持つコンセンサス配列を約3倍以上同定した。予想通り、最も改善されたのはLTRレトロエレメントだった。このように、RepeatModeler2は、真核生物のゲノム配列におけるTEの同定と研究を強化するゲノムアノテーションツールキットの貴重な追加機能となる。RepeatModeler2は、オープンライセンス（https://github.com/Dfam-consortium/RepeatModeler, http://www.repeatmasker.org/RepeatModeler/）のもと、ソースコードまたはコンテナ化されたパッケージとして提供される。

RepeatModelerは、HubleyとSmitによって2008年にリリースされ、最も広く使用されているTE発見ツールの1つである（2019年11月21日現在1,462回引用）。RepeatModelerは、ゲノム全体のリピートファミリーのシードアラインメントとコンセンサス配列をde novoで構築する。しかし、RepeatModelerのオリジナルバージョンは、他の既存のTEディスカバリーソフトウェアと同様に、完全な長さのコンセンサス配列の非冗長ライブラリを生成するには不十分である。最も問題となるのは、出力ライブラリに含まれる多くの断片化された部分的に冗長な配列の中から、特定のTEファミリーに対して一意の連続したコンセンサス配列であるべきものを解決することである。この問題は、逆に、TEファミリーの分類を妨げ、ゲノム中の実際のTEファミリーの数を増加させ、ゲノムアノテーションや下流の解析を混乱させる可能性がある。LTRレトロエレメントは、その大きさ（最大20キロ塩基対[kbp]）と配列や組織の複雑さのため、自動化されたTEファミリーの同定には特に抵抗があります。しかし、これらの要素は真核生物のゲノムに広く、しばしば非常に豊富で多様である。例えば、トウモロコシのリファレンスゲノムには、ゲノムDNAの約半分を占める約20,000の異なるファミリーに分類される10万以上のLTRエレメントが存在している(ref.29)。

これらの問題に対処するために、著者らはRepeatModelerの改良版を開発した。特に、構造的特徴からゲノム中のLTRエレメントを同定するためのオプションモジュールを統合した(ref.30, 31)。3つの多様なモデル種を用いてベンチマークから、RepeatModeler2は検出感度とコンセンサス配列の品質の両面で前バージョンよりも大幅に改善されていることを示す。このオープンソースパッケージは、シングル、マルチプロセッサコンピュータ上で動作するように設計されており、インストールを容易にするために、ソースディストリビューションまたはDocker/Singularityコンテナとして提供される。（以下略）

http://www.repeatmasker.org/RepeatModeler/

インストール

condaを使って導入した。 bioconda-recipesを見るとABBlastとNINJAはcondaでは導入されないのが分かる。LTRの探索も行うにはNINJAも必要 (cluster only)。

依存
Prerequisites

Perl
RepeatMasker & Libraries
RECON - De Novo Repeat Finder
RepeatScout - De Novo Repeat Finder,
TRF - Tandem Repeat Finder
RMBlast - A modified version of NCBI Blast for use with RepeatMasker and RepeatModeler.

Optional. Additional search engine:

ABBlast

Optional. Required for running LTR structural search pipeline:

LtrHarvest - The LtrHarvest program is part of the GenomeTools suite.
Ltr_retriever - A LTR discovery post-processing and filtering tool.
MAFFT
CD-HIT
Ninja - A tool for large-scale neighbor-joining phylogeny inference and clustering. We developed and tested RepeatModeler using Ninja version "0.95-cluster_only".

本体　Github

#or Bioconda(link) ここでは高速なmambaを使う
mamba create -n repeatmodeler python=3.11 -y
conda activate repeatmodeler
mamba install -c bioconda repeatmodeler -y

> RepeatModeler -h

$ RepeatModeler -h

Unknown option: h

/Users/kazu/anaconda3/envs/repeatmodeler/share/RepeatModeler/RepeatModeler - 2.0.1

NAME

RepeatModeler - Model repetitive DNA

SYNOPSIS

RepeatModeler [-options] -database <XDF Database>

DESCRIPTION

The options are:

-h(elp)

Detailed help

-database

The name of the sequence database to run an analysis on. This is the

name that was provided to the BuildDatabase script using the "-name"

option.

-pa #

Specify the number of parallel search jobs to run. RMBlast jobs will

use 4 cores each and ABBlast jobs will use a single core each. i.e.

on a machine with 12 cores and running with RMBlast you would use

-pa 3 to fully utilize the machine.

-recoverDir <Previous Output Directory>

If a run fails in the middle of processing, it may be possible

recover some results and continue where the previous run left off.

Simply supply the output directory where the results of the failed

run were saved and the program will attempt to recover and continue

the run.

-srand #

Optionally set the seed of the random number generator to a known

value before the batches are randomly selected ( using Fisher Yates

Shuffling ). This is only useful if you need to reproduce the sample

choice between runs. This should be an integer number.

-LTRStruct

Run the LTR structural discovery pipeline ( LTR_Harvest and

LTR_retreiver ) and combine results with the RepeatScout/RECON

pipeline. [optional]

-genomeSampleSizeMax #

Optionally change the maximum bp of the genome to sample in all

rounds of RECON (default=243000000).

CONFIGURATION OVERRIDES

-mafft_dir <string>

The path to the installation of the MAFFT multiple alignment

program.

-repeatmasker_dir <string>

The path to the installation of RepeatMasker.

-trf_prgm <string>

The full path including the name for the TRF program ( 4.0.9 or

higher )

-rscout_dir <string>

The path to the installation of the RepeatScout ( 1.0.6 or higher )

de-novo repeatfinding program.

-ninja_dir <string>

The path to the installation of the Ninja phylogenetic analysis

package.

-rmblast_dir <string>

The path to the installation of the RMBLAST sequence alignment

program.

-abblast_dir <string>

The path to the installation of the ABBLAST sequence alignment

program.

-recon_dir <string>

The path to the installation of the RECON de-novo repeatfinding

program.

-genometools_dir <string>

The path to the installation of the GenomeTools package.

-cdhit_dir <string>

The path to the installation of the CD-Hit sequence clustering

package.

-ltr_retriever_dir <string>

The path to the installation of the LTR_Retriever structural LTR

analysis package.

SEE ALSO

RepeatMasker, RMBlast

AUTHOR

RepeatModeler:

Robert Hubley <rhubley@systemsbiology.org>

Arian Smit <asmit@systemsbiology.org>

LTR Pipeline Extensions:

Jullien Michelle Flynn <jmf422@cornell.edu>

> BuildDatabase -h

$ BuildDatabase -h

No query sequence file indicated

/Users/kazu/anaconda3/envs/repeatmodeler/share/RepeatModeler/BuildDatabase - 2.0.1

NAME

BuildDatabase - Format FASTA files for use with RepeatModeler

SYNOPSIS

BuildDatabase [-options] -name "mydb" <seqfile(s) in fasta format>

BuildDatabase [-options] -name "mydb"

-dir <dir containing fasta files *.fa, *.fasta,

*.fast, *.FA, *.FASTA, *.FAST, *.dna,

and *.DNA >

BuildDatabase [-options] -name "mydb"

-batch <file containing a list of fasta files>

DESCRIPTION

This is basically a wrapper around AB-Blast's and NCBI Blast's

DB formating programs. It assists in aggregating files for processing

into a single database. Source files can be specified by:

- Placing the names of the FASTA files on the command

line.

- Providing the name of a directory containing FASTA files

with the file suffixes *.fa or *.fasta.

- Providing the name of a manifest file which contains the

names of FASTA files ( fully qualified ) one per line.

NOTE: Sequence identifiers are not preserved in this database. Each

sequence is assigned a new GI ( starting from 1 ). The

translation back to the original sequence is preserved in the

*.translation file.

The options are:

-h(elp)

Detailed help

-name <database name>

The name of the database to create.

-engine <engine name>

The name of the search engine we are using. I.e abblast/wublast or

rmblast.

-dir <directory>

The name of a directory containing fasta files to be processed. The

files are recognized by their suffix. Only *.fa and *.fasta files

are processed.

-batch <file>

The name of a file which contains the names of fasta files to

process. The files names are listed one per line and should be fully

qualified.

SEE ALSO

RepeatModeler, RMBlast

AUTHOR

Robert Hubley <rhubley@systemsbiology.org>

> ProcessRepeats -h

$ ProcessRepeats -h

No cat file indicated

NAME

ProcessRepeats - Post process results from RepeatMasker and produce an

annotation file.

SYNOPSIS

ProcessRepeats [-options] <RepeatMasker *.cat file>

DESCRIPTION

The options are:

-h(elp)

Detailed help

-species <query species>

Post process RepeatMasker results run on sequence from this species.

Default is human.

-lib <libfile>

Skips most processing, does not produce a .tbl file unless the

custome library is in the ">name#class" format.

-nolow

Does not display simple repeats or low_complexity DNA in the

annotation.

-noint

Skips steps specific to interspersed repeats, saving lots of time.

-lcambig

Outputs ambiguous DNA transposon fragments using a lower case name.

All other repeats are listed in upper case. Ambiguous fragments

match multiple repeat elements and can only be called based on

flanking repeat information.

-u Creates an untouched annotation file besides the manipulated file.

-xm Creates an additional output file in cross_match format (for

parsing).

-ace

Creates an additional output file in ACeDB format.

-gff

Creates an additional Gene Feature Finding format.

-poly

Creates an output file listing only potentially polymorphic simple

repeats.

-no_id

Leaves out final column with unique number for each element (was

default).

-excln

Calculates repeat densities excluding long stretches of Ns in the

query.

-orf2

Results in sometimes negative coordinates for L1 elements; all L1

subfamilies are aligned over the ORF2 region, sometimes improving

interpretation of data.

-a Shows the alignments in a .align output file.

-maskSource <originalSeqenceFile>

Instructs ProcessRepeats to mask the sequence file using the

annotation.

-x Mask repeats with a lower case 'x'.

-xsmall

Mask repeats by making the sequence lowercase.

SEE ALSO

RepeatMasker, Crossmatch, Blast

Biology

AUTHORS

Arian Smit <asmit@systemsbiology.org>

Robert Hubley <rhubley@systemsbiology.org>

RepeatMaskerのhelpも載せておきます。

> RepeatMasker -h

$ RepeatMasker -h

RepeatMasker version open-4.0.9

Option h is ambiguous (help, html)

NAME

RepeatMasker - Mask repetitive DNA

SYNOPSIS

RepeatMasker [-options] <seqfiles(s) in fasta format>

DESCRIPTION

The options are:

-h(elp)

Detailed help

Default settings are for masking all type of repeats in a primate

sequence.

Use an alternate search engine to the default. Note: 'ncbi' and

'rmblast' are both aliases for the rmblastn search engine engine.

The generic NCBI blastn program is not sensitive enough for use with

RepeatMasker at this time.

-pa(rallel) [number]

The number of processors to use in parallel (only works for batch

files or sequences over 50 kb)

-s Slow search; 0-5% more sensitive, 2-3 times slower than default

-q Quick search; 5-10% less sensitive, 2-5 times faster than default

-qq Rush job; about 10% less sensitive, 4->10 times faster than default

(quick searches are fine under most circumstances) repeat options

-nolow

Does not mask low_complexity DNA or simple repeats

-noint

Only masks low complex/simple repeats (no interspersed repeats)

-norna

Does not mask small RNA (pseudo) genes

-alu

Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)

-div [number]

Masks only those repeats < x percent diverged from consensus seq

-lib [filename]

Allows use of a custom library (e.g. from another species)

-cutoff [number]

Sets cutoff score for masking repeats when using -lib (default 225)

-species <query species>

Specify the species or clade of the input sequence. The species name

must be a valid NCBI Taxonomy Database species name and be contained

in the RepeatMasker repeat database. Some examples are:

-species human

-species mouse

-species rattus

-species "ciona savignyi"

-species arabidopsis

Other commonly used species:

mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,

danio, "ciona intestinalis" drosophila, anopheles, worm, diatoaea,

artiodactyl, arabidopsis, rice, wheat, and maize

Contamination options

-is_only

Only clips E coli insertion elements out of fasta and .qual files

-is_clip

Clips IS elements before analysis (default: IS only reported)

-no_is

Skips bacterial insertion element check

Running options

-gc [number]

Use matrices calculated for 'number' percentage background GC level

-gccalc

RepeatMasker calculates the GC content even for batch files/small

seqs

-frag [number]

Maximum sequence length masked without fragmenting (default 60000)

-nocut

Skips the steps in which repeats are excised

-noisy

Prints search engine progress report to screen (defaults to .stderr

file)

-nopost

Do not postprocess the results of the run ( i.e. call ProcessRepeats

). NOTE: This options should only be used when ProcessRepeats will

be run manually on the results.

output options

-dir [directory name]

Writes output to this directory (default is query file directory,

"-dir ." will write to current directory).

-a(lignments)

Writes alignments in .align output file

-inv

Alignments are presented in the orientation of the repeat (with

option -a)

-lcambig

Outputs ambiguous DNA transposon fragments using a lower case name.

All other repeats are listed in upper case. Ambiguous fragments

match multiple repeat elements and can only be called based on

flanking repeat information.

-small

Returns complete .masked sequence in lower case

-xsmall

Returns repetitive regions in lowercase (rest capitals) rather than

masked

-x Returns repetitive regions masked with Xs rather than Ns

-poly

Reports simple repeats that may be polymorphic (in file.poly)

-source

Includes for each annotation the HSP "evidence". Currently this

option is only available with the "-html" output format listed

below.

-html

Creates an additional output file in xhtml format.

-ace

Creates an additional output file in ACeDB format

-gff

Creates an additional Gene Feature Finding format output

-u Creates an additional annotation file not processed by

ProcessRepeats

-xm Creates an additional output file in cross_match format (for

parsing)

-no_id

Leaves out final column with unique ID for each element (was

default)

-e(xcln)

Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in

the query

SEE ALSO

Crossmatch, ProcessRepeats

AUTHORS

Arian Smit <asmit@systemsbiology.org>

Robert Hubley <rhubley@systemsbiology.org>

以下のコンテナを使えばRepeatModeler、RepeatMasker、そしてcosegの3つを利用可能。

Dockerhub

実行方法

１、データベースを作成する。

BuildDatabase -name prefix input_genome.fa

-name The name of the database to create

出力

f:id:kazumaxneo:20200702171557p:plain

２、RepeatModeler を実行する。スレッド数は以前は-paで指定したが、現在は-threads で指定する。

RepeatModeler -database prefix -threads 20

#ランタイムが最低数時間以上かかるため、Gihtubではnohup実行が強く推奨されている。nohupのため進捗logをファイルに保存する。
nohup RepeatModeler -database elephant -pa 20 >& run.out &

-database The name of the sequence database to run an analysis on. This is the name that was provided to the BuildDatabase script using the "-name" option.
-threads Specify the maximum number of threads which can be used by the program at any one time. Note that the '-pa' parameter in previous releases controlled the number of sequence batches compared in parallel using rmblastn (each running 4 threads). Therefore, if '-pa 4' was used previously the new thread parameter should be set to '-threads 16'.

様々なファイルが出力される。ディレクトリ名は実行dateが含まれるので少し長くなる（RM_88440.TueJul71311202020とか）。*2

f:id:kazumaxneo:20200703002148p:plain

tmpBlastXResults.out.bxsummary

f:id:kazumaxneo:20200703112744p:plain

シードアライメントファイル（.stk）は、Dfam互換のStockholmフォーマットで、help@dfam.org、Dfamデータベースにアップロードすることができる（マニュアルより）。

３、得られたリピートのコンセンサス配列(冗長な配列をコンセンサスにしてまとめたもの)のFASTA形式ファイルをライブラリに指定してRepeatMaskerを実行する。input_genome.faのリピートをソフトマスクする Nで置換する（*１）。

RepeatMasker -pa 20 -html -gff -small -lib outdir/tmpConsensus.fa.masked input_genome.fa

-html Creates an additional output file in xhtml format.
-gff Creates an additional Gene Feature Finding format output
-small Returns complete .masked sequence in lower case
-lib Allows use of a custom library (e.g. from another species)

出力

f:id:kazumaxneo:20200703120400p:plain

.tblがサマリーファイル

f:id:kazumaxneo:20200703120621p:plain

out.html

f:id:kazumaxneo:20200703120753p:plain

４、ProcessRepeatesを実行してリピートをソフトマスクする。ステップ3のRepeatMaskerの出力；input_genome.fa.catファイルを使うので、ステップ3のコマンドを先に実行する必要がある（YNSさんのコメントもご参照下さい）。

ProcessRepeats -maskSource input_genome.fa -xsmall -gff input_genome.fa.cat

-maskSource Instructs ProcessRepeats to mask the sequence file using the annotation.
-xsmall Mask repeats by making the sequence lowercase.
-gff Creates an additional Gene Feature Finding format output

input_genome.fa.maskedやGFFファイルなどが出力される。

追記

RepeatModelerのランでLTR探索も実行する。

RepeatModeler -database elephant -pa 20 -LTRStruct

-LTRStruct Run the LTR structural discovery pipeline ( LTR_Harvest and
LTR_retreiver ) and combine results with the RepeatScout/RECON
pipeline. [optional]

NINJAがないとランできない。

https://github.com/TravisWheelerLab/NINJA/releases/tag/0.95-cluster_only

からダウンロードしてmakeする。NINJAができるので、$NINJA_DIRを設定する。

export NINJA_DIR=<path>/<to>/NINJA-0.95-cluster_only/NINJA/

ある植物ゲノムアセンブリに適用したところ、condaで配布されているRepeatMaskerのデフォルトのリピートライブラリではゲノムの２％の領域しかマスクされなかったが、 RepeatModeler2でリピートを予測後、それをライブラリにしてRepeatMaskerをランすると34％の領域がマスクされた。

引用

RepeatModeler2 for automated genomic discovery of transposable element families
Jullien M. Flynn, Robert Hubley, Clément Goubert, Jeb Rosen, Andrew G. Clark, Cédric Feschotte, and Arian F. Smit

PNAS April 28, 2020 117 (17) 9451-9457; first published April 16, 2020

関連

参考

https://heavywatal.github.io/bio/repeatmasker.html

*１

一般にマスクはリピートをNで置き換える操作を意味する。リピートをNで置き換える代わりに、単に塩基配列を小文字に切り替える操作はソフトマスクと言われる。

RepeatMasker issue: Setting up library from fasta file (PGSB-REdat) #13

https://github.com/rmhubley/RepeatMasker/issues/13

2020-07-02

インタラクティブなオンラインの系統樹ツール Interactive Tree Of Life (iTOL) v4

2007 2011 2016 2019 Bioinformatics Nucleic Acids Research web tool 分子系統樹初心者向け metadata multi-omics 系統解析結果の視覚化 (visualization) multiple sequence alignment (MSA) heatmap phylogenetic tree viewer

2020 7/2 誤字修正

2021 4/27 v5の論文リンク追加

2022 8/27追記

2024/04/21 v6論文追加

　系統樹は、生物学やその他の科学分野において重要なツールであり、様々なデータタイプのコンテキスト化としても機能している。このことは、このような系統樹を作成するためのツールが頻繁に使われていることからもわかる(MEGA, (ref.2))。このようなツリーの可視化は、長年にわたって様々なソフトウェアツールによってカバーされてきたが、iTOL(ref.5)では、様々なタイプの追加データを含むツリーのアノテーションを導入した。現在では、ETE toolkit(ref.6)、Dendroscope(ref.7)、Evolview(ref.8)などのように、オンラインでもスタンドアロンでも、様々なソフトウェアパッケージがツリーのアノテーション機能を提供している。ここでは、iTOLの機能を拡張・合理化し、より強力で使いやすいものにしたiTOLの最近の開発について報告する。
　iTOLは、最新のWebブラウザからアクセス可能なオンラインツールである。ツリー表示エンジンは純粋なJavascriptで実装されており、可視化のためにHTML5 Canvasを使用している。
　iTOLは、一般的に使用されている系統樹のフォーマットをサポートしている。Newick、Nexus、phyloXMLである。EPAとpplacerによって作成されたPhylogenetic placements filesもサポートしている。現在のバージョンでは、QIIME 2 のツリーとアノテーションファイルのサポートが導入されている。（一部略）

　iTOLは、他の系統樹ビューアで利用可能なほとんどの一般的な機能を提供する。iTOL v4では、標準的な表示形式（rectangular, circular and unrooted）に加えて、系統樹の斜め表示モードをサポートしている。ツリーは様々な方法で操作することができ、基本的な編集機能では、単一ノードやクレード全体の削除や移動をインタラクティブに行うことができる。また、様々なパラメータ（関連するブートストラップ値や平均枝長距離など）に基づいて、手動または自動でブランチをカットしたり、縮めたりすることができる。また、任意のノードで手動で、または中点ルート法を使って自動的にツリーの re-rootができる。ツリーのleavesは、手動または自動で様々な方法でソートすることができる。

　現在のバージョンでは、いくつかの新しいアノテーション機能が導入されており、個々の表示要素に対するユーザーのコントロールが拡張され、4つの新しいデータセットタイプが追加されている（論文図2）。

　iTOLでは、ツリー内の個々のノードやラベルに対して、個々のスタイルや色をサポートしている。現在のバージョンでは、ユーザーインターフェース全体でUTF-8文字セットを完全にサポートし、Google Web Fontsリストから任意のフォントを使用できるようになり、さまざまなフォントやフォントスタイルのサポートが大幅に拡張されている。さらに、任意のテキストラベルの背景色を独立して変更できる。

https://twitter.com/search?q=iTOL%20v4&src=typed_query

video tutorial

https://itol.embl.de/video_tutorial.cgi

help

https://itol.embl.de/help.cgi

gallary

https://itol.embl.de/gallery.cgi

テストデータ

helpからexample annotation data（ダイレクトリンク）をダウンロードできる。その中にツリーファイルも含まれる。ここではこのデモデータを使う。

webサービス

https://itol.embl.de/ にアクセスする。気づかなかったが、いつの間にかv5v6になっている。

写真中央下のUpload treeボタン、または上のメニューからtreeファイルをアップロードする。Newick、Nexus、phyloXMLフォーマットやQiime2のツリーファイル等に対応している。

アカウントを作ってloginしておくと複数ツリーを管理でき便利になる。loginしない場合、この管理画面は表示されずそのままツリーが表示される。

f:id:kazumaxneo:20200702001310p:plain

１、基本操作

exampleデータを読み込んだ。管理画面でツリーファイルを選ぶとツリーが視覚化される。画面はインタラクティブにマウスホイールで拡大縮小したり、ドラッグで移動できる。これらの操作は左端のボタンでも可能になっている。

f:id:kazumaxneo:20200702001506p:plain

右のウィンドウからツリーの表示方法を変更できる。

f:id:kazumaxneo:20200702003644p:plain

DisplayモードはdefaultではCircularになっている。

f:id:kazumaxneo:20200702003720p:plain

Normal。長方形の形状の普通の系統樹。

f:id:kazumaxneo:20200702003807p:plain

Unrooted

f:id:kazumaxneo:20200702004053p:plain

Normal、Slanted => ON

f:id:kazumaxneo:20200702004243p:plain

Normal、Dashed line => 3、Branch lines => 2

f:id:kazumaxneo:20200702004359p:plain

Labels => At tips

f:id:kazumaxneo:20200702004459p:plain

Labels => At tips & Label shift 60、Label Font => Times New Roman。

f:id:kazumaxneo:20200702004908p:plain

Advancedの機能も簡単に確認していく。

Branch length => Display、Fout size 16、round to 2 decimals（小数点以下の表示桁数）

f:id:kazumaxneo:20200702010411p:plain

Branch length => Display、Display as age => ON。

f:id:kazumaxneo:20200702010833p:plain

年齢表示オプションを選択すると、枝の長さの値の代わりにノードの年齢が表示される。ツリー内で最も遠いノードの年齢はゼロで、ツリーのルートに向かって年齢が上昇していく。

ブートストラップ値は4つの方法で表示できる。 symbol modeのほか、text、color、widthモードになる。

bootstrap => Display & symbol mode、Legend On。

f:id:kazumaxneo:20200702013540p:plain

ブートストラップ値を表示できるのはツリーにブートストラップ値の情報が含まれている場合のみ。

bootstrap => Display & text mode、Font size 13、position on branch 30。

f:id:kazumaxneo:20200702013810p:plain

bootstrap => Display & color mode、Legend On。Display range 50-100。

f:id:kazumaxneo:20200702014011p:plain

bootstrap => Display & width mode、max width 3、min width 1。

f:id:kazumaxneo:20200702014155p:plain

bootstrap値が50以下の信頼性が低い枝は消す（実験によって閾値は変わる）。枝長が短いノード（<0.1）をcollapseする。

Auto collapse clades < 0.2 、Delete branches < 50。

f:id:kazumaxneo:20200702015013p:plain

collapseしているとcollapseした部分の形状を選べるメニューが表示される。

Collapsed nodes => circle。

f:id:kazumaxneo:20200702015832p:plain

Collapsedした枝はクリックすることで再び展開できる。

f:id:kazumaxneo:20200702020658p:plain

Internal tree scale => Display、interval1 => 0.2 & dark blue、font size 27、Set root to 1997 & Scaling factor 10。

f:id:kazumaxneo:20200702011315p:plain

タイムスケールモードでは、ツリーのスケールをカスタマイズして、枝の長さではなく他の値を表示することができる（データに合わせた適切な方法でツリーが推定・校正されている時にだけ利用できるオプションであることに注意する）

対応するノードIDを使って系統樹を描画している場合、右下のAuto aasign taxnomyボタンを押し、Reset treeを一度行うことで対応する名前に変更できる。ここでは数値が種名に変わる。

f:id:kazumaxneo:20200702021858p:plain

メニュー左下のSave/restore ボタンで名前をつけて設定を保存する。現在の設定をデフォルト設定として利用できるようになる。

f:id:kazumaxneo:20200702022317p:plain

Save as => 名前を決めて保存する。

f:id:kazumaxneo:20200702022053p:plain

設定はいつでも呼び出せる。非常に便利な機能。

f:id:kazumaxneo:20200702022445p:plain

枝をクリックすることで、クレード全体/特定のノードのフォントカラーやバックグラウンドを変更できる。

Color => Set clade color

f:id:kazumaxneo:20200702022723p:plain

赤を選択した。

f:id:kazumaxneo:20200702022804p:plain

選択したクレードだけ赤になった。

f:id:kazumaxneo:20200702022826p:plain

赤のクレードの枝を実線から破線に変更した。

Stype => Clade => Dashed line。

f:id:kazumaxneo:20200702023051p:plain

Leaf labels => set labels colorで色の変更。

f:id:kazumaxneo:20200702024059p:plain

フォントを緑にして、太字、size 3にした。

f:id:kazumaxneo:20200702024016p:plain

Leaf labels => set labels backgroundでフォントの背景色の変更。

f:id:kazumaxneo:20200702024328p:plain

背景を薄い緑にした。

f:id:kazumaxneo:20200702024413p:plain

背景を整えると、tree of lifeの環状ツリーのようにクレードごとの統一性が出る。

追記

クレード全体の背景を変えるにはColor => New color groupで色を設定する。上の方法だとノード間に色がない部分が残る。

f:id:kazumaxneo:20200703121042p:plain

不要なクレードを削除する。よく考えて行うこと。

Tree structure => Delete clade。

f:id:kazumaxneo:20200702023204p:plain

Copy leaf labels - 選択したクレードのラベルだけコピーする

f:id:kazumaxneo:20200702023554p:plain

テキストエディタに貼り付けた。

Rerot tree here

f:id:kazumaxneo:20200702023740p:plain

2023/08追記

advancedのメニュー下から、指定した枝長以内の姉妹系統を一括してcollapseしたり、特定のbootstrap値以下、以上の枝を一括してcollapseする、learve数が指定値以下のノードを一括削除するなどの機能が利用できるようになっています。collapseはun-collapse allボタンで解除できる。

２、応用操作

ここからは一括設定や注釈ファイルの表示が可能なconfigファイルを使う流れを確認する。使うのは上のexampleファイル。中に様々な設定ファイルが入っているので、これを使ってiTOL v4で表現可能な注釈情報を示す。

f:id:kazumaxneo:20200702003532p:plain

exampleフォルダの中身。

実際に使用するには、注釈ファイルを表示中の系統樹の上にドラッグアンドドロップするだけでよい。

試しに colors_tol.txtを画面上にドラッグするとすぐに注釈が反映された。このcolors_tol.txtに記載されているのはノードのフォントサイズや色設定になるため、それらが反映されている。

f:id:kazumaxneo:20200702025305p:plain

次はranges.txtを読み込んだ。ranges.txtはそのままレンジを一括指定する。形状はcircularモードにしている。

f:id:kazumaxneo:20200702025604p:plain

追記

rangeでクレード全体の背景に色をつけるには、葉の名前の位置をAt tipsにするか、下の画像のように、rangesのウィンドウの一番下のボタンで、CladeかFullを選択する。

tol_alignment.txt。マルチプルシーケンスアラインメントを系統樹の隣に表示する。特定のタンパク質や遺伝子の系統樹で使う。normalモードでしか表示されない。

f:id:kazumaxneo:20200702025814p:plain

tol_binary.txt

f:id:kazumaxneo:20200702030021p:plain

tol_binary2.txt

f:id:kazumaxneo:20200702030129p:plain

tol_binary.txtの後に追加した。複数メタデータを追加した場合、追加された順に内側から外側に向かって順番に表現される。

tol_boxplot.txt

#各ノードには5つの数値の要約が定義され、その後に任意の数の極値が定義されていなければならない。すなわち、カンマ区切りなら

ID1,最小値,q1,中央値,q3,最大値,極値1,極値2

となっている必要がある。

f:id:kazumaxneo:20200702030326p:plain

tol_color_strip.txt

f:id:kazumaxneo:20200702030426p:plain

tol_domains.txt タンパク質ドメインなどを表示する。

f:id:kazumaxneo:20200702031019p:plain

tol_external_shapes.txt

f:id:kazumaxneo:20200702031133p:plain

tol_heatmap1.txt

f:id:kazumaxneo:20200702031239p:plain

tol_linechart-sine.txt

f:id:kazumaxneo:20200702031409p:plain

tol_multibar10.txt

f:id:kazumaxneo:20200702031456p:plain

tol_pies1.txt

f:id:kazumaxneo:20200702031556p:plain

tol_connections_leaves.txt

f:id:kazumaxneo:20200702223830p:plain

複数の設定ファイルを読み込み

f:id:kazumaxneo:20200702225304p:plain

左のメニューボタンの一番下には画面に直接半透明の四角や楕円を描いたりするmanual annotationボタンがある。しかしあまり使い勝手は良くなく、出力後にイラストレーターなどで編集する方が簡単に感じた。

ではexampleのtol_binary.txtを例に、実際にはどのようなファイルを用意すれば良いのかを説明する。このtol_binary.txtはバイナリ（1/0）表現可能なメタデータを読み込むときに使う。

tol_binary.txtの中身

f:id:kazumaxneo:20200702031904p:plain

f:id:kazumaxneo:20200702031905p:plain

このファイルを、iTOL v4のexample treeを表示させた状態で画面にドラッグ&ドロップすると、系統樹横に11個のメタデータが表示される。

f:id:kazumaxneo:20200702032356p:plain

これをどう制御しているのか説明する。まずtol_binary.txtの5-6行目。読み込むバイナリ値のセパレータはコンマとする。セパレータをコンマからタブに変える場合、5行目のコメントアウトを消し、7行目をコメントアウトする。

f:id:kazumaxneo:20200702032031p:plain

24-26行名は重要。11個のメタデータを追加したいなら、24-26行名それぞれ11指定する必要がある。例えば25行目だけ10個しか指定していないなど、ズレがあると何も表示されない。

f:id:kazumaxneo:20200702032242p:plain

24行目はラベル。ユニークな名前をメタデータの順番に11個指定する。

f:id:kazumaxneo:20200702033301p:plain

25行目は色。16進数カラーコード指定。メタデータの順番に11個指定する。１６進数カラーコードはオンラインのカラーピッカーサイトやグラデーション表示計算サイト（例）を使用して下さい。

f:id:kazumaxneo:20200702033248p:plain

26行目はプロットの形状。19-23行目に記載されているように1-5の数値で形が決まる。下だとメタデータの一番内側は長方形。次は右向き矢印。メタデータの順番に11個指定する。

f:id:kazumaxneo:20200702033355p:plain

そのほか細かい指定があるが、テキスト自体コメントをつけながら書かれているので、読み込むexampleファイルを読めばわかるようになっている。

バイナリメタデータは最後に記載する。

f:id:kazumaxneo:20200702033703p:plain

系統樹のノード155864を拡大して示した。系統樹ではこのように表示されているが、

f:id:kazumaxneo:20200702034127p:plain

tol_binary.txtをみると、このノード155864は0,1,1,1,0,1,0,-1,0,1となっている。

f:id:kazumaxneo:20200702033950p:plain

11個のマークの形状はtol_binary.txtの26行目で決定しているので、この72行目で指定しているのはマークを表示するかしないかである。1は表示、0は白抜きで枠のみ表示、-1は非表示と決まっている。

結果をみると、左から１番目（◁）のマークは0なので白抜きで表示、２番目（◀︎）は１なので表示と、確かにそうなっている。８番目は-1なのでこれだけ非表示になっている。

f:id:kazumaxneo:20200702034127p:plain

（繰り返しになるがマークの形状は26行目で指定、全てのサンプルに同じ形状が反映される）

このようにしてexampleのバイナリメタデータは表現されている。

f:id:kazumaxneo:20200702105903p:plain

読み込んだメタデータはDatasetからシンボルの間隔やサイズを変更できる。

f:id:kazumaxneo:20200702111515p:plain

configファイルに大半の情報を記載しているため、間違った時も設定をresetして再読み込みするだけでリカバー可能。メタデータのルール詳細は、上にリンクを張ったhelpを参照して下さい。

結果はPDFやSVGなどで出力できる。

f:id:kazumaxneo:20200702111934p:plain 系統樹を保存するだけでなくシェアする機能もある。

exampleデータやgallaryファイルを数時間さわれば、大体の操作方法はマスターできます。テスト時は数千ノードあっても機能しました。バージョンアップを重ねているだけあってユーザーフレンドリな素晴らしいphylogenetic tree viewerになっています。

# バグなのか仕様なのか不明ですが、Newickフォーマットのツリーファイルを読み込ませると一部のleave が抜けていたデータがありました。いったんFigtreeに読み込ませて全データをNEXUS形式で出力し、それを使うと一部のleave の抜けは無くなりました。

引用

Interactive Tree of Life (iTOL) v6: recent updates to the phylogenetic tree display and annotation tool

Ivica Letunic, Peer Bork

Nucleic Acids Research, Published: 13 April 2024

Interactive Tree Of Life (iTOL) v4: recent updates and new developments
Ivica Letunic, Peer Bork
Nucleic Acids Research, Volume 47, Issue W1, 02 July 2019

Interactive Tree of Life (iTOL) v3: An Online Tool for the Display and Annotation of Phylogenetic and Other Trees

Ivica Letunic, Peer Bork

Nucleic Acids Res. 2016 Jul 8;44(W1):W242-5

Interactive Tree Of Life v2: Online Annotation and Display of Phylogenetic Trees Made Easy

Ivica Letunic , Peer Bork

Nucleic Acids Res. 2011 Jul;39

Interactive Tree Of Life (iTOL): An Online Tool for Phylogenetic Tree Display and Annotation

Ivica Letunic, Peer Bork

Bioinformatics. 2007 Jan 1;23(1):127-8

参考

2021 4/27

SERVER ACCESSより

iTOLは自立したツールであり、実質的な公的資金を得ずに維持・拡張されてきた。アクティブなユーザー数とアップロードされる樹木の数が増え続けていることを考慮し、iTOLを維持・発展させ、必要なストレージとCPUパワーを提供し、ユーザーベースにタイムリーな技術サポートを提供するための持続可能なモデルを模索してきた。バージョン5では、ツリーのアノテーション機能は無料で利用できるが、iTOLのアカウント管理機能の一部は、アクティブなサブスクリプションが必要となった。iTOLサーバーへのツリーアノテーションの保存とバッチアップロードモードを除き、ほとんどのユーザーアカウント管理機能は無料で利用できる。ユーザーがアップロードしたツリーやアノテーションは、会員登録の有無にかかわらず、いつでもアクセスできる。

iTOL subscriptions

2020-07-01

ターゲットアセンブリにより保存されたプラスミド配列を再構成してアノテーションをつける PlasmidID

target assembly target gene reconstruction AMR plasmid

　PlasmidIDはマッピングベースのアセンブリアシストプラスミド同定ツールで、プラスミド同定のための解析とグラフィックソリューションを提供する。

　PlasmidIDは、プラスミドデータベースの配列上にイルミナリードをマッピングするBASHで実装された計算パイプラインである。ｋ-merフィルタリングされ、最もカバーされた配列は、重複を避けるために同一性によってクラスタリングされ、最も長い配列はプラスミド再構成のための足場として使用される。リードはアセンブルされ、自動アノテーションと特異的アノテーションによってアノテーションされる。マッピング、アセンブリ、アノテーション、ローカルアラインメント解析から得られたすべての情報が収集され、環状画像で正確に表現されるため、ユーザーはあらゆる細菌サンプルのプラスミド組成を決定することができる。

wiki

https://github.com/BU-ISCIII/plasmidID/wiki

twitter

https://twitter.com/search?q=PlasmidID&src=typed_query

インストール

ubuntu18.04にてcondaの仮想環境を作ってテストした（テスト時はmacosでは動作しなかった）。

依存

Python >=3.6
Trimmomatic v0.33(Optional)
Spades v3.8 (Optional)
Perl v5.26.0
NCBI_blast + v2.2.3
Bedtools v2.25
Bowtie 2 v2.2.4
SAMtools v1.2
prokka v1.12
cd-hit v4.6.6 (no longer needed since v1.6)
circos v0.69.3
mash v2.2

本体　Github

#bioconda(link) ここでは仮想環境に入れる
conda create -n plasmidid -y
conda activate plasmidid
conda install -c conda-forge -c bioconda plasmidid -y

#docker images(link)
docker pull buisciii/plasmidid

> plasmidID -h

$ plasmidID -h

plasmidID is a computational pipeline tha reconstruct and annotate the most likely plasmids present in one sample

usage : /home/kazu/anaconda3/envs/plasmidid/bin/plasmidID <-1 R1> <-2 R2> <-d database(fasta)> <-s sample_name> [-g group_name] [options]

Mandatory input data:

-1 | --R1 <filename> reads corresponding to paired-end R1 (mandatory)

-2 | --R2 <filename> reads corresponding to paired-end R2 (mandatory)

-d | --database <filename> database to map and reconstruct (mandatory)

-s | --sample <string> sample name (mandatory), less than 37 characters

Optional input data:

-g | --group <string> group name (optional). If unset, samples will be gathered in NO_GROUP group

-c | --contigs <filename> file with contigs. If supplied, plasmidID will not assembly reads

-a | --annotate <filename> file with configuration file for specific annotation

-o <output_dir> output directory, by default is the current directory

Pipeline options:

--explore Relaxes default parameters to find less reliable relationships within data supplied and database

--only-reconstruct Database supplied will not be filtered and all sequences will be used as scaffold

This option does not require R1 and R2, instead a contig file can be supplied

-w Undo winner takes it all algorithm when clustering by kmer - QUICKER MODE

Trimming:

--trimmomatic-directory Indicate directory holding trimmomatic .jar executable

--no-trim Reads supplied will not be quality trimmed

Coverage and Clustering:

-C | --coverage-cutoff <int> minimun coverage percentage to select a plasmid as scafold (0-100), default 80

-S | --coverage-summary <int> minimun coverage percentage to include plasmids in summary image (0-100), default 90

-f | --cluster <int> kmer identity to cluster plasmids into the same representative sequence (0 means identical) (0-1), default 0.5

-k | --kmer <int> identity to filter plasmids from the database with kmer approach (0-1), default 0.95

Contig local alignment

-i | --alignment-identity <int> minimun identity percentage aligned for a contig to annotate, default 90

-l | --alignment-percentage <int> minimun length percentage aligned for a contig to annotate, default 20

-L | --length-total <int> minimun alignment length to filter blast analysis

--extend-annotation <int> look for annotation over regions with no homology found (base pairs), default 500bp

Draw images:

--config-directory <dir> directory holding config files, default config_files/

--config-file-individual <file-name> file name of the individual file used to reconstruct

Additional options:

-M | --memory <int> max memory allowed to use

-T | --threads <int> number of threads

-v | --version version

-h | --help display usage message

example: ./plasmidID.sh -1 ecoli_R1.fastq.gz -2 ecoli_R2.fastq.gz -d database.fasta -s ECO_553 -G ENTERO

./plasmidID.sh -1 ecoli_R1.fastq.gz -2 ecoli_R2.fastq.gz -d PacBio_sample.fasta -c scaffolds.fasta -C 60 -s ECO_60 -G ENTERO --no-trim

データベース

（デモ）データセットとして、plasmidFinderのプラスミド配列群が用意されている。

git clone https://github.com/BU-ISCIII/plasmidID.git

f:id:kazumaxneo:20200630121319p:plain

ペアエンドfastq、足場にするプラスミド配列を指定する。アセンブルして得たcontig配列も指定すれば"-c contog.fasta"、Spadesによるアセンブルステップはスキップされる。

plasmidID -1 SAMPLE_R1.fastq.gz -2 SAMPLE_R2.fastq.gz \
-d plasmids.fasta --no-trim -s sample -T 16

-1 reads corresponding to paired-end R1 (mandatory)
-2 reads corresponding to paired-end R2 (mandatory)
-d database to map and reconstruct (mandatory)
-s sample name (mandatory), less than 37 characters
--no-trim Reads supplied will not be quality trimmed
-T number of threads
-c file with contigs. If supplied, plasmidID will not assembly reads

output-dir_final_results.tab

f:id:kazumaxneo:20200630133828p:plain

f:id:kazumaxneo:20200701140648p:plain

拡大

f:id:kazumaxneo:20200701140739p:plain

適切にクラスタリングされていないプラスミドを判断し最も適切なものだけを選択するために、同じコンティグ間のリンクを表現したサマリー画像が出力される（例えば2つの異なるプラスミド中に同じコンティグが存在する場合など）。ユーザーは、分析されたサンプルに存在する異なるプラスミドの数を最終的に手動で決定する。ユーザーが決定しなければならないが、PlasmidIDとこのガイドで提供されるすべての情報があれば、このタスクは簡単になるはずである。

出力について

https://github.com/BU-ISCIII/plasmidID/wiki/Understanding-the-image:-track-by-track

引用

GitHub - BU-ISCIII/plasmidID: PlasmidID is a mapping-based, assembly-assisted plasmid identification tool that analyzes and gives graphic solution for plasmid identification.

関連

2020-06-30

植物ゲノムの代謝遺伝子クラスターを検出する PhytoClust

2017 Nucleic Acids Research gene cluster web tool Pathway plant

　代謝遺伝子クラスター（MGC）は、特定の代謝パスウェイのゲノム上で共局在し、潜在的に共制御される遺伝子である。細菌のオペロンとは対照的に、それらは単一の転写ユニットの制御下にはない。 MGCは真菌ゲノムによく見られ、MGCは植物の例外としてのみ発生すると長い間想定されてきた（ref.1）。ただし、近年では 20を超えるMGCがさまざまな種で実験的に特徴づけられており、その大部分は植物の特殊な代謝に関連している（2–6）。植物のMGCは、単子葉および双子葉の両方の種のゲノムにまたがり、ベンゾオキサジノイド、シアン配糖体、テルペノイド、アルカロイドなどのさまざまなケミカルクラスに関連する最終産物の生合成を媒介する（ref.3）。重要なこととして、報告されている植物MGCの範囲には、薬学上および農学上重要な化学物質の合成のための生合成反応が含まれる。

　植物MGCの一般的な記述子は、特殊な代謝産物の合成に関与する酵素をコードする、少なくとも3つ、時には2つの非相同な生合成遺伝子の隣接局在である（ref.4）。植物のMGCは原核生物のオペロンとは独立して進化したようである。通常、植物MGCは、いわゆる「シグネチャ酵素」、すなわち生合成パスウェイの最初のコミットされたステップを触媒し、以下の特殊な代謝産物の足場を合成する酵素をコードする1つの遺伝子で構成されている。残りの遺伝子は、足場を修飾して目的の最終産物を形成する後続の「調整酵素」をコードする（ref.4）。さらに、シグネチャ遺伝子は植物の一次代謝の遺伝子と相同性を共有しているため、植物のMGCは遺伝子の重複と新機能化による追加の調整酵素の補充によって形成されたと広く想定されている。

（一部略）

　インシリコMGC予測ツールであるPhytoClustを開発および適用した。 PhytoClustでは、既知の植物MGCタイプの検索と、新規タイプのクラスターのマイニング（つまり、酵素クラス構成の観点から）が可能である。候補クラスターにある遺伝子の共発現分析は、選択した植物種で利用できる。 PhytoClustは、新しいゲノムアセンブリが利用可能になると、広範囲の植物種における新規MGCの特性評価を強化すると予想している。

　PhytoClustのワークフローは、（ref.18）に詳述されているコアAntismash実装に従う。クラスター分析パイプラインは、入力としてGBK、EMBL、またはFASTAファイルを使用し、入力ファイルから遺伝子を抽出するか、指定されていない場合、GlimmerHMMを使用して入力ヌクレオチド配列から遺伝子を予測する（ref.21）。（一部略）オプションの共発現分析については、「PhytoClustの共発現モジュール」で詳しく説明している。結果の出力には、HTML、GBK、EMBL、TXT、およびXLSファイルが含まれる。

PhytoClust- our tool for plant metabolic gene clusters search is already online: https://t.co/Fnh42IZBer
see also: https://t.co/HO6C0Ut2TM
— Jedrzej Szymanski (@JJSzymanski) 2017年3月28日

about

http://phytoclust.weizmann.ac.il/about/

使い方

http://phytoclust.weizmann.ac.ilにアクセスする。

サーバーに既に保存されている植物ゲノムの中から選択するか、またはユーザーの配列をアップロードする。アップロードする場合、FASTAファイルかEMBL、GBK形式のアノテーションファイルを指定する。FASTAファイルを指定した際は遺伝子予測されてから用いられる。

f:id:kazumaxneo:20200629203332p:plain

ファイルサイズが1.5GBを超える場合、染色体ごとにアップロードするか関心のある領域だけアップロードする。必要に応じて共発現解析のチェックもつける。

メールアドレス記載すると、計算終了後にダウンロードリンクを含む通知が届く。結果は計算終了後7日間ダウンロード可能になっている。検出可能な遺伝子クラスターについてはaboutを参照。

引用

The PhytoClust tool for metabolic gene clusters discovery in plant genomes
Nadine Töpfer, Lisa-Maria Fuchs, Asaph Aharoni

Nucleic Acids Res. 2017 Jul 7; 45(12): 7049–7063

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Kmasker

GFAのインタラクティブな可視化ツール GfaViz

種名を指定するだけで自動で系統推定を実行する PhySpeTree

De novoでTEを探索する RepeatModeler2

インタラクティブなオンラインの系統樹ツール Interactive Tree Of Life (iTOL) v4

ターゲットアセンブリにより保存されたプラスミド配列を再構成してアノテーションをつける PlasmidID

植物ゲノムの代謝遺伝子クラスターを検出する PhytoClust