アノテーションされたトランスポーザブル・エレメント（TE）のキュレーションを支援する TEtrimmer

レポジトリより

トランスポーザブル・エレメント（TE）の発見とアノテーションのために多くのツールが開発されている。しかし、高品質なTEコンセンサスライブラリーの構築には、依然としてTEを手作業でキュレーションする必要があり、それには時間がかかり、専門家が必要である。

TEtrimmerは、TEの手動キュレーションを自動化するために設計された強力なソフトウェアである。入力は、EDTA2やRepeatModeler2などのde novo TE探索ツールによるTEライブラリ、または近縁種のTEライブラリである。各入力コンセンサス配列に対して、TEtrimmerは自動的にBLAST、配列抽出、伸長、多重配列アライメント（MSA）、MSAクラスタリング、MSAクリーニング、TE境界定義、およびTE分類を実行する。TEtrimmerはまた、予測されたTEを検査し、改善するためのグラフィカルユーザーインターフェース（GUI）を提供し、手作業によるキュレーションレベルのTEコンセンサスライブラリーを容易に達成できるよう支援する。

Are you still struggling with the manual curation of TEs? Do you want to know the magic of cleaning TE MSAs with only a single button click?
Welcome to try TEtrimmer (paper will come soon)https://t.co/PV8SUwokHp
I am also looking forward to discussing this with you in #ICTE2024 pic.twitter.com/CDFo4r0KVq
— Jiangzhao (@qjiangzhao) 2024年4月21日

manual

https://github.com/qjiangzhao/TEtrimmer/blob/main/docs/TEtrimmerv1.2.0Manual.pdf

インストール

ubuntu22でレポジトリの指示に従ってインストールした。macOSではTEtrimmer condaパッケージを直接インストールできる（レポジトリ参照）。

依存

python=3.10

Github

git clone https://github.com/qjiangzhao/TEtrimmer.git
cd TEtrimmer/
mamba env create -f TEtrimmer_env_for_linux.yml
conda activate TEtrimmer

> python tetrimmer/TEtrimmer.py --help

Usage: TEtrimmer.py [OPTIONS]

##########################################################################################

████████╗███████╗████████╗██████╗ ██╗███╗ ███╗███╗ ███╗███████╗██████╗

╚══██╔══╝██╔════╝╚══██╔══╝██╔══██╗██║████╗ ████║████╗ ████║██╔════╝██╔══██╗

██║ █████╗ ██║ ██████╔╝██║██╔████╔██║██╔████╔██║█████╗ ██████╔╝

██║ ██╔══╝ ██║ ██╔══██╗██║██║╚██╔╝██║██║╚██╔╝██║██╔══╝ ██╔══██╗

██║ ███████╗ ██║ ██║ ██║██║██║ ╚═╝ ██║██║ ╚═╝ ██║███████╗██║ ██║

╚═╝ ╚══════╝ ╚═╝ ╚═╝ ╚═╝╚═╝╚═╝ ╚═╝╚═╝ ╚═╝╚══════╝╚═╝ ╚═╝

Version: v1.2.0 (19/April/2024)

Github: https://github.com/qjiangzhao/TEtrimmer

Developers:

Jiangzhao Qian; RWTH Aachen University; Email: jqian@bio1.rwth-aachen.de

Hang Xue; University of California, Berkeley; Email: hang_xue@berkeley.edu

Stefan Kusch; Research Center Juelich; Email: s.kusch@fz-juelich.de

Funding source:

Ralph Panstruga Lab; RWTH Aachen University; Email: panstruga@bio1.rwth-aachen.de

Website: https://www.bio1.rwth-aachen.de/PlantMolCellBiology/index.html

##########################################################################################

python ./path_to_TEtrimmer_bin/TEtrimmer.py -i <TE_consensus_file> -g <genome_file>

TEtrimmer is designed to replace manual curation of transposable elements (TEs).

Two mandatory arguments are required, including <genome file>, the genome FASTA file, and <TE consensus file>

from TE annotation software like RepeatModeler, EDTA, or REPET. TEtrimmer can do BLAST, sequence extension,

multiple sequence alignment, and defining TE boundaries.

Options:

-i, --input_file TEXT Path to TE consensus file (FASTA format). Use the output from RepeatModeler, EDTA,

REPET, et al. [required]

-g, --genome_file TEXT Path to genome FASTA file (FASTA format). [required]

-o, --output_dir TEXT Path to output directory. Default: current working directory.

-s, --preset [conserved|divergent]

Choose one preset config (conserved or divergent).

-t, --num_threads INTEGER Thread number used for TEtrimmer. Default: 10

--classify_unknown Use RepeatClassifier to classify the consensus sequence if the input sequence is not

classified or is unknown or the processed sequence length by TEtrimmer is 2000 bp

longer or shorter than the query sequence.

--classify_all Use RepeatClassifier to classify every consensus sequence. WARNING: This may take a

long time.

-ca, --continue_analysis Continue from previous unfinished TEtrimmer run and would use the same output

directory.

--dedup Remove duplicate sequences in input file.

-ga, --genome_anno Perform genome TE annotation using RepeatMasker with the TEtrimmer curated TE

libraries.

--hmm Generate HMM files for each processed consensus sequence.

--debug debug mode. This will keep all raw files. WARNING: Many files will be generated.

--fast_mode Reduce running time at the cost of lower accuracy and specificity.

-pd, --pfam_dir TEXT Pfam database directory. TE Trimmer will download the database automatically. Only

turn on this option if you want to use a local PFAM database or the automatic

download fails.

--cons_thr FLOAT The minimum level of agreement required at a given position in the alignment for a

consensus character to be called. Default: 0.8

--mini_orf INTEGER Define the minimum ORF length to be predicted by TEtrimmer. Default: 200

--max_msa_lines INTEGER Set the maximum number of sequences to be included in a multiple sequence alignment.

Default: 100

--top_msa_lines INTEGER If the sequence number of multiple sequence alignment (MSA) is greater than

<max_msa_lines>, TEtrimmer will first sort sequences by length and choose

<top_msa_lines> number of sequences. Then, TEtrimmer will randomly select sequences

from all remaining BLAST hits until <max_msa_lines>sequences are found for the

multiple sequence alignment. Default: 100

--min_seq_num INTEGER The minimum blast hit number required for the input sequence. We do not recommend

decreasing this number. Default: 10

--min_blast_len INTEGER The minimum sequence length for blast hits to be included for further analysis.

Default: 150

--max_cluster_num INTEGER The maximum number of clusters assigned in each multiple sequence alignment. Each

multiple sequence alignment can be grouped into different clusters based on

alignment patterns WARNING: using a larger number will potentially result in more

accurate consensus results but will significantly increase the running time. We do

not recommend increasing this value to over 5. Default: 2

--ext_thr FLOAT The threshold to call “N” at a position. For example, if the most conserved

nucleotide in a MSA columnhas proportion smaller than <ext_thr>, a “N” will be

called at this position. Used with <ext_check_win>. The lower the value of

<ext_thr>, the more likely to get longer the extensions on both ends. You can try

reducing <ext_thr> if TEtrimmer fails to find full-length TEs. Default: 0.7

--ext_check_win TEXT the check windows size during defining start and end of the consensus sequence based

on the multiple sequence alignment. Used with <ext_thr>. If <ext_check_win> bp at

the end of multiple sequence alignment has “N” present (ie. positions have

similarity proportion smaller than <ext_thr>), the extension will stop, which

defines the edge of the consensus sequence. Default: 150

--ext_step INTEGER the number of nucleotides to be added to the left and right ends of the multiple

sequence alignment in each extension step. TE_Trimmer will iteratively add

<ext_step> nucleotides until finding the TE boundary or reaching <max_ext>. Default:

1000

--max_ext INTEGER The maximum extension in nucleotides at both ends of the multiple sequence

alignment. Default: 7000

--gap_thr FLOAT If a single column in the multiple sequence alignment has a gap proportion larger

than <gap_thr> and the proportion of the most common nucleotide in this column is

less than <gap_nul_thr>, this column will be removed from the consensus. Default:

0.4

--gap_nul_thr FLOAT The nucleotide proportion threshold for keeping the column of the multiple sequence

alignment. Used with the <gap_thr> option. i.e. if this column has <40% gap and the

portion of T (or any other) nucleotide is >70% in this particular column, this

column will be kept. Default: 0.7

--crop_end_div_thr FLOAT The crop end by divergence function will convert each nucleotide in the multiple

sequence alignment into a proportion value. This function will iteratively choose a

sliding window from each end of each sequence of the MSA and sum up the proportion

numbers in this window. The cropping will continue until the sum of proportions is

larger than <--crop_end_div_thr>. Cropped nucleotides will be converted to -.

Default: 0.7

--crop_end_div_win INTEGER Window size used for the end-cropping process. Used with the <--crop_end_div_thr>

option. Default: 40

--crop_end_gap_thr FLOAT The crop end by gap function will iteratively choose a sliding window from each end

of each sequence of the MSA and calculate the gap proportion in this window. The

cropping will continue until the sum of gap proportions is smaller than

<--crop_end_gap_thr>. Cropped nucleotides will be converted to -. Default: 0.1

--crop_end_gap_win INTEGER Define window size used to crop end by gap. Used with the <--crop_end_gap_thr>

option. Default: 250

--start_patterns TEXT LTR elements always start with a conserved sequence pattern. TEtrimmer searches the

beginning of the consensus sequence for these patterns. If the pattern is not found,

TEtrimmer will extend the search of <--start_patterns> to up to 15 nucleotides from

the beginning of the consensus sequence and redefine the start of the consensus

sequence if the pattern is found. Note: The user can provide multiple LTR start

patterns in a comma-separated list, like: TG,TA,TC (no spaces; the order of patterns

determines the priority for the search). Default: TG

--end_patterns TEXT LTR elements always end with a conserved sequence pattern. TEtrimmer searches the

end of the consensus sequence for these patterns. If the pattern is not found,

TEtrimmer will extend the search of <--end_patterns> to up to 15 nucleotides from

the end of the consensus sequence and redefine the end of the consensus sequence if

the pattern is found. Note: The user can provide multiple LTR end patterns in a

comma-separated list, like: CA,TA,GA (no spaces; the order of patterns determines

the priority for the search). Default: CA

--help Show this message and exit.

実行方法

ランするにはFASTA 形式（.fa または .fasta）のゲノム配列のほかに、TEコンセンサスライブラリーが必要。具体的にはRepeatModelerやEDTAのようなde novo TEアノテーションツールのTEコンセンサスライブラリが必要。

TEtrimmer --input_file TE_consensus_library.fa --genome_file genome.fasta --output_dir outdir --num_threads 20 --classify_all

-pdでpfamデータベースのパスを指定する必要がある。認識されなければコマンド実行時に自動でダウンロードされる。

出力例

（レポジトリに出力についての説明あり）

TEtrimmerのラン後、TEコンセンサスライブラリを検査、改善する目的でGUIツールが準備されている。TEコンセンサスライブラリーの品質を従来の手動キュレーションレベルまで高めるために、この作業を行うことが強く推奨されている。

#launch GUI

> python tetrimmer/TEtrimmer_proof_anno_GUI/annoGUI.py

引用

https://github.com/qjiangzhao/TEtrimmer/tree/main

現在論文が準備中とのことです。早めに紹介しました。ゲノムアセンブリでアノテーションされたTEのキュレーションで苦労されている方は試してみると良いのでは無いかと思います。特にトラブルなく動作しますが、計算にはある程度時間がかかるので、初回はTEの配列数が少し小さめのデータセットでテストされると良いかもしれません（提供したTE配列に対してキュレーションをするので、TE配列数が数千以上あるとそれなりに時間がかかる）。