TEtrimmerは、TEの手動キュレーションを自動化するために設計された強力なソフトウェアである。入力は、EDTA2やRepeatModeler2などのde novo TE探索ツールによるTEライブラリ、または近縁種のTEライブラリである。各入力コンセンサス配列に対して、TEtrimmerは自動的にBLAST、配列抽出、伸長、多重配列アライメント(MSA)、MSAクラスタリング、MSAクリーニング、TE境界定義、およびTE分類を実行する。TEtrimmerはまた、予測されたTEを検査し、改善するためのグラフィカルユーザーインターフェース(GUI)を提供し、手作業によるキュレーションレベルのTEコンセンサスライブラリーを容易に達成できるよう支援する。
ubuntu22でレポジトリの指示に従ってインストールした。macOSではTEtrimmer condaパッケージを直接インストールできる(レポジトリ参照)。
- python=3.10
git clone https://github.com/qjiangzhao/TEtrimmer.git
cd TEtrimmer/
mamba env create -f TEtrimmer_env_for_linux.yml
conda activate TEtrimmer
> python tetrimmer/TEtrimmer.py --help
Usage: TEtrimmer.py [OPTIONS]
Version: v1.2.0 (19/April/2024)
Github: https://github.com/qjiangzhao/TEtrimmer
Jiangzhao Qian; RWTH Aachen University; Email: jqian@bio1.rwth-aachen.de
Hang Xue; University of California, Berkeley; Email: hang_xue@berkeley.edu
Stefan Kusch; Research Center Juelich; Email: s.kusch@fz-juelich.de
Funding source:
Ralph Panstruga Lab; RWTH Aachen University; Email: panstruga@bio1.rwth-aachen.de
Website: https://www.bio1.rwth-aachen.de/PlantMolCellBiology/index.html
python ./path_to_TEtrimmer_bin/TEtrimmer.py -i <TE_consensus_file> -g <genome_file>
TEtrimmer is designed to replace manual curation of transposable elements (TEs).
Two mandatory arguments are required, including <genome file>, the genome FASTA file, and <TE consensus file>
from TE annotation software like RepeatModeler, EDTA, or REPET. TEtrimmer can do BLAST, sequence extension,
multiple sequence alignment, and defining TE boundaries.
-i, --input_file TEXT Path to TE consensus file (FASTA format). Use the output from RepeatModeler, EDTA,
REPET, et al. [required]
-g, --genome_file TEXT Path to genome FASTA file (FASTA format). [required]
-o, --output_dir TEXT Path to output directory. Default: current working directory.
-s, --preset [conserved|divergent]
Choose one preset config (conserved or divergent).
-t, --num_threads INTEGER Thread number used for TEtrimmer. Default: 10
--classify_unknown Use RepeatClassifier to classify the consensus sequence if the input sequence is not
classified or is unknown or the processed sequence length by TEtrimmer is 2000 bp
longer or shorter than the query sequence.
--classify_all Use RepeatClassifier to classify every consensus sequence. WARNING: This may take a
long time.
-ca, --continue_analysis Continue from previous unfinished TEtrimmer run and would use the same output
--dedup Remove duplicate sequences in input file.
-ga, --genome_anno Perform genome TE annotation using RepeatMasker with the TEtrimmer curated TE
--hmm Generate HMM files for each processed consensus sequence.
--debug debug mode. This will keep all raw files. WARNING: Many files will be generated.
--fast_mode Reduce running time at the cost of lower accuracy and specificity.
-pd, --pfam_dir TEXT Pfam database directory. TE Trimmer will download the database automatically. Only
turn on this option if you want to use a local PFAM database or the automatic
download fails.
--cons_thr FLOAT The minimum level of agreement required at a given position in the alignment for a
consensus character to be called. Default: 0.8
--mini_orf INTEGER Define the minimum ORF length to be predicted by TEtrimmer. Default: 200
--max_msa_lines INTEGER Set the maximum number of sequences to be included in a multiple sequence alignment.
Default: 100
--top_msa_lines INTEGER If the sequence number of multiple sequence alignment (MSA) is greater than
<max_msa_lines>, TEtrimmer will first sort sequences by length and choose
<top_msa_lines> number of sequences. Then, TEtrimmer will randomly select sequences
from all remaining BLAST hits until <max_msa_lines>sequences are found for the
multiple sequence alignment. Default: 100
--min_seq_num INTEGER The minimum blast hit number required for the input sequence. We do not recommend
decreasing this number. Default: 10
--min_blast_len INTEGER The minimum sequence length for blast hits to be included for further analysis.
Default: 150
--max_cluster_num INTEGER The maximum number of clusters assigned in each multiple sequence alignment. Each
multiple sequence alignment can be grouped into different clusters based on
alignment patterns WARNING: using a larger number will potentially result in more
accurate consensus results but will significantly increase the running time. We do
not recommend increasing this value to over 5. Default: 2
--ext_thr FLOAT The threshold to call “N” at a position. For example, if the most conserved
nucleotide in a MSA columnhas proportion smaller than <ext_thr>, a “N” will be
called at this position. Used with <ext_check_win>. The lower the value of
<ext_thr>, the more likely to get longer the extensions on both ends. You can try
reducing <ext_thr> if TEtrimmer fails to find full-length TEs. Default: 0.7
--ext_check_win TEXT the check windows size during defining start and end of the consensus sequence based
on the multiple sequence alignment. Used with <ext_thr>. If <ext_check_win> bp at
the end of multiple sequence alignment has “N” present (ie. positions have
similarity proportion smaller than <ext_thr>), the extension will stop, which
defines the edge of the consensus sequence. Default: 150
--ext_step INTEGER the number of nucleotides to be added to the left and right ends of the multiple
sequence alignment in each extension step. TE_Trimmer will iteratively add
<ext_step> nucleotides until finding the TE boundary or reaching <max_ext>. Default:
--max_ext INTEGER The maximum extension in nucleotides at both ends of the multiple sequence
alignment. Default: 7000
--gap_thr FLOAT If a single column in the multiple sequence alignment has a gap proportion larger
than <gap_thr> and the proportion of the most common nucleotide in this column is
less than <gap_nul_thr>, this column will be removed from the consensus. Default:
--gap_nul_thr FLOAT The nucleotide proportion threshold for keeping the column of the multiple sequence
alignment. Used with the <gap_thr> option. i.e. if this column has <40% gap and the
portion of T (or any other) nucleotide is >70% in this particular column, this
column will be kept. Default: 0.7
--crop_end_div_thr FLOAT The crop end by divergence function will convert each nucleotide in the multiple
sequence alignment into a proportion value. This function will iteratively choose a
sliding window from each end of each sequence of the MSA and sum up the proportion
numbers in this window. The cropping will continue until the sum of proportions is
larger than <--crop_end_div_thr>. Cropped nucleotides will be converted to -.
Default: 0.7
--crop_end_div_win INTEGER Window size used for the end-cropping process. Used with the <--crop_end_div_thr>
option. Default: 40
--crop_end_gap_thr FLOAT The crop end by gap function will iteratively choose a sliding window from each end
of each sequence of the MSA and calculate the gap proportion in this window. The
cropping will continue until the sum of gap proportions is smaller than
<--crop_end_gap_thr>. Cropped nucleotides will be converted to -. Default: 0.1
--crop_end_gap_win INTEGER Define window size used to crop end by gap. Used with the <--crop_end_gap_thr>
option. Default: 250
--start_patterns TEXT LTR elements always start with a conserved sequence pattern. TEtrimmer searches the
beginning of the consensus sequence for these patterns. If the pattern is not found,
TEtrimmer will extend the search of <--start_patterns> to up to 15 nucleotides from
the beginning of the consensus sequence and redefine the start of the consensus
sequence if the pattern is found. Note: The user can provide multiple LTR start
patterns in a comma-separated list, like: TG,TA,TC (no spaces; the order of patterns
determines the priority for the search). Default: TG
--end_patterns TEXT LTR elements always end with a conserved sequence pattern. TEtrimmer searches the
end of the consensus sequence for these patterns. If the pattern is not found,
TEtrimmer will extend the search of <--end_patterns> to up to 15 nucleotides from
the end of the consensus sequence and redefine the end of the consensus sequence if
the pattern is found. Note: The user can provide multiple LTR end patterns in a
comma-separated list, like: CA,TA,GA (no spaces; the order of patterns determines
the priority for the search). Default: CA
--help Show this message and exit.
ランするにはFASTA 形式(.fa または .fasta)のゲノム配列のほかに、TEコンセンサスライブラリーが必要。具体的にはRepeatModelerやEDTAのようなde novo TEアノテーションツールのTEコンセンサスライブラリが必要。
TEtrimmer --input_file TE_consensus_library.fa --genome_file genome.fasta --output_dir outdir --num_threads 20 --classify_all
#launch GUI
> python tetrimmer/TEtrimmer_proof_anno_GUI/annoGUI.py