macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

マルチプルアライメンントのトリミングツール trimAI

2020 5/14 help追記

2021 1/23 condaによるインストール追記

 

マルチプルアライメントを行うとアライメントがほとんどできない領域ができることがあるが、そういった領域は情報として利用するのが難しいため、一般的に除去しても問題にならない。trimAIはラージスケールにも対応したマルチプルアライメントのトリミングツールで、何千もの配列のマルチプルアライメント出力からアライメントが貧弱な領域を除去することができる。入力できるのはPhylip、Fasta、Clustal、NBRF/Pir、Mega、Nexusなどになる。

 

マニュアル

http://trimal.cgenomics.org/use_of_the_command_line_trimal_v1.2

チュートリアル

http://trimal.cgenomics.org/_media/manual.b.pdf

 

インストール

Github

Download

http://trimal.cgenomics.org/downloads

ダウンロードしたディレクトリを解凍してビルドする。

git clone https://github.com/scapella/trimal.git
cd trimal/source/
make -j

#conda
mamba install -c bioconda trimal
mamba install -c bioconda/label/cf201901 trimal

./trimal

$ ./trimal 

 

trimAl v1.4.rev22 build[2015-05-21]. 2009-2015. Salvador Capella-Gutierrez and Toni Gabaldón.

 

trimAl webpage: http://trimal.cgenomics.org

 

This program is free software: you can redistribute it and/or modify 

it under the terms of the GNU General Public License as published by 

the Free Software Foundation, the last available version.

 

Please cite:

trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses.

Salvador Capella-Gutierrez; Jose M. Silla-Martinez; Toni Gabaldon.

Bioinformatics 2009, 25:1972-1973.

 

Basic usage

trimal -in <inputfile> -out <outputfile> -(other options).

 

Common options (for a complete list please see the User Guide or visit http://trimal.cgenomics.org):

 

    -h                          Print this information and show some examples.

    --version                   Print the trimAl version.

 

    -in <inputfile>             Input file in several formats (clustal, fasta, NBRF/PIR, nexus, phylip3.2, phylip).

 

    -compareset <inputfile>     Input list of paths for the files containing the alignments to compare.

    -forceselect <inputfile>    Force selection of the given input file in the files comparison method.

 

    -backtrans <inputfile>      Use a Coding Sequences file to get a backtranslation for a given AA alignment

    -ignorestopcodon            Ignore stop codons in the input coding sequences

    -splitbystopcodon           Split input coding sequences up to first stop codon appearance

 

    -matrix <inpufile>          Input file for user-defined similarity matrix (default is Blosum62).

    --alternative_matrix <name> Select an alternative similarity matrix already loaded. 

                                Only available 'degenerated_nt_identity'

 

    -out <outputfile>           Output alignment in the same input format (default stdout). (default input format)

    -htmlout <outputfile>       Get a summary of trimal's work in an HTML file.

 

    -keepheader                 Keep original sequence header including non-alphanumeric characters.

                                Only available for input FASTA format files. (future versions will extend this feature)

 

    -nbrf                       Output file in NBRF/PIR format

    -mega                       Output file in MEGA format

    -nexus                      Output file in NEXUS format

    -clustal                    Output file in CLUSTAL format

 

    -fasta                      Output file in FASTA format

    -fasta_m10                  Output file in FASTA format. Sequences name length up to 10 characters.

 

    -phylip                     Output file in PHYLIP/PHYLIP4 format

    -phylip_m10                 Output file in PHYLIP/PHYLIP4 format. Sequences name length up to 10 characters.

    -phylip_paml                Output file in PHYLIP format compatible with PAML

    -phylip_paml_m10            Output file in PHYLIP format compatible with PAML. Sequences name length up to 10 characters.

    -phylip3.2                  Output file in PHYLIP3.2 format

    -phylip3.2_m10              Output file in PHYLIP3.2 format. Sequences name length up to 10 characters.

 

    -complementary              Get the complementary alignment.

    -colnumbering               Get the relationship between the columns in the old and new alignment.

 

    -selectcols { n,l,m-k }     Selection of columns to be removed from the alignment. Range: [0 - (Number of Columns - 1)]. (see User Guide).

    -selectseqs { n,l,m-k }     Selection of sequences to be removed from the alignment. Range: [0 - (Number of Sequences - 1)]. (see User Guide).

 

    -gt -gapthreshold <n>       1 - (fraction of sequences with a gap allowed). Range: [0 - 1]

    -st -simthreshold <n>       Minimum average similarity allowed. Range: [0 - 1]

    -ct -conthreshold <n>       Minimum consistency value allowed.Range: [0 - 1]

    -cons <n>                   Minimum percentage of the positions in the original alignment to conserve. Range: [0 - 100]

 

    -nogaps                     Remove all positions with gaps in the alignment.

    -noallgaps                  Remove columns composed only by gaps.

    -keepseqs                   Keep sequences even if they are composed only by gaps.

 

    -gappyout                   Use automated selection on "gappyout" mode. This method only uses information based on gaps' distribution. (see User Guide).

    -strict                     Use automated selection on "strict" mode. (see User Guide).

    -strictplus                 Use automated selection on "strictplus" mode. (see User Guide).

                               (Optimized for Neighbour Joining phylogenetic tree reconstruction).

 

    -automated1                 Use a heuristic selection of the automatic method based on similarity statistics. (see User Guide). (Optimized for Maximum Likelihood phylogenetic tree reconstruction).

 

    -terminalonly               Only columns out of internal boundaries (first and last column without gaps) are 

                                candidates to be trimmed depending on the selected method

    --set_boundaries { l,r }    Set manually left (l) and right (r) boundaries - only columns out of these boundaries are 

                                candidates to be trimmed depending on the selected method. Range: [0 - (Number of Columns - 1)]

    -block <n>                  Minimum column block size to be kept in the trimmed alignment. Available with manual and automatic (gappyout) methods

 

    -resoverlap                 Minimum overlap of a positions with other positions in the column to be considered a "good position". Range: [0 - 1]. (see User Guide).

    -seqoverlap                 Minimum percentage of "good positions" that a sequence must have in order to be conserved. Range: [0 - 100](see User Guide).

 

    -clusters <n>               Get the most Nth representatives sequences from a given alignment. Range: [1 - (Number of sequences)]

    -maxidentity <n>            Get the representatives sequences for a given identity threshold. Range: [0 - 1].

 

    -w <n>                      (half) Window size, score of position i is the average of the window (i - n) to (i + n).

    -gw <n>                     (half) Window size only applies to statistics/methods based on Gaps.

    -sw <n>                     (half) Window size only applies to statistics/methods based on Similarity.

    -cw <n>                     (half) Window size only applies to statistics/methods based on Consistency.

 

    -sgc                        Print gap scores for each column in the input alignment.

    -sgt                        Print accumulated gap scores for the input alignment.

    -ssc                        Print similarity scores for each column in the input alignment.

    -sst                        Print accumulated similarity scores for the input alignment.

    -sfc                        Print sum-of-pairs scores for each column from the selected alignment

    -sft                        Print accumulated sum-of-pairs scores for the selected alignment

    -sident                     Print identity scores matrix for all sequences in the input alignment. (see User Guide).

    -soverlap                   Print overlap scores matrix for all sequences in the input alignment. (see User Guide).

 

 

 

実行方法

入力はマルチプルアライメントの出力ファイルとなる。

 10%以上の配列でアライメントにギャップがある領域を全てトリミングして出力する(トリミング後の長さが60%以下になる場合、60%までトリミングを行う)。

trimal -in input.aln -out output.aln -htmlout output.html -gt 0.9 -cons 60 
  • -in Input file in several formats (clustal, fasta, NBRF/PIR, nexus, phylip3.2, phylip).
  • -out Output alignment in the same input format (default stdout). (default input format)
  • -htmlout Get a summary of trimal's work in an HTML file.
  • -gt 1 - (fraction of sequences with a gap allowed).
  • -cons Minimum percentage of the positions in the original alignment to conserve. 

 

ギャップの閾値を自動で決める。4つの方法がある。

trimal -in input.aln -out output.aln -gappyout
  • -gappyout Use automated selection on "gappyout" mode. This method only uses information based on gaps' distribution. (see User Guide).
trimal -in input.aln -out output.aln -strict
  • -strict Use automated selection on "strict" mode. (see User Guide).
trimal -in input.aln -out output.aln -strictplus
  • -strictplus Use automated selection on "strictplus" mode. (see User Guide). (Optimized for Neighbour Joining phylogenetic tree reconstruction).
 trimal -in input.aln -out output.aln -automated1
  • -automated1 Use a heuristic selection of the automatic method based on similarity statistics. (see User Guide). (Optimized for Maximum Likelihood phylogenetic tree reconstruction).

 

  

マルチプルアライメントは t-coffeeなどで行うことができる(リンク)。

t_coffee input.fasta

 

引用

trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses

Salvador Capella-Gutiérrez, José M. Silla-Martínez and Toni Gabaldón∗

Bioinformatics. 2009 Aug 1;25(15):1972-3.

 

関連