マルチプルシーケンスアラインメントを行う Clustal Omega

　Clustal Omega は、複数配列のアラインメント（MSA）を作成するためのパッケージである。利用可能な配列数が大幅に増加していることと、大きな配列を迅速かつ正確に作成する必要性に対応するために、約10年前に開発された。過去30年間で最も広く使われてきたMSA作成パッケージはClustal W2とClustal X3であったが、その間に100以上のMSAパッケージがリリースされている。これらのパッケージは大まかに二つの主要なグループに分類される：高速で非常に大きなアラインメントを作成できるものと、より正確でより少ない配列数に限定されたものである。MUSCLEとMAFFTは前者の例として広く使われているが、T-CoffeeとMAFFT L-INS-iは後者の例である。Clustal W と Clustal X は、パーソナルコンピュータやサーバーでの利用が可能であること、コードの堅牢性と移植性、そして非常に柔軟で直感的なユーザーインターフェースのために広く利用されている。Clustal Omega を設計した当初の動機は、精度を犠牲にすることなく、非常に大規模なアライメントを行うことができるパッケージを作ることであった。

　最初の Clustal パッケージは高速でシンプルな "ガイドツリー "を作成する方法を特徴としていた。これらは配列のクラスタリングであり、後のプログレッシブアラインメント段階でアラインメントの順序を決定するために使用される。Clustal は、1970 年代に最初に完全自動化された MSA 法にまで遡る関連する手法の一例です。一般的な考え方としては、単に2つの配列のアラインメントから始め、通常はデータセットの中で最も近い配列のアラインメントから始める。その後、ガイドツリーのトポロジーに従って、互いに、または配列をアラインメントにアラインメントすることによって、アラインメントが構築される。ガイドツリー構築の複雑さは、すべてのN個の配列を互いに比較しなければならないため、N個の配列については通常O(N2)となる。Clustal の初期のバージョンでは、これらの比較に高速なワードベースのアラインメントを使用していたため、メモリ効率が良く、PC や Macintosh コンピュータでも十分に高速に動作した。しかし、配列数が数千以上になると、O(N2)の複雑さに時間がかかり、非常に大きなアラインメントを行うことが困難になる。著者らは、配列のアラインメントスコアの計算を NLog(N)に限定することで、何十万もの配列のガイドツリーを作成できる mBed,10 と呼ばれる O(NlogN)法を開発した。このmBed法はClustal Omegaで使用されているもので、非常に大規模なデータセットに対応するためのキャパシティとスケーラビリティを提供する。

　Clustal Omegaの第二の主な開発は、従来の動的プログラミングとプロファイルアライメントの代わりに、プロファイル隠れマルコフモデル（HMM）を互いにアライメントするためのアライメントエンジンを使用することであった。著者らは、プロファイルHMMのアライメントに非常に高い精度を持つことが示されているHHalignを使用した。これにより、構造ベースのアライメントベンチマークで測定された初期のClustalプログラムと比較して、Clustal Omegaの精度が大幅に向上した。新しいプログラムには、以前のClustalプログラムからのオリジナルコードのうち、わずかな量だけが使用された：高速ワードベースのペアワイズアライメントルーチンである。残りのコードはスクラッチから新たにコード化されたものか、一般に公開されているライブラリから抜粋されたものである。

　これにより、精度を落とすことなく何千もの配列をアラインメントすることができる全く新しいプログラムができた。2011年にリリースされ、オープンソースライセンスのもと、すべてのソースコードを自由にダウンロードすることができる。ユーザーは、ほとんどのオペレーティングシステム用の実行ファイルをダウンロードしたり(www.clustal.org)、多くのサイト、特にEMBL European Bioinformatics Institute (www.ebi.ac.uk)でオンラインでプログラムを使用できる。この論文では、Clustal Omega のオリジナルリリースから追加されたいくつかの機能について説明し、二次構造予測の精度に基づくタンパク質ベンチマークを用いて、様々なプログラムオプションのベンチマーク結果を紹介する。

インストール

macos10.14のanaconda3 (python3.7) 環境で、condaを使って導入した。

#bioconda(link)
mamba create -n clustalo python=3.8 -y
conda activate clustalo
mamba install -c bioconda clustalo -y

> clustalo --help

$ clustalo --help

Clustal Omega - 1.2.4 (AndreaGiacomo)

If you like Clustal-Omega please cite:

Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG.

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega.

Mol Syst Biol. 2011 Oct 11;7:539. doi: 10.1038/msb.2011.75. PMID: 21988835.

If you don't like Clustal-Omega, please let us know why (and cite us anyway).

Check http://www.clustal.org for more information and updates.

Usage: clustalo [-hv] [-i {<file>,-}] [--hmm-in=<file>]... [--hmm-batch=<file>] [--dealign] [--profile1=<file>] [--profile2=<file>] [--is-profile] [-t {Protein, RNA, DNA}] [--infmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]}] [--distmat-in=<file>] [--distmat-out=<file>] [--guidetree-in=<file>] [--guidetree-out=<file>] [--pileup] [--full] [--full-iter] [--cluster-size=<n>] [--clustering-out=<file>] [--trans=<n>] [--posterior-out=<file>] [--use-kimura] [--percent-id] [-o {file,-}] [--outfmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]}] [--residuenumber] [--wrap=<n>] [--output-order={input-order,tree-order}] [--iterations=<n>] [--max-guidetree-iterations=<n>] [--max-hmm-iterations=<n>] [--maxnumseq=<n>] [--maxseqlen=<l>] [--auto] [--threads=<n>] [--pseudo=<file>] [-l <file>] [--version] [--long-version] [--force] [--MAC-RAM=<n>]

A typical invocation would be: clustalo -i my-in-seqs.fa -o my-out-seqs.fa -v

See below for a list of all options.

Sequence Input:

-i, --in, --infile={<file>,-} Multiple sequence input file (- for stdin)

--hmm-in=<file> HMM input files

--hmm-batch=<file> specify HMMs for individual sequences

--dealign Dealign input sequences

--profile1, --p1=<file> Pre-aligned multiple sequence file (aligned columns will be kept fix)

--profile2, --p2=<file> Pre-aligned multiple sequence file (aligned columns will be kept fix)

--is-profile disable check if profile, force profile (default no)

-t, --seqtype={Protein, RNA, DNA} Force a sequence type (default: auto)

--infmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]} Forced sequence input file format (default: auto)

Clustering:

--distmat-in=<file> Pairwise distance matrix input file (skips distance computation)

--distmat-out=<file> Pairwise distance matrix output file

--guidetree-in=<file> Guide tree input file (skips distance computation and guide-tree clustering step)

--guidetree-out=<file> Guide tree output file

--pileup Sequentially align sequences

--full Use full distance matrix for guide-tree calculation (might be slow; mBed is default)

--full-iter Use full distance matrix for guide-tree calculation during iteration (might be slowish; mBed is default)

--cluster-size=<n> soft maximum of sequences in sub-clusters

--clustering-out=<file> Clustering output file

--trans=<n> use transitivity (default: 0)

--posterior-out=<file> Posterior probability output file

--use-kimura use Kimura distance correction for aligned sequences (default no)

--percent-id convert distances into percent identities (default no)

Alignment Output:

-o, --out, --outfile={file,-} Multiple sequence alignment output file (default: stdout)

--outfmt={a2m=fa[sta],clu[stal],msf,phy[lip],selex,st[ockholm],vie[nna]} MSA output file format (default: fasta)

--residuenumber, --resno in Clustal format print residue numbers (default no)

--wrap=<n> number of residues before line-wrap in output

--output-order={input-order,tree-order} MSA output order like in input/guide-tree

Iteration:

--iterations, --iter=<n> Number of (combined guide-tree/HMM) iterations

--max-guidetree-iterations=<n> Maximum number of guidetree iterations

--max-hmm-iterations=<n> Maximum number of HMM iterations

Limits (will exit early, if exceeded):

--maxnumseq=<n> Maximum allowed number of sequences

--maxseqlen=<l> Maximum allowed sequence length

Miscellaneous:

--auto Set options automatically (might overwrite some of your options)

--threads=<n> Number of processors to use

--pseudo=<file> Input file for pseudo-count parameters

-l, --log=<file> Log all non-essential output to this file

-h, --help Print this help and exit

-v, --verbose Verbose output (increases if given multiple times)

--version Print version information and exit

--long-version Print long version information and exit

--force Force file overwriting

実行方法

比較対象の配列のmulti-fastaを指定する。

#protein
clustalo -t Protein -i input_proteins.fa -o alinged.fa

#DNA
clustalo -t DNA -i input_DNA.fa -o alinged.fa

#RNA
clustalo -t RNA -i input_RNA.fa -o alinged.fa

#auto
clustalo -t RNA -i input.fa -o alinged.fa

-t, --seqtype={Protein, RNA, DNA} Force a sequence type (default: auto)
-i Multiple sequence input file (- for stdin)
-o Multiple sequence alignment output file (default: stdout)
--threads Number of processors to use (OpenMPが必要)
--auto Set options automatically (might overwrite some of your options)

”--distmat-out”フラグを立て出力ファイル名を指定することで、多重整列結果以外に距離行列ファイルが出力される。"--percent-id"をつけることで距離が percent identities (%)になる。

clustalo --full --percent-id --distmat-out=output.distmat -i input.fa -o output.aln

--full Use full distance matrix for guide-tree calculation (might be slow; mBed is default)
--full-iter Use full distance matrix for guide-tree calculation during iteration (might be slowish; mBed is default)
--percent-id convert distances into percent identities (default no)
-distmat-out Pairwise distance matrix output file (距離行列時は使用不可)

このオプションはAll versus allの配列比較の距離行列ファイルを得たい時にも役立つ。

EMBOSSにも同様の機能のコマンドdistmatがある（アルゴリズムを選べる。またずっと高速に計算できる）。

https://www.bioinformatics.nl/cgi-bin/emboss/help/distmat

引用
Clustal Omega for making accurate alignments of many protein sequences

Fabian Sievers, Desmond G Higgins

Protein Sci. 2018 Jan;27(1):135-145

Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega

Fabian Sievers, Andreas Wilm, David Dineen, Toby J Gibson, Kevin Karplus, Weizhong Li, Rodrigo Lopez, Hamish McWilliam, Michael Remmert, Johannes Söding, Julie D Thompson, Desmond G Higgins

Mol Syst Biol. 2011 Oct 11;7:539

参考

Question: Make matrix of protein pairwise identities/similarities from multiple protein sequences

https://www.biostars.org/p/301422/

bioinformatics