All Versus Allの配列比較（sequence comparison）を行うEMBOSSの needleall

2020 10/29 vsearchのコマンド追記

needleall は入力された一連の配列を読み込み、それらをすべて 1 つ以上の配列と比較し、最適なグローバル配列のアラインメントをファイルに書き込む。Needleman-Wunschアライメントアルゴリズムを使用して、全長に沿った2つの配列の最適アライメント（ギャップを含む）を見つける。このアルゴリズムは動的計画法を使用して、可能なすべてのアラインメントを探索し、最適なものを選択することで、最適なアラインメントを確実にする。可能性のある残基またはヌクレオチドが一致するすべての値を含むスコアリングマトリクスが読み込まれる。needleallは、アラインメントのスコアがスコアリングマトリクスから取得したマッチの合計から、アラインメントされた配列のギャップを開いたり拡張したりすることで生じるペナルティを差し引いた値に等しい最大のスコアを持つアラインメントを見つける。置換マトリクスとギャップの開きと拡張のペナルティはユーザーが指定する。

アルゴリズム

Needleman-Wunsch アルゴリズムは、2 つの配列のベストスコアとアラインメントを mn ステップ（n と m は配列の長さ）で計算できるアルゴリズムの一種である。これらの動的プログラミングアルゴリズムは、タンパク質の配列比較のためにNeedlemanとWunschによって最初に開発されたが、同様の手法は1960年代後半から1970年代初頭にかけて、音声処理やコンピュータサイエンスの分野で使用するために独自に考案された。重要な問題は、ギャップ、すなわちアライメントスコアを最適化するために挿入されるスペースの処理である。各ギャップが開かれるごとにスコアからペナルティが減算され（「ギャップオープン」ペナルティ）、ギャップスペースの合計数にコストを掛けたものがスコアから減算される（「ギャップエクステンション」ペナルティ）。一般的に、ギャップを拡張するためのコストは、ギャップを開くためのコストの5～10倍に設定されている。

n個の位置のギャップに対するペナルティは、次の式を用いて計算される。

gap opening penalty + (n - 1) * gap extension penalty
Needleman-Wunschのグローバルアラインメントでは、各配列の全長がアラインメントされる。配列が部分的に重なっている場合もあれば、一方の配列が他方の配列に完全に内部的にアラインメントされている場合もある。オーバーラップの両端が垂れ下がっていてもペナルティはない。バイオインフォマティクスでは配列が不完全であると考えるのが一般的であり、欠落している塩基のアラインメントに失敗してもペナルティはない。

EMBOSS eexplorer

https://www.bioinformatics.nl/cgi-bin/emboss/needleall

needleall wiki

EMBOSS: needleall

インストール

EMBOSSはcondaやbrewで導入できる。

#bioconda
conda install -c bioconda -y emboss 

#homebrew 
brew install emboss

> needleall -h

$ needleall -h

Many-to-many pairwise alignments of two sequence sets

Version: EMBOSS:6.6.0.0

Standard (Mandatory) qualifiers:

[-asequence] seqset Sequence set filename and optional format,

or reference (input USA)

[-bsequence] seqall Sequence(s) filename and optional format, or

reference (input USA)

-gapopen float [10.0 for any sequence] The gap open penalty

is the score taken away when a gap is

created. The best value depends on the

choice of comparison matrix. The default

value assumes you are using the EBLOSUM62

matrix for protein sequences, and the

EDNAFULL matrix for nucleotide sequences.

(Floating point number from 1.0 to 100.0)

-gapextend float [0.5 for any sequence] The gap extension,

penalty is added to the standard gap penalty

for each base or residue in the gap. This

is how long gaps are penalized. Usually you

will expect a few long gaps rather than many

short gaps, so the gap extension penalty

should be lower than the gap penalty. An

exception is where one or both sequences are

single reads with possible sequencing

errors in which case you would expect many

single base gaps. You can get this result by

setting the gap open penalty to zero (or

very low) and using the gap extension

penalty to control gap scoring. (Floating

point number from 0.0 to 10.0)

[-outfile] align [*.needleall] Output alignment file name

(default -aformat score)

Additional (Optional) qualifiers:

-datafile matrixf [EBLOSUM62 for protein, EDNAFULL for DNA]

This is the scoring matrix file used when

comparing sequences. By default it is the

file 'EBLOSUM62' (for proteins) or the file

'EDNAFULL' (for nucleic sequences). These

files are found in the 'data' directory of

the EMBOSS installation.

-endweight boolean [N] Apply end gap penalties.

-endopen float [10.0 for any sequence] The end gap open

penalty is the score taken away when an end

gap is created. The best value depends on

the choice of comparison matrix. The default

value assumes you are using the EBLOSUM62

matrix for protein sequences, and the

EDNAFULL matrix for nucleotide sequences.

(Floating point number from 1.0 to 100.0)

-endextend float [0.5 for any sequence] The end gap

extension, penalty is added to the end gap

penalty for each base or residue in the end

gap. (Floating point number from 0.0 to

10.0)

-minscore float [1.0 for any sequence] Minimum alignment

score to report an alignment. (Floating

point number from -10.0 to 100.0)

-errfile outfile [needleall.error] Error file to be written

Advanced (Unprompted) qualifiers:

-[no]brief boolean [Y] Brief identity and similarity

General qualifiers:

-help boolean Report command line options and exit. More

information on associated and general

qualifiers can be found with -help -verbose

実行方法

クエリのmulti-fasta、比較先のmulti-fastaの順番で指定する（塩基配列かアミノ酸配列）。コマンドだけ叩くと、対話モードで実行できる。

needleall first_sequences.fasta second_sequences.fasta output -auto

-gapopen float [10.0 for any sequence] The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62
matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. (Floating point number from 1.0 to 100.0)
-gapextend float [0.5 for any sequence] The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. (Floating point number from 0.0 to 10.0)
-auto Turn off prompts

１行形式の出力（"-aformat <>"で変更可能*1）

f:id:kazumaxneo:20201028200806p:plain

avaBLAST（DOI: 10.1109/CIBEC.2008.4786046）というプログラム（all versus all blast 比較を行う）も試したかったのですが、ダウンロードHPが消えていたのでこちらを紹介しました。二配列間の比較にはEMBOSSパッケージのneedleまたはstretcherが利用できます。needleについては、ばいばいバイオさんのHPで分かりやすく説明されています。

ゲノムのような大きな配列間のall vs all比較には使えません。また、遠く離れた配列間の比較にも適していません。注意して下さい。

2020 10/29追記

windowmoonさんに教えていただきました。

vsearchの--allpairs_globalサブコマンドを使う（Needleman-Wunsch の非常に高速な実装が利用でき、needleallの数十倍以上高速）。

vsearch --allpairs_global 16S_seq.fa --id 0.5 --uc output

--alnout <FILENAME> filename for human-readable alignment output
--acceptall output all pairwise alignments
--id <REAL> reject if identity lower
--uc <FILENAME> specify filename for UCLUST-like output

引用

EMBOSS: the European Molecular Biology Open Software Suite.
Rice P, Longden I, Bleasby A

Trends Genet. 2000 Jun;16(6):276-7

参考

The available multiple alignment format names are: unknown, multiple, simple, fasta, msf, trace, srs

The available pairwise alignment format names are: pair, markx0, markx1, markx2, markx3, markx10, srspair, score

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

All Versus Allの配列比較（sequence comparison）を行うEMBOSSの needleall