メガサイズのマルチプルアライメントや数千の配列のマルチプルアライメントが可能なFSA

2019 7/29 condaインストール、help追記

公式サイト

http://fsa.sourceforge.net

Q&A

FSA Frequently Asked Questions

ダウンロード

sorceforge

https://sourceforge.net/projects/fsa/

解凍して、中に入りビルドする。

./configure
make
make install

fsa -h #インストール確認

#bioconda (link)
conda install -c bioconda fsa

メガサイズの配列を比較する際にはmummerかexonerate（またはMercator）を使うので、あらかじめbrew等でインストールしておく。

brew install mummer exonerate

mummer -h #インストール確認
exonerate -h #インストール確認

> fsa --help

$ fsa --help

fsa - Distance-based alignment of DNA, RNA and proteins.

Usage: fsa [options] <sequence file(s)>

Command-line options (righthandmost options take precedence)

------------------------------------------------------------

-h,-help,--help display this message

-v,--version display version

Logging options

---------------

--log <string> turn on diagnostic logging (-loghelp shows syntax)

--logfile <file> log to file

--logcopy <file> log to file and standard error

--logtime timestamp standard error (logfile stamped automatically)

--logxml (default) add XML timestamps (--nologxml to disable)

--logerr log on standard error (default)

Output options

--------------

--stockholm output Stockholm alignments (default is multi-FASTA format)

--gui record alignment & statistical model for interactive Java GUI

--write-params write learned emission distributions (substitution matrices) to disk

--write-posteriors write learned pairwise posterior alignment probability matrices to disk

Parallelization options

-----------------------

(Parallelization not available; please see the manual for more information.)

Database options

----------------

(Database not available; please see the manual for more information.)

Pair HMM model options

----------------------

--nucprot align input nucleotide sequences (must all be nucleotide) in protein space

--indel2 (default) use two sets of indel states in Pair HMM (use --noindel2 to use 1 set only)

--gapopen1 <real> initial gap-open probability (for set 1 of indel states)

--gapextend1 <real> initial gap-extend probability (for set 1 of indel states)

--gapopen2 <real> initial gap-open probability (for set 2 of indel states)

--gapextend2 <real> initial gap-extend probability (for set 2 of indel states)

--model <integer> initial substitution model: 0 = Jukes-Cantor, 1 = Tamura-Nei / BLOSUM62-like (proteins) (default is 1)

--time <real> Jukes-Cantor/Tamura-Nei evolutionary time parameter (default is 0.4)

--alphar <real> Tamura-Nei rate alpha_R (transition: purine) (default is 1.3)

--alphay <real> Tamura-Nei rate alpha_Y (transition: pyrimidine) (default is 1.3)

--beta <real> Tamura-Nei rate beta (transversion) (default is 1)

--load-probs <string> load pairwise posterior probabilities from a file rather than performing inference with Pair HMM

Parameter estimation options

----------------------------

--learngap estimate indel probabilities for each pair of sequences (--nolearngap to disable)

--learnemit-bypair (default for DNA and RNA) estimate emission probabilities for each pair of sequences (--nolearnemit-bypair to disable)

--learnemit-all (default for proteins) estimate emission probabilities averaged over all sequences (--nolearnemit-all to disable)

--nolearn disable ALL parameter learning (use ProbCons defaults)

--regularize (default) regularize learned emission and gap probabilities with Dirichlet prior (--noregularize to disable)

--regularization-gapscale <real> scaling factor for transition prior

--regularization-emitscale <real> scaling factor for emission Dirichlet prior

--mininc <real> minimum fractional increase in log-likelihood per round of EM (default is 0.1)

--maxrounds <integer> maximum number of iterations of EM (default is 3)

--mingapdata <integer> minimum amount of sequence data (# of aligned pairs of characters) for training gap probs

--minemitdata <integer> minimum amount of sequence data (# of aligned pairs of characters) for training emission probs

Multiple alignment options: sequence annealing

----------------------------------------------

--refinement <integer> number of iterative refinement steps (default is unlimited; 0 for none)

--maxsn maximum sensitivity (instead of highest accuracy)

--gapfactor <real> gap factor; 0 for highest sensitivity (the internal effective minimum is 0.01); >1 for higher specificity (default is 1)

--dynamicweights (default) enable dynamic edge re-weighting (--nodynamicweights to disable)

--treeweights <string> weights for sequence pairs based on a tree

--require-homology require that there be some detectable homology between all input sequences

Alignment speedup options: many sequences

-----------------------------------------

--fast fast alignment: use 5 * Erdos-Renyi threshold percent of sequence pairs for alignment and 2 * for learning

--refalign alignment to a reference sequence only (reference must be first sequence in file)

--mst-min <integer> build --mst-min minimum spanning trees on input sequences for pairwise comparisons (default is 3)

--mst-max <integer> build --mst-max maximum spanning trees on input sequences for pairwise comparisons (default is 0)

--mst-palm <integer> build --mst-palm minimum spanning palm trees on input sequences for pairwise comparisons (default is 0)

--degree <integer> use --degree number of pairwise comparisons between closest sequences (default is 0)

--kmer <integer> length of k-mers to use when determining sequence similarity

--alignment-fraction <real> randomized fraction of all (n choose 2) pairs of sequences to consider during alignment inference (default is 1)

--alignment-number <integer> total number of (randomized) pairs of sequences to consider during alignment inference

Alignment speedup options: long sequences (MUMmer)

--------------------------------------------------

--anchored use anchoring (--noanchored to disable)

--translated perform anchoring in protein space

--minlen <integer> minimum length of exact matches for anchoring

--maxjoinlen <integer> maximum ungapped separation of parallel adjacent anchors to join (default is 2)

--hardmasked leave hardmasked sequence >10 nt unaligned instead of randomizing it (default for long DNA)

Alignment speedup options: long sequences (exonerate)

-----------------------------------------------------

--exonerate call exonerate to get anchors (implies --anchored)

--minscore <integer> minimum score of alignments found by exonerate (default is 100)

--softmasked input sequences are softmasked

Alignment speedup options: long sequences (Mercator)

----------------------------------------------------

--mercator <string> input Mercator constraints

Memory savings

--------------

--maxram <integer> maximum RAM to use (in megabytes) (default is -1)

--bandwidth <integer> banding (default is no banding)

--minprob <real> minimum posterior probability to store (default is 0.01)

Input sequence file(s) must be in FASTA format.

FSA attempts to automatically figure out appropriate settings;

you can override its automated choices with the above options.

Please contact the FSA team at fsa@math.berkeley.edu with any questions or comments.

ラン

数百以上の配列（遺伝子）のアライメント

fsa --fast genes.fa --log 7

--log <string>　turn on diagnostic logging (-loghelp shows syntax)
--gui　record alignment & statistical model for interactive Java GUI

--guiをつけると、マルチプルアライメント結果を付属するjavaアプリで描画できる。

mummerを使ったゲノムのアライメント

fsa --anchored genome_set.fa --log 7

--anchored use anchoring (--noanchored to disable)

exonerateを使ったゲノムのアライメント

fsa --exonerate --softmasked genome_set.fa --log 7

--softmasked　input sequences are softmasked
--exonerate　call exonerate to get anchors (implies --anchored)

結果のビジュアル化

出力されたinput.fa.guiと使用したfastaを同じディレクトリに入れて以下のように入力のfasta名を打つ。

java -jar fsa-1.15.9/display/mad.jar genes.fa

f:id:kazumaxneo:20171210223250j:plain

他の描画ツール（wiki）

ゲノムサイズのアライメントだとかなりの時間とメモリが要求されます。ご注意ください。

アライメントの感度と特異性のトレードオフのバランスをどう取るかについてはQ&Aに記載されています。そちらを参照してください（リンク）。

引用

Fast Statistical Alignment

Robert K. Bradley , Adam Roberts, Michael Smoot, Sudeep Juvekar, Jaeyoung Do, Colin Dewey, Ian Holmes, Lior Pachter

Published: May 29, 2009https://doi.org/10.1371/journal.pcbi.1000392

https://www.biostars.org/p/55961/

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メガサイズのマルチプルアライメントや数千の配列のマルチプルアライメントが可能なFSA