macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

メガサイズのマルチプルアライメントや数千の配列のマルチプルアライメントが可能なFSA

2019 7/29 condaインストール、help追記 

  

公式サイト

http://fsa.sourceforge.net

Q&A

FSA Frequently Asked Questions

 

ダウンロード

sorceforge

https://sourceforge.net/projects/fsa/

解凍して、中に入りビルドする。

./configure
make
make install

fsa -h #インストール確認

#bioconda (link)
conda install -c bioconda fsa

メガサイズの配列を比較する際にはmummerかexonerate(またはMercator)を使うので、あらかじめbrew等でインストールしておく。  

brew install mummer exonerate

mummer -h #インストール確認
exonerate -h #インストール確認

> fsa --help

$ fsa --help

fsa - Distance-based alignment of DNA, RNA and proteins.

Usage: fsa [options] <sequence file(s)>

 

Command-line options (righthandmost options take precedence)

------------------------------------------------------------

-h,-help,--help                   display this message

-v,--version                      display version

 

Logging options

---------------

--log <string>                    turn on diagnostic logging (-loghelp shows syntax)

--logfile <file>                  log to file

--logcopy <file>                  log to file and standard error

--logtime                         timestamp standard error (logfile stamped automatically)

--logxml                          (default) add XML timestamps (--nologxml to disable)

--logerr                          log on standard error (default)

 

Output options

--------------

--stockholm                       output Stockholm alignments (default is multi-FASTA format)

--gui                             record alignment & statistical model for interactive Java GUI

--write-params                    write learned emission distributions (substitution matrices) to disk

--write-posteriors                write learned pairwise posterior alignment probability matrices to disk

 

Parallelization options

-----------------------

(Parallelization not available; please see the manual for more information.)

 

Database options

----------------

(Database not available; please see the manual for more information.)

 

Pair HMM model options

----------------------

--nucprot                         align input nucleotide sequences (must all be nucleotide) in protein space

--indel2                          (default) use two sets of indel states in Pair HMM (use --noindel2 to use 1 set only)

--gapopen1 <real>                 initial gap-open probability (for set 1 of indel states)

--gapextend1 <real>               initial gap-extend probability (for set 1 of indel states)

--gapopen2 <real>                 initial gap-open probability (for set 2 of indel states)

--gapextend2 <real>               initial gap-extend probability (for set 2 of indel states)

--model <integer>                 initial substitution model: 0 = Jukes-Cantor, 1 = Tamura-Nei / BLOSUM62-like (proteins) (default is 1)

--time <real>                     Jukes-Cantor/Tamura-Nei evolutionary time parameter (default is 0.4)

--alphar <real>                   Tamura-Nei rate alpha_R (transition: purine) (default is 1.3)

--alphay <real>                   Tamura-Nei rate alpha_Y (transition: pyrimidine) (default is 1.3)

--beta <real>                     Tamura-Nei rate beta (transversion) (default is 1)

--load-probs <string>             load pairwise posterior probabilities from a file rather than performing inference with Pair HMM

 

Parameter estimation options

----------------------------

--learngap                        estimate indel probabilities for each pair of sequences (--nolearngap to disable)

--learnemit-bypair                (default for DNA and RNA) estimate emission probabilities for each pair of sequences (--nolearnemit-bypair to disable)

--learnemit-all                   (default for proteins) estimate emission probabilities averaged over all sequences (--nolearnemit-all to disable)

--nolearn                         disable ALL parameter learning (use ProbCons defaults)

--regularize                      (default) regularize learned emission and gap probabilities with Dirichlet prior (--noregularize to disable)

--regularization-gapscale <real>  scaling factor for transition prior

--regularization-emitscale <real> scaling factor for emission Dirichlet prior

--mininc <real>                   minimum fractional increase in log-likelihood per round of EM (default is 0.1)

--maxrounds <integer>             maximum number of iterations of EM (default is 3)

--mingapdata <integer>            minimum amount of sequence data (# of aligned pairs of characters) for training gap probs

--minemitdata <integer>           minimum amount of sequence data (# of aligned pairs of characters) for training emission probs

 

Multiple alignment options: sequence annealing

----------------------------------------------

--refinement <integer>            number of iterative refinement steps (default is unlimited; 0 for none)

--maxsn                           maximum sensitivity (instead of highest accuracy)

--gapfactor <real>                gap factor; 0 for highest sensitivity (the internal effective minimum is 0.01); >1 for higher specificity (default is 1)

--dynamicweights                  (default) enable dynamic edge re-weighting (--nodynamicweights to disable)

--treeweights <string>            weights for sequence pairs based on a tree

--require-homology                require that there be some detectable homology between all input sequences

 

Alignment speedup options: many sequences

-----------------------------------------

--fast                            fast alignment: use 5 * Erdos-Renyi threshold percent of sequence pairs for alignment and 2 * for learning

--refalign                        alignment to a reference sequence only (reference must be first sequence in file)

--mst-min <integer>               build --mst-min minimum spanning trees on input sequences for pairwise comparisons (default is 3)

--mst-max <integer>               build --mst-max maximum spanning trees on input sequences for pairwise comparisons (default is 0)

--mst-palm <integer>              build --mst-palm minimum spanning palm trees on input sequences for pairwise comparisons (default is 0)

--degree <integer>                use --degree number of pairwise comparisons between closest sequences (default is 0)

--kmer <integer>                  length of k-mers to use when determining sequence similarity

--alignment-fraction <real>       randomized fraction of all (n choose 2) pairs of sequences to consider during alignment inference (default is 1)

--alignment-number <integer>      total number of (randomized) pairs of sequences to consider during alignment inference

 

Alignment speedup options: long sequences (MUMmer)

--------------------------------------------------

--anchored                        use anchoring (--noanchored to disable)

--translated                      perform anchoring in protein space

--minlen <integer>                minimum length of exact matches for anchoring

--maxjoinlen <integer>            maximum ungapped separation of parallel adjacent anchors to join (default is 2)

--hardmasked                      leave hardmasked sequence >10 nt unaligned instead of randomizing it (default for long DNA)

 

Alignment speedup options: long sequences (exonerate)

-----------------------------------------------------

--exonerate                       call exonerate to get anchors (implies --anchored)

--minscore <integer>              minimum score of alignments found by exonerate (default is 100)

--softmasked                      input sequences are softmasked

 

Alignment speedup options: long sequences (Mercator)

----------------------------------------------------

--mercator <string>               input Mercator constraints

 

Memory savings

--------------

--maxram <integer>                maximum RAM to use (in megabytes) (default is -1)

--bandwidth <integer>             banding (default is no banding)

--minprob <real>                  minimum posterior probability to store (default is 0.01)

 

 

Input sequence file(s) must be in FASTA format.

 

FSA attempts to automatically figure out appropriate settings;

you can override its automated choices with the above options.

 

Please contact the FSA team at fsa@math.berkeley.edu with any questions or comments.

 

 

 

ラン

数百以上の配列(遺伝子)のアライメント

fsa --fast genes.fa --log 7
  •  --log <string> turn on diagnostic logging (-loghelp shows syntax)
  • --gui record alignment & statistical model for interactive Java GUI

--guiをつけると、マルチプルアライメント結果を付属するjavaアプリで描画できる。

 

mummerを使ったゲノムのアライメント

fsa --anchored genome_set.fa --log 7
  •  --anchored use anchoring (--noanchored to disable)

 

exonerateを使ったゲノムのアライメント

fsa --exonerate --softmasked genome_set.fa --log 7
  • --softmasked input sequences are softmasked
  • --exonerate call exonerate to get anchors (implies --anchored)

 

結果のビジュアル化

出力されたinput.fa.guiと使用したfastaを同じディレクトリに入れて以下のように入力のfasta名を打つ。

java -jar fsa-1.15.9/display/mad.jar genes.fa

f:id:kazumaxneo:20171210223250j:plain

 

他の描画ツール(wiki

 

ゲノムサイズのアライメントだとかなりの時間とメモリが要求されます。ご注意ください。

 

アライメントの感度と特異性のトレードオフのバランスをどう取るかについてはQ&Aに記載されています。そちらを参照してください(リンク)。

 

 

引用

Fast Statistical Alignment

Robert K. Bradley , Adam Roberts, Michael Smoot, Sudeep Juvekar, Jaeyoung Do, Colin Dewey, Ian Holmes, Lior Pachter

Published: May 29, 2009https://doi.org/10.1371/journal.pcbi.1000392

 

https://www.biostars.org/p/55961/