macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

infernal

 

 infernalは、入力として与えられた構造的にアノテーションされた複数の配列アラインメントから、共分散モデル(CM)と呼ばれるRNAファミリーの配列と二次構造の確率的プロファイルを構築する。infernalは、共分散モデルを使用して、配列データベース内の新しいファミリーメンバーを検索し、潜在的に大規模な多重配列アラインメントを作成する。Infernalのバージョン1.1では、加速プロファイル隠れマルコフモデル(HMM)法とHMMバンド共分散モデルアライメント法に基づいたRNAホモロジー検索のための新しいフィルタパイプラインが導入されている。これにより、従来のバージョンよりも∼100倍の速度と、フィルタリングされていない網羅的な共分散モデル検索よりも∼10,000倍の加速が可能となった。

 ソースコード、ドキュメント、ベンチマークhttp://infernal.janelia.org からダウンロードできる。InfernalはGNU GPLv3の下でライセンスされており、LinuxMac OS/Xを含むPOSIX準拠のオペレーティング・システムに自由に移植可能である。ドキュメントには、チュートリアル付きのユーザーガイド、ファイルフォーマットやユーザーオプションの説明、ソフトウェアに実装されている方法についての追加の詳細が含まれている。

 

HP

http://eddylab.org/infernal/

PDF manual

http://eddylab.org/infernal/Userguide.pdf

 

インストール

Github

#bioconda
mamba install -c bioconda -y infernal

#homebrew
brew install infernal

#from source
wget eddylab.org/infernal/infernal-1.1.3.tar.gz
tar xf infernal-1.1.3.tar.gz
cd infernal-1.1.3
./configure
make
make check
make install

cmalign -h

# cmalign -h

# cmalign :: align sequences to a CM

# INFERNAL 1.1.2 (July 2016)

# Copyright (C) 2016 Howard Hughes Medical Institute.

# Freely distributed under a BSD open source license.

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Usage: cmalign [-options] <cmfile> <seqfile>

 

Basic options:

  -h     : show brief help on version and usage

  -o <f> : output the alignment to file <f>, not stdout

  -g     : configure CM for global alignment [default: local]

 

Options controlling alignment algorithm:

  --optacc   : use the Holmes/Durbin optimal accuracy algorithm  [default]

  --cyk      : use the CYK algorithm

  --sample   : sample alignment of each seq from posterior distribution

  --seed <n> : w/--sample, set RNG seed to <n> (if 0: one-time arbitrary seed)

  --notrunc  : do not use truncated alignment algorithm

  --sub      : build sub CM for columns b/t HMM predicted start/end points

 

Options controlling speed and memory requirements:

  --hbanded    : accelerate using CM plan 9 HMM derived bands  [default]

  --tau <x>    : set tail loss prob for HMM bands to <x>  [1e-7]  (1e-18<x<1)

  --mxsize <x> : set maximum allowable DP matrix size to <x> Mb  [1028.0]  (x>0.)

  --fixedtau   : do not adjust tau (tighten bands) until mx size is < limit

  --maxtau <x> : set max tau <x> when tightening HMM bands  [0.05]  (0<x<0.5)

  --nonbanded  : do not use HMM bands for faster alignment

  --small      : use small memory divide and conquer (d&c) algorithm

 

Optional output files:

  --sfile <f>  : dump alignment score information to file <f>

  --tfile <f>  : dump individual sequence parsetrees to file <f>

  --ifile <f>  : dump information on per-sequence inserts to file <f>

  --elfile <f> : dump information on per-sequence EL inserts to file <f>

 

Other options:

  --mapali <f>    : include alignment in file <f> (same ali that CM came from)

  --mapstr        : include structure (w/pknots) from <f> from --mapali <f>

  --noss          : cmbuild --noss option was used w/aln from --mapali <f>

  --informat <s>  : assert <seqfile> is in format <s>: no autodetection

  --outformat <s> : output alignment in format <s>  [Stockholm]

  --dnaout        : output alignment as DNA (not RNA) sequence data

  --noprob        : do not include posterior probabilities in the alignment

  --matchonly     : include only match columns in output alignment

  --ileaved       : force output in interleaved Stockholm format

  --regress <f>   : save regression test data to file <f>

  --verbose       : report extra information; mainly useful for debugging

  --cpu <n>       : number of parallel CPU workers to use for multithreads  (n>=0)

 

Sequence input formats:   FASTA, GenBank

Alignment output formats: Stockholm, Pfam, AFA (aligned FASTA), A2M, Clustal, PHYLIP

cmsearch -h

# cmsearch -h

# cmsearch :: search CM(s) against a sequence database

# INFERNAL 1.1.2 (July 2016)

# Copyright (C) 2016 Howard Hughes Medical Institute.

# Freely distributed under a BSD open source license.

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Usage: cmsearch [options] <cmfile> <seqdb>

 

Basic options:

  -h        : show brief help on version and usage

  -g        : configure CM for glocal alignment [default: local]

  -Z <x>    : set search space size in *Mb* to <x> for E-value calculations  (x>0)

  --devhelp : show list of otherwise hidden developer/expert options

 

Options directing output:

  -o <f>       : direct output to file <f>, not stdout

  -A <f>       : save multiple alignment of all significant hits to file <s>

  --tblout <f> : save parseable table of hits to file <s>

  --acc        : prefer accessions over names in output

  --noali      : don't output alignments, so output is smaller

  --notextw    : unlimit ASCII text output line width

  --textw <n>  : set max width of ASCII text output lines  [120]  (n>=120)

  --verbose    : report extra information; mainly useful for debugging

 

Options controlling reporting thresholds:

  -E <x> : report sequences <= this E-value threshold in output  [10.0]  (x>0)

  -T <x> : report sequences >= this score threshold in output

 

Options controlling inclusion (significance) thresholds:

  --incE <x> : consider sequences <= this E-value threshold as significant  [0.01]

  --incT <x> : consider sequences >= this score threshold as significant

 

Options controlling model-specific reporting thresholds:

  --cut_ga : use CM's GA gathering cutoffs as reporting thresholds

  --cut_nc : use CM's NC noise cutoffs as reporting thresholds

  --cut_tc : use CM's TC trusted cutoffs as reporting thresholds

 

Options controlling acceleration heuristics*:

  --max      : turn all heuristic filters off (slow)

  --nohmm    : skip all HMM filter stages, use only CM (slow)

  --mid      : skip first two HMM filter stages (SSV & Vit)

  --default  : default: run search space size-dependent pipeline  [default]

  --rfam     : set heuristic filters at Rfam-level (fast)

  --hmmonly  : use HMM only, don't use a CM at all

  --FZ <x>   : set filters to defaults used for a search space of size <x> Mb

  --Fmid <x> : with --mid, set P-value threshold for HMM stages to <x>  [0.02]

 

Other options*:

  --notrunc     : do not allow truncated hits at sequence termini

  --anytrunc    : allow full and truncated hits anywhere within sequences

  --nonull3     : turn off the NULL3 post hoc additional null model

  --mxsize <x>  : set max allowed alnment mx size to <x> Mb [df: autodetermined]

  --smxsize <x> : set max allowed size of search DP matrices to <x> Mb  [128.]

  --cyk         : use scanning CM CYK algorithm, not Inside in final stage

  --acyk        : align hits with CYK, not optimal accuracy

  --wcx <x>     : set W (expected max hit len) as <x> * cm->clen (model len)

  --toponly     : only search the top strand

  --bottomonly  : only search the bottom strand

  --tformat <s> : assert target <seqdb> is in format <s>: no autodetection

  --cpu <n>     : number of parallel CPU workers to use for multithreads

 

*Use --devhelp to show additional expert options.

cmstat -h

# cmstat -h

# cmstat :: display summary statistics for CMs

# INFERNAL 1.1.2 (July 2016)

# Copyright (C) 2016 Howard Hughes Medical Institute.

# Freely distributed under a BSD open source license.

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Usage: cmstat [-options] <cmfile>

 

Options:

  -h          : show brief help on version and usage

  -E <x>      : print bit scores that correspond to E-value threshold of <x>

  -P <x>      : print bit scores that correspond to E-value threshold of <x>

  -T <x>      : print E-values that correspond to bit score threshold of <x>

  -Z <x>      : set database size in *Mb* to <x> for E-value calculations  [10]

  --cut_ga    : print E-values that correspond to GA bit score thresholds

  --cut_nc    : print E-values that correspond to NC bit score thresholds

  --cut_tc    : print E-values that correspond to TC bit score thresholds

  --key <s>   : only print statistics for CM with name or accession <s>

  --hmmonly   : print filter HMM bit scores/E-values, not CM ones

  --nohmmonly : print CM bit scores/E-values, even for models with 0 basepairs

cmbuild -h

# cmbuild -h

# cmbuild :: covariance model construction from multiple sequence alignments

# INFERNAL 1.1.2 (July 2016)

# Copyright (C) 2016 Howard Hughes Medical Institute.

# Freely distributed under a BSD open source license.

# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Usage: cmbuild [-options] <cmfile_out> <msafile>

 

Basic options:

  -h        : show brief help on version and usage

  -n <s>    : name the CM(s) <s>, (only if single aln in file)

  -F        : force; allow overwriting of <cmfile_out>

  -o <f>    : direct summary output to file <f>, not stdout

  -O <f>    : resave consensus/insert column annotated MSA to file <f>

  --devhelp : show list of otherwise hidden developer/expert options

 

Alternative model construction strategies:

  --fast        : assign cols w/ >= symfrac residues as consensus

  --hand        : use reference coordinate annotation to specify consensus

  --symfrac <x> : fraction of non-gaps to require in a consensus column [0..1]

  --noss        : ignore secondary structure annotation in input alignment

  --rsearch <f> : use RSEARCH parameterization with RIBOSUM matrix file <f>

 

Other model construction options*:

  --null <f>  : read null (random sequence) model from file <f>

  --prior <f> : read priors from file <f>

 

Alternative relative sequence weighting strategies:

  --wpb     : Henikoff position-based weights  [default]

  --wgsc    : Gerstein/Sonnhammer/Chothia tree weights

  --wnone   : don't do any relative weighting; set all to 1

  --wgiven  : use weights as given in MSA file

  --wblosum : Henikoff simple filter weights

  --wid <x> : for --wblosum: set identity cutoff  [0.62]  (0<=x<=1)

 

Alternative effective sequence weighting strategies:

  --eent        : adjust eff seq # to achieve relative entropy target  [default]

  --enone       : no effective seq # weighting: just use nseq

  --ere <x>     : for --eent: set CM target relative entropy to <x>

  --eset <x>    : set eff seq # for all models to <x>

  --eminseq <x> : for --eent: set minimum effective sequence number to <x>  [0.1]

  --ehmmre <x>  : for --eent: set minimum HMM relative entropy to <x>

  --esigma <x>  : for --eent: set sigma param to <x>  [45.0]

 

Options for HMM filter construction*:

  --p7ere <x> : for the filter p7 HMM, set minimum rel entropy/posn to <x>

  --p7ml      : define the filter p7 HMM as the ML p7 HMM

 

Options for HMM filter calibration*:

  --EmN <n>  : number of sampled seqs to use for p7 local MSV calibration  [200]

  --EvN <n>  : number of sampled seqs to use for p7 local Vit calibration  [200]

  --ElfN <n> : number of sampled seqs to use for p7 local Fwd calibration  [200]

  --EgfN <n> : number of sampled seqs to use for p7 glocal Fwd calibration  [200]

 

Options for refining the input alignment*:

  --refine <f> : refine input aln w/Expectation-Maximization, save to <f>

  -l           : w/--refine, configure model for local alignment [default: global]

  --gibbs      : w/--refine, use Gibbs sampling instead of EM

  --seed <n>   : w/--gibbs, set RNG seed to <n> (if 0: one-time arbitrary seed)

  --cyk        : w/--refine, use CYK instead of optimal accuracy

  --notrunc    : w/--refine, do not use truncated alignment algorithm

 

*Use --devhelp to show additional expert options.

 

 

テストラン

Infernalは、用意した複数の配列のアラインメントファイルから始まる。Stockholm format(Wiki)で、コンセンサス二次構造のアノテーションが含まれている必要がある(例:tutorial/tRNA5.sto)。

f:id:kazumaxneo:20201125113602p:plain

ストックホルム形式の二次構造アノテーションを用いたマルチプルRNA配列のアラインメントの例。ストックホルムフォーマットは、HMMERやInfernal、PfamやRfamデータベースで使用されているネイティブのアラインメントフォーマット。右の絵は参考のために掲載されているtRNAの二次構造。PDFマニュアルより転載。

git clone https://github.com/EddyRivasLab/infernal
cd infernal/
cp tutorial/tRNA5.c.cm tRNA5.cm
cmsearch tRNA5.cm tutorial/mrum-genome.fa

 

 

2023/02/28 追記

Rfamを使う。

manual; https://docs.rfam.org/en/latest/genome-annotation.html

wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.cm.gz
gunzip Rfam.cm.gz
wget ftp://ftp.ebi.ac.uk/pub/databases/Rfam/CURRENT/Rfam.clanin

#indexing
cmpress Rfam.cm

#cmscanではデータベースサイズを提供する必要がある。これは検索対象となるヌクレオチドの両鎖の塩基数をメガベース(Mb、数百万ヌクレオチド)単位で表したもの。seqkit statsなどでゲノムサイズを算出し、*2/10^6する。
#cmscanの実行
cmscan -Z 5.874406 --cut_ga --rfam --nohmmonly --tblout mrum-genome.tblout --fmt 2 --clanin Rfam.clanin Rfam.cm tutorial/mrum-genome.fa > mrum-genome.cmscan

詳しくはRfamのマニュアルを読んでください。

 

引用
Infernal 1.1: 100-fold faster RNA homology searches
Eric P. Nawrocki, Sean R. Eddy

Bioinformatics. 2013 Nov 15; 29(22): 2933–2935

 

関連