ESTシーケンスは、タンパク質コード配列を迅速に収集するための多目的なアプローチである。それらは、ゲノムデータからの遺伝子予測の依然としてエラーを起こしやすい手順をバイパスして、生物の遺伝子レパートリーへの直接アクセスを提供する。したがって、多くの場合、ESTは主流の関心外の分類群からの生物学的配列データの唯一のソースである。 ESTの進化的研究、特に分子系統学研究でのESTの普及は、ESTでの自動オーソログ予測のための効率的で信頼性の高いアプローチの欠如によって依然として妨げられている。既存の方法は、既知の種ツリーに依存するか、ESTデータの冗長性に対処できない。
ESTデータをマイニングするための新しいアプローチ(HaMStR)を提示する。 HaMStRは、プロファイルHidden Markov Model検索と後続のBLAST検索を組み合わせて、既存のオルソログクラスターをさらに分類群のシーケンスで拡張する。 HaMStRの結果は、完全に配列決定されたゲノムを必要とする既存のオルソロジー予測方法で得られた結果と一致していることを示している。 35の真菌分類群の系統発生に関する事例研究は、HaMStRがESTおよびタンパク質配列データから系統発生研究のための有益なデータセットをコンパイルするのに適していることを示している。
インストール
依存
- wget, grep and sed (or gsed for MacOS) to install HaMStR
- To use FAS tool (a dependency of HaMStR), you also need Python 3.
conda create --name hamstr -y
conda activate hamstr
conda install -c BIONF -c bioconda -c conda-forge -y hamstr
データベースの準備。数GBあるので注意。
#recommend root account, otherwise some dependencies cannot be installed.
sudo setup_hamstr
途中でアノテーションツールのダウンロードをするかどうか聞かれる。Yを選択。
cd /HaMStR/bin/
export PATH=$PATH:$(pwd)
> oneSeq -h
$ oneSeq -h
Please wait why the taxonomy database is indexing...
indexing done!
YOU ARE RUNNING oneSeq v.1.4 on kazu
This program is freely distributed under a GPL.
Copyright (c) GRL limited: portions of the code are from separate copyrights
USAGE: oneSeq.pl -sequence_file=<> -seqId=<> -seqName=<> -refSpec=<> -minDist=<> -maxDist=<> [OPTIONS]
OPTIONS:
GENERAL
-h
Invoke this help method
-version
Print the program version
-showTaxa
Print availible Taxa (dependent on the on/off status of database mode)
REQUIRED
-seqFile=<>
Specifies the file containing the seed sequence (protein only) in fasta format.
If not provided the program will ask for it.
-seqId=<>
Specifies the sequence identifier of the seed sequence in the reference protein set.
If not provided, the program will attempt to determin it automatically.
-refSpec
Determines the reference species for the hamstr search. It should be the species the seed sequence was derived from.
If not provided, the program will ask for it.
-minDist=<>
specify the minimum systematic distance of primer taxa for the core set compilation.
If not provided, the program will ask for it.
-maxDist=<>
specify the maximum systematic distance of primer taxa to be considered for core set compilation.
If not provided, the program will ask for it.
-coreOrth=<>
Specify the number of orthologs added to the core set.
USING NON-DEFAULT PATHS
-outpath=<>
Specifies the path for the output directory. Default is /home/kazu/Document/HaMStR/output;
-hmmpath=<>
Specifies the path for the core ortholog directory. Default is /home/kazu/Document/HaMStR/core_orthologs/
ADDITIONAL OPTIONS
-append
Set this flag to append the output to existing output files
-seqName=<>
Specifies a name for the search. If not set a random name will be set.
-db
Run oneSeq.pl in database mode. Requires a mySql database. Only for internatl use.
-filter=[T|F]
Switch on or off the low complexity filter for the blast search. Default: T
-silent
Surpress output to the command line
-coreTaxa=<>
You can provide a list of primer taxa that should exclusively be used for the compilation
of the core ortholog set
-strict
Run the final HaMStR search in 'strict mode'. An ortholog is only then accepted when the reciprocity is fulfilled
for each sequence in the core set.
-force
Force the final HaMStR search to create output file. Existing files will be overwritten.
-coreStrict
Run the HaMStR for the compilation of the core set in strict mode.
-checkCoorthologsRef
During the final HaMStR search, accept an ortholog also when its best hit in the reverse search is not the
core ortholog itself, but a co-ortholog of it.
-CorecheckCoorthologsRef
Invokes the 'checkCoorthologsRef' behavior in the course of the core set compilation.
-rbh
Requires a reciprocal best hit during the HaMStR search to accept a new ortholog.
-evalBlast=<>
This option allows to set the e-value cut-off for the Blast search. Default: 1E-5
-evalHmmer=<>
This options allows to set the e-value cut-off for the HMM search. Default: 1E-5
-evalRelaxfac=<>
This options allows to set the factor to relax the e-value cut-off (Blast search and HMM search) for the final HaMStR run. Default: 10
-hitLimit=<>
Provide an integer specifying the number of hits of the initial pHMM based search that should be evaluated
via a reverse search. Default: 10
-coreHitLimit=<>
Provide an integer specifying the number of hits of the initial pHMM based search that should be evaluated
via a reverse search. Default: 3
-autoLimit
Setting this flag will invoke a lagPhase analysis on the score distribution from the hmmer search. This will determine automatically
a hit limit for each query. Note, when setting this flag, it will be effective for both the core ortholog compilation
and the final ortholog search.
-scoreThreshold
Instead of setting an automatic hit limit, you can specify with this flag that only candidates with an hmm score no less
than x percent of the hmm score of the best hit are further evaluated. Default is x = 10.
You can change this cutoff with the option -scoreCutoff. Note, when setting this flag, it will be effective for
both the core ortholog compilation and the final ortholog search.
-scoreCutoff=<>
In combination with -scoreThreshold you can define the percent range of the hmms core of the best hit up to which a
candidate of the hmmsearch will be subjected for further evaluation. Default: 10%.
-coreOnly
Set this flag to compile only the core orthologs. These sets can later be used for a stand alone HaMStR search.
-reuse_core
Set this flag if the core set for your sequence is already existing. No check currently implemented.
-ignoreDistance
Set this flag to ignore the distance between Taxa and to choose orthologs only based on score
-distDeviation=<>
Specify the deviation in score in percent (1=100%, 0=0%) allowed for two taxa to be considered similar
-blast
Set this flag to determine sequence id and refspec automatically. Note, the chosen sequence id and reference species
does not necessarily reflect the species the sequence was derived from.
-rep
Set this flag to obtain only the sequence being most similar to the corresponding sequence in the core set rather
than all putative co-orthologs.
-coreRep
Set this flag to invoke the '-rep' behaviour for the core ortholog compilation.
-cpu
Determine the number of threads to be run in parallel
-batch=<>
Currently has NO functionality.
-group=<>
Allows to limit the search to a certain systematic group
-cleanup
Temporary output will be deleted.
-aligner
Choose between mafft-linsi or muscle for the multiple sequence alignment. DEFAULT: muscle
SPECIFYING FAS SUPPORT OPTIONS
-fasoff
Turn OFF FAS support. Default is ON.
-coreFilter=[relaxed|strict]
Specifiy mode for filtering core orthologs by FAS score. In 'relaxed' mode candidates with insufficient FAS score will be disadvantaged.
In 'strict' mode candidates with insufficient FAS score will be deleted from the candidates list. Default is None.
The option '-minScore=<>' specifies the cut-off of the FAS score.
-minScore=<>
Specify the threshold for coreFilter. Default is 0.75.
-weight_seed
Specify the gene set (either seed species or orthologs origin) which is used to determine the weight of a feature. If this flag is set the weights will be determined on the basis of the seed species. Default is the origin of the respective ortholog.
-local
Specify the alignment strategy during core ortholog compilation. Default is local.
-glocal
Set the alignment strategy during core ortholog compilation to glocal.
-global
Set the alignment strategy during core ortholog compilation to global.
-countercheck
Set this flag to counter-check your final profile. The FAS score will be computed in two ways (seed vs. hit and hit vs. seed).
SPECIFYING EXTENT OF OUTPUT TO SCREEN
-debug
Set this flag to obtain more detailed information about the programs actions
-silent
Surpress output to screen as much as possbile
テストラン
cd HaMStR/data/
oneSeq -seqFile=infile.fa -seqid=P83876 -refspec=HUMAN@9606@1 -minDist=genus -maxDist=kingdom -coreOrth=5 -cleanup -global
- -seqFile Specifies the file containing the seed sequence (protein only) in fasta format. If not provided the program will ask for it.
- -seqId Specifies the sequence identifier of the seed sequence in the reference protein set. If not provided, the program will attempt to determin it automatically.
- -refSpec Determines the reference species for the hamstr search. It should be the species the seed sequence was derived from. If not provided, the program will ask for it.
- -minDist Specify the minimum systematic distance of primer taxa for the core set compilation. If not provided, the program will ask for it.
- -maxDist Specify the maximum systematic distance of primer taxa to be considered for core set compilation. If not provided, the program will ask for it.
- -coreOrth Specify the number of orthologs added to the core set.
- -glocal set the alignment strategy during core ortholog compilation to glocal.
- -cleanup Temporary output will be deleted.
出力についてはGithubで説明されている。そのうち、出力の.phyloprofileファイルとdomain architectureファイルはPhyloProfileの入力に使用できる。
> head acCJEfk.phyloprofile
$ head acCJEfk.phyloprofile
geneID ncbiID orthoID FAS_F FAS_B
acCJEfk ncbi7165 acCJEfk|ANOGA@7165@1|Q7Q496|1 1.0 0.0
acCJEfk ncbi3702 acCJEfk|ARATH@3702@1|Q9FE62|1 1.0 0.0
acCJEfk ncbi330879 acCJEfk|ASPFU@330879@1|Q4WY99|1 0.98359 0.0
acCJEfk ncbi684364 acCJEfk|BATDJ@684364@1|F4NXD8|1 1.0 0.0
acCJEfk ncbi9913 acCJEfk|BOVIN@9913@1|F1MTU6|1 1.0 0.0
acCJEfk ncbi7739 acCJEfk|BRAFL@7739@1|C3ZKG5|1 1.0 0.0
acCJEfk ncbi6239 acCJEfk|CAEEL@6239@1|L8E6I4|1 1.0 0.0
acCJEfk ncbi237561 acCJEfk|CANAL@237561@1|Q5A1M0|1 0.9999399999999999 0.0
acCJEfk ncbi9615 acCJEfk|CANLF@9615@1|E2R204|1 1.0 0.0
geneID、ncbiID、orthoID、 FAS scoresなどが記載されている。
複数の結果をマージする。
cat *.extended.profile > combined.extended.profile
#run the parsing script
perl HaMStR/bin/visuals/parseOneSeq.pl -i combined.extended.profile -o combined.phyloprofile
引用
HaMStR: profile hidden markov model based search for orthologs in ESTs
Ebersberger I1, Strauss S, von Haeseler A
BMC Evol Biol. 2009 Jul 8;9:157