macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

HaMStR-OneSeq

 

 ESTシーケンスは、タンパク質コード配列を迅速に収集するための多目的なアプローチである。それらは、ゲノムデータからの遺伝子予測の依然としてエラーを起こしやすい手順をバイパスして、生物の遺伝子レパートリーへの直接アクセスを提供する。したがって、多くの場合、ESTは主流の関心外の分類群からの生物学的配列データの唯一のソースである。 ESTの進化的研究、特に分子系統学研究でのESTの普及は、ESTでの自動オーソログ予測のための効率的で信頼性の高いアプローチの欠如によって依然として妨げられている。既存の方法は、既知の種ツリーに依存するか、ESTデータの冗長性に対処できない。
 ESTデータをマイニングするための新しいアプローチ(HaMStR)を提示する。 HaMStRは、プロファイルHidden Markov Model検索と後続のBLAST検索を組み合わせて、既存のオルソログクラスターをさらに分類群のシーケンスで拡張する。 HaMStRの結果は、完全に配列決定されたゲノムを必要とする既存のオルソロジー予測方法で得られた結果と一致していることを示している。 35の真菌分類群の系統発生に関する事例研究は、HaMStRがESTおよびタンパク質配列データから系統発生研究のための有益なデータセットコンパイルするのに適していることを示している。

 

インストール

依存

Github

conda create --name hamstr -y
conda activate hamstr
conda install -c BIONF -c bioconda -c conda-forge -y hamstr 

データベースの準備。数GBあるので注意。

#recommend root account, otherwise some dependencies cannot be installed.
sudo setup_hamstr

途中でアノテーションツールのダウンロードをするかどうか聞かれる。Yを選択。

f:id:kazumaxneo:20200216195215p:plain

cd /HaMStR/bin/
export PATH=$PATH:$(pwd)

oneSeq -h

$ oneSeq -h

Please wait why the taxonomy database is indexing...

indexing done!

 

YOU ARE RUNNING oneSeq v.1.4 on kazu

 

This program is freely distributed under a GPL.

Copyright (c) GRL limited: portions of the code are from separate copyrights

 

 

USAGE: oneSeq.pl -sequence_file=<> -seqId=<>  -seqName=<> -refSpec=<> -minDist=<> -maxDist=<> [OPTIONS]

 

OPTIONS:

 

GENERAL

 

-h

Invoke this help method

-version

Print the program version

-showTaxa

        Print availible Taxa (dependent on the on/off status of database mode)

 

REQUIRED

 

-seqFile=<>

Specifies the file containing the seed sequence (protein only) in fasta format.

If not provided the program will ask for it.

-seqId=<>

Specifies the sequence identifier of the seed sequence in the reference protein set.

If not provided, the program will attempt to determin it automatically.

-refSpec

Determines the reference species for the hamstr search. It should be the species the seed sequence was derived from.

If not provided, the program will ask for it.

-minDist=<>

specify the minimum systematic distance of primer taxa for the core set compilation.

If not provided, the program will ask for it.

-maxDist=<>

specify the maximum systematic distance of primer taxa to be considered for core set compilation.

If not provided, the program will ask for it.

-coreOrth=<>

Specify the number of orthologs added to the core set.

 

USING NON-DEFAULT PATHS

 

-outpath=<>

Specifies the path for the output directory. Default is /home/kazu/Document/HaMStR/output;

-hmmpath=<>

Specifies the path for the core ortholog directory. Default is /home/kazu/Document/HaMStR/core_orthologs/

 

ADDITIONAL OPTIONS

 

-append

Set this flag to append the output to existing output files

-seqName=<>

        Specifies a name for the search. If not set a random name will be set.

-db

Run oneSeq.pl in database mode. Requires a mySql database. Only for internatl use.

-filter=[T|F]

Switch on or off the low complexity filter for the blast search. Default: T

-silent

Surpress output to the command line

-coreTaxa=<>

You can provide a list of primer taxa that should exclusively be used for the compilation

of the core ortholog set

-strict

Run the final HaMStR search in 'strict mode'. An ortholog is only then accepted when the reciprocity is fulfilled

for each sequence in the core set.

-force

Force the final HaMStR search to create output file. Existing files will be overwritten.

-coreStrict

Run the HaMStR for the compilation of the core set in strict mode.

-checkCoorthologsRef

During the final HaMStR search, accept an ortholog also when its best hit in the reverse search is not the

core ortholog itself, but a co-ortholog of it.

-CorecheckCoorthologsRef

Invokes the 'checkCoorthologsRef' behavior in the course of the core set compilation.

-rbh

Requires a reciprocal best hit during the HaMStR search to accept a new ortholog.

-evalBlast=<>

This option allows to set the e-value cut-off for the Blast search. Default: 1E-5

-evalHmmer=<>

This options allows to set the e-value cut-off for the HMM search. Default: 1E-5

-evalRelaxfac=<>

This options allows to set the factor to relax the e-value cut-off (Blast search and HMM search) for the final HaMStR run. Default: 10

-hitLimit=<>

Provide an integer specifying the number of hits of the initial pHMM based search that should be evaluated

via a reverse search. Default: 10

-coreHitLimit=<>

Provide an integer specifying the number of hits of the initial pHMM based search that should be evaluated

via a reverse search. Default: 3

-autoLimit

                Setting this flag will invoke a lagPhase analysis on the score distribution from the hmmer search. This will determine automatically

                a hit limit for each query. Note, when setting this flag, it will be effective for both the core ortholog compilation

and the final ortholog search.

-scoreThreshold

                Instead of setting an automatic hit limit, you can specify with this flag that only candidates with an hmm score no less

than x percent of the hmm score of the best hit are further evaluated. Default is x = 10.

                You can change this cutoff with the option -scoreCutoff. Note, when setting this flag, it will be effective for

both the core ortholog compilation and the final ortholog search.

-scoreCutoff=<>

                In combination with -scoreThreshold you can define the percent range of the hmms core of the best hit up to which a

                candidate of the hmmsearch will be subjected for further evaluation. Default: 10%.

-coreOnly

Set this flag to compile only the core orthologs. These sets can later be used for a stand alone HaMStR search.

-reuse_core

Set this flag if the core set for your sequence is already existing. No check currently implemented.

-ignoreDistance

Set this flag to ignore the distance between Taxa and to choose orthologs only based on score

-distDeviation=<>

Specify the deviation in score in percent (1=100%, 0=0%) allowed for two taxa to be considered similar

-blast

Set this flag to determine sequence id and refspec automatically. Note, the chosen sequence id and reference species

does not necessarily reflect the species the sequence was derived from.

-rep

Set this flag to obtain only the sequence being most similar to the corresponding sequence in the core set rather

than all putative co-orthologs.

-coreRep

Set this flag to invoke the '-rep' behaviour for the core ortholog compilation.

-cpu

Determine the number of threads to be run in parallel

-batch=<>

Currently has NO functionality.

-group=<>

Allows to limit the search to a certain systematic group

-cleanup

        Temporary output will be deleted.

-aligner

Choose between mafft-linsi or muscle for the multiple sequence alignment. DEFAULT: muscle

 

SPECIFYING FAS SUPPORT OPTIONS

 

-fasoff

        Turn OFF FAS support. Default is ON.

-coreFilter=[relaxed|strict]

        Specifiy mode for filtering core orthologs by FAS score. In 'relaxed' mode candidates with insufficient FAS score will be disadvantaged.

        In 'strict' mode candidates with insufficient FAS score will be deleted from the candidates list. Default is None.

        The option '-minScore=<>' specifies the cut-off of the FAS score.

-minScore=<>

        Specify the threshold for coreFilter. Default is 0.75.

-weight_seed

        Specify the gene set (either seed species or orthologs origin) which is used to determine the weight of a feature. If this flag is set the weights will be determined on the basis of the seed species. Default is the origin of the respective ortholog.

-local

        Specify the alignment strategy during core ortholog compilation. Default is local.

-glocal

        Set the alignment strategy during core ortholog compilation to glocal.

-global

        Set the alignment strategy during core ortholog compilation to global.

-countercheck

        Set this flag to counter-check your final profile. The FAS score will be computed in two ways (seed vs. hit and hit vs. seed).

 

SPECIFYING EXTENT OF OUTPUT TO SCREEN

 

-debug

Set this flag to obtain more detailed information about the programs actions

-silent

Surpress output to screen as much as possbile

 

 

 

 

テストラン

cd HaMStR/data/
oneSeq -seqFile=infile.fa -seqid=P83876 -refspec=HUMAN@9606@1 -minDist=genus -maxDist=kingdom -coreOrth=5 -cleanup -global
  • -seqFile    Specifies the file containing the seed sequence (protein only) in fasta format. If not provided the program will ask for it.
  • -seqId     Specifies the sequence identifier of the seed sequence in the reference protein set. If not provided, the program will attempt to determin it automatically.
  • -refSpec    Determines the reference species for the hamstr search. It should be the species the seed sequence was derived from. If not provided, the program will ask for it.
  • -minDist    Specify the minimum systematic distance of primer taxa for the core set compilation. If not provided, the program will ask for it.
  • -maxDist    Specify the maximum systematic distance of primer taxa to be considered for core set compilation. If not provided, the program will ask for it.
  • -coreOrth    Specify the number of orthologs added to the core set.
  • -glocal    set the alignment strategy during core ortholog compilation to glocal.
  • -cleanup    Temporary output will be deleted.

出力についてはGithubで説明されている。そのうち、出力の.phyloprofileファイルとdomain architectureファイルはPhyloProfileの入力に使用できる。

head acCJEfk.phyloprofile

$ head acCJEfk.phyloprofile 

geneID ncbiID orthoID FAS_F FAS_B

acCJEfk ncbi7165 acCJEfk|ANOGA@7165@1|Q7Q496|1 1.0 0.0

acCJEfk ncbi3702 acCJEfk|ARATH@3702@1|Q9FE62|1 1.0 0.0

acCJEfk ncbi330879 acCJEfk|ASPFU@330879@1|Q4WY99|1 0.98359 0.0

acCJEfk ncbi684364 acCJEfk|BATDJ@684364@1|F4NXD8|1 1.0 0.0

acCJEfk ncbi9913 acCJEfk|BOVIN@9913@1|F1MTU6|1 1.0 0.0

acCJEfk ncbi7739 acCJEfk|BRAFL@7739@1|C3ZKG5|1 1.0 0.0

acCJEfk ncbi6239 acCJEfk|CAEEL@6239@1|L8E6I4|1 1.0 0.0

acCJEfk ncbi237561 acCJEfk|CANAL@237561@1|Q5A1M0|1 0.9999399999999999 0.0

acCJEfk ncbi9615 acCJEfk|CANLF@9615@1|E2R204|1 1.0 0.0

geneID、ncbiID、orthoID、 FAS scoresなどが記載されている。

 

複数の結果をマージする。

cat *.extended.profile > combined.extended.profile

#run the parsing script
perl HaMStR/bin/visuals/parseOneSeq.pl -i combined.extended.profile -o combined.phyloprofile

 

 

引用
HaMStR: profile hidden markov model based search for orthologs in ESTs

Ebersberger I1, Strauss S, von Haeseler A

BMC Evol Biol. 2009 Jul 8;9:157