2020-07-15

OLCのメタゲノムアセンブラ BBAP

　メタゲノムアセンブリの精度は、通常、シークエンシングおよびアセンブリの際に、同じゲノム領域からの発散性のあるリードが異なる遺伝子座として認識されるため、高レベルの多型によって損なわれる。ウイルス準種（viral quasispecies）とは、単一のキャリアに見られる豊富で多様な遺伝学的に関連したウイルスのグループである。VelvetやSOAPdenovoのような現在の主流のアセンブル方法は、もともとこのようなメタゲノムデータのアセンブルを意図したものではないため、メタゲノムデータの正確で有益なアセンブル結果を提供するための新しい方法が求められている。
　本研究では、 partial de novo-reference assembly (PDR) 戦略とBLAST-based assembly pipeline (BBAP) を組み合わせたハイブリッドな多相性データ収集法を提案する。PDR戦略は、ランダムに抽出された部分的なデータセットをde novoアセンブリすることでin situリファレンス配列を生成し、その後、全データセットのリファレンスアセンブリに使用する。BBAPは、多型リードをアセンブルするために greedy algorithmを採用している。PDRとBBAPの両方の性能を評価し、比較するために、以前の研究から12のB型肝炎ウイルス準種NGSデータセットを使用した。解析の結果、メタゲノムデータセットの高多型は、断片化されたde novoアセンブリにつながることが示唆された。一方、外部リファレンス配列の偏ったまたは限定された表現は、アセンブリの精度とバリエーション感度を低下させ、これはアセンブリに含まれるリードの数が少ないことを示している。一方、PDRで生成されたin situリファレンス配列は、フルメタゲノミクスデータセットのPDRアセンブリでより多くのリードがアセンブルされ、より高い精度と高いバリエーション感度が得られた。BBAPアセンブリの結果は、他のアセンブリ方法と比較してアセンブリの効率と精度が高いことも示唆している。さらに、BBAPアセンブリは、他の方法のアセンブリ結果では観察されなかったHBVの構造バリアントを回収した。これらの結果から、PDR/BBAPのアセンブリ結果は、他の比較法と比較して有意に優れていた。
　PDRとBBAPの両方が独立して、高度に多型化されたデータのアセンブリ効率と精度を向上させ、一緒に使用するとアセンブリ性能がさらに向上した。BBAPはまた、ヌクレオチド頻度情報も提供する。PDRとBBAPを併用することで、メタゲノムデータ研究のための強力なツールを提供する。

manual (PDF)

http://homepage.ntu.edu.tw/~youylin/BBAP_Manual.pdf

インストール

依存

legacy blast

#仮想環境に導入(bioconda)
conda create -n blast-legacy -y
conda activate blast-legacy
conda install -c bioconda blast-legacy -y

http://homepage.ntu.edu.tw/~youylin/BBAP.html

e$ perl -w QC_SB_AC_masterPipeline.pl

====START OF QC_SB_AC pipeline====

-p directory for QC SB AC perl scripts

-F sequence format: 1 --> fastq file Illumina 1.3+ 2 --> fasta file 3 --> unique fasta file

-o output heading

-O output directory heading (give directory path)

====QUALITY CONTROL AND UNIQUE SEQUENCE====

====QC_0_quality_control_unique_pipeline.pl====

====QC_1_quality_control_fragment.pl====

-f fastq file Illumina 1.3+ or fasta file

-q quality threshold (default = 20, fastq only)

-Q 33 for Illumina 1.8+, 64 for Illumina 1.3+ (default = 33, fastq only)

-A trim ? bp from start (default = 0; for barcode, fastq only)

-B trim ? bp from end (default = 0; fastq only)

trim is after quality filtering in workflow (whole sequence is subject to quality filtering)

====QC_2_fasta_unique.pl====

-r view reverse complement sequence as identical sequence, 1 -> yes, 2 -> no (default = 2)

====QC_3_fasta_countFilter.pl====

-c minimum number of sequences per unique sequence (default = 1)

====SELFBLAST AND CLUSTERING====

====SB_0_selfblast_cluster_pipeline.pl====

====SB_1_selfblast.pl====

-e e for blast (default = 1e-5)

-b directory for formatdb and blastall

-a number of CPUs to run for blast (defaule = 1)

====SB_2_cluster.pl====

-i identity threshold (default = 85, for clustering)

-l length threshold (default = 85% of read length, for clustering)

====SB_3_cluster_stat_sort.pl====

-C count per cluster threshold (default = 1)

====SB_4_cluster_getfasta.pl====

Died at ../QC_SB_AC_masterPipeline.pl line 97.

テストラン

cd /BBAP/Example/
perl -w ../QC_SB_AC_masterPipeline.pl -p ../../BBAP/ -F 1 \
 -o DenovoExample1 -O ./DenovoExample1 \
 -f Example_NGSdataset.fastq \
 -b <your>/<blast>/<direcotry>

出力

f:id:kazumaxneo:20200716000634p:plain

引用

De novo assembly of highly polymorphic metagenomic data using in situ generated reference sequences and a novel BLAST-based assembly pipeline

You-Yu Lin, Chia-Hung Hsieh, Jiun-Hong Chen, Xuemei Lu, Jia-Horng Kao, Pei-Jer Chen, Ding-Shinn Chen & Hurng-Yi Wang
BMC Bioinformatics volume 18, Article number: 223 (2017)

2020-07-14

marginPhase

2019 Genome Biology テスト失敗 haplotyping genotype

　リファレンスベースの遺伝的変異の同定は、ジェノタイピングとphasingという2つの関連プロセスから成り立っている。ジェノタイピングは、個人のゲノムにどのような遺伝的変異が存在するかを決定するプロセスである。ある部位の遺伝子型とは、両方の染色体コピーがバリアント対立遺伝子を持っているか、片方だけが持っているか、あるいはバリアント対立遺伝子が全く存在しないかを示す。フェーシングとは、個人のハプロタイプを決定することであり、これは同じ染色体上で互いに近くに存在し、一緒に遺伝するバリアントで構成されている。生物のすべての遺伝的変異を完全に記述するためには、ジェノタイピングとフェーズングの両方が必要である。この2つのプロセスを合わせて、ディプロタイプと呼ばれている。

　多くの既存のバリアント解析パイプラインは、ショートDNAシーケンスリード用に設計されている。ショートリードは塩基レベルでは非常に正確だが、特に繰り返しや重複する領域では、ゲノムに明確にアラインすることが難しいという問題がある。その結果、現在のところ、基準となるヒトゲノムの何百万塩基もの塩基が、主に染色体のセントロメアや短腕の近くのマルチメガバイトギャップで、ショートリードによって確実に遺伝子型を決定することができていない。ショートリードはこれらの領域に一意にマッピングすることができないが、ロングリードはこれらの領域にまたがる可能性がある。ロングリードは、リードベースのハプロタイピング、大規模な構造変異の検出、およびde novoアセンブリに有用であることがすでに証明されている[ref,5-8]。ここでは、より包括的なジェノタイピングのためのロングリードの有用性を実証する。これらの技術は歴史的に相対的にコストが高く、シーケンスエラー率が高いため、これまでこの問題にはほとんど注目されていなかった。しかし、ロングリードDNAシーケンシング技術は急速に価格が下がり、一般的に利用できるようになってきている。このような技術としては、Pacific Biosciences社（PacBio）の1分子リアルタイム（SMRT）シーケンシングやOxford Nanopore Technologies社（ONT）のナノポアシーケンシングがあり、いずれも本研究で評価している。

　ジェノタイピングの問題は、ロングリードシークエンシングデータからハプロタイプを推論する作業に関連しており、豊富な文献や多くのツールが存在している。ハプロタイプ再構成の最も一般的な形式化は、 minimum error correction（MEC）問題である。MEC問題は、同じハプロタイプからのリードを互いに整合性のあるものにするために、最小のエラー数を修正する必要があるように、ハプロタイプごとにリードを分割しようとする問題である。原理的には、この問題の定式化は遺伝子型を推論するのに役立つが、実際には「すべてのヘテロ接合体」が仮定されている。ハプロタイプ再構成のためのツールは、一般的にヘテロ接合体の位置のセットが入力として与えられることを仮定しており、これらのサイトのみを対象としている。

　このようなツールの一般的な欠如にもかかわらず、ロングリードを用いたジェノタイピングのためのいくつかの方法が提案されている。Guoら[ref.17]は、ロングリード一塩基バリアント（Single-nucleotide variant: SNV）コールとハプロタイプ再構成のための方法を記述しており、この方法では、各SNV部位で、その部位に重なる近くのリードと最もよく一致する模範的なリードを特定する。次に、隣接する SNV サイトの模範となるリードとの類似性に基づいて、そのサイト周辺のリードを分割する。しかし、この方法ではハプロタイプ間の最適なリードの分割が保証されておらず、著者らの報告によると、NA12878のPacBioデータでは比較的高い偽発見率（15.7％）と偽陰性率（11.0％）が報告されており、これはわずか86.6％のF1スコアに対応している。さらに、2つのグループが現在、学習ベースのバリアントコーラーを開発しており、彼らは、ノイズの多いロングリードを使用して動作するように調整できることを示している。最近のプレプリントでは、Luo et al. [18] が、ロングリードデータからバリアントコールするために畳み込みニューラルネットワーク (CNN) を使用する方法を記述している。（一部略）

　ロングリードを使用して隣接するサイト間のディプロタイプを行うことの潜在的な利点を説明するために、論文図1aを考える。ロングリードでカバーされている3つのSNV位置が示されている。灰色の配列は真のハプロタイプ配列を表しており、リードは青と赤で着色されている。シーケンシングエラーが発生する可能性があるため、リードによってサポートされる対立遺伝子は、灰色で示されているハプロタイプの真の対立遺伝子とは必ずしも一致していない。SNVを個別に考えると、各アレルをサポートするリードの数が同じなので、最初のものをA/C、2番目のものをT/G、3番目のものをG/Cと呼ぶのが妥当だろう。これは、2つ目のSNVの予測を誤ってしまうことになる。しかし、もし各リードがどのハプロタイプに由来するかがわかれば、つまりその色がわかれば、2番目のSNV部位でシーケンスエラーが発生しているはずだということがわかる。同じハプロタイプに由来するリードは同じアレルをサポートしていなければならず、また、この部位のハプロタイプに由来するリード間には不一致があるため、この遺伝子座での遺伝子型予測は非常に不確実なものとして扱われなければならない。したがって、遺伝子型決定の際にハプロタイプ情報を使用することで、不確実性を検出し、より信頼性の高い遺伝子型予測を計算することが可能になる。

　本論文では、現代のロングリード技術において、リードベースのphase推論をSNVのジェノタイピングプロセスと同時に組み合わせることで、正確なディプロタイプを作成し、ショートリードではマッピングできない領域のバリアントを検出することができることを示す。この推論の鍵は、リード内のヘテロ接合部位間の連結関係を検出することであることを示す。これを実現するために、著者らは、ノイズの多いロングリードから正確にディプロタイプを予測するための新しいアルゴリズムを記述した。

次に、このアルゴリズムを1000 Genomes ProjectのNA12878の1人の個体のディプロタイプに適用した。NA12878は広範囲にシークエンスされ研究されており、Genome in a Bottle Consortiumは、これらのゲノム領域内の高信頼性領域とそれに対応する高信頼性バリアントコールのセットを発表している[ref.20]。本論文では、提案する方法が正確であり、不確実性の高い領域のバリアントを確認するために使用でき、ショートDNAリードシーケンス技術を使用してマッピングできない領域のバリアントの発見を可能にすることを実証する。

Finally out: our paper on haplotype-aware genotyping using #whatshap and marginphase with up to 99.8% genotype accuracy from long reads. Great to move into regions opaque to short reads. With Jana Ebler, @MarinaHaukness, Trevor Pesout and @BenedictPaten. https://t.co/9Mp9qvKaPM
— Tobias Marschall (@tobiasmarschal) 2019年6月4日

インストール

ビルド依存

cmake version 3.7 (or higher):

wget https://cmake.org/files/v3.7/cmake-3.7.2-Linux-x86_64.sh && mkdir /opt/cmake && sh cmake-3.7.2-Linux-x86_64.sh --prefix=/opt/cmake --skip-license && ln -s /opt/cmake/bin/cmake /usr/local/bin/cmake

apt-get -y install git make gcc g++ autoconf bzip2 lzma-dev zlib1g-dev libcurl4-openssl-dev libcrypto++-dev libpthread-stubs0-dev libbz2-dev liblzma-dev

GIthub

git clone git@github.com:benedictpaten/marginPhase.git
cd marginPhase/
git submodule update --init
mkdir build
cd build
cmake ..
make -j 8

調整中

引用

Haplotype-aware diplotyping from noisy long reads

Jana Ebler, Marina Haukness, Trevor Pesout, Tobias Marschall & Benedict Paten
Genome Biology volume 20, Article number: 116 (2019) Cite this article

2020-07-13

StoatyDive

2019 Preprint ChIP-Seq bam/sam CLIP-Seq

タンパク質の生物学的機能は、その相互作用パートナーと相互作用のモードによって決まる。これらの相互作用を研究することで、オルタナティブスプライシングや転写後調節などの細胞メカニズムに関する視野が広がる。クロスリンク、またはクロマチン免疫沈降とハイスループットシーケンス（CLIP-Seq、ChIP-Seq）の組み合わせは、これらの相互作用を推測する方法である。 CLIP- Seqは、RNA結合タンパク質（RBP）とそのターゲットRNA間のすべての相互作用を調査する（ref.1）。したがって、CLIP-Seqは、RBPによる転写後の調節を精査する。結合領域の予測（ピークコール）は、CLIP-SeqやChIP-Seqなどのメソッドのデータ分析における重要なステップである。通常、ピーク分析の前に、ピーク特性の評価と分類は行われない。それでも、peakcallerから取得したピークセットには、ダウンストリーム分析を改良するためにフィルタリングする価値のある異なるピークプロファイルがある場合がある。異なるピーク形状は、いくつかの生物学的および技術的な問題の結果である。
　JankowskyとHarris（ref.2）は、RNA-タンパク質相互作用の特性と潜在的な問題について議論している。RBPは異なる結合ドメインを持っているか、タンパク質複合体の一部である可能性がある。したがって、このタンパク質には、特異的から非特異的までのさまざまな親和性（メカニズム）を持つ異種の結合部位が存在する可能性がある。 RNA部位に対するタンパク質の親和性、タンパク質とRNAの濃度などの要因は、異なるタンパク質結合ドメインの結合特異性に影響する。例えば、これらはいくつかのRNAに結合する能力を持っているmRNA輸送因子について述べている。言及されていないのは、タンパク質タイプが異なるピークプロファイルにも現れる可能性があることである。ヘリカーゼは、転写因子と比較して異なるピークプロファイルを持つ場合がある。
さらに、技術的なバイアスがピークプロファイルの状況を変える可能性がある。ライブラリの準備中にエラーが発生すると、不特定のバインドが発生する場合がある。プロトコルバイアス、たとえば、エンドヌクレアーゼおよび光活性化可能なヌクレオシドによって導入されるPAR-CLIPバイアス（ref.3）も、結合部位の予測に影響を与える可能性がある。さらに、peakcaller自体が特定のピークプロファイルと偽陽性を生成する場合があるが、ユーザーはそれらをデータに含めたくない場合がある。
　そのため、結合部位のデータ分析には多くの疑問が生じる。対象のタンパク質は一般に、より特異的（論文図1a）または非特異的（図1b）に結合するのだろうか？目的のRBPには複数の結合部位があり得るか？私の実験にはいくつかの品質の問題があるのか、つまり、ライブラリ調整のエラーのためリードが非特異的な結合から来ているのか？私のプロトコルはバイアスを生成するか？選択したpeakcallerからの予測ピークのセットに偽陽性があるか？（一部略）
　ここでは、ピークプロファイルを評価および分類して、前述の質問に答えるのに役立つツールであるStoatyDiveを紹介する。 StoatyDiveは、ピークプロファイル全体と定義済みの機能を使用して、シーケンスデータのピーク形状クラスタリングを実行する。この論文では、ヒストンステムループ結合タンパク質（SLBP）のeCLIPプロトコルのCLIPデータでStoatyDiveをテストする（ref.5）。StoatyDiveは、タンパク質のさまざまな結合プロファイルを評価するいくつかのプロットと表を提供する。このツールは、特異的および非特異的結合部位を選択し、同様の形状のピークプロファイルを見つけるのに役立つ。したがって、SLBPデータの得られたピークを改良して、SLBPのより具体的なサイトを見つけようとする。（以下略）

インストール

macos10.14のpython3.7.4環境で仮想環境を作成してテストした。

本体　Github

#Bioconda(link)condaで仮想環境を作って導入 
conda create -n stoatydive -c bioconda -y StoatyDive
conda activate stoatydive

> StoatyDive.py -h

$ StoatyDive.py

[START]

usage: StoatyDive.py [-h] [options] -a *.bed -b *.bam/*bed -c *.txt

StoatyDive.py: error: the following arguments are required: -a/--input_bed, -b/--input_bam, -c/--chr_file

(StoatyDive) kamisakakazumanoMac-mini:HINGE kazu$ StoatyDive.py -h

[START]

usage: StoatyDive.py [-h] [options] -a *.bed -b *.bam/*bed -c *.txt

The tool can evalute the profile of peaks. Provide the peaks you want to evalutate in bed6 format and the reads

you used for the peak detection in bed or bam format. The user obtains a distributions of the coefficient of variation (CV)

which can be used to evaluate the profile landscape. In addition, the tool generates ranked list for the peaks based

on the CV. The table hast the following columns: Chr Start End ID VC Strand bp r p Max_Norm_VC

Left_Border_Center_Difference Right_Border_Center_Difference. See StoatyDive's development page for a detailed description.

optional arguments:

-h, --help show this help message and exit

-v, --version show program's version number and exit

-a *.bed, --input_bed *.bed

Path to the peak file in bed6 format.

-b *.bam/*.bed, --input_bam *.bam/*.bed

Path to the read file used for the peak calling in bed

or bam format.

-c *.txt, --chr_file *.txt

Path to the chromosome length file.

-o path/, --output_folder path/

Write results to this path. [Default: Operating Path]

-t float, --thresh float

Set a normalized CV threshold to divide the peak

profiles into more specific (0) and more unspecific

(1). [Default: 1.0]

--peak_correction Activate peak correction. The peaks are recentered

(shifted) for the correct sumit.

--max_translocate Set this flag if you want to shift the peak profiles

based on the maximum value inside the profile instead

of a Gaussian blur translocation.

--peak_length int Set maximum peak length for the constant peak length.

--max_norm_value float

Provide a maximum value for CV to make the normalized

CV plot more comparable.

--border_penalty Adds a penalty for non-centered peaks.

--scale_max float Provide a maximum value for the CV plot.

--maxcl int Maximal number of clusters of the kmeans clustering of

the peak profiles. The algorithm will be optimized,

i.e., the parameter is just a constraint and not

absolute. [Default: 15]

-k int, --numcl int You can forcefully set the number of cluster of peak

profiles.

--sm Turn on the peak profile smoothing for the peak

profile classification. It is recommended to turn it

on.

--lam float Parameter for the peak profile classification. Set

lambda for the smoothing of the peak profiles. A

higher value (> default) will underfit. A lower value

(< default) will overfit. [Default: 0.3]

--turn_off_classification

Turn off the peak profile classification.

テストラン

git clone https://github.com/BackofenLab/StoatyDive.git
cd StoatyDive/

#exaample1
StoatyDive.py -a test/broad_peaks/peaks.bed -b test/broad_peaks/reads.bed -c test/chrom_sizes.txt --peak_correction --border_penalty --turn_off_classification -o test/broad_peaks/

-a Path to the peak file in bed6 format.
-b Path to the read file used for the peak calling in bed or bam format.
-c Path to the chromosome length file.
-o Write results to this path. [Default:

CV distribution plot

f:id:kazumaxneo:20200714001025p:plain

この図は、興味のあるタンパク質の結合特異性についての第一印象を与えてくれる。図はまた、あなたの実験のパフォーマンス/品質についても教えてくれる。非特異的な結合部位を多く含む実験では、ゼロに近いCV分布を持つことになる。（Githubより）

Normalized CV distribution plot

f:id:kazumaxneo:20200714001155p:plain

正規化されたCV分布は、実験内の特異的な部位と非特異的な部位を識別するのに役立つ。正規化されたCVは，範囲[0,1]にある。特異的なサイトは 1 の値を持ち、非特異的なサイトは 0 の値を持つ。

final_tab.bed file

最終的な表形式ファイルは、予測された結合部位のランク付けされた、タブで区切られたリストになる。

実行方法

アラインメントのbamに加え、ピークファイルをBED6フォーマットで指定する。

StoatyDive.py -a peak.bed -b input.bam -c chromosome_length.txt --border_penalty --peak_correction

--border_penalty Adds a penalty for non-centered peaks.
--peak_correction Activate peak correction. The peaks are recentered (shifted) for the correct sumit.

--border_penalty と --peak_correction を指定して使用することが推奨されている。 --border_penalty を追加すると、正しくセンタリングされていないピークを処理する。

引用

StoatyDive: Evaluation and Classification of Peak Profiles for Sequencing Data

Florian Heyl, Rolf Backofen
BioRxiv preprint first posted online Oct. 9, 2019

2020-07-12

高感度な類似タンパク質配列検索ツール HH-suite3（hhblitsについて）

2019 BMC Bioinformatics protein search 高速なツール multiple sequence alignment (MSA)

2020 7/13 タイトル変更

2020 7/14追記

2022/10/19 追記

ゲノミクスやメタゲノミクスプロジェクトのかなりの割合のタンパク質では同定可能なアノテーションされた相同なタンパク質がなく、アノテーションされていないタンパク質がかなりの割合を占めている[ref. 1]。配列類似性検索の感度が高ければ、アノテーションされた機能や既知の構造を持つ相同タンパク質を見つける可能性が高くなり、検索対象のタンパク質の機能や構造を推測することができる。そのため、比較タンパク質構造モデリングやディープ機能的アノテーションのためのテンプレートタンパク質を見つけるために、HMMERやHHblits [ref.5]のような最も感度の高い検索ツールがよく使われている[ref.6-9]。これらのツールは、単一配列と他の配列とのアラインメントだけでなく、多数の相同配列を含む複数配列のアラインメント（MSA）という形でより多くの情報を利用することで、相同性の検出を向上させることができる。MSAの各列のアミノ酸の頻度から、「配列プロファイル」と呼ばれる位置特異的なアミノ酸置換スコアの20×長さの行列を計算する。

　隠れマルコフモデル(HMM)は、位置特異的なアミノ酸置換スコアを挿入や欠失の位置特異的なペナルティで補強することで、配列プロファイルを拡張する。これらは、MSAにおける挿入と欠失の頻度から推定することができる。追加された情報は、PSI-BLAST [ref.10]のような配列プロファイルに基づく手法よりも、HHHblitsやHMMER3のようなプロファイルHMMベースの手法の感度を向上させる。

　検索ツールの中で、クエリとターゲットタンパク質の両方を相同タンパク質のMSAから構築された配列プロファイルとして表現しているものはほとんどない[ref.11-14]。対照的に、HHblits / HHsearchは、クエリとターゲットタンパク質の両方をプロファイルHMMとして表現する。これにより、配列類似性検索やより遠い相同性検出のための最も感度の高いツールの一つとなっている[ref.5, 15]。

　近年、BLASTに比べて最大4桁の高速化を実現した様々な配列検索ツールが開発されている[ref.16-19]。この高速化は、膨大な量の次世代シークエンスデータを、増え続けるアノテーション配列のデータベースに対して検索する必要性に対応している。しかし、BLASTやMMseqs2のような感度の高い方法を用いても、これらの配列の多くは相同性が見つからないことがある[ref.19]。

　ゲノミクスやメタゲノミクスプロジェクトは、PDBやPfam、その他のプロファイルデータベースを介したHHblits検索をパイプラインに追加することで、より多くの配列のアノテーションを行うことができる[ref.8]。本研究で紹介するHHblitsのバージョンは、Pfam [ref.20]やInterPro [ref.21]のアノテーションのための標準ツールであるHMMERの20倍の速度で実行されるため、追加の計算コストはわずかであろう。

　本研究では、最も時間的に重要なツールであるHHblitsとHHsearchを中心に、様々なHHスイートアルゴリズムを高速化し、並列化することを目標とした。Advanced Vector Extension 2 (AVX2) 命令や Streaming SIMD Extension 2 (SSE2) 命令を使用したデータレベルの並列化、OpenMP を使用したスレッドレベルの並列化、MPI を使用したコンピュータ間の並列化を適用した。最も重要なのは、最新のIntel、AMD、IBMのCPUに搭載されているSIMD演算ユニットによる並列化を十分に利用したことで、CPUコアあたり2～4倍のスピードアップを達成したことである。

Githubより

HH-suite は、隠れマルコフモデル（HMM）のペアワイズアラインメントに基づいた高感度なタンパク質配列検索のためのオープンソースのソフトウェアパッケージである。最も重要な2つのプログラムはHHsearchとHHblitsである。どちらもプロファイル隠れマルコフモデル（HMM）のペアワイズ比較に基づいている（クエリ配列とデータベース配列の両方をプロファイルHMMで表現することでペアワイズ配列比較やプロファイル配列比較に基づく方法よりも、遠い相同タンパク質の検出とアラインメントの感度が高くなる）。HHsearch は、多重配列アライメント（MSA）またはプロファイル HMM を入力とし、HMM のデータベース（PDB、Pfam、InterPro など）から相同なタンパク質を検索する。HHsearchは、相同なテンプレートを検出し、相同性モデリングのための高精度なクエリ-テンプレートペアワイズアラインメントを構築するために、タンパク質の構造予測によく利用されている。

　HHblitsは、単一配列またはMSAから始まる高品質のMSAを構築することができる。これをクエリHMMに変換し、Uniclust30データベースを反復的に検索し、前回の検索で得られた有意に類似した配列を、次の検索反復のために更新されたクエリHMMに追加する。PSI-BLASTと比較して、HHblitsは高速で、最大2倍の感度を持ち、より正確なアラインメントを生成する。HHblitsはHHsearchと同じHMM-HMMアライメントアルゴリズムを使用しているが、低速なHMM-HMM比較を実行するためのデータベースHMMの数を数千万から数千に減らす高速なプレフィルターを採用している。

計算機クラスタで動かす手順など

https://github.com/soedinglab/hh-suite/wiki#running-hhblits-efficiently-on-a-computer-cluster

Overview of programs

hhblits: Iteratively) search an HHsuite database with a query sequence or MSA
hhsearch: Search an HHsuite database with a query MSA or HMM
hhmake: Build an HMM from an input MSA
hhfilter: Filter an MSA by max sequence identity, coverage, and other criteria
hhalign: Calculate pairwise alignments etc. for two HMMs/MSAs
hhconsensus: Calculate the consensus sequence for an A3M/FASTA input file
reformat.pl: Reformat one or many MSAs
addss.pl: Add PSIPRED predicted secondary structure to an MSA or HHM file
hhmakemodel.pl: Generate MSAs or coarse 3D models from HHsearch or HHblits results
hhmakemodel.py: Generates coarse 3D models from HHsearch or HHblits results and modifies cif files such that they are compatible with MODELLER
hhsuitedb.py: Build HHsuite database with prefiltering, packed MSA/HMM, and index files
splitfasta.pl: Split a multiple-sequence FASTA file into multiple single-sequence files
renumberpdb.pl: Generate PDB file with indices renumbered to match input sequence indices
HHPaths.pm: Configuration file with paths to the PDB, BLAST, PSIPRED etc.
mergeali.pl: Merge MSAs in A3M format according to an MSA of their seed sequences
pdb2fasta.pl: Generate FASTA sequence file from SEQRES records of globbed pdb files
cif2fasta.py: Generate a FASTA sequence from the pdbx_seq_one_letter_code entry of the entity_poly of globbed cif files
pdbfilter.pl: Generate representative set of PDB/SCOP sequences from pdb2fasta.pl output
pdbfilter.py: Generate representative set of PDB/SCOP sequences from cif2fasta.py output

インストール

ubuntu18.04の計算機でビルドして動作確認した。

ビルド依存

To compile from source, you will need a recent C/C++ compiler (at least GCC 4.8 or Clang 3.6) and CMake 2.8.12 or later.
HHblits requires a CPU supporting at least the SSE2 (Streaming SIMD Extensions 2) instruction set. Every 64-bit CPU also supports SSE2. By default, the HH-suite binaries are compiled using the most recent supported instruction set on the computer used for compiling.

Github

#bioconda (link)
mamba install -c bioconda -c conda-forge hhsuite -y

#from source
git clone https://github.com/soedinglab/hh-suite.git
mkdir -p hh-suite/build && cd hh-suite/build
cmake -DCMAKE_INSTALL_PREFIX=. ..
make -j 8 && make install

#パスを通す
export PATH="$(pwd)/bin:$(pwd)/scripts:$PATH"

> hhblits -h

$ hhblits -h

HHblits 3.3.0:

HMM-HMM-based lightning-fast iterative sequence search

HHblits is a sensitive, general-purpose, iterative sequence search tool that represents

both query and database sequences by HMMs. You can search HHblits databases starting

with a single query sequence, a multiple sequence alignment (MSA), or an HMM. HHblits

prints out a ranked list of database HMMs/MSAs and can also generate an MSA by merging

the significant database HMMs/MSAs onto the query MSA.

Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger S J, and Söding J (2019)

HH-suite3 for fast remote homology detection and deep protein annotation.

BMC Bioinformatics, doi:10.1186/s12859-019-3019-7

Usage: hhblits -i query [options]

-i <file> input/query: single sequence or multiple sequence alignment (MSA)

in a3m, a2m, or FASTA format, or HMM in hhm format

<file> may be 'stdin' or 'stdout' throughout.

Options:

-d <name> database name (e.g. uniprot20_29Feb2012)

Multiple databases may be specified with '-d <db1> -d <db2> ...'

-n [1,8] number of iterations (default=2)

-e [0,1] E-value cutoff for inclusion in result alignment (def=0.001)

Input alignment format:

-M a2m use A2M/A3M (default): upper case = Match; lower case = Insert;

' -' = Delete; '.' = gaps aligned to inserts (may be omitted)

-M first use FASTA: columns with residue in 1st sequence are match states

-M [0,100] use FASTA: columns with fewer than X% gaps are match states

-tags/-notags do NOT / do neutralize His-, C-myc-, FLAG-tags, and trypsin

recognition sequence to background distribution (def=-notags)

Output options:

-o <file> write results in standard format to file (default=<infile.hhr>)

-oa3m <file> write result MSA with significant matches in a3m format

-opsi <file> write result MSA of significant matches in PSI-BLAST format

-ohhm <file> write HHM file for result MSA of significant matches

-oalis <name> write MSAs in A3M format after each iteration

-blasttab <name> write result in tabular BLAST format (compatible to -m 8 or -outfmt 6 output)

1 2 3 4 5 6 7 8 9 10 11 12

query target #match/tLen alnLen #mismatch #gapOpen qstart qend tstart tend eval score

-add_cons generate consensus sequence as master sequence of query MSA (default=don't)

-hide_cons don't show consensus sequence in alignments (default=show)

-hide_pred don't show predicted 2ndary structure in alignments (default=show)

-hide_dssp don't show DSSP 2ndary structure in alignments (default=show)

-show_ssconf show confidences for predicted 2ndary structure in alignments

-Ofas <file> write pairwise alignments in FASTA xor A2M (-Oa2m) xor A3M (-Oa3m) format

-seq <int> max. number of query/template sequences displayed (default=1)

-aliw <int> number of columns per line in alignment list (default=80)

-p [0,100] minimum probability in summary and alignment list (default=20)

-E [0,inf[ maximum E-value in summary and alignment list (default=1E+06)

-Z <int> maximum number of lines in summary hit list (default=500)

-z <int> minimum number of lines in summary hit list (default=10)

-B <int> maximum number of alignments in alignment list (default=500)

-b <int> minimum number of alignments in alignment list (default=10)

Prefilter options

-noprefilt disable all filter steps

-noaddfilter disable all filter steps (except for fast prefiltering)

-maxfilt max number of hits allowed to pass 2nd prefilter (default=20000)

-min_prefilter_hits min number of hits to pass prefilter (default=100)

-prepre_smax_thresh min score threshold of ungapped prefilter (default=10)

-pre_evalue_thresh max E-value threshold of Smith-Waterman prefilter score (default=1000.0)

-pre_bitfactor prefilter scores are in units of 1 bit / pre_bitfactor (default=4)

-pre_gap_open gap open penalty in prefilter Smith-Waterman alignment (default=20)

-pre_gap_extend gap extend penalty in prefilter Smith-Waterman alignment (default=4)

-pre_score_offset offset on sequence profile scores in prefilter S-W alignment (default=50)

Filter options applied to query MSA, database MSAs, and result MSA

-all show all sequences in result MSA; do not filter result MSA

-interim_filter NONE|FULL

filter sequences of query MSA during merging to avoid early stop (default: FULL)

NONE: disables the intermediate filter

FULL: if an early stop occurs compare filter seqs in an all vs. all comparison

-id [0,100] maximum pairwise sequence identity (def=90)

-diff [0,inf[ filter MSAs by selecting most diverse set of sequences, keeping

at least this many seqs in each MSA block of length 50

Zero and non-numerical values turn off the filtering. (def=1000)

-cov [0,100] minimum coverage with master sequence (%) (def=0)

-qid [0,100] minimum sequence identity with master sequence (%) (def=0)

-qsc [0,100] minimum score per column with master sequence (default=-20.0)

-neff [1,inf] target diversity of multiple sequence alignment (default=off)

-mark do not filter out sequences marked by ">@"in their name line

HMM-HMM alignment options:

-norealign do NOT realign displayed hits with MAC algorithm (def=realign)

-realign_old_hits realign hits from previous iterations

-mact [0,1[ posterior prob threshold for MAC realignment controlling greedi-

ness at alignment ends: 0:global >0.1:local (default=0.35)

-glob/-loc use global/local alignment mode for searching/ranking (def=local)

-realign realign displayed hits with max. accuracy (MAC) algorithm

-realign_max <int> realign max. <int> hits (default=500)

-ovlp <int> banded alignment: forbid <ovlp> largest diagonals |i-j| of DP matrix (def=0)

-alt <int> show up to this many alternative alignments with raw score > smin(def=4)

-smin <float> minimum raw score for alternative alignments (def=20.0)

-shift [-1,1] profile-profile score offset (def=-0.03)

-corr [0,1] weight of term for pair correlations (def=0.10)

-sc <int> amino acid score (tja: template HMM at column j) (def=1)

0 = log2 Sum(tja*qia/pa) (pa: aa background frequencies)

1 = log2 Sum(tja*qia/pqa) (pqa = 1/2*(pa+ta) )

2 = log2 Sum(tja*qia/ta) (ta: av. aa freqs in template)

3 = log2 Sum(tja*qia/qa) (qa: av. aa freqs in query)

5 local amino acid composition correction

-ssm {0,..,4} 0: no ss scoring

1,2: ss scoring after or during alignment [default=2]

3,4: ss scoring after or during alignment, predicted vs. predicted

-ssw [0,1] weight of ss score (def=0.11)

-ssa [0,1] ss confusion matrix = (1-ssa)*I + ssa*psipred-confusion-matrix [def=1.00)

-wg use global sequence weighting for realignment!

Gap cost options:

-gapb [0,inf[ Transition pseudocount admixture (def=1.00)

-gapd [0,inf[ Transition pseudocount admixture for open gap (default=0.15)

-gape [0,1.5] Transition pseudocount admixture for extend gap (def=1.00)

-gapf ]0,inf] factor to increase/reduce gap open penalty for deletes (def=0.60)

-gapg ]0,inf] factor to increase/reduce gap open penalty for inserts (def=0.60)

-gaph ]0,inf] factor to increase/reduce gap extend penalty for deletes(def=0.60)

-gapi ]0,inf] factor to increase/reduce gap extend penalty for inserts(def=0.60)

-egq [0,inf[ penalty (bits) for end gaps aligned to query residues (def=0.00)

-egt [0,inf[ penalty (bits) for end gaps aligned to template residues (def=0.00)

Pseudocount (pc) options:

Context specific hhm pseudocounts:

-pc_hhm_contxt_mode {0,..,3} position dependence of pc admixture 'tau' (pc mode, default=2)

0: no pseudo counts: tau = 0

1: constant tau = a

2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)

3: CSBlast admixture: tau = a(1+b)/(Neff[i]+b)

(Neff[i]: number of effective seqs in local MSA around column i)

-pc_hhm_contxt_a [0,1] overall pseudocount admixture (def=0.9)

-pc_hhm_contxt_b [1,inf[ Neff threshold value for mode 2 (def=4.0)

-pc_hhm_contxt_c [0,3] extinction exponent c for mode 2 (def=1.0)

Context independent hhm pseudocounts (used for templates; used for query if contxt file is not available):

-pc_hhm_nocontxt_mode {0,..,3} position dependence of pc admixture 'tau' (pc mode, default=2)

0: no pseudo counts: tau = 0

1: constant tau = a

2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)

(Neff[i]: number of effective seqs in local MSA around column i)

-pc_hhm_nocontxt_a [0,1] overall pseudocount admixture (def=1.0)

-pc_hhm_nocontxt_b [1,inf[ Neff threshold value for mode 2 (def=1.5)

-pc_hhm_nocontxt_c [0,3] extinction exponent c for mode 2 (def=1.0)

Context specific prefilter pseudocounts:

-pc_prefilter_contxt_mode {0,..,3} position dependence of pc admixture 'tau' (pc mode, default=3)

0: no pseudo counts: tau = 0

1: constant tau = a

2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)

3: CSBlast admixture: tau = a(1+b)/(Neff[i]+b)

(Neff[i]: number of effective seqs in local MSA around column i)

-pc_prefilter_contxt_a [0,1] overall pseudocount admixture (def=0.8)

-pc_prefilter_contxt_b [1,inf[ Neff threshold value for mode 2 (def=2.0)

-pc_prefilter_contxt_c [0,3] extinction exponent c for mode 2 (def=1.0)

Context independent prefilter pseudocounts (used if context file is not available):

-pc_prefilter_nocontxt_mode {0,..,3} position dependence of pc admixture 'tau' (pc mode, default=2)

0: no pseudo counts: tau = 0

1: constant tau = a

2: diversity-dependent: tau = a/(1+((Neff[i]-1)/b)^c)

(Neff[i]: number of effective seqs in local MSA around column i)

-pc_prefilter_nocontxt_a [0,1] overall pseudocount admixture (def=1.0)

-pc_prefilter_nocontxt_b [1,inf[ Neff threshold value for mode 2 (def=1.5)

-pc_prefilter_nocontxt_c [0,3] extinction exponent c for mode 2 (def=1.0)

Context-specific pseudo-counts:

-nocontxt use substitution-matrix instead of context-specific pseudocounts

-contxt <file> context file for computing context-specific pseudocounts (default=)

-csw [0,inf] weight of central position in cs pseudocount mode (def=1.6)

-csb [0,1] weight decay parameter for positions in cs pc mode (def=0.9)

Other options:

-v <int> verbose mode: 0:no screen output 1:only warings 2: verbose (def=2)

-neffmax ]1,20] skip further search iterations when diversity Neff of query MSA

becomes larger than neffmax (default=20.0)

-cpu <int> number of CPUs to use (for shared memory SMPs) (default=2)

-scores <file> write scores for all pairwise comparisons to file

-filter_matrices filter matrices for similarity to output at most 100 matrices

-atab <file> write all alignments in tabular layout to file

-maxseq <int> max number of input rows (def=65535)

-maxres <int> max number of HMM columns (def=20001)

-maxmem [1,inf[ limit memory for realignment (in GB) (def=3.0)

Examples:

hhblits -i query.fas -o query.hhr -d ./uniclust30

hhblits -i query.fas -o query.hhr -oa3m query.a3m -n 1 -d ./uniclust30

> hhsearch -h

$ hhsearch -h

HHsearch 3.3.0

Search a database of HMMs with a query alignment or query HMM

Steinegger M, Meier M, Mirdita M, Vöhringer H, Haunsberger S J, and Söding J (2019)

HH-suite3 for fast remote homology detection and deep protein annotation.

BMC Bioinformatics, doi:10.1186/s12859-019-3019-7

Usage: hhsearch -i query -d database [options]

-i <file> input/query multiple sequence alignment (a2m, a3m, FASTA) or HMM

<file> may be 'stdin' or 'stdout' throughout.

Options:

-d <name> database name (e.g. uniprot20_29Feb2012)

Multiple databases may be specified with '-d <db1> -d <db2> ...'

-e [0,1] E-value cutoff for inclusion in result alignment (def=0.001)

Input alignment format:

-M a2m use A2M/A3M (default): upper case = Match; lower case = Insert;

'-' = Delete; '.' = gaps aligned to inserts (may be omitted)