macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

シンテニープロット解析前のクオリティコントロールを行う SyntenyQC

 

SyntenyQCは、シンテニープロットを構築するためのデータ前処理ツールである。ゲノムデータの収集、アノテーション、デレプリケーションをサポートし、有益なシンテニープロットの構築を容易に(場合によっては根本的に)実現する。

SyntenyQCは、Pythonバージョン3.10を使用して開発され、pytestを使用してテストされたコマンドラインアプリである。SyntenyQCは、MITライセンスの下、PyPIhttps://pypi.org/project/SyntenyQC)から詳細なユーザーチュートリアルとともに入手できる。パッケージはhttps://github.com/Tim-Kirkwood/SyntenyQCに用意されている。

 

レポジトリより

 シンテニープロットは、ゲノム近傍領域の比較に広く用いられている。シンテニープロットは、多くの場合、大規模なソフトウェアスイート(antiSMASH ClusterBlastモジュールなど)の一部として含まれているが、現在では、ユーザーが候補近傍領域を抽出して独自のシンテニープロットを作成できる、ローコードのスタンドアロンツールが数多く提供されている。しかし、以下のツール間には依然としてギャップが存在する。

(i) 候補近傍領域を抽出するツール(例:数百の候補領域を検出可能なcblaster)

(ii) シンテニープロットを構築するツール(例:clinker。近傍領域数が30~50を超えると処理が困難になる)

(iii) シンテニープロット自体(含まれる近傍領域数が増えるほど、解析/提示が困難になる)。

SyntenyQC は、シンテニープロット作成直前に近傍領域をキュレーションするための Python アプリケーションである。SyntenyQC collect は、cblaster への直接の統合に基づいて、候補近傍領域の体系的な定義とアノテーションをサポートする。SytenyQC sieve は、シンテニープロット作成前に、cblaster またはその他のツールを使用して取得した冗長な近傍領域を客観的に除去するための柔軟な手法を提供する。これは場合によっては絶対的な要件となる (例: CAGECAT Web サーバー経由で呼び出される cblaster では、近傍領域が 50 個に制限される)。

 

インストール

Github

mamba create --name syntenyqc_env pip python=3.12.9
conda activate syntenyqc_env
pip install SyntenyQC

#cblasterも使う
pip3 install cblaster

> SyntenyQC collect -h

usage: SyntenyQC collect [-h] -bp -ns -em [-fn] [-sp] [-wg]

 

Write genbank files corresponding to cblaster neighbourhoods from a specified CSV-format binary file loacted at BINARY_PATH.  For each cblaster hit accession in the binary file:

 

1) A record is downloaded from NCBI using the accession.  NCBI requires a user EMAIL to search for this record programatically.  If WRITE_GENOMES is specified, this record is written to a local file according to FILENAMES (see final bulletpoint).

2) A neighbourhood of size NEIGHBOURHOOD_SIZE bp is defined, centered on the cblaster hits defined in the binary file for the target accession. 

3) (If STRICT_SPAN is specified:) If the accession's record is too small to contain a neighbourhood of the desired size, it is discarded.  For example, if an accession record is a 25kb contig and NEIGHBOURHOOD_SIZE is 50000, the record is discarded.

4) If FILENAMES is "organism", the nighbourhood is written to file called *organism*.gbk. If FILENAMES is "accession", the neighbourhood is written to *accession*.gbk. Synteny softwares such as clinker can use these filesnames to label synetny plot samples.

                                            

Once COLLECT has been run, a new folder with the same name as the binary file should be created in the directory that holds the binary file (i.e. the file "path/to/binary/file.txt" will generate the folder "path/to/binary/file"). This folder will have a subdirectory called "neighbourhood", containing all of the neighbourhood genbank files (i.e. "path/to/binary/file/neighbourhood"). If WRITE_GENOMES is specified, a second direcory ("genome") will also be present, containing the entire record associated with each cblaster accession (i.e. "path/to/binary/file/genome").  Finally, a log file will be present in the folder "path/to/binary/file", containing a summary of accessions whose neighbourhoods were discarded.

 

options:

  -h, --help            show this help message and exit

  -bp, --binary_path 

                        Full filepath to the CSV-format cblaster binary file containing neighbourhoods that should be extracted

  -ns, --neighbourhood_size 

                        Size (basepairs) of neighbourhood to be extracted (centered on middle of CBLASTER-defined neighbourhood)

  -em, --email      Email - required for NCBI entrez querying

  -fn, --filenames  If "organism", all collected files will be named according to organism. If "accession", all files will be named by NCBI accession. (default: organism)

  -sp, --strict_span    If set, will discard all neighbourhoods that are smaller than neighbourhood_size bp. For example, if you set a neighbourhood_size of 50000, a 50kb neighbourhood will be extracted from the NCBI

                        record associateed with each cblaster hit. If the record is too small for this to be done (i.e. the record is smaller then 50kb) it is discarded

  -wg, --write_genomes  If set, will write entire NCBI record containing a cblaster hit to file (as well as just the neighbourhood)

SyntenyQC sieve -h

usage: SyntenyQC sieve [-h] -gf [-ev] [-mts] [-mev] [-sf] [-am] [-dmts] [-ex] [-qc] [-sc] [-id] [-ks]

 

Filter redundant genomic neighbourhoods based on neighbourhood similarity:

- First, an all-vs-all BLASTP is performed with user-specified BLASTP settings and the neighbourhoods in GENBANK_FOLDER.

- Secondly, these are parsed to define reciprocal best hits between every pair of neighbourhoods.

- Thirdly, these reciprocal best hits are used to derive a neighbourhood similarity network, where edges indicate two neighbourhood nodes that have a similarity > SIMILARITY_FILTER. Similarity = Number of RBHs / Number of proteins in smallest neighbourhood in pair.

- Finally, this network is pruned to remove neighbourhoods that exceed the user's SIMILARITY_FILTER threshold. Nodes that remain are copied to the newly created folder "genbank_folder/sieve_results/genbank".

 

options:

  -h, --help            show this help message and exit

  -gf, --genbank_folder

                        Full path to folder containing neighbourhood genbank files requiring de-duplication

  -ev, --e_value    BLASTP evalue threshold. (default: 1e-05)

  -mts, --max_target_seqs

                        BLASTP -max_target_seqs. Maximum number of aligned sequences to keep. (default: 200)

  -mev, --min_edge_view

                        Minimum similarity between two neighbourhoods for an edge to be drawn betweeen them in the RBH graph. Purely for visualisation of the graph HTML file - has no impact on the graph pruning

                        results. (default: --similarity_filter)

  -sf, --similarity_filter

                        Similarity threshold above which two neighbourhoods are considered redundant (default: 0.7)

  -am, --alignment_mode

                        Alignment mode used by DIAMOND (choices: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, ultra-sensitive). Without using any sensitivity option, the default mode will run

                        which is designed for finding hits of >60 percent identity and short read alignment. Its sensitivity is between --fast and --mid-sensitive. See here

                        https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#sensitivity-modes

  -dmts, --dynamic_max_target_seqs

                        If set, --max_target_seqs will be automatically defined as the numer of genbank files within --genbank_folder or --max_target_seqs, whichever is larger

  -ex, --expand         If set, DO NOT gzip compress DIAMOND results file (will increase disk space requirments)

  -qc, --query_cover

                        Report only alignments above the given percentage of query cover. Note that using this option reduces performance.

  -sc, --subject_cover

                        Report only alignments above the given percentage of subject cover. Note that using this option reduces performance.

  -id, --identity   Report only alignments above the given percentage of sequence identity. Note that using this option reduces performance.

  -ks, --keep_pseudo    if set, will count pseudo entries (or missing sequences) when counting the number of proteins within a given neighbourhood for the inter-neighbourhood similarity score.

 

 

実行方法

SyntenyQCはcblaster などで得た binary file(候補となる近傍情報 CSV)を使う。

 

1、cblasterの実行。"-bde ','"を付けた上でbinary tableを出力する(-b binary.txt)。-igフラグは立ててはならない。

git clone https://github.com/gamcil/cblaster.git
cd /cblaster/tests

#自分のメールアドレスを登録してデータを利用する
cblaster config --email name@domain.com
cblaster search --query_file test.faa -p out.html --output summary.csv -s session.json -bde ',' -b binary.txt

出力

> cat binary.txt #CSVフォーマット

Organism,Scaffold,Start,End,Score,QBE85648.1,QBE85647.1,QBE85646.1

 

Aspergillus alliaceus IBT 14317,ML735331.1,25371,30385,5.2238,1,1,1

 

Aspergillus alliaceus FRR 5400,SPNV01000377.1,20785,25771,5.2249,1,1,1

 

Aspergillus alliaceus FRR 5400,MK425157.1,9655,14641,5.2249,1,1,1

 

Aspergillus alliaceus CBS 536.65,NW_022474703.1,24798,29805,5.2231,1,1,1

 

Aspergillus versicolor IMB17-055,MN395477.1,10953,14873,5.1532,1,1,1

 

Aspergillus versicolor CBS 583.65,NW_024467525.1,3170167,3174073,5.1544,1,1,1

 

Aspergillus mulundensis DSM 5745,NW_020797889.1,1726891,1732402,5.1639,1,1,1

 

Aspergillus multicolor CBS 133.54,NW_027395953.1,1159173,1163556,5.1565,1,1,1

 

Pseudomassariella vexata CBS 129021,NW_024467959.1,1616956,1621475,5.1029,1,1,1

 

Plenodomus lindquistii US01,JANUWI010000007.1,789573,793524,5.0959,1,1,1

 

cblasterの出力を見ると、かなり冗長な出力となっている。これをSyntenyQCで間引き、意味のある比較を行えるようにする。

 

あるいはwebサーバーのCAGECAT(紹介)でcblasterを実行する。

ここではdemo遺伝子クラスターgenbankファイルを指定。

 

2、SyntenyQC collectの実行。cblaster search で得られた binaryの近傍領域情報から、各ヒットを中心に -ns(42566 bpはデモ遺伝子クラスターの2倍の長さ)で指定した長さの ゲノム近傍配列のgenbankファイルを抽出する。binary.txtはフルパスで指定する。

SyntenyQC collect -bp ./binary.txt -ns 42566 -em my_email@domain.com -fn organism -sp -wg
  • -bp      Full filepath to the CSV-format cblaster binary file containing neighbourhoods that should be extracted
  • -ns      Size (basepairs) of neighbourhood to be extracted (centered on middle of CBLASTER-defined neighbourhood)
  • -em      Email - required for NCBI entrez querying
  • -fn        If "organism", all collected files will be named according to organism. If "accession", all files will be named by NCBI accession. (default: organism)
  • -sp       If set, will discard all neighbourhoods that are smaller than neighbourhood_size bp. For example, if you set a neighbourhood_size of 50000, a 50kb neighbourhood will be extracted from the NCBI record associateed with each cblaster hit. If the record is too small for this to be done (i.e. the record is smaller then 50kb) it is discarded
  • -wg      If set, will write entire NCBI record containing a cblaster hit to file (as well as just the neighbourhood)

 

出力例

binary/

binary/genome/

binary/neighbourhood/

 

3、sieveコマンドの実行。収集したGenBank形式のゲノム近傍ファイル群から冗長なものを自動的に除外。2の出力のbinary/neighbourhood/を指定する。-sfで2つのneighbourhoodがどれくらい似ていたら片方を冗長として捨てるかの配列Similarity閾値を決める。diamondにパスが通ってないとエラーになるので注意(mamba install -c bioconda -c conda-forge diamond)。

SyntenyQC sieve -sf 0.7 -gf binary/neighbourhood
  • -gf      Full path to folder containing neighbourhood genbank files requiring de-duplication
  • -sf      Similarity threshold above which two neighbourhoods are considered redundant (default: 0.7)
  • -ev      BLASTP evalue threshold. (default: 1e-05)
  • -am    Alignment mode used by DIAMOND (choices: fast, mid-sensitive, sensitive, more-sensitive, very-sensitive, ultra-sensitive). Without using any sensitivity option, the default mode will run which is designed for finding hits of >60 percent identity and short read alignment. Its sensitivity is between --fast and --mid-sensitive. See here https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#sensitivity-modes
  • -mts     BLASTP -max_target_seqs. Maximum number of aligned sequences to keep. (default: 200)
  • -mev    Minimum similarity between two neighbourhoods for an edge to be drawn betweeen them in the RBH graph. Purely for visualisation of the graph HTML file - has no impact on the graph pruning results. (default: --similarity_filter)

sieveコマンドでは、まずユーザーが指定したBLASTP設定とGENBANK_FOLDER内の近傍を用いて、all-vs-all BLASTPを実行し、各近傍ペア間の相互ベストヒットを定義する。これらの相互ベストヒットを用いて近傍類似度ネットワークを構築する。この時、ネットワークのエッジは、SIMILARITY_FILTERを超える類似度を持つ2つの近傍ノードを示す。類似度 = RBH数 / ペア内の最小近傍に含まれるタンパク質数。最後に、このネットワークから、ユーザーが指定したSIMILARITY_FILTER閾値を超える近傍を除去する。残ったノードは、新しく作成されたフォルダ「genbank_folder/sieve_results/genbank」にコピーされる。

出力例

binary/sieve_results/

visualisations/

534genbankから94genbankまで減少した。


=> 冗長性を除去後、Clinkerなどで視覚化する。

引用

Synteny plot quality control with SyntenyQC Open Access

Timothy D J Kirkwood , Jack A Connolly , Ee Lui Ang , Huimin Zhao , Eriko Takano , Rainer Breitling

Bioinformatics, Volume 41, Issue 12, December 2025

 

関連