SRA/ENA/GEOのメタデータとデータを取得したり、IDを変換するツールキット pysradb

2022/04/20 タイトル修正

　いくつかのプロジェクトはDNA-seq [ref.1]とRNA-seq [ref.2、3]データセットの要約を分析して公表する努力をしている。 NCBIのSRA（Sequencing Read Archive）[ref.4]からメタデータと生データを入手することは、公開されている次世代のシークエンシングデータセットを個人データと比較したり、新しい仮説を検証したりするための最初のステップである。 NCBIのSRA toolkit[ref.5]はrawシーケンシングデータをダウンロードするためのユーティリティメソッドを提供しているが、メタデータはWebサイトへの問い合わせまたはEntrez efetchコマンドラインユーティリティ[ref.6]を通じて取得する。公開データを分析するほとんどのワークフローは、最初にコマンドラインユーティリティまたはWebサイトを介してメタデータ内の関連キーワードを検索し、次にこれらをダウンロードすることに依存している。より合理化されたワークフローでは、これら両方のステップを同時に実行することができる。
　メタデータとデータの両方のクエリをより正確かつ堅牢にするために、SRAdb [ref.7]プロジェクトはSRAから解析されたすべてのメタデータを含む頻繁に更新されるSQLiteデータベースを提供する。 SRAdbは、SRAのメタデータに含まれる5つの主なデータオブジェクト（submission, study, sample, experiment and run）を追跡する。これらはSQLiteファイルで利用可能になる5つの異なるリレーショナルデータベーステーブルにマッピングされる。ファイル内のメタデータセマンティクスは、SRAと同じままである。 Rプログラミング言語[ref.9]で利用可能になったパッケージSRAdb [ref.8]は、SQLiteデータベースを利用することによってメタデータクエリと生データダウンロードを処理するための便利なフレームワークを提供する。強力ではあるが、SRAdbはエンドユーザーがRプログラミング言語に精通していることを要求し、問い合わせやダウンロード操作のためのコマンドラインインターフェースを提供しない。
　pysradbパッケージは、SRAdbの原則に基づいて構築されており、メタデータを照会し、SRAからデータセットをダウンロードするためのシンプルでユーザーフレンドリーなコマンドラインインターフェイスを提供する。ユーザーがSRAからデータセットを照会およびダウンロードするためのプログラミング言語に精通している必要がなくなる。さらに、ユーザーがよりきめ細かいクエリを実行するためのユーティリティ機能も提供する。これは、大規模で複数のデータセットを扱う場合によく必要とされる。コマンドラインでメタデータ検索とダウンロード操作の両方を可能にすることによって、pysradbは公共のシーケンスデータセットと関連するメタデータをシームレスに検索する際のギャップを埋めることを目指している。
　pysradbはPythonで書かれ（Python Software Foundation、https：//www.python.org/）[ref.10]、現在GithubでオープンソースのBSD 3条項ライセンスの下で開発されている。エンドユーザー向けのインストール手順を簡単にするために、PyPI（https://pypi.org/project/ pysradb）およびbioconda [ref.11]（https://bioconda.github）からダウンロードすることもできる。

A preprint describing pysradb: https://t.co/9WjQPDxPsb
— Saket Choudhary (@saketkc) March 17, 2019

インストール

python3.6.1環境でテストした。

依存

pandas>=0.23.4
tqdm>=4.28
aspera-client
SRAmetadb.sqlite

aspera-client導入が強く推奨されている。

Direct download links: (Githubより。macosはダウンロードしたdmgファイルの指示に従ってインストールする)

本体　Github

pip install pysradb

#Anacondaを使っているならcondaで導入
mamba install -c bioconda pysradb -y

#または仮想環境pysradbを作って導入
conda create -c bioconda -n pysradb PYTHON=3 pysradb

> pysradb -h

$ pysradb -h

Usage: pysradb [OPTIONS] COMMAND [ARGS]...

pysradb: Query NGS metadata and data from NCBI Sequence Read Archive.

Citation: Pending.

Options:

--version Show the version and exit.

-h, --help Show this message and exit.

Commands:

download Download SRA project (SRPnnnn)

gse-to-gsm Get GSM for a GSE

gse-to-srp Get SRP for a GSE

gsm-to-gse Get GSE for a GSM

gsm-to-srp Get SRP for a GSM

gsm-to-srr Get SRR for a GSM

gsm-to-srx Get SRX for a GSM

metadata Fetch metadata for SRA project (SRPnnnn)

metadb Download SRAmetadb.sqlite

search Search SRA for matching text

srp-to-gse Get GSE for a SRP

srp-to-srr Get SRR for a SRP

srp-to-srs Get SRS for a SRP

srp-to-srx Get SRX for a SRP

srr-to-srp Get SRP for a SRR

srr-to-srs Get SRS for a SRR

srr-to-srx Get SRX for a SRR

srs-to-srx Get SRX for a SRS

srx-to-srp Get SRP for a SRX

srx-to-srr Get SRR for a SRX

srx-to-srs Get SRS for a SRX

> pysradb download -h

$ pysradb download -h

usage: pysradb download [-h] [--out-dir OUT_DIR] [--db DB]

[--srx SRX [SRX ...]] [--srp SRP [SRP ...]]

[--skip-confirmation] [--use-wget]

optional arguments:

-h, --help show this help message and exit

--out-dir OUT_DIR Output directory root

--db DB Path to SRAmetadb.sqlite file

--srx SRX [SRX ...], -x SRX [SRX ...]

Download only these SRX(s)

--srp SRP [SRP ...], -p SRP [SRP ...]

SRP ID

--skip-confirmation, -y

Skip confirmation

--use-wget, -w Use wget instead of aspera

> pysradb search -h

3 user$ pysradb search -h

usage: pysradb search [-h] [--saveto SAVETO] [--db DB] [--assay] [--desc]

[--detailed] [--expand]

search_text

positional arguments:

search_text

optional arguments:

-h, --help show this help message and exit

--saveto SAVETO Save metadata dataframe to file

--db DB Path to SRAmetadb.sqlite file

--assay Include assay type in output

--desc Should sample_attribute be included

--detailed Display detailed metadata table

--expand Should sample_attribute be expanded

> pysradb srp-to-gse -h

$ pysradb srp-to-gse -h

usage: pysradb srp-to-gse [-h] [--db DB] [--saveto SAVETO] [--detailed]

[--desc] [--expand]

srp_id

positional arguments:

srp_id

optional arguments:

-h, --help show this help message and exit

--db DB Path to SRAmetadb.sqlite file

--saveto SAVETO Save output to file

--detailed Output additional columns: [sample_accession,

run_accession]

--desc Should sample_attribute be included

--expand Should sample_attribute be exp

他の変換コマンドのヘルプは省略する。

sqliteデータベース準備

pysradb metadb

データベースファイルSRAmetadb.sqliteができる。39GBほどある。このファイルがあるディレクトリでランする必要がある。

実行例

1、"ribosome profiling"を含むSRAプロジェクトを検索。

pysradb search "ribosome profiling" | head

２、SRAののアクセッションナンバーSRP000941のメタデータを検索する。ダウンロードしたSRAmetadb.sqliteを指定する。

pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head

--assay Include assay type in output
--desc Should sample_attribute be included
--expand Should sample_attribute be expanded
--detailed Display detailed metadata table

"--detailed"をつけると、詳細情報のカラムが追加される。

結果

f:id:kazumaxneo:20190404154258j:plain

拡大

f:id:kazumaxneo:20190404154908j:plain

３、SRPをGSEに変換する（解説HP）。

pysradb metadata --db ./SRAmetadb.sqlite SRP000941 --assay --desc --expand | head

$ pysradb srp-to-gse --db ./SRAmetadb.sqlite SRP075720

study_accession study_alias

SRP075720 GSE81903

他の変換例はGithub参照。GSM => SRP、GSM => GSE、GSM => SRX、GSM => SRR変換などができる。

４、プロジェクトSRP063852全体をダウンロードする。

pysradb download --db ./SRAmetadb.sqlite --out-dir ./pysradb_downloads -p SRP063852

指定ディレクトリpysradb_downloads/ができ、その中にダウンロードされる。

５、プロジェクトSRP000941からstudyの項目がRNA-Seqのデータのみダウンロードする。

pysradb metadata SRP000941 --assay | grep 'study\|RNA-Seq' | pysradb download

引用

pysradb: A Python package to query next-generation sequencing metadata and data from NCBI Sequence Read Archive
Saket Choudhary

bioRxiv preprint first posted online Mar. 16, 2019

original SRAdb package

SRAdb: query and use public next-generation sequencing data from within R

Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, Sean R. Davis

BMC bioinformatics 14, no. 1 (2013): 19.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

SRA/ENA/GEOのメタデータとデータを取得したり、IDを変換するツールキット pysradb