シークエンスリードアーカイブからメタデータ情報をJSON形式で取得する ffq

2022/05/20　論文引用

タイトルの通りのツール。簡単に紹介します。

`ffq` (Fetch FastQ) is a new command line tool that makes it easier to find #sequencing data from the SRA / GEO / ENA. Importantly `ffq` does not download files, just file metadata / download links in #json formathttps://t.co/iWwOgMPyB0

developed by @lioscro and myself. pic.twitter.com/3Rm6BtEAxb
— Sina Booeshaghi (@sinabooeshaghi) May 24, 2021

インストール

ubuntu18でmambaを使ってインストールした。

本体　Github

#conda (bioconda)
mamba install -c bioconda ffq -y

#pip (pypi)
pip install ffq

> ffq -h

usage: ffq [-h] [-o OUT] [-t TYPE] [--split] [--verbose] IDs [IDs ...]

ffq 0.0.4: Fetch run information from the European Nucleotide Archive (ENA).

positional arguments:

IDs Can be a SRA / ENA Run Accessions or Study Accessions, GEO Study Accessions, DOIs or paper titles.

optional arguments:

-h, --help Show this help message and exit

-o OUT Path to JSON file to write run information. If `--split` is used, path to directory in which to place JSON files. (default: standard out)

-t TYPE The type of term used to query data. Can be one of SRR, ERR, DRR, SRP, ERP, DRP, GSE, DOI (default: SRR)

--split Split runs into their own files.

--verbose Print debugging information

実行法表

SRRで始まるリードアーカイブの識別子を指定する。

ここではHuman Microbiome Project2 (HMP2) のstool sampleのリードアーカイブを指定。

ffq SRR6664502

f:id:kazumaxneo:20220418123405p:plain

jqなどと組み合わせれば特定の情報だけ取得できます（参考）。

利用できるのは 'SRR', 'ERR', 'DRR', 'SRP', 'ERP', 'DRP', 'GSE', 'DOI'

複数指定

ffq [SRR1] [SRR2] ...

引用

GitHub - pachterlab/ffq: A command line tool that makes it easier to find sequencing data from the SRA / GEO / ENA.

自分がこのツールを知るきっかけになったツイートです。

If you want to fetch in bulk, you can use ffq (https://t.co/llueKghQvF) or you can get the E-MTAB-xxxx.sdrf.txt file from the page, and the urls for all files per sample are just there. You can simply download them using wget. The header & the index reads are all retained. 4/5
— Xi Chen (@XiChenUoM) April 15, 2022