2023-06-01

1行のコードでゲノムデータベースへの様々な問い合わせを行う gget

2023 Bioinformatics JSON download API scRNAseq human genome mouse PDB

　ゲノムデータの解釈には、既存のリファレンスデータベースとの関連で結果を評価することが繰り返し課題となっている。コマンドラインやPythonの利用者が増加する中、大規模な公開ゲノムデータベースの多様なコレクションに保存されているキュレーションされたリファレンス情報への自動的かつ容易なプログラムによるアクセスを実現するツールが必要とされている。ggetは、Ensemblのようなゲノムリファレンスデータベースの効率的なクエリを可能にする、フリーでオープンソースのコマンドラインツールおよびPythonパッケージである。ggetは、別々のモジュールから構成されているが、相互運用可能である。マニュアルとソースコードは、https://github.com/pachterlab/gget で入手できる。

Analysis of #scRNAseq requires constant, tedious, interaction with genomics databases. To facilitate querying from @ensembl et al., @NeuroLuebbert developed gget: https://t.co/xoaqK9scZo (code @ https://t.co/j5FLp7DyFx).
gget has many uses; a 🧵on the its amazing versatility: 1/ pic.twitter.com/yp7UJoVo22
— Lior Pachter (@lpachter) 2022年5月19日

（2022年当時のツイート）

Tutorial

https://github.com/pachterlab/gget_examples

ggetは、1行のコードでデータベースへの問い合わせをできるように設計されている。それぞれのコマンドは相互運用可能なモジュールとして構成されている。画像はレポジトリより転載。

インストール

Github

#conda(link)
mmaba install -c bioconda gget

#pip
pip install --upgrade gget

#jupyterやpython環境
> import gget

> gget

$ gget

usage: gget [-h] [-v] {ref,search,info,seq,muscle,blast,blat,enrichr,archs4,setup,alphafold,pdb,gpt,cellxgene} ...

gget v0.27.7

positional arguments:

{ref,search,info,seq,muscle,blast,blat,enrichr,archs4,setup,alphafold,pdb,gpt,cellxgene}

ref Fetch FTPs for reference genomes and annotations by species.

search Fetch gene and transcript IDs from Ensembl using free-form search terms.

info Fetch gene and transcript metadata using Ensembl IDs.

seq Fetch nucleotide or amino acid sequence (FASTA) of a gene (and all isoforms) or transcript by Ensembl, WormBase or FlyBase ID.

muscle Align multiple nucleotide or amino acid sequences against each other (using the Muscle v5 algorithm).

blast BLAST a nucleotide or amino acid sequence against any BLAST database.

blat BLAT a nucleotide or amino acid sequence against any BLAT UCSC assembly.

enrichr Perform an enrichment analysis on a list of genes using Enrichr.

archs4 Find the most correlated genes or the tissue expression atlas of a gene using data from the human and mouse RNA-seq database

ARCHS4 (https://maayanlab.cloud/archs4/).

setup Install third-party dependencies for a specified gget module.

alphafold Predicts the structure of a protein using a simplified version of AlphaFold v2.3.0 (https://doi.org/10.1038/s41586-021-03819-2).

pdb Query RCSB PDB for the protein structutre/metadata of a given PDB ID.

gpt Generates natural language text based on a given prompt using the OpenAI API's 'openai.ChatCompletion.create' endpoint.

cellxgene Query data from CZ CELLxGENE Discover (https://cellxgene.cziscience.com/).

optional arguments:

-h, --help Print manual.

-v, --version Print version.

実行方法

レポジトリにはいくつかの例が載っています。確認します。

search；検索ワードを使ってEnsemblから遺伝子と転写物のIDを取得

ace2もしくはangiotensin converting enzyme 2が含まれるヒト遺伝子のEnsembl IDを取得する。

gget search -s homo_sapiens 'ace2' 'angiotensin converting enzyme 2'

-s SPECIES, --species SPECIES Species to be queried, e.g. homo_sapiens.
-t {gene, transcript}, --id_type {gene, transcript} 'gene': Returns genes that match the searchwords. (default). 'transcript': Returns transcripts that match the searchwords.
-ao {and, or}, --andor {and, or} 'or': Gene descriptions must include at least one of the searchwords (default). 'and': Only return genes whose descriptions include all searchwords.
-csv, --csv Returns results in csv format instead of json.
-o OUT, --out OUT Path to the file the results will be saved in, e.g. path/to/directory/results.json. Default: Standard out.

結果はjson形式で標準出力に出力される。ファイルに保存するには-o <out>をつける。さらにCSV形式で書き出すには"--csv”を付ける。

CSV形式での出力例

ref；リファレンスゲノムやアノテーションのFTPを生物種ごとに取得

最新のEnsemblリリースからHomo sapiensのリファレンスとアノテーション FTPを取得する。

gget ref homo_sapiens

-w WHICH, --which WHICH Defines which results to return. Default: 'all' -> Returns all available results. Possible entries are one or a combination (as a comma-separated list) of the following: 'gtf' - Returns the annotation (GTF). 'cdna' - Returns the trancriptome (cDNA). 'dna' - Returns the genome (DNA). 'cds - Returns the coding sequences corresponding to Ensembl genes. (Does not contain UTR or intronic sequence.) 'cdrna' - Returns transcript sequences corresponding to non-coding RNA genes (ncRNA). 'pep' - Returns the protein translations of Ensembl genes. Example: '-w dna,gtf'
-r RELEASE, --release RELEASE Ensembl release the FTPs will be fetched from, e.g. 104 (default: latest Ensembl release).
-ftp, --ftp Return only the FTP link(s).

Thu Jun 1 08:35:01 2023 INFO Fetching reference information for homo_sapiens from Ensembl release: 109.

{

"homo_sapiens": {

"transcriptome_cdna": {

"ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz",

"ensembl_release": 109,

"release_date": "2022-12-13",

"release_time": "11:30",

"bytes": "75M"

"genome_dna": {

"ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz",

"ensembl_release": 109,

"release_date": "2022-12-13",

"release_time": "00:02",

"bytes": "840M"

"annotation_gtf": {

"ftp": "http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz",

"ensembl_release": 109,

"release_date": "2022-12-15",

"release_time": "11:20",

"bytes": "52M"

"coding_seq_cds": {

"ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz",

"ensembl_release": 109,

"release_date": "2022-12-13",

"release_time": "11:31",

"bytes": "21M"

"non-coding_seq_ncRNA": {

"ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz",

"ensembl_release": 109,

"release_date": "2022-12-13",

"release_time": "12:40",

"bytes": "18M"

"protein_translation_pep": {

"ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz",

"ensembl_release": 109,

"release_date": "2022-12-13",

"release_time": "11:30",

"bytes": "14M"

}

info；Ensembl IDを使用して遺伝子や転写産物のメタデータを取得

遺伝子ENSG00000130234（ACE2）およびその転写産物ENST00000252519を調べる。

gget info ENSG00000130234 ENST00000252519

-o OUT, --out OUT Path to file the results will be saved as, e.g. path/to/directory/results.json. Default: Standard out.
-csv, --csv Returns results in csv format instead of json.
-pdb, --pdb Also returns PDB IDs (might increase run time).

CSVでの出力例

続き

seq；Ensembl、WormBase、FlyBaseのIDで遺伝子（および全アイソフォーム）または転写産物のヌクレオチド配列またはアミノ酸配列（FASTA）を取得

遺伝子IDを指定する。複数も可能。

gget seq --translate ENSG00000130234 -o out.fa

-t, --translate Returns amino acid sequences from UniProt. (Otherwise returns nucleotide sequences from Ensembl.)
-iso, --isoforms Returns sequences of all known transcripts (default: False). (Only for gene IDs.)
-o OUT, --out OUT Path to the FASTA file the results will be saved in, e.g. path/to/directory/results.fa. Default: Standard out.
--transcribe DEPRECATED - use True/False flag 'translate' instead.

muscle；Muscle v5アルゴリズムを使用して複数の塩基配列またはアミノ酸配列の多重整列を作成

gget muscle seq.fa

-s5, --super5 If True, align input using Super5 algorithm instead of PPP algorithm to decrease time and memory. Use for large inputs (a few
hundred sequences).
-o OUT, --out OUT Path to save an 'aligned FASTA' (.afa) file with the results, e.g. path/to/directory/results.afa.Default: 'None' -> Standard out in Clustal format.

ENST00000252519 MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNYNTNITEENVQNMNNAGDKWSAFLKEQSTLA

ENST00000252519 QMYPLQEIQNLTVKLQLQALQQNGSSVLSEDKSKRLNTILNTMSTIYSTGKVCNPDNPQECLLLEPGLNEIMANSLDYNE

ENST00000252519 RLWAWESWRSEVGKQLRPLYEEYVVLKNEMARANHYEDYGDYWRGDYEVNGVDGYDYSRGQLIEDVEHTFEEIKPLYEHL

ENST00000252519 HAYVRAKLMNAYPSYISPIGCLPAHLLGDMWGRFWTNLYSLTVPFGQKPNIDVTDAMVDQAWDAQRIFKEAEKFFVSVGL

ENST00000252519 PNMTQGFWENSMLTDPGNVQKAVCHPTAWDLGKGDFRILMCTKVTMDDFLTAHHEMGHIQYDMAYAAQPFLLRNGANEGF

ENST00000252519 HEAVGEIMSLSAATPKHLKSIGLLSPDFQEDNETEINFLLKQALTIVGTLPFTYMLEKWRWMVFKGEIPKDQWMKKWWEM

ENST00000252519 KREIVGVVEPVPHDETYCDPASLFHVSNDYSFIRYYTRTLYQFQFQEALCQAAKHEGPLHKCDISNSTEAGQKLFNMLRL

ENST00000252519 GKSEPWTLALENVVGAKNMNVRPLLNYFEPLFTWLKDQNKNSFVGWSTDWSPYADQSIKVRISLKSALGDKAYEWNDNEM

ENST00000252519 YLFRSSVAYAMRQYFLKVKNQMILFGEEDVRVANLKPRISFNFFVTAPKNVSDIIPRTEVEKAIRMSRSRINDAFRLNDN

ENST00000252519 SLEFLGIQPTLGPPNQPPVSIWLIVFGVVMGVIVVGIVILIFTGIRDRKKKNKARSGENPYASIDISKGENNPGFQNTDD

ENST00000252519 VQTSF

blat；UCSC アセンブリに対するヌクレオチドまたはアミノ酸のBLATサーチ

塩基配列かアミノ酸配列を指定する。デフォルトではDBとしてヒトアセンブリが使用される。DNAかタンパク質かは自動認識するが"-st"で入力タイプを強制的に変更できる。

gget blat MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS -a human

-st {DNA,protein,translated%20RNA,translated%20DNA}, 'DNA', 'protein', 'translated RNA', or 'translated%20DNA'. Default: 'DNA' for nucleotide sequences; 'protein' for amino acid sequences.
-a ASSEMBLY, --assembly ASSEMBLY 'human' (assembly hg38) (default), 'mouse' (assembly mm39), or any of the species assemblies available at https://genome.ucsc.edu/cgi-bin/hgBlat (use short assembly name as listed after the '/').
-csv, --csv Returns results in csv format instead of json.
-o OUT, --out OUT Path to the csv file the results will be saved in, e.g. path/to/directory/results.csv.Default: St

Thu Jun 1 08:54:51 2023 INFO Sequence recognized as amino acid sequence. 'seqtype' will be set as protein.

[

{

"genome": "hg38",

"query_size": 47,

"aligned_start": 1,

"aligned_end": 47,

"matches": 47,

"mismatches": 0,

"%_aligned": 100.0,

"%_matched": 100.0,

"chromosome": "chrX",

"strand": "+-",

"start": 15600771,

"end": 15600911

}

]

blast；任意のBLASTデータベースに対しての塩基配列またはアミノ酸配列でのBLASTサーチ

gget blast MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS

-db {nt,nr,refseq_rna,refseq_protein,swissprot,pdbaa,pdbnt} 'nt', 'nr', 'refseq_rna', 'refseq_protein', 'swissprot', 'pdbaa', or 'pdbnt'. Default: 'nt' for nucleotide sequences; 'nr' for amino acid sequences. More info on BLAST databases: https://ncbi.github.io/blast-cloud/blastdb/available-blastdbs.html
-e EXPECT, --expect EXPECT float or None. An expect value cutoff. Default 10.0.
-csv, --csv Returns results in csv format instead of json.
-o OUT, --out OUT Path to the csv file the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.

enrichr；Enrichrを使用して、遺伝子リストに対してエンリッチメント解析を行う。

DBと遺伝子名のリストを指定する。

gget enrichr -db ontology ACE2 AGT AGTR1 ACE AGTRAP AGTR2 ACE3P

-db DATABASE, --database DATABASE 'pathway', 'transcription', 'ontology', 'diseases_drugs', 'celltypes', 'kinase_interactions'or any database listed at:
https://maayanlab.cloud/Enrichr/#libraries
-e, --ensembl Add this flag if genes are given as Ensembl gene IDs.
-csv, --csv Returns results in csv format instead of json.
-o OUT, --out OUT Path to the csv file the results will be saved in, e.g. path/to/directory/results.csv.Default: Standard out.

archs4；ヒト・マウスRNA-seqデータベースARCHS4のデータを用いて、最も相関の高い遺伝子や、ある遺伝子の組織発現アトラスを検索する。

gget archs4 -w tissue ACE2

-w {correlation,tissue}, --which {correlation,tissue} 'correlation' (default) or 'tissue'. - 'correlation' returns a gene correlation table that contains the 100 most correlated genes to the gene of interest. The Pearson correlation is calculated over all samples and tissues in ARCHS4. - 'tissue' returns a tissue expression atlas calculated from human or mouse samples (as defined by 'species') in ARCHS4.
-gc GENE_COUNT, --gene_count GENE_COUNT Number of correlated genes to return (default: 100). (Only for gene correlation.)
-s {human,mouse}, --species {human,mouse} 'human' (default) or 'mouse'. (Only for tissue expression atlas.)
-csv, --csv Returns results in csv format instead of json.
-o OUT, --out OUT Path to the csv file the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.

pdb；指定されたPDB IDのタンパク質の構造・メタデータをRCSB PDBに問い合わせる。

PDB IDを指定する。

gget pdb 1R42 -o 1R42.pdb

-r {pdb,entry,pubmed,assembly,branched_entity,nonpolymer_entity,polymer_entity,uniprot,branched_entity_instance,polymer_entity_instance,nonpolymer_entity_instance} Defines type of information to be returned. "pdb": Returns the protein structure in PDB format. "entry": Information about PDB
structures at the top level of PDB structure hierarchical data organization. "pubmed": Get PubMed annotations (data integrated from PubMed) for a given entry's primary citation. "assembly": Information about PDB structures at the quaternary structure level. "branched_entity": Get branched entity description (define entity ID as "identifier"). "nonpolymer_entity": Get non-polymer entity
data (define entity ID as "identifier"). "polymer_entity": Get polymer entity data (define entity ID as "identifier"). "uniprot": Get UniProt annotations for a given macromolecular entity (define entity ID as "identifier"). "branched_entity_instance": Get branched entity instance description (define chain ID as "identifier"). "polymer_entity_instance": Get polymer entity instance
(a.k.a chain) data (define chain ID as "identifier"). "nonpolymer_entity_instance": Get non-polymer entity instance description (define chain ID as "identifier").
-i IDENTIFIER, --identifier IDENTIFIER Can be used to define assembly, entity or chain ID if applicable (default: None). Assembly/entity IDs are numbers (e.g. 1), and chain IDs are letters (e.g. A).
-o OUT, --out OUT Path to the file the results will be saved in, e.g. path/to/directory/7S7U.pdb or path/to/directory/7S7U_entry.json. Resource 'pdb' is returned in PDB format. All other resources are returned in JSON format. Default: Standard out.

PDB IDは、gget infoにフラグ-pdbを付けることで得られる。

cellxgene；指定した遺伝子、組織、細胞種に基づく scRNAseq のカウントマトリックス (AnnData 形式) を取得する

遺伝子名、組織名などを指定する。デフォルトはヒトになっている。

gget setup cellxgene #初回のみ
gget cellxgene --gene ACE2 SLC5A1 --tissue lung --cell_type 'mucus secreting cell' -o example_adata.h5ad

-o OUT, --out OUT Path to save the generated AnnData .h5ad file (or .csv with --meta_only).
-s {homo_sapiens,mus_musculus}, --species {homo_sapiens,mus_musculus} Choice of 'homo_sapiens' or 'mus_musculus'.
-g GENE [GENE ...], --gene GENE [GENE ...] Str or space-separated list of gene name(s) or Ensembl ID(s), e.g. ACE2 SLC5A1 or ENSG00000130234 ENSG00000100170NOTE: Set ensembl=True when providing Ensembl ID(s) instead of gene name(s).
--tissue TISSUE [TISSUE ...] Str or space-separated list of tissue(s), e.g. lung blood
--cell_type CELL_TYPE [CELL_TYPE ...] Str or space-separated list of cell_type(s), e.g. 'mucus secreting cell' 'neuroendocrine cell'

Thu Jun 1 14:01:47 2023 INFO Fetching metadata from CZ CELLxGENE Discover...

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

Thu Jun 1 14:01:48 2023 INFO The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

example_adata.h5adが保存される。

alphafold；アミノ酸配列からGFPのタンパク質構造を予測する

アミノ酸配列を指定する。

gget setup alphafold #初回のみ
gget alphafold MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK

-mfm, --multimer_for_monomer Use multimer model for a monomer.
-mr MULTIMER_RECYCLES, --multimer_recycles MULTIMER_RECYCLES The multimer model will continue recycling until the predictions stop changing, up to the limit set here. For higher accuracy, at the potential cost of longer inference times, set this to 20.
-r, --relax AMBER relax the best model.
-o OUT, --out OUT Path to folder the predicted aligned error (json) and the prediction (PDB) will be saved in. Default: ./[date_time]_gget_alphafold_prediction