macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

1行のコードでゲノムデータベースへの様々な問い合わせを行う gget

 

 ゲノムデータの解釈には、既存のリファレンスデータベースとの関連で結果を評価することが繰り返し課題となっている。コマンドラインPythonの利用者が増加する中、大規模な公開ゲノムデータベースの多様なコレクションに保存されているキュレーションされたリファレンス情報への自動的かつ容易なプログラムによるアクセスを実現するツールが必要とされている。ggetは、Ensemblのようなゲノムリファレンスデータベースの効率的なクエリを可能にする、フリーでオープンソースコマンドラインツールおよびPythonパッケージである。ggetは、別々のモジュールから構成されているが、相互運用可能である。マニュアルとソースコードは、https://github.com/pachterlab/gget で入手できる。

 

(2022年当時のツイート)

 

Tutorial

https://github.com/pachterlab/gget_examples

 

ggetは、1行のコードでデータベースへの問い合わせをできるように設計されている。それぞれのコマンドは相互運用可能なモジュールとして構成されている。画像はレポジトリより転載。

 

インストール

Github

#conda(link)
mmaba install -c bioconda gget

#pip
pip install --upgrade gget

#jupyterやpython環境
> import gget

gget

$ gget

usage: gget [-h] [-v] {ref,search,info,seq,muscle,blast,blat,enrichr,archs4,setup,alphafold,pdb,gpt,cellxgene} ...

 

gget v0.27.7

 

positional arguments:

  {ref,search,info,seq,muscle,blast,blat,enrichr,archs4,setup,alphafold,pdb,gpt,cellxgene}

    ref                 Fetch FTPs for reference genomes and annotations by species.

    search              Fetch gene and transcript IDs from Ensembl using free-form search terms.

    info                Fetch gene and transcript metadata using Ensembl IDs.

    seq                 Fetch nucleotide or amino acid sequence (FASTA) of a gene (and all isoforms) or transcript by Ensembl, WormBase or FlyBase ID.

    muscle              Align multiple nucleotide or amino acid sequences against each other (using the Muscle v5 algorithm).

    blast               BLAST a nucleotide or amino acid sequence against any BLAST database.

    blat                BLAT a nucleotide or amino acid sequence against any BLAT UCSC assembly.

    enrichr             Perform an enrichment analysis on a list of genes using Enrichr.

    archs4              Find the most correlated genes or the tissue expression atlas of a gene using data from the human and mouse RNA-seq database

                        ARCHS4 (https://maayanlab.cloud/archs4/).

    setup               Install third-party dependencies for a specified gget module.

    alphafold           Predicts the structure of a protein using a simplified version of AlphaFold v2.3.0 (https://doi.org/10.1038/s41586-021-03819-2).

    pdb                 Query RCSB PDB for the protein structutre/metadata of a given PDB ID.

    gpt                 Generates natural language text based on a given prompt using the OpenAI API's 'openai.ChatCompletion.create' endpoint.

    cellxgene           Query data from CZ CELLxGENE Discover (https://cellxgene.cziscience.com/).

 

optional arguments:

  -h, --help            Print manual.

  -v, --version         Print version.

 

 

実行方法

レポジトリにはいくつかの例が載っています。確認します。

 

search検索ワードを使ってEnsemblから遺伝子と転写物のIDを取得

ace2もしくはangiotensin converting enzyme 2が含まれるヒト遺伝子のEnsembl IDを取得する。

gget search -s homo_sapiens 'ace2' 'angiotensin converting enzyme 2' 
  • -s SPECIES, --species SPECIES   Species to be queried, e.g. homo_sapiens.                        
  • -t {genetranscript}, --id_type {genetranscript}    'gene': Returns genes that match the searchwords. (default). 'transcript': Returns transcripts that match the searchwords.
  • -ao {andor}, --andor {andor}     'or': Gene descriptions must include at least one of the searchwords (default). 'and': Only return genes whose descriptions include all searchwords.
  • -csv, --csv           Returns results in csv format instead of json.
  • -o OUT, --out OUT     Path to the file the results will be saved in, e.g. path/to/directory/results.json. Default: Standard out.

結果はjson形式で標準出力に出力される。ファイルに保存するには-o <out>をつける。さらにCSV形式で書き出すには"--csv”を付ける。

CSV形式での出力例

 

ref;リファレンスゲノムやアノテーションFTPを生物種ごとに取得

最新のEnsemblリリースからHomo sapiensのリファレンスとアノテーションFTPを取得する。

gget ref homo_sapiens
  • -w WHICH, --which WHICH   Defines which results to return. Default: 'all' -> Returns all available results. Possible entries are one or a combination (as a comma-separated list) of the following: 'gtf' - Returns the annotation (GTF). 'cdna' - Returns the trancriptome (cDNA). 'dna' - Returns the genome (DNA). 'cds - Returns the coding sequences corresponding to Ensembl genes. (Does not contain UTR or intronic sequence.) 'cdrna' - Returns transcript sequences corresponding to non-coding RNA genes (ncRNA). 'pep' - Returns the protein translations of Ensembl genes. Example: '-w dna,gtf'
  • -r RELEASE, --release RELEASE   Ensembl release the FTPs will be fetched from, e.g. 104 (default: latest Ensembl release).      
  • -ftp, --ftp    Return only the FTP link(s).

Thu Jun  1 08:35:01 2023 INFO Fetching reference information for homo_sapiens from Ensembl release: 109.

{

    "homo_sapiens": {

        "transcriptome_cdna": {

            "ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh38.cdna.all.fa.gz",

            "ensembl_release": 109,

            "release_date": "2022-12-13",

            "release_time": "11:30",

            "bytes": "75M"

        },

        "genome_dna": {

            "ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz",

            "ensembl_release": 109,

            "release_date": "2022-12-13",

            "release_time": "00:02",

            "bytes": "840M"

        },

        "annotation_gtf": {

            "ftp": "http://ftp.ensembl.org/pub/release-109/gtf/homo_sapiens/Homo_sapiens.GRCh38.109.gtf.gz",

            "ensembl_release": 109,

            "release_date": "2022-12-15",

            "release_time": "11:20",

            "bytes": "52M"

        },

        "coding_seq_cds": {

            "ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/cds/Homo_sapiens.GRCh38.cds.all.fa.gz",

            "ensembl_release": 109,

            "release_date": "2022-12-13",

            "release_time": "11:31",

            "bytes": "21M"

        },

        "non-coding_seq_ncRNA": {

            "ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh38.ncrna.fa.gz",

            "ensembl_release": 109,

            "release_date": "2022-12-13",

            "release_time": "12:40",

            "bytes": "18M"

        },

        "protein_translation_pep": {

            "ftp": "http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/pep/Homo_sapiens.GRCh38.pep.all.fa.gz",

            "ensembl_release": 109,

            "release_date": "2022-12-13",

            "release_time": "11:30",

            "bytes": "14M"

        }

    }

}

 

infoEnsembl IDを使用して遺伝子や転写産物のメタデータを取得

遺伝子ENSG00000130234(ACE2)およびその転写産物ENST00000252519を調べる。

gget info ENSG00000130234 ENST00000252519
  • -o OUT, --out OUT     Path to file the results will be saved as, e.g. path/to/directory/results.json. Default: Standard out.
  • -csv, --csv        Returns results in csv format instead of json.
  • -pdb, --pdb      Also returns PDB IDs (might increase run time).

CSVでの出力例

続き

 

seqEnsembl、WormBase、FlyBaseのIDで遺伝子(および全アイソフォーム)または転写産物のヌクレオチド配列またはアミノ酸配列(FASTA)を取得

遺伝子IDを指定する。複数も可能。

gget seq --translate ENSG00000130234 -o out.fa
  • -t, --translate       Returns amino acid sequences from UniProt. (Otherwise returns nucleotide sequences from Ensembl.)
  • -iso, --isoforms    Returns sequences of all known transcripts (default: False). (Only for gene IDs.)
  • -o OUT, --out OUT    Path to the FASTA file the results will be saved in, e.g. path/to/directory/results.fa. Default: Standard out.
  • --transcribe          DEPRECATED - use True/False flag 'translate' instead.

 

muscle;Muscle v5アルゴリズムを使用して複数の塩基配列またはアミノ酸配列の多重整列を作成

gget muscle seq.fa
  • -s5, --super5         If True, align input using Super5 algorithm instead of PPP algorithm to decrease time and memory. Use for large inputs (a few
                            hundred sequences).
  • -o OUT, --out OUT     Path to save an 'aligned FASTA' (.afa) file with the results, e.g. path/to/directory/results.afa.Default: 'None' -> Standard out in Clustal format.

ENST00000252519 MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNYNTNITEENVQNMNNAGDKWSAFLKEQSTLA

ENST00000252519 MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLASWNYNTNITEENVQNMNNAGDKWSAFLKEQSTLA

 

 

ENST00000252519 QMYPLQEIQNLTVKLQLQALQQNGSSVLSEDKSKRLNTILNTMSTIYSTGKVCNPDNPQECLLLEPGLNEIMANSLDYNE

ENST00000252519 QMYPLQEIQNLTVKLQLQALQQNGSSVLSEDKSKRLNTILNTMSTIYSTGKVCNPDNPQECLLLEPGLNEIMANSLDYNE

 

 

ENST00000252519 RLWAWESWRSEVGKQLRPLYEEYVVLKNEMARANHYEDYGDYWRGDYEVNGVDGYDYSRGQLIEDVEHTFEEIKPLYEHL

ENST00000252519 RLWAWESWRSEVGKQLRPLYEEYVVLKNEMARANHYEDYGDYWRGDYEVNGVDGYDYSRGQLIEDVEHTFEEIKPLYEHL

 

 

ENST00000252519 HAYVRAKLMNAYPSYISPIGCLPAHLLGDMWGRFWTNLYSLTVPFGQKPNIDVTDAMVDQAWDAQRIFKEAEKFFVSVGL

ENST00000252519 HAYVRAKLMNAYPSYISPIGCLPAHLLGDMWGRFWTNLYSLTVPFGQKPNIDVTDAMVDQAWDAQRIFKEAEKFFVSVGL

 

 

ENST00000252519 PNMTQGFWENSMLTDPGNVQKAVCHPTAWDLGKGDFRILMCTKVTMDDFLTAHHEMGHIQYDMAYAAQPFLLRNGANEGF

ENST00000252519 PNMTQGFWENSMLTDPGNVQKAVCHPTAWDLGKGDFRILMCTKVTMDDFLTAHHEMGHIQYDMAYAAQPFLLRNGANEGF

 

 

ENST00000252519 HEAVGEIMSLSAATPKHLKSIGLLSPDFQEDNETEINFLLKQALTIVGTLPFTYMLEKWRWMVFKGEIPKDQWMKKWWEM

ENST00000252519 HEAVGEIMSLSAATPKHLKSIGLLSPDFQEDNETEINFLLKQALTIVGTLPFTYMLEKWRWMVFKGEIPKDQWMKKWWEM

 

 

ENST00000252519 KREIVGVVEPVPHDETYCDPASLFHVSNDYSFIRYYTRTLYQFQFQEALCQAAKHEGPLHKCDISNSTEAGQKLFNMLRL

ENST00000252519 KREIVGVVEPVPHDETYCDPASLFHVSNDYSFIRYYTRTLYQFQFQEALCQAAKHEGPLHKCDISNSTEAGQKLFNMLRL

 

 

ENST00000252519 GKSEPWTLALENVVGAKNMNVRPLLNYFEPLFTWLKDQNKNSFVGWSTDWSPYADQSIKVRISLKSALGDKAYEWNDNEM

ENST00000252519 GKSEPWTLALENVVGAKNMNVRPLLNYFEPLFTWLKDQNKNSFVGWSTDWSPYADQSIKVRISLKSALGDKAYEWNDNEM

 

 

ENST00000252519 YLFRSSVAYAMRQYFLKVKNQMILFGEEDVRVANLKPRISFNFFVTAPKNVSDIIPRTEVEKAIRMSRSRINDAFRLNDN

ENST00000252519 YLFRSSVAYAMRQYFLKVKNQMILFGEEDVRVANLKPRISFNFFVTAPKNVSDIIPRTEVEKAIRMSRSRINDAFRLNDN

 

 

ENST00000252519 SLEFLGIQPTLGPPNQPPVSIWLIVFGVVMGVIVVGIVILIFTGIRDRKKKNKARSGENPYASIDISKGENNPGFQNTDD

ENST00000252519 SLEFLGIQPTLGPPNQPPVSIWLIVFGVVMGVIVVGIVILIFTGIRDRKKKNKARSGENPYASIDISKGENNPGFQNTDD

 

 

ENST00000252519 VQTSF

ENST00000252519 VQTSF

 

blatUCSCアセンブリに対するヌクレオチドまたはアミノ酸BLATサーチ

塩基配列アミノ酸配列を指定する。デフォルトではDBとしてヒトアセンブリが使用される。DNAかタンパク質かは自動認識するが"-st"で入力タイプを強制的に変更できる。

gget blat MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS -a human
  • -st {DNA,protein,translated%20RNA,translated%20DNA},    'DNA', 'protein', 'translated RNA', or 'translated%20DNA'. Default: 'DNA' for nucleotide sequences; 'protein' for amino acid sequences.
  • -a ASSEMBLY, --assembly ASSEMBLY   'human' (assembly hg38) (default), 'mouse' (assembly mm39), or any of the species assemblies available at https://genome.ucsc.edu/cgi-bin/hgBlat (use short assembly name as listed after the '/').
  • -csv, --csv           Returns results in csv format instead of json.
  • -o OUT, --out OUT     Path to the csv file the results will be saved in, e.g. path/to/directory/results.csv.Default: St

Thu Jun  1 08:54:51 2023 INFO Sequence recognized as amino acid sequence. 'seqtype' will be set as protein.

[

    {

        "genome": "hg38",

        "query_size": 47,

        "aligned_start": 1,

        "aligned_end": 47,

        "matches": 47,

        "mismatches": 0,

        "%_aligned": 100.0,

        "%_matched": 100.0,

        "chromosome": "chrX",

        "strand": "+-",

        "start": 15600771,

        "end": 15600911

    }

]

 

blast;任意のBLASTデータベースに対しての塩基配列またはアミノ酸配列でのBLASTサーチ

gget blast MSSSSWLLLSLVAVTAAQSTIEEQAKTFLDKFNHEAEDLFYQSSLAS
  • -db {nt,nr,refseq_rna,refseq_protein,swissprot,pdbaa,pdbnt}  'nt', 'nr', 'refseq_rna', 'refseq_protein', 'swissprot', 'pdbaa', or 'pdbnt'. Default: 'nt' for nucleotide sequences; 'nr' for  amino acid sequences. More info on BLAST databases: https://ncbi.github.io/blast-cloud/blastdb/available-blastdbs.html
  • -e EXPECT, --expect EXPECT    float or None. An expect value cutoff. Default 10.0.
  • -csv, --csv      Returns results in csv format instead of json.
  • -o OUT, --out OUT     Path to the csv file the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.

 

 

enrichr;Enrichrを使用して、遺伝子リストに対してエンリッチメント解析を行う。

DBと遺伝子名のリストを指定する。

gget enrichr -db ontology ACE2 AGT AGTR1 ACE AGTRAP AGTR2 ACE3P
  • -db DATABASE, --database DATABASE   'pathway', 'transcription', 'ontology', 'diseases_drugs', 'celltypes', 'kinase_interactions'or any database listed at:
    https://maayanlab.cloud/Enrichr/#libraries
  • -e, --ensembl      Add this flag if genes are given as Ensembl gene IDs.
  • -csv, --csv           Returns results in csv format instead of json.
  • -o OUT, --out OUT     Path to the csv file the results will be saved in, e.g. path/to/directory/results.csv.Default: Standard out.

 

archs4;ヒト・マウスRNA-seqデータベースARCHS4のデータを用いて、最も相関の高い遺伝子や、ある遺伝子の組織発現アトラスを検索する。

gget archs4 -w tissue ACE2
  • -w {correlation,tissue}, --which {correlation,tissue}   'correlation' (default) or 'tissue'. - 'correlation' returns a gene correlation table that contains the 100 most correlated genes to the gene of interest. The Pearson correlation is calculated over all samples and tissues in ARCHS4. - 'tissue' returns a tissue expression atlas calculated from human or mouse samples (as defined by 'species') in ARCHS4.
  • -gc GENE_COUNT, --gene_count GENE_COUNT   Number of correlated genes to return (default: 100). (Only for gene correlation.)
  • -s {human,mouse}, --species {human,mouse}   'human' (default) or 'mouse'. (Only for tissue expression atlas.) 
  • -csv, --csv     Returns results in csv format instead of json.
  • -o OUT, --out OUT     Path to the csv file the results will be saved in, e.g. path/to/directory/results.csv. Default: Standard out.

 

 

pdb;指定されたPDB IDのタンパク質の構造・メタデータをRCSB PDBに問い合わせる。

PDB IDを指定する。

gget pdb 1R42 -o 1R42.pdb
  •  -r  {pdb,entry,pubmed,assembly,branched_entity,nonpolymer_entity,polymer_entity,uniprot,branched_entity_instance,polymer_entity_instance,nonpolymer_entity_instance}    Defines type of information to be returned. "pdb": Returns the protein structure in PDB format. "entry": Information about PDB
     structures at the top level of PDB structure hierarchical data organization. "pubmed": Get PubMed annotations (data integrated from PubMed) for a given entry's primary citation. "assembly": Information about PDB structures at the quaternary structure level. "branched_entity": Get branched entity description (define entity ID as "identifier"). "nonpolymer_entity": Get non-polymer entity
     data (define entity ID as "identifier"). "polymer_entity": Get polymer entity data (define entity ID as "identifier"). "uniprot": Get UniProt annotations for a given macromolecular entity (define entity ID as "identifier"). "branched_entity_instance": Get branched entity instance description (define chain ID as "identifier"). "polymer_entity_instance": Get polymer entity instance
     (a.k.a chain) data (define chain ID as "identifier"). "nonpolymer_entity_instance": Get non-polymer entity instance description (define chain ID as "identifier").
  • -i IDENTIFIER, --identifier IDENTIFIER    Can be used to define assembly, entity or chain ID if applicable (default: None). Assembly/entity IDs are numbers (e.g. 1), and  chain IDs are letters (e.g. A).
  • -o OUT, --out OUT     Path to the file the results will be saved in, e.g. path/to/directory/7S7U.pdb or path/to/directory/7S7U_entry.json. Resource 'pdb' is returned in PDB format. All other resources are returned in JSON format. Default: Standard out.

PDB IDは、gget infoにフラグ-pdbを付けることで得られる。

 

cellxgene;指定した遺伝子、組織、細胞種に基づく scRNAseq のカウントマトリックス (AnnData 形式) を取得する

遺伝子名、組織名などを指定する。デフォルトはヒトになっている。

gget setup cellxgene #初回のみ
gget cellxgene --gene ACE2 SLC5A1 --tissue lung --cell_type 'mucus secreting cell' -o example_adata.h5ad
  •  -o OUT, --out OUT     Path to save the generated AnnData .h5ad file (or .csv with --meta_only).
  • -s {homo_sapiens,mus_musculus}, --species {homo_sapiens,mus_musculus} Choice of 'homo_sapiens' or 'mus_musculus'.
  • -g GENE [GENE ...], --gene GENE [GENE ...]  Str or space-separated list of gene name(s) or Ensembl ID(s), e.g. ACE2 SLC5A1 or ENSG00000130234 ENSG00000100170NOTE: Set ensembl=True when providing Ensembl ID(s) instead of gene name(s).
  •  --tissue TISSUE [TISSUE ...]   Str or space-separated list of tissue(s), e.g. lung blood
  • --cell_type CELL_TYPE [CELL_TYPE ...]   Str or space-separated list of cell_type(s), e.g. 'mucus secreting cell' 'neuroendocrine cell'

Thu Jun  1 14:01:47 2023 INFO Fetching metadata from CZ CELLxGENE Discover...

The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

Thu Jun  1 14:01:48 2023 INFO The "stable" release is currently 2023-05-15. Specify 'census_version="2023-05-15"' in future calls to open_soma() to ensure data consistency.

example_adata.h5adが保存される。

 

alphafoldアミノ酸配列からGFPのタンパク質構造を予測する 

アミノ酸配列を指定する。

gget setup alphafold #初回のみ
gget alphafold MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
  • -mfm, --multimer_for_monomer   Use multimer model for a monomer.
  • -mr MULTIMER_RECYCLES, --multimer_recycles MULTIMER_RECYCLES   The multimer model will continue recycling until the predictions stop changing, up to the limit set here. For higher accuracy, at the potential cost of longer inference times, set this to 20.
  • -r, --relax     AMBER relax the best model.
  • -o OUT, --out OUT     Path to folder the predicted aligned error (json) and the prediction (PDB) will be saved in. Default: ./[date_time]_gget_alphafold_prediction

alphafoldコマンドはテスト時は動作しませんでしたが、バグではなく、一部のライブラリや依存するツールが導入できなかったためです。

 

引用

Efficient querying of genomic reference databases with gget 
Laura Luebbert,  Lior Pachter
Bioinformatics, Volume 39, Issue 1, January 2023