NCBIのゲノムや遺伝子配列をコマンドラインでダウンロードするNCBI Datasetsツール

2024/09/17追記

2025/02/11 追記

NCBI Datasetsは、NCBIデータベース全体のデータを簡単に収集できるリソースである。コマンドラインインターフェース（CLI）ツールやNCBI Datasetsウェブインターフェースを使って、遺伝子やゲノムの配列、アノテーション、メタデータを検索しダウンロードすることができる。NCBI Datasetsツールは現在開発中で、フィードバックはGitHub issueを作成するか、NCBIまで直接問い合わせることが推奨される。

NCBI Datasetsコマンドラインツール（CLI）にはdatasetsとdataformatがある。datasetsを使用すると、NCBIから生命の全領域にわたる生物学的配列データをダウンロードすることができる。dataformatを使用すると、メタデータをJSON Linesフォーマットから他のフォーマットに変換することができる。

2024/10/27

NCBI Datasets reminder! The v1 API, the command-line interface (CLI) version 13 and older versions, and the Python library v1 will no longer be available, effective December 31, 2024. Please update to API v2 & the latest version of the CLI. More info: https://t.co/otFVyr6nt3 pic.twitter.com/byl1MduXKg
— NCBI (@NCBI) October 26, 2024

NCBI Command-line tools

https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

How to Guides

https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/

インストール

ubuntu22.04LTSにcondaの環境を作って2024年8月時点の最新版を導入した。古いバージョンだとダウンロードできない可能性があるので注意する。

Download

https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/

Github

#conda (link)
mamba create -n ncbi_datasets -y
conda activate ncbi_datasets
mamba install -c conda-forge ncbi-datasets-cli -y

# 最新版 (linux)
wget https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
chmod +x datasets
sudo mv datasets /usr/local/bin/

> datasets --version

datasets version: 16.26.0

> datasets

datasets is a command-line tool that is used to query and download biological sequence data

across all domains of life from NCBI databases.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.

Usage

datasets [command]

Data Retrieval Commands

summary Print a data report containing gene, genome, taxonomy or virus metadata

download Download a gene, genome or virus dataset as a zip file

rehydrate Rehydrate a downloaded, dehydrated dataset

Miscellaneous Commands

completion Generate autocompletion scripts

Flags

--api-key string Specify an NCBI API key

--debug Emit debugging info

--help Print detailed help about a datasets command

--version Print version of datasets

Use datasets <command> --help for detailed help about a command.

> datasets summary

Error: Continue with one of the sub-commands

datasets summary [command]

Available Commands

gene Print a summary of a gene dataset

genome Print a data report containing genome metadata

virus Print a data report containing virus genome metadata

taxonomy Print a data report containing taxonomy metadata

Use datasets summary <command> --help for detailed help about a command.

> datasets download -h

Download genome, gene and virus data packages, including sequence, annotation, and metadata, as a zip file.

Refer to NCBI's [download and install](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/) documentation for information about getting started with the command-line tools.

Usage

datasets download [command]

Sample Commands

datasets download genome accession GCF_000001405.40 --chromosomes X,Y --exclude-gff3 --exclude-rna

datasets download genome taxon "bos taurus"

datasets download gene gene-id 672

datasets download gene symbol brca1 --taxon "mus musculus"

datasets download gene accession NP_000483.3

datasets download taxonomy taxon human,sars-cov-2

datasets download virus genome taxon sars-cov-2 --host dog

datasets download virus protein S --host dog --filename SARS2-spike-dog.zip

Available Commands

gene Download a gene data package

genome Download a genome data package

taxonomy Download a taxonomy data package

virus Download a virus data package

Flags

--filename string Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")

--no-progressbar Hide progress bar

Global Flags

--api-key string Specify an NCBI API key

--debug Emit debugging info

--help Print detailed help about a datasets command

--version Print version of datasets

Use datasets download <command> --help for detailed help about a command.

> datasets rehydrate -h

Download data files for an unzipped, dehydrated genome data package. Data files specified in fetch.txt will be downloaded from NCBI. Read more about how rehydration can help with large genome downloads: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genomes/large-download/

Usage

datasets rehydrate [flags] --directory <directory_name>

Flags

--directory string Specify the directory containing the unzipped dehydrated bag

--gzip rehydrate files to gzip format

--list List files that would be downloaded during rehydration

--match string Specify substring that matches files for rehydration

--max-workers int Limit the maximum number of concurrent download workers (allowed range is 1-30) (default 10)

--no-progressbar Hide progress bar

Global Flags

--api-key string Specify an NCBI API key

--debug Emit debugging info

--help Print detailed help about a datasets command

--version Print version of datasets

実行方法

ダウンロードには、ゲノムをダウンロードするgenome、遺伝子をダウンロードするgene、分類データをダウンロードするtaxonomy、ウィルスデータをダウンロードするvirusがある。

https://www.ncbi.nlm.nih.gov/datasets/docs/v2/reference-docs/command-line/datasets/download/

実際にダウンするには、"--dehydrated"をつけて、直訳だと"脱水状態のデータ"をzipで固めたデータとしてダウンロードする。それからダウンロードしたzipの解凍、"rehydrate"（再加湿）という流れでデータファイルを取得する。dehydratedをつけなくてもダウンロードはできるが、ダウンロードにかかる時間は長くなる（特に大きなデータでは付けた方が良い。ただし、rehydrateにはある程度時間がかかる）。

例えばヒトのリファレンスゲノムデータをアクセッションID（GCF_000001405.40）を指定してダウンロードする。zipの解凍とrehydrateを行ってファイルを取り出す。

#1 ダウンロードしてhuman_GRCh38_dataset.zipとして保存
datasets download genome accession GCF_000001405.40 --dehydrated --filename human_GRCh38_dataset.zip

#2 zipの解凍
unzip human_GRCh38_dataset.zip -d genome_dir

#3 rehydrate: 解凍したZIPアーカイブからfastaなどの人間が読める形式のデータに変換する
datasets rehydrate --directory genome_dir/

#1 humann T2T assemblyなら
datasets download genome accession GCF_009914755.1 --dehydrated --filename human_T2T_dataset.zip

アセンブリレベルをchromosomeとcompleteに限定、さらに性染色体に限定

datasets download genome taxon human --assembly-level chromosome,complete --chromosomes X,Y --include protein,cds --dehydrated

--assembly-level Limit to genomes at one or more assembly levels (comma-separated):
* chromosome
* complete
* contig
* scaffold
(default "[]")

CDSとタンパク質だけダウンロードする。

datasets download genome taxon human --include protein,cds --dehydrated

--include Specify the data files to include (comma-separated).
* genome: genomic sequence
* rna: transcript
* protein: amnio acid sequences
* cds: nucleotide coding sequences
* gff3: general feature file
* gtf: gene transfer format
* gbff: GenBank flat file
* seq-report: sequence report file
* none: do not retrieve any sequence files
(default [genome])

アクセッションIDを２つ指定

datasets download genome accession GCA_003774525.2 GCA_000001635 --chromosomes X,Y,Un.9

NCBI taxIDで指定（ここでは662）、リファレンスストレイン（*1）に限定

datasets download genome taxon 662 --include protein --dehydrated --reference

バクテリア（eubacteria全て）のリファレンスゲノム、完全長アセンブリに限定（2024年12月実行時で5707アセンブリ）。

datasets download genome taxon 2 --include genome --dehydrated --reference --assembly-level complete

プレビュー

datasets download genome accession GCA_003774525.2 --preview

$ datasets download genome accession GCA_003774525.2 --preview

Collecting 1 genome record [================================================] 100% 1/1

{"resource_updated_on":"2024-08-17T02:30:17Z","record_count":1,"estimated_file_size_mb":769,"included_data_files":{"all_genomic_fasta":{"file_count":1,"size_mb":768.0921}}}

Bioproject IDを指定

datasets download genome accession PRJNA289059 --include none

まだ開発中とのことでhelpも充実してないようですが、NCBI assemblyの機能がどんどん変わっていっており、CLI環境でのゲノムと遺伝子のダウンロードはこのツールに集約されそうな感じがするので早めに紹介しました。ほかの機能はNCBIのマニュアルを確認してください。

最近ブログを全く更新できず、もし期待されている方がいらっしゃいましたら申し訳ありません。オフラインで論文を書くのに集中していました。まだしばらく時間が取れなさそうですが、時間を見つけて不定期でも更新していきます。

引用

https://github.com/ncbi/datasets?tab=readme-ov-file

細菌なら種のタイプ株もしくは非タイプだがアセンブリ品質が良好な種を代表する株

datasets download genomeのhelp

$ datasets download genome -h

Download a genome data package. Genome data packages may include genome, transcript and protein sequences, annotation and one or more data reports. Data packages are downloaded as a zip archive.

The default genome data package includes the following files:
* <accession>_<assembly_name>_genomic.fna (genomic sequences)
* assembly_data_report.jsonl (data report with genome assembly and annotation metadata)
* dataset_catalog.json (a list of files and file types included in the data package)

Usage
datasets download genome [flags]
datasets download genome [command]

Sample Commands
datasets download genome accession GCF_000001405.40 --chromosomes X,Y --include genome,gff3,rna
datasets download genome taxon "bos taurus" --dehydrated
datasets download genome taxon human --assembly-level chromosome,complete --dehydrated
datasets download genome taxon mouse --search C57BL/6J --search "Broad Institute" --dehydrated

Available Commands
accession Download a genome data package by Assembly or BioProject accession
taxon Download a genome data package by taxon (NCBI Taxonomy ID, scientific or common name at any tax rank)

Flags
--annotated Limit to annotated genomes
--assembly-level string Limit to genomes at one or more assembly levels (comma-separated):
* chromosome
* complete
* contig
* scaffold
(default "[]")
--assembly-source string Limit to 'RefSeq' (GCF_) or 'GenBank' (GCA_) genomes (default "all")
--assembly-version string Limit to 'latest' assembly accession version or include 'all' (latest + previous versions)
--chromosomes strings Limit to a specified, comma-delimited list of chromosomes, or 'all' for all chromosomes
--dehydrated Download a dehydrated zip archive including the data report and locations of data files (use the rehydrate command to retrieve data files).
--exclude-atypical Exclude atypical assemblies
--exclude-multi-isolate Exclude assemblies from multi-isolate projects
--from-type Only return records with type material
--include string(,string) Specify the data files to include (comma-separated).
* genome: genomic sequence
* rna: transcript
* protein: amnio acid sequences
* cds: nucleotide coding sequences
* gff3: general feature file
* gtf: gene transfer format
* gbff: GenBank flat file
* seq-report: sequence report file
* none: do not retrieve any sequence files
(default [genome])
--mag string Limit to metagenome assembled genomes (only) or remove them from the results (exclude) (default "all")
--preview Show information about the requested data package
--reference Limit to reference genomes
--released-after string Limit to genomes released on or after a specified date (free format, ISO 8601 YYYY-MM-DD recommended)
--released-before string Limit to genomes released on or before a specified date (free format, ISO 8601 YYYY-MM-DD recommended)
--search strings Limit results to genomes with specified text in the searchable fields:
species and infraspecies, assembly name and submitter.
To search multiple strings, use the flag multiple times.

Global Flags
--api-key string Specify an NCBI API key
--debug Emit debugging info
--filename string Specify a custom file name for the downloaded data package (default "ncbi_dataset.zip")
--help Print detailed help about a datasets command
--no-progressbar Hide progress bar
--version Print version of datasets

Use datasets download genome <command> --help for detailed help about a command.