2019 11/8 コマンドのミス修正("Escherichia coli" => "Escherichia")
2019 12/19 関連ツールリンク追加
タイトルの通りの機能をもつスクリプト。
インストール
mac os10.13のminiconda2-4.0.5環境でテストした。
依存
本体 GIthub
#anaconda環境ならcondaで導入できる
mamba create -n ncbi-genome-download -y
conda activate ncbi-genome-download
mamba install -c bioconda ncbi-genome-download -y
> ncbi-genome-download -h
$ ncbi-genome-download -h
/Users/kazuma/.pyenv/versions/miniconda2-4.0.5/lib/python2.7/site-packages/cryptography/hazmat/primitives/constant_time.py:26: CryptographyDeprecationWarning: Support for your Python version is deprecated. The next version of cryptography will remove support. Please upgrade to a 2.7.x release that supports hmac.compare_digest as soon as possible.
utils.DeprecatedIn23,
usage: ncbi-genome-download [-h] [-s {refseq,genbank}] [-F FILE_FORMAT]
[-l ASSEMBLY_LEVEL] [-g GENUS] [-T SPECIES_TAXID]
[-t TAXID] [-A ASSEMBLY_ACCESSIONS]
[-R {all,reference,representative}] [-o OUTPUT]
[-H] [-u URI] [-p N] [-r N] [-m METADATA_TABLE]
[-n] [-N] [-v] [-d] [-V]
group
positional arguments:
group The NCBI taxonomic group to download (default: all). A
comma-separated list of taxonomic groups is also
possible. For example: "bacteria,viral"Choose from:
['all', 'archaea', 'bacteria', 'fungi',
'invertebrate', 'plant', 'protozoa',
'vertebrate_mammalian', 'vertebrate_other', 'viral']
optional arguments:
-h, --help show this help message and exit
-s {refseq,genbank}, --section {refseq,genbank}
NCBI section to download (default: refseq)
-F FILE_FORMAT, --format FILE_FORMAT
Which format to download (default: genbank).A comma-
separated list of formats is also possible. For
example: "fasta,assembly-report". Choose from:
['genbank', 'fasta', 'features', 'gff', 'protein-
fasta', 'genpept', 'wgs', 'cds-fasta', 'rna-fna',
'rna-fasta', 'assembly-report', 'assembly-stats',
'all']
-l ASSEMBLY_LEVEL, --assembly-level ASSEMBLY_LEVEL
Assembly level of genomes to download (default: all).
A comma-separated list of assembly levels is also
possible. For example: "complete,chromosome". Coose
from: ['all', 'complete', 'chromosome', 'scaffold',
'contig']
-g GENUS, --genus GENUS
Only download sequences of the provided genus. A
comma-seperated list of genera is also possible. For
example: "Streptomyces coelicolor,Escherichia coli".
(default: )
-T SPECIES_TAXID, --species-taxid SPECIES_TAXID
Only download sequences of the provided species NCBI
taxonomy ID. A comma-separated list of species taxids
is also possible. For example: "52342,12325".
(default: )
-t TAXID, --taxid TAXID
Only download sequences of the provided NCBI taxonomy
ID. A comma-separated list of taxids is also possible.
For example: "9606,9685". (default: [])
-A ASSEMBLY_ACCESSIONS, --assembly-accessions ASSEMBLY_ACCESSIONS
Only download sequences matching the provided NCBI
assembly accession(s). A comma-separated list of
accessions is possible, as well as a path to a
filename containing one accession per line.
-R {all,reference,representative}, --refseq-category {all,reference,representative}
Only download sequences of the provided refseq
category (default: all)
-o OUTPUT, --output-folder OUTPUT
Create output hierarchy in specified folder (default:
/Users/kazuma)
-H, --human-readable Create links in human-readable hierarchy (might fail
on Windows)
-u URI, --uri URI NCBI base URI to use (default:
https://ftp.ncbi.nih.gov/genomes)
-p N, --parallel N Run N downloads in parallel (default: 1)
-r N, --retries N Retry download N times when connection to NCBI fails
(default: 0)
-m METADATA_TABLE, --metadata-table METADATA_TABLE
Save tab-delimited file with genome metadata
-n, --dry-run Only check which files to download, don't download
genome files.
-N, --no-cache Don't cache the assembly summary file in
/Users/kazuma/Library/Caches/ncbi-genome-download.
-v, --verbose increase output verbosity
-d, --debug print debugging information
-V, --version print version information
実行方法
1、Refseqからバクテリア全ゲノムをダウンロード
ncbi-genome-download bacteria -s refseq
- group The NCBI taxonomic group to download (default: all). Choose from: ['all', 'archaea', 'bacteria', 'fungi','invertebrate', 'plant', 'protozoa', 'vertebrate_mammalian', 'vertebrate_other', 'viral'].
- -s {refseq, genbank} NCBI section to download (default: refseq)
2、複数同時指定も可能。接続が速ければ、ダウンロードの並列化で高速化できる。。
ncbi-genome-download bacteria,viral,archaea,fungi,protozoa -p 4
- -p Run N downloads in parallel (default: 1)
3、ウィルスゲノムをfastaフォーマットでダウンロード。アセンブリレポートもダウンロードする。
ncbi-genome-download --format fasta,assembly-report viral
- --format Which format to download (default: genbank). Choose from: ['genbank', 'fasta', 'features', 'gff', 'protein-fasta', 'genpept', 'wgs', 'cds-fasta', 'rna-fna','rna-fasta', 'assembly-report', 'assembly-stats','all']
4、completeなバクテリアゲノムをfastaフォーマットでダウンロード。
ncbi-genome-download --assembly-level complete --format fasta bacteria
- --assembly-level Assembly level of genomes to download (default: all). Coose
from: ['all', 'complete', 'chromosome', 'scaffold','contig']
5、バクテリアのマッチするgenusのゲノムをダウンロード。
ncbi-genome-download --genus "Streptomyces coelicolor,Escherichia" bacteria
- --genus Only download sequences of the provided genus.
6、特定のtaxonomy IDゲノムをダウンロード(この例はK-12 substr. MG1655)
ncbi-genome-download --taxid 511145 bacteria
- -t Only download sequences of the provided NCBI taxonomy ID. A comma-separated list of taxids is also possible.
taxnomy IDはNCBI taxonomy browser(link)などから検索できる。
7、ダウンロード前に、どのようなゲノムがダウンロードされるかチェックするには”--dry-run"フラグを立てる。
ncbi-genome-download --dry-run bacteria > list
- --dry-run Only check which files to download, don't download genome files.
追記
バクテリアの特定の属について、完全長ゲノムが利用できる場合にfasta, genbank, faa、アセンブリレポートをダウンロードする。
ncbi-genome-download --assembly-level complete --format genbank,fasta,protein-fasta,assembly-report --genus "Photobacterium" bacteria
accession numberからダウンロード
NCBI accesion IDの種類についてはこちらが分かりやすい。リンク先中盤の"Unique identifiers and NCBI accession prefixes"を参照。
https://www.ncbi.nlm.nih.gov/pathogens/pathogens_help/#isolates-browser-what-is
引用
https://github.com/kblin/ncbi-genome-download
関連
NCBI nr database