NCBIからゲノムをダウンロードしたり、差分だけ更新する機能を持つ genome_updater

2020 4/25 help追記、タイトル変更

genome_updaterはNCBIゲノム（refseq / genbank）をダウンロードおよび更新するBash スクリプトである。データの更新、詳細ログの保持、ファイル整合性チェック（MD5）、そして並列[2]ダウンロードをサポートする。

インストール

macos10.14のanaconda3.7環境でテストした。

本体　Github

#bioconda (link)
conda create -n genome_updater -y
conda activate genome_updater
conda install -c bioconda -y genome_updater

> genome_updater.sh

$ genome_updater.sh

genome_updater v0.2.0 by Vitor C. Piro (vitorpiro@gmail.com, http://github.com/pirovc)

-g Organism group (one or more comma-separated entries) [archaea, bacteria, fungi, human (also contained in vertebrate_mammalian), invertebrate, metagenomes (genbank), other (synthetic genomes - only genbank), plant, protozoa, vertebrate_mammalian, vertebrate_other, viral (only refseq)]. Example: archaea,bacteria

or Species level taxids (one or more comma-separated entries). Example: species:622,562

or Any level taxids - lineage will be generated (one or more comma-separated entries). Example: taxids:620,649776

-d Database [genbank, refseq]

Default: refseq

-c RefSeq Category [all, reference genome, representative genome, na]

Default: all

-l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]

Default: all

-f File formats [genomic.fna.gz,assembly_report.txt, ... - check ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats]

Default: assembly_report.txt

-k Dry-run, no data is downloaded or updated - just checks for available sequences and changes

-i Fix failed downloads or any incomplete data from a previous run, keep current version

-x Allow the deletion of extra files if some are found in the repository folder

-u Report of updated assembly accessions (Added/Removed, assembly accession, url)

-r Report of updated sequence accessions (Added/Removed, assembly accession, genbank accession, refseq accession, sequence length, taxid). Only available when file assembly_report.txt selected and successfully downloaded

-p Output list of URLs for downloaded and failed files

-a Download the current version of the Taxonomy database (taxdump.tar.gz)

-o Working output directory

Default: ./tmp.XXXXXXXXXX

-b Version label

Default: current timestamp (YYYY-MM-DD_HH-MM-SS)

-e External "assembly_summary.txt" file to recover data from

Default: ""

-t Threads

Default: 1

-m Check MD5 for downloaded files

-s Silent output

-w Silent output with download progress (%) and download version at the end

-n Conditional exit status. Exit Code = 1 if more than N files failed to download (integer for file number, float for percentage, 0 -> off)

Default: 0

実行方法

refseqのarchaeaとbacteriaの完全長ゲノムのfastaとgenbankをダウンロードする。しばらく経ってから更新も行う。

#refseqのarchaeaとbacteriaの完全長ゲノムをダウンロード。8スレッド指定。MD5チェック。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz" -o "arc_bac_refseq_cg" -t 8 -u -m

#fastaに追加でgbff（genbank）をダウンロード。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -t 8 -u -m -i

#しばらく経ってからアップデートをチェック
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -k

#更新があったらアップデート。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -t 8 -u -m

-d [genbank, refseq] Default: refseq
-g Organism group (one or more comma-separated entries) [archaea, bacteria, fungi, human (also contained in vertebrate_mammalian), invertebrate, metagenomes (genbank), other (synthetic genomes - only genbank), plant, protozoa, vertebrate_mammalian, vertebrate_other, viral (only refseq)].
-t Threads Default: 1
-c RefSeq Category [all, reference genome, representative genome, na]
Default: all
-l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]
Default: all
-f File formats [genomic.fna.gz,assembly_report.txt, ...
-o Working output directory. Default: ./tmp.XXXXXXXXXX
-u Report of updated assembly accessions (Added/Removed, assembly accession, url)
-m Check MD5 for downloaded files
-i Fix failed downloads or any incomplete data from a previous run, keep current version

Refseqの全RNA virus（under the taxon Riboviria）をダウンロード

genome_updater.sh -d "refseq" -g "taxids:2559587" -f "genomic.fna.gz" -o "all_rna_virus" -t 12

genbankの利用可能な全virusゲノム数を確認

genome_updater.sh -d "genbank" -g "viral" -c "all" -l "all" -k

f:id:kazumaxneo:20200211231943p:plain

genbankの利用可能な全virusゲノム数を確認

genome_updater.sh -d "refseq" -g "fungi" -c "all" -l "all" -f "assembly_report.txt" -o "fungi" -t 12 -r -p

他の例はGihtubで確認して下さい。

引用

GitHub - pirovc/genome_updater: Automatic download and update genome and sequences files from NCBI

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

NCBIからゲノムをダウンロードしたり、差分だけ更新する機能を持つ genome_updater