2020 4/25 help追記、タイトル変更
genome_updaterはNCBIゲノム(refseq / genbank)をダウンロードおよび更新するBashスクリプトである。データの更新、詳細ログの保持、ファイル整合性チェック(MD5)、そして並列[2]ダウンロードをサポートする。
インストール
macos10.14のanaconda3.7環境でテストした。
本体 Github
#bioconda (link)
conda create -n genome_updater -y
conda activate genome_updater
conda install -c bioconda -y genome_updater
> genome_updater.sh
$ genome_updater.sh
genome_updater v0.2.0 by Vitor C. Piro (vitorpiro@gmail.com, http://github.com/pirovc)
-g Organism group (one or more comma-separated entries) [archaea, bacteria, fungi, human (also contained in vertebrate_mammalian), invertebrate, metagenomes (genbank), other (synthetic genomes - only genbank), plant, protozoa, vertebrate_mammalian, vertebrate_other, viral (only refseq)]. Example: archaea,bacteria
or Species level taxids (one or more comma-separated entries). Example: species:622,562
or Any level taxids - lineage will be generated (one or more comma-separated entries). Example: taxids:620,649776
-d Database [genbank, refseq]
Default: refseq
-c RefSeq Category [all, reference genome, representative genome, na]
Default: all
-l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]
Default: all
-f File formats [genomic.fna.gz,assembly_report.txt, ... - check ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats]
Default: assembly_report.txt
-k Dry-run, no data is downloaded or updated - just checks for available sequences and changes
-i Fix failed downloads or any incomplete data from a previous run, keep current version
-x Allow the deletion of extra files if some are found in the repository folder
-u Report of updated assembly accessions (Added/Removed, assembly accession, url)
-r Report of updated sequence accessions (Added/Removed, assembly accession, genbank accession, refseq accession, sequence length, taxid). Only available when file assembly_report.txt selected and successfully downloaded
-p Output list of URLs for downloaded and failed files
-a Download the current version of the Taxonomy database (taxdump.tar.gz)
-o Working output directory
Default: ./tmp.XXXXXXXXXX
-b Version label
Default: current timestamp (YYYY-MM-DD_HH-MM-SS)
-e External "assembly_summary.txt" file to recover data from
Default: ""
-t Threads
Default: 1
-m Check MD5 for downloaded files
-s Silent output
-w Silent output with download progress (%) and download version at the end
-n Conditional exit status. Exit Code = 1 if more than N files failed to download (integer for file number, float for percentage, 0 -> off)
Default: 0
実行方法
refseqのarchaeaとbacteriaの完全長ゲノムのfastaとgenbankをダウンロードする。しばらく経ってから更新も行う。
#refseqのarchaeaとbacteriaの完全長ゲノムをダウンロード。8スレッド指定。MD5チェック。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz" -o "arc_bac_refseq_cg" -t 8 -u -m
#fastaに追加でgbff(genbank)をダウンロード。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -t 8 -u -m -i
#しばらく経ってからアップデートをチェック
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -k
#更新があったらアップデート。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -t 8 -u -m
- -d [genbank, refseq] Default: refseq
- -g Organism group (one or more comma-separated entries) [archaea, bacteria, fungi, human (also contained in vertebrate_mammalian), invertebrate, metagenomes (genbank), other (synthetic genomes - only genbank), plant, protozoa, vertebrate_mammalian, vertebrate_other, viral (only refseq)].
- -t Threads Default: 1
- -c RefSeq Category [all, reference genome, representative genome, na]
Default: all - -l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]
Default: all - -f File formats [genomic.fna.gz,assembly_report.txt, ...
- -o Working output directory. Default: ./tmp.XXXXXXXXXX
- -u Report of updated assembly accessions (Added/Removed, assembly accession, url)
- -m Check MD5 for downloaded files
- -i Fix failed downloads or any incomplete data from a previous run, keep current version
Refseqの全RNA virus(under the taxon Riboviria)をダウンロード
genome_updater.sh -d "refseq" -g "taxids:2559587" -f "genomic.fna.gz" -o "all_rna_virus" -t 12
genbankの利用可能な全virusゲノム数を確認
genome_updater.sh -d "genbank" -g "viral" -c "all" -l "all" -k
genbankの利用可能な全virusゲノム数を確認
genome_updater.sh -d "refseq" -g "fungi" -c "all" -l "all" -f "assembly_report.txt" -o "fungi" -t 12 -r -p
他の例はGihtubで確認して下さい。
引用
GitHub - pirovc/genome_updater: Automatic download and update genome and sequences files from NCBI
関連