macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

NCBIからゲノムをダウンロードしたり、 差分だけ更新する機能を持つ genome_updater

 2020 4/25 help追記、タイトル変更

 

genome_updaterはNCBIゲノム(refseq / genbank)をダウンロードおよび更新するBashスクリプトである。データの更新、詳細ログの保持、ファイル整合性チェック(MD5)、そして並列[2]ダウンロードをサポートする。

 

インストール

macos10.14のanaconda3.7環境でテストした。

本体 Github

#bioconda (link)
conda create -n genome_updater -y
conda activate genome_updater
conda install -c bioconda -y genome_updater

genome_updater.sh

$ genome_updater.sh 

genome_updater v0.2.0 by Vitor C. Piro (vitorpiro@gmail.com, http://github.com/pirovc)

 

 -g Organism group (one or more comma-separated entries) [archaea, bacteria, fungi, human (also contained in vertebrate_mammalian), invertebrate, metagenomes (genbank), other (synthetic genomes - only genbank), plant, protozoa, vertebrate_mammalian, vertebrate_other, viral (only refseq)]. Example: archaea,bacteria

    or Species level taxids (one or more comma-separated entries). Example: species:622,562

    or Any level taxids - lineage will be generated (one or more comma-separated entries). Example: taxids:620,649776

 

 -d Database [genbank, refseq]

Default: refseq

 -c RefSeq Category [all, reference genome, representative genome, na]

Default: all

 -l Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]

Default: all

 -f File formats [genomic.fna.gz,assembly_report.txt, ... - check ftp://ftp.ncbi.nlm.nih.gov/genomes/all/README.txt for all file formats]

Default: assembly_report.txt

 

 -k Dry-run, no data is downloaded or updated - just checks for available sequences and changes

 -i Fix failed downloads or any incomplete data from a previous run, keep current version

 -x Allow the deletion of extra files if some are found in the repository folder

 

 -u Report of updated assembly accessions (Added/Removed, assembly accession, url)

 -r Report of updated sequence accessions (Added/Removed, assembly accession, genbank accession, refseq accession, sequence length, taxid). Only available when file assembly_report.txt selected and successfully downloaded

 -p Output list of URLs for downloaded and failed files

 -a Download the current version of the Taxonomy database (taxdump.tar.gz)

 

 -o Working output directory 

Default: ./tmp.XXXXXXXXXX

 -b Version label

Default: current timestamp (YYYY-MM-DD_HH-MM-SS)

 -e External "assembly_summary.txt" file to recover data from 

Default: ""

 -t Threads

Default: 1

 

 -m Check MD5 for downloaded files

 -s Silent output

 -w Silent output with download progress (%) and download version at the end

 -n Conditional exit status. Exit Code = 1 if more than N files failed to download (integer for file number, float for percentage, 0 -> off)

Default: 0

 

 

実行方法

refseqのarchaeaとbacteriaの完全長ゲノムのfastagenbankをダウンロードする。しばらく経ってから更新も行う。

#refseqのarchaeaとbacteriaの完全長ゲノムをダウンロード。8スレッド指定。MD5チェック。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz" -o "arc_bac_refseq_cg" -t 8 -u -m

#fastaに追加でgbff(genbank)をダウンロード。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -t 8 -u -m -i

#しばらく経ってからアップデートをチェック
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -k

#更新があったらアップデート。
genome_updater.sh -d "refseq" -g "archaea,bacteria" -c "all" -l "Complete Genome" -f "genomic.fna.gz,genomic.gbff.gz" -o "arc_bac_refseq_cg" -t 8 -u -m
  • -d   [genbank, refseq]  Default: refseq
  • -g   Organism group (one or more comma-separated entries) [archaea, bacteria, fungi, human (also contained in vertebrate_mammalian), invertebrate, metagenomes (genbank), other (synthetic genomes - only genbank), plant, protozoa, vertebrate_mammalian, vertebrate_other, viral (only refseq)].
  • -t   Threads   Default: 1
  • -c   RefSeq Category [all, reference genome, representative genome, na]
    Default: all
  • -l    Assembly level [all, Complete Genome, Chromosome, Scaffold, Contig]
    Default: all
  • -f    File formats [genomic.fna.gz,assembly_report.txt, ...
  • -o   Working output directory. Default: ./tmp.XXXXXXXXXX
  • -u   Report of updated assembly accessions (Added/Removed, assembly accession, url)
  • -m  Check MD5 for downloaded files
  • -i   Fix failed downloads or any incomplete data from a previous run, keep current version

 

Refseqの全RNA virus(under the taxon Riboviria)をダウンロード

genome_updater.sh -d "refseq" -g "taxids:2559587" -f "genomic.fna.gz" -o "all_rna_virus" -t 12

 

genbankの利用可能な全virusゲノム数を確認 

genome_updater.sh -d "genbank" -g "viral" -c "all" -l "all" -k

f:id:kazumaxneo:20200211231943p:plain

 

genbankの利用可能な全virusゲノム数を確認 

genome_updater.sh -d "refseq" -g "fungi" -c "all" -l "all" -f "assembly_report.txt" -o "fungi" -t 12 -r -p

 

他の例はGihtubで確認して下さい。

引用

GitHub - pirovc/genome_updater: Automatic download and update genome and sequences files from NCBI

 

関連