ゲノム配列やシークエンシングデータを公開リポジトリから取得する getSequenceInfo

2022/07/14 誤字修正

　生物学的配列は、世界中で急速に、そして指数関数的に増加している。塩基配列データベースは、様々な生物について有意義なゲノム情報を提供する重要な役割を担っている。getSequenceInfoは、GenBank、RefSeq、European Nucleotide Archiveなどの公共リポジトリから配列情報にアクセスできるソフトウェアツールで、Linux、MacOS、Microsoft Windowsに対応し、プログラム（コマンドライン）またはGUIとして使用できる。指定されたキングダムや種に基づく配列データ、あるいは指定された日付からの配列データを問い合わせることができる。染色体とプラスミド（またはその他の遺伝要素／コンポーネント）を、それぞれのコンポーネントを所定のフォルダに配置することで、分離することができる。また、基本的な統計処理（例えば、照会されたアセンブリのGC含量の計算など）もこのプログラムによって実行される。また、ヌクレオチド情報を用いて経験的に設計されたヌクレオチド比を算出し、調査対象ゲノムアセンブリの「NucleScore」を暫定的に提供する。メインツールである gSeqI 以外にも、配列解析に関連する様々なタスクを実行するためのツールが開発されている。
　本研究の目的は、プログラム的に公開リポジトリの利用を民主化し、教育的な観点から配列データ解析を容易にすることである。出力結果は、FASTA, FASTQ, Excel/TSV または HTML 形式で利用可能である。本プログラムは、https://github.com/karubiotools/getSequenceInfo。getSequenceInfoおよび補足ツールは、最近リリースされたGalaxy KaruBioNetプラットフォーム（http://calamar.univ-ag.fr/c3i/galaxy_karubionet.html）から一部利用可能である。

インストール

依存

Perl (version 5.26 or greater)

Github

git clone https://github.com/dcouvin/getSequenceInfo.git
cd getSequenceInfo/
bash install/installer_Unix.sh

> perl getSequenceInfo.pl -h

perl getSequenceInfo/getSequenceInfo.pl -h

##################################################################

## ---> Welcome to getSequenceInfo/getSequenceInfo.pl (version 1.0.1)!

## Start Date (yyyy-mm-dd, hh:min:sec): 2022-7-13, 23:19:1

##################################################################

Archive::Tar is.................installed!

Bio::SeqIO is.................installed!

Bio::Species is.................installed!

Date::Calc is.................installed!

File::Copy is.................installed!

File::Path is.................installed!

Net::FTP is.................installed!

IO::Uncompress::Gunzip is.................installed!

LWP::Simple is.................installed!

POSIX is.................installed!

File::Log is.................installed!

Name:

getSequenceInfo/getSequenceInfo.pl

Synopsis:

A Perl script allowing to get sequence information from GenBank RefSeq or ENA repositories.

Usage:

perl getSequenceInfo/getSequenceInfo.pl [options]

examples:

perl getSequenceInfo/getSequenceInfo.pl -k bacteria -s "Helicobacter pylori" -l "Complete Genome" -date 2019-06-01

perl getSequenceInfo/getSequenceInfo.pl -k viruses -n 5 -date 2019-06-01

perl getSequenceInfo/getSequenceInfo.pl -k "bacteria" -taxid 9,24 -n 10 -c plasmid -dir genbank -o Results

perl getSequenceInfo/getSequenceInfo.pl -ena BN000065

perl getSequenceInfo/getSequenceInfo.pl -fastq ERR818002

perl getSequenceInfo/getSequenceInfo.pl -fastq ERR818002,ERR818004

Kingdoms:

archaea

bacteria

fungi

invertebrate

plant

protozoa

vertebrate_mammalian

vertebrate_other

viral

Assembly levels:

"Complete Genome"

Chromosome

Scaffold

Contig

General:

-help or -h displays this help

-version or -v displays the current version of the program

Options ([XXX] represents the expected value):

-directory or -dir [XXX] allows to indicate the NCBI's nucleotide sequences repository (default: genbank)

-get or -getSummaries [XXX] allows to obtain a new assembly summary file in function of given kingdoms (bacteria,fungi,protozoa...)

-k or -kingdom [XXX] allows to indicate kingdom of the organism (see the examples above)

-s or -species [XXX] allows to indicate the species (must be combined with -k option)

-taxid [XXX] allows to indicate a specific taxid (must be combined with -k option)

-assembly_or_project [XXX] allows to indicate a specific assembly accession or bioproject (must be combined with -k option)

-date [XXX] indicates the release date (with format yyyy-mm-dd) from which sequence information are available

-l or -level [XXX] allows to select a specific assembly level (e.g. "Complete Genome")

-o or -output [XXX] allows users to name the output result folder

-n or -number [XXX] allows to limit the total number of assemblies to be downloaded

-c or -components [XXX] allows to select specific components of the assembly (e.g. plasmid, chromosome, ...)

-ena [XXX] allows to download report and fasta file given a ENA sequence ID

-fastq [XXX] allows to download FASTQ sequences from ENA given a run accession (https://ena-docs.readthedocs.io/en/latest/faq/archive-generated-files.html)

-log allows to create a log file

実行方法

Helicobacter pyloriの2019-06-01以降に登録された（利用可能になった）完全長ゲノムのfastaファイルとGBFFファイルをダウンロードする。

perl getSequenceInfo.pl -k bacteria -s "Helicobacter pylori" -l "Complete Genome" -date 2022-04-01

-k allows to indicate kingdom of the organism
-s allows to indicate the species (must be combined with -k option)
-l allows to select a specific assembly level (e.g. "Complete Genome")
-date indicates the release date (with format yyyy-mm-dd) from which sequence information are available
-o allows users to name the output result folder

カレントにAssembly、GenBank、Reportディレクトリが作成され、その中に保存される。また、最後にsummary.xlsができる。ダウンロード終了後、これらのディレクトリは-oで指定したディレクトリのサブディレクトリ中に移動される（デフォルトはResult/xxx）。

Assembly/

GenBank/

Report/

Results/bacteria_2022_7_14/assembly_repository_Helicobacter pylori__bacteria_2022_7_14/

バクテリアのBuchnera aphidicola（taxID: 9）とShewanella putrefaciens （taxID: 24）のプラスミドをダウンロードする。

perl getSequenceInfo.pl -k "bacteria" -taxid 9,24 -n 10 -c plasmid -dir genbank -o Results

-n allows to limit the total number of assemblies to be downloaded
-taxid allows to indicate a specific taxid (must be combined with -k option)
-c allows to select specific components of the assembly (e.g. plasmid, chromosome, ...)
-dir allows to indicate the NCBI's nucleotide sequences repository (default: genbank)
-o allows users to name the output result folder

ENAのBN000065（リンク）をダウンロードする。

perl getSequenceInfo.pl -ena BN000065 -o outdir

-ena allows to download report and fasta file given a ENA sequence ID

outdir_name/

ENAのRun ID：ERR818002（リンク）とERR818004（リンク）をダウンロードする。

perl getSequenceInfo.pl -fastq ERR818002,ERR818004

-fastq allows to download FASTQ sequences from ENA given a run accession

ディレクトリが作成され、その中に保存される。

ERR818002_folder/

引用

getSequenceInfo: a suite of tools allowing to get genome sequence information from public repositories
Vincent Moco, Damien Cazenave, Maëlle Garnier, Matthieu Pot, Isabel Marcelino, Antoine Talarmin, Stéphanie Guyomard-Rabenirina, Sébastien Breurec, Séverine Ferdinand, Alexis Dereeper, Yann Reynaud & David Couvin
BMC Bioinformatics volume 23, Article number: 268 (2022)