原核生物のゲノム構造を利用してリボソーム領域全体をアセンブルする riboSeed

　バクテリアのゲノムシーケンスの大部分は、イルミナのショートリードを用いて行われている。しかし、ショートリードだけでは繰り返し領域を解決することが難しいため、クローズドなゲノムを得ることができたのは、シーケンスプロジェクトの約10%に過ぎなかった。最も一般的な繰り返し領域は、リボソームオペロン（rDNA）をコードする領域で、細菌ゲノム中に1～15回出現し、一般的に細菌を分類・同定するための配列マーカーとして使用されている。本研究では、分類群間でのrDNAの保存性と、ゲノム内でのrDNAの周辺領域の独自性を利用することで、de novoシーケンシングに比べてrDNA領域のアセンブリを改善する。本研究では、リファレンスゲノムのrDNAにマッピングされるリードを反復的にアセンブルすることで、ターゲットとなる疑似コンティグを構築する方法を紹介する。これらの疑似コンティグは、新たにシークエンシングされたchromosomeをより正確にアセンブリするために使用される。riboSeedとして実装されたこの方法は、バクテリアのゲノムアセンブリにおいて、隣接するコンティグを正しく橋渡しすることができ、他のゲノム研磨ツールと併用することで、ゲノムのクロージングを支援することができることを示している。

Overview

https://nickp60.github.io/riboSeed/

Documentation

https://riboseed.readthedocs.io/en/latest/

インストール

本体　Github

#bioconda (link)
mamba create -n riboseed -y
conda activate riboseed
mamba install -c bioconda riboseed -y

#docker 
docker pull nickp60/riboseed:latest

> ribo -h

$ ribo -h

riboSeed v0.4.90

Contact: Nick Waters <nickp60@gmail.com>

Description: A suite of tools to perform de fere novo assembly to bridge

gaps caused by rDNA repeats

Usage: ribo <command> [options]

Available commands:

run execute pipeline (scan, select, seed, and more)

scan reannotate rRNAs in a FASTA file

select group rRNA annotations into rDNA operons

seed perform de fere novo assembly

snag extract rDNA regions and plot entropy

sim perform simulations used in manuscript

sketch plot results from a de fere novo assembly

stack compare coverage depth in rDNA regions to rest of genome

score score batches of assemblies with BLASTn

swap swap contigs from assemblies

spec use assembly graph to speculate number of rDNAs

structure view the rRNA operon structure across several genomes

config write out a blank config file to be used with `run`

try runs the pipeline on some included sample data

> ribo run -h

$ ribo run -h

usage: ribo run [-r reference.fasta] -c config_file [-o /output/dir/]

[-e experiment_name] [-K {bac,euk,arc,mito}] [-S 16S:23S:5S]

[--clusters str] [-C str] [-F reads_F.fq] [-R reads_R.fq]

[-S1 reads_S.fq] [-s int]

[--ref_as_contig {ignore,infer,trusted,untrusted}] [--linear]

[--subassembler {spades,skesa}] [-j] [-l int]

[-k 21,33,55,77,99,127] [--force_kmers] [-p 21,33,55,77,99]

[-d int] [--clean_temps] [-i int] [--skip_control]

[-v {1,2,3,4,5}] [--cores int] [--memory int]

[--damn_the_torpedos]

[--stages {sketch,spec,snag,score,stack,none} [{sketch,spec,snag,score,stack,none} ...]]

[-t {1,2,4}] [--additional_libs str] [-z] [-h] [--version]

Run the riboSeed pipeline of scan, select, and seed, plus any additional

stages. Uses a config file to wrangle all the args not available via these

commandline args. This can either be run by providing (as minimum) a

reference, some reads, and an output directory; or, if you have a completed

config file, you can run it with just that.

optional arguments:

-r reference.fasta, --reference_fasta reference.fasta

path to a (multi)fasta or a directory containing one

or more chromosomal sequences in fasta format.

Required, unless using a config file

-c config_file, --config config_file

config file; if none given, create one; default:

/Users/kazu/Desktop/T1

-o /output/dir/, --output /output/dir/

output directory; default: /Users/kazu/Desktop/T1/2021

-05-02T1322_riboSeed_pipeline_results/

-e experiment_name, --experiment_name experiment_name

prefix for results files; default: inferred

-K {bac,euk,arc,mito}, --Kingdom {bac,euk,arc,mito}

whether to look for eukaryotic, archaeal, or bacterial

rDNA; default: bac

-S 16S:23S:5S, --specific_features 16S:23S:5S

colon:separated -- specific features; default:

16S:23S:5S

--clusters str number of rDNA clusters;if submitting multiple

records, must be a colon:separated list whose length

matches number of genbank records. Default is inferred

from specific feature with fewest hits

-C str, --cluster_file str

clustered_loci file output from riboSelect;this is

created by default from run_riboSeed, but if you don't

agree with the operon structure predicted by

riboSelect, you can use your alternate clustered_loci

file. default: None

-F reads_F.fq, --fastq1 reads_F.fq

path to forward fastq file, can be compressed

-R reads_R.fq, --fastq2 reads_R.fq

path to reverse fastq file, can be compressed

-S1 reads_S.fq, --fastq_single1 reads_S.fq

path to single fastq file

-s int, --score_min int

If using smalt, this sets the '-m' param; default with

smalt is inferred from read length. If using BWA,

reads mapping with ASscore lower than this will be

rejected; default with BWA is half of read length

--ref_as_contig {ignore,infer,trusted,untrusted}

ignore: reference will not be used in subassembly.

trusted: SPAdes will use the seed sequences as a

--trusted-contig; untrusted: SPAdes will treat as

--untrusted-contig. infer: if mapping percentage over

80%, 'trusted'; else 'untrusted'. See SPAdes docs for

details. default: infer

--linear if genome is known to not be circular and a region of

interest (including flanking bits) extends past

chromosome end, this extends the seqence past

chromosome origin forward by --padding; default: False

--subassembler {spades,skesa}

assembler to use for subassembly scheme. SPAdes is

used by default, but Skesa is a new addition that

seems to work for subassembly and is faster

-j, --just_seed Don't do an assembly, just generate the long read

'seeds'; default: False

-l int, --flanking_length int

length of flanking regions, in bp; default: 1000

-k 21,33,55,77,99,127, --kmers 21,33,55,77,99,127

kmers used for final assembly, separated by commas

such as21,33,55,77,99,127. Can be set to 'auto', where

SPAdes chooses. We ensure kmers are not too big or too

close to read length; default: 21,33,55,77,99,127

--force_kmers skip checking to see if kmerchoice is appropriate to

read length. Sometimes kmers longer than reads can

help in the final assembly, as the long reads

generated by riboSeed contain kmers longer than the

read length

-p 21,33,55,77,99, --pre_kmers 21,33,55,77,99

kmers used during seeding assemblies, separated bt

commas; default: 21,33,55,77,99

-d int, --min_flank_depth int

a subassembly won't be performed if this minimum depth

is not achieved on both the 3' and5' end of the

pseudocontig. default: 0

--clean_temps if --clean_temps, mapping files will be removed once

they are no no longer needed during the mapping

iterations to save space; default: False

-i int, --iterations int

if iterations>1, multiple seedings will occur after

subassembly of seed regions; if setting --target_len,

seedings will continue until --iterations are

completed or --target_len is matched or exceeded;

default: 3

--skip_control if --skip_control, no de novo assembly will be done;

default: False

-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}

Logger writes debug to file in output dir; this sets

verbosity level sent to stderr. 1 = debug(), 2 =

info(), 3 = warning(), 4 = error() and 5 = critical();

default: 2

--cores int cores used; default: None

--memory int cores for multiprocessing; default: 8

--damn_the_torpedos Ignore certain errors, full speed ahead!

--stages {sketch,spec,snag,score,stack,none} [{sketch,spec,snag,score,stack,none} ...]

Which assessment stages you wish to run: sketch, spec,

snag, score, stack. Any combination thereof

-t {1,2,4}, --threads {1,2,4}

if your cores are hyperthreaded, set number threads to

the number of threads per processer.If unsure, see

'cat /proc/cpuinfo' under 'cpu cores', or 'lscpu'

under 'Thread(s) per core'.: 1

--additional_libs str

include these libraries in final assembly in addition

to the reads supplied as -F and -R. They must be

supplied according to SPAdes arg naming scheme. Use at

own risk.default: None

-z, --serialize if --serialize, runs seeding and assembly without

multiprocessing. We recommend this for machines with

less than 8GB RAM: False

-h, --help Displays this help message

--version show program's version number and exit

実行方法

ribo runコマンドは、scan、select、seed、sketch、scoreなどの最も一般的に使用される一連のコマンドを実行する。

ribo run concatenated_seq.fasta -F test_reads1.fq -R test_reads2.fq -o outdir -v 1

-r path to a (multi)fasta or a directory containing one or more chromosomal sequences in fasta format. Required, unless using a config file
-c config file; if none given, create one
-o output directory
-F path to forward fastq file, can be compressed
-R path to reverse fastq file, can be compressed
-S1 path to single fastq file
-v {1,2,3,4,5} Logger writes debug to file in output dir; this sets verbosity level sent to stderr. 1 = debug(), 2 = info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2

引用

riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions

Nicholas R Waters, Florence Abram, Fiona Brennan, Ashleigh Holmes, Leighton Pritchard

Nucleic Acids Res. 2018 Jun 20;46(11):e68

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

原核生物のゲノム構造を利用してリボソーム領域全体をアセンブルする riboSeed