バクテリアのゲノムシーケンスの大部分は、イルミナのショートリードを用いて行われている。しかし、ショートリードだけでは繰り返し領域を解決することが難しいため、クローズドなゲノムを得ることができたのは、シーケンスプロジェクトの約10%に過ぎなかった。最も一般的な繰り返し領域は、リボソームオペロン(rDNA)をコードする領域で、細菌ゲノム中に1~15回出現し、一般的に細菌を分類・同定するための配列マーカーとして使用されている。本研究では、分類群間でのrDNAの保存性と、ゲノム内でのrDNAの周辺領域の独自性を利用することで、de novoシーケンシングに比べてrDNA領域のアセンブリを改善する。本研究では、リファレンスゲノムのrDNAにマッピングされるリードを反復的にアセンブルすることで、ターゲットとなる疑似コンティグを構築する方法を紹介する。これらの疑似コンティグは、新たにシークエンシングされたchromosomeをより正確にアセンブリするために使用される。riboSeedとして実装されたこの方法は、バクテリアのゲノムアセンブリにおいて、隣接するコンティグを正しく橋渡しすることができ、他のゲノム研磨ツールと併用することで、ゲノムのクロージングを支援することができることを示している。
Overview
https://nickp60.github.io/riboSeed/
Documentation
https://riboseed.readthedocs.io/en/latest/
インストール
本体 Github
#bioconda (link)
mamba create -n riboseed -y
conda activate riboseed
mamba install -c bioconda riboseed -y
#docker
docker pull nickp60/riboseed:latest
> ribo -h
$ ribo -h
riboSeed v0.4.90
Contact: Nick Waters <nickp60@gmail.com>
Description: A suite of tools to perform de fere novo assembly to bridge
gaps caused by rDNA repeats
Usage: ribo <command> [options]
Available commands:
run execute pipeline (scan, select, seed, and more)
scan reannotate rRNAs in a FASTA file
select group rRNA annotations into rDNA operons
seed perform de fere novo assembly
snag extract rDNA regions and plot entropy
sim perform simulations used in manuscript
sketch plot results from a de fere novo assembly
stack compare coverage depth in rDNA regions to rest of genome
score score batches of assemblies with BLASTn
swap swap contigs from assemblies
spec use assembly graph to speculate number of rDNAs
structure view the rRNA operon structure across several genomes
config write out a blank config file to be used with `run`
try runs the pipeline on some included sample data
> ribo run -h
$ ribo run -h
usage: ribo run [-r reference.fasta] -c config_file [-o /output/dir/]
[-e experiment_name] [-K {bac,euk,arc,mito}] [-S 16S:23S:5S]
[--clusters str] [-C str] [-F reads_F.fq] [-R reads_R.fq]
[-S1 reads_S.fq] [-s int]
[--ref_as_contig {ignore,infer,trusted,untrusted}] [--linear]
[--subassembler {spades,skesa}] [-j] [-l int]
[-k 21,33,55,77,99,127] [--force_kmers] [-p 21,33,55,77,99]
[-d int] [--clean_temps] [-i int] [--skip_control]
[-v {1,2,3,4,5}] [--cores int] [--memory int]
[--damn_the_torpedos]
[--stages {sketch,spec,snag,score,stack,none} [{sketch,spec,snag,score,stack,none} ...]]
[-t {1,2,4}] [--additional_libs str] [-z] [-h] [--version]
Run the riboSeed pipeline of scan, select, and seed, plus any additional
stages. Uses a config file to wrangle all the args not available via these
commandline args. This can either be run by providing (as minimum) a
reference, some reads, and an output directory; or, if you have a completed
config file, you can run it with just that.
optional arguments:
-r reference.fasta, --reference_fasta reference.fasta
path to a (multi)fasta or a directory containing one
or more chromosomal sequences in fasta format.
Required, unless using a config file
-c config_file, --config config_file
config file; if none given, create one; default:
/Users/kazu/Desktop/T1
-o /output/dir/, --output /output/dir/
output directory; default: /Users/kazu/Desktop/T1/2021
-05-02T1322_riboSeed_pipeline_results/
-e experiment_name, --experiment_name experiment_name
prefix for results files; default: inferred
-K {bac,euk,arc,mito}, --Kingdom {bac,euk,arc,mito}
whether to look for eukaryotic, archaeal, or bacterial
rDNA; default: bac
-S 16S:23S:5S, --specific_features 16S:23S:5S
colon:separated -- specific features; default:
16S:23S:5S
--clusters str number of rDNA clusters;if submitting multiple
records, must be a colon:separated list whose length
matches number of genbank records. Default is inferred
from specific feature with fewest hits
-C str, --cluster_file str
clustered_loci file output from riboSelect;this is
created by default from run_riboSeed, but if you don't
agree with the operon structure predicted by
riboSelect, you can use your alternate clustered_loci
file. default: None
-F reads_F.fq, --fastq1 reads_F.fq
path to forward fastq file, can be compressed
-R reads_R.fq, --fastq2 reads_R.fq
path to reverse fastq file, can be compressed
-S1 reads_S.fq, --fastq_single1 reads_S.fq
path to single fastq file
-s int, --score_min int
If using smalt, this sets the '-m' param; default with
smalt is inferred from read length. If using BWA,
reads mapping with ASscore lower than this will be
rejected; default with BWA is half of read length
--ref_as_contig {ignore,infer,trusted,untrusted}
ignore: reference will not be used in subassembly.
trusted: SPAdes will use the seed sequences as a
--trusted-contig; untrusted: SPAdes will treat as
--untrusted-contig. infer: if mapping percentage over
80%, 'trusted'; else 'untrusted'. See SPAdes docs for
details. default: infer
--linear if genome is known to not be circular and a region of
interest (including flanking bits) extends past
chromosome end, this extends the seqence past
chromosome origin forward by --padding; default: False
--subassembler {spades,skesa}
assembler to use for subassembly scheme. SPAdes is
used by default, but Skesa is a new addition that
seems to work for subassembly and is faster
-j, --just_seed Don't do an assembly, just generate the long read
'seeds'; default: False
-l int, --flanking_length int
length of flanking regions, in bp; default: 1000
-k 21,33,55,77,99,127, --kmers 21,33,55,77,99,127
kmers used for final assembly, separated by commas
such as21,33,55,77,99,127. Can be set to 'auto', where
SPAdes chooses. We ensure kmers are not too big or too
close to read length; default: 21,33,55,77,99,127
--force_kmers skip checking to see if kmerchoice is appropriate to
read length. Sometimes kmers longer than reads can
help in the final assembly, as the long reads
generated by riboSeed contain kmers longer than the
read length
-p 21,33,55,77,99, --pre_kmers 21,33,55,77,99
kmers used during seeding assemblies, separated bt
commas; default: 21,33,55,77,99
-d int, --min_flank_depth int
a subassembly won't be performed if this minimum depth
is not achieved on both the 3' and5' end of the
pseudocontig. default: 0
--clean_temps if --clean_temps, mapping files will be removed once
they are no no longer needed during the mapping
iterations to save space; default: False
-i int, --iterations int
if iterations>1, multiple seedings will occur after
subassembly of seed regions; if setting --target_len,
seedings will continue until --iterations are
completed or --target_len is matched or exceeded;
default: 3
--skip_control if --skip_control, no de novo assembly will be done;
default: False
-v {1,2,3,4,5}, --verbosity {1,2,3,4,5}
Logger writes debug to file in output dir; this sets
verbosity level sent to stderr. 1 = debug(), 2 =
info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
--cores int cores used; default: None
--memory int cores for multiprocessing; default: 8
--damn_the_torpedos Ignore certain errors, full speed ahead!
--stages {sketch,spec,snag,score,stack,none} [{sketch,spec,snag,score,stack,none} ...]
Which assessment stages you wish to run: sketch, spec,
snag, score, stack. Any combination thereof
-t {1,2,4}, --threads {1,2,4}
if your cores are hyperthreaded, set number threads to
the number of threads per processer.If unsure, see
'cat /proc/cpuinfo' under 'cpu cores', or 'lscpu'
under 'Thread(s) per core'.: 1
--additional_libs str
include these libraries in final assembly in addition
to the reads supplied as -F and -R. They must be
supplied according to SPAdes arg naming scheme. Use at
own risk.default: None
-z, --serialize if --serialize, runs seeding and assembly without
multiprocessing. We recommend this for machines with
less than 8GB RAM: False
-h, --help Displays this help message
--version show program's version number and exit
実行方法
ribo runコマンドは、scan、select、seed、sketch、scoreなどの最も一般的に使用される一連のコマンドを実行する。
ribo run concatenated_seq.fasta -F test_reads1.fq -R test_reads2.fq -o outdir -v 1
- -r path to a (multi)fasta or a directory containing one or more chromosomal sequences in fasta format. Required, unless using a config file
- -c config file; if none given, create one
- -o output directory
- -F path to forward fastq file, can be compressed
- -R path to reverse fastq file, can be compressed
- -S1 path to single fastq file
- -v {1,2,3,4,5} Logger writes debug to file in output dir; this sets verbosity level sent to stderr. 1 = debug(), 2 = info(), 3 = warning(), 4 = error() and 5 = critical();
default: 2
引用
riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions
Nicholas R Waters, Florence Abram, Fiona Brennan, Ashleigh Holmes, Leighton Pritchard
Nucleic Acids Res. 2018 Jun 20;46(11):e68