macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

原核生物のゲノム構造を利用してリボソーム領域全体をアセンブルする riboSeed

 

 バクテリアのゲノムシーケンスの大部分は、イルミナのショートリードを用いて行われている。しかし、ショートリードだけでは繰り返し領域を解決することが難しいため、クローズドなゲノムを得ることができたのは、シーケンスプロジェクトの約10%に過ぎなかった。最も一般的な繰り返し領域は、リボソームオペロン(rDNA)をコードする領域で、細菌ゲノム中に1~15回出現し、一般的に細菌を分類・同定するための配列マーカーとして使用されている。本研究では、分類群間でのrDNAの保存性と、ゲノム内でのrDNAの周辺領域の独自性を利用することで、de novoシーケンシングに比べてrDNA領域のアセンブリを改善する。本研究では、リファレンスゲノムのrDNAにマッピングされるリードを反復的にアセンブルすることで、ターゲットとなる疑似コンティグを構築する方法を紹介する。これらの疑似コンティグは、新たにシークエンシングされたchromosomeをより正確にアセンブリするために使用される。riboSeedとして実装されたこの方法は、バクテリアのゲノムアセンブリにおいて、隣接するコンティグを正しく橋渡しすることができ、他のゲノム研磨ツールと併用することで、ゲノムのクロージングを支援することができることを示している。

 

Overview

https://nickp60.github.io/riboSeed/

Documentation

https://riboseed.readthedocs.io/en/latest/

 

インストール

 本体 Github

#bioconda (link)
mamba create -n riboseed -y
conda activate riboseed
mamba install -c bioconda riboseed -y

#docker
docker pull nickp60/riboseed:latest

> ribo -h 

$ ribo -h

riboSeed v0.4.90

Contact: Nick Waters <nickp60@gmail.com>

Description: A suite of tools to perform de fere novo assembly to bridge

gaps caused by rDNA repeats

 

Usage: ribo <command> [options]

 

Available commands:

run execute pipeline (scan, select, seed, and more)

scan reannotate rRNAs in a FASTA file

select group rRNA annotations into rDNA operons

seed perform de fere novo assembly

snag extract rDNA regions and plot entropy

sim perform simulations used in manuscript

sketch plot results from a de fere novo assembly

stack compare coverage depth in rDNA regions to rest of genome

score score batches of assemblies with BLASTn

swap swap contigs from assemblies

spec use assembly graph to speculate number of rDNAs

structure view the rRNA operon structure across several genomes

config write out a blank config file to be used with `run`

try runs the pipeline on some included sample data

> ribo run -h

$ ribo run -h

usage: ribo run [-r reference.fasta] -c config_file [-o /output/dir/]

                [-e experiment_name] [-K {bac,euk,arc,mito}] [-S 16S:23S:5S]

                [--clusters str] [-C str] [-F reads_F.fq] [-R reads_R.fq]

                [-S1 reads_S.fq] [-s int]

                [--ref_as_contig {ignore,infer,trusted,untrusted}] [--linear]

                [--subassembler {spades,skesa}] [-j] [-l int]

                [-k 21,33,55,77,99,127] [--force_kmers] [-p 21,33,55,77,99]

                [-d int] [--clean_temps] [-i int] [--skip_control]

                [-v {1,2,3,4,5}] [--cores int] [--memory int]

                [--damn_the_torpedos]

                [--stages {sketch,spec,snag,score,stack,none} [{sketch,spec,snag,score,stack,none} ...]]

                [-t {1,2,4}] [--additional_libs str] [-z] [-h] [--version]

 

Run the riboSeed pipeline of scan, select, and seed, plus any additional

stages. Uses a config file to wrangle all the args not available via these

commandline args. This can either be run by providing (as minimum) a

reference, some reads, and an output directory; or, if you have a completed

config file, you can run it with just that.

 

optional arguments:

  -r reference.fasta, --reference_fasta reference.fasta

                        path to a (multi)fasta or a directory containing one

                        or more chromosomal sequences in fasta format.

                        Required, unless using a config file

  -c config_file, --config config_file

                        config file; if none given, create one; default:

                        /Users/kazu/Desktop/T1

  -o /output/dir/, --output /output/dir/

                        output directory; default: /Users/kazu/Desktop/T1/2021

                        -05-02T1322_riboSeed_pipeline_results/

  -e experiment_name, --experiment_name experiment_name

                        prefix for results files; default: inferred

  -K {bac,euk,arc,mito}, --Kingdom {bac,euk,arc,mito}

                        whether to look for eukaryotic, archaeal, or bacterial

                        rDNA; default: bac

  -S 16S:23S:5S, --specific_features 16S:23S:5S

                        colon:separated -- specific features; default:

                        16S:23S:5S

  --clusters str        number of rDNA clusters;if submitting multiple

                        records, must be a colon:separated list whose length

                        matches number of genbank records. Default is inferred

                        from specific feature with fewest hits

  -C str, --cluster_file str

                        clustered_loci file output from riboSelect;this is

                        created by default from run_riboSeed, but if you don't

                        agree with the operon structure predicted by

                        riboSelect, you can use your alternate clustered_loci

                        file. default: None

  -F reads_F.fq, --fastq1 reads_F.fq

                        path to forward fastq file, can be compressed

  -R reads_R.fq, --fastq2 reads_R.fq

                        path to reverse fastq file, can be compressed

  -S1 reads_S.fq, --fastq_single1 reads_S.fq

                        path to single fastq file

  -s int, --score_min int

                        If using smalt, this sets the '-m' param; default with

                        smalt is inferred from read length. If using BWA,

                        reads mapping with ASscore lower than this will be

                        rejected; default with BWA is half of read length

  --ref_as_contig {ignore,infer,trusted,untrusted}

                        ignore: reference will not be used in subassembly.

                        trusted: SPAdes will use the seed sequences as a

                        --trusted-contig; untrusted: SPAdes will treat as

                        --untrusted-contig. infer: if mapping percentage over

                        80%, 'trusted'; else 'untrusted'. See SPAdes docs for

                        details. default: infer

  --linear              if genome is known to not be circular and a region of

                        interest (including flanking bits) extends past

                        chromosome end, this extends the seqence past

                        chromosome origin forward by --padding; default: False

  --subassembler {spades,skesa}

                        assembler to use for subassembly scheme. SPAdes is

                        used by default, but Skesa is a new addition that

                        seems to work for subassembly and is faster

  -j, --just_seed       Don't do an assembly, just generate the long read

                        'seeds'; default: False

  -l int, --flanking_length int

                        length of flanking regions, in bp; default: 1000

  -k 21,33,55,77,99,127, --kmers 21,33,55,77,99,127

                        kmers used for final assembly, separated by commas

                        such as21,33,55,77,99,127. Can be set to 'auto', where

                        SPAdes chooses. We ensure kmers are not too big or too

                        close to read length; default: 21,33,55,77,99,127

  --force_kmers         skip checking to see if kmerchoice is appropriate to

                        read length. Sometimes kmers longer than reads can

                        help in the final assembly, as the long reads

                        generated by riboSeed contain kmers longer than the

                        read length

  -p 21,33,55,77,99, --pre_kmers 21,33,55,77,99

                        kmers used during seeding assemblies, separated bt

                        commas; default: 21,33,55,77,99

  -d int, --min_flank_depth int

                        a subassembly won't be performed if this minimum depth

                        is not achieved on both the 3' and5' end of the

                        pseudocontig. default: 0

  --clean_temps         if --clean_temps, mapping files will be removed once

                        they are no no longer needed during the mapping

                        iterations to save space; default: False

  -i int, --iterations int

                        if iterations>1, multiple seedings will occur after

                        subassembly of seed regions; if setting --target_len,

                        seedings will continue until --iterations are

                        completed or --target_len is matched or exceeded;

                        default: 3

  --skip_control        if --skip_control, no de novo assembly will be done;

                        default: False

  -v {1,2,3,4,5}, --verbosity {1,2,3,4,5}

                        Logger writes debug to file in output dir; this sets

                        verbosity level sent to stderr. 1 = debug(), 2 =

                        info(), 3 = warning(), 4 = error() and 5 = critical();

                        default: 2

  --cores int           cores used; default: None

  --memory int          cores for multiprocessing; default: 8

  --damn_the_torpedos   Ignore certain errors, full speed ahead!

  --stages {sketch,spec,snag,score,stack,none} [{sketch,spec,snag,score,stack,none} ...]

                        Which assessment stages you wish to run: sketch, spec,

                        snag, score, stack. Any combination thereof

  -t {1,2,4}, --threads {1,2,4}

                        if your cores are hyperthreaded, set number threads to

                        the number of threads per processer.If unsure, see

                        'cat /proc/cpuinfo' under 'cpu cores', or 'lscpu'

                        under 'Thread(s) per core'.: 1

  --additional_libs str

                        include these libraries in final assembly in addition

                        to the reads supplied as -F and -R. They must be

                        supplied according to SPAdes arg naming scheme. Use at

                        own risk.default: None

  -z, --serialize       if --serialize, runs seeding and assembly without

                        multiprocessing. We recommend this for machines with

                        less than 8GB RAM: False

  -h, --help            Displays this help message

  --version             show program's version number and exit

 

 

実行方法

ribo runコマンドは、scan、select、seed、sketch、scoreなどの最も一般的に使用される一連のコマンドを実行する。

ribo run concatenated_seq.fasta -F test_reads1.fq -R test_reads2.fq -o outdir -v 1
  • -r     path to a (multi)fasta or a directory containing one or more chromosomal sequences in fasta format. Required, unless using a config file
  • -c     config file; if none given, create one
  • -o     output directory
  • -F     path to forward fastq file, can be compressed
  • -R     path to reverse fastq file, can be compressed
  • -S1   path to single fastq file
  • -v {1,2,3,4,5}    Logger writes debug to file in output dir; this sets verbosity level sent to stderr. 1 = debug(), 2 = info(), 3 = warning(), 4 = error() and 5 = critical();
    default: 2

 

引用

riboSeed: leveraging prokaryotic genomic architecture to assemble across ribosomal regions

Nicholas R Waters, Florence Abram, Fiona Brennan, Ashleigh Holmes, Leighton Pritchard

Nucleic Acids Res. 2018 Jun 20;46(11):e68