macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

シーケンスデータ中のウイルス・微生物検出を高速に行う RabbitV

 

 シークエンスデータに含まれるウイルスや微生物の検出・同定は、病原体の診断や研究において重要な役割を担っている。しかし、この問題のための既存のツールは、しばしば高い実行時間とメモリ消費に悩まされている。
本著者らは、ユニークなk-merの高速同定に基づき、イルミナシーケンスデータセット中のウイルスや微生物を迅速に検出するツール、RabbitVを発表する。マルチスレッド、ベクトル化、高速データ解析を用いることで、最新のマルチコアCPUのパワーを活用できる。実験によると、RabbitVはユニークk-merの生成(RabbitUniq)および病原体の同定(RabbitV)において、それぞれ少なくとも42.5倍および14.4倍fastvを上回った。さらに、RabbitVは40サンプルのシーケンスデータ(FASTQ形式で255GB)から、わずか320秒でCOVID-19を検出することができた。

 

インストール

AMDの5950xのCPUを積んだubuntu18でテストした。

Github

#RabbitV
git clone https://github.com/RabbitBio/RabbitV
cd RabbitV/
make -j20

#RabbitUniq
git clone https://github.com/RabbitBio/RabbitUniq
cd RabbitUniq
make -j20
cd bin/

> python RabbitUniq.py -h

usage: RabbitUniq.py [-h] [--workspace WORKSPACE] --infile_list INFILE_LIST

                     --outfile OUTFILE [--gu_worker GU_WORKER]

                     [--kmer_len KMER_LEN] [--bin_num BIN_NUM]

                     [--exclude_last] [--uniq_ref_num UNIQ_REF_NUM]

                     [--output_char]

 

RabbitUniq

 

optional arguments:

  -h, --help            show this help message and exit

  --workspace WORKSPACE, -w WORKSPACE

                        workspace directory the bin files stored [default:

                        workspace]

  --infile_list INFILE_LIST, -l INFILE_LIST

                        input file list, one line per file

  --outfile OUTFILE, -o OUTFILE

                        out put file

  --gu_worker GU_WORKER, -n GU_WORKER

                        The number of worker thread when generate unique kmer

                        [default: 20]

  --kmer_len KMER_LEN, -k KMER_LEN

                        Unique k-mer length [default: 25]

  --bin_num BIN_NUM, -b BIN_NUM

                        Number of bin files to be store, from 64 to

                        2000[default: 512]

  --exclude_last, -e    Exclude the last element in infile_list when output

  --uniq_ref_num UNIQ_REF_NUM, -u UNIQ_REF_NUM

                        Threshold considered as unique kmer, default is 1

  --output_char, -c     Output the unique k-mer collection in character-based

                        file instead of binary file (slower, so not

                        recommended)

> ./RabbitV 

RabbitV version: 0.1.0

usage: ./RabbitV [options] ... 

options:

  -i, --in1                                        read1 input file name (string [=])

  -I, --in2                                        read2 input file name (string [=])

  -o, --out1                                       file name to store read1 with on-target sequences (string [=])

  -O, --out2                                       file name to store read2 with on-target sequences (string [=])

  -c, --kmer_collection                            the unique k-mer collection file in fasta format, see an example: http://opengene.org/kmer_collection.fasta (string [=])

  -k, --kmer                                       the unique k-mer file of the detection target in fasta format. data/SARS-CoV-2.kmer.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=])

  -g, --genomes                                    the genomes file of the detection target in fasta format. data/SARS-CoV-2.genomes.fa will be used if none of k-mer/Genomes/k-mer_Collection file is specified (string [=])

  -p, --positive_threshold                         the data is considered as POSITIVE, when its mean coverage of unique kmer >= positive_threshold (0.001 ~ 100). 0.1 by default. (float [=0.1])

  -d, --depth_threshold                            For coverage calculation. A region is considered covered when its mean depth >= depth_threshold (0.001 ~ 1000). 1.0 by default. (float [=1])

  -E, --ed_threshold                               If the edit distance of a sequence and a genome region is <=ed_threshold, then consider it a match (0 ~ 50). 8 by default. (int [=8])

      --long_read_threshold                        A read will be considered as long read if its length >= long_read_threshold (100 ~ 10000). 200 by default. (int [=200])

      --read_segment_len                           A long read will be splitted to read segments, with each <= read_segment_len (50 ~ 5000, should be < long_read_threshold). 100 by default. (int [=100])

      --bin_size                                   For coverage calculation. The genome is splitted to many bins, with each bin has a length of bin_size (1 ~ 100000), default 0 means adaptive. (int [=0])

      --kc_coverage_threshold                      For each genome in the k-mer collection FASTA, report it when its coverage > kc_coverage_threshold. Default is 0.01. (double [=0.01])

      --kc_high_confidence_coverage_threshold      For each genome in the k-mer collection FASTA, report it as high confidence when its coverage > kc_high_confidence_coverage_threshold. Default is 0.7. (double [=0.7])

      --kc_high_confidence_median_hit_threshold    For each genome in the k-mer collection FASTA, report it as high confidence when its median hits > kc_high_confidence_median_hit_threshold. Default is 5. (int [=5])

  -j, --json                                       the json format report file name (string [=RabbitV.json])

  -h, --html                                       the html format report file name (string [=RabbitV.html])

  -R, --report_title                               should be quoted with ' or ", default is "RabbitV report" (string [=RabbitV report])

  -w, --thread                                     worker thread number, default is 4 (int [=4])

  -6, --phred64                                    indicate the input is using phred64 scoring (it'll be converted to phred33, so the output will still be phred33)

  -z, --compression                                compression level for gzip output (1 ~ 9). 1 is fastest, 9 is smallest, default is 4. (int [=4])

      --stdin                                      input from STDIN. If the STDIN is interleaved paired-end FASTQ, please also add --interleaved_in.

      --stdout                                     stream passing-filters reads to STDOUT. This option will result in interleaved FASTQ output for paired-end output. Disabled by default.

      --interleaved_in                             indicate that <in1> is an interleaved FASTQ which contains both read1 and read2. Disabled by default.

      --reads_to_process                           specify how many reads/pairs to be processed. Default 0 means process all reads. (int [=0])

      --dont_overwrite                             don't overwrite existing files. Overwritting is allowed by default.

  -V, --verbose                                    output verbose log information (i.e. when every 1M reads are processed).

  -A, --disable_adapter_trimming                   adapter trimming is enabled by default. If this option is specified, adapter trimming is disabled

  -a, --adapter_sequence                           the adapter for read1. For SE data, if not specified, the adapter will be auto-detected. For PE data, this is used if R1/R2 are found not overlapped. (string [=auto])

      --adapter_sequence_r2                        the adapter for read2 (PE data only). This is used if R1/R2 are found not overlapped. If not specified, it will be the same as <adapter_sequence> (string [=auto])

      --adapter_fasta                              specify a FASTA file to trim both read1 and read2 (if PE) by all the sequences in this FASTA file (string [=])

      --detect_adapter_for_pe                      by default, the auto-detection for adapter is for SE data input only, turn on this option to enable it for PE data.

  -f, --trim_front1                                trimming how many bases in front for read1, default is 0 (int [=0])

  -t, --trim_tail1                                 trimming how many bases in tail for read1, default is 0 (int [=0])

  -b, --max_len1                                   if read1 is longer than max_len1, then trim read1 at its tail to make it as long as max_len1. Default 0 means no limitation (int [=0])

  -F, --trim_front2                                trimming how many bases in front for read2. If it's not specified, it will follow read1's settings (int [=0])

  -T, --trim_tail2                                 trimming how many bases in tail for read2. If it's not specified, it will follow read1's settings (int [=0])

  -B, --max_len2                                   if read2 is longer than max_len2, then trim read2 at its tail to make it as long as max_len2. Default 0 means no limitation. If it's not specified, it will follow read1's settings (int [=0])

      --poly_g_min_len                             the minimum length to detect polyG in the read tail. 10 by default. (int [=10])

  -G, --disable_trim_poly_g                        disable polyG tail trimming, by default trimming is automatically enabled for Illumina NextSeq/NovaSeq data

  -x, --trim_poly_x                                enable polyX trimming in 3' ends.

      --poly_x_min_len                             the minimum length to detect polyX in the read tail. 10 by default. (int [=10])

  -5, --cut_front                                  move a sliding window from front (5') to tail, drop the bases in the window if its mean quality < threshold, stop otherwise.

  -3, --cut_tail                                   move a sliding window from tail (3') to front, drop the bases in the window if its mean quality < threshold, stop otherwise.

  -r, --cut_right                                  move a sliding window from front to tail, if meet one window with mean quality < threshold, drop the bases in the window and the right part, and then stop.

  -W, --cut_window_size                            the window size option shared by cut_front, cut_tail or cut_sliding. Range: 1~1000, default: 4 (int [=4])

  -M, --cut_mean_quality                           the mean quality requirement option shared by cut_front, cut_tail or cut_sliding. Range: 1~36 default: 20 (Q20) (int [=20])

      --cut_front_window_size                      the window size option of cut_front, default to cut_window_size if not specified (int [=4])

      --cut_front_mean_quality                     the mean quality requirement option for cut_front, default to cut_mean_quality if not specified (int [=20])

      --cut_tail_window_size                       the window size option of cut_tail, default to cut_window_size if not specified (int [=4])

      --cut_tail_mean_quality                      the mean quality requirement option for cut_tail, default to cut_mean_quality if not specified (int [=20])

      --cut_right_window_size                      the window size option of cut_right, default to cut_window_size if not specified (int [=4])

      --cut_right_mean_quality                     the mean quality requirement option for cut_right, default to cut_mean_quality if not specified (int [=20])

  -Q, --disable_quality_filtering                  quality filtering is enabled by default. If this option is specified, quality filtering is disabled

  -q, --qualified_quality_phred                    the quality value that a base is qualified. Default 15 means phred quality >=Q15 is qualified. (int [=15])

  -u, --unqualified_percent_limit                  how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])

  -n, --n_base_limit                               if one read's number of N base is >n_base_limit, then this read/pair is discarded. Default is 5 (int [=5])

  -e, --average_qual                               if one read's average quality score <avg_qual, then this read/pair is discarded. Default 0 means no requirement (int [=0])

  -L, --disable_length_filtering                   length filtering is enabled by default. If this option is specified, length filtering is disabled

  -l, --length_required                            reads shorter than length_required will be discarded, default is 15. (int [=15])

      --length_limit                               reads longer than length_limit will be discarded, default 0 means no limitation. (int [=0])

  -y, --low_complexity_filter                      enable low complexity filter. The complexity is defined as the percentage of base that is different from its next base (base[i] != base[i+1]).

  -Y, --complexity_threshold                       the threshold for low complexity filter (0~100). Default is 30, which means 30% complexity is required. (int [=30])

      --filter_by_index1                           specify a file contains a list of barcodes of index1 to be filtered out, one barcode per line (string [=])

      --filter_by_index2                           specify a file contains a list of barcodes of index2 to be filtered out, one barcode per line (string [=])

      --filter_by_index_threshold                  the allowed difference of index barcode for index filtering, default 0 means completely identical. (int [=0])

  -C, --correction                                 enable base correction in overlapped regions (only for PE data), default is disabled

      --overlap_len_require                        the minimum length to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 30 by default. (int [=30])

      --overlap_diff_limit                         the maximum number of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. 5 by default. (int [=5])

      --overlap_diff_percent_limit                 the maximum percentage of mismatched bases to detect overlapped region of PE reads. This will affect overlap analysis based PE merge, adapter trimming and correction. Default 20 means 20%. (int [=20])

  -U, --umi                                        enable unique molecular identifier (UMI) preprocessing

      --umi_loc                                    specify the location of UMI, can be (index1/index2/read1/read2/per_index/per_read, default is none (string [=])

      --umi_len                                    if the UMI is in read1/read2, its length should be provided (int [=0])

      --umi_prefix                                 if specified, an underline will be used to connect prefix and UMI (i.e. prefix=UMI, UMI=AATTCG, final=UMI_AATTCG). No prefix by default (string [=])

      --umi_skip                                   if the UMI is in read1/read2, RabbitV can skip several bases following UMI, default is 0 (int [=0])

  -K, --kmer_len                                   the uniuqe k-mer length, default is 25 (int [=25])

  -?, --help                                       print this message

 

 

実行方法

1、リファレンスゲノムリストからunique k-mer の生成

./RabbitUniq.py --infile_list ${infile_list} --outfile kmer.fa \
-n 20 -k 25 -b 2000
  • --infile_list   input file list, one line per file
  • --outfile    out put file
  • -n    The number of worker thread when generate unique kmer [default: 20]
  • -k     Unique k-mer length [default: 25]
  • -b     Number of bin files to be store, from 64 to 2000[default: 512]

 kmer.faが出力される。

data/SARS-CoV-2.kmer.fa

 

2、RabbitVのラン

RabbitV -i paired_*.fq -k SARS-CoV-2.kmer.fa -g SARS-CoV-2.genomes.fa




 SARS-CoV-2の同定

RabbitV -i filename.fastq -k SARS-CoV-2.kmer.fa -g SARS-CoV-2.genomes.fa

 

引用

RabbitV: fast detection of viruses and microorganisms in sequencing data on multi-core architectures Get access Arrow
Hao Zhang, Qixin Chang, Zekun Yin, Xiaoming Xu, Yanjie Wei, Bertil Schmidt, Weiguo Liu
Bioinformatics, Published: 25 March 2022

 

関連