macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ターゲット配列に関係するシークエンシングリードを集めるMIRAのmirabaitコマンド

 

MIRAbait: 256bpまでのkmerのための「grep」のようなツール

mirabaitは、リードコレクションの中から、ターゲットベイトとして定義された配列と部分的に類似した、あるいは等しいリードを選択する。類似性とは、ベイト配列と選択されるスクリーニング配列の間に、ユーザーが調整可能な数の共通のk-mer(k個の連続した塩基からなる配列)を見つけることで定義される(順方向または順逆相補方向)。4塩基までのリピートに対して、DUSTのようなリピートフィルターを追加することもできる。ペアエンドファイルで使用した場合、少なくとも1つのメイトが一致する配列を選ぶ。

 

インストール

V5rc1をダウンロードしてパスを通した(ubuntu18.04使用)。v4はcondaで導入できます。

最新版のバイナリ(V5rc1)

SourceForge

#version4
mamba create -n mira -y
conda activate mira
mamba install -c bioconda mira -y

> mirabait

$ mirabait 

# ./mirabait 

 

Usage: mirabait [options] {-b baitfile [-b ...] | -B file | -j joblibrary} {-p file_1 file_2 | -P file3}* [file4 ...]

 

MIRAbait: a 'grep' like tool for kmers up to 256 bp

 

mirabait selects reads from a read collection which are partly similar or

equal to sequences defined as target baits. Similarity is defined by finding a

user-adjustable number of common k-mers (sequences of k consecutive bases)

which are the same in the bait sequences and the screened sequences to be

selected, either in forward or forward/reverse complement direction. Adding a

DUST-like repeat filter for repeats up 4 bases is optional. 

When used on paired files, selects sequences where at least one mate matches.

 

Options:

-b file Load bait sequences from file

(multiple -b allowed)

-B file Load baits from kmer statistics file, not from sequence files.

Only one -B allowed, cannot be combined with -b.

(see -K for creating such a file)

-j job Set options for predefined job from supplied MIRA library

Currently available jobs:

  rrna Bait rRNA sequences

-p file1 file2 Load paired sequences to search from file1 and file2

Files must contain same number of sequences, sequence 

names must be in same order.

Multiple -p allowed, but must come before non-paired

files.

-P file Load paired sequences from file

File must be interleaved: pairs must follow each other,

non-pairs are not allowed.

Multiple -p allowed, but must come before non-paired

files.

 

-k int kmer length of bait in bases (<=256, default=31)

-n int If >0: minimum number of k-mer baits needed (default=1)

If <=0: allowed number of missed kmers over sequence

        length

 

-d Do not use kmers with microrepeats (DUST-like, see also -D)

-D int Set length of microrepeats in kmers to discard from bait.

int > 0 microrepeat len in percentage of kmer length.

      E.g.: -k 17 -D 67 --> 11.39 bases --> 12 bases.

int < 0 microrepeat len in bases.

int != 0 implies -d, int=0 turns DUST filter off.

-i Selects sequences that do not hit bait

-I Selects sequences that hit and do not hit bait (to

different files)

-r No checking of reverse complement direction

-t Number of threads to use (default=0 -> up to 4 CPU cores)

 

Options for output definition:

Normally mirabait writes separate result files (named 'bait_match_*' and

'bait_miss_*') for each input to the current directory. For changing this

behaviour and other relating to output, use these options:

-c char Normally, mirabait lowercases bases a kmer hit.

Using this option, one can instead mask those bases with the

given character.

Use a blank to neither mask nor lowercase hits.

-l int length of a line (FASTA only, default 0=unlimited)

-K file Save kmer statistics to 'file' (see also -B)

-N name Change the prefix 'bait' to <name>

Has no effect if -o/-O is used and targets are not

directories

-o <path> Save sequences matching bait to path

If path is a directory, write separate files into this

directory. If not, combine all matching sequences from

the input file(s) into a single file specified by the

path.

-O <path> Like -o, but for sequences not matching

 

Other options:

-T dir Use 'dir' as directory for temporary files instead of

current working directory.

-m integer Memory to use for computing kmer statistics

0..100 = use percentage of free system memory

>100 = amount of MiB to use (e.g. 16384 for 16 GiB)

Default 75 (75% of free system memory).

 

Defining files types to load/save:

Normally mirabait recognises the file types according to the file extension

(even when packed). In cases you need to force a certain file type because the

file extension is non-standard, use the EMBOSS notation to force a type:

<filetype>::<name_of_file>. E.g., to tell that "somefile.dat" is FASTQ, use:

fastq::somefile.dat

Recognised types are: caf, fasta, fastq, gbf, gbk, gbff, maf and phd.

 

MIRABAIT will write files in the same file type as the corresponding input

files.

 

Examples:

  mirabait -b b.fasta file.fastq

 

  mirabait -I -j rrna -p file_1.fastq file_2.fastq

 

  mirabait -b b1.fasta -b b2.gbk file.fastq

 

  mirabait -b fasta::baits.dat -p fastq::file_1.dat fastq::file_2.dat

 

  mirabait -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf

 

  mirabait -I -b b.fasta -p file_1.fastq file_2.fastq -P file3.fasta file4.caf

 

  mirabait -k 27 -n 10 -b b.fasta file.fastq

 

  mirabait -b fasta::b.dat fastq::file.dat

 

  mirabait -o /dev/shm/ -b b.fasta -p file_1.fastq file_2.fastq

 

  mirabait -o /dev/shm/match -b b.fasta -p file_1.fastq file_2.fastq

 

  mirabait -b human_genome.fasta -K HG_kmerstats.mhs.gz -p file1.fastq file2.fastq

 

  mirabait -B HG_kmerstats.mhs.gz -p file1.fastq file2.fastq

 

  mirabait -d -B HG_kmerstats.mhs.gz -p file1.fastq file2.fastq

 

./mirabait: No bait files defined via -b and no -B given!

Did you use the command line for the old mirabait (<= 4.0.2)?

 

 

実行方法

baitリファレンスとfastqファイル(.fastq)を指定する。gzip圧縮fastqには対応していない。

mirabait -b ref.fasta input.fastq
  • -b <file>   Load bait sequences from file (multiple -b allowed)
  • -k <int>    kmer length of bait in bases (<=256, default=31)-k int kmer length of bait in bases (<=256, default=31)

baitリファレンスとマッチしたリードが出力される。

 

 

ペアエンドfastq

mirabait -b ref.fasta -p pair_R1.fastq pair_R2.fastq
  • -p <file1> <file2>   Load paired sequences to search from file1 and file2. Files must contain same number of sequences, sequence  names must be in same order. Multiple -p allowed, but must come before non-paired files.
  • -P <file>   Load paired sequences from file. File must be interleaved: pairs must follow each other, non-pairs are not allowed. Multiple -p allowed, but must come before non-paired files.

ペアエンドfastqは同期されたまま出力される。

 

”-i”をつけるとbait配列にヒットしないリードが出力される。

引用

Chevreux, B., Wetter, T. and Suhai, S. (1999): Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56.

 

Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J., Müller, W. E., Wetter, T. and Suhai, S. (2004): Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs. Genome Research, 14(6)

 

関連