macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

NGSデータをマッピングする Magic-BLAST

2019 4/2 文章修正

2021 8/14  論文引用

2023/07/06 help更新

 

 Magic-BLASTは、NGSシーケンスデータ(Illumina、Roche-454、ABI(SOLiDを除く))をゲノムやトランスクリプトーム全体に対してマッピングするため開発されたNCBI BLASTの派生ツール。Magic-BLASTは他のBLASTプログラムと同様に動作し、はじめにシード16塩基の正確なマッチを見つけ、左右にアライメントを拡張する。 スプライスシグナルが検出された場合、イントロンをまたぐ短いアライメントは1つに繋ぐため、真核生物のRNA seqのマッパーとしても使用することができるとされる。ペアエンドも考慮しており、累積ペアスコアを使用して最適なマッピングを選択する。繰り返しのマッピングを避けるために、Magic-BLASTはデータベースをスキャンし、16塩基のwordを数える機能をもち、defaultでは 10回以上出現するものは拡張されない。この機能は、-limit_lookup Fオプションを使用して無効化できる。

 入力は、FASTA、FASTQ、そしてローカルのSRAファイルにも対応する。SRA ID を指定することでリモートのSRAのデータを直接扱うこともできる。

 

NCBI Insights

Introducing Magic-BLAST | NCBI Insights

How Magic-BLAST works(Documentation)

https://ncbi.github.io/magicblast/doc/tutorial.html

 

インストール

Github

https://github.com/ncbi/magicblast

 ftpサイトからプログラムのバイナリをダウンロードし、パスの通ったディレクトリにコピーする。

https://ncbi.github.io/magicblast/doc/download.html

追記

condaを使う

mamba install -c bioconda magicblast -y

> makeblastdb -help

USAGE

  makeblastdb [-h] [-help] [-in input_file] [-input_type type]

    -dbtype molecule_type [-title database_title] [-parse_seqids]

    [-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]

    [-mask_desc mask_algo_descriptions] [-gi_mask]

    [-gi_mask_name gi_based_mask_names] [-out database_name]

    [-blastdb_version version] [-max_file_sz number_of_bytes]

    [-logfile File_Name] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-version]

 

DESCRIPTION

   Application to create BLAST databases, version 2.12.0+

 

REQUIRED ARGUMENTS

 -dbtype <String, `nucl', `prot'>

   Molecule type of target db

 

OPTIONAL ARGUMENTS

 -h

   Print USAGE and DESCRIPTION;  ignore all other parameters

 -help

   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters

 -version

   Print version number;  ignore other arguments

 

 *** Input options

 -in <File_In>

   Input file/database name

   Default = `-'

 -input_type <String, `asn1_bin', `asn1_txt', `blastdb', `fasta'>

   Type of the data specified in input_file

   Default = `fasta'

 

 *** Configuration options

 -title <String>

   Title for BLAST database

   Default = input file name provided to -in argument

 -parse_seqids

   Option to parse seqid for FASTA input if set, for all other input types

   seqids are parsed automatically

 -hash_index

   Create index of sequence hash values.

 

 *** Sequence masking options

 -mask_data <String>

   Comma-separated list of input files containing masking data as produced by

   NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)

 -mask_id <String>

   Comma-separated list of strings to uniquely identify the masking algorithm

    * Requires:  mask_data

    * Incompatible with:  gi_mask

 -mask_desc <String>

   Comma-separated list of free form strings to describe the masking algorithm

   details

    * Requires:  mask_id

 -gi_mask

   Create GI indexed masking data.

    * Requires:  parse_seqids

    * Incompatible with:  mask_id

 -gi_mask_name <String>

   Comma-separated list of masking data output files.

    * Requires:  mask_data, gi_mask

 

 *** Output options

 -out <String>

   Name of BLAST database to be created

   Default = input file name provided to -in argumentRequired if multiple

   file(s)/database(s) are provided as input

 -blastdb_version <Integer, 4..5>

   Version of BLAST database to be created

   Default = `5'

 -max_file_sz <String>

   Maximum file size for BLAST database files

   Default = `1GB'

 -logfile <File_Out>

   File to which the program log should be redirected

 

 *** Taxonomy options

 -taxid <Integer, >=0>

   Taxonomy ID to assign to all sequences

    * Incompatible with:  taxid_map

 -taxid_map <File_In>

   Text file mapping sequence IDs to taxonomy IDs.

   Format:<SequenceId> <TaxonomyId><newline>

    * Requires:  parse_seqids

    * Incompatible with:  taxid

 

./magicblast -help

$ magicblast -help

USAGE

  magicblast [-h] [-help] [-db database_name] [-gilist filename]

    [-seqidlist filename] [-negative_gilist filename]

    [-negative_seqidlist filename] [-taxids taxids] [-negative_taxids taxids]

    [-taxidlist filename] [-negative_taxidlist filename]

    [-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]

    [-subject subject_input_file] [-subject_loc range] [-query input_file]

    [-out output_file] [-gzo] [-out_unaligned output_file]

    [-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty]

    [-perc_identity float_value] [-fr] [-rf] [-penalty penalty]

    [-lcase_masking] [-validate_seqs TF] [-infmt format] [-paired]

    [-query_mate infile] [-sra accession] [-sra_batch file]

    [-parse_deflines TF] [-sra_cache] [-outfmt format] [-unaligned_fmt format]

    [-md_tag] [-no_query_id_trim] [-no_unaligned] [-no_discordant] [-tag tag]

    [-num_threads int_value] [-max_intron_length length] [-score num]

    [-max_edit_dist num] [-splice TF] [-reftype type] [-limit_lookup TF]

    [-max_db_word_count num] [-lookup_stride num] [-version]

 

DESCRIPTION

   Short read mapper

 

OPTIONAL ARGUMENTS

 -h

   Print USAGE and DESCRIPTION;  ignore all other parameters

 -help

   Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters

 -version

   Print version number;  ignore other arguments

 

 *** Input query options

 -query <File_In>

   Input file name

   Default = `-'

    * Incompatible with:  sra, sra_batch

 -infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>

   Input format for sequences

   Default = `fasta'

    * Incompatible with:  sra, sra_batch

 -paired

   Input query sequences are paired

 -query_mate <File_In>

   FASTA file with mates for query sequences (if given in another file)

    * Requires:  query

 -sra <String>

   Comma-separated SRA accessions

    * Incompatible with:  query, infmt, sra_batch

 -sra_batch <File_In>

   File with a list of SRA accessions, one per line

    * Incompatible with:  sra, query, infmt

 

 *** General search options

 -db <String>

   BLAST database name

    * Incompatible with:  subject, subject_loc

 -out <File_Out, file name length < 256>

   Output file name

   Default = `-'

 -gzo

   Output will be compressed

 -out_unaligned <File_Out>

   Report unaligned reads to this file

    * Incompatible with:  no_unaligned

 -word_size <Integer, >=12>

   Minimum number of consecutive bases matching exactly

   Default = `18'

 -gapopen <Integer>

   Cost to open a gap

   Default = `0'

 -gapextend <Integer>

   Cost to extend a gap

   Default = `4'

 -penalty <Integer, <=0>

   Penalty for a nucleotide mismatch

   Default = `-4'

 -max_intron_length <Integer, >=0>

   Maximum allowed intron length

   Default = `500000'

 

 *** BLAST-2-Sequences options

 -subject <File_In>

   Subject sequence(s) to search

    * Incompatible with:  db, gilist, seqidlist, negative_gilist,

   negative_seqidlist, taxids, taxidlist, negative_taxids, negative_taxidlist,

   db_soft_mask, db_hard_mask

 -subject_loc <String>

   Location on the subject sequence in 1-based offsets (Format: start-stop)

    * Incompatible with:  db, gilist, seqidlist, negative_gilist,

   negative_seqidlist, taxids, taxidlist, negative_taxids, negative_taxidlist,

   db_soft_mask, db_hard_mask, remote

 

 *** Formatting options

 -outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >

   alignment view options:

   sam = SAM format,

   tabular = Tabular format,

   asn = text ASN.1

   Default = `sam'

 -unaligned_fmt <String, Permissible values: 'fasta' 'sam' 'tabular' >

   format for reporting unaligned reads:

   sam = SAM format,

   tabular = Tabular format,

   fasta = sequences in FASTA format

   Default = same as outfmt

    * Requires:  out_unaligned

 -md_tag

   Include MD tag in SAM report

 -no_query_id_trim

   Do not trim '.1', '/1', '.2', or '/2' at the end of read ids for SAM format

   andpaired runs

 -no_unaligned

   Do not report unaligned reads

    * Incompatible with:  out_unaligned

 -no_discordant

   Suppress discordant alignments for paired reads

 -tag <String>

   A user tag to add to each alignment

 

 *** Query filtering options

 -lcase_masking

   Use lower case filtering in subject sequence(s)?

 -validate_seqs <Boolean>

   Reject low quality sequences 

   Default = `true'

 -limit_lookup <Boolean>

   Remove word seeds with high frequency in the searched database

   Default = `true'

 -max_db_word_count <Integer, (>=2 and =<255)>

   Words that appear more than this number of times in the database will be

   masked in the lookup table

   Default = `30'

 -lookup_stride <Integer>

   Number of words to skip after collecting one while creating a lookup table

   Default = `0'

 

 *** Restrict search or results

 -gilist <String>

   Restrict search of database to list of GIs

    * Incompatible with:  seqidlist, taxids, taxidlist, negative_gilist,

   negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

   subject_loc

 -seqidlist <String>

   Restrict search of database to list of SeqIDs

    * Incompatible with:  gilist, taxids, taxidlist, negative_gilist,

   negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

   subject_loc

 -negative_gilist <String>

   Restrict search of database to everything except the specified GIs

    * Incompatible with:  gilist, seqidlist, taxids, taxidlist,

   negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

   subject_loc

 -negative_seqidlist <String>

   Restrict search of database to everything except the specified SeqIDs

    * Incompatible with:  gilist, seqidlist, taxids, taxidlist,

   negative_gilist, negative_taxids, negative_taxidlist, remote, subject,

   subject_loc

 -taxids <String>

   Restrict search of database to include only the specified taxonomy IDs

   (multiple IDs delimited by ',')

    * Incompatible with:  gilist, seqidlist, taxidlist, negative_gilist,

   negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

   subject_loc

 -negative_taxids <String>

   Restrict search of database to everything except the specified taxonomy IDs

   (multiple IDs delimited by ',')

    * Incompatible with:  gilist, seqidlist, taxids, taxidlist,

   negative_gilist, negative_seqidlist, negative_taxidlist, remote, subject,

   subject_loc

 -taxidlist <String>

   Restrict search of database to include only the specified taxonomy IDs

    * Incompatible with:  gilist, seqidlist, taxids, negative_gilist,

   negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

   subject_loc

 -negative_taxidlist <String>

   Restrict search of database to everything except the specified taxonomy IDs

    * Incompatible with:  gilist, seqidlist, taxids, taxidlist,

   negative_gilist, negative_seqidlist, negative_taxids, remote, subject,

   subject_loc

 -db_soft_mask <String>

   Filtering algorithm ID to apply to the BLAST database as soft masking

    * Incompatible with:  db_hard_mask, subject, subject_loc

 -db_hard_mask <String>

   Filtering algorithm ID to apply to the BLAST database as hard masking

    * Incompatible with:  db_soft_mask, subject, subject_loc

 -perc_identity <Real, 0..100>

   Percent identity cutoff for alignments

   Default = `0.0'

 -fr

   Strand specific reads forward/reverse

    * Incompatible with:  rf

 -rf

   Strand specific reads reverse/forward

    * Incompatible with:  fr

 

 *** Miscellaneous options

 -parse_deflines <Boolean>

   Should the query and subject defline(s) be parsed?

   Default = `true'

 -sra_cache

   Enable SRA caching in local files

    * Requires:  sra

 -num_threads <Integer, >=1>

   Number of threads (CPUs) to use in the BLAST search

   Default = `1'

    * Incompatible with:  remote

 

 *** Mapping options

 -score <String>

   Cutoff score for accepting alignments. Can be expressed as a number or a

   function of read length: L,b,a for a * length + b.

   Zero means that the cutoff score will be equal to:

   read length,      if read length <= 20,

   20,               if read length <= 30,

   read length - 10, if read length <= 50,

   40,               otherwise.

   Default = `0'

 -max_edit_dist <Integer>

   Cutoff edit distance for accepting an alignment

   Default = unlimited

 -splice <Boolean>

   Search for spliced alignments

   Default = `true'

 -reftype <String, `genome', `transcriptome'>

   Type of the reference: genome or transcriptome

   Default = `genome'

 

 

 

ラン

1、index

データベースの作成。

makeblastdb -in ref.fa -dbtype nucl -parse_seqids -out my_database
  • -parse_seqids   Option to parse seqid for FASTA input if set, for all other input types seqids are parsed automatically

 

 2、mapping

SRAのIDを指定すると、データをダウンロードしてmapingを実行する。sraはIDのみ記載する。.sraまで記載するとエラーを起こす。ペアエンドは自動で認識される。出力はsam形式。

magicblast -sra <accessionID> -db my_database -num_threads 8 -out output.sam
  •  -outfmt    output.format. SAM format, tabular = Tabular format, asn = text ASN.1 Default = `sam'

デフォルトではスプライスアライメントがONになっている。bacteria RNA seqや一般的なDNA seqでは、スプライスアライメントをOFFにする"-splice F"フラグを立てる。

 

またはローカルのペアエンドのfastqをマッピングする。アンマップのリードは捨てる。samtoolsにパイプして出力はポジションソート(coordinate sort)のbamとする。"-infmt fastq"がないとエラーになる。

magicblast -query pair1.fastq -query_mate pair2.fastq -db my_database -infmt fastq -num_threads 8 -no_unaligned |\
samtools view -@ 8 -bS - |\
samtools sort -@ 8 -O BAM -o output.bam - # *1
  • -no_unaligned   Do report unaligned reads

NGSデータがFASTAフォーマットなら-infmt fastqは消す。

*1 上のコマンドはsamtools1.2など古いsamtoolsでは動作しません。

 

トランスクリプトームがデータベースの場合、"-reftype transcriptome"フラグを立てる。これは"-splice F -limit_lookup F"の一括設定で、スプライスアライメントOFFとリピートアライメント無効化のキャンセル(否定)になる。

magicblast -query reads.fa -db my_database -reftype transcriptome -num_threads 8 -out output.sam 

 

publishされたペーパーがなく単なる感想になってしまいますが、1サンプルだけテストした限り、ランニングタイムはそれなりに長めですが、アライメント感度が非常に高いという結果が得られました。BLAST譲りの高い感度があるようなので、適材適所で使うケースを間違えなければ使えるアライナーではないでしょうか?すでにMagic-BLASTをフローに組み込んだツールも出てきています。

 

引用

Copyright

https://ncbi.github.io/magicblast/dev/copyright.html

 

Magic-BLAST, an accurate RNA-seq aligner for long and short reads
Grzegorz M. Boratyn, Jean Thierry-Mieg, Danielle Thierry-Mieg, Ben Busby & Thomas L. Madden
BMC Bioinformatics volume 20, Article number: 405 (2019)