2018-07-06

NGSデータをマッピングする Magic-BLAST

2019 4/2 文章修正

2021 8/14 論文引用

2023/07/06 help更新

　Magic-BLASTは、NGSシーケンスデータ（Illumina、Roche-454、ABI（SOLiDを除く））をゲノムやトランスクリプトーム全体に対してマッピングするため開発されたNCBI BLASTの派生ツール。Magic-BLASTは他のBLASTプログラムと同様に動作し、はじめにシード16塩基の正確なマッチを見つけ、左右にアライメントを拡張する。スプライスシグナルが検出された場合、イントロンをまたぐ短いアライメントは１つに繋ぐため、真核生物のRNA seqのマッパーとしても使用することができるとされる。ペアエンドも考慮しており、累積ペアスコアを使用して最適なマッピングを選択する。繰り返しのマッピングを避けるために、Magic-BLASTはデータベースをスキャンし、16塩基のwordを数える機能をもち、defaultでは 10回以上出現するものは拡張されない。この機能は、-limit_lookup Fオプションを使用して無効化できる。

入力は、FASTA、FASTQ、そしてローカルのSRAファイルにも対応する。SRA ID を指定することでリモートのSRAのデータを直接扱うこともできる。

NCBI Insights

Introducing Magic-BLAST | NCBI Insights

How Magic-BLAST works（Documentation）

https://ncbi.github.io/magicblast/doc/tutorial.html

インストール

Github

https://github.com/ncbi/magicblast

ftpサイトからプログラムのバイナリをダウンロードし、パスの通ったディレクトリにコピーする。

https://ncbi.github.io/magicblast/doc/download.html

追記

condaを使う

mamba install -c bioconda magicblast -y

> makeblastdb -help

USAGE

makeblastdb [-h] [-help] [-in input_file] [-input_type type]

-dbtype molecule_type [-title database_title] [-parse_seqids]

[-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]

[-mask_desc mask_algo_descriptions] [-gi_mask]

[-gi_mask_name gi_based_mask_names] [-out database_name]

[-blastdb_version version] [-max_file_sz number_of_bytes]

[-logfile File_Name] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-version]

DESCRIPTION

Application to create BLAST databases, version 2.12.0+

REQUIRED ARGUMENTS

-dbtype <String, `nucl', `prot'>

Molecule type of target db

OPTIONAL ARGUMENTS

-h

Print USAGE and DESCRIPTION; ignore all other parameters

-help

Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters

-version

Print version number; ignore other arguments

*** Input options

-in <File_In>

Input file/database name

Default = `-'

-input_type <String, `asn1_bin', `asn1_txt', `blastdb', `fasta'>

Type of the data specified in input_file

Default = `fasta'

*** Configuration options

-title <String>

Title for BLAST database

Default = input file name provided to -in argument

-parse_seqids

Option to parse seqid for FASTA input if set, for all other input types

seqids are parsed automatically

-hash_index

Create index of sequence hash values.

*** Sequence masking options

-mask_data <String>

Comma-separated list of input files containing masking data as produced by

NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)

-mask_id <String>

Comma-separated list of strings to uniquely identify the masking algorithm

* Requires: mask_data

* Incompatible with: gi_mask

-mask_desc <String>

Comma-separated list of free form strings to describe the masking algorithm

details

* Requires: mask_id

-gi_mask

Create GI indexed masking data.

* Requires: parse_seqids

* Incompatible with: mask_id

-gi_mask_name <String>

Comma-separated list of masking data output files.

* Requires: mask_data, gi_mask

*** Output options

-out <String>

Name of BLAST database to be created

Default = input file name provided to -in argumentRequired if multiple

file(s)/database(s) are provided as input

-blastdb_version <Integer, 4..5>

Version of BLAST database to be created

Default = `5'

-max_file_sz <String>

Maximum file size for BLAST database files

Default = `1GB'

-logfile <File_Out>

File to which the program log should be redirected

*** Taxonomy options

-taxid <Integer, >=0>

Taxonomy ID to assign to all sequences

* Incompatible with: taxid_map

-taxid_map <File_In>

Text file mapping sequence IDs to taxonomy IDs.

Format:<SequenceId> <TaxonomyId><newline>

* Requires: parse_seqids

* Incompatible with: taxid

> ./magicblast -help

$ magicblast -help

USAGE

magicblast [-h] [-help] [-db database_name] [-gilist filename]

[-seqidlist filename] [-negative_gilist filename]

[-negative_seqidlist filename] [-taxids taxids] [-negative_taxids taxids]

[-taxidlist filename] [-negative_taxidlist filename]

[-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]

[-subject subject_input_file] [-subject_loc range] [-query input_file]

[-out output_file] [-gzo] [-out_unaligned output_file]

[-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty]

[-perc_identity float_value] [-fr] [-rf] [-penalty penalty]

[-lcase_masking] [-validate_seqs TF] [-infmt format] [-paired]

[-query_mate infile] [-sra accession] [-sra_batch file]

[-parse_deflines TF] [-sra_cache] [-outfmt format] [-unaligned_fmt format]

[-md_tag] [-no_query_id_trim] [-no_unaligned] [-no_discordant] [-tag tag]

[-num_threads int_value] [-max_intron_length length] [-score num]

[-max_edit_dist num] [-splice TF] [-reftype type] [-limit_lookup TF]

[-max_db_word_count num] [-lookup_stride num] [-version]

DESCRIPTION

Short read mapper

OPTIONAL ARGUMENTS

-h

Print USAGE and DESCRIPTION; ignore all other parameters

-help

Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters

-version

Print version number; ignore other arguments

*** Input query options

-query <File_In>

Input file name

Default = `-'

* Incompatible with: sra, sra_batch

-infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>

Input format for sequences

Default = `fasta'

* Incompatible with: sra, sra_batch

-paired

Input query sequences are paired

-query_mate <File_In>

FASTA file with mates for query sequences (if given in another file)

* Requires: query

-sra <String>

Comma-separated SRA accessions

* Incompatible with: query, infmt, sra_batch

-sra_batch <File_In>

File with a list of SRA accessions, one per line

* Incompatible with: sra, query, infmt

*** General search options

-db <String>

BLAST database name

* Incompatible with: subject, subject_loc

-out <File_Out, file name length < 256>

Output file name

Default = `-'

-gzo

Output will be compressed

-out_unaligned <File_Out>

Report unaligned reads to this file

* Incompatible with: no_unaligned

-word_size <Integer, >=12>

Minimum number of consecutive bases matching exactly

Default = `18'

-gapopen <Integer>

Cost to open a gap

Default = `0'

-gapextend <Integer>

Cost to extend a gap

Default = `4'

-penalty <Integer, <=0>

Penalty for a nucleotide mismatch

Default = `-4'

-max_intron_length <Integer, >=0>

Maximum allowed intron length

Default = `500000'

*** BLAST-2-Sequences options

-subject <File_In>

Subject sequence(s) to search

* Incompatible with: db, gilist, seqidlist, negative_gilist,

negative_seqidlist, taxids, taxidlist, negative_taxids, negative_taxidlist,

db_soft_mask, db_hard_mask

-subject_loc <String>

Location on the subject sequence in 1-based offsets (Format: start-stop)

* Incompatible with: db, gilist, seqidlist, negative_gilist,

negative_seqidlist, taxids, taxidlist, negative_taxids, negative_taxidlist,

db_soft_mask, db_hard_mask, remote

*** Formatting options

-outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >

alignment view options:

sam = SAM format,

tabular = Tabular format,

asn = text ASN.1

Default = `sam'

-unaligned_fmt <String, Permissible values: 'fasta' 'sam' 'tabular' >

format for reporting unaligned reads:

sam = SAM format,

tabular = Tabular format,

fasta = sequences in FASTA format

Default = same as outfmt

* Requires: out_unaligned

-md_tag

Include MD tag in SAM report

-no_query_id_trim

Do not trim '.1', '/1', '.2', or '/2' at the end of read ids for SAM format

andpaired runs

-no_unaligned

Do not report unaligned reads

* Incompatible with: out_unaligned

-no_discordant

Suppress discordant alignments for paired reads

-tag <String>

A user tag to add to each alignment

*** Query filtering options

-lcase_masking

Use lower case filtering in subject sequence(s)?

-validate_seqs <Boolean>

Reject low quality sequences

Default = `true'

-limit_lookup <Boolean>

Remove word seeds with high frequency in the searched database

Default = `true'

-max_db_word_count <Integer, (>=2 and =<255)>

Words that appear more than this number of times in the database will be

masked in the lookup table

Default = `30'

-lookup_stride <Integer>

Number of words to skip after collecting one while creating a lookup table

Default = `0'

*** Restrict search or results

-gilist <String>

Restrict search of database to list of GIs

* Incompatible with: seqidlist, taxids, taxidlist, negative_gilist,

negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

subject_loc

-seqidlist <String>

Restrict search of database to list of SeqIDs

* Incompatible with: gilist, taxids, taxidlist, negative_gilist,

negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

subject_loc

-negative_gilist <String>

Restrict search of database to everything except the specified GIs

* Incompatible with: gilist, seqidlist, taxids, taxidlist,

negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

subject_loc

-negative_seqidlist <String>

Restrict search of database to everything except the specified SeqIDs

* Incompatible with: gilist, seqidlist, taxids, taxidlist,

negative_gilist, negative_taxids, negative_taxidlist, remote, subject,

subject_loc

-taxids <String>

Restrict search of database to include only the specified taxonomy IDs

(multiple IDs delimited by ',')

* Incompatible with: gilist, seqidlist, taxidlist, negative_gilist,

negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

subject_loc

-negative_taxids <String>

Restrict search of database to everything except the specified taxonomy IDs

(multiple IDs delimited by ',')

* Incompatible with: gilist, seqidlist, taxids, taxidlist,

negative_gilist, negative_seqidlist, negative_taxidlist, remote, subject,

subject_loc

-taxidlist <String>

Restrict search of database to include only the specified taxonomy IDs

* Incompatible with: gilist, seqidlist, taxids, negative_gilist,

negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,

subject_loc

-negative_taxidlist <String>

Restrict search of database to everything except the specified taxonomy IDs

* Incompatible with: gilist, seqidlist, taxids, taxidlist,

negative_gilist, negative_seqidlist, negative_taxids, remote, subject,

subject_loc

-db_soft_mask <String>

Filtering algorithm ID to apply to the BLAST database as soft masking

* Incompatible with: db_hard_mask, subject, subject_loc

-db_hard_mask <String>

Filtering algorithm ID to apply to the BLAST database as hard masking

* Incompatible with: db_soft_mask, subject, subject_loc

-perc_identity <Real, 0..100>

Percent identity cutoff for alignments

Default = `0.0'

-fr

Strand specific reads forward/reverse

* Incompatible with: rf

-rf

Strand specific reads reverse/forward

* Incompatible with: fr

*** Miscellaneous options

-parse_deflines <Boolean>

Should the query and subject defline(s) be parsed?

Default = `true'

-sra_cache

Enable SRA caching in local files

* Requires: sra

-num_threads <Integer, >=1>

Number of threads (CPUs) to use in the BLAST search

Default = `1'

* Incompatible with: remote

*** Mapping options

-score <String>

Cutoff score for accepting alignments. Can be expressed as a number or a

function of read length: L,b,a for a * length + b.

Zero means that the cutoff score will be equal to:

read length, if read length <= 20,

20, if read length <= 30,

read length - 10, if read length <= 50,

40, otherwise.

Default = `0'

-max_edit_dist <Integer>

Cutoff edit distance for accepting an alignment

Default = unlimited

-splice <Boolean>

Search for spliced alignments

Default = `true'

-reftype <String, `genome', `transcriptome'>

Type of the reference: genome or transcriptome

Default = `genome'

ラン

１、index

データベースの作成。

makeblastdb -in ref.fa -dbtype nucl -parse_seqids -out my_database

-parse_seqids Option to parse seqid for FASTA input if set, for all other input types seqids are parsed automatically

2、mapping

SRAのIDを指定すると、データをダウンロードしてmapingを実行する。sraはIDのみ記載する。.sraまで記載するとエラーを起こす。ペアエンドは自動で認識される。出力はsam形式。

magicblast -sra <accessionID> -db my_database -num_threads 8 -out output.sam

-outfmt output.format. SAM format, tabular = Tabular format, asn = text ASN.1 Default = `sam'

デフォルトではスプライスアライメントがONになっている。bacteria RNA seqや一般的なDNA seqでは、スプライスアライメントをOFFにする"-splice F"フラグを立てる。

またはローカルのペアエンドのfastqをマッピングする。アンマップのリードは捨てる。samtoolsにパイプして出力はポジションソート（coordinate sort）のbamとする。"-infmt fastq"がないとエラーになる。

magicblast -query pair1.fastq -query_mate pair2.fastq -db my_database -infmt fastq -num_threads 8 -no_unaligned |\
 samtools view -@ 8 -bS - |\
 samtools sort -@ 8 -O BAM -o output.bam - # *1

-no_unaligned Do report unaligned reads

NGSデータがFASTAフォーマットなら-infmt fastqは消す。

*1 上のコマンドはsamtools1.2など古いsamtoolsでは動作しません。

トランスクリプトームがデータベースの場合、"-reftype transcriptome"フラグを立てる。これは"-splice F -limit_lookup F"の一括設定で、スプライスアライメントOFFとリピートアライメント無効化のキャンセル（否定）になる。

magicblast -query reads.fa -db my_database -reftype transcriptome -num_threads 8 -out output.sam

publishされたペーパーがなく単なる感想になってしまいますが、１サンプルだけテストした限り、ランニングタイムはそれなりに長めですが、アライメント感度が非常に高いという結果が得られました。BLAST譲りの高い感度があるようなので、適材適所で使うケースを間違えなければ使えるアライナーではないでしょうか？すでにMagic-BLASTをフローに組み込んだツールも出てきています。

引用

https://ncbi.github.io/magicblast/dev/copyright.html

Magic-BLAST, an accurate RNA-seq aligner for long and short reads
Grzegorz M. Boratyn, Jean Thierry-Mieg, Danielle Thierry-Mieg, Ben Busby & Thomas L. Madden
BMC Bioinformatics volume 20, Article number: 405 (2019)

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

NGSデータをマッピングする Magic-BLAST