2019 4/2 文章修正
2021 8/14 論文引用
2023/07/06 help更新
Magic-BLASTは、NGSシーケンスデータ(Illumina、Roche-454、ABI(SOLiDを除く))をゲノムやトランスクリプトーム全体に対してマッピングするため開発されたNCBI BLASTの派生ツール。Magic-BLASTは他のBLASTプログラムと同様に動作し、はじめにシード16塩基の正確なマッチを見つけ、左右にアライメントを拡張する。 スプライスシグナルが検出された場合、イントロンをまたぐ短いアライメントは1つに繋ぐため、真核生物のRNA seqのマッパーとしても使用することができるとされる。ペアエンドも考慮しており、累積ペアスコアを使用して最適なマッピングを選択する。繰り返しのマッピングを避けるために、Magic-BLASTはデータベースをスキャンし、16塩基のwordを数える機能をもち、defaultでは 10回以上出現するものは拡張されない。この機能は、-limit_lookup Fオプションを使用して無効化できる。
入力は、FASTA、FASTQ、そしてローカルのSRAファイルにも対応する。SRA ID を指定することでリモートのSRAのデータを直接扱うこともできる。
NCBI Insights
Introducing Magic-BLAST | NCBI Insights
How Magic-BLAST works(Documentation)
https://ncbi.github.io/magicblast/doc/tutorial.html
インストール
https://github.com/ncbi/magicblast
ftpサイトからプログラムのバイナリをダウンロードし、パスの通ったディレクトリにコピーする。
https://ncbi.github.io/magicblast/doc/download.html
追記
condaを使う
mamba install -c bioconda magicblast -y
> makeblastdb -help
USAGE
makeblastdb [-h] [-help] [-in input_file] [-input_type type]
-dbtype molecule_type [-title database_title] [-parse_seqids]
[-hash_index] [-mask_data mask_data_files] [-mask_id mask_algo_ids]
[-mask_desc mask_algo_descriptions] [-gi_mask]
[-gi_mask_name gi_based_mask_names] [-out database_name]
[-blastdb_version version] [-max_file_sz number_of_bytes]
[-logfile File_Name] [-taxid TaxID] [-taxid_map TaxIDMapFile] [-version]
DESCRIPTION
Application to create BLAST databases, version 2.12.0+
REQUIRED ARGUMENTS
-dbtype <String, `nucl', `prot'>
Molecule type of target db
OPTIONAL ARGUMENTS
-h
Print USAGE and DESCRIPTION; ignore all other parameters
-help
Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
-version
Print version number; ignore other arguments
*** Input options
-in <File_In>
Input file/database name
Default = `-'
-input_type <String, `asn1_bin', `asn1_txt', `blastdb', `fasta'>
Type of the data specified in input_file
Default = `fasta'
*** Configuration options
-title <String>
Title for BLAST database
Default = input file name provided to -in argument
-parse_seqids
Option to parse seqid for FASTA input if set, for all other input types
seqids are parsed automatically
-hash_index
Create index of sequence hash values.
*** Sequence masking options
-mask_data <String>
Comma-separated list of input files containing masking data as produced by
NCBI masking applications (e.g. dustmasker, segmasker, windowmasker)
-mask_id <String>
Comma-separated list of strings to uniquely identify the masking algorithm
* Requires: mask_data
* Incompatible with: gi_mask
-mask_desc <String>
Comma-separated list of free form strings to describe the masking algorithm
details
* Requires: mask_id
-gi_mask
Create GI indexed masking data.
* Requires: parse_seqids
* Incompatible with: mask_id
-gi_mask_name <String>
Comma-separated list of masking data output files.
* Requires: mask_data, gi_mask
*** Output options
-out <String>
Name of BLAST database to be created
Default = input file name provided to -in argumentRequired if multiple
file(s)/database(s) are provided as input
-blastdb_version <Integer, 4..5>
Version of BLAST database to be created
Default = `5'
-max_file_sz <String>
Maximum file size for BLAST database files
Default = `1GB'
-logfile <File_Out>
File to which the program log should be redirected
*** Taxonomy options
-taxid <Integer, >=0>
Taxonomy ID to assign to all sequences
* Incompatible with: taxid_map
-taxid_map <File_In>
Text file mapping sequence IDs to taxonomy IDs.
Format:<SequenceId> <TaxonomyId><newline>
* Requires: parse_seqids
* Incompatible with: taxid
> ./magicblast -help
$ magicblast -help
USAGE
magicblast [-h] [-help] [-db database_name] [-gilist filename]
[-seqidlist filename] [-negative_gilist filename]
[-negative_seqidlist filename] [-taxids taxids] [-negative_taxids taxids]
[-taxidlist filename] [-negative_taxidlist filename]
[-db_soft_mask filtering_algorithm] [-db_hard_mask filtering_algorithm]
[-subject subject_input_file] [-subject_loc range] [-query input_file]
[-out output_file] [-gzo] [-out_unaligned output_file]
[-word_size int_value] [-gapopen open_penalty] [-gapextend extend_penalty]
[-perc_identity float_value] [-fr] [-rf] [-penalty penalty]
[-lcase_masking] [-validate_seqs TF] [-infmt format] [-paired]
[-query_mate infile] [-sra accession] [-sra_batch file]
[-parse_deflines TF] [-sra_cache] [-outfmt format] [-unaligned_fmt format]
[-md_tag] [-no_query_id_trim] [-no_unaligned] [-no_discordant] [-tag tag]
[-num_threads int_value] [-max_intron_length length] [-score num]
[-max_edit_dist num] [-splice TF] [-reftype type] [-limit_lookup TF]
[-max_db_word_count num] [-lookup_stride num] [-version]
DESCRIPTION
Short read mapper
OPTIONAL ARGUMENTS
-h
Print USAGE and DESCRIPTION; ignore all other parameters
-help
Print USAGE, DESCRIPTION and ARGUMENTS; ignore all other parameters
-version
Print version number; ignore other arguments
*** Input query options
-query <File_In>
Input file name
Default = `-'
* Incompatible with: sra, sra_batch
-infmt <String, `asn1', `asn1b', `fasta', `fastc', `fastq'>
Input format for sequences
Default = `fasta'
* Incompatible with: sra, sra_batch
-paired
Input query sequences are paired
-query_mate <File_In>
FASTA file with mates for query sequences (if given in another file)
* Requires: query
-sra <String>
Comma-separated SRA accessions
* Incompatible with: query, infmt, sra_batch
-sra_batch <File_In>
File with a list of SRA accessions, one per line
* Incompatible with: sra, query, infmt
*** General search options
-db <String>
BLAST database name
* Incompatible with: subject, subject_loc
-out <File_Out, file name length < 256>
Output file name
Default = `-'
-gzo
Output will be compressed
-out_unaligned <File_Out>
Report unaligned reads to this file
* Incompatible with: no_unaligned
-word_size <Integer, >=12>
Minimum number of consecutive bases matching exactly
Default = `18'
-gapopen <Integer>
Cost to open a gap
Default = `0'
-gapextend <Integer>
Cost to extend a gap
Default = `4'
-penalty <Integer, <=0>
Penalty for a nucleotide mismatch
Default = `-4'
-max_intron_length <Integer, >=0>
Maximum allowed intron length
Default = `500000'
*** BLAST-2-Sequences options
-subject <File_In>
Subject sequence(s) to search
* Incompatible with: db, gilist, seqidlist, negative_gilist,
negative_seqidlist, taxids, taxidlist, negative_taxids, negative_taxidlist,
db_soft_mask, db_hard_mask
-subject_loc <String>
Location on the subject sequence in 1-based offsets (Format: start-stop)
* Incompatible with: db, gilist, seqidlist, negative_gilist,
negative_seqidlist, taxids, taxidlist, negative_taxids, negative_taxidlist,
db_soft_mask, db_hard_mask, remote
*** Formatting options
-outfmt <String, Permissible values: 'asn' 'sam' 'tabular' >
alignment view options:
sam = SAM format,
tabular = Tabular format,
asn = text ASN.1
Default = `sam'
-unaligned_fmt <String, Permissible values: 'fasta' 'sam' 'tabular' >
format for reporting unaligned reads:
sam = SAM format,
tabular = Tabular format,
fasta = sequences in FASTA format
Default = same as outfmt
* Requires: out_unaligned
-md_tag
Include MD tag in SAM report
-no_query_id_trim
Do not trim '.1', '/1', '.2', or '/2' at the end of read ids for SAM format
andpaired runs
-no_unaligned
Do not report unaligned reads
* Incompatible with: out_unaligned
-no_discordant
Suppress discordant alignments for paired reads
-tag <String>
A user tag to add to each alignment
*** Query filtering options
-lcase_masking
Use lower case filtering in subject sequence(s)?
-validate_seqs <Boolean>
Reject low quality sequences
Default = `true'
-limit_lookup <Boolean>
Remove word seeds with high frequency in the searched database
Default = `true'
-max_db_word_count <Integer, (>=2 and =<255)>
Words that appear more than this number of times in the database will be
masked in the lookup table
Default = `30'
-lookup_stride <Integer>
Number of words to skip after collecting one while creating a lookup table
Default = `0'
*** Restrict search or results
-gilist <String>
Restrict search of database to list of GIs
* Incompatible with: seqidlist, taxids, taxidlist, negative_gilist,
negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,
subject_loc
-seqidlist <String>
Restrict search of database to list of SeqIDs
* Incompatible with: gilist, taxids, taxidlist, negative_gilist,
negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,
subject_loc
-negative_gilist <String>
Restrict search of database to everything except the specified GIs
* Incompatible with: gilist, seqidlist, taxids, taxidlist,
negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,
subject_loc
-negative_seqidlist <String>
Restrict search of database to everything except the specified SeqIDs
* Incompatible with: gilist, seqidlist, taxids, taxidlist,
negative_gilist, negative_taxids, negative_taxidlist, remote, subject,
subject_loc
-taxids <String>
Restrict search of database to include only the specified taxonomy IDs
(multiple IDs delimited by ',')
* Incompatible with: gilist, seqidlist, taxidlist, negative_gilist,
negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,
subject_loc
-negative_taxids <String>
Restrict search of database to everything except the specified taxonomy IDs
(multiple IDs delimited by ',')
* Incompatible with: gilist, seqidlist, taxids, taxidlist,
negative_gilist, negative_seqidlist, negative_taxidlist, remote, subject,
subject_loc
-taxidlist <String>
Restrict search of database to include only the specified taxonomy IDs
* Incompatible with: gilist, seqidlist, taxids, negative_gilist,
negative_seqidlist, negative_taxids, negative_taxidlist, remote, subject,
subject_loc
-negative_taxidlist <String>
Restrict search of database to everything except the specified taxonomy IDs
* Incompatible with: gilist, seqidlist, taxids, taxidlist,
negative_gilist, negative_seqidlist, negative_taxids, remote, subject,
subject_loc
-db_soft_mask <String>
Filtering algorithm ID to apply to the BLAST database as soft masking
* Incompatible with: db_hard_mask, subject, subject_loc
-db_hard_mask <String>
Filtering algorithm ID to apply to the BLAST database as hard masking
* Incompatible with: db_soft_mask, subject, subject_loc
-perc_identity <Real, 0..100>
Percent identity cutoff for alignments
Default = `0.0'
-fr
Strand specific reads forward/reverse
* Incompatible with: rf
-rf
Strand specific reads reverse/forward
* Incompatible with: fr
*** Miscellaneous options
-parse_deflines <Boolean>
Should the query and subject defline(s) be parsed?
Default = `true'
-sra_cache
Enable SRA caching in local files
* Requires: sra
-num_threads <Integer, >=1>
Number of threads (CPUs) to use in the BLAST search
Default = `1'
* Incompatible with: remote
*** Mapping options
-score <String>
Cutoff score for accepting alignments. Can be expressed as a number or a
function of read length: L,b,a for a * length + b.
Zero means that the cutoff score will be equal to:
read length, if read length <= 20,
20, if read length <= 30,
read length - 10, if read length <= 50,
40, otherwise.
Default = `0'
-max_edit_dist <Integer>
Cutoff edit distance for accepting an alignment
Default = unlimited
-splice <Boolean>
Search for spliced alignments
Default = `true'
-reftype <String, `genome', `transcriptome'>
Type of the reference: genome or transcriptome
Default = `genome'
ラン
1、index
データベースの作成。
makeblastdb -in ref.fa -dbtype nucl -parse_seqids -out my_database
- -parse_seqids Option to parse seqid for FASTA input if set, for all other input types seqids are parsed automatically
2、mapping
SRAのIDを指定すると、データをダウンロードしてmapingを実行する。sraはIDのみ記載する。.sraまで記載するとエラーを起こす。ペアエンドは自動で認識される。出力はsam形式。
magicblast -sra <accessionID> -db my_database -num_threads 8 -out output.sam
- -outfmt output.format. SAM format, tabular = Tabular format, asn = text ASN.1 Default = `sam'
デフォルトではスプライスアライメントがONになっている。bacteria RNA seqや一般的なDNA seqでは、スプライスアライメントをOFFにする"-splice F"フラグを立てる。
またはローカルのペアエンドのfastqをマッピングする。アンマップのリードは捨てる。samtoolsにパイプして出力はポジションソート(coordinate sort)のbamとする。"-infmt fastq"がないとエラーになる。
magicblast -query pair1.fastq -query_mate pair2.fastq -db my_database -infmt fastq -num_threads 8 -no_unaligned |\
samtools view -@ 8 -bS - |\
samtools sort -@ 8 -O BAM -o output.bam - # *1
- -no_unaligned Do report unaligned reads
NGSデータがFASTAフォーマットなら-infmt fastqは消す。
*1 上のコマンドはsamtools1.2など古いsamtoolsでは動作しません。
トランスクリプトームがデータベースの場合、"-reftype transcriptome"フラグを立てる。これは"-splice F -limit_lookup F"の一括設定で、スプライスアライメントOFFとリピートアライメント無効化のキャンセル(否定)になる。
magicblast -query reads.fa -db my_database -reftype transcriptome -num_threads 8 -out output.sam
publishされたペーパーがなく単なる感想になってしまいますが、1サンプルだけテストした限り、ランニングタイムはそれなりに長めですが、アライメント感度が非常に高いという結果が得られました。BLAST譲りの高い感度があるようなので、適材適所で使うケースを間違えなければ使えるアライナーではないでしょうか?すでにMagic-BLASTをフローに組み込んだツールも出てきています。
引用
Copyright
https://ncbi.github.io/magicblast/dev/copyright.html
Magic-BLAST, an accurate RNA-seq aligner for long and short reads
Grzegorz M. Boratyn, Jean Thierry-Mieg, Danielle Thierry-Mieg, Ben Busby & Thomas L. Madden
BMC Bioinformatics volume 20, Article number: 405 (2019)