MOSAIKアライナー - macでインフォマティクス

　MOSAIKは、第2世代および第3世代のシーケンシングリードをリファレンスゲノムにマッピングするための安定した高感度のオープンソースプログラムである。現在のマッピングツールの中でも特に優れているMOSAIKは、Illumina、Applied Biosystems SOLiD、Roche 454、Ion Torrent、Pacific BioSciences SMRTなど、すべての主要なシーケンシング技術で生成されたリードをアラインメントすることができる。実際、MOSAIKは、1000 Genomesプロジェクトで生成されたすべてのデータ（シーケンシング技術、ローカバレッジ、エクソーム）に対して一貫したマッピングを提供した唯一のアライナーだった。高精度のアラインメントを提供するために、MOSAIKはSmith-Watermanアルゴリズムと結合したハッシュクラスタリング戦略を採用している。この方法は、ミスマッチだけでなく、短い挿入や欠失も捉えるのに適している。より大きな構造バリアント（SV）発見への関心の高まりをサポートするために、MOSAIKは、既知配列のSVを扱うための明示的なサポートを提供し、SV発見を支援するためにカスタマイズされた出力を生成する。すべてのバリアント発見は、リードの配置の信頼性の正確な記述から利益を得ることができる。この目的のために、MOSAIKはニューラルネットワークベースのトレーニングスキームを使用して、よくキャリブレーションされたマッピング品質スコアを提供する。あらゆるゲノムの研究を確実にサポートするために、調査対象のゲノムに対して最適なマッピング品質スコアを保証するためのトレーニングパイプラインが提供されている。MOSAIKはマルチスレッドでオープンソースであり、著者らのコマンドとパイプラインランチャーシステムGKNO（http://gkno.me）に組み込まれている（*おそらく公開は終了しています）。

インストール

macos10.14とubuntu18.04LTSでテストした(python3.8環境)。

依存

Github

#bioconda (link)
conda install -c bioconda mosaik

> MosaikAligner -h

$ MosaikAligner -h

Description: pairwise aligns a MOSAIK read file.

Usage: MosaikAligner -in <filename> -out <filename> -ia <filename>

Input/output: (required):

-ia <MOSAIK reference filename> the input reference file

-in <MOSAIK read filename> the input read file

-out <MOSAIK alignment filename> the output alignment file

-ibs <MOSAIK reference filename> enables colorspace to basespace conversion

using the supplied BASESPACE reference

archive

-annpe <Neural network filename>

-annse <Neural network filename>

Essential parameters:

-a <algorithm> alignment algorithm: [fast, single, multi,

all]. def: all

-m <mode> alignment mode: [unique, all]. def: all

-hs <hash size> hash size [4 - 32]. def: 15

Filtering:

-act <threshold> the alignment candidate threshold (length)

-ls <radius> enable local alignment search for PE reads

-mhp <hash positions> the maximum # of positions stored per seed

-mhr <hash regionss> the maximum # of regions for aligning

-min <nucleotides> the minimum # of aligned nucleotides

-minp <percent> the minimum alignment percentage [0.0 -

1.0]

-mm <mismatches> the # of mismatches allowed

-mmp <threshold> the percentage of mismatches allowed [0.0

- 1.0]

-ncg not count gaps as mismatches

Performance:

-p <processors> uses the specified number of processors

-bw <bandwidth> specifies the Smith-Waterman bandwidth.

def: 9

-lm enable low-memory functions

Jump database:

-j <filename stub> uses the specified jump database

-kd keeps the keys file on disk

-pd keeps the positions file on disk

-sref <reference prefixes> specifies the prefixes of special

references

-srefn <hashes> the maximum special hashes

Reporting:

-statmq <threshold> enable mapping quality threshold for

statistical map [0 - 255]

-omi output chromosome ids and positions of

multiply mapped alignments in the

multiple.bam

-om output complete multiply mapped alignments

in the multiple.bam

-zn output zn tags

Pairwise Alignment Scores:

-ms <match score> the match score. def: 10.00

-mms <mismatch score> the mismatch score. def: -9.00

-gop <gap open penalty> the gap open penalty. def: 15.00

-gep <gap extend penalty> the gap extend penalty. def: 1.00

-hgop <gap open penalty> enables the homopolymer gop. def: 4.00

Interface Options:

-quiet disable progress bars and counters

Help:

--help, -h shows this help text

> MosaikBuild

$ MosaikBuild

Description: converts external read formats to native MOSAIK formats.

Usage: MosaikBuild [OPTIONS] [-out|-oa] <filename>

Conversion (Reference Sequence):

-cs translate reference to colorspace

-fr <FASTA reference filename> the FASTA reference sequences file

-ga <genome assembly ID> the genome assembly ID. e.g. HG18

-oa <MOSAIK reference filename> the output reference file

-sn <species name> the species name. e.g. "Homo sapiens"

-uri <uniform resource ID> the URI (e.g. URL or URN)

Conversion (FASTA):

-fr <FASTA read filename> the FASTA reads file

-fq <FASTA quality filename> the FASTA base qualities file

-fr2 <FASTA read filename> the FASTA 2nd mate

-fq2 <FASTA quality filename> the FASTA BQ 2nd mate

-assignQual <base quality> assigns a quality for each base

Conversion (FASTQ):

-q <FASTQ filename or directory> the FASTQ file or directory

-q2 <FASTQ filename or directory> the FASTQ 2nd mate

Conversion (Short Read Format):

-srf <SRF filename or directory> the SRF file or directory

Conversion (Illumina Bustard):

-bd <Bustard directory> the Illumina Bustard directory

-il <lanes> the desired lanes e.g 5678 for lanes 5-8

-split splits the read into two mates

Conversion (Illumina Gerald):

-gd <Gerald directory> the Illumina Gerald directory

-il <lanes> the desired lanes e.g 5678 for lanes 5-8

Read Archive Metadata:

-cn <center name> sequencing center name. e.g. broad

-ds <description> read group description

-id <identifier> read group ID. e.g. SRR009060

-ln <library name> library name. e.g. g1k-sc-NA18944-JPT-1

-mfl <median fragment length> median fragment length. e.g. 150

-pu <run name & lane> the platform unit. e.g. IL12_490_5

-sam <sample name> sample name. e.g. NA12878

-st <sequencing technology> sets the sequencing technology: '454',

'helicos', 'illumina', 'illumina_long',

'sanger' or 'solid'

Read Archive Options:

-out <MOSAIK read filename> the output read file

-tp <# of beginning bases> trims the first # of bases

-ts <# of end bases> trims the last # of bases

Interface Options:

-quiet disable progress bars and counters

Help:

--help, -h shows this help text

> MosaikJump

$ MosaikJump

------------------------------------------------------------------------------

MosaikJump 2.2.26 2014-03-28

Wan-Ping Lee & Michael Stromberg Marth Lab, Boston College Biology Department

------------------------------------------------------------------------------

Description: produces a jump database from a MOSAIK reference file.

Usage: MosaikJump -ia <filename> -out <filename> -hs <hash size>

Input:

-ia <MOSAIK reference filename> the input reference file

Output:

-out <jump filename stub> the stub for the output filenames

Options:

-kd keeps the keys database on disk

-mem <GB> the amount memory used when sorting

hashes. def: 2

-hs <hash size> the hash size [4 - 32]

-iupac considers IUPAC

Help:

--help, -h shows this help text

> MosaikText

$ MosaikText

------------------------------------------------------------------------------

MosaikText 2.2.26 2014-03-28

Wan-Ping Lee & Michael Stromberg Marth Lab, Boston College Biology Department

------------------------------------------------------------------------------

Description: exports reads and alignments from MOSAIK.

Usage: MosaikText -in <filename> -out <filename> -ia <filename>

Read Archive Options:

-ir <MOSAIK read filename> the input read file

-fastq <FASTQ filename> stores the data in a FASTQ file

-screen displays the reads on the screen

Alignment Archive Options:

-in <MOSAIK alignment filename> the input alignment file

-axt <axt filename> stores the data in an AXT file

-bam <bam filename> stores the data in a BAM file

-bed <bed filename> stores the data in a BED file

-eland <eland filename> stores the data in an Eland file

-ref <reference sequence name> displays output for a specific reference

-sam <sam filename> stores the data in a SAM file

-screen displays the alignments on the screen

-u limit output to unique reads

Help:

--help, -h shows this help text

テストラン

git clone https://github.com/wanpinglee/MOSAIK.git
cd MOSAIK/src/
make
#condaで導入しているならビルドせずにbin/にシンボリックリンクを張ってもOK

#test run
cd ../demo/
./Build.sh
./Align.sh

実行方法

１、リファレンスfastaとシークエンシングリードのfastqのバイナリフォーマットへの変換

#ref fasta
MosaikBuild -fr reference.fa -oa reference.dat

#illumina fastq
MosaikBuild -q pair_1.fq -q2 pair_2.fq -out read.mkb -st illumina

-fr <FASTA reference filename> the FASTA reference sequences file
-oa <MOSAIK reference filename> the output reference file
-q <FASTQ filename or directory> the FASTQ file or directory
-q2 <FASTQ filename or directory> the FASTQ 2nd mate
-st <sequencing technology> sets the sequencing technology: '454',
'helicos', 'illumina', 'illumina_long', 'sanger' or 'solid'
-out <MOSAIK alignment filename> the output alignment file

reference.daとread.mkbファイルが出力される。

２、マッピング

/MOSAIK/src/networkFileに2.1.78.pe.annと2.1.78.se.annが含まれている。

 MosaikAligner -in read.mkb -out read.mka -ia reference.dat -annpe MOSAIK/src/networkFile/2.1.78.pe.ann -annse MOSAIK/src/networkFile/2.1.78.se.ann

-out <MOSAIK alignment filename> the output alignment file
-ia <MOSAIK reference filename> the input reference file
-annpe <Neural network filename>
-annse <Neural network filename>

read.mka.bamとread.mka.statファイルが出力される。

引用

MOSAIK: a hash-based algorithm for accurate next-generation sequencing short-read mapping

Wan-Ping Lee, Michael P Stromberg, Alistair Ward, Chip Stewart, Erik P Garrison, Gabor T Marth

PLoS One. 2014 Mar 5;9(3):e90581. 2014