2018-06-10

MaSuRCA アセンブラ

2018/8/28,29 dockerコマンド等、分かりにくい部分を修正

2019 5/3 動作条件追記、6/12 hybrid assembly リンク追加、10/9 condaインストール追記、ONTのハイブリッド追記、12/22 condaインストール追記

2020 1/22リンク追記

2022/10/10 help更新、dockerイメージのリンク更新

　 2001年にヒトゲノムのドラフトバージョンが作成された後、800bpを超えるリード長を有する第1世代（すなわちSanger）シーケンス技術を使用して、多くのスモールゲノムおよびラージゲノムのシーケンスが決定された。より最近では、様々な種類のsecond-generation sequencing（SGS）技術が、50〜400bpのリード長で出現している（論文が執筆された当時）。近年、着実に改良されてきた新しいアセンブリ方法が、ショートリードアセンブリの課題に対応して開発されてきた。しかし、この進歩にもかかわらず、ゲノムを決定する問題は、解決から遠い。事実上、現在公開されているすべてのアセンブリは、さまざまなレベルの品質を持つ「下書き」ゲノムであり、下流の分析にこれらのゲノムに頼っている科学者にとって重大な問題となる多くのギャップとアセンブリエラーが含まれている。この記事では、ゲノムアセンブリの新しいアプローチによって促進されるゲノムのアセンブリ方法を報告する。最初に、全ゲノムショットガンシーケンシングデータのアセンブリに使用された2つの一般的なアプローチについて簡単に説明する。

　Overlap–layout–consensus (OLC) assembly。簡単に述べると、OLCパラダイムは、最初に、配列の類似性を用いて重複を決定することによって、リード間のすべてのペアワイズ重複を計算しようと試みる。次に、OLCアルゴリズムはレイアウトを作成する。これはすべての重複するリードのアライメントである。このレイアウトから、アルゴリズムはマルチアライメントアラインメントを列ごとにスキャンすることによってコンセンサス配列を抽出する。 Celera Assembler（論文より　Miller et al、2008; Myers et al、2000）、PCAP（Huang、2003）、Arachne（Batzoglou et al、2002）およびPhusion（Mullikin and Ning、2003）を含むSangerシーケンシングデータのためのほとんどのアセンブラ）は、OLCのアプローチに基づいている。OLCアプローチの主な利点の2つは、リード長に関する柔軟性とシーケンシングエラーに対する堅牢性である。見かけ上の重なりが現実である（そしてリピートが誘発されない）可能性を高めるために、OLCアルゴリズムは、典型的には、ある最小長を超えることを要求する。 Celera Assemblerは40 bp以上のオーバーラップを必要とするため、オーバーラップ領域でのエラー率（1〜2％）が小さくなる。

　SGSのデノボ・アセンブリ・プロジェクトでは、サンガーシークエンス・プロジェクトの100倍のリードを生成する。例えばオリジナルのヒトゲノムプロジェクト（Lander et al., 2001; Venter et al., 2001）とマウスゲノムプロジェクト（Mouse Genome Sequencing Consortium et al、 2002）は、それぞれ3500万のリード生成した。一方、最近のヒトゲノムシーケンス(Li et al., 2010) では、3-4億リード生成した。 de Bruijn graphアプローチは、ペアワイズ重複計算を完全に回避する。これがSGSアセンブリの主要な方法になった理由の1つである。

The de Bruijn graph approach。 de BruijnグラフアセンブリアルゴリズムはPevzner et alが先鞭をつけた手法で (Idury and Waterman, 1995; Pevzner, 1989)、オイラーのアセンブラで初めて実装された（Pevzner et al、2001）。オイラーはサンガーシーケンスリードのために設計されているが、SGSデータをアセンブリするためのプログラムや、特にイルミナのデータ用のプログラムでも、同じ枠組みが採用されている。 de Bruijn戦略を使用する最近開発されたアセンブラにはAllpaths-LG（Gnerre et al、2010）、SOAPdenovo（Li et al、2008）、Velvet（Zerbino and Birney、2008）、EULER-SR（Chaisson and Pevzner、2008））およびABySS（Simpsonら、2009）がある。

　このアプローチは、以下のように、シーケンスデータからde Bruijn graphを作成することから始まる。固定値kに対して、各リードから長さk（a k-mer）の各部分文字列は、ノードAとノードBとを接続するgraphの有向枝に割り当てられる。ノードAとBは、それぞれの最初と最後のk-1個のヌクレオチドに対応する。正式にはオイラーパスとして知られているすべてのエッジを正確に1回訪れるgraph内のパスは、リードのドラフトアセンブリを形成している。実際には、これらのグラフは多くの交差するサイクルと多くの代替のオイラーパスを持つ複雑なものであるため、graphを作成することは、優れたドラフトアセンブリを作成する最初の小さなステップに過ぎない。アセンブリを完了するには、graphにメイトペア情報を組み込み、リピートシーケンスによって作成された複雑なサイクルを解消する必要がある。 k-mersはリードよりも短いので、graphにはリードよりも少ない情報しか含まれていないため、後でgraphの曖昧さをなくすためにリードを保持する必要がある。

　このアプローチの主な利点は計算効率である。膨大な数のオーバーラップが計算されないという事実から利益を得ている。主な欠点は、グラフ中のk-mer隣接情報の損失およびデータのエラーによって引き起こされる偽のブランチングである。

このジャーナルでは、著者らが super-readsと呼ぶものを生成する、ショートリードデータアセンブリの第3のパラダイムを提案する。目的は、元のリードに存在するすべてのシーケンス情報を含む super-readsのセットを作成することである。理想的なエラーのない場合については、下記の定理を参照。

　 Super-readsの基本的な概念は、extension（以後、伸長）が一意である限り、各元のリードを前後にベースごとに伸長することである。この概念は以下のように説明することができる。効率的なハッシュテーブルを使用してk-merカウントルックアップテーブルを作成し、各k-merが何回発生するかを素早く判断する。リードの終わりに見つかったk-merがある場合、ゲノムのシーケンスの次のk-merになる可能性のある4つのk-merがある。これらは、最後にA、C、GまたはTを付けることによって形成された文字列ですk-1個の塩基が読み取られる。著者らのアルゴリズムは、これらのk-mersのどれがテーブル内に現れるかを調べる。 4つの可能なk-merのうちの1つのみが出現する場合、その読みは固有のk-merを有し、その塩基をリードに付加する。リードが１通りに伸長されなくなるまで続ける。すなわち、複数の可能性が存在するか、または行き止まりに達するまで。この伸長は、リードの3 '末端および5'末端の両方で行う。新しい長い文字列は、super-readsと呼ばれる。多くのリードは、論文図1（リンク）に示すように同じsuper-readsに伸長される。たとえば、2つの同一でないリピートまたは2つの異なるハプロタイプに由来している場合、これらは別個のsuper-readsを生成するだろう。もちろん、super-readsはde Bruijn graphを使って簡単に計算できる。要点は、super-readsが作成されると、super-readsと接続するメイトペアと一緒に、de brujin graphを一括して置き換えることである。ペア情報は、以下に説明するOLCアセンブリ工程を用いて行われる。スーパーリードデータ計算の2つの最も重要な特性は、以下の通りである。

各元のリードはsuper-readsに含まれているため（情報が失われていない）。
オリジナルリードの多くは同じsuper-readsになるため、super-readsを使用するとデータセットが大幅に削減される。

（複数段落省略）

言い換えれば、super-readsのセットは、リード内のすべての情報を含み、エラーを導入していない。この証拠はsuper-readsの構築に従っている。各リードはsuper-readsに含まれるため、データは失われない（一部略）。MaSuRCAアセンブラは、super-readsからコンティグとスキャホールドを作成するため、CABOGアセンブラのアセンブリ技術を利用している。

ジョンズホプキンス大学 Center for Computational Biology（CCB）のツール一覧

http://ccb.jhu.edu/software.shtml

MaSuRCA (Maryland Super-Read Celera Assembler) ページ

http://masurca.blogspot.com

Slideshare

Masurca genome assembly with super reads from Abdullah Khan Zehady

2020 1/22

インストール

cent os6とubuntu18.04のdockerコンテナ内でテストした。

必要条件（Githubより）

Compile/Install requirements.

To compile the assembler we require gcc version 4.7 or newer to be installed on the system. Only Linux is supported (May or may not compile under gcc for MacOS or Cygwin, Windows, etc). The assembler has been tested on the following distributions:

• Fedora 12 and up

• RedHat 5 and 6 (requires installation of gcc 4.7)

• CentOS 5 and 6 (requires installation of gcc 4.7)

• Ubuntu 12 LTS and up

• SUSE Linux 16 and up

Hardware requirements.

The hardware requirements vary with the size of the genome project. Both Intel and AMD x64 architectures are supported. The general guidelines for hardware configuration are as follows:

• Bacteria (up to 10Mb): 16Gb RAM, 8+ cores, 10Gb disk space

• Insect (up to 500Mb): 128Gb RAM, 16+ cores, 1Tb disk space

• Avian/small plant genomes (up to 1Gb): 256Gb RAM, 32+ cores, 2Tb disk space

• Mammalian genomes (up to 3Gb): 512Gb RAM, 32+ cores, 5Tb disk space

• Plant genomes (up to 30Gb): 1Tb RAM, 64+cores, 10Tb+ disk space

Expected run times.

The expected run times depend on the cpu speed/number of cores used for the assembly and on the data used. The following lists the expected run times for the minimum configurations outlined above for Illumina-only data sets. Adding long reads (454, Sanger, etc. makes the assembly run about 50-100% longer:

• Bacteria (up to 10Mb): <1 hour

• Insect (up to 500Mb): 1-2 days

• Avian/small plant genomes (up to 1Gb): 4-5 days

• Mammalian genomes (up to 3Gb): 15-20 days

• Plant genomes (up to 30Gb): 60-90 days

本体　Github

ビルドに失敗したのでdockerhubのイメージを使う。複数イメージがアップロードされているが、sauloさんのイメージを使用した（https://hub.docker.com/r/sauloal/masurca/）。

docker pull sauloal/masurca
docker run --rm -m "30g" -itv ${PWD}/:/home/ -w /MaSuRCA-2.2.1/bin/ sauloal/masurca

staphbのイメージを使う（link）
docker pull staphb/masurca:latest
#イメージのhomeとカレントディレクトリを共有ディレクトリにして立ち上げる。メモリは100gとした（*2）
sudo docker run --rm -m "100g" -itv ${PWD}/:/home/ staphb/masurca:latest
cp /MaSuRCA-4.0.9/masurca_config_example.txt .

#追記 Bioconda (link) (Python 3.7.4でテスト)
mamba create -n masurca
conda activate masurca
mamba install -c bioconda -y masurca

> ./masurca

# masurca -h

Create the assembly script from a MaSuRCA configuration file. A

sample configuration file can be generated with the -g switch. The

assembly script assemble.sh will run the assembly proper. For a

quick run without creating a configuration file, and with two Illumina

paired end reads files (forward/reverse) and (optionally) a

long reads (Nanopore/PacBio) file use -i switch, setting the number of threads with -t:

masurca -r paired_ends_fwd.fastq.gz -t 32

masurca -r paired_ends_fwd.fastq.gz,paired_ends_rev.fastq.gz -t 32

,and for a hybrid assembly you can also add the long Nanopore or PacBio reads with -r switch:

masurca -r paired_ends_fwd.fastq.gz,paired_ends_rev.fastq.gz -r nanopore.fa.gz -t 32

this will run paired-ends Illumina or hybrid assembly with CABOG contigger and default settings.

This is suitable for small assembly projects.

Options:

-t, --threads ONLY to use with -i option, number of threads

-i, --illumina Run assembly without creating configuration file, argument can be

illumina_paired_end_forward_reads

illumina_paired_end_forward_reads,illumina_paired_end_reverse_reads

if you only have single-end Illumina data. Reads file(s) could be fasta or fastq, can be gzipped.

-r, --reads ONLY to use with -i option, single long reads file for hybrid assembly, can be Nanopore or PacBio,

fasta or fastq, can be gzipped

-v, --version Report version

-o, --output Assembly script (assemble.sh)

-g, --generate Generate example configuration file

-p, --path Prepend to PATH in assembly script

-l, --ld-library-path Prepend to LD_LIBRARY_PATH in assembly script

--skip-checking Skip checking availability of other executables

-h, --help This message

ラン

ペアエンドショートリードのfastqを指定する。

masurca -t 32 -i pe_R1.fq,/path_to/pe_R2.fq

#ロングリードも指定
masurca -t 32 -i pe_R1.fq,pe_R2.fq -r nanopore.fastq.gz

このコマンドは、最初にイルミナデータでNanoporeリードを補正して、デフォルト設定でハイブリッドアセンブリを実行する。fastqはgzip圧縮されていても認識する。

大きなプロジェクトでランするにはconfigファイルが必要になる。configに記載することで、複数のライブラリを効率的に管理し、まとめてアセンブリできるようになっている。configのマスターシートをコピーして編集する（*1）。

docker run --rm -m "30g" -itv ${PWD}/:/home/ -w /MaSuRCA-2.2.1/bin/ sauloal/masurca
cp /MaSuRCA-4.0.9/masurca_config_example.txt .
vi masurca_config_example.txt

> cat config.txt

# example configuration file

# DATA is specified as type {PE,JUMP,OTHER,PACBIO} and 5 fields:

# 1)two_letter_prefix 2)mean 3)stdev 4)fastq(.gz)_fwd_reads

# 5)fastq(.gz)_rev_reads. The PE reads are always assumed to be

# innies, i.e. --->.<---, and JUMP are assumed to be outties

# <---.--->. If there are any jump libraries that are innies, such as

# longjump, specify them as JUMP and specify NEGATIVE mean. Reverse reads

# are optional for PE libraries and mandatory for JUMP libraries. Any

# OTHER sequence data (454, Sanger, Ion torrent, etc) must be first

# converted into Celera Assembler compatible .frg files (see

# http://wgs-assembler.sourceforge.com)

DATA

#Illumina paired end reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>

#if single-end, do not specify <reverse_reads>

#If mean/stdev are unknown use 500 and 50 -- these are safe values that will work for most runs

#MUST HAVE Illumina paired end reads to use MaSuRCA

PE= pe 500 50 /FULL_PATH/frag_1.fastq /FULL_PATH/frag_2.fastq

#Illumina mate pair reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>

#JUMP= sh 3600 200 /FULL_PATH/short_1.fastq /FULL_PATH/short_2.fastq

#pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped

#if you have both types of reads supply them both as NANOPORE type

#PACBIO=/FULL_PATH/pacbio.fa

#NANOPORE=/FULL_PATH/nanopore.fa

#Legacy reads (Sanger, 454, etc) in one frg file, concatenate your frg files into one if you have many

#OTHER=/FULL_PATH/file.frg

#synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data

#REFERENCE=/FULL_PATH/nanopore.fa

END

PARAMETERS

#PLEASE READ all comments to essential parameters below, and set the parameters according to your project

#set this to 1 if your Illumina mate pair (jumping) library reads are shorter than 100bp

EXTEND_JUMP_READS=0

#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content

GRAPH_KMER_SIZE = auto

#set this to 1 for all Illumina-only assemblies

#set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)

USE_LINKING_MATES = 0

#specifies whether to run the assembly on the grid

USE_GRID=0

#specifies grid engine to use SGE or SLURM

GRID_ENGINE=SGE

#specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY

GRID_QUEUE=all.q

#batch size in the amount of long read sequence for each batch on the grid

GRID_BATCH_SIZE=500000000

#use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads

#can increase this to 30 or 35 if your long reads reads have N50<7000bp

LHE_COVERAGE=25

#this parameter is useful if you have too many Illumina jumping library reads. Typically set it to 60 for bacteria and 300 for the other organisms

LIMIT_JUMP_COVERAGE = 300

#these are the additional parameters to Celera Assembler; do not worry about performance, number or processors or batch sizes -- these are computed automatically.

#CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.

CA_PARAMETERS = cgwErrorRate=0.15

#CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina or long read data

CLOSE_GAPS=1

#number of cpus to use, set this to the number of CPUs/threads per node you will be using

NUM_THREADS = 32

#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20

JF_SIZE = 200000000

#ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.

#Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data

SOAP_ASSEMBLY=0

#If you are doing Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY (no Illumina mate pairs or OTHER frg files).

#Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.

#A lot faster than CABOG, AND QUALITY IS THE SAME OR BETTER.

#DO NOT use if you have less than 20x coverage by long reads.

FLYE_ASSEMBLY=1

END

PE= pe 600 60 /MaSuRCA-2.2.1/test/pair1.fq /MaSuRCA-2.2.1/test/pair2.fq

のように５カラムでデータの情報を記載する。1目のPEはペアエンド、２カラム目の600はインサートサイズ、３カラム目の60はインサートサイズのSD。４−５カラムにペアエンドファイルのパスを記載する。それ以外の行のパラメータは、条件に応じて変更する（コメントアウトされてない行がパラメータ）。

インサートサイズやそのSDをなるべく精度よく求めたければ、seqkit（紹介）などで10%ほどサンプリングしてminimap2（紹介）等でbamを作り、picard-toolsやgoleftなどで計算してください（リンク）。

MaSuRCAの実行。

../bin/masurca config.txt -o assemble.sh
#assemble.shができる

#de novoアセンブリ実行
./assemble.sh

エラーコレクションが終わると、そのまま自動でアセンブリが実行される。

[Sun Jun 10 01:33:04 UTC 2018] Gap closing

Gap close success. Output sequence is in CA/10-gapclose/genome.{ctg,scf}.fasta

ランが終わると、カレントに大量のtempoファイルとCAディレクトリができる。CA/10-gapclose/genome.{ctg,scf}.fastaが出力となる。

ホストとの共有ディレクトリにコピー。

cp CA/10-gapclose/genome.*fasta /home/

説明しやすいのでコンテナ中で作業しましたが、sauloさんが記載されているように、コンテナの外からもジョブは投げれます。

性能については、GAGE-Bやレビューを参考にしてください。GAGEは論文で用いられたハイカバレッジのシーケンスデータもダウンロードしてテストできるようになっています。

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3702249/

2019 10/9 追記

他のシーケンシングリードも使う（*3）。

https://github.com/alekseyzimin/masurcaにconfigファイルが載っているので、これを真似る。以下のように、DATAには様々なリードとリファレンスを指定できる。

DATA
#Illumina paired end reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>
#if single-end, do not specify <reverse_reads>
#MUST HAVE Illumina paired end reads to use MaSuRCA
PE= pe 500 50  /FULL_PATH/frag_1.fastq  /FULL_PATH/frag_2.fastq
#Illumina mate pair reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>
JUMP= sh 3600 200  /FULL_PATH/short_1.fastq  /FULL_PATH/short_2.fastq
#pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped
#if you have both types of reads supply them both as NANOPORE type
#PACBIO=/FULL_PATH/pacbio.fa
#NANOPORE=/FULL_PATH/nanopore.fa
#Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many
#OTHER=/FULL_PATH/file.frg
#synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data
#REFERENCE=/FULL_PATH/nanopore.fa
END

ペアエンドのショートリードは必須になる。これにジャンピングライブラリ（mate pair）、ONTのリード（FASTA）を加えるならDATAは以下のようになる。

DATA
PE= pe 500 50  /FULL_PATH/frag_1.fastq  /FULL_PATH/frag_2.fastq
JUMP= sh 3600 200  /FULL_PATH/short_1.fastq  /FULL_PATH/short_2.fastq
NANOPORE=/FULL_PATH/nanopore.fa
END

対応していないフォーマットを指定すると、masurca config.txtの実行時にエラーが出る。

OLCとde brujin graphのハイブリッドde novoアセンブリツールも出てきています。

ハイカバレッジのバクテリアのde novoアセンブリツール。

Unicycler

引用

The MaSuRCA genome assembler.

Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA.

Bioinformatics. 2013 Nov 1;29(21):2669-77.

GIthubへの移行

http://masurca.blogspot.com/2018/01/masurca-is-now-on-github.html

参考

emacsに慣れているなら、その場でインストールして使う。

apt-get update && apt-get install emacs

何度もランするなら--rmは消してください。毎回configを作るは面倒です。

上で紹介したdockerファイルは古いので対応していない。condaで最新版を入れてください。

#仮想環境を作って導入
conda create -n masurca 
conda activate masurca
conda install -c bioconda -y masurca
#help
masurca -h

追加

MaSuRCA version 3.3.1

MaSuRCA4についても触れられています。

MaSuRCA genome assembly package: March 2019

追記

Hybrid assembly approach with MaSuRCA to assemble genomes from Illumina and Oxford Nanopore reads

KBase

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

MaSuRCA アセンブラ