polyploidのラージゲノムのアセンブラ Meraculous2

　ヒトや他のギガベース規模のゲノムの正確なディープショットガンシーケンスは、今や控えめなコストで容易に利用可能になっている。これらのシーケンシングスループットの増加により、大規模かつ複雑なゲノム用のショットガンシーケンスを構築するための新しい世代の計算アルゴリズムが開発された（Preprintより ref.1-3で概説されている）。これらのアプローチには、通常、Euler [ref.4]とVelvet [ref.5]によるショートリードアセンブリで先駆けされたde Bruijn graphアプローチが組み込まれている。いくつかのグループが、これらの方法でヒトおよび他の哺乳動物ゲノムを組み立てた[6-11]。このようなアセンブリは、ゲノムのサイズと反復性のためだけでなく、一塩基変異（SNV）、スモールおよびラージスケールの挿入および欠失（INDEL）、およびより大きな構造変化を包含する異種交配種の本質的なヘテロ接合性のためにも困難である。大規模で反復的な多型ゲノムを組み立てるには、通常、大規模な共有メモリシステムが必要であり、実行には1週間以上かかり、1つのアセンブリを作成するため、かなりのコンピューティングリソースが必要になる。いくつかの大きなゲノムは純粋に短い（<150bp）ペアエンドシーケンスからアセンブリされているが、どちらのゲノムが良質のアセンブリを可能にする構造を有するか、および/またはどのデータの組み合わせがこれを促進するかは不明である [11、13、14]。「Assemblathon」コンペは、アルゴリズムの直接比較を容易にするための共通データセットを提供する新しいアプローチのテストおよび実装に役に立つ[ref.11,13]。 Genome 10Kのような10,000個の脊椎動物ゲノムをアセンブルするようなラージゲノムプロジェクトでは[ref.15]、非常に効率的かつ精度の高いアセンブリツールの発達が不可欠である。

　以前著者らは、ハイブリッドk-mer / read-basedアセンブラとして"Meraculous"アルゴリズムを発表した。簡単に言うと、Meraculousは、簡略化されたde Bruijn k-mer graphを効率的に構築してトラバースすることによって、最初にゲノムの固有の領域を予備的な未解決の「UU」コンティグに組み立てる。次に、これらのコンティグをペアエンドデータとアライメントさせることによって連結し、残ったscaffoldsのギャップは、関連するローカルアセンブリを用いて埋められる。 Meraculousは現在のIlluminaシーケンスの高精度シーケンスを活用することで明示的なエラー訂正ステップを避ける。エラー訂正ステップは、アセンブリプロセスにとって冗長なプロセスであると著者らは考えている。最初のバージョンのMeraculousはpolymorphicな二倍体ゲノムに対応しておらず、並列化されているにもかかわらず、シミュレーションデータセットでは約1500万bpの一倍体のカビゲノムしかレポートできず、ギガクラスのラージゲノムには適用できなかった。

　ここでは、これらの以前の制限を克服して、Meraculous Assemblathon2に組み込まれた変更を拡張する、Meraculousの改善について説明する。新機能には、（1）de Bruijn graph内の「バブル」構造の線形連鎖を用いた対立遺伝子多型の明示的な処理、（2）ケーススタディに基づくギャップ閉鎖の改善、および（3）より完全なアセンブリを生成する改良されたscaffoldingアルゴリズム、となる。新しい並列実装によって処理速度と帯域幅の効率が大幅に改善され、JGI Genepool クラスタ上でリアルタイムに24時間以内にヒトゲノムのアセンブルが可能になった（この計算を実行するために使用されたリソースの詳細はPreprint 表1を参照）。 Meraculous2のこれらの機能を調べるために、ここでは、ゲノムが決定され、1000ゲノムプロジェクトによって広範囲に分析されたヨーロッパ系祖先の女性である NA12878ヒトデータセットのアセンブリへの適用について説明する。Trioフェージングによって（すなわち、NA12878と彼女の両親の配列を組み合わせること）、フェージングされた母系および父系のハプロタイプ配列が推定された（一部略）。 NA12878のゲノムは、以前にALLPATHS-LG [ref.9]とSGA [ref.10]によってアセンブリされているので、Meraculous2の性能をこれらの2つの最先端のアセンブラと比較した。我々（著者ら）は、このヒトセンブリにおいて以前には記載されていないいくつかの方法を提示し、長距離リンケージの正確さを測定するためのいくつかの測定基準について議論する。

PDFマニュアル

http://1ofdmq2n8tc36m6i46scovo2e.wpengine.netdna-cdn.com/wp-content/uploads/2014/12/Manual.pdf

以下のような特徴を持つ（マニュアルより）。

Currently Meraculous works with Illumina data only. It relies on Illumina naming conventions and Phred-like sequence quality scores. Long-read/low-depth sequencing platforms are not supported at this time.
An overall mean depth of read coverage of at least 30x is strongly recommended. Low-coverage datasets will likely result in a highly fragmented assembly or an aborted process altogether.
Meraculous deals with genomic diploidy by creating a pseudo-haploid assembly where haplotypes are "squashed", i.e., a contig is formed with a single majority allele. However, the higher the polymorphic rate the less effective this process is. As a result, genomes with polymorphism rates of over 0.05 are better to assemble as haploid, letting Meraculous keep both haplotypes as distinct contigs, in essence imitating a meta- genome.
Although it is capable of assembling small bacterial genomes, it may not be the most resource-efficient choice for these scenarios.
Meraculous relies heavily on distributed and threaded computing and will perform best on a multiple-core server or in a cluster environment. For more on this, see sections 'Operating System requirements' and 'Hardware considerations'

メモリ使用量（マニュアルより）。

インストール

cent os6でテストした。

依存

Meraculous can run on any 64-bit Linux system
cmake >= 2.8
GCC g++ >= 4.4.7
GNU make 3.81
Boost C++ library >= 1.50.0
Perl (>= 5.10)
Log4perl.pm (>= 1.31 )
gnuplot (>= 3.7)
qqacct (optional but highly recommended for Grid Engine cluster environments)

Log4perl.pmは手っ取り早く"cpanm Log4perl.pm"で導入した。

本体 SourceFroge

https://sourceforge.net/projects/meraculous20/

tar -xvf Meraculous-v2.2.5.1-1-ga103cd6.tar
cd Meraculous-v2.2.5.1-1-ga103cd6/
install.sh <installation directory>

> ./run_meraculous.sh

$ ./run_meraculous.sh

Smartmatch is experimental at /home/uesaka/test2/Meraculous-v2.2.5.1-1-ga103cd6/bin/meraculous.pl line 929.

Smartmatch is experimental at /home/uesaka/test2/Meraculous-v2.2.5.1-1-ga103cd6/bin/meraculous.pl line 2789.

Command line arguments for meraculous.pl (Version 2.2.5.1):

meraculous.pl

Required:

-c|config <config file> : user config file

Optional:

-label <label> : provide a label name for new runs ( Default: 'run' )

-dir <directory> : provide a run directory name ( Default: latest run )

-restart : restart a previously failed assembly

-resume : restart but preserve any partial results

-step : execute one stage and stop

-start <stage> : re-run starting with this stage

-stop <stage> : stop after this stage

-archive : save any old stage directories (valid only with -restart)

-cleanup_level [0|1|2] : decide how agressively the pipeline should clean up intermediate data ( Default: 1)

0 - do not delete any intermediate outputs (disk space footprint may be huge)

1 - delete files that are not used in any of the subsequent stages and that are generally not informative to the user

2 - delete files as soon as possible. WARNING!!! You will not be able to rerun the

stages individually once they have completed!

-h|help : you guessed it: this usage page

-v|version : about this program

The default configuration file is 'meraculous.params', which must be present

The default label name is <genus>_<species>_[strain] if these are defined in

the configuration file, and 'run' otherwise;

-resume/-restart : If no directory is given, the most recently run dir. is used.

Invalid command line combinations:

-restart with -resume

-label with -restart or -resume

-start without -restart or -resume

-archive without -restart

Please contact Eugene Goltsman at egoltsman@lbl.gov if you encounter any problems.

ラン

テストランを実行する。

cd /etc/meraculous/test/pipeline bash
bash /bin/run_meraculous.sh -c meraculous.config

configファイルを指定している。lib_seqのところでfastqとそのパラメータを指定している。詳細はPDFマニュアルのBasic assembly parameters:の項を参照。

###################################

# Meraculous params file

###################################

#######################################

# Basic parameters

########################################

# Describe the libraries ( one line per library )

# lib_seq [ wildcard ][ prefix ][ insAvg ][ insSdev ][ avgReadLen ][ hasInnieArtifact ][ isRevComped ][ useForContigging ][ onoSetId ][ useForGapClosing ][ 5pWiggleRoom ][3pWiggleRoom]

lib_seq frags.fastq.25K FRA 180 10 101 0 0 1 1 1 0 0

lib_seq jumps.fastq.25K JMP 3000 500 101 1 1 1 2 1 0 0

genome_size 0.0002

diploid_mode 0

mer_size 21

min_depth_cutoff 7

num_prefix_blocks 4

#################################################

# Advanced parameters

#################################################

no_read_validation 1

use_cluster 0