Reference-assisted assembly1　ragout - macでインフォマティクス

small genomeとlarge genomeいずれにも使えるツール。2014年に発表された（ref.1）。複数の近縁ゲノムを使うことで、アセンブル精度を高めたとされる。公式ページには、現在レビュアー審査中の論文では哺乳類のクロモソームを再構成できると記載されている（ref.2）。

インストール

現在version2.0が公開されている。macが公式サポートされている。

Ragout - enlarge your contigs!

mac、linuxのbinary版も提供されており、こちらのリンクからダウンロードできる。bainary版はディレクトリ直下にあるragout.pyを打つだけだが、アライメントにsibeliaを使う。sibeliaが入っていないと動かないので、最初にragoutディレクトリ直下で以下のコマンドを打ってsibeliaをインストールしておく。

python scripts/install-sibelia.py

これで準備ができた。

> python ragput.py #ヘルプ

$ python ragout.py -h

usage: ragout.py [-h] [-o output_dir] [-s {sibelia,cactus,maf,hal}]

[--no-refine] [--solid-scaffolds] [--overwrite] [--repeats]

[--debug] [-t THREADS] [--version]

recipe_file

A tool for reference-assisted assembly

positional arguments:

recipe_file path to recipe file

optional arguments:

-h, --help show this help message and exit

-o output_dir, --outdir output_dir

output directory (default: ragout-out)

-s {sibelia,cactus,maf,hal}, --synteny {sibelia,cactus,maf,hal}

backend for synteny block decomposition (default:

sibelia)

--no-refine disable refinement with assembly graph (default:

False)

--solid-scaffolds do not break input sequences - disables chimera

detection module (default: False)

--overwrite overwrite results from the previous run (default:

False)

--repeats resolve repetitive input sequences (default: False)

--debug enable debug output (default: False)

-t THREADS, --threads THREADS

number of threads for synteny backend (default: 1)

--version show program's version number and exit

実行方法

ランは以下のコマンドで行う。rcpファイルはデータの場所などを記載したいわゆるconfigファイルである。

python ragout.py input.rcp -t 12

結果に影響する主なパラメータを載せておく。

-o output_dir output directory (default: ragout-out)

-s {sibelia,maf,hal} backend for synteny block decomposition (default: sibelia)

--refine enable refinement with assembly graph (default: False)

--solid-scaffolds do not break input sequences - disables chimera detection module (default: False)

--overwrite overwrite results from the previous run (default: False)

--repeats enable repeat resolution algorithm (default: False)

-t THREADS number of threads for synteny backend (default: 1)

テストデータとして、 E.coli、ヘリコバクター、コレラ菌、黄色ブドウ球菌のデータが含まれている。黄色ブドウ球菌やコレラ菌はゲノムのシャッフリングが高頻度で起こることが報告されていた記憶がある。面白そうなので黄色ブドウ球菌のデータをアセンブルしてみよう。まず黄色ブドウ球菌のディレクトリに移動する。

cd examples/S.Aureus/

そこにはaureus.rcpというファイルがあるはず。これがconfigファイルに当たる。編集する必要はないが、catで中を確認しておく。

user-no-MacBook-Pro:V.Cholerae user$ cat aureus.rcp

.references = rf122,col,jkd,n315

.target = usa

col.fasta = references/COL.fasta

jkd.fasta = references/JKD6008.fasta

rf122.fasta = references/RF122.fasta

n315.fasta = references/N315.fasta

赤の部分がrcpの中身である。

１行目　参照するリファンレンス３つの名前。必ずしもファイル名と同じでなくて良い。

２行目　target

３行目以降　参照するリファンレンスゲノムのパスを書く。カレントディレクトリからの相対パスで構わない。

exampleはconfigファイルを編集する必要はない。そのままランを開始する。

python ../../ragout.py aureus.rcp -o ragout-out_default_condition -t 12

ランが終わると、ragout-outでイレクトリにh1_scaffolds.fastaファイル、またそのほかの中間ファイルができた。scaffoldsを確認すると、１本になっており、一部はNで繋がっていた。ragoutは、繰り返し配列やリピートの領域をマスクしてアセンブルするが、通常アセンブルが切れるのは繰り返し配列やリピートの領域になる。よってNの領域はリピート領域に由来するものかもしれない。

--repeatsオプションをつけて、リピートも含めてグラフが描かれるような設定で再解析する。

python ../../ragout.py aureus.rcp --repeats -o ragout-out_repeats_resolution -t 12

Nはまだ残っていたが、本来の目的であるh1_scaffolds.fastaのアセンブルに移る。mursakiとGMV（ref. 3）を使い、ragoutで一本になったscaffoldsを参照したリファレンスと比較してみる。

f:id:kazumaxneo:20170628133642j:plain

コントロールキー＋TでToggle plot表示に切り替え

f:id:kazumaxneo:20170628144420j:plain

example dataはすでにゲノムがFinishした配列を使っているはずである。scaffoldsをblastnにかけ、全領域がマッチするリファレンスを見つけ出す。

f:id:kazumaxneo:20170628142903j:plain

blastnの結果。

scaffoldsの全領域がStaphylococcus aureus strain C2406 chromosomeとマッチする。ただし黒い部分がある。この部位でscaffoldsの並びが変わっているのかもしれない。配列をNCBIからダウンロードし、再びmurasakiを使ってC2406 chromosomeと比較してみよう。

f:id:kazumaxneo:20170628143708j:plain

線が別領域に伸びているとこは少しはあるが、ほとんどの全領域が C2406 chromosomeと同じ配置になっている。おそらく正しくアセンブルできていることが確認できた。

精度を上げるため解析には2つ以上の近縁種ゲノムが必要であるが、それさえクリアしていれば有用なツールである。ただし、ragoutに限らずReference-based assemblyな手法に共通する問題であるが、inversion、タンデムリイート、コピー遺伝子などがcontigの段階で組み込まれていないと、抜いてアセンブルする傾向がある。そのため、draft genomeを構築後、配列を検証する作業が必ず必要になってくる。

また、参照するゲノムにあまり相同性がなければ、ラン中にエラーを起こす。うまくランさせるには、ほぼ同一とみなされている菌種のリファレンスゲノムを用意する必要がある。

murasakiとGMVは以前紹介しています。

引用

1、Ragout-a reference-assisted assembly tool for bacterial genomes.

Kolmogorov M1, Raney B2, Paten B2, Pham S2.

Bioinformatics. 2014 Jun 15;30(12):i302-9. doi: 10.1093/bioinformatics/btu280.

２、Chromosome assembly of large and complex genomes using multiple references

Mikhail Kolmogorov, Joel Armstrong, Brian J. Raney, Ian Streeter, Matthew Dunn, Fengtang Yang, Duncan Odom, Paul Flicek, Thomas Keane, David Thybert, Benedict Paten, Son Pham

doi: https://doi.org/10.1101/088435 preprint

３、Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes.

Popendorf K1, Tsuyoshi H, Osana Y, Sakakibara Y.

PLoS One. 2010 Sep 24;5(9):e12651.