2017-06-28

Reference-assisted assembly1　ragout

small genomeとlarge genomeいずれにも使えるツール。2014年に発表された（ref.1）。複数の近縁ゲノムを使うことで、アセンブル精度を高めたとされる。公式ページには、現在レビュアー審査中の論文では哺乳類のクロモソームを再構成できると記載されている（ref.2）。

インストール

現在version2.0が公開されている。macが公式サポートされている。

Ragout - enlarge your contigs!

mac、linuxのbinary版も提供されており、こちらのリンクからダウンロードできる。bainary版はディレクトリ直下にあるragout.pyを打つだけだが、アライメントにsibeliaを使う。sibeliaが入っていないと動かないので、最初にragoutディレクトリ直下で以下のコマンドを打ってsibeliaをインストールしておく。

python scripts/install-sibelia.py

これで準備ができた。

> python ragput.py #ヘルプ

$ python ragout.py -h

usage: ragout.py [-h] [-o output_dir] [-s {sibelia,cactus,maf,hal}]

[--no-refine] [--solid-scaffolds] [--overwrite] [--repeats]

[--debug] [-t THREADS] [--version]

recipe_file

A tool for reference-assisted assembly

positional arguments:

recipe_file path to recipe file

optional arguments:

-h, --help show this help message and exit

-o output_dir, --outdir output_dir

output directory (default: ragout-out)

-s {sibelia,cactus,maf,hal}, --synteny {sibelia,cactus,maf,hal}

backend for synteny block decomposition (default:

sibelia)

--no-refine disable refinement with assembly graph (default:

False)

--solid-scaffolds do not break input sequences - disables chimera

detection module (default: False)

--overwrite overwrite results from the previous run (default:

False)

--repeats resolve repetitive input sequences (default: False)

--debug enable debug output (default: False)

-t THREADS, --threads THREADS

number of threads for synteny backend (default: 1)

--version show program's version number and exit

実行方法

ランは以下のコマンドで行う。rcpファイルはデータの場所などを記載したいわゆるconfigファイルである。

python ragout.py input.rcp -t 12

結果に影響する主なパラメータを載せておく。

-o output_dir output directory (default: ragout-out)

-s {sibelia,maf,hal} backend for synteny block decomposition (default: sibelia)

--refine enable refinement with assembly graph (default: False)

--solid-scaffolds do not break input sequences - disables chimera detection module (default: False)

--overwrite overwrite results from the previous run (default: False)

--repeats enable repeat resolution algorithm (default: False)

-t THREADS number of threads for synteny backend (default: 1)

テストデータとして、 E.coli、ヘリコバクター、コレラ菌、黄色ブドウ球菌のデータが含まれている。黄色ブドウ球菌やコレラ菌はゲノムのシャッフリングが高頻度で起こることが報告されていた記憶がある。面白そうなので黄色ブドウ球菌のデータをアセンブルしてみよう。まず黄色ブドウ球菌のディレクトリに移動する。

cd examples/S.Aureus/

そこにはaureus.rcpというファイルがあるはず。これがconfigファイルに当たる。編集する必要はないが、catで中を確認しておく。

user-no-MacBook-Pro:V.Cholerae user$ cat aureus.rcp

.references = rf122,col,jkd,n315

.target = usa

col.fasta = references/COL.fasta

jkd.fasta = references/JKD6008.fasta

rf122.fasta = references/RF122.fasta

n315.fasta = references/N315.fasta

赤の部分がrcpの中身である。

１行目　参照するリファンレンス３つの名前。必ずしもファイル名と同じでなくて良い。

２行目　target

３行目以降　参照するリファンレンスゲノムのパスを書く。カレントディレクトリからの相対パスで構わない。

exampleはconfigファイルを編集する必要はない。そのままランを開始する。

python ../../ragout.py aureus.rcp -o ragout-out_default_condition -t 12

ランが終わると、ragout-outでイレクトリにh1_scaffolds.fastaファイル、またそのほかの中間ファイルができた。scaffoldsを確認すると、１本になっており、一部はNで繋がっていた。ragoutは、繰り返し配列やリピートの領域をマスクしてアセンブルするが、通常アセンブルが切れるのは繰り返し配列やリピートの領域になる。よってNの領域はリピート領域に由来するものかもしれない。

--repeatsオプションをつけて、リピートも含めてグラフが描かれるような設定で再解析する。

python ../../ragout.py aureus.rcp --repeats -o ragout-out_repeats_resolution -t 12

Nはまだ残っていたが、本来の目的であるh1_scaffolds.fastaのアセンブルに移る。mursakiとGMV（ref. 3）を使い、ragoutで一本になったscaffoldsを参照したリファレンスと比較してみる。

f:id:kazumaxneo:20170628133642j:plain

コントロールキー＋TでToggle plot表示に切り替え

f:id:kazumaxneo:20170628144420j:plain

example dataはすでにゲノムがFinishした配列を使っているはずである。scaffoldsをblastnにかけ、全領域がマッチするリファレンスを見つけ出す。

f:id:kazumaxneo:20170628142903j:plain

blastnの結果。

scaffoldsの全領域がStaphylococcus aureus strain C2406 chromosomeとマッチする。ただし黒い部分がある。この部位でscaffoldsの並びが変わっているのかもしれない。配列をNCBIからダウンロードし、再びmurasakiを使ってC2406 chromosomeと比較してみよう。

f:id:kazumaxneo:20170628143708j:plain

線が別領域に伸びているとこは少しはあるが、ほとんどの全領域が C2406 chromosomeと同じ配置になっている。おそらく正しくアセンブルできていることが確認できた。

精度を上げるため解析には2つ以上の近縁種ゲノムが必要であるが、それさえクリアしていれば有用なツールである。ただし、ragoutに限らずReference-based assemblyな手法に共通する問題であるが、inversion、タンデムリイート、コピー遺伝子などがcontigの段階で組み込まれていないと、抜いてアセンブルする傾向がある。そのため、draft genomeを構築後、配列を検証する作業が必ず必要になってくる。

また、参照するゲノムにあまり相同性がなければ、ラン中にエラーを起こす。うまくランさせるには、ほぼ同一とみなされている菌種のリファレンスゲノムを用意する必要がある。

murasakiとGMVは以前紹介しています。

引用

1、Ragout-a reference-assisted assembly tool for bacterial genomes.

Kolmogorov M1, Raney B2, Paten B2, Pham S2.

Bioinformatics. 2014 Jun 15;30(12):i302-9. doi: 10.1093/bioinformatics/btu280.

２、Chromosome assembly of large and complex genomes using multiple references

Mikhail Kolmogorov, Joel Armstrong, Brian J. Raney, Ian Streeter, Matthew Dunn, Fengtang Yang, Duncan Odom, Paul Flicek, Thomas Keane, David Thybert, Benedict Paten, Son Pham

doi: https://doi.org/10.1101/088435 preprint

３、Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes.

Popendorf K1, Tsuyoshi H, Osana Y, Sakakibara Y.

PLoS One. 2010 Sep 24;5(9):e12651.

2017-06-27

k-mer カウントして、配列も出力するツール　jellyfish、BFCounter

k-mer 2011 Bioinformatics BMC Bioinformatics

2017.11追記

2018.03 -sフラグ修正

2019 5/14 リンク追加

2019 9/9 BFcounterインストール追記

2019 12/19 condaインストール追記、BFCounterインストール追記

k-merカウントを行うjellyfishと、k-merの全配列を書き出すBFCounterを紹介する。

Jellyfish

公式サイト

JELLYFISH - Fast, Parallel k-mer Counting for DNA

マニュアルPDF

http://www.cbcb.umd.edu/software/jellyfish/jellyfish-manual-1.1.pdf

Github

git clone https://github.com/gmarcais/Jellyfish.git
cd Jellyfish/
autoreconf -i
 ./configure --prefix=$HOME
make -j 8
sudo make install

#bioconda(link)
#version1系
conda install -c bioconda -y kmer-jellyfish==1.1.12-0

#version2 latest
conda install -c bioconda -y kmer-jellyfish

またはbrewでも導入できる。

ラン

k-merカウントする。

jellyfish count -m 31 -s 10000000 -t 20 input.fasta/fastq -o mer_counts.jf

-m　Length of mer
-s　Initial hash size
-t　Number of threads (1)

インプットファイルはfasta ,fastqとも対応している。

mer_counts.jfができる。中身はbinaryである。このファイルを以後の解析に使う。

簡単なstatistics

jellyfish stats mer_counts.jf

user$ jellyfish stats mer_counts.jf

Unique: 2413100

Distinct: 10097611

Total: 69409538

Max_count: 388

DIstinct k-merはたとえ複数回出現しても1回しか数えられないk-merで、Unique k-merは1回しか出現しないk-mer。Canonicalかどうかでカウント値は影響を受ける。

k-merの配列を全て出力する。

jellyfish dump mer_counts.jf > k31.fasta

中身を見てみる。

user-no-MacBook-Pro: user$ less -S k33.fa

CCCAGGGTCTGGAACTATCGCCGGTGGAAAATATCGAAACGGAAGCGGGGCAAGGGGTTCAGGGTTGGTACCAAGGC

>15

AATGCCCTGGGGAGTAACTGAAAGGAAACGAAAAAATAATCTTCTTCCTATTCTATTCCTTCTGATTGAGTAACCAG

CCTTGTTTGACAGTGCAGTCCGGGTTCCTAAAAGGTTGCAAGGGGATTTATTCATTCGCAGTACCCAAACGGACACC

GGTGGAATAAATACCTGTGGTGAAGGTATTAACGCCAAAATATTGCACTGTGCCAAAATCGTTCAGAGTCCCCATCA

AGAACGTCGTGGACGGCGTCCCCGTCGTGAGGGAACGGTGGAGGTCAAAGGGCGATCGGGAAGGGCGGCGGAACTAT

>16

TGGTGGACGGAAATCTTAAACATTGCCTAGAGAAAAGTTGGGAATTGACGGTGGTGGGGGCAGGGCAAGGAGTGGAA

>15

GGTTTCCTGGGCAAACTCGATGGAACCTGGACAATCGAGGAAATTTAAGCGCAAATTTTCATAATCAATCCCTGCCA

指定のk-merサイズの全塩基配列が出力されている。数値は登場回数である。重複させながら元の配列を分解しているので、元のfastqよりサイズははるかに大きくなる。

ヒストグラムデータを出力。

jellyfish histo mer_counts.jf > histogram.txt

出力の中身を見てみる。

user-no-MacBook-Pro: user$ less -S histogram.txt

1 3101726

2 8283

3 5424

4 12735

5 29386

6 61451

7 103516

8 164603

9 230629

10 291824

11 347466

Rに読み込んでグラフ化する。

純化されたバクテリアのシーケンスデータを分析する。

> R #Rに入る
input <- read.table("histogram.txt") #読み込み
plot(input[2:100,],type="l") #2列目が頻度情報。

f:id:kazumaxneo:20171120142312j:plain

２つ目の山がゲノムのシングルコピーの領域に由来する33merのピークと思われる。ピーク座標を調べる。

input[0:20,]

13がピークと思われる（赤色）。

> input[0:20,]

V1 V2

1 1 136238639

2 2 3362948

3 3 446016

4 4 142167

5 5 122950

6 6 159056

7 7 216472

8 8 287219

9 9 362505

10 10 430060

11 11 487906

12 12 520707

13 13 531598

14 14 531210

15 15 504837

16 16 472747

17 17 435696

18 18 390053

19 19 342325

20 20 298075

もっとちゃんと調べる。

k15

f:id:kazumaxneo:20171120154345j:plain

k35

f:id:kazumaxneo:20171120154503j:plain

k55

f:id:kazumaxneo:20171120154620j:plain

k65

f:id:kazumaxneo:20171120154637j:plain

k77

f:id:kazumaxneo:20171120154650j:plain

k99

f:id:kazumaxneo:20171120155119j:plain

k127

f:id:kazumaxneo:20171120155356j:plain

kmergenuineでもbestなk-merを探す（参考にしたページ）。（kmergenuineはbrewで導入できる。）

kmergenie input.fq

fitting model to histograms to estimate best k

table of predicted num. of genomic k-mers: histograms.dat

recommended coverage cut-off for best k: 2

best k: 65

kmergenuineでは65-merがベストと判定された。

65-merの入力テーブルを使い、ゲノムサイズを推定する。

sum(as.numeric(input[12:1130,1]*input[12:1130,2]))/65

リピートがあると、その分だけ少ない値が出るはずである。

k-mer coverageの貧弱なものを除きたいならkhmerを使う。khmerはk-merグラフの分布から低頻度k-mer配列を除去したり様々な機能をもつde novo assemblyをアシストするツールである。（pipでインストール可能）

load-into-counting.py -T 20 -k 20 -x 3e6 output1 input.fastq

T スレッド数
k k-merサイズ
x 大まかなゲノムサイズ。

filter-abund.py -C 2 -T 20 output1 input.fastq

注　khmrは別途記事にします。

BFCounter

BFCounterは k-merを数えて、全部の配列を出力するツール。

Github

https://github.com/pmelsted/BFCounter

ソースコードをダウンロードしてmakeする。

git clone https://github.com/pmelsted/BFCounter.git
cd BFCounter/

#ここではlarge k-merを扱えるようにしておく
make MAX_KMER_SIZE=300

パスを通しておく。

# ./BFCounter

Error: too few arguments

BFCounter BFCounter 1.0

A memory efficient k-mer counting program.

Usage: BFCounter <cmd> [options] ...

Where <cmd> can be one of:

count Counts the occurrences of k-mers in sequence files

dump Writes occurrences of k-mers into a tab-separated text file

cite Prints information for citing the paper

version Displays version number

ラン

BFCounter count -n 1000 -k 127 -o single_out -t 12 input.fq

-n　Estimated number of k-mers
-k　Size of k-mers, at most 301
-t　Number of threads to use (default 1)

出力はbinaryファイルである。ここからk-mer長の配列を抽出する。

BFCounter dump -i single_out -k 127 -o k-127_output

-i　Filename for input file from count
-k　Size of k-mers, at most 301
-o　Filename for output

出力の中身を見てみる。

f:id:kazumaxneo:20170627225750j:plain

このように全k-mer配列が出力されている。あとはスクリプトを使い自由にいじることができる。

このツールの機能はこれだけであるが、シンプルでなため使いやすい。

追記

１、シロイヌナズナゲノム (k-mer size 28)

$ jellyfish stats mer_counts.jf

Unique: 109610472

Distinct: 112718914

Total: 119479935

Max_count: 4778

28merという大きなサイズでゲノムのk-merを調べると、偶然２回以上出現する配列は限りなくゼロになる。そのため、ユニークなk-merの出現頻度は大半が１になり（下の図）、頻度1のユニークなk-merの数とゲノムのサイズはほぼ等しくなる。しかしリピートがあるゲノム（28merで調べているなら28以上完全に同一の配列がある）ならば、同じ28-merの配列が複数回出現するため、出現頻度1のユニークなk-merの合計と、全k-merのトータル（重複があるかどうかを問わず）に大きな差が生じる。リピート領域が多ければ、ヒストグラムの裾野の分布（肩）が右向きに広く（コピー数が多い）、高く（コピーの割合が多い）なる。

f:id:kazumaxneo:20180412152648j:plain

２、シーケンスデータのk-mer分布。

シロイヌナズナのillumina rawシーケンスデータ (k-mer size 28)

$ jellyfish stats mer_counts.jf

Unique: 241278948

Distinct: 475904703

Total: 3772597024

Max_count: 843576

シーケンスデータのk-merカウントを行うと、高いカバレッジでシーケンスしているため、ゲノムがリピートを持つかどうかに関わらず、ユニークなk-merはほぼゼロという結果が得られる（下の図）。それにもかかわらずユニークなk-merの数が非常に多いのは、シーケンスエラーが発生しているためである。平均クオリティQ30の100bpリードは、平均して10リードに１つしかシーケンスエラーを含まないが、それでも１つのシーケンスエラーが発生すると、28merでカウントしているなら28のユニークk-merが出現する（変化して他に配列と合致する可能性がゼロとして）。リアルデータだと、シーケンスエラーの発生はそこまで単純ではないため、おそらくはユニークk-mer数はもっと多くなる。参考リンク

f:id:kazumaxneo:20180412154040j:plain

図、左端はシーケンスエラー由来のピークで、ゲノムアセンブリに必要な真の情報は頻度5-15付近のピーク部分。リピートが多いゲノムのシーケンスデータでは、右に大きな第二ピークができる。

引用

Guillaume Marcais and Carl Kingsford, A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Bioinformatics (2011) 27(6): 764-770 (first published online January 7, 2011) doi:10.1093/bioinformatics/btr011)

Efficient counting of k-mers in DNA sequences using a bloom filter.

BMC Bioinformatics. 2011 Aug 10;12:333. doi: 10.1186/1471-2105-12-333.

Melsted P1, Pritchard JK.

K-mer analysis and genome size estimate

velvetのベストなk-merを自動で決めてアセンブルするvelvetoptimiser

k-mer assembly

velvetoptimiserは自動でk-merを振ってKmer coverage を調べ、velvetのアセブルにベストと思われるk-merのサイズを決め、アセンブルまで自動で行うラッパーツール。Velvetkよりもっと便利に使える。

Github

wikiリンク

https://wiki.gacrc.uga.edu/wiki/VelvetOptimiser

macでもvelvetとBioperlをインストールしていれば動く。BioperlはCPANで検索してダウンロード&インストールする（CPANリンク）。

MiSeqペアードエンドデータをアセンブルするなら

perl VelvetOptimiser.pl -s 19 -e 151 -x 8 -f '-fastq -shortPaired R1.fq R2.fq'

--s The starting (lower) hash value (default '19').
--e The end (higher) hash value (default '99').
--x The step in hash search.. min 2, no odd numbers (default '2').

--sで検討する最小のk-merサイズ、--eで検討する最大のk-merサイズを指定する。この例だとk=19から始まり、k=8刻みでk=151まで検討することになる。最大サイズは当然リードよりも短くする必要がある。アセンブルは自分で行うが、velvetに限らず最適なk-merサイズを調べたければ、kmergenieが使いやすい。パスを通しておけば、kmergenie input.fqだけでk-merのヒストグラムを描いてくれる。

kmergenieについては、日本の方の記事リンクを貼っておきます。

引用

GitHub - tseemann/VelvetOptimiser: Automatically optimise three of Velvet's assembly parameters.

2017-06-27

De novoアセンブリ関連のツールまとめ

assembly summary

2020 3/8 整頓、help追記, 2024/01/22 タイトルからショートリード削除

随時更新

これまでNGSデータをアセンブル様々なアセンブルやツールが発表されてきた。例えばOMICs toolでヒットするツールは100を超える（https://omictools.com/genome-assembly-category）。全てを整理しきるのは難しいが、代表的と思われるいくつかのツールをインストールしてランした結果をまとめる。

追加分

主にショートリードを使うアセンブラ

１、ラージゲノムにも対応したアセンブラ

ABySS 2.0
polyploidのラージゲノムのアセンブラ Meraculous2
highly heterozygous なゲノムのアセンブリパイプライン Redundans
MaSuRCA アセンブラ （複数シーケンシングテクノロジー）
高速なラージゲノムのハイブリッドアセンブルツール DBG2OLC
haplotype-resolved assemblyを行う Hifiasm
オーバーラップグラフに基づいたde novoゲノムアセンブラ ALGA

２、スモールゲノム向けのアセンブラ

３、ウィルスゲノムのアセンブラ

４、メタゲノムのアセンブラ

５、オルガネラゲノムのアセンブラ

６、プラスミドのアセンブラ

アセンブリグラフからプラスミドを検出する HyAsP

7、シード配列から拡張

コンティグの拡張アセンブリを行う PRICE

８，その他

サテライトDNAモチーフをアセンブルする SRF

主にロングリードを使うアセンブラ

１、ラージ〜スモールゲノム

２、メタゲノムのアセンブラ

３，ターゲットアセンブラ

ターゲット2倍体ゲノムアセンブラ JTK

４、プラスミドのアセンブラ

バクテリアのプラスミド自動アセンブリツール Plassembler

エラーコレクションツール

１、ロングリード

２、ショートリード

その他

scaffoldingツール

ガイドアセンブリ

前処理やフィルタリング

Merge

Polishing

ロングリードのドラフトアセンブリをpolishする marginpolish
ロングリードを使ってリファレンスベースのアセンブリとpolishを行う Rebaler
miniasmでアセンブリして得たGFAをポリッシュする minipolish
ナノポアのアセンブルデータのキュレーション及び変異の検出 nanopolish
ロングリードのドラフトアセンブリからコンセンサス配列を出力する Racon
高速かつ精度が高い（illuminaとPacbio HiFlリードに対応）　POLCA
コンセンサスコールを行うmedaka
ラージゲノムにもスケールする高速且つ精度の高いドラフトゲノムポリッシャー hypo
ONTのロングリードアセンブリをポリッシュする PEPPER
ハプロタイプを考慮してドラフトゲノムアセンブリをポリッシュする Hapo-G
ラージゲノムにもスケールする高速なドラフトゲノム配列polishingツール ntEdit
ロングリードを使ったゲノムアセンブリの評価とミスアセンブリ修正を行う Inspector（SVを含むゲノムアセンブリ評価、ミスアセンブリ修正）

評価ツール

シミュレータ

Repetitive sequence

Primary contigs

２倍体ゲノムドラフトアセンブリからPrimary contigsを出力　Purge Haplotigs

リード情報は使わないscaffoldingツール

Ab initio gene prediction

Evidence-driven gene prediction

EVidenceModeler

複数の遺伝子モデルを統合する EVidenceModeler

Eukaryotic genome Annotation (including gene predictions)

真核生物のゲノムプロジェクトにおいて共同研究者と共にアノテーションを効率的に進めるためのwebサービス GenSAS （リピートマスクと遺伝子予測、予測結果のリファイン、予測された遺伝子の機能アノテーションまで全てサポート）
Functional annotationを行うwebサーバー PANNZER2
機能アノテーション付けを行うwebサービス eggNOG-Mapper
De novo遺伝子予測やメタゲノムの機能アノテーションなどに対応したeggNOG-Mapper v2（denovo で作ったcontigの汚染の有無の確認にも利用可能）
アノテーションパイプライン PASA（遺伝子予測と予測結果の向上）
eukaryotesのアノテーション BRAKER2
真菌を中心とした真核生物ゲノムのアノテーションツール funannotate
真菌の遺伝子予測を行う CodingQuarry
GenBankやEMBL/ENA flat fileをNCBI submission用に変換する GB2sequin
真菌ゲノムのアノテーションパイプライン FunGAP

wiki

GAGE-Aのレシピ (論文 http://genome.cshlp.org/content/22/3/557.long)

http://gage.cbcb.umd.edu/recipes/index.html

AbySS

de Bruijn graph based assembler

マニュアル

http://crusade1096.web.fc2.com/abyss.html

Github

https://github.com/bcgsc/abyss/#install-abyss-on-debian-or-ubuntu

インストールは brewやconda で行える。

#bioconda
conda create -n abyss -y
conda activate abyss
conda install -c bioconda -y abyss

#homebrew
breq install homebrew/science/abyss

brewでインストールするとmax kは64だが、手動でインストールするなら、例えば ./configure --enable-maxk=96 としてビルドすれば、k=96まで扱えるように修正できる。

ただしソースからビルドするならboostが必要になる。

> abyss --help

u$ abyss --help

Usage: ABYSS -k<kmer> -o<output.fa> [OPTION]... FILE...

Assemble the input files, FILE, which may be in FASTA, FASTQ,

qseq, export, SAM or BAM format and compressed with gz, bz2 or xz.

Options:

--chastity discard unchaste reads [default]

--no-chastity do not discard unchaste reads

--trim-masked trim masked bases from the ends of reads

[default]

--no-trim-masked do not trim masked bases from the ends of

reads

-q, --trim-quality=N trim bases from the ends of reads whose

quality is less than the threshold

-Q, --mask-quality=N mask all low quality bases as `N'

--standard-quality zero quality is `!' (33)

default for FASTQ and SAM files

--illumina-quality zero quality is `@' (64)

default for qseq and export files

--SS assemble in strand-specific mode

--no-SS do not assemble in strand-specific mode

-o, --out=FILE write the contigs to FILE

-k, --kmer=N the length of a k-mer (when -K is not set) [<=128]

or the span of a k-mer pair (when -K is set)

-K, --single-kmer=N the length of a single k-mer in a k-mer pair

-t, --trim-length=N maximum length of blunt contigs to trim [k]

-c, --coverage=FLOAT remove contigs with mean k-mer coverage

less than this threshold

--kc=N remove all k-mers with multiplicity < N [0]

-b, --bubbles=N pop bubbles shorter than N bp [3*k]

-b0, --no-bubbles do not pop bubbles

-e, --erode=N erode bases at the ends of blunt contigs with coverage

less than this threshold [round(sqrt(median))]

-E, --erode-strand=N erode bases at the ends of blunt contigs

with coverage less than this threshold on

either strand [1 if sqrt(median) > 2 else 0]

--coverage-hist=FILE write the k-mer coverage histogram to FILE

-m, --mask-cov do not include kmers containing masked bases in

coverage calculations [experimental]

-s, --snp=FILE record popped bubbles in FILE

-v, --verbose display verbose output

--help display this help and exit

--version output version information and exit

--db=FILE specify path of database repository in FILE

--library=NAME specify library NAME for database

--strain=NAME specify strain NAME for database

--species=NAME specify species NAME for database

ABYSS Options: (won't work with ABYSS-P)

-g, --graph=FILE generate a graph in dot format

Report bugs to <abyss-users@bcgsc.ca>.

ラン

abyss-pe name=bacteria j=16 k=64 n=5 s=100 in='R1.fq R2.fq'

j number of threads [2]
k size of k-mer (bp)
s minimum unitig size required for building contigs (bp) [200]
n minimum number of pairs required for building contigs [10]
in=' 'でinputするfastqを指定。

forループを使い、複数のk-merを自動検討する。

for k in {65..96}; do ¥
 ABYSS -k$k R1.fq -o contigs-k$k.fa ¥
done

SOAPdenovo2

de Bruijn graph based assembler

公式マニュアル

SOAPdenovo: short-read assembly

公式マニュアル2

SOAP :: Short Oligonucleotide Analysis Package

mamba create -n soapdenovo2 -y 
conda activate soapdenovo2
mamba install -c conda-forge -c bioconda -y soapdenovo2

> SOAPdenovo-127mer -h

$ SOAPdenovo-127mer -h

Version 2.04: released on July 13th, 2012

Compile Dec 23 2013 10:46:03

Usage: SOAPdenovo <command> [option]

pregraph construct kmer-graph

sparse_pregraph construct sparse kmer-graph

contig eliminate errors and output contigs

map map reads to contigs

scaff construct scaffolds

all do pregraph-contig-map-scaff in turn

> SOAPdenovo-127mer all

$ SOAPdenovo-127mer all

Version 2.04: released on July 13th, 2012

Compile Feb 14 2018 20:39:13

SOAPdenovo all -s configFile -o outputGraph [-R -F -u -w] [-K kmer -p

n_cpu -a initMemoryAssumption -d KmerFreqCutOff -D EdgeCovCutoff -M

mergeLevel -k kmer_R2C, -G gapLenDiff -L minContigLen -c minContigCvg

-C maxContigCvg -b insertSizeUpperBound -B bubbleCoverage -N

genomeSize]

-s <string> configFile: the config file of solexa reads

-o <string> outputGraph: prefix of output graph file name

-K <int> kmer(min 13, max 127): kmer size, [23]

-p <int> n_cpu: number of cpu for use, [8]

-a <int> initMemoryAssumption: memory assumption initialized

to avoid further reallocation, unit G, [0]

-d <int> kmerFreqCutoff: kmers with frequency no larger than

KmerFreqCutoff will be deleted, [0]

-R (optional) resolve repeats by reads, [NO]

-D <int> edgeCovCutoff: edges with coverage no larger than

EdgeCovCutoff will be deleted, [1]

-M <int> mergeLevel(min 0, max 3): the strength of merging

similar sequences during contiging, [1]

-e <int> arcWeight: two edges, between which the arc's weight

is larger than arcWeight, will be linerized, [0]

-m <int> maxKmer (max 127): maximum kmer size used for multi-kmer, [NO]

-E (optional) merge clean bubble before iterate, works only if -M

is set when using multi-kmer, [NO]

-k <int> kmer_R2C(min 13, max 127): kmer size used for mapping

reads to contigs, [K]

-F (optional) fill gaps in scaffolds, [NO]

-u (optional) un-mask contigs with high/low coverage before

scaffolding, [mask]

-w (optional) keep contigs weakly connected to other contigs in

scaffold, [NO]

-G <int> gapLenDiff: allowed length difference between

estimated and filled gap, [50]

-L <int> minContigLen: shortest contig for scaffolding, [K+2]

-c <float> minContigCvg: minimum contig coverage (c*avgCvg),

contigs shorter than 100bp with coverage smaller than c*avgCvg will be

masked before scaffolding unless -u is set, [0.1]

-C <float> maxContigCvg: maximum contig coverage (C*avgCvg),

contigs with coverage larger than C*avgCvg or contigs shorter than

100bp with coverage larger than 0.8*C*avgCvg will be masked before

scaffolding unless -u is set, [2]

-b <float> insertSizeUpperBound: (b*avg_ins) will be used as

upper bound of insert size for large insert size ( > 1000) when

handling pair-end connections between contigs if b is set to larger

than 1, [1.5]

-B <float> bubbleCoverage: remove contig with lower cvoerage in

bubble structure if both contigs' coverage are smaller than

bubbleCoverage*avgCvg, [0.6]

-N <int> genomeSize: genome size for statistics, [0]

-V (optional) output information for Hawkeye to visualize the assembly, [NO]

ラン

ランはconfig_fileファイル（参考）を作成し、それを元にして実行する。mate-pairデータも扱うことができ、 contigをNでつなぎscaffoldsを作ってくれる。paired-endデータを使った場合もこの作業は行われる。ただしメモリ喰いで、ある程度大きいゲノムだとサーバークラスのメモリが必要となる。64GB程度だと途中でストップする。khmerなどのツールであらかじめシーケンスデータを減らしておく必要があるかもしれない。

SOAPdenovo-127mer all -s config_file -k 127 -R -o output -p 20

SPAdes

de Bruijn graph based assembler

現在もバージョンアップが続けられているbacteriaむけのアセンブラ。paired-endデータを扱えるが、それだけでなくロングリードを使いhybrid-assemblyを行うことも可能である。

spadesはハプロタイプのような部位もできる限り繋げようとする傾向が強い。また、contigをpaired-ned情報を使いscaffoldingするため、ほかと比べてかなり長いcontigができる。ただしこれは精度と性能の天秤にになりやすく、微妙な部位をむりやり繋げる恐れも高いかもしれない。例えば、monoploidの生き物でも、ゲノムがマルチコピーだとmelody ploideの状態のindelが発生している可能性はある。そのようなシーケンスデータでも、spadesはarbitraryな判断で繋ぐ傾向が強い。これは例えWT:MT=50:50でも起こる（そもそもspadesは純化したバクテリアゲノム向け）。

spadesはbrewでインストールできる。

ラン

spades.py -k auto -t 8 -1 R1.fq -2 R2.fq -o output_dir

Edena

string graph-based assembler

公式サイト

http://computing.bio.cam.ac.uk/local/doc/edena.pdf

mamba create -n edena -y 
conda activate edena
mamba install -c bioconda -y edena

> edena

$ edena

Edena v3.131028

David Hernandez, Patrice Francois, Jacques Schrenzel

Genomic Research Laboratory, Geneva University Hospitals, Switzerland

PROGRAM OPTIONS:

1) Overlapping mode:

-nThreads <int> Number of threads to use. Default 2

-r

-singleEnd <file1> <file2> ...

Reads file(s) in FASTA or FASTQ format.

Several files can be specified

-DRpairs

-paired <file1_1> <file1_2> <file2_1> <file_2_2> ...

Direct-Reverse paired reads files. Several

pairs of files can be specified.

-RDpairs

-matePairs <file1_1> <file1_2> <file2_1> <file_2_2> ...

Reverse-Direct paired reads files. Several

pairs of files can be specified.

-p

-prefix <name> Prefix for the output files. Default is "out".

-M

-minOverlap <int> Minimum size of the overlaps to compute.

Default is half of the reads length.

-t

-truncate <int> Truncate the 3' end of the reads TO the specified

length

2) Assembler mode:

-e

-edenaFile <file.ovl> Edena overlap (.ovl) file. Required.

-p

-prefix <name> Prefix for the output files.

-m

-overlapCutoff <int> Only consider overlaps >= than the specified size.

The optimal setting of this parameter depends on the

coverage that was achieved by the sequencing run.

You should therefore try different values in order

to get the optimal one.

-c

-minContigSize <int> Minimum size of the contigs to output.

Default is 1.5*readLength.

-cc

-contextualCleaning

<yes/no> Contextual cleaning of spurious edges.

-minCoverage <int> Minimum required coverage for the contigs. This

value is automatically determined if not specified.

-sph

-shortPeHorizon <int> Maximum search distance for short >< paired-end

sampling. Default: 1000

-lph

-longPeHorizon <int> Maximum search distance for long <> paired-end

sampling. Default: 15000

-peHorizon <int> obsolete: Maximum search distance for both short

and long paired-end reads sampling.

-trimRed <yes/no> By default, possible redundant sequences, caused by

unresolved repeats, are trimmed from contigs ends.

Setting this flag to 'no' allows keeping such

redundancies up to the length of the largest insert

size (maxJump). !! setting this setting to 'no'

produces an artificially increased assembly size !!

-maxRed <int> Max ending redundancy length. Default: 0 (equivalent

to '-trimRed yes'. Overrides -trimRed.

-d

-deadEnds <int> Maximum length for dead-end paths removal.

Default value is set to 2*readLength-1.

-discardNonUsable

<yes/no> Reads that are involved in orphan nodes smaller than

1.5*readLength are considered as "non-usable".

This filter discards such nodes. Default: enabled

-trim <int> Coverage cutoff for contigs ends:

Contig ends ending in a gap may contain errors due

to low coverage. This option trim a few bases from

these ends until a minimum coverage is reached.

Default is 4. Ends are not trimmed if set to 1.

-shell Interactive shell. Allows querying the assembly

graph.

REPORT BUGS:

david.hernandez@genomic.ch

ラン

fastq配列の長さが同じでないと動かないので、短いリードを除去し、さらにスクリプトを書いて整形した。

edena -paired R1.fq R2.fq

out.ovlが出力される。

edena -o out.ovl

mオプション次第でかなり変わる。-mの値は100-249まで選べるが、Miseq 301bpだと150前後がベストな気がする。

A5-miseq

quality check、data trimming、エラーコレクション、contig生成、scaffold構築、視覚化を全自動でやってくれるのが特徴のアセンブラである。A5-miseqは手法としての新しさよりも、パイプラインを自動化することに重きを置いているため、コマンドラインに不慣れなユーザーでも比較的手軽に扱うことができる。

ダウンロード

https://sourceforge.net/projects/ngopt/

conda create -n a5-miseq -y 
conda activate a5-miseq
conda install -c bioconda -y a5-miseq

> a5_pipeline.pl -h

$ a5_pipeline.pl -h

A5-miseq version 20160825

Usage: a5_pipeline.pl [--begin=1-5] [--end=1-5] [--preprocessed] [--threads=4] [--debug] [--metagenome] <lib_file> <out_base>

Or: a5_pipeline.pl <Read 1 FastQ> <Read 2 FastQ> <out_base>

Or: a5_pipeline.pl <Read 1,2 Interleaved FastQ> <out_base>

<out_base> is the base file name for all output files. When assembling from

a single library, the fastq files may be given directly on the command line.

If using more than one library, a library file must be given as <lib_file>.

The library file must contain the filenames of all read files.

If --preprocessed is used, <lib_file> is expected to be the library file

created during step 2 of the pipeline, named <out_base>.preproc.libs. Note

that this flag only applies if beginning pipeline after step 2.

docker版があり、仮想環境で気軽にテストもできる(docker-A5-Miseq)。

参考動画

Velvet

インストール

#bioconda
conda create -n velvet -y 
conda activate velvet
conda install -c bioconda -y velvet

#from source
git clone https://github.com/dzerbino/velvet.git 
cd velvet/
vi MakeFile #開く。31mer以上でもアセンブル用にする。別のエディタで開いてもOK
６行目のMAXKMERLENGTH?=31を => MAXKMERLENGTH?=250 とかに修正。
#保存して閉じる。

make #makeはデフォルトでMakefileを自動認識する
パスを通しておく。

> velvetg

$ velvetg

velvetg - de Bruijn graph construction, error removal and repeat resolution

Version 1.2.10

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Compilation settings:

CATEGORIES = 4

MAXKMERLENGTH = 191

OPENMP

LONGSEQUENCES

Usage:

./velvetg directory [options]

directory : working directory name

Standard options:

-cov_cutoff <floating-point|auto> : removal of low coverage nodes AFTER tour bus or allow the system to infer it

(default: no removal)

-ins_length <integer> : expected distance between two paired end reads (default: no read pairing)

-read_trkg <yes|no> : tracking of short read positions in assembly (default: no tracking)

-min_contig_lgth <integer> : minimum contig length exported to contigs.fa file (default: hash length * 2)

-amos_file <yes|no> : export assembly to AMOS file (default: no export)

-exp_cov <floating point|auto> : expected coverage of unique regions or allow the system to infer it

(default: no long or paired-end read resolution)

-long_cov_cutoff <floating-point>: removal of nodes with low long-read coverage AFTER tour bus

(default: no removal)

Advanced options:

-ins_length* <integer> : expected distance between two paired-end reads in the respective short-read dataset (default: no read pairing)

-ins_length_long <integer> : expected distance between two long paired-end reads (default: no read pairing)

-ins_length*_sd <integer> : est. standard deviation of respective dataset (default: 10% of corresponding length)

[replace '*' by nothing, '2' or '_long' as necessary]

-scaffolding <yes|no> : scaffolding of contigs used paired end information (default: on)

-max_branch_length <integer> : maximum length in base pair of bubble (default: 100)

-max_divergence <floating-point>: maximum divergence rate between two branches in a bubble (default: 0.2)

-max_gap_count <integer> : maximum number of gaps allowed in the alignment of the two branches of a bubble (default: 3)

-min_pair_count <integer> : minimum number of paired end connections to justify the scaffolding of two long contigs (default: 5)

-max_coverage <floating point> : removal of high coverage nodes AFTER tour bus (default: no removal)

-coverage_mask <int> : minimum coverage required for confident regions of contigs (default: 1)

-long_mult_cutoff <int> : minimum number of long reads required to merge contigs (default: 2)

-unused_reads <yes|no> : export unused reads in UnusedReads.fa file (default: no)

-alignments <yes|no> : export a summary of contig alignment to the reference sequences (default: no)

-exportFiltered <yes|no> : export the long nodes which were eliminated by the coverage filters (default: no)

-clean <yes|no> : remove all the intermediary files which are useless for recalculation (default : no)

-very_clean <yes|no> : remove all the intermediary files (no recalculation possible) (default: no)

-paired_exp_fraction <double> : remove all the paired end connections which less than the specified fraction of the expected count (default: 0.1)

-shortMatePaired* <yes|no> : for mate-pair libraries, indicate that the library might be contaminated with paired-end reads (default no)

-conserveLong <yes|no> : preserve sequences with long reads in them (default no)

Output:

directory/contigs.fa : fasta file of contigs longer than twice hash length

directory/stats.txt : stats file (tab-spaced) useful for determining appropriate coverage cutoff

directory/LastGraph : special formatted file with all the information on the final graph

directory/velvet_asm.afg : (if requested) AMOS compatible assembly file

ラン

１、velvethで処理するためのデータを準備。

velveth folder-name 31,45,2 -fastq -shortPaired1 input1.fastq -shortPaired2 input2.fastq #31,45,2はハッシュ値が31から45まで2ずつ検討するということ

シングルなら

velveth directory 31,45,2 -fastq single.fq

２、velvetgでコンティグを作成 (例えばk=31に対して)

velvetg folder-name_31 -ins_length1 300 -ins_length2 3000

-exp_cov <floating point|auto>　expected coverage of unique regions or allow the system to infer it -exp_cov <floating point|auto> : expected coverage of unique regions or allow the system to infer it (default: no long or paired-end read resolution)
-ins_length* <integer>　expected distance between two paired-end reads in the respective short-read dataset (default: no read pairing) -ins_length* <integer> : expected distance between two paired-end reads in the respective short-read dataset (default: no read pairing)

真核生物ゲノムの遺伝子予測

遺伝子予測の実践的な手順についてはこちらがよくまとまっています。

基礎的な知識を学ぶならこちら

https://www.science.smith.edu/cmbs/wp-content/uploads/sites/36/2015/09/euk_genome_annotation_review.pdf

PASAに記載されている真核生物アノテーションワークフローの手順

以下のソフトウェアツールを用いたab initio遺伝子探索。GeneMarkHMM、FGENESH、Augustus、およびSNAP、GlimmerHMM
GeneWiseソフトウェアとuniref90非冗長タンパク質データベースを用いたタンパク質相同性検出とイントロン分析
既知のEST、完全長cDNA、最近ではTrinity RNA-Seqアセンブリのゲノムへのアラインメント
ステップ( C)で得られたオーバーラップする転写産物アラインメントに基づくPASAアラインメントアセンブリ
EVidenceModeler (EVM) を用いて、上記 (A, B, C, D) に基づく重み付きコンセンサス遺伝子構造アノテーションの計算
PASAを使用してEVMコンセンサス予測を更新し、UTRアノテーションと交互スプライシングアイソフォームのモデルを追加（DとEを利用）
Argo または Apollo を用いたゲノムアノテーション (F) の限定的な手動改良

追記

以下のツールはデータの関係で使えなかった。

MIRA 454、Solexaデータのアセンブラ。
Velvet-SC シングルセルのアセンブラ。
METAVelvet メタゲノムのアセンブラ。
CABOG 454データのアセンブラ。
PASHA large genomeのアセンブラ。
Fermi large genomeのアセンブラ。

既に昔のデータになってしまっているが、他のアセンブルツールのコマンドについてはGAGEやAssemblathonのペーパー参照。

GAGE recipes

GAGE recipes

GAGE-B recipes

https://ccb.jhu.edu/gage_b/recipes/recipes.pdf

Assemblathon2

https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-10

13742_2013_29_MOESM3_ESM.docx

各ツールの特徴

Next Generation Sequencing (NGS)/De novo assembly - Wikibooks, open books for an open world

エラーコレクションツールのベンチマーク

https://biorxiv.org/cgi/content/short/2020.03.06.977975v1

2020 10/31

バクテリアゲノムアセンブリのガイド

2020 12/13

When developing methods, it is good to know the variability in user data. It was hard to find varied data from many GIAB samples, so we sequenced HG001-7+NA12891/2 on NovaSeq and HiSeqX/4000 in both WGS 50x PCR-Free and PCR+ runs and 3 WES capture kits. https://t.co/LOnvZV0FDA
— Andrew Carroll (@acarroll_ATG) 2020年12月12日

2020 12/27

By the way, you can find the links to various public CHM13 and HG00733 assemblies in this repo: https://t.co/Tf2HUpXDAY
— Heng Li (@lh3lh3) 2020年12月27日

2021 1/27

LeafGo: Leaf to Genome, a quick workflow to produce high-quality De novo genomes with Third Generation Sequencing technology https://t.co/2kbmLrSLW0 #biorxiv_genomic
— bioRxiv Genomics (@biorxiv_genomic) 2021年1月26日

(Published: November 12, 2020)

2021 08/01

2022 06/25

バイオインフォマティクスの前提を崩すような奇妙な遺伝子アノテーションのリスト

https://t.co/tRMGU15m0r "A list of weird gene annotations or things that break bioinformatics assumptions"
— Pierre Lindenbaum (@yokofakun) 2022年6月24日

2023

"様々なヘテロ接合度を持つ6つのゲノムを用いて、ゲノムサイズやヘテロ接合度などのゲノム特性の推定、de novoアセンブル、ポリッシング、対立遺伝子コンティグの除去などの一連のプロセスを評価した。5つのロングリードのみのアセンブラ（Canu、Flye、miniasm、NextDenovo、Redbean）と、ショートリードとロングリードを組み合わせた5つのハイブリッドアセンブラ（HASLR、MaSuRCA、Platanus-allee、SPAdes、WENGAN）を評価し、安定で高性能なアセンブラを用いて、ヘテロ接合の程度に応じたハプロタイプ表現の構築、それに続くハプロティグのポリッシングとパージについて、具体的なガイドラインを提案した。"

A practical assembly guideline for genomes with various levels of heterozygosity
Takako Mochizuki, Mika Sakamoto, Yasuhiro Tanizawa, Takuro Nakayama, Goro Tanifuji, Ryoma Kamikawa, Yasukazu Nakamura
Briefings in Bioinformatics, Volume 24, Issue 6, November 2023, bbad337

https://academic.oup.com/bib/article/24/6/bbad337/7291993?login=false

2017-06-27

DNA解析ソフト1 Snapgene

DNA解析ソフト結果の視覚化 (visualization) windowsツール GUIツール sanger sequencing multiple sequence alignment (MSA)

2018 10/26 追記

2019 4/22 プライマーの説明追加

2019 5/30 誤字修正

2019 7/1 追記

2020 1/6 リンク追加

2020 1/25 追記

2020 2/12 分かりにくい表現を修正

SnapGeneはプラスミドやゲノムの解析ソフトである。ビジュアルで見やすく表示する機能に特化しており、制限酵素処理結果を画面で見やすく表示したり、TA cloningやgibson、in fuisionなどの反応結果を表示し設計を支援する機能などを持つ。

SnapGeneはアノテーション情報がない遺伝子配列からorfを自動検出してプラスミド図などを描画する機能も持つが、この機能は無償版のSnapGene viewerでも使うことができる。さらに素晴らしいのは、無償版、機能の豊富な有償版ともLinux版、mac版、windows版全て揃っていることである。とりあえずゲノム配列を手に入れたらsnapgene (viewer)で確認してみよう、くらいの気軽さで使うことができる。

公式サイトからダウンロードできる。

SnapGene | Software for everyday molecular biology

インストールは画面の指示に従っていくだけでできる。

SnapGeneは、genbankファイルの取り込みに対応している。genbankファイルは、NCBIで遺伝子を見るときのフォーマットである。以下のリンクを開いてgenbankファイルを１つダウンロードしてみよう。

https://www.ncbi.nlm.nih.gov/nuccore/?term=cloning+vector

f:id:kazumaxneo:20170627130533j:plain

左端のgenbankを選択。

f:id:kazumaxneo:20170627130705j:plain

右上のSend toからgenbankを選択してダウンロードする。

f:id:kazumaxneo:20171001152156j:plain

または配列まで含めて画面全体をコピーし、

f:id:kazumaxneo:20170627130652j:plain

　↓

それを軽量なプレーンテキストエディタ（macならMi(Mi ダウンロードリンク）、windowsならTerapadかサクラエディタ）にペーストする。

f:id:kazumaxneo:20170627130838j:plain

.gbkで保存。

f:id:kazumaxneo:20170627131016j:plain

Snapgeneを指定してgbkファイルを開く。

f:id:kazumaxneo:20170627131108j:plain

環状かどうか聞いてくるので、環状を選択。

f:id:kazumaxneo:20170627131212j:plain

開いた直後の画面。orfがgbkファイルの情報に従い自動認識されている。また制限酵素サイトが自動で表示されている。

f:id:kazumaxneo:20170627131242j:plain

左下のタブ ”シーケンス” を押すと配列に切り替わる。

f:id:kazumaxneo:20170627131711j:plain

左下のタブ ”酵素” を押すと制限酵素の位置が表示される。

f:id:kazumaxneo:20170627131900j:plain

上のタブ”ライン"を押すと切れる場所が縦線で表示される。cut部位の相対的な位置関係を掴みやすい。

f:id:kazumaxneo:20170627131941j:plain

追記　プライマー追加

f:id:kazumaxneo:20190423201522p:plain

ウィンドウが出てくるのでプライマー配列をペーストする。

f:id:kazumaxneo:20190423201529p:plain

マッチする部位が表示されるので、右下のテンプレートにプライマーを追加ボタンをクリック。

プライマー部位が追加された。

f:id:kazumaxneo:20190423201539p:plain

繰り返すことで複数登録できる。

2019 7/1 追記

サンガー配列の読み込みはviewerでもできます。

f:id:kazumaxneo:20190701145358j:plain

開いたところ。

f:id:kazumaxneo:20190701145410j:plain

マルチプルシーケンスアラインメント（MSA）結果も表示できる。

f:id:kazumaxneo:20200125001046p:plain

他にも様々な機能を持ち、有償版は、例えば配列の編集、配列同士のアライメントなどに対応している。

いくつか動画を貼っておく。

コンストラクトの構築手順。

動画後半では、右下のタブ”履歴” 機能を使いコンストラクトの設計を行っている。

アライメント。

SnapGene Software Tutorial Videos | Aligning to a Reference DNA Sequence

下のCCからsubtitleをonにすると説明がつきます。

Primer設計と変異導入。

アカデミックライセンスなら、１ユーザ345米ドルで購入でき、他の解析ソフトと比べても良心的な金額になっている。まずは無償版で試し、有償版の機能が欲しくなったら、（ボスにお願いして）購入を考えればよい。

有償版と無償版の機能の違い。

SnapGene vs. SnapGene Viewer | Molecular Biology Software

他にもmac vectorも良い評判を聞いたことがある（安くて、かつ解析ソフトの機能は概ね備えている）。使っている人を見たことがあるが、genetyxに似ており概ね使いやすそうだった。２つ動画を貼っておく。

MacVector Tutorial

Create constructs using MacVector's Cloning Clipboard

これ以外にも無料で使えるツールはいくつかあります。

これらとsnap gene viewerと組み合われば、コストをかけずに使うことも可能です。

2020 1/6追記

UGENEおすすめです。

引用

SnapGene software (from GSL Biotech; available at snapgene.com)

2017-06-23

ゲノムの相同性の高い領域の網羅的な検索 MUMmer

ゲノム比較 (comparative genomics) 2004 Genome Biology

MUMmerはゲノム全体を高速にアライメントするオープンソースのツール。Finisihしたゲノムだけでなくドラフトゲノムでも使用でき、容易に何百あるcontigのアライメントを行うことができる。最初の論文が発表されたのは1999年だが（ref.1）、現在でもオープンソースのまま開発が続けられており、間も無くMUMmer４が論文化されるとアナウンスされている（その後、2018年始めにpublishされた）。

オンラインマニュアル

The MUMmer 3 manual

MUMmerは５つのアライメントプログラム（tu-rukaranaru.mummer、nucmer、promer、rum-mummer1、そしてrun-mummer3) が専ら使われている。

brewで導入できる。

brew install mummer

作成まで時間がかかってしまったので、別に投稿しました。

引用------------------------------------------------------------------------------------------------------------------------

Mummer1

1、Alignment of whole genomes

Nucleic Acids Research, 1999, Vol. 27, No. 11 2369–2376

http://mummer.sourceforge.net/MUMmer.pdf

Mummer2

２、Fast algorithms for large-scale genome alignment and comparison.

Delcher AL1, Phillippy A, Carlton J, Salzberg SL.

Nucleic Acids Res. 2002 Jun 1;30(11):2478-83.

http://mummer.sourceforge.net/MUMmer2.pdf

Mummer3

３、Versatile and open software for comparing large genomes

Stefan Kurtz*, Adam Phillippy†, Arthur L Delcher†, Michael Smoot†‡, Martin Shumway†, Corina Antonescu† and Steven L Salzberg†

Genome Biology 2004, 5:R12

http://mummer.sourceforge.net/MUMmer3.pdf

追記

Mummer4

MUMmer4: A fast and versatile genome alignment system

Guillaume Marçais, Arthur L. Delcher, Adam M. Phillippy, Rachel Coston, Steven L. Salzberg, Aleksey Zimin

Published online 2018 Jan 26.

2017-06-23

ゲノム比較のmurasakiと結果を表示するGMV

ゲノム比較 (comparative genomics) 結果の視覚化 (visualization) GUIツール 2010 PLoS ONE docker synteny_block

2020 5/25 docker link追加、ヘルプ追加、分かりにくい文章の修正

murasakiは複数ゲノムの相同性ある領域の探索を高速に行うツールで、GMVはその比較結果を見るためのビューアソフトである。領域によってカラフルな色がつくので、ゲノムリアレンジメントなどの構造変化をわかりやすく示すことができる。

公式サイト

murasaki genomemurasaki genome

murasakiのインストールはこの方のブログを参考にしてください。

2020 5/26 追記

#dockerhub(link)
docker pull biocontainers/murasaki:v1.68.6-8b1-deb_cv1

> docker run --rm -it biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki -h

$ docker run --rm -it biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki -h

Usage: murasaki -p<patternSpec> [options] seq1 [seq2 [seq3 ... ]]

Options:

--pattern|-p = seed pattern (eg. 11101001010011011).

using the format [<w>:<l>] automatically generates a

random pattern of weight <w> and length <l>

--directory|-d = output directory (default: output)

--name|-n = alignment name (default: test)

--quickhash|-q = specify a hashing function:

0 - adaptive with S-boxes

1 - don't pack bits to make hash (use first word only)

2 - naively use the first hashbits worth of pattern

3 - adaptivevely find a good hash (default)

**experimental CryptoPP hashes**

4 - MD5

5 - SHA1

6 - Whirlpool

7 - CRC-32

8 - Adler-32

--hashbits|-b = use n bit hashes (for n's of 1 to WORDSIZE. default 26)

--hashtype|-t = select hash table data structure to use:

OpenHash - open sub-word packing of hashbits

EcoHash - chained sub-word packing of hashbits (default)

ArrayHash - malloc/realloc (fast but fragmenty)

MSetHash - memory exorbanant, almost pointless.

--probing = 0 - linear, 1 - quadratic (default)

--hitfilter|-h = minimum number of hits to be outputted as an anchor

(default 1)

--histogram|-H = histogram computation level: (-H alone implies -H1)

0 - no histogram (default)

1 - basic bucketsize/bucketcount histogram data

2 - bucket-based scores to anchors.detils

3 - perbucket count data

4 - perbucket + perpattern count data

--repeatmask|-r= skip repeat masked data (ie: lowercase atgc)

--seedfilter|-f= skip seeds that occur more than N times

--hashfilter|-m= like --seedfilter but works on hash keys instead of

seeds. May cause some collateral damage to otherwise

unique seeds, but it's faster. Also non-sequence-specific

so more like at best 1/N the tolerance of seedfilter.

--rseed|-s = random number seed for non-deterministic algorithms

(ie: the adative hash-finding). If you're doing any

performance comparisons, it's probably imperative that you

use the same seed for each run of the same settings.

Default is obtained from time() (ie: seconds since 1970).

--skipfwd|-F = Skip forward facing matches

--skiprev|-R = Skip reverse facing matches

--skip1to1|-1 = Skip matches along the 1:1 line (good for comparing to self)

--hashonly|-Q = Hash Only. no anchors. just statistics.

--hashskip|-S = Hashes every n bases. (Default is 1. ie all)

Not supplying any argument increments the skip amount by 1.

--hashCache|-c = Caches hash tables in the directory. (default: cache/)

--join|-j = Join anchors within n bases of eachother (default: 0)

Specifying a negative n implies -n*patternLength

--bitscore|-B = toggles compututation of a bitscore for all anchors

(default is on)

--memory|-M = set the target amount of total memory

(either in gb or as % total memory)

--seedterms|-T = toggles retention of seed terms (defaults to off)

(these are necessary for computing TF-IDF scores)

--sectime|-e = always display times in seconds

--repeatmap|-i = toggles keeping of a repeat map when --mergefilter

is used (defaults to yes).

--mergefilter|-Y = filter out matches which would would cause more than N

many anchors to be generated from 1 seed (default -Y100).

Use -Y0 to disable.

--scorefilter = set a minimum ungapped score for seeds

--tfidf|-k = perform accurate tfidf scoring from within murasaki

(requires extra memory at anchor generation time)

--reverseotf|-o = generate reverse complement on the fly (defaults to on)

--rifts|-/ = allow anchors to skip N sequences (default 0)

--islands|-% = same as --rifts=S-N (where S is number of seqs)

--fuzzyextend|-z = enable (default) or disable fuzzy extension of hits

--fuzzyextendlosslimit|-Z = set the cutoff at which to stop extending

fuzzy hits (ie. the BLAST X parameter).

--gappedanchors = use gapped (yes) or ungapped (no (default)) anchors.

--scorebyminimumpair = do anchor scoring by minimum pair when appropriate

(default). Alternative is mean (somewhat illogical, but

theoretically faster).

--binaryseq = enable (default) or disable binary sequence read/write

Adaptive has function related:

--hasherFairEntropy = use more balanced entropy estimation (default: yes)

--hasherCorrelationAdjust = adjust entropy estimates for nearby sources

assuming some correlation (default: yes)

--hasherTargetGACycles = GA cycle cutoff

--hasherEntropyAgro = how aggressive to be about pursuing maximum

entropy hash functions (takes a real. default is 1).

...and of course

--verbose|-v = increases verbosity

--version|-V = prints version information and quits

--help|-? = prints this help message and quits

Platform information:

Wordsize: 64 bits

sizeof(word): 8 bytes

Total Memory: 22.97 GB

Available Memory: 23.93 GB (104.20%)

Murasaki version 1.68.6 (LARGESEQ, CRYPTOPP)

ラン

ゲノム・プラスミドのfastaかgenbakファイルを指定する。

３つのgbkファイルの比較。

./murasaki -p 19:26 -d output -n sample_name 7942.gbk GT-S.gbk 7002.gbk Leptolyngbya_dg5.gbk Nostoc_sp.PCC7120.gbk

#docker imageを使うなら
docker run --rm -itv $PWD:/data/ -w /data biocontainers/murasaki:v1.68.6-8b1-deb_cv1 murasaki A.fasta B.fasta C.fasta -p 19:26 -d output

出力ディレクトリoutputをGNVに読み込ませて描画すると、下のような図を出力できる。

f:id:kazumaxneo:20170623134158j:plain

４種の非常に近縁なシアノバクテリアを比較。右端にはANI値を載せた。

ランさせた時の様子はこのような感じになる。

動画では、genbankファイルを２つ選択してmurasakiで相同性を調べ、出力されたフォルダをGMVで表示している。

著者らによるアルゴリズムの説明とチュートリアル

Yasunori Osana @ University of the Ryukyus

こちらも貼っておきます。

Yasunori Osana @ University of the Ryukyus

引用

Murasaki: a fast, parallelizable algorithm to find anchors from multiple genomes

Popendorf K, Tsuyoshi H, Osana Y, Sakakibara Y

PLoS One. 2010 Sep 24;5(9):e12651

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Reference-assisted assembly1　ragout

k-mer カウントして、配列も出力するツール　jellyfish、BFCounter

Jellyfish

BFCounter

velvetのベストなk-merを自動で決めてアセンブルするvelvetoptimiser

De novoアセンブリ関連のツールまとめ

DNA解析ソフト1 Snapgene

ゲノムの相同性の高い領域の網羅的な検索 MUMmer

ゲノム比較のmurasakiと結果を表示するGMV