2018-06-12

fastq-dumpを並列化した pfastq-dump

2018 11/25 誤字修正

2019 12/18 インストール手順修正、コマンド実行手順追加

pfastq-dumpは、Ohtaさんが公開されているfastq-dumpを並列処理するpython スクリプトparallel-fastq-dumpのbash実装バージョン。Sequence Read Archive（wiki）からダウンロードされたシーケンスデータ（SRAフォーマット）をfastq-dumpの並列処理で素早くfastqに変換することができる。

インストール

依存

SRA toolkit

#science系ツールレポジトリをtapしとく（してない人だけ）。
brew tap brewsci/science

#インストール
brew install sratoolkit

fastq-dumpとsra-statコマンドを実行し、パスが通っているか確認する。

本体 Github

git clone https://github.com/inutano/pfastq-dump
cd pfastq-dump/bin/
chmod a+x pfastq-dump
#パスの通ったディレクトリに移動かシンボリックリンクを張る
mv pfastq-dump /usr/local/bin

$ ./pfastq-dump

Using sra-stat : 2.8.2

Using fastq-dump : 2.8.2

tmpdir: ~/pfastq-dump/bin

outdir: ~/pfastq-dump/bin

Usage:

pfastq-dump --threads <number of threads> [options] <path> [<path>...]

pfastq-dump option:

-t|--threads number of threads

-s|--sra-id SRA id

-O|--outdir output directory

--tmpdir temporary directory

Extra arguments will be passed through

Other:

-v|--version show version info

-h|--help program usage

本体はbash スクリプト。パスの通ったディレクトリに移動する。

ラン

prefetchでsraファイルをダウンロード(*1)してから、８スレッド使ってSRR000001.sraをペアエンドfastqに変換。

#ダウンロード
prefetch SRR000001.sra

#fastqに変換。ダウンロードしたSRR000001.sraのパスを指定する。
pfastq-dump --threads 8 --split-files --outdir output_dir <path>/<to>/SRR000001.sra

#ダウンロードも含め一括実行
pfastq-dump --threads 8 --split-files --outdir output_dir -s SRR000001.sra

fastq-dumpはダウンロードとfastq変換を一括で行うことも可能ですが（下の参考を参照）、本ツールはsraをダウンロードしてから利用することが強く推奨されています。

参考

fastq-dumpでsraをダウンロードし、fastqをgz圧縮出力する。

fastq-dump SRR000001 --gzip  #ペアエンドは"--split-files"をつける。

#sraファイルが入らなければ消す。
rm -i ~/ncbi/publc/sra/SRR000001.sra

追記

引用

GitHub - inutano/pfastq-dump: parallel-fastq-dump implementation in bash script

参考ページ

https://bonohu.wordpress.com

fastq-dump

http://yagays.github.io/blog/2013/06/20/fastq-dump-sratoolkit/

SRAからfastqを取得する

SRAからfastqを取得する - Palmsonntagmorgen

fastq-dumpでもダウンロードできるが、非常に遅くなるので、prefetchでダウンロードし、fastq-dumpは解凍だけに使う方が早くなる。

2018-06-11

fastqから素早くインサートサイズを計算する

fasta/fastqの操作 tips

　bamファイルをすでに作っているなら、ペアエンドのインサートサイズはPicard-tools等ですぐ出せますが、raw fastqしかない時にいちいちbamにして求めるのは少し面倒です。ワンランナーで出すスクリプト書きました。好みにあわせて修正して使ってください。手順ですが、fastqをランダムサンプリングし、minimap2&&samtoolsでbamを作り、goleftに渡してインサートサイズを出しています。リファレンスがなければ、de novo assemblyを行い、得られたcontigを使ってください。

インストール

依存

samtools (1.7で動作確認。古いバージョンでなければO.K）。
seqkit（紹介）
minimap2（紹介）
goleft（紹介）

> cat insert_size.sh

#!/bin/bash

#samtools,seqkit、minimap2、goleftを使う。パスが通っていること。

#version1 2018/06/11

CMDNAME=`basename $0`

if [ $# -ne 3 ]; then

echo "Usage: $CMDNAME ref.fa pair1.fq pair2.fq" 1>&2

exit 1

#10万リードサンプリング (goleftはリードが少なすぎるとエラーになる）

seqkit sample -n 100000 $2 > sampling1.fq 2>/dev/null

seqkit sample -n 100000 $3 > sampling2.fq 2>/dev/null

#mapping

minimap2 -t 8 -ax sr $1 sampling1.fq sampling2.fq 2>/dev/null | samtools sort -@ 8 -O BAM -o sampling.bam - 2>/dev/null

#index

samtools index sampling.bam

#出力

echo

echo ********results********

goleft covstats sampling.bam |column -t

echo

#samplingで出たfastqとbamを消す。

rm sampling1.fq sampling2.fq sampling.bam*

exit

上記をコピーしてファイル名insert_size.shで保存。画面出力は2>/dev/nullで捨てている。進捗が欲しければ消してください。

実行権をつける。

chmod u+x cat insert_size.sh

> insert_size.sh

$ insert_size.sh

Usage: insert_size.sh ref.fa pair1.fq pair2.fq

引数は3つ必要（ reference.fasta、pair1.fastq、pair2.fastq）。

パスの通ったディクトリに移動しておく。

ラン

ペアエンドシーケンスデータのインサートサイズを計算。引数はフルパスで与えてもOKです。

insert_size.sh ref.fa pair1.fq pair2.fq

f:id:kazumaxneo:20180611215507j:plain

インサートサイズ（outer insert size）は418、standard deviationは73だった。

感想

goleftはピカード（picard-tools）あたりに変え（紹介）、グラフィック出力にしてもいいんじゃないでしょうか。それから、ラージゲノムを使っている人は、小さな染色体だけ取り出して計算してください (ヒトならchr20とか)。時間がかかります。その時は、seqkitのサンプリングサイズを、10万リードから100万リードくらいに増やしてください。マッピングされたリードがあまりに少ないとエラーを吐きます。

お粗末様でした。

引用

https://www.biostars.org/p/50783/

2018-06-11

メタゲノムのリアルタイム分類ツール LiveKraken

生物種の推定 (taxonomic profiling) 結果の視覚化 (visualization) リアルタイム解析 metagenome

　ゲノムシーケンシングデータのリアルタイム解析は、シーケンサがまだ稼動している間にデータを分析できるため、過去数年にわたって特に注目を集めている。しかし、Minionシーケンサーをベースにしたライブ解析アプローチの可能性は、これらのデバイスのスループット・レートとシーケンスクオリティが低いため、依然として限られている。著者らは、HiLive（Lindner et al 2017）を用いて、イルミナのシーケンサーからのハイスループットシーケンシングデータのリアルタイム解析のための最初の方法を提案し、新しい分野のアプリケーションを可能にした。メタゲノム研究の場合、Kraken（Wood and Salzberg 2014）のような分類ツールも、 time-relevantなアプリケーションで使用されている。しかし、これはウエットおよびドライのラボの逐次的なパラダイムに影響され、実行の全体時間の下限はシーケンシングマシンの実行時間でセットされる。

　これらの制限に取り組むために、著者らはKrakenのコアアルゴリズムに基づくリアルタイムtaxonomy分類ツールであるLiveKrakenを提示する。シーケンサーが終了するずっと前に、確立されたツールの結果と同等の結果が得られ、シーケンシングの実行が終了したら直ちにKrakenの結果と同じ結果が保証されることを示す。 LiveKrakenはHiSeqとMiSeqシステム上でテストされており、Krakenと同じくらい堅牢で使いやすい。アプリケーションフィールドは、サンプル組成の制御、汚染の同定、またはリアルタイムでのアウトブレイクの検出にまで及ぶ可能性がある。

　もともと、Krakenは線形のワークフローを持っている（Wood and Salzberg 2014）。シーケンシングされたリードは、FASTAまたはFASTQファイルから読み取られ、続いて事前計算されたデータベースを用いて分類される。シーケンスリードは互いに独立しているため、並列に処理することができる。各リードで見つかったlowest common ancestor （LCA）分類結果は、Krakenの表形式のレポートファイルに書き込まれる。
　このワークフローを生きた分類学分類に適合させるため、HiLive（Lindner et al、2017）のアプローチと同様に、イルミナのバイナリBCLフォーマットからシーケンシングデータを読み取ることができる新しいシーケンスリーダーモジュールを実装した。 LiveKrakenを使用すると、オリジナルのKrakenと同じデータベース構造を使用して、メタゲノムのサンプル構成を継続的に分析し、改良することができる。
イルミナシーケンサは、すべてのリードをサイクルと呼ばれる並行処理で処理し、サイクルごとにすべてのリードに1ベースを追加する。各サイクルで、Basecall（BCL）ファイルはIlluminaのBaseCallsディレクトリに作成され、FASTAまたはFASTQファイルの代わりにLiveKrakenの入力として宣言される。新しいデータは、サイズ指定kの最初のk-merから始まるj個のシーケンシングサイクルのユーザー指定の間隔でBCLシーケンシングリーダーモジュールによって収集される。収集されたデータは分類器に送られ、分類器は新しいシーケンス情報で記憶された部分分類を精緻化する。Krakenの一時的なデータ構造は、LCAリスト、あいまいなヌクレオチドのリスト、およびデータベース中のk-mer発生の数など、各リードについて保存される。これは、各シーケンスリードについて見出されるLCAの数に比例して、メモリ消費の全体的な増加をもたらす。さらに、繰り返し精緻化のために非常に重要なことは、各リードが分類された位置を保持する変数が格納されることである。各精緻化ステップの後、Krakenと同じ表形式の出力が生成される。これにより、早期分類が可能になると同時に、最後のシーケンシングサイクルからデータを読み取った後の分類出力が、Krakenが生産するものと全く同じであることが保証される（論文図1a参照）。
LiveKrakenは、付属のスクリプトinstall_kraken.shを使ってKrakenと同様にインストールできる。 Gcc v4.9.2およびv 7.2.0およびブーストv 1.5.8でテストされている。さらに、Condaパッケージが利用可能である（Grüninget al、2017）。 LiveKrakenはKrakenと同じコマンドラインインターフェイスを使用している。

インストール

centos6でテストした。

本体 GitLab

https://gitlab.com/rki_bioinformatics/LiveKraken

git clone https://gitlab.com/rki_bioinformatics/LiveKraken.git
cd LiveKraken/
./install_kraken.sh $target_directory

混同がないよう、krakenと同じコマンドはlivekrakenという名前にリネームされてビルドされる。

> ./livekraken -h

$ ./livekraken -h

Usage: livekraken [options] <filename(s)>

Options:

--db NAME Name for Kraken DB

(default: none)

--threads NUM Number of threads (default: 1)

--fasta-input Input is FASTA format

--fastq-input Input is FASTQ format

--bcl-input Input is BCL format

--bcl-length NUM Number of sequencing cycles

--bcl-start NUM First analysis cycle (>31)

--bcl-spacing NUM Spacing between classification

--bcl-lanes NUM NUM .. The lanes to analyse (e.g. 1 3 4)

Default: Analyse all lanes found.

--bcl-tiles NUM NUM .. The tiles to analyse (e.g. 1101 2115 2304)

Default: 96 tiles (2 sides -> 3 swaths -> 16 tiles).

--bcl-max-tile NUM Maximum tile to analyse, in XYZZ tile format.

Default: Up to tile 2316 (for 96 tiles.)

If this option is used, --bcl-tiles is ignored.

--gzip-compressed Input is gzip compressed

--bzip2-compressed Input is bzip2 compressed

--quick Quick operation (use first hit or hits)

--min-hits NUM In quick op., number of hits req'd for classification

NOTE: this is ignored if --quick is not specified

--unclassified-out FILENAME

Print unclassified sequences to filename

--classified-out FILENAME

Print classified sequences to filename

--output FILENAME Print output to filename (default: stdout); "-" will

suppress normal output

--only-classified-output

Print no Kraken output for unclassified sequences

--preload Loads DB into memory before classification

--paired The two filenames provided are paired-end reads

--check-names Ensure each pair of reads have names that agree

with each other; ignored if --paired is not specified

--help Print this message

--version Print version information

If none of the *-input or *-compressed flags are specified, and the

file is a regular file, automatic format detection is attempted.

> ./livekraken-build -h

$ ./livekraken-build -h

Usage: livekraken-build [task option] [options]

Task options (exactly one must be selected):

--download-taxonomy Download NCBI taxonomic information

--download-library TYPE Download partial library

(TYPE = one of "bacteria", "plasmids",

"viruses", "human")

--add-to-library FILE Add FILE to library

--build Create DB from library

(requires taxonomy d/l'ed and at least one file

in library)

--rebuild Create DB from library like --build, but remove

existing non-library/taxonomy files before build

--clean Remove unneeded files from a built database

--shrink NEW_CT Shrink an existing DB to have only NEW_CT k-mers

--standard Download and create default database

--upgrade Upgrade an existing older database to use scrambled

minimizer ordering (see README for details)

--help Print this message

--version Print version information

Options:

--db NAME Kraken DB/library name (mandatory except for

--help/--version)

--threads # Number of threads (def: 1)

--new-db NAME New Kraken DB name (shrink task only; mandatory

for shrink task)

--kmer-len NUM K-mer length in bp (build/shrink tasks only;

def: 31)

--minimizer-len NUM Minimizer length in bp (build/shrink tasks only;

def: 15)

--jellyfish-hash-size STR Pass a specific hash size argument to jellyfish

when building database (build task only)

--max-db-size SIZE Shrink the DB before full build, making sure

database and index together use <= SIZE gigabytes

(build task only)

--shrink-block-offset NUM When shrinking, select the k-mer that is NUM

positions from the end of a block of k-mers

(default: 1)

--work-on-disk Perform most operations on disk rather than in

RAM (will slow down build in most cases)

>./classify #ベースコール解析用に追加されたコマンド

$ ./classify

Missing mandatory option -d

Usage: classify [options] <fasta/fastq file(s)>

Options: (*mandatory)

-a output at all classification steps

* -d filename Kraken DB filename

* -i filename Kraken DB index filename

-n filename NCBI Taxonomy nodes file

-o filename Output file for Kraken output

-t # Number of threads

-u # Thread work unit size (in bp)

-q Quick operation

-m # Minimum hit count (ignored w/o -q)

-C filename Print classified sequences

-U filename Print unclassified sequences

-l length Length of reads (BCL mode)

-s start Initial cycle to start classification (>31)

-k spacing How many cycles to wait between classification attempts.

-x tiles Tiles to analyze (can be passed multiple times). Default: all tiles

-y lanes Lanes to analyze (can be passed multiple times). Default: all lanes

-z max_tile Maximum tile number to analyse (Default: 2316). Conflicts with -x.

-f Input is in FASTQ format

-b Input is in BCL format

-c Only include classified reads in output

-M Preload database files

-h Print this message

At least one FASTA or FASTQ file or a BaseCalls (BCL) folder must be specified.

Kraken output is to standard output by default.

./visualisation/livekraken_sankey_diagram.py -h

$ ./visualisation/livekraken_sankey_diagram.py -h

usage: livekraken_sankey_diagram.py [-h] -i INFILE [-c] [-s]

[-r {species,genus,family,order}] [-t TOP]

[-o OUTPUT] [-m NAMES] [-n NODES]

This tool creates sankey plots from LiveKraken output.

optional arguments:

-h, --help show this help message and exit

-i INFILE, --infile INFILE

Used to list input files, can be used several times to

input an ordered list of files

-c, --color Used to switch from red-green to red-blue color scheme

-s, --compress Used to "compress" unclassified nodes by only keeping

a number of reads corresponding to the sum of flows

from/to nodes other than unclassified.

-r {species,genus,family,order}, --rank {species,genus,family,order}

Used to set on which level to bin the classified reads

-t TOP, --top TOP Used to determine the top x nodes to display for every

cycle (plus one node serving as bin for everyting

else)

-o OUTPUT, --output OUTPUT

Used to set the output directory path for the html and

json file

-m NAMES, --names NAMES

Used to set the path to the names.dmp for taxonomic

resolution

-n NODES, --nodes NODES

Used to set the path to the nodes.dmp for taxonomic

resolution

LiveKrakenはcondaでも導入できる（リンク）。mac osxも対応している。

ラン

--データベースの準備--

すでにkrakenを導入しておりデータベースがあるなら飛ばしてよい。

データベースがない場合、ダウンロードする必要がある。ここではdatabaseというディレクトリ名にダウンロードする。livekraken-buildの34-37行目がkraken-buildになっていたので、実行前にlivekraken-buildに書き換えた。実行する。

livekraken-build --standard --db database --threads 40

database/にRefseqの全データとtaxonomy情報がダウンロードされ、40スレッド使いビルドされる。環境によってはかなりの時間がかかる。

bacteria、viruses、plasmids、humanのいずれかだけデータベースを作るなら、以下のようになる。taxonomyファイルをダウンロードし、それから必要なデータをダウンロード&ビルドする。

#taxonomyデータ
livekraken-build --download-taxonomy --db database

#データ
livekraken-build --download-library bacteria --db database

#さらにユーザー指定のfastaを追加することもできる。数がある時は、公式のようにfindとxargsをつなぎ自動化する
livekraken-build --add-to-library chr1.fa --db database

#ビルド
livekraken-build --build --db database --threads 40

bacteriaはkraken同様、URLが変更されていてダウンロードできなかった。NCBI FTPサーバー（ftp://ftp.ncbi.nlm.nih.gov/genomes/）のarchive/old_refseq/Bacteria/all.fna.tar.gzに変更した。

”bacteria”をダウンロードするには、43行目の

wget $FTP_SERVER/genomes/Bacteria/all.fna.tar.gz

　↓

wget $FTP_SERVER/genomes/archive/old_refseq/Bacteria/all.fna.tar.gz

に修正して実行する。他のデータベースの配列を使うことも可能（kaijuの例）。wgetでダウンロードし、上に書いた"--add-to-library"フラグを立てて追加する。

データベースの内容はkrakenから変わっていない。

database.kdb: Contains the k-mer to taxon mappings
database.idx: Contains minimizer offset locations in database.kdb
taxonomy/nodes.dmp: Taxonomy tree structure + ranks
taxonomy/names.dmp: Taxonomy names

--taxonomyラベリング--

fastqを分析する。gz圧縮fastqを使う時は--gzip-compressedフラグを立てる。

livekraken --preload --threads 20 --fastq-input --paired --db database pair1.fq pair2.fq --output output

--preload Loads DB into memory before classification
--fasta-input Input is FASTA format
--fastq-input Input is FASTQ format
--paired The two filenames provided are paired-end reads
--gzip-compressed Input is gzip compressed
--bzip2-compressed Input is bzip2 compressed

学名に変換。superkingdom, kingdom, phylum, class, order, family, genus, species表記になる。

livekraken-translate --mpa-format --db database output > sequences.labels

集計レポートを作る。

livekraken-report --db database output > summary

以上の作業はKrakenから変更がない。

--realtime classification--

LiveKrakenではbasecallを直接分析、分類するclassifyコマンドが実装されている（解析の時系列は論文の図１参照 pubmed）。

サポートされているシーケンサー

Miseq
HiSeq 1500
HiSeq 2000/2500/3000/4000/X (untested)

以下のコマンドはテストしてません。注意してください。

./classify -d database/ -i database/database.idx -n database/taxonomy/nodes.dmp -b basecall_dir/ -o classify_output

-b used for BCL files, the input path should be the BaseCalls folder generated by Illumina sequencers instead of a file as in the case of FASTA/FASTQ input.
-l final number of cycles so that the software knows when it can stop waiting for additional BCL files to be generated
-s wait for this many cycles to accumulate before starting classification. Note that this should be greater than 31, which is the k-mer size used by Kraken
-k the number of cycles to wait before another incremental classification occurs
-x which tiles on the flowcell to analyze (default is all). Example: -x 1101 -x 1315 -x 2108
-y which of the lanes (usually 8 lanes) to analyze (default is all). Example: -y 1 -y 6
-a whether to output intermediate results at the incremental classification steps. This would make sense if you want to monitor the sequence composition during sequencing.
-f if input is in FASTQ format

出力は付属ツールで画像にできる。

./visualisation/livekraken_sankey_diagram.py -i classify_output

リアルタイム解析を可能にするHiLive（pubmed）も紹介したかったのですが、 HiLiveは依存ライブラリSeqAnの大量コンパイルの後のhiliveのビルドでつまづき時間を浪費しそうだったので一旦諦めました。dockerコンテナはあるようですが（私の環境でちゃんとテストするには、シーケンスを行いbasecallをリアルタイムでLinuxサーバーに送りテストする必要があり、とっても手間がかかります。臨床分野の人なら24時間シーケンスしてるのかもしれませんが...）。

ジョンズホプキンス大学 Center for Computational Biology（CCB）メタゲノム分析ツールには、下記のツールがあります。いくつかは紹介しています。

http://ccb.jhu.edu/software.shtml

Kraken

Centrifuge

Pavian

Bracken (Bayesian Reestimation of Abundance with KrakEN)

KrakenHLL

引用

LiveKraken - Real-time metagenomic classification of Illumina data.

Tausch SH, Strauch B, Andrusch A, Loka TP, Lindner MS, Nitsche A, Renard BY.

Bioinformatics. 2018 Jun 1.

2018-06-10

ターゲット遺伝子座のリファレンスガイドアセンブリを行う aTRAM2.0

assembly target gene reconstruction k-mer 2015 BMC Bioinformatics 2018

2021 7/21 タイトル修正

　大規模なシーケンスからの迅速な標的遺伝子座特異的なアセンブリは、現在、医学から広範囲の系統学までの応用分野で、生物学科学全体で一般的に使用されている。ターゲットアセンブリ手法は、完全なゲノムのデノボアセンブリと比較してアセンブリの計算規模を大幅に削減し、全ゲノムアセンブリのエラーを避けることができる（論文より ref2-5）

　aTRAM1.0 (automated Target Restricted Assembly Method）は、次世代シークエンシング（NGS）データから任意の遺伝子座をアセンブルするために開発された(6,7)。このプロセスでは、NGSのターゲットにマッチするリードのみアセンブリされる。Perlで書かれ、個々のコンピュータと高性能コンピューティング環境の両方で動作するように設計されている。aTRAMは、最後に、BLASTを使用して、非常に異なるtaxonomy、すなわちhighly closedなゲノムを持たないraxonの遺伝子座のアセンブリを可能にする、一致するリードを見出す（一部略）。

　ここでは、新しい機能を追加し、より詳細なチューニングオプションを使用してアセンブリの制御を大幅に向上させ、以前の方法をより効率的に処理するため、TRAM 1.0を完全な再実装したTRAM 2.0を提示する。（1）BLASTデータベースへの読み込みを増加させてヒットを増加させること、（2）新しいデータベース化戦略を用いて配列クエリーおよび検索を高速化すること、（3）アセンブリされたコンティグをBLASTすルためフィルタリングステップを逆にする（4）さらなるユースケースを可能にするために新たなデノボアセンブラを追加する、（5）多くの分類群から多くの遺伝子座を迅速に組み立てるための自動化パイプラインを追加する、（6）コードの完全なるPythonへの書き換えによって長期的な持続性とパフォーマンスを向上させる。

　aTRAM 2.0は、2つのステップで動作するコマンドラインのPythonベースのツールである。第1に、ユーザはターゲットとユーザオプションのセットを指定し、次いでaTRAMフォーマットのライブラリが構築され、そこではペアエンドリードが分割される。分割はシーケンスの内容とは無関係で、データベースサイズのしきい値（デフォルトで250 MB、シャード (shards) サイズと数はユーザーが調整可能）に基づいている。このデータセット分割により、並列クエリ読み取りが可能になり、シリアルクエリでもパフォーマンスが大幅に向上する。各シャードについて、BLASTフォーマットのデータベースが両方のメイト対に対して構築される。第2のステップでは、関心対象の遺伝子座を標的とし、アセンブルする。これを行うために、クエリ配列は、シャードされたデータベースに対してBLASTされる。一致するリードは、デノボアセンブリモジュールでアセンブルされる。アセンブリは反復アプローチによって改善される。つまり、2回目の反復では、アセンブルされたコンティグが元のクエリを置き換え、ショートリードデータベースにぶつかる。最初の繰り返しと同様に、マッチングしたリードがデノボでアセンブリされる。このプロセスは、ユーザー指定の反復制限に達するか、新しいコンティグがアセンブルされなくなるまで続く（論文図1 ワークフロー）。各遺伝子が独立してアセンブルされるので、複数のレベルでパラレル化が可能になっている。多数の遺伝子を同時に並行して迅速にアセンブルすることができる。さらに、データベースが断片に分割され、それぞれが独立して検索されるように、各実行内で並列処理が可能になっている。最後に、アセンブリモジュール内の並列化制御も実装されている。これらの処理ステップは、ユーザーオプション設定によって高度にカスタマイズ可能になった。例えば、大規模なデータセット（例えば、ミトコンドリアゲノム）のような高カバレッジなターゲットについてaTRAMライブラリーの一部のみを検索することが可能である。デノボアセンブリモジュールの多くのオプションを活用することも可能になっている。

インストール

cent OS6のPython 3.6.3 :: Anacondaでテストした。

依存（アセンブラはどれか１つあればOK）

BLAST - version 2.7.1
Velvet
Trinity - version 2.5.1
Abyss - version 2.0.2
SPAdes - version 3.11.1

本体　Github

https://github.com/HKU-BAL/megagta

virtualenvを使い環境を構築する。

git clone https://github.com/juliema/aTRAM.git 
cd atram 
virtualenv venv -p python3
source venv/bin/activate
pip install -r requirements.txt

> python atram_preprocessor.py -h

$ python atram_preprocessor -h

usage: atram_preprocessor.py [-h] [--version]

[--end-1 FASTA_or_FASTQ [FASTA_or_FASTQ ...]]

[--end-2 FASTA_or_FASTQ [FASTA_or_FASTQ ...]]

[--mixed-ends FASTA_or_FASTQ [FASTA_or_FASTQ ...]]

[--single-ends FASTA_or_FASTQ [FASTA_or_FASTQ ...]]

[-b DB] [--cpus CPUS] [-t DIR] [-l LOG_FILE]

[-s SHARDS] [--path PATH] [--sqlite-temp-dir DIR]

This script prepares data for use by the atram.py

script. It takes fasta or fastq files of paired-end (or

single-end) sequence reads and creates a set of atram

databases.

You need to prepare the sequence read archive files so that the

header lines contain only a sequence ID with the optional

paired-end suffix at the end of the header line. The separator

for the optional trailing paired-end suffix may be a space,

a slash "/", a dot ".", or an underscore "_".

For example:

>DBRHHJN1:427:H9YYAADXX:1:1101:10001:77019/1

GATTAA...

>DBRHHJN1:427:H9YYAADXX:1:1101:10001:77019/2

ATAGCC...

>DBRHHJN1:427:H9YYAADXX:1:1101:10006:63769/2

CGAAAA...

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

--end-1 FASTA_or_FASTQ [FASTA_or_FASTQ ...], -1 FASTA_or_FASTQ [FASTA_or_FASTQ ...]

Sequence read archive files that have only end 1

sequences. The sequence names do not need an end

suffix, we will assume the suffix is always 1. The

files are in fasta or fastq format. You may enter more

than one file or you may use wildcards.

--end-2 FASTA_or_FASTQ [FASTA_or_FASTQ ...], -2 FASTA_or_FASTQ [FASTA_or_FASTQ ...]

Sequence read archive files that have only end 2

sequences. The sequence names do not need an end

suffix, we will assume the suffix is always 2. The

files are in fasta or fastq format. You may enter more

than one file or you may use wildcards.

--mixed-ends FASTA_or_FASTQ [FASTA_or_FASTQ ...], -m FASTA_or_FASTQ [FASTA_or_FASTQ ...]

Sequence read archive files that have a mix of both

end 1 and end 2 sequences (or single ends). The files

are in fasta or fastq format. You may enter more than

one file or you may use wildcards.

--single-ends FASTA_or_FASTQ [FASTA_or_FASTQ ...], -0 FASTA_or_FASTQ [FASTA_or_FASTQ ...]

Sequence read archive files that have only unpaired

sequences. Any sequence suffix will be ignored. The

files are in fasta or fastq format. You may enter more

than one file or you may use wildcards.

preprocessor arguments:

-b DB, --blast-db DB, --output DB, --db DB

This is the prefix of all of the blast database files.

So you can identify different blast database sets. You

may include a directory as part of the prefix. The

default is "./atram_2018-06-10".

--cpus CPUS, --processes CPUS, --max-processes CPUS

Number of CPU threads to use. On this machine the

default is ("10")

-t DIR, --temp-dir DIR

You may save intermediate files for debugging in this

directory. The directory must be empty.

-l LOG_FILE, --log-file LOG_FILE

Log file (full path). The default is to use the DB and

program name to come up with a name like

"<DB>_atram_preprocessor.log"

-s SHARDS, --shards SHARDS, --number SHARDS

Number of blast DB shards to create. The default is to

have each shard contain roughly 250MB of sequence

data.

--path PATH If blast or makeblastdb is not in your $PATH then use

this to prepend directories to your path.

--sqlite-temp-dir DIR

Use this directory to save temporary SQLITE3 files.

This is a possible fix for "database or disk is full"

errors.

——

> python atram.py -h

$ python atram.py -h

usage: atram.py [-h] [--version] -b DB [DB ...] [-q QUERY [QUERY ...]]

[-Q QUERY_SPLIT [QUERY_SPLIT ...]] -o OUTPUT_PREFIX

[-a {abyss,trinity,velvet,spades,none}] [-i N] [-p]

[--fraction FRACTION] [--cpus CPUS] [--log-file LOG_FILE]

[--path PATH] [-t DIR] [--sqlite-temp-dir DIR] [-T SECONDS]

[--no-filter] [--bit-score SCORE]

[--contig-length CONTIG_LENGTH] [--db-gencode CODE]

[--evalue EVALUE] [--max-target-seqs MAX] [--no-long-reads]

[--kmer KMER] [--mpi] [--bowtie2] [--max-memory MEMORY]

[--exp-coverage EXP_COVERAGE] [--ins-length INS_LENGTH]

[--min-contig-length MIN_CONTIG_LENGTH]

[--cov-cutoff COV_CUTOFF]

This is the aTRAM script. It takes a query sequence and a blast

database built with the atram_preprocessor.py script and builds an

assembly.

If you specify more than one query sequence and/or more than one blast

database then aTRAM will build one assembly for each query/blast

DB pair.

NOTE: You may use a text file to hold the command-line arguments

like: @/path/to/args.txt. This is particularly useful when specifying

multiple blast databases or multiple query sequences.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

required arguments:

-b DB [DB ...], --blast-db DB [DB ...], --sra DB [DB ...], --db DB [DB ...], --database DB [DB ...]

This needs to match the DB prefix you entered for

atram_preprocessor.py. You may repeat this argument to

run the --query sequence(s) against multiple blast

databases.

-q QUERY [QUERY ...], --query QUERY [QUERY ...], --target QUERY [QUERY ...], --probe QUERY [QUERY ...]

The path to the fasta file with sequences of interest.

You may repeat this argument. If you do then Each

--query sequence file will be run against every

--blast-db.

-Q QUERY_SPLIT [QUERY_SPLIT ...], --query-split QUERY_SPLIT [QUERY_SPLIT ...], --target-split QUERY_SPLIT [QUERY_SPLIT ...]

The path to the fasta file with multiple sequences of

interest. This will take every sequence in the fasta

file and treat it as if it were its own --query

argument. So every sequence in --query-split will be

run against every --blast-db.

-o OUTPUT_PREFIX, --output-prefix OUTPUT_PREFIX

This is the prefix of all of the output files. So you

can identify different blast output file sets. You may

include a directory as part of the prefix. aTRAM will

add suffixes to differentiate ouput files.

optional aTRAM arguments:

-a {abyss,trinity,velvet,spades,none}, --assembler {abyss,trinity,velvet,spades,none}

Which assembler to use. Choosing "none" (the default)

will do a single blast run and stop before any

assembly.

-i N, --iterations N The number of pipeline iterations. The default is "5".

-p, --protein Are the query sequences protein? aTRAM will guess if

you skip this argument.

--fraction FRACTION Use only the specified fraction of the aTRAM database.

The default is "1.0"

--cpus CPUS, --processes CPUS, --max-processes CPUS

Number of CPU threads to use. This will also be used

for the assemblers when possible. On this machine the

default is ("10")

--log-file LOG_FILE Log file (full path)."

--path PATH If the assembler or blast you want to use is not in

your $PATH then use this to prepend directories to

your path.

-t DIR, --temp-dir DIR

You may save intermediate files for debugging in this

directory. The directory must be empty.

--sqlite-temp-dir DIR

Use this directory to save temporary SQLITE3 files.

This is a possible fix for "database or disk is full"

errors.

-T SECONDS, --timeout SECONDS

How many seconds to wait for an assembler before

stopping the run. To wait forever set this to 0. The

default is "300" (5 minutes).

optional values for blast-filtering contigs:

--no-filter Do not filter the assembled contigs. This will: set

both the --bit-score and --contig-length to 0

--bit-score SCORE Remove contigs that have a value less than this. The

default is "70.0". This is turned off by the --no-

filter argument.

--contig-length CONTIG_LENGTH, --length CONTIG_LENGTH

Remove blast hits that are shorter than this length.

The default is "100". This is turned off by the --no-

filter argument.

optional blast arguments:

--db-gencode CODE The genetic code to use during blast runs. The default

is "1".

--evalue EVALUE The default evalue is "1e-10".

--max-target-seqs MAX

Maximum hit sequences per shard. Default is calculated

based on the available memory and the number of

shards.

optional assembler arguments:

--no-long-reads Do not use long reads during assembly. (Abyss,

Trinity, Velvet)

--kmer KMER k-mer size. The default is "64" for Abyss and "31" for

Velvet. Note: the maximum kmer length for Velvet is

31. (Abyss, Velvet)

--mpi Use MPI for this assembler. The assembler must have

been compiled to use MPI. (Abyss)

--bowtie2 Use bowtie2 during assembly. (Trinity)

--max-memory MEMORY Maximum amount of memory to use in gigabytes. The

default is "14". (Trinity, Spades)

--exp-coverage EXP_COVERAGE, --expected-coverage EXP_COVERAGE

The expected coverage of the region. The default is

"30". (Velvet)

--ins-length INS_LENGTH

The size of the fragments used in the short-read

library. The default is "300". (Velvet)

--min-contig-length MIN_CONTIG_LENGTH

The minimum contig length used by the assembler

itself. The default is "100". (Velvet)

--cov-cutoff COV_CUTOFF

Read coverage cutoff value. Must be a positive float

value, or "auto", or "off". The default is "off".

(Spades)

——

ラン

１、プリプロセシング。

mkdir library
python atram_preprocessor.py --cpus 20 -b library/name --end-1 pair11.fastq --end-2 pair2.fastq

--cpus CPUS, --processes CPUS, --max-processes CPUS
-b This is the prefix of all of the blast database files so you can identify different blast database sets and so they can be stored together without resorting to subdirectories. You may include a directory as part of the prefix. The default is ./atram_2017-10-1

gzip圧縮fastqには対応していない。

２、ターゲットアセンブリ。ここではvelvetを使う。velvetのk-merはデフォルト31。ターゲット遺伝子座のFASTAファイルを指定する。

python atram.py -b library/name -q ref_locus.fasta -i 5 --cpus 40 --kmer 31 -o Locus_atram2.fasta --log-file Locus.log -a velvet

-q This specifies the query sequence or sequences. If one sequence use the "-q" option. For many query sequences, "-Q".
-i The number of pipeline iterations. The default is "5".
-a {abyss, trinity, velvet, spades} Choose which assembler to use from the list. If you do not use this argument then aTRAM will do a single blast run and stop before assembly.
-o Give a prefix for output files. You may include a directory path as part of the prefix.
-p Are the query sequences protein? aTRAM will guess if you skip this argument.
--kmer The default is 64 for Abyss and 31 for Velvet. Note: the maximum kmer length for Velvet is 31 (Abyss, Velvet).

ランが終わるとターゲットアセンブリのFASTAが出力される。

f:id:kazumaxneo:20180610193618j:plain

実際のランでは、sqliteのデータベースが増え続ける？ようなバグに備え、場所を変えることを推奨している。さらにメジャーアップデートをした時は、データベースをリセットすることが推奨されている。

引用

aTRAM 2.0: An Improved, Flexible Locus Assembler for NGS Data.

Allen JM, LaFrance R, Folk RA, Johnson KP, Guralnick RP

Evol Bioinform Online. 2018 May 8;14

aTRAM - automated target restricted assembly method: a fast method for assembling loci across divergent taxa from next-generation sequencing data.

Allen JM, Huang DI, Cronk QC, Johnson KP

BMC Bioinformatics. 2015 Mar 25;16:98.

2018-06-10

メタゲノムのgene-targeted assembler: MegaGTA

assembly metagenome target gene reconstruction k-mer sequence clustering 2017 BMC Bioinformatics

　次世代シーケンシングは、近年のメタゲノミクスの研究を大きく促進してきた。これらの研究は、しばしば何百万から数十億のリードをde novoでアセンブリし、コンティグにして遺伝子アノテーションすることを含む。これは、メタゲノムのアセンブリ効率を大幅に向上させる高度なアルゴリズムの研究の開始のトリガーとなった[論文より ref.1,2,3]。一方、不均一なカバレッジと cross-genome repeatsのために、断片化された遺伝子配列になるのが一般的である。これらの欠点を克服するために、EMIRGE [ref.5]、REAGO [ref.6]、SAT-Assembler [ref.7]、Xander [ref.8]を含むいくつかの gene targeted assembly methodsが公開されている。 Xanderは、アセンブリ前に標的とされた遺伝子に由来する可能性があるリードを分類しようとする最初の3つの方法とは異なり、はじめにde Bruijn graph（DBG）を構築し、オンザフライでk-merのトラバーサルを直接探して遺伝子を探し出す。特定の遺伝子の集合は、基礎となる遺伝子ファミリーの複数のシーケンスアライメント結果を用いて構築された隠れマルコフモデル（HMM）のプロファイル[ref.9]によって導かれる。ノードから出発して、Xanderはgraphのブランチ部分で決定を行い、HMMの可能性が最も高いユニークパスを出力する。これは、ブランチで本来停止するデノボアセンブリの問題を克服する。より具体的には、Xanderアルゴリズムは、DBGとHMMとの組み合わせgraph上で操作される。そのワークフローは論文図1に示されており、次のセクションで詳しく説明する。

　Xanderは、以前の方法よりも長くよりハイクオリティの遺伝子特異的コンティグを産生するが、まだ改善の余地がある。このペーパーでは、以下の3つの点を改善した。

A.　複数のサイズのk-merを使用する。 Xanderは固定kを使ってk-mersのde Bruijn graphを作成する。固定k-merは古典的なジレンマを引き起こす。小さいkは短いリピートで崩壊してde Bruijn graphに過度の分岐をもたらし、大きなkは低いカバレッジのゲノム領域の間にギャップをもたらすため、これらの領域由来の遺伝子はアセンブリされにくい[ref.10]。 HMM-guided assemblyは、HMMによって「示唆」された最良の経路を選択することによってリピートを解決することを目標としているが、異なる遺伝子の2つの部分が反復を介してキメラコンティグに結合する可能性はゼロではない。これに関して、Iterativeなde Bruijn graph [ref.10]は、複数のk-merでレバレッジをかけるものであり、いくつかのde novoアセンブラでその利点が示されている[3,11,12,13]。以下のベンチマークは、HMM誘導アセンブリがIterative（繰り返し）なde Bruijn graphから利益を得ることができることを示している。
B.　De Bruijn graphの誤ったk-merをフィルターする。与えられたリードセットで一度だけ出現するk-mersはエラーの傾向が強い。低カバレッジな遺伝子の感度を上がるために、Xanderは、頻度の低いk-merをフィルタリングしない方針を選択した。代わりに、Xanderは誤ったk-mersを避けるためHMMに依存している。しかし、それは構造的な誤りまたは誤った塩基を伴う多くのコンティグをもたらし、特に小さいサイズのk-merが使用される場合に起こりやすくなる。この不具合を回避するために、著者らは、HMM誘導検索の間に1回だけ発生するk-mersにペナルティを課している（これらのk-mersに事前の誤った確率を設定することに等しい）。
C.　確率的なデータ構造によって引き起こされる偽陽性のk-merを避ける。より良いメモリ効率を達成するために、XanderはBloomフィルタ[ref.14]を使用してde Bruijn graphを表している。これは確率的なデータ構造であり、一定割合の偽陽性（ただし偽陰性はない）を引き起こす。著者らは、Bloomフィルタを、偽陽性と偽陰性の両方から解放された最新のメモリ効率の高いデータ構造である簡潔なde Bruijn graph[ref.3、15]に置き換えて解決した。新しい手法との比較のため、Xanderのk-mer選択による誤ったコンティグを調べ、De Bruijn graphの正確な表現に問い合わせベンチマークした。

前述の3つの改善が実装され、MegaGTA（Xanderとの違いやMegaGTAワークフローも図1に示されている）と呼ばれる新しい遺伝子ターゲットアセンブラに統合された。最初に、HMP定義のmock communityと、大規模な根圏土壌メタゲノムサンプルの2つのデータセットを用いて、MegaGTAの有効性を実証した[ref.8]。Iterativeなde Bruijn graphで、MegaGTAはXander（よく調整されたパラメータ）よりも高い感度と精度を達成した。さらに興味深いことに、24コアサーバーでテストした場合、MegaGTAは、複数のk-merサイズを使うことで計算量が増えても、同様の量のピークメモリを使用しながら、Xanderより約2〜4倍高速だった。 MegaGTAに似た方法で複数のk-merサイズでXanderを繰り返し実行すると、MegaGTAの相対的なスピードアップはさらに大きくなる。

　根圏土壌データセットに関しては、Xanderがそれぞれ256GB、128GBおよび64GBのBloomフィルターサイズを使用するとき、Xanderが生成するコンティグの0.02、0.39、および10.52％が偽陽性k-mersを含むことがわかった。明らかにBloomフィルタは、メモリが制約されているときに精度の問題を引き起こす。簡潔な de Bruijn graphは、Bloomフィルタの不正確さを克服するだけでなく、より速いgraph構築を可能にする。

f:id:kazumaxneo:20180610122012j:plain

ワークフロー。論文より転載。

インストール

cent OS6のpython2.7.14でテストした。

依存

HMMER 3.1 (http://hmmer.janelia.org)
UCHIME (http://drive5.com/usearch/manual/uchime_algo.html)
ant 1.9.2 or greater (for compiling RDPTools)
g++ 4.6 or greater (for compiling MegaGTA core)

UCHIMEはバイナリをHP（リンク）からダウンロードし、/usr/local/binなどに移動する。

本体　Github

https://github.com/HKU-BAL/megagta

git clone https://github.com/HKU-BAL/megagta.git 
cd megagta 
git submodule update --init --recursive 
./make.sh
cd bin/

> python megagta.py -h

$ python megagta.py -h

MegaGTA v0.1-alpha

Usage:

megagta.py [options] {-1 <pe1> -2 <pe2> | --12 <pe12> | -r <se>} -g gene_list.txt [-o <out_dir>]

Input options that can be specified for multiple times (supporting plain text and gz/bz2 extensions)

-1 <pe1> comma-separated list of fasta/q paired-end #1 files, paired with files in <pe2>

-2 <pe2> comma-separated list of fasta/q paired-end #2 files, paired with files in <pe1>

--12 <pe12> comma-separated list of interleaved fasta/q paired-end files

-r/--read <se> comma-separated list of fasta/q single-end files

-g/--gene-list <string> gene list

Optional Arguments:

Basic assembly options:

-c/--min-count <int> minimum multiplicity for filtering k-mers [1]

-k/--k-list <int,int,..> comma-separated list of kmer size (in range 15-127)

the last k must be a multiple of 3) [30,36,45]

-p/--prune-len <int> prune the search if the score does not improve after <int> steps [20]

-l/--low-cov-penalty <float> penalty for coverage one edges (in [0,1]) [0.5]

--max-tip-len <int> max tip length [150]

--no-mercy do not add mercy kmers

Hardware options:

-m/--memory <float> max memory in byte to be used in SdBG construction [0.9]

(if set between 0-1, fraction of the machine's total memory)

--mem-flag <int> SdBG builder memory mode [1]

0: minimum; 1: moderate; others: use all memory specified by '-m/--memory'.

--use-gpu use GPU

--gpu-mem <float> GPU memory in byte to be used. Default: auto detect to use up all free GPU memory [0]

-t/--num-cpu-threads <int> number of CPU threads, at least 2. Default: auto detect to use all CPU threads [auto]

Output options:

-o/--out-dir <string> output directory [./megagta_out]

--min-contig-len <int> minimum length of contigs to output [450]

--keep-tmp-files keep all temporary files

Other Arguments:

--continue continue a MEGAHIT run from its last available check point.

please set the output directory correctly when using this option.

-h/--help print the usage message

-v/--version print version

--verbose verbose mode

>./post_proc.sh

$ ./post_proc.sh

Usage: post_proc.sh -g <gene_resources> -d <workdir> -m <max_jvm_heap> -c <dist_cutoff> [-t <num_threads=1>] [-f (turn on framebot)]

ラン

ランにはXanderと同様にアセンブリターゲット遺伝子のHMMモデルが必要になる。Githubに掲載されている例を貼っておく。１行１遺伝子で、先頭カラムで遺伝子名を指定する。これは出力ディレクトリ名などに使われる。

rplB share/RDPTools/Xander_assembler/gene_resource/rplB/for_enone.hmm share/RDPTools/Xander_assembler/gene_resource/rplB/rev_enone.hmm share/RDPTools/Xander_assembler/gene_resource/rplB/ref_aligned.faa
nirK share/RDPTools/Xander_assembler/gene_resource/nirK/for_enone.hmm share/RDPTools/Xander_assembler/gene_resource/nirK/rev_enone.hmm share/RDPTools/Xander_assembler/gene_resource/nirK/ref_aligned.faa

実際には付属のXanderの中のmegagta/share/RDPTools/Xander_assembler/gene_resource/に、該当遺伝子を、たくさんの種のマルチプルアライメントから計算しHMMファイルが入っている（nifやrpoBなど）。ターゲットがこれらなら、そのまま指定すればよい。

該当遺伝子のHMMファイルが見つかれなければ作成する必要がある。手順はXanderのHマニュアルに書かれている。

https://github.com/rdpstaff/Xander_assembler/blob/master/README.md#per-gene-preparation-requires-biological-insight

ターゲットアセンブリを実行する。ターゲットのconfigファイルを"-g gene_list.txt"で指定している。

python megagta.py -r input.fq -g gene_list.txt -o output_dir

出力

> ls -lth output_dir/

$ ls -lth output_dir/

total 52K

-rw-r--r-- 1 uesaka user 20K Jun 10 18:06 log

drwxr-xr-x 4 uesaka user 4.0K Jun 10 18:06 contigs

drwxr-xr-x 7 uesaka user 4.0K Jun 10 18:06 .

drwxr-xr-x 2 uesaka user 4.0K Jun 10 18:06 k44

drwxr-xr-x 2 uesaka user 4.0K Jun 10 18:06 k35

drwxr-xr-x 2 uesaka user 4.0K Jun 10 18:05 k29

drwxr-xr-x 2 uesaka user 4.0K Jun 10 18:05 tmp

-rw-r--r-- 1 uesaka user 72 Jun 10 18:05 opts.txt

drwxr-xr-x 4 uesaka user 4.0K Jun 10 18:05 ..

contigs/の中にターゲット遺伝子のfastaが出力される。

> ls -lth output_dir/contigs/nirK/

$ ls -lth output_dir/contigs/nirK/

total 0

-rw-r--r-- 1 uesaka user 0 Jun 10 18:06 prot_merged.fasta

-rw-r--r-- 1 uesaka user 0 Jun 10 18:06 nucl_merged.fasta

--クラスタリング--

1塩基でも異なると別出力されるため、かなり多くの配列ができる。付属スクリプトpost_proc.shを使い、例えば塩基配列同一性が99%以上の配列をまとめる。

post_proc.shラン前に、post_proc.shの先頭部分のUCHIMEとHMMALIGNのパスを変更する。

$ head -n 11 post_proc.sh

#!/bin/bash

## This script runs contigs search and post-assembly processing

## The search step requires large memory for large dataset, see Readme for instructions

## This step assumes a gene directory already exists with gene_start.txt in the directory for each gene in the genes list

## This will overwrites the previous search and post-assembly results

SCRIPT=$(readlink -f $0)

SCRIPTPATH=`dirname $SCRIPT`

JAR_DIR=${SCRIPTPATH}/../share/RDPTools/

UCHIME=/usr/local/bin/uchime

HMMALIGN=~/.linuxbrew/bin/hmmalign

準備できたらクラスタリングを実行する。

./post_proc.sh -g megagta/share/RDPTools/Xander_assembler/gene_resource/ -d output_dir/contigs -m 16G -c 0.01

引用

MegaGTA: a sensitive and accurate metagenomic gene-targeted assembler using iterative de Bruijn graphs.

Li D, Huang Y, Leung CM, Luo R, Ting HF, Lam TW.

BMC Bioinformatics. 2017 Oct 16;18(Suppl 12):408.

Xander: employing a novel method for efficient gene-targeted metagenomic assembly

Qiong Wang, Jordan A. Fish, Mariah Gilman, Yanni Sun, C. Titus Brown, James M. Tiedje and James R. Cole

Microbiome. 2015 Aug 5;3:32.

2018-06-10

MaSuRCA アセンブラ

assembly 2013 Bioinformatics large genome Pacbio Nanopore long read hybrid assembly

2018/8/28,29 dockerコマンド等、分かりにくい部分を修正

2019 5/3 動作条件追記、6/12 hybrid assembly リンク追加、10/9 condaインストール追記、ONTのハイブリッド追記、12/22 condaインストール追記

2020 1/22リンク追記

2022/10/10 help更新、dockerイメージのリンク更新

　 2001年にヒトゲノムのドラフトバージョンが作成された後、800bpを超えるリード長を有する第1世代（すなわちSanger）シーケンス技術を使用して、多くのスモールゲノムおよびラージゲノムのシーケンスが決定された。より最近では、様々な種類のsecond-generation sequencing（SGS）技術が、50〜400bpのリード長で出現している（論文が執筆された当時）。近年、着実に改良されてきた新しいアセンブリ方法が、ショートリードアセンブリの課題に対応して開発されてきた。しかし、この進歩にもかかわらず、ゲノムを決定する問題は、解決から遠い。事実上、現在公開されているすべてのアセンブリは、さまざまなレベルの品質を持つ「下書き」ゲノムであり、下流の分析にこれらのゲノムに頼っている科学者にとって重大な問題となる多くのギャップとアセンブリエラーが含まれている。この記事では、ゲノムアセンブリの新しいアプローチによって促進されるゲノムのアセンブリ方法を報告する。最初に、全ゲノムショットガンシーケンシングデータのアセンブリに使用された2つの一般的なアプローチについて簡単に説明する。

　Overlap–layout–consensus (OLC) assembly。簡単に述べると、OLCパラダイムは、最初に、配列の類似性を用いて重複を決定することによって、リード間のすべてのペアワイズ重複を計算しようと試みる。次に、OLCアルゴリズムはレイアウトを作成する。これはすべての重複するリードのアライメントである。このレイアウトから、アルゴリズムはマルチアライメントアラインメントを列ごとにスキャンすることによってコンセンサス配列を抽出する。 Celera Assembler（論文より　Miller et al、2008; Myers et al、2000）、PCAP（Huang、2003）、Arachne（Batzoglou et al、2002）およびPhusion（Mullikin and Ning、2003）を含むSangerシーケンシングデータのためのほとんどのアセンブラ）は、OLCのアプローチに基づいている。OLCアプローチの主な利点の2つは、リード長に関する柔軟性とシーケンシングエラーに対する堅牢性である。見かけ上の重なりが現実である（そしてリピートが誘発されない）可能性を高めるために、OLCアルゴリズムは、典型的には、ある最小長を超えることを要求する。 Celera Assemblerは40 bp以上のオーバーラップを必要とするため、オーバーラップ領域でのエラー率（1〜2％）が小さくなる。

　SGSのデノボ・アセンブリ・プロジェクトでは、サンガーシークエンス・プロジェクトの100倍のリードを生成する。例えばオリジナルのヒトゲノムプロジェクト（Lander et al., 2001; Venter et al., 2001）とマウスゲノムプロジェクト（Mouse Genome Sequencing Consortium et al、 2002）は、それぞれ3500万のリード生成した。一方、最近のヒトゲノムシーケンス(Li et al., 2010) では、3-4億リード生成した。 de Bruijn graphアプローチは、ペアワイズ重複計算を完全に回避する。これがSGSアセンブリの主要な方法になった理由の1つである。

The de Bruijn graph approach。 de BruijnグラフアセンブリアルゴリズムはPevzner et alが先鞭をつけた手法で (Idury and Waterman, 1995; Pevzner, 1989)、オイラーのアセンブラで初めて実装された（Pevzner et al、2001）。オイラーはサンガーシーケンスリードのために設計されているが、SGSデータをアセンブリするためのプログラムや、特にイルミナのデータ用のプログラムでも、同じ枠組みが採用されている。 de Bruijn戦略を使用する最近開発されたアセンブラにはAllpaths-LG（Gnerre et al、2010）、SOAPdenovo（Li et al、2008）、Velvet（Zerbino and Birney、2008）、EULER-SR（Chaisson and Pevzner、2008））およびABySS（Simpsonら、2009）がある。

　このアプローチは、以下のように、シーケンスデータからde Bruijn graphを作成することから始まる。固定値kに対して、各リードから長さk（a k-mer）の各部分文字列は、ノードAとノードBとを接続するgraphの有向枝に割り当てられる。ノードAとBは、それぞれの最初と最後のk-1個のヌクレオチドに対応する。正式にはオイラーパスとして知られているすべてのエッジを正確に1回訪れるgraph内のパスは、リードのドラフトアセンブリを形成している。実際には、これらのグラフは多くの交差するサイクルと多くの代替のオイラーパスを持つ複雑なものであるため、graphを作成することは、優れたドラフトアセンブリを作成する最初の小さなステップに過ぎない。アセンブリを完了するには、graphにメイトペア情報を組み込み、リピートシーケンスによって作成された複雑なサイクルを解消する必要がある。 k-mersはリードよりも短いので、graphにはリードよりも少ない情報しか含まれていないため、後でgraphの曖昧さをなくすためにリードを保持する必要がある。

　このアプローチの主な利点は計算効率である。膨大な数のオーバーラップが計算されないという事実から利益を得ている。主な欠点は、グラフ中のk-mer隣接情報の損失およびデータのエラーによって引き起こされる偽のブランチングである。

このジャーナルでは、著者らが super-readsと呼ぶものを生成する、ショートリードデータアセンブリの第3のパラダイムを提案する。目的は、元のリードに存在するすべてのシーケンス情報を含む super-readsのセットを作成することである。理想的なエラーのない場合については、下記の定理を参照。

　 Super-readsの基本的な概念は、extension（以後、伸長）が一意である限り、各元のリードを前後にベースごとに伸長することである。この概念は以下のように説明することができる。効率的なハッシュテーブルを使用してk-merカウントルックアップテーブルを作成し、各k-merが何回発生するかを素早く判断する。リードの終わりに見つかったk-merがある場合、ゲノムのシーケンスの次のk-merになる可能性のある4つのk-merがある。これらは、最後にA、C、GまたはTを付けることによって形成された文字列ですk-1個の塩基が読み取られる。著者らのアルゴリズムは、これらのk-mersのどれがテーブル内に現れるかを調べる。 4つの可能なk-merのうちの1つのみが出現する場合、その読みは固有のk-merを有し、その塩基をリードに付加する。リードが１通りに伸長されなくなるまで続ける。すなわち、複数の可能性が存在するか、または行き止まりに達するまで。この伸長は、リードの3 '末端および5'末端の両方で行う。新しい長い文字列は、super-readsと呼ばれる。多くのリードは、論文図1（リンク）に示すように同じsuper-readsに伸長される。たとえば、2つの同一でないリピートまたは2つの異なるハプロタイプに由来している場合、これらは別個のsuper-readsを生成するだろう。もちろん、super-readsはde Bruijn graphを使って簡単に計算できる。要点は、super-readsが作成されると、super-readsと接続するメイトペアと一緒に、de brujin graphを一括して置き換えることである。ペア情報は、以下に説明するOLCアセンブリ工程を用いて行われる。スーパーリードデータ計算の2つの最も重要な特性は、以下の通りである。

各元のリードはsuper-readsに含まれているため（情報が失われていない）。
オリジナルリードの多くは同じsuper-readsになるため、super-readsを使用するとデータセットが大幅に削減される。

（複数段落省略）

言い換えれば、super-readsのセットは、リード内のすべての情報を含み、エラーを導入していない。この証拠はsuper-readsの構築に従っている。各リードはsuper-readsに含まれるため、データは失われない（一部略）。MaSuRCAアセンブラは、super-readsからコンティグとスキャホールドを作成するため、CABOGアセンブラのアセンブリ技術を利用している。

ジョンズホプキンス大学 Center for Computational Biology（CCB）のツール一覧

http://ccb.jhu.edu/software.shtml

MaSuRCA (Maryland Super-Read Celera Assembler) ページ

http://masurca.blogspot.com

Slideshare

Masurca genome assembly with super reads from Abdullah Khan Zehady

2020 1/22

インストール

cent os6とubuntu18.04のdockerコンテナ内でテストした。

必要条件（Githubより）

Compile/Install requirements.

To compile the assembler we require gcc version 4.7 or newer to be installed on the system. Only Linux is supported (May or may not compile under gcc for MacOS or Cygwin, Windows, etc). The assembler has been tested on the following distributions:

• Fedora 12 and up

• RedHat 5 and 6 (requires installation of gcc 4.7)

• CentOS 5 and 6 (requires installation of gcc 4.7)

• Ubuntu 12 LTS and up

• SUSE Linux 16 and up

Hardware requirements.

The hardware requirements vary with the size of the genome project. Both Intel and AMD x64 architectures are supported. The general guidelines for hardware configuration are as follows:

• Bacteria (up to 10Mb): 16Gb RAM, 8+ cores, 10Gb disk space

• Insect (up to 500Mb): 128Gb RAM, 16+ cores, 1Tb disk space

• Avian/small plant genomes (up to 1Gb): 256Gb RAM, 32+ cores, 2Tb disk space

• Mammalian genomes (up to 3Gb): 512Gb RAM, 32+ cores, 5Tb disk space

• Plant genomes (up to 30Gb): 1Tb RAM, 64+cores, 10Tb+ disk space

Expected run times.

The expected run times depend on the cpu speed/number of cores used for the assembly and on the data used. The following lists the expected run times for the minimum configurations outlined above for Illumina-only data sets. Adding long reads (454, Sanger, etc. makes the assembly run about 50-100% longer:

• Bacteria (up to 10Mb): <1 hour

• Insect (up to 500Mb): 1-2 days

• Avian/small plant genomes (up to 1Gb): 4-5 days

• Mammalian genomes (up to 3Gb): 15-20 days

• Plant genomes (up to 30Gb): 60-90 days

本体　Github

ビルドに失敗したのでdockerhubのイメージを使う。複数イメージがアップロードされているが、sauloさんのイメージを使用した（https://hub.docker.com/r/sauloal/masurca/）。

docker pull sauloal/masurca
docker run --rm -m "30g" -itv ${PWD}/:/home/ -w /MaSuRCA-2.2.1/bin/ sauloal/masurca

staphbのイメージを使う（link）
docker pull staphb/masurca:latest
#イメージのhomeとカレントディレクトリを共有ディレクトリにして立ち上げる。メモリは100gとした（*2）
sudo docker run --rm -m "100g" -itv ${PWD}/:/home/ staphb/masurca:latest
cp /MaSuRCA-4.0.9/masurca_config_example.txt .

#追記 Bioconda (link) (Python 3.7.4でテスト)
mamba create -n masurca
conda activate masurca
mamba install -c bioconda -y masurca

> ./masurca

# masurca -h

Create the assembly script from a MaSuRCA configuration file. A

sample configuration file can be generated with the -g switch. The

assembly script assemble.sh will run the assembly proper. For a

quick run without creating a configuration file, and with two Illumina

paired end reads files (forward/reverse) and (optionally) a

long reads (Nanopore/PacBio) file use -i switch, setting the number of threads with -t:

masurca -r paired_ends_fwd.fastq.gz -t 32

masurca -r paired_ends_fwd.fastq.gz,paired_ends_rev.fastq.gz -t 32

,and for a hybrid assembly you can also add the long Nanopore or PacBio reads with -r switch:

masurca -r paired_ends_fwd.fastq.gz,paired_ends_rev.fastq.gz -r nanopore.fa.gz -t 32

this will run paired-ends Illumina or hybrid assembly with CABOG contigger and default settings.

This is suitable for small assembly projects.

Options:

-t, --threads ONLY to use with -i option, number of threads

-i, --illumina Run assembly without creating configuration file, argument can be

illumina_paired_end_forward_reads

illumina_paired_end_forward_reads,illumina_paired_end_reverse_reads

if you only have single-end Illumina data. Reads file(s) could be fasta or fastq, can be gzipped.

-r, --reads ONLY to use with -i option, single long reads file for hybrid assembly, can be Nanopore or PacBio,

fasta or fastq, can be gzipped

-v, --version Report version

-o, --output Assembly script (assemble.sh)

-g, --generate Generate example configuration file

-p, --path Prepend to PATH in assembly script

-l, --ld-library-path Prepend to LD_LIBRARY_PATH in assembly script

--skip-checking Skip checking availability of other executables

-h, --help This message

ラン

ペアエンドショートリードのfastqを指定する。

masurca -t 32 -i pe_R1.fq,/path_to/pe_R2.fq

#ロングリードも指定
masurca -t 32 -i pe_R1.fq,pe_R2.fq -r nanopore.fastq.gz

このコマンドは、最初にイルミナデータでNanoporeリードを補正して、デフォルト設定でハイブリッドアセンブリを実行する。fastqはgzip圧縮されていても認識する。

大きなプロジェクトでランするにはconfigファイルが必要になる。configに記載することで、複数のライブラリを効率的に管理し、まとめてアセンブリできるようになっている。configのマスターシートをコピーして編集する（*1）。

docker run --rm -m "30g" -itv ${PWD}/:/home/ -w /MaSuRCA-2.2.1/bin/ sauloal/masurca
cp /MaSuRCA-4.0.9/masurca_config_example.txt .
vi masurca_config_example.txt

> cat config.txt

# example configuration file

# DATA is specified as type {PE,JUMP,OTHER,PACBIO} and 5 fields:

# 1)two_letter_prefix 2)mean 3)stdev 4)fastq(.gz)_fwd_reads

# 5)fastq(.gz)_rev_reads. The PE reads are always assumed to be

# innies, i.e. --->.<---, and JUMP are assumed to be outties

# <---.--->. If there are any jump libraries that are innies, such as

# longjump, specify them as JUMP and specify NEGATIVE mean. Reverse reads

# are optional for PE libraries and mandatory for JUMP libraries. Any

# OTHER sequence data (454, Sanger, Ion torrent, etc) must be first

# converted into Celera Assembler compatible .frg files (see

# http://wgs-assembler.sourceforge.com)

DATA

#Illumina paired end reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>

#if single-end, do not specify <reverse_reads>

#If mean/stdev are unknown use 500 and 50 -- these are safe values that will work for most runs

#MUST HAVE Illumina paired end reads to use MaSuRCA

PE= pe 500 50 /FULL_PATH/frag_1.fastq /FULL_PATH/frag_2.fastq

#Illumina mate pair reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>

#JUMP= sh 3600 200 /FULL_PATH/short_1.fastq /FULL_PATH/short_2.fastq

#pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped

#if you have both types of reads supply them both as NANOPORE type

#PACBIO=/FULL_PATH/pacbio.fa

#NANOPORE=/FULL_PATH/nanopore.fa

#Legacy reads (Sanger, 454, etc) in one frg file, concatenate your frg files into one if you have many

#OTHER=/FULL_PATH/file.frg

#synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data

#REFERENCE=/FULL_PATH/nanopore.fa

END

PARAMETERS

#PLEASE READ all comments to essential parameters below, and set the parameters according to your project

#set this to 1 if your Illumina mate pair (jumping) library reads are shorter than 100bp

EXTEND_JUMP_READS=0

#this is k-mer size for deBruijn graph values between 25 and 127 are supported, auto will compute the optimal size based on the read data and GC content

GRAPH_KMER_SIZE = auto

#set this to 1 for all Illumina-only assemblies

#set this to 0 if you have more than 15x coverage by long reads (Pacbio or Nanopore) or any other long reads/mate pairs (Illumina MP, Sanger, 454, etc)

USE_LINKING_MATES = 0

#specifies whether to run the assembly on the grid

USE_GRID=0

#specifies grid engine to use SGE or SLURM

GRID_ENGINE=SGE

#specifies queue (for SGE) or partition (for SLURM) to use when running on the grid MANDATORY

GRID_QUEUE=all.q

#batch size in the amount of long read sequence for each batch on the grid

GRID_BATCH_SIZE=500000000

#use at most this much coverage by the longest Pacbio or Nanopore reads, discard the rest of the reads

#can increase this to 30 or 35 if your long reads reads have N50<7000bp

LHE_COVERAGE=25

#this parameter is useful if you have too many Illumina jumping library reads. Typically set it to 60 for bacteria and 300 for the other organisms

LIMIT_JUMP_COVERAGE = 300

#these are the additional parameters to Celera Assembler; do not worry about performance, number or processors or batch sizes -- these are computed automatically.

#CABOG ASSEMBLY ONLY: set cgwErrorRate=0.25 for bacteria and 0.1<=cgwErrorRate<=0.15 for other organisms.

CA_PARAMETERS = cgwErrorRate=0.15

#CABOG ASSEMBLY ONLY: whether to attempt to close gaps in scaffolds with Illumina or long read data

CLOSE_GAPS=1

#number of cpus to use, set this to the number of CPUs/threads per node you will be using

NUM_THREADS = 32

#this is mandatory jellyfish hash size -- a safe value is estimated_genome_size*20

JF_SIZE = 200000000

#ILLUMINA ONLY. Set this to 1 to use SOAPdenovo contigging/scaffolding module.

#Assembly will be worse but will run faster. Useful for very large (>=8Gbp) genomes from Illumina-only data

SOAP_ASSEMBLY=0

#If you are doing Hybrid Illumina paired end + Nanopore/PacBio assembly ONLY (no Illumina mate pairs or OTHER frg files).

#Set this to 1 to use Flye assembler for final assembly of corrected mega-reads.

#A lot faster than CABOG, AND QUALITY IS THE SAME OR BETTER.

#DO NOT use if you have less than 20x coverage by long reads.

FLYE_ASSEMBLY=1

END

PE= pe 600 60 /MaSuRCA-2.2.1/test/pair1.fq /MaSuRCA-2.2.1/test/pair2.fq

のように５カラムでデータの情報を記載する。1目のPEはペアエンド、２カラム目の600はインサートサイズ、３カラム目の60はインサートサイズのSD。４−５カラムにペアエンドファイルのパスを記載する。それ以外の行のパラメータは、条件に応じて変更する（コメントアウトされてない行がパラメータ）。

インサートサイズやそのSDをなるべく精度よく求めたければ、seqkit（紹介）などで10%ほどサンプリングしてminimap2（紹介）等でbamを作り、picard-toolsやgoleftなどで計算してください（リンク）。

MaSuRCAの実行。

../bin/masurca config.txt -o assemble.sh
#assemble.shができる

#de novoアセンブリ実行
./assemble.sh

エラーコレクションが終わると、そのまま自動でアセンブリが実行される。

[Sun Jun 10 01:33:04 UTC 2018] Gap closing

Gap close success. Output sequence is in CA/10-gapclose/genome.{ctg,scf}.fasta

ランが終わると、カレントに大量のtempoファイルとCAディレクトリができる。CA/10-gapclose/genome.{ctg,scf}.fastaが出力となる。

ホストとの共有ディレクトリにコピー。

cp CA/10-gapclose/genome.*fasta /home/

説明しやすいのでコンテナ中で作業しましたが、sauloさんが記載されているように、コンテナの外からもジョブは投げれます。

性能については、GAGE-Bやレビューを参考にしてください。GAGEは論文で用いられたハイカバレッジのシーケンスデータもダウンロードしてテストできるようになっています。

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3702249/

2019 10/9 追記

他のシーケンシングリードも使う（*3）。

https://github.com/alekseyzimin/masurcaにconfigファイルが載っているので、これを真似る。以下のように、DATAには様々なリードとリファレンスを指定できる。

DATA
#Illumina paired end reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>
#if single-end, do not specify <reverse_reads>
#MUST HAVE Illumina paired end reads to use MaSuRCA
PE= pe 500 50  /FULL_PATH/frag_1.fastq  /FULL_PATH/frag_2.fastq
#Illumina mate pair reads supplied as <two-character prefix> <fragment mean> <fragment stdev> <forward_reads> <reverse_reads>
JUMP= sh 3600 200  /FULL_PATH/short_1.fastq  /FULL_PATH/short_2.fastq
#pacbio OR nanopore reads must be in a single fasta or fastq file with absolute path, can be gzipped
#if you have both types of reads supply them both as NANOPORE type
#PACBIO=/FULL_PATH/pacbio.fa
#NANOPORE=/FULL_PATH/nanopore.fa
#Other reads (Sanger, 454, etc) one frg file, concatenate your frg files into one if you have many
#OTHER=/FULL_PATH/file.frg
#synteny-assisted assembly, concatenate all reference genomes into one reference.fa; works for Illumina-only data
#REFERENCE=/FULL_PATH/nanopore.fa
END

ペアエンドのショートリードは必須になる。これにジャンピングライブラリ（mate pair）、ONTのリード（FASTA）を加えるならDATAは以下のようになる。

DATA
PE= pe 500 50  /FULL_PATH/frag_1.fastq  /FULL_PATH/frag_2.fastq
JUMP= sh 3600 200  /FULL_PATH/short_1.fastq  /FULL_PATH/short_2.fastq
NANOPORE=/FULL_PATH/nanopore.fa
END

対応していないフォーマットを指定すると、masurca config.txtの実行時にエラーが出る。

OLCとde brujin graphのハイブリッドde novoアセンブリツールも出てきています。

ハイカバレッジのバクテリアのde novoアセンブリツール。

Unicycler

引用

The MaSuRCA genome assembler.

Zimin AV, Marçais G, Puiu D, Roberts M, Salzberg SL, Yorke JA.

Bioinformatics. 2013 Nov 1;29(21):2669-77.

GIthubへの移行

http://masurca.blogspot.com/2018/01/masurca-is-now-on-github.html

参考

emacsに慣れているなら、その場でインストールして使う。

apt-get update && apt-get install emacs

何度もランするなら--rmは消してください。毎回configを作るは面倒です。

上で紹介したdockerファイルは古いので対応していない。condaで最新版を入れてください。

#仮想環境を作って導入
conda create -n masurca 
conda activate masurca
conda install -c bioconda -y masurca
#help
masurca -h

追加

MaSuRCA version 3.3.1

MaSuRCA4についても触れられています。

MaSuRCA genome assembly package: March 2019

追記

Hybrid assembly approach with MaSuRCA to assemble genomes from Illumina and Oxford Nanopore reads

KBase

2018-06-10

高速かつメモリ使用量の少ないメタゲノムアセンブリツール MEGAHIT

assembly metagenome 高速なツール k-mer 2016 Methods (Journal) mix-assembly

2019 5/6　インストール方法修正、5/7　パラメータ追記、5/15 タイトル修正、5/19 リンク追加、5/17 タイトル再修正、パラメータ修正、6/3 コメント追加、7/27 condaインストール追記

2021 5/7 helpとインストール手順更新、7/21 リンク追加

2022 2/19 メモ追記

2024/03/21 v1.2.2 0 => v1.2.9

　次世代シーケンシング技術は、メタゲノミクスを研究し、ヒトの腸、動物の第一胃および土壌などの様々な微生物群を理解する新しい機会を提供してきた。リファレンスゲノムの欠如のため、メタゲノミクスデータのde novo assemblyは、メタゲノミクス分析のための有益かつほぼ不可避なステップである（Qin et al、2010）。しかし、このステップは、特に、環境メタゲノミクスで遭遇する大規模かつ複雑なデータセット（Howe et al、2014）の重い計算資源要件による制約がある。 Howeらによって最近公表された土壌メタゲノミクスデータセットは、低クオリティの塩基をトリミングした後でさえ、252Gbp含まれる。データセットは、パーティショニングとdigital normalization（紹介）を含む前処理ステップでうまくアセンブリされた。現時点では、デノボ・アセンブラは、コンピュータ・メモリを使用してデータの全体をアセンブリすることはできない。土壌のメタゲノムデータをアセンブリするためのSOAPdenovo2（Luo et al、2012）およびIDBA-UD（Peng et al、2012）の推定メモリ要件は少なくとも4 TBである。メタゲノミクスデータの量が増え続けるにつれて、特に単一ノードサーバー（現在の2ソケットのサーバーでは最大メモリ容量768GB（論文執筆時点））上で、大規模かつ複雑なメタゲノミクスデータを時間と費用効率の高い方法で組み立てることができるアセンブラMEGAHITを開発した。

MEGAHITのワークフロー。論文より転載。

wiki

https://github.com/voutcn/megahit/wiki

インストール

macosとubuntu18.04でテストした。

依存

zlib
python 2.6 or greater
G++ 4.4 or greater

本体　Github

#linux binary v1.2.9
wget https://github.com/voutcn/megahit/releases/download/v1.2.9/MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
tar zvxf MEGAHIT-1.2.9-Linux-x86_64-static.tar.gz
cd MEGAHIT-1.2.9-Linux-x86_64-static/bin/

#test run
./megahit --test # run on a toy dataset

#docker image
docker run -v $(pwd):/workspace -w /workspace --user $(id -u):$(id -g) vout/megahit \
megahit -1 YOUR_PE_READ_1.gz -2 YOUR_PE_READ_2.fq.gz -o YOUR_OUTPUT_DIR


#from source
git clone https://github.com/voutcn/megahit.git
cd megahit
git submodule update --init
mkdir build && cd build
cmake .. -DCMAKE_BUILD_TYPE=Release # add -DCMAKE_INSTALL_PREFIX=YOUR_PREFIX if needed
make -j4
make simple_test # will test MEGAHIT with a toy dataset
# make install if needed

#bioconda (link)
mamba create -n megahit -y
conda activate megahit
mamba install -c bioconda -y megahit

> megahit

$ megahit

megahit: MEGAHIT v1.2.9

contact: Dinghua Li <voutcn@gmail.com>

Usage:

megahit [options] {-1 <pe1> -2 <pe2> | --12 <pe12> | -r <se>} [-o <out_dir>]

Input options that can be specified for multiple times (supporting plain text and gz/bz2 extensions)

-1 <pe1> comma-separated list of fasta/q paired-end #1 files, paired with files in <pe2>

-2 <pe2> comma-separated list of fasta/q paired-end #2 files, paired with files in <pe1>

--12 <pe12> comma-separated list of interleaved fasta/q paired-end files

-r/--read <se> comma-separated list of fasta/q single-end files

Optional Arguments:

Basic assembly options:

--min-count <int> minimum multiplicity for filtering (k_min+1)-mers [2]

--k-list <int,int,..> comma-separated list of kmer size

all must be odd, in the range 15-255, increment <= 28)

[21,29,39,59,79,99,119,141]

Another way to set --k-list (overrides --k-list if one of them set):

--k-min <int> minimum kmer size (<= 255), must be odd number [21]

--k-max <int> maximum kmer size (<= 255), must be odd number [141]

--k-step <int> increment of kmer size of each iteration (<= 28), must be even number [12]

Advanced assembly options:

--no-mercy do not add mercy kmers

--bubble-level <int> intensity of bubble merging (0-2), 0 to disable [2]

--merge-level <l,s> merge complex bubbles of length <= l*kmer_size and similarity >= s [20,0.95]

--prune-level <int> strength of low depth pruning (0-3) [2]

--prune-depth <int> remove unitigs with avg kmer depth less than this value [2]

--disconnect-ratio <float> disconnect unitigs if its depth is less than this ratio times

the total depth of itself and its siblings [0.1]

--low-local-ratio <float> remove unitigs if its depth is less than this ratio times

the average depth of the neighborhoods [0.2]

--max-tip-len <int> remove tips less than this value [2*k]

--cleaning-rounds <int> number of rounds for graph cleanning [5]

--no-local disable local assembly

--kmin-1pass use 1pass mode to build SdBG of k_min

Presets parameters:

--presets <str> override a group of parameters; possible values:

meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141'

meta-large: '--k-min 27 --k-max 127 --k-step 10'

(large & complex metagenomes, like soil)

Hardware options:

-m/--memory <float> max memory in byte to be used in SdBG construction

(if set between 0-1, fraction of the machine's total memory) [0.9]

--mem-flag <int> SdBG builder memory mode. 0: minimum; 1: moderate;

others: use all memory specified by '-m/--memory' [1]

-t/--num-cpu-threads <int> number of CPU threads [# of logical processors]

--no-hw-accel run MEGAHIT without BMI2 and POPCNT hardware instructions

Output options:

-o/--out-dir <string> output directory [./megahit_out]

--out-prefix <string> output prefix (the contig file will be OUT_DIR/OUT_PREFIX.contigs.fa)

--min-contig-len <int> minimum length of contigs to output [200]

--keep-tmp-files keep all temporary files

--tmp-dir <string> set temp directory

Other Arguments:

--continue continue a MEGAHIT run from its last available check point.

please set the output directory correctly when using this option.

--test run MEGAHIT on a toy test dataset

-h/--help print the usage message

-v/--version print version

パスの通ったディレクトリにコピーするなら、以下全てを移動する。

sudo cp megahit megahit_asm_core megahit_toolkit megahit_sdbg_build /usr/local/bin/

Dockerコンテナも複数上げられている（リンク）。バイナリもリリースから入手できる（リンク）。

open MPやCUDAにも対応してます。詳細はGithubで確認してください。

実行方法

ペアエンドfastqを指定してランする。後半のパラメータは省略可能。

megahit -1 pair1.fq -2 pair2.fq -o output --k-min 21 --k-max 141 --k-step 12 -t 40

interleaveのペアエンドは"-12"で指定する。kは奇数、kのステップは偶数にする。非常に複雑なsoilなどのメタゲノムは、de brujin graphをシンプルにするため、最小kを27などにすることが推奨されている（カバレッジの大きいデータも同様）。-min-countも重要で、カバレッジが十分（>40）以上あるゲノムがターゲットなら、クオリティトリミングを行い、--k-minも上げることが推奨されている。一方、カバレッジが不十分でも、cut-off 2以下はシーケンスエラーを高頻度に拾うため非推奨になっている。

シングルエンド

megahit -r single.fq -o output --k-min 21 --k-max 141 --k-step 12 -t 40

merged.fqを使う。

https://www.biostars.org/p/328994/

merged.fqを使用した時にlow k-merを指定してメモリエラーを吐く場合、少し値を上げることで対処療法的に対応可能。

megahit -1 pair1.fq -2 pair2.fq -o output --k-min 31 --k-max 121 --k-step 10 -t 40

sensitiveモード

#meta-sensitive
megahit -1 pair1.fq -2 pair2.fq -o output -t 40 --presets meta-sensitive

#meta-large (large & complex metagenomes, like soil))
megahit -1 pair1.fq -2 pair2.fq -o output -t 40 --presets meta-large

--presets <str> override a group of parameters; possible values:
meta-sensitive: '--min-count 1 --k-list 21,29,39,49,...,129,141'
meta-large: '--k-min 27 --k-max 127 --k-step 10' (large & complex metagenomes, like soil)

途中で止めたジョブを再開するには--continueフラグを立て、当時の出力ディレクトリを指定する。プログラムのバージョンが同じなら他の計算機に移しても機能するはず（メモリが足りない時など）。

megahit -o output --continue

--continue　 continue a MEGAHIT run from its last available check point. please set the output directory correctly when using this option.

de brujin graphはbandageで可視化できる（詳細）。

megahit_toolkit contig2fastg 99 k99.contigs.fa > k99.fastg

SRAのpublicデータを使った実際のワークフローもあります（De novoアセンブリとカバレッジ計算のフロー）。

https://github.com/voutcn/megahit/wiki/An-example-of-real-assembly

GPUなしでも動作は非常に高速です。

追記

基本的に短いk-merから長いk-merまでk-mer stepを小さくしてiterativeにアセンブルを繰り貸せば、トータルconitgサイズは増加します。この時、merged.fastqも加えているならk>200以上までma k-merを増やしても伸ばすこともトータルconitgサイズは増加します。ただし、そうやって限界まで伸ばしたcontigが一番良いとは限りません。サブサンプリングしたデータを使い、いくつかのパラメータ設定でアセンブルをってmetaquast等で評価してみてください（ほとんどのゲノムはリファレンスから遠いので数株分のゲノムでの評価にしかなりませんが、それでもないよりずっとマシです）。

2022/02/19

250GBx2 gzipped fastq（非圧縮でおよそ1.5TB）のランでは、k=21は3.8日で計算できたが、続いてのk=33のアセンブリでメモリが足りなくなり落ちた（ピークメモリは523 giga bytes。サンプルの多様性が大きいデータを使用）。=> その後、BBtoolsのbbnorm.shを使って低頻度な配列を除外することによって200GBx２のgzipped fastqにまで減らしたところ、ピークメモリ475GB、ランタイム7日と１時間（53155 user CPU time (min)）でラン出来た。

引用

MEGAHIT v1.0: A fast and scalable metagenome assembler driven by advanced methodologies and community practices

Li D, Luo R, Liu CM3, Leung CM, Ting HF, Sadakane K, Yamashita H, Lam TW

Methods. 2016 Jun 1;102:3-11.

MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph.

Li D, Liu CM, Luo R, Sadakane K, Lam TW

Bioinformatics. 2015 May 15;31(10):1674-6.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

fastq-dumpを並列化した pfastq-dump

fastqから素早くインサートサイズを計算する

メタゲノムのリアルタイム分類ツール LiveKraken

ターゲット遺伝子座のリファレンスガイドアセンブリを行う aTRAM2.0

メタゲノムのgene-targeted assembler: MegaGTA

MaSuRCA アセンブラ

高速かつメモリ使用量の少ないメタゲノムアセンブリツール MEGAHIT