2019-07-01

blast結果を視覚化する BlastViewer

BLAST GUIツールインフォマティクス解析をサポートするツール結果の視覚化 (visualization) DNA解析ソフト

BlastViewerは、NCBI BLASTソフトウェアの結果をグラフィカルに表示することを目的として開発されたツールである。詳細はGIhutbとwiki参照。

wiki

Home · pgdurand/BlastViewer Wiki · GitHub

インストール

windows10 proとubuntu16.0.4でテストした（いずれもjava version1.8.0）。

依存

Java Virtual Machine 1.7 (or above) from Oracle (link).

本体　GIthub

リリースよりビルド済みjarファイルをダウンロードする。

実行方法

１、blastを実行してlegacy xmlファイルを出力しておく。NCBIにアクセスしてblastを実行したなら、上のメニューからxml形式でダウンロードする。

f:id:kazumaxneo:20190701121004j:plain

左上の方にあるXMLを選択してダウンロード実行。

f:id:kazumaxneo:20190701123724j:plain

ローカルblastの場合、"-outfmt 5"付きでblastを実行する（legacy blastを使っているなら"-m 7"）。

#例えばblastpなら
blastp -db blastp_database -query query.fasta -out output.xml \
 -num_threads 8 -outfmt 5

-outfmt <String>

alignment view options:
0 = Pairwise,
1 = Query-anchored showing identities,
2 = Query-anchored no identities,
3 = Flat query-anchored showing identities,
4 = Flat query-anchored no identities,
5 = BLAST XML,
6 = Tabular,
7 = Tabular with comment lines,
8 = Seqalign (Text ASN.1),
9 = Seqalign (Binary ASN.1),
10 = Comma-separated values,
11 = BLAST archive (ASN.1),
12 = Seqalign (JSON),
13 = Multiple-file BLAST JSON,
14 = Multiple-file BLAST XML2,
15 = Single-file BLAST JSON,
16 = Single-file BLAST XML2,
18 = Organism Report

２、コンソールで以下のように打ってBlastViewerを起動。

java -jar blastviewer-5.2.0.jar

#またはメモリ指定して起動。512mメモリで立ち上げ、10gまでメモリ使用を許可。
java -Xms512m -Xmx10G -jar blastviewer-5.0.0.jar

windowsで起動したところ。

f:id:kazumaxneo:20190701115822j:plain

xmlファイルを読み込む。

f:id:kazumaxneo:20190701124051j:plain

読み込んだところ。BCBIでblastを実行した時と同じように、表形式で結果はまとめられる。

f:id:kazumaxneo:20190701124126j:plain

ヒット部位

f:id:kazumaxneo:20190701143446j:plain

それぞれのヒットをクリックすると、その配列に関する情報が表示される。

f:id:kazumaxneo:20190701143611j:plain

Multiple Sequence Alignment (MSA) 表示

f:id:kazumaxneo:20190701124401j:plain

引用

https://github.com/pgdurand/BlastViewer

AMOSアセンブラパッケージのMinimusとMinimus2

2007 2011 BMC Bioinformatics assembly dereplication

2021 6/11 minimus2のコマンドを修正

MInumusのpaper（Sommer et al., 2007）より

　大規模な全ゲノムシークエンシングプロジェクトの課題に対処するためのアルゴリズムの必要性に応えて、ゲノムアセンブラは非常に大きく複雑になっている。しかし、アセンブラの最も一般的な用途の多くは、より少ないソフトウェアコンポーネントのみ必要とし、よりメモリ使用量が少なく、インストールと実行がはるかに簡単な、より単純なタイプのアセンブラである。

　これらの問題に対処するためにMinimusアセンブラを開発し、さまざまなアセンブリ問題でテストした。Minimusがウイルスゲノム、個々の遺伝子、およびBACクローンのアセンブリを含む、いくつかの小さなアセンブリ作業でうまく機能することを示す。さらに、大規模なアセンブリパイプラインの構成要素としての適合性を評価するために、バクテリアゲノムアセンブリにおけるMinimusの性能を評価する。これらのタスクに現在使用されている他のソフトウェアとは異なり、Minimusは、より細分化されたアセンブリを生成することを犠牲にして、大幅に少ないアセンブリエラーを生成することを示す。

　スモールゲノムや他の小さなアセンブリ作業のために、Minimusが既存のツールより速くそしてはるかに柔軟であることを見つける。その小型サイズとモジュラー設計により、Minimusは複雑なアセンブリパイプラインのコンポーネントとして最適である。 Minimusはオープンソースソフトウェアプロジェクトとしてリリースされており、そのコードはSourceforgeのAMOSプロジェクトの一部として入手可能である。

AMOS wiki

http://amos.sourceforge.net/wiki/index.php/AMOS

Minimus wiki

http://amos.sourceforge.net/wiki/index.php/Minimus

Minimus2 wiki

http://amos.sourceforge.net/wiki/index.php/Minimus2

インストール

condaを使ってpython2.7の仮想環境に導入した。

SourceForge

A clone of the official AMOS git repo on sourceforge

Amosパッケージを入れれば使える。

#bioconda(link)
mamba create -n amos -y python=2.7
conda activate amos
mamba install -c bioconda -y amos

> minimus

# minimus

The log file is: runAmos.log

Cannot substitute variable strip .afg PREFIX

>minimus2 -h

# minimus2 -h

minimus2 - The AMOS Pipeline for merging 2 assemblies

Usage:

minimus2 prefix \

-D REFCOUNT=<n> \ # Number of sequences in the 1st assembly ; (Required)

-D OVERLAP=<n> \ # Assembly 1 vs 2 minimum overlap (Default 40bp)

-D CONSERR=<f> \ # Maximum consensus error (0..1) (Default 0.06)

-D MINID=<n> \ # Minimum overlap percent identity for alignments (Default 94)

-D MAXTRIM=<n> # Maximum sequence trimming length (Default 20bp)

実行方法

１、minimus

少数の配列セットをアセンブルする（次世代のようなたくさんの配列は処理できない）。

multi-fastaのmy_reads.seqを指定する。

toAmos -s input.seq -o input.afg
minimus input

作業ディレクトリとアセンブル結果のinput.fastaが出力される。

２、minimus2

redundancyを取り除きながら２組のコンティグのマージを行うならminimus2を使う。

cat pair1.seq pair2.seq > concatenate.seq
toAmos -s concatenate.seq -o concatenate.afg
minimus2 concatenate -D REFCOUNT=<number>

上で指定する<number>は最初の配列pair1.seqのサイズになる。grepで取得する。

grep -c "^>" pair1.seq

引用

Minimus: a fast, lightweight genome assembler.
Sommer DD, Delcher AL, Salzberg SL, Pop M

BMC Bioinformatics. 2007 Feb 26;8:64.

Next generation sequence assembly with AMOS
Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M

Curr Protoc Bioinformatics. 2011 Mar;Chapter 11:Unit 11.8

2019-06-29

(メタゲノム向け) blastアノテーション結果をインタラクティブなグラフで視覚化する Keanu

2019 BMC Bioinformatics 結果の視覚化 (visualization) BLAST テスト失敗

　メタゲノミクスは、環境サンプルから回収された遺伝物質の研究である。これらのサンプルは、特定の環境の多様性や生態学に関する情報を提供する。メタゲノミクス研究は通常、ショットガンシーケンスデータセットから得られた微生物シーケンスに焦点を当てている[ref.1]。ウイルス、細菌、真核生物の成分の特徴付けは、特定のサンプルに存在する多様性を総合的に理解するために重要である[ref.1]。シーケンシングプラットフォームとツールの急速な開発と改良により、研究者は環境サンプルの種の豊富さと多様性をよりよく観察することができる。しかしながら、これらの進歩はまた分析され解釈されるべき膨大な量のデータの発生をもたらした。大量のテキストデータを分析するのは面倒で困難なプロセスである。メタゲノミクス研究も例外ではない。得られたデータセットは、本質的な多次元性と、複数レベルの階層および接続性の存在によって特徴付けられる[ref.2]。視覚化は、研究者がデータについてさらに質問をすることができるもの、および追加の分析を実行できるものを決定するのに役立つ。対話機能により、さらに調査と分析のオプションが追加される。メタゲノミクスデータの視覚化は研究の活発な分野であり、毎年新しい方法が開発されている。視覚化方法は、それらの視覚化能力に従って分類することができる。いくつかの方法は、単一のメタゲノムを視覚的に探索することを目的としているが、他の方法は、複数のメタゲノムの視覚化を可能にする。このツールには、円グラフ、バブルチャート、ツリーと樹状図、ボックスプロット、自己組織化マップなど、さまざまな種類の視覚化も含まれる。さまざまな視覚化方法とツールの詳細なレビューについては[ref.2]を参照してください。

　ここでは、アラスカ内陸部の古代の土壌における生物多様性のパターンを調査するために、微生物、ウイルス、真核生物種を含むメタゲノムの種構成を調べることができる可視化ツール、Keanuを開発した。このツールは、ローカルで開くことができ、内容に基づいてサンプルをさらに分析する方法を決定するために探索できるインタラクティブなWebページを生成する。この種の探索的データ分析は、どのような種類の追加の質問にデータを尋ねることができるのか、およびより多くのデータを収集する必要があるのかどうかを知らせる[ref.3]。静止画像は小さなデータセットを探索するのに役立つが、そのような画像はデータによっては非常に大きくまたは非常に詳細になることがあるため、それらの探索は困難になる。この対話型の視覚化はNCBI分類データベースからの情報を使用して、ルートから種レベルまで降順で各分類群の存在量を示している。ユーザーは、データセット全体の混沌としたノイズの多い視覚化を作成せずに分類の特定のサブセクションを探索できる。

　Keanuに似たツールが開発された。 MetaSee [ref.4]は、かつてはWebサービスおよびスタンドアロンのGUIアプリケーションとして利用可能だった。Keanuが使用しているのと非常によく似た形式でメタゲノムデータを表示できる対話型の視覚化ツールだったが、MetaSeeを実行することは、サーバーのソースコードをダウンロードしてローカルで実行する必要があるため、簡単な作業ではない。 Blobtools [ref.5]とMetacoder [ref.6]はどちらも大きな静止画像を作成する。 Blobtoolsは、BLOBプロットを作成するPythonツールである。 Metacoder [ref.6]はヒートツリーを作成するRパッケージである。どちらもKeanuの開発を駆り立てたインタラクティブな機能を欠いている。 Keanuはローカルで実行されるため、ユーザーはWebサービスを使用してデータベースを管理するのではなく最新のデータベースを管理したり、サービスをホストしたりすることができる。また、Keanuの対話機能は出力の分析に役立つ。

f:id:kazumaxneo:20190317181449p:plain

Fig. 2
flow diagram of Keanu showing how data is piped into different tools or processes

論文より転載

インストール

依存

Keanu can be run on any system where Python is available, which includes Windows, macOS, and Linux.

Keanu is written in Python 3.5 and is available with an ex- ample dataset at https://github.com/IGBB/keanu (Fig. 1).

Github

git clone https://github.com/IGBB/keanu.git
cd keanu/

> python3 make_db.py -h

# python3 make_db.py -h

usage: make_db.py [-h] [-names NAMES] [-nodes NODES] [-merged MERGED]

[-deleted DELETED] [-out_db OUT_DB] [-out_md_db OUT_MD_DB]

A tool to format NCBI taxonomy data for Keanu

optional arguments:

-h, --help show this help message and exit

-names NAMES Location of names.dmp

-nodes NODES Location of nodes.dmp

-merged MERGED Location of merged.dmp

-deleted DELETED Location of delnodes.dmp

-out_db OUT_DB Name of output taxonomy database

-out_md_db OUT_MD_DB Name of output merged/deleted database

> python3 format_input.py -h

# python3 format_input.py -h

usage: format_input.py [-h] [-in INPUT] [-out OUTPUT]

A tool to format BLAST qseqid/staxid files into Keanu's input

optional arguments:

-h, --help show this help message and exit

-in INPUT, --input INPUT

BLAST query/taxon data

-out OUTPUT, --output OUTPUT

Output filename

> python3 keanu.py -h

# python3 keanu.py -h

usage: keanu.py [-h] [-db DATABASE] [-md_db MERGED_DELETED_DATABASE]

[-view {tree,bilevel}] [-in INPUT] [-out OUTPUT]

optional arguments:

-h, --help show this help message and exit

-db DATABASE, --database DATABASE

Formatted taxonomy database

-md_db MERGED_DELETED_DATABASE, --merged_deleted_database MERGED_DELETED_DATABASE

Merged/deleted taxonomy database

-view {tree,bilevel}, --view {tree,bilevel}

View choice

-in INPUT, --input INPUT

Input data set

-out OUTPUT, --output OUTPUT

Output HTML filename

実行方法

１、blast実行。" -outfmt '6 std staxids' "をつけて実行する。ここではblastnを使っているが、感度を上げるならアミノ酸で探す（クエリが塩基配列なら、プロテインデータベースを用意してx）。

blastn -db blast_database/nt -query sample2.fasta -outfmt '6 std staxids' -out query.txt -num_threads 10 -evalue 1e-10

２、データベース作成。初回のみ。ローカルのblastデータベースを指定する。

#taxdmpファイルのダウンロード
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
tar -zxvf taxdump.tar.gz

python3 make_db.py -names names.dmp -nodes nodes.dmp \
 -out_db taxonomy.dat -deleted delnodes.dmp -merged merged.dmp \
 -out_md_db merged_deleted.dat

３、クエリデータの準備。

python3 format_input.py -in query.txt -out keanu.txt

４、グラフ表示

bilevel partition graph

python3 keanu.py -db taxonomy.dat -md_db merged_deleted.dat \
 -in keanu.txt -view bilevel -out bilevel.html

５、collapsible tree表示

python3 keanu.py -db taxonomy.dat -md_db merged_deleted.dat \
 -in keanu.txt -view tree -out tree.html

テスト時はhtml出力でエラーになxった。

引用

Keanu: a novel visualization tool to explore biodiversity in metagenomes

Adam Thrash, Mark ArickII, Robyn A. Barbato, Robert M. Jones, Thomas A. Douglas, Julie Esdale, Edward J. Perkins and Natàlia Garcia-ReyeroEmail author
BMC Bioinformatics 2019 20 (Suppl 2) :103

参考

https://evosite3d.blogspot.com/2013/06/browsing-ncbi-taxonomy-with-python.html

2019-06-28

MMseqs2 コマンド其の2 タンパク質配列のクラスタリング

2017 Nature Biotechnology 高速なツール sequence clustering

インストール

以前の記事を参照

> mmseqs

$ mmseqs

MMseqs2 (Many against Many sequence searching) is an open-source software suite for very fast,

parallelized protein sequence searches and clustering of huge protein sequence data sets.

Please cite: M. Steinegger and J. Soding. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017).

MMseqs2 Version: 905109c40ac720c407282d4d6517a164c3470fba

Easy workflows (for non-experts)

easy-search Search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step

easy-linsearch Linear time search with a query fasta against target fasta (or database) and return a BLAST-compatible result in a single step

easy-linclust Compute clustering of a fasta database in linear time. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences.

easy-cluster Compute clustering of a fasta database. The workflow outputs the representative sequences, a cluster tsv and a fasta-like format containing all sequences.

easy-taxonomy Compute taxonomy and lowest common ancestor for each sequence. The workflow outputs a taxonomic classification for sequences and a hierarchical summery report.

Main tools (for non-experts)

createdb Convert protein sequence set in a FASTA file to MMseqs sequence DB format

search Search with query sequence or profile DB (iteratively) through target sequence DB

linsearch Search with query sequence DB through target sequence DB

map Fast ungapped mapping of query sequences to target sequences.

cluster Compute clustering of a sequence DB (quadratic time)

linclust Cluster sequences of >30% sequence identity *in linear time*

createindex Precompute index table of sequence DB for faster searches

createlinindex Precompute index for linsearch

enrich Enrich a query set by searching iteratively through a profile sequence set.

rbh Find reciprocal best hits between query and target

clusterupdate Update clustering of old sequence DB to clustering of new sequence DB

Utility tools for format conversions

createtsv Create tab-separated flat file from prefilter DB, alignment DB, cluster DB, or taxa DB

convertalis Convert alignment DB to BLAST-tab format or specified custom-column output format

convertprofiledb Convert ffindex DB of HMM files to profile DB

convert2fasta Convert sequence DB to FASTA format

result2flat Create a FASTA-like flat file from prefilter DB, alignment DB, or cluster DB

createseqfiledb Create DB of unaligned FASTA files (1 per cluster) from sequence DB and cluster DB

Taxonomy tools

taxonomy Compute taxonomy and lowest common ancestor for each sequence.

createtaxdb Annotates a sequence database with NCBI taxonomy information

addtaxonomy Add taxonomy information to result database.

lca Compute the lowest common ancestor from a set of taxa.

taxonomyreport Create Kraken-style taxonomy report.

filtertaxdb Filter taxonomy database.

An extended list of all tools can be obtained by calling 'mmseqs -h'.

Bash completion for tools and parameters can be installed by adding "source MMSEQS_HOME/util/bash-completion.sh" to your "$HOME/.bash_profile".

Include the location of the MMseqs2 binary is in your "$PATH" environment variable.

kazu@kazu:~/taniguchi_datadir/metagenome/sample2/test$

kamisakumanoMBP:~ kazuma$

> mmseqs cluster

$ mmseqs cluster

Usage: mmseqs cluster <i:sequenceDB> <o:clusterDB> <tmpDir> [options]

Compute clustering of a sequence DB (quadratic time)

Options:

Prefilter:

--max-seqs INT Maximum result sequences per query allowed to pass the prefilter (this parameter affects sensitivity) [20]

Align:

-c FLOAT list matches above this fraction of aligned (covered) residues (see --cov-mode) [0.800]

--cov-mode INT 0: coverage of query and target, 1: coverage of target, 2: coverage of query 3: target seq. length needs be at least x% of query length, 4: query seq. length needs be at least x% of target length [0]

-a add backtrace string (convert to alignments with mmseqs convertalis utility)

-e FLOAT list matches below this E-value (range 0.0-inf) [0.001]

--min-seq-id FLOAT list matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]

--min-aln-len INT minimum alignment length (range 0-INT_MAX) [0]

--seq-id-mode INT 0: alignment length 1: shorter, 2: longer sequence [0]

--alt-ali INT Show up to this many alternative alignments [0]

--force-reuse reuse tmp file in tmp/latest folder ignoring parameters and git version change

Clust:

--cluster-mode INT 0: Setcover, 1: connected component, 2: Greedy clustering by sequence length 3: Greedy clustering by sequence length (low mem) [0]

--single-step-clustering 0 switches from cascaded to simple clustering workflow [1, set to 0 to disable]

Common:

--threads INT number of cores used for the computation (uses all cores by default) [56]

--compressed INT write results in compressed format [0]

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

An extended list of options can be obtained by calling 'mmseqs cluster -h'.

- Steinegger M, Soding J: Clustering huge protein sequence sets in linear time. Nature Communications, doi:10.1038/s41467-018-04964-5 (2018)

- Hauser M, Steinegger M, Soding J: MMseqs software suite for fast and deep clustering and searching of large protein sequence sets. Bioinformatics, 32(9), 1323-1330 (2016).

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

3 Database paths are required

> mmseqs createdb

$ mmseqs createdb

Usage: mmseqs createdb <i:fastaFile1[.gz]> ... <i:fastaFileN[.gz]> <o:sequenceDB> [options]

Convert protein sequence set in a FASTA file to MMseqs sequence DB format

Options:

Misc:

--dont-split-seq-by-len 0 Dont split sequences by --max-seq-len [1, set to 0 to disable]

--dbtype INT Database type 0: auto, 1: amino acid 2: nucleotides [0]

--dont-shuffle 0 Do not shuffle input database [1, set to 0 to disable]

--id-offset INT numeric ids in index file are offset by this value [0]

Common:

--compressed INT write results in compressed format [0]

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

An extended list of options can be obtained by calling 'mmseqs createdb -h'.

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

2 Database paths are required

> mmseqs createtsv

$ mmseqs createtsv

Usage: mmseqs createtsv <i:queryDB> [<i:targetDB>] <i:resultDB> <o:tsvFile> [options]

Create tab-separated flat file from prefilter DB, alignment DB, cluster DB, or taxa DB

Options:

Misc:

--first-seq-as-repr Use the first sequence of the clustering result as representative sequence

--target-column INT Select a target column (default 1), 0 if no target id exists. [1]

--full-header Replace DB ID by its corresponding Full Header

--idx-seq-src INT 0: auto, 1: split/translated sequences, 2: input sequences [0]

--db-output Output a result db instead of a text file

Common:

--threads INT number of cores used for the computation (uses all cores by default) [56]

--compressed INT write results in compressed format [0]

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

An extended list of options can be obtained by calling 'mmseqs createtsv -h'.

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

> mmseqs result2flat -h

$ mmseqs result2flat -h

Usage: mmseqs result2flat <i:queryDB> <i:targetDB> <i:resultDB> <o:fastaDB> [options]

Create a FASTA-like flat file from prefilter DB, alignment DB, or cluster DB

By Martin Steinegger <martin.steinegger@mpibpc.mpg.de>

Options:

Misc:

--use-fasta-header use the id parsed from the fasta header as the index key instead of using incrementing numeric identifiers

Common:

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

> mmseqs result2repseq

$ mmseqs result2repseq

Usage: mmseqs result2repseq <i:sequenceDB> <i:resultDB> <o:sequenceDb> [options]

Get representative sequences for a result database

Options:

Common:

--threads INT number of cores used for the computation (uses all cores by default) [56]

--compressed INT write results in compressed format [0]

-v INT verbosity level: 0=nothing, 1: +errors, 2: +warnings, 3: +info [3]

An extended list of options can be obtained by calling 'mmseqs result2repseq -h'.

- Steinegger M, Soding J: MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, doi:10.1038/nbt.3988 (2017)

実行方法

mmseq2 cluster

１、まずproteome.fastaをデータベースに変換する。

mmseqs createdb proteome.fasta DB

２、クラスタリング実行。作業ディレクトリが必要。ここでは/tmpとした。巨大なproteomeファイルの場合はかなりのディスクスペースを必要とするので、作業ディレクトリの空き容量に注意する。

mmseqs cluster DB DB_clu tmp

--min-seq-id <FLOAT> list matches above this sequence identity (for clustering) (range 0.0-1.0) [0.000]
--cov-mode <INT> 0: coverage of query and target, 1: coverage of target, 2: coverage of query 3: target seq. length needs be at least x% of query length, 4: query seq. length needs be at least x% of target length [0]

クラスタリング感度は"--min-seq-id"、"-c"、”--cov-mode”などを立てることで手動設定できる。

３、クラスタリング結果をTSV出力する。

mmseqs createtsv DB DB DB_clu clustered.tsv

> head clustered.tsv

$ head clustered.tsv

k161_4496204_2_367_+ k161_4496204_2_367_+

k161_2597819_391_528_- k161_2597819_391_528_-

k161_2397988_694_1326_- k161_2397988_694_1326_-

k161_2397988_694_1326_- k161_732702_1_726_+

k161_2397988_694_1326_- k161_4554834_402_1124_-

k161_1498756_826_1108_- k161_1498756_826_1108_-

k161_3197337_1_597_+ k161_3197337_1_597_+

k161_4696064_350_931_+ k161_4696064_350_931_+

k161_4696064_350_931_+ k161_4519536_22584_23195_-

k161_4696064_350_931_+ k161_3060204_1_699_+

４、クラスタリング後のfastaを出力する。

mmseqs result2flat DB DB DB_clu_rep DB_clu_rep.fasta --use-fasta-header

メタゲノムのアセンブリ配列をFragGeneScaにかけてprootein配列にしたもの（output.faa）を入力として、クラスタリングを実行した。

f:id:kazumaxneo:20190625000340p:plain

ラン後のファイルはDB_clu_rep.fastaだが、わずかに配列数が減少している。

mmseq2 linclust

Linclustは線形時間でのクラスタリングである。実行時間は早くなるが、クラスタリングよりも少しだけ感度が劣る。

１、データベース作成

mmseqs createdb proteome.fasta DB

２、クラスタリング実行。

mmseqs linclust DB DB_clu /tmp

４、クラスタリング後のfastaを出力する。

mmseqs result2repseq DB DB_clu DB_clu_rep
mmseqs result2flat DB DB DB_clu_rep DB_clu_rep.fasta --use-fasta-header

引用

MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets
Steinegger M, Söding J

Nat Biotechnol. 2017 Nov;35(11):1026-1028

MMseqs software suite for fast and deep clustering and searching of large protein sequence sets.
Hauser M, Steinegger M, Söding J

Bioinformatics. 2016 May 1;32(9):1323-30.

Clustering huge protein sequence sets in linear time

Martin Steinegger & Johannes Söding

Nat Commun. 2018; 9: 2542.

参考

Check out my new beginner-friendly post on how to write a shell script that checks whether your computer has SSE or AVX instructions and then runs the correct version of @thesteinegger's MMseqs2 program automatically! https://t.co/N5OtebdBhY
— Ryan Moore (@TenderIsTheByte) May 21, 2019

2019-06-27

FragGeneScan

2010 Nucleic Acids Research metagenome HMM coding region

　次世代シーケンシング技術の進歩は、環境試料（すなわちメタゲノム）内の遺伝物質の全コレクションを直接シーケンシングしようと試みるメタゲノム研究を促進した。メタゲノムアセンブリは利用できないことが多いので（論文執筆時点）、ショートリードから直接遺伝子を同定することは、メタゲノムにアノテーションををつける重要だがチャレンジングな問題となっている。全ゲノム（例えばGlimmer）用に開発され、そしてメタゲノム配列（例えばMetaGene）用に最近開発された遺伝子予測法は、シーケンシングエラー率が増加するにつれて、またはリードが短くなるにつれて、性能の有意な低下を示す。本著者らは、ショートリードにおけるタンパク質コード領域の予測を改善するために、シークエンシングエラーモデルと隠れマルコフモデルにおけるコドン使用法を組み合わせた、新規な遺伝子予測方法FragGeneScanを開発した。 FragGeneScanの性能は、完全なゲノムに関してGlimmerおよびMetaGeneと同等であった。しかしショートリードでは、FragGeneScanは一貫してMetaGeneを上回った（精度は400塩基リード長で1％シークエンスエラーの場合62％、100塩基リード長でエラーがない場合は18％向上した）。メタゲノムに適用した場合、FragGeneScanはMetaGeneが予測したよりも実質的に多くの遺伝子（相同性検索によって同定された遺伝子の90％超）、および現在のタンパク質配列データベースにホモログを含まない多くの新規遺伝子を回収した。

インストール

ubbuntuのminiconda2.4.0.5環境でテストした。

SourcceForge

#bioconda (link)
conda install -c bioconda -y fraggenescan

実行方法

genomeからproteinを得る。-complete=1と-train=completeを指定。

run_FragGeneScan.pl -genome=contigs.fasta -out=output -complete=1 --train=complete -thread=20

-genome= sequence file name including the full path
-complete= 1 if the sequence file has complete genomic sequences. 0 if the sequence file has short sequence reads
train= file name that contains model parameters. [complete] for complete genomic sequences or short sequence reads without sequencing error. [sanger_5] for Sanger sequencing reads with about 0.5% error rate. [sanger_10] for Sanger sequencing reads with about 1% error rate. [454_10] for 454 pyrosequencing reads with about 1% error rate. [454_30] for 454 pyrosequencing reads with about 3% error rate. [illumina_5] for Illumina sequencing reads with about 0.5% error rate. [illumina_10] for Illumina sequencing reads with about 1% error rate.

output.faやgffファイル等が出力される。

illuminaのショートリードからproteinを得る。-complete=0とtrain=illumina_5（またはillumina_10）を指定。

run_FragGeneScan.pl -genome=ngs.fasta -out=output -complete=0 -train=illumina_5 -thread=20

-genome= sequence file name including the full path
-complete= 1 if the sequence file has complete genomic sequences. 0 if the sequence file has short sequence reads
train= file name that contains model parameters. [complete] for complete genomic sequences or short sequence reads without sequencing error. [sanger_5] for Sanger sequencing reads with about 0.5% error rate. [sanger_10] for Sanger sequencing reads with about 1% error rate. [454_10] for 454 pyrosequencing reads with about 1% error rate. [454_30] for 454 pyrosequencing reads with about 3% error rate. [illumina_5] for Illumina sequencing reads with about 0.5% error rate. [illumina_10] for Illumina sequencing reads with about 1% error rate.

454なら-train=454_10、-train=454_30を指定。

引用

FragGeneScan: predicting genes in short and error-prone reads
Mina Rho, Haixu Tang, Yuzhen Ye

Nucleic Acids Res. 2010 Nov; 38(20): e191.

FragGeneScan-Plus

https://ieeexplore.ieee.org/document/7300341

2019-06-26

再現性のあるメタゲノム解析を行うためのモジュール設計された自動パイプライン Sunbeam

2019 Microbiome (Journal) workflow manager automatic pipeline metagenome contamination low complexity error correction quality check filtering assembly SRA download mapping 高速なツール

2019 6/26 誤字修正

　メタゲノミックショットガンシークエンシングは、関心のある微生物混合群からDNAを抽出し、無作為に抽出されたDNAをディープシーケンシングする。これは、特定の標的遺伝子領域が増幅およびシーケンシングされるマーカー遺伝子シーケンシング（例えば、細菌の16S rRNA遺伝子）とは対照的である。メタゲノムシーケンシングは、特にウイルスおよびバクテリオファージコミュニティの研究において、微生物学において重要な洞察を可能にしました[ref.1–9]。ただし、継続的な課題は、結果として得られる大規模データセットを標準的で信頼性の高い方法で分析および解釈することである[ref.16–23]。

　微生物のメタゲノムを調べる一般的な方法は、イルミナのシークエンシング技術を使用して、目的のサンプルから単離した断片化DNAから多数の短い（100〜250塩基対）リードを取得することである。シーケンスの取得後、シーケンスを使用して基礎となる生物学的洞察を得る前に、いくつかの後処理ステップを実行する必要がある[ref.21、24]。

　研究者はそれぞれの後処理ステップを実行するための多くのツールを自由に使えるようにしているが、その結果として、得られるアウトプットや下流の分析を変える可能性がある各ツールそれぞれに様々なパラメータが存在する。分析間でパラメータ、ツール、またはリファエンスデータベースのバージョンを変えると、異なるメタゲノムシーケンシング実験の結果を比較するのが難しくなる[ref.25]。逆に、研究間で一貫したワークフローを採用することで、実験は比較可能であり、下流分析は再現可能であることが保証される[ref.21]。使用されるソフトウェア、データベース、およびパラメータの文書化は、このプラクティスの重要な要素である。そうでなければ、一貫した再現可能なワークフローの利点が失われる。

　メタゲノムのポストプロセッシングワークフローは、その有用性と柔軟性を最大にするために以下の特性を持つべきである。それは共有コンピューティングシステムとクラウドにデプロイ可能であるべきであり、ソフトウェアパラメータとリファレンスデータベースの簡単な設定を許容する。中断の後に再開する能力、それは不必要なステップがスキップされるか無視されることができるようにモジュール式であるべきである、そしてそれは新しい手順がユーザによって加えられることを可能にするべきである。機関プラットフォームとクラウドプラットフォームの両方にワークフローを展開する機能により、さまざまなラボでさまざまなコンピューティング設定でワークフローを繰り返すことができ、研究者が機関とクラウドのコンピューティングリソースを柔軟に選択できるようになる。同様に、設定ファイルを使用して実行中のパラメータを記録する機能により、実験固有のソフトウェアパラメータを使用することができ、今後の参考資料として使用できる。

　いくつかの機能が効率的なデータ分析に貢献する。ワークフローのエラーや中断が最初からやり直す必要がない場合は、有益である。シーケンス実験では大量のデータが生成されるため、データ処理のステップを繰り返す必要があり、時間と費用がかかる可能性がある。さらに、ワークフローのすべてのステップがすべての実験に必要なわけではなく、実験によってはカスタム処理が必要な場合もある。実験を適切に処理するために、ワークフローは不要なステップをスキップし、必要に応じて後で実行する簡単な方法を提供するべきである。フレームワークを広く有用にするために、ユーザーは必要に応じてワークフローに新しいステップを簡単に追加して他のユーザーと共有できなければならない。これらの目標の多くを達成するパイプラインがいくつか開発されているが[ref.26-29]、メタゲノムデータセットの処理における柔軟性の向上と分析の長期的な再現性に対する我々（本著者ら）のニーズを満たしていなかった。

　ここでは、メタゲノムシーケンス実験から一貫した一連の後処理済みファイルを生成する、簡単に配置および構成可能なパイプラインであるSunbeamを紹介する。 Sunbeamは自己完結型で、管理者権限なしでGNU / Linuxシステムにインストールできる。 Snakemakeワークフロー言語での実装により、エラー処理、タスク再開、および並列コンピューティング機能を備えている[ref.30]。ほぼすべての手順は事前に指定された妥当なデフォルト値を使用して構成可能であるため、広範なパラメータ調整を行わなくても迅速に展開できる。 Sunbeamは、ローカルデータを直接使用することも、NCBIのシーケンスリードアーカイブ（SRA）からの外部データを使用することもできる[ref.31]。 Sunbeamは、新しい手順をユーザーが追加できるようにする単純なメカニズムを使用して拡張可能である。

　さらに、Sunbeamは、低品質またはホスト由来の配列を多く含む困難なサンプルのデータを処理できるカスタムソフトウェアを備えている。これらには、任意の数の宿主ゲノムまたは汚染ゲノム配列に対するカスタム調整された宿主由来のリード除去ステップ、およびKomplexity、問題のある低複雑性リードを下流の分析の前に迅速かつ正確に除去する新しい配列複雑性分析プログラムが含まれる。マイクロサテライトDNA配列は、ヒトゲノムのかなりの部分を占めており、個体間で非常に多様であり[ref.32-34]、単一のリファレンスゲノムに対するアラインメントによってそれらを除去することの難しさを増している。 Komplexityを開発したのは、ヌクレオチド配列の複雑さを分析するための既存のツール[ref.35-37]が、スピード、怪しいヒットの除去、およびfastqファイルのネイティブ処理の点で、本著者らのニーズを満たしていなかったためである。ここでは宿主関連の low-microbial biomass body sites [ref.19、38、39]、virome[ref.40]、およびマイクロバイオームの縦断サンプリング[ref.41、42]の公表され進行中の研究においてSunbeamを使用した。

　SunbeamはPython、Bash、およびRustで実装されている。 GPLv3でライセンスされている。https://github.com/sunbeam-labs/sunbeamで自由に利用可能であり、ドキュメントはhttp://sunbeam.readthedocs.ioで利用可能である。

　Sunbeamは、「可用性と要件」に記載されているハードウェアとソフトウェアの初期要件を満たすGNU / Linuxディストリビューションにインストールできる。インストールは、リポジトリからソフトウェアをダウンロードして「install.sh」を実行することによって実行される。Sunbeamは、インストールまたは実行に管理者権限を必要としない。 Sunbeamの基本的な分析ワークフローをDebian 9、 CentOS 6と7、Ubuntu 14.04、16.04、18.04、および18.10、 Red Hat Enterprise 6および7、そしてSUSE Enterprise 12にインストールして実行できることを確認した。

　SunbeamはCondaパッケージ管理システム[ref.43]を利用して、「可用性と要件」のセクションに記載されているもの以外にも追加のソフトウェア依存関係をインストールする。 Condaは、依存関係の自動解決を提供し、管理者以外のユーザーにソフトウェアのインストールを容易にし、追加のソフトウェアをパッケージ化してサードパーティのソフトウェアチャネルを管理するための標準化されたシステムを使用する。 Condaは、パイプラインの外側にある既存のソフトウェアとの競合を避けるために、追加のソフトウェア依存関係解決のための分離された環境も提供する。必要に応じて、CondaはSunbeamインストールスクリプトによってインストールされる。 Sunbeamが使用する分離ソフトウェア環境もインストールスクリプトによって作成される。

　Sunbeamパイプラインへの入力は、ローカルのファイルまたはSRAを通じて利用可能なサンプルのいずれかで、raw Illuminaシーケンスリードからなる。デフォルトでは、Sunbeamはリードに対して次の順序で予備操作を実行する。

クオリティ管理：オプションで、grabseq [ref.44]とsra-tools [ref.31]を使ってSRAからリードをダウンロードする。アダプター配列を除去し、そしてCutadapt [ref.45]およびTrimmomatic [ref.46]ソフトウェアを用いて塩基をクオリティフィルタリングする。クオリティフィルタリングを通過したリードペアは保持される。クオリティはFastQCを使用して評価され[ref.47]、別々のレポートにまとめられる。

低複雑度マスキング：各リードにおける配列複雑度は、後述する新規複雑度スコアリングアルゴリズムであるKomplexityを使用して評価される。ユーザーがカスタマイズ可能なシーケンス複雑度のしきい値を下回るリードは削除される。削除されたリード数のログは後で検査するために書き込まれる。

ホストリード除去：リードはbwaを使用してユーザー指定のホストまたは汚染物質シーケンスのセットに対してマッピングされる。特定の同一性および長さのしきい値内でこれらのシーケンスのいずれかにマップされるリードは削除される。削除されたリード数は後で検査するために記録される。

初期クオリティ管理プロセスの後、複数のオプションの下流工程を並行して実施することができる。分類ステップでは、汚染配列が除去されクオリティ管理されたリードがKraken [ref.49]を使用して分類的に分類され、タブ区切り形式とBIOM [ref.50]形式の両方で要約される。アセンブリステップでは、MEGAHIT [ref.51]を使用して、各サンプルからのリードを連続アセンブリする。事前に指定された長さを超えるコンティグには、あのテーションが付けられる。オープンリーディングフレーム（ORF）は、Prodigal ［ref.52］を用いて予測される。次に、コンティグ全体（および関連するORF）を、コンティグ全体および推定ORFの両方を使用して、任意の数のユーザー指定のヌクレオチドまたはタンパク質BLAST [ref.53]データベースに対して検索する。結果は各サンプルのレポートにまとめられる。最後に、マッピングステップでは、bwa [ref.48]を使用してクオリティ管理されたリードを任意の数のユーザー指定のリファレンスゲノムまたは遺伝子セットにマッピングし、結果のBAMファイルをSAMtools [ref.54]を使用してソートおよびインデックス付けする。

Sunbeamは、出力ファイルを概念的に異なるフォルダにグループ化し、論理的な出力フォルダ構造を提供するよう設計されている。 Sunbeamからの標準出力には、クオリティ管理プロセスの各ステップからのfastqファイル、各リードの分類系統アサイン、各サンプルからアセンブリされたコンティグ、遺伝子予測、およびすべてのクオリティ管理されたリードのアラインメントファイルが含まれる。ほとんどのルールは、後の検査と要約のためにその操作のログを作成する。ユーザーは特定の出力を個別にまたはグループとして要求でき、パイプラインは必要なファイルを作成するのに必要なステップのみを実行する。これにより、ユーザーはモジュール式にパイプラインの任意の部分をスキップまたは再実行できる。

表1、他のツールとの機能比較

論文より転載

以下の作業が自動化されている (Githubより)

Quality control, including adaptor trimming, host read removal, and quality filtering;
Taxonomic assignment of reads to databases using Kraken;
Assembly of reads into contigs using Megahit;
Contig annotation using BLAST[n/p/x];
Mapping of reads to target genomes; and
ORF prediction using Prodigal.

Documentation

https://sunbeam.readthedocs.io/en/latest/

インストール

ubuntu16.0.4のminiconda3-4.3.30環境でテストした。

依存

python3.6 ~ 3.7

Gihutb

git clone -b stable https://github.com/sunbeam-labs/sunbeam sunbeam-stable
cd sunbeam-stable

#condaがない環境で実行するとcondaも含めて導入される
./install.sh
#condaをここで入れたなら
echo "export PATH=$PATH:/root/miniconda3/bin" >> ~/.bashrc && source ~/.bashrc
source activate sunbeam

テスト

#テストラン時に出た文字化けエラーはC.UTF-8をexportして応急処置

export LC_ALL=C.UTF-8
export LANG=C.UTF-8

> tests/run_tests.bash -e sunbeam

f:id:kazumaxneo:20190622180053j:plain

> sunbeam

# sunbeam

usage: sunbeam [-h/--help,-v/--version] <subcommand>

subcommands:

init Create a new config file for a project using local or SRA data.

run Execute the pipeline.

config Modify or update config files.

list_samples Make a list of samples from a directory.

optional arguments:

-v, --version show program's version number and exit

For more help, see the docs at http://sunbeam.readthedocs.io.

Unrecognized command.

> sunbeam init -h

# sunbeam init -h

usage: init [-h] [-f] [--output FILE] [--defaults FILE] [--template FILE]

[--data_fp PATH] [--format STR] [--single_end]

[--data_acc ACC [ACC ...]]

project_fp

Initialize a new Sunbeam project in a given directory, creating a new config

file and (optionally) a sample list. The sample list source can be either a

folder of input files or a list of SRA accession numbers.

positional arguments:

project_fp project directory (will be created if it does not

exist)

optional arguments:

-h, --help show this help message and exit

-f, --force overwrite files if they already exist

--output FILE name of config file (sunbeam_config.yml)

config file options:

--defaults FILE set of default values to use to populate config file

--template FILE custom config file template, in YAML format

sample list options:

Options to automatically generate a sample list. --data_fp (for reading

filenames from a specified folder) and --data_acc (for fetching SRA

accession numbers to be downloaded during a run) are mutually exclusive.

If --data_acc is used, all SRA Run (sample) entries corresponding to the

given accession numbers (e.g. PRJNA###, SAMN###, SRR###) will be used to

create the sample list. Note in this case project_fp should either be

given before --data_acc or separated by a '--' to distinguish it from an

accession number.

--data_fp PATH path to folder containing fastq.gz files

--format STR filename format for --data_fp (default: guessed)

--single_end fastq files are in single-end, not paired-end, format

for --data_fp

--data_acc ACC [ACC ...]

list of SRA-compatible accession numbers

> sunbeam run -h

# sunbeam run -h

usage: sunbeam run [options] -- <snakemake options>

Executes the Sunbeam pipeline by calling Snakemake.

optional arguments:

-h, --help show this help message and exit

-s SUNBEAM_DIR, --sunbeam_dir SUNBEAM_DIR

Path to Sunbeam installation

You can pass further arguments to Snakemake after --, e.g:

$ sunbeam run -- --cores 12

See http://snakemake.readthedocs.io for more information.

> sunbeam config -h

# sunbeam config -h

usage: sunbeam config [-h] {update,modify} ...

positional arguments:

{update,modify}

optional arguments:

-h, --help show this help message and exit

> sunbeam list_samples -h

# sunbeam list_samples -h

usage: sunbeam list_samples [-h] [-s] [-f STR] data_fp

positional arguments:

data_fp Path to folder containing reads

optional arguments:

-h, --help show this help message and exit

-s, --single_end Samples are single-end (not paired-end)

-f STR, --format STR Filename format (e.g. {sample}_R{rp}.fastq.gz).

(default: guessed)

最小限の手間でパイプラインにツールを追加することも可能になっている（未テスト）。

#sunbeam内にて
git clone https://github.com/sunbeam-labs/sbx_kaiju extensions/sbx_kaiju
cd extensions/sbx_kaiju/
conda install -c bioconda --file requirements.txt

#databaseをダウンロード後、ymlファイルにパスを追加。(kaiju github)
mkdir kaijudb
cd kaijudb
kaiju-makedb -s (database name)

#それからrun前にymlファイルに追加する
cat extensions/sbx_kaiju/config.yml >> my_config.yml

データベースの準備

１、kraken database（version.1.0）

ここではpre-build (2017)のMiniKraken DB_8GBデータベースを使う。

mkdir /data/kraken
cd /data/kraken/
wget https://ccb.jhu.edu/software/kraken/dl/minikraken_20171019_8GB.tgz
tar -zxvf minikraken_20171019_8GB.tgz && rm minikraken_20171019_8GB.tgz

ymlファイルのclassify:kraken_db_fp: にパスを記載する。上の通りなら

kraken_db_fp: '/data/kraken/minikraken_20171019_8GB'

２、filtering of contaminated sequences

コンタミネーション（ホストゲノムやテクニカルな汚染配列）の除去。例えばヒトゲノムの汚染を除く。リファレンスの拡張子はfastaにする必要がある。

mkdir /data/contamination_genome
cd /data/contamination_genome/
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.fa.gz
unpigz -c hg38.fa.gz -p 8 > hg38.fasta && rm hg38.fa.gz

ymlファイルのhost_fp: にパスを記載する。上の通りなら

host_fp: '/data/contamination_genome'

３、blast（n、p、x）

blastしてアノテーションをつけることができる。ここではNCBI NTを使う。抗生物質耐性遺伝子、virulance factor、NCBI nt、など目的に応じてデータベースを構築しておく。

mkdir /data/annotation_database/
cd /data/annotation_database/

#NCBI ntを入れる (容量に注意)
mkdir nt
cd nt
wget "ftp://ftp.ncbi.nlm.nih.gov/blast/db/nt.*.tar.gz"
for a in nt.*.tar.gz; do tar xzf $a; done


#CARDデータベース
mkdir /data/annotation_database/card
#調べてから追記します

#VFDBータベース

ymlファイルの blastdbs以下に記載する。ここではproteinは載せていないのでblastpは実行されない。

f:id:kazumaxneo:20190625194055j:plain

ダウンロードしたntは70以上に分かれているが、sunbeamはnt/ntで認識する。ここではblastデータベースは全て/data/annotation_database/にあり、その下位ディレクトリのnt/にNCBI ntデータベースファイルがあるという設定で進めている。

下の画像の通り、 nt/にはダウンロードし解凍したntのblastデータベースが保存されている。

f:id:kazumaxneo:20190625194446j:plain

実行方法

１、fastqの準備。

特定のディレクトリに解析対象のfastqをまとめる。例えばこのようなサンプルを分析する。ファイル名は推測できるような名前になっていないといけない。下のファイル名だと問題なく認識した（sample1, sample2, ...）。

f:id:kazumaxneo:20190623155306j:plain

２、ランのconfigファイル作成。

ここではfastqを保存しているディレクトリを/sequencing/fastq、プロジェクトファイル名はmy_project1としている。

sunbeam init my_project1 --data_fp /sequencing/fastq

my_project1ディレクトリができ、その中にsamples.csvとsunbeam_config.ymlができる。

３、.ymlの編集。他のツールのconfiguration fileに相当するymlファイルができるので、ラン前に編集する。

vim my_project1/sunbeam_config.yml

sunbeam_config.yml

f:id:kazumaxneo:20190626131356j:plain

上から順番に見ていって、パラメータを変更していく。モジュラー設計のパイプラインなのでステップを丸ごと追加・削除することもできる。

４、実行する。-nでdry run

sunbeam run --configfile my_project1/sunbeam_config.yml

出力

f:id:kazumaxneo:20190625194650j:plain

SRAデータを使ったテストラン

例えばmetagenome ４サンプルを分析する。

mkdir ~/test4strain
sunbeam init /data/test4strain/ --data_acc ERR3277301 ERR3277295 ERR3277320 ERR3277274
#.ymlとsample.csvができる。

#~/testにできた.ymlを指定してラン
sunbeam run --configfile /data/test4strain/sunbeam_config.yml

ymlファイルで出力defaultは"subbeam_output"になっている。

ラン後、subbeam_outputを開く。

f:id:kazumaxneo:20190623130924j:plain

dowload、qc、assembly、classify、annotationに分かれている。

f:id:kazumaxneo:20190623130942j:plain

download/にはダウンロードされたfastqが収められている。

f:id:kazumaxneo:20190623131008j:plain

qc/にはcutadapt、trimmomaticによるアダプター除去/QT、low comlexility reads除去、ホストコンタミネーションリード除去後のそれぞれのfastqが収められている。したがってかなりサイズが大きくなる。

f:id:kazumaxneo:20190623130955j:plain

qc/reportsにはQC結果のレポート

f:id:kazumaxneo:20190623131812j:plain

classify/にはdefaultではkrakenによる分析結果が収められている。

f:id:kazumaxneo:20190623131025j:plain

全サンプル結果はBIOM形式とTSV形式でまとめられる。

all_samples.tsv

f:id:kazumaxneo:20190623132003j:plain

biomはqiimeやRのphyloseqなどで可視化できる。

assembly/にはmegahitのcontigsやカバレッジファイルが収められている。

f:id:kazumaxneo:20190623131040j:plain

全サンプルの全contigのカバレッジやサイズレポート

f:id:kazumaxneo:20190623131044j:plain

アノテーションにはユーザーが指定したblastデータベースによるアノテーション結果が収められる。

f:id:kazumaxneo:20190623131052j:plain

snakemakeのスケーラビリティにより、シングルノードサーバーから、クラスターの分散メモリ環境でも特別なチューニングなしにランできるようになっています。パラメータは

https://sunbeam.readthedocs.io/en/latest/usage.html#configuration

のCluster optionsで確認して下さい。

引用

Sunbeam: an extensible pipeline for analyzing metagenomic sequencing experiments

Clarke EL, Taylor LJ, Zhao C, Connell A, Lee JJ, Fett B, Bushman FD, Bittinger K

Microbiome. 2019 Mar 22;7(1):46.

2019-06-25

複数のメタゲノムをその場で分析するための軽量で多機能なメタゲノム分析ツール SqueezeMeta（オフライン使用）

2018 Frontiers in Microbiology 生物種の推定 (taxonomic profiling) metagenome metatranscriptome Nanopore long read リアルタイム解析 off-line automatic pipeline assembly Binning (metagenomics) taxonomic assignment

2020 11/19 condaインストール追記

　シーケンシング技術の改良によりメタゲノムシーケンシングが一般化し、メタゲノムシーケンシングがマイクロバイオームの構造および機能性を分析するための標準的な手順となった。メタゲノム実験によって生成された膨大な数のショートリード配列に対処するために、多くの新規なバイオインフォマティックツールおよびアプローチが開発されてきた。単に膨大な量のデータ以外に、メタゲノム解析は、結果に直接互換性がないことが多いさまざまなソフトウェアツールといくつかの標準化されていないステップを含む複雑な作業になっている。

　最近、携帯性の高いシーケンサー、特にナノポア技術に基づくシーケンサーの開発（Deamer et al、2016）は、迅速な結果を得る必要性が最も重要であるシナリオ、例えば疾病管理または伝染病の臨床シナリオにおいてin situシークエンシングを容易にした（Quick et al、2015、2016）。メタゲノムシークエンシングもまた、例えば南極氷の海洋調査がその場で行われており（Lim et al、2014； Johnson et al、2017）、サンプリングしてすぐにシークエンスを生成する能力が増大している。これにより、前日の結果に従って、今後のサンプリング実験の情報に基づく計画が可能になる。近い将来、このようなアプリケーションがますます使用されるようになると予測される。したがって、バイオインフォマティック分析は非常に短い時間（数時間）で実行されるべきであり、軽量コンピューティングインフラストラクチャに適しているはずである。

　標準的なメタゲノムパイプラインには、リードのキュレーション、アセンブリ、遺伝子予測、および結果として得られる遺伝子の機能および分類学的アノテーションが含まれる。これらの分析の大部分を自動化するために、いくつかのパイプラインが作成されている（Li、2009; Arumugam et al、2010; Glass and Meyer、2011; Abubucker et al、2012; Eren et al、2015; Kim et al、2016）。ただし、能力とアプローチの点では異なり、最も重要な違いの1つは、アセンブリ手順が必要かどうかになる。いくつかのプラットフォームは、アセンブリをスキップし、その結果、遺伝子予測をスキップし、代わりにrawリードのダイレクトアノテーションに頼っているが、rawリードを扱ういくつかの欠点がある：これは巨大なリファレンスデータベースに対する何百万ものシーケンスの相同性検索に基づいているので、それは通常非常に大きいCPU使用率を必要とする。特に分類学的アサインについては、リファレンスデータベースはエラーを最小にするために可能な限り完全でなければならない（Pignatelli et al、2008）。さらに、正確なアサインを生成するにはシーケンス長が短すぎることが多い（Wommack et al、2008; Carr and Borenstein、2014）。

　多くの遺伝子を含むことが多いゲノムのより大きな断片を回収することができるので、アセンブリは賢明な手段である。完全な遺伝子配列およびそのコンテキストを有することは、その機能および分類学的アサインをはるかに容易かつより信頼性の高いものにする。アセンブリの欠点は、異なるゲノムパーツを誤ってアセンブリことによるキメラ形成、およびいくつかのリード、特に存在量の少ない種からのリードをアセンブリすることができないことである。アセンブリされていないリードの割合は、いくつかの要因、特にシーケンシングデプスとマイクロバイオームの多様性に左右されるが、通常は少ない（多くの場合20％未満）。最近、最初の段階でアセンブリされなかったリードをリアセンブリし、このステップの性能を向上させるいくつかのツールが開発されている（Hitch and Creevey、2018）。結果のセクションで説明するように、関連するメタゲノムを共にアセンブリすることで、この問題を大幅に軽減することもできる。

　アセンブリはまた、ビニング方法を介して準完全なゲノム回収を容易にするので、賢明である。ゲノムの検索は、微生物と機能を結び付けることを可能にし、よってコミュニティの機能のより正確な生態学的記述に寄与するので、マイクロバイオームの研究における大きな前進につながる。例えば、（特に重要な機能に関与する）マイクロバイオームの重要なメンバーを決定すること、メンバー間の潜在的な相互作用を推測すること（例えば代謝の補完関係を探すこと）、および生態学的効果の理解を進めることが可能である。

　ビニングのための最良の戦略は、関連メタゲノムの共アセンブリである。異なる試料中のコンティグの存在量および組成を比較することによって、どのコンティグが同じ生物に属するかを決定することが可能である：これらのコンティグは類似のオリゴヌクレオチド組成、個々の試料中で類似の存在量、および異なる試料間では共変動パターンを有する。このようにして、マイクロバイオームの機能をより詳細に分析するための出発点として使用することができる、異なる完成レベルを有する数十または数百のゲノムビンを検索することが可能である。

　SqueezeMetaはメタゲノム/メタトランスクリプトミクスのための完全自動のパイプラインで、分析のすべてのステップを網羅している。これは、関連メタゲノムの共アセンブリおよびビニング手順による個々のゲノムの検索を可能にするマルチメタゲノムサポートを含む。

　SqueezeMetaと他のパイプラインの機能の比較を論文の表Table 1.1に示す。現在のほとんどのパイプラインでは、協調およびビニングのサポートは含まれていないが、外部ビニング結果をインポートして関連情報を表示することを許可するものもある。

SqueezeMetaには、既存のパイプラインとは異なる高度な特性がいくつかある。例えば、

1、各メタゲノムにおける個々の遺伝子存在量推定のためのリードマッピングと組み合わせた共アセンブリ手順。
２、個々のメタゲノムのマージを介して無制限の数のメタゲノムの処理を可能にする代替の共アセンブリ手法。
３、ナノポアロングリードのサポート。
４、個々のゲノムを回収するためのビニングおよびビンチェック。
５、コンティグとビンの分類学的アノテーションの内部チェック。
６、リファレンスメタゲノムに対してcDNAリードをマッピングするメタトランスクリプトームのサポート。メタゲノムとメタトランスクリプトームの共アセンブリへのマッピングも利用できる。
７、結果を保存するためのMySQLデータベース。Webインターフェースを使用して簡単にエクスポート、共有および検査ができる。

我々（著者ら）は、サンプリングしたその場でのメタゲノムシーケンシング実験に利用できるよう、SqueezeMetaをコンピュータリソースの乏しい環境でも実行できるように設計した。すべてのパイプラインのコンポーネントを適切に設定することで、16GBのRAMを搭載したデスクトップコンピューターを使用して、個々のメタゲノムを完全に分析し、さらに関連するメタゲノムを統合することさえできた。本システムは完全に自動化されているため、技術的な知識やバイオインフォマティックな知識を必要としない非常に使いやすいものである。また、インターネット接続有無とは完全に無関係である。

SqueezeMetaはhttps://github.com/jtamames/SqueezeMetaからダウンロードできる。

Workflow of the three modes of operation of SqueezeMeta.

論文より転載

SqueezeMeta uses a combination of custom scripts and external software packages for the different steps of the analysis: （Githubより）

Assembly
RNA prediction and classification
ORF (CDS) prediction
Homology searching against taxonomic and functional databases
Hmmer searching against Pfam database
Taxonomic assignment of genes
Functional assignment of genes
Taxonomic assignment of contigs, and check for taxonomic disparities
Coverage and abundance estimation for genes and contigs
Estimation of taxa abundances
Estimation of function abundances
Merging of previous results to obtain the ORF table
Binning with MaxBin
Binning with MetaBAT
Binning integration with DAS tool
Taxonomic assignment of bins, and check for taxonomic disparities
Checking of bins with CheckM
Merging of previous results to obtain the bin table
Merging of previous results to obtain the contig table
Prediction of kegg and metacyc patwhays for each bin
Final statistics for the run

Our SqueezeMeta paper for metagenomic analysis is finally published! Check it out: https://t.co/9Wu3wgh78F
— Microbiome Analysis Group, CNB-CSIC (@Microbiome_CNB) January 24, 2019

インストール

ubuntu16.04でテストした（ホストOS macos10.14）。

本体　Github

GithubレポジトリのINSTALL-UBUNTU14+（リンク）の手順にしたがってインストールした。

git clone https://github.com/jtamames/SqueezeMeta.git
cd SqueezeMeta/

#以下のインストールスクリプトをdockerのrootで実行した。インストールスクリプトのsudoとスクリプト最後の200GBのデータセットダウンロードを消してから実行した。
bash INSTALL-UBUNTU14+

#blastp、pysamが入っていなかったので手動で導入。usearch(v11)はlinux 32bit版をオーサーのHPからダウンロード

2020 11/19追記(python3.7でテストした)
conda create -n SqueezeMeta -c bioconda -c conda-forge -c fpusan squeezemeta
conda activate SqueezeMeta

> perl scripts/SqueezeMeta.pl

# perl scripts/SqueezeMeta.pl

SqueezeMeta v0.4.2, Jan 2019 - (c) J. Tamames, F. Puente-Sánchez CNB-CSIC, Madrid, SPAIN

Please cite: Tamames & Puente-Sanchez, bioRxiv 347559 (2018); doi: https://doi.org/10.1101/347559; Tamames & Puente-Sanchez, Frontiers in Microbiology (2019), in press

Usage: SqueezeMeta.pl -m <mode> -p <projectname> -s <equivfile> -f <raw fastq dir> <options>

Arguments:

Mandatory parameters:

-m: Mode (sequential, coassembly, merged) (REQUIRED)

-s|-samples: Samples file (REQUIRED)

-f|-seq: Fastq read files' directory (REQUIRED)

-p: Project name (REQUIRED in coassembly and merged modes)

Filtering:

--cleaning: Filters with Trimmomatic (Default: No)

-cleaning_options: Options for Trimmomatic (Default:LEADING:8 TRAILING:8 SLIDINGWINDOW:10:15 MINLEN:30)

Assembly:

-a: assembler [megahit,spades,canu] (Default: megahit)

-assembly_options: Options for required assembler

-c|-contiglen: Minimum length of contigs (Default: 200)

Mapping:

-map: mapping software [bowtie,bwa,minimap2-ont,minimap2-pb,minimap2-sr] (Default: bowtie)

Annotation:

--nocog: Skip COG assignment (Default: no)

--nokegg: Skip KEGG assignment (Default: no)

--nopfam: Skip Pfam assignment (Default: no)

-b|-block-size: block size for diamond against the nr database (Default: 8)

-e|-evalue: max evalue for discarding hits diamond run (Default: 1e-03)

-miniden: identity perc for discarding hits in diamond run (Default: 50)

Binning:

--nobins: Skip all binning (Default: no)

--nomaxbin: Skip MaxBin binning (Default: no)

--nometabat: Skip MetaBat2 binning (Default: no)

Performance:

-t: Number of threads (Default: 12)

-canumem: memory for canu in Gb (Default: 32)

--lowmem: run on less than 16Gb of memory (Default:no)

Other:

--minion: Run on MinION reads (use canu and minimap2) (Default: no)

Information:

-v: Version number

-h: This help

データベースの準備

オフライン環境になる前にダウンロードしておく。

200GBほどスペースが必要。

mkdir data_dir
perl scripts/preparing_databases/download_databases.pl data_dir

#condaで導入した場合
download_databases.pl data_dir

ダウンロードされ、data_dir/db/に解凍される。

> ls -alth *

db:

total 120G

drwxr-xr-x 4 root root 4.0K Mar 20 06:47 ..

drwxr-xr-x 10 1000 1000 4.0K Feb 26 07:55 .

-rw-r--r-- 1 1000 1000 56 Feb 26 07:55 DB_BUILD_DATE

drwxr-xr-x 2 1000 1000 4.0K Feb 26 07:55 LCA_tax

-rw-r--r-- 1 1000 1000 1.4G Feb 26 06:18 Pfam-A.hmm

-rw-r--r-- 1 1000 1000 3.3G Feb 26 06:17 eggnog.dmnd

-rw-r--r-- 1 1000 1000 112G Feb 26 06:13 nr.dmnd

-rw-r--r-- 1 1000 1000 2.7G Feb 26 00:17 keggdb.dmnd

-rw-r--r-- 1 1000 1000 130M Feb 24 21:18 taxdb.btd

-rw-r--r-- 1 1000 1000 14M Feb 24 21:18 taxdb.bti

-rw-r--r-- 1 1000 1000 1.1K Feb 22 20:51 nr.pal

-rw-r--r-- 1 1000 1000 65M Nov 16 23:54 arc.all.faa

-rw-r--r-- 1 1000 1000 1.3M Nov 16 23:54 arc.scg.faa

-rw-r--r-- 1 1000 1000 198K Nov 16 23:54 arc.scg.lookup

-rw-r--r-- 1 1000 1000 16M Nov 16 23:54 bac.all.faa

-rw-r--r-- 1 1000 1000 244K Nov 16 23:54 bac.scg.faa

-rw-r--r-- 1 1000 1000 36K Nov 16 23:54 bac.scg.lookup

-rw-r--r-- 1 1000 1000 335 May 21 2018 ReadMe

-rw-r--r-- 1 1000 1000 820K Apr 27 2018 arc.hmm

-rw-r--r-- 1 1000 1000 790K Apr 27 2018 bac.hmm

-rw-r--r-- 1 1000 1000 1015K Apr 27 2018 euk.hmm

-rw-r--r-- 1 1000 1000 446K Apr 27 2018 mito.hmm

-rw-r--r-- 1 1000 1000 3.6M Apr 27 2018 bacar_marker.hmm

-rw-r--r-- 1 1000 1000 16M Apr 27 2018 marker.hmm

-rw-r--r-- 1 1000 1000 5.1K Apr 27 2018 .dmanifest

drwxr-xr-x 4 1000 1000 4.0K Jan 14 2015 genome_tree

drwxr-xr-x 2 1000 1000 4.0K Jan 14 2015 img

drwxr-xr-x 2 1000 1000 4.0K Jan 14 2015 pfam

-rw-r--r-- 1 1000 1000 80K Jan 14 2015 selected_marker_sets.tsv

-rw-r--r-- 1 1000 1000 21M Jan 14 2015 taxon_marker_sets.tsv

drwxr-xr-x 2 1000 1000 4.0K Jan 14 2015 test_data

drwxr-xr-x 2 1000 1000 4.0K Jan 14 2015 hmms

drwxr-xr-x 2 1000 1000 4.0K Jan 14 2015 hmms_ssu

drwxr-xr-x 2 1000 1000 4.0K Jan 14 2015 distributions

test:

total 16K

drwxr-xr-x 4 root root 4.0K Mar 20 06:47 ..

drwxrwxr-x 3 1007 1008 4.0K May 21 2018 .

drwxrwxr-x 2 1007 1008 4.0K May 21 2018 raw

-rw-r--r-- 1 1007 1008 156 May 21 2018 test.samples

実行方法

３つのモードがある。テストデータを使い説明する。

data_dir/test/raw/

f:id:kazumaxneo:20190610171133j:plain

２セットのfastqが用意されている。

１、Sequential mode

すべてのサンプルを個別に処理し、順次分析する。このモードはビニングを含まない。

perl scripts/SqueezeMeta.pl -m sequential -t 40 \
-f data_dir/test/raw/ -s data_dir/test/test.samples

-m <sequential, coassembly, merged>: Mode (REQUIRED)
-s Samples file (REQUIRED)
-f Fastq read files' directory (REQUIRED)
-t Number of threads (Default: 12)

２、Coassembly mode

すべてのサンプルからのリードがプールされ、まとめてアセンブリが実行される。次に、個々のサンプルからのリードをCoアセンブリにマッピングして、各サンプル量を取得し、ビニングしてゲノムビンを得る。メモリ要件が高すぎてクラッシュする場合、３のmerged modeを使う。プロジェクト名が必要。

perl scripts/SqueezeMeta.pl -m coassembly -p projectname -t 40 \
-f data_dir/test/raw/ -s data_dir/test/test.samples

-p <string>: Project name (REQUIRED in coassembly and merged modes)

３、Merged mode

このモードでは無制限数のサンプルのCoアセンブリが可能なっている。まずサンプルを個々にアセンブリし、得られたコンティグを単一の共アセンブリにマージする。その後、Coassemblyモードの場合と同様にビニングが行われる。キメラコンティグを作成する可能性がより高いので、これは推奨される手順ではない（可能であればcoassembly mode使用）。プロジェクト名が必要。

perl scripts/SqueezeMeta.pl -m merged -p projectname -t 40 \
-f data_dir/test/raw/ -s data_dir/test/test.samples

dockerイメージもあげましたが、ランの最後にクラッシュします。非推奨です。

docker pull kazumax/squeezemeta 

#ホストのカレントディレクトリとイメージの/dataをシェアして起動（pullを飛ばして以下を実行してもOK）
docker run -itv $PWD:/data/ kazumax/squeezemeta

> source ~/.profile
> cd ~/SqueezeMeta

引用

SqueezeMeta, A Highly Portable, Fully Automatic Metagenomic Analysis Pipeline
Javier Tamames, Fernando Puente-Sánchez

Front Microbiol. 2018; 9: 3349

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

blast結果を視覚化する BlastViewer

AMOSアセンブラパッケージのMinimusとMinimus2

(メタゲノム向け) blastアノテーション結果をインタラクティブなグラフで視覚化する Keanu

MMseqs2 コマンド其の2 タンパク質配列のクラスタリング

FragGeneScan

再現性のあるメタゲノム解析を行うためのモジュール設計された自動パイプライン Sunbeam

複数のメタゲノムをその場で分析するための軽量で多機能なメタゲノム分析ツール SqueezeMeta（オフライン使用）