2017-09-11

メタゲノムcontigのカバレッジ、GC、taxonomy情報を可視化して分析できる BlobTools

2019 1/16 テストラン追加、diamondデータベースbuidコマンドエラー修正

2019 1/19 diamondデータベースbuidコマンド修正

2019 1/21 追記

2019 6/22 インストール追記

2020 7/29 シミュレーション追記

2020 9/29 追記

2021 9/1 ビルドコマンド修正( リンク修正と、公開データベースのファイル構造が変わっているようので修正しくビルドできるように直しました)

2021 11/4 helpの誤字修正

アセンブリしたcontig中に、アセンブリツールのアーティファクトやコンタミ由来のcontigが混じることは頻繁に起きる。そのため、アセンブリのクオリティチェックの一つにターゲットとなる生物以外の配列がどれほど混じっているか見積もることが重要になる。BlobToolsはcontigのカバレッジ、GC、taxonomy情報などをグラフに出力し、候補となるcontigのスクリーニングを支援するツールである。

公式サイト

https://github.com/DRL/blobtools

マニュアル

https://blobtools.readme.io/docs

インストール

依存

samtools1.5

Github

installを実行する。依存も含めてインストールされる。

git clone https://github.com/DRL/blobtools.git 
cd blobtools
sudo ./install

#bioconda (link)
conda install -c bioconda -y blobtools

[+] Python dependencies installed.

[+] Creating BlobTools executable...done.

[+] Downloading samtools-1.5...done.

[+] Unpacking samtools-1.5...done.

[+] Configuring samtools-1.5...done.

[+] Compiling samtools-1.5...done.

[+] Cleaning up...

[+] Downloading NCBI taxdump from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz ...done.

[+] Unpacking nodes/names.dmp ...done.

[+] Creating nodesDB from /root/blobtools/data/nodes.dmp and /root/blobtools/data/names.dmp

[+] Store nodesDB in default location /root/blobtools/data/nodesDB.txt

[%] 100%

[+] Removing intermediate files...

[+] BlobTools was installed successfully. Please run ./blobtools

ビルドの過程で、samtools1.5が導入される。NCBI taxonomyファイルもダウンロードされる。blobtoolsのバイナリができる。

> blobtools

# blobtools

usage: blobtools [<command>] [<args>...] [--help] [--version]

commands:

create create a BlobDB

view generate tabular view, CONCOCT input or COV files from BlobDB

plot generate a BlobPlot from a BlobDB

covplot generate a CovPlot from a BlobDB and a COV file

map2cov generate a COV file from BAM file

taxify generate a BlobTools compatible HITS file (TSV)

bamfilter subset paired-end reads from a BAM file

seqfilter subset sequences in FASTA file based sequence IDs in list

nodesdb create nodesdb based on NCBI Taxdump's names.dmp and nodes.dmp

-h, --help show this

-v, --version show version number

See 'blobtools <command> --help' for more information on a specific command.

examples:

# 1. Create a BlobDB

./blobtools create -i example/assembly.fna -b example/mapping_1.bam -t example/blast.out -o example/test

# 2. Generate a tabular view

./blobtools view -i example/test.blobDB.json

# 3. Generate a blobplot

./blobtools blobplot -i example/test.blobDB.json

> blobtools create -h

# blobtools create -h

usage: blobtools create -i FASTA [-y FASTATYPE] [-o PREFIX] [--title TITLE]

[-b BAM...] [-s SAM...] [-a CAS...] [-c COV...]

[--nodes <NODES>] [--names <NAMES>] [--db <NODESDB>]

[-t HITS...] [-x TAXRULE...] [-m FLOAT] [-d FLOAT] [--tax_collision_random]

[-h|--help]

Options:

-h --help show this

-i, --infile FASTA FASTA file of assembly. Headers are split at whitespaces.

-y, --type FASTATYPE Assembly program used to create FASTA. If specified,

coverage will be parsed from FASTA header.

(Parsing supported for 'spades', 'velvet', 'platanus')

-t, --hitsfile HITS... Hits file in format (qseqid\ttaxid\tbitscore)

(e.g. BLAST output "--outfmt '6 qseqid staxids bitscore'")

Can be specified multiple times

-x, --taxrule <TAXRULE>... Taxrule determines how taxonomy of blobs

is computed (by default both are calculated)

"bestsum" : sum bitscore across all

hits for each taxonomic rank

"bestsumorder" : sum bitscore across all

hits for each taxonomic rank.

- If first <TAX> file supplies hits, bestsum is calculated.

- If no hit is found, the next <TAX> file is used.

-m, --min_score <FLOAT> Minimal score necessary to be considered for taxonomy calculaton, otherwise set to 'no-hit'

[default: 0.0]

-d, --min_diff <FLOAT> Minimal score difference between highest scoring

taxonomies (otherwise "unresolved") [default: 0.0]

--tax_collision_random Random allocation of taxonomy if highest scoring

taxonomies have equal scores (otherwise "unresolved") [default: False]

--nodes <NODES> NCBI nodes.dmp file. Not required if '--db'

--names <NAMES> NCBI names.dmp file. Not required if '--db'

--db <NODESDB> NodesDB file (default: $BLOBTOOLS/data/nodesDB.txt).

-b, --bam <BAM>... BAM file(s), can be specified multiple times

-s, --sam <SAM>... SAM file(s), can be specified multiple times

-a, --cas <CAS>... CAS file(s) (requires clc_mapping_info in $PATH), can be specified multiple times

-c, --cov <COV>... COV file(s), can be specified multiple times

-o, --out <PREFIX> BlobDB output prefix

--title TITLE Title of BlobDB [default: output prefix)

> blobtools plot -h

# blobtools plot -h

usage: blobtools blobplot -i BLOBDB

[-p INT] [-l INT] [--cindex] [-n] [-s]

[-r RANK] [-x TAXRULE] [--label GROUPS...]

[--lib COVLIB] [-o PREFIX] [-m]

[--sort ORDER] [--sort_first LABELS] [--hist HIST] [--notitle] [--filelabel]

[--colours FILE] [--exclude FILE]

[--refcov FILE] [--catcolour FILE]

[--format FORMAT] [--noblobs] [--noreads] [--legend]

[--cumulative] [--multiplot]

[-h|--help]

Options:

-h --help show this

-i, --infile BLOBDB BlobDB file (created with "blobtools create")

--lib COVLIB Plot only certain covlib(s). Separated by ","

--notitle Do not add filename as title to plot

--filelabel Label axis based on filenames

-p, --plotgroups INT Number of (taxonomic) groups to plot, remaining

groups are placed in 'other' [default: 7]

-l, --length INT Minimum sequence length considered for plotting [default: 100]

--cindex Colour blobs by 'c index' [default: False]

-n, --nohit Hide sequences without taxonomic annotation [default: False]

-s, --noscale Do not scale sequences by length [default: False]

--legend Plot legend of blobplot in separate figure

-m, --multiplot Multi-plot. Print blobplot for each (taxonomic) group separately

--cumulative Print plot after addition of each (taxonomic) group

--sort <ORDER> Sort order for plotting [default: span]

span : plot with decreasing span

count : plot with decreasing count

--sort_first <L1,L2,...> Labels that should always be plotted first, regardless of sort order

("no-hit,other,undef" is often a useful setting)

--hist <HIST> Data for histograms [default: span]

span : span-weighted histograms

count : count histograms

-r, --rank <RANK> Taxonomic rank used for colouring of blobs [default: phylum]

(Supported: species, genus, family, order,

phylum, superkingdom)

-x, --taxrule <TAXRULE> Taxrule which has been used for computing taxonomy

(Supported: bestsum, bestsumorder) [default: bestsum]

--format FORMAT Figure format for plot (png, pdf, eps, jpeg,

ps, svg, svgz, tiff) [default: png]

--noblobs Omit blobplot [default: False]

--noreads Omit plot of reads mapping [default: False]

-o, --out PREFIX Output prefix

--label GROUPS... Relabel (taxonomic) groups, can be used several times.

e.g. "A=Actinobacteria,Proteobacteria"

--colours COLOURFILE File containing colours for (taxonomic) groups

--exclude GROUPS Exclude these (taxonomic) groups (also works for 'other')

e.g. "Actinobacteria,Proteobacteria,other"

--refcov <FILE> File containing number of "total" and "mapped" reads

per coverage file. (e.g.: bam0,900,100). If provided, info

will be used in read coverage plot(s).

--catcolour <FILE> Colour plot based on categories from FILE

(format : "seq category").

ラン

example/のデータを使う。

blobDBの作成。

blobtools create -i example/assembly.fna -b example/mapping_1.bam -t example/blast.out -o example/my_first_blobplot

-i　FASTA file of assembly. Headers are split at whitespaces.
-b　BAM file (requires samtools in $PATH)
-t　Taxonomy file in format (qseqid taxid bitscore) (e.g. BLAST output "--outfmt '6 qseqid staxids bitscore'")

-s　SAM file
-o　BlobDB output prefix --title　Title of BlobDB [default: FASTA)
--db　NodesDB file [default: data/nodesDB.txt].
-y　Assembly program used to create FASTA. If specified, coverage will be parsed from FASTA header. (Parsing supported for 'spades', 'soap', 'velvet', 'abyss')

入力するのは、contigのFASTAファイル、FASTAにマッピングして作ったbamファイル、blast解析してtaxonomy情報をつけたファイルとなる。taxonomyファイルの作成方法は下の方に記載した。taxonomyファイル自体はなくてもランは可能である。日本語パスのファイルがあると動作しないので注意する。

blobDBのデータベース情報の確認。

blobtools view -i example/my_first_blobplot.blobDB.json -o example/

出力されるtableファイルは以下のような内容になっている。

user$ cat example/my_first_blobplot.blobDB.table.txt

## blobtools v1.0

## assembly : /Users/user/local/blobtools/example/assembly.fna

## coverage : bam0 - /Users/user/local/blobtools/example/mapping_1.bam

## taxonomy : tax0 - /Users/user/local/blobtools/example/blast.out

## nodesDB : /Users/user/local/blobtools/data/nodesDB.txt

## taxrule : bestsum

## min_score : 0.0

## min_diff : 0.0

## tax_collision_random : False

# name length GC N bam0 phylum.t.6 phylum.s.7 phylum.c.8

contig_1 756 0.2606 0 90.406 Actinobacteria 200.0 0

contig_2 1060 0.2623 0 168.409 Actinobacteria 2300.0 0

contig_3 602 0.2342 0 43.761 Actinobacteria 10000.0 0

contig_4 951 0.3155 0 456.313 Actinobacteria 1000.0 0

contig_5 614 0.329 0 163.557 Nematoda 2000.0 0

contig_6 216 0.1944 0 25.88 Tardigrada 4000.0 2

contig_7 4060 0.2584 0 52.312 Nematoda 2000.0 0

contig_8 2346 0.2801 0 91.742 unresolved 2000.0 1

contig_9 1599 0.2439 0 74.757 Nematoda 200.0 0

bam0　Coverage from bam0 (see main header for filename)
phylum.t.6　The assigned taxonomy of the sequence at the taxonomic rank of "phylum" under the tax-rule "best-sum"
phylum.s.7　The sum of scores for the taxonomy of the sequence at the taxonomic rank of "phylum" under the tax-rule "best-sum"
phylum.c.8　The c-index for the taxonomy of the sequence at the taxonomic rank of "phylum" under the tax-rule "best-sum"

テストラン

1、jsonファイル作成

blobtools create -i example/assembly.fna -b example/mapping_1.bam -t example/blast.out -o example/

my_first_blobplot.blobDB.jsonとmy_first_blobplot.mapping_1.bam.covが出力される。

blobplotの作成。

blobtools plot -i example/my_first_blobplot.blobDB.json -o example/

--format Figure format for plot (png, pdf, eps, jpeg, ps, svg, svgz, tiff) [default: png]

以下の２つの図が出力される。

f:id:kazumaxneo:20170911232614p:plain

contig.faへのマッピング率とtaxonomyごとのマッピング率。

f:id:kazumaxneo:20170911232623p:plain

横軸がcontigのGC率、縦軸がcontigにリードを当てた時のカバレッジ。プロットの大きさがcontigの長さを表す。メタゲノムの分析にも利用できると思われる。

追記

カスタマイズする。階級はsuperkingdom。プロットの順番を変える（一番下に持って行きたいのを--sort_firstフラグで記載）。nohitはプロットしない（ --nohit）。出力はPDF。

blobtools plot -i blobDB.json --rank superkingdom --format pdf --nohit --sort_first bacteria --multiplot

--multiplot Multi-plot. Print blobplot for each (taxonomic) group separately
--format <FORMAT> Figure format for plot (png, pdf, eps, jpeg,
ps, svg, svgz, tiff) [default: png]
--nohit Hide sequences without taxonomic annotation [default: False]
--rank <RANK> Taxonomic rank used for colouring of blobs [default: phylum]
(Supported: species, genus, family, order, phylum, superkingdom)

門レベルを指定。分類ごとファイルを分け、順番に重ねる形でプロットする（--multiplot、最後のファイルには全てプロットされる）。sort countで上位20をプロットする。

blobtools plot -i blobDB.json --rank phylum --plotgroups 20 --multiplot --sort count

--multiplot Multi-plot. Print blobplot for each (taxonomic) group separately
--plotgroups <INT> Number of (taxonomic) groups to plot, remaining
groups are placed in 'other' [default: 7]
--sort <ORDER> Sort order for plotting [default: span]
span : plot with decreasing span
count : plot with decreasing count

taxonomyファイル（blast.out）の作成方法

１、Diamond database for UniProt Reference Proteomes

Uniplotのデータベース(dowoload) Proteomics mappingダウンロード（link）

#2022/05/02
wget
https://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/reference_proteomes/Reference_Proteomes_2022_01.tar.gz

2019年1月16日にダウンロードした時は2018_11が最新だった。=>現在は2022.01が最新

ここから先は公式マニュアル通りに行う（link）。この行程のみcent OSで作業した。

#1 Unpack protein FASTAs for each kingdom
parallel -j4 'gunzip {}' ::: `ls` | grep "fasta.gz" | grep -v 'DNA' | grep -v 'additional' 

#またはGNU parallelを使わず１つずつ進める。必要なのはアミノ酸fasta
tar -zxvf Reference_Proteomes_2019_01.tar.gz

#以下は不要なので消す（消さないと結合時に巻き込まれる）
rm */*/*acc.gz
rm */*/*DNA*gz
rm */*/*additional.fasta.gz

#タンパク質配列を結合(2021.03はディレクトリ構造が変わったので修正した)
find . -maxdepth 3 -name '*.fasta.gz' |xargs cat > uniprot_ref_proteomes.fasta.gz
gzip -dv uniprot_ref_proteomes.fasta.gz

#3 Simplify sequence IDs
cat uniprot_ref_proteomes.fasta | sed -r 's/(^>sp\|)|(^>tr\|)/>/g' | cut -f1 -d"|" > temp
mv temp uniprot_ref_proteomes.fasta

#4 Subset mapping file to only contain NCBI TaxID entries
cat */*/*.idmapping.gz > idmapping.gz
gzip -dv idmapping.gz
cat idmapping | grep "NCBI_TaxID" > uniprot_ref_proteomes.taxids

#5 Make Diamond database
diamond makedb --in uniprot_ref_proteomes.fasta --taxonmap uniprot_ref_proteomes.taxids -d uniprot_ref_proteomes.diamond


#taxidsの取り込みでエラーが出たので、以下のように修正した
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdmp.zip
unzip taxdmp.zip -d taxdmp
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/prot.accession2taxid.gz
gzip -dv prot.accession2taxid.gz
#database作成
diamond makedb --in uniprot_ref_proteomes.fasta \
 --taxonmap prot.accession2taxid \
 --taxonnodes taxdmp/nodes.dmp \
 -d uniprot_ref_proteomes.diamond

２、RNAcentral SILVA collection

# Download ID mapping file from RNAcentral release 5.0 wget
ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/id_mapping/id_mapping.tsv.gz

# extract SILVA sequence IDs (this is the TaxID-mapping file) gunzip -c id_mapping.tsv.gz | grep SILVA | sort | uniq > id_mapping.SILVA.tsv 

# generate a list of sequence IDs 
cut -f1 id_mapping.SILVA.tsv > id_mapping.SILVA.names.txt 

# Download FASTA from RNAcentral release 5.0 (contains all sequences) 
wget ftp://ftp.ebi.ac.uk/pub/databases/RNAcentral/releases/5.0/sequences/rnacentral_active.fasta.gz gunzip rnacentral_active.fasta.gz

# Subselect only SILVA rRNA sequences 
blobtools seqfilter -i rnacentral_active.fasta -l id_mapping.SILVA.names.tsv

# Convert rRNA to rDNA 
perl -ne 'chomp; if(m/^>/){@temp=split(" "); print $temp[0]."\n";} else {$_ =~ tr/U/T/; print $_."\n"}' rnacentral_active.filtered.fna > rnacentral_active.SILVA.rDNA.fasta

# make the blastDB 
makeblastdb -dbtype nucl -in rnacentral_active.SILVA.rDNA.fasta

データベースができたら、scaffolds.fastaをblastにかける。--outfmt 6でタブ出力、出力項目はqseqid、staxids、 bitscoreとする。

diamond blastx --query scaffolds.fa \
 --db uniprot_ref_proteomes.diamond.dmnd \
 --outfmt 6 qseqid staxids bitscore \
 --sensitive --max-target-seqs 1 \
 --evalue 1e-25 > blast.out

これでblastのテキスト出力が得られる。しかし、tatxidsがつかない配列が多数だったので、uniprot_ref_proteomes.taxidsをデータベースとして、diamondのタブ出力に手動でtaxonomy idsをアサインするスクリプトを書いた。下の図は、系統IDを手動アサインして、blobtoolsで可視化した結果。

Mock metagenomeのアセンブリ配列分析結果

f:id:kazumaxneo:20190121203448p:plain

f:id:kazumaxneo:20190121203853p:plain

human oral swab

f:id:kazumaxneo:20190928130544p:plain

metagenome of marburg forest soil（SRX2359704）

f:id:kazumaxneo:20190122112207j:plain

f:id:kazumaxneo:20190122112007p:plain

f:id:kazumaxneo:20190928130513p:plain

アサインできてないcontigがたくさん発生している。

追記

2020 7/29

シミュレートデータを使って調べてみる。

　１、大腸菌のゲノムからARTを使いショートリードを50x分発生させた。

art_illumina -ss HS25 -i GCF_000005845.2_ASM584v2_genomic.fna -p -l 150 -f 50 -m 300 -s 30 -o ecoli_

次にシロイヌナズナの４番染色体配列からショートリードを15x分発生させた。

art_illumina -ss HS25 -i Arabidopsis_thaliana.TAIR10.dna.chromosome.4.fa -p -l 150 -f 15 -m 300 -s 30 -o at_

リード数は以下の通り。リード数が全体に占める割合は大腸菌のゲノムが45.4% 、シロイヌナズナの４番染色体が54.5%となる。シロイヌナズナの方は15xしかリードを発生させていないが、染色体サイズは18.5 Mbと大きいため、リードの占める割合はシロイヌナズナが大腸菌を上回った。

Table.1

f:id:kazumaxneo:20200729174003p:plain

２、２組のペアエンドfastqをマージしてmetaspadesでDe novoアセンブリした。

metaspades.py -1 merge-1.fq -2 merge-2.fq -k 77 -o outdir -t 30

=> contigs数は978、トータルサイズは22,746,854 bpだった。

３、Uniref90をデータベース（上の方と同じ手順で作成）にdiamond blastx検索し（≤1e-50）、結果をblobtoolsでまとめて視覚化した。tax rankはspeciesとする。

blobtools create -i contig.fasta -b mapped.bam -t blastx-out
blobtools plot -i blobDB.json -r species

f:id:kazumaxneo:20200729175240p:plain

Fig,1

f:id:kazumaxneo:20200729175242p:plain

Fig,2

図の情報を保存しているblobplot.stats.txtファイルをexcelで開く。少し見栄えを良くしてある。

Table.2

f:id:kazumaxneo:20200729182620p:plain

A 、L、M列を拡大した。

大腸菌とシロイヌナズナのアサインは太字の5-6行目。７行目以降は誤判定となる（E value ≤ 1e-50で検索）。太字にしたM列目は上の図Fig.2で使用されている。リード数が全体に占める割合は、準備のセクションに記載通り、大腸菌のゲノムが45.4% 、シロイヌナズナの４番染色体が54.5%で比率は1.20だが、表（下の図）の36.9 %と43.7 %：ratio 1.18はこれに近い。つまりbamo_read_mapが全リードのうちどれだけのリードがその分類群に当たったかを示していることになる。bamo_read_map_pはその%表記。