2020-05-23

メタゲノムデータからウィルスゲノムをアセンブルする metaviralSPAdes

2020 5/25 タイトル修正

2020 11/11 dockerリンク追加

　現在知られているウイルスのセットは着実に拡大しているが、これまでのところ、地球上のウイルスのごく一部しかシークエンシングされていない。ショットガンメタゲノムシークエンシングは、新しいウイルスを明らかにする機会を提供するが、メタゲノムアセンブリでは検出が困難なウイルスゲノムを特定するという計算上の課題に直面している。
　本研究では、ウイルスと細菌のクロモソーム間のカバレッジデプスのばらつきを分析することに基づいて、メタゲノムアセンブリグラフ中のウイルスゲノムを同定するためのmetaviralSPAdesツールについて説明する。多様なメタゲノムデータセット上でmetaviralSPAdesをベンチマークし、ウイルス固有の隠れマルコフモデルのセットを用いて我々の予測を検証し、最先端のウイルス同定パイプラインを改善することを実証した。
　metaviralSPAdes には viralAssembly、viralVerify、viralComplete モジュールが含まれており、スタンドアロンパッケージとしてhttps://github.com/ablab/spades/tree/metaviral_publication、 https://github.com/ablab/viralVerify/、 https://github.com/ablab/viralComplete/ で利用できる。

インストール

ubuntu18.04でテストした。

Github

cd spades/assembler/
./spades_compile.sh

> ./metaviralspades.py

$ ./metaviralspades.py

SPAdes genome assembler v3.14.0-dev [metaplasmidSPAdes mode]

Usage: spades.py [options] -o <output_dir>

Basic options:

-o <output_dir> directory to store all the resulting files (required)

--iontorrent this flag is required for IonTorrent data

--test runs SPAdes on toy dataset

-h, --help prints this usage message

-v, --version prints version

Input data:

--12 <filename> file with interlaced forward and reverse paired-end reads

-1 <filename> file with forward paired-end reads

-2 <filename> file with reverse paired-end reads

-s <filename> file with unpaired reads

--merged <filename> file with merged forward and reverse paired-end reads

--pe-12 <#> <filename> file with interlaced reads for paired-end library number <#>.

Older deprecated syntax is -pe<#>-12 <filename>

--pe-1 <#> <filename> file with forward reads for paired-end library number <#>.

Older deprecated syntax is -pe<#>-1 <filename>

--pe-2 <#> <filename> file with reverse reads for paired-end library number <#>.

Older deprecated syntax is -pe<#>-2 <filename>

--pe-s <#> <filename> file with unpaired reads for paired-end library number <#>.

Older deprecated syntax is -pe<#>-s <filename>

--pe-m <#> <filename> file with merged reads for paired-end library number <#>.

Older deprecated syntax is -pe<#>-m <filename>

--pe-or <#> <or> orientation of reads for paired-end library number <#>

(<or> = fr, rf, ff).

Older deprecated syntax is -pe<#>-<or>

--s <#> <filename> file with unpaired reads for single reads library number <#>.

Older deprecated syntax is --s<#> <filename>

--mp-12 <#> <filename> file with interlaced reads for mate-pair library number <#>.

Older deprecated syntax is -mp<#>-12 <filename>

--mp-1 <#> <filename> file with forward reads for mate-pair library number <#>.

Older deprecated syntax is -mp<#>-1 <filename>

--mp-2 <#> <filename> file with reverse reads for mate-pair library number <#>.

Older deprecated syntax is -mp<#>-2 <filename>

--mp-s <#> <filename> file with unpaired reads for mate-pair library number <#>.

Older deprecated syntax is -mp<#>-s <filename>

--mp-or <#> <or> orientation of reads for mate-pair library number <#>

(<or> = fr, rf, ff).

Older deprecated syntax is -mp<#>-<or>

--hqmp-12 <#> <filename> file with interlaced reads for high-quality mate-pair library number <#>.

Older deprecated syntax is -hqmp<#>-12 <filename>

--hqmp-1 <#> <filename> file with forward reads for high-quality mate-pair library number <#>.

Older deprecated syntax is -hqmp<#>-1 <filename>

--hqmp-2 <#> <filename> file with reverse reads for high-quality mate-pair library number <#>.

Older deprecated syntax is -hqmp<#>-2 <filename>

--hqmp-s <#> <filename> file with unpaired reads for high-quality mate-pair library number <#>.

Older deprecated syntax is -hqmp<#>-s <filename>

--hqmp-or <#> <or> orientation of reads for high-quality mate-pair library number <#>

(<or> = fr, rf, ff).

Older deprecated syntax is -hqmp<#>-<or>

--nxmate-1 <#> <filename> file with forward reads for Lucigen NxMate library number <#>.

Older deprecated syntax is -nxmate<#>-1 <filename>

--nxmate-2 <#> <filename> file with reverse reads for Lucigen NxMate library number <#>.

Older deprecated syntax is -nxmate<#>-2 <filename>

--sanger <filename> file with Sanger reads

--pacbio <filename> file with PacBio reads

--nanopore <filename> file with Nanopore reads

--tslr <filename> file with TSLR-contigs

--trusted-contigs <filename>

file with trusted contigs

--untrusted-contigs <filename>

file with untrusted contigs

Pipeline options:

--only-error-correction runs only read error correction (without assembling)

--only-assembler runs only assembling (without read error correction)

--careful tries to reduce number of mismatches and short indels

--checkpoints <last or all>

save intermediate check-points ('last', 'all')

--continue continue run from the last available check-point

--restart-from <cp> restart run with updated options and from the specified check-point

('ec', 'as', 'k<int>', 'mc', 'last')

--disable-gzip-output forces error correction not to compress the corrected reads

--disable-rr disables repeat resolution stage of assembling

Advanced options:

--dataset <filename> file with dataset description in YAML format

-t <int>, --threads <int> number of threads. [default: 16]

-m <int>, --memory <int> RAM limit for SPAdes in Gb (terminates if exceeded). [default: 250]

--tmp-dir <dirname> directory for temporary files. [default: <output_dir>/tmp]

-k <int> [<int> ...] list of k-mer sizes (must be odd and less than 128)

[default: 'auto']

--cov-cutoff <float> coverage cutoff value (a positive float number, or 'auto', or 'off')

[default: 'off']

--phred-offset <33 or 64> PHRED quality offset in the input reads (33 or 64),

[default: auto-detect]

公式ではないがdockerイメージもアップされている。

#dockerhub(link)
docker pull nakor/metaviralspades:latest

docker run --rm -itv $PWD:/data/ -w /data nakor/metaviralspades metaviralspades.py -1 /data/pair1.fq /data/pair2.fq -o /data/output_dir

実行方法

メタゲノムシーケンシングリードを指定する。

./metaviralspades.py -1 input_1.fq.gz -2 input_2.fq.gz -o outdir -t 12

引用

metaviralSPAdes: assembly of viruses from metagenomic data
Dmitry Antipov, Mikhail Raiko, Alla Lapidus, Pavel A Pevzner
Bioinformatics, Published: 15 May 2020 Article history

関連

2020-05-22

k-merカウンタ DSK

2013 Bioinformatics k-mer

　DNA/RNAシーケンシングのリード中のすべてのk-mer（長さkの部分文字列）をカウントすることは、多くのバイオインフォマティクスアプリケーションの前段階である。しかし、最新のk-merカウント方法では、大きなデータ構造がメモリ内に存在する必要がある。このようなデータ構造は、通常、カウントするk-merの数に応じて大きくなる。本著者らは、DSK (disk streaming of k-mer)と呼ばれるk-merカウントのための新しいストリーミングアルゴリズムを提示する。このアプローチは、メモリ、時間、ディスクのトレードオフを実現する。リードに含まれるすべてのk-merのマルチセットはパーティション化され、パーティションはディスクに保存される。その後、各パーティションは一時的なハッシュテーブルに個別にメモリにロードされる。k-merカウントは、各ハッシュテーブルをトラバースすることで返される。低密度のk-merはオプションでフィルタリングされる。DSKは、わずか4.0GBのメモリと160GBの適度なディスク容量を用いて、ヒトゲノムデータセットの27 k-merを17.9時間でカウントすることができる最初のアプローチである。（論文執筆時点）

インストール

ubuntu18.04LTSでテストした。

ビルド依存

CMake 3.1+
C++/11 capable compiler (e.g. gcc 4.7+, clang 3.5+, Apple/clang 6.0+)

Github

#bioconda (link)
conda install -c bioconda -y dsk

#from soruce
git clone --recursive https://github.com/GATB/dsk.git 
cd dsk 
sh INSTALL

> dsk

$ dsk

ERROR: Option '-file' is mandatory

[dsk options]

-nb-cores (1 arg) : number of cores [default '0']

-verbose (1 arg) : verbosity level [default '1']

-version (0 arg) : version

-help (0 arg) : help

-file (1 arg) : reads file

-kmer-size (1 arg) : size of a kmer [default '31']

-abundance-min (1 arg) : min abundance threshold for solid kmers [default '2']

-abundance-max (1 arg) : max abundance threshold for solid kmers [default '2147483647']

-abundance-min-threshold (1 arg) : min abundance hard threshold (only used when min abundance is "auto") [default '2']

-histo-max (1 arg) : max number of values in kmers histogram [default '10000']

-solidity-kind (1 arg) : way to compute counts of several files (sum, min, max, one, all, custom) [default 'sum']

-solidity-custom (1 arg) : when solidity-kind is custom, specifies list of files where kmer must be present [default '']

-max-memory (1 arg) : max memory (in MBytes) [default '5000']

-max-disk (1 arg) : max disk (in MBytes) [default '0']

-solid-kmers-out (1 arg) : output file for solid kmers (only when constructing a graph) [default '']

-out (1 arg) : output file [default '']

-out-dir (1 arg) : output directory [default '.']

-out-tmp (1 arg) : output directory for temporary files [default '.']

-out-compress (1 arg) : h5 compression level (0:none, 9:best) [default '0']

-storage-type (1 arg) : storage type of kmer counts ('hdf5' or 'file') [default 'hdf5']

-histo2D (1 arg) : compute the 2D histogram (with first file = genome, remaining files = reads) [default '0']

-histo (1 arg) : output the kmer abundance histogram [default '0']

[kmer count, advanced performance tweaks options]

-minimizer-type (1 arg) : minimizer type (0=lexi, 1=freq) [default '0']

-minimizer-size (1 arg) : size of a minimizer [default '10']

-repartition-type (1 arg) : minimizer repartition (0=unordered, 1=ordered) [default '0']

実行方法

１、31-merをカウント

dsk -file input.fq -kmer-size 31

#listとして与える
ls -1 *.fastq > list_reads
dsk -file list_reads

-kmer-size size of a kmer [default '31']

f:id:kazumaxneo:20200522231810p:plain

（以下省略）
出力はHDF5 フォーマットのファイルになる。

２、変換する。

dsk2ascii -file input.fq.h5 -out output.txt

> head output.txt

f:id:kazumaxneo:20200522234111p:plain

引用

DSK: k-mer counting with very low memory usage.

Rizk G, Lavenier D, Chikhi R.

Bioinformatics. 2013 Mar 1;29(5):652-3

関連

2020-05-21

ゲノムのリアレンジメントを検出して視覚化する smashpp

2020 GigaScience genomic rearrangements ゲノム比較 (comparative genomics) 結果の視覚化 (visualization) synteny_block

2020 5/22 追記

　ゲノムのリアレンジメントの研究は、染色体の進化や遺伝的疾患、ガンなどの研究に重要な役割を果たしており、その研究は非常に重要である。本研究では、2つのDNA配列間の小規模・大規模なゲノムリアレンジメントを検出し、可視化するためのアラインメントフリーでメモリ効率の良いツール、Smash++を紹介する。この計算ソリューションは、データ圧縮技術を利用して2つの配列の情報内容を抽出し、リアレンジメントを見つける。また、SVG（Scalable Vector Graphics）画像を生成することで、検出されたリアレンジメントを自己および相対的な複雑さとともに可視化できるツール、Smash++ビジュアライザーも紹介する。
　細菌、真菌、鳥類、哺乳類の合成DNA配列とreal DNA配列を用いて実験を行った結果、提案したツールはゲノムリアレンジメントを正確に検出することができた。検出された領域は、アライメントに基づくアプローチやFISH(fluorescence in situ hybridization)解析を行った先行研究と一致していた。すべての実験において、最大メモリ使用量のピークは約1GBであり、Smash++は現在の標準的なコンピュータ上で実行可能である。

インストール

ubuntu18.04でテストした（macに導入する場合はソースからビルドする。Github参照）。

Gihtub

#bioconda(link)
conda create -n smashpp
conda activate smashpp
conda install -y -c bioconda smashpp

> smashpp

# smashpp

SYNOPSIS

./smashpp [OPTIONS] -r <REF-FILE> -t <TAR-FILE>

OPTIONS

Required:

-r <FILE> = reference file (Seq/FASTA/FASTQ)

-t <FILE> = target file (Seq/FASTA/FASTQ)

Optional:

-l <INT> = level of compression: [0, 6]. Default -> 3

-m <INT> = min segment size: [1, 4294967295] -> 50

-e <FLOAT> = entropy of 'N's: [0.0, 100.0] -> 2.0

-n <INT> = number of threads: [1, 255] -> 4

-f <INT> = filter size: [1, 4294967295] -> 100

-ft <INT/STRING> = filter type (windowing function): -> hann

{0/rectangular, 1/hamming, 2/hann,

3/blackman, 4/triangular, 5/welch,

6/sine, 7/nuttall}

-fs [S][M][L] = filter scale

{S/small, M/medium, L/large}

-d <INT> = sampling steps -> 1

-th <FLOAT> = threshold: [0.0, 20.0] -> 1.5

-rb <INT> = ref beginning guard: [-32768, 32767] -> 0

-re <INT> = ref ending guard: [-32768, 32767] -> 0

-tb <INT> = tar beginning guard: [-32768, 32767] -> 0

-te <INT> = tar ending guard: [-32768, 32767] -> 0

-ar = consider asymmetric regions -> no

-nr = do NOT compute self complexity -> no

-sb = save sequence (input: FASTA/FASTQ) -> no

-sp = save profile (*.prf) -> no

-sf = save filtered file (*.fil) -> no

-ss = save segmented files (*.s[i]) -> no

-sa = save profile, filetered and -> no

segmented files

-rm k,[w,d,]ir,a,g/t,ir,a,g:...

-tm k,[w,d,]ir,a,g/t,ir,a,g:...

= parameters of models

<INT> k: context size

<INT> w: width of sketch in log2 form,

e.g., set 10 for w=2^10=1024

<INT> d: depth of sketch

<INT> ir: inverted repeat: {0, 1, 2}

0: regular (not inverted)

1: inverted, solely

2: both regular and inverted

<FLOAT> a: estimator

<FLOAT> g: forgetting factor: [0.0, 1.0)

<INT> t: threshold (no. substitutions)

-ll = list of compression levels

-h = usage guide

-v = more information

--version = show version

AUTHOR

Morteza Hosseini seyedmorteza@ua.pt

SAMPLE

./smashpp -r ref -t tar -l 0 -m 1000

> smashpp -viz

# smashpp -viz

SYNOPSIS

./smashpp -viz [OPTIONS] -o <SVG-FILE> <POS-FILE>

OPTIONS

Required:

<POS-FILE> = position file, generated by

Smash++ tool (*.pos)

Optional:

-o <SVG-FILE> = output image name (*.svg). Default -> map.svg

-rn <STRING> = reference name shown on output. If it

has spaces, use double quotes, e.g.

"Seq label". Default: name in header

of position file

-tn <STRING> = target name shown on output

-l <INT> = type of the link between maps: [1, 6] -> 1

-c <INT> = color mode: [0, 1] -> 0

-p <FLOAT> = opacity: [0.0, 1.0] -> 0.9

-w <INT> = width of the sequence: [8, 100] -> 10

-s <INT> = space between sequences: [5, 200] -> 40

-tc <INT> = total number of colors: [1, 255]

-rt <INT> = reference tick: [1, 4294967295]

-tt <INT> = target tick: [1, 4294967295]

-th [0][1] = tick human readable: 0=false, 1=true -> 1

-m <INT> = minimum block size: [1, 4294967295] -> 1

-vv = vertical view -> no

-nrr = do NOT show relative redundancy -> no

(relative complexity)

-nr = do NOT show redunadancy -> no

-ni = do NOT show inverse maps -> no

-ng = do NOT show regular maps -> no

-n = show 'N' bases -> no

-stat = save stats (*.csv) -> stat.csv

-h = usage guide

-v = more information

--version = show version

AUTHOR

Morteza Hosseini seyedmorteza@ua.pt

SAMPLE

./smashpp -viz -vv -o simil.svg ref.tar.pos

テストラン1

１、配列比較

git clone https://github.com/smortezah/smashpp.git
cd smashpp/example/
smashpp -r ref -t tar

> cat ref.tar.pos

# cat ref.tar.pos

##SMASH++

##PARAM=<-r ref -t tar>

##INFO=<Ref=ref,RefSize=1000,Tar=tar,TarSize=1000>

#RBeg REnd RRelRdn RRdn TBeg TEnd TRelRdn TRdn Inv

0 510 0.5108 2.0000 497 1000 0.5120 2.0000 F

498 1000 0.1518 2.0000 508 0 0.1551 2.0000 T

２、視覚化

smashpp -viz -stat -o example.svg ref.tar.pos

-r reference file (Seq/FASTA/FASTQ)
-t target file (Seq/FASTA/FASTQ)

f:id:kazumaxneo:20200522085852p:plain

"-vv"をつけると縦に出力される。

テストラン２

cd smashpp/experiment/dataset/bacteria/Xanthomonas_oryzae_pv_oryzae/
smashpp -r MAFF_311018.seq -t PXO99A.seq -l 0 -m 1000
smashpp -viz -stat -o example.svg MAFF_311018.seq.PXO99A.seq.pos

-l level of compression: [0, 6]. Default -> 3
-m min segment size: [1, 4294967295] -> 50

f:id:kazumaxneo:20200522090912p:plain

他にもいくつかの配列が用意されている。

f:id:kazumaxneo:20200522091327p:plain

引用

Smash++: an alignment-free and memory-efficient tool to find genomic rearrangements
Morteza Hosseini, Diogo Pratas, Burkhard Morgenstern, Armando J Pinho
GigaScience, Volume 9, Issue 5, May 2020

関連

2020-05-20

ゲノムの中の関心がある遺伝子を視覚化する Gcluster

2020 Bioinformatics 結果の視覚化 (visualization) pan-genome ゲノム比較 (comparative genomics) BLAST gene cluster

2020 5/27 コメント追加

2020 5/28 -mオプション追記

　遺伝子、遺伝子クラスター、およびその近傍のゲノムコンテクストを比較することは、遺伝子の機能や微生物の進化の基盤を決定する上で非常に重要である。現在のところ、多数のゲノムのゲノムコンテクストを可視化し、比較することができる使いやすく柔軟なツールはまだ存在していない。
　ここでは、研究者が大量の完成したゲノムやドラフトゲノムにわたって、興味のある遺伝子の周りのゲノム領域の高品質な線形マップをカスタマイズして作成することを可能にするスタンドアロンのPerlツールであるGclusterを紹介する。重要なことは、Gclusterは、組み込みのorthoMCLの形での相同遺伝子解析と、遺伝子のコンテキストの優れた比較を提供するために、与えられた系統樹上にゲノムをマッピングすることを統合している。
　GclusterはPerlで書かれており、GPLv3でリリースされている。ソースコードは https://github.com/Xiangyang1984/Gcluster とhttp://www.microbialgenomic.com/Gcluster_tool.html において自由に入手できる。

インストール

ubuntu18.04でテストした。

本体　Github

#bioconda (link)*1
conda install -c bioconda gcluster -y

> Gcluster.pl

$ Gcluster.pl

=NAME

Gcluster.pl

=DESCRIPTION

Gcluster is a simple tool for visualizing and comparing genome contexts for massive genomes. It is freely available at http://www.microbialgenomic.com/Gcluster_tool.html and https://github.com/Xiangyang1984/Gcluster_v1.01 under an open source GPLv3 license. It is a stand-alone Perl application, which requires MCL, NCBI BLAST+ and several Perl Modules (e.g. GD, GD::SVG) to be installed before use.

=USAGE

Gcluster.pl -dir genbank_file_directory -gene interested_gene_file [options]

FOR EXAMPLE:

perl /home/xiangyang/Gcluster_v1.01-master/Gcluster.pl -dir /home/xiangyang/Gcluster_v1.01-master/test_data/gbk -gene /home/xiangyang/Gcluster_v1.01-master/test_data/interested_gene_name.txt -tree /home/xiangyang/Gcluster_v1.01-master/test_data/full_NJ_rRNA_tree.nwk -m 3

Large test data is available at website (http://www.microbialgenomic.com/160_genomes_testdata.tar.gz). It contains 160 annotated genomes used as input files to create Fig. 1 in manuscript.

=ARGUMENTS

REQUIRED ARGUMENTS:

~~~~~~~~~~~~~~~~~~~

-dir, --genbank_file_directory

A Directory containing annotated genomes as Genbank format file. To avoid a mistake, genome names cannot use special character, such as space, equal. For large number of genomes, users are recommended to download using Aspera, a high-speed file transfer tool (https://downloads.asperasoft.com/). If enough rgb colors provided or not to color homologous genes, there are no limitation to the number of genomes and the number of genes flanking gene of interest for Gcluster.

-gene, --interested_gene_file

A list of the interested gene, in which each line contains a locus tag of the interested gene for individual genome. Users are recommended to use "interested_gene_generation.pl" in Gcluster package for generation this file. In this situation, user needs to provide a blast database file in FASTA format, which contains at least one protein sequence homologous to the gene of interest. To map genome contexts to a given phylogeny or to order the genome contexts using a "strain_reorder_file", only one gene of interest is allowed to provide for each genome.

For example:

AX2_RS10405 #arsenite_oxidase_large_subunit;Achromobacter_xylosoxidans_NBRC_15126_ATCC_27061

KUC_RS10495 #arsenite_oxidase_large_subunit;Halomonas_boliviensis_LC1

KYC_RS14580 #arsenite_oxidase_large_subunit;Achromobacter_arsenitoxydans_SY8

...

OPTIONAL ARGUMENTS:

~~~~~~~~~~~~~~~~~~~

-o, --Gcluster_output_directory

An output directory holding all the generated files by Gcluster.pl. if this option is not set, Gcluster will create a directory named "Gcluster_workplace" in the bin directory from where Gcluster.pl was invoked.

-tree, --phylogenetic_file

A Newick format tree file is used by Gcluster to automatically accociate the genomes with their phylogeny. Meanwhile, Gcluster will output a file named "temp_strain_reorder_file", which contains the order information of genomes in tree from up to down. It should be noted that all nodes name in provided tree must completely match with the genbank files name of all genomes.

Gcluster provides a perlscript in "Gcluster/script" directory for batch extraction of 16S rRNA gene sequences, which can be used to build a 16S rRNA tree using software like MEGA (https://www.megasoftware.net/).

-topology, --show_tree_topology

Display the tree topology, which is obtained from the tree file (Default: T).

-branch, --show_tree_branch

Draw tree using tree branch length, which is obtained from the tree file (Default: F).

--x_step

Draw tree using xstep instead of tree branch length (Default: 10).

-bootstrap, --show_tree_bootstrap

Display the tree bootstrap value, which is obtained from the tree file (Default: F).

-srf, --strain_reorder_file

A two-column tab-delimited text file is used to sort genomes from up to down accoding to users requirement. Each row must consist of a strain name followed by the numerical order that is used for sorting genomes. It should be noted that all strains name must completely match with the genbank files name of all genomes. Gcluster needs a "strain_reorder_file" or a "phylogenetic_file", but not both at the same time.

For example:

Achromobacter_xylosoxidans_NCTC10807 1

Achromobacter_xylosoxidans_NBRC_15126_ATCC_27061 2

Achromobacter_xylosoxidans_NCTC10808 9

Achromobacter_sp._2789STDY5608623 5

Achromobacter_sp._2789STDY5608621 4

Alcaligenes_faecalis_subsp._faecalis_NCIB_8687 10

Achromobacter_xylosoxidans_DPB_1 7

Achromobacter_xylosoxidans_B_1 8

Achromobacter_piechaudii_HLE 3

Achromobacter_marplatensis_B2 6

... ...

-n, --flanking_gene_number

Number of genes flanking gene of interest are set to show. If enough rgb colors provided or not to color homologous genes, there are no limitation to the number of genomes and the number of genes flanking gene of interest (Default: 10).

-color_f, --gene_color_filled

Color was used to fill homologous gene clusters or gene families of interest (Default: T), if choose F, all of the genes were filled with the color customized by "gene_no_color_filled" parameter).

-pso, --percent_strain_homologouscluster_color

Only color certain homologous gene clusters, in which the holding number of different genomes exceeds the threshold number (Default: 0). This is measured using (X/Y)*100. In this formula, X denotes the number of different genomes in a set of homologous gene cluster, and Y denotes the total number of genomes. This parameter is useful when no enough rgb colors are provided in colors_configure_file under color_configure direcoty. Users could to reduce the number of colors used by setting a high value.

-c_color_b, --cds_color_border

To color the border of the CDS genes (Default: black), users can choose from blue, black, red, white, gray, dgray.

-p_color_b, --pseudo_color_border

To color the border of the Pseudo genes (Default: dgray), users can choose from blue, black, red, white, gray, dgray.

-r_color_b, --RNA_color_border

To color the border of the RNA (tRNA, rRNA) genes (Default: red), users can choose from blue, black, red, white, gray, dgray.

-no_color_f, --gene_no_color_filled

To fill uniqe genes (including RNA genes), pseudo genes, and homologous gene clusters not meeting the criteria set by "percent_strain_homologouscluster_color" parameter with a single color (Default: white), users can choose from blue, black, red, white, gray, dgray.

-dw, --line_drawing_width

Set the line drawing width (Default: 1).

-l, --arrow_relative_Length

Set the relative length of the gene arrow (Default: 4).

-w, --arrow_relative_Height

Set the relative Height of the gene arrow (Default: 6).

-scale, --figure_Scale_up_multiple

Adjust gene length through zooming (Default: 0.5).

-s_Y, --strain_name_shift_Y

Set the offset along Y-axis for strain names (Default: 0).

-g_Y, --gene_label_shift_Y

Set the offset along Y-axis for gene labels (Default: 2).

-dis, --distance_between_two_genomes

Set the distance between two genome contexts in Y-axis (Default: 70).

-up, --up_shift

Set the top margin of image in pixels (Default: 10).

-down, --down_shift

Set the bottom margin of image in pixels (Default: 20).

-left, --left_shift

Set the left margin of image in pixels (Default: 10).

-right, --right_shift

Set the right margin of image in pixels (Default: 20).

-label, --show_label

Display the gene label (gene Locus Tag or genename) (Default: T).

-ul, --unification_label

Unify gene label for homologous gene cluster (Default: T). Among a set of homologous gene cluster, if a gene is annotated with a name X, all other genes will be labeled with X.

-family, --font_family

Set font family for the genome name and the gene label, e.g. Times New Roman, Arial, Verdana and so on (Default: Times New Roman). Users are suggested to choose font family listed in metrcis module, or causing a miscalculation of string width for genome name in SVG-format map.

-style, --font_style

Set font style for the genome name and the gene label, e.g. Normal, Bold, Italic (Default: Normal). It should be noted that the font style "Bold" does not work when using to cearte a PNG format figure in MacOS.

-size, --label_font_size

Set font size for gene label (Default: 6).

-color, --label_font_color

Set font color for gene label (Default: dgray).

-i_color, --interested_gene_label_font_color

Customize gene label color for gene of interest (Default: red), users can choose from blue, black, red, white, gray, dgray.

-r, --rotate_gene_label (Default: 30)

Rotate the angle of the gene label, e.g. 30, 45, 135 and so on.

--strain_name_font_size

Set font size for genome name (Default: 12).

--Strain_name_font_color

set font color for genome name (Default: black). Users can choose from blue, black, red, white, gray, dgray.

-Bst, --homologous_gene_cutoff

Array to set blast parse cutoff: E-value, Identify, Coverage, Match_length (Default: E-value=1-e5, Identify=0, Coverage=50%, Match_length=0).

-m, --multiple_threads

Numbers of thread to use (Default: 1).

-SVG, --SVG_image

Create SVG format figure (Default: T).

-PNG, --PNG_image

Create PNG format figure (Default: T).

-sub_TFT, --start_at_sub_TFT (Default: F)

Jump to generate a collection of sub-TFT tables and perform homologous gene analysis (Default: F). Skips sequences extraction and TFT file generation.

-map, --start_at_map

Jump to map generation (Default: F). Generation of a collection of sub-TFT tables and homologous gene clusters has already been done. This parameter is very useful to customize the map quickly. It should be noted that there's no sense to reset "flanking_gene_number" parameter if this parameter set to "T".

Importantly, at this step, users can revise the gene label by directly edition of the locus_tag in sub_TFT file or all_orthomcl.out. In sub_TFT files and all_orthomcl.out file, there are two forms of gene locus tag, (1) "Locus_Tag", in this case, no genename is defined for a gene; (2) "GeneName;Locus_Tag", in this case, genename is given for a gene. For the first form, user can revise gene label by addition of a genename followed by a semicolon in the front of the Locus_Tag. For the second form, user can revise gene label by modification of the genename.

-h, --help

Show this message.

=AUTHOR

Dr. Xiangyang Li (E-mail: lixiangyang@fudan.edu.cn, lixiangyang1984@gmail.com), Fudan university; Kaili University; Bacterial Genome Data mining & Bioinformatic Analysis (http://www.microbialgenomic.com/).

=COPYRIGHT

> interested_gene_generation.pl

# interested_gene_generation.pl

=NAME

interested_gene_generation.pl

=DESCRIPTION

Run this command to enble users to obtain a list of the interested gene (a two-column tab-delimited text file) by a local blastP analysis using multiple threads.

=USAGE

interested_gene_generation.pl -dir genbank_file_directory -db database [options]

FOR EXAMPLE:

# perl /home/xiangyang/Gcluster_v1.01-master/interested_gene_generation.pl -dir /home/xiangyang/Gcluster_v1.01-master/test_data/gbk -db /home/xiangyang/Gcluster_v1.01-master/test_data/aioB.fasta -m 3

=ARGUMENTS

=======================

REQUIRED ARGUMENTS:

~~~~~~~~~~~~~~~~~~~

-dir, --genbank_file_directory

A directory containing annotated genomes as Genbank format file. To avoid a mistake, genome names cannot use special character,

such as space, equal. For large number of genomes, users are recommended to download using Aspera, a high-speed file transfer

tool (https://downloads.asperasoft.com/).

-db, --database

A protein database in FASTA format, which contains at least one protein sequence homologous to the gene of interest.

OPTIONAL ARGUMENTS:

~~~~~~~~~~~~~~~~~~~

-o, --output_directory

An output directory holding all the generated files by interested_gene_generation.pl. if this option is not set, interested_gene_generation.pl will create a directory named "interested_gene_workplace" in the bin directory from where interested_gene_generation.pl was invoked.

-m, --multiple_threads

set thread number (Default: 1)

-b, --start_at_blast

Jump to a local blastp analysis, and Skips sequencing extraction (Default: T).

-e, --e_value

set E-value cutoff in Blast analysi (default: 1e-5)

-i, --identify

set percent identity cutoff in Blast analysis (default: 30)

-c, --coverage

set percent coverage (Query and Subject) cutoff in Blast analysis (default: 50)

-l, --match_length

set alignment length cutoff in Blast analysis (default: 30)

-h, --help

Show this message.

=AUTHOR

Dr. Xiangyang Li (E-mail: lixiangyang@fudan.edu.cn), Fudan university; Kaili University; Bacterial Genome Data mining & Bioinformatic Analysis (www.microbialgenomic.com/).

=COPYRIGHT

テストラン

git clone https://github.com/Xiangyang1984/Gcluster.git
cd Gcluster/
perl test.pl #*1

実行方法

準備

ランには比較するゲノムのGenBnakファイルとリストファイルが必要。

１、GenBnakファイル。

テストデータのGenBnak（ファイル名にスペースや特殊文字がない事）

f:id:kazumaxneo:20200520230937p:plain

２、関心のある遺伝子のリストファイル。

関心のある遺伝子の locus tagを記載したテキストファイルになる。9 genbankの比較なら、それぞれのゲノムから1つずつ9行のリストを作ることになる。

f:id:kazumaxneo:20200520231626p:plain

手動で作成してもよいが、数が多いと手間なので、blastpのヒットから自動作成するスクリプトが準備されている。ランするにはgenbankのディレクトリと興味ある遺伝子のタンパク質配列（クエリ）を指定する。

interested_gene_generation.pl -dir test_data/gbk_dir -db test_data/aioB.fasta -m 4 -o out_dir -r 1e-5 -i 30 -c 50 -I 30

-dir A directory containing annotated genomes as Genbank format file.
-db A protein database in FASTA format, which contains at least one protein sequence homolo
-o output_directory
-m set thread number (Default: 1)
-e set E-value cutoff in Blast analysi (default: 1e-5)
-i set percent identity cutoff in Blast analysis (default: 30)
-c set percent coverage (Query and Subject) cutoff in Blast analysis (default: 50)
-l set alignment length cutoff in Blast analysis (default: 30)

ランが終わると出力ディレクトリに上の画像のような内容のテキストファイルができる。これをGclusterのランに使用する。

f:id:kazumaxneo:20200520232903p:plain

interested_gene_name.txtにblastpのトップヒット遺伝子のlocus tagがリスト化される。blastpにセカンドヒットなども存在する場合、２列目のコメントアウトした部分以降にまとめられる。

ラン

genbankのディレクトリと興味がある遺伝子のリストファイルを指定する。

Gcluster.pl -dir ./test_data/gbk -gene interested_gene_name.txt -o out_dir -m 12

#遺伝子名を大きくする
Gcluster.pl -dir ./test_data/gbk -gene interested_gene_name.txt -o out_dir -m 12 -size 10

#よく似たゲノム同士なら条件を厳しくする
Gcluster.pl -dir ./test_data/gbk -gene interested_gene_name.txt -o out_dir -m 12 -size 10 -Bst 1e-100

-dir A Directory containing annotated genomes as Genbank format file.
-gene A list of the interested gene, in which each line contains a locus tag of the interested gene for individual genome.
-m Numbers of thread to use (Default: 1).
-size Set font size for gene label (Default: 6).
-Bst Array to set blast parse cutoff: E-value, Identify, Coverage, Match_length (Default: E-value=1-e5, Identify=0, Coverage=50%, Match_length=0).

出力

f:id:kazumaxneo:20200520234055p:plain

figure.png。指定した関心のある遺伝子は、図の真ん中の赤い注釈になっている遺伝子（*４）。

f:id:kazumaxneo:20200520234111p:plain

順番を変えるには、次の例のようにnewlickのファイルを与えるか、順番を支持したテキストファイルを指定する（Github参照）。

Newickフォーマットのツリーファイルも与える事で、phylogenyに従って順番をソートできる*2。

Gcluster.pl -dir ./test_data/gbk \
 -gene out_dir/interested_gene_name.txt \
 -tree ./test_data/16S_rRNA_tree.nwk \
 -o out_dir -m 12

figure.svg

f:id:kazumaxneo:20200520234944p:plain

出力図はGcluster.plのオプションでカスタマイズできます。例えば"-n <num>"で隣接する遺伝子の表示数という感じです。コマンドのhelpを見て下さい。*３

公式（大規模な例あり）

http://www.microbialgenomic.com/Gcluster_tool.html

引用

Gcluster: a simple-to-use tool for visualizing and comparing genome contexts for numerous genomes
Xiangyang Li, Fang Chen, Yunpeng Chen
Bioinformatics, Published: 28 March 2020

*1 Githubにも注意書きがあるが、/usr/bin/にコマンドがないとエラーが出たのでシンボリックリンクを張って対応した。

ln -s /root/anaconda3/bin/blastp /usr/bin/

ln -s /root/anaconda3/bin/makeblastdb /usr/bin/

ln -s /root/anaconda3/bin/mcl /usr/bin/

genbankから自動で16Sの配列を抽出するスクリプトが用意されている。抽出後、MEGAなどを使って16Sのツリーを構築し、それを指定する。ファイル名が同じでないと受け付けないので注意。

*３

たくさんの遺伝子クラスターが図示されてしまってわかりにくい場合は、interested_gene_generation.plの結果得られるinterested_gene_name.txtを開いて、適切なアノテーション行１つだけ残し、あとは消して下さい。それを使ってGcluster.plをランすれば、１ゲノムから１クラスターのみ視覚化されるようになります。

*４

図中の注釈遺伝子名はリストファイルのデータに則って同じ名前で統一される。注釈が同じだからとってそのサンプルに該当遺伝子があることを意味するわけではない。注意する。

関連

2020-05-19

効率的なVCFの圧縮器と関連ツールを提供する genozip

2020 Bioinformatics VCF Compression / decompression

　大規模なゲノムプロジェクトはますます一般的になりつつあり、その結果、数千もの個々のゲノムデータセットからなるVCF（Variant Call Format; (Danecek et al., 2011))ファイルが作成される。圧縮された形式であっても、このようなファイルは非常に大きく（通常数GB）、長期的なデータ保存とファイル転送のコストが急速に上昇し、より効率的な圧縮アルゴリズムの開発に拍車がかかっている。
　最近、一握りの新しい圧縮アルゴリズムが出現してe.g. (Durbin, 2014; Deorowicz and Danek, 2019; Kelleher et al., 2019)、これらはVCFファイル内のgenotypesを圧縮することによって機能するが、genotypesは、VCFファイル内で表現される１つのデータ型に過ぎず、多くの場合、総データ内容に対するマイナーなコントリビュータに過ぎない。例えば、実世界の例として使用された(Durbin, 2014)のファイル(我々のベンチマークのFile1）では、genotypesは非圧縮のVCFファイルデータの7.1%しか表していない。このように、genotypesを圧縮するだけでは、VCFファイルの圧縮戦略としては十分ではないことは明らかである。
ここで、本著者らはロスレス圧縮ツールであるgenozipを紹介する。genozipは、任意のploidy、Phasing structure、またはvariantタイプのVCFファイルを扱うことができる。genozipの主な目的は、ゲノムデータを効率的にパッケージングすることである。
また、パイプラインの解析機能も備えている。

genozip パッケージはすべての一般的なオペレーティングシステム上で動作し、genozip, genounzip, genocat, genols の 4 つのコマンドラインツールが含まれている。 genounzip は .vcf.genozip ファイルを .vcf または .vcf.gz 形式に解凍し、genols は .genozip ファイルの内容に関する統計情報を提供する。
　分析パイプラインへのシームレスな統合をサポートするために、.vcf.genozip ファイル内のデータにアクセスするための genocat コマンドが提供されており、データへのランダムアクセスを可能にする --regions や -samples などのオプションが含まれている。インデックス作成は圧縮の一部として行われ、個別のインデックス作成ステップやインデックスファイルはない。さらに、このツールセットは標準的な入出力ストリームを使用できるように設計されている。
　データを--password(256ビットAESを使用)で暗号化することで、genozipは厳しいプライバシー要件を満たすゲノムファイルの効率的かつ安全な配布を可能にする。データの完全性は、--md5でMD5署名を生成することでさらに保証される。さらに、--outputオプションは、同一のサンプルを含むVCFファイルを連結し、genounzip -splitを使用してオリジナルのコンポーネントを再生成することができる。

　ユーザーが必要に応じて圧縮を変更できるように、いくつかのオプションを追加した。まず、--optimize オプションでは、INFO と FORMAT のサブフィールドのデータを修正し、浮動小数点数を有効数字 2 桁に丸めたり、Phred 値に上限を設定したりすることで圧縮を改善する。この場合、VCFデータが変更されるため、圧縮はロスレスではないが、下流の解析結果には影響しないことに注意する。第二に、--gtshark オプションは GTShark (Deorowicz and Danek, 2019) を使用しており、genozip または GTShark を単独で使用する場合と比較して圧縮率を向上させる (補足資料を参照)。最後に、--vblockおよび--sblockオプションを使用すると、領域やサンプルのサブセット化に関連する圧縮と速度の間のトレードオフを制御することができる。
　.bcf ファイルを .genozip 形式に圧縮するには bcftools が、.xz ファイルを圧縮するには XZ Utils (Collin, 2011) が、.vcf.gz に解凍するには bgzip が、--gtshark を使うには GTShark が、URL から圧縮するには cURL が必要である (Hostetter et al., 1997)。

インストール

macos10.14でテストした。

Github

git clone https://github.com/divonlan/genozip.git
cd genozip/
make -j

> ./genozip

$ ./genozip

Compress VCF (Variant Call Format) files

Usage: genozip [options]... [files or urls]...

One or more file names or URLs may be given, or if omitted, standard input is used instead

Supported input file types: .vcf .vcf.gz .vcf.bgz .vcf.bz2 .vcf.xz .bcf .bcf.gz .bcf.bgz

Note: for .bcf files, bcftools needs to be installed, and for .xz files, xz needs to be installed

Examples: genozip file1.vcf file2.vcf -o concat.vcf.genozip

genozip --optimize -password 12345 ftp://ftp.ncbi.nlm.nih.gov/file2.vcf.gz

アセンブリグラフからメタゲノムのビニングを行う GraphBin

2020 Bioinformatics Binning (metagenomics) metagenome assembly graph

　メタゲノミクスの分野では、微生物群集の構造、多様性、生態についての貴重な知見が得られている。メタゲノム解析の重要なステップの1つは、長いコンティグにリードをアセンブリし、メタゲノムサンプル中に存在する異なる種に属するコンティグのグループにビン化することである。コンティグのビン化はメタゲノム解析において重要な役割を果たしており、ほとんどのビニングアルゴリズムは、オリゴヌクレオチド/k-mer組成やコンティグカバレッジなどのゲノムの特徴を利用してコンティグをビニングする。メタゲノムのコンティグはアセンブリプロセスから得られるため、コンティグ間の接続性に関する貴重な情報が含まれているアセンブリグラフはビニングに利用できる。
　本論文では、既存のツールのビニング結果を改良するために、アセンブリグラフを利用し、ラベル伝播アルゴリズムを適用した新しいビニング手法GraphBinを提案する。本論文では、de Bruijn graphとオーバーラップ・レイアウト・コンセンサス法の両方から構成されたアセンブリグラフを利用できることを示す。さらに、誤ってビニングされたコンティグの識別や、既存のビニングツールでは捨てられていたコンティグのビニングについても、実験的に改善された結果が得られた。アセンブリグラフの情報がメタゲノムコンティグのビニングツールに利用されたのは初めてのことである。

インストール

Github

git clone https://github.com/Vini2/GraphBin.git
cd GraphBin/
conda env create -f environment.yml
conda activate graphbin

> python graphbin.py -h

$ python graphbin.py -h

usage: graphbin.py [-h] --assembler ASSEMBLER --graph GRAPH [--paths PATHS] --binned BINNED --output OUTPUT [--prefix PREFIX] [--max_iteration MAX_ITERATION]

[--diff_threshold DIFF_THRESHOLD]

GraphBin Help. GraphBin is a metagenomic contig binning tool that makes use of the contig connectivity information from the assembly graph to bin contigs. It utilizes the

binning result of an existing binning tool and a label propagation algorithm to correct mis-binned contigs and predict the labels of contigs which are discarded due to short

length.

optional arguments:

-h, --help show this help message and exit

--assembler ASSEMBLER

name of the assembler used (SPAdes, SGA or MEGAHIT)

--graph GRAPH path to the assembly graph file

--paths PATHS path to the contigs.paths file

--binned BINNED path to the .csv file with the initial binning output from an existing tool

--output OUTPUT path to the output folder

--prefix PREFIX prefix for the output file

--max_iteration MAX_ITERATION

maximum number of iterations for label propagation algorithm. [default: 100]

--diff_threshold DIFF_THRESHOLD

difference threshold for label propagation algorithm. [default: 0.1]

テストラン

#megahitの場合
python graphbin.py --assembler megahit --graph test_data/ES_MEGAHIT/final.gfa --binned test_data/ES_MEGAHIT/maxbin_contig_bins.csv --output output_dir/

> ls -l output_dir/

$ ls -l output_dir/

total 144

-rw-r--r-- 1 kazuma staff 2280 5 19 00:12 graphbin.log

-rw-r--r-- 1 kazuma staff 2652 5 19 00:12 graphbin_output.csv

-rw-r--r-- 1 kazuma staff 61788 5 19 00:12 graphbin_unbinned.csv

実行方法

SGA、megahit、metaspadesのアセンブリグラフに対応している。

1, 外部ビニングツールを使って binningを実行

２、binning結果をまとめたCSVファイルを出力

 python GraphBin/support/prepResult.py --binned <path>/<to>/binning_result_dir --assembler SPAdes/SGA/MEGAHIT --output output_dir

３、graphbinの実行

metaspadesのアセンブリグラフを使う。

python graphbin.py --assembler spades \
 --graph graph.gfa \
 --paths pathsfile.paths\
 --binned binning_result.csv \
 --output output_dir

megahitのアセンブリグラフを使う。

python graphbin.py --assembler megahit \
 --graph graph.gfa \
 --binned binning_result.csv \
 --output output_dir

sgaのアセンブリグラフを使う。

python graphbin.py --assembler sga \
 --graph graph_file.asqg \
 --binned binning_result.csv \
 --output output_dir

引用

GraphBin: refined binning of metagenomic contigs using assembly graphs
Vijini Mallawaarachchi, Anuradha Wickramarachchi, Yu Lin
Bioinformatics, Published: 13 March 2020

2020-05-17

Iontorrentのエラーコレクションを行う IonHammer

2019 Journal of Computational Biology error correction iontorrent

IonTorrentのリードエラー修正アルゴリズムを用いて、幅広いデータセットのシーケンシングエラーを大幅に削減することができる。IonHammerはC++で実装されており、SPAdesゲノムアセンブラパッケージの一部として自由に利用できる。

インストール

mac os 10.14でコンパイル済みのプログラムをダウンロードしてテストした。

Github

バイナリをダウンロードするかconda/brewでインストールする。

#3.14.0 linux
wget http://cab.spbu.ru/files/release3.14.0/SPAdes-3.14.0-Linux.tar.gz
tar -xzf SPAdes-3.14.0-Linux.tar.gz
cd SPAdes-3.14.0-Linux/bin/

#3.14.0 darwin
curl http://cab.spbu.ru/files/release3.14.0/SPAdes-3.14.0-Darwin.tar.gz -o SPAdes-3.14.0-Darwin.tar.gz
tar -zxf SPAdes-3.14.0-Darwin.tar.gz
cd SPAdes-3.14.0-Darwin/bin/


#bioconda(link)
conda install -c bioconda -y sppades

#homebrew 
brew install spades

実行方法

--iontorrentをつけてspadesを実行する（iontorrentのシーケンシングデータ以外に使わない事）。

spades.py --iontorrent -s input.fq.gz -o output_dir -t 12 -k auto

--iontorrent This flag is required when assembling IonTorrent data. Allows BAM files as input. Carefully read section 3.3 before using this option.
-t number of threads-t/--threads <int> number of threads [default: 16]
-k comma-separated list of k-mer sizes (must be odd and-k <int,int,...> comma-separated list of k-mer sizes (must be odd and less than 128) [default: 'auto']
-s file with unpaired reads

エラー修正のみ行う。

spades.py --iontorrent -s input.fq.gz -o output_dir -t 12 --only-error-correction

--iontorrent This flag is required when assembling IonTorrent data. Allows BAM
--only-error-correction runs only read error correction (without assembling)

引用

IonHammer: Homopolymer-Space Hamming Clustering for IonTorrent Read Error Correction
Vasily Ershov, Artem Tarasov, Alla Lapidus, Anton Korobeynikov

Journal of Computational Biology. Published Online: 6 Feb 2019

SPAdesアセンブラ自体は以前紹介しました。

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

メタゲノムデータからウィルスゲノムをアセンブルする metaviralSPAdes

k-merカウンタ DSK

ゲノムのリアレンジメントを検出して視覚化する smashpp

ゲノムの中の関心がある遺伝子を視覚化する Gcluster

効率的なVCFの圧縮器と関連ツールを提供する genozip

アセンブリグラフからメタゲノムのビニングを行う GraphBin

Iontorrentのエラーコレクションを行う IonHammer