2019-06-24

ショートリードのマッピングを行う Whisper

　リファレンスゲノムへのリードのマッピングは、シークエンシングデータ解析パイプラインの最初のステップである。シーケンシングコストが削減していることから、合理的な時間内に増大する量の生成データを処理することができるアルゴリズムに対する必要性が高まっている。
　著者らはリードをソートし、それからリファレンスゲノムとそのreverse complementのsuffix arraysに対してそれらをマッピングするという考えに基づいて、正確で高性能なマッピングツールであるWhisperを紹介する。タスクとデータの並列処理を採用し、一時データをディスクに格納すると、妥当なメモリ要件で優れた時間効率が得られる。 Whisperは、大規模なNGSリードコレクション、特にIlluminaの一般的なWGSカバレッジのデータ処理で優れた性能を示す。実データによる実験では、よく知られているBWA-MEMおよびBowtie 2ツールで必要とされる時間の約15％で、同等の精度でこのソリューションが機能することが示されている。

本体　Github

リリースよりダウンロードする

https://github.com/refresh-bio/Whisper/releases

#linux v1.1
wget https://github.com/refresh-bio/Whisper/releases/download/v1.1/whisper
chmod +x whisper
sudo mv whisper /usr/local/bin/

#linux 
wget https://github.com/refresh-bio/Whisper/releases/download/v1.1/whisper-index
chmod +x whisper-index
sudo mv whisper-index /usr/local/bin/

> whisper

$ whisper

Whisper v. 1.1 (2018-07-10)

Usage:

whisper [options] <index_name> @<files>

whisper [options] <index_name> file_se

whisper [options] <index_name> file_pe1 file_pe2

Parameters:

index_name - name of the index (as created by asm_pp)

files - name of the file containing list of FASTQ files with seq. reads

file_se - FASTQ file (single-end)

file_pe[1|2] - FASTQ files (paired-end)

Options:

-b <value> - no. of temporary files (minimum: 100, default: 384)

-d[fr/ff/rf] - mapping orientation (default: -dfr (forward - reverse)

-dist_paired <value> - max. distance for paired read (default: 1000)

-e <value> - max. no of errors (default: 4, max: 5% of read length)

-e-paired <value> - max. fraction of errors in paired read (default: 0.06)

-enable-boundary-clipping <value> - enable clipping at boundaries when a lot of mismatches appears (default: 0)

-filter <value> - store only mappings for given chromosome (default: )

-gap-open <value> - score for gap open (default: -6)

-gap-extend <value> - score for gap extend (default: -1)

-gzipped-SAM-level <value> - gzip compression level of SAM/BAM, 0 - no compression (default: 0)

-hit-merging-threshold <value> - minimal distance between different mappings (default: 12)

-high-confidence-sigmas <value> - (default: 4)

-hit-merging-wrt-first <value> - calculate distance in marged group w.r.t. first (default: 1)

-m[f/s/a] - mode: first stratum/second stratum/all strata (default: first stratum)

-mask-lqb <value> - mask bases of quality lower than value (default: 0)

-out <name> - name of the output file (default: whisper)

-penalty-saturation <value> - no. of sigmas for max. penalty in matching pairs (default: 7)

-rg <read_group> - complete read group header line, (example: '@RG\tID:foo\tSM:bar')

'\t' character will be converted to a TAB in the output SAM/BAM,

while the read group ID will be attached to every read

-r[s|p] - single or paired-end reads <value> (default: single)

-score-discretization-threshold (default: 0.5)

-score-match <value> - score for matching symbol (default: 1)

-score-clipping <value> score for clipping (default: -10)

-score-mismatch <value> - score for mismatching symbol (default: -5)

-sens <value> - turn on/off sensitive mode (default: 1)

-sens-factor <value> - sensitivity factor (default: 3)

-stdout - use stdout to store the output

-store-BAM - turn on saving in BAM

-t <value> - no. of threads (0-adjust to hardware) (default: 0)

-temp <name> - prefix for temporary files (default: ./whisper_temp_)

-x - load complete suffix arrays in main memory (default: 0)

Examples:

whisper human @files

whisper -temp temp/ human reads1.fq reads2.fq

whisper -out result.sam -temp temp/ -t 12 human reads1.fq reads2.fq

kazu@kazu:~$

> whisper-index

$ whisper-index

Whisper index construction v. 1.1 (2018-07-10)

Usage:

whisper-index <index_name> <ref_seq_file_name> <dest_dir> <temp_dir>

whisper-index <index_name> <@ref_seq_files_name> <dest_dir> <temp_dir>

実行方法

たとえば、圧縮されていないFASTQ形式のサンプル読み取りのサイズが100GBで、処理が16個のCPUスレッドによって行われる場合、Whisperは600のファイルを開く。そのため、同時に開けるファイル数を上げる必要がある。linuxなら

ulimit -n 600

fastqサイズや使用スレッド数が増えると増加する。必要な数は

num_files = num_bins (384 by default) + num_threads + 2 * total_size_of_FASTQ_in_GB

で算出できる。

１、indexing

出力と作業ディレクトリを指定する。ここでは作業ディレクトリは/tmpか/var/tmpあたりにする。

mdkir output_dir
whisper-index genome_index ref.fa output_dir /tmp

２、mapping

ペアエンドfastqとindexを指定する。

whisper -out output -t 8 output_dir/genome_index \
 pair_1.fq.gz pair_2.fq.gz

output.samが出力される。

引用

Whisper: read sorting allows robust mapping of DNA sequencing data
Sebastian Deorowicz Agnieszka Debudaj-Grabysz Adam Gudyś Szymon Grabowski
Bioinformatics, Volume 35, Issue 12, June 2019, Pages 2043–2050

2019-06-23

メタバーコディングのデータベース配列キュレーションなどを行うツールキット MetaCurator

2019 Preprint phylogenetic marker genes metagenome DNA barcoding PCR primer multiple sequence alignment (MSA) database

　配列ベースの生物学的コミュニティの特徴付けの過程において、配列の教師ありのtaxonomic classification は重要な目標である。多数の配列分類ソフトウェアプログラムは、配列類似性を測り、そして配列類似性と分類学的所属との間の関係をモデル化することによってこれを達成する。（一部略）

　配列組成に基づく分類の間、関心のあるマーカーに隣接する無関係の配列領域は、SINTAX、RDPナイーブベイズ分類器、VSEARCHなどの多くのツールで使用されるk-mer分布を偏らせる（Wang et al、2007; Edgar 2016; Rognes et al、 2017）。上位N個の配列の距離分布を分析することによって分類学的境界を推定する分類器（Huson et al、2011; Bengtsson-Palme et al、2015; Gao et al、2017）では、配列重複の保持は距離分布の特徴付けに偏りがある。ただし、2つの配列が同じ分類群からのものであることを保証せずに分類群的に単純な配列重複の除去を行うと、 2つの分類群間の変異に関してデータベースから重要な情報が削除されることになる。
　これらの問題を解決するために、MetaCuratorツールキットを開発した。メインツールであるIterRazorは、例えば全ゲノムアセンブリを含む任意の入手可能な供給源から標的マーカーリファレンス配列を同定および抽出する。与えられたマーカーのリファレンスを抽出した後、DerepByTaxonomyは分類学を意識したアプローチを使ってリファレンス配列を逆複製するために使われる。追加のPythonツールおよびUnix シェルスクリプトは、階層的キュレーションのための分類系統のフォーマット化およびNCBI ヌクレオチドデータベース内に一般的に見られる分類系統アーチファクトの除去を容易にする（NCBI Resource Coordinators、2018）。

インストール

ubuntu16.04にて、conda createで2.7.14の仮想環境を作ってテストした。

依存

MetaCurator is a command-line only toolkit which runs on typical Unix/Linux environments. It requires the following software to be installed and globally accessible.

Python 2.7.5 or greater
Perl 5.16.0 or compatible
VSEARCH 2.8.1 or compatible
HMMER3 3.1b2 or compatible
MAFFT 7.270 or compatible

#仮想環境を作って依存を導入 
conda create -n metacurator -c bioconda python=2.7.14 vsearch hmmer mafft

本体　Github

wget https://github.com/RTRichar/MetaCurator/archive/MetaCurator_v1.0beta.1.tar.gz
tar xzf MetaCurator_v1.0beta.1.tar.gz
cd MetaCurator-MetaCurator_v1.0beta.1/
export PATH=$PATH:$PWD

> IterRazor.py -h

# IterRazor.py -h

usage: IterRazor is a tool which uses hidden Markov models to identify and extract precise DNA barcode sequences of interest from the full diversity of reference sequence data available through sources such as NCBI (e.g. whole genome sequences, whole chloroplast sequences or Sanger seuqnces representative of variable portions of the locus of interest)

[-h] -r REFERENCESFILE -i INPUTFILE -o OUTPUTFILE [-t THREADS]

[--SaveTemp SAVETEMP] [-e HMMEVALUE] [-is ITERATIONSERIES]

[-cs COVERAGESERIES]

optional arguments:

-h, --help show this help message and exit

-r REFERENCESFILE, --ReferencesFile REFERENCESFILE

Fasta file of sequences trimmed to exact region of

interest (header format should conform to --InputFile

header format). This should be 5 to 10 sequences

spanning a diversity of the phylogenetic tree of

interest, so as to build a relatively broadly

inclusive HMM from the first iteration (REQUIRED)

-i INPUTFILE, --InputFile INPUTFILE

Input file of reference sequences to be extracted

(e.g. whole chloroplast genome sequences from NCBI).

Sequence headers should include a unique sequence

identifier, NCBI accessions are recomended, preceded

only by '>' (REQUIRED)

-o OUTPUTFILE, --OutPutFile OUTPUTFILE

Output file for extracted sequences (REQUIRED)

-t THREADS, --threads THREADS

Number of processors to use

--SaveTemp SAVETEMP Option for saving alignments and HMM files produced

during extraction

-e HMMEVALUE, --HmmEvalue HMMEVALUE

Evalue threshold for hmm search

-is ITERATIONSERIES, --IterationSeries ITERATIONSERIES

The IterationSeries and CoverageSeries arguments

dictate the number of iterative searches to run and

the minimum HMM coverege required for each round of

extracting. This argument consists of a comma-

separated list specifying the number of searches to

run during each round of extraction. The list can vary

in length, changing the total number of extraction

rounds conducted. However, if a search iteration fails

to yield any new reference sequence amplicons,

IterRazor will break out of that round of searching

and move to the next. By default, the software

conducts four rounds of extraction with 20, 10, 5 and

5 search iterations per round (-is 20,10,5,5).

-cs COVERAGESERIES, --CoverageSeries COVERAGESERIES

This argument consists of a comma-separated list

specifying the minimum HMM coverage required per round

for an extracted amplicon reference sequence to be

added to the reference sequence fasta. The list can

vary in length but must be equal in length to the

IterationSeries list. By default, the software

conducts four rounds of extraction with coverage

limits of 1.0, 0.95, 0.9 and 0.65 for each respective

search round (-cs 1.0,0.95,0.9,0.65).

> MetaCurator.py -h

# MetaCurator.py -h

usage: MetaCurator.py [-h] -r REFERENCESFILE -i INPUTFILE -it INTAX -ot OUTTAX

-of OUTFASTA [-t THREADS] [--SaveTemp SAVETEMP]

[-e HMMEVALUE] [-tf TAXONOMIZRFORMAT] [-ct CLEANTAX]

[-is ITERATIONSERIES] [-cs COVERAGESERIES] [-mh MAXHITS]

[-di ITERATIONS]

optional arguments:

-h, --help show this help message and exit

required arguments:

-r REFERENCESFILE, --ReferencesFile REFERENCESFILE

Fasta file of sequences trimmed to exact region of

interest (header format should conform to inputfile

header format). This should be approximately 10

sequences spanning a diversity of the phylogenetic

tree of interest, so as to build an appropriately

inclusive HMM from the first iteration.

-i INPUTFILE, --InputFile INPUTFILE

Input file of reference sequences to be extracted

(sequence headers should include a unique sequence

identifier, NCBI accessions are recomended, preceded

only by '>')

-it INTAX, --InTax INTAX

Input taxonomic lineages (a tab-delimited file with

the first column containing unique sequence

identifiers compatible with the files provided under

'-r' and '-i'). Normally, this would be in

Metaxa2-compatible format, however, if lineages come

directly from Taxonomizr, you can skip reformatting

and declare '-tf True.' See below for more details.

-ot OUTTAX, --OutTax OUTTAX

Filename for output taxonomic lineages

-of OUTFASTA, --OutFasta OUTFASTA

Filename for output fasta formatted sequences

universal optional arguments:

-t THREADS, --threads THREADS

Number of processors to use

--SaveTemp SAVETEMP Option for saving alignments and HMM files produced

during extraction and dereplication (True or False)

-tf TAXONOMIZRFORMAT, --TaxonomizrFormat TAXONOMIZRFORMAT

Specify if tax lineages are in Taxonomizr format and

must first be converted to Metaxa2-compatible tab-

delimited format (True or False)

-ct CLEANTAX, --CleanTax CLEANTAX

Specify if tax lineages are to be cleaned of common

NCBI artifacts and revised at unresolved midpoints

(True or False)

IterRazor optional arguments:

-e HMMEVALUE, --HmmEvalue HMMEVALUE

Evalue threshold for HMM search

-is ITERATIONSERIES, --IterationSeries ITERATIONSERIES

The IterationSeries and CoverageSeries arguments

dictate the number of iterative searches to run and

the minimum HMM coverege required for each round of

extracting. This argument consists of a comma-

separated list specifying the number of searches to

run during each round of extraction. The list can vary

in length, changing the total number of extraction

rounds conducted. However, if a search iteration fails

to yield any new reference sequence amplicons,

IterRazor will break out of that round of searching

and move to the next. By default, the software

conducts four rounds of extraction with 20, 10, 5 and

5 search iterations per round (i.e. '-is 20,10,5,5').

-cs COVERAGESERIES, --CoverageSeries COVERAGESERIES

This argument consists of a comma-separated list

specifying the minimum HMM coverage required per round

for an extracted amplicon reference sequence to be

added to the reference sequence fasta. The list can

vary in length but must be equal in length to the

IterationSeries list. By default, the software

conducts four rounds of extraction with coverage

limits of 1.0, 0.95, 0.9 and 0.65 for each respective

search round (i.e. '-cs 1.0,0.95,0.9,0.65').

DerepByTaxonomy-specific optional arguments:

-mh MAXHITS, --MaxHits MAXHITS

Max hits to find before ending search for any given

query using VSEARCH (default: 1000)

-di ITERATIONS, --Iterations ITERATIONS

Number of iterative VSEARCH searches to run (default:

10). This is important in case the number of

replicates for a given sequence greatly exceeds '--

MaxHits'

テストラン

１、メタバーコーディングする領域を指定するため手動でトリミングして集めてきた8つのrbcL塩基配列"rbcL_Reps.fa"を入力とし、NCBIでrbcLとアノテートされている数千のrbcL塩基配列"rbcL_sample.fa"をキュレーションする。

cd TestMetaCurator/
MetaCurator.py -r rbcL_Reps.fa -i rbcL_sample.fa -it rbcL_sample.tax -tf True -ct True -of Test.fa -ot Test.tax

元のデータベースrbcL_sample.fa平均サイズは850.9-bpで2500配列あったが、キュレーション後は1691配列、平均476.6-bpに減少した。

> seqkit stats rbcL_sample.fa Test.fa

# seqkit stats rbcL_sample.fa Test.fa

file format type num_seqs sum_len min_len avg_len max_len

rbcL_sample.fa FASTA DNA 2,500 2,127,215 94 850.9 8,413

Test.fa FASTA DNA 1,691 805,958 319 476.6 494

Taxnomy情報を表すTest.tax も出力される。

上ではNCBIに登録されているrbcL配列群rbcL_sample.fa を-iで指定しているが、実際のデータ解析（メタバーコーディングのプライマー作成、増幅領域のマルチプルアラインメントなど）で使用するときは、ゲノム、メタゲノムのアセンブリなどを-iで指定する。

引用

MetaCurator: A hidden Markov model-based toolkit for extracting and curating sequences from taxonomically-informative genetic markers

Rodney T. Richardson, Douglas B. Sponsler, Harper McMinn-Sauder, Reed M. Johnson

bioRxiv preprint first posted online Jun. 18, 2019

2019-06-22

ヒトのガン原遺伝子/腫瘍抑制遺伝子の変異を視覚化するwebツール Mutplot

2019 PLoS ONE human genome web tool 結果の視覚化 (visualization) tumor panel sequencing clinical and diagnostic UniProt

　シーケンシング技術開発はガン研究に革命をもたらした。約20年に及ぶ発展後、次世代シーケンシング（NGS）は速くて手頃な価格になっている。それは精密医療を臨床の現実にした。 NSGは、臨床現場での治療法を個別化し、研究情報を広げるための包括的なビッグデータを提供している。この技術的進歩は治療および研究のためのより多くの機会を生み出したが、それらは非常に大きく詳細であるため、結果として得られたデータを効率的に合成および要約するという問題も生じた。ビッグデータを手動でフィルタリングすると、エラーの可能性が高まり、整理するのに時間がかかる。ビッグデータも効果的に提示するのは困難である。ソフトウェアはこれらの問題をすべて回避する。この目的のためにいくつかのツールが利用可能である。ただし、ほとんどはプログラミングの背景を持つユーザー向けに設計されている。これは病院やそのような訓練を受けていない施設利用者の大部分を締め出す。 MutplotはWebブラウザで機能する機能し、簡単なカスタマイズのための柔軟性を提供する。臨床医や研究者が自分で使用するために特別に設計された。抽象的なビッグデータを視覚的な結果に変換する。さらに、Mutplotはすべてのプラットフォームで動作するオープンソースのツールであり、セキュリティ目的のため、ファイアウォールの内側で簡単に統合できる。

　Mutplotを、MutationMapper [ref.1]、Lollipops [ref.2]、Muts-needle-plot [ref.3]、Pfam [ref.4]、Plot Protein [ref.5]、trackViewer [ref.6]など、他の6つの最も一般的な突然変異プロットツールと比較した。それらのどれも技術的でないユーザーのためのすべての要件を満たさない（詳細は論文表1に示されている）。 MutationMapperを除くそれらすべては、プログラミングトレーニングを必要とするコマンドラインユーザーインターフェイスを使用する。 Muts-needle-plot、Plot Protein、TrackViewerでは、手動のドメイン入力が必要である。Lollipopsは、サンプル頻度が類似している突然変異とクラスタ化突然変異を区別することはできない。その上、手作業でデータを入力することは人的ミスを犯しがちであり、そしてそれは突然変異ハイライト機能を持たない。 Pfamは公開フォーマットではないJSONファイルを出力する。 MutationMapperはWebベースのユーザーインターフェースを使用しているので最良の選択と思われるが、それ自身の欠点もある。それは最も高い再発性の突然変異（アミノ酸の変化）を示すだけで、これは低い頻度であるがドライバー遺伝子の突然変異を排除するだろう[ref.7]。事実、腫瘍間の遺伝的異質性のために、多くのドライバー遺伝子が非常に低いvariant allele frequenciesで生じる。同じ遺伝子に複数の突然変異が発生した場合、MutationMapperは、ガン研究と個別化医療の進歩に不可欠な低発生突然変異を簡単に無視することがあり得る。さらに、2つのバリアントがMutationMapper内で非常に近くにあると、それらのうちの1つからの情報が重複する。 MutationMapperのもう1つの落とし穴は、スペースが限られている場合にドメイン名が自動的に切り捨てられることである。これらの欠点により、MutationMapperはNGS分析にはあまり理想的ではない。

　Mutplotには、さまざまなタンパク質の変異を可視化するための完全なワークフローが含まれている（論文図1）。バリアント情報（必要な4つの列はHugo_Symbol、Sample_ID、Protein_Change、およびMutaiton_Type。表S1参照）でファイル（タブ区切り形式またはカンマ区切り形式であること）を入力すると、Mutplotは自動的にUniProtから最新のタンパク質情報に接続する[ref.8]。ドロップダウンメニューを使用して、合計409のガン遺伝子および腫瘍抑制遺伝子が組み込まれている（論文表S2）。 Mutplotは選択された遺伝子のドメイン情報を取得する。アミノ酸頻度閾値のハイライトオプションは1、2、3、4、5、10、15、20、25、30に設定されている。両方の遺伝子とハイライト閾値オプションは、ソースコードをカスタマイズするだけで拡張できる。指示はGitHubにdepositされている：https：//github.com/VivianBailey/Mutplot。
この情報を使って、Mutplotはドメイン情報、アミノ酸の位置、変異の頻度、アミノ酸の変化、変異の種類、そして強調表示された変異を含むタンパク質の図を作成する。アミノ酸の位置は正確な比率のために遺伝子の長さに合わせて調整されている。強調表示された変異は詳細なアミノ酸改変情報を有する。突然変異の種類と説明は、視覚化と区別を容易にするために色分けされている。複数の変異が密集している場合、Mutplotはスマートに他の変異を妨害することなく変異を標識する方法を見つけ出す。 Mutplotは出力オプションに関しても高い柔軟性を与える。 JEPG、PDF、PNG、画像ダウンロード用のSVGをサポートしている。また、入力データと更新されたUniprotデータベースから取得した対応するドメイン情報から、ダイアグラム用に選択したデータをエクスポートすることもできる。

このソースコードはGitHub: https://github.com/VivianBailey/Mutplotで利用かのであり、非営利目的で使用することができる。そして簡単にアクセスしたり、修正したり、他のパイプラインやソフトウェアに統合することができる。ソースコードを修正すると、MutplotはWebアプリケーションからファイアウォールの内側にあるパーソナルコンピュータまたはサーバに移行する可能性がある。これは、厳格なセキュリティ規制に従う機関にとって大きな選択肢となる。さらに、GitHubにはMutplotの完全なドキュメント、ソースコードのカスタマイズ方法の説明、そして将来のリリースもGitHubに説明が載っている。

WebアプリはRプログラミング言語で開発された。 shinyパッケージ、ggplot2パッケージ、plyrパッケージ、httrパッケージ、drawProteinsパッケージ、およびggrepelパッケージが使用されている。

Github

使い方

https://bioinformaticstools.shinyapps.io/lollipop/ にアクセスする。

入力ファイルはタブ/カンマ区切りの４列からなるフォーマットで用意する必要がある。それぞれの列には、Hugo_Symbol、Sample_ID、Protein_Change、およびMutaiton_Typeが記載されている必要がある。

入力ファイル例（direct link）。

f:id:kazumaxneo:20190619213507p:plain

Browseからファイルをアップロードする。

f:id:kazumaxneo:20190619213710p:plain

ヘッダーがある場合は「ヘッダー」を選択。入力ファイルに合わせセパレータを指定する。

視覚化する遺伝子を409のリスト中から選択。ここではテストデータを使っているので、TP53遺伝子か、BRCA1を選ぶ。

f:id:kazumaxneo:20190619160137j:plain

利用できる遺伝子リスト（direct link）。

ハイライト表示するアミノ酸頻度のしきい値を設定する。

f:id:kazumaxneo:20190619221044p:plain

defaultでは全変異のハイライト表示は無しになっているが、数値を選ぶことで、多くのサンプルに共通する変異に絞ってハイライト表示できる。上に載せた例では、R175Hの変異が5サンプルで共通しており、他はそれぞれ１サンプルのみから由来。

ここでは１を選択、出力フォーマットにはpdfを指定。

f:id:kazumaxneo:20190619221615p:plain

Download Plotで図がダウンロードされる。

f:id:kazumaxneo:20190619221443p:plain

図の右端にレジェンドが表示される。変異のタイプの他、モチーフ位置も表現されている。

f:id:kazumaxneo:20190619221832p:plain

ハイライト表示するアミノ酸頻度のしきい値を5に変更

f:id:kazumaxneo:20190619221428p:plain

５サンプル以上で共通する変異のみハイライト表示された。

引用

Mutplot: An easy-to-use online tool for plotting complex mutation data with flexibility
Zhang W, Wang C, Zhang X

PLoS One. 2019 May 15;14(5)

2019-06-21

バクテリア、アーキア、プラスミドの複製起点（ori）データベース DoriC

2018 plasmid bacteria Nucleic Acids Research oriC archaea web tool database tips

2019 6/21 誤字修正、コマンド修正

2023/10/19 URL修正

　すべての生物において、DNA複製は複製機構の構築段階で正確に制御されている（ref.1）。複製起点は特定のゲノム遺伝子座であり、そこでは二本鎖DNAがほどけて一本鎖DNA鋳型を形成して新しい鎖の合成を開始する。大部分の細菌において、複製起点（oriC）は、主要なイニシエータータンパク質DnAによって認識されるいくつかのDnaAボックスモチーフを含む。また複製起点（oriC）にはAT含有量の高いDNA unwinding element （DUE）を含む領域、ここでは一本鎖DNAもDnaAにより認識される（ref.2〜5）、も含まれる。ATリッチなDNA unwinding elementは古細菌複製起点にも不可欠であることがわかっており、これはorigin recognition proteins （ref.6,7）の結合部位として働くorigin recognition boxes (ORBs) に隣接している。多数のプラスミドにおいて、origin of vegetative replication (oriV) はしばしばダイレクトリピートまたはiteron DNA配列（wiki）からなり、これらはRepタンパク質と相互作用して複製開始の過程で初期複合体を形成する（ref.8）。oriVのiteronの位置の近くにATに富む領域もあり、これはDNA巻き戻し要素として働く（ref.9）。

　複製起点が通常、dnaA、orc1 / cdc6およびrep遺伝子などの複製関連遺伝子の隣にあることは興味深い。chromosomeおよびプラスミド上の原核生物複製起点の類似構造は、同じ枠組みに基づいて起源予測のためのアルゴリズムを設計する機会を提供する。当初、Ori-Finderは細菌のchromosome上のoriC領域を同定するために開発された（ref.10）。

　3つのドメインの独立したドメインの１つとして、ほとんどの古細菌は地球上の様々な極端な環境に存在しており、特定のhabitsが実験的方法による複製起源の同定を困難にしている（ref.11）。したがって、ウェブベースのツールOri-Finder2は、古細菌ゲノムのoriC領域をin silicoで予測するために開発されたものであり、予測された結果は実験室での古細菌起源の同定に役立つ可能性がある（ref.12）。

　プラスミドは、chromosome外の自己複製する遺伝要素であり、細菌、古細菌、酵母、そしていくつかの高等真核細胞に広く見られる（ref.8）。プラスミドはしばしば抗生物質耐性やtoxin–antitoxinシステムのような宿主細胞にいくつかの特別な特徴をもたらす遺伝子を持っている（ref.13）。したがって、プラスミドの自律的なDNA複製は細胞の生存にとって重要である。Vegetative replicationの起源はプラスミドの最も重要な要素の一つである。これまで、oriVの位置と特徴は、RK2、F、P1、R6K、pPS10プラスミドなどの広範囲のプラスミドでよく理解されていた（ref.14–18）。しかしながら、シーケンシングされた多数のプラスミド上でoriVを自動的に同定するためにバイオインフォマティクスツールが緊急に必要とされている。

　複製起点に関する関連研究を容易にするために、Ori-Finderシステムの予測がオンラインデータベースにまとめられた（ref.19–21）。 2007年に、oriC領域のデータベースであるDoriCが最初に公に利用可能になり、2013年に、DoriC 5.0は細菌ゲノムと古細菌ゲノムの両方の複製起点を含めた（ref.22、23）。過去6年間で、次世代シーケンシング技術の急速な進歩と様々な微生物ゲノムプロジェクトからのシーケンシングされたゲノムの蓄積がDoriCの拡大を促進しており、この拡大されたデータベースはDoriC 10.0としてここに提示される。はじめてのプラスミドDoriCデータベースとOri-Finderシステムは、複製起点の構造と機能のより良い理解を確実にし、そしてDNA複製における開始段階の調節メカニズムへの新しい洞察を提供する。これまでのところ、DoriCに保存されている予測の多くは現在実験室で検証されており、過去のDoriCデータベースとOri-Finderシステムに基づくより多くのアプリケーションが我々の最近のarticleで見直された（ref.24, pubmed）。

　今回のリリースでは、DoriCの内容は次のようにバージョン5.0と比較して大幅に改善されている。（i）細菌chromosome上のoriCは4倍の1633から7580に増加した。（ii）古細菌のchromosome上のoriCは86から226に増加した。（iii）NCBI recordから検索された348の注釈付き起点および修正Ori-Finderシステムによって861の予測起点を含む1209のプラスミド複製起点が初めて提示された。（iv）originの機能に重要な新規リピートトリヌクレオチドモチーフである DnaA-trio要素を含む、細菌複製起点中のより多くの配列要素が組み込まれる。 DnaA-trioは、DoriCによるバイオインフォマティクス分析によって細菌界全体で高度に保存されている、origin の巻き戻しおよびDNAヘリカーゼローディングにおいて役割を果たす（ref.20）。 DnaA-trio様配列をDoriCデータベースで検索した後、その情報を対応するoriCレコードに追加した。さらに、データベースのユーザーインターフェースをより便利で直感的にわかるように再設計した（論文図1）。（以下略）

使い方

DoriCにアクセスする。

version12はこちら

以下、古いバージョンでの説明

Browseからbacteira、Archaea、plamidのいずれかを選択する。

f:id:kazumaxneo:20190620215431p:plain

登録されている配列が表示される。

f:id:kazumaxneo:20190620215623p:plain

DoriC accession numberをクリックすることで、ポジションなどの詳細を表示できる。

f:id:kazumaxneo:20190620215740p:plain

Z-curves (bacteira、Archaeaのみ)

f:id:kazumaxneo:20190620221938p:plain

Browse in NCBI

f:id:kazumaxneo:20190620220620p:plain

BLAST検索はテスト時動作しなかった。

downloadからはRefseq IDや配列を含むCSVファイルをダウンロードできる。

http://tubic.org/doric/public/index.php/download

解凍。bacteira、Archaea、plamid３つのCSVがある。

f:id:kazumaxneo:20190620222731p:plain

excelで開いた。

f:id:kazumaxneo:20190620223326p:plain

おまけ

ori配列を使ってアセンブルを支援する。

awkでCSVファイルからfasta形式に変換する。変換後、EMBOSSパッケージのseqret (紹介) を使い、エラーを除きつつ適当な文字数で改行。

#ubuntuで実行
#bacteira
awk 'BEGIN{FS=","}{ OFS = "" }{print ">", $3, "\n", $NF}' tubic_bacteria.csv \
 | sed '1,2d' - > bacteira_oriC.fasta

#archaea
awk 'BEGIN{FS=","}{ OFS = "" }{print ">", $3, "\n", $NF}' tubic_archaea.csv \
 | sed '1,2d' - > archaea_oriC.fasta 

#plasmid
awk 'BEGIN{FS=","}{ OFS = "" }{print ">", $3, "\n", $NF}' tubic_plasmid.csv \
 | sed '1,2d' - > plasmid_ori.fasta

#emboss seqret
seqret archaea_oriC.fasta archaea_oriC_corrected.fasta
seqret bacteira_oriC.fasta bacteira_oriC_corrected.fasta
seqret plasmid_ori.fasta plasmid_corrected.fasta

*$NFで最後のカラムのみ出力。OFS = ""でスペースを排除。sedに渡して先頭のコメント行（awk処理で2行になっている）を排除。EMBOSSのseqretはゲノム登録用に開発されており、潜在的なエラーをあらかた排除できる。fold -w "INT" やperl/awkよりオススメ。

De novoアセンブル支援に使う。

BandageにGFA/fastgを読み込みblast検索する。

f:id:kazumaxneo:20190621010852p:plain

完全ではないが、メタゲノムのアセンブリから、アーキア、バクテリア、プラスミドを見分けるのに役立つと思われる。

引用

DoriC 10.0: an updated database of replication origins in prokaryotic genomes including chromosomes and plasmids
Hao Luo, Feng Gao
Nucleic Acids Research, Volume 47, Issue D1, 08 January 2019, Pages D74–D77

DoriC 5.0: an updated database of oriC regions in both bacterial and archaeal genomes

Gao F, Luo H, Zhang CT

Nucleic Acids Res. 2013 Jan;41(Database issue):D90-3

DoriC: a database of oriC regions in bacterial genomes

Gao F, Zhang CT

Bioinformatics. 2007 Jul 15;23(14):1866-7. Epub 2007 May 12

2023/10/19

DoriC 12.0: an updated database of replication origins in both complete and draft prokaryotic genomes
Mei-Jing Dong, Hao Luo, Feng Gao

Nucleic Acids Res. 2023 Jan 6;51(D1):D117-D120.

参考

bioinformatics - Identifying the origin of replication of an unannotated *E. coli* plasmid - Biology Stack Exchange

2019-06-20

ショートリードによるpolishingも行う高速なロングリードアセンブラ Raven (旧名 Ra)

Nanopore long read Pacbio assembly polish 高速なツール large genome hybrid assembly 2021 2020 Preprint Nature Communications

2020 5/23 タイトル補足、ravenインストール追記

2020 8/11 引用にpreprint追記

2021 5/24 論文引用

2022/06/08 help更新

　Ra（現在はRaven）は、第3世代シーケンシングによって生成されたrawシーケンシングリードの高速で使いやすいアセンブラである。以下の図に示すように、RaはMinimap2、Rala、およびRaconで構成されている。

　Raは入力としてFASTA / FASTQフォーマット（gzipで圧縮可能）のrawシーケンシングリードを含む単一のファイルを取り、高精度の一連のコンティグをstdoutにFASTAフォーマットで出力する。さらに、FASTA / FASTQ形式の第2世代シーケンスリードファイル（gzip圧縮対応）を第2引数として受け取り、ショートリードでpolishingすることで最終的なアセンブリを完成させることができる。

Ra flow chart. Githubより

参考スライド ( Rayan Chikhi, CNRS, Univ Lille BiG talk, Lund University)

http://rayan.chikhi.name/pdf/big18_large_genome_assembly.pdf

とても面白い内容です。ワクワクしますね。 Raについては最後の方で触れられています。

2021/11/29

New release of Raven, long-read de novo assembler https://t.co/t7mCEPvLbC. In most cases, it improves contiguity by at least 10%. W/ @robertvaser
— Mile Sikic (@msikic) 2021年11月29日

Raven version 1.5.0 - up to 2x faster (increases with data accuracy), improves NGA50 in most cases, properly handles small plasmids. w/ @robertvaser https://t.co/t7mCEPvLbC
— Mile Sikic (@msikic) 2021年3月24日

Raven - rapid and easy to use long read de novo assembler which produces the least number of contigs https://t.co/YBvpY6mYuX w/ @robertvaser. Any comments are more than welcome!
— Mile Sikic (@msikic) 2020年8月11日

インストール

ubuntu16.04のminiconda3.4.0.5環境でテストした（docker使用、ホストOS ubuntu18.04）。

依存

gcc 4.8+ or clang 3.4+
cmake 3.2+

Github

#--recursiveをつけて依存するminimap2、racon、そしてralaモジュールも含めてクローンする。
git clone --recursive https://github.com/rvaser/ra.git ra
cd ra 
mkdir build
cd build 
cmake -DCMAKE_BUILD_TYPE=Release .. 
make -j 8
cd bin/

> ./ra --help

# ./ra --help

/usr/bin/env: 'python': No such file or directory

root@ebaf20ee92c1:~/ra/build/bin# source ~/.profile

root@ebaf20ee92c1:~/ra/build/bin# ./ra --help

usage: ra [-h] [-u] [-t THREADS] [--version] -x {ont,pb}

sequences [ngs_sequences]

positional arguments:

sequences input file in FASTA/FASTQ format (can be compressed

with gzip) containing third generation sequences for

assembly

ngs_sequences input file in FASTA/FASTQ format (can be compressed

with gzip) containing next generation sequences for

polishing (default: None)

optional arguments:

-h, --help show this help message and exit

-u, --include-unused output unassembled and unpolished sequences (default:

False)

-t THREADS, --threads THREADS

number of threads (default: 1)

--version show program's version number and exit

required arguments:

-x {ont,pb} sequencing technology of input sequences (default:

None)

Raven

#bioconda
mamba create -n raven -y
conda activate raven
mamba install -c bioconda raven-assembler -y

#from source
git clone --recursive https://github.com/lbcb-sci/raven.git raven
cd raven && mkdir build && cd build
cmake -DCMAKE_BUILD_TYPE=Release .. && make
./bin/raven

> ./raven

usage: raven [options ...] <sequences> [<sequences> ...]

# default output is to stdout in FASTA format

input file in FASTA/FASTQ format (can be compressed with gzip)

options:

-k, --kmer-len <int>

default: 15

length of minimizers used to find overlaps

-w, --window-len <int>

default: 5

length of sliding window from which minimizers are sampled

-f, --frequency <double>

default: 0.001

threshold for ignoring most frequent minimizers

-p, --polishing-rounds <int>

default: 2

number of times racon is invoked

-m, --match <int>

default: 3

score for matching bases

-n, --mismatch <int>

default: -5

score for mismatching bases

-g, --gap <int>

default: -4

gap penalty (must be negative)

--graphical-fragment-assembly <string>

prints the assembly graph in GFA format

--resume

resume previous run from last checkpoint

--disable-checkpoints

disable checkpoint file creation

-t, --threads <int>

default: 1

number of threads

--version

prints the version number

-h, --help

prints the usage

実行方法

ONTのアセンブリ。

ra -t 20 -x ont long_reads.fq > output.fa

GFAファイルとアセンブリされた配列output.faが出力される。

Pacbioのアセンブリ。

ra -t 20 -x pb long_reads.fq short_reads.fq > output.fa

ONTのリードとショートリードのハイブリッドアセンブリ。ペアエンドショートリードデータはあらかじめ結合しておく。

ra -t 20 -x ont long_reads.fq short_reads.fq > output.fa

引用

Raven: a de novo genome assembler for long reads

Robert Vaser, Mile Sikic

bioRxiv, Posted August 10, 2020

2021 5/24

Time- and memory-efficient genome assembly with Raven | Nature Computational Science

論文中のパフォーマンス比較結果をみると多くのmetricsについてバランスの取れた結果を出しており、Ravenが優秀なアセンブラであることが分かりますね。

(Omics向け) 従来のベン図表現を拡張する DiVenn

2019 Frontiers in Genetics web tool GO term venn diagram KEGG pathway RNA seq proteome

　ハイスループットデータ技術の進歩により、詳細な分析なしに膨大な量の遺伝子発現データが生成されてきた。例えば、INVEX (Xia et al., 2013)、ExAtlas (Sharov et al., 2015)、そしてWebGIVI (Sun et al., 2017)などのいくつかのウェブベースの視覚化ツールは、発現データ分析において首尾よく使用された。ただし、3つ以上の実験を体系的に比較することは依然として困難である。統合されたバイオインフォマティクスデータベースとともに複数の実験のデータを視覚化することは特に困難になる。ベン図は、複数の実験間で遺伝子リストを比較するために広く使用されている。 GeneVenn（Pirooznia et al、2007）、Venny（Oliveros、2015）、およびInteractiVenn（Heberle et al、2015）は、現在使用されているWebベースツールの例である。しかしながら、それらは重大な制限を有する：（１）gene IDを遺伝子機能に結び付けることはできない。Biological pathwayおよび遺伝子オントロジー（GO）などのバイオインフォマティクスデータベースを統合することはできない。（2）グラフに遺伝子発現量を表示することはできない。（3）ベン図で興味を引く可能性がある共通または固有の遺伝子は、遺伝子発現値および遺伝子機能では抽出できない。

　本著者らは、上記の制限を克服し、生物学者が彼らの遺伝子リストを視覚化し、biological pathwayおよびGOデータベースからの統合された知識に基づいて生物学的仮説を生成するのを助けるインタラクティブなwebベースのツールを提供する。このツールを使用すると、研究者は遺伝子リストを比較および視覚化するだけでなく、関心のある遺伝子機能に基づいてグラフ内の遺伝子ノードをサブセット化または強調表示することもできる。このツールはユーザーフレンドリーで、force-directed focus packageを使用して大量の入力データを処理できる（ref.1）。ユーザーは結果/情報テーブルから重要な遺伝子情報を抽出およびダウンロードし、publication品質の高解像度画像をダウンロードできる。
DiVennは、PHP、JavaScript、R、D3.js（Bostock et al、2011）、およびMySQLデータベースを使用して開発された。データの視覚化のフローチャートを論文図1に示す。DiVennは現在、2種類の入力データを受け入れる。（1）2列のタブ区切りのカスタムデータ。例えば、 gene IDsおよび対応するpathwayデータ、転写因子およびそれらによって調節される下流の遺伝子、ならびにマイクロRNAおよび対応する標的遺伝子などである。 2列目は「1」または「2」でなければならない。（２）遺伝子発現データ。 1列目は gene IDs、2列目は遺伝子制御値である。遺伝子調節値は、differentially expressed (DE) 遺伝子から得られるべきである。使用者は、倍数変化のカットオフ値（例えば、２倍変化）を選択してそれらのDE遺伝子を定義できる。この遺伝子調節値を単純化するために、本著者らは、使用者が自身の倍数変化のカットオフ値に基づいて、上方調節遺伝子を表すために「１」を、下方調節遺伝子を表すために「２」を用いることを要求する。ユーザーが自分の遺伝子をKEGG pathway（Kanehisa et al、2019）またはGOデータベースにリンクする必要がある場合、DiVennではKEGG pathwayおよびGOデータベースが利用可能な14のモデル生物種がサポートされている。現在、3種類のgene IDs、すなわち KEGG gene IDs、Uniprotgene ID（UniProt、2008）、およびNCBI gene ID（Benson et al、2018）がpathway解析に受け入れられている。DiVennによるd分析には、すべてのagriGO（Du et al、2010; Tian et al、2017）がサポートするIDが受け入れられる。 DiVennでは、ユーザーはネットワークグラフ内の最大8つの遺伝子リストを比較して視覚化することができる。（以下略）

チュートリアル

https://github.com/noble-research-institute/DiVenn

DiVenn Tutorial

ブラウザ

All modern browsers, such as Safari, Google Chrome, and IE are supported. The recommended web browser is Chrome.

使い方

DiVennの解析前に、（ユーザー指定のcut-off条件で）それぞれのトランスクリプトーム解析データから発現変動遺伝子セットが抽出されていないといけない。

http://divenn.noble.org にアクセスする。

オーサーが公開しているYou tube動画と同じ流れで説明する。

生物種を選択する。現在１４のモデル生物種に対応している。

f:id:kazumaxneo:20190619025013p:plain

ここではArabidospsisを選択。

実験データ数を選択する。ここでは実験データ３つのDEGセットを比較するとして３を選択。

f:id:kazumaxneo:20190616204821p:plain

"Load Sample Data"ボタンをクリック。

f:id:kazumaxneo:20190616204855p:plain

それぞれのウィンドウ内にsampleデータが読み込まれた。データの１列目がgene IDs、２列目が発現パターンになる。発現パターンは、ユーザーが指定の条件でフィルタリングして得たDEGのセットの発現パターンで（ユーザー指定のcut-off条件でDEGは抽出済みのはず）、数値の1か2のみ受け付ける。1は up-regulated、2は down-regulatedを表す。

視覚化後、オプションのpathway解析まで行うためには、KEGG、Uniprot (UniProt, 2008)、 NCBI (Benson, et al., 2018)のgene IDを使う必要がある。

表示される実験名を変えたいならウィンドウに手打ちする。Exp1からexperiment1に変更した。

f:id:kazumaxneo:20190619015042p:plain

--補足--

実際の解析時にはデータを含むテキストファイルをuploadする。

f:id:kazumaxneo:20190616205116p:plain

-------------

一番下のsubmitボタンを押し、視覚化を実行する。

f:id:kazumaxneo:20190616210529p:plain

視覚化結果

f:id:kazumaxneo:20190616210559p:plain

色は以下のような情報を表す。

f:id:kazumaxneo:20190616210738p:plain

従来のベン図では実験間で共通/固有のDEG数しか表現されない。一方、見てわかるようにこのインタラクティブなグラフは、実験間で共通/固有のDEGに加え、各々のDEGが誘導か抑制なのか、クリックすることで後述する遺伝子名、gene IDs、GO term、KEGG pathway情報まで表現できる。さらに、GO enrichment 解析やpathway enrichment解析を視覚的に確認しながら進めることが可能（後半で説明しているが、リストのGO termやpahway情報とフィッシャー検定結果がその場で表にまとめられるので、興味あるリストのみ選択して再描画できる）。

個人的な意見として、共通/固有のDEG数は一般的なベン図の方が判断しやすいと思う。

図はマウスのホイール上下やTrackpadの上下で拡大縮小したり、nodeをドラッグして操作できる。もとに戻したい時は十字アイコンの真ん中のRESETをクリックする。

f:id:kazumaxneo:20190616212158p:plain

nodeの上で右クリックすることで、gene IDsの表示/非表示を選択可能。

f:id:kazumaxneo:20190616210702p:plain

一括表示。

f:id:kazumaxneo:20190619022517p:plain

ノードの上で右クリックしてGene detailをクリックすると、GO termやGO IDなどの情報が表示される。

f:id:kazumaxneo:20190619022610p:plain

KEGG、Uniprot、 NCBI のgene IDsも表示されている。

一括表示/非表示は右上のメニューからも実行できる。

f:id:kazumaxneo:20190616211714p:plain

メニューではnodeの色も変更可能。

f:id:kazumaxneo:20190616211652p:plain

表示されている遺伝子群について、エンリッチされたKEGG pathwayとGO termを調べることができる。メニューのShow Gene DetailsからPathwayかGOを選択。

f:id:kazumaxneo:20190616211844p:plain

pahwayを選択した。しばらく待ってから下の方にスクロールすると表が出現している。

f:id:kazumaxneo:20190616212529p:plain

表のgene IDをクリックすると、ノードを右クリックしたときと同様詳細が表示できる。

f:id:kazumaxneo:20190616214156p:plain

pathway列にpathwayの情報がある場合、文字をクリックするとそのKEGG pathwayに飛ぶ。

f:id:kazumaxneo:20190616213806p:plain

KEGG pathwayではその生物種でアサインされているものが緑色で表示される。基礎的な話になるが、上の図では、この生物種（ここではシロイヌナズナ）でKEGG pathwayにアサインされているαリノレン酸代謝系酵素遺伝子が緑色になっている（ユーザーがこのDiVennで使用したDEGリストにあった遺伝子では無い）。

ここでは特にエンリッチされていそうなpathwayを調べたい。表の一番上のpathwayボタンを押し、pathwayでソートし直す。

f:id:kazumaxneo:20190616212808p:plain

検索効率を上げるため、左上から表示件数を変更。50 => 200にした。

f:id:kazumaxneo:20190616212708p:plain

多いpathwayを探す。p valueをみて増えてそうなpathwayが見つかったら、各行の右端にあるチェックboxに✔︎をつけて選択していく。

f:id:kazumaxneo:20190616212731p:plain

shiftキーを押しながら一番上と一番下をサンドすることで該当pathwayを一気に選択可能。

一番下までスクロールし、Only Redraw Selectedにチェックをつけ、Redrawボタンをクリック。

f:id:kazumaxneo:20190616212959p:plain

一番上に戻ると、選択したgene IDsのみが描画されている。

f:id:kazumaxneo:20190616213139p:plain

DiVennによって生成されたグラフは、右上のメニューからPNGかSVGとして出力、ダウンロードできます。

ノードはクリックしたりドラッグして移動できます。重要な遺伝子を真ん中に配置するなどして整えてから出力するといいと思います。

引用
DiVenn: An Interactive and Integrated Web-Based Visualization Tool for Comparing Gene Lists
Liang Sun, Sufen Dong, Yinbing Ge, Jose Pedro Fonseca, Zachary T. Robinson, Kirankumar S. Mysore, Perdeep Mehta

Front Genet. 2019; 10: 421

多機能なNGS分析ツール BBtools　其の３BBMap追加コマンド

高速なツール SNV small indel error correction metagenome annotation

BBMapの追加コマンドについて紹介します。

BBMap Guide

https://jgi.doe.gov/data-and-tools/bbtools/bb-tools-user-guide/bbmap-guide/

callvariants.sh

Introducing CallVariants, a new variant caller in #BBMap! CallVariants is 81x faster than mpilup+bcftools, and very, very accurate.
— Brian Bushnell (@BBToolsBio) December 15, 2016

Do you like #GATK?

BBTools' CallVariants is 2000 times faster. Also, it does not place adjacent deletions and insertions, like GATK.

I'm not saying GATK is garbage, but I'm also not saying that. You can't run GATK without incurring adjacent insertions and deletions.
— Brian Bushnell (@BBToolsBio) June 13, 2019

bbcms.sh

New in #BBTools:https://t.co/H04qsEfnrO - error correction utility for metagenomes. Tadpole is more accurate, but BBCMS uses the same algorithm and never runs out of memory (since it uses a lossy data structure). Use Tadpole when you can, and BBCMS if it runs out of memory.
— Brian Bushnell (@BBToolsBio) June 13, 2019

callgenes.sh

New in #BBTools- https://t.co/VyzMcc0Wda, a new prokaryotic gene-caller. Did the world need a new prok gene caller? No! But, I needed a super-fast caller for internal use by sendsketch.

"https://t.co/D9wMtd802X in=assembly.fa protein"

You can even use it with raw reads!
— Brian Bushnell (@BBToolsBio) June 13, 2019

It's actually really good, though:https://t.co/VyzMcc0Wda in=assembly.fa out=genes.gff outa=proteins.faa

By default it uses models for hundreds of bacteria and archaea. You can increase the accuracy (if you know the clade) by making your own model with https://t.co/gnqK0US6UD.
— Brian Bushnell (@BBToolsBio) June 13, 2019

インストール

bbmapを導入する。

#bioconda (link)
conda install -c bioconda -y bbmap

バリアントコール

> callvariants.sh

$ callvariants.sh

Written by Brian Bushnell

Last modified October 12, 2017

Description: Calls variants from sam or bam input.

In default mode, all input files are combined and treated as a single sample.

In multisample mode, each file is treated as an individual sample,

and gets its own column in the VCF file. Unless overridden, input file

names are used as sample names.

Please read bbmap/docs/guides/CallVariantsGuide.txt for more information,

or bbmap/pipelines/variantPipeline.sh for a usage example.

Usage: callvariants.sh in=<file,file,...> ref=<file> vcf=<file>

Input may be sorted or unsorted.

The reference should be fasta.

I/O parameters:

in=<file> Input; may be one file or multiple comma-delimited files.

list=<file> Optional text file containing one input file per line.

Use list or in, but not both.

out=<file> Output variant list in native format. If the name ends

with .vcf then it will be vcf format.

vcf=<file> Output variant list in vcf format.

ref=<file> Reference fasta. Required to display ref alleles.

Variant calling wil be more accurate with the reference.

shist=<file> (scorehist) Output for variant score histogram.

zhist=<file> (zygosityhist) Output for zygosity histogram.

overwrite=f (ow) Set to false to force the program to abort rather than

overwrite an existing file.

extended=t Print additional variant statistics columns.

sample= Optional comma-delimited list of sample names.

multisample=f (multi) Set to true if there are multiple sam/bam files,

and each should be tracked as an individual sample.

vcf0= Optional comma-delimited list of per-sample outputs.

Only used in multisample mode.

bgzip=f Use bgzip for gzip compression.

Processing Parameters:

prefilter=f Use a Bloom filter to exclude variants seen fewer than

minreads times. Doubles the runtime but greatly reduces

memory usage. The results are identical.

samstreamer=t (ss) Load reads multithreaded to increase speed.

coverage=t (cc) Calculate coverage, to better call variants.

ploidy=1 Set the organism's ploidy.

rarity=1.0 Penalize the quality of variants with allele fraction

lower than this. For example, if you are interested in

4% frequency variants, you could set both rarity and

minallelefraction to 0.04. This is affected by ploidy -

a variant with frequency indicating at least one copy

is never penalized.

covpenalty=0.8 (lowcoveragepenalty) A lower penalty will increase the

scores of low-coverage variants, and is useful for

low-coverage datasets.

useidentity=t Include average read identity in score calculation.

usepairing=t Include pairing rate in score calculation.

usebias=t Include strand bias in score calculation.

useedist=t Include read-end distance in score calculation.

homopolymer=t Penalize scores of substitutions matching adjacent bases.

nscan=t Consider the distance of a variant from contig ends when

calculating strand bias.

32bit=f Set to true to allow coverage tracking over depth 65535.

strandedcov=t Tracks per-strand ref coverage to print the DP4 field.

Requires more memory when enabled.

callsub=t Call substitutions.

calldel=t Call deletions.

callins=t Call insertions.

nopassdot=f Use . as genotype for variations failing the filter.

Trimming parameters:

border=5 Trim at least this many bases on both ends of reads.

qtrim=r Quality-trim reads on this end

r: right, l: left, rl: both, f: don't quality-trim.

trimq=10 Quality-trim bases below this score.

Realignment parameters:

realign=f Realign all reads with more than a couple mismatches.

Decreases speed. Recommended for aligners other than BBMap.

unclip=f Convert clip symbols from exceeding the ends of the

realignment zone into matches and substitutitions.

repadding=70 Pad alignment by this much on each end. Typically,

longer is more accurate for long indels, but greatly

reduces speed.

rerows=602 Use this many rows maximum for realignment. Reads longer

than this cannot be realigned.

recols=2000 Reads may not be aligned to reference seqments longer

than this. Needs to be at least read length plus

max deletion length plus twice padding.

msa= Select the aligner. Options:

MultiStateAligner11ts: Default.

MultiStateAligner9PacBio: Use for PacBio reads, or for

Illumina reads mapped to PacBio/Nanopore reads.

Sam-filtering parameters:

minpos= Ignore alignments not overlapping this range.

maxpos= Ignore alignments not overlapping this range.

minreadmapq=4 Ignore alignments with lower mapq.

contigs= Comma-delimited list of contig names to include. These

should have no spaces, or underscores instead of spaces.

secondary=f Include secondary alignments.

supplimentary=f Include supplimentary alignments.

invert=f Invert sam filters.

Variant-Calling Cutoffs:

minreads=2 Ignore variants seen in fewer reads.

minqualitymax=15 Ignore variants with lower max base quality.

minedistmax=20 Ignore variants with lower max distance from read ends.

minmapqmax=0 Ignore variants with lower max mapq.

minidmax=0 Ignore variants with lower max read identity.

minpairingrate=0.1 Ignore variants with lower pairing rate.

minstrandratio=0.1 Ignore variants with lower plus/minus strand ratio.

minquality=12.0 Ignore variants with lower average base quality.

minedist=10.0 Ignore variants with lower average distance from ends.

minavgmapq=0.0 Ignore variants with lower average mapq.

minallelefraction=0.1 Ignore variants with lower allele fraction. This

should be adjusted for high ploidies.

minid=0 Ignore variants with lower average read identity.

minscore=20.0 Ignore variants with lower Phred-scaled score.

clearfilters Reset all filters to zero. Filter flags placed after

the clearfilters flag will still be applied.

There are additionally max filters for score, quality, mapq, allelefraction,

and identity.

Java Parameters:

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

ORF予測

>callgenes.sh

$ callgenes.sh

Written by Brian Bushnell

Last modified December 19, 2018

Description: Finds orfs and calls genes in unspliced prokaryotes.

This includes bacteria, archaea, viruses, and mitochondria.

Can also predict 16S, 23S, 5S, and tRNAs.

Usage: callgenes.sh in=contigs.fa out=calls.gff outa=aminos.faa

File parameters:

in=<file> A fasta file.

out=<file> Output gff file.

outa=<file> Amino acid output.

model=<file> A pgm file or comma-delimited list.

If unspecified a default model will be used.

compareto= Optional reference gff file to compare with the gene calls.

Other parameters:

minlen=60 Don't call genes shorter than this.

trd=f (trimreaddescription) Set to true to trim read headers after

the first whitespace. Necessary for IGV.

merge=f For paired reads, merge before calling.

detranslate=f Output canonical nucleotide sequences instead of amino acids.

recode=f Re-encode nucleotide sequences over called genes, leaving

non-coding regions unchanged.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

> analyzegenes.sh

$ analyzegenes.sh

Written by Brian Bushnell

Last modified September 27, 2018

Description: Generates a prokaryotic gene model (.pkm) for gene calling.

Input is fasta and gff files.

The .pkm file may be used by CallGenes.

Usage: analyzegenes.sh in=x.fa gff=x.gff out=x.pgm

File parameters:

in=<file> A fasta file or comma-delimited list of fasta files.

gff=<file> A gff file or comma-delimited list. This is optional;

if present, it must match the number of fasta files.

If absent, a fasta file 'foo.fasta' will imply the

presence of 'foo.gff'.

out=<file> Output pgm file.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

メタゲノムエラーコレクション

> bbcms.sh

r$ bbcms.sh

Written by Brian Bushnell

Last modified July 24, 2018

Description: Error corrects reads and/or filters by depth, storing

kmer counts in a count-min sketch (a Bloom filter variant).

This uses a fixed amount of memory. The error-correction algorithm is taken

from Tadpole; with plenty of memory, the behavior is almost identical to

Tadpole. As the number of unique kmers in a dataset increases, the accuracy

decreases such that it will make fewer corrections. It is still capable

of making useful corrections far past the point where Tadpole would crash

by running out of memory, even with the prefilter flag. But if there is

sufficient memory to use Tadpole, then Tadpole is more desirable.

Because accuracy declines with an increasing number of unique kmers, it can

be useful with very large datasets to run this in 2 passes, with the first

pass for filtering only using a 2-bit filter with the flags tossjunk=t and

ecc=f (and possibly mincount=2 and hcf=0.4), and the second pass using a

4-bit filter for the actual error correction.

Usage: bbcms.sh in=<input file> out=<output> outb=<reads failing filters>

Example of use in error correction:

bbcms.sh in=reads.fq out=ecc.fq bits=4 hashes=3 k=31 merge

Example of use in depth filtering:

bbcms.sh in=reads.fq out=high.fq outb=low.fq k=31 mincount=2 ecc=f hcf=0.4

Error correction and depth filtering can be done simultaneously.

File parameters:

in=<file> Primary input, or read 1 input.

in2=<file> Read 2 input if reads are in two files.

out=<file> Primary read output.

out2=<file> Read 2 output if reads are in two files.

outb=<file> (outbad/outlow) Output for reads failing mincount.

outb2=<file> (outbad2/outlow2) Read 2 output if reads are in two files.

extra=<file> Additional comma-delimited files for generating kmer counts.

ref=<file> If ref is set, then only files in the ref list will be used

for kmer counts, and the input files will NOT be used for

counts; they will just be filtered or corrected.

overwrite=t (ow) Set to false to force the program to abort rather than

overwrite an existing file.

Hashing parameters:

k=31 Kmer length, currently 1-31.

hashes=3 Number of hashes per kmer. Higher generally reduces

false positives at the expense of speed; rapidly

diminishing returns above 4.

minprob=0.5 Ignore kmers with probability of being correct below this.

memmult=1.0 Fraction of free memory to use for Bloom filter. 1.0 should

generally work; if the program crashes with an out of memory

error, set this lower. You may be able to increase accuracy

by setting it slightly higher.

cells= Option to set the number of cells manually. By default this

will be autoset to use all available memory. The only reason

to set this is to ensure deterministic output.

seed=0 This will change the hash function used.

Depth filtering parameters:

mincount=0 If positive, reads with kmer counts below mincount will

be discarded (sent to outb).

hcf=1.0 (highcountfraction) Fraction of kmers that must be at least

mincount to pass.

requireboth=t Require both reads in a pair to pass in order to go to out.

When true, if either read has a count below mincount, both

reads in the pair will go to outb. When false, reads will

only go to outb if both fail.

tossjunk=f Remove reads or pairs with outermost kmer depth below 2.

(Suggested params for huge metagenomes: mincount=2 hcf=0.4 tossjunk=t)

Error correction parameters:

ecc=t Perform error correction.

bits= Bits used to store kmer counts; max count is 2^bits-1.

Supports 2, 4, 8, 16, or 32. 16 is best for high-depth data;

2 or 4 are for huge, low-depth metagenomes that saturate the

bloom filter otherwise. Generally 4 bits is recommended for

error-correction and 2 bits is recommended for filtering only.

ecco=f Error-correct paired reads by overlap prior to kmer-counting.

merge=t Merge paired reads by overlap prior to kmer-counting, and

again prior to correction. Output will still be unmerged.

smooth=3 Remove spikes from kmer counts due to hash collisions.

The number is the max width of peaks to be smoothed; range is

0-3 (3 is most aggressive; 0 disables smoothing).

This also affects tossjunk.

Java Parameters:

-Xmx This will be passed to Java to set memory usage, overriding the program's automatic memory detection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs. The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

実行方法

1、callvariants.sh

bam/samからバリアントコールを行う。

1-1、リファレンスにペアエンドfastqをmappingしてbamかsamを得る。

bbmap.sh ref=ref.fa in1=pair_1.fq in2=pair_2.fq out=output.sam nodisk
samtools sort -@ 12 -O BAM output.sam > sort.bam && samtools index -@ 12 sort.bam

indexディレクトリが作られ、リードがリファレンスにアライメントされる。defaultだとマシンの全コアが計算に使用される（明示するなら"t="）。それからsortコマンドでcoordinate sortedされたbamを得る。

1-2、入力のbam/samとリファレンスを指定してcallvariantsを実行する。ここでは倍数性(ploidy)2で実行。

callvariants.sh ploidy=2 in=sort.bam ref=ref.fa vcf=output.vcf

callsub=t Call substitutions
calldel=t Call deletions
callins=t Call insertions
ploidy=1 Set the organism's ploidy
in=<file> Input; may be one file or multiple comma-delimited files
vcf=<file> Output variant list in vcf format

バクテリアゲノムだと数秒でランは終わる。 vcf=を指定時の出力はVCF v4.2に従う。

2、callgenes.sh

contigやゲノム配列からprokaryotesの遺伝子配列を予測する。ここでは予測した遺伝子のアミノ酸配列も出力している。

callgenes.sh in=ref.fa out=output outa=amnoacids.fa

NGSのリードから実行することもできる。

callgenes.sh in=pair_1.fq in2=pair_2.fq out=output outa=amnoacids.fa

リードそれぞれから予測するため、出力は相応に大きくなることに注意。

3、bbcms.sh

メタゲノムシーケンシングリードのエラーコレクション。

入力のペアエンドfastqと出力のペアエンドfastqをそれぞれ指定する。

bbcms.sh in=pair_1.fq in2=pair_2.fq \
 out=error_corrected_1.fq.gz out2=error_corrected_2.fq.gz \
 outb=failing_mincount_1.fq.gz outb2=failing_mincount_2.fq.gz

in=<file> Primary input, or read 1 input.
in2=<file> Read 2 input if reads are in two files.
out=<file> Primary read output.
out2=<file> Read 2 output if reads are in two files.
outb=<file> Output for reads failing mincount.
outb2=<file> Read 2 output if reads are in two files.

参考

Question: bbmap callvariants - how to add sample names, how to get 0/0 alleles back?

http://callvariants.sh

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ショートリードのマッピングを行う Whisper

メタバーコディングのデータベース配列キュレーションなどを行うツールキット MetaCurator

ヒトのガン原遺伝子/腫瘍抑制遺伝子の変異を視覚化するwebツール Mutplot

バクテリア、アーキア、プラスミドの複製起点（ori）データベース DoriC

ショートリードによるpolishingも行う高速なロングリードアセンブラ Raven (旧名 Ra)

(Omics向け) 従来のベン図表現を拡張する DiVenn

多機能なNGS分析ツール BBtools　其の３BBMap追加コマンド