2019-03-17

高速なロングリードのアセンブリツール wtdbg2 (Redbean)

2019 4/15 Githubリンクの誤り修正、7/3 名前修正、12/11 論文引用追加、タイトル修正

2022/06/07 インストール手順修正

　デノボシーケンスアセンブリは、比較的短いシーケンシングリードからサンプルゲノムを再構築する。リファレンスゲノムは関心のある領域を欠いている可能性があるため、マッピングベースの分析に失敗することが多い、新種および構造ゲノムの変化の研究に不可欠である。 Pacific Biosciences（PacBio）およびOxford Nanopore Technologies（ONT）による一分子シークエンシング技術の急速な進歩により、10〜100キロベース（kb）のリードを手頃なコストでシークエンスすることができる。このようなロングリードは霊長類の主要なrepetitiveクラスを解決し、アセンブリの連続性を改善するのに役立つ。今日では、いくつかの高品質のアセンブラが開発されたおかげで、ロングリードアセンブリはバクテリアや小さなゲノムにとっては日常的な作業となっている（ref.1-5）。しかし、哺乳類のゲノムの場合、既存のアセンブラには数CPU年かかる場合がある。GoogleまたはAmazonのクラウドをコンピューティングに使用する場合、最大1000ドルかかることがある。（一部略）この問題に対処するために、著者らはwtdbg2を開発した。これは大きなゲノムのためのアセンブリの品質をほとんど妥協することなく何十倍も速い、新しいロングリードアセンブラである。

　Wtdbg2は、概して、重複レイアウトコンセンサスパラダイムに従っている。これは、高速の全対全リードアライメント実装と、フfuzzy-Bruijn graph (FBG) に基づく新しいレイアウトアルゴリズムによって、既存のアセンブラを進歩させる。哺乳類のゲノムの場合、現在のリードオーバーラッパー（ref.7〜9）は、入力のリードを多数の小さなバッチに分割し、バッチ間の全対全アライメントを実行する。この方法では、ファイルのI / Oが繰り返されたり、情報が得られないk-merのインデックス付けやクエリが行われたりすると、計算時間が無駄になる。これらのオーバーラッパーは、大量のメモリを消費する可能性があるので、単一のハッシュテーブルを作成しない。興味深いことに、これは大きな問題ではないはずである。 Wtdbg2は最初にすべてのリードをメモリにロードし、k-merの発生をカウントする。次に、リードを256bpごとに1つのユニットにビンし、k-merが2回以上発生していて値が関連するビンの識別子になっているハッシュテーブルを作成する。例えば、60倍のカバレッジでシーケンシングされたCHM1ヒトゲノムの場合、デフォルト設定ではわずか15億の非ユニークなk-merしかない。rawリード配列をメモリにステージングしてハッシュテーブルを構築すると、ピーク時に250GBかかる。これは、ショートリードのアセンブラのメモリ使用量に相当する。
　Wtdbg2ビンは、配列の次のステップである動的計画法（DP）を高速化するためにシーケンスを読み取る。 256bpビニングでは、DP行列は、１ベース当たりのDP行列よりも65536６（＝ 256×256）倍小さい。これにより、k-merベースまたはベースレベルのDPと比較して、DPが大幅に縮小される。

（一段落省略）
　CANU-1.8、FALCON-180831、Flye-2.3.6、MECAT-180314とともに、5つのデータセットでwtdbg2 v2.3を評価した（Preprint表1）。minimap2を使用して、アセンブリされたコンティグをリファレンスゲノムにアライメントさせ、そして測定基準を収集した。すべてのデータセットで、wtdbg2は最も近い競合の少なくとも4倍の速さだった。 Wtdbg2アセンブリは、リファレンスゲノムをややカバーしないことがあり、それがwtdbg2の弱点だが、その理由はコンティグはduplicationsが少ない傾向があるためである。 1つのコンティグによってカバーされるゲノム領域の割合は、他のアセンブラと比較して2、3パーセントだけ低い。

wtdbg2に関するツイート

Interested in a long-read assembler with the speed of miniasm and comparable consensus accuracy of canu? Try wtdbg2 by my friend Jue Ruan. A human genome in one day. Also available from bioconda. https://t.co/H3CIshjLGY
— Heng Li (@lh3lh3) October 23, 2018

追記

Jue Ruan detailing his brilliant algorithm wtdbg2 (now entitled "redbean"). Uses fuzzy 'k-bins' instead of k-mers for DeBruijn graph assembly #UQwinterschool pic.twitter.com/TpYDGmPSdA
— Martin A. Smith (@martinalexsmith) July 2, 2019

インストール

ubuntu16.04でテストした（ホストOS macos10.12）。

依存

Wtdbg2 only works on 64-bit Linux. To compile, please type make in the source code directory. You can then copy wtdbg2 and wtpoa-cns to your PATH.

本体 Github

git clone https://github.com/ruanjue/wtdbg2
cd wtdbg2 && make

#Anacondaを使っているならcondaで導入できる(linuxのみ)
mamba install -y -c bioconda wtdbg

> ./wtdbg2

WTDBG: De novo assembler for long noisy sequences

Author: Jue Ruan <ruanjue@gmail.com>

Version: 2.3 (20181206)

Usage: wtdbg2 [options] -i <reads.fa> -o <prefix> [reads.fa ...]

Options:

-i <string> Long reads sequences file (REQUIRED; can be multiple),

-o <string> Prefix of output files (REQUIRED),

-t <int> Number of threads, 0 for all cores, [4]

-f Force to overwrite output files

-x <string> Presets, comma delimited,

rsII/rs: -p 21 -S 4 -s 0.05 -L 5000

sequel/sq

nanopore/ont:

(genome size < 1G) -p 0 -k 15 -AS 2 -s 0.05 -L 5000

(genome size >= 1G) -p 19 -AS 2 -s 0.05 -L 5000

corrected/ccs: -p 21 -k 0 -AS 4 -K 0.05 -s 0.5

Example: '-e 3 -x ont -S 1' in parsing order, -e will be 3, -S will be 1

-g <number> Approximate genome size (k/m/g suffix allowed) [0]

-X <float> Choose the best <float> depth from input reads(effective with -g) [50]

-L <int> Choose the longest subread and drop reads shorter than <int> (5000 recommended for PacBio) [0]

Negative integer indicate keeping read names, e.g. -5000.

-k <int> Kmer fsize, 0 <= k <= 25, [0]

-p <int> Kmer psize, 0 <= p <= 25, [21]

k + p <= 25, seed is <k-mer>+<p-homopolymer-compressed>

-K <float> Filter high frequency kmers, maybe repetitive, [1000.05]

>= 1000 and indexing >= (1 - 0.05) * total_kmers_count

-E <int> Min kmer frequency, [2]

-S <float> Subsampling kmers, 1/(<-S>) kmers are indexed, [4.00]

-S is very useful in saving memeory and speeding up

please note that subsampling kmers will have less matched length

-l <float> Min length of alignment, [2048]

-m <float> Min matched length by kmer matching, [200]

-A Keep contained reads during alignment

-s <float> Min similarity, calculated by kmer matched length / aligned length, [0.05]

-e <int> Min read depth of a valid edge, [3]

-q Quiet

-v Verbose (can be multiple)

-V Print version information and then exit

--help Show more options

> ./wtpoa-cns -h

WTPOA-CNS: Consensuser for wtdbg using PO-MSA

Author: Jue Ruan <ruanjue@gmail.com>

Version: 2.3

Usage: wtpoa-cns [options]

Options:

-t <int> Number of threads, [4]

-d <string> Reference sequences for SAM input, will invoke sorted-SAM input mode

-u Only process reference regions present in/between SAM alignments

-r Force to use reference mode

-p <string> Similar with -d, but translate SAM into wtdbg layout file

-i <string> Input file(s) *.ctg.lay from wtdbg, +, [STDIN]

Or sorted SAM files when having -d

-o <string> Output files, [STDOUT]

-f Force overwrite

-j <int> Expected max length of node, or say the overlap length of two adjacent units in layout file, [1500] bp

-b <int> Bonus for tri-bases match, [0]

-M <int> Match score, [2]

-X <int> Mismatch score, [-5]

-I <int> Insertion score, [-2]

-D <int> Deletion score, [-4]

-H <float> Homopolymer merge score used in dp-call-cns mode, [-3]

-B <int> Bandwidth, [96]

-W <int> Window size in the middle of the first read for fast align remaining reads, [200]

If $W is negative, will disable fast align, but use the abs($W) as Band align score cutoff

-w <int> Min size of aligned size in window, [$W * 0.5]

In sorted-SAM input mode, -w is the sliding window size [2000]

-A Abort TriPOA when any read cannot be fast aligned, then try POA

-S <int> Shuffle mode, 0: don't shuffle reads, 1: by shared kmers, 2: subsampling. [1]

-R <int> Realignment bandwidth, 0: disable, [16]

-c <int> Consensus mode: 0, run-length; 1, dp-call-cns, [0]

-C <int> Min count of bases to call a consensus base, [3]

-F <float> Min frequency of non-gap bases to call a consensus base, [0.5]

-N <int> Max number of reads in PO-MSA [20]

Keep in mind that I am not going to generate high accurate consensus sequences here

-x <string> Presets,

sam-sr: polishs contigs from short reads mapping, accepts sorted SAM files

shorted for '-w 200 -j 150 -R 0 -b 1 -c 1 -N 50 -rS 2'

-v Verbose

-V Print version information and then exit

> > wtdbg2 -V

$ wtdbg2 -V

wtdbg2 2.3

2019年3月16日現在v2.3が入る。

実行方法

１、ドラフトアセンブリ

wtdbg2 -t 16 -i reads.fa.gz -fo prefix

-i Long reads sequences file (REQUIRED; there can be multiple -i),
-t Number of threads, 0 for all cores, [4]
-f Force to overwrite output files
-o Prefix of output files (REQUIRED),

複数のファイルができる。

f:id:kazumaxneo:20181203123621j:plain

7GBのONTリードを使ったランタイムはジャスト６分だった（*1）。

２、コンセンサスコールを行いfastaファイル出力

wtpoa-cns -t 16 -i prefix.ctg.lay.gz -fo prefix.ctg.lay.fa

7GBのONTリードを使ったランタイムは10分ほどだった（*1）。

ランが終わると"output.ctg.lay.fa"ができる。

引用

Fast and accurate long-read assembly with wtdbg2
Jue Ruan, Heng Li

bioRxiv preprint first posted online Jan. 26, 2019

Fast and accurate long-read assembly with wtdbg2

Jue Ruan, Heng Li
Nature Methods (2019)

mac pro 2012でdockerを使用してラン。thread==16

関連

2019-03-16

コンタミやダメージを考慮してAncient DNAのシーケンシングリードをシミュレートする gargammel

2017 Bioinformatics Ancient DNA simulator GC docker

　Ancient DNA（aDNA）とも呼ばれるsubfossilsから回収されたDNAは、populationの歴史を再構築するためにますます使用されている（Leonardi et al、2016）。しかし、下流の推論に影響を与える可能性があるいくつかの要因があるため、aDNAデータの分析は依然として困難である。第一に、DNAは時間とともに分解する傾向があり、かなりのヌクレオチドのmisincorporationsを示す限られたサイズ（30〜80 bp）のフラグメントをもたらす（Briggs et al、2007）。第二に、生物死後に環境微生物がコロニーを形成する傾向がある（Green et al、2009 pubmed）。その結果、内因性DNA比率が極端に減少することがあり、ショットガンシークエンシングアプローチが非経済的になる。第三に、そのような外因性配列は、リードアラインメント中に正しく同定されないと、Ancient DNAの再構築に影響を及ぼし得る。ヒト族から回収されたDNAの場合、発掘中および実験室中を含む任意の段階で導入され得る現在のヒトDNA混入は、単一のサンプル内に無関係な集団の歴史を混在させるので特に問題がある。

　Ancient DNA研究者は、人口パラメータの推論を目的とした要約統計量の頑健性をテストするためにシミュレーションを頻繁に使用する。いくつかのパッケージはシーケンシングに起因するプラットフォーム特有のエラーをシミュレートしているが、損傷、断片化、ヒトおよび微生物汚染などのそれらの最も顕著な特徴を含むaDNA配列データセットを適切にシミュレートするパッケージは現在利用できない。

　ここでは、微生物画分、内因性画分、そして今日の人間の汚染を表す一連のゲノム参照からのaDNA配列データセットをシミュレートするパッケージであるgargammelを紹介する。このパッケージは、死後のDNA損傷や塩基の誤取り込みなど、DNA配列の最も一般的な機能をシミュレートすることができる。さらに、ライブラリーの調製に使用される分子ツールからのバイアス、GCに富むフラグメントからのバイアス、およびシーケンシングプラットフォームによってもたらされるエラーをシミュレートする。

Gargammel flowchart. 　論文より転載。

gargammel by grenaud

Wanna simulate ancient DNA sequence datasets, including contamination and bacterial communities? go see gargammel!https://t.co/sczWdMlL2b
— Ludovic Orlando (@LudovicLorlando) October 29, 2016

インストール

依存

git
C++ compiler supporting C++11
cmake, you can install on Ubuntu by typing: sudo apt install cmake
zlib
lib gsl, you can install on Ubuntu by typing: sudo apt-get install libgsl0-dev

If you plan on using ms2chromosomes.py to simulate chromosomes based on ms, you also need:

Hudson's ms (see: http://home.uchicago.edu/rhudson1/source/mksamples.html)
seq-gen, you can install on Ubuntu by typing: sudo apt install seq-gen

本体　Github

git clone https://github.com/grenaud/gargammel.git
cd gargammel/
make

> perl gargammel.pl

# perl gargammel.pl

This script is a wrapper to run the different programs to create

a set of Illumina reads for ancient DNA from a set of fasta files

representing the endogenous, the contaminant from the same species

and the bacterial contamination.

usage: gargammel.pl <options> [directory with fasta directories]

This directory should contain 3 directories:

bact/ The bacterial contamination

cont/ The contamination from the same species

endo/ The endogenous material

Options:

--comp [B,C,E] Composition of the final set in fraction

the 3 numbers represent the bacterial, contaminant and endogenous

ex: --comp 0.6,0.02,0.38 will result

in 60% bacterial contamination while the rest will be from the same

species 5% will be contamination and 95% will be endogenous

Default: --comp 0,0,1

--mock Do nothing, just print the commands that will be run

-o Output prefix (default: [input dir]/simadna)

Either specify:

-n [number] Generate [number] fragments (default: 1000)

-c [coverage] Endogenous coverage

Fragment selection

===================

--misince [file] Base misincorporation for the endogenous fragments (default none)

--misincc [file] Base misincorporation for the contaminant fragments (default none)

--misincb [file] Base misincorporation for the bacterial fragments (default none)

--distmis [dist] Distance to consider for base misincorporation (default 1)

this file is obtained from mapDamage

Fragment size distribution: specify either one of the 3 possible options:

-l [length] Generate fragments of fixed length (default: 35)

-s [file] Open file with size distribution (one fragment length per line)

-f [file] Open file with size frequency in the following format:

length[TAB]freq ex:

40 0.0525

41 0.0491

...

Length options:

--loc [file] Location for lognormal distribution (default none)

--scale [file] Scale for lognormal distribution (default none)

Fragment size limit:

--minsize [length] Minimum fragments length (default: 0)

--maxsize [length] Maximum fragments length (default: 1000)

Fragment methylation:

--methyl Allow for lowercase C and G to denote

methylated cytosines on the + and - strand respectively

(default: not used)

Deamination

===================

To add deamination to the bacterial and endogenous material,you can specify

either one of these options:

-mapdamage [mis.txt] [protocol] Read the miscorporation file [mis.txt]

produced by mapDamage

[protocol] can be either "single" or "double" (without quotes)

Single strand will have C->T damage on both ends

Double strand will have and C->T at the 5' end and G->A damage at the 3' end

-matfile [matrix file prefix] Read the matrix file of substitutions

Provide the prefix only, both files must end with

5.dat and 3.dat

-damage [v,l,d,s] For the Briggs et al. 2007 model

The parameters must be comma-separated e.g.: -damage 0.03,0.4,0.01,0.3

v: nick frequency

l: length of overhanging ends (geometric parameter)

d: prob. of deamination of Cs in double-stranded parts

s: prob. of deamination of Cs in single-stranded parts

Alternatively, you can specify these options independently for the endogenous (e), bacterial (b)

and present-day human contaminant (c) using the following options:

-mapdamagee [mis.txt] [protocol] Endogenous mapDamage misincorporation file

-matfilee [matrix file prefix] Endogenous matrix file of substitutions

-damagee [v,l,d,s] Endogenous Briggs parameters

-mapdamageb [mis.txt] [protocol] Bacterial mapDamage misincorporation file

-matfileb [matrix file prefix] Bacterial matrix file of substitutions

-damageb [v,l,d,s] Bacterial Briggs parameters

-mapdamagec [mis.txt] [protocol] Human contaminant mapDamage misincorporation file

-matfilec [matrix file prefix] Human contaminant matrix file of substitutions

-damagecd [v,l,d,s] Human contaminant Briggs parameters

please note that if you do specify deamination for one source but not for another, no deamination will be added

If using --methyl, you can also specify different matrix file for methylated

-matfilenonmeth [matrix file prefix] Read the matrix file of substitutions for non-methylated Cs

Provide the prefix only, both files must end with

5.dat and 3.dat

-matfilemeth [matrix file prefix] Read the matrix file of substitutions for methylated Cs

Provide the prefix only, both files must end with

5.dat and 3.data

Adapter and sequencing

===================

-fa [seq] Adapter that is observed after the forward read (Default: AGATCGGAAG...)

-sa [seq] Adapter that is observed after the reverse read (Default: AGATCGGAAG...)

-rl [length] Desired read length (Default: 75)

-se use single-end sequencing (Default: paired-end)

The following options change the sequencing error rate, please note that positive factor

will decrease the rate of such errors and a negative one will increase it.

-qs [factor] Increase error rate for forward reads by a factor of 1/(10^([factor]/10)) (Default: 0)

-qs2 [factor] Increase error rate for reverse reads by a factor of 1/(10^([factor]/10)) (Default: 0)

-ss [system] Illumina platfrom to use, the parentheses indicate the max. read length

use the shorthand in the left column:

(single-end, paired-end)

GA2 - GenomeAnalyzer II ( 50bp, 75bp)

HS20 - HiSeq 2000 ( 100bp, 100bp)

HS25 - HiSeq 2500 ( 125bp, 150bp) (Default)

HSXt - HiSeqX TruSeq ( 150bp, 150bp)

MSv1 - MiSeq v1 ( 250bp, 250bp)

MSv3 - MiSeq v3 ( 250bp, 250bp)

依存ツールも含めてビルドされる。しかしsamtoolsがビルドされなかったので、手っ取り早くapt installで古いバージョンを入れた。

dockerイメージも上げておきます。

$ docker run -itv $PWD:/data/ kazumax/gargammel
> source ~/.profile
> gargammel.pl -h

テストラン

1、リファレンスを作る。

cd gargammel/ && mkdir data && cd data/

#実行
python ../ms2chromosomes.py -s 0.2 -f . -n 1000 

#cleanup（作業ファイルが）
rm -rfv simul_* seedms

> ls -alh *

# ls -alh *

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 ref.fa

bact:

total 132K

drwxr-xr-x 2 root root 4.0K Mar 15 11:01 .

drwxr-xr-x 5 root root 124K Mar 15 11:02 ..

cont:

total 20M

drwxr-xr-x 2 root root 4.0K Mar 15 11:01 .

drwxr-xr-x 5 root root 124K Mar 15 11:02 ..

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 cont.1.fa

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 cont.2.fa

endo:

total 20M

drwxr-xr-x 2 root root 4.0K Mar 15 11:01 .

drwxr-xr-x 5 root root 124K Mar 15 11:02 ..

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 endo.1.fa

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 endo.2.fa

-rw-r--r-- 1 root root 80K Mar 15 11:01 segsites

ref.faと３つのサブディレクトリ（bact、cont、endo）ができる。bactはバクテリアのコンタミ、contは同じ種からのコンタミ、endoは内性（endogenous）。endogenousのendo.1.faとendo.2.faのheterozygousポジションは以下のファイルにまとめられている。

> head data/endo/segsites

# head data/endo/segsites

>ref_1 312 C T

>ref_1 452 G T

>ref_1 1135 A G

>ref_1 1478 T A

>ref_1 1522 G T

>ref_1 2016 G C

>ref_1 2227 C T

>ref_1 4486 A C

>ref_1 5080 C G

>ref_1 5820 G A

２、リファレンスが用意できたらaDNAをシミュレートする。コマンドの最後には、１で作ったディレクトリを指定する。

cd ..

perl gargammel.pl -c 3 --comp 0,0.08,0.92 -f src/sizefreq.size.gz -matfile src/matrices/single- -o data/simulation data/

> ls -alh data/

# ls -alh data/

total 200M

drwxr-xr-x 1 root root 4.0K Mar 15 17:15 .

drwxr-xr-x 1 root root 4.0K Mar 15 10:58 ..

drwxr-xr-x 2 root root 4.0K Mar 15 11:01 bact

drwxr-xr-x 1 root root 4.0K Mar 15 17:13 cont

drwxr-xr-x 1 root root 4.0K Mar 15 17:13 endo

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 ref.fa

-rw-r--r-- 1 root root 0 Mar 15 17:13 simulation.b.fa.gz

-rw-r--r-- 1 root root 1.3M Mar 15 17:13 simulation.c.fa.gz

-rw-r--r-- 1 root root 28M Mar 15 17:13 simulation.e.fa.gz

-rw-r--r-- 1 root root 36M Mar 15 17:14 simulation_a.fa.gz

-rw-r--r-- 1 root root 30M Mar 15 17:13 simulation_d.fa.gz

-rw-r--r-- 1 root root 48M Mar 15 17:14 simulation_s1.fq.gz

-rw-r--r-- 1 root root 49M Mar 15 17:14 simulation_s2.fq.gz

roo

simulation_s1.fq.gzとsimulation_s2.fq.gzが最終出力。

実行例

Githubには他にもいくつかの例がある。

１、非常に低いcoverageの40-bpリードを発生させる。

gargammel.pl -c 0.5 --comp 0,0,1 -l 40 -o data/simulation data/

-c Endogenous coverage
--comp [B,C,E] Composition of the final set in fraction the 3 numbers represent the bacterial, contaminant and endogenous
ex: --comp 0.6,0.02,0.38 will result in 60% bacterial contamination while the rest will be from the same species 5% will be contamination and 95% will be endogenous Default: --comp 0,0,1
-l Generate fragments of fixed length (default: 35)
-o Output prefix (default: [input dir]/simadna)

２、20xの高いcoverageだが現代のコンタミネーションが激しい（40%）45-bpのリードを発生させる。

gargammel.pl -c 20 --comp 0,0.4,0.6 -l 45 -o data/simulation data/

３、Briggs et al. 2007 model（pubmed）を使い、脱アミノ化した二本鎖DNAを1M発生させる。

gargammel.pl -n 1000000 --comp 0,0,1 -l 40 -damage 0.03,0.4,0.01,0.3 -o data/simulation data/

-n Generate [number] fragments (default: 1000)
-damage [v ,l ,d ,s] For the Briggs et al. 2007 model. The parameters must be comma-separated e.g.: -damage

他の例はGithubを確認して下さい。丁寧に説明されています。

引用

gargammel: simulations of ancient DNA datasets

Gabriel Renaud, Kristian Hanghøj, Eske Willerslev, Ludovic Orlando
Bioinformatics, Volume 33, Issue 4, 15 February 2017, Pages 577–579

2019-03-15

グラフィック出力が利用できない環境で頻度分布を素早く確認できる bashplotlib

tips インフォマティクス解析をサポートするツール結果の視覚化 (visualization)

bashplotlibは、端末で基本的なプロットを作成するためのpythonパッケージおよびコマンドラインツール。 GUIがない場合にデータを視覚化する簡単な方法を提供する。pythonでコーディングされており、pipを使って素早くインストールできる。

特徴（githubより）

quick plotting from the command line
customize the color, size, title, and shape of plots
pipe data into plots with stdin

bashplotlibに関するツイート

インストール

ubuntu16.04のpython2.7.13環境でテストした。

本体　Github

pip install bashplotlib

#from source
git clone git@github.com:glamp/bashplotlib.git
cd bashplotlib
python setup.py install

> hist --help

$ hist --help

Usage: hist is a command for making histograms. it accepts a series of values in one of the following formats:

1) txt file w/ 1 column of numbers

2) standard in piped from another command line cat or curl

for some examples of how to use hist, you can type the command:

hist --demo

or visit https://github.com/glamp/bashplotlib/blob/master/examples/sample.sh

Options:

-h, --help show this help message and exit

-f F, --file=F a file containing a column of numbers

-t T, --title=T title for the chart

-b B, --bins=B number of bins in the histogram

-w BINWIDTH, --binwidth=BINWIDTH

width of bins in the histogram

-s H, --height=H height of the histogram (in lines)

-p P, --pch=P shape of each bar

-x, --xlab label bins on x-axis

-c COLOUR, --colour=COLOUR

colour of the plot (blue, yellow, pink, default, aqua,

grey, green, black, white, red)

-d, --demo run demos

-n, --nosummary hide summary

-r, --regular use regular y-scale (0 - maximum y value), instead of

truncated y-scale (minimum y-value - maximum y-value)

> scatter --help

$ scatter --help

Usage: scatterplot is a command for making xy plots. it accepts a series of x values and a series of y values in the

following formats:

1) a txt file or standard in value w/ 2 comma seperated columns of x,y values

2) 2 txt files. 1 w/ designated x values and another with designated y values.

scatter -x <xcoords> -y <ycoords>

cat <file_with_x_and_y_coords> | scatter

Options:

-h, --help show this help message and exit

-f F, --file=F a csv w/ x and y coordinates

-t T, --title=T title for the chart

-x X x coordinates

-y Y y coordinates

-s SIZE, --size=SIZE y coordinates

-p PCH, --pch=PCH shape of point

-c COLOUR, --colour=COLOUR

colour of the plot (blue, yellow, pink, default, aqua,

grey, green, black, white, red)

実行方法

hist

以下のような1行、数値ファイル（1行１列）のテキストから頻度分布を出力する。

cd bashplotlib/examples/

#中身を確認
head data/exp.txt

$ head data/exp.txt

1.12578195876

0.16026238021

0.0392117875843

0.968428864579

0.334430039433

0.235098864615

1.09595064233

2.37425604196

1.18110389062

0.551066224631

hist実行

hist --file data/exp.txt

451| o

427| o

403| o

380| o

356| o

332| o

309| o

285| o

261| oo

238| oo

214| oo

190| oo

166| oo

143| oo

119| ooo

95| ooo

72| oooo

48| oooo

24| ooooo

1| ooooooooooo

-----------

---------------------------------

| Summary |

---------------------------------

| observations: 1000 |

| min value: 0.001718 |

| mean : 0.988786 |

| max value: 6.552654 |

---------------------------------

頻度が◯の数で示される。

テストラン

cd bashplotlib/examples/
./sample.sh

$ ./sample.sh

downloading data

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 424 0 424 0 0 813 0 --:--:-- --:--:-- --:--:-- 813

100 3421k 100 3421k 0 0 1672k 0 0:00:02 0:00:02 --:--:-- 2346k

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 392 0 392 0 0 1007 0 --:--:-- --:--:-- --:--:-- 1007

100 14618 100 14618 0 0 13349 0 0:00:01 0:00:01 --:--:-- 28274

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 400 0 400 0 0 1156 0 --:--:-- --:--:-- --:--:-- 1156

100 220k 100 220k 0 0 168k 0 0:00:01 0:00:01 --:--:-- 260k

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 396 0 396 0 0 1171 0 --:--:-- --:--:-- --:--:-- 1168

100 16723 100 16723 0 0 13662 0 0:00:01 0:00:01 --:--:-- 15.9M

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 398 0 398 0 0 1133 0 --:--:-- --:--:-- --:--:-- 1133

100 23 100 23 0 0 22 0 0:00:01 0:00:01 --:--:-- 0

% Total % Received % Xferd Average Speed Time Time Time Current

Dload Upload Total Spent Left Speed

100 398 0 398 0 0 1213 0 --:--:-- --:--:-- --:--:-- 1217

100 23 100 23 0 0 23 0 0:00:01 --:--:-- 0:00:01 0

plotting coordinates

------------------------------------------

| xxxxxx |

| x x |

| x xx |

| x xxxxxxx |

| x xxxxx|

| x |

| xx x xx |

| xx |

| xx xxx xx|

| xxx xx xx|

| xx x xxx |

| xx xx |

| x xx |

| xx xx |

| x xx |

| xxxx |

------------------------------------------

with x and y coords

--------------------------------------------

| x | |

| | |

| | x |

| | |

| x x| |

| | |

| ---------o-----------|

| | |

| x | |

| | |

| | x |

| | |

| | x |

| | x|

--------------------------------------------

plotting a histogram

451| o

427| o

403| o

380| o

356| o

332| o

309| o

285| o

261| oo

238| oo

214| oo

190| oo

166| oo

143| oo

119| ooo

95| ooo

72| oooo

48| oooo

24| ooooo

1| ooooooooooo

-----------

-----------------------------------

| Summary |

-----------------------------------

| observations: 1000 |

| min value: 0.001718 |

| mean : 0.988786 |

| max value: 6.552654 |

-----------------------------------

with colors

451| o

427| o

403| o

380| o

356| o

332| o

309| o

285| o

261| oo

238| oo

214| oo

190| oo

166| oo

143| oo

119| ooo

95| ooo

72| oooo

48| oooo

24| ooooo

1| ooooooooooo

-----------

-----------------------------------

| Summary |

-----------------------------------

| observations: 1000 |

| min value: 0.001718 |

| mean : 0.988786 |

| max value: 6.552654 |

-----------------------------------

changing the shape of the point

451| .

427| .

403| .

380| .

356| .

332| .

309| .

285| .

261| ..

238| ..

214| ..

190| ..

166| ..

143| ..

119| ...

95| ...

72| ....

48| ....

24| .....

1| ...........

-----------

-----------------------------------

| Summary |

-----------------------------------

| observations: 1000 |

| min value: 0.001718 |

| mean : 0.988786 |

| max value: 6.552654 |

-----------------------------------

adding x-labels

451| .

427| .

403| .

380| .

356| .

332| .

309| .

285| .

261| ..

238| ..

214| ..

190| ..

166| ..

143| ..

119| ...

95| ...

72| ....

48| ....

24| .....

1| ...........

-----------

0 1 2 3 5 6

. . . . . .

0 3 6 9 2 5

0 1 2 3 4 5

1 1 2 2 2 2

-----------------------------------

| Summary |

-----------------------------------

| observations: 1000 |

| min value: 0.001718 |

| mean : 0.988786 |

| max value: 6.552654 |

-----------------------------------

引用

GitHub - glamp/bashplotlib: plotting in the terminal

関連

2019-03-15

ウィルス分類器 viruses_classifier

2019 Nature Scientific Reports Viruses deep neural network virus

　次世代シーケンシング（NGS）の台頭により、メタゲノムは微生物生態学におけるゴールドスタンダードとなった。その限界、主にウイルス間の普遍的なマーカー遺伝子の欠如にもかかわらず、ウイルスメタゲノミクスはウイルス発見のための主要なツールとなっている（ref.1）。メタゲノムアプローチにより、培養を必要とせずにウイルスの発見が可能になっている。他方、明確に特定された宿主 - ウイルス共培養物の欠如は、新たに発見されたウイルスの宿主の同定を妨げる。この制限を回避するために、いくつかの実験方法が詳しく説明されている。例えば、溶菌性ファージは、自然界からフォスミドの使用によって発見され、それらの宿主にアサインされるかもしれない。選択したファージはphageFISHを使用している宿主と関連している可能性があるが、この方法では宿主遺伝子マーカー配列の予備知識が必要になる（ref.3）。別のアプローチ、ウイルスタギングは、蛍光染料での染色およびフローサイトメトリー選別を用いてWTウイルスをそれらの宿主と結び付けるが、培養可能な宿主に限定される（ref.4）。

　ウイルス - 宿主を連鎖させるプロセスは、ウエットの実験技術のみならずバイオインフォマティックツールでも実行できる。バクテリアゲノムのプロファージの検索に焦点を絞ったツール、たとえばPhaster（ref.5）、PhiSpy（ref.6）、Phage_Finder （ref.7）、Prophinder （ref.8）などがある。この場合、ファージ - 宿主の関係は明らかであるが、このアプローチは、ゲノムが、いくつかの生活環段階において宿主に組み込まれ、そしてその宿主が既にシーケンシングされているファージに限定される。したがって、ファージの宿主を予測するための他の多くの計算手法が存在する。宿主を門、綱、目、科、属、および種のレベルで予測するための計算方法は、Edwards et al（ref.9）によって広範に検討され、ベンチマークされている。以下の主題が評価の対象となった。いくつかの宿主ゲノムフラグメントのファージによる獲得、CRISPRシステムにおける以前のウイルス感染のサイン、ならびに宿主およびファージのオリゴヌクレオチドプロファイルの比較から生じる相同性。後に、Zhangら（ref.10）は、配列シグネチャとしてオリゴヌクレオチド頻度を使用して、属レベルでファージの宿主を予測するいくつかの機械学習アプローチ、- ロジスティック回帰、サポートベクターマシン、ランダムフォレスト、ガウスナイーブベイズ、およびベルヌーイナイーブベイズ、の適用に成功したことを報告した。ファージと宿主のオリゴヌクレオチドプロファイルを比較することは、Ahlgren et al（ref.11）によるより洗練された類似性測度の適用によって活用された。ファージ - 宿主関係の調査の他に、メタゲノムデータにおけるウイルスシグナルの検出のためのツールが利用可能であり、VirSorter（ref.12）（紹介）、VirFinder（ref.13）、およびMARVEL（ref.14）、次第に拡張されている。

　前述のアプローチは宿主 - ファージの関係を調べるために適用された（我々（ほん著者ら）の研究ではバクテリアとアーキアに感染するウイルスを指すために「ファージ」という用語を使う）が、真核生物ウイルスの宿主を推論する問題はあまり研究されていない。 2010年にKapoor et alが3つの新しいPicornaのようなウイルスの発見を報告して、それらの宿主分類群を推論する新しい方法を発表した。彼らがヌクレオチド組成分析（NCA）と名付けた技術は、異なる宿主に感染するウイルス間のヌクレオチド組成の違いを利用する。ウイルスは、それらのモノヌクレオチドおよびジヌクレオチド頻度ならびにジヌクレオチドバイアスによって表された。バイアスは、観察されたジヌクレオチドの頻度と、期待される頻度、これは２つの構成モノヌクレオチドの頻度を掛けることによって決定する、の比として評価された。判別分析を使用して、96％の予測精度で哺乳類、植物、昆虫に感染しているウイルスを識別することができた（ref.15）。この方法の適用は、巨大なウイルスグループではあるものの1つに限定されていたが、科学界の関心を集め、その後いくつかの（主にssRNA）真核生物ウイルスの宿主を推論するために類似の方法が適用された（ref.16-21）。残念なことにオーサーらは新たにシーケンシングされたウイルス宿主の分類群を予測する試みを強化できるツールをリリースしていない。ウイルス宿主を予測する際に、異なる機械学習、アライメント、アライメントフリーのk-merスペクトラムの相違に基づくアプローチを調査したグループもあるが、それらの研究は一部のグループのウイルスに限定されていた（ref.11, 22, 23, 2）。

　本著者らは最初により一般的な質問に答え、ファージと非ファージを区別できる分類器を開発することを目的とした。著者らの知る限り、現在知られている分類群を構成するウイルスは、真核生物またはバクテリア/アーキアのいずれかに感染するが、両方のグループに感染することはない。したがって、この作業はBLASTアライメント（ref.26）を単独で使用することで簡単に思えるかもしれない。しかしながら、高度に多様性のあるウイルスは、以前にシーケンシングされたウイルスとの配列類似性を欠く可能性があり、そして新たなウイルス属が依然として発見されている（例えばPandoravirus）。さらに、さまざまな生物学的材料、例えば糞便サンプルはファージと真核生物ウイルス（ref.15）を含んでいるので、サンプルの起源は新しくシーケンシングされたウイルスがファージであるかどうかを決定するのに役立たない。ここでは、この点で役立つ新しいツールを紹介する。 Host Taxon Predictor（HTP）は、標的グループのウイルスという意味での万能ツールであり、ウイルスが真核生物に感染するのかバクテリア/アーキアに感染するのかを非常に高い精度で予測することができる。 HTPのソースコードはhttps://github.com/wojciech-galan/viruses_classifierにある。

　研究の過程において、ウイルスヌクレオチド配列は単純な配列シグネチャ（モノヌクレオチド、ジヌクレオチド絶対頻度およびジヌクレオチド、トリヌクレオチド相対頻度）および核酸のタイプ（DNA, RNA）によって表された。ウイルス宿主の分類群を予測するために、4つの教師ありの機械学習法が使用された。（以下略）

　HTPは、メタゲノム研究で得られた長いコンティグに似ている全長配列または部分配列のいずれでもうまく機能する。残念なことに、HTPは短いコンティグやシーケンスリードに似た短いサブシーケンスに対してはパフォーマンスが弱い。

インストール

依存

kNN classifier may not work on 32-bit python, so stick to the 64-bit one. Also, all of the classifiers were trained with a distinct version of scikit-learn, and may not work for the newer/older ones.

Python = 2.7
NumPy >= 1.8.2
SciPy >= 0.13.3
scikit-learn = 0.19.2

本体　Github

pip install git+https://github.com/wojciech-galan/viruses_classifier.git

> viruses_classifier -h

$ viruses_classifier -h

usage: viruses_classifier [-h] [--nucleic_acid {DNA,RNA,dna,rna}]

[--classifier {SVC,kNN,QDA,LR,svc,knn,qda,lr}]

[--probas]

sequence

positional arguments:

sequence sequence in plaintext

optional arguments:

-h, --help show this help message and exit

--nucleic_acid {DNA,RNA,dna,rna}

nucleic acid: either DNA or RNA

--classifier {SVC,kNN,QDA,LR,svc,knn,qda,lr}

classifier: SVC, kNN, QDA or LR

--probas, -p

実行方法

ウィルスゲノムのfastaファイル、またはplan textを指定する。

viruses_classifier raw_or_FASTA-formatted_sequence_file --nucleic_acid rna --classifier qda

--nucleic_acid nucleic acid: either DNA or RNA
--classifier classifier: SVC, kNN, QDA or LR

出力

$ viruses_classifier virus.fasta --nucleic_acid DNA --classifier qda

phage

引用

Host Taxon Predictor - A Tool for Predicting Taxon of the Host of a Newly Discovered Virus
Wojciech Gałan, Maciej Bąk, Małgorzata Jakubowska
Scientific Reports volume 9, Article number: 3436 (2019)

2019-03-14

ナノポアのロングリードのQCツール ToulligQC

Nanopore long read 結果の視覚化 (visualization) evaluation tool docker trimming / preprocessing

2020 7/19 追記

　ToulligQCはPythonで書かれたEcole Normale Superieure の生物学研究所（IBENS）のゲノム施設によって開発されたプログラムである。このプログラムは、オックスフォードナノポアのQC分析を専門としている。さらに、DNA-SeqとともにRNA-Seqにも適応しており、1D squareのランと互換性がある。ToulligQCは部分的に要約ファイルとオックスフォードナノポアのベースコーラーであるAlbacoreによってベースコールプロセス間に生成されるpipieline.logファイルに依存している。ランには、また、単一のFAST5ファイル（フローセルIDと実行日を捉えるため）とAlbacoreが出力したFASTQファイル（シーケンス統計を計算するため）を必要とする。 ToulligQCは、使用されているバーコードを記述するsamplesheet.csvを使用して、バーコードサンプルを考慮に入れることができる。

　ToulligQCは異なるファイルフォーマットを扱える：gz、tar.gz、bz2、tar.bz2、FASTQとFAST5。 ToulligQCの出力は、一連のグラフ、txt形式の統計ファイル、およびHTMLレポートを作成する。

2021 5/6

2/3 ToulligQC 2 => is fast (few minutes on a laptop) =>supports all versions of Guppy =>can be used with all the @nanopore sequencing devices #MinION #GridION #PrometION => compatible with both 1D and 1D2 chemistries pic.twitter.com/2Zxu9xEJQq
— Génomique ENS (@Genomique_ENS) 2021年5月6日

インストール

依存

matplotlib
h5py
pandas
seaborn
numpy
plotly

本体　Github

conda create -n toulligqc -y 
conda activate toulligqc
conda install toulligqc

#pip (ここではcondaの仮想環境に入れる)
conda activate
conda install pip
pip install toulligqc

または
git clone https://github.com/GenomicParisCentre/toulligQC.git
cd toulligqc && python3 setup.py build install

> toulligqc -h

u$ toulligqc -h

usage: toulligqc [-h] [-c FILE] [-n REPORT_NAME] [-f FAST5_SOURCE]

[-a SEQUENCING_SUMMARY_SOURCE]

[-d SEQUENCING_SUMMARY_1DSQR_SOURCE]

[-p ALBACORE_PIPELINE_SOURCE] [-q FASTQ_SOURCE]

[-t TELEMETRY_SOURCE] [-o OUTPUT] [-b] [-s SAMPLE_SHEET_FILE]

[-l BARCODES] [--quiet] [--version]

optional arguments:

-h, --help show this help message and exit

-c FILE, --conf-file FILE

Specify config file

-n REPORT_NAME, --report-name REPORT_NAME

Report name

-f FAST5_SOURCE, --fast5-source FAST5_SOURCE

Fast5 file source

-a SEQUENCING_SUMMARY_SOURCE, --sequencing-summary-source SEQUENCING_SUMMARY_SOURCE, --albacore-summary-source SEQUENCING_SUMMARY_SOURCE

Basecaller sequencing summary source

-d SEQUENCING_SUMMARY_1DSQR_SOURCE, --sequencing-summary-1dsqr-source SEQUENCING_SUMMARY_1DSQR_SOURCE, --albacore-1dsqr-summary-source SEQUENCING_SUMMARY_1DSQR_SOURCE

Basecaller 1dsq summary source

-p ALBACORE_PIPELINE_SOURCE, --albacore-pipeline-source ALBACORE_PIPELINE_SOURCE

Albacore pipeline log source

-q FASTQ_SOURCE, --fastq-source FASTQ_SOURCE

Fastq file source

-t TELEMETRY_SOURCE, --telemetry-source TELEMETRY_SOURCE

Telemetry file source

-o OUTPUT, --output OUTPUT

Output directory

-b, --barcoding Barcode usage

-s SAMPLE_SHEET_FILE, --samplesheet-file SAMPLE_SHEET_FILE

Path to sample sheet file

-l BARCODES, --barcodes BARCODES

Coma separated barcode list

--quiet Quiet mode

--version show program's version number and exit

> toulligqc --version

$ toulligqc --version

1.3

またはオーサーらが用意してくれているdockerイメージを使う。

docker pull genomicpariscentre/toulligqc:latest

テストラン

wget http://outils.genomique.biologie.ens.fr/leburon/downloads/toulligqc-example/toulligqc_demo_data.tar.bz2
tar -xzf toulligqc_demo_data.tar.bz2
cd toulligqc_demo_data

#test run
./run-toulligqc.sh

#test run; docker version (docker imageがダウンロードされるので注意)
./run-toulligqc-with-docker.sh

出力

f:id:kazumaxneo:20200719203644p:plain

report.html

f:id:kazumaxneo:20200719203828p:plain

f:id:kazumaxneo:20200719203832p:plain

f:id:kazumaxneo:20200719203842p:plain

f:id:kazumaxneo:20200719203848p:plain

f:id:kazumaxneo:20200719203858p:plain

f:id:kazumaxneo:20200719203905p:plain

f:id:kazumaxneo:20200719203909p:plain

f:id:kazumaxneo:20200719203921p:plain

f:id:kazumaxneo:20200719203924p:plain

実行方法

python3 toulligqc.py --report-name FAF0256 \
 --fast5-source /fast5_dir
 --albacore-summary-source sequencing_summary.txt \
 --albacore-pipeline-source albacore/pipeline.log \
 --fastq-source /fastq \
 --output /output_dir

バーコードつきサンプル

python3 toulligqc.py --report-name FAF0256 \
 --barcoding \
 --fast5-source /fast5_dir
 --albacore-summary-source sequencing_summary.txt \
 --albacore-pipeline-source albacore/pipeline.log \
 --fastq-source /fastq \
 --output /output_dir
 --sample-sheet-source /sample/

python3.6.7ではエラーが起きる。

=> 最新の1.3を入れるとランできるようになりました。

引用

GitHub - GenomicParisCentre/toulligQC: A post sequencing QC tool for Oxford Nanopore sequencers

関連

2019-03-13

Nanoporeのbasecaller Chiron

Nanopore long read 2018 GigaScience CNN basecaller

　Oxford Nanopore Technologies（ONT）によって最近マーケットに導入された、バイオエンジニアリングされたナノポアを介したDNAシーケンシングは、ゲノムのlandscapeを大きく変えた。 ONTナノポアシーケンシングデバイスであるMinIONの重要な技術革新は、DNAの一本鎖分子が通過するときに、ナノポアを横切る電流の変化を測定することである。次にシグナルを使用してDNA鎖のヌクレオチド配列を決定する[ref.1-3]。重要なことに、このシグナルは、シークエンシングがまだ進行中の間にユーザーによって得られそして分析され得る。ホッチキスのサイズのMinIONデバイスに多数のナノポアを詰めることができるため、このテクノロジは非常に移植性が高くなる。サイズの小ささ、およびリアルタイムの性質は、タイムクリティカルなゲノミクスアプリケーション[ref.4-7]および遠隔地域[ref.8-11]に新たな機会を切り開く。

　ナノポアシーケンシングは、大きなナノポアのアレイを設計しそしてDNAフラグメントのより速い転位を可能にすることで大規模にスケールアップできるが、分析パイプラインにおけるボトルネックの１つは生のシグナルのヌクレオチド配列への翻訳またはbasecallingである。 Chironのリリース前は、ナノポアデータのベースコールには2つの段階があった。生データ系列は、最初にk-merから得られた信号に対応するセグメントに分割され（セグメンテーション）、次にモデルが適用されてセグメント信号をk-merに変換する。 DeepNano [ref.12]は、セグメントの基本統計量（平均信号、標準偏差、および長さ）を使用して対応するk-merを予測する双方向リカレントニューラルネットワーク（RNN）を使用するというアイデアを紹介した。 ONT、nanonet、およびAlbacore（v2.0.1より前）によってリリースされた公式のベースコールも同様の手法を採用している。連続するセグメントからのk-merはｋ-1塩基だけ重複すると予想されるので、これらの方法は動的計画法アルゴリズムを使用して最も確からしい経路を見いだし、それが基底シーケンスデータをもたらす。 BasecRAWller [ref.13]は一対の単方向RNNを使用する。最初のRNNはセグメンテーションのためのセグメント境界の確率を予測し、2番目のRNNは離散イベントを基本シーケンスに変換する。そのため、BasecRAWllerは生の信号データをストリーミング方式で処理することができる。

　この論文では、Chironを紹介する。これは、生の電気信号を直接ヌクレオチド配列に変換できる最初のディープニューラルネットワークモデルである。Chironは、畳み込みニューラルネットワーク（CNN）とRNNおよびコネクショニスト時間分類（CTC）デコーダを結合する新しいアーキテクチャを持っている[ref.13]。これにより、イベントセグメンテーションステップを使用せずに、生の信号データを直接モデル化することができる。 ONTは、Chiron v0.1の直後にリリースされた、セグメンテーションフリーのベースコール元Albacore v2.0.1も開発した。

　Chironはウイルスとバクテリアのゲノムからシーケンスされた小さなデータセットで訓練された、それでもそれは他のバクテリアとヒトのようなゲノムの範囲に一般化することができる。Chironは、バクテリアとウイルスのベースコールに関して、ONTで設計され訓練されたAlbacore v2.0.1と同じくらい正確であり、他のすべての既存の方法よりも優れている。さらに、Albacoreとは異なり、Chironはユーザーが独自のニューラルネットワークをトレーニングすることを可能にし、また完全にオープンソースであり、塩基修飾の検出などの特殊なbase callingアプリケーションの開発を可能にする。

Enjoyed giving presentation on Chiron to 'roll your own basecaller' using the Google cloud yesterday. Looking forward to introducing Chiron developer @HeavensTeng to Guppy developer Mark Bicknell at #nanoporeconf today! https://t.co/qvviyMIV0f
— Lachlan Coin (@ljcoin) May 25, 2018

Chironに関するツイート

インストール

ubuntu16.04のpython3.6.7環境でテストした（xeon E5 v4マシン）。

pip install tensorflow

#またはGPUバージョン導入（対応GPU、対応CUDAドライバ、cuDNNが必要。github参照）
pip install tensorflow-gpu

本体　Github

pipで導入できる。tensorflow以外の依存pythonパッケージも一括導入される。

pip install chiron

> chiron -h

model_default_path /home/kazu/.pyenv/versions/miniconda3-4.3.30/lib/python3.6/site-packages/chiron/model/DNA_default

usage: chiron [-h] {call,export,train} ...

A deep neural network basecaller.

optional arguments:

-h, --help show this help message and exit

sub command:

{call,export,train} sub command help

call Perform basecalling.

export Extract signal and label in the fast5 file.

train Train a model.

> chiron call -h

model_default_path /home/kazu/.pyenv/versions/miniconda3-4.3.30/lib/python3.6/site-packages/chiron/model/DNA_default

usage: chiron call [-h] -i INPUT -o OUTPUT [-m MODEL] [-s START]

[-b BATCH_SIZE] [-l SEGMENT_LEN] [-j JUMP] [-t THREADS]

[-e EXTENSION] [--beam BEAM] [--concise] [--mode MODE]

Perform basecalling

optional arguments:

-h, --help show this help message and exit

-i INPUT, --input INPUT

File path or Folder path to the fast5 file.

-o OUTPUT, --output OUTPUT

Output folder path

-m MODEL, --model MODEL

model folder path

-s START, --start START

Start index of the signal file.

-b BATCH_SIZE, --batch_size BATCH_SIZE

Batch size for run, bigger batch_size will increase

the processing speed but require larger RAM load

-l SEGMENT_LEN, --segment_len SEGMENT_LEN

Segment length to be divided into.

-j JUMP, --jump JUMP Step size for segment

-t THREADS, --threads THREADS

Threads number, default is 0, which use all the

available threads.

-e EXTENSION, --extension EXTENSION

Output file type.

--beam BEAM Beam width used in beam search decoder, default is 50,

set to 0 to use a greedy decoder. Large beam width

give better decoding result but require longer

decoding time.

--concise Concisely output the result, the meta and segments

files will not be output.

--mode MODE Output mode, can be chosen from dna or rna.

> chiron export -h

model_default_path /home/kazu/.pyenv/versions/miniconda3-4.3.30/lib/python3.6/site-packages/chiron/model/DNA_default

usage: chiron export [-h] -i INPUT -o OUTPUT [-f TFFILE]

[--basecall_group BASECALL_GROUP]

[--basecall_subgroup BASECALL_SUBGROUP]

Export signal and label from the fast5 file.

optional arguments:

-h, --help show this help message and exit

-i INPUT, --input INPUT

Input folder contain fast5 files.

-o OUTPUT, --output OUTPUT

Output folder.

-f TFFILE, --tffile TFFILE

tfrecord file

--basecall_group BASECALL_GROUP

Basecall group Nanoraw resquiggle into. Default is

Basecall_1D_000

--basecall_subgroup BASECALL_SUBGROUP

Basecall subgroup Nanoraw resquiggle into. Default is

BaseCalled_template

> chiron train -h

model_default_path /home/kazu/.pyenv/versions/miniconda3-4.3.30/lib/python3.6/site-packages/chiron/model/DNA_default

usage: chiron train [-h] -i DATA_DIR [-f TFRECORD] -o LOG_DIR -m MODEL_NAME

[-c CACHE_DIR] [-t RETRAIN] [-l SEQUENCE_LEN]

[-b BATCH_SIZE] [-x MAX_STEPS] [-r STEP_RATE] [-k K_MER]

Model training

optional arguments:

-h, --help show this help message and exit

-i DATA_DIR, --data_dir DATA_DIR

Folder containing the labelled data.

-f TFRECORD, --tfrecord TFRECORD

tfrecord file

-o LOG_DIR, --log_dir LOG_DIR

Log dir which save the trained model

-m MODEL_NAME, --model_name MODEL_NAME

Model name saved.

-c CACHE_DIR, --cache_dir CACHE_DIR

Output folder

-t RETRAIN, --retrain RETRAIN

If retrain is true, the previous trained model will be

loaded from LOG_DIR before training.

-l SEQUENCE_LEN, --sequence_len SEQUENCE_LEN

Segment length to be divided into.

-b BATCH_SIZE, --batch_size BATCH_SIZE

Batch size to train, large batch size require more ram

but give faster training speed.

-x MAX_STEPS, --max_steps MAX_STEPS

Maximum training steps conducted.

-r STEP_RATE, --step_rate STEP_RATE

Learning rate used for optimiztion algorithm.

-k K_MER, --k_mer K_MER

Output k-mer size.

テストラン

git clone https://github.com/haotianteng/Chiron.git

#run
chiron call -i Chiron/chiron/example_data/ \
 -m Chiron/chiron/model/DNA_default/ \
 -o out_dir

-i File path or Folder path to the fast5 file
-o Output folder path
-m model folder path

出力ディレクトリ

f:id:kazumaxneo:20190311150833j:plain

トレーニング

デファルトのDNAモデルはR.9.4のフローセルとE.coli、lambdaのゲノムの組み合わせでトレーニングされているが、結果に満足行かないなら、自分のデータで訓練することもできる。トレーニングにはChironのGPU版（GPUのメモリは8GB以上要求）の使用が特に推奨されている。

１、データの準備

python chiron/utils/raw.py --input <fast5 folder> --output <output_folder> --mode dna

２、トレーニング

chiron train --data_dir <signal_label folder> --log_dir <model_log_folder> --model_name <saved_model_name>

作成中

引用

Chiron: translating nanopore raw signal directly into nucleotide sequence using deep learning
Haotian Teng, Minh Duc Cao, Michael B Hall, Tania Duarte, Sheng Wang, Lachlan J M Coin

GigaScience, Volume 7, Issue 5, 1 May 2018, giy037

関連

2019-03-12

Nanoporeのオフィシャルコマンドラインbasecaller2 Guppy - CPU版について

高速なツール Nanopore long read basecalling GPU basecaller

2019 3/12 タイトル修正

2019 3/12 コマンド追記、誤ったコメント削除

2020 1/19 GPU版のリンク追記

2020 5/4 3.6ツイート追記

2021 1/8 helpのバージョン更新、リンク切れ修正

GPU版

2020 3/13 構成を微修正、タイトル変更

20200 7/15

guppy v4.0.11

I got back into some @nanopore sequencing data I had from 2018 and ran basecalling again. Who knew I'd actually see the improved accuracy by eye 😮. At this pace, it will look like @illumina reads in another 2 years! pic.twitter.com/FfIrYQnUAz
— François Kroll (@francois_kroll) 2020年7月14日

2020 5/4

Guppy 3.6.0 is a marked improvement for PromethION @nanopore R9.4.1 data!!! Looking forward to the new announcements at LC2020!!! @NanoporeConf @BenedictPaten @kishwarshafin @HOlsenUCSC pic.twitter.com/wofInG39Be
— Miten Jain (@mitenjain) 2020年5月4日

Next up in the random nanopore read comparison: Bonito 0.1.5 vs Guppy 3.6.0: Bonito (left) seems to have a slightly higher accuracy due to fewer inserton/delitions but guppy maps a few more bases at the end of the read. pic.twitter.com/3tAEUZ3tYF
— Ola Wallerman (@OWallerman) 2020年5月2日

2021 /8現在v4.4.1が最新。

GuppyはOxford Nanoporeによって提供されているコマンドラインのbasecaller。そしてポアを通過するDNAまたはRNAをbasecallingするために最新のリカレントニューラルネットワークアルゴリズムを利用してナノポアからのシグナルデータを解釈する。GPU GuppyはOxford Nanopore Technologiesのソフトウェア製品に安定した機能を実装しており、完全にサポートされている。 .fast5ファイルを入力として受け取り、ベースコール情報を付加した.fast5ファイル、処理された.fast5ファイル（1D^2は２回ランする）、fastqファイルを生成できる。ここではCPU版を使う流れを説明する。

マニュアルリンク

log inの必要あり。

インストール

mac10.12でバイナリをダウンロードしてテストした。

ハードウエア要件

4 GBのRAMと1 Dベースコールのスレッドあたり1 GB
4 GBのRAMと1 D 2ベースコールのスレッドあたり2 GB
.debまたは.msi インストーラーの管理者アクセス
インストール用に最大100 MBのドライブスペース、ベースコールされたリードファイル用に最低512 GBのストレージスペース（1 TB推奨）
外部GPU（GPU版）

Albacoreと同様、ダウンロードはOxford nanopore HPのsoftware downloadsから行う。

f:id:kazumaxneo:20210108125625p:plain

バイナリがダウンロードできる。ここではlinux 64bit向けをダウンロードした。gpu版のバイナリはcentos7版のみ用意されている。その後、他のプラットフォームも公開されるようになった。ARM版もある。

cd ont-guppy-cpu/bin/

> ls -l

$ ls -l

total 18360

-rw-r--r--@ 1 user staff 196187 12 11 15:37 Nanopore Community TCs 09 September 2016.pdf

-rw-r--r--@ 1 user staff 57178 1 28 17:16 THIRD_PARTY_LICENSES

-rwxr-xr-x@ 1 user staff 950944 2 26 14:22 guppy_aligner

-rwxr-xr-x@ 1 user staff 1098576 2 26 14:22 guppy_barcoder

-rwxr-xr-x@ 1 user staff 1774404 2 26 14:22 guppy_basecall_server

-rwxr-xr-x@ 1 user staff 2707612 2 26 14:22 guppy_basecaller

-rwxr-xr-x@ 1 user staff 2598784 2 26 14:22 guppy_basecaller_1d2

> ./guppy_basecaller (v4.4.1)

: Guppy Basecalling Software, (C) Oxford Nanopore Technologies, Limited. Version 4.4.1+1c81d62

Usage:

With config file:"

guppy_basecaller -i <input path> -s <save path> -c <config file> [options]

With flowcell and kit name:

guppy_basecaller -i <input path> -s <save path> --flowcell <flowcell name>

--kit <kit name>

List supported flowcells and kits:

guppy_basecaller --print_workflows

Command line parameters:

--trim_threshold arg Threshold above which data will be trimmed

(in standard deviations of current level

distribution).

--trim_min_events arg Adapter trimmer minimum stride intervals

after stall that must be seen.

--max_search_len arg Maximum number of samples to search through

for the stall

--override_scaling Manually provide scaling parameters rather

than estimating them from each read.

--scaling_med arg Median current value to use for manual

scaling.

--scaling_mad arg Median absolute deviation to use for manual

scaling.

--trim_strategy arg Trimming strategy to apply: 'dna' or 'rna'

(or 'none' to disable trimming)

--dmean_win_size arg Window size for coarse stall event

detection

--dmean_threshold arg Threshold for coarse stall event detection

--jump_threshold arg Threshold level for rna stall detection

--pt_scaling Enable polyT/adapter max detection for read

scaling.

--pt_median_offset arg Set polyT median offset for setting read

scaling median (default 2.5)

--adapter_pt_range_scale arg Set polyT/adapter range scale for setting

read scaling median absolute deviation

(default 5.2)

--pt_required_adapter_drop arg Set minimum required current drop from

adapter max to polyT detection. (default

30.0)

--pt_minimum_read_start_index arg Set minimum index for read start sample

required to attempt polyT scaling. (default

30)

--as_model_file arg Path to JSON model file for adapter

scaling.

--as_gpu_runners_per_device arg Number of runners per GPU device for

adapter scaling.

--as_cpu_threads_per_scaler arg Number of CPU worker threads per adapter

scaler

--as_reads_per_runner arg Maximum reads per runner for adapter

scaling.

--as_num_scalers arg Number of parallel scalers for adapter

scaling.

-m [ --model_file ] arg Path to JSON model file.

-k [ --kernel_path ] arg Path to GPU kernel files location (only

needed if builtin_scripts is false).

-x [ --device ] arg Specify basecalling device: 'auto', or

'cuda:<device_id>'.

--builtin_scripts arg Whether to use GPU kernels that were

included at compile-time.

--chunk_size arg Stride intervals per chunk.

--chunks_per_runner arg Maximum chunks per runner.

--chunks_per_caller arg Soft limit on number of chunks in each

caller's queue. New reads will not be

queued while this is exceeded.

--high_priority_threshold arg Number of high priority chunks to process

for each medium priority chunk.

--medium_priority_threshold arg Number of medium priority chunks to process

for each low priority chunk.

--overlap arg Overlap between chunks (in stride

intervals).

--gpu_runners_per_device arg Number of runners per GPU device.

--cpu_threads_per_caller arg Number of CPU worker threads per

basecaller.

--num_callers arg Number of parallel basecallers to create.

--post_out Return full posterior matrix in output

fast5 file and/or called read message from

server.

--stay_penalty arg Scaling factor to apply to stay probability

calculation during transducer decode.

--qscore_offset arg Qscore calibration offset.

--qscore_scale arg Qscore calibration scale factor.

--temp_weight arg Temperature adjustment for weight matrix in

softmax layer of RNN.

--temp_bias arg Temperature adjustment for bias vector in

softmax layer of RNN.

--beam_cut arg Beam score cutoff for beam search decoding.

--beam_width arg Beam score cutoff for beam search decoding.

--qscore_filtering Enable filtering of reads into PASS/FAIL

folders based on min qscore.

--min_qscore arg Minimum acceptable qscore for a read to be

filtered into the PASS folder

--reverse_sequence arg Reverse the called sequence (for RNA

sequencing).

--u_substitution arg Substitute 'U' for 'T' in the called

sequence (for RNA sequencing).

--log_speed_frequency arg How often to print out basecalling speed.

--barcode_kits arg Space separated list of barcoding kit(s) or

expansion kit(s) to detect against. Must be

in double quotes.

--trim_barcodes Trim the barcodes from the output sequences

in the FastQ files.

--num_extra_bases_trim arg How vigorous to be in trimming the barcode.

Default is 0 i.e. the length of the

detected barcode. A positive integer means

extra bases will be trimmed, a negative

number is how many fewer bases (less

vigorous) will be trimmed.

--arrangements_files arg Files containing arrangements.

--lamp_arrangements_files arg Files containing lamp arrangements.

--score_matrix_filename arg File containing mismatch score matrix.

--start_gap1 arg Gap penalty for aligning before the

reference.

--end_gap1 arg Gap penalty for aligning after the

reference.

--open_gap1 arg Penalty for opening a new gap in the

reference.

--extend_gap1 arg Penalty for extending a gap in the

reference.

--start_gap2 arg Gap penalty for aligning before the query.

--end_gap2 arg Gap penalty for aligning after the query.

--open_gap2 arg Penalty for opening a new gap in the query.

--extend_gap2 arg Penalty for extending a gap in the query.

--min_score arg Minimum score to consider a valid

alignment.

--min_score_rear_override arg Minimum score to consider a valid alignment

for the rear barcode only (and min_score

will then be used for the front only when

this is set).

--min_score_mask arg Minimum score for a barcode context to

consider a valid alignment.

--front_window_size arg Window size for the beginning barcode.

--rear_window_size arg Window size for the ending barcode.

--require_barcodes_both_ends Reads will only be classified if there is a

barcode above the min_score at both ends of

the read.

--allow_inferior_barcodes Reads will still be classified even if both

the barcodes at the front and rear (if

applicable) were not the best scoring

barcodes above the min_score.

--detect_mid_strand_barcodes Search for barcodes through the entire

length of the read.

--min_score_mid_barcodes arg Minimum score for a barcode to be detected

in the middle of a read.

--lamp_kit arg LAMP barcoding kit to perform LAMP

detection against.

--min_score_lamp arg Minimum score for a LAMP barcode to be

classified.

--min_score_lamp_mask arg Minimum score for a LAMP barcode mask

context to be classified.

--min_score_lamp_target arg Minimum score for a LAMP target to be

classified.

--additional_context_bases arg Number of bases from a lamp FIP barcode

context to append to the front and rear of

the FIP barcode before performing matching.

Default is 2.

--min_length_lamp_context arg Minimum align length for a LAMP barcode

mask context to be classified.

--min_length_lamp_target arg Minimum align length for a LAMP target to

be classified.

--num_barcoding_buffers arg Number of GPU memory buffers to allocate to

perform barcoding into. Controls level of

parallelism on GPU for barcoding.

--num_mid_barcoding_buffers arg Number of GPU memory buffers to allocate to

perform barcoding into. Controls level of

parallelism on GPU for mid barcoding.

--num_barcode_threads arg Number of worker threads to use for

barcoding.

--calib_detect Enable calibration strand detection and

filtering.

--calib_reference arg Reference FASTA file containing calibration

strand.

--calib_min_sequence_length arg Minimum sequence length for reads to be

considered candidate calibration strands.

--calib_max_sequence_length arg Maximum sequence length for reads to be

considered candidate calibration strands.

--calib_min_coverage arg Minimum reference coverage to pass

calibration strand detection.

--print_workflows Output available workflows.

--flowcell arg Flowcell to find a configuration for

--kit arg Kit to find a configuration for

-a [ --align_ref ] arg Path to alignment reference.

--bed_file arg Path to .bed file containing areas of

interest in reference genome.

--num_alignment_threads arg Number of worker threads to use for

alignment.

-z [ --quiet ] Quiet mode. Nothing will be output to

STDOUT if this option is set.

--trace_categories_logs arg Enable trace logs - list of strings with

the desired names.

--verbose_logs Enable verbose logs.

--trace_domains_log arg List of trace domains to include in verbose

logging (if enabled), '*' for all.

--trace_domains_config arg Configuration file containing list of trace

domains to include in verbose logging (if

enabled), this will override

--trace_domain_logs

--disable_pings Disable the transmission of telemetry

pings.

--ping_url arg URL to send pings to

--ping_segment_duration arg Duration in minutes of each ping segment.

--progress_stats_frequency arg Frequency in seconds in which to report

progress statistics, if supplied will

replace the default progress display.

-q [ --records_per_fastq ] arg Maximum number of records per fastq file, 0

means use a single file (per worker, per

run id).

--read_batch_size arg Maximum batch size, in reads, for grouping

input files.

--compress_fastq Compress fastq output files with gzip.

-i [ --input_path ] arg Path to input fast5 files.

--input_file_list arg Optional file containing list of input

fast5 files to process from the input_path.

-s [ --save_path ] arg Path to save fastq files.

-l [ --read_id_list ] arg File containing list of read ids to filter

-r [ --recursive ] Search for input files recursively.

--fast5_out Choice of whether to do fast5 output.

--bam_out Choice of whether to do BAM file output.

--bam_methylation_threshold arg The value below which a predicted

methylation probability will not be emitted

into a BAM file, expressed as a percentage.

Default is 5.0(%).

--resume Resume a previous basecall run using the

same output folder.

--client_id arg Optional unique identifier (non-negative

integer) for this instance of the Guppy

Client Basecaller, if supplied will form

part of the output filenames.

--nested_output_folder If flagged output fastq files will be

written to a nested folder structure, based

on: protocol_group/sample/protocol/qscore_p

ass_fail/barcode_arrangement/

--max_queued_reads arg Maximum number of reads to be submitted for

processing at any one time.

-h [ --help ] produce help message

-v [ --version ] print version number

-c [ --config ] arg Config file to use

-d [ --data_path ] arg Path to use for loading any data files the

application requires.

> ./guppy_basecaller_1d2 (Version 2.3.5)

$ ./guppy_basecaller_1d2

: Guppy 1D-Squared Basecalling Software, (C) Oxford Nanopore Technologies, Limited. Version 2.3.5+53a111f6

Command line parameters:

--print_workflows Output available workflows.

--flowcell arg Flowcell to find a configuration for

--kit arg Kit to find a configuration for

-m [ --model_file ] arg Path to JSON model file.

--chunk_size arg Stride intervals per chunk.

--chunks_per_runner arg Maximum chunks per runner.

--chunks_per_caller arg Soft limit on number of chunks in each

caller's queue. New reads will not be queued

while this is exceeded.

--overlap arg Overlap between chunks (in stride intervals).

--gpu_runners_per_device arg Number of runners per GPU device.

--cpu_threads_per_caller arg Number of CPU worker threads per basecaller.

--num_callers arg Number of parallel basecallers to create.

--stay_penalty arg Scaling factor to apply to stay probability

calculation during transducer decode.

--qscore_offset arg Qscore calibration offset.

--qscore_scale arg Qscore calibration scale factor.

--temp_weight arg Temperature adjustment for weight matrix in

softmax layer of RNN.

--temp_bias arg Temperature adjustment for bias vector in

softmax layer of RNN.

--hp_correct arg Whether to use homopolymer correction during

decoding.

--builtin_scripts arg Whether to use GPU kernels that were included

at compile-time.

-x [ --device ] arg Specify basecalling device: 'auto', or

'cuda:<device_id>'.

-k [ --kernel_path ] arg Path to GPU kernel files location (only

needed if builtin_scripts is false).

-z [ --quiet ] Quiet mode. Nothing will be output to STDOUT

if this option is set.

--trace_categories_logs arg Enable trace logs - list of strings with the

desired names.

--verbose_logs Enable verbose logs.

--qscore_filtering Enable filtering of reads into PASS/FAIL

folders based on min qscore.

--min_qscore arg Minimum acceptable qscore for a read to be

filtered into the PASS folder

--disable_pings Disable the transmission of telemetry pings.

--ping_url arg URL to send pings to

--ping_segment_duration arg Duration in minutes of each ping segment.

--calib_detect Enable calibration strand detection and

filtering.

--calib_reference arg Reference FASTA file containing calibration

strand.

--calib_min_sequence_length arg Minimum sequence length for reads to be

considered candidate calibration strands.

--calib_max_sequence_length arg Maximum sequence length for reads to be

considered candidate calibration strands.

--calib_min_coverage arg Minimum reference coverage to pass

calibration strand detection.

--score_matrix arg Path to mismatch matrix for prior label

alignment

-q [ --records_per_fastq ] arg Maximum number of records per fastq file (0

means use a single file).

--winsize1 arg Short window length for event detection.

--winsize2 arg Long window length for event detection.

--threshold1 arg Shirt time-scale threshold for event

detection.

--threshold2 arg Long time-scale threshold for event

detection.

--band_size arg Band size for 1d-squared alignment table.

--pa_band_size arg Band size for prior-alignment table.

--gap_penalty arg Gap penalty for prior label alignment.

--start_end_penalty arg Overhang penalty for prior label alignment.

--reverse_sequence arg Reverse the called sequence (for RNA

sequencing).

--u_substitution arg Substute 'U' for 'T' in the called sequence

(for RNA sequencing).

-i [ --input_path ] arg Path to input fast5 files.

-f [ --index_file ] arg Index file from 1D basecall.

-s [ --save_path ] arg Path to save fastq files.

-p [ --port ] arg Hostname and port for connecting to basecall

service (ie 'myserver:5555'), or port only

(ie '5555'), in which case localhost is

assumed.

-r [ --recursive ] Search for input files recursively.

--override_scaling Manually provide scaling parameters rather

than estimating them from each read.

--scaling_med arg Median current value to use for manual

scaling.

--scaling_mad arg Median absolute deviation to use for manual

scaling.

-h [ --help ] produce help message

-v [ --version ] print version number

-c [ --config ] arg Config file to use

-d [ --data_path ] arg Path to use for loading any data files the

application requires.

> ./guppy_barcoder

$ ./guppy_barcoder

Usage:

guppy_barcoder -i <input fastq path> -s <save path>

With kit name:

guppy_barcoder -i <input fastq path> -s <save path> --barcode_kits <kit name>

--kit <kit name>

List supported barcoding kits:

guppy_barcoder --print_kits

Command line parameters:

-z [ --quiet ] Quiet mode. Nothing will be output to stdout

if this option is set.

-t [ --worker_threads ] arg Number of worker threads.

-i [ --input_path ] arg Path to input fastq files.

-s [ --save_path ] arg Path to save fastq files.

-r [ --recursive ] Search for input file recursively.

--trace_categories_logs arg Enable trace logs - list of strings with the

desired names.

--verbose_logs Enable verbose logs.

--print_kits Output all available barcoding kits.

--barcode_kits arg Space separated list of barcoding kit(s) or

expansion kit(s) to detect against. Must be in

double quotes.

-q [ --records_per_fastq ] arg Maximum number of records per fastq file, 0

means use a single file (per run id).

--arrangements_files arg Files containing arrangements.

--score_matrix_filename arg File containing mismatch score matrix.

--start_gap1 arg Gap penalty for aligning before the reference.

--end_gap1 arg Gap penalty for aligning after the reference.

--open_gap1 arg Penalty for opening a new gap in the

reference.

--extend_gap1 arg Penalty for extending a gap in the reference.

--start_gap2 arg Gap penalty for aligning before the query.

--end_gap2 arg Gap penalty for aligning after the query.

--open_gap2 arg Penalty for opening a new gap in the query.

--extend_gap2 arg Penalty for extending a gap in the query.

--min_score arg Minimum score to consider a valid alignment.

--front_window_size arg Window size for the beginning barcode.

--rear_window_size arg Window size for the ending barcode.

-h [ --help ] produce help message

-v [ --version ] print version number

-c [ --config ] arg Config file to use

-d [ --data_path ] arg Path to use for loading any data files the

application requires.

> ./guppy_basecall_server

$ ./guppy_basecall_server

: Guppy Basecall Service Software, (C) Oxford Nanopore Technologies, Limited. Version 2.3.5+53a111f6

Usage:

With config file:

guppy_basecall_server -c <config file> --port <server listen port>

--log_path <log file path> [options]

With flowcell and kit:

guppy_basecall_server --flowcell <flowcell name> --kit <kit name>

--port <server listen port> --log_path <log file path> [options]

List supported flowcells and kits:

guppy_basecall_server --print_workflows

Command line parameters:

--print_workflows Output available workflows.

--flowcell arg Flowcell to find a configuration for

--kit arg Kit to find a configuration for

-m [ --model_file ] arg Path to JSON model file.

--chunk_size arg Stride intervals per chunk.

--chunks_per_runner arg Maximum chunks per runner.

--chunks_per_caller arg Soft limit on number of chunks in each

caller's queue. New reads will not be queued

while this is exceeded.

--overlap arg Overlap between chunks (in stride intervals).

--gpu_runners_per_device arg Number of runners per GPU device.

--cpu_threads_per_caller arg Number of CPU worker threads per basecaller.

--num_callers arg Number of parallel basecallers to create.

--stay_penalty arg Scaling factor to apply to stay probability

calculation during transducer decode.

--qscore_offset arg Qscore calibration offset.

--qscore_scale arg Qscore calibration scale factor.

--temp_weight arg Temperature adjustment for weight matrix in

softmax layer of RNN.

--temp_bias arg Temperature adjustment for bias vector in

softmax layer of RNN.

--hp_correct arg Whether to use homopolymer correction during

decoding.

--builtin_scripts arg Whether to use GPU kernels that were included

at compile-time.

-x [ --device ] arg Specify basecalling device: 'auto', or

'cuda:<device_id>'.

-k [ --kernel_path ] arg Path to GPU kernel files location (only

needed if builtin_scripts is false).

-z [ --quiet ] Quiet mode. Nothing will be output to STDOUT

if this option is set.

--trace_categories_logs arg Enable trace logs - list of strings with the

desired names.

--verbose_logs Enable verbose logs.

--qscore_filtering Enable filtering of reads into PASS/FAIL

folders based on min qscore.

--min_qscore arg Minimum acceptable qscore for a read to be

filtered into the PASS folder

--disable_pings Disable the transmission of telemetry pings.

--ping_url arg URL to send pings to

--ping_segment_duration arg Duration in minutes of each ping segment.

--calib_detect Enable calibration strand detection and

filtering.

--calib_reference arg Reference FASTA file containing calibration

strand.

--calib_min_sequence_length arg Minimum sequence length for reads to be

considered candidate calibration strands.

--calib_max_sequence_length arg Maximum sequence length for reads to be

considered candidate calibration strands.

--calib_min_coverage arg Minimum reference coverage to pass

calibration strand detection.

--ipc_threads arg Number of threads to use for inter-process

communication.

--max_queued_reads arg Maximum number of reads in input queue.

-l [ --log_path ] arg Path to save log file.

-p [ --port ] arg Port for hosting service. Specify "auto" to

make server automatically search for a free

port.

-h [ --help ] produce help message

-v [ --version ] print version number

-c [ --config ] arg Config file to use

-d [ --data_path ] arg Path to use for loading any data files the

application requires.

> ./guppy_aligner

$ ./guppy_aligner

Usage:

guppy_aligner -i <input fastq path> -s <output SAM path>

--align_ref <reference file>

Command line parameters:

-z [ --quiet ] Quiet mode. Nothing will be output to stdout if

this option is set.

-t [ --worker_threads ] arg Number of worker threads.

-i [ --input_path ] arg Path to input fastq files.

-s [ --save_path ] arg Path to save fastq files.

-r [ --recursive ] Search for input file recursively.

--trace_categories_logs arg Enable trace logs - list of strings with the

desired names.

--verbose_logs Enable verbose logs.

-a [ --align_ref ] arg Path to alignment reference.

--min_coverage arg Minimum coverage to accept an alignment.

-h [ --help ] produce help message

-v [ --version ] print version number

-c [ --config ] arg Config file to use

-d [ --data_path ] arg Path to use for loading any data files the

application requires.

対応フローセルとキット

> guppy_basecaller --print_workflows

$ guppy_basecaller --print_workflows

Available flowcell + kit combinations are:

flowcell kit barcoding config_name

FLO-MIN106 SQK-DCS108 dna_r9.4.1_450bps

FLO-MIN106 SQK-DCS109 dna_r9.4.1_450bps

FLO-MIN106 SQK-LRK001 dna_r9.4.1_450bps

FLO-MIN106 SQK-LSK108 dna_r9.4.1_450bps

FLO-MIN106 SQK-LSK109 dna_r9.4.1_450bps

FLO-MIN106 SQK-LWP001 dna_r9.4.1_450bps

FLO-MIN106 SQK-PCS108 dna_r9.4.1_450bps

FLO-MIN106 SQK-PCS109 dna_r9.4.1_450bps

FLO-MIN106 SQK-PSK004 dna_r9.4.1_450bps

FLO-MIN106 SQK-RAD002 dna_r9.4.1_450bps

FLO-MIN106 SQK-RAD003 dna_r9.4.1_450bps

FLO-MIN106 SQK-RAD004 dna_r9.4.1_450bps

FLO-MIN106 SQK-RAS201 dna_r9.4.1_450bps

FLO-MIN106 SQK-RLI001 dna_r9.4.1_450bps

FLO-MIN106 VSK-VBK001 dna_r9.4.1_450bps

FLO-MIN106 VSK-VSK001 dna_r9.4.1_450bps

FLO-MIN106 SQK-RBK001 included dna_r9.4.1_450bps

FLO-MIN106 SQK-RBK004 included dna_r9.4.1_450bps

FLO-MIN106 SQK-RLB001 included dna_r9.4.1_450bps

FLO-MIN106 SQK-LWB001 included dna_r9.4.1_450bps

FLO-MIN106 SQK-PBK004 included dna_r9.4.1_450bps

FLO-MIN106 SQK-RAB201 included dna_r9.4.1_450bps

FLO-MIN106 SQK-RAB204 included dna_r9.4.1_450bps

FLO-MIN106 SQK-RPB004 included dna_r9.4.1_450bps

FLO-MIN106 VSK-VMK001 included dna_r9.4.1_450bps

FLO-PRO001 SQK-LSK109 dna_r9.4.1_450bps_prom

FLO-PRO001 SQK-DCS109 dna_r9.4.1_450bps_prom

FLO-PRO001 SQK-PCS109 dna_r9.4.1_450bps_prom

FLO-PRO002 SQK-LSK109 dna_r9.4.1_450bps_prom

FLO-PRO002 SQK-DCS109 dna_r9.4.1_450bps_prom

FLO-PRO002 SQK-PCS109 dna_r9.4.1_450bps_prom

FLO-MIN107 SQK-DCS108 dna_r9.5_450bps

FLO-MIN107 SQK-DCS109 dna_r9.5_450bps

FLO-MIN107 SQK-LRK001 dna_r9.5_450bps

FLO-MIN107 SQK-LSK108 dna_r9.5_450bps

FLO-MIN107 SQK-LSK109 dna_r9.5_450bps

FLO-MIN107 SQK-LSK308 dna_r9.5_450bps

FLO-MIN107 SQK-LSK309 dna_r9.5_450bps

FLO-MIN107 SQK-LSK319 dna_r9.5_450bps

FLO-MIN107 SQK-LWP001 dna_r9.5_450bps

FLO-MIN107 SQK-PCS108 dna_r9.5_450bps

FLO-MIN107 SQK-PCS109 dna_r9.5_450bps

FLO-MIN107 SQK-PSK004 dna_r9.5_450bps

FLO-MIN107 SQK-RAD002 dna_r9.5_450bps

FLO-MIN107 SQK-RAD003 dna_r9.5_450bps

FLO-MIN107 SQK-RAD004 dna_r9.5_450bps

FLO-MIN107 SQK-RAS201 dna_r9.5_450bps

FLO-MIN107 SQK-RLI001 dna_r9.5_450bps

FLO-MIN107 VSK-VBK001 dna_r9.5_450bps

FLO-MIN107 VSK-VSK001 dna_r9.5_450bps

FLO-MIN107 SQK-LWB001 included dna_r9.5_450bps

FLO-MIN107 SQK-PBK004 included dna_r9.5_450bps

FLO-MIN107 SQK-RAB201 included dna_r9.5_450bps

FLO-MIN107 SQK-RAB204 included dna_r9.5_450bps

FLO-MIN107 SQK-RBK001 included dna_r9.5_450bps

FLO-MIN107 SQK-RBK004 included dna_r9.5_450bps

FLO-MIN107 SQK-RLB001 included dna_r9.5_450bps

FLO-MIN107 SQK-RPB004 included dna_r9.5_450bps

FLO-MIN107 VSK-VMK001 included dna_r9.5_450bps

FLO-MIN106 SQK-RNA001 rna_r9.4.1_70bps

FLO-MIN106 SQK-RNA002 rna_r9.4.1_70bps

FLO-MIN107 SQK-RNA001 rna_r9.4.1_70bps

FLO-MIN107 SQK-RNA002 rna_r9.4.1_70bps

FLO-PRO001 SQK-RNA002 rna_r9.4.1_70bps_prom

FLO-PRO002 SQK-RNA002 rna_r9.4.1_70bps_prom

実行方法

フローセル、kit名、入出力（*1）等を指定して実行する。-r (--recursive)をつけるとサブディレクトリも含めてbasecallingされる（*1）。

guppy_basecaller --flowcell FLO-MIN106 --kit SQK-LSK109 \
 --cpu_threads_per_caller 4 --num_callers 4 \
 -i fast5/input_dir -s output_dir -r

出力（サブディレクトリ１つ分）

f:id:kazumaxneo:20190310135002j:plain

1D^2のbase callingの場合、--fast5_outフラグを立てて上のコマンドを実行し、得られたfast5ファイルをguppy_basecaller_1d2で再びbase callingする2stepで行う（未テスト）。

引用

nanoporetech Guppy 2.3.5

https://community.nanoporetech.com/protocols/Guppy-protocol-preRev/v/gpb_2003_v1_revh_14dec2018

参考

ツイッターであかまるさんに教えていただきました。

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

高速なロングリードのアセンブリツール wtdbg2 (Redbean)

コンタミやダメージを考慮してAncient DNAのシーケンシングリードをシミュレートする gargammel

グラフィック出力が利用できない環境で頻度分布を素早く確認できる bashplotlib

ウィルス分類器 viruses_classifier

ナノポアのロングリードのQCツール ToulligQC

Nanoporeのbasecaller Chiron

Nanoporeのオフィシャルコマンドラインbasecaller2 Guppy - CPU版について