2020-01-19

GuppyのGPU版を使う

2020/07/23 モニターコマンド追記

2021/01/8 helpのバージョン更新

2021/08/22 更新

2022/1/7 v6に更新（helpはv4）

2022/02/16 helpをv6に更新

タイトルの通り、GuppyのGPU版を使うまでの流れをまとめておきます。

ubuntuへのインストール

１、Nvidia GPU driverのインストール

#レポジトリの追加
sudo add-apt-repository ppa:graphics-drivers/ppa 
sudo apt update

#NVIDIA driverのインストール。最新GPUだとより最新のNvidiaドライバーを入れる必要があるかもしれない（ONTのGuppy documentより）。
sudo apt install nvidia-384 

#OS reboot
sudo reboot

libcuda.so.1がないというエラーが出たら、/libcuda.soから/libcuda.so.1にシンボリックリンクを張って、$LD_LIBRARY_PATHに追加することでとりあえず解決。

#私の環境では
ln -s /usr/local/cuda/lib64/stubs/libcuda.so /usr/local/cuda/lib64/stubs/libcuda.so.1

export LD_LIBRARY_PATH=/usr/local/cuda/lib64/stubs:${LD_LIBRARY_PATH}

バージョン確認

> modinfo nvidia | grep version

２、GuppyのGPU版ダウンロード

2020 1/19現在、Guppyのv3.4.4が提供されている。log inしてsoftware downloadからlinux GPUビルドをダウンロードする（=> 2023 5/28現在v6.5.7が最新）。

https://community.nanoporetech.com/downloads

注；少し前からGuppyはをダウンロードしなくてもパッケージマネージャでインストールできるようになっています。画像中央列の各プラットフォーム向けマニュアルを確認して下さい。

cd ont-guppy/bin/

> ./guppy_basecaller #v.6.01

$ guppy_basecaller

: Guppy Basecalling Software, (C) Oxford Nanopore Technologies, Limited. Version 6.0.1+652ffd179

Usage:

With config file:"

guppy_basecaller -i <input path> -s <save path> -c <config file> [options]

With flowcell and kit name:

guppy_basecaller -i <input path> -s <save path> --flowcell <flowcell name>

--kit <kit name>

List supported flowcells and kits:

guppy_basecaller --print_workflows

Use GPU for basecalling:

guppy_basecaller -i <input path> -s <save path> -c <config file>

--device <cuda device name> [options]

Command line parameters:

--trim_threshold arg Threshold above which data will be

trimmed (in standard deviations of

current level distribution).

--trim_min_events arg Adapter trimmer minimum stride

intervals after stall that must be

seen.

--max_search_len arg Maximum number of samples to search

through for the stall

--override_scaling Manually provide scaling parameters

rather than estimating them from each

read.

--scaling_med arg Median current value to use for manual

scaling.

--scaling_mad arg Median absolute deviation to use for

manual scaling.

--trim_strategy arg Trimming strategy to apply: 'dna' or

'rna' (or 'none' to disable trimming)

--dmean_win_size arg Window size for coarse stall event

detection

--dmean_threshold arg Threshold for coarse stall event

detection

--jump_threshold arg Threshold level for rna stall detection

--pt_scaling Enable polyT/adapter max detection for

read scaling.

--pt_median_offset arg Set polyT median offset for setting

read scaling median (default 2.5)

--adapter_pt_range_scale arg Set polyT/adapter range scale for

setting read scaling median absolute

deviation (default 5.2)

--pt_required_adapter_drop arg Set minimum required current drop from

adapter max to polyT detection.

(default 30.0)

--pt_minimum_read_start_index arg Set minimum index for read start sample

required to attempt polyT scaling.

(default 30)

--as_model_file arg Path to JSON model file for adapter

scaling.

--as_gpu_runners_per_device arg Number of runners per GPU device for

adapter scaling.

--as_cpu_threads_per_scaler arg Number of CPU worker threads per

adapter scaler

--as_reads_per_runner arg Maximum reads per runner for adapter

scaling.

--as_num_scalers arg Number of parallel scalers for adapter

scaling.

--noisiest_section_scaling_max_size arg

Threshold read size in samples under

which nosiest-section scaling will be

performed.

-m [ --model_file ] arg Path to JSON model file.

-k [ --kernel_path ] arg Path to GPU kernel files location (only

needed if builtin_scripts is false).

-x [ --device ] arg Specify basecalling device: 'auto', or

'cuda:<device_id>'.

--builtin_scripts arg Whether to use GPU kernels that were

included at compile-time.

--chunk_size arg Stride intervals per chunk.

--chunks_per_runner arg Maximum chunks per runner.

--chunks_per_caller arg Soft limit on number of chunks in each

caller's queue. New reads will not be

queued while this is exceeded.

--high_priority_threshold arg Number of high priority chunks to

process for each medium priority chunk.

--medium_priority_threshold arg Number of medium priority chunks to

process for each low priority chunk.

--overlap arg Overlap between chunks (in stride

intervals).

--gpu_runners_per_device arg Number of runners per GPU device.

--cpu_threads_per_caller arg Number of CPU worker threads per

basecaller.

--num_callers arg Number of parallel basecallers to

create.

--post_out Return full posterior matrix in output

fast5 file and/or called read message

from server.

--stay_penalty arg Scaling factor to apply to stay

probability calculation during

transducer decode.

--qscore_offset arg Qscore calibration offset.

--qscore_scale arg Qscore calibration scale factor.

--temp_weight arg Temperature adjustment for weight

matrix in softmax layer of RNN.

--temp_bias arg Temperature adjustment for bias vector

in softmax layer of RNN.

--beam_cut arg Beam score cutoff for beam search

decoding.

--beam_width arg Beam score cutoff for beam search

decoding.

--duplex_window_size arg Window size to use for prefix search in

duplex decoding.

--disable_qscore_filtering Disable filtering of reads into

PASS/FAIL folders based on min qscore.

--min_qscore arg Minimum acceptable qscore for a read to

be filtered into the PASS folder

--reverse_sequence arg Reverse the called sequence (for RNA

sequencing).

--u_substitution arg Substitute 'U' for 'T' in the called

sequence (for RNA sequencing).

--log_speed_frequency arg How often to print out basecalling

speed.

--barcode_kits arg Space separated list of barcoding

kit(s) or expansion kit(s) to detect

against. Must be in double quotes.

--trim_barcodes Trim the barcodes from the sequences in

the output files.

--trim_adapters Trim the adapters from the sequences in

the output files.

--trim_primers Trim the primers from the sequences in

the output files.

--num_extra_bases_trim arg How vigorous to be in trimming the

barcode. Default is 0 i.e. the length

of the detected barcode. A positive

integer means extra bases will be

trimmed, a negative number is how many

fewer bases (less vigorous) will be

trimmed.

--score_matrix_filename arg File containing mismatch score matrix.

--start_gap1 arg Gap penalty for aligning before the

reference.

--end_gap1 arg Gap penalty for aligning after the

reference.

--open_gap1 arg Penalty for opening a new gap in the

reference.

--extend_gap1 arg Penalty for extending a gap in the

reference.

--start_gap2 arg Gap penalty for aligning before the

query.

--end_gap2 arg Gap penalty for aligning after the

query.

--open_gap2 arg Penalty for opening a new gap in the

query.

--extend_gap2 arg Penalty for extending a gap in the

query.

--min_score_barcode_front arg Minimum score to consider a front

barcode to be a valid barcode

alignment.

--min_score_barcode_rear arg Minimum score to consider a rear

barcode to be a valid alignment (and

min_score_front will then be used for

the front only when this is set).

--min_score_barcode_mask arg Minimum score for a barcode context to

be considered a valid alignment.

--min_score_adapter_mid arg Minimum score for a mid-strand adapter

to be considered a valid alignment.

--min_score_adapter arg Minimum score for an adapter to be

considered a valid alignment.

--min_score_primer arg Minimum score for a primer to be

considered to be a valid alignment.

--front_window_size arg Window size for the beginning barcode.

--rear_window_size arg Window size for the ending barcode.

--require_barcodes_both_ends Reads will only be classified if there

is a barcode above the min_score at

both ends of the read.

--allow_inferior_barcodes Reads will still be classified even if

both the barcodes at the front and rear

(if applicable) were not the best

scoring barcodes above the min_score.

--detect_barcodes Detect barcode sequences at the front

and rear of the read.

--detect_adapter Detect adapter sequences at the front

and rear of the read.

--detect_primer Detect primer sequences at the front

and rear of the read.

--detect_mid_strand_adapter Detect adapter sequences within reads.

--detect_mid_strand_barcodes Search for barcodes through the entire

length of the read.

--min_score_barcode_mid arg Minimum score for a barcode to be

detected in the middle of a read.

--lamp_kit arg LAMP barcoding kit to perform LAMP

detection against.

--min_score_lamp arg Minimum score for a LAMP barcode to be

classified.

--min_score_lamp_mask arg Minimum score for a LAMP barcode mask

context to be classified.

--min_score_lamp_target arg Minimum score for a LAMP target to be

classified.

--min_length_lamp_target arg Minimum align length for a LAMP target

to be classified.

--min_length_lamp_context arg Minimum align length for a LAMP barcode

mask context to be classified.

--additional_lamp_context_bases arg Number of bases from a lamp FIP barcode

context to append to the front and rear

of the FIP barcode before performing

matching. Default is 2.

--num_barcoding_buffers arg Number of GPU memory buffers to

allocate to perform barcoding into.

Controls level of parallelism on GPU

for barcoding.

--num_mid_barcoding_buffers arg Number of GPU memory buffers to

allocate to perform barcoding into.

Controls level of parallelism on GPU

for mid barcoding.

--num_barcode_threads arg Number of worker threads to use for

barcoding.

--read_splitting_arrangement_files arg

Files containing arrangements for read

splitting.

--read_splitting_score_matrix_filename arg

File containing mismatch score matrix

for read splitting.

--num_read_splitting_buffers arg Number of GPU memory buffers to

allocate to perform read splitting.

Controls level of parallelism on GPU

for read splitting using mid adapter

detection.

--num_read_splitting_threads arg Number of worker threads to use for

read splitting.

--min_score_read_splitting arg Minimum alignment score for the mid

adapter on which to split the read.

--do_read_splitting Perform read splitting based on

mid-strand adapter detection.

--max_read_split_depth arg The maximum number of iterations of

read splitting that should be

performed.

--num_reads_per_barcoding_buffer arg The maximum number of reads to process

at once in each barcoding buffer.

--calib_detect Enable calibration strand detection and

filtering.

--calib_reference arg Reference FASTA file containing

calibration strand.

--calib_min_sequence_length arg Minimum sequence length for reads to be

considered candidate calibration

strands.

--calib_max_sequence_length arg Maximum sequence length for reads to be

considered candidate calibration

strands.

--calib_min_coverage arg Minimum reference coverage to pass

calibration strand detection.

--print_workflows Output available workflows.

--flowcell arg Flowcell to find a configuration for

--kit arg Kit to find a configuration for

-a [ --align_ref ] arg Path to alignment reference.

--bed_file arg Path to .bed file containing areas of

interest in reference genome.

--align_type arg Specify whether you wand full or coarse

alignment. Valid values are

(auto/full/coarse).

--num_alignment_threads arg Number of worker threads to use for

alignment.

-z [ --quiet ] Quiet mode. Nothing will be output to

STDOUT if this option is set.

--trace_categories_logs arg Enable trace logs - list of strings

with the desired names.

--verbose_logs Enable verbose logs.

--trace_domains_config arg Configuration file containing list of

trace domains to include in verbose

logging (if enabled)

--disable_pings Disable the transmission of telemetry

pings.

--ping_url arg URL to send pings to

--ping_segment_duration arg Duration in minutes of each ping

segment.

--progress_stats_frequency arg Frequency in seconds in which to report

progress statistics, if supplied will

replace the default progress display.

-q [ --records_per_fastq ] arg Maximum number of records per fastq

file, 0 means use a single file (per

worker, per run id).

--read_batch_size arg Maximum batch size, in reads, for

grouping input files.

--compress_fastq Compress fastq output files with gzip.

-i [ --input_path ] arg Path to input fast5 files.

--input_file_list arg Optional file containing list of input

fast5 files to process from the

input_path.

-s [ --save_path ] arg Path to save fastq files.

-l [ --read_id_list ] arg File containing list of read ids to

filter to

-r [ --recursive ] Search for input files recursively.

--fast5_out Choice of whether to do fast5 output.

--bam_out Choice of whether to do BAM file

output.

--index Choice of whether to output BAM index

file.

--bam_methylation_threshold arg The value below which a predicted

methylation probability will not be

emitted into a BAM file, expressed as a

percentage. Default is 5.0(%).

--resume Resume a previous basecall run using

the same output folder.

--client_id arg Optional unique identifier

(non-negative integer) for this

instance of the Guppy Client

Basecaller, if supplied will form part

of the output filenames.

--nested_output_folder If flagged output fastq files will be

written to a nested folder structure,

based on: protocol_group/sample/protoco

l/qscore_pass_fail/barcode_arrangement/

--max_queued_reads arg Maximum number of reads to be submitted

for processing at any one time.

-h [ --help ] produce help message

-v [ --version ] print version number

-c [ --config ] arg Config file to use

-d [ --data_path ] arg Path to use for loading any data files

the application requires.

GPUを確認しておく。

> nvidia-smi

$ nvidia-smi

Sun Jan 19 12:06:32 2020

+-----------------------------------------------------------------------------+

| NVIDIA-SMI 390.129 Driver Version: 390.129 |

|-------------------------------+----------------------+----------------------+

| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |

| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |

|===============================+======================+======================|

| 0 GeForce GTX 1080 Off | 00000000:07:00.0 On | N/A |

| 19% 52C P8 9W / 200W | 334MiB / 8118MiB | 0% Default |

+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+

| Processes: GPU Memory |

| GPU PID Type Process name Usage |

|=============================================================================|

| 0 1256 G /usr/lib/xorg/Xorg 162MiB |

| 0 1501 G /usr/bin/gnome-shell 100MiB |

| 0 3217 G ...uest-channel-token=12657443395681243138 68MiB |

+-----------------------------------------------------------------------------+

GTX 1080の8 GB VRAM（GDDR5X）になっている。

実行方法

CPU版と同様の流れでランできる。異なるのはCPUスレッドの代わりにデバイス番号を指定するところ。１台しか利用できないなら--device autoか--device cuda:0を指定すればO.K。

guppy_basecaller \
 --flowcell FLO-MIN106 \
 --kit SQK-LSK109 \
 -x cuda:0 \
 -i fast5_dir \
 -s output_dir2 -r

-x [ --device ] Specify basecalling device: 'auto', or 'cuda:<device_id>'.
--flowcell Flowcell to find a configuration for
-kit Kit to find a configuration for
-i [ --input_path ] Path to input fast5 files.
-s [ --save_path ] Path to save fastq files.

100MB程度の小さなfast5データを使ってランタイムを調べた。

f:id:kazumaxneo:20200119144601p:plain

ラン中はGPU使用率がほぼ100%になる（右上）（nvtopを使用 *1）。

結果は

GTX 1080 =>18.5s

CPU(AMD 3700x) => 7m56.6s

25倍の差がついた。大きなデータでは、GPU版を使わないと終わらないのがよく分かりました。

参考

nvtopのインストール

github

git clone https://github.com/Syllo/nvtop.git 
mkdir -p nvtop/build && cd nvtop/build 
cmake .. -DNVML_RETRIEVE_HEADER_ONLINE=True
make
sudo make install
#help
nvtop -h

　nvtopは複数GPUもモニターできます。上では、Terminator（参考にしたHP）を入れて端末を分割しています。

nvtop

#追記　nvtopが導入できない環境ならnvidia-smiを使う。GPU1をモニター。1秒おきに更新（-l 1）。
nvidia-smi -i 1 -l 1

2020-01-18

URMAP

2020 Preprint 高速なツール mapping PeerJ

2020 1/19 コマンドの誤り修正

2020 1/20 twitter追記

2020 6/25 論文追記、リンク切れ更新

　次世代シーケンシングにより、ヒト機能ゲノミクス（Morozova and Marra、2008）から微生物メタゲノミクス（Gilbert and Dupont、2011）までの分野で劇的な進歩が可能になった。次世代研究のデータ分析では、リードをヒトゲノム、ヒトエクソーム、または完全長の微生物ゲノムのコレクションなどの参照データベースにマッピングする必要がある。マッピングは、シーケンスデータベース検索の特殊なケースであり、クエリシーケンスは短く、データベースシーケンスは長く、シーケンスの類似性は高くなる。特定のクエリシーケンス（リード）の場合、マッピングの主な目的は、可能であれば最適な一致を報告することである。

　URMAPはk-merのハッシュテーブルインデックス、つまり、長さがkの固定長ワードを使用する。ここで、k = 24はヒトゲノムで推奨される。インデックスは、メモリキャッシュミスを最小限に抑えるため、RAM内の特定のハッシュワード（スロット）に関連する情報を密接に保つように設計されている。リファレンス（ピン）で1回だけ見つかったスロットにフラグが付けられる。与えられたクエリに対して、URMAPは最初に、リファレンス内で互いに近接している重複しないピンのペアを検索する（ブレース、図1を参照）。中括弧が見つかった場合、位置合わせが試行され、検索が成功するとすぐに終了する。それ以外の場合、シードと拡張の戦略（Altschul et al、1990）が続く。

https://drive5.com/urmap/

URMAP quick start

https://drive5.com/urmap/manual/quickstart.html

Ultrafast readmapper posted, https://t.co/p3lEmuBjrK, paper: https://t.co/8BrXJSD20F. Email me with bugs and feature requests, I'll post updates promptly.
— Robert Edgar (@RobertEdgarPhD) 2020年1月17日

実行方法

１、create index

urmap -make_ufi hg38.fa -output hg38.ufi

２、mapping

#paired-end
urmap -map2 sample_R1.fastq.gz -reverse sample_R2.fastq.gz \
 -ufi hg38.ufi -samout sample.sam

#single
urmap -map sample.fastq.gz -ufi hg38.ufi -samout sample.sam

引用

URMAP, an ultra-fast read mapper

Robert C. Edgar
bioRxiv preprint first posted online Jan. 14, 2020

2020 6/25 追記

URMAP, an ultra-fast read mapper
Robert Edgar
PeerJ, June 24, 2020

2020-01-17

Cytobandファイルのダウンロード

download UCSC

Cytoband file format

https://software.broadinstitute.org/software/igv/Cytoband

Cytobandファイル（ギムザ染色された染色体のバンドのおおよその位置を表す）はUCSCから入手できる。

group => Mapping and Sequencing

track => Chromosome bandを選択。

f:id:kazumaxneo:20200117004111p:plain

get outputをクリックしてcytobandをダウンロードする。

f:id:kazumaxneo:20200117004228p:plain

2020-01-16

circosVCF

2017 Bioinformatics web tool 結果の視覚化 (visualization) UCSC circos

　1000米ドル未満で全ゲノムをシーケンスするという目標が現実になり、医療における全ゲノムシーケンス利用が次第に増えている。 Genomics England （http://www.genomicsengland.co.uk/）などの大規模ゲノムプロジェクトは、数千のゲノムから収集された情報を使用して、健康と病気をよりよく理解することを目的としている。次世代のシーケンス研究から得られたデータを分析するために、さまざまなタイプのソフトウェアが開発されている。全ゲノムVCFファイルは、エクソームVCFファイルの平均10倍の大きさになる。ただし、既存のツールのほとんどは、全ゲノム研究によって生成されるファイルのサイズの増加を無視しているため、大容量メモリなどの余分な要件のために不適切になる。効率的な全ゲノムデータ解析ソフトウェアツールの必要性が高まっている。

　さまざまなプラットフォームが、ヒトNGSデータの分析用に提案されている。（一部略）VarSifter Teer et al （2012）は、グラフィカルユーザーインターフェイス（GUI）の助けを借りて幅広い種類のクエリを定義できる、強力な公開ツールである。ただし、メモリでは使いやすさが制限される。（一部略）

　ここでは、大きなVCFファイルの分析を実行するように設計されたソフトウェアツールであるVCF-Explorerを紹介する。 VCF-Explorerは軽量な構造で、メモリを使い果たすことなく非常に大きなファイルを処理できる。 GUIを使用すると、アノテーションのバリアントを簡単にフィルタリングできる。 ITのバックグラウンドを持たない臨床医と研究者は、簡単にツールを使用できる。初期ファイルのロードや前処理は行われないため、CPU時間とメモリが節約される。シンプルなパーソナルコンピューターや、より大きな機能を備えた大型サーバーで使用できる。

circosVCF demo

https://www.ariel.ac.il/wp/fbl/software/

CircosVCF exercise

https://en-lifesci.m.tau.ac.il/sites/lifesci_en.tau.ac.il/files/media_server/all-units/CircosVCF_Ex.pdf

使い方

http://legolas.ariel.ac.il/~tools/CircosVCF/ にアクセスする。

種を指定する。

f:id:kazumaxneo:20191121033400p:plain

ゲノムのバージョンを指定する。

f:id:kazumaxneo:20191121033413p:plain

描画する染色体を指定する。

f:id:kazumaxneo:20191121033415p:plain

saveをクリック。

VCFをアップロードする。

f:id:kazumaxneo:20200117010701p:plain

途中からうまく動作しなくなりました。ネットワークが不安定なのか、メンテナンスされていないのか不明ですが、できるようになったら追記します。

補足

ローカルのkaryotypeファイルを使う。chr番号とサイズを記載したタブ区切りファイルを用意する。

f:id:kazumaxneo:20200117005944p:plain

これをアップロードする。

引用
CircosVCF: circos visualization of whole-genome sequence variations stored in VCF files

Drori E, Levy D, Smirin-Yosef P, Rahimi O, Salmon-Divon M

Bioinformatics. 2017 May 1;33(9):1392-1393

変異のlandscape visualizationを提供するwebサービス LandScape

2019 Preprint 結果の視覚化 (visualization) tumor human genome panel sequencing

　シーケンシング技術の莫大な発展はゲノムデータの蓄積を加速させ、指数関数的蓄積を引き起こし、ヒトゲノム研究を著しく加速させた。一方、生物学的研究では、増加するサンプルからのトランスオミクスデータを分析している。多くの適切に設計された視覚化は、結果の結果を明確かつ完全に報告するためにpublicationで広く利用されている。最近のガンゲノム研究では、一般に数百の腫瘍ゲノムデータを分析し、mutation burden、突然変異遺伝子、生物学的パスウェイなどの一連の属性間でそれらを比較すが、これらはよく知られている‘Landscape’ figuresで一般的に示される。ランドスケープ図とその変形は、すべてのサンプルの関心のある多層特性パターンを一度に要約するのに最適であるため、サンプルの大規模コホートの研究で広く適用されている。これらの数値の実装は、十分なプログラミングスキルを必要とし、publicationのために微調整する多大な努力を必要とするため、生物学研究者にとって課題のままである。さらに、既存のソフトウェアパッケージによって生成される数値は静的であるため、インタラクティブなデータ探索が不可能になる。　
　ここでは、LandScapeovizを紹介する。これは、バッチサンプルの統合マルチレイヤーデータをインタラクティブかつリアルタイムに視覚化するためのWebベースのアプリケーションである。 LandScapeは、適切に設計されたフォーマットを使用して複数のデータタイプを処理し、表示をカスタマイズし、結果を調べ、高品質の図を生成する一連の組み込み関数を提供する。

「LandScape」ビジュアライゼーションは、バッチサンプルの複数のレイヤーからの統合データの体系的な図を提供するために頻繁に利用される。これらのデータは、ガンで変異した遺伝子や生物学的パスウェイなどの特定の属性で互いに比較される。このオンラインの「LandScape」ビジュアライゼーションは、追加のパネル（年齢、性別、組織学など）を備えた固定部分（ヒストグラムと遺伝子パネル）で設計されている。

Oviz-Bio Github

使い方

Oviz-Bio

Oviz-Bioは、ランドスケープビュー、SNVシグネチャ、突然変異対遺伝子、CNV、融合遺伝子、SV、および染色体3Dコンフォメーションなどの分析をサポートしている。

https://bio.oviz.org/demo-project にアクセスする。

データを視覚化するには、必要な形式でCSVファイルをアップロードし、サイドバーオプションを使用して表示をカスタマイズする。

demoのCSVをアップロードした。

f:id:kazumaxneo:20200115235642p:plain

ダークモードOFF

f:id:kazumaxneo:20200115235646p:plain

入力ファイルCSVは1行目がヘッダー、2行目以降に１サンプル1行ずつ記載していく。de movo CSVをexcelで開いた。

f:id:kazumaxneo:20200116013048p:plain

拡大した。１列目はsampleID。２列目以降に表示したい遺伝子を記載する。変異なしは"-"で表記する。not availableはN/Aで表記する。

f:id:kazumaxneo:20200116013400p:plain

この変異情報の遺伝子パネル情報だけアップロードした場合、変異の垂直トラック図と変異の種類のヒストグラムだけが出力される。

f:id:kazumaxneo:20200116014608p:plain

横方向がサンプル、ここではT119まである。縦方向が遺伝子。

サブタイプ、性別、などの属性情報もアップすると、上に示したように追加のヒストグラムが表示される。詳細はmanualと、マニュアルの上からダウンロードできるテストデータを確認してください。

引用

LandScape: a web application for interactive genomic summary visualization
Wenlong Jia, Hechen Li, Shiying Li, Shuaicheng Li

bioRxiv preprint first posted online Dec. 6, 2019

ゲノムの指定した領域をNでマスクする

mask fasta/fastqの操作 request

bedtoolsを使う。

Document

bedtoolsのインストール

本体　Github

#bioconda(link)
condaw install -c bioconda -y bedtools

> bedtools maskfasta

$ bedtools maskfasta

Tool: bedtools maskfasta (aka maskFastaFromBed)

Version: v2.29.0

Summary: Mask a fasta file based on feature coordinates.

Usage: bedtools maskfasta [OPTIONS] -fi <fasta> -fo <fasta> -bed <bed/gff/vcf>

Options:

-fi Input FASTA file

-bed BED/GFF/VCF file of ranges to mask in -fi

-fo Output FASTA file

-soft Enforce "soft" masking.

Mask with lower-case bases, instead of masking with Ns.

-mc Replace masking character.

Use another character, instead of masking with Ns.

-fullHeader Use full fasta header.

By default, only the word before the first space or tab

is used.

実行方法

入力の FASTAとマスクしたい領域のBED/GFF/VCFを指定して実行する。

bedtools maskfasta -fi input.fa -fo mask.fa -bed regions.bed

出力

f:id:kazumaxneo:20200114200153p:plain

ソフトマスク

bedtools maskfasta -fi input.fa -fo mask.fa -bed regions.bed -soft

出力

f:id:kazumaxneo:20200114200655p:plain

引用

BEDTools: a flexible suite of utilities for comparing genomic features.

Quinlan AR1, Hall IM.

Bioinformatics. 2010 Mar 15;26(6):841-2.

https://www.ncbi.nlm.nih.gov/pubmed/20110278

ベンチマークその２（2019）

2020 9/12,9/13 誤字修正

１の続きになります。

ピークメモリ値

　以下のグラフは前回の投稿のxeon E5 v4 dualのランログから取った、各ツールのピークメモリ使用量（GB）になる (n=5)。flyeのピークメモリが突出していた。特にエラー修正していないraw ONT リードをアセンブリに使うと60GBを超えていた(*1)。前もってLordecでエラー修正してから使うと、ピークメモリは25GB前後まで減った。

f:id:kazumaxneo:20200113194302p:plain

前回の投稿で、32GBメモリのryzen7計算機は、raw ONT リードのアセンブリのみセグメンテーションフォルトを出して終了したと書いたが、ピークメモリの値からメモリが足りなかった事が主要な原因だとわかった。

補足

ところでLoRDEC（紹介）でエラー修正することでONTのアセンブリは変化するのだろうか？アセンブリ配列をseqkitでチェックする。

$ seqkit stats -T flye_correct/assembly.fasta flye_raw/assembly.fasta

出力を表にした。

f:id:kazumaxneo:20200113200920p:plain

de novo アセンブリ前にエラー修正することで、contig数は144から49まで減り、連続性も大きく改善していた。最大長のcontigサイズも14Mbから16Mbに増加した。QUAST-LG（紹介）の結果も一番下にまとめた（*5）。アセンブリ前にショートリードでエラー修正する価値は多いにあると言える。

まとめに入る。今回使ったマシンの長所・短所を簡単にまとめた。

１、3700x自作機

不可の低いジョブで高い性能を示したものの、多くのコアを使う計算集約的なジョブでは遅れが目立った。ECCメモリ（参考）を使えないため、長時間のジョブを走らせるには不安が残る。しかし8万円前後で組めることを考えれば費用効果は抜群に高い。2019夏に購入して連続運用しているが、今のところシャットダウンやパーツの破損などのトラブルもない。Ubuntuがかなり成熟してきたことも使い勝手の良さを後押ししている。目立った欠点は128GBしかメモリが積めないことか（ボードによっては64GB）。またメモリがデュアルチャネル、PCIの帯域なども気になるところ。

２、MousePro

簡単なジョブ、計算集約的なジョブを問わず高い性能を示した。

３、xeon platinum自作ワークステーション

本体価格が一番高い割には計算時間がかかることが多く、今回の比較ではいいところがなかった。flyeのランタイムが長くなった原因は不明。AVX-512対応は長所だが（e.g., BWA MEM2）、AVX-512はクロックが下がる問題があるので手放しで喜べない。

４、mac mini 2018上位

３と同様、特にいいところがなかった。計算集約的なジョブ、特にのlordecのランで遅れが目立ったのは、mac miniの冷却性能が低く、長時間負荷による熱の蓄積でCPUのサーマルスロットリングが起き、上限クロックまで上がらなかったためかもしれない（*4）。osxが使え、各種安定ドライバ、完成度の高い有料アプリが揃っているのが一番のメリットかもしれない。

５、Apple mac pro 2012 early （リンク）

すでに引退させた機種だが、予想以上に高いパフォーマンスを示した。osのアップデートができないので今後セキュリティ面が難しくなるが、用途を限定すればまだまだ使えそうな気がした。しかし電気代の面で高コストである。主力運用は控えたい。

次に、今回実行したジョブの各計算機のランタイムの平均値と、計算機の価格の図を示した。計算機の発売時期は全く異なり、点の数も全然足りないのだが、参考程度に見て欲しい。図では、安くて性能が高い（費用効果が良い）計算機ほど左下に位置する。ryzen7の計算機がその位置に最も近く、２番目に近いのはxeon E5 2680 dualの計算機である。では、ryzen7 3700xが最も優秀な計算機と結論して良いのだろうか。

f:id:kazumaxneo:20200114003900p:plain

　ここで忘れてはいけないのが、計算機の堅牢性や冗長性、環境負荷（消費電力）など数値に変えにくい因子である。上の図はこれらの因子を抜きにして議論している。いくら安くても、ラン中にメモリエラーを起こしたり、データが頻繁に飛ぶような計算機では使い物にならない。よって、左下に行くほど良い計算機と簡単に判断するのは危険である。とすれば、実は散布図中央の一番下付近がスイートスポットで、この位置する計算機が好ましいのかもしれない。高い計算能力を持ち、堅牢性も備えた計算機を組み立てれば、コストが上がって自然にその辺りにプロットされるからである。今回使った計算機の中では、xeon E5 2680 dualが最もその位置に近いだろうか。

　では結局のところどのような計算機を持つのが理想的だろうか？とりあえず3990xを買って計算集約的なジョブにはこちらを使い、ルーチンの軽いジョブはryzenやcoreiシリーズ、という風に使い分けするのがバランスが良いろうか？この手の話はケースバイケースであり、十把一絡げな議論は難しいが、ここで１つ注意したいのは、ZEN2世代のRyzen Threadripperのメモリ上限が256GBしかないことである。256GBメモリでは、真核生物のアセンブリ（例えばmasurcaのメモリ使用量）、メタゲノムのアセンブリ（metaFlyeの質問, metaspades）では不安が残る。幸い現在ではたくさんのクラウドリソースが利用できるようになっていて、お金を払えば24TBメモリなどリッチなクラウドはすぐに利用できる。

この文章自体５年、いや１０年遅いが、今や研究室単位で巨大なクラスタを持つ時代ではなくなって来ている。研究室単位ではせいぜいRyzen Threadripper程度のマシンにとどめ、大きなデータはクラウド（スパコン含む）、という風に使い分けるのが賢い選択だと思う。（分野によります。主観性の強い意見なので参考程度に）。

2020 9/13 追記

OEM向けに2TBまで対応したThreadripper Proが発売・出荷されている。こちらのCPUが搭載されたマシンを使えば、メモリ容量問題はほぼ解決する（3990xよりクロックがやや低いので、TRを使うメリットがやや損なわれているのは問題だと思う）。

まとめ

xeon E5 v4 2680 dual > ryzen7 3700x ≥ mac mini i7 > xeon platinum P-8136

補足

flyeのメモリ使用量はGithubにまとめられている。

https://github.com/fenderglass/Flye

実はxeon platinumのマシンはその位置を狙って組んだのだが、ジョブによってかなり時間がかかる計算機になってしまった（flye参照）。

このmac mini2018のcorei7はオプションであり、デフォルトではよりクロックの低いCPUになっている。appleは時々インテル系CPUの発熱を甘く見てやらかすので（macbook pro2018のcorei9コアモデル）、corei7の熱をうまくさばけていないのかもしれない。

学内のスパコンサービスや計算機研究者にお願いしてリッチな環境も利用させてもらえるなら解決する問題かもしれない。

連続性は向上していても、その分エラーが増えていれば意味がない。QUEST-LGで評価する。

quast flye_correct/assembly.fasta flye_raw/assembly.fasta \
 -R Arabidopsis_thaliana.TAIR10.dna.toplevel.fa.gz \
 -1 clean_1.fq.gz -2 clean_2.fq.gz \
 --nanopore ont_trimmed.fq.gz \
 --labels flye_correct,flye_raw \
 --eukaryote \
 -o quast_output \
 -t 40 --large

以下に結果の表を乗せた。エラー修正することで、genome fraction6%近く上がった。ミスマッチの塩基数は増えているものの、indelエラーが大きく減った。また、リファレンスゲノムに全くアラインされないcontigの数が41から1に減った。まだindelエラーは多いが、raconやpilonでエラー修正することでさらにrefineできると思われる。

Genome statistics	flye_correct	flye_raw
Genome fraction (%)	79.563	73.857
Duplication ratio	1.017	1.002
Largest alignment	3832552	3540314
Total aligned length	96645076	88413363
NG50	14486416	12278303
NG75	11422790	9738359
NA50	488799	274714
NA75	58141	-
NGA50	488799	336387
NGA75	56885	-
LG50	4	5
LG75	7	7
LA50	44	64
LA75	222	-
LGA50	44	54
LGA75	224	-
Misassemblies
# misassemblies	856	565
# relocations	428	354
# translocations	415	199
# inversions	13	12
# misassembled contigs	20	16
Misassembled contigs length	116668822	112170040
# local misassemblies	7465	4680
# scaffold gap ext. mis.	0	0
# scaffold gap loc. mis.	0	0
# possible TEs	304	166
# unaligned mis. contigs	15	56
Unaligned
# fully unaligned contigs	1	41
Fully unaligned length	66442	1098364
# partially unaligned contigs	36	92
Partially unaligned length	22758306	36369393
Mismatches
# mismatches	1264899	973593
# indels	381552	860169
Indels length	1335715	1752268
# mismatches per 100 kbp	1330.59	1103.27
# indels per 100 kbp	401.37	974.74
# indels (<= 5 bp)	348521	827147
# indels (> 5 bp)	33031	33022
# N's	200	900
# N's per 100 kbp	0.17	0.71
Statistics without reference
# contigs	47	139
# contigs (>= 0 bp)	47	144
# contigs (>= 1000 bp)	47	141
# contigs (>= 5000 bp)	46	127
# contigs (>= 10000 bp)	43	112
# contigs (>= 25000 bp)	41	102
# contigs (>= 50000 bp)	38	72
Largest contig	16188110	14781025
Total length	119509464	125898504
Total length (>= 0 bp)	119509464	125904440
Total length (>= 1000 bp)	119509464	125902454
Total length (>= 5000 bp)	119505257	125848957
Total length (>= 10000 bp)	119481288	125734397
Total length (>= 25000 bp)	119445279	125544799
Total length (>= 50000 bp)	119331684	124385925
N50	14486416	12278303
N75	11422790	8802673
L50	4	5
L75	7	8
GC (%)	36.1	36.41

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

GuppyのGPU版を使う

URMAP

Cytobandファイルのダウンロード

circosVCF

変異のlandscape visualizationを提供するwebサービス LandScape

ゲノムの指定した領域をNでマスクする

ベンチマークその２（2019）