2019-03-11

Nanoporeのオフィシャルコマンドラインbasecaller1 Albacore

2019 3/11 タイトル修正

2019 3/12 コマンド追記、誤ったコメント削除

Albacore software overview（2019 3/10確認）

Albacoreは、Oxford Nanoporeベースコールアルゴリズム、およびいくつかの後処理ステップを提供するデータ処理パイプラインである。 Windows、Mac OS X、および複数のLinuxプラットフォームでコマンドラインから実行できる。Albacoreパイプラインは以下を含む。
１、Basecalling: MinKNOW basecallingに見られるのと同様のアルゴリズムの実装。ただし、MinKNOWでは現在処理されていないベースコールケミストリ用の設定ファイルも含まれている（ e.g. 1D2 reads）。
２、 Calibration Strand Detection: 読み取り値は、内蔵されたminimap2アライナを介してキャリブレーションストランドの基準に対して調整される。キャリブレーションストランドは、ナノポアと実験の品質管理として機能する。現在の読み取り値がキャリブレーションストランドであると識別された場合、バーコードステップやアライメントステップは実行されない。
３、Barcoding/Demultiplexing: 各ストランドの始めと終わりは、オックスフォードナノポアテクノロジーズによって現在提供されているバーコードに対してアライメントされ、バーコードの結果によってdemultiplexingされる。
４、Alignment: ユーザーは、FASTA、lastdb、またはminimap2インデックス形式のリファレンスファイルを提供できる。この場合、リードは統合されたminimap2アライナを介してこの参照に対してアライメントされる。
現在の仮定と制限
このソフトウェアは現在、linear（1D）または1D2ケミストリのベースコールを提供している。このソフトウェアへの入力として使用されるリードの.fast5ファイルには、生データが含まれている必要がある。 MinKNOWソフトウェアによって生成されたリードファイルには、生データがデフォルトで含まれているので、ユーザーが古いファイルを処理しようとしない限り、これは問題にはならない。

ベースコール速度
ベースコールにかかる時間は、コンピューターの仕様、割り当てられたスレッド数、および分析されたリード数によって異なる。ガイドラインに従うと、1Dベースコールは、リード長の中央値7 kbで、毎分約90リード実行できる。

manual（サード?）

https://denbi-nanopore-training-course.readthedocs.io/en/latest/basecalling/basecalling.html

Release notes (2019 3/9)

インストール

docker（公式イメージ使用）と、mac os 10.12 python3.6.1環境でfast5のサブディレクトリ１つだけ使ってテストした。

version2.3.4のハードウエア要件

Albacoreを実行するための一般的なシステム要件
4 GBのRAMと1 Dベースコール用のワーカースレッドあたり1 GB
4 GBのRAMと1 D2ベースコール用のワーカースレッドあたり2 GB
.debまたは.msi インストーラーの管理者アクセス権（オプションで.whlの場合）
インストール用に最大100 MBのドライブスペース、ベースコールされたファイル用に最低512 GBのストレージスペース（1 TB推奨）
1Dベースコール：生データのみを含む.fast5ファイルから開始すると、ファイルサイズは元のサイズの約4.5倍に増加する。

version2.3.4のインストールソフトウエア依存

mac

64-bit OSX 10.11 (El Capitan) or higher
Python 3.6 (we recommend installing Python via brew)
pip for Python 3 (this should come with your Python 3 installation)

linux

Python 3.4, 3.5 or 3.6
pip for Python 3 version 8.1 or higher (depending on how this has been installed it may be referred to as "pip" or "pip3")

windows

64-bit Windows 10
64-bit Windows 7 SP1 and the Windows 10 Universal C Runtime

AlbacoreのダウンロードはOxford nanopore HPのsoftware downloadsから行う。

↑このページにジャンプするには購入用アカウントでlog inしておく必要あり。

インストール方法は上のページの各ツールのInstallation guideに書いてある。ここではdockerを使ってテストする。

docker pull genomicpariscentre/albacore

> read_fast5_basecaller.py -h

# read_fast5_basecaller.py -h

usage: read_fast5_basecaller.py [-h] [-l] [-v] [-i INPUT] -t WORKER_THREADS -s

SAVE_PATH [--resume [X]] [-f FLOWCELL]

[-k KIT] [--barcoding] [-c CONFIG]

[-d DATA_PATH] [-b] [-r]

[-n FILES_PER_BATCH_FOLDER] [-o OUTPUT_FORMAT]

[-q READS_PER_FASTQ_BATCH]

[--disable_filtering] [--disable_pings]

ONT Albacore Sequencing Pipeline Software

optional arguments:

-h, --help show this help message and exit

-l, --list_workflows List standard flowcell / kit combinations.

-v, --version Print the software version.

-i INPUT, --input INPUT

Folder containing read fast5 files (if not present,

will expect file names on stdin).

-t WORKER_THREADS, --worker_threads WORKER_THREADS

Number of worker threads to use.

-s SAVE_PATH, --save_path SAVE_PATH

Path to save output.

--resume [X] Resume previous run for the given save path. Optional

parameter X is for debugging purposes only.

-f FLOWCELL, --flowcell FLOWCELL

Flowcell used during the sequencing run.

-k KIT, --kit KIT Kit used during the sequencing run.

--barcoding Search for barcodes to demultiplex sequencing data.

-c CONFIG, --config CONFIG

Optional configuration file to use.

-d DATA_PATH, --data_path DATA_PATH

Optional path to model files.

-b, --debug Output additional debug information to the log.

-r, --recursive Recurse through subfolders for input data files.

-n FILES_PER_BATCH_FOLDER, --files_per_batch_folder FILES_PER_BATCH_FOLDER

Maximum number of files in each batch subfolder. Set

to 0 to disable batch subfolders.

-o OUTPUT_FORMAT, --output_format OUTPUT_FORMAT

desired output format, can be fastq,fast5 or only one

of these.

-q READS_PER_FASTQ_BATCH, --reads_per_fastq_batch READS_PER_FASTQ_BATCH

number of reads per FastQ batch file. Set to 1 to

receive one reads per file and file names which

include the read ID. Set to 0 to have all reads per

run ID written to one file.

--disable_filtering Disable filtering into pass/fail folders

--disable_pings Do not send summary information about the run

>read_fast5_basecaller.py -v

# read_fast5_basecaller.py -v

ONT Albacore Sequencing Pipeline Software (version 2.3.4)

テスト時のバージョンは2.3.4

#1D² chemistry

> full_1dsq_basecaller.py -h

# full_1dsq_basecaller.py -h

usage: full_1dsq_basecaller.py [-h] [-l] [-v] -i INPUT -t WORKER_THREADS -s

SAVE_PATH [--resume [REASON]] [--barcoding] -f

FLOWCELL -k KIT [-d DATA_PATH] [-b] [-r]

[-n FILES_PER_BATCH_FOLDER] [-o OUTPUT_FORMAT]

[-q READS_PER_FASTQ_BATCH]

[--disable_filtering] [--disable_pings]

ONT Albacore Sequencing Pipeline Software

optional arguments:

-h, --help show this help message and exit

-l, --list_workflows List standard flowcell / kit combinations.

-v, --version Print the software version.

-i INPUT, --input INPUT

Folder containing read fast5 files.

-t WORKER_THREADS, --worker_threads WORKER_THREADS

Number of worker threads to use.

-s SAVE_PATH, --save_path SAVE_PATH

Path to save output.

--resume [REASON] Resume previous run for the given save path. Optional

parameter X is for debugging purposes only.

--barcoding Search for barcodes to demultiplex sequencing data.

-f FLOWCELL, --flowcell FLOWCELL

Flowcell used during the sequencing run.

-k KIT, --kit KIT Kit used during the sequencing run.

-d DATA_PATH, --data_path DATA_PATH

Optional path to model files.

-b, --debug Output additional debug information to the log.

-r, --recursive Recurse through subfolders for input data files.

-n FILES_PER_BATCH_FOLDER, --files_per_batch_folder FILES_PER_BATCH_FOLDER

Maximum number of files in each batch subfolder. Set

to 0 to disable batch subfolders.

-o OUTPUT_FORMAT, --output_format OUTPUT_FORMAT

desired output format, can be fastq,fast5 or fast5.

Applies only to 1d part of process

-q READS_PER_FASTQ_BATCH, --reads_per_fastq_batch READS_PER_FASTQ_BATCH

number of reads per FastQ batch file. Set to 0 to

receive one reads per file and file names which

include the read ID.

--disable_filtering Disable filtering into pass/fail folders

--disable_pings Do not send summary information about the run

> paired_read_basecaller.py -h

# paired_read_basecaller.py -h

usage: paired_read_basecaller.py [-h] [-l] [-v] [-i INPUT] -f FILE_INDEX -t

WORKER_THREADS -s SAVE_PATH

[--resume [REASON]] [--barcoding] -c CONFIG

[-d DATA_PATH] [-b] [-r]

[-q READS_PER_FASTQ_BATCH]

[--disable_filtering] [--disable_pings]

ONT Albacore 1D-Squared Sequencing Pipeline Software

optional arguments:

-h, --help show this help message and exit

-l, --list_workflows List standard flowcell / kit combinations.

-v, --version Print the software version.

-i INPUT, --input INPUT

Folder containing read fast5 files (if not present,

will expect file names on stdin).

-f FILE_INDEX, --file_index FILE_INDEX

Summary file for indexing potential paired reads.

-t WORKER_THREADS, --worker_threads WORKER_THREADS

Number of worker threads to use.

-s SAVE_PATH, --save_path SAVE_PATH

Path to save output.

--resume [REASON] Resume previous run for the given save path. Optional

parameter X is for debugging purposes only.

--barcoding Search for barcodes to demultiplex sequencing data.

-c CONFIG, --config CONFIG

Configuration file to use.

-d DATA_PATH, --data_path DATA_PATH

Optional path to model files.

-b, --debug Output additional debug information to the log.

-r, --recursive Recurse through subfolders for input data files.

-q READS_PER_FASTQ_BATCH, --reads_per_fastq_batch READS_PER_FASTQ_BATCH

number of reads per FastQ batch file. Set to 0 to

receive one reads per file and file names which

include the read ID.

--disable_filtering Disable filtering into pass/fail folders

--disable_pings Do not send summary information about the run

サポートされているフローセルとkitの組み合わせの確認

> read_fast5_basecaller.py -l

# read_fast5_basecaller.py -l

Parsing config files in /opt/albacore.

Available flowcell + kit combinations are:

flowcell kit barcoding config file

FLO-MIN106 SQK-DCS108 r94_450bps_linear.cfg

FLO-MIN106 SQK-LRK001 r94_450bps_linear.cfg

FLO-MIN106 SQK-LSK108 r94_450bps_linear.cfg

FLO-MIN106 SQK-LSK109 r94_450bps_linear.cfg

FLO-MIN106 SQK-LWB001 included r94_450bps_linear.cfg

FLO-MIN106 SQK-LWP001 r94_450bps_linear.cfg

FLO-MIN106 SQK-PBK004 included r94_450bps_linear.cfg

FLO-MIN106 SQK-PCS108 r94_450bps_linear.cfg

FLO-MIN106 SQK-PSK004 r94_450bps_linear.cfg

FLO-MIN106 SQK-RAB201 included r94_450bps_linear.cfg

FLO-MIN106 SQK-RAB204 included r94_450bps_linear.cfg

FLO-MIN106 SQK-RAD002 r94_450bps_linear.cfg

FLO-MIN106 SQK-RAD003 r94_450bps_linear.cfg

FLO-MIN106 SQK-RAD004 r94_450bps_linear.cfg

FLO-MIN106 SQK-RAS201 r94_450bps_linear.cfg

FLO-MIN106 SQK-RBK001 included r94_450bps_linear.cfg

FLO-MIN106 SQK-RBK004 included r94_450bps_linear.cfg

FLO-MIN106 SQK-RLB001 included r94_450bps_linear.cfg

FLO-MIN106 SQK-RLI001 r94_450bps_linear.cfg

FLO-MIN106 SQK-RNA001 r941_70bps_rna_linear.cfg

FLO-MIN106 SQK-RPB004 included r94_450bps_linear.cfg

FLO-MIN106 VSK-VBK001 r94_450bps_linear.cfg

FLO-MIN106 VSK-VMK001 included r94_450bps_linear.cfg

FLO-MIN106 VSK-VSK001 r94_450bps_linear.cfg

FLO-MIN107 SQK-DCS108 r95_450bps_linear.cfg

FLO-MIN107 SQK-LRK001 r95_450bps_linear.cfg

FLO-MIN107 SQK-LSK108 r95_450bps_linear.cfg

FLO-MIN107 SQK-LSK109 r95_450bps_linear.cfg

FLO-MIN107 SQK-LWB001 included r95_450bps_linear.cfg

FLO-MIN107 SQK-LWP001 r95_450bps_linear.cfg

FLO-MIN107 SQK-PBK004 included r95_450bps_linear.cfg

FLO-MIN107 SQK-PCS108 r95_450bps_linear.cfg

FLO-MIN107 SQK-PSK004 r95_450bps_linear.cfg

FLO-MIN107 SQK-RAB201 included r95_450bps_linear.cfg

FLO-MIN107 SQK-RAB204 included r95_450bps_linear.cfg

FLO-MIN107 SQK-RAD002 r95_450bps_linear.cfg

FLO-MIN107 SQK-RAD003 r95_450bps_linear.cfg

FLO-MIN107 SQK-RAD004 r95_450bps_linear.cfg

FLO-MIN107 SQK-RAS201 r95_450bps_linear.cfg

FLO-MIN107 SQK-RBK001 included r95_450bps_linear.cfg

FLO-MIN107 SQK-RBK004 included r95_450bps_linear.cfg

FLO-MIN107 SQK-RLB001 included r95_450bps_linear.cfg

FLO-MIN107 SQK-RLI001 r95_450bps_linear.cfg

FLO-MIN107 SQK-RNA001 r941_70bps_rna_linear.cfg

FLO-MIN107 SQK-RPB004 included r95_450bps_linear.cfg

FLO-MIN107 VSK-VBK001 r95_450bps_linear.cfg

FLO-MIN107 VSK-VMK001 included r95_450bps_linear.cfg

FLO-MIN107 VSK-VSK001 r95_450bps_linear.cfg

FLO-PRO001 SQK-LSK109 r941_450bps_linear_prom.cfg

実行方法

フローセルとkitの種類、raw callのディレクトリパス等を指定する（*1）。例えばMinION/GridIONのR.9.4.1FLO-MIN106D（リンク）、Rapid barcoding lit（リンク）を使っているなら現行はSQK-RBK004なので、以下のようになる（2019 3/10現在）。

read_fast5_basecaller.py -f FLO-MIN106 -k SQK-LSK308 -t 24 \
 -s /data/albacore_basecalled/ -o fastq -q 100000 \
 -i /data/Ecoli_R7_ONI_flowcell_18/

-f Flowcell used during the sequencing run
-i Folder containing read fast5 files (if not present, will expect file names on stdin)
-s Path to save output
-t Number of worker threads to use
-k Kit used during the sequencing run
-o desired output format, can be fastq,fast5 or only one of these
-q number of reads per FastQ batch file. Set to 0 to receive one reads per file and file names which include the read ID

重いので利用できるだけスレッドを指定した方が良いです（-t <INT>）。また、-o fast5でexportし、さらに他のbase callerに使うこともできる。

出力

failtとpassにそれぞれfastqができる（-o fastq）。-qで指定した数でfastqは分割される。

f:id:kazumaxneo:20190309172231j:plain

追記

@kazumachack いつもツールの紹介記事をありがたく拝見しております。
ご存知だったら申し訳ないですが、
AlbacoreとGuppyについて、「--recursive」をつけることでサブフォルダ内のfast5をすべて検索してベースコールできます。全部のfast5を一つのフォルダにまとめなくても大丈夫です。
— あかまる (@aka_maru_k) March 12, 2019

サブディレクトリも含めてbasecallingする。

read_fast5_basecaller.py -f FLO-MIN106 -k SQK-LSK308 -t 24 \
 -s /data/albacore_basecalled/ -o fastq -q 100000 \
 -i /data/Ecoli_R7_ONI_flowcell_18/
 --recursive

-r, --recursive Recurse through subfolders for input data files

あかまるさん、教えていただきありがとうございます。

引用

nanoporetech Albacore 2.3.3 Release

https://community.nanoporetech.com/posts/albacore-2-3-3

参考

https://github.com/rrwick/Basecalling-comparison/tree/95bf07476f61cda79e6971f20f48c6ac83e634b3

基本的なSQK-RAD004などが入ってませんね。MinKNOW（リンク）を使ってということでしょうか。

2019-03-10

illuminaのショートリードシミュレータ Sandy（RNA seqにも対応）

simulator RNA seq fasta/fastqの操作

　Sandyは、与えられたfastaファイルからシングルエンド/ペアエンドのリードを生成するシンプルなバイオインフォマティックツールである。多くの次世代シーケンシング分析は、実際には正確には満足されていない仮説モデルおよび原理に依存している。ポジティブコントロールを提供するシミュレーションデータは、これらの困難を克服するための完璧な方法である。ここでは、Sandyを紹介する。Sandyは、第2世代および第3世代シーケンスの合成リードをを生成する簡単で使いやすく、高速かつ完全なツールセットである。 Sandyは全ゲノムシーケンス、全エクソームシーケンス、RNAseqリードをシミュレートする。 Sandyはまた、ユーザーがデータを操作するためのいくつかの機能を提供するとともに、リードのゲノム上のポジション、遺伝子および転写産物の発現、シーケンシングエラー、「真の」情報（生成データに基づく）を含む体系化されたデータベースも提供する。Sandyの最も印象的な機能の1つは、シーケンシングエラーとともに、多型をsnv、indel、および構造変異（例えば、遺伝子重複、逆位、遺伝子融合）としてシミュレートする能力である - さらなる処理ステップを必要としない。そのため、Sandyは、ゲノミクスやトランスクリプトミクスにおけるさまざまなパイプラインの結果のベンチマーク、および新しい仮説の生成やシーケンスプロジェクトの最善の設計、おそらく時間とコストの最適化に役立つだろう。

Welcome to Sandy simulator! ## | sandy

manual

https://galantelab.github.io/sandy/v0.22/main.html#docker-usage

インストール

ubuntu16.04でテストした。

依存

#step1 Debian/Ubuntu
apt update && apt install perl zlib1g-dev gcc make

#step1 Centos/Fedora
yum install perl zlib gcc make

#step2 install cpanminus package 
cpan -i App::cpanminus

本体　Github

cpanm App::Sandy

またはオーサーらが準備してくれているdockerイメージを使う

docker pull galantelab/sandy

>sandy

# sandy

Usage:

sandy [options]

sandy help <command>

sandy <command> ...

Options:

-h, --help brief help message

-u, --man full documentation

Help commands:

help show application or command-specific help

man show application or command-specific documentation

Misc commands:

version print the current version

citation export citation in BibTeX format

Database commands:

quality manage quality profile database

expression manage expression-matrix database

variation manage structural variation database

Main commands:

genome simulate genome sequencing

transcriptome simulate transcriptome sequencing

> sandy genome --help

# sandy genome --help

Usage:

sandy genome [options] <fasta-file>

Arguments:

a fasta-file

Options:

-h, --help brief help message

-u, --man full documentation

-v, --verbose print log messages

-p, --prefix prefix output [default:"out"]

-o, --output-dir output directory [default:"."]

-O, --output-format bam, sam, fastq.gz, fastq [default:"fastq.gz"]

-1, --join-paired-ends merge R1 and R2 outputs in one file

-x, --compression-level speed compression: "1" - compress faster,

"9" - compress better [default:"6"; Integer]

-i, --append-id append to the defined template id [Format]

-I, --id overlap the default template id [Format]

-j, --jobs number of jobs [default:"1"; Integer]

-s, --seed set the seed of the base generator

[default:"time()"; Integer]

-c, --coverage genome coverage [default:"8", Number]

-t, --sequencing-type single-end or paired-end reads

[default:"paired-end"]

-q, --quality-profile sequencing system profiles from quality

database [default:"poisson"]

-e, --sequencing-error sequencing error rate for poisson

[default:"0.001"; Number]

-m, --read-mean read mean size for poisson

[default:"100"; Integer]

-d, --read-stdd read standard deviation size for poisson

[default:"0"; Integer]

-M, --fragment-mean the fragment mean size for paired-end reads

[default:"300"; Integer]

-D, --fragment-stdd the fragment standard deviation size for

paired-end reads [default:"50"; Integer]

-a, --genomic-variation a list of genomic variation entries from

variation database. This option may be passed

multiple times [default:"none"]

-A, --genomic-variation-regex a list of perl-like regex to match genomic

variation entries in variation database.

This option may be passed multiple times

[default:"none"]

> sandy transcriptome -h

# sandy transcriptome -h

Usage:

sandy transcriptome [options] <fasta-file>

Arguments:

a fasta-file

Options:

-h, --help brief help message

-u, --man full documentation

-v, --verbose print log messages

-p, --prefix prefix output [default:"out"]

-o, --output-dir output directory [default:"."]

-O, --output-format bam, sam, fastq.gz, fastq [default:"fastq.gz"]

-1, --join-paired-ends merge R1 and R2 outputs in one file

-x, --compression-level speed compression: "1" - compress faster,

"9" - compress better [default:"6"; Integer]

-i, --append-id append to the defined template id [Format]

-I, --id overlap the default template id [Format]

-j, --jobs number of jobs [default:"1"; Integer]

-s, --seed set the seed of the base generator

[default:"time()"; Integer]

-n, --number-of-reads set the number of reads

[default:"1000000", Integer]

-t, --sequencing-type single-end or paired-end reads

[default:"paired-end"]

-q, --quality-profile sequencing system profiles from quality

database [default:"poisson"]

-e, --sequencing-error sequencing error rate for poisson

[default:"0.001"; Number]

-m, --read-mean read mean size for poisson

[default:"100"; Integer]

-d, --read-stdd read standard deviation size for poisson

[default:"0"; Integer]

-M, --fragment-mean the fragment mean size for paired-end reads

[default:"300"; Integer]

-D, --fragment-stdd the fragment standard deviation size for

paired-end reads [default:"50"; Integer]

-f, --expression-matrix an expression-matrix entry from database

root@84830ec770b4:/data#

> sandy expression #RNAseq向けデータベース組み込み済み発現パターン

# sandy expression

.----------------------------------------------------------------------------------.

+-------------------------------------+--------------------+----------+------------+

'-------------------------------------+--------------------+----------+------------'

詳細なマニュアル

> sandy genome --man

dockerでヘルプを見るなら

> docker run galantelab/sandy --help

$ docker run galantelab/sandy --help

Usage:

sandy [options]

sandy help <command>

sandy <command> ...

Options:

-h, --help brief help message

-u, --man full documentation

Help commands:

help show application or command-specific help

man show application or command-specific documentation

Misc commands:

version print the current version

citation export citation in BibTeX format

Database commands:

quality manage quality profile database

expression manage expression-matrix database

variation manage structural variation database

Main commands:

genome simulate genome sequencing

transcriptome simulate transcriptome sequencing

実行方法

１、genome

ペアエンドfastqをゲノムの20x発生させる。

sandy genome --verbose --sequencing-type=paired-end --coverage=20 input_genome.fa -o outdir

#dockerなら
docker run -v $PWD:/data/ galantelab/sandy genome --verbose --sequencing-type=paired-end --coverage=20 /data/input_genome.fa -o /data/outdir

default settings（manualより）

The strand is randomly chosen;
The number of reads is calculated by the coverage;
The chromossomes are raffled following a weighted raffle with the sequence length as the bias;

-s <integer>で乱数発生を固定し（defaltはtimeを使う）、再現性ある結果をreproduceできる。リードのidentifier(fastqの１、3行目)は"--id="で指定できる。フォーマットはCのprintf関数に似ている。詳細はdocumentation参照（リンク）。quaity profileのdefaultはhiseq。

出力

f:id:kazumaxneo:20190304220750p:plain

２、transcriptome

sandy transcriptome \
 --quality-profile=hiseq_101 \
 --sequencing-type=paired-end \
 --number-of-reads=1000000
 --fragment-mean=350 \
 --fragment-stdd=100 \
 --prefix=pancreas_sim \
 --output-dir=simulation_dir \
 --id="%i.%U read=%c:%t-%n mate=%c:%T-%N length=%r" \
 --verbose \
 --seed=123 \
 --jobs=20 \
input_cDNAs.fa

default settings（manualより）

Choose the Minus strand;
The number of reads is directly passed;
The genes/transcripts are raffled following the expression matrix;

上の条件で実行して得たfastqのstatistics

file format type num_seqs sum_len min_len avg_len max_len

out_R1_001.fastq.gz FASTQ DNA 1,000,000 100,000,000 100 100 100

out_R2_001.fastq.gz FASTQ DNA 1,000,000 100,000,000 100 100 100

$ head -n 4 out_R1_001.fastq

@SR.1:syn:sll0006 aspC, aat; putative aminotransferase; K00837 1

agggctgtttctgctgctttttgtacgactcgaaaaattccataatccaagttagttttcagcgttctcaaaccttgaataatttcttgatttcccacga

@IAFFAHFACIDCAEH@FFIIFBEAHEFGEGEBIFGDBCBFABBHDEEHI@@EIFHF?@FFIDCAH78==658><8<6=9=99.,11-2+/+#&(%##%$

公式Documentでは以下の実行例がある。

sandy transcriptome \
 --quality-profile=hiseq_101 \
 --sequencing-type=paired-end \
 --fragment-mean=350 \
 --fragment-stdd=100 \
 --prefix=pancreas_sim \
 --output-dir=sim_dir \
 --id="%i.%U read=%c:%t-%n mate=%c:%T-%N length=%r" \
 --verbose \
 --seed=123 \
 --jobs=30 \
 --no-gzip \
 --expression-matrix=brain_cortex \
 gencode_pc_v26.fa.gz

expression-matrixはリファレンスと紐づいており、ヘッダーが違うと機能しないので注意。

自分の発現データとリファレンスを組み込み、"-expression-matrix"フラグで使えるようにする。

sandy expression add my_expression.txt
sandy expression -f my_expression.txt my_fasta.fa

他にもquality profileを追加するコマンドなどがある。

引用

galantelab/sandy: A straightforward and complete next-generation sequencing read simulator

Thiago Miller. (2018, May 6).
Zenodo. http://doi.org/10.5281/zenodo.1241587

2019-03-09

Nanoporeのロングリードのアセンブリを中心としたツールキット Pomoxis

Nanopore long read assembly automatic pipeline evaluation tool

　Pomoxisは、ナノポアシーケンスに合わせて調整された基本的なバイオインフォマティクスツールのセット。ドラフトアセンブリの生成分析に関するツールが含まれている。これらのツールの多くは、Oxford Nanopore Technologiesの研究データ分析グループによって使用されている。

Document

Pomoxis - bioinformatics tools for nanopore research — Pomoxis 0.2.2 documentation

特徴（documentより）

Wraps third party tools with known good default parameters and methods of use.
Creates an isolated environment with all third-party tools.
Can be installed with conda.
Streamlines common short analysis chains.
Includes a nanopore read simulator.
Server/client components for minimap2 and bwa.
Integrates into katuali for performing more complex analysis pipelines.
Open source (Mozilla Public License 2.0).

インストール

依存

Installation from source

gcc-4.9
g++-4.9
zlib1g-dev
libncurses5-dev
python3-all-dev
libhdf5-dev
libatlas-base-dev
libopenblas-base
libopenblas-dev
libbz2-dev
liblzma-dev
libffi-dev
make
python-virtualenv
cmake (for racon)
wget (for fetching modules from github)
bzip2 (for extracting those modules)

requires the user to provide several binaries:

minimap2
miniasm
racon
samtools
bcftools
seqkit

本体　Github

#virtualenvを使用
virtualenv pomoxis --python=python3 --prompt "(pomoxis) "
. pomoxis/bin/activate
pip install git+https://github.com/rrwick/Porechop
pip install pomoxis

#またはvirtualenvを使わずpipで導入
pip install git+https://github.com/rrwick/Porechop
pip install pomoxis

> mini_align

$ mini_align

mini_align [-h] -r <reference> -i <fastq>

Align fastq/a formatted reads to a genome using minimap2.

-h show this help text.

-r reference, should be a fasta file. If correspondng minimap indices

do not exist they will be created. (required).

-i fastq/a input reads (required).

-I split index every ~NUM input bases (default: 16G, this is larger

than the usual minimap2 default).

-a aggressively extend gaps (sets -A1 -B2 -O2 -E1 for minimap2).

-P filter to only primary alignments (i.e. run samtools view -F 2308.

Deprecated: this filter is now default and can be disabled with -A.)

-A do not filter alignments to primary alignments, output all.

-n sort bam by read name.

-c chunk size. Input reads/contigs will be broken into chunks

prior to alignment.

-t alignment threads (default: 1).

-p output file prefix (default: reads).

-m fill MD tag.

-i and -r must be specified.

> mini_assemble

$ mini_assemble

mini_assemble [-h] -i <fastq>

Assemble fastq/fasta formatted reads and perform POA consensus.

-h show this help text.

-i fastx input reads (required).

-r reference fasta for reference-guided consensus (instead of de novo assembly)

-o output folder (default: assm).

-p output file prefix (default: reads).

-t number of minimap and racon threads (default: 1).

-m number of racon rounds (default: 4).

-n number of racon shuffles (default: 1. If not 1, should be >10).

-w racon window length (default: 500).

-k keep intermediate files (default: delete).

-K minimap's -K option (default: 500M).

-c trim adapters from reads prior to everything else.

-e error correct longest e% of reads prior to assembly, or an estimated coverage (e.g. 2x).

For an estimated coverage, the -l option must be a fastx or a length (e.g. 4.8mb).

-l Reference length, either as a number (e.g. 4.8mb) or fastx from which length will be calculated.

-x log all commands before running.

-i must be specified.

> assess_assembly

$ assess_assembly

assess_assembly [-h] -r <reference> -i <fastq>

Calculate accuracy statistics for an assembly.

-h show this help text.

-r reference, should be a fasta file. If correspondng bwa indices

do not exist they will be created. (required).

-i fastq/a input assembly (required).

-c chunk size. Input reads/contigs will be broken into chunks

prior to alignment, 0 will not chunk (default 100000).

-C catalogue errors.

-t alignment threads (default: 1).

-p output file prefix (default: assm).

-T trim consensus to primary alignments of truth to assembly.

-b .bed file of reference regions to assess.

-i and -r must be specified.

> simulate_calls -h

$ simulate_calls -h

usage: simulate_calls [-h] [--mu MU] [--sigma SIGMA] [--noise NOISE]

[--threads THREADS]

fasta ncalls

Simulate basecalls with scrappy.

positional arguments:

fasta Source sequence file.

ncalls Number of basecalls to produce.

optional arguments:

-h, --help show this help message and exit

--mu MU mean fragment length.

--sigma SIGMA stdv fragment length.

--noise NOISE Additional Gaussian noise on signal.

--threads THREADS number of worker threads.

> subsample_bam

$ subsample_bam

usage: subsample bam to uniform or proportional depth [-h] [-o OUTPUT_PREFIX]

[-r REGIONS [REGIONS ...]]

[-p PROFILE]

[-O {fwd,rev}]

[-t THREADS]

[-q QUALITY]

[-a ACCURACY]

[-c COVERAGE]

[--any_fail | --all_fail]

[-x PATIENCE]

[-s STRIDE] [-P]

[-S SEED]

bam depth [depth ...]

subsample bam to uniform or proportional depth: error: the following arguments are required: bam, depth

実行方法

1、mini_align

マッピング（minimap2）。fastqはgzip圧縮（.gz）にも対応している。

mini_align -t 20 -r ref.fa -i input.fq

-r reference, should be a fasta file. If correspondng minimap indices do not exist they will be created. (required)
-t alignment threads (default: 1)
-i fastq/a input reads (required)

出力はcoordinate sortされたbamとbam.baiになる。

２、mini_assemble

de novo アセンブリ（miniasmとminimap）とコンセンサスコール（racon x 4 times）

mini_assemble -t 20 -e 10 -l 4.7m -m 4 -i input.fq

-i fastx input reads (required)
-t number of minimap and racon threads (default: 1)
-c trim adapters from reads prior to everything else.
-e error correct longest e% of reads prior to assembly, or an estimated coverage (e.g. 2x). For an estimated coverage, the -l option must be a fastx or a length (e.g. 4.8mb)
-l Reference length, either as a number (e.g. 4.8mb) or fastx from which length will be calculated

注: テスト時はpolish off（-m 0）にしないと出力がゼロになった。polishが失敗しているにもかかわらず作業ファイル（draft assembly）を消しているからかもしれない。

リファレンスガイドアセンブリとコンセンサスコール（racon）

mini_assemble -t 20 -e 10 -l 4.7m -m 4 -r ref.fa -i input.fq

-r reference fasta for reference-guided consensus (instead of de novo assembly)

未テスト

３、assess_assembly

アセンブリ評価

assess_assembly -t 20 -r assembly.fa -i input.fq

# Percentage Errors

name mean q10 q50 q90

err_ont 16.926% 13.544% 16.310% 20.197%

err_bal 17.925% 14.164% 17.221% 21.615%

iden 6.706% 5.190% 6.558% 8.406%

del 5.699% 4.222% 5.305% 7.134%

ins 5.903% 4.259% 5.470% 7.339%

# Q Scores

name mean q10 q50 q90

err_ont 7.71 8.68 7.88 6.95

err_bal 7.47 8.49 7.64 6.65

iden 11.74 12.85 11.83 10.75

del 12.44 13.74 12.75 11.47

ins 12.29 13.71 12.62 11.34

標準ではSTDOUT出力される。

４、simulate_calls

oxford nanoporeのロングリードのシミュレーション

refe.faを鋳型にして10000リード発生させる。平均リード長は10000とする。

assess_assembly --threads 20 --mu 10000 refe.fa 10000

--mu mean fragment length

重たいので使えるだけスレッド指定した方が良いです。

出力

f:id:kazumaxneo:20190305144844j:plain

内部で実際に動いているツールとパラメータについては、それぞれの実行用bash スクリプトを開いて確認して下さい。

引用

Pomoxis - bioinformatics tools for nanopore research

高速かつ精度の高いオロソログ検索を行う SonicParanoid

2019 Bioinformatics orthologue 高速なツール

3/8 誤字修正、ツイッターリンク追記

3/10 誤字修正

　最近のDNAシーケンシング技術の進歩により、completeシーケンシングされたゲノムの数は急速に増加している。複数のゲノムにコードされているオルソログ遺伝子を正確に推論することが、これらのデータセットに基づいたさまざまな解析の鍵となる（Altenhoff and Dessimoz、2012）。例えば、比較ゲノミクス、ゲノムアノテーション、ゲノムの系統、およびゲノムデータベースの開発はすべて、信頼できるオロソログ推論に依存している。 InParanoid（Sonnhammer andÖstlund、2015）およびその拡張MultiParanoid（Alexeyenko et al。、2006）、OrthoMCL (Li、2003)、Hieranoid (Kaduk and Sonnhammer, 2017)、OMA (Train et al、2017)、Proteinortho (Lechner et al、2011)、OrthoFinder (Emms and Kelly, 2015)、PANTHER (Mi et al、2017)など、オロソログ推論に利用できるツールは多数ある。 InParanoidは最も古く最も人気のあるツールの1つで、specificityとrecall 間で最良のトレードオフを持っている（Altenhoff et al、2016）。ここでは、SonicParanoidを紹介する。SonicParanoidは、高速で正確で使いやすい多種のオルソログ推論のためのツールである。

　SonicParanoidはInParanoidのグラフベースのアルゴリズム（Remm et al、2001）で報告されている、正確さのために使用されている概念を借用している（Altenhoff et al、2016; Chen et al。、2007）。プロセス全体をスピードアップし、自動化する（論文補足図S1）。計算時間を短縮するために、セカンドパスアライメントおよびブートストラップテストはスキップされ、従来のBLASTの代わりにMMseqs2（Steinegger andSöding、2017）が使用される（Altschul et al。、1997）。さらに、SonicParanoidは、シーケンス長の違いを考慮に入れた新しいスコア関数と設定可能なしきい値を採用している（補足資料と補足図S2とS3）。オリジナルのInParanoidアルゴリズムとのもう1つの違いは、オルソログの重複グループがクラスタ化されている点である。

Remm et al（2001）では、グループは分類されたオルソログの信頼スコアの比較に基づいてマージされるか、削除される。これとは対照的に、SonicParanoidはグループを数値セットの要素として扱うため、アルゴリズムは高速になるが、精度の点でもほとんど違いはない（補足図S4）。最後に、複数種のオルソログ推論ステップは完全に自動化されているため、ユーザーはMultiParanoidで必要とされる面倒でエラーが発生しやすいオルソログテーブルの収集と構成ファイルの作成を避けることができる。詳細は補足資料に記載されている。

Accuracy of SonicParanoid and other 13 orthology inference tools.　論文より転載

f:id:kazumaxneo:20190307162110j:plain

岩崎さんの発表（ゲノム微生物年会2019 3/7 招待講演より）

http://iwasakilab.bs.s.u-tokyo.ac.jp/sonicparanoid/

ラボHP

http://iwasakilab.bs.s.u-tokyo.ac.jp/eindex.html

Our #paper has been published online. #SonicParanoid:
fast, accurate and easy #orthology inference. https://t.co/HlLN8LaFrV. Congratulations @iwasakiw. https://t.co/5ysAOcAYq5
— Salvatore Cosentino (@punto_c) August 1, 2018

論文「SonicParanoid: 高速、正確、簡単なオルソログ推定」が出版になりました。https://t.co/xHd4iMrIvh https://t.co/dARxKORy96
— Wataru IWASAKI (@iwasakiw) August 1, 2018

インストール

mac os10.14のPython 3.6.8環境でテストした（* macの推奨OSバージョンは10.13）。

依存

SonicParanoid only requires the Python programming language and the MMseqs2 alignment tool, to be installed in your laptop/server in order to work.

Python 3.5 or above
GNU GCC compiler (version 5.0 or above)

本体　Bitbucket

pip install sonicparanoid

> sonicparanoid -h

$ sonicparanoid -h

usage: sonicparanoid -i <INPUT_DIRECTORY> -o <OUTPUT_DIRECTORY>[options]

SonicParanoid 1.0

optional arguments:

-h, --help show this help message and exit

-i INPUT_DIRECTORY, --input-directory INPUT_DIRECTORY

Directory containing the proteomes (in FASTA format)

of the species to be analyzed.

-o OUTPUT_DIRECTORY, --output-directory OUTPUT_DIRECTORY

The directory in which the results will be stored.

-m {fast,default,sensitive,most-sensitive}, --mode {fast,default,sensitive,most-sensitive}

SonicParanoid execution mode. The default mode is

suitable for most studies. Use sensitive or most-

sensitive if the input proteomes are not closely

related.

-p PROJECT_ID, --project-id PROJECT_ID

Name for the project reflecting the run. If not

specified it will be automatically generated using the

current date and time.

-sh SHARED_DIRECTORY, --shared-directory SHARED_DIRECTORY

The directory in which the alignment files are stored.

If not specified it will be created inside the main

output directory.

-db MMSEQS_DBS, --mmseqs-dbs MMSEQS_DBS

The directory in which the database files for MMseqs2

will be stored. If not specified it will be created

inside the main output directory.

-t THREADS, --threads THREADS

Number of parallel 1-CPU jobs to be used. Default=4

-se SENSITIVITY, --sensitivity SENSITIVITY

Sensitivity for MMseqs2. This will overwrite the

--mode option.

-noidx, --no-indexing

Avoid the creation of indexes for MMseqs2 databases.

IMPORTANT: while this saves starting storage space it

makes MMseqs2 slightly slower. The results might also

be sligthy different.

-ot, --overwrite-tables

This will force the re-computation of the ortholog

tables. Only missing alignment files will be re-

computed.

-ow, --overwrite Overwrite previous runs and execute it again. This can

be useful to update a subset of the computed tables.

-ml MAX_LEN_DIFF, --max-len-diff MAX_LEN_DIFF

Maximum allowed length-difference-ratio between main

orthologs and canditate inparalogs. Example: 0.5 means

one of the two sequences could be two times longer

than the other 0 means no length difference allowed; 1

means any length difference allowed. Default=0.5

-mgs MAX_GENE_PER_SP, --max-gene-per-sp MAX_GENE_PER_SP

Limits the maximum number of genes per species in the

multi-species output table. This option reduces the

verbosity of the multi-species output file when

comparing a high number of species (especially

eukaryotes). Default=10

-sm, --skip-multi-species

Skip the creation of multi-species ortholog groups.

-op, --output-pairs Output a text file with all the orthologous relations.

-qfo11, --qfo-2011 Output a text file with all the orthologous relations

formatted to be uploaded to the QfO benchmark service.

NOTE: implies --output-pairs

-ka, --keep-raw-alignments

Do not delete raw MMseqs2 alignment files. NOTE: this

will triple the space required for storing the

results.

-rs, --remove-old-species

Remove alignments and pairwise ortholog tables related

to species used in a previous run. This option should

be used when updating a run in which some input

proteomes were modified or removed.

-un, --update-input-names

Remove alignments and pairwise ortholog tables for an

input proteome used in a previous which file name

conflicts with a newly added species. This option

should be used when updating a run in which some input

proteomes or their file names were modified.

-d, --debug Output debug information.

実行方法

入力のproteome（amino-acids.fasta）が入ったディレクトリを指定して実行する。

sonicparanoid -i input_dir/ -o output/ -t 8 -m default --project-id my_project

-i Directory containing the proteomes (in FASTA format) of the species to be analyzed
-t Number of parallel 1-CPU jobs to be used. Default=4
-o The directory in which the results will be stored
-m {fast | default | sensitive | most-sensitive} SonicParanoid execution mode. The default mode is suitable for most studies. Use sensitive or most-sensitive if the input proteomes are not closely related

出力

f:id:kazumaxneo:20190307161634p:plain

結果は、output/run/my_project/にある。

f:id:kazumaxneo:20190307221748p:plain

入力した proteome配列間でシェアされているオロソログはmy_project/ortholog_groups/にある。

f:id:kazumaxneo:20190307222741p:plain

ortholog_groups.tsv Tab-separated table with the ortholog groups
flat.ortholog_groups.tsv Simpler table with only the gene names for each group
not_assigned_genes.ortholog_groups.tsv List of genes that could not be classified as orthologs

さらにそれぞれのproteomeペア間のPairwise ortholog tablesも出力される。

詳細はHPのoutput（リンク先下の方"Output"）の説明を読んで下さい。

引用
SonicParanoid: fast, accurate and easy orthology inference
Salvatore Cosentino, Wataru Iwasaki

Bioinformatics. 2019 Jan 1; 35(1): 149–151

高速かつ高効率にシーケンスデータを圧縮 / 解凍する NAF

2019 Compression / decompression Bioinformatics 高速なツール fasta/fastqの操作 Nanopore direct RNA-seq

2019 3/9 twitterコメント追記

Preprintより

　DNA配列データベースは、シーケンシング技術の継続的な進歩により、指数関数的に成長している。通常、データ圧縮は保存スペースを節約するためにすべての保存DNAシーケンシングデータに使用される。1993年に最初の専用のDNA圧縮器が提案された
（Tahi、1993）。それ以来、多数のDNA圧縮器が開発された（例えば、Cao et al、2007、Li et al、2013、Benoit et al、2015、Al-Okaily et al、2017）。著者らの経験では、2つの圧縮器だけが実用性を満たしている: DELIMINATE (Mohammed et al、 2012) とMFCompress (Pinho and Pratas、 2014)。これらは安定しており、FASTAフォーマットの一般的に使用される機能をサポートし、処理するのに十分効率的で、脊椎動物ゲノム全体の圧縮（および解凍）などの実用的な作業をハンドリングすることができる。DELIMINATEとMFCompressの両方の圧縮率はimpressiveだが、解凍が非常に遅く大規模データベースでは、その有用性が大幅に制限される。DNA圧縮に関する多数の研究にもかかわらず、現在、データベースの大多数は圧縮にgzip（https://www.gzip.org/）に頼り続けている。この永続的な人気は、gzipの可用性、堅牢性、および速度（特に解凍速度）によっている。他の一般的な汎用コンプレッサーとしてbzip2（http://www.bzip.org/）やlzma / xz（https://tukaani.org/xz/format.html）などが開発されている。さらに、近年では、新世代の高度な汎用コンプレッサー、特にbrotli（https://github.com/google/brotli）およびzstd （https://github.com/facebook/zstd）が登場した。これらのコンプレッサはgzipの性能を向上させるが、それでも圧縮強度の点で特殊なDNAコンプレッサには到達できない。この研究では、Nucleotide Archival Format（NAF）と呼ばれる新しいDNA圧縮フォーマットについて説明する。 NAFはDELIMINATEと同程度でMFCompressよりわずかに圧縮率が低いが、最新の圧縮率を提供する。同時に、それは30から80倍速い解凍を提供する。 NAFはFASTAおよびFASTQフォーマットで圧縮および圧縮解除を行う。 NAFはマスクされた配列とあいまいなIUPAC ヌクレオチドを支持する。NAFはシーケンスの長さに制限がなく、リファレンスシーケンスも必要としない。

Comparison of compressors. Githubより

NAF Documentation

NAF specification

https://github.com/KirillKryukov/naf/blob/master/NAFv2.pdf

Kirill Kryukovさんの発表（ゲノム微生物年会2019 3/6 ）

こちら研究室のポスドクのKirill Kryukovさんが開発しているfasta形式のDNA配列データ圧縮ポログラムNAFの解説いただいた記事です。現在fastq形式にも（ほぼ）対応し、fasta形式のアミノ酸配列などにも対応を計画しています。 https://t.co/VJQnRJtZ54
— So Nakagawa (@sounaka) March 6, 2019

NAFに関するツイート

インストール

ubuntu16.04でテストした（ホストOS: mac os 10.14）。

本体　Github

リリースよりバイナリ（linux & windows）をダウンロードできる。

ビルドするなら

git clone --recurse-submodules https://github.com/KirillKryukov/naf.git
cd naf && make && sudo make install

>./ennaf -h

# ./ennaf -h

Usage: ennaf [OPTIONS] [infile]

Options:

-o FILE - Write compressed output to FILE

-c - Write to standard output

-#, --level # - Use compression level # (from -131072 to 22, default: 1)

--temp-dir DIR - Use DIR as temporary directory

--name NAME - Use NAME as prefix for temporary files

--title TITLE - Store TITLE as dataset title

--fasta - Input is in FASTA format

--fastq - Input is in FASTQ format

--line-length N - Override line length to N

--verbose - Verbose mode

--keep-temp-files - Keep temporary files

--no-mask - Don't store mask

-h, --help - Show help

-V, --version - Show version

> ./unnaf -h

# ./unnaf -h

Usage: unnaf [OUTPUT-TYPE] [file.naf]

Options for selecting output type:

--format - File format version

--part-list - List of parts

--sizes - Part sizes

--number - Number of sequences

--title - Dataset title

--ids - Sequence ids (accession numbers)

--names - Full sequence names (including ids)

--lengths - Sequence lengths

--total-length - Sum of sequence lengths

--mask - Masked region lengths

--4bit - 4bit-encoded DNA (binary data)

--dna - Continuous DNA sequence

--fasta - FASTA-formatted sequences

--fastq - FASTQ-formatted sequences

Other options:

--line-length N - Use lines of width N for FASTA output

--no-mask - Ignore mask

-h, --help - Show help

-V, --version - Show version

実行方法

１、compression

#working dir
mkdir tmp
export TMP=tmp

#fasta
ennaf --fasta input.fa -o output.fa.naf
#またはworking dirを指定してラン
ennaf --fasta input.fa -o output.fa.naf --temp-dir temp

#fastq
ennaf --fastq input.fq -o output.fq.naf

#fq.gz
gzip -dc input.fq.gz | ./ennaf --fastq -o file.naf

--dna Input contains DNA sequences (default). Valid sequences can include: ACGT, RYSWKMBDHV, N, '-'.

--rna Input contains RNA sequences. Valid sequences can include: ACGU, RYSWKMBDHV, N, '-'.

output.nafが出力される。RNA配列なら"--rna"をつけて実行する (default DNA)。

2、decompression

unnaf input.naf > output.fq

#解凍してgzに再圧縮
unnaf input.naf | pigz -p 8 - > out.fq.gz

引用

Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences

Kirill Kryukov, Mahoko Takahashi Ueda, So Nakagawa, Tadashi Imanishi

Bioinformatics, btz144, Published: 25 February 2019

2019-03-06

サンプルのコンタミネーションを見積もる Mash Screen

2019 Preprint metagenome 生物種の推定 (taxonomic profiling) virus archaea bacteria contamination MInHash 高速なツール genetic distance plasmid

2019 11/5 論文追加

　シーケンシング技術がスループットを高めそしてコストを下げ続けるにつれて、シーケンシングされたゲノムのデータベース（例えばNCBI RefSeq [ref.1]）は指数関数的成長を続け、それらに対する検索をさらに複雑にしている[ref.2、3]。さらに、rawシーケンスデータであるNCBIのSequence Read Archive（SRA）[ref.4]はさらに急速に成長しており、表示されているゲノムおよびメタゲノムをアセンブリしてキュレーションする能力を上回っている[ref.5、6]。これらの傾向は、典型的には短い（〜21-bp）、一定の長さkについてゲノムの部分配列をオーバーラップして操作する、アライメントフリー分析法を普及させた。この分野における著者らの以前の研究はMash [ref.7]を導入した。これはMinHash次元圧縮[ref.8]を用いて全ゲノムのk-merセットを数百または数千の値のスケッチに圧縮し、ゲノムまたはメタゲノムの距離の極めて迅速な推定を可能にする。単離したゲノムまたは高レベルのメタゲノム相関には便利だが、Mashはメタゲノムの構成要素の分析には適していない。これは、MinHashingの一般的なアプリケーションが包含よりも近似に近いためである。この研究では、封じ込めの問題に取り組み、ゲノムまたはプロテオームがメタゲノム内でどの程度うまく表されているかを測定するための新しい方法、Mash Screenを示す。これは、迅速なコンタミネーションスクリーニングから、利用可能なすべてのメタゲノム全体にわたる特定の細菌およびウイルス株の追跡までの範囲の用途を有する。

（複数段落省略）

我々は公共のSRAメタゲノムからの新規ポリオーマウイルス種の同定と集合について述べる。

解説

https://genomeinformatics.github.io/mash-screen/

"Mash Screen: High-throughput sequence containment estimation for genome discovery" is published! Two years after I blogged about it, and 8 months after the preprint. I thank @BrianOndov for his patience and the reviewers for some key corrections https://t.co/boAWvvpnn5
— Adam Phillippy (@aphillippy) November 5, 2019

MASHは以前v2.0を紹介しており、オーサーらの手順に従ってメタゲノムの分析の流れもまとめていますが、重要なツールなのでメタゲノムやコンタミチェックの手順に絞って改めて紹介します。

インストール

Github

リリースよりバイナリをダウンロードするか、パッケージマネージャで導入する。

conda install -c bioconda -y mash==2.1

> mash

$ mash

Mash version 2.1

Type 'mash --license' for license and copyright information.

Usage:

mash <command> [options] [arguments ...]

Commands:

bounds Print a table of Mash error bounds.

dist Estimate the distance of query sequences to references.

info Display information about sketch files.

paste Create a single sketch file from multiple sketch files.

screen Determine whether query sequences are within a larger pool of

sequences.

sketch Create sketches (reduced representations for fast operations).

triangle Estimate a lower-triangular distance matrix.

> mash screen

$ mash screen

Version: 2.1

Usage:

mash screen [options] <queries>.msh <pool> [<pool>] ...

Description:

Determine how well query sequences are contained within a pool of sequences.

The queries must be formatted as a single Mash sketch file (.msh), created

with the `mash sketch` command. The <pool> files can be contigs or reads, in

fasta or fastq, gzipped or not, and "-" can be given for <pool> to read from

standard input. The <pool> sequences are assumed to be nucleotides, and will

be 6-frame translated if the <queries> are amino acids. The output fields are

[identity, shared-hashes, median-multiplicity, p-value, query-ID,

query-comment], where median-multiplicity is computed for shared hashes, based

on the number of observations of those hashes within the pool.

Options:

Option Description (range) [default]

-h Help

-p <int> Parallelism. This many threads will be spawned for processing. [1]

-w Winner-takes-all strategy for identity estimates. After counting

hashes for each query, hashes that appear in multiple queries will

be removed from all except the one with the best identity (ties

broken by larger query), and other identities will be reduced. This

removes output redundancy, providing a rough compositional outline.

...Output...

-i <num> Minimum identity to report. Inclusive unless set to zero, in which

case only identities greater than zero (i.e. with at least one

shared hash) will be reported. Set to -1 to output everything.

(-1-1) [0]

-v <num> Maximum p-value to report. (0-1) [1.0]

mash triangleコマンドも新しく実装されていますね。

~ GIthubより ~

A new triangle command computes a lower triangular
distance matrix in relaxed Phylip format. This streamlines all-pairs distance
commands and avoids computational redundancy.）

----------------------------------------------------------------------------------------

データベースの準備

https://mash.readthedocs.io/en/latest/tutorials.html よりNCBI RefSeqゲノムのpresketchされたアーカイブ"refseq.genomes.k21.s1000.msh direct link"をダウンロードする。plasmidの"refseq.genomes+plasmids.k21.s1000.msh refseq.plasmids.k21.s1000.msh direct link "も必要ならダウンロードする（direct link）。

または、MASH Documentからもダウンロードできる（合体版）。こちらには、SRA metagenomesとRefSeqのsketchファイルも用意されている。

https://mash.readthedocs.io/en/latest/data.html

実行方法

通常sketchコマンドはlow k-merを捨てて実行するが、screenコマンドはコンタミも含めてfastqファイル内にある全てのゲノムを検出したい時に使う。fastqをそのまま引数に指定する。

mash screen -p 4 refseq.genomes.k21s1000.msh pair1.fq pair2.fq > screen.tab
sort -gr screen.tab | head

-p Parallelism. This many threads will be spawned for processing. [1]
-w Winner-takes-all strategy for identity estimates. After counting hashes for each query, hashes that appear in multiple queries will be removed from all except the one with the best identity (ties broken by larger query), and other identities will be reduced. This removes output redundancy, providing a rough compositional outline

トップヒットに特定の種の様々な株がヒットしてくる場合、winner take all戦略のフラグ（-w）を立てることで回避できる。

出力

f:id:kazumaxneo:20190305160613j:plain

引用

Mash Screen: High-throughput sequence containment estimation for genome discovery

Brian D Ondov, Gabriel J Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B Buck, Adam M Phillippy

bioRxiv preprint first posted online Mar. 1, 2019

追記

Mash Screen: high-throughput sequence containment estimation for genome discovery

Brian D. Ondov, Gabriel J. Starrett, Anna Sappington, Aleksandra Kostic, Sergey Koren, Christopher B. Buck & Adam M. Phillippy
Genome Biology volume 20, Article number: 232 (2019)

ナノポアのrawロングリードからプラスミドを探索する tiptoft

plasmid Nanopore long read Journal of Open Source Software 2019 リアルタイム解析 bacteria archaea

2019 12/13 タイトル修正

　急速にコストが下がる中、Pacific Biosciences（PacBio）およびOxford Nanopore Technologies（ONT）のロングリードDNAシークエンシング技術がアウトブレイク調査に使用され始めている（Faria et al、2017; J. Quick et al、2005 2015）および急速な感染症の臨床診断（Votintseva et al、2017）。 ONT機器は数分以内にデータを作成でき、PacBioは数時間/数日かかるショートリードシーケンシング技術と比較して数時間以内にデータを作成できる。実用的な答えまでの時間を短縮することによって、ゲノミクスは臨床決定に直接影響を与え始め、患者に良い影響を与える可能性がある（Gardy＆Loman、2017）。抗菌薬耐性を付与するものまたは病原性因子をコードするもののような臨床的に重要な遺伝子は、プラスミドから水平に獲得され得る。ロングリードシーケンシング技術によって得られる速度の増加と共に、ベースエラー率が増加している。ロングリードシーケンシングリードに固有の高いエラー率は、リードを修正するための特別なツールを必要とする（Koren et al、2017）が、これらの方法はかなりの計算上の要求を必要とする。シークエンシングデータ、および臨床的に重要な小さいプラスミドの損失をもたらす可能性がある。
　基礎となるrawデータにどのプラスミドが存在するかを予測するために生の未補正リードを使用するTipToftを紹介する。これは、デノボアセンブリのプラスミド含有量を検証するための独立した方法を提供する。TipToftは、未修正のロングリードから実行できる唯一のツールである。 TipToftは速く、リアルタイムで結果を提供するためにストリーミング入力データを受け入れることもできる。プラスミドは、PlasmidFinderからのタイピングに使用されたレプリコン配列を使用して同定される（Carattoli et al、2014）。著者らは、PacBioのロングリードシークエンシング技術を使用して1975サンプル（https://www.sanger.ac.uk/resources/downloads/bacteria/nctc /）でソフトウェアをテストし、abricate（紹介）を使用してde novoアセンブリからプラスミドを予測した。1975サンプルから、プラスミド配列と100％一致するが、de novoアセンブリ内には対応するプラスミドが存在しない84サンプルが同定された。アセンブリにおいて同定されたすべてのプラスミドを100％の一致条件でとると、Tiptoftはこれらのうち97％（n ＝ 326）を同定した。デプスがより深まるとプラスミド配列を正確に同定する能力が高まるであろう。そのレベルは基本エラー率に依存する。 90％の塩基精度を有するシーケンシングデータについては、99.5％の信頼度でプラスミドレプリコン配列を同定するために、おおよそ５のデプスが必要とされる。このソフトウェアはPython 3で書かれており、https://github.com/andrewjpage/tiptoftからオープンソースのGNU GPLv3ライセンスの下で入手可能できる。

Small plasmids can disappear at the read correction step of long read assembly.Tiptoft published today in #JOSS by @andrewjpage & @torstenseemann takes uncorrected long reads & identifies what plasmids are actually present. Install: pip3 install tiptoft https://t.co/xru3UyeIL0
— Andrew Page (@andrewjpage) March 1, 2019

tiptoftに関するツイート

インストール

著者が配布しているdockerイメージを使いテストした（ホストOS: mac os）。

本体　GIthub

#dockcerイメージを使うならpullする
docker pull andrewjpage/tiptoft

#インストールする場合pipが利用できる
pip3 install cython
pip3 install tiptoft

#Anaconda環境なら
conda install -y -c bioconda tiptoft

#homebrewも利用できる
brew install python # this is python v3
pip3 install cython
pip3 install tiptoft

> tiptoft -h

# tiptoft -h

usage: tiptoft [options] input.fastq

Plasmid replicon and incompatibility group prediction from uncorrected long

reads

positional arguments:

input_fastq Input FASTQ file (optionally gzipped)

optional arguments:

-h, --help show this help message and exit

Optional input arguments:

--plasmid_data PLASMID_DATA, -d PLASMID_DATA

FASTA file containing plasmid data from downloader

script, defaults to bundled database (default: None)

--kmer KMER, -k KMER k-mer size (default: 13)

Optional output arguments:

--filtered_reads_file FILTERED_READS_FILE, -f FILTERED_READS_FILE

Filename to save matching reads to (default: None)

--output_file OUTPUT_FILE, -o OUTPUT_FILE

Output file [STDOUT] (default: None)

--print_interval PRINT_INTERVAL, -p PRINT_INTERVAL

Print results every this number of reads (default:

None)

--verbose, -v Turn on debugging [False]

--version show program's version number and exit

Optional advanced input arguments:

--no_hc_compression Turn off homoploymer compression of k-mers (default:

False)

--no_gene_filter Dont filter out lower coverage genes from same group

(default: False)

--max_gap MAX_GAP Maximum gap for blocks to be contigous, measured in

multiples of the k-mer size (default: 3)

--max_kmer_count MAX_KMER_COUNT

Exclude k-mers which occur more than this number of

times in a sequence (default: 10)

--margin MARGIN Flanking region around a block to use for mapping

(default: 10)

--min_block_size MIN_BLOCK_SIZE

Minimum block size in bases (default: 50)

--min_fasta_hits MIN_FASTA_HITS, -m MIN_FASTA_HITS

Minimum No. of kmers matching a read (default: 8)

--min_perc_coverage MIN_PERC_COVERAGE, -c MIN_PERC_COVERAGE

Minimum percentage coverage of typing sequence to

report (default: 85)

--min_kmers_for_onex_pass MIN_KMERS_FOR_ONEX_PASS

Minimum No. of kmers matching a read in 1st pass

(default: 5)

root@84830ec770b4:/data#

実行方法

１、最初にデータベースを準備する。PlasmidFinder（ref.1）のデータベースを使っている。

#ホストのカレントとシェアしてRUNする
docker run -itv $PWD:/data/ andrewjpage/tiptoft
> apt update && apt install -y curl #curlが入ってないので入れる
> cd /data

#データベースダウンロード
> tiptoft_database_downloader plasmid_data

plasmid_data.faができる。

２、tiptoftの実行。データベースとpacbioまたはn anopreのfastqを指定する。fastqはgzip圧縮にも対応している。

tiptoft long_reads.fq  --plasmid_data plasmid_data.fa

テストランの出力

GENE COMPLETENESS %COVERAGE ACCESSION DATABASE PRODUCT

rep11.1 Full 100 AB178871 plasmidfinder rep11.1_repA(pB82)_AB178871

rep17.1 Partial 86 AF507977 plasmidfinder rep17.1_CDS29(pRUM)_AF507977

repUS15. Partial 92 NZAAAK010000287 plasmidfinder repUS15._ORF(E.faecium287)_NZAAAK010000287

引用

TipToft: detecting plasmids contained in uncorrected long read sequencing data

Andrew J. Page, Torsten Seemann

Journal of Open Source Software, 4(35), 1021

ref.1

In silico detection and typing of plasmids using PlasmidFinder and plasmid multilocus sequence typing.

Carattoli A, Zankari E, García-Fernández A, Voldby Larsen M, Lund O, Villa L, Møller Aarestrup F, Hasman H.

Antimicrob Agents Chemother. 2014 Jul;58(7):3895-903. doi: 10.1128/AAC.02412-14. Epub 2014 Apr 28

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

Nanoporeのオフィシャルコマンドラインbasecaller1 Albacore

illuminaのショートリードシミュレータ Sandy（RNA seqにも対応）

Nanoporeのロングリードのアセンブリを中心としたツールキット Pomoxis

高速かつ精度の高いオロソログ検索を行う SonicParanoid

高速かつ高効率にシーケンスデータを圧縮 / 解凍する NAF

サンプルのコンタミネーションを見積もる Mash Screen

ナノポアのrawロングリードからプラスミドを探索する tiptoft