macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ロングリードシーケンスデータを評価する NanoPack2

 

 ロングリードシーケンスプロジェクトにおけるコホートサイズの増加により、Oxford Nanopore TechnologiesとPacific Biosciencesのシーケンスデータの品質評価と処理のためのより効率的なソフトウェアが必要とされている。ここでは、実験の要約、データセットのフィルタリング、フェーズドアラインメントの結果の可視化、およびNanoPackソフトウェアスイートの更新を行うための新しいツールについて説明する。
cramino, chopper, kyber, phasiusの各ツールはRustで書かれており、インストールや依存関係の管理を必要とせず、実行可能なバイナリとして提供されている。muslでビルドされたバイナリは、幅広い互換性で利用できる。NanoPlot と NanoComp は Python3 で書かれている。各ツールとそのドキュメントへのリンクは https://github.com/wdecoster/nanopack にある。すべてのツールはLinuxMac OS、MS Windows Subsystem for Linuxと互換性があり、MITライセンスでリリースされている。リポジトリにはテストデータが含まれており、ツールはGitHub Actionsを使用して継続的にテストされており、condaマネージャでインストールできる。

 

インストール

  • "The python scripts are written and tested for Python >= 3.6. With pip install nanopack all python tools can be installed simultaneously, but using a conda environment is encouraged. For the rust tools binaries can be downloaded from the releases on the respective GitHub repositories, as well as installation through conda."

 

 

インストールおよび実行

ここではcondaでpython3.10の環境を作って導入する。

mamba create -n NanoPack python=3.10 -y
conda activate NanoPack

#環境を抜ける場合、deactivateする。
conda deactivate

 

1、NanoPlotGithub

リード(fastq)、アラインメント(bam)、albacore サマリーファイルからリードの分析に役立つ多くのプロットを出力する。

mamba create -n nanoplot
conda activate
mamba install -c bioconda nanoplot

> NanoPlot -h

usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw] [--huge] [-o OUTDIR] [--no_static] [-p PREFIX] [--tsv_stats] [--info_in_report] [--maxlength N] [--minlength N] [--drop_outliers] [--downsample N] [--loglength] [--percentqual] [--alength] [--minqual N]

                [--runtime_until N] [--readtype {1D,2D,1D2}] [--barcoded] [--no_supplementary] [-c COLOR] [-cm COLORMAP] [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]] [--plots [{kde,hex,dot} ...]] [--legacy [{kde,dot,hex} ...]] [--listcolors] [--listcolormaps]

                [--no-N50] [--N50] [--title TITLE] [--font_scale FONT_SCALE] [--dpi DPI] [--hide_stats]

                (--fastq file [file ...] | --fasta file [file ...] | --fastq_rich file [file ...] | --fastq_minimal file [file ...] | --summary file [file ...] | --bam file [file ...] | --ubam file [file ...] | --cram file [file ...] | --pickle pickle | --feather file [file ...])

 

CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.

 

General options:

  -h, --help            show the help and exit

  -v, --version         Print version and exit.

  -t, --threads THREADS

                        Set the allowed number of threads to be used by the script

  --verbose             Write log messages also to terminal.

  --store               Store the extracted data in a pickle file for future plotting.

  --raw                 Store the extracted data in tab separated file.

  --huge                Input data is one very large file.

  -o, --outdir OUTDIR   Specify directory in which output has to be created.

  --no_static           Do not make static (png) plots.

  -p, --prefix PREFIX   Specify an optional prefix to be used for the output files.

  --tsv_stats           Output the stats file as a properly formatted TSV.

  --info_in_report      Add NanoPlot run info in the report.

 

Options for filtering or transforming input prior to plotting:

  --maxlength N         Hide reads longer than length specified.

  --minlength N         Hide reads shorter than length specified.

  --drop_outliers       Drop outlier reads with extreme long length.

  --downsample N        Reduce dataset to N reads by random sampling.

  --loglength           Additionally show logarithmic scaling of lengths in plots.

  --percentqual         Use qualities as theoretical percent identities.

  --alength             Use aligned read lengths rather than sequenced length (bam mode)

  --minqual N           Drop reads with an average quality lower than specified.

  --runtime_until N     Only take the N first hours of a run

  --readtype {1D,2D,1D2}

                        Which read type to extract information about from summary. Options are 1D, 2D,

                        1D2

  --barcoded            Use if you want to split the summary file by barcode

  --no_supplementary    Use if you want to remove supplementary alignments

 

Options for customizing the plots created:

  -c, --color COLOR     Specify a valid matplotlib color for the plots

  -cm, --colormap COLORMAP

                        Specify a valid matplotlib colormap for the heatmap

  -f, --format [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]

                        Specify the output format of the plots, which are in addition to the html files

  --plots [{kde,hex,dot} ...]

                        Specify which bivariate plots have to be made.

  --legacy [{kde,dot,hex} ...]

                        Specify which bivariate plots have to be made (legacy mode).

  --listcolors          List the colors which are available for plotting and exit.

  --listcolormaps       List the colors which are available for plotting and exit.

  --no-N50              Hide the N50 mark in the read length histogram

  --N50                 Show the N50 mark in the read length histogram

  --title TITLE         Add a title to all plots, requires quoting if using spaces

  --font_scale FONT_SCALE

                        Scale the font of the plots by a factor

  --dpi DPI             Set the dpi for saving images

  --hide_stats          Not adding Pearson R stats in some bivariate plots

 

Input data sources, one of these is required.:

  --fastq file [file ...]

                        Data is in one or more default fastq file(s).

  --fasta file [file ...]

                        Data is in one or more fasta file(s).

  --fastq_rich file [file ...]

                        Data is in one or more fastq file(s) generated by albacore, MinKNOW or guppy

                        with additional information concerning channel and time.

  --fastq_minimal file [file ...]

                        Data is in one or more fastq file(s) generated by albacore, MinKNOW or guppy

                        with additional information concerning channel and time. Is extracted swiftly

                        without elaborate checks.

  --summary file [file ...]

                        Data is in one or more summary file(s) generated by albacore or guppy.

  --bam file [file ...]

                        Data is in one or more sorted bam file(s).

  --ubam file [file ...]

                        Data is in one or more unmapped bam file(s).

  --cram file [file ...]

                        Data is in one or more sorted cram file(s).

  --pickle pickle       Data is a pickle file stored earlier.

  --feather file [file ...]

                        Data is in one or more feather file(s).

 

EXAMPLES:

    NanoPlot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed

    NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots hex dot

    NanoPlot --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000

Guppyやalbacore、MinKnow basecallingから得られたsummary.txt outputファイルを指定するか、fastq、もしくはbamを指定する。

#summary.txtを指定
NanoPlot --summary sequencing_summary.txt --loglength -o outdir

#fastqを指定。複数も可。
NanoPlot -t 12 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots dot -o outdir

#bamを指定。複数も可。10000リードのみダウンサンプリングして分析。
NanoPlot -t 12 --color yellow --bam alignment1.bam alignment2.bam --downsample 10000 -o outdir
  • --downsample N        Reduce dataset to N reads by random sampling.

htmlレポートが出力される。

 

2、NanoComp(Github

リード長や品質について複数のランを比較する。

pip install NanoComp

> NanoComp -h

usage: NanoComp [-h] [-v] [-t THREADS] [-o OUTDIR] [-p PREFIX] [--verbose] [--raw] [--store] [--tsv_stats] [--make_no_static] [--readtype {1D,2D,1D2}] [--maxlength N] [--minlength N] [--barcoded] [--split_runs TSV_FILE] [-f [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]]

                [-n names [names ...]] [-c colors [colors ...]] [--plot {violin,box,ridge,false}] [--title TITLE] [--dpi DPI]

                (--fasta file [file ...] | --fastq files [files ...] | --fastq_rich file [file ...] | --summary files [files ...] | --bam files [files ...] | --ubam file [file ...] | --cram file [file ...] | --pickle file [file ...] | --feather file [file ...])

 

Compares long read sequencing datasets.

 

General options:

  -h, --help            show the help and exit

  -v, --version         Print version and exit.

  -t, --threads THREADS

                        Set the allowed number of threads to be used by the script

  -o, --outdir OUTDIR   Specify directory in which output has to be created.

  -p, --prefix PREFIX   Specify an optional prefix to be used for the output files.

  --verbose             Write log messages also to terminal.

  --raw                 Store the extracted data in tab separated file.

  --store               Store the extracted data in a pickle file for future plotting.

  --tsv_stats           Output the stats file as a properly formatted TSV.

  --make_no_static      Do not make static (png) plots.

 

Options for filtering or transforming input prior to plotting:

  --readtype {1D,2D,1D2}

                        Which read type to extract information about from summary. Options are 1D, 2D,

                        1D2

  --maxlength N         Drop reads longer than length specified.

  --minlength N         Drop reads shorter than length specified.

  --barcoded            Barcoded experiment in summary format, splitting per barcode.

  --split_runs TSV_FILE

                        File: Split the summary on run IDs and use names in tsv file. Mandatory header

                        fields are 'NAME' and 'RUN_ID'.

 

Options for customizing the plots created:

  -f, --format [{png,jpg,jpeg,webp,svg,pdf,eps,json} ...]

                        Specify the output format of the plots, which are in addition to the html files

  -n, --names names [names ...]

                        Specify the names to be used for the datasets

  -c, --colors colors [colors ...]

                        Specify the colors to be used for the datasets

  --plot {violin,box,ridge,false}

                        Which plot type to use: 'box', 'violin' (default), 'ridge' (joyplot) or 'false'

                        (no plots)

  --title TITLE         Add a title to all plots, requires quoting if using spaces

  --dpi DPI             Set the dpi for saving images (deprecated)

 

Input data sources, one of these is required.:

  --fasta file [file ...]

                        Data is in (compressed) fasta format.

  --fastq files [files ...]

                        Data is in (compressed) fastq format.

  --fastq_rich file [file ...]

                        Data is in one or more fastq file(s) generated by MinKNOW or guppy with

                        additional information concerning channel and time.

  --summary files [files ...]

                        Data is in (compressed) summary files generated by guppy.

  --bam files [files ...]

                        Data is in sorted bam files.

  --ubam file [file ...]

                        Data is in one or more unmapped bam file(s).

  --cram file [file ...]

                        Data is in one or more sorted cram file(s).

  --pickle file [file ...]

                        Data is in one or more pickle file(s) from using NanoComp/NanoPlot.

  --feather file [file ...]

                        Data is in one or more feather file(s).

 

EXAMPLES:

    NanoComp --bam alignment1.bam alignment2.bam --outdir compare-runs

    NanoComp --fastq reads1.fastq.gz reads2.fastq.gz reads3.fastq.gz  --names run1 run2 run3

    

比較したいfastqかbamを指定する。

#fastqを指定
NanoComp --fastq ONT1.fastq.gz ONT2.fastq.gz ONT3.fastq.gz --names run1 run2 run3 --outdir outdir

#bamを指定
NanoComp --bam alignment1.bam alignment2.bam alignment3.bam --outdir outdir
  • --plot  {violin,box,ridge,false}    Which plot type to use: 'box', 'violin' (default), 'ridge' (joyplot) or 'false' (no plots)

出力例(レポートの一部)

3、NanoQCGithub

リード末尾のヌクレオチド組成や品質分布を調査する。

mamba install -c bioconda nanoQC -y

> nanoQC -h

usage: nanoQC [-h] [-v] [-o OUTDIR] [--rna] [-l MINLEN] fastq

 

Investigate nucleotide composition and base quality.

 

positional arguments:

  fastq                 Reads data in fastq.gz format.

 

options:

  -h, --help            show this help message and exit

  -v, --version         Print version and exit.

  -o OUTDIR, --outdir OUTDIR

                        Specify directory in which output has to be created.

  --rna                 Fastq is from direct RNA-seq and contains U nucleotides.

  -l MINLEN, --minlen MINLEN

                        Filters the reads on a minimal length of the given range. Also plots the given length/2 of the begin and end of the reads.

fastqを指定する。

nanoQC ONT.fastq.gz -o OUTDIR

出力例

 

4、Cramino(Github

NanoStatのRust代替。より迅速なBAMまたはCRAMファイルの要約。

mamba install -c bioconda cramino -y

> cramino -h

cramino 0.9.7

Wouter De Coster decosterwouter@gmail.com

Tool to extract QC metrics from cram or bam

 

USAGE:

    cramino [OPTIONS] <INPUT>

 

ARGS:

    <INPUT>    cram or bam file to check

 

OPTIONS:

    -t, --threads <THREADS>

            Number of parallel decompression threads to use [default: 4]

 

    -m, --min-read-len <MIN_READ_LEN>

            Minimal length of read to be considered [default: 0]

 

        --hist

            If histograms have to be generated

 

        --checksum

            If a checksum has to be calculated

 

        --arrow <ARROW>

            Write data to a feather format

 

        --karyotype

            Provide normalized number of reads per chromosome

 

        --phased

            Calculate metrics for phased reads

 

    -h, --help

            Print help information

 

    -V, --version

            Print version information

CRAMかbamを指定する。

cramino -t 8 input.bam

出力例

非常に高速。HDDに置いてあるファイルサイズ1GBのロングリードbamの分析は数秒で終わった。

 

5、chopper(Github

NanoLyseとNanoFiltを組み合わせたRust実装で、フィルタリング、トリミング、汚染物質除去をより高速に行う。

mamba install -c bioconda chopper -y

> chopper -h

chopper 0.5.0

wdecoster <decosterwouter@gmail.com>

Filtering and trimming of fastq files. Reads on stdin and writes to stdout.

 

USAGE:

    chopper [OPTIONS]

 

OPTIONS:

    -q, --quality <MINQUAL>        Sets a minimum Phred average quality score [default: 0]

        --maxqual <MAXQUAL>        Sets a maximum Phred average quality score [default: 1000]

    -l, --minlength <MINLENGTH>    Sets a minimum read length [default: 1]

        --maxlength <MAXLENGTH>    Sets a maximum read length [default: 2147483647]

        --headcrop <HEADCROP>      Trim N nucleotides from the start of a read [default: 0]

        --tailcrop <TAILCROP>      Trim N nucleotides from the end of a read [default: 0]

    -t, --threads <THREADS>        Use N parallel threads [default: 4]

    -c, --contam <CONTAM>          Filter contaminants against a fasta

    -h, --help                     Print help information

    -V, --version                  Print version information

ロングリードのfastqを指定する。

#ONTの1例。クオリティ10以上、1000bp以上でフィルタリング。
gunzip -c reads.fastq.gz | chopper -q 10 -l 1000 -t 12 | gzip > filtered_reads.fastq.gz
  • -t     Use N parallel threads [default: 4]
  • -c    Filter contaminants against a fasta
  • -l     Sets a minimum read length [default: 1]

  • -q   Sets a minimum Phred average quality score [default: 0]

 

6、phasius(Github

リードのphasing性能を示すグラフを作成するRustツール。cargoで導入するかリリースからダウンロードする。

cargo install phasius

 

7、kyber(Github

BAM/CRAMファイルの対数変換したリード長と精度の600x600ピクセルのヒートマップイメージを素早く作成するツール。リードを素早く初期評価するために使う。

 

その他

Deprecated(廃止(予定)、非推奨)のリストに、以前このブログでも紹介したNanoStatNanoFiltNanoLyseが入っています。注意してください。

引用

NanoPack2: Population scale evaluation of long-read sequencing data 
De Coster Wouter,  Rosa Rademakers
Bioinformatics, Published: 12 May 2023