ナノポアのロングリードのトリミングやフィルタリングを行うNanofilt

2019 2/14 コマンド追加

2019 5/19 ヘルプ追加、パラメータ変更

2019 12/30並列処理例追加

2020 10/10 リンク追加

nanofitはナノポアのロングリードのクオリティトリミングができるツールである。

インストール

Github

https://github.com/wdecoster/nanofilt

https://github.com/wdecoster/NanoPlot

mamba install -y -c conda-forge -c bioconda Nanofilt
mamba install -y -c conda-forge -c bioconda NanoPlot

> NanoPlot -h

$ NanoPlot -h

usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw]

[-o OUTDIR] [-p PREFIX] [--maxlength N] [--minlength N]

[--drop_outliers] [--downsample N] [--loglength]

[--percentqual] [--alength] [--minqual N]

[--readtype {1D,2D,1D2}] [--barcoded] [-c COLOR]

[-f {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}]

[--plots [{kde,hex,dot,pauvre} [{kde,hex,dot,pauvre} ...]]]

[--listcolors] [--no-N50] [--N50] [--title TITLE]

CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.

General options:

-h, --help show the help and exit

-v, --version Print version and exit.

-t, --threads THREADS

Set the allowed number of threads to be used by the script

--verbose Write log messages also to terminal.

--store Store the extracted data in a pickle file for future plotting.

--raw Store the extracted data in tab separated file.

-o, --outdir OUTDIR Specify directory in which output has to be created.

-p, --prefix PREFIX Specify an optional prefix to be used for the output files.

Options for filtering or transforming input prior to plotting:

--maxlength N Drop reads longer than length specified.

--minlength N Drop reads shorter than length specified.

--drop_outliers Drop outlier reads with extreme long length.

--downsample N Reduce dataset to N reads by random sampling.

--loglength Logarithmic scaling of lengths in plots.

--percentqual Use qualities as theoretical percent identities.

--alength Use aligned read lengths rather than sequenced length (bam mode)

--minqual N Drop reads with an average quality lower than specified.

--readtype {1D,2D,1D2}

Which read type to extract information about from summary. Options are 1D, 2D,

1D2

--barcoded Use if you want to split the summary file by barcode

Options for customizing the plots created:

-c, --color COLOR Specify a color for the plots, must be a valid matplotlib color

-f, --format {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}

Specify the output format of the plots.

--plots [{kde,hex,dot,pauvre} [{kde,hex,dot,pauvre} ...]]

Specify which bivariate plots have to be made.

--listcolors List the colors which are available for plotting and exit.

--no-N50 Hide the N50 mark in the read length histogram

--N50 Show the N50 mark in the read length histogram

--title TITLE Add a title to all plots, requires quoting if using spaces

Input data sources, one of these is required.:

--fastq file [file ...]

Data is in one or more default fastq file(s).

--fasta file [file ...]

Data is in one or more fasta file(s).

--fastq_rich file [file ...]

Data is in one or more fastq file(s) generated by albacore or MinKNOW with

additional information concerning channel and time.

--fastq_minimal file [file ...]

Data is in one or more fastq file(s) generated by albacore or MinKNOW with

additional information concerning channel and time. Minimal data is extracted

swiftly without elaborate checks.

--summary file [file ...]

Data is in one or more summary file(s) generated by albacore.

--bam file [file ...]

Data is in one or more sorted bam file(s).

--cram file [file ...]

Data is in one or more sorted cram file(s).

--pickle pickle Data is a pickle file stored earlier.

EXAMPLES:

Nanoplot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed

NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots hex dot

NanoPlot --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000

> NanoFilt -h

# NanoFilt -h

usage: NanoFilt [-h] [-v] [--logfile LOGFILE] [-l LENGTH]

[--maxlength MAXLENGTH] [-q QUALITY] [--minGC MINGC]

[--maxGC MAXGC] [--headcrop HEADCROP] [--tailcrop TAILCROP]

[-s SUMMARY] [--readtype {1D,2D,1D2}]

Perform quality and/or length and/or GC filtering of (long read) fastq data. Reads on stdin.

General options:

-h, --help show the help and exit

-v, --version Print version and exit.

--logfile LOGFILE Specify the path and filename for the log file.

Options for filtering reads on.:

-l LENGTH, --length LENGTH

Filter on a minimum read length

--maxlength MAXLENGTH

Filter on a maximum read length

-q QUALITY, --quality QUALITY

Filter on a minimum average read quality score

--minGC MINGC Sequences must have GC content >= to this. Float between 0.0 and 1.0. Ignored if

using summary file.

--maxGC MAXGC Sequences must have GC content <= to this. Float between 0.0 and 1.0. Ignored if

using summary file.

Options for trimming reads.:

--headcrop HEADCROP Trim n nucleotides from start of read

--tailcrop TAILCROP Trim n nucleotides from end of read

Input options.:

-s SUMMARY, --summary SUMMARY

Use summary file for quality scores

--readtype {1D,2D,1D2}

Which read type to extract information about from summary. Options are 1D, 2D or

1D2

EXAMPLES:

gunzip -c reads.fastq.gz | NanoFilt -q 10 -l 500 --headcrop 50 | minimap2 genome.fa - | samtools sort -O BAM -@24 -o alignment.bam -

gunzip -c reads.fastq.gz | NanoFilt -q 12 --headcrop 75 | gzip > trimmed-reads.fastq.gz

gunzip -c reads.fastq.gz | NanoFilt -q 10 | gzip > highQuality-reads.fastq.gz

root@fc7ac9b00489:/#

ラン

5'末端75-bpのトリミング、平均クオリティ10以下のリードを捨てるクオリティフィルタリング、500bp以下のリードを捨てるサイズフィルタリングを実行する。

gunzip -c input.fq.gz |NanoFilt -q 10 -l 500 --headcrop 75 | gzip > trimmed.fq.gz

-q　QUALITY Filter on a minimum average read quality score
-s　SUMMARYFILE optional, the sequencing_summary file from albacore for extracting quality scores
-l　LENGTH Filter on a minimum read length
--headcrop　HEADCROP Trim n nucleotides from start of read
--tailcrop　TAILCROP Trim n nucleotides from end of read

ナノポアのリードの先頭数十bpは特にクオリティが悪く、解析に悪影響を与えるので強制トリミングしている。

1Dのデータを分析してみる。

まずはfast5からbasecallingして作ったraw fastqを分析する（webでも利用可能）。

NanoPlot --fastq E.coli.fastq --loglength -t 12

f:id:kazumaxneo:20171007151034j:plain

quality6以下、1000bp付近に非常にたくさんのリードが出ており、クオリティの山が２つある状態である。また、山の形状も異なるのも興味深い。左下に伸びた短いリードはつまりジャンクということだろうか？

下の山をクオリティ6で切る。また5'末端50-bpをトリミングし、100bp以下になったリードは捨てる。

$gzip compressed fastq
gunzip -c input.fq.gz |NanoFilt -q 6 --headcrop 50 -l 100 > trimmed.fq

#fastq
cat input.fq |NanoFilt -q 6 --headcrop 50 -l 100 > trimmed.fq

nanoplotで分析。

NanoPlot --fastq trimmed.fastq --loglength -t 12 -o qc_result_dir

f:id:kazumaxneo:20171007152051j:plain

平均クオリティ6以下が完全になくなっている。

nanoplotは別に紹介しています。

追記

2019 12/30 mergeする前のfastqがあるならGNU parallelで簡単に並列処理できる（*1）。"q10>"、"500bp>"、"先頭50bpトリミング"を８並列で実行する。

ls *.fastq | parallel -j 8 'cat {} | NanoFilt -q 10 --headcrop 50 -l 500 > filtered_{}'

引用

NanoPack: visualizing and processing long read sequencing data.

De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C

Bioinformatics. 2018 Mar 14.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ナノポアのロングリードのトリミングやフィルタリングを行うNanofilt