macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ナノポアのロングリードのトリミングやフィルタリングを行うNanofilt

2019 2/14 コマンド追加

2019 5/19 ヘルプ追加、パラメータ変更

2019 12/30並列処理例追加

2020 10/10 リンク追加

 

nanofitはナノポアのロングリードのクオリティトリミングができるツールである。

  

インストール

Github

https://github.com/wdecoster/nanofilt

https://github.com/wdecoster/NanoPlot

mamba install -y -c conda-forge -c bioconda Nanofilt
mamba install -y -c conda-forge -c bioconda NanoPlot

> NanoPlot -h

$ NanoPlot -h

usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw]

                [-o OUTDIR] [-p PREFIX] [--maxlength N] [--minlength N]

                [--drop_outliers] [--downsample N] [--loglength]

                [--percentqual] [--alength] [--minqual N]

                [--readtype {1D,2D,1D2}] [--barcoded] [-c COLOR]

                [-f {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}]

                [--plots [{kde,hex,dot,pauvre} [{kde,hex,dot,pauvre} ...]]]

                [--listcolors] [--no-N50] [--N50] [--title TITLE]

                (--fastq file [file ...] | --fasta file [file ...] | --fastq_rich file [file ...] | --fastq_minimal file [file ...] | --summary file [file ...] | --bam file [file ...] | --cram file [file ...] | --pickle pickle)

 

CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.

 

General options:

  -h, --help            show the help and exit

  -v, --version         Print version and exit.

  -t, --threads THREADS

                        Set the allowed number of threads to be used by the script

  --verbose             Write log messages also to terminal.

  --store               Store the extracted data in a pickle file for future plotting.

  --raw                 Store the extracted data in tab separated file.

  -o, --outdir OUTDIR   Specify directory in which output has to be created.

  -p, --prefix PREFIX   Specify an optional prefix to be used for the output files.

 

Options for filtering or transforming input prior to plotting:

  --maxlength N         Drop reads longer than length specified.

  --minlength N         Drop reads shorter than length specified.

  --drop_outliers       Drop outlier reads with extreme long length.

  --downsample N        Reduce dataset to N reads by random sampling.

  --loglength           Logarithmic scaling of lengths in plots.

  --percentqual         Use qualities as theoretical percent identities.

  --alength             Use aligned read lengths rather than sequenced length (bam mode)

  --minqual N           Drop reads with an average quality lower than specified.

  --readtype {1D,2D,1D2}

                        Which read type to extract information about from summary. Options are 1D, 2D,

                        1D2

  --barcoded            Use if you want to split the summary file by barcode

 

Options for customizing the plots created:

  -c, --color COLOR     Specify a color for the plots, must be a valid matplotlib color

  -f, --format {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}

                        Specify the output format of the plots.

  --plots [{kde,hex,dot,pauvre} [{kde,hex,dot,pauvre} ...]]

                        Specify which bivariate plots have to be made.

  --listcolors          List the colors which are available for plotting and exit.

  --no-N50              Hide the N50 mark in the read length histogram

  --N50                 Show the N50 mark in the read length histogram

  --title TITLE         Add a title to all plots, requires quoting if using spaces

 

Input data sources, one of these is required.:

  --fastq file [file ...]

                        Data is in one or more default fastq file(s).

  --fasta file [file ...]

                        Data is in one or more fasta file(s).

  --fastq_rich file [file ...]

                        Data is in one or more fastq file(s) generated by albacore or MinKNOW with

                        additional information concerning channel and time.

  --fastq_minimal file [file ...]

                        Data is in one or more fastq file(s) generated by albacore or MinKNOW with

                        additional information concerning channel and time. Minimal data is extracted

                        swiftly without elaborate checks.

  --summary file [file ...]

                        Data is in one or more summary file(s) generated by albacore.

  --bam file [file ...]

                        Data is in one or more sorted bam file(s).

  --cram file [file ...]

                        Data is in one or more sorted cram file(s).

  --pickle pickle       Data is a pickle file stored earlier.

 

EXAMPLES:

    Nanoplot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed

    NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots hex dot

    NanoPlot --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000

    

NanoFilt -h

# NanoFilt -h

usage: NanoFilt [-h] [-v] [--logfile LOGFILE] [-l LENGTH]

                [--maxlength MAXLENGTH] [-q QUALITY] [--minGC MINGC]

                [--maxGC MAXGC] [--headcrop HEADCROP] [--tailcrop TAILCROP]

                [-s SUMMARY] [--readtype {1D,2D,1D2}]

 

Perform quality and/or length and/or GC filtering of (long read) fastq data.           Reads on stdin.

 

General options:

  -h, --help            show the help and exit

  -v, --version         Print version and exit.

  --logfile LOGFILE     Specify the path and filename for the log file.

 

Options for filtering reads on.:

  -l LENGTH, --length LENGTH

                        Filter on a minimum read length

  --maxlength MAXLENGTH

                        Filter on a maximum read length

  -q QUALITY, --quality QUALITY

                        Filter on a minimum average read quality score

  --minGC MINGC         Sequences must have GC content >= to this. Float between 0.0 and 1.0. Ignored if

                        using summary file.

  --maxGC MAXGC         Sequences must have GC content <= to this. Float between 0.0 and 1.0. Ignored if

                        using summary file.

 

Options for trimming reads.:

  --headcrop HEADCROP   Trim n nucleotides from start of read

  --tailcrop TAILCROP   Trim n nucleotides from end of read

 

Input options.:

  -s SUMMARY, --summary SUMMARY

                        Use summary file for quality scores

  --readtype {1D,2D,1D2}

                        Which read type to extract information about from summary. Options are 1D, 2D or

                        1D2

 

EXAMPLES:

  gunzip -c reads.fastq.gz | NanoFilt -q 10 -l 500 --headcrop 50 | minimap2 genome.fa - | samtools sort -O BAM -@24 -o alignment.bam -

  gunzip -c reads.fastq.gz | NanoFilt -q 12 --headcrop 75 | gzip > trimmed-reads.fastq.gz

  gunzip -c reads.fastq.gz | NanoFilt -q 10 | gzip > highQuality-reads.fastq.gz

root@fc7ac9b00489:/# 

 

 

ラン

5'末端75-bpのトリミング、平均クオリティ10以下のリードを捨てるクオリティフィルタリング、500bp以下のリードを捨てるサイズフィルタリングを実行する。

gunzip -c input.fq.gz |NanoFilt -q 10 -l 500 --headcrop 75 | gzip > trimmed.fq.gz
  • -q QUALITY Filter on a minimum average read quality score
  • -s SUMMARYFILE optional, the sequencing_summary file from albacore for extracting quality scores
  • -l LENGTH Filter on a minimum read length
  • --headcrop HEADCROP Trim n nucleotides from start of read
  • --tailcrop TAILCROP Trim n nucleotides from end of read

 

ナノポアのリードの先頭数十bpは特にクオリティが悪く、解析に悪影響を与えるので強制トリミングしている。

 

 

1Dのデータを分析してみる。

まずはfast5からbasecallingして作ったraw fastqを分析する(webでも利用可能)。

NanoPlot --fastq E.coli.fastq --loglength -t 12

f:id:kazumaxneo:20171007151034j:plain

quality6以下、1000bp付近に非常にたくさんのリードが出ており、クオリティの山が2つある状態である。また、山の形状も異なるのも興味深い。左下に伸びた短いリードはつまりジャンクということだろうか?

 

下の山をクオリティ6で切る。また5'末端50-bpをトリミングし、100bp以下になったリードは捨てる。

$gzip compressed fastq
gunzip -c input.fq.gz |NanoFilt -q 6 --headcrop 50 -l 100 > trimmed.fq

#fastq
cat input.fq |NanoFilt -q 6 --headcrop 50 -l 100 > trimmed.fq

nanoplotで分析。

NanoPlot --fastq trimmed.fastq --loglength -t 12 -o qc_result_dir

f:id:kazumaxneo:20171007152051j:plain

平均クオリティ6以下が完全になくなっている。

 

nanoplotは別に紹介しています。


追記

2019 12/30 mergeする前のfastqがあるならGNU parallelで簡単に並列処理できる(*1)。"q10>"、"500bp>"、"先頭50bpトリミング"を8並列で実行する。

ls *.fastq | parallel -j 8 'cat {} | NanoFilt -q 10 --headcrop 50 -l 500 > filtered_{}'

 

引用

NanoPack: visualizing and processing long read sequencing data.

De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C

Bioinformatics. 2018 Mar 14.

 

関連


 

 

 

 

*1

Trimming and filtering Oxford Nanopore sequencing reads – Gigabase or gigabyte