2019 2/14 コマンド追加
2019 5/19 ヘルプ追加、パラメータ変更
2019 12/30並列処理例追加
2020 10/10 リンク追加
nanofitはナノポアのロングリードのクオリティトリミングができるツールである。
インストール
https://github.com/wdecoster/nanofilt
https://github.com/wdecoster/NanoPlot
mamba install -y -c conda-forge -c bioconda Nanofilt
mamba install -y -c conda-forge -c bioconda NanoPlot
> NanoPlot -h
$ NanoPlot -h
usage: NanoPlot [-h] [-v] [-t THREADS] [--verbose] [--store] [--raw]
[-o OUTDIR] [-p PREFIX] [--maxlength N] [--minlength N]
[--drop_outliers] [--downsample N] [--loglength]
[--percentqual] [--alength] [--minqual N]
[--readtype {1D,2D,1D2}] [--barcoded] [-c COLOR]
[-f {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}]
[--plots [{kde,hex,dot,pauvre} [{kde,hex,dot,pauvre} ...]]]
[--listcolors] [--no-N50] [--N50] [--title TITLE]
(--fastq file [file ...] | --fasta file [file ...] | --fastq_rich file [file ...] | --fastq_minimal file [file ...] | --summary file [file ...] | --bam file [file ...] | --cram file [file ...] | --pickle pickle)
CREATES VARIOUS PLOTS FOR LONG READ SEQUENCING DATA.
General options:
-h, --help show the help and exit
-v, --version Print version and exit.
-t, --threads THREADS
Set the allowed number of threads to be used by the script
--verbose Write log messages also to terminal.
--store Store the extracted data in a pickle file for future plotting.
--raw Store the extracted data in tab separated file.
-o, --outdir OUTDIR Specify directory in which output has to be created.
-p, --prefix PREFIX Specify an optional prefix to be used for the output files.
Options for filtering or transforming input prior to plotting:
--maxlength N Drop reads longer than length specified.
--minlength N Drop reads shorter than length specified.
--drop_outliers Drop outlier reads with extreme long length.
--downsample N Reduce dataset to N reads by random sampling.
--loglength Logarithmic scaling of lengths in plots.
--percentqual Use qualities as theoretical percent identities.
--alength Use aligned read lengths rather than sequenced length (bam mode)
--minqual N Drop reads with an average quality lower than specified.
--readtype {1D,2D,1D2}
Which read type to extract information about from summary. Options are 1D, 2D,
1D2
--barcoded Use if you want to split the summary file by barcode
Options for customizing the plots created:
-c, --color COLOR Specify a color for the plots, must be a valid matplotlib color
-f, --format {eps,jpeg,jpg,pdf,pgf,png,ps,raw,rgba,svg,svgz,tif,tiff}
Specify the output format of the plots.
--plots [{kde,hex,dot,pauvre} [{kde,hex,dot,pauvre} ...]]
Specify which bivariate plots have to be made.
--listcolors List the colors which are available for plotting and exit.
--no-N50 Hide the N50 mark in the read length histogram
--N50 Show the N50 mark in the read length histogram
--title TITLE Add a title to all plots, requires quoting if using spaces
Input data sources, one of these is required.:
--fastq file [file ...]
Data is in one or more default fastq file(s).
--fasta file [file ...]
Data is in one or more fasta file(s).
--fastq_rich file [file ...]
Data is in one or more fastq file(s) generated by albacore or MinKNOW with
additional information concerning channel and time.
--fastq_minimal file [file ...]
Data is in one or more fastq file(s) generated by albacore or MinKNOW with
additional information concerning channel and time. Minimal data is extracted
swiftly without elaborate checks.
--summary file [file ...]
Data is in one or more summary file(s) generated by albacore.
--bam file [file ...]
Data is in one or more sorted bam file(s).
--cram file [file ...]
Data is in one or more sorted cram file(s).
--pickle pickle Data is a pickle file stored earlier.
EXAMPLES:
Nanoplot --summary sequencing_summary.txt --loglength -o summary-plots-log-transformed
NanoPlot -t 2 --fastq reads1.fastq.gz reads2.fastq.gz --maxlength 40000 --plots hex dot
NanoPlot --color yellow --bam alignment1.bam alignment2.bam alignment3.bam --downsample 10000
> NanoFilt -h
# NanoFilt -h
usage: NanoFilt [-h] [-v] [--logfile LOGFILE] [-l LENGTH]
[--maxlength MAXLENGTH] [-q QUALITY] [--minGC MINGC]
[--maxGC MAXGC] [--headcrop HEADCROP] [--tailcrop TAILCROP]
[-s SUMMARY] [--readtype {1D,2D,1D2}]
Perform quality and/or length and/or GC filtering of (long read) fastq data. Reads on stdin.
General options:
-h, --help show the help and exit
-v, --version Print version and exit.
--logfile LOGFILE Specify the path and filename for the log file.
Options for filtering reads on.:
-l LENGTH, --length LENGTH
Filter on a minimum read length
--maxlength MAXLENGTH
Filter on a maximum read length
-q QUALITY, --quality QUALITY
Filter on a minimum average read quality score
--minGC MINGC Sequences must have GC content >= to this. Float between 0.0 and 1.0. Ignored if
using summary file.
--maxGC MAXGC Sequences must have GC content <= to this. Float between 0.0 and 1.0. Ignored if
using summary file.
Options for trimming reads.:
--headcrop HEADCROP Trim n nucleotides from start of read
--tailcrop TAILCROP Trim n nucleotides from end of read
Input options.:
-s SUMMARY, --summary SUMMARY
Use summary file for quality scores
--readtype {1D,2D,1D2}
Which read type to extract information about from summary. Options are 1D, 2D or
1D2
EXAMPLES:
gunzip -c reads.fastq.gz | NanoFilt -q 10 -l 500 --headcrop 50 | minimap2 genome.fa - | samtools sort -O BAM -@24 -o alignment.bam -
gunzip -c reads.fastq.gz | NanoFilt -q 12 --headcrop 75 | gzip > trimmed-reads.fastq.gz
gunzip -c reads.fastq.gz | NanoFilt -q 10 | gzip > highQuality-reads.fastq.gz
root@fc7ac9b00489:/#
ラン
5'末端75-bpのトリミング、平均クオリティ10以下のリードを捨てるクオリティフィルタリング、500bp以下のリードを捨てるサイズフィルタリングを実行する。
gunzip -c input.fq.gz |NanoFilt -q 10 -l 500 --headcrop 75 | gzip > trimmed.fq.gz
- -q QUALITY Filter on a minimum average read quality score
- -s SUMMARYFILE optional, the sequencing_summary file from albacore for extracting quality scores
- -l LENGTH Filter on a minimum read length
- --headcrop HEADCROP Trim n nucleotides from start of read
- --tailcrop TAILCROP Trim n nucleotides from end of read
ナノポアのリードの先頭数十bpは特にクオリティが悪く、解析に悪影響を与えるので強制トリミングしている。
1Dのデータを分析してみる。
まずはfast5からbasecallingして作ったraw fastqを分析する(webでも利用可能)。
NanoPlot --fastq E.coli.fastq --loglength -t 12
quality6以下、1000bp付近に非常にたくさんのリードが出ており、クオリティの山が2つある状態である。また、山の形状も異なるのも興味深い。左下に伸びた短いリードはつまりジャンクということだろうか?
下の山をクオリティ6で切る。また5'末端50-bpをトリミングし、100bp以下になったリードは捨てる。
$gzip compressed fastq
gunzip -c input.fq.gz |NanoFilt -q 6 --headcrop 50 -l 100 > trimmed.fq
#fastq
cat input.fq |NanoFilt -q 6 --headcrop 50 -l 100 > trimmed.fq
nanoplotで分析。
NanoPlot --fastq trimmed.fastq --loglength -t 12 -o qc_result_dir
平均クオリティ6以下が完全になくなっている。
nanoplotは別に紹介しています。
追記
2019 12/30 mergeする前のfastqがあるならGNU parallelで簡単に並列処理できる(*1)。"q10>"、"500bp>"、"先頭50bpトリミング"を8並列で実行する。
ls *.fastq | parallel -j 8 'cat {} | NanoFilt -q 10 --headcrop 50 -l 500 > filtered_{}'
引用
NanoPack: visualizing and processing long read sequencing data.
De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C
Bioinformatics. 2018 Mar 14.
関連
*1
Trimming and filtering Oxford Nanopore sequencing reads – Gigabase or gigabyte