macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

FastQCの高速な代替 Falco

 

 品質管理はシーケンスデータ解析において不可欠な最初のステップであり、品質管理のためのソフトウェアツールはほとんどのシーケンスセンターで標準的なパイプラインに深く浸透している。関連する計算は簡単だが、多くの環境では品質管理に必要な総計算量はかなりのものであり、最適化が必要である。Falcoは、FastQCのエミュレーションであり、同等の結果を生成しながら平均3倍速く動作する。FastQCと比較して、Falcoは実行に必要なメモリも少なく、HTMLレポートのより柔軟な視覚化を提供する。

 

インストール

ubuntu20とmacでテストした。

依存

  • zlib is required to read gzip compressed FASTQ files
  • htslib is required to process bam files

Github

https://github.com/smithlabcode/falco

#conda
mamba install -c bioconda falco -y

#source
git clone https://github.com/smithlabcode/falco.git
cd falco/
./autogen.sh
./configure CXXFLAGS="-O3 -Wall" --enable-hts
make HAVE_HTSLIB=1 all
make HAVE_HTSLIB=1 install

> falco -h

Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...

 

Options:

  -h, --help               Print this help file and exit  

  -v, --version            Print the version of the program and exit  

  -o, --outdir             Create all output files in the specified 

                           output directory. FALCO-SPECIFIC: If the 

                           directory does not exists, the program will 

                           create it. If this option is not set then 

                           the output file for each sequence file is 

                           created in the same directory as the 

                           sequence file which was processed.  

      --casava             [IGNORED BY FALCO] Files come from raw 

                           casava output. Files in the same sample 

                           group (differing only by the group number) 

                           will be analysed as a set rather than 

                           individually. Sequences with the filter flag 

                           set in the header will be excluded from the 

                           analysis. Files must have the same names 

                           given to them by casava (including being 

                           gzipped and ending with .gz) otherwise they 

                           won't be grouped together correctly.  

      --nano               [IGNORED BY FALCO] Files come from nanopore 

                           sequences and are in fast5 format. In this 

                           mode you can pass in directories to process 

                           and the program will take in all fast5 files 

                           within those directories and produce a 

                           single output file from the sequences found 

                           in all files.  

      --nofilter           [IGNORED BY FALCO] If running with --casava 

                           then don't remove read flagged by casava as 

                           poor quality when performing the QC 

                           analysis.  

      --extract            [ALWAYS ON IN FALCO] If set then the zipped 

                           output file will be uncompressed in the same 

                           directory after it has been created. By 

                           default this option will be set if fastqc is 

                           run in non-interactive mode.  

  -j, --java               [IGNORED BY FALCO] Provides the full path to 

                           the java binary you want to use to launch 

                           fastqc. If not supplied then java is assumed 

                           to be in your path.  

      --noextract          [IGNORED BY FALCO] Do not uncompress the 

                           output file after creating it. You should 

                           set this option if you do not wish to 

                           uncompress the output when running in 

                           non-interactive mode.  

      --nogroup            Disable grouping of bases for reads >50bp. 

                           All reports will show data for every base in 

                           the read. WARNING: When using this option, 

                           your plots may end up a ridiculous size. You 

                           have been warned!  

      --min_length         [NOT YET IMPLEMENTED IN FALCO] Sets an 

                           artificial lower limit on the length of the 

                           sequence to be shown in the report. As long 

                           as you set this to a value greater or equal 

                           to your longest read length then this will 

                           be the sequence length used to create your 

                           read groups. This can be useful for making 

                           directly comaparable statistics from 

                           datasets with somewhat variable read 

                           lengths.  

  -f, --format             Bypasses the normal sequence file format 

                           detection and forces the program to use the 

                           specified format. Validformats are bam, sam, 

                           bam_mapped, sam_mapped, fastq, fq, fastq.gz 

                           or fq.gz.  

  -t, --threads            [NOT YET IMPLEMENTED IN FALCO] Specifies the 

                           number of files which can be processed 

                           simultaneously. Each thread will be 

                           allocated 250MB of memory so you shouldn't 

                           run more threads than your available memory 

                           will cope with, and not more than 6 threads 

                           on a 32 bit machine [1] 

  -c, --contaminants       Specifies a non-default file which contains 

                           the list of contaminants to screen 

                           overrepresented sequences against. The file 

                           must contain sets of named contaminants in 

                           the form name[tab]sequence. Lines prefixed 

                           with a hash will be ignored. Default: 

                           /home/kazu/Documents/falco/Configuration/contaminant_list.txt 

  -a, --adapters           Specifies a non-default file which contains 

                           the list of adapter sequences which will be 

                           explicity searched against the library. The 

                           file must contain sets of named adapters in 

                           the form name[tab]sequence. Lines prefixed 

                           with a hash will be ignored. Default: 

                           /home/kazu/Documents/falco/Configuration/adapter_list.txt 

  -l, --limits             Specifies a non-default file which contains 

                           a set of criteria which will be used to 

                           determine the warn/error limits for the 

                           various modules. This file can also be used 

                           to selectively remove some modules from the 

                           output all together. The format needs to 

                           mirror the default limits.txt file found in 

                           the Configuration folder. Default: 

                           /home/kazu/Documents/falco/Configuration/limits.txt 

  -k, --kmers              [IGNORED BY FALCO AND ALWAYS SET TO 7] 

                           Specifies the length of Kmer to look for in 

                           the Kmer content module. Specified Kmer 

                           length must be between 2 and 10. Default 

                           length is 7 if not specified.  

  -q, --quiet              Supress all progress messages on stdout and 

                           only report errors.  

  -d, --dir                [IGNORED: FALCO DOES NOT CREATE TMP FILES] 

                           Selects a directory to be used for temporary 

                           files written when generating report images. 

                           Defaults to system temp directory if not 

                           specified.  

  -s, -subsample           [Falco only] makes falco faster (but 

                           possibly less accurate) by only processing 

                           reads that are multiple of this value (using 

                           0-based indexing to number reads). [1] 

  -b, -bisulfite           [Falco only] reads are whole genome 

                           bisulfite sequencing, and more Ts and fewer 

                           Cs are therefore expected and will be 

                           accounted for in base content.  

  -r, -reverse-complement  [Falco only] The input is a 

                           reverse-complement. All modules will be 

                           tested by swapping A/T and C/G  

      -skip-data           [Falco only] Do not create FastQC data text 

                           file.  

      -skip-report         [Falco only] Do not create FastQC report 

                           HTML file.  

      -skip-summary        [Falco only] Do not create FastQC summary 

                           file  

  -D, -data-filename       [Falco only] Specify filename for FastQC 

                           data output (TXT). If not specified, it will 

                           be called fastq_data.txt in either the input 

                           file's directory or the one specified in the 

                           --output flag. Only available when running 

                           falco with a single input.  

  -R, -report-filename     [Falco only] Specify filename for FastQC 

                           report output (HTML). If not specified, it 

                           will be called fastq_report.html in either 

                           the input file's directory or the one 

                           specified in the --output flag. Only 

                           available when running falco with a single 

                           input.  

  -S, -summary-filename    [Falco only] Specify filename for the short 

                           summary output (TXT). If not specified, it 

                           will be called fastq_report.html in either 

                           the input file's directory or the one 

                           specified in the --output flag. Only 

                           available when running falco with a single 

                           input.  

  -K, -add-call            [Falco only] add the command call call to 

                           FastQC data output and FastQC report HTML 

                           (this may break the parse of fastqc_data.txt 

                           in programs that are very strict about the 

                           FastQC output format).  

 

Help options:

  -?, -help                print this help message  

      -about               print about message  

 

 

 

実行方法

fastqファイルを指定する(fastq, fq, fastq.gz or fq.gz)。

#1つのfastq
falco input.fq.gz

#複数fastq、#出力指定、4スレッド
falco -t 4 -o outdir sample*.fq.gz
  • -o     Create all output files in the specified output directory. FALCO-SPECIFIC: If the directory does not exists, the program will create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed.                     
  • -t    Specifies the number of files which can be processed  simultaneously. Each thread will be allocated 250MB of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine [1] 
  • -c    Specifies a non-default file which contains the list of contaminants to screen overrepresented sequences against. The file must contain sets of named contaminants in  the form name[tab]sequence. Lines prefixed with a hash will be ignored. Default: Configuration/contaminant_list.txt 
  • -a    Specifies a non-default file which contains the list of adapter sequences which will be explicity searched against the library. The file must contain sets of named adapters in  the form name[tab]sequence. Lines prefixed  with a hash will be ignored. Default:  Configuration/adapter_list.txt 

 

sam、bam (binary sam)のリードの分析にも対応している(htslibが必要)。

falco -t 4 -o outdir input.bam
  • -f      Bypasses the normal sequence file format   detection and forces the program to use the specified format. Valid formats are bam, sam, bam_mapped, sam_mapped, fastq, fq, fastq.gz or fq.gz. 
                              

出力

  • fastqc_data.txtはQCメトリクスの要約を含むテキストファイル
  • fastqc_report.htmlは、テキストサマリーのプロットを示すHTMLレポート
  • summary.txt: 各モジュールのpass/warn/failの結果を記述したタブ区切りファイル

レポート

ロングリードの場合(デフォルト設定でONTを調べるとqualityはfailとなった)

(以下省略)

 

その他

  • fastQC互換なので、出力はそのままmultiqcで認識する。

引用
Falco: high-speed FastQC emulation for quality control of sequencing data

Guilherme de Sena Brandine 1, Andrew D Smith

F1000Res. 2019 Nov 7:8:1874