品質管理はシーケンスデータ解析において不可欠な最初のステップであり、品質管理のためのソフトウェアツールはほとんどのシーケンスセンターで標準的なパイプラインに深く浸透している。関連する計算は簡単だが、多くの環境では品質管理に必要な総計算量はかなりのものであり、最適化が必要である。Falcoは、FastQCのエミュレーションであり、同等の結果を生成しながら平均3倍速く動作する。FastQCと比較して、Falcoは実行に必要なメモリも少なく、HTMLレポートのより柔軟な視覚化を提供する。
インストール
ubuntu20とmacでテストした。
依存
- zlib is required to read gzip compressed FASTQ files
- htslib is required to process bam files
https://github.com/smithlabcode/falco
#conda
mamba install -c bioconda falco -y
#source
git clone https://github.com/smithlabcode/falco.git
cd falco/
./autogen.sh
./configure CXXFLAGS="-O3 -Wall" --enable-hts
make HAVE_HTSLIB=1 all
make HAVE_HTSLIB=1 install
> falco -h
Usage: falco [OPTIONS] <seqfile1> <seqfile2> ...
Options:
-h, --help Print this help file and exit
-v, --version Print the version of the program and exit
-o, --outdir Create all output files in the specified
output directory. FALCO-SPECIFIC: If the
directory does not exists, the program will
create it. If this option is not set then
the output file for each sequence file is
created in the same directory as the
sequence file which was processed.
--casava [IGNORED BY FALCO] Files come from raw
casava output. Files in the same sample
group (differing only by the group number)
will be analysed as a set rather than
individually. Sequences with the filter flag
set in the header will be excluded from the
analysis. Files must have the same names
given to them by casava (including being
gzipped and ending with .gz) otherwise they
won't be grouped together correctly.
--nano [IGNORED BY FALCO] Files come from nanopore
sequences and are in fast5 format. In this
mode you can pass in directories to process
and the program will take in all fast5 files
within those directories and produce a
single output file from the sequences found
in all files.
--nofilter [IGNORED BY FALCO] If running with --casava
then don't remove read flagged by casava as
poor quality when performing the QC
analysis.
--extract [ALWAYS ON IN FALCO] If set then the zipped
output file will be uncompressed in the same
directory after it has been created. By
default this option will be set if fastqc is
run in non-interactive mode.
-j, --java [IGNORED BY FALCO] Provides the full path to
the java binary you want to use to launch
fastqc. If not supplied then java is assumed
to be in your path.
--noextract [IGNORED BY FALCO] Do not uncompress the
output file after creating it. You should
set this option if you do not wish to
uncompress the output when running in
non-interactive mode.
--nogroup Disable grouping of bases for reads >50bp.
All reports will show data for every base in
the read. WARNING: When using this option,
your plots may end up a ridiculous size. You
have been warned!
--min_length [NOT YET IMPLEMENTED IN FALCO] Sets an
artificial lower limit on the length of the
sequence to be shown in the report. As long
as you set this to a value greater or equal
to your longest read length then this will
be the sequence length used to create your
read groups. This can be useful for making
directly comaparable statistics from
datasets with somewhat variable read
lengths.
-f, --format Bypasses the normal sequence file format
detection and forces the program to use the
specified format. Validformats are bam, sam,
bam_mapped, sam_mapped, fastq, fq, fastq.gz
or fq.gz.
-t, --threads [NOT YET IMPLEMENTED IN FALCO] Specifies the
number of files which can be processed
simultaneously. Each thread will be
allocated 250MB of memory so you shouldn't
run more threads than your available memory
will cope with, and not more than 6 threads
on a 32 bit machine [1]
-c, --contaminants Specifies a non-default file which contains
the list of contaminants to screen
overrepresented sequences against. The file
must contain sets of named contaminants in
the form name[tab]sequence. Lines prefixed
with a hash will be ignored. Default:
/home/kazu/Documents/falco/Configuration/contaminant_list.txt
-a, --adapters Specifies a non-default file which contains
the list of adapter sequences which will be
explicity searched against the library. The
file must contain sets of named adapters in
the form name[tab]sequence. Lines prefixed
with a hash will be ignored. Default:
/home/kazu/Documents/falco/Configuration/adapter_list.txt
-l, --limits Specifies a non-default file which contains
a set of criteria which will be used to
determine the warn/error limits for the
various modules. This file can also be used
to selectively remove some modules from the
output all together. The format needs to
mirror the default limits.txt file found in
the Configuration folder. Default:
/home/kazu/Documents/falco/Configuration/limits.txt
-k, --kmers [IGNORED BY FALCO AND ALWAYS SET TO 7]
Specifies the length of Kmer to look for in
the Kmer content module. Specified Kmer
length must be between 2 and 10. Default
length is 7 if not specified.
-q, --quiet Supress all progress messages on stdout and
only report errors.
-d, --dir [IGNORED: FALCO DOES NOT CREATE TMP FILES]
Selects a directory to be used for temporary
files written when generating report images.
Defaults to system temp directory if not
specified.
-s, -subsample [Falco only] makes falco faster (but
possibly less accurate) by only processing
reads that are multiple of this value (using
0-based indexing to number reads). [1]
-b, -bisulfite [Falco only] reads are whole genome
bisulfite sequencing, and more Ts and fewer
Cs are therefore expected and will be
accounted for in base content.
-r, -reverse-complement [Falco only] The input is a
reverse-complement. All modules will be
tested by swapping A/T and C/G
-skip-data [Falco only] Do not create FastQC data text
file.
-skip-report [Falco only] Do not create FastQC report
HTML file.
-skip-summary [Falco only] Do not create FastQC summary
file
-D, -data-filename [Falco only] Specify filename for FastQC
data output (TXT). If not specified, it will
be called fastq_data.txt in either the input
file's directory or the one specified in the
--output flag. Only available when running
falco with a single input.
-R, -report-filename [Falco only] Specify filename for FastQC
report output (HTML). If not specified, it
will be called fastq_report.html in either
the input file's directory or the one
specified in the --output flag. Only
available when running falco with a single
input.
-S, -summary-filename [Falco only] Specify filename for the short
summary output (TXT). If not specified, it
will be called fastq_report.html in either
the input file's directory or the one
specified in the --output flag. Only
available when running falco with a single
input.
-K, -add-call [Falco only] add the command call call to
FastQC data output and FastQC report HTML
(this may break the parse of fastqc_data.txt
in programs that are very strict about the
FastQC output format).
Help options:
-?, -help print this help message
-about print about message
実行方法
fastqファイルを指定する(fastq, fq, fastq.gz or fq.gz)。
#1つのfastq
falco input.fq.gz
#複数fastq、#出力指定、4スレッド
falco -t 4 -o outdir sample*.fq.gz
- -o Create all output files in the specified output directory. FALCO-SPECIFIC: If the directory does not exists, the program will create it. If this option is not set then the output file for each sequence file is created in the same directory as the sequence file which was processed.
- -t Specifies the number of files which can be processed simultaneously. Each thread will be allocated 250MB of memory so you shouldn't run more threads than your available memory will cope with, and not more than 6 threads on a 32 bit machine [1]
- -c Specifies a non-default file which contains the list of contaminants to screen overrepresented sequences against. The file must contain sets of named contaminants in the form name[tab]sequence. Lines prefixed with a hash will be ignored. Default: Configuration/contaminant_list.txt
- -a Specifies a non-default file which contains the list of adapter sequences which will be explicity searched against the library. The file must contain sets of named adapters in the form name[tab]sequence. Lines prefixed with a hash will be ignored. Default: Configuration/adapter_list.txt
sam、bam (binary sam)のリードの分析にも対応している(htslibが必要)。
falco -t 4 -o outdir input.bam
- -f Bypasses the normal sequence file format detection and forces the program to use the specified format. Valid formats are bam, sam, bam_mapped, sam_mapped, fastq, fq, fastq.gz or fq.gz.
出力
- fastqc_data.txtはQCメトリクスの要約を含むテキストファイル
- fastqc_report.htmlは、テキストサマリーのプロットを示すHTMLレポート
- summary.txt: 各モジュールのpass/warn/failの結果を記述したタブ区切りファイル
レポート
ロングリードの場合(デフォルト設定でONTを調べるとqualityはfailとなった)
(以下省略)
その他
- fastQC互換なので、出力はそのままmultiqcで認識する。
引用
Falco: high-speed FastQC emulation for quality control of sequencing data
Guilherme de Sena Brandine 1, Andrew D Smith
F1000Res. 2019 Nov 7:8:1874