2022-09-19

Demultiplexingを行う fgbioのDemuxFastqsコマンド

fgbioはディープシーケンシングデータを扱うためのコマンドラインツールキット。リードレベルのデータ（FASTQ、SAM、BAMなど）やバリアントレベルのデータ（VCF、BCFなど）を操作する。特に次のようなものを提供することに重点を置いている（Githubより）。

堅牢で、よくテストされたツール
使いやすいコマンドライン
各ツールの明確かつ徹底したドキュメント
コミュニティとクライアントの利益となるオープンソース開発

fgbioの中のfgbio DemuxFastqsサブコマンドは、FASTQsのdemultiplexing （デマルチプレックス）を行う。また、オプションでUMIを抽出する。

Tool section

http://fulcrumgenomics.github.io/fgbio/tools/latest/

fgbio DemuxFastqs

http://fulcrumgenomics.github.io/fgbio/tools/latest/DemuxFastqs.html

インストール

ubuntu18に導入した。

Github

mamba install -c bioconda fgbio -y

> fgbio DemuxFastqs

$ fgbio DemuxFastqs

USAGE: fgbio [fgbio arguments] [command name] [command arguments]

Version: 2.0.2

------------------------------------------------------------------------------------------------------------------------

fgbio Arguments:

------------------------------------------------------------------------------------------------------------------------

-h true|false, --help=true|false

Display the help message. [Default: false].

--async-io=true|false Use asynchronous I/O where possible, e.g. for SAM and BAM files. [Default:

false].

--version=true|false Display the version number for this tool. [Default: false].

--compression=Int Default GZIP compression level, BAM compression level. [Default: 5].

--tmp-dir=DirPath Directory to use for temporary files. [Default:

/var/folders/9y/gqf42hb548178qbs0mm2r78w0000gn/T].

--log-level=LogLevel Minimum severity log-level to emit. [Default: Info]. Options: Debug, Info,

Warning, Error, Fatal.

--sam-validation-stringency=ValidationStringency

Validation stringency for SAM/BAM reading. [Default: SILENT]. Options:

STRICT, LENIENT, SILENT.

DemuxFastqs

------------------------------------------------------------------------------------------------------------------------

Performs sample demultiplexing on FASTQs.

The sample barcode for each sample in the sample sheet will be compared against the sample barcode bases extracted from

the FASTQs, to assign each read to a sample. Reads that do not match any sample within the given error tolerance will

be placed in the 'unmatched' file.

The type of output is specified with the '--output-type' option, and can be BAM ('--output-type Bam'), gzipped FASTQ

('--output-type Fastq'), or both ('--output-type BamAndFastq').

For BAM output, the output directory will contain one BAM file per sample in the sample sheet or metadata CSV file,

plus a BAM for reads that could not be assigned to a sample given the criteria. The output file names will be the

concatenation of sample id, sample name, and sample barcode bases (expected not observed), delimited by '-'. A metrics

file will also be output providing analogous information to the metric described SampleBarcodeMetric

(http://fulcrumgenomics.github.io/fgbio/metrics/latest/#samplebarcodemetric).

For gzipped FASTQ output, one or more gzipped FASTQs per sample in the sample sheet or metadata CSV file will be

written to the output directory. For paired end data, the output will have the suffix '_R1.fastq.gz' and '_R2.fastq.gz'

for read one and read two respectively. The sample barcode and molecular barcodes (concatenated) will be appended to

the read name and delimited by a colon. If the '--illumina-standards' option is given, then the output read names and

file names will follow the Illumina standards described here

(https://help.basespace.illumina.com/articles/tutorials/upload-data-using-web-uploader/).

The output base qualities will be standardized to Sanger/SAM format.

FASTQs and associated read structures for each sub-read should be given:

* a single fragment read should have one FASTQ and one read structure

* paired end reads should have two FASTQs and two read structures

* a dual-index sample with paired end reads should have four FASTQs and four read structures given: two for the two

index reads, and two for the template reads.

If multiple FASTQs are present for each sub-read, then the FASTQs for each sub-read should be concatenated together

prior to running this tool (ex. 'cat s_R1_L001.fq.gz s_R1_L002.fq.gz > s_R1.fq.gz').

(Read structures)https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures are made up of '<number><operator>'

pairs much like the 'CIGAR' string in BAM files. Four kinds of operators are recognized:

1. 'T' identifies a template read

2. 'B' identifies a sample barcode read

3. 'M' identifies a unique molecular index read

4. 'S' identifies a set of bases that should be skipped or ignored

The last '<number><operator>' pair may be specified using a '+' sign instead of number to denote "all remaining bases".

This is useful if, e.g., fastqs have been trimmed and contain reads of varying length. Both reads must have template

bases. Any molecular identifiers will be concatenated using the '-' delimiter and placed in the given SAM record tag

('RX' by default). Similarly, the sample barcode bases from the given read will be placed in the 'BC' tag.

Metadata about the samples should be given in either an Illumina Experiment Manager sample sheet or a metadata CSV

file. Formats are described in detail below.

The read structures will be used to extract the observed sample barcode, template bases, and molecular identifiers from

each read. The observed sample barcode will be matched to the sample barcodes extracted from the bases in the sample

metadata and associated read structures.

Sample Sheet

------------

The read group's sample id, sample name, and library id all correspond to the similarly named values in the sample

sheet. Library id will be the sample id if not found, and the platform unit will be the sample name concatenated with

the sample barcode bases delimited by a '.'.

The sample section of the sample sheet should contain information related to each sample with the following columns:

* Sample_ID: The sample identifier unique to the sample in the sample sheet.

* Sample_Name: The sample name.

* Library_ID: The library Identifier. The combination sample name and library identifier should be unique across the

samples in the sample sheet.

* Description: The description of the sample, which will be placed in the description field in the output BAM's read

group. This column may be omitted.

* Sample_Barcode: The sample barcode bases unique to each sample. The name of the column containing the sample

barcode can be changed using the '--column-for-sample-barcode' option. If the sample barcode is present across

multiple reads (ex. dual-index, or inline in both reads of a pair), then the expected barcode bases from each read

should be concatenated in the same order as the order of the reads' FASTQs and read structures given to this tool.

Metadata CSV

------------

In lieu of a sample sheet, a simple CSV file may be provided with the necessary metadata. This file should contain the

same columns as described above for the sample sheet ('Sample_ID', 'Sample_Name', 'Library_ID', and 'Description').

Example Command Line

--------------------

As an example, if the sequencing run was 2x100bp (paired end) with two 8bp index reads both reading a sample barcode,

as well as an in-line 8bp sample barcode in read one, the command line would be

--inputs r1.fq i1.fq i2.fq r2.fq --read-structures 8B92T 8B 8B 100T \

--metadata SampleSheet.csv --metrics metrics.txt --output output_folder

Output Standards

----------------

The following options affect the output format:

1. If '--omit-fastq-read-numbers' is specified, then trailing /1 and /2 for R1 and R2 respectively, will not be

appended to e FASTQ read name. By default they will be appended.

2. If '--include-sample-barcodes-in-fastq' is specified, then sample barcode will replace the last field in the first

comment in the FASTQ header, e.g. replace 'NNNNNN' in the header '@Instrument:RunID:FlowCellID:Lane:Tile:X:Y

1:N:0:NNNNNN'

3. If '--illumina-file-names' is specified, the output files will be named according to the Illumina FASTQ file

naming conventions:

a. The file extension will be '_R1_001.fastq.gz' for read one, and '_R2_001.fastq.gz' for read two (if paired end). b.

The per-sample output prefix will be '<SampleName>_S<SampleOrdinal>_L<LaneNumber>' (without angle brackets).

Options (1) and (2) require the input FASTQ read names to contain the following elements:

'@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index>'

See the Illumina FASTQ conventions for more details.

(https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/FASTQFiles_Intro_swBS.htm)

The '--illumina-standards' option may not be specified with the three options above. Use this option if you intend to

upload to Illumina BaseSpace. This option implies:

'--omit-fastq-read-numbers=true --include-sample-barcodes-in-fastq=false --illumina-file-names=true'

See the Illumina Basespace standards described here

(https://help.basespace.illumina.com/articles/tutorials/upload-data-using-web-uploader/).

To output with recent Illumina conventions (circa 2021) that match 'bcl2fastq' and 'BCLconvert', use:

'--omit-fastq-read-numbers=true --include-sample-barcodes-in-fastq=true --illumina-file-names=true'

By default all input reads are output. If your input FASTQs contain reads that do not pass filter (as defined by the

Y/N filter flag in the FASTQ comment) these can be filtered out during demultiplexing using the '--omit-failing-reads'

option.

To output only reads that are not control reads, as encoded in the '<control number>' field in the header comment, use

the '--omit-control-reads' flag

DemuxFastqs Arguments:

------------------------------------------------------------------------------------------------------------------------

-i PathToFastq+, --inputs=PathToFastq+

One or more input fastq files each corresponding to a sub-read (ex. index-read, read-one,

read-two, fragment).

-o DirPath, --output=DirPath The output directory in which to place sample BAMs.

-x FilePath, --metadata=FilePath

A file containing the metadata about the samples.

-r ReadStructure+, --read-structures=ReadStructure+

The read structure for each of the FASTQs.

-h true|false, --help=true|false

Display the help message. [Default: false].

--version=true|false Display the version number for this tool. [Default: false].

-m FilePath, --metrics=FilePath

The file to which per-barcode metrics are written. If none given, a file named

'demux_barcode_metrics.txt' will be written to the output directory. [Optional].

-c String, --column-for-sample-barcode=String

The column name in the sample sheet or metadata CSV for the sample barcode.

[Default: Sample_Barcode].

-u String, --unmatched=String Output BAM file name for the unmatched records. [Default: unmatched.bam].

-q QualityEncoding, --quality-format=QualityEncoding

A value describing how the quality values are encoded in the FASTQ. Either Solexa for

pre-pipeline 1.3 style scores (solexa scaling + 66), Illumina for pipeline 1.3 and above

(phred scaling + 64) or Standard for phred scaled scores with a character shift of 33. If

this value is not specified, the quality format will be detected automatically.

[Optional]. Options: Solexa, Illumina, Standard.

-t Int, --threads=Int The number of threads to use while de-multiplexing. The performance does not increase

linearly with the # of threads and seems not to improve beyond 2-4 threads.

[Default: 1].

--max-mismatches=Int Maximum mismatches for a barcode to be considered a match. [Default: 1].

--min-mismatch-delta=Int Minimum difference between number of mismatches in the best and second best barcodes for

a barcode to be considered a match. [Default: 2].

--max-no-calls=Int Maximum allowable number of no-calls in a barcode read before it is considered

unmatchable. [Default: 2].

--sort-order=SortOrder The sort order for the output sam/bam file (typically unsorted or queryname).

[Default: queryname]. Options: unsorted, queryname, coordinate, duplicate,

unknown.

--umi-tag=String The SAM tag for any molecular barcode. If multiple molecular barcodes are specified, they

will be concatenated and stored here. [Default: RX].

--platform-unit=String The platform unit (typically '<flowcell-barcode>-<sample-barcode>.<lane>')

[Optional].

--sequencing-center=String The sequencing center from which the data originated [Optional].

--predicted-insert-size=Integer

Predicted median insert size, to insert into the read group header [Optional].

--platform-model=String Platform model to insert into the group header (ex. miseq, hiseq2500, hiseqX)

[Optional].

--platform=String Platform to insert into the read group header of BAMs (e.g Illumina) [Default:

Illumina].

--comments=String* Comment(s) to include in the merged output file's header. [Optional].

--run-date=Iso8601Date Date the run was produced, to insert into the read group header [Optional].

--output-type=OutputType The type of outputs to produce. [Optional]. Options: Fastq, Bam,

BamAndFastq.

--include-all-bases-in-fastqs=true|false

Output all bases (i.e. all sample barcode, molecular barcode, skipped, and template

bases) for every read with template bases (ex. read one and read two) as defined by the

corresponding read structure(s). [Default: false].

--illumina-standards=true|false

Output FASTQs according to Illumina BaseSpace Sequence Hub naming standards. This is

differfent than Illumina naming standards. [Default: false]. Cannot be

used in conjunction with argument(s): includeSampleBarcodesInFastq, omitFastqReadNumbers,

illuminaFileNames

--omit-fastq-read-numbers=true|false

Do not include trailing /1 or /2 for R1 and R2 in the FASTQ read name. [Default:

false]. Cannot be used in conjunction with argument(s): illuminaStandards

--include-sample-barcodes-in-fastq=true|false

Insert the sample barcode into the FASTQ header. [Default: false].

Cannot be used in conjunction with argument(s): illuminaStandards

--illumina-file-names=true|false

Name the output files according to the Illumina file name standards. [Default:

false]. Cannot be used in conjunction with argument(s): illuminaStandards

--omit-failing-reads=true|false

Keep only passing filter reads if true, otherwise keep all reads. Passing filter reads

are determined from the comment in the FASTQ header. [Default: false].

--omit-control-reads=true|false

Do not keep reads identified as control if true, otherwise keep all reads. Control reads

are determined from the comment in the FASTQ header. [Default: false].

--mask-bases-below-quality=IntMask bases with a quality score below the specified threshold as Ns [Default:

0].

実行方法

ランするにはindex情報を書いたCSV形式のサンプルシートが必要。また、Read Structure情報も必要。Read Structureは、シーケンスランの塩基をどのように割り当てるかを記述したStringで、<number><operator>のペアで表す。オプションとして、文字列の最後のセグメントには、その長さに番号の代わりに+を使用することが許可されている。この+は、他のセグメントが処理された後に残ったすべての塩基に変換される（[0..infinity]を意味する）。

https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures

4種類のoperatorがサポートされている。

TまたはTemplate：セグメント内の塩基は、テンプレート（ゲノムDNA、RNAなど）の配列
B or Sample Barcode：セグメント中の塩基は、配列決定中のサンプルを識別するためのインデックス配列
M or Molecular Barcode: セグメント内の塩基は、配列中のユニークなソース分子（例：UMI）を識別するためのインデックス配列
S or Skip：セグメント内の塩基をスキップまたは無視する。例えば、ライブラリ調製で生成されたモノテンプレート配列

一般的なルール

各セグメントは、正の整数 >= 1 (または +) でなければならない。
読み込んだStructureの最後のセグメントのみ、その長さに + を使用することができる。
隣接するセグメントには、同じ演算子を使用することができる。例えば、2つのサンプルインデックスが隣接するように別々に分子上にライゲーションされる場合、6B6B+TのStructureは許容される。

2つの異なる方法でRead Structureを記述する４つの例。

SampleSheet

4つ、もしくは５つの列からなる。

１，Sample_ID：サンプルシート内のサンプルに固有のサンプル識別子。

２，Sample_Name：サンプル名。

３，Library_ID：ライブラリ識別子。サンプル名とライブラリ識別子の組み合わせは、サンプルシート内のサンプル間でユニークである必要がある。

（任意）、Description：出力されるBAMのリードグループの説明フィールドに配置されるサンプルの説明。この列は省略可能。
４，Sample_Barcode：各サンプルに固有のサンプルバーコード塩基。サンプルバーコードを含むカラム名は、--column-for-sample-barcodeオプションで変更可能。サンプルバーコードが複数のリードにまたがって存在する場合（例：デュアルインデックス、ペアの両方のリードにインライン）、各リードのバーコード塩基は、このツールに与えられたリードのFASTQおよびリード構造の順序と同じ順序で連結される必要がある。

-iでdemultiplexingを行うfastq、--read-structuresでリード構造（リンク）、-xでサンプルシートのCSVファイル、-tでスレッド数、-oで出力ディレクトリ（存在しない時は作成される）を指定する。

fgbio DemuxFastqs -i r1.fq i1.fq i2.fq r2.fq --read-structures 8B92T 8B 8B 100T \
-x SampleSheet.csv -o outdir

-i One or more input fastq files each corresponding to a sub-read (ex. index-read, read-one, read-two, fragment).
-o The output directory in which to place sample BAMs.
-x A file containing the metadata about the samples.
-m The file to which per-barcode metrics are written. If none given, a file named 'demux_barcode_metrics.txt' will be written to the output directory. [Optional].
--compression=Int Default GZIP compression level, BAM compression level. [Default: 5].
--tmp-dir=DirPath Directory to use for temporary files.
--max-mismatches=Int Maximum mismatches for a barcode to be considered a match. [Default: 1].
--min-mismatch-delta=Int Minimum difference between number of mismatches in the best and second best barcodes for a barcode to be considered a match. [Default: 2].
--max-no-calls=Int Maximum allowable number of no-calls in a barcode read before it is considered unmatchable. [Default: 2].

出力ディレクトリに、demultiplexing されたサンプルごとのfastqファイル、demultiplexing されなかったfastq、要約統計のテキストファイルが出力される。

引用

A Universal Analysis Pipeline for Hybrid Capture-Based Targeted Sequencing Data with Unique Molecular Indexes
Min-Jung Kim, Si-Cho Kim, and Young-Joon Kim

Genomics Inform. 2018 Dec; 16(4): e29