latf-loadコマンド - macでインフォマティクス

シークエンシングデータなどをDDBJのファイルサーバにアップしてDRAの登録申請を行う場合、D-wayのDRAで情報を記載後に実行する自動validationのステップがあります。このプロセスでは、ファイル受付サーバからのシークエンシングデータのロードにSRA Toolkitの"latf-load"というプログラムが使われているらしく（2023年5月現在）、DRA申請時のlog ファイルから確認することができます。

latf-loadはファイル読み出しの時にシークエンシングデータのタイプが申請内容とあっているかエラーチェックも行います。申請内容と一致していないfastqがロードされた場合、DRAでダウンロードできるlogにエラーが表示されます（似たコマンドにXMLを使う"fastq-load"や"illumina-load"、"abi-load"（SOLiD向け？*1）、他いくつかあるようです）。このコマンドはsra-toolsに含まれているので、ローカルマシンで予めシークエンシングデータのチェックを行ない、DRAの自動検証時にエラーが起きなさそうか確認することも可能です*2。簡単に試してみます。

インストール

SRA Toolkitはcondaで導入できるが、v.3.0.5にはこのコマンドは含まれない（v2.1.1では含まれていたので除外された理由があるのかもしれない）。ここから各プラットフォーム向けのバイナリをダウンロードできる。ダウンロードしてパスを通した（ubuntu18LTS使用）。

Github

#Ubuntu
wget https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.5/sratoolkit.3.0.5-ubuntu64.tar.gz
tar -xvf sratoolkit.3.0.5-ubuntu64.tar.gz
cd sratoolkit.3.0.5-ubuntu64/bin/

> ./fastq-load -h

Usage:

fastq-load [options] -r run.xml -e experiment.xml -o output-path

-r|--run-xml path to run.xml describing input files

-e|--experiment path to experiment.xml

-o|--output-path target location

Options:

-i|--input-path input files location, default '.'

-u|--input-unpacked input files are unpacked

-t|--input-no-threads disable input files threaded caching

-f|--force force target overwrite

-n|--spots-number process only given number of spots from

input

-bE|--bad-spot-number acceptable number of spot creation errors,

default is 50

-p|--bad-spot-percentage acceptable percentage of spots creation

errors, default is 5

-x|--expected path to expected.xml

-s|--intensities [on off] load intensity data, default is

off. For Illumina: signal, intensity,

noise; AB SOLiD: signal(s); LS454:

signal, position (for SFF files this option

is ON by default).

-z|--xml-log <logfile> Produce XML-formatted log file.

-h|--help Output brief explanation for the program.

-V|--version Display the version of the program then

quit.

-L|--log-level <level> Logging level as number or enum string. One

of (fatal|sys|int|err|warn|info|debug) or

(0-6) Current/default is warn.

-v|--verbose Increase the verbosity of the program

status messages. Use multiple times for more

verbosity. Negates quiet.

-q|--quiet Turn off all status messages for the

program. Negated by verbose.

--option-file <file> Read more options and parameters from the

file.

./fastq-load : 3.0.5

> ./latf-load -h

Usage:

latf-load [options] <fastq-file> ...

Summary:

Load FASTQ formatted data files

Example:

latf-load -p 454 -o /tmp/SRZ123456 123456-1.fastq 123456-2.fastq

-o|--output <path> Path and Name of the output database.

-q|--quality Quality encoding (PHRED_33, PHRED_64,

LOGODDS)

Options:

-t|--tmpfs <path-to-file> Path to be used for scratch files.

-Q|--qual-quant <phred-score> Quality scores quantization level, can be

number (0: none default, 1: 2bit, 2:

1bit), or string like

'1:10,10:20,20:30,30:-' (which is

equivalent to 1).

--cache-size <mbytes> Set the cache size in MB for the temporary

indices

--max-rec-count <count> Set the maximum number of records to

process from the FASTQ file

-E|--max-err-count <count> Set the maximum number of errors to ignore

from the FASTQ file

-p|--platform Platform (ILLUMINA, LS454, SOLID,

COMPLETE_GENOMICS, HELICOS, PACBIO,

IONTORRENT, CAPILLARY)

--max-err-pct acceptable percentage of spots creation

errors, default is 5

--ignore-illumina-tags ignore barcodes contained in

Illumina-formatted names

--no-readnames drop original read names

-a|--allow_duplicates allow duplicate read names in the same file

-1|--read1PairFiles <path-to-file> Default read number for this file is 1.

Processing will be interleaved with the

file specified in --read2PairFile|-r2

-2|--read2PairFiles <path-to-file> Default read number for this file is 2.

Processing will be interleaved with the

file specified in --read1PairFile|-r1

-z|--xml-log <logfile> Produce XML-formatted log file.

-h|--help Output brief explanation for the program.

-V|--version Display the version of the program then

quit.

-L|--log-level <level> Logging level as number or enum string. One

of (fatal|sys|int|err|warn|info|debug) or

(0-6) Current/default is warn.

-v|--verbose Increase the verbosity of the program

status messages. Use multiple times for more

verbosity. Negates quiet.

-q|--quiet Turn off all status messages for the

program. Negated by verbose.

--option-file <file> Read more options and parameters from the

file.

./latf-load : 3.0.5

実行方法

fastqを指定する。ここではイルミナのNovaSeq6000で読まれたペアエンドのfastqを想定している。また、シークエンスプラットフォーム、クオリティエンコードタイプを指定する。

 ./latf-load -p ILLUMINA -o test -q PHRED_33 sample1_L01_1_1.fq.gz sample1_L01_1_2.fq.gz

-o Path and Name of the output database.
-q Quality encoding (PHRED_33, PHRED_64, LOGODDS)
-p Platform (ILLUMINA, LS454, SOLID, COMPLETE_GENOMICS, HELICOS, PACBIO, IONTORRENT, CAPILLARY)

このコマンドでエラーが出る場合、申請内容（-pと-qで指定した機種依存的なフォーマット、DRAではwebでの申請内容）と一致していない可能性があります。ただし、エラー情報はやや直感的な説明ではないようです。例えば":syntax error, unexpected fqENDLINE"というのは、fastq２行目の配列決定された塩基がおかしいとは限らず、fastq１行目のID行のフォーマットがおかしい可能性もあります。

自分はvalidation時に上記のエラーが発生しました。サブミットするfastqには前処理をかけていて、IDが標準的な書式から変化していたことが原因と考えました。通常SRA/ENA/DRAには生のプロセシングしていないリードを登録すると思いますが、そのデータはヒトゲノムのリードが含まれる可能性があるメタゲノムデータであったため、個人や地域特異的な多型プロファイルが利用されるリスクを予防する目的でフィルタリングしたリードをアップロードしました。そのプロセスによってfastqのIDが標準的な書式から変わっていたため、validationで":syntax error, unexpected fqENDLINE"が発生したと考えられます。スクリプトを書いてfastqのIDの余分な情報を修正して、ローカルでlatf-loadを走らせてエラーが発生しないことを確認後、アップロードしました。

参考までに例えばNovaSeq6000のfastqのID行は以下のようになっています*3。コロンはセパレータです。

@AXXXXX:7:XXXXXXX2:2:1101:26539:1110 1:N:0:1

左から