fastpのバッチ処理スクリプトを使う - macでインフォマティクス

2025/09/19 追記

fastpのversion 1.0がリリースされ、フォルダ内のfastqをバッチ処理する便利なスクリプトも提供されました。このスクリプトの使い方を確認しておきます。

インストール

最新のfastpにパスが通っている必要がある。fastpのバージョンが１未満だと動作しない。

#bioconda (link) v.1系指定、最新バージョンはリンク先確認
mamba install -c bioconda -y fastp==1.0.1
# download the latest build
wget http://opengene.org/fastp/fastp
chmod a+x ./fastp

#parallel.pyはcondaではインストールされない。レポジトリから取得する。
git clone https://github.com/OpenGene/fastp.git
#パスの通ったディレクトリに移動させる
chmod +x fastp/parallel.py
mv fastp/parallel.py /usr/local/bin/ #（あるいは~/binなど）

> python parallel.py -h

Usage: A python script to use fastp to preprocess all FASTQ files within a folder

Options:

--version show program's version number and exit

-h, --help show this help message and exit

-i INPUT_DIR, --input_dir=INPUT_DIR

the folder contains the FASTQ files to be

preprocessed, by default is current dir (.)

-o OUT_DIR, --out_dir=OUT_DIR

the folder to store the clean FASTQ. If not specified,

then there will be no output files.

-r REPORT_DIR, --report_dir=REPORT_DIR

the folder to store QC reports. If not specified, use

out_dir if out_dir is specified, otherwise use

input_dir.

-c COMMAND, --command=COMMAND

the path to fastp command, if not specified, then it

will use 'fastp' in PATH

-a ARGS, --args=ARGS the arguments that will be passed to fastp. Enclose in

quotation marks. Like --args='-f 3 -t 3'

-p PARALLEL, --parallel=PARALLEL

the number of fastp processes can be run in parallel,

if not specified, then it will be CPU_Core/4

-1 READ1_FLAG, --read1_flag=READ1_FLAG

specify the name flag of read1, default is R1, which

means a file with name *R1* is read1 file

-2 READ2_FLAG, --read2_flag=READ2_FLAG

specify the name flag of read2, default is R2, which

means a file with name *R2* is read2 file

実行方法

fastq（fastq.gz）を含むディレクトリ、出力ディレクトリ、HTMLとJSON形式レポートの保存ディレクトリをそれぞれ指定する。fastpのネイティブオプションのコマンドは-a ' 'で指定する。例えば-f3 -t2は各fastqの先頭3bpと末尾2bpをそれぞれ強制トリミングする。

python parallel.py -i input_fastq_dir -o fq_outdir -r report_outdir -a '-f 3 -t 2'

-i the folder contains the FASTQ files to be preprocessed, by default is current dir (.)
-o the folder to store the clean FASTQ. If not specified, then there will be no output files.
-r the folder to store QC reports. If not specified, use out_dir if out_dir is specified, otherwise use input_dir.
-a the arguments that will be passed to fastp. Enclose in quotation marks. Like --args='-f 3 -t 3'
-p the number of fastp processes can be run in parallel, if not specified, then it will be CPU_Core/4
-1 specify the name flag of read1, default is R1, which means a file with name *R1* is read1 file
-2 specify the name flag of read2, default is R2, which means a file with name *R2* is read2 file

結果は指定したフォルダ内に保存される｡ペアエンドのリードはデフォルトではR1とR2が認識される。変更するには-1と-2オプションを使う。

parallel.py \
  -i ./ -o fastp_processed -r report -p 6 \
  -1 "_R1" -2 "_R2" \
  -a "-q 20 -u 30 -n 10 -l 20 --correction -w 3"

-c, --correction enable base correction in overlapped regions (only for PE data), default is disabled
-u, --unqualified_percent_limit how many percents of bases are allowed to be unqualified (0~100). Default 40 means 40% (int [=40])

その他