Mulit-FASTAの分割 (split) - macでインフォマティクス

2018 10/26追記

2019 10/28インストール追記

2020 4/29 追記

BBtoolsを使うと簡単にマルチFASTAを分割できる（BBtools）。

インストール

conda install -c bioconda -y bbmap

> partition.sh -h

$ partition.sh -h

Written by Brian Bushnell

Last modified April 17, 2018

Description: Splits a sequence file evenly into multiple files.

Usage: partition.sh in=<file> in2=<file2> out=<outfile> out2=<outfile2> ways=<number>

in2 and out2 are for paired reads and are optional.

If input is paired and out2 is not specified, data will be written interleaved.

Output filenames MUST contain a '%' symbol. This will be replaced by a number.

Parameters and their defaults:

in=<file> Input file.

out=<file> Output file pattern.

ways=-1 The number of output files to create; must be positive.

ow=f (overwrite) Overwrites files that already exist.

app=f (append) Append to files that already exist.

zl=4 (ziplevel) Set compression level, 1 (low) to 9 (max).

int=f (interleaved) Determines whether INPUT file is considered interleaved.

Java Parameters:

-Xmx This will set Java's memory usage, overriding autodetection.

-Xmx20g will specify 20 gigs of RAM, and -Xmx200m will specify 200 megs.

The max is typically 85% of physical memory.

-eoom This flag will cause the process to exit if an out-of-memory

exception occurs. Requires Java 8u92+.

-da Disable assertions.

Please contact Brian Bushnell at bbushnell@lbl.gov if you encounter any problems.

実行方法

５クロモソームを分割するなら、以下のようにコマンドを打つ。

partition.sh in=input.fasta out=chromosome%.fasta ways=5

ヒトゲノムなどの大きなゲノムなら-Xmx20G　などをつけておく（javaの使用メモリ20 GB）。

partition.sh -Xmx20G in=hs37d5.fa out=chromosome%.fasta ways=86

embossのseqretsplitを使う。

seqretsplit input_multi.fasta out

#複数ファイル
seqretsplit input* out

wayは出力ファイル数。fastaの数以上にすると、余剰分は空ファイルが出力される。FASTAがいくつあるか分からなければ、最初にgrepを使ってFASTA数を調べる。

grep -n ">" input.fasta |wc -l

linuxなら以下のコマンドでも分割できる。

csplit -z input.fasta '/>/' '{*}'

たくさんの配列を含むmulti-fastaを分割するならseqkitが使える。

均等に10分割する。

seqkit split -p 10 sequences.fasta

bamの分割にはbamtoolsが使えます。

引用

Biostars