macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

fastaを分割するUCSCの fasplitコマンド

 

タイトルの通りのコマンド。

 

インストール

Home

#conda (link)
mamba install -c bioconda ucsc-fasplit -y

faSplit

$ faSplit

faSplit - Split an fa file into several files.

usage:

   faSplit how input.fa count outRoot

where how is either 'about' 'byname' 'base' 'gap' 'sequence' or 'size'.

Files split by sequence will be broken at the nearest fa record boundary.

Files split by base will be broken at any base.

Files broken by size will be broken every count bases.

 

Examples:

   faSplit sequence estAll.fa 100 est

This will break up estAll.fa into 100 files

(numbered est001.fa est002.fa, ... est100.fa

Files will only be broken at fa record boundaries

 

   faSplit base chr1.fa 10 1_

This will break up chr1.fa into 10 files

 

   faSplit size input.fa 2000 outRoot

This breaks up input.fa into 2000 base chunks

 

   faSplit about est.fa 20000 outRoot

This will break up est.fa into files of about 20000 bytes each by record.

 

   faSplit byname scaffolds.fa outRoot/

This breaks up scaffolds.fa using sequence names as file names.

       Use the terminating / on the outRoot to get it to work correctly.

 

   faSplit gap chrN.fa 20000 outRoot

This breaks up chrN.fa into files of at most 20000 bases each,

at gap boundaries if possible.  If the sequence ends in N's, the last

piece, if larger than 20000, will be all one piece.

 

Options:

    -verbose=2 - Write names of each file created (=3 more details)

    -maxN=N - Suppress pieces with more than maxN n's.  Only used with size.

              default is size-1 (only suppresses pieces that are all N).

    -oneFile - Put output in one file. Only used with size

    -extra=N - Add N extra bytes at the end to form overlapping

pieces.  Only used with size.

    -out=outFile Get masking from outfile.  Only used with size.

    -lift=file.lft Put info on how to reconstruct sequence from

                   pieces in file.lft.  Only used with size and gap.

    -minGapSize=X Consider a block of Ns to be a gap if block size >= X.

                  Default value 1000.  Only used with gap.

    -noGapDrops - include all N's when splitting by gap.

    -outDirDepth=N Create N levels of output directory under current dir.

                   This helps prevent NFS problems with a large number of

                   file in a directory.  Using -outDirDepth=3 would

                   produce ./1/2/3/outRoot123.fa.

    -prefixLength=N - used with byname option. create a separate output

                   file for each group of sequences names with same prefix

                   of length N.

 

 

実行方法

faSplit base

指定したFASTAを10個のファイルに分割する。出力prefixはoutputとする。

faSplit base input.fa 10 output

 

faSplit size

指定した(Multi-)FASTAを100bpの長さのチャンクに分割する。

faSplit size input.fa 100 output

 

指定した(Multi-)FASTAを10000bpの長さ(チャンク)に分割する。Nが1000以上ある配列は捨てる。

faSplit size input.fa 10000 output -maxN=1000
  • -maxN=N - Suppress pieces with more than maxN n's.  Only used with size.
                  default is size-1 (only suppresses pieces that are all N).

 

指定した(Multi-)FASTAを10000bpの長さで分割する。オーバーラップする領域を500bpずつ加える。出力は1ファイルにする。

faSplit size input.fa 10000 output.fasta -extra=500 -oneFile
  • -extra=N - Add N extra bytes at the end to form overlapping
    pieces.  Only used with size.
  • -oneFile - Put output in one file. Only used with size

 

split about

指定した(Multi-)FASTAをだいたい20000byteのサイズになるよう分割する。

faSplit about input.fa 20000 output
  • -maxN=N - Suppress pieces with more than maxN n's.  Only used with size.

 

faSplit byname

指定した(Multi-)FASTAを配列名で分割する。

faSplit byname input.fa output

 

faSplit gap

指定した(Multi-)FASTAをギャップ部分で分割する。

faSplit gap input.fa output
  • -minGapSize=X  Consider a block of Ns to be a gap if block size >= X.
                      Default value 1000.  Only used with gap.
  •  -noGapDrops - include all N's when splitting by gap.

 

参考

https://www.biostars.org/p/348483/