pyfastaを使ってコンティグを重複ありで分割する

メタゲノムのアセンブリでは、アセンブリの連続性を高めるために、最初のメタゲノムアセンブリ配列を重複付きで分割して、OLCアセンブラの入力として使用することがある（例えばOPERA-MSやminimus2）。この重複ありのcontigの分割には、pyfastaのsplitコマンドが利用できる。

インストール

#conda (link)
mamba install -c bioconda -y pyfasta

#pypi (link)
pip install pyfasta

> pyfasta

available actions:

`extract`: extract sequences from a fasta file

`info`: show info about the fasta file and exit.

`split`: split a large fasta file into separate files

and/or into K-mers.

`flatten`: flatten a fasta file inplace so that later

command-line (and programmattic) access via

pyfasta will use the inplace flattened version

rather than creating another .flat copy of the

sequence.

to view the help for a particular action, use:

pyfasta [action] --help

e.g.:

pyfasta extract --help

実行方法

配列を重複なしに10000-bpの長さで分割(-k 10000 )、1ファイルとして出力（-n 1）する。

pyfasta split -n 1 -k 10000 seq.fasta

-n number of new files to create
-o overlap in basepairs
-k split big files into pieces of this size in basepairs. default default of -1 means do not split the sequence up into k-mers, just split based on the headers. a reasonable value would be 10Kbp

上のコマンドを実行すると、配列を10000-bpの長さで分割したseq.split.10Kmer.fastaが出力される（35000-bpの配列を分割すると、重複なしで10000-bp, 10000-bp, 10000-bp, 5000-bpの4配列が出力される）。

配列を重複なしに1000-bpの長さで分割(-k 1000 )、3ファイルとして出力（-n 3）する。

pyfasta split -n 3 -k 10000 seq.fasta

9000-bpの１配列を分割すると、重複なしで1000-bp, 10000-bp, 10000-bpの配列からなるfastaが3ファイル出力される。

重複ありで分割するには-o <INT>を付ける。

multi-fasta配列を200-bpずつ重複あり(-o 200)で1000-bpの長さで分割(-k 1000 )、1ファイルとして出力（-n 1）する。

pyfasta split -n 1 -k 1000 -o 200 contig.fasta

引用

参考

macでインフォマティクス