EMBOSSのseqretを使ってfastaファイルを修復する

2019 6/19 インストール追記

2019 7/15 タイトル修正

2019 8/7リンク追加

2019 10/3コメント追加

fastaファイルfをいじっていると、何らかの拍子に構造がおかしくなってsamtoolsのindexでsegmentation errorを起こすことがある。途中に空行ができていたり、特殊文字が入っていたり、何らかの理由があるわけだが、embossのseqretを使うと簡単に修復することができる。seqretは入力ファイルを分析し、パースして標準的なNCBIのFASTA形式で出力することに使われるコマンドである。

配列中の数値、スペースなども消してくれるので、genbankをコピーして余計な文字を消すときにも使えます。

このような配列も

f:id:kazumaxneo:20201019161021p:plain

↓ 修復してくれる。

f:id:kazumaxneo:20201019161051p:plain

公式サイト

http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/seqret.html

インストール

embossはcondaやbrewで導入できる。

#bioconda (link)
mamba install -c bioconda -y emboss

#homebrew
brew cask install xquartz #xquartzも無ければ入れておく
brew install emboss

> seqret -h

$ seqret -h

Read and write (return) sequences

Version: EMBOSS:6.6.0.0

Standard (Mandatory) qualifiers:

[-sequence] seqall (Gapped) sequence(s) filename and optional

format, or reference (input USA)

[-outseq] seqoutall [<sequence>.<format>] Sequence set(s)

filename and optional format (output USA)

Additional (Optional) qualifiers: (none)

Advanced (Unprompted) qualifiers:

-feature boolean Use feature information

-firstonly boolean [N] Read one sequence and stop

General qualifiers:

-help boolean Report command line options and exit. More

information on associated and general

qualifiers can be found with -help -verbose

実行方法

seqret

入力のFASTAと出力のFASTA名を順番に入力する。

user$ seqret

Read and write (return) sequences

Input (gapped) sequence(s): input.fasta

output sequence(s) [chr.fasta]:out.fa

またはinputとoutputのfasta名を指定する。

seqret input.fasta output.fasta

これだけでFASTAを修復できる。

UCSCからも同様のツールが提供されています。

https://users.soe.ucsc.edu/~kent/dnaDust/dnadust.html

追記

gffからgbkに変換 (manual)

seqret -sequence genome.fasta -feature -fformat gff -fopenfile input.gff -osformat genbank -osname_outseq output.gbk -ofdirectory_outseq gbk_file -auto

空のgff（先頭のコメント行のみ）を使えば、遺伝子アノテーションが無いgbkファイルを作ることもできます。

Proteinの修復ならProtein Duster が利用できます。

引用

EMBOSS: The European Molecular Biology Open Software Suite

Rice P1, Longden I, Bleasby A.

Trends Genet. 2000 Jun;16(6):276-7.

http://seqanswers.com/forums/showthread.php?t=2352

Existing tool for converting gff3 to genbank (gbk)

https://bioinformatics.stackexchange.com/questions/11115/existing-tool-for-converting-gff3-to-genbank-gbk