EMBOSSパッケージのinfoseq は、1 つ以上の入力配列に関する基本情報を画面上に表示する。これには、Uniform Sequence Address (USA)、名前、アクセッション番号、タイプ(核酸またはタンパク質)、長さ、C+Gの割合、および説明が含まれる。オプションでHTMLテーブル出力することもできる。
EMBOSS explorer (a graphical user interface to the EMBOSS suite)
https://www.bioinformatics.nl/cgi-bin/emboss/infoseq
インストール
condaやbrewで導入できる。
#bioconda
conda install -c bioconda -y emboss
#homebrew
brew install emboss
> infoseq -help
$ infoseq -help
Display basic information about sequences
Version: EMBOSS:6.6.0.0
Standard (Mandatory) qualifiers:
[-sequence] seqall (Gapped) sequence(s) filename and optional
format, or reference (input USA)
Additional (Optional) qualifiers:
-outfile outfile [stdout] If you enter the name of a file
here then this program will write the
sequence details into that file.
-html boolean [N] Format output as an HTML table
Advanced (Unprompted) qualifiers:
-[no]columns boolean [Y] Set this option on (Y) to print the
sequence information into neat, aligned
columns in the output file. Alternatively,
leave it unset (N), in which case the
information records will be delimited by a
character, which you may specify by using
the -delimiter option. In other words, if
-columns is set on, the -delimiter option is
overriden.
-delimiter string [|] This string, which is usually a single
character only, is used to delimit
individual records in the text output file.
It could be a space character, a tab
character, a pipe character or any other
character or string. (Any string)
-only boolean [N] This is a way of shortening the command
line if you only want a few things to be
displayed. Instead of specifying:
'-nohead -noname -noacc -notype -nopgc
-nodesc'
to get only the length output, you can
specify
'-only -length'
-[no]heading boolean [Y] Display column headings
-usa boolean [@(!$(only))] Display the USA of the
sequence
-database boolean [@(!$(only))] Display 'database' column
-name boolean [@(!$(only))] Display 'name' column
-accession boolean [@(!$(only))] Display 'accession' column
-gi boolean [N] Display 'GI' column
-seqversion boolean [N] Display 'version' column
-type boolean [@(!$(only))] Display 'type' column
-length boolean [@(!$(only))] Display 'length' column
-pgc boolean [@(!$(only))] Display 'percent GC content'
column
-organism boolean [@(!$(only))] Display 'organism' column
-description boolean [@(!$(only))] Display 'description' column
General qualifiers:
-help boolean Report command line options and exit. More
information on associated and general
qualifiers can be found with -help -verbose
実行方法
EMBOSSの各コマンドは、コマンドだけ打てば対話モードで実行できる。
infoseq
または入力のFASTAファイルを引数指定する。
infoseq input.fasta
出力
fasta::rename.fa:contig1 - contig1 - N 2328312 47.92
fasta::rename.fa:contig2 - contig2 - N 1875280 46.73
fasta::rename.fa:contig3 - contig3 - N 1753326 48.22
fasta::rename.fa:contig4 - contig4 - N 1691583 46.81
fasta::rename.fa:contig5 - contig5 - N 1658655 47.94
右端に長さとlengthがつく。
出力される情報は指定できる。Contig名、length、GCを出力。
infoseq -only -name -length -pgc input.fasta |head
- -only This is a way of shortening the command line if you only want a few things to be displayed
- -database Display 'database' column
- -name Display 'name' column
- -accession Display 'accession' column
- -gi Display 'GI' column
- -seqversion Display 'version' column
- -type Display 'type' column
- -length Display 'length' column
- -pgc Display 'percent GC content' column
- -organism Display 'organism' column
- -description Display 'description' column
出力
$ infoseq -only -name -length -pgc rename.fa |head
Display basic information about sequences
Name Length %GC
contig1 2328312 47.92
contig2 1875280 46.73
contig3 1753326 48.22
contig4 1691583 46.81
contig5 1658655 47.94
contig6 1657196 48.81
contig7 1409879 48.51
contig8 1066393 47.87
contig9 1066242 48.72
nameとlength情報だけ取り出すならsamtools faidxも使える。
samtools faidx input.fasta
出力される.fasta.faiの1、2列目がnameとlength。
引用
EMBOSS: the European Molecular Biology Open Software Suite.
Rice P, Longden I, Bleasby A
Trends Genet. 2000 Jun;16(6):276-7.
参考
https://www.biostars.org/p/79490/
関連