FASTAやFASTQの塩基数をカウントする - macでインフォマティクス

インストール

TECHOVERFLO（(https://techoverflow.net）の公開しているpython スクリプトを利用させてもらう。該当記事（リンク）からコピーして、ファイル名 fasta-stats.pyで保存。

"chmod u+x python fasta-stats.py"で実行権もつけておく。

> python fasta-stats.py -h #ヘルプ

$ python fasta-stats.py -h

usage: fasta-stats.py [-h] [--case-sensitive] [-o ONLY] infiles [infiles ...]

Compute simple statistics for FASTA files.

positional arguments:

infiles FASTA files (.fa, .fa.gz) to generate statistics for

optional arguments:

-h, --help show this help message and exit

--case-sensitive Count characters in a case-sensitive way. Disabled per

default.

-o ONLY, --only ONLY If this option is supplied (e.g. set to 'ATGC'),

characters not in the set will be ignored for all

statistics

/usr/local/bin/とかに入れておく。

ラン

-- ATGC count from FASTA --

FASTAファイルを引数にしてランする。

 python fasta-stats.py input.fa > output

FASTAはgz圧縮でも読み込めます。また２つ以上も同時解析できます。

ヘッダーが１つのFASTAのATGCカウント結果

$ python fasta-stats.py reference.fa

Sequence '1' from FASTA file 'reference.fa' contains 3571707 sequence characters:

A : 931911 = 26.0914739087%

C : 851062 = 23.8278783786%

G : 853416 = 23.8937852405%

T : 935318 = 26.1868624722%

A+T : 1867229 = 52.2783363809%

G+C : 1704478 = 47.7216636191%%

-- ATGC count from FASTQ --

fastqのフォーマットは扱えないので、ここではseqkitでfastq => fastaに変換してからカウントする。

まずfastaに変換してコンカテネートする。

echo ">header" > pseudo.fasta
seqkit fq2fa input.fq |grep -nv ">" - |cut -d ":" -f 2 - |perl -pe 's/\n//g' - >> pseudo.fasta

seqkitでfastaにして、さらにヘッダーと改行を除き出力している。

カウント

 python fasta-stats.py pseudo.fasta

出力

$ python fasta-stats.py pseudo.fasta

Sequence 'header' from FASTA file 'pseudo.fasta' contains 77820790 sequence characters:

A : 20537394 = 26.3906264637%

C : 18373394 = 23.6098785427%

G : 18332744 = 23.5576431439%

N : 175 = 0.000224875640558%

T : 20577083 = 26.441626974%

uesaka-no-Air-2:Desktop kazumaxneo$

fastqだと時間がかかります。

何度も行うなら、上記をエイリアスか関数にしたほうが楽です。上の流れで関数を定義するなら、下記をbash_profileに記載して、soruce しておきます（fasta-stats.pyは/usr/local/bin/にあると仮定）。

function fastq_count() {
echo ">header" > pseudo.fasta && seqkit fq2fa $1 |grep -nv ">" - |cut -d ":" -f 2 - |perl -pe 's/\n//g' - >> pseudo.fasta && python /usr/local/bin/fasta-stats.py pseudo.fasta && rm pseudo.fasta
}

> source ~/.bash_profile

上のように fastq_countという関数名なら

fastq_count input.fastq

でカウントできます。コマンドは必要に応じて変更してください。

引用

TECHOVERFLO

https://techoverflow.net/2013/10/24/a-simple-tool-for-fasta-statistics/