fasta/fastq/bamのユーティリティツール fxtools

インストール

ubuntu16.0.4でテストした。

本体　 Github

git clone https://github.com/yangao07/fxtools.git --recursive
cd fxtools; make

> ./fxtools

$ fxtools

Program: fxtools (light-weight processing tool for FASTA, FASTQ and BAM format data)

Usage: fxtools <command> [options]

Command:

filter (fl) filter fa/fq sequences with specified length boundary.

filter-name (fn) filter fa/fq sequences with specified name.

filter-bam (fb) filter bam/sam records with specified read length boundary.

filter-bam-name (fbn) filter bam/sam records with specified read name.

split-fx (sx) split fa/fq file into multipule files.

fq2fa (qa) convert FASTQ format data to FASTA format data.

fa2fq (aq) convert FASTA format data to FASTQ format data.

bam2bed (bb) convert BAM file to BED file. seperated exon regions for spliced BAM

re-co (rc) convert DNA sequence(fa/fq) to its reverse-complementary sequence.

seq-display (sd) display a specified region of FASTA/FASTQ file.

cigar-parse (cp) parse the given cigar(stdout).

length-parse (lp) parse the length of sequences in fa/fq file.

merge-fa (mf) merge the reads with same read name in fasta/fastq file.

merge-filter-fa (mff) merge and filter the reads with same read name in fasta file.

duplicate-fa (df) duplicate the read sequence with specific copy number.

error-parse (ep) parse indel and mismatch error based on CIGAR and NM in bam file.

dna2rna (dr) convert DNA fa/fq to RNA fa/fq.

rna2dna (rd) convert RNA fa/fq to DNA fa/fq.

trim (tr) trim poly A tail(poly T head).

trimF (tf) trim and filter with poly A tail(poly T head). Only poly A contained reads will be kept.

実行方法

1、fxtools filter

Usage: fxtools filter <in.fa/fq> <lower-bound> <upper-bound>(-1 for NO bound) > <out.fa/fq>

100-1000bpの配列（fasta/fastq）のみ出力

fxtools filter input.fasta 100 1000 > output.fasta

bamの場合はfxtools filter-bamを使う。

2、fxtools filter-name

Usage: fxtools filter-name [-n name] [-m sub-name] [-l] <in.fa/fq> > <out.fa/fq>

-n [STR] only output read with specified name.

-m [STR] only output read whose name or comment contain specified string.

-l input a list of names or sub-names with a list file, each line is a name or sub-name. [False]

指定したヘッダ名のみ出力

#ヘッダ名がchr1
fxtools filter-name -n chr1 input.fa > output.fa

#部分一致、ヘッダにplasmidを含む
fxtools filter-name -m plamisd input.fa > output.fa

bamの場合はfxtools filter-bam-nameを使う。

3、 fxtools split-fx

Usage: fxtools split-fx <in.fa/q> <N> <out_dir>

指定数にfasta/fastqを分割

#8配列からなるmulti-fastaを8個に分割
mkdir outdir
fxtools split-fx input.fa 8 outdir
# =>出力ディレクトリに8つのfastaファイルができる。


#8配列からなるmulti-fastaを3個に分割
mkdir outdir
fxtools split-fx input.fa 3 outdir
# =>出力ディレクトリに3つのfastaファイルができる。そのうち2つは3配列のmulti-fasta、

bamの場合はfxtools filter-bam-nameを使う。

4、 fxtools bam2bed

Usage: fxtools bam2bed in.bam > out.bed

bam => bed

fxtools bam2bed inout.bam > out.bed

5、 fxtools re-co

Usage: fxtools re-co in.fa/fq > out.fa

DNA配列をreverse-complementaryに変換

#fastq
fxtools re-co input.fastq > out.fastq

6、fxtools seq-display

Usage: fxtools seq-display <in.fa/fq> <chr/read_name> <start_pos(1-based)> <end_pos>

use negative coordinate to indicate later part of sequence. (e.g., -1 for last bp)

指定した配列の指定した領域を出力

#chr1の10000-11000を出力
fxtools seq-display input.fasta chr1 10000 11000 > out.fa

7、fxtools seq-display

Usage: fxtools seq-display <in.fa/fq> <chr/read_name> <start_pos(1-based)> <end_pos>

use negative coordinate to indicate later part of sequence. (e.g., -1 for last bp)

指定した配列の指定した領域を出力

#chr1の10000-11000を出力
fxtools seq-display input.fasta chr1 10000 11000 > out.fa

8、fxtools cigar-parse

Usage: fxtools cigar-parse <input-cigar>

CIGARをパース

fxtools cigar-parse 144S157M

出力

Cigar length:

157 M

144 S

seq-len: 301

ref-len: 157

9、fxtools length-parse

Usage: fxtools length-parse <in.fa/fq/len>

fasta/fastqのリード長分析

fxtools length-parse reference.fasta

Read_9997_length=16254bp_startpos=2897484_number_of_errors=2626_total_error_prob=0.15_passes=1.359881649744054_passes_left=1_passes_right=2_cut_position=10404 16254

Read_9998_length=14882bp_startpos=1299896_number_of_errors=1313_total_error_prob=0.09_passes=1.5013880205881063_passes_left=1_passes_right=2_cut_position=7420 14882

Read_9999_length=3576bp_startpos=603342_number_of_errors=13_total_error_prob=0.00_passes=6.341188385708311_passes_left=7_passes_right=6_cut_position=1220 3576

== '/Users/kazuma/Documents/pacbio-GT-S_simulation.fastq' read length stats ==

Total reads 10,000

Total bases 81,524,014

Mean length 8,152

Min. length 50

Max. length 30,592

N-50 length 9,430

10、fxtools merge-fa

Usage: fxtools merge-fa <in.fa/fq> [N] > <out.fa/fq>

optional: use N to separate merged sequences

同じヘッダのfasta/fastqを統合（長い方が残る）

fxtools merge-fa inout.fa > output.fa

11、 fxtools duplicate-fa

Usage: fxtools duplicate-fa <in.fa/fq> <copy_number> > out.fa/fq

fasta/fastqを複製

#3倍にduplicate
fxtools duplicate-fa input.fa 3 > out.fa

> seqkit stats input.fa out.fa

file format type num_seqs sum_len min_len avg_len max_len

input.fa FASTA DNA 1 15,360 15,360 15,360 15,360

out.fa FASTA DNA 1 46,080 46,080 46,080 46,080

12、fxtools error-parse

Usage: fxtools error-parse <input.bam> [-s] > error.out

-s include non-primary records in the output.

BAMファイルのCIGARとNMに基づいて、インデルとミスマッチをパース

fxtools error-parse input.bam > output

13、 fxtools trim

Usage: fxtools trim in.fa/fq min_trim_length min_fraction window_size > out.fa

poly A tail (poly head)をトリミング

fxtools trim input.fq 10 0.05 4 > output.fq

14、 fxtools trim

Usage: fxtools trim in.fa/fq min_trim_length min_fraction window_size > out.fa

poly A tail (poly head)をトリミング

fxtools trim input.fq 10 0.05 4 > output.fq

poly A tail (poly head)を持つリードだけ出力するならfxtools trimFを使う。

他にも、fastq <=>fasta変換コマンドやDNA <=> RNA 変換コマンドがある。

引用

https://github.com/yangao07/fxtools

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

fasta/fastq/bamのユーティリティツール fxtools