2018-12-20

seqkitに新しく追加されたコマンドを確認する

2019 8/7 誤字修正

2023/01/20 translate help更新

seqkitを以前ブログで紹介した時は0..6.0でしたが、１年半近く経ち、2018年12月20日現在ではバージョンが0.9.4まで上がっています。ありがたいことに、bug fixだけでなく、新しいコマンドが複数追加されています。v0.6.1以降に追加されたコマンドについてまとめます。

インストール

ダウンロード

http://bioinf.shenwei.me/seqkit/download/

解凍するとbinaryファイルできる。

実行権をつけ、それを/usr/local/bin/などに移動して使う。

chmod u+x seqtk #実行権
ls -l seqtk #確認
#パスが通ったディレクトリに移動
mv seqtk /usr/local/bin

#anacondaを使っているならcondaで入る。例えば20181219時点で最新のv0.9.3を入れる
conda install -c bioconda -y seqkit==0.9.3

0.6.0の時点であったコマンドを復習しておく。

アルファベット順

common　 IDや配列が共通する配列抽出
dup 　　配列を指定回、複製
fq2fa　　　fastqをfastaに変換
fx2tab　　 fasta/fastqをtab形式に変換し、さらにGCなどの情報を表示
grep　　特定の配列を検索
head　　　指定数だけ先頭からリードを抽出
locate　　配列をリプレース
rename　 duplicateしたIDをユニークな名前に修復
rmdup　 duplication除去
restart　環状ゲノムのスタート位置変更
sample　リードをランダムサンプリング（10%だけランダムに取り出す）
seq　　配列を変形（逆相補鎖、DNA<=>RNA、指定サイズ以上/以下）
shuffle　ランダムな順番にする
sliding　配列を指定の単位ずつずらしながら取り出す
sort　　長さでソート
split　　分割（fastqを均等にn個に分けて出力）
stats　 fasta/fastqの簡単なstatistics
subseq 特定のリードや配列だけ抽出

新しいコマンド（アルファベット順）

concat （リンク）同じIDの配列を指定長ずつ合体

$ seqkit concat -h

concatenate sequences with same ID from multiple files

Example: concatenating leading 2 bases and last 2 bases

$ cat t.fa

>test

ACCTGATGT

>test2

TGATAGCTACTAGGGTGTCTATCG

$ seqkit concate <(seqkit subseq -r 1:2 t.fa) <(seqkit subseq -r -2:-1 t.fa)

>test

ACGT

>test2

TGCG

Usage:

seqkit concat [flags]

Aliases:

concat, concate

Flags:

-h, --help help for concat

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

上の例のように、１つ目の配列の先頭1-2文字と、２つ目の配列の後ろから1-2文字が結合（同じIDのもの同士）。

seqkit concat <(seqkit subseq -r 1:2 t.fa) <(seqkit subseq -r -2:-1 t.fa)

convert （リンク） fastqのクオリティエンコードをSolexa形式（1.3）、サンガーイルミナ（1.8）形式で相互変換

$ seqkit convert -h

convert FASTQ quality encoding between Sanger, Solexa and Illumina

Usage:

seqkit convert [flags]

Flags:

-d, --dry-run dry run

-f, --force for Illumina-1.8+ -> Sanger, truncate scores > 40 to 40

--from string source quality encoding. if not given, we'll guess it

-h, --help help for convert

-n, --nrecords int number of records for guessing quality encoding (default 1000)

-N, --thresh-B-in-n-most-common int threshold of 'B' in top N most common quality for guessing Illumina 1.5. (default 4)

-F, --thresh-illumina1.5-frac float threshold of faction of Illumina 1.5 in the leading N records (default 0.1)

--to string target quality encoding (default "Sanger")

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

クオリティフォーマット確認

seqkit convert input.fq.gz |seqkit head -n 1

このデータは[Sanger Illumina 1.8

$ seqkit convert SRR5035895_1.fastq |seqkit head -n 1

[INFO] possible quality encodings: [Sanger Illumina-1.8+]

[INFO] guessed quality encoding: Sanger

[INFO] converting Sanger -> Sanger

[WARN] source and target quality encoding match.

@SRR5035895.1 1 length=228

AAAAAGTATCTACGACACGCGATTTAGTGTCGAGAAAGGCGACTAAGACGGTAGCTTCGTCGGTTGCTTCGCTTCCTGCA+

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGデフォルトの設定でランすると、イルミナ1.8に変換するようになっている。イルミナ1.5に変換。

seqkit convert Illimina1.8.fq.gz --to Illumina-1.5+ > out.fq

duplicate （リンク）配列を複製

$ seqkit duplicate -h

duplicate sequences N times

You may need "seqkit rename" to make the the sequence IDs unique.

Usage:

seqkit duplicate [flags]

Aliases:

duplicate, dup

Flags:

-h, --help help for duplicate

-n, --times int duplication number (default 1)

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

全配列ではなく、fastaの先頭の配列だけ１回複製する。seqkit renameでユニークなヘッダーにして出力。

seqkit head -n 1 input.fa | seqkit duplicate -n 2 | seqkit rename > out.fa

range （リンク）指定間のリードを抽出 (headでもtailでも単独では対応できない時に)

$ seqkit range -h

print FASTA/Q records in a range (start:end)

Usage:

seqkit range [flags]

Flags:

-h, --help help for range

-r, --range string range. e.g., 1:12 for first 12 records (head -n 12), -12:-1 for last 12 records (tail -n 12)

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

user-no-MacBook-Pro-2:Illumina_MiSeq_metagenome_of_Kelp_associated_biofilm user$ $

先頭から101番目 ~ 150番目の50リード出力

seqkit range -r 101:150 input.fq > out.fq

後ろから-1番目 ~ -100番目の50リード出力

seqkit range -r -100:-1 input.fq > out.fq

faidx （リンク） FASTAのindexを作成したり、特定の配列を抽出（samttools faidx互換だが、より高機能）

$ seqkit faidx -h

create FASTA index file and extract subsequence

This command is similar with "samtools faidx" but has some extra features:

1. output full header line with flag -f

2. support regular expression as sequence ID with flag -r

Usage:

seqkit faidx [flags] <fasta-file> [regions...]

Flags:

-f, --full-head print full header line instead of just ID. New fasta index file ending with .seqkit.fai will be created

-h, --help help for faidx

-i, --ignore-case ignore case

-r, --use-regexp IDs are regular expression. But subseq region is not suppored here.

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

seqkit faidx input.fa

input.fa.faiができる。

restart （リンク）　環状ゲノムのスタート位置を変える

$ seqkit restart -h

reset start position for circular genome

Examples

$ echo -e ">seq\nacgtnACGTN"

>seq

acgtnACGTN

$ echo -e ">seq\nacgtnACGTN" | seqkit restart -i 2

>seq

cgtnACGTNa

$ echo -e ">seq\nacgtnACGTN" | seqkit restart -i -2

>seq

TNacgtnACG

Usage:

seqkit restart [flags]

Flags:

-h, --help help for restart

-i, --new-start int new start position (1-base, supporting negative value counting from the end) (default 1)

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

入力環状

ゲノムのポジション10000からリスタート

seqkit restart -i 10000 input.fa > output.fa

split2 （リンク）　入力配列を分割する。

$ seqkit split2 -h

split sequences into files by part size or number of parts

This command supports FASTA and paired- or single-end FASTQ with low memory

occupation and fast speed.

The file extensions of output are automatically detected and created

according to the input files.

Usage:

seqkit split2 [flags]

Flags:

-p, --by-part int split sequences into N parts

-s, --by-size int split sequences into multi parts with N sequences

-f, --force overwrite output directory

-h, --help help for split2

-O, --out-dir string output directory (default value is $infile.split)

-1, --read1 string read1 file

-2, --read2 string read2 file

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

FASTAを10000リードずつに分ける。

seqkit split2 hairpin.fa.gz -s 10000 -f

FASTAを4パートに分ける。

seqkit split hairpin.fa.gz -p 4 -f

シングルエンドFASTQを2パートに分ける。

seqkit split2 reads_1.fq.gz -p 2 -O out -f

ペアエンドFASTQを2パートに分ける。

seqkit split2 -1 reads_1.fq.gz -2 reads_2.fq.gz -p 2 -O out -f

tab2fx （リンク）　FASTA/FASTQをタブ区切り（TSV）フォーマットに変換

$ seqkit tab2fx -h

convert tabular format (first two/three columns) to FASTA/Q format

Usage:

seqkit tab2fx [flags]

Flags:

-p, --comment-line-prefix strings comment line prefix (default [#,//])

-h, --help help for tab2fx

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

TSV形式で10リードだけ出力

seqkit fx2tab input.fq |head -n 10 > output.tsv

f:id:kazumaxneo:20181220205208p:plain

配列長、GC率だけ表示（*csvtkはcondaで導入可）

seqkit fx2tab contig.fa.gz -l -g -n -i -H | head -n 10 | csvtk -t -C '&' pretty

-g print GC content
-H print header line
-l print sequence length
-n only print names (no sequences and qualities)
-i print ID instead of full head

#name seq qual length GC

NODE_1_length_254790_cov_21.168655 254790 47.85

NODE_2_length_246122_cov_21.354689 246122 47.88

NODE_3_length_214379_cov_20.356971 214379 48.91

NODE_4_length_200639_cov_20.843756 200639 47.45

NODE_5_length_186136_cov_20.060989 186136 48.74

NODE_6_length_183861_cov_21.459733 183861 47.74

NODE_7_length_180715_cov_21.655153 180715 47.09

NODE_8_length_141474_cov_21.429290 141474 48.11

NODE_9_length_130080_cov_20.966290 130080 47.76

translate （リンク） DNA/RNAをアミノ酸配列に翻訳

$ seqkit translate -h

translate DNA/RNA to protein sequence (supporting ambiguous bases)

Note:

1. This command supports codons containing any ambiguous base.

Please switch on flag -L INT for details. e.g., for standard table:

ACN -> T

CCN -> P

CGN -> R

CTN -> L

GCN -> A

GGN -> G

GTN -> V

TCN -> S

MGR -> R

YTR -> L

Translate Tables/Genetic Codes:

# https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes

1: The Standard Code

2: The Vertebrate Mitochondrial Code

3: The Yeast Mitochondrial Code

4: The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code

5: The Invertebrate Mitochondrial Code

6: The Ciliate, Dasycladacean and Hexamita Nuclear Code

9: The Echinoderm and Flatworm Mitochondrial Code

10: The Euplotid Nuclear Code

11: The Bacterial, Archaeal and Plant Plastid Code

12: The Alternative Yeast Nuclear Code

13: The Ascidian Mitochondrial Code

14: The Alternative Flatworm Mitochondrial Code

16: Chlorophycean Mitochondrial Code

21: Trematode Mitochondrial Code

22: Scenedesmus obliquus Mitochondrial Code

23: Thraustochytrium Mitochondrial Code

24: Pterobranchia Mitochondrial Code

25: Candidate Division SR1 and Gracilibacteria Code

26: Pachysolen tannophilus Nuclear Code

27: Karyorelict Nuclear

28: Condylostoma Nuclear

29: Mesodinium Nuclear

30: Peritrich Nuclear

31: Blastocrithidia Nuclear

Usage:

seqkit translate [flags]

Flags:

-x, --allow-unknown-codon translate unknown code to 'X'. And you may not use flag --trim which removes 'X'

-F, --append-frame append frame information to sequence ID

--clean change all STOP codon positions from the '*' character to 'X' (an unknown residue)

-f, --frame strings frame(s) to translate, available value: 1, 2, 3, -1, -2, -3, and 6 for all six frames (default [1])

-h, --help help for translate

-M, --init-codon-as-M translate initial codon at beginning to 'M'

-l, --list-transl-table int show details of translate table N, 0 for all (default -1)

-L, --list-transl-table-with-amb-codons int show details of translate table N (including ambigugous codons), 0 for all. (default -1)

-T, --transl-table int translate table/genetic code, type 'seqkit translate --help' for more details (default 1)

--trim remove all 'X' and '*' characters from the right end of the translation

Global Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")

--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments

-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)

１フレームで翻訳

seqkit translate --frame 1 --transl-table 1 input.fa > output.fa

#The Bacterial, Archaeal and Plant Plastid Code
seqkit translate --frame 1 --transl-table 11 input.fa > output.fa