2019 8/7 誤字修正
2023/01/20 translate help更新
seqkitを以前ブログで紹介した時は0..6.0でしたが、1年半近く経ち、2018年12月20日現在ではバージョンが0.9.4まで上がっています。ありがたいことに、bug fixだけでなく、新しいコマンドが複数追加されています。v0.6.1以降に追加されたコマンドについてまとめます。
インストール
ダウンロード
http://bioinf.shenwei.me/seqkit/download/
解凍するとbinaryファイルできる。
実行権をつけ、それを/usr/local/bin/などに移動して使う。
chmod u+x seqtk #実行権
ls -l seqtk #確認
#パスが通ったディレクトリに移動
mv seqtk /usr/local/bin
#anacondaを使っているならcondaで入る。例えば20181219時点で最新のv0.9.3を入れる
conda install -c bioconda -y seqkit==0.9.3
0.6.0の時点であったコマンドを復習しておく。
アルファベット順
- common IDや配列が共通する配列抽出
- dup 配列を指定回、複製
- fq2fa fastqをfastaに変換
- fx2tab fasta/fastqをtab形式に変換し、さらにGCなどの情報を表示
- grep 特定の配列を検索
- head 指定数だけ先頭からリードを抽出
- locate 配列をリプレース
- rename duplicateしたIDをユニークな名前に修復
- rmdup duplication除去
- restart 環状ゲノムのスタート位置変更
- sample リードをランダムサンプリング(10%だけランダムに取り出す)
- seq 配列を変形(逆相補鎖、DNA<=>RNA、指定サイズ以上/以下)
- shuffle ランダムな順番にする
- sliding 配列を指定の単位ずつずらしながら取り出す
- sort 長さでソート
- split 分割(fastqを均等にn個に分けて出力)
- stats fasta/fastqの簡単なstatistics
- subseq 特定のリードや配列だけ抽出
新しいコマンド(アルファベット順)
concat (リンク) 同じIDの配列を指定長ずつ合体
$ seqkit concat -h
concatenate sequences with same ID from multiple files
Example: concatenating leading 2 bases and last 2 bases
$ cat t.fa
>test
ACCTGATGT
>test2
TGATAGCTACTAGGGTGTCTATCG
$ seqkit concate <(seqkit subseq -r 1:2 t.fa) <(seqkit subseq -r -2:-1 t.fa)
>test
ACGT
>test2
TGCG
Usage:
seqkit concat [flags]
Aliases:
concat, concate
Flags:
-h, --help help for concat
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
上の例のように、1つ目の配列の先頭1-2文字と、2つ目の配列の後ろから1-2文字が結合(同じIDのもの同士)。
seqkit concat <(seqkit subseq -r 1:2 t.fa) <(seqkit subseq -r -2:-1 t.fa)
convert (リンク) fastqのクオリティエンコードをSolexa形式(1.3)、サンガーイルミナ(1.8)形式で相互変換
$ seqkit convert -h
convert FASTQ quality encoding between Sanger, Solexa and Illumina
Usage:
seqkit convert [flags]
Flags:
-d, --dry-run dry run
-f, --force for Illumina-1.8+ -> Sanger, truncate scores > 40 to 40
--from string source quality encoding. if not given, we'll guess it
-h, --help help for convert
-n, --nrecords int number of records for guessing quality encoding (default 1000)
-N, --thresh-B-in-n-most-common int threshold of 'B' in top N most common quality for guessing Illumina 1.5. (default 4)
-F, --thresh-illumina1.5-frac float threshold of faction of Illumina 1.5 in the leading N records (default 0.1)
--to string target quality encoding (default "Sanger")
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
クオリティフォーマット確認
seqkit convert input.fq.gz |seqkit head -n 1
このデータは[Sanger Illumina 1.8
$ seqkit convert SRR5035895_1.fastq |seqkit head -n 1
[INFO] possible quality encodings: [Sanger Illumina-1.8+]
[INFO] guessed quality encoding: Sanger
[INFO] converting Sanger -> Sanger
[WARN] source and target quality encoding match.
@SRR5035895.1 1 length=228
AAAAAGTATCTACGACACGCGATTTAGTGTCGAGAAAGGCGACTAAGACGGTAGCTTCGTCGGTTGCTTCGCTTCCTGCA+
CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGデフォルトの設定でランすると、イルミナ1.8に変換するようになっている。イルミナ1.5に変換。
seqkit convert Illimina1.8.fq.gz --to Illumina-1.5+ > out.fq
duplicate (リンク) 配列を複製
$ seqkit duplicate -h
duplicate sequences N times
You may need "seqkit rename" to make the the sequence IDs unique.
Usage:
seqkit duplicate [flags]
Aliases:
duplicate, dup
Flags:
-h, --help help for duplicate
-n, --times int duplication number (default 1)
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
全配列ではなく、fastaの先頭の配列だけ1回複製する。seqkit renameでユニークなヘッダーにして出力。
seqkit head -n 1 input.fa | seqkit duplicate -n 2 | seqkit rename > out.fa
range (リンク) 指定間のリードを抽出 (headでもtailでも単独では対応できない時に)
$ seqkit range -h
print FASTA/Q records in a range (start:end)
Usage:
seqkit range [flags]
Flags:
-h, --help help for range
-r, --range string range. e.g., 1:12 for first 12 records (head -n 12), -12:-1 for last 12 records (tail -n 12)
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
user-no-MacBook-Pro-2:Illumina_MiSeq_metagenome_of_Kelp_associated_biofilm user$ $
先頭から101番目 ~ 150番目の50リード出力
seqkit range -r 101:150 input.fq > out.fq
後ろから-1番目 ~ -100番目の50リード出力
seqkit range -r -100:-1 input.fq > out.fq
faidx (リンク) FASTAのindexを作成したり、特定の配列を抽出(samttools faidx互換だが、より高機能)
$ seqkit faidx -h
create FASTA index file and extract subsequence
This command is similar with "samtools faidx" but has some extra features:
1. output full header line with flag -f
2. support regular expression as sequence ID with flag -r
Usage:
seqkit faidx [flags] <fasta-file> [regions...]
Flags:
-f, --full-head print full header line instead of just ID. New fasta index file ending with .seqkit.fai will be created
-h, --help help for faidx
-i, --ignore-case ignore case
-r, --use-regexp IDs are regular expression. But subseq region is not suppored here.
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
seqkit faidx input.fa
input.fa.faiができる。
restart (リンク) 環状ゲノムのスタート位置を変える
$ seqkit restart -h
reset start position for circular genome
Examples
$ echo -e ">seq\nacgtnACGTN"
>seq
acgtnACGTN
$ echo -e ">seq\nacgtnACGTN" | seqkit restart -i 2
>seq
cgtnACGTNa
$ echo -e ">seq\nacgtnACGTN" | seqkit restart -i -2
>seq
TNacgtnACG
Usage:
seqkit restart [flags]
Flags:
-h, --help help for restart
-i, --new-start int new start position (1-base, supporting negative value counting from the end) (default 1)
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
入力環状
ゲノムのポジション10000からリスタート
seqkit restart -i 10000 input.fa > output.fa
split2 (リンク) 入力配列を分割する。
$ seqkit split2 -h
split sequences into files by part size or number of parts
This command supports FASTA and paired- or single-end FASTQ with low memory
occupation and fast speed.
The file extensions of output are automatically detected and created
according to the input files.
Usage:
seqkit split2 [flags]
Flags:
-p, --by-part int split sequences into N parts
-s, --by-size int split sequences into multi parts with N sequences
-f, --force overwrite output directory
-h, --help help for split2
-O, --out-dir string output directory (default value is $infile.split)
-1, --read1 string read1 file
-2, --read2 string read2 file
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
FASTAを10000リードずつに分ける。
seqkit split2 hairpin.fa.gz -s 10000 -f
FASTAを4パートに分ける。
seqkit split hairpin.fa.gz -p 4 -f
シングルエンドFASTQを2パートに分ける。
seqkit split2 reads_1.fq.gz -p 2 -O out -f
ペアエンドFASTQを2パートに分ける。
seqkit split2 -1 reads_1.fq.gz -2 reads_2.fq.gz -p 2 -O out -f
tab2fx (リンク) FASTA/FASTQをタブ区切り(TSV)フォーマットに変換
$ seqkit tab2fx -h
convert tabular format (first two/three columns) to FASTA/Q format
Usage:
seqkit tab2fx [flags]
Flags:
-p, --comment-line-prefix strings comment line prefix (default [#,//])
-h, --help help for tab2fx
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^([^\\s]+)\\s?")
-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)
TSV形式で10リードだけ出力
seqkit fx2tab input.fq |head -n 10 > output.tsv
配列長、GC率だけ表示(*csvtkはcondaで導入可)
seqkit fx2tab contig.fa.gz -l -g -n -i -H | head -n 10 | csvtk -t -C '&' pretty
- -g print GC content
- -H print header line
- -l print sequence length
- -n only print names (no sequences and qualities)
- -i print ID instead of full head
#name seq qual length GC
NODE_1_length_254790_cov_21.168655 254790 47.85
NODE_2_length_246122_cov_21.354689 246122 47.88
NODE_3_length_214379_cov_20.356971 214379 48.91
NODE_4_length_200639_cov_20.843756 200639 47.45
NODE_5_length_186136_cov_20.060989 186136 48.74
NODE_6_length_183861_cov_21.459733 183861 47.74
NODE_7_length_180715_cov_21.655153 180715 47.09
NODE_8_length_141474_cov_21.429290 141474 48.11
NODE_9_length_130080_cov_20.966290 130080 47.76
translate (リンク) DNA/RNAをアミノ酸配列に翻訳
$ seqkit translate -h
translate DNA/RNA to protein sequence (supporting ambiguous bases)
Note:
1. This command supports codons containing any ambiguous base.
Please switch on flag -L INT for details. e.g., for standard table:
ACN -> T
CCN -> P
CGN -> R
CTN -> L
GCN -> A
GGN -> G
GTN -> V
TCN -> S
MGR -> R
YTR -> L
Translate Tables/Genetic Codes:
# https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes
1: The Standard Code
2: The Vertebrate Mitochondrial Code
3: The Yeast Mitochondrial Code
4: The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code
5: The Invertebrate Mitochondrial Code
6: The Ciliate, Dasycladacean and Hexamita Nuclear Code
9: The Echinoderm and Flatworm Mitochondrial Code
10: The Euplotid Nuclear Code
11: The Bacterial, Archaeal and Plant Plastid Code
12: The Alternative Yeast Nuclear Code
13: The Ascidian Mitochondrial Code
14: The Alternative Flatworm Mitochondrial Code
16: Chlorophycean Mitochondrial Code
21: Trematode Mitochondrial Code
22: Scenedesmus obliquus Mitochondrial Code
23: Thraustochytrium Mitochondrial Code
24: Pterobranchia Mitochondrial Code
25: Candidate Division SR1 and Gracilibacteria Code
26: Pachysolen tannophilus Nuclear Code
27: Karyorelict Nuclear
28: Condylostoma Nuclear
29: Mesodinium Nuclear
30: Peritrich Nuclear
31: Blastocrithidia Nuclear
Usage:
seqkit translate [flags]
Flags:
-x, --allow-unknown-codon translate unknown code to 'X'. And you may not use flag --trim which removes 'X'
-F, --append-frame append frame information to sequence ID
--clean change all STOP codon positions from the '*' character to 'X' (an unknown residue)
-f, --frame strings frame(s) to translate, available value: 1, 2, 3, -1, -2, -3, and 6 for all six frames (default [1])
-h, --help help for translate
-M, --init-codon-as-M translate initial codon at beginning to 'M'
-l, --list-transl-table int show details of translate table N, 0 for all (default -1)
-L, --list-transl-table-with-amb-codons int show details of translate table N (including ambigugous codons), 0 for all. (default -1)
-T, --transl-table int translate table/genetic code, type 'seqkit translate --help' for more details (default 1)
--trim remove all 'X' and '*' characters from the right end of the translation
Global Flags:
--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)
--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments
-w, --line-width int line width when outputting FASTA format (0 for no wrap) (default 60)
-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")
--quiet be quiet and do not show extra information
-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")
-j, --threads int number of CPUs. can also set with environment variable SEQKIT_THREADS) (default 4)
1フレームで翻訳
seqkit translate --frame 1 --transl-table 1 input.fa > output.fa
#The Bacterial, Archaeal and Plant Plastid Code
seqkit translate --frame 1 --transl-table 11 input.fa > output.fa
- -T, --transl-table int translate table/genetic code, type 'seqkit translate --help' for more details (default 1)
--trimをつけると*を除ける。
他にもbashに対応したシェル自動補完コマンドがあります。
引用
SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation
Wei Shen, Shuai Le, Yan Li , Fuquan Hu
PLOS ONE. 2016. doi:10.1371/journal.pone.0163962.