macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

seqkitに新しく追加されたコマンドを確認する

2019 8/7 誤字修正

 

seqkitを以前ブログで紹介した時は0..6.0でしたが、1年半近く経ち、2018年12月20日現在ではバージョンが0.9.4まで上がっています。ありがたいことに、bug fixだけでなく、新しいコマンドが複数追加されています。v0.6.1以降に追加されたコマンドについてまとめます。

 


インストール 

ダウンロード

http://bioinf.shenwei.me/seqkit/download/

解凍するとbinaryファイルできる。

実行権をつけ、それを/usr/local/bin/などに移動して使う。

chmod u+x seqtk #実行権
ls -l seqtk #確認
#パスが通ったディレクトリに移動
mv seqtk /usr/local/bin

#anacondaを使っているならcondaで入る。例えば20181219時点で最新のv0.9.3を入れる
conda install -c bioconda -y seqkit==0.9.3

 

0.6.0の時点であったコマンドを復習しておく。

アルファベット順

  • common  IDや配列が共通する配列抽出
  • dup        配列を指定回、複製
  • fq2fa   fastqをfastaに変換
  • fx2tab    fasta/fastqをtab形式に変換し、さらにGCなどの情報を表示
  • grep       特定の配列を検索
  • head   指定数だけ先頭からリードを抽出
  • locate    配列をリプレース
  • rename    duplicateしたIDをユニークな名前に修復
  • rmdup     duplication除去
  • restart     環状ゲノムのスタート位置変更
  • sample   リードをランダムサンプリング(10%だけランダムに取り出す)
  • seq       配列を変形(逆相補鎖、DNA<=>RNA、指定サイズ以上/以下)
  • shuffle    ランダムな順番にする
  • sliding     配列を指定の単位ずつずらしながら取り出す
  • sort      長さでソート
  • split     分割(fastqを均等にn個に分けて出力)
  • stats       fasta/fastqの簡単なstatistics
  • subseq      特定のリードや配列だけ抽出 

 

新しいコマンド(アルファベット順)

concat (リンク)     同じIDの配列を指定長ずつ合体

$ seqkit concat -h

concatenate sequences with same ID from multiple files

 

Example: concatenating leading 2 bases and last 2 bases

 

    $ cat t.fa

    >test

    ACCTGATGT

    >test2

    TGATAGCTACTAGGGTGTCTATCG

 

    $ seqkit concate <(seqkit subseq -r 1:2 t.fa) <(seqkit subseq -r -2:-1 t.fa)

    >test

    ACGT

    >test2

    TGCG

 

Usage:

  seqkit concat [flags]

 

Aliases:

  concat, concate

 

Flags:

  -h, --help   help for concat

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

上の例のように、1つ目の配列の先頭1-2文字と、2つ目の配列の後ろから1-2文字が結合(同じIDのもの同士)。

seqkit concat <(seqkit subseq -r 1:2 t.fa) <(seqkit subseq -r -2:-1 t.fa)

 

 

convert (リンク)       fastqのクオリティエンコードをSolexa形式(1.3)、サンガーイルミナ(1.8)形式で相互変換

$ seqkit convert -h

convert FASTQ quality encoding between Sanger, Solexa and Illumina

 

Usage:

  seqkit convert [flags]

 

Flags:

  -d, --dry-run                         dry run

  -f, --force                           for Illumina-1.8+ -> Sanger, truncate scores > 40 to 40

      --from string                     source quality encoding. if not given, we'll guess it

  -h, --help                            help for convert

  -n, --nrecords int                    number of records for guessing quality encoding (default 1000)

  -N, --thresh-B-in-n-most-common int   threshold of 'B' in top N most common quality for guessing Illumina 1.5. (default 4)

  -F, --thresh-illumina1.5-frac float   threshold of faction of Illumina 1.5 in the leading N records (default 0.1)

      --to string                       target quality encoding (default "Sanger")

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

クオリティフォーマット確認

seqkit convert input.fq.gz |seqkit head -n 1

このデータは[Sanger Illumina 1.8

$ seqkit convert SRR5035895_1.fastq |seqkit head -n 1

[INFO] possible quality encodings: [Sanger Illumina-1.8+]

[INFO] guessed quality encoding: Sanger

[INFO] converting Sanger -> Sanger

[WARN] source and target quality encoding match.

@SRR5035895.1 1 length=228

AAAAAGTATCTACGACACGCGATTTAGTGTCGAGAAAGGCGACTAAGACGGTAGCTTCGTCGGTTGCTTCGCTTCCTGCACAAGANNNNNNNNCNGTTGANNNNAANNNNCNTNNNNNNNNGCGTAGAACATTACTCAGAGCGNNNAGANNNNANGGNNNNNNAGAAAAGNNNANNNTTGCTGTAGATCTCTTGGAGCGTTACANAGGGCATNGCCNNNAAGATTTCC

+

CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG########:#:BFFG####::####:#:########::DFFGGGGGGGGGGGGGGGGG###:@C####3#88######66FFGG?###5###66=FGGGGGGFG<FFGGGGGFGEGGGE#2=EDGG6#*=C###2=CGFGFFF

デフォルトの設定でランすると、イルミナ1.8に変換するようになっている。イルミナ1.5に変換。

seqkit convert Illimina1.8.fq.gz --to Illumina-1.5+ > out.fq

 

 

duplicate (リンク)      配列を複製

$ seqkit duplicate -h

duplicate sequences N times

 

You may need "seqkit rename" to make the the sequence IDs unique.

 

Usage:

  seqkit duplicate [flags]

 

Aliases:

  duplicate, dup

 

Flags:

  -h, --help        help for duplicate

  -n, --times int   duplication number (default 1)

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

全配列ではなく、fastaの先頭の配列だけ1回複製する。seqkit renameでユニークなヘッダーにして出力。

seqkit head -n 1 input.fa | seqkit duplicate -n 2 | seqkit rename > out.fa

 

 

range (リンク)     指定間のリードを抽出 (headでもtailでも単独では対応できない時に)

$ seqkit range -h

print FASTA/Q records in a range (start:end)

 

Usage:

  seqkit range [flags]

 

Flags:

  -h, --help           help for range

  -r, --range string   range. e.g., 1:12 for first 12 records (head -n 12), -12:-1 for last 12 records (tail -n 12)

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

user-no-MacBook-Pro-2:Illumina_MiSeq_metagenome_of_Kelp_associated_biofilm user$ $

先頭から101番目 ~ 150番目の50リード出力

seqkit range -r 101:150 input.fq > out.fq

 後ろから-1番目 ~ -100番目の50リード出力

seqkit range -r -100:-1 input.fq > out.fq

 

faidx (リンク)    FASTAのindexを作成したり、特定の配列を抽出(samttools faidx互換だが、より高機能)

$ seqkit faidx -h

create FASTA index file and extract subsequence

 

This command is similar with "samtools faidx" but has some extra features:

 

  1. output full header line with flag -f

  2. support regular expression as sequence ID with flag -r

 

Usage:

  seqkit faidx [flags] <fasta-file> [regions...]

 

Flags:

  -f, --full-head     print full header line instead of just ID. New fasta index file ending with .seqkit.fai will be created

  -h, --help          help for faidx

  -i, --ignore-case   ignore case

  -r, --use-regexp    IDs are regular expression. But subseq region is not suppored here.

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

 

seqkit faidx input.fa

 input.fa.faiができる。

 

 

restartリンク) 環状ゲノムのスタート位置を変える

$ seqkit restart -h

reset start position for circular genome

 

Examples

 

    $ echo -e ">seq\nacgtnACGTN"

    >seq

    acgtnACGTN

 

    $ echo -e ">seq\nacgtnACGTN" | seqkit restart -i 2

    >seq

    cgtnACGTNa

 

    $ echo -e ">seq\nacgtnACGTN" | seqkit restart -i -2

    >seq

    TNacgtnACG

 

Usage:

  seqkit restart [flags]

 

Flags:

  -h, --help            help for restart

  -i, --new-start int   new start position (1-base, supporting negative value counting from the end) (default 1)

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

 入力環状

ゲノムのポジション10000からリスタート

seqkit restart -i 10000 input.fa > output.fa

 

 

split2リンク) 入力配列を分割する。

$ seqkit split2 -h

split sequences into files by part size or number of parts

 

This command supports FASTA and paired- or single-end FASTQ with low memory

occupation and fast speed.

 

The file extensions of output are automatically detected and created

according to the input files.

 

Usage:

  seqkit split2 [flags]

 

Flags:

  -p, --by-part int      split sequences into N parts

  -s, --by-size int      split sequences into multi parts with N sequences

  -f, --force            overwrite output directory

  -h, --help             help for split2

  -O, --out-dir string   output directory (default value is $infile.split)

  -1, --read1 string     read1 file

  -2, --read2 string     read2 file

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

FASTAを10000リードずつに分ける。

seqkit split2 hairpin.fa.gz -s 10000 -f

FASTAを4パートに分ける。

seqkit split hairpin.fa.gz -p 4 -f

 シングルエンドFASTQを2パートに分ける。

seqkit split2 reads_1.fq.gz -p 2 -O out -f

 ペアエンドFASTQを2パートに分ける。

seqkit split2 -1 reads_1.fq.gz -2 reads_2.fq.gz -p 2 -O out -f

 

tab2fxリンク) FASTA/FASTQをタブ区切り(TSV)フォーマットに変換

$ seqkit tab2fx -h

convert tabular format (first two/three columns) to FASTA/Q format

 

Usage:

  seqkit tab2fx [flags]

 

Flags:

  -p, --comment-line-prefix strings   comment line prefix (default [#,//])

  -h, --help                          help for tab2fx

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

kazuma@kamisakumanoMBP:~/Documents$ 

TSV形式で10リードだけ出力

seqkit fx2tab input.fq |head -n 10 > output.tsv

f:id:kazumaxneo:20181220205208p:plain

配列長、GC率だけ表示(*csvtkはcondaで導入可)

seqkit fx2tab contig.fa.gz -l -g -n -i -H | head -n 10 | csvtk -t -C '&' pretty
  • -g    print GC content
  • -H   print header line
  • -l     print sequence length
  • -n    only print names (no sequences and qualities)
  • -i     print ID instead of full head

#name                                seq   qual   length   GC

NODE_1_length_254790_cov_21.168655                254790   47.85

NODE_2_length_246122_cov_21.354689                246122   47.88

NODE_3_length_214379_cov_20.356971                214379   48.91

NODE_4_length_200639_cov_20.843756                200639   47.45

NODE_5_length_186136_cov_20.060989                186136   48.74

NODE_6_length_183861_cov_21.459733                183861   47.74

NODE_7_length_180715_cov_21.655153                180715   47.09

NODE_8_length_141474_cov_21.429290                141474   48.11

NODE_9_length_130080_cov_20.966290                130080   47.76

 

 

translateリンク)  DNA/RNAアミノ酸配列に翻訳

$ seqkit translate -h

translate DNA/RNA to protein sequence

 

Translate Tables/Genetic Codes:

 

    # https://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/index.cgi?chapter=tgencodes

 

     1: The Standard Code

     2: The Vertebrate Mitochondrial Code

     3: The Yeast Mitochondrial Code

     4: The Mold, Protozoan, and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code

     5: The Invertebrate Mitochondrial Code

     6: The Ciliate, Dasycladacean and Hexamita Nuclear Code

     9: The Echinoderm and Flatworm Mitochondrial Code

    10: The Euplotid Nuclear Code

    11: The Bacterial, Archaeal and Plant Plastid Code

    12: The Alternative Yeast Nuclear Code

    13: The Ascidian Mitochondrial Code

    14: The Alternative Flatworm Mitochondrial Code

    16: Chlorophycean Mitochondrial Code

    21: Trematode Mitochondrial Code

    22: Scenedesmus obliquus Mitochondrial Code

    23: Thraustochytrium Mitochondrial Code

    24: Pterobranchia Mitochondrial Code

    25: Candidate Division SR1 and Gracilibacteria Code

    26: Pachysolen tannophilus Nuclear Code

    27: Karyorelict Nuclear

    28: Condylostoma Nuclear

    29: Mesodinium Nuclear

    30: Peritrich Nuclear

    31: Blastocrithidia Nuclear

 

Usage:

  seqkit translate [flags]

 

Flags:

  -x, --allow-unknown-codon   translate unknown code to 'X'. And you may not use flag --trim which removes 'X'

      --clean                 changes all STOP codon positions from the '*' character to 'X' (an unknown residue)

      --frame int             frame to translate, available value: 1, 2, 3, -1, -2, -3 (default 1)

  -h, --help                  help for translate

  -T, --transl-table int      translate table/genetic code, type 'seqkit translate --help' for more details (default 1)

      --trim                  removes all 'X' and '*' characters from the right end of the translation

 

Global Flags:

      --alphabet-guess-seq-length int   length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

      --id-ncbi                         FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

      --id-regexp string                regular expression for parsing ID (default "^([^\\s]+)\\s?")

  -w, --line-width int                  line width when outputing FASTA format (0 for no wrap) (default 60)

  -o, --out-file string                 out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

      --quiet                           be quiet and do not show extra information

  -t, --seq-type string                 sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

  -j, --threads int                     number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

 1フレームで翻訳

seqkit translate --frame 1 --transl-table 1 input.fa > output.fa

 --trimをつけると*を除ける。

 

他にもbashに対応したシェル自動補完コマンドがあります。

引用

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

Wei Shen, Shuai Le, Yan Li , Fuquan Hu 

PLOS ONE. 2016. doi:10.1371/journal.pone.0163962.