SeqKit has got 1000 citations 🎉
We've been fixing bugs and adding new features during the last 7 years, with 17 sub-commands added and 50 versions released after publication.
We also emphasize usability, sustainable development and maintenance of our tools 🤝. pic.twitter.com/wOr4RAwMWk
— Wei Shen 沈伟 (@shenwei356) March 21, 2023

2023/03/20

SeqKit v2.4.0 supports the bzip2 format and brings many bug fixes and new features. https://t.co/lsKkDcmsls
— Wei Shen 沈伟 (@shenwei356) March 17, 2023

2021 8/28

🎉SeqKit bumps to v2.0 after five years🎉, bringing much faster speed: 2-3X faster on processing gzipped FASTQ files compared to previous v0.16.1. Even faster than 'seqtk | pigz -c'.https://t.co/LJEQZng3tc
— Wei Shen 沈伟 (@shenwei356) 2021年8月28日

さらにコマンドが増えました！すごい！！

seqkit v0.11.0 is released with 5 new subcommands, thanks @boti_ka ! available on bioconda too~ https://t.co/Y4UqTfHjEV https://t.co/RCEsVBzCwq
— Wei Shen (@shenwei356) September 25, 2019

インストール

Github

ダウンロード

http://bioinf.shenwei.me/seqkit/download/

解凍するとbinaryファイルできる。

実行権をつけ、それを/usr/local/bin/などに移動して使う。

chmod u+x seqtk #実行権
ls -l seqtk #確認
mv seqtk /usr/local/bin #パスが通ったディレクトリに移動。ln -sでリンクしても良い。

> seqkit #v0.12.0

$ seqkit --help

SeqKit -- a cross-platform and ultrafast toolkit for FASTA/Q file manipulation

Version: 0.12.0

Author: Wei Shen <shenwei356@gmail.com>

Documents : http://bioinf.shenwei.me/seqkit

Source code: https://github.com/shenwei356/seqkit

Please cite: https://doi.org/10.1371/journal.pone.0163962

Usage:

seqkit [command]

Available Commands:

amplicon retrieve amplicon (or specific region around it) via primer(s)

bam monitoring and online histograms of BAM record features

common find common sequences of multiple files by id/name/sequence

concat concatenate sequences with same ID from multiple files

convert convert FASTQ quality encoding between Sanger, Solexa and Illumina

duplicate duplicate sequences N times

faidx create FASTA index file and extract subsequence

fish look for short sequences in larger sequences using local alignment

fq2fa convert FASTQ to FASTA

fx2tab convert FASTA/Q to tabular format (with length/GC content/GC skew)

genautocomplete generate shell autocompletion script

grep search sequences by ID/name/sequence/sequence motifs, mismatch allowed

head print first N FASTA/Q records

help Help about any command

locate locate subsequences/motifs, mismatch allowed

mutate edit sequence (point mutation, insertion, deletion)

range print FASTA/Q records in a range (start:end)

rename rename duplicated IDs

replace replace name/sequence by regular expression

restart reset start position for circular genome

rmdup remove duplicated sequences by id/name/sequence

sample sample sequences by number or proportion

sana sanitize broken single line fastq files

seq transform sequences (revserse, complement, extract ID...)

shuffle shuffle sequences

sliding sliding sequences, circular genome supported

sort sort sequences by id/name/sequence/length

split split sequences into files by id/seq region/size/parts (mainly for FASTA)

split2 split sequences into files by size/parts (FASTA, PE/SE FASTQ)

stats simple statistics of FASTA/Q files

subseq get subsequences by region/gtf/bed, including flanking sequences

tab2fx convert tabular format to FASTA/Q format

translate translate DNA/RNA to protein sequence (supporting ambiguous bases)

version print version information and check for update

watch monitoring and online histograms of sequence features

Flags:

--alphabet-guess-seq-length int length of sequence prefix of the first FASTA record based on which seqkit guesses the sequence type (0 for whole seq) (default 10000)

-h, --help help for seqkit

--id-ncbi FASTA head is NCBI-style, e.g. >gi|110645304|ref|NC_002516.2| Pseud...

--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")

--infile-list string file of input files list (one file per line), if given, they are appended to files from cli arguments

-w, --line-width int line width when outputing FASTA format (0 for no wrap) (default 60)

-o, --out-file string out file ("-" for stdout, suffix .gz for gzipped out) (default "-")

--quiet be quiet and do not show extra information

-t, --seq-type string sequence type (dna|rna|protein|unlimit|auto) (for auto, it automatically detect by the first sequence) (default "auto")

-j, --threads int number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

Use "seqkit [command] --help" for more information about a command.

実行方法

入力はfasta、fastq、.gzなど。

v2になっていくつかコマンドが変化しています（v２リリース）。下に書いてあるのは、v２より前のバージョンなので、新しいバージョンを使っている方は注意して下さい。

seq　配列を変形

reverse complementary (相補鎖にしてさらにリバースにする）

seqkit seq -pr input.fa > output.fa

-p　　　　　　complement sequence (blank for Protein sequence)
-r reverse sequence
--dna2rna DNA to RNA
--rna2dna RNA to DNA
-l 　　 print sequences in lower case
-i　 print ID instead of full head
-u　　　　 print sequences in upper case
-n　　　　 only print names
-m only print sequences longer than the minimum length (-1 for no limit) (default -1)
-M, 　　 only print sequences shorter than the maximum length (-1 for no limit) (default -1)

ヘッダーだけ取り出すことでcontigの数をカウントする。

seqkit seq -n contig.fa |wc -l

fastqからヘッダーのIDだけ取り出す。

seqkit seq -ni fastq |less

--id-regexpを使うと、標準正規表現で仕切りを自由にカスタム可能（リンク）。

ギャップを除く。("-"が除かれる)。

seqkit seq -g scaffolds.fa > output.fasta

1000bp以上の配列だけ取り出す。

seqkit seq -m 1000 scaffolds.fa > output.fasta

1000bp以下の配列だけ取り出し、statsコマンドに渡して分析。

seqkit seq -M 1000 scaffolds.fa |seqkit stats

他にもいくつかのオプションがある。

subseq 特定のリードや配列だけ抽出

最初の100-bpを取り出す。

seqkit subseq -r 1:100 contigs.fa > left.fa

最後の100-bpを取り出す（マイナスで後ろからカウント）。

seqkit subseq -r -100:-1 contig.fa > right.fa

最初と最後の100-bpを除く。

seqkit subseq -r 100:-100 contig.fa > center.fa

gtfで指定した領域の配列だけ取り出す。

seqkit subseq --gtf cDNA.gtf genome.fa > cdna.fa

gtfで指定した領域の配列とその100bp上流までを取り出す。

seqkit subseq --gtf cDNA.gtf -u 100 genome.fa > cdna_and100bp_upstream.fa

-fもつけると、上流配列だけ取り出される。

bedファイルの指定領域かつchr1の配列のみ取り出す。

seqkit subseq --bed Homo_sapiens.GRCh38.bed.gz --chr 1 input.fa >  chr1_bed.fa

bedtoolsのツールと似ているが、動作が高速でメモリ使用量も少ないので、こちらの方が使いやすい。

sliding　配列を指定の単位ずつずらしながら取り出す。

100-bp windowでゲノム配列を切り出す（GC含量カウントに使ったりする）。

seqkit sliding -s 1 -W 100 genome.fa > 100bp_windows.fa

1bpずつずらしながら、100bpの配列が取り出される。

環状ゲノムなら-Cをつける。

GC含量をカウント

seqkit fx2tab genome.fa| head -n 1 | seqkit tab2fx | seqkit sliding -s 1 -W 100 | seqkit fx2tab -n -g > GC.txt

ウィンドウサイズ100でのGC率が出力される。平均GC率などを求めるなら、続けて以下のように打つ。

cut -f 4 GC.txt > GC #GC列を抽出
st --mean GC

stコマンドはbrewで導入できます。

stats　fastaの簡単な分析を行う。

カレントディレクトリの全fa.gzファイルを分析。

seqkit stats *.fa.gz

user$ seqkit stats R1.fastq

file format type num_seqs sum_len min_len avg_len max_len

T1second_R1.fastq FASTQ DNA 261,774 77,786,418 35 297.2 301

-aをつけるとより詳細な情報を表示。

user$ seqkit stats -a R1.fastq

file format type num_seqs sum_len min_len avg_len max_len sum_gap N50

T1second_R1.fastq FASTQ DNA 261,774 77,786,418 35 297.2 301 0 301

-Tをつけるとtabular formatで表示。チュートリアルにはcsvtkなどに渡してさらに整形する例も記載されている。

並列化と追加情報つき。csvtkに渡す

seqkit stats *.fa.gz -j 8 -a | csvtk pretty -t

file format type num_seqs sum_len min_len avg_len max_len Q1 Q2 Q3 sum_gap N50 Q20(%) Q30(%)

WS-tuber-1_S276_L001_R1_001.fastq FASTQ DNA 28630 8580372 35 299.7 301 299 300 301 0 300 98.00 94.16

WS-tuber-1_S276_L001_R1_001.fastq.gz FASTQ DNA 28630 8580372 35 299.7 301 299 300 301 0 300 98.00 94.16

WS-tuber-1_S276_L001_R2_001.fastq FASTQ DNA 28630 8598293 35 300.3 301 300 301 301 0 301 93.54 85.00

WS-tuber-1_S276_L001_R2_001.fastq.gz FASTQ DNA 28630 8598293 35 300.3 301 300 301 301 0 301 93.54 85.00

WS-tuber-2_S277_L001_R1_001.fastq.gz FASTQ DNA 28547 8555935 35 299.7 301 299 300 301 0 300 98.00 94.24

WS-tuber-2_S277_L001_R2_001.fastq.gz FASTQ DNA 28547 8575923 35 300.4 301 300 301 301 0 301 92.74 83.63

WS-tuber-3_S278_L001_R1_001.fastq.gz FASTQ DNA 35579 10656382 35 299.5 301 299 300 301 0 300 98.13 94.37

WS-tuber-3_S278_L001_R2_001.fastq.gz FASTQ DNA 35579 10680068 35 300.2 301 300 301 301 0 301 93.04 84.03

EMBOSSのinfoseqも便利（link）

fq2fa　fastqをfastaに変換

seqkit fq2fa input.fastq > output.fa

gz圧縮ファイルにも対応。

seqkit fq2fa input.fastq.gz > output.fa.gz

fx2tab　fasta/fastqをtab形式に変換し、さらにGCなどの情報を表示。

seqkit fx2tab input.fa > output.fa

-g　print GC content
-G　print GC-Skew
-H　print header line　(先頭行のコメント出力)
-l　print sequence length
-n　only print names (no sequences and qualities)
-i　print ID instead of full head

出力はタブ区切り形式になっている。 fastqならクオリティも後ろについている。

f:id:kazumaxneo:20170808221310j:plain

GC-skew、GC含量、長さも出力。配列は出力しない。

seqkit fx2tab -Ggln input.fa > output.fa

出力。名前の後ろの列に長さ、GC率、GC-skewがプリントされている。

f:id:kazumaxneo:20170808221724j:plain

タブからfastqなどに戻すtab2fxというコマンドもある。

head　指定数だけ先頭からリードを取り出す。

seqkit head -10 input.fastq > 10read.fastq

先頭から10リード（つまり40行）出力される。

rename　duplicateしたIDをユニークな名前に修復する。

seqkit rename input.fastq > renamed.fastq

-n　check duplicated by full name instead of just id.

duplicateしていると判断されると"_2"とか名前につく。

restart　環状ゲノムのスタート位置を変える。

seqkit restart -i 100 genome.fa > genome_restart.fa

-i　new start position (1-base, supporting negative value counting from the end) (default 1).

この例だと100番目の塩基がスタートになる（元々の99塩基は最後につく）。

shuffle　順番をバラバラにする。

seqkit shuffle input.fastq > shuffled.fastq

-2　two-pass mode read files twice to lower memory usage. (only for FASTA format).
-s　rand seed for shuffle (default 23)

一度メモリーに読み込むので、シーケンスデータが大きすぎるとランできない恐れあり。fasta限定だが、-2をつけるとtwo-passにしてメモリ使用量を削減可能。ペアリードをシャッフルするなら、R1とR2それぞれ-sの値を同じ値にすること。

sort　ソート。

長さでソートし、10000-bp以上の配列だけ出力。

seqkit sort --quiet -r -l conitg.fa -L 10000 > sorted.fa

-l　by sequence length
-n　by full name instead of just id
-s　by sequence
-r　reverse the result
-L　length of sequence prefix on which seqkit sorts by sequences (0 for whole sequence) (default 10000)
-2　two-pass mode read files twice to lower memory usage. (only for FASTA format)
-i　ignore case

配列でソート。

seqkit sort --quiet -s conitg.fa > sorted.fa

長さでソートして、名前と長さ、ヘッダー行だけタブ出力。

seqkit sort --quiet -l conitg.fa | seqkit fx2tab -l -n -i -H > length.txt

rmdup　duplicationを除去。

フルネームが重複したリードを除去。

seqkit rmdup -n input.fq

-n　by full name instead of just id
-s　by seq
-D　file to save number and list of duplicated seqs
-d　string file to save duplicated seqs
-i　ignore case

簡単な動作テストはseqkitのdupコマンドでfastqを複製して行えば良い。

seqkit dup -n 2 input.fastq > duplicated.fastq #全リードを1回ずつコピー。ファイルサイズが倍はなる。
seqkit rmdup -n duplicated.fastq > dup_removed.fastq

user$ seqkit dup -n 2 input.fastq > dup.fq

user$ seqkit rmdup -n dup.fq > dup_removed.fastq

[INFO] 261774 duplicated records removed

common　複数ファイルの共通する配列や名前を分析

カレントディレクトリのfastaファイルから、IDが共通する配列を抽出。

seqkit common *fa > common.fa

-n　match by full name instead of just id
-s　match by sequence
-i　ignore case

カレントディレクトリのfastqファイルから、配列が一致する配列を抽出。

seqkit common -s *fq > common_seq.fq

sample　リードをランダムサンプリング

リードを10%ランダムサンプリングする。

seqkit sample -p 0.1 input.fastq > output.fastq

#pigzに渡しgzip出力する。
seqkit sample -p 0.1 input.fq.gz |pigz -c - -p 8 > output.fq.gz

-n　int sample by number (result may not exactly match)
-p　float sample by proportion
-s　 int rand seed (default 11)
-2　2-pass mode read files twice to lower memory usage. Not allowed when reading from stdin
-j number of CPUs. (default value: 1 for single-CPU PC, 2 for others) (default 2)

-sでランダム発生器を制御しているので、ペアリードは-sを同じ値にして抽出する。

split　分割する。

fastqを均等に10ファイルに分けて出力。

seqkit split -p 10 input.fastq

-p　split squences into N parts

input.fastq.split/ができ、その中に10ファイルにfastqが分割され出力される。

巨大なデータを分割して渡したり、サーバーからダウンロードする時に使えるかもしれない。

他にもいくつかのコマンドがあるが、動作は１リードを１行として扱う以外unixの同名コマンドと同じ感じで使える。

locate　配列をリプレースする。
grep　特定の配列を検索する。
dup 　配列を指定回、複製させる。

紹介したコマンドや例は全てではありません。使えそうなコマンドがあれば、公式マニュアルで確認してみてください。

Usage - SeqKit - Ultrafast FASTA/Q kit

追加コマンド

引用

SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation

Wei Shen, Shuai Le, Yan Li , Fuquan Hu

PLOS ONE. 2016. doi:10.1371/journal.pone.0163962.

似たコマンドにseqtkがある。以前紹介しています。

2017-08-08

クレード特異的マーカー遺伝子を使いメタゲノム配列のtaxnomy assigmentを行う MetaPhlAn2、クラスタリングするHclust2、系統樹を作成するGraPhlAn

metagenome 生物種の推定 (taxonomic profiling) 2015 Nature Methods phylogenetic marker genes taxonomic assignment docker

2019 5/17 condaインストール追記、イントロ文章修正、

2019 7/2タイトル修正

2019 7/4 インストール追記

2019 7/6 インストール追記タイトル修正、誤解を招く文章を削除

2019 10/8 インストール追記

2020 8/24 condaインストール追記

2021 7/16 mambaに変更

　MetaPhlAnは、メタゲノムショットガンシーケンスデータから微生物群集の構成をプロファイリングするための計算ツールである。 MetaPhlAnは、〜17,000のリファレンスゲノム（〜13,500の細菌および古細菌、〜3,500のウイルス、〜110の真核生物）から同定された独自のクレード特異的マーカー遺伝子に依存している。

　1秒間に最大25,000回のリード（1つのCPU上）の分析速度（既存の方法と比較して桁違いに速い）。
MetaPhlAnマーカーとしての明確な分類アサインはクレード固有のものである。
生物学的相対abundanceの正確な推定（リードの割合よりもむしろ細胞数の観点から）。
細菌、古細菌、真核生物、ウイルスの種レベルの解像度。
いくつかの合成データセットおよび何千もの実際のメタゲノムに関するプロファイリング精度の広範な検証。

　2016年には、同じ研究チームからPanPhlAnも発表されている。PanPhlAnは種内の多様性を遺伝子の有無で評価するツールである。MetaPhlAn2でメタゲノムにどんな種のゲノムがあるか調べ、PanPhlAnで種内の多様性を調べることが可能になった。

マニュアル

https://bitbucket.org/biobakery/biobakery/wiki/metaphlan2

オーサーらのツール一覧

http://segatalab.cibio.unitn.it/tools/

インストール

condaやbrewで本体含め全ての依存をワンライナーで導入できる。

#全て仮想環境に入れてしまう
mamba create -n metaphlan2 python=2.7 -y
conda activate metaphlan2
#metaphylan2
mamba install -c bioconda -y metaphlan2
#or (faster)
mamba install -c bioconda/label/cf201901 metaphlan2 
#hclust2
mamba install -c bioconda -y hclust2
#graphlan
mamba install -c bioconda -y graphlan
mamba install -c bioconda -y export2graphlan



#homebrew
brew install biobakery/biobakery/graphlan
brew install metaphlan2


#docker
docker pull biobakery/metaphlan2:2.7.7
#help
docker run --rm biobakery/metaphlan2 metaphlan2.py -h
#run
docker run --rm -itv $PWD:/data -w /data biobakery/metaphlan2 metaphlan2.py -h

util/のスクリプトが入ってないなら、hg cloneして直接本体を叩く（mergeのステップで使うスクリプト）。

#hgコマンドがないなら　sudo apt update && apt install mercurial
hg clone https://bitbucket.org/biobakery/biobakery

> metaphlan2.py -h

# metaphlan2.py -h

usage: metaphlan2.py --input_type

{fastq,fasta,multifasta,multifastq,bowtie2out,sam}

[--mpa_pkl MPA_PKL] [--force]

[--bowtie2db METAPHLAN_BOWTIE2_DB] [-x INDEX]

[--bt2_ps BowTie2 presets] [--bowtie2_exe BOWTIE2_EXE]

[--bowtie2_build BOWTIE2_BUILD] [--bowtie2out FILE_NAME]

[--no_map] [--tmp_dir] [--tax_lev TAXONOMIC_LEVEL]

[--min_cu_len] [--min_alignment_len]

[--ignore_eukaryotes] [--ignore_bacteria]

[--ignore_archaea] [--stat_q] [--perc_nonzero]

[--ignore_markers IGNORE_MARKERS] [--avoid_disqm]

[--stat] [-t ANALYSIS TYPE] [--nreads NUMBER_OF_READS]

[--pres_th PRESENCE_THRESHOLD] [--clade] [--min_ab]

[-o output file] [--sample_id_key name]

[--sample_id value] [-s sam_output_file]

[--legacy-output] [--no-unknown-estimation]

[--biom biom_output] [--mdelim mdelim] [--nproc N]

[--install] [--force_download]

[--read_min_len READ_MIN_LEN] [-v] [-h]

[INPUT_FILE] [OUTPUT_FILE]

DESCRIPTION

MetaPhlAn version 2.9.12 (12 Jun 2019):

METAgenomic PHyLogenetic ANalysis for metagenomic taxonomic profiling.

AUTHORS: Nicola Segata (nicola.segata@unitn.it), Duy Tin Truong, Francesco Asnicar (f.asnicar@unitn.it), Francesco Beghini (francesco.beghini@unitn.it)

COMMON COMMANDS

We assume here that metaphlan2.py is in the system path and that mpa_dir bash variable contains the

main MetaPhlAn folder. Also BowTie2 should be in the system path with execution and read

permissions, and Perl should be installed)

========== MetaPhlAn 2 clade-abundance estimation =================

The basic usage of MetaPhlAn 2 consists in the identification of the clades (from phyla to species and

strains in particular cases) present in the metagenome obtained from a microbiome sample and their

relative abundance. This correspond to the default analysis type (-t rel_ab).

* Profiling a metagenome from raw reads:

$ metaphlan2.py metagenome.fastq --input_type fastq -o profiled_metagenome.txt

* You can take advantage of multiple CPUs and save the intermediate BowTie2 output for re-running

MetaPhlAn extremely quickly:

$ metaphlan2.py metagenome.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq -o profiled_metagenome.txt

* If you already mapped your metagenome against the marker DB (using a previous MetaPhlAn run), you

can obtain the results in few seconds by using the previously saved --bowtie2out file and

specifying the input (--input_type bowtie2out):

$ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out -o profiled_metagenome.txt

* bowtie2out files generated with MetaPhlAn2 versions below 2.9 are not compatibile.

Starting from MetaPhlAn2 2.9, the BowTie2 ouput now includes the size of the profiled metagenome.

If you want to re-run MetaPhlAn2 using these file you should provide the metagenome size via --nreads:

$ metaphlan2.py metagenome.bowtie2.bz2 --nproc 5 --input_type bowtie2out --nreads 520000 -o profiled_metagenome.txt

* You can also provide an externally BowTie2-mapped SAM if you specify this format with

--input_type. Two steps: first apply BowTie2 and then feed MetaPhlAn2 with the obtained sam:

$ bowtie2 --sam-no-hd --sam-no-sq --no-unal --very-sensitive -S metagenome.sam -x ${mpa_dir}/metaphlan_databases/mpa_v25_CHOCOPhlAn_201901 -U metagenome.fastq

$ metaphlan2.py metagenome.sam --input_type sam -o profiled_metagenome.txt

* We can also natively handle paired-end metagenomes, and, more generally, metagenomes stored in

multiple files (but you need to specify the --bowtie2out parameter):

$ metaphlan2.py metagenome_1.fastq,metagenome_2.fastq --bowtie2out metagenome.bowtie2.bz2 --nproc 5 --input_type fastq

-------------------------------------------------------------------

========== Marker level analysis ============================

MetaPhlAn 2 introduces the capability of charachterizing organisms at the strain level using non

aggregated marker information. Such capability comes with several slightly different flavours and

are a way to perform strain tracking and comparison across multiple samples.

Usually, MetaPhlAn 2 is first ran with the default -t to profile the species present in

the community, and then a strain-level profiling can be performed to zoom-in into specific species

of interest. This operation can be performed quickly as it exploits the --bowtie2out intermediate

file saved during the execution of the default analysis type.

* The following command will output the abundance of each marker with a RPK (reads per kilo-base)

higher 0.0. (we are assuming that metagenome_outfmt.bz2 has been generated before as

shown above).

$ metaphlan2.py -t marker_ab_table metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt

The obtained RPK can be optionally normalized by the total number of reads in the metagenome

to guarantee fair comparisons of abundances across samples. The number of reads in the metagenome

needs to be passed with the '--nreads' argument

* The list of markers present in the sample can be obtained with '-t marker_pres_table'

$ metaphlan2.py -t marker_pres_table metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt

The --pres_th argument (default 1.0) set the minimum RPK value to consider a marker present

* The list '-t clade_profiles' analysis type reports the same information of '-t marker_ab_table'

but the markers are reported on a clade-by-clade basis.

$ metaphlan2.py -t clade_profiles metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt

* Finally, to obtain all markers present for a specific clade and all its subclades, the

'-t clade_specific_strain_tracker' should be used. For example, the following command

is reporting the presence/absence of the markers for the B. fragulis species and its strains

the optional argument --min_ab specifies the minimum clade abundance for reporting the markers

$ metaphlan2.py -t clade_specific_strain_tracker --clade s__Bacteroides_fragilis metagenome_outfmt.bz2 --input_type bowtie2out -o marker_abundance_table.txt

-------------------------------------------------------------------

positional arguments:

INPUT_FILE the input file can be:

* a fastq file containing metagenomic reads

* a BowTie2 produced SAM file.

* an intermediary mapping file of the metagenome generated by a previous MetaPhlAn run

If the input file is missing, the script assumes that the input is provided using the standard

input, or named pipes.

IMPORTANT: the type of input needs to be specified with --input_type

OUTPUT_FILE the tab-separated output file of the predicted taxon relative abundances

[stdout if not present]

Required arguments:

--input_type {fastq,fasta,multifasta,multifastq,bowtie2out,sam}

set whether the input is the multifasta file of metagenomic reads or

the SAM file of the mapping of the reads against the MetaPhlAn db.

[default 'automatic', i.e. the script will try to guess the input format]

Mapping arguments:

--mpa_pkl MPA_PKL The metadata pickled MetaPhlAn file [deprecated]

--force

--bowtie2db METAPHLAN_BOWTIE2_DB

The BowTie2 database file of the MetaPhlAn database. Used if --input_type is fastq, fasta, multifasta, or multifastq [default /root/.pyenv/versions/miniconda2-4.0.5/bin/metaphlan_databases]

-x INDEX, --index INDEX

Specify the id of the database version to use. If the database

files are not found on the local MetaPhlAn2 installation they

will be automatically downloaded [default latest]

--bt2_ps BowTie2 presets

Presets options for BowTie2 (applied only when a multifasta file is provided)

The choices enabled in MetaPhlAn are:

* sensitive

* very-sensitive

* sensitive-local

* very-sensitive-local

[default very-sensitive]

--bowtie2_exe BOWTIE2_EXE

Full path and name of the BowTie2 executable. This option allowsMetaPhlAn to reach the executable even when it is not in the system PATH or the system PATH is unreachable

--bowtie2_build BOWTIE2_BUILD

Full path to the bowtie2-build command to use, deafult assumes that 'bowtie2-build is present in the system path

--bowtie2out FILE_NAME

The file for saving the output of BowTie2

--no_map Avoid storing the --bowtie2out map file

--tmp_dir The folder used to store temporary files [default is the OS dependent tmp dir]

Post-mapping arguments:

--tax_lev TAXONOMIC_LEVEL

The taxonomic level for the relative abundance output:

'a' : all taxonomic levels

'k' : kingdoms

'p' : phyla only

'c' : classes only

'o' : orders only

'f' : families only

'g' : genera only

's' : species only

[default 'a']

--min_cu_len minimum total nucleotide length for the markers in a clade for

estimating the abundance without considering sub-clade abundances

[default 2000]

--min_alignment_len The sam records for aligned reads with the longest subalignment

length smaller than this threshold will be discarded.

[default None]

--ignore_eukaryotes Do not profile eukaryotic organisms

--ignore_bacteria Do not profile bacterial organisms

--ignore_archaea Do not profile archeal organisms

--stat_q Quantile value for the robust average

[default 0.1]

--perc_nonzero Percentage of markers with a non zero relative abundance for misidentify a species

[default 0.33]

--ignore_markers IGNORE_MARKERS

File containing a list of markers to ignore.

--avoid_disqm Deactivate the procedure of disambiguating the quasi-markers based on the

marker abundance pattern found in the sample. It is generally recommended

to keep the disambiguation procedure in order to minimize false positives

--stat EXPERIMENTAL! Statistical approach for converting marker abundances into clade abundances

'avg_g' : clade global (i.e. normalizing all markers together) average

'avg_l' : average of length-normalized marker counts

'tavg_g' : truncated clade global average at --stat_q quantile

'tavg_l' : trunated average of length-normalized marker counts (at --stat_q)

'wavg_g' : winsorized clade global average (at --stat_q)

'wavg_l' : winsorized average of length-normalized marker counts (at --stat_q)

'med' : median of length-normalized marker counts

[default tavg_g]

Additional analysis types and arguments:

-t ANALYSIS TYPE Type of analysis to perform:

* rel_ab: profiling a metagenomes in terms of relative abundances

* rel_ab_w_read_stats: profiling a metagenomes in terms of relative abundances and estimate the number of reads comming from each clade.

* reads_map: mapping from reads to clades (only reads hitting a marker)

* clade_profiles: normalized marker counts for clades with at least a non-null marker

* marker_ab_table: normalized marker counts (only when > 0.0 and normalized by metagenome size if --nreads is specified)

* marker_counts: non-normalized marker counts [use with extreme caution]

* marker_pres_table: list of markers present in the sample (threshold at 1.0 if not differently specified with --pres_th

[default 'rel_ab']

--nreads NUMBER_OF_READS

The total number of reads in the original metagenome. It is used only when

-t marker_table is specified for normalizing the length-normalized counts

with the metagenome size as well. No normalization applied if --nreads is not

specified

--pres_th PRESENCE_THRESHOLD

Threshold for calling a marker present by the -t marker_pres_table option

--clade The clade for clade_specific_strain_tracker analysis

--min_ab The minimum percentage abundace for the clade in the clade_specific_strain_tracker analysis

Output arguments:

-o output file, --output_file output file

The output file (if not specified as positional argument)

--sample_id_key name Specify the sample ID key for this analysis. Defaults to '#SampleID'.

--sample_id value Specify the sample ID for this analysis. Defaults to 'Metaphlan2_Analysis'.

-s sam_output_file, --samout sam_output_file

The sam output file

--legacy-output Old two columns output

--no-unknown-estimation

Ignore estimation of reads mapping to unkwnown clades

--biom biom_output, --biom_output_file biom_output

If requesting biom file output: The name of the output file in biom format

--mdelim mdelim, --metadata_delimiter_char mdelim

Delimiter for bug metadata: - defaults to pipe. e.g. the pipe in k__Bacteria|p__Proteobacteria

Other arguments:

--nproc N The number of CPUs to use for parallelizing the mapping [default 4]

--install Only checks if the MetaPhlAn2 DB is installed and installs it if not. All other parameters are ignored.

--force_download Force the re-download of the latest MetaPhlAn2 database.

--read_min_len READ_MIN_LEN

Specify the minimum length of the reads to be considered when parsing the input file with 'read_fastx.py' script, default value is 70

-v, --version Prints the current MetaPhlAn version and exit

-h, --help show this help message and exit

>python2.7 hclust2/hclust2.py -h

$ python2.7 hclust2/hclust2.py -h

usage: hclust2.py [-h] [-i [INPUT_FILE]] [-o [OUTPUT_FILE]]

[--legend_file [LEGEND_FILE]] [-t INPUT_TYPE] [--sep SEP]

[--out_table OUT_TABLE] [--fname_row FNAME_ROW]

[--sname_row SNAME_ROW] [--metadata_rows METADATA_ROWS]

[--skip_rows SKIP_ROWS] [--sperc SPERC] [--fperc FPERC]

[--stop STOP] [--ftop FTOP] [--def_na DEF_NA]

[--f_dist_f F_DIST_F] [--s_dist_f S_DIST_F]

[--load_dist_matrix_f LOAD_DIST_MATRIX_F]

[--load_dist_matrix_s LOAD_DIST_MATRIX_S]

[--load_pickled_dist_matrix_f LOAD_PICKLED_DIST_MATRIX_F]

[--load_pickled_dist_matrix_s LOAD_PICKLED_DIST_MATRIX_S]

[--save_pickled_dist_matrix_f SAVE_PICKLED_DIST_MATRIX_F]

[--save_pickled_dist_matrix_s SAVE_PICKLED_DIST_MATRIX_S]

[--no_fclustering] [--no_plot_fclustering]

[--no_sclustering] [--no_plot_sclustering]

[--flinkage FLINKAGE] [--slinkage SLINKAGE] [--dpi DPI] [-l]

[--title TITLE] [--title_fontsize TITLE_FONTSIZE] [-s]

[--no_slabels] [--minv MINV] [--maxv MAXV] [--no_flabels]

[--max_slabel_len MAX_SLABEL_LEN]

[--max_flabel_len MAX_FLABEL_LEN]

[--flabel_size FLABEL_SIZE] [--slabel_size SLABEL_SIZE]

[--fdend_width FDEND_WIDTH] [--sdend_height SDEND_HEIGHT]

[--metadata_height METADATA_HEIGHT]

[--metadata_separation METADATA_SEPARATION]

[--colorbar_font_size COLORBAR_FONT_SIZE]

[--image_size IMAGE_SIZE]

[--cell_aspect_ratio CELL_ASPECT_RATIO]

[-c {Accent,Blues,BrBG,BuGn,BuPu,Dark2,GnBu,Greens,Greys,OrRd,Oranges,PRGn,Paired,Pastel1,Pastel2,PiYG,PuBu,PuBuGn,PuOr,PuRd,Purples,RdBu,RdGy,RdPu,RdYlBu,RdYlGn,Reds,Set1,Set2,Set3,Spectral,YlGn,YlGnBu,YlOrBr,YlOrRd,afmhot,autumn,binary,bone,brg,bwr,cool,copper,flag,gist_earth,gist_gray,gist_heat,gist_ncar,gist_rainbow,gist_stern,gist_yarg,gnuplot,gnuplot2,gray,hot,hsv,jet,ocean,pink,prism,rainbow,seismic,spectral,spring,summer,terrain,winter,bbcyr,bbcry,bcry}]

[--bottom_c BOTTOM_C] [--top_c TOP_C] [--nan_c NAN_C]

TBA

optional arguments:

-h, --help show this help message and exit

-i [INPUT_FILE], --inp [INPUT_FILE], --in [INPUT_FILE]

The input matrix

-o [OUTPUT_FILE], --out [OUTPUT_FILE]

The output image file [image on screen of not

specified]

--legend_file [LEGEND_FILE]

The output file for the legend of the provided

metadata

-t INPUT_TYPE, --input_type INPUT_TYPE

The input type can be a data matrix or distance matrix

[default data_matrix]

Input data matrix parameters:

--sep SEP

--out_table OUT_TABLE

Write processed data matrix to file

--fname_row FNAME_ROW

row number containing the names of the features

[default 0, specify -1 if no names are present in the

matrix

--sname_row SNAME_ROW

column number containing the names of the samples

[default 0, specify -1 if no names are present in the

matrix

--metadata_rows METADATA_ROWS

Row numbers to use as metadata[default None, meaning

no metadata

--skip_rows SKIP_ROWS

Row numbers to skip (0-indexed, comma separated) from

the input file[default None, meaning no rows skipped

--sperc SPERC Percentile of sample value distribution for sample

selection

--fperc FPERC Percentile of feature value distribution for sample

selection

--stop STOP Number of top samples to select (ordering based on

percentile specified by --sperc)

--ftop FTOP Number of top features to select (ordering based on

percentile specified by --fperc)

--def_na DEF_NA Set the default value for missing values [default None

which means no replacement]

Distance parameters:

--f_dist_f F_DIST_F Distance function for features [default correlation]

--s_dist_f S_DIST_F Distance function for sample [default euclidean]

--load_dist_matrix_f LOAD_DIST_MATRIX_F

Load the distance matrix to be used for features

[default None].

--load_dist_matrix_s LOAD_DIST_MATRIX_S

Load the distance matrix to be used for samples

[default None].

--load_pickled_dist_matrix_f LOAD_PICKLED_DIST_MATRIX_F

Load the distance matrix to be used for features as

previously saved as pickle file using hclust2 itself

[default None].

--load_pickled_dist_matrix_s LOAD_PICKLED_DIST_MATRIX_S

Load the distance matrix to be used for samples as

previously saved as pickle file using hclust2 itself

[default None].

--save_pickled_dist_matrix_f SAVE_PICKLED_DIST_MATRIX_F

Save the distance matrix for features to file [default

None].

--save_pickled_dist_matrix_s SAVE_PICKLED_DIST_MATRIX_S

Save the distance matrix for samples to file [default

None].

Clustering parameters:

--no_fclustering avoid clustering features

--no_plot_fclustering

avoid plotting the feature dendrogram

--no_sclustering avoid clustering samples

--no_plot_sclustering

avoid plotting the sample dendrogram

--flinkage FLINKAGE Linkage method for feature clustering [default

average]

--slinkage SLINKAGE Linkage method for sample clustering [default average]

Heatmap options:

--dpi DPI Image resolution in dpi [default 150]

-l, --log_scale Log scale

--title TITLE Title of the plot

--title_fontsize TITLE_FONTSIZE

Font size of the title

-s, --sqrt_scale Square root scale

--no_slabels Do not show sample labels

--minv MINV Minimum value to display in the color map [default

None meaning automatic]

--maxv MAXV Maximum value to display in the color map [default

None meaning automatic]

--no_flabels Do not show feature labels

--max_slabel_len MAX_SLABEL_LEN

Max number of chars to report for sample labels

[default 15]

--max_flabel_len MAX_FLABEL_LEN

Max number of chars to report for feature labels

[default 15]

--flabel_size FLABEL_SIZE

Feature label font size [default 10]

--slabel_size SLABEL_SIZE

Sample label font size [default 10]

--fdend_width FDEND_WIDTH

Width of the feature dendrogram [default 1 meaning

100% of default heatmap width]

--sdend_height SDEND_HEIGHT

Height of the sample dendrogram [default 1 meaning

100% of default heatmap height]

--metadata_height METADATA_HEIGHT

Height of the metadata panel [default 0.05 meaning 5%

of default heatmap height]

--metadata_separation METADATA_SEPARATION

Distance between the metadata and data panels.

[default 0.001 meaning 0.1% of default heatmap height]

--colorbar_font_size COLORBAR_FONT_SIZE

Color bar label font size [default 12]

--image_size IMAGE_SIZE

Size of the largest between width and eight size for

the image in inches [default 8]

--cell_aspect_ratio CELL_ASPECT_RATIO

Aspect ratio between width and height for the cells of

the heatmap [default 1.0]

-c {Accent,Blues,BrBG,BuGn,BuPu,Dark2,GnBu,Greens,Greys,OrRd,Oranges,PRGn,Paired,Pastel1,Pastel2,PiYG,PuBu,PuBuGn,PuOr,PuRd,Purples,RdBu,RdGy,RdPu,RdYlBu,RdYlGn,Reds,Set1,Set2,Set3,Spectral,YlGn,YlGnBu,YlOrBr,YlOrRd,afmhot,autumn,binary,bone,brg,bwr,cool,copper,flag,gist_earth,gist_gray,gist_heat,gist_ncar,gist_rainbow,gist_stern,gist_yarg,gnuplot,gnuplot2,gray,hot,hsv,jet,ocean,pink,prism,rainbow,seismic,spectral,spring,summer,terrain,winter,bbcyr,bbcry,bcry}, --colormap {Accent,Blues,BrBG,BuGn,BuPu,Dark2,GnBu,Greens,Greys,OrRd,Oranges,PRGn,Paired,Pastel1,Pastel2,PiYG,PuBu,PuBuGn,PuOr,PuRd,Purples,RdBu,RdGy,RdPu,RdYlBu,RdYlGn,Reds,Set1,Set2,Set3,Spectral,YlGn,YlGnBu,YlOrBr,YlOrRd,afmhot,autumn,binary,bone,brg,bwr,cool,copper,flag,gist_earth,gist_gray,gist_heat,gist_ncar,gist_rainbow,gist_stern,gist_yarg,gnuplot,gnuplot2,gray,hot,hsv,jet,ocean,pink,prism,rainbow,seismic,spectral,spring,summer,terrain,winter,bbcyr,bbcry,bcry}

--bottom_c BOTTOM_C Color to use for cells below the minimum value of the

scale [default None meaning bottom color of the scale]

--top_c TOP_C Color to use for cells below the maximum value of the

scale [default None meaning bottom color of the scale]

--nan_c NAN_C Color to use for nan cells [default None]

hclust2

>python2.7 hclust2/hclust2.py -h

$ python2.7 hclust2/hclust2.py -h

usage: hclust2.py [-h] [-i [INPUT_FILE]] [-o [OUTPUT_FILE]]

[--legend_file [LEGEND_FILE]] [-t INPUT_TYPE] [--sep SEP]

[--out_table OUT_TABLE] [--fname_row FNAME_ROW]

[--sname_row SNAME_ROW] [--metadata_rows METADATA_ROWS]

[--skip_rows SKIP_ROWS] [--sperc SPERC] [--fperc FPERC]

[--stop STOP] [--ftop FTOP] [--def_na DEF_NA]

[--f_dist_f F_DIST_F] [--s_dist_f S_DIST_F]

[--load_dist_matrix_f LOAD_DIST_MATRIX_F]

[--load_dist_matrix_s LOAD_DIST_MATRIX_S]

[--load_pickled_dist_matrix_f LOAD_PICKLED_DIST_MATRIX_F]

[--load_pickled_dist_matrix_s LOAD_PICKLED_DIST_MATRIX_S]

[--save_pickled_dist_matrix_f SAVE_PICKLED_DIST_MATRIX_F]

[--save_pickled_dist_matrix_s SAVE_PICKLED_DIST_MATRIX_S]

[--no_fclustering] [--no_plot_fclustering]

[--no_sclustering] [--no_plot_sclustering]

[--flinkage FLINKAGE] [--slinkage SLINKAGE] [--dpi DPI] [-l]

[--title TITLE] [--title_fontsize TITLE_FONTSIZE] [-s]

[--no_slabels] [--minv MINV] [--maxv MAXV] [--no_flabels]

[--max_slabel_len MAX_SLABEL_LEN]

[--max_flabel_len MAX_FLABEL_LEN]

[--flabel_size FLABEL_SIZE] [--slabel_size SLABEL_SIZE]

[--fdend_width FDEND_WIDTH] [--sdend_height SDEND_HEIGHT]

[--metadata_height METADATA_HEIGHT]

[--metadata_separation METADATA_SEPARATION]

[--colorbar_font_size COLORBAR_FONT_SIZE]

[--image_size IMAGE_SIZE]

[--cell_aspect_ratio CELL_ASPECT_RATIO]

[--bottom_c BOTTOM_C] [--top_c TOP_C] [--nan_c NAN_C]

TBA

optional arguments:

-h, --help show this help message and exit

-i [INPUT_FILE], --inp [INPUT_FILE], --in [INPUT_FILE]

The input matrix

-o [OUTPUT_FILE], --out [OUTPUT_FILE]

The output image file [image on screen of not

specified]

--legend_file [LEGEND_FILE]

The output file for the legend of the provided

metadata

-t INPUT_TYPE, --input_type INPUT_TYPE

The input type can be a data matrix or distance matrix

[default data_matrix]

Input data matrix parameters:

--sep SEP

--out_table OUT_TABLE

Write processed data matrix to file

--fname_row FNAME_ROW

row number containing the names of the features

[default 0, specify -1 if no names are present in the

matrix

--sname_row SNAME_ROW

column number containing the names of the samples

[default 0, specify -1 if no names are present in the

matrix

--metadata_rows METADATA_ROWS

Row numbers to use as metadata[default None, meaning

no metadata

--skip_rows SKIP_ROWS

Row numbers to skip (0-indexed, comma separated) from

the input file[default None, meaning no rows skipped

--sperc SPERC Percentile of sample value distribution for sample

selection

--fperc FPERC Percentile of feature value distribution for sample

selection

--stop STOP Number of top samples to select (ordering based on

percentile specified by --sperc)

--ftop FTOP Number of top features to select (ordering based on

percentile specified by --fperc)

--def_na DEF_NA Set the default value for missing values [default None

which means no replacement]

Distance parameters:

--f_dist_f F_DIST_F Distance function for features [default correlation]

--s_dist_f S_DIST_F Distance function for sample [default euclidean]

--load_dist_matrix_f LOAD_DIST_MATRIX_F

Load the distance matrix to be used for features

[default None].

--load_dist_matrix_s LOAD_DIST_MATRIX_S

Load the distance matrix to be used for samples

[default None].

--load_pickled_dist_matrix_f LOAD_PICKLED_DIST_MATRIX_F

Load the distance matrix to be used for features as

previously saved as pickle file using hclust2 itself

[default None].

--load_pickled_dist_matrix_s LOAD_PICKLED_DIST_MATRIX_S

Load the distance matrix to be used for samples as

previously saved as pickle file using hclust2 itself

[default None].

--save_pickled_dist_matrix_f SAVE_PICKLED_DIST_MATRIX_F

Save the distance matrix for features to file [default

None].

--save_pickled_dist_matrix_s SAVE_PICKLED_DIST_MATRIX_S

Save the distance matrix for samples to file [default

None].

Clustering parameters:

--no_fclustering avoid clustering features

--no_plot_fclustering

avoid plotting the feature dendrogram

--no_sclustering avoid clustering samples

--no_plot_sclustering

avoid plotting the sample dendrogram

--flinkage FLINKAGE Linkage method for feature clustering [default

average]

--slinkage SLINKAGE Linkage method for sample clustering [default average]

Heatmap options:

--dpi DPI Image resolution in dpi [default 150]

-l, --log_scale Log scale

--title TITLE Title of the plot

--title_fontsize TITLE_FONTSIZE

Font size of the title

-s, --sqrt_scale Square root scale

--no_slabels Do not show sample labels

--minv MINV Minimum value to display in the color map [default

None meaning automatic]

--maxv MAXV Maximum value to display in the color map [default

None meaning automatic]

--no_flabels Do not show feature labels

--max_slabel_len MAX_SLABEL_LEN

Max number of chars to report for sample labels

[default 15]

--max_flabel_len MAX_FLABEL_LEN

Max number of chars to report for feature labels

[default 15]

--flabel_size FLABEL_SIZE

Feature label font size [default 10]

--slabel_size SLABEL_SIZE

Sample label font size [default 10]

--fdend_width FDEND_WIDTH

Width of the feature dendrogram [default 1 meaning

100% of default heatmap width]

--sdend_height SDEND_HEIGHT

Height of the sample dendrogram [default 1 meaning

100% of default heatmap height]

--metadata_height METADATA_HEIGHT

Height of the metadata panel [default 0.05 meaning 5%

of default heatmap height]

--metadata_separation METADATA_SEPARATION

Distance between the metadata and data panels.

[default 0.001 meaning 0.1% of default heatmap height]

--colorbar_font_size COLORBAR_FONT_SIZE

Color bar label font size [default 12]

--image_size IMAGE_SIZE

Size of the largest between width and eight size for

the image in inches [default 8]

--cell_aspect_ratio CELL_ASPECT_RATIO

Aspect ratio between width and height for the cells of

the heatmap [default 1.0]

--bottom_c BOTTOM_C Color to use for cells below the minimum value of the

scale [default None meaning bottom color of the scale]

--top_c TOP_C Color to use for cells below the maximum value of the

scale [default None meaning bottom color of the scale]

--nan_c NAN_C Color to use for nan cells [default None]

GraPhlAn

> graphlan.py -h

# graphlan.py -h

usage: graphlan.py [-h] [--format ['output_image_format']]

[--warnings WARNINGS] [--positions POSITIONS]

[--dpi image_dpi] [--size image_size] [--pad pad_in]

[--external_legends] [--avoid_reordering] [-v]

input_tree output_image

GraPhlAn 1.1.3 (5 June 2018) AUTHORS: Nicola Segata (nsegata@hsph.harvard.edu)

positional arguments:

input_tree the input tree in PhyloXML format

output_image the output image, the format is guessed from the

extension unless --format is given. Available file

formats are: png, pdf, ps, eps, svg

optional arguments:

-h, --help show this help message and exit

--format ['output_image_format']

set the format of the output image (default none

meaning that the format is guessed from the output

file extension)

--warnings WARNINGS set whether warning messages should be reported or not

(default 1)

--positions POSITIONS

set whether the absolute position of the points should

be reported on the standard output. The two

cohordinates are r and theta

--dpi image_dpi the dpi of the output image for non vectorial formats

--size image_size the size of the output image (in inches, default 7.0)

--pad pad_in the distance between the most external graphical

element and the border of the image

--external_legends specify whether the two external legends should be put

in separate file or keep them along with the image

(default behavior)

--avoid_reordering specify whether the tree will be reorder or not

(default the tree will be reordered)

-v, --version Prints the current GraPhlAn version and exit

> python2.7 graphlan/graphlan_annotate.py -h

# python2.7 graphlan/graphlan_annotate.py -h

usage: graphlan_annotate.py [-h] [--annot annotation_file] [-v]

input_tree [output_tree]

GraPhlAn annotate module 1.1.3 (5 June 2018) AUTHORS: Nicola Segata

(nsegata@hsph.harvard.edu)

positional arguments:

input_tree the input tree in Newick, Nexus, PhyloXML or plain

text format

output_tree the output tree in PhyloXML format containing the

newly added annotations. If not specified, the input

tree file will be overwritten

optional arguments:

-h, --help show this help message and exit

--annot annotation_file

specify the annotation file

-v, --version Prints the current GraPhlAn version and exit

入力ファイル

fasta、fastq、tar.bz2、gzなど。

実行方法

テストデータダウンロード。ここでは1MB程度のgzがダウンロードされる。

curl -O https://bitbucket.org/biobakery/biobakery/raw/tip/demos/biobakery_demos/data/metaphlan2/input/SRS014459-Stool.fasta.gz 
curl -O https://bitbucket.org/biobakery/biobakery/raw/tip/demos/biobakery_demos/data/metaphlan2/input/SRS014464-Anterior_nares.fasta.gz 
curl -O https://bitbucket.org/biobakery/biobakery/raw/tip/demos/biobakery_demos/data/metaphlan2/input/SRS014470-Tongue_dorsum.fasta.gz 
curl -O https://bitbucket.org/biobakery/biobakery/raw/tip/demos/biobakery_demos/data/metaphlan2/input/SRS014472-Buccal_mucosa.fasta.gz 
curl -O https://bitbucket.org/biobakery/biobakery/raw/tip/demos/biobakery_demos/data/metaphlan2/input/SRS014476-Supragingival_plaque.fasta.gz 
curl -O https://bitbucket.org/biobakery/biobakery/raw/tip/demos/biobakery_demos/data/metaphlan2/input/SRS014494-Posterior_fornix.fasta.gz

ダウンロードしたデータを移動させる。

mkdir metaphlan2_analysis #新しくディレクトリを作る。
mv *fasta.gz metaphlan2_analysis/ #ダウンロードしたファイルを移動させる
cd metaphlan2_analysis/ #移動

fastaデータを１つだけ解析するには以下のように打つ。

metaphlan2.py SRS014476-Supragingival_plaque.fasta.gz  --input_type fasta --nproc 4 > SRS014476-Supragingival_plaque_profile.txt

--nproc　
--input_type

終わると２つのファイルが出来る。

SRS014476-Supragingival_plaque.fasta.gz.bowtie2out.txt　１行１リード形式でマーカー遺伝子とのマッチング結果がまとめられている。中間ファイルだが、どのような生物のマーカー遺伝子とhitしたのか調べる時に使える。
SRS014476-Supragingival_plaque_profile.txt　解析結果。同定できたtaxisonomyがプリントされている。

中身を確認する。

今回は４データある。それぞれランする。ここではスレッドを12指定。fastaはgzip圧縮して拡張子をfasta.gzにする。

metaphlan2.py SRS014464-Anterior_nares.fasta.gz --input_type fasta --nproc 12 > SRS014464-Anterior_nares_profile.txt 
metaphlan2.py SRS014470-Tongue_dorsum.fasta.gz --input_type fasta --nproc 12 > SRS014470-Tongue_dorsum_profile.txt 
metaphlan2.py SRS014472-Buccal_mucosa.fasta.gz --input_type fasta --nproc 12 > SRS014472-Buccal_mucosa_profile.txt 
metaphlan2.py SRS014494-Posterior_fornix.fasta.gz --input_type fasta --nproc 12 > SRS014494-Posterior_fornix_profile.txt

公式サイトではforで回す例も記載されている。

４つの結果をマージする（*1）。

python2.7 metaphlan2/utils/merge_metaphlan_tables.py *_profile.txt > merged_abundance_table.txt

ここではワイルドカードを使っているが、順番にファイルを指定しても良い。処理は一瞬で終わる。

出力のmerged_abundance_table.txtを確認。

less -S merged_abundance_table.txt |head -10

ID SRS014459-Stool_profile SRS014464-Anterior_nares_profile SRS014470-Tongue_dorsum_profile SRS014472-Buccal_mucosa_profile SRS014476-Supragingival_plaque_profile SRS014494-Posterior_fornix_profile

#SampleID Metaphlan2_Analysis Metaphlan2_Analysis Metaphlan2_Analysis Metaphlan2_Analysis Metaphlan2_Analysis Metaphlan2_Analysis

k__Bacteria 100.0 16.82175 100.0 100.0 100.0 100.0

k__Bacteria|p__Actinobacteria 0.0 14.07759 0.85102 0.0 90.32561 0.0

k__Bacteria|p__Actinobacteria|c__Actinobacteria 0.0 14.07759 0.85102 0.0 90.32561 0.0

k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales 0.0 14.07759 0.85102 0.0 90.32561 0.0

k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Actinomycetales|f__Actinomycetaceae 0.0 0.0 0.85102 0.0 0.0 0.0

species情報だけ抜き出す（この操作を行なわず上のファイルを直接hclustに使うと、階級を無視してクラスタリングを行ってしまうため、ここではspeciesレベルに階級を揃えている）。

grep -E "(s__)|(^ID)" merged_abundance_table.txt | grep -v "t__" | sed 's/^.*s__//g' > merged_abundance_table_species.txt

出力を確認。

head merged_abundance_table_species.txt

ID SRS014459-Stool_profile SRS014464-Anterior_nares_profile SRS014470-Tongue_dorsum_profile SRS014472-Buccal_mucosa_profile SRS014476-Supragingival_plaque_profile SRS014494-Posterior_fornix_profile

Actinomyces_graevenitzii 0.0 0.0 0.85102 0.0 0.0 0.0

Corynebacterium_matruchotii 0.0 0.0 0.0 0.0 58.21595 0.0

Corynebacterium_pseudodiphtheriticum 0.0 14.07759 0.0 0.0 0.0 0.0

Rothia_dentocariosa 0.0 0.0 0.0 0.0 32.10966 0.0

Bacteroides_cellulosilyticus 3.82206 0.0 0.0 0.0 0.0 0.0

Bacteroides_massiliensis 10.61295 0.0 0.0 0.0 0.0 0.0

Bacteroides_ovatus 4.08051 0.0 0.0 0.0 0.0 0.0

Bacteroides_stercoris 12.82765 0.0 0.0 0.0 0.0 0.0

Parabacteroides_distasonis 3.7393 0.0 0.0 0.0 0.0 0.0

hclust2を使いヒートマップを描く。ここではhclust2をダウンロードして、相対パスで実行している。

#hgコマンドがないなら　sudo apt update && apt install mercurial
hg clone https://bitbucket.org/nsegata/hclust2


python2.7 hclust2/hclust2.py -i merged_abundance_table_species.txt \
 -o abundance_heatmap_species.png \
 --ftop 25 \
 --f_dist_f braycurtis \
 --s_dist_f braycurtis \
 --cell_aspect_ratio 0.5 -l \
 --flabel_size 6 \
 --slabel_size 6 \
 --max_flabel_len 100 \
 --max_slabel_len 100 \
 --minv 0.1 \
 --dpi 300

hclust2はbrewで導入できる（今回はエラーになったため直接本体を叩いた）。

f:id:kazumaxneo:20170807130405j:plain

ヒートマップが描かれた。黒はその菌がゼロに近いことを意味し、赤-->黄色になると非常に多い（優占種である）ことを意味する。今回の６サンプルは、データによって菌種に共通するものはないことがわかる。ダウンサンプリングされたデータなので、レアなゲノムのデータが閾値以下のゼロに近い値になってしまったのかもしれない。また、データベースにな菌が多いデータでも、アサイン不可能なリードはたくさん出てくる（例えばrelative abundanceの99%が"unknown"など）。

サーバーで解析したなら、図をscpでローカルに落として確認してください。

系統樹を描く。

GraPhlAnの入力ファイルを作る。

hg clone https://bitbucket.org/nsegata/graphlan

python2.7 graphlan/export2graphlan/export2graphlan.py  \
 --skip_rows 1,2  \
 -i merged_abundance_table.txt \
 --tree merged_abundance.tree.txt \
 --annotation merged_abundance.annot.txt \
 --most_abundant 100 \
 --abundance_threshold 1 \
 --least_biomarkers 10 \
 --annotations 5,6 \
 --external_annotations 7 \
 --min_clade_size 1

graphlanもbrew/condaで導入できる（brew install biobakery/biobakery/graphlan ）。ここではエラーが出たので、直接ダウンロードして使用している。

cladogramを作成。

#xmlファイル作成
python2.7 graphlan/graphlan_annotate.py \
 --annot merged_abundance.annot.txt \
 merged_abundance.tree.txt merged_abundance.xml 

#xmlを使い系統樹描画
python2.7 graphlan/graphlan.py --dpi 300 \
 merged_abundance.xml merged_abundance.png \
 --external_legends

f:id:kazumaxneo:20170807132518j:plain

公式サイトでは他にもチュートリアルがある（リンク）。

引用

MetaPhlAn2 for enhanced metagenomic taxonomic profiling

Duy Tin Truong, Eric A Franzosa, Timothy L Tickle, Matthias Scholz, George Weingart, Edoardo Pasolli, Adrian Tett, Curtis Huttenhower & Nicola Segata.

Nature Methods 12, 902–903 (2015)

MetaPhlAn2 for enhanced metagenomic taxonomic profiling. - PubMed - NCBI

Compact graphical representation of phylogenetic data and metadata with GraPhlAn.

Asnicar F1, Weingart G2, Tickle TL3, Huttenhower C4, Segata N1.

PeerJ. 2015 Jun 18;3:e1029. doi: 10.7717/peerj.1029. eCollection 2015.

PanPhlAnは以前紹介しました。

追記

condaで導入して使うと、helpメッセージは出るものの、マージができず空のファイルができてしまうエラがーがあった。hg loneして使用した。

hg clone https://bitbucket.org/biobakery/metaphlan2
python2.7 metaphlan2/utils/merge_metaphlan_tables.py *_profile.txt > merged_abundance_table.txt

2017-08-07

PanPhlAnによるメタゲノムのプロファイリング

metagenome 生物種の推定 (taxonomic profiling) 2016 Nature Methods

2018 10/30 イントロ修正

　PanPhlAnはメタゲノムをstrainレベルで解析するツール。調べるのは遺伝子の有り/無しで、データベースのゲノムと比較することでメタゲムシーケンスしたバクテリアの特定の種に、実際にはどれくらいの多様性があるか（どれくらいのstrainが混じっているか）を調べることが可能となっている（本手法でプロファイルの指標になるのはコード領域の配列）。

　PanPhlAnの解析例として、公式ページにはアウトブレイクしたE.coliのシーケンスデータをPanPhlAnで解析し、ヒートマップに可視化した図が載っている（リンク先の上の図a）。図１の下の方にある黒い線がシーケンスされ検出された菌ゲノムで、黄色がPanphlanのデータベースのゲノムである。データベースにはこれまでシーケンス解読され登録されてきた様々なE.coliのstrainが載っているのだが、その中でアウトブレイクしたE.coliの株は密接してクラスターを作っていることがわかる。これはアウトブレイクを起こしたE.coliのstrainのゲノムが互いに非常によく似ていることを意味する。

　バクテリアのGWAS解析などができないかと妄想してこのツールを見つけたが、PanPhlAnは今後のメタゲノム解析に有用なツールと感じた。導入して使い勝手を見ていく。

公式サイト

https://bitbucket.org/CibioCM/panphlan/wiki/Home

チュートリアル

https://bitbucket.org/CibioCM/panphlan/wiki/Tutorial

インストール

依存

Python version 2.7 or 3.x (including the Biopython module, numpy, and matplotlib)
bowtie2　#2.21リンク
samtools
usearch 7 　#ダウンロード

Pythonとusearch以外はbrewでインストール可能。またpythonのモジュールはpipでインストールする。最後のusearch7はリンクよりダウンロードし、実行権をつけた上でusearch7という名前に変えてパスを通す。

本体のダウンロード

wget https://bitbucket.org/cibiocm/panphlan/get/default.zip 
unzip default.zip 
mkdir panphlan 
mv CibioCM-panphlan-93907843bbfa/ panphlan/ #名前を変更しておく

プログラムが認識できる様に、以下の名前でパスを通しておく。

echo 'export PATH=/Users/user/local/bowtie2-2.2.1/:$PATH' >> ~/.bash_profile 
echo 'export BT2_HOME=/Users/user/local/bowtie2-2.2.1/' >> ~/.bash_profile
echo 'export BOWTIE2_INDEXES=/Users/user/local/panphlan/Pre-processed_databases/' >> ~/.bash_profile

#ここでは書いていないがsamtoolsもパスが通っていなければ通す。

入力ファイル

fasta、fastq、tar.bz2、gzなど。

実行方法

以下の様な流れでランする。

データベースの作成

　↓

シーケンスデータをデータベースの配列にアライメントする。

　↓

結果をマージして１つにする。

　↓

遺伝子セットの有無をまとめた行列データが出力される。

マニュアルに沿って実際にやってみる。

1、データベースのダウンロード。

wget http://www.matthias-scholz.de/panphlan_erectale15.zip 
unzip panphlan_erectale15.zip

終わるとカレントディレクトリに８つのファイル panphlan_erectale15~ができる。このうち４つはbowtie2のindexファイルで、残りはPanPhlAnのデータベースファイルである。これらのファイルをデータベースとして使って以降の作業を行う。

２、シーケンスデータの準備。テストデータをダウンロードする。

wget -P samples https://www.dropbox.com/sh/5vxicumnz3adiwi/AACyBh4CUJFJ1P6a-ypzHvlua/SRS013951.tar.bz2 
wget -P samples https://www.dropbox.com/sh/5vxicumnz3adiwi/AAA4CVLuQVzg9-9ql4pAQxL0a/SRS014459.tar.bz2 
wget -P samples https://www.dropbox.com/sh/5vxicumnz3adiwi/AABslonWx4_V7L9ugEbSN3XFa/SRS015065.tar.bz 
wget -P samples https://www.dropbox.com/sh/5vxicumnz3adiwi/AACanDqRyhaX0G6Dtf4NPTVwa/SRS019161.tar.bz2

４つ全て5GB以上あるので、空き容量に注意。

４ファイルできる。

ls samples/ 
SRS013951.tar.bz2 SRS014459.tar.bz2 SRS015065.tar.bz2 SRS019161.tar.bz2

シーケンスデータをデータベースの配列にマッピングする。

./panphlan/panphlan_map.py -c erectale15 -i samples/SRS013951.tar.bz2 -o map_results/SRS013951_erectale15.csv --verbose
./panphlan/panphlan_map.py -c erectale15 -i samples/SRS014459.tar.bz2 -o map_results/SRS014459_erectale15.csv --verbose
./panphlan/panphlan_map.py -c erectale15 -i samples/SRS015065.tar.bz2 -o map_results/SRS015065_erectale15.csv --verbose
./panphlan/panphlan_map.py -c erectale15 -i samples/SRS019161.tar.bz2 -o map_results/SRS019161_erectale15.csv --verbose

ダウンロードしたファイル４つそれぞれに対して実行する。

map_results/に４つのファイルができる。

４ファイルを付属のツールでまとめ、検出された遺伝子がどの種に存在するのかマトリックスにして出力する。

./panphlan/panphlan_profile.py -c erectale15 -i map_results/ --o_dna result_gene_presence_absence.csv --add_strains --verbose

ディレクトリを指令するだけで認識する。

ヒートマップを書く。

hclust2をダウンロード（ない人だけ）。

hg clone https://bitbucket.org/nsegata/hclust2

python2.7 hclust2/hclust2.py -i result_gene_presence_absence.csv -o heatmap.png --legend_file legend.png --image_size 30 --cell_aspect_ratio 0.0005 --dpi 300 --slabel_size 16 --f_dist_f jaccard --s_dist_f jaccard

図が出力される。

f:id:kazumaxneo:20170809210137j:plain

データベースを自分で作る場合。（リンク）

NCBIの ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/ からデータを取ってくる。

wget -P fna ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Eubacterium_rectale_ATCC_33656_uid59169/NC_012781.fna 
wget -P ffn ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Eubacterium_rectale_ATCC_33656_uid59169/NC_012781.ffn

wget -P fna ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Eubacterium_rectale_uid197161/NC_021010.fna
wget -P ffn ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Eubacterium_rectale_uid197161/NC_021010.ffn

wget -P fna ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Eubacterium_rectale_uid197162/NC_021044.fna
wget -P ffn ftp://ftp.ncbi.nih.gov/genomes/archive/old_refseq/Bacteria/Eubacterium_rectale_uid197162/NC_021044.ffn

３生物のfnaとffnが保存される。ffnはコード領域のみのfna。

データベースを作成。

sudo python2.7 panphlan_pangenome_generation.py -c erectale15 --i_fna fna/ --i_ffn ffn/ -o database/

-c　to specify the clade or species database-name;
-i_fna　input folder for genome sequences
-i_fnn　input folder for gene sequences
-o　output folder for the pangenome database

-cで指定した名前が以降使うデータベース名になる。ここでは"erectale15"。ランには数十分かかる。

データベース作成が終わったら、出力をチェック

user$ ls -l database/

total 57344

-rw-r--r-- 1 root staff 7622462 8 7 16:35 panphlan_erectale15.1.bt2

-rw-r--r-- 1 root staff 2568828 8 7 16:35 panphlan_erectale15.2.bt2

-rw-r--r-- 1 root staff 1979 8 7 16:35 panphlan_erectale15.3.bt2

-rw-r--r-- 1 root staff 2568822 8 7 16:35 panphlan_erectale15.4.bt2

-rw-r--r-- 1 root staff 7622462 8 7 16:35 panphlan_erectale15.rev.1.bt2

-rw-r--r-- 1 root staff 2568828 8 7 16:35 panphlan_erectale15.rev.2.bt2

-rw-r--r-- 1 root staff 5231770 8 7 16:35 panphlan_erectale15_centroids.ffn

-rw-r--r-- 1 root staff 1159076 8 7 16:35 panphlan_erectale15_pangenome.csv

database/に移動する。

データベースを調べるなら以下のコマンドを打つ。

sudo python2.7 ../panphlan/panphlan_profile.py -c erectale15 --add_strains --o_dna genefamily_presence_absence.tsv

出力ファイルは以下のようなファイルになっている。

user$ head genefamily_presence_absence.tsv

REF_NC_012781 REF_NC_021010 REF_NC_021044

g000001 1 0 0

g000002 1 0 0

g000003 1 0 0

g000004 1 0 0

g000005 1 0 0

g000006 1 0 0

g000007 1 0 0

g000008 1 0 0

g000009 1 0 0

データベースを作成した３株の列それぞれに、マーガーのヒット結果が１行ずつプリントされている。

プリセットデータベースをダウンロードする場合。（リンク）

hg clone https://bitbucket.org/CibioCM/panphlan
source panphlan/tools/download_all_databases.sh
mkdir Pre-processed_databases #フォルダを作成して
mv *zip Pre-processed_databases/ #その中にzipを入れる。
cd Pre-processed_databases/ 
unzip '*zip' #解凍。シングルクオーテーションで囲まないとエラーになる。

#元のデータを消す。残してもOK
rm Pre-processed_databases/*zip

hgがない人はbrewでhgコマンドを導入可能（ゲノムデータベースを自分で作る場合は一番下を参照）。合計14GBある。

解凍が終わると大量のファイルができる。それぞれのデータベースの、ffn、fna、bowtie2のindexファイル、またツール専用のファイルなどになる。

この様な感じで、Pre-processed_databases/には大量のファイルができている。

user$ ls -l Pre-processed_databases/panphlan_* |head -20

-rw-rw-rw- 1 user staff 6498035 3 6 2016 Pre-processed_databases/panphlan_aaphrophilus16.1.bt2

-rw-rw-rw- 1 user staff 1723384 3 6 2016 Pre-processed_databases/panphlan_aaphrophilus16.2.bt2

-rw-rw-rw- 1 user staff 458 3 6 2016 Pre-processed_databases/panphlan_aaphrophilus16.3.bt2

-rw-rw-rw- 1 user staff 1723376 3 6 2016 Pre-processed_databases/panphlan_aaphrophilus16.4.bt2

-rw-rw-rw- 1 user staff 6498035 3 6 2016 Pre-processed_databases/panphlan_aaphrophilus16.rev.1.bt2

-rw-rw-rw- 1 user staff 1723384 3 6 2016 Pre-processed_databases/panphlan_aaphrophilus16.rev.2.bt2

-rw-rw-rw- 1 user staff 2681949 3 6 2016 Pre-processed_databases/panphlan_aaphrophilus16_centroids.ffn

-rw-rw-rw- 1 user staff 577364 3 6 2016 Pre-processed_databases/panphlan_aaphrophilus16_pangenome.csv

-rw-rw-rw- 1 user staff 5963263 3 6 2016 Pre-processed_databases/panphlan_abirgiae16.1.bt2

-rw-rw-rw- 1 user staff 1325616 3 6 2016 Pre-processed_databases/panphlan_abirgiae16.2.bt2

-rw-rw-rw- 1 user staff 620 3 6 2016 Pre-processed_databases/panphlan_abirgiae16.3.bt2

-rw-rw-rw- 1 user staff 1325612 3 6 2016 Pre-processed_databases/panphlan_abirgiae16.4.bt2

-rw-rw-rw- 1 user staff 5963263 3 6 2016 Pre-processed_databases/panphlan_abirgiae16.rev.1.bt2

-rw-rw-rw- 1 user staff 1325616 3 6 2016 Pre-processed_databases/panphlan_abirgiae16.rev.2.bt2

-rw-rw-rw- 1 user staff 4911052 3 6 2016 Pre-processed_databases/panphlan_abirgiae16_centroids.ffn

-rw-rw-rw- 1 user staff 439546 3 6 2016 Pre-processed_databases/panphlan_abirgiae16_pangenome.csv

-rw-rw-rw- 1 user staff 4925884 3 6 2016 Pre-processed_databases/panphlan_acardiffensis16.1.bt2

-rw-rw-rw- 1 user staff 547012 3 6 2016 Pre-processed_databases/panphlan_acardiffensis16.2.bt2

-rw-rw-rw- 1 user staff 368 3 6 2016 Pre-processed_databases/panphlan_acardiffensis16.3.bt2

-rw-rw-rw- 1 user staff 547005 3 6 2016 Pre-processed_databases/panphlan_acardiffensis16.4.bt2

合計3341ファイル。

例えばaaphrophilusを解析するなら、上に出てくるaaphrophilus16をデータベース名として使う。

データベースが充実したecoli16を確認してみる。

cd Pre-processed_databases/
sudo chmod u+rx *
sudo python2.7 ../panphlan/panphlan_profile.py -c aaphrophilus16 -i map_results/ --o_dna result_gene_presence_absence.csv --add_strains

f:id:kazumaxneo:20170807172817j:plain

100以上あるE.coliのstrain（列）の遺伝子（行）の存在する/存在しないが1/0でマトリックスに表記されている。

他にも下のような解析が可能である。公式リンクを貼っておく。

→ RNA-seq: How to extract strain specific transcriptional activities?

→ Functional analysis: How to get gene-family sequences and annotations?

→ PanPhlAn profile options

→ How to adapt strain detection thresholds?

→ How to get gene-family presence/absence profiles of all reference genomes, without sample profiles?

gene familyのIDを指定して配列を取り出すやり方など参考になりそうである。

引用

Strain-level microbial epidemiology and population genomics from shotgun metagenomics.

Matthias Scholz*, Doyle V. Ward*, Edoardo Pasolli*, Thomas Tolio, Moreno Zolfo, Francesco Asnicar, Duy Tin Truong, Adrian Tett, Ardythe L. Morrow, and Nicola Segata (* Equal contribution)

Nature Methods, 13, 435–438, 2016.

2017-08-06

ウィルスゲノムのアセンブルツール IVA (Iterative Virus Assembler)

assembly virus 2015 Bioinformatics

IVA (Iterative Virus Assembler)はウィルス用のDNAアセンブラ。2015年に発表された。カバレッジが大きく変動するウィルスゲノムのアセンブルに対応しているとされる。入力データはilluminaのペーアドエンドである。前もって作ったcontigからscaffoldを作ったり、リファレンスゲノムと比較する機能も備える。

公式サイト（サンガー研）

マニュアル

https://github.com/sanger-pathogens/iva/wiki

インストール

macではテストランのbashのスクリプト処理時にエラーが出たのでcent OS6にインストールした。

先に依存関係をインストール。

Python 3 version 3.3 or higher (IVA is written in Python 3)
KMC installed, so that kmc and kmc_dump are in your path.
MUMmer installed with its executables (ie nucmer etc) in your path.
Samtools installed, so that samtools is in your path.
SMALT installed, so that smalt is in your path.
Optional: Trimmomatic - although this is optional,

samtools、kmc、smalt、blast+、Trimmomatic はbrewで導入できる。

brew install samtools
brew install mummer
brew install kmc
brew install smalt
brew install Trimmomatic

本体はpip3で導入する。

pip3 install iva

実行方法

はじめにテストランを行う。

iva --test outdir

自動でfastqがダウンロードされアセンブルが実行される。動いているコマンドは以下のとおりである。

/bin/iva --threads 1 --pcr_primers hiv_pcr_primers.fa -f reads_1.fq.gz -r reads_2.fq.gz iva.out

上のhiv_pcr_primers.faは以下のようなプライマー配列のファイル。

outdir]$ head -8 hiv_pcr_primers.fa

>Pan-HIV-1_1F.1

AGCCCGGGAGCTCTCTG

>Pan-HIV-1_1F.2

AGCCTGGGAGCTCTCTG

>Pan-HIV-1_1R.1

CCTCCAATTCCCCCTATCATTTT

>Pan-HIV-1_1R.2

CCTCCAATTCCTCCTATCATTTT

プライマーを明示してアセンブル前に除去する。

ランにエラーがなければ、/outdir/iva.outにcontig.fastaができる。テストデータのリファレンスのゲノムは１つのセグメントから構成されているが、アセンブルされたconitgは二つのcontigまでアセンブルされていた。

実際のランは以下のように行う。

iva --threads 12 -f R1.fastq -r R2.fastq <Output_directory>

--threads
--max_insert
-f　 Name of forward reads fasta/q file [.gz].
-r　Name of reverse reads fasta/q file [.gz].
-fr　ame of interleaved fasta/q file [.gz].
-k　kmer hash length in SMALT (the -k option in smalt index) [19]
-s　kmer hash step size in SMALT (the -s option in smalt index) [11]
-y FLOAT　Minimum identity threshold for mapping to be reported (the -y option in smalt map) [0.5].
-i　Maximum insert size (includes read length). Reads with inferred insert size more than the maximum will not be used to extend contigs [800].

ラン中は画面更新がないので、初めて行うとき不安かもしれない。そうゆう人は小さなデータで一度テストしてみると良いかもしれない。正常に終わると指定したディレクトリにcontigファイルができる。

IVAのランにはイルミナのPCRプライマー配列とindexを除去したリードを使うのが望ましいが、除去されていない場合はTrimmomaticで除去してランできる（Trimmomaticがインストールされている必要あり）。

PCRプライマー配列をfastaフォーマットで明示してトリミング。

iva --trimmomatic /path/to/trimmomatic-0.32.jar --pcr_primers primers.fasta --fr R1R2.fastq <Output_directory>

アダプター配列も明示してトリミングを実行。

iva --trimmomatic /path/to/trimmomatic-0.32.jar \
 --adapters adapters.fasta \
 --pcr_primers primers.fasta \
 --fr R1R".fastq <Output_directory>

contig配列を指定してラン。

iva --contigs contigs.fasta --fr R1R2.fastq <Output_directory>

引用

IVA: accurate de novo assembly of RNA virus genomes

Martin Hunt, Astrid Gall, Swee Hoe Ong, Jacqui Brener, Bridget Ferns, Philip Goulder, Eleni Nastouli, Jacqueline A. Keane, Paul Kellam, and Thomas D. Otto

Bioinformatics. 2015 Jul 15; 31(14): 2374–2376.

2017-08-04

diginormによるシーケンスデータの軽量化

fasta/fastqの操作 assembly RNA seq normalize contamination 2012 metagenome

2019 5/14 helpとパラメータ追記

"digital normalization"という名で発表されたこの手法は、k-merを指標にリードを間引いて、データサイズを軽量化する方法論。データサイズが大きすぎてアセンブルできないサンプルの軽量化に使えるとされる。トリミングターゲットは、低/高のk-merカバレッジのリードになる。k-merのカバレッジが中央値から極端に少ないのはシーケンスエラーと考えられ、k-merのカバレッジが極端に多いにはリピートと考えられる。それらを削って軽量化する。

ただし低k-merカバレッジ配列はエラーでなくレアなバリアントの可能性もあるし、高カバレッジの配列も単純リピートでなく倍数体ゲノムの極端に似た配列の可能性がある。それらをエラーのk-merと一緒に消してしまう恐れがある。それについて論文の著者がHPで考えを述べている（Why you shouldn't use digital normalization）。興味がある人はその辺りを理解した上で使ってください。

本手法はde novo transcrptomeのデータにも使うことが可能とされ、de novo transcriptomeの論文にも使われた実績を持つ。

diginorm公式サイト

http://ged.msu.edu/papers/2012-diginorm/

解析にはkhmerを使う。

http://khmer.readthedocs.io/en/v2.0/user/scripts.html#scripts-diginorm

チュートリアル

A tutorial in basic digital normalization — ANGUS 2.0 documentation

ランに必要なツールのインストール

khmerとscreedが必要。

git clone git://github.com/ged-lab/screed.git 
git clone git://github.com/ged-lab/khmer.git

cd screed 
python setup.py install 
cd ../khmer 
make test 
export PYTHONPATH=/root/khmer/python

khmerとscreedはpipでも導入できます。

ベストなk-merについて

> normalize-by-median.py -h

# normalize-by-median.py -h

|| This is the script normalize-by-median.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

|| If you use this script in a publication, please cite EACH of the following:

|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

|| * CT Brown et al., arXiv:1203.4802 [q-bio.GN]

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

usage: normalize-by-median.py [--version] [--info] [-h] [-k KSIZE]

[-U UNIQUE_KMERS] [--fp-rate FP_RATE]

[-M MAX_MEMORY_USAGE] [--small-count] [-q]

[-C CUTOFF] [-p] [--force_single]

[-u unpaired_reads_filename] [-s filename]

[-R report_filename]

[--report-frequency report_frequency] [-f]

[-o filename] [-l filename] [--gzip | --bzip]

input_sequence_filename

[input_sequence_filename ...]

Do digital normalization (remove mostly redundant sequences)

positional arguments:

input_sequence_filename

Input FAST[AQ] sequence filename.

optional arguments:

--version show program's version number and exit

--info print citation information

-h, --help show this help message and exit

-k KSIZE, --ksize KSIZE

k-mer size to use (default: 32)

-U UNIQUE_KMERS, --unique-kmers UNIQUE_KMERS

approximate number of unique kmers in the input set

(default: 0)

--fp-rate FP_RATE Override the automatic FP rate setting for the current

script (default: None)

-M MAX_MEMORY_USAGE, --max-memory-usage MAX_MEMORY_USAGE

maximum amount of memory to use for data structure

(default: None)

--small-count Reduce memory usage by using a smaller counter for

individual kmers. (default: False)

-q, --quiet

-C CUTOFF, --cutoff CUTOFF

when the median k-mer coverage level is above this

number the read is not kept. (default: 20)

-p, --paired require that all sequences be properly paired

(default: False)

--force_single treat all sequences as single-ended/unpaired (default:

False)

-u unpaired_reads_filename, --unpaired-reads unpaired_reads_filename

include a file of unpaired reads to which -p/--paired

does not apply. (default: None)

-s filename, --savegraph filename

save the k-mer countgraph to disk after all reads are

loaded. (default: None)

-R report_filename, --report report_filename

write progress report to report_filename (default:

None)

--report-frequency report_frequency

report progress every report_frequency reads (default:

100000)

-f, --force continue past file reading errors (default: False)

-o filename, --output filename

only output a single file with the specified filename;

use a single dash "-" to specify that output should go

to STDOUT (the terminal) (default: None)

-l filename, --loadgraph filename

load a precomputed k-mer graph from disk (default:

None)

--gzip Compress output using gzip (default: False)

--bzip Compress output using bzip2 (default: False)

Discard sequences based on whether or not their median k-mer abundance lies

above a specified cutoff. Kept sequences will be placed in <fileN>.keep.

By default, paired end reads will be considered together; if either read should

be kept, both will be kept. (This keeps both reads from a fragment, and helps

with retention of repeats.) Unpaired reads are treated individually.

If `-p`/`--paired` is set, then proper pairing is required and the script will

exit on unpaired reads, although `--unpaired-reads` can be used to supply a

file of orphan reads to be read after the paired reads.

`--force_single` will ignore all pairing information and treat reads

individually.

With `-s`/`--savegraph`, the k-mer countgraph will be saved to the specified

file after all sequences have been processed. `-l`/`--loadgraph` will load the

specified k-mer countgraph before processing the specified files. Note that

these graphs are are in the same format as those produced by `load-into-

counting.py` and consumed by `abundance-dist.py`.

To append reads to an output file (rather than overwriting it), send output to

STDOUT with `--output -` and use UNIX file redirection syntax (`>>`) to append

to the file.

Example:

normalize-by-median.py -k 17 tests/test-data/test-abund-read-2.fa

Example:

normalize-by-median.py -p -k 17 \

tests/test-data/test-abund-read-paired.fa

Example:

normalize-by-median.py -p -k 17 -o - tests/test-data/paired.fq \

>> appended-output.fq

Example:

normalize-by-median.py -k 17 -f tests/test-data/test-error-reads.fq \

tests/test-data/test-fastq-reads.fq

Example:

normalize-by-median.py -k 17 -s test.ct \

tests/test-data/test-abund-read-2.fa \

tests/test-data/test-fastq-reads.fq

> filter-abund.py -h

# filter-abund.py -h

|| This is the script filter-abund.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

|| If you use this script in a publication, please cite EACH of the following:

|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

|| * Q Zhang et al., https://doi.org/10.1371/journal.pone.0101271

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

usage: filter-abund.py [--version] [--info] [-h] [-T THREADS] [-C CUTOFF] [-V]

[-Z NORMALIZE_TO] [-o optional_output_filename] [-f]

[-q] [--gzip | --bzip]

input_count_graph_filename input_sequence_filename

[input_sequence_filename ...]

Trim sequences at a minimum k-mer abundance.

positional arguments:

input_count_graph_filename

The input k-mer countgraph filename

input_sequence_filename

Input FAST[AQ] sequence filename

optional arguments:

--version show program's version number and exit

--info print citation information

-h, --help show this help message and exit

-T THREADS, --threads THREADS

Number of simultaneous threads to execute (default: 1)

-C CUTOFF, --cutoff CUTOFF

Trim at k-mers below this abundance. (default: 2)

-V, --variable-coverage

Only trim low-abundance k-mers from sequences that

have high coverage. (default: False)

-Z NORMALIZE_TO, --normalize-to NORMALIZE_TO

Base the variable-coverage cutoff on this median k-mer

abundance. (default: 20)

-o optional_output_filename, --output optional_output_filename

Output the trimmed sequences into a single file with

the given filename instead of creating a new file for

each input file. (default: None)

-f, --force Overwrite output file if it exists (default: False)

-q, --quiet

--gzip Compress output using gzip (default: False)

--bzip Compress output using bzip2 (default: False)

Trimmed sequences will be placed in "${input_sequence_filename}.abundfilt" for

each input sequence file. If the input sequences are from RNAseq or metagenome

sequencing then `--variable-coverage` should be used.

Example:

load-into-counting.py -k 20 -x 5e7 countgraph data/100k-filtered.fa

filter-abund.py -C 2 countgraph data/100k-filtered.fa

> load-into-counting.py -h

# load-into-counting.py -h

|| This is the script load-into-counting.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

|| If you use this script in a publication, please cite EACH of the following:

|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

|| * Q Zhang et al., https://doi.org/10.1371/journal.pone.0101271

|| * A. D\xf6ring et al. https://doi.org:80/10.1186/1471-2105-9-11

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

usage: load-into-counting.py [--version] [--info] [-h] [-k KSIZE]

[-U UNIQUE_KMERS] [--fp-rate FP_RATE]

[-M MAX_MEMORY_USAGE] [--small-count]

[-T THREADS] [-b] [-s FORMAT] [-f] [-q]

output_countgraph_filename

input_sequence_filename

[input_sequence_filename ...]

Build a k-mer countgraph from the given sequences.

positional arguments:

output_countgraph_filename

The name of the file to write the k-mer countgraph to.

input_sequence_filename

The names of one or more FAST[AQ] input sequence

files.

optional arguments:

--version show program's version number and exit

--info print citation information

-h, --help show this help message and exit

-k KSIZE, --ksize KSIZE

k-mer size to use (default: 32)

-U UNIQUE_KMERS, --unique-kmers UNIQUE_KMERS

approximate number of unique kmers in the input set

(default: 0)

--fp-rate FP_RATE Override the automatic FP rate setting for the current

script (default: None)

-M MAX_MEMORY_USAGE, --max-memory-usage MAX_MEMORY_USAGE

maximum amount of memory to use for data structure

(default: None)

--small-count Reduce memory usage by using a smaller counter for

individual kmers. (default: False)

-T THREADS, --threads THREADS

Number of simultaneous threads to execute (default: 1)

-b, --no-bigcount The default behaviour is to count past 255 using

bigcount. This flag turns bigcount off, limiting

counts to 255. (default: True)

-s FORMAT, --summary-info FORMAT

What format should the machine readable run summary be

in? (`json` or `tsv`, disabled by default) (default:

None)

-f, --force Overwrite output file if it exists (default: False)

-q, --quiet

Note: with `-b`/`--no-bigcount` the output will be the exact size of the k-mer

countgraph and this script will use a constant amount of memory. In exchange

k-mer counts will stop at 255. The memory usage of this script with `-b` will

be about 1.15x the product of the `-x` and `-N` numbers.

Example:

load-into-counting.py -k 20 -x 5e7 out data/100k-filtered.fa

Multiple threads can be used to accelerate the process, if you have extra cores

to spare.

Example:

load-into-counting.py -k 20 -x 5e7 -T 4 out data/100k-filtered.fa

> split-paired-reads.py -h

# split-paired-reads.py -h

|| This is the script split-paired-reads.py in khmer.

|| You are running khmer version 3.0.0a3

|| You are also using screed version 1.0

|| If you use this script in a publication, please cite EACH of the following:

|| * MR Crusoe et al., 2015. https://doi.org/10.12688/f1000research.6924.1

|| Please see http://khmer.readthedocs.io/en/latest/citations.html for details.

usage: split-paired-reads.py [--version] [--info] [-h] [-d output_directory]

[-0 output_orphaned] [-1 output_first]

[-2 output_second] [-f] [--gzip | --bzip]

[infile]

Split interleaved reads into two files, left and right.

positional arguments:

infile

optional arguments:

--version show program's version number and exit

--info print citation information

-h, --help show this help message and exit

-d output_directory, --output-dir output_directory

Output split reads to specified directory. Creates

directory if necessary (default: )

-0 output_orphaned, --output-orphaned output_orphaned

Allow "orphaned" reads and extract them to this file

(default: None)

-1 output_first, --output-first output_first

Output "left" reads to this file (default: None)

-2 output_second, --output-second output_second

Output "right" reads to this file (default: None)

-f, --force Overwrite output file if it exists (default: False)

--gzip Compress output using gzip (default: False)

--bzip Compress output using bzip2 (default: False)

Some programs want paired-end read input in the One True Format, which is

interleaved; other programs want input in the Insanely Bad Format, with left-

and right- reads separated. This reformats the former to the latter.

The directory into which the left- and right- reads are output may be specified

using `-d`/`--output-dir`. This directory will be created if it does not

already exist.

Alternatively, you can specify the filenames directly with `-1`/`--output-

first` and `-2`/`--output-second`, which will override the `-d`/`--output-dir`

setting on a file-specific basis.

`-0`/'--output-orphans` will allow broken-paired format, and orphaned reads

will be saved separately, to the specified file.

Example:

split-paired-reads.py tests/test-data/paired.fq

Example:

split-paired-reads.py -0 reads-output-file tests/test-data/paired.fq

Example:

split-paired-reads.py -1 reads.1 -2 reads.2 tests/test-data/paired.fq

ラン

トランスクリプトームデータ

1-passノーマライズを行う。

はじめにインターレースのpaired.fqにmergeしておく。

seqtk mergepe R1.fq R2.fq > paired.fq

分布でトリミング。

normalize-by-median.py -C 20 -k 20 -N 4 -x 5e8 paired.fq

-C <CUTOFF> when the median k-mer coverage level is above this number the read is not kept. (default: 20)
-k <KSIZE > k-mer size to use (default: 32)

CとKの値について（上と同じリンク）。

ゲノムデータ

スリーパスノーマライズ (3-pass)を行う。

はじめにインターレースのpaired.fqにmergeしておく（seqtk（リンク））。

seqtk mergepe R1.fq R2.fq > paired.fq

k-merカウントグラフを書く。(countgraph.py)

load-into-counting.py -k 20 -x 5e8 -T 24 graph paired.fq

-k　k-mer size to use (default: 32)
-T　Number of simultaneous threads to execute
-x　upper bound on tablesize to use; overrides --max-memory-usage/-M.

1-pass　分布でリードをトリミング。

normalize-by-median.py ; Discard sequences based on whether or not their median k-mer abundance lies above a specified cutoff.

normalize-by-median.py -C 20 -k 20 -N 4 -x 5e8 paired.fq

-k　k-mer size to use (default: 32)
-C　when the median k-mer coverage level is above this number the read is not kept. (default: 20)
-N　number of tables to use in k-mer countgraph
-x　upper bound on tablesize to use; overrides --max-memory-usage/-M.

paired.fq.keepが出力される。試したデータでは、リードは287,956から273,788に減少（5%カット）。出力されるリード数は揃っていない。あとでペアに修正する。

2-pass　今度はlow coverageのリードをトリミングする。

filter-abund.py ; Trim sequences at a minimum k-mer abundance.

filter-abund.py -T 24 -C 2 graph paired.fq.keep

-C　CUTOFF, --cutoff CUTOFF Trim at k-mers below this abundance. (default: 2)
-T　Number of simultaneous threads to execute (default: 1)

R1.fastq.keep.abundfiltとR2.fastq.keep.abundfiltが出力される。

paired.fq.keep.abundfiltが出力される。リードは60%ほどトリムされた。

3-pass　最後にもう一度-C 5でトリミング。

normalize-by-median.py -C 5 -k 20 -N 4 -x 1e8 paired.fq.keep.abundfilt

paired.fq.keep.abundfilt.keepが出力される。

最後にペアのリードとシングルトンのリードに分ける。

split-paired-reads.py -0 orphan -1 trimmmed_R1.fq -2 trimmmed_R2.fq paired.fq.keep.abundfilt.keep

-0　output_orphaned Allow "orphaned" reads and extract them to this file (default: None)
-1　Output "left" reads to this file (default: None)
-2　Output "right" reads to this file (default: None)

R1とR2の行数は同じになる。ペアじゃないオーファンなリードが20%近くを占めた。

KMCを使ってk-mer頻度を計算(高速なk-merカウントツール KMC)。

　↓

20merでヒストグラムを書く。ややエラーの多いfastqを使っている。

diginorm処理前

f:id:kazumaxneo:20180221164944j:plain

頻度が少ない（エラー、low coverage、レアバリアント、コンタミ）20merが非常に多い。

　↓

diginorm処理後

f:id:kazumaxneo:20180221163327j:plain

頻度が少ない２０-mer配列が消えた。

diginormの前に、エラー訂正もやっておいた方がいいかもしれません。karatならrecall、precision評価ツールもついています。エラーコレクションツール karect

下は"before"をkaratでエラー訂正したfastqのヒストグラム。

f:id:kazumaxneo:20180221163254j:plain

まだ多いが、beforeと比べて頻度が少ないk-merの数が大きく減っている。捨てる前に、修復できるなら修復して、捨てられないようにしようという発想です。

引用

A Reference-Free Algorithm for Computational Normalization of Shotgun Sequencing Data

C. Titus Brown, Adina Howe, Qingpeng Zhang, Alexis B. Pyrkosz, Timothy H. Brom

arXiv:1203.4802 Publication date: 2012

追記

BBnormがオススメです。

2017-08-03

配列をクラスタリングする CD-HIT

RNA seq assembly sequence clustering 2006 Bioinformatics dereplication

2018 12/12　追記、6/18 インストール追記、7/24 help追加、9/25 タイトル修正、11/28 ユーザーズガイドリンク追記

2024/01/31 追記

似た塩基配列やアミノ酸配列をクラスタリングできるツール。例えば、de novo transcriptome解析でアセンブルを行った後、95%以上似た配列をまとめてlongestのものだけ残しunigeneにする、というような作業を行うことができる。

cd-hit-user-guideより転載。

ただしde nobo transcritomeに使うなら、クラスタリングすることで定量に影響が出ないかどうかが問題になる。つまり、統合することで非常によく似たコピー遺伝子やパラログが失われ、かえって網羅性が失われる恐れがある。興味がある人は以下の論文を見てください。

公式サイト

CD-HIT Official Website

webサーバー

http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi

CD-HIT User's Guide

http://weizhongli-lab.org/lab-wiki/doku.php?id=cd-hit-user-guide

インストール

brewかcondaを使う。

#bioconda (link)
mamba install -c bioconda -y cd-hit

#homebrew
brew install cd-hit

Github

git clone https://github.com/weizhongli/cdhit.git
cd cdhit/

#長い配列も扱えるようにビルド
make MAX_SEQ=1000000

> cd-hit

$ cd-hit

====== CD-HIT version 4.7 (built on Jul 13 2018) ======

Usage: cd-hit [Options]

Options

-i input filename in fasta format, required

-o output filename, required

-c sequence identity threshold, default 0.9

this is the default cd-hit's "global sequence identity" calculated as:

number of identical amino acids in alignment

divided by the full length of the shorter sequence

-G use global sequence identity, default 1

if set to 0, then use local sequence identity, calculated as :

number of identical amino acids in alignment

divided by the length of the alignment

NOTE!!! don't use -G 0 unless you use alignment coverage controls

see options -aL, -AL, -aS, -AS

-b band_width of alignment, default 20

-M memory limit (in MB) for the program, default 800; 0 for unlimitted;

-T number of threads, default 1; with 0, all CPUs will be used

-n word_length, default 5, see user's guide for choosing it

-l length of throw_away_sequences, default 10

-t tolerance for redundance, default 2

-d length of description in .clstr file, default 20

if set to 0, it takes the fasta defline and stops at first space

-s length difference cutoff, default 0.0

if set to 0.9, the shorter sequences need to be

at least 90% length of the representative of the cluster

-S length difference cutoff in amino acid, default 999999

if set to 60, the length difference between the shorter sequences

and the representative of the cluster can not be bigger than 60

-aL alignment coverage for the longer sequence, default 0.0

if set to 0.9, the alignment must covers 90% of the sequence

-AL alignment coverage control for the longer sequence, default 99999999

if set to 60, and the length of the sequence is 400,

then the alignment must be >= 340 (400-60) residues

-aS alignment coverage for the shorter sequence, default 0.0

if set to 0.9, the alignment must covers 90% of the sequence

-AS alignment coverage control for the shorter sequence, default 99999999

if set to 60, and the length of the sequence is 400,

then the alignment must be >= 340 (400-60) residues

-A minimal alignment coverage control for the both sequences, default 0

alignment must cover >= this value for both sequences

-uL maximum unmatched percentage for the longer sequence, default 1.0

if set to 0.1, the unmatched region (excluding leading and tailing gaps)

must not be more than 10% of the sequence

-uS maximum unmatched percentage for the shorter sequence, default 1.0

if set to 0.1, the unmatched region (excluding leading and tailing gaps)

must not be more than 10% of the sequence

-U maximum unmatched length, default 99999999

if set to 10, the unmatched region (excluding leading and tailing gaps)

must not be more than 10 bases

-B 1 or 0, default 0, by default, sequences are stored in RAM

if set to 1, sequence are stored on hard drive

!! No longer supported !!

-p 1 or 0, default 0

if set to 1, print alignment overlap in .clstr file

-g 1 or 0, default 0

by cd-hit's default algorithm, a sequence is clustered to the first

cluster that meet the threshold (fast cluster). If set to 1, the program

will cluster it into the most similar cluster that meet the threshold

(accurate but slow mode)

but either 1 or 0 won't change the representatives of final clusters

-sc sort clusters by size (number of sequences), default 0, output clusters by decreasing length

if set to 1, output clusters by decreasing size

-sf sort fasta/fastq by cluster size (number of sequences), default 0, no sorting

if set to 1, output sequences by decreasing cluster size

-bak write backup cluster file (1 or 0, default 0)

-h print this help

Questions, bugs, contact Limin Fu at l2fu@ucsd.edu, or Weizhong Li at liwz@sdsc.edu

For updated versions and information, please visit: http://cd-hit.org

cd-hit web server is also available from http://cd-hit.org

If you find cd-hit useful, please kindly cite:

"CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659

"CD-HIT: accelerated for clustering the next generation sequencing data", Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152

ラン

DNAのクラスタリング。

cd-hit-est -i input.fasta -c 0.95 -T 8 -M 4000 -o output.fasta

-c　sequence identity threshold, default 0.9
-i　input filename in fasta format, required
-o　output filename, required
-T　number of threads, default 1; with 0, all CPUs will be used
-M　memory limit (in MB) for the program, default 800; 0 for unlimitted;

他にも多くのオプションがある。

この場合だと95%以上似た配列が折り畳まれ、１つの代表配列になる。どれくらいの数値を使うかだが、例えばdeno vo transcriptome解析のペーパーでunigeneを絞り込むのに使う場合、95~98%などの閾値が使われている（数値の根拠は不明）。

.clstrを開くと、どのcontigがクラスターになったか分かる。例えば下に載せたクラスターは３つのcontigがクラスターになっている。最長だったNODE_21だけが出力され、他の２contigは排除されていることを意味する。

>Cluster 20

0 2181nt, >NODE_21_length_2181... *

1 128nt, >NODE_20700_length_1... at -/97.66%

2 72nt, >NODE_21261_length_7... at -/97.22%

アミノ酸配列のクラスタリング。

cd-hit -i input.faa -o output.faa -c 0.95 -T 8 -M 4000

注；fastaのヘッダー名が複雑だとエラーが起きる可能性がある。その場合はリネームしてから使う。

追記

cd-hit、cd-hit-estの他にも、duplicationがあるシーケンスリードをクラスター化するCD-HIT-DUP、オーバーラップがあるリードをクラスター化するCD-HIT-LAP、二つの配列群を比較してクラスター化するCD-HIT-2Dなどがある。

https://github.com/weizhongli/cdhit/blob/master/doc/cdhit-user-guide.wiki

（レポジトリより転載）
引用

Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences.

Li W1, Godzik A.

Bioinformatics. 2006 Jul 1;22(13):1658-9. Epub 2006 May 26.

CD-HIT | 核酸およびアミノ酸配列のクラスタリング

fastq / fastaの操作ツール seqtk

format conversion fasta/fastqの操作

seqtkはfastqをfastaに変換したり、相補鎖に変換できるツール。ランダムサンプリング機能ももち、de novo transcriptome解析でアセンブルに有利なリードデプスに間引くツールとして用いられることもある（ペーパー）。動作が非常に高速のため使いやすい。似たツールにseqkitがある。導入して機能をチェックしてみる。

インストール

mac os10.12でテストした。

brewで導入できる。

brew install seqtk

#Anaconda環境なら
conda install -c bioconda seqtk

Github

ラン

seq - common transformation of FASTA/Q

fastqをfastaに変換。（.gzも扱える。)

seqtk seq -a input.fq > output.fa

ILLUMINA 1.3+フォーマットをfastaに変換し、同時にquality20以下を小文字にする。

seqtk seq -aQ64 -q20 input.fq > output.fa

相補鎖（reverse complement)に変換。

seqtk seq -r input.fq > output.fq

様々なオプションがある。

-q　mask bases with quality lower than INT [0]
-X　mask bases with quality higher than INT [255]
-Q　quality shift: ASCII-INT gives base quality [33]
-M　 mask regions in BED or name list FILE [null]
-L　drop sequences with length shorter than INT [0]
-c 　mask complement region (effective with -M)
-r　reverse complement -A force FASTA output (discard quality)
-C　drop comments at the header lines
-N　drop sequences containing ambiguous bases
-1　output the 2n-1 reads only
-2　output the 2n reads only
-V　shift quality by '(-Q) - 33'
-U　convert all bases to uppercases
-S　strip of white spaces in sequences

mergepe - interleave two PE FASTA/Q files

ペアリードを１つのファイルにまとめインターレースにする。

seqtk mergepe R1.fq R2.fq > paired.fq

インターレースファイルからペアじゃないリードを除去するならdropseを使う。

trimfq　- trim FASTQ using the Phred algorithm

低クオリティな塩基の除去。Phread quality scoreに基づく。

eqtk trimfq input.fa > output.fa

-q　error rate threshold (disabled by -b/-e) [0.05]
-l　maximally trim down to INT bp (disabled by -b/-e) [30]

デフォルトだと5%のエラー率を超えるとトリムされる（-q 0.05）。

指定範囲の塩基のトリミング。

eqtk trimfq -b 5 -e 10 input.fa > output.fa

-b　 trim INT bp from left (non-zero to disable -q/-l) [0]
-e　trim INT bp from right (non-zero to disable -q/-l) [0]
-L　retain at most INT bp from the 5'-end (non-zero to disable -q/-l) [0]

subseq - extract subsequences from FASTA/Q

fastqをランダムサンプリングする。-sで乱数発生を制御している。

seqtk sample -s100 read1.fq 10000 > sub1.fq 
seqtk sample -s100 read2.fq 10000 > sub2.fq

-s　RNG seed [11]
-2　2-pass mode: twice as slow but with much reduced memory

上記では10000リードずつ取り出される。ただし-sの値を変えるとペアじゃないリードが10000リード取り出されるので注意。ペアリードは同じパラメータで行う。

fqchk - fastq QC (base/quality summary)

クオリティの分析。

seqtk fqchk input.fq > output.txt

$ head -5 output.txt

min_len: 101; max_len: 101; avg_len: 101.00; 38 distinct quality values

POS #bases %A %C %G %T %N avgQ errQ %low %high

ALL 252500 26.2 24.3 23.7 25.8 0.0 35.9 20.4 2.8 97.2

1 2500 13.0 44.2 25.4 17.4 0.0 32.6 30.7 0.4 99.6

2 2500 21.0 22.1 21.7 35.2 0.0 32.9 31.7 0.4 99.6

3 2500 25.4 25.2 22.1 27.2 0.0 33.0 31.8 0.2 99.8

Rに持ち込んでグラフ化すれば見やすい。

他にもいくつかの機能がある。

引用

GitHub - lh3/seqtk: Toolkit for processing sequences in FASTA/Q formats

Using seqtk to trim and process reads at an insanely high speed — ANGUS 2.0 documentation

FASTA/Q sequence processing toolkit -- seqtk

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

fastq / fastaの操作ツール seqkit

クレード特異的マーカー遺伝子を使いメタゲノム配列のtaxnomy assigmentを行う MetaPhlAn2、クラスタリングするHclust2、系統樹を作成するGraPhlAn

PanPhlAnによるメタゲノムのプロファイリング

ウィルスゲノムのアセンブルツール IVA (Iterative Virus Assembler)

diginormによるシーケンスデータの軽量化

配列をクラスタリングする CD-HIT

fastq / fastaの操作ツール seqtk