Metagenomicsとは、興味ある環境から得られたゲノム研究であり、例えばヒトの体内(Huttenhower and Human Microbiome Project Consortium、2012)、海水(Venter et al。、2004)、酸性雨排水(Tyson et al 、2004)などが例として挙げられる。メタゲノミクス研究では、微生物の存在を捕らえてその相対量を定量化するために、数千万回のシーケンシングリードを生成し、これらのデータの分類と分析を分析プロセス上の課題としている。
既存の分類方法は、2つの広いカテゴリーに分類することができる:アライメントベースおよびアラインメントフリー。 BLAST(Altschul et al。、1990)として最も普及している前者のアプローチは、参照ゲノムとの最良のアライメントを提供するタクソンに各リードを割り当てる。 MEGAN(Husonら、2007)、PhymmBL(Brady and Salzberg、2009)およびNBC(Rosenら、2011)を含むいくつかの方法は、分類精度を高めるためにBLAST結果に追加の機械学習技術を適用する。これらの方法は、しばしばBLAST単独よりも遅く、何百万というショートリードの大規模分析には計算上禁止されている。しかし、最近発表されたCentrifuge(Kim et al。、2016)は、アライメントベースアルゴリズムをFM-indexを用いてのスケーラビリティを大幅に改善している。最近公開されたKaiju(Menzel and Krogh、2015)は、リファレンスとしてゲノム配列を使用することに加えて、タンパク質配列に対するアライメントを実行し、既存のツールよりも迅速な分類速度を達成している(kaiju紹介)。
LMAT(Ames et al。、2013)、Kraken(Wood and Salzberg、2014)、Clark(Ounit et al。、2015)をなどの他のツールは、リードを正確なk-merマッチによって標的タクソン収集を可能にし、それによって、アラインベースのアプローチに匹敵する感度および特異性を維持しながら非効率的な塩基ごとのアライメントを回避する。このアプローチは、アライメントに基づく方法より一般的に高速であり、各分類群に属する参照配列から抽出されたk-mer収集のみ必要とするので、参照の柔軟性をより大きくする。したがって、DNAまたはRNA配列決定データから抽出されたk-merは、参照ゲノムのみを使用してしばしばnaturalバリアントを捕捉するアルゴリズムの感度を高める。
MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads
Thomas Nordahl Petersen, Oksana Lukjancenko, Martin Christen Frølund Thomsen, Maria Maddalena Sperotto, Ole Lund, Frank Møller Aarestrup, and Thomas Sicheritz-Pontén
ゲノムの構造変異の同定は、ヒトの遺伝的多様性、進化、ならびに病因を理解する重要なステップである。癌を含む数多くの遺伝病は、構造変異(SV; Futreal et al、2004)と関連している。アレイベースの技術は、SVを検出するための多くの研究において成功しているが、ブレークポイントの検出における比較的低い分解能および小さなSVの特徴付けは依然として困難である。イルミナゲノムアナライザーやApplied Biosystems SOLiDシステムなどのハイスループットシーケンシング技術が導入され、ショートインサートペアエンドまたはメイトペアリードを使用することによりSVの検出能が改善された(Korbel et al、2007)。リファレンスゲノムへのマッピングの際に、ペアの順序、向きおよびインサートサイズなどの情報を利用した異常なマッピングペアを調べることで、潜在的なゲノム変動を示すことができる。
clean: the transcript origin is from the focal sample.
cross contamination: the transcript origin is from an alien sample of the same experiment
dubious: expression levels are too close between focal and alien samples to determine the true origin of the transcript..
low coverage: expression levels are too low in all samples, thus hampering our procedure (which relies on differential expression) to confidently assign it to any category.
over expressed: expression levels are very high in at least 3 samples and CroCo will not try to categorize it. Indeed, such a pattern does not correspond to expectations for cross contaminations, but often reflect highly conserved genes such as ribosomal gene, or external contamination shared by several samples (e.g. Escherichia coli contaminations).
CroCo_v1.1.sh is a program that can detect potential cross-contaminations in assembled transcriptomes using sequencing reads to find true origin of transcripts.
--threads INT : Number of threads to use (DEFAULT : 1) [short: -n]
--output-prefix STR : Prefix of output directory that will be created (DEFAULT : empty) [short: -p]
--output-level 1|2 : Select whether or not to output fasta files. '1' for none, '2' for all (DEFAULT : 2) [short: -l]
--graph yes|no : Produce graphical output using R (DEFAULT : no) [short: -g]
--add-option 'STR' : This text string will be understood as additional options for the mapper/quantifier used (DEFAULT : empty) [short: -a]
--recat SRT : Name of a previous CroCo output directory you wish to use to re-categorize transcripts (DEFAULT : no) [short: -r]
--trim5 INT : nb bases trimmed from 5' (DEFAULT : 0) [short: -x]
--trim3 INT : nb bases trimmed from 3' (DEFAULT : 0) [short: -y]
--suspect-id INT : Indicate the minimum percent identity between two transcripts to suspect a cross contamination (DEFAULT : 95) [short: -s]
--suspect-len INT : Indicate the minimum length of an alignment between two transcripts to suspect a cross contamination (DEFAULT : 40) [short: -w]
--frag-length FLOAT : Estimated average fragment length (no default value). Only used in specific combinations of --mode and --tool [short: -u]
--frag-sd FLOAT : Estimated standard deviation of fragment length (no default value). Only used in specific combinations of --mode and --tool [short: -v]
It is good practice to redirect information about each CroCo run into an output log file using the following structure :
'2>&1 | tee log_file'
Minimal working example :
CroCo_v0.1.sh --mode p 2>&1 | tee log_file
Exhaustive example :
CroCo_v0.1.sh --mode p --in data_folder_name --tool R --fold-threshold 2 --minimum-coverage 0.2 --overexp 300 --threads 8 --output-prefix test1_ --output-level 2 --graph yes --add-option '-v 0' --trim5 0 --trim3 0 --suspect-id 95 --suspect-len 40 --recat no 2>&1 | tee log_file
Exhaustive example using shortcuts :
CroCo_v0.1.sh -m p -i data_folder_name -t R -f 2 -c 0.2 -d 300 -n 8 -p test1_ -l 2 -g yes -a '-v 0' -x 0 -y 0 -s 95 -w 40 -r no 2>&1 | tee log_file
Example for re-categorizing previous CroCo results
-i Input genome fasta file. See README for formatting requirments**.-i, --infile=<str>, REQ Input genome fasta file. See README for formatting requirments**.
-o Output directory for results. Default = Current directory
--fastaFASTA Mode. When present, converts bed files to FASTA sequences using the provided reference genome
--nanopore Generate Oxford Nanopore data. Calculates a gamma distribution.-
--pacbio Generate PacBio data. Calculates a log normal distribution. Default mode if none specified.
-m Mean read length for in-silico read generation. Default = 10000 bp-m, --mean_read_length=<int>, OPT Mean read length for in-silico read generation. Default = 10000 bp
-s Standard deviation of in-silico reads. Default = 2050
-c Desired genome coverage of in-silico sequencing. Default = 8
ハイスループットの全ゲノムショットガン(WGS)データセットの迅速な解析は、大きなサイズが生み出す複雑さのためにチャレンジングである(Schatz et al、2012)。 WGSデータを分析するためのリファレンスが不要なアプローチは、基本的な品質、リード長、GCコンテンツ(Yang et al、2013)の調査とk-mer(サイズk のワード)スペクトラム探索を含む(Chor et al、2009; Lo and Chain、2014)。頻繁に使用されるリファレンスフリーの品質管理ツールはFastQC(http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)である。
K-merスペクトラムは、データ品質(エラーのレベル、シーケンシングバイアス、シーケンシングカバレッジおよび潜在的汚染)だけでなく、ゲノムの複雑さ(サイズ、核型、ヘテロ接合性および反復含有量; Simpson、2014)に関する情報を明らかにする。 WGSデータセットのペアワイズ比較(Anvar et al、2014)により、スペクトラムの違いを強調して問題のあるサンプルを識別できる追加情報を抽出できる。
Python V3.5+ with the tabulate, scipy, numpy and matplotlib packages and C API installed. Python is optional but highly recommended, without python, KAT functionality is limited: no plots, no distribution analysis, and no documentation.
Sphinx-doc V1.3+ (Optional: only required for building the documentation.
* plot: Plotting tools.Contains several plotting tools to visualise K-mer and compare
distributions.
Options:
-v [ --verbose ]Print extra information
--version Print version string
--helpProduce help message
ラン
hist:histogram of k-mer occurrences.
> kat hist
$ kat hist
Kmer Analysis Toolkit (KAT) V2.4.2
Usage: kat hist [options] (<input>)+
Create an histogram of k-mer occurrences from the input.
Create an histogram with the number of k-mers having a given count, derived from the input, which can take the form of a single jellyfish hash, or one or more FastA or FastQ files. In bucket 'i' are tallied the k-mers which have a count 'c' satisfying 'low+i*inc <= c < low+(i+1)'. Buckets in the output are labeled by the low end point (low+i).
The last bucket in the output behaves as a catchall: it tallies all k-mers with a count greater or equal to the low end point of this bucket.
This tool is very similar to the "histo" tool in jellyfish itself.The primary difference being that the output contains metadata that make the histogram easier for the user to plot.
Options:
-o [ --output_prefix ] arg (="kat.hist") Path prefix for files generated by this program.
-t [ --threads ] arg (=1)The number of threads to use
-h [ --high ] arg (=10000) High count value of histogram
-i [ --inc ] arg (=1)Increment for each bin
--5ptrim arg (=0)Ignore the first X bases from reads.If more that one file is provided you can specify different values for each file by
seperating with commas.
-N [ --non_canonical ] If counting fast(a/q), store explicit kmer as found.By default, we store 'canonical' k-mers, which means we count both strands.
-m [ --mer_len ] arg (=27) The kmer length to use in the kmer hashes.Larger values will provide more discriminating power between kmers but at the expense
of additional memory and lower coverage.
-H [ --hash_size ] arg (=100000000)If kmer counting is required for the input, then use this value as the hash size.If this hash size is not large enough for your
dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.
-d [ --dump_hash ] Dumps any jellyfish hashes to disk that were produced during this run. Normally, this is not recommended, as hashes are slow to
load and will likely consume a significant amount of disk space.
-p [ --output_type ] arg (=png)The plot file type to create: png, ps, pdf.
Compares GC content and K-mer coverage from the input.
This tool takes in either a single jellyfish hash or one or more FastA or FastQ input files and then counts the GC nucleotides for each distinct K-mer in the hash.For each GC count and K-mer coverage level, the number of distinct K-mers are counted and stored in a matrix.This matrix can be used to analyse biological content within the hash.For example, it can be used to distinguish legitimate content from contamination, or unexpected content.
Options:
-o [ --output_prefix ] arg (="kat-gcp") Path prefix for files generated by this program.
-t [ --threads ] arg (=1) The number of threads to use
-x [ --cvg_scale ] arg (=1) Number of bins for the gc data when creating the contamination matrix.
-y [ --cvg_bins ] arg (=1000) Number of bins for the cvg data when creating the contamination matrix.
--5ptrim arg (=0) Ignore the first X bases from reads.If more that one file is provided you can specify different values for each file by seperating
with commas.
-N [ --non_canonical ]If counting fast(a/q), store explicit kmer as found.By default, we store 'canonical' k-mers, which means we count both strands.
-m [ --mer_len ] arg (=27)The kmer length to use in the kmer hashes.Larger values will provide more discriminating power between kmers but at the expense of
additional memory and lower coverage.
-H [ --hash_size ] arg (=100000000) If kmer counting is required for the input, then use this value as the hash size.If this hash size is not large enough for your
dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.
-d [ --dump_hash ]Dumps any jellyfish hashes to disk that were produced during this run. Normally, this is not recommended, as hashes are slow to load
and will likely consume a significant amount of disk space.
-p [ --output_type ] arg (=png) The plot file type to create: png, ps, pdf.
There are two main use cases for this tool.The first is to compare K-mers from two K-mer hashes both representing K-mer counts for reads.The intersected output forms a matrix that can be used to show how related both spectra are via a density plot.The second use case is to compare K-mers generated from reads to those generated from an assembly, in this case the dataset for the reads must be provided first and the assembly second.This also produces a matrix containing the intersection of both spectra, but this is instead visualised via a stacked histogram.
There is also a third use case where K-mers from a third dataset as a filter, restricting the analysis to the K-mers present on that set.The manual contains more details on specific use cases.
Options:
-o [ --output_prefix ] arg (=kat-comp) Path prefix for files generated by this program.
-t [ --threads ] arg (=1)The number of threads to use.
-x [ --d1_scale ] arg (=1) Scaling factor for the first dataset - float multiplier
-y [ --d2_scale ] arg (=1) Scaling factor for the second dataset - float multiplier
-i [ --d1_bins ] arg (=1001) Number of bins for the first dataset.i.e. number of rows in the matrix
-j [ --d2_bins ] arg (=1001) Number of bins for the second dataset.i.e. number of rows in the matrix
--d1_5ptrim arg (=0) Ignore the first X bases from reads in dataset 1.If more that one file is provided for dataset 1 you can specify different values
for each file by seperating with commas.
--d2_5ptrim arg (=0) Ignore the first X bases from reads in dataset 2.If more that one file is provided for dataset 2 you can specify different values
for each file by seperating with commas.
-N [ --non_canonical_1 ] If counting fast(a/q) for input 1, store explicit kmer as found.By default, we store 'canonical' k-mers, which means we count both
strands.
-O [ --non_canonical_2 ] If counting fast(a/q) for input 2, store explicit kmer as found.By default, we store 'canonical' k-mers, which means we count both
strands.
-P [ --non_canonical_3 ] If counting fast(a/q) for input 3, store explicit kmer as found.By default, we store 'canonical' k-mers, which means we count both
strands.
-m [ --mer_len ] arg (=27) The kmer length to use in the kmer hashes.Larger values will provide more discriminating power between kmers but at the expense of
additional memory and lower coverage.
-H [ --hash_size_1 ] arg (=100000000)If kmer counting is required for input 1, then use this value as the hash size.If this hash size is not large enough for your
dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.
-I [ --hash_size_2 ] arg (=100000000)If kmer counting is required for input 2, then use this value as the hash size.If this hash size is not large enough for your
dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.
-J [ --hash_size_3 ] arg (=100000000)If kmer counting is required for input 3, then use this value as the hash size.If this hash size is not large enough for your
dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.
-d [ --dump_hashes ] Dumps any jellyfish hashes to disk that were produced during this run. Normally, this is not recommended, as hashes are slow to load
and will likely consume a significant amount of disk space.
-g [ --disable_hash_grow ] By default jellyfish will double the size of the hash if it gets filled, and then attempt to recount.Setting this option to true,
disables automatic hash growing.If the hash gets filled an error is thrown.This option is useful if you are working with large
genomes, or have strict memory limits on your system.
-n [ --density_plot ]Makes a density plot.By default we create a spectra_cn plot.
-p [ --output_type ] arg (=png)The plot file type to create: png, ps, pdf.
-h [ --output_hists ]Whether or not to output histogram data and plots for input 1 and input 2
-v [ --verbose ] Print extra information.
--help Produce help message.
2サンプルのfastqを比較する。
kat comp -t 10 -v 'sample1_1.fq sample1_2.fq' 'sample2_1.fq sample2_2.fq'
cold: Contig Length and Duplication analysis tool.
> kat cold
$ kat cold
Kmer Analysis Toolkit (KAT) V2.4.2
Usage: kat cold [options] <assembly> <reads>
COntig Length and Duplication analysis tool
Calculates median read k-mer coverage, assembly k-mer coverage and GC% across each sequence in the provided assembly. Then, assuming plotting is enabled, the results are converted into a scatter plot, where each point is colored according to a similar scheme used in spectra-cn plots, and sized according to its length.The y-axis representsmedian read K-mer coverage, and x-axis represents GC%.
The <assembly> should be a fasta file that is NOT gzipped compressed.The <reads> can be any number of <fasta/q> files, which CAN be gzipped compressed, or a pre-counted hash.
Options:
-o [ --output_prefix ] arg (="kat-cold") Path prefix for files generated by this program.
-x [ --gc_bins ] arg (=1001) Number of bins for the gc data when creating the contamination matrix.
-y [ --cvg_bins ] arg (=1001)Number of bins for the cvg data when creating the contamination matrix.
-t [ --threads ] arg (=1)The number of threads to use
--5ptrim arg (=0)Ignore the first X bases from reads.If more that one file is provided you can specify different values for each file by
seperating with commas.
-m [ --mer_len ] arg (=27) The kmer length to use in the kmer hashes.Larger values will provide more discriminating power between kmers but at the expense
of additional memory and lower coverage.
-H [ --hash_size ] arg (=100000000)If kmer counting is required, then use this value as the hash size for the reads.We assume the assembly should use half this
value.If this hash size is not large enough for your dataset then the default behaviour is to double the size of the hash and
recount, which will increase runtime and memory usage.
-d [ --dump_hashes ] Dumps any jellyfish hashes to disk that were produced during this run. Normally, this is not recommended, as hashes are slow to
load and will likely consume a significant amount of disk space.
-g [ --disable_hash_grow ] By default jellyfish will double the size of the hash if it gets filled, and then attempt to recount.Setting this option to true,
disables automatic hash growing.If the hash gets filled an error is thrown.This option is useful if you are working with large
genomes, or have strict memory limits on your system.
-p [ --output_type ] arg (=png)The plot file type to create: png, ps, pdf.