2018-05-05

小メモリで高速にメタゲノムのtaxonomy assignmentを行う metaOthello

2018 10/7 タイトル修正

　Metagenomicsとは、興味ある環境から得られたゲノム研究であり、例えばヒトの体内（Huttenhower and Human Microbiome Project Consortium、2012）、海水（Venter et al。、2004）、酸性雨排水（Tyson et al 、2004）などが例として挙げられる。メタゲノミクス研究では、微生物の存在を捕らえてその相対量を定量化するために、数千万回のシーケンシングリードを生成し、これらのデータの分類と分析を分析プロセス上の課題としている。

メタゲノムデータの解析における主要な計算上の課題の1つは、最も特異的な生物学的分類群に各シーケンスを当てはめるプロセスである。具体的には、リードは、その分類群について収集された参照ゲノムと高い配列類似性を有する場合、分類群に属するものとして分類される。 2014年だけで、ハイスループットシーケンシング技術のアクセシビリティのおかげで、NCBI RefSeqデータベースに10,000を超えるシーケンスレコードが新たに追加された。

　既存の分類方法は、2つの広いカテゴリーに分類することができる：アライメントベースおよびアラインメントフリー。 BLAST（Altschul et al。、1990）として最も普及している前者のアプローチは、参照ゲノムとの最良のアライメントを提供するタクソンに各リードを割り当てる。 MEGAN（Husonら、2007）、PhymmBL（Brady and Salzberg、2009）およびNBC（Rosenら、2011）を含むいくつかの方法は、分類精度を高めるためにBLAST結果に追加の機械学習技術を適用する。これらの方法は、しばしばBLAST単独よりも遅く、何百万というショートリードの大規模分析には計算上禁止されている。しかし、最近発表されたCentrifuge（Kim et al。、2016）は、アライメントベースアルゴリズムをFM-indexを用いてのスケーラビリティを大幅に改善している。最近公開されたKaiju（Menzel and Krogh、2015）は、リファレンスとしてゲノム配列を使用することに加えて、タンパク質配列に対するアライメントを実行し、既存のツールよりも迅速な分類速度を達成している（kaiju紹介）。

　LMAT（Ames et al。、2013）、Kraken（Wood and Salzberg、2014）、Clark（Ounit et al。、2015）をなどの他のツールは、リードを正確なk-merマッチによって標的タクソン収集を可能にし、それによって、アラインベースのアプローチに匹敵する感度および特異性を維持しながら非効率的な塩基ごとのアライメントを回避する。このアプローチは、アライメントに基づく方法より一般的に高速であり、各分類群に属する参照配列から抽出されたk-mer収集のみ必要とするので、参照の柔軟性をより大きくする。したがって、DNAまたはRNA配列決定データから抽出されたk-merは、参照ゲノムのみを使用してしばしばnaturalバリアントを捕捉するアルゴリズムの感度を高める。

　上記のアライメントフリーアルゴリズムは、k-merマッチングのためのindex付け構造に依存する。例えば、Krakenは最小化オフセット配列を使用してその辞書編集的にソートされたk-merデータベースを索引付けし、Clarkはハッシュテーブルを使用してk-merとその分類情報との間のマッピングを格納する。 KrakenとClarkは両方とも、index構造の構築とリードの分類をサポートするために、大容量のコンピュータを必要とする（少なくとも70 GB~120 GBのRAM）（kraken紹介）。両方のアルゴリズムにはメモリフットプリントを小さくする必要のあるバリエーションが存在するが、フルバージョンと比較して感度が大幅に低くなるか、実行速度が遅くなることがよくある。このため、シーケンシングおよび参照ゲノムデータの絶え間なく増加する量は、メモリと計算の両方においてより優れたスケーラビリティを有するツールを必要とする。

本稿では、Sequencing Readの分類学的分類のためにMetaOthelloと呼ばれる新しいアルゴリズムを紹介する。著者らのアルゴリズムはtaxonomy固有のk-merシグネチャを基にして、taxonomyのどのレベルにも直接リードを割り当てる。最先端の方法であるKraken と Clarkと比較して、少なくとも1桁以上の速度向上を実現する、超高速k-mer分類をサポートするための新しいデータ構造l-Othelloを採用している。Kaijuよりも3倍速い。MetaOthelloは、メモリフットプリントを大幅に削減し、通常、上記の方法の3分の1しかメモリを必要としない。この控えめなメモリ要件により、本アルゴリズムは32 GBのRAMを備えた典型的なラボサーバ上で動作することができ、スーパーコンピュータのみが達成可能なメモリ要件よりも生物学研究者にとってよりアクセスしやすくなる。さらに、本アルゴリズムは、階層的なトップダウン分類分類を行うことができ、様々なデータセットのベンチマークによって検証された感度と特異性の両方の他のアルゴリズムと競合する性能を提供する。

図4にベンチマーク結果。

f:id:kazumaxneo:20180505123206j:plain

論文より転載（リンク）。

インストール

cent OSでテストした。

Github

https://github.com/xa6xa6/metaOthello/

git clone https://github.com/xa6xa6/metaOthello.git
cd metaOthello/build/
make build
#実行ファイル buildができる。
cd ../
make classifier
#実行ファイル classifierができる。

>./build

$ ./build

args: descriptiveFilename KmerFnamePrefix KmerFnameSuffix Kmer_length splitbits OutputFile workingDirectory

> ./classifier

$ ./classifier

#1 inputLothelloNodeIndex

#2 outputFolder

#3 Kmer_length

#4 threads_num

#5 fa_or_fq

#6 SE_or_PE

#7 inputSpeciesTaxoIdFile

#8 NCBIfullTaxoId2NameFile

#9 inputReadFile_1

(#10 inputReadFile_2)

ラン

GIthubにバクテリアの20mer、25mer、31merのindexデータベースがリンクされている。ここでは上のGithubリンクから20-merのindexをダウンロードした（サイズは24GB）。自分でindexデータベースを構築する場合はbuildコマンドを使用する。詳細はGithub参照。

ランにはさらに２つのファイルが必要。種名とtaxonの関係を示したファイル及びNCBIの命名則と種名の関係を示したファイルになる。ダウンロードする。

bacterial_speciesId2taxoInfo_file（直リンク）。

https://drive.google.com/open?id=0BxgO-FKbbXRIc3FkLVFvMlpVVGM

NCBI_names_file（直リンク）。

https://drive.google.com/open?id=0BxgO-FKbbXRIUFI2dHlBMXZhdTA

エラーメッセージが出ないが、上のヘルプメッセージ順番で記載すればランできる。k-merサイズ20、thread40とした。fastqはgz圧縮にも対応している。

classifier bacterial_20mer_L12.index output/ 20 40 fq PE bacterial_speciesId2taxoInfo.txt names.dmp.scientific pair1.fq  pair2.fq

SRAにdepositされたシーケンスデータを使ってランしてみる。SRA Explorerで"clinical"で検索してダウンロードした。fastqサイズは690MB。

SRA toolkitでfastqに変換し、metaOthelloを走らせる。

fastq-dump ~/./ncbi./public/sra/SRR7094156.sra -O test

./classifier bacterial_20mer_L12.index output/ 20 40 fq SE bacterial_speciesId2taxoInfo.txt names.dmp.scientific test/SRR7094156.fastq

出力。
> ls -l output4/taxoInfo/

$ ls -l output4/taxoInfo/

total 308

-rw-r--r-- 1 uesaka user 1532 May 5 11:21 class.txt

-rw-r--r-- 1 uesaka user 7879 May 5 11:21 family.txt

-rw-r--r-- 1 uesaka user 16109 May 5 11:21 genus.txt

-rw-r--r-- 1 uesaka user 3887 May 5 11:21 order.txt

-rw-r--r-- 1 uesaka user 223847 May 5 11:21 overall.txt

-rw-r--r-- 1 uesaka user 850 May 5 11:21 phylum.txt

-rw-r--r-- 1 uesaka user 51004 May 5 11:21 species.txt

分類体系ごとに結果がまとめられている。

ファイルを開いてみる。

> head class.txt

$ head class.txt

Class_index Class_ID Class_name

0 28211 Alphaproteobacteria

1 1760 Actinobacteria

2 31969 Mollicutes

3 183924 Thermoprotei

4 768503 Cytophagia

5 28216 Betaproteobacteria

6 191410 Chlorobia

7 1853228 Chitinophagia

8 183968 Thermococci

２カラム目はclass（綱）のID（hitした数と書いていたがそれは間違い）

全部は載せられないが、overall.txtが全結果となる。

> cut -f 1-10 overall.txt |head|column -t

$ cut -f 1-10 overall.txt |head|column -t

Species_index Species_ID Species_name Genus_index Genus_ID Genus_name Family_index Family_ID Family_name Order_index

0 1002672 Candidatus Pelagibacter sp. IMCC9063 0 198251 Candidatus Pelagibacter 0 1655514 Pelagibacteraceae 0

1 1003110 Verrucosispora maris 1 84593 Verrucosispora 1 28056 Micromonosporaceae 1

2 100379 Onion yellows phytoplasma 2 33926 Candidatus Phytoplasma 2 2146 Acholeplasmataceae 2

3 1006005 Metallosphaera cuprina 3 41980 Metallosphaera 3 118883 Sulfolobaceae 3

4 1006 Marivirga tractuosa 4 869806 Marivirga 4 200667 Flammeovirgaceae 4

5 1007105 Pusillimonas sp. T7-7 5 305976 Pusillimonas 5 506 Alcaligenaceae 5

6 100716 Chloroherpeton thalassium 6 100715 Chloroherpeton 6 191412 Chlorobiaceae 6

7 1008 Saprospira grandis 7 1007 Saprospira 7 89374 Saprospiraceae 7

8 1008460 Pyrococcus yayanosii 8 2260 Pyrococcus 8 2259 Thermococcaceae 8

ベストヒットしたのは、αプロテオバクテリア。種名はCandidatus（リンク）となっており、まだよくわかっていないバクテリアだった。

k-mer indexはバクテリアのデータベースから構築されているため、ウィルスやアーキアなどのゲノムを検出するには自分でデータベースを構築する必要があります。自分専用のk-mer indexを作りたいときは、GithubのPDFマニュアルを読んでみてください。

引用

A novel data structure to support ultra-fast taxonomic classification of metagenomic sequences with k-mer signatures.

Liu X, Yu Y, Liu J, Elliott CF, Qian C, Liu J

Bioinformatics. 2018 Jan 1;34(1):171-178.

2018-05-03

メタゲノムのtaxonomyアノテーションを行い定量する MGmapper

metagenome 生物種の推定 (taxonomic profiling) virus bacteria annotation plasmid archaea web tool abundance estimation in metagenomics data

　迅速で効率的なDNAシーケンシング技術の進歩により、堆積物[論文より ref.1] [ref.2]、水[ref.3]、氷[ref.4]、ヒトなど様々な環境から微生物群集を研究することが可能になった[ ref.6]。既知のDNA配列決定プラットフォームの中で、イルミナHiSeqおよびMiSeqは、大きなデータ出力および塩基対当たりのコストが比較的低いため、単一ゲノムおよびメタゲノミクス研究の両方にとって好ましい。ゲノムショットガンシーケンシング技術全体を適用すると、サンプル中の全てのDNA配列が決定され、数百万の短いヌクレオチド配列が生成される。単一のヒト腸試料からのメタゲノミクスデータは、何百もの生物を表す複雑な系であり、試料が多くの個体からの混合物として由来する場合にはさらに多様性が予想される。たとえば下水道、公共交通機関または動物園からの人間または動物が当てはまる。そのようなデータセットを分析することへの関心は、細菌またはウイルス病原体のモニタリング、抗菌抵抗性遺伝子の同定、ファージ同定、または単に存在する生物の完全なカタログを得ることであり得る。このような解析は簡単ではなく、大量のメモリを使用せずにfastq配列のリードを多くの参照配列データベースにマッピングし、配列アライメントを解析して検証するプログラム、偽陽性率の低いタクソノミー注釈および最終的に出力を提示するプログラム（SNPまたはコンティグアセンブリ）を使用して、細菌、ウイルス、菌類、植物などの多くの全ゲノムデータベースの使用を可能にする複雑なデータセットのルーチン分析にアクセスできるようになった。脊椎動物、脊椎動物、無脊椎動物などの脊椎動物でもあり、抗菌抵抗性遺伝子、16S rRNA、または一連のfasta配列に基づく任意のカスタムデータベースなどの遺伝子データベースの使用も可能である（一部略）。

　リードの各々をゲノムに割り当てるタスクは困難であり、擬陽性の予測の問題は、クエリー配列が標的配列の大きなデータベースに対してマッピングされるために常に考慮されるべき問題である。ターゲットデータベースサイズが大きくなるにつれて、ランダムヒットを見つける機会も増える。Blastプログラムスイート[ref.7]は、数十年前から、大きなデータベースに対するクエリー配列のペアワイズアライメントのため最も頻繁に使用されるプログラムの1つである。 Blastは、期待値の形式のフィルタをしきい値として使用し、偽陽性の数を減らす。分類法の分野では、フィルタやカットオフを使用することはほとんどないが、最近のベンチマーク研究では、in vitroとin silicoの両方のデータセットで評価するといくつかの方法が存在すると予測されている。この研究では、15種の分類法注釈法のベンチマーキングが行われた。そのうち2つ（Kraken [ref.9]およびCARMA3 [ref.10]）は、in vitroセットに存在するすべての種を正確に同定し、0.1％のリードカウント量閾値を使用した（以下略）。

分類法注釈を実行するためのいくつかの方法があり、以前のメタゲノミクスベンチマーク研究では、正しい注釈と誤った注釈とを区別するために閾値または後処理が適用されない限り、膨大な数の偽陽性種のアノテーションが起こることが問題でああった。MGmapperは、raw NGSデータを処理し、リファレンスベースのリード割り当てを実行し、その後、種および株レベルの解像度で信頼できるtaxonomyアノテーションを生成するためのパッケージである。 8種の属、11種および12種からなる in vitro細菌mock communityサンプルは、以前メタゲノミクス分類法のベンチマーキングに使用されていたものである。後処理フィルタを適用した後、著者らは種および属レベルで100％正確にtaxonomy を割り当てた。種レベルのアノテーションでは75％のrecallとprecisionが得られた。種レベルでMGmapperとKraken（紹介）を比較すると、MGmapperはリードの84.8％を使用して種レベルでtaxonomyに割り当て、Krakenでは70.5％であり、どちらの方法も偽陽性のないすべての種を特定したことを示している。出力は、拒否されたタクソノミー注釈と受け入れられたタクソノミ注釈の両方について、豊富なリード数統計のプレーンテキストとExcelシートである。 MGmapperのコマンドライン版ではカスタムデータベースの使用が可能で、完全なパイプラインはBitbuckedパッケージとして利用可能である。

Bitbucket

https://bitbucket.org/genomicepidemiology/mgmapper

MGmapper web

Best mappingを行うデータベースを選択する。初期はアーキアとバクテリアが選択されている。

f:id:kazumaxneo:20180502110824j:plain

Best modeはデータベース1,2とする。full modeは実行しないので空白。Trimmingは実行する。すでにアダプターとクオリティトリミングが行われたfastqならチェックを外す。

f:id:kazumaxneo:20180502111222j:plain

alignment criteriaのデフォルトのfractionは0.8になっている。MGmapperではbwaでリードをデータベースにアライメントし、リード長に対するmatches+mismatches (called: FMM, fraction of matches+mismatches) が0.8以上になるリードがマッチと見なされる。ペアエンドシーケンスでは、片リードだけしかこの基準を満たしてなければdiscardされる。

f:id:kazumaxneo:20180502111426j:plain

他にclade specificなアライメントや、検出感度のパラメータが設定できる。

f:id:kazumaxneo:20180502112925j:plain

最後にシーケンスデータのfastqをアップロードする。gz圧縮にも対応している（plain, gzipped (.gz) or compressed (.Z) fastq）。

f:id:kazumaxneo:20180502110459j:plain

Isolateをクリックしてペアエンドのファイルを選択し、Uploadをクリック。アップロードが終わると自動で解析がスタートする。

出力

公式の出力説明

https://cge.cbs.dtu.dk/services/MGmapper/output.php

混雑しており、テストランできなかったが、本来はpreprocessingしたfastqを使い、データベースにhitした種の存在量がプレーンテキストとexcelファイルで出力される。存在量はゲノムサイズで正規化されている。データベースにhitしなかったfastqをダウンロードすることもできる。

感想

混雑時は、ランまで時間がかかるためe-mailアドレスを記載するよう促されます。ゲノム特異的なプライマを設計するRUCSツール（紹介）と同じサーバーを使っているようですが、RUCSをテストした時は、結果のメールが届くまで2日かかりました。メタゲノムを処理するMGmapperは結果が出るまでさらに時間を要するかもしれません。余裕を持って実行してください。

引用

MGmapper: Reference based mapping and taxonomy annotation of metagenomics sequence reads

Thomas Nordahl Petersen, Oksana Lukjancenko, Martin Christen Frølund Thomsen, Maria Maddalena Sperotto, Ole Lund, Frank Møller Aarestrup, and Thomas Sicheritz-Pontén

PLoS One. 2017; 12(5): e0176469.

2018-05-03

SVDetect

human genome tumor structural variations (SV) 2010 Bioinformatics

　ゲノムの構造変異の同定は、ヒトの遺伝的多様性、進化、ならびに病因を理解する重要なステップである。癌を含む数多くの遺伝病は、構造変異（SV; Futreal et al、2004）と関連している。アレイベースの技術は、SVを検出するための多くの研究において成功しているが、ブレークポイントの検出における比較的低い分解能および小さなSVの特徴付けは依然として困難である。イルミナゲノムアナライザーやApplied Biosystems SOLiDシステムなどのハイスループットシーケンシング技術が導入され、ショートインサートペアエンドまたはメイトペアリードを使用することによりSVの検出能が改善された（Korbel et al、2007）。リファレンスゲノムへのマッピングの際に、ペアの順序、向きおよびインサートサイズなどの情報を利用した異常なマッピングペアを調べることで、潜在的なゲノム変動を示すことができる。

　ここでは、SV検出とペアエンドデータ（ショートとロング）からのSV予測のため利用可能なプログラムSVDetectを提示する。 SVDetectは、異なるタイプのSVを識別する。クラスタリングとスライディングウィンドウの両方の戦略で、大規模な挿入 - 欠失と逆位をゲノム規模で視覚化する。他のツールと比較して、本発明の方法の新規性は、（i）ペアエンドおよびメイトペア配列データの両方を分析すること、（ii）SV検出を改善するためにユニークなペアエンド制約を使用する。（iii）様々なタイプのタンデムduplicationを予測し、再編成を区別する。（iv）複数のサンプルにわたってSVを比較する; （v）コピー数プロファイルを構築する。（vi）SVのグラフィカルビューに対する様々な出力ファイルフォーマットを作成する。

インストール

依存

Perl 5.8.x, or newer
Config::General
Tie::IxHash
Parallel::ForkManager
SAMtools
BEDTools
Circos

cpanm Config::General Tie::IxHash Parallel::ForkManager #libにも入ってる

または

perl -MCPAN -e shell

% install Config::General

circos、SAMtools、BEDtoolsはbrewで導入できる。

本体　SourceForge

https://sourceforge.net/projects/svdetect/?source=typ_redirect

osx用の実行ファイルが含まれているので、ダウンロードして解凍する。

cd /SVDetect_r0.8b/bin/
./SVDetect

r$ ./SVDetect

Usage:

SVDetect <command> -conf <configuration_file> [-help] [-man]

Command:

linking detection and isolation of links

filtering filtering of links according different parameters

links2circos links conversion to circos format

links2bed paired-ends of links converted to bed format (UCSC)

links2compare compare and get common/specific links between samples

links2SV formatted output to show most significant SVs

cnv calculate copy-number profiles

ratio2circos ratio conversion to circos density format

ratio2bedgraph ratio conversion to bedGraph density format (UCSC)

パスの通ったディレクトリに移動しておく。

またはdockerで動かす。

docker pull omicsnut/svdetect

#ここでは共有ディレクトリ~/dataをコンテナの/homeと共有指定して起動する。
docker run -i -t -v ~/data/:/home omicsnut/svdetect
SVDetect #テスト

#抜けるには"control + P、Q"。起動しているか確認は"docker ps -a"。
#、再度ログインするには"docker attach <CONTAINER ID>"

ラン

テストデータを解析してみる。

cd test_sample/Neuroblastoma/
wget https://sourceforge.net/projects/svdetect/files/Data/Neuroblastoma_supp_data.tar.gz/download
tar -zxvf download
mv Neuroblastoma_supp_data/* .

./svdetect_script.sh

以下のようなシェルスクリプトになっている。

f:id:kazumaxneo:20180503114848j:plain

テストランすると、 docker環境では

error Can't locate SVG.pm in @INC

が最後に出る。CPANからSVG(v2.28)を入れると今度はcircos configurationのエラーが出てしまった。

SVdetectはランの各ステップにconfigurationファイルを使うが、作成手順のマニュアルが見つからなかった。詳細がわかれば追記します。

引用

SVDetect: a tool to identify genomic structural variations from paired-end and mate-pair sequencing data.

Zeitouni B, Boeva V, Janoueix-Lerosey I, Loeillet S, Legoix-né P, Nicolas A, Delattre O, Barillot E.

Bioinformatics. 2010 Aug 1;26(15):1895-6.

2018-05-02

RNA-seqのクロスコンタミを検出する Croco

contamination RNA seq RNAseqの定量

　核酸試料間の汚染は、分子生物学における潜在的な問題として長く認識されてきた。ポリメラーゼ連鎖反応（PCR）による増幅や、そして最近ではハイスループット配列決定でのPCR増幅は、ソースにかかわらず、また非常に低レベルの混入した核酸でさえ、十分な範囲で配列決定できることを意味する[ 論文より ref.1-9]。寄生虫、腸内細菌、内寄生虫または環境由来の目的生物の配列と汚染配列とを区別するために、様々なツールが既に開発されている。これらのアルゴリズムは、通常、特定の基準に基づいて汚染配列を同定し、リファレンスデータベースを使用して汚染物質の分類源を推測する。 Blobtoolsパイプライン[ref.10]（紹介）はGCの内容、カバレッジおよび分類学的割り当て（Basic Local Alignment Search Tool（BLAST）をNCBI nrに対して実行）に基づいて汚染配列を検出する。わずかに異なる方法であるAnvi'o [ref.11]は、最初にカバレッジおよび/またはk-merスペクトル基づいてコンティグを自動的にビンし、汚染由来ビンを特定する。MCSC（Model-based Categorical Sequence Clustering）[ref.5]は、シーケンスで観測された頻繁なパターンに基づいたクラスタリング手法（分割的階層クラスタリング）を使用し、UniRef90データベース（リンク）に対するブラストによって汚染クラスターを識別する。これらの方法（MCSCを除く）は、ゲノムデータに焦点を当てる。しかし、それらは汚染されていない公共データベースに部分的に依存し、遠く離れた生物からの汚染を検出するように設計されている。トランスクリプトームデータは現在、進化生物学で広く使用されているため、RNAシークエンシング（RNA-seq）データ用に設計された新しいツールCroCoを設計した。これは発現レベル推定値に依存したリファレンスゲノムフリーの方法であり、別のタイプの汚染、すなわちクロスコンタミネーションをターゲットとしている。

　クロスコンタミネーションは、所定のシークエンシングプロジェクトにおいて並行して処理するサンプル間の汚染として定義される。これは実験が汚染の起源であり、サンプル操作、DNA / RNA抽出、ライブラリーの調製および増幅、サンプルの多重化および不正確なバーコード配列決定の複数のベンチワークステップで潜在的に生じ得る。本発明者らの経験的観察は、multiplexによる複数種のcDNAライブラリーのシーケンス（例えば、その後の系統樹再構築のために）では、ある程度のクロスコンタミネーションが避けられないことを示す。この現象は、2つ以上の十分に離れた種の転写産物において、ヌクレオチドレベルで同一またはほぼ同一の配列があるときに明らかである（例えば、論文の図S1参照）。このような症例は、最近のいくつかの進化生物学研究で既に検出されている[ref.9,12-16]。クロスコンタミネーションは、種間の偽の類似性を生じ、あらゆる種類の下流比較分析に有害な結果をもたらす。

本研究者らのアプローチは、最初に、ペアワイズBLAST手順を用いてサンプル間で疑わしいほど類似している転写物のサブセットを決定する。 CroCoは全てのリードをメタトランスクリプトームに結合し、全ての転写物の「発現レベル」を定量する。この情報は、次の5つのカテゴリで各トランスクリプトを分類するために使用される。

clean: the transcript origin is from the focal sample.
cross contamination: the transcript origin is from an alien sample of the same experiment
dubious: expression levels are too close between focal and alien samples to determine the true origin of the transcript..
low coverage: expression levels are too low in all samples, thus hampering our procedure (which relies on differential expression) to confidently assign it to any category.
over expressed: expression levels are very high in at least 3 samples and CroCo will not try to categorize it. Indeed, such a pattern does not correspond to expectations for cross contaminations, but often reflect highly conserved genes such as ribosomal gene, or external contamination shared by several samples (e.g. Escherichia coli contaminations).

CroCoの手順は、潜在的なクロスコンタミネーションを検出し、定量化するために配列類似性を利用する。そのため、あまりにも密接に関連するサンプルが分析されると（腫瘍と正常細胞との比較など）、疑わしい事例および過剰発現数の過剰推定につながる可能性がある。

インストール

ubuntu14.04でテストした。

依存

BLAST-2.5.0+
R-3.1.0 (optional)
R library package visNetwork
R library package igraph

Mandatory dependencies - at least one mapping tool in the following list :

Bowtie-1.1.2
Kallisto-0.43.0
Rapmap-0.1.0

３つのマッピングツールは、本体の./install実行時に自動インストールされる。

本体　GitLab

http://gitlab.mbb.univ-montp2.fr/mbb/CroCo

git clone http://gitlab.mbb.univ-montp2.fr/mbb/CroCo.git
cd CroCo/utils/

#osx (実行中にbrew install coreutils gnu-getopt gawk grep gnu-sed gcc48 make cmakeを行うので注意する）
bash ./install_dependencies.sh --tool R --os macosx

#ubuntu (centosなら最後をcentosに変更する）テストではこちらを実行した。
bash ./install_dependencies.sh --tool all --os ubuntu


cd src/

> bash CroCo_v1.1.sh

$ bash CroCo_v1.1.sh

CroCo_v1.1.sh is a program that can detect potential cross-contaminations in assembled transcriptomes using sequencing reads to find true origin of transcripts.

Usage :

./CroCo_v1.1.sh [--mode p|u] [--tool B|B2|K|R|S] [--fold-threshold INT] [--minimum-coverage FLOAT] [--threads INT] [--output-prefix STR] [--output-level 1|2|3] [--graph yes|no] [--trim5 INT] [--trim3 INT] [--frag-length FLOAT] [--frag-sd FLOAT] [--suspect-id INT] [--suspect-len INT] [--add-option STR] [--recat STR]

--mode p|u : 'p' for paired and 'u' for unpaired (default : 'p') [short: -m]

--in STR : Name of the directory containing the input files to be analyzed (DEFAULT : working directory) [short: -i]

--tool B|K|R : 'B' for bowtie, 'K' for kallisto, 'R' for rapmap (DEFAULT : 'R') [short: -t]

--fold-threshold FLOAT : Value between 1 and N (DEFAULT : 2) [short: -f]

--minimum-coverage FLOAT : TPM value (DEFAULT : 0.2) [short: -c]

--overexp FLOAT : TPM value (DEFAULT : 300) [short: -d]

--threads INT : Number of threads to use (DEFAULT : 1) [short: -n]

--output-prefix STR : Prefix of output directory that will be created (DEFAULT : empty) [short: -p]

--output-level 1|2 : Select whether or not to output fasta files. '1' for none, '2' for all (DEFAULT : 2) [short: -l]

--graph yes|no : Produce graphical output using R (DEFAULT : no) [short: -g]

--add-option 'STR' : This text string will be understood as additional options for the mapper/quantifier used (DEFAULT : empty) [short: -a]

--recat SRT : Name of a previous CroCo output directory you wish to use to re-categorize transcripts (DEFAULT : no) [short: -r]

--trim5 INT : nb bases trimmed from 5' (DEFAULT : 0) [short: -x]

--trim3 INT : nb bases trimmed from 3' (DEFAULT : 0) [short: -y]

--suspect-id INT : Indicate the minimum percent identity between two transcripts to suspect a cross contamination (DEFAULT : 95) [short: -s]

--suspect-len INT : Indicate the minimum length of an alignment between two transcripts to suspect a cross contamination (DEFAULT : 40) [short: -w]

--frag-length FLOAT : Estimated average fragment length (no default value). Only used in specific combinations of --mode and --tool [short: -u]

--frag-sd FLOAT : Estimated standard deviation of fragment length (no default value). Only used in specific combinations of --mode and --tool [short: -v]

It is good practice to redirect information about each CroCo run into an output log file using the following structure :

'2>&1 | tee log_file'

Minimal working example :

CroCo_v0.1.sh --mode p 2>&1 | tee log_file

Exhaustive example :

CroCo_v0.1.sh --mode p --in data_folder_name --tool R --fold-threshold 2 --minimum-coverage 0.2 --overexp 300 --threads 8 --output-prefix test1_ --output-level 2 --graph yes --add-option '-v 0' --trim5 0 --trim3 0 --suspect-id 95 --suspect-len 40 --recat no 2>&1 | tee log_file

Exhaustive example using shortcuts :

CroCo_v0.1.sh -m p -i data_folder_name -t R -f 2 -c 0.2 -d 300 -n 8 -p test1_ -l 2 -g yes -a '-v 0' -x 0 -y 0 -s 95 -w 40 -r no 2>&1 | tee log_file

Example for re-categorizing previous CroCo results

CroCo_v0.1.sh -i data_folder_name -r previous_CroCo_results_folder_name -f 10 -c 0.5 -g yes 2>&1 | tee log_file

CrocoはDockerコンテナでも導入できます。

ラン

付属のテストデータを分析する。テストデータはsampleAとBの比較になっている。

tar -xvzf CroCo_dataset_test.tgz 
bash src/CroCo_v1.1.sh --mode p --in CroCo_dataset_test -l 1 --graph no # --graph yesだと図も出力

--inで指定したディレクトリに以下の命名規則でFASTAとfastqを準備して置く必要がある。

・for paired-end reads :
NAME.fasta (assembled transcriptome seqs)
NAME.L.fastq (raw illumina data LEFT)
NAME.R.fastq (raw illumina data RIGHT)

・for unpaired reads :
NAME.fasta (assembled transcriptome seqs)
NAME.fastq (raw illumina data)

ファイル名やヘッダーに特殊文字は使用してはならない(e.g. \/[]()|:;) 。

output

最初に説明した５グループのFASTAファイル及び５グループのリードカウントファイルが出力される（検出数がゼロならファイルはできない）。"all"ファイルには5グループ全ての発現量が記載される。

f:id:kazumaxneo:20180502202758j:plain

> column -t species_A.all |head -n 20

$ column -t species_A.all |head -n 20

Contig species_A_reads species_B_reads MaxOtherSpCov log2FoldChange Status

species_A|comp10376_c0_seq1 0 0 0 inf lowcov

species_A|comp6343_c0_seq1 0 0 0 inf lowcov

species_A|comp4018_c0_seq1 0 0 0 inf lowcov

species_A|comp16598_c0_seq1 0 136.775 136.775 -20.3834 contam

species_A|comp453_c0_seq1 0 0 0 inf lowcov

species_A|comp4443_c0_seq1 0 0 0 inf lowcov

species_A|comp6002_c0_seq1 0 0 0 inf lowcov

species_A|comp2523_c1_seq1 0 0 0 inf lowcov

species_A|comp20352_c0_seq1 223.257 0 0 inf clean

species_A|comp11276_c0_seq1 0 0 0 inf lowcov

species_A|comp10316_c0_seq1 0 0 0 inf lowcov

species_A|comp14979_c0_seq1 0 0 0 inf lowcov

species_A|comp1090_c0_seq1 0 0 0 inf lowcov

species_A|comp20633_c0_seq1 0 0 0 inf lowcov

species_A|comp7168_c0_seq1 0 0 0 inf lowcov

species_A|comp1371_c0_seq1 0 0 0 inf lowcov

species_A|comp17691_c0_seq1 0 0 0 inf lowcov

species_A|comp17691_c0_seq2 0 0 0 inf lowcov

species_A|comp16963_c0_seq1 0 0 0 inf lowcov

contaminationと判定されたtrasncriptも見つかる。

> column -t All_transcripts.quants |head -n 20

$ column -t All_transcripts.quants |head -n 20

Contig species_A_reads species_B_reads

species_A|comp18265_c0_seq1 0 0

species_B|sb_528476_ 0 0

species_B|sb_540258_ 0 0

species_B|sb_534367_ 0 0

species_A|comp7973_c0_seq1 0 0

species_B|sb_561352_ 0 0

species_B|sb_555461_ 0 0

species_B|sb_538423_ 0 0

species_B|sb_550205_ 0 0

species_B|sb_549570_ 0 0

species_B|sb_544314_ 0 0

species_B|sb_550979_ 0 112.563

species_B|sb_536865_ 0 0

species_B|sb_519827_ 0 0

species_B|sb_531609_ 0 0

species_B|sb_542756_ 0 0

species_B|sb_525718_ 0 0

species_B|sb_546812_ 0 0

species_B|sb_539098_ 0 0

引用

A software tool ‘CroCo’ detects pervasive cross-species contamination in next generation sequencing data

Paul Simion, Khalid Belkhir, Clémentine François, Julien Veyssier, Jochen C. Rink, Michaël Manuel, Hervé Philippe, and Maximilian J. Telford

BMC Biol. 2018; 16: 28.

2018-05-02

ONTのアーティファクトを取り除く CarrierSeq

low complexity Nanopore long read filtering

　環境メタゲノムシーケンシングは、多くの課題を提起する。第一に、複雑な土壌マトリックスと強靭な生物は、デオキシリボ核酸（DNA）とリボ核酸（RNA）の抽出を妨げる[論文より ref.1]。第2に、低バイオマス試料は、汚染の可能性も高める、さらなる抽出および濃縮ステップを必要とする[ref.2]。第3に、ターゲット増幅（例えば、16S rRNAアンプリコン）が分類学的分解能を低下させる一方で、全ゲノム増幅は集団にバイアスを起こすことがある[ref.3]。これらの課題に対処するために、著者らは低バイオマス難分解性サンプルに適合し、溶解するのが困難な生物の抽出プロトコールを開発した[ref.5]（pubmed）。 Bacillus subtilisの難溶性胞子を用いて開発されたこれらのプロトコールは、遠心分離を行わずに2×10 5細胞/ gの土壌を含む50mgのサンプルから少なくとも5％の抽出収率を達成している[ref.6]。さらに、増幅バイアスおよびさらなる汚染を回避するため、低入力量の標的DNA（枯草菌）をシャトルするゲノム担体（腸内細菌ファージλ） [ref7]を使い、ライブラリー調製および理想的なstoichiometryを調べる実験を行った。このアプローチにより、Oxford Nanopore Technologies（ONT）Minionシーケンサーを使い、1000ngのラムダDNAを用いて調製したBacillus subtilis DNAを0.2ngまで検出することができた。

　この論文では、キャリア配列決定法を用いて、標的DNAをゲノム担体で調製して低入力DNAを配列し、理想的なライブラリー調製およびstoichiometryで増幅せずにシーケンシングする方法を採用する。次に、シーケンス解析ワークフローであるCarrierSeqを使用して、ゲノムキャリアからの低入力ターゲットリードを特定する。著者らは1000ngのエンテロバクテリアファージλDNAのバックグラウンドにおける0.2ngのBacillus subtilis ATCC 6633 DNAの組み合わせからの配列決定によりCarrierSeqを実験的に試験した。キャリアのリードや低クオリティでlow complexityなリード（ONTのジャンク）をフィルタリグした後、ターゲットリード（枯草菌）、コンタミネーションリード、および high quality noise reads（HQNR）を検出した。これらのリードは、それらが特定のチャネル（細孔）と関連しており、ナノポア配列決定プロセスのアーティファクトのように見える。

　CarrierSeqはbwa-mem [ref.8]を実装して、最初にすべてのリードをゲノムキャリアにマップし、次にsamtools [9]とseqtk [10]を使用してマッピングされていないリードを抽出する。その後、CarrierSeqは品質スコアのしきい値を定義して（phread quality score ≥ 9）、低複雑度のリード[11]をfqtrim [12]を使い破棄する。このアンマップかつフィルタリングされて残ったリードセットは、reads of interest（ROI）とラベルされ、ROIは理論的にはターゲットリードと汚染由来リードの可能性がある。ただし、ROIには、クオリティスコアと複雑さのフィルタは満たしてしまうが、いかなるデータベースとも一致しない、特定のチャネルからの不均衡な読み取りとして定義されるhigh-quality noise reads（HQNR）も含まれる。リードをPoisson arrival processとして扱うことで、CarrierSeqは期待されるROIチャネルの分布をモデル化し、リード/チャネルのしきい値（xcrit）を超えるチャネルからのデータを拒否する。

example dataset

https://figshare.com/articles/Example_carrier_sequencing_fastQ_data_set_for_CarrierSeq/5868825/1

インストール

依存

Biopython - http://biopython.org/
SciPy - https://www.scipy.org/
bwa - https://github.com/lh3/bwa
seqtk - https://github.com/lh3/seqtk
samtools - https://github.com/samtools/samtools
fqtrim - https://ccb.jhu.edu/software/fqtrim/

dockerコンテナで導入する。

docker pull mojarro/carrierseq:latest

#イメージの確認
docker images

#コンテナの作成と起動
docker run mojarro/carrierseq

#ここでは共有ディレクトリ~/nanoporeをコンテナの/homeと共有指定して起動する。
docker run -i -t -v ~/nanopore/:/home mojarro/carrierseq 

#抜けるには"control + P、Q"。起動しているか確認は"docker ps -a"。
#、再度ログインするには"docker attach <CONTAINER ID>"

本体 Github

https://github.com/amojarro/carrierseq

git clone https://github.com/amojarro/carrierseq.git
cd carrierseq/
chmod +x carrierseq.sh

> ./carrierseq.sh

$ ./carrierseq.sh

Usage: carrierseq.sh options [-i INPUT] [-r REFERENCE] [-o OUTPUT] are required. Use -h for help

user-no-MacBook-Pro-2:carrierseq user$ ./carrierseq.sh -h

./carrierseq.sh: illegal option -- h

Usage: carrierseq.sh [-i INPUT] [-r REFERENCE] [-o OUTPUT]...

CarrierSeq requires bwa, samtools, seqtk, fqtrim, and biopython.

Reads to be analyzed must be compiled into a single fastq file and the reference genome must be in fasta format.

-i All reads to be analyzed *.fastq

-r Carrier reference genome *.fasta

-t Number of threads used for BWA mapping (default = 1)

-q User-defined quality (phred) score (default = 9)

-p User-defined p-value

(default = ~0.0001 or 0.05 / 512 active channels)

-o Output directory

ラン

./carrierseq.sh -i inout.fastq -r ref.fasta -o out -t 24 -q 9

-i All reads to be analyzed *.fastq
-r Carrier reference genome *.fasta
-t Number of threads used for BWA mapping (default = 1)
-q User-defined quality (phred) score (default = 9)
-p User-defined p-value (default = ~0.0001 or 0.05 / 512 active channels)

進捗とともに新しいディレクトリが作成され、最終的に合計９つのディレクトリができる。

f:id:kazumaxneo:20180502161524j:plain

07がhqnrsとして定義された、クオリティが高くlow compxilityな配列構造になっていないが、いかなるデータベースとも相同性を示さないおそらくジャンクの配列である。ROIとして残ったfastqのうち、しきい値を超えたチャネルからの出力がここに振り分けられる。

　論文の著者らの実験では、0.2 ngの B. subtilis DNAを1µgのLambda DNAと混ぜ、MinionのR.94フローセルで48hシーケンスしAlbacore v1でベースコールしている。スループットは6,4GBの717,432リードで、そこからLambdaゲノムにマッピングして、ROIを1811リード得ている。そのうちhqnrsと判定されたリードは1179で、True positive（08のディレクトリ）は632リードだった。自分でもテストデータをダウンロードして、CarrierSeqを使いhqnrsを抽出し、blastn解析してみた。数リードだけE.coliゲノムと高い相同性を示したが、およそ8-9割のリードはblastnでヒットした配列がゼロだった。

　hqnrsが本当にアーティファクトのリードであるならば、低クオリティなリードとして出力されるべきだが、Oxfordナノポアはタンパク質をセンサーの中心に使っているので、均一なフローセル作成が難しく歩留まりは高くない。したがって厳密なクオリティ評価も困難と思われる。ONTのシーケンス解析では、この論文の手法のようなクオリティスコアだけに依存しないpreprocessing手法を考える必要がある。本当のシーケンスリードであることを担保しないまま進めると、環境ゲノムから得た新種のゲノムだと考えていた配列が、実はジャンクだったということにもなりかねない。そしてこれはデータベースの汚染を引き起こす問題でもある。

引用

CarrierSeq: a sequence analysis workflow for low-input nanopore sequencing.

BMC Bioinformatics. 2018.

Mojarro A, Hachey J, Ruvkun G, Zuber MT, Carr CE

2018-05-01

ロングリードのシミュレーター SiLiCO

2016 Preprint simulator Nanopore long read

2019 7/28コマンド追記

　広範に使用されているPacBioプラットフォームおよびOxford Nanoporeプラットフォームを含む長いリード配列決定プラットフォームは、15〜20キロベースを超える配列断片を生成することを目的としており、構造変異の同定およびゲノムアセンブリを容易化する。しかしながら、ロングリードシーケンスは比較的高価でエラーが発生しやすく、配列決定の失敗はゲノミクス施設にとって重大な問題である。シーケンシング失敗のメカニズムを定量的に評価するためには、シーケンシングの結果を比較できる高度に再現可能で制御可能な参照データセットを持つことが不可欠である。著者らは、ロングリードシーケンスをリードする代表的なプラットフォームの両方から、標準化されたシーケンシング結果を生成するIn silicoのシミュレーションツールであるSiLiCOを報告する。

インストール

ubuntu14.04に導入した。

依存

Python 2.7.11 or higher and Python 3.5 or higher.
Snumpy
pybedtools*
natsort

本体のビルド時に導入されるので、持ってなくても問題ない。

Github

https://github.com/ethanagbaker/SiLiCO

git clone https://github.com/ethanagbaker/SiLiCO.git
cd SiLiCO/
python setup.py build
python setup.py install

> python SiLiCO.py -h

$ python SiLiCO.py -h

#############################################################################

## SiLiCO: Simulator of Long Read Sequencing in PacBio and Oxford Nanopore ##

#############################################################################

Usage: python SiLiCO.py -i </path/to/genome> -o </path/to/outDir> -m <mean read length> -s <standard dev of read lengths> -c <coverage> -t <trials> [-f]

[ FILE I/O ]

-i, --infile=<str>, REQ Input genome fasta file. See README for formatting requirments.

-o, --output=<str>, OPT Output directory for results. Default = Current directory

[ DISTRIBUTION PARAMETERS ]

-m, --mean_read_length=<int>, OPT Mean read length for in-silico read generation. Default = 10000 bp

-s, --standard_dev=<int>, OPT Standard deviation of in-silico reads. Default = 2050

-c, --coverage=<int>, OPT Desired genome coverage of in-silico sequencing. Default = 8

--trials=<int>, OPT Number of trials. Default = 1

[ MODES ]

-f, --fasta, OPT FASTA Mode. When present, converts bed files to FASTA sequences using the provided reference genome.

-n, --nanopore, Generate Oxford Nanopore data. Calculates a gamma distribution.

-p, --pacbio, Generate PacBio data. Calculates a log normal distribution. Default mode if none specified.

[ DOCUMENTATION ]

-h, --help Display this message.

--version What version of SiLiCO are you using?

--contact Report a bug or get more help.

--citation View the citation for SiLiCO.

ラン

ONTのロングリードを発生させる。

python SiLiCO.py -i input.fa -o output--nanopore

-i　Input genome fasta file. See README for formatting requirments**.-i, --infile=<str>, REQ Input genome fasta file. See README for formatting requirments**.
-o　Output directory for results. Default = Current directory
--fasta　 FASTA Mode. When present, converts bed files to FASTA sequences using the provided reference genome
--nanopore　Generate Oxford Nanopore data. Calculates a gamma distribution.-
--pacbio　Generate PacBio data. Calculates a log normal distribution. Default mode if none specified.
-m　Mean read length for in-silico read generation. Default = 10000 bp-m, --mean_read_length=<int>, OPT Mean read length for in-silico read generation. Default = 10000 bp
-s　Standard deviation of in-silico reads. Default = 2050
-c　Desired genome coverage of in-silico sequencing. Default = 8
--trials=<int>, OPT Number of trials. Default = 1

Pacbioのロングリードを発生させる。fasta出力する。

python SiLiCO.py -i input.fa -o output --fasta --pacbio

引用

SiLiCO: A Simulator of Long Read Sequencing in PacBio and Oxford Nanopore

Ethan Alexander Garcia Baker, Sara Goodwin, W. Richard McCombie, Olivia Mendivil Ramos

bioRxiv preprint doi: https://doi.org/10.1101/076901

https://www.biorxiv.org/content/early/2016/09/22/076901

2018-05-01

K-mer分析ツール KAT

k-mer ゲノム比較 (comparative genomics) assembly quality check evaluation tool duplication repetitive sequences 2017 Bioinformatics 結果の視覚化 (visualization) metagenome GC

2019 5/15 リンク、condaインストール追加

2019 5/16 タイトル修正

2020 9/27 help更新

ハイスループットの全ゲノムショットガン（WGS）データセットの迅速な解析は、大きなサイズが生み出す複雑さのためにチャレンジングである（Schatz et al、2012）。 WGSデータを分析するためのリファレンスが不要なアプローチは、基本的な品質、リード長、GCコンテンツ（Yang et al、2013）の調査とk-mer（サイズk のワード）スペクトラム探索を含む（Chor et al、2009; Lo and Chain、2014）。頻繁に使用されるリファレンスフリーの品質管理ツールはFastQC（http://www.bioinformatics.babraham.ac.uk/projects/fastqc/）である。

K-merスペクトラムは、データ品質（エラーのレベル、シーケンシングバイアス、シーケンシングカバレッジおよび潜在的汚染）だけでなく、ゲノムの複雑さ（サイズ、核型、ヘテロ接合性および反復含有量; Simpson、2014）に関する情報を明らかにする。 WGSデータセットのペアワイズ比較（Anvar et al、2014）により、スペクトラムの違いを強調して問題のあるサンプルを識別できる追加情報を抽出できる。

K-mer analysis Tookitは、シーケンスデータから直接的に任意の長さのk-merのスペクトルを迅速にカウント、比較、分析するためのツール群である。任意の分布のk-merをフィルタリングする機能も備えている。

KAT's Documentation

Welcome to KAT’s documentation! — kat 2.4.1 documentation

f:id:kazumaxneo:20180501115728j:plain

マニュアルより転載。k-merスペクトラムを描くことで、ゲノムに関する様々な情報を得ることができる。図にはないが、たいていのシーケンスデータでは左端に最も大きいピークがある（おそらくシーケンスエラー由来）。

kat-comp introduction

k-mer counting, part I: Introduction

インストール

依存

GCC V4.8+
make
autoconf V2.53+
automake V1.11+
libtool V2.4.2+
pthreads (probably already installed)
zlib
Python V3.5+ with the tabulate, scipy, numpy and matplotlib packages and C API installed. Python is optional but highly recommended, without python, KAT functionality is limited: no plots, no distribution analysis, and no documentation.
Sphinx-doc V1.3+ (Optional: only required for building the documentation.

Github

https://github.com/TGAC/KAT

ビルドはやや複雑だが、KATはbrewによるインストールがサポートされており、ワンライナーで導入できる。

brew install brewsci/bio/kat

#bioconda
conda install -c bioconda -y kat

$ kat

The K-mer Analysis Toolkit (KAT) contains a number of tools that analyse jellyfish K-mer hashes.

The First argument should be the tool/mode you wish to use:

* hist: Create an histogram of k-mer occurrences from a sequence file. Similar to

jellyfish histogram sub command but adds metadata in output for easy plotting,

also actually runs multi-threaded.

* gcp: K-mer GC Processor. Creates a matrix of the number of K-mers found given a GC

count and a K-mer count.

* comp: K-mer comparison tool. Creates a matrix of shared K-mers between two (or three)

sequence files.

* sect: SEquence Coverage estimator Tool. Estimates the coverage of each sequence in

a file using K-mers from another sequence file.

* cold: Given, reads and an assembly, calculates both the read and assembly K-mer

coverage along with GC% for each sequence in the assembly.

a file using K-mers from another sequence file.

* filter: Filtering tools. Contains tools for filtering k-mers and sequences based on

user-defined GC and coverage limits.

* plot: Plotting tools. Contains several plotting tools to visualise K-mer and compare

distributions.

Options:

-v [ --verbose ] Print extra information

--version Print version string

--help Produce help message

ラン

hist: histogram of k-mer occurrences.

> kat hist

$ kat hist

Kmer Analysis Toolkit (KAT) V2.4.2

Usage: kat hist [options] (<input>)+

Create an histogram of k-mer occurrences from the input.

Create an histogram with the number of k-mers having a given count, derived from the input, which can take the form of a single jellyfish hash, or one or more FastA or FastQ files. In bucket 'i' are tallied the k-mers which have a count 'c' satisfying 'low+i*inc <= c < low+(i+1)'. Buckets in the output are labeled by the low end point (low+i).

The last bucket in the output behaves as a catchall: it tallies all k-mers with a count greater or equal to the low end point of this bucket.

This tool is very similar to the "histo" tool in jellyfish itself. The primary difference being that the output contains metadata that make the histogram easier for the user to plot.

Options:

-o [ --output_prefix ] arg (="kat.hist") Path prefix for files generated by this program.

-t [ --threads ] arg (=1) The number of threads to use

-l [ --low ] arg (=1) Low count value of histogram

-h [ --high ] arg (=10000) High count value of histogram

-i [ --inc ] arg (=1) Increment for each bin

--5ptrim arg (=0) Ignore the first X bases from reads. If more that one file is provided you can specify different values for each file by

seperating with commas.

-N [ --non_canonical ] If counting fast(a/q), store explicit kmer as found. By default, we store 'canonical' k-mers, which means we count both strands.

-m [ --mer_len ] arg (=27) The kmer length to use in the kmer hashes. Larger values will provide more discriminating power between kmers but at the expense

of additional memory and lower coverage.

-H [ --hash_size ] arg (=100000000) If kmer counting is required for the input, then use this value as the hash size. If this hash size is not large enough for your

dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.

-d [ --dump_hash ] Dumps any jellyfish hashes to disk that were produced during this run. Normally, this is not recommended, as hashes are slow to

load and will likely consume a significant amount of disk space.

-p [ --output_type ] arg (=png) The plot file type to create: png, ps, pdf.

-v [ --verbose ] Print extra information.

--help Produce help message.

k-merサイズ27でペアエンドのfastqを分析。横軸の最大値は10000とする。

kat hist -t 8 -l 1 -h 10000 -m 27 -v -o output sample1_*.fastq

頻度を示したテキストファイルとヒストグラムのグラフが出力される。下はシロイヌナズナのfastq分析結果。

f:id:kazumaxneo:20180429204401j:plain

gcp: Creates a matrix of the number of K-mers found given a GC count and a K-mer count

> kat gcp

$ kat gcp

Kmer Analysis Toolkit (KAT) V2.4.2

Usage: kat gcp (<input>)+

Compares GC content and K-mer coverage from the input.

This tool takes in either a single jellyfish hash or one or more FastA or FastQ input files and then counts the GC nucleotides for each distinct K-mer in the hash. For each GC count and K-mer coverage level, the number of distinct K-mers are counted and stored in a matrix. This matrix can be used to analyse biological content within the hash. For example, it can be used to distinguish legitimate content from contamination, or unexpected content.

Options:

-o [ --output_prefix ] arg (="kat-gcp") Path prefix for files generated by this program.

-t [ --threads ] arg (=1) The number of threads to use

-x [ --cvg_scale ] arg (=1) Number of bins for the gc data when creating the contamination matrix.

-y [ --cvg_bins ] arg (=1000) Number of bins for the cvg data when creating the contamination matrix.

--5ptrim arg (=0) Ignore the first X bases from reads. If more that one file is provided you can specify different values for each file by seperating

with commas.

-N [ --non_canonical ] If counting fast(a/q), store explicit kmer as found. By default, we store 'canonical' k-mers, which means we count both strands.

-m [ --mer_len ] arg (=27) The kmer length to use in the kmer hashes. Larger values will provide more discriminating power between kmers but at the expense of

additional memory and lower coverage.

-H [ --hash_size ] arg (=100000000) If kmer counting is required for the input, then use this value as the hash size. If this hash size is not large enough for your

dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.

-d [ --dump_hash ] Dumps any jellyfish hashes to disk that were produced during this run. Normally, this is not recommended, as hashes are slow to load

and will likely consume a significant amount of disk space.

-p [ --output_type ] arg (=png) The plot file type to create: png, ps, pdf.

-v [ --verbose ] Print extra information.

--help Produce help message.

k-merサイズ27でペアエンドのfastqを分析。

kat gcp -t 10 -x 1 -y 10000 -m 27 -o gcc -v sample1_*.fastq

シロイヌナズナのシーケンスデータ分析結果。

f:id:kazumaxneo:20180429205507j:plain

激しくコンタミがあるバクテリアのシーケンスデータ分析結果。

f:id:kazumaxneo:20180501115356j:plain

comp: Compares jellyfish K-mer count hashes.

> kat comp

$ kat comp

Kmer Analysis Toolkit (KAT) V2.4.2

Usage: kat comp [options] <input_1> <input_2> [<input_3>]

Compares jellyfish K-mer count hashes.

There are two main use cases for this tool. The first is to compare K-mers from two K-mer hashes both representing K-mer counts for reads. The intersected output forms a matrix that can be used to show how related both spectra are via a density plot. The second use case is to compare K-mers generated from reads to those generated from an assembly, in this case the dataset for the reads must be provided first and the assembly second. This also produces a matrix containing the intersection of both spectra, but this is instead visualised via a stacked histogram.

There is also a third use case where K-mers from a third dataset as a filter, restricting the analysis to the K-mers present on that set. The manual contains more details on specific use cases.

Options:

-o [ --output_prefix ] arg (=kat-comp) Path prefix for files generated by this program.

-t [ --threads ] arg (=1) The number of threads to use.

-x [ --d1_scale ] arg (=1) Scaling factor for the first dataset - float multiplier

-y [ --d2_scale ] arg (=1) Scaling factor for the second dataset - float multiplier

-i [ --d1_bins ] arg (=1001) Number of bins for the first dataset. i.e. number of rows in the matrix

-j [ --d2_bins ] arg (=1001) Number of bins for the second dataset. i.e. number of rows in the matrix

--d1_5ptrim arg (=0) Ignore the first X bases from reads in dataset 1. If more that one file is provided for dataset 1 you can specify different values

for each file by seperating with commas.

--d2_5ptrim arg (=0) Ignore the first X bases from reads in dataset 2. If more that one file is provided for dataset 2 you can specify different values

for each file by seperating with commas.

-N [ --non_canonical_1 ] If counting fast(a/q) for input 1, store explicit kmer as found. By default, we store 'canonical' k-mers, which means we count both

strands.

-O [ --non_canonical_2 ] If counting fast(a/q) for input 2, store explicit kmer as found. By default, we store 'canonical' k-mers, which means we count both

strands.

-P [ --non_canonical_3 ] If counting fast(a/q) for input 3, store explicit kmer as found. By default, we store 'canonical' k-mers, which means we count both

strands.

-m [ --mer_len ] arg (=27) The kmer length to use in the kmer hashes. Larger values will provide more discriminating power between kmers but at the expense of

additional memory and lower coverage.

-H [ --hash_size_1 ] arg (=100000000) If kmer counting is required for input 1, then use this value as the hash size. If this hash size is not large enough for your

dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.

-I [ --hash_size_2 ] arg (=100000000) If kmer counting is required for input 2, then use this value as the hash size. If this hash size is not large enough for your

dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.

-J [ --hash_size_3 ] arg (=100000000) If kmer counting is required for input 3, then use this value as the hash size. If this hash size is not large enough for your

dataset then the default behaviour is to double the size of the hash and recount, which will increase runtime and memory usage.

-d [ --dump_hashes ] Dumps any jellyfish hashes to disk that were produced during this run. Normally, this is not recommended, as hashes are slow to load

and will likely consume a significant amount of disk space.

-g [ --disable_hash_grow ] By default jellyfish will double the size of the hash if it gets filled, and then attempt to recount. Setting this option to true,

disables automatic hash growing. If the hash gets filled an error is thrown. This option is useful if you are working with large

genomes, or have strict memory limits on your system.

-n [ --density_plot ] Makes a density plot. By default we create a spectra_cn plot.

-p [ --output_type ] arg (=png) The plot file type to create: png, ps, pdf.

-h [ --output_hists ] Whether or not to output histogram data and plots for input 1 and input 2

-v [ --verbose ] Print extra information.

--help Produce help message.

２サンプルのfastqを比較する。

kat comp -t 10 -v 'sample1_1.fq sample1_2.fq' 'sample2_1.fq sample2_2.fq'

kat-comp-main.mxとヒストグラムが出力される。ここではここでは同じゲノムのリアルデータ（PCR free）とシミュレーションデータを比較した。

f:id:kazumaxneo:20180501133141p:plain

plotコマンドで2サンプルのbiasを分析。density plot出力。

kat plot density kat-comp-main.mx

f:id:kazumaxneo:20180501133334j:plain

plotコマンドで2サンプルのbiasを分析。ヒストグラム出力。

kat plot spectra-mx -i kat-comp-main.mx

cold: Contig Length and Duplication analysis tool.

> kat cold

$ kat cold

Kmer Analysis Toolkit (KAT) V2.4.2

Usage: kat cold [options] <assembly> <reads>

COntig Length and Duplication analysis tool

Calculates median read k-mer coverage, assembly k-mer coverage and GC% across each sequence in the provided assembly. Then, assuming plotting is enabled, the results are converted into a scatter plot, where each point is colored according to a similar scheme used in spectra-cn plots, and sized according to its length. The y-axis representsmedian read K-mer coverage, and x-axis represents GC%.

The <assembly> should be a fasta file that is NOT gzipped compressed. The <reads> can be any number of <fasta/q> files, which CAN be gzipped compressed, or a pre-counted hash.

Options:

-o [ --output_prefix ] arg (="kat-cold") Path prefix for files generated by this program.

-x [ --gc_bins ] arg (=1001) Number of bins for the gc data when creating the contamination matrix.

-y [ --cvg_bins ] arg (=1001) Number of bins for the cvg data when creating the contamination matrix.

-t [ --threads ] arg (=1) The number of threads to use

--5ptrim arg (=0) Ignore the first X bases from reads. If more that one file is provided you can specify different values for each file by

seperating with commas.

-m [ --mer_len ] arg (=27) The kmer length to use in the kmer hashes. Larger values will provide more discriminating power between kmers but at the expense

of additional memory and lower coverage.

-H [ --hash_size ] arg (=100000000) If kmer counting is required, then use this value as the hash size for the reads. We assume the assembly should use half this

value. If this hash size is not large enough for your dataset then the default behaviour is to double the size of the hash and

recount, which will increase runtime and memory usage.

-d [ --dump_hashes ] Dumps any jellyfish hashes to disk that were produced during this run. Normally, this is not recommended, as hashes are slow to

load and will likely consume a significant amount of disk space.

-g [ --disable_hash_grow ] By default jellyfish will double the size of the hash if it gets filled, and then attempt to recount. Setting this option to true,

disables automatic hash growing. If the hash gets filled an error is thrown. This option is useful if you are working with large

genomes, or have strict memory limits on your system.

-p [ --output_type ] arg (=png) The plot file type to create: png, ps, pdf.

-v [ --verbose ] Print extra information.

--help Produce help message.

アセンブリしたfastaと、アセンブリに使ったfastqを指定する。

kat cold -t 10 -v assembly.fasta sample1_*.fastq

少しだけコンタミがあるバクテリアのシーケンスデータと、そのspadesアセンブリ FASTAの分析結果。

f:id:kazumaxneo:20180429214407j:plain

filter: K-mer filtering tools。kmerとseqがある。

> kat filter

$ kat filter

Kmer Analysis Toolkit (KAT) V2.4.2

Usage: kat filter <mode>

Filtering tools

First argument should be the filter mode you wish to use:

* kmer: Filters a jellyfish k-mer hash using user defined properties.

* seq: Filters sequences in a file based on k-mers in a given hash

Options:

-v [ --verbose ] Print extra information

--help Produce help message.

はじめにjellyfishなどでk-mer分析を行う。

jellyfish count -m 27 -s 10000000 -t 10sample1_*fq -o mer_counts.jf

k-mer 頻度10以上1000以下、k-merのGC率50-60%だけ抽出する。

kat filter kmer -t 10 -c 5 -d 1000 -g 50 -h 60 -o filtered mer_counts.jf

plot: Plotting tools.

> kat plot

$ kat plot

Kmer Analysis Toolkit (KAT) V2.4.2

Usage: kat plot <mode>

Create K-mer Plots

First argument should be the plot mode you wish to use:

* density: Creates a density plot from a matrix created with the "comp" tool or the

"GCP" tool. Typically this is used to compare two K-mer hashes produced

by different NGS reads, or to represent the kmer coverage vs GC count plots.

* profile: Creates a K-mer coverage plot for a single sequence. Takes in fasta

coverage output from the "sect" tool

* spectra-cn: Creates a stacked histogram using a matrix created with the "comp" tool.

Typically this is used to compare a jellyfish hash produced from a read set

to a jellyfish hash produced from an assembly. The plot shows the amount of

distinct K-mers absent, as well as the copy number variation present within

the assembly.

* spectra-hist: Creates a K-mer spectra plot for a set of K-mer histograms produced either

by jellyfish-histo or kat-histo.

* spectra-mx: Creates a K-mer spectra plot for a set of K-mer histograms that are derived

from selected rows or columns in a matrix produced by the "comp".

* cold: Takes in a stats file produced by "cold" and produces a scatter plot of

median read K-mer coverage vs GC% for each contig in the assembly, with

point's size being adjusted by sequence length

Options:

-v [ --verbose ] Print extra information

--help Produce help message.

plotはcompコマンドでの例のように、他のコマンドと組み合わせて使う。

この他、カバレッジを推定するsectがある。

KATマニュアルのwalkthroughセクションを読むと、サンブルのバッチエフェクト、シーケンスバイアスの調べ方などを学べます。

http://kat.readthedocs.io/en/latest/walkthrough.html#comparing-r1-v-r2-in-an-illumina-pe-dataset

追記

ゲノムサイズ推定

引用

KAT: a K-mer analysis toolkit to quality control NGS datasets and genome assemblies

Daniel Mapleson, Gonzalo Garcia Accinelli, George Kettleborough, Jonathan Wright, and Bernardo J. Clavijo

Bioinformatics. 2017 Feb 15; 33(4): 574–576.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

小メモリで高速にメタゲノムのtaxonomy assignmentを行う metaOthello

メタゲノムのtaxonomyアノテーションを行い定量する MGmapper

SVDetect

RNA-seqのクロスコンタミを検出する Croco

ONTのアーティファクトを取り除く CarrierSeq

ロングリードのシミュレーター SiLiCO

K-mer分析ツール KAT