2019-01-16

KrakenUniq

2019 1/17 エラー修正

2024/02/14 追記

　メタゲノミクス分類手法は、データセット内の各リードに taxonomic identityをアサインすることを試みる。メタゲノミクスデータにはしばしば何千万ものリードが含まれているため、分類は、通常、長さk（k-mers）の短いワードの正確な一致を使用して行われる。結果には、リードの分類が含まれているが、アライメントされたゲノム内のポジションは含まれない（[論文のref.1]のレビューのように）。しかしながら、リードカウントは欺かれる可能性がある。サンプルの抽出、取り扱い、またはシーケンシング中に実験室キットまたは環境から導入された汚染により、大量の偽のidentificationsが引き起こされる可能性がある [ref.2,3]。サンプル投入が少量だと、汚染の問題をさらに複雑にする可能性がある。たとえば、感染症の臨床診断にシークエンシングを使用する場合、投入DNAノ0.1％未満だけが目的の微生物に由来する可能性がある[ref.4, 5]。ゲノムの複雑さの低い領域やデータベースゲノム自体の汚染から、追加の疑似マッチが生じる可能性がある[ref.6]。

　そのような偽陽性のリードはゲノムの小さな部分にのみ一致して起きる。典型的には、種のゲノムが低複雑度領域を含み、その種に合致する唯一のリードがこの領域に入る場合など起きる。真に存在する微生物からのリードは、1つまたは少数の場所に集中するのではなく、ゲノム全体に比較的均一に分布すべきである。ゲノムアライメントはこの情報を明らかにすることができる。しかし、アライメントはリソースを集中的に使用するため、ゲノムすべてのindexを作成する必要があり、すべてのリードをそれらのindexと比較するための比較的遅いアライメントステップが必要になる。いくつかのメタゲノミクス手法では、カバレッジ情報を使用してマッピングや定量の精度を向上させているが、これらの手法では、より遅いアラインメント手法を入力として必要としている[ref.7 link]。アセンブリベースの方法は偽陽性を回避するのにも役立つが、これらは非常に豊富な種にのみ有用である[ref.8 link]。

　ここでは、非常に高速なk-merベースの分類と高速なk-mer濃度推定とを組み合わせた新しい方法であるKrakenUniqを提示する。 KrakenUniqはKrakenメタゲノム分類器[ref.9]に基づいており、また、各taxonについて特定されたユニークなk-mersの数を数えるために、効率的な濃度推定アルゴリズムHyperLogLog [ref.10, 11, 12]を使用する方法が追加されている。 KrakenUniqは、各ゲノム中のユニークなk-mersのうち、どれだけがリードでカバーされているかを数えることによって、true-positiveのマッチからfalse-positiveを識別することができる。さらに、KrakenUniqは、メタゲノム分類を改善するための追加の新機能を実装している。（a）複数のデータベースに対して階層的に検索できる。（b）taxonomy は、系統およびプラスミドのためのノードを含むように拡張され、それによりそれらの検出が可能になる。（c）database build scriptは、NCBIウイルスゲノムリソース[ref.13]から10万を超えるウイルスの追加を可能にする。KrakenUniqは、Krakenによって提供される情報のスーパーセットを提供し、同時に、同じくらい早く、または少し高速に実行でき、分類時に追加メモリをほとんど使用しない。

　KrakenUniqは、メタゲノム実験で同定されたすべての分類群の効率的なk-merカウント情報を提供するために開発された。主なワークフローは次のとおりである。リードが処理されると、各k-merにデータベースからtaxonがアサインされる（論文図1a）。 KrakenUniqは、各分類群のHyperLogLogデータスケッチをインスタンス化し、それにk-mersを追加する（論文図1bおよび追加ファイル1：HyperLogLogアルゴリズムのセクション1）。リードを分類した後、KrakenUniqはtaxonomic treeを走査し、各taxonの見積りをその親とマージする。分類レポート出力には、入力データで観測された各taxonの固有のk-mer数とk-merカバレッジデプスが含まれている（論文図1c）。

（以下省略）

図1. Overview of the KrakenUniq algorithm and output. 論文より転載

2022/07/06

Metagenomics folks: we finally found a way to let you download our huge (300GB+) KrakenUniq DBs directly (and free!), with help from @BenLangmead. Our two most popular DBs are at https://t.co/2GjBTSL4Uf, which has links to our Amazon cloud storage 1/2
— Steven Salzberg 💙💛 (@StevenSalzberg1) 2022年6月29日

We've released another Kraken (KrakenUniq, just published) Very elegant work by @fbreitw using randomized counting algorithm to enhance the Kraken metagenomics software. #microbiome https://t.co/bgwUeJrik7
— Steven Salzberg (@StevenSalzberg1) November 21, 2018

インストール

ubuntu14.04、miniconda3-4.0.5環境で動作確認を行った。

依存

Jellyfish v1.1.11

Jellyfish ver.2には対応していないので注意する。

ハードウエア要件

KrakenUniq requires a lot of RAM - ideally 128GB - 512GB. For more memory efficient classification consider using centrifuge.

本体　Github

#condaを使って導入できる（linuxのみ）
conda install -y -c bioconda krakenuniq

> krakenuniq --help

$ krakenuniq --help

Usage: krakenuniq --report-file FILENAME [options] <filename(s)>

Options:

--db NAME Name for Kraken DB (default: none)

--threads NUM Number of threads (default: 1)

--fasta-input Input is FASTA format

--fastq-input Input is FASTQ format

--gzip-compressed Input is gzip compressed

--bzip2-compressed Input is bzip2 compressed

--hll-precision INT Precision for HyperLogLog k-mer cardinality estimation, between 10 and 18 (default: 12)

--exact Compute exact cardinality instead of estimate (slower, requires memory proportional to cardinality!)

--quick Quick operation (use first hit or hits)

--min-hits NUM In quick op., number of hits req'd for classification

NOTE: this is ignored if --quick is not specified

--unclassified-out FILENAME

Print unclassified sequences to filename

--classified-out FILENAME

Print classified sequences to filename

--output FILENAME Print output to filename (default: stdout); "off" will

suppress normal output

--only-classified-output

Print no Kraken output for unclassified sequences

--preload Loads DB into memory before classification

--paired The two filenames provided are paired-end reads

--check-names Ensure each pair of reads have names that agree

with each other; ignored if --paired is not specified

--help Print this message

--version Print version information

Experimental:

--uid-mapping Map using UID database

If none of the *-input or *-compressed flags are specified, and the

file is a regular file, automatic format detection is attempted.

> krakenuniq-download --help

$ krakenuniq-download --help

Unknown database --help.

krakenuniq-download [<options>] <pattern> <pattern>*

ARGUMENTS

<pattern> can be one of

'contaminants' Contaminant sequences from UniVec and EmVec.

'taxonomy' NCBI taxonomy mappings from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/

'nucleotide' Download nucleotide sequences using a query specified using --search or --ac.

'microbial-nt' Download microbial sequences from nt database.

'viral-neighbors' Download viral strain sequences from the NCBI Viral Genome Resource.

(Search: "viruses[organism] not cellular organisms[orgn] not wgs[prop] not gbdiv syn[prop] and (nuccore genome samespecies[filter] and (complete[title]) not unverified[title])").

'genbank/DOMAIN' Download all complete genomes for DOMAIN from GenBank.

'refseq/DOMAIN' Download all complete genomes for DOMAIN from RefSeq.

'refseq/DOMAIN/ASS_LEVEL'

'refseq/DOMAIN/ASS_LEVEL/COLUMN=value1(/COLUMN=value2)*'

Possible values for DOMAIN: bacteria viral archaea fungi protozoa invertebrate plant vertebrate_mammalian vertebrate_other.

Possible values for ASS_LEVEL: Any, Complete_Genome, Chromosome, Scaffold and Contig.

Possible values for COLUMN: Any column in the NCBI assembly_summary.txt, e.g. species_taxid or assembly_accession.

Example: 'refseq/vertebrate_mammalian/Any/species_taxid=9606' <- download all human assemblies

COMMON OPTIONS

-o <directory> Folder to which the files are downloaded. Default: '.'

--db <directory> Alternative to -o: Download to <directory>/{library,taxonomy}.

--threads <# of threads> Number of processes when downloading (uses xargs). Default: '5'

--rsync, -R Download using rsync.

--overwrite Redownload and overwrite files with the same name.

--verbose Be verbose.

--dust, -D Mask low-complexity regions using dustmasker.

--min-seq-len X Filter all sequences from the FASTA files that have less than X bp.

WHEN USING DATABASE nucleotide or viral-neighbors:

--search SEARCH_TERM Download all sequences returned from a NCBI nucleotide search.

When used with viral-neighbors, it subsets the viral genomes by the search.

E.g. "txid1570291[Organism]" for Ebola virus sequences (taxonomy ID 1570291).

--ac AC1,AC2 Alternative to --search. Download specified ACs of nucleotide database.

--mapping-file MAP Map accessions using the specified mapping file(s) (comma-separated).

Possible values: nucl_est, nucl_gb, nucl_gss, nucl_wgs.

For viral-neighbors, the default is nucl_gb. Unset by giving it an empty string.

Downloaded from ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/accession2taxid/.

--retmode RM Specify return mode. Possible options: text (default), asn.1, xml.

--rettype RT Specify return type for download. Note that no mapping files are generated when

rettype is specified. Possible options: fasta (default), gb, gbc, native, acc, seqid, ft,

gbwithparts, fasta_cds_na, fasta_cds_aa. Note that only gb and fasta files are split, while the other formats stay in chunks.

The resulting files will have the rettype as extension.

WHEN USING DATABASE nt:

--taxa TAXLIST Provide list of taxa that are kept in the database. Comma separated, supply either taxIDs or

one of the following division names: archaea bacteria fungi parasitic_worms protozoa viral

Default: "bacteria,archaea,viral,fungi,protozoa"

--exclude-environmental-taxa

Exclude taxa that are named 'environmental samples'.

WHEN USING DATABASE refseq OR genbank:

--fna <seq types> Comma-separated list of sequence types, including genomic, rna, rna_from_genomic, cds_from_genomic. Default: genomic.

See the assembly project FTP site for available sequences

-u Filter unplaced sequences.

> krakenuniq-build --help

$ krakenuniq-build --help

Usage: krakenuniq-build [task option] [options]

Task options (exactly one can be selected -- default is build):

--download-taxonomy Download NCBI taxonomic information

--download-library TYPE Download partial library (TYPE = one of "refseq/bacteria", "refseq/archaea", "refseq/viral").

Use krakenuniq-download for more options.

--add-to-library FILE Add FILE to library

--build Create DB from library (requires taxonomy d/l'ed and at

least one file in library)

--rebuild Create DB from library like --build, but remove

existing non-library/taxonomy files before build

--clean Remove unneeded files from a built database

--shrink NEW_CT Shrink an existing DB to have only NEW_CT k-mers

--standard Download and create default database, which contains complete genomes

for archaea, bacteria and viruses from RefSeq, as well as viral strains

from NCBI. Specify --taxids-for-genomes and --taxids-for-sequences

separately, if desired.

--help Print this message

--version Print version information

Options:

--db DBDIR Kraken DB directory (mandatory except for --help/--version)

--threads # Number of threads (def: 1)

--new-db NAME New Kraken DB name (shrink task only; mandatory

for shrink task)

--kmer-len NUM K-mer length in bp (build/shrink tasks only;

def: 31)

--minimizer-len NUM Minimizer length in bp (build/shrink tasks only;

def: 15)

--jellyfish-hash-size STR Pass a specific hash size argument to jellyfish

when building database (build task only)

--jellyfish-bin STR Use STR as Jellyfish 1 binary.

--max-db-size SIZE Shrink the DB before full build, making sure

database and index together use <= SIZE gigabytes

(build task only)

--shrink-block-offset NUM When shrinking, select the k-mer that is NUM

positions from the end of a block of k-mers

(default: 1)

--work-on-disk Perform most operations on disk rather than in

RAM (will slow down build in most cases)

--taxids-for-genomes Add taxonomy IDs (starting with 1 billion) for genomes.

Only works with 3-column seqid2taxid map with third

column being the name

--taxids-for-sequences Add taxonomy IDs for sequences, starting with 1 billion.

Can be useful to resolve classifications with multiple genomes

for one taxonomy ID.

--min-contig-size NUM Minimum contig size for inclusion in database.

Use with draft genomes to reduce contamination, e.g. with values between 1000 and 10000.

--library-dir DIR Use DIR for reference sequences instead of DBDIR/library.

--taxonomy-dir DIR Use DIR for taxonomy instead of DBDIR/taxonomy.

Experimental:

--uid-database Build a UID database (default no)

--lca-database Build a LCA database (default yes)

--no-lca-database Do not build a LCA database

--lca-order DIR1 Impose a hierarchical order for setting LCAs.

--lca-order DIR2 The directories must be specified relative to the libary directory

... (DBDIR/library). When setting the LCAs, k-mers from sequences in

DIR1 will be set first, and only unset k-mers will be set from

DIR2, etc, and final from the whole library.

Use this option when including low-confidence draft genomes,

e.g use --lca-order Complete_Genome --lca-order Chromosome to

prioritize more complete assemblies.

Keep in mind that this option takes considerably longer.

> krakenuniq-extract-reads --help

$ krakenuniq-extract-reads --help

/home/uesaka/.pyenv/versions/miniconda3-4.2.12/bin/krakenuniq-extract-reads version [unknown] calling Getopt::Std::getopts (version 1.12 [paranoid]),

running under Perl version 5.26.2.

Usage: krakenuniq-extract-reads [-OPTIONS [-MORE_OPTIONS]] [--] [PROGRAM_ARG1 ...]

The following single-character options are accepted:

With arguments: -t

Boolean (without arguments): -f -v -i -a -p

Options may be merged together. -- stops processing of options.

Space is not required between options and their arguments.

[Now continuing due to backward compatibility and excessive paranoia.

See 'perldoc Getopt::Std' about $Getopt::Std::STANDARD_HELP_VERSION.]

krakenuniq-extract-reads: Extract all reads from FASTQ file that are matched to a specied taxon by KrakenUniq

Usage: krakenuniq-extract-reads [OPTIONS] <taxon> <kraken> <fasta/fastq>

<taxon> taxonomy ID, possibly multiple separated by ','

<kraken> kraken result file

<fasta/fastq> fasta/fastq file, possibly gzipped

Options:

-a input is FASTA file (default: FASTQ)

-f output in FASTA format

-i invert: print all reads not matching taxon

-t TAXDB Include children of taxonomy IDs, using TAXDB to find them

-v verbose

-p paired-end reads: use a '%' in fasta/q file name as placeholder for 1 and 2

Example:

krakenuniq-extract-reads -p 9606 result.kraken input_%.fq

outputs all reads of input_1.fq and input_2.fq that have the taxonomy ID 9606 to STDOUT.

> krakenuniq-report --help

$ krakenuniq-report --help

Usage: krakenuniq-report --db KRAKEN_DB_NAME [OPTIONS] <kraken output file(s)>

OPTIONS:

--show-zeros Show full taxonomy table.

--taxon-counts Input files are in the format '<taxon ID><tab><count>' instead of Kraken output.

--taxon-list Input files is list of taxon IDs instead of Kraken output.

-h This message.

This script should only be used when post-processing raw KrakenUniq output, and k-mer counts and coverages are not needed. For most use-cases, krakenuniq --report-file is better than krakenuniq-report.

> krakenuniq-mpa-report --help

$ krakenuniq-mpa-report --help

Usage: krakenuniq-mpa-report [--db KRAKEN_DB_NAME] [options] <kraken output file(s)>

Options:

--db NAME Name of Kraken database

(default: none)

--show-zeros Display taxa even if they lack a read in any sample

--header-line Display a header line indicating sample IDs

(sample IDs are the filenames)

--intermediate-ranks Display taxa not at the standard ranks with x__ prefix

> krakenuniq-filter --help

$ krakenuniq-filter --help

Usage: krakenuniq-filter [--db KRAKEN_DB_NAME] [--threshold NUM] <kraken output file(s)>

Threshold must be between 0 and 1.

> krakenuniq-translate --help

$ krakenuniq-translate --help

Usage: krakenuniq-translate [--db KRAKEN_DB_NAME] [--mpa-format] <kraken output file(s)>

データベースのビルド

1、taxonomyのダウンロード (350MBくらい)

krakenuniq-download --db DBDIR taxonomy

$ ls -alh taxonomy/

total 345M

drwxr-xr-x 2 uesaka user 4.0K Jan 14 10:27 .

drwxr-xr-x 4 uesaka user 4.0K Jan 14 10:35 ..

-rw-r--r-- 1 uesaka user 167M Jan 14 10:23 names.dmp

-rw-r--r-- 1 uesaka user 133M Jan 14 10:23 nodes.dmp

-rw-r--r-- 1 uesaka user 45M Jan 14 10:23 taxdump.tar.gz

-rw-r--r-- 1 uesaka user 29 Jan 14 10:27 timestamp

２、データベースのダウンロード

#Choose one of the following: bacteria, viral, archaea, fungi, protozoa, invertebrate, plant, vertebrate_mammalian, vertebrate_other


#REgseqの全bacteria、archaea
krakenuniq-download --db DBDIR --threads 20 --dust refseq/bacteria refseq/archaea

#plant(3種のみ　コンタミチェックに使う)
krakenuniq-download --db DBDIR_plant --threads 20 --dust refseq/plant

#fungi (10種) 
krakenuniq-download --db DBDIR_fungi --threads 20 --dust refseq/invertebrate

#protozoa (3種) 
krakenuniq-download --db DBDIR_protozoa --threads 20 --dust refseq/protozoa

#invertebrate
krakenuniq-download --db DBDIR_invertebrate --dust refseq/invertebrate


## Contaminant sequences from UniVec and EmVec, plus the human reference genome 
krakenuniq-download --db DBDIR refseq/vertegrate_mammalian/Chromosome/species_taxid=9606 

## All viral genomes from RefSeq plus viral 'neighbors' in NCBI Nucleotide 
krakenuniq-download --db DBDIR_virus --threads 20 refseq/viral/Any viral-neighbors 

## All microbial (including eukaryotes) sequences in the NCBI nt database （並列処理に対応していない?）
krakenuniq-download --db DBDIR --dust microbial-nt

#
krakenuniq-download --db DBDIR --threads 20 \
--taxa "archaea,bacteria,viral,fungi,protozoa,helminths" \
--dust --exclude-environmental-taxa nt

#DB/にnt~fnaファイルがダウンロードされる。

--dust Mask low-complexity regions using dustmasker.
--db <directory> Alternative to -o: Download to <directory>/{library,taxonomy}.
--threads <NUM> Number of processes when downloading (uses xargs). Default: '5'
--taxa <taxlist> Provide list of taxa that are kept in the database. Comma separated, supply either taxIDs or
--exclude-environmental-taxa Exclude taxa that are named 'environmental samples'.

（refseq/のplant、fhungi、protozoa、bacteria、archaea、viral/Any、 viral-neighborsのみ動作確認を行った）

３、データベースビルド（時間がかかる）

krakenuniq-build --db DBDIR --kmer-len 31 --threads 40 --taxids-for-genomes --taxids-for-sequences

--kmer-len <NUM> K-mer length in bp (build/shrink tasks only; def: 31)
--taxids-for-genomes Add taxonomy IDs (starting with 1 billion) for genomes.
--taxids-for-sequences Add taxonomy IDs for sequences, starting with 1 billion.

実行方法

レポートファイル作成まではkrakenと同じ流れで進める。

#メモリに余裕があれば、データベースをmemoryにpreloadしてインメモリデータベースにしておくことでかなり早くなる
krakenuniq --db DBDIR/ --preload --threads 20

#k-merアサインを実行
krakenuniq --threads 20  --db DBDIR/ --fastq-input input1.fq input2.fq > kmer-assignment

--preload Loads DB into memory before classification

> head -n 5 kmer-assignment

U M02077:12:000000000-A8737:1:1101:26651:6691 0 300 0:270

C M02077:12:000000000-A8737:1:1101:7099:6692 51589 300 0:124 2237:5 0:36 2237:12 0:31 51589:8 0:31 2237:23

U M02077:12:000000000-A8737:1:1101:27349:6692 0 299 0:269

U M02077:12:000000000-A8737:1:1101:25832:6693 0 300 0:270

リードごとにtaxonomyアサインされる（raw KrakenUniq output）。

レポート作成

krakenuniq-report --db DBDIR/ kmer-assignment > report

--show-zeros Show full taxonomy table

すでにkrakenのデータベースがあるなら、以下のコマンドでtaxDBを作成するだけでランできる。

cd KRAKEN_DB/
#DBディレクトリで次のコマンドを打ち、DB/にtaxDBを追加
build_taxdb taxonomy/names.dmp taxonomy/nodes.dmp > taxDB

krakenuniq --report-file FILENAME --preload --db KRAKEN_DB/\
 --threads 20 --fastq-input input.fq

追記

kra24/02/14 kenuniqはkraken1ベースのDBしか使用できないと思っていたが、コミュニティからコードが改善されてpull requestされ（#220）、masterにマージされたらしい（#249）。

引用

KrakenUniq: confident and fast metagenomics classification using unique k-mer counts

F. P. Breitwieser, D. N. Baker and S. L. Salzberg

Genome Biology 201819:198