マッピングなしでraw fastqからバリアントコールを行う KATK

KATKは、raw NGSリードから直接バリアントをコールするための高速かつ正確なソフトウェアツールである。KATKは、あらかじめ定義されたk-merを使用してFASTQファイルから興味のあるリードのみを取得し、取得したリードをローカルにアラインメントすることで遺伝子型をコールする。KATKは既知の多型に関するデータを使用せず、デフォルトの遺伝子型はNC（No Call）である。リファレンスまたはバリアント対立遺伝子は、データ中にそれらが存在することを示す十分な証拠がある場合にのみコールされる。そのため、レアなバリアントやde novo突然変異に対して偏りがない。

　シミュレーションされたデータセットでは、偽陰性率0.23％（感度99.77％）、偽発見率0.19％を達成した。KATKを用いてすべてのヒトエキソン領域を呼び出すのに要した時間は1-2時間だった。KATKはGNU GPL v3の条件で配布されている。k-merデータベースはクリエイティブ・コモンズCC BY-NC-SAライセンスに基づいて配布されている。ソースコードはGitHubでGenometester4パッケージの一部として公開されている(https://github.com/bioinfo-ut/GenomeTester4/)。本論文で紹介したKATKパッケージとk-merデータベースのバイナリは http://bioinfo.ut.ee/KATK/ で入手可能である。

https://bioinfo.ut.ee/KATK/

manual

https://bioinfo.ut.ee/KATK/index.php?r=site/page&view=manual

インストール

ubuntu18.04LTSでテストした。

依存

本体　Github

git clone https://github.com/bioinfo-ut/GenomeTester4.git
cd GenomeTester4/
cd src
make gmer_counter
make gassembler

> ./gassembler

$ ./gassembler

gassembler version 4.2.0 (prerelease)

Usage: gassembler --dbi FILENAME --region_file FILENAME [ARGUMENTS]

Common options:

-v, --version - print version information and exit

-h, --help - print this usage screen and exit

--dbi FILENAME - index of sequenced reads (mandatory)

--region_file FILENAME - reference and kmer database (mandatory)

--sex male|female|auto - sex of the individual (default auto)

--coverage FLOAT | median | local | ignore - average sequencing depth (default - median, local - use local number of reads)

--num_threads - number of threads to use (default 24)

--min_p FLOAT - minimum call quality (default 0.95)

--min_pmut FLOAT - minimum reference call quality (default 0.50)

--exome - Disable quality models (needed if coverage variability is high)

--advanced - print advanced usage options

> ./gmer_counter

$ ./gmer_counter

Nothing to do!

gmer_counter version 4.2.0 (prerelease)

Usage:

gmer_counter ARGUMENTS SEQUENCES...

Arguments:

-v | --version - Print version information and exit

-db DATABASE - SNP/KMER database file

-dbb DBBINARY - binary database file

-w FILENAME - write binary database to file

-32 - use 32-bit integeres for counts (default 16-bit)

--max_kmers NUM - maximum number of kmers per node

--silent - do not print kmer counts (default for index and binary database compilation)

--verbose - print kmer counts (default for counting)

--header - print header row

--total - print the total number of kmers per node

--unique - print the number of nonzero kmers per node

--kmers - print individual kmer counts (default if no other output)

--compile_index FILENAME - Add read index to database and write it to file

--distribution NUM - print kmer distribution (up to given number)

--num_threads - number of worker threads (default 24)

--prefetch - prefetch memory mapped files (faster on high-memory systems)

--recover - recover from FastA/FastQ errors (useful for corrupted streams)

--stats - print some statistics about sequence and kmers

-D - increase debug level

-DDB - increase database debug level

テストラン

ダウンロードに3日ほどかかった。

#1 Download region-specific files
wget http://bioinfo.ut.ee/KATK/downloads/KATK_db_20200401.tar.gz
tar zxvf KATK_db_20200401.tar.gz

#2 Download FASTQ files with sequencing reads of the individual NA12877 (coded as ERR194146) and unpack.
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR194/ERR194146/ERR194146_2.fastq.gz
gunzip ERR194146*.fastq.gz

#3
./gmer_counter -dbb cmd_20190410.dbb --compile_index ERR194146.index ERR194146*.fastq
./gassembler -dbi ERR194146.index --file cmd_20191031.txt > ERR194146.calls