メタゲノムデータからホストゲノムのコンタミを除く KneadData

2020 4/1 9 インストール手順とhelp追記、タイトル修正

2021 6/11 link修正

2022/07/02 インストール手順修正

バクテリアのメタゲノム解析では、度々ホストゲノムのコンタミリードがシーケンスされてしまうことがある。KneadDataはそのようなホスト由来のリードや低クオリティのリードをフィルタリングするために設計されたツールである。

Trimmomaticでのクオリティトリミング-> ホストゲノムへのアライメント -> アライメントされなかったリードの抽出、の順番で解析される。

公式サイト

biobakery / kneadData / wiki / Home — Bitbucket

インストール

依存

Trimmomatic (version == 0.33) (automatically installed)
Bowtie2 (version >= 2.2) (automatically installed)
Python (version >= 2.7)
Java Runtime Environment TRF (optional)
Fastqc (optional)
SAMTools (only required if input file is in BAM format)

Github

#bioconda (link)
mamba create -n kneaddata -y
conda activate kneaddata
mamba install -c bioconda -y kneaddata

#pip
pip install kneaddata

#homebrew
brew install kneaddata

> kneaddata -h

# kneaddata -h

usage: kneaddata [-h] [--version] [-v] -i INPUT -o OUTPUT_DIR [-db REFERENCE_DB] [--bypass-trim] [--output-prefix OUTPUT_PREFIX] [-t <1>] [-p <1>] [-q {phred33,phred64}] [--run-bmtagger] [--run-trf] [--run-fastqc-start] [--run-fastqc-end] [--store-temp-output]

[--remove-intermediate-output] [--cat-final-output] [--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}] [--log LOG] [--trimmomatic TRIMMOMATIC_PATH] [--max-memory MAX_MEMORY] [--trimmomatic-options TRIMMOMATIC_OPTIONS] [--bowtie2 BOWTIE2_PATH]

[--bowtie2-options BOWTIE2_OPTIONS] [--no-discordant] [--reorder] [--serial] [--bmtagger BMTAGGER_PATH] [--trf TRF_PATH] [--match MATCH] [--mismatch MISMATCH] [--delta DELTA] [--pm PM] [--pi PI] [--minscore MINSCORE] [--maxperiod MAXPERIOD]

[--fastqc FASTQC_PATH]

KneadData

optional arguments:

-h, --help show this help message and exit

-v, --verbose additional output is printed

global options:

--version show program's version number and exit

-i INPUT, --input INPUT

input FASTQ file (add a second argument instance to run with paired input files)

-o OUTPUT_DIR, --output OUTPUT_DIR

directory to write output files

-db REFERENCE_DB, --reference-db REFERENCE_DB

location of reference database (additional arguments add databases)

--bypass-trim bypass the trim step

--output-prefix OUTPUT_PREFIX

prefix for all output files

[ DEFAULT : $SAMPLE_kneaddata ]

-t <1>, --threads <1>

number of threads

[ Default : 1 ]

-p <1>, --processes <1>

number of processes

[ Default : 1 ]

-q {phred33,phred64}, --quality-scores {phred33,phred64}

quality scores

[ DEFAULT : phred33 ]

--run-bmtagger run BMTagger instead of Bowtie2 to identify contaminant reads

--run-trf run TRF to remove tandem repeats

--run-fastqc-start run fastqc at the beginning of the workflow

--run-fastqc-end run fastqc at the end of the workflow

--store-temp-output store temp output files

[ DEFAULT : temp output files are removed ]

--remove-intermediate-output

remove intermediate output files

[ DEFAULT : intermediate output files are stored ]

--cat-final-output concatenate all final output files

[ DEFAULT : final output is not concatenated ]

--log-level {DEBUG,INFO,WARNING,ERROR,CRITICAL}

level of log messages

[ DEFAULT : DEBUG ]

--log LOG log file

[ DEFAULT : $OUTPUT_DIR/$SAMPLE_kneaddata.log ]

trimmomatic arguments:

--trimmomatic TRIMMOMATIC_PATH

path to trimmomatic

[ DEFAULT : $PATH ]

--max-memory MAX_MEMORY

max amount of memory

[ DEFAULT : 500m ]

--trimmomatic-options TRIMMOMATIC_OPTIONS

options for trimmomatic

[ DEFAULT : ILLUMINACLIP:/TruSeq3-SE.fa:2:30:10 SLIDINGWINDOW:4:20 MINLEN:50 ]

MINLEN is set to 50 percent of total input read length

bowtie2 arguments:

--bowtie2 BOWTIE2_PATH

path to bowtie2

[ DEFAULT : $PATH ]

--bowtie2-options BOWTIE2_OPTIONS

options for bowtie2

[ DEFAULT : --very-sensitive ]

--no-discordant do not include discordant alignments for pairs (ie one of the two pairs aligns)

[ DEFAULT : Discordant alignments are included ]

--reorder order the sequences in the same order as the input

[ DEFAULT : With discordant paired alignments sequences are not ordered ]

--serial filter the input in serial for multiple databases so a subset of reads are processed in each database search

bmtagger arguments:

--bmtagger BMTAGGER_PATH

path to BMTagger

[ DEFAULT : $PATH ]

trf arguments:

--trf TRF_PATH path to TRF

[ DEFAULT : $PATH ]

--match MATCH matching weight

[ DEFAULT : 2 ]

--mismatch MISMATCH mismatching penalty

[ DEFAULT : 7 ]

--delta DELTA indel penalty

[ DEFAULT : 7 ]

--pm PM match probability

[ DEFAULT : 80 ]

--pi PI indel probability

[ DEFAULT : 10 ]

--minscore MINSCORE minimum alignment score to report

[ DEFAULT : 50 ]

--maxperiod MAXPERIOD

maximum period size to report

[ DEFAULT : 500 ]

fastqc arguments:

--fastqc FASTQC_PATH path to fastqc

[ DEFAULT : $PATH ]

(base) root@7c3303a460d3:/data#

実行方法

Kneaddata は、デフォルトでは、trimomatic が提供する "NexteraPE" アダプタを使用して、アダプター配列をトリミングする。その他の利用可能なオプションはNexteraPE", "TruSeq2", "TruSeq3", "none"となっている。

１、初めにリファレンスゲノムのインデックスを作成する。ヒトやマウスのゲノムならKneadDataのコマンドからbowite2のindex作成済みゲノムをダウンロードできる。

mkdir genome
kneaddata_database --download human_genome bowtie2 genome/ #3.44GBある

（options）　自分で用意したリファレンスの場合、fastaからindexを作成するなら次のように打つ。１でダウンロードした時はこのステップは不要。

bowtie2-build Homo_sapiens.fasta -o Homo_sapiens_db test
#作成したindexをディレクトリに収納する
mkdir genome
mv test* genome/

２、シングルエンドfastq。作成したindexのフォルダを指定してランする。

kneaddata --i seq.fastq --reference-db genome/ --output kneaddata_output -t 20

-t　number of threads [ Default : 1 ]
-i　input FASTQ file (add a second argument instance to run with paired input files)
-o　directory to write output files
-d　location of reference database (additional arguments add databases)

２、ペアエンドfastq

kneaddata --i seq1.fastq --i seq2.fastq -db genome/ --output kneaddata_output -t 20

Trimmomaticがないと言われたら --trimmomatic フラグをつけて trimmomaticのディレクトリを指定する。面倒ならwgetでダウンロードしておいておけばいい。

#HP
wget http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-0.39.zip
unzip Trimmomatic-0.39.zip

#trimmomaticを指定してラン
kneaddata --input seq1.fastq --input seq2.fastq -db genome/ --output kneaddata_output -t 20 --trimmomatic Trimmomatic-0.39/

終わると複数のfastqとlogが出力される。

~kneaddata.trimmed.fastq: クオリティトリミングされたリード。

~kneaddata.fastq: リファレンスにアライメントされなかったリード（クオリティコントロール済みのfastq）。

~kneaddata_$DATABASE_bowtie2_contam.fastq: リファレンス（ホストゲノムなど）にアライメントされたリード。

2022/02/13

何度もファイルの読み書きをするため、I/Oの速度がランタイムにかなり影響してきます。3GBx2のfastq.gzファイルをランするのに必要な時間を計ってみると、PCI4接続 optane SSDだと36分、SATA3接続HDDだと47分でした。

引用

https://github.com/LangilleLab/microbiome_helper/wiki/Metagenomics-Sequencing-Pre-processing

例えばこちらのpreprintでヒトゲノムのリード除去に使用されています。

https://www.biorxiv.org/content/10.1101/804443v2.full.pdf