植物RNA seqシーケンシングデータからvirusリードを検出する kodoja

Kodojaはk-merプロファイリングを使用してRNA-seqまたはsRNA-seのfastq/fasta生データからウイルス配列を特定するツール。 k-merを用いた系統分類ツールKrakenとおよびタンパク質レベルでの配列マッチングのKaijuを組み合わせている（Burrows-Wheeler変換している）。

wiki

https://github.com/abaizan/kodoja/wiki/Kodoja-Manual

インストール

mac os10.14のminiconda3-4.3.30環境でテストした。

依存

You can use Python 2.7 or Python 3, specifically Kodoja has been tested on Python 3.6.

FastQC v0.11.5
Trimmomatic v0.36
Kraken v1.0
Kaiju v1.5.0
Python packages:
numpy v1.9
biopython v1.67
pandas v0.14
ncbi-genome-download v0.2.6

本体　Github

#Anaconda環境でcondaを使い導入、公式では仮想環境を使っているが、ここでは直接導入
conda install -y -c bioconda kodoja

> kodoja_search.py -h

$ kodoja_search.py -h

usage: kodoja_search.py [-h] [--version] -o OUTPUT_DIR -d1 KRAKEN_DB -d2

KAIJU_DB -r1 READ1 [-r2 READ2] [-f DATA_FORMAT]

[-t THREADS] [-s HOST_SUBSET] [-m TRIM_MINLEN]

[-a TRIM_ADAPT] [-q KRAKEN_QUICK] [-p]

[-c KAIJU_SCORE] [-l KAIJU_MINLEN] [-i KAIJU_MISMATCH]

Kodoja Search is a tool intended to identify viral sequences

in a FASTQ/FASTA sequencing run by matching them against both Kraken and

Kaiju databases.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Output directory path, required

-d1 KRAKEN_DB, --kraken_db KRAKEN_DB

Kraken database path, required

-d2 KAIJU_DB, --kaiju_db KAIJU_DB

Kaiju database path, required

-r1 READ1, --read1 READ1

Read 1 file path, required

-r2 READ2, --read2 READ2

Read 2 file path

-f DATA_FORMAT, --data_format DATA_FORMAT

Sequence data format (default fastq)

-t THREADS, --threads THREADS

Number of threads (default 1)

-s HOST_SUBSET, --host_subset HOST_SUBSET

Subset sequences with this tax id from results

-m TRIM_MINLEN, --trim_minlen TRIM_MINLEN

Trimmomatic minimum length

-a TRIM_ADAPT, --trim_adapt TRIM_ADAPT

Illumina adapter sequence file

-q KRAKEN_QUICK, --kraken_quick KRAKEN_QUICK

Number of minium hits by Kraken

-p, --kraken_preload Kraken preload database

-c KAIJU_SCORE, --kaiju_score KAIJU_SCORE

Kaju alignment score

-l KAIJU_MINLEN, --kaiju_minlen KAIJU_MINLEN

Kaju minimum length

-i KAIJU_MISMATCH, --kaiju_mismatch KAIJU_MISMATCH

Kaju allowed mismatches

The main output of ``kodoja_search.py`` is a file called ``virus_table.txt``

in the specified output directory. This is a plain text tab-separated table,

the columns are as follows:

1. Species name,

2. Species NCBI taxonomy identifier (TaxID),

3. Number of reads assigned by *either* Kraken or Kaiju to this species,

4. Number of Reads assigned by *both* Kraken and Kaiju to this species,

5. Genus name,

6. Number of reads assigned by *either* Kraken or Kaiju to this genus,

7. Number of reads assigned by *both* Kraken and Kaiju to this genus.

The output directory includes additional files, including ``kodoja_VRL.txt``

(a table listing the read identifiers used) which is intended mainly as

input to the ``kodoja_retrieve.py`` script.

> kodoja_build.py -h

kazuma@kamisakumanoMBP:~/Documents/metagenome_simulation$ kodoja_build.py -h

usage: kodoja_build.py [-h] [--version] -o OUTPUT_DIR [-t THREADS]

[-p HOST_TAXID] [-d DOWNLOAD_PARALLEL] [-n]

[-e [EXTRA_FILES [EXTRA_FILES ...]]]

[-x [EXTRA_TAXIDS [EXTRA_TAXIDS ...]]] [-v]

[-b KRAKEN_TAX] [-k KRAKEN_KMER] [-m KRAKEN_MINIMIZER]

[-a DB_TAG]

Kodoja database construction

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

-o OUTPUT_DIR, --output_dir OUTPUT_DIR

Output directory path, required

-t THREADS, --threads THREADS

Number of threads, default=1

-p HOST_TAXID, --host_taxid HOST_TAXID

Host tax ID

-d DOWNLOAD_PARALLEL, --download_parallel DOWNLOAD_PARALLEL

Parallel genome download, default=4

-n, --no_download Genomes have already been downloaded

-e [EXTRA_FILES [EXTRA_FILES ...]], --extra_files [EXTRA_FILES [EXTRA_FILES ...]]

List of extra files added to "extra" dir

-x [EXTRA_TAXIDS [EXTRA_TAXIDS ...]], --extra_taxids [EXTRA_TAXIDS [EXTRA_TAXIDS ...]]

List of taxID of extra files

-v, --all_viruses Build databases with all viruses (not plant specific)

-b KRAKEN_TAX, --kraken_tax KRAKEN_TAX

Path to taxonomy directory

-k KRAKEN_KMER, --kraken_kmer KRAKEN_KMER

Kraken kmer size, default=31

-m KRAKEN_MINIMIZER, --kraken_minimizer KRAKEN_MINIMIZER

Kraken minimizer size, default=15

-a DB_TAG, --db_tag DB_TAG

Suffix for databases

> kodoja_retrieve.py -h

$ kodoja_retrieve.py -h

usage: kodoja_retrieve.py [-h] [--version] -o FILE_DIR -r1 READ1 [-r2 READ2]

[-f USER_FORMAT] [-t TAXID] [-g] [-s]

Kodoja Retrieve is used with the output of Kodoja Search to

give subsets of your input sequencing reads matching viruses.

optional arguments:

-h, --help show this help message and exit

--version show program's version number and exit

-o FILE_DIR, --file_dir FILE_DIR

Path to directory of kodoja_search results, required

-r1 READ1, --read1 READ1

Read 1 file path, required

-r2 READ2, --read2 READ2

Read 2 file path, default: False

-f USER_FORMAT, --user_format USER_FORMAT

Sequence data format, default: fastq

-t TAXID, --taxID TAXID

Virus tax ID for subsetting, default: All viral

sequences

-g, --genus Include sequences classified at genus

-s, --stringent Only subset sequences identified by both tools

The main output of ``kodoja_search.py`` is a file called ``virus_table.txt``

(a table summarising the potential viruses found), but the specified output

directory will also contain ``kodoja_VRL.txt`` (a table listing the read

identifiers). This second file is used as input to ``kodoja_retrieve.py``

along with the original sequencing read files.

A sub-directory ``subreads/`` will be created in the output directory,

which will include either FASTA or FASTQ files named as follows:

* ``subset_files/virus_all_sequences1.fasta`` FASTA output

* ``subset_files/virus_all_sequences1.fastq`` FASTQ output

And, for paired end datasets,

* ``subset_files/virus_all_sequences2.fasta`` FASTA output

* ``subset_files/virus_all_sequences2.fastq`` FASTQ output

However, if the ``-t 12345`` option is used rather than ``virus_all_...``

the files will be named ``virus_12345_...`` instead.

kodoja_build.pyはウイルス/宿主ゲノムをダウンロードして、新しいKrakenデータベースとKaijuデータベースを作成するコマンド。 kodoja_retrieve.pyはサーチ結果から関心のあるシーケンスを取り出すコマンド。

データベースの作成

mkdir kodojaDB_v1.0
cd kodojaDB_v1.0/
wget https://zenodo.org/record/1406071/files/kodojaDB_v1.0.tar.gz
tar -zxvf kodojaDB_v1.0.tar.gz

実行方法

クエリのfastqとデータベースを指定して実行する。

kodoja_search.py --kraken_db krakenDB/ --kaiju_db kaijuDB/\
 --read1 pair_R1.fastq --read2 pair_R2.fastq\
 -o out_dir

> column -t out_dir/virus_table.txt

$ column -t out_dir/virus_table.txt

Species Species TaxID Species sequences Species sequences (stringent) Genus Genus sequences Genus sequences (stringent)

Cassava brown streak virus 137758 45 45 Ipomovirus 0 0

Ugandan cassava brown streak virus 946046 28 28 Ipomovirus 0 0

Tobacco etch virus 12227 19 19 Potyvirus 0 0

Galaxyでも使えるようです。マニュアルwikiを読んでください。

引用

https://github.com/abaizan/kodoja

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

植物RNA seqシーケンシングデータからvirusリードを検出する kodoja