macでインフォマティクス

macでインフォマティクス

NGS関連のインフォマティクス情報についてまとめています。

植物RNA seqシーケンシングデータからvirusリードを検出する kodoja

 

Kodojaはk-merプロファイリングを使用してRNA-seqまたはsRNA-seのfastq/fasta生データからウイルス配列を特定するツール。 k-merを用いた系統分類ツールKrakenとおよびタンパク質レベルでの配列マッチングのKaijuを組み合わせている(Burrows-Wheeler変換している)。

 

 

wiki

https://github.com/abaizan/kodoja/wiki/Kodoja-Manual

 

インストール

mac os10.14のminiconda3-4.3.30環境でテストした。

依存

You can use Python 2.7 or Python 3, specifically Kodoja has been tested on Python 3.6.

  • FastQC v0.11.5
  • Trimmomatic v0.36
  • Kraken v1.0
  • Kaiju v1.5.0
  • Python packages:
  • numpy v1.9
  • biopython v1.67
  • pandas v0.14
  • ncbi-genome-download v0.2.6

 

本体 Github

#Anaconda環境でcondaを使い導入、公式では仮想環境を使っているが、ここでは直接導入
conda install -y -c bioconda kodoja

> kodoja_search.py -h

$ kodoja_search.py -h

usage: kodoja_search.py [-h] [--version] -o OUTPUT_DIR -d1 KRAKEN_DB -d2

                        KAIJU_DB -r1 READ1 [-r2 READ2] [-f DATA_FORMAT]

                        [-t THREADS] [-s HOST_SUBSET] [-m TRIM_MINLEN]

                        [-a TRIM_ADAPT] [-q KRAKEN_QUICK] [-p]

                        [-c KAIJU_SCORE] [-l KAIJU_MINLEN] [-i KAIJU_MISMATCH]

 

Kodoja Search is a tool intended to identify viral sequences

in a FASTQ/FASTA sequencing run by matching them against both Kraken and

Kaiju databases.

 

optional arguments:

  -h, --help            show this help message and exit

  --version             show program's version number and exit

  -o OUTPUT_DIR, --output_dir OUTPUT_DIR

                        Output directory path, required

  -d1 KRAKEN_DB, --kraken_db KRAKEN_DB

                        Kraken database path, required

  -d2 KAIJU_DB, --kaiju_db KAIJU_DB

                        Kaiju database path, required

  -r1 READ1, --read1 READ1

                        Read 1 file path, required

  -r2 READ2, --read2 READ2

                        Read 2 file path

  -f DATA_FORMAT, --data_format DATA_FORMAT

                        Sequence data format (default fastq)

  -t THREADS, --threads THREADS

                        Number of threads (default 1)

  -s HOST_SUBSET, --host_subset HOST_SUBSET

                        Subset sequences with this tax id from results

  -m TRIM_MINLEN, --trim_minlen TRIM_MINLEN

                        Trimmomatic minimum length

  -a TRIM_ADAPT, --trim_adapt TRIM_ADAPT

                        Illumina adapter sequence file

  -q KRAKEN_QUICK, --kraken_quick KRAKEN_QUICK

                        Number of minium hits by Kraken

  -p, --kraken_preload  Kraken preload database

  -c KAIJU_SCORE, --kaiju_score KAIJU_SCORE

                        Kaju alignment score

  -l KAIJU_MINLEN, --kaiju_minlen KAIJU_MINLEN

                        Kaju minimum length

  -i KAIJU_MISMATCH, --kaiju_mismatch KAIJU_MISMATCH

                        Kaju allowed mismatches

 

The main output of ``kodoja_search.py`` is a file called ``virus_table.txt``

in the specified output directory. This is a plain text tab-separated table,

the columns are as follows:

 

1. Species name,

2. Species NCBI taxonomy identifier (TaxID),

3. Number of reads assigned by *either* Kraken or Kaiju to this species,

4. Number of Reads assigned by *both* Kraken and Kaiju to this species,

5. Genus name,

6. Number of reads assigned by *either* Kraken or Kaiju to this genus,

7. Number of reads assigned by *both* Kraken and Kaiju to this genus.

 

The output directory includes additional files, including ``kodoja_VRL.txt``

(a table listing the read identifiers used) which is intended mainly as

input to the ``kodoja_retrieve.py`` script.

> kodoja_build.py -h

kazuma@kamisakumanoMBP:~/Documents/metagenome_simulation$ kodoja_build.py -h

usage: kodoja_build.py [-h] [--version] -o OUTPUT_DIR [-t THREADS]

                       [-p HOST_TAXID] [-d DOWNLOAD_PARALLEL] [-n]

                       [-e [EXTRA_FILES [EXTRA_FILES ...]]]

                       [-x [EXTRA_TAXIDS [EXTRA_TAXIDS ...]]] [-v]

                       [-b KRAKEN_TAX] [-k KRAKEN_KMER] [-m KRAKEN_MINIMIZER]

                       [-a DB_TAG]

 

Kodoja database construction

 

optional arguments:

  -h, --help            show this help message and exit

  --version             show program's version number and exit

  -o OUTPUT_DIR, --output_dir OUTPUT_DIR

                        Output directory path, required

  -t THREADS, --threads THREADS

                        Number of threads, default=1

  -p HOST_TAXID, --host_taxid HOST_TAXID

                        Host tax ID

  -d DOWNLOAD_PARALLEL, --download_parallel DOWNLOAD_PARALLEL

                        Parallel genome download, default=4

  -n, --no_download     Genomes have already been downloaded

  -e [EXTRA_FILES [EXTRA_FILES ...]], --extra_files [EXTRA_FILES [EXTRA_FILES ...]]

                        List of extra files added to "extra" dir

  -x [EXTRA_TAXIDS [EXTRA_TAXIDS ...]], --extra_taxids [EXTRA_TAXIDS [EXTRA_TAXIDS ...]]

                        List of taxID of extra files

  -v, --all_viruses     Build databases with all viruses (not plant specific)

  -b KRAKEN_TAX, --kraken_tax KRAKEN_TAX

                        Path to taxonomy directory

  -k KRAKEN_KMER, --kraken_kmer KRAKEN_KMER

                        Kraken kmer size, default=31

  -m KRAKEN_MINIMIZER, --kraken_minimizer KRAKEN_MINIMIZER

                        Kraken minimizer size, default=15

  -a DB_TAG, --db_tag DB_TAG

                        Suffix for databases

> kodoja_retrieve.py -h

$ kodoja_retrieve.py -h

usage: kodoja_retrieve.py [-h] [--version] -o FILE_DIR -r1 READ1 [-r2 READ2]

                          [-f USER_FORMAT] [-t TAXID] [-g] [-s]

 

Kodoja Retrieve is used with the output of Kodoja Search to

give subsets of your input sequencing reads matching viruses.

 

optional arguments:

  -h, --help            show this help message and exit

  --version             show program's version number and exit

  -o FILE_DIR, --file_dir FILE_DIR

                        Path to directory of kodoja_search results, required

  -r1 READ1, --read1 READ1

                        Read 1 file path, required

  -r2 READ2, --read2 READ2

                        Read 2 file path, default: False

  -f USER_FORMAT, --user_format USER_FORMAT

                        Sequence data format, default: fastq

  -t TAXID, --taxID TAXID

                        Virus tax ID for subsetting, default: All viral

                        sequences

  -g, --genus           Include sequences classified at genus

  -s, --stringent       Only subset sequences identified by both tools

 

The main output of ``kodoja_search.py`` is a file called ``virus_table.txt``

(a table summarising the potential viruses found), but the specified output

directory will also contain ``kodoja_VRL.txt`` (a table listing the read

identifiers). This second file is used as input to ``kodoja_retrieve.py``

along with the original sequencing read files.

 

A sub-directory ``subreads/`` will be created in the output directory,

which will include either FASTA or FASTQ files named as follows:

 

* ``subset_files/virus_all_sequences1.fasta`` FASTA output

* ``subset_files/virus_all_sequences1.fastq`` FASTQ output

 

And, for paired end datasets,

 

* ``subset_files/virus_all_sequences2.fasta`` FASTA output

* ``subset_files/virus_all_sequences2.fastq`` FASTQ output

 

However, if the ``-t 12345`` option is used rather than ``virus_all_...``

the files will be named ``virus_12345_...`` instead.

kodoja_build.pyはウイルス/宿主ゲノムをダウンロードして、新しいKrakenデータベースとKaijuデータベースを作成するコマンド。 kodoja_retrieve.pyはサーチ結果から関心のあるシーケンスを取り出すコマンド。

 

データベースの作成

mkdir kodojaDB_v1.0
cd kodojaDB_v1.0/
wget https://zenodo.org/record/1406071/files/kodojaDB_v1.0.tar.gz
tar -zxvf kodojaDB_v1.0.tar.gz

  

実行方法

クエリのfastqとデータベースを指定して実行する。

kodoja_search.py --kraken_db krakenDB/ --kaiju_db kaijuDB/\
--read1 pair_R1.fastq --read2 pair_R2.fastq\
-o out_dir

>  column -t out_dir/virus_table.txt

$ column -t out_dir/virus_table.txt 

Species  Species  TaxID   Species  sequences  Species  sequences  (stringent)  Genus       Genus  sequences  Genus  sequences  (stringent)

Cassava  brown    streak  virus    137758     45       45         Ipomovirus   0           0

Ugandan  cassava  brown   streak   virus      946046   28         28           Ipomovirus  0      0

Tobacco  etch     virus   12227    19         19       Potyvirus  0            0

 

Galaxyでも使えるようです。マニュアルwikiを読んでください。

引用

https://github.com/abaizan/kodoja

 

関連ツール