（主にヒトRNA-seq）大規模RNA-seqデータセットからデータセットに関する情報を提供する Kmerator Suite

　一般に公開されている膨大な数のRNA-sequencing (RNA-seq) ライブラリは、組織における既知または新規の転写産物の発現を定量化するための機能情報の宝庫である。しかし、転写産物の定量は、多くの計算資源と処理時間を必要とするアライメント手法に依存するのが一般的であり、大規模なデータセットに容易に対応することはできない。K-mer decompositionは、より少ないリソースで正確に遺伝子発現を定量できるk-merによって、RNA-seqデータを処理し、転写シグネチャーを同定するための新しい方法である。本著者らは、特定のk-merシグネチャーを抽出し、これらのk-merを使ってRNA-seqデータセットを定量し、大規模なデータセットの特性を迅速に可視化するために設計された3つのツールのセットであるKmerator Suiteを発表する。コアツールであるKmeratorは、ヒト遺伝子の97%について特異的なk-merを生成し、シミュレーションデータセットにおいて高い精度で遺伝子発現を測定することができる。Kmeratorの直接的なアプリケーションであるKmerExploRは、予測遺伝子固有のk-merのセットを使用して、RNA-seqデータセットからライブラリプロトコル、サンプル特徴、汚染物質などのメタデータを推測する。KmerExploRの結果は、ユーザーフレンドリーなインターフェースで可視化される。さらに、Kmerator Suiteは、ヒトの健康への応用のために、突然変異、遺伝子融合、ロングノンコードRNAなどの既知または新しいバイオマーカーをターゲットにした高度なクエリーに使用できることを実証している。

インストール

GIthub

#as user
python3 -m pip install --user kmerexplor

> kmerexplor

$ kmerexplor

usage: kmerexplor [-h] (-s | -p) [-k] [-d] [-o <output_dir>] [--tmp-dir <tmp_dir>] [--config config.yaml] [-t <tag_file>] [-a <tag_file>] [--dump-config [config.yaml]] [--show-tags] [--title TITLE] [-y]

[-c <cores>] [-v]

<file1> ... [<file1> ... ...]

positional arguments:

<file1> ... fastq or fastq.gz or tsv countTag files.

optional arguments:

-h, --help show this help message and exit

-s, --single when samples are single.

-p, --paired when samples are paired.

-k, --keep-counts keep countTags outputs.

-d, --debug debug.

-o <output_dir>, --output <output_dir>

output directory (default: "./kmerexplor-results").

--tmp-dir <tmp_dir> temporary files directory.

--title TITLE title to be displayed in the html page.

-y, --yes, --assume-yes

assume yes to all prompt answers.

-c <cores>, --cores <cores>

specify the number of files which can be processed simultaneously by countTags. (default: 1). Valid when inputs are fastq files.

-v, --version show program's version number and exit

advanced features:

--config config.yaml alternate config yaml file of each category (default: built-in "config.yaml").

-t <tag_file>, --tags <tag_file>

alternate tag file.

-a <tag_file>, --add-tags <tag_file>

additional tag file.

extra features:

--dump-config [config.yaml]

dump builtin config file as specified name to current directory and exit (default name: config.yaml).

--show-tags print builtin categories and predictors and exit.

Examples:

# Mandatory: -p for paired-end or -s for single:

kmerexplor -p path/to/*.fastq.gz

# -c for multithreading, -k to keep counts (input must be fastq):

kmerexplor -p -c 16 -k path/to/*.fastq.gz

# You can skip the counting step thanks to countTags output (see -k option):

kmerexplor -p path/to/countTags/files/*.tsv

# -o to choose your directory output (directory will be created),

# --title to show title in results:

kmerexplor -p -o output_dir --title 'Title displayed on the html page' dir/*.fastq.gz'

# Advanced: use your own tag file and config.yaml file:

kmerexplor -p --tags my_tags.tsv --config my_config.yaml dir/*.fast.gz

実行方法

fastqかカウントファイルを指定する。fastqファイルはilluminaフォーマット（_R1_001、_R2_001）か、_1.fastq[.gz]、2.fastq[.gz]である必要がある。カウントファイルはtsv[.gz]で終わる必要がある。カウントファイルは複数列ファイルに集約した単一ファイルでも指定できる。

kmerexplor -p -c 16 -k path/to/*.fastq.gz

-s when samples are single.
-p when samples are paired.
-c specify the number of files which can be processed.

分析が終わるとブラウザに結果がロードされる。

poly A and Ribo depletion

mRNAの大部分とは対照的に、ヒストンmRNAの一部はポリアデニル化されていない（凡例の遺伝子）。そのため、これらの非ポリアデニル化転写物はpolyA+ RNA-seqではほとんど検出されない。特異的なk-merを使ったこれらの遺伝子の検出で、ribozeroの効果とpolyA濃縮を区別してmRNA濃縮の効果を評価しているらしい。正常なデータなら、上の閾値の破線（水色）よりpolyA濃縮サンプルでは遥かに下、Rbiozero処理したサンプルであれば遥かに上になるとされる（サンプルによってはモデルに適合しないかもしれない）。

Orientation

ペアエンドRNA-seqプロトコルで生成した1サンプルあたりのfastqファイルのファイルの向きを決定するためにハウスキーピング遺伝子のサブセットが使用されている。

サンプルがstrandedである場合、フォワードとリバースのk-merは、それぞれ2種類に分かれることが予想される。フォワードとリバースのk-merがそれぞれ均等に見つかれば
fastqファイルサンプルはをアンストランドと見なせる。（ポジティブとネガティブのカウントがバランスしている）（マニュアルより）。上の画像ではペアでないデータを１つ入れているので、それだけバランスしていないように見える。

Y chromosome

以前に発表されたY染色体特異的遺伝子で性別を判断

オスは選択された遺伝子をすべて発現するはずで、メスは値がほぼゼロになる。

Read positon bias

5'末端から3'末端までのリードカバレッジバイアス。poly(A)で選択されたサンプルにおいて、もしリードが主に転写産物の3'末端に蓄積されるなら、出発材料であるRNAの品質が低いことを示しているのかもしれない。

5'、3'、CDS領域からの平均k-mer数が提示されている。リードカバレッジの均一性を保つために、ハウスキーピングのサブセットが使われている。上の画像では１つだけおかしくなっているように見える。

Hela contamination

HeLaは最初の不死身のヒト細胞株である。現在この細胞株は医学研究において大きく利用されており、それゆえ、HeLaが他の細胞種でコンタミネーションする可能性がある。

HeLa特異的変異60nt配列に特異的なk-merを設計され使用されている。

Mycoplasma contamination

Drexlerらによる、細胞汚染で最も頻度の高い6種類のマイコプラズマ汚染の検出。

6種のマイコプラズマのrRNA配列が使用されている。

Viruses contamination

ウイルス汚染（感染）の調査。

各ウイルスについて、ヒトのリファレンスに存在しないk-merが選択されている。

Species

ミトコンドリアにコードされたシトクロムcオキシダーゼI（MT-CO1）の特異的なk-merを使って、他の種の汚染を評価。

引用

Kmerator Suite: design of specific k-mer signatures and automatic metadata discovery in large RNA-seq datasets
Sébastien Riquier, Chloé Bessiere, Benoit Guibert, Anne-Laure Bouge, Anthony Boureux, Florence Ruffle, Jérôme Audoux, Nicolas Gilbert, Haoliang Xue, Daniel Gautheret, Thérèse Commes

NAR Genom Bioinform. 2021 Jun 23;3(3)