Transcript-level Aware なロングリードのエラーコレクションを行う TALC

　ロングリードシーケンシング技術は、複雑なRNAトランスクリプト構造を決定するために非常に重要だが、エラーが発生しやすい。同じサンプルからシーケンスされたショートリードの精度と深さを利用してロングリードを補正する「ハイブリッド補正」アルゴリズムがゲノムデータ用に多数開発されている。これらのアルゴリズムは、より複雑なトランスクリプトームシーケンスデータの補正には適していない。
　著者らは、TALC (Transcript-level Aware Long Read Correction)と呼ばれる新しいリファレンスフリーのアルゴリズムを作成した。これは、RNA発現とアイソフォーム表現の変化を重み付きDe-Bruijnグラフでモデル化し、トランスクリプトーム研究のロングリードを補正するものである。TALCによるトランスクリプトレベルアウェアなロングリード補正は、下流のRNA-seqアプリケーションの全スペクトルの精度を向上させ、ロングリード技術を用いたトランスクリプトーム解析に必要であることを示している。TALCはC ++で実装されており、https://github.com/lbroseus/TALCで入手できる。

インストール

ubuntu18.04でテストした。jellyfish2はcondaでjellyfish2の仮想環境を作って導入、実行した（bioconda jellyfish）。

ビルド依存

gcc version > 5.

GitLab

git clone https://gitlab.igh.cnrs.fr/lbroseus/TALC.git
cd TALC
git clone https://github.com/seqan/seqan.git
make -j

> ./talc

# ./talc -h

******************************************************

* TALC : Transcriptome-Aware Long Read Correction *

*----------------------------------------------------*

* *

* Kmers are assumed directional *

******************************************************

[TALC]: Parsing arguments

TALC: Transcriptome-Aware Long Read Correction - Hybrid Long Read Correction using Short Read coverage

======================================================================================================

SYNOPSIS

DESCRIPTION

REQUIRED ARGUMENTS

Input_fastQ/A_file_containing_Long_Reads STRING

OPTIONS

-h, --help

Display the help message.

--version-check BOOL

Turn this option off to disable version update notifications of the application. One of 1, ON, TRUE, T, YES,

0, OFF, FALSE, F, and NO. Default: 1.

--version

Display version information.

-o, --output STRING

Prefix to be used for output files Default: out.

-k, --kmerSize INTEGER

length k of k-mers. In range [18..30].

-qm, --query-mode STRING

Mode that should be used to query kmers (advised: memory). One of memory and jellyfish2. Default: memory.

-SR, --SRCounts STRING

Short reads kmer counts in format .dump file (memory mode) or .jf file (jf2 mode), obtained from Jellyfish2

-j, --junctions STRING

k-mers flanking junctions and their counts.

-jf2, --pathToJF2 STRING

Specifies where the Jellyfish2 program should be found. Default: .

-MIN_INNER_SCORE, --MIN_INNER_SCORE DOUBLE

Minimum %ID score required for the best-correction-path-between-any-two-solid-regions to be kept. In range

[0.3..0.9]. Default: 0.7.

-MIN_BORDER_SCORE, --MIN_BORDER_SCORE DOUBLE

Minimum %ID score required for the best-correction-path-on-left-or-right-borders to be kept. In range

[0.5..0.9]. Default: 0.7.

-MIN_COUNT, --MIN_COUNT INTEGER

Minimal count for a k-mer to be kept in the SR-de Bruijn Graph. In range [2..inf]. Default: 2.

-SR_ERROR_RATE, --SR_ERROR_RATE DOUBLE

Prior estimate of the error rate in short reads. In range [0.01..0.1]. Default: 0.025.

-WINDOW_SIZE, --WINDOW_SIZE INTEGER

[ADVANCED] Size of the window. The larger, the more fresh air. In range [6..inf]. Default: 9.

-MAX_NB_BRANCHES, --MAX_NB_BRANCHES INTEGER

[ADVANCED] Maximal number of competing branches. In range [5..inf]. Default: 7.

-ALPHA_FOR_PRED, --ALPHA_FOR_PRED DOUBLE

[ADVANCED] Coefficient used to build count confidence interval: [count +/- ALPHA*sqrt(count)]. In range

[0.67..inf]. Default: 2.57.

-t, --num_threads INTEGER

number of threads In range [1..inf]. Default: 1.

-DEBUG_MODE, --DEBUG_MODE STRING

Will activate output for debugging purposes

-rev, --reverse

If set, long reads will be reverse complemented before correction by short reads.

VERSION

Last update: September 2019

TALC: Transcriptome-Aware Long Read Correction version: 1.01

SeqAn version: 2.4.0

実行方法

１、jellyfish2によるk-merカウント

raw fastqを指定する。

jellyfish count --mer 25 -s 100M -o out.jf -t 20 pair_R*
jellyfish dump -c out.jf > out.dump

-s Initial hash size
-m Length of mer

2、talcによるエラーコレクション

ロングリード（raw fastq/fasta）と１の出力のdumpファイルを指定する。

talc long-reads.fq --SRCounts out.dump -k 31 -o talc-out -t 30 > log

-o Prefix to be used for output files Default: out
-k length k of k-mers. In range [18..30]

出力

f:id:kazumaxneo:20200802232938p:plain

talc-out.faがエラー修正されたリングリード。

TALCの重み付きde Bruijnグラフは方向性があるので、ショートリードとロングリードの配列は同じ方向でなければならない。ロングリードがショートリードと逆相補的な場合は”--reverse”を追加する。また既知スプライスジャンクション情報をエラー修正時に提供することもできる。手順はGitlabで確認してください。

引用

TALC: Transcript-level Aware Long Read Correction
Lucile Broseus, Aubin Thomas, Andrew J Oldfield, Dany Severac, Emeric Dubois, William Ritchie
Bioinformatics, Published: 16 July 2020