エラーコレクションツール karect - macでインフォマティクス

2019 9/8 インストール追記

　De novo assembly時、シーケンスエラーを間違ってscaffoldsに組み込んでしまうと、dead-endのグラフができたり、false positiveの分岐が生じたり、あるいはキメラのパスができてしまう可能性がある。そのため、アセンブル前にエラーと思われるような低クオリティのリードはトリムする必要があるが、トリムするとリード長が短くなると共に低カバレッジ領域が増え、アセンブルのパフォーマンスが大きく落ちてしまう。そこで、エラーと思われるような低頻度の配列を修正する方法論が必要となる。一般的にエラーコレクションツール（ECツール）はこのような作業を担っている。

　ECツールとして、SPadesのBayesHammerやMaSuRCAのQuORUMのようにde novo 　assemblyのツールの内部で動作するツールや、単独で動作するツールがある。ECツールはいくつか発表されているが、使用するデータやECツールによってエラーの検出感度や間違って置き換える偽陽性率には差がある。特に頻出する短い繰り返し配列ではそのようなことが起こりやすい。このエラーの検出感度とエラーの修正精度はトレードオフの関係になりやすく、recallもprecisonも良くてかつF measureも良いような方法論の構築は簡単ではない。karectはエラー補正を行うツールで、ミスマッチの他にindelも含めて補正を行う。原論文ではkarectによりNGA50が10%改善したと主張されている。エラー補正がアセンブルに及ぼす影響を13のECツールで比較した論文中では(ref.1)、NGA50最も改善されていた。ただし、ピークメモリとランタイムは他のECツールより多い傾向にあった。

インストール

Github

github.com

git clone https://github.com/aminallam/karect.git 
cd karect/
make
./karect #動作確認

#bioconda (link)
mamba install -c bioconda -y karect

> ./karect

$ ./karect

-Please specify the tool you want to run: (correct-align-eval-merge).

-Run "karect -[correct|align|eval|merge]" to find information about how to run a specific tool.

1) correct: a tool to correct assembly reads from fasta/fastq files.

2) align: a tool to align original assembly reads as pre-processing for evaluation.

3) eval: a tool to evaluate assembly reads correction.

4) merge: a tool to concatenate fasta/fastq files.

-Example: "./karect -correct"

パスの通ったディレクトリに移動しておく。

ヘルプはそれぞれのコマンドを打ち表示させる。例えば"./karect -correct"

ラン

１、Githubからリンクされているテストデータ（ペアードエンドのfastq) をダウンロードして解凍

２、エラー訂正を実行。非圧縮のfastqかfastaファイルを指定する。

karect -correct -threads=12 -matchtype=hamming \
 -celltype=haploid -kmer=9 \
 -inputfile=./frag_1.fastq -inputfile=./frag_2.fastq

-correct　a tool to correct assembly reads from fasta/fastq files.
-inputfile=f　Specify an input fasta/fastq file. This option can be repeated for multiple files
-threads=i　Specify the number of threads [Default=16].
-celltype=[haploid|diploid]　Specify the cell type. Use "haploid" for bacteria and viruses.
-matchtype=[edit|hamming|insdel]　Specify the matching type. "hamming" allows substitution errors only. "edit" allows insertions, deletion, and substitutions with equal costs. "insdel" is the same as "edit", but the cost of substitutions is doubled. Use "hamming" for Illumina datasets, and "edit" for 454 datasets.
-kmer=i　Specify the minimum kmer size (will increase according to '-kmerfactor') [Default=9].

ref.1の論文では、k=9の時が一番パフォーマンスは良くなっている。

終わると複数ファイルが出力される。karect_frag~がエラーが修正されたfastq。

f:id:kazumaxneo:20171219194351j:plain

３、genomeファイルにアライメントすることでエラー訂正結果を評価する。

alignコマンドを実行。

karect -align -threads=12 -matchtype=hamming \
 -refgenomefile=./genome.fasta -alignfile=./align.txt \
 -inputfile=./frag_1.fastq \
 -inputfile=./frag_2.fastq

-align　a tool to align original assembly reads as pre-processing for evaluation.
-refgenomefile=f　Specify the file containing the reference genome (to be aligned with).
-alignfile=f　Specify the output alignment file.

４、最後にevalコマンドで評価する。

karect -eval -threads=12 -matchtype=hamming \
 -refgenomefile=./genome.fasta \
 -alignfile=./align.txt -evalfile=./eval.txt \ 
 -inputfile=./frag_1.fastq \ 
 -inputfile=./frag_2.fastq \
 -resultfile=./karect_frag_1.fastq \
 -resultfile=./karect_frag_2.fastq

-eval　a tool to evaluate assembly reads correction.

-resultfile=f　Specify a result fasta/fastq file (resulting from running "karect -correct"
-alignfile=f　Specify the alignment file resulted from running "karect -align".
-evalfile=f　Specify the output evaluation file.

出力

Evaluation Started

Reference Genome Size = 2903081

..........

-------------------------------------------------------------------------------

Whole Read Statistics:

TN) num_correct_stayed_correct = 382060 [of 1011910] (37.756322 %)

TP) num_wrong_correctly_fixed = 622666 [of 1011910] (61.533733 %)

FP) num_correct_wrongly_fixed = 91 [of 1011910] (0.008993 %)

FN) num_wrong_stayed_wrong = 7093 [of 1011910] (0.700952 %)

*) Recall (Sensitivity) = 98.873696 %

*) Precision = 99.985388 %

*) FScore = 99.426434 %

*) Gain = 98.859246 %

-------------------------------------------------------------------------------

Bases Statistics:

TN) num_correct_stayed_correct = 99587505 [of 102202910] (97.440968 %)

TP) num_wrong_correctly_fixed = 2601317 [of 102202910] (2.545247 %)

FP) num_correct_wrongly_fixed = 1957 [of 102202910] (0.001915 %)

FN) num_wrong_stayed_wrong = 12131 [of 102202910] (0.011870 %)

*) Recall (Sensitivity) = 99.535824 %

*) Precision = 99.924825 %

*) FScore = 99.729945 %

*) Gain = 99.460942 %

-------------------------------------------------------------------------------

Evaluation Finished

scaffoldsにマッピングして観察した。上半分がエラー訂正前、下半分が訂正後。

f:id:kazumaxneo:20180221111236j:plain

クリップされた領域も表示されているので、ミスマッチが多いように見えるが、エラー訂正した後はそのミスマッチが多いリードが目に見えて減っている（面白いことに、このscaffoldsはエラー訂正前のリードをspadesでアセンブルしたものである。それでもエラーが減る。）

シミュレーションだと、もっと劇的な結果が出る。

まずwgsimで変異をわずかに発生させたリードをシミュレートした。それからエラー訂正有り無しでリファレンスにアライメントした。下図の上半分は変異入りリードを元のリファレンスにアライメントした結果で、下半分がkarectでエラー訂正してから元のリファレンスにアライメントした結果となる。wgsimで発生させたSNV、indelエラーが完全に消えている。

f:id:kazumaxneo:20180221173137j:plain