2017-10-26

ハイブリッドアセンブリにも対応したショートリードアセンブラ Unicycler

2019　追記、 dockerリンク追加、help all追加、コメント追記

2020 help更新、追記

2021 2/25 誤字修正、5/9 ツイート追加

2022 1/25 v5に対応してインストール手順を修正、02/22, 09/21 インストール手順を修正

2023/07/11,12 追記

2024/02/07, 11追記、6/4, 6/15 tips追記、誤字修正

　現在、バクテリアゲノムはIlluminaシーケンシングプラットフォームによって支配されている。イルミナのリードは正確で、ベースあたりのコストが低く、全ゲノムシーケンシングの普及が可能である。しかし、多くのイルミナシーケンシングは、バクテリアゲノムの多くのリピートエレメントよりも短いフラグメント（500 bp以下）を使用している[ref,1]。これにより、ショートリードのアセンブリツール（アセンブラ）による完全なゲノムのresolutionを防ぎ、代わりに数十のアセンブリ（contigs）に断片化する。結果として、利用可能なバクテリアゲノムのほとんどは不完全であり、これは大規模な比較ゲノム研究を妨げる。

　Pacific Biosciences（PacBio）およびOxford Nanopore Technologies（ONT）シーケンシングプラットフォームは、10kbp以上のDNA断片をシーケンシングできるが、イルミナのプラットフォームよりも1塩基あたりのコストが高くなる。 PacBioおよびONTロングリードは、バクテリアゲノムアセンブリを完成させるにはしばしば十分だが、イルミナのリード（5-15％vs <1％）よりもずっと高いベースエラー率を持っている[ref.2,3]。したがって、ほとんどの研究者は、安価なIlluminaシーケンシングを使用して多くの分離株の断片化されたドラフトアセンブリを生成するか、高価なロングリード技術を使用して少ない分離株で完全なアセンブリを生成するかを選択する必要がある。ハイブリッドアセンブリは、代替手段としてショートリードとロングリードのコンビネーションを使用する。このアプローチでは、ショートリードは正確なコンティグを生成するために使用され、ロングリードはそれらを一緒にスキャフォールドするための情報を提供する。これは、完全なバクテリアゲノムへの最も費用効果の高い経路となり得る。

　近年のロングリード技術の進歩にもかかわらず、イルミナのリードりは、public health and research laboratoriesで広く使用されており[ref.4]、高精度で低コストであるためにしばらくの間続くだろう。さらに、イルミナのデータは既に数十万のバクテリア分離株で利用可能であり、これらの大部分はロングリードシーケンシングデータに置き換えられる可能性は低い。（一部略）

　ハイブリッドアセンブリは、ショートリード優先またはロングリード優先のいずれかの方法で達成することができる。ショートリード優先の方法では、スキャフォールディングツールにロングリードを使用してIlluminaコンティグを一緒に結合する。しかし、スキャフォールディングミスは一般的であり、シークエンスの構造上のミスマッチ（misassemblies）につながる[ref,5]。ロングリード優先アプローチは、訂正されていないロング・リードのアセンブリ、それに続くショートリードを用いたアセンブリエラー訂正を含む[ref.3]。あるいは、最初にショートリードを使用してロングリードのエラーを修正し、その後に修正されたロングリード[ref,6,7]を使ってアセンブリする。エラー訂正がアセンブリの前または後に実行されるかどうかにかかわらず、ロングリード優先アプローチは、ショートリード優先のアプローチよりもロングリードのデプスを必要とする。

ここでは、バクテリア分離株ゲノム用の新しいハイブリッドアセンブリパイプラインであるUnicyclerを紹介する。 Unicyclerは最初にショートリードを正確かつ接続されたアセンブリグラフ、つまりコンティグと相互接続の両方を含むデータ構造にアセンブルする[ref,8]。その後、ロングリードを使用して、グラフ内の最適なパスを検索する。ショートリード優先のアプローチに従うことによって、Unicyclerは少量のロングリードも有効に使用するが、ロングリードのデプスで十分であれば、完成したアセンブリ（1レプリコンあたり1コンティグ）を生成することができる。アセンブリグラフ接続を使用して可能なスキャホールディング構成を制約することにより、Unicyclerは、ショートリードファーストアセンブラよりもミスアセンブリ率が低くなる。

　Unicyclerはショートリードのde novo asssembly、ロングリードのde novo asssembly、ショートリードとロングリードを両方使ったハイブリッドのde novo asssemblyに対応したアセンブルパイプライン。単離されたバクテリア向けに設計されており、真核生物のゲノムやメタゲノムには対応していない。

　ショートリードからアセンブルのグラフを構築するために、内部でspadesを使用している。この時、Unicyclerはspadesのオプチマイザーとして機能する。具体的には、幅広いk-merを試し、dead-endとcontig数が両方とも最も少なくなる条件でアセンブルを行う。そのため、ショートリードだけのアセンブルでも、spadesよりパフォーマンスが上がる可能性がある。ロングリードだけを使ったアセンブルでは、内部でminiasmとraconが動く。miniasmでアセンブルすると、contigのエラー率は元のロングリードと同じくらいになるが、raconでiterative にpolishすることでエラーを取り除いている。

　Unicylcerの出力は、fastaとgfaファイルの両方で出力される。gfaはアセンブル情報を全て記載できるフォーマットであり、そのため、gfaの情報だけでbandageに読み込んでグラフパスを可視化することができる。また、Unicyclerはcontig末端の配列を刈り込むため、spadesのように末端にk-merサイズの重複ができることがない。そのためより正確にアセンブルの状態を把握することができる。

2024/06/04

New major version of SPAdes - very exciting!https://t.co/FuL6jgIZds

Unicycler isn't under active development anymore (as you can see from all the unanswered issues 😢), but I did just push a quick fix to make it work with SPAdes v4:https://t.co/l4JAPKT4ql
— Ryan Wick (@rrwick) 2024年6月5日

2022 1/24

I just released a new version of Unicycler (v0.5.0) which fixes SPAdes compatibility, drops some extraneous bits and patches a few bugs.https://t.co/l4JAPKT4ql

Unicycler is now nearly 6 years old, so here's a thread with my thoughts on its place in the world in 2022.
(1/8)
— Ryan Wick (@rrwick) January 23, 2022

2021 5/9

Unicycler v0.4.9https://t.co/hhj63QPpp5
"The most significant one is that Unicycler now enforces that SPAdes cannot be a version later than v3.13.0."
"Unicycler users that had later versions of SPAdes were getting suboptimal assemblies."
— Ilnam Kang (@ilnam_kang) 2021年5月10日

インストール

追記

本体 Github

#conda (20222/02/22現在はv5.0がlatest)
mamba create -n unicycler python=3.9 -y
conda activate unicycler
mamba install -c bioconda unicycler -y

#from source
git clone https://github.com/rrwick/Unicycler.git
cd Unicycler
python3 setup.py install

> unicycler -h

$ unicycler -h

usage: unicycler [-h] [--help_all] [--version] [-1 SHORT1] [-2 SHORT2] [-s UNPAIRED] [-l LONG] -o OUT [--verbosity VERBOSITY] [--min_fasta_length MIN_FASTA_LENGTH] [--keep KEEP] [-t THREADS] [--mode {conservative,normal,bold}]

[--linear_seqs LINEAR_SEQS] [--vcf]

\ \___

\ ___\

____// _ _ _ _

//_ //\\ | | | | |_| | |

// \// \\ | | | | _ __ _ ___ _ _ ___ | | ___ _ __

|| (O) || | | | || '_ \ | | / __|| | | | / __|| | / _ \| '__|

\\ \_ // | |__| || | | || || (__ | |_| || (__ | || __/| |

\\_____// \____/ |_| |_||_| \___| \__, | \___||_| \___||_|

__/ |

|___/

Unicycler: an assembly pipeline for bacterial genomes

Help:

-h, --help Show this help message and exit

--help_all Show a help message with all program options

--version Show Unicycler's version number

Input:

-1 SHORT1, --short1 SHORT1 FASTQ file of first short reads in each pair (required)

-2 SHORT2, --short2 SHORT2 FASTQ file of second short reads in each pair (required)

-s UNPAIRED, --unpaired UNPAIRED FASTQ file of unpaired short reads (optional)

-l LONG, --long LONG FASTQ or FASTA file of long reads (optional)

Output:

-o OUT, --out OUT Output directory (required)

--verbosity VERBOSITY Level of stdout and log file information (default: 1)

0 = no stdout, 1 = basic progress indicators, 2 = extra info, 3 = debugging info

--min_fasta_length MIN_FASTA_LENGTH Exclude contigs from the FASTA file which are shorter than this length (default: 100)

--keep KEEP Level of file retention (default: 1)

0 = only keep final files: assembly (FASTA, GFA and log), 1 = also save graphs at main checkpoints, 2 = also keep SAM (enables fast rerun in different mode),

3 = keep all temp files and save all graphs (for debugging)

--vcf Produce a VCF by mapping the short reads to the final assembly (experimental, default: do not produce a vcf file)

Other:

-t THREADS, --threads THREADS Number of threads used (default: 8)

--mode {conservative,normal,bold} Bridging mode (default: normal)

conservative = smaller contigs, lowest misassembly rate

normal = moderate contig size and misassembly rate

bold = longest contigs, higher misassembly rate

--linear_seqs LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in the underlying sequence (default: 0)

help all

> unicycler --help_all

$ unicycler --help_all

[--min_bridge_qual MIN_BRIDGE_QUAL] [--linear_seqs LINEAR_SEQS] [--min_anchor_seg_len MIN_ANCHOR_SEG_LEN] [--spades_path SPADES_PATH] [--no_correct] [--min_kmer_frac MIN_KMER_FRAC] [--max_kmer_frac MAX_KMER_FRAC] [--kmers KMERS]

[--kmer_count KMER_COUNT] [--depth_filter DEPTH_FILTER] [--largest_component] [--spades_tmp_dir SPADES_TMP_DIR] [--no_miniasm] [--racon_path RACON_PATH] [--existing_long_read_assembly EXISTING_LONG_READ_ASSEMBLY] [--no_rotate]

[--start_genes START_GENES] [--start_gene_id START_GENE_ID] [--start_gene_cov START_GENE_COV] [--makeblastdb_path MAKEBLASTDB_PATH] [--tblastn_path TBLASTN_PATH] [--no_pilon] [--bowtie2_path BOWTIE2_PATH]

[--bowtie2_build_path BOWTIE2_BUILD_PATH] [--samtools_path SAMTOOLS_PATH] [--pilon_path PILON_PATH] [--java_path JAVA_PATH] [--min_polish_size MIN_POLISH_SIZE] [--vcf] [--bcftools_path BCFTOOLS_PATH]

[--min_component_size MIN_COMPONENT_SIZE] [--min_dead_end_size MIN_DEAD_END_SIZE] [--contamination CONTAMINATION] [--scores SCORES] [--low_score LOW_SCORE]

\ \___

\ ___\

____// _ _ _ _

//_ //\\ | | | | |_| | |

// \// \\ | | | | _ __ _ ___ _ _ ___ | | ___ _ __

|| (O) || | | | || '_ \ | | / __|| | | | / __|| | / _ \| '__|

\\ \_ // | |__| || | | || || (__ | |_| || (__ | || __/| |

\\_____// \____/ |_| |_||_| \___| \__, | \___||_| \___||_|

__/ |

|___/

Unicycler: an assembly pipeline for bacterial genomes

Help:

-h, --help Show this help message and exit

--help_all Show a help message with all program options

--version Show Unicycler's version number

Input:

-1 SHORT1, --short1 SHORT1 FASTQ file of first short reads in each pair (required)

-2 SHORT2, --short2 SHORT2 FASTQ file of second short reads in each pair (required)

-s UNPAIRED, --unpaired UNPAIRED FASTQ file of unpaired short reads (optional)

-l LONG, --long LONG FASTQ or FASTA file of long reads (optional)

Output:

-o OUT, --out OUT Output directory (required)

--verbosity VERBOSITY Level of stdout and log file information (default: 1)

0 = no stdout, 1 = basic progress indicators, 2 = extra info, 3 = debugging info

--min_fasta_length MIN_FASTA_LENGTH Exclude contigs from the FASTA file which are shorter than this length (default: 100)

--keep KEEP Level of file retention (default: 1)

0 = only keep final files: assembly (FASTA, GFA and log), 1 = also save graphs at main checkpoints, 2 = also keep SAM (enables fast rerun in different mode),

3 = keep all temp files and save all graphs (for debugging)

--vcf Produce a VCF by mapping the short reads to the final assembly (experimental, default: do not produce a vcf file)

Other:

-t THREADS, --threads THREADS Number of threads used (default: 8)

--mode {conservative,normal,bold} Bridging mode (default: normal)

conservative = smaller contigs, lowest misassembly rate

normal = moderate contig size and misassembly rate

bold = longest contigs, higher misassembly rate

--min_bridge_qual MIN_BRIDGE_QUAL Do not apply bridges with a quality below this value

conservative mode default: 25.0

normal mode default: 10.0

bold mode default: 1.0

--linear_seqs LINEAR_SEQS The expected number of linear (i.e. non-circular) sequences in the underlying sequence (default: 0)

--min_anchor_seg_len MIN_ANCHOR_SEG_LEN

If set, Unicycler will not use segments shorter than this as scaffolding anchors (default: automatic threshold)

SPAdes assembly:

These options control the short-read SPAdes assembly at the beginning of the Unicycler pipeline.

--spades_path SPADES_PATH Path to the SPAdes executable (default: spades.py)

--no_correct Skip SPAdes error correction step (default: conduct SPAdes error correction)

--min_kmer_frac MIN_KMER_FRAC Lowest k-mer size for SPAdes assembly, expressed as a fraction of the read length (default: 0.2)

--max_kmer_frac MAX_KMER_FRAC Highest k-mer size for SPAdes assembly, expressed as a fraction of the read length (default: 0.95)

--kmers KMERS Exact k-mers to use for SPAdes assembly, comma-separated (example: 22,33,44, default: automatic)

--kmer_count KMER_COUNT Number of k-mer steps to use in SPAdes assembly (default: 10)

--depth_filter DEPTH_FILTER Filter out contigs lower than this fraction of the chromosomal depth, if doing so does not result in graph dead ends (default: 0.25)

--largest_component Only keep the largest connected component of the assembly graph (default: keep all connected components)

--spades_tmp_dir SPADES_TMP_DIR Specify SPAdes temporary directory using the SPAdes --tmp-dir option (default: make a temporary directory in the output directory)

miniasm+Racon assembly:

These options control the use of miniasm and Racon to produce long-read bridges.

--no_miniasm Skip miniasm+Racon bridging (default: use miniasm and Racon to produce long-read bridges)

--racon_path RACON_PATH Path to the Racon executable (default: racon)

--existing_long_read_assembly EXISTING_LONG_READ_ASSEMBLY

A pre-prepared long read assembly for the sample in GFA format. If this option is used, Unicycler will skip the miniasm/Racon steps and instead use the given assembly (default: perform long read assembly

using miniasm/Racon)

Assembly rotation:

These options control the rotation of completed circular sequence near the end of the Unicycler pipeline.

--no_rotate Do not rotate completed replicons to start at a standard gene (default: completed replicons are rotated)

--start_genes START_GENES FASTA file of genes for start point of rotated replicons (default: start_genes.fasta)

--start_gene_id START_GENE_ID The minimum required BLAST percent identity for a start gene search (default: 90.0)

--start_gene_cov START_GENE_COV The minimum required BLAST percent coverage for a start gene search (default: 95.0)

--makeblastdb_path MAKEBLASTDB_PATH Path to the makeblastdb executable (default: makeblastdb)

--tblastn_path TBLASTN_PATH Path to the tblastn executable (default: tblastn)

Pilon polishing:

These options control the final assembly polish using Pilon at the end of the Unicycler pipeline.

--no_pilon Do not use Pilon to polish the final assembly (default: Pilon is used)

--bowtie2_path BOWTIE2_PATH Path to the bowtie2 executable (default: bowtie2)

--bowtie2_build_path BOWTIE2_BUILD_PATH

Path to the bowtie2_build executable (default: bowtie2-build)

--samtools_path SAMTOOLS_PATH Path to the samtools executable (default: samtools)

--pilon_path PILON_PATH Path to a Pilon executable or the Pilon Java archive file (default: pilon)

--java_path JAVA_PATH Path to the java executable (default: java)

--min_polish_size MIN_POLISH_SIZE Contigs shorter than this value (bp) will not be polished using Pilon (default: 10000)

VCF:

These options control the production of the VCF of the final assembly.

--bcftools_path BCFTOOLS_PATH Path to the bcftools executable (default: bcftools)

Graph cleaning:

These options control the removal of small leftover sequences after bridging is complete.

--min_component_size MIN_COMPONENT_SIZE

Graph components smaller than this size (bp) will be removed from the final graph (default: 1000)

--min_dead_end_size MIN_DEAD_END_SIZE

Graph dead ends smaller than this size (bp) will be removed from the final graph (default: 1000)

Long read alignment:

These options control the alignment of long reads to the assembly graph.

--contamination CONTAMINATION FASTA file of known contamination in long reads

--scores SCORES Comma-delimited string of alignment scores: match, mismatch, gap open, gap extend (default: 3,-6,-5,-2)

--low_score LOW_SCORE Score threshold - alignments below this are considered poor (default: set threshold automatically)

追記

dockerイメージ

https://hub.docker.com/r/staphb/unicycler/

実行方法

ショートリードのアセンブリ。

unicycler -1 R1.fq.gz -2 R2.fq.gz -o output_dir -t 12

もちろん非圧縮のfastqも使用できる。

ハイブリッドアセンブリ

unicycler -1 short_reads_1.fq.gz -2 short_reads_2.fq.gz -l long_reads.fq.gz -o output_dir

ロングリードonlyアセンブリ(minimap => miniasm => Racon)

unicycler -l long_reads.fq.gz -o output_dir

Tips、気づいた事

アセンブルの精度とcontiguityは‑‑modeオプションで設定できる。defaultではブリッジのクオリティの閾値 (cutoffを決める値)は10で、--mode conservativeでは25になる。同時にショートリードをブリッジ情報として使わなくなる。--mode boldでは反対にcutoffの値が最低の1まで下がり、ショートリードもブリッジ情報として利用する。エラーのリスクも出てくるが、コンティグの連続性をできる限り高める必要があるならこのモードも試す事が出来る。
アセンブルした配列のFASTA形式ファイルとアセンブリグラフのGFAファイルは必ず出力されるが、他の関連ファイルを維持するかどうかは、keepフラグの数値で調整できる。--keep 0で関連ファイルはほぼ残らなくなる。数値をあげると関連ファイルも残るようになり、--keep 3では中間ファイルも含め全てのファイルが保存される（詳細はレポジトリを確認）。
完全な環状の配列がアセンブルできた場合、unicyclerはゲノムを回転/反転させて、（デフォルトでは）dnaAまたはrepAを+1にする。--start_genesオプションで指定することができる。
コンタミネーションを除去するために、Unicyclerは低デプスなアセンブリをフィルタリングする。汚染の有無を調べるなら、低レベルなコンタミネーションのグラフからの配列も出力も出すSPAdesの方が適している。
raconやnanopolish、pilonの単独、または併用polishよりunicyclerの方が良好な結果を出している（What is a good genome assembly? – Albertsen Lab）（数年前の情報）。
ロングリードも使用したハイブリッドアセンブリでもリピートが完全に解読されない結果になることもある。このような時、より長いロングリードだけコマンドで抽出してアセンブリをやり直すと、それだけで複雑な繰り返し領域のアセンブリが改善することが間々ある。上手くいくかどうかはリピートの長さとリード長に依存しているので、失敗した時は、アセンブリグラフをBandageに読み込んで、リピート領域の長さを確認する。そのリピートよりも十分に長いリードが得られているなら、この方法で一繋ぎにアセンブル出来る可能性がある。ONTのリードを使っていてウルトラロングリードも得られているなら試してみてください。
繰り返しになるが、アセンブルがFinishしない場合、assembly.gfaをbandageグラフビューアで視覚化して原因を探る。例えばプラスミド間で共有されている長いリピートがあるなら、目的のプラスミドのリードだけを回収してローカルアセンブリするとFinishする可能性がある。例えば20kbの高度に断片化した繰り返し配列領域があるなら、20-30kb以上のロングリードだけ取り出し、それを使ってロングリードベースのアセンブルを行うと一繋ぎの配列が得られる可能性がある。
unicyclerのアセンブリではアセンブル後に複数回のpolishingを行い、これをアセンブリエラーがゼロになるまで繰り返すことで精度の高いアセンブリを達成している。多くのケースでQ50Q60を超えるアセンブリが得られるが、アセンブリエラーが完全にゼロの保証はない。経験上、まだ修正すべきエラーが10~100カ所ほど残っているゲノムも度々見られる。unicyclerのアセンブリにショートリードをバックマッピングし、ホモのSNVやindelがコールされないかチェックする必要がある。コールされるならpilonなどを使って再びpolishingを行う。微妙なポジションであれば、spadesなどで得たcontig.fastaをminimap2 の-x asm5設定でunicyclerのゲノム配列にアラインメントし、IGVにショートリードとspadesアセンブリのアラインメントを読み込み、複数のトラック情報から正しい塩基配列を確認する。時には、追加のシークエンシングがないと配列決定できない事もある。同一系統の他の株のゲノムがでているなら、それもアラインすると参考になる。
ロングリードがあるなら、UnicyclerだけでなくFlyeでもアセンブルを行い、アセンブリ構造に大きな違いがないかを確認すると良い。不一致が見つかるなら、その部位のFlyeのコンティグとのアラインメントを目視で確認する。
環状ゲノムの場合、見かけのスタート位置とエンド部位のジャンクションにエラーがあっても、線形のfastaファイルだとエラーはコールされない。見かけの+1をseqkitなどで1,000~10,000ほど回転させ、エラーがないか確認する。
spadesのアセンブリとunicyclerのアセンブリは大雑把にはほとんど同等の品質を持つが（注；UnicyclerはSPAdesを使用してイルミナリードをアセンブリグラフにアセンブルする）、unicyclerのほうが単純リピートやタンデムリピート領域のリピート回数の表現で良好なことが多い（*1）。これはspadesのハイブリッドアセンブリとunicyclerのハイブリッドアセンブリの比較においても同じ傾向がある。理由として、spadesはハイブリッドアセンブリであっても、ショートリードベースで進め、エラーがあってもショートリードで作ったグラフを信用してノード間をブリッジするのに対して、１）unicyclerにはspadesのアセンブリでデッドエンドがない適切なk値を見つけるプロセスが存在し（unicycler.log参照）、2）unicyclerにはspadesのアセンブリをそのまま最終出力とするのではなく、ポストアセンブリ処理でグラフを解きほぐす処理が加わっていること、3) ミスアセンブリの割合を減らすために、rpeat resolutionで信頼性の低いリピートを除外するプロセスがある、などが関係していると考えられる。つまり、SPAdesはショートリードでは正確な配列決定が難しいタンデムリピート領域もグラフから配列を確定させてしまい、ロングリード情報が利用できる時もこの領域は再考されないことが起こりうる。このあたりの戦略の違いが結果に影響していると考えられる。ただし、ゲノムアセンブリとはシークエンシングエラーがあるリードからできるだけ長い配列を組み立てるためのヒューリスティックであり、どのようなデータやゲノムでも万能に動作することを保証するアルゴリズムではない。慎重に進めるには、Trycyclerのように複数のアセンブル結果を比較検討する必要がある（参考）。
SPAdesのアセンブリグラフは最終k値分だけ末端配列が重複しているため、下流の解析で邪魔になることがある（そのため、SPAdesではcontig.fastaのそれぞれの末端配列を最終k値数トリミングして使うことがある）。一方でunicyclerのアセンブリグラフは配列の重複なしでアセンブリグラフが構築されている（レポジトリより）。そのため、複雑な領域のグラフでは、SPAdesのグラフでは見られない非常に短い配列のノード（例えば1bpなど極端に短いもの）が見られることがある。これは配列間重複がないため極端に短いのだが、信頼性の高い分岐として残っており、情報としては重要である。このような短い分岐は、ショートリード単独で構築されたアセンブリグラフで度々見られる。
通常ショートリードアセンブリが切れるのは、k値以上の長いリピート配列が存在する部位でとなる。リピート配列はゲノム中に二回以上出現する配列である。このようなリピート配列は、細菌では、ゲノム中に散在して存在し、隣接して存在しているわけではない。典型的なリピート配列としてISやrRNA operonが知られている。このような長いリピート配列とは異なり、数十bpや数塩基単位の短いリピートがタンデムに並んで存在するゲノム領域も存在する。、数十bpから数塩基単位のダイレクトリピート；(XXXXXX)nは、k-merベースのアセンブラが苦手とするタイプのリピートである。１つのリピートは短いものの、何十コピーも連続していると、アセンブラはその配列の正確な出現回数を把握しきれなくなる。具体的には、このような領域は、切断するのではなく、リピートの表現回数を間違えて一繋ぎの配列として配列決定してしまう事が多い。例えば31-bpの長さのタンデムリピートが10回繰り返される領域がある時、(31-mer)10と表現（配列決定）せずに(31-mer)9とか(31-mer)11と表現したりすることが確率的に起きる。
タンデムリピートの表現回数のエラーは、ショートリードのde brujin graphアセンブラで度々発生している。アセンブル後に修正するには、アセンブルに使ったリードをアラインしてエラーを見つけるしかないが、ダイレクトリピートに対して”ショート”リードを正確にアラインすることは難しいため、polishやエラーチェックのステップでもエラーとして把握されないことがある。結果として、慎重に進めたにも関わらず、公的データベースにサブミットしたゲノム配列にエラーが残っていることが起こり得る。回避するために、アセンブル終了後、ショートリード（とロングリードと別のアセンブリ結果）をアセンブリにマッピングし、アラインメント間に不一致がないかゲノム全体をよくチェックすることが推奨される。自分の場合、unicyclerのアセンブリとpolishing完了後にショートリードとロングリード、そしてSPAdesアセンブリ（k値を複数パターン）を研磨したアセンブリにマッピングし、ギャップをもつリードが集積している部位がないかチェックすることにしている。ギャップがあれば他のシークエンシングテクノロジーのアラインメントを確認し、アセンブリのエラーとわかれば手修正する。これをゲノム全体で行い、どうしても修正できなければ追加のシークエンシング（長いショートリード or サンガー）を行い、異常がゼロになったらアノテーションを付けるようにしている。