ロングリードを使ってscaffoldsのgap closingを行うLR_Gapcloser

　次世代シークエンシング（ NGS）技術は、デノボアセンブリによるゲノム配列の低コストおよび高速構築を可能にする。 NGS技術の利点と共に、この10年間で、多くのゲノムプロジェクト（例えば、10Kゲノムプロジェクト[ref.1]や100K病原体ゲノムプロジェクト[ref.2]）が開始され、多数の種のゲノムがアセンブリされた[ref.3、4]。。しかし、シーケンスバイアス[redf.5]、リピート領域[ref.6]、ヘテロクロマチン[ref.7]などの要因により、一部の領域はアセンブリが困難または不可能になり、ギャップや断片化されたゲノムアセンブリが発生する。

　ギャップクロージャのプロセスは、ゲノムアセンブリの完全性と連続性を高めるための最後の、しかし最も重要なステップである。完全なゲノムを得るために、GapFiller [ref.8]、GapCloser [ref.9]、Sealer [ref.10]、GapBlaster [ref.11]、GapReduce [ref.12]、およびGap2Seq [ref.13]を含むいくつかのギャップクロージャアプローチがNGSリードまたは事前にアセンブリされたコンティグ[ref.14]の隙間を埋めるために利用される。しかしながら、これらのツールは、ギャップクロージャプロセス中に高い割合のミスアセンブリを起こす［ref.15］。さらに、これらのツールを使用してすべてのギャップ、特に大きなギャップを埋めることは困難である。例えばPacific Biosciences（PacBio）やNanoporeプラットフォームなどの第3世代シーケンシング（TGS）技術としても知られる長い１分子シーケンシング技術は、これらのギャップを埋める可能性を秘めた長い偏りのないリードを生み出し、完全なゲノムアセンブリを達成する。 PBJelly [ref.18]とGMcloser [ref.15]は、ギャップを埋めるためにPacBioリードを採用している。 PBJellyは、基本的なローカルアラインメント[ref.19]を使用してロングリードをリファレンスアセンブリにアラインメントさせ、サポートするリードを選択し、ローカルギャップアセンブリを実行し、ギャップを埋め正確なアセンブリを決定する。 GMcloserは、scaffoldsをサブコンティグに分割し、MUMmer [ref.20]またはblastn（basicローカルアライメント検索ツール）を使用してロングリードをサブコンティグに位置合わせする。 GMcloserは、尤度ベースの分類子を使用して、ロングリードをscaffoldsのギャップに正しく割り当てる。ただし、実行時間が長い、メモリ使用量が多い、クロージャパフォーマンスが低いなどの欠点があるため、特に大規模で複雑なゲノムではアプリケーションが制限される。したがって、ゲノムアセンブリのギャップを埋めるために、高速でメモリ効率の良いギャップクロージャアプローチが必要である。

　ここでは、ロングリードを使用してアセンブリ内のギャップを効率的かつ迅速に埋めるLR_Gapcloserを開発した。以前のギャップ解消ツールと比較して、多くの注目すべき利点があり、これにはより高いギャップクロージャー性能、より少ない実行時間、より少ないピークメモリー、そしてより少ないミスアセンブリが含まれる。大きくて複雑なゲノムおよび繰り返し由来のギャップの両方に対してさえ、このツールはより良い性能を示した。最後に、NGSとTGS技術の両方を使用してシーケンスされたゲノムについて、異なるハイブリッドアセンブリ戦略の連続性と正確性を評価し、TGSベースとNGSベースのアセンブリをLR_Gapcloserと組み合わせて高品質のアセンブリを作成する最適ハイブリッド戦略を提案した。

The primary steps in LR_Gapcloser. 論文より転載

インストール

ubuntu16.04でテストした。

依存

Perl and Bioperl should be installed on the system.
GLIBC 2.14 should be installed.

git clone https://github.com/CAFS-bioinformatics/LR_Gapcloser.git
cd LR_Gapcloser/src/

> ./LR_Gapcloser.sh

$ ./LR_Gapcloser.sh

ls: /proc/81616/fd/: No such file or directory

Usage:sh LR_Gapcloser.sh -i Scaffold_file -l Corrected-PacBio-read_file

-i the scaffold file that contains gaps, represented by a string of N [ required ]

-l the raw and error-corrected long reads used to close gaps. The file should

be fasta format. [ required ]

-s sequencing platform: pacbio [p] or nanopore [n] [ default: p ]

-t number of threads (for machines with multiple processors), used in the bwa

mem alignment processes and the following coverage filteration. [ default: 5 ]

-c the coverage threshold to select high-quality alignments [ default: 0.8 ]

-a the deviation between gap length and filled sequence length [ default: 0.2 ]

-m to select the reliable tags for gap-closure, the maximal allowed

distance from alignment region to gap boundary (bp) [ default: 600 ]

-n the number of files that all tags were divided into [ default: 5 ]

-g the length of tags that a long read would be divided into (bp) [ default: 300 ]

-v the minimal tag alignment length around each boundary of a gap (bp) [ default: 300 ]

-r number of iteration [ default: 3 ]

-o name of output directory [ default: ./Output]

実行方法

pacbio

bash LR_Gapcloser.sh -i input_scaffolds.fasta -l Corrected-PacBio-reads.fasta -s p

-l the raw and error-corrected long reads used to close gaps. The file should be fasta format
-i the scaffold file that contains gaps, represented by a string of N [required]
-s sequencing platform: pacbio [p] or nanopore [n] [ default: p ]

nanopore

bash LR_Gapcloser.sh -i input_scaffolds.fasta -n nanopore.fasta -s n

-l the raw and error-corrected long reads used to close gaps. The file should be fasta format
-i the scaffold file that contains gaps, represented by a string of N [required]
-s sequencing platform: pacbio [p] or nanopore [n] [ default: p ]

引用

LR_Gapcloser: a tiling path-based gap closer that uses long reads to complete genome assembly

Gui-Cai Xu Tian-Jun Xu Rui Zhu Yan Zhang Shang-Qi Li Hong-Wei Wang Jiong-Tang Li

GigaScience, Volume 8, Issue 1, January 2019, giy157

この論文でも使用することが推奨されています。