ロングリードゲノムアセンブリのミスアセンブリ修正およびスキャフォールディングを行う LongStitch

　モデル生物や非モデル生物のゲノム研究では、高品質なデノボゲノムアセンブリの作成が不可欠である。近年、ロングリードシーケンシングは、ゲノムアセンブリやスキャフォールディング（ロングレンジ情報を利用してアセンブリされた配列を順番と方向付けするプロセス）に大きな貢献をしている。ロングリードは、ショートリードと比較して、ゲノムの繰り返し領域をカバーすることができるため、問題のある領域を解決し、より完全なドラフトアセンブリを生成するのに非常に有用である。ここでは、ロングリードのみを用いてドラフトゲノムアセンブリーの修正とスキャフォールドを行うスケーラブルなパイプライン、LongStitchを紹介する。LongStitchは、著者らのグループが開発した複数のツールを用いて、最初のアセンブリ修正（Tigmint-long）と2つの段階的な足場形成（ntLinkとARKS-long）の最大3ステージで実行される。Tigmint-longとARKS-longは、それぞれLinkedリード用に開発されたミスアセンブリ修正とスキャフォールドのユーティリティで、これをロングリード用にアレンジした。ここでは、LongStitchパイプラインについて説明し、コンティグを結合するために軽量なミニマーマッピングを利用する新しいロングリードのスカフォールドであるntLinkを紹介する。LongStitchは、線虫（Caenorhabditis elegans）、サツマイモ（Oryza sativa）、および3人の異なるヒトの、対応するナノポアロングリードデータを用いたショートリードおよびロングリードのアセンブリでテストされ、各アセンブリのコンティグネスを1.2倍から304.6倍まで向上させた（NGA50長で測定）。さらに、LongStitchは、ほとんどのテストにおいて、最先端のロングリードのスキャッフォールドツールであるLRScafと比較して、より連続した正しいアセンブリを生成し、コンスタントに5時間以内、23GB以下のRAM使用でヒトのアセンブリを改善することができた。LongStitch は、ロングリードを用いたドラフトアセンブリの改善に有効かつ効率的であることから、新規ゲノムアセンブリプロジェクトに貢献できるものと期待している。LongStitch パイプラインは、https://github.com/bcgsc/longstitch で自由に利用できる。

Githubより

LongStitchはロングリードを用いたゲノムアセンブリの修正およびスキャフォールディングのパイプラインです。最大3つのステップで構成されています。

Tigmintがドラフトアセンブリのミスアセンブリの可能性のある領域をカットする
補正されたアセンブリのスキャフォールドにntlinkを使用
続いて、ARKSでさらにスキャフォールディングを行う（任意）。

インストール

依存

GNU Make
Tigmint v1.2.4+
ntLink v1.0.0+
ARCS/ARKS v1.2.2+
ABySS v2.3.0+
LINKS v1.8.5+
samtools

GIthub

#conda
mamba install -c bioconda -c conda-forge longstitch

*LongStitchのすべての依存関係はhomebrewを利用できる。longstitchについては、最新リリースのtarballも利用できる。

> longstitch

# longstitch

LongStitch v1.0.1

Usage: ./longstitch [COMMAND] [OPTION=VALUE]…

For example, to run the default pipeline on a draft assembly draft-assembly.fa with the reads reads.fa.gz and a genome size of gsize:

longstitch run draft=draft-assembly reads=reads G=gsize

Commands:

run run default LongStitch pipeline: Tigmint, then ntLink

tigmint-ntLink-arks run full LongStitch pipeline: Tigmint, ntLink, then ARCS in kmer mode

tigmint-ntLink run Tigmint, then ntLink (Same as 'run' target)

ntLink-arks run ntLink, then run ARCS in kmer mode

General options (required):

draft draft name [draft]. File must have .fa extension

reads read name [reads]. File must have .fq.gz or .fa.gz extension

General options (optional):

t number of threads [8]

z minimum size of contig (bp) to scaffold [1000]

Tigmint options:

span min number of spanning molecules to be considered correctly assembled [auto]

dist maximum distance between alignments to be considered the same molecule [auto]

G haploid genome size (bp) for calculating span parameter (e.g. '3e9' for human genome). Required when span=auto [0]

ntLink options:

k_ntLink k-mer size for minimizers [32]

w window size for minimizers [100]

ARCS+LINKS options:

j minimum fraction of read kmers matching a contigId (used in kmer mode) [0.05]

k_arks size of a k-mer (used in kmer mode) [20]

c minimum aligned read pairs per molecule [4]

l minimum number of links to compute scaffold [4]

a maximum link ratio between two best contain pairs [0.3]

Notes:

- by default, span is automatically calculated as 1/4 of the sequence coverage of the input long reads

- G (genome size) must be specified if span=auto

- by default, dist is automatically calculated as p5 of the input long read lengths

- Ensure that all input files are in the current working directory, making soft-links if needed

テストラン

git clone https://github.com/bcgsc/LongStitch.git
cd LongStitch/tests/
./run_longstitch_demo.sh

実際にランするには、ドラフトアセンブリdraft-assembly.fa、リードのreads.fa.gz、ゲノムサイズgsizeを指定する（デフォルト設定）。

引用

LongStitch: high-quality genome assembly correction and scaffolding using long reads

Lauren Coombe, Janet X. Li, Theodora Lo, Johnathan Wong, Vladimir Nikolic, René L. Warren & Inanc Birol
BMC Bioinformatics volume 22, Article number: 534 (2021)