Linked readを使ってクロモソームスケールのアセンブリを行う ARKS

ARCSより

　10×Genomics（10×G、Pleasanton、CA）のChromiumシークエンシングライブラリー調製プロトコルは、Illuminaシーケンシング（San Diego、CA）ベースで長いDNA断片上にショートリードとバーコード情報を提供しローカライズさせる。したがって、高スループットのプラットフォームのスケーリングによる経済性の恩恵を受ける。 20〜200kb分子のシーケンシングリードがバーコード/リンクされるので、この技術の応用は、主にヒトゲノムのバリアントのphasingに集中している（Narasimhan et al、2016; Zheng et al、2016）。

　10×Genomicsのlinked readsを生成する能力は、Illumina TruSeq（Kuleshov et al、2014 pubmed）(日本語紹介ページ) に似ている。後者の技術は、全ゲノムショットガンアセンブリプロジェクトに役立つ補完的な情報を提供する。生成する疑似ロングリードは、長いリピートを解決するのに役立つ。しかし、疑似ロングリードを生成するために、TruSeqは、co-localized したリードのa prioriのフラグメントアセンブリのための高いカバレッジのデータを必要とし、本質的にそのターゲットゲノムに対して低いフラグメントカバレッジのデータを生成する。したがって、TruSeqは、哺乳類サイズのゲノムに適切なフラグメントカバレッジを提供するために比較的高価であり得る。逆に、Chromiumプラットフォームは、典型的には、単一バーコーディング分子に対して低いカバレッジを提供し、個々のフラグメントアセンブリのためのその有用性を制限する。しかし、スループットのこの限界を補って、より高いフラグメントカバレッジを提供する。

　最近、このデータ型はcontiguity preserving transposition sequencing (CPT-seq)および別のロングレンジ情報データソース（Hi-C）（Adey et al、2014 pubmed）を使うソフトウエアにより、scaffoldingしてドラフトゲノムを作るために利用された(Mostovoy et al., 2016 pubmed)。その論文でMostovoy et alは、97xカバレッジのGemCodeシークエンシング（10×GenomicsのChromiumのprecursor）を用いたヒトゲノムのアセンブリにより12倍のcontiguityの改善を示し、scaffoldingしてドラフトゲノムを構築する技術の可能性を実証した。

　　この論文で著者らは大規模な染色体セグメントを特徴付けるcontiguous assemblies をオーガナイズするために、大量のロングフラグメント情報を活用する方法であるARCS、Assembly Round-up by Chromium Scaffoldingアルゴリズムを紹介する。最近のGenome In A Bottle （GIAB）ヒトゲノムシーケンシングデータ（Zook et al、2016）を使用し、ARCSをfragScaffと比較する。これは、publicationされた10×Genomicsのlinked readsを使ってドラフトゲノムアセンブリを生成する唯一の他の技術である（Mostovoy et al、2016）。 fragScaffスキャフォールディングアルゴリズムでは、バーコード化されたアライメントファイルが解析され、各配列末端にどのバーコードがマッピングされたか決定される。各ペアエンドについて、共有バーコード部分メトリックが計算される。これらの値は、各配列末端の共有バーコード画分の分布を生成する。これらの分布に基づいてscaffolds graphにエッジが追加されシーケンス末端のreadがリンクされる。接続された各コンポーネントについて、maximum-weight minimum spanning tree（MST）が決定され、その後、MSTのメイントランクに任意の分岐を組み込んで最終的なscaffoldsを生成することが繰り返される。

（一部略）

より少ない時間と計算リソースを使用しながら、広範囲のパラメータにわたって、fragScaffとArchitectよりもcontiguityがあり、正確なアセンブリがどのように実装されるかを示す。異なる実験からの2つのヒトのlinked readsデータセットを使用して、既存のヒトゲノムドラフトのARCSのscaffoldsが、新たにリリースされた10×GenomicsのSupernovaデノボアセンブラでアセンブリされたものと同等以上の同一性および正確性を有するアセンブリを得ることを示す (Weisenfeld et al、2017）。 ARCSはC ++で実装され、Unix上で動作する。

ARCSに関するツイート

インストール

ARKSとARCSはubuntu16.04でビルドしてテストした。また、arcsのdockerイメージもpullして、動作することを確認した（ホストOS: mac os10.12）。arksはdocker hubにイメージがなかったので、ビルドしてイメージをdocker hubにpushした。

依存

Boost (tested on 1.61)
GCC (tested on 4.4.7)
Autotools (if cloning directly from repository)
LINKS (tested on 1.8)
本体　 Github

ARCS（初代）

後継のARKS

#arksのインストール

#google sparsehash
apt install libsparsehash-dev

#arks本体
git clone https://github.com/bcgsc/arks.git
cd arks/
#configureの生成
./autogen.sh
./configure
make
make install

> arks --help

# arks --help

Reading user inputs...

Usage: [arks 1.0.2]

arks [Options] <chrom file list>

Options:

=> TYPE OPTIONS: <=

-p can be one of:

1) full uses the full ARKS process (kmerize draft, kmerize and align chromium reads, scaffold).

2) align skips kmerizing of draft and starts with kmerizing and aligning chromium reads.

3) graph skips kmerizing draft and kmerizing/aligning chromium reads and only scaffolds.

=> INPUT OPTIONS: <=

A) Always required (specific type 'full'):

-f Using kseq parser, these are the contig sequences to further scaffold and can be in either FASTA or FASTQ format. (required)

-a tsv or csv file for barcode multiplicities. Can be acquired by running CalcBarcodeMultiplicity script included. (required)

B) If you want to skip kmerizing contigs you will need (specified type 'align'):

-q tsv file for ContigRecord (a record of all the contigs + h/t).

--> Format of file should be: <contig record index number> <contig name> <H/T>

-w tsv file for the ContigKmerMap (a record of all the kmers to contig ID index (corresponds to contigrecord tsv).

--> Format of file should be: <kmer> <contig record index number>

C) If you want to skip both the full kmer alignment based step you will need (specific type 'graph'):

**Note that you are using ARKS as a graphing application**

-i tsv file for the IndexMap.

--> Format of file should be: <barcode> <contig name> <H/T> <count>

=> DISTANCE ESTIMATION OPTIONS:

-D enable distance estimation [disabled] -s=FILE output TSV of intra-contig distance/barcode data [disabled]

-S=FILE output TSV of inter-contig distance/barcode data [disabled]

-B num neighbouring samples to estimate distance upper bound [20]

=> EXTRA OUTPUT OPTIONS: <=

-o can be one of:

0 no checkpoint files (default)

1 outputs of kmerizing the draft (ContigRecord + ContigKmerMap only)

2 output of aligning chromium to draft (IndexMap only)

3 all checkpoint files (ContigRecord, ContigKmerMap, and IndexMap)

-c Minimum number of mapping read pairs/Index required before creating edge in graph. (default: 5)

-k k-value for the size of a k-mer. (default: 30) (required)

-g shift between k-mers (default: 1)

-j Minimum fraction of read kmers matching a contigId for a read to be associated with the contigId. (default: 0.55)

-l Minimum number of links to create edge in graph (default: 0)

-z Minimum contig length to consider for scaffolding (default: 500)

-b Base name for your output files (optional)

-m Range (in the format min-max) of index multiplicity (only reads with indices in this multiplicity range will be included in graph) (default: 50-10000)

-d Maximum degree of nodes in graph. All nodes with degree greater than this number will be removed from the graph prior to printing final graph. For no node removal, set to 0 (default: 0)

-e End length (bp) of sequences to consider (default: 30000)

-r Maximum p-value for H/T assignment and link orientation determination. Lower is more stringent (default: 0.05)

-t Number of threads.(default: 1)

-v Runs in verbose mode (optional, default: 0)

dockerイメージ

https://hub.docker.com/r/abnerchang/arcs/

#ホストのカレントディレクトリとイメージの/dataをシェアして起動
#arcs
docker run --rm -itv $PWD:/data/ abnerchang/arcs arcs --help

#arks
docker run --rm -itv $PWD:/data/ kazumax/arks arks --help

テストラン

arksのほか、LINKSとExamples/のarks-makeが必要。

cd /arks/Examples/arks_test-demo/
./runARKSdemo.sh

実行方法

必要なファイル

ドラフトアセンブリのfastaファイル
Interleaveのリンクリードファイル（Github参照）
バーコード多重度のCSVファイル。付属スクリプトで作成（Github参照）

>perl bin/calcBarcodeMultiplicities.pl reads.fof > read_multiplicities.csv

cd arks/Examples/
sh pipeline_example.sh

# sh pipeline_example.sh

Usage: pipeline_example.sh <draft> <reads>

draft Assembled Sequences to further scaffold (Multi-Fasta format, with extension .fa or .fasta)

reads Interleaved chromium reads with barcode in the read header (ie. @ReadName BX:Z:<barcode>)

NOTE: file must have .fastq.gz or .fq.gz extension

リンクリードとドラフトアセンブリのファイルのシンボリックリンクを用意すればランできると書かれている。

Tigmint（Github）でミスアセンブリを検出、分解し、再びarcsで scaffolidingすることができます。

tigmint-make arcs draft=myassembly reads=myreads

詳細はtigmintとarcsのGithubの説明を読んでください。

引用
ARKS: chromosome-scale scaffolding of human genome drafts with linked read kmers
Coombe L, Zhang J, Vandervalk BP, Chu J, Jackman SD, Birol I, Warren RL

BMC Bioinformatics. 2018 Jun 20;19(1):234

ARCS: scaffolding genome drafts with linked reads
Yeo S, Coombe L, Warren RL, Chu J, Birol I

Bioinformatics. 2018 Mar 1;34(5):725-731