ラージゲノムにも対応したcontigのscaffoldingツール BESST

　近年のハイスループットシーケンシング（HTS）技術は、低コストで数百万の短いDNA配列（リードと呼ばれる）を生成するため、デノボアセンブリプロジェクトにとって魅力的である。しかしながら、これらのリードは数百bpの長さしかないため、アセンブラ（例えば、[論文より　ref.1,2]）がゲノムを再構築することを困難にする。その結果、アセンブリ出力はコンティグ、すなわち長い断片にアセンブリされたゲノムの断片からなることが多い。

　しかし、HTS技術は、アセンブリの連続性を高めるために使用できるペアエンドリードを作成するためのプロトコルを提供する。ペアエンドリードは、リード間の距離をインサートサイズと呼ぶ既知の距離と方向でシーケンシングされている。ペアエンドリード内の2つのリードが異なるコンティグCaおよびCbに属する場合、CaとCbの間にリンクが作成される（論文　図1aを参照 link）。このリンクから、CaとCbの相対的な順序、方向、距離を推論することができる。

　コンティグのリンクと順序付けのプロセスはscaffoldingと呼ばれている。ペアエンドリードに加え、関連生物のリファレンス配列[ref.3]、制限酵素地図[ref.4]およびRNA-seqデータ[ref.5]などの情報をコンティグリンクに用いることができる。しかし、リファレンスベースのアセンブリは、ほとんどのde novoシーケンシングプロジェクトには適用できない。制限酵素地図はしばしば利用できず、RNA-seqデータは遺伝子をカバーするだけで、リード間の距離に関する情報は含まれていない。これにより、（しばしば唯一適用可能な）ペアエンド情報がスキャフォールディングの情報源として最も一般的に使用されている。

　残念なことにペアエンドのscaffoldingは問題に直面する。リードは、シーケンスエラー、ヘテロ接合性、およびゲノムのリピートのため擬似リンクを作成する可能性があり、これらの擬似リンクはコンティグ間の順序付けおよび方向付けをあいまいにする。したがってscaffoldingの問題とは、コンティグの一貫した順序付けおよび配向を見つけるために、正しいリンクを検出および利用すること、とまとめられる。scaffoldingの既存の公式化はNP完全であることが証明されているが、これらの公式化が目的に関して最適な解を見出すときでさえ、実際の（すなわち生物学的な）問題を解決するかどうかはまだ不明である。これらのアプローチは、個々のリンクの正確性を評価することにほとんど重点を置かずに、コンティグリンクによって誘導されるグラフの構造的特性に焦点を当ててきた。著者らのアプローチでは、間違ったリンクを削除し、洗練された統計を使用して、リンクしているリードが基礎となるライブラリの配布からのものであるのか、ミスアライメントによるものか評価する。

　論文の次のセクションでは、scaffoldingとその関連作業の形式化と、作業の概要とこの仕事のmotivationについて説明する。我々（著者ら）のアルゴリズムは、BESST（Bias Estimating Stepwise Scaffolding Tool）と呼ばれるインプリメンテーションで実現され、メソッドのセクションで詳細に説明されている。このアルゴリズムは、Picea abies（20 Gbp）のゲノムプロジェクト[ref.6]の使用で証明されているように、非常に大きく複雑なゲノム上にもスケールする。さらに、より幅広いインサートサイズ分布でも優れたscaffolding能を持つ。

（一段落省略）

　我々（著者ら）の結果は、単一のscaffolderがすべてのデータセットで他のスキャフォルダーより優れているわけではないことを示しているが、他のスタンドアロンscaffolderとの比較でBESSTは最も好ましい結果を示している。さらに、ライブラリのインサートサイズ分布の標準偏差が大きい場合、他のスタンドアローンのscaffolderよりも優れている（以下省略）。

BESSTに関するツイート

インストール

cent os7でテストした。

Supported on Linux / OSX with python2 (versions >= 2.7) and python3 (versions >=3.4)

本体　Github

conda install -c bioconda besst

> runBESST

runBESST

usage: BESST [-h] -c CONTIGFILE -f BAMFILES [BAMFILES ...] -orientation

ORIENTATION [ORIENTATION ...] [-r READLEN [READLEN ...]]

[-m MEAN [MEAN ...]] [-s STDDEV [STDDEV ...]] [-z COVCUTOFF]

[-z_min LOWER_COVCUTOFF] [-T THRESHOLD [THRESHOLD ...]]

[-e EDGESUPPORT [EDGESUPPORT ...]] [-k MINSIZE [MINSIZE ...]]

[-filter_contigs CONTIG_FILTER_LENGTH] [--min_mapq MIN_MAPQ]

[--iter PATH_THRESHOLD] [--score_cutoff SCORE_CUTOFF]

[--max_extensions MAX_EXTENSIONS] [-a HAPLRATIO]

[-b HAPLTHRESHOLD] [-K KMER] [-M MMER] [-g] [-o OUTPUT] [-d] [-y]

[-q] [--no_score] [-devel] [-plots] [--separate_repeats]

[--NO_ILP] [--FASTER_ILP] [--print_scores] [--dfs_traversal]

[--bfs_traversal] [-max_contig_overlap MAX_CONTIG_OVERLAP]

[--version]

BESST - scaffolder for genomic assemblies.

optional arguments:

-h, --help show this help message and exit

-y Deactivate pathfinder module for including smaller

contigs.

--no_score Statistical scoring is not performed. BESST instead

searches for paths between contigs.

Required arguments:

-c CONTIGFILE Fasta file containing contigs.

-f BAMFILES [BAMFILES ...]

Path(s) to bamfile(s).

-orientation ORIENTATION [ORIENTATION ...]

Mapped orientation for each library given with -f

option. Valid input is either fr (forward reverse

orientation) or rf (reverse forward orientation).

Library information:

Library parameters that can be set, e.g., read length, insert size mean/std_dev, coverage etc.

-r READLEN [READLEN ...]

Mean read length of libraries.

-m MEAN [MEAN ...] Mean insert size of libraries.

-s STDDEV [STDDEV ...]

Estimated standard deviation of libraries.

-z COVCUTOFF User specified coverage cutoff. (Manually filter out

contigs with coverage over this value)

-z_min LOWER_COVCUTOFF

User specified coverage cutoff. (Manually filter out

contigs with coverage under this value)

Ploidy:

Options involving detection of split allele contigs in diploid assemblies.

-a HAPLRATIO Maximum length ratio for merging of haplotypic

regions.

-b HAPLTHRESHOLD Number of standard deviations over mean/2 of coverage

to allow for clasification of haplotype. Example:

contigs with under mean/2 + 3sigma are indicated as

potential haplotypes (tested later) if -b 3

-g Haplotype detection function, default = off

Parameters/variables/threshold involved in BESST.:

-T THRESHOLD [THRESHOLD ...]

Threshold value filter out reads that are mapped too

far apart given insert size.

-e EDGESUPPORT [EDGESUPPORT ...]

Threshold value for the least nr of links that is

needed to create an edge. Default for all libs:

Inferred by BESST (see value in output in

Statistics.txt).

-k MINSIZE [MINSIZE ...]

Contig size threshold for the library (all contigs

below this size is discarded from the 'large contigs'

scaffolding, but included in pathfinding). Default:

Set to same as -T parameter

-filter_contigs CONTIG_FILTER_LENGTH

Contigs under this size is discarded from all

scaffolding (including pathfinding). they are also not

included in the resulting scaffold output fasta file

and any of the other files as well as all statistics.

Default: All contigs are included

--min_mapq MIN_MAPQ Lowest mapping quality allowed in order to use the

read pair (both reads needs to have equal or bigger

mapq than this value). This value is compared to the

mapping quality column in the BAM file.

--iter PATH_THRESHOLD

The number of iterations performed in breadth first

search for placing smaller contigs.

--score_cutoff SCORE_CUTOFF

Only store paths with score larger than score_cutoff.

--max_extensions MAX_EXTENSIONS

Maximum number of (large) scaffolds to search for

paths extensions with.

--NO_ILP Warning: Do not use this option! This option is

included only for benchmarking purposes of old BESST

algorithm. This option will give poor results as it

mimics earlier versions of BESST.

--FASTER_ILP Faster but worse performing heuristic solution to

solving ILPs. May be used if BESST is too slow.

However, lowering --iter is usually more effective to

reduce scaffolding time.

--print_scores Print BESST scores on edges in the Scaffolding graph.

Various other parameters:

-K KMER k-mer size used in de brujin graph for obtaining

contigs in assembly, default 50

-M MMER m-mer usted for creating connection graph. Should be

set lower than k-mer size

-o OUTPUT Path to output directory. BESST will create a folder

named 'BESST_output' in the directory given by the

path.

-d Deactivate sequencing duplicates detection

-q Parallellize work load of path finder module in case

of multiple processors available.

-devel Run in development mode (bug checking and memory usage

etc.)

-plots Plot coverage, histograms of scores e.t.c.

--separate_repeats Separates contigs classified as repeats by BESST into

a file 'repeats.fa'. They are not included in the main

scaffolding output file with this flag specified.

--dfs_traversal Depth first search with DP in the contig graph

(default).

--bfs_traversal Choose a breadth first search in the contig graph.

Default is the new depth first search with a DP

approch that seems to outperform previous traversals.

This option is kept for evaluation but is not

reccomended to specify.

-max_contig_overlap MAX_CONTIG_OVERLAP

BESST checks for overlapping ends in contigs that are

adjacent in a scaffold. This parameter sets the

maximum identical overlap to search for, default is

200.

--version show program's version number and exit

Source code and manual: http://github.com/ksahlin/BESST

Please cite:

Sahlin K, Vezzi F, Nystedt B, Lundeberg J, Arvestad L (2014) BESST--efficient scaffolding of large fragmented assemblies. BMC Bioinformatics 15, 281

ラン

アセンブリしたcontigと、そのcontigにマッピングしたbamを指定する。複数ライブラリのシーケンシングデータがある時は各々のbamを順番に指定する。

runBESST -c contig.fa -f file1.bam file2.bam ... -o output -orientation {fr/rf}

ランの詳細はGithubのREADMEを確認して下さい。

引用

BESST--efficient scaffolding of large fragmented assemblies
Sahlin K, Vezzi F, Nystedt B, Lundeberg J, Arvestad L

BMC Bioinformatics. 2014 Aug 15;15:281

Assembly scaffolding with PE-contaminated mate-pair libraries.
Sahlin K, Chikhi R, Arvestad L

Bioinformatics. 2016 Jul 1;32(13):1925-32

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ラージゲノムにも対応したcontigのscaffoldingツール BESST