String graphとde Bruin graphを使ったアセンブルを行う StriDe

　De Brujinのgraphを使うde novo assemblyの手法は、OLCのgraphを使った手法が苦手とするエラーが多い領域のアセンブルに強く、大量のリードの処理効率も優れている。しかしながら、リードをk-merサイズに分解するため、k-mer以上の繰り返し配列がある領域のアセンブルを行うことはできない。一方OLCはリード全長が使えるため、繰り返し配列のアセンブルではDe BrujinのgraphよりOLCのgraphの方が有利となる。それでもSPAdesなどがN50などでOLCの手法より高いパフォーマンスを示すことが多いのは、ペアリードの情報を使ったpaired de Brujin graphを使うことで、k-merサイズ以上の繰り返し配列のアセンブルも行うためである（ラフに言えばペアードエンドの長さのロングリード情報を使っている）。

　STriDeはこのOLCに似たStringのgraphとDe Brujinのgraphのメリットを合わせることで、エラーに強く、繰り返し配列のアセンブルにも高いパフォーマンスを示す方法論。バクテリアゲノムのアセンブルコンペティション（GAGE-B）のデータセットを使ったツール間のパフォーマンス比較では、 12のゲノムの大半でStriDeがSPAdesやほかのツールを上回るN50を出している（ただしk-merのサイズは31と短い）。StriDeはFM-indexを使うことでメモリ管理にも優れ、GAGE-Bのデータでは2GBのメモリしか使わなかったとされる。計算時間も短い部類である。

インストール

cent OSに導入した。

本体　Github

git clone https://github.com/ythuang0522/StriDe.git
cd StriDe/
./autogen.sh 
./configure
make
cd StriDe/
./stride #動作確認

またはcondaを使う（Anaconda環境のみ）。

stride — Bioconda documentation

conda install -c bioconda stride

> ./stride

$ ./stride

Program: StriDe

Version: 0.0.1

Contact: Yao-Ting Huang [ythuang@cs.ccu.edu.tw]

Usage: stride <command> [options]

All-in-one Commands:

all Perform error correction, long-read generation, overlap computation, and assembly in one run

Step-by-step Commands:

preprocess filter and quality-trim reads

index build FM-index for a set of reads

correct correct sequencing errors in reads

fmwalk merge paired reads into long reads via FM-index walk

filter remove redundant reads from a data set

overlap compute overlaps between reads

assemble generate contigs from an assembly graph

Other Commands:

merge merge multiple BWT/FM-index files into a single index

パスの通ったディレクトリに移動させる。

段階別に進めるコマンドと、通しで行うallコマンドがある。

Step-by-stepのコマンド

  preprocess  filter and quality-trim reads
  index       build FM-index for a set of reads
  correct     correct sequencing errors in reads 
  fmwalk      merge paired reads into long reads via FM-index walk
  filter      remove redundant reads from a data set
  overlap     compute overlaps between reads
  assemble    generate contigs from an assembly graph

フローは上から下の順番になっており、エラー補正してからアセンブルが実行される。

allのヘルプ

$ stride all -h

all: invalid option -- 'h'

all: missing arguments

Usage: StriDe all [OPTION] ... READFILE (format controlled by -p) ...

Perform error correction, long-read generation, overlap, and assembly in one command.

Mandatory arguments:

-r, --read-length=LEN median read length (if there are multiple libraries, set to the max read length)

-i, --insert-size=LEN median insert size (if there are multiple libraries, set to the max insert size)

Optional arguments:

-t, --thread=N number of threads (default: 16)

-p, --pe-mode=INT 1 - paired reads are separated with the first read in READS1 and the second

read in READS2 of the same library (default)

2 - paired reads are interleaved within a single file of the same library.

-k, --kmer-size=N length of kmer (default: 31)

-c, --kmer-threshold=N kmer frequency cutoff (default: 3)

-m, --min-overlap=LEN minimum reliable overlap length (default: read length * 0.8)

--help display this help and exit

実行方法

ペアードエンドのfastqのアセンブル。

stride all pair1.fastq pair2.fastq -r 300 -i 600 -t 22

-t　number of threads (default: 16)
-k　length of kmer (default: 31)
-r　median read length (if there are multiple libraries, set to the max read length)
-i　median insert size (if there are multiple libraries, set to the max insert size)

phaseごとにcontigが出力される。

$ seqkit stats *contigs.fa

file format type num_seqs sum_len min_len avg_len max_len

phase1-contigs.fa FASTA DNA 859 4,166,566 253 4,850.5 213,625

phase2-contigs.fa FASTA DNA 458 4,037,285 283 8,815 387,148

phase3-contigs.fa FASTA DNA 93 3,900,512 300 41,941 387,148

phase4-contigs.fa FASTA DNA 75 3,892,572 303 51,901 387,148

StriDe-contigs.fa FASTA DNA 73 3,892,600 303 53,323.3 387,148

複数のライブラリを指定することも可能です。Githubで確認してください。

https://github.com/ythuang0522/StriDe

引用

Integration of string and de Bruijn graphs for genome assembly