ロングリードやcontig情報を使いスキャッホールドのギャップを埋める GMcloser

2019 9/4 インストール追記

NGSのリードやアセンブルしたコンティグを使い、スキャッホールドのギャップを埋めるツールがいくつか発表されているが、オーサーらは、これらのツールに起因するアセンブリのエラー率が、デノボアセンブルで起こるエラー率よりも20〜500倍高いことを指摘している。それをmotive forceとして、オーサーらは、アセンブルしたコンティグxまたはエラー修正されたPacBioのロングリードでこれらのギャップを正確に閉じるGMcloserを開発した。GMcloserはスキャッホールド、コンティグ、ペアエンドリードのアライメント統計から計算した尤度ベースの分類子を使用して、ギャップにコンティグやロングリードを正しく割り当て、正確かつ効率的なギャップクローズを行う。比較実験の結果から、GMcloserは他の利用可能なツールと同様の効率かつ3〜100倍精度が高いと主張されている。

テストデータ

https://sourceforge.net/projects/gmcloser/files/?source=navbar

インストール

依存

Perl 5.6 or later
MUMmer 3.23(MUMmer 3.22 is not allowed)
blast+ 2.2.18 or later
Bowtie 2 (tested on ver. 2.1.0)
YASS (yass) (tested on ver. 1.14)

YASSは下記リンクからバイナリをダウンロードできる。

YASS : genomic similarity search tool

名前はyassに修正して、パスの通ったディレクトリに移動しておく。他のツールも、bowtie2、nucmer、blastn、で$PATHに定義されている必要がある。

本体　SourceForge

#bioconda (link)
conda install gmcloser -c bioconda -y

> gmcloser -h

$ gmcloser -h

Usage:

GMcloser ver. 1.5

Options:

--target_scaf or -t <STR> input target scaffold fasta file (e.g., scaf.fa) [mandatory]

--query_seq or -q <STR> input query contig (or long-read) fasta file (e.g., contig.fa) (if long reads are used, -lr option must be specified) [mandatory]

--prefix_out or -p <STR> prefix name of output files [mandatory]

--read_file or -r <STR> space-separated list of fastq or fasta files (or gzip compressed files) of paired-end reads (e.g., -r read_pe1.fq read_pe2.fq)

--read_len or -l <INT> read length of paired-end reads specified with the -r, -st, -sq, or -sd option (mean read length if read lengths are variable) [mandatory]

--insert or -i <INT> average insert size of paired-end reads [>read_len <20001, default: 400]

--sd_insert or -d <INT> standard deviation of insert size of paired-end reads [default: 40]

--read_format or -f <STR> fastq or fasta [default: fastq]

--sam_talign or -st <STR> space-separated list of sam alignment file(s) for target scaffolds, created in a single-end read alignment mode for paired-end reads (e.g., -sa tPE1.sam tPE2.sam, for paired-end read files PE1.fq and PE2.fq)

--sam_qalign or -sq <STR> space-separated list of sam alignment file(s) for query contigs, created in a single-end read alignment mode for paired-end reads (e.g., -sa qPE1.sam qPE2.sam, for paired-end read files PE1.fq and PE2.fq)

--sam_dir or -sd <STR> path of directory (i.e., bowtie_align) containing sam alignment files generated from a previous job with GMcloser (this can be used in place of -st and -sq option)

--align_file or -a <STR> Nucmer alignment file for query against target [optional]

--connect_subcon or -c connect between gap-encompassing subcontig pairs with their original (not merged with query contigs) termini [default: false]

--extend or -et extend scaffold temini with aligned query sequences [default: false] (When using long read query, this option is disabled in the current version)

--blast or -b conduct alignment between target and query contigs with BLASTn [default: false] (Nucmer alignment by default)

--min_match_len or -mm <INT> minimum overlap length to be filtered for Nucmer alignments.

Contig-alignments that satisfy both the values specified with -mm and -mi are selected, irrespective of any mapping rates of PE-reads. [>49, default: 300]

--min_identity or -mi <INT> minimum overlap identity (%) to be filtered for Nucmer alignments. Alignments are selected by combination with -mm option. [95~100, default: 99]

--min_len_local or -ml <INT> minimum overlap length for merging between neighbor subcontigs with YASS aligner [>14, default: 20]

--min_subcon or -ms <INT> minimum length of subcontigs, to be used for gapclosing [default: 100]

--min_gap_size or -g <INT> minimum length of gap, when spliting the target scaffold sequences into subcontigs [>0, default: 1]

--max_indel or -is <INT> maximum length of indel, observed in alignments between target subcontigs and query contigs. The alignments separated by the indel will be merged. [default: 70]

--max_qsc or -qsc <INT> maximum alignment coverage (%) of query singletones for target subcontigs (query with >= INT is excluded from query singletone output) [default: 60]

--base_qual or -bq <STR> base call quality format of fastq read file; illumina (phred64) or sanger (phred33) [default: auto]

--nuc_len or -nl <INT> nucmer exact match length, a value specified with '-l' option of the Nucmer aligner [default: auto, increased from 30 to 50 depending on the total contig length]

--ad_score or -as <FLOAT> score to add to (subtract from) the standard threshold score for selection of correct contig-subcontig alignments (e.g., 1 or -1) [default: 0]

--hetero or -ht heterozygosity factor (specify this if your sequenced genome is heterozygous (>0.2% difference of the haploid size)) [default: false]

--thread or -n <INT> number of threads (for machines with multiple processors), enabling all the alignment processes in parallel [default: 5]

--thread_connect or -nc <INT> number of threads (for machines with multiple processors), enabling the subcontig-connection process in parallel [default: number specified with --thread]

--help or -h output help message

[Options for long read query]

--long_read or -LR query sequence file is a fasta file of long reads (PacBio reads must be error-corrected) [default: false] (alignment is peformed with blast)

--lr_cov or -lc <INT> fold coverage of long reads for target scaffolds [default: auto ; automatically calculated by dividing a total length of query by a total length of target]

--min_qalign or -mq <INT> minimum number of queries that are aligned to either 5'- or 3'-terminus of a target subcontig [default: 1]

--iterate or -it <INT> number of iteration [default: 3]

--alignq or -aq <STR> BLASTn alignment file for query against query [optional]

> gmvalue -h

$ gmvalue -h

Usage:

GMvalue ver. 1.3

Usage: gmvalue [contig|subcon|scaf] [options]

(see the details of options for each command by e.g., gmval contig -h)

contig find misassemblies in contigs and/or output an error-free contig set where misassembled contigs are corrected

subcon find misassemblies in (sub)contigs that were split at gaps in scaffolds and/or output an error-free subcontig set

scaf find misassemblies (mislinks between contigs and misassemblies within contigs) in scaffilds and/or output an error-free scaffold set

ラン

上記リンクからテストデータをダウンロードし、テストランを行う。Rhodobacter sphaeroidesのデータとなる。NGSのリード、NNNありのscaffolds、アセンブルしたコンティグ、エラー補正したロングリードが収納されている。

gmcloser -t Sample_data/scaffolds.fasta -q Sample_data/contigs.fasta -r Sample_data/pe300-40x_1.fastq Sample_data/pe300-40x_2.fastq -l 100 -i 300 -d 40 -c -n 4 -p sample

scaffoldsやcontigについては、PDFマニュアルの22ページに記載されている。ランが終わると、ギャップがクローズされたスキャッホールドが出力される。

リファレンスゲノムと比較して、結果を評価する。

gmvalue subcon -r Sample_data/Rhodobacter.genome.fasta -q sample.gapclosed.fa

終わるとout.misassemble.list.txtとout.stat.txtが出力される。

> cat out.stat.txt

Number of contigs: 1545

Total number of misassembled contigs = 1 (0%)

Total number of truely assembled contigs = 1544 (99.9%)

Total number of misassembled events = 1

Number of unmapped contigs = 0 (0%)

Misassembly events on the same chromosome (break point distance > 100 bp, < 1000 bp): 0

Misassembly events (Relocation) on the same chromosome (break point distance >= 1000 bp): 0

Misassembly events (Translocation): 0

Misassembly events (Inversion): 0

Misassembly events (alignment coverage < 99%): 1

Medium indels (> 5 bp and <= 100 bp): 6

Total contig length: 4351438

Contig N50: 5546

Corrected contig N50: 5546

Total scaffold length: 4728132

Gap length: 376694 (7.9%)

> cat out.stat.txt

$ cat /Users/user/Downloads/GMcloser-1.6.1/out.misassemble.list.txt

scaffold18/20 1 5891 5993 CP000143 local-misassembly

ロングリードを使ったやり方はPDFマニュアルで確認してください。

引用

GMcloser: closing gaps in assemblies accurately with a likelihood-based selection of contig or long-read alignments.

Kosugi S1, Hirakawa H1, Tabata S1.

Bioinformatics. 2015 Dec 1;31(23):3733-41.

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

ロングリードやcontig情報を使いスキャッホールドのギャップを埋める GMcloser