複数のシーケンシング技術に対応したドラフトアセンブリpolishingツール Apollo

　第三世代のシークエンシング技術は900Kもの塩基対（bp）を含むロングリードをシークエンシングすることができる。これらの長いリードは、アセンブリ（すなわち対象のゲノム）を構築するために使用される。残念なことに、第3世代のシーケンシング技術は高いシーケンシングエラー率を持ち、これらのロングリードの大部分は誤って識別される。これらのエラーはアセンブリに伝播し、ゲノム解析の精度に影響を与える。アセンブリpolishingアルゴリズムは、リードとアセンブリとの間の位置合わせからの情報（すなわち、リードからアセンブリへの位置合わせ情報）を使用することによって、アセンブリ内のエラーをpolishingまたは修正することによってエラー伝播を最小限に抑える。しかしながら、現在利用可能なアセンブリpolishingアルゴリズムは、特定のシーケンシング技術または小さなゲノムからのリードのいずれかを用いてしかアセンブリをpolishingできない。この技術およびゲノムサイズ依存性は、（１）複数のシーケンシング技術からのすべての利用可能なリードセットを使用すること、または（２）大きなゲノム（例えば、ヒトゲノム）をpolishingすることのいずれかによる最先端のアセンブリpolishingアルゴリズムを妨げる。
　Apolloは、あらゆるサイズ（つまり、ラージゲノムと小さなゲノムの両方）のアセンブリをすべてのシーケンシングテクノロジー（つまり、第2世代と第3世代）のリードを扱えるスケーラブルなユニバーサルアセンブリpolishingアルゴリズムである。著者らの目標は、ラージゲノムをポリシングしてアセンブリ精度を向上させ、そしてすべてのシーケンシング技術のリードセットを使用できる単一のアルゴリズムのpolishingを提供することである。 Apolloは１）プロファイル隠れマルコフモデル（pHMM）としてアセンブリをモデル化し、2）Forward-BackwardアルゴリズムでpHMMをトレーニングするためにread-to-assemblyアライメントを使用し、3）Viterbiアルゴリズムでトレーニングされたモデルをデコードしてpolishingする。リアルリードセットを使った実験では、1）複数のシーケンステクノロジからのリードを使用した場合、単一のシーケンステクノロジからのリードを使用した場合よりも正確なアセンブリが得られた。 3）Apolloは、シングルシーケンシング技術のリードでpolishingする場合でも、精度に関して競合の最先端のアルゴリズムと同等かそれ以上に優れている。

インストール

ubuntu16.0.4でテストした。

本体　 Github

git clone https://github.com/CMU-SAFARI/Apollo.git
cd ./Apollo
make -j 8
cd ./bin

> ./apollo

$ ./apollo

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

================================================================================================

Try 'Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm --help' for

more information.

VERSION

Last update: July 2019

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm version: 1.1

SeqAn version: 2.4.0

> ./apollo -h

$ ./apollo -h

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm

================================================================================================

SYNOPSIS

DESCRIPTION

OPTIONS

-h, --help

Display the help message.

--version

Display version information.

-a, --assembly INPUT_FILE

The fasta file which contains the assembly

-r, --read List of INPUT_FILE's

A fasta file which contains a set of reads that are aligned to the assembly.

-m, --alignment List of INPUT_FILE's

{s,b}am file which contains alignments of the set of reads to the assembly.

-o, --output OUTPUT_FILE

Output file to write the polished (i.e., corrected) assembly.

-q, --mapq INTEGER

Minimum mapping quality for a read-to-assembly alignment to be used in assembly polishing. Note that if the

aligner reports multiplealignmentsvfor a read, then it may be setting mapping qualities of multiple

alignments as 0. In range [0..255]. Default: 0.

-f, --filter INTEGER

Filter size that allows calculation of at most "f" many most probable transitions in each time step. This

parameter is directly proportional to running time. In range [1..inf]. Default: 100.

-v, --viterbi-filter INTEGER

Filter size for the Viterbi algorithm that allows calculation of at most "vf" many most probable states in

each time step. This parameter is directly proportional to running time. In range [1..inf]. Default: 5.

-i, --maxi INTEGER

Maximum number of insertions in a row. This parameter is directly proportional to the running time. In range

[0..inf]. Default: 3.

-d, --maxd INTEGER

Maximum number of deletions in a row. This parameter is directly proportional to the running time. In range

[0..inf]. Default: 10.

-tm, --mtransition DOUBLE

Initial transition probability to a match state. See --itransition as well. In range [0..1]. Default: 0.85.

-ti, --itransition DOUBLE

Initial transition probability to an insertion state. Note that the deletion transition probability equals

to: (1 - (matchTransition + insertionTransition)). In range [0..1]. Default: 0.1.

-df, --dfactor DOUBLE

Factor for the polynomial distribution to calculate the each of the probabilities to delete 1 to "d" many

basepairs. Note that unless "df" is set 1, the probability of the deleting k many characters will always

going to be different than deleting n many characters where 0<k<n<"d". A higher "df" value favors less

deletions. In range [0.001..inf]. Default: 2.5.

-em, --memission DOUBLE

Initial emission probability of a match to a reference. Note that: mismatch emission probability equals to:

((1-matchEmission)/3). In range [0..1]. Default: 0.97.

-b, --batch INTEGER

Number of consecutive basepairs that Viterbi decodes per thread. Setting it to zero will decode the entire

contig with a single thread. In range [0..inf]. Default: 5000.

-t, --thread INTEGER

Maximum number of threads to use. In range [1..inf]. Default: 1.

-n, --no-verbose

Apollo runs quitely with no informative output

VERSION

Last update: July 2019

Apollo: A Sequencing-Technology-Independent, Scalable, and Accurate Assembly Polishing Algorithm version: 1.1

SeqAn version: 2.4.0

実行方法

リードと、ドラフトアセンブリのfasta、そのドラフトアセンブリにmappingして得たbamファイルを指定する。リードは非圧縮のfasta形式のみサポートしている。

pacbio

#mapping
minimap2 -x map-pb -a assembly.fasta pacbio.fasta | samtools sort -@ 12 > alignment.bam
#indexing
samtools index -@ 12 alignment.bam

#polishing
apollo -a assembly.fasta -r pacbio.fasta -m alignment.bam -o polished.fasta

ハイブリッド（ONT + illumina）

現状ペアエンドリードはサポートされていないので、別々にマッピングして使うか、コンカテネートして使う（リードのidがユニークであること）。

#mapping:short reads
minimap2 -x sr -a assembly.fasta interleave_reads.fasta | samtools sort -@ 12 > short_alignment.bam
samtools index -@ 12 short_alignment.bam
#mapping:long reads
minimap2 -x map-ont -a assembly.fasta ont_reads.fasta | samtools sort -@ 12 > long_alignment.bam
samtools index -@ 12 long_alignment.bam

#polishing
apollo -a assembly.fasta -r interleave_reads.fasta -r ont_reads.fasta -m short_alignment.bam -m long_alignment.bam -t 30 -o polished.fasta