ラージゲノムにもスケールする高速且つ精度の高いドラフトゲノムポリッシャー hypo

　DNAシーケンサーによって生成されたフラグメント（リード）からゲノムを再構築するゲノムアセンブリと、種間または種内の遺伝的変異を調べるためのその解析は、ゲノミクスの中心である。 Pacific Biosciences（PacBio）やOxford Nanopore Technologies（ONT）などの第3世代（DNA）シーケンサー（TGS）は、ゲノムの高品質なアセンブリと分析を可能にすることで、ゲノミクスに新しい刺激を与えた。 TGSは、リード長が数百塩基対に制限されていることから生じる第2世代または次世代シーケンサー（NGS; Illuminaなど）の主な制限を克服し、平均長が数万塩基対のリードを生成する。連続的なアセンブリ、よりrepetitiveなエレメントの解決、およびより大きな構造変異の啓示など[Roberts et al、2013、Lee et al、2016]。ごく最近まで、ヒトゲノムのde novoアセンブリは、ナノポアからのウルトラロングリードを使用して行われ（品質を改善する他の補完的な技術とともに）、ヒトリファレンスゲノム（GRCH38）の連続性を超えるだけでなく、テロメアからテロメアへの完全な染色体Xを再構成した [Miga et al、2019]。ただし、99％を超える精度を持つNGSのショートリードとは対照的に、ロングリードにはエラー率が高い（> 10％）という欠点がある[Weirather et al、2017、Jain et al、2018]。さらに、ロングリードのエラープロファイルは、置換よりも挿入-欠失（ indel）に偏っている。それらのうちホモポリマーのindelがより顕著である[Weirather et al、2017]。
　ノイズの多いロングリードに対処するために、アセンブラは通常、アセンブリの前にエラーを修正することに頼る。これは、特に大きなゲノムの場合、計算コストがかかる[Sovićet al、2016、Fu et al、2019、Zhang et al、2019]。さらに、アセンブリされたコンティグは通常、最後にコンセンサス生成を使用することでさらにアセンブリの品質を向上させる。最近、いくつかの高速アセンブラ（miniasm [Li、2016]、Ra [Vaser andŠikić、2019]、wtdbg2 [Ruan and Li、2019]）が利用可能になった。ベースレベルのエラーは多くなるが（修正されたリードを使用する他のアセンブラに比べての約10倍のエラー）１桁早く動作する。これらの高速アセンブラは、エラー修正をpolishのみに依存している。重要なことに、ロングリードアセンブリでは、エラー率が高く indelがドミナントであるため、エラーを修正してタンパク質予測に重大な影響を与えないようにすることが重要である[Watson and Warr、2019]。したがって、ポリッシングツールは、正確で長いリードのアセンブリ、特に高速でエラー修正のないアセンブラによって生成されるアセンブリを作成する上で重要な役割を果たす。
　ポリッシャーは大まかに「Sequencer-bound」と「General」に分類できる。シーケンサーにバインドされたポリッシャーは、特定のシーケンサーによって生成された生の信号レベル情報を必要とするため、特定のシーケンサーからのリードのみをポリッシュできる。 ONTの場合、NanoPolish [Loman et al、2015]およびその後継のMedaka [Nanopore Technologies、2019]はこのカテゴリに分類される。同様に、PacBioについては、Quiver [Chin et al、2013]およびその後継のArrow [Laird Smith et al、2016]が利用可能である。

　一方、Generalなポリッシャーは、どのようなシーケンサーによって生成されたリードでも処理できる十分な堅牢性を持つ。以前、Pilon [Walker et al、2014]は、細菌および小さな真核生物ゲノムに広く使用されている一般的なポリッシャーであった。しかしながら、Pilonは、Raconに取って代わられつついる（または組み合わせて使用されている）[Vaser et al。、2017]。Raconは超高速であるため、大規模なゲノム上でリソース的にうまくスケーリングできる。最近、いくつかの新しいポリッシャーが登場した：wtpoa-cns（wtdbg2のスタンドアロンコンセンサスモジュール）、ntEdit [Warren et al、2019]、Apollo [Firtina et al、2019]。 ntEdit以外の各ポリッシャーは、ドラフト（未修正）アセンブリのリードのアライメント情報に依存する。 Pilonは、ドラフトコンティグの各ベース位置でのリードからのベースのパイルアップに基づいている。 Raconおよびwtpoa-cnsはコンティグをより小さな断片に分割し、より高速且つより実用的にするため、POAのsingle instruction multiple data (SIMD)実装[Lee et al、2002、Lee、2003]を使う。 Apolloは、機械学習アプローチを展開して、ドラフトアセンブリのプロファイルHidden Markov Model（pHMM）[Firtina et al、2018]を構築し、それを使用してエラーを修正する。リードからアセンブリへのアライメントの活用とは対照的に、ntEditはドラフト内のkmers（長さkのシーケンス）のスキャンに基づいてエラーを修正し、リードにkmerを格納するBloom Filterを使用してそれらの有無をチェックする。
　一般的なポリッシャーにはそれぞれ制限がある。 PilonとntEditは、主に非常に正確なショートリードで動作するように設計されている。さらに、前述のように、Pilonはリソースの観点から大きなゲノムではうまくスケーリングできないが、Raconとwtpoa-cnsは大きなゲノムでもスケーラブルし高速であり、ショートリードとノイズの多いロングリードを処理できる。ただし、1回の実行で、両方のポリッシングに使用できるのは、ロングリードのみ、またはショートリードのみである。精度を高めるには、ロングリードポリッシングとショートリードポリッシングを勧める。 Apolloは1回の実行で両方のタイプのリードを使用できるが、非常に時間がかかる。たとえば、ApolloはPacBioリードを使用してE.Coliデータセットを洗練するのに約2時間半かかったが、Raconは約2分しかかからなかった[Firtina et al、2019]。現在、Raconはその速度と比較的優れた精度を考えると、広く使用されているポリッシャーである（また、Raconは全体的に、他のポリッシャーよりも正確な結果を生成することを確認している）。
　ここでは、1回の実行でショートリードとロングリードを利用して、大小のゲノムのロングリードアセンブリをポリッシュするHyPo–a Hybrid Polisherを紹介する。HyPoはユニークなゲノムkmerを利用して、選択的な読み取りセグメントのPOAを開拓して、コンティグのセグメントを選択的にポリッシュする。 Hypoは、Raconに比べて約3分の1の時間で、メモリ要件が約半分で、非常に正確なポリッシュされたアセンブリを生成することを示している。

インストール

ubuntu18.04LTSでテストした。

依存

Either Mac OS X or Linux are currently supported.

Zlib
OpenMP
GCC (>=8) to support filesystem
Following are the commands to update GCC on an Ubuntu machine (from say GCC 5):

sudo apt-get update; sudo apt-get install build-essential software-properties-common -y;
 sudo add-apt-repository ppa:ubuntu-toolchain-r/test -y; sudo apt update; 
 sudo apt install gcc-snapshot -y; sudo apt update
 sudo apt install gcc-8 g++-8 -y; 
 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-8 60 --slave /usr/bin/g++ g++ /usr/bin/g++-8
 sudo update-alternatives --install /usr/bin/gcc gcc /usr/bin/gcc-5 60 --slave /usr/bin/g++ g++ /usr/bin/g++-5

KMC3

#KMC3
conda install -c bioconda kmc

HTSLIB (version >=1.10)

本体　Github

git clone --recursive https://github.com/kensung-lab/hypo hypo 
cd hypo
#Htslibも導入
chmod +x install_deps.sh
./install_deps.sh
mkdir build
cd build
cmake -DCMAKE_BUILD_TYPE=Release ..
make -j 8

> ./hypo

# ./hypo

[Hypo::] Error: Invalid command: Too few arguments!

Usage: hypo <args>

** Mandatory args:

-r, --reads-short <str>

Input file name containing reads (in fasta/fastq format; can be compressed). A list of files containing file names in each line can be passed with @ prefix.

-d, --draft <str>

Input file name containing the draft contigs (in fasta/fastq format; can be compressed).

-b, --bam-sr <str>

Input file name containing the alignments of short reads against the draft (in bam/sam format; must have CIGAR information).

-c, --coverage-short <int>

Approximate mean coverage of the short reads.

-s, --size-ref <str>

Approximate size of the genome (a number; could be followed by units k/m/g; e.g. 10m, 2.3g).

** Optional args:

-B, --bam-lr <str>

Input file name containing the alignments of long reads against the draft (in bam/sam format; must have CIGAR information).

[Only Short reads polishing will be performed if this argument is not given]

-o, --output <str>

Output file name.

[Default] hypo_<draft_file_name>.fasta in the working directory.

-t, --threads <int>

Number of threads.

[Default] 1.

-p, --processing-size <int>

Number of contigs to be processed in one batch. Lower value means less memory usage but slower speed.

[Default] All the contigs in the draft.

-m, --match-sr <int>

Score for matching bases for short reads.

[Default] 5.

-x, --mismatch-sr <int>

Score for mismatching bases for short reads.

[Default] -4.

-g, --gap-sr <int>

Gap penalty for short reads (must be negative).

[Default] -8.

-M, --match-lr <int>

Score for matching bases for long reads.

[Default] 3.

-X, --mismatch-lr <int>

Score for mismatching bases for long reads.

[Default] -5.

-G, --gap-lr <int>

Gap penalty for long reads (must be negative).

[Default] -4.

-n, --ned-th <int>

Threshold for Normalised Edit Distance of long arms allowed in a window (in %). Higher number means more arms allowed which may slow down the execution.

[Default] 20.

-q, --qual-map-th <int>

Threshold for mapping quality of reads. The reads with mapping quality below this threshold will not be taken into consideration.

[Default] 2.

-i, --intermed

Store or use (if already exist) the intermediate files.

[Currently, only Solid kmers are stored as an intermediate file.].

-h, --help

Print the usage.

実行方法

1、mapping

#short read
minimap2 -t 16 -ax sr draft_genome.fasta pair_*.fq.gz |samtools sort -@ 12 -O BAM - > short.bam

#long(ONT)
minimap2 -t 16 -ax map-ont draft_genome.fasta nanopore.fq.gz |samtools sort -@ 12 -O BAM - > long.bam

2、fastq名のテキストファイル作成

echo -e "pair_1.fq.gz\npair_2.fq.gz" > names.txt

３、polishing

short

hypo -d draft_genome.fasta -r names.txt -s 3g -c 55 -b short.bam -p 96 -t 8 -o output.fa

short & long

hypo -d draft_genome.fasta -r names.txt -s 3g -c 55 \
-b short.bam -B mapped-lg.sorted.bam -p 96 -t 8 \
-o output.fa

-t Number of threads [Default] 1.
-p Number of contigs to be processed in one batch. Lower value means less memory usage but slower speed. [Default] All the contigs in the draft.
-m Score for matching bases for short reads [Default] 5.
-d Input file name containing the draft contigs (in fasta/fastq format; can be compressed).
-r Input file name containing reads (in fasta/fastq format; can be compressed). A list of files containing file names in each line can be passed with @ prefix.
-b Input file name containing the alignments of short reads against the draft (in bam/sam format; must have CIGAR information).
-B Input file name containing the alignments of long reads against the draft (in bam/sam format; must have CIGAR information). [Only Short reads polishing will be performed if this argument is not given.