オルガネラゲノムをターゲットアセンブリする NOVOPlasty

　次世代シークエンシング（NGS）技術の進化により、様々なアセンブルアルゴリズムが開発されてきたが、オルガネラゲノムのアセンブルに焦点を当てたものはほとんどない。これらのゲノムは、系統研究や食品の同定に利用されており、GenBankに登録されている真核生物ゲノムの中では最も多い。全ゲノムシークエンシング（WGS）データからオルガネラゲノムのアセンブルを行うことが最も正確で手間のかからない方法であるが、このタスクのために特別に設計されたツールがない（論文執筆時点）。我々（著者ら）は、全ゲノムシークエンシング（WGS）データからオルガネラゲノムをアセンブルするシードアンドエクステンドアルゴリズムを開発した。このアルゴリズムは、いくつかの新しいデータ（Gonioctena intermediaとAvicennia marina）と公開データ（Arabidopsis thalianaとOryza sativa）の全ゲノムIlluminaデータセットでテストされており、アセンブリの精度とカバレッジにおいて既知のアセンブラよりも優れている。ベンチマークでは、NOVOPlastyは30分以内にテストされた全ての環状ゲノムをアセンブルし、最大16GBのメモリを必要とし、99.99％以上の精度を達成した。結論として、NOVOPlastyは、WGSデータから核外ゲノムを高速かつ簡単に抽出できる唯一のデノボアセンブラである。このソフトウェアはオープンソースで、https://github.com/ndierckx/NOVOPlasty からダウンロードできる。

　 NOVOPlastyはSSAKEやVCAKEのようなストリングオーバーラップアルゴリズムに似たシードエクステンドベースのアセンブラである。配列をハッシュ・テーブルに格納することから始まる。アセンブリはシードによって開始されなければならず、これは反復的に双方向に拡張される。このシード配列は、アセンブリを開始するために使用されるのではなく、NGSデータセットから目的のゲノムの1つの配列リードを取得するために使用される。この戦略は、アセンブリにミスマッチを組み込むことなく、より広い範囲のシード入力を扱うことができる。シードの配列は、１つのシーケンシングリード、保存された遺伝子、または遠い種からの完全なオルガネラゲノムであってもよい。シードの終わりと始まりは、ハッシュテーブルで重複するリードがないかスキャンされ、別々に保存される。推定されるすべての拡張が同定され、その後、それらが正しく配置されているかどうかを確認するために、ペアになっているリードとクロスチェックされる。比較的類似した配列はグループ化され、すべての塩基拡張はオーバーラップしたリード間のコンセンサスによって解決される。複数のコンセンサス拡張が考えられる場合（すなわち、十分なサイズのグループが1つ以上ある場合）、アセンブリは分割され、2つの新しいコンティグが作成される。ほとんどのアセンブラとは異なり、NOVOPlastyはすべてのリードをアセンブルしようとはせず、環状ゲノムが形成されるまで、与えられたシードを拡張する。アセンブルは、長さが予想される範囲内にあり、両端が少なくとも200bp重なっている場合に環状化する。反復領域が検出された場合は、アセンブリが反復領域を抜けるまで環状化は延期される。全ゲノムデータは通常、核外配列のカバレッジが高いため、このアルゴリズムは、1つのリードを完全な環状ゲノムに拡張することができる（論文１の図2）。
　アセンブルへの新しい線形アプローチに加えて、NOVOPlastyはケースベースの調整を組み込むことで、より高品質なアセンブルを実現する。シーケンシングエラーやゲノムエレメントの混入による問題領域は自動的に検出され、パラメータを調整し、適切な戦略を開始することで可能な限り解決される。現在のIlluminaシーケンシング技術（Illumina HiSeqおよびMiSeq）は、例えば、長い一塩基リピート（SNR）ストレッチの後に非常に高いエラー率を持っており、それはその後のシーケンスを信頼性の低いものにしている（ref.16, pubmed）。これらのエラーが発生しやすい領域は、コンティグアセンブリの中断を引き起こす可能性があり、この線形アセンブリ戦略では特に問題となる。継続性を確保するためのNOVOPlastyの重要な戦略は、これらのSNRストレッチを早期に検出し、コンセンサスを構築する前に、最も誤ったリードを廃棄することにある。SNRストレッチの正確な長さを定義することは、個々のリード間のSNRの長さの強いばらつきと、全体的に低い品質スコアのために、簡単ではない。コンセンサスが解決できない場合、NOVOPlastyはペアエンド情報を使用して問題のある領域を「ジャンプ」し、そこからアセンブリを再開する。SNR領域は両方向からアプローチされるので、リードの非常に誤った部分を回避することができる。ショートリードアセンブリの他の大きな問題の一つは、複雑な反復配列である。カブトムシのミトコンドリアゲノムは多くの場合、長い高度に反復性の高い部分を含んでいるが(ref.17)、葉緑体ゲノムはより短く分散した反復性の高いDNAを含んでいることがある(ref.18)。NOVOPlastyは繰り返し領域に遭遇した場合、繰り返し配列と上流に隣接する領域を決定する。反復領域の前の配列から始まるすべてのリードは、さらなる分析のためにフィルタリングされる。反復領域の長さがリードの長さよりも短い場合には、領域のアセンブリを直接解決することができる。そうでなければ、アルゴリズムは、リファレンスのポイントとして機能し得る反復配列間の小さな変動を検索するために、非常に厳しくパラメータを一時的に調整する。領域が解決できなかった場合、アセンブリは終了し、繰り返し領域に続く配列で新しいコンティグとしてリブートされる。

wiki

https://github.com/ndierckx/NOVOPlasty/wiki

インストール

mac10.14でテストした。

本体　Github

git clone https://github.com/ndierckx/NOVOPlasty.git
perl NOVOPlasty/NOVOPlasty3.8.3.pl

> perl NOVOPlasty3.8.3.pl

$ perl NOVOPlasty3.8.3.pl

-----------------------------------------------

NOVOPlasty: The Organelle Assembler

Version 3.8.3

-----------------------------------------------

Error:Can't open the configuration file, please check the manual!

Usage: perl NOVOPlasty3.8.3.pl -c config.txt

実行方法

ランにはconfigファイルを使用する。fastqのパスやリファレンスのパスを記載する。データはraw fastqかgz/bz2圧縮fastqbになっている必要がある。

git clone https://github.com/ndierckx/NOVOPlasty.git
cp NOVOPlasty/config.txt .

> cat config.txt

$ cat config.txt

Project:

-----------------------

Project name = Test

Type = mito

Genome Range = 12000-22000

K-mer = 39

Max memory =

Extended log = 0

Save assembled reads = no

Seed Input = /path/to/seed_file/Seed.fasta

Extend seed directly = no

Reference sequence = /path/to/reference_file/reference.fasta (optional)

Variance detection =

Chloroplast sequence = /path/to/chloroplast_file/chloroplast.fasta (only for "mito_plant" option)

Dataset 1:

-----------------------

Read Length = 151

Insert size = 300

Platform = illumina

Single/Paired = PE

Combined reads =

Forward reads = /path/to/reads/reads_1.fastq

Reverse reads = /path/to/reads/reads_2.fastq

Heteroplasmy:

-----------------------

MAF =

HP exclude list =

PCR-free =

Optional:

-----------------------

Insert size auto = yes

Insert Range = 1.9

Insert Range strict = 1.3

Use Quality Scores = no

Project:

-----------------------

Project name = Choose a name for your project, it will be used for the output files.

Type = (chloro/mito/mito_plant) "chloro" for chloroplast assembly, "mito" for mitochondrial assembly and

"mito_plant" for mitochondrial assembly in plants.

Genome Range = (minimum genome size-maximum genome size) The expected genome size range of the genome.

Default value for mito: 12000-20000 / Default value for chloro: 120000-200000

If the expected size is know, you can lower the range, this can be useful when there is a repetitive

region, what could lead to a premature circularization of the genome.

K-mer = (integer) This is the length of the overlap between matching reads (Default: 33).

If reads are shorter then 90 bp or you have low coverage data, this value should be decreased down to 23.

For reads longer then 101 bp, this value can be increased, but this is not necessary.

Max memory = You can choose a max memory usage, suitable to automatically subsample the data or when you have limited

memory capacity. If you have sufficient memory, leave it blank, else write your available memory in GB

(if you have for example a 8 GB RAM laptop, put down 7 or 7.5 (don't add the unit in the config file))

Extended log = Prints out a very extensive log, could be useful to send me when there is a problem (0/1).

Save assembled reads = All the reads used for the assembly will be stored in seperate files (yes/no)

Seed Input = The path to the file that contains the seed sequence.

Extend seed directly = This gives the option to extend the seed directly, in stead of finding matching reads. Only use this when your seed

originates from the same sample and there are no possible mismatches (yes/no)

Reference (optional) = If a reference is available, you can give here the path to the fasta file.

The assembly will still be de novo, but references of the same genus can be used as a guide to resolve

duplicated regions in the plant mitochondria or the inverted repeat in the chloroplast.

References from different genus haven't beeen tested yet.

Variance detection = If you select yes, you should also have a reference sequence (previous line). It will create a vcf file

with all the variances compared to the give reference (yes/no)

Chloroplast sequence = The path to the file that contains the chloroplast sequence (Only for mito_plant mode).

You have to assemble the chloroplast before you assemble the mitochondria of plants!

Dataset 1:

-----------------------

Read Length = The read length of your reads.

Insert size = Total insert size of your paired end reads, it doesn't have to be accurate but should be close enough.

Platform = illumina/ion - The performance on Ion Torrent data is significantly lower

Single/Paired = For the moment only paired end reads are supported.

Combined reads = The path to the file that contains the combined reads (forward and reverse in 1 file)

Forward reads = The path to the file that contains the forward reads (not necessary when there is a merged file)

Reverse reads = The path to the file that contains the reverse reads (not necessary when there is a merged file)

Heteroplasmy:

-----------------------

MAF = (0.007-0.49) Minor Allele Frequency: If you want to detect heteroplasmy, first assemble the genome without this option. Then give the resulting

sequence as a reference and as a seed input. And give the minimum minor allele frequency for this option

(0.01 will detect heteroplasmy of >1%)

HP exclude list = Option not yet available

PCR-free = (yes/no) If you have a PCR-free library write yes

Optional:

-----------------------

Insert size auto = (yes/no) This will finetune your insert size automatically (Default: yes)

Insert Range = This variation on the insert size, could lower it when the coverage is very high or raise it when the

coverage is too low (Default: 1.9).

Insert Range strict = Strict variation to resolve repetitive regions (Default: 1.3).

Use Quality Scores = It will take in account the quality scores, only use this when reads have low quality, like with the

300 bp reads of Illumina (yes/no)

configファイルを指定して実行する。

perl NOVOPlasty3.2.pl -c config.txt

引用

NOVOPlasty: de novo assembly of organelle genomes from whole genome data

Nicolas Dierckxsens, Patrick Mardulyn, Guillaume Smits
Nucleic Acids Research, Volume 45, Issue 4, 28 February 2017, Page e18

Unraveling heteroplasmy patterns with NOVOPlasty

Nicolas Dierckxsens, Patrick Mardulyn, Guillaume Smits
NAR Genomics and Bioinformatics, Volume 2, Issue 1, March 2020, lqz011

参考

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

オルガネラゲノムをターゲットアセンブリする NOVOPlasty