染色体外環状DNA（eccDNA）を検出するパイプライン ECCsplorer

　Extrachromosomal circular DNAs（eccDNA）は、染色体から物理的に分離したリング状のDNA構造で、100bpから数メガバイトの大きさである。eccDNAは、タンデムに繰り返されるDNAの他に、遺伝子のコピーや最近活性化したトランスポゾームエレメントを持つことがある。eccDNAはこれまで研究されてきたすべての真核生物に存在し、ストレス、癌、老化に関与していると考えられるため、最近の研究の主要なターゲットとなっているが、計算機の不足によりその研究は限定的である。

　本発表では、次世代シーケンサーを用いて、あらゆる種類の生物・組織のeccDNAを検出するバイオインフォマティクス・パイプラインであるECCsplorerを紹介する。増幅された circular DNA (circSeq) のイルミナシーケンス（circSeq）に続いて、ECCsplorerは、eccDNA候補を簡単かつ自動的に発見することを可能にする。データ解析には大きく分けて2つの手順がある。1つ目は、リファレンスゲノムへのリードマッピングにより、ハイカバレッジ、不一致マッピング、スプリットリードなどの有益なリード分布を検出することである。第二に、増幅された eccDNA のリードクラスターとコントロールサンプルデータをリファレンスなしで比較することで、特異的に濃縮された DNA サークルを明らかにする。この2つのソフトウェアは、それぞれの目的やデータの利用状況に応じて、別々に、あるいは合同で実行することができる。本アプローチの幅広い適用性を示すために、モデル生物であるホモサピエンスとシロイヌナズナの半人工および公開されたcircSeqデータを解析し、非モデル作物であるBeta vulgarisからcircSeqリードを生成した。リファレンスゲノムの有無にかかわらず、すべてのデータセットからeccDNA候補を明確に同定した。ECCsplorerのパイプラインは、ミトコンドリアのミニサークルとレトロトランスポゾンの活性化を特異的に検出し、ECCsplorerの感度と特異性を実証した。

　ECCsplorer（https://github.com/crimBubble/ECCsplorer）は、次世代シーケンサーデータを用いて、あらゆる種類の生物や組織からeccDNAを検出するバイオインフォマティクスパイプラインである。ECCsplorerは、次世代シーケンサーのデータを用いて、あらゆる種類の生物・組織のeccDNAを検出するバイオインフォマティクス・パイプラインであり、オルガネラゲノミクスにおける癌関連eccDNAの解析から活性トランスポゾンの同定まで、幅広い研究において有用な情報を提供する。

Mini-workshop

https://github.com/crimBubble/ECCsplorer/blob/master/tutorials/Mini-workshop.md

Youtube-video

インストール

ubuntu18にcondaを使って導入した。

Github

git clone https://github.com/crimBubble/ECCsplorer
cd ECCsplorer
mamba env create -f environment.yml
conda activate eccsplorer

#segemehlも必要。HPからダウンロードしてビルドする（link）
cd segemehl-0.3.4/
make all
export PATH=$PATH:$PWD

#こちらも必要
https://bitbucket.org/petrnovak/repex_tarean/src/devel/

> python ECCsplorer.py -h

usage: ECCsplorer.py [-h] [-ref <file>] [-out <directory>] [-trm <option>]

[-img <option>] [-dsa <txt>] [-dsb <txt>] [-rgs <int>]

[-cnt <int>] [-win <int>] [-tax <tax>] [-log]

[-d <DB> [<DB> ...]] [-cpu <int>] [-m <option>]

<file1A> <file2A> [<file1B>] [<file2B>]

ECCsplorer v0.9b: detecting extrachromosomal circular DNAs (eccDNA) from short read sequencing data.

positional arguments:

<file1A> Paired-end reads file1 of data set A (R1). Required.

<file2A> Paired-end reads file2 of data set A (R2). Required.

<file1B> Paired-end reads file1 of data set B (R1). Recommended.

<file2B> Paired-end reads file2 of data set B (R2). Recommended.

optional arguments:

-h, --help show this help message and exit

-ref <file>, --reference_genome <file>

Reference genome sequence in FASTA format.

With single chromosomes named as chr1, chr2, ...chrN

-out <directory>, --output_dir <directory>

Name your project output directory.

Default: eccpipe (old content will be partially overwritten !)

-trm <option>, --trim_reads <option>

Read trimming with trimmomatic v0.38.

Strongly recommended, for usage specify adapter option:

- nex (Nextera),

- tru2 (TruSeq2), tru3 (TruSeq3), tru3-2 (TruSeq3-2)

- custom (see trimmomatic manual, name UserAdapter.fa)

-img <option>, --image_format <option>

Choose your desired image format.

Options: png (default), jpeg, bmp, tiff, pdf

-dsa <txt>, --preA <txt>

Set readID prefix for data set A. Max. 10 characters. Default = TR

Used for comparative analysis.

-dsb <txt>, --preB <txt>

Set readID prefix for data set B. Equal length as -dsa. Default = CO

Used for comparative analysis.

-rgs <int>, --genome_size <int>

Set the genome size of your organism in base pairs [bp].

Only needed if -cnt set to "auto". Default = None

-cnt <int>, --read_count <int>

Number of reads to use for clustering, if not set max. available reads are used.

Set "auto" to use 0.1x genome coverage (only with -ref or -rgs set)

Note: for mapping analysis max. available reads are used.

-win <int>, --window_size <int>

Window size for mapping analysis.

Used for peak detection and visualization.

Smaller window size increases memory usage. Default = 100

-tax <tax>, --taxon <tax>

Use this option to specify taxon using:

vir for Viridiplantea (default) or met for Metazoa.

-log Use this option to print logging to file.

If not set logging is only printed to stdout.

-d <DB> [<DB> ...], --database <DB> [<DB> ...]

Fasta file for custom BLASTn (annotation database).

Existing database might be used.

Usage of multiple databases separated by space possible.

-cpu <int>, --max_threads <int>

Specify max. threads to use.

Default = max. available cpu threads are used.

-m <option>, --mode <option>

Choose mode to run.

Options:

all (default, run all modules)

map (run only mapping module)

clu (run only clustering module)

PRExer (only run preparation module)

Thanks for using ECCsplorer.

実行方法

対象生物のphi29ポリメラーゼで増幅された環状DNAを読んだペアエンドfastqとリファレンスゲノム配列が必要。また、コントロール（非増幅、または異なる処理・生物から増幅）のペアエンドデータの使用も推奨されている（コントロールとリファレンスがある時に最良の結果が得られる）。

python ECCsplorer.py readsA1.fq.gz readsA2.fq.gz readsB1.fq.gz readsB2.fq.gz -ref sequence.fa

テストラン

ECCsplorerをcondaで導入した場合、仮想環境のbin/のサブフォルダにテストデータが含まれている。そのパスに移動する。

cd envs/eccsplorer/bin/ECCsplorer/testdata/

EntrezのE-utilitiesを使ってリファレンス配列とeccDNAのリファレンス配列をfetchする。リファレンス配列３つは結合する。

#ref
efetch -db nucleotide -id CM009438.1 -seq_start 8971216 -seq_stop 9147030 -format fasta > chr1.fa
efetch -db nucleotide -id CM009440.1 -seq_start 44915437 -seq_stop 45118630 -format fasta > chr2.fa
efetch -db nucleotide -id CM009444.1 -seq_start 23920431 -seq_stop 24120625 -format fasta > chr3.fa
cat chr1.fa chr2.fa chr3.fa | awk '/^>/{print ">chr" ++i; next}{print}' > RefGenomeSeq.fa
rm chr1.fa chr2.fa chr3.fa

#ref eccDNA
efetch -db nucleotide -id JX455085.1 -format fasta > RefSeq_DB.fa

準備ができたらランする。

eccsplorer --output_dir testrun testdata/aDNA_R1.fastq testdata/aDNA_R2.fastq testdata/gDNA_R1.fastq testdata/gDNA_R2.fastq --reference_genome testdata/RefGenomeSeq.fa --database testdata/RefSeq_DB.fa --trim_reads tru3 --read_count 1000 -log

依存関係をそろえた環境を構築できなかった。改善できたら追記します。

引用

ECCsplorer: a pipeline to detect extrachromosomal circular DNA (eccDNA) from next-generation sequencing data

Ludwig Mann, Kathrin M. Seibt, Beatrice Weber & Tony Heitkam
BMC Bioinformatics volume 23, Article number: 40 (2022)