コンタミやダメージを考慮してAncient DNAのシーケンシングリードをシミュレートする gargammel

　Ancient DNA（aDNA）とも呼ばれるsubfossilsから回収されたDNAは、populationの歴史を再構築するためにますます使用されている（Leonardi et al、2016）。しかし、下流の推論に影響を与える可能性があるいくつかの要因があるため、aDNAデータの分析は依然として困難である。第一に、DNAは時間とともに分解する傾向があり、かなりのヌクレオチドのmisincorporationsを示す限られたサイズ（30〜80 bp）のフラグメントをもたらす（Briggs et al、2007）。第二に、生物死後に環境微生物がコロニーを形成する傾向がある（Green et al、2009 pubmed）。その結果、内因性DNA比率が極端に減少することがあり、ショットガンシークエンシングアプローチが非経済的になる。第三に、そのような外因性配列は、リードアラインメント中に正しく同定されないと、Ancient DNAの再構築に影響を及ぼし得る。ヒト族から回収されたDNAの場合、発掘中および実験室中を含む任意の段階で導入され得る現在のヒトDNA混入は、単一のサンプル内に無関係な集団の歴史を混在させるので特に問題がある。

　Ancient DNA研究者は、人口パラメータの推論を目的とした要約統計量の頑健性をテストするためにシミュレーションを頻繁に使用する。いくつかのパッケージはシーケンシングに起因するプラットフォーム特有のエラーをシミュレートしているが、損傷、断片化、ヒトおよび微生物汚染などのそれらの最も顕著な特徴を含むaDNA配列データセットを適切にシミュレートするパッケージは現在利用できない。

　ここでは、微生物画分、内因性画分、そして今日の人間の汚染を表す一連のゲノム参照からのaDNA配列データセットをシミュレートするパッケージであるgargammelを紹介する。このパッケージは、死後のDNA損傷や塩基の誤取り込みなど、DNA配列の最も一般的な機能をシミュレートすることができる。さらに、ライブラリーの調製に使用される分子ツールからのバイアス、GCに富むフラグメントからのバイアス、およびシーケンシングプラットフォームによってもたらされるエラーをシミュレートする。

Gargammel flowchart. 　論文より転載。

gargammel by grenaud

Wanna simulate ancient DNA sequence datasets, including contamination and bacterial communities? go see gargammel!https://t.co/sczWdMlL2b
— Ludovic Orlando (@LudovicLorlando) October 29, 2016

インストール

依存

git
C++ compiler supporting C++11
cmake, you can install on Ubuntu by typing: sudo apt install cmake
zlib
lib gsl, you can install on Ubuntu by typing: sudo apt-get install libgsl0-dev

If you plan on using ms2chromosomes.py to simulate chromosomes based on ms, you also need:

Hudson's ms (see: http://home.uchicago.edu/rhudson1/source/mksamples.html)
seq-gen, you can install on Ubuntu by typing: sudo apt install seq-gen

本体　Github

git clone https://github.com/grenaud/gargammel.git
cd gargammel/
make

> perl gargammel.pl

# perl gargammel.pl

This script is a wrapper to run the different programs to create

a set of Illumina reads for ancient DNA from a set of fasta files

representing the endogenous, the contaminant from the same species

and the bacterial contamination.

usage: gargammel.pl <options> [directory with fasta directories]

This directory should contain 3 directories:

bact/ The bacterial contamination

cont/ The contamination from the same species

endo/ The endogenous material

Options:

--comp [B,C,E] Composition of the final set in fraction

the 3 numbers represent the bacterial, contaminant and endogenous

ex: --comp 0.6,0.02,0.38 will result

in 60% bacterial contamination while the rest will be from the same

species 5% will be contamination and 95% will be endogenous

Default: --comp 0,0,1

--mock Do nothing, just print the commands that will be run

-o Output prefix (default: [input dir]/simadna)

Either specify:

-n [number] Generate [number] fragments (default: 1000)

-c [coverage] Endogenous coverage

Fragment selection

===================

--misince [file] Base misincorporation for the endogenous fragments (default none)

--misincc [file] Base misincorporation for the contaminant fragments (default none)

--misincb [file] Base misincorporation for the bacterial fragments (default none)

--distmis [dist] Distance to consider for base misincorporation (default 1)

this file is obtained from mapDamage

Fragment size distribution: specify either one of the 3 possible options:

-l [length] Generate fragments of fixed length (default: 35)

-s [file] Open file with size distribution (one fragment length per line)

-f [file] Open file with size frequency in the following format:

length[TAB]freq ex:

40 0.0525

41 0.0491

...

Length options:

--loc [file] Location for lognormal distribution (default none)

--scale [file] Scale for lognormal distribution (default none)

Fragment size limit:

--minsize [length] Minimum fragments length (default: 0)

--maxsize [length] Maximum fragments length (default: 1000)

Fragment methylation:

--methyl Allow for lowercase C and G to denote

methylated cytosines on the + and - strand respectively

(default: not used)

Deamination

===================

To add deamination to the bacterial and endogenous material,you can specify

either one of these options:

-mapdamage [mis.txt] [protocol] Read the miscorporation file [mis.txt]

produced by mapDamage

[protocol] can be either "single" or "double" (without quotes)

Single strand will have C->T damage on both ends

Double strand will have and C->T at the 5' end and G->A damage at the 3' end

-matfile [matrix file prefix] Read the matrix file of substitutions

Provide the prefix only, both files must end with

5.dat and 3.dat

-damage [v,l,d,s] For the Briggs et al. 2007 model

The parameters must be comma-separated e.g.: -damage 0.03,0.4,0.01,0.3

v: nick frequency

l: length of overhanging ends (geometric parameter)

d: prob. of deamination of Cs in double-stranded parts

s: prob. of deamination of Cs in single-stranded parts

Alternatively, you can specify these options independently for the endogenous (e), bacterial (b)

and present-day human contaminant (c) using the following options:

-mapdamagee [mis.txt] [protocol] Endogenous mapDamage misincorporation file

-matfilee [matrix file prefix] Endogenous matrix file of substitutions

-damagee [v,l,d,s] Endogenous Briggs parameters

-mapdamageb [mis.txt] [protocol] Bacterial mapDamage misincorporation file

-matfileb [matrix file prefix] Bacterial matrix file of substitutions

-damageb [v,l,d,s] Bacterial Briggs parameters

-mapdamagec [mis.txt] [protocol] Human contaminant mapDamage misincorporation file

-matfilec [matrix file prefix] Human contaminant matrix file of substitutions

-damagecd [v,l,d,s] Human contaminant Briggs parameters

please note that if you do specify deamination for one source but not for another, no deamination will be added

If using --methyl, you can also specify different matrix file for methylated

-matfilenonmeth [matrix file prefix] Read the matrix file of substitutions for non-methylated Cs

Provide the prefix only, both files must end with

5.dat and 3.dat

-matfilemeth [matrix file prefix] Read the matrix file of substitutions for methylated Cs

Provide the prefix only, both files must end with

5.dat and 3.data

Adapter and sequencing

===================

-fa [seq] Adapter that is observed after the forward read (Default: AGATCGGAAG...)

-sa [seq] Adapter that is observed after the reverse read (Default: AGATCGGAAG...)

-rl [length] Desired read length (Default: 75)

-se use single-end sequencing (Default: paired-end)

The following options change the sequencing error rate, please note that positive factor

will decrease the rate of such errors and a negative one will increase it.

-qs [factor] Increase error rate for forward reads by a factor of 1/(10^([factor]/10)) (Default: 0)

-qs2 [factor] Increase error rate for reverse reads by a factor of 1/(10^([factor]/10)) (Default: 0)

-ss [system] Illumina platfrom to use, the parentheses indicate the max. read length

use the shorthand in the left column:

(single-end, paired-end)

GA2 - GenomeAnalyzer II ( 50bp, 75bp)

HS20 - HiSeq 2000 ( 100bp, 100bp)

HS25 - HiSeq 2500 ( 125bp, 150bp) (Default)

HSXt - HiSeqX TruSeq ( 150bp, 150bp)

MSv1 - MiSeq v1 ( 250bp, 250bp)

MSv3 - MiSeq v3 ( 250bp, 250bp)

依存ツールも含めてビルドされる。しかしsamtoolsがビルドされなかったので、手っ取り早くapt installで古いバージョンを入れた。

dockerイメージも上げておきます。

$ docker run -itv $PWD:/data/ kazumax/gargammel
> source ~/.profile
> gargammel.pl -h

テストラン

1、リファレンスを作る。

cd gargammel/ && mkdir data && cd data/

#実行
python ../ms2chromosomes.py -s 0.2 -f . -n 1000 

#cleanup（作業ファイルが）
rm -rfv simul_* seedms

> ls -alh *

# ls -alh *

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 ref.fa

bact:

total 132K

drwxr-xr-x 2 root root 4.0K Mar 15 11:01 .

drwxr-xr-x 5 root root 124K Mar 15 11:02 ..

cont:

total 20M

drwxr-xr-x 2 root root 4.0K Mar 15 11:01 .

drwxr-xr-x 5 root root 124K Mar 15 11:02 ..

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 cont.1.fa

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 cont.2.fa

endo:

total 20M

drwxr-xr-x 2 root root 4.0K Mar 15 11:01 .

drwxr-xr-x 5 root root 124K Mar 15 11:02 ..

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 endo.1.fa

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 endo.2.fa

-rw-r--r-- 1 root root 80K Mar 15 11:01 segsites

ref.faと３つのサブディレクトリ（bact、cont、endo）ができる。bactはバクテリアのコンタミ、contは同じ種からのコンタミ、endoは内性（endogenous）。endogenousのendo.1.faとendo.2.faのheterozygousポジションは以下のファイルにまとめられている。

> head data/endo/segsites

# head data/endo/segsites

>ref_1 312 C T

>ref_1 452 G T

>ref_1 1135 A G

>ref_1 1478 T A

>ref_1 1522 G T

>ref_1 2016 G C

>ref_1 2227 C T

>ref_1 4486 A C

>ref_1 5080 C G

>ref_1 5820 G A

２、リファレンスが用意できたらaDNAをシミュレートする。コマンドの最後には、１で作ったディレクトリを指定する。

cd ..

perl gargammel.pl -c 3 --comp 0,0.08,0.92 -f src/sizefreq.size.gz -matfile src/matrices/single- -o data/simulation data/

> ls -alh data/

# ls -alh data/

total 200M

drwxr-xr-x 1 root root 4.0K Mar 15 17:15 .

drwxr-xr-x 1 root root 4.0K Mar 15 10:58 ..

drwxr-xr-x 2 root root 4.0K Mar 15 11:01 bact

drwxr-xr-x 1 root root 4.0K Mar 15 17:13 cont

drwxr-xr-x 1 root root 4.0K Mar 15 17:13 endo

-rw-r--r-- 1 root root 9.6M Mar 15 11:01 ref.fa

-rw-r--r-- 1 root root 0 Mar 15 17:13 simulation.b.fa.gz

-rw-r--r-- 1 root root 1.3M Mar 15 17:13 simulation.c.fa.gz

-rw-r--r-- 1 root root 28M Mar 15 17:13 simulation.e.fa.gz

-rw-r--r-- 1 root root 36M Mar 15 17:14 simulation_a.fa.gz

-rw-r--r-- 1 root root 30M Mar 15 17:13 simulation_d.fa.gz

-rw-r--r-- 1 root root 48M Mar 15 17:14 simulation_s1.fq.gz

-rw-r--r-- 1 root root 49M Mar 15 17:14 simulation_s2.fq.gz

roo

simulation_s1.fq.gzとsimulation_s2.fq.gzが最終出力。

実行例

Githubには他にもいくつかの例がある。

１、非常に低いcoverageの40-bpリードを発生させる。

gargammel.pl -c 0.5 --comp 0,0,1 -l 40 -o data/simulation data/

-c Endogenous coverage
--comp [B,C,E] Composition of the final set in fraction the 3 numbers represent the bacterial, contaminant and endogenous
ex: --comp 0.6,0.02,0.38 will result in 60% bacterial contamination while the rest will be from the same species 5% will be contamination and 95% will be endogenous Default: --comp 0,0,1
-l Generate fragments of fixed length (default: 35)
-o Output prefix (default: [input dir]/simadna)

２、20xの高いcoverageだが現代のコンタミネーションが激しい（40%）45-bpのリードを発生させる。

gargammel.pl -c 20 --comp 0,0.4,0.6 -l 45 -o data/simulation data/

３、Briggs et al. 2007 model（pubmed）を使い、脱アミノ化した二本鎖DNAを1M発生させる。

gargammel.pl -n 1000000 --comp 0,0,1 -l 40 -damage 0.03,0.4,0.01,0.3 -o data/simulation data/

-n Generate [number] fragments (default: 1000)
-damage [v ,l ,d ,s] For the Briggs et al. 2007 model. The parameters must be comma-separated e.g.: -damage

他の例はGithubを確認して下さい。丁寧に説明されています。

引用

gargammel: simulations of ancient DNA datasets

Gabriel Renaud, Kristian Hanghøj, Eske Willerslev, Ludovic Orlando
Bioinformatics, Volume 33, Issue 4, 15 February 2017, Pages 577–579