古代DNAの損傷パラメータを推定する mapDamage2

　骨や歯の化石、コプロライト、堆積物、ミイラ化した標本、博物館のコレクションなどに含まれるAncient DNA（aDNA）分子は、進化生物学者にとって素晴らしい情報源であり、過去の伝染病の原因や過去の集団の動態を明らかにしてくれる。しかし、aDNA の分析には一般的に 2 つの大きな問題がある。まず第一に、配列は内因性のものと様々な外因性の背景（主に微生物）が混在している。第二に、死後の深刻なDNA損傷の結果、高いヌクレオチドmisincorporation rateが観察されることがある。このようなmisincorporationのパターンは、古代の配列と現代の汚染物質との違いを検証するのに役立つ。著者らは、2011年、次世代シーケンサー（NGS）の配列データからこのようなパターンを特定する、使いやすいパッケージ「mapDamage」を報告した。しかし、DNA損傷プロセスの正式な統計的モデルがないため、サンプル間での厳密な定量的比較ができなかった。

　mapDamage 2.0は、DNA損傷の統計的モデルを導入することで、従来のmapDamageの機能を拡張したものである。損傷イベントは、シーケンス位置と死後の脱アミノ化のみに依存すると仮定し、ベイズ統計フレームワークにより、オーバーハングの平均長さ（λ）、ニック頻度（ν）、および二本鎖領域とオーバーハングの両方におけるシトシン脱アミノ化率という、aDNA分子の4つの重要な特徴を推定することができる。mapDamage 2.0は、NGSデータセットを簡単に扱うことができ、様々なDNAライブラリプロトコルと互換性がある。mapDamage 2.0はPythonパッケージとしてginolhac.github.io/mapDamage/で入手でき、ドキュメントはCentre for GeoGeneticsのWebサイト（geogenetics.ku.dk/publications/mapdamage2.0/）で管理されている。

Document

https://ginolhac.github.io/mapDamage/

インストール

macos10.14でpython3.8の仮想環境を作ってテストした。

Github

#conda (bioconda)
mamba install -c bioconda mapdamage2 -y

>mapDamage -h

$ mapDamage -h

Usage: mapDamage [options] -i BAMfile -r reference.fasta

Use option -h or --help for help

Options:

--version show program's version number and exit

-h, --help show this help message and exit

Input files:

-i FILENAME, --input=FILENAME

SAM/BAM file, must contain a valid header, use '-' for

reading a BAM from stdin

-r REF, --reference=REF

Reference file in FASTA format

General options:

-n DOWNSAMPLE, --downsample=DOWNSAMPLE

Downsample to a randomly selected fraction of the

reads (if 0 < DOWNSAMPLE < 1), or a fixed number of

randomly selected reads (if DOWNSAMPLE >= 1). By

default, no downsampling is performed.

--downsample-seed=DOWNSAMPLE_SEED

Seed value to use for downsampling. See documentation

for py module 'random' for default behavior.

--merge-reference-sequences

Ignore referece sequence names when tabulating reads

(using '*' instead). Useful for alignments with a

large number of reference sequnces, which may

otherwise result in excessive memory or disk usage due

to the number of tables generated.

-l LENGTH, --length=LENGTH

read length, in nucleotides to consider [70]

-a AROUND, --around=AROUND

nucleotides to retrieve before/after reads [10]

-Q MINQUAL, --min-basequal=MINQUAL

minimun base quality Phred score considered, Phred-33

assumed [0]

-d FOLDER, --folder=FOLDER

folder name to store results [results_FILENAME]

-f, --fasta Write alignments in a FASTA file

--plot-only Run only plotting from a valid result folder

-q, --quiet Disable any output to stdout

-v, --verbose Display progression information during parsing

--mapdamage-modules=MAPDAMAGE_MODULES

Override the system wide installed mapDamage module

Options for graphics:

-y YMAX, --ymax=YMAX

graphical y-axis limit for nucleotide misincorporation

frequencies [0.3]

-m READPLOT, --readplot=READPLOT

read length, in nucleotides, considered for plotting

nucleotide misincorporations [25]

-b REFPLOT, --refplot=REFPLOT

the number of reference nucleotides to consider for

ploting base composition in the region located

upstream and downstream of every read [10]

-t TITLE, --title=TITLE

title used for plots []

Options for the statistical estimation:

--rand=RAND Number of random starting points for the likelihood

optimization [30]

--burn=BURN Number of burnin iterations [10000]

--adjust=ADJUST Number of adjust proposal variance parameters

iterations [10]

--iter=ITER Number of final MCMC iterations [50000]

--forward Using only the 5' end of the seqs [False]

--reverse Using only the 3' end of the seqs [False]

--var-disp Variable dispersion in the overhangs [False]

--jukes-cantor Use Jukes Cantor instead of HKY85 [False]

--diff-hangs The overhangs are different for 5' and 3' [False]

--fix-nicks Fix the nick frequency vector (Only C.T from the 5'

end and G.A from the 3' end) [False]

--use-raw-nick-freq

Use the raw nick frequency vector without smoothing

[False]

--single-stranded Single stranded protocol [False]

--theme-bw Use black and white theme in post. pred. plot [False]

--seq-length=SEQ_LENGTH

How long sequence to use from each side [12]

--stats-only Run only statistical estimation from a valid result

folder

--no-stats Disabled statistical estimation, active by default

--check-R-packages Check if the R modules are working

Options for rescaling of BAM files:

--rescale Rescale the quality scores in the BAM file using the

output from the statistical estimation

--rescale-only Run only rescaling from a valid result folder

--rescale-out=RESCALE_OUT

Write the rescaled BAM to this file

--rescale-length-5p=RESCALE_LENGTH_5P

How many bases to rescale at the 5' termini; defaults

to --seq-length.

--rescale-length-3p=RESCALE_LENGTH_3P

How many bases to rescale at the 5' termini; defaults

to --seq-length.

report bugs to aginolhac@snm.ku.dk, MSchubert@snm.ku.dk or

jonsson.hakon@gmail.com

実行方法

bamとマッピングに使ったリファレンスfastaを指定する。

 mapDamage -i referernce.bam -r aln.bam -d outdir

-i SAM/BAM file, must contain a valid header, use '-' for reading a BAM from stdin
-r Reference file in FASTA format
-d folder name to store results [results_FILENAME]

出力

f:id:kazumaxneo:20210424133829p:plain

すべてのオプションを有効にすると、16個のファイルが結果フォルダに生成される。

Runtime_log.txt：使用されたコマンドラインの概要とタイムスタンプが記載されたログファイル。

プロット用

Fragmisincorporation_plot.pdf：フラグメンテーションとミスインコーポレーションの両方のパターンを表示するpdfファイル。
Length_plot.pdf：シングルトンリードの長さ分布をストランドごとに表示し、5'末端のC->Tと3'末端のG->Aの累積頻度もストランドごとに表示（重要な注意事項：このプロットはリードの長さの分布を表示し、ペアエンドリードのインサートサイズは表示しない。この目的のためには、Picard CollectInsertSizeMetricsを使用できる）。
misincorporation.txtには、各タイプの変異の発生率とリードエンドからの相対的な位置を示した表が含まれる。
5pCtoT_freq.txt： 5'末端からの位置ごとのCytosineからThymineへの変異の頻度を含む。
3pGtoA_freq.txt：3'末端からの位置ごとのGuanineからAdenineへの変異の頻度を含む。
dnacomp.txt：リード内部および隣接領域の位置ごとの参照ゲノムの塩基組成の表を含む。
lgdistribution.txtには、ストランドごとのリード長さの分布の表が含む。

統計的推定