（メタゲノム）BAMのカバレッジ、polymorphic サイト率、リファレンスフリーのコンセンサス配列を計算する CMSeq

CMSeqは、SegataLabで公開されている、リファレンスのカバレッジ、polymorphic サイト率、BAMからのコンセンサス配列計算のための.bamファイルへのインターフェースを提供するコマンド群。

インストール

依存

Requires:
samtools (> 1.x)
numpy
pysam
pandas
Biopython with bcbio-gff module (warning: Biopython <= 1.76 is required)

#conda (link)
mamba install -c bioconda cmseq -y

#pip(pypi)
pip install CMSeq

> breadth_depth.py -h

$ breadth_depth.py -h

usage: breadth_depth.py [-h] [-c REFERENCE ID] [-f] [--sortindex]

[--minlen MINLEN] [--minqual MINQUAL]

[--mincov MINCOV] [--truncate TRUNCATE] [--combine]

BAMFILE

calculate the Breadth and Depth of coverage of BAMFILE.

positional arguments:

BAMFILE The file on which to operate

optional arguments:

-h, --help show this help message and exit

-c REFERENCE ID, --contig REFERENCE ID

Focus on a subset of references in the BAM file. Can

be a list of references separated by commas or a FASTA

file (the IDs are used to subset)

-f If set unmapped (FUNMAP), secondary (FSECONDARY), qc-

fail (FQCFAIL) and duplicate (FDUP) are excluded. If

unset ALL reads are considered (bedtools genomecov

style). Default: unset

--sortindex Sort and index the file

--minlen MINLEN Minimum Reference Length for a reference to be

considered

--minqual MINQUAL Minimum base quality. Bases with quality score lower

than this will be discarded. This is performed BEFORE

--mincov. Default: 30

--mincov MINCOV Minimum position coverage to perform the polymorphism

calculation. Position with a lower depth of coverage

will be discarded (i.e. considered as zero-coverage

positions). This is calculated AFTER --minqual.

Default: 1

--truncate TRUNCATE Number of nucleotides that are truncated at either

contigs end before calculating coverage values.

--combine Combine all contigs into one giant contig and report

it at the end

> polymut.py -h

$ polymut.py -h

usage: polymut.py [-h] [-c REFERENCE ID] [-f] [--sortindex] [--minlen MINLEN]

[--minqual MINQUAL] [--mincov MINCOV]

[--dominant_frq_thrsh DOMINANT_FRQ_THRSH]

[--gff_file GFF_FILE]

BAMFILE

Reports the polymorpgic rate of each reference (polymorphic bases / total

bases). Focuses only on covered regions (i.e. depth >= 1)

positional arguments:

BAMFILE The file on which to operate

optional arguments:

-h, --help show this help message and exit

-c REFERENCE ID, --contig REFERENCE ID

Focus on a subset of references in the BAM file. Can

be a list of references separated by commas or a FASTA

file (the IDs are used to subset)

-f If set unmapped (FUNMAP), secondary (FSECONDARY), qc-

fail (FQCFAIL) and duplicate (FDUP) are excluded. If

unset ALL reads are considered (bedtools genomecov

style). Default: unset

--sortindex Sort and index the file

--minlen MINLEN Minimum Reference Length for a reference to be

considered. Default: 0

--minqual MINQUAL Minimum base quality. Bases with quality score lower

than this will be discarded. This is performed BEFORE

--mincov. Default: 30

--mincov MINCOV Minimum position coverage to perform the polymorphism

calculation. Position with a lower depth of coverage

will be discarded (i.e. considered as zero-coverage

positions). This is calculated AFTER --minqual.

Default:1

--dominant_frq_thrsh DOMINANT_FRQ_THRSH

Cutoff for degree of `allele dominance` for a position

to be considered polymorphic. Default: 0.8

--gff_file GFF_FILE GFF file used to extract protein-coding genes

> poly.py -h

$ poly.py -h

usage: poly.py [-h] [-c REFERENCE ID] [-f] [--sortindex] [--minlen MINLEN]

[--minqual MINQUAL] [--mincov MINCOV] [--pvalue PVALUE]

[--seq_err SEQ_ERR] [--dominant_frq_thrsh DOMINANT_FRQ_THRSH]

BAMFILE

Reports the polymorpgic rate of each reference (polymorphic bases / total

bases). Focuses only on covered regions (i.e. depth >= 1)

positional arguments:

BAMFILE The file on which to operate

optional arguments:

-h, --help show this help message and exit

-c REFERENCE ID, --contig REFERENCE ID

Focus on a subset of references in the BAM file. Can

be a list of references separated by commas or a FASTA

file (the IDs are used to subset)

-f If set unmapped (FUNMAP), secondary (FSECONDARY), qc-

fail (FQCFAIL) and duplicate (FDUP) are excluded. If

unset ALL reads are considered (bedtools genomecov

style). Default: unset

--sortindex Sort and index the file

--minlen MINLEN Minimum Reference Length for a reference to be

considered. Default: 0

--minqual MINQUAL Minimum base quality. Bases with quality score lower

than this will be discarded. This is performed BEFORE

--mincov. Default: 30

--mincov MINCOV Minimum position coverage to perform the polymorphism

calculation. Position with a lower depth of coverage

will be discarded (i.e. considered as zero-coverage

positions). This is calculated AFTER --minqual.

Default:1

--pvalue PVALUE Binomial p-value threshold for the binomal-polymorphic

test. Default: 0.01

--seq_err SEQ_ERR Sequencing error rate. Default: 0.001

--dominant_frq_thrsh DOMINANT_FRQ_THRSH

Cutoff for degree of `allele dominance` for a position

to be considered polymorphic. Default: 0.8

実行方法

breadth_depth.py - BAMアライメントファイルから各リファレンスのカバレッジ範囲の広さ、平均値、中央値を表にして出力

breadth_depth.py sorted.bam

#chrを指定
breadth_depth.py ―c chr1 sorted.bam

#unsoorted.bam, minimum depth 10
breadth_depth.py --sortindex --mincov 10 sorted.bam

polymut.py - タンパク質をコードする遺伝子上のpolymorphic サイト率を計算

polymut.py sorted.bam

#unsoorted.bam, minimum coverage 10, minimum quality 30
poly.py --sortindex --mincov 10 --minqual 30 unsort.bam

Githubより

この関数は、タンパク質をコードする遺伝子について、ヌクレオチドレベルでdominant およびsecond-dominant alleles を考慮し、ORFをタンパク質に変換し、タンパク質のdominant およびsecond-dominant間の（タンパク質レベルでの）同義的および非同義的な突然変異の数を計算して出力する。 Dominantとsecond-dominant のカバレッジの比率がdominant_frq_thrshよりも小さい位置は非変異とみなされる。この関数は、Pasolli et al., 2019において、メタゲノムにおける株の異質性を計算するためのアドホックな尺度として使用された。

consensus.py - BAMからReference Freeのコンセンサス配列を出力

consensus.py sort.bam > consensus.fasta

結果はFASTA形式で出力される。再構成された配列の長さは、リファレンスのオリジナルの長さに拘束される。その長さでは、すべての位置がカバーされていない可能性があり、次のような場合に起こる。

その位置に対応するリードが存在しない
その位置にマッピングされるリードの数が少なすぎる（つまりmincov未満）。
その位置にマッピングされているリードの品質が低い（すなわち、< minqual）。
その位置のヌクレオチドの分布に問題がある可能性がある（例：dominant_allele_frequency < dominant_frq_thrsh）：この場合、ノイズを減らすために、その位置は除外される。

consensus_aDNA.py

古代メタゲノム研究を想定して、BAMファイルからゲノムのコンセンサスを抽出する。カバレッジが5より小さく、ダメージ確率（MapDamage2のStats_out_MCMC_correct_prob.csv）が0.95より大きい位置は無視する。

consensus_aDNA.py --mincov 5 -r reference.fna --pos_specific_prob_tab Stats_out_MCMC_correct_prob.csv --pos_damage_prob_thrsh 0.95 sorted.bam

引用

https://github.com/SegataLab/cmseq#breadth-and-depth-of-coverage-with-breadth_depthpy

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

（メタゲノム）BAMのカバレッジ、polymorphic サイト率、リファレンスフリーのコンセンサス配列を計算する CMSeq