CMSeqは、SegataLabで公開されている、リファレンスのカバレッジ、polymorphic サイト率、BAMからのコンセンサス配列計算のための.bamファイルへのインターフェースを提供するコマンド群。
インストール
依存
- Requires:
- samtools (> 1.x)
- numpy
- pysam
- pandas
- Biopython with bcbio-gff module (warning: Biopython <= 1.76 is required)
#conda (link)
mamba install -c bioconda cmseq -y
#pip(pypi)
pip install CMSeq
> breadth_depth.py -h
$ breadth_depth.py -h
usage: breadth_depth.py [-h] [-c REFERENCE ID] [-f] [--sortindex]
[--minlen MINLEN] [--minqual MINQUAL]
[--mincov MINCOV] [--truncate TRUNCATE] [--combine]
BAMFILE
calculate the Breadth and Depth of coverage of BAMFILE.
positional arguments:
BAMFILE The file on which to operate
optional arguments:
-h, --help show this help message and exit
-c REFERENCE ID, --contig REFERENCE ID
Focus on a subset of references in the BAM file. Can
be a list of references separated by commas or a FASTA
file (the IDs are used to subset)
-f If set unmapped (FUNMAP), secondary (FSECONDARY), qc-
fail (FQCFAIL) and duplicate (FDUP) are excluded. If
unset ALL reads are considered (bedtools genomecov
style). Default: unset
--sortindex Sort and index the file
--minlen MINLEN Minimum Reference Length for a reference to be
considered
--minqual MINQUAL Minimum base quality. Bases with quality score lower
than this will be discarded. This is performed BEFORE
--mincov. Default: 30
--mincov MINCOV Minimum position coverage to perform the polymorphism
calculation. Position with a lower depth of coverage
will be discarded (i.e. considered as zero-coverage
positions). This is calculated AFTER --minqual.
Default: 1
--truncate TRUNCATE Number of nucleotides that are truncated at either
contigs end before calculating coverage values.
--combine Combine all contigs into one giant contig and report
it at the end
> polymut.py -h
$ polymut.py -h
usage: polymut.py [-h] [-c REFERENCE ID] [-f] [--sortindex] [--minlen MINLEN]
[--minqual MINQUAL] [--mincov MINCOV]
[--dominant_frq_thrsh DOMINANT_FRQ_THRSH]
[--gff_file GFF_FILE]
BAMFILE
Reports the polymorpgic rate of each reference (polymorphic bases / total
bases). Focuses only on covered regions (i.e. depth >= 1)
positional arguments:
BAMFILE The file on which to operate
optional arguments:
-h, --help show this help message and exit
-c REFERENCE ID, --contig REFERENCE ID
Focus on a subset of references in the BAM file. Can
be a list of references separated by commas or a FASTA
file (the IDs are used to subset)
-f If set unmapped (FUNMAP), secondary (FSECONDARY), qc-
fail (FQCFAIL) and duplicate (FDUP) are excluded. If
unset ALL reads are considered (bedtools genomecov
style). Default: unset
--sortindex Sort and index the file
--minlen MINLEN Minimum Reference Length for a reference to be
considered. Default: 0
--minqual MINQUAL Minimum base quality. Bases with quality score lower
than this will be discarded. This is performed BEFORE
--mincov. Default: 30
--mincov MINCOV Minimum position coverage to perform the polymorphism
calculation. Position with a lower depth of coverage
will be discarded (i.e. considered as zero-coverage
positions). This is calculated AFTER --minqual.
Default:1
--dominant_frq_thrsh DOMINANT_FRQ_THRSH
Cutoff for degree of `allele dominance` for a position
to be considered polymorphic. Default: 0.8
--gff_file GFF_FILE GFF file used to extract protein-coding genes
> poly.py -h
$ poly.py -h
usage: poly.py [-h] [-c REFERENCE ID] [-f] [--sortindex] [--minlen MINLEN]
[--minqual MINQUAL] [--mincov MINCOV] [--pvalue PVALUE]
[--seq_err SEQ_ERR] [--dominant_frq_thrsh DOMINANT_FRQ_THRSH]
BAMFILE
Reports the polymorpgic rate of each reference (polymorphic bases / total
bases). Focuses only on covered regions (i.e. depth >= 1)
positional arguments:
BAMFILE The file on which to operate
optional arguments:
-h, --help show this help message and exit
-c REFERENCE ID, --contig REFERENCE ID
Focus on a subset of references in the BAM file. Can
be a list of references separated by commas or a FASTA
file (the IDs are used to subset)
-f If set unmapped (FUNMAP), secondary (FSECONDARY), qc-
fail (FQCFAIL) and duplicate (FDUP) are excluded. If
unset ALL reads are considered (bedtools genomecov
style). Default: unset
--sortindex Sort and index the file
--minlen MINLEN Minimum Reference Length for a reference to be
considered. Default: 0
--minqual MINQUAL Minimum base quality. Bases with quality score lower
than this will be discarded. This is performed BEFORE
--mincov. Default: 30
--mincov MINCOV Minimum position coverage to perform the polymorphism
calculation. Position with a lower depth of coverage
will be discarded (i.e. considered as zero-coverage
positions). This is calculated AFTER --minqual.
Default:1
--pvalue PVALUE Binomial p-value threshold for the binomal-polymorphic
test. Default: 0.01
--seq_err SEQ_ERR Sequencing error rate. Default: 0.001
--dominant_frq_thrsh DOMINANT_FRQ_THRSH
Cutoff for degree of `allele dominance` for a position
to be considered polymorphic. Default: 0.8
実行方法
breadth_depth.py - BAMアライメントファイルから各リファレンスのカバレッジ範囲の広さ、平均値、中央値を表にして出力
breadth_depth.py sorted.bam
#chrを指定
breadth_depth.py ―c chr1 sorted.bam
#unsoorted.bam, minimum depth 10
breadth_depth.py --sortindex --mincov 10 sorted.bam
polymut.py - タンパク質をコードする遺伝子上のpolymorphic サイト率を計算
polymut.py sorted.bam
#unsoorted.bam, minimum coverage 10, minimum quality 30
poly.py --sortindex --mincov 10 --minqual 30 unsort.bam
Githubより
この関数は、タンパク質をコードする遺伝子について、ヌクレオチドレベルでdominant およびsecond-dominant alleles を考慮し、ORFをタンパク質に変換し、タンパク質のdominant およびsecond-dominant間の(タンパク質レベルでの)同義的および非同義的な突然変異の数を計算して出力する。 Dominantとsecond-dominant のカバレッジの比率がdominant_frq_thrshよりも小さい位置は非変異とみなされる。この関数は、Pasolli et al., 2019において、メタゲノムにおける株の異質性を計算するためのアドホックな尺度として使用された。
consensus.py - BAMからReference Freeのコンセンサス配列を出力
consensus.py sort.bam > consensus.fasta
結果はFASTA形式で出力される。再構成された配列の長さは、リファレンスのオリジナルの長さに拘束される。その長さでは、すべての位置がカバーされていない可能性があり、次のような場合に起こる。
- その位置に対応するリードが存在しない
- その位置にマッピングされるリードの数が少なすぎる(つまりmincov未満)。
- その位置にマッピングされているリードの品質が低い(すなわち、< minqual)。
- その位置のヌクレオチドの分布に問題がある可能性がある(例:dominant_allele_frequency < dominant_frq_thrsh):この場合、ノイズを減らすために、その位置は除外される。
consensus_aDNA.py
古代メタゲノム研究を想定して、BAMファイルからゲノムのコンセンサスを抽出する。カバレッジが5より小さく、ダメージ確率(MapDamage2のStats_out_MCMC_correct_prob.csv)が0.95より大きい位置は無視する。
consensus_aDNA.py --mincov 5 -r reference.fna --pos_specific_prob_tab Stats_out_MCMC_correct_prob.csv --pos_damage_prob_thrsh 0.95 sorted.bam
引用
https://github.com/SegataLab/cmseq#breadth-and-depth-of-coverage-with-breadth_depthpy
関連