2019-06-16

bamファイルを扱う bamM

library bam/sam

　BamMはBAMファイルを解析するpythonにラップされたcライブラリである。このコードはPySam (link) のすべての機能を実装するものではないが、PySamよりも高速で安定したBAMファイルのインターフェースを提供することを目的としている。

http://ecogenomics.github.io/BamM/

マニュアル

http://ecogenomics.github.io/BamM/manual/BamM_manual.pdf

python library

API Documentation

インストール

ubuntu16.04のminiconda2-4.0.5環境でテストした（docker使用、ホストOS ubuntu18.04）。

依存

python2.7
samtools
bwa

本体　Github

#bioconda（link）
conda install -c bioconda bamm

#GNU Guix
guix package --install bamm

> bamm

...::: BamM :::...

Working with the BAM, not against it...

-------------------------------------------------------------------------

version: 1.7.3

-------------------------------------------------------------------------

bamm make -> Make BAM/TAM files (sorted + indexed)

bamm parse -> Get coverage profiles / linking reads / insert types

bamm extract -> Extract reads / headers from BAM files

bamm filter -> Filter BAM file reads

USE: bamm OPTION -h to see detailed options

> bamm make -h

$ bamm make -h

usage: bamm make -d DATABASE [-i INTERLEAVED [INTERLEAVED ...]]

[-c COUPLED [COUPLED ...]] [-s SINGLE [SINGLE ...]]

[-p PREFIX] [-o OUT_FOLDER]

[--index_algorithm INDEX_ALGORITHM]

[--alignment_algorithm ALIGNMENT_ALGORITHM] [--extras EXTRAS]

[-k] [-K] [-f] [--output_tam] [-u] [-t THREADS] [-m MEMORY]

[--temporary_directory TEMPORARY_DIRECTORY] [-h]

[--show_commands] [--quiet] [--silent]

make a BAM/TAM file (sorted + indexed)

required argument:

-d DATABASE, --database DATABASE

contigs to map onto (in fasta format)

reads to map (specify one or more arguments):

-i INTERLEAVED [INTERLEAVED ...], --interleaved INTERLEAVED [INTERLEAVED ...]

map interleaved sequence files (as many as you like) EX: -i reads1_interleaved.fq.gz reads2_interleaved.fna

-c COUPLED [COUPLED ...], --coupled COUPLED [COUPLED ...]

map paired sequence files (as many sets as you like) EX: -c reads1_1.fq.gz reads1_2.fq.gz reads2_1.fna reads2_2.fna

-s SINGLE [SINGLE ...], --single SINGLE [SINGLE ...]

map Single ended sequence files (as many as you like) EX: -s reads1_singles.fq.gz reads2_singles.fna

optional arguments:

-p PREFIX, --prefix PREFIX

prefix to apply to BAM/TAM files (None for reference name)

-o OUT_FOLDER, --out_folder OUT_FOLDER

write to this folder (default: .)

--index_algorithm INDEX_ALGORITHM

algorithm bwa uses for indexing 'bwtsw' or 'is' [None for auto]

--alignment_algorithm ALIGNMENT_ALGORITHM

algorithm bwa uses for alignment 'mem', 'bwasw' or 'aln' (default: mem)

--extras EXTRAS extra arguments to use during mapping. Format is "<BWA_MODE1>:'<ARGS>',<BWA_MODE2>:'<ARGS>'"

-k, --keep keep all generated database index files (see also --kept)

-K, --kept assume database indices already exist (e.g. previously this script was run with -k/--keep)

-f, --force force overwriting of index files if they are present

--output_tam output TAM file instead of BAM file

-u, --keep_unmapped Keep unmapped reads in BAM output

-t THREADS, --threads THREADS

maximum number of threads to use (default: 1)

-m MEMORY, --memory MEMORY

maximum amount of memory to use per samtools sort thread (default 2GB). Suffix K/M/G recognized

--temporary_directory TEMPORARY_DIRECTORY

temporary directory for working with BAM files (default do not use)

-h, --help show this help message and exit

--show_commands show all commands being run

--quiet suppress output from the mapper

--silent suppress all output

Example usage:

Produce a sorted, indexed BAM file with reads mapped onto contigs.fa using 40 threads

bamm make -d contigs.fa -i reads1_interleaved.fq.gz reads2_interleaved.fq.gz -c reads3_1.fq.gz reads3_2.fq.gz -t 40

Produce a 3 sorted, indexed BAM files with reads mapped onto contigs.fa.gz

bamm make -d contigs.fa.gz -i reads1_interleaved.fq.gz reads2_interleaved.fq.gz -s reads4_singles.fq.gz

Extra arguments are passed on a "per-mode" basis using the format:

<BWA_MODE>:'<ARGS>'

For example, the argument:

--extras "mem:-k 25"

tells bwa mem to use a minimum seed length of 25bp.

Multiple modes are separated by commas. For example:

--extras "aln:-O 12 -E 3,sampe:-n 15"

Passes the arguments "-O 12 -E 3" to bwa aln and the arguments "-n 15" to bwa sampe.

********************************************************************************

*** WARNING ***

********************************************************************************

Values passed using --extras are not checked by BamM. This represents a

significant security risk if BamM is being run with elevated privileges.

Thus you should NEVER run 'bamm make' as root or some other powerful user,

ESPECIALLY if you are providing access to multiple / unknown users.

********************************************************************************

> bamm parse -h

$ bamm parse -h

usage: bamm parse -b BAMFILES [BAMFILES ...] [-c COVERAGES] [-l LINKS]

[-i INSERTS] [-n NUM_TYPES [NUM_TYPES ...]]

[-m {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}]

[-r CUTOFF_RANGE [CUTOFF_RANGE ...]] [--length LENGTH]

[--base_quality BASE_QUALITY]

[--mapping_quality MAPPING_QUALITY]

[--max_distance MAX_DISTANCE] [--use_secondary]

[--use_supplementary] [-v] [-t THREADS] [-h]

get bamfile type and/or coverage profiles and/or linking reads

required argument:

-b BAMFILES [BAMFILES ...], --bamfiles BAMFILES [BAMFILES ...]

bam files to parse

optional arguments:

-c COVERAGES, --coverages COVERAGES

filename to write coverage profiles to [default: don't calculate coverage]

-l LINKS, --links LINKS

filename to write pairing links to [default: don't calculate links]

-i INSERTS, --inserts INSERTS

filename to write bamfile insert distributions to [default: print to STDOUT]

-n NUM_TYPES [NUM_TYPES ...], --num_types NUM_TYPES [NUM_TYPES ...]

number of insert/orientation types per BAM

-m {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}, --coverage_mode {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}

how to calculate coverage (requires --coverages) (default: pmean)

-r CUTOFF_RANGE [CUTOFF_RANGE ...], --cutoff_range CUTOFF_RANGE [CUTOFF_RANGE ...]

range used to calculate upper and lower bounds when calculating coverage

--length LENGTH minimum Q length (default: 50)

--base_quality BASE_QUALITY

base quality threshold (Qscore) (default: 20)

--mapping_quality MAPPING_QUALITY

mapping quality threshold

--max_distance MAX_DISTANCE

maximum allowable edit distance from query to reference (default: 1000)

--use_secondary use reads marked with the secondary flag

--use_supplementary use reads marked with the supplementary flag

-v, --verbose be verbose

-t THREADS, --threads THREADS

maximum number of threads to use (default: 1)

-h, --help show this help message and exit

Example usage:

Calculate insert distribution, print to STDOUT

bamm parse -b my.bam

Calculate contig-wise coverage

bamm parse -b my.bam -c output_coverage.tsv

Calculate linking information on 2 bam files

bamm parse -b first.bam second.bam -l output_links.tsv

Coverage calculation modes:

* opmean: outlier pileup coverage: average of reads overlapping each base,

after bases with coverage outside mean +/- 1 standard deviation

have been excluded. The number of standard deviation used for the

cutoff can be changed with --cutoff_range.

* pmean: pileup coverage: average of number of reads overlapping each base

* tpmean: trimmed pileup coverage: average of reads overlapping each base,

after bases with in the top and bottom 10% have been excluded. The

10% range can be changed using --cutoff_range and should be

specified as a percentage (e.g., 15 for 15%).

* counts: absolute number of reads mapping

* cmean: like 'counts' except divided by the length of the contig

* pmedian: median pileup coverage: median of number of reads overlapping each

base

* pvariance: variance of pileup coverage: variance of number of reads

overlapping each base

The 'cutoff_range' variable is used for trimmed mean and outlier mean. This

parameter takes at most two values. The first is the lower cutoff and the

second is the upper. If only one value is supplied then lower == upper.

> bamm extract -h

$ bamm extract -h

usage: bamm extract -b BAMFILES [BAMFILES ...] -g GROUPS [GROUPS ...]

[-p PREFIX] [-o OUT_FOLDER] [--mix_bams] [--mix_groups]

[--mix_reads] [--interleave]

[--mapping_quality MAPPING_QUALITY] [--use_secondary]

[--use_supplementary] [--max_distance MAX_DISTANCE]

[--no_gzip] [--headers_only] [-v] [-t THREADS] [-h]

Extract reads which hit the given references

required arguments:

-b BAMFILES [BAMFILES ...], --bamfiles BAMFILES [BAMFILES ...]

bam files to parse

-g GROUPS [GROUPS ...], --groups GROUPS [GROUPS ...]

files containing reference names (1 per line) or contigs file in fasta format

optional arguments:

-p PREFIX, --prefix PREFIX

prefix to apply to output files

-o OUT_FOLDER, --out_folder OUT_FOLDER

write to this folder (default: .)

--mix_bams use the same file for multiple bam files

--mix_groups use the same files for multiple groups

--mix_reads use the same files for paired/unpaired reads

--interleave interleave paired reads in ouput files

--mapping_quality MAPPING_QUALITY

mapping quality threshold

--use_secondary use reads marked with the secondary flag

--use_supplementary use reads marked with the supplementary flag

--max_distance MAX_DISTANCE

maximum allowable edit distance from query to reference (default: 1000)

--no_gzip do not gzip output files

--headers_only extract only (unique) headers

-v, --verbose be verbose

-t THREADS, --threads THREADS

maximum number of threads to use (default: 1)

-h, --help show this help message and exit

> bamm filter -h

$ bamm filter -h

usage: bamm filter -b BAMFILE [-o OUT_FOLDER]

[--mapping_quality MAPPING_QUALITY]

[--max_distance MAX_DISTANCE] [--length LENGTH]

[--percentage_id PERCENTAGE_ID]

[--percentage_aln PERCENTAGE_ALN] [--use_secondary]

[--use_supplementary] [-v] [-h]

Apply stringency filter to Bam file reads

required arguments:

-b BAMFILE, --bamfile BAMFILE

bam file to filter

optional arguments:

-o OUT_FOLDER, --out_folder OUT_FOLDER

write to this folder (output file has '_filtered.bam' suffix) (default: .)

--mapping_quality MAPPING_QUALITY

mapping quality threshold

--max_distance MAX_DISTANCE

maximum allowable edit distance from query to reference (default: 1000)

--length LENGTH minimum allowable query length

--percentage_id PERCENTAGE_ID

minimum base identity of mapped region (between 0 and 1)

--percentage_aln PERCENTAGE_ALN

minimum fraction of read mapped (between 0 and 1)

--use_secondary use reads marked with the secondary flag

--use_supplementary use reads marked with the supplementary flag

-v, --invert_match select unmapped reads

-h, --help show this help message and exit

コマンドとしての使用方法

1、bamm make mappingしてbamを作成

ペアエンドfastqを指定する。出力はmappedとし、8スレッド使う。bwaのアルゴリズムはmemを指定する。indexがすでにあっても上書きする。unmapのリードも出力する。

bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
 -p mapped -t 8 --alignment_algorithm mem -f -u

-i shuffled reads
-c paired files
-s singleton read files
-p prefix to apply to BAM/TAM files (None for reference name)
-o write to this folder (default: .)
--index_algorithm algorithm bwa uses for indexing 'bwtsw' or 'is' [None for auto]
--alignment_algorithm algorithm bwa uses for alignment 'mem', 'bwasw' or 'aln' (default: mem)
-f force overwriting of index files if they are present
-t maximum number of threads to use (default: 1)
-m maximum amount of memory to use per samtools sort thread
-u Keep unmapped reads in BAM output

ペアエンド、シングルエンド、インターリーブのfastqをそれぞれ指定する。20スレッド指定する。unmapのリードは出力しない（default）。

bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
 -s unpaired.fastq.gz -i interleave.fq.gz \
 -p mapped -t 20 --alignment_algorithm mem -f

2、bamm parse カバレッジやインサートサイズ計算

それぞれのchromosomeのリードのカバレッジ、相対的なリードの方向、リードのインサートサイズ、リンク情報を出力する。

bamm parse -c covs.tsv -l links.tsv -i inserts.tsv -t 8 -b mapped.bam

-c filename to write coverage profiles to [default: don't calculate coverage]
-l filename to write pairing links to [default: don't calculate links]
-i filename to write bamfile insert distributions to [default: print to STDOUT]
-t maximum number of threads to use (default: 1)
-b bam files to parse

リンク情報は他のcontigやchrにまたがってmapingされるリード情報をもとにリンクされている配列を出力する（下の図はmanualより）。

インサートサイズ、相対的なリードの方向 (IN/OUT)を標準出力に出力する。

bamm parse -b mapped.bam

複数のbamのカバレッジを出力する。

bamm parse -c coverage.tsv -m pmean -b f1.bam f2.bam f3.bam

-m <pmean,opmean,tpmean,counts,cmean,pmedian,pvariance> how to calculate coverage (requires --coverages) (default: pmean)

計算モード（manualより）

opmean: Outlier pileup coverage: average of reads overlapping each base, after bases with coverage outside mean +/- 1 standard deviation have been excluded. The number of standard deviation used for the cutoff can be changed with --coverage_range.
pmean: Pileupcoverage:averageofnumberofreadsoverlappingeachbase
tpmean: Trimmed pileup coverage: average of reads overlapping each base, after bases with in the top and bottom 10% have been excluded. The 10% range can be changed using --coverage_range.
counts: Absolute number of reads mapping
cmean: Like'counts'exceptdividedbythelengthofthecontig
pmedian: Median pileup coverage: median of number of reads overlapping each base

３、extract bamからリードを出力

bamを指定する。他のコマンドと同様、複数bamを指定できる。

bamm extract -g ref.fasta -b mapped.bam -t 8

ペアエンドのbamを指定した場合、ペアエンドfastq、ペアエンドfastqでsingletonになったfastqの３つのファイルが出力される。ペアエンドfastqの順番は同期されている。出力はgzip圧縮される。

出力されたリードのヘッダーにはどのchr/contigにマッピングされていたか、ペアとしてマッピングされていたかなどを示す情報がつく（manualより）。

引用

GitHub - Ecogenomics/BamM: Metagenomics-focused BAM file manipulation

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

bamファイルを扱う bamM