BamMはBAMファイルを解析するpythonにラップされたcライブラリである。 このコードはPySam (link) のすべての機能を実装するものではないが、PySamよりも高速で安定したBAMファイルのインターフェースを提供することを目的としている。
HP
http://ecogenomics.github.io/BamM/
マニュアル
http://ecogenomics.github.io/BamM/manual/BamM_manual.pdf
python library
インストール
ubuntu16.04のminiconda2-4.0.5環境でテストした(docker使用、ホストOS ubuntu18.04)。
依存
- python2.7
- samtools
- bwa
本体 Github
#bioconda(link)
conda install -c bioconda bamm
#GNU Guix
guix package --install bamm
> bamm
...::: BamM :::...
Working with the BAM, not against it...
-------------------------------------------------------------------------
version: 1.7.3
-------------------------------------------------------------------------
bamm make -> Make BAM/TAM files (sorted + indexed)
bamm parse -> Get coverage profiles / linking reads / insert types
bamm extract -> Extract reads / headers from BAM files
bamm filter -> Filter BAM file reads
USE: bamm OPTION -h to see detailed options
> bamm make -h
$ bamm make -h
usage: bamm make -d DATABASE [-i INTERLEAVED [INTERLEAVED ...]]
[-c COUPLED [COUPLED ...]] [-s SINGLE [SINGLE ...]]
[-p PREFIX] [-o OUT_FOLDER]
[--index_algorithm INDEX_ALGORITHM]
[--alignment_algorithm ALIGNMENT_ALGORITHM] [--extras EXTRAS]
[-k] [-K] [-f] [--output_tam] [-u] [-t THREADS] [-m MEMORY]
[--temporary_directory TEMPORARY_DIRECTORY] [-h]
[--show_commands] [--quiet] [--silent]
make a BAM/TAM file (sorted + indexed)
required argument:
-d DATABASE, --database DATABASE
contigs to map onto (in fasta format)
reads to map (specify one or more arguments):
-i INTERLEAVED [INTERLEAVED ...], --interleaved INTERLEAVED [INTERLEAVED ...]
map interleaved sequence files (as many as you like) EX: -i reads1_interleaved.fq.gz reads2_interleaved.fna
-c COUPLED [COUPLED ...], --coupled COUPLED [COUPLED ...]
map paired sequence files (as many sets as you like) EX: -c reads1_1.fq.gz reads1_2.fq.gz reads2_1.fna reads2_2.fna
-s SINGLE [SINGLE ...], --single SINGLE [SINGLE ...]
map Single ended sequence files (as many as you like) EX: -s reads1_singles.fq.gz reads2_singles.fna
optional arguments:
-p PREFIX, --prefix PREFIX
prefix to apply to BAM/TAM files (None for reference name)
-o OUT_FOLDER, --out_folder OUT_FOLDER
write to this folder (default: .)
--index_algorithm INDEX_ALGORITHM
algorithm bwa uses for indexing 'bwtsw' or 'is' [None for auto]
--alignment_algorithm ALIGNMENT_ALGORITHM
algorithm bwa uses for alignment 'mem', 'bwasw' or 'aln' (default: mem)
--extras EXTRAS extra arguments to use during mapping. Format is "<BWA_MODE1>:'<ARGS>',<BWA_MODE2>:'<ARGS>'"
-k, --keep keep all generated database index files (see also --kept)
-K, --kept assume database indices already exist (e.g. previously this script was run with -k/--keep)
-f, --force force overwriting of index files if they are present
--output_tam output TAM file instead of BAM file
-u, --keep_unmapped Keep unmapped reads in BAM output
-t THREADS, --threads THREADS
maximum number of threads to use (default: 1)
-m MEMORY, --memory MEMORY
maximum amount of memory to use per samtools sort thread (default 2GB). Suffix K/M/G recognized
--temporary_directory TEMPORARY_DIRECTORY
temporary directory for working with BAM files (default do not use)
-h, --help show this help message and exit
--show_commands show all commands being run
--quiet suppress output from the mapper
--silent suppress all output
Example usage:
Produce a sorted, indexed BAM file with reads mapped onto contigs.fa using 40 threads
bamm make -d contigs.fa -i reads1_interleaved.fq.gz reads2_interleaved.fq.gz -c reads3_1.fq.gz reads3_2.fq.gz -t 40
Produce a 3 sorted, indexed BAM files with reads mapped onto contigs.fa.gz
bamm make -d contigs.fa.gz -i reads1_interleaved.fq.gz reads2_interleaved.fq.gz -s reads4_singles.fq.gz
Extra arguments are passed on a "per-mode" basis using the format:
<BWA_MODE>:'<ARGS>'
For example, the argument:
--extras "mem:-k 25"
tells bwa mem to use a minimum seed length of 25bp.
Multiple modes are separated by commas. For example:
--extras "aln:-O 12 -E 3,sampe:-n 15"
Passes the arguments "-O 12 -E 3" to bwa aln and the arguments "-n 15" to bwa sampe.
********************************************************************************
*** WARNING ***
********************************************************************************
Values passed using --extras are not checked by BamM. This represents a
significant security risk if BamM is being run with elevated privileges.
Thus you should NEVER run 'bamm make' as root or some other powerful user,
ESPECIALLY if you are providing access to multiple / unknown users.
********************************************************************************
> bamm parse -h
$ bamm parse -h
usage: bamm parse -b BAMFILES [BAMFILES ...] [-c COVERAGES] [-l LINKS]
[-i INSERTS] [-n NUM_TYPES [NUM_TYPES ...]]
[-m {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}]
[-r CUTOFF_RANGE [CUTOFF_RANGE ...]] [--length LENGTH]
[--base_quality BASE_QUALITY]
[--mapping_quality MAPPING_QUALITY]
[--max_distance MAX_DISTANCE] [--use_secondary]
[--use_supplementary] [-v] [-t THREADS] [-h]
get bamfile type and/or coverage profiles and/or linking reads
required argument:
-b BAMFILES [BAMFILES ...], --bamfiles BAMFILES [BAMFILES ...]
bam files to parse
optional arguments:
-c COVERAGES, --coverages COVERAGES
filename to write coverage profiles to [default: don't calculate coverage]
-l LINKS, --links LINKS
filename to write pairing links to [default: don't calculate links]
-i INSERTS, --inserts INSERTS
filename to write bamfile insert distributions to [default: print to STDOUT]
-n NUM_TYPES [NUM_TYPES ...], --num_types NUM_TYPES [NUM_TYPES ...]
number of insert/orientation types per BAM
-m {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}, --coverage_mode {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}
how to calculate coverage (requires --coverages) (default: pmean)
-r CUTOFF_RANGE [CUTOFF_RANGE ...], --cutoff_range CUTOFF_RANGE [CUTOFF_RANGE ...]
range used to calculate upper and lower bounds when calculating coverage
--length LENGTH minimum Q length (default: 50)
--base_quality BASE_QUALITY
base quality threshold (Qscore) (default: 20)
--mapping_quality MAPPING_QUALITY
mapping quality threshold
--max_distance MAX_DISTANCE
maximum allowable edit distance from query to reference (default: 1000)
--use_secondary use reads marked with the secondary flag
--use_supplementary use reads marked with the supplementary flag
-v, --verbose be verbose
-t THREADS, --threads THREADS
maximum number of threads to use (default: 1)
-h, --help show this help message and exit
Example usage:
Calculate insert distribution, print to STDOUT
bamm parse -b my.bam
Calculate contig-wise coverage
bamm parse -b my.bam -c output_coverage.tsv
Calculate linking information on 2 bam files
bamm parse -b first.bam second.bam -l output_links.tsv
Coverage calculation modes:
* opmean: outlier pileup coverage: average of reads overlapping each base,
after bases with coverage outside mean +/- 1 standard deviation
have been excluded. The number of standard deviation used for the
cutoff can be changed with --cutoff_range.
* pmean: pileup coverage: average of number of reads overlapping each base
* tpmean: trimmed pileup coverage: average of reads overlapping each base,
after bases with in the top and bottom 10% have been excluded. The
10% range can be changed using --cutoff_range and should be
specified as a percentage (e.g., 15 for 15%).
* counts: absolute number of reads mapping
* cmean: like 'counts' except divided by the length of the contig
* pmedian: median pileup coverage: median of number of reads overlapping each
base
* pvariance: variance of pileup coverage: variance of number of reads
overlapping each base
The 'cutoff_range' variable is used for trimmed mean and outlier mean. This
parameter takes at most two values. The first is the lower cutoff and the
second is the upper. If only one value is supplied then lower == upper.
> bamm extract -h
$ bamm extract -h
usage: bamm extract -b BAMFILES [BAMFILES ...] -g GROUPS [GROUPS ...]
[-p PREFIX] [-o OUT_FOLDER] [--mix_bams] [--mix_groups]
[--mix_reads] [--interleave]
[--mapping_quality MAPPING_QUALITY] [--use_secondary]
[--use_supplementary] [--max_distance MAX_DISTANCE]
[--no_gzip] [--headers_only] [-v] [-t THREADS] [-h]
Extract reads which hit the given references
required arguments:
-b BAMFILES [BAMFILES ...], --bamfiles BAMFILES [BAMFILES ...]
bam files to parse
-g GROUPS [GROUPS ...], --groups GROUPS [GROUPS ...]
files containing reference names (1 per line) or contigs file in fasta format
optional arguments:
-p PREFIX, --prefix PREFIX
prefix to apply to output files
-o OUT_FOLDER, --out_folder OUT_FOLDER
write to this folder (default: .)
--mix_bams use the same file for multiple bam files
--mix_groups use the same files for multiple groups
--mix_reads use the same files for paired/unpaired reads
--interleave interleave paired reads in ouput files
--mapping_quality MAPPING_QUALITY
mapping quality threshold
--use_secondary use reads marked with the secondary flag
--use_supplementary use reads marked with the supplementary flag
--max_distance MAX_DISTANCE
maximum allowable edit distance from query to reference (default: 1000)
--no_gzip do not gzip output files
--headers_only extract only (unique) headers
-v, --verbose be verbose
-t THREADS, --threads THREADS
maximum number of threads to use (default: 1)
-h, --help show this help message and exit
> bamm filter -h
$ bamm filter -h
usage: bamm filter -b BAMFILE [-o OUT_FOLDER]
[--mapping_quality MAPPING_QUALITY]
[--max_distance MAX_DISTANCE] [--length LENGTH]
[--percentage_id PERCENTAGE_ID]
[--percentage_aln PERCENTAGE_ALN] [--use_secondary]
[--use_supplementary] [-v] [-h]
Apply stringency filter to Bam file reads
required arguments:
-b BAMFILE, --bamfile BAMFILE
bam file to filter
optional arguments:
-o OUT_FOLDER, --out_folder OUT_FOLDER
write to this folder (output file has '_filtered.bam' suffix) (default: .)
--mapping_quality MAPPING_QUALITY
mapping quality threshold
--max_distance MAX_DISTANCE
maximum allowable edit distance from query to reference (default: 1000)
--length LENGTH minimum allowable query length
--percentage_id PERCENTAGE_ID
minimum base identity of mapped region (between 0 and 1)
--percentage_aln PERCENTAGE_ALN
minimum fraction of read mapped (between 0 and 1)
--use_secondary use reads marked with the secondary flag
--use_supplementary use reads marked with the supplementary flag
-v, --invert_match select unmapped reads
-h, --help show this help message and exit
コマンドとしての使用方法
1、bamm make mappingしてbamを作成
ペアエンドfastqを指定する。 出力はmappedとし、8スレッド使う。bwaのアルゴリズムはmemを指定する。indexがすでにあっても上書きする。unmapのリードも出力する。
bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
-p mapped -t 8 --alignment_algorithm mem -f -u
- -i shuffled reads
- -c paired files
- -s singleton read files
- -p prefix to apply to BAM/TAM files (None for reference name)
- -o write to this folder (default: .)
- --index_algorithm algorithm bwa uses for indexing 'bwtsw' or 'is' [None for auto]
- --alignment_algorithm algorithm bwa uses for alignment 'mem', 'bwasw' or 'aln' (default: mem)
- -f force overwriting of index files if they are present
- -t maximum number of threads to use (default: 1)
- -m maximum amount of memory to use per samtools sort thread
- -u Keep unmapped reads in BAM output
ペアエンド、シングルエンド、インターリーブのfastqをそれぞれ指定する。20スレッド指定する。unmapのリードは出力しない(default)。
bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
-s unpaired.fastq.gz -i interleave.fq.gz \
-p mapped -t 20 --alignment_algorithm mem -f
2、bamm parse カバレッジやインサートサイズ計算
それぞれのchromosomeのリードのカバレッジ、相対的なリードの方向、リードのインサートサイズ、リンク情報を出力する。
bamm parse -c covs.tsv -l links.tsv -i inserts.tsv -t 8 -b mapped.bam
- -c filename to write coverage profiles to [default: don't calculate coverage]
- -l filename to write pairing links to [default: don't calculate links]
- -i filename to write bamfile insert distributions to [default: print to STDOUT]
- -t maximum number of threads to use (default: 1)
- -b bam files to parse
リンク情報は他のcontigやchrにまたがってmapingされるリード情報をもとにリンクされている配列を出力する(下の図はmanualより)。
インサートサイズ、相対的なリードの方向 (IN/OUT)を標準出力に出力する。
bamm parse -b mapped.bam
複数のbamのカバレッジを出力する。
bamm parse -c coverage.tsv -m pmean -b f1.bam f2.bam f3.bam
- -m <pmean,opmean,tpmean,counts,cmean,pmedian,pvariance> how to calculate coverage (requires --coverages) (default: pmean)
計算モード(manualより)
- opmean: Outlier pileup coverage: average of reads overlapping each base, after bases with coverage outside mean +/- 1 standard deviation have been excluded. The number of standard deviation used for the cutoff can be changed with --coverage_range.
- pmean: Pileupcoverage:averageofnumberofreadsoverlappingeachbase
- tpmean: Trimmed pileup coverage: average of reads overlapping each base, after bases with in the top and bottom 10% have been excluded. The 10% range can be changed using --coverage_range.
- counts: Absolute number of reads mapping
- cmean: Like'counts'exceptdividedbythelengthofthecontig
- pmedian: Median pileup coverage: median of number of reads overlapping each base
3、extract bamからリードを出力
bamを指定する。他のコマンドと同様、複数bamを指定できる。
bamm extract -g ref.fasta -b mapped.bam -t 8
ペアエンドのbamを指定した場合、ペアエンドfastq、ペアエンドfastqでsingletonになったfastqの3つのファイルが出力される。ペアエンドfastqの順番は同期されている。出力はgzip圧縮される。
出力されたリードのヘッダーにはどのchr/contigにマッピングされていたか、ペアとしてマッピングされていたかなどを示す情報がつく(manualより)。
引用
GitHub - Ecogenomics/BamM: Metagenomics-focused BAM file manipulation