macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

bamファイルを扱う bamM

 

 BamMはBAMファイルを解析するpythonにラップされたcライブラリである。 このコードはPySam (link) のすべての機能を実装するものではないが、PySamよりも高速で安定したBAMファイルのインターフェースを提供することを目的としている。

 

HP

http://ecogenomics.github.io/BamM/

 マニュアル

http://ecogenomics.github.io/BamM/manual/BamM_manual.pdf

 python library

API Documentation

 

インストール

ubuntu16.04のminiconda2-4.0.5環境でテストした(docker使用、ホストOS ubuntu18.04)。

依存

  • python2.7
  • samtools
  • bwa

本体 Github

#bioconda(link
conda install -c bioconda bamm

#GNU Guix
guix package --install bamm

> bamm

 

                              ...::: BamM :::...

 

                    Working with the BAM, not against it...

 

   -------------------------------------------------------------------------

                                  version: 1.7.3

   -------------------------------------------------------------------------

 

    bamm make     ->  Make BAM/TAM files (sorted + indexed)

    bamm parse    ->  Get coverage profiles / linking reads / insert types

    bamm extract  ->  Extract reads / headers from BAM files

    bamm filter   ->  Filter BAM file reads

 

    USE: bamm OPTION -h to see detailed options

    

>  bamm make -h

$ bamm make -h

usage: bamm make -d DATABASE [-i INTERLEAVED [INTERLEAVED ...]]

                 [-c COUPLED [COUPLED ...]] [-s SINGLE [SINGLE ...]]

                 [-p PREFIX] [-o OUT_FOLDER]

                 [--index_algorithm INDEX_ALGORITHM]

                 [--alignment_algorithm ALIGNMENT_ALGORITHM] [--extras EXTRAS]

                 [-k] [-K] [-f] [--output_tam] [-u] [-t THREADS] [-m MEMORY]

                 [--temporary_directory TEMPORARY_DIRECTORY] [-h]

                 [--show_commands] [--quiet] [--silent]

 

make a BAM/TAM file (sorted + indexed)

 

required argument:

  -d DATABASE, --database DATABASE

                        contigs to map onto (in fasta format)

 

reads to map (specify one or more arguments):

  -i INTERLEAVED [INTERLEAVED ...], --interleaved INTERLEAVED [INTERLEAVED ...]

                        map interleaved sequence files (as many as you like) EX: -i reads1_interleaved.fq.gz reads2_interleaved.fna

  -c COUPLED [COUPLED ...], --coupled COUPLED [COUPLED ...]

                        map paired sequence files (as many sets as you like) EX: -c reads1_1.fq.gz reads1_2.fq.gz reads2_1.fna reads2_2.fna

  -s SINGLE [SINGLE ...], --single SINGLE [SINGLE ...]

                        map Single ended sequence files (as many as you like) EX: -s reads1_singles.fq.gz reads2_singles.fna

 

optional arguments:

  -p PREFIX, --prefix PREFIX

                        prefix to apply to BAM/TAM files (None for reference name)

  -o OUT_FOLDER, --out_folder OUT_FOLDER

                        write to this folder (default: .)

  --index_algorithm INDEX_ALGORITHM

                        algorithm bwa uses for indexing 'bwtsw' or 'is' [None for auto]

  --alignment_algorithm ALIGNMENT_ALGORITHM

                        algorithm bwa uses for alignment 'mem', 'bwasw' or 'aln' (default: mem)

  --extras EXTRAS       extra arguments to use during mapping. Format is "<BWA_MODE1>:'<ARGS>',<BWA_MODE2>:'<ARGS>'"

  -k, --keep            keep all generated database index files (see also --kept)

  -K, --kept            assume database indices already exist (e.g. previously this script was run with -k/--keep)

  -f, --force           force overwriting of index files if they are present

  --output_tam          output TAM file instead of BAM file

  -u, --keep_unmapped   Keep unmapped reads in BAM output

  -t THREADS, --threads THREADS

                        maximum number of threads to use (default: 1)

  -m MEMORY, --memory MEMORY

                        maximum amount of memory to use per samtools sort thread (default 2GB). Suffix K/M/G recognized

  --temporary_directory TEMPORARY_DIRECTORY

                        temporary directory for working with BAM files (default do not use)

  -h, --help            show this help message and exit

  --show_commands       show all commands being run

  --quiet               suppress output from the mapper

  --silent              suppress all output

 

Example usage:

 

 Produce a sorted, indexed BAM file with reads mapped onto contigs.fa using 40 threads

   bamm make -d contigs.fa -i reads1_interleaved.fq.gz reads2_interleaved.fq.gz -c reads3_1.fq.gz reads3_2.fq.gz -t 40

 

 Produce a 3 sorted, indexed BAM files with reads mapped onto contigs.fa.gz

   bamm make -d contigs.fa.gz -i reads1_interleaved.fq.gz reads2_interleaved.fq.gz -s reads4_singles.fq.gz

 

Extra arguments are passed on a "per-mode" basis using the format:

 

    <BWA_MODE>:'<ARGS>'

 

For example, the argument:

 

    --extras "mem:-k 25"

 

tells bwa mem to use a minimum seed length of 25bp.

Multiple modes are separated by commas. For example:

 

    --extras "aln:-O 12 -E 3,sampe:-n 15"

 

Passes the arguments "-O 12 -E 3" to bwa aln and the arguments "-n 15" to bwa sampe.

 

********************************************************************************

*** WARNING ***

********************************************************************************

 

Values passed using --extras are not checked by BamM. This represents a

significant security risk if BamM is being run with elevated privileges.

Thus you should NEVER run 'bamm make' as root or some other powerful user,

ESPECIALLY if you are providing access to multiple / unknown users.

 

********************************************************************************

bamm parse -h

$ bamm parse -h

usage: bamm parse -b BAMFILES [BAMFILES ...] [-c COVERAGES] [-l LINKS]

                  [-i INSERTS] [-n NUM_TYPES [NUM_TYPES ...]]

                  [-m {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}]

                  [-r CUTOFF_RANGE [CUTOFF_RANGE ...]] [--length LENGTH]

                  [--base_quality BASE_QUALITY]

                  [--mapping_quality MAPPING_QUALITY]

                  [--max_distance MAX_DISTANCE] [--use_secondary]

                  [--use_supplementary] [-v] [-t THREADS] [-h]

 

get bamfile type and/or coverage profiles and/or linking reads

 

required argument:

  -b BAMFILES [BAMFILES ...], --bamfiles BAMFILES [BAMFILES ...]

                        bam files to parse

 

optional arguments:

  -c COVERAGES, --coverages COVERAGES

                        filename to write coverage profiles to [default: don't calculate coverage]

  -l LINKS, --links LINKS

                        filename to write pairing links to [default: don't calculate links]

  -i INSERTS, --inserts INSERTS

                        filename to write bamfile insert distributions to [default: print to STDOUT]

  -n NUM_TYPES [NUM_TYPES ...], --num_types NUM_TYPES [NUM_TYPES ...]

                        number of insert/orientation types per BAM

  -m {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}, --coverage_mode {pmean,opmean,tpmean,counts,cmean,pmedian,pvariance}

                        how to calculate coverage (requires --coverages) (default: pmean)

  -r CUTOFF_RANGE [CUTOFF_RANGE ...], --cutoff_range CUTOFF_RANGE [CUTOFF_RANGE ...]

                        range used to calculate upper and lower bounds when calculating coverage

  --length LENGTH       minimum Q length (default: 50)

  --base_quality BASE_QUALITY

                        base quality threshold (Qscore) (default: 20)

  --mapping_quality MAPPING_QUALITY

                        mapping quality threshold

  --max_distance MAX_DISTANCE

                        maximum allowable edit distance from query to reference (default: 1000)

  --use_secondary       use reads marked with the secondary flag

  --use_supplementary   use reads marked with the supplementary flag

  -v, --verbose         be verbose

  -t THREADS, --threads THREADS

                        maximum number of threads to use (default: 1)

  -h, --help            show this help message and exit

 

Example usage:

 

 Calculate insert distribution, print to STDOUT

   bamm parse -b my.bam

 

 Calculate contig-wise coverage

   bamm parse -b my.bam -c output_coverage.tsv

 

 Calculate linking information on 2 bam files

   bamm parse -b first.bam second.bam -l output_links.tsv

 

Coverage calculation modes:

* opmean:    outlier pileup coverage: average of reads overlapping each base,

             after bases with coverage outside mean +/- 1 standard deviation

             have been excluded. The number of standard deviation used for the

             cutoff can be changed with --cutoff_range.

* pmean:     pileup coverage: average of number of reads overlapping each base

* tpmean:    trimmed pileup coverage: average of reads overlapping each base,

             after bases with in the top and bottom 10% have been excluded. The

             10% range can be changed using --cutoff_range and should be

             specified as a percentage (e.g., 15 for 15%).

* counts:    absolute number of reads mapping

* cmean:     like 'counts' except divided by the length of the contig

* pmedian:   median pileup coverage: median of number of reads overlapping each

             base

* pvariance: variance of pileup coverage: variance of number of reads

             overlapping each base

 

The 'cutoff_range' variable is used for trimmed mean and outlier mean. This

parameter takes at most two values. The first is the lower cutoff and the

second is the upper. If only one value is supplied then lower == upper.

bamm extract -h

$ bamm extract -h

usage: bamm extract -b BAMFILES [BAMFILES ...] -g GROUPS [GROUPS ...]

                    [-p PREFIX] [-o OUT_FOLDER] [--mix_bams] [--mix_groups]

                    [--mix_reads] [--interleave]

                    [--mapping_quality MAPPING_QUALITY] [--use_secondary]

                    [--use_supplementary] [--max_distance MAX_DISTANCE]

                    [--no_gzip] [--headers_only] [-v] [-t THREADS] [-h]

 

Extract reads which hit the given references

 

required arguments:

  -b BAMFILES [BAMFILES ...], --bamfiles BAMFILES [BAMFILES ...]

                        bam files to parse

  -g GROUPS [GROUPS ...], --groups GROUPS [GROUPS ...]

                        files containing reference names (1 per line) or contigs file in fasta format

 

optional arguments:

  -p PREFIX, --prefix PREFIX

                        prefix to apply to output files

  -o OUT_FOLDER, --out_folder OUT_FOLDER

                        write to this folder (default: .)

  --mix_bams            use the same file for multiple bam files

  --mix_groups          use the same files for multiple groups

  --mix_reads           use the same files for paired/unpaired reads

  --interleave          interleave paired reads in ouput files

  --mapping_quality MAPPING_QUALITY

                        mapping quality threshold

  --use_secondary       use reads marked with the secondary flag

  --use_supplementary   use reads marked with the supplementary flag

  --max_distance MAX_DISTANCE

                        maximum allowable edit distance from query to reference (default: 1000)

  --no_gzip             do not gzip output files

  --headers_only        extract only (unique) headers

  -v, --verbose         be verbose

  -t THREADS, --threads THREADS

                        maximum number of threads to use (default: 1)

  -h, --help            show this help message and exit

bamm filter -h

$ bamm filter -h

usage: bamm filter -b BAMFILE [-o OUT_FOLDER]

                   [--mapping_quality MAPPING_QUALITY]

                   [--max_distance MAX_DISTANCE] [--length LENGTH]

                   [--percentage_id PERCENTAGE_ID]

                   [--percentage_aln PERCENTAGE_ALN] [--use_secondary]

                   [--use_supplementary] [-v] [-h]

 

Apply stringency filter to Bam file reads

 

required arguments:

  -b BAMFILE, --bamfile BAMFILE

                        bam file to filter

 

optional arguments:

  -o OUT_FOLDER, --out_folder OUT_FOLDER

                        write to this folder (output file has '_filtered.bam' suffix) (default: .)

  --mapping_quality MAPPING_QUALITY

                        mapping quality threshold

  --max_distance MAX_DISTANCE

                        maximum allowable edit distance from query to reference (default: 1000)

  --length LENGTH       minimum allowable query length

  --percentage_id PERCENTAGE_ID

                        minimum base identity of mapped region (between 0 and 1)

  --percentage_aln PERCENTAGE_ALN

                        minimum fraction of read mapped (between 0 and 1)

  --use_secondary       use reads marked with the secondary flag

  --use_supplementary   use reads marked with the supplementary flag

  -v, --invert_match    select unmapped reads

  -h, --help            show this help message and exit

 

コマンドとしての使用方法

1、bamm make   mappingしてbamを作成

ペアエンドfastqを指定する。 出力はmappedとし、8スレッド使う。bwaのアルゴリズムはmemを指定する。indexがすでにあっても上書きする。unmapのリードも出力する。

bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
-p mapped -t 8 --alignment_algorithm mem -f -u
  • -i      shuffled reads
  • -c     paired files
  • -s     singleton read files
  • -p     prefix to apply to BAM/TAM files (None for reference name)
  • -o     write to this folder (default: .)
  • --index_algorithm   algorithm bwa uses for indexing 'bwtsw' or 'is' [None for auto]
  • --alignment_algorithm    algorithm bwa uses for alignment 'mem', 'bwasw' or 'aln' (default: mem)
  • -f      force overwriting of index files if they are present
  • -t      maximum number of threads to use (default: 1)
  • -m    maximum amount of memory to use per samtools sort thread
  • -u     Keep unmapped reads in BAM output

ペアエンド、シングルエンド、インターリーブのfastqをそれぞれ指定する。20スレッド指定する。unmapのリードは出力しない(default)。

bamm make -d ref.fasta -c pair_1.fastq.gz pair_2.fastq.gz \
-s unpaired.fastq.gz -i interleave.fq.gz \
-p mapped -t 20 --alignment_algorithm mem -f

 

 2、bamm parse  カバレッジやインサートサイズ計算

それぞれのchromosomeのリードのカバレッジ、相対的なリードの方向、リードのインサートサイズ、リンク情報を出力する。

bamm parse -c covs.tsv -l links.tsv -i inserts.tsv -t 8 -b mapped.bam
  • -c   filename to write coverage profiles to [default: don't calculate coverage]
  • -l    filename to write pairing links to [default: don't calculate links]
  • -i    filename to write bamfile insert distributions to [default: print to STDOUT]
  • -t    maximum number of threads to use (default: 1)
  • -b   bam files to parse

リンク情報は他のcontigやchrにまたがってmapingされるリード情報をもとにリンクされている配列を出力する(下の図はmanualより)。

f:id:kazumaxneo:20190616123753p:plain

 

インサートサイズ、相対的なリードの方向 (IN/OUT)を標準出力に出力する。

bamm parse -b mapped.bam

 複数のbamのカバレッジを出力する。

bamm parse -c coverage.tsv -m pmean -b f1.bam f2.bam f3.bam
  • -m <pmean,opmean,tpmean,counts,cmean,pmedian,pvariance>   how to calculate coverage (requires --coverages) (default: pmean

計算モード(manualより)

  1. opmean: Outlier pileup coverage: average of reads overlapping each base, after bases with coverage outside mean +/- 1 standard deviation have been excluded. The number of standard deviation used for the cutoff can be changed with --coverage_range.
  2. pmean: Pileupcoverage:averageofnumberofreadsoverlappingeachbase
  3. tpmean: Trimmed pileup coverage: average of reads overlapping each base, after bases with in the top and bottom 10% have been excluded. The 10% range can be changed using --coverage_range.
  4. counts: Absolute number of reads mapping
  5. cmean: Like'counts'exceptdividedbythelengthofthecontig
  6. pmedian: Median pileup coverage: median of number of reads overlapping each base

 

3、extract  bamからリードを出力

bamを指定する。他のコマンドと同様、複数bamを指定できる。

bamm extract -g ref.fasta -b mapped.bam -t 8

ペアエンドのbamを指定した場合、ペアエンドfastq、ペアエンドfastqでsingletonになったfastqの3つのファイルが出力される。ペアエンドfastqの順番は同期されている。出力はgzip圧縮される。

 

 出力されたリードのヘッダーにはどのchr/contigにマッピングされていたか、ペアとしてマッピングされていたかなどを示す情報がつく(manualより)。

f:id:kazumaxneo:20190616124832p:plain

  

引用

GitHub - Ecogenomics/BamM: Metagenomics-focused BAM file manipulation