macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

pacbioのアライナー pbmm2

 

 pbmm2はminimap2のC API用のSMRT C ++ラッパーである。 その目的は、ネイティブのPacBio入出力をサポートし、推奨パラメータセットでソート出力をon-the-fly(複数の処理をまとめて)で生成することである。 BAMがpbmm2への入力として使用されている場合は、ソートされた出力をGenomicConsensusを使用したpolishに直接使用できる。 ベンチマークは、pbmm2がBLASRよりも優れていることを示している。 pbmm2はBLASRの公式の代替となるツールである。

 

 

インストール

本体 Github

pacbioのオフィシャルレポジトリになる。

#anacondaを使っているならcondaで導入可能(mac,linux)
conda install -c bioconda -y pbmm2

> pbmm2

$ pbmm2

pbmm2 - minimap2 with native PacBio BAM support

 

Usage:

  pbmm2 <tool>

 

Options:

  -h, --help   Output this help.

  --version    Output version info.

 

Tools:

    index      Index reference and store as .mmi file

    align      Align PacBio reads to reference sequences

 

Examples:

  pbmm2 align ref.referenceset.xml movie.subreadset.xml ref.movie.alignmentset.xml

  pbmm2 index ref.referenceset.xml ref.mmi

 

Typical workflows:

  A. Generate index file for reference and reuse it to align reads

    $ pbmm2 index ref.fasta ref.mmi

    $ pbmm2 align ref.mmi movie.subreads.bam ref.movie.bam

 

  B. Align reads and sort on-the-fly, with 4 alignment and 2 sort threads

    $ pbmm2 align ref.fasta movie.subreads.bam ref.movie.bam --sort -j 4 -J 2

 

  C. Align reads, sort on-the-fly, and create PBI

    $ pbmm2 align ref.fasta movie.subreadset.xml ref.movie.alignmentset.xml --sort

 

  D. Omit output file and stream BAM output to stdout

    $ pbmm2 align hg38.mmi movie1.subreadset.xml | samtools sort > hg38.movie1.sorted.bam

 

  E. Align CCS fastq input and sort on-the-fly

    $ pbmm2 align ref.fasta movie.Q20.fastq ref.movie.bam --preset CCS --sort --rg '@RG\tID:myid\tSM:mysample'

> pbmm2 index -h

$ pbmm2 index -h

Usage: pbmm2 index [options] <ref.fa|xml> <out.mmi>

Index reference and store as .mmi file

 

Basic Options:

  -h,--help                 Output this help.

  --version                 Output version information.

  --log-file                Log to a file, instead of stdout.

  --log-level               Set log level: "TRACE", "DEBUG", "INFO", "WARN", "FATAL". ["WARN"]

  -j,--num-threads          Number of threads to use, 0 means autodetection. [0]

 

Parameter Set Option:

  --preset                  Set alignment mode:

                             - "SUBREAD" -k 19 -w 10

                             - "CCS"  -k 19 -w 10 -u

                             - "ISOSEQ"  -k 15 -w 5 -u

                             - "UNROLLED" -k 15 -w 15

                            Default ["SUBREAD"]

 

Parameter Override Options:

  -k                        k-mer size (no larger than 28). [-1]

  -w                        Minizer window size. [-1]

  -u,--no-kmer-compression  Disable homopolymer-compressed k-mer (compression is activate for SUBREAD & UNROLLED presets).

 

Options:

  --emit-tool-contract      Emit tool contract.

  --resolved-tool-contract  Use args from resolved tool contract.

 

Arguments:

  ref.fa|xml                Reference FASTA, ReferenceSet XML

  out.mmi                   Output Reference Index

> pbmm2 align -h

$ pbmm2 align -h

Usage: pbmm2 align [options] <ref.fa|xml|mmi> <in.bam|xml|fa|fq> [out.aligned.bam|xml]

Align PacBio reads to reference sequences

 

Basic Options:

  -h,--help                  Output this help.

  --version                  Output version information.

  --log-file                 Log to a file, instead of stdout.

  --log-level                Set log level: "TRACE", "DEBUG", "INFO", "WARN", "FATAL". ["WARN"]

  --chunk-size               Process N records per chunk. [100]

 

Sorting Options:

  --sort                     Generate sorted BAM file.

  -m,--sort-memory           Memory per thread for sorting. ["768M"]

 

Threading Options:

  -j,--alignment-threads     Number of threads used for alignment, 0 means autodetection. [0]

  -J,--sort-threads          Number of threads used for sorting; 0 means 25% of -j, maximum 8. [0]

 

Parameter Set Options:

  --preset                   Set alignment mode:

                              - "SUBREAD" -k 19 -w 10 -o 5 -O 56 -e 4 -E 1 -A 2 -B 5 -z 400 -Z 50 -r 2000 -L 0.5

                              - "CCS" -k 19 -w 10 -u -o 5 -O 56 -e 4 -E 1 -A 2 -B 5 -z 400 -Z 50 -r 2000 -L 0.5

                              - "ISOSEQ" -k 15 -w 5 -u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -C 5 -r 200000 -G 200000 -L 0.5

                              - "UNROLLED" -k 15 -w 15 -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 2000 -L 0.5

                             Default ["SUBREAD"]

 

General Parameter Override Options:

  -k                         k-mer size (no larger than 28). [-1]

  -w                         Minizer window size. [-1]

  -u,--no-kmer-compression   Disable homopolymer-compressed k-mer (compression is activate for SUBREAD & UNROLLED presets).

  -A                         Matching score. [-1]

  -B                         Mismatch penalty. [-1]

  -z                         Z-drop score. [-1]

  -Z                         Z-drop inversion score. [-1]

  -r                         Bandwidth used in chaining and DP-based alignment. [-1]

 

Gap Parameter Override Options (a k-long gap costs min{o+k*e,O+k*E}):

  -o,--gap-open-1            Gap open penalty 1. [-1]

  -O,--gap-open-2            Gap open penalty 2. [-1]

  -e,--gap-extend-1          Gap extension penalty 1. [-1]

  -E,--gap-extend-2          Gap extension penalty 2. [-1]

  -L,--lj-min-ratio          Long join flank ratio. [-1]

 

IsoSeq Parameter Override Options:

  -G                         Max intron length (changes -r). [-1]

  -C                         Cost for a non-canonical GT-AG splicing. [-1]

  --no-splice-flank          Do not prefer splice flanks GT-AG.

 

Read Group Options:

  --sample                   Sample name for all read groups. Defaults, in order of precedence: SM field in input read group, biosample name, well sample name, "UnnamedSample".

  --rg                       Read group header line such as '@RG\tID:xyz\tSM:abc'. Only for FASTA/Q inputs.

 

Output Options:

  -c,--min-concordance-perc  Minimum alignment concordance in percent. [70]

  -l,--min-length            Minimum mapped read length in basepair. [50]

  -N,--best-n                Output at maximum N alignments for each read, 0 means no maximum. [0]

  --strip                    Remove all kinetic and extra QV tags. Output cannot be polished.

  --split-by-sample          One output BAM per sample.

  --no-bai                   Omit BAI generation for sorted output.

  --unmapped                 Include unmapped records in output.

 

Input Manipulation Options (mutually exclusive):

  --median-filter            Pick one read per ZMW of median length.

  --zmw                      Process ZMW Reads, subreadset.xml input required (activates UNROLLED preset).

  --hqregion                 Process HQ region of each ZMW, subreadset.xml input required (activates UNROLLED preset).

 

Options:

  --emit-tool-contract       Emit tool contract.

  --resolved-tool-contract   Use args from resolved tool contract.

 

Arguments:

  ref.fa|xml|mmi             Reference FASTA, ReferenceSet XML, or Reference Index

  in.bam|xml|fa|fq           Input BAM, DataSet XML, FASTA, or FASTQ

  out.aligned.bam|xml        Output BAM or DataSet XML

 

 

実行方法

1、indexing

pbmm2 index ref.fasta ref.mmi

 

2、mapping

#Align CCS fastq input and sort bam output
pbmm2 align ref.fasta movie.Q20.fastq ref.movie.bam --preset CCS --sort --rg '@RG\tID:myid\tSM:mysample'

#Align reads and sort on-the-fly, with 4 alignment and 2 sort threads
pbmm2 align ref.fasta movie.bam ref.bam --sort -j 4 -J 2
  • --sort    Generate sorted BAM file
  • --rg        Read group header line such as '@RG\tID:xyz\tSM:abc'. Only for FASTA/Q inputs.
  • -j     Number of threads used for alignment, 0 means autodetection. [0]
  • -J    Number of threads used for sorting; 0 means 25% of -j, maximum 8. [0]
  • --preset   Set alignment mode:
    - "SUBREAD" -k 19 -w 10 -o 5 -O 56 -e 4 -E 1 -A 2 -B 5 -z 400 -Z 50 -r 2000 -L 0.5
    - "CCS" -k 19 -w 10 -u -o 5 -O 56 -e 4 -E 1 -A 2 -B 5 -z 400 -Z 50 -r 2000 -L 0.5
    - "ISOSEQ" -k 15 -w 5 -u -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -C 5 -r 200000 -G 200000 -L 0.5
    - "UNROLLED" -k 15 -w 15 -o 2 -O 32 -e 1 -E 0 -A 1 -B 2 -z 200 -Z 100 -r 2000 -L 0.5
    Default ["SUBREAD"]

     

引用

GitHub - PacificBiosciences/pbmm2: A minimap2 frontend for PacBio native data formats

 

関連


関連

Structural Variant Detection in SMRT Link 5 with pbsv

f:id:kazumaxneo:20190317193217p:plain