macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

medakaを使ってコンセンサスコールを行う

2020 3/23 コマンドの間違いを修正

2020 3/24 説明追記

2020 10/10 ツイート追記

 

 Documentation

 

 

特徴

  • basecallされたデータのみ必要(.fastaまたは.fastq)
  • グラフベースのメソッド(Raconなど)よりも精度が向上
  • Nanopolishよりも50倍高速(GPU実行できるため)
  • オーダーメイドの補正ネットワーク実装とトレーニングのための追加機能
  • オープンソースMozilla Public License 2.0)

 

 

インストール

Linuxで動作する。ここではbiocondaを使ってubuntu18.04LTSに導入した(docker使用、GPU未使用)。

  • gcc
  • zlib1g-dev
  • libbz2-dev
  • liblzma-dev
  • libffi-dev
  • libncurses5-dev
  • make
  • wget
  • python3-all-dev
  • python-virtualenv

本体 Github

#bioconda (link)
#ここでは仮想環境medaka-envにmedakaを導入する。
conda create -n medaka-env -y
conda activate medaka-env
conda install -c conda-forge -c bioconda -y medaka

#pip
pip install medaka
#To enable the use of GPU resource it is necessary to install the
#tensorflow-gpu package. In outline this can be achieve with:
pip uninstall tensorflow
pip install tensorflow-gpu
#note that The tensorflow-gpu GPU package is compiled against a specific version of the NVIDIA CUDA library; users are directed to the tensorflow installation pages for further information.

medaka -h

$ medaka -h

usage: medaka [-h] [--version]

              {compress_bam,features,train,consensus,smolecule,consensus_from_features,fastrle,stitch,variant,snp,methylation,tools}

              ...

 

optional arguments:

  -h, --help            show this help message and exit

  --version             show program's version number and exit

 

subcommands:

  valid commands

 

  {compress_bam,features,train,consensus,smolecule,consensus_from_features,fastrle,stitch,variant,snp,methylation,tools}

                        additional help

    compress_bam        Compress an alignment into RLE form.

    features            Create features for inference.

    train               Train a model from features.

    consensus           Run inference from a trained model and alignments.

    smolecule           Create consensus sequences from single-molecule reads.

    consensus_from_features

                        Run inference from a trained model on existing

                        features.

    fastrle             Create run-length encoded fastq (lengths in quality

                        track).

    stitch              Stitch together output from medaka consensus into

                        final output.

    variant             Decode probabilities to VCF.

    snp                 Decode probabilities to SNPs.

    methylation         methylation subcommand.

    tools               tools subcommand.

様々なサブコマンドが利用できるが、ここではconsensusのみ記載。

medaka consensus -h

$ medaka consensus -h

usage: medaka consensus [-h] [--debug | --quiet] [--batch_size BATCH_SIZE]

                        [--regions REGIONS [REGIONS ...]]

                        [--chunk_len CHUNK_LEN] [--chunk_ovlp CHUNK_OVLP]

                        [--model MODEL] [--disable_cudnn] [--threads THREADS]

                        [--check_output] [--save_features]

                        [--tag_name TAG_NAME] [--tag_value TAG_VALUE]

                        [--tag_keep_missing]

                        bam output

 

positional arguments:

  bam                   Input alignments.

  output                Output file.

 

optional arguments:

  -h, --help            show this help message and exit

  --debug               Verbose logging of debug information. (default: 20)

  --quiet               Minimal logging; warnings only). (default: 20)

  --batch_size BATCH_SIZE

                        Inference batch size. (default: 100)

  --regions REGIONS [REGIONS ...]

                        Genomic regions to analyse, or a bed file. (default:

                        None)

  --chunk_len CHUNK_LEN

                        Chunk length of samples. (default: 10000)

  --chunk_ovlp CHUNK_OVLP

                        Overlap of chunks. (default: 1000)

  --model MODEL         Model definition, default is equivalent to

                        r941_min_high_g344. {r941_min_fast_g303,

                        r941_min_high_g303, r941_min_high_g330,

                        r941_min_high_g344, r941_prom_fast_g303,

                        r941_prom_high_g303, r941_prom_high_g344,

                        r941_prom_high_g330, r10_min_high_g303,

                        r10_min_high_g340, r103_min_high_g345,

                        r941_prom_snp_g303, r941_prom_variant_g303,

                        r941_min_high_g340_rle} (default:

                        /Users/kazu/anaconda3/envs/medaka-

                        env/lib/python3.6/site-

                        packages/medaka/data/r941_min_high_g344_model.hdf5)

  --disable_cudnn       Disable use of cuDNN model layers. (default: False)

  --threads THREADS     Number of threads used by inference. (default: 1)

  --check_output        Verify integrity of output file after inference.

                        (default: False)

  --save_features       Save features with consensus probabilities. (default:

                        False)

 

filter tag:

  Filtering alignments by an integer valued tag.

 

  --tag_name TAG_NAME   Two-letter tag name. (default: None)

  --tag_value TAG_VALUE

                        Value of tag. (default: None)

  --tag_keep_missing    Keep alignments when tag is missing. (default: False)

 

medaka_consensus -h

$ medaka_consensus -h

 

medaka 0.11.5

------------

 

Assembly polishing via neural networks. The input assembly should be

preprocessed with racon.

 

medaka_consensus [-h] -i <fastx>

 

    -h  show this help text.

    -i  fastx input basecalls (required).

    -d  fasta input assembly (required).

    -o  output folder (default: medaka).

    -m  medaka model, (default: r941_min_high_g344).

        Available: r941_min_fast_g303, r941_min_high_g303, r941_min_high_g330, r941_min_high_g344, r941_prom_fast_g303, r941_prom_high_g303, r941_prom_high_g344, r941_prom_high_g330, r10_min_high_g303, r10_min_high_g340, r103_min_high_g345, r941_prom_snp_g303, r941_prom_variant_g303, r941_min_high_g340_rle.

        Alternatively a .hdf file from 'medaka train'.

    -f  Force overwrite of outputs (default will reuse existing outputs).

    -t  number of threads with which to create features (default: 1).

    -b  batchsize, controls memory use (default: 100).

 

 

実行方法

canuやminiasm+raconで作成したraw de novo aasemblyを入力とする。 oxford nanopoporeが想定しているのはraconでポリッシュしたアセンブリ配列となる。

medaka_consensus -i basecalled.fa -d draft-assembly.fa -o output
  • -i     fastx input basecalls (required).
  • -d    fasta input assembly (required).
  • -o    output folder (default: medaka).
  • -m    Model definition (default: r941_min_high_g344_model.hdf5)

 結果は指定したディレクトリに出力される。

 

引用 

medaka/README.md at master · nanoporetech/medaka · GitHub

 

関連