macでインフォマティクス

macでインフォマティクス

HTS (NGS) 関連のインフォマティクス情報についてまとめています。

リボソームプロファイリングのクオリティメトリクスを提供する MappingQC

 

MappingQCは、リボソームプロファイリングデータのマッピングの品質の概要を示すいくつかの図を簡単に生成するツールである。 より具体的には、 P site offsetの計算、遺伝子分布、およびメタジェニック分類の概要を示す。 さらに、MappingQCは、データの標準的なトランスクリプトのトリプレット周期性とリンクされたトリプレットフェーズ(リボソームプロファイリングに典型的)の徹底的な分析を行う。 特に、phase distributionとRPFの長さ、リンクの相対的な位置、およびトリプレット同一性の間のリンクが考慮される。

 

Galaxy版(link)とローカル版がある。ここではローカル版使用の流れを簡単にまとめます。


インストール

依存(Githubより)

MappingQC relies on following Perl modules which have to be installed on your system:

  • DBI
  • Getopt::Long
  • Parallel::ForkManager
  • CWD
  • Data::Dumper (for debugging purposes)

Furthermore, mappingQC relies on following Python2 modules which have to be installed on your system:

  • getopt
  • defaultdict (collections)
  • sqlite3
  • pandas
  • numpy
  • matplotlib (including pyplot, colors, cm, gridspec, ticker and mplot3d)
  • seaborn

Github

#bioconda (link) ここでは仮想環境mappingqcに入れる
conda create -n mappingqc -c bioconda -y mqc python=2.7
conda activate mappingqc

map2bed.pl

$ map2bed.pl 

map2bed converts ART's map files to a BED file

 

USAGE: /usr/local/bin/map2bed.pl out_bed_file.bed in_map_file_1 [ in_map_file_2 ...]

 

(mqc) kazuma@kamisakumanoMBP:~/Downloads$ mQC.pl 

Working directory                                        : /Users/kazuma/Downloads

The following tmpfolder is used                          : /Users/kazuma/Downloads/tmp

 

 

MappingQC (Stand-alone version)

 

    MappingQC is a tool to easily generate some figures which give a nice overview of the quality of the mapping of ribosome profiling data. More specific, it gives an overview of the P site offset calculation, the gene distribution and the metagenic classification. Furthermore, MappingQC does a thorough analysis of the triplet periodicity and the linked triplet phase (typical for ribosome profiling) in the canonical transcript of your data. Especially, the link between the phase distribution and the RPF length, the relative sequence position and the triplet identity are taken into account.

        

    Input parameters:

    --help                  this helpful screen

    --work_dir              working directory to run the scripts in (default: current working directory)

    --experiment_name       customly chosen experiment name for the mappingQC run (mandatory)

    --samfile               path to the SAM/BAM file that comes out of the mapping script of PROTEOFORMER (mandatory)

    --cores                 the amount of cores to run the script on (integer, default: 5)

    --species               the studied species (mandatory)

    --ens_v                 the version of the Ensembl database you want to use

    --tmp                   temporary folder for storing temporary files of mappingQC (default: work_dir/tmp)

    --unique                whether to use only the unique alignments.

    Possible options: Y, N (default Y)

    --mapper                the mapper you used to generate the SAM file (STAR, TopHat2, HiSat2) (default: STAR)

    --maxmultimap           the maximum amount of multimapped positions used for filtering the reads (default: 16)

    --ens_db                path to the Ensembl SQLite database with annotation info. If you want mappingQC to download the right Ensembl database automatically for you, put in 'get' for this parameter (mandatory)

    --offset                the offset determination method.

                                Possible options:

                                - plastid: calculate the offsets with Plastid (Dunn et al. 2016)

                                - standard: use the standard offsets from the paper of Ingolia et al. (2012) (default option)

                                - from_file: use offsets from an input file

    --plastid_bam           the mapping bam file for Plastid offset generation (default: convert)

    --min_length_plastid    the minimum RPF length for Plastid offset generation (default 22)

    --max_length_plastid    the maximum RPF length for Plastid offset generation (default 34)

    --offset_file           the offsets input file

    --min_length_gd         minimum RPF length used for gene distributions and metagenic classification (default: 26).

    --max_length_gd         maximum RPF length used for gene distributions and metagenic classification (default: 34).

    --outfolder             the folder to store the output files (default: work_dir/mQC_output)

    --tool_dir              folder with necessary additional mappingQC tools. More information below in the dependencies section. (default: search for the default tool directory location in the active conda environment)

    --plotrpftool           the module that will be used for plotting the RPF-phase figure

                                Possible options:

                                - grouped2D: use Seaborn to plot a grouped 2D bar chart (default)

                                - pyplot3D: use mplot3d to plot a 3D bar chart. This tool can suffer sometimes from Escher effects, as it tries to plot a 3D plot with the 2D software of pyplot and matplotlib.

                                - mayavi: use the mayavi package to plot a 3D bar chart. This tool only works on local systems with graphical cards.

    --outhtml               custom name for the output HTML file (default: work_dir/mQC_experiment_name.html)

    --outzip                custom name for output ZIP file (default: work_dir/mQC_experiment_name.zip)

    

 

 

 

ERROR: do not forget the experiment name!

 

 

実行方法

samとデータベースを指定する。

mQC.pl --experiment_name yourexperimentname --samfile yoursamfile.sam --cores 20 --species human --ens_v 86 --ens_db ENS_hsa_86.db --unique N --offset plastid --plastid_bam yourbamfile.bam --tool_dir mqc_tools

 

引用

https://github.com/Biobix/mQC