artemis.dq.plotlytool

#! ~/miniconda3/envs/artemis-dev/bin/python # -- coding: utf-8 -- # vim:fenc=utf-8 # # Copyright © Her Majesty the Queen in Right of Canada, as represented # by the Minister of Statistics Canada, 2019.

author

Mitchell Shahen (mitchell.shahen2@canada.ca)

history

Oct 3, 2019

This module contains five class objects performing various functions to:
  • Extract histogram and TDigest data from a metastore object.

  • Create dictionaries describing histogram plotting properties.

  • Create dictionaries describing TDigest CDF plotting properties.

  • Create Plotly figures from dictionaries of plotting properties.

  • Organize, save, and/or plot the Plotly figures.

Module Structure:

ProcessHist(histograms=None):
  • _create_dict(histogram=None, name=”“, address=”“)

  • _get_hist_obj(histogram=None)

  • _validate(histograms=None)

  • generate_collection(histogram=None, valid_name=”“, address=”“)

  • generate_traces()

ProcessTDigest(tdigests=None):
  • _calculate_cdf(tdigest=None, method=”“)

  • _create_dict(data=None, name=”“, address=”“)

  • _get_digest_map(tdigest=None)

  • _validate(tdigests=None)

  • get_centroids(digest_map=None)

  • generate_traces()

MergeHist(traces=None, max_cols=0):
  • _validate(traces=None, max_cols=0)

  • combine(traces=None, names=None)

  • modify_coord(traces=None, max_cols=0)

  • modify_colours(traces=None)

  • merge()

BuildFigure(traces=None, figure_type=”“):
  • _create_bar(traces=None, template=None)

  • _create_scatter(traces=None, template=None)

  • _validate(traces=None, figure_type=”“)

  • update_figure(figure=None, figure_type=”“)

  • generate_figure()

PlotlyTool(store=None, uuid=”“):
  • _check_output(output=”“, check=True)

  • _list(store=None, uuid=”“)

  • _validate(store=None, uuid=”“)

  • get_figure(traces=None, output=”“, show=True, check=True, fig_type=”“)

  • visualize(output=”“, show=True, check=True)

This tool’s intended functionality includes extracting Histograms and TDigests from a dataset, locatable by a UUID code, within an input store, then saving, and possibly plotting, the histograms and TDigest CDFs as HTML files.

This can be done by including the following code,

``` from artemis.dq.plotlytool import PlotlyTool PlotlyTool(

store=my_store, uuid=dataset_uuid

).visualize(

output=path_to_directory, show=show_plots, check=check_with_user

)

Note that each of my_store, dataset_uuid, path_to_directory, show_plots, and check_with_user are defined as necessary.

Created in collaboration with:

collaborators

Ryan White (ryan.white4@canada.ca) Dominic Parent (dominic.parent@canada.ca), William Fairgrieve (william.fairgrieve@canada.ca) Russell Gill (russell.gill@canada.ca)

Module Contents

artemis.dq.plotlytool.REQ_HIST_TRACE_NAMES = ['all']
artemis.dq.plotlytool.REQ_TDIGEST_TRACE_NAMES = ['all']
artemis.dq.plotlytool.MAX_HIST_SUBPLOT_COLUMNS = 2
artemis.dq.plotlytool.MAX_TDIGEST_SUBPLOT_COLUMNS = 2
artemis.dq.plotlytool.CDF_ANALYSIS_METHOD = spline
artemis.dq.plotlytool.BAR_TEMPLATE
artemis.dq.plotlytool.SCATTER_TEMPLATE
class artemis.dq.plotlytool.ProcessHist(histograms=None)

Class to create dictionaries of plotting properties from input histogram objects.

Parameters

histograms (google.protobuf.pyext._message.RepeatedCompositeContainer) – A Cronus protobuf object containing histogram data to be plotted.

static _create_dict(histogram=None, name='', address='')

Creates a dictionary containing plotting instructions. The instructions specify parameters to be interpretted when plotting the histogram. In addition to the histogram object, the histogram’s name and location on the user’s computer is made available to the user when viewing the plot, useful for differentiating between plotted histograms.

Parameters
  • histogram (histogram_pb2.Histogram) – A Histogram object containing data and plotting information.

  • name (str) – The name of the histogram dataset.

  • address (str) – Location of the Cronus object originally containing the histogram’s data.

Returns

`trace` – Dictionary containing the histogram data and its plotting information.

Return type

dict

static _get_hist_obj(histogram=None)

A class method to extract histogram data using an input Cronus object. The Cronus object contains the address of a histogram dataset. The dataset is located using the address and loaded into a HistogramCollection object.

Parameters

histogram (cronus_pb2.CronusObject) – A Cronus protobuf object containing histogram data and information.

Returns

`output` – A tuple of a histogram object containing the histogram data from the Cronus object, as a HistogramCollection, and the address of the Cronus object.

Return type

tuple

static _validate(histograms=None)

Class method to validate the histograms parameter. The histogram container and interior histograms are checked. If the container is of an incoorect type, its histograms cannot be extracted. If the histograms are of the incorrect type, they cannot be used moving forward. In both cases, an empty list is returned rather than a list of histograms.

Parameters

histograms (google.protobuf.pyext._message.RepeatedCompositeContainer) – A Cronus object containing several histogram datasets.

Returns

`valid_histograms` – List of histograms from histograms that are of the proper type.

Return type

list

generate_collection(self, histogram=None, valid_name='', address='')

Generate list of histogram trace objects from input HistogramCollection. Each histogram in the collection whose name contains valid_name is extracted and passed to _create_dict. From _create_dict, a dictionary of plotting properties is produced, which is added to a list. Once the list is populated with the dictionaries of all histograms containing valid_name, the list is returned.

Parameters
  • histogram (histogram_pb2.HistogramCollection) – HistogramCollection object of every histogram from initial Cronus object.

  • valid_name (str) – Only traces whose name contains this parameter will be plotted.

  • address (str) – The location of Cronus object containing the histogram’s data.

Returns

`all_traces` – A list of all the traces from histograms whose name contains the valid_name parameter, each converted to dictionaries.

Return type

list

generate_traces(self)

Generate a list of lists each containing histogram trace objects originating from the histograms container, histograms, passed to the class object. Only histograms that have been requested by name through REQ_HIST_TRACE_NAMES variable will have plotting dictionaries created and added to the list of lists. Each dictionary in a sub-list contains plotting information from the same histogram object. Therefore, if the input container contains only one histogram collection object, the output of this method will be a list with one element, a list of plotting dictionaries produced using the requested histograms from the single Histogram collection. If the list of requested histograms is invalid, all available histograms are used.

Returns

`traces` – List of dictionaries containing the traces of the inputted histograms.

Return type

list

class artemis.dq.plotlytool.ProcessTDigest(tdigests=None)

Class to create dictionaries describing TDigest data to be plotted. The methods in this class load TDigest datasets, produce TDigest maps, extract centroids, compute CDFs, and generate dictionaries of plotting instructions.

Parameters

tdigests (google.protobuf.pyext._message.RepeatedCompositeContainer) – A protobuf container object containing TDigests to be analyzed.

static _calculate_cdf(tdigest=None, method='')

Method to calculate the x-axis and y-axis data for a CDF plot. Percentile markers from the inputted TDigest object are used as x-axis values and are supplied to the TDigest’s cdf property to generate the y-axis values. Various methods can be used to mainuplate the x-axis percentile values and the y-axis CDF values. The method parameter is used to indicate which analysis method to use, if any.

Parameters
  • tdigest (artemis.externals.tdigest.tdigest.TDigest) – TDigest object containing data used to calculate and plot CDF

  • method (str) – The type of analysis method used to generate the CDF data.

Returns

`x_data, y_data` – A tuple of two lists of data representing x-axis and y-axis of a CDF.

Return type

tuple

static _create_dict(data=None, name='', address='')

Method to create a dictionary of plotting instructions. The instructions include data to plot, the type of plot, the colour scheme, and accompanying text.

Parameters
  • data (numpy.ndarray) – TDigest data intended to be plotted or included in the plot.

  • name (str) – The intended name of the data being plotted.

  • address (str) – The location of the Cronus object initially containing the TDigest data.

Returns

`trace` – A dictionary of centroid data and plotting properties.

Return type

dict

static _get_digest_map(tdigest=None)

A method to generate a TDigest map from an input Cronus object. The Cronus object contains the location of a TDigest dataset. This location is used to load the dataset into an empty TDigestBook object. The book is then decomposed into its centroid datasets, which are each added to a TDigest object. Each populated TDigest object is then used to populate a TDigest map. The map and the location of the TDigest dataset are both returned in a tuple.

Parameters

tdigest (cronus_pb2.CronusObject) – A Cronus protobuf object containing information on a TDigest dataset, including its location.

Returns

`output` – A tuple of a dictionary of TDigest datasets, each containing centroid data, and the location of the Cronus object initially containing the TDigest data.

Return type

tuple

static _validate(tdigests=None)

Method to validate the input tdigests parameter. It is expected that tdigests is a Protobuf Composite Container filled with Cronus objects. If the input parameter is not a container or contains no Cronus objects, an empty list is returned. Otherwise, a list of the container’s Cronus objects is returned. If the output list does not contain any Cronus objects, no data can be located and no TDigests will be available.

Parameters

tdigests (google.protobuf.pyext._message.RepeatedCompositeContainer) – A protobuf container object containing TDigests to be analyzed.

Returns

`valid_tdigests` – A protobuf container containing only validated TDigests.

Return type

google.protobuf.pyext._message.RepeatedCompositeContainer

get_centroids(self, digest_map=None, name='')

Class method for extracting centroid data from a TDigest map. The inputted digest_map is decomposed into its individual TDigest objects, which are passed to the _calculate_cdf method to produce a CDF 2-Dimensional dataset. This dataset has two properties added to it, the Tdigest’s K-value and Delta value. All four elements are added to a numpy.ndarray and appended to a list. Once populated with an array from each TDigest, the list is returned.

Parameters

digest_map (dict) – A dictionary of TDigest datasets each containing centroid data.

Returns

`digest_data` – A list containing arrays of two dimensions of centroid data and two additional descriptive statistics of the centroid data.

Return type

list

generate_traces(self)

Generate a list of lists containing dictionaries of plotting instructions. The plotting instructions include CDF data to be plotted, colour schemes, and the position in the subplot. Only requested TDigests are included in the list of lists. TDigests are requested by name through the REQ_TDIGEST_TRACE_NAMES list. Only TDigests whose name is present in this list will have a plotting dictionary created containing its data. Each dictionary in a sub-list contains plotting information from the same TDigest object. Therefore, if the input container contains only one TDigest object, the output of this method will be a list with one element, a list of plotting dictionaries produced using the requested histograms from the single TDigest object. If the list of requested TDigests is invalid, all available TDigests are used.

Returns

`all_dicts` – A list of lists of traces containing centroids and plotting data.

Return type

list

class artemis.dq.plotlytool.MergeTraces(traces=None, max_cols=0)

Class to combine similarly named traces and modify various trace properties. The input traces are a list of lists of dictionaries containing plotting instructions and are organized based on their original object. Every dictionary in a sub-list is from the same HistogramCollection or TDigest. For plotting, histograms and TDigests will be organized by name, such that a subplot displays the same data, but from different objects. To do this, each sub-list must contain plotting dictionaries with the same name. Several methods in this class perform this substitution, while modifying the colour scheme and subplot coordinates accordingly.

Parameters
  • traces (list) – A list of lists containing traces from various protobufs.

  • max_cols (int) – The maximum number of columns in the intended, final subplots.

static _validate(traces=None, max_cols=0)

A method to validate the input traces, their contents and the maximal number of subplot columns. The traces parameter was expected to be a list containing lists of plotting dictionaries. If this is not the case, the dictionaries cannot be accessed and properly re-organized; rather an empty list is returned instead. For the maximum number of subplots, the expected parameter type is an integer. If the parameter is not an integer or float, a default maximum value of 1 is used instead.

Parameters
  • traces (list) – A list of lists containing traces from various protobufs.

  • max_cols (int) – The maximum number of columns in the intended, final subplot.

Returns

`output` – Validated traces, as a list, and the validated number of columns as an integer.

Return type

tuple

static combine(traces=None, names=None)

A method to combine traces that share the same name. The list of all available names, names, is iterated through and each trace found to have the current name is added to a list. Once every trace named as the current name is added to the list, the list is appended to another list. Once every name has been iterated through, the output list will be a list of sub-lists with each sub-list containing traces with the same name.

Parameters
  • traces (list) – A list of lists containing traces from various protobufs.

  • names (list) – A list of each of the names found in traces.

Returns

`combined_hists` – A list of lists with each sublist containing similarly-named traces.

Return type

list

static modify_coord(traces=None, max_cols=0)

A method to modify the subplot coordinates of each merged trace. The dictionaries in each sub-list of traces are intended to be plotted in the same subplots. Therefore, they must have the same row and col values. Each sub-list is isolated and each dictionary in the sub-list has its row and column values changed accordingly.

Parameters
  • traces (list) – A list of lists containing traces from various protobufs.

  • max_cols (int) – The maximum number of columns in the intended, final subplot.

Returns

`all_traces` – The list of traces with the row and column coordinates properly modified.

Return type

list

static modify_colours(traces=None)

A method to modify the colour scheme of the bars in the subplots. It is intended that data from the same original histogram or TDigest object be of the same colour. Each trace is organized into sub-lists by its original object. Therefore, creating a list of colours and assigning a colour to every dictionary in each sub-list will ensure data from the same origin is always plotted as the same colour, even when organized by trace name.

Parameters

traces (list) – A list of lists containing traces from various protobufs.

Returns

`all_traces` – The list of traces with their individual colour schemes properly modified.

Return type

list

merge(self)

A method to validate, combine, and modify input traces based on naming similarities. Calls various methods from the MergeTraces class to appropriately merge traces based on their naming, while modifying each trace’s colour scheme and subplot coordinates.

Returns

`adj_all_traces` – The validated traces, and the validated number of columns, as an integer.

Return type

list

class artemis.dq.plotlytool.BuildFigure(traces=None, figure_type='')

Class object to generate a figure from a list of traces, each containing data and plotting properties. Contains various methods to split traces, add default properties, create figures, and update figures with new layouts.

Parameters
  • traces (list) – A list of traces (dictionaries) containing data and properties to be plotted.

  • figure_type (str) – The type of figure being generated.

static _create_bar(traces=None, template=None)

Class method to create a bar plot trace using two traces. Systematically replace properties of the template trace with properties from the user- defined trace as they appear. Only the properties from trace that are also in template are replaced to avoid defining unsupported properties.

Parameters
  • traces (list) – A list containing dictionaries of user-defined bar plot properties.

  • template (dict) – A dict object containing every bar plot property and it’s default value.

Returns

`output_traces` – A list of dictionaries each containing every bar plot property and it’s default value, save for the properties specified in each inputted trace.

Return type

list

static _create_scatter(traces=None, template=None)

Class method to create a scatter plot trace using two traces. Systematically replace properties of the template trace with properties from the user- defined trace as they appear. Only the properties from trace that are also in template are replaced to avoid defining unsupported properties.

Parameters
  • traces (list) – A list of dictionaries of user-defined scatter plot properties.

  • template (dict) – A dict object of every scatter plot property and it’s default value.

Returns

`output_traces` – A list of dictionaries each containing every scatter plot property and it’s default value, save for the properties specified in each inputted trace.

Return type

list

static _validate(traces=None, figure_type='')

Validate that traces amd figure_type are each of the proper types. It is expected that the traces parameter is a list and the figure_type parameter is a string. If this is not the case, no traces can be made available and/or the figure type is invalid.

Parameters
  • traces (list) – A list of traces containing properties, as dictionaries, to be plotted.

  • figure_type (str) – The type of figure being generated.

Returns

`output` – A tuple of the validated traces list and figure type.

Return type

tuple

static update_figure(figure=None, figure_type='')

A class method to execute all the required updates to the layout of the figure object. The figure is updated with a variety of layout and axis features to improve the usability and readability of the saved and rendered subplots.

Parameters
  • figure (plotly.graph_objs._figure.Figure) – The Plotly Figure object to be updated.

  • figure_type (str) – The type of figure being generated.

Returns

`figure`

Return type

plotly.graph_objs._figure.Figure

generate_figure(self)

Class method to split the list of inputted traces, build a subplot figure from each list element, and combine each subplot into one complete figure.

Returns

`figure`

Return type

plotly.graph_objs._figure.Figure

class artemis.dq.plotlytool.PlotlyTool(store=None, uuid='')

Class object containing functions to call methods from each of the previous four classes.

Parameters
  • store (artemis.meta.cronus.BaseObjectStore) – A store object possibly containing Histogram and/or TDigest data.

  • uuid (string) – A string representing the unique identifier of a dataset.

static _check_output(output='', check=True)

Method to check if the requested output directory exists and create it, if necessary. If necessary, checks with the user if files and directories can be created and/or deleted to produce the requested directory in which all figure plots are to be saved.

Parameters
  • output (str) – The path to the directory intended to hold the outputted files.

  • check (boolean) – User’s permission is required to create/delete files/directories.

Returns

`proceed` – Indicate if the tool can proceed to plotting histograms and TDigests.

Return type

boolean

static _list(store=None, uuid='')

A method to list the histograms and TDigests in the inputted store and present at the specified ID. The output is a tuple of two elements: a container of histogram collection objects and a container of TDigest datasets.

Parameters
  • store (artemis.meta.cronus.BaseObjectStore) – A store object possibly containing histogram nad/or TDigest data.

  • uuid (string) – A string representing the unique identifier of a dataset.

Returns

`output` – A tuple containing the histogram and TDigest containers from the store.

Return type

tuple

static _validate(store=None, uuid='')

A method to validate that the inputted store object and UUID are both of the proper type. It is expected that the store be a BaseObjectStore and the UUID be a string. If this is not the case, no store and/or UUID can be used due to incompatibility.

Parameters
  • store (artemis.meta.cronus.BaseObjectStore) – A store object possibly containing histogram and/or TDigest data.

  • uuid (string) – A string representing the unique identifier of a dataset.

Returns

`output` – A tuple containing the validated input parameters.

Return type

tuple

static get_figure(traces=None, output='', show=True, check=True, fig_type='')

A method to generate Plotly Figure objects, render them, and save them as HTML files in the directory specified by output. The generate_figure method of BuildFigure is called to generate the plots for histograms and TDigests, while the plot_save function imported from Plotly saved and/or renders the plots.

Parameters
  • traces (list) – A list of traces to be converted into a Plotly Figure object.

  • output (str) – The path to the directory intended to hold the outputted files.

  • check (boolean) – User’s permission is required to create/delete files/directories.

  • show (boolean) – A boolean indicating if the plots are rendered as well as saved.

  • fig_type (str) – The type of traces being provided in traces.

visualize(self, output='', show=True, check=True)

A method to perform various functions by calling several classes and their methods to validate parameters, create traces, build plots as figure objects, as well as save and/or render the plots.

Parameters
  • output (str) – The path to the directory intended to hold the outputted files.

  • show (boolean) – A boolean indicating if the plots are rendered as well as saved.

  • check (boolean) – User’s permission is required to create/delete files/directories.