Skip to content

Data and Performance Analysis

Introduction

This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the advanced model approach, but some (e.g., reporting model status) is still helpful for PyTorch simple modeling.

Importing and Preparing Data

Data can be imported into a modeling context using the BOSS Unified Dataspace (UDS) API & Boss Training Resources API (boss.core.lib.uds, boss.core.uds.util and boss.core.lib.training_resources). These libraries provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask dataframe) based on BOSS virtual datasets defined in the No Code client. It also provides the capability to retrieve previously trained word embeddings.

The BOSS functions handling data are listed below. Some are used for straightforward data importing (e.g., get_dataframe, get_features_and_labels) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the BOSS Model Shop gitlab project for examples on how to use the functions for developing AI models.

boss.core.lib.uds

  • get_asset
  • get_tf_dataset
  • get_tf_dataset_image
  • get_tf_dataset_text

boss.core.uds.util

  • get_dataframe

boss.core.lib.training_resources

  • get_features_and_labels
  • get_tensorflow_feature_and_labels

Retrieving your data: boss.core.lib.training_resources

In order to get your training / testing dask dataframes, you will want to use get_features_and_labels or get_tensorflow_features_and_labels (if using TensorFlow). The simple modeling framework does this for you, but if you’re designing an advanced model these are the methods to use to retrieve your data; simply provide the args passed into your main() method by our platform.

  • get_features_and_labels
  • get_tensorflow_feature_and_labels

Here’s an example of each method. get_features_and_labels will return to you a data, labels, and elastic ids dataframe for each of training, evaluation, and testing data.

train_df, train_lb, train_eid, eval_df, eval_lb, eval_eid, test_df, test_lb, test_eid = get_features_and_labels(args)

The TensorFlow version is slightly different. It returns lists of Dask delayed objects, each list containing two items: features and labels. There is one list for each of training, evaluation, and testing. The test_labels are also return separately for ease of use, & lastly a list of ElasticSearch ids are returned matching your data. The eid list contains 3 items; a dask delayed for each of training, evaluation, and testing.

delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features, \
    delayed_eids = get_tensorflow_feature_and_labels(args)

Working with assets

Text models may use embeddings trained within the BOSS platform, which are stored as assets. They can be selected when starting a training from within the platform, & retrieved within your advanced model as follows using the uds module:

def main(args):
    asset_id = args['asset_id']
    embeddings_index, embedding_matrix, embedding_size, word_index_mapping, pad_value = uds.get_asset(asset_id)
    train_df, _, _, eval_df, _, _, test_df, _, _ = get_features_and_labels(args, word_index_mapping=word_index_mapping)

Working with TensorFlow

When creating TensorFlow Estimators, you must create TrainSpec / EvaluationSpec objects, and these require generator methods denoted as input_fn. The boss.core.lib.uds methods listed in the above related section take care of this for you. Namely, get_tf_dataset_image works for image datasets, get_tf_dataset works for tabular data, and get_tf_dataset_text works for text data. Here is an example for a text dataset:

# Model function instantiating a TensorFlow Estimator
def model(training_data, evaluation_data, training_steps, log_dir, embedding_matrix, embedding_size, word_index_mapping,
          max_document_length, pad_value, tid):
    # Define the feature columns for inputs.
    target_type = tf.int32
    train_spec = tf.estimator.TrainSpec(
            input_fn=lambda: uds.get_tf_dataset_text(feature_label, training_data, word_index_mapping, pad_value,
                                                     max_document_length,
                                                     target_type).repeat(count=None).shuffle(100).batch(int(100)),
            max_steps=training_steps,
            hooks=[lml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, last_epoch=training_steps)])

# Prepare vds data for modeling
    delayed_values_training, delayed_values_evaluation, delayed_values_testing, testing_labels, _, e_ids = \
        get_tensorflow_feature_and_labels(args)

# Get asset/embedding
embeddings_index, embedding_matrix, embedding_size, word_index_mapping, pad_value = uds.get_asset(asset_id)

# Instantiate the model!
_estimator, train_spec, eval_spec, target_type, serving_input_fn = model(delayed_values_training, 
        delayed_values_evaluation, training_steps, log_dir, embedding_matrix,
        embedding_size, word_index_mapping, max_document_length, pad_value, tid)

Retrieve a single DataFrame

Sometimes all you need is a dataframe with data. That’s what boss.core.uds.util and get_dataframe are for. All you need to do is pass it a vds ID & it returns: - A dataframe with data - The number of features - A list of elastic ids

training_data, num_features, elastic_ids = get_dataframe(args['vds']['training'])

Important notes for implementing multi-class modeling

TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier) and hence, the num_classes argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with PyTorch or XGBoost.

Analyzing Model Performance

Post-training performance analysis tasks are supported by the BOSS Machine Learning (ML), Confusion Matrix, & Plotting APIs (boss.core.lib.ml, boss.core.lib.confusion_matrix, boss.core.lib.plotter, boss.core.lib.plot). These libraries provide functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves, cluster scatter plots, line graphs), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the No Code client after the entire model training process has completed.

The BOSS ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions.

boss.core.lib.ml

The ml module contains methods for generating predictions easily from your models. They can be used within advanced models. You can see example usage in the Model Shop. - get_predictions_classification_pt - get_predictions_classification_tf - get_predictions_classification_sklearn - get_predictions_classification_xg - get_predictions_regression_pt - get_predictions_regression_tf - get_predictions_regression_sklearn - get_predictions_regression_xg

Submitting Performance Analysis Results

Trained models and metadata can be uploaded to the BOSS backend via the boss.core.internal.train.update function. The following piece of example code illustrates how to use the function.

# Store model graph and performance stats back to BOSS back-end
    model_filename = uds.zip_model_tf(_estimator, serving_input_receiver_fn, model_id, graph_version, log_dir)
    with open(model_filename, "rb") as graph_file:
        train.update({tid: {
            'performance': {
                'rmse': rmse,
                'mae': mae,
                'r2': r2
            },
            "ordered_feature_names": ["x"],
            'ordered_class_names': ["flower.sepal_width"],
            'output_name': "outputs",
            "input_name": "inputs",
            'graph_version': graph_version,
            'graph_filename': 'model.zip',
            'graph_file': graph_file.read(),
        }})

train.update takes a python dictionary as the argument, with the train_id, described in Table 1 in Developing Advanced Models, as the top-level key (tid represents the train_id in the code snippet above). The secondary keys graph_version and graph_file store the graph version and trained graph file (model) respectively. The secondary key performance stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the No Code client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The BOSS Model Shop for more insights.

Enabling Model Explainability

To enable a trained model to be used by the explainability tool in the No Code client, some parameters must be defined. For TensorFlow models, ordered_feature_names, ordered_class_names, input_name, and output_name must be defined. ordered_feature_names (not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The BOSS Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name is the name of the input layer in your TensorFlow model to which your ordered_feature_names data will be passed. output_name is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’). The output_name is used to retrieve your model outputs in the proper format for explanation. PyTorch, Scikit-learn, and XGBoost models only require that ordered_class_names be provided.

Plots

BOSS allows users to plot associated model training metrics in the No Code client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Two classes exist to aid in plotting: - Plotter - Plot

Plotter

The Plotter is a management class that Plot objects can be added to for storage and conversion into the BOSS platform. Example use is seen below:

from boss.core.lib.plotter import Plotter
from boss.core.lib.plot import Plot

# Instantiate Plotter
pltr = Plotter(train_id)

# Instantiate Plots
loss_plot = Plot("Average Loss", x_label="Epoch", y_label="Average Loss",
                description="Average loss throughout training epochs.")

accuracy_plot = Plot("Validation Accuracy", x_label="Epoch", y_label="Accuracy",
                    description="Validation Accuracy")

Throughout the course of a model being trained, lines may be added to a Plot object (designated by name). Note that adding a pre-existing line name to a Plot will overwrite the pre-existing line. This is also true of adding a named Plot to a Plotter, which comes in handy for updating the same line iteratively. Here is an example (x & y are 1-dimensional lists):

loss_plot.add_line(x, y, name="Average Loss")
pltr.add_plot(loss_plot)
pltr.update()

pltr.update() writes every Plot inside of the Plotter into the platform, which can then be viewed on the model profile page within the GUI.

Further, the Plotter class contains several pre-defined methods for graph generation (internally these construct Plot objects):

pltr.roc_curve(truths: list, scores: list, class_list: list)

pltr.precision_recall_curve(truths: list, scores: list, class_list: list)

pltr.qq_plot(residuals: np.array)

pltr.fitted_vs_residual(residuals: np.array, predictions: np.array, actuals: np.array)

pltr.residual_frequency(residuals: np.array)

pltr.residual_plot(independent_variable: pd.Series, residuals: np.array, coefficient: float)

pltr.regression_plot(independent_variable: pd.Series, predictions: np.array, truths: np.array, class_names: list)

Note that pltr.update() must still be called after using an above method.

Plot

The Plot class allows users to create their own plots with their own data. Currently, it supports three types of custom plotting:

loss_plot.add_line(x, y, name, z=None)
loss_plot.add_scatter(x, y, name, samples=None, z=None)
loss_plot.add_bar(x, y, name, x_str=None)

For each of the above, x & y are 1-dimensional lists / arrays of values. x[0] should pair with y[0]. Scatter supports the samples parameter, which should be a 1-d list of elastic ids returned by the get_features_and_labels method. The samples should be the ids which match (in order) the x/y pairs. This parameter is used in the background of our simple modeling framework to add real data to our cluster plots.

TensorFlow Hook

A TensorFlow hook is provided in boss.core.lib.ml for automatically parsing generated events files (the same as used by TensorBoard) and passing them to a Plotter as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec or TrainSpec object as follows (stub included for posterity):

class LucdTFEstimatorHook(tf.estimator.SessionRunHook):
    def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int):
        ...
train_spec = tf.estimator.TrainSpec(
        input_fn=lambda: uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type,
                                                       num_classes).repeat(count=None).shuffle(30).batch(int(30)),
        max_steps=training_steps,
        hooks=ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)])

train_hook allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir tells the hook where to find TensorFlow events files. freq is the frequency that the hook should look for metrics in the events files. last_epoch tells the hook the number of epochs being run so the hook can ignore freq for the last epoch.

Confusion Matrix

BOSS provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the No Code client to be shown actual records associated with the square selected. This may be enabled by using the following class from boss.core.lib.confusion_matrix:

cm = Confusion_Matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs),
                 tid: str = None, elastic_ids: list = None)

cm.matrix()

Simply instantiating an object of the class will write it to the BOSS platform and make it visible on the Model Profile page. Further, calling .matrix() will return the internally constructed confusion matrix.

The function arguments details are provided below.

  • test_set: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from the get_features_and_labels function.
  • predictions: This should be a list of predictions generated by your model (the list returned from ml get_predictions_classification). The list must be in the same order as the test_set data.
  • num_classes: An integer number of classes for the confusion matrix to represent.
  • label_mapping: A function to map integers to class labels, which is used to map predictions to a human-readable format.
  • tid: Training id to associate confusion matrix with.
  • elastic_ids: This should be a list of ElasticSearch ids associated with the test_set (in the same order) - this is returned from get_features_and_labels.

Example Usage

First, we must define a label_mapping:

def _label_mapping():
    return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'}

Next, prepare vds data for modeling (args were passed into our main() method by BOSS platform):

train_df, train_lb, train_eid, eval_df, eval_lb, eval_eid, test_df, test_lb, test_eid = get_features_and_labels(args)

Create PyTorch iterator objects for train and evaluation datasets:

trainloader = DataLoader(train_df, batch_size=10)
validloader = DataLoader(eval_df, batch_size=10)

Collect predictions in the pytorch evaluation loop. net is our model, and constants.PYTORCH_FEATURES denotes our features in the DataLoader:

for data in validloader:
    outputs = torch.log(net(data[constants.PYTORCH_FEATURES]) + 1e-20)
    _, predicted = torch.max(outputs.data, 1)
    predictions.extend(predicted)

Lastly, generate a confusion matrix for the GUI:

ConfusionMatrix(validloader, predictions, 3, label_mapping(), tid, eval_eid.compute().tolist())

Submitting Model Training Status

Another helpful function is boss.core.internal.train.status, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the No Code client. The function definition is below.

def status(tid, code, message=None):
    """Update model status in the database.

    Args:
        tid: Int representing a training ID.
        code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE,
        5 - ERROR, 6 - QUEUED, 7 - STOPPED.
        message: String representing optional custom message to include.

    Returns:
        Status message.

    Raises:
        TypeError: If code is not of type int.
        Exception: If code is invalid.
    """