Data and Performance Analysis
Introduction
This section covers the high-level tasks needed to support model training: importing and preparing data and analyzing performance. Most of the content here pertains to the advanced
model approach, but some (e.g., reporting model status) is still helpful for PyTorch simple
modeling.
Importing and Preparing Data
Data can be imported into a modeling context using the BOSS Unified Dataspace (UDS) API & Boss Training Resources API (boss.core.lib.uds
, boss.core.uds.util
and boss.core.lib.training_resources
). These libraries provides functions for creating datasets of various formats (TensorFlow, PyTorch, Dask
dataframe) based on BOSS virtual datasets defined in the No Code client. It also provides the capability to retrieve previously trained word embeddings.
The BOSS functions handling data are listed below. Some are used for straightforward data importing (e.g., get_dataframe
, get_features_and_labels
) while others are used for preparing framework-specific datasets for AI models (e.g., get_tf_dataset
for defining TensorFlow datasets). Refer to the API documentation for full function descriptions, and visit the BOSS Model Shop
gitlab project for examples on how to use the functions for developing AI models.
boss.core.lib.uds
get_asset
get_tf_dataset
get_tf_dataset_image
get_tf_dataset_text
boss.core.uds.util
get_dataframe
boss.core.lib.training_resources
get_features_and_labels
get_tensorflow_feature_and_labels
Retrieving your data: boss.core.lib.training_resources
In order to get your training / testing dask dataframes, you will want to use get_features_and_labels
or get_tensorflow_features_and_labels
(if using TensorFlow). The simple
modeling framework does this for you, but if you’re designing an advanced
model these are the methods to use to retrieve your data; simply provide the args
passed into your main()
method by our platform.
get_features_and_labels
get_tensorflow_feature_and_labels
Here’s an example of each method. get_features_and_labels
will return to you a data, labels, and elastic ids dataframe for each of training, evaluation, and testing data.
train_df, train_lb, train_eid, eval_df, eval_lb, eval_eid, test_df, test_lb, test_eid = get_features_and_labels(args)
The TensorFlow version is slightly different. It returns lists of Dask delayed
objects, each list containing two items: features and labels. There is one list for each of training, evaluation, and testing. The test_labels are also return separately for ease of use, & lastly a list of ElasticSearch ids are returned matching your data. The eid list contains 3 items; a dask delayed
for each of training, evaluation, and testing.
delayed_values_training, delayed_values_evaluation, delayed_values_testing, my_df_testing_label, num_features, \
delayed_eids = get_tensorflow_feature_and_labels(args)
Working with assets
Text models may use embeddings trained within the BOSS platform, which are stored as assets
. They can be selected when starting a training from within the platform, & retrieved within your advanced
model as follows using the uds
module:
def main(args):
asset_id = args['asset_id']
embeddings_index, embedding_matrix, embedding_size, word_index_mapping, pad_value = uds.get_asset(asset_id)
train_df, _, _, eval_df, _, _, test_df, _, _ = get_features_and_labels(args, word_index_mapping=word_index_mapping)
Working with TensorFlow
When creating TensorFlow Estimators, you must create TrainSpec / EvaluationSpec objects, and these require generator methods denoted as input_fn
. The boss.core.lib.uds
methods listed in the above related section take care of this for you. Namely, get_tf_dataset_image
works for image datasets, get_tf_dataset
works for tabular data, and get_tf_dataset_text
works for text data. Here is an example for a text dataset:
# Model function instantiating a TensorFlow Estimator
def model(training_data, evaluation_data, training_steps, log_dir, embedding_matrix, embedding_size, word_index_mapping,
max_document_length, pad_value, tid):
# Define the feature columns for inputs.
target_type = tf.int32
train_spec = tf.estimator.TrainSpec(
input_fn=lambda: uds.get_tf_dataset_text(feature_label, training_data, word_index_mapping, pad_value,
max_document_length,
target_type).repeat(count=None).shuffle(100).batch(int(100)),
max_steps=training_steps,
hooks=[lml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, last_epoch=training_steps)])
# Prepare vds data for modeling
delayed_values_training, delayed_values_evaluation, delayed_values_testing, testing_labels, _, e_ids = \
get_tensorflow_feature_and_labels(args)
# Get asset/embedding
embeddings_index, embedding_matrix, embedding_size, word_index_mapping, pad_value = uds.get_asset(asset_id)
# Instantiate the model!
_estimator, train_spec, eval_spec, target_type, serving_input_fn = model(delayed_values_training,
delayed_values_evaluation, training_steps, log_dir, embedding_matrix,
embedding_size, word_index_mapping, max_document_length, pad_value, tid)
Retrieve a single DataFrame
Sometimes all you need is a dataframe with data. That’s what boss.core.uds.util
and get_dataframe
are for. All you need to do is pass it a vds ID & it returns:
- A dataframe with data
- The number of features
- A list of elastic ids
training_data, num_features, elastic_ids = get_dataframe(args['vds']['training'])
Important notes for implementing multi-class modeling
TensorFlow offers different approaches to building multi-class models, two prominent ones being using pre-made Estimators (https://www.tensorflow.org/tutorials/estimator/premade#overview_of_programming_with_estimators), and using general techniques such as with Keras models and Estimators. If one-hot encoded data labels are needed (i.e., to match the number of nodes in a neural network output layer), the num_classes
parameter should be used when calling relevant functions to get data (e.g., lucd_uds.get_tf_dataset
). Note that most (if not all) TensorFlow pre-made Estimator models do not require explicitly one-hot encoded data labels for non-binary modeling (e.g., tensorflow.estimator.DNNClassifier
) and hence, the num_classes
argument can be omitted. In the case of TensorFlow Estimators, developers are encouraged to understand how to shape input for the models. The same goes for modeling with
PyTorch or XGBoost.
Analyzing Model Performance
Post-training performance analysis tasks are supported by the BOSS Machine Learning (ML), Confusion Matrix, & Plotting APIs (boss.core.lib.ml
, boss.core.lib.confusion_matrix
, boss.core.lib.plotter
, boss.core.lib.plot
). These libraries provide functions supporting automatic execution and reporting of critical performance analysis tasks (e.g., creating confusion matrices, ROC curves, cluster scatter plots, line graphs), preventing the need to repeatedly write such code. The tables and plots created from these library functions can be viewed in the No Code client after the entire
model training process has completed.
The BOSS ML functions for performance analysis are listed below. Refer to the API documentation for full function descriptions.
boss.core.lib.ml
The ml module contains methods for generating predictions easily from your models. They can be used within advanced
models. You can see example usage in the Model Shop.
- get_predictions_classification_pt
- get_predictions_classification_tf
- get_predictions_classification_sklearn
- get_predictions_classification_xg
- get_predictions_regression_pt
- get_predictions_regression_tf
- get_predictions_regression_sklearn
- get_predictions_regression_xg
Submitting Performance Analysis Results
Trained models and metadata can be uploaded to the BOSS backend via the boss.core.internal.train.update
function. The following piece of example code illustrates how to use the function.
# Store model graph and performance stats back to BOSS back-end
model_filename = uds.zip_model_tf(_estimator, serving_input_receiver_fn, model_id, graph_version, log_dir)
with open(model_filename, "rb") as graph_file:
train.update({tid: {
'performance': {
'rmse': rmse,
'mae': mae,
'r2': r2
},
"ordered_feature_names": ["x"],
'ordered_class_names': ["flower.sepal_width"],
'output_name': "outputs",
"input_name": "inputs",
'graph_version': graph_version,
'graph_filename': 'model.zip',
'graph_file': graph_file.read(),
}})
train.update
takes a python dictionary as the argument, with the train_id
, described in Table 1 in Developing Advanced Models, as the top-level key (tid
represents the train_id
in the code snippet above). The secondary keys graph_version
and graph_file
store the graph version and trained graph file (model) respectively. The secondary key performance
stores another dictionary for performance values. There is no restriction on the key-value pairs here. The developer is allowed to choose the performance values and they will be viewable in the No Code client afterward. The values shown in the code snippet above are customary for evaluating classification models. Again, see example models in The BOSS Model Shop for more insights.
Enabling Model Explainability
To enable a trained model to be used by the explainability tool in the No Code client, some parameters must be defined. For TensorFlow models, ordered_feature_names
, ordered_class_names
, input_name
, and output_name
must be defined. ordered_feature_names
(not to be confused with training data input column names) is a list of ordered names of the inputs to the trained model, commonly defined in TensorFlow model definitions as tf.feature_column. For example, for a TensorFlow text classification model, the named input might be embedding_input. Please see example code in The BOSS Model Shop. ordered_class_names is a list formatted such that string class names are ordered by their integer representations (the order of outputs from your model). For instance, for a binary classification model for which the labels are 0 and 1, the order of strings must be negative and positive (or whatever string labels you choose). input_name
is the name of the input layer in your TensorFlow model to which your ordered_feature_names
data will be passed. output_name
is the name of the output layer in your TensorFlow model (by default these can be named things like ‘dense_2’ and ‘scores’).
The output_name
is used to retrieve your model outputs in the proper format for explanation. PyTorch, Scikit-learn, and XGBoost models only require that ordered_class_names
be provided.
Plots
BOSS allows users to plot associated model training metrics in the No Code client. The plots will update in realtime during model training, providing insight into the viability, problems, and successes of training runs. Two classes exist to aid in plotting: - Plotter - Plot
Plotter
The Plotter
is a management class that Plot objects can be added to for storage and conversion into the BOSS platform.
Example use is seen below:
from boss.core.lib.plotter import Plotter
from boss.core.lib.plot import Plot
# Instantiate Plotter
pltr = Plotter(train_id)
# Instantiate Plots
loss_plot = Plot("Average Loss", x_label="Epoch", y_label="Average Loss",
description="Average loss throughout training epochs.")
accuracy_plot = Plot("Validation Accuracy", x_label="Epoch", y_label="Accuracy",
description="Validation Accuracy")
Throughout the course of a model being trained, lines may be added to a Plot
object (designated by name). Note that adding a pre-existing line name to a Plot
will overwrite the pre-existing line. This is also true of adding a named Plot
to a Plotter
, which comes in handy for updating the same line iteratively. Here is an example (x & y are 1-dimensional lists):
loss_plot.add_line(x, y, name="Average Loss")
pltr.add_plot(loss_plot)
pltr.update()
pltr.update()
writes every Plot inside of the Plotter into the platform, which can then be viewed on the model profile page within the GUI.
Further, the Plotter
class contains several pre-defined methods for graph generation (internally these construct Plot
objects):
pltr.roc_curve(truths: list, scores: list, class_list: list)
pltr.precision_recall_curve(truths: list, scores: list, class_list: list)
pltr.qq_plot(residuals: np.array)
pltr.fitted_vs_residual(residuals: np.array, predictions: np.array, actuals: np.array)
pltr.residual_frequency(residuals: np.array)
pltr.residual_plot(independent_variable: pd.Series, residuals: np.array, coefficient: float)
pltr.regression_plot(independent_variable: pd.Series, predictions: np.array, truths: np.array, class_names: list)
Note that pltr.update()
must still be called after using an above method.
Plot
The Plot
class allows users to create their own plots with their own data. Currently, it supports three types of custom plotting:
loss_plot.add_line(x, y, name, z=None)
loss_plot.add_scatter(x, y, name, samples=None, z=None)
loss_plot.add_bar(x, y, name, x_str=None)
For each of the above, x & y are 1-dimensional lists / arrays of values. x[0] should pair with y[0]. Scatter supports the samples parameter, which should be a 1-d list of elastic ids returned by the get_features_and_labels
method. The samples should be the ids which match (in order) the x/y pairs. This parameter is used in the background of our simple
modeling framework to add real data to our cluster plots.
TensorFlow Hook
A TensorFlow hook is provided in boss.core.lib.ml
for automatically parsing generated events files (the same as used by TensorBoard) and passing them to a Plotter
as part of a TensorFlow model. It can be provided as part of a TenorFlow EvalSpec
or TrainSpec
object as follows (stub included for posterity):
class LucdTFEstimatorHook(tf.estimator.SessionRunHook):
def __init__(self, train_hook: bool, log_dir: str, tid: str, freq: int, last_epoch: int):
...
train_spec = tf.estimator.TrainSpec(
input_fn=lambda: uds.get_tf_dataset_image(type_dict, training_data, num_features, target_type,
num_classes).repeat(count=None).shuffle(30).batch(int(30)),
max_steps=training_steps,
hooks=ml.LucdTFEstimatorHook(train_hook=True, log_dir=log_dir, tid=tid, freq=10, max_epoch=training_steps)])
train_hook
allows a user to specify whether the hook is providing train or eval metrics to the user (train=true, eval=false). log_dir
tells the hook where to find TensorFlow events files. freq
is the frequency that the hook should look for metrics in the events files. last_epoch
tells the hook the number of epochs being run so the hook can ignore freq for the last epoch.
Confusion Matrix
BOSS provides an in-depth, interactive confusion matrix for classification model evaluation. Users may select a square in the No Code client to be shown actual records associated with the square selected. This may be enabled by using the following class from boss.core.lib.confusion_matrix
:
cm = Confusion_Matrix(test_set: list or DataLoader, predictions: list, num_classes: int, label_mapping: type(abs),
tid: str = None, elastic_ids: list = None)
cm.matrix()
Simply instantiating an object of the class will write it to the BOSS platform and make it visible on the Model Profile page. Further, calling .matrix()
will return the internally constructed confusion matrix.
The function arguments details are provided below.
test_set
: Users may directly pass the PyTorch DataLoader or list of delayed dask dataframes returned from theget_features_and_labels
function.predictions
: This should be a list of predictions generated by your model (the list returned fromml
get_predictions_classification). The list must be in the same order as the test_set data.num_classes
: An integer number of classes for the confusion matrix to represent.label_mapping
: A function to map integers to class labels, which is used to map predictions to a human-readable format.tid
: Training id to associate confusion matrix with.elastic_ids
: This should be a list of ElasticSearch ids associated with the test_set (in the same order) - this is returned fromget_features_and_labels
.
Example Usage
First, we must define a label_mapping:
def _label_mapping():
return {0: 'I. versicolor', 1: 'I. virginica', 2: 'I. setosa'}
Next, prepare vds data for modeling (args were passed into our main() method by BOSS platform):
train_df, train_lb, train_eid, eval_df, eval_lb, eval_eid, test_df, test_lb, test_eid = get_features_and_labels(args)
Create PyTorch iterator objects for train and evaluation datasets:
trainloader = DataLoader(train_df, batch_size=10)
validloader = DataLoader(eval_df, batch_size=10)
Collect predictions in the pytorch evaluation loop. net
is our model, and constants.PYTORCH_FEATURES
denotes our
features in the DataLoader:
for data in validloader:
outputs = torch.log(net(data[constants.PYTORCH_FEATURES]) + 1e-20)
_, predicted = torch.max(outputs.data, 1)
predictions.extend(predicted)
Lastly, generate a confusion matrix for the GUI:
ConfusionMatrix(validloader, predictions, 3, label_mapping(), tid, eval_eid.compute().tolist())
Submitting Model Training Status
Another helpful function is boss.core.internal.train.status
, which is used for storing the status of a developer’s training pipeline. This enables a model’s status to be displayed on the No Code client. The function definition is below.
def status(tid, code, message=None):
"""Update model status in the database.
Args:
tid: Int representing a training ID.
code: 0 - RETRIEVING VDS, 1 - TRAINING, 2 - ANALYZING PERFORMANCE, 3 - STORING MODEL, 4 - TRAINING COMPLETE,
5 - ERROR, 6 - QUEUED, 7 - STOPPED.
message: String representing optional custom message to include.
Returns:
Status message.
Raises:
TypeError: If code is not of type int.
Exception: If code is invalid.
"""