Skip to content

Advanced Models

Introduction

With the advanced modeling approach, developers implement their model training routines in a python script file and directly use python functions for defining and using PyTorch datasets, TensorFlow datasets, and Dask dataframes (for Scikit-learn and XGBoost modeling) based on BOSS virtual datasets (defined in the No Code client). Additionally, a developer must call functions for uploading trained models and metadata (e.g., model training performance metrics) to the BOSS backend. The advantage of using the advanced model approach is that developers are free to “carry-over” modeling and customized and/or experimental performance analysis techniques from previously written code. Advanced model examples are contained in The BOSS Model Shop.

Advanced Model Format

Advanced models are implemented using python scripts. The code’s entrypoint function must be called main. One of the biggest differences between Simple and Advanced models are that Advanced models must retrieve the data manually at the beginning of the main() method by using something like get_features_and_labels. Refer to this page for more information about data retrieval.

How to write your Model

Arguments and parameters specified from the GUI when starting a training run will be passed into your main() method as args (look at the example at the bottom of the page). The arguments passed to main are described in the sub-sections below.

TensorFlow, PyTorch and Scikit-learn

Table 1 describes the python arguments (defined in the No Code client when starting mode training) which are always passed to the main function for TensorFlow, PyTorch, and Scikit-learn models.

Argument Description
args['model'] (string) Model ID, used for storing checkpoints and models to BOSS backend
args['train_id'] (string) Model “training” ID, to be used for storing trained model asset to BOSS backend
args['vds'] (dictionary) A dictionary containing two fields: training & testing. Each field contains a BOSS virtual dataset ID, used for retrieving training/testing data for model training
args['asset'] (string) Asset (word embedding) ID, used for retrieving word embeddings for text classification model training
args['parameters']['steps'] (int) Number of steps for model training
args['parameters']['lr'] (float) Learning rate for model training
args['parameters']['regularization_value'] (float) Regularization value for model training
args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation
args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing
args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models)
args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision
args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification).
args['exportdir'] (string) Directory used for storing trained model (for upload purposes)
args['graphversion'] (string) Version of the graph being trained

Table 1. Advanced model python script arguments for TensorFlow and PyTorch models.

XGBoost Dask

Table 2 describes the python arguments passed to the start function for XGBoost Dask models.

Argument Description
args['parameters']['lr'] (float) Learning rate for model training
args['parameters']['steps'] (int) Number of steps for model training
args['parameters']['classification_mode'] (string) Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models)
args['parameters']['eval_percent'] (float) Percentage of the virtual dataset to use for validation
args['parameters']['test_percent'] (float) Percentage of the virtual dataset to use for testing
args['parameters']['max_document_length'] (int) Maximum number of tokens to be used for free text input into the model for training (for text classification).
args['parameters']['prediction_threshold'] (float) For binary classification models, minimum threshold for designating a positive decision
args['booster'] (string) XGBoost booster type
args['algorithm'] (string) One of “booster”, “xgbclassifier”, “xgbregressor”
args['objective'] (string) Learning task and the corresponding learning objective
args['base_score'] (float) The initial prediction score of all instances, global bias
args['eval_metric'] (string) Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’
args['seed'] (int) Random number seed

Table 2. Advanced model python script arguments for Dask-XGBoost models.

Saving your Trained Model

At the end of your training run, you will need to save your model. Before doing so, you will want to use one of the boss.core.uds.util methods for placing your resultant model in a zip file. These methods include:

  • zip_model_tf(_estimator, serving_input_receiver_function, model_id, graph_version, log_dir)
  • zip_model_pt(model, model_id, model_path, graph_version)
  • zip_model_xgboost(model, model_id, model_path, graph_version)
  • zip_model_sklearn(model, model_id, model_path, graph_version)

Note that the TensorFlow method is different than the rest in that it requires the serving input receiver function (for use in inference). model is your model object, model_id is your model_id (provided in args to main), model_path is the path used to construct the zip file, and we recommend using the log_dir variable as shown below. Lastly, graph_version is provided by args passed to main.

After retrieving the zipped trained model, you can save it by passing it to train.update(). All of this is demonstrated in the below example.

## Example Advanced Model (Scikit-Learn on Iris data)

import boss.core.uds.util as uds
import numpy as np
from boss.core.internal import train
from boss.core.lib.confusion_matrix import ConfusionMatrix
from boss.core.lib.plotter import Plotter
from boss.core.lib.training_resources import get_features_and_labels
from sklearn.linear_model import LogisticRegression  


# Function used in classification models to map int labels to string names
def label_mapping():
    return {0: 'I. versicolor', 1: 'I. virginica-setosa'}


# Entrypoint of the BOSS Advanced modeling framework (your main function is what gets called!)
def main(args):
    # Retrieve parameters
    tid = args['train_id']
    graph_version = args['graphversion']
    model_id = args['model']
    log_dir = str(args['exportdir'])

    # Retrieve the dataset
    try:
        train.status(tid, 0)
        train_features, train_labels, train_eid, eval_features, eval_labels, eval_eid, _, _, test_eid = \
            get_features_and_labels(args)
    except Exception as exception:
        train.status(tid, 5, exception)
        raise

    # Apply label mapping to the dataframes
    train_labels = train_labels.mask(train_labels['flower.species'] == 'I. versicolor', 0)
    train_labels = train_labels.mask(train_labels['flower.species'] == 'I. virginica', 1)
    train_labels = train_labels.mask(train_labels['flower.species'] == 'I. setosa', 1)

    eval_labels = eval_labels.mask(eval_labels['flower.species'] == 'I. versicolor', 0)
    eval_labels = eval_labels.mask(eval_labels['flower.species'] == 'I. virginica', 1)
    eval_labels = eval_labels.mask(eval_labels['flower.species'] == 'I. setosa', 1)

    # Set the DTypes
    train_features = train_features.astype(np.float)
    eval_features = eval_features.astype(np.float)
    train_labels = train_labels.astype(np.int)
    eval_labels = eval_labels.astype(np.int)

    # Must compute for Sci-kit learn
    computed = train_features.compute()
    computed_labels = train_labels.compute()

    # Computed values
    nparr = computed.values
    nplabelsarr = computed_labels.values

    # Let's train a model!
    try:
        train.status(tid, 1)

        # Train model
        try:
            sk_model = LogisticRegression().fit(nparr, nplabelsarr)
        except Exception as err:
            train.status(tid, 5, "failure: error in training -- " + str(err))
            raise

        # Generate predictions and convert to format for ROC & PR
        preds = sk_model.predict(eval_features)
        new_preds = [[1.0 - i, i] for i in preds]
        pred_labels = [np.argmax(i) for i in new_preds]
        eval_labs = eval_labels.compute()

        # Instantiate plotter object
        pltr = Plotter(tid)

        # ROC curve, PR curve, confusion matrix
        pltr.roc_curve(eval_labs, new_preds, [0, 1])
        pltr.precision_recall_curve(eval_labels.compute(), new_preds, [0, 1])
        pltr.update()

        ConfusionMatrix([eval_features, eval_labels], pred_labels, 2, label_mapping(), tid,
                        eval_eid.compute().to_list())

        # Store model and performance stats back to BOSS back-end
        model_filename = uds.zip_model_sklearn(sk_model, model_id, log_dir + "model.pkl",
                                               graph_version)
        with open(model_filename, "rb") as graph_file:
            train.update({tid: {
                'performance': {
                    'score': sk_model.score(eval_features.compute(), eval_labs)
                },
                'ordered_class_names': ['I. versicolor', 'I. virginica-setosa'],
                'graph_version': graph_version,
                'graph_filename': 'model.skl',
                'graph_file': graph_file.read()
            }})

            train.status(tid, 3)
    except Exception as exception:
        train.update({tid: {'Caught exception': str(exception)}})
        train.status(tid, 5, "failure: error in training -- " + str(exception))
        raise