Advanced Models
Introduction
With the advanced modeling approach, developers implement their model training routines in a python script file and directly use python functions for defining and using PyTorch datasets, TensorFlow datasets, and Dask dataframes (for Scikit-learn and XGBoost modeling) based on BOSS virtual datasets (defined in the No Code client). Additionally, a developer must call functions for uploading trained models and metadata (e.g., model training performance metrics) to the BOSS backend. The advantage of using the advanced model approach is that developers are free to “carry-over” modeling and customized and/or experimental performance analysis techniques from previously written code. Advanced model examples are contained in The BOSS Model Shop.
Advanced Model Format
Advanced models are implemented using python scripts. The code’s entrypoint function must be called main
. One of the biggest differences between Simple
and Advanced
models are that Advanced
models must retrieve the data manually at the beginning of the main()
method by using something like get_features_and_labels
. Refer to this page for more information about data retrieval.
How to write your Model
Arguments and parameters specified from the GUI when starting a training run will be passed into your main()
method as args (look at the example at the bottom of the page). The arguments passed to main are described in the sub-sections below.
TensorFlow, PyTorch and Scikit-learn
Table 1 describes the python arguments (defined in the No Code client when starting mode training) which are always passed to the main
function for TensorFlow, PyTorch, and Scikit-learn models.
Argument | Description |
---|---|
args['model'] (string) |
Model ID, used for storing checkpoints and models to BOSS backend |
args['train_id'] (string) |
Model “training” ID, to be used for storing trained model asset to BOSS backend |
args['vds'] (dictionary) |
A dictionary containing two fields: training & testing. Each field contains a BOSS virtual dataset ID, used for retrieving training/testing data for model training |
args['asset'] (string) |
Asset (word embedding) ID, used for retrieving word embeddings for text classification model training |
args['parameters']['steps'] (int) |
Number of steps for model training |
args['parameters']['lr'] (float) |
Learning rate for model training |
args['parameters']['regularization_value'] (float) |
Regularization value for model training |
args['parameters']['eval_percent'] (float) |
Percentage of the virtual dataset to use for validation |
args['parameters']['test_percent'] (float) |
Percentage of the virtual dataset to use for testing |
args['parameters']['classification_mode'] (string) |
Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models) |
args['parameters']['prediction_threshold'] (float) |
For binary classification models, minimum threshold for designating a positive decision |
args['parameters']['max_document_length'] (int) |
Maximum number of tokens to be used for free text input into the model for training (for text classification). |
args['exportdir'] (string) |
Directory used for storing trained model (for upload purposes) |
args['graphversion'] (string) |
Version of the graph being trained |
Table 1. Advanced model python script arguments for TensorFlow and PyTorch models.
XGBoost Dask
Table 2 describes the python arguments passed to the start
function for XGBoost Dask models.
Argument | Description |
---|---|
args['parameters']['lr'] (float) |
Learning rate for model training |
args['parameters']['steps'] (int) |
Number of steps for model training |
args['parameters']['classification_mode'] (string) |
Type of classification (binary, multiclass, tf_premade_multiclass) as selected in the GUI (only applies to classification models) |
args['parameters']['eval_percent'] (float) |
Percentage of the virtual dataset to use for validation |
args['parameters']['test_percent'] (float) |
Percentage of the virtual dataset to use for testing |
args['parameters']['max_document_length'] (int) |
Maximum number of tokens to be used for free text input into the model for training (for text classification). |
args['parameters']['prediction_threshold'] (float) |
For binary classification models, minimum threshold for designating a positive decision |
args['booster'] (string) |
XGBoost booster type |
args['algorithm'] (string) |
One of “booster”, “xgbclassifier”, “xgbregressor” |
args['objective'] (string) |
Learning task and the corresponding learning objective |
args['base_score'] (float) |
The initial prediction score of all instances, global bias |
args['eval_metric'] (string) |
Evaluation metrics for validation data, a default metric will be assigned according to ‘objective’ |
args['seed'] (int) |
Random number seed |
Table 2. Advanced model python script arguments for Dask-XGBoost models.
Saving your Trained Model
At the end of your training run, you will need to save your model. Before doing so, you will want to use one of the boss.core.uds.util
methods for placing your resultant model in a zip file. These methods include:
- zip_model_tf(_estimator, serving_input_receiver_function, model_id, graph_version, log_dir)
- zip_model_pt(model, model_id, model_path, graph_version)
- zip_model_xgboost(model, model_id, model_path, graph_version)
- zip_model_sklearn(model, model_id, model_path, graph_version)
Note that the TensorFlow method is different than the rest in that it requires the serving input receiver function (for use in inference). model
is your model object, model_id
is your model_id (provided in args
to main), model_path
is the path used to construct the zip file, and we recommend using the log_dir
variable as shown below. Lastly, graph_version
is provided by args
passed to main.
After retrieving the zipped trained model, you can save it by passing it to train.update()
. All of this is demonstrated in the below example.
## Example Advanced Model (Scikit-Learn on Iris data)
import boss.core.uds.util as uds
import numpy as np
from boss.core.internal import train
from boss.core.lib.confusion_matrix import ConfusionMatrix
from boss.core.lib.plotter import Plotter
from boss.core.lib.training_resources import get_features_and_labels
from sklearn.linear_model import LogisticRegression
# Function used in classification models to map int labels to string names
def label_mapping():
return {0: 'I. versicolor', 1: 'I. virginica-setosa'}
# Entrypoint of the BOSS Advanced modeling framework (your main function is what gets called!)
def main(args):
# Retrieve parameters
tid = args['train_id']
graph_version = args['graphversion']
model_id = args['model']
log_dir = str(args['exportdir'])
# Retrieve the dataset
try:
train.status(tid, 0)
train_features, train_labels, train_eid, eval_features, eval_labels, eval_eid, _, _, test_eid = \
get_features_and_labels(args)
except Exception as exception:
train.status(tid, 5, exception)
raise
# Apply label mapping to the dataframes
train_labels = train_labels.mask(train_labels['flower.species'] == 'I. versicolor', 0)
train_labels = train_labels.mask(train_labels['flower.species'] == 'I. virginica', 1)
train_labels = train_labels.mask(train_labels['flower.species'] == 'I. setosa', 1)
eval_labels = eval_labels.mask(eval_labels['flower.species'] == 'I. versicolor', 0)
eval_labels = eval_labels.mask(eval_labels['flower.species'] == 'I. virginica', 1)
eval_labels = eval_labels.mask(eval_labels['flower.species'] == 'I. setosa', 1)
# Set the DTypes
train_features = train_features.astype(np.float)
eval_features = eval_features.astype(np.float)
train_labels = train_labels.astype(np.int)
eval_labels = eval_labels.astype(np.int)
# Must compute for Sci-kit learn
computed = train_features.compute()
computed_labels = train_labels.compute()
# Computed values
nparr = computed.values
nplabelsarr = computed_labels.values
# Let's train a model!
try:
train.status(tid, 1)
# Train model
try:
sk_model = LogisticRegression().fit(nparr, nplabelsarr)
except Exception as err:
train.status(tid, 5, "failure: error in training -- " + str(err))
raise
# Generate predictions and convert to format for ROC & PR
preds = sk_model.predict(eval_features)
new_preds = [[1.0 - i, i] for i in preds]
pred_labels = [np.argmax(i) for i in new_preds]
eval_labs = eval_labels.compute()
# Instantiate plotter object
pltr = Plotter(tid)
# ROC curve, PR curve, confusion matrix
pltr.roc_curve(eval_labs, new_preds, [0, 1])
pltr.precision_recall_curve(eval_labels.compute(), new_preds, [0, 1])
pltr.update()
ConfusionMatrix([eval_features, eval_labels], pred_labels, 2, label_mapping(), tid,
eval_eid.compute().to_list())
# Store model and performance stats back to BOSS back-end
model_filename = uds.zip_model_sklearn(sk_model, model_id, log_dir + "model.pkl",
graph_version)
with open(model_filename, "rb") as graph_file:
train.update({tid: {
'performance': {
'score': sk_model.score(eval_features.compute(), eval_labs)
},
'ordered_class_names': ['I. versicolor', 'I. virginica-setosa'],
'graph_version': graph_version,
'graph_filename': 'model.skl',
'graph_file': graph_file.read()
}})
train.status(tid, 3)
except Exception as exception:
train.update({tid: {'Caught exception': str(exception)}})
train.status(tid, 5, "failure: error in training -- " + str(exception))
raise