Model Development Approaches
Model Development Approaches
BMF provides flexibility in the level of effort and control needed for preparing models for BOSS. The two approaches include the advanced and simple model approaches (known as full & compact respectively); their differences are illustrated in Figure 1.
Figure 1. Conceptual illustration of full and compact model approaches.
Advanced Model Approach
In the advanced model approach, a developer creates some AI model and manually uses BMF python libraries to complete the model training workflow (e.g., train, validate, holdout data testing, store results). This enables complete flexibility for more advanced use cases which might include designing complex or experimental training loops, advanced performance analysis, custom model compression, etc. Advanced models are implemented using normal python scripts. Further details are in the Developing Advanced Models section of this documentation.
Simple Model Approach
The simple model approach enables a developer to focus most if not all effort on defining an AI model, leaving other workflow tasks like holdout data testing and storage of performance results for the BMF to do automatically behind the scenes. In the case of TensorFlow, the developer does not even need to write training logic. The major benefits of the simple model approach include (1) significantly less coding effort and (2) potential reduction of errors and/or inconsistencies in writing boilerplate performance-testing logic. These benefits are especially useful for formatting models for multi-run experiments such as k-fold cross validation and learning curves (which will be introduced in an upcoming BMF release). Further details about simple modeling are in Developing Simple Models.
Notable Framework Capabilities
The BMF consists of an evolving set of capabilities. The following subsections describe notable modeling capabilities supported as of release 7.0.0.
Distributed Model Training
To use distributed model training, all that is required is a developer be familiar with how to use the Horovod python library to distribute their model. More details are covered in Distributed Model Training.
Federated Machine Learning allows for models to be built and trained on data across distinct remote systems (known as federates). This capability is incredibly useful when you either don’t want to or can’t move data across systems. Further details about federated modeling are in Developing Federated Models
TensorFlow Estimator-Based Modeling
TensorFlow supports AI modeling using either low-level APIs or easier-to-use high-level Estimator APIs. The BMF is designed to support Estimator-based model development. Keras may be used to create models, especially for enabling more customization. However, such models must be converted to Estimators for BMF and the broader BOSS platform to manage them appropriately. See for following link for an introduction to TensorFlow Estimators, https://www.tensorflow.org/guide/estimator.
Various Feature Types
For TensorFlow modeling, all dataset feature column types are supported (see https://www.tensorflow.org/guide/feature_columns), enabling support for a broad range of numeric and categorical features. Regarding categorical features, the domain of such a feature must be known at training time. For example, if you choose to use a feature car_make as a categorical feature, you must know all the possible makes when you write your model. This requirement will be removed in a future release. Also, the conversion of non-numerical data to numerical data (e.g., for encoding label/target values) based on a scan of the entire dataset is not supported in the current release. However, to help with this, data value replacement operations are supported in the BOSS No Code client.
For TensorFlow modeling, label types are assumed to be TensorFlow int32.
For TensorFlow and PyTorch modeling, BMF supports the use of embedding data, e.g., word2vec for representing free text. For PyTorch, the TorchText library is supported, but n-grams are not supported in the current release.
Important Note: Currently, when using text input, only the text/embedding input is allowed as a feature, enabling conventional text classification. Future releases will enable the use of multiple feature inputs alongside text data.
For TensorFlow and PyTorch modeling, use of image data (i.e., pixel values) as model input is supported.
Scikit-learn models and Scikit-learn pipelines are also supported. The use of sklearn.preprocessing.FunctionTransformer and other custom transformers within pipelines are not supported.
Distributed XGBoost using Dask
Distributed training of XGBoost models using the Dask parallel data analytics framework is supported. Current versions of XGBoost (1.3.1) include a module natively inside of the XGBoost library (the dask-xgboost project was migrated). See the following link for more information, https://xgboost.readthedocs.io/en/latest/tutorials/dask.html.
Support for TensorFlow and PyTorch distributed training is under development.
The BOSS modeling framework supports the following languages and machine learning -related libraries:
- Python v3.6.5
- TensorFlow (for Python) v2.1
- PyTorch v1.6.0
- Dask v2021.1.0
- XGBoost v1.3.1
- Numpy v1.16.4
- Scikit-learn v0.19.2
- Pandas v0.25.1
While this documentation introduces all the core components and best practices for developing AI models for the BOSS platform, there is rarely a replacement for sample code. The BOSS Model Shop provides a wide range of code (prepared by BOSS developers) to help developers get started with preparing AI models. In the future, the BOSS Model Shop will also allow for the larger BOSS developer community to share their code, further helping others with their AI goals.
Python API Documentation
The BMF Python API documentation can be found in the following BOSS GitLab Pages site: (coming soon!)
Preparing Models Using the BOSS Modeling Framework
The following documentation contains further details and examples for developing AI models for BOSS.
- Developing Simple Models
- Developing Advanced Models
- Developing Federated Models
- Working with Data and Performance Analysis
- The BOSS Model Shop
An important note for developing PyTorch models in BOSS is that before saving the model, the “eval” mode must be activiated. See the following link for more details, https://pytorch.org/tutorials/beginner/saving_loading_models.html?highlight=eval.