Data Preparation
Sources
The BOSS client can show the user all available sources across the federation, as well as a data ingestion over time visualization.
- Sources table – list of all sources in federation.
- Federate indicator – hover over to see which federates contain the source. If the indicator is missing, then the source only exists on the logged in domain.
- Refresh – refresh the displayed data.
- Ingestion over time viz – color represents relative number of records ingested during a given time period for a source. Click a box to zoom into that time period across all sources.
- Back – Go up a time period (ex: Month to Year).
Data Transformation
After selecting a project, a user is taken to the Data Transform view, where queries can have EDA operations added to them and then built into a Virtual Dataset (VDS).
Left Sidebar
- Saved Workflows – Each row represents a query that has been saved and is eligible to have EDA operations performed on it. Can be dragged and reordered on the ‘Active Workflows’ space.
- Federate – hover over to see the federates of all VDS’ contained in the saved workflow. If it is orange, then at least one of the VDS has an issue on a federate. If this icon is not visible, then all VDS’ on that workflow only exist on the logged in domain.
- VDS – The number of VDS created from the given workflow.
- Workflow name – This will appear red if any operations within the workflow have returned an error. The error will go away once the operation has returned successfully.
- Quick add – click to add the workflow to the 3D visualize space.
- Begin a new query – Click to build a query inside query builder so that it can be saved to the Transform space.
- Available operations – Click to begin adding an operation to a selected 3D node.
Workflow 3D Space
- Zoom – click and drag to zoom in and out of 3D space.
- Arrow to node – click to move selection to a different node. Can also use arrow keys.
- Active Workflows – an ordered list of workflows currently displayed in 3D space. Can be dragged into a new order or clicked on to zoom to the root node of the selected workflow.
- Selected node – Select any node to see additional options.
- Child node – children are displayed to the right of a parent node with lines connecting it.
Query node
- Remove – remove the selected query and accompanying workflow from the 3D space.
- Delete – delete the selected query and accompanying workflow. This cannot be undone.
- Edit – Reloads the query parameters into the query builder so it can modified and saved as a new query.
- Preview Data – Execute the query and visualize the results.
- Create VDS – Begins the process for creating a Virtual Dataset used for training.
Operation Nodes
- Operation name.
- Operation type.
- Delete – delete the selected operation and all downstream operations in workflow. This cannot be undone. Will not execute if there is a VDS downstream.
- Preview Data – Execute the query all operations including this one and visualize the results.
- Create VDS – Begins the process for creating a Virtual Dataset used for training.
Virtual Dataset Node
- VDS Name.
- Delete – delete the selected VDS. This cannot be undone.
- Preview Data – Execute the query all operations leading to this VDS and visualize the results.
- Create Embedding - Only available with text models.
- Merge VDS - Click and drag to another Virtual Dataset to merge them together.
- Start training – Open the Modeling view to train with this VDS.
Navigating Transforms with Arrow Keys
Once a node is selected in the data transform space, you may use the arrow keys to navigate quickly between adjacent nodes.
Collapsing Nodes with Double Click
Double clicking a node will collapse all children downstream from that node and add a superscript next to it, indicating how many nodes were collapsed. This can be useful in large, spread out trees.
Rearranging Active Workflows
Transform workflows in the active list can be rearranged in any order. This can be useful for comparing trees, or to bring two Virtual Datasets closer together to perform a merge.
Preparing Text Data for Model Training
BOSS provides special operations for easily preparing text data for model training, saving a model developer valuable time in manually coding routines for text transformation.
- After creating an EDA tree based on a query of a text data source, a developer can add a new operation to the tree based on NLP operations as shown above.
- NLP operations (e.g., stopword removal, whitespace removal, lemmatization) can be applied in any sequence.
- It’s important to select the correct facet as the “text attribute.”
- One can also elect to apply tokenization based on a document level (i.e., create one sequence of tokens for the entire facet value per record), or sentence level (i.e., create a token sequence per sentence in the facet for a record).
Saving VDS with Processed Text
When a developer wants to create a new virtual dataset including the transformed text data, they must choose the “processed_text” facet as the “sole” feature of the virtual dataset as shown below.
Currently, BOSS does not support text model training incorporating multiple feature columns, only the “processed_text” facet must be selected.
Multi-column text model training will be supported in a future release.
Applying Custom Operations
Once custom operations have been defined and uploaded using the BOSS Python Client library, they are available in the GUI for usage in data transformation.
As shown above, clicking on a custom operation will show further details, specifically the features the operation uses as well as the actual source code defining the op. As mentioned in the documentation for defining custom operations via the BOSS Python Client, one must select how to apply the operation based one of the following three Dask dataframe approaches:
- apply
- map_partitions
- applymap
- apply_direct - apply custom function directly on a dask dataframe
Applying Image Operations
To apply image operations, select the Image Ops tab within the New Op menu in an EDA tree.
- It’s important to select an image facet as the “Feature.”
- The currently provided operations are as follows:
Vertical and horizontal flips
Grayscale Contrast normalization
Normalize (0 mean and unit variance)
Resize width & height
Color inversion
Crop borders
Gaussian blur
Rotate
Min-max scaling
To array (converts binary data to Numpy Array)
Reshape dimensions
* Operations can be applied to percentages of a dataset instead of the entirety, and can also be used to augment existing data instead of operating in-place.
Query Builder
The BOSS client offers a unique and intuitive way to query data, giving a user flexibility in how complex queries are strung together to retrieve exact results.
Left Sidebar
- Sources – a list of available sources to query. This can be dragged into the node editor window.
- Quick add – click to add this source to the node editor window
- Federate status – Hover to see which federates that hold the source. If this icon does not show, then the source only exists on the currently logged in domain.
- Data Models – a list of available data models to query. This can be dragged into the node editor window.
- Quick add – click to add this data model to the node editor window
- View stats – click to view statistics of this particular data model
- View features – click to view the features of this particular data model
- Features – a list of features in this data model. This can be dragged into the node editor window
- Quick add – click to add this feature to the node editor window
- Federates – a list of available federates for filtering the query.
- Note: the currently logged in domain will ALWAYS return results regardless if it is selected.
Node Editor Window
- Global search parameters – Click to view simple/advanced search filters
- Zoom – drag this slider or use the mouse wheel to zoom in and out of the node view
- Lucene syntax – a text representation of the search to be executed.
- Copy Lucene syntax – click to copy the Lucene syntax. This can be pasted into the global search parameters to customize a search with features not supported by the node editor.
- Search – Click to execute the search
- Save – Save the search for use in Transform workflow.
- Note that a search must be execute before it is saved.
- Group – Toggle, then click and drag around a set of nodes to add a grouping around them. This acts as a set of parentheses in the Lucene syntax. This function can also be accomplished by holding Shift + Left click + drag
- Refresh – Click to retrieve and repopulate the list of sources/data models/federates.
- Exit – Close the query builder. Any unsaved progress will be lost.
- Modify Node – Change node filter settings
- Delete Node
- Node connection dropdown – Click to select from AND/OR/XOR
- Node connector – click and drag to connect to another node or grouping
- Statistics – click to view statistics of last executed query
Advanced Search Parameters
- All these words – search results must include all these words
- Lucene query – add a Lucene query that will take the place of whatever is in the Node Editor Window
- This exact phrase – search results must include this exact phrase
- None of these words – search results must not have any of these words
- Records per source/model - return this many records per source/model
- Total records to return - return at least this many total records
- Date range – search results must be from within this time period
- Randomize – results should be returned in a random order
- All Sources/Models - results should include a sample from every applicable source and data model
Search Results
- Visualization panel – this will update with each search executed
- Federate distribution – a bar chart showing how many records were returned from each applicable federate
- Query statistics – each returned feature will show relevant statistics, and if applicable, a box plot to visualize.
Adding a node and changing connection logic
Nodes can be dragged into the workspace, or quickly added using the ‘+’ button on the left. The dropdown connecting two nodes or groups can be changed to AND/OR/XOR
Grouping nodes
Nodes can be grouped together using the ‘Group’ toggle at the top or by holding shift and dragging. Groupings will add parentheses around the selected node in the Lucene output.
Manually connecting nodes
Nodes can be manually connected and disconnected by clicking and dragging either of the two circles on the side of a node/group.