Skip to content

Schema inference

At multiple stages in the lifecycle of a machine learning model, Cubonacci passes data from one method to another. To minimize configuration required by our users, whenever possible Cubonacci infers the structure of the data. This document gives more insights into the where, the why, and the how.

Where this happens

Feature data, target data, and prediction data are inferred at the different stages. Feature data and target data are inspected directly after creating a data snapshot. Prediction data is inferred after fully training a model.

Why this happens

Schemas of your data are important for Cubonacci. Automatically generating the structure for the Deployments that fits with your data makes managing and testing this easier. Validating the input structure and giving clear error messages makes this more maintainable. Instead of writing a lengthy configuration that describes this structure, Cubonacci takes care of this where possible.

How this happens

Certain data structures contain the schema as metadata already. This includes Pandas dataframes and TensorFlow Datasets. In these cases, Cubonacci extracts the schema information directly from these objects. In cases where this is not available, Cubonacci looks at a large number of samples and builds up a schema that fully describes the data.

Supported structures

Data being nested, variable in length or potentially missing, this should not matter. Cubonacci supports a number of native primitives and a number of commonly used packages with data structures. Below is a list of structures supported that can be nested in individual samples.

Sample structure

Sample structure
Pandas Dataframe
Pandas Series
Numpy array
Dictionary
Tuple
List
String
Float

Example:

sample = {"historical_purchases": [1, 2, 3, 4],
          "profile": 
              {"age": 31, 
               "length": 1.832,
               "name": "Cubonacci"}

Dataset structure

Supported for the full dataset with immediate schema inference from the metadata are the following:

Direct dataset structure
Pandas Dataframe
Pandas Series
Numpy array

Example:

import pandas as pd

dataset = pd.DataFrame({"name": ["Cubonacci", "Python", "Kubernetes"],
                        "age": [1.9, 26.3, 4.6]}

Other structures that can be used for wrapping the individual samples are the following:

Indirect dataset structure
List
Tuple
dataset = [{"name": "Cubonacci", "age": 1.9},
           {"name": "Python", "age": 26.3},
           {"name": "Kubernetes", "age": 4.7}