Skip to content

Data loader

The data loader is the interface between Cubonacci and the data sets for your project. This involves training data, prediction data for batch predictions and monitoring data. There are several different patterns available depending on the types of data and algorithms. This document first describes the idea behind the data loader, two important concepts, then the two different patterns and the different ways to implement these.

DataLoader class

The DataLoader class is a class that you implement in your repository. The constructor initializes the class and optionally takes the secrets argument. The object is initialized every time Cubonacci needs to interact with your dataset. Cubonacci inspects which methods are implemented and calls the methods when required.

Schema inference

To minimize the required adaptation, in most cases explicit schemas of your data do not need to be passed to Cubonacci. When your methods are called, the platform does a deep inspection of what your data looks like. This structure of your data is later used to dynamically generate the structure of your deployments and to check whether two different versions of data are compatible.

Validation scheme

To let Cubonacci take care of the validation scheme based on the configuration, we require some implementation specifics. In the case of in-memory data, this involves keeping your data in order if this is relevant for validation. In the case of the iterative approaches, Cubonacci passes a fold argument to the method that returns the object to allow the custom validation scheme.

In-memory data

Loading training data in-memory works similar to scikit-learn, where your load_training_data method returns two objects, the features, and the corresponding targets. For both these objects, the structure of the data is inferred.

The method optionally takes the secrets argument which is a dictionary filled at run time with the relevant secrets.

After Cubonacci calls this method, the data is stored as a snapshot and available for later usage.

Example:

import pandas as pd

class DataLoader:
    def load_training_data(self, secrets):
        if secrets['password'] != "correct":
            raise ValueError("Wrong password")
        features = pd.DataFrame({"a": [1, 2, 3, 4, 5, 6], 
                                 "b": [2, 3, 4, 5, 6, 7]})
        targets = pd.Series([0, 1, 0, 1, 0, 1])
        return features, target

Iterative data

Iterative data uses a slightly different approach. First, the raw data is copied to a specific path by the load_raw_training_data from the DataLoader. After this is done, Cubonacci will call the correct method for your iterative approach based on the pattern you have implemented. In each of those cases the path is passed as a string argument.

Next to this path argument, there is also the goal argument. This is a string that can take the following values:

Goal Explanation
train Give a dataset to train the model on for validation purposes
validate Give a dataset to validate a model for validation purposes
train-full Give all the data for training a model purposed for deployment

Generator

Implementing the data_generator method lets Cubonacci load your data generator. This method should return a tuple with two components, the first is the actual generator and the second the number of samples in your generator. This generator should be infinite, meaning this can be consumed continuously.

TensorFlow Dataset

Similar to the generator approach, implementing the tensorflow_dataset method will let Cubonacci retrieve this Dataset object which can then be inspected and passed on to the algorithms.

PyTorch DataLoader

The pytorch_dataloader method returns a PyTorch DataLoader instance for the specific goal which again is inspected and passed to the algorithms.