Skip to content

Algorithms

Algorithms are responsible for learning the relationship between the features and the targets. The way these estimators are defined is based on the API design of scikit-learn but has a bit more options. Next to the code, there is a small configuration file required containing the name and the hyperparameter settings.

Model definition

Algorithms are defined by implementing a Model class. There are three mandatory methods and a number of optional ones. Every algorithm is placed in its own folder.

Constructor

__init__(self, hyperparameter_1, hyperparameter_2)

The constructor initializes your model with the correct hyperparameter settings, that are stored as attributes of the model instance for later use in the fit method. The arguments containing the hyperparameters correspond to the hyperparameters defined in the configuration, as described below. During training, Cubonacci passes the configuration for the current session to the constructor. This is based on the hyperparameter search from the running experiment or passed hyperparameters.

Fit

The fit method combines the set hyperparameters and training data to train the model and get it ready for predicting on unseen data. Depending on the form of your dataset returned by the DataLoader class.

In-memory

.fit(self, X, y)

When training a model, Cubonacci will pass the features and corresponding targets to the fit method. In case of a full model training, this will be all the data returned by your DataLoader class. When validating this is a subset of this data based on the validation scheme.

Generator

.fit(self, generator)

Training a model using the generator approach is done by passing the relevant generator to the fit method. This generator is requested from the DataLoader.

Predict

.predict(X)

After a model has been trained by calling the .fit() method the predict method should calculate new predictions for the passed features. After Cubonacci has trained a model, a number of training samples are passed through the model to infer what predictions look like. This helps Cubonacci to monitor live predictions and with the API deployments.

Serialization

In a number of cases, Cubonacci will take care of storing your model internally so that it is available for evaluation or deployment at a later stage. When using packages that store the state of your model in C or C++, this is not possible. An example of this is TensorFlow. To accommodate this, your model class can have two additional methods.

Save

The first method is .save(self, path) where your model class gets passed a path string. Your code can write files required for storing the state of your model here. This method is called after fully training a model.

Load

The second method is .load(cls, path) and is a so-called @classmethod. The path argument is the same as with the save method and contains all the files generated by the saving functionality. This load method should return a new instance of the Model class with the state in it. This means the predict should be callable.

Feature importance

There is an optional method .feature_importance(self) available that is required to return a dictionary after the model is trained. If this method is implemented, Cubonacci will inspect this after training a model and showcase this in the user interface.

Example

from sklearn.ensemble import GradientBoostingClassifier
import pandas as pd

class Model:
    def __init__(self, max_depth, criterion, min_impurity_decrease):
        self.max_depth = max_depth
        self.criterion = criterion
        self.min_impurity_decrease = min_impurity_decrease
        self.model = None

    def fit(self, X, y):
        self.model = GradientBoostingClassifier(n_estimators=100,
                                                max_depth=self.max_depth,
                                                criterion=self.criterion,
                                                min_impurity_decrease=self.min_impurity_decrease)
        self.model.fit(X=X, y=y.values.ravel())

    def predict(self, X):
        discrete_prediction = self.model.predict(X)
        predictions = self.model.predict_proba(X)
        predictions = pd.DataFrame(predictions, columns=['PSetosa', 'PVersicolor', 'PVirginica'])
        predictions['Prediction'] = discrete_prediction
        return predictions

    def feature_importance(self):
        feature_importance = self.pipeline.steps[1][1].feature_importances_
        return {'Sepal length': feature_importance[0],
                'Sepal width': feature_importance[1],
                'Petal length': feature_importance[2],
                'Petal width': feature_importance[3]

Model configuration

The model configuration is a YAML file named cubonacci-model.yaml, similar to the main configuration. There are two parts, the first one is simply a string with the name of the algorithm which will be shown in different parts of the user interface. The hyperparameter section is more extensive and is described next.

Hyperparameters

The hyperparameters describe the different settings that influence how your model is trained. These hyperparameters are passed to the model using the constructor. Each hyperparameter has a name, type, and description as configuration. Every type has other additional values, as described in the next section.

int

Hyperparameters of the type int take integer values and are constrained by the min and max values that are passed in the configuration.

float

Hyperparameters of the type float take any value between the passed min and max.

categorical

With categorical hyperparameters, there are a number of discrete options that can be passed which will be given as a string. The values that can be passed are configured using the values field as an array of strings.

Example

name: Gradient Boosting
hyperparameters:
  - name: max_depth
    type: int
    min: 2
    max: 5
    description: Maximum depth of 1 tree
  - name: min_impurity_decrease
    type: float
    min: 0.0
    max: 0.2
    description: Minimal decrease in impurity to allow split
  - name: criterion
    type: categorical
    values:
      - friedman_mse
      - mse
      - mae
    description: The function to measure the quality of a split