Train Classification

v.1.0.0

Overview

The train_classification function in Shimoku's SDK enables users to train machine learning models for various classification tasks. This guide will walk you through setting up your environment, preparing your data, and using the train_classification function to train a model.

Step 0: Get ready

Make sure you have followed these steps first: Setup and Requirements

Step 1: Initialize the Client and set up your workspace

Import necessary libraries and initialize the Shimoku client with your credentials.

Define a menu path, any name you want, for organizing your AI models and disable caching for real-time data processing. Import necessary libraries and initialize the Shimoku client with your credentials. Define a menu path, any name you want, for organizing your AI models and disable caching for real-time data processing.

import os
import time
from io import StringIO
import pandas as pd
from shimoku import Client

s = Client(
    access_token=os.getenv("SHIMOKU_TOKEN"),
    universe_id=os.getenv("UNIVERSE_ID"),
)

s.set_workspace(uuid=os.getenv("WORKSPACE_ID"))

menu_path_name = "insurance_model"
s.set_menu_path(name=menu_path_name)
s.disable_caching()

Note: you must have your SHIMOKU_TOKEN, UNIVERSE_ID and WORKSPACE_ID saved as environment variables.

Step 2: Prepare and Upload Your Data

Load your training data and create an input file for the AI function.

Here's a sample dataset for you to train your first model:

input_file = pd.read_csv('./sample_training_dataset.csv').to_csv(index=False).encode()

s.ai.create_input_files(
    input_files={'training_insurance': input_file},
    force_overwrite=True
)

Note that the input file is passed as a dictionary in which the key is the name you are assigning to your file (in this example, 'training_insurance') and the value is a CSV bytes object.

Step 3: Execute the Training Function

Train your classification model by specifying model details and training parameters.

run_id = s.ai.generic_execute(
    ai_function="train_classification",
    model_name="churn_insurance",
    table="training_insurance",
    columns_target=["Churn"],
    strategy="predictor",
    id_columns=["Customer"]
)

Let's understand the parameters:

ai_function: str It must be 'train_classification' when you are performing this task.

model_name: str Whatever name you want to give it, you'll refer to it later if you use the Predict Classification function.

table: str It must be the name of an input file you previously created - see Step 2.

columns_target: List[str] A list containing the name of the target column(s), that is, the column(s) in your dataset that contain(s) the values that we are trying to predict with our model.

strategy: str 'predictor' or 'recommender'. Choose 'predictor' if you want to predict churn or lead scoring, for example. If you want to be able to recommend products, for example, choose 'recommender'.

id_columns: List[str] A list containing all the columns in your dataset which are ids.

Step 4: Monitor the Training Process

Wait for the model to be trained and the outputs to be uploaded.

attempts = 20
wait = 60

for _ in range(attempts):
    try:
        results = s.ai.get_output_file_objects(run_id=run_id)
        if results:
            print("Successfully obtained the output.")
            break  # Exit the loop if results are obtained
    except Exception:
        pass  # Ignore errors and continue
    time.sleep(wait)  # Wait before retrying
else:
    print("Failed to obtain the output after the maximum number of attempts.")

Step 5: Accessing the Model Outputs

Once training is complete you can access the output files, which include predictions for the training dataset, explainability files and model scoring.

output_dict = dict()
for file_name, bytes_obj in results.items():
    output_dict[file_name] = pd.read_csv(StringIO(bytes_obj[0].decode('utf-8')))

The dictionary output_dict will have 4 items in which the keys are the names of the outputs and the value are pandas data frames. The following outputs will be available:

  • df_predicted.csv: Data frame containing predictions for the data used to train the model.

  • df_importance.csv: Data frame containing the importance of each feature.

  • df_db.csv: Dataframe containing drivers and barriers per prediction.

  • df_pdp.csv: Dataframe containing partial dependence evaluations per feature.

  • scoring_naive.csv: Dataframe containing model performance metrics.

Have a look here to better understand the outputs.

Finally, if you want to save these outputs in your local machine, you can execute the following:

for file_name, dataframe in output_dict.items(): 
    dataframe.to_csv(file_name, index=False)

Also, now your model is ready for predictions of new data through the Predict Classification.

You can use the Generate Insights AI function to create text insights for df_db.csv and df_pdp.csv.

Last updated