Train Classification
v.1.0.0
Last updated
v.1.0.0
Last updated
The train_classification
function in Shimoku's SDK enables users to train machine learning models for various classification tasks. This guide will walk you through setting up your environment, preparing your data, and using the train_classification
function to train a model.
Make sure you have followed these steps first: Setup and Requirements
Import necessary libraries and initialize the Shimoku client with your credentials.
Define a menu path, any name you want, for organizing your AI models and disable caching for real-time data processing. Import necessary libraries and initialize the Shimoku client with your credentials. Define a menu path, any name you want, for organizing your AI models and disable caching for real-time data processing.
Note: you must have your SHIMOKU_TOKEN
, UNIVERSE_ID
and WORKSPACE_ID
saved as environment variables.
Load your training data and create an input file for the AI function.
Here's a sample dataset for you to train your first model:
Note that the input file is passed as a dictionary in which the key is the name you are assigning to your file (in this example, 'training_insurance') and the value is a CSV bytes object.
Train your classification model by specifying model details and training parameters.
Let's understand the parameters:
ai_function: str It must be 'train_classification' when you are performing this task.
model_name: str Whatever name you want to give it, you'll refer to it later if you use the Predict Classification function.
table: str It must be the name of an input file you previously created - see Step 2.
columns_target: List[str] A list containing the name of the target column(s), that is, the column(s) in your dataset that contain(s) the values that we are trying to predict with our model.
strategy: str 'predictor' or 'recommender'. Choose 'predictor' if you want to predict churn or lead scoring, for example. If you want to be able to recommend products, for example, choose 'recommender'.
id_columns: List[str] A list containing all the columns in your dataset which are ids.
Wait for the model to be trained and the outputs to be uploaded.
Once training is complete you can access the output files, which include predictions for the training dataset, explainability files and model scoring.
The dictionary output_dict will have 4 items in which the keys are the names of the outputs and the value are pandas data frames. The following outputs will be available:
df_predicted.csv: Data frame containing predictions for the data used to train the model.
df_importance.csv: Data frame containing the importance of each feature.
df_db.csv: Dataframe containing drivers and barriers per prediction.
df_pdp.csv: Dataframe containing partial dependence evaluations per feature.
scoring_naive.csv: Dataframe containing model performance metrics.
Have a look here to better understand the outputs.
Finally, if you want to save these outputs in your local machine, you can execute the following:
Also, now your model is ready for predictions of new data through the Predict Classification.
You can use the Generate Insights AI function to create text insights for df_db.csv and df_pdp.csv.