Train Classification Outputs
v.1.0.0
Once you've followed the steps in Train Classification, you're ready to explore its outputs.
The dictionary output_dict will have 4 items in which the keys are the names of the outputs and the value are pandas data frames. Let's look at each one of them.
Global Explainability
df_importance.csv: Data frame containing the importance of each feature.
How to read the data above:
The variable that has more impact in the global predictions for not churning (class False) is Income (37.5%), followed by Monthly Premium Auto (17.75%) and Number of Policies (9.22%).
The values for the impact in churning (class True) are the same as the above.
df_pdp.csv: Dataframe containing partial dependence evaluations per feature.
Let's see a couple of charts that will help us understand the data above.
First, we want to understand the impact of each State in the probability of not churning, so we get the first 5 rows of data in which name_feature is State and class is False.
In the chart above we can see that the fact that a customer is from California leads them to a higher probability of retention (not churning), comparing to other states.
Now, below you see the data for State when class is True - so we're looking at the probability of churning.
As you could imagine, we see the opposite - being from California makes a customer less likely to churn.
Ok, but shouldn't the probability (pd column) sum to 100? - you may be wondering. Answer is no. It is perfectly normal that the probability of churn is very high or very low for all States, for example. So in the case of the partial dependence plot, what matters the most is the differences among the States, not the probabilities themselves.
You can use the Generate Insights AI function to create text insights for df_pdp.csv.
Predictions + Local Explainability
df_predicted.csv: Data frame containing predictions for the data used to train the model.
How to read the data above:
Customer BU79786 has 9.5% probability of churning (class True) and 90.5% probability of not churning (class False). This user hasn't churned (true_value False) and the prediction says it won't churn (prediction False).
Customer QZ44356 has 81.5% probability of churning (class True) and 18.5% probability of not churning (class False). This user hasn't churned (true_value False) and the prediction says it will churn (prediction True).
df_db.csv: Dataframe containing drivers and barriers per prediction.
How to read the data above:
This will help you understand the impact that each variable has on the predictions we just saw above. Let us take again customer BU79786 as an example. We'll look at class False to explain the probability of 90.5% of not churning. This probability comes from (base value + drivers + barriers). That is, the base value is the probability that any customer in the dataset has to not churn. But each customer is different, they have each a combination of Income, Monthly Premium Auto, Number of Policies, etc. so we have to sum their drivers and barriers to the base value to get the final probability.
For customer BU79786 we see that Income (15.4%) and Monthly Premium Auto (6.8%) are the top two drivers that lead this specific customer not to churn, while their Number of policies (3.4%) and Vehicle class (1%) are the top two barriers.
Both the drivers and barriers are ordered from most to least impact in columns list_driver_names
and list_barrier_names
. In list_driver_values
and list_barrier_values
you can see how much impact they have in terms of percentage.
You can use the Generate Insights AI function to automatically create text insights for df_db.csv.
Model Performance Metrics
scoring_naive.csv: Dataframe containing model performance metrics.
Here you'll see the algorithm that your model was trained with and its performance metrics.
Remember, apart from these outputs, now your model is ready for predictions of new data through the Predict Classification.
Last updated