Generate Insights

v.1.0.0

Overview

The generate_insights tool allows the user to add explanatory insights, generated with OpenAI API, to a dataset of various natures. This could be generic data provided by the user, such as a table, a bar chart, etc., or it could be an output file generated by one of our tools. Currently, the implemented tools are as follows:

generic_insights: For any table introduced by the user, a series of bullet points with insights about the data is returned.
partial_dependence: Given the data frame containing the partial dependence evaluations, df_pdp.csv, generated in the Train Classification function, this tool provides a textual explanation of each potential one-dimensional partial dependence graph available.
drivers_barriers: This tool starts from the table of drivers and barriers, df_db.csv, generated in the Train Classification or Predict Classification functions. To every row, it adds a textual description explaining which inputs contribute the most, both positively and negatively, to the target taking a specific value. Executions are currently limited to 15 rows at a time.

Version 1.0.0 of the tool requires user access to the OpenAI model gpt-4-1106-preview, to ensure proper functionality.

Step 0: Get ready

Make sure you have followed these steps first: Setup and Requirements

Step 1: Initialize the Client and set up your workspace

Import necessary libraries and initialize the Shimoku client with your credentials. Define a workspace and a menu path for organizing your AI models.

import os
import time
from io import StringIO
import pandas as pd
from shimoku import Client

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
OPENAI_ORG_ID = os.getenv("OPENAI_ORG_ID")

s = Client(
    access_token=os.getenv("SHIMOKU_TOKEN"),
    universe_id=os.getenv("UNIVERSE_ID"),
)

s.set_workspace(uuid=os.getenv("WORKSPACE_ID"))

menu_path_name = "insights"
s.set_menu_path(name=menu_path_name)
s.disable_caching()

Note: you must have your SHIMOKU_TOKEN, UNIVERSE_ID, WORKSPACE_ID, OPENAI_API_KEY and OPENAI_ORG_ID saved as environment variables.

For steps 2 to 5, choose the tab below according to the task you want to perform.

Step 2: Prepare and Upload Your Data

Upload any type of table on which you wish to request relevant information. No additional format is imposed.

1MB

input_data.csv

input_file = pd.read_csv('./input_data.csv')

s.ai.create_input_files(
        input_files={'input_data': input_file.to_csv(index=False).encode()},
        force_overwrite=True
)

Step 3: Execute the Generate Insight Function

Call the insight generator function and adjust the arguments for the generic_insights task.

run_id = s.ai.generic_execute(
    ai_function='generate_insights',
    task='generic_insights',
    data='input_data',
    openai_api_key=OPENAI_API_KEY,
    openai_org_id=OPENAI_ORG_ID,
)

ai_function: str Label for this functionality, which will have the value 'generate_insights'.

openai_api_key: str Your OpenAI unique API key.

openai_org_id: str Your OpenAI organization id.

task: str 'generic_insight' requests to generate insights about a table in any type of format.

data: str Name chosen in create_input_files to refer to the table.

Step 4: Monitor the Process

Wait for the insights to be generated and the outputs to be uploaded.

attempts = 20
wait = 60

for _ in range(attempts):
    try:
        results = s.ai.get_output_file_objects(run_id=run_id)
        if results:
            print("Successfully obtained the output.")
            break  # Exit the loop if results are obtained
    except Exception:
        pass  # Ignore errors and continue
    time.sleep(wait)  # Wait before retrying
else:
    print("Failed to obtain the output after the maximum number of attempts.")

Step 5: Accessing the GPT insights

Once execution is complete, insights are available.

insights = results['insights.txt'][0].decode()

Step 2: Prepare and Upload Your Data

Upload the resulting partial dependence file, df_pdp.csv, as it was returned by our Train Classification function.

40KB

df_pdp.csv

df_pdp = pd.read_csv('./df_pdp.csv')

# The number of pd plots with insights is currently limited to 10 per execution
cols_to_groupby = ['column_target', 'class', 'name_feature']
first_10_pdp = (df_pdp[cols_to_groupby].drop_duplicates().head(10))
df_pdp_10 = pd.merge(df_pdp, first_10_pdp, on=cols_to_groupby)

s.ai.create_input_files(
    input_files={'pd_data': df_pdp_10.to_csv(index=False).encode()},
    force_overwrite=True
)

Step 3: Execute the Generate Insight Function

Call the insight generator function and adjust the arguments for the partial dependence task.

run_id = s.ai.generic_execute(
    ai_function='generate_insights',
    task='partial_dependence',
    data='pd_data',
    openai_api_key=OPENAI_API_KEY,
    openai_org_id=OPENAI_ORG_ID,
)

ai_function: str Label for this functionality, which will have the value 'generate_insights'.

openai_api_key: str Your OpenAI unique API key.

openai_org_id: str Your OpenAI organization id.

task: str 'partial_dependence' task.

data: str Name chosen in create_input_files referring to df_pdp.csv.

Step 4: Monitor the Process

Wait for the insights to be generated and the outputs to be uploaded.

attempts = 20
wait = 60

for _ in range(attempts):
    try:
        results = s.ai.get_output_file_objects(run_id=run_id)
        if results:
            print("Successfully obtained the output.")
            break  # Exit the loop if results are obtained
    except Exception:
        pass  # Ignore errors and continue
    time.sleep(wait)  # Wait before retrying
else:
    print("Failed to obtain the output after the maximum number of attempts.")

Step 5: Accessing the GPT insights

Once execution is complete, insights are available.

df_pdp_insights = pd.read_csv(StringIO(results['df_insights.csv'][0].decode('utf-8')))

Step 2: Prepare and Upload Your Data

Upload the drivers and barriers file, df_db.csv, as it was generated by one of our functions, the Train Classification or Predict Classification. Also, you will need to upload the dataset you used to train your model. Here are the files we used in the example of Train Classification.

11MB

df_db.csv

2MB

sample_training_dataset.csv

The df_db.csv file must maintain its original format, except for the fact that you will need to break it in chucks of up to 15 rows, as it's the current limitation per execution.

input_file = pd.read_csv('./sample_training_dataset.csv')
df_db = pd.read_csv('./df_db.csv')

df_db_sample = df_db.head(15)

s.ai.create_input_files(
    input_files={'training_insurance': input_file.to_csv(index=False).encode(), 
                 'db_data': df_db_sample.to_csv(index=False).encode()},
    force_overwrite=True
)

Step 3: Execute the Generate Insight Function

Call the insight generator function and adjust the arguments for the drivers and barriers task.

run_id = s.ai.generic_execute(
    ai_function='generate_insights',
    task='drivers_barriers',
    data='db_data',
    context_data='training_insurance',
    OPENAI_API_KEY
    openai_org_id=OPENAI_ORG_ID,
)

ai_function: str Label for this functionality, which will have the value 'generate_insights'.

openai_api_key: str Your OpenAI unique API key.

openai_org_id: str Your OpenAI organization id.

task: str 'drivers_barriers' task.

data: str Name chosen in create_input_files referring to db_data.csv.

context_data: str Name chosen in create_input_files referring to the data used to train the classification model. Required to provide insights.

Step 4: Monitor the Process

Wait for the insights to be generated and the outputs to be uploaded.

attempts = 20
wait = 60

for _ in range(attempts):
    try:
        results = s.ai.get_output_file_objects(run_id=run_id)
        if results:
            print("Successfully obtained the output.")
            break  # Exit the loop if results are obtained
    except Exception:
        pass  # Ignore errors and continue
    time.sleep(wait)  # Wait before retrying
else:
    print("Failed to obtain the output after the maximum number of attempts.")

Step 5: Accessing the GPT insights

Once execution is complete, insights are available by the user.

df_db_insights = pd.read_csv(StringIO(results['df_insights.csv'][0].decode('utf-8')))

PreviousPredict Classification Outputs NextGenerate Insights Outputs

Last updated 1 year ago

Was this helpful?