auto-analyst-backend / docs /shared_dataframe.md
GitHub Actions
Update from GitHub Actions
4d16728

Shared Dataframe Between Agents

This document explains how to use the shared dataframe functionality that allows one agent to create a processed dataframe (df_processed) that other agents can access and use.

Overview

The Auto-Analyst system now supports sharing a processed dataframe between agents. This is useful when:

  1. One agent performs data preprocessing, cleaning, or feature engineering
  2. Subsequent agents need to use this processed data for analysis, visualization, or other tasks

The first agent (typically Agent1) creates a dataframe called df_processed, and all subsequent agents can access this same dataframe without needing to reprocess the data.

How It Works

  1. Automatic variable sharing is handled through the SHARED_CONTEXT global dictionary in format_response.py
  2. When an agent executes Python code that creates a variable named df_processed, this variable is automatically stored in the shared context
  3. Subsequent agent code executions will have access to this df_processed variable

Implementation for Agent Developers

Agent1 (Data Processor)

Agent1 should define a processed dataframe that will be used by subsequent agents:

import pandas as pd
import numpy as np

# Do some data processing
df_processed = df.copy()  # Start with a copy of the original dataframe
df_processed = df_processed.dropna()  # Remove missing values
df_processed['new_feature'] = df_processed['column_a'] / df_processed['column_b']
print("Data processing complete. Created df_processed for other agents to use.")

Agent2 (Data Consumer)

Agent2 can access the df_processed dataframe created by Agent1:

import matplotlib.pyplot as plt
import seaborn as sns

# Access the shared df_processed dataframe
print(f"Using shared df_processed with shape: {df_processed.shape}")

# Create visualization using the processed data
plt.figure(figsize=(10, 6))
sns.scatterplot(data=df_processed, x='column_a', y='new_feature')
plt.title('Analysis of Processed Data')
plt.show()

Technical Details

The shared dataframe functionality is implemented through:

  1. A global SHARED_CONTEXT dictionary in format_response.py
  2. Modified execute_code_from_markdown function that checks for df_processed in the execution context
  3. Updated app.py to process agents in the correct order from the plan_list

Best Practices

  1. Name the shared dataframe consistently as df_processed
  2. Document what processing was done to create the shared dataframe
  3. Agent1 should print a message confirming that df_processed was created
  4. Agent2 should verify the structure of df_processed before using it (e.g., print its shape or columns)
  5. Keep processing in Agent1, analysis in Agent2 for clean separation of concerns

Example

# Agent1 code
import pandas as pd

# Load and process data
df_processed = df.copy()
df_processed = df_processed[df_processed['price'] > 0]  # Remove invalid prices
df_processed['price_per_sqft'] = df_processed['price'] / df_processed['sqft']
print(f"Created df_processed with {len(df_processed)} rows after processing")

# Agent2 code
import plotly.express as px

# Use the processed dataframe
print(f"Using df_processed with {len(df_processed)} rows")
fig = px.scatter(df_processed, x='sqft', y='price', color='price_per_sqft',
                title='Price vs. Square Footage (Colored by Price per SqFt)')
fig.show()