Spaces:

CodeKnightDebjit
/

data_cleaning_env

Sleeping

App Files Files Community

CodeKnightDebjit commited on 21 days ago

Commit

c0db505

verified ·

1 Parent(s): 73f852c

Upload folder using huggingface_hub

Browse files

Files changed (2) hide show

README.md +747 -255
inference.py +254 -430

README.md CHANGED Viewed

@@ -1,255 +1,747 @@
----
-title: Data Cleaning Env Environment Server
-emoji: 🎹
-colorFrom: indigo
-colorTo: red
-sdk: docker
-pinned: false
-app_port: 8000
-base_path: /web
-tags:
-  - openenv
----
-# Data Cleaning Env Environment
-A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
-## Quick Start
-The simplest way to use the Data Cleaning Env environment is through the `DataCleaningEnv` class:
-```python
-from data_cleaning_env import CleanAction, DataCleaningEnv
-try:
-    # Create environment from Docker image
-    data_cleaning_envenv = DataCleaningEnv.from_docker_image("data_cleaning_env-env:latest")
-    # Reset
-    result = data_cleaning_envenv.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Send multiple messages
-    messages = ["Hello, World!", "Testing echo", "Final message"]
-    for msg in messages:
-        result = data_cleaning_envenv.step(CleanAction(message=msg))
-        print(f"Sent: '{msg}'")
-        print(f"  → Echoed: '{result.observation.echoed_message}'")
-        print(f"  → Length: {result.observation.message_length}")
-        print(f"  → Reward: {result.reward}")
-finally:
-    # Always clean up
-    data_cleaning_envenv.close()
-```
-That's it! The `DataCleaningEnv.from_docker_image()` method handles:
-- Starting the Docker container
-- Waiting for the server to be ready
-- Connecting to the environment
-- Container cleanup when you call `close()`
-## Building the Docker Image
-Before using the environment, you need to build the Docker image:
-```bash
-# From project root
-docker build -t data_cleaning_env-env:latest -f server/Dockerfile .
-```
-## Deploying to Hugging Face Spaces
-You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
-```bash
-# From the environment directory (where openenv.yaml is located)
-openenv push
-# Or specify options
-openenv push --namespace my-org --private
-```
-The `openenv push` command will:
-1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
-2. Prepare a custom build for Hugging Face Docker space (enables web interface)
-3. Upload to Hugging Face (ensuring you're logged in)
-### Prerequisites
-- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
-### Options
-- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
-- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
-- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
-- `--private`: Deploy the space as private (default: public)
-### Examples
-```bash
-# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
-openenv push
-# Push to a specific repository
-openenv push --repo-id my-org/my-env
-# Push with a custom base image
-openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
-# Push as a private space
-openenv push --private
-# Combine options
-openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
-```
-After deployment, your space will be available at:
-`https://huggingface.co/spaces/<repo-id>`
-The deployed space includes:
-- **Web Interface** at `/web` - Interactive UI for exploring the environment
-- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
-- **Health Check** at `/health` - Container health monitoring
-- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
-## Environment Details
-### Action
-**CleanAction**: Contains a single field
-- `message` (str) - The message to echo back
-### Observation
-**CleanAction**: Contains the echo response and metadata
-- `echoed_message` (str) - The message echoed back
-- `message_length` (int) - Length of the message
-- `reward` (float) - Reward based on message length (length × 0.1)
-- `done` (bool) - Always False for echo environment
-- `metadata` (dict) - Additional info like step count
-### Reward
-The reward is calculated as: `message_length × 0.1`
-- "Hi" → reward: 0.2
-- "Hello, World!" → reward: 1.3
-- Empty message → reward: 0.0
-## Advanced Usage
-### Connecting to an Existing Server
-If you already have a Data Cleaning Env environment server running, you can connect directly:
-```python
-from data_cleaning_env import DataCleaningEnv
-# Connect to existing server
-data_cleaning_envenv = DataCleaningEnv(base_url="<ENV_HTTP_URL_HERE>")
-# Use as normal
-result = data_cleaning_envenv.reset()
-result = data_cleaning_envenv.step(CleanAction(message="Hello!"))
-```
-Note: When connecting to an existing server, `data_cleaning_envenv.close()` will NOT stop the server.
-### Using the Context Manager
-The client supports context manager usage for automatic connection management:
-```python
-from data_cleaning_env import CleanAction, DataCleaningEnv
-# Connect with context manager (auto-connects and closes)
-with DataCleaningEnv(base_url="http://localhost:8000") as env:
-    result = env.reset()
-    print(f"Reset: {result.observation.echoed_message}")
-    # Multiple steps with low latency
-    for msg in ["Hello", "World", "!"]:
-        result = env.step(CleanAction(message=msg))
-        print(f"Echoed: {result.observation.echoed_message}")
-```
-The client uses WebSocket connections for:
-- **Lower latency**: No HTTP connection overhead per request
-- **Persistent session**: Server maintains your environment state
-- **Efficient for episodes**: Better for many sequential steps
-### Concurrent WebSocket Sessions
-The server supports multiple concurrent WebSocket connections. To enable this,
-modify `server/app.py` to use factory mode:
-```python
-# In server/app.py - use factory mode for concurrent sessions
-app = create_app(
-    DataCleaningEnvironment,  # Pass class, not instance
-    CleanAction,
-    CleanAction,
-    max_concurrent_envs=4,  # Allow 4 concurrent sessions
-)
-```
-Then multiple clients can connect simultaneously:
-```python
-from data_cleaning_env import CleanAction, DataCleaningEnv
-from concurrent.futures import ThreadPoolExecutor
-def run_episode(client_id: int):
-    with DataCleaningEnv(base_url="http://localhost:8000") as env:
-        result = env.reset()
-        for i in range(10):
-            result = env.step(CleanAction(message=f"Client {client_id}, step {i}"))
-        return client_id, result.observation.message_length
-# Run 4 episodes concurrently
-with ThreadPoolExecutor(max_workers=4) as executor:
-    results = list(executor.map(run_episode, range(4)))
-```
-## Development & Testing
-### Direct Environment Testing
-Test the environment logic directly without starting the HTTP server:
-```bash
-# From the server directory
-python3 server/data_cleaning_env_environment.py
-```
-This verifies that:
-- Environment resets correctly
-- Step executes actions properly
-- State tracking works
-- Rewards are calculated correctly
-### Running Locally
-Run the server locally for development:
-```bash
-uvicorn server.app:app --reload
-```
-## Project Structure
-```
-data_cleaning_env/
-├── .dockerignore         # Docker build exclusions
-├── __init__.py            # Module exports
-├── README.md              # This file
-├── openenv.yaml           # OpenEnv manifest
-├── pyproject.toml         # Project metadata and dependencies
-├── uv.lock                # Locked dependencies (generated)
-├── client.py              # DataCleaningEnv client
-├── models.py              # Action and Observation models
-└── server/
-    ├── __init__.py        # Server module exports
-    ├── data_cleaning_env_environment.py  # Core environment logic
-    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
-    └── Dockerfile         # Container image definition
-```

+---
+title: Data Cleaning Environment
+emoji: 🧹
+colorFrom: blue
+colorTo: purple
+sdk: docker
+app_port: 7860
+base_path: /web
+---
+<div align="center">
+# 🧹 Data Cleaning Environment
+### A Reinforcement Learning Benchmark for Autonomous Data Cleaning Agents
+[![Python](https://img.shields.io/badge/Python-3.12+-3776AB?style=for-the-badge&logo=python&logoColor=white)](https://www.python.org/)
+[![OpenEnv](https://img.shields.io/badge/OpenEnv-Compatible-FF6B35?style=for-the-badge)](https://github.com/meta-pytorch/OpenEnv)
+[![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?style=for-the-badge&logo=pydantic&logoColor=white)](https://docs.pydantic.dev/)
+[![FastAPI](https://img.shields.io/badge/FastAPI-WebSocket-009688?style=for-the-badge&logo=fastapi&logoColor=white)](https://fastapi.tiangolo.com/)
+[![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?style=for-the-badge&logo=docker&logoColor=white)](https://www.docker.com/)
+[![HuggingFace](https://img.shields.io/badge/HuggingFace-Deployable-FFD21E?style=for-the-badge&logo=huggingface&logoColor=black)](https://huggingface.co/)
+[![License](https://img.shields.io/badge/License-MIT-green?style=for-the-badge)](LICENSE)
+<br/>
+> **An OpenEnv-compatible reinforcement learning environment where an LLM agent receives a dirty CSV dataset and must autonomously fix type errors, outliers, missing values, and schema inconsistencies to match a hidden ground truth — step by step.**
+<br/>
+```
+┌──────────────────────────────────────────────────────────────────┐
+│   Dirty CSV  →  Agent Observes  →  Issues CleanAction  →  Reward │
+│                                                                  │
+│   "N/A"  →  FILL_MISSING(median)  →  Score ↑  →  +0.12 reward  │
+│   "2099" →  SET_VALUE(row=3,"2024-01-15")  →  Score ↑  →  +0.08 │
+│   "  bob" → STANDARDIZE_COL("name")  →  Score ↑  →  +0.05       │
+└──────────────────────────────────────────────────────────────────┘
+```
+</div>
+---
+## 📑 Table of Contents
+- [Overview](#-overview)
+- [Architecture](#-architecture)
+- [Project Structure](#-project-structure)
+- [Tasks](#-tasks)
+- [Action Space](#-action-space)
+- [Observation Space](#-observation-space)
+- [Reward Function](#-reward-function)
+- [Quick Start](#-quick-start)
+- [Running Inference](#-running-inference)
+- [Environment API](#-environment-api)
+- [Configuration](#-configuration)
+- [Deployment](#-deployment)
+- [Development & Testing](#-development--testing)
+- [Troubleshooting](#-troubleshooting)
+---
+## 🌟 Overview
+The **Data Cleaning Environment** is a structured RL benchmark where an LLM-powered agent must clean tabular datasets. The environment wraps a FastAPI WebSocket server following the [OpenEnv](https://github.com/meta-pytorch/OpenEnv) protocol, making it compatible with any OpenEnv-based training or evaluation framework.
+### Why This Matters
+Real-world data pipelines spend 60–80% of their time on data cleaning. This environment trains agents to:
+- **Detect** type errors, outliers, missing values, and schema inconsistencies
+- **Reason** about which fix is most impactful at each step
+- **Self-correct** from informative error feedback
+- **Terminate** efficiently without over-cleaning
+### Key Properties
+| Property | Value |
+|---|---|
+| Protocol | OpenEnv (WebSocket + HTTP) |
+| Action Space | Discrete (5 command types) |
+| Observation | Full CSV state + grader feedback |
+| Episode Structure | Reset → N × Step → Done |
+| Concurrency | ✅ Multiple simultaneous sessions |
+| State Management | Server-side, fully isolated per session |
+---
+## 🏗️ Architecture
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                         Agent (LLM / RL Policy)                     │
+│                  Qwen2.5-72B / Mistral / Custom Model               │
+└────────────────────────┬───────────────────────────────┬────────────┘
+                         │ CleanAction (JSON)             │ CleanObservation
+                         ▼                               │
+┌────────────────────────────────────────────────────────┴────────────┐
+│                      DataCleaningEnv (client.py)                     │
+│               OpenEnv EnvClient[CleanAction, CleanObservation, dict] │
+│                   WebSocket persistent connection                     │
+└────────────────────────┬────────────────────────────────────────────┘
+                         │  WebSocket /ws
+                         ▼
+┌─────────────────────────────────────────────────────────────────────┐
+│                      FastAPI Server (server/app.py)                  │
+│                  HTTP + WebSocket endpoints, sessions                │
+└────────────────────────┬────────────────────────────────────────────┘
+                         │
+                         ▼
+┌─────────────────────────────────────────────────────────────────────┐
+│               DataCleaningEnvironment (server/data_cleaning_env.py)  │
+│                                                                      │
+│  ┌─────────────┐  ┌──────────────┐  ┌───────────┐  ┌────────────┐ │
+│  │ dataset_    │  │  Action      │  │  Grader   │  │  Reward    │ │
+│  │ factory.py  │  │  Dispatcher  │  │  Engine   │  │  Computer  │ │
+│  │             │  │  SET_VALUE   │  │  grade()  │  │            │ │
+│  │ easy/medium │  │  DROP_ROW    │  │  score    │  │  progress  │ │
+│  │ /hard CSVs  │  │  STANDARD.   │  │  delta    │  │  efficiency│ │
+│  │             │  │  FILL_MISS.  │  │           │  │  penalties │ │
+│  └─────────────┘  └──────────────┘  └───────────┘  └────────────┘ │
+└─────────────────────────────────────────────────────────────────────┘
+```
+---
+## 📁 Project Structure
+```
+data_cleaning_env/
+│
+├── 📄 client.py                  # DataCleaningEnv — OpenEnv client
+├── 📄 models.py                  # CleanAction, CleanObservation, CleanState (Pydantic)
+├── 📄 inference.py               # Official evaluation entry point
+├── 📄 dataset_factory.py         # Generates easy/medium/hard dirty↔clean CSV pairs
+├── 📄 graders.py                 # Scoring engine — grade(agent_df vs clean_df)
+├── 📄 openenv.yaml               # OpenEnv manifest (HuggingFace Spaces config)
+├── 📄 pyproject.toml             # Project metadata and dependencies
+│
+└── server/
+    ├── 📄 app.py                 # FastAPI application (HTTP + WebSocket)
+    ├── 📄 data_cleaning_env.py   # Core environment logic (reset/step/state)
+    ├── 📄 __init__.py
+    └── 📄 Dockerfile             # Container image definition
+```
+---
+## 🎯 Tasks
+The environment ships three progressively harder tasks, each with fixed-seed deterministic datasets:
+### 🟢 Easy — Sales Orders
+| Property | Value |
+|---|---|
+| Dataset | ~100-row sales orders CSV |
+| Dirty Issues | Cell-level type errors, a few missing values |
+| Step Budget | **40 steps** |
+| Success Threshold | **Score ≥ 0.95** |
+| Primary Skills | `SET_VALUE`, `FILL_MISSING` |
+**What the agent needs to fix:** Individual cells with wrong types (e.g., `"N/A"` in a price column, `"abc"` in a numeric field). Straightforward injected errors with clear ground truth.
+---
+### 🟡 Medium — Financial Transactions
+| Property | Value |
+|---|---|
+| Dataset | ~200-row transaction log |
+| Dirty Issues | Outlier rows, mixed date formats, missing amounts |
+| Step Budget | **80 steps** |
+| Success Threshold | **Score ≥ 0.85** |
+| Primary Skills | `DROP_ROW`, `STANDARDIZE_COL`, `FILL_MISSING` |
+**What the agent needs to fix:** Statistical outliers disguised as data, inconsistent date formats, missing numeric values. Crucially, some extreme values are **valid** — dropping them costs a false-positive penalty.
+---
+### 🔴 Hard — Multi-Schema Dataset
+| Property | Value |
+|---|---|
+| Dataset | ~400-row multi-domain CSV |
+| Dirty Issues | Cross-column inconsistencies, future-year dates, bulk missing data |
+| Step Budget | **150 steps** |
+| Success Threshold | **Score ≥ 0.80** |
+| Primary Skills | All commands |
+**What the agent needs to fix:** Everything from easy + medium, plus cascading schema issues across columns. Requires strategic planning about fix order.
+---
+## 🕹️ Action Space
+Every step the agent sends exactly one `CleanAction`:
+```python
+from models import CleanAction
+# Fix a specific cell
+CleanAction(command="SET_VALUE", row_index=3, column="price", value="29.99")
+# Remove an entire row (use carefully — false positives are penalised)
+CleanAction(command="DROP_ROW", row_index=17)
+# Normalise a column's format (dates → YYYY-MM-DD, numbers → float, strings → stripped)
+CleanAction(command="STANDARDIZE_COL", column="order_date")
+# Fill all NaN values in a column using a strategy
+CleanAction(command="FILL_MISSING", column="quantity", fill_strategy="median")
+# Signal episode completion (only accepted when score ≥ task threshold)
+CleanAction(command="DONE")
+```
+### Command Reference
+| Command | `row_index` | `column` | `value` | `fill_strategy` |
+|---|---|---|---|---|
+| `SET_VALUE` | ✅ required | ✅ required | ✅ required | — |
+| `DROP_ROW` | ✅ required | — | — | — |
+| `STANDARDIZE_COL` | — | ✅ required | — | — |
+| `FILL_MISSING` | — | ✅ required | — | ✅ required |
+| `DONE` | — | — | — | — |
+### `FILL_MISSING` Strategies
+| Strategy | Behaviour |
+|---|---|
+| `"mean"` | Replace NaN with column mean (numeric columns only) |
+| `"median"` | Replace NaN with column median (numeric columns only) |
+| `"mode"` | Replace NaN with most frequent value (any column) |
+| `"drop"` | Remove rows where this column is NaN |
+> ⚠️ **Important:** `DROP_ROW` removes by **positional row index** (the `row_index` column in the CSV), not by a row ID field. Row indices shift after each drop.
+---
+## 👁️ Observation Space
+After every `reset()` and `step()`, the agent receives a `CleanObservation`:
+```python
+@dataclass
+class CleanObservation:
+    # ── Task context (constant per episode) ──────────────────────
+    task_id: str               # "easy" | "medium" | "hard"
+    schema_hint: str           # Plain-English description of clean schema
+    initial_dirty_cells: int   # Total dirty cells at episode start
+    # ── Per-step state ───────────────────────────────────────────
+    dirty_csv: str             # Full current CSV as string (all edits applied)
+    current_score: float       # 0.0 → 1.0  (grader score vs ground truth)
+    issues_remaining: int      # Approximate dirty cells still to fix
+    step_number: int           # Steps taken so far
+    max_steps: int             # Budget for this task
+    # ── Last-action feedback ─────────────────────────────────────
+    last_action_success: bool  # Whether previous action applied cleanly
+    last_action_error: str     # Error message if success=False (else None)
+    # ── Inherited ────────────────────────────────────────────────
+    done: bool                 # True = episode ended
+    reward: float | None       # Per-step reward (None after reset)
+```
+### Score Computation
+The grader compares the agent's working DataFrame to the hidden ground-truth DataFrame:
+```
+score = (initial_dirty_cells - remaining_dirty_cells) / initial_dirty_cells
+```
+A score of `1.0` means perfect agreement with ground truth.
+---
+## 💰 Reward Function
+The reward is dense and shaped to guide efficient, precise cleaning:
+```
+reward = progress_term
+       + efficiency_bonus
+       + false_positive_penalty
+       + early_done_penalty
+       + step_cost
+```
+| Component | Value | When |
+|---|---|---|
+| **Progress** | `current_score − previous_score` | Every step |
+| **Efficiency bonus** | `+0.10 × (1 − steps_used/max_steps)` | Only when task is solved this step |
+| **False-positive penalty** | `−0.15` | `DROP_ROW` removes a valid-extreme row (medium task) |
+| **Early DONE penalty** | `−0.20` | `DONE` called with score < 0.60 |
+| **Step cost** | `−0.005` | Every step (discourages padding) |
+| **Premature DONE block** | `−1.00` | `DONE` below task threshold — episode *continues* |
+**Reward range:** `[−0.5, +1.0]` (clipped)
+### Termination Logic
+The episode terminates when **any** of these is true:
+1. ✅ `current_score >= task_threshold` (auto-terminated, efficiency bonus awarded)
+2. ✅ Agent sends `DONE` and `current_score >= task_threshold` (accepted)
+3. ⏱️ `step_count >= max_steps` (budget exhausted)
+`DONE` is **refused** if the score is below threshold — the episode continues with a `−1.0` reward signal.
+---
+## 🚀 Quick Start
+### Prerequisites
+- Python 3.12+
+- Docker Desktop (for containerised server)
+- A free [HuggingFace token](https://huggingface.co/settings/tokens) (for the inference LLM)
+### 1. Clone & Install
+```bash
+git clone https://github.com/Code-Knight-Debjit/Data-Cleaning-Environment.git
+cd Data-Cleaning-Environment
+# Create virtual environment
+python -m venv .venv
+# Activate (Windows PowerShell)
+.venv\Scripts\Activate.ps1
+# Activate (macOS/Linux)
+source .venv/bin/activate
+# Install dependencies
+pip install -e .
+```
+### 2. Build the Docker Image
+```bash
+docker build -t openenv-data_cleaning:latest -f server/Dockerfile .
+```
+### 3. Set Your HuggingFace Token
+```powershell
+# Windows PowerShell
+$env:HF_TOKEN = "hf_your_token_here"
+# macOS / Linux
+export HF_TOKEN="hf_your_token_here"
+```
+### 4. Run Inference
+```bash
+python inference.py
+```
+That's it! The script auto-starts the Docker container, runs the LLM agent through all three tasks (easy → medium → hard), and prints structured evaluation logs.
+---
+## 🤖 Running Inference
+### Environment Variables
+| Variable | Default | Description |
+|---|---|---|
+| `HF_TOKEN` | *(required)* | Your HuggingFace token for LLM API access |
+| `API_BASE_URL` | `https://router.huggingface.co/v1` | LLM API endpoint |
+| `MODEL_NAME` | `Qwen/Qwen2.5-72B-Instruct` | Model to use for inference |
+| `LOCAL_IMAGE_NAME` | `openenv-data_cleaning:latest` | Docker image to launch |
+| `ENV_BASE_URL` | `http://localhost:8000` | Direct server URL (if not using Docker) |
+### Switching Models
+```powershell
+# Use Mistral (smaller, faster)
+$env:MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"
+# Use Llama
+$env:MODEL_NAME = "meta-llama/Llama-3.1-8B-Instruct"
+```
+### Connecting to a Running Server (skip Docker)
+```powershell
+$env:LOCAL_IMAGE_NAME = ""   # must be empty string
+$env:ENV_BASE_URL = "http://localhost:8000"
+python inference.py
+```
+### Expected Output
+```
+API_BASE_URL     : https://router.huggingface.co/v1
+MODEL_NAME       : Qwen/Qwen2.5-72B-Instruct
+LOCAL_IMAGE_NAME : openenv-data_cleaning:latest
+ENV_BASE_URL     : http://localhost:8000
+[START] task=easy env=data_cleaning_env model=Qwen/Qwen2.5-72B-Instruct
+[STEP]  step=1  action=FILL_MISSING  reward=0.12 done=false  error=null
+[STEP]  step=2  action=SET_VALUE     reward=0.08 done=false  error=null
+[STEP]  step=3  action=STANDARDIZE_COL reward=0.05 done=false error=null
+...
+[END]   success=true steps=18 score=0.97 rewards=0.12,0.08,...
+[START] task=medium env=data_cleaning_env ...
+...
+════════════════════════════════════════════════════════
+Task        Score    Reward  Steps  Pass
+────────────────────────────────────────────────────────
+easy       0.9712    1.3400     18   YES
+medium     0.8823    2.1100     47   YES
+hard       0.7640    1.8500     98    NO
+════════════════════════════════════════════════════════
+```
+---
+## 🔌 Environment API
+### Using the Python Client Directly
+```python
+import asyncio
+from client import DataCleaningEnv
+from models import CleanAction
+async def run():
+    # Option A: Auto-start Docker container
+    env = await DataCleaningEnv.from_docker_image("openenv-data_cleaning:latest")
+    # Option B: Connect to an already-running server
+    # env = DataCleaningEnv(base_url="http://localhost:8000")
+    # await env.connect()
+    try:
+        # Reset for a specific task
+        result = await env.reset(task_id="easy")
+        obs = result.observation
+        print(f"Score: {obs.current_score:.4f}")
+        print(f"Issues: {obs.issues_remaining}")
+        print(f"Schema: {obs.schema_hint}")
+        # Take a step
+        action = CleanAction(
+            command="FILL_MISSING",
+            column="price",
+            fill_strategy="median"
+        )
+        result = await env.step(action)
+        obs = result.observation
+        print(f"Reward: {result.reward:.4f}")
+        print(f"New score: {obs.current_score:.4f}")
+        print(f"Action OK: {obs.last_action_success}")
+        # Signal completion
+        result = await env.step(CleanAction(command="DONE"))
+    finally:
+        await env.close()
+asyncio.run(run())
+```
+### Using the Sync Wrapper
+```python
+from client import DataCleaningEnv
+from models import CleanAction
+env = DataCleaningEnv(base_url="http://localhost:8000").sync()
+with env:
+    result = env.reset(task_id="easy")
+    result = env.step(CleanAction(command="STANDARDIZE_COL", column="order_date"))
+    print(f"Score: {result.observation.current_score:.4f}")
+```
+### HTTP Endpoints
+When the server is running, the following HTTP endpoints are available:
+| Endpoint | Method | Description |
+|---|---|---|
+| `/health` | GET | Server health check |
+| `/docs` | GET | Swagger / OpenAPI documentation |
+| `/web` | GET | Interactive web UI |
+| `/ws` | WebSocket | Persistent session endpoint |
+---
+## ⚙️ Configuration
+### Step Budgets
+```python
+MAX_STEPS = {
+    "easy":   40,
+    "medium": 80,
+    "hard":   150,
+}
+```
+### Success Thresholds
+```python
+DONE_THRESHOLD = {
+    "easy":   0.95,
+    "medium": 0.85,
+    "hard":   0.80,
+}
+```
+### Reward Constants
+| Constant | Value | Purpose |
+|---|---|---|
+| `STEP_COST` | `-0.005` | Per-step penalty to discourage padding |
+| `EARLY_DONE_PENALTY` | `-0.20` | Penalty for `DONE` below score 0.60 |
+| `EARLY_DONE_THRESHOLD` | `0.60` | Score floor for DONE without penalty |
+| `FALSE_POSITIVE_PENALTY` | `-0.15` | Penalty for wrongly dropping a valid row |
+| `EFFICIENCY_BONUS_WEIGHT` | `0.10` | Multiplier for early-completion bonus |
+---
+## ☁️ Deployment
+### Deploy to HuggingFace Spaces
+```bash
+# Install the OpenEnv CLI
+pip install openenv
+# Authenticate with HuggingFace
+huggingface-cli login
+# Deploy (from the repo root where openenv.yaml lives)
+openenv push
+# Or deploy privately to a specific repo
+openenv push --repo-id your-username/data-cleaning-env --private
+```
+After deployment, your environment will be live at:
+```
+https://huggingface.co/spaces/your-username/data-cleaning-env
+```
+With endpoints:
+- **Web UI:** `/web`
+- **API Docs:** `/docs`
+- **Health:** `/health`
+- **WebSocket:** `/ws`
+### Connect to a HuggingFace Space
+```python
+env = await DataCleaningEnv.from_env("your-username/data-cleaning-env")
+# or run locally with UV (no Docker needed)
+env = await DataCleaningEnv.from_env("your-username/data-cleaning-env", use_docker=False)
+```
+### Run the Server Locally (Without Docker)
+```bash
+uvicorn server.app:app --reload --port 8000
+```
+---
+## 🧪 Development & Testing
+### Test the Environment Logic (No Server Needed)
+```bash
+# Runs a smoke test across all three tasks
+python server/data_cleaning_env.py
+```
+Expected output:
+```
+────────────────────────────────────────────────────────────────
+TASK: EASY
+────────────────────────────────────────────────────────────────
+reset()  → score=0.0000  issues=29  done=False
+  CSV:  101 rows, 5 cols
+  Hint: Sales orders dataset. price must be float...
+step (bad col) → success=False  error='Column 'DOES_NOT_EXIST' not found...'
+step (fix row=3 col='price') → success=True  score=0.0345  reward=0.0295
+step (DONE, blocked)  → done=False  reward=-1.0  score=0.0345
+...
+All smoke tests passed.
+```
+### Test Pydantic Models
+```bash
+python models.py
+```
+### Test the Client Parser
+```bash
+python test_parse.py
+```
+### Run the Full Server Locally
+```bash
+uvicorn server.app:app --reload
+# Open http://localhost:8000/docs for interactive API explorer
+```
+---
+## 🔧 Troubleshooting
+### `TypeError: Too few arguments for EnvClient`
+**Cause:** Your `client.py` subclasses `EnvClient` with only 2 type parameters, but OpenEnv requires 3 (`ActT`, `ObsT`, `StateT`).
+**Fix:**
+```python
+# ❌ Wrong
+class DataCleaningEnv(EnvClient[CleanAction, CleanObservation]):
+# ✅ Correct
+class DataCleaningEnv(EnvClient[CleanAction, CleanObservation, dict]):
+```
+Also ensure `_parse_state` is implemented:
+```python
+def _parse_state(self, payload: dict) -> dict:
+    return payload
+```
+---
+### `ValidationError: Input should be 'SET_VALUE', 'DROP_ROW', ...`
+**Cause:** Passing an invalid command string to `CleanAction`.
+**Fix:** Only these 5 commands are valid:
+```python
+"SET_VALUE" | "DROP_ROW" | "STANDARDIZE_COL" | "FILL_MISSING" | "DONE"
+```
+There is no `"drop_column"` — columns cannot be dropped, only rows.
+---
+### `UnboundLocalError: cannot access local variable 'env'`
+**Cause 1:** Docker image doesn't exist yet.
+```bash
+docker build -t openenv-data_cleaning:latest -f server/Dockerfile .
+```
+**Cause 2:** Stray test lines in `inference.py` referencing `env` before it's assigned.
+**Fix:** Remove any manually added lines like `action = CleanAction(...)` or `result = await env.step(action)` from inside `main()`. The `main()` function should only call `run_episode()` — all action logic belongs inside that function.
+---
+### `DONE rejected: score X < required Y`
+**This is expected behaviour, not a bug.** The environment refuses premature termination. The agent should continue cleaning until the score meets the task threshold.
+---
+### HuggingFace Router returns 401
+Ensure your token is set:
+```powershell
+$env:HF_TOKEN = "hf_your_token_here"
+```
+Get a free token at [huggingface.co/settings/tokens](https://huggingface.co/settings/tokens).
+---
+## 📐 Data Flow Diagram
+```
+                    ┌──────────────────────────────────┐
+                    │   inference.py / custom agent    │
+                    │                                  │
+                    │  1. await env.reset(task_id=…)   │
+                    │  2. obs = result.observation      │
+                    │  3. build_prompt(obs) → LLM       │
+                    │  4. parse_action(llm_output)      │
+                    │  5. await env.step(action)        │
+                    │  6. GOTO 2 until done             │
+                    └──────────────┬───────────────────┘
+                                   │
+                    CleanAction (JSON over WebSocket)
+                                   │
+                                   ▼
+                    ┌──────────────────────────────────┐
+                    │        DataCleaningEnvironment    │
+                    │                                  │
+                    │  _apply_action()                 │
+                    │    → mutates _dirty_df in-place  │
+                    │                                  │
+                    │  grade(agent_df vs clean_df)     │
+                    │    → score ∈ [0.0, 1.0]          │
+                    │                                  │
+                    │  _compute_reward()               │
+                    │    → progress + bonuses          │
+                    │                                  │
+                    │  _build_observation()            │
+                    │    → CleanObservation            │
+                    └──────────────────────────────────┘
+```
+---
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch: `git checkout -b feature/my-improvement`
+3. Run the smoke tests: `python server/data_cleaning_env.py`
+4. Commit your changes: `git commit -m "feat: add my improvement"`
+5. Push and open a Pull Request
+---
+## 📄 License
+This project is licensed under the MIT License. See [LICENSE](LICENSE) for details.
+---
+<div align="center">
+Built with ❤️ using [OpenEnv](https://github.com/meta-pytorch/OpenEnv) · [FastAPI](https://fastapi.tiangolo.com/) · [Pydantic](https://docs.pydantic.dev/) · [HuggingFace](https://huggingface.co/)
+</div>

inference.py CHANGED Viewed

@@ -1,508 +1,332 @@
 """
-inference.py
-------------
-Data Cleaning Pipeline — submission inference script.
-Supports:
-  • Ollama local llama3 (DEFAULT — no API key needed)
-  • Groq free cloud API
-  • Any OpenAI-compatible endpoint
-Environment variables:
-    API_BASE_URL     LLM endpoint.   Default: http://localhost:11434/v1  (Ollama)
-    MODEL_NAME       Model name.     Default: llama3
-    HF_TOKEN         API key.        Default: "ollama" (ignored by Ollama)
-    LOCAL_IMAGE_NAME Docker image    (leave unset to use ENV_BASE_URL)
-    ENV_BASE_URL     Env server URL. Default: http://localhost:8000
-To switch to Groq instead of Ollama:
-    $env:API_BASE_URL = "https://api.groq.com/openai/v1"
-    $env:MODEL_NAME   = "llama-3.3-70b-versatile"
-    $env:HF_TOKEN     = "gsk_xxxxxxxxxxxx"
-STDOUT FORMAT (evaluator parses exactly — do not modify):
-    [START] task=<n> env=<benchmark> model=<model>
-    [STEP]  step=<n> action=<str> reward=<0.00> done=<true|false> error=<msg|null>
-    [END]   success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
 """
 import asyncio
 import json
 import os
 import re
-import sys
-from typing import Any, Dict, List, Optional
 from openai import OpenAI
-try:
-    from client import DataCleaningEnv
-    from models import CleanAction, MAX_STEPS, DONE_THRESHOLD
-except ImportError:
-    sys.path.insert(0, os.path.dirname(__file__))
-    from client import DataCleaningEnv
-    from models import CleanAction, MAX_STEPS, DONE_THRESHOLD
-# ── Configuration ─────────────────────────────────────────────────────────────
-API_BASE_URL     = os.getenv("API_BASE_URL",     "http://localhost:11434/v1")
-MODEL_NAME       = os.getenv("MODEL_NAME",       "llama3")
-HF_TOKEN         = os.getenv("HF_TOKEN",         "ollama")
-LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME", "")
-ENV_BASE_URL     = os.getenv("ENV_BASE_URL",     "http://localhost:8000")
-BENCHMARK   = "data_cleaning_env"
-TASK_IDS    = ["easy", "medium", "hard"]
-STEP_LIMITS = {"easy": 40, "medium": 100, "hard": 150}
-# ── System prompt (deterministic agent) ──────────────────────────────────────
-SYSTEM_PROMPT = """You are a deterministic data cleaning agent.
-Your task is to clean a dataset step-by-step using valid actions.
-You are operating inside an environment with strict rules.
---------------------------------------------------
-## INPUT PROVIDED EACH STEP
-You will receive:
-1. Column schema (LIST OF VALID COLUMN NAMES - CASE SENSITIVE)
-2. Column status:
-   - missing values count
-   - whether standardized (true/false)
-3. Remaining issues (global state)
-4. Previous actions taken
---------------------------------------------------
-## OBJECTIVE
-Fully clean the dataset with MINIMUM steps.
-A dataset is CLEAN only if:
-- No missing values remain
-- All columns are standardized
-- No invalid formats exist
---------------------------------------------------
-## STRICT RULES (MUST FOLLOW)
-### 1. NEVER TERMINATE EARLY
-You MUST NOT output DONE unless:
-- ALL columns have missing = 0
-- ALL columns have standardized = true
-- remaining_issues is empty
-If ANY issue remains -> DO NOT output DONE.
-### 2. USE ONLY VALID COLUMNS
-- You MUST use EXACT column names from the schema list
-- Column names are CASE SENSITIVE
-- NEVER invent new column names
-### 3. PRIORITIZE COLUMN-LEVEL ACTIONS
-Preferred actions (in order):
-  1. FILL_MISSING    - fixes entire column missing values
-  2. STANDARDIZE_COL - fixes formatting for entire column
-  3. SET_VALUE        - only for a single isolated bad cell
-  4. DROP_ROW         - only for truly corrupt/outlier rows
-NEVER fix a full column using repeated SET_VALUE.
-### 4. DO NOT REPEAT ACTIONS
-- Do NOT apply the same action to the same column twice
-- Do NOT standardize an already standardized column
-- Do NOT fill missing if missing = 0
-### 5. CHOOSE THE CORRECT FILL STRATEGY
-- Numeric columns (float/int): use "median" or "mean"
-- Categorical/string columns: use "mode"
-- NEVER use "mean" or "median" on a categorical column
-### 6. ALWAYS THINK GLOBALLY
-Before choosing an action:
-- Review ALL columns in column_status
-- Pick the single action that fixes the largest remaining issue
---------------------------------------------------
-## DECISION PROCESS (MANDATORY)
-At each step:
-1. Read column_status carefully
-2. Find columns where missing > 0 OR standardized = false
-3. If none exist AND remaining_issues is empty -> output DONE
-4. Otherwise, pick the ONE most impactful action
---------------------------------------------------
-## OUTPUT FORMAT - STRICT JSON ONLY
-Return ONLY a single JSON object. No explanation. No markdown. No backticks.
-Fill missing values:
-{"action": "FILL_MISSING", "column": "<exact_col_name>", "strategy": "<mean|median|mode>"}
-Standardize a column:
-{"action": "STANDARDIZE_COL", "column": "<exact_col_name>"}
-Fix one cell:
-{"action": "SET_VALUE", "column": "<exact_col_name>", "row": <int>, "value": "<str>"}
-Drop a bad row:
-{"action": "DROP_ROW", "row": <int>}
-Signal completion:
-{"action": "DONE"}
---------------------------------------------------
-## FAILURE CONDITIONS (YOU WILL BE PENALIZED FOR):
-- Outputting DONE when issues remain
-- Using a column name not in the schema
-- Repeating the same action on the same column
-- Using SET_VALUE to fix an entire column
-- Using mean/median on a categorical column
-- Using mode on a numeric column
---------------------------------------------------
-## FINAL GOAL
-Be efficient, precise, and minimal.
-Every step must move the dataset closer to a fully clean state."""
-# ── Official log helpers ──────────────────────────────────────────────────────
 def log_start(task: str, env: str, model: str) -> None:
     print(f"[START] task={task} env={env} model={model}", flush=True)
-def log_step(step: int, action: str, reward: float,
-             done: bool, error: Optional[str]) -> None:
     print(
-        f"[STEP] step={step} action={action[:80].replace(chr(10), ' ')} "
-        f"reward={reward:.2f} done={str(done).lower()} "
-        f"error={error if error else 'null'}",
         flush=True,
     )
-def log_end(success: bool, steps: int, score: float,
-            rewards: List[float]) -> None:
     print(
-        f"[END] success={str(success).lower()} steps={steps} "
-        f"score={score:.4f} rewards={','.join(f'{r:.2f}' for r in rewards)}",
         flush=True,
     )
-# ── Column type hints (used to suggest fill strategies) ──────────────────────
-_COL_TYPES: Dict[str, Dict[str, str]] = {
-    "easy": {
-        "order_id":   "numeric",
-        "customer":   "categorical",
-        "product":    "categorical",
-        "category":   "categorical",
-        "price":      "numeric",
-        "quantity":   "numeric",
-        "order_date": "datetime",
-        "region":     "categorical",
-    },
-    "medium": {
-        "tx_id":       "numeric",
-        "customer_id": "numeric",
-        "amount":      "numeric",
-        "tx_date":     "datetime",
-        "category":    "categorical",
-        "country":     "categorical",
-        "status":      "categorical",
-    },
-    "hard": {
-        "record_id":     "numeric",   "id":            "numeric",   "RecordID":    "numeric",
-        "customer_id":   "numeric",   "cust_id":       "numeric",   "CustomerID":  "numeric",
-        "full_name":     "categorical","name":          "categorical","CustomerName":"categorical",
-        "email":         "categorical","email_address": "categorical","Email":       "categorical",
-        "amount":        "numeric",   "sale_amount":   "numeric",   "Amount":      "numeric",
-        "currency":      "categorical","ccy":           "categorical","Currency":    "categorical",
-        "purchase_date": "datetime",  "date":          "datetime",  "PurchaseDate":"datetime",
-        "product_name":  "categorical","item":          "categorical","ProductName": "categorical",
-        "region":        "categorical","territory":     "categorical","area":        "categorical",
-        "contact_email": "categorical","value":         "numeric",   "product":     "categorical",
-    },
-}
-def _strategy_hint(task_id: str, col: str) -> str:
-    col_type = _COL_TYPES.get(task_id, {}).get(col, "unknown")
-    if col_type == "numeric":
-        return "median"
-    if col_type in ("categorical", "datetime"):
-        return "mode"
-    return "median"
-# ── Prompt builder ────────────────────────────────────────────────────────────
-def _column_status_block(obs, task_id: str) -> str:
-    col_status: Dict[str, Any] = getattr(obs, "column_status", {}) or {}
-    if col_status:
-        lines = []
-        for col, status in col_status.items():
-            missing      = status.get("missing", 0)
-            standardized = status.get("standardized", True)
-            hint         = _strategy_hint(task_id, col)
-            flag         = "OK" if (missing == 0 and standardized) else "NEEDS_FIX"
-            lines.append(
-                f"  {col:<22} missing={missing:<4} "
-                f"standardized={str(standardized).lower():<5}  "
-                f"fill_strategy={hint:<7}  [{flag}]"
-            )
-        return "\n".join(lines)
-    # Fallback: derive columns from CSV header
-    rows   = obs.dirty_csv.strip().split("\n")
-    header = rows[0] if rows else ""
-    cols   = [c.strip() for c in header.split(",")]
-    return "\n".join(
-        f"  {col:<22} (status unknown)  fill_strategy={_strategy_hint(task_id, col)}"
-        for col in cols
-    )
 def build_user_prompt(obs, history: List[str]) -> str:
-    rows      = obs.dirty_csv.strip().split("\n")
-    header    = rows[0] if rows else ""
-    data_rows = rows[1:]
-    preview   = "\n".join([header] + data_rows[:10])
-    truncated = len(data_rows) > 10
-    col_status: Dict[str, Any] = getattr(obs, "column_status", {}) or {}
-    broken = [
-        c for c, s in col_status.items()
-        if s.get("missing", 0) > 0 or not s.get("standardized", True)
-    ]
-    history_block = (
-        "\n".join(f"  {h}" for h in history[-6:])
-        if history else "  (none yet)"
-    )
-    return (
-        f"--------------------------------------------------\n"
-        f"## STEP {obs.step_number}/{obs.max_steps}\n"
-        f"Score:            {obs.current_score:.4f}  "
-        f"(need >= {DONE_THRESHOLD[obs.task_id]:.2f} to pass)\n"
-        f"Issues remaining: {obs.issues_remaining}\n"
-        f"Broken columns:   {len(broken)} -> {broken[:10] if broken else 'NONE — consider DONE'}\n"
-        f"\n## SCHEMA HINT\n{obs.schema_hint}\n"
-        f"\n## VALID COLUMN NAMES (CASE SENSITIVE ��� copy exactly)\n{header}\n"
-        f"\n## COLUMN STATUS (read carefully before acting)\n"
-        f"{_column_status_block(obs, obs.task_id)}\n"
-        f"\n## CSV PREVIEW"
-        f"{' (first 10 of ' + str(len(data_rows)) + ' rows)' if truncated else ''}\n"
-        f"{preview}\n"
-        f"\n## PREVIOUS ACTIONS (last 6)\n{history_block}\n"
-        f"\n--------------------------------------------------\n"
-        f"## DECISION CHECKLIST\n"
-        f"1. Any column with missing > 0?  -> FILL_MISSING (use strategy from column status)\n"
-        f"2. Any column with standardized=false?  -> STANDARDIZE_COL\n"
-        f"3. Isolated bad cell visible in CSV?  -> SET_VALUE\n"
-        f"4. Clearly corrupt/outlier row?  -> DROP_ROW\n"
-        f"5. ALL missing=0, ALL standardized=true, issues=0?  -> DONE\n"
-        f"\nOutput ONE JSON action (no markdown, no explanation):"
-    )
-# ── Action parser ─────────────────────────────────────────────────────────────
-# Bridges {action, column, strategy, row, value} -> CleanAction
-_COMMAND_MAP = {
-    "FILL_MISSING":    "FILL_MISSING",
-    "STANDARDIZE_COL": "STANDARDIZE_COL",
-    "STANDARDIZE":     "STANDARDIZE_COL",
-    "SET_VALUE":       "SET_VALUE",
-    "DROP_ROW":        "DROP_ROW",
-    "DROP":            "DROP_ROW",
-    "DONE":            "DONE",
-}
-def parse_action(raw: str) -> CleanAction:
-    text = raw.strip()
-    # Strip markdown fences
-    if text.startswith("```"):
-        lines = text.split("\n")
-        inner = lines[1:-1] if lines[-1].strip().startswith("```") else lines[1:]
-        text  = "\n".join(inner).strip()
-    m = re.search(r"\{[^{}]*\}", text, re.DOTALL)
-    if not m:
-        return CleanAction(command="DONE")
-    try:
-        data: Dict[str, Any] = json.loads(m.group())
-    except json.JSONDecodeError:
-        return CleanAction(command="DONE")
-    raw_cmd = str(data.get("action", "DONE")).upper().strip().replace(" ", "_")
-    command = _COMMAND_MAP.get(raw_cmd)
-    if not command:
-        return CleanAction(command="DONE")
-    if command == "DONE":
-        return CleanAction(command="DONE")
-    column        = data.get("column")
-    fill_strategy = data.get("strategy") or data.get("fill_strategy")
-    row_raw       = data.get("row") if data.get("row") is not None else data.get("row_index")
-    value         = data.get("value")
-    try:
         return CleanAction(
-            command       = command,
-            column        = column,
-            fill_strategy = fill_strategy,
-            row_index     = int(row_raw) if row_raw is not None else None,
-            value         = str(value) if value is not None else None,
         )
-    except Exception:
         return CleanAction(command="DONE")
-# ── LLM call (async — keeps WebSocket keepalive alive) ───────────────────────
-async def call_llm_async(client: OpenAI, messages: list) -> str:
-    loop = asyncio.get_event_loop()
-    response = await loop.run_in_executor(
-        None,
-        lambda: client.chat.completions.create(
-            model       = MODEL_NAME,
-            messages    = messages,
-            max_tokens  = 120,
-            temperature = 0.0,
-        ),
-    )
-    return (response.choices[0].message.content or "").strip()
-# ── Episode loop ───────────────────────────────────────────────────────────────
-async def run_episode(env, client: OpenAI, task_id: str) -> dict:
-    max_steps        = STEP_LIMITS[task_id]
-    threshold        = DONE_THRESHOLD[task_id]
-    rewards: List[float] = []
-    steps_taken      = 0
-    score            = 0.0
-    success          = False
-    history: List[str] = []
     log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
     try:
-        result = await env.reset(task_id=task_id)
-        obs    = result.observation
-        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
         for step in range(1, max_steps + 1):
-            if obs.done:
                 break
-            steps_taken = step
-            messages.append({"role": "user", "content": build_user_prompt(obs, history)})
-            try:
-                raw    = await call_llm_async(client, messages)
-                action = parse_action(raw)
-                messages.append({"role": "assistant", "content": raw})
-            except Exception as exc:
-                log_step(step, "DONE", 0.00, True, str(exc)[:120])
-                rewards.append(0.0)
-                break
-            # Keep system + last 3 exchanges to avoid context overflow
-            if len(messages) > 7:
-                messages = [messages[0]] + messages[-6:]
             result = await env.step(action)
             obs    = result.observation
-            reward = result.reward or 0.0
             rewards.append(reward)
-            score  = obs.current_score
-            log_step(
-                step   = step,
-                action = action.command,
-                reward = reward,
-                done   = obs.done,
-                error  = obs.last_action_error,
             )
-            err_note = f" [ERR: {obs.last_action_error[:40]}]" if obs.last_action_error else ""
-            history.append(
-                f"step {step}: {action.command}"
-                + (f"({action.column}"
-                   + (f", {action.fill_strategy})" if action.fill_strategy else ")")
-                   if action.column else "")
-                + f" -> score={score:.4f}{err_note}"
             )
-            if obs.done or score >= threshold:
                 break
         success = score >= threshold
-    except Exception as e:
-        print(f"[EPISODE ERROR] task={task_id} error={str(e)[:120]}", flush=True)
     finally:
         log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
     return {
-        "task_id": task_id,
         "score":   score,
         "reward":  sum(rewards),
         "steps":   steps_taken,
         "success": success,
     }
-# ── Entry point ────────────────────────────────────────────────────────────────
 async def main() -> None:
-    is_ollama = "11434" in API_BASE_URL or "ollama" in API_BASE_URL.lower()
-    if not is_ollama and (not HF_TOKEN or HF_TOKEN == "ollama"):
-        print(
-            "ERROR: HF_TOKEN not set for remote API.\n"
-            "For Groq:  $env:HF_TOKEN='gsk_xxxxxxxxxxxx'\n"
-            "For Ollama (local): no token needed — defaults already set.",
-            file=sys.stderr,
-        )
-        sys.exit(1)
-    print(f"API_BASE_URL : {API_BASE_URL}", flush=True)
-    print(f"MODEL_NAME   : {MODEL_NAME}",   flush=True)
-    print(f"BACKEND      : {'Ollama (local)' if is_ollama else 'Remote API'}", flush=True)
-    print(f"ENV SERVER   : {LOCAL_IMAGE_NAME or ENV_BASE_URL}", flush=True)
-    print("", flush=True)
-    llm = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
     results = []
-    for task_id in TASK_IDS:
-        # Fresh connection per task — prevents WebSocket keepalive timeout carryover
-        if LOCAL_IMAGE_NAME:
-            env = await DataCleaningEnv.from_docker_image(LOCAL_IMAGE_NAME)
-        else:
-            env = DataCleaningEnv(base_url=ENV_BASE_URL)
-            await env.connect()
-        try:
-            summary = await run_episode(env, llm, task_id)
             results.append(summary)
-        finally:
-            try:
-                await env.close()
-            except Exception:
-                pass
-        print("", flush=True)
-    print("=" * 56, flush=True)
-    print(f"{'Task':<10} {'Score':>7} {'Reward':>9} {'Steps':>6} {'Pass':>5}")
-    print("-" * 56, flush=True)
     for r in results:
-        print(
-            f"{r['task_id']:<10} {r['score']:>7.4f} {r['reward']:>9.4f} "
-            f"{r['steps']:>6}  {'YES' if r['success'] else 'NO':>4}",
-            flush=True,
-        )
-    print("=" * 56, flush=True)
 if __name__ == "__main__":

 """
+Inference Script — Data Cleaning Environment
+=============================================
+MANDATORY environment variables:
+    API_BASE_URL       The API endpoint for the LLM.
+    MODEL_NAME         The model identifier to use for inference.
+    HF_TOKEN           Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME   Docker image name (when using from_docker_image()).
+Defaults are set only for API_BASE_URL and MODEL_NAME (not HF_TOKEN).
+STDOUT FORMAT
+- [START] task=<task_name> env=<benchmark> model=<model_name>
+- [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+- [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
 """
 import asyncio
 import json
 import os
 import re
+import textwrap
+from typing import List, Optional
 from openai import OpenAI
+from client import DataCleaningEnv
+from models import CleanAction
+# ── Environment variables ────────────────────────────────────────────────────
+LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME", "openenv-data_cleaning:latest")
+API_KEY          = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL     = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME       = os.getenv("MODEL_NAME",   "Qwen/Qwen2.5-72B-Instruct")
+BENCHMARK        = "data_cleaning_env"
+# ── Per-task config (mirrors server constants) ────────────────────────────────
+TASK_CONFIG = {
+    "easy":   {"max_steps": 40,  "threshold": 0.95},
+    "medium": {"max_steps": 80,  "threshold": 0.85},
+    "hard":   {"max_steps": 150, "threshold": 0.80},
+}
+TEMPERATURE = 0.2   # low temp → more deterministic action parsing
+MAX_TOKENS  = 256
+# ── Logging helpers (strict stdout format) ───────────────────────────────────
 def log_start(task: str, env: str, model: str) -> None:
     print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val  = str(done).lower()
     print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
         flush=True,
     )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
     print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
         flush=True,
     )
+# ── Prompt builders ───────────────────────────────────────────────────────────
+SYSTEM_PROMPT = textwrap.dedent("""
+    You are an expert data cleaning agent. You receive a dirty CSV dataset and must
+    fix it step by step to match a hidden clean ground truth.
+    Available commands (respond with EXACTLY one JSON object, no extra text):
+    {"command": "SET_VALUE",       "row_index": <int>, "column": "<col>", "value": "<val>"}
+    {"command": "DROP_ROW",        "row_index": <int>}
+    {"command": "STANDARDIZE_COL", "column": "<col>"}
+    {"command": "FILL_MISSING",    "column": "<col>", "fill_strategy": "mean|median|mode|drop"}
+    {"command": "DONE"}
+    Rules:
+    - Output ONLY the JSON object — no explanation, no markdown, no backticks.
+    - Use DONE only when you are confident the score meets the task threshold.
+    - SET_VALUE fixes a single bad cell.
+    - STANDARDIZE_COL normalises an entire column's format.
+    - FILL_MISSING fills NaN values in a column.
+    - DROP_ROW removes a row; use carefully — false positives are penalised.
+    - Row indices are 0-based positional indices (they shift after each DROP_ROW).
+""").strip()
 def build_user_prompt(obs, history: List[str]) -> str:
+    history_block = "\n".join(history[-15:]) if history else "None yet."
+    return textwrap.dedent(f"""
+        Task: {obs.task_id}
+        Schema hint: {obs.schema_hint}
+        Step: {obs.step_number} / {obs.max_steps}
+        Current score: {obs.current_score:.4f}
+        Issues remaining: {obs.issues_remaining}
+        Initial dirty cells: {obs.initial_dirty_cells}
+        Last action success: {obs.last_action_success}
+        Last action error: {obs.last_action_error or 'none'}
+        === ACTION HISTORY (most recent 15) ===
+        {history_block}
+        IMPORTANT RULES:
+        - Do NOT repeat any action that already appears in the history with score_delta=0.0000.
+        - Do NOT repeat STANDARDIZE_COL or FILL_MISSING on the same column twice.
+        - If score is not improving after 2 steps, switch strategy entirely.
+        - Use SET_VALUE to fix specific bad cells (wrong types, "N/A" strings, outliers, future dates).
+        - Inspect the CSV carefully before choosing your action.
+        Current CSV (first 80 rows shown if large):
+        {_truncate_csv(obs.dirty_csv, max_rows=80)}
+        Output your next CleanAction as a single JSON object.
+    """).strip()
+def _truncate_csv(csv_text: str, max_rows: int = 80) -> str:
+    lines = csv_text.splitlines()
+    if len(lines) <= max_rows + 1:   # +1 for header
+        return csv_text
+    header = lines[0]
+    body   = lines[1: max_rows + 1]
+    omitted = len(lines) - 1 - max_rows
+    return "\n".join([header] + body + [f"... ({omitted} more rows omitted)"])
+# ── Action parsing ────────────────────────────────────────────────────────────
+VALID_COMMANDS = {"SET_VALUE", "DROP_ROW", "STANDARDIZE_COL", "FILL_MISSING", "DONE"}
+VALID_STRATEGIES = {"mean", "median", "mode", "drop"}
+def parse_action(llm_output: str) -> CleanAction:
+    """
+    Parse the LLM's JSON output into a CleanAction.
+    Falls back to STANDARDIZE_COL on the first column if parsing fails.
+    """
+    text = llm_output.strip()
+    # Strip accidental markdown fences
+    text = re.sub(r"^```(?:json)?", "", text, flags=re.IGNORECASE).strip()
+    text = re.sub(r"```$", "", text).strip()
+    # Extract first JSON object
+    match = re.search(r"\{.*?\}", text, re.DOTALL)
+    if not match:
+        raise ValueError(f"No JSON object found in LLM output: {text!r}")
+    data = json.loads(match.group())
+    command = data.get("command", "").upper()
+    if command not in VALID_COMMANDS:
+        raise ValueError(f"Unknown command: {command!r}")
+    if command == "SET_VALUE":
+        return CleanAction(
+            command="SET_VALUE",
+            row_index=int(data["row_index"]),
+            column=str(data["column"]),
+            value=str(data["value"]),
+        )
+    elif command == "DROP_ROW":
+        return CleanAction(command="DROP_ROW", row_index=int(data["row_index"]))
+    elif command == "STANDARDIZE_COL":
+        return CleanAction(command="STANDARDIZE_COL", column=str(data["column"]))
+    elif command == "FILL_MISSING":
+        strategy = str(data.get("fill_strategy", "median")).lower()
+        if strategy not in VALID_STRATEGIES:
+            strategy = "median"
         return CleanAction(
+            command="FILL_MISSING",
+            column=str(data["column"]),
+            fill_strategy=strategy,
         )
+    else:  # DONE
         return CleanAction(command="DONE")
+def _action_to_str(action: CleanAction) -> str:
+    """Compact single-line string for [STEP] log."""
+    parts = [action.command]
+    if action.row_index is not None:
+        parts.append(f"row={action.row_index}")
+    if action.column:
+        parts.append(f"col={action.column}")
+    if action.value is not None:
+        val_repr = str(action.value)[:30]
+        parts.append(f"val={val_repr!r}")
+    if action.fill_strategy:
+        parts.append(f"strategy={action.fill_strategy}")
+    return "(" + ",".join(parts) + ")"
+# ── LLM call ──────────────────────────────────────────────────────────────────
+def get_model_action(client: OpenAI, obs, history: List[str]) -> CleanAction:
+    user_prompt = build_user_prompt(obs, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user",   "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return parse_action(text)
+    except Exception as exc:
+        print(f"[DEBUG] Model/parse error: {exc}", flush=True)
+        return CleanAction(command="FILL_MISSING", column="quantity", fill_strategy="median")
+# ── Episode runner ────────────────────────────────────────────────────────────
+async def run_episode(env: DataCleaningEnv, client: OpenAI, task_id: str) -> dict:
+    """
+    Run a single episode for task_id. Returns a summary dict.
+    """
+    cfg        = TASK_CONFIG[task_id]
+    max_steps  = cfg["max_steps"]
+    threshold  = cfg["threshold"]
+    rewards:      List[float] = []
+    history:      List[str]   = []   # action history fed back to LLM each step
+    steps_taken:  int         = 0
+    score:        float       = 0.0
+    prev_score:   float       = 0.0
+    success:      bool        = False
     log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
     try:
+        result    = await env.reset(task_id=task_id)
+        obs       = result.observation
+        prev_score = obs.current_score
         for step in range(1, max_steps + 1):
+            if result.done:
                 break
+            action = get_model_action(client, obs, history)
             result = await env.step(action)
             obs    = result.observation
+            reward      = result.reward or 0.0
+            done        = result.done
+            error       = obs.last_action_error if not obs.last_action_success else None
+            score_delta = obs.current_score - prev_score
+            prev_score  = obs.current_score
             rewards.append(reward)
+            steps_taken = step
+            # Build a rich history entry the LLM can learn from
+            action_desc = _action_to_str(action)
+            status      = "✓" if obs.last_action_success else "✗"
+            delta_str   = f"+{score_delta:.4f}" if score_delta > 0 else f"{score_delta:.4f}"
+            history.append(
+                f"step={step} {status} {action_desc} reward={reward:+.2f} "
+                f"score_delta={delta_str} score={obs.current_score:.4f}"
+                + (f" ERROR={error}" if error else "")
             )
+            log_step(
+                step=step,
+                action=action_desc,
+                reward=reward,
+                done=done,
+                error=error,
             )
+            if done:
                 break
+        score   = obs.current_score if obs else 0.0
         success = score >= threshold
     finally:
+        score   = score if score else 0.0
+        success = success if success else False
         log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
     return {
+        "task":    task_id,
         "score":   score,
         "reward":  sum(rewards),
         "steps":   steps_taken,
         "success": success,
     }
+# ── Main ──────────────────────────────────────────────────────────────────────
 async def main() -> None:
+    print(f"API_BASE_URL     : {API_BASE_URL}")
+    print(f"MODEL_NAME       : {MODEL_NAME}")
+    print(f"LOCAL_IMAGE_NAME : {LOCAL_IMAGE_NAME}")
+    print()
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await DataCleaningEnv.from_docker_image(LOCAL_IMAGE_NAME)
     results = []
+    try:
+        for task_id in ("easy", "medium", "hard"):
+            summary = await run_episode(env, client, task_id)
             results.append(summary)
+            print()  # blank line between tasks
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error: {e}", flush=True)
+    # ── Summary table ────────────────────────────────────────────────────────
+    print("═" * 56)
+    print(f"{'Task':<12} {'Score':>7}  {'Reward':>7}  {'Steps':>5}  {'Pass'}")
+    print("─" * 56)
     for r in results:
+        flag = "YES" if r["success"] else " NO"
+        print(f"{r['task']:<12} {r['score']:>7.4f}  {r['reward']:>7.4f}  {r['steps']:>5}  {flag}")
+    print("═" * 56)
 if __name__ == "__main__":