Spaces:

CodeKnightDebjit
/

data-cleaning-env

Running

App Files Files Community

CodeKnightDebjit commited on 4 days ago

Commit

d627dc7

verified ·

1 Parent(s): 3feb717

Upload folder using huggingface_hub

Browse files

Files changed (22) hide show

Dockerfile +81 -0
README.md +250 -5
__init__.py +16 -0
client.py +319 -0
dataset_factory.py +550 -0
graders.py +686 -0
inference.py +271 -0
models.py +463 -0
openenv.yaml +151 -0
openenv_data_cleaning_env.egg-info/PKG-INFO +13 -0
openenv_data_cleaning_env.egg-info/SOURCES.txt +21 -0
openenv_data_cleaning_env.egg-info/dependency_links.txt +1 -0
openenv_data_cleaning_env.egg-info/entry_points.txt +2 -0
openenv_data_cleaning_env.egg-info/requires.txt +9 -0
openenv_data_cleaning_env.egg-info/top_level.txt +1 -0
pyproject.toml +48 -0
server/__init__.py +11 -0
server/app.py +25 -0
server/data_cleaning_env.py +827 -0
server/requirements.txt +5 -0
uv.lock +0 -0
validate-submission.sh +185 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,81 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=data_cleaning_env
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,255 @@
 ---
-title: Data Cleaning Env
-emoji: 📚
-colorFrom: gray
-colorTo: gray
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Data Cleaning Env Environment Server
+emoji: 🎹
+colorFrom: indigo
+colorTo: red
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# Data Cleaning Env Environment
+A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
+## Quick Start
+The simplest way to use the Data Cleaning Env environment is through the `DataCleaningEnv` class:
+```python
+from data_cleaning_env import CleanAction, DataCleaningEnv
+try:
+    # Create environment from Docker image
+    data_cleaning_envenv = DataCleaningEnv.from_docker_image("data_cleaning_env-env:latest")
+    # Reset
+    result = data_cleaning_envenv.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Send multiple messages
+    messages = ["Hello, World!", "Testing echo", "Final message"]
+    for msg in messages:
+        result = data_cleaning_envenv.step(CleanAction(message=msg))
+        print(f"Sent: '{msg}'")
+        print(f"  → Echoed: '{result.observation.echoed_message}'")
+        print(f"  → Length: {result.observation.message_length}")
+        print(f"  → Reward: {result.reward}")
+finally:
+    # Always clean up
+    data_cleaning_envenv.close()
+```
+That's it! The `DataCleaningEnv.from_docker_image()` method handles:
+- Starting the Docker container
+- Waiting for the server to be ready
+- Connecting to the environment
+- Container cleanup when you call `close()`
+## Building the Docker Image
+Before using the environment, you need to build the Docker image:
+```bash
+# From project root
+docker build -t data_cleaning_env-env:latest -f server/Dockerfile .
+```
+## Deploying to Hugging Face Spaces
+You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
+```bash
+# From the environment directory (where openenv.yaml is located)
+openenv push
+# Or specify options
+openenv push --namespace my-org --private
+```
+The `openenv push` command will:
+1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
+2. Prepare a custom build for Hugging Face Docker space (enables web interface)
+3. Upload to Hugging Face (ensuring you're logged in)
+### Prerequisites
+- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
+### Options
+- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
+- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
+- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
+- `--private`: Deploy the space as private (default: public)
+### Examples
+```bash
+# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
+openenv push
+# Push to a specific repository
+openenv push --repo-id my-org/my-env
+# Push with a custom base image
+openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
+# Push as a private space
+openenv push --private
+# Combine options
+openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
+```
+After deployment, your space will be available at:
+`https://huggingface.co/spaces/<repo-id>`
+The deployed space includes:
+- **Web Interface** at `/web` - Interactive UI for exploring the environment
+- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
+- **Health Check** at `/health` - Container health monitoring
+- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
+## Environment Details
+### Action
+**CleanAction**: Contains a single field
+- `message` (str) - The message to echo back
+### Observation
+**CleanAction**: Contains the echo response and metadata
+- `echoed_message` (str) - The message echoed back
+- `message_length` (int) - Length of the message
+- `reward` (float) - Reward based on message length (length × 0.1)
+- `done` (bool) - Always False for echo environment
+- `metadata` (dict) - Additional info like step count
+### Reward
+The reward is calculated as: `message_length × 0.1`
+- "Hi" → reward: 0.2
+- "Hello, World!" → reward: 1.3
+- Empty message → reward: 0.0
+## Advanced Usage
+### Connecting to an Existing Server
+If you already have a Data Cleaning Env environment server running, you can connect directly:
+```python
+from data_cleaning_env import DataCleaningEnv
+# Connect to existing server
+data_cleaning_envenv = DataCleaningEnv(base_url="<ENV_HTTP_URL_HERE>")
+# Use as normal
+result = data_cleaning_envenv.reset()
+result = data_cleaning_envenv.step(CleanAction(message="Hello!"))
+```
+Note: When connecting to an existing server, `data_cleaning_envenv.close()` will NOT stop the server.
+### Using the Context Manager
+The client supports context manager usage for automatic connection management:
+```python
+from data_cleaning_env import CleanAction, DataCleaningEnv
+# Connect with context manager (auto-connects and closes)
+with DataCleaningEnv(base_url="http://localhost:8000") as env:
+    result = env.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Multiple steps with low latency
+    for msg in ["Hello", "World", "!"]:
+        result = env.step(CleanAction(message=msg))
+        print(f"Echoed: {result.observation.echoed_message}")
+```
+The client uses WebSocket connections for:
+- **Lower latency**: No HTTP connection overhead per request
+- **Persistent session**: Server maintains your environment state
+- **Efficient for episodes**: Better for many sequential steps
+### Concurrent WebSocket Sessions
+The server supports multiple concurrent WebSocket connections. To enable this,
+modify `server/app.py` to use factory mode:
+```python
+# In server/app.py - use factory mode for concurrent sessions
+app = create_app(
+    DataCleaningEnvironment,  # Pass class, not instance
+    CleanAction,
+    CleanAction,
+    max_concurrent_envs=4,  # Allow 4 concurrent sessions
+)
+```
+Then multiple clients can connect simultaneously:
+```python
+from data_cleaning_env import CleanAction, DataCleaningEnv
+from concurrent.futures import ThreadPoolExecutor
+def run_episode(client_id: int):
+    with DataCleaningEnv(base_url="http://localhost:8000") as env:
+        result = env.reset()
+        for i in range(10):
+            result = env.step(CleanAction(message=f"Client {client_id}, step {i}"))
+        return client_id, result.observation.message_length
+# Run 4 episodes concurrently
+with ThreadPoolExecutor(max_workers=4) as executor:
+    results = list(executor.map(run_episode, range(4)))
+```
+## Development & Testing
+### Direct Environment Testing
+Test the environment logic directly without starting the HTTP server:
+```bash
+# From the server directory
+python3 server/data_cleaning_env_environment.py
+```
+This verifies that:
+- Environment resets correctly
+- Step executes actions properly
+- State tracking works
+- Rewards are calculated correctly
+### Running Locally
+Run the server locally for development:
+```bash
+uvicorn server.app:app --reload
+```
+## Project Structure
+```
+data_cleaning_env/
+├── .dockerignore         # Docker build exclusions
+├── __init__.py            # Module exports
+├── README.md              # This file
+├── openenv.yaml           # OpenEnv manifest
+├── pyproject.toml         # Project metadata and dependencies
+├── uv.lock                # Locked dependencies (generated)
+├── client.py              # DataCleaningEnv client
+├── models.py              # Action and Observation models
+└── server/
+    ├── __init__.py        # Server module exports
+    ├── data_cleaning_env_environment.py  # Core environment logic
+    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
+    └── Dockerfile         # Container image definition
+```

__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Data Cleaning Env Environment."""
+from .client import DataCleaningEnv
+from .models import CleanAction, CleanObservation
+__all__ = [
+    "CleanAction",
+    "CleanObservation",
+    "DataCleaningEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,319 @@

+"""
+client.py
+---------
+DataCleaningEnv — the typed WebSocket client for the data cleaning pipeline.
+This module contains exactly one public class: ``DataCleaningEnv``.
+It extends ``EnvClient`` from OpenEnv core and implements the three abstract
+translation methods that bridge Python objects and the server's JSON wire format:
+    _step_payload(action)      CleanAction   → dict   (outbound)
+    _parse_result(payload)     dict          → StepResult[CleanObservation]  (inbound)
+    _parse_state(payload)      dict          → CleanState  (inbound)
+Everything else — WebSocket lifecycle, connect/disconnect, async context
+manager, the `.sync()` wrapper — is handled by the base class.
+Usage (async)
+-------------
+    import asyncio
+    from data_cleaning_env.client import DataCleaningEnv
+    from data_cleaning_env.models import CleanAction
+    async def main():
+        async with DataCleaningEnv(base_url="http://localhost:8000") as env:
+            result = await env.reset(task_id="easy")
+            print(result.observation.schema_hint)
+            result = await env.set_value(row_index=3, column="price", value="29.99")
+            print(result.reward, result.observation.current_score)
+            result = await env.done()
+    asyncio.run(main())
+Usage (sync wrapper)
+--------------------
+    env = DataCleaningEnv(base_url="http://localhost:8000").sync()
+    with env:
+        result = env.reset(task_id="medium")
+        result = env.fill_missing(column="amount", fill_strategy="median")
+        result = env.done()
+"""
+from __future__ import annotations
+from typing import Any, Optional
+# ── OpenEnv core imports ──────────────────────────────────────────────────────
+try:
+    from openenv.core.client_types import StepResult
+    from openenv.core.env_client import EnvClient
+except ImportError:
+    from openenv.core.client_types import StepResult  # type: ignore[no-redef]
+    from openenv.core.env_client import EnvClient     # type: ignore[no-redef]
+# ── Local model imports (try relative then absolute) ──────────────────────────
+try:
+    from .models import (
+        CleanAction,
+        CleanObservation,
+        CleanState,
+        MAX_STEPS,
+        DONE_THRESHOLD,
+    )
+except ImportError:
+    from models import (                              # type: ignore[no-redef]
+        CleanAction,
+        CleanObservation,
+        CleanState,
+        MAX_STEPS,
+        DONE_THRESHOLD,
+    )
+class DataCleaningEnv(EnvClient[CleanAction, CleanObservation, CleanState]):
+    """
+    Async WebSocket client for the Data Cleaning Pipeline environment.
+    Connects to a running ``DataCleaningEnvironment`` server and exposes the
+    standard OpenEnv interface (``reset``, ``step``, ``state``) plus typed
+    convenience helpers for each command.
+    All methods are async. For synchronous use, call ``.sync()`` to get a
+    ``SyncEnvClient`` wrapper:
+        with DataCleaningEnv(base_url="http://localhost:8000").sync() as env:
+            result = env.reset(task_id="easy")
+            result = env.set_value(row_index=0, column="price", value="9.99")
+    Connecting to different backends
+    ---------------------------------
+    Local dev server (after ``openenv serve``):
+        env = DataCleaningEnv(base_url="http://localhost:8000")
+    Local Docker image (after ``openenv build``):
+        env = await DataCleaningEnv.from_docker_image("data-cleaning-env:latest")
+    Hugging Face Space (after ``openenv push``):
+        env = await DataCleaningEnv.from_env("your-org/data-cleaning-env")
+    """
+    # ─────────────────────────────────────────────────────────────────────────
+    # Abstract method implementations — the three translation methods
+    # ─────────────────────────────────────────────────────────────────────────
+    def _step_payload(self, action: CleanAction) -> dict[str, Any]:
+        """
+        Serialise a CleanAction to the JSON dict the server expects.
+        The server's ``step()`` endpoint receives this dict, validates it
+        against ``CleanAction``, and dispatches to the correct handler.
+        We use ``model_dump(exclude_none=True)`` to omit fields the agent
+        left as ``None`` — this keeps the wire message minimal and avoids
+        triggering Pydantic's ``extra="forbid"`` validator on the server side
+        for fields that weren't set.
+        """
+        return action.model_dump(exclude_none=True)
+    def _parse_result(self, payload: dict[str, Any]) -> StepResult[CleanObservation]:
+        """
+        Parse the server's step/reset response into a ``StepResult``.
+        Wire format (what the server sends back):
+        ::
+            {
+              "observation": {
+                "done": false,
+                "reward": -0.005,
+                "metadata": {},
+                "task_id": "easy",
+                "schema_hint": "Sales orders...",
+                "initial_dirty_cells": 29,
+                "dirty_csv": "row_index,order_id,...\\n0,1001,...",
+                "current_score": 0.9550,
+                "issues_remaining": 18,
+                "step_number": 1,
+                "max_steps": 40,
+                "last_action_success": true,
+                "last_action_error": null
+              },
+              "reward": -0.005,
+              "done": false
+            }
+        Note: ``reward`` and ``done`` appear both at the top level (for
+        convenience) and inside ``observation`` (because ``Observation`` base
+        carries them).  We use the top-level copies for ``StepResult`` so the
+        caller doesn't have to dig into the observation.
+        """
+        obs_data = payload.get("observation", {})
+        observation = CleanObservation(
+            # ── inherited from Observation base ──────────────────────────────
+            done=payload.get("done", obs_data.get("done", False)),
+            reward=payload.get("reward", obs_data.get("reward")),
+            metadata=obs_data.get("metadata", {}),
+            # ── task context (constant for the episode) ───────────────────────
+            task_id=obs_data["task_id"],
+            schema_hint=obs_data["schema_hint"],
+            initial_dirty_cells=obs_data["initial_dirty_cells"],
+            # ── per-step state ────────────────────────────────────────────────
+            dirty_csv=obs_data["dirty_csv"],
+            current_score=obs_data.get("current_score", 0.0),
+            issues_remaining=obs_data.get("issues_remaining", 0),
+            step_number=obs_data.get("step_number", 0),
+            max_steps=obs_data["max_steps"],
+            # ── last-action feedback ──────────────────────────────────────────
+            last_action_success=obs_data.get("last_action_success", True),
+            last_action_error=obs_data.get("last_action_error"),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: dict[str, Any]) -> CleanState:
+        """
+        Parse the server's state response into a ``CleanState``.
+        The server serialises ``CleanState`` via Pydantic's ``model_dump()``,
+        so the wire keys match our field names exactly.  We use ``.get()``
+        with sensible defaults everywhere so a partially-initialised state
+        (e.g. before the first reset) doesn't crash the client.
+        """
+        return CleanState(
+            # ── inherited from State base ─────────────────────────────────────
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            # ── task identity ─────────────────────────────────────────────────
+            task_id=payload.get("task_id", "easy"),
+            # ── DataFrame snapshots ───────────────────────────────────────────
+            dirty_csv_snapshot=payload.get("dirty_csv_snapshot", ""),
+            clean_csv_snapshot=payload.get("clean_csv_snapshot", ""),
+            # ── scoring ───────────────────────────────────────────────────────
+            initial_dirty_cells=payload.get("initial_dirty_cells", 0),
+            current_score=payload.get("current_score", 0.0),
+            previous_score=payload.get("previous_score", 0.0),
+            # ── grader metadata ───────────────────────────────────────────────
+            task_metadata=payload.get("task_metadata", {}),
+            # ── schema ────────────────────────────────────────────────────────
+            schema_hint=payload.get("schema_hint", ""),
+            # ── step budget ──────────────────────────────��────────────────────
+            max_steps=payload.get("max_steps", 40),
+        )
+    # ─────────────────────────────────────────────────────────────────────────
+    # Typed convenience helpers — one per CleanAction command
+    # ─────────────────────────────────────────────────────────────────────────
+    # These methods exist purely for ergonomics: they let callers write
+    #
+    #     await env.set_value(row_index=3, column="price", value="29.99")
+    #
+    # instead of the more verbose:
+    #
+    #     await env.step(CleanAction(
+    #         command="SET_VALUE", row_index=3, column="price", value="29.99"
+    #     ))
+    #
+    # The baseline inference script can use either form.
+    async def set_value(
+        self,
+        row_index: int,
+        column: str,
+        value: str,
+    ) -> StepResult[CleanObservation]:
+        """Fix a single cell. ``value`` is always passed as a string; the
+        server casts it to the column's target dtype automatically."""
+        return await self.step(
+            CleanAction(
+                command="SET_VALUE",
+                row_index=row_index,
+                column=column,
+                value=value,
+            )
+        )
+    async def drop_row(self, row_index: int) -> StepResult[CleanObservation]:
+        """Remove an entire row (e.g. a true outlier in the medium task)."""
+        return await self.step(
+            CleanAction(command="DROP_ROW", row_index=row_index)
+        )
+    async def standardize_col(self, column: str) -> StepResult[CleanObservation]:
+        """Normalise a whole column's format.
+        The server auto-detects what to do:
+        - Date columns → parse any format, reformat as ``YYYY-MM-DD``
+        - Numeric columns → coerce to float/int, drop unit strings
+        - String columns → strip leading/trailing whitespace
+        """
+        return await self.step(
+            CleanAction(command="STANDARDIZE_COL", column=column)
+        )
+    async def fill_missing(
+        self,
+        column: str,
+        fill_strategy: str,
+    ) -> StepResult[CleanObservation]:
+        """Fill ``NaN`` values in ``column``.
+        Args:
+            column: Column name to fill.
+            fill_strategy: One of ``"mean"``, ``"median"``, ``"mode"``, ``"drop"``.
+                ``"drop"`` removes rows where the column is ``NaN``.
+        """
+        return await self.step(
+            CleanAction(
+                command="FILL_MISSING",
+                column=column,
+                fill_strategy=fill_strategy,
+            )
+        )
+    async def done(self) -> StepResult[CleanObservation]:
+        """Signal that the agent believes the CSV is clean.
+        This ends the episode immediately.  If the current score is below
+        ``EARLY_DONE_THRESHOLD`` (0.60) a penalty of -0.20 is applied.
+        """
+        return await self.step(CleanAction(command="DONE"))
+    # ─────────────────────────────────────────────────────────────────────────
+    # Introspection helpers
+    # ─────────────────────────────────────────────────────────────────────────
+    async def current_score(self) -> float:
+        """Return the grader score from the last step (0.0–1.0)."""
+        st = await self.state()
+        return st.current_score
+    async def task_id(self) -> str:
+        """Return the active task ID (``"easy"``, ``"medium"``, or ``"hard"``)."""
+        st = await self.state()
+        return st.task_id
+    async def steps_remaining(self) -> int:
+        """Return the number of steps left before forced termination."""
+        st = await self.state()
+        return max(0, st.max_steps - st.step_count)
+    async def is_solved(self) -> bool:
+        """Return ``True`` if the current score meets the task's done threshold."""
+        st = await self.state()
+        threshold = DONE_THRESHOLD.get(st.task_id, 0.95)
+        return st.current_score >= threshold

dataset_factory.py ADDED Viewed

	@@ -0,0 +1,550 @@

+"""
+dataset_factory.py
+------------------
+Generates (dirty_df, clean_df, metadata) triples for all 3 tasks.
+Key design decisions:
+  - Fixed random seeds per task → reproducible grader scores
+  - clean_df is ALWAYS generated first, then dirt is injected
+  - metadata carries ground-truth info the grader needs (e.g. which
+    rows are real outliers vs valid extremes in Task 2)
+  - No external files needed — everything is generated in memory
+"""
+from __future__ import annotations
+import copy
+import random
+import string
+from dataclasses import dataclass, field
+from typing import Any
+import numpy as np
+import pandas as pd
+# ── Reproducible seeds ────────────────────────────────────────────────────────
+SEEDS = {
+    "easy":   42,
+    "medium": 137,
+    "hard":   999,
+}
+# ── Return type ───────────────────────────────────────────────────────────────
+@dataclass
+class TaskDataset:
+    """Everything the environment and grader need for one episode."""
+    task_id: str
+    dirty_df: pd.DataFrame
+    clean_df: pd.DataFrame
+    schema_hint: str                        # plain-English schema description
+    total_dirty_cells: int                  # how many cells differ at episode start
+    metadata: dict[str, Any] = field(default_factory=dict)
+    # metadata keys used by graders:
+    #   "outlier_rows"    (Task 2) — list of row indices that ARE true outliers
+    #   "valid_extreme_rows" (Task 2) — valid rows that look extreme but must stay
+    #   "canonical_columns"  (Task 3) — {alias: canonical_name} mapping
+    #   "duplicate_row_ids"  (Task 3) — list of (original_idx, duplicate_idx) pairs
+# ── Public API ────────────────────────────────────────────────────────────────
+def make_dataset(task_id: str) -> TaskDataset:
+    """Entry point. Call this from the environment's reset()."""
+    if task_id == "easy":
+        return _make_easy()
+    elif task_id == "medium":
+        return _make_medium()
+    elif task_id == "hard":
+        return _make_hard()
+    else:
+        raise ValueError(f"Unknown task_id: {task_id!r}. Must be easy/medium/hard.")
+def count_dirty_cells(dirty_df: pd.DataFrame, clean_df: pd.DataFrame) -> int:
+    """Number of cells that differ between dirty and clean DataFrames."""
+    # Align on same dtypes for comparison
+    d = dirty_df.astype(str).reset_index(drop=True)
+    c = clean_df.astype(str).reset_index(drop=True)
+    return int((d != c).sum().sum())
+# ── Task 1: easy ─────────────────────────────────────────────────────────────
+#
+# 50-row sales CSV.
+# Clean schema:
+#   order_id (int), customer (str), product (str), category (str),
+#   price (float, 2dp), quantity (int), order_date (YYYY-MM-DD),
+#   region (str)
+#
+# Injected issues (29 dirty cells total):
+#   • 10 wrong-type cells  — numeric column contains a word
+#   • 8  missing values    — NaN in various columns
+#   • 5  bad dates         — future year (2099-xx-xx)
+#   • 6  whitespace cells  — leading/trailing spaces in string columns
+def _make_easy() -> TaskDataset:
+    rng = random.Random(SEEDS["easy"])
+    np_rng = np.random.default_rng(SEEDS["easy"])
+    n = 50
+    categories = ["Electronics", "Clothing", "Home", "Sports", "Books"]
+    regions    = ["North", "South", "East", "West"]
+    products   = ["Widget A", "Widget B", "Gadget X", "Gadget Y", "Item Z"]
+    customers  = [f"Customer_{i:03d}" for i in range(1, 31)]
+    # ── Build clean DataFrame ────────────────────────────────────────────────
+    clean = pd.DataFrame({
+        "order_id":   range(1001, 1001 + n),
+        "customer":   [rng.choice(customers) for _ in range(n)],
+        "product":    [rng.choice(products)  for _ in range(n)],
+        "category":   [rng.choice(categories) for _ in range(n)],
+        "price":      np_rng.uniform(5.0, 500.0, n).round(2),
+        "quantity":   np_rng.integers(1, 20, n),
+        "order_date": _random_dates(np_rng, n, "2023-01-01", "2024-06-30"),
+        "region":     [rng.choice(regions) for _ in range(n)],
+    })
+    clean["price"]    = clean["price"].astype(float)
+    clean["quantity"] = clean["quantity"].astype(int)
+    # ── Inject dirt ───────────────────────────────────────────────────���──────
+    dirty = clean.copy(deep=True).astype(object)
+    injected: set[tuple[int, str]] = set()
+    def pick_fresh(col: str, exclude: set) -> int:
+        rows = [r for r in range(n) if (r, col) not in exclude]
+        return rng.choice(rows)
+    # 10 wrong-type cells in numeric columns
+    bad_words = ["N/A", "unknown", "missing", "null", "TBD", "??", "-", "n/a", "none", "—"]
+    for word, col in zip(bad_words, rng.choices(["price", "quantity"], k=10)):
+        row = pick_fresh(col, injected)
+        dirty.at[row, col] = word
+        injected.add((row, col))
+    # 8 missing values in various columns
+    missing_cols = rng.choices(["customer", "product", "price", "quantity", "region"], k=8)
+    for col in missing_cols:
+        row = pick_fresh(col, injected)
+        dirty.at[row, col] = np.nan
+        injected.add((row, col))
+    # 5 bad dates — far-future year
+    bad_date_templates = [
+        "2099-01-15", "2099-07-04", "2099-12-31", "2099-03-22", "2099-11-11"
+    ]
+    for bad_date in bad_date_templates:
+        row = pick_fresh("order_date", injected)
+        dirty.at[row, "order_date"] = bad_date
+        injected.add((row, "order_date"))
+    # 6 whitespace cells in string columns
+    ws_cols = rng.choices(["customer", "product", "category", "region"], k=6)
+    for col in ws_cols:
+        row = pick_fresh(col, injected)
+        orig = str(dirty.at[row, col])
+        dirty.at[row, col] = f"  {orig}  "
+        injected.add((row, col))
+    dirty_cell_count = count_dirty_cells(dirty.astype(str), clean.astype(str))
+    schema_hint = (
+        "Sales orders dataset. Expected columns: "
+        "order_id (integer), customer (string, no leading/trailing spaces), "
+        "product (string, no spaces), category (one of: Electronics/Clothing/Home/Sports/Books), "
+        "price (float, 2 decimal places, no text), "
+        "quantity (integer, no text), "
+        "order_date (YYYY-MM-DD format, year must be 2023 or 2024), "
+        "region (one of: North/South/East/West, no spaces). "
+        "No missing values allowed."
+    )
+    return TaskDataset(
+        task_id="easy",
+        dirty_df=dirty,
+        clean_df=clean.astype(object),
+        schema_hint=schema_hint,
+        total_dirty_cells=dirty_cell_count,
+        metadata={"injected_cells": list(injected)},
+    )
+# ── Task 2: medium ────────────────────────────────────────────────────────────
+#
+# 200-row customer transaction CSV.
+# Clean schema:
+#   tx_id (int), customer_id (int), amount (float), tx_date (YYYY-MM-DD),
+#   category (str), country (str), status (str)
+#
+# Injected issues:
+#   • 15 statistical outliers  — amount Z-score > 4.0  (should be removed/capped)
+#   • 5  valid extremes        — genuinely large transactions, must NOT be removed
+#   • 12 category typos        — slight misspellings
+def _make_medium() -> TaskDataset:
+    rng = random.Random(SEEDS["medium"])
+    np_rng = np.random.default_rng(SEEDS["medium"])
+    n = 200
+    categories = ["Food", "Electronics", "Travel", "Healthcare", "Entertainment"]
+    countries  = ["US", "UK", "CA", "AU", "DE"]
+    statuses   = ["completed", "pending", "refunded"]
+    # ── Build clean base ────────────────────────────────────────────────────
+    # Normal transaction amounts: mean $150, sd $60, clipped to [5, 800]
+    amounts = np_rng.normal(150, 60, n).clip(5, 800).round(2)
+    clean = pd.DataFrame({
+        "tx_id":       range(9001, 9001 + n),
+        "customer_id": np_rng.integers(1, 501, n),
+        "amount":      amounts,
+        "tx_date":     _random_dates(np_rng, n, "2023-01-01", "2024-06-30"),
+        "category":    [rng.choice(categories) for _ in range(n)],
+        "country":     [rng.choice(countries)  for _ in range(n)],
+        "status":      [rng.choice(statuses)   for _ in range(n)],
+    })
+    # ── Choose outlier rows (15) — will be injected with extreme amounts ─────
+    all_rows = list(range(n))
+    outlier_rows: list[int] = rng.sample(all_rows, 15)
+    remaining    = [r for r in all_rows if r not in outlier_rows]
+    # ── Choose valid extreme rows (5) — large but legitimate ─────────────────
+    # These are NOT in outlier_rows; amounts are large (Z > 3) but real
+    valid_extreme_rows: list[int] = rng.sample(remaining, 5)
+    # ── Build dirty DataFrame ────────────────────────────────────────────────
+    dirty = clean.copy(deep=True).astype(object)
+    # Inject true outliers: very high or very low (Z > 4)
+    for row in outlier_rows:
+        if rng.random() > 0.3:
+            dirty.at[row, "amount"] = round(rng.uniform(5000, 15000), 2)   # extreme high
+        else:
+            dirty.at[row, "amount"] = round(rng.uniform(-500, -10),  2)    # negative (impossible)
+    # Inject valid extremes (in clean AND dirty — they stay)
+    for row in valid_extreme_rows:
+        valid_large = round(rng.uniform(900, 2000), 2)
+        clean.at[row, "amount"] = valid_large
+        dirty.at[row, "amount"] = valid_large
+    # Inject 12 category typos
+    typo_map: dict[str, str] = {
+        "Electronics":   ["Electrnics", "Electronis", "Electonics"],
+        "Food":          ["Foood", "Fod", "Fo0d"],
+        "Travel":        ["Travle", "Trevel", "Travell"],
+        "Healthcare":    ["Helthcare", "Healtcare", "Heathcare"],
+        "Entertainment": ["Entertainmnt", "Entertainmet", "Entertainmen"],
+    }
+    injected_typo_rows: set[int] = set()
+    typo_count = 0
+    typo_cells: list[tuple[int, str, str]] = []   # (row, dirty_val, clean_val)
+    for row in rng.sample(remaining, min(12, len(remaining))):
+        if typo_count >= 12:
+            break
+        if row in injected_typo_rows:
+            continue
+        orig_cat = str(clean.at[row, "category"])
+        misspellings = typo_map.get(orig_cat)
+        if misspellings:
+            bad = rng.choice(misspellings)
+            dirty.at[row, "category"] = bad
+            typo_cells.append((row, bad, orig_cat))
+            injected_typo_rows.add(row)
+            typo_count += 1
+    dirty_cell_count = count_dirty_cells(dirty.astype(str), clean.astype(str))
+    schema_hint = (
+        "Customer transactions dataset. Expected columns: "
+        "tx_id (integer), customer_id (integer 1–500), "
+        "amount (float, must be positive; realistic range is $5–$2000; "
+        "amounts above $2000 or below $0 are data errors), "
+        "tx_date (YYYY-MM-DD), "
+        "category (one of: Food/Electronics/Travel/Healthcare/Entertainment — exact spelling), "
+        "country (two-letter code: US/UK/CA/AU/DE), "
+        "status (one of: completed/pending/refunded). "
+        "Note: some large transactions ($900–$2000) are legitimate — do not remove them. "
+        "Only remove rows where the amount is clearly erroneous (negative or > $2000)."
+    )
+    return TaskDataset(
+        task_id="medium",
+        dirty_df=dirty,
+        clean_df=clean.astype(object),
+        schema_hint=schema_hint,
+        total_dirty_cells=dirty_cell_count,
+        metadata={
+            "outlier_rows":      outlier_rows,
+            "valid_extreme_rows": valid_extreme_rows,
+            "typo_cells":        typo_cells,         # [(row, dirty_val, clean_val)]
+        },
+    )
+# ── Task 3: hard ──────────────────────────────────────────────────────────────
+#
+# 400-row CSV merged from 3 fictional data sources.
+# Each source uses different column names for the same concepts.
+# Issues:
+#   • Inconsistent column naming (3 aliases per concept)
+#   • Mixed date formats across sources (ISO, US, EU)
+#   • 30 duplicate rows (exact and near-duplicate)
+#   • No schema documentation — agent must infer canonical form
+#
+# Canonical schema (what the agent must produce):
+#   record_id, customer_id, full_name, email, amount,
+#   currency, purchase_date (YYYY-MM-DD), product_name, region
+_CANONICAL_COLS = [
+    "record_id", "customer_id", "full_name", "email",
+    "amount", "currency", "purchase_date", "product_name", "region",
+]
+# Column aliases per source
+_SOURCE_ALIASES = {
+    "source_a": {
+        "record_id":    "record_id",
+        "customer_id":  "cust_id",
+        "full_name":    "name",
+        "email":        "email_address",
+        "amount":       "sale_amount",
+        "currency":     "ccy",
+        "purchase_date":"date",
+        "product_name": "item",
+        "region":       "territory",
+    },
+    "source_b": {
+        "record_id":    "id",
+        "customer_id":  "customer_id",
+        "full_name":    "full_name",
+        "email":        "contact_email",
+        "amount":       "value",
+        "currency":     "currency",
+        "purchase_date":"purchase_date",
+        "product_name": "product",
+        "region":       "area",
+    },
+    "source_c": {
+        "record_id":    "RecordID",
+        "customer_id":  "CustomerID",
+        "full_name":    "CustomerName",
+        "email":        "Email",
+        "amount":       "Amount",
+        "currency":     "Currency",
+        "purchase_date":"PurchaseDate",
+        "product_name": "ProductName",
+        "region":       "Region",
+    },
+}
+# Date format used by each source
+_SOURCE_DATE_FORMATS = {
+    "source_a": "%Y-%m-%d",   # ISO: 2023-04-15
+    "source_b": "%m/%d/%Y",   # US:  04/15/2023
+    "source_c": "%d.%m.%Y",   # EU:  15.04.2023
+}
+def _make_hard() -> TaskDataset:
+    rng = random.Random(SEEDS["hard"])
+    np_rng = np.random.default_rng(SEEDS["hard"])
+    currencies = ["USD", "EUR", "GBP"]
+    regions    = ["APAC", "EMEA", "AMER", "LATAM"]
+    products   = [
+        "Pro Subscription", "Enterprise License", "Support Package",
+        "Training Course", "Hardware Bundle", "Consulting Day",
+    ]
+    # Helper: generate a block of rows for one source
+    def _source_block(source: str, n: int, id_start: int) -> pd.DataFrame:
+        aliases   = _SOURCE_ALIASES[source]
+        date_fmt  = _SOURCE_DATE_FORMATS[source]
+        cust_ids  = np_rng.integers(2001, 3001, n)
+        amounts   = np_rng.uniform(100, 5000, n).round(2)
+        iso_dates = _random_dates(np_rng, n, "2022-01-01", "2024-06-30")
+        # Format dates in source-specific format
+        formatted_dates = [
+            pd.to_datetime(d).strftime(date_fmt)
+            for d in iso_dates
+        ]
+        names  = [_random_name(rng)  for _ in range(n)]
+        emails = [_name_to_email(nm) for nm in names]
+        data = {
+            aliases["record_id"]:    range(id_start, id_start + n),
+            aliases["customer_id"]:  cust_ids.tolist(),
+            aliases["full_name"]:    names,
+            aliases["email"]:        emails,
+            aliases["amount"]:       amounts.tolist(),
+            aliases["currency"]:     [rng.choice(currencies) for _ in range(n)],
+            aliases["purchase_date"]: formatted_dates,
+            aliases["product_name"]: [rng.choice(products)  for _ in range(n)],
+            aliases["region"]:       [rng.choice(regions)   for _ in range(n)],
+        }
+        return pd.DataFrame(data)
+    # Three sources, ~133 rows each (total ~400)
+    block_a = _source_block("source_a", 134, id_start=1)
+    block_b = _source_block("source_b", 133, id_start=135)
+    block_c = _source_block("source_c", 133, id_start=268)
+    # ── Canonical (clean) dataframe ─────────────────────────────────────────
+    def _to_canonical(df: pd.DataFrame, source: str) -> pd.DataFrame:
+        rev = {v: k for k, v in _SOURCE_ALIASES[source].items()}
+        renamed = df.rename(columns=rev)
+        # Normalise date to YYYY-MM-DD
+        renamed["purchase_date"] = pd.to_datetime(
+            renamed["purchase_date"],
+            format=_SOURCE_DATE_FORMATS[source],
+        ).dt.strftime("%Y-%m-%d")
+        return renamed[_CANONICAL_COLS]
+    clean_a = _to_canonical(block_a, "source_a")
+    clean_b = _to_canonical(block_b, "source_b")
+    clean_c = _to_canonical(block_c, "source_c")
+    clean   = pd.concat([clean_a, clean_b, clean_c], ignore_index=True)
+    clean["record_id"] = range(1, len(clean) + 1)
+    # ── Dirty dataframe = concat of raw source blocks ────────────────────────
+    # (columns are still in aliased form, dates in source-specific format)
+    dirty = pd.concat([block_a, block_b, block_c], ignore_index=True)
+    # ── Inject 30 duplicate rows ─────────────────────────────────────────────
+    n_clean = len(dirty)
+    sampled_orig = rng.sample(range(n_clean), 30)
+    duplicate_rows_to_inject: list[pd.DataFrame] = []
+    duplicate_pairs: list[tuple[int, int]] = []
+    for orig_idx in sampled_orig:
+        dup = dirty.iloc[[orig_idx]].copy()
+        # Near-duplicate: 40% chance of a minor field change
+        if rng.random() < 0.4:
+            # Slightly alter the amount (±1%)
+            col_amount = list(_SOURCE_ALIASES["source_a"].values())[4]  # 'sale_amount'
+            # Find which column name is 'amount-like' in this row's source
+            # Since we concat all sources, each row might have NaN in other sources' cols.
+            # Simpler: just modify the raw value in the only non-null amount column.
+            for amt_col in ["sale_amount", "value", "Amount"]:
+                if amt_col in dup.columns and pd.notna(dup.iloc[0].get(amt_col)):
+                    old_val = dup.at[dup.index[0], amt_col]
+                    dup.at[dup.index[0], amt_col] = round(float(old_val) * rng.uniform(0.99, 1.01), 2)
+                    break
+        duplicate_rows_to_inject.append(dup)
+        duplicate_pairs.append((orig_idx, n_clean + len(duplicate_pairs)))
+    dirty = pd.concat([dirty] + duplicate_rows_to_inject, ignore_index=True)
+    # Shuffle so duplicates aren't obviously at the bottom
+    dirty = dirty.sample(frac=1, random_state=SEEDS["hard"]).reset_index(drop=True)
+    # Build canonical alias lookup for grader
+    canonical_lookup: dict[str, str] = {}
+    for source, aliases in _SOURCE_ALIASES.items():
+        for canonical, alias in aliases.items():
+            canonical_lookup[alias] = canonical
+    dirty_cell_count = len(dirty) * len(_CANONICAL_COLS)  # hard task: whole-df scope
+    schema_hint = (
+        "Merged dataset from 3 sources with inconsistent schemas. "
+        "Your goal is to produce a single clean DataFrame with these canonical columns: "
+        "record_id (integer, unique), customer_id (integer), full_name (string), "
+        "email (string), amount (float), currency (one of: USD/EUR/GBP), "
+        "purchase_date (YYYY-MM-DD), product_name (string), region (one of: APAC/EMEA/AMER/LATAM). "
+        "Column names in the raw data vary by source (e.g. 'cust_id', 'customer_id', 'CustomerID' "
+        "all mean customer_id). Date formats also vary (ISO, US MM/DD/YYYY, EU DD.MM.YYYY). "
+        "There are also ~30 duplicate rows (some exact, some near-duplicate). "
+        "Remove duplicates, normalise all column names and date formats."
+    )
+    return TaskDataset(
+        task_id="hard",
+        dirty_df=dirty,
+        clean_df=clean.astype(object),
+        schema_hint=schema_hint,
+        total_dirty_cells=dirty_cell_count,
+        metadata={
+            "canonical_columns":  _CANONICAL_COLS,
+            "canonical_lookup":   canonical_lookup,   # alias → canonical name
+            "source_aliases":     _SOURCE_ALIASES,
+            "source_date_formats": _SOURCE_DATE_FORMATS,
+            "duplicate_pairs":    duplicate_pairs,    # (original_idx, dup_idx) in pre-shuffle dirty
+            "n_clean_rows":       len(clean),
+        },
+    )
+# ── Internal helpers ──────────────────────────────────────────────────────────
+def _random_dates(
+    rng: np.random.Generator,
+    n: int,
+    start: str,
+    end: str,
+) -> list[str]:
+    """Generate n random ISO-format date strings between start and end."""
+    start_ts = pd.Timestamp(start)
+    end_ts   = pd.Timestamp(end)
+    delta_days = (end_ts - start_ts).days
+    offsets = rng.integers(0, delta_days, n)
+    return [
+        (start_ts + pd.Timedelta(days=int(d))).strftime("%Y-%m-%d")
+        for d in offsets
+    ]
+_FIRST_NAMES = [
+    "Alice", "Bob", "Carol", "David", "Eva", "Frank", "Grace", "Henry",
+    "Iris", "Jack", "Karen", "Leo", "Mia", "Nathan", "Olivia", "Paul",
+    "Quinn", "Rosa", "Sam", "Tara", "Uma", "Victor", "Wendy", "Xavier",
+    "Yuki", "Zara",
+]
+_LAST_NAMES = [
+    "Smith", "Jones", "Williams", "Brown", "Taylor", "Davies", "Evans",
+    "Wilson", "Thomas", "Roberts", "Johnson", "Lee", "Martin", "Garcia",
+    "Martinez", "Anderson", "Thompson", "White", "Harris", "Clark",
+]
+def _random_name(rng: random.Random) -> str:
+    return f"{rng.choice(_FIRST_NAMES)} {rng.choice(_LAST_NAMES)}"
+def _name_to_email(name: str) -> str:
+    first, last = name.lower().split()
+    domains = ["example.com", "mail.com", "inbox.net", "corp.io"]
+    return f"{first}.{last}@{domains[hash(name) % len(domains)]}"
+# ── Smoke test ────────────────────────────────────────────────────────────────
+if __name__ == "__main__":
+    for task_id in ("easy", "medium", "hard"):
+        ds = make_dataset(task_id)
+        print(f"\n{'─'*60}")
+        print(f"Task: {task_id.upper()}")
+        print(f"  dirty shape : {ds.dirty_df.shape}")
+        print(f"  clean shape : {ds.clean_df.shape}")
+        print(f"  dirty cells : {ds.total_dirty_cells}")
+        print(f"  schema hint : {ds.schema_hint[:80]}…")
+        print(f"  metadata keys: {list(ds.metadata.keys())}")
+        if task_id == "easy":
+            print(f"\n  Sample dirty rows (price/quantity col):")
+            mask = ds.dirty_df["price"].astype(str).str.contains(
+                r"[a-zA-Z]|nan", na=True
+            )
+            print(ds.dirty_df[mask][["order_id","price","quantity"]].head(3).to_string(index=False))
+        if task_id == "medium":
+            print(f"\n  Outlier rows (first 5): {ds.metadata['outlier_rows'][:5]}")
+            print(f"  Valid extreme rows:     {ds.metadata['valid_extreme_rows']}")
+        if task_id == "hard":
+            print(f"\n  Raw column names: {list(ds.dirty_df.columns)}")
+            print(f"  Duplicate pairs (first 3): {ds.metadata['duplicate_pairs'][:3]}")

graders.py ADDED Viewed

	@@ -0,0 +1,686 @@

+"""
+graders.py
+----------
+Deterministic graders for all three tasks.
+Each grader receives the agent's current working DataFrame and the
+TaskDataset produced by dataset_factory, and returns a GradeResult
+with a scalar score in [0.0, 1.0] plus a human-readable breakdown.
+Public API
+----------
+grade(task_id, agent_df, dataset) -> GradeResult
+    Dispatches to the correct grader. Call this from step().
+GradeResult
+    .score          float 0.0–1.0  (the number that feeds the reward)
+    .breakdown      dict           (sub-scores, useful for logging/debugging)
+    .issues_remaining int          (how many cells still need fixing)
+    .detail         str            (one-line human summary)
+"""
+from __future__ import annotations
+import re
+from dataclasses import dataclass, field
+from typing import Any, Dict, List, Optional, Tuple
+import numpy as np
+import pandas as pd
+# ─────────────────────────────────────────────────────────────────────────────
+# Return type
+# ─────────────────────────────────────────────────────────────────────────────
+@dataclass
+class GradeResult:
+    score: float                              # 0.0 – 1.0, fed into reward
+    breakdown: Dict[str, float] = field(default_factory=dict)
+    issues_remaining: int = 0
+    detail: str = ""
+    def __post_init__(self) -> None:
+        self.score = round(float(np.clip(self.score, 0.0, 1.0)), 4)
+# ─────────────────────────────────────────────────────────────────────────────
+# Public dispatcher
+# ─────────────────────────────────────────────────────────────────────────────
+def grade(
+    task_id: str,
+    agent_df: pd.DataFrame,
+    clean_df: pd.DataFrame,
+    metadata: Dict[str, Any],
+    initial_dirty_cells: int,
+) -> GradeResult:
+    """
+    Route to the correct grader and return a GradeResult.
+    Parameters
+    ----------
+    task_id
+        One of "easy", "medium", "hard".
+    agent_df
+        The agent's current working DataFrame (may still be dirty).
+    clean_df
+        Ground-truth clean DataFrame from TaskDataset.
+    metadata
+        TaskDataset.metadata dict (grader-specific ground truth).
+    initial_dirty_cells
+        Dirty cell count at episode start; used to compute issues_remaining
+        for easy/medium tasks.
+    """
+    if agent_df is None or len(agent_df) == 0:
+        return GradeResult(score=0.0, detail="Empty DataFrame — no score.")
+    if task_id == "easy":
+        return _grade_easy(agent_df, clean_df, metadata, initial_dirty_cells)
+    elif task_id == "medium":
+        return _grade_medium(agent_df, clean_df, metadata, initial_dirty_cells)
+    elif task_id == "hard":
+        return _grade_hard(agent_df, clean_df, metadata)
+    else:
+        raise ValueError(f"Unknown task_id: {task_id!r}")
+# ─────────────────────────────────────────────────────────────────────────────
+# Task 1 — easy: cell-level match against ground truth
+# ─────────────────────────────────────────────────────────────────────────────
+#
+# Score = (cells matching ground truth) / (total cells)
+#
+# "Matching" is defined after normalisation:
+#   - strip leading/trailing whitespace
+#   - numeric columns: round to 2dp, compare as float strings
+#   - date column: accept YYYY-MM-DD only
+#   - string columns: case-sensitive exact match after strip
+#   - NaN vs NaN → always mismatch (agent must fill or fix them)
+def _grade_easy(
+    agent_df: pd.DataFrame,
+    clean_df: pd.DataFrame,
+    metadata: Dict[str, Any],
+    initial_dirty_cells: int,
+) -> GradeResult:
+    # Align shape — agent might have different row count if they accidentally
+    # dropped rows; penalise by treating missing rows as all-wrong.
+    agent_norm  = _normalise_easy(agent_df,  clean_df)
+    clean_norm  = _normalise_easy(clean_df, clean_df)
+    total_cells = clean_norm.size
+    # Pad or truncate agent rows to match clean row count
+    if len(agent_norm) < len(clean_norm):
+        pad = pd.DataFrame(
+            [["__MISSING__"] * len(clean_norm.columns)] * (len(clean_norm) - len(agent_norm)),
+            columns=clean_norm.columns,
+        )
+        agent_norm = pd.concat([agent_norm, pad], ignore_index=True)
+    elif len(agent_norm) > len(clean_norm):
+        agent_norm = agent_norm.iloc[: len(clean_norm)].copy()
+    matches = (agent_norm == clean_norm).sum().sum()
+    score   = matches / total_cells
+    # Issues remaining: number of cells that still differ
+    mismatches = int((agent_norm != clean_norm).sum().sum())
+    breakdown = {
+        "cell_match_ratio":   round(score, 4),
+        "cells_matched":      int(matches),
+        "total_cells":        int(total_cells),
+        "cells_mismatched":   mismatches,
+    }
+    detail = (
+        f"{int(matches)}/{total_cells} cells correct "
+        f"({100*score:.1f}%) — {mismatches} still need fixing."
+    )
+    return GradeResult(
+        score=score,
+        breakdown=breakdown,
+        issues_remaining=mismatches,
+        detail=detail,
+    )
+def _normalise_easy(df: pd.DataFrame, clean_df: pd.DataFrame) -> pd.DataFrame:
+    """
+    Bring a DataFrame to a canonical string form for cell-level comparison.
+    Rules applied per column based on clean_df's dtype:
+      - Numeric (price, quantity): round to 2 decimal places → string
+      - Date (order_date):         parse and reformat to YYYY-MM-DD
+      - String (all others):       strip whitespace, leave case unchanged
+      - NaN / unparseable:         normalise to the sentinel "__NAN__"
+    """
+    out = {}
+    NUMERIC_COLS = {"price", "quantity"}
+    DATE_COLS    = {"order_date"}
+    for col in clean_df.columns:
+        if col not in df.columns:
+            # Agent removed or renamed the column — all cells wrong
+            out[col] = pd.Series(["__MISSING_COL__"] * len(df))
+            continue
+        series = df[col].copy()
+        if col in NUMERIC_COLS:
+            out[col] = series.apply(_to_numeric_str)
+        elif col in DATE_COLS:
+            out[col] = series.apply(_to_date_str)
+        else:
+            out[col] = series.apply(
+                lambda x: "__NAN__" if _is_missing(x) else str(x).strip()
+            )
+    return pd.DataFrame(out, dtype=str)
+def _to_numeric_str(x: Any) -> str:
+    if _is_missing(x):
+        return "__NAN__"
+    try:
+        return f"{float(str(x).strip().replace(',', '')):.2f}"
+    except (ValueError, TypeError):
+        return "__INVALID__"
+def _to_date_str(x: Any) -> str:
+    if _is_missing(x):
+        return "__NAN__"
+    s = str(x).strip()
+    # Reject obviously wrong dates (e.g. year 2099)
+    try:
+        parsed = pd.to_datetime(s, dayfirst=False)
+        if parsed.year > 2030 or parsed.year < 2000:
+            return "__BAD_DATE__"
+        return parsed.strftime("%Y-%m-%d")
+    except Exception:
+        return "__INVALID_DATE__"
+def _is_missing(x: Any) -> bool:
+    if x is None:
+        return True
+    try:
+        return bool(pd.isna(x))
+    except (TypeError, ValueError):
+        return False
+# ─────────────────────────────────────────────────────────────────────────────
+# Task 2 — medium: F1 on outlier detection + typo correction
+# ─────────────────────────────────────────────────────────────────────────────
+#
+# Two independent sub-scores, equally weighted:
+#
+#   outlier_f1   — precision/recall on which rows were fixed or removed
+#   typo_score   — fraction of category typo-cells correctly fixed
+#
+# Final score = 0.50 * outlier_f1 + 0.50 * typo_score
+#
+# Outlier logic:
+#   A true-outlier row is "correctly handled" if:
+#     (a) the row still exists AND amount is now in [5, 800], OR
+#     (b) the row was dropped entirely
+#   A valid-extreme row is a "false positive" if it was dropped OR
+#     its amount was changed to something outside [900, 2000].
+#
+# The thresholds match the schema_hint the agent was given.
+_VALID_AMOUNT_MIN    = 5.0
+_VALID_AMOUNT_MAX    = 800.0
+_EXTREME_AMOUNT_MIN  = 900.0
+_EXTREME_AMOUNT_MAX  = 2000.0
+def _grade_medium(
+    agent_df: pd.DataFrame,
+    clean_df: pd.DataFrame,
+    metadata: Dict[str, Any],
+    initial_dirty_cells: int,
+) -> GradeResult:
+    outlier_rows:       List[int] = metadata.get("outlier_rows", [])
+    valid_extreme_rows: List[int] = metadata.get("valid_extreme_rows", [])
+    typo_cells: List[Tuple[int, str, str]] = metadata.get("typo_cells", [])
+    # ── Outlier sub-score ────────────────────────────────────────────────────
+    # Detect which of the original row indices are still present in agent_df.
+    # We track by tx_id (which is stable and unique) rather than df index,
+    # since the agent may reset the index after dropping rows.
+    agent_tx_ids: set = set()
+    if "tx_id" in agent_df.columns:
+        agent_tx_ids = set(agent_df["tx_id"].dropna().astype(int).tolist())
+    tp = 0  # outlier rows that were correctly handled
+    fn = 0  # outlier rows still wrong (extreme amount still present)
+    fp = 0  # valid-extreme rows wrongly removed or damaged
+    # True-positive check
+    for orig_idx in outlier_rows:
+        tx_id_val = int(clean_df.iloc[orig_idx]["tx_id"]) if orig_idx < len(clean_df) else None
+        if tx_id_val is None:
+            continue
+        if tx_id_val not in agent_tx_ids:
+            # Row was dropped — counts as correctly handled (outlier removed)
+            tp += 1
+        else:
+            # Row still present — check if amount was fixed
+            agent_row = agent_df[agent_df["tx_id"].astype(int) == tx_id_val]
+            if len(agent_row) == 0:
+                tp += 1  # dropped after all
+            else:
+                amt = _safe_float(agent_row.iloc[0].get("amount"))
+                if amt is not None and _VALID_AMOUNT_MIN <= amt <= _VALID_AMOUNT_MAX:
+                    tp += 1
+                else:
+                    fn += 1
+    # False-positive check (valid extremes must survive untouched)
+    for orig_idx in valid_extreme_rows:
+        if orig_idx >= len(clean_df):
+            continue
+        tx_id_val = int(clean_df.iloc[orig_idx]["tx_id"])
+        clean_amt = float(clean_df.iloc[orig_idx]["amount"])
+        if tx_id_val not in agent_tx_ids:
+            fp += 1  # wrongly dropped a valid row
+        else:
+            agent_row = agent_df[agent_df["tx_id"].astype(int) == tx_id_val]
+            if len(agent_row) == 0:
+                fp += 1
+            else:
+                amt = _safe_float(agent_row.iloc[0].get("amount"))
+                # Accept if amount is within ±5% of original clean value
+                if amt is None or not (clean_amt * 0.95 <= amt <= clean_amt * 1.05):
+                    fp += 1
+    n_outliers  = len(outlier_rows)
+    precision   = tp / (tp + fp + 1e-9)
+    recall      = tp / (n_outliers + 1e-9)
+    outlier_f1  = (2 * precision * recall) / (precision + recall + 1e-9)
+    # ── Typo sub-score ───────────────────────────────────────────────────────
+    typo_correct = 0
+    for (row_idx, dirty_val, clean_val) in typo_cells:
+        if "tx_id" not in clean_df.columns or row_idx >= len(clean_df):
+            continue
+        tx_id_val = int(clean_df.iloc[row_idx]["tx_id"])
+        agent_rows = agent_df[agent_df["tx_id"].astype(int) == tx_id_val] \
+            if "tx_id" in agent_df.columns else pd.DataFrame()
+        if len(agent_rows) == 0:
+            continue  # row dropped; neither credit nor penalty
+        agent_cat = str(agent_rows.iloc[0].get("category", "")).strip()
+        if agent_cat == clean_val:
+            typo_correct += 1
+    typo_score = typo_correct / max(len(typo_cells), 1)
+    # ── Combined score ───────────────────────────────────────────────────────
+    score = 0.50 * outlier_f1 + 0.50 * typo_score
+    # Approximate issues remaining: unsolved outliers + unsolved typos
+    issues_remaining = fn + (len(typo_cells) - typo_correct)
+    breakdown = {
+        "outlier_f1":    round(outlier_f1, 4),
+        "outlier_tp":    tp,
+        "outlier_fn":    fn,
+        "outlier_fp":    fp,
+        "precision":     round(precision, 4),
+        "recall":        round(recall, 4),
+        "typo_score":    round(typo_score, 4),
+        "typos_fixed":   typo_correct,
+        "typos_total":   len(typo_cells),
+        "combined":      round(score, 4),
+    }
+    detail = (
+        f"Outlier F1={outlier_f1:.3f} (TP={tp}, FP={fp}, FN={fn}) | "
+        f"Typos {typo_correct}/{len(typo_cells)} fixed → score={score:.3f}"
+    )
+    return GradeResult(
+        score=score,
+        breakdown=breakdown,
+        issues_remaining=issues_remaining,
+        detail=detail,
+    )
+def _safe_float(x: Any) -> Optional[float]:
+    if _is_missing(x):
+        return None
+    try:
+        return float(str(x).strip().replace(",", ""))
+    except (ValueError, TypeError):
+        return None
+# ─────────────────────────────────────────────────────────────────────────────
+# Task 3 — hard: schema normalisation + deduplication + date formatting
+# ──────────────────��──────────────────────────────────────────────────────────
+#
+# Three independent sub-scores:
+#
+#   schema_score  (weight 0.40)
+#       Fraction of canonical column names present in agent_df.
+#       Bonus: all 9 canonical columns present AND no extra columns → +0.1
+#
+#   dedup_score   (weight 0.35)
+#       How many of the 30 true duplicate tx records were removed.
+#       Penalises over-deletion (removing rows that were not duplicates).
+#       dedup_precision = removed_true_dups / (rows_removed + ε)
+#       dedup_recall    = removed_true_dups / n_duplicate_pairs
+#       dedup_f1        = harmonic mean
+#
+#   format_score  (weight 0.25)
+#       Fraction of values in the purchase_date column (or canonical alias)
+#       that are valid YYYY-MM-DD strings.
+#
+# Final score = 0.40 * schema_score + 0.35 * dedup_score + 0.25 * format_score
+_CANONICAL_COLS = [
+    "record_id", "customer_id", "full_name", "email",
+    "amount", "currency", "purchase_date", "product_name", "region",
+]
+_ISO_DATE_PATTERN = re.compile(r"^\d{4}-\d{2}-\d{2}$")
+def _grade_hard(
+    agent_df: pd.DataFrame,
+    clean_df: pd.DataFrame,
+    metadata: Dict[str, Any],
+) -> GradeResult:
+    canonical_lookup: Dict[str, str] = metadata.get("canonical_lookup", {})
+    n_clean_rows: int                = metadata.get("n_clean_rows", len(clean_df))
+    # ── 1. Schema score ──────────────────────────────────────────────────────
+    schema_score, schema_detail = _grade_schema(agent_df, canonical_lookup)
+    # ── 2. Deduplication score ───────────────────────────────────────────────
+    dedup_score, dedup_detail = _grade_deduplication(
+        agent_df, clean_df, n_clean_rows, canonical_lookup
+    )
+    # ── 3. Date format score ─────────────────────────────────────────────────
+    format_score, format_detail = _grade_date_format(agent_df, canonical_lookup)
+    # ── Combined ─────────────────────────────────────────────────────────────
+    score = 0.40 * schema_score + 0.35 * dedup_score + 0.25 * format_score
+    # issues_remaining: rough proxy (unresolved column aliases + excess rows)
+    n_canonical_present = sum(
+        1 for c in _CANONICAL_COLS if c in agent_df.columns
+    )
+    issues_remaining = (
+        (len(_CANONICAL_COLS) - n_canonical_present)   # missing canonical cols
+        + max(0, len(agent_df) - n_clean_rows)          # excess rows (dups not removed)
+    )
+    breakdown = {
+        "schema_score":  round(schema_score,  4),
+        "dedup_score":   round(dedup_score,   4),
+        "format_score":  round(format_score,  4),
+        "combined":      round(score,          4),
+        **{f"schema_{k}": v for k, v in schema_detail.items()},
+        **{f"dedup_{k}":  v for k, v in dedup_detail.items()},
+        **{f"fmt_{k}":    v for k, v in format_detail.items()},
+    }
+    detail = (
+        f"Schema={schema_score:.3f} | "
+        f"Dedup={dedup_score:.3f} | "
+        f"DateFmt={format_score:.3f} → score={score:.3f}"
+    )
+    return GradeResult(
+        score=score,
+        breakdown=breakdown,
+        issues_remaining=issues_remaining,
+        detail=detail,
+    )
+def _grade_schema(
+    agent_df: pd.DataFrame,
+    canonical_lookup: Dict[str, str],
+) -> Tuple[float, Dict[str, Any]]:
+    """
+    Score how well the agent normalised column names.
+    Strategy:
+      - Build a set of "recognised" columns: canonical names + their aliases.
+      - For each canonical column, check if the agent has it (by canonical name).
+      - Partial credit per canonical column found.
+      - Small bonus if ALL 9 are present and no unrecognised extra columns remain.
+    """
+    agent_cols = set(agent_df.columns)
+    canonical_set = set(_CANONICAL_COLS)
+    # All known column names (canonical + every alias)
+    all_known = canonical_set | set(canonical_lookup.keys())
+    # Count canonical columns present
+    found    = [c for c in _CANONICAL_COLS if c in agent_cols]
+    n_found  = len(found)
+    base     = n_found / len(_CANONICAL_COLS)
+    # Bonus: all canonical present AND no leftover alias columns
+    leftover_aliases = [c for c in agent_cols if c not in canonical_set]
+    all_present      = n_found == len(_CANONICAL_COLS)
+    clean_rename     = len(leftover_aliases) == 0
+    bonus = 0.10 if (all_present and clean_rename) else 0.0
+    score = min(1.0, base + bonus)
+    detail: Dict[str, Any] = {
+        "canonical_found":    n_found,
+        "canonical_total":    len(_CANONICAL_COLS),
+        "leftover_aliases":   len(leftover_aliases),
+        "rename_bonus":       bonus,
+    }
+    return score, detail
+def _grade_deduplication(
+    agent_df: pd.DataFrame,
+    clean_df: pd.DataFrame,
+    n_clean_rows: int,
+    canonical_lookup: Dict[str, str],
+) -> Tuple[float, Dict[str, Any]]:
+    """
+    Score how well the agent removed duplicate rows.
+    We compare row counts and detect near-duplicate detection quality:
+      - n_injected_dups: 30 (hardcoded from dataset_factory)
+      - expected_final_rows: n_clean_rows (400)
+      - rows_removed: (raw dirty rows = 430) - len(agent_df)
+      - true_dups_removed: min(rows_removed, 30) if rows_removed ≤ 35
+        (we're lenient — removing 1–35 rows likely targets dups)
+      - over_deletion: max(0, rows_removed - 30) rows beyond the dup count
+        penalises removing valid data.
+    Precision = true_dups_removed / (rows_removed + ε)
+    Recall    = true_dups_removed / 30
+    F1        = harmonic mean
+    """
+    N_INJECTED_DUPS = 30
+    N_DIRTY_ROWS    = n_clean_rows + N_INJECTED_DUPS  # 430
+    rows_removed = max(0, N_DIRTY_ROWS - len(agent_df))
+    # Heuristic: any removal ≤ 35 rows is probably targeting dups
+    true_dups_removed = min(rows_removed, N_INJECTED_DUPS)
+    # Penalise over-removal (agent deleted valid rows beyond dups)
+    over_deletion = max(0, rows_removed - N_INJECTED_DUPS)
+    # Each over-deleted row reduces precision
+    effective_true = max(0, true_dups_removed - over_deletion)
+    precision = effective_true / (rows_removed + 1e-9)
+    recall    = true_dups_removed / (N_INJECTED_DUPS + 1e-9)
+    f1        = (2 * precision * recall) / (precision + recall + 1e-9)
+    detail: Dict[str, Any] = {
+        "rows_removed":       rows_removed,
+        "true_dups_removed":  true_dups_removed,
+        "over_deletion":      over_deletion,
+        "precision":          round(precision, 4),
+        "recall":             round(recall, 4),
+        "f1":                 round(f1, 4),
+    }
+    return f1, detail
+def _grade_date_format(
+    agent_df: pd.DataFrame,
+    canonical_lookup: Dict[str, str],
+) -> Tuple[float, Dict[str, Any]]:
+    """
+    Fraction of purchase_date values matching YYYY-MM-DD.
+    Looks for the canonical name "purchase_date" first; falls back to
+    known aliases ("date", "PurchaseDate") if the agent hasn't renamed yet.
+    """
+    DATE_ALIASES = {"purchase_date", "date", "PurchaseDate"}
+    date_col = None
+    # Prefer canonical name
+    if "purchase_date" in agent_df.columns:
+        date_col = "purchase_date"
+    else:
+        for alias in DATE_ALIASES:
+            if alias in agent_df.columns:
+                date_col = alias
+                break
+    if date_col is None:
+        return 0.0, {"date_col_found": False, "valid_ratio": 0.0}
+    # Guard: duplicate column names after rename produce a DataFrame, not Series.
+    # Take the first occurrence.
+    col_data = agent_df[date_col]
+    if isinstance(col_data, pd.DataFrame):
+        col_data = col_data.iloc[:, 0]
+    # Force object dtype so .sum() always returns a numeric 0, not '' (the
+    # StringDtype identity).  Python 3.14 + pandas 2.2+ infer StringDtype
+    # from .astype(str), which makes .sum() on an empty Series return ''.
+    series   = col_data.dropna().astype(object).apply(str).str.strip()
+    n_total  = len(series)
+    if n_total == 0:
+        return 0.0, {"date_col_found": True, "valid_ratio": 0.0, "n_total": 0}
+    # Combined check: ISO pattern match AND year in plausible range
+    def _is_valid_iso(s: str) -> bool:
+        if not _ISO_DATE_PATTERN.match(s):
+            return False
+        try:
+            return 2000 <= int(s[:4]) <= 2030
+        except Exception:
+            return False
+    valid_flags = series.apply(_is_valid_iso)
+    n_valid    = int(valid_flags.sum())   # int() guards against numpy/pandas scalar types
+    n_year_ok  = n_valid                  # same condition — kept for breakdown detail
+    valid_ratio = n_year_ok / n_total
+    detail: Dict[str, Any] = {
+        "date_col_found": True,
+        "date_col_used":  date_col,
+        "n_total":        int(n_total),
+        "n_valid_iso":    int(n_valid),
+        "n_year_ok":      int(n_year_ok),
+        "valid_ratio":    round(valid_ratio, 4),
+    }
+    return valid_ratio, detail
+# ─────────────────────────────────────────────────────────────────────────────
+# Smoke test
+# ────────────────────────────────────────────────────────────────���────────────
+if __name__ == "__main__":
+    import sys
+    sys.path.insert(0, ".")
+    from dataset_factory import make_dataset
+    SEP = "─" * 62
+    # ── Task 1: easy ─────────────────────────────────────────────────────────
+    print(f"\n{SEP}\nTASK: easy\n{SEP}")
+    ds = make_dataset("easy")
+    # Baseline: grade dirty df (should be low)
+    r_dirty = grade("easy", ds.dirty_df, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[dirty]  score={r_dirty.score:.4f}  {r_dirty.detail}")
+    # Perfect: grade clean df (should be 1.0)
+    r_clean = grade("easy", ds.clean_df, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[clean]  score={r_clean.score:.4f}  {r_clean.detail}")
+    # Partial: fix half the injected cells
+    partial = ds.dirty_df.copy()
+    injected = ds.metadata.get("injected_cells", [])
+    for (row, col) in injected[:len(injected)//2]:
+        partial.at[row, col] = ds.clean_df.at[row, col]
+    r_partial = grade("easy", partial, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[half]   score={r_partial.score:.4f}  {r_partial.detail}")
+    print(f"Breakdown: {r_partial.breakdown}")
+    # ── Task 2: medium ────────────────────────────────────────────────────────
+    print(f"\n{SEP}\nTASK: medium\n{SEP}")
+    ds = make_dataset("medium")
+    r_dirty = grade("medium", ds.dirty_df, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[dirty]  score={r_dirty.score:.4f}  {r_dirty.detail}")
+    r_clean = grade("medium", ds.clean_df, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[clean]  score={r_clean.score:.4f}  {r_clean.detail}")
+    # Simulate agent fixing all outliers (set amount to 150.0) + all typos
+    fixed = ds.dirty_df.copy()
+    for row in ds.metadata["outlier_rows"]:
+        if "tx_id" in ds.clean_df.columns:
+            fixed.at[row, "amount"] = 150.0
+    for (row, dirty_val, clean_val) in ds.metadata["typo_cells"]:
+        fixed.at[row, "category"] = clean_val
+    r_fixed = grade("medium", fixed, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[fixed]  score={r_fixed.score:.4f}  {r_fixed.detail}")
+    print(f"Breakdown: {r_fixed.breakdown}")
+    # ── Task 3: hard ──────────────────────────────────────────────────────────
+    print(f"\n{SEP}\nTASK: hard\n{SEP}")
+    ds = make_dataset("hard")
+    r_dirty = grade("hard", ds.dirty_df, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[dirty]  score={r_dirty.score:.4f}  {r_dirty.detail}")
+    r_clean = grade("hard", ds.clean_df, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[clean]  score={r_clean.score:.4f}  {r_clean.detail}")
+    # Simulate partial fix: rename columns only, don't dedup or fix dates
+    partial_hard = ds.dirty_df.copy()
+    rename_map = ds.metadata.get("canonical_lookup", {})
+    partial_hard = partial_hard.rename(columns=rename_map)
+    # Keep only canonical columns that exist
+    canonical_present = [c for c in _CANONICAL_COLS if c in partial_hard.columns]
+    partial_hard = partial_hard[canonical_present]
+    r_renamed = grade("hard", partial_hard, ds.clean_df, ds.metadata, ds.total_dirty_cells)
+    print(f"[rename] score={r_renamed.score:.4f}  {r_renamed.detail}")
+    print(f"Breakdown: {r_renamed.breakdown}")

inference.py ADDED Viewed

	@@ -0,0 +1,271 @@

+"""
+inference.py
+------------
+Official submission inference script for the Data Cleaning Pipeline environment.
+Reads from environment variables (ALL FREE — no paid API needed):
+    API_BASE_URL       LLM endpoint. Default: HuggingFace free router.
+    MODEL_NAME         Model to use.  Default: free open model.
+    HF_TOKEN           Your free HuggingFace token (hf_...).
+    LOCAL_IMAGE_NAME   Docker image name if using from_docker_image().
+                       Leave unset to connect via ENV_BASE_URL instead.
+    ENV_BASE_URL       Direct server URL. Default: http://localhost:8000
+STDOUT FORMAT (evaluator parses these lines exactly — do not modify):
+    [START] task=<n> env=<benchmark> model=<model>
+    [STEP]  step=<n> action=<str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<0.00> rewards=<r1,r2,...>
+"""
+import asyncio
+import json
+import os
+import re
+import sys
+from typing import List, Optional
+from unittest import result
+from client import DataCleaningEnv, CleanAction, CleanObservation
+from openai import OpenAI
+# ── Environment client imports ────────────────────────────────────────────────
+try:
+    from client import DataCleaningEnv
+    from models import CleanAction, MAX_STEPS, DONE_THRESHOLD
+except ImportError:
+    sys.path.insert(0, os.path.dirname(__file__))
+    from client import DataCleaningEnv
+    from models import CleanAction, MAX_STEPS, DONE_THRESHOLD
+# ── Configuration — all defaults are FREE ────────────────────────────────────
+API_BASE_URL     = os.getenv("API_BASE_URL",     "https://router.huggingface.co/v1")
+MODEL_NAME       = os.getenv("MODEL_NAME",       "Qwen/Qwen2.5-72B-Instruct")
+HF_TOKEN         = os.getenv("HF_TOKEN",         "")
+LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME", "")
+ENV_BASE_URL     = os.getenv("ENV_BASE_URL",     "http://localhost:8000")
+BENCHMARK  = "data_cleaning_env"
+TASK_IDS   = ["easy", "medium", "hard"]
+# Conservative budgets — keeps total runtime under 20 min on vcpu=2 / 8 GB
+STEP_LIMITS = {"easy": 25, "medium": 50, "hard": 80}
+# ── Official log helpers ──────────────────────────────────────────────────────
+# Field names, order, and spacing match the evaluator spec exactly.
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(
+    step:   int,
+    action: str,
+    reward: float,
+    done:   bool,
+    error:  Optional[str],
+) -> None:
+    error_val  = error if error else "null"
+    done_val   = str(done).lower()
+    action_str = action[:80].replace("\n", " ")   # keep line single-line
+    print(
+        f"[STEP] step={step} action={action_str} "
+        f"reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(
+    success: bool,
+    steps:   int,
+    score:   float,
+    rewards: List[float],
+) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} "
+        f"score={score:.2f} rewards={rewards_str}",
+        flush=True,
+    )
+# ── LLM helpers ───────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = (
+    "You are a data cleaning agent. You receive a dirty CSV and must fix it "
+    "step by step using JSON action commands. Fix the most impactful issues "
+    "first. Be precise — wrong column names cause errors. "
+    "Output a single valid JSON object and nothing else — no explanation, no markdown."
+)
+def build_prompt(obs) -> str:
+    rows      = obs.dirty_csv.strip().split("\n")
+    preview   = "\n".join(rows[:30])
+    truncated = len(rows) > 30
+    last_err  = f"\nLast error: {obs.last_action_error}" if obs.last_action_error else ""
+    return (
+        f"Task: {obs.task_id}\n"
+        f"Schema: {obs.schema_hint}\n"
+        f"Score: {obs.current_score:.4f} | Issues remaining: {obs.issues_remaining}\n"
+        f"Step {obs.step_number}/{obs.max_steps}{last_err}\n"
+        f"\nCSV{' (first 30 rows)' if truncated else ''}:\n{preview}\n\n"
+        "Reply with ONE JSON action:\n"
+        '  {"command":"SET_VALUE",       "row_index":<int>, "column":"<name>", "value":"<str>"}\n'
+        '  {"command":"DROP_ROW",        "row_index":<int>}\n'
+        '  {"command":"STANDARDIZE_COL", "column":"<name>"}\n'
+        '  {"command":"FILL_MISSING",    "column":"<name>", "fill_strategy":"mean|median|mode|drop"}\n'
+        '  {"command":"DONE"}\n'
+        "row_index = integer in the leftmost column of the CSV. JSON only."
+    )
+def parse_action(raw: str) -> CleanAction:
+    """Convert model output to CleanAction. Falls back to DONE on any error."""
+    text = raw.strip()
+    if text.startswith("```"):
+        lines = text.split("\n")
+        inner = lines[1:-1] if lines[-1].strip().startswith("```") else lines[1:]
+        text  = "\n".join(inner).strip()
+    try:
+        return CleanAction(**json.loads(text))
+    except Exception:
+        m = re.search(r"\{[^{}]+\}", text, re.DOTALL)
+        if m:
+            try:
+                return CleanAction(**json.loads(m.group()))
+            except Exception:
+                pass
+    return CleanAction(command="DONE")
+def call_llm(client: OpenAI, messages: list) -> str:
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        messages=messages,
+        max_tokens=150,   # actions are short; saves free-tier quota
+        temperature=0.1,
+    )
+    return (response.choices[0].message.content or "").strip()
+# ── Episode loop ───────────────────────────────────────────────────────────────
+async def run_episode(env, client: OpenAI, task_id: str) -> dict:
+    """Run one episode. Emits [START] → N×[STEP] → [END]."""
+    max_steps        = STEP_LIMITS[task_id]
+    threshold        = DONE_THRESHOLD[task_id]
+    rewards: List[float] = []
+    steps_taken      = 0
+    score            = 0.0
+    success          = False
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset(task_id=task_id)
+        obs    = result.observation
+        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+        for step in range(1, max_steps + 1):
+            if obs.done:
+                break
+            steps_taken = step
+            messages.append({"role": "user", "content": build_prompt(obs)})
+            try:
+                raw    = call_llm(client, messages)
+                action = parse_action(raw)
+                messages.append({"role": "assistant", "content": raw})
+            except Exception as exc:
+                # API or parse failure — log and stop episode
+                log_step(step, "DONE", 0.00, True, str(exc)[:120])
+                rewards.append(0.0)
+                break
+            # Keep only system + last 8 exchanges to stay inside free-tier context limits
+            if len(messages) > 17:
+                messages = [messages[0]] + messages[-16:]
+            result = await env.step(action)
+            obs    = result.observation
+            reward = result.reward or 0.0
+            rewards.append(reward)
+            score  = obs.current_score
+            log_step(
+                step   = step,
+                action = action.command,
+                reward = reward,
+                done   = obs.done,
+                error  = obs.last_action_error,
+            )
+            if obs.done or score >= threshold:
+                break
+        success = score >= threshold
+    finally:
+        # [END] is always emitted, even if the episode crashed
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {"task_id": task_id, "score": score,
+            "reward": sum(rewards), "steps": steps_taken, "success": success}
+# ── Entry point ────────────────────────────────────────────────────────────────
+async def main() -> None:
+    if not HF_TOKEN:
+        print(
+            "ERROR: HF_TOKEN is not set.\n"
+            "1. Go to https://huggingface.co/settings/tokens\n"
+            "2. Click 'New token' → choose 'Read' → copy it\n"
+            "3. In PowerShell: $env:HF_TOKEN='hf_xxxxxxxxxxxx'\n"
+            "4. Then run: python inference.py",
+            file=sys.stderr,
+        )
+        sys.exit(1)
+    print(f"API_BASE_URL     : {API_BASE_URL}", flush=True)
+    print(f"MODEL_NAME       : {MODEL_NAME}",   flush=True)
+    print(f"LOCAL_IMAGE_NAME : {LOCAL_IMAGE_NAME or '(not set — using ENV_BASE_URL)'}", flush=True)
+    print(f"ENV_BASE_URL     : {ENV_BASE_URL}",  flush=True)
+    print("", flush=True)
+    # ✅ Create llm and env in the correct order
+    llm = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)
+    if LOCAL_IMAGE_NAME:
+        env = await DataCleaningEnv.from_docker_image(LOCAL_IMAGE_NAME)
+    else:
+        env = DataCleaningEnv(base_url=ENV_BASE_URL)
+        await env.connect()
+    results = []
+    try:
+        for task_id in TASK_IDS:
+            summary = await run_episode(env, llm, task_id)
+            results.append(summary)
+            print("", flush=True)
+    finally:
+        await env.close()
+    # Human-readable summary (evaluator ignores lines that don't start with [START]/[STEP]/[END])
+    print("=" * 56, flush=True)
+    print(f"{'Task':<10} {'Score':>7} {'Reward':>9} {'Steps':>6} {'Pass':>5}")
+    print("-" * 56, flush=True)
+    for r in results:
+        print(
+            f"{r['task_id']:<10} {r['score']:>7.4f} {r['reward']:>9.4f} "
+            f"{r['steps']:>6}  {'YES' if r['success'] else 'NO':>4}",
+            flush=True,
+        )
+    print("=" * 56, flush=True)
+if __name__ == "__main__":
+    asyncio.run(main())

models.py ADDED Viewed

	@@ -0,0 +1,463 @@

+"""
+models.py
+---------
+Pydantic models for the Data Cleaning Pipeline environment.
+Three models define the full agent↔environment contract:
+  CleanAction      — what the agent sends on each step
+  CleanObservation — what the agent receives back
+  CleanState       — internal server state (not sent to agent directly)
+Inheritance chain (confirmed from OpenEnv source):
+  Action      → extra="forbid", has: metadata: Dict[str, Any]
+  Observation → extra="forbid", has: done: bool, reward: float|None, metadata: Dict[str, Any]
+  State       → extra="allow",  has: episode_id: Optional[str], step_count: int
+"""
+from __future__ import annotations
+from typing import Any, Dict, List, Literal, Optional
+from pydantic import Field, field_validator, model_validator
+try:
+    from openenv.core.env_server.types import Action, Observation, State
+except ImportError:
+    # Fallback for local development without the full OpenEnv install
+    from openenv.core.env_server import Action, Observation, State
+# ── Valid values (used by validators + schema hints) ──────────────────────────
+VALID_COMMANDS = Literal[
+    "SET_VALUE",       # Fix a specific cell: (row_index, column, value)
+    "DROP_ROW",        # Remove an entire row: (row_index,)
+    "STANDARDIZE_COL", # Normalize an entire column's format: (column,)
+    "FILL_MISSING",    # Fill NaN values in a column: (column, fill_strategy)
+    "DONE",            # Agent signals episode is complete: ()
+]
+VALID_FILL_STRATEGIES = Literal["mean", "median", "mode", "drop"]
+VALID_TASK_IDS = Literal["easy", "medium", "hard"]
+# ─────────────────────────────────────────────────────────────────────────────
+# CleanAction
+# ─────────────────────────────────────────────────────────────────────────────
+class CleanAction(Action):
+    """Action sent by the agent each step.
+    The ``command`` field selects the operation. Depending on command,
+    only a subset of the remaining fields are required:
+    +-----------------+------------+--------+-------+---------------+
+    | command         | row_index  | column | value | fill_strategy |
+    +=================+============+========+=======+===============+
+    | SET_VALUE       | required   | req    | req   | —             |
+    | DROP_ROW        | required   | —      | —     | —             |
+    | STANDARDIZE_COL | —          | req    | —     | —             |
+    | FILL_MISSING    | —          | req    | —     | required      |
+    | DONE            | —          | —      | —     | —             |
+    +-----------------+------------+--------+-------+---------------+
+    Example (fix a single cell)::
+        CleanAction(
+            command="SET_VALUE",
+            row_index=3,
+            column="price",
+            value="29.99",
+        )
+    Example (drop a whole row)::
+        CleanAction(command="DROP_ROW", row_index=17)
+    Example (fill all NaN in a column with the median)::
+        CleanAction(
+            command="FILL_MISSING",
+            column="quantity",
+            fill_strategy="median",
+        )
+    """
+    command: VALID_COMMANDS = Field(
+        ...,
+        description=(
+            "Operation to perform. One of: SET_VALUE, DROP_ROW, "
+            "STANDARDIZE_COL, FILL_MISSING, DONE."
+        ),
+    )
+    row_index: Optional[int] = Field(
+        default=None,
+        ge=0,
+        description=(
+            "Zero-based row index to target. "
+            "Required for SET_VALUE and DROP_ROW."
+        ),
+    )
+    column: Optional[str] = Field(
+        default=None,
+        min_length=1,
+        description=(
+            "Name of the column to target. "
+            "Required for SET_VALUE, STANDARDIZE_COL, and FILL_MISSING."
+        ),
+    )
+    value: Optional[str] = Field(
+        default=None,
+        description=(
+            "New cell value as a string. "
+            "Required for SET_VALUE. The environment casts this to the "
+            "column's expected dtype (e.g. '29.99' → float for a price column)."
+        ),
+    )
+    fill_strategy: Optional[VALID_FILL_STRATEGIES] = Field(
+        default=None,
+        description=(
+            "Strategy for FILL_MISSING. One of: mean, median, mode, drop. "
+            "'drop' removes rows where the column is NaN."
+        ),
+    )
+    @model_validator(mode="after")
+    def _check_required_fields(self) -> "CleanAction":
+        """Ensure each command has exactly the fields it needs."""
+        cmd = self.command
+        if cmd == "SET_VALUE":
+            missing = []
+            if self.row_index is None:
+                missing.append("row_index")
+            if self.column is None:
+                missing.append("column")
+            if self.value is None:
+                missing.append("value")
+            if missing:
+                raise ValueError(
+                    f"SET_VALUE requires: {', '.join(missing)}"
+                )
+        elif cmd == "DROP_ROW":
+            if self.row_index is None:
+                raise ValueError("DROP_ROW requires row_index")
+        elif cmd == "STANDARDIZE_COL":
+            if self.column is None:
+                raise ValueError("STANDARDIZE_COL requires column")
+        elif cmd == "FILL_MISSING":
+            missing = []
+            if self.column is None:
+                missing.append("column")
+            if self.fill_strategy is None:
+                missing.append("fill_strategy")
+            if missing:
+                raise ValueError(
+                    f"FILL_MISSING requires: {', '.join(missing)}"
+                )
+        # DONE requires nothing — always valid
+        return self
+    @field_validator("row_index")
+    @classmethod
+    def _non_negative_row(cls, v: Optional[int]) -> Optional[int]:
+        if v is not None and v < 0:
+            raise ValueError(f"row_index must be >= 0, got {v}")
+        return v
+# ─────────────────────────────────────────────────────────────────────────────
+# CleanObservation
+# ─────────────────────────────────────────────────────────────────────────────
+class CleanObservation(Observation):
+    """Observation returned to the agent after each step (and at reset).
+    The agent sees the full current state of the dirty CSV at every step
+    so it can decide what to fix next. This is intentionally verbose —
+    passing the whole CSV string keeps the environment stateless from the
+    agent's perspective (no hidden memory needed).
+    Inherited from Observation (do NOT redeclare these):
+      done:     bool           — True when the episode has ended
+      reward:   float | None   — per-step reward (None at reset)
+      metadata: Dict[str, Any] — extra info (unused by core loop)
+    """
+    # ── Task context (set at reset, constant for the episode) ────────────────
+    task_id: VALID_TASK_IDS = Field(
+        ...,
+        description="Which task is active: 'easy', 'medium', or 'hard'.",
+    )
+    schema_hint: str = Field(
+        ...,
+        description=(
+            "Plain-English description of the target schema. "
+            "Tells the agent what the clean data should look like."
+        ),
+    )
+    initial_dirty_cells: int = Field(
+        ...,
+        ge=0,
+        description=(
+            "Total number of cells that differed from ground truth at episode start. "
+            "Used to compute a normalised progress score."
+        ),
+    )
+    # ── Per-step state ───────────────────────────────────────────────────────
+    dirty_csv: str = Field(
+        ...,
+        description=(
+            "Full current state of the working DataFrame serialised as a CSV string. "
+            "This reflects all changes the agent has made so far this episode."
+        ),
+    )
+    current_score: float = Field(
+        default=0.0,
+        ge=0.0,
+        le=1.0,
+        description=(
+            "Grader score after the last action (0.0 = no cells correct, "
+            "1.0 = perfect match with ground truth)."
+        ),
+    )
+    issues_remaining: int = Field(
+        default=0,
+        ge=0,
+        description=(
+            "Approximate count of cells still differing from ground truth. "
+            "Convenience field — agents can also derive this from the CSV."
+        ),
+    )
+    step_number: int = Field(
+        default=0,
+        ge=0,
+        description="How many steps have been taken in this episode so far.",
+    )
+    max_steps: int = Field(
+        ...,
+        ge=1,
+        description="Maximum steps allowed for this task before forced termination.",
+    )
+    # ── Last-action feedback ────────────────────────────────────────────────
+    last_action_success: bool = Field(
+        default=True,
+        description=(
+            "Whether the last action was applied without errors. "
+            "False if the column/row didn't exist, value couldn't be cast, etc."
+        ),
+    )
+    last_action_error: Optional[str] = Field(
+        default=None,
+        description=(
+            "Error message if last_action_success is False, else None. "
+            "Helps the agent self-correct."
+        ),
+    )
+    @field_validator("current_score")
+    @classmethod
+    def _round_score(cls, v: float) -> float:
+        return round(v, 4)
+# ─────────────────────────────────────────────────────────────────────────────
+# CleanState
+# ─────────────────────────────────────────────────────────────────────────────
+class CleanState(State):
+    """Internal server-side state. Never sent to the agent directly.
+    Holds the live DataFrames, ground truth, and grader metadata.
+    Because State uses extra="allow", we can store arbitrary fields
+    without listing them in the JSON schema.
+    Inherited from State:
+      episode_id: Optional[str]   — unique episode identifier
+      step_count: int             — steps taken this episode (ge=0)
+    """
+    # ── Task identity ────────────────────────────────────────────────────────
+    task_id: str = Field(
+        default="easy",
+        description="Active task: 'easy', 'medium', or 'hard'.",
+    )
+    # ── DataFrame snapshots (stored as CSV strings for serialisation) ────────
+    # NOTE: The environment keeps live pd.DataFrame objects in instance vars.
+    # These string fields are the serialised snapshots used by state() calls
+    # and for WebSocket state responses.
+    dirty_csv_snapshot: str = Field(
+        default="",
+        description="Current working DataFrame serialised to CSV string.",
+    )
+    clean_csv_snapshot: str = Field(
+        default="",
+        description="Ground-truth clean DataFrame serialised to CSV string.",
+    )
+    # ── Scoring ──────────────────────────────────────────────────────────────
+    initial_dirty_cells: int = Field(
+        default=0,
+        ge=0,
+        description="Dirty cell count at episode start (denominator for progress).",
+    )
+    current_score: float = Field(
+        default=0.0,
+        ge=0.0,
+        le=1.0,
+        description="Grader score after the last step.",
+    )
+    previous_score: float = Field(
+        default=0.0,
+        ge=0.0,
+        le=1.0,
+        description="Grader score before the last step (for reward delta).",
+    )
+    # ── Task metadata (passed through from TaskDataset.metadata) ─────────────
+    # Contains grader-specific ground truth: outlier_rows, canonical_lookup, etc.
+    task_metadata: Dict[str, Any] = Field(
+        default_factory=dict,
+        description=(
+            "Task-specific metadata from dataset_factory.TaskDataset.metadata. "
+            "Contains grader ground truth (outlier_rows, duplicate_pairs, etc.)."
+        ),
+    )
+    # ── Schema hint (echoed in observations) ────────────────────────────────
+    schema_hint: str = Field(
+        default="",
+        description="Plain-English schema description for this task.",
+    )
+    # ── Per-task step budget ─────────────────────────────────────────────────
+    max_steps: int = Field(
+        default=40,
+        ge=1,
+        description="Maximum steps for this task (40 / 80 / 150 for easy/medium/hard).",
+    )
+    @field_validator("current_score", "previous_score")
+    @classmethod
+    def _clamp_score(cls, v: float) -> float:
+        return round(max(0.0, min(1.0, v)), 4)
+# ── Step budget constants ─────────────────────────────────────────────────────
+MAX_STEPS: Dict[str, int] = {
+    "easy":   40,
+    "medium": 80,
+    "hard":   150,
+}
+# Done threshold: score at which the agent is considered successful
+DONE_THRESHOLD: Dict[str, float] = {
+    "easy":   0.95,
+    "medium": 0.85,
+    "hard":   0.80,
+}
+# ── Smoke test ────────────────────────────────────────────────────────────────
+if __name__ == "__main__":
+    import json
+    print("── CleanAction examples ──────────────────────────────────────")
+    a1 = CleanAction(command="SET_VALUE", row_index=3, column="price", value="29.99")
+    print("SET_VALUE:      ", a1.model_dump())
+    a2 = CleanAction(command="DROP_ROW", row_index=17)
+    print("DROP_ROW:       ", a2.model_dump())
+    a3 = CleanAction(command="FILL_MISSING", column="quantity", fill_strategy="median")
+    print("FILL_MISSING:   ", a3.model_dump())
+    a4 = CleanAction(command="STANDARDIZE_COL", column="order_date")
+    print("STANDARDIZE_COL:", a4.model_dump())
+    a5 = CleanAction(command="DONE")
+    print("DONE:           ", a5.model_dump())
+    # Validation: SET_VALUE without row_index should fail
+    print("\n── Validation ────────────────────────────────────────────────")
+    try:
+        bad = CleanAction(command="SET_VALUE", column="price", value="10.0")
+    except Exception as e:
+        print(f"Expected error (missing row_index): {e}")
+    try:
+        bad = CleanAction(command="FILL_MISSING", column="price")
+    except Exception as e:
+        print(f"Expected error (missing fill_strategy): {e}")
+    print("\n── CleanObservation ──────────────────────────────────────────")
+    obs = CleanObservation(
+        task_id="easy",
+        schema_hint="Sales orders dataset. price must be float.",
+        initial_dirty_cells=29,
+        dirty_csv="order_id,price\n1001,N/A\n1002,19.99",
+        current_score=0.0,
+        issues_remaining=29,
+        step_number=0,
+        max_steps=40,
+        done=False,
+        reward=None,
+    )
+    print(json.dumps(obs.model_dump(), indent=2))
+    print("\n── CleanState ────────────────────────────────────────────────")
+    state = CleanState(
+        episode_id="ep-001",
+        step_count=0,
+        task_id="easy",
+        dirty_csv_snapshot="order_id,price\n1001,N/A",
+        clean_csv_snapshot="order_id,price\n1001,14.99",
+        initial_dirty_cells=29,
+        current_score=0.0,
+        previous_score=0.0,
+        task_metadata={"injected_cells": [(0, "price")]},
+        schema_hint="Sales orders dataset.",
+        max_steps=40,
+    )
+    print(json.dumps(state.model_dump(), indent=2))
+    print("\n── JSON schemas ──────────────────────────────────────────────")
+    print("Action schema keys:     ", list(CleanAction.model_json_schema()["properties"].keys()))
+    print("Observation schema keys:", list(CleanObservation.model_json_schema()["properties"].keys()))
+    print("State schema keys:      ", list(CleanState.model_json_schema()["properties"].keys()))

openenv.yaml ADDED Viewed

	@@ -0,0 +1,151 @@

+# openenv.yaml
+# ─────────────────────────────────────────────────────────────────────────────
+# Manifest for the Data Cleaning Pipeline OpenEnv environment.
+#
+# Field reference
+# ───────────────
+# Required by the CLI (serve / build / push / validate):
+#   spec_version  — always 1 for this generation of the spec
+#   name          — environment identifier used by the CLI and auto-discovery
+#   type          — "space" means it can be deployed as a Hugging Face Space
+#   runtime       — "fastapi" tells the server how to boot
+#   app           — Python import path to the FastAPI app object
+#   port          — port the server listens on inside the container
+#
+# Read by AutoEnv auto-discovery (openenv.auto._discovery):
+#   name          — maps to env_key after stripping the "_env" suffix
+#   description   — human-readable label shown in env listings
+#   spec_version  — stored in EnvironmentInfo for introspection
+#   action        — EXPLICIT override of the auto-inferred class name
+#   observation   — EXPLICIT override of the auto-inferred class name
+#
+# NOTE on action / observation overrides:
+#   Auto-discovery infers class names from the env name using PascalCase:
+#     "data_cleaning_env" → base "data_cleaning" → "CleanAction"
+#   Our actual class is named "CleanAction" (not "CleanAction"),
+#   so these fields MUST be set to avoid ImportError on AutoEnv.from_env().
+#
+# All other fields (tasks, reward, tags) are informational.  They are not
+# parsed by the current OpenEnv tooling but are preserved in
+# EnvironmentInfo.manifest and available to the web UI and external tools.
+# ─────────────────────────────────────────────────────────────────────────────
+# ── Core deployment fields ────────────────────────────────────────────────────
+spec_version: 1
+name: data_cleaning_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+# ── Package metadata ──────────────────────────────────────────────────────────
+version: "1.0.0"
+description: >-
+  Data cleaning pipeline: the agent receives a dirty CSV and must detect
+  and fix type errors, missing values, outliers, and schema inconsistencies
+  to match a hidden ground-truth dataset. Three tasks (easy → medium → hard)
+  with a deterministic grader that returns a continuous score in [0.0, 1.0].
+# ── Auto-discovery class overrides ───────────────────────────────────────────
+# These override auto-inferred names (which would be CleanAction /
+# CleanAction) to match the actual class names defined in models.py.
+action: CleanAction
+observation: CleanObservation
+# The client class is correctly inferred as DataCleaningEnv (data_cleaning →
+# DataCleaning + Env), which matches client.py, so no override is needed.
+# ── Tags (informational) ──────────────────────────────────────────────────────
+tags:
+  - data-cleaning
+  - tabular
+  - real-world
+  - hackathon
+# ── Task manifest (informational) ─────────────────────────────────────────────
+# One entry per task. These values mirror the constants in models.py
+# (MAX_STEPS, DONE_THRESHOLD) and the descriptions in dataset_factory.py.
+tasks:
+  - id: easy
+    name: Fix obvious errors
+    description: >-
+      50-row sales CSV with 29 injected dirty cells: 10 type mismatches
+      (text in numeric columns), 8 missing values, 5 far-future dates
+      (year 2099), and 6 cells with leading/trailing whitespace.
+      Graded by exact cell-level match against the ground truth (0.0–1.0).
+    dataset_rows: 50
+    dirty_cells: 29
+    max_steps: 40
+    done_threshold: 0.95
+  - id: medium
+    name: Outlier detection without false positives
+    description: >-
+      200-row customer transaction CSV with 15 true statistical outliers
+      (negative or > $2000 amounts) that must be fixed or removed, 5 valid
+      large transactions ($900–$2000) that must NOT be removed, and 12
+      category spelling typos. Graded by F1 score on outlier detection
+      (0.5 weight) and typo correction rate (0.5 weight).
+    dataset_rows: 200
+    dirty_cells: 27
+    max_steps: 80
+    done_threshold: 0.85
+  - id: hard
+    name: Multi-source schema normalisation and deduplication
+    description: >-
+      430-row CSV (400 clean + 30 duplicates) merged from 3 fictional data
+      sources with inconsistent column naming (e.g. cust_id / customer_id /
+      CustomerID), mixed date formats (ISO, US, EU), and ~30 duplicate rows
+      (exact and near-duplicate). Agent must infer the canonical 9-column
+      schema without explicit documentation. Graded by schema match (40%),
+      deduplication F1 (35%), and date format compliance (25%).
+    dataset_rows: 430
+    canonical_rows: 400
+    canonical_columns: 9
+    duplicate_rows: 30
+    max_steps: 150
+    done_threshold: 0.80
+# ── Reward function summary (informational) ───────────────────────────────────
+reward:
+  type: dense
+  range: [-0.5, 1.0]
+  step_cost: -0.005
+  components:
+    - name: progress
+      weight: primary
+      description: >-
+        Grader score delta each step (curr_score − prev_score).
+        The main learning signal — any cell fixed produces a non-zero reward.
+    - name: efficiency_bonus
+      weight: "+0.10 × (1 − step_fraction)"
+      description: >-
+        Small bonus awarded the step the episode is solved (score crosses
+        done_threshold). Rewards finishing early relative to the step budget.
+    - name: false_positive_penalty
+      weight: -0.15
+      description: >-
+        Applied when DROP_ROW removes a valid-extreme row in the medium task.
+        Penalises aggressive deletion without checking schema_hint.
+    - name: early_done_penalty
+      weight: -0.20
+      description: >-
+        Applied when the agent sends DONE with current_score < 0.60.
+        Discourages giving up prematurely.
+    - name: step_cost
+      weight: -0.005
+      description: >-
+        Fixed cost every step regardless of outcome.
+        Prevents infinite loops and padding.

openenv_data_cleaning_env.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,13 @@

+Metadata-Version: 2.4
+Name: openenv-data_cleaning_env
+Version: 0.1.0
+Summary: Data Cleaning Env environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core
+Requires-Dist: pandas>=2.0
+Requires-Dist: numpy>=1.24
+Requires-Dist: fastapi
+Requires-Dist: uvicorn
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_data_cleaning_env.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,21 @@

+README.md
+__init__.py
+client.py
+dataset_factory.py
+graders.py
+models.py
+pyproject.toml
+./__init__.py
+./client.py
+./dataset_factory.py
+./graders.py
+./models.py
+openenv_data_cleaning_env.egg-info/PKG-INFO
+openenv_data_cleaning_env.egg-info/SOURCES.txt
+openenv_data_cleaning_env.egg-info/dependency_links.txt
+openenv_data_cleaning_env.egg-info/entry_points.txt
+openenv_data_cleaning_env.egg-info/requires.txt
+openenv_data_cleaning_env.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/data_cleaning_env.py

openenv_data_cleaning_env.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_data_cleaning_env.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = data_cleaning_env.server.app:main

openenv_data_cleaning_env.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,9 @@

+openenv-core
+pandas>=2.0
+numpy>=1.24
+fastapi
+uvicorn
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_data_cleaning_env.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ data_cleaning_env

pyproject.toml ADDED Viewed

	@@ -0,0 +1,48 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-data_cleaning_env"
+version = "0.1.0"
+description = "Data Cleaning Env environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+    "openenv-core",
+    "pandas>=2.0",
+    "numpy>=1.24",
+    "fastapi",
+    "uvicorn",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m data_cleaning_env.server.app
+server = "data_cleaning_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["data_cleaning_env", "data_cleaning_env.server"]
+package-dir = { "data_cleaning_env" = ".", "data_cleaning_env.server" = "server" }

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Data Cleaning Env environment server components."""
+from .data_cleaning_env import DataCleaningEnvironment
+__all__ = ["DataCleaningEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,25 @@

+try:
+    from openenv.core.env_server import create_app
+    from ..models import CleanAction, CleanObservation
+    from .data_cleaning_env import DataCleaningEnvironment
+except ImportError:
+    from openenv.core.env_server import create_app
+    from models import CleanAction, CleanObservation
+    from server.data_cleaning_env import DataCleaningEnvironment
+app = create_app(
+    DataCleaningEnvironment,   # class, not instance
+    CleanAction,
+    CleanObservation,
+    env_name="data_cleaning_env",
+)
+def main() -> None:
+    """Entry point for openenv serve / uv run / python -m."""
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=8000)
+if __name__ == "__main__":
+    main()

server/data_cleaning_env.py ADDED Viewed

	@@ -0,0 +1,827 @@

+"""
+server/data_cleaning_env.py
+---------------------------
+DataCleaningEnvironment — the heart of the environment.
+Implements the three abstract methods from openenv.core.env_server.interfaces.Environment:
+    reset(seed, episode_id, **kwargs) -> CleanObservation
+    step(action, timeout_s, **kwargs) -> CleanObservation
+    state  (property)                -> CleanState
+Architecture
+------------
+Live DataFrames (_dirty_df, _clean_df) live as instance variables for speed.
+CleanState holds lightweight CSV snapshots used only for WebSocket state()
+responses — not for every step. This avoids serialising a 400-row DataFrame
+on every call.
+Action dispatch
+---------------
+Each CleanAction.command routes to a private _apply_* method that mutates
+_dirty_df in place. Errors in those methods (bad column name, out-of-bounds
+row) are caught and returned as (success=False, error_msg=...) so the agent
+gets corrective feedback instead of a 500.
+Reward
+------
+compute_reward() implements the dense reward formula designed in the plan:
+    progress term      — grader score delta (main signal)
+    efficiency bonus   — small reward for early completion
+    false-positive penalty — for dropping a valid-extreme row (medium task)
+    early-DONE penalty — for calling DONE with a low score
+    step cost          — -0.005 every step to discourage padding
+"""
+from __future__ import annotations
+import sys
+import os
+from typing import Any, Optional
+from uuid import uuid4
+import numpy as np
+import pandas as pd
+# ── OpenEnv imports (try relative → absolute) ─────────────────────────────────
+try:
+    from openenv.core.env_server.interfaces import Environment
+    from openenv.core.env_server.types import EnvironmentMetadata
+except ImportError:
+    from openenv.core.env_server.interfaces import Environment
+    from openenv.core.env_server.types import EnvironmentMetadata
+# ── Local imports (try relative → absolute for both server and standalone) ───
+try:
+    from ..models import (
+        CleanAction, CleanObservation, CleanState,
+        MAX_STEPS, DONE_THRESHOLD,
+    )
+    from ..dataset_factory import make_dataset, TaskDataset
+    from ..graders import grade, GradeResult
+except ImportError:
+    try:
+        from models import (
+            CleanAction, CleanObservation, CleanState,
+            MAX_STEPS, DONE_THRESHOLD,
+        )
+        from dataset_factory import make_dataset, TaskDataset
+        from graders import grade, GradeResult
+    except ImportError:
+        sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+        from models import (
+            CleanAction, CleanObservation, CleanState,
+            MAX_STEPS, DONE_THRESHOLD,
+        )
+        from dataset_factory import make_dataset, TaskDataset
+        from graders import grade, GradeResult
+# ── Constants ─────────────────────────────────────────────────────────────────
+# Per-step cost that discourages infinite loops / padding
+STEP_COST = -0.005
+# Penalty for calling DONE before the score is reasonable
+EARLY_DONE_PENALTY = -0.20
+EARLY_DONE_THRESHOLD = 0.60   # DONE below this score triggers the penalty
+# Penalty for removing a valid-extreme row in the medium task
+FALSE_POSITIVE_PENALTY = -0.15
+# Efficiency bonus multiplier (only awarded when episode is solved)
+EFFICIENCY_BONUS_WEIGHT = 0.10
+# Date formats the STANDARDIZE_COL handler will try, in priority order
+_DATE_PARSE_FORMATS = [
+    "%Y-%m-%d",   # ISO — most reliable, try first
+    "%m/%d/%Y",   # US
+    "%d.%m.%Y",   # EU
+    "%d/%m/%Y",   # EU alt
+    "%Y/%m/%d",   # Asian
+]
+# ─────────────────────────────────────────────────────────────────────────────
+# DataCleaningEnvironment
+# ─────────────────────────────────────────────────────────────────────────────
+class DataCleaningEnvironment(Environment):
+    """
+    Gym-style environment for the data cleaning pipeline task.
+    Each episode:
+      1. reset(task_id="easy"|"medium"|"hard") loads a dirty/clean CSV pair.
+      2. The agent calls step() repeatedly, each time sending a CleanAction.
+      3. The episode ends when the agent sends DONE, the score crosses the
+         task threshold, or the step budget is exhausted.
+    The environment is fully stateless between sessions — all mutable state
+    lives in instance variables, so concurrent sessions each get their own
+    isolated copy (SUPPORTS_CONCURRENT_SESSIONS = True).
+    """
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self) -> None:
+        super().__init__()
+        # Live DataFrames — mutated by each step()
+        self._dirty_df: Optional[pd.DataFrame] = None
+        self._clean_df: Optional[pd.DataFrame] = None
+        # Full task dataset from dataset_factory (holds metadata for grader)
+        self._dataset: Optional[TaskDataset] = None
+        # Pydantic state (lightweight; updated on demand)
+        self._state: Optional[CleanState] = None
+    # ─────────────────────────────────────────────────────────────────────────
+    # reset()
+    # ─────────────────────────────────────────────────────────────────────────
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        task_id: str = "easy",
+        **kwargs: Any,
+    ) -> CleanObservation:
+        """
+        Reset the environment for a new episode.
+        Parameters
+        ----------
+        seed
+            Ignored — datasets use fixed seeds per task for reproducibility.
+        episode_id
+            Optional; auto-generated if not provided.
+        task_id
+            Which task to load: "easy", "medium", or "hard".
+        """
+        if task_id not in MAX_STEPS:
+            raise ValueError(
+                f"Unknown task_id {task_id!r}. Must be one of: {list(MAX_STEPS)}"
+            )
+        # Load dataset (always deterministic via fixed seed in dataset_factory)
+        self._dataset  = make_dataset(task_id)
+        self._dirty_df = self._dataset.dirty_df.copy(deep=True)
+        self._clean_df = self._dataset.clean_df.copy(deep=True)
+        max_steps = MAX_STEPS[task_id]
+        # Run grader on the initial dirty state so we have a starting score
+        initial_result = grade(
+            task_id=task_id,
+            agent_df=self._dirty_df,
+            clean_df=self._clean_df,
+            metadata=self._dataset.metadata,
+            initial_dirty_cells=self._dataset.total_dirty_cells,
+        )
+        self._state = CleanState(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+            task_id=task_id,
+            dirty_csv_snapshot=self._df_to_csv(self._dirty_df),
+            clean_csv_snapshot=self._df_to_csv(self._clean_df),
+            initial_dirty_cells=self._dataset.total_dirty_cells,
+            current_score=initial_result.score,
+            previous_score=0.0,
+            task_metadata=self._dataset.metadata,
+            schema_hint=self._dataset.schema_hint,
+            max_steps=max_steps,
+        )
+        return self._build_observation(
+            reward=None,
+            done=False,
+            last_action_success=True,
+            last_action_error=None,
+            grader_result=initial_result,
+        )
+    # ─────────────────────────────────────────────────────────────────────────
+    # step()
+    # ─────────────────────────────────────────────────────────────────────────
+    def step(
+        self,
+        action: CleanAction,
+        timeout_s: Optional[float] = None,
+        **kwargs: Any,
+    ) -> CleanObservation:
+        """
+        Apply one CleanAction and return the resulting observation.
+        Never raises for bad action inputs — instead returns
+        last_action_success=False with a descriptive error message so the
+        agent can self-correct on the next step.
+        """
+        if self._state is None or self._dirty_df is None:
+            raise RuntimeError("Environment not initialised. Call reset() first.")
+        self._state.step_count += 1
+        # ── Save previous score before mutating ──────────────────────────────
+        prev_score = self._state.current_score
+        self._state.previous_score = prev_score
+        # ── DONE shortcut ────────────────────────────────────────────────────
+        if action.command == "DONE":
+            reward = self._compute_reward(
+                action=action,
+                prev_score=prev_score,
+                curr_score=prev_score,   # score doesn't change on DONE
+                action_success=True,
+                was_false_positive=False,
+            )
+            done = True
+            self._state.dirty_csv_snapshot = self._df_to_csv(self._dirty_df)
+            return self._build_observation(
+                reward=reward,
+                done=done,
+                last_action_success=True,
+                last_action_error=None,
+                grader_result=GradeResult(
+                    score=prev_score,
+                    issues_remaining=self._state.initial_dirty_cells
+                        - int(prev_score * self._state.initial_dirty_cells),
+                    detail="Agent signalled DONE.",
+                ),
+            )
+        # ── Apply action to _dirty_df ────────────────────────────────────────
+        action_success, error_msg, was_false_positive = self._apply_action(action)
+        # ── Grade the result ──────────────────────────────────────────────────
+        grader_result = grade(
+            task_id=self._state.task_id,
+            agent_df=self._dirty_df,
+            clean_df=self._clean_df,
+            metadata=self._state.task_metadata,
+            initial_dirty_cells=self._state.initial_dirty_cells,
+        )
+        curr_score = grader_result.score
+        self._state.current_score = curr_score
+        # ── Compute reward ────────────────────────────────────────────────────
+        reward = self._compute_reward(
+            action=action,
+            prev_score=prev_score,
+            curr_score=curr_score,
+            action_success=action_success,
+            was_false_positive=was_false_positive,
+        )
+        # ── Check termination ────────────────────────────────────────────────
+        done = (
+            curr_score >= DONE_THRESHOLD[self._state.task_id]
+            or self._state.step_count >= self._state.max_steps
+        )
+        # ── Sync state snapshot ──────────────────────────────────────────────
+        self._state.dirty_csv_snapshot = self._df_to_csv(self._dirty_df)
+        return self._build_observation(
+            reward=reward,
+            done=done,
+            last_action_success=action_success,
+            last_action_error=error_msg,
+            grader_result=grader_result,
+        )
+    # ─────────────────────────────────────────────────────────────────────────
+    # state (property)
+    # ─────────────────────────────────────────────────────────────────────────
+    @property
+    def state(self) -> CleanState:
+        """Return the current environment state (serialisable snapshot)."""
+        if self._state is None:
+            raise RuntimeError("Environment not initialised. Call reset() first.")
+        # Keep snapshot fresh in case step() was called without triggering a sync
+        if self._dirty_df is not None:
+            self._state.dirty_csv_snapshot = self._df_to_csv(self._dirty_df)
+        return self._state
+    # ─────────────────────────────────────────────────────────────────────────
+    # Action dispatch
+    # ─────────────────────────────────────────────────────────────────────────
+    def _apply_action(
+        self, action: CleanAction
+    ) -> tuple[bool, Optional[str], bool]:
+        """
+        Mutate self._dirty_df according to the action.
+        Returns
+        -------
+        (success, error_msg, was_false_positive)
+            success           — True if action applied without error
+            error_msg         — human-readable description if success=False
+            was_false_positive — True if a DROP_ROW removed a valid-extreme row
+        """
+        cmd = action.command
+        if cmd == "SET_VALUE":
+            return self._apply_set_value(action)
+        elif cmd == "DROP_ROW":
+            return self._apply_drop_row(action)
+        elif cmd == "STANDARDIZE_COL":
+            return self._apply_standardize_col(action)
+        elif cmd == "FILL_MISSING":
+            return self._apply_fill_missing(action)
+        else:
+            return False, f"Unknown command: {cmd!r}", False
+    # ── SET_VALUE ─────────────────────────────────────────────────────────────
+    def _apply_set_value(
+        self, action: CleanAction
+    ) -> tuple[bool, Optional[str], bool]:
+        df = self._dirty_df
+        row_idx = action.row_index
+        col     = action.column
+        val     = action.value
+        # Validate column
+        if col not in df.columns:
+            return (
+                False,
+                f"Column {col!r} not found. Available: {list(df.columns)}",
+                False,
+            )
+        # Validate row index (positional)
+        if row_idx < 0 or row_idx >= len(df):
+            return (
+                False,
+                f"Row index {row_idx} out of range. DataFrame has {len(df)} rows (0–{len(df)-1}).",
+                False,
+            )
+        # Try to cast value to the column's expected type
+        cast_val, cast_err = self._cast_value(val, df, col)
+        if cast_err:
+            return False, cast_err, False
+        df.iloc[row_idx, df.columns.get_loc(col)] = cast_val
+        return True, None, False
+    # ── DROP_ROW ──────────────────────────────────────────────────────────────
+    def _apply_drop_row(
+        self, action: CleanAction
+    ) -> tuple[bool, Optional[str], bool]:
+        df = self._dirty_df
+        row_idx = action.row_index
+        if row_idx < 0 or row_idx >= len(df):
+            return (
+                False,
+                f"Row index {row_idx} out of range. DataFrame has {len(df)} rows.",
+                False,
+            )
+        # Detect false positive for medium task: is this a valid-extreme row?
+        was_false_positive = self._is_valid_extreme_row(row_idx)
+        # Drop the row and reset positional index so future iloc references stay valid
+        self._dirty_df = df.drop(df.index[row_idx]).reset_index(drop=True)
+        return True, None, was_false_positive
+    def _is_valid_extreme_row(self, iloc_idx: int) -> bool:
+        """
+        Return True if dropping this row would be a false positive.
+        Only applies to the medium task, which tracks valid_extreme_rows
+        by their original tx_id.
+        """
+        if self._state is None or self._state.task_id != "medium":
+            return False
+        valid_extreme_rows: list = self._state.task_metadata.get(
+            "valid_extreme_rows", []
+        )
+        if not valid_extreme_rows or self._clean_df is None:
+            return False
+        df = self._dirty_df
+        if "tx_id" not in df.columns:
+            return False
+        # Get the tx_id of the row being dropped
+        try:
+            tx_id_to_drop = int(df.iloc[iloc_idx]["tx_id"])
+        except (IndexError, ValueError, KeyError):
+            return False
+        # Check if any valid-extreme row in clean_df has this tx_id
+        for orig_idx in valid_extreme_rows:
+            if orig_idx >= len(self._clean_df):
+                continue
+            if int(self._clean_df.iloc[orig_idx]["tx_id"]) == tx_id_to_drop:
+                return True
+        return False
+    # ── STANDARDIZE_COL ───────────────────────────────────────────────────────
+    def _apply_standardize_col(
+        self, action: CleanAction
+    ) -> tuple[bool, Optional[str], bool]:
+        df  = self._dirty_df
+        col = action.column
+        if col not in df.columns:
+            return (
+                False,
+                f"Column {col!r} not found. Available: {list(df.columns)}",
+                False,
+            )
+        series = df[col].copy()
+        # ── Try date normalisation first ──────────────────────────────────────
+        if self._looks_like_date_column(col, series):
+            normalised, err = self._normalise_dates(series)
+            if err:
+                return False, f"Date normalisation failed for column {col!r}: {err}", False
+            self._dirty_df[col] = normalised
+            return True, None, False
+        # ── Try numeric coercion ──────────────────────────────────────────────
+        if self._looks_like_numeric_column(col, series):
+            numeric = pd.to_numeric(series, errors="coerce")
+            # Only apply if we didn't lose more than 20% of non-null values
+            original_non_null = series.notna().sum()
+            coerced_non_null  = numeric.notna().sum()
+            if original_non_null == 0 or coerced_non_null / original_non_null >= 0.8:
+                self._dirty_df[col] = numeric
+                return True, None, False
+        # ── String normalisation: strip whitespace ───────────────────────────
+        self._dirty_df[col] = series.apply(
+            lambda x: str(x).strip() if not _is_nan(x) else x
+        )
+        return True, None, False
+    def _looks_like_date_column(self, col: str, series: pd.Series) -> bool:
+        """Heuristic: column name contains 'date' or most non-null values parse as dates."""
+        if "date" in col.lower():
+            return True
+        sample = series.dropna().astype(str).head(5)
+        parsed = 0
+        for s in sample:
+            for fmt in _DATE_PARSE_FORMATS:
+                try:
+                    pd.to_datetime(s, format=fmt)
+                    parsed += 1
+                    break
+                except Exception:
+                    pass
+        return parsed >= max(1, len(sample) // 2)
+    def _looks_like_numeric_column(self, col: str, series: pd.Series) -> bool:
+        """Heuristic: column name or majority of values suggests numeric data."""
+        numeric_keywords = {"price", "amount", "value", "quantity", "qty", "count", "id", "num"}
+        if any(kw in col.lower() for kw in numeric_keywords):
+            return True
+        sample = series.dropna().head(10)
+        if len(sample) == 0:
+            return False
+        convertible = pd.to_numeric(sample, errors="coerce").notna().sum()
+        return convertible / len(sample) >= 0.7
+    def _normalise_dates(self, series: pd.Series) -> tuple[pd.Series, Optional[str]]:
+        """Parse dates in any supported format and reformat as YYYY-MM-DD."""
+        def _parse_one(x: Any) -> Any:
+            if _is_nan(x):
+                return x
+            s = str(x).strip()
+            for fmt in _DATE_PARSE_FORMATS:
+                try:
+                    return pd.to_datetime(s, format=fmt).strftime("%Y-%m-%d")
+                except Exception:
+                    pass
+            # Last resort: let pandas guess
+            try:
+                parsed = pd.to_datetime(s, dayfirst=False)
+                if 2000 <= parsed.year <= 2030:
+                    return parsed.strftime("%Y-%m-%d")
+            except Exception:
+                pass
+            return x  # leave unchanged if unparseable
+        return series.apply(_parse_one), None
+    # ── FILL_MISSING ──────────────────────────────────────────────────────────
+    def _apply_fill_missing(
+        self, action: CleanAction
+    ) -> tuple[bool, Optional[str], bool]:
+        df  = self._dirty_df
+        col = action.column
+        strategy = action.fill_strategy
+        if col not in df.columns:
+            return (
+                False,
+                f"Column {col!r} not found. Available: {list(df.columns)}",
+                False,
+            )
+        series  = df[col].copy()
+        numeric = pd.to_numeric(series, errors="coerce")
+        has_numeric = numeric.notna().sum() > 0
+        if strategy == "mean":
+            if not has_numeric:
+                return False, f"Cannot compute mean for non-numeric column {col!r}.", False
+            fill_val = numeric.mean()
+            self._dirty_df[col] = numeric.fillna(round(fill_val, 2))
+        elif strategy == "median":
+            if not has_numeric:
+                return False, f"Cannot compute median for non-numeric column {col!r}.", False
+            fill_val = numeric.median()
+            self._dirty_df[col] = numeric.fillna(round(fill_val, 2))
+        elif strategy == "mode":
+            mode_result = series.mode(dropna=True)
+            if mode_result.empty:
+                return False, f"No mode found for column {col!r} (all values missing?).", False
+            self._dirty_df[col] = series.fillna(mode_result.iloc[0])
+        elif strategy == "drop":
+            before = len(self._dirty_df)
+            self._dirty_df = self._dirty_df.dropna(subset=[col]).reset_index(drop=True)
+            after = len(self._dirty_df)
+            return True, None, False
+        else:
+            return False, f"Unknown fill_strategy: {strategy!r}", False
+        return True, None, False
+    # ─────────────────────────────────────────────────────────────────────────
+    # Reward computation
+    # ─────────────────────────────────────────────────────────────────────────
+    def _compute_reward(
+        self,
+        action: CleanAction,
+        prev_score: float,
+        curr_score: float,
+        action_success: bool,
+        was_false_positive: bool,
+    ) -> float:
+        """
+        Dense per-step reward in the range [-0.5, +1.0].
+        Components
+        ----------
+        progress          score delta (main learning signal)
+        efficiency bonus  small reward for solving with steps to spare
+        fp_penalty        penalise removing a valid-extreme row (medium task)
+        early_done_penalty penalise calling DONE with a very low score
+        step_cost         tiny constant cost to discourage padding
+        """
+        if self._state is None:
+            return 0.0
+        max_steps   = self._state.max_steps
+        step_count  = self._state.step_count
+        # 1. Progress term
+        progress = curr_score - prev_score
+        # 2. Efficiency bonus (only when task is solved this step)
+        threshold = DONE_THRESHOLD[self._state.task_id]
+        just_solved = prev_score < threshold <= curr_score
+        step_fraction = step_count / max_steps
+        efficiency = EFFICIENCY_BONUS_WEIGHT * (1.0 - step_fraction) if just_solved else 0.0
+        # 3. False-positive penalty
+        fp_penalty = FALSE_POSITIVE_PENALTY if was_false_positive else 0.0
+        # 4. Early-DONE penalty
+        early_done = (
+            EARLY_DONE_PENALTY
+            if action.command == "DONE" and curr_score < EARLY_DONE_THRESHOLD
+            else 0.0
+        )
+        # 5. Step cost
+        step_cost = STEP_COST
+        reward = progress + efficiency + fp_penalty + early_done + step_cost
+        return round(float(np.clip(reward, -0.5, 1.0)), 4)
+    # ─────────────────────────────────────────────────────────────────────────
+    # Observation builder
+    # ─────────────────────────────────────────────────────────────────────────
+    def _build_observation(
+        self,
+        reward: Optional[float],
+        done: bool,
+        last_action_success: bool,
+        last_action_error: Optional[str],
+        grader_result: GradeResult,
+    ) -> CleanObservation:
+        if self._state is None:
+            raise RuntimeError("State not initialised.")
+        return CleanObservation(
+            # Inherited from Observation base
+            done=done,
+            reward=reward,
+            # Task context
+            task_id=self._state.task_id,
+            schema_hint=self._state.schema_hint,
+            initial_dirty_cells=self._state.initial_dirty_cells,
+            # Per-step state
+            dirty_csv=self._df_to_csv(self._dirty_df),
+            current_score=grader_result.score,
+            issues_remaining=grader_result.issues_remaining,
+            step_number=self._state.step_count,
+            max_steps=self._state.max_steps,
+            # Last-action feedback
+            last_action_success=last_action_success,
+            last_action_error=last_action_error,
+        )
+    # ─────────────────────────────────────────────────────────────────────────
+    # Utilities
+    # ─────────────────────────────────────────────────────────────────────────
+    @staticmethod
+    def _df_to_csv(df: Optional[pd.DataFrame]) -> str:
+        """Serialise DataFrame to CSV string with the integer position index."""
+        if df is None:
+            return ""
+        return df.to_csv(index=True, index_label="row_index")
+    @staticmethod
+    def _cast_value(
+        val: str, df: pd.DataFrame, col: str
+    ) -> tuple[Any, Optional[str]]:
+        """
+        Try to cast a string value to the appropriate type for `col`.
+        Returns (cast_value, error_message). error_message is None on success.
+        """
+        # Determine target type from the clean (non-null, non-text) column values
+        sample = pd.to_numeric(
+            df[col].dropna().astype(str).str.strip(), errors="coerce"
+        )
+        majority_numeric = sample.notna().sum() / max(len(df[col].dropna()), 1) >= 0.5
+        if majority_numeric:
+            try:
+                float_val = float(val.strip().replace(",", ""))
+                # If all sample values are whole numbers, keep as int
+                if (sample.dropna() % 1 == 0).all() and float_val % 1 == 0:
+                    return int(float_val), None
+                return round(float_val, 2), None
+            except (ValueError, AttributeError):
+                return (
+                    None,
+                    f"Cannot cast {val!r} to numeric for column {col!r}. "
+                    f"Provide a plain number (e.g. '29.99').",
+                )
+        # String column — accept as-is (strip whitespace)
+        return val.strip(), None
+    # ──────────────────────────────────────────────────────────��──────────────
+    # Lifecycle
+    # ─────────────────────────────────────────────────────────────────────────
+    def close(self) -> None:
+        self._dirty_df = None
+        self._clean_df = None
+        self._dataset  = None
+        self._state    = None
+    def get_metadata(self) -> EnvironmentMetadata:
+        return EnvironmentMetadata(
+            name="data_cleaning_env",
+            description=(
+                "Data cleaning pipeline: the agent receives a dirty CSV "
+                "and must fix type errors, outliers, missing values, and "
+                "schema inconsistencies to match a hidden ground truth."
+            ),
+            version="1.0.0",
+            author="hackathon",
+        )
+# ─────────────────────────────────────────────────────────────────────────────
+# Helpers
+# ─────────────────────────────────────────────────────────────────────────────
+def _is_nan(x: Any) -> bool:
+    """Return True if x is any flavour of missing value."""
+    if x is None:
+        return True
+    try:
+        return bool(pd.isna(x))
+    except (TypeError, ValueError):
+        return False
+# ─────────────────────────────────────────────────────────────────────────────
+# Smoke test
+# ─────────────────────────────────────────────────────────────────────────────
+if __name__ == "__main__":
+    SEP = "─" * 64
+    for task_id in ("easy", "medium", "hard"):
+        print(f"\n{SEP}\nTASK: {task_id.upper()}\n{SEP}")
+        env = DataCleaningEnvironment()
+        # ── reset ────────────────────────────────────────────────────────────
+        obs = env.reset(task_id=task_id)
+        print(f"reset()  → score={obs.current_score:.4f}  "
+              f"issues={obs.issues_remaining}  done={obs.done}")
+        assert obs.reward is None,  "reward must be None after reset"
+        assert obs.done   is False, "done must be False after reset"
+        lines = obs.dirty_csv.strip().split("\n")
+        print(f"  CSV:  {len(lines)} rows, {len(lines[0].split(','))} cols")
+        print(f"  Hint: {obs.schema_hint[:70]}…")
+        # ── state() ──────────────────────────────────────────────────────────
+        st = env.state
+        print(f"state()  → episode_id={st.episode_id[:8]}…  step_count={st.step_count}")
+        # ── step: bad column (should give feedback, not crash) ───────────────
+        bad_action = CleanAction(
+            command="SET_VALUE", row_index=0, column="DOES_NOT_EXIST", value="0"
+        )
+        obs2 = env.step(bad_action)
+        assert obs2.last_action_success is False
+        print(f"step (bad col) → success={obs2.last_action_success}  "
+              f"error='{obs2.last_action_error[:50]}…'")
+        # ── step: out-of-bounds row ──────────────────────────────────────────
+        bad_row = CleanAction(
+            command="SET_VALUE", row_index=9999, column="price", value="10.0"
+        )
+        obs3 = env.step(bad_row)
+        assert obs3.last_action_success is False
+        print(f"step (bad row) → success={obs3.last_action_success}  "
+              f"error='{obs3.last_action_error[:50]}…'")
+        # ── step: valid fix ──────────────────────────────────────────────────
+        if task_id == "easy":
+            # Find the first injected dirty cell and fix it
+            injected = env._dataset.metadata.get("injected_cells", [])
+            if injected:
+                row, col = injected[0]
+                clean_val = str(env._clean_df.iloc[row][col])
+                fix_action = CleanAction(
+                    command="SET_VALUE", row_index=row, column=col, value=clean_val
+                )
+                obs4 = env.step(fix_action)
+                print(f"step (fix row={row} col={col!r}) → "
+                      f"success={obs4.last_action_success}  "
+                      f"score={obs4.current_score:.4f}  "
+                      f"reward={obs4.reward:.4f}")
+                assert obs4.last_action_success is True
+                assert obs4.reward is not None
+        elif task_id == "medium":
+            # Fix one outlier row via FILL_MISSING on amount
+            obs4 = env.step(CleanAction(
+                command="FILL_MISSING", column="amount", fill_strategy="median"
+            ))
+            print(f"step (FILL_MISSING amount/median) → "
+                  f"score={obs4.current_score:.4f}  reward={obs4.reward:.4f}")
+        elif task_id == "hard":
+            # Standardize the date column
+            obs4 = env.step(CleanAction(
+                command="STANDARDIZE_COL", column="date"
+            ))
+            print(f"step (STANDARDIZE_COL date) → "
+                  f"success={obs4.last_action_success}  "
+                  f"score={obs4.current_score:.4f}  reward={obs4.reward:.4f}")
+        # ── DONE action ───────────────────────────────────────────────────────
+        done_obs = env.step(CleanAction(command="DONE"))
+        assert done_obs.done is True
+        print(f"step (DONE)    → done={done_obs.done}  "
+              f"reward={done_obs.reward:.4f}  score={done_obs.current_score:.4f}")
+        env.close()
+    print(f"\n{SEP}\nAll smoke tests passed.\n{SEP}")

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0
+pandas>=2.0.0
+numpy>=2.0.0

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff

validate-submission.sh ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0