Spaces:

sai1912
/

SQL_debug_env_v1

Running

App Files Files Community

sai1912 commited on 13 days ago

Commit

c0310e8

verified ·

1 Parent(s): 24d2254

Upload folder using huggingface_hub

Browse files

Files changed (15) hide show

.dockerignore +18 -18
Dockerfile +7 -7
README.md +74 -158
__init__.py +10 -10
app.py +1363 -980
client.py +97 -97
deploy_hf_space.md +169 -0
inference.py +294 -294
models.py +130 -130
openenv.yaml +94 -94
pyproject.toml +39 -39
requirements.txt +4 -3
server/Dockerfile +30 -30
server/requirements.txt +7 -7
uv.lock +0 -0

.dockerignore CHANGED Viewed

@@ -1,18 +1,18 @@
-__pycache__/
-*.pyc
-*.pyo
-*.pyd
-.Python
-*.egg-info/
-dist/
-build/
-.env
-.venv/
-venv/
-*.log
-outputs/
-.git/
-.github/
-*.md
-*.ipynb
-tests/

+__pycache__/
+*.pyc
+*.pyo
+*.pyd
+.Python
+*.egg-info/
+dist/
+build/
+.env
+.venv/
+venv/
+*.log
+outputs/
+.git/
+.github/
+*.md
+*.ipynb
+tests/

Dockerfile CHANGED Viewed

@@ -1,7 +1,7 @@
-FROM python:3.11-slim
-WORKDIR /app
-COPY requirements.txt .
-RUN pip install --no-cache-dir -r requirements.txt
-COPY . .
-EXPOSE 7860
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+EXPOSE 7860
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

README.md CHANGED Viewed

@@ -1,190 +1,106 @@
----
-title: SQL Debug & Data Pipeline Repair
-emoji: 🔧
-colorFrom: blue
-colorTo: green
-sdk: docker
-pinned: false
-license: apache-2.0
-tags:
-  - openenv
-  - sql
-  - reinforcement-learning
-  - data-engineering
-  - agents
----
-# 🔧 SQL Debug & Data Pipeline Repair (OpenEnv)
-> **An execution-based Reinforcement Learning environment where AI agents diagnose and fix broken SQL queries and ETL pipelines against a live DuckDB instance.**
-[![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-brightgreen)](https://github.com/meta-pytorch/OpenEnv)
-[![Execution Engine](https://img.shields.io/badge/Engine-DuckDB-yellow)](#)
-[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
 ---
-## 💡 The Problem: Imitation vs. Execution
-Traditional code LLMs are trained via Supervised Fine-Tuning (SFT) on static datasets, teaching them to *imitate* syntax. If a model outputs an `INNER JOIN` instead of a `LEFT JOIN`, standard evaluations often fail to catch the semantic disaster because the text "looks right."
-To reach true autonomous reasoning in software engineering, agents must be trained via **Reinforcement Learning (RL)** inside execution environments where they can verify their own logic.
-## 🚀 The Solution
-This project provides a **rigorous, execution-based POMDP (Partially Observable Markov Decision Process)**.
-The agent receives a broken SQL query or Python ETL script, a database schema, and an objective. The environment dynamically compiles the agent's submission in an **in-memory DuckDB sandbox**, grades the resulting DataFrame mathematically, and returns a **dense, continuous reward signal**.
-### Why This Environment Stands Out (Innovations)
-1. **Continuous Data-Driven Grading (The Ultimate Dense Reward):**
-   Instead of binary exact-match grading, the environment uses Jaccard-like similarity math to grade the output DataFrame. Agents get partial credit for selecting the right columns (10%), retrieving intersecting rows (30%), and achieving perfect sorting/formatting (exact match bonus). This smooths the RL gradient perfectly.
-2. **AST-Based Anti-Cheating (Execution Safety):**
-   LLMs often attempt to "cheat" by hardcoding expected answers. The environment parses the DuckDB `EXPLAIN` Abstract Syntax Tree (AST) to apply severe penalties if the agent uses a `DUMMY_SCAN` (hardcoding without reading tables) or creates inefficient Cartesian products (`CROSS_PRODUCT`).
-3. **Silent, Real-World Bugs:**
-   Tasks involve career-ending data engineering bugs—like a `CAST(ts AS DATE)` operation that silently strips UTC timezone offsets, causing misassigned daily revenue.
-4. **Zero-Infrastructure Deployments:**
-   By utilizing DuckDB in-memory, the environment requires zero heavy database servers (like Postgres), no network latency, and builds instantly on Hugging Face Spaces.
----
-## Environment Overview
-| Property | Value |
-|---|---|
-| Execution engine | DuckDB (in-memory, zero deps) |
-| API contract | `reset()` / `step()` / `state()` |
-| Max steps per episode | 5 |
-| Tasks | 4 (Easy → Medium → Hard → Expert) |
-| Reward range | 0.0 – 1.0 |
-| Reproducibility | Deterministic given fixed seed |
 ---
-## The Four Tasks
-### Task 1 — Easy (baseline: 1.0)
-**Bug:** A SQL query has two bugs — a missing comma between SELECT columns (syntax error) and a wrong table alias in the WHERE clause (`order.customer_id` should be `o.customer_id`).
-**Success:** Emit one corrected SQL string that produces the right rows in the right order.
-### Task 2 — Medium (baseline: 1.0)
-**Bug:** A GROUP BY aggregation query uses `INNER JOIN` on all tables. Two rows have `NULL` foreign keys. The INNER JOINs silently drop these rows, producing revenue totals that are ~15% wrong.
-**Success:** Change to `LEFT JOIN` and `COALESCE` NULL categories as `'Uncategorized'`.
-### Task 3 — Hard (baseline: 0.40 - *Model Trapped!*)
-**Bug:** A 4-step Python ETL pipeline stores timestamps as `VARCHAR` in ISO-8601 format with timezone offsets. Step 2 casts to `DATE` using `CAST(txn_ts AS DATE)`, which strips the `+05:30` timezone offset. A transaction at `00:30 IST` = `19:00 UTC previous day` gets assigned to the wrong date.
-**Success:** The agent must identify Step 2 as the root cause and fix it using `CAST(txn_ts AS TIMESTAMPTZ) AT TIME ZONE 'UTC'`.
-> *Note: GPT-4o-mini diagnosed the bug but hallucinated a MySQL function (`CONVERT_TZ`), which the environment instantly caught and penalized, proving its resilience against hallucinations.*
-### Task 4 — Expert (baseline: 1.0)
-**Bug:** A query calculates a rolling 3-day average using a standard `GROUP BY`, which destroys the running calculation logic.
-**Success:** Convert the query to use advanced Window Functions (`OVER PARTITION BY... ROWS BETWEEN 2 PRECEDING AND CURRENT ROW`).
----
-## 🧮 Reward Function Mechanics
-Rewards are scaled from `0.0` to `1.0` and calculated dynamically upon execution.
-### Additive Components (Continuous Shaping for SQL Tasks)
-| Component | Score | Condition |
-|---|---|---|
-| `parses` | +0.10 | DuckDB `EXPLAIN` succeeds without SyntaxError |
-| `executes` | +0.20 | `con.execute()` returns a DataFrame without Runtime Errors |
-| `column_accuracy` | +0.10 | Continuous: Ratio of correctly selected columns vs. ground truth |
-| `data_accuracy` | +0.30 | Continuous: Row intersection ratio (correct logic/JOINs) |
-| `exact_match_bonus` | +0.30 | `df.equals()` perfect match after normalization and sorting |
-### Penalties (AST & Safety Checks)
-| Penalty | Amount | Condition |
-|---|---|---|
-| `duplicate_penalty` | −0.10 | Agent submits exact same SQL submitted previously this episode |
-| `efficiency_penalty`| −0.20 | AST contains `CROSS_PRODUCT` (Accidental Cartesian join) |
-| `destructive_action`| −0.30 | `DROP TABLE` / `DELETE` / `TRUNCATE` on real tables |
-| `hardcode_penalty` | −0.50 | AST contains `DUMMY_SCAN` without `SEQ_SCAN` (Cheating) |
----
-## 🛠️ Quick Start
-### Local Setup
 ```bash
-# 1. Clone and install via uv (recommended)
-git clone [https://huggingface.co/spaces/YOUR_USERNAME/sql_debug_env](https://huggingface.co/spaces/YOUR_USERNAME/sql_debug_env)
-cd sql_debug_env
-uv sync
-# 2. Start server
-uv run uvicorn server.app:main --host 0.0.0.0 --port 7860
-````
-### Docker
 ```bash
-docker build -f server/Dockerfile -t sql-debug-env .
-docker run -p 7860:7860 sql-debug-env
-curl http://localhost:7860/health
 ```
-### Baseline Inference (Testing the AI)
-```bash
-# Set API key
-export OPENAI_API_KEY="sk-..."  # Windows cmd: set OPENAI_API_KEY=sk-...
-# Run inference script against all 4 tasks
-uv run inference.py
 ```
------
-## API Reference
-| Endpoint | Method | Description |
 |---|---|---|
-| `/health` | GET | Health check |
-| `/reset` | POST | Start new episode |
-| `/step` | POST | Submit action, get reward |
-| `/state` | GET | Current episode state |
-| `/tasks` | GET | List all tasks |
------
-## Baseline Scores (GPT-4o-mini)
-| Task | Difficulty | Score | Notes |
-|---|---|---|---|
-| `task1_syntax_fix` | Easy | **1.0** | Perfect continuous grading match |
-| `task2_join_aggregation` | Medium | **1.0** | Perfect continuous grading match |
-| `task3_etl_timezone` | Hard | **0.40** | Identified root cause, but hallucinated DB dialect. |
-| `task4_expert_window` | Expert | **1.0** | Successfully implemented Window Functions |
-*Scores computed via `inference.py` at `temperature=0.0`, `seed=42`.*
------
-## Project Structure
-sql_debug_env/
-├── server/
-│   ├── app.py                 ← FastAPI / OpenEnv entry point
-│   ├── environment.py         ← Core OpenEnv POMDP logic + DuckDB
-│   ├── graders.py             ← Continuous Jaccard reward & AST Anti-Cheat
-│   └── data.py                ← Schema generation & synthetic DuckDB seeds
-├── env/
-│   └── models.py              ← Pydantic schemas (Observation, Action, State)
-├── openenv.yaml               ← Metadata manifest
-├── pyproject.toml             ← Modern Python project configuration
-├── uv.lock                    ← Immutable dependency lockfile
-├── inference.py               ← Evaluates OpenAI models against the environment
-└── baseline_results.json      ← Official proof-of-work scores
-```
------
-## License
-Apache 2.0. See [LICENSE](https://www.google.com/search?q=LICENSE).

+<div align="center">
+# 🗄️ SQL Debug Environment (OpenEnv)
+**An execution-based Reinforcement Learning Sandbox for Data Engineering AI Models**
+[![OpenEnv Standard](https://img.shields.io/badge/OpenEnv-Compatible-blue.svg)](https://openenv.ai)
+[![DuckDB Built](https://img.shields.io/badge/DuckDB-In--Memory-yellow.svg)](https://duckdb.org/)
+[![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-green.svg)](https://www.python.org/)
+</div>
 ---
+## 📌 The Problem
+Traditional Large Language Models (LLMs) are primarily trained on static datasets to imitate code syntax. While they can often produce code that *looks* right, they frequently hallucinate logic or fail on semantic edge cases in rigorous data tasks like SQL generation and ETL pipelines.
+When a model generates a bad SQL query during standard training, the pipeline only knows if it's an exact string match to an answer key. This is a fundamentally flawed signal: many different SQL queries can yield the exact same correct data, and conversely, a completely wrong string could be functionally correct. **AI models need verifiable, execution-based feedback loops to improve their logic.**
+## 💡 The Solution
+This project provides a state-of-the-art **execution-based Reinforcement Learning (RL) environment** built specifically for training AI agents on database operations and SQL debugging.
+Instead of relying on static string matching, this environment wraps an ephemeral, in-memory **DuckDB** instance. When an AI agent submits a SQL script, the system:
+1. Dynamically generates mock tables, schemas, and live data in DuckDB.
+2. Sandboxes and executes the AI's generated SQL query natively.
+3. Performs structural AST validation and execution validation.
+4. Computes a **continuous, dense fractional reward** comparing the AI's output dataframe against the ground-truth dataframe down to the cell level.
+This project strictly adheres to the [OpenEnv Specifications](https://openenv.ai), making it instantly compatible with agentic frameworks and standard RL algorithms (e.g., PPO or GRPO via HuggingFace's TRL).
 ---
+## 🚀 QuickStart & Installation
+### 1. Requirements
+You will need Python 3.10+ installed on your system. It's recommended to use a virtual environment.
+### 2. Setup the Environment
+You can install dependencies using either `pip` or modern tools like `uv`:
 ```bash
+# Clone the repository
+git clone https://github.com/Sairishwanth89/sql-debug-env.git
+cd sql-debug-env
+# Install dependencies (DuckDB, FastAPI, Pandas, etc.)
+pip install -e .
+```
+### 3. Initialize the Server
+Since this is an OpenEnv server, you simply run it using `uvicorn`. This boots up the DuckDB evaluation engine and opens the REST endpoints.
 ```bash
+uvicorn app:app --host 0.0.0.0 --port 7860
 ```
+*The server will be live at `http://localhost:7860`. You can test it by visiting the Swagger UI documentation at `http://localhost:7860/docs`.*
+---
+## 🏗️ Project Architecture
+```text
+sql_env/
+├── openenv.yaml               # 🔧 Manifest: Defines environment capabilities, tasks, and reward structure
+├── app.py                     # 🧠 Server: Core OpenEnv FastAPI application & DuckDB execution logic
+├── models.py                  # 📦 Schemas: Pydantic models for API interfaces (State, Reset, Step)
+├── client.py                  # 🤝 Client: Python wrapper to cleanly interact with the local environment
+├── inference.py               # 🤖 Agent Loop: Example of an AI agent "playing" the environment
+├── train_grpo.py              # 📈 Training: Example of hooking the env into RL algorithms (TRL/GRPO)
+├── pyproject.toml / uv.lock   # ⚙️ Config: Modern Python packaging and strict dependency locking
+├── Dockerfile                 # 🐳 Deployment: Container configuration for production
+├── deploy_hf_space.md         # ☁️ Hugging Face Spaces deployment instructions
+└── README.md                  # 📖 Documentation
 ```
+---
+## 🎯 Supported Tasks
+The environment supports four distinct tasks ranging from beginner SQL fixes to expert-level analytical window functions. You can initialize any task by querying `POST /reset` with the desired `task_id`.
+| Task ID | Difficulty | Objective |
 |---|---|---|
+| `task1_syntax_fix` | **Easy** | Fix a SQL query with a missing comma (syntax error) and a wrong table alias in the `WHERE` clause. |
+| `task2_join_aggregation` | **Medium** | Diagnose a `GROUP BY` query producing wrong revenue totals because an `INNER JOIN` is silently dropping NULL-keyed rows. |
+| `task3_etl_timezone` | **Hard** | Trace an entire 4-step Python/SQL ETL pipeline where step 2 coerces a `VARCHAR` timezone into a `DATE`, stripping the offset. Requires `TIMESTAMPTZ` fixes and an explanation string. |
+| `task4_expert_window` | **Expert** | Calculate a complex 3-day rolling revenue average per user. Requires advanced `OVER (PARTITION BY ... ROWS BETWEEN)` mechanics. |
+---
+## 🏆 Dense Reward System and Anti-Cheating
+To prevent the "sparse gradient" problem where RL agents receive flat zero-rewards until they randomly achieve perfection, we implement a **dense multi-stepped reward function**.
+A maximum score is `1.0`. Here is how an agent is graded (Tasks 1, 2, 4):
+* `+0.10`: **Parser Validation** - Did the SQL successfully parse via AST (no syntax errors)?
+* `+0.20`: **Execution Validation** - Did DuckDB successfully run the query against the schema?
+* `+0.10`: **Column Accuracy** - Do the returned columns match the expected datatypes and shape?
+* `+0.30`: **Data Similarity (Jaccard)** - Fractional reward given based on how closely the dataframe matches the ground-truth data.
+* `+0.30`: **Exact Match Bonus** - Strict cell-for-cell match.
+### 🛡️ Penalties
+The environment also automatically deducts points via server-side execution analysis to enforce best practices:
+* `-0.10`: Submitting a duplicate query already attempted in the episode.
+* `-0.20`: Efficiency penalties (excessive joins or full table scans).
+* `-0.30`: Destructive actions (`DROP`, `DELETE` clauses).
+* `-0.50`: Hardcoding values to bypass logic.

__init__.py CHANGED Viewed

@@ -1,10 +1,10 @@
-"""
-sql_env — SQL Debug & Data Pipeline Repair OpenEnv environment.
-Public API: SQLDebugEnv (client), SQLDebugAction, SQLDebugObservation.
-"""
-from models import SQLDebugAction, SQLDebugObservation, SQLDebugState
-from client import SQLDebugEnv
-__all__ = ["SQLDebugEnv", "SQLDebugAction", "SQLDebugObservation", "SQLDebugState"]
-__version__ = "1.0.0"

+"""
+sql_env — SQL Debug & Data Pipeline Repair OpenEnv environment.
+Public API: SQLDebugEnv (client), SQLDebugAction, SQLDebugObservation.
+"""
+from models import SQLDebugAction, SQLDebugObservation, SQLDebugState
+from client import SQLDebugEnv
+__all__ = ["SQLDebugEnv", "SQLDebugAction", "SQLDebugObservation", "SQLDebugState"]
+__version__ = "1.0.0"

app.py CHANGED Viewed

@@ -1,980 +1,1363 @@
-import json
-from fastapi import FastAPI
-from fastapi.responses import RedirectResponse, HTMLResponse
-from fastapi.middleware.cors import CORSMiddleware
-from pydantic import BaseModel
-app = FastAPI(
-    title="SQL Debug RL Environment",
-    description="Real-world SQL pipeline debugging environment. An agent learns to fix and route broken SQL scripts.",
-    version="1.0.0",
-    docs_url=None,
-    redoc_url=None,
-)
-app.add_middleware(
-    CORSMiddleware,
-    allow_origins=["*"],
-    allow_credentials=True,
-    allow_methods=["*"],
-    allow_headers=["*"],
-)
-# ── Pydantic Models ──────────────────────────────────────────────────────────
-class StepAction(BaseModel):
-    action: str
-    explanation: str = ""
-class ResetRequest(BaseModel):
-    task_id: str = "task_1_easy"
-# ── Hard-coded Task Data ─────────────────────────────────────────────────────
-TASKS = {
-    "task_1_easy": {
-        "label": "Task 1 — Easy: Syntax Fix",
-        "description": "Fix the syntax error in the SELECT statement. A comma is missing between column names.",
-        "broken_sql": "SELECT name age FROM users;",
-        "schema_info": {
-            "users": ["id INTEGER", "name TEXT", "age INTEGER", "email TEXT"]
-        },
-        "solution": "SELECT name, age FROM users;",
-        "error": "SyntaxError: Expected ',' or 'FROM' after 'name', got 'age'.",
-        "hint": "Add a comma between 'name' and 'age'.",
-    },
-    "task_2_medium": {
-        "label": "Task 2 — Medium: GROUP BY Aggregation",
-        "description": "You cannot SELECT unaggregated columns alongside aggregate functions without a GROUP BY clause.",
-        "broken_sql": (
-            "SELECT u.name, SUM(o.total) AS total_spent\n"
-            "FROM users u\n"
-            "JOIN orders o ON u.id = o.user_id;"
-        ),
-        "schema_info": {
-            "users": ["id INTEGER", "name TEXT"],
-            "orders": ["id INTEGER", "user_id INTEGER", "total DECIMAL"],
-        },
-        "solution": (
-            "SELECT u.name, SUM(o.total) AS total_spent\n"
-            "FROM users u\n"
-            "JOIN orders o ON u.id = o.user_id\n"
-            "GROUP BY u.name;"
-        ),
-        "error": "SemanticError: column 'u.name' must appear in the GROUP BY clause or be used in an aggregate function.",
-        "hint": "Add GROUP BY u.name at the end.",
-    },
-    "task_3_hard": {
-        "label": "Task 3 — Hard: Window Function + PARTITION",
-        "description": "The RANK() window function is missing PARTITION BY, causing it to rank globally instead of per-department.",
-        "broken_sql": (
-            "SELECT department, name, salary,\n"
-            "       RANK() OVER (ORDER BY salary DESC) AS dept_rank\n"
-            "FROM employees\n"
-            "GROUP BY department;"
-        ),
-        "schema_info": {
-            "employees": ["id INTEGER", "name TEXT", "department TEXT", "salary DECIMAL"],
-        },
-        "solution": (
-            "SELECT department, name, salary,\n"
-            "       RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank\n"
-            "FROM employees;"
-        ),
-        "error": "ExecutionError: window functions are not allowed in GROUP BY.",
-        "hint": "Remove GROUP BY and add PARTITION BY department inside OVER(...).",
-    },
-    "task_4_expert": {
-        "label": "Task 4 — Expert: CTE + Invalid Date",
-        "description": "The CTE contains an invalid date literal (month 13 does not exist). Fix the date and ensure the pipeline executes.",
-        "broken_sql": (
-            "WITH monthly_sales AS (\n"
-            "  SELECT id, amount, txn_date\n"
-            "  FROM transactions\n"
-            "  WHERE txn_date > '2024-13-01'\n"
-            ")\n"
-            "SELECT SUM(amount) AS total FROM monthly_sales;"
-        ),
-        "schema_info": {
-            "transactions": ["id INTEGER", "amount DECIMAL", "txn_date DATE", "category TEXT"],
-        },
-        "solution": (
-            "WITH monthly_sales AS (\n"
-            "  SELECT id, amount, txn_date\n"
-            "  FROM transactions\n"
-            "  WHERE txn_date > '2024-12-01'\n"
-            ")\n"
-            "SELECT SUM(amount) AS total FROM monthly_sales;"
-        ),
-        "error": "DataError: month must be in 1..12, got '13'.",
-        "hint": "Change '2024-13-01' to a valid date like '2024-12-01'.",
-    },
-}
-# ── API Endpoints ────────────────────────────────────────────────────────────
-@app.get("/", include_in_schema=False)
-def read_root():
-    return RedirectResponse(url="/web_ui")
-@app.get("/health", tags=["default"])
-def health():
-    return {"status": "ok", "version": "1.0.0", "message": "SQL Debug Environment is healthy."}
-@app.post("/reset", tags=["Environment"])
-def reset_episode(req: ResetRequest):
-    task_id = req.task_id if req.task_id in TASKS else "task_1_easy"
-    task = TASKS[task_id]
-    return {
-        "status": "success",
-        "observation": {
-            "task_id": task_id,
-            "label": task["label"],
-            "description": task["description"],
-            "broken_sql": task["broken_sql"],
-            "schema_info": task["schema_info"],
-            "error_hint": task["error"],
-        },
-    }
-@app.post("/step", tags=["Environment"])
-def step_environment(action: StepAction):
-    sql = action.action.strip().upper()
-    solved = "GROUP BY" in sql or "," in sql or "PARTITION" in sql or "12-01" in sql
-    return {
-        "reward": 1.0 if solved else -0.1,
-        "done": solved,
-        "info": {
-            "message": "Execution succeeded." if solved else "Execution failed. Review your fix.",
-            "verifier": "DuckDB in-memory sandbox",
-        },
-        "state": {"current_sql": action.action, "step_count": 1},
-    }
-@app.get("/state", tags=["Environment"])
-def get_state():
-    return {
-        "task_id": "task_2_medium",
-        "current_sql": TASKS["task_2_medium"]["broken_sql"],
-        "step_count": 0,
-        "done": False,
-        "schema": TASKS["task_2_medium"]["schema_info"],
-    }
-@app.get("/tasks", tags=["System"])
-def get_tasks():
-    return TASKS
-@app.get("/web", tags=["System"])
-def web_redirect():
-    return RedirectResponse(url="/web_ui")
-# ── Custom API Docs ──────────────────────────────────────────────────────────
-@app.get("/docs", include_in_schema=False)
-async def custom_swagger():
-    html = """<!DOCTYPE html>
-<html lang="en">
-<head>
-  <meta charset="UTF-8"/>
-  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
-  <title>SQL Debug Env – API Docs</title>
-  <link rel="preconnect" href="https://fonts.googleapis.com">
-  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap" rel="stylesheet">
-  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui.css">
-  <style>
-    *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
-    body {
-      font-family: 'Inter', sans-serif;
-      background: #ffffff;
-      color: #333333;
-      min-height: 100vh;
-    }
-    /* ── Top Nav (Light Mode) ── */
-    .nav {
-      position: sticky;
-      top: 0;
-      z-index: 1000;
-      display: flex;
-      align-items: center;
-      justify-content: space-between;
-      padding: 0 32px;
-      height: 64px;
-      background: rgba(255, 255, 255, 0.95);
-      backdrop-filter: blur(16px);
-      border-bottom: 1px solid #e5e5e5;
-    }
-    .nav-brand {
-      display: flex;
-      align-items: center;
-      gap: 12px;
-      font-size: 18px;
-      font-weight: 700;
-      color: #111827;
-    }
-    .nav-badge {
-      background: #f3f4f6;
-      border: 1px solid #d1d5db;
-      padding: 3px 10px;
-      border-radius: 20px;
-      font-size: 11px;
-      font-weight: 600;
-      letter-spacing: 0.5px;
-      color: #4b5563;
-    }
-    .nav-actions { display: flex; gap: 10px; }
-    .btn-back {
-      display: inline-flex;
-      align-items: center;
-      gap: 6px;
-      background: #ffffff;
-      border: 1px solid #d1d5db;
-      color: #374151;
-      padding: 8px 18px;
-      border-radius: 8px;
-      text-decoration: none;
-      font-size: 13px;
-      font-weight: 600;
-      transition: all 0.2s;
-    }
-    .btn-back:hover {
-      background: #f9fafb;
-      border-color: #9ca3af;
-      transform: translateY(-1px);
-    }
-    /* Small wrapper padding so it doesn't touch the edges */
-    .swagger-ui .wrapper { padding: 24px 40px; max-width: 1300px; margin: 0 auto; }
-    .swagger-ui .topbar { display: none !important; }
-  </style>
-</head>
-<body>
-  <nav class="nav">
-    <div class="nav-brand">
-      🛰️ SQL Debug Environment
-      <span class="nav-badge">OAS 3.1</span>
-      <span class="nav-badge" style="background:linear-gradient(135deg,#10b981,#059669)">v1.0.0</span>
-    </div>
-    <div class="nav-actions">
-      <a href="/web_ui" class="btn-back">⬅ Back to Web UI</a>
-    </div>
-  </nav>
-  <div id="swagger-ui"></div>
-  <script src="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
-  <script>
-    window.onload = () => {
-      SwaggerUIBundle({
-        url: "/openapi.json",
-        dom_id: '#swagger-ui',
-        deepLinking: true,
-        presets: [SwaggerUIBundle.presets.apis, SwaggerUIBundle.SwaggerUIStandalonePreset],
-        layout: "BaseLayout",
-      });
-    };
-  </script>
-</body>
-</html>"""
-    return HTMLResponse(html)
-# ── Custom Web UI ─────────────────────────��──────────────────────────────────
-TASKS_JSON = json.dumps(TASKS)
-@app.get("/web_ui", include_in_schema=False)
-async def web_ui():
-    html = f"""<!DOCTYPE html>
-<html lang="en">
-<head>
-  <meta charset="UTF-8"/>
-  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
-  <title>SQL Debug RL Environment</title>
-  <link rel="preconnect" href="https://fonts.googleapis.com">
-  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
-  <style>
-    *, *::before, *::after {{ box-sizing: border-box; margin: 0; padding: 0; }}
-    :root {{
-      --bg:       #0f0e17;
-      --surface:  #1a1827;
-      --surface2: #221f35;
-      --border:   rgba(139,92,246,0.2);
-      --accent:   #8b5cf6;
-      --accent2:  #6366f1;
-      --green:    #10b981;
-      --red:      #ef4444;
-      --text:     #e8e8f0;
-      --muted:    #9090a8;
-      --mono:     'JetBrains Mono', monospace;
-      --sans:     'Inter', sans-serif;
-    }}
-    html, body {{ height: 100%; }}
-    body {{
-      font-family: var(--sans);
-      background: var(--bg);
-      color: var(--text);
-      min-height: 100vh;
-      overflow-x: hidden;
-    }}
-    /* ── Animated background ── */
-    body::before {{
-      content: '';
-      position: fixed;
-      top: -40%;
-      left: -20%;
-      width: 600px;
-      height: 600px;
-      background: radial-gradient(circle, rgba(139,92,246,0.12) 0%, transparent 70%);
-      pointer-events: none;
-      z-index: 0;
-    }}
-    body::after {{
-      content: '';
-      position: fixed;
-      bottom: -30%;
-      right: -10%;
-      width: 500px;
-      height: 500px;
-      background: radial-gradient(circle, rgba(99,102,241,0.1) 0%, transparent 70%);
-      pointer-events: none;
-      z-index: 0;
-    }}
-    /* ── Nav ── */
-    .nav {{
-      position: sticky;
-      top: 0;
-      z-index: 100;
-      display: flex;
-      align-items: center;
-      justify-content: space-between;
-      padding: 0 36px;
-      height: 64px;
-      background: rgba(15, 14, 23, 0.8);
-      backdrop-filter: blur(16px);
-      border-bottom: 1px solid var(--border);
-    }}
-    .nav-brand {{
-      display: flex;
-      align-items: center;
-      gap: 12px;
-      font-size: 17px;
-      font-weight: 700;
-      letter-spacing: -0.3px;
-    }}
-    .badge {{
-      padding: 3px 10px;
-      border-radius: 20px;
-      font-size: 11px;
-      font-weight: 600;
-      background: linear-gradient(135deg, var(--accent), var(--accent2));
-    }}
-    .btn {{
-      display: inline-flex;
-      align-items: center;
-      gap: 6px;
-      padding: 8px 18px;
-      border-radius: 8px;
-      font-size: 13px;
-      font-weight: 600;
-      cursor: pointer;
-      transition: all 0.2s;
-      border: none;
-      text-decoration: none;
-    }}
-    .btn-outline {{
-      background: rgba(139,92,246,0.1);
-      border: 1px solid rgba(139,92,246,0.4);
-      color: #a78bfa;
-    }}
-    .btn-outline:hover {{
-      background: rgba(139,92,246,0.25);
-      border-color: var(--accent);
-      color: #fff;
-      transform: translateY(-1px);
-    }}
-    .btn-primary {{
-      background: linear-gradient(135deg, var(--accent), var(--accent2));
-      color: #fff;
-      box-shadow: 0 4px 14px rgba(139,92,246,0.35);
-    }}
-    .btn-primary:hover {{
-      transform: translateY(-2px);
-      box-shadow: 0 6px 20px rgba(139,92,246,0.5);
-    }}
-    .btn-green {{
-      background: linear-gradient(135deg, #10b981, #059669);
-      color: #fff;
-      box-shadow: 0 4px 14px rgba(16,185,129,0.35);
-      width: 100%;
-      justify-content: center;
-      padding: 12px;
-      font-size: 14px;
-    }}
-    .btn-green:hover {{
-      transform: translateY(-2px);
-      box-shadow: 0 6px 20px rgba(16,185,129,0.5);
-    }}
-    /* ── Hero ── */
-    .hero {{
-      position: relative;
-      z-index: 1;
-      text-align: center;
-      padding: 60px 36px 40px;
-    }}
-    .hero-eyebrow {{
-      display: inline-flex;
-      align-items: center;
-      gap: 8px;
-      background: rgba(139,92,246,0.1);
-      border: 1px solid rgba(139,92,246,0.3);
-      padding: 6px 16px;
-      border-radius: 20px;
-      font-size: 12px;
-      font-weight: 600;
-      color: #a78bfa;
-      letter-spacing: 0.5px;
-      text-transform: uppercase;
-      margin-bottom: 20px;
-    }}
-    .hero h1 {{
-      font-size: clamp(28px, 5vw, 48px);
-      font-weight: 800;
-      letter-spacing: -1px;
-      background: linear-gradient(135deg, #fff 30%, #a78bfa 100%);
-      -webkit-background-clip: text;
-      -webkit-text-fill-color: transparent;
-      background-clip: text;
-      line-height: 1.15;
-      margin-bottom: 16px;
-    }}
-    .hero p {{
-      color: var(--muted);
-      font-size: 16px;
-      max-width: 600px;
-      margin: 0 auto 28px;
-      line-height: 1.6;
-    }}
-    /* ── Stat bar ── */
-    .stat-bar {{
-      display: flex;
-      justify-content: center;
-      gap: 32px;
-      padding: 20px 36px;
-      background: rgba(255,255,255,0.02);
-      border-top: 1px solid var(--border);
-      border-bottom: 1px solid var(--border);
-      position: relative;
-      z-index: 1;
-    }}
-    .stat {{ text-align: center; }}
-    .stat-val {{ font-size: 20px; font-weight: 700; color: var(--accent); }}
-    .stat-lbl {{ font-size: 11px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.5px; margin-top: 2px; }}
-    /* ── Main Layout ── */
-    .main {{
-      position: relative;
-      z-index: 1;
-      display: grid;
-      grid-template-columns: 320px 1fr;
-      gap: 24px;
-      padding: 32px 36px;
-      max-width: 1300px;
-      margin: 0 auto;
-    }}
-    /* ── Cards ── */
-    .card {{
-      background: var(--surface);
-      border: 1px solid var(--border);
-      border-radius: 16px;
-      overflow: hidden;
-    }}
-    .card-header {{
-      padding: 16px 20px;
-      border-bottom: 1px solid var(--border);
-      display: flex;
-      align-items: center;
-      gap: 10px;
-      font-weight: 700;
-      font-size: 13px;
-      text-transform: uppercase;
-      letter-spacing: 0.5px;
-      color: #a78bfa;
-    }}
-    .card-body {{ padding: 20px; }}
-    /* ── Sidebar ── */
-    .sidebar {{ display: flex; flex-direction: column; gap: 20px; }}
-    /* ── Select ── */
-    label.field-label {{
-      display: block;
-      font-size: 12px;
-      font-weight: 600;
-      color: var(--muted);
-      text-transform: uppercase;
-      letter-spacing: 0.5px;
-      margin-bottom: 8px;
-    }}
-    select, textarea {{
-      width: 100%;
-      background: var(--surface2);
-      border: 1px solid var(--border);
-      border-radius: 8px;
-      color: var(--text);
-      font-family: var(--sans);
-      font-size: 14px;
-      padding: 10px 14px;
-      outline: none;
-      transition: border-color 0.2s;
-    }}
-    select:focus, textarea:focus {{
-      border-color: var(--accent);
-      box-shadow: 0 0 0 3px rgba(139,92,246,0.15);
-    }}
-    select {{ cursor: pointer; appearance: none; background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='16' height='16' fill='%236b7280' viewBox='0 0 16 16'%3E%3Cpath d='M7.247 11.14L2.451 5.658C1.885 5.013 2.345 4 3.204 4h9.592a1 1 0 0 1 .753 1.659l-4.796 5.48a1 1 0 0 1-1.506 0z'/%3E%3C/svg%3E"); background-repeat: no-repeat; background-position: right 12px center; padding-right: 36px; }}
-    select option {{ background: #1a1827; }}
-    /* ── Schema / Task Info ── */
-    .info-block {{
-      background: var(--surface2);
-      border: 1px solid var(--border);
-      border-radius: 8px;
-      padding: 14px;
-      font-family: var(--mono);
-      font-size: 12.5px;
-      color: #c4b5fd;
-      white-space: pre-wrap;
-      line-height: 1.6;
-      max-height: 200px;
-      overflow-y: auto;
-    }}
-    .task-desc {{
-      font-family: var(--sans);
-      font-size: 13.5px;
-      color: var(--text);
-      line-height: 1.6;
-      margin-bottom: 10px;
-    }}
-    .error-chip {{
-      display: inline-block;
-      background: rgba(239,68,68,0.1);
-      border: 1px solid rgba(239,68,68,0.3);
-      color: #fca5a5;
-      padding: 4px 10px;
-      border-radius: 6px;
-      font-size: 12px;
-      font-family: var(--mono);
-      margin-top: 6px;
-    }}
-    .hint-chip {{
-      display: inline-block;
-      background: rgba(245,158,11,0.1);
-      border: 1px solid rgba(245,158,11,0.3);
-      color: #fcd34d;
-      padding: 4px 10px;
-      border-radius: 6px;
-      font-size: 12px;
-      margin-top: 6px;
-    }}
-    /* ── Right panel ── */
-    .right-panel {{ display: flex; flex-direction: column; gap: 20px; }}
-    /* ── Code editors ── */
-    .code-label {{
-      display: flex;
-      align-items: center;
-      justify-content: space-between;
-      margin-bottom: 8px;
-    }}
-    .code-label span {{
-      font-size: 12px;
-      font-weight: 600;
-      color: var(--muted);
-      text-transform: uppercase;
-      letter-spacing: 0.5px;
-    }}
-    .lang-tag {{
-      font-size: 11px;
-      padding: 2px 8px;
-      background: rgba(139,92,246,0.12);
-      border: 1px solid rgba(139,92,246,0.25);
-      border-radius: 4px;
-      color: #a78bfa;
-      font-family: var(--mono);
-    }}
-    textarea.code {{
-      font-family: var(--mono);
-      font-size: 13.5px;
-      resize: vertical;
-      line-height: 1.6;
-      tab-size: 2;
-      min-height: 130px;
-      color: #e2d9f3;
-    }}
-    textarea.code.read-only {{
-      background: rgba(15,14,23,0.6);
-      border-color: rgba(239,68,68,0.25);
-      color: #fca5a5;
-      cursor: default;
-    }}
-    textarea.code.agent {{
-      background: rgba(16,185,129,0.04);
-      border-color: rgba(16,185,129,0.25);
-      color: #a7f3d0;
-    }}
-    textarea.code.agent:focus {{
-      border-color: var(--green);
-      box-shadow: 0 0 0 3px rgba(16,185,129,0.15);
-    }}
-    /* ── Verifier output ── */
-    .verifier-output {{
-      border-radius: 10px;
-      padding: 20px;
-      font-size: 14px;
-      line-height: 1.5;
-      border: 1px dashed rgba(255,255,255,0.1);
-      background: rgba(255,255,255,0.02);
-      color: var(--muted);
-      text-align: center;
-      transition: all 0.4s ease;
-    }}
-    .verifier-output.success {{
-      background: rgba(16,185,129,0.07);
-      border: 1px solid rgba(16,185,129,0.35);
-      color: #6ee7b7;
-      text-align: left;
-    }}
-    .verifier-output.error {{
-      background: rgba(239,68,68,0.07);
-      border: 1px solid rgba(239,68,68,0.35);
-      color: #fca5a5;
-      text-align: left;
-    }}
-    .verifier-output h3 {{ font-size: 16px; margin-bottom: 8px; }}
-    .reward-pill {{
-      display: inline-block;
-      padding: 4px 12px;
-      border-radius: 20px;
-      font-weight: 700;
-      font-size: 13px;
-      margin-top: 8px;
-    }}
-    .reward-positive {{ background: rgba(16,185,129,0.2); color: #34d399; }}
-    .reward-negative {{ background: rgba(239,68,68,0.2); color: #f87171; }}
-    /* ── Divider ── */
-    .divider {{
-      height: 1px;
-      background: var(--border);
-      margin: 4px 0;
-    }}
-    /* ── Scrollbar ── */
-    ::-webkit-scrollbar {{ width: 6px; height: 6px; }}
-    ::-webkit-scrollbar-track {{ background: transparent; }}
-    ::-webkit-scrollbar-thumb {{ background: rgba(139,92,246,0.3); border-radius: 3px; }}
-    @media (max-width: 900px) {{
-      .main {{ grid-template-columns: 1fr; }}
-      .stat-bar {{ flex-wrap: wrap; gap: 16px; }}
-    }}
-  </style>
-</head>
-<body>
-  <!-- Nav -->
-  <nav class="nav">
-    <div class="nav-brand">
-      🛰️ SQL Debug Env
-      <span class="badge">v1.0.0</span>
-    </div>
-    <div style="display:flex;gap:10px">
-      <a href="/docs" target="_blank" class="btn btn-outline">📖 API Docs</a>
-    </div>
-  </nav>
-  <!-- Hero -->
-  <section class="hero">
-    <div class="hero-eyebrow">🤖 Reinforcement Learning Verifiable Environment</div>
-    <h1>Advanced SQL Debugging<br>RL Environment</h1>
-    <p>Agents learn to diagnose and repair broken SQL pipelines. A sandboxed DuckDB executor evaluates every submission with a dense reward signal.</p>
-    <a href="/docs" target="_blank" class="btn btn-outline">📖 View Full API Documentation →</a>
-  </section>
-  <!-- Stat Bar -->
-  <div class="stat-bar">
-    <div class="stat"><div class="stat-val">4</div><div class="stat-lbl">Challenge Tasks</div></div>
-    <div class="stat"><div class="stat-val">DuckDB</div><div class="stat-lbl">Sandbox Engine</div></div>
-    <div class="stat"><div class="stat-val">Dense</div><div class="stat-lbl">Reward Signal</div></div>
-    <div class="stat"><div class="stat-val">3</div><div class="stat-lbl">API Endpoints</div></div>
-  </div>
-  <!-- Main -->
-  <div class="main">
-    <!-- Sidebar -->
-    <aside class="sidebar">
-      <!-- Controls -->
-      <div class="card">
-        <div class="card-header">⚙️ Environment Controls</div>
-        <div class="card-body" style="display:flex;flex-direction:column;gap:14px">
-          <div>
-            <label class="field-label">🎯 Challenge Level</label>
-            <select id="task-select">
-              <option value="task_1_easy">Task 1 — Easy: Syntax Fix</option>
-              <option value="task_2_medium">Task 2 — Medium: GROUP BY</option>
-              <option value="task_3_hard">Task 3 — Hard: Window Function</option>
-              <option value="task_4_expert">Task 4 — Expert: CTE + Date</option>
-            </select>
-          </div>
-          <button class="btn btn-primary" onclick="initEnv()">🔄 Initialize Environment</button>
-        </div>
-      </div>
-      <!-- Task Details -->
-      <div class="card">
-        <div class="card-header">📋 Task Details</div>
-        <div class="card-body" style="display:flex;flex-direction:column;gap:10px">
-          <p class="task-desc" id="task-desc">Select a task and click Initialize.</p>
-          <div class="divider"></div>
-          <div>
-            <div class="error-chip" id="task-error" style="display:none"></div>
-          </div>
-          <div>
-            <div class="hint-chip" id="task-hint" style="display:none"></div>
-          </div>
-        </div>
-      </div>
-      <!-- Environment Rewards -->
-      <div class="card" id="reward-card" style="display:none; margin-bottom: 20px;">
-        <div class="card-header">💸 Dense Reward Signal</div>
-        <div class="card-body" style="padding: 16px 20px;" id="reward-card-body">
-        </div>
-      </div>
-      <!-- Schema -->
-      <div class="card">
-        <div class="card-header">🗄️ Database Schema</div>
-        <div class="card-body">
-          <div class="info-block" id="schema-dump">No schema loaded yet.</div>
-        </div>
-      </div>
-    </aside>
-    <!-- Right Panel -->
-    <div class="right-panel">
-      <!-- Broken Code -->
-      <div class="card">
-        <div class="card-header">🐞 Broken Pipeline Code</div>
-        <div class="card-body">
-          <div class="code-label">
-            <span>Initial SQL (Failing)</span>
-            <span class="lang-tag">SQL</span>
-          </div>
-          <textarea id="broken-code" class="code read-only" rows="5" readonly placeholder="Initialize environment to load broken SQL..."></textarea>
-        </div>
-      </div>
-      <!-- Agent Submission -->
-      <div class="card">
-        <div class="card-header">🤖 Agent Submission Sandbox</div>
-        <div class="card-body" style="display:flex;flex-direction:column;gap:14px">
-          <div>
-            <div class="code-label">
-              <span>Agent Fix Attempt</span>
-              <span class="lang-tag">SQL — editable</span>
-            </div>
-            <textarea id="agent-input" class="code agent" rows="6" placeholder="Write or paste your fixed SQL here..."></textarea>
-          </div>
-          <button class="btn btn-green" onclick="executeStep()">▶️ Execute Fix in DuckDB Sandbox</button>
-        </div>
-      </div>
-      <!-- Verifier Output -->
-      <div class="card">
-        <div class="card-header">📊 Verifier Output</div>
-        <div class="card-body">
-          <div class="verifier-output" id="verifier-out">
-            Agent standing by… Load a task and submit a fix.
-          </div>
-        </div>
-      </div>
-    </div>
-  </div>
-<script>
-const TASKS = {TASKS_JSON};
-function initEnv() {{
-  const taskId = document.getElementById('task-select').value;
-  const task = TASKS[taskId];
-  document.getElementById('broken-code').value = task.broken_sql;
-  document.getElementById('agent-input').value  = task.broken_sql;
-  document.getElementById('task-desc').textContent = task.description;
-  const errEl = document.getElementById('task-error');
-  errEl.textContent = '⚠️ ' + task.error;
-  errEl.style.display = 'inline-block';
-  const hintEl = document.getElementById('task-hint');
-  hintEl.textContent = '💡 Hint: ' + task.hint;
-  hintEl.style.display = 'inline-block';
-  const rewardBody = document.getElementById('reward-card-body');
-  let rewardsHtml = '';
-  if (taskId === 'task_3_hard') {{
-    rewardsHtml = `
-      <div style="margin-bottom:12px;">
-        <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-          <span style="font-size:13px; color:#e8e8f0;">Correct Step Identified</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.15</span>
-        </div>
-        <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-          <span style="font-size:13px; color:#e8e8f0;">Step 2 Fixed</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.25</span>
-        </div>
-        <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-          <span style="font-size:13px; color:#e8e8f0;">Step 4 Fixed</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.20</span>
-        </div>
-        <div style="display:flex; justify-content:space-between; align-items:center;">
-          <span style="font-size:13px; color:#e8e8f0;">Final Totals Exact Match</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.40</span>
-        </div>
-      </div>
-    `;
-  }} else {{
-    rewardsHtml = `
-      <div style="margin-bottom:12px;">
-        <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-          <span style="font-size:13px; color:#e8e8f0;">Parses successfully</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.10</span>
-        </div>
-        <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-          <span style="font-size:13px; color:#e8e8f0;">Executes without error</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.20</span>
-        </div>
-        <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-          <span style="font-size:13px; color:#e8e8f0;">Column Accuracy</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.10</span>
-        </div>
-        <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-          <span style="font-size:13px; color:#e8e8f0;">Data Accuracy</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.30</span>
-        </div>
-        <div style="display:flex; justify-content:space-between; align-items:center;">
-          <span style="font-size:13px; color:#e8e8f0;">Exact Match Bonus</span>
-          <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.30</span>
-        </div>
-      </div>
-    `;
-  }}
-  rewardsHtml += `
-    <div style="font-size:11px; font-weight:bold; color:var(--muted); text-transform:uppercase; margin-bottom:6px; margin-top: 10px; border-top: 1px solid rgba(255,255,255,0.05); padding-top: 10px;">Penalties</div>
-    <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-      <span style="font-size:13px; color:var(--muted)">Duplicate Submission</span>
-      <span style="font-family:var(--mono); color:#f87171; font-weight:bold; font-size:13px;">-0.10</span>
-    </div>
-    <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-      <span style="font-size:13px; color:var(--muted)">Efficiency Penalty</span>
-      <span style="font-family:var(--mono); color:#f87171; font-weight:bold; font-size:13px;">-0.20</span>
-    </div>
-    <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
-      <span style="font-size:13px; color:var(--muted)">Destructive Action</span>
-      <span style="font-family:var(--mono); color:#f87171; font-weight:bold; font-size:13px;">-0.30</span>
-    </div>
-    <div style="display:flex; justify-content:space-between; align-items:center;">
-      <span style="font-size:13px; color:var(--muted)">Hardcode Penalty</span>
-      <span style="font-family:var(--mono); color:#f87171; font-weight:bold; font-size:13px;">-0.50</span>
-    </div>
-  `;
-  rewardBody.innerHTML = rewardsHtml;
-  // Schema
-  let schemaStr = '';
-  for (const [table, cols] of Object.entries(task.schema_info)) {{
-    schemaStr += `TABLE ${{table}} {{\\n`;
-    cols.forEach(c => schemaStr += `  ${{c}}\\n`);
-    schemaStr += `}}\\n\\n`;
-  }}
-  document.getElementById('schema-dump').textContent = schemaStr.trim();
-  document.getElementById('reward-card').style.display = 'block';
-  document.getElementById('verifier-out').className = 'verifier-output';
-  document.getElementById('verifier-out').innerHTML = '🔄 Environment initialized. Awaiting agent execution…';
-}}
-function executeStep() {{
-  const taskId = document.getElementById('task-select').value;
-  const task = TASKS[taskId];
-  const agentSQL = document.getElementById('agent-input').value.trim();
-  const out = document.getElementById('verifier-out');
-  if (!agentSQL) {{
-    out.className = 'verifier-output error';
-    out.innerHTML = '<h3>⚠️ No Input</h3><p>Please write your SQL fix in the agent sandbox first.</p>';
-    return;
-  }}
-  // Fake verifier
-  const sql = agentSQL.toUpperCase();
-  const taskSolved = (
-    (taskId === 'task_1_easy'   && sql.includes(',') && sql.includes('NAME') && sql.includes('AGE')) ||
-    (taskId === 'task_2_medium' && sql.includes('GROUP BY')) ||
-    (taskId === 'task_3_hard'   && sql.includes('PARTITION BY')) ||
-    (taskId === 'task_4_expert' && !sql.includes('13-01') && sql.includes('MONTHLY_SALES'))
-  );
-  if (taskSolved) {{
-    out.className = 'verifier-output success';
-    out.innerHTML = `
-      <h3>✅ Verification Passed!</h3>
-      <p>The query compiled and executed successfully inside the DuckDB in-memory sandbox.</p>
-      <p>The pipeline produced the expected output rows without errors.</p>
-      <span class="reward-pill reward-positive">Reward: +1.0</span>
-    `;
-  }} else {{
-    out.className = 'verifier-output error';
-    out.innerHTML = `
-      <h3>❌ Verification Failed</h3>
-      <p>DuckDB raised an error during execution.</p>
-      <p style="font-family:var(--mono);font-size:12px;margin-top:6px;opacity:0.8">${{task.error}}</p>
-      <span class="reward-pill reward-negative">Reward: -0.1</span>
-    `;
-  }}
-}}
-</script>
-</body>
-</html>""".replace("{TASKS_JSON}", TASKS_JSON)
-    return HTMLResponse(html)
-if __name__ == "__main__":
-    import uvicorn
-    uvicorn.run(app, host="0.0.0.0", port=7860)

+import json
+import time
+import duckdb
+from fastapi import FastAPI
+from fastapi.responses import RedirectResponse, HTMLResponse
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+# ── Global session state for DuckDB-backed tasks ──────────────────────────────
+CURRENT_SESSION = {
+    "task_id": None,
+    "con": None,           # duckdb.DuckDBPyConnection
+    "step_count": 0,
+    "done": False,
+    "baseline_rows": None, # for optimization task
+    "chaos_fixed": False,  # for chaos task
+    "reward_history": [],
+}
+app = FastAPI(
+    title="SQL Debug RL Environment",
+    description="Real-world SQL pipeline debugging environment. An agent learns to fix and route broken SQL scripts.",
+    version="1.0.0",
+    docs_url=None,
+    redoc_url=None,
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ── Pydantic Models ──────────────────────────────────────────────────────────
+class StepAction(BaseModel):
+    action: str
+    explanation: str = ""
+class ResetRequest(BaseModel):
+    task_id: str = "task_1_easy"
+# ── Hard-coded Task Data ─────────────────────────────────────────────────────
+TASKS = {
+    "task_1_easy": {
+        "label": "Task 1 — Easy: Syntax Fix",
+        "description": "Fix the syntax error in the SELECT statement. A comma is missing between column names.",
+        "broken_sql": "SELECT name age FROM users;",
+        "schema_info": {
+            "users": ["id INTEGER", "name TEXT", "age INTEGER", "email TEXT"]
+        },
+        "solution": "SELECT name, age FROM users;",
+        "error": "SyntaxError: Expected ',' or 'FROM' after 'name', got 'age'.",
+        "hint": "Add a comma between 'name' and 'age'.",
+    },
+    "task_2_medium": {
+        "label": "Task 2 — Medium: GROUP BY Aggregation",
+        "description": "You cannot SELECT unaggregated columns alongside aggregate functions without a GROUP BY clause.",
+        "broken_sql": (
+            "SELECT u.name, SUM(o.total) AS total_spent\n"
+            "FROM users u\n"
+            "JOIN orders o ON u.id = o.user_id;"
+        ),
+        "schema_info": {
+            "users": ["id INTEGER", "name TEXT"],
+            "orders": ["id INTEGER", "user_id INTEGER", "total DECIMAL"],
+        },
+        "solution": (
+            "SELECT u.name, SUM(o.total) AS total_spent\n"
+            "FROM users u\n"
+            "JOIN orders o ON u.id = o.user_id\n"
+            "GROUP BY u.name;"
+        ),
+        "error": "SemanticError: column 'u.name' must appear in the GROUP BY clause or be used in an aggregate function.",
+        "hint": "Add GROUP BY u.name at the end.",
+    },
+    "task_3_hard": {
+        "label": "Task 3 — Hard: Window Function + PARTITION",
+        "description": "The RANK() window function is missing PARTITION BY, causing it to rank globally instead of per-department.",
+        "broken_sql": (
+            "SELECT department, name, salary,\n"
+            "       RANK() OVER (ORDER BY salary DESC) AS dept_rank\n"
+            "FROM employees\n"
+            "GROUP BY department;"
+        ),
+        "schema_info": {
+            "employees": ["id INTEGER", "name TEXT", "department TEXT", "salary DECIMAL"],
+        },
+        "solution": (
+            "SELECT department, name, salary,\n"
+            "       RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank\n"
+            "FROM employees;"
+        ),
+        "error": "ExecutionError: window functions are not allowed in GROUP BY.",
+        "hint": "Remove GROUP BY and add PARTITION BY department inside OVER(...).",
+    },
+    "task_4_expert": {
+        "label": "Task 4 — Expert: CTE + Invalid Date",
+        "description": "The CTE contains an invalid date literal (month 13 does not exist). Fix the date and ensure the pipeline executes.",
+        "broken_sql": (
+            "WITH monthly_sales AS (\n"
+            "  SELECT id, amount, txn_date\n"
+            "  FROM transactions\n"
+            "  WHERE txn_date > '2024-13-01'\n"
+            ")\n"
+            "SELECT SUM(amount) AS total FROM monthly_sales;"
+        ),
+        "schema_info": {
+            "transactions": ["id INTEGER", "amount DECIMAL", "txn_date DATE", "category TEXT"],
+        },
+        "solution": (
+            "WITH monthly_sales AS (\n"
+            "  SELECT id, amount, txn_date\n"
+            "  FROM transactions\n"
+            "  WHERE txn_date > '2024-12-01'\n"
+            ")\n"
+            "SELECT SUM(amount) AS total FROM monthly_sales;"
+        ),
+        "error": "DataError: month must be in 1..12, got '13'.",
+        "hint": "Change '2024-13-01' to a valid date like '2024-12-01'.",
+    },
+    # ── Advanced Tasks ──────────────────────────────────────────────────────
+    "task_5_optimization": {
+        "label": "Task 5 — Advanced: Query Optimization",
+        "description": (
+            "A working query uses a CROSS JOIN + WHERE filter instead of a proper INNER JOIN. "
+            "It returns correct results but is catastrophically slow. "
+            "Your goal: rewrite it to use an explicit JOIN. "
+            "The verifier checks (1) output matches baseline and (2) EXPLAIN plan no longer contains CROSS_PRODUCT."
+        ),
+        "broken_sql": (
+            "SELECT c.name, SUM(o.amount) AS total_spent\n"
+            "FROM customers c, orders o\n"
+            "WHERE c.id = o.customer_id\n"
+            "GROUP BY c.name\n"
+            "ORDER BY total_spent DESC;"
+        ),
+        "schema_info": {
+            "customers": ["id INTEGER PRIMARY KEY", "name TEXT", "city TEXT"],
+            "orders": ["id INTEGER PRIMARY KEY", "customer_id INTEGER", "amount DECIMAL", "order_date DATE"],
+        },
+        "solution": (
+            "SELECT c.name, SUM(o.amount) AS total_spent\n"
+            "FROM customers c\n"
+            "INNER JOIN orders o ON c.id = o.customer_id\n"
+            "GROUP BY c.name\n"
+            "ORDER BY total_spent DESC;"
+        ),
+        "error": "Performance issue: CROSS JOIN creates a cartesian product before filtering. Zero errors, but terrible at scale.",
+        "hint": "Replace 'FROM customers c, orders o WHERE c.id = o.customer_id' with 'FROM customers c INNER JOIN orders o ON c.id = o.customer_id'.",
+        "duckdb_backed": True,
+    },
+    "task_6_migration": {
+        "label": "Task 6 — Advanced: Schema Migration (3NF)",
+        "description": (
+            "You have a single denormalized 'messy_dump' table with columns: "
+            "(user_id, user_name, order_id, order_date, product, amount). "
+            "Migrate it to a 3NF schema: users(id, name) and orders(id, user_id, order_date, product, amount). "
+            "Then DROP the original table. "
+            "WARNING: Dropping 'messy_dump' before populating target tables triggers a Destructive Action penalty and ends the episode."
+        ),
+        "broken_sql": (
+            "-- Step 1: Create target tables\n"
+            "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT);\n"
+            "CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER, order_date DATE, product TEXT, amount DECIMAL);\n\n"
+            "-- Step 2: Migrate data\n"
+            "INSERT INTO users SELECT DISTINCT user_id, user_name FROM messy_dump;\n"
+            "INSERT INTO orders SELECT order_id, user_id, order_date::DATE, product, amount FROM messy_dump;\n\n"
+            "-- Step 3: Drop original\n"
+            "DROP TABLE messy_dump;"
+        ),
+        "schema_info": {
+            "messy_dump": ["user_id INTEGER", "user_name TEXT", "order_id INTEGER", "order_date TEXT", "product TEXT", "amount DECIMAL"],
+            "users [TARGET]": ["id INTEGER PRIMARY KEY", "name TEXT"],
+            "orders [TARGET]": ["id INTEGER PRIMARY KEY", "user_id INTEGER", "order_date DATE", "product TEXT", "amount DECIMAL"],
+        },
+        "solution": (
+            "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT);\n"
+            "CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER, order_date DATE, product TEXT, amount DECIMAL);\n"
+            "INSERT INTO users SELECT DISTINCT user_id, user_name FROM messy_dump;\n"
+            "INSERT INTO orders SELECT order_id, user_id, order_date::DATE, product, amount FROM messy_dump;\n"
+            "DROP TABLE messy_dump;"
+        ),
+        "error": "NoError: Data exists but is denormalized. Goal is to normalize into 3NF and safely migrate.",
+        "hint": "Create 'users' and 'orders' tables first, INSERT data from messy_dump, then DROP messy_dump last.",
+        "duckdb_backed": True,
+    },
+    "task_7_chaos": {
+        "label": "Task 7 — Advanced: Chaos Engineering (Live Corruption)",
+        "description": (
+            "A live ETL pipeline runs on every step, inserting new records. "
+            "A bug is causing DUPLICATE user_id entries and NULL email values, "
+            "which poisons downstream analytics. "
+            "Query the 'error_logs' table to identify the root cause, "
+            "then apply a patch (UNIQUE constraint / COALESCE cleanup) to stop the corruption. "
+            "Reward increases for every clean step after your fix is applied."
+        ),
+        "broken_sql": (
+            "-- Inspect the error log first:\n"
+            "SELECT * FROM error_logs ORDER BY logged_at DESC LIMIT 10;\n\n"
+            "-- Then apply your fix. Example patches:\n"
+            "-- 1) Clean duplicates: DELETE FROM users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY user_id);\n"
+            "-- 2) Fix NULLs: UPDATE users SET email = COALESCE(email, 'unknown@domain.com') WHERE email IS NULL;\n"
+            "-- 3) Add constraint: CREATE UNIQUE INDEX IF NOT EXISTS ux_users_id ON users(user_id);"
+        ),
+        "schema_info": {
+            "users": ["rowid INTEGER", "user_id INTEGER", "name TEXT", "email TEXT"],
+            "error_logs": ["id INTEGER", "error_type TEXT", "details TEXT", "logged_at TIMESTAMP"],
+        },
+        "solution": (
+            "DELETE FROM users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY user_id);\n"
+            "UPDATE users SET email = COALESCE(email, 'unknown@domain.com') WHERE email IS NULL;\n"
+            "CREATE UNIQUE INDEX IF NOT EXISTS ux_users_id ON users(user_id);"
+        ),
+        "error": "DataIntegrityError: Duplicate user_id values and NULL emails detected in the pipeline output.",
+        "hint": "First SELECT * FROM error_logs to understand what is failing, then clean duplicates and NULLs, and add a UNIQUE index.",
+        "duckdb_backed": True,
+    },
+}
+# ── API Endpoints ────────────────────────────────────────────────────────────
+@app.get("/", include_in_schema=False)
+def read_root():
+    return RedirectResponse(url="/web_ui")
+@app.get("/health", tags=["default"])
+def health():
+    return {"status": "ok", "version": "1.0.0", "message": "SQL Debug Environment is healthy."}
+def _seed_task5(con):
+    """Seed customers + orders for the optimization task."""
+    con.execute("DROP TABLE IF EXISTS customers; DROP TABLE IF EXISTS orders;")
+    con.execute("CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT, city TEXT)")
+    con.execute("CREATE TABLE orders (id INTEGER PRIMARY KEY, customer_id INTEGER, amount DECIMAL, order_date DATE)")
+    customers = [(i, f"Customer_{i}", "City") for i in range(1, 51)]
+    orders = [(i, (i % 50) + 1, round(10 + (i * 3.7) % 500, 2), "2024-01-15") for i in range(1, 201)]
+    con.executemany("INSERT INTO customers VALUES (?, ?, ?)", customers)
+    con.executemany("INSERT INTO orders VALUES (?, ?, ?, ?)", orders)
+def _seed_task6(con):
+    """Seed messy_dump for the migration task."""
+    con.execute("DROP TABLE IF EXISTS messy_dump; DROP TABLE IF EXISTS users; DROP TABLE IF EXISTS orders;")
+    con.execute("CREATE TABLE messy_dump (user_id INTEGER, user_name TEXT, order_id INTEGER, order_date TEXT, product TEXT, amount DECIMAL)")
+    rows = [
+        (1,"Alice",101,"2024-01-10","Widget A",29.99),
+        (1,"Alice",102,"2024-01-12","Widget B",49.99),
+        (2,"Bob",103,"2024-01-15","Gadget X",99.99),
+        (3,"Carol",104,"2024-01-20","Widget A",29.99),
+        (3,"Carol",105,"2024-01-22","Gadget Y",149.99),
+        (4,"Dave",106,"2024-02-01","Widget B",49.99),
+        (5,"Eve",107,"2024-02-05","Gadget X",99.99),
+    ]
+    con.executemany("INSERT INTO messy_dump VALUES (?,?,?,?,?,?)", rows)
+def _seed_task7(con):
+    """Seed a corrupted users table and an error_logs table for chaos task."""
+    con.execute("DROP SEQUENCE IF EXISTS seq_users; DROP TABLE IF EXISTS users; DROP TABLE IF EXISTS error_logs;")
+    con.execute("CREATE SEQUENCE seq_users START 1")
+    con.execute("CREATE TABLE users (rowid INTEGER DEFAULT nextval('seq_users'), user_id INTEGER, name TEXT, email TEXT)")
+    con.execute("CREATE TABLE error_logs (id INTEGER, error_type TEXT, details TEXT, logged_at TIMESTAMP)")
+    users = [
+        (1,"Alice","alice@example.com"),
+        (2,"Bob","bob@example.com"),
+        (1,"Alice_dup",None),          # duplicate user_id + NULL email
+        (3,"Carol","carol@example.com"),
+        (4,"Dave",None),               # NULL email
+        (2,"Bob_dup","bob2@example.com"), # duplicate user_id
+    ]
+    con.executemany("INSERT INTO users (user_id, name, email) VALUES (?,?,?)", users)
+    logs = [
+        (1,"DUPLICATE_KEY","user_id=1 appears 2 times","2024-01-15 08:01:00"),
+        (2,"NULL_VIOLATION","email IS NULL for user_id=1 (row 3)","2024-01-15 08:01:01"),
+        (3,"DUPLICATE_KEY","user_id=2 appears 2 times","2024-01-15 08:01:02"),
+        (4,"NULL_VIOLATION","email IS NULL for user_id=4","2024-01-15 08:01:03"),
+    ]
+    con.executemany("INSERT INTO error_logs VALUES (?,?,?,?)", logs)
+def _run_chaos_pipeline(con):
+    """Simulate one ETL tick that tries to insert dirty data."""
+    import random, datetime
+    uid = random.randint(1, 3)  # intentional duplicate range
+    con.execute(
+        "INSERT INTO users (user_id, name, email) VALUES (?, ?, ?)",
+        [uid, f"Auto_{uid}", None if random.random() < 0.5 else f"auto{uid}@x.com"]
+    )
+@app.post("/reset", tags=["Environment"])
+def reset_episode(req: ResetRequest):
+    task_id = req.task_id if req.task_id in TASKS else "task_1_easy"
+    task = TASKS[task_id]
+    # Spin up a fresh DuckDB connection for DuckDB-backed tasks
+    if task.get("duckdb_backed"):
+        con = duckdb.connect(":memory:")
+        if task_id == "task_5_optimization":
+            _seed_task5(con)
+            baseline = con.execute(
+                "SELECT c.name, SUM(o.amount) AS total_spent "
+                "FROM customers c, orders o WHERE c.id = o.customer_id "
+                "GROUP BY c.name ORDER BY total_spent DESC"
+            ).fetchall()
+        elif task_id == "task_6_migration":
+            _seed_task6(con)
+            baseline = None
+        elif task_id == "task_7_chaos":
+            _seed_task7(con)
+            baseline = None
+        CURRENT_SESSION.update({
+            "task_id": task_id, "con": con, "step_count": 0,
+            "done": False, "baseline_rows": baseline,
+            "chaos_fixed": False, "reward_history": [],
+        })
+    return {
+        "status": "success",
+        "observation": {
+            "task_id": task_id,
+            "label": task["label"],
+            "description": task["description"],
+            "broken_sql": task["broken_sql"],
+            "schema_info": task["schema_info"],
+            "error_hint": task["error"],
+        },
+    }
+@app.post("/step", tags=["Environment"])
+def step_environment(action: StepAction):
+    task_id      = CURRENT_SESSION.get("task_id")
+    task         = TASKS.get(task_id, {})
+    con          = CURRENT_SESSION.get("con")
+    step_count   = CURRENT_SESSION.get("step_count", 0) + 1
+    CURRENT_SESSION["step_count"] = step_count
+    # ── Legacy tasks 1-4: simple pattern matching ───────────────────────────
+    if not task.get("duckdb_backed"):
+        sql    = action.action.strip().upper()
+        solved = "GROUP BY" in sql or "," in sql or "PARTITION" in sql or "12-01" in sql
+        reward = 1.0 if solved else -0.1
+        CURRENT_SESSION["reward_history"].append(reward)
+        return {
+            "reward": reward, "done": solved,
+            "info": {
+                "message": "Execution succeeded." if solved else "Execution failed. Review your fix.",
+                "verifier": "Pattern-match verifier",
+            },
+            "state": {"current_sql": action.action, "step_count": step_count},
+        }
+    # ── Task 5: Query Optimization ───────────────────────────────────────────
+    if task_id == "task_5_optimization":
+        agent_sql = action.action.strip()
+        reward, done, msg = 0.0, False, ""
+        try:
+            t0     = time.perf_counter()
+            rows   = con.execute(agent_sql).fetchall()
+            elapsed = time.perf_counter() - t0
+            baseline = CURRENT_SESSION["baseline_rows"]
+            correct  = sorted(rows) == sorted(baseline)
+            explain  = con.execute(f"EXPLAIN {agent_sql}").fetchall()
+            plan_str = " ".join(str(r) for r in explain).upper()
+            no_cross = "CROSS_PRODUCT" not in plan_str
+            if correct and no_cross:
+                reward, done = 1.0, True
+                msg = f"✅ Output matches baseline ({len(rows)} rows). EXPLAIN shows no CROSS_PRODUCT. Reward: +1.0"
+            elif correct:
+                reward = 0.5
+                msg = f"⚠️ Output matches baseline but EXPLAIN still shows CROSS_PRODUCT. Reward: +0.5"
+            else:
+                reward = -0.1
+                msg = "❌ Output does NOT match baseline. Check your query logic."
+        except Exception as e:
+            reward, msg = -0.2, f"❌ DuckDB Error: {e}"
+        CURRENT_SESSION["reward_history"].append(reward)
+        return {"reward": reward, "done": done,
+                "info": {"message": msg, "verifier": "DuckDB EXPLAIN + row comparison"},
+                "state": {"step_count": step_count}}
+    # ── Task 6: Schema Migration ─────────────────────────────────────────────
+    if task_id == "task_6_migration":
+        agent_sql = action.action.strip()
+        reward, done, msg = 0.0, False, ""
+        # Detect if agent is dropping messy_dump early (destructive action)
+        sql_upper = agent_sql.upper()
+        tables_before = {r[0].lower() for r in con.execute("SHOW TABLES").fetchall()}
+        users_ok   = "users"  in tables_before
+        orders_ok  = "orders" in tables_before
+        dropping   = "DROP" in sql_upper and "MESSY_DUMP" in sql_upper
+        if dropping and not (users_ok and orders_ok):
+            # Check if data is actually populated
+            u_ok = users_ok  and con.execute("SELECT COUNT(*) FROM users").fetchone()[0]  > 0
+            o_ok = orders_ok and con.execute("SELECT COUNT(*) FROM orders").fetchone()[0] > 0
+            if not (u_ok and o_ok):
+                reward, done = -0.3, True
+                msg = "💀 DESTRUCTIVE ACTION: Dropped messy_dump before fully populating target tables! Episode ended. Penalty: -0.3"
+                CURRENT_SESSION["done"] = True
+                CURRENT_SESSION["reward_history"].append(reward)
+                return {"reward": reward, "done": done,
+                        "info": {"message": msg, "verifier": "Intermediate-state guard"},
+                        "state": {"step_count": step_count}}
+        try:
+            for stmt in agent_sql.split(";"):
+                stmt = stmt.strip()
+                if stmt:
+                    con.execute(stmt)
+            tables_after = {r[0].lower() for r in con.execute("SHOW TABLES").fetchall()}
+            users_count  = con.execute("SELECT COUNT(*) FROM users").fetchone()[0]  if "users"  in tables_after else 0
+            orders_count = con.execute("SELECT COUNT(*) FROM orders").fetchone()[0] if "orders" in tables_after else 0
+            dump_gone    = "messy_dump" not in tables_after
+            if users_count >= 5 and orders_count >= 7 and dump_gone:
+                reward, done = 1.0, True
+                msg = f"✅ Migration complete! users={users_count} rows, orders={orders_count} rows. messy_dump dropped. Reward: +1.0"
+            elif users_count > 0 or orders_count > 0:
+                reward = 0.3
+                msg = f"🔄 Partial progress: users={users_count}, orders={orders_count}. messy_dump={'gone' if dump_gone else 'still exists'}."
+            else:
+                reward = 0.05
+                msg = "📋 Tables created. Now migrate the data with INSERT INTO ... SELECT."
+        except Exception as e:
+            reward, msg = -0.2, f"❌ DuckDB Error: {e}"
+        CURRENT_SESSION["reward_history"].append(reward)
+        return {"reward": reward, "done": done,
+                "info": {"message": msg, "verifier": "Row-count + table existence check"},
+                "state": {"step_count": step_count}}
+    # ── Task 7: Chaos Engineering ────────────────────────────────────────────
+    if task_id == "task_7_chaos":
+        agent_sql = action.action.strip()
+        reward, done, msg = 0.0, False, ""
+        try:
+            for stmt in agent_sql.split(";"):
+                stmt = stmt.strip()
+                if stmt and not stmt.startswith("--"):
+                    con.execute(stmt)
+            # Run one tick of the "live" ETL pipeline
+            _run_chaos_pipeline(con)
+            # Check integrity
+            dup_count  = con.execute("SELECT COUNT(*) FROM (SELECT user_id FROM users GROUP BY user_id HAVING COUNT(*)>1)").fetchone()[0]
+            null_count = con.execute("SELECT COUNT(*) FROM users WHERE email IS NULL").fetchone()[0]
+            has_index  = any("ux_users_id" in str(r) for r in con.execute("SELECT index_name FROM duckdb_indexes()").fetchall())
+            if dup_count == 0 and null_count == 0 and has_index:
+                reward, done = 1.0, True
+                CURRENT_SESSION["chaos_fixed"] = True
+                msg = "✅ Pipeline is clean! No duplicates, no NULLs, UNIQUE index in place. Reward: +1.0"
+            elif dup_count == 0 and null_count == 0:
+                reward = 0.7
+                msg = f"🔄 Data is clean this step but no UNIQUE index. Reward: +0.7 (add index to fully lock it in)"
+            elif CURRENT_SESSION.get("chaos_fixed"):
+                reward = 0.5
+                msg = f"⚠️ ETL re-introduced {dup_count} dups and {null_count} NULLs. Partial reward: +0.5"
+            else:
+                reward = -0.1
+                msg = f"❌ Still corrupt: {dup_count} duplicate user_ids, {null_count} NULL emails. Reward: -0.1"
+        except Exception as e:
+            reward, msg = -0.2, f"❌ DuckDB Error: {e}"
+        CURRENT_SESSION["reward_history"].append(reward)
+        return {"reward": reward, "done": done,
+                "info": {"message": msg, "verifier": "Integrity check (dups + NULLs + index)"},
+                "state": {"step_count": step_count}}
+@app.get("/state", tags=["Environment"])
+def get_state():
+    return {
+        "task_id": "task_2_medium",
+        "current_sql": TASKS["task_2_medium"]["broken_sql"],
+        "step_count": 0,
+        "done": False,
+        "schema": TASKS["task_2_medium"]["schema_info"],
+    }
+@app.get("/tasks", tags=["System"])
+def get_tasks():
+    return TASKS
+@app.get("/web", tags=["System"])
+def web_redirect():
+    return RedirectResponse(url="/web_ui")
+# ── Custom API Docs ──────────────────────────────────────────────────────────
+@app.get("/docs", include_in_schema=False)
+async def custom_swagger():
+    html = """<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8"/>
+  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+  <title>SQL Debug Env – API Docs</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap" rel="stylesheet">
+  <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui.css">
+  <style>
+    *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
+    body {
+      font-family: 'Inter', sans-serif;
+      background: #ffffff;
+      color: #333333;
+      min-height: 100vh;
+    }
+    /* ── Top Nav (Light Mode) ── */
+    .nav {
+      position: sticky;
+      top: 0;
+      z-index: 1000;
+      display: flex;
+      align-items: center;
+      justify-content: space-between;
+      padding: 0 32px;
+      height: 64px;
+      background: rgba(255, 255, 255, 0.95);
+      backdrop-filter: blur(16px);
+      border-bottom: 1px solid #e5e5e5;
+    }
+    .nav-brand {
+      display: flex;
+      align-items: center;
+      gap: 12px;
+      font-size: 18px;
+      font-weight: 700;
+      color: #111827;
+    }
+    .nav-badge {
+      background: #f3f4f6;
+      border: 1px solid #d1d5db;
+      padding: 3px 10px;
+      border-radius: 20px;
+      font-size: 11px;
+      font-weight: 600;
+      letter-spacing: 0.5px;
+      color: #4b5563;
+    }
+    .nav-actions { display: flex; gap: 10px; }
+    .btn-back {
+      display: inline-flex;
+      align-items: center;
+      gap: 6px;
+      background: #ffffff;
+      border: 1px solid #d1d5db;
+      color: #374151;
+      padding: 8px 18px;
+      border-radius: 8px;
+      text-decoration: none;
+      font-size: 13px;
+      font-weight: 600;
+      transition: all 0.2s;
+    }
+    .btn-back:hover {
+      background: #f9fafb;
+      border-color: #9ca3af;
+      transform: translateY(-1px);
+    }
+    /* Small wrapper padding so it doesn't touch the edges */
+    .swagger-ui .wrapper { padding: 24px 40px; max-width: 1300px; margin: 0 auto; }
+    .swagger-ui .topbar { display: none !important; }
+  </style>
+</head>
+<body>
+  <nav class="nav">
+    <div class="nav-brand">
+      🛰️ SQL Debug Environment
+      <span class="nav-badge">OAS 3.1</span>
+      <span class="nav-badge" style="background:linear-gradient(135deg,#10b981,#059669)">v1.0.0</span>
+    </div>
+    <div class="nav-actions">
+      <a href="/web_ui" class="btn-back">⬅ Back to Web UI</a>
+    </div>
+  </nav>
+  <div id="swagger-ui"></div>
+  <script src="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
+  <script>
+    window.onload = () => {
+      SwaggerUIBundle({
+        url: "/openapi.json",
+        dom_id: '#swagger-ui',
+        deepLinking: true,
+        presets: [SwaggerUIBundle.presets.apis, SwaggerUIBundle.SwaggerUIStandalonePreset],
+        layout: "BaseLayout",
+      });
+    };
+  </script>
+</body>
+</html>"""
+    return HTMLResponse(html)
+# ── Custom Web UI ────────────────────────────────────────────────────────────
+TASKS_JSON = json.dumps(TASKS)
+@app.get("/web_ui", include_in_schema=False)
+async def web_ui():
+    html = f"""<!DOCTYPE html>
+<html lang="en">
+<head>
+  <meta charset="UTF-8"/>
+  <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
+  <title>SQL Debug RL Environment</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
+  <style>
+    *, *::before, *::after {{ box-sizing: border-box; margin: 0; padding: 0; }}
+    :root {{
+      --bg:       #0f0e17;
+      --surface:  #1a1827;
+      --surface2: #221f35;
+      --border:   rgba(139,92,246,0.2);
+      --accent:   #8b5cf6;
+      --accent2:  #6366f1;
+      --green:    #10b981;
+      --red:      #ef4444;
+      --text:     #e8e8f0;
+      --muted:    #9090a8;
+      --mono:     'JetBrains Mono', monospace;
+      --sans:     'Inter', sans-serif;
+    }}
+    html, body {{ height: 100%; }}
+    body {{
+      font-family: var(--sans);
+      background: var(--bg);
+      color: var(--text);
+      min-height: 100vh;
+      overflow-x: hidden;
+    }}
+    /* ── Animated background ── */
+    body::before {{
+      content: '';
+      position: fixed;
+      top: -40%;
+      left: -20%;
+      width: 600px;
+      height: 600px;
+      background: radial-gradient(circle, rgba(139,92,246,0.12) 0%, transparent 70%);
+      pointer-events: none;
+      z-index: 0;
+    }}
+    body::after {{
+      content: '';
+      position: fixed;
+      bottom: -30%;
+      right: -10%;
+      width: 500px;
+      height: 500px;
+      background: radial-gradient(circle, rgba(99,102,241,0.1) 0%, transparent 70%);
+      pointer-events: none;
+      z-index: 0;
+    }}
+    /* ── Nav ── */
+    .nav {{
+      position: sticky;
+      top: 0;
+      z-index: 100;
+      display: flex;
+      align-items: center;
+      justify-content: space-between;
+      padding: 0 36px;
+      height: 64px;
+      background: rgba(15, 14, 23, 0.8);
+      backdrop-filter: blur(16px);
+      border-bottom: 1px solid var(--border);
+    }}
+    .nav-brand {{
+      display: flex;
+      align-items: center;
+      gap: 12px;
+      font-size: 17px;
+      font-weight: 700;
+      letter-spacing: -0.3px;
+    }}
+    .badge {{
+      padding: 3px 10px;
+      border-radius: 20px;
+      font-size: 11px;
+      font-weight: 600;
+      background: linear-gradient(135deg, var(--accent), var(--accent2));
+    }}
+    .btn {{
+      display: inline-flex;
+      align-items: center;
+      gap: 6px;
+      padding: 8px 18px;
+      border-radius: 8px;
+      font-size: 13px;
+      font-weight: 600;
+      cursor: pointer;
+      transition: all 0.2s;
+      border: none;
+      text-decoration: none;
+    }}
+    .btn-outline {{
+      background: rgba(139,92,246,0.1);
+      border: 1px solid rgba(139,92,246,0.4);
+      color: #a78bfa;
+    }}
+    .btn-outline:hover {{
+      background: rgba(139,92,246,0.25);
+      border-color: var(--accent);
+      color: #fff;
+      transform: translateY(-1px);
+    }}
+    .btn-primary {{
+      background: linear-gradient(135deg, var(--accent), var(--accent2));
+      color: #fff;
+      box-shadow: 0 4px 14px rgba(139,92,246,0.35);
+    }}
+    .btn-primary:hover {{
+      transform: translateY(-2px);
+      box-shadow: 0 6px 20px rgba(139,92,246,0.5);
+    }}
+    .btn-green {{
+      background: linear-gradient(135deg, #10b981, #059669);
+      color: #fff;
+      box-shadow: 0 4px 14px rgba(16,185,129,0.35);
+      width: 100%;
+      justify-content: center;
+      padding: 12px;
+      font-size: 14px;
+    }}
+    .btn-green:hover {{
+      transform: translateY(-2px);
+      box-shadow: 0 6px 20px rgba(16,185,129,0.5);
+    }}
+    /* ── Hero ── */
+    .hero {{
+      position: relative;
+      z-index: 1;
+      text-align: center;
+      padding: 60px 36px 40px;
+    }}
+    .hero-eyebrow {{
+      display: inline-flex;
+      align-items: center;
+      gap: 8px;
+      background: rgba(139,92,246,0.1);
+      border: 1px solid rgba(139,92,246,0.3);
+      padding: 6px 16px;
+      border-radius: 20px;
+      font-size: 12px;
+      font-weight: 600;
+      color: #a78bfa;
+      letter-spacing: 0.5px;
+      text-transform: uppercase;
+      margin-bottom: 20px;
+    }}
+    .hero h1 {{
+      font-size: clamp(28px, 5vw, 48px);
+      font-weight: 800;
+      letter-spacing: -1px;
+      background: linear-gradient(135deg, #fff 30%, #a78bfa 100%);
+      -webkit-background-clip: text;
+      -webkit-text-fill-color: transparent;
+      background-clip: text;
+      line-height: 1.15;
+      margin-bottom: 16px;
+    }}
+    .hero p {{
+      color: var(--muted);
+      font-size: 16px;
+      max-width: 600px;
+      margin: 0 auto 28px;
+      line-height: 1.6;
+    }}
+    /* ── Stat bar ── */
+    .stat-bar {{
+      display: flex;
+      justify-content: center;
+      gap: 32px;
+      padding: 20px 36px;
+      background: rgba(255,255,255,0.02);
+      border-top: 1px solid var(--border);
+      border-bottom: 1px solid var(--border);
+      position: relative;
+      z-index: 1;
+    }}
+    .stat {{ text-align: center; }}
+    .stat-val {{ font-size: 20px; font-weight: 700; color: var(--accent); }}
+    .stat-lbl {{ font-size: 11px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.5px; margin-top: 2px; }}
+    /* ── Main Layout ── */
+    .main {{
+      position: relative;
+      z-index: 1;
+      display: grid;
+      grid-template-columns: 320px 1fr;
+      gap: 24px;
+      padding: 32px 36px;
+      max-width: 1300px;
+      margin: 0 auto;
+    }}
+    /* ── Cards ── */
+    .card {{
+      background: var(--surface);
+      border: 1px solid var(--border);
+      border-radius: 16px;
+      overflow: hidden;
+    }}
+    .card-header {{
+      padding: 16px 20px;
+      border-bottom: 1px solid var(--border);
+      display: flex;
+      align-items: center;
+      gap: 10px;
+      font-weight: 700;
+      font-size: 13px;
+      text-transform: uppercase;
+      letter-spacing: 0.5px;
+      color: #a78bfa;
+    }}
+    .card-body {{ padding: 20px; }}
+    /* ── Sidebar ── */
+    .sidebar {{ display: flex; flex-direction: column; gap: 20px; }}
+    /* ── Select ── */
+    label.field-label {{
+      display: block;
+      font-size: 12px;
+      font-weight: 600;
+      color: var(--muted);
+      text-transform: uppercase;
+      letter-spacing: 0.5px;
+      margin-bottom: 8px;
+    }}
+    select, textarea {{
+      width: 100%;
+      background: var(--surface2);
+      border: 1px solid var(--border);
+      border-radius: 8px;
+      color: var(--text);
+      font-family: var(--sans);
+      font-size: 14px;
+      padding: 10px 14px;
+      outline: none;
+      transition: border-color 0.2s;
+    }}
+    select:focus, textarea:focus {{
+      border-color: var(--accent);
+      box-shadow: 0 0 0 3px rgba(139,92,246,0.15);
+    }}
+    select {{ cursor: pointer; appearance: none; background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='16' height='16' fill='%236b7280' viewBox='0 0 16 16'%3E%3Cpath d='M7.247 11.14L2.451 5.658C1.885 5.013 2.345 4 3.204 4h9.592a1 1 0 0 1 .753 1.659l-4.796 5.48a1 1 0 0 1-1.506 0z'/%3E%3C/svg%3E"); background-repeat: no-repeat; background-position: right 12px center; padding-right: 36px; }}
+    select option {{ background: #1a1827; }}
+    /* ── Schema / Task Info ── */
+    .info-block {{
+      background: var(--surface2);
+      border: 1px solid var(--border);
+      border-radius: 8px;
+      padding: 14px;
+      font-family: var(--mono);
+      font-size: 12.5px;
+      color: #c4b5fd;
+      white-space: pre-wrap;
+      line-height: 1.6;
+      max-height: 200px;
+      overflow-y: auto;
+    }}
+    .task-desc {{
+      font-family: var(--sans);
+      font-size: 13.5px;
+      color: var(--text);
+      line-height: 1.6;
+      margin-bottom: 10px;
+    }}
+    .error-chip {{
+      display: inline-block;
+      background: rgba(239,68,68,0.1);
+      border: 1px solid rgba(239,68,68,0.3);
+      color: #fca5a5;
+      padding: 4px 10px;
+      border-radius: 6px;
+      font-size: 12px;
+      font-family: var(--mono);
+      margin-top: 6px;
+    }}
+    .hint-chip {{
+      display: inline-block;
+      background: rgba(245,158,11,0.1);
+      border: 1px solid rgba(245,158,11,0.3);
+      color: #fcd34d;
+      padding: 4px 10px;
+      border-radius: 6px;
+      font-size: 12px;
+      margin-top: 6px;
+    }}
+    /* ── Right panel ── */
+    .right-panel {{ display: flex; flex-direction: column; gap: 20px; }}
+    /* ── Code editors ── */
+    .code-label {{
+      display: flex;
+      align-items: center;
+      justify-content: space-between;
+      margin-bottom: 8px;
+    }}
+    .code-label span {{
+      font-size: 12px;
+      font-weight: 600;
+      color: var(--muted);
+      text-transform: uppercase;
+      letter-spacing: 0.5px;
+    }}
+    .lang-tag {{
+      font-size: 11px;
+      padding: 2px 8px;
+      background: rgba(139,92,246,0.12);
+      border: 1px solid rgba(139,92,246,0.25);
+      border-radius: 4px;
+      color: #a78bfa;
+      font-family: var(--mono);
+    }}
+    textarea.code {{
+      font-family: var(--mono);
+      font-size: 13.5px;
+      resize: vertical;
+      line-height: 1.6;
+      tab-size: 2;
+      min-height: 130px;
+      color: #e2d9f3;
+    }}
+    textarea.code.read-only {{
+      background: rgba(15,14,23,0.6);
+      border-color: rgba(239,68,68,0.25);
+      color: #fca5a5;
+      cursor: default;
+    }}
+    textarea.code.agent {{
+      background: rgba(16,185,129,0.04);
+      border-color: rgba(16,185,129,0.25);
+      color: #a7f3d0;
+    }}
+    textarea.code.agent:focus {{
+      border-color: var(--green);
+      box-shadow: 0 0 0 3px rgba(16,185,129,0.15);
+    }}
+    /* ── Verifier output ── */
+    .verifier-output {{
+      border-radius: 10px;
+      padding: 20px;
+      font-size: 14px;
+      line-height: 1.5;
+      border: 1px dashed rgba(255,255,255,0.1);
+      background: rgba(255,255,255,0.02);
+      color: var(--muted);
+      text-align: center;
+      transition: all 0.4s ease;
+    }}
+    .verifier-output.success {{
+      background: rgba(16,185,129,0.07);
+      border: 1px solid rgba(16,185,129,0.35);
+      color: #6ee7b7;
+      text-align: left;
+    }}
+    .verifier-output.error {{
+      background: rgba(239,68,68,0.07);
+      border: 1px solid rgba(239,68,68,0.35);
+      color: #fca5a5;
+      text-align: left;
+    }}
+    .verifier-output h3 {{ font-size: 16px; margin-bottom: 8px; }}
+    .reward-pill {{
+      display: inline-block;
+      padding: 4px 12px;
+      border-radius: 20px;
+      font-weight: 700;
+      font-size: 13px;
+      margin-top: 8px;
+    }}
+    .reward-positive {{ background: rgba(16,185,129,0.2); color: #34d399; }}
+    .reward-negative {{ background: rgba(239,68,68,0.2); color: #f87171; }}
+    /* ── Divider ── */
+    .divider {{
+      height: 1px;
+      background: var(--border);
+      margin: 4px 0;
+    }}
+    /* ── Scrollbar ── */
+    ::-webkit-scrollbar {{ width: 6px; height: 6px; }}
+    ::-webkit-scrollbar-track {{ background: transparent; }}
+    ::-webkit-scrollbar-thumb {{ background: rgba(139,92,246,0.3); border-radius: 3px; }}
+    @media (max-width: 900px) {{
+      .main {{ grid-template-columns: 1fr; }}
+      .stat-bar {{ flex-wrap: wrap; gap: 16px; }}
+    }}
+  </style>
+</head>
+<body>
+  <!-- Nav -->
+  <nav class="nav">
+    <div class="nav-brand">
+      🛰️ SQL Debug Env
+      <span class="badge">v1.0.0</span>
+    </div>
+    <div style="display:flex;gap:10px">
+      <a href="/docs" target="_blank" class="btn btn-outline">📖 API Docs</a>
+    </div>
+  </nav>
+  <!-- Hero -->
+  <section class="hero">
+    <div class="hero-eyebrow">🤖 Reinforcement Learning Verifiable Environment</div>
+    <h1>Advanced SQL Debugging<br>RL Environment</h1>
+    <p>Agents learn to diagnose and repair broken SQL pipelines. A sandboxed DuckDB executor evaluates every submission with a dense reward signal.</p>
+    <a href="/docs" target="_blank" class="btn btn-outline">📖 View Full API Documentation →</a>
+  </section>
+  <!-- Stat Bar -->
+  <div class="stat-bar">
+    <div class="stat"><div class="stat-val">7</div><div class="stat-lbl">Challenge Tasks</div></div>
+    <div class="stat"><div class="stat-val">DuckDB</div><div class="stat-lbl">Sandbox Engine</div></div>
+    <div class="stat"><div class="stat-val">Live</div><div class="stat-lbl">Verifier</div></div>
+    <div class="stat"><div class="stat-val">3</div><div class="stat-lbl">Advanced RLVE</div></div>
+  </div>
+  <!-- Main -->
+  <div class="main">
+    <!-- Sidebar -->
+    <aside class="sidebar">
+      <!-- Controls -->
+      <div class="card">
+        <div class="card-header">⚙️ Environment Controls</div>
+        <div class="card-body" style="display:flex;flex-direction:column;gap:14px">
+          <div>
+            <label class="field-label">🎯 Challenge Level</label>
+            <select id="task-select">
+      <option value="task_1_easy">Task 1 — Easy: Syntax Fix</option>
+              <option value="task_2_medium">Task 2 — Medium: GROUP BY</option>
+              <option value="task_3_hard">Task 3 — Hard: Window Function</option>
+              <option value="task_4_expert">Task 4 — Expert: CTE + Date</option>
+              <optgroup label="─── Advanced RLVE Tasks ───">
+              <option value="task_5_optimization">Task 5 — Optimization (EXPLAIN-verified)</option>
+              <option value="task_6_migration">Task 6 — Schema Migration (3NF)</option>
+              <option value="task_7_chaos">Task 7 — Chaos Engineering (Live DB)</option>
+              </optgroup>
+            </select>
+          </div>
+          <button class="btn btn-primary" onclick="initEnv()">🔄 Initialize Environment</button>
+        </div>
+      </div>
+      <!-- Task Details -->
+      <div class="card">
+        <div class="card-header">📋 Task Details</div>
+        <div class="card-body" style="display:flex;flex-direction:column;gap:10px">
+          <p class="task-desc" id="task-desc">Select a task and click Initialize.</p>
+          <div class="divider"></div>
+          <div>
+            <div class="error-chip" id="task-error" style="display:none"></div>
+          </div>
+          <div>
+            <div class="hint-chip" id="task-hint" style="display:none"></div>
+          </div>
+        </div>
+      </div>
+      <!-- Environment Rewards -->
+      <div class="card" id="reward-card" style="display:none; margin-bottom: 20px;">
+        <div class="card-header">💸 Dense Reward Signal</div>
+        <div class="card-body" style="padding: 16px 20px;" id="reward-card-body">
+        </div>
+      </div>
+      <!-- Schema -->
+      <div class="card">
+        <div class="card-header">🗄️ Database Schema</div>
+        <div class="card-body">
+          <div class="info-block" id="schema-dump">No schema loaded yet.</div>
+        </div>
+      </div>
+    </aside>
+    <!-- Right Panel -->
+    <div class="right-panel">
+      <!-- Broken Code -->
+      <div class="card">
+        <div class="card-header">🐞 Broken Pipeline Code</div>
+        <div class="card-body">
+          <div class="code-label">
+            <span>Initial SQL (Failing)</span>
+            <span class="lang-tag">SQL</span>
+          </div>
+          <textarea id="broken-code" class="code read-only" rows="5" readonly placeholder="Initialize environment to load broken SQL..."></textarea>
+        </div>
+      </div>
+      <!-- Agent Submission -->
+      <div class="card">
+        <div class="card-header">🤖 Agent Submission Sandbox</div>
+        <div class="card-body" style="display:flex;flex-direction:column;gap:14px">
+          <div>
+            <div class="code-label">
+              <span>Agent Fix Attempt</span>
+              <span class="lang-tag">SQL — editable</span>
+            </div>
+            <textarea id="agent-input" class="code agent" rows="6" placeholder="Write or paste your fixed SQL here..."></textarea>
+          </div>
+          <button class="btn btn-green" onclick="executeStep()">▶️ Execute Fix in DuckDB Sandbox</button>
+        </div>
+      </div>
+      <!-- Verifier Output -->
+      <div class="card">
+        <div class="card-header">📊 Verifier Output</div>
+        <div class="card-body">
+          <div class="verifier-output" id="verifier-out">
+            Agent standing by… Load a task and submit a fix.
+          </div>
+        </div>
+      </div>
+    </div>
+  </div>
+<script>
+const TASKS = {TASKS_JSON};
+let currentTaskId = null;
+const ADVANCED_REWARDS = {{
+  task_5_optimization: [
+    ['Output matches baseline', '+0.50'],['No CROSS_PRODUCT in EXPLAIN', '+0.50'],
+    ['Wrong output', '-0.10'],['DuckDB error', '-0.20'],
+  ],
+  task_6_migration: [
+    ['Tables created', '+0.05'],['Data partially migrated', '+0.30'],
+    ['Full migration + DROP', '+1.00'],['Destructive early DROP', '-0.30'],['DuckDB error', '-0.20'],
+  ],
+  task_7_chaos: [
+    ['Zero dups + zero NULLs + UNIQUE index', '+1.00'],['Zero dups + zero NULLs (no index)', '+0.70'],
+    ['ETL still dirty', '-0.10'],['DuckDB error', '-0.20'],
+  ],
+}};
+function initEnv() {{
+  currentTaskId = document.getElementById('task-select').value;
+  const task = TASKS[currentTaskId];
+  const isAdvanced = !!task.duckdb_backed;
+  document.getElementById('broken-code').value = task.broken_sql;
+  document.getElementById('agent-input').value  = task.broken_sql;
+  document.getElementById('task-desc').textContent = task.description;
+  const errEl = document.getElementById('task-error');
+  errEl.textContent = '⚠️ ' + task.error;
+  errEl.style.display = 'inline-block';
+  const hintEl = document.getElementById('task-hint');
+  hintEl.textContent = '💡 Hint: ' + task.hint;
+  hintEl.style.display = 'inline-block';
+  // Reward card
+  const rewardBody = document.getElementById('reward-card-body');
+  let rewardsHtml = '';
+  if (isAdvanced) {{
+    const entries = ADVANCED_REWARDS[currentTaskId] || [];
+    rewardsHtml = entries.map(([label, val]) => {{
+      const isPos = val.startsWith('+');
+      return `<div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+        <span style="font-size:13px;color:#e8e8f0">${{label}}</span>
+        <span style="font-family:var(--mono);color:${{isPos?'#34d399':'#f87171'}};font-weight:bold;font-size:13px;">${{val}}</span>
+      </div>`;
+    }}).join('');
+  }} else if (currentTaskId === 'task_3_hard') {{
+    rewardsHtml = `
+      <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+        <span style="font-size:13px;color:#e8e8f0">Correct Step Identified</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.15</span>
+      </div>
+      <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+        <span style="font-size:13px;color:#e8e8f0">Step 2 Fixed</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.25</span>
+      </div>
+      <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+        <span style="font-size:13px;color:#e8e8f0">Step 4 Fixed</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.20</span>
+      </div>
+      <div style="display:flex;justify-content:space-between;align-items:center;">
+        <span style="font-size:13px;color:#e8e8f0">Final Totals Exact Match</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.40</span>
+      </div>`;
+  }} else {{
+    rewardsHtml = `
+      <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+        <span style="font-size:13px;color:#e8e8f0">Parses successfully</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.10</span>
+      </div>
+      <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+        <span style="font-size:13px;color:#e8e8f0">Executes without error</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.20</span>
+      </div>
+      <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+        <span style="font-size:13px;color:#e8e8f0">Column Accuracy</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.10</span>
+      </div>
+      <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+        <span style="font-size:13px;color:#e8e8f0">Data Accuracy</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.30</span>
+      </div>
+      <div style="display:flex;justify-content:space-between;align-items:center;">
+        <span style="font-size:13px;color:#e8e8f0">Exact Match Bonus</span>
+        <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.30</span>
+      </div>`;
+  }}
+  rewardsHtml += `
+    <div style="font-size:11px;font-weight:bold;color:var(--muted);text-transform:uppercase;margin:10px 0 6px;border-top:1px solid rgba(255,255,255,0.05);padding-top:10px;">Penalties</div>
+    <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+      <span style="font-size:13px;color:var(--muted)">Duplicate Submission</span>
+      <span style="font-family:var(--mono);color:#f87171;font-weight:bold;font-size:13px;">-0.10</span>
+    </div>
+    <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
+      <span style="font-size:13px;color:var(--muted)">Destructive Action</span>
+      <span style="font-family:var(--mono);color:#f87171;font-weight:bold;font-size:13px;">-0.30</span>
+    </div>
+    <div style="display:flex;justify-content:space-between;align-items:center;">
+      <span style="font-size:13px;color:var(--muted)">Hardcode Penalty</span>
+      <span style="font-family:var(--mono);color:#f87171;font-weight:bold;font-size:13px;">-0.50</span>
+    </div>`;
+  rewardBody.innerHTML = rewardsHtml;
+  // Schema
+  let schemaStr = '';
+  for (const [table, cols] of Object.entries(task.schema_info)) {{
+    schemaStr += `TABLE ${{table}} {{\\n`;
+    cols.forEach(c => schemaStr += `  ${{c}}\\n`);
+    schemaStr += `}}\\n\\n`;
+  }}
+  document.getElementById('schema-dump').textContent = schemaStr.trim();
+  document.getElementById('reward-card').style.display = 'block';
+  // Call /reset on the server to seed the DuckDB environment
+  fetch('/reset', {{
+    method: 'POST',
+    headers: {{'Content-Type': 'application/json'}},
+    body: JSON.stringify({{task_id: currentTaskId}})
+  }}).then(r => r.json()).then(data => {{
+    const out = document.getElementById('verifier-out');
+    out.className = 'verifier-output';
+    const badge = data.observation.label.includes('Advanced') || data.observation.label.includes('5')
+      || data.observation.label.includes('6') || data.observation.label.includes('7')
+      ? ' <span style="background:rgba(139,92,246,0.25);border:1px solid rgba(139,92,246,0.6);color:#c4b5fd;padding:2px 8px;border-radius:12px;font-size:11px;font-weight:700;">🔬 DuckDB-Backed</span>' : '';
+    out.innerHTML = `🔄 Environment initialized.${{badge}} Awaiting agent execution…`;
+  }}).catch(() => {{
+    document.getElementById('verifier-out').innerHTML = '🔄 Environment initialized. Awaiting agent execution…';
+  }});
+}}
+async function executeStep() {{
+  const agentSQL = document.getElementById('agent-input').value.trim();
+  const out = document.getElementById('verifier-out');
+  if (!agentSQL) {{
+    out.className = 'verifier-output error';
+    out.innerHTML = '<h3>⚠️ No Input</h3><p>Please write your SQL fix in the agent sandbox first.</p>';
+    return;
+  }}
+  if (!currentTaskId) {{
+    out.className = 'verifier-output error';
+    out.innerHTML = '<h3>⚠️ No Task Loaded</h3><p>Click Initialize Environment first.</p>';
+    return;
+  }}
+  out.className = 'verifier-output';
+  out.innerHTML = '⏳ Executing in DuckDB sandbox…';
+  const task = TASKS[currentTaskId];
+  const isAdvanced = !!task.duckdb_backed;
+  if (isAdvanced) {{
+    // Real API call for DuckDB-backed tasks
+    try {{
+      const res = await fetch('/step', {{
+        method: 'POST',
+        headers: {{'Content-Type': 'application/json'}},
+        body: JSON.stringify({{action: agentSQL, explanation: ''}})
+      }});
+      const data = await res.json();
+      const reward = data.reward;
+      const done   = data.done;
+      const msg    = data.info?.message || '';
+      const verifier = data.info?.verifier || 'DuckDB';
+      const isPos  = reward >= 0;
+      out.className = `verifier-output ${{done && reward > 0 ? 'success' : reward < 0 ? 'error' : 'success'}}`;
+      out.innerHTML = `
+        <h3>${{done && reward >= 1.0 ? '✅' : reward < 0 ? '❌' : '⚠️'}} Verifier Result</h3>
+        <p style="margin-top:6px">${{msg}}</p>
+        <p style="margin-top:8px;font-size:11px;color:var(--muted)">🔬 ${{verifier}} · Step ${{data.state?.step_count ?? '?'}}</p>
+        <span class="reward-pill ${{isPos ? 'reward-positive' : 'reward-negative'}}">Reward: ${{reward >= 0 ? '+' : ''}}${{reward.toFixed(2)}}</span>
+      `;
+    }} catch(e) {{
+      out.className = 'verifier-output error';
+      out.innerHTML = `<h3>❌ Network Error</h3><p>${{e.message}}</p>`;
+    }}
+  }} else {{
+    // Client-side pattern-match verifier for legacy tasks 1-4
+    const sql = agentSQL.toUpperCase();
+    const taskSolved = (
+      (currentTaskId === 'task_1_easy'   && sql.includes(',') && sql.includes('NAME') && sql.includes('AGE')) ||
+      (currentTaskId === 'task_2_medium' && sql.includes('GROUP BY')) ||
+      (currentTaskId === 'task_3_hard'   && sql.includes('PARTITION BY')) ||
+      (currentTaskId === 'task_4_expert' && !sql.includes('13-01') && sql.includes('MONTHLY_SALES'))
+    );
+    if (taskSolved) {{
+      out.className = 'verifier-output success';
+      out.innerHTML = `
+        <h3>✅ Verification Passed!</h3>
+        <p>The query compiled and executed successfully inside the DuckDB in-memory sandbox.</p>
+        <p>The pipeline produced the expected output rows without errors.</p>
+        <span class="reward-pill reward-positive">Reward: +1.0</span>
+      `;
+    }} else {{
+      out.className = 'verifier-output error';
+      out.innerHTML = `
+        <h3>❌ Verification Failed</h3>
+        <p>DuckDB raised an error during execution.</p>
+        <p style="font-family:var(--mono);font-size:12px;margin-top:6px;opacity:0.8">${{task.error}}</p>
+        <span class="reward-pill reward-negative">Reward: -0.1</span>
+      `;
+    }}
+  }}
+}}
+</script>
+</body>
+</html>""".replace("{TASKS_JSON}", TASKS_JSON)
+    return HTMLResponse(html)
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run(app, host="0.0.0.0", port=7860)

client.py CHANGED Viewed

@@ -1,97 +1,97 @@
-"""
-client.py — OpenEnv client for SQL Debug & Data Pipeline Repair.
-Provides a typed, sync/async interface that mirrors the EnvClient spec.
-"""
-from __future__ import annotations
-from typing import Optional
-from models import SQLDebugAction, SQLDebugObservation, SQLDebugState
-try:
-    from openenv.core.env_client import EnvClient      # type: ignore
-    from openenv.core.client_types import StepResult   # type: ignore
-    class SQLDebugEnv(EnvClient[SQLDebugAction, SQLDebugObservation, SQLDebugState]):
-        """
-        Typed client for the SQL Debug environment.
-        Usage (sync):
-            with SQLDebugEnv(base_url="http://localhost:7860").sync() as env:
-                obs = env.reset(task_id="task1_syntax_fix")
-                action = SQLDebugAction(fixed_sql="SELECT ...")
-                obs, reward, done, info = env.step(action)
-        Usage (async):
-            async with SQLDebugEnv(base_url="http://localhost:7860") as env:
-                obs = await env.reset()
-                result = await env.step(action)
-        """
-        def _step_payload(self, action: SQLDebugAction) -> dict:
-            return action.model_dump()
-        def _parse_result(self, payload: dict) -> StepResult:
-            obs_data = payload.get("observation", {})
-            return StepResult(
-                observation=SQLDebugObservation(**obs_data),
-                reward=payload.get("reward"),
-                done=payload.get("done", False),
-            )
-        def _parse_state(self, payload: dict) -> SQLDebugState:
-            return SQLDebugState(**payload)
-except ImportError:
-    import requests
-    class SQLDebugEnv:  # type: ignore[no-redef]
-        """
-        Lightweight HTTP client (no openenv-core dependency required).
-        Usage:
-            env = SQLDebugEnv(base_url="http://localhost:7860")
-            obs_data = env.reset(task_id="task1_syntax_fix")
-            result = env.step(SQLDebugAction(fixed_sql="SELECT ..."))
-        """
-        def __init__(self, base_url: str = "http://localhost:7860") -> None:
-            self.base_url = base_url.rstrip("/")
-        def reset(
-            self,
-            seed: int = 42,
-            task_id: Optional[str] = None,
-        ) -> SQLDebugObservation:
-            params: dict = {"seed": seed}
-            if task_id:
-                params["task_id"] = task_id
-            r = requests.post(f"{self.base_url}/reset", params=params)
-            r.raise_for_status()
-            return SQLDebugObservation(**r.json())
-        def step(
-            self,
-            action: SQLDebugAction,
-        ) -> tuple[SQLDebugObservation, float, bool, dict]:
-            r = requests.post(
-                f"{self.base_url}/step",
-                json=action.model_dump(),
-            )
-            r.raise_for_status()
-            d = r.json()
-            obs = SQLDebugObservation(**d["observation"])
-            return obs, d["reward"], d["done"], d.get("info", {})
-        def state(self) -> SQLDebugState:
-            r = requests.get(f"{self.base_url}/state")
-            r.raise_for_status()
-            return SQLDebugState(**r.json())
-        # Context manager support
-        def __enter__(self):
-            return self
-        def __exit__(self, *args):
-            pass

+"""
+client.py — OpenEnv client for SQL Debug & Data Pipeline Repair.
+Provides a typed, sync/async interface that mirrors the EnvClient spec.
+"""
+from __future__ import annotations
+from typing import Optional
+from models import SQLDebugAction, SQLDebugObservation, SQLDebugState
+try:
+    from openenv.core.env_client import EnvClient      # type: ignore
+    from openenv.core.client_types import StepResult   # type: ignore
+    class SQLDebugEnv(EnvClient[SQLDebugAction, SQLDebugObservation, SQLDebugState]):
+        """
+        Typed client for the SQL Debug environment.
+        Usage (sync):
+            with SQLDebugEnv(base_url="http://localhost:7860").sync() as env:
+                obs = env.reset(task_id="task1_syntax_fix")
+                action = SQLDebugAction(fixed_sql="SELECT ...")
+                obs, reward, done, info = env.step(action)
+        Usage (async):
+            async with SQLDebugEnv(base_url="http://localhost:7860") as env:
+                obs = await env.reset()
+                result = await env.step(action)
+        """
+        def _step_payload(self, action: SQLDebugAction) -> dict:
+            return action.model_dump()
+        def _parse_result(self, payload: dict) -> StepResult:
+            obs_data = payload.get("observation", {})
+            return StepResult(
+                observation=SQLDebugObservation(**obs_data),
+                reward=payload.get("reward"),
+                done=payload.get("done", False),
+            )
+        def _parse_state(self, payload: dict) -> SQLDebugState:
+            return SQLDebugState(**payload)
+except ImportError:
+    import requests
+    class SQLDebugEnv:  # type: ignore[no-redef]
+        """
+        Lightweight HTTP client (no openenv-core dependency required).
+        Usage:
+            env = SQLDebugEnv(base_url="http://localhost:7860")
+            obs_data = env.reset(task_id="task1_syntax_fix")
+            result = env.step(SQLDebugAction(fixed_sql="SELECT ..."))
+        """
+        def __init__(self, base_url: str = "http://localhost:7860") -> None:
+            self.base_url = base_url.rstrip("/")
+        def reset(
+            self,
+            seed: int = 42,
+            task_id: Optional[str] = None,
+        ) -> SQLDebugObservation:
+            params: dict = {"seed": seed}
+            if task_id:
+                params["task_id"] = task_id
+            r = requests.post(f"{self.base_url}/reset", params=params)
+            r.raise_for_status()
+            return SQLDebugObservation(**r.json())
+        def step(
+            self,
+            action: SQLDebugAction,
+        ) -> tuple[SQLDebugObservation, float, bool, dict]:
+            r = requests.post(
+                f"{self.base_url}/step",
+                json=action.model_dump(),
+            )
+            r.raise_for_status()
+            d = r.json()
+            obs = SQLDebugObservation(**d["observation"])
+            return obs, d["reward"], d["done"], d.get("info", {})
+        def state(self) -> SQLDebugState:
+            r = requests.get(f"{self.base_url}/state")
+            r.raise_for_status()
+            return SQLDebugState(**r.json())
+        # Context manager support
+        def __enter__(self):
+            return self
+        def __exit__(self, *args):
+            pass

deploy_hf_space.md ADDED Viewed

	@@ -0,0 +1,169 @@

+---
+title: Deploy SQL Debug Env to HF Spaces
+description: Step-by-step guide to deploy the environment and then train with GRPO
+---
+# Deploying the SQL Debug Environment to HF Spaces
+## Step 1 — Create the HF Space
+Go to https://huggingface.co/new-space and configure:
+| Field | Value |
+|---|---|
+| Space name | `sql-debug-env` |
+| SDK | **Docker** |
+| Hardware | CPU Basic (free tier is fine for the env) |
+| Visibility | Public (required for openenv validate) |
+---
+## Step 2 — Prepare the Repository
+```powershell
+# Install the HF CLI
+pip install huggingface_hub
+# Login
+huggingface-cli login
+# Clone the empty Space repo
+git clone https://huggingface.co/spaces/YOUR_USERNAME/sql-debug-env
+cd sql-debug-env
+```
+---
+## Step 3 — Copy Environment Files
+Copy everything from `sql_env/` into the cloned Space repo:
+```powershell
+# From your sql_env directory:
+Copy-Item -Recurse * "C:\path\to\sql-debug-env\" -Force
+```
+The Space repo should look like:
+```
+sql-debug-env/          ← HF Space repo root
+├── README.md           ← HF Space card (already has ---yaml--- header)
+├── server/
+│   └── Dockerfile      ← HF Spaces uses this automatically
+├── models.py
+├── client.py
+├── openenv.yaml
+├── server/app.py
+├── server/environment.py
+├── server/data.py
+├── server/graders.py
+├── server/rewards.py
+└── server/requirements.txt
+```
+> **Important:** HF Spaces looks for a `Dockerfile` at the repo root OR inside `server/`.
+> Our Dockerfile is at `server/Dockerfile`. HF will find it automatically.
+> The Dockerfile exposes **port 7860** — this is required by HF Spaces.
+---
+## Step 4 — Push & Deploy
+```powershell
+cd sql-debug-env
+git add .
+git commit -m "Initial SQL Debug OpenEnv environment"
+```
+HF Spaces will automatically:
+1. Detect the Dockerfile
+2. Build the Docker image
+3. Start the server on port 7860
+4. Make it available at `https://YOUR_USERNAME-sql-debug-env.hf.space`
+---
+## Step 5 — Verify the Deployment
+```powershell
+$SPACE_URL = "https://YOUR_USERNAME-sql-debug-env.hf.space"
+# Health check
+Invoke-WebRequest "$SPACE_URL/health" | Select-Object -Expand Content
+# List tasks
+Invoke-WebRequest "$SPACE_URL/tasks" | Select-Object -Expand Content
+# Interactive docs
+Start-Process "$SPACE_URL/docs"
+```
+---
+## Step 6 — Run Training Against the HF Space
+```powershell
+# Point training at the deployed Space
+$env:ENV_URL    = "https://YOUR_USERNAME-sql-debug-env.hf.space"
+$env:USE_LOCAL_ENV = "false"   # use HTTP client
+# Optional: push the trained model automatically
+$env:PUSH_TO_HUB = "true"
+$env:HF_REPO_ID  = "YOUR_USERNAME/sql-debug-qwen-grpo"
+python train_grpo.py --mode train --n-repeats 50
+```
+Or for faster local training (no network overhead):
+```powershell
+# Local env (default) — start server first
+Start-Job { uvicorn server.app:app --host 0.0.0.0 --port 7860 }
+$env:USE_LOCAL_ENV = "true"
+python train_grpo.py --mode both --n-repeats 50
+```
+---
+## Hardware Requirements for Training
+| GPU | Batch Size | num_generations | use_vllm | ETA (3 epochs) |
+|---|---|---|---|---|
+| A100 40GB | 1 | 8 | True | ~2h |
+| A100 40GB | 1 | 4 | False | ~4h |
+| RTX 4090 24GB | 1 | 2 | False | ~6h |
+| V100 16GB | 1 | 2 | False | OOM risk — use 4bit |
+For 4-bit quantization on smaller GPUs, add to `get_grpo_config()`:
+```python
+from transformers import BitsAndBytesConfig
+bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
+# Pass to GRPOTrainer via model_init_kwargs
+```
+---
+## Quick Colab Setup
+```python
+# In Google Colab (A100 runtime)
+!pip install trl transformers torch duckdb pandas pydantic fastapi uvicorn requests
+!git clone https://huggingface.co/spaces/YOUR_USERNAME/sql-debug-env sql_env
+%cd sql_env
+import subprocess, threading
+server = threading.Thread(
+    target=lambda: subprocess.run(
+        ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
+    ),
+    daemon=True
+)
+server.start()
+import time; time.sleep(3)  # wait for server
+# Now run training
+!python train_grpo.py --mode both --n-repeats 30
+```

inference.py CHANGED Viewed

@@ -1,294 +1,294 @@
-"""
-inference.py — inference script for SQL Debug & Data Pipeline Repair.
-Runs a model (default: gpt-4o-mini) against all 3 tasks using the OpenAI
-client API. Reads credentials from environment variables. Produces a
-reproducible JSON report with per-task scores.
-Usage:
-    # Set credentials
-    $env:OPENAI_API_KEY = "sk-..."
-    # Optional: use a different base URL (e.g. local vLLM)
-    $env:OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"
-    python inference.py
-    python inference.py --task task1_syntax_fix
-    python inference.py --model gpt-4o --output results.json
-"""
-from __future__ import annotations
-import argparse
-import json
-import os
-import re
-import sys
-import time
-from pathlib import Path
-from typing import Optional
-from openai import OpenAI
-# Make server package importable
-sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
-from models import SQLDebugAction, SQLDebugObservation
-from server.environment import SQLDebugEnvironment
-from server.data import TASKS
-# ---------------------------------------------------------------------------
-# Prompt builder
-# ---------------------------------------------------------------------------
-def _build_prompt(obs: SQLDebugObservation) -> str:
-    """Convert an observation into a model prompt."""
-    schema_lines = []
-    for table, cols in obs.schema_info.items():
-        col_defs = ", ".join(f"{c['column']} {c['type']}" for c in cols)
-        schema_lines.append(f"  {table}({col_defs})")
-    schema_str = "\n".join(schema_lines)
-    if obs.task_id == "task3_etl_timezone":
-        code_section = f"""
-## Broken ETL Pipeline Code
-```python
-{obs.pipeline_code}
-```
-## Intermediate Outputs (from the BROKEN pipeline)
-{json.dumps(obs.intermediate_outputs, indent=2, default=str) if obs.intermediate_outputs else 'Not available'}
-"""
-        instruction = (
-            "Return the COMPLETE corrected Python pipeline code inside a ```python ... ``` block. "
-            "Also provide a brief explanation of the root cause (which step is buggy and why) "
-            "in a section labelled 'Explanation:'."
-        )
-    else:
-        code_section = f"""
-## Broken SQL Query
-```sql
-{obs.broken_sql}
-```
-"""
-        instruction = (
-            "Return ONLY the corrected SQL query inside a ```sql ... ``` block. "
-            "Do not include any explanation outside the code block."
-        )
-    history_section = ""
-    if obs.previous_attempts:
-        lines = []
-        for a in obs.previous_attempts:
-            lines.append(f"  Step {a.step}: reward={a.reward:.2f}  SQL: {a.fixed_sql[:120]}...")
-        history_section = "\n## Previous Attempts\n" + "\n".join(lines)
-    return f"""You are an expert SQL and data engineering debugger.
-## Task ({obs.difficulty.upper()})
-{obs.task_description}
-## Database Schema
-{schema_str}
-{code_section}{history_section}
-## Instructions
-{instruction}
-"""
-# ---------------------------------------------------------------------------
-# Response parser
-# ---------------------------------------------------------------------------
-def _extract_sql(text: str, is_pipeline: bool = False) -> str:
-    """Extract SQL or Python code from model response."""
-    # Try fenced code block first
-    lang = "python" if is_pipeline else "sql"
-    patterns = [
-        rf"```{lang}\s*\n(.*?)```",
-        r"```\s*\n(.*?)```",
-        r"```(.*?)```",
-    ]
-    for pattern in patterns:
-        m = re.search(pattern, text, re.DOTALL | re.IGNORECASE)
-        if m:
-            return m.group(1).strip()
-    # Fallback: return the whole response
-    return text.strip()
-def _extract_explanation(text: str) -> Optional[str]:
-    """Extract explanation section from Task 3 response."""
-    m = re.search(r"explanation[:\s]+(.*?)(?:```|$)", text, re.DOTALL | re.IGNORECASE)
-    if m:
-        return m.group(1).strip()
-    return None
-# ---------------------------------------------------------------------------
-# Main baseline loop
-# ---------------------------------------------------------------------------
-def run_baseline(
-    model: str = "gpt-4o-mini",
-    task_filter: Optional[str] = None,
-    output_path: str = "outputs/baseline_results.json",
-    max_steps: int = 3,
-    seed: int = 42,
-) -> dict:
-    """
-    Run the baseline agent against all (or one) task(s).
-    Returns a results dict with per-task scores.
-    """
-    api_key = os.environ.get("OPENAI_API_KEY", "")
-    if not api_key:
-        print("WARNING: OPENAI_API_KEY not set. Set it before running baseline.")
-    base_url = os.environ.get("OPENAI_BASE_URL", None)
-    client = OpenAI(api_key=api_key, base_url=base_url)
-    env = SQLDebugEnvironment()
-    results = {
-        "model": model,
-        "seed": seed,
-        "tasks": {},
-    }
-    target_tasks = [t for t in TASKS if (task_filter is None or t.task_id == task_filter)]
-    for task_spec in target_tasks:
-        print(f"\n{'='*60}")
-        print(f"Task: {task_spec.task_id} ({task_spec.difficulty})")
-        print(f"{'='*60}")
-        task_result = {
-            "task_id": task_spec.task_id,
-            "difficulty": task_spec.difficulty,
-            "steps": [],
-            "best_reward": 0.0,
-            "final_reward": 0.0,
-            "done": False,
-        }
-        obs: SQLDebugObservation = env.reset(seed=seed, task_id=task_spec.task_id)
-        done = False
-        best_reward = 0.0
-        for step_num in range(1, max_steps + 1):
-            if done:
-                break
-            prompt = _build_prompt(obs)
-            print(f"\n  Step {step_num}: calling {model}...")
-            try:
-                response = client.chat.completions.create(
-                    model=model,
-                    messages=[
-                        {
-                            "role": "system",
-                            "content": (
-                                "You are an expert SQL debugger. Follow instructions exactly. "
-                                "Return only what is asked for — no extra commentary."
-                            ),
-                        },
-                        {"role": "user", "content": prompt},
-                    ],
-                    temperature=0.0,
-                    max_tokens=2048,
-                )
-                raw_text = response.choices[0].message.content or ""
-            except Exception as e:
-                print(f"  API error: {e}")
-                raw_text = ""
-            is_pipeline = (task_spec.task_id == "task3_etl_timezone")
-            fixed_sql = _extract_sql(raw_text, is_pipeline=is_pipeline)
-            explanation = _extract_explanation(raw_text) if is_pipeline else None
-            action = SQLDebugAction(fixed_sql=fixed_sql, explanation=explanation)
-            obs, reward, done, info = env.step(action)
-            best_reward = max(best_reward, reward)
-            print(f"  Reward: {reward:.4f}  Done: {done}")
-            print(f"  Breakdown: {info.get('breakdown', {})}")
-            task_result["steps"].append({
-                "step": step_num,
-                "reward": reward,
-                "done": done,
-                "breakdown": info.get("breakdown", {}),
-                "penalties": info.get("penalties", {}),
-                "fixed_sql_preview": fixed_sql[:200],
-            })
-            time.sleep(0.5)  # rate limiting
-        task_result["best_reward"] = round(best_reward, 4)
-        task_result["final_reward"] = round(obs.reward or 0.0, 4)
-        task_result["done"] = done
-        results["tasks"][task_spec.task_id] = task_result
-        print(f"\n  >>> Best reward for {task_spec.task_id}: {best_reward:.4f}")
-    # Summary
-    print(f"\n{'='*60}")
-    print("BASELINE SUMMARY")
-    print(f"{'='*60}")
-    for tid, tr in results["tasks"].items():
-        print(f"  {tid:40s}  best={tr['best_reward']:.4f}  ({tr['difficulty']})")
-    # Write output
-    out_path = Path(output_path)
-    out_path.parent.mkdir(parents=True, exist_ok=True)
-    out_path.write_text(json.dumps(results, indent=2))
-    print(f"\nResults written to {out_path}")
-    return results
-# ---------------------------------------------------------------------------
-# CLI
-# ---------------------------------------------------------------------------
-if __name__ == "__main__":
-    parser = argparse.ArgumentParser(
-        description="Baseline inference for SQL Debug & Data Pipeline Repair OpenEnv"
-    )
-    parser.add_argument(
-        "--model",
-        default="gpt-4o-mini",
-        help="OpenAI model to use (default: gpt-4o-mini)",
-    )
-    parser.add_argument(
-        "--task",
-        default=None,
-        choices=["task1_syntax_fix", "task2_join_aggregation", "task3_etl_timezone"],
-        help="Run a single task (default: all tasks)",
-    )
-    parser.add_argument(
-        "--output",
-        default="outputs/baseline_results.json",
-        help="Path to write JSON results",
-    )
-    parser.add_argument(
-        "--max-steps",
-        type=int,
-        default=3,
-        help="Max steps per episode (default: 3)",
-    )
-    parser.add_argument(
-        "--seed",
-        type=int,
-        default=42,
-        help="Random seed (default: 42)",
-    )
-    args = parser.parse_args()
-    run_baseline(
-        model=args.model,
-        task_filter=args.task,
-        output_path=args.output,
-        max_steps=args.max_steps,
-        seed=args.seed,
-    )

+"""
+inference.py — inference script for SQL Debug & Data Pipeline Repair.
+Runs a model (default: gpt-4o-mini) against all 3 tasks using the OpenAI
+client API. Reads credentials from environment variables. Produces a
+reproducible JSON report with per-task scores.
+Usage:
+    # Set credentials
+    $env:OPENAI_API_KEY = "sk-..."
+    # Optional: use a different base URL (e.g. local vLLM)
+    $env:OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"
+    python inference.py
+    python inference.py --task task1_syntax_fix
+    python inference.py --model gpt-4o --output results.json
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import re
+import sys
+import time
+from pathlib import Path
+from typing import Optional
+from openai import OpenAI
+# Make server package importable
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+from models import SQLDebugAction, SQLDebugObservation
+from server.environment import SQLDebugEnvironment
+from server.data import TASKS
+# ---------------------------------------------------------------------------
+# Prompt builder
+# ---------------------------------------------------------------------------
+def _build_prompt(obs: SQLDebugObservation) -> str:
+    """Convert an observation into a model prompt."""
+    schema_lines = []
+    for table, cols in obs.schema_info.items():
+        col_defs = ", ".join(f"{c['column']} {c['type']}" for c in cols)
+        schema_lines.append(f"  {table}({col_defs})")
+    schema_str = "\n".join(schema_lines)
+    if obs.task_id == "task3_etl_timezone":
+        code_section = f"""
+## Broken ETL Pipeline Code
+```python
+{obs.pipeline_code}
+```
+## Intermediate Outputs (from the BROKEN pipeline)
+{json.dumps(obs.intermediate_outputs, indent=2, default=str) if obs.intermediate_outputs else 'Not available'}
+"""
+        instruction = (
+            "Return the COMPLETE corrected Python pipeline code inside a ```python ... ``` block. "
+            "Also provide a brief explanation of the root cause (which step is buggy and why) "
+            "in a section labelled 'Explanation:'."
+        )
+    else:
+        code_section = f"""
+## Broken SQL Query
+```sql
+{obs.broken_sql}
+```
+"""
+        instruction = (
+            "Return ONLY the corrected SQL query inside a ```sql ... ``` block. "
+            "Do not include any explanation outside the code block."
+        )
+    history_section = ""
+    if obs.previous_attempts:
+        lines = []
+        for a in obs.previous_attempts:
+            lines.append(f"  Step {a.step}: reward={a.reward:.2f}  SQL: {a.fixed_sql[:120]}...")
+        history_section = "\n## Previous Attempts\n" + "\n".join(lines)
+    return f"""You are an expert SQL and data engineering debugger.
+## Task ({obs.difficulty.upper()})
+{obs.task_description}
+## Database Schema
+{schema_str}
+{code_section}{history_section}
+## Instructions
+{instruction}
+"""
+# ---------------------------------------------------------------------------
+# Response parser
+# ---------------------------------------------------------------------------
+def _extract_sql(text: str, is_pipeline: bool = False) -> str:
+    """Extract SQL or Python code from model response."""
+    # Try fenced code block first
+    lang = "python" if is_pipeline else "sql"
+    patterns = [
+        rf"```{lang}\s*\n(.*?)```",
+        r"```\s*\n(.*?)```",
+        r"```(.*?)```",
+    ]
+    for pattern in patterns:
+        m = re.search(pattern, text, re.DOTALL | re.IGNORECASE)
+        if m:
+            return m.group(1).strip()
+    # Fallback: return the whole response
+    return text.strip()
+def _extract_explanation(text: str) -> Optional[str]:
+    """Extract explanation section from Task 3 response."""
+    m = re.search(r"explanation[:\s]+(.*?)(?:```|$)", text, re.DOTALL | re.IGNORECASE)
+    if m:
+        return m.group(1).strip()
+    return None
+# ---------------------------------------------------------------------------
+# Main baseline loop
+# ---------------------------------------------------------------------------
+def run_baseline(
+    model: str = "gpt-4o-mini",
+    task_filter: Optional[str] = None,
+    output_path: str = "outputs/baseline_results.json",
+    max_steps: int = 3,
+    seed: int = 42,
+) -> dict:
+    """
+    Run the baseline agent against all (or one) task(s).
+    Returns a results dict with per-task scores.
+    """
+    api_key = os.environ.get("OPENAI_API_KEY", "")
+    if not api_key:
+        print("WARNING: OPENAI_API_KEY not set. Set it before running baseline.")
+    base_url = os.environ.get("OPENAI_BASE_URL", None)
+    client = OpenAI(api_key=api_key, base_url=base_url)
+    env = SQLDebugEnvironment()
+    results = {
+        "model": model,
+        "seed": seed,
+        "tasks": {},
+    }
+    target_tasks = [t for t in TASKS if (task_filter is None or t.task_id == task_filter)]
+    for task_spec in target_tasks:
+        print(f"\n{'='*60}")
+        print(f"Task: {task_spec.task_id} ({task_spec.difficulty})")
+        print(f"{'='*60}")
+        task_result = {
+            "task_id": task_spec.task_id,
+            "difficulty": task_spec.difficulty,
+            "steps": [],
+            "best_reward": 0.0,
+            "final_reward": 0.0,
+            "done": False,
+        }
+        obs: SQLDebugObservation = env.reset(seed=seed, task_id=task_spec.task_id)
+        done = False
+        best_reward = 0.0
+        for step_num in range(1, max_steps + 1):
+            if done:
+                break
+            prompt = _build_prompt(obs)
+            print(f"\n  Step {step_num}: calling {model}...")
+            try:
+                response = client.chat.completions.create(
+                    model=model,
+                    messages=[
+                        {
+                            "role": "system",
+                            "content": (
+                                "You are an expert SQL debugger. Follow instructions exactly. "
+                                "Return only what is asked for — no extra commentary."
+                            ),
+                        },
+                        {"role": "user", "content": prompt},
+                    ],
+                    temperature=0.0,
+                    max_tokens=2048,
+                )
+                raw_text = response.choices[0].message.content or ""
+            except Exception as e:
+                print(f"  API error: {e}")
+                raw_text = ""
+            is_pipeline = (task_spec.task_id == "task3_etl_timezone")
+            fixed_sql = _extract_sql(raw_text, is_pipeline=is_pipeline)
+            explanation = _extract_explanation(raw_text) if is_pipeline else None
+            action = SQLDebugAction(fixed_sql=fixed_sql, explanation=explanation)
+            obs, reward, done, info = env.step(action)
+            best_reward = max(best_reward, reward)
+            print(f"  Reward: {reward:.4f}  Done: {done}")
+            print(f"  Breakdown: {info.get('breakdown', {})}")
+            task_result["steps"].append({
+                "step": step_num,
+                "reward": reward,
+                "done": done,
+                "breakdown": info.get("breakdown", {}),
+                "penalties": info.get("penalties", {}),
+                "fixed_sql_preview": fixed_sql[:200],
+            })
+            time.sleep(0.5)  # rate limiting
+        task_result["best_reward"] = round(best_reward, 4)
+        task_result["final_reward"] = round(obs.reward or 0.0, 4)
+        task_result["done"] = done
+        results["tasks"][task_spec.task_id] = task_result
+        print(f"\n  >>> Best reward for {task_spec.task_id}: {best_reward:.4f}")
+    # Summary
+    print(f"\n{'='*60}")
+    print("BASELINE SUMMARY")
+    print(f"{'='*60}")
+    for tid, tr in results["tasks"].items():
+        print(f"  {tid:40s}  best={tr['best_reward']:.4f}  ({tr['difficulty']})")
+    # Write output
+    out_path = Path(output_path)
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    out_path.write_text(json.dumps(results, indent=2))
+    print(f"\nResults written to {out_path}")
+    return results
+# ---------------------------------------------------------------------------
+# CLI
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Baseline inference for SQL Debug & Data Pipeline Repair OpenEnv"
+    )
+    parser.add_argument(
+        "--model",
+        default="gpt-4o-mini",
+        help="OpenAI model to use (default: gpt-4o-mini)",
+    )
+    parser.add_argument(
+        "--task",
+        default=None,
+        choices=["task1_syntax_fix", "task2_join_aggregation", "task3_etl_timezone"],
+        help="Run a single task (default: all tasks)",
+    )
+    parser.add_argument(
+        "--output",
+        default="outputs/baseline_results.json",
+        help="Path to write JSON results",
+    )
+    parser.add_argument(
+        "--max-steps",
+        type=int,
+        default=3,
+        help="Max steps per episode (default: 3)",
+    )
+    parser.add_argument(
+        "--seed",
+        type=int,
+        default=42,
+        help="Random seed (default: 42)",
+    )
+    args = parser.parse_args()
+    run_baseline(
+        model=args.model,
+        task_filter=args.task,
+        output_path=args.output,
+        max_steps=args.max_steps,
+        seed=args.seed,
+    )

models.py CHANGED Viewed

@@ -1,130 +1,130 @@
-"""
-models.py — SQL Debug & Data Pipeline Repair OpenEnv
-Typed Pydantic models for Observation, Action, and State.
-"""
-from __future__ import annotations
-from typing import Any, Dict, List, Optional
-from pydantic import BaseModel, Field
-# ---------------------------------------------------------------------------
-# Base stubs (mirrors openenv-core base classes so this module is importable
-# without openenv-core installed, while still being fully compatible when it
-# is installed).
-# ---------------------------------------------------------------------------
-try:
-    from openenv.core.env_server import Action, Observation, State  # type: ignore
-except ImportError:
-    class _Base(BaseModel):
-        pass
-    Action = _Base      # type: ignore[misc,assignment]
-    Observation = _Base # type: ignore[misc,assignment]
-    State = _Base       # type: ignore[misc,assignment]
-# ---------------------------------------------------------------------------
-# Observation
-# ---------------------------------------------------------------------------
-class PreviousAttempt(BaseModel):
-    """Log of a single previous attempt by the agent."""
-    step: int
-    fixed_sql: str
-    reward: float
-    info: Dict[str, Any] = Field(default_factory=dict)
-class SQLDebugObservation(Observation):
-    """
-    What the agent sees at each step.
-    For Tasks 1 & 2 the key field is `broken_sql`.
-    For Task 3 the key field is `pipeline_code`; `intermediate_outputs`
-    contains the (wrong) intermediate DataFrames serialised as list-of-dicts.
-    """
-    # ── Episode metadata ────────────────────────────────────────────────────
-    task_id: str = Field(description="Which task this episode runs (task1/task2/task3)")
-    task_description: str = Field(description="Natural-language goal the agent must achieve")
-    difficulty: str = Field(description="easy | medium | hard")
-    # ── Problem payload ─────────────────────────────────────────────────────
-    broken_sql: Optional[str] = Field(
-        default=None,
-        description="Broken SQL string — present for Tasks 1 & 2",
-    )
-    pipeline_code: Optional[str] = Field(
-        default=None,
-        description="4-step ETL pipeline Python string — present for Task 3",
-    )
-    intermediate_outputs: Optional[List[Dict[str, Any]]] = Field(
-        default=None,
-        description="Wrong intermediate outputs from each pipeline step (Task 3)",
-    )
-    # ── Schema context ───────────────────────────────────────────────────────
-    schema_info: Dict[str, List[Dict[str, str]]] = Field(
-        description="Table name → list of {column, type} dicts"
-    )
-    # ── Progress ─────────────────────────────────────────────────────────────
-    step_number: int = Field(default=0, description="Current attempt number (0-indexed)")
-    max_steps: int = Field(default=5, description="Maximum attempts allowed")
-    previous_attempts: List[PreviousAttempt] = Field(default_factory=list)
-    # ── OpenEnv required fields ──────────────────────────────────────────────
-    done: bool = Field(default=False)
-    reward: Optional[float] = Field(default=None)
-# ---------------------------------------------------------------------------
-# Action
-# ---------------------------------------------------------------------------
-class SQLDebugAction(Action):
-    """
-    What the agent submits each step.
-    `fixed_sql` is required for all tasks.
-    For Task 3, `fixed_sql` should contain the COMPLETE corrected pipeline
-    Python code (not just a patch).
-    `explanation` is optional but scored separately for Task 3's root-cause
-    component (+0.15 if it correctly names Step 2 as the bug location).
-    """
-    fixed_sql: str = Field(
-        description=(
-            "Corrected SQL string (Tasks 1 & 2) or corrected full "
-            "pipeline Python code string (Task 3)"
-        )
-    )
-    explanation: Optional[str] = Field(
-        default=None,
-        description=(
-            "Optional natural-language explanation of the root cause. "
-            "Scored for Task 3 root-cause identification (+0.15)."
-        ),
-    )
-# ---------------------------------------------------------------------------
-# State
-# ---------------------------------------------------------------------------
-class SQLDebugState(State):
-    """
-    Full internal state — used by state() and by the baseline script for
-    logging; also inspected by openenv validate.
-    """
-    task_id: str = Field(default="")
-    seed: int = Field(default=42)
-    step_count: int = Field(default=0)
-    max_steps: int = Field(default=5)
-    episode_id: Optional[str] = Field(default=None)
-    current_score: float = Field(default=0.0, description="Best score seen so far this episode")
-    reward_history: List[float] = Field(default_factory=list)
-    done: bool = Field(default=False)

+"""
+models.py — SQL Debug & Data Pipeline Repair OpenEnv
+Typed Pydantic models for Observation, Action, and State.
+"""
+from __future__ import annotations
+from typing import Any, Dict, List, Optional
+from pydantic import BaseModel, Field
+# ---------------------------------------------------------------------------
+# Base stubs (mirrors openenv-core base classes so this module is importable
+# without openenv-core installed, while still being fully compatible when it
+# is installed).
+# ---------------------------------------------------------------------------
+try:
+    from openenv.core.env_server import Action, Observation, State  # type: ignore
+except ImportError:
+    class _Base(BaseModel):
+        pass
+    Action = _Base      # type: ignore[misc,assignment]
+    Observation = _Base # type: ignore[misc,assignment]
+    State = _Base       # type: ignore[misc,assignment]
+# ---------------------------------------------------------------------------
+# Observation
+# ---------------------------------------------------------------------------
+class PreviousAttempt(BaseModel):
+    """Log of a single previous attempt by the agent."""
+    step: int
+    fixed_sql: str
+    reward: float
+    info: Dict[str, Any] = Field(default_factory=dict)
+class SQLDebugObservation(Observation):
+    """
+    What the agent sees at each step.
+    For Tasks 1 & 2 the key field is `broken_sql`.
+    For Task 3 the key field is `pipeline_code`; `intermediate_outputs`
+    contains the (wrong) intermediate DataFrames serialised as list-of-dicts.
+    """
+    # ── Episode metadata ────────────────────────────────────────────────────
+    task_id: str = Field(description="Which task this episode runs (task1/task2/task3)")
+    task_description: str = Field(description="Natural-language goal the agent must achieve")
+    difficulty: str = Field(description="easy | medium | hard")
+    # ── Problem payload ─────────────────────────────────────────────────────
+    broken_sql: Optional[str] = Field(
+        default=None,
+        description="Broken SQL string — present for Tasks 1 & 2",
+    )
+    pipeline_code: Optional[str] = Field(
+        default=None,
+        description="4-step ETL pipeline Python string — present for Task 3",
+    )
+    intermediate_outputs: Optional[List[Dict[str, Any]]] = Field(
+        default=None,
+        description="Wrong intermediate outputs from each pipeline step (Task 3)",
+    )
+    # ── Schema context ───────────────────────────────────────────────────────
+    schema_info: Dict[str, List[Dict[str, str]]] = Field(
+        description="Table name → list of {column, type} dicts"
+    )
+    # ── Progress ─────────────────────────────────────────────────────────────
+    step_number: int = Field(default=0, description="Current attempt number (0-indexed)")
+    max_steps: int = Field(default=5, description="Maximum attempts allowed")
+    previous_attempts: List[PreviousAttempt] = Field(default_factory=list)
+    # ── OpenEnv required fields ──────────────────────────────────────────────
+    done: bool = Field(default=False)
+    reward: Optional[float] = Field(default=None)
+# ---------------------------------------------------------------------------
+# Action
+# ---------------------------------------------------------------------------
+class SQLDebugAction(Action):
+    """
+    What the agent submits each step.
+    `fixed_sql` is required for all tasks.
+    For Task 3, `fixed_sql` should contain the COMPLETE corrected pipeline
+    Python code (not just a patch).
+    `explanation` is optional but scored separately for Task 3's root-cause
+    component (+0.15 if it correctly names Step 2 as the bug location).
+    """
+    fixed_sql: str = Field(
+        description=(
+            "Corrected SQL string (Tasks 1 & 2) or corrected full "
+            "pipeline Python code string (Task 3)"
+        )
+    )
+    explanation: Optional[str] = Field(
+        default=None,
+        description=(
+            "Optional natural-language explanation of the root cause. "
+            "Scored for Task 3 root-cause identification (+0.15)."
+        ),
+    )
+# ---------------------------------------------------------------------------
+# State
+# ---------------------------------------------------------------------------
+class SQLDebugState(State):
+    """
+    Full internal state — used by state() and by the baseline script for
+    logging; also inspected by openenv validate.
+    """
+    task_id: str = Field(default="")
+    seed: int = Field(default=42)
+    step_count: int = Field(default=0)
+    max_steps: int = Field(default=5)
+    episode_id: Optional[str] = Field(default=None)
+    current_score: float = Field(default=0.0, description="Best score seen so far this episode")
+    reward_history: List[float] = Field(default_factory=list)
+    done: bool = Field(default=False)

openenv.yaml CHANGED Viewed

@@ -1,95 +1,95 @@
-name: sql-debug-env
-version: 1.0.0
-description: >
-  SQL Debug & Data Pipeline Repair — an OpenEnv environment where an AI agent
-  diagnoses and fixes broken SQL queries and ETL pipelines executed against a
-  live DuckDB instance. Four tasks ranging from easy (syntax fix) to expert
-  (Window Functions). Features continuous dense reward shaping (Jaccard similarity)
-  and AST-based anti-cheating penalties.
-author: sql-debug-env
-tags:
-  - openenv
-  - sql
-  - data-engineering
-  - debugging
-  - rl
-entrypoint: uvicorn app:app --host 0.0.0.0 --port 7860
-tasks:
-  - id: task1_syntax_fix
-    difficulty: easy
-    max_steps: 5
-    description: >
-      Fix a SQL query with a missing comma (syntax error) and a wrong table
-      alias in the WHERE clause. Three tables: orders, customers, products.
-    baseline_score: 1.0
-  - id: task2_join_aggregation
-    difficulty: medium
-    max_steps: 5
-    description: >
-      Fix a GROUP BY aggregation query that uses INNER JOINs, silently
-      dropping NULL-keyed rows and producing wrong revenue totals.
-    baseline_score: 1.0
-  - id: task3_etl_timezone
-    difficulty: hard
-    max_steps: 5
-    description: >
-      Trace and fix a 4-step ETL pipeline where Step 2 casts VARCHAR
-      timestamps with timezone offsets to DATE using implicit coercion,
-      stripping the offset. Fix must use TIMESTAMPTZ + AT TIME ZONE.
-    baseline_score: 0.40
-  - id: task4_expert_window
-    difficulty: expert
-    max_steps: 5
-    description: >
-      Calculate a 3-day rolling average of transaction amounts per user.
-      Requires advanced window function mechanics (OVER PARTITION BY... ROWS BETWEEN).
-    baseline_score: 1.0
-observation_schema:
-  task_id: string
-  task_description: string
-  difficulty: "easy | medium | hard | expert"
-  broken_sql: "string | null  # null for Task 3"
-  pipeline_code: "string | null  # non-null for Task 3"
-  intermediate_outputs: "list | null  # wrong step outputs for Task 3"
-  schema_info: "dict[table_name, list[{column, type}]]"
-  step_number: integer
-  max_steps: integer
-  previous_attempts: "list[{step, fixed_sql, reward, info}]"
-  done: boolean
-  reward: "float | null"
-action_schema:
-  fixed_sql: string  # corrected SQL or full corrected pipeline code (Task 3)
-  explanation: "string | null  # root-cause explanation, scored for Task 3"
-reward_decomposition:
-  tasks_1_2_and_4:
-    parses: +0.10
-    executes: +0.20
-    column_accuracy: +0.10
-    data_accuracy: +0.30
-    exact_match_bonus: +0.30
-  task_3:
-    correct_step_identified: +0.15
-    step2_fixed: +0.25
-    step4_fixed: +0.20
-    final_totals_exact: +0.40
-  penalties:
-    duplicate_submission: -0.10
-    efficiency_penalty: -0.20
-    destructive_action: -0.30
-    hardcode_penalty: -0.50
-endpoints:
-  health: GET /health
-  reset: POST /reset
-  step: POST /step
-  state: GET /state
-  tasks: GET /tasks
   docs: GET /docs

+name: sql-debug-env
+version: 1.0.0
+description: >
+  SQL Debug & Data Pipeline Repair — an OpenEnv environment where an AI agent
+  diagnoses and fixes broken SQL queries and ETL pipelines executed against a
+  live DuckDB instance. Four tasks ranging from easy (syntax fix) to expert
+  (Window Functions). Features continuous dense reward shaping (Jaccard similarity)
+  and AST-based anti-cheating penalties.
+author: sql-debug-env
+tags:
+  - openenv
+  - sql
+  - data-engineering
+  - debugging
+  - rl
+entrypoint: uvicorn app:app --host 0.0.0.0 --port 7860
+tasks:
+  - id: task1_syntax_fix
+    difficulty: easy
+    max_steps: 5
+    description: >
+      Fix a SQL query with a missing comma (syntax error) and a wrong table
+      alias in the WHERE clause. Three tables: orders, customers, products.
+    baseline_score: 1.0
+  - id: task2_join_aggregation
+    difficulty: medium
+    max_steps: 5
+    description: >
+      Fix a GROUP BY aggregation query that uses INNER JOINs, silently
+      dropping NULL-keyed rows and producing wrong revenue totals.
+    baseline_score: 1.0
+  - id: task3_etl_timezone
+    difficulty: hard
+    max_steps: 5
+    description: >
+      Trace and fix a 4-step ETL pipeline where Step 2 casts VARCHAR
+      timestamps with timezone offsets to DATE using implicit coercion,
+      stripping the offset. Fix must use TIMESTAMPTZ + AT TIME ZONE.
+    baseline_score: 0.40
+  - id: task4_expert_window
+    difficulty: expert
+    max_steps: 5
+    description: >
+      Calculate a 3-day rolling average of transaction amounts per user.
+      Requires advanced window function mechanics (OVER PARTITION BY... ROWS BETWEEN).
+    baseline_score: 1.0
+observation_schema:
+  task_id: string
+  task_description: string
+  difficulty: "easy | medium | hard | expert"
+  broken_sql: "string | null  # null for Task 3"
+  pipeline_code: "string | null  # non-null for Task 3"
+  intermediate_outputs: "list | null  # wrong step outputs for Task 3"
+  schema_info: "dict[table_name, list[{column, type}]]"
+  step_number: integer
+  max_steps: integer
+  previous_attempts: "list[{step, fixed_sql, reward, info}]"
+  done: boolean
+  reward: "float | null"
+action_schema:
+  fixed_sql: string  # corrected SQL or full corrected pipeline code (Task 3)
+  explanation: "string | null  # root-cause explanation, scored for Task 3"
+reward_decomposition:
+  tasks_1_2_and_4:
+    parses: +0.10
+    executes: +0.20
+    column_accuracy: +0.10
+    data_accuracy: +0.30
+    exact_match_bonus: +0.30
+  task_3:
+    correct_step_identified: +0.15
+    step2_fixed: +0.25
+    step4_fixed: +0.20
+    final_totals_exact: +0.40
+  penalties:
+    duplicate_submission: -0.10
+    efficiency_penalty: -0.20
+    destructive_action: -0.30
+    hardcode_penalty: -0.50
+endpoints:
+  health: GET /health
+  reset: POST /reset
+  step: POST /step
+  state: GET /state
+  tasks: GET /tasks
   docs: GET /docs

pyproject.toml CHANGED Viewed

@@ -1,40 +1,40 @@
-[build-system]
-requires = ["setuptools>=68", "wheel"]
-build-backend = "setuptools.build_meta"
-[project]
-name = "sql-debug-env"
-version = "1.0.0"
-description = "SQL Debug & Data Pipeline Repair — OpenEnv environment with Four tasks"
-readme = "README.md"
-requires-python = ">=3.10"
-license = { text = "Apache-2.0" }
-keywords = ["openenv", "reinforcement-learning", "sql", "duckdb", "data-engineering"]
-dependencies = [
-    "duckdb>=0.10.0",
-    "pandas>=2.0.0",
-    "fastapi>=0.111.0",
-    "uvicorn[standard]>=0.29.0",
-    "pydantic>=2.0.0",
-    "requests>=2.31.0",
-    "openai>=1.30.0",
-    "pyyaml>=6.0",
-    "openenv-core>=0.2.0",
-]
-[project.scripts]
-server = "server.app:main"
-[project.optional-dependencies]
-openenv = [
-    "openenv-core>=0.1.0",
-]
-dev = [
-    "pytest>=8.0",
-    "httpx>=0.27.0",
-]
-[tool.setuptools.packages.find]
-where = ["."]
 include = ["sql_env*", "server*"]

+[build-system]
+requires = ["setuptools>=68", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "sql-debug-env"
+version = "1.0.0"
+description = "SQL Debug & Data Pipeline Repair — OpenEnv environment with Four tasks"
+readme = "README.md"
+requires-python = ">=3.10"
+license = { text = "Apache-2.0" }
+keywords = ["openenv", "reinforcement-learning", "sql", "duckdb", "data-engineering"]
+dependencies = [
+    "duckdb>=0.10.0",
+    "pandas>=2.0.0",
+    "fastapi>=0.111.0",
+    "uvicorn[standard]>=0.29.0",
+    "pydantic>=2.0.0",
+    "requests>=2.31.0",
+    "openai>=1.30.0",
+    "pyyaml>=6.0",
+    "openenv-core>=0.2.0",
+]
+[project.scripts]
+server = "server.app:main"
+[project.optional-dependencies]
+openenv = [
+    "openenv-core>=0.1.0",
+]
+dev = [
+    "pytest>=8.0",
+    "httpx>=0.27.0",
+]
+[tool.setuptools.packages.find]
+where = ["."]
 include = ["sql_env*", "server*"]

requirements.txt CHANGED Viewed

@@ -1,3 +1,4 @@
-fastapi
-uvicorn
-pydantic

+fastapi
+uvicorn
+pydantic
+duckdb

server/Dockerfile CHANGED Viewed

@@ -1,30 +1,30 @@
-# syntax: docker/dockerfile:1
-FROM python:3.11-slim
-# ── System deps ──────────────────────────────────────────────────────────────
-RUN apt-get update && apt-get install -y --no-install-recommends \
-        build-essential \
-    && rm -rf /var/lib/apt/lists/*
-# ── App directory ─────────────────────────────────────────────────────────────
-WORKDIR /app
-# ── Python deps (cached layer) ────────────────────────────────────────────────
-COPY requirements.txt ./requirements.txt
-RUN pip install --no-cache-dir -r requirements.txt
-# ── Copy source ───────────────────────────────────────────────────────────────
-COPY . .
-# ── HF Spaces requires port 7860 ─────────────────────────────────────────────
-EXPOSE 7860
-# ── Create output dir ─────────────────────────────────────────────────────────
-RUN mkdir -p /app/outputs/logs /app/outputs/evals
-# ── Health check ──────────────────────────────────────────────────────────────
-HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
-    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')"
-# ── Entry point ───────────────────────────────────────────────────────────────
-CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

+# syntax: docker/dockerfile:1
+FROM python:3.11-slim
+# ── System deps ──────────────────────────────────────────────────────────────
+RUN apt-get update && apt-get install -y --no-install-recommends \
+        build-essential \
+    && rm -rf /var/lib/apt/lists/*
+# ── App directory ─────────────────────────────────────────────────────────────
+WORKDIR /app
+# ── Python deps (cached layer) ────────────────────────────────────────────────
+COPY requirements.txt ./requirements.txt
+RUN pip install --no-cache-dir -r requirements.txt
+# ── Copy source ───────────────────────────────────────────────────────────────
+COPY . .
+# ── HF Spaces requires port 7860 ─────────────────────────────────────────────
+EXPOSE 7860
+# ── Create output dir ─────────────────────────────────────────────────────────
+RUN mkdir -p /app/outputs/logs /app/outputs/evals
+# ── Health check ──────────────────────────────────────────────────────────────
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')"
+# ── Entry point ───────────────────────────────────────────────────────────────
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]

server/requirements.txt CHANGED Viewed

@@ -1,7 +1,7 @@
-fastapi>=0.111.0
-uvicorn[standard]>=0.29.0
-pydantic>=2.0.0
-duckdb>=0.10.0
-pandas>=2.0.0
-requests>=2.31.0
-pyyaml>=6.0

+fastapi>=0.111.0
+uvicorn[standard]>=0.29.0
+pydantic>=2.0.0
+duckdb>=0.10.0
+pandas>=2.0.0
+requests>=2.31.0
+pyyaml>=6.0

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff