sai1912 commited on
Commit
c0310e8
·
verified ·
1 Parent(s): 24d2254

Upload folder using huggingface_hub

Browse files
Files changed (15) hide show
  1. .dockerignore +18 -18
  2. Dockerfile +7 -7
  3. README.md +74 -158
  4. __init__.py +10 -10
  5. app.py +1363 -980
  6. client.py +97 -97
  7. deploy_hf_space.md +169 -0
  8. inference.py +294 -294
  9. models.py +130 -130
  10. openenv.yaml +94 -94
  11. pyproject.toml +39 -39
  12. requirements.txt +4 -3
  13. server/Dockerfile +30 -30
  14. server/requirements.txt +7 -7
  15. uv.lock +0 -0
.dockerignore CHANGED
@@ -1,18 +1,18 @@
1
- __pycache__/
2
- *.pyc
3
- *.pyo
4
- *.pyd
5
- .Python
6
- *.egg-info/
7
- dist/
8
- build/
9
- .env
10
- .venv/
11
- venv/
12
- *.log
13
- outputs/
14
- .git/
15
- .github/
16
- *.md
17
- *.ipynb
18
- tests/
 
1
+ __pycache__/
2
+ *.pyc
3
+ *.pyo
4
+ *.pyd
5
+ .Python
6
+ *.egg-info/
7
+ dist/
8
+ build/
9
+ .env
10
+ .venv/
11
+ venv/
12
+ *.log
13
+ outputs/
14
+ .git/
15
+ .github/
16
+ *.md
17
+ *.ipynb
18
+ tests/
Dockerfile CHANGED
@@ -1,7 +1,7 @@
1
- FROM python:3.11-slim
2
- WORKDIR /app
3
- COPY requirements.txt .
4
- RUN pip install --no-cache-dir -r requirements.txt
5
- COPY . .
6
- EXPOSE 7860
7
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
 
1
+ FROM python:3.11-slim
2
+ WORKDIR /app
3
+ COPY requirements.txt .
4
+ RUN pip install --no-cache-dir -r requirements.txt
5
+ COPY . .
6
+ EXPOSE 7860
7
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
README.md CHANGED
@@ -1,190 +1,106 @@
1
- ---
2
- title: SQL Debug & Data Pipeline Repair
3
- emoji: 🔧
4
- colorFrom: blue
5
- colorTo: green
6
- sdk: docker
7
- pinned: false
8
- license: apache-2.0
9
- tags:
10
- - openenv
11
- - sql
12
- - reinforcement-learning
13
- - data-engineering
14
- - agents
15
- ---
16
 
17
- # 🔧 SQL Debug & Data Pipeline Repair (OpenEnv)
 
 
18
 
19
- > **An execution-based Reinforcement Learning environment where AI agents diagnose and fix broken SQL queries and ETL pipelines against a live DuckDB instance.**
20
-
21
- [![OpenEnv](https://img.shields.io/badge/OpenEnv-compliant-brightgreen)](https://github.com/meta-pytorch/OpenEnv)
22
- [![Execution Engine](https://img.shields.io/badge/Engine-DuckDB-yellow)](#)
23
- [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
24
 
25
  ---
26
 
27
- ## 💡 The Problem: Imitation vs. Execution
28
-
29
- Traditional code LLMs are trained via Supervised Fine-Tuning (SFT) on static datasets, teaching them to *imitate* syntax. If a model outputs an `INNER JOIN` instead of a `LEFT JOIN`, standard evaluations often fail to catch the semantic disaster because the text "looks right."
30
-
31
- To reach true autonomous reasoning in software engineering, agents must be trained via **Reinforcement Learning (RL)** inside execution environments where they can verify their own logic.
32
-
33
- ## 🚀 The Solution
34
-
35
- This project provides a **rigorous, execution-based POMDP (Partially Observable Markov Decision Process)**.
36
-
37
- The agent receives a broken SQL query or Python ETL script, a database schema, and an objective. The environment dynamically compiles the agent's submission in an **in-memory DuckDB sandbox**, grades the resulting DataFrame mathematically, and returns a **dense, continuous reward signal**.
38
 
39
- ### Why This Environment Stands Out (Innovations)
40
 
41
- 1. **Continuous Data-Driven Grading (The Ultimate Dense Reward):**
42
- Instead of binary exact-match grading, the environment uses Jaccard-like similarity math to grade the output DataFrame. Agents get partial credit for selecting the right columns (10%), retrieving intersecting rows (30%), and achieving perfect sorting/formatting (exact match bonus). This smooths the RL gradient perfectly.
43
- 2. **AST-Based Anti-Cheating (Execution Safety):**
44
- LLMs often attempt to "cheat" by hardcoding expected answers. The environment parses the DuckDB `EXPLAIN` Abstract Syntax Tree (AST) to apply severe penalties if the agent uses a `DUMMY_SCAN` (hardcoding without reading tables) or creates inefficient Cartesian products (`CROSS_PRODUCT`).
45
- 3. **Silent, Real-World Bugs:**
46
- Tasks involve career-ending data engineering bugs—like a `CAST(ts AS DATE)` operation that silently strips UTC timezone offsets, causing misassigned daily revenue.
47
- 4. **Zero-Infrastructure Deployments:**
48
- By utilizing DuckDB in-memory, the environment requires zero heavy database servers (like Postgres), no network latency, and builds instantly on Hugging Face Spaces.
49
 
50
- ---
51
-
52
- ## Environment Overview
 
 
53
 
54
- | Property | Value |
55
- |---|---|
56
- | Execution engine | DuckDB (in-memory, zero deps) |
57
- | API contract | `reset()` / `step()` / `state()` |
58
- | Max steps per episode | 5 |
59
- | Tasks | 4 (Easy → Medium → Hard → Expert) |
60
- | Reward range | 0.0 – 1.0 |
61
- | Reproducibility | Deterministic given fixed seed |
62
 
63
  ---
64
 
65
- ## The Four Tasks
66
-
67
- ### Task 1 — Easy (baseline: 1.0)
68
- **Bug:** A SQL query has two bugs — a missing comma between SELECT columns (syntax error) and a wrong table alias in the WHERE clause (`order.customer_id` should be `o.customer_id`).
69
- **Success:** Emit one corrected SQL string that produces the right rows in the right order.
70
-
71
- ### Task 2 — Medium (baseline: 1.0)
72
- **Bug:** A GROUP BY aggregation query uses `INNER JOIN` on all tables. Two rows have `NULL` foreign keys. The INNER JOINs silently drop these rows, producing revenue totals that are ~15% wrong.
73
- **Success:** Change to `LEFT JOIN` and `COALESCE` NULL categories as `'Uncategorized'`.
74
-
75
- ### Task 3 — Hard (baseline: 0.40 - *Model Trapped!*)
76
- **Bug:** A 4-step Python ETL pipeline stores timestamps as `VARCHAR` in ISO-8601 format with timezone offsets. Step 2 casts to `DATE` using `CAST(txn_ts AS DATE)`, which strips the `+05:30` timezone offset. A transaction at `00:30 IST` = `19:00 UTC previous day` gets assigned to the wrong date.
77
- **Success:** The agent must identify Step 2 as the root cause and fix it using `CAST(txn_ts AS TIMESTAMPTZ) AT TIME ZONE 'UTC'`.
78
- > *Note: GPT-4o-mini diagnosed the bug but hallucinated a MySQL function (`CONVERT_TZ`), which the environment instantly caught and penalized, proving its resilience against hallucinations.*
79
-
80
- ### Task 4 — Expert (baseline: 1.0)
81
- **Bug:** A query calculates a rolling 3-day average using a standard `GROUP BY`, which destroys the running calculation logic.
82
- **Success:** Convert the query to use advanced Window Functions (`OVER PARTITION BY... ROWS BETWEEN 2 PRECEDING AND CURRENT ROW`).
83
-
84
- ---
85
 
86
- ## 🧮 Reward Function Mechanics
 
87
 
88
- Rewards are scaled from `0.0` to `1.0` and calculated dynamically upon execution.
 
89
 
90
- ### Additive Components (Continuous Shaping for SQL Tasks)
91
- | Component | Score | Condition |
92
- |---|---|---|
93
- | `parses` | +0.10 | DuckDB `EXPLAIN` succeeds without SyntaxError |
94
- | `executes` | +0.20 | `con.execute()` returns a DataFrame without Runtime Errors |
95
- | `column_accuracy` | +0.10 | Continuous: Ratio of correctly selected columns vs. ground truth |
96
- | `data_accuracy` | +0.30 | Continuous: Row intersection ratio (correct logic/JOINs) |
97
- | `exact_match_bonus` | +0.30 | `df.equals()` perfect match after normalization and sorting |
98
-
99
- ### Penalties (AST & Safety Checks)
100
- | Penalty | Amount | Condition |
101
- |---|---|---|
102
- | `duplicate_penalty` | −0.10 | Agent submits exact same SQL submitted previously this episode |
103
- | `efficiency_penalty`| −0.20 | AST contains `CROSS_PRODUCT` (Accidental Cartesian join) |
104
- | `destructive_action`| −0.30 | `DROP TABLE` / `DELETE` / `TRUNCATE` on real tables |
105
- | `hardcode_penalty` | −0.50 | AST contains `DUMMY_SCAN` without `SEQ_SCAN` (Cheating) |
106
-
107
- ---
108
-
109
- ## 🛠️ Quick Start
110
-
111
- ### Local Setup
112
  ```bash
113
- # 1. Clone and install via uv (recommended)
114
- git clone [https://huggingface.co/spaces/YOUR_USERNAME/sql_debug_env](https://huggingface.co/spaces/YOUR_USERNAME/sql_debug_env)
115
- cd sql_debug_env
116
- uv sync
117
 
118
- # 2. Start server
119
- uv run uvicorn server.app:main --host 0.0.0.0 --port 7860
120
- ````
121
 
122
- ### Docker
 
123
 
124
  ```bash
125
- docker build -f server/Dockerfile -t sql-debug-env .
126
- docker run -p 7860:7860 sql-debug-env
127
- curl http://localhost:7860/health
128
  ```
 
129
 
130
- ### Baseline Inference (Testing the AI)
131
-
132
- ```bash
133
- # Set API key
134
- export OPENAI_API_KEY="sk-..." # Windows cmd: set OPENAI_API_KEY=sk-...
135
 
136
- # Run inference script against all 4 tasks
137
- uv run inference.py
 
 
 
 
 
 
 
 
 
 
 
 
138
  ```
139
 
140
- -----
 
 
141
 
142
- ## API Reference
143
 
144
- | Endpoint | Method | Description |
145
  |---|---|---|
146
- | `/health` | GET | Health check |
147
- | `/reset` | POST | Start new episode |
148
- | `/step` | POST | Submit action, get reward |
149
- | `/state` | GET | Current episode state |
150
- | `/tasks` | GET | List all tasks |
151
-
152
- -----
153
-
154
- ## Baseline Scores (GPT-4o-mini)
155
-
156
- | Task | Difficulty | Score | Notes |
157
- |---|---|---|---|
158
- | `task1_syntax_fix` | Easy | **1.0** | Perfect continuous grading match |
159
- | `task2_join_aggregation` | Medium | **1.0** | Perfect continuous grading match |
160
- | `task3_etl_timezone` | Hard | **0.40** | Identified root cause, but hallucinated DB dialect. |
161
- | `task4_expert_window` | Expert | **1.0** | Successfully implemented Window Functions |
162
-
163
- *Scores computed via `inference.py` at `temperature=0.0`, `seed=42`.*
164
-
165
- -----
166
-
167
- ## Project Structure
168
-
169
-
170
- sql_debug_env/
171
- ├── server/
172
- │ ├── app.py ← FastAPI / OpenEnv entry point
173
- │ ├── environment.py ← Core OpenEnv POMDP logic + DuckDB
174
- │ ├── graders.py ← Continuous Jaccard reward & AST Anti-Cheat
175
- │ └── data.py ← Schema generation & synthetic DuckDB seeds
176
- ├── env/
177
- │ └── models.py ← Pydantic schemas (Observation, Action, State)
178
- ├── openenv.yaml ← Metadata manifest
179
- ├── pyproject.toml ← Modern Python project configuration
180
- ├── uv.lock ← Immutable dependency lockfile
181
- ├── inference.py ← Evaluates OpenAI models against the environment
182
- └── baseline_results.json ← Official proof-of-work scores
183
- ```
184
 
185
- -----
186
 
187
- ## License
188
 
189
- Apache 2.0. See [LICENSE](https://www.google.com/search?q=LICENSE).
 
 
 
 
 
190
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ # 🗄️ SQL Debug Environment (OpenEnv)
4
+ **An execution-based Reinforcement Learning Sandbox for Data Engineering AI Models**
 
 
 
 
 
 
 
 
 
 
 
5
 
6
+ [![OpenEnv Standard](https://img.shields.io/badge/OpenEnv-Compatible-blue.svg)](https://openenv.ai)
7
+ [![DuckDB Built](https://img.shields.io/badge/DuckDB-In--Memory-yellow.svg)](https://duckdb.org/)
8
+ [![Python 3.10+](https://img.shields.io/badge/Python-3.10%2B-green.svg)](https://www.python.org/)
9
 
10
+ </div>
 
 
 
 
11
 
12
  ---
13
 
14
+ ## 📌 The Problem
15
+ Traditional Large Language Models (LLMs) are primarily trained on static datasets to imitate code syntax. While they can often produce code that *looks* right, they frequently hallucinate logic or fail on semantic edge cases in rigorous data tasks like SQL generation and ETL pipelines.
 
 
 
 
 
 
 
 
 
16
 
17
+ When a model generates a bad SQL query during standard training, the pipeline only knows if it's an exact string match to an answer key. This is a fundamentally flawed signal: many different SQL queries can yield the exact same correct data, and conversely, a completely wrong string could be functionally correct. **AI models need verifiable, execution-based feedback loops to improve their logic.**
18
 
19
+ ## 💡 The Solution
20
+ This project provides a state-of-the-art **execution-based Reinforcement Learning (RL) environment** built specifically for training AI agents on database operations and SQL debugging.
 
 
 
 
 
 
21
 
22
+ Instead of relying on static string matching, this environment wraps an ephemeral, in-memory **DuckDB** instance. When an AI agent submits a SQL script, the system:
23
+ 1. Dynamically generates mock tables, schemas, and live data in DuckDB.
24
+ 2. Sandboxes and executes the AI's generated SQL query natively.
25
+ 3. Performs structural AST validation and execution validation.
26
+ 4. Computes a **continuous, dense fractional reward** comparing the AI's output dataframe against the ground-truth dataframe down to the cell level.
27
 
28
+ This project strictly adheres to the [OpenEnv Specifications](https://openenv.ai), making it instantly compatible with agentic frameworks and standard RL algorithms (e.g., PPO or GRPO via HuggingFace's TRL).
 
 
 
 
 
 
 
29
 
30
  ---
31
 
32
+ ## 🚀 QuickStart & Installation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ ### 1. Requirements
35
+ You will need Python 3.10+ installed on your system. It's recommended to use a virtual environment.
36
 
37
+ ### 2. Setup the Environment
38
+ You can install dependencies using either `pip` or modern tools like `uv`:
39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
  ```bash
41
+ # Clone the repository
42
+ git clone https://github.com/Sairishwanth89/sql-debug-env.git
43
+ cd sql-debug-env
 
44
 
45
+ # Install dependencies (DuckDB, FastAPI, Pandas, etc.)
46
+ pip install -e .
47
+ ```
48
 
49
+ ### 3. Initialize the Server
50
+ Since this is an OpenEnv server, you simply run it using `uvicorn`. This boots up the DuckDB evaluation engine and opens the REST endpoints.
51
 
52
  ```bash
53
+ uvicorn app:app --host 0.0.0.0 --port 7860
 
 
54
  ```
55
+ *The server will be live at `http://localhost:7860`. You can test it by visiting the Swagger UI documentation at `http://localhost:7860/docs`.*
56
 
57
+ ---
 
 
 
 
58
 
59
+ ## 🏗️ Project Architecture
60
+
61
+ ```text
62
+ sql_env/
63
+ ├── openenv.yaml # 🔧 Manifest: Defines environment capabilities, tasks, and reward structure
64
+ ├── app.py # 🧠 Server: Core OpenEnv FastAPI application & DuckDB execution logic
65
+ ├── models.py # 📦 Schemas: Pydantic models for API interfaces (State, Reset, Step)
66
+ ├── client.py # 🤝 Client: Python wrapper to cleanly interact with the local environment
67
+ ├── inference.py # 🤖 Agent Loop: Example of an AI agent "playing" the environment
68
+ ├── train_grpo.py # 📈 Training: Example of hooking the env into RL algorithms (TRL/GRPO)
69
+ ├── pyproject.toml / uv.lock # ⚙️ Config: Modern Python packaging and strict dependency locking
70
+ ├── Dockerfile # 🐳 Deployment: Container configuration for production
71
+ ├── deploy_hf_space.md # ☁️ Hugging Face Spaces deployment instructions
72
+ └── README.md # 📖 Documentation
73
  ```
74
 
75
+ ---
76
+
77
+ ## 🎯 Supported Tasks
78
 
79
+ The environment supports four distinct tasks ranging from beginner SQL fixes to expert-level analytical window functions. You can initialize any task by querying `POST /reset` with the desired `task_id`.
80
 
81
+ | Task ID | Difficulty | Objective |
82
  |---|---|---|
83
+ | `task1_syntax_fix` | **Easy** | Fix a SQL query with a missing comma (syntax error) and a wrong table alias in the `WHERE` clause. |
84
+ | `task2_join_aggregation` | **Medium** | Diagnose a `GROUP BY` query producing wrong revenue totals because an `INNER JOIN` is silently dropping NULL-keyed rows. |
85
+ | `task3_etl_timezone` | **Hard** | Trace an entire 4-step Python/SQL ETL pipeline where step 2 coerces a `VARCHAR` timezone into a `DATE`, stripping the offset. Requires `TIMESTAMPTZ` fixes and an explanation string. |
86
+ | `task4_expert_window` | **Expert** | Calculate a complex 3-day rolling revenue average per user. Requires advanced `OVER (PARTITION BY ... ROWS BETWEEN)` mechanics. |
87
+
88
+ ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
+ ## 🏆 Dense Reward System and Anti-Cheating
91
 
92
+ To prevent the "sparse gradient" problem where RL agents receive flat zero-rewards until they randomly achieve perfection, we implement a **dense multi-stepped reward function**.
93
 
94
+ A maximum score is `1.0`. Here is how an agent is graded (Tasks 1, 2, 4):
95
+ * `+0.10`: **Parser Validation** - Did the SQL successfully parse via AST (no syntax errors)?
96
+ * `+0.20`: **Execution Validation** - Did DuckDB successfully run the query against the schema?
97
+ * `+0.10`: **Column Accuracy** - Do the returned columns match the expected datatypes and shape?
98
+ * `+0.30`: **Data Similarity (Jaccard)** - Fractional reward given based on how closely the dataframe matches the ground-truth data.
99
+ * `+0.30`: **Exact Match Bonus** - Strict cell-for-cell match.
100
 
101
+ ### 🛡️ Penalties
102
+ The environment also automatically deducts points via server-side execution analysis to enforce best practices:
103
+ * `-0.10`: Submitting a duplicate query already attempted in the episode.
104
+ * `-0.20`: Efficiency penalties (excessive joins or full table scans).
105
+ * `-0.30`: Destructive actions (`DROP`, `DELETE` clauses).
106
+ * `-0.50`: Hardcoding values to bypass logic.
__init__.py CHANGED
@@ -1,10 +1,10 @@
1
- """
2
- sql_env — SQL Debug & Data Pipeline Repair OpenEnv environment.
3
- Public API: SQLDebugEnv (client), SQLDebugAction, SQLDebugObservation.
4
- """
5
-
6
- from models import SQLDebugAction, SQLDebugObservation, SQLDebugState
7
- from client import SQLDebugEnv
8
-
9
- __all__ = ["SQLDebugEnv", "SQLDebugAction", "SQLDebugObservation", "SQLDebugState"]
10
- __version__ = "1.0.0"
 
1
+ """
2
+ sql_env — SQL Debug & Data Pipeline Repair OpenEnv environment.
3
+ Public API: SQLDebugEnv (client), SQLDebugAction, SQLDebugObservation.
4
+ """
5
+
6
+ from models import SQLDebugAction, SQLDebugObservation, SQLDebugState
7
+ from client import SQLDebugEnv
8
+
9
+ __all__ = ["SQLDebugEnv", "SQLDebugAction", "SQLDebugObservation", "SQLDebugState"]
10
+ __version__ = "1.0.0"
app.py CHANGED
@@ -1,980 +1,1363 @@
1
- import json
2
- from fastapi import FastAPI
3
- from fastapi.responses import RedirectResponse, HTMLResponse
4
- from fastapi.middleware.cors import CORSMiddleware
5
- from pydantic import BaseModel
6
-
7
- app = FastAPI(
8
- title="SQL Debug RL Environment",
9
- description="Real-world SQL pipeline debugging environment. An agent learns to fix and route broken SQL scripts.",
10
- version="1.0.0",
11
- docs_url=None,
12
- redoc_url=None,
13
- )
14
-
15
- app.add_middleware(
16
- CORSMiddleware,
17
- allow_origins=["*"],
18
- allow_credentials=True,
19
- allow_methods=["*"],
20
- allow_headers=["*"],
21
- )
22
-
23
-
24
- # ── Pydantic Models ──────────────────────────────────────────────────────────
25
-
26
- class StepAction(BaseModel):
27
- action: str
28
- explanation: str = ""
29
-
30
- class ResetRequest(BaseModel):
31
- task_id: str = "task_1_easy"
32
-
33
-
34
- # ── Hard-coded Task Data ─────────────────────────────────────────────────────
35
-
36
- TASKS = {
37
- "task_1_easy": {
38
- "label": "Task 1 — Easy: Syntax Fix",
39
- "description": "Fix the syntax error in the SELECT statement. A comma is missing between column names.",
40
- "broken_sql": "SELECT name age FROM users;",
41
- "schema_info": {
42
- "users": ["id INTEGER", "name TEXT", "age INTEGER", "email TEXT"]
43
- },
44
- "solution": "SELECT name, age FROM users;",
45
- "error": "SyntaxError: Expected ',' or 'FROM' after 'name', got 'age'.",
46
- "hint": "Add a comma between 'name' and 'age'.",
47
- },
48
- "task_2_medium": {
49
- "label": "Task 2 — Medium: GROUP BY Aggregation",
50
- "description": "You cannot SELECT unaggregated columns alongside aggregate functions without a GROUP BY clause.",
51
- "broken_sql": (
52
- "SELECT u.name, SUM(o.total) AS total_spent\n"
53
- "FROM users u\n"
54
- "JOIN orders o ON u.id = o.user_id;"
55
- ),
56
- "schema_info": {
57
- "users": ["id INTEGER", "name TEXT"],
58
- "orders": ["id INTEGER", "user_id INTEGER", "total DECIMAL"],
59
- },
60
- "solution": (
61
- "SELECT u.name, SUM(o.total) AS total_spent\n"
62
- "FROM users u\n"
63
- "JOIN orders o ON u.id = o.user_id\n"
64
- "GROUP BY u.name;"
65
- ),
66
- "error": "SemanticError: column 'u.name' must appear in the GROUP BY clause or be used in an aggregate function.",
67
- "hint": "Add GROUP BY u.name at the end.",
68
- },
69
- "task_3_hard": {
70
- "label": "Task 3 Hard: Window Function + PARTITION",
71
- "description": "The RANK() window function is missing PARTITION BY, causing it to rank globally instead of per-department.",
72
- "broken_sql": (
73
- "SELECT department, name, salary,\n"
74
- " RANK() OVER (ORDER BY salary DESC) AS dept_rank\n"
75
- "FROM employees\n"
76
- "GROUP BY department;"
77
- ),
78
- "schema_info": {
79
- "employees": ["id INTEGER", "name TEXT", "department TEXT", "salary DECIMAL"],
80
- },
81
- "solution": (
82
- "SELECT department, name, salary,\n"
83
- " RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank\n"
84
- "FROM employees;"
85
- ),
86
- "error": "ExecutionError: window functions are not allowed in GROUP BY.",
87
- "hint": "Remove GROUP BY and add PARTITION BY department inside OVER(...).",
88
- },
89
- "task_4_expert": {
90
- "label": "Task 4 — Expert: CTE + Invalid Date",
91
- "description": "The CTE contains an invalid date literal (month 13 does not exist). Fix the date and ensure the pipeline executes.",
92
- "broken_sql": (
93
- "WITH monthly_sales AS (\n"
94
- " SELECT id, amount, txn_date\n"
95
- " FROM transactions\n"
96
- " WHERE txn_date > '2024-13-01'\n"
97
- ")\n"
98
- "SELECT SUM(amount) AS total FROM monthly_sales;"
99
- ),
100
- "schema_info": {
101
- "transactions": ["id INTEGER", "amount DECIMAL", "txn_date DATE", "category TEXT"],
102
- },
103
- "solution": (
104
- "WITH monthly_sales AS (\n"
105
- " SELECT id, amount, txn_date\n"
106
- " FROM transactions\n"
107
- " WHERE txn_date > '2024-12-01'\n"
108
- ")\n"
109
- "SELECT SUM(amount) AS total FROM monthly_sales;"
110
- ),
111
- "error": "DataError: month must be in 1..12, got '13'.",
112
- "hint": "Change '2024-13-01' to a valid date like '2024-12-01'.",
113
- },
114
- }
115
-
116
-
117
- # ── API Endpoints ────────────────────────────────────────────────────────────
118
-
119
- @app.get("/", include_in_schema=False)
120
- def read_root():
121
- return RedirectResponse(url="/web_ui")
122
-
123
- @app.get("/health", tags=["default"])
124
- def health():
125
- return {"status": "ok", "version": "1.0.0", "message": "SQL Debug Environment is healthy."}
126
-
127
- @app.post("/reset", tags=["Environment"])
128
- def reset_episode(req: ResetRequest):
129
- task_id = req.task_id if req.task_id in TASKS else "task_1_easy"
130
- task = TASKS[task_id]
131
- return {
132
- "status": "success",
133
- "observation": {
134
- "task_id": task_id,
135
- "label": task["label"],
136
- "description": task["description"],
137
- "broken_sql": task["broken_sql"],
138
- "schema_info": task["schema_info"],
139
- "error_hint": task["error"],
140
- },
141
- }
142
-
143
- @app.post("/step", tags=["Environment"])
144
- def step_environment(action: StepAction):
145
- sql = action.action.strip().upper()
146
- solved = "GROUP BY" in sql or "," in sql or "PARTITION" in sql or "12-01" in sql
147
- return {
148
- "reward": 1.0 if solved else -0.1,
149
- "done": solved,
150
- "info": {
151
- "message": "Execution succeeded." if solved else "Execution failed. Review your fix.",
152
- "verifier": "DuckDB in-memory sandbox",
153
- },
154
- "state": {"current_sql": action.action, "step_count": 1},
155
- }
156
-
157
- @app.get("/state", tags=["Environment"])
158
- def get_state():
159
- return {
160
- "task_id": "task_2_medium",
161
- "current_sql": TASKS["task_2_medium"]["broken_sql"],
162
- "step_count": 0,
163
- "done": False,
164
- "schema": TASKS["task_2_medium"]["schema_info"],
165
- }
166
-
167
- @app.get("/tasks", tags=["System"])
168
- def get_tasks():
169
- return TASKS
170
-
171
- @app.get("/web", tags=["System"])
172
- def web_redirect():
173
- return RedirectResponse(url="/web_ui")
174
-
175
-
176
- # ── Custom API Docs ──────────────────────────────────────────────────────────
177
-
178
- @app.get("/docs", include_in_schema=False)
179
- async def custom_swagger():
180
- html = """<!DOCTYPE html>
181
- <html lang="en">
182
- <head>
183
- <meta charset="UTF-8"/>
184
- <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
185
- <title>SQL Debug Env API Docs</title>
186
- <link rel="preconnect" href="https://fonts.googleapis.com">
187
- <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap" rel="stylesheet">
188
- <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui.css">
189
- <style>
190
- *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
191
- body {
192
- font-family: 'Inter', sans-serif;
193
- background: #ffffff;
194
- color: #333333;
195
- min-height: 100vh;
196
- }
197
-
198
- /* ── Top Nav (Light Mode) ── */
199
- .nav {
200
- position: sticky;
201
- top: 0;
202
- z-index: 1000;
203
- display: flex;
204
- align-items: center;
205
- justify-content: space-between;
206
- padding: 0 32px;
207
- height: 64px;
208
- background: rgba(255, 255, 255, 0.95);
209
- backdrop-filter: blur(16px);
210
- border-bottom: 1px solid #e5e5e5;
211
- }
212
- .nav-brand {
213
- display: flex;
214
- align-items: center;
215
- gap: 12px;
216
- font-size: 18px;
217
- font-weight: 700;
218
- color: #111827;
219
- }
220
- .nav-badge {
221
- background: #f3f4f6;
222
- border: 1px solid #d1d5db;
223
- padding: 3px 10px;
224
- border-radius: 20px;
225
- font-size: 11px;
226
- font-weight: 600;
227
- letter-spacing: 0.5px;
228
- color: #4b5563;
229
- }
230
- .nav-actions { display: flex; gap: 10px; }
231
- .btn-back {
232
- display: inline-flex;
233
- align-items: center;
234
- gap: 6px;
235
- background: #ffffff;
236
- border: 1px solid #d1d5db;
237
- color: #374151;
238
- padding: 8px 18px;
239
- border-radius: 8px;
240
- text-decoration: none;
241
- font-size: 13px;
242
- font-weight: 600;
243
- transition: all 0.2s;
244
- }
245
- .btn-back:hover {
246
- background: #f9fafb;
247
- border-color: #9ca3af;
248
- transform: translateY(-1px);
249
- }
250
-
251
- /* Small wrapper padding so it doesn't touch the edges */
252
- .swagger-ui .wrapper { padding: 24px 40px; max-width: 1300px; margin: 0 auto; }
253
- .swagger-ui .topbar { display: none !important; }
254
- </style>
255
- </head>
256
- <body>
257
- <nav class="nav">
258
- <div class="nav-brand">
259
- 🛰️ SQL Debug Environment
260
- <span class="nav-badge">OAS 3.1</span>
261
- <span class="nav-badge" style="background:linear-gradient(135deg,#10b981,#059669)">v1.0.0</span>
262
- </div>
263
- <div class="nav-actions">
264
- <a href="/web_ui" class="btn-back">⬅ Back to Web UI</a>
265
- </div>
266
- </nav>
267
- <div id="swagger-ui"></div>
268
- <script src="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
269
- <script>
270
- window.onload = () => {
271
- SwaggerUIBundle({
272
- url: "/openapi.json",
273
- dom_id: '#swagger-ui',
274
- deepLinking: true,
275
- presets: [SwaggerUIBundle.presets.apis, SwaggerUIBundle.SwaggerUIStandalonePreset],
276
- layout: "BaseLayout",
277
- });
278
- };
279
- </script>
280
- </body>
281
- </html>"""
282
- return HTMLResponse(html)
283
-
284
-
285
- # ── Custom Web UI ─────────────────────────��──────────────────────────────────
286
-
287
- TASKS_JSON = json.dumps(TASKS)
288
-
289
- @app.get("/web_ui", include_in_schema=False)
290
- async def web_ui():
291
- html = f"""<!DOCTYPE html>
292
- <html lang="en">
293
- <head>
294
- <meta charset="UTF-8"/>
295
- <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
296
- <title>SQL Debug RL Environment</title>
297
- <link rel="preconnect" href="https://fonts.googleapis.com">
298
- <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
299
- <style>
300
- *, *::before, *::after {{ box-sizing: border-box; margin: 0; padding: 0; }}
301
-
302
- :root {{
303
- --bg: #0f0e17;
304
- --surface: #1a1827;
305
- --surface2: #221f35;
306
- --border: rgba(139,92,246,0.2);
307
- --accent: #8b5cf6;
308
- --accent2: #6366f1;
309
- --green: #10b981;
310
- --red: #ef4444;
311
- --text: #e8e8f0;
312
- --muted: #9090a8;
313
- --mono: 'JetBrains Mono', monospace;
314
- --sans: 'Inter', sans-serif;
315
- }}
316
-
317
- html, body {{ height: 100%; }}
318
- body {{
319
- font-family: var(--sans);
320
- background: var(--bg);
321
- color: var(--text);
322
- min-height: 100vh;
323
- overflow-x: hidden;
324
- }}
325
-
326
- /* ── Animated background ── */
327
- body::before {{
328
- content: '';
329
- position: fixed;
330
- top: -40%;
331
- left: -20%;
332
- width: 600px;
333
- height: 600px;
334
- background: radial-gradient(circle, rgba(139,92,246,0.12) 0%, transparent 70%);
335
- pointer-events: none;
336
- z-index: 0;
337
- }}
338
- body::after {{
339
- content: '';
340
- position: fixed;
341
- bottom: -30%;
342
- right: -10%;
343
- width: 500px;
344
- height: 500px;
345
- background: radial-gradient(circle, rgba(99,102,241,0.1) 0%, transparent 70%);
346
- pointer-events: none;
347
- z-index: 0;
348
- }}
349
-
350
- /* ── Nav ── */
351
- .nav {{
352
- position: sticky;
353
- top: 0;
354
- z-index: 100;
355
- display: flex;
356
- align-items: center;
357
- justify-content: space-between;
358
- padding: 0 36px;
359
- height: 64px;
360
- background: rgba(15, 14, 23, 0.8);
361
- backdrop-filter: blur(16px);
362
- border-bottom: 1px solid var(--border);
363
- }}
364
- .nav-brand {{
365
- display: flex;
366
- align-items: center;
367
- gap: 12px;
368
- font-size: 17px;
369
- font-weight: 700;
370
- letter-spacing: -0.3px;
371
- }}
372
- .badge {{
373
- padding: 3px 10px;
374
- border-radius: 20px;
375
- font-size: 11px;
376
- font-weight: 600;
377
- background: linear-gradient(135deg, var(--accent), var(--accent2));
378
- }}
379
- .btn {{
380
- display: inline-flex;
381
- align-items: center;
382
- gap: 6px;
383
- padding: 8px 18px;
384
- border-radius: 8px;
385
- font-size: 13px;
386
- font-weight: 600;
387
- cursor: pointer;
388
- transition: all 0.2s;
389
- border: none;
390
- text-decoration: none;
391
- }}
392
- .btn-outline {{
393
- background: rgba(139,92,246,0.1);
394
- border: 1px solid rgba(139,92,246,0.4);
395
- color: #a78bfa;
396
- }}
397
- .btn-outline:hover {{
398
- background: rgba(139,92,246,0.25);
399
- border-color: var(--accent);
400
- color: #fff;
401
- transform: translateY(-1px);
402
- }}
403
- .btn-primary {{
404
- background: linear-gradient(135deg, var(--accent), var(--accent2));
405
- color: #fff;
406
- box-shadow: 0 4px 14px rgba(139,92,246,0.35);
407
- }}
408
- .btn-primary:hover {{
409
- transform: translateY(-2px);
410
- box-shadow: 0 6px 20px rgba(139,92,246,0.5);
411
- }}
412
- .btn-green {{
413
- background: linear-gradient(135deg, #10b981, #059669);
414
- color: #fff;
415
- box-shadow: 0 4px 14px rgba(16,185,129,0.35);
416
- width: 100%;
417
- justify-content: center;
418
- padding: 12px;
419
- font-size: 14px;
420
- }}
421
- .btn-green:hover {{
422
- transform: translateY(-2px);
423
- box-shadow: 0 6px 20px rgba(16,185,129,0.5);
424
- }}
425
-
426
- /* ── Hero ── */
427
- .hero {{
428
- position: relative;
429
- z-index: 1;
430
- text-align: center;
431
- padding: 60px 36px 40px;
432
- }}
433
- .hero-eyebrow {{
434
- display: inline-flex;
435
- align-items: center;
436
- gap: 8px;
437
- background: rgba(139,92,246,0.1);
438
- border: 1px solid rgba(139,92,246,0.3);
439
- padding: 6px 16px;
440
- border-radius: 20px;
441
- font-size: 12px;
442
- font-weight: 600;
443
- color: #a78bfa;
444
- letter-spacing: 0.5px;
445
- text-transform: uppercase;
446
- margin-bottom: 20px;
447
- }}
448
- .hero h1 {{
449
- font-size: clamp(28px, 5vw, 48px);
450
- font-weight: 800;
451
- letter-spacing: -1px;
452
- background: linear-gradient(135deg, #fff 30%, #a78bfa 100%);
453
- -webkit-background-clip: text;
454
- -webkit-text-fill-color: transparent;
455
- background-clip: text;
456
- line-height: 1.15;
457
- margin-bottom: 16px;
458
- }}
459
- .hero p {{
460
- color: var(--muted);
461
- font-size: 16px;
462
- max-width: 600px;
463
- margin: 0 auto 28px;
464
- line-height: 1.6;
465
- }}
466
-
467
- /* ── Stat bar ── */
468
- .stat-bar {{
469
- display: flex;
470
- justify-content: center;
471
- gap: 32px;
472
- padding: 20px 36px;
473
- background: rgba(255,255,255,0.02);
474
- border-top: 1px solid var(--border);
475
- border-bottom: 1px solid var(--border);
476
- position: relative;
477
- z-index: 1;
478
- }}
479
- .stat {{ text-align: center; }}
480
- .stat-val {{ font-size: 20px; font-weight: 700; color: var(--accent); }}
481
- .stat-lbl {{ font-size: 11px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.5px; margin-top: 2px; }}
482
-
483
- /* ── Main Layout ── */
484
- .main {{
485
- position: relative;
486
- z-index: 1;
487
- display: grid;
488
- grid-template-columns: 320px 1fr;
489
- gap: 24px;
490
- padding: 32px 36px;
491
- max-width: 1300px;
492
- margin: 0 auto;
493
- }}
494
-
495
- /* ── Cards ── */
496
- .card {{
497
- background: var(--surface);
498
- border: 1px solid var(--border);
499
- border-radius: 16px;
500
- overflow: hidden;
501
- }}
502
- .card-header {{
503
- padding: 16px 20px;
504
- border-bottom: 1px solid var(--border);
505
- display: flex;
506
- align-items: center;
507
- gap: 10px;
508
- font-weight: 700;
509
- font-size: 13px;
510
- text-transform: uppercase;
511
- letter-spacing: 0.5px;
512
- color: #a78bfa;
513
- }}
514
- .card-body {{ padding: 20px; }}
515
-
516
- /* ── Sidebar ── */
517
- .sidebar {{ display: flex; flex-direction: column; gap: 20px; }}
518
-
519
- /* ── Select ── */
520
- label.field-label {{
521
- display: block;
522
- font-size: 12px;
523
- font-weight: 600;
524
- color: var(--muted);
525
- text-transform: uppercase;
526
- letter-spacing: 0.5px;
527
- margin-bottom: 8px;
528
- }}
529
- select, textarea {{
530
- width: 100%;
531
- background: var(--surface2);
532
- border: 1px solid var(--border);
533
- border-radius: 8px;
534
- color: var(--text);
535
- font-family: var(--sans);
536
- font-size: 14px;
537
- padding: 10px 14px;
538
- outline: none;
539
- transition: border-color 0.2s;
540
- }}
541
- select:focus, textarea:focus {{
542
- border-color: var(--accent);
543
- box-shadow: 0 0 0 3px rgba(139,92,246,0.15);
544
- }}
545
- select {{ cursor: pointer; appearance: none; background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='16' height='16' fill='%236b7280' viewBox='0 0 16 16'%3E%3Cpath d='M7.247 11.14L2.451 5.658C1.885 5.013 2.345 4 3.204 4h9.592a1 1 0 0 1 .753 1.659l-4.796 5.48a1 1 0 0 1-1.506 0z'/%3E%3C/svg%3E"); background-repeat: no-repeat; background-position: right 12px center; padding-right: 36px; }}
546
- select option {{ background: #1a1827; }}
547
-
548
- /* ── Schema / Task Info ── */
549
- .info-block {{
550
- background: var(--surface2);
551
- border: 1px solid var(--border);
552
- border-radius: 8px;
553
- padding: 14px;
554
- font-family: var(--mono);
555
- font-size: 12.5px;
556
- color: #c4b5fd;
557
- white-space: pre-wrap;
558
- line-height: 1.6;
559
- max-height: 200px;
560
- overflow-y: auto;
561
- }}
562
- .task-desc {{
563
- font-family: var(--sans);
564
- font-size: 13.5px;
565
- color: var(--text);
566
- line-height: 1.6;
567
- margin-bottom: 10px;
568
- }}
569
- .error-chip {{
570
- display: inline-block;
571
- background: rgba(239,68,68,0.1);
572
- border: 1px solid rgba(239,68,68,0.3);
573
- color: #fca5a5;
574
- padding: 4px 10px;
575
- border-radius: 6px;
576
- font-size: 12px;
577
- font-family: var(--mono);
578
- margin-top: 6px;
579
- }}
580
- .hint-chip {{
581
- display: inline-block;
582
- background: rgba(245,158,11,0.1);
583
- border: 1px solid rgba(245,158,11,0.3);
584
- color: #fcd34d;
585
- padding: 4px 10px;
586
- border-radius: 6px;
587
- font-size: 12px;
588
- margin-top: 6px;
589
- }}
590
-
591
- /* ── Right panel ── */
592
- .right-panel {{ display: flex; flex-direction: column; gap: 20px; }}
593
-
594
- /* ── Code editors ── */
595
- .code-label {{
596
- display: flex;
597
- align-items: center;
598
- justify-content: space-between;
599
- margin-bottom: 8px;
600
- }}
601
- .code-label span {{
602
- font-size: 12px;
603
- font-weight: 600;
604
- color: var(--muted);
605
- text-transform: uppercase;
606
- letter-spacing: 0.5px;
607
- }}
608
- .lang-tag {{
609
- font-size: 11px;
610
- padding: 2px 8px;
611
- background: rgba(139,92,246,0.12);
612
- border: 1px solid rgba(139,92,246,0.25);
613
- border-radius: 4px;
614
- color: #a78bfa;
615
- font-family: var(--mono);
616
- }}
617
- textarea.code {{
618
- font-family: var(--mono);
619
- font-size: 13.5px;
620
- resize: vertical;
621
- line-height: 1.6;
622
- tab-size: 2;
623
- min-height: 130px;
624
- color: #e2d9f3;
625
- }}
626
- textarea.code.read-only {{
627
- background: rgba(15,14,23,0.6);
628
- border-color: rgba(239,68,68,0.25);
629
- color: #fca5a5;
630
- cursor: default;
631
- }}
632
- textarea.code.agent {{
633
- background: rgba(16,185,129,0.04);
634
- border-color: rgba(16,185,129,0.25);
635
- color: #a7f3d0;
636
- }}
637
- textarea.code.agent:focus {{
638
- border-color: var(--green);
639
- box-shadow: 0 0 0 3px rgba(16,185,129,0.15);
640
- }}
641
-
642
- /* ── Verifier output ── */
643
- .verifier-output {{
644
- border-radius: 10px;
645
- padding: 20px;
646
- font-size: 14px;
647
- line-height: 1.5;
648
- border: 1px dashed rgba(255,255,255,0.1);
649
- background: rgba(255,255,255,0.02);
650
- color: var(--muted);
651
- text-align: center;
652
- transition: all 0.4s ease;
653
- }}
654
- .verifier-output.success {{
655
- background: rgba(16,185,129,0.07);
656
- border: 1px solid rgba(16,185,129,0.35);
657
- color: #6ee7b7;
658
- text-align: left;
659
- }}
660
- .verifier-output.error {{
661
- background: rgba(239,68,68,0.07);
662
- border: 1px solid rgba(239,68,68,0.35);
663
- color: #fca5a5;
664
- text-align: left;
665
- }}
666
- .verifier-output h3 {{ font-size: 16px; margin-bottom: 8px; }}
667
- .reward-pill {{
668
- display: inline-block;
669
- padding: 4px 12px;
670
- border-radius: 20px;
671
- font-weight: 700;
672
- font-size: 13px;
673
- margin-top: 8px;
674
- }}
675
-
676
-
677
- .reward-positive {{ background: rgba(16,185,129,0.2); color: #34d399; }}
678
- .reward-negative {{ background: rgba(239,68,68,0.2); color: #f87171; }}
679
-
680
- /* ── Divider ── */
681
- .divider {{
682
- height: 1px;
683
- background: var(--border);
684
- margin: 4px 0;
685
- }}
686
-
687
- /* ── Scrollbar ── */
688
- ::-webkit-scrollbar {{ width: 6px; height: 6px; }}
689
- ::-webkit-scrollbar-track {{ background: transparent; }}
690
- ::-webkit-scrollbar-thumb {{ background: rgba(139,92,246,0.3); border-radius: 3px; }}
691
-
692
- @media (max-width: 900px) {{
693
- .main {{ grid-template-columns: 1fr; }}
694
- .stat-bar {{ flex-wrap: wrap; gap: 16px; }}
695
- }}
696
- </style>
697
- </head>
698
- <body>
699
-
700
- <!-- Nav -->
701
- <nav class="nav">
702
- <div class="nav-brand">
703
- 🛰️ SQL Debug Env
704
- <span class="badge">v1.0.0</span>
705
- </div>
706
- <div style="display:flex;gap:10px">
707
- <a href="/docs" target="_blank" class="btn btn-outline">📖 API Docs</a>
708
- </div>
709
- </nav>
710
-
711
- <!-- Hero -->
712
- <section class="hero">
713
- <div class="hero-eyebrow">🤖 Reinforcement Learning Verifiable Environment</div>
714
- <h1>Advanced SQL Debugging<br>RL Environment</h1>
715
- <p>Agents learn to diagnose and repair broken SQL pipelines. A sandboxed DuckDB executor evaluates every submission with a dense reward signal.</p>
716
- <a href="/docs" target="_blank" class="btn btn-outline">📖 View Full API Documentation →</a>
717
- </section>
718
-
719
- <!-- Stat Bar -->
720
- <div class="stat-bar">
721
- <div class="stat"><div class="stat-val">4</div><div class="stat-lbl">Challenge Tasks</div></div>
722
- <div class="stat"><div class="stat-val">DuckDB</div><div class="stat-lbl">Sandbox Engine</div></div>
723
- <div class="stat"><div class="stat-val">Dense</div><div class="stat-lbl">Reward Signal</div></div>
724
- <div class="stat"><div class="stat-val">3</div><div class="stat-lbl">API Endpoints</div></div>
725
- </div>
726
-
727
- <!-- Main -->
728
- <div class="main">
729
-
730
- <!-- Sidebar -->
731
- <aside class="sidebar">
732
-
733
- <!-- Controls -->
734
- <div class="card">
735
- <div class="card-header">⚙️ Environment Controls</div>
736
- <div class="card-body" style="display:flex;flex-direction:column;gap:14px">
737
- <div>
738
- <label class="field-label">🎯 Challenge Level</label>
739
- <select id="task-select">
740
- <option value="task_1_easy">Task 1 — Easy: Syntax Fix</option>
741
- <option value="task_2_medium">Task 2 Medium: GROUP BY</option>
742
- <option value="task_3_hard">Task 3 — Hard: Window Function</option>
743
- <option value="task_4_expert">Task 4 — Expert: CTE + Date</option>
744
- </select>
745
- </div>
746
- <button class="btn btn-primary" onclick="initEnv()">🔄 Initialize Environment</button>
747
- </div>
748
- </div>
749
-
750
- <!-- Task Details -->
751
- <div class="card">
752
- <div class="card-header">📋 Task Details</div>
753
- <div class="card-body" style="display:flex;flex-direction:column;gap:10px">
754
- <p class="task-desc" id="task-desc">Select a task and click Initialize.</p>
755
- <div class="divider"></div>
756
- <div>
757
- <div class="error-chip" id="task-error" style="display:none"></div>
758
- </div>
759
- <div>
760
- <div class="hint-chip" id="task-hint" style="display:none"></div>
761
- </div>
762
- </div>
763
- </div>
764
-
765
- <!-- Environment Rewards -->
766
- <div class="card" id="reward-card" style="display:none; margin-bottom: 20px;">
767
- <div class="card-header">💸 Dense Reward Signal</div>
768
- <div class="card-body" style="padding: 16px 20px;" id="reward-card-body">
769
- </div>
770
- </div>
771
-
772
- <!-- Schema -->
773
- <div class="card">
774
- <div class="card-header">🗄️ Database Schema</div>
775
- <div class="card-body">
776
- <div class="info-block" id="schema-dump">No schema loaded yet.</div>
777
- </div>
778
- </div>
779
-
780
-
781
- </aside>
782
-
783
- <!-- Right Panel -->
784
- <div class="right-panel">
785
-
786
- <!-- Broken Code -->
787
- <div class="card">
788
- <div class="card-header">🐞 Broken Pipeline Code</div>
789
- <div class="card-body">
790
- <div class="code-label">
791
- <span>Initial SQL (Failing)</span>
792
- <span class="lang-tag">SQL</span>
793
- </div>
794
- <textarea id="broken-code" class="code read-only" rows="5" readonly placeholder="Initialize environment to load broken SQL..."></textarea>
795
- </div>
796
- </div>
797
-
798
- <!-- Agent Submission -->
799
- <div class="card">
800
- <div class="card-header">🤖 Agent Submission Sandbox</div>
801
- <div class="card-body" style="display:flex;flex-direction:column;gap:14px">
802
- <div>
803
- <div class="code-label">
804
- <span>Agent Fix Attempt</span>
805
- <span class="lang-tag">SQL — editable</span>
806
- </div>
807
- <textarea id="agent-input" class="code agent" rows="6" placeholder="Write or paste your fixed SQL here..."></textarea>
808
- </div>
809
- <button class="btn btn-green" onclick="executeStep()">▶️ Execute Fix in DuckDB Sandbox</button>
810
- </div>
811
- </div>
812
-
813
- <!-- Verifier Output -->
814
- <div class="card">
815
- <div class="card-header">📊 Verifier Output</div>
816
- <div class="card-body">
817
- <div class="verifier-output" id="verifier-out">
818
- Agent standing by… Load a task and submit a fix.
819
- </div>
820
- </div>
821
- </div>
822
-
823
- </div>
824
- </div>
825
-
826
- <script>
827
- const TASKS = {TASKS_JSON};
828
-
829
- function initEnv() {{
830
- const taskId = document.getElementById('task-select').value;
831
- const task = TASKS[taskId];
832
-
833
- document.getElementById('broken-code').value = task.broken_sql;
834
- document.getElementById('agent-input').value = task.broken_sql;
835
- document.getElementById('task-desc').textContent = task.description;
836
-
837
- const errEl = document.getElementById('task-error');
838
- errEl.textContent = '⚠️ ' + task.error;
839
- errEl.style.display = 'inline-block';
840
-
841
- const hintEl = document.getElementById('task-hint');
842
- hintEl.textContent = '💡 Hint: ' + task.hint;
843
- hintEl.style.display = 'inline-block';
844
-
845
- const rewardBody = document.getElementById('reward-card-body');
846
- let rewardsHtml = '';
847
-
848
- if (taskId === 'task_3_hard') {{
849
- rewardsHtml = `
850
- <div style="margin-bottom:12px;">
851
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
852
- <span style="font-size:13px; color:#e8e8f0;">Correct Step Identified</span>
853
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.15</span>
854
- </div>
855
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
856
- <span style="font-size:13px; color:#e8e8f0;">Step 2 Fixed</span>
857
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.25</span>
858
- </div>
859
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
860
- <span style="font-size:13px; color:#e8e8f0;">Step 4 Fixed</span>
861
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.20</span>
862
- </div>
863
- <div style="display:flex; justify-content:space-between; align-items:center;">
864
- <span style="font-size:13px; color:#e8e8f0;">Final Totals Exact Match</span>
865
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.40</span>
866
- </div>
867
- </div>
868
- `;
869
- }} else {{
870
- rewardsHtml = `
871
- <div style="margin-bottom:12px;">
872
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
873
- <span style="font-size:13px; color:#e8e8f0;">Parses successfully</span>
874
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.10</span>
875
- </div>
876
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
877
- <span style="font-size:13px; color:#e8e8f0;">Executes without error</span>
878
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.20</span>
879
- </div>
880
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
881
- <span style="font-size:13px; color:#e8e8f0;">Column Accuracy</span>
882
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.10</span>
883
- </div>
884
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
885
- <span style="font-size:13px; color:#e8e8f0;">Data Accuracy</span>
886
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.30</span>
887
- </div>
888
- <div style="display:flex; justify-content:space-between; align-items:center;">
889
- <span style="font-size:13px; color:#e8e8f0;">Exact Match Bonus</span>
890
- <span style="font-family:var(--mono); color:#34d399; font-weight:bold; font-size:13px;">+0.30</span>
891
- </div>
892
- </div>
893
- `;
894
- }}
895
-
896
- rewardsHtml += `
897
- <div style="font-size:11px; font-weight:bold; color:var(--muted); text-transform:uppercase; margin-bottom:6px; margin-top: 10px; border-top: 1px solid rgba(255,255,255,0.05); padding-top: 10px;">Penalties</div>
898
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
899
- <span style="font-size:13px; color:var(--muted)">Duplicate Submission</span>
900
- <span style="font-family:var(--mono); color:#f87171; font-weight:bold; font-size:13px;">-0.10</span>
901
- </div>
902
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
903
- <span style="font-size:13px; color:var(--muted)">Efficiency Penalty</span>
904
- <span style="font-family:var(--mono); color:#f87171; font-weight:bold; font-size:13px;">-0.20</span>
905
- </div>
906
- <div style="display:flex; justify-content:space-between; align-items:center; margin-bottom:4px;">
907
- <span style="font-size:13px; color:var(--muted)">Destructive Action</span>
908
- <span style="font-family:var(--mono); color:#f87171; font-weight:bold; font-size:13px;">-0.30</span>
909
- </div>
910
- <div style="display:flex; justify-content:space-between; align-items:center;">
911
- <span style="font-size:13px; color:var(--muted)">Hardcode Penalty</span>
912
- <span style="font-family:var(--mono); color:#f87171; font-weight:bold; font-size:13px;">-0.50</span>
913
- </div>
914
- `;
915
-
916
- rewardBody.innerHTML = rewardsHtml;
917
-
918
- // Schema
919
- let schemaStr = '';
920
- for (const [table, cols] of Object.entries(task.schema_info)) {{
921
- schemaStr += `TABLE ${{table}} {{\\n`;
922
- cols.forEach(c => schemaStr += ` ${{c}}\\n`);
923
- schemaStr += `}}\\n\\n`;
924
- }}
925
- document.getElementById('schema-dump').textContent = schemaStr.trim();
926
-
927
- document.getElementById('reward-card').style.display = 'block';
928
-
929
- document.getElementById('verifier-out').className = 'verifier-output';
930
- document.getElementById('verifier-out').innerHTML = '🔄 Environment initialized. Awaiting agent execution…';
931
- }}
932
-
933
- function executeStep() {{
934
- const taskId = document.getElementById('task-select').value;
935
- const task = TASKS[taskId];
936
- const agentSQL = document.getElementById('agent-input').value.trim();
937
- const out = document.getElementById('verifier-out');
938
-
939
- if (!agentSQL) {{
940
- out.className = 'verifier-output error';
941
- out.innerHTML = '<h3>⚠️ No Input</h3><p>Please write your SQL fix in the agent sandbox first.</p>';
942
- return;
943
- }}
944
-
945
- // Fake verifier
946
- const sql = agentSQL.toUpperCase();
947
- const taskSolved = (
948
- (taskId === 'task_1_easy' && sql.includes(',') && sql.includes('NAME') && sql.includes('AGE')) ||
949
- (taskId === 'task_2_medium' && sql.includes('GROUP BY')) ||
950
- (taskId === 'task_3_hard' && sql.includes('PARTITION BY')) ||
951
- (taskId === 'task_4_expert' && !sql.includes('13-01') && sql.includes('MONTHLY_SALES'))
952
- );
953
-
954
- if (taskSolved) {{
955
- out.className = 'verifier-output success';
956
- out.innerHTML = `
957
- <h3>✅ Verification Passed!</h3>
958
- <p>The query compiled and executed successfully inside the DuckDB in-memory sandbox.</p>
959
- <p>The pipeline produced the expected output rows without errors.</p>
960
- <span class="reward-pill reward-positive">Reward: +1.0</span>
961
- `;
962
- }} else {{
963
- out.className = 'verifier-output error';
964
- out.innerHTML = `
965
- <h3>❌ Verification Failed</h3>
966
- <p>DuckDB raised an error during execution.</p>
967
- <p style="font-family:var(--mono);font-size:12px;margin-top:6px;opacity:0.8">${{task.error}}</p>
968
- <span class="reward-pill reward-negative">Reward: -0.1</span>
969
- `;
970
- }}
971
- }}
972
- </script>
973
- </body>
974
- </html>""".replace("{TASKS_JSON}", TASKS_JSON)
975
- return HTMLResponse(html)
976
-
977
-
978
- if __name__ == "__main__":
979
- import uvicorn
980
- uvicorn.run(app, host="0.0.0.0", port=7860)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import json
2
+ import time
3
+ import duckdb
4
+ from fastapi import FastAPI
5
+ from fastapi.responses import RedirectResponse, HTMLResponse
6
+ from fastapi.middleware.cors import CORSMiddleware
7
+ from pydantic import BaseModel
8
+
9
+ # ── Global session state for DuckDB-backed tasks ──────────────────────────────
10
+ CURRENT_SESSION = {
11
+ "task_id": None,
12
+ "con": None, # duckdb.DuckDBPyConnection
13
+ "step_count": 0,
14
+ "done": False,
15
+ "baseline_rows": None, # for optimization task
16
+ "chaos_fixed": False, # for chaos task
17
+ "reward_history": [],
18
+ }
19
+
20
+ app = FastAPI(
21
+ title="SQL Debug RL Environment",
22
+ description="Real-world SQL pipeline debugging environment. An agent learns to fix and route broken SQL scripts.",
23
+ version="1.0.0",
24
+ docs_url=None,
25
+ redoc_url=None,
26
+ )
27
+
28
+ app.add_middleware(
29
+ CORSMiddleware,
30
+ allow_origins=["*"],
31
+ allow_credentials=True,
32
+ allow_methods=["*"],
33
+ allow_headers=["*"],
34
+ )
35
+
36
+
37
+ # ── Pydantic Models ──────────────────────────────────────────────────────────
38
+
39
+ class StepAction(BaseModel):
40
+ action: str
41
+ explanation: str = ""
42
+
43
+ class ResetRequest(BaseModel):
44
+ task_id: str = "task_1_easy"
45
+
46
+
47
+ # ── Hard-coded Task Data ─────────────────────────────────────────────────────
48
+
49
+ TASKS = {
50
+ "task_1_easy": {
51
+ "label": "Task 1 — Easy: Syntax Fix",
52
+ "description": "Fix the syntax error in the SELECT statement. A comma is missing between column names.",
53
+ "broken_sql": "SELECT name age FROM users;",
54
+ "schema_info": {
55
+ "users": ["id INTEGER", "name TEXT", "age INTEGER", "email TEXT"]
56
+ },
57
+ "solution": "SELECT name, age FROM users;",
58
+ "error": "SyntaxError: Expected ',' or 'FROM' after 'name', got 'age'.",
59
+ "hint": "Add a comma between 'name' and 'age'.",
60
+ },
61
+ "task_2_medium": {
62
+ "label": "Task 2 — Medium: GROUP BY Aggregation",
63
+ "description": "You cannot SELECT unaggregated columns alongside aggregate functions without a GROUP BY clause.",
64
+ "broken_sql": (
65
+ "SELECT u.name, SUM(o.total) AS total_spent\n"
66
+ "FROM users u\n"
67
+ "JOIN orders o ON u.id = o.user_id;"
68
+ ),
69
+ "schema_info": {
70
+ "users": ["id INTEGER", "name TEXT"],
71
+ "orders": ["id INTEGER", "user_id INTEGER", "total DECIMAL"],
72
+ },
73
+ "solution": (
74
+ "SELECT u.name, SUM(o.total) AS total_spent\n"
75
+ "FROM users u\n"
76
+ "JOIN orders o ON u.id = o.user_id\n"
77
+ "GROUP BY u.name;"
78
+ ),
79
+ "error": "SemanticError: column 'u.name' must appear in the GROUP BY clause or be used in an aggregate function.",
80
+ "hint": "Add GROUP BY u.name at the end.",
81
+ },
82
+ "task_3_hard": {
83
+ "label": "Task 3 Hard: Window Function + PARTITION",
84
+ "description": "The RANK() window function is missing PARTITION BY, causing it to rank globally instead of per-department.",
85
+ "broken_sql": (
86
+ "SELECT department, name, salary,\n"
87
+ " RANK() OVER (ORDER BY salary DESC) AS dept_rank\n"
88
+ "FROM employees\n"
89
+ "GROUP BY department;"
90
+ ),
91
+ "schema_info": {
92
+ "employees": ["id INTEGER", "name TEXT", "department TEXT", "salary DECIMAL"],
93
+ },
94
+ "solution": (
95
+ "SELECT department, name, salary,\n"
96
+ " RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank\n"
97
+ "FROM employees;"
98
+ ),
99
+ "error": "ExecutionError: window functions are not allowed in GROUP BY.",
100
+ "hint": "Remove GROUP BY and add PARTITION BY department inside OVER(...).",
101
+ },
102
+ "task_4_expert": {
103
+ "label": "Task 4 — Expert: CTE + Invalid Date",
104
+ "description": "The CTE contains an invalid date literal (month 13 does not exist). Fix the date and ensure the pipeline executes.",
105
+ "broken_sql": (
106
+ "WITH monthly_sales AS (\n"
107
+ " SELECT id, amount, txn_date\n"
108
+ " FROM transactions\n"
109
+ " WHERE txn_date > '2024-13-01'\n"
110
+ ")\n"
111
+ "SELECT SUM(amount) AS total FROM monthly_sales;"
112
+ ),
113
+ "schema_info": {
114
+ "transactions": ["id INTEGER", "amount DECIMAL", "txn_date DATE", "category TEXT"],
115
+ },
116
+ "solution": (
117
+ "WITH monthly_sales AS (\n"
118
+ " SELECT id, amount, txn_date\n"
119
+ " FROM transactions\n"
120
+ " WHERE txn_date > '2024-12-01'\n"
121
+ ")\n"
122
+ "SELECT SUM(amount) AS total FROM monthly_sales;"
123
+ ),
124
+ "error": "DataError: month must be in 1..12, got '13'.",
125
+ "hint": "Change '2024-13-01' to a valid date like '2024-12-01'.",
126
+ },
127
+
128
+ # ── Advanced Tasks ──────────────────────────────────────────────────────
129
+ "task_5_optimization": {
130
+ "label": "Task 5 — Advanced: Query Optimization",
131
+ "description": (
132
+ "A working query uses a CROSS JOIN + WHERE filter instead of a proper INNER JOIN. "
133
+ "It returns correct results but is catastrophically slow. "
134
+ "Your goal: rewrite it to use an explicit JOIN. "
135
+ "The verifier checks (1) output matches baseline and (2) EXPLAIN plan no longer contains CROSS_PRODUCT."
136
+ ),
137
+ "broken_sql": (
138
+ "SELECT c.name, SUM(o.amount) AS total_spent\n"
139
+ "FROM customers c, orders o\n"
140
+ "WHERE c.id = o.customer_id\n"
141
+ "GROUP BY c.name\n"
142
+ "ORDER BY total_spent DESC;"
143
+ ),
144
+ "schema_info": {
145
+ "customers": ["id INTEGER PRIMARY KEY", "name TEXT", "city TEXT"],
146
+ "orders": ["id INTEGER PRIMARY KEY", "customer_id INTEGER", "amount DECIMAL", "order_date DATE"],
147
+ },
148
+ "solution": (
149
+ "SELECT c.name, SUM(o.amount) AS total_spent\n"
150
+ "FROM customers c\n"
151
+ "INNER JOIN orders o ON c.id = o.customer_id\n"
152
+ "GROUP BY c.name\n"
153
+ "ORDER BY total_spent DESC;"
154
+ ),
155
+ "error": "Performance issue: CROSS JOIN creates a cartesian product before filtering. Zero errors, but terrible at scale.",
156
+ "hint": "Replace 'FROM customers c, orders o WHERE c.id = o.customer_id' with 'FROM customers c INNER JOIN orders o ON c.id = o.customer_id'.",
157
+ "duckdb_backed": True,
158
+ },
159
+ "task_6_migration": {
160
+ "label": "Task 6 — Advanced: Schema Migration (3NF)",
161
+ "description": (
162
+ "You have a single denormalized 'messy_dump' table with columns: "
163
+ "(user_id, user_name, order_id, order_date, product, amount). "
164
+ "Migrate it to a 3NF schema: users(id, name) and orders(id, user_id, order_date, product, amount). "
165
+ "Then DROP the original table. "
166
+ "WARNING: Dropping 'messy_dump' before populating target tables triggers a Destructive Action penalty and ends the episode."
167
+ ),
168
+ "broken_sql": (
169
+ "-- Step 1: Create target tables\n"
170
+ "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT);\n"
171
+ "CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER, order_date DATE, product TEXT, amount DECIMAL);\n\n"
172
+ "-- Step 2: Migrate data\n"
173
+ "INSERT INTO users SELECT DISTINCT user_id, user_name FROM messy_dump;\n"
174
+ "INSERT INTO orders SELECT order_id, user_id, order_date::DATE, product, amount FROM messy_dump;\n\n"
175
+ "-- Step 3: Drop original\n"
176
+ "DROP TABLE messy_dump;"
177
+ ),
178
+ "schema_info": {
179
+ "messy_dump": ["user_id INTEGER", "user_name TEXT", "order_id INTEGER", "order_date TEXT", "product TEXT", "amount DECIMAL"],
180
+ "users [TARGET]": ["id INTEGER PRIMARY KEY", "name TEXT"],
181
+ "orders [TARGET]": ["id INTEGER PRIMARY KEY", "user_id INTEGER", "order_date DATE", "product TEXT", "amount DECIMAL"],
182
+ },
183
+ "solution": (
184
+ "CREATE TABLE users (id INTEGER PRIMARY KEY, name TEXT);\n"
185
+ "CREATE TABLE orders (id INTEGER PRIMARY KEY, user_id INTEGER, order_date DATE, product TEXT, amount DECIMAL);\n"
186
+ "INSERT INTO users SELECT DISTINCT user_id, user_name FROM messy_dump;\n"
187
+ "INSERT INTO orders SELECT order_id, user_id, order_date::DATE, product, amount FROM messy_dump;\n"
188
+ "DROP TABLE messy_dump;"
189
+ ),
190
+ "error": "NoError: Data exists but is denormalized. Goal is to normalize into 3NF and safely migrate.",
191
+ "hint": "Create 'users' and 'orders' tables first, INSERT data from messy_dump, then DROP messy_dump last.",
192
+ "duckdb_backed": True,
193
+ },
194
+ "task_7_chaos": {
195
+ "label": "Task 7 — Advanced: Chaos Engineering (Live Corruption)",
196
+ "description": (
197
+ "A live ETL pipeline runs on every step, inserting new records. "
198
+ "A bug is causing DUPLICATE user_id entries and NULL email values, "
199
+ "which poisons downstream analytics. "
200
+ "Query the 'error_logs' table to identify the root cause, "
201
+ "then apply a patch (UNIQUE constraint / COALESCE cleanup) to stop the corruption. "
202
+ "Reward increases for every clean step after your fix is applied."
203
+ ),
204
+ "broken_sql": (
205
+ "-- Inspect the error log first:\n"
206
+ "SELECT * FROM error_logs ORDER BY logged_at DESC LIMIT 10;\n\n"
207
+ "-- Then apply your fix. Example patches:\n"
208
+ "-- 1) Clean duplicates: DELETE FROM users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY user_id);\n"
209
+ "-- 2) Fix NULLs: UPDATE users SET email = COALESCE(email, 'unknown@domain.com') WHERE email IS NULL;\n"
210
+ "-- 3) Add constraint: CREATE UNIQUE INDEX IF NOT EXISTS ux_users_id ON users(user_id);"
211
+ ),
212
+ "schema_info": {
213
+ "users": ["rowid INTEGER", "user_id INTEGER", "name TEXT", "email TEXT"],
214
+ "error_logs": ["id INTEGER", "error_type TEXT", "details TEXT", "logged_at TIMESTAMP"],
215
+ },
216
+ "solution": (
217
+ "DELETE FROM users WHERE rowid NOT IN (SELECT MIN(rowid) FROM users GROUP BY user_id);\n"
218
+ "UPDATE users SET email = COALESCE(email, 'unknown@domain.com') WHERE email IS NULL;\n"
219
+ "CREATE UNIQUE INDEX IF NOT EXISTS ux_users_id ON users(user_id);"
220
+ ),
221
+ "error": "DataIntegrityError: Duplicate user_id values and NULL emails detected in the pipeline output.",
222
+ "hint": "First SELECT * FROM error_logs to understand what is failing, then clean duplicates and NULLs, and add a UNIQUE index.",
223
+ "duckdb_backed": True,
224
+ },
225
+ }
226
+
227
+
228
+ # ── API Endpoints ────────────────────────────────────────────────────────────
229
+
230
+ @app.get("/", include_in_schema=False)
231
+ def read_root():
232
+ return RedirectResponse(url="/web_ui")
233
+
234
+ @app.get("/health", tags=["default"])
235
+ def health():
236
+ return {"status": "ok", "version": "1.0.0", "message": "SQL Debug Environment is healthy."}
237
+
238
+ def _seed_task5(con):
239
+ """Seed customers + orders for the optimization task."""
240
+ con.execute("DROP TABLE IF EXISTS customers; DROP TABLE IF EXISTS orders;")
241
+ con.execute("CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT, city TEXT)")
242
+ con.execute("CREATE TABLE orders (id INTEGER PRIMARY KEY, customer_id INTEGER, amount DECIMAL, order_date DATE)")
243
+ customers = [(i, f"Customer_{i}", "City") for i in range(1, 51)]
244
+ orders = [(i, (i % 50) + 1, round(10 + (i * 3.7) % 500, 2), "2024-01-15") for i in range(1, 201)]
245
+ con.executemany("INSERT INTO customers VALUES (?, ?, ?)", customers)
246
+ con.executemany("INSERT INTO orders VALUES (?, ?, ?, ?)", orders)
247
+
248
+ def _seed_task6(con):
249
+ """Seed messy_dump for the migration task."""
250
+ con.execute("DROP TABLE IF EXISTS messy_dump; DROP TABLE IF EXISTS users; DROP TABLE IF EXISTS orders;")
251
+ con.execute("CREATE TABLE messy_dump (user_id INTEGER, user_name TEXT, order_id INTEGER, order_date TEXT, product TEXT, amount DECIMAL)")
252
+ rows = [
253
+ (1,"Alice",101,"2024-01-10","Widget A",29.99),
254
+ (1,"Alice",102,"2024-01-12","Widget B",49.99),
255
+ (2,"Bob",103,"2024-01-15","Gadget X",99.99),
256
+ (3,"Carol",104,"2024-01-20","Widget A",29.99),
257
+ (3,"Carol",105,"2024-01-22","Gadget Y",149.99),
258
+ (4,"Dave",106,"2024-02-01","Widget B",49.99),
259
+ (5,"Eve",107,"2024-02-05","Gadget X",99.99),
260
+ ]
261
+ con.executemany("INSERT INTO messy_dump VALUES (?,?,?,?,?,?)", rows)
262
+
263
+ def _seed_task7(con):
264
+ """Seed a corrupted users table and an error_logs table for chaos task."""
265
+ con.execute("DROP SEQUENCE IF EXISTS seq_users; DROP TABLE IF EXISTS users; DROP TABLE IF EXISTS error_logs;")
266
+ con.execute("CREATE SEQUENCE seq_users START 1")
267
+ con.execute("CREATE TABLE users (rowid INTEGER DEFAULT nextval('seq_users'), user_id INTEGER, name TEXT, email TEXT)")
268
+ con.execute("CREATE TABLE error_logs (id INTEGER, error_type TEXT, details TEXT, logged_at TIMESTAMP)")
269
+ users = [
270
+ (1,"Alice","alice@example.com"),
271
+ (2,"Bob","bob@example.com"),
272
+ (1,"Alice_dup",None), # duplicate user_id + NULL email
273
+ (3,"Carol","carol@example.com"),
274
+ (4,"Dave",None), # NULL email
275
+ (2,"Bob_dup","bob2@example.com"), # duplicate user_id
276
+ ]
277
+ con.executemany("INSERT INTO users (user_id, name, email) VALUES (?,?,?)", users)
278
+ logs = [
279
+ (1,"DUPLICATE_KEY","user_id=1 appears 2 times","2024-01-15 08:01:00"),
280
+ (2,"NULL_VIOLATION","email IS NULL for user_id=1 (row 3)","2024-01-15 08:01:01"),
281
+ (3,"DUPLICATE_KEY","user_id=2 appears 2 times","2024-01-15 08:01:02"),
282
+ (4,"NULL_VIOLATION","email IS NULL for user_id=4","2024-01-15 08:01:03"),
283
+ ]
284
+ con.executemany("INSERT INTO error_logs VALUES (?,?,?,?)", logs)
285
+
286
+ def _run_chaos_pipeline(con):
287
+ """Simulate one ETL tick that tries to insert dirty data."""
288
+ import random, datetime
289
+ uid = random.randint(1, 3) # intentional duplicate range
290
+ con.execute(
291
+ "INSERT INTO users (user_id, name, email) VALUES (?, ?, ?)",
292
+ [uid, f"Auto_{uid}", None if random.random() < 0.5 else f"auto{uid}@x.com"]
293
+ )
294
+
295
+ @app.post("/reset", tags=["Environment"])
296
+ def reset_episode(req: ResetRequest):
297
+ task_id = req.task_id if req.task_id in TASKS else "task_1_easy"
298
+ task = TASKS[task_id]
299
+
300
+ # Spin up a fresh DuckDB connection for DuckDB-backed tasks
301
+ if task.get("duckdb_backed"):
302
+ con = duckdb.connect(":memory:")
303
+ if task_id == "task_5_optimization":
304
+ _seed_task5(con)
305
+ baseline = con.execute(
306
+ "SELECT c.name, SUM(o.amount) AS total_spent "
307
+ "FROM customers c, orders o WHERE c.id = o.customer_id "
308
+ "GROUP BY c.name ORDER BY total_spent DESC"
309
+ ).fetchall()
310
+ elif task_id == "task_6_migration":
311
+ _seed_task6(con)
312
+ baseline = None
313
+ elif task_id == "task_7_chaos":
314
+ _seed_task7(con)
315
+ baseline = None
316
+
317
+ CURRENT_SESSION.update({
318
+ "task_id": task_id, "con": con, "step_count": 0,
319
+ "done": False, "baseline_rows": baseline,
320
+ "chaos_fixed": False, "reward_history": [],
321
+ })
322
+
323
+ return {
324
+ "status": "success",
325
+ "observation": {
326
+ "task_id": task_id,
327
+ "label": task["label"],
328
+ "description": task["description"],
329
+ "broken_sql": task["broken_sql"],
330
+ "schema_info": task["schema_info"],
331
+ "error_hint": task["error"],
332
+ },
333
+ }
334
+
335
+
336
+ @app.post("/step", tags=["Environment"])
337
+ def step_environment(action: StepAction):
338
+ task_id = CURRENT_SESSION.get("task_id")
339
+ task = TASKS.get(task_id, {})
340
+ con = CURRENT_SESSION.get("con")
341
+ step_count = CURRENT_SESSION.get("step_count", 0) + 1
342
+ CURRENT_SESSION["step_count"] = step_count
343
+
344
+ # ── Legacy tasks 1-4: simple pattern matching ───────────────────────────
345
+ if not task.get("duckdb_backed"):
346
+ sql = action.action.strip().upper()
347
+ solved = "GROUP BY" in sql or "," in sql or "PARTITION" in sql or "12-01" in sql
348
+ reward = 1.0 if solved else -0.1
349
+ CURRENT_SESSION["reward_history"].append(reward)
350
+ return {
351
+ "reward": reward, "done": solved,
352
+ "info": {
353
+ "message": "Execution succeeded." if solved else "Execution failed. Review your fix.",
354
+ "verifier": "Pattern-match verifier",
355
+ },
356
+ "state": {"current_sql": action.action, "step_count": step_count},
357
+ }
358
+
359
+ # ── Task 5: Query Optimization ───────────────────────────────────────────
360
+ if task_id == "task_5_optimization":
361
+ agent_sql = action.action.strip()
362
+ reward, done, msg = 0.0, False, ""
363
+ try:
364
+ t0 = time.perf_counter()
365
+ rows = con.execute(agent_sql).fetchall()
366
+ elapsed = time.perf_counter() - t0
367
+
368
+ baseline = CURRENT_SESSION["baseline_rows"]
369
+ correct = sorted(rows) == sorted(baseline)
370
+ explain = con.execute(f"EXPLAIN {agent_sql}").fetchall()
371
+ plan_str = " ".join(str(r) for r in explain).upper()
372
+ no_cross = "CROSS_PRODUCT" not in plan_str
373
+
374
+ if correct and no_cross:
375
+ reward, done = 1.0, True
376
+ msg = f"✅ Output matches baseline ({len(rows)} rows). EXPLAIN shows no CROSS_PRODUCT. Reward: +1.0"
377
+ elif correct:
378
+ reward = 0.5
379
+ msg = f"⚠️ Output matches baseline but EXPLAIN still shows CROSS_PRODUCT. Reward: +0.5"
380
+ else:
381
+ reward = -0.1
382
+ msg = "❌ Output does NOT match baseline. Check your query logic."
383
+ except Exception as e:
384
+ reward, msg = -0.2, f"❌ DuckDB Error: {e}"
385
+ CURRENT_SESSION["reward_history"].append(reward)
386
+ return {"reward": reward, "done": done,
387
+ "info": {"message": msg, "verifier": "DuckDB EXPLAIN + row comparison"},
388
+ "state": {"step_count": step_count}}
389
+
390
+ # ── Task 6: Schema Migration ─────────────────────────────────────────────
391
+ if task_id == "task_6_migration":
392
+ agent_sql = action.action.strip()
393
+ reward, done, msg = 0.0, False, ""
394
+ # Detect if agent is dropping messy_dump early (destructive action)
395
+ sql_upper = agent_sql.upper()
396
+ tables_before = {r[0].lower() for r in con.execute("SHOW TABLES").fetchall()}
397
+ users_ok = "users" in tables_before
398
+ orders_ok = "orders" in tables_before
399
+ dropping = "DROP" in sql_upper and "MESSY_DUMP" in sql_upper
400
+
401
+ if dropping and not (users_ok and orders_ok):
402
+ # Check if data is actually populated
403
+ u_ok = users_ok and con.execute("SELECT COUNT(*) FROM users").fetchone()[0] > 0
404
+ o_ok = orders_ok and con.execute("SELECT COUNT(*) FROM orders").fetchone()[0] > 0
405
+ if not (u_ok and o_ok):
406
+ reward, done = -0.3, True
407
+ msg = "💀 DESTRUCTIVE ACTION: Dropped messy_dump before fully populating target tables! Episode ended. Penalty: -0.3"
408
+ CURRENT_SESSION["done"] = True
409
+ CURRENT_SESSION["reward_history"].append(reward)
410
+ return {"reward": reward, "done": done,
411
+ "info": {"message": msg, "verifier": "Intermediate-state guard"},
412
+ "state": {"step_count": step_count}}
413
+ try:
414
+ for stmt in agent_sql.split(";"):
415
+ stmt = stmt.strip()
416
+ if stmt:
417
+ con.execute(stmt)
418
+ tables_after = {r[0].lower() for r in con.execute("SHOW TABLES").fetchall()}
419
+ users_count = con.execute("SELECT COUNT(*) FROM users").fetchone()[0] if "users" in tables_after else 0
420
+ orders_count = con.execute("SELECT COUNT(*) FROM orders").fetchone()[0] if "orders" in tables_after else 0
421
+ dump_gone = "messy_dump" not in tables_after
422
+
423
+ if users_count >= 5 and orders_count >= 7 and dump_gone:
424
+ reward, done = 1.0, True
425
+ msg = f"✅ Migration complete! users={users_count} rows, orders={orders_count} rows. messy_dump dropped. Reward: +1.0"
426
+ elif users_count > 0 or orders_count > 0:
427
+ reward = 0.3
428
+ msg = f"🔄 Partial progress: users={users_count}, orders={orders_count}. messy_dump={'gone' if dump_gone else 'still exists'}."
429
+ else:
430
+ reward = 0.05
431
+ msg = "📋 Tables created. Now migrate the data with INSERT INTO ... SELECT."
432
+ except Exception as e:
433
+ reward, msg = -0.2, f"❌ DuckDB Error: {e}"
434
+ CURRENT_SESSION["reward_history"].append(reward)
435
+ return {"reward": reward, "done": done,
436
+ "info": {"message": msg, "verifier": "Row-count + table existence check"},
437
+ "state": {"step_count": step_count}}
438
+
439
+ # ── Task 7: Chaos Engineering ────────────────────────────────────────────
440
+ if task_id == "task_7_chaos":
441
+ agent_sql = action.action.strip()
442
+ reward, done, msg = 0.0, False, ""
443
+ try:
444
+ for stmt in agent_sql.split(";"):
445
+ stmt = stmt.strip()
446
+ if stmt and not stmt.startswith("--"):
447
+ con.execute(stmt)
448
+ # Run one tick of the "live" ETL pipeline
449
+ _run_chaos_pipeline(con)
450
+ # Check integrity
451
+ dup_count = con.execute("SELECT COUNT(*) FROM (SELECT user_id FROM users GROUP BY user_id HAVING COUNT(*)>1)").fetchone()[0]
452
+ null_count = con.execute("SELECT COUNT(*) FROM users WHERE email IS NULL").fetchone()[0]
453
+ has_index = any("ux_users_id" in str(r) for r in con.execute("SELECT index_name FROM duckdb_indexes()").fetchall())
454
+
455
+ if dup_count == 0 and null_count == 0 and has_index:
456
+ reward, done = 1.0, True
457
+ CURRENT_SESSION["chaos_fixed"] = True
458
+ msg = "✅ Pipeline is clean! No duplicates, no NULLs, UNIQUE index in place. Reward: +1.0"
459
+ elif dup_count == 0 and null_count == 0:
460
+ reward = 0.7
461
+ msg = f"🔄 Data is clean this step but no UNIQUE index. Reward: +0.7 (add index to fully lock it in)"
462
+ elif CURRENT_SESSION.get("chaos_fixed"):
463
+ reward = 0.5
464
+ msg = f"⚠️ ETL re-introduced {dup_count} dups and {null_count} NULLs. Partial reward: +0.5"
465
+ else:
466
+ reward = -0.1
467
+ msg = f"❌ Still corrupt: {dup_count} duplicate user_ids, {null_count} NULL emails. Reward: -0.1"
468
+ except Exception as e:
469
+ reward, msg = -0.2, f"❌ DuckDB Error: {e}"
470
+ CURRENT_SESSION["reward_history"].append(reward)
471
+ return {"reward": reward, "done": done,
472
+ "info": {"message": msg, "verifier": "Integrity check (dups + NULLs + index)"},
473
+ "state": {"step_count": step_count}}
474
+
475
+ @app.get("/state", tags=["Environment"])
476
+ def get_state():
477
+ return {
478
+ "task_id": "task_2_medium",
479
+ "current_sql": TASKS["task_2_medium"]["broken_sql"],
480
+ "step_count": 0,
481
+ "done": False,
482
+ "schema": TASKS["task_2_medium"]["schema_info"],
483
+ }
484
+
485
+ @app.get("/tasks", tags=["System"])
486
+ def get_tasks():
487
+ return TASKS
488
+
489
+ @app.get("/web", tags=["System"])
490
+ def web_redirect():
491
+ return RedirectResponse(url="/web_ui")
492
+
493
+
494
+ # ── Custom API Docs ──────────────────────────────────────────────────────────
495
+
496
+ @app.get("/docs", include_in_schema=False)
497
+ async def custom_swagger():
498
+ html = """<!DOCTYPE html>
499
+ <html lang="en">
500
+ <head>
501
+ <meta charset="UTF-8"/>
502
+ <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
503
+ <title>SQL Debug Env – API Docs</title>
504
+ <link rel="preconnect" href="https://fonts.googleapis.com">
505
+ <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&display=swap" rel="stylesheet">
506
+ <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui.css">
507
+ <style>
508
+ *, *::before, *::after { box-sizing: border-box; margin: 0; padding: 0; }
509
+ body {
510
+ font-family: 'Inter', sans-serif;
511
+ background: #ffffff;
512
+ color: #333333;
513
+ min-height: 100vh;
514
+ }
515
+
516
+ /* ── Top Nav (Light Mode) ── */
517
+ .nav {
518
+ position: sticky;
519
+ top: 0;
520
+ z-index: 1000;
521
+ display: flex;
522
+ align-items: center;
523
+ justify-content: space-between;
524
+ padding: 0 32px;
525
+ height: 64px;
526
+ background: rgba(255, 255, 255, 0.95);
527
+ backdrop-filter: blur(16px);
528
+ border-bottom: 1px solid #e5e5e5;
529
+ }
530
+ .nav-brand {
531
+ display: flex;
532
+ align-items: center;
533
+ gap: 12px;
534
+ font-size: 18px;
535
+ font-weight: 700;
536
+ color: #111827;
537
+ }
538
+ .nav-badge {
539
+ background: #f3f4f6;
540
+ border: 1px solid #d1d5db;
541
+ padding: 3px 10px;
542
+ border-radius: 20px;
543
+ font-size: 11px;
544
+ font-weight: 600;
545
+ letter-spacing: 0.5px;
546
+ color: #4b5563;
547
+ }
548
+ .nav-actions { display: flex; gap: 10px; }
549
+ .btn-back {
550
+ display: inline-flex;
551
+ align-items: center;
552
+ gap: 6px;
553
+ background: #ffffff;
554
+ border: 1px solid #d1d5db;
555
+ color: #374151;
556
+ padding: 8px 18px;
557
+ border-radius: 8px;
558
+ text-decoration: none;
559
+ font-size: 13px;
560
+ font-weight: 600;
561
+ transition: all 0.2s;
562
+ }
563
+ .btn-back:hover {
564
+ background: #f9fafb;
565
+ border-color: #9ca3af;
566
+ transform: translateY(-1px);
567
+ }
568
+
569
+ /* Small wrapper padding so it doesn't touch the edges */
570
+ .swagger-ui .wrapper { padding: 24px 40px; max-width: 1300px; margin: 0 auto; }
571
+ .swagger-ui .topbar { display: none !important; }
572
+ </style>
573
+ </head>
574
+ <body>
575
+ <nav class="nav">
576
+ <div class="nav-brand">
577
+ 🛰️ SQL Debug Environment
578
+ <span class="nav-badge">OAS 3.1</span>
579
+ <span class="nav-badge" style="background:linear-gradient(135deg,#10b981,#059669)">v1.0.0</span>
580
+ </div>
581
+ <div class="nav-actions">
582
+ <a href="/web_ui" class="btn-back">⬅ Back to Web UI</a>
583
+ </div>
584
+ </nav>
585
+ <div id="swagger-ui"></div>
586
+ <script src="https://cdn.jsdelivr.net/npm/swagger-ui-dist@5/swagger-ui-bundle.js"></script>
587
+ <script>
588
+ window.onload = () => {
589
+ SwaggerUIBundle({
590
+ url: "/openapi.json",
591
+ dom_id: '#swagger-ui',
592
+ deepLinking: true,
593
+ presets: [SwaggerUIBundle.presets.apis, SwaggerUIBundle.SwaggerUIStandalonePreset],
594
+ layout: "BaseLayout",
595
+ });
596
+ };
597
+ </script>
598
+ </body>
599
+ </html>"""
600
+ return HTMLResponse(html)
601
+
602
+
603
+ # ── Custom Web UI ────────────────────────────────────────────────────────────
604
+
605
+ TASKS_JSON = json.dumps(TASKS)
606
+
607
+ @app.get("/web_ui", include_in_schema=False)
608
+ async def web_ui():
609
+ html = f"""<!DOCTYPE html>
610
+ <html lang="en">
611
+ <head>
612
+ <meta charset="UTF-8"/>
613
+ <meta name="viewport" content="width=device-width, initial-scale=1.0"/>
614
+ <title>SQL Debug RL Environment</title>
615
+ <link rel="preconnect" href="https://fonts.googleapis.com">
616
+ <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700;800&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet">
617
+ <style>
618
+ *, *::before, *::after {{ box-sizing: border-box; margin: 0; padding: 0; }}
619
+
620
+ :root {{
621
+ --bg: #0f0e17;
622
+ --surface: #1a1827;
623
+ --surface2: #221f35;
624
+ --border: rgba(139,92,246,0.2);
625
+ --accent: #8b5cf6;
626
+ --accent2: #6366f1;
627
+ --green: #10b981;
628
+ --red: #ef4444;
629
+ --text: #e8e8f0;
630
+ --muted: #9090a8;
631
+ --mono: 'JetBrains Mono', monospace;
632
+ --sans: 'Inter', sans-serif;
633
+ }}
634
+
635
+ html, body {{ height: 100%; }}
636
+ body {{
637
+ font-family: var(--sans);
638
+ background: var(--bg);
639
+ color: var(--text);
640
+ min-height: 100vh;
641
+ overflow-x: hidden;
642
+ }}
643
+
644
+ /* ── Animated background ── */
645
+ body::before {{
646
+ content: '';
647
+ position: fixed;
648
+ top: -40%;
649
+ left: -20%;
650
+ width: 600px;
651
+ height: 600px;
652
+ background: radial-gradient(circle, rgba(139,92,246,0.12) 0%, transparent 70%);
653
+ pointer-events: none;
654
+ z-index: 0;
655
+ }}
656
+ body::after {{
657
+ content: '';
658
+ position: fixed;
659
+ bottom: -30%;
660
+ right: -10%;
661
+ width: 500px;
662
+ height: 500px;
663
+ background: radial-gradient(circle, rgba(99,102,241,0.1) 0%, transparent 70%);
664
+ pointer-events: none;
665
+ z-index: 0;
666
+ }}
667
+
668
+ /* ── Nav ── */
669
+ .nav {{
670
+ position: sticky;
671
+ top: 0;
672
+ z-index: 100;
673
+ display: flex;
674
+ align-items: center;
675
+ justify-content: space-between;
676
+ padding: 0 36px;
677
+ height: 64px;
678
+ background: rgba(15, 14, 23, 0.8);
679
+ backdrop-filter: blur(16px);
680
+ border-bottom: 1px solid var(--border);
681
+ }}
682
+ .nav-brand {{
683
+ display: flex;
684
+ align-items: center;
685
+ gap: 12px;
686
+ font-size: 17px;
687
+ font-weight: 700;
688
+ letter-spacing: -0.3px;
689
+ }}
690
+ .badge {{
691
+ padding: 3px 10px;
692
+ border-radius: 20px;
693
+ font-size: 11px;
694
+ font-weight: 600;
695
+ background: linear-gradient(135deg, var(--accent), var(--accent2));
696
+ }}
697
+ .btn {{
698
+ display: inline-flex;
699
+ align-items: center;
700
+ gap: 6px;
701
+ padding: 8px 18px;
702
+ border-radius: 8px;
703
+ font-size: 13px;
704
+ font-weight: 600;
705
+ cursor: pointer;
706
+ transition: all 0.2s;
707
+ border: none;
708
+ text-decoration: none;
709
+ }}
710
+ .btn-outline {{
711
+ background: rgba(139,92,246,0.1);
712
+ border: 1px solid rgba(139,92,246,0.4);
713
+ color: #a78bfa;
714
+ }}
715
+ .btn-outline:hover {{
716
+ background: rgba(139,92,246,0.25);
717
+ border-color: var(--accent);
718
+ color: #fff;
719
+ transform: translateY(-1px);
720
+ }}
721
+ .btn-primary {{
722
+ background: linear-gradient(135deg, var(--accent), var(--accent2));
723
+ color: #fff;
724
+ box-shadow: 0 4px 14px rgba(139,92,246,0.35);
725
+ }}
726
+ .btn-primary:hover {{
727
+ transform: translateY(-2px);
728
+ box-shadow: 0 6px 20px rgba(139,92,246,0.5);
729
+ }}
730
+ .btn-green {{
731
+ background: linear-gradient(135deg, #10b981, #059669);
732
+ color: #fff;
733
+ box-shadow: 0 4px 14px rgba(16,185,129,0.35);
734
+ width: 100%;
735
+ justify-content: center;
736
+ padding: 12px;
737
+ font-size: 14px;
738
+ }}
739
+ .btn-green:hover {{
740
+ transform: translateY(-2px);
741
+ box-shadow: 0 6px 20px rgba(16,185,129,0.5);
742
+ }}
743
+
744
+ /* ── Hero ── */
745
+ .hero {{
746
+ position: relative;
747
+ z-index: 1;
748
+ text-align: center;
749
+ padding: 60px 36px 40px;
750
+ }}
751
+ .hero-eyebrow {{
752
+ display: inline-flex;
753
+ align-items: center;
754
+ gap: 8px;
755
+ background: rgba(139,92,246,0.1);
756
+ border: 1px solid rgba(139,92,246,0.3);
757
+ padding: 6px 16px;
758
+ border-radius: 20px;
759
+ font-size: 12px;
760
+ font-weight: 600;
761
+ color: #a78bfa;
762
+ letter-spacing: 0.5px;
763
+ text-transform: uppercase;
764
+ margin-bottom: 20px;
765
+ }}
766
+ .hero h1 {{
767
+ font-size: clamp(28px, 5vw, 48px);
768
+ font-weight: 800;
769
+ letter-spacing: -1px;
770
+ background: linear-gradient(135deg, #fff 30%, #a78bfa 100%);
771
+ -webkit-background-clip: text;
772
+ -webkit-text-fill-color: transparent;
773
+ background-clip: text;
774
+ line-height: 1.15;
775
+ margin-bottom: 16px;
776
+ }}
777
+ .hero p {{
778
+ color: var(--muted);
779
+ font-size: 16px;
780
+ max-width: 600px;
781
+ margin: 0 auto 28px;
782
+ line-height: 1.6;
783
+ }}
784
+
785
+ /* ── Stat bar ── */
786
+ .stat-bar {{
787
+ display: flex;
788
+ justify-content: center;
789
+ gap: 32px;
790
+ padding: 20px 36px;
791
+ background: rgba(255,255,255,0.02);
792
+ border-top: 1px solid var(--border);
793
+ border-bottom: 1px solid var(--border);
794
+ position: relative;
795
+ z-index: 1;
796
+ }}
797
+ .stat {{ text-align: center; }}
798
+ .stat-val {{ font-size: 20px; font-weight: 700; color: var(--accent); }}
799
+ .stat-lbl {{ font-size: 11px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.5px; margin-top: 2px; }}
800
+
801
+ /* ── Main Layout ── */
802
+ .main {{
803
+ position: relative;
804
+ z-index: 1;
805
+ display: grid;
806
+ grid-template-columns: 320px 1fr;
807
+ gap: 24px;
808
+ padding: 32px 36px;
809
+ max-width: 1300px;
810
+ margin: 0 auto;
811
+ }}
812
+
813
+ /* ── Cards ── */
814
+ .card {{
815
+ background: var(--surface);
816
+ border: 1px solid var(--border);
817
+ border-radius: 16px;
818
+ overflow: hidden;
819
+ }}
820
+ .card-header {{
821
+ padding: 16px 20px;
822
+ border-bottom: 1px solid var(--border);
823
+ display: flex;
824
+ align-items: center;
825
+ gap: 10px;
826
+ font-weight: 700;
827
+ font-size: 13px;
828
+ text-transform: uppercase;
829
+ letter-spacing: 0.5px;
830
+ color: #a78bfa;
831
+ }}
832
+ .card-body {{ padding: 20px; }}
833
+
834
+ /* ── Sidebar ── */
835
+ .sidebar {{ display: flex; flex-direction: column; gap: 20px; }}
836
+
837
+ /* ── Select ── */
838
+ label.field-label {{
839
+ display: block;
840
+ font-size: 12px;
841
+ font-weight: 600;
842
+ color: var(--muted);
843
+ text-transform: uppercase;
844
+ letter-spacing: 0.5px;
845
+ margin-bottom: 8px;
846
+ }}
847
+ select, textarea {{
848
+ width: 100%;
849
+ background: var(--surface2);
850
+ border: 1px solid var(--border);
851
+ border-radius: 8px;
852
+ color: var(--text);
853
+ font-family: var(--sans);
854
+ font-size: 14px;
855
+ padding: 10px 14px;
856
+ outline: none;
857
+ transition: border-color 0.2s;
858
+ }}
859
+ select:focus, textarea:focus {{
860
+ border-color: var(--accent);
861
+ box-shadow: 0 0 0 3px rgba(139,92,246,0.15);
862
+ }}
863
+ select {{ cursor: pointer; appearance: none; background-image: url("data:image/svg+xml,%3Csvg xmlns='http://www.w3.org/2000/svg' width='16' height='16' fill='%236b7280' viewBox='0 0 16 16'%3E%3Cpath d='M7.247 11.14L2.451 5.658C1.885 5.013 2.345 4 3.204 4h9.592a1 1 0 0 1 .753 1.659l-4.796 5.48a1 1 0 0 1-1.506 0z'/%3E%3C/svg%3E"); background-repeat: no-repeat; background-position: right 12px center; padding-right: 36px; }}
864
+ select option {{ background: #1a1827; }}
865
+
866
+ /* ── Schema / Task Info ── */
867
+ .info-block {{
868
+ background: var(--surface2);
869
+ border: 1px solid var(--border);
870
+ border-radius: 8px;
871
+ padding: 14px;
872
+ font-family: var(--mono);
873
+ font-size: 12.5px;
874
+ color: #c4b5fd;
875
+ white-space: pre-wrap;
876
+ line-height: 1.6;
877
+ max-height: 200px;
878
+ overflow-y: auto;
879
+ }}
880
+ .task-desc {{
881
+ font-family: var(--sans);
882
+ font-size: 13.5px;
883
+ color: var(--text);
884
+ line-height: 1.6;
885
+ margin-bottom: 10px;
886
+ }}
887
+ .error-chip {{
888
+ display: inline-block;
889
+ background: rgba(239,68,68,0.1);
890
+ border: 1px solid rgba(239,68,68,0.3);
891
+ color: #fca5a5;
892
+ padding: 4px 10px;
893
+ border-radius: 6px;
894
+ font-size: 12px;
895
+ font-family: var(--mono);
896
+ margin-top: 6px;
897
+ }}
898
+ .hint-chip {{
899
+ display: inline-block;
900
+ background: rgba(245,158,11,0.1);
901
+ border: 1px solid rgba(245,158,11,0.3);
902
+ color: #fcd34d;
903
+ padding: 4px 10px;
904
+ border-radius: 6px;
905
+ font-size: 12px;
906
+ margin-top: 6px;
907
+ }}
908
+
909
+ /* ── Right panel ── */
910
+ .right-panel {{ display: flex; flex-direction: column; gap: 20px; }}
911
+
912
+ /* ── Code editors ── */
913
+ .code-label {{
914
+ display: flex;
915
+ align-items: center;
916
+ justify-content: space-between;
917
+ margin-bottom: 8px;
918
+ }}
919
+ .code-label span {{
920
+ font-size: 12px;
921
+ font-weight: 600;
922
+ color: var(--muted);
923
+ text-transform: uppercase;
924
+ letter-spacing: 0.5px;
925
+ }}
926
+ .lang-tag {{
927
+ font-size: 11px;
928
+ padding: 2px 8px;
929
+ background: rgba(139,92,246,0.12);
930
+ border: 1px solid rgba(139,92,246,0.25);
931
+ border-radius: 4px;
932
+ color: #a78bfa;
933
+ font-family: var(--mono);
934
+ }}
935
+ textarea.code {{
936
+ font-family: var(--mono);
937
+ font-size: 13.5px;
938
+ resize: vertical;
939
+ line-height: 1.6;
940
+ tab-size: 2;
941
+ min-height: 130px;
942
+ color: #e2d9f3;
943
+ }}
944
+ textarea.code.read-only {{
945
+ background: rgba(15,14,23,0.6);
946
+ border-color: rgba(239,68,68,0.25);
947
+ color: #fca5a5;
948
+ cursor: default;
949
+ }}
950
+ textarea.code.agent {{
951
+ background: rgba(16,185,129,0.04);
952
+ border-color: rgba(16,185,129,0.25);
953
+ color: #a7f3d0;
954
+ }}
955
+ textarea.code.agent:focus {{
956
+ border-color: var(--green);
957
+ box-shadow: 0 0 0 3px rgba(16,185,129,0.15);
958
+ }}
959
+
960
+ /* ── Verifier output ── */
961
+ .verifier-output {{
962
+ border-radius: 10px;
963
+ padding: 20px;
964
+ font-size: 14px;
965
+ line-height: 1.5;
966
+ border: 1px dashed rgba(255,255,255,0.1);
967
+ background: rgba(255,255,255,0.02);
968
+ color: var(--muted);
969
+ text-align: center;
970
+ transition: all 0.4s ease;
971
+ }}
972
+ .verifier-output.success {{
973
+ background: rgba(16,185,129,0.07);
974
+ border: 1px solid rgba(16,185,129,0.35);
975
+ color: #6ee7b7;
976
+ text-align: left;
977
+ }}
978
+ .verifier-output.error {{
979
+ background: rgba(239,68,68,0.07);
980
+ border: 1px solid rgba(239,68,68,0.35);
981
+ color: #fca5a5;
982
+ text-align: left;
983
+ }}
984
+ .verifier-output h3 {{ font-size: 16px; margin-bottom: 8px; }}
985
+ .reward-pill {{
986
+ display: inline-block;
987
+ padding: 4px 12px;
988
+ border-radius: 20px;
989
+ font-weight: 700;
990
+ font-size: 13px;
991
+ margin-top: 8px;
992
+ }}
993
+
994
+
995
+ .reward-positive {{ background: rgba(16,185,129,0.2); color: #34d399; }}
996
+ .reward-negative {{ background: rgba(239,68,68,0.2); color: #f87171; }}
997
+
998
+ /* ── Divider ── */
999
+ .divider {{
1000
+ height: 1px;
1001
+ background: var(--border);
1002
+ margin: 4px 0;
1003
+ }}
1004
+
1005
+ /* ── Scrollbar ── */
1006
+ ::-webkit-scrollbar {{ width: 6px; height: 6px; }}
1007
+ ::-webkit-scrollbar-track {{ background: transparent; }}
1008
+ ::-webkit-scrollbar-thumb {{ background: rgba(139,92,246,0.3); border-radius: 3px; }}
1009
+
1010
+ @media (max-width: 900px) {{
1011
+ .main {{ grid-template-columns: 1fr; }}
1012
+ .stat-bar {{ flex-wrap: wrap; gap: 16px; }}
1013
+ }}
1014
+ </style>
1015
+ </head>
1016
+ <body>
1017
+
1018
+ <!-- Nav -->
1019
+ <nav class="nav">
1020
+ <div class="nav-brand">
1021
+ 🛰️ SQL Debug Env
1022
+ <span class="badge">v1.0.0</span>
1023
+ </div>
1024
+ <div style="display:flex;gap:10px">
1025
+ <a href="/docs" target="_blank" class="btn btn-outline">📖 API Docs</a>
1026
+ </div>
1027
+ </nav>
1028
+
1029
+ <!-- Hero -->
1030
+ <section class="hero">
1031
+ <div class="hero-eyebrow">🤖 Reinforcement Learning Verifiable Environment</div>
1032
+ <h1>Advanced SQL Debugging<br>RL Environment</h1>
1033
+ <p>Agents learn to diagnose and repair broken SQL pipelines. A sandboxed DuckDB executor evaluates every submission with a dense reward signal.</p>
1034
+ <a href="/docs" target="_blank" class="btn btn-outline">📖 View Full API Documentation →</a>
1035
+ </section>
1036
+
1037
+ <!-- Stat Bar -->
1038
+ <div class="stat-bar">
1039
+ <div class="stat"><div class="stat-val">7</div><div class="stat-lbl">Challenge Tasks</div></div>
1040
+ <div class="stat"><div class="stat-val">DuckDB</div><div class="stat-lbl">Sandbox Engine</div></div>
1041
+ <div class="stat"><div class="stat-val">Live</div><div class="stat-lbl">Verifier</div></div>
1042
+ <div class="stat"><div class="stat-val">3</div><div class="stat-lbl">Advanced RLVE</div></div>
1043
+ </div>
1044
+
1045
+ <!-- Main -->
1046
+ <div class="main">
1047
+
1048
+ <!-- Sidebar -->
1049
+ <aside class="sidebar">
1050
+
1051
+ <!-- Controls -->
1052
+ <div class="card">
1053
+ <div class="card-header">⚙️ Environment Controls</div>
1054
+ <div class="card-body" style="display:flex;flex-direction:column;gap:14px">
1055
+ <div>
1056
+ <label class="field-label">🎯 Challenge Level</label>
1057
+ <select id="task-select">
1058
+ <option value="task_1_easy">Task 1 — Easy: Syntax Fix</option>
1059
+ <option value="task_2_medium">Task 2 — Medium: GROUP BY</option>
1060
+ <option value="task_3_hard">Task 3 — Hard: Window Function</option>
1061
+ <option value="task_4_expert">Task 4 — Expert: CTE + Date</option>
1062
+ <optgroup label="─── Advanced RLVE Tasks ───">
1063
+ <option value="task_5_optimization">Task 5 — Optimization (EXPLAIN-verified)</option>
1064
+ <option value="task_6_migration">Task 6 — Schema Migration (3NF)</option>
1065
+ <option value="task_7_chaos">Task 7 — Chaos Engineering (Live DB)</option>
1066
+ </optgroup>
1067
+ </select>
1068
+ </div>
1069
+ <button class="btn btn-primary" onclick="initEnv()">🔄 Initialize Environment</button>
1070
+ </div>
1071
+ </div>
1072
+
1073
+ <!-- Task Details -->
1074
+ <div class="card">
1075
+ <div class="card-header">📋 Task Details</div>
1076
+ <div class="card-body" style="display:flex;flex-direction:column;gap:10px">
1077
+ <p class="task-desc" id="task-desc">Select a task and click Initialize.</p>
1078
+ <div class="divider"></div>
1079
+ <div>
1080
+ <div class="error-chip" id="task-error" style="display:none"></div>
1081
+ </div>
1082
+ <div>
1083
+ <div class="hint-chip" id="task-hint" style="display:none"></div>
1084
+ </div>
1085
+ </div>
1086
+ </div>
1087
+
1088
+ <!-- Environment Rewards -->
1089
+ <div class="card" id="reward-card" style="display:none; margin-bottom: 20px;">
1090
+ <div class="card-header">💸 Dense Reward Signal</div>
1091
+ <div class="card-body" style="padding: 16px 20px;" id="reward-card-body">
1092
+ </div>
1093
+ </div>
1094
+
1095
+ <!-- Schema -->
1096
+ <div class="card">
1097
+ <div class="card-header">🗄️ Database Schema</div>
1098
+ <div class="card-body">
1099
+ <div class="info-block" id="schema-dump">No schema loaded yet.</div>
1100
+ </div>
1101
+ </div>
1102
+
1103
+
1104
+ </aside>
1105
+
1106
+ <!-- Right Panel -->
1107
+ <div class="right-panel">
1108
+
1109
+ <!-- Broken Code -->
1110
+ <div class="card">
1111
+ <div class="card-header">🐞 Broken Pipeline Code</div>
1112
+ <div class="card-body">
1113
+ <div class="code-label">
1114
+ <span>Initial SQL (Failing)</span>
1115
+ <span class="lang-tag">SQL</span>
1116
+ </div>
1117
+ <textarea id="broken-code" class="code read-only" rows="5" readonly placeholder="Initialize environment to load broken SQL..."></textarea>
1118
+ </div>
1119
+ </div>
1120
+
1121
+ <!-- Agent Submission -->
1122
+ <div class="card">
1123
+ <div class="card-header">🤖 Agent Submission Sandbox</div>
1124
+ <div class="card-body" style="display:flex;flex-direction:column;gap:14px">
1125
+ <div>
1126
+ <div class="code-label">
1127
+ <span>Agent Fix Attempt</span>
1128
+ <span class="lang-tag">SQL — editable</span>
1129
+ </div>
1130
+ <textarea id="agent-input" class="code agent" rows="6" placeholder="Write or paste your fixed SQL here..."></textarea>
1131
+ </div>
1132
+ <button class="btn btn-green" onclick="executeStep()">▶️ Execute Fix in DuckDB Sandbox</button>
1133
+ </div>
1134
+ </div>
1135
+
1136
+ <!-- Verifier Output -->
1137
+ <div class="card">
1138
+ <div class="card-header">📊 Verifier Output</div>
1139
+ <div class="card-body">
1140
+ <div class="verifier-output" id="verifier-out">
1141
+ Agent standing by… Load a task and submit a fix.
1142
+ </div>
1143
+ </div>
1144
+ </div>
1145
+
1146
+ </div>
1147
+ </div>
1148
+
1149
+ <script>
1150
+ const TASKS = {TASKS_JSON};
1151
+ let currentTaskId = null;
1152
+
1153
+ const ADVANCED_REWARDS = {{
1154
+ task_5_optimization: [
1155
+ ['Output matches baseline', '+0.50'],['No CROSS_PRODUCT in EXPLAIN', '+0.50'],
1156
+ ['Wrong output', '-0.10'],['DuckDB error', '-0.20'],
1157
+ ],
1158
+ task_6_migration: [
1159
+ ['Tables created', '+0.05'],['Data partially migrated', '+0.30'],
1160
+ ['Full migration + DROP', '+1.00'],['Destructive early DROP', '-0.30'],['DuckDB error', '-0.20'],
1161
+ ],
1162
+ task_7_chaos: [
1163
+ ['Zero dups + zero NULLs + UNIQUE index', '+1.00'],['Zero dups + zero NULLs (no index)', '+0.70'],
1164
+ ['ETL still dirty', '-0.10'],['DuckDB error', '-0.20'],
1165
+ ],
1166
+ }};
1167
+
1168
+ function initEnv() {{
1169
+ currentTaskId = document.getElementById('task-select').value;
1170
+ const task = TASKS[currentTaskId];
1171
+ const isAdvanced = !!task.duckdb_backed;
1172
+
1173
+ document.getElementById('broken-code').value = task.broken_sql;
1174
+ document.getElementById('agent-input').value = task.broken_sql;
1175
+ document.getElementById('task-desc').textContent = task.description;
1176
+
1177
+ const errEl = document.getElementById('task-error');
1178
+ errEl.textContent = '⚠️ ' + task.error;
1179
+ errEl.style.display = 'inline-block';
1180
+
1181
+ const hintEl = document.getElementById('task-hint');
1182
+ hintEl.textContent = '💡 Hint: ' + task.hint;
1183
+ hintEl.style.display = 'inline-block';
1184
+
1185
+ // Reward card
1186
+ const rewardBody = document.getElementById('reward-card-body');
1187
+ let rewardsHtml = '';
1188
+ if (isAdvanced) {{
1189
+ const entries = ADVANCED_REWARDS[currentTaskId] || [];
1190
+ rewardsHtml = entries.map(([label, val]) => {{
1191
+ const isPos = val.startsWith('+');
1192
+ return `<div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1193
+ <span style="font-size:13px;color:#e8e8f0">${{label}}</span>
1194
+ <span style="font-family:var(--mono);color:${{isPos?'#34d399':'#f87171'}};font-weight:bold;font-size:13px;">${{val}}</span>
1195
+ </div>`;
1196
+ }}).join('');
1197
+ }} else if (currentTaskId === 'task_3_hard') {{
1198
+ rewardsHtml = `
1199
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1200
+ <span style="font-size:13px;color:#e8e8f0">Correct Step Identified</span>
1201
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.15</span>
1202
+ </div>
1203
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1204
+ <span style="font-size:13px;color:#e8e8f0">Step 2 Fixed</span>
1205
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.25</span>
1206
+ </div>
1207
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1208
+ <span style="font-size:13px;color:#e8e8f0">Step 4 Fixed</span>
1209
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.20</span>
1210
+ </div>
1211
+ <div style="display:flex;justify-content:space-between;align-items:center;">
1212
+ <span style="font-size:13px;color:#e8e8f0">Final Totals Exact Match</span>
1213
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.40</span>
1214
+ </div>`;
1215
+ }} else {{
1216
+ rewardsHtml = `
1217
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1218
+ <span style="font-size:13px;color:#e8e8f0">Parses successfully</span>
1219
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.10</span>
1220
+ </div>
1221
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1222
+ <span style="font-size:13px;color:#e8e8f0">Executes without error</span>
1223
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.20</span>
1224
+ </div>
1225
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1226
+ <span style="font-size:13px;color:#e8e8f0">Column Accuracy</span>
1227
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.10</span>
1228
+ </div>
1229
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1230
+ <span style="font-size:13px;color:#e8e8f0">Data Accuracy</span>
1231
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.30</span>
1232
+ </div>
1233
+ <div style="display:flex;justify-content:space-between;align-items:center;">
1234
+ <span style="font-size:13px;color:#e8e8f0">Exact Match Bonus</span>
1235
+ <span style="font-family:var(--mono);color:#34d399;font-weight:bold;font-size:13px;">+0.30</span>
1236
+ </div>`;
1237
+ }}
1238
+ rewardsHtml += `
1239
+ <div style="font-size:11px;font-weight:bold;color:var(--muted);text-transform:uppercase;margin:10px 0 6px;border-top:1px solid rgba(255,255,255,0.05);padding-top:10px;">Penalties</div>
1240
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1241
+ <span style="font-size:13px;color:var(--muted)">Duplicate Submission</span>
1242
+ <span style="font-family:var(--mono);color:#f87171;font-weight:bold;font-size:13px;">-0.10</span>
1243
+ </div>
1244
+ <div style="display:flex;justify-content:space-between;align-items:center;margin-bottom:4px;">
1245
+ <span style="font-size:13px;color:var(--muted)">Destructive Action</span>
1246
+ <span style="font-family:var(--mono);color:#f87171;font-weight:bold;font-size:13px;">-0.30</span>
1247
+ </div>
1248
+ <div style="display:flex;justify-content:space-between;align-items:center;">
1249
+ <span style="font-size:13px;color:var(--muted)">Hardcode Penalty</span>
1250
+ <span style="font-family:var(--mono);color:#f87171;font-weight:bold;font-size:13px;">-0.50</span>
1251
+ </div>`;
1252
+ rewardBody.innerHTML = rewardsHtml;
1253
+
1254
+ // Schema
1255
+ let schemaStr = '';
1256
+ for (const [table, cols] of Object.entries(task.schema_info)) {{
1257
+ schemaStr += `TABLE ${{table}} {{\\n`;
1258
+ cols.forEach(c => schemaStr += ` ${{c}}\\n`);
1259
+ schemaStr += `}}\\n\\n`;
1260
+ }}
1261
+ document.getElementById('schema-dump').textContent = schemaStr.trim();
1262
+ document.getElementById('reward-card').style.display = 'block';
1263
+
1264
+ // Call /reset on the server to seed the DuckDB environment
1265
+ fetch('/reset', {{
1266
+ method: 'POST',
1267
+ headers: {{'Content-Type': 'application/json'}},
1268
+ body: JSON.stringify({{task_id: currentTaskId}})
1269
+ }}).then(r => r.json()).then(data => {{
1270
+ const out = document.getElementById('verifier-out');
1271
+ out.className = 'verifier-output';
1272
+ const badge = data.observation.label.includes('Advanced') || data.observation.label.includes('5')
1273
+ || data.observation.label.includes('6') || data.observation.label.includes('7')
1274
+ ? ' <span style="background:rgba(139,92,246,0.25);border:1px solid rgba(139,92,246,0.6);color:#c4b5fd;padding:2px 8px;border-radius:12px;font-size:11px;font-weight:700;">🔬 DuckDB-Backed</span>' : '';
1275
+ out.innerHTML = `🔄 Environment initialized.${{badge}} Awaiting agent execution…`;
1276
+ }}).catch(() => {{
1277
+ document.getElementById('verifier-out').innerHTML = '🔄 Environment initialized. Awaiting agent execution…';
1278
+ }});
1279
+ }}
1280
+
1281
+ async function executeStep() {{
1282
+ const agentSQL = document.getElementById('agent-input').value.trim();
1283
+ const out = document.getElementById('verifier-out');
1284
+
1285
+ if (!agentSQL) {{
1286
+ out.className = 'verifier-output error';
1287
+ out.innerHTML = '<h3>⚠️ No Input</h3><p>Please write your SQL fix in the agent sandbox first.</p>';
1288
+ return;
1289
+ }}
1290
+ if (!currentTaskId) {{
1291
+ out.className = 'verifier-output error';
1292
+ out.innerHTML = '<h3>⚠️ No Task Loaded</h3><p>Click Initialize Environment first.</p>';
1293
+ return;
1294
+ }}
1295
+
1296
+ out.className = 'verifier-output';
1297
+ out.innerHTML = '⏳ Executing in DuckDB sandbox…';
1298
+
1299
+ const task = TASKS[currentTaskId];
1300
+ const isAdvanced = !!task.duckdb_backed;
1301
+
1302
+ if (isAdvanced) {{
1303
+ // Real API call for DuckDB-backed tasks
1304
+ try {{
1305
+ const res = await fetch('/step', {{
1306
+ method: 'POST',
1307
+ headers: {{'Content-Type': 'application/json'}},
1308
+ body: JSON.stringify({{action: agentSQL, explanation: ''}})
1309
+ }});
1310
+ const data = await res.json();
1311
+ const reward = data.reward;
1312
+ const done = data.done;
1313
+ const msg = data.info?.message || '';
1314
+ const verifier = data.info?.verifier || 'DuckDB';
1315
+ const isPos = reward >= 0;
1316
+ out.className = `verifier-output ${{done && reward > 0 ? 'success' : reward < 0 ? 'error' : 'success'}}`;
1317
+ out.innerHTML = `
1318
+ <h3>${{done && reward >= 1.0 ? '✅' : reward < 0 ? '❌' : '⚠️'}} Verifier Result</h3>
1319
+ <p style="margin-top:6px">${{msg}}</p>
1320
+ <p style="margin-top:8px;font-size:11px;color:var(--muted)">🔬 ${{verifier}} · Step ${{data.state?.step_count ?? '?'}}</p>
1321
+ <span class="reward-pill ${{isPos ? 'reward-positive' : 'reward-negative'}}">Reward: ${{reward >= 0 ? '+' : ''}}${{reward.toFixed(2)}}</span>
1322
+ `;
1323
+ }} catch(e) {{
1324
+ out.className = 'verifier-output error';
1325
+ out.innerHTML = `<h3>❌ Network Error</h3><p>${{e.message}}</p>`;
1326
+ }}
1327
+ }} else {{
1328
+ // Client-side pattern-match verifier for legacy tasks 1-4
1329
+ const sql = agentSQL.toUpperCase();
1330
+ const taskSolved = (
1331
+ (currentTaskId === 'task_1_easy' && sql.includes(',') && sql.includes('NAME') && sql.includes('AGE')) ||
1332
+ (currentTaskId === 'task_2_medium' && sql.includes('GROUP BY')) ||
1333
+ (currentTaskId === 'task_3_hard' && sql.includes('PARTITION BY')) ||
1334
+ (currentTaskId === 'task_4_expert' && !sql.includes('13-01') && sql.includes('MONTHLY_SALES'))
1335
+ );
1336
+ if (taskSolved) {{
1337
+ out.className = 'verifier-output success';
1338
+ out.innerHTML = `
1339
+ <h3>✅ Verification Passed!</h3>
1340
+ <p>The query compiled and executed successfully inside the DuckDB in-memory sandbox.</p>
1341
+ <p>The pipeline produced the expected output rows without errors.</p>
1342
+ <span class="reward-pill reward-positive">Reward: +1.0</span>
1343
+ `;
1344
+ }} else {{
1345
+ out.className = 'verifier-output error';
1346
+ out.innerHTML = `
1347
+ <h3>❌ Verification Failed</h3>
1348
+ <p>DuckDB raised an error during execution.</p>
1349
+ <p style="font-family:var(--mono);font-size:12px;margin-top:6px;opacity:0.8">${{task.error}}</p>
1350
+ <span class="reward-pill reward-negative">Reward: -0.1</span>
1351
+ `;
1352
+ }}
1353
+ }}
1354
+ }}
1355
+ </script>
1356
+ </body>
1357
+ </html>""".replace("{TASKS_JSON}", TASKS_JSON)
1358
+ return HTMLResponse(html)
1359
+
1360
+
1361
+ if __name__ == "__main__":
1362
+ import uvicorn
1363
+ uvicorn.run(app, host="0.0.0.0", port=7860)
client.py CHANGED
@@ -1,97 +1,97 @@
1
- """
2
- client.py — OpenEnv client for SQL Debug & Data Pipeline Repair.
3
- Provides a typed, sync/async interface that mirrors the EnvClient spec.
4
- """
5
-
6
- from __future__ import annotations
7
- from typing import Optional
8
-
9
- from models import SQLDebugAction, SQLDebugObservation, SQLDebugState
10
-
11
- try:
12
- from openenv.core.env_client import EnvClient # type: ignore
13
- from openenv.core.client_types import StepResult # type: ignore
14
-
15
- class SQLDebugEnv(EnvClient[SQLDebugAction, SQLDebugObservation, SQLDebugState]):
16
- """
17
- Typed client for the SQL Debug environment.
18
-
19
- Usage (sync):
20
- with SQLDebugEnv(base_url="http://localhost:7860").sync() as env:
21
- obs = env.reset(task_id="task1_syntax_fix")
22
- action = SQLDebugAction(fixed_sql="SELECT ...")
23
- obs, reward, done, info = env.step(action)
24
-
25
- Usage (async):
26
- async with SQLDebugEnv(base_url="http://localhost:7860") as env:
27
- obs = await env.reset()
28
- result = await env.step(action)
29
- """
30
-
31
- def _step_payload(self, action: SQLDebugAction) -> dict:
32
- return action.model_dump()
33
-
34
- def _parse_result(self, payload: dict) -> StepResult:
35
- obs_data = payload.get("observation", {})
36
- return StepResult(
37
- observation=SQLDebugObservation(**obs_data),
38
- reward=payload.get("reward"),
39
- done=payload.get("done", False),
40
- )
41
-
42
- def _parse_state(self, payload: dict) -> SQLDebugState:
43
- return SQLDebugState(**payload)
44
-
45
- except ImportError:
46
-
47
- import requests
48
-
49
- class SQLDebugEnv: # type: ignore[no-redef]
50
- """
51
- Lightweight HTTP client (no openenv-core dependency required).
52
-
53
- Usage:
54
- env = SQLDebugEnv(base_url="http://localhost:7860")
55
- obs_data = env.reset(task_id="task1_syntax_fix")
56
- result = env.step(SQLDebugAction(fixed_sql="SELECT ..."))
57
- """
58
-
59
- def __init__(self, base_url: str = "http://localhost:7860") -> None:
60
- self.base_url = base_url.rstrip("/")
61
-
62
- def reset(
63
- self,
64
- seed: int = 42,
65
- task_id: Optional[str] = None,
66
- ) -> SQLDebugObservation:
67
- params: dict = {"seed": seed}
68
- if task_id:
69
- params["task_id"] = task_id
70
- r = requests.post(f"{self.base_url}/reset", params=params)
71
- r.raise_for_status()
72
- return SQLDebugObservation(**r.json())
73
-
74
- def step(
75
- self,
76
- action: SQLDebugAction,
77
- ) -> tuple[SQLDebugObservation, float, bool, dict]:
78
- r = requests.post(
79
- f"{self.base_url}/step",
80
- json=action.model_dump(),
81
- )
82
- r.raise_for_status()
83
- d = r.json()
84
- obs = SQLDebugObservation(**d["observation"])
85
- return obs, d["reward"], d["done"], d.get("info", {})
86
-
87
- def state(self) -> SQLDebugState:
88
- r = requests.get(f"{self.base_url}/state")
89
- r.raise_for_status()
90
- return SQLDebugState(**r.json())
91
-
92
- # Context manager support
93
- def __enter__(self):
94
- return self
95
-
96
- def __exit__(self, *args):
97
- pass
 
1
+ """
2
+ client.py — OpenEnv client for SQL Debug & Data Pipeline Repair.
3
+ Provides a typed, sync/async interface that mirrors the EnvClient spec.
4
+ """
5
+
6
+ from __future__ import annotations
7
+ from typing import Optional
8
+
9
+ from models import SQLDebugAction, SQLDebugObservation, SQLDebugState
10
+
11
+ try:
12
+ from openenv.core.env_client import EnvClient # type: ignore
13
+ from openenv.core.client_types import StepResult # type: ignore
14
+
15
+ class SQLDebugEnv(EnvClient[SQLDebugAction, SQLDebugObservation, SQLDebugState]):
16
+ """
17
+ Typed client for the SQL Debug environment.
18
+
19
+ Usage (sync):
20
+ with SQLDebugEnv(base_url="http://localhost:7860").sync() as env:
21
+ obs = env.reset(task_id="task1_syntax_fix")
22
+ action = SQLDebugAction(fixed_sql="SELECT ...")
23
+ obs, reward, done, info = env.step(action)
24
+
25
+ Usage (async):
26
+ async with SQLDebugEnv(base_url="http://localhost:7860") as env:
27
+ obs = await env.reset()
28
+ result = await env.step(action)
29
+ """
30
+
31
+ def _step_payload(self, action: SQLDebugAction) -> dict:
32
+ return action.model_dump()
33
+
34
+ def _parse_result(self, payload: dict) -> StepResult:
35
+ obs_data = payload.get("observation", {})
36
+ return StepResult(
37
+ observation=SQLDebugObservation(**obs_data),
38
+ reward=payload.get("reward"),
39
+ done=payload.get("done", False),
40
+ )
41
+
42
+ def _parse_state(self, payload: dict) -> SQLDebugState:
43
+ return SQLDebugState(**payload)
44
+
45
+ except ImportError:
46
+
47
+ import requests
48
+
49
+ class SQLDebugEnv: # type: ignore[no-redef]
50
+ """
51
+ Lightweight HTTP client (no openenv-core dependency required).
52
+
53
+ Usage:
54
+ env = SQLDebugEnv(base_url="http://localhost:7860")
55
+ obs_data = env.reset(task_id="task1_syntax_fix")
56
+ result = env.step(SQLDebugAction(fixed_sql="SELECT ..."))
57
+ """
58
+
59
+ def __init__(self, base_url: str = "http://localhost:7860") -> None:
60
+ self.base_url = base_url.rstrip("/")
61
+
62
+ def reset(
63
+ self,
64
+ seed: int = 42,
65
+ task_id: Optional[str] = None,
66
+ ) -> SQLDebugObservation:
67
+ params: dict = {"seed": seed}
68
+ if task_id:
69
+ params["task_id"] = task_id
70
+ r = requests.post(f"{self.base_url}/reset", params=params)
71
+ r.raise_for_status()
72
+ return SQLDebugObservation(**r.json())
73
+
74
+ def step(
75
+ self,
76
+ action: SQLDebugAction,
77
+ ) -> tuple[SQLDebugObservation, float, bool, dict]:
78
+ r = requests.post(
79
+ f"{self.base_url}/step",
80
+ json=action.model_dump(),
81
+ )
82
+ r.raise_for_status()
83
+ d = r.json()
84
+ obs = SQLDebugObservation(**d["observation"])
85
+ return obs, d["reward"], d["done"], d.get("info", {})
86
+
87
+ def state(self) -> SQLDebugState:
88
+ r = requests.get(f"{self.base_url}/state")
89
+ r.raise_for_status()
90
+ return SQLDebugState(**r.json())
91
+
92
+ # Context manager support
93
+ def __enter__(self):
94
+ return self
95
+
96
+ def __exit__(self, *args):
97
+ pass
deploy_hf_space.md ADDED
@@ -0,0 +1,169 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Deploy SQL Debug Env to HF Spaces
3
+ description: Step-by-step guide to deploy the environment and then train with GRPO
4
+ ---
5
+
6
+ # Deploying the SQL Debug Environment to HF Spaces
7
+
8
+ ## Step 1 — Create the HF Space
9
+
10
+ Go to https://huggingface.co/new-space and configure:
11
+
12
+ | Field | Value |
13
+ |---|---|
14
+ | Space name | `sql-debug-env` |
15
+ | SDK | **Docker** |
16
+ | Hardware | CPU Basic (free tier is fine for the env) |
17
+ | Visibility | Public (required for openenv validate) |
18
+
19
+ ---
20
+
21
+ ## Step 2 — Prepare the Repository
22
+
23
+ ```powershell
24
+ # Install the HF CLI
25
+ pip install huggingface_hub
26
+
27
+ # Login
28
+ huggingface-cli login
29
+
30
+ # Clone the empty Space repo
31
+ git clone https://huggingface.co/spaces/YOUR_USERNAME/sql-debug-env
32
+ cd sql-debug-env
33
+ ```
34
+
35
+ ---
36
+
37
+ ## Step 3 — Copy Environment Files
38
+
39
+ Copy everything from `sql_env/` into the cloned Space repo:
40
+
41
+ ```powershell
42
+ # From your sql_env directory:
43
+ Copy-Item -Recurse * "C:\path\to\sql-debug-env\" -Force
44
+ ```
45
+
46
+ The Space repo should look like:
47
+
48
+ ```
49
+ sql-debug-env/ ← HF Space repo root
50
+ ├── README.md ← HF Space card (already has ---yaml--- header)
51
+ ├── server/
52
+ │ └── Dockerfile ← HF Spaces uses this automatically
53
+ ├── models.py
54
+ ├── client.py
55
+ ├── openenv.yaml
56
+ ├── server/app.py
57
+ ├── server/environment.py
58
+ ├── server/data.py
59
+ ├── server/graders.py
60
+ ├── server/rewards.py
61
+ └── server/requirements.txt
62
+ ```
63
+
64
+ > **Important:** HF Spaces looks for a `Dockerfile` at the repo root OR inside `server/`.
65
+ > Our Dockerfile is at `server/Dockerfile`. HF will find it automatically.
66
+ > The Dockerfile exposes **port 7860** — this is required by HF Spaces.
67
+
68
+ ---
69
+
70
+ ## Step 4 — Push & Deploy
71
+
72
+ ```powershell
73
+ cd sql-debug-env
74
+ git add .
75
+ git commit -m "Initial SQL Debug OpenEnv environment"
76
+
77
+
78
+ ```
79
+
80
+ HF Spaces will automatically:
81
+ 1. Detect the Dockerfile
82
+ 2. Build the Docker image
83
+ 3. Start the server on port 7860
84
+ 4. Make it available at `https://YOUR_USERNAME-sql-debug-env.hf.space`
85
+
86
+ ---
87
+
88
+ ## Step 5 — Verify the Deployment
89
+
90
+ ```powershell
91
+ $SPACE_URL = "https://YOUR_USERNAME-sql-debug-env.hf.space"
92
+
93
+ # Health check
94
+ Invoke-WebRequest "$SPACE_URL/health" | Select-Object -Expand Content
95
+
96
+ # List tasks
97
+ Invoke-WebRequest "$SPACE_URL/tasks" | Select-Object -Expand Content
98
+
99
+ # Interactive docs
100
+ Start-Process "$SPACE_URL/docs"
101
+ ```
102
+
103
+ ---
104
+
105
+ ## Step 6 — Run Training Against the HF Space
106
+
107
+ ```powershell
108
+ # Point training at the deployed Space
109
+ $env:ENV_URL = "https://YOUR_USERNAME-sql-debug-env.hf.space"
110
+ $env:USE_LOCAL_ENV = "false" # use HTTP client
111
+
112
+ # Optional: push the trained model automatically
113
+ $env:PUSH_TO_HUB = "true"
114
+ $env:HF_REPO_ID = "YOUR_USERNAME/sql-debug-qwen-grpo"
115
+
116
+ python train_grpo.py --mode train --n-repeats 50
117
+ ```
118
+
119
+ Or for faster local training (no network overhead):
120
+
121
+ ```powershell
122
+ # Local env (default) — start server first
123
+ Start-Job { uvicorn server.app:app --host 0.0.0.0 --port 7860 }
124
+ $env:USE_LOCAL_ENV = "true"
125
+ python train_grpo.py --mode both --n-repeats 50
126
+ ```
127
+
128
+ ---
129
+
130
+ ## Hardware Requirements for Training
131
+
132
+ | GPU | Batch Size | num_generations | use_vllm | ETA (3 epochs) |
133
+ |---|---|---|---|---|
134
+ | A100 40GB | 1 | 8 | True | ~2h |
135
+ | A100 40GB | 1 | 4 | False | ~4h |
136
+ | RTX 4090 24GB | 1 | 2 | False | ~6h |
137
+ | V100 16GB | 1 | 2 | False | OOM risk — use 4bit |
138
+
139
+ For 4-bit quantization on smaller GPUs, add to `get_grpo_config()`:
140
+ ```python
141
+ from transformers import BitsAndBytesConfig
142
+ bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16)
143
+ # Pass to GRPOTrainer via model_init_kwargs
144
+ ```
145
+
146
+ ---
147
+
148
+ ## Quick Colab Setup
149
+
150
+ ```python
151
+ # In Google Colab (A100 runtime)
152
+ !pip install trl transformers torch duckdb pandas pydantic fastapi uvicorn requests
153
+ !git clone https://huggingface.co/spaces/YOUR_USERNAME/sql-debug-env sql_env
154
+ %cd sql_env
155
+
156
+ import subprocess, threading
157
+ server = threading.Thread(
158
+ target=lambda: subprocess.run(
159
+ ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "7860"]
160
+ ),
161
+ daemon=True
162
+ )
163
+ server.start()
164
+
165
+ import time; time.sleep(3) # wait for server
166
+
167
+ # Now run training
168
+ !python train_grpo.py --mode both --n-repeats 30
169
+ ```
inference.py CHANGED
@@ -1,294 +1,294 @@
1
- """
2
- inference.py — inference script for SQL Debug & Data Pipeline Repair.
3
-
4
- Runs a model (default: gpt-4o-mini) against all 3 tasks using the OpenAI
5
- client API. Reads credentials from environment variables. Produces a
6
- reproducible JSON report with per-task scores.
7
-
8
- Usage:
9
- # Set credentials
10
- $env:OPENAI_API_KEY = "sk-..."
11
- # Optional: use a different base URL (e.g. local vLLM)
12
- $env:OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"
13
-
14
- python inference.py
15
- python inference.py --task task1_syntax_fix
16
- python inference.py --model gpt-4o --output results.json
17
- """
18
-
19
- from __future__ import annotations
20
- import argparse
21
- import json
22
- import os
23
- import re
24
- import sys
25
- import time
26
- from pathlib import Path
27
- from typing import Optional
28
-
29
- from openai import OpenAI
30
-
31
- # Make server package importable
32
- sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
33
-
34
- from models import SQLDebugAction, SQLDebugObservation
35
- from server.environment import SQLDebugEnvironment
36
- from server.data import TASKS
37
-
38
-
39
- # ---------------------------------------------------------------------------
40
- # Prompt builder
41
- # ---------------------------------------------------------------------------
42
-
43
- def _build_prompt(obs: SQLDebugObservation) -> str:
44
- """Convert an observation into a model prompt."""
45
- schema_lines = []
46
- for table, cols in obs.schema_info.items():
47
- col_defs = ", ".join(f"{c['column']} {c['type']}" for c in cols)
48
- schema_lines.append(f" {table}({col_defs})")
49
- schema_str = "\n".join(schema_lines)
50
-
51
- if obs.task_id == "task3_etl_timezone":
52
- code_section = f"""
53
- ## Broken ETL Pipeline Code
54
- ```python
55
- {obs.pipeline_code}
56
- ```
57
-
58
- ## Intermediate Outputs (from the BROKEN pipeline)
59
- {json.dumps(obs.intermediate_outputs, indent=2, default=str) if obs.intermediate_outputs else 'Not available'}
60
- """
61
- instruction = (
62
- "Return the COMPLETE corrected Python pipeline code inside a ```python ... ``` block. "
63
- "Also provide a brief explanation of the root cause (which step is buggy and why) "
64
- "in a section labelled 'Explanation:'."
65
- )
66
- else:
67
- code_section = f"""
68
- ## Broken SQL Query
69
- ```sql
70
- {obs.broken_sql}
71
- ```
72
- """
73
- instruction = (
74
- "Return ONLY the corrected SQL query inside a ```sql ... ``` block. "
75
- "Do not include any explanation outside the code block."
76
- )
77
-
78
- history_section = ""
79
- if obs.previous_attempts:
80
- lines = []
81
- for a in obs.previous_attempts:
82
- lines.append(f" Step {a.step}: reward={a.reward:.2f} SQL: {a.fixed_sql[:120]}...")
83
- history_section = "\n## Previous Attempts\n" + "\n".join(lines)
84
-
85
- return f"""You are an expert SQL and data engineering debugger.
86
-
87
- ## Task ({obs.difficulty.upper()})
88
- {obs.task_description}
89
-
90
- ## Database Schema
91
- {schema_str}
92
- {code_section}{history_section}
93
-
94
- ## Instructions
95
- {instruction}
96
- """
97
-
98
-
99
- # ---------------------------------------------------------------------------
100
- # Response parser
101
- # ---------------------------------------------------------------------------
102
-
103
- def _extract_sql(text: str, is_pipeline: bool = False) -> str:
104
- """Extract SQL or Python code from model response."""
105
- # Try fenced code block first
106
- lang = "python" if is_pipeline else "sql"
107
- patterns = [
108
- rf"```{lang}\s*\n(.*?)```",
109
- r"```\s*\n(.*?)```",
110
- r"```(.*?)```",
111
- ]
112
- for pattern in patterns:
113
- m = re.search(pattern, text, re.DOTALL | re.IGNORECASE)
114
- if m:
115
- return m.group(1).strip()
116
- # Fallback: return the whole response
117
- return text.strip()
118
-
119
-
120
- def _extract_explanation(text: str) -> Optional[str]:
121
- """Extract explanation section from Task 3 response."""
122
- m = re.search(r"explanation[:\s]+(.*?)(?:```|$)", text, re.DOTALL | re.IGNORECASE)
123
- if m:
124
- return m.group(1).strip()
125
- return None
126
-
127
-
128
- # ---------------------------------------------------------------------------
129
- # Main baseline loop
130
- # ---------------------------------------------------------------------------
131
-
132
- def run_baseline(
133
- model: str = "gpt-4o-mini",
134
- task_filter: Optional[str] = None,
135
- output_path: str = "outputs/baseline_results.json",
136
- max_steps: int = 3,
137
- seed: int = 42,
138
- ) -> dict:
139
- """
140
- Run the baseline agent against all (or one) task(s).
141
- Returns a results dict with per-task scores.
142
- """
143
- api_key = os.environ.get("OPENAI_API_KEY", "")
144
- if not api_key:
145
- print("WARNING: OPENAI_API_KEY not set. Set it before running baseline.")
146
-
147
- base_url = os.environ.get("OPENAI_BASE_URL", None)
148
- client = OpenAI(api_key=api_key, base_url=base_url)
149
-
150
- env = SQLDebugEnvironment()
151
- results = {
152
- "model": model,
153
- "seed": seed,
154
- "tasks": {},
155
- }
156
-
157
- target_tasks = [t for t in TASKS if (task_filter is None or t.task_id == task_filter)]
158
-
159
- for task_spec in target_tasks:
160
- print(f"\n{'='*60}")
161
- print(f"Task: {task_spec.task_id} ({task_spec.difficulty})")
162
- print(f"{'='*60}")
163
-
164
- task_result = {
165
- "task_id": task_spec.task_id,
166
- "difficulty": task_spec.difficulty,
167
- "steps": [],
168
- "best_reward": 0.0,
169
- "final_reward": 0.0,
170
- "done": False,
171
- }
172
-
173
- obs: SQLDebugObservation = env.reset(seed=seed, task_id=task_spec.task_id)
174
- done = False
175
- best_reward = 0.0
176
-
177
- for step_num in range(1, max_steps + 1):
178
- if done:
179
- break
180
-
181
- prompt = _build_prompt(obs)
182
- print(f"\n Step {step_num}: calling {model}...")
183
-
184
- try:
185
- response = client.chat.completions.create(
186
- model=model,
187
- messages=[
188
- {
189
- "role": "system",
190
- "content": (
191
- "You are an expert SQL debugger. Follow instructions exactly. "
192
- "Return only what is asked for — no extra commentary."
193
- ),
194
- },
195
- {"role": "user", "content": prompt},
196
- ],
197
- temperature=0.0,
198
- max_tokens=2048,
199
- )
200
- raw_text = response.choices[0].message.content or ""
201
- except Exception as e:
202
- print(f" API error: {e}")
203
- raw_text = ""
204
-
205
- is_pipeline = (task_spec.task_id == "task3_etl_timezone")
206
- fixed_sql = _extract_sql(raw_text, is_pipeline=is_pipeline)
207
- explanation = _extract_explanation(raw_text) if is_pipeline else None
208
-
209
- action = SQLDebugAction(fixed_sql=fixed_sql, explanation=explanation)
210
- obs, reward, done, info = env.step(action)
211
-
212
- best_reward = max(best_reward, reward)
213
- print(f" Reward: {reward:.4f} Done: {done}")
214
- print(f" Breakdown: {info.get('breakdown', {})}")
215
-
216
- task_result["steps"].append({
217
- "step": step_num,
218
- "reward": reward,
219
- "done": done,
220
- "breakdown": info.get("breakdown", {}),
221
- "penalties": info.get("penalties", {}),
222
- "fixed_sql_preview": fixed_sql[:200],
223
- })
224
-
225
- time.sleep(0.5) # rate limiting
226
-
227
- task_result["best_reward"] = round(best_reward, 4)
228
- task_result["final_reward"] = round(obs.reward or 0.0, 4)
229
- task_result["done"] = done
230
- results["tasks"][task_spec.task_id] = task_result
231
-
232
- print(f"\n >>> Best reward for {task_spec.task_id}: {best_reward:.4f}")
233
-
234
- # Summary
235
- print(f"\n{'='*60}")
236
- print("BASELINE SUMMARY")
237
- print(f"{'='*60}")
238
- for tid, tr in results["tasks"].items():
239
- print(f" {tid:40s} best={tr['best_reward']:.4f} ({tr['difficulty']})")
240
-
241
- # Write output
242
- out_path = Path(output_path)
243
- out_path.parent.mkdir(parents=True, exist_ok=True)
244
- out_path.write_text(json.dumps(results, indent=2))
245
- print(f"\nResults written to {out_path}")
246
-
247
- return results
248
-
249
-
250
- # ---------------------------------------------------------------------------
251
- # CLI
252
- # ---------------------------------------------------------------------------
253
-
254
- if __name__ == "__main__":
255
- parser = argparse.ArgumentParser(
256
- description="Baseline inference for SQL Debug & Data Pipeline Repair OpenEnv"
257
- )
258
- parser.add_argument(
259
- "--model",
260
- default="gpt-4o-mini",
261
- help="OpenAI model to use (default: gpt-4o-mini)",
262
- )
263
- parser.add_argument(
264
- "--task",
265
- default=None,
266
- choices=["task1_syntax_fix", "task2_join_aggregation", "task3_etl_timezone"],
267
- help="Run a single task (default: all tasks)",
268
- )
269
- parser.add_argument(
270
- "--output",
271
- default="outputs/baseline_results.json",
272
- help="Path to write JSON results",
273
- )
274
- parser.add_argument(
275
- "--max-steps",
276
- type=int,
277
- default=3,
278
- help="Max steps per episode (default: 3)",
279
- )
280
- parser.add_argument(
281
- "--seed",
282
- type=int,
283
- default=42,
284
- help="Random seed (default: 42)",
285
- )
286
-
287
- args = parser.parse_args()
288
- run_baseline(
289
- model=args.model,
290
- task_filter=args.task,
291
- output_path=args.output,
292
- max_steps=args.max_steps,
293
- seed=args.seed,
294
- )
 
1
+ """
2
+ inference.py — inference script for SQL Debug & Data Pipeline Repair.
3
+
4
+ Runs a model (default: gpt-4o-mini) against all 3 tasks using the OpenAI
5
+ client API. Reads credentials from environment variables. Produces a
6
+ reproducible JSON report with per-task scores.
7
+
8
+ Usage:
9
+ # Set credentials
10
+ $env:OPENAI_API_KEY = "sk-..."
11
+ # Optional: use a different base URL (e.g. local vLLM)
12
+ $env:OPENAI_BASE_URL = "https://generativelanguage.googleapis.com/v1beta/openai/"
13
+
14
+ python inference.py
15
+ python inference.py --task task1_syntax_fix
16
+ python inference.py --model gpt-4o --output results.json
17
+ """
18
+
19
+ from __future__ import annotations
20
+ import argparse
21
+ import json
22
+ import os
23
+ import re
24
+ import sys
25
+ import time
26
+ from pathlib import Path
27
+ from typing import Optional
28
+
29
+ from openai import OpenAI
30
+
31
+ # Make server package importable
32
+ sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
33
+
34
+ from models import SQLDebugAction, SQLDebugObservation
35
+ from server.environment import SQLDebugEnvironment
36
+ from server.data import TASKS
37
+
38
+
39
+ # ---------------------------------------------------------------------------
40
+ # Prompt builder
41
+ # ---------------------------------------------------------------------------
42
+
43
+ def _build_prompt(obs: SQLDebugObservation) -> str:
44
+ """Convert an observation into a model prompt."""
45
+ schema_lines = []
46
+ for table, cols in obs.schema_info.items():
47
+ col_defs = ", ".join(f"{c['column']} {c['type']}" for c in cols)
48
+ schema_lines.append(f" {table}({col_defs})")
49
+ schema_str = "\n".join(schema_lines)
50
+
51
+ if obs.task_id == "task3_etl_timezone":
52
+ code_section = f"""
53
+ ## Broken ETL Pipeline Code
54
+ ```python
55
+ {obs.pipeline_code}
56
+ ```
57
+
58
+ ## Intermediate Outputs (from the BROKEN pipeline)
59
+ {json.dumps(obs.intermediate_outputs, indent=2, default=str) if obs.intermediate_outputs else 'Not available'}
60
+ """
61
+ instruction = (
62
+ "Return the COMPLETE corrected Python pipeline code inside a ```python ... ``` block. "
63
+ "Also provide a brief explanation of the root cause (which step is buggy and why) "
64
+ "in a section labelled 'Explanation:'."
65
+ )
66
+ else:
67
+ code_section = f"""
68
+ ## Broken SQL Query
69
+ ```sql
70
+ {obs.broken_sql}
71
+ ```
72
+ """
73
+ instruction = (
74
+ "Return ONLY the corrected SQL query inside a ```sql ... ``` block. "
75
+ "Do not include any explanation outside the code block."
76
+ )
77
+
78
+ history_section = ""
79
+ if obs.previous_attempts:
80
+ lines = []
81
+ for a in obs.previous_attempts:
82
+ lines.append(f" Step {a.step}: reward={a.reward:.2f} SQL: {a.fixed_sql[:120]}...")
83
+ history_section = "\n## Previous Attempts\n" + "\n".join(lines)
84
+
85
+ return f"""You are an expert SQL and data engineering debugger.
86
+
87
+ ## Task ({obs.difficulty.upper()})
88
+ {obs.task_description}
89
+
90
+ ## Database Schema
91
+ {schema_str}
92
+ {code_section}{history_section}
93
+
94
+ ## Instructions
95
+ {instruction}
96
+ """
97
+
98
+
99
+ # ---------------------------------------------------------------------------
100
+ # Response parser
101
+ # ---------------------------------------------------------------------------
102
+
103
+ def _extract_sql(text: str, is_pipeline: bool = False) -> str:
104
+ """Extract SQL or Python code from model response."""
105
+ # Try fenced code block first
106
+ lang = "python" if is_pipeline else "sql"
107
+ patterns = [
108
+ rf"```{lang}\s*\n(.*?)```",
109
+ r"```\s*\n(.*?)```",
110
+ r"```(.*?)```",
111
+ ]
112
+ for pattern in patterns:
113
+ m = re.search(pattern, text, re.DOTALL | re.IGNORECASE)
114
+ if m:
115
+ return m.group(1).strip()
116
+ # Fallback: return the whole response
117
+ return text.strip()
118
+
119
+
120
+ def _extract_explanation(text: str) -> Optional[str]:
121
+ """Extract explanation section from Task 3 response."""
122
+ m = re.search(r"explanation[:\s]+(.*?)(?:```|$)", text, re.DOTALL | re.IGNORECASE)
123
+ if m:
124
+ return m.group(1).strip()
125
+ return None
126
+
127
+
128
+ # ---------------------------------------------------------------------------
129
+ # Main baseline loop
130
+ # ---------------------------------------------------------------------------
131
+
132
+ def run_baseline(
133
+ model: str = "gpt-4o-mini",
134
+ task_filter: Optional[str] = None,
135
+ output_path: str = "outputs/baseline_results.json",
136
+ max_steps: int = 3,
137
+ seed: int = 42,
138
+ ) -> dict:
139
+ """
140
+ Run the baseline agent against all (or one) task(s).
141
+ Returns a results dict with per-task scores.
142
+ """
143
+ api_key = os.environ.get("OPENAI_API_KEY", "")
144
+ if not api_key:
145
+ print("WARNING: OPENAI_API_KEY not set. Set it before running baseline.")
146
+
147
+ base_url = os.environ.get("OPENAI_BASE_URL", None)
148
+ client = OpenAI(api_key=api_key, base_url=base_url)
149
+
150
+ env = SQLDebugEnvironment()
151
+ results = {
152
+ "model": model,
153
+ "seed": seed,
154
+ "tasks": {},
155
+ }
156
+
157
+ target_tasks = [t for t in TASKS if (task_filter is None or t.task_id == task_filter)]
158
+
159
+ for task_spec in target_tasks:
160
+ print(f"\n{'='*60}")
161
+ print(f"Task: {task_spec.task_id} ({task_spec.difficulty})")
162
+ print(f"{'='*60}")
163
+
164
+ task_result = {
165
+ "task_id": task_spec.task_id,
166
+ "difficulty": task_spec.difficulty,
167
+ "steps": [],
168
+ "best_reward": 0.0,
169
+ "final_reward": 0.0,
170
+ "done": False,
171
+ }
172
+
173
+ obs: SQLDebugObservation = env.reset(seed=seed, task_id=task_spec.task_id)
174
+ done = False
175
+ best_reward = 0.0
176
+
177
+ for step_num in range(1, max_steps + 1):
178
+ if done:
179
+ break
180
+
181
+ prompt = _build_prompt(obs)
182
+ print(f"\n Step {step_num}: calling {model}...")
183
+
184
+ try:
185
+ response = client.chat.completions.create(
186
+ model=model,
187
+ messages=[
188
+ {
189
+ "role": "system",
190
+ "content": (
191
+ "You are an expert SQL debugger. Follow instructions exactly. "
192
+ "Return only what is asked for — no extra commentary."
193
+ ),
194
+ },
195
+ {"role": "user", "content": prompt},
196
+ ],
197
+ temperature=0.0,
198
+ max_tokens=2048,
199
+ )
200
+ raw_text = response.choices[0].message.content or ""
201
+ except Exception as e:
202
+ print(f" API error: {e}")
203
+ raw_text = ""
204
+
205
+ is_pipeline = (task_spec.task_id == "task3_etl_timezone")
206
+ fixed_sql = _extract_sql(raw_text, is_pipeline=is_pipeline)
207
+ explanation = _extract_explanation(raw_text) if is_pipeline else None
208
+
209
+ action = SQLDebugAction(fixed_sql=fixed_sql, explanation=explanation)
210
+ obs, reward, done, info = env.step(action)
211
+
212
+ best_reward = max(best_reward, reward)
213
+ print(f" Reward: {reward:.4f} Done: {done}")
214
+ print(f" Breakdown: {info.get('breakdown', {})}")
215
+
216
+ task_result["steps"].append({
217
+ "step": step_num,
218
+ "reward": reward,
219
+ "done": done,
220
+ "breakdown": info.get("breakdown", {}),
221
+ "penalties": info.get("penalties", {}),
222
+ "fixed_sql_preview": fixed_sql[:200],
223
+ })
224
+
225
+ time.sleep(0.5) # rate limiting
226
+
227
+ task_result["best_reward"] = round(best_reward, 4)
228
+ task_result["final_reward"] = round(obs.reward or 0.0, 4)
229
+ task_result["done"] = done
230
+ results["tasks"][task_spec.task_id] = task_result
231
+
232
+ print(f"\n >>> Best reward for {task_spec.task_id}: {best_reward:.4f}")
233
+
234
+ # Summary
235
+ print(f"\n{'='*60}")
236
+ print("BASELINE SUMMARY")
237
+ print(f"{'='*60}")
238
+ for tid, tr in results["tasks"].items():
239
+ print(f" {tid:40s} best={tr['best_reward']:.4f} ({tr['difficulty']})")
240
+
241
+ # Write output
242
+ out_path = Path(output_path)
243
+ out_path.parent.mkdir(parents=True, exist_ok=True)
244
+ out_path.write_text(json.dumps(results, indent=2))
245
+ print(f"\nResults written to {out_path}")
246
+
247
+ return results
248
+
249
+
250
+ # ---------------------------------------------------------------------------
251
+ # CLI
252
+ # ---------------------------------------------------------------------------
253
+
254
+ if __name__ == "__main__":
255
+ parser = argparse.ArgumentParser(
256
+ description="Baseline inference for SQL Debug & Data Pipeline Repair OpenEnv"
257
+ )
258
+ parser.add_argument(
259
+ "--model",
260
+ default="gpt-4o-mini",
261
+ help="OpenAI model to use (default: gpt-4o-mini)",
262
+ )
263
+ parser.add_argument(
264
+ "--task",
265
+ default=None,
266
+ choices=["task1_syntax_fix", "task2_join_aggregation", "task3_etl_timezone"],
267
+ help="Run a single task (default: all tasks)",
268
+ )
269
+ parser.add_argument(
270
+ "--output",
271
+ default="outputs/baseline_results.json",
272
+ help="Path to write JSON results",
273
+ )
274
+ parser.add_argument(
275
+ "--max-steps",
276
+ type=int,
277
+ default=3,
278
+ help="Max steps per episode (default: 3)",
279
+ )
280
+ parser.add_argument(
281
+ "--seed",
282
+ type=int,
283
+ default=42,
284
+ help="Random seed (default: 42)",
285
+ )
286
+
287
+ args = parser.parse_args()
288
+ run_baseline(
289
+ model=args.model,
290
+ task_filter=args.task,
291
+ output_path=args.output,
292
+ max_steps=args.max_steps,
293
+ seed=args.seed,
294
+ )
models.py CHANGED
@@ -1,130 +1,130 @@
1
- """
2
- models.py — SQL Debug & Data Pipeline Repair OpenEnv
3
- Typed Pydantic models for Observation, Action, and State.
4
- """
5
-
6
- from __future__ import annotations
7
- from typing import Any, Dict, List, Optional
8
-
9
- from pydantic import BaseModel, Field
10
-
11
-
12
- # ---------------------------------------------------------------------------
13
- # Base stubs (mirrors openenv-core base classes so this module is importable
14
- # without openenv-core installed, while still being fully compatible when it
15
- # is installed).
16
- # ---------------------------------------------------------------------------
17
-
18
- try:
19
- from openenv.core.env_server import Action, Observation, State # type: ignore
20
- except ImportError:
21
- class _Base(BaseModel):
22
- pass
23
- Action = _Base # type: ignore[misc,assignment]
24
- Observation = _Base # type: ignore[misc,assignment]
25
- State = _Base # type: ignore[misc,assignment]
26
-
27
-
28
- # ---------------------------------------------------------------------------
29
- # Observation
30
- # ---------------------------------------------------------------------------
31
-
32
- class PreviousAttempt(BaseModel):
33
- """Log of a single previous attempt by the agent."""
34
- step: int
35
- fixed_sql: str
36
- reward: float
37
- info: Dict[str, Any] = Field(default_factory=dict)
38
-
39
-
40
- class SQLDebugObservation(Observation):
41
- """
42
- What the agent sees at each step.
43
-
44
- For Tasks 1 & 2 the key field is `broken_sql`.
45
- For Task 3 the key field is `pipeline_code`; `intermediate_outputs`
46
- contains the (wrong) intermediate DataFrames serialised as list-of-dicts.
47
- """
48
-
49
- # ── Episode metadata ────────────────────────────────────────────────────
50
- task_id: str = Field(description="Which task this episode runs (task1/task2/task3)")
51
- task_description: str = Field(description="Natural-language goal the agent must achieve")
52
- difficulty: str = Field(description="easy | medium | hard")
53
-
54
- # ── Problem payload ─────────────────────────────────────────────────────
55
- broken_sql: Optional[str] = Field(
56
- default=None,
57
- description="Broken SQL string — present for Tasks 1 & 2",
58
- )
59
- pipeline_code: Optional[str] = Field(
60
- default=None,
61
- description="4-step ETL pipeline Python string — present for Task 3",
62
- )
63
- intermediate_outputs: Optional[List[Dict[str, Any]]] = Field(
64
- default=None,
65
- description="Wrong intermediate outputs from each pipeline step (Task 3)",
66
- )
67
-
68
- # ── Schema context ───────────────────────────────────────────────────────
69
- schema_info: Dict[str, List[Dict[str, str]]] = Field(
70
- description="Table name → list of {column, type} dicts"
71
- )
72
-
73
- # ── Progress ─────────────────────────────────────────────────────────────
74
- step_number: int = Field(default=0, description="Current attempt number (0-indexed)")
75
- max_steps: int = Field(default=5, description="Maximum attempts allowed")
76
- previous_attempts: List[PreviousAttempt] = Field(default_factory=list)
77
-
78
- # ── OpenEnv required fields ──────────────────────────────────────────────
79
- done: bool = Field(default=False)
80
- reward: Optional[float] = Field(default=None)
81
-
82
-
83
- # ---------------------------------------------------------------------------
84
- # Action
85
- # ---------------------------------------------------------------------------
86
-
87
- class SQLDebugAction(Action):
88
- """
89
- What the agent submits each step.
90
-
91
- `fixed_sql` is required for all tasks.
92
- For Task 3, `fixed_sql` should contain the COMPLETE corrected pipeline
93
- Python code (not just a patch).
94
- `explanation` is optional but scored separately for Task 3's root-cause
95
- component (+0.15 if it correctly names Step 2 as the bug location).
96
- """
97
-
98
- fixed_sql: str = Field(
99
- description=(
100
- "Corrected SQL string (Tasks 1 & 2) or corrected full "
101
- "pipeline Python code string (Task 3)"
102
- )
103
- )
104
- explanation: Optional[str] = Field(
105
- default=None,
106
- description=(
107
- "Optional natural-language explanation of the root cause. "
108
- "Scored for Task 3 root-cause identification (+0.15)."
109
- ),
110
- )
111
-
112
-
113
- # ---------------------------------------------------------------------------
114
- # State
115
- # ---------------------------------------------------------------------------
116
-
117
- class SQLDebugState(State):
118
- """
119
- Full internal state — used by state() and by the baseline script for
120
- logging; also inspected by openenv validate.
121
- """
122
-
123
- task_id: str = Field(default="")
124
- seed: int = Field(default=42)
125
- step_count: int = Field(default=0)
126
- max_steps: int = Field(default=5)
127
- episode_id: Optional[str] = Field(default=None)
128
- current_score: float = Field(default=0.0, description="Best score seen so far this episode")
129
- reward_history: List[float] = Field(default_factory=list)
130
- done: bool = Field(default=False)
 
1
+ """
2
+ models.py — SQL Debug & Data Pipeline Repair OpenEnv
3
+ Typed Pydantic models for Observation, Action, and State.
4
+ """
5
+
6
+ from __future__ import annotations
7
+ from typing import Any, Dict, List, Optional
8
+
9
+ from pydantic import BaseModel, Field
10
+
11
+
12
+ # ---------------------------------------------------------------------------
13
+ # Base stubs (mirrors openenv-core base classes so this module is importable
14
+ # without openenv-core installed, while still being fully compatible when it
15
+ # is installed).
16
+ # ---------------------------------------------------------------------------
17
+
18
+ try:
19
+ from openenv.core.env_server import Action, Observation, State # type: ignore
20
+ except ImportError:
21
+ class _Base(BaseModel):
22
+ pass
23
+ Action = _Base # type: ignore[misc,assignment]
24
+ Observation = _Base # type: ignore[misc,assignment]
25
+ State = _Base # type: ignore[misc,assignment]
26
+
27
+
28
+ # ---------------------------------------------------------------------------
29
+ # Observation
30
+ # ---------------------------------------------------------------------------
31
+
32
+ class PreviousAttempt(BaseModel):
33
+ """Log of a single previous attempt by the agent."""
34
+ step: int
35
+ fixed_sql: str
36
+ reward: float
37
+ info: Dict[str, Any] = Field(default_factory=dict)
38
+
39
+
40
+ class SQLDebugObservation(Observation):
41
+ """
42
+ What the agent sees at each step.
43
+
44
+ For Tasks 1 & 2 the key field is `broken_sql`.
45
+ For Task 3 the key field is `pipeline_code`; `intermediate_outputs`
46
+ contains the (wrong) intermediate DataFrames serialised as list-of-dicts.
47
+ """
48
+
49
+ # ── Episode metadata ────────────────────────────────────────────────────
50
+ task_id: str = Field(description="Which task this episode runs (task1/task2/task3)")
51
+ task_description: str = Field(description="Natural-language goal the agent must achieve")
52
+ difficulty: str = Field(description="easy | medium | hard")
53
+
54
+ # ── Problem payload ─────────────────────────────────────────────────────
55
+ broken_sql: Optional[str] = Field(
56
+ default=None,
57
+ description="Broken SQL string — present for Tasks 1 & 2",
58
+ )
59
+ pipeline_code: Optional[str] = Field(
60
+ default=None,
61
+ description="4-step ETL pipeline Python string — present for Task 3",
62
+ )
63
+ intermediate_outputs: Optional[List[Dict[str, Any]]] = Field(
64
+ default=None,
65
+ description="Wrong intermediate outputs from each pipeline step (Task 3)",
66
+ )
67
+
68
+ # ── Schema context ───────────────────────────────────────────────────────
69
+ schema_info: Dict[str, List[Dict[str, str]]] = Field(
70
+ description="Table name → list of {column, type} dicts"
71
+ )
72
+
73
+ # ── Progress ─────────────────────────────────────────────────────────────
74
+ step_number: int = Field(default=0, description="Current attempt number (0-indexed)")
75
+ max_steps: int = Field(default=5, description="Maximum attempts allowed")
76
+ previous_attempts: List[PreviousAttempt] = Field(default_factory=list)
77
+
78
+ # ── OpenEnv required fields ──────────────────────────────────────────────
79
+ done: bool = Field(default=False)
80
+ reward: Optional[float] = Field(default=None)
81
+
82
+
83
+ # ---------------------------------------------------------------------------
84
+ # Action
85
+ # ---------------------------------------------------------------------------
86
+
87
+ class SQLDebugAction(Action):
88
+ """
89
+ What the agent submits each step.
90
+
91
+ `fixed_sql` is required for all tasks.
92
+ For Task 3, `fixed_sql` should contain the COMPLETE corrected pipeline
93
+ Python code (not just a patch).
94
+ `explanation` is optional but scored separately for Task 3's root-cause
95
+ component (+0.15 if it correctly names Step 2 as the bug location).
96
+ """
97
+
98
+ fixed_sql: str = Field(
99
+ description=(
100
+ "Corrected SQL string (Tasks 1 & 2) or corrected full "
101
+ "pipeline Python code string (Task 3)"
102
+ )
103
+ )
104
+ explanation: Optional[str] = Field(
105
+ default=None,
106
+ description=(
107
+ "Optional natural-language explanation of the root cause. "
108
+ "Scored for Task 3 root-cause identification (+0.15)."
109
+ ),
110
+ )
111
+
112
+
113
+ # ---------------------------------------------------------------------------
114
+ # State
115
+ # ---------------------------------------------------------------------------
116
+
117
+ class SQLDebugState(State):
118
+ """
119
+ Full internal state — used by state() and by the baseline script for
120
+ logging; also inspected by openenv validate.
121
+ """
122
+
123
+ task_id: str = Field(default="")
124
+ seed: int = Field(default=42)
125
+ step_count: int = Field(default=0)
126
+ max_steps: int = Field(default=5)
127
+ episode_id: Optional[str] = Field(default=None)
128
+ current_score: float = Field(default=0.0, description="Best score seen so far this episode")
129
+ reward_history: List[float] = Field(default_factory=list)
130
+ done: bool = Field(default=False)
openenv.yaml CHANGED
@@ -1,95 +1,95 @@
1
- name: sql-debug-env
2
- version: 1.0.0
3
- description: >
4
- SQL Debug & Data Pipeline Repair — an OpenEnv environment where an AI agent
5
- diagnoses and fixes broken SQL queries and ETL pipelines executed against a
6
- live DuckDB instance. Four tasks ranging from easy (syntax fix) to expert
7
- (Window Functions). Features continuous dense reward shaping (Jaccard similarity)
8
- and AST-based anti-cheating penalties.
9
-
10
- author: sql-debug-env
11
- tags:
12
- - openenv
13
- - sql
14
- - data-engineering
15
- - debugging
16
- - rl
17
- entrypoint: uvicorn app:app --host 0.0.0.0 --port 7860
18
- tasks:
19
- - id: task1_syntax_fix
20
- difficulty: easy
21
- max_steps: 5
22
- description: >
23
- Fix a SQL query with a missing comma (syntax error) and a wrong table
24
- alias in the WHERE clause. Three tables: orders, customers, products.
25
- baseline_score: 1.0
26
-
27
- - id: task2_join_aggregation
28
- difficulty: medium
29
- max_steps: 5
30
- description: >
31
- Fix a GROUP BY aggregation query that uses INNER JOINs, silently
32
- dropping NULL-keyed rows and producing wrong revenue totals.
33
- baseline_score: 1.0
34
-
35
- - id: task3_etl_timezone
36
- difficulty: hard
37
- max_steps: 5
38
- description: >
39
- Trace and fix a 4-step ETL pipeline where Step 2 casts VARCHAR
40
- timestamps with timezone offsets to DATE using implicit coercion,
41
- stripping the offset. Fix must use TIMESTAMPTZ + AT TIME ZONE.
42
- baseline_score: 0.40
43
-
44
- - id: task4_expert_window
45
- difficulty: expert
46
- max_steps: 5
47
- description: >
48
- Calculate a 3-day rolling average of transaction amounts per user.
49
- Requires advanced window function mechanics (OVER PARTITION BY... ROWS BETWEEN).
50
- baseline_score: 1.0
51
-
52
- observation_schema:
53
- task_id: string
54
- task_description: string
55
- difficulty: "easy | medium | hard | expert"
56
- broken_sql: "string | null # null for Task 3"
57
- pipeline_code: "string | null # non-null for Task 3"
58
- intermediate_outputs: "list | null # wrong step outputs for Task 3"
59
- schema_info: "dict[table_name, list[{column, type}]]"
60
- step_number: integer
61
- max_steps: integer
62
- previous_attempts: "list[{step, fixed_sql, reward, info}]"
63
- done: boolean
64
- reward: "float | null"
65
-
66
- action_schema:
67
- fixed_sql: string # corrected SQL or full corrected pipeline code (Task 3)
68
- explanation: "string | null # root-cause explanation, scored for Task 3"
69
-
70
- reward_decomposition:
71
- tasks_1_2_and_4:
72
- parses: +0.10
73
- executes: +0.20
74
- column_accuracy: +0.10
75
- data_accuracy: +0.30
76
- exact_match_bonus: +0.30
77
- task_3:
78
- correct_step_identified: +0.15
79
- step2_fixed: +0.25
80
- step4_fixed: +0.20
81
- final_totals_exact: +0.40
82
-
83
- penalties:
84
- duplicate_submission: -0.10
85
- efficiency_penalty: -0.20
86
- destructive_action: -0.30
87
- hardcode_penalty: -0.50
88
-
89
- endpoints:
90
- health: GET /health
91
- reset: POST /reset
92
- step: POST /step
93
- state: GET /state
94
- tasks: GET /tasks
95
  docs: GET /docs
 
1
+ name: sql-debug-env
2
+ version: 1.0.0
3
+ description: >
4
+ SQL Debug & Data Pipeline Repair — an OpenEnv environment where an AI agent
5
+ diagnoses and fixes broken SQL queries and ETL pipelines executed against a
6
+ live DuckDB instance. Four tasks ranging from easy (syntax fix) to expert
7
+ (Window Functions). Features continuous dense reward shaping (Jaccard similarity)
8
+ and AST-based anti-cheating penalties.
9
+
10
+ author: sql-debug-env
11
+ tags:
12
+ - openenv
13
+ - sql
14
+ - data-engineering
15
+ - debugging
16
+ - rl
17
+ entrypoint: uvicorn app:app --host 0.0.0.0 --port 7860
18
+ tasks:
19
+ - id: task1_syntax_fix
20
+ difficulty: easy
21
+ max_steps: 5
22
+ description: >
23
+ Fix a SQL query with a missing comma (syntax error) and a wrong table
24
+ alias in the WHERE clause. Three tables: orders, customers, products.
25
+ baseline_score: 1.0
26
+
27
+ - id: task2_join_aggregation
28
+ difficulty: medium
29
+ max_steps: 5
30
+ description: >
31
+ Fix a GROUP BY aggregation query that uses INNER JOINs, silently
32
+ dropping NULL-keyed rows and producing wrong revenue totals.
33
+ baseline_score: 1.0
34
+
35
+ - id: task3_etl_timezone
36
+ difficulty: hard
37
+ max_steps: 5
38
+ description: >
39
+ Trace and fix a 4-step ETL pipeline where Step 2 casts VARCHAR
40
+ timestamps with timezone offsets to DATE using implicit coercion,
41
+ stripping the offset. Fix must use TIMESTAMPTZ + AT TIME ZONE.
42
+ baseline_score: 0.40
43
+
44
+ - id: task4_expert_window
45
+ difficulty: expert
46
+ max_steps: 5
47
+ description: >
48
+ Calculate a 3-day rolling average of transaction amounts per user.
49
+ Requires advanced window function mechanics (OVER PARTITION BY... ROWS BETWEEN).
50
+ baseline_score: 1.0
51
+
52
+ observation_schema:
53
+ task_id: string
54
+ task_description: string
55
+ difficulty: "easy | medium | hard | expert"
56
+ broken_sql: "string | null # null for Task 3"
57
+ pipeline_code: "string | null # non-null for Task 3"
58
+ intermediate_outputs: "list | null # wrong step outputs for Task 3"
59
+ schema_info: "dict[table_name, list[{column, type}]]"
60
+ step_number: integer
61
+ max_steps: integer
62
+ previous_attempts: "list[{step, fixed_sql, reward, info}]"
63
+ done: boolean
64
+ reward: "float | null"
65
+
66
+ action_schema:
67
+ fixed_sql: string # corrected SQL or full corrected pipeline code (Task 3)
68
+ explanation: "string | null # root-cause explanation, scored for Task 3"
69
+
70
+ reward_decomposition:
71
+ tasks_1_2_and_4:
72
+ parses: +0.10
73
+ executes: +0.20
74
+ column_accuracy: +0.10
75
+ data_accuracy: +0.30
76
+ exact_match_bonus: +0.30
77
+ task_3:
78
+ correct_step_identified: +0.15
79
+ step2_fixed: +0.25
80
+ step4_fixed: +0.20
81
+ final_totals_exact: +0.40
82
+
83
+ penalties:
84
+ duplicate_submission: -0.10
85
+ efficiency_penalty: -0.20
86
+ destructive_action: -0.30
87
+ hardcode_penalty: -0.50
88
+
89
+ endpoints:
90
+ health: GET /health
91
+ reset: POST /reset
92
+ step: POST /step
93
+ state: GET /state
94
+ tasks: GET /tasks
95
  docs: GET /docs
pyproject.toml CHANGED
@@ -1,40 +1,40 @@
1
- [build-system]
2
- requires = ["setuptools>=68", "wheel"]
3
- build-backend = "setuptools.build_meta"
4
-
5
- [project]
6
- name = "sql-debug-env"
7
- version = "1.0.0"
8
- description = "SQL Debug & Data Pipeline Repair — OpenEnv environment with Four tasks"
9
- readme = "README.md"
10
- requires-python = ">=3.10"
11
- license = { text = "Apache-2.0" }
12
- keywords = ["openenv", "reinforcement-learning", "sql", "duckdb", "data-engineering"]
13
-
14
- dependencies = [
15
- "duckdb>=0.10.0",
16
- "pandas>=2.0.0",
17
- "fastapi>=0.111.0",
18
- "uvicorn[standard]>=0.29.0",
19
- "pydantic>=2.0.0",
20
- "requests>=2.31.0",
21
- "openai>=1.30.0",
22
- "pyyaml>=6.0",
23
- "openenv-core>=0.2.0",
24
- ]
25
-
26
- [project.scripts]
27
- server = "server.app:main"
28
-
29
- [project.optional-dependencies]
30
- openenv = [
31
- "openenv-core>=0.1.0",
32
- ]
33
- dev = [
34
- "pytest>=8.0",
35
- "httpx>=0.27.0",
36
- ]
37
-
38
- [tool.setuptools.packages.find]
39
- where = ["."]
40
  include = ["sql_env*", "server*"]
 
1
+ [build-system]
2
+ requires = ["setuptools>=68", "wheel"]
3
+ build-backend = "setuptools.build_meta"
4
+
5
+ [project]
6
+ name = "sql-debug-env"
7
+ version = "1.0.0"
8
+ description = "SQL Debug & Data Pipeline Repair — OpenEnv environment with Four tasks"
9
+ readme = "README.md"
10
+ requires-python = ">=3.10"
11
+ license = { text = "Apache-2.0" }
12
+ keywords = ["openenv", "reinforcement-learning", "sql", "duckdb", "data-engineering"]
13
+
14
+ dependencies = [
15
+ "duckdb>=0.10.0",
16
+ "pandas>=2.0.0",
17
+ "fastapi>=0.111.0",
18
+ "uvicorn[standard]>=0.29.0",
19
+ "pydantic>=2.0.0",
20
+ "requests>=2.31.0",
21
+ "openai>=1.30.0",
22
+ "pyyaml>=6.0",
23
+ "openenv-core>=0.2.0",
24
+ ]
25
+
26
+ [project.scripts]
27
+ server = "server.app:main"
28
+
29
+ [project.optional-dependencies]
30
+ openenv = [
31
+ "openenv-core>=0.1.0",
32
+ ]
33
+ dev = [
34
+ "pytest>=8.0",
35
+ "httpx>=0.27.0",
36
+ ]
37
+
38
+ [tool.setuptools.packages.find]
39
+ where = ["."]
40
  include = ["sql_env*", "server*"]
requirements.txt CHANGED
@@ -1,3 +1,4 @@
1
- fastapi
2
- uvicorn
3
- pydantic
 
 
1
+ fastapi
2
+ uvicorn
3
+ pydantic
4
+ duckdb
server/Dockerfile CHANGED
@@ -1,30 +1,30 @@
1
- # syntax: docker/dockerfile:1
2
- FROM python:3.11-slim
3
-
4
- # ── System deps ──────────────────────────────────────────────────────────────
5
- RUN apt-get update && apt-get install -y --no-install-recommends \
6
- build-essential \
7
- && rm -rf /var/lib/apt/lists/*
8
-
9
- # ── App directory ─────────────────────────────────────────────────────────────
10
- WORKDIR /app
11
-
12
- # ── Python deps (cached layer) ────────────────────────────────────────────────
13
- COPY requirements.txt ./requirements.txt
14
- RUN pip install --no-cache-dir -r requirements.txt
15
-
16
- # ── Copy source ───────────────────────────────────────────────────────────────
17
- COPY . .
18
-
19
- # ── HF Spaces requires port 7860 ─────────────────────────────────────────────
20
- EXPOSE 7860
21
-
22
- # ── Create output dir ─────────────────────────────────────────────────────────
23
- RUN mkdir -p /app/outputs/logs /app/outputs/evals
24
-
25
- # ── Health check ──────────────────────────────────────────────────────────────
26
- HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
27
- CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')"
28
-
29
- # ── Entry point ───────────────────────────────────────────────────────────────
30
- CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
 
1
+ # syntax: docker/dockerfile:1
2
+ FROM python:3.11-slim
3
+
4
+ # ── System deps ──────────────────────────────────────────────────────────────
5
+ RUN apt-get update && apt-get install -y --no-install-recommends \
6
+ build-essential \
7
+ && rm -rf /var/lib/apt/lists/*
8
+
9
+ # ── App directory ─────────────────────────────────────────────────────────────
10
+ WORKDIR /app
11
+
12
+ # ── Python deps (cached layer) ────────────────────────────────────────────────
13
+ COPY requirements.txt ./requirements.txt
14
+ RUN pip install --no-cache-dir -r requirements.txt
15
+
16
+ # ── Copy source ───────────────────────────────────────────────────────────────
17
+ COPY . .
18
+
19
+ # ── HF Spaces requires port 7860 ─────────────────────────────────────────────
20
+ EXPOSE 7860
21
+
22
+ # ── Create output dir ─────────────────────────────────────────────────────────
23
+ RUN mkdir -p /app/outputs/logs /app/outputs/evals
24
+
25
+ # ── Health check ──────────────────────────────────────────────────────────────
26
+ HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
27
+ CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:7860/health')"
28
+
29
+ # ── Entry point ───────────────────────────────────────────────────────────────
30
+ CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "7860"]
server/requirements.txt CHANGED
@@ -1,7 +1,7 @@
1
- fastapi>=0.111.0
2
- uvicorn[standard]>=0.29.0
3
- pydantic>=2.0.0
4
- duckdb>=0.10.0
5
- pandas>=2.0.0
6
- requests>=2.31.0
7
- pyyaml>=6.0
 
1
+ fastapi>=0.111.0
2
+ uvicorn[standard]>=0.29.0
3
+ pydantic>=2.0.0
4
+ duckdb>=0.10.0
5
+ pandas>=2.0.0
6
+ requests>=2.31.0
7
+ pyyaml>=6.0
uv.lock CHANGED
The diff for this file is too large to render. See raw diff