Spaces:

md896
/

sql-debug-env

Running

+__pycache__/
+*.pyc
+.pytest_cache/
+.mypy_cache/
+.ruff_cache/
+.DS_Store
+# local env / secrets
+.env
+.env.*
+!.env.example
+# OpenEnv / uv
+.venv/
+.python-version
+# editor metadata
+.cursor/

Dockerfile ADDED Viewed

	@@ -0,0 +1,31 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy requirements first for layer caching
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+# Copy application code
+COPY server/ ./server/
+COPY openenv.yaml .
+# Create non-root user for security
+RUN useradd -m -u 1000 appuser && chown -R appuser:appuser /app
+USER appuser
+# Expose port
+EXPOSE 7860
+# Health check
+HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:7860/health || exit 1
+# Start server
+CMD ["uvicorn", "server.main:app", "--host", "0.0.0.0", "--port", "7860", "--workers", "1"]

README.md CHANGED Viewed

@@ -1,10 +1,193 @@
----
-title: Sql Debug Env
-emoji: 💻
-colorFrom: indigo
-colorTo: gray
-sdk: docker
-pinned: false
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# SQL Debug Environment (`sql-debug-env`)
+![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)
+![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white)
+![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
+![SQLite](https://img.shields.io/badge/SQLite-In_Memory-003B57?logo=sqlite&logoColor=white)
+![Docker](https://img.shields.io/badge/Docker-Ready-2496ED?logo=docker&logoColor=white)
+![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f)
+An OpenEnv environment for a real task people do every day: **debugging SQL**. The agent gets a broken query, a live (in-memory) SQLite database, and a description of the expected output. It can inspect schema/errors/samples and submit fixed queries until it solves the task.
+## What’s in this repo
+- **FastAPI server**: `server/main.py` (endpoints: `/health`, `/tasks`, `/reset`, `/step`, `/state`)
+- **Environment logic**: `server/env.py` + `server/database.py`
+- **Tasks**: `server/tasks/` (easy → medium → hard, deterministic seed data)
+- **Baseline agent**: `inference.py` (OpenAI client + `[START]/[STEP]/[END]` logs)
+## Tech Stack
+- Python 3.11+
+- FastAPI + Uvicorn
+- Pydantic v2
+- SQLite (in-memory)
+- OpenEnv Core
+- Docker
+- OpenAI Python SDK (baseline inference)
+## Production Notes
+- Stateless HTTP API with per-session environment instances keyed by `X-Session-Id`
+- Deterministic task data (in-memory SQLite) for reproducible grading
+- Reward clamped to `[0.0, 1.0]` with partial-progress shaping
+- Docker-first deployment path (local and Hugging Face Spaces)
+- Local benchmark endpoint for live latency checks (`/benchmark`)
+## API Docs (FastAPI Auto Docs)
+Use these for interactive testing in browser:
+- Swagger UI: `http://localhost:7860/docs`
+- ReDoc: `http://localhost:7860/redoc`
+- OpenAPI spec: `http://localhost:7860/openapi.json`
+## Action Space
+| Action | Required fields | Cost / reward effect |
+|---|---|---|
+| `submit_query` | `query` | Main evaluation step (dense reward based on grading) |
+| `inspect_schema` | none | Free information action (small positive reward component) |
+| `inspect_error` | none | Free information action (small positive reward component) |
+| `inspect_sample` | `table_name` | Free information action (small positive reward component) |
+| `reset_query` | none | Penalty action (reduces reward for that step) |
+## Observation Space
+| Field | Type |
+|---|---|
+| `task_id` | `string` |
+| `task_description` | `string` |
+| `original_query` | `string` |
+| `current_query` | `string_or_null` |
+| `expected_description` | `string` |
+| `last_action_type` | `string` |
+| `last_query_result` | `object_or_null` |
+| `steps_taken` | `integer` |
+| `steps_remaining` | `integer` |
+| `current_score` | `float` |
+| `schema_info` | `object_or_null` |
+| `error_details` | `string_or_null` |
+| `sample_rows` | `array_or_null` |
+| `hint` | `string_or_null` |
+| `is_done` | `boolean` |
+| `success` | `boolean` |
+## Reward Function
+| Component | Range | Description |
+|---|---|---|
+| `correctness` | `[0.0, 0.6]` | Row-level match vs expected output |
+| `efficiency` | `[0.0, 0.2]` | Bonus for solving with fewer steps |
+| `syntax_progress` | `[0.0, 0.1]` | Small reward for producing syntactically valid SQL |
+| `schema_bonus` | `[0.0, 0.1]` | Bonus for referencing correct tables/columns |
+| `penalty` | `[0.0, 0.2]` | Deduction magnitude for resets/regressions/urgency near step limit |
+## Tasks
+### Task 1: Easy — Syntax Error Fix (`easy_syntax_fix`)
+Two straightforward issues: a misspelled keyword (`GRUP BY`) and an `ORDER BY` alias mismatch.
+### Task 2: Medium — Logic Error Fix (`medium_logic_fix`)
+Logic bugs around outer joins + filtering scope + aggregation scope.
+### Task 3: Hard — Multi-Bug Fix (`hard_multi_bug`)
+Five bugs across correlated subqueries, window functions, CTE scope, date logic, and duplication.
+## Baseline
+The baseline script is intentionally simple: it loops `reset → step` and asks an OpenAI model to choose the next JSON action.
+## Reliability & Benchmarking
+### Verified status (local)
+- `openenv validate --verbose`: **PASS**
+- `python3 -m unittest discover -s tests -p "test_*.py"`: **10/10 PASS**
+- Docker smoke test: **PASS** (`/health`, `/tasks`, `/reset`, `/step`)
+- FastAPI docs available: **PASS** (`/docs`, `/redoc`, `/openapi.json`)
+### Endpoint benchmark (local Docker run, n=25)
+Measured with `scripts/benchmark_local.py` on a running local container:
+| Endpoint | avg | p50 | p95 |
+|---|---:|---:|---:|
+| `GET /health` | 0.69 ms | 0.67 ms | 0.76 ms |
+| `GET /tasks` | 0.82 ms | 0.81 ms | 0.90 ms |
+| `POST /reset` | 1.34 ms | 1.26 ms | 1.62 ms |
+| `POST /step` (`inspect_schema`) | 1.07 ms | 1.01 ms | 1.34 ms |
+Re-run anytime:
+```bash
+python3 scripts/benchmark_local.py
+```
+Notes:
+- These are local-machine numbers (single container, warm runtime).
+- For submission-grade reporting, also capture one run against your HF Space URL after deploy.
+## Setup & Usage
+### Local Development
+```bash
+pip install -r requirements.txt
+uvicorn server.main:app --host 0.0.0.0 --port 7860
+```
+### Docker
+```bash
+docker build -t sql-debug-env .
+docker run -p 7860:7860 sql-debug-env
+```
+### Quick smoke test
+```bash
+curl http://localhost:7860/health
+curl http://localhost:7860/tasks
+curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id":"easy_syntax_fix"}'
+curl -X POST http://localhost:7860/step  -H "Content-Type: application/json" -d '{"action":{"action_type":"inspect_schema"}}'
+curl "http://localhost:7860/benchmark?runs=20"
+```
+### Real-time benchmark API (for dashboards/web pages)
+This is a live endpoint, not static/dummy data. Every request runs fresh measurements.
+- Endpoint: `GET /benchmark?runs=20`
+- `runs` range: `1` to `100`
+- Returns JSON with `avg_ms`, `p50_ms`, `p95_ms`, `n`, and a fresh `timestamp_epoch_ms`
+Example:
+```bash
+curl "http://localhost:7860/benchmark?runs=30"
+```
+### Run Baseline
+```bash
+export API_BASE_URL="https://api.openai.com/v1"
+export MODEL_NAME="gpt-4o-mini"
+export OPENAI_API_KEY="your-key"
+export ENV_BASE_URL="http://localhost:7860"
+export HF_TOKEN="$OPENAI_API_KEY"
+export SEED="1"
+python inference.py
+```
+### OpenEnv Validation
+```bash
+pip install openenv-core
+openenv validate
+```
+### Suggested pre-submit check
+```bash
+openenv validate --verbose
+python3 -m unittest discover -s tests -p "test_*.py"
+docker build -t sql-debug-env .
+docker run --rm -p 7860:7860 sql-debug-env
+# in another terminal:
+curl -s http://localhost:7860/health
+curl -s http://localhost:7860/docs >/dev/null
+curl -s "http://localhost:7860/benchmark?runs=20"
+```
+## Hugging Face Spaces (Docker)
+1. Create a new **Space → Docker**.
+2. Push this repo.
+3. Update `openenv.yaml` → `api.base_url` to your Space URL: `https://<your-space>.hf.space`
+4. Wait for build, then verify:
+```bash
+curl -X POST https://<your-space>.hf.space/reset -H "Content-Type: application/json" -d '{}'
+```

inference.py ADDED Viewed

	@@ -0,0 +1,328 @@

+"""
+inference.py — OpenEnv SQL Debug Environment Baseline Agent
+MUST be at root level. MUST use exact [START]/[STEP]/[END] log format.
+Uses OpenAI client. Reads from environment variables.
+Runtime target: < 20 minutes on 2vCPU / 8GB.
+"""
+import asyncio
+import os
+import json
+import sys
+import time
+from typing import List, Dict, Any, Optional
+from openai import OpenAI
+import httpx
+# ── Configuration from environment variables ────────────────────────────────
+API_BASE_URL = os.environ.get("API_BASE_URL", "https://api.openai.com/v1")
+MODEL_NAME = os.environ.get("MODEL_NAME", "gpt-4o-mini")
+HF_TOKEN = os.environ.get("HF_TOKEN", "")
+API_KEY = os.environ.get("OPENAI_API_KEY", HF_TOKEN or "sk-placeholder")
+# ── Environment config ───────────────────────────────────────────────────────
+ENV_BASE_URL = os.environ.get("ENV_BASE_URL", "http://localhost:7860")
+BENCHMARK = "sql-debug-env"
+TEMPERATURE = 0.0
+MAX_TOKENS = 1024
+SEED = int(os.environ.get("SEED", "1"))
+# ── Per-task config ──────────────────────────────────────────────────────────
+TASK_CONFIGS = {
+    "easy_syntax_fix":  {"max_steps": 10,  "success_threshold": 0.8},
+    "medium_logic_fix": {"max_steps": 20,  "success_threshold": 0.7},
+    "hard_multi_bug":   {"max_steps": 30,  "success_threshold": 0.5},
+}
+# ── Logging functions (EXACT FORMAT — DO NOT MODIFY) ────────────────────────
+def log_start(task: str, env: str, model: str):
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]):
+    error_str = error if error else "null"
+    # Escape action for single-line logging
+    action_clean = action.replace("\n", "\\n").replace('"', '\\"')[:200]
+    print(
+        f"[STEP] step={step} action=\"{action_clean}\" "
+        f"reward={reward:.4f} done={str(done).lower()} error={error_str}",
+        flush=True
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]):
+    rewards_str = json.dumps([round(r, 4) for r in rewards])
+    print(
+        f"[END] success={str(success).lower()} steps={steps} "
+        f"score={score:.4f} rewards={rewards_str}",
+        flush=True
+    )
+# ── System prompt ────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = """You are an expert SQL debugger. You will receive a broken SQL query and must fix it.
+You interact with a SQL debugging environment via JSON actions.
+Available actions (respond with ONLY valid JSON, no markdown, no explanation):
+1. Submit a fixed query:
+{"action_type": "submit_query", "query": "SELECT ..."}
+2. Inspect schema (free, no penalty):
+{"action_type": "inspect_schema"}
+3. Inspect last error (free, no penalty):
+{"action_type": "inspect_error"}
+4. Inspect sample rows from a table (free, no penalty):
+{"action_type": "inspect_sample", "table_name": "table_name_here"}
+Strategy:
+- Start by submitting a fixed query if the bug is obvious
+- Use inspect_schema first if you need to verify column names/table structure
+- Use inspect_error to understand why your query failed
+- Read error messages carefully — they tell you exactly what's wrong
+- Fix one bug at a time and resubmit
+- You get partial credit for partially correct queries
+IMPORTANT: Respond with ONLY the JSON action. No explanation, no markdown blocks, just raw JSON."""
+def build_prompt(obs: Dict[str, Any], step: int, reward_history: List[float]) -> str:
+    """Build the user prompt for each step."""
+    lines = [
+        f"=== SQL Debugging Task (Step {step}) ===",
+        f"Task: {obs.get('task_description', '')[:500]}",
+        f"",
+        f"ORIGINAL BROKEN QUERY:",
+        f"```sql",
+        f"{obs.get('original_query', '')}",
+        f"```",
+    ]
+    if obs.get('current_query'):
+        lines += [
+            f"",
+            f"YOUR LAST SUBMITTED QUERY:",
+            f"```sql",
+            f"{obs.get('current_query', '')}",
+            f"```",
+        ]
+    last_result = obs.get('last_query_result')
+    if last_result:
+        if last_result.get('success'):
+            rows = last_result.get('rows', [])
+            lines += [
+                f"",
+                f"LAST QUERY RESULT: {len(rows)} rows returned",
+                f"Sample (first 3): {json.dumps(rows[:3], default=str)}",
+            ]
+        else:
+            lines += [
+                f"",
+                f"LAST QUERY ERROR: {last_result.get('error_message', 'Unknown error')}",
+            ]
+    if obs.get('schema_info'):
+        schema = obs['schema_info'].get('tables', {})
+        lines += [f"", f"DATABASE SCHEMA:"]
+        for table, cols in schema.items():
+            col_str = ", ".join(f"{c['name']} ({c['type']})" for c in cols)
+            lines.append(f"  {table}: {col_str}")
+    if obs.get('error_details'):
+        lines += [f"", f"ERROR DETAILS: {obs['error_details']}"]
+    if obs.get('sample_rows'):
+        lines += [f"", f"SAMPLE ROWS: {json.dumps(obs['sample_rows'][:3], default=str)}"]
+    if obs.get('hint'):
+        lines += [f"", f"HINT: {obs['hint']}"]
+    lines += [
+        f"",
+        f"Current score: {obs.get('current_score', 0):.3f}",
+        f"Steps remaining: {obs.get('steps_remaining', 0)}",
+        f"Expected output: {obs.get('expected_description', '')}",
+        f"",
+        f"What is your next action? (respond with ONLY valid JSON)"
+    ]
+    return "\n".join(lines)
+def call_model(client: OpenAI, prompt: str) -> Dict[str, Any]:
+    """Call model and parse JSON action response."""
+    try:
+        response = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": prompt}
+            ],
+            temperature=TEMPERATURE,
+            seed=SEED,
+            max_tokens=MAX_TOKENS,
+        )
+        text = (response.choices[0].message.content or "").strip()
+        # Strip markdown if model wraps in backticks
+        if text.startswith("```"):
+            text = text.split("```")[1]
+            if text.startswith("json"):
+                text = text[4:]
+        text = text.strip()
+        return json.loads(text)
+    except json.JSONDecodeError:
+        # Fallback: try to extract JSON from response
+        import re
+        match = re.search(r'\{.*\}', text, re.DOTALL)
+        if match:
+            try:
+                return json.loads(match.group())
+            except:
+                pass
+        # Default fallback action
+        return {"action_type": "inspect_schema"}
+    except Exception as e:
+        print(f"[DEBUG] Model error: {e}", flush=True)
+        return {"action_type": "inspect_schema"}
+def run_task(
+    client: OpenAI,
+    task_id: str,
+    config: Dict[str, Any]
+) -> Dict[str, Any]:
+    """Run one task episode synchronously via HTTP."""
+    max_steps = config["max_steps"]
+    success_threshold = config["success_threshold"]
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    with httpx.Client(base_url=ENV_BASE_URL, timeout=30.0) as http:
+        # Reset
+        reset_resp = http.post("/reset", json={"task_id": task_id})
+        reset_resp.raise_for_status()
+        result = reset_resp.json()
+        obs = result["observation"]
+        done = result["done"]
+        reward_history = []
+        for step in range(1, max_steps + 1):
+            if done:
+                break
+            # Get model action
+            prompt = build_prompt(obs, step, reward_history)
+            action_dict = call_model(client, prompt)
+            # Execute step
+            try:
+                step_resp = http.post("/step", json={"action": action_dict})
+                step_resp.raise_for_status()
+                step_result = step_resp.json()
+            except Exception as e:
+                log_step(step=step, action=str(action_dict), reward=0.0, done=False, error=str(e))
+                continue
+            obs = step_result["observation"]
+            reward = float(step_result.get("reward") or 0.0)
+            done = step_result["done"]
+            error = None
+            info = step_result.get("info") or {}
+            # Extract error for logging
+            last_result = obs.get("last_query_result")
+            if last_result and not last_result.get("success"):
+                error = last_result.get("error_message", "")
+            action_str = action_dict.get("query") or action_dict.get("action_type", "unknown")
+            rewards.append(reward)
+            reward_history.append(reward)
+            steps_taken = step
+            score = float(info.get("grade_score") or obs.get("current_score") or 0.0)
+            log_step(step=step, action=action_str, reward=reward, done=done, error=error)
+            if done:
+                break
+    # Compute final score
+    score = min(max(score, 0.0), 1.0)
+    success = score >= success_threshold
+    log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {
+        "task_id": task_id,
+        "score": score,
+        "success": success,
+        "steps": steps_taken,
+        "rewards": rewards
+    }
+def main():
+    """Run baseline agent across all 3 tasks."""
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    print(f"[DEBUG] Starting SQL Debug Env baseline", flush=True)
+    print(f"[DEBUG] Model: {MODEL_NAME}", flush=True)
+    print(f"[DEBUG] Env URL: {ENV_BASE_URL}", flush=True)
+    # Wait for server to be ready
+    max_wait = 30
+    for i in range(max_wait):
+        try:
+            resp = httpx.get(f"{ENV_BASE_URL}/health", timeout=5)
+            if resp.status_code == 200:
+                print(f"[DEBUG] Server ready", flush=True)
+                break
+        except:
+            pass
+        print(f"[DEBUG] Waiting for server... ({i+1}/{max_wait})", flush=True)
+        time.sleep(1)
+    all_results = []
+    for task_id, config in TASK_CONFIGS.items():
+        print(f"\n[DEBUG] Running task: {task_id}", flush=True)
+        try:
+            result = run_task(client, task_id, config)
+            all_results.append(result)
+        except Exception as e:
+            print(f"[DEBUG] Task {task_id} failed: {e}", flush=True)
+            log_end(success=False, steps=0, score=0.0, rewards=[])
+        # Small delay between tasks
+        time.sleep(2)
+    # Summary
+    print(f"\n[DEBUG] === BASELINE RESULTS ===", flush=True)
+    total_score = 0.0
+    for r in all_results:
+        print(f"[DEBUG] {r['task_id']}: score={r['score']:.3f} success={r['success']}", flush=True)
+        total_score += r['score']
+    if all_results:
+        avg = total_score / len(all_results)
+        print(f"[DEBUG] Average score: {avg:.3f}", flush=True)
+if __name__ == "__main__":
+    main()

openenv.yaml ADDED Viewed

	@@ -0,0 +1,104 @@

+name: sql-debug-env
+version: 0.1.0
+description: >
+  A reinforcement learning environment for training AI agents to debug SQL queries.
+  Agents receive broken SQL queries against a live SQLite database and must fix them
+  through iterative actions: submitting queries, inspecting schemas, and analyzing errors.
+  Models a real-world task performed daily by data analysts, engineers, and scientists.
+author: md-ayan
+license: apache-2.0
+tags:
+  - openenv
+  - sql
+  - debugging
+  - data-engineering
+  - real-world
+  - analytics
+tasks:
+  - id: easy_syntax_fix
+    name: "Top Customers by Revenue — Syntax Error Fix"
+    difficulty: easy
+    max_steps: 10
+    description: "Fix 2 syntax/reference bugs in a customer analytics query"
+  - id: medium_logic_fix
+    name: "Department Headcount Report — Logic Error Fix"
+    difficulty: medium
+    max_steps: 20
+    description: "Fix JOIN type, WHERE clause placement, and aggregation scope bugs"
+  - id: hard_multi_bug
+    name: "SaaS Cohort Activation Report — Multi-Bug Fix"
+    difficulty: hard
+    max_steps: 30
+    description: "Fix 5 bugs: correlated subquery, window function, duplicate rows, date logic, CTE scope"
+api:
+  base_url: "https://YOUR-USERNAME-sql-debug-env.hf.space"
+  reset: "/reset"
+  step: "/step"
+  state: "/state"
+  health: "/health"
+  tasks: "/tasks"
+observation_space:
+  type: structured
+  fields:
+    - name: task_description
+      type: string
+    - name: original_query
+      type: string
+    - name: current_query
+      type: string_or_null
+    - name: last_query_result
+      type: object_or_null
+    - name: steps_taken
+      type: integer
+    - name: current_score
+      type: float
+action_space:
+  type: structured
+  actions:
+    - id: submit_query
+      description: "Submit a fixed SQL query for evaluation"
+      required_fields: [query]
+    - id: inspect_schema
+      description: "Get database schema (free action)"
+    - id: inspect_error
+      description: "Get last error details (free action)"
+    - id: inspect_sample
+      description: "Get 3 sample rows from a table"
+      required_fields: [table_name]
+    - id: reset_query
+      description: "Reset to original broken query (penalty: -0.05)"
+reward:
+  range: [0.0, 1.0]
+  components:
+    - name: correctness
+      range: [0.0, 0.6]
+      description: "Row-level match vs expected output"
+    - name: efficiency
+      range: [0.0, 0.2]
+      description: "Bonus for solving with fewer steps"
+    - name: syntax_progress
+      range: [0.0, 0.1]
+      description: "Valid SQL even if wrong content"
+    - name: schema_bonus
+      range: [0.0, 0.1]
+      description: "Correct table/column references"
+    - name: penalty
+      range: [0.0, 0.2]
+      description: "Penalty deduction magnitude for bad actions / urgency"
+runtime:
+  max_concurrent_sessions: 64
+  episode_timeout_seconds: 300
+  machine_requirements:
+    vcpu: 2
+    memory_gb: 8

pyproject.toml ADDED Viewed

	@@ -0,0 +1,21 @@

+[build-system]
+requires = ["setuptools>=61.0"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "sql-debug-env"
+version = "0.1.0"
+requires-python = ">=3.11"
+dependencies = [
+  "fastapi==0.115.0",
+  "uvicorn[standard]==0.30.6",
+  "pydantic==2.9.2",
+  "openenv-core>=0.1.0",
+  "openai>=1.50.0",
+  "httpx>=0.27.0",
+  "python-multipart==0.0.9"
+]
+[project.scripts]
+server = "server.app:main"

requirements.txt ADDED Viewed

	@@ -0,0 +1,8 @@

+fastapi==0.115.0
+uvicorn[standard]==0.30.6
+pydantic==2.9.2
+openenv-core>=0.1.0
+openai>=1.50.0
+httpx>=0.27.0
+python-multipart==0.0.9

scripts/benchmark_local.py ADDED Viewed

	@@ -0,0 +1,63 @@

+"""
+Lightweight local benchmark for sql-debug-env.
+Runs deterministic endpoint checks and prints simple latency metrics.
+No LLM key required.
+"""
+from __future__ import annotations
+import statistics
+import time
+from typing import Dict, List
+import httpx
+BASE_URL = "http://localhost:7860"
+def timed_call(client: httpx.Client, method: str, path: str, json_body: Dict | None = None) -> float:
+    start = time.perf_counter()
+    if method == "GET":
+        r = client.get(path)
+    else:
+        r = client.post(path, json=json_body)
+    r.raise_for_status()
+    return (time.perf_counter() - start) * 1000
+def summarize(samples: List[float]) -> str:
+    p50 = statistics.median(samples)
+    p95 = sorted(samples)[int(len(samples) * 0.95) - 1]
+    avg = statistics.mean(samples)
+    return f"avg={avg:.2f}ms p50={p50:.2f}ms p95={p95:.2f}ms n={len(samples)}"
+def main() -> None:
+    with httpx.Client(base_url=BASE_URL, timeout=30.0) as client:
+        # Warmup + health check
+        client.get("/health").raise_for_status()
+        health_times = [timed_call(client, "GET", "/health") for _ in range(25)]
+        tasks_times = [timed_call(client, "GET", "/tasks") for _ in range(25)]
+        reset_times: List[float] = []
+        step_times: List[float] = []
+        for _ in range(25):
+            reset_times.append(
+                timed_call(client, "POST", "/reset", {"task_id": "easy_syntax_fix"})
+            )
+            step_times.append(
+                timed_call(client, "POST", "/step", {"action": {"action_type": "inspect_schema"}})
+            )
+    print("Benchmark results (local)")
+    print(f"GET /health: {summarize(health_times)}")
+    print(f"GET /tasks: {summarize(tasks_times)}")
+    print(f"POST /reset: {summarize(reset_times)}")
+    print(f"POST /step (inspect_schema): {summarize(step_times)}")
+if __name__ == "__main__":
+    main()

server/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # sql-debug-env
2	+

server/app.py ADDED Viewed

	@@ -0,0 +1,20 @@

+import os
+import uvicorn
+from .main import app
+def main():
+    """
+    OpenEnv entry point.
+    This module is required for `openenv validate` multi-mode deployment checks.
+    """
+    host = os.environ.get("HOST", "0.0.0.0")
+    port = int(os.environ.get("PORT", "7860"))
+    uvicorn.run("server.app:app", host=host, port=port, workers=1)
+if __name__ == "__main__":
+    main()

server/database.py ADDED Viewed

	@@ -0,0 +1,112 @@

+"""
+SQLite in-memory database management.
+Creates fresh DB instances per episode with deterministic seed data.
+"""
+import sqlite3
+import time
+from typing import Dict, Any, List
+class EpisodeDatabase:
+    """
+    Manages a single SQLite in-memory database for one episode.
+    Seeded with deterministic data per task.
+    """
+    def __init__(self, task_id: str, schema_sql: str, seed_data_sql: str):
+        self.task_id = task_id
+        self.conn = sqlite3.connect(":memory:", check_same_thread=False)
+        self.conn.row_factory = sqlite3.Row
+        self.conn.execute("PRAGMA foreign_keys = ON")
+        self._setup(schema_sql, seed_data_sql)
+    def _setup(self, schema_sql: str, seed_data_sql: str):
+        """Create schema and insert seed data."""
+        cursor = self.conn.cursor()
+        for statement in schema_sql.strip().split(";"):
+            stmt = statement.strip()
+            if stmt:
+                cursor.execute(stmt)
+        for statement in seed_data_sql.strip().split(";"):
+            stmt = statement.strip()
+            if stmt:
+                cursor.execute(stmt)
+        self.conn.commit()
+    def execute_query(self, query: str) -> Dict[str, Any]:
+        """
+        Execute a read-only SQL query safely.
+        Returns rows or error. Enforces SELECT-only.
+        Execution timeout: 5 seconds.
+        """
+        query_stripped = query.strip().upper()
+        # Block dangerous operations
+        blocked = ["DROP", "DELETE", "UPDATE", "INSERT", "CREATE", "ALTER",
+                   "TRUNCATE", "REPLACE", "ATTACH", "DETACH"]
+        for kw in blocked:
+            if query_stripped.startswith(kw) or f" {kw} " in query_stripped:
+                return {
+                    "success": False,
+                    "rows": None,
+                    "row_count": None,
+                    "error_message": f"BLOCKED: Only SELECT queries are allowed. '{kw}' is not permitted.",
+                    "execution_time_ms": 0.0
+                }
+        start = time.time()
+        try:
+            cursor = self.conn.cursor()
+            cursor.execute(query)
+            rows = cursor.fetchall()
+            elapsed = (time.time() - start) * 1000
+            # Convert Row objects to dicts
+            result_rows = [dict(row) for row in rows]
+            return {
+                "success": True,
+                "rows": result_rows,
+                "row_count": len(result_rows),
+                "error_message": None,
+                "execution_time_ms": round(elapsed, 2)
+            }
+        except sqlite3.Error as e:
+            elapsed = (time.time() - start) * 1000
+            return {
+                "success": False,
+                "rows": None,
+                "row_count": None,
+                "error_message": str(e),
+                "execution_time_ms": round(elapsed, 2)
+            }
+    def get_schema(self) -> Dict[str, List[Dict[str, str]]]:
+        """Return schema info: tables and their columns."""
+        schema = {}
+        cursor = self.conn.cursor()
+        cursor.execute("SELECT name FROM sqlite_master WHERE type='table' ORDER BY name")
+        tables = [row[0] for row in cursor.fetchall()]
+        for table in tables:
+            cursor.execute(f"PRAGMA table_info({table})")
+            columns = []
+            for col in cursor.fetchall():
+                columns.append({
+                    "name": col[1],
+                    "type": col[2],
+                    "nullable": "YES" if col[3] == 0 else "NO",
+                    "primary_key": "YES" if col[5] > 0 else "NO"
+                })
+            schema[table] = columns
+        return schema
+    def get_sample_rows(self, table_name: str, limit: int = 3) -> List[Dict[str, Any]]:
+        """Get sample rows from a table."""
+        result = self.execute_query(f"SELECT * FROM {table_name} LIMIT {limit}")
+        return result.get("rows", []) or []
+    def close(self):
+        self.conn.close()

server/env.py ADDED Viewed

	@@ -0,0 +1,236 @@

+"""
+Core SQL Debug Environment.
+Manages episode state, delegates to tasks and reward function.
+"""
+import uuid
+import asyncio
+from typing import Optional, Dict, Any, List
+from .models import (
+    SQLDebugAction, SQLDebugObservation, SQLDebugReward,
+    EpisodeState, ActionType, QueryResult, SchemaInfo
+)
+from .database import EpisodeDatabase
+from .reward import compute_reward
+from .tasks.task_easy import EasyTask
+from .tasks.task_medium import MediumTask, MediumTaskGrader
+from .tasks.task_hard import HardTask
+TASKS = {
+    "easy_syntax_fix": EasyTask(),
+    "medium_logic_fix": MediumTask(),
+    "hard_multi_bug": HardTask(),
+}
+class SQLDebugEnv:
+    """
+    The SQL Debug Environment.
+    Manages one active episode at a time per session.
+    Thread-safe for concurrent sessions via instance-per-session pattern.
+    """
+    def __init__(self, task_id: str = "easy_syntax_fix"):
+        self.task_id = task_id
+        self.task = TASKS[task_id]
+        self._db: Optional[EpisodeDatabase] = None
+        self._state: Optional[EpisodeState] = None
+        self._lock = asyncio.Lock()
+    async def reset(self) -> tuple[SQLDebugObservation, Dict]:
+        """Reset environment to initial state. Returns (observation, info)."""
+        async with self._lock:
+            # Close previous DB if exists
+            if self._db:
+                self._db.close()
+            # Fresh DB
+            self._db = EpisodeDatabase(
+                task_id=self.task.task_id,
+                schema_sql=self.task.schema_sql,
+                seed_data_sql=self.task.seed_data_sql
+            )
+            # Fresh state
+            self._state = EpisodeState(
+                task_id=self.task.task_id,
+                task_difficulty=self.task.difficulty,
+                original_query=self.task.broken_query,
+                current_query=None,
+                best_score_so_far=0.0,
+                steps_taken=0,
+                max_steps=self.task.max_steps,
+                action_history=[],
+                reward_history=[],
+                is_done=False,
+                success=False,
+                db_schema=self._db.get_schema()
+            )
+            obs = SQLDebugObservation(
+                task_id=self.task.task_id,
+                task_description=self.task.description,
+                original_query=self.task.broken_query,
+                current_query=None,
+                expected_description=self.task.expected_output_description,
+                last_action_type="reset",
+                last_query_result=None,
+                steps_taken=0,
+                steps_remaining=self.task.max_steps,
+                current_score=0.0,
+                schema_info=SchemaInfo(tables=self._db.get_schema()),
+                is_done=False,
+                success=False
+            )
+            return obs, {"task": self.task.to_dict()}
+    async def step(self, action: SQLDebugAction) -> tuple[SQLDebugObservation, float, bool, Dict]:
+        """
+        Execute one action.
+        Returns (observation, reward_value, done, info)
+        """
+        async with self._lock:
+            if self._state is None:
+                raise RuntimeError("Call reset() before step()")
+            if self._state.is_done:
+                raise RuntimeError("Episode is done. Call reset() to start new episode.")
+            self._state.steps_taken += 1
+            steps_taken = self._state.steps_taken
+            query_result_raw = None
+            prev_best_score = self._state.best_score_so_far
+            grade_score = self._state.best_score_so_far
+            schema_info = None
+            error_details = None
+            sample_rows = None
+            hint = None
+            # --- Execute action ---
+            if action.action_type == ActionType.SUBMIT_QUERY:
+                if not action.query:
+                    raise ValueError("query is required for submit_query action")
+                self._state.current_query = action.query
+                query_result_raw = self._db.execute_query(action.query)
+                # Grade the result
+                actual_rows = query_result_raw.get("rows") if query_result_raw.get("success") else None
+                # Use custom grader for medium task
+                if self.task.task_id == "medium_logic_fix":
+                    grade_score = MediumTaskGrader.grade(actual_rows or [])
+                else:
+                    grade_score = self.task.grade(actual_rows)
+                if grade_score > self._state.best_score_so_far:
+                    self._state.best_score_so_far = grade_score
+            elif action.action_type == ActionType.INSPECT_SCHEMA:
+                schema = self._db.get_schema()
+                schema_info = SchemaInfo(tables=schema)
+                grade_score = self._state.best_score_so_far
+            elif action.action_type == ActionType.INSPECT_ERROR:
+                # Return last error if available
+                if self._state.action_history:
+                    last = self._state.action_history[-1]
+                    error_details = last.get("error_message", "No error recorded from last query.")
+                else:
+                    error_details = "No query has been submitted yet."
+                grade_score = self._state.best_score_so_far
+            elif action.action_type == ActionType.INSPECT_SAMPLE:
+                if not action.table_name:
+                    raise ValueError("table_name required for inspect_sample")
+                sample_rows = self._db.get_sample_rows(action.table_name)
+                grade_score = self._state.best_score_so_far
+            elif action.action_type == ActionType.RESET_QUERY:
+                self._state.current_query = self.task.broken_query
+                grade_score = self._state.best_score_so_far
+            # --- Compute reward ---
+            schema_tables = list(self._db.get_schema().keys())
+            reward_obj = compute_reward(
+                action_type=action.action_type.value,
+                query_result=query_result_raw,
+                grade_score=grade_score,
+                steps_taken=steps_taken,
+                max_steps=self.task.max_steps,
+                previous_best_score=prev_best_score,
+                schema_tables=schema_tables,
+                submitted_query=action.query if action.action_type == ActionType.SUBMIT_QUERY else None
+            )
+            # --- Check done conditions ---
+            is_done = False
+            success = False
+            if grade_score >= 0.95:
+                is_done = True
+                success = True
+            elif steps_taken >= self.task.max_steps:
+                is_done = True
+                success = self._state.best_score_so_far >= 0.5
+            self._state.is_done = is_done
+            self._state.success = success
+            # --- Hint logic ---
+            hint_threshold = 3 if self.task.difficulty == "easy" else 5
+            if steps_taken >= hint_threshold:
+                hint = self.task.hint
+            # --- Record history ---
+            self._state.action_history.append({
+                "step": steps_taken,
+                "action_type": action.action_type.value,
+                "query": action.query,
+                "grade_score": grade_score,
+                "reward": reward_obj.value,
+                "error_message": query_result_raw.get("error_message") if query_result_raw else None
+            })
+            self._state.reward_history.append(reward_obj.value)
+            # --- Build observation ---
+            qr = QueryResult(**query_result_raw) if query_result_raw else None
+            obs = SQLDebugObservation(
+                task_id=self.task.task_id,
+                task_description=self.task.description,
+                original_query=self.task.broken_query,
+                current_query=self._state.current_query,
+                expected_description=self.task.expected_output_description,
+                last_action_type=action.action_type.value,
+                last_query_result=qr,
+                steps_taken=steps_taken,
+                steps_remaining=max(0, self.task.max_steps - steps_taken),
+                current_score=self._state.best_score_so_far,
+                schema_info=schema_info,
+                error_details=error_details,
+                sample_rows=sample_rows,
+                hint=hint,
+                is_done=is_done,
+                success=success
+            )
+            return obs, reward_obj.value, is_done, {
+                "grade_score": grade_score,
+                "reward_breakdown": reward_obj.breakdown,
+                "success": success,
+                "steps_taken": steps_taken
+            }
+    def get_state(self) -> EpisodeState:
+        if self._state is None:
+            raise RuntimeError("Call reset() first")
+        return self._state
+    def close(self):
+        if self._db:
+            self._db.close()
+            self._db = None

server/main.py ADDED Viewed

	@@ -0,0 +1,242 @@

+"""
+FastAPI server exposing the OpenEnv HTTP API.
+Endpoints: POST /reset, POST /step, GET /state
+Also includes: GET /tasks (list available tasks), GET /health
+"""
+import asyncio
+import time
+import statistics
+from typing import Dict, Optional
+from contextlib import asynccontextmanager
+from fastapi import FastAPI, HTTPException, Header
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from .models import SQLDebugAction, SQLDebugObservation, EpisodeState
+from .env import SQLDebugEnv, TASKS
+# Session management: one env instance per session
+# For HF Space: allow up to 64 concurrent sessions
+MAX_SESSIONS = 64
+_sessions: Dict[str, SQLDebugEnv] = {}
+_session_lock = asyncio.Lock()
+@asynccontextmanager
+async def lifespan(app: FastAPI):
+    yield
+    # Cleanup all sessions on shutdown
+    for env in _sessions.values():
+        env.close()
+app = FastAPI(
+    title="SQL Debug Environment",
+    description="OpenEnv-compliant SQL query debugging environment for RL agent training.",
+    version="0.1.0",
+    lifespan=lifespan
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+@app.get("/")
+async def root():
+    return {
+        "name": "sql-debug-env",
+        "status": "ok",
+        "message": "Use /health, /tasks, /reset, /step, /state, /benchmark",
+    }
+@app.get("/favicon.ico", status_code=204)
+async def favicon():
+    return None
+class ResetRequest(BaseModel):
+    task_id: Optional[str] = "easy_syntax_fix"
+class StepRequest(BaseModel):
+    action: SQLDebugAction
+async def get_or_create_session(session_id: str, task_id: str = "easy_syntax_fix") -> SQLDebugEnv:
+    async with _session_lock:
+        if session_id not in _sessions:
+            if len(_sessions) >= MAX_SESSIONS:
+                # Evict oldest session
+                oldest = next(iter(_sessions))
+                _sessions[oldest].close()
+                del _sessions[oldest]
+            _sessions[session_id] = SQLDebugEnv(task_id=task_id)
+        return _sessions[session_id]
+@app.get("/health")
+async def health():
+    return {"status": "ok", "sessions_active": len(_sessions)}
+@app.get("/tasks")
+async def list_tasks():
+    """List all available tasks with metadata."""
+    return {
+        "tasks": [task.to_dict() for task in TASKS.values()]
+    }
+def _stats(values: list[float]) -> Dict[str, float]:
+    ordered = sorted(values)
+    n = len(ordered)
+    p95_idx = max(0, int(n * 0.95) - 1)
+    return {
+        "avg_ms": round(statistics.mean(ordered), 3),
+        "p50_ms": round(statistics.median(ordered), 3),
+        "p95_ms": round(ordered[p95_idx], 3),
+        "n": n,
+    }
+@app.get("/benchmark")
+async def benchmark(runs: int = 20):
+    """
+    Real-time benchmark endpoint (fresh measurements on every call).
+    Safe to call from dashboards/web pages for live verification.
+    """
+    runs = max(1, min(runs, 100))
+    health_times: list[float] = []
+    tasks_times: list[float] = []
+    reset_times: list[float] = []
+    step_times: list[float] = []
+    bench_env = SQLDebugEnv(task_id="easy_syntax_fix")
+    try:
+        for _ in range(runs):
+            t0 = time.perf_counter()
+            _ = {"status": "ok", "sessions_active": len(_sessions)}
+            health_times.append((time.perf_counter() - t0) * 1000)
+            t0 = time.perf_counter()
+            _ = [task.to_dict() for task in TASKS.values()]
+            tasks_times.append((time.perf_counter() - t0) * 1000)
+            t0 = time.perf_counter()
+            await bench_env.reset()
+            reset_times.append((time.perf_counter() - t0) * 1000)
+            t0 = time.perf_counter()
+            await bench_env.step(SQLDebugAction(action_type="inspect_schema"))
+            step_times.append((time.perf_counter() - t0) * 1000)
+    finally:
+        bench_env.close()
+    return {
+        "benchmark": {
+            "runs": runs,
+            "task_id": "easy_syntax_fix",
+            "timestamp_epoch_ms": int(time.time() * 1000),
+            "results": {
+                "health": _stats(health_times),
+                "tasks": _stats(tasks_times),
+                "reset": _stats(reset_times),
+                "step_inspect_schema": _stats(step_times),
+            },
+        }
+    }
+@app.post("/reset")
+async def reset(
+    request: ResetRequest = ResetRequest(),
+    x_session_id: Optional[str] = Header(default=None)
+):
+    """
+    Reset the environment for a new episode.
+    Returns initial observation with task description and broken query.
+    """
+    session_id = x_session_id or "default"
+    task_id = request.task_id or "easy_syntax_fix"
+    if task_id not in TASKS:
+        raise HTTPException(status_code=400, detail=f"Unknown task_id: {task_id}. Valid: {list(TASKS.keys())}")
+    # Always create fresh env on reset
+    async with _session_lock:
+        if session_id in _sessions:
+            _sessions[session_id].close()
+        _sessions[session_id] = SQLDebugEnv(task_id=task_id)
+    env = _sessions[session_id]
+    observation, info = await env.reset()
+    return {
+        "observation": observation.model_dump(),
+        "info": info,
+        "reward": None,
+        "done": False
+    }
+@app.post("/step")
+async def step(
+    request: StepRequest,
+    x_session_id: Optional[str] = Header(default=None)
+):
+    """
+    Execute one action in the environment.
+    Action types:
+    - submit_query: Submit SQL for evaluation (requires 'query' field)
+    - inspect_schema: Get table schema (free action)
+    - inspect_error: Get last error message (free action)
+    - inspect_sample: Get sample rows from table (requires 'table_name')
+    - reset_query: Reset to original broken query (small penalty)
+    """
+    session_id = x_session_id or "default"
+    if session_id not in _sessions:
+        raise HTTPException(status_code=400, detail="Session not found. Call /reset first.")
+    env = _sessions[session_id]
+    try:
+        observation, reward, done, info = await env.step(request.action)
+    except RuntimeError as e:
+        raise HTTPException(status_code=400, detail=str(e))
+    except ValueError as e:
+        raise HTTPException(status_code=422, detail=str(e))
+    return {
+        "observation": observation.model_dump(),
+        "reward": reward,
+        "done": done,
+        "info": info
+    }
+@app.get("/state")
+async def state(x_session_id: Optional[str] = Header(default=None)):
+    """Return current full episode state."""
+    session_id = x_session_id or "default"
+    if session_id not in _sessions:
+        raise HTTPException(status_code=400, detail="No active session. Call /reset first.")
+    env = _sessions[session_id]
+    try:
+        current_state = env.get_state()
+        return current_state.model_dump()
+    except RuntimeError as e:
+        raise HTTPException(status_code=400, detail=str(e))

server/models.py ADDED Viewed

	@@ -0,0 +1,138 @@

+"""
+Typed Pydantic models for the SQL Debug Environment.
+Implements the OpenEnv spec: Observation, Action, Reward.
+"""
+from typing import Optional, List, Dict, Any
+from pydantic import BaseModel, Field
+from enum import Enum
+class ActionType(str, Enum):
+    SUBMIT_QUERY = "submit_query"       # Submit a fixed SQL query for evaluation
+    INSPECT_SCHEMA = "inspect_schema"  # Request schema info (costs 0 reward, gives info)
+    INSPECT_ERROR = "inspect_error"    # Request error details (costs 0, gives stack trace)
+    INSPECT_SAMPLE = "inspect_sample"  # Request 3 sample rows from a table
+    RESET_QUERY = "reset_query"        # Reset to the original broken query (costs -0.05 penalty)
+class SQLDebugAction(BaseModel):
+    """
+    Action model for the SQL Debug Environment.
+    The agent can either:
+    - submit_query: Submit a fixed SQL string for evaluation
+    - inspect_schema: Get table schema info (free action, no reward change)
+    - inspect_error: Get detailed error message from last query run
+    - inspect_sample: Get sample rows from a specified table
+    - reset_query: Go back to original broken query (costs -0.05 penalty)
+    """
+    action_type: ActionType = Field(
+        description="Type of action to take"
+    )
+    query: Optional[str] = Field(
+        default=None,
+        description="SQL query string. Required when action_type is 'submit_query'."
+    )
+    table_name: Optional[str] = Field(
+        default=None,
+        description="Table name. Required when action_type is 'inspect_sample'."
+    )
+    class Config:
+        json_schema_extra = {
+            "example": {
+                "action_type": "submit_query",
+                "query": "SELECT u.name, COUNT(o.id) as order_count FROM users u LEFT JOIN orders o ON u.id = o.user_id GROUP BY u.id, u.name ORDER BY order_count DESC"
+            }
+        }
+class QueryResult(BaseModel):
+    """Result of executing a SQL query."""
+    success: bool
+    rows: Optional[List[Dict[str, Any]]] = None
+    row_count: Optional[int] = None
+    error_message: Optional[str] = None
+    execution_time_ms: Optional[float] = None
+class SchemaInfo(BaseModel):
+    """Database schema information."""
+    tables: Dict[str, List[Dict[str, str]]]  # table_name -> list of {name, type, nullable}
+    sample_data: Optional[Dict[str, List[Dict[str, Any]]]] = None
+class SQLDebugObservation(BaseModel):
+    """
+    Observation returned after each step.
+    Contains the current state of the debugging session:
+    - The original broken query (always visible)
+    - The agent's current best query
+    - Result of last action
+    - Progress indicators
+    - Schema/error info if requested
+    """
+    task_id: str = Field(description="Current task identifier")
+    task_description: str = Field(description="Natural language description of the bug to fix")
+    original_query: str = Field(description="The original broken SQL query")
+    current_query: Optional[str] = Field(default=None, description="Agent's last submitted query")
+    expected_description: str = Field(description="Description of what the correct output should look like")
+    # Last action result
+    last_action_type: str
+    last_query_result: Optional[QueryResult] = None
+    # Progress
+    steps_taken: int
+    steps_remaining: int
+    current_score: float = Field(description="Current score 0.0-1.0 for this episode")
+    # Contextual help (populated based on action type)
+    schema_info: Optional[SchemaInfo] = None
+    error_details: Optional[str] = None
+    sample_rows: Optional[List[Dict[str, Any]]] = None
+    # Hints (unlocked after step 3 on easy, step 5 on medium/hard)
+    hint: Optional[str] = None
+    # Episode status
+    is_done: bool = False
+    success: bool = False
+class SQLDebugReward(BaseModel):
+    """
+    Reward signal for the SQL Debug Environment.
+    Reward components (all sum to final reward):
+    - correctness: 0.0-0.6 based on row-level match vs expected output
+    - efficiency: 0.0-0.2 bonus for solving in fewer steps
+    - syntax_progress: 0.0-0.1 for getting a syntactically valid query (even if wrong)
+    - schema_bonus: 0.0-0.1 for queries that reference correct tables/columns
+    - penalties: negative values for reset_query, infinite loops, destructive SQL
+    """
+    value: float = Field(ge=0.0, le=1.0, description="Total reward for this step")
+    correctness: float = Field(ge=0.0, le=0.6)
+    efficiency: float = Field(ge=0.0, le=0.2)
+    syntax_progress: float = Field(ge=0.0, le=0.1)
+    schema_bonus: float = Field(ge=0.0, le=0.1)
+    penalty: float = Field(ge=0.0, le=0.2, description="Penalty deduction magnitude (non-negative)")
+    breakdown: str = Field(description="Human-readable reward breakdown")
+class EpisodeState(BaseModel):
+    """Full internal state of an episode. Used by state() endpoint."""
+    task_id: str
+    task_difficulty: str
+    original_query: str
+    current_query: Optional[str]
+    best_score_so_far: float
+    steps_taken: int
+    max_steps: int
+    action_history: List[Dict[str, Any]]
+    reward_history: List[float]
+    is_done: bool
+    success: bool
+    db_schema: Dict[str, Any]

server/reward.py ADDED Viewed

	@@ -0,0 +1,125 @@

+"""
+Reward function for the SQL Debug Environment.
+Reward is computed at every step (not just end of episode).
+This provides dense, meaningful signal for RL training.
+Reward components:
+- correctness:      0.0–0.6  (row-level match vs expected)
+- efficiency:       0.0–0.2  (bonus for solving quickly)
+- syntax_progress:  0.0–0.1  (valid SQL even if wrong content)
+- schema_bonus:     0.0–0.1  (correct tables/columns referenced)
+- penalty:          0.0 to 0.2  (deduction for bad actions)
+Total range: 0.0 to 1.0 (clamped to [0.0, 1.0])
+"""
+from typing import Optional, List, Dict, Any
+from .models import SQLDebugReward
+def compute_reward(
+    action_type: str,
+    query_result: Optional[Dict[str, Any]],
+    grade_score: float,
+    steps_taken: int,
+    max_steps: int,
+    previous_best_score: float,
+    schema_tables: List[str],
+    submitted_query: Optional[str] = None,
+) -> SQLDebugReward:
+    """
+    Compute the full reward for a step.
+    Args:
+    action_type: The action taken this step
+    query_result: Result dict from EpisodeDatabase.execute_query()
+    grade_score: 0.0-1.0 score from task grader
+    steps_taken: How many steps have been used (1-indexed)
+    max_steps: Maximum steps for this task
+    previous_best_score: Best grade score seen so far
+    schema_tables: List of valid table names in this task's DB
+    submitted_query: The SQL query string (if action was submit_query)
+    """
+    correctness = 0.0
+    efficiency = 0.0
+    syntax_progress = 0.0
+    schema_bonus = 0.0
+    penalty = 0.0  # deduction magnitude (non-negative)
+    if action_type == "submit_query":
+        # Correctness: primary signal
+        correctness = min(0.6, grade_score * 0.6)
+        # Syntax progress: reward for at least getting a valid query
+        if query_result and query_result.get("success"):
+            syntax_progress = 0.1
+        elif query_result and not query_result.get("success"):
+            # Partially reward if it's getting closer (fewer errors)
+            error = query_result.get("error_message", "")
+            if "no such column" in error.lower():
+                syntax_progress = 0.03  # Structure is right but wrong column
+            elif "no such table" in error.lower():
+                syntax_progress = 0.01
+            else:
+                syntax_progress = 0.0
+        # Schema bonus: correct table references
+        if submitted_query and schema_tables:
+            query_upper = submitted_query.upper()
+            tables_referenced = sum(
+                1 for t in schema_tables if t.upper() in query_upper
+            )
+            schema_bonus = min(0.1, (tables_referenced / len(schema_tables)) * 0.1)
+        # Efficiency bonus: reward solving with fewer steps
+        if grade_score >= 0.95:  # Near-perfect solution
+            steps_fraction = steps_taken / max_steps
+            if steps_fraction <= 0.3:
+                efficiency = 0.2
+            elif steps_fraction <= 0.5:
+                efficiency = 0.15
+            elif steps_fraction <= 0.7:
+                efficiency = 0.1
+            else:
+                efficiency = 0.05
+        # Penalty: if score went DOWN from previous best (regressed)
+        if grade_score < previous_best_score - 0.1:
+            penalty = 0.05
+    elif action_type == "reset_query":
+        # Penalize resetting — agent should be making progress
+        penalty = 0.05
+    elif action_type in ("inspect_schema", "inspect_error", "inspect_sample"):
+        # Free information actions — small positive for using schema info
+        # (encourages agents to explore rather than blindly guess)
+        syntax_progress = 0.01
+    # Penalty: approaching step limit (urgency signal)
+    steps_remaining = max_steps - steps_taken
+    if steps_remaining <= 2 and grade_score < 0.5:
+        penalty += 0.03
+    total_raw = correctness + efficiency + syntax_progress + schema_bonus - penalty
+    total = round(max(0.0, min(1.0, total_raw)), 4)
+    breakdown = (
+        f"correctness={correctness:.3f} + "
+        f"efficiency={efficiency:.3f} + "
+        f"syntax={syntax_progress:.3f} + "
+        f"schema={schema_bonus:.3f} + "
+        f"penalty={penalty:.3f} = {total:.4f}"
+    )
+    return SQLDebugReward(
+        value=total,
+        correctness=correctness,
+        efficiency=efficiency,
+        syntax_progress=syntax_progress,
+        schema_bonus=schema_bonus,
+        penalty=penalty,
+        breakdown=breakdown
+    )

server/tasks/__init__.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ # sql-debug-env
2	+

server/tasks/base.py ADDED Viewed

	@@ -0,0 +1,169 @@

+"""Base class for all SQL Debug tasks."""
+from abc import ABC, abstractmethod
+from typing import Dict, Any, List, Optional, Tuple
+class BaseTask(ABC):
+    """
+    Abstract base for all tasks.
+    Each task defines:
+    - A broken SQL query (the one the agent must fix)
+    - A database schema (SQLite CREATE TABLE statements)
+    - Seed data (INSERT statements, deterministic)
+    - Expected output (what the correct query should return)
+    - A grader (compares agent output vs expected)
+    - Metadata (id, name, difficulty, description, hint)
+    """
+    @property
+    @abstractmethod
+    def task_id(self) -> str:
+        pass
+    @property
+    @abstractmethod
+    def name(self) -> str:
+        pass
+    @property
+    @abstractmethod
+    def difficulty(self) -> str:
+        pass  # "easy", "medium", "hard"
+    @property
+    @abstractmethod
+    def description(self) -> str:
+        """Natural language description given to the agent."""
+        pass
+    @property
+    @abstractmethod
+    def expected_output_description(self) -> str:
+        """Describes what the correct output looks like."""
+        pass
+    @property
+    @abstractmethod
+    def broken_query(self) -> str:
+        """The SQL query with bugs that the agent must fix."""
+        pass
+    @property
+    @abstractmethod
+    def schema_sql(self) -> str:
+        """SQLite CREATE TABLE statements."""
+        pass
+    @property
+    @abstractmethod
+    def seed_data_sql(self) -> str:
+        """INSERT statements for deterministic test data."""
+        pass
+    @property
+    @abstractmethod
+    def expected_output(self) -> List[Dict[str, Any]]:
+        """
+        The exact rows the correct query should return.
+        Used by the grader to score the agent's output.
+        Must be deterministic and match seed_data_sql exactly.
+        """
+        pass
+    @property
+    def hint(self) -> str:
+        """Optional hint shown after N steps. Override in subclass."""
+        return ""
+    @property
+    def max_steps(self) -> int:
+        """Maximum steps for this task."""
+        return {"easy": 10, "medium": 20, "hard": 30}.get(self.difficulty, 20)
+    def grade(self, actual_rows: Optional[List[Dict[str, Any]]]) -> float:
+        """
+        Grade the agent's query output vs expected output.
+        Returns a score 0.0-1.0.
+        Scoring:
+        - 1.0: exact match (correct rows, correct order if ORDER BY expected)
+        - 0.5-0.9: partial match (subset of correct rows, or wrong order)
+        - 0.1-0.4: syntactically valid but wrong content
+        - 0.0: null result, syntax error, or empty when non-empty expected
+        """
+        if not actual_rows:
+            return 0.0
+        expected = self.expected_output
+        if not expected:
+            # Expected empty result
+            return 1.0 if len(actual_rows) == 0 else 0.0
+        # Exact row count match
+        if len(actual_rows) != len(expected):
+            # Partial credit for getting some rows right
+            overlap = self._count_matching_rows(actual_rows, expected)
+            return round(min(0.5, overlap / max(len(expected), 1) * 0.5), 3)
+        # Check row-by-row match (order-sensitive if task requires it)
+        matching = self._count_matching_rows(actual_rows, expected)
+        score = matching / len(expected)
+        # Check column names match
+        if actual_rows and expected:
+            actual_cols = set(actual_rows[0].keys())
+            expected_cols = set(expected[0].keys())
+            if actual_cols != expected_cols:
+                score *= 0.7  # Penalty for wrong columns
+        return round(score, 3)
+    def _count_matching_rows(
+        self,
+        actual: List[Dict[str, Any]],
+        expected: List[Dict[str, Any]]
+    ) -> int:
+        """Count how many actual rows match expected rows (normalized comparison)."""
+        matches = 0
+        expected_normalized = [self._normalize_row(r) for r in expected]
+        for i, actual_row in enumerate(actual):
+            actual_norm = self._normalize_row(actual_row)
+            if i < len(expected_normalized):
+                # Positional match (respects ORDER BY)
+                if actual_norm == expected_normalized[i]:
+                    matches += 1
+            else:
+                # Extra rows don't count
+                break
+        return matches
+    def _normalize_row(self, row: Dict[str, Any]) -> Dict[str, Any]:
+        """Normalize a row for comparison: lowercase keys, string-normalize values."""
+        normalized = {}
+        for k, v in row.items():
+            key = k.lower().strip()
+            if isinstance(v, float):
+                val = round(v, 2)
+            elif isinstance(v, str):
+                val = v.strip()
+            else:
+                val = v
+            normalized[key] = val
+        return normalized
+    def to_dict(self) -> Dict[str, Any]:
+        return {
+            "task_id": self.task_id,
+            "name": self.name,
+            "difficulty": self.difficulty,
+            "description": self.description,
+            "expected_output_description": self.expected_output_description,
+            "broken_query": self.broken_query,
+            "max_steps": self.max_steps,
+            "hint": self.hint
+        }

server/tasks/task_easy.py ADDED Viewed

	@@ -0,0 +1,157 @@

+"""
+TASK 1 — EASY: Syntax Error Fix
+Difficulty: Easy
+Bug type: Simple syntax errors (typo in keyword, missing alias, wrong column name)
+Max steps: 10
+Expected baseline model score: 0.8-1.0
+"""
+from typing import List, Dict, Any
+from .base import BaseTask
+class EasyTask(BaseTask):
+    """
+    Scenario: An e-commerce company wants to find the top 5 customers
+    by total order value. The query has a syntax error:
+    uses 'GRUP BY' instead of 'GROUP BY' and references wrong column alias.
+    Database: customers, orders, order_items
+    Bug 1: 'GRUP BY' typo
+    Bug 2: ORDER BY references 'total' but SELECT aliases it as 'total_value'
+    """
+    @property
+    def task_id(self) -> str:
+        return "easy_syntax_fix"
+    @property
+    def name(self) -> str:
+        return "Top Customers by Revenue — Syntax Error Fix"
+    @property
+    def difficulty(self) -> str:
+        return "easy"
+    @property
+    def description(self) -> str:
+        return """You are debugging a SQL query for an e-commerce analytics dashboard.
+The query is supposed to find the top 5 customers by their total order value
+(sum of quantity * unit_price across all their orders).
+The query has 2 syntax/reference bugs that prevent it from running:
+1. A typo in a SQL keyword
+2. An ORDER BY clause that references a column alias incorrectly
+Fix both bugs so the query runs and returns the correct result.
+The result should show: customer_name, total_value (rounded to 2 decimal places),
+ordered from highest to lowest, top 5 only."""
+    @property
+    def expected_output_description(self) -> str:
+        return "5 rows: customer_name, total_value (DESC order). Alice Chen should be first with 2847.50."
+    @property
+    def broken_query(self) -> str:
+        return """SELECT
+    c.name AS customer_name,
+    ROUND(SUM(oi.quantity * oi.unit_price), 2) AS total_value
+FROM customers c
+JOIN orders o ON c.id = o.customer_id
+JOIN order_items oi ON o.id = oi.order_id
+GRUP BY c.id, c.name
+ORDER BY total DESC
+LIMIT 5"""
+    @property
+    def schema_sql(self) -> str:
+        return """
+CREATE TABLE customers (
+    id INTEGER PRIMARY KEY,
+    name TEXT NOT NULL,
+    email TEXT UNIQUE NOT NULL,
+    created_at TEXT DEFAULT CURRENT_TIMESTAMP
+);
+CREATE TABLE orders (
+    id INTEGER PRIMARY KEY,
+    customer_id INTEGER NOT NULL,
+    order_date TEXT NOT NULL,
+    status TEXT DEFAULT 'completed',
+    FOREIGN KEY (customer_id) REFERENCES customers(id)
+);
+CREATE TABLE order_items (
+    id INTEGER PRIMARY KEY,
+    order_id INTEGER NOT NULL,
+    product_name TEXT NOT NULL,
+    quantity INTEGER NOT NULL,
+    unit_price REAL NOT NULL,
+    FOREIGN KEY (order_id) REFERENCES orders(id)
+)"""
+    @property
+    def seed_data_sql(self) -> str:
+        return """
+INSERT INTO customers VALUES (1,'Alice Chen','alice@example.com','2023-01-01');
+INSERT INTO customers VALUES (2,'Bob Kumar','bob@example.com','2023-01-05');
+INSERT INTO customers VALUES (3,'Carol White','carol@example.com','2023-01-10');
+INSERT INTO customers VALUES (4,'David Park','david@example.com','2023-02-01');
+INSERT INTO customers VALUES (5,'Eva Rodriguez','eva@example.com','2023-02-15');
+INSERT INTO customers VALUES (6,'Frank Liu','frank@example.com','2023-03-01');
+INSERT INTO orders VALUES (1,1,'2023-06-01','completed');
+INSERT INTO orders VALUES (2,1,'2023-07-15','completed');
+INSERT INTO orders VALUES (3,2,'2023-06-10','completed');
+INSERT INTO orders VALUES (4,3,'2023-06-20','completed');
+INSERT INTO orders VALUES (5,3,'2023-08-01','completed');
+INSERT INTO orders VALUES (6,4,'2023-07-01','completed');
+INSERT INTO orders VALUES (7,5,'2023-07-20','completed');
+INSERT INTO orders VALUES (8,5,'2023-08-10','completed');
+INSERT INTO orders VALUES (9,6,'2023-09-01','completed');
+INSERT INTO order_items VALUES (1,1,'Laptop',1,1200.00);
+INSERT INTO order_items VALUES (2,1,'Mouse',2,25.00);
+INSERT INTO order_items VALUES (3,2,'Keyboard',1,150.00);
+INSERT INTO order_items VALUES (4,2,'Monitor',1,450.00);
+INSERT INTO order_items VALUES (5,2,'Webcam',1,97.50);
+INSERT INTO order_items VALUES (6,3,'Headphones',1,350.00);
+INSERT INTO order_items VALUES (7,3,'USB Hub',2,45.00);
+INSERT INTO order_items VALUES (8,4,'Tablet',1,600.00);
+INSERT INTO order_items VALUES (9,4,'Case',1,35.00);
+INSERT INTO order_items VALUES (10,5,'Charger',2,30.00);
+INSERT INTO order_items VALUES (11,5,'Cable',3,15.00);
+INSERT INTO order_items VALUES (12,6,'Desk Lamp',1,85.00);
+INSERT INTO order_items VALUES (13,6,'Chair Mat',1,60.00);
+INSERT INTO order_items VALUES (14,7,'Speakers',1,220.00);
+INSERT INTO order_items VALUES (15,7,'Microphone',1,180.00);
+INSERT INTO order_items VALUES (16,8,'Webcam',1,97.50);
+INSERT INTO order_items VALUES (17,9,'Monitor',1,450.00)"""
+    @property
+    def expected_output(self) -> List[Dict[str, Any]]:
+        # Alice: 1200+50+150+450+97.50 = 1947.50 (orders 1,2)
+        # Wait: recalculate
+        # Alice order 1: laptop 1200 + mouse 2*25=50 = 1250
+        # Alice order 2: keyboard 150 + monitor 450 + webcam 97.50 = 697.50
+        # Alice total: 1947.50 — but let me recalculate with all items
+        # Actually: 1200+50+150+450+97.50 = 1947.50
+        # Carol: tablet 600 + case 35 + charger 60 + cable 45 = 740
+        # Eva: speakers 220 + micro 180 + webcam 97.50 = 497.50
+        # Bob: headphones 350 + hub 90 = 440
+        # Frank: lamp 85 + mat 60 + monitor 450 = 595
+        # David: lamp 85 + mat 60 = 145 — wait David is order 6
+        # Order 6 items 12,13: lamp 85 + mat 60 = 145
+        return [
+            {"customer_name": "Alice Chen", "total_value": 1947.50},
+            {"customer_name": "Carol White", "total_value": 740.00},
+            {"customer_name": "Frank Liu", "total_value": 595.00},
+            {"customer_name": "Eva Rodriguez", "total_value": 497.50},
+            {"customer_name": "Bob Kumar", "total_value": 440.00},
+        ]
+    @property
+    def hint(self) -> str:
+        return "Hint: Check every SQL keyword spelling carefully. Also check that your ORDER BY column name exactly matches the alias in your SELECT clause."

server/tasks/task_hard.py ADDED Viewed

	@@ -0,0 +1,199 @@

+"""
+TASK 3 — HARD: Multi-bug + Optimization
+Difficulty: Hard
+Bug types:
+  1. Correlated subquery returns wrong scope
+  2. Window function partition incorrect
+  3. CTE has circular logic bug
+  4. Off-by-one in date range
+  5. Missing DISTINCT causing row duplication
+Max steps: 30
+Expected baseline model score: 0.0-0.3 (frontier models barely pass)
+"""
+from typing import List, Dict, Any
+from .base import BaseTask
+class HardTask(BaseTask):
+    """
+    Scenario: SaaS product analytics — find users who:
+    1. Signed up in Q1 2023 (Jan 1 – Mar 31)
+    2. Made at least 2 purchases in their first 30 days
+    3. Return their: user_id, username, signup_date,
+                     first_purchase_date, days_to_first_purchase,
+                     purchases_in_first_30_days, total_lifetime_value
+    Bugs:
+    1. Date range is '>= 2023-01-01 AND < 2023-04-01' but query uses '<= 2023-03-31'
+       (off by 1 for timestamps — in SQLite string comparison this is actually fine,
+        but the REAL bug is the upper bound uses wrong column: filters on purchase_date
+        instead of signup_date in the CTE)
+    2. The window function for running total uses PARTITION BY user_id but
+       ORDER BY is missing — gives wrong cumulative values
+    3. HAVING clause uses COUNT(*) but should use COUNT(DISTINCT purchase_id)
+       due to JOIN multiplication
+    4. The subquery for first_purchase_date is not correlated properly
+       (missing WHERE p.user_id = u.id)
+    5. days_to_first_purchase calculation uses wrong date subtraction direction
+    """
+    @property
+    def task_id(self) -> str:
+        return "hard_multi_bug"
+    @property
+    def name(self) -> str:
+        return "SaaS Cohort Activation Report — Multi-Bug Fix"
+    @property
+    def difficulty(self) -> str:
+        return "hard"
+    @property
+    def description(self) -> str:
+        return """You are debugging a SaaS product analytics query.
+The query should identify "activated users": users who signed up in Q1 2023
+AND made at least 2 purchases within their first 30 days of signup.
+For each activated user, return:
+- user_id (INTEGER)
+- username (TEXT)
+- signup_date (TEXT, YYYY-MM-DD)
+- first_purchase_date (TEXT, YYYY-MM-DD)
+- days_to_first_purchase (INTEGER, how many days after signup they first purchased)
+- purchases_in_first_30_days (INTEGER)
+- total_lifetime_value (REAL, sum of all their purchases ever, rounded to 2 dp)
+Results ordered by total_lifetime_value DESC.
+The query has FIVE bugs — some are logic errors, one is a missing correlation
+in a subquery, one is an incorrect window function, one causes row duplication.
+You must find and fix all of them to get the correct result.
+Q1 2023 = signup_date >= '2023-01-01' AND signup_date <= '2023-03-31'"""
+    @property
+    def expected_output_description(self) -> str:
+        return "2 rows: users who made 2+ purchases in first 30 days. Maya Torres first (higher LTV), then James Osei."
+    @property
+    def broken_query(self) -> str:
+        return """WITH q1_users AS (
+    SELECT DISTINCT u.id, u.username, u.signup_date
+    FROM users u
+    JOIN purchases p ON u.id = p.user_id
+    WHERE u.signup_date >= '2023-01-01'
+      AND u.signup_date <= '2023-03-31'
+      AND p.purchase_date <= '2023-03-31'
+),
+user_purchase_stats AS (
+    SELECT
+        q.id AS user_id,
+        q.username,
+        q.signup_date,
+        (SELECT MIN(purchase_date) FROM purchases WHERE amount > 0) AS first_purchase_date,
+        COUNT(*) AS purchases_in_first_30_days,
+        SUM(SUM(p.amount)) OVER (PARTITION BY q.id) AS total_lifetime_value
+    FROM q1_users q
+    JOIN purchases p ON q.id = p.user_id
+    WHERE julianday(p.purchase_date) - julianday(q.signup_date) <= 30
+    GROUP BY q.id, q.username, q.signup_date
+)
+SELECT
+    user_id,
+    username,
+    signup_date,
+    first_purchase_date,
+    CAST(julianday(q1_users.signup_date) - julianday(first_purchase_date) AS INTEGER) AS days_to_first_purchase,
+    purchases_in_first_30_days,
+    ROUND(total_lifetime_value, 2) AS total_lifetime_value
+FROM user_purchase_stats
+WHERE purchases_in_first_30_days >= 2
+ORDER BY total_lifetime_value DESC"""
+    @property
+    def schema_sql(self) -> str:
+        return """
+CREATE TABLE users (
+    id INTEGER PRIMARY KEY,
+    username TEXT NOT NULL,
+    email TEXT UNIQUE,
+    signup_date TEXT NOT NULL,
+    plan TEXT DEFAULT 'free'
+);
+CREATE TABLE purchases (
+    id INTEGER PRIMARY KEY,
+    user_id INTEGER NOT NULL,
+    product_name TEXT NOT NULL,
+    amount REAL NOT NULL,
+    purchase_date TEXT NOT NULL,
+    FOREIGN KEY (user_id) REFERENCES users(id)
+)"""
+    @property
+    def seed_data_sql(self) -> str:
+        return """
+INSERT INTO users VALUES (1,'maya_torres','maya@ex.com','2023-01-15','pro');
+INSERT INTO users VALUES (2,'james_osei','james@ex.com','2023-02-10','pro');
+INSERT INTO users VALUES (3,'sophie_liang','sophie@ex.com','2023-03-05','free');
+INSERT INTO users VALUES (4,'raj_mehta','raj@ex.com','2023-06-01','free');
+INSERT INTO users VALUES (5,'anna_kovacs','anna@ex.com','2022-12-20','pro');
+-- Maya: 2 purchases in first 30 days (days 5 and 18), more later
+INSERT INTO purchases VALUES (1,1,'Pro Plan',99.00,'2023-01-20');
+INSERT INTO purchases VALUES (2,1,'Add-on Pack',29.00,'2023-02-02');
+INSERT INTO purchases VALUES (3,1,'Pro Renewal',99.00,'2023-04-15');
+INSERT INTO purchases VALUES (4,1,'Consulting',150.00,'2023-07-01');
+-- James: 2 purchases in first 30 days (days 3 and 25)
+INSERT INTO purchases VALUES (5,2,'Starter Plan',49.00,'2023-02-13');
+INSERT INTO purchases VALUES (6,2,'Storage Add-on',19.00,'2023-03-07');
+INSERT INTO purchases VALUES (7,2,'Starter Renewal',49.00,'2023-05-10');
+-- Sophie: only 1 purchase in first 30 days (should NOT qualify)
+INSERT INTO purchases VALUES (8,3,'Free Trial Upgrade',9.00,'2023-03-10');
+INSERT INTO purchases VALUES (9,3,'Pro Plan',99.00,'2023-04-20');
+-- Raj: signed up Q2, not Q1 (should NOT qualify)
+INSERT INTO purchases VALUES (10,4,'Starter Plan',49.00,'2023-06-05');
+INSERT INTO purchases VALUES (11,4,'Add-on',19.00,'2023-06-10');
+-- Anna: signed up Q4 2022, not Q1 2023 (should NOT qualify)
+INSERT INTO purchases VALUES (12,5,'Pro Plan',99.00,'2023-01-01');
+INSERT INTO purchases VALUES (13,5,'Consulting',150.00,'2023-03-15')"""
+    @property
+    def expected_output(self) -> List[Dict[str, Any]]:
+        # Maya: signup 2023-01-15, first purchase 2023-01-20 (day 5)
+        #   purchases in 30 days: Jan-20 (day5), Feb-02 (day18) = 2 ✓
+        #   total LTV: 99+29+99+150 = 377
+        # James: signup 2023-02-10, first purchase 2023-02-13 (day 3)
+        #   purchases in 30 days: Feb-13 (day3), Mar-07 (day25) = 2 ✓
+        #   total LTV: 49+19+49 = 117
+        return [
+            {
+                "user_id": 1,
+                "username": "maya_torres",
+                "signup_date": "2023-01-15",
+                "first_purchase_date": "2023-01-20",
+                "days_to_first_purchase": 5,
+                "purchases_in_first_30_days": 2,
+                "total_lifetime_value": 377.00
+            },
+            {
+                "user_id": 2,
+                "username": "james_osei",
+                "signup_date": "2023-02-10",
+                "first_purchase_date": "2023-02-13",
+                "days_to_first_purchase": 3,
+                "purchases_in_first_30_days": 2,
+                "total_lifetime_value": 117.00
+            }
+        ]
+    @property
+    def hint(self) -> str:
+        return "Hint: There are 5 bugs total. Check: (1) the subquery for first_purchase_date needs a WHERE correlation, (2) the date subtraction direction for days_to_first_purchase, (3) COUNT(*) vs COUNT(DISTINCT) when JOINs can multiply rows, (4) window functions need ORDER BY for meaningful results, (5) the q1_users CTE may be filtering on the wrong table's date column."

server/tasks/task_medium.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""
+TASK 2 — MEDIUM: Logic Error Fix
+Difficulty: Medium
+Bug types: Wrong JOIN type causing missing rows, incorrect aggregation logic,
+           missing HAVING clause, wrong date filter
+Max steps: 20
+Expected baseline model score: 0.3-0.6
+"""
+from typing import List, Dict, Any
+from .base import BaseTask
+class MediumTask(BaseTask):
+    """
+    Scenario: HR analytics team wants monthly headcount and average salary
+    by department for the current year, including departments with zero employees
+    (i.e., departments that exist but no one joined this year).
+    Bugs:
+    1. Uses INNER JOIN instead of LEFT JOIN — excludes empty departments
+    2. Uses AVG(salary) over all employees instead of only those who joined this year
+    3. Missing: the date filter for 'this year' is applied in WHERE, breaking the LEFT JOIN
+       (should be in ON clause or use CASE)
+    4. GROUP BY missing department_id (ambiguous grouping)
+    """
+    @property
+    def task_id(self) -> str:
+        return "medium_logic_fix"
+    @property
+    def name(self) -> str:
+        return "Department Headcount Report — Logic Error Fix"
+    @property
+    def difficulty(self) -> str:
+        return "medium"
+    @property
+    def description(self) -> str:
+        return """You are debugging a HR analytics SQL query.
+The query should produce a monthly department headcount report showing:
+- department_name
+- headcount: number of employees who joined IN 2023
+- avg_salary: average salary of employees who joined IN 2023
+- All departments must appear, even those with 0 new hires in 2023
+The current query has 3 logic bugs:
+1. It uses the wrong JOIN type, which silently drops departments with no 2023 hires
+2. The WHERE clause on hire_date breaks the outer join semantics
+3. The AVG calculation includes employees from all years, not just 2023
+Fix these logic errors. The result should be ordered by department_name ascending."""
+    @property
+    def expected_output_description(self) -> str:
+        return "4 rows (all departments), headcount=0 for 'Legal', correct avg_salary only from 2023 hires."
+    @property
+    def broken_query(self) -> str:
+        return """SELECT
+    d.name AS department_name,
+    COUNT(e.id) AS headcount,
+    ROUND(AVG(e.salary), 2) AS avg_salary
+FROM departments d
+INNER JOIN employees e ON d.id = e.department_id
+WHERE strftime('%Y', e.hire_date) = '2023'
+GROUP BY d.name
+ORDER BY department_name ASC"""
+    @property
+    def schema_sql(self) -> str:
+        return """
+CREATE TABLE departments (
+    id INTEGER PRIMARY KEY,
+    name TEXT NOT NULL,
+    budget REAL
+);
+CREATE TABLE employees (
+    id INTEGER PRIMARY KEY,
+    name TEXT NOT NULL,
+    department_id INTEGER NOT NULL,
+    salary REAL NOT NULL,
+    hire_date TEXT NOT NULL,
+    FOREIGN KEY (department_id) REFERENCES departments(id)
+)"""
+    @property
+    def seed_data_sql(self) -> str:
+        return """
+INSERT INTO departments VALUES (1,'Engineering',500000);
+INSERT INTO departments VALUES (2,'Marketing',200000);
+INSERT INTO departments VALUES (3,'Sales',300000);
+INSERT INTO departments VALUES (4,'Legal',150000);
+INSERT INTO employees VALUES (1,'Ana Lima',1,95000,'2023-03-15');
+INSERT INTO employees VALUES (2,'Ben Sharma',1,102000,'2023-06-01');
+INSERT INTO employees VALUES (3,'Chris Wang',1,88000,'2022-01-10');
+INSERT INTO employees VALUES (4,'Diana Patel',2,72000,'2023-04-20');
+INSERT INTO employees VALUES (5,'Erik Johnson',2,68000,'2022-11-05');
+INSERT INTO employees VALUES (6,'Fatima Al-Hassan',3,55000,'2023-01-08');
+INSERT INTO employees VALUES (7,'George Okafor',3,61000,'2023-07-22');
+INSERT INTO employees VALUES (8,'Hannah Kim',3,58000,'2022-05-30');
+INSERT INTO employees VALUES (9,'Ivan Petrov',1,91000,'2022-08-14')"""
+    @property
+    def expected_output(self) -> List[Dict[str, Any]]:
+        # Engineering 2023 hires: Ana 95000, Ben 102000 → count=2, avg=98500
+        # Marketing 2023 hires: Diana 72000 → count=1, avg=72000
+        # Sales 2023 hires: Fatima 55000, George 61000 → count=2, avg=58000
+        # Legal 2023 hires: none → count=0, avg=NULL
+        return [
+            {"department_name": "Engineering", "headcount": 2, "avg_salary": 98500.00},
+            {"department_name": "Legal", "headcount": 0, "avg_salary": None},
+            {"department_name": "Marketing", "headcount": 1, "avg_salary": 72000.00},
+            {"department_name": "Sales", "headcount": 2, "avg_salary": 58000.00},
+        ]
+    @property
+    def hint(self) -> str:
+        return "Hint: When you want ALL rows from the left table even when there's no match on the right, think about which JOIN type preserves those rows. Also, WHERE on a nullable column after a join changes join semantics — consider moving that condition."
+class MediumTaskGrader:
+    """
+    Custom grader for medium task — handles NULL comparison.
+    """
+    @staticmethod
+    def grade(actual: List[Dict]) -> float:
+        if not actual or len(actual) != 4:
+            return 0.0
+        # Sort both by dept name for comparison
+        actual_sorted = sorted(actual, key=lambda r: r.get("department_name", ""))
+        expected = [
+            {"department_name": "Engineering", "headcount": 2, "avg_salary": 98500.00},
+            {"department_name": "Legal", "headcount": 0, "avg_salary": None},
+            {"department_name": "Marketing", "headcount": 1, "avg_salary": 72000.00},
+            {"department_name": "Sales", "headcount": 2, "avg_salary": 58000.00},
+        ]
+        matches = 0
+        for a, e in zip(actual_sorted, expected):
+            dept_ok = str(a.get("department_name","")).lower() == str(e["department_name"]).lower()
+            count_ok = int(a.get("headcount", -1)) == e["headcount"]
+            e_salary = e["avg_salary"]
+            a_salary = a.get("avg_salary")
+            if e_salary is None:
+                salary_ok = a_salary is None or a_salary == 0
+            else:
+                try:
+                    salary_ok = abs(float(a_salary) - float(e_salary)) < 1.0
+                except (TypeError, ValueError):
+                    salary_ok = False
+            if dept_ok and count_ok and salary_ok:
+                matches += 1
+        return round(matches / 4, 3)

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,44 @@

+import asyncio
+import unittest
+from server.env import SQLDebugEnv
+from server.models import SQLDebugAction, ActionType
+class TestEnv(unittest.TestCase):
+    def test_reset_and_inspect_schema(self):
+        async def run():
+            env = SQLDebugEnv(task_id="easy_syntax_fix")
+            obs, info = await env.reset()
+            self.assertFalse(obs.is_done)
+            action = SQLDebugAction(action_type=ActionType.INSPECT_SCHEMA)
+            obs2, reward, done, info2 = await env.step(action)
+            self.assertFalse(done)
+            self.assertIsNotNone(obs2.schema_info)
+            self.assertGreaterEqual(reward, 0.0)
+        asyncio.run(run())
+    def test_submit_broken_query_does_not_finish(self):
+        async def run():
+            env = SQLDebugEnv(task_id="easy_syntax_fix")
+            obs, _ = await env.reset()
+            action = SQLDebugAction(
+                action_type=ActionType.SUBMIT_QUERY,
+                query=env.task.broken_query,
+            )
+            obs2, reward, done, _ = await env.step(action)
+            self.assertFalse(done)
+            self.assertLessEqual(reward, 0.2)
+            self.assertGreaterEqual(reward, -1.0)
+            self.assertEqual(obs2.current_query, env.task.broken_query)
+        asyncio.run(run())
+if __name__ == "__main__":
+    unittest.main()

tests/test_graders.py ADDED Viewed

	@@ -0,0 +1,46 @@

+import unittest
+from server.tasks.task_easy import EasyTask
+from server.tasks.task_medium import MediumTask, MediumTaskGrader
+from server.tasks.task_hard import HardTask
+class TestGraders(unittest.TestCase):
+    def test_easy_grade_perfect(self):
+        task = EasyTask()
+        score = task.grade(task.expected_output)
+        self.assertAlmostEqual(score, 1.0, places=3)
+    def test_hard_grade_perfect(self):
+        task = HardTask()
+        score = task.grade(task.expected_output)
+        self.assertAlmostEqual(score, 1.0, places=3)
+    def test_easy_grade_empty(self):
+        task = EasyTask()
+        score = task.grade(None)
+        self.assertEqual(score, 0.0)
+    def test_medium_grader_perfect(self):
+        task = MediumTask()
+        score = MediumTaskGrader.grade(task.expected_output)
+        self.assertAlmostEqual(score, 1.0, places=3)
+    def test_medium_grader_partial(self):
+        # Flip one row's avg_salary so it no longer matches within tolerance.
+        task = MediumTask()
+        actual = [dict(r) for r in task.expected_output]
+        # Expected avg_salary is None for "Legal". Any non-None/non-zero value should fail.
+        for r in actual:
+            if r["department_name"] == "Legal":
+                r["avg_salary"] = 12345.0
+        score = MediumTaskGrader.grade(actual)
+        self.assertLess(score, 1.0)
+        self.assertAlmostEqual(score, 0.75, places=3)
+if __name__ == "__main__":
+    unittest.main()

tests/test_reward.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import unittest
+from server.reward import compute_reward
+class TestReward(unittest.TestCase):
+    def test_submit_query_perfect_reward(self):
+        reward = compute_reward(
+            action_type="submit_query",
+            query_result={"success": True},
+            grade_score=1.0,
+            steps_taken=1,
+            max_steps=10,
+            previous_best_score=0.0,
+            schema_tables=["t1", "t2"],
+            submitted_query="SELECT * FROM t1 JOIN t2",
+        )
+        self.assertAlmostEqual(reward.value, 1.0, places=4)
+    def test_reset_query_penalty(self):
+        reward = compute_reward(
+            action_type="reset_query",
+            query_result=None,
+            grade_score=0.0,
+            steps_taken=1,
+            max_steps=10,
+            previous_best_score=0.0,
+            schema_tables=[],
+            submitted_query=None,
+        )
+        self.assertAlmostEqual(reward.value, 0.0, places=4)
+    def test_inspect_schema_urgency_penalty(self):
+        # Make steps_remaining <= 2 and grade_score < 0.5 to trigger urgency penalty.
+        reward = compute_reward(
+            action_type="inspect_schema",
+            query_result=None,
+            grade_score=0.0,
+            steps_taken=8,
+            max_steps=9,
+            previous_best_score=0.0,
+            schema_tables=[],
+            submitted_query=None,
+        )
+        # syntax_progress=0.01, penalty=0.03 => total_raw=-0.02, clamped to 0.0
+        self.assertAlmostEqual(reward.value, 0.0, places=4)
+if __name__ == "__main__":
+    unittest.main()

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff