Spaces:

Eishaan
/

sql-migration-env

Sleeping

App Files Files Community

Eishaan commited on about 1 month ago

Commit

6a32325

1 Parent(s): 71fa486

self push after fixing few errors

Browse files

Files changed (11) hide show

README.md +132 -133
__pycache__/inference.cpython-312.pyc +0 -0
__pycache__/seeds.cpython-312.pyc +0 -0
inference.py +81 -73
seeds.py +415 -7
server/__pycache__/environment.cpython-312.pyc +0 -0
server/__pycache__/grader.cpython-312.pyc +0 -0
server/environment.py +192 -36
server/grader.py +296 -686
test_all_tasks.py +86 -30
test_smoke.py +72 -18

README.md CHANGED Viewed

@@ -1,165 +1,164 @@
----
-title: SQL Migration Agent
-emoji: "\U0001F5C4\uFE0F"
-colorFrom: blue
-colorTo: purple
-sdk: docker
-pinned: false
-tags:
-  - openenv
----
-# SQL Schema Migration Agent
-> **An OpenEnv environment for benchmarking autonomous database migration agents.**
->
-> Built for the Meta x Hugging Face OpenEnv Hackathon.
----
-## Why This Matters (Real-World Utility)
-Database schema migrations are among the most error-prone, high-stakes tasks in software engineering. Every production system faces them as application models evolve, yet they are extremely difficult to automate safely because data must be perfectly preserved.
-This environment trains AI agents to autonomously reconcile schema drift the exact way a real CI/CD pipeline would -- given a flawed current state and an ideal target state, the agent must compute and safely execute the transformation sequence using raw SQL.
-**Real-world analogues:** `Flyway`, `Liquibase`, Django `makemigrations`, `Terraform` state transitions. This environment models that exact problem, reduced to an agentic RL core.
----
-## Evaluation Philosophy & Anti-Exploit Mechanics
-Unlike simplistic environments that merely string-match SQL schemas, this environment uses a **deep structural reconciliation grader** built specifically to prevent LLM gamification:
-1. **Zero-Sum Exploit Protection:** Naive agents will often execute `DROP TABLE x; CREATE TABLE x (...)` to easily match the target schema, silently destroying all data. Our grader actively runs `SELECT COUNT(*)`, `SUM(id)`, and data-integrity fingerprinting. If a table's schema matches but the data is gone, the score is brutally clamped to `0.01`.
-2. **PRAGMA Bypass Prevention:** The grader re-asserts `PRAGMA foreign_keys = ON` before every scoring pass, preventing agents from disabling FK constraints to cheat.
-3. **Granular Partial Credit:** Multi-step migrations (like Task 7's 6-to-4 table consolidation) require 18+ steps. Binary pass/fail rewards provide zero learning signal. Our grader assigns fractional weights to individual FK constraints, data type coercions, and orphaned record audit logs, providing continuous RL reward gradients.
-4. **Deterministic Adversarial Seeds:** Our injected data includes edge cases that break naive SQL: `O'Brien` (apostrophes), `$1,234.56` (comma+dollar coercion), orphaned foreign keys, NULL emails, and leading whitespace in emails.
----
 ## Tasks (2 Easy / 3 Medium / 2 Hard)
-| # | Name | Difficulty | Steps | Description |
 |---|------|-----------|-------|-------------|
-| 1 | `column-restructure` | Easy | 10 | Merge `first_name` + `last_name` into `full_name` without data loss. Adversarial: apostrophes (`O'Brien`), mid-caps (`McDonald`) |
-| 2 | `soft-delete-restoration` | Easy | 10 | Restore deleted products from `deletion_log`, add `is_deleted`/`deleted_at` columns. Adversarial: `stock=0` must not be confused with `is_deleted=1` |
-| 3 | `table-normalization` | Medium | 15 | Decompose flat `purchases` into `customers` + `orders` with FK. Adversarial: duplicate emails (x3), commas in item names |
-| 4 | `schema-version-merge` | Medium | 15 | Merge overlapping `products_v1` (TEXT prices) and `products_v2` (REAL prices) with conflict resolution and `source` tracking. Adversarial: `$XX.XX` coercion, NULL category, high ID=101 |
-| 5 | `multi-entity-extraction` | Medium | 15 | Decompose `sales_records` god-table into 3NF (5 tables) with 3 FKs and invalid data routing. Adversarial: leading whitespace email, empty email, comma in SKU |
-| 6 | `cascade-migration` | Hard | 20 | 4-table FK cascade: type coercion (`$90000` TEXT to `90000` INTEGER), orphan audit logging, NULL salary removal, full FK chain enforcement |
-| 7 | `dual-source-consolidation` | Hard | 20 | Merge 6 tables from two incompatible systems (Legacy CRM + Modern SaaS) into 4 unified tables with cross-system email dedup, currency coercion, orphan detection |
----
-## Observation Space
-| Field | Type | Description |
-|-------|------|-------------|
-| `current_schema_sql` | `str` | Current database DDL extracted from `sqlite_master` |
-| `target_schema_sql` | `str` | Target DDL the agent must reach |
-| `last_execution_result` | `str` | Result of last SQL execution, or error message |
-| `step_number` | `int` | Current step count |
-| `migration_progress` | `float` | Current grader score [0.01-0.99] |
-| `task_name` | `str` | Name of the active task |
-| `done` | `bool` | Whether the episode has terminated |
-| `reward` | `float` | Step reward: score delta from previous step (can be negative) |
-## Action Space
-| Field | Type | Description |
-|-------|------|-------------|
-| `sql_command` | `str` | Raw SQL statement to execute against the database |
-| `reasoning` | `str` | Chain-of-thought explanation (logged for review) |
-| `submit_final` | `bool` | Set `true` when migration is believed complete |
----
-## Reward Function
-- **Step reward**: Delta between current and previous migration score. Strongly negative for destructive actions (e.g., wrong DROP TABLE leads to -0.4).
-- **Episode score**: Clamped to (0.01, 0.99). Final state wins -- regressions hurt.
-- **Exploit protection**: If schema matches target but tables are empty (agent deleted data), score is capped at 0.01.
-- **PRAGMA protection**: `PRAGMA foreign_keys = ON` is re-asserted before every grading pass.
-- **Auto-termination**: Episode ends immediately when score reaches 0.99, preventing post-success regression.
----
-## Setup & Usage
 ```bash
-# Install dependencies
 pip install -r requirements.txt
-# Run baseline inference (requires HF_TOKEN)
-export HF_TOKEN=your_token_here
-export API_BASE_URL=https://router.huggingface.co/v1
 export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
-python inference.py
-# Run validation tests
-python test_smoke.py
-python test_all_tasks.py
-# Start environment server locally
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
----
 ## API Endpoints
 | Endpoint | Method | Description |
 |----------|--------|-------------|
-| `/health` | GET | Health check |
-| `/reset` | POST | Reset environment, returns initial observation |
-| `/step` | POST | Execute action, returns observation + reward |
 | `/state` | GET | Current environment state |
-| `/tasks` | GET | List all 7 tasks with descriptions |
-| `/grader` | POST | Run grader on all tasks, return scores |
-| `/schema` | GET | OpenEnv schema (action/observation types) |
-| `/ws` | WS | WebSocket for real-time interaction |
----
 ## Deployment
 ```bash
-# Docker (local test)
 docker build -t sql-migration-env .
-docker run -p 7860:7860 \
-  -e HF_TOKEN=your_token \
-  -e API_BASE_URL=https://router.huggingface.co/v1 \
-  -e MODEL_NAME=Qwen/Qwen2.5-72B-Instruct \
-  sql-migration-env
 ```
-**Hugging Face Spaces:** Push this repo to HF Spaces with your `HF_TOKEN`, `API_BASE_URL`, and `MODEL_NAME` set as Space secrets. The Dockerfile builds automatically.
----
-## Baseline Scores
-| Task | Score | Steps | Model |
-|------|-------|-------|-------|
-| `column-restructure` | 0.99 | 4 | Qwen/Qwen2.5-72B-Instruct |
-| `soft-delete-restoration` | 0.99 | 5-7 | Qwen/Qwen2.5-72B-Instruct |
-| `table-normalization` | 0.99 | 5-8 | Qwen/Qwen2.5-72B-Instruct |
-| `schema-version-merge` | 0.60-0.85 | 8-12 | Qwen/Qwen2.5-72B-Instruct |
-| `multi-entity-extraction` | 0.40-0.70 | 12-15 | Qwen/Qwen2.5-72B-Instruct |
-| `cascade-migration` | 0.30-0.65 | 15-20 | Qwen/Qwen2.5-72B-Instruct |
-| `dual-source-consolidation` | 0.20-0.50 | 18-20 | Qwen/Qwen2.5-72B-Instruct |
----
-## Pre-Submission Checklist
-- [x] `docker build` succeeds
-- [x] `curl /health` returns 200
-- [x] `curl /tasks` returns 7 tasks
-- [x] `curl -X POST /reset` returns valid observation
-- [x] `openenv validate` passes
-- [x] Baseline script completes all 7 tasks without crashing
-- [x] Grader scores in (0.01, 0.99) range
-- [x] Exploit protection: empty-table shortcuts penalized
-- [x] PRAGMA bypass protection enforced

+# SQL Schema Migration Agent — OpenEnv Benchmark
+An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data.
+## Why This Benchmark?
+Database schema migration is a **real-world task** that humans perform daily. Unlike toy benchmarks, it tests:
+- **Reasoning under constraints** (SQLite's limited ALTER TABLE support)
+- **Data preservation** (agents must never silently drop rows)
+- **Multi-step planning** (complex migrations require 5-15 coordinated SQL commands)
+- **Edge case handling** (apostrophes, NULL values, empty strings, type coercion)
+## Architecture
+```
+┌─────────────────────────────────┐
+│  inference.py (Baseline Agent)  │
+│  - LLM API calls (OpenAI fmt)  │
+│  - JSON mode + fallback parser │
+│  - Task-specific prompts       │
+└─────────┬───────────────────────┘
+          │ MigrationAction
+┌─────────▼───────────────────────┐
+│  environment.py (OpenEnv Env)   │
+│  - SQLite execution engine      │
+│  - SELECT result passthrough    │
+│  - SQL timeout (progress hdlr) │
+│  - Dangerous SQL blacklist      │
+│  - Transaction awareness        │
+│  - Trajectory logging           │
+└─────────┬───────────────────────┘
+          │ score()
+┌─────────▼───────────────────────┐
+│  grader.py (Golden DB Engine)   │
+│  - Dynamic golden reference DB  │
+│  - Schema + data + FK scoring   │
+│  - Case-insensitive comparison  │
+│  - PRAGMA state preservation    │
+│  - Anti-exploit checks          │
+└─────────────────────────────────┘
+```
 ## Tasks (2 Easy / 3 Medium / 2 Hard)
+| # | Task | Difficulty | Steps | Description |
 |---|------|-----------|-------|-------------|
+| 1 | `column-restructure` | Easy | 10 | Merge first_name + last_name → full_name |
+| 2 | `soft-delete-restoration` | Easy | 10 | Restore deleted products from deletion_log |
+| 3 | `table-normalization` | Medium | 15 | Normalize purchases → customers + orders + FK |
+| 4 | `schema-version-merge` | Medium | 15 | Merge v1/v2 product tables with price coercion |
+| 5 | `multi-entity-extraction` | Medium | 15 | 3NF decomposition with invalid data routing |
+| 6 | `cascade-migration` | Hard | 20 | 4-table FK cascade, type coercion, orphan audit |
+| 7 | `dual-source-consolidation` | Hard | 20 | 6→4 table merge, cross-system email dedup |
+### Adversarial Edge Cases
+- **O'Brien** (apostrophe in data — tests SQL escaping)
+- **$90,000 salary** (TEXT→INTEGER coercion — tests string processing)
+- **Empty string emails** (not NULL — tests data validation logic)
+- **Leading whitespace** (` alice@company.com` — tests TRIM awareness)
+- **ID conflicts** (same ID in two source tables — tests merge logic)
+- **Orphaned FKs** (references to deleted entities — tests audit logging)
+- **NULL currency** (must default to 'USD' — tests COALESCE)
+## Dynamic Golden Database Grading
+Unlike benchmarks with hardcoded expected values, our grader is **seed-independent**:
+1. At scoring time, a fresh DB is seeded and the correct migration is applied
+2. The agent's DB is compared table-by-table against this golden reference
+3. If seed data changes, the golden DB auto-updates
+**Scoring breakdown (per task):**
+- **Schema match (30%)**: Tables exist with correct columns
+- **Data match (40%)**: Row content matches golden DB (order-independent)
+- **FK & integrity (20%)**: Foreign keys enforced, PRAGMA integrity_check passes
+- **Anti-exploit (10%)**: No empty tables, no schema pollution
+## Security & Robustness
+- **SQL Timeout**: Progress-handler-based execution timeout prevents infinite CTEs
+- **Dangerous SQL Blacklist**: ATTACH DATABASE, DETACH, LOAD_EXTENSION blocked
+- **Transaction Awareness**: Respects BEGIN/COMMIT/ROLLBACK from agents
+- **Case-Insensitive Grading**: Table/column names compared case-insensitively
+- **PRAGMA Preservation**: Grader doesn't corrupt agent's FK state
+- **Trajectory Logging**: Full SQL history attached to final observation
+## Setup
+### Requirements
 ```bash
 pip install -r requirements.txt
+```
+### Environment Variables
+```bash
+export HF_TOKEN=your_huggingface_token
+export API_BASE_URL=https://router.huggingface.co/v1  # or Groq, etc.
 export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+```
+### Run Tests
+```bash
+python test_smoke.py       # Quick validation
+python test_all_tasks.py   # All 7 tasks: golden migration + lifecycle
+```
+### Run Baseline Inference
+```bash
+python inference.py        # Runs all 7 tasks sequentially
+```
+### Start Server (HF Spaces)
+```bash
 uvicorn server.app:app --host 0.0.0.0 --port 7860
 ```
 ## API Endpoints
 | Endpoint | Method | Description |
 |----------|--------|-------------|
+| `/reset` | POST | Start new migration episode |
+| `/step` | POST | Execute a SQL action |
 | `/state` | GET | Current environment state |
+| `/tasks` | GET | List all 7 tasks with metadata |
+| `/grader` | POST | Run grader on specific/all tasks |
+| `/health` | GET | Health check |
+| `/docs` | GET | Interactive API documentation |
+## Action Schema
+```json
+{
+  "sql_command": "ALTER TABLE users ADD COLUMN full_name TEXT",
+  "reasoning": "Add the target column before migrating data",
+  "submit_final": false
+}
+```
+## Observation Schema
+```json
+{
+  "current_schema_sql": "CREATE TABLE users (...);",
+  "target_schema_sql": "CREATE TABLE users (...);",
+  "last_execution_result": "Success: 5 rows affected",
+  "step_number": 3,
+  "migration_progress": 0.75,
+  "task_name": "column-restructure",
+  "done": false,
+  "reward": 0.15
+}
+```
 ## Deployment
+### Docker
 ```bash
 docker build -t sql-migration-env .
+docker run -p 7860:7860 -e HF_TOKEN=your_token sql-migration-env
 ```
+### Hugging Face Spaces
+Push to a Space with the included Dockerfile. Set `HF_TOKEN`, `API_BASE_URL`, and `MODEL_NAME` as Space secrets.
+## License
+MIT

__pycache__/inference.cpython-312.pyc ADDED Viewed

Binary file (14.6 kB). View file

__pycache__/seeds.cpython-312.pyc CHANGED Viewed

Binary files a/__pycache__/seeds.cpython-312.pyc and b/__pycache__/seeds.cpython-312.pyc differ

inference.py CHANGED Viewed

@@ -2,9 +2,18 @@
 """
 Baseline Inference Script for SQL Migration Environment.
-Runs all 3 migration tasks sequentially using an LLM via OpenAI-compatible API.
 Outputs structured [START]/[STEP]/[END] format for automated evaluation.
 Usage:
     python inference.py
@@ -16,6 +25,7 @@ Environment Variables:
 import json
 import os
 import sys
 import time
 import traceback
@@ -23,29 +33,31 @@ import traceback
 # Server URL for the environment
 ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
-# LLM Configuration — defaults required for API_BASE_URL and MODEL_NAME only
 API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
 MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
-HF_TOKEN = os.getenv("HF_TOKEN")  # No default — must be set by user
-# Also support OPENAI_API_KEY as primary (per spec) and API_KEY as alias
 API_KEY = os.getenv("OPENAI_API_KEY") or HF_TOKEN or os.getenv("API_KEY")
 SYSTEM_PROMPT_TEMPLATE = """You are an autonomous SQLite database migration engine. You receive the current schema and a target schema. Write SQL to transform the current state to the target state without losing row data.
-CRITICAL — SQLite-specific rules (violations cause immediate errors):
-1. SQLite does NOT support ALTER TABLE ADD CONSTRAINT — never use it.
-2. SQLite does NOT support ALTER TABLE ALTER COLUMN — never use it.
-3. SQLite does NOT support ALTER TABLE ADD PRIMARY KEY — never use it.
-4. SQLite does NOT support ADD COLUMN with non-constant DEFAULT — add column as NULL then UPDATE.
-5. To change column types, add NOT NULL, or add FKs: CREATE new table with correct schema, INSERT INTO new SELECT from old, DROP old, RENAME new to original name.
-6. Apostrophes in data (e.g., O'Brien, O'Neill) are present — always use parameterized patterns or escape with ''.
-7. For table normalization: create new tables first, INSERT INTO ... SELECT, then drop old tables.
-8. For ORPHANED FK rows: before inserting into a FK-constrained table, DELETE or INSERT INTO audit_log any rows whose FK reference does not exist in the parent table. Example: DELETE FROM assets WHERE employee_id NOT IN (SELECT id FROM employees).
-9. For TEXT salary columns like '$90000': use CAST(REPLACE(REPLACE(salary, '$', ''), ',', '') AS INTEGER) to convert.
-10. Execute exactly ONE SQL statement per step.
-11. When migration is complete (schemas match, data preserved), set submit_final to true IMMEDIATELY.
-TARGET SCHEMA (fixed — achieve this exactly):
 {target_ddl}
 Respond ONLY with valid JSON — no markdown, no code blocks, no text outside the object:
@@ -60,15 +72,12 @@ ALL_TASKS = [
     "cascade-migration",
     "dual-source-consolidation",
 ]
-MAX_STEPS = 20  # Global fallback; per-task limits override this
-MAX_PARSE_ERRORS = 5  # Higher tolerance for thinking models (Qwen3, DeepSeek-R1)
-# Auto-submit threshold: if migration_progress >= this, force submit_final
 AUTO_SUBMIT_THRESHOLD = 0.95
 def call_llm(messages: list, timeout: int = 90) -> str:
-    """Call the LLM API and return the response content."""
     from openai import OpenAI
     client = OpenAI(
@@ -77,11 +86,25 @@ def call_llm(messages: list, timeout: int = 90) -> str:
         timeout=timeout,
     )
     try:
         response = client.chat.completions.create(
             model=MODEL_NAME,
             messages=messages,
-            temperature=0.0,  # Deterministic output — eliminates variance
             max_tokens=1024,
         )
         return response.choices[0].message.content.strip()
@@ -93,18 +116,13 @@ def parse_action(raw_text: str) -> dict:
     """
     Parse LLM output into an action dict.
-    Handles: raw JSON, markdown-wrapped JSON (```json ... ```),
-    <think>...</think> reasoning tokens (Qwen3, DeepSeek-R1),
-    and common LLM mistakes like trailing commas or extra text.
     """
-    import re
     text = raw_text.strip()
-    # Strip <think>...</think> blocks emitted by reasoning models (Qwen3, R1)
-    # Must do this BEFORE any other processing
     text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
-    # Also strip partial/unclosed think blocks (truncated output)
     text = re.sub(r"<think>.*$", "", text, flags=re.DOTALL).strip()
     # Strip markdown code block fences
@@ -119,7 +137,7 @@ def parse_action(raw_text: str) -> dict:
     except json.JSONDecodeError:
         pass
-    # Try to find JSON object in the text (handles preamble text or extra trailing content)
     start = text.find("{")
     end = text.rfind("}") + 1
     if start >= 0 and end > start:
@@ -128,11 +146,14 @@ def parse_action(raw_text: str) -> dict:
         except json.JSONDecodeError:
             pass
-    # Last resort: try to extract just sql_command if JSON is truncated
-    sql_match = re.search(r'"sql_command"\s*:\s*"([^"]+)"', text)
     if sql_match:
         return {
-            "sql_command": sql_match.group(1),
             "reasoning": "auto-extracted from malformed response",
             "submit_final": False,
         }
@@ -143,49 +164,47 @@ def parse_action(raw_text: str) -> dict:
 def run_task_local(task_name: str) -> dict:
     """
     Run a single task using a local environment instance (no server needed).
-    This is the primary mode — avoids HTTP overhead and works inside Docker.
     """
-    # Import environment directly
     sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
     from server.environment import DbMigrationEnvironment
     from models import MigrationAction
     import seeds
     env = DbMigrationEnvironment(task_name=task_name)
-    # Use task-specific step budget (defaults to global MAX_STEPS)
-    task_max_steps = seeds.TASKS.get(task_name, {}).get("max_steps", MAX_STEPS)
     print(f"[START] task={task_name} env=sql-migration-agent model={MODEL_NAME}", flush=True)
     obs = env.reset()
-    # Build task-specific system prompt with target DDL baked in (sent ONCE)
-    task_system_prompt = SYSTEM_PROMPT_TEMPLATE.format(target_ddl=obs.target_schema_sql)
     history = [{"role": "system", "content": task_system_prompt}]
-    # Initial observation — only current schema (target already in system prompt)
     initial_msg = (
         f"CURRENT DATABASE SCHEMA:\n{obs.current_schema_sql}\n\n"
         f"Status: {obs.last_execution_result}\n"
         f"Migration progress: {obs.migration_progress:.2f}\n\n"
-        f"Write your first SQL command to begin the migration."
     )
     history.append({"role": "user", "content": initial_msg})
     rewards_list = []
-    parse_errors = 0
     final_score = 0.0
     steps_taken = 0
     done = False
-    peak_score = 0.0  # Track the highest score we've reached
     for step in range(task_max_steps):
         if done:
             break
-        # Context truncation: system prompt + last 10 messages (5 pairs)
         messages = [history[0]] + history[-10:]
         try:
@@ -199,22 +218,21 @@ def run_task_local(task_name: str) -> dict:
         # Parse the action
         try:
             action_dict = parse_action(raw_response)
-        except ValueError as e:
-            parse_errors += 1
             print(f"[STEP] step={step+1} action=PARSE_ERROR reward=0.00 done=false error=parse_error", flush=True)
-            if parse_errors >= MAX_PARSE_ERRORS:
-                print(f"[STEP] step={step+1} action=MAX_PARSE_ERRORS reward=0.00 done=true error=too_many_parse_errors", flush=True)
                 done = True
                 break
             history.append({"role": "assistant", "content": raw_response})
             history.append({
                 "role": "user",
-                "content": "ERROR: Your response was not valid JSON. Respond ONLY with: {\"sql_command\": \"...\", \"reasoning\": \"...\", \"submit_final\": false}",
             })
             continue
-        parse_errors = 0
         # Build the MigrationAction
         try:
             action = MigrationAction(
@@ -234,15 +252,9 @@ def run_task_local(task_name: str) -> dict:
         final_score = obs.migration_progress
         done = obs.done
-        # Track peak score
-        if final_score > peak_score:
-            peak_score = final_score
-        # AUTO-SUBMIT: If we just reached a near-perfect score, force submit
-        # This prevents the LLM from continuing to send queries and regressing
         if final_score >= AUTO_SUBMIT_THRESHOLD and not done:
             done = True
-            # Submit a final no-op to lock in the score
             submit_action = MigrationAction(
                 sql_command="SELECT 1",
                 reasoning="Migration complete — auto-submitting",
@@ -251,15 +263,13 @@ def run_task_local(task_name: str) -> dict:
             obs = env.step(submit_action)
             final_score = obs.migration_progress
-        # Abbreviate SQL for logging
         sql_abbrev = action.sql_command[:50].replace("\n", " ")
         if len(action.sql_command) > 50:
             sql_abbrev += "..."
         error_str = obs.metadata.get("error", "null") if obs.metadata else "null"
         if error_str != "null":
             error_str = error_str[:80]
         print(
             f"[STEP] step={steps_taken} action={sql_abbrev} "
             f"reward={step_reward:.2f} done={'true' if done else 'false'} "
@@ -270,19 +280,17 @@ def run_task_local(task_name: str) -> dict:
         # Add to conversation history
         history.append({"role": "assistant", "content": json.dumps(action_dict)})
-        # Lean feedback — target is already in the system prompt, no need to repeat
         feedback_msg = (
-            f"EXECUTION RESULT: {obs.last_execution_result}\n\n"
-            f"CURRENT SCHEMA:\n{obs.current_schema_sql}\n\n"
             f"Progress: {obs.migration_progress:.2f}"
         )
         if done:
             feedback_msg += "\n\nEpisode complete."
         elif obs.migration_progress >= 0.9:
             feedback_msg += (
-                "\n\nMigration is nearly complete! Compare the current schema "
-                "carefully to the target schema. If they match and data is "
-                "preserved, set submit_final to true in your next response."
             )
         else:
             feedback_msg += "\n\nContinue the migration. Write your next SQL command."
@@ -309,7 +317,7 @@ def run_task_local(task_name: str) -> dict:
 def main():
-    """Run all 3 tasks sequentially."""
     if not API_KEY:
         print("WARNING: No API key found. Set HF_TOKEN or API_KEY.", file=sys.stderr)
         sys.exit(1)

 """
 Baseline Inference Script for SQL Migration Environment.
+Runs all 7 migration tasks sequentially using an LLM via OpenAI-compatible API.
 Outputs structured [START]/[STEP]/[END] format for automated evaluation.
+Fixes Applied:
+- D1: Task description injected into system prompt
+- D2: Hardcoded system prompt traps removed (no more audit_log/INTEGER traps)
+- D3: Data discovery rule added (agent runs SELECT before DDL)
+- D4: Submit guard added (agent must verify before submitting)
+- D5: Context window bloat fixed (schema not repeated every step)
+- D6: Parse error counter tracks consecutive errors only
+- D7: response_format JSON mode with fallback
 Usage:
     python inference.py
 import json
 import os
+import re
 import sys
 import time
 import traceback
 # Server URL for the environment
 ENV_BASE_URL = os.getenv("ENV_BASE_URL", "http://localhost:7860")
+# LLM Configuration
 API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
 MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+HF_TOKEN = os.getenv("HF_TOKEN")
 API_KEY = os.getenv("OPENAI_API_KEY") or HF_TOKEN or os.getenv("API_KEY")
+# --- D2: Cleaned system prompt — no hardcoded table names or type traps ---
 SYSTEM_PROMPT_TEMPLATE = """You are an autonomous SQLite database migration engine. You receive the current schema and a target schema. Write SQL to transform the current state to the target state without losing row data.
+TASK OBJECTIVE:
+{task_description}
+CRITICAL SQLite-specific rules (violations cause immediate errors):
+1. SQLite does NOT support ALTER TABLE ADD CONSTRAINT, ALTER COLUMN, or ADD PRIMARY KEY.
+2. To change column types, add NOT NULL, or add FKs: CREATE new table, INSERT INTO new SELECT FROM old, DROP old, RENAME new.
+3. Apostrophes in data (O'Brien, O'Neill) are present — escape with '' in string literals.
+4. Execute exactly ONE SQL statement per step.
+5. For table normalization: create new tables first, INSERT INTO ... SELECT, then drop old tables.
+6. For orphaned FK rows: check the TARGET SCHEMA for the correct anomaly/issues table name (it varies per task). Log invalid records there before dropping.
+7. For text currency columns like '$90,000' or '$1,234.56': strip '$' and ',' then cast to the type in the target schema (INTEGER for whole numbers, REAL for decimals).
+8. IMPORTANT: Before writing any DDL, execute SELECT * FROM tablename LIMIT 5 for each source table to inspect the actual data format and identify edge cases like empty strings, leading whitespace, NULL values, and special characters.
+9. Do NOT set submit_final to true until you have run SELECT COUNT(*) on your target tables and verified the counts and data match what the task requires.
+10. When migration is complete and verified, set submit_final to true.
+TARGET SCHEMA (achieve this exactly):
 {target_ddl}
 Respond ONLY with valid JSON — no markdown, no code blocks, no text outside the object:
     "cascade-migration",
     "dual-source-consolidation",
 ]
+MAX_PARSE_ERRORS = 5  # Consecutive parse errors before giving up
 AUTO_SUBMIT_THRESHOLD = 0.95
 def call_llm(messages: list, timeout: int = 90) -> str:
+    """Call the LLM API with JSON mode fallback."""
     from openai import OpenAI
     client = OpenAI(
         timeout=timeout,
     )
+    # --- D7: Try JSON mode first, fallback to plain ---
     try:
         response = client.chat.completions.create(
             model=MODEL_NAME,
             messages=messages,
+            temperature=0.0,
+            max_tokens=1024,
+            response_format={"type": "json_object"},
+        )
+        return response.choices[0].message.content.strip()
+    except Exception:
+        pass
+    # Fallback: plain text mode
+    try:
+        response = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=messages,
+            temperature=0.0,
             max_tokens=1024,
         )
         return response.choices[0].message.content.strip()
     """
     Parse LLM output into an action dict.
+    Handles: raw JSON, markdown-wrapped JSON, <think>...</think> blocks,
+    escaped quotes in SQL, and truncated output recovery.
     """
     text = raw_text.strip()
+    # Strip <think>...</think> blocks (Qwen3, DeepSeek-R1)
     text = re.sub(r"<think>.*?</think>", "", text, flags=re.DOTALL).strip()
     text = re.sub(r"<think>.*$", "", text, flags=re.DOTALL).strip()
     # Strip markdown code block fences
     except json.JSONDecodeError:
         pass
+    # Try to find JSON object in the text
     start = text.find("{")
     end = text.rfind("}") + 1
     if start >= 0 and end > start:
         except json.JSONDecodeError:
             pass
+    # --- D6: Improved regex that handles escaped quotes ---
+    sql_match = re.search(r'"sql_command"\s*:\s*"((?:[^"\\]|\\.)*)"', text)
     if sql_match:
+        sql = sql_match.group(1)
+        # Unescape JSON string escapes
+        sql = sql.replace('\\"', '"').replace("\\n", "\n").replace("\\\\", "\\")
         return {
+            "sql_command": sql,
             "reasoning": "auto-extracted from malformed response",
             "submit_final": False,
         }
 def run_task_local(task_name: str) -> dict:
     """
     Run a single task using a local environment instance (no server needed).
     """
     sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
     from server.environment import DbMigrationEnvironment
     from models import MigrationAction
     import seeds
     env = DbMigrationEnvironment(task_name=task_name)
+    task_config = seeds.TASKS[task_name]
+    task_max_steps = task_config.get("max_steps", 20)
     print(f"[START] task={task_name} env=sql-migration-agent model={MODEL_NAME}", flush=True)
     obs = env.reset()
+    # --- D1: Inject task description into system prompt ---
+    task_system_prompt = SYSTEM_PROMPT_TEMPLATE.format(
+        task_description=task_config["description"],
+        target_ddl=obs.target_schema_sql,
+    )
     history = [{"role": "system", "content": task_system_prompt}]
+    # Initial observation
     initial_msg = (
         f"CURRENT DATABASE SCHEMA:\n{obs.current_schema_sql}\n\n"
         f"Status: {obs.last_execution_result}\n"
         f"Migration progress: {obs.migration_progress:.2f}\n\n"
+        f"Start by inspecting the source data with SELECT queries, then begin the migration."
     )
     history.append({"role": "user", "content": initial_msg})
     rewards_list = []
+    consecutive_parse_errors = 0  # D6: Track consecutive only
     final_score = 0.0
     steps_taken = 0
     done = False
     for step in range(task_max_steps):
         if done:
             break
+        # --- D5: Context window fix — only keep last 10 messages + system ---
         messages = [history[0]] + history[-10:]
         try:
         # Parse the action
         try:
             action_dict = parse_action(raw_response)
+            consecutive_parse_errors = 0  # D6: Reset on success
+        except ValueError:
+            consecutive_parse_errors += 1
             print(f"[STEP] step={step+1} action=PARSE_ERROR reward=0.00 done=false error=parse_error", flush=True)
+            if consecutive_parse_errors >= MAX_PARSE_ERRORS:
+                print(f"[STEP] step={step+1} action=MAX_PARSE_ERRORS reward=0.00 done=true error=too_many_consecutive_parse_errors", flush=True)
                 done = True
                 break
             history.append({"role": "assistant", "content": raw_response})
             history.append({
                 "role": "user",
+                "content": 'ERROR: Your response was not valid JSON. Respond ONLY with: {"sql_command": "...", "reasoning": "...", "submit_final": false}',
             })
             continue
         # Build the MigrationAction
         try:
             action = MigrationAction(
         final_score = obs.migration_progress
         done = obs.done
+        # AUTO-SUBMIT: If we reached near-perfect score, force submit
         if final_score >= AUTO_SUBMIT_THRESHOLD and not done:
             done = True
             submit_action = MigrationAction(
                 sql_command="SELECT 1",
                 reasoning="Migration complete — auto-submitting",
             obs = env.step(submit_action)
             final_score = obs.migration_progress
+        # Log
         sql_abbrev = action.sql_command[:50].replace("\n", " ")
         if len(action.sql_command) > 50:
             sql_abbrev += "..."
         error_str = obs.metadata.get("error", "null") if obs.metadata else "null"
         if error_str != "null":
             error_str = error_str[:80]
         print(
             f"[STEP] step={steps_taken} action={sql_abbrev} "
             f"reward={step_reward:.2f} done={'true' if done else 'false'} "
         # Add to conversation history
         history.append({"role": "assistant", "content": json.dumps(action_dict)})
+        # --- D5: Lean feedback — NO schema repetition ---
         feedback_msg = (
+            f"EXECUTION RESULT: {obs.last_execution_result}\n"
             f"Progress: {obs.migration_progress:.2f}"
         )
         if done:
             feedback_msg += "\n\nEpisode complete."
         elif obs.migration_progress >= 0.9:
             feedback_msg += (
+                "\n\nMigration is nearly complete! Run SELECT COUNT(*) on each table "
+                "and compare to your expectations. If everything matches, set submit_final to true."
             )
         else:
             feedback_msg += "\n\nContinue the migration. Write your next SQL command."
 def main():
+    """Run all 7 tasks sequentially."""
     if not API_KEY:
         print("WARNING: No API key found. Set HF_TOKEN or API_KEY.", file=sys.stderr)
         sys.exit(1)

seeds.py CHANGED Viewed

@@ -678,6 +678,372 @@ def seed_task7(conn: sqlite3.Connection) -> None:
     conn.commit()
 # =============================================================================
 # Task Registry
 # =============================================================================
@@ -685,50 +1051,92 @@ def seed_task7(conn: sqlite3.Connection) -> None:
 TASKS = {
     "column-restructure": {
         "seed_fn": seed_task1,
         "target_ddl": TASK1_TARGET_DDL,
-        "description": "Merge first_name and last_name into a single full_name column without data loss",
         "difficulty": "easy",
         "max_steps": 10,
     },
     "soft-delete-restoration": {
         "seed_fn": seed_task4,
         "target_ddl": TASK4_TARGET_DDL,
-        "description": "Restore deleted products from deletion_log, add is_deleted/deleted_at columns",
         "difficulty": "easy",
         "max_steps": 10,
     },
     "table-normalization": {
         "seed_fn": seed_task2,
         "target_ddl": TASK2_TARGET_DDL,
-        "description": "Decompose a flat purchases table into normalized customers and orders tables with FK",
         "difficulty": "medium",
         "max_steps": 15,
     },
     "schema-version-merge": {
         "seed_fn": seed_task5,
         "target_ddl": TASK5_TARGET_DDL,
-        "description": "Merge overlapping v1/v2 product tables with price coercion and conflict resolution",
         "difficulty": "medium",
         "max_steps": 15,
     },
     "multi-entity-extraction": {
         "seed_fn": seed_task6,
         "target_ddl": TASK6_TARGET_DDL,
-        "description": "Decompose a sales god-table into 3NF with 3 FKs and invalid data routing",
         "difficulty": "medium",
         "max_steps": 15,
     },
     "cascade-migration": {
         "seed_fn": seed_task3,
         "target_ddl": TASK3_TARGET_DDL,
-        "description": "Multi-table FK cascade with type coercion, NULL handling, and orphan audit logging",
         "difficulty": "hard",
         "max_steps": 20,
     },
     "dual-source-consolidation": {
         "seed_fn": seed_task7,
         "target_ddl": TASK7_TARGET_DDL,
-        "description": "Merge 6 tables from two incompatible systems into 4 unified tables with cross-system dedup",
         "difficulty": "hard",
         "max_steps": 20,
     },

     conn.commit()
+# =============================================================================
+# Golden Migration Functions
+# =============================================================================
+# These produce the CORRECT expected database state from any seed data.
+# Used by the dynamic grader to compare against the agent's output.
+# If seed data changes, the golden DB auto-updates — no hardcoded literals.
+def golden_task1(conn: sqlite3.Connection) -> None:
+    """Golden migration for Task 1: Column Restructure."""
+    conn.execute("CREATE TABLE users_new (id INTEGER PRIMARY KEY, full_name TEXT NOT NULL)")
+    conn.execute(
+        "INSERT INTO users_new (id, full_name) "
+        "SELECT id, first_name || ' ' || last_name FROM users"
+    )
+    conn.execute("DROP TABLE users")
+    conn.execute("ALTER TABLE users_new RENAME TO users")
+    conn.commit()
+def golden_task2(conn: sqlite3.Connection) -> None:
+    """Golden migration for Task 2: Table Normalization."""
+    conn.execute("PRAGMA foreign_keys = OFF")
+    conn.execute(
+        "CREATE TABLE customers ("
+        "id INTEGER PRIMARY KEY, name TEXT NOT NULL, email TEXT NOT NULL UNIQUE)"
+    )
+    conn.execute(
+        "INSERT INTO customers (name, email) "
+        "SELECT DISTINCT customer_name, customer_email FROM purchases"
+    )
+    conn.execute(
+        "CREATE TABLE orders ("
+        "id INTEGER PRIMARY KEY, customer_id INTEGER NOT NULL, "
+        "item_name TEXT NOT NULL, price INTEGER NOT NULL, "
+        "FOREIGN KEY (customer_id) REFERENCES customers(id))"
+    )
+    conn.execute(
+        "INSERT INTO orders (customer_id, item_name, price) "
+        "SELECT c.id, p.item_name, p.price "
+        "FROM purchases p JOIN customers c ON p.customer_email = c.email"
+    )
+    conn.execute("DROP TABLE purchases")
+    conn.execute("PRAGMA foreign_keys = ON")
+    conn.commit()
+def golden_task3(conn: sqlite3.Connection) -> None:
+    """Golden migration for Task 3: Cascade Migration."""
+    conn.execute("PRAGMA foreign_keys = OFF")
+    # Create audit_log
+    conn.execute(
+        "CREATE TABLE audit_log (id INTEGER PRIMARY KEY, source_table TEXT NOT NULL, "
+        "original_row_json TEXT NOT NULL, reason TEXT NOT NULL)"
+    )
+    # Log orphaned assets
+    conn.execute(
+        "INSERT INTO audit_log (source_table, original_row_json, reason) "
+        "SELECT 'assets', '{\"id\":' || id || ',\"employee_id\":' || employee_id || '}', 'orphaned_record' "
+        "FROM assets WHERE employee_id NOT IN (SELECT id FROM employees)"
+    )
+    # Log NULL salary employees
+    conn.execute(
+        "INSERT INTO audit_log (source_table, original_row_json, reason) "
+        "SELECT 'employees', '{\"id\":' || id || ',\"name\":\"' || name || '\"}', 'null_salary' "
+        "FROM employees WHERE salary IS NULL"
+    )
+    # Rebuild companies
+    conn.execute("CREATE TABLE companies_new (id INTEGER PRIMARY KEY, name TEXT NOT NULL)")
+    conn.execute("INSERT INTO companies_new SELECT id, name FROM companies")
+    conn.execute("DROP TABLE companies")
+    conn.execute("ALTER TABLE companies_new RENAME TO companies")
+    # Rebuild departments
+    conn.execute(
+        "CREATE TABLE departments_new (id INTEGER PRIMARY KEY, company_id INTEGER NOT NULL, "
+        "name TEXT NOT NULL, FOREIGN KEY (company_id) REFERENCES companies(id))"
+    )
+    conn.execute("INSERT INTO departments_new SELECT id, company_id, name FROM departments")
+    conn.execute("DROP TABLE departments")
+    conn.execute("ALTER TABLE departments_new RENAME TO departments")
+    # Rebuild employees (remove NULL salary, coerce TEXT to INT)
+    conn.execute(
+        "CREATE TABLE employees_new (id INTEGER PRIMARY KEY, department_id INTEGER NOT NULL, "
+        "name TEXT NOT NULL, salary INTEGER NOT NULL, "
+        "FOREIGN KEY (department_id) REFERENCES departments(id))"
+    )
+    conn.execute(
+        "INSERT INTO employees_new (id, department_id, name, salary) "
+        "SELECT id, department_id, name, "
+        "CAST(REPLACE(REPLACE(salary, '$', ''), ',', '') AS INTEGER) "
+        "FROM employees WHERE salary IS NOT NULL"
+    )
+    conn.execute("DROP TABLE employees")
+    conn.execute("ALTER TABLE employees_new RENAME TO employees")
+    # Rebuild assets (remove orphans)
+    conn.execute(
+        "CREATE TABLE assets_new (id INTEGER PRIMARY KEY, employee_id INTEGER NOT NULL, "
+        "description TEXT NOT NULL, FOREIGN KEY (employee_id) REFERENCES employees(id))"
+    )
+    conn.execute(
+        "INSERT INTO assets_new SELECT id, employee_id, description FROM assets "
+        "WHERE employee_id IN (SELECT id FROM employees)"
+    )
+    conn.execute("DROP TABLE assets")
+    conn.execute("ALTER TABLE assets_new RENAME TO assets")
+    conn.execute("PRAGMA foreign_keys = ON")
+    conn.commit()
+def golden_task4(conn: sqlite3.Connection) -> None:
+    """Golden migration for Task 4: Soft-Delete Restoration."""
+    conn.execute("PRAGMA foreign_keys = OFF")
+    # Create new table with extra columns
+    conn.execute(
+        "CREATE TABLE products_new (id INTEGER PRIMARY KEY, name TEXT NOT NULL, "
+        "price REAL NOT NULL, stock INTEGER NOT NULL, "
+        "is_deleted INTEGER NOT NULL DEFAULT 0, deleted_at TEXT)"
+    )
+    # Copy existing products as active
+    conn.execute(
+        "INSERT INTO products_new (id, name, price, stock, is_deleted, deleted_at) "
+        "SELECT id, name, price, stock, 0, NULL FROM products"
+    )
+    # Restore deleted products from log
+    conn.execute(
+        "INSERT INTO products_new (id, name, price, stock, is_deleted, deleted_at) "
+        "SELECT product_id, product_name, product_price, product_stock, 1, deleted_at "
+        "FROM deletion_log"
+    )
+    conn.execute("DROP TABLE products")
+    conn.execute("ALTER TABLE products_new RENAME TO products")
+    conn.execute("DROP TABLE deletion_log")
+    conn.execute("PRAGMA foreign_keys = ON")
+    conn.commit()
+def golden_task5(conn: sqlite3.Connection) -> None:
+    """Golden migration for Task 5: Schema Version Merge."""
+    conn.execute("PRAGMA foreign_keys = OFF")
+    conn.execute(
+        "CREATE TABLE products (id INTEGER PRIMARY KEY, name TEXT NOT NULL, "
+        "price REAL NOT NULL, category TEXT, supplier TEXT, brand TEXT, "
+        "sku TEXT, source TEXT NOT NULL)"
+    )
+    # Insert v1-only rows
+    conn.execute(
+        "INSERT INTO products (id, name, price, category, supplier, brand, sku, source) "
+        "SELECT id, name, CAST(REPLACE(REPLACE(price, '$', ''), ',', '') AS REAL), "
+        "category, supplier, NULL, NULL, 'v1' "
+        "FROM products_v1 WHERE id NOT IN (SELECT id FROM products_v2)"
+    )
+    # Insert v2-only rows
+    conn.execute(
+        "INSERT INTO products (id, name, price, category, supplier, brand, sku, source) "
+        "SELECT id, name, unit_cost, category, NULL, brand, sku, 'v2' "
+        "FROM products_v2 WHERE id NOT IN (SELECT id FROM products_v1)"
+    )
+    # Insert conflict rows (v2 wins for name/price)
+    conn.execute(
+        "INSERT INTO products (id, name, price, category, supplier, brand, sku, source) "
+        "SELECT v2.id, v2.name, v2.unit_cost, v2.category, v1.supplier, v2.brand, v2.sku, 'both' "
+        "FROM products_v2 v2 JOIN products_v1 v1 ON v2.id = v1.id"
+    )
+    conn.execute("DROP TABLE products_v1")
+    conn.execute("DROP TABLE products_v2")
+    conn.execute("PRAGMA foreign_keys = ON")
+    conn.commit()
+def golden_task6(conn: sqlite3.Connection) -> None:
+    """Golden migration for Task 6: Multi-Entity Extraction."""
+    conn.execute("PRAGMA foreign_keys = OFF")
+    # Create target tables
+    conn.execute(
+        "CREATE TABLE salespersons (id INTEGER PRIMARY KEY, name TEXT NOT NULL, "
+        "email TEXT NOT NULL UNIQUE, region TEXT NOT NULL)"
+    )
+    conn.execute(
+        "CREATE TABLE customers (id INTEGER PRIMARY KEY, name TEXT NOT NULL, "
+        "email TEXT NOT NULL UNIQUE, tier TEXT NOT NULL)"
+    )
+    conn.execute(
+        "CREATE TABLE products (id INTEGER PRIMARY KEY, name TEXT NOT NULL, "
+        "sku TEXT NOT NULL UNIQUE, category TEXT NOT NULL)"
+    )
+    conn.execute(
+        "CREATE TABLE sales (id INTEGER PRIMARY KEY, salesperson_id INTEGER NOT NULL, "
+        "customer_id INTEGER NOT NULL, product_id INTEGER NOT NULL, "
+        "quantity INTEGER NOT NULL, unit_price REAL NOT NULL, "
+        "discount_pct INTEGER NOT NULL DEFAULT 0, sale_date TEXT NOT NULL, "
+        "FOREIGN KEY (salesperson_id) REFERENCES salespersons(id), "
+        "FOREIGN KEY (customer_id) REFERENCES customers(id), "
+        "FOREIGN KEY (product_id) REFERENCES products(id))"
+    )
+    conn.execute(
+        "CREATE TABLE data_issues (id INTEGER PRIMARY KEY, source_table TEXT NOT NULL, "
+        "source_row_id INTEGER NOT NULL, issue_type TEXT NOT NULL, "
+        "issue_detail TEXT NOT NULL)"
+    )
+    # Populate salespersons (TRIM email)
+    conn.execute(
+        "INSERT INTO salespersons (name, email, region) "
+        "SELECT DISTINCT rep_name, TRIM(rep_email), rep_region FROM sales_records"
+    )
+    # Populate customers (exclude empty email rows)
+    conn.execute(
+        "INSERT INTO customers (name, email, tier) "
+        "SELECT DISTINCT customer_name, customer_email, customer_tier "
+        "FROM sales_records WHERE customer_email IS NOT NULL AND customer_email != ''"
+    )
+    # Populate products
+    conn.execute(
+        "INSERT INTO products (name, sku, category) "
+        "SELECT DISTINCT product_name, product_sku, product_category FROM sales_records"
+    )
+    # Populate sales (exclude rows with empty customer email)
+    conn.execute(
+        "INSERT INTO sales (salesperson_id, customer_id, product_id, quantity, "
+        "unit_price, discount_pct, sale_date) "
+        "SELECT sp.id, c.id, p.id, sr.quantity, sr.unit_price, sr.discount_pct, sr.sale_date "
+        "FROM sales_records sr "
+        "JOIN salespersons sp ON TRIM(sr.rep_email) = sp.email "
+        "JOIN customers c ON sr.customer_email = c.email "
+        "JOIN products p ON sr.product_sku = p.sku "
+        "WHERE sr.customer_email IS NOT NULL AND sr.customer_email != ''"
+    )
+    # Log data issues (empty email)
+    conn.execute(
+        "INSERT INTO data_issues (source_table, source_row_id, issue_type, issue_detail) "
+        "SELECT 'sales_records', id, 'empty_email', "
+        "'Customer email is empty for: ' || customer_name "
+        "FROM sales_records WHERE customer_email IS NULL OR customer_email = ''"
+    )
+    conn.execute("DROP TABLE sales_records")
+    conn.execute("PRAGMA foreign_keys = ON")
+    conn.commit()
+def golden_task7(conn: sqlite3.Connection) -> None:
+    """Golden migration for Task 7: Dual-Source Consolidation."""
+    conn.execute("PRAGMA foreign_keys = OFF")
+    # Create unified_customers
+    conn.execute(
+        "CREATE TABLE unified_customers (id INTEGER PRIMARY KEY AUTOINCREMENT, "
+        "legacy_id INTEGER, modern_uuid TEXT, name TEXT, email TEXT, phone TEXT, "
+        "tier TEXT NOT NULL DEFAULT 'free', source TEXT NOT NULL, created_at TEXT)"
+    )
+    # Insert legacy-only customers (no email match in modern)
+    conn.execute(
+        "INSERT INTO unified_customers (legacy_id, modern_uuid, name, email, phone, tier, source, created_at) "
+        "SELECT lc.id, NULL, lc.full_name, lc.contact_email, lc.phone, lc.account_type, 'legacy', lc.join_date "
+        "FROM legacy_customers lc "
+        "WHERE lc.contact_email IS NULL OR lc.contact_email NOT IN (SELECT email_address FROM modern_users WHERE email_address IS NOT NULL)"
+    )
+    # Insert modern-only users (no email match in legacy)
+    conn.execute(
+        "INSERT INTO unified_customers (legacy_id, modern_uuid, name, email, phone, tier, source, created_at) "
+        "SELECT NULL, mu.uuid, mu.display_name, mu.email_address, NULL, "
+        "CASE mu.subscription_tier "
+        "  WHEN 1 THEN 'free' WHEN 2 THEN 'basic' WHEN 3 THEN 'premium' WHEN 4 THEN 'enterprise' "
+        "  ELSE 'free' END, "
+        "'modern', mu.created_at "
+        "FROM modern_users mu "
+        "WHERE mu.email_address NOT IN (SELECT contact_email FROM legacy_customers WHERE contact_email IS NOT NULL)"
+    )
+    # Insert matched (both) customers — legacy name + modern tier
+    conn.execute(
+        "INSERT INTO unified_customers (legacy_id, modern_uuid, name, email, phone, tier, source, created_at) "
+        "SELECT lc.id, mu.uuid, lc.full_name, lc.contact_email, lc.phone, "
+        "CASE mu.subscription_tier "
+        "  WHEN 1 THEN 'free' WHEN 2 THEN 'basic' WHEN 3 THEN 'premium' WHEN 4 THEN 'enterprise' "
+        "  ELSE 'free' END, "
+        "'both', lc.join_date "
+        "FROM legacy_customers lc "
+        "JOIN modern_users mu ON lc.contact_email = mu.email_address "
+        "WHERE lc.contact_email IS NOT NULL"
+    )
+    # Create unified_products
+    conn.execute(
+        "CREATE TABLE unified_products (id INTEGER PRIMARY KEY AUTOINCREMENT, "
+        "code TEXT NOT NULL UNIQUE, title TEXT NOT NULL, price REAL NOT NULL, "
+        "source TEXT NOT NULL)"
+    )
+    # Legacy products
+    conn.execute(
+        "INSERT INTO unified_products (code, title, price, source) "
+        "SELECT code, description, "
+        "CAST(REPLACE(REPLACE(unit_price, '$', ''), ',', '') AS REAL), 'legacy' "
+        "FROM legacy_products"
+    )
+    # Modern products (no code overlap expected)
+    conn.execute(
+        "INSERT INTO unified_products (code, title, price, source) "
+        "SELECT sku, title, base_price, 'modern' "
+        "FROM modern_catalog"
+    )
+    # Create migration_issues
+    conn.execute(
+        "CREATE TABLE migration_issues (id INTEGER PRIMARY KEY, "
+        "source_system TEXT NOT NULL, source_table TEXT NOT NULL, "
+        "source_id TEXT NOT NULL, issue_type TEXT NOT NULL, "
+        "resolution TEXT NOT NULL)"
+    )
+    # Log NULL email customer
+    conn.execute(
+        "INSERT INTO migration_issues (source_system, source_table, source_id, issue_type, resolution) "
+        "SELECT 'legacy', 'legacy_customers', CAST(id AS TEXT), 'null_email', "
+        "'Imported without email' "
+        "FROM legacy_customers WHERE contact_email IS NULL"
+    )
+    # Log orphaned transactions
+    conn.execute(
+        "INSERT INTO migration_issues (source_system, source_table, source_id, issue_type, resolution) "
+        "SELECT 'modern', 'modern_transactions', CAST(id AS TEXT), 'orphaned_record', "
+        "'User UUID not found: ' || user_uuid "
+        "FROM modern_transactions WHERE user_uuid NOT IN (SELECT uuid FROM modern_users)"
+    )
+    # Create unified_orders
+    conn.execute(
+        "CREATE TABLE unified_orders (id INTEGER PRIMARY KEY AUTOINCREMENT, "
+        "customer_id INTEGER NOT NULL, product_id INTEGER, amount REAL NOT NULL, "
+        "currency TEXT NOT NULL DEFAULT 'USD', status TEXT NOT NULL, "
+        "order_date TEXT, source TEXT NOT NULL, "
+        "FOREIGN KEY (customer_id) REFERENCES unified_customers(id))"
+    )
+    # Legacy orders
+    conn.execute(
+        "INSERT INTO unified_orders (customer_id, product_id, amount, currency, status, order_date, source) "
+        "SELECT uc.id, up.id, "
+        "CAST(REPLACE(REPLACE(lo.total_amount, '$', ''), ',', '') AS REAL), "
+        "'USD', lo.order_status, lo.order_date, 'legacy' "
+        "FROM legacy_orders lo "
+        "JOIN legacy_customers lc ON lo.customer_id = lc.id "
+        "JOIN unified_customers uc ON (uc.legacy_id = lc.id) "
+        "LEFT JOIN unified_products up ON lo.product_code = up.code"
+    )
+    # Modern transactions (exclude orphans)
+    conn.execute(
+        "INSERT INTO unified_orders (customer_id, product_id, amount, currency, status, order_date, source) "
+        "SELECT uc.id, up.id, mt.amount, "
+        "COALESCE(mt.currency, 'USD'), "
+        "CASE mt.tx_status "
+        "  WHEN 1 THEN 'pending' WHEN 2 THEN 'processing' WHEN 3 THEN 'complete' "
+        "  WHEN 4 THEN 'failed' WHEN 5 THEN 'refunded' ELSE 'unknown' END, "
+        "mt.created_at, 'modern' "
+        "FROM modern_transactions mt "
+        "JOIN modern_users mu ON mt.user_uuid = mu.uuid "
+        "JOIN unified_customers uc ON (uc.modern_uuid = mu.uuid OR uc.email = mu.email_address) "
+        "LEFT JOIN unified_products up ON mt.item_sku = up.code"
+    )
+    # Clean up source tables
+    conn.execute("DROP TABLE legacy_customers")
+    conn.execute("DROP TABLE legacy_orders")
+    conn.execute("DROP TABLE legacy_products")
+    conn.execute("DROP TABLE modern_users")
+    conn.execute("DROP TABLE modern_transactions")
+    conn.execute("DROP TABLE modern_catalog")
+    conn.execute("PRAGMA foreign_keys = ON")
+    conn.commit()
 # =============================================================================
 # Task Registry
 # =============================================================================
 TASKS = {
     "column-restructure": {
         "seed_fn": seed_task1,
+        "golden_fn": golden_task1,
         "target_ddl": TASK1_TARGET_DDL,
+        "description": "Merge first_name and last_name into a single full_name column (concatenated with a space) without data loss. Apostrophes in names (e.g., O'Brien) must be preserved.",
         "difficulty": "easy",
         "max_steps": 10,
     },
     "soft-delete-restoration": {
         "seed_fn": seed_task4,
+        "golden_fn": golden_task4,
         "target_ddl": TASK4_TARGET_DDL,
+        "description": (
+            "Restore deleted products from the deletion_log table back into the products table. "
+            "Use product_id from deletion_log (NOT the log's id column) as the product's primary key. "
+            "Add is_deleted and deleted_at columns. Original products: is_deleted=0, deleted_at=NULL. "
+            "Restored products: is_deleted=1, deleted_at copied from log. "
+            "Note: stock=0 on a product does NOT mean it was deleted."
+        ),
         "difficulty": "easy",
         "max_steps": 10,
     },
     "table-normalization": {
         "seed_fn": seed_task2,
+        "golden_fn": golden_task2,
         "target_ddl": TASK2_TARGET_DDL,
+        "description": (
+            "Decompose the flat purchases table into normalized customers and orders tables with a FK. "
+            "customers should have DISTINCT entries by email. "
+            "All 7 original purchases must be preserved as individual orders linked to the correct customer."
+        ),
         "difficulty": "medium",
         "max_steps": 15,
     },
     "schema-version-merge": {
         "seed_fn": seed_task5,
+        "golden_fn": golden_task5,
         "target_ddl": TASK5_TARGET_DDL,
+        "description": (
+            "Merge products_v1 and products_v2 into a single products table. "
+            "v1 prices are stored as TEXT ('$XX.XX') — coerce to REAL. v2 uses 'unit_cost' — rename to 'price'. "
+            "For ID conflicts (same ID in both tables), v2 values WIN for name/price. "
+            "Set source='v1' for v1-only, 'v2' for v2-only, 'both' for conflicts."
+        ),
         "difficulty": "medium",
         "max_steps": 15,
     },
     "multi-entity-extraction": {
         "seed_fn": seed_task6,
+        "golden_fn": golden_task6,
         "target_ddl": TASK6_TARGET_DDL,
+        "description": (
+            "Decompose the sales_records god-table into 3NF: salespersons, customers, products, sales, data_issues. "
+            "Route records with empty string '' customer emails to data_issues (not just NULL). "
+            "TRIM leading/trailing whitespace from all email addresses before inserting. "
+            "Each sale must link to the correct salesperson, customer, and product via FKs."
+        ),
         "difficulty": "medium",
         "max_steps": 15,
     },
     "cascade-migration": {
         "seed_fn": seed_task3,
+        "golden_fn": golden_task3,
         "target_ddl": TASK3_TARGET_DDL,
+        "description": (
+            "Multi-table FK cascade with type coercion, NULL handling, and orphan audit logging. "
+            "Convert salary from TEXT ('$90000') to INTEGER (90000) by stripping '$' and ','. "
+            "Remove employees with NULL salary and log them to audit_log with reason='null_salary'. "
+            "Remove orphaned assets (employee_id not in employees) and log them with reason='orphaned_record'. "
+            "Enforce NOT NULL and FK constraints on all tables."
+        ),
         "difficulty": "hard",
         "max_steps": 20,
     },
     "dual-source-consolidation": {
         "seed_fn": seed_task7,
+        "golden_fn": golden_task7,
         "target_ddl": TASK7_TARGET_DDL,
+        "description": (
+            "Merge 6 tables from Legacy CRM + Modern SaaS into 4 unified tables. "
+            "Cross-system customer dedup: match by email address. Set source='both' for matches, "
+            "'legacy' or 'modern' for unmatched. "
+            "Tier mapping (modern subscription_tier): 1=free, 2=basic, 3=premium, 4=enterprise. "
+            "Status mapping (modern tx_status): 1=pending, 2=processing, 3=complete, 4=failed, 5=refunded. "
+            "Legacy amounts are TEXT ('$1,234.56') — coerce to REAL. NULL currency defaults to 'USD'. "
+            "Log orphaned transactions (user_uuid not found) to migration_issues with issue_type='orphaned_record'. "
+            "Log customers with NULL email to migration_issues with issue_type='null_email'."
+        ),
         "difficulty": "hard",
         "max_steps": 20,
     },

server/__pycache__/environment.cpython-312.pyc CHANGED Viewed

Binary files a/server/__pycache__/environment.cpython-312.pyc and b/server/__pycache__/environment.cpython-312.pyc differ

server/__pycache__/grader.cpython-312.pyc CHANGED Viewed

Binary files a/server/__pycache__/grader.cpython-312.pyc and b/server/__pycache__/grader.cpython-312.pyc differ

server/environment.py CHANGED Viewed

@@ -4,11 +4,21 @@ SQL Migration Environment Server Implementation.
 This is the core environment that wraps SQLite and exposes it via the OpenEnv
 Environment interface. Each WebSocket session gets its own environment instance
 with an isolated in-memory database.
 """
 import sqlite3
 import uuid
-from typing import Any, Optional
 # Support both in-repo and standalone imports
 try:
@@ -27,6 +37,26 @@ except ImportError:
     import seeds
 class DbMigrationEnvironment(Environment):
     """
     SQL Schema Migration Environment.
@@ -44,7 +74,7 @@ class DbMigrationEnvironment(Environment):
         Initialize the migration environment.
         Args:
-            task_name: One of "column-restructure", "table-normalization", "cascade-migration"
         """
         super().__init__()
@@ -59,14 +89,17 @@ class DbMigrationEnvironment(Environment):
         self._conn: Optional[sqlite3.Connection] = None
         self._reconciler: Optional[StateReconciler] = None
         self._step_count = 0
         self._state = MigrationState(
             task_name=task_name,
             migration_progress=0.0,
-            max_steps=20,
         )
     def _get_current_schema(self) -> str:
-        """Get current database schema as DDL string."""
         if self._conn is None:
             return ""
         try:
@@ -79,6 +112,75 @@ class DbMigrationEnvironment(Environment):
         except Exception:
             return ""
     def reset(
         self,
         seed: Optional[int] = None,
@@ -101,6 +203,7 @@ class DbMigrationEnvironment(Environment):
         if task_name != self.task_name and task_name in seeds.TASKS:
             self.task_name = task_name
             self._task_config = seeds.TASKS[task_name]
         # Clean up previous connection
         if self._conn is not None:
@@ -122,16 +225,21 @@ class DbMigrationEnvironment(Environment):
         # Initialize grader
         self._reconciler = StateReconciler(self.task_name)
-        # Reset counters
         self._step_count = 0
         self._state = MigrationState(
             episode_id=episode_id or str(uuid.uuid4()),
             step_count=0,
             task_name=self.task_name,
             migration_progress=0.0,
-            max_steps=20,
         )
         return MigrationObservation(
             done=False,
             reward=0.0,
@@ -139,7 +247,7 @@ class DbMigrationEnvironment(Environment):
             target_schema_sql=self._task_config["target_ddl"],
             last_execution_result="Environment initialized. Ready for migration.",
             step_number=0,
-            migration_progress=0.0,
             task_name=self.task_name,
             metadata={"status": "ready"},
         )
@@ -155,7 +263,7 @@ class DbMigrationEnvironment(Environment):
         Args:
             action: MigrationAction with sql_command, reasoning, and submit_final
-            timeout_s: Unused
             **kwargs: Additional parameters
         Returns:
@@ -178,42 +286,87 @@ class DbMigrationEnvironment(Environment):
             )
         self._step_count += 1
-        # Execute the SQL command
-        execution_result = ""
-        action_error = None
-        try:
-            cursor = self._conn.execute(action.sql_command)
-            self._conn.commit()
-            rows_affected = cursor.rowcount
-            execution_result = f"Success: {rows_affected} rows affected"
-        except sqlite3.Warning as e:
-            # Multi-statement attempt — agent tried to combine statements
             execution_result = (
-                f"Error: SQLite requires one statement per step. "
-                f"Split your commands into separate steps. Original error: {e}"
             )
-            action_error = "multi_statement"
-            try:
-                self._conn.rollback()
-            except Exception:
-                pass
-        except Exception as e:
-            # Never crash — feed the error back to the agent
-            execution_result = str(e)
-            action_error = str(e)
-            # Rollback failed transaction
-            try:
-                self._conn.rollback()
-            except Exception:
-                pass
         # Compute scores
         current_score, step_reward = self._reconciler.compute_step_reward(self._conn)
         # Episode termination: submit_final, max steps, OR perfect score
-        task_max = self._task_config.get("max_steps", 20)
-        done = action.submit_final or self._step_count >= task_max or current_score >= 0.99
         # Update state
         self._state.step_count = self._step_count
@@ -227,6 +380,9 @@ class DbMigrationEnvironment(Environment):
         }
         if action_error:
             meta["error"] = action_error
         return MigrationObservation(
             done=done,

 This is the core environment that wraps SQLite and exposes it via the OpenEnv
 Environment interface. Each WebSocket session gets its own environment instance
 with an isolated in-memory database.
+Architecture Fixes Applied:
+- A1: SELECT queries return actual data rows (not just "rows affected")
+- A2: SQL execution timeout via progress handler (prevents infinite CTEs)
+- A3: Dangerous SQL blacklist (ATTACH, DETACH, LOAD_EXTENSION, writable_schema)
+- A4: Transaction awareness (respects BEGIN/COMMIT/ROLLBACK from agent)
+- A5: Trajectory logging (full SQL history in metadata on episode end)
+- A6: Per-task max_steps from seeds registry
 """
+import re
 import sqlite3
+import threading
 import uuid
+from typing import Any, Dict, List, Optional
 # Support both in-repo and standalone imports
 try:
     import seeds
+# --- A3: Dangerous SQL Blacklist ---
+_DANGEROUS_PATTERNS = re.compile(
+    r"\b(ATTACH\s+DATABASE|DETACH\s+DATABASE|LOAD_EXTENSION)\b"
+    r"|PRAGMA\s+writable_schema",
+    re.IGNORECASE,
+)
+# --- A4: Transaction control keywords ---
+_TX_BEGIN = re.compile(r"^\s*(BEGIN|BEGIN\s+TRANSACTION|BEGIN\s+DEFERRED|BEGIN\s+IMMEDIATE|BEGIN\s+EXCLUSIVE)\s*;?\s*$", re.IGNORECASE)
+_TX_END = re.compile(r"^\s*(COMMIT|END|END\s+TRANSACTION|ROLLBACK)\s*;?\s*$", re.IGNORECASE)
+# --- A2: Maximum SQLite operations before timeout ---
+_MAX_OPS = 500_000  # ~5 seconds on typical hardware
+class _TimeoutError(Exception):
+    """Raised when SQL execution exceeds the operation budget."""
+    pass
 class DbMigrationEnvironment(Environment):
     """
     SQL Schema Migration Environment.
         Initialize the migration environment.
         Args:
+            task_name: One of the registered task names in seeds.TASKS
         """
         super().__init__()
         self._conn: Optional[sqlite3.Connection] = None
         self._reconciler: Optional[StateReconciler] = None
         self._step_count = 0
+        self._trajectory: List[Dict[str, Any]] = []  # A5
+        self._in_explicit_tx = False  # A4
+        self._max_steps = self._task_config.get("max_steps", 20)  # A6
         self._state = MigrationState(
             task_name=task_name,
             migration_progress=0.0,
+            max_steps=self._max_steps,  # A6
         )
     def _get_current_schema(self) -> str:
+        """Get current database schema as DDL string, filtering internal tables."""
         if self._conn is None:
             return ""
         try:
         except Exception:
             return ""
+    def _is_read_query(self, sql: str) -> bool:
+        """Check if SQL is a read-only query (SELECT or certain PRAGMAs)."""
+        stripped = sql.strip().upper()
+        if stripped.startswith("SELECT"):
+            return True
+        # PRAGMA table_info, foreign_key_list, etc. are read-only
+        if stripped.startswith("PRAGMA") and "=" not in stripped:
+            return True
+        return False
+    def _execute_with_timeout(self, sql: str) -> tuple:
+        """
+        Execute SQL with a progress-handler-based timeout.
+        Returns: (cursor_or_None, error_string_or_None)
+        """
+        ops_count = [0]
+        def _progress_callback():
+            ops_count[0] += 1
+            if ops_count[0] > _MAX_OPS:
+                return 1  # Non-zero = abort
+            return 0
+        self._conn.set_progress_handler(_progress_callback, 1000)
+        try:
+            cursor = self._conn.execute(sql)
+            return cursor, None
+        except sqlite3.OperationalError as e:
+            if "interrupted" in str(e).lower() or ops_count[0] > _MAX_OPS:
+                return None, "Error: Query exceeded execution time limit (possible infinite loop). Simplify your query."
+            return None, str(e)
+        except sqlite3.Warning as e:
+            return None, (
+                f"Error: SQLite requires one statement per step. "
+                f"Split your commands into separate steps. Original error: {e}"
+            )
+        except Exception as e:
+            return None, str(e)
+        finally:
+            self._conn.set_progress_handler(None, 0)
+    def _format_query_results(self, cursor) -> str:
+        """Format SELECT query results as a readable table string."""
+        try:
+            rows = cursor.fetchall()
+            if not rows:
+                return "Query returned 0 rows."
+            # Get column names
+            col_names = [desc[0] for desc in cursor.description] if cursor.description else []
+            # Cap at 50 rows
+            truncated = len(rows) > 50
+            display_rows = rows[:50]
+            # Build output
+            header = " | ".join(col_names) if col_names else "Results"
+            lines = [header, "-" * len(header)]
+            for row in display_rows:
+                lines.append(" | ".join(str(v) for v in row))
+            if truncated:
+                lines.append(f"... ({len(rows) - 50} more rows truncated)")
+            lines.append(f"({len(rows)} rows total)")
+            return "\n".join(lines)
+        except Exception:
+            return "Query executed successfully."
     def reset(
         self,
         seed: Optional[int] = None,
         if task_name != self.task_name and task_name in seeds.TASKS:
             self.task_name = task_name
             self._task_config = seeds.TASKS[task_name]
+            self._max_steps = self._task_config.get("max_steps", 20)
         # Clean up previous connection
         if self._conn is not None:
         # Initialize grader
         self._reconciler = StateReconciler(self.task_name)
+        # Reset counters and trajectory
         self._step_count = 0
+        self._trajectory = []  # A5
+        self._in_explicit_tx = False  # A4
         self._state = MigrationState(
             episode_id=episode_id or str(uuid.uuid4()),
             step_count=0,
             task_name=self.task_name,
             migration_progress=0.0,
+            max_steps=self._max_steps,  # A6
         )
+        # Compute initial score
+        initial_score = self._reconciler.score(self._conn)
         return MigrationObservation(
             done=False,
             reward=0.0,
             target_schema_sql=self._task_config["target_ddl"],
             last_execution_result="Environment initialized. Ready for migration.",
             step_number=0,
+            migration_progress=initial_score,
             task_name=self.task_name,
             metadata={"status": "ready"},
         )
         Args:
             action: MigrationAction with sql_command, reasoning, and submit_final
+            timeout_s: Unused (we use progress handler instead)
             **kwargs: Additional parameters
         Returns:
             )
         self._step_count += 1
+        sql_command = action.sql_command.strip()
+        # --- A3: Dangerous SQL Blacklist ---
+        if _DANGEROUS_PATTERNS.search(sql_command):
             execution_result = (
+                "Error: This SQL command is not allowed for security reasons. "
+                "ATTACH DATABASE, DETACH DATABASE, LOAD_EXTENSION, and "
+                "PRAGMA writable_schema are blocked."
             )
+            action_error = "blocked_command"
+        else:
+            # --- A4: Transaction Awareness ---
+            execution_result = ""
+            action_error = None
+            if _TX_BEGIN.match(sql_command):
+                # Agent wants to start a transaction
+                try:
+                    self._conn.execute("BEGIN")
+                    self._in_explicit_tx = True
+                    execution_result = "Success: Transaction started."
+                except Exception as e:
+                    execution_result = str(e)
+                    action_error = str(e)
+            elif _TX_END.match(sql_command):
+                # Agent wants to commit or rollback
+                try:
+                    if sql_command.strip().upper().startswith("ROLLBACK"):
+                        self._conn.rollback()
+                        execution_result = "Success: Transaction rolled back."
+                    else:
+                        self._conn.commit()
+                        execution_result = "Success: Transaction committed."
+                    self._in_explicit_tx = False
+                except Exception as e:
+                    execution_result = str(e)
+                    action_error = str(e)
+                    self._in_explicit_tx = False
+            else:
+                # --- Normal SQL execution with timeout (A1, A2) ---
+                cursor, error = self._execute_with_timeout(sql_command)
+                if error:
+                    execution_result = error
+                    action_error = error
+                    # Rollback failed transaction
+                    try:
+                        if not self._in_explicit_tx:
+                            self._conn.rollback()
+                    except Exception:
+                        pass
+                else:
+                    # --- A1: SELECT result passthrough ---
+                    if self._is_read_query(sql_command):
+                        execution_result = self._format_query_results(cursor)
+                    else:
+                        rows_affected = cursor.rowcount
+                        execution_result = f"Success: {rows_affected} rows affected"
+                        # Only auto-commit if not in explicit transaction (A4)
+                        if not self._in_explicit_tx:
+                            try:
+                                self._conn.commit()
+                            except Exception:
+                                pass
         # Compute scores
         current_score, step_reward = self._reconciler.compute_step_reward(self._conn)
         # Episode termination: submit_final, max steps, OR perfect score
+        done = action.submit_final or self._step_count >= self._max_steps or current_score >= 0.99
+        # --- A5: Trajectory logging ---
+        self._trajectory.append({
+            "step": self._step_count,
+            "sql": action.sql_command,
+            "reasoning": action.reasoning,
+            "result": execution_result[:200],  # Truncate for storage
+            "score": current_score,
+            "reward": step_reward,
+            "error": action_error,
+        })
         # Update state
         self._state.step_count = self._step_count
         }
         if action_error:
             meta["error"] = action_error
+        # Include full trajectory on episode end
+        if done:
+            meta["trajectory"] = self._trajectory
         return MigrationObservation(
             done=done,

server/grader.py CHANGED Viewed

@@ -1,756 +1,366 @@
 """
-StateReconciler — The Deep Structural Grading Engine for SQL Agents.
-> **Hackathon Judges Note:**
-> Naive SQL agents often "solve" migration environments by executing `DROP TABLE x; CREATE TABLE x ...`
-> to forge exactly matching schemas while silently destroying all data.
->
-> This `StateReconciler` implements robust **Anti-Exploit Protection**. It doesn't just diff schemas;
-> it recursively runs data-integrity hashing, cross-checks row counts, and verifies orphaned records.
-> If an agent drops data to match a schema, the score is brutally clamped to 0.01.
-> Furthermore, it utilizes heavily weighted fractional rewards to provide continuous learning
-> signals to the RL agent during complex, multi-step constraints (e.g., fractional points for each FK enforced).
-CRITICAL ARCHITECTURE RULES:
-- The grader NEVER modifies the database (SELECT and PRAGMA only)
-- The grader NEVER raises exceptions (catches everything, isolated sandbox)
-- Scores are strictly clamped to (0.0, 1.0) exclusive per validation constraints.
 """
 import sqlite3
-from typing import Dict, List, Optional, Set, Tuple
-from seeds import (
-    TASK1_EXPECTED_ROWS,
-    TASK2_EXPECTED_CUSTOMER_COUNT,
-    TASK2_EXPECTED_ORDER_COUNT,
-    TASK3_EXPECTED_AUDIT_COUNT,
-    TASK3_EXPECTED_AUDIT_ENTRIES,
-    TASK3_EXPECTED_EMPLOYEE_COUNT,
-    TASK3_EXPECTED_SALARIES,
-    TASK4_EXPECTED_ROW_COUNT,
-    TASK4_EXPECTED_ID_SUM,
-    TASK4_EXPECTED_DELETED_COUNT,
-    TASK4_EXPECTED_ACTIVE_COUNT,
-    TASK5_EXPECTED_ROW_COUNT,
-    TASK5_EXPECTED_PRICE_SUM,
-    TASK5_EXPECTED_BOTH_COUNT,
-    TASK6_EXPECTED_SALESPERSON_COUNT,
-    TASK6_EXPECTED_CUSTOMER_COUNT,
-    TASK6_EXPECTED_PRODUCT_COUNT,
-    TASK6_EXPECTED_SALES_COUNT,
-    TASK6_EXPECTED_DATA_ISSUES_COUNT,
-    TASK7_EXPECTED_UNIFIED_CUSTOMERS,
-    TASK7_EXPECTED_BOTH_SOURCE_COUNT,
-    TASK7_EXPECTED_UNIFIED_ORDERS,
-    TASK7_EXPECTED_MIGRATION_ISSUES,
-)
 def _get_table_names(conn: sqlite3.Connection) -> Set[str]:
-    """Get all table names in the database."""
     try:
         cursor = conn.execute(
             "SELECT name FROM sqlite_master WHERE type='table' "
             "AND name NOT LIKE 'sqlite_%' ORDER BY name"
         )
-        return {row[0] for row in cursor.fetchall()}
     except Exception:
         return set()
-def _get_column_names(conn: sqlite3.Connection, table: str) -> Set[str]:
-    """Get column names for a given table."""
     try:
         cursor = conn.execute(f"PRAGMA table_info({table})")
-        return {row[1] for row in cursor.fetchall()}
     except Exception:
-        return set()
 def _get_row_count(conn: sqlite3.Connection, table: str) -> int:
-    """Get row count of a table. Returns 0 on any error."""
     try:
-        cursor = conn.execute(f"SELECT COUNT(*) FROM {table}")
         return cursor.fetchone()[0]
     except Exception:
         return 0
 def _has_foreign_key(conn: sqlite3.Connection, table: str, ref_table: str) -> bool:
-    """Check if table has a FK referencing ref_table."""
     try:
-        cursor = conn.execute(f"PRAGMA foreign_key_list({table})")
         for row in cursor.fetchall():
-            if row[2] == ref_table:
                 return True
         return False
     except Exception:
         return False
-class StateReconciler:
     """
-    Scores the current database state against the target for a specific task.
-    Instantiated once per episode. Tracks previous score to compute step deltas.
     """
     def __init__(self, task_name: str):
         self.task_name = task_name
         self._last_score: float = 0.0
     def score(self, conn: sqlite3.Connection) -> float:
         """
-        Compute the current migration score [0.0, 1.0].
-        Routes to the appropriate task-specific scorer.
-        Never raises — returns 0.0 on any unexpected error.
         """
         try:
-            if self.task_name == "column-restructure":
-                return self._score_task1(conn)
-            elif self.task_name == "table-normalization":
-                return self._score_task2(conn)
-            elif self.task_name == "cascade-migration":
-                return self._score_task3(conn)
-            elif self.task_name == "soft-delete-restoration":
-                return self._score_task4(conn)
-            elif self.task_name == "schema-version-merge":
-                return self._score_task5(conn)
-            elif self.task_name == "multi-entity-extraction":
-                return self._score_task6(conn)
-            elif self.task_name == "dual-source-consolidation":
-                return self._score_task7(conn)
-            else:
-                return 0.01
         except Exception:
             return 0.01
     def compute_step_reward(self, conn: sqlite3.Connection) -> Tuple[float, float]:
         """
-        Compute both the current score and the step reward delta.
-        Returns:
-            (current_score, step_reward) where step_reward = current - previous
         """
         current_score = self.score(conn)
         step_reward = current_score - self._last_score
         self._last_score = current_score
-        return current_score, step_reward
-    # =========================================================================
-    # Task 1: Column Restructure
-    # =========================================================================
-    # Weights: schema=0.4, row_count=0.2, data=0.4
-    def _score_task1(self, conn: sqlite3.Connection) -> float:
-        score = 0.0
-        tables = _get_table_names(conn)
-        if "users" not in tables:
-            return 0.0
-        columns = _get_column_names(conn, "users")
-        # Schema check: full_name exists, old columns gone
-        has_full_name = "full_name" in columns
-        old_cols_gone = "first_name" not in columns and "last_name" not in columns
-        if has_full_name and old_cols_gone:
-            score += 0.4  # Full schema credit
-        elif has_full_name:
-            score += 0.2  # Partial: full_name exists but old cols remain
-        # Row count check
-        row_count = _get_row_count(conn, "users")
-        if row_count == len(TASK1_EXPECTED_ROWS):
-            score += 0.2
-        # Data correctness check
-        if has_full_name:
-            try:
-                cursor = conn.execute("SELECT id, full_name FROM users ORDER BY id")
-                actual_rows = cursor.fetchall()
-                if actual_rows == TASK1_EXPECTED_ROWS:
-                    score += 0.4
-                elif len(actual_rows) > 0:
-                    # Partial credit: fraction of correct rows
-                    correct = sum(
-                        1 for a, e in zip(actual_rows, TASK1_EXPECTED_ROWS)
-                        if a == e
-                    )
-                    score += 0.4 * (correct / len(TASK1_EXPECTED_ROWS))
-            except Exception:
-                pass
-        # Exploit check: if schema matches but table is empty, cap score
-        if has_full_name and old_cols_gone and row_count == 0:
-            score = min(score, 0.1)
-        return max(0.01, min(0.99, score))
-    # =========================================================================
-    # Task 2: Table Normalization
-    # =========================================================================
-    # Weights: tables_exist=0.1, fk=0.2, customer_count=0.2,
-    #          order_count=0.2, no_null_ids=0.1, integrity=0.2
-    def _score_task2(self, conn: sqlite3.Connection) -> float:
-        # Re-assert FK enforcement to prevent PRAGMA bypass exploit
         try:
-            conn.execute("PRAGMA foreign_keys = ON")
         except Exception:
             pass
-        score = 0.0
-        tables = _get_table_names(conn)
-        # Both tables exist
-        has_customers = "customers" in tables
-        has_orders = "orders" in tables
-        if has_customers and has_orders:
-            score += 0.1
-        # FK constraint: orders -> customers
-        if has_orders and _has_foreign_key(conn, "orders", "customers"):
-            score += 0.2
-        # Correct distinct customer count
-        if has_customers:
-            try:
-                cursor = conn.execute("SELECT COUNT(DISTINCT email) FROM customers")
-                distinct_count = cursor.fetchone()[0]
-                if distinct_count == TASK2_EXPECTED_CUSTOMER_COUNT:
-                    score += 0.2
-            except Exception:
-                pass
-        # Correct order count (all original purchases preserved)
-        if has_orders:
-            order_count = _get_row_count(conn, "orders")
-            if order_count == TASK2_EXPECTED_ORDER_COUNT:
-                score += 0.2
-        # No NULL customer_ids in orders
-        if has_orders:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM orders WHERE customer_id IS NULL"
-                )
-                null_count = cursor.fetchone()[0]
-                if null_count == 0:
-                    score += 0.1
-            except Exception:
-                pass
-        # Integrity check
-        try:
-            cursor = conn.execute("PRAGMA integrity_check")
-            result = cursor.fetchone()[0]
-            if result == "ok":
-                score += 0.2
-        except Exception:
-            pass
-        # Exploit check: tables exist but are empty
-        if has_customers and has_orders:
-            c_count = _get_row_count(conn, "customers")
-            o_count = _get_row_count(conn, "orders")
-            if c_count == 0 and o_count == 0:
-                score = min(score, 0.1)
-        return max(0.01, min(0.99, score))
-    # =========================================================================
-    # Task 3: Cascade Migration
-    # =========================================================================
-    # Granular partial credit for each relationship in the FK chain.
-    # Total weights: audit=0.30, fk_chain=0.20, emp_count=0.05,
-    #                salary_coercion=0.15, no_orphans=0.10, integrity=0.10
-    #                companies_not_null=0.05 (within fk_chain)
-    # Total max = 0.90 for all grader checks + 0.10 integrity = 1.00
-    def _score_task3(self, conn: sqlite3.Connection) -> float:
-        # Re-assert FK enforcement to prevent PRAGMA bypass exploit
-        try:
-            conn.execute("PRAGMA foreign_keys = ON")
-        except Exception:
-            pass
-        score = 0.0
-        tables = _get_table_names(conn)
-        # --- audit_log checks (0.30 total) ---
-        has_audit = "audit_log" in tables
-        if has_audit:
-            score += 0.1  # table exists
-        if has_audit:
-            audit_count = _get_row_count(conn, "audit_log")
-            if audit_count >= TASK3_EXPECTED_AUDIT_COUNT:
-                score += 0.1  # has enough rows
-        if has_audit:
-            try:
-                cursor = conn.execute(
-                    "SELECT source_table, reason FROM audit_log ORDER BY source_table, reason"
-                )
-                actual_entries = cursor.fetchall()
-                expected_sorted = sorted(TASK3_EXPECTED_AUDIT_ENTRIES)
-                if actual_entries == expected_sorted:
-                    score += 0.2
-                elif len(actual_entries) > 0:
-                    correct = sum(1 for a in actual_entries if a in TASK3_EXPECTED_AUDIT_ENTRIES)
-                    score += 0.2 * (correct / TASK3_EXPECTED_AUDIT_COUNT)
-            except Exception:
-                pass
-        # --- FK chain checks (0.20 total, 0.05 each) ---
-        # departments -> companies
-        if "departments" in tables and _has_foreign_key(conn, "departments", "companies"):
-            score += 0.05
-        # employees -> departments
-        if "employees" in tables and _has_foreign_key(conn, "employees", "departments"):
-            score += 0.05
-        # assets -> employees
-        if "assets" in tables and _has_foreign_key(conn, "assets", "employees"):
-            score += 0.05
-        # companies.name NOT NULL
-        if "companies" in tables:
-            try:
-                cursor = conn.execute("PRAGMA table_info(companies)")
-                for row in cursor.fetchall():
-                    if row[1] == "name" and row[3] == 1:  # notnull flag
-                        score += 0.05
-                        break
-            except Exception:
-                pass
-        # --- Employee count (Hal Patel removed) (0.05) ---
-        if "employees" in tables:
-            emp_count = _get_row_count(conn, "employees")
-            if emp_count == TASK3_EXPECTED_EMPLOYEE_COUNT:
-                score += 0.05
-        # --- Salary coercion: TEXT $90000 -> INTEGER 90000 (0.15) ---
-        if "employees" in tables:
-            try:
-                all_correct = True
-                for emp_id, expected_salary in TASK3_EXPECTED_SALARIES.items():
-                    cursor = conn.execute(
-                        "SELECT salary FROM employees WHERE id = ?", (emp_id,)
-                    )
-                    row = cursor.fetchone()
-                    if row is None:
-                        all_correct = False
-                        break
-                    actual = row[0]
-                    if not isinstance(actual, int):
-                        try:
-                            actual = int(actual)
-                        except (ValueError, TypeError):
-                            all_correct = False
-                            break
-                    if actual != expected_salary:
-                        all_correct = False
-                        break
-                if all_correct:
-                    score += 0.15
-            except Exception:
-                pass
-        # --- No orphaned assets (0.10) ---
-        if "assets" in tables and "employees" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM assets WHERE employee_id NOT IN "
-                    "(SELECT id FROM employees)"
-                )
-                orphan_count = cursor.fetchone()[0]
-                if orphan_count == 0:
-                    score += 0.10
-            except Exception:
-                pass
-        # --- Integrity check (0.10) ---
-        try:
-            cursor = conn.execute("PRAGMA integrity_check")
-            result = cursor.fetchone()[0]
-            if result == "ok":
-                score += 0.10
-        except Exception:
-            pass
-        # Exploit check: if employees table is empty
-        if "employees" in tables and _get_row_count(conn, "employees") == 0:
-            score = min(score, 0.1)
-        return max(0.01, min(0.99, score))
-    # =========================================================================
-    # Task 4: Soft-Delete Restoration (Easy)
-    # =========================================================================
-    def _score_task4(self, conn: sqlite3.Connection) -> float:
-        score = 0.0
-        tables = _get_table_names(conn)
-        if "products" not in tables:
-            return 0.01
-        cols = _get_column_names(conn, "products")
-        # is_deleted column exists (+0.15)
-        if "is_deleted" in cols:
-            score += 0.15
-        # deleted_at column exists (+0.10)
-        if "deleted_at" in cols:
-            score += 0.10
-        # Row count = 8 (+0.20)
-        row_count = _get_row_count(conn, "products")
-        if row_count == TASK4_EXPECTED_ROW_COUNT:
-            score += 0.20
-        # Active products: is_deleted=0, deleted_at IS NULL (+0.25)
-        if "is_deleted" in cols:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM products WHERE is_deleted = 0 AND deleted_at IS NULL"
-                )
-                active = cursor.fetchone()[0]
-                if active == TASK4_EXPECTED_ACTIVE_COUNT:
-                    score += 0.25
-            except Exception:
-                pass
-        # Restored products: is_deleted=1, deleted_at IS NOT NULL (+0.20)
-        if "is_deleted" in cols:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM products WHERE is_deleted = 1 AND deleted_at IS NOT NULL"
-                )
-                restored = cursor.fetchone()[0]
-                if restored == TASK4_EXPECTED_DELETED_COUNT:
-                    score += 0.20
-            except Exception:
-                pass
-        # SUM(id) fingerprint = 36 — no phantom rows (+0.10)
-        try:
-            cursor = conn.execute("SELECT SUM(id) FROM products")
-            id_sum = cursor.fetchone()[0]
-            if id_sum == TASK4_EXPECTED_ID_SUM:
-                score += 0.10
-        except Exception:
-            pass
-        # Exploit check
-        if row_count == 0:
-            score = min(score, 0.1)
-        return max(0.01, min(0.99, score))
-    # =========================================================================
-    # Task 5: Schema Version Merge (Medium)
-    # =========================================================================
-    def _score_task5(self, conn: sqlite3.Connection) -> float:
-        # Re-assert FK enforcement
-        try:
-            conn.execute("PRAGMA foreign_keys = ON")
-        except Exception:
-            pass
-        score = 0.0
-        tables = _get_table_names(conn)
-        if "products" not in tables:
             return 0.01
-        cols = _get_column_names(conn, "products")
-        # Schema completeness: all 8 columns (+0.10)
-        expected_cols = {"id", "name", "price", "category", "supplier", "brand", "sku", "source"}
-        if expected_cols.issubset(cols):
-            score += 0.10
-        # Row count = 9 (+0.15)
-        row_count = _get_row_count(conn, "products")
-        if row_count == TASK5_EXPECTED_ROW_COUNT:
-            score += 0.15
-        # PRICE_SUM fingerprint (+0.20)
-        try:
-            cursor = conn.execute("SELECT ROUND(SUM(price), 2) FROM products")
-            price_sum = cursor.fetchone()[0]
-            if price_sum is not None and abs(price_sum - TASK5_EXPECTED_PRICE_SUM) < 0.02:
-                score += 0.20
-        except Exception:
-            pass
-        # source='both' for conflicted ids 1,2 (+0.15)
-        if "source" in cols:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM products WHERE source = 'both'"
-                )
-                both_count = cursor.fetchone()[0]
-                if both_count == TASK5_EXPECTED_BOTH_COUNT:
-                    score += 0.15
-            except Exception:
-                pass
-        # v2 name wins for conflicted rows (+0.15)
-        try:
-            cursor = conn.execute("SELECT name FROM products WHERE id = 2")
-            row = cursor.fetchone()
-            if row and "Updated" in row[0]:
-                score += 0.15
-        except Exception:
-            pass
-        # No NULL prices (+0.10)
-        try:
-            cursor = conn.execute("SELECT COUNT(*) FROM products WHERE price IS NULL")
-            null_count = cursor.fetchone()[0]
-            if null_count == 0:
-                score += 0.10
-        except Exception:
-            pass
-        # PRAGMA integrity_check (+0.15)
-        try:
-            cursor = conn.execute("PRAGMA integrity_check")
-            result = cursor.fetchone()[0]
-            if result == "ok":
-                score += 0.15
-        except Exception:
-            pass
-        # Exploit check
-        if row_count == 0:
-            score = min(score, 0.1)
-        return max(0.01, min(0.99, score))
-    # =========================================================================
-    # Task 6: Multi-Entity Extraction (Medium — Hard End)
-    # =========================================================================
-    def _score_task6(self, conn: sqlite3.Connection) -> float:
-        # Re-assert FK enforcement
-        try:
-            conn.execute("PRAGMA foreign_keys = ON")
-        except Exception:
-            pass
-        score = 0.0
-        tables = _get_table_names(conn)
-        # All 5 tables exist (+0.10)
-        required = {"salespersons", "customers", "products", "sales", "data_issues"}
-        if required.issubset(tables):
-            score += 0.10
-        # salesperson count = 3 (+0.10)
-        if "salespersons" in tables:
-            count = _get_row_count(conn, "salespersons")
-            if count == TASK6_EXPECTED_SALESPERSON_COUNT:
-                score += 0.10
-        # customer count = 3 (invalid excluded) (+0.12)
-        if "customers" in tables:
-            count = _get_row_count(conn, "customers")
-            if count == TASK6_EXPECTED_CUSTOMER_COUNT:
-                score += 0.12
-        # product count = 5 (+0.10)
-        if "products" in tables:
-            count = _get_row_count(conn, "products")
-            if count == TASK6_EXPECTED_PRODUCT_COUNT:
-                score += 0.10
-        # sales count = 11 (bad row excluded) (+0.12)
-        if "sales" in tables:
-            count = _get_row_count(conn, "sales")
-            if count == TASK6_EXPECTED_SALES_COUNT:
-                score += 0.12
-        # All 3 FKs present in sales (+0.15)
-        if "sales" in tables:
-            fk_count = 0
-            if _has_foreign_key(conn, "sales", "salespersons"): fk_count += 1
-            if _has_foreign_key(conn, "sales", "customers"): fk_count += 1
-            if _has_foreign_key(conn, "sales", "products"): fk_count += 1
-            score += 0.05 * fk_count  # 0.15 total for all 3
-        # data_issues count = 1, for row 6 (+0.11)
-        if "data_issues" in tables:
-            count = _get_row_count(conn, "data_issues")
-            if count == TASK6_EXPECTED_DATA_ISSUES_COUNT:
-                score += 0.11
-        # alice email is trimmed (+0.10)
-        if "salespersons" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT email FROM salespersons WHERE name LIKE '%Alice%'"
-                )
-                row = cursor.fetchone()
-                if row and row[0] == "alice@company.com":
-                    score += 0.10
-            except Exception:
-                pass
-        # PRAGMA integrity_check (+0.10)
-        try:
-            cursor = conn.execute("PRAGMA integrity_check")
-            result = cursor.fetchone()[0]
-            if result == "ok":
-                score += 0.10
-        except Exception:
-            pass
-        # Exploit check
-        sales_count = _get_row_count(conn, "sales") if "sales" in tables else 0
-        if sales_count == 0 and "sales" in tables:
-            score = min(score, 0.1)
-        return max(0.01, min(0.99, score))
-    # =========================================================================
-    # Task 7: Dual-Source Consolidation (Hard)
-    # =========================================================================
-    def _score_task7(self, conn: sqlite3.Connection) -> float:
-        # Re-assert FK enforcement
         try:
             conn.execute("PRAGMA foreign_keys = ON")
-        except Exception:
-            pass
-        score = 0.0
-        tables = _get_table_names(conn)
-        # All 4 tables exist (+0.05)
-        required = {"unified_customers", "unified_products", "unified_orders", "migration_issues"}
-        if required.issubset(tables):
-            score += 0.05
-        # unified_customers count = 7 (+0.08)
-        if "unified_customers" in tables:
-            count = _get_row_count(conn, "unified_customers")
-            if count == TASK7_EXPECTED_UNIFIED_CUSTOMERS:
-                score += 0.08
-        # source='both' for email-matched records (+0.08)
-        if "unified_customers" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM unified_customers WHERE source = 'both'"
-                )
-                both = cursor.fetchone()[0]
-                if both == TASK7_EXPECTED_BOTH_SOURCE_COUNT:
-                    score += 0.08
-            except Exception:
-                pass
-        # Legacy amount coercion — check unified_orders has REAL amounts (+0.10)
-        if "unified_orders" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM unified_orders WHERE typeof(amount) = 'real' OR typeof(amount) = 'integer'"
-                )
-                real_count = cursor.fetchone()[0]
-                order_count = _get_row_count(conn, "unified_orders")
-                if real_count == order_count and order_count > 0:
-                    score += 0.10
-            except Exception:
-                pass
-        # NULL currency → 'USD' fill (+0.07)
-        if "unified_orders" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM unified_orders WHERE currency IS NULL"
-                )
-                null_curr = cursor.fetchone()[0]
-                if null_curr == 0:
-                    score += 0.07
-            except Exception:
-                pass
-        # tx_status mapped to strings (+0.10)
-        if "unified_orders" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM unified_orders WHERE typeof(status) = 'text'"
-                )
-                text_count = cursor.fetchone()[0]
-                order_count = _get_row_count(conn, "unified_orders")
-                if text_count == order_count and order_count > 0:
-                    score += 0.10
-            except Exception:
-                pass
-        # subscription_tier mapped to strings (+0.08)
-        if "unified_customers" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM unified_customers WHERE typeof(tier) = 'text'"
-                )
-                text_count = cursor.fetchone()[0]
-                cust_count = _get_row_count(conn, "unified_customers")
-                if text_count == cust_count and cust_count > 0:
-                    score += 0.08
-            except Exception:
-                pass
-        # migration_issues count = 2 (+0.08)
-        if "migration_issues" in tables:
-            count = _get_row_count(conn, "migration_issues")
-            if count == TASK7_EXPECTED_MIGRATION_ISSUES:
-                score += 0.08
-        # Orphaned transaction in issues (+0.07)
-        if "migration_issues" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM migration_issues WHERE issue_type = 'orphaned_record'"
-                )
-                orphan_issues = cursor.fetchone()[0]
-                if orphan_issues >= 1:
-                    score += 0.07
-            except Exception:
-                pass
-        # NULL email customer in issues (+0.07)
-        if "migration_issues" in tables:
-            try:
-                cursor = conn.execute(
-                    "SELECT COUNT(*) FROM migration_issues WHERE issue_type = 'null_email'"
-                )
-                null_issues = cursor.fetchone()[0]
-                if null_issues >= 1:
-                    score += 0.07
-            except Exception:
-                pass
-        # FK integrity on unified_orders (+0.10)
-        if "unified_orders" in tables:
-            if _has_foreign_key(conn, "unified_orders", "unified_customers"):
-                score += 0.10
-        # PRAGMA integrity_check (+0.10)
-        try:
             cursor = conn.execute("PRAGMA integrity_check")
             result = cursor.fetchone()[0]
-            if result == "ok":
-                score += 0.10
         except Exception:
             pass
-        # Exploit check
-        if "unified_orders" in tables and _get_row_count(conn, "unified_orders") == 0:
-            score = min(score, 0.1)
-        return max(0.01, min(0.99, score))

 """
+StateReconciler — Dynamic Golden Database Grading Engine.
+ARCHITECTURE:
+- Instead of hardcoded expected values, we build a "golden" database by running
+  the correct migration on a fresh copy of the seed data.
+- The agent's database is compared table-by-table against this golden reference.
+- This makes the grader SEED-INDEPENDENT: if judges change the seed data,
+  the golden DB auto-updates and scoring remains accurate.
+SCORING WEIGHTS (per-table, dynamic):
+- Schema match (table exists, correct columns): 30%
+- Data match (row count + content): 40%
+- FK & constraint integrity: 20%
+- Anti-exploit checks: 10%
+ANTI-EXPLOIT PROTECTIONS:
+- Case-insensitive table/column name comparison
+- PRAGMA state preservation (grader doesn't corrupt agent's FK state)
+- Phantom row detection (SUM fingerprinting)
+- Empty table exploitation blocked
+- Extra/leftover table penalty
 """
 import sqlite3
+from typing import Any, Dict, List, Optional, Set, Tuple
+# Import seeds for golden migration functions
+try:
+    from .. import seeds
+except ImportError:
+    import seeds
 def _get_table_names(conn: sqlite3.Connection) -> Set[str]:
+    """Get all user table names (case-normalized to lowercase)."""
     try:
         cursor = conn.execute(
             "SELECT name FROM sqlite_master WHERE type='table' "
             "AND name NOT LIKE 'sqlite_%' ORDER BY name"
         )
+        return {row[0].lower() for row in cursor.fetchall()}
     except Exception:
         return set()
+def _get_column_info(conn: sqlite3.Connection, table: str) -> List[dict]:
+    """Get column info for a table. Returns list of {name, type, notnull, pk}."""
     try:
         cursor = conn.execute(f"PRAGMA table_info({table})")
+        return [
+            {"name": row[1].lower(), "type": row[2].upper(), "notnull": row[3], "pk": row[5]}
+            for row in cursor.fetchall()
+        ]
     except Exception:
+        return []
+def _get_column_names(conn: sqlite3.Connection, table: str) -> Set[str]:
+    """Get column names (lowercase) for a table."""
+    return {col["name"] for col in _get_column_info(conn, table)}
 def _get_row_count(conn: sqlite3.Connection, table: str) -> int:
+    """Get row count. Returns 0 on error."""
     try:
+        cursor = conn.execute(f"SELECT COUNT(*) FROM [{table}]")
         return cursor.fetchone()[0]
     except Exception:
         return 0
+def _get_all_rows(conn: sqlite3.Connection, table: str) -> List[Tuple]:
+    """Get all rows from a table, sorted for deterministic comparison."""
+    try:
+        cols = _get_column_names(conn, table)
+        if not cols:
+            return []
+        cursor = conn.execute(f"SELECT * FROM [{table}] ORDER BY 1")
+        return cursor.fetchall()
+    except Exception:
+        return []
 def _has_foreign_key(conn: sqlite3.Connection, table: str, ref_table: str) -> bool:
+    """Check if table has a FK referencing ref_table (case-insensitive)."""
     try:
+        cursor = conn.execute(f"PRAGMA foreign_key_list([{table}])")
         for row in cursor.fetchall():
+            if row[2].lower() == ref_table.lower():
                 return True
         return False
     except Exception:
         return False
+def _count_foreign_keys(conn: sqlite3.Connection, table: str) -> int:
+    """Count all FK relationships for a table."""
+    try:
+        cursor = conn.execute(f"PRAGMA foreign_key_list([{table}])")
+        refs = set()
+        for row in cursor.fetchall():
+            refs.add(row[2].lower())
+        return len(refs)
+    except Exception:
+        return 0
+def _build_golden_db(task_name: str) -> sqlite3.Connection:
+    """
+    Build a golden reference database for a task.
+    Seeds a fresh in-memory DB with the task's seed data, then applies
+    the golden migration to produce the expected final state.
+    """
+    task_config = seeds.TASKS[task_name]
+    conn = sqlite3.connect(":memory:")
+    conn.execute("PRAGMA foreign_keys = ON")
+    # Seed with same data as agent
+    task_config["seed_fn"](conn)
+    # Apply perfect migration
+    task_config["golden_fn"](conn)
+    return conn
+def _compare_row_data(
+    agent_rows: List[Tuple],
+    golden_rows: List[Tuple],
+) -> float:
+    """
+    Compare row data between agent and golden databases.
+    Returns a similarity score between 0.0 and 1.0.
+    Handles: different row counts, partial matches, type coercion differences.
     """
+    if not golden_rows:
+        return 1.0 if not agent_rows else 0.0
+    if not agent_rows:
+        return 0.0
+    # Exact match
+    if agent_rows == golden_rows:
+        return 1.0
+    # Row count match bonus
+    count_match = 1.0 if len(agent_rows) == len(golden_rows) else (
+        min(len(agent_rows), len(golden_rows)) / max(len(agent_rows), len(golden_rows))
+    )
+    # Per-row comparison (order-independent for flexibility)
+    golden_set = set()
+    for row in golden_rows:
+        # Normalize: convert all values to strings for loose comparison
+        golden_set.add(tuple(str(v).strip() if v is not None else "" for v in row))
+    matched = 0
+    for row in agent_rows:
+        normalized = tuple(str(v).strip() if v is not None else "" for v in row)
+        if normalized in golden_set:
+            matched += 1
+            golden_set.discard(normalized)
+    if len(golden_rows) == 0:
+        content_match = 0.0
+    else:
+        content_match = matched / len(golden_rows)
+    # Penalize extra rows (data bloat)
+    if len(agent_rows) > len(golden_rows):
+        bloat_penalty = max(0, 1.0 - (len(agent_rows) - len(golden_rows)) / len(golden_rows))
+        content_match *= bloat_penalty
+    return 0.4 * count_match + 0.6 * content_match
+class StateReconciler:
+    """
+    Dynamic Golden Database grading engine.
+    Compares the agent's database state against a dynamically-generated
+    golden reference database. No hardcoded expected values.
     """
     def __init__(self, task_name: str):
         self.task_name = task_name
         self._last_score: float = 0.0
+        self._golden_conn: Optional[sqlite3.Connection] = None
+        # Build golden reference DB
+        try:
+            self._golden_conn = _build_golden_db(task_name)
+            self._golden_tables = _get_table_names(self._golden_conn)
+            self._golden_table_data: Dict[str, dict] = {}
+            for table in self._golden_tables:
+                self._golden_table_data[table] = {
+                    "columns": _get_column_info(self._golden_conn, table),
+                    "col_names": _get_column_names(self._golden_conn, table),
+                    "rows": _get_all_rows(self._golden_conn, table),
+                    "row_count": _get_row_count(self._golden_conn, table),
+                    "fk_count": _count_foreign_keys(self._golden_conn, table),
+                }
+        except Exception:
+            self._golden_tables = set()
+            self._golden_table_data = {}
+    def __del__(self):
+        """Clean up golden DB connection."""
+        if self._golden_conn is not None:
+            try:
+                self._golden_conn.close()
+            except Exception:
+                pass
     def score(self, conn: sqlite3.Connection) -> float:
         """
+        Compute migration score by comparing agent DB against golden reference.
+        Scoring breakdown:
+        - Schema match: 0.30 (tables exist with correct columns)
+        - Data match: 0.40 (row content matches golden DB)
+        - FK/constraint integrity: 0.20 (FKs enforced, integrity OK)
+        - Anti-exploit bonus: 0.10 (no empty tables, no extra tables)
+        Returns: float in [0.01, 0.99]
         """
         try:
+            return self._score_dynamic(conn)
         except Exception:
             return 0.01
     def compute_step_reward(self, conn: sqlite3.Connection) -> Tuple[float, float]:
         """
+        Compute current score and step reward delta.
+        CRITICAL: Preserves the agent's PRAGMA foreign_keys state.
+        The grader reads FK state, does its work, then restores it.
         """
+        # A8: Preserve PRAGMA state
+        try:
+            original_fk = conn.execute("PRAGMA foreign_keys").fetchone()[0]
+        except Exception:
+            original_fk = 1
         current_score = self.score(conn)
         step_reward = current_score - self._last_score
         self._last_score = current_score
+        # A8: Restore original PRAGMA state
         try:
+            conn.execute(f"PRAGMA foreign_keys = {'ON' if original_fk else 'OFF'}")
         except Exception:
             pass
+        return current_score, step_reward
+    def _score_dynamic(self, conn: sqlite3.Connection) -> float:
+        """Core dynamic scoring: compare agent DB against golden DB."""
+        if not self._golden_tables:
             return 0.01
+        agent_tables = _get_table_names(conn)
+        # ---- 1. Schema Match (0.30) ----
+        schema_score = 0.0
+        tables_found = 0
+        total_col_match = 0.0
+        for table in self._golden_tables:
+            golden_info = self._golden_table_data[table]
+            if table in agent_tables:
+                tables_found += 1
+                # Column name comparison
+                agent_cols = _get_column_names(conn, table)
+                golden_cols = golden_info["col_names"]
+                if golden_cols:
+                    col_overlap = len(agent_cols & golden_cols) / len(golden_cols)
+                    total_col_match += col_overlap
+                else:
+                    total_col_match += 1.0
+        if self._golden_tables:
+            table_ratio = tables_found / len(self._golden_tables)
+            col_ratio = total_col_match / len(self._golden_tables) if self._golden_tables else 0
+            schema_score = 0.15 * table_ratio + 0.15 * col_ratio
+        # ---- 2. Data Match (0.40) ----
+        data_score = 0.0
+        data_checks = 0
+        for table in self._golden_tables:
+            golden_info = self._golden_table_data[table]
+            if table not in agent_tables:
+                data_checks += 1
+                continue
+            agent_rows = _get_all_rows(conn, table)
+            golden_rows = golden_info["rows"]
+            similarity = _compare_row_data(agent_rows, golden_rows)
+            data_score += similarity
+            data_checks += 1
+        if data_checks > 0:
+            data_score = 0.40 * (data_score / data_checks)
+        # ---- 3. FK & Constraint Integrity (0.20) ----
+        fk_score = 0.0
+        fk_checks = 0
+        for table in self._golden_tables:
+            golden_info = self._golden_table_data[table]
+            expected_fks = golden_info["fk_count"]
+            if expected_fks > 0 and table in agent_tables:
+                agent_fks = _count_foreign_keys(conn, table)
+                fk_ratio = min(agent_fks, expected_fks) / expected_fks
+                fk_score += fk_ratio
+                fk_checks += 1
+        # PRAGMA integrity check
+        integrity_ok = False
         try:
+            # Temporarily enable FK for integrity check
             conn.execute("PRAGMA foreign_keys = ON")
             cursor = conn.execute("PRAGMA integrity_check")
             result = cursor.fetchone()[0]
+            integrity_ok = (result == "ok")
         except Exception:
             pass
+        if fk_checks > 0:
+            fk_score = 0.10 * (fk_score / fk_checks)
+        else:
+            # No FK constraints expected — award full FK portion
+            fk_score = 0.10
+        fk_score += 0.10 if integrity_ok else 0.0
+        # ---- 4. Anti-Exploit Checks (0.10) ----
+        exploit_score = 0.10  # Start with full credit, deduct for violations
+        # Check for empty tables where golden has data
+        for table in self._golden_tables:
+            golden_info = self._golden_table_data[table]
+            if golden_info["row_count"] > 0 and table in agent_tables:
+                agent_count = _get_row_count(conn, table)
+                if agent_count == 0:
+                    # Agent emptied a table that should have data — heavy penalty
+                    exploit_score = 0.0
+                    # Also cap the data score for this exploit
+                    data_score = min(data_score, 0.05)
+                    break
+        # Penalize extra non-golden tables (schema pollution)
+        extra_tables = agent_tables - self._golden_tables
+        if extra_tables:
+            # Small penalty per extra table (some might be temp tables)
+            penalty = min(0.05, 0.01 * len(extra_tables))
+            exploit_score = max(0, exploit_score - penalty)
+        total = schema_score + data_score + fk_score + exploit_score
+        return max(0.01, min(0.99, total))

test_all_tasks.py CHANGED Viewed

@@ -1,49 +1,105 @@
-"""Quick validation of all 7 tasks: seeds + graders."""
-import sqlite3
 import sys
 import os
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
-from seeds import TASKS
 from server.grader import StateReconciler
-print(f"Tasks registered: {len(TASKS)}")
-assert len(TASKS) == 7, f"Expected 7 tasks, got {len(TASKS)}"
-print(f"  Names: {list(TASKS.keys())}")
-for name, cfg in TASKS.items():
-    # Seed
     conn = sqlite3.connect(":memory:")
     conn.execute("PRAGMA foreign_keys = ON")
-    cfg["seed_fn"](conn)
-    cursor = conn.execute(
-        "SELECT name FROM sqlite_master WHERE type='table' AND name NOT LIKE 'sqlite_%'"
-    )
-    tables = [r[0] for r in cursor.fetchall()]
-    print(f"\n[{name}] ({cfg['difficulty']}, max_steps={cfg.get('max_steps', 20)})")
-    print(f"  Tables: {tables}")
-    # Grade
-    reconciler = StateReconciler(name)
-    score = reconciler.score(conn)
-    assert 0.01 <= score <= 0.99, f"Score {score} out of [0.01, 0.99]!"
-    print(f"  Initial score: {score:.2f} OK")
     conn.close()
-# Also test environment resets for each task
-from server.environment import DbMigrationEnvironment
-for name in TASKS:
-    env = DbMigrationEnvironment(task_name=name)
     obs = env.reset()
-    assert obs.done == False
-    assert obs.step_number == 0
-    print(f"  [{name}] Environment reset OK")
     env.close()
-print("\n" + "=" * 50)
-print("ALL 7 TASKS VALIDATED SUCCESSFULLY!")
-print("=" * 50)

+"""Test all 7 tasks: seed, golden migration, grade, reset, close."""
 import sys
 import os
+import sqlite3
 sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+sys.path.insert(0, os.path.join(os.path.dirname(os.path.abspath(__file__)), '..', 'OpenEnv', 'src'))
+import seeds
 from server.grader import StateReconciler
+from server.environment import DbMigrationEnvironment
+from models import MigrationAction
+def test_golden_migration(task_name: str) -> None:
+    """Test that golden migration produces a near-perfect grader score."""
+    config = seeds.TASKS[task_name]
+    # 1. Create DB and seed
     conn = sqlite3.connect(":memory:")
     conn.execute("PRAGMA foreign_keys = ON")
+    config["seed_fn"](conn)
+    # 2. Score before migration (should be low)
+    reconciler = StateReconciler(task_name)
+    score_before = reconciler.score(conn)
+    # 3. Run golden migration
+    config["golden_fn"](conn)
+    # 4. Score after migration (should be >0.90)
+    score_after = reconciler.score(conn)
     conn.close()
+    status = "PASS" if score_after >= 0.90 else "FAIL"
+    print(f"  [{status}] {task_name}: before={score_before:.2f} after={score_after:.2f}")
+    if score_after < 0.90:
+        raise AssertionError(f"{task_name}: golden migration only scored {score_after:.2f}")
+def test_environment_lifecycle(task_name: str) -> None:
+    """Test that environment can reset, step, and close without crashes."""
+    env = DbMigrationEnvironment(task_name=task_name)
     obs = env.reset()
+    assert not obs.done, f"{task_name}: obs.done should be False after reset"
+    assert obs.step_number == 0, f"{task_name}: step should be 0 after reset"
+    assert obs.current_schema_sql, f"{task_name}: should have current schema"
+    assert obs.target_schema_sql, f"{task_name}: should have target schema"
+    # Run a SELECT to verify data passthrough
+    action = MigrationAction(
+        sql_command="SELECT name FROM sqlite_master WHERE type='table' AND name NOT LIKE 'sqlite_%'",
+        reasoning="List tables",
+        submit_final=False,
+    )
+    obs = env.step(action)
+    assert "rows total" in obs.last_execution_result or "Query returned" in obs.last_execution_result, \
+        f"{task_name}: SELECT should return formatted data, got: {obs.last_execution_result[:100]}"
     env.close()
+    print(f"  [PASS] {task_name}: environment lifecycle OK (SELECT data passthrough verified)")
+def main():
+    print("=" * 60)
+    print("Testing Golden Migrations (all 7 tasks)")
+    print("=" * 60)
+    errors = []
+    for task_name in seeds.TASKS:
+        try:
+            test_golden_migration(task_name)
+        except Exception as e:
+            errors.append(f"Golden {task_name}: {e}")
+    print()
+    print("=" * 60)
+    print("Testing Environment Lifecycle (all 7 tasks)")
+    print("=" * 60)
+    for task_name in seeds.TASKS:
+        try:
+            test_environment_lifecycle(task_name)
+        except Exception as e:
+            errors.append(f"Lifecycle {task_name}: {e}")
+    print()
+    if errors:
+        print("=" * 60)
+        print(f"FAILURES ({len(errors)}):")
+        for e in errors:
+            print(f"  ✗ {e}")
+        print("=" * 60)
+        sys.exit(1)
+    else:
+        print("=" * 60)
+        print("ALL 7 TASKS PASSED!")
+        print("=" * 60)
+if __name__ == "__main__":
+    main()

test_smoke.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""Smoke test for the SQL Migration Environment."""
 import sys
 import os
@@ -42,13 +42,29 @@ assert cursor.fetchone()[0] is None
 conn.close()
 print("PASS: Task 3 seeds - 5 employees, NULL salary")
-# Test 5: Grader
 from server.grader import StateReconciler
 conn = sqlite3.connect(":memory:")
 seed_task1(conn)
 reconciler = StateReconciler("column-restructure")
 score = reconciler.score(conn)
 print(f"PASS: Grader score for unmodified Task 1: {score:.2f}")
 # Simulate correct migration
 conn.execute("CREATE TABLE users_new (id INTEGER PRIMARY KEY, full_name TEXT NOT NULL)")
@@ -58,19 +74,42 @@ conn.execute("ALTER TABLE users_new RENAME TO users")
 conn.commit()
 score = reconciler.score(conn)
 print(f"PASS: Score after correct Task 1: {score:.2f}")
-assert score == 0.99, f"Expected 0.99, got {score}"
 conn.close()
-# Test 6: Full environment
 from server.environment import DbMigrationEnvironment
 env = DbMigrationEnvironment(task_name="column-restructure")
 obs = env.reset()
 assert obs.done == False
 assert obs.step_number == 0
-assert "users" in obs.current_schema_sql
 print(f"PASS: Environment reset. Step={obs.step_number}")
 # Run a complete correct migration
 steps = [
     "CREATE TABLE users_new (id INTEGER PRIMARY KEY, full_name TEXT NOT NULL)",
     "INSERT INTO users_new (id, full_name) SELECT id, first_name || ' ' || last_name FROM users",
@@ -79,20 +118,20 @@ steps = [
 ]
 for i, sql in enumerate(steps):
     is_final = (i == len(steps) - 1)
-    action = MigrationAction(
-        sql_command=sql,
-        reasoning=f"Step {i+1}",
-        submit_final=is_final,
-    )
-    obs = env.step(action)
-    print(f"  Step {i+1}: reward={obs.reward:.2f}, progress={obs.migration_progress:.2f}, done={obs.done}")
-assert obs.done == True
-assert obs.migration_progress == 0.99, f"Expected 0.99, got {obs.migration_progress}"
 env.close()
-print("PASS: Full migration episode completed with score 0.99")
-# Test 7: Task 2 grader
 conn = sqlite3.connect(":memory:")
 conn.execute("PRAGMA foreign_keys = ON")
 seed_task2(conn)
@@ -101,7 +140,7 @@ score_before = reconciler2.score(conn)
 print(f"PASS: Task 2 grader before migration: {score_before:.2f}")
 conn.close()
-# Test 8: Task 3 grader
 conn = sqlite3.connect(":memory:")
 conn.execute("PRAGMA foreign_keys = ON")
 seed_task3(conn)
@@ -110,6 +149,21 @@ score_before = reconciler3.score(conn)
 print(f"PASS: Task 3 grader before migration: {score_before:.2f}")
 conn.close()
 print()
 print("=" * 50)
 print("ALL TESTS PASSED! Environment is fully working!")

+"""Smoke test for the SQL Migration Environment (updated for Golden DB grader)."""
 import sys
 import os
 conn.close()
 print("PASS: Task 3 seeds - 5 employees, NULL salary")
+# Test 5: Golden migrations run without error
+from seeds import golden_task1, golden_task2, golden_task3, golden_task4, golden_task5, golden_task6, golden_task7
+for i, (seed_fn, golden_fn, name) in enumerate([
+    (seed_task1, golden_task1, "column-restructure"),
+    (seed_task2, golden_task2, "table-normalization"),
+    (seed_task3, golden_task3, "cascade-migration"),
+], 1):
+    conn = sqlite3.connect(":memory:")
+    conn.execute("PRAGMA foreign_keys = ON")
+    seed_fn(conn)
+    golden_fn(conn)
+    conn.close()
+    print(f"PASS: Golden migration {name} runs without error")
+# Test 6: Grader with Golden DB
 from server.grader import StateReconciler
 conn = sqlite3.connect(":memory:")
+conn.execute("PRAGMA foreign_keys = ON")
 seed_task1(conn)
 reconciler = StateReconciler("column-restructure")
 score = reconciler.score(conn)
 print(f"PASS: Grader score for unmodified Task 1: {score:.2f}")
+assert score < 0.7, f"Expected moderate score before migration, got {score}"
 # Simulate correct migration
 conn.execute("CREATE TABLE users_new (id INTEGER PRIMARY KEY, full_name TEXT NOT NULL)")
 conn.commit()
 score = reconciler.score(conn)
 print(f"PASS: Score after correct Task 1: {score:.2f}")
+assert score >= 0.89, f"Expected >= 0.89, got {score}"
 conn.close()
+# Test 7: Full environment with SELECT passthrough
 from server.environment import DbMigrationEnvironment
 env = DbMigrationEnvironment(task_name="column-restructure")
 obs = env.reset()
 assert obs.done == False
 assert obs.step_number == 0
+assert "users" in obs.current_schema_sql.lower()
 print(f"PASS: Environment reset. Step={obs.step_number}")
+# Test SELECT returns actual data (A1 fix)
+select_action = MigrationAction(
+    sql_command="SELECT * FROM users LIMIT 2",
+    reasoning="Inspecting data",
+    submit_final=False,
+)
+obs = env.step(select_action)
+assert "O'Brien" in obs.last_execution_result, f"SELECT should return data, got: {obs.last_execution_result}"
+print(f"PASS: SELECT returns actual data rows")
+# Test dangerous SQL is blocked (A3 fix)
+dangerous_action = MigrationAction(
+    sql_command="ATTACH DATABASE ':memory:' AS evil",
+    reasoning="Testing security",
+    submit_final=False,
+)
+obs = env.step(dangerous_action)
+assert "not allowed" in obs.last_execution_result.lower() or "blocked" in obs.last_execution_result.lower(), \
+    f"ATTACH should be blocked, got: {obs.last_execution_result}"
+print(f"PASS: Dangerous SQL is blocked")
 # Run a complete correct migration
+env2 = DbMigrationEnvironment(task_name="column-restructure")
+obs2 = env2.reset()
 steps = [
     "CREATE TABLE users_new (id INTEGER PRIMARY KEY, full_name TEXT NOT NULL)",
     "INSERT INTO users_new (id, full_name) SELECT id, first_name || ' ' || last_name FROM users",
 ]
 for i, sql in enumerate(steps):
     is_final = (i == len(steps) - 1)
+    action = MigrationAction(sql_command=sql, reasoning=f"Step {i+1}", submit_final=is_final)
+    obs2 = env2.step(action)
+    print(f"  Step {i+1}: reward={obs2.reward:.2f}, progress={obs2.migration_progress:.2f}, done={obs2.done}")
+assert obs2.done == True
+assert obs2.migration_progress >= 0.89, f"Expected >= 0.89, got {obs2.migration_progress}"
+# Check trajectory is included in final metadata
+assert "trajectory" in obs2.metadata, "Trajectory should be in final metadata"
+print(f"PASS: Full migration completed with score {obs2.migration_progress:.2f}")
 env.close()
+env2.close()
+# Test 8: Task 2 grader
 conn = sqlite3.connect(":memory:")
 conn.execute("PRAGMA foreign_keys = ON")
 seed_task2(conn)
 print(f"PASS: Task 2 grader before migration: {score_before:.2f}")
 conn.close()
+# Test 9: Task 3 grader
 conn = sqlite3.connect(":memory:")
 conn.execute("PRAGMA foreign_keys = ON")
 seed_task3(conn)
 print(f"PASS: Task 3 grader before migration: {score_before:.2f}")
 conn.close()
+# Test 10: Case insensitivity (A7)
+conn = sqlite3.connect(":memory:")
+conn.execute("PRAGMA foreign_keys = ON")
+seed_task1(conn)
+conn.execute("CREATE TABLE USERS_NEW (id INTEGER PRIMARY KEY, full_name TEXT NOT NULL)")
+conn.execute("INSERT INTO USERS_NEW SELECT id, first_name || ' ' || last_name FROM users")
+conn.execute("DROP TABLE users")
+conn.execute("ALTER TABLE USERS_NEW RENAME TO USERS")
+conn.commit()
+reconciler_case = StateReconciler("column-restructure")
+score_case = reconciler_case.score(conn)
+print(f"PASS: Case-insensitive grading score: {score_case:.2f}")
+assert score_case >= 0.79, f"Case-insensitive should score high, got {score_case}"
+conn.close()
 print()
 print("=" * 50)
 print("ALL TESTS PASSED! Environment is fully working!")