Spaces:
Sleeping
Sleeping
metadata
title: SQL Migration Agent Benchmark
emoji: ποΈ
colorFrom: blue
colorTo: purple
sdk: docker
app_file: server/app.py
pinned: false
SQL Migration Agent Benchmark (OpenEnv)
A Production-Grade Evaluation Suite for Database Engineering Agents.
An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data.
ποΈ Architecture Overview
The suite combines formal sequence modeling with a modular local engine.
System Mapping
βββββββββββββββββββββββββββββββββββ
β inference.py (Baseline Agent) β
β - LLM API calls (OpenAI fmt) β
β - JSON mode + fallback parser β
βββββββββββ¬ββββββββββββββββββββββββ
β MigrationAction
βββββββββββΌββββββββββββββββββββββββ
β environment.py (OpenEnv Env) β
β - SQLite execution engine β
β - ERD & Schema Diff generator β
β - SQL timeout & Blacklist β
βββββββββββ¬ββββββββββββββββββββββββ
β score()
βββββββββββΌββββββββββββββββββββββββ
β grader.py (Golden DB Engine) β
β - Dynamic golden reference DB β
β - Schema + data + FK scoring β
β - Anti-exploit checks β
βββββββββββββββββββββββββββββββββββ
Protocol Flow
sequenceDiagram
participant Agent
participant Env as MigrationEnv Server
participant DB as SQLite (:memory:)
participant Grader as Dynamic Golden Grader
Agent->>Env: POST /reset (task_name)
Env->>DB: Seed Source Data
Env->>Grader: Build Golden Reference
Grader-->>Env: Initial Score
Env-->>Agent: Observation (DDL, Schema Diff, ERD)
loop Migration Steps
Agent->>Env: POST /step (SQL, Reasoning)
Env->>DB: Execute SQL (w/ Timeout & Blacklist)
Env->>Grader: Compute Delta Reward
Grader-->>Env: current_score, reward
Env-->>Agent: New Observation + ERD (Visualization)
end
Agent->>Env: submit_final = True
Env->>Grader: Final Integrity & FK Check
Env-->>Agent: Final Episode Summary (Trajectory)
π― Benchmark Tasks
| # | Task | Difficulty | Challenge |
|---|---|---|---|
| 1 | column-restructure |
π’ Easy | Merge first_name + last_name β full_name (with apostrophes) |
| 2 | soft-delete-restoration |
π’ Easy | Restore deleted products from deletion_log |
| 3 | table-normalization |
π‘ Medium | Normalize purchases β customers + orders + FK |
| 4 | schema-version-merge |
π‘ Medium | Merge v1/v2 product tables with price coercion |
| 5 | multi-entity-extraction |
π‘ Medium | 3NF decomposition with invalid data routing |
| 6 | cascade-migration |
π΄ Hard | 4-table FK cascade, type coercion, orphan audit |
| 7 | dual-source-consolidation |
π΄ Hard | 6β4 table merge, cross-system email dedup |
π οΈ Adversarial Edge Cases (The "Stress Tests")
- O'Brien: Apostrophe in data β tests SQL escaping and string literal handling.
- $90,000 salary: TEXTβINTEGER coercion β tests complex string parsing and casting.
- Empty string emails: NOT NULL vs Empty β tests data quality validation logic.
- Leading whitespace:
alice@company.comβ tests TRIM awareness. - ID conflicts: Overlapping IDs in dual sources β tests intelligent merge logic.
- Orphaned FKs: References to deleted entities β tests environment's audit logging.
- NULL currency: Must default to 'USD' β tests COALESCE usage.
βοΈ Evaluation Baselines
| Task | Qwen 32B Score | GPT-OSS 120B |
|---|---|---|
column-restructure |
0.99 | 0.99 |
soft-delete-restoration |
0.99 | 0.99 |
table-normalization |
0.94 | 0.99 |
schema-version-merge |
0.93 | 0.98 |
multi-entity-extraction |
0.35 | 0.65 |
cascade-migration |
0.61 | 0.83 |
dual-source-consolidation |
0.28 | 0.38 |
π‘οΈ Security & Reward Function
The Reward Formula
Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$. Progress $P_t$ is a weighted sum (0.01 to 0.99):
- Schema Match (30%): Tables exist with correct
(name, type)signatures. - Data Match (40%): Row content matches golden DB (order-independent).
- FK & Integrity (20%): Foreign keys enforced,
integrity_checkpasses. - Anti-Exploit (10%): Penalty for empty tables or schema pollution.
Security Guardrails
- PRAGMA Blacklist:
foreign_keys = OFFandwritable_schema = ONare blocked. - Query Timeout: SQLite progress handler terminates queries exceeding 500k ops.
- Dangerous SQL:
ATTACH,DETACH, andLOAD_EXTENSIONare filtered.
π Setup & Usage
Local Deployment
pip install -r requirements.txt
python -m server.app # Starts OpenEnv server on port 7860
Environment Variables
export HF_TOKEN=your_token
export API_BASE_URL=https://router.huggingface.co/v1
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
API Endpoints
POST /reset: Initialize migration episode.POST /step: Execute SQL and reasoning.GET /tasks: List all available scenarios.POST /grader: Run deep comparison against Golden DB.
πΌοΈ Observations
Each observation includes erd_visualization (Mermaid.js) and schema_diff to assist agents in understanding the current drift.
π License
MIT. Built for the OpenEnv Hackathon 2026.