Eishaan commited on
Commit
5f32203
Β·
1 Parent(s): 41cae03

docs: merge professional visuals with detailed technical specs for ultimate README

Browse files
Files changed (1) hide show
  1. README.md +83 -59
README.md CHANGED
@@ -15,14 +15,38 @@ pinned: false
15
  [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
16
  [![Hugging Face Space](https://img.shields.io/badge/HF%20Space-Deployed-orange)](https://huggingface.co/spaces/Eishaan/sql-migration-env)
17
 
18
- This repository contains a high-fidelity valuation environment designed to measure the capability of AI agents in performing complex SQL schema migrations. Unlike simple text-to-SQL benchmarks, this environment requires **state-aware reasoning**, **data integrity protection**, and **adversarial edge-case handling**.
19
 
20
  ---
21
 
22
  ## πŸ—οΈ Architecture Overview
23
 
24
- The environment follows the **OpenEnv** specification, exposing a standardized API for agents to interact with an isolated SQLite instance.
25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
26
  ```mermaid
27
  sequenceDiagram
28
  participant Agent
@@ -53,83 +77,83 @@ sequenceDiagram
53
 
54
  ## 🎯 Benchmark Tasks
55
 
56
- The suite consists of **7 progressive tasks** representing real-world database engineering challenges:
57
-
58
- | Task | Difficulty | Core Challenge |
59
- | :--- | :--- | :--- |
60
- | **Column Restructure** | 🟒 Easy | Merging `first_name` + `last_name` while preserving apostrophes (O'Brien). |
61
- | **Soft-Delete Restoration** | 🟒 Easy | Restoring products from a deletion log and managing boolean flags. |
62
- | **Table Normalization** | 🟑 Medium | Decomposing a denormalized "God Table" into 3NF (`customers` β†’ `orders`). |
63
- | **Schema Version Merge** | 🟑 Medium | Merging conflicting schemas (v1 vs v2) with complex price coercion. |
64
- | **Multi-Entity Extraction** | 🟑 Medium | 3NF decomposition with strict data routing for invalid records. |
65
- | **Cascade Migration** | πŸ”΄ Hard | 4-table FK cascade, orphan audit logging, and strict data type cleanup. |
66
- | **Dual-Source Consolidation** | πŸ”΄ Hard | Merging 6 tables from two incompatible systems (Legacy CRM + Modern SaaS). |
 
 
 
 
 
 
 
67
 
68
  ---
69
 
70
- ## βš–οΈ Grading & Reward Function
71
-
72
- The benchmark uses a **Dynamic Golden Database Grader**. Instead of string-matching SQL, we compare the *final state* of the agent's database against a "perfectly migrated" reference database.
73
-
74
- ### The Reward Formula
75
- Rewards are sparse/dense deltas calculated at every step:
76
-
77
- $$R_t = P_t - P_{t-1}$$
78
 
79
- Where $P_t$ (Progress) is a weighted sum ($[0.01, 0.99]$):
80
- - **Schema Match (30%):** Validates table existence and strict `(name, type)` signatures.
81
- - **Data Match (40%):** Validates row content, counts, and checks for data loss/pollution.
82
- - **Integrity (20%):** Validates `PRAGMA foreign_key_check` and `PRAGMA integrity_check`.
83
- - **Anti-Exploit (10%):** Penalizes empty tables or leftover "garbage" tables.
 
 
 
 
84
 
85
  ---
86
 
87
- ## πŸ›‘οΈ Security & Sandbox Guardrails
88
 
89
- To prevent agents from faking results or exploiting the environment, we implement:
90
- - **PRAGMA Blacklist:** Commands like `foreign_keys = OFF` or `PRAGMA foreign_keys = 0` are strictly blocked.
91
- - **Query Timeout:** Infinite loops (e.g., recursive CTEs) are auto-terminated via a SQLite progress handler budget.
92
- - **Dangerous Command Filter:** `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are blocked via regex.
93
- - **Isolation:** Each episode runs in a fresh, isolated `:memory:` database with no persistence.
 
 
 
 
 
 
 
94
 
95
  ---
96
 
97
- ## πŸš€ Getting Started
98
 
99
- ### Local Deployment (Docker)
100
  ```bash
101
- # Clone the repo
102
- git clone https://github.com/Eishaan-Khatri/sql-migration-env
103
- cd sql-migration-env
104
-
105
- # Build and run
106
- docker build -t sql-migration-env .
107
- docker run -p 7860:7860 sql-migration-env
108
  ```
109
 
110
- ### Run Baseline Evaluation
111
  ```bash
112
- python inference.py
 
 
113
  ```
114
 
115
- ---
116
-
117
- ## πŸ“Š Evaluation Baselines
118
-
119
- Results using `GPT-OSS-120B` class models:
120
-
121
- - **Avg. Benchmark Score:** 0.83 (Production ready)
122
- - **Task Success Rates:**
123
- - Easy: 0.99
124
- - Medium: 0.82
125
- - Hard: 0.60
126
 
127
  ---
128
 
129
- ## πŸ–ΌοΈ Observations & Visuals
130
- Each observation includes an `erd_visualization` field containing a **Mermaid.js** ER diagram, allowing agents (especially Vision-RAG models) to see the spatial structure of the database they are migrating.
131
-
132
- ---
133
 
134
  ## πŸ“„ License
135
- This benchmark is licensed under the MIT License. Built for the **OpenEnv Hackathon 2026**.
 
15
  [![License: MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
16
  [![Hugging Face Space](https://img.shields.io/badge/HF%20Space-Deployed-orange)](https://huggingface.co/spaces/Eishaan/sql-migration-env)
17
 
18
+ An OpenEnv-compatible environment for evaluating AI agents on autonomous SQLite database migration tasks. The agent receives a broken/drifted schema and must write SQL to transform it to a target state without losing data.
19
 
20
  ---
21
 
22
  ## πŸ—οΈ Architecture Overview
23
 
24
+ The suite combines formal sequence modeling with a modular local engine.
25
 
26
+ ### System Mapping
27
+ ```
28
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
29
+ β”‚ inference.py (Baseline Agent) β”‚
30
+ β”‚ - LLM API calls (OpenAI fmt) β”‚
31
+ β”‚ - JSON mode + fallback parser β”‚
32
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
33
+ β”‚ MigrationAction
34
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
35
+ β”‚ environment.py (OpenEnv Env) β”‚
36
+ β”‚ - SQLite execution engine β”‚
37
+ β”‚ - ERD & Schema Diff generator β”‚
38
+ β”‚ - SQL timeout & Blacklist β”‚
39
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
40
+ β”‚ score()
41
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
42
+ β”‚ grader.py (Golden DB Engine) β”‚
43
+ β”‚ - Dynamic golden reference DB β”‚
44
+ β”‚ - Schema + data + FK scoring β”‚
45
+ β”‚ - Anti-exploit checks β”‚
46
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
47
+ ```
48
+
49
+ ### Protocol Flow
50
  ```mermaid
51
  sequenceDiagram
52
  participant Agent
 
77
 
78
  ## 🎯 Benchmark Tasks
79
 
80
+ | # | Task | Difficulty | Challenge |
81
+ |---|------|-----------|-----------|
82
+ | 1 | `column-restructure` | 🟒 Easy | Merge first_name + last_name β†’ full_name (with apostrophes) |
83
+ | 2 | `soft-delete-restoration` | 🟒 Easy | Restore deleted products from `deletion_log` |
84
+ | 3 | `table-normalization` | 🟑 Medium | Normalize `purchases` β†’ `customers` + `orders` + FK |
85
+ | 4 | `schema-version-merge` | 🟑 Medium | Merge v1/v2 product tables with price coercion |
86
+ | 5 | `multi-entity-extraction` | 🟑 Medium | 3NF decomposition with invalid data routing |
87
+ | 6 | `cascade-migration` | πŸ”΄ Hard | 4-table FK cascade, type coercion, orphan audit |
88
+ | 7 | `dual-source-consolidation`| πŸ”΄ Hard | 6β†’4 table merge, cross-system email dedup |
89
+
90
+ ### πŸ› οΈ Adversarial Edge Cases (The "Stress Tests")
91
+ - **O'Brien**: Apostrophe in data β€” tests SQL escaping and string literal handling.
92
+ - **$90,000 salary**: TEXT→INTEGER coercion — tests complex string parsing and casting.
93
+ - **Empty string emails**: NOT NULL vs Empty β€” tests data quality validation logic.
94
+ - **Leading whitespace**: ` alice@company.com` β€” tests TRIM awareness.
95
+ - **ID conflicts**: Overlapping IDs in dual sources β€” tests intelligent merge logic.
96
+ - **Orphaned FKs**: References to deleted entities β€” tests environment's audit logging.
97
+ - **NULL currency**: Must default to 'USD' β€” tests COALESCE usage.
98
 
99
  ---
100
 
101
+ ## βš–οΈ Evaluation Baselines
 
 
 
 
 
 
 
102
 
103
+ | Task | Qwen 32B Score | GPT-OSS 120B |
104
+ |------|--------------|--------------|
105
+ | `column-restructure` | 0.99 | 0.99 |
106
+ | `soft-delete-restoration` | 0.99 | 0.99 |
107
+ | `table-normalization` | 0.94 | 0.99 |
108
+ | `schema-version-merge` | 0.93 | 0.98 |
109
+ | `multi-entity-extraction` | 0.35 | 0.65 |
110
+ | `cascade-migration` | 0.61 | 0.83 |
111
+ | `dual-source-consolidation`| 0.28 | 0.38 |
112
 
113
  ---
114
 
115
+ ## πŸ›‘οΈ Security & Reward Function
116
 
117
+ ### The Reward Formula
118
+ Rewards are calculated as progress deltas: $R_t = P_t - P_{t-1}$.
119
+ Progress $P_t$ is a weighted sum (0.01 to 0.99):
120
+ - **Schema Match (30%)**: Tables exist with correct `(name, type)` signatures.
121
+ - **Data Match (40%)**: Row content matches golden DB (order-independent).
122
+ - **FK & Integrity (20%)**: Foreign keys enforced, `integrity_check` passes.
123
+ - **Anti-Exploit (10%)**: Penalty for empty tables or schema pollution.
124
+
125
+ ### Security Guardrails
126
+ - **PRAGMA Blacklist**: `foreign_keys = OFF` and `writable_schema = ON` are blocked.
127
+ - **Query Timeout**: SQLite progress handler terminates queries exceeding 500k ops.
128
+ - **Dangerous SQL**: `ATTACH`, `DETACH`, and `LOAD_EXTENSION` are filtered.
129
 
130
  ---
131
 
132
+ ## πŸš€ Setup & Usage
133
 
134
+ ### Local Deployment
135
  ```bash
136
+ pip install -r requirements.txt
137
+ python -m server.app # Starts OpenEnv server on port 7860
 
 
 
 
 
138
  ```
139
 
140
+ ### Environment Variables
141
  ```bash
142
+ export HF_TOKEN=your_token
143
+ export API_BASE_URL=https://router.huggingface.co/v1
144
+ export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
145
  ```
146
 
147
+ ### API Endpoints
148
+ - `POST /reset`: Initialize migration episode.
149
+ - `POST /step`: Execute SQL and reasoning.
150
+ - `GET /tasks`: List all available scenarios.
151
+ - `POST /grader`: Run deep comparison against Golden DB.
 
 
 
 
 
 
152
 
153
  ---
154
 
155
+ ## πŸ–ΌοΈ Observations
156
+ Each observation includes `erd_visualization` (Mermaid.js) and `schema_diff` to assist agents in understanding the current drift.
 
 
157
 
158
  ## πŸ“„ License
159
+ MIT. Built for the **OpenEnv Hackathon 2026**.