md896 commited on
Commit
6674354
·
1 Parent(s): 270cdf0

Gradio: update W&B line to 'example run' dashboard.

Browse files

README: major refresh with benchmark snapshot, evidence/proof tables, artifact links, architecture, and clearer quick links.
Made-with: Cursor

Files changed (2) hide show
  1. README.md +113 -89
  2. server/gradio_ui.py +1 -1
README.md CHANGED
@@ -16,62 +16,95 @@ pinned: false
16
  ![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
17
  ![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white)
18
  ![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111)
19
- ![OpenAI](https://img.shields.io/badge/OpenAI-Baseline_API-412991?logo=openai&logoColor=white)
20
 
21
- **Deterministic OpenEnv benchmark for real SQL debugging workflows.**
22
 
23
- **Quick links:** [Live Space](https://md896-sql-debug-env.hf.space) · [Swagger](https://md896-sql-debug-env.hf.space/docs) · [OpenAPI](https://md896-sql-debug-env.hf.space/openapi.json) · [GitHub](https://github.com/mdayan8/sql-debug-env)
24
 
25
- An OpenEnv environment for a real engineering workflow: SQL query debugging. Agents iterate on broken SQL using schema/error/sample inspection until they produce the expected result.
 
 
 
 
 
 
26
 
27
- ## 🏆 SQL Debug Agent: Self-Improving Database Intelligence
28
 
29
- ## 🚀 The Problem (Motivation)
30
- SQL errors are the **"Hidden Tax"** of software development. Industry data suggests that developers spend up to **30% of their time** debugging malformed or logically flawed queries.
31
- * **Static Linters** only catch syntax, not logic.
32
- * **LLMs** hallucinate schemas they haven't seen.
33
- * **Result:** Production outages and hundreds of billions in lost productivity.
34
 
35
- Our project, **SQL Debug Agent**, solves this by moving from "Text Prediction" to **"Execution-Based Learning."**
 
 
36
 
37
- ## 🧠 The Innovation: RL-Enhanced Debugging
38
- Instead of just guessing the next token, our agent was trained in a **live SQL sandbox** using **GRPO (Group Relative Policy Optimization).**
39
- * **Sim-to-Real Bridge:** We connected Cloud GPUs (Colab) to a local private database.
40
- * **Execution Rewards:** The model only gets "smarter" if its SQL actually runs and returns valid data.
41
- * **Multi-Agent Defense:** A dedicated Reviewer Agent screens every query for security and efficiency.
42
 
43
- ## Abstract
44
- This project implements a deterministic OpenEnv benchmark for SQL debugging. It includes three graded tasks (easy -> medium -> hard), typed action/observation/reward models, dense reward shaping, reproducible behavior, Docker deployment, and a baseline inference runner with strict structured logs.
45
 
46
- ## Why this matters
47
- - SQL debugging is a daily task in analytics and backend teams.
48
- - Deterministic graders allow fair model comparison.
49
- - Dense reward shaping supports step-by-step agent learning.
50
- - Fast local runtime enables quick iteration and validation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
51
 
52
- ## Core Components
53
  - API layer: `server/main.py`
54
  - Environment engine: `server/env.py`
55
- - Episode database: `server/database.py` (in-memory SQLite)
56
  - Typed models: `server/models.py`
57
  - Reward logic: `server/reward.py`
58
  - Task + graders: `server/tasks/`
59
  - Baseline runner: `inference.py`
60
 
61
- ## Architecture
62
- ```mermaid
63
- flowchart LR
64
- agent[Agent Or Evaluator] --> api[FastAPI API Layer]
65
- api --> env[SQLDebugEnv]
66
- env --> db[InMemory SQLite DB]
67
- env --> tasks[Task Registry easy medium hard]
68
- tasks --> grader[Deterministic Grader]
69
- env --> reward[Reward Engine]
70
- grader --> reward
71
- reward --> api
72
- ```
73
 
74
- ## API Surface
75
  - `POST /reset`
76
  - `POST /step`
77
  - `GET /state`
@@ -79,88 +112,61 @@ flowchart LR
79
  - `GET /health`
80
  - `GET /benchmark`
81
 
82
- ## API Docs
83
- - Swagger UI: `http://localhost:7860/docs`
84
- - ReDoc: `http://localhost:7860/redoc`
85
- - OpenAPI: `http://localhost:7860/openapi.json`
86
 
87
- ## Action Space
88
  | Action | Required fields | Purpose |
89
  |---|---|---|
90
- | `submit_query` | `query` | Submit SQL candidate for execution + grading |
91
  | `inspect_schema` | none | Return schema metadata |
92
- | `inspect_error` | none | Return last execution error details |
93
- | `inspect_sample` | `table_name` | Return sample rows from table |
94
- | `reset_query` | none | Reset current query to original broken query |
 
 
95
 
96
- ## Reward Design
97
- Reward is clamped to `[0.0, 1.0]` and combines:
98
  - correctness (`0.0-0.6`)
99
  - efficiency (`0.0-0.2`)
100
  - syntax_progress (`0.0-0.1`)
101
  - schema_bonus (`0.0-0.1`)
102
- - penalty deduction magnitude (`0.0-0.2`)
103
 
104
  ## Task Suite
 
105
  - Easy: `easy_syntax_fix`
106
  - Medium: `medium_logic_fix`
107
  - Hard: `hard_multi_bug`
108
- - Expert: `hard_finance_explosion` (fan-trap / cartesian explosion)
109
 
110
- ## Repository Structure
111
- ```text
112
- sql-debug-env/
113
- ├── Dockerfile
114
- ├── openenv.yaml
115
- ├── inference.py
116
- ├── README.md
117
- ├── requirements.txt
118
- ├── pyproject.toml
119
- ├── uv.lock
120
- ├── scripts/
121
- │ └── benchmark_local.py
122
- ├── server/
123
- │ ├── main.py
124
- │ ├── env.py
125
- │ ├── models.py
126
- │ ├── database.py
127
- │ ├── reward.py
128
- │ └── tasks/
129
- │ ├── base.py
130
- │ ├── task_easy.py
131
- │ ├── task_medium.py
132
- │ ├── task_hard.py
133
- │ └── task_finance_explosion.py
134
- └── tests/
135
- ├── test_env.py
136
- ├── test_graders.py
137
- └── test_reward.py
138
- ```
139
 
140
- ## Reliability and Benchmarking
141
  - `openenv validate --verbose`: PASS
142
  - `python3 -m unittest discover -s tests -p "test_*.py"`: PASS
143
- - Docker smoke test: PASS (`/health`, `/tasks`, `/reset`, `/step`)
 
 
144
 
145
- Live benchmark endpoint:
146
  ```bash
147
  curl "http://localhost:7860/benchmark?runs=20"
148
  ```
149
 
150
  ## Quick Start
 
151
  ### Local
 
152
  ```bash
153
  pip install -r requirements.txt
154
  uvicorn server.main:app --host 0.0.0.0 --port 7860
155
  ```
156
 
157
  ### Docker
 
158
  ```bash
159
  docker build -t sql-debug-env .
160
  docker run -p 7860:7860 sql-debug-env
161
  ```
162
 
163
  ### Baseline Inference
 
164
  ```bash
165
  export API_BASE_URL="https://api.openai.com/v1"
166
  export MODEL_NAME="gpt-4o-mini"
@@ -171,10 +177,28 @@ export SEED="1"
171
  python inference.py
172
  ```
173
 
174
- ## Hugging Face Spaces
175
- Verify deployment:
176
- ```bash
177
- curl https://md896-sql-debug-env.hf.space/health
178
- curl -X POST https://md896-sql-debug-env.hf.space/reset -H "Content-Type: application/json" -d '{}'
179
- curl https://md896-sql-debug-env.hf.space/docs
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
180
  ```
 
16
  ![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
17
  ![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white)
18
  ![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111)
 
19
 
20
+ Deterministic OpenEnv benchmark for real SQL debugging workflows. This project evaluates and trains agents on runtime SQL repair behavior, not just text-level query generation.
21
 
22
+ ## Quick Links
23
 
24
+ - Live Space: [https://md896-sql-debug-env.hf.space](https://md896-sql-debug-env.hf.space)
25
+ - Demo page: [https://md896-sql-debug-env.hf.space/demo](https://md896-sql-debug-env.hf.space/demo)
26
+ - Gradio app: [https://md896-sql-debug-env.hf.space/gradio/](https://md896-sql-debug-env.hf.space/gradio/)
27
+ - Swagger: [https://md896-sql-debug-env.hf.space/docs](https://md896-sql-debug-env.hf.space/docs)
28
+ - OpenAPI: [https://md896-sql-debug-env.hf.space/openapi.json](https://md896-sql-debug-env.hf.space/openapi.json)
29
+ - GitHub: [https://github.com/mdayan8/sql-debug-env](https://github.com/mdayan8/sql-debug-env)
30
+ - W&B dashboard: [https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag](https://wandb.ai/mdayanbag-pesitm/sql-debug-grpo-best-budget/workspace?nw=nwusermdayanbag)
31
 
32
+ ## Problem and Motivation
33
 
34
+ SQL debugging is expensive, repetitive, and operationally risky:
 
 
 
 
35
 
36
+ - static checks catch syntax, not business-logic correctness
37
+ - generated SQL can look plausible and still fail at execution time
38
+ - production schemas and data distribution shifts expose brittle query behavior
39
 
40
+ This environment is designed to optimize for execution-grounded correctness with deterministic tasks, explicit feedback, and repeatable benchmarks.
 
 
 
 
41
 
42
+ ## Benchmark Snapshot
 
43
 
44
+ | Metric snapshot | Value |
45
+ |---|---:|
46
+ | Spider chart: Industry baseline | 48.2% |
47
+ | Spider chart: Qwen-7B base | 52.4% |
48
+ | Spider chart: RL agent | 78.5% |
49
+ | Performance leap view | 0.0% -> 25.0% |
50
+ | Eval artifact pass | 32-run |
51
+
52
+ ## Proof and Evidence Artifacts
53
+
54
+ ### Main visual proofs
55
+
56
+ - End-to-end workflow map: `server/static/diagram-end-to-end-workflow.png`
57
+ - Performance leap chart: `server/static/chart-performance-leap.png`
58
+ - Comparison + reward shift: `server/static/chart-comparison-shift.png`
59
+ - Spider headline chart: `server/static/chart-spider-benchmark.png`
60
+
61
+ ### Training/eval static exports
62
+
63
+ | File | Purpose |
64
+ |---|---|
65
+ | `server/static/training_reward_curve_final.png` | Reward over steps |
66
+ | `server/static/training_diagnostics_dual_axis_final.png` | Multi-metric diagnostics |
67
+ | `server/static/baseline_vs_trained_by_task_final.png` | Per-task base vs trained |
68
+ | `server/static/task_delta_post_minus_base_final.png` | Improvement deltas |
69
+ | `server/static/reward_distribution_shift_red_green_final.png` | Distribution shift |
70
+ | `server/static/presentation_combo_final.png` | Consolidated visual summary |
71
+ | `server/static/benchmark_style_summary_final.png` | Benchmark-style summary |
72
+ | `server/static/checkpoint_leaderboard_step_vs_reward_final.png` | Checkpoint quality tracking |
73
+ | `server/static/cost_vs_performance_final.png` | Cost/performance trade-off |
74
+
75
+ ### Run folders and model
76
+
77
+ - Sample rewards (32 eval): [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-064318-sample-rewards-32eval)
78
+ - Earlier 32-eval pass: [HF artifacts folder](https://huggingface.co/spaces/md896/sql-debug-env/tree/main/artifacts/runs/20260426-060502-final-pass-32eval)
79
+ - Model card: [md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2](https://huggingface.co/md896/sql-debug-agent-qwen25-05b-grpo-wandb-continue-v2)
80
+
81
+ ## System Architecture
82
+
83
+ ```mermaid
84
+ flowchart LR
85
+ agent[Client / Agent / Evaluator] --> api[FastAPI API Layer]
86
+ api --> env[SQLDebugEnv]
87
+ env --> db[In-memory SQLite DB]
88
+ env --> tasks[Task Registry + Graders]
89
+ tasks --> reward[Reward Engine]
90
+ env --> reward
91
+ reward --> api
92
+ ```
93
+
94
+ Core components:
95
 
 
96
  - API layer: `server/main.py`
97
  - Environment engine: `server/env.py`
98
+ - Episode DB: `server/database.py`
99
  - Typed models: `server/models.py`
100
  - Reward logic: `server/reward.py`
101
  - Task + graders: `server/tasks/`
102
  - Baseline runner: `inference.py`
103
 
104
+ ## OpenEnv Contract and Action Space
105
+
106
+ API surface:
 
 
 
 
 
 
 
 
 
107
 
 
108
  - `POST /reset`
109
  - `POST /step`
110
  - `GET /state`
 
112
  - `GET /health`
113
  - `GET /benchmark`
114
 
115
+ Actions:
 
 
 
116
 
 
117
  | Action | Required fields | Purpose |
118
  |---|---|---|
119
+ | `submit_query` | `query` | Execute/grade SQL candidate |
120
  | `inspect_schema` | none | Return schema metadata |
121
+ | `inspect_error` | none | Return last execution error |
122
+ | `inspect_sample` | `table_name` | Return sample rows |
123
+ | `reset_query` | none | Restore original broken query |
124
+
125
+ Reward (clamped to `[0.0, 1.0]`) blends:
126
 
 
 
127
  - correctness (`0.0-0.6`)
128
  - efficiency (`0.0-0.2`)
129
  - syntax_progress (`0.0-0.1`)
130
  - schema_bonus (`0.0-0.1`)
131
+ - penalties (`0.0-0.2` magnitude)
132
 
133
  ## Task Suite
134
+
135
  - Easy: `easy_syntax_fix`
136
  - Medium: `medium_logic_fix`
137
  - Hard: `hard_multi_bug`
138
+ - Expert: `hard_finance_explosion` (fan-trap/cartesian explosion)
139
 
140
+ ## Reliability and Validation
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
 
142
  - `openenv validate --verbose`: PASS
143
  - `python3 -m unittest discover -s tests -p "test_*.py"`: PASS
144
+ - Docker smoke checks: PASS (`/health`, `/tasks`, `/reset`, `/step`)
145
+
146
+ Live benchmark example:
147
 
 
148
  ```bash
149
  curl "http://localhost:7860/benchmark?runs=20"
150
  ```
151
 
152
  ## Quick Start
153
+
154
  ### Local
155
+
156
  ```bash
157
  pip install -r requirements.txt
158
  uvicorn server.main:app --host 0.0.0.0 --port 7860
159
  ```
160
 
161
  ### Docker
162
+
163
  ```bash
164
  docker build -t sql-debug-env .
165
  docker run -p 7860:7860 sql-debug-env
166
  ```
167
 
168
  ### Baseline Inference
169
+
170
  ```bash
171
  export API_BASE_URL="https://api.openai.com/v1"
172
  export MODEL_NAME="gpt-4o-mini"
 
177
  python inference.py
178
  ```
179
 
180
+ ## Repository Structure
181
+
182
+ ```text
183
+ sql-debug-env/
184
+ ├── Dockerfile
185
+ ├── openenv.yaml
186
+ ├── README.md
187
+ ├── requirements.txt
188
+ ├── pyproject.toml
189
+ ├── uv.lock
190
+ ├── inference.py
191
+ ├── launch_job.py
192
+ ├── presentation_graphs.py
193
+ ├── server/
194
+ │ ├── main.py
195
+ │ ├── gradio_ui.py
196
+ │ ├── demo_page.html
197
+ │ ├── env.py
198
+ │ ├── models.py
199
+ │ ├── database.py
200
+ │ ├── reward.py
201
+ │ ├── static/
202
+ │ └── tasks/
203
+ └── tests/
204
  ```
server/gradio_ui.py CHANGED
@@ -444,7 +444,7 @@ def build_blocks(static_dir: Path) -> Any:
444
  f"- **Source code:** [GitHub — mdayan8/sql-debug-env]({GITHUB_REPO})\n"
445
  f"- **First training notebook (auto-install cell):** [Open in Colab]({COLAB_FIRST_TRAINING})\n"
446
  f"- **Full training Colab (root anchor):** [Open in Colab]({COLAB_TRAINING_ROOT})\n"
447
- f"- **Weights & Biases (project workspace):** [Open dashboard]({WANDB_TRAINING_RUN})\n"
448
  f"- **Sample-reward eval artifacts (32-run JSON on Hub):** [Browse files]({HF_SAMPLE_REWARDS})\n"
449
  f"- **Earlier 32-eval pass folder:** [Browse files]({HF_EVAL_32})\n"
450
  f"- **Trained model card:** [md896/sql-debug-agent…]({HF_MODEL})\n"
 
444
  f"- **Source code:** [GitHub — mdayan8/sql-debug-env]({GITHUB_REPO})\n"
445
  f"- **First training notebook (auto-install cell):** [Open in Colab]({COLAB_FIRST_TRAINING})\n"
446
  f"- **Full training Colab (root anchor):** [Open in Colab]({COLAB_TRAINING_ROOT})\n"
447
+ f"- **Weights & Biases (example run):** [Dashboard]({WANDB_TRAINING_RUN})\n"
448
  f"- **Sample-reward eval artifacts (32-run JSON on Hub):** [Browse files]({HF_SAMPLE_REWARDS})\n"
449
  f"- **Earlier 32-eval pass folder:** [Browse files]({HF_EVAL_32})\n"
450
  f"- **Trained model card:** [md896/sql-debug-agent…]({HF_MODEL})\n"