md896 commited on
Commit
c7d8ccb
·
1 Parent(s): 06729fc

initial commit

Browse files
Files changed (1) hide show
  1. README.md +5 -459
README.md CHANGED
@@ -1,464 +1,10 @@
1
  ---
2
- title: sql-debug-env
3
- emoji: "🧪"
4
- colorFrom: blue
5
- colorTo: green
6
  sdk: docker
7
  pinned: false
8
  ---
9
 
10
- # SQL Debug Environment (`sql-debug-env`)
11
-
12
- ![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f)
13
- ![Docker](https://img.shields.io/badge/Deploy-Docker-2496ED?logo=docker&logoColor=white)
14
- ![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)
15
- ![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white)
16
- ![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
17
- ![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white)
18
- ![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111)
19
- ![OpenAI](https://img.shields.io/badge/OpenAI-Baseline_API-412991?logo=openai&logoColor=white)
20
-
21
- **Deterministic OpenEnv benchmark for real SQL debugging workflows.**
22
-
23
- **Quick links:** [Live Space](https://md896-sql-debug-env.hf.space) · [Swagger](https://md896-sql-debug-env.hf.space/docs) · [OpenAPI](https://md896-sql-debug-env.hf.space/openapi.json) · [GitHub](https://github.com/mdayan8/sql-debug-env)
24
-
25
- An OpenEnv environment focused on a real engineering workflow: **SQL query debugging**.
26
- Agents iterate on broken SQL using schema/error/sample inspection until they produce the expected result.
27
-
28
- ## Space Config
29
- | Key | Value |
30
- |---|---|
31
- | `title` | `sql-debug-env` |
32
- | `emoji` | `🧪` |
33
- | `colorFrom` | `blue` |
34
- | `colorTo` | `green` |
35
- | `sdk` | `docker` |
36
- | `pinned` | `false` |
37
-
38
- ## Abstract
39
- This project implements a deterministic OpenEnv benchmark for SQL debugging. It includes three graded tasks (easy -> medium -> hard), typed action/observation/reward models, dense reward shaping, reproducible behavior, Docker deployment, and a baseline inference runner that emits strict structured logs.
40
-
41
- ## Why this matters
42
- - SQL debugging is a daily task in analytics and backend teams.
43
- - Deterministic graders allow fair model comparison.
44
- - Dense reward shaping supports step-by-step agent learning.
45
- - Fast local runtime enables quick iteration and validation.
46
-
47
- ## Core Components
48
- - **API layer**: `server/main.py`
49
- - **Environment engine**: `server/env.py`
50
- - **Episode database**: `server/database.py` (in-memory SQLite)
51
- - **Typed models**: `server/models.py`
52
- - **Reward logic**: `server/reward.py`
53
- - **Task + graders**: `server/tasks/`
54
- - **Baseline runner**: `inference.py`
55
-
56
- ## Architecture
57
- ```mermaid
58
- flowchart LR
59
- agent[Agent Or Evaluator] --> api[FastAPI API Layer]
60
- api --> env[SQLDebugEnv]
61
- env --> db[InMemory SQLite DB]
62
- env --> tasks[Task Registry easy medium hard]
63
- tasks --> grader[Deterministic Grader]
64
- env --> reward[Reward Engine]
65
- grader --> reward
66
- reward --> api
67
- ```
68
-
69
- ## API Surface
70
- - `POST /reset`
71
- - `POST /step`
72
- - `GET /state`
73
- - `GET /tasks`
74
- - `GET /health`
75
- - `GET /benchmark`
76
-
77
- ## API Docs
78
- - Swagger UI: `http://localhost:7860/docs`
79
- - ReDoc: `http://localhost:7860/redoc`
80
- - OpenAPI: `http://localhost:7860/openapi.json`
81
-
82
- ## Action Space
83
- | Action | Required fields | Purpose |
84
- |---|---|---|
85
- | `submit_query` | `query` | Submit SQL candidate for execution + grading |
86
- | `inspect_schema` | none | Return schema metadata |
87
- | `inspect_error` | none | Return last execution error details |
88
- | `inspect_sample` | `table_name` | Return sample rows from table |
89
- | `reset_query` | none | Reset current query to original broken query |
90
-
91
- ## Observation Space (high-level)
92
- - Task context: `task_id`, `task_description`, `original_query`, `expected_description`
93
- - Progress: `steps_taken`, `steps_remaining`, `current_score`
94
- - Feedback: `last_action_type`, `last_query_result`, `schema_info`, `error_details`, `sample_rows`
95
- - Episode status: `is_done`, `success`
96
-
97
- ## Reward Design
98
- Reward is clamped to `[0.0, 1.0]` and combines:
99
- - `correctness` (`0.0-0.6`)
100
- - `efficiency` (`0.0-0.2`)
101
- - `syntax_progress` (`0.0-0.1`)
102
- - `schema_bonus` (`0.0-0.1`)
103
- - `penalty` deduction magnitude (`0.0-0.2`)
104
-
105
- ## Task Suite
106
- ### Easy — `easy_syntax_fix`
107
- Fix a misspelled SQL keyword and alias mismatch.
108
-
109
- ### Medium — `medium_logic_fix`
110
- Fix join/filter placement and aggregation scope issues.
111
-
112
- ### Hard — `hard_multi_bug`
113
- Fix multi-part bugs across correlation, date logic, and aggregation/window behavior.
114
-
115
- ## Repository Structure
116
- ```text
117
- sql-debug-env/
118
- ├── Dockerfile
119
- ├── openenv.yaml
120
- ├── inference.py
121
- ├── README.md
122
- ├── requirements.txt
123
- ├── pyproject.toml
124
- ├── uv.lock
125
- ├── scripts/
126
- │ └── benchmark_local.py
127
- ├── server/
128
- │ ├── main.py
129
- │ ├── env.py
130
- │ ├── models.py
131
- │ ├── database.py
132
- │ ├── reward.py
133
- │ └── tasks/
134
- │ ├── base.py
135
- │ ├── task_easy.py
136
- │ ├── task_medium.py
137
- │ └── task_hard.py
138
- └── tests/
139
- ├── test_env.py
140
- ├── test_graders.py
141
- └─�� test_reward.py
142
- ```
143
-
144
- ## Reliability and Benchmarking
145
- ### Verified local status
146
- - `openenv validate --verbose`: PASS
147
- - `python3 -m unittest discover -s tests -p "test_*.py"`: 10/10 PASS
148
- - Docker smoke test: PASS (`/health`, `/tasks`, `/reset`, `/step`)
149
-
150
- ### Live benchmark endpoint
151
- `GET /benchmark?runs=20` performs fresh timing each call.
152
-
153
- Example:
154
- ```bash
155
- curl "http://localhost:7860/benchmark?runs=20"
156
- ```
157
-
158
- ## Quick Start
159
- ### Local
160
- ```bash
161
- pip install -r requirements.txt
162
- uvicorn server.main:app --host 0.0.0.0 --port 7860
163
- ```
164
-
165
- ### Docker
166
- ```bash
167
- docker build -t sql-debug-env .
168
- docker run -p 7860:7860 sql-debug-env
169
- ```
170
-
171
- ### Smoke test
172
- ```bash
173
- curl http://localhost:7860/health
174
- curl http://localhost:7860/tasks
175
- curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{}'
176
- curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d '{"action":{"action_type":"inspect_schema"}}'
177
- curl "http://localhost:7860/benchmark?runs=20"
178
- ```
179
-
180
- ## Baseline Inference
181
- ```bash
182
- export API_BASE_URL="https://api.openai.com/v1"
183
- export MODEL_NAME="gpt-4o-mini"
184
- export OPENAI_API_KEY="your-key"
185
- export HF_TOKEN="$OPENAI_API_KEY"
186
- export ENV_BASE_URL="http://localhost:7860"
187
- export SEED="1"
188
- python inference.py
189
- ```
190
-
191
- ## Hugging Face Spaces (Docker)
192
- 1. Create Docker Space.
193
- 2. Push this repository.
194
- 3. Ensure `openenv.yaml` has:
195
- `api.base_url: "https://md896-sql-debug-env.hf.space"`
196
- 4. Verify:
197
- ```bash
198
- curl https://md896-sql-debug-env.hf.space/health
199
- curl -X POST https://md896-sql-debug-env.hf.space/reset -H "Content-Type: application/json" -d '{}'
200
- curl https://md896-sql-debug-env.hf.space/docs
201
- ```
202
- ---
203
- title: sql-debug-env
204
- emoji: "🧪"
205
- colorFrom: blue
206
- colorTo: green
207
- sdk: docker
208
- pinned: false
209
- ---
210
-
211
- # SQL Debug Environment (`sql-debug-env`)
212
-
213
- ![OpenEnv](https://img.shields.io/badge/OpenEnv-Validated-2ea44f)
214
- ![Docker](https://img.shields.io/badge/Deploy-Docker-2496ED?logo=docker&logoColor=white)
215
- ![Python](https://img.shields.io/badge/Python-3.11+-3776AB?logo=python&logoColor=white)
216
- ![FastAPI](https://img.shields.io/badge/FastAPI-0.115-009688?logo=fastapi&logoColor=white)
217
- ![Pydantic](https://img.shields.io/badge/Pydantic-v2-E92063?logo=pydantic&logoColor=white)
218
- ![SQLite](https://img.shields.io/badge/SQLite-In--Memory-003B57?logo=sqlite&logoColor=white)
219
- ![Uvicorn](https://img.shields.io/badge/Uvicorn-ASGI-111111)
220
- ![OpenAI](https://img.shields.io/badge/OpenAI-Baseline_API-412991?logo=openai&logoColor=white)
221
-
222
- An OpenEnv environment for a real task people do every day: **debugging SQL**. The agent gets a broken query, a live (in-memory) SQLite database, and a description of the expected output. It can inspect schema/errors/samples and submit fixed queries until it solves the task.
223
-
224
- ## Space Config
225
- | Key | Value |
226
- |---|---|
227
- | `title` | `sql-debug-env` |
228
- | `emoji` | `🧪` |
229
- | `colorFrom` | `blue` |
230
- | `colorTo` | `green` |
231
- | `sdk` | `docker` |
232
- | `pinned` | `false` |
233
-
234
- ## Why this project matters
235
- - SQL debugging is a real operational task across analytics and backend teams.
236
- - The environment is deterministic, fast, and local-first for reliable evaluation.
237
- - Reward shaping gives useful partial progress signals instead of only pass/fail.
238
- - The benchmark endpoint provides live runtime evidence for reviewers.
239
-
240
- ## What’s in this repo
241
- - **FastAPI server**: `server/main.py` (endpoints: `/health`, `/tasks`, `/reset`, `/step`, `/state`)
242
- - **Environment logic**: `server/env.py` + `server/database.py`
243
- - **Tasks**: `server/tasks/` (easy → medium → hard, deterministic seed data)
244
- - **Baseline agent**: `inference.py` (OpenAI client + `[START]/[STEP]/[END]` logs)
245
-
246
- ## Tech Stack
247
- - Python 3.11+
248
- - FastAPI + Uvicorn
249
- - Pydantic v2
250
- - SQLite (in-memory)
251
- - OpenEnv Core
252
- - Docker
253
- - OpenAI Python SDK (baseline inference)
254
-
255
- ## Production Notes
256
- - Stateless HTTP API with per-session environment instances keyed by `X-Session-Id`
257
- - Deterministic task data (in-memory SQLite) for reproducible grading
258
- - Reward clamped to `[0.0, 1.0]` with partial-progress shaping
259
- - Docker-first deployment path (local and Hugging Face Spaces)
260
- - Local benchmark endpoint for live latency checks (`/benchmark`)
261
-
262
- ## Architecture
263
- ```mermaid
264
- flowchart LR
265
- Client[Client Agent or Evaluator] --> API[FastAPI Server]
266
- API --> Env[SQLDebugEnv]
267
- Env --> DB[InMemory SQLite Episode DB]
268
- Env --> Tasks[Task Set easy medium hard]
269
- Env --> Reward[Reward Engine]
270
- Tasks --> Grader[Deterministic Graders]
271
- Grader --> Reward
272
- Reward --> API
273
- API --> Client
274
- ```
275
-
276
- ## API Docs (FastAPI Auto Docs)
277
- Use these for interactive testing in browser:
278
-
279
- - Swagger UI: `http://localhost:7860/docs`
280
- - ReDoc: `http://localhost:7860/redoc`
281
- - OpenAPI spec: `http://localhost:7860/openapi.json`
282
-
283
- ## Project Structure
284
- ```text
285
- sql-debug-env/
286
- ├── Dockerfile
287
- ├── openenv.yaml
288
- ├── inference.py
289
- ├── README.md
290
- ├── requirements.txt
291
- ├── pyproject.toml
292
- ├── uv.lock
293
- ├── scripts/
294
- │ └── benchmark_local.py
295
- ├─�� server/
296
- │ ├── main.py
297
- │ ├── env.py
298
- │ ├── models.py
299
- │ ├── database.py
300
- │ ├── reward.py
301
- │ └── tasks/
302
- │ ├── base.py
303
- │ ├── task_easy.py
304
- │ ├── task_medium.py
305
- │ └── task_hard.py
306
- └── tests/
307
- ├── test_env.py
308
- ├── test_graders.py
309
- └── test_reward.py
310
- ```
311
-
312
- ## Action Space
313
- | Action | Required fields | Cost / reward effect |
314
- |---|---|---|
315
- | `submit_query` | `query` | Main evaluation step (dense reward based on grading) |
316
- | `inspect_schema` | none | Free information action (small positive reward component) |
317
- | `inspect_error` | none | Free information action (small positive reward component) |
318
- | `inspect_sample` | `table_name` | Free information action (small positive reward component) |
319
- | `reset_query` | none | Penalty action (reduces reward for that step) |
320
-
321
- ## Observation Space
322
- | Field | Type |
323
- |---|---|
324
- | `task_id` | `string` |
325
- | `task_description` | `string` |
326
- | `original_query` | `string` |
327
- | `current_query` | `string_or_null` |
328
- | `expected_description` | `string` |
329
- | `last_action_type` | `string` |
330
- | `last_query_result` | `object_or_null` |
331
- | `steps_taken` | `integer` |
332
- | `steps_remaining` | `integer` |
333
- | `current_score` | `float` |
334
- | `schema_info` | `object_or_null` |
335
- | `error_details` | `string_or_null` |
336
- | `sample_rows` | `array_or_null` |
337
- | `hint` | `string_or_null` |
338
- | `is_done` | `boolean` |
339
- | `success` | `boolean` |
340
-
341
- ## Reward Function
342
- | Component | Range | Description |
343
- |---|---|---|
344
- | `correctness` | `[0.0, 0.6]` | Row-level match vs expected output |
345
- | `efficiency` | `[0.0, 0.2]` | Bonus for solving with fewer steps |
346
- | `syntax_progress` | `[0.0, 0.1]` | Small reward for producing syntactically valid SQL |
347
- | `schema_bonus` | `[0.0, 0.1]` | Bonus for referencing correct tables/columns |
348
- | `penalty` | `[0.0, 0.2]` | Deduction magnitude for resets/regressions/urgency near step limit |
349
-
350
- ## Tasks
351
- ### Task 1: Easy — Syntax Error Fix (`easy_syntax_fix`)
352
- Two straightforward issues: a misspelled keyword (`GRUP BY`) and an `ORDER BY` alias mismatch.
353
-
354
- ### Task 2: Medium — Logic Error Fix (`medium_logic_fix`)
355
- Logic bugs around outer joins + filtering scope + aggregation scope.
356
-
357
- ### Task 3: Hard — Multi-Bug Fix (`hard_multi_bug`)
358
- Five bugs across correlated subqueries, window functions, CTE scope, date logic, and duplication.
359
-
360
- ## Baseline
361
- The baseline script is intentionally simple: it loops `reset → step` and asks an OpenAI model to choose the next JSON action.
362
-
363
- ## Reliability & Benchmarking
364
-
365
- ### Verified status (local)
366
- - `openenv validate --verbose`: **PASS**
367
- - `python3 -m unittest discover -s tests -p "test_*.py"`: **10/10 PASS**
368
- - Docker smoke test: **PASS** (`/health`, `/tasks`, `/reset`, `/step`)
369
- - FastAPI docs available: **PASS** (`/docs`, `/redoc`, `/openapi.json`)
370
-
371
- ### Endpoint benchmark (local Docker run, n=25)
372
- Measured with `scripts/benchmark_local.py` on a running local container:
373
-
374
- | Endpoint | avg | p50 | p95 |
375
- |---|---:|---:|---:|
376
- | `GET /health` | 0.69 ms | 0.67 ms | 0.76 ms |
377
- | `GET /tasks` | 0.82 ms | 0.81 ms | 0.90 ms |
378
- | `POST /reset` | 1.34 ms | 1.26 ms | 1.62 ms |
379
- | `POST /step` (`inspect_schema`) | 1.07 ms | 1.01 ms | 1.34 ms |
380
-
381
- Re-run anytime:
382
-
383
- ```bash
384
- python3 scripts/benchmark_local.py
385
- ```
386
-
387
- Notes:
388
- - These are local-machine numbers (single container, warm runtime).
389
- - For submission-grade reporting, also capture one run against your HF Space URL after deploy.
390
-
391
- ## Setup & Usage
392
-
393
- ### Local Development
394
- ```bash
395
- pip install -r requirements.txt
396
- uvicorn server.main:app --host 0.0.0.0 --port 7860
397
- ```
398
-
399
- ### Docker
400
- ```bash
401
- docker build -t sql-debug-env .
402
- docker run -p 7860:7860 sql-debug-env
403
- ```
404
-
405
- ### Quick smoke test
406
- ```bash
407
- curl http://localhost:7860/health
408
- curl http://localhost:7860/tasks
409
- curl -X POST http://localhost:7860/reset -H "Content-Type: application/json" -d '{"task_id":"easy_syntax_fix"}'
410
- curl -X POST http://localhost:7860/step -H "Content-Type: application/json" -d '{"action":{"action_type":"inspect_schema"}}'
411
- curl "http://localhost:7860/benchmark?runs=20"
412
- ```
413
-
414
- ### Real-time benchmark API (for dashboards/web pages)
415
- This is a live endpoint, not static/dummy data. Every request runs fresh measurements.
416
-
417
- - Endpoint: `GET /benchmark?runs=20`
418
- - `runs` range: `1` to `100`
419
- - Returns JSON with `avg_ms`, `p50_ms`, `p95_ms`, `n`, and a fresh `timestamp_epoch_ms`
420
-
421
- Example:
422
- ```bash
423
- curl "http://localhost:7860/benchmark?runs=30"
424
- ```
425
-
426
- ### Run Baseline
427
- ```bash
428
- export API_BASE_URL="https://api.openai.com/v1"
429
- export MODEL_NAME="gpt-4o-mini"
430
- export OPENAI_API_KEY="your-key"
431
- export ENV_BASE_URL="http://localhost:7860"
432
- export HF_TOKEN="$OPENAI_API_KEY"
433
- export SEED="1"
434
- python inference.py
435
- ```
436
-
437
- ### OpenEnv Validation
438
- ```bash
439
- pip install openenv-core
440
- openenv validate
441
- ```
442
-
443
- ### Suggested pre-submit check
444
- ```bash
445
- openenv validate --verbose
446
- python3 -m unittest discover -s tests -p "test_*.py"
447
- docker build -t sql-debug-env .
448
- docker run --rm -p 7860:7860 sql-debug-env
449
- # in another terminal:
450
- curl -s http://localhost:7860/health
451
- curl -s http://localhost:7860/docs >/dev/null
452
- curl -s "http://localhost:7860/benchmark?runs=20"
453
- ```
454
-
455
- ## Hugging Face Spaces (Docker)
456
- 1. Create a new **Space → Docker**.
457
- 2. Push this repo.
458
- 3. Update `openenv.yaml` → `api.base_url` to your Space URL: `https://<your-space>.hf.space`
459
- 4. Wait for build, then verify:
460
-
461
- ```bash
462
- curl -X POST https://<your-space>.hf.space/reset -H "Content-Type: application/json" -d '{}'
463
- ```
464
-
 
1
  ---
2
+ title: Sql Debug Env
3
+ emoji: 💻
4
+ colorFrom: indigo
5
+ colorTo: gray
6
  sdk: docker
7
  pinned: false
8
  ---
9
 
10
+ Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference