Prithvigg commited on
Commit
039839b
Β·
verified Β·
1 Parent(s): 6338c5b

Upload folder using huggingface_hub

Browse files
Files changed (11) hide show
  1. README.md +21 -21
  2. demo.py +1 -1
  3. inference.py +2 -3
  4. judge.py +49 -17
  5. models.py +1 -1
  6. openenv.yaml +7 -3
  7. playbook.py +10 -9
  8. pyproject.toml +4 -0
  9. server/app.py +1 -6
  10. server/requirements.txt +1 -2
  11. tasks.py +49 -57
README.md CHANGED
@@ -143,30 +143,30 @@ A semantically correct but O(NΒ²) query re-executes `AVG(salary)` for every empl
143
  **Schema:** `departments`, `employees` β€” 9 employees across 3 departments
144
  **Goal:** Employees who earn strictly above their department average, ordered by dept/salary
145
 
146
- ### Expert β€” Fix the Tie-Breaking Window Function
147
- `ROW_NUMBER()` silently drops tied reps β€” one per region is kept, tied ones discarded. Agent must use `RANK()` or `DENSE_RANK()` to return all tied top performers.
148
 
149
  **Schema:** `sales_reps(id, name, region, revenue)` β€” 6 reps across 2 regions with ties
150
  **Goal:** All reps whose revenue is the highest in their region
151
 
152
- ### Expert β€” Traverse Org Chart with Recursive CTE
153
- A hardcoded two-level CTE expansion misses employees deeper in the tree. Agent must use `WITH RECURSIVE` to traverse all levels of the hierarchy.
154
 
155
  **Schema:** `employees(id, name, manager_id)` β€” 14 employees, 4 levels deep
156
- **Goal:** All 8 subordinates of VP Eng at any depth, ordered by id
157
 
158
- ### Expert β€” Fix Two Broken Window Functions
159
- Both `SUM` and `RANK` window functions are missing `PARTITION BY` but require different `ORDER BY` clauses. Agent must fix both independently.
160
 
161
- **Schema:** `quarterly_sales(region, quarter, revenue)` β€” 8 rows across 2 regions
162
- **Goal:** Per-region running total (`ORDER BY quarter`) and within-region revenue rank (`ORDER BY revenue DESC`)
163
 
164
  > **Structural penalties** are enforced per task level/id to prevent gaming:
165
  > - `hard`: requires `WITH` clause (βˆ’0.30 if absent)
166
  > - `medium`: requires explicit `JOIN` (βˆ’0.20 if absent)
167
- > - `task_expert_recursive`: requires `WITH RECURSIVE` (βˆ’0.30 if absent)
168
- > - `task_expert_rank`: penalises `ROW_NUMBER()` (βˆ’0.20 β€” drops ties)
169
- > - `task_expert_window`: requires `PARTITION BY` in both window functions (βˆ’0.20 if absent)
170
 
171
  ---
172
 
@@ -249,19 +249,19 @@ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
249
 
250
  ## Baseline Results
251
 
252
- The following scores were produced by running `claude-haiku-4-5` as the agent against all three tasks with the full AI judge active. These serve as the reproducible baseline for this environment.
253
 
254
  | Task | Level | Steps Used | Best Score |
255
  |---|---|---|---|
256
  | Fix the Syntax Errors | easy | 1 | **1.000** |
257
  | Fix the Cartesian JOIN | medium | 1 | **0.900** |
258
- | Rewrite Correlated Subquery as CTE | hard | 1 | **0.950** |
259
- | **Average** | | | **0.950** |
 
 
 
260
 
261
- All three tasks were solved (or near-solved) on the first step, demonstrating that:
262
- - The reward pipeline returns meaningful signal immediately
263
- - The environment terminates cleanly when the done threshold (β‰₯ 0.90) is met
264
- - A stronger model or a harder task set would produce more training-relevant trajectories
265
 
266
  ---
267
 
@@ -348,13 +348,13 @@ queryforge/
348
  β”œβ”€β”€ playbook.py # Local test runner (no server required)
349
  β”œβ”€οΏ½οΏ½οΏ½ inference.py # Baseline inference script (any OpenAI-compatible LLM)
350
  β”œβ”€β”€ demo.py # Gradio interactive demo (mounted at /demo)
 
351
  β”œβ”€β”€ openenv.yaml # OpenEnv manifest
352
  β”œβ”€β”€ pyproject.toml # Project metadata and dependencies
353
  β”œβ”€β”€ uv.lock # Locked dependencies
354
  └── server/
355
  β”œβ”€β”€ app.py # FastAPI app β€” core + /tasks REST endpoints + Gradio mount
356
  β”œβ”€β”€ queryforge_environment.py # Environment class (reset, step, state)
357
- β”œβ”€β”€ Dockerfile # Container image
358
  └── requirements.txt # Server dependencies
359
  ```
360
 
@@ -373,7 +373,7 @@ Add `ANTHROPIC_API_KEY` as a Space secret after deployment. Without it, the envi
373
  ### Docker
374
 
375
  ```bash
376
- docker build -t queryforge:latest -f server/Dockerfile .
377
  docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY queryforge:latest
378
  ```
379
 
 
143
  **Schema:** `departments`, `employees` β€” 9 employees across 3 departments
144
  **Goal:** Employees who earn strictly above their department average, ordered by dept/salary
145
 
146
+ ### Expert β€” Fix the Tie-Breaking Window Function (2 bugs)
147
+ Two layered bugs: `ROW_NUMBER()` drops tied reps AND `ORDER BY revenue ASC` picks the lowest earners instead of the highest. Agent must fix the sort order AND switch to `RANK()`/`DENSE_RANK()` β€” fixing only one still produces wrong results.
148
 
149
  **Schema:** `sales_reps(id, name, region, revenue)` β€” 6 reps across 2 regions with ties
150
  **Goal:** All reps whose revenue is the highest in their region
151
 
152
+ ### Expert β€” Traverse Org Chart with Recursive CTE (2 bugs)
153
+ Two layered bugs: the anchor uses `WHERE id = 3` (includes VP Eng himself in results) AND the query is a hardcoded two-level CTE that misses deeper employees. Agent must fix the anchor to `WHERE manager_id = 3` AND convert to `WITH RECURSIVE`.
154
 
155
  **Schema:** `employees(id, name, manager_id)` β€” 14 employees, 4 levels deep
156
+ **Goal:** All 8 subordinates of VP Eng at any depth (excluding VP Eng), ordered by id
157
 
158
+ ### Expert β€” Fix Broken Window Functions (3 bugs)
159
+ Three layered bugs: both `SUM` and `RANK` window functions are missing `PARTITION BY`, they need different `ORDER BY` clauses, AND the data contains tied revenue values (West Q3=Q4=16000) that must be ranked correctly.
160
 
161
+ **Schema:** `quarterly_sales(region, quarter, revenue)` β€” 8 rows across 2 regions with ties
162
+ **Goal:** Per-region running total (`ORDER BY quarter`) and within-region revenue rank (`ORDER BY revenue DESC`) with correct tie handling
163
 
164
  > **Structural penalties** are enforced per task level/id to prevent gaming:
165
  > - `hard`: requires `WITH` clause (βˆ’0.30 if absent)
166
  > - `medium`: requires explicit `JOIN` (βˆ’0.20 if absent)
167
+ > - `task_expert_recursive`: requires `WITH RECURSIVE` (βˆ’0.30) + correct anchor via `manager_id` (βˆ’0.15)
168
+ > - `task_expert_rank`: penalises `ROW_NUMBER()` (βˆ’0.20) + penalises `ASC` ordering without `DESC` (βˆ’0.15)
169
+ > - `task_expert_window`: requires `PARTITION BY` in both window functions (βˆ’0.20 if absent, βˆ’0.10 if only one)
170
 
171
  ---
172
 
 
249
 
250
  ## Baseline Results
251
 
252
+ The following scores were produced by running `meta-llama/Llama-3.1-8B-Instruct` (via HuggingFace router) as the agent against all 6 tasks with the full AI judge active.
253
 
254
  | Task | Level | Steps Used | Best Score |
255
  |---|---|---|---|
256
  | Fix the Syntax Errors | easy | 1 | **1.000** |
257
  | Fix the Cartesian JOIN | medium | 1 | **0.900** |
258
+ | Rewrite Correlated Subquery as CTE | hard | 1 | **0.900** |
259
+ | Fix the Tie-Breaking Window Function | expert | 1 | **1.000** |
260
+ | Traverse Org Chart with Recursive CTE | expert | 2 | **0.900** |
261
+ | Fix Two Broken Window Functions | expert | 3 | **0.900** |
262
+ | **Average** | | | **0.933** |
263
 
264
+ The easy–hard tasks and the rank/recursive expert tasks were solved in 1–2 steps. The dual-window expert task required 3 steps, demonstrating the feedback loop produces training-relevant multi-step trajectories for harder tasks.
 
 
 
265
 
266
  ---
267
 
 
348
  β”œβ”€β”€ playbook.py # Local test runner (no server required)
349
  β”œβ”€οΏ½οΏ½οΏ½ inference.py # Baseline inference script (any OpenAI-compatible LLM)
350
  β”œβ”€β”€ demo.py # Gradio interactive demo (mounted at /demo)
351
+ β”œβ”€β”€ Dockerfile # Container image
352
  β”œβ”€β”€ openenv.yaml # OpenEnv manifest
353
  β”œβ”€β”€ pyproject.toml # Project metadata and dependencies
354
  β”œβ”€β”€ uv.lock # Locked dependencies
355
  └── server/
356
  β”œβ”€β”€ app.py # FastAPI app β€” core + /tasks REST endpoints + Gradio mount
357
  β”œβ”€β”€ queryforge_environment.py # Environment class (reset, step, state)
 
358
  └── requirements.txt # Server dependencies
359
  ```
360
 
 
373
  ### Docker
374
 
375
  ```bash
376
+ docker build -t queryforge:latest .
377
  docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY queryforge:latest
378
  ```
379
 
demo.py CHANGED
@@ -115,7 +115,7 @@ Fix broken or slow SQL queries and get instant graded feedback.
115
  )
116
  )
117
 
118
- with gr.Blocks(title="QueryForge", theme=gr.themes.Soft()) as demo:
119
 
120
  state = gr.State(None)
121
 
 
115
  )
116
  )
117
 
118
+ with gr.Blocks(title="QueryForge") as demo:
119
 
120
  state = gr.State(None)
121
 
inference.py CHANGED
@@ -188,9 +188,8 @@ def run_task(task_id: str, llm: OpenAI, env_client) -> dict:
188
 
189
  def main() -> None:
190
  # ── Validate required config ──────────────────────────────────────────────
191
- missing = [v for v in ("API_BASE_URL", "MODEL_NAME") if not os.getenv(v)]
192
- if missing:
193
- print(f"ERROR: missing required env vars: {', '.join(missing)}")
194
  sys.exit(1)
195
 
196
  if not API_KEY:
 
188
 
189
  def main() -> None:
190
  # ── Validate required config ──────────────────────────────────────────────
191
+ if not MODEL_NAME:
192
+ print("ERROR: MODEL_NAME env var is not set.")
 
193
  sys.exit(1)
194
 
195
  if not API_KEY:
judge.py CHANGED
@@ -16,7 +16,7 @@ Grading pipeline for each submitted SQL query:
16
  Partial credit for correct row count or partial row matches.
17
 
18
  Stage 4 β€” AI Quality (β†’ 1.0)
19
- Anthropic claude-sonnet-4-6 evaluates optimization, code style, and
20
  semantic correctness vs. the reference solution.
21
  The AI score can move the final score up to 1.0 when rows are correct,
22
  or provide nuanced feedback even when rows are partially wrong.
@@ -183,16 +183,24 @@ def rows_match(
183
 
184
  projected = [_project(row) for row in actual]
185
 
 
 
 
186
  if len(projected) != len(expected):
187
- overlap_ratio = min(len(projected), len(expected)) / max(len(projected), len(expected))
188
- score = 0.3 * overlap_ratio
 
 
 
 
 
189
  return score, (
190
  f"Row count mismatch: got {len(projected)}, expected {len(expected)}. "
191
- f"({overlap_ratio:.0%} overlap ratio)"
192
  )
193
 
194
- actual_sorted = sorted([_normalize(r) for r in projected], key=lambda r: _sort_key(r, order_by))
195
- expected_sorted = sorted([_normalize(r) for r in expected], key=lambda r: _sort_key(r, order_by))
196
 
197
  matches = sum(1 for a, e in zip(actual_sorted, expected_sorted) if a == e)
198
  row_accuracy = matches / len(expected)
@@ -289,7 +297,6 @@ Respond with ONLY valid JSON (no markdown fences):
289
  {"role": "assistant", "content": "{"}, # prefill forces JSON-only reply
290
  ],
291
  )
292
- print("Anthropic judge response:", message.content)
293
  # Prepend the prefilled "{" back before parsing
294
  raw = "{" + message.content[0].text.strip()
295
 
@@ -381,15 +388,35 @@ def grade(
381
  elif task.level == "medium" and "JOIN " not in query_upper:
382
  structural_penalty = 0.20 # medium task demands explicit JOINs
383
  row_feedback += " (Penalty: no explicit JOIN β€” task requires JOIN … ON syntax.)"
384
- elif task.id == "task_expert_recursive" and "RECURSIVE" not in query_upper:
385
- structural_penalty = 0.30 # must use recursive CTE, not repeated JOINs
386
- row_feedback += " (Penalty: WITH RECURSIVE required β€” plain JOIN only fetches one level.)"
387
- elif task.id == "task_expert_rank" and "ROW_NUMBER" in query_upper:
388
- structural_penalty = 0.20 # ROW_NUMBER breaks ties β€” must use RANK/DENSE_RANK
389
- row_feedback += " (Penalty: ROW_NUMBER() drops tied rows β€” use RANK() or DENSE_RANK().)"
390
- elif task.id == "task_expert_window" and "PARTITION BY" not in query_upper:
391
- structural_penalty = 0.20 # both window functions need PARTITION BY region
392
- row_feedback += " (Penalty: missing PARTITION BY β€” both SUM and RANK must be partitioned per region.)"
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
393
 
394
  details["structural_penalty"] = structural_penalty
395
 
@@ -405,9 +432,14 @@ def grade(
405
  details["ai_hint"] = ai_hint
406
 
407
  # Final blending:
 
408
  # rows fully correct β†’ trust AI score (can reach 1.0)
409
  # rows partially wrong β†’ clamp AI score to not exceed deterministic
410
- if row_score >= 0.95:
 
 
 
 
411
  final_score = ai_score
412
  elif row_score >= 0.5:
413
  # Blend: AI provides nuance but can't exceed deterministic ceiling
 
16
  Partial credit for correct row count or partial row matches.
17
 
18
  Stage 4 β€” AI Quality (β†’ 1.0)
19
+ Anthropic claude-haiku-4-5 evaluates optimization, code style, and
20
  semantic correctness vs. the reference solution.
21
  The AI score can move the final score up to 1.0 when rows are correct,
22
  or provide nuanced feedback even when rows are partially wrong.
 
183
 
184
  projected = [_project(row) for row in actual]
185
 
186
+ actual_norm = [_normalize(r) for r in projected]
187
+ expected_norm = [_normalize(r) for r in expected]
188
+
189
  if len(projected) != len(expected):
190
+ # Count how many returned rows are actually in the expected set
191
+ expected_set = [tuple(sorted(r.items())) for r in expected_norm]
192
+ correct_rows = sum(1 for r in actual_norm if tuple(sorted(r.items())) in expected_set)
193
+ # Score based on fraction of expected rows correctly returned
194
+ coverage = correct_rows / len(expected)
195
+ # Base 0.10 for count mismatch, up to 0.45 for high coverage of correct rows
196
+ score = 0.10 + 0.35 * coverage
197
  return score, (
198
  f"Row count mismatch: got {len(projected)}, expected {len(expected)}. "
199
+ f"{correct_rows}/{len(expected)} expected rows present."
200
  )
201
 
202
+ actual_sorted = sorted(actual_norm, key=lambda r: _sort_key(r, order_by))
203
+ expected_sorted = sorted(expected_norm, key=lambda r: _sort_key(r, order_by))
204
 
205
  matches = sum(1 for a, e in zip(actual_sorted, expected_sorted) if a == e)
206
  row_accuracy = matches / len(expected)
 
297
  {"role": "assistant", "content": "{"}, # prefill forces JSON-only reply
298
  ],
299
  )
 
300
  # Prepend the prefilled "{" back before parsing
301
  raw = "{" + message.content[0].text.strip()
302
 
 
388
  elif task.level == "medium" and "JOIN " not in query_upper:
389
  structural_penalty = 0.20 # medium task demands explicit JOINs
390
  row_feedback += " (Penalty: no explicit JOIN β€” task requires JOIN … ON syntax.)"
391
+ elif task.id == "task_expert_recursive":
392
+ # Two bugs: anchor uses WHERE id=3 (includes VP Eng) + non-recursive CTE (misses deep levels)
393
+ if "RECURSIVE" not in query_upper:
394
+ structural_penalty += 0.30
395
+ row_feedback += " (Penalty: WITH RECURSIVE required β€” hardcoded levels won't scale.)"
396
+ if "MANAGER_ID = 3" not in query_upper and "MANAGER_ID=3" not in query_upper:
397
+ structural_penalty += 0.15
398
+ row_feedback += " (Penalty: anchor should select subordinates via manager_id, not the VP themselves.)"
399
+ structural_penalty = min(structural_penalty, 0.40)
400
+ elif task.id == "task_expert_rank":
401
+ # Two bugs: ROW_NUMBER (drops ties) + ASC ordering (picks lowest instead of highest)
402
+ if "ROW_NUMBER" in query_upper:
403
+ structural_penalty += 0.20
404
+ row_feedback += " (Penalty: ROW_NUMBER() drops tied rows β€” use RANK() or DENSE_RANK().)"
405
+ if "ASC" in query_upper and "DESC" not in query_upper:
406
+ structural_penalty += 0.15
407
+ row_feedback += " (Penalty: ordering by revenue ASC picks lowest earners, not highest.)"
408
+ structural_penalty = min(structural_penalty, 0.35)
409
+ elif task.id == "task_expert_window":
410
+ # Three bugs: missing PARTITION BY on both windows + tied revenues need correct ranking
411
+ if "PARTITION BY" not in query_upper:
412
+ structural_penalty += 0.20
413
+ row_feedback += " (Penalty: missing PARTITION BY β€” both SUM and RANK must be partitioned per region.)"
414
+ # Count PARTITION BY occurrences β€” need at least 2 (one per window function)
415
+ partition_count = query_upper.count("PARTITION BY")
416
+ if 0 < partition_count < 2:
417
+ structural_penalty += 0.10
418
+ row_feedback += " (Penalty: only one window function has PARTITION BY β€” both need it.)"
419
+ structural_penalty = min(structural_penalty, 0.30)
420
 
421
  details["structural_penalty"] = structural_penalty
422
 
 
432
  details["ai_hint"] = ai_hint
433
 
434
  # Final blending:
435
+ # AI judge offline (fallback) β†’ use deterministic score directly
436
  # rows fully correct β†’ trust AI score (can reach 1.0)
437
  # rows partially wrong β†’ clamp AI score to not exceed deterministic
438
+ ai_is_fallback = abs(ai_score - deterministic_score) < 0.001
439
+ if ai_is_fallback:
440
+ # AI judge was unavailable β€” use deterministic score as-is
441
+ final_score = deterministic_score
442
+ elif row_score >= 0.95:
443
  final_score = ai_score
444
  elif row_score >= 0.5:
445
  # Blend: AI provides nuance but can't exceed deterministic ceiling
models.py CHANGED
@@ -24,7 +24,7 @@ class SQLObservation(Observation):
24
  # ── Task context ─────────────────────────────────────────────────────────
25
  task_id: str = Field(default="", description="Active task identifier")
26
  task_level: str = Field(
27
- default="", description="Difficulty: easy | medium | hard"
28
  )
29
  task_title: str = Field(default="", description="Human-readable task title")
30
  task_description: str = Field(
 
24
  # ── Task context ─────────────────────────────────────────────────────────
25
  task_id: str = Field(default="", description="Active task identifier")
26
  task_level: str = Field(
27
+ default="", description="Difficulty: easy | medium | hard | expert"
28
  )
29
  task_title: str = Field(default="", description="Human-readable task title")
30
  task_description: str = Field(
openenv.yaml CHANGED
@@ -11,16 +11,20 @@ description: |
11
  An agent receives a broken or slow SQL query together with the schema and an
12
  error/performance warning. It must produce a working, optimised query.
13
 
14
- Tasks (3 levels, cycled in order):
15
  easy β€” fix three misspelled SQL keywords (SELECT / FROM / WHERE)
16
  medium β€” fix a missing JOIN condition that causes a cartesian product
17
  hard β€” rewrite a correlated subquery (O(NΒ²)) as a CTE (O(N))
 
 
 
18
 
19
  Reward signal (0.0 – 1.0):
20
  0.00 syntax error
21
  0.15 syntax valid, runtime error
22
  0.30 executes, wrong / empty results
23
  0.30–0.80 partial row correctness (deterministic, DuckDB)
24
- 0.80–1.00 correct results + AI quality score (Anthropic claude-sonnet-4-6)
25
 
26
- Required env var: ANTHROPIC_API_KEY
 
 
11
  An agent receives a broken or slow SQL query together with the schema and an
12
  error/performance warning. It must produce a working, optimised query.
13
 
14
+ Tasks (6 tasks across 4 difficulty levels):
15
  easy β€” fix three misspelled SQL keywords (SELECT / FROM / WHERE)
16
  medium β€” fix a missing JOIN condition that causes a cartesian product
17
  hard β€” rewrite a correlated subquery (O(NΒ²)) as a CTE (O(N))
18
+ expert β€” fix tie-breaking window function (2 bugs: ROW_NUMBER + ASC ordering)
19
+ expert β€” traverse org chart with recursive CTE (2 bugs: wrong anchor + hardcoded levels)
20
+ expert β€” fix broken window functions (3 bugs: missing PARTITION BY + tied revenues)
21
 
22
  Reward signal (0.0 – 1.0):
23
  0.00 syntax error
24
  0.15 syntax valid, runtime error
25
  0.30 executes, wrong / empty results
26
  0.30–0.80 partial row correctness (deterministic, DuckDB)
27
+ 0.80–1.00 correct results + AI quality score (Anthropic claude-haiku-4-5)
28
 
29
+ Optional env var: ANTHROPIC_API_KEY (enables AI judge for scores up to 1.0;
30
+ without it, scoring is fully deterministic and capped at 0.80)
playbook.py CHANGED
@@ -3,13 +3,14 @@ QueryForge Client Playbook
3
  ──────────────────────────
4
  Tests the environment through the HTTP server using the QueryforgeEnv client.
5
 
6
- Requires the server to be running first:
7
- uvicorn server.app:app --host 0.0.0.0 --port 8000
8
-
9
- Then run:
10
  python playbook.py
11
 
12
- If ANTHROPIC_API_KEY is set, Stage 4 AI scoring is live.
 
 
 
13
  If not set, the judge falls back to deterministic scoring (capped at 0.80).
14
  """
15
 
@@ -23,7 +24,7 @@ from client import QueryforgeEnv
23
  from models import SQLAction, TaskSpec
24
  from tasks import REGISTRY, task_from_dict
25
 
26
- BASE_URL = "https://prithvigg-queryforge.hf.space"
27
 
28
  # ── Formatting helpers ────────────────────────────────────────────────────────
29
 
@@ -239,10 +240,10 @@ if __name__ == "__main__":
239
  _hr("═")
240
 
241
  with QueryforgeEnv(base_url=BASE_URL).sync() as client:
242
- # run_easy(client)
243
  run_medium(client)
244
  run_hard(client)
245
- # run_custom(client)
246
 
247
  _section("DONE")
248
- print(" All 4 tasks completed.\n")
 
3
  ──────────────────────────
4
  Tests the environment through the HTTP server using the QueryforgeEnv client.
5
 
6
+ Usage:
7
+ # Against the live HF Space:
 
 
8
  python playbook.py
9
 
10
+ # Against a local server:
11
+ ENV_URL=http://localhost:8000 python playbook.py
12
+
13
+ If ANTHROPIC_API_KEY is set on the server, Stage 4 AI scoring is live.
14
  If not set, the judge falls back to deterministic scoring (capped at 0.80).
15
  """
16
 
 
24
  from models import SQLAction, TaskSpec
25
  from tasks import REGISTRY, task_from_dict
26
 
27
+ BASE_URL = os.environ.get("ENV_URL", "https://prithvigg-queryforge.hf.space")
28
 
29
  # ── Formatting helpers ────────────────────────────────────────────────────────
30
 
 
240
  _hr("═")
241
 
242
  with QueryforgeEnv(base_url=BASE_URL).sync() as client:
243
+ run_easy(client)
244
  run_medium(client)
245
  run_hard(client)
246
+ run_custom(client)
247
 
248
  _section("DONE")
249
+ print(" All tasks completed.\n")
pyproject.toml CHANGED
@@ -22,6 +22,10 @@ dependencies = [
22
  "duckdb>=0.10.0",
23
  # AI judge β€” quality scoring via Anthropic API
24
  "anthropic>=0.25.0",
 
 
 
 
25
  ]
26
 
27
  [project.optional-dependencies]
 
22
  "duckdb>=0.10.0",
23
  # AI judge β€” quality scoring via Anthropic API
24
  "anthropic>=0.25.0",
25
+ # Interactive demo UI (mounted at /demo on the FastAPI server)
26
+ "gradio>=4.0.0",
27
+ # Inference script uses the OpenAI client
28
+ "openai>=1.0.0",
29
  ]
30
 
31
  [project.optional-dependencies]
server/app.py CHANGED
@@ -124,9 +124,4 @@ def main(host: str = "0.0.0.0", port: int = 8000):
124
 
125
 
126
  if __name__ == "__main__":
127
- import argparse
128
-
129
- parser = argparse.ArgumentParser()
130
- parser.add_argument("--port", type=int, default=8000)
131
- args = parser.parse_args()
132
- main(port=args.port)
 
124
 
125
 
126
  if __name__ == "__main__":
127
+ main()
 
 
 
 
 
server/requirements.txt CHANGED
@@ -3,6 +3,5 @@ fastapi>=0.115.0
3
  uvicorn>=0.24.0
4
  duckdb>=0.10.0
5
  anthropic>=0.25.0
6
-
7
-
8
 
 
3
  uvicorn>=0.24.0
4
  duckdb>=0.10.0
5
  anthropic>=0.25.0
6
+ gradio>=4.0.0
 
7
 
tasks.py CHANGED
@@ -270,9 +270,8 @@ _TASK_EXPERT_RANK = SQLTask(
270
  level="expert",
271
  title="Fix the Tie-Breaking Window Function",
272
  description="""\
273
- TASK: The query below finds the top-earning sales rep per region, but it
274
- silently drops reps who are tied for first place. Fix it so ALL reps
275
- tied at rank 1 are returned.
276
 
277
  SCHEMA:
278
  sales_reps(id INTEGER, name VARCHAR, region VARCHAR, revenue DECIMAL)
@@ -281,19 +280,18 @@ BROKEN QUERY:
281
  SELECT name, region, revenue
282
  FROM (
283
  SELECT name, region, revenue,
284
- ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue DESC) AS rn
285
  FROM sales_reps
286
  ) ranked
287
  WHERE rn = 1
288
  ORDER BY region, name
289
 
290
  PROBLEM:
291
- ROW_NUMBER() assigns unique sequential numbers even for tied revenue values.
292
- When two reps share the top revenue in a region, ROW_NUMBER arbitrarily
293
- picks one and discards the other.
294
 
295
  GOAL: Return ALL reps whose revenue is the highest in their region.
296
- Use RANK() or DENSE_RANK() instead of ROW_NUMBER().
297
  Order by region ASC, name ASC.""",
298
  schema_ddl="""\
299
  CREATE TABLE sales_reps (id INTEGER, name VARCHAR, region VARCHAR, revenue DECIMAL);
@@ -309,16 +307,16 @@ INSERT INTO sales_reps VALUES
309
  SELECT name, region, revenue
310
  FROM (
311
  SELECT name, region, revenue,
312
- ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue DESC) AS rn
313
  FROM sales_reps
314
  ) ranked
315
  WHERE rn = 1
316
  ORDER BY region, name""",
317
  error_message=(
318
- "Query runs but returns only 2 rows β€” one per region. "
319
- "Tied reps at the top are silently dropped by ROW_NUMBER()."
320
  ),
321
- hint="Replace ROW_NUMBER() with RANK() or DENSE_RANK(). Both include all tied rows.",
322
  test_cases=[
323
  TestCase(
324
  description="All reps tied at rank 1 per region",
@@ -350,21 +348,21 @@ _TASK_EXPERT_RECURSIVE = SQLTask(
350
  title="Traverse Org Chart with Recursive CTE",
351
  description="""\
352
  TASK: The query below attempts to find all subordinates of the VP of Engineering
353
- (id=3) using a two-level CTE expansion. It misses employees more than two levels
354
- deep. Rewrite it using a recursive CTE that traverses all levels.
355
 
356
  SCHEMA:
357
  employees(id INTEGER, name VARCHAR, manager_id INTEGER)
358
 
359
  DATA (partial):
360
- VP Eng (id=3) β†’ Lead A (id=5), Lead B (id=6)
361
- Lead A (id=5) β†’ Dev 1 (id=8), Dev 2 (id=9)
362
- Lead B (id=6) β†’ Dev 3 (id=10), Dev 4 (id=11)
363
- Dev 1 (id=8) β†’ Junior 1 (id=13), Junior 2 (id=14)
 
364
 
365
  BROKEN QUERY:
366
  WITH direct AS (
367
- SELECT id, name, manager_id FROM employees WHERE manager_id = 3
368
  ),
369
  level2 AS (
370
  SELECT e.id, e.name, e.manager_id
@@ -377,12 +375,13 @@ BROKEN QUERY:
377
  ORDER BY id
378
 
379
  PROBLEM:
380
- This hardcoded two-level expansion returns 6 rows but misses Junior 1 (id=13)
381
- and Junior 2 (id=14), who report to Dev 1 β€” three levels below VP Eng.
382
- Adding a level3 CTE would help for now but still break if the tree grows deeper.
383
 
384
- GOAL: Use WITH RECURSIVE to return ALL 8 subordinates of VP Eng (id=3)
385
- at any depth. Return id, name, manager_id columns, ordered by id ASC.""",
 
386
  schema_ddl="""\
387
  CREATE TABLE employees (id INTEGER, name VARCHAR, manager_id INTEGER);
388
  INSERT INTO employees VALUES
@@ -403,7 +402,7 @@ INSERT INTO employees VALUES
403
  """,
404
  broken_query="""\
405
  WITH direct AS (
406
- SELECT id, name, manager_id FROM employees WHERE manager_id = 3
407
  ),
408
  level2 AS (
409
  SELECT e.id, e.name, e.manager_id
@@ -415,11 +414,10 @@ UNION ALL
415
  SELECT id, name, manager_id FROM level2
416
  ORDER BY id""",
417
  error_message=(
418
- "Query returns only 6 rows β€” two levels under VP Eng. "
419
- "Junior 1 (id=13) and Junior 2 (id=14) who report to Dev 1 are missing. "
420
- "A hardcoded level3 CTE would fix this instance but not scale to deeper trees."
421
  ),
422
- hint="Use WITH RECURSIVE. Start from manager_id = 3, then JOIN employees to the CTE itself on manager_id = cte.id.",
423
  test_cases=[
424
  TestCase(
425
  description="All 8 subordinates of VP Eng at any depth",
@@ -456,34 +454,33 @@ ORDER BY id""",
456
  _TASK_EXPERT_WINDOW = SQLTask(
457
  id="task_expert_window",
458
  level="expert",
459
- title="Fix Two Broken Window Functions: Running Total and Revenue Rank",
460
  description="""\
461
- TASK: The query below computes a cumulative running total and a
462
- within-region revenue rank for each quarter, but BOTH window functions
463
- are broken β€” neither has a PARTITION BY, so they treat all rows as one
464
- giant partition instead of computing independently per region.
465
 
466
  SCHEMA:
467
  quarterly_sales(region VARCHAR, quarter INTEGER, revenue DECIMAL)
468
 
 
 
 
 
469
  BROKEN QUERY:
470
  SELECT region, quarter, revenue,
471
  SUM(revenue) OVER (ORDER BY region, quarter) AS running_total,
472
- RANK() OVER (ORDER BY revenue DESC) AS revenue_rank
473
  FROM quarterly_sales
474
  ORDER BY region, quarter
475
 
476
  PROBLEM:
477
- - running_total accumulates across both regions: West's Q1 shows 65000
478
- (continuing from East's Q4) instead of resetting to 11000.
479
- - revenue_rank ranks revenue across ALL regions globally, so East Q4 (20000)
480
- and West Q3 (16000) compete directly instead of being ranked within their
481
- own region.
482
-
483
- GOAL: Fix BOTH window functions so they operate independently per region.
484
- - running_total must reset to 0 at the start of each region (ORDER BY quarter).
485
- - revenue_rank must rank revenue within each region (ORDER BY revenue DESC).
486
- Both OVER clauses need PARTITION BY region, but with different ORDER BY columns.
487
  Final output: ORDER BY region ASC, quarter ASC.""",
488
  schema_ddl="""\
489
  CREATE TABLE quarterly_sales (region VARCHAR, quarter INTEGER, revenue DECIMAL);
@@ -495,7 +492,7 @@ INSERT INTO quarterly_sales VALUES
495
  ('West', 1, 11000),
496
  ('West', 2, 14000),
497
  ('West', 3, 16000),
498
- ('West', 4, 13000);
499
  """,
500
  broken_query="""\
501
  SELECT region, quarter, revenue,
@@ -504,19 +501,14 @@ SELECT region, quarter, revenue,
504
  FROM quarterly_sales
505
  ORDER BY region, quarter""",
506
  error_message=(
507
- "Query runs but both window functions are wrong. "
508
- "West Q1 running_total shows 76000 (continuing from East) instead of 11000. "
509
- "revenue_rank is a global ranking across all 8 rows instead of per-region. "
510
- "Both SUM and RANK are missing PARTITION BY region."
511
- ),
512
- hint=(
513
- "Add PARTITION BY region to BOTH window functions, but with different ORDER BY: "
514
- "SUM(revenue) OVER (PARTITION BY region ORDER BY quarter) for running total, "
515
- "RANK() OVER (PARTITION BY region ORDER BY revenue DESC) for within-region rank."
516
  ),
 
517
  test_cases=[
518
  TestCase(
519
- description="Per-region running total and within-region revenue rank",
520
  expected_rows=[
521
  {"region": "East", "quarter": 1, "revenue": 15000.0, "running_total": 15000.0, "revenue_rank": 3},
522
  {"region": "East", "quarter": 2, "revenue": 18000.0, "running_total": 33000.0, "revenue_rank": 2},
@@ -525,7 +517,7 @@ ORDER BY region, quarter""",
525
  {"region": "West", "quarter": 1, "revenue": 11000.0, "running_total": 11000.0, "revenue_rank": 4},
526
  {"region": "West", "quarter": 2, "revenue": 14000.0, "running_total": 25000.0, "revenue_rank": 3},
527
  {"region": "West", "quarter": 3, "revenue": 16000.0, "running_total": 41000.0, "revenue_rank": 1},
528
- {"region": "West", "quarter": 4, "revenue": 13000.0, "running_total": 54000.0, "revenue_rank": 2},
529
  ],
530
  order_by="region,quarter",
531
  )
 
270
  level="expert",
271
  title="Fix the Tie-Breaking Window Function",
272
  description="""\
273
+ TASK: The query below attempts to find the top-earning sales rep per region,
274
+ but it returns wrong results. Debug it.
 
275
 
276
  SCHEMA:
277
  sales_reps(id INTEGER, name VARCHAR, region VARCHAR, revenue DECIMAL)
 
280
  SELECT name, region, revenue
281
  FROM (
282
  SELECT name, region, revenue,
283
+ ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue ASC) AS rn
284
  FROM sales_reps
285
  ) ranked
286
  WHERE rn = 1
287
  ORDER BY region, name
288
 
289
  PROBLEM:
290
+ The query returns 2 rows but the expected answer has 4.
291
+ The output values are also wrong β€” it seems to pick the lowest revenue per region
292
+ instead of the highest.
293
 
294
  GOAL: Return ALL reps whose revenue is the highest in their region.
 
295
  Order by region ASC, name ASC.""",
296
  schema_ddl="""\
297
  CREATE TABLE sales_reps (id INTEGER, name VARCHAR, region VARCHAR, revenue DECIMAL);
 
307
  SELECT name, region, revenue
308
  FROM (
309
  SELECT name, region, revenue,
310
+ ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue ASC) AS rn
311
  FROM sales_reps
312
  ) ranked
313
  WHERE rn = 1
314
  ORDER BY region, name""",
315
  error_message=(
316
+ "Query runs but returns wrong results: only 2 rows (one per region) "
317
+ "with the LOWEST revenue instead of the HIGHEST. Expected 4 rows."
318
  ),
319
+ hint="There are two bugs. Think about both the ranking function and the sort order.",
320
  test_cases=[
321
  TestCase(
322
  description="All reps tied at rank 1 per region",
 
348
  title="Traverse Org Chart with Recursive CTE",
349
  description="""\
350
  TASK: The query below attempts to find all subordinates of the VP of Engineering
351
+ (id=3), but it returns wrong results. Debug and fix it.
 
352
 
353
  SCHEMA:
354
  employees(id INTEGER, name VARCHAR, manager_id INTEGER)
355
 
356
  DATA (partial):
357
+ CEO (id=1)
358
+ VP Eng (id=3, reports to CEO)
359
+ Lead A (id=5), Lead B (id=6) report to VP Eng
360
+ Dev 1..4 (id=8..11) report to Leads
361
+ Junior 1..2 (id=13..14) report to Dev 1
362
 
363
  BROKEN QUERY:
364
  WITH direct AS (
365
+ SELECT id, name, manager_id FROM employees WHERE id = 3
366
  ),
367
  level2 AS (
368
  SELECT e.id, e.name, e.manager_id
 
375
  ORDER BY id
376
 
377
  PROBLEM:
378
+ The query returns some results but the row count and values don't match
379
+ the expected output. Inspect what the anchor condition selects and whether
380
+ the query reaches all depths of the org tree.
381
 
382
+ GOAL: Return ALL 8 subordinates of VP Eng (id=3) at any depth.
383
+ Do NOT include VP Eng himself β€” only his reports.
384
+ Return id, name, manager_id columns, ordered by id ASC.""",
385
  schema_ddl="""\
386
  CREATE TABLE employees (id INTEGER, name VARCHAR, manager_id INTEGER);
387
  INSERT INTO employees VALUES
 
402
  """,
403
  broken_query="""\
404
  WITH direct AS (
405
+ SELECT id, name, manager_id FROM employees WHERE id = 3
406
  ),
407
  level2 AS (
408
  SELECT e.id, e.name, e.manager_id
 
414
  SELECT id, name, manager_id FROM level2
415
  ORDER BY id""",
416
  error_message=(
417
+ "Query returns wrong results. Check carefully: does the anchor condition "
418
+ "select the right starting rows? Does the query traverse all depths?"
 
419
  ),
420
+ hint="There are multiple issues. Think about what the anchor selects and how deep the query reaches.",
421
  test_cases=[
422
  TestCase(
423
  description="All 8 subordinates of VP Eng at any depth",
 
454
  _TASK_EXPERT_WINDOW = SQLTask(
455
  id="task_expert_window",
456
  level="expert",
457
+ title="Fix Broken Window Functions: Running Total and Revenue Rank",
458
  description="""\
459
+ TASK: The query below computes a cumulative running total and a within-region
460
+ revenue rank for each quarter, but the results are wrong. Debug and fix it.
 
 
461
 
462
  SCHEMA:
463
  quarterly_sales(region VARCHAR, quarter INTEGER, revenue DECIMAL)
464
 
465
+ DATA:
466
+ East: Q1=15000, Q2=18000, Q3=12000, Q4=20000
467
+ West: Q1=11000, Q2=14000, Q3=16000, Q4=16000 (note: Q3 and Q4 are tied)
468
+
469
  BROKEN QUERY:
470
  SELECT region, quarter, revenue,
471
  SUM(revenue) OVER (ORDER BY region, quarter) AS running_total,
472
+ RANK() OVER (ORDER BY revenue DESC) AS revenue_rank
473
  FROM quarterly_sales
474
  ORDER BY region, quarter
475
 
476
  PROBLEM:
477
+ The query returns wrong values for both running_total and revenue_rank.
478
+ Compare your output against the expected results carefully.
479
+
480
+ GOAL: running_total should be a cumulative sum per region (reset each region,
481
+ ordered by quarter). revenue_rank should rank revenue within each region
482
+ (ordered by revenue DESC), handling ties correctly (tied values must get
483
+ the same rank).
 
 
 
484
  Final output: ORDER BY region ASC, quarter ASC.""",
485
  schema_ddl="""\
486
  CREATE TABLE quarterly_sales (region VARCHAR, quarter INTEGER, revenue DECIMAL);
 
492
  ('West', 1, 11000),
493
  ('West', 2, 14000),
494
  ('West', 3, 16000),
495
+ ('West', 4, 16000);
496
  """,
497
  broken_query="""\
498
  SELECT region, quarter, revenue,
 
501
  FROM quarterly_sales
502
  ORDER BY region, quarter""",
503
  error_message=(
504
+ "Query runs but both computed columns are wrong. "
505
+ "running_total does not reset per region. "
506
+ "revenue_rank is a global ranking across all rows instead of per-region."
 
 
 
 
 
 
507
  ),
508
+ hint="Multiple issues exist. Think about partitioning and how tied values should be ranked.",
509
  test_cases=[
510
  TestCase(
511
+ description="Per-region running total and within-region revenue rank with ties",
512
  expected_rows=[
513
  {"region": "East", "quarter": 1, "revenue": 15000.0, "running_total": 15000.0, "revenue_rank": 3},
514
  {"region": "East", "quarter": 2, "revenue": 18000.0, "running_total": 33000.0, "revenue_rank": 2},
 
517
  {"region": "West", "quarter": 1, "revenue": 11000.0, "running_total": 11000.0, "revenue_rank": 4},
518
  {"region": "West", "quarter": 2, "revenue": 14000.0, "running_total": 25000.0, "revenue_rank": 3},
519
  {"region": "West", "quarter": 3, "revenue": 16000.0, "running_total": 41000.0, "revenue_rank": 1},
520
+ {"region": "West", "quarter": 4, "revenue": 16000.0, "running_total": 57000.0, "revenue_rank": 1},
521
  ],
522
  order_by="region,quarter",
523
  )