Spaces:

Prithvigg
/

queryforge

Sleeping

App Files Files Community

Prithvigg commited on 18 days ago

Commit

039839b

verified ·

1 Parent(s): 6338c5b

Upload folder using huggingface_hub

Browse files

Files changed (11) hide show

README.md +21 -21
demo.py +1 -1
inference.py +2 -3
judge.py +49 -17
models.py +1 -1
openenv.yaml +7 -3
playbook.py +10 -9
pyproject.toml +4 -0
server/app.py +1 -6
server/requirements.txt +1 -2
tasks.py +49 -57

README.md CHANGED Viewed

@@ -143,30 +143,30 @@ A semantically correct but O(N²) query re-executes `AVG(salary)` for every empl
 **Schema:** `departments`, `employees` — 9 employees across 3 departments
 **Goal:** Employees who earn strictly above their department average, ordered by dept/salary
-### Expert — Fix the Tie-Breaking Window Function
-`ROW_NUMBER()` silently drops tied reps — one per region is kept, tied ones discarded. Agent must use `RANK()` or `DENSE_RANK()` to return all tied top performers.
 **Schema:** `sales_reps(id, name, region, revenue)` — 6 reps across 2 regions with ties
 **Goal:** All reps whose revenue is the highest in their region
-### Expert — Traverse Org Chart with Recursive CTE
-A hardcoded two-level CTE expansion misses employees deeper in the tree. Agent must use `WITH RECURSIVE` to traverse all levels of the hierarchy.
 **Schema:** `employees(id, name, manager_id)` — 14 employees, 4 levels deep
-**Goal:** All 8 subordinates of VP Eng at any depth, ordered by id
-### Expert — Fix Two Broken Window Functions
-Both `SUM` and `RANK` window functions are missing `PARTITION BY` but require different `ORDER BY` clauses. Agent must fix both independently.
-**Schema:** `quarterly_sales(region, quarter, revenue)` — 8 rows across 2 regions
-**Goal:** Per-region running total (`ORDER BY quarter`) and within-region revenue rank (`ORDER BY revenue DESC`)
 > **Structural penalties** are enforced per task level/id to prevent gaming:
 > - `hard`: requires `WITH` clause (−0.30 if absent)
 > - `medium`: requires explicit `JOIN` (−0.20 if absent)
-> - `task_expert_recursive`: requires `WITH RECURSIVE` (−0.30 if absent)
-> - `task_expert_rank`: penalises `ROW_NUMBER()` (−0.20 — drops ties)
-> - `task_expert_window`: requires `PARTITION BY` in both window functions (−0.20 if absent)
 ---
@@ -249,19 +249,19 @@ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
 ## Baseline Results
-The following scores were produced by running `claude-haiku-4-5` as the agent against all three tasks with the full AI judge active. These serve as the reproducible baseline for this environment.
 | Task | Level | Steps Used | Best Score |
 |---|---|---|---|
 | Fix the Syntax Errors | easy | 1 | **1.000** |
 | Fix the Cartesian JOIN | medium | 1 | **0.900** |
-| Rewrite Correlated Subquery as CTE | hard | 1 | **0.950** |
-| **Average** | | | **0.950** |
-All three tasks were solved (or near-solved) on the first step, demonstrating that:
-- The reward pipeline returns meaningful signal immediately
-- The environment terminates cleanly when the done threshold (≥ 0.90) is met
-- A stronger model or a harder task set would produce more training-relevant trajectories
 ---
@@ -348,13 +348,13 @@ queryforge/
 ├── playbook.py                     # Local test runner (no server required)
 ├─��� inference.py                    # Baseline inference script (any OpenAI-compatible LLM)
 ├── demo.py                         # Gradio interactive demo (mounted at /demo)
 ├── openenv.yaml                    # OpenEnv manifest
 ├── pyproject.toml                  # Project metadata and dependencies
 ├── uv.lock                         # Locked dependencies
 └── server/
     ├── app.py                      # FastAPI app — core + /tasks REST endpoints + Gradio mount
     ├── queryforge_environment.py   # Environment class (reset, step, state)
-    ├── Dockerfile                  # Container image
     └── requirements.txt            # Server dependencies
 ```
@@ -373,7 +373,7 @@ Add `ANTHROPIC_API_KEY` as a Space secret after deployment. Without it, the envi
 ### Docker
 ```bash
-docker build -t queryforge:latest -f server/Dockerfile .
 docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY queryforge:latest
 ```

 **Schema:** `departments`, `employees` — 9 employees across 3 departments
 **Goal:** Employees who earn strictly above their department average, ordered by dept/salary
+### Expert — Fix the Tie-Breaking Window Function (2 bugs)
+Two layered bugs: `ROW_NUMBER()` drops tied reps AND `ORDER BY revenue ASC` picks the lowest earners instead of the highest. Agent must fix the sort order AND switch to `RANK()`/`DENSE_RANK()` — fixing only one still produces wrong results.
 **Schema:** `sales_reps(id, name, region, revenue)` — 6 reps across 2 regions with ties
 **Goal:** All reps whose revenue is the highest in their region
+### Expert — Traverse Org Chart with Recursive CTE (2 bugs)
+Two layered bugs: the anchor uses `WHERE id = 3` (includes VP Eng himself in results) AND the query is a hardcoded two-level CTE that misses deeper employees. Agent must fix the anchor to `WHERE manager_id = 3` AND convert to `WITH RECURSIVE`.
 **Schema:** `employees(id, name, manager_id)` — 14 employees, 4 levels deep
+**Goal:** All 8 subordinates of VP Eng at any depth (excluding VP Eng), ordered by id
+### Expert — Fix Broken Window Functions (3 bugs)
+Three layered bugs: both `SUM` and `RANK` window functions are missing `PARTITION BY`, they need different `ORDER BY` clauses, AND the data contains tied revenue values (West Q3=Q4=16000) that must be ranked correctly.
+**Schema:** `quarterly_sales(region, quarter, revenue)` — 8 rows across 2 regions with ties
+**Goal:** Per-region running total (`ORDER BY quarter`) and within-region revenue rank (`ORDER BY revenue DESC`) with correct tie handling
 > **Structural penalties** are enforced per task level/id to prevent gaming:
 > - `hard`: requires `WITH` clause (−0.30 if absent)
 > - `medium`: requires explicit `JOIN` (−0.20 if absent)
+> - `task_expert_recursive`: requires `WITH RECURSIVE` (−0.30) + correct anchor via `manager_id` (−0.15)
+> - `task_expert_rank`: penalises `ROW_NUMBER()` (−0.20) + penalises `ASC` ordering without `DESC` (−0.15)
+> - `task_expert_window`: requires `PARTITION BY` in both window functions (−0.20 if absent, −0.10 if only one)
 ---
 ## Baseline Results
+The following scores were produced by running `meta-llama/Llama-3.1-8B-Instruct` (via HuggingFace router) as the agent against all 6 tasks with the full AI judge active.
 | Task | Level | Steps Used | Best Score |
 |---|---|---|---|
 | Fix the Syntax Errors | easy | 1 | **1.000** |
 | Fix the Cartesian JOIN | medium | 1 | **0.900** |
+| Rewrite Correlated Subquery as CTE | hard | 1 | **0.900** |
+| Fix the Tie-Breaking Window Function | expert | 1 | **1.000** |
+| Traverse Org Chart with Recursive CTE | expert | 2 | **0.900** |
+| Fix Two Broken Window Functions | expert | 3 | **0.900** |
+| **Average** | | | **0.933** |
+The easy–hard tasks and the rank/recursive expert tasks were solved in 1–2 steps. The dual-window expert task required 3 steps, demonstrating the feedback loop produces training-relevant multi-step trajectories for harder tasks.
 ---
 ├── playbook.py                     # Local test runner (no server required)
 ├─��� inference.py                    # Baseline inference script (any OpenAI-compatible LLM)
 ├── demo.py                         # Gradio interactive demo (mounted at /demo)
+├── Dockerfile                      # Container image
 ├── openenv.yaml                    # OpenEnv manifest
 ├── pyproject.toml                  # Project metadata and dependencies
 ├── uv.lock                         # Locked dependencies
 └── server/
     ├── app.py                      # FastAPI app — core + /tasks REST endpoints + Gradio mount
     ├── queryforge_environment.py   # Environment class (reset, step, state)
     └── requirements.txt            # Server dependencies
 ```
 ### Docker
 ```bash
+docker build -t queryforge:latest .
 docker run -p 8000:8000 -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY queryforge:latest
 ```

demo.py CHANGED Viewed

@@ -115,7 +115,7 @@ Fix broken or slow SQL queries and get instant graded feedback.
     )
 )
-with gr.Blocks(title="QueryForge", theme=gr.themes.Soft()) as demo:
     state = gr.State(None)

     )
 )
+with gr.Blocks(title="QueryForge") as demo:
     state = gr.State(None)

inference.py CHANGED Viewed

@@ -188,9 +188,8 @@ def run_task(task_id: str, llm: OpenAI, env_client) -> dict:
 def main() -> None:
     # ── Validate required config ──────────────────────────────────────────────
-    missing = [v for v in ("API_BASE_URL", "MODEL_NAME") if not os.getenv(v)]
-    if missing:
-        print(f"ERROR: missing required env vars: {', '.join(missing)}")
         sys.exit(1)
     if not API_KEY:

 def main() -> None:
     # ── Validate required config ──────────────────────────────────────────────
+    if not MODEL_NAME:
+        print("ERROR: MODEL_NAME env var is not set.")
         sys.exit(1)
     if not API_KEY:

judge.py CHANGED Viewed

@@ -16,7 +16,7 @@ Grading pipeline for each submitted SQL query:
     Partial credit for correct row count or partial row matches.
   Stage 4 — AI Quality (→ 1.0)
-    Anthropic claude-sonnet-4-6 evaluates optimization, code style, and
     semantic correctness vs. the reference solution.
     The AI score can move the final score up to 1.0 when rows are correct,
     or provide nuanced feedback even when rows are partially wrong.
@@ -183,16 +183,24 @@ def rows_match(
     projected = [_project(row) for row in actual]
     if len(projected) != len(expected):
-        overlap_ratio = min(len(projected), len(expected)) / max(len(projected), len(expected))
-        score = 0.3 * overlap_ratio
         return score, (
             f"Row count mismatch: got {len(projected)}, expected {len(expected)}. "
-            f"({overlap_ratio:.0%} overlap ratio)"
         )
-    actual_sorted = sorted([_normalize(r) for r in projected], key=lambda r: _sort_key(r, order_by))
-    expected_sorted = sorted([_normalize(r) for r in expected], key=lambda r: _sort_key(r, order_by))
     matches = sum(1 for a, e in zip(actual_sorted, expected_sorted) if a == e)
     row_accuracy = matches / len(expected)
@@ -289,7 +297,6 @@ Respond with ONLY valid JSON (no markdown fences):
                 {"role": "assistant", "content": "{"},   # prefill forces JSON-only reply
             ],
         )
-        print("Anthropic judge response:", message.content)
         # Prepend the prefilled "{" back before parsing
         raw = "{" + message.content[0].text.strip()
@@ -381,15 +388,35 @@ def grade(
     elif task.level == "medium" and "JOIN " not in query_upper:
         structural_penalty = 0.20  # medium task demands explicit JOINs
         row_feedback += " (Penalty: no explicit JOIN — task requires JOIN … ON syntax.)"
-    elif task.id == "task_expert_recursive" and "RECURSIVE" not in query_upper:
-        structural_penalty = 0.30  # must use recursive CTE, not repeated JOINs
-        row_feedback += " (Penalty: WITH RECURSIVE required — plain JOIN only fetches one level.)"
-    elif task.id == "task_expert_rank" and "ROW_NUMBER" in query_upper:
-        structural_penalty = 0.20  # ROW_NUMBER breaks ties — must use RANK/DENSE_RANK
-        row_feedback += " (Penalty: ROW_NUMBER() drops tied rows — use RANK() or DENSE_RANK().)"
-    elif task.id == "task_expert_window" and "PARTITION BY" not in query_upper:
-        structural_penalty = 0.20  # both window functions need PARTITION BY region
-        row_feedback += " (Penalty: missing PARTITION BY — both SUM and RANK must be partitioned per region.)"
     details["structural_penalty"] = structural_penalty
@@ -405,9 +432,14 @@ def grade(
     details["ai_hint"] = ai_hint
     # Final blending:
     #   rows fully correct  → trust AI score (can reach 1.0)
     #   rows partially wrong → clamp AI score to not exceed deterministic
-    if row_score >= 0.95:
         final_score = ai_score
     elif row_score >= 0.5:
         # Blend: AI provides nuance but can't exceed deterministic ceiling

     Partial credit for correct row count or partial row matches.
   Stage 4 — AI Quality (→ 1.0)
+    Anthropic claude-haiku-4-5 evaluates optimization, code style, and
     semantic correctness vs. the reference solution.
     The AI score can move the final score up to 1.0 when rows are correct,
     or provide nuanced feedback even when rows are partially wrong.
     projected = [_project(row) for row in actual]
+    actual_norm = [_normalize(r) for r in projected]
+    expected_norm = [_normalize(r) for r in expected]
     if len(projected) != len(expected):
+        # Count how many returned rows are actually in the expected set
+        expected_set = [tuple(sorted(r.items())) for r in expected_norm]
+        correct_rows = sum(1 for r in actual_norm if tuple(sorted(r.items())) in expected_set)
+        # Score based on fraction of expected rows correctly returned
+        coverage = correct_rows / len(expected)
+        # Base 0.10 for count mismatch, up to 0.45 for high coverage of correct rows
+        score = 0.10 + 0.35 * coverage
         return score, (
             f"Row count mismatch: got {len(projected)}, expected {len(expected)}. "
+            f"{correct_rows}/{len(expected)} expected rows present."
         )
+    actual_sorted = sorted(actual_norm, key=lambda r: _sort_key(r, order_by))
+    expected_sorted = sorted(expected_norm, key=lambda r: _sort_key(r, order_by))
     matches = sum(1 for a, e in zip(actual_sorted, expected_sorted) if a == e)
     row_accuracy = matches / len(expected)
                 {"role": "assistant", "content": "{"},   # prefill forces JSON-only reply
             ],
         )
         # Prepend the prefilled "{" back before parsing
         raw = "{" + message.content[0].text.strip()
     elif task.level == "medium" and "JOIN " not in query_upper:
         structural_penalty = 0.20  # medium task demands explicit JOINs
         row_feedback += " (Penalty: no explicit JOIN — task requires JOIN … ON syntax.)"
+    elif task.id == "task_expert_recursive":
+        # Two bugs: anchor uses WHERE id=3 (includes VP Eng) + non-recursive CTE (misses deep levels)
+        if "RECURSIVE" not in query_upper:
+            structural_penalty += 0.30
+            row_feedback += " (Penalty: WITH RECURSIVE required — hardcoded levels won't scale.)"
+        if "MANAGER_ID = 3" not in query_upper and "MANAGER_ID=3" not in query_upper:
+            structural_penalty += 0.15
+            row_feedback += " (Penalty: anchor should select subordinates via manager_id, not the VP themselves.)"
+        structural_penalty = min(structural_penalty, 0.40)
+    elif task.id == "task_expert_rank":
+        # Two bugs: ROW_NUMBER (drops ties) + ASC ordering (picks lowest instead of highest)
+        if "ROW_NUMBER" in query_upper:
+            structural_penalty += 0.20
+            row_feedback += " (Penalty: ROW_NUMBER() drops tied rows — use RANK() or DENSE_RANK().)"
+        if "ASC" in query_upper and "DESC" not in query_upper:
+            structural_penalty += 0.15
+            row_feedback += " (Penalty: ordering by revenue ASC picks lowest earners, not highest.)"
+        structural_penalty = min(structural_penalty, 0.35)
+    elif task.id == "task_expert_window":
+        # Three bugs: missing PARTITION BY on both windows + tied revenues need correct ranking
+        if "PARTITION BY" not in query_upper:
+            structural_penalty += 0.20
+            row_feedback += " (Penalty: missing PARTITION BY — both SUM and RANK must be partitioned per region.)"
+        # Count PARTITION BY occurrences — need at least 2 (one per window function)
+        partition_count = query_upper.count("PARTITION BY")
+        if 0 < partition_count < 2:
+            structural_penalty += 0.10
+            row_feedback += " (Penalty: only one window function has PARTITION BY — both need it.)"
+        structural_penalty = min(structural_penalty, 0.30)
     details["structural_penalty"] = structural_penalty
     details["ai_hint"] = ai_hint
     # Final blending:
+    #   AI judge offline (fallback) → use deterministic score directly
     #   rows fully correct  → trust AI score (can reach 1.0)
     #   rows partially wrong → clamp AI score to not exceed deterministic
+    ai_is_fallback = abs(ai_score - deterministic_score) < 0.001
+    if ai_is_fallback:
+        # AI judge was unavailable — use deterministic score as-is
+        final_score = deterministic_score
+    elif row_score >= 0.95:
         final_score = ai_score
     elif row_score >= 0.5:
         # Blend: AI provides nuance but can't exceed deterministic ceiling

models.py CHANGED Viewed

@@ -24,7 +24,7 @@ class SQLObservation(Observation):
     # ── Task context ─────────────────────────────────────────────────────────
     task_id: str = Field(default="", description="Active task identifier")
     task_level: str = Field(
-        default="", description="Difficulty: easy | medium | hard"
     )
     task_title: str = Field(default="", description="Human-readable task title")
     task_description: str = Field(

     # ── Task context ─────────────────────────────────────────────────────────
     task_id: str = Field(default="", description="Active task identifier")
     task_level: str = Field(
+        default="", description="Difficulty: easy | medium | hard | expert"
     )
     task_title: str = Field(default="", description="Human-readable task title")
     task_description: str = Field(

openenv.yaml CHANGED Viewed

@@ -11,16 +11,20 @@ description: |
   An agent receives a broken or slow SQL query together with the schema and an
   error/performance warning. It must produce a working, optimised query.
-  Tasks (3 levels, cycled in order):
     easy   — fix three misspelled SQL keywords (SELECT / FROM / WHERE)
     medium — fix a missing JOIN condition that causes a cartesian product
     hard   — rewrite a correlated subquery (O(N²)) as a CTE (O(N))
   Reward signal (0.0 – 1.0):
     0.00        syntax error
     0.15        syntax valid, runtime error
     0.30        executes, wrong / empty results
     0.30–0.80   partial row correctness (deterministic, DuckDB)
-    0.80–1.00   correct results + AI quality score (Anthropic claude-sonnet-4-6)
-  Required env var: ANTHROPIC_API_KEY

   An agent receives a broken or slow SQL query together with the schema and an
   error/performance warning. It must produce a working, optimised query.
+  Tasks (6 tasks across 4 difficulty levels):
     easy   — fix three misspelled SQL keywords (SELECT / FROM / WHERE)
     medium — fix a missing JOIN condition that causes a cartesian product
     hard   — rewrite a correlated subquery (O(N²)) as a CTE (O(N))
+    expert — fix tie-breaking window function (2 bugs: ROW_NUMBER + ASC ordering)
+    expert — traverse org chart with recursive CTE (2 bugs: wrong anchor + hardcoded levels)
+    expert — fix broken window functions (3 bugs: missing PARTITION BY + tied revenues)
   Reward signal (0.0 – 1.0):
     0.00        syntax error
     0.15        syntax valid, runtime error
     0.30        executes, wrong / empty results
     0.30–0.80   partial row correctness (deterministic, DuckDB)
+    0.80–1.00   correct results + AI quality score (Anthropic claude-haiku-4-5)
+  Optional env var: ANTHROPIC_API_KEY (enables AI judge for scores up to 1.0;
+  without it, scoring is fully deterministic and capped at 0.80)

playbook.py CHANGED Viewed

@@ -3,13 +3,14 @@ QueryForge Client Playbook
 ──────────────────────────
 Tests the environment through the HTTP server using the QueryforgeEnv client.
-Requires the server to be running first:
-    uvicorn server.app:app --host 0.0.0.0 --port 8000
-Then run:
     python playbook.py
-If ANTHROPIC_API_KEY is set, Stage 4 AI scoring is live.
 If not set, the judge falls back to deterministic scoring (capped at 0.80).
 """
@@ -23,7 +24,7 @@ from client import QueryforgeEnv
 from models import SQLAction, TaskSpec
 from tasks import REGISTRY, task_from_dict
-BASE_URL = "https://prithvigg-queryforge.hf.space"
 # ── Formatting helpers ────────────────────────────────────────────────────────
@@ -239,10 +240,10 @@ if __name__ == "__main__":
     _hr("═")
     with QueryforgeEnv(base_url=BASE_URL).sync() as client:
-        # run_easy(client)
         run_medium(client)
         run_hard(client)
-        # run_custom(client)
     _section("DONE")
-    print("  All 4 tasks completed.\n")

 ──────────────────────────
 Tests the environment through the HTTP server using the QueryforgeEnv client.
+Usage:
+    # Against the live HF Space:
     python playbook.py
+    # Against a local server:
+    ENV_URL=http://localhost:8000 python playbook.py
+If ANTHROPIC_API_KEY is set on the server, Stage 4 AI scoring is live.
 If not set, the judge falls back to deterministic scoring (capped at 0.80).
 """
 from models import SQLAction, TaskSpec
 from tasks import REGISTRY, task_from_dict
+BASE_URL = os.environ.get("ENV_URL", "https://prithvigg-queryforge.hf.space")
 # ── Formatting helpers ────────────────────────────────────────────────────────
     _hr("═")
     with QueryforgeEnv(base_url=BASE_URL).sync() as client:
+        run_easy(client)
         run_medium(client)
         run_hard(client)
+        run_custom(client)
     _section("DONE")
+    print("  All tasks completed.\n")

pyproject.toml CHANGED Viewed

@@ -22,6 +22,10 @@ dependencies = [
     "duckdb>=0.10.0",
     # AI judge — quality scoring via Anthropic API
     "anthropic>=0.25.0",
 ]
 [project.optional-dependencies]

     "duckdb>=0.10.0",
     # AI judge — quality scoring via Anthropic API
     "anthropic>=0.25.0",
+    # Interactive demo UI (mounted at /demo on the FastAPI server)
+    "gradio>=4.0.0",
+    # Inference script uses the OpenAI client
+    "openai>=1.0.0",
 ]
 [project.optional-dependencies]

server/app.py CHANGED Viewed

@@ -124,9 +124,4 @@ def main(host: str = "0.0.0.0", port: int = 8000):
 if __name__ == "__main__":
-    import argparse
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--port", type=int, default=8000)
-    args = parser.parse_args()
-    main(port=args.port)


124
125
126	if __name__ == "__main__":
127	+ main()

server/requirements.txt CHANGED Viewed

@@ -3,6 +3,5 @@ fastapi>=0.115.0
 uvicorn>=0.24.0
 duckdb>=0.10.0
 anthropic>=0.25.0

 uvicorn>=0.24.0
 duckdb>=0.10.0
 anthropic>=0.25.0
+gradio>=4.0.0

tasks.py CHANGED Viewed

@@ -270,9 +270,8 @@ _TASK_EXPERT_RANK = SQLTask(
     level="expert",
     title="Fix the Tie-Breaking Window Function",
     description="""\
-TASK: The query below finds the top-earning sales rep per region, but it
-silently drops reps who are tied for first place. Fix it so ALL reps
-tied at rank 1 are returned.
 SCHEMA:
   sales_reps(id INTEGER, name VARCHAR, region VARCHAR, revenue DECIMAL)
@@ -281,19 +280,18 @@ BROKEN QUERY:
   SELECT name, region, revenue
   FROM (
       SELECT name, region, revenue,
-             ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue DESC) AS rn
       FROM sales_reps
   ) ranked
   WHERE rn = 1
   ORDER BY region, name
 PROBLEM:
-  ROW_NUMBER() assigns unique sequential numbers even for tied revenue values.
-  When two reps share the top revenue in a region, ROW_NUMBER arbitrarily
-  picks one and discards the other.
 GOAL: Return ALL reps whose revenue is the highest in their region.
-     Use RANK() or DENSE_RANK() instead of ROW_NUMBER().
      Order by region ASC, name ASC.""",
     schema_ddl="""\
 CREATE TABLE sales_reps (id INTEGER, name VARCHAR, region VARCHAR, revenue DECIMAL);
@@ -309,16 +307,16 @@ INSERT INTO sales_reps VALUES
 SELECT name, region, revenue
 FROM (
     SELECT name, region, revenue,
-           ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue DESC) AS rn
     FROM sales_reps
 ) ranked
 WHERE rn = 1
 ORDER BY region, name""",
     error_message=(
-        "Query runs but returns only 2 rows — one per region. "
-        "Tied reps at the top are silently dropped by ROW_NUMBER()."
     ),
-    hint="Replace ROW_NUMBER() with RANK() or DENSE_RANK(). Both include all tied rows.",
     test_cases=[
         TestCase(
             description="All reps tied at rank 1 per region",
@@ -350,21 +348,21 @@ _TASK_EXPERT_RECURSIVE = SQLTask(
     title="Traverse Org Chart with Recursive CTE",
     description="""\
 TASK: The query below attempts to find all subordinates of the VP of Engineering
-(id=3) using a two-level CTE expansion. It misses employees more than two levels
-deep. Rewrite it using a recursive CTE that traverses all levels.
 SCHEMA:
   employees(id INTEGER, name VARCHAR, manager_id INTEGER)
 DATA (partial):
-  VP Eng (id=3) → Lead A (id=5), Lead B (id=6)
-  Lead A (id=5) → Dev 1 (id=8), Dev 2 (id=9)
-  Lead B (id=6) → Dev 3 (id=10), Dev 4 (id=11)
-  Dev 1 (id=8)  → Junior 1 (id=13), Junior 2 (id=14)
 BROKEN QUERY:
   WITH direct AS (
-      SELECT id, name, manager_id FROM employees WHERE manager_id = 3
   ),
   level2 AS (
       SELECT e.id, e.name, e.manager_id
@@ -377,12 +375,13 @@ BROKEN QUERY:
   ORDER BY id
 PROBLEM:
-  This hardcoded two-level expansion returns 6 rows but misses Junior 1 (id=13)
-  and Junior 2 (id=14), who report to Dev 1 — three levels below VP Eng.
-  Adding a level3 CTE would help for now but still break if the tree grows deeper.
-GOAL: Use WITH RECURSIVE to return ALL 8 subordinates of VP Eng (id=3)
-     at any depth. Return id, name, manager_id columns, ordered by id ASC.""",
     schema_ddl="""\
 CREATE TABLE employees (id INTEGER, name VARCHAR, manager_id INTEGER);
 INSERT INTO employees VALUES
@@ -403,7 +402,7 @@ INSERT INTO employees VALUES
 """,
     broken_query="""\
 WITH direct AS (
-    SELECT id, name, manager_id FROM employees WHERE manager_id = 3
 ),
 level2 AS (
     SELECT e.id, e.name, e.manager_id
@@ -415,11 +414,10 @@ UNION ALL
 SELECT id, name, manager_id FROM level2
 ORDER BY id""",
     error_message=(
-        "Query returns only 6 rows — two levels under VP Eng. "
-        "Junior 1 (id=13) and Junior 2 (id=14) who report to Dev 1 are missing. "
-        "A hardcoded level3 CTE would fix this instance but not scale to deeper trees."
     ),
-    hint="Use WITH RECURSIVE. Start from manager_id = 3, then JOIN employees to the CTE itself on manager_id = cte.id.",
     test_cases=[
         TestCase(
             description="All 8 subordinates of VP Eng at any depth",
@@ -456,34 +454,33 @@ ORDER BY id""",
 _TASK_EXPERT_WINDOW = SQLTask(
     id="task_expert_window",
     level="expert",
-    title="Fix Two Broken Window Functions: Running Total and Revenue Rank",
     description="""\
-TASK: The query below computes a cumulative running total and a
-within-region revenue rank for each quarter, but BOTH window functions
-are broken — neither has a PARTITION BY, so they treat all rows as one
-giant partition instead of computing independently per region.
 SCHEMA:
   quarterly_sales(region VARCHAR, quarter INTEGER, revenue DECIMAL)
 BROKEN QUERY:
   SELECT region, quarter, revenue,
          SUM(revenue) OVER (ORDER BY region, quarter)        AS running_total,
-         RANK()       OVER (ORDER BY revenue DESC)            AS revenue_rank
   FROM quarterly_sales
   ORDER BY region, quarter
 PROBLEM:
-  - running_total accumulates across both regions: West's Q1 shows 65000
-    (continuing from East's Q4) instead of resetting to 11000.
-  - revenue_rank ranks revenue across ALL regions globally, so East Q4 (20000)
-    and West Q3 (16000) compete directly instead of being ranked within their
-    own region.
-GOAL: Fix BOTH window functions so they operate independently per region.
-     - running_total must reset to 0 at the start of each region (ORDER BY quarter).
-     - revenue_rank must rank revenue within each region (ORDER BY revenue DESC).
-     Both OVER clauses need PARTITION BY region, but with different ORDER BY columns.
      Final output: ORDER BY region ASC, quarter ASC.""",
     schema_ddl="""\
 CREATE TABLE quarterly_sales (region VARCHAR, quarter INTEGER, revenue DECIMAL);
@@ -495,7 +492,7 @@ INSERT INTO quarterly_sales VALUES
     ('West', 1, 11000),
     ('West', 2, 14000),
     ('West', 3, 16000),
-    ('West', 4, 13000);
 """,
     broken_query="""\
 SELECT region, quarter, revenue,
@@ -504,19 +501,14 @@ SELECT region, quarter, revenue,
 FROM quarterly_sales
 ORDER BY region, quarter""",
     error_message=(
-        "Query runs but both window functions are wrong. "
-        "West Q1 running_total shows 76000 (continuing from East) instead of 11000. "
-        "revenue_rank is a global ranking across all 8 rows instead of per-region. "
-        "Both SUM and RANK are missing PARTITION BY region."
-    ),
-    hint=(
-        "Add PARTITION BY region to BOTH window functions, but with different ORDER BY: "
-        "SUM(revenue) OVER (PARTITION BY region ORDER BY quarter) for running total, "
-        "RANK() OVER (PARTITION BY region ORDER BY revenue DESC) for within-region rank."
     ),
     test_cases=[
         TestCase(
-            description="Per-region running total and within-region revenue rank",
             expected_rows=[
                 {"region": "East", "quarter": 1, "revenue": 15000.0, "running_total": 15000.0, "revenue_rank": 3},
                 {"region": "East", "quarter": 2, "revenue": 18000.0, "running_total": 33000.0, "revenue_rank": 2},
@@ -525,7 +517,7 @@ ORDER BY region, quarter""",
                 {"region": "West", "quarter": 1, "revenue": 11000.0, "running_total": 11000.0, "revenue_rank": 4},
                 {"region": "West", "quarter": 2, "revenue": 14000.0, "running_total": 25000.0, "revenue_rank": 3},
                 {"region": "West", "quarter": 3, "revenue": 16000.0, "running_total": 41000.0, "revenue_rank": 1},
-                {"region": "West", "quarter": 4, "revenue": 13000.0, "running_total": 54000.0, "revenue_rank": 2},
             ],
             order_by="region,quarter",
         )

     level="expert",
     title="Fix the Tie-Breaking Window Function",
     description="""\
+TASK: The query below attempts to find the top-earning sales rep per region,
+but it returns wrong results. Debug it.
 SCHEMA:
   sales_reps(id INTEGER, name VARCHAR, region VARCHAR, revenue DECIMAL)
   SELECT name, region, revenue
   FROM (
       SELECT name, region, revenue,
+             ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue ASC) AS rn
       FROM sales_reps
   ) ranked
   WHERE rn = 1
   ORDER BY region, name
 PROBLEM:
+  The query returns 2 rows but the expected answer has 4.
+  The output values are also wrong — it seems to pick the lowest revenue per region
+  instead of the highest.
 GOAL: Return ALL reps whose revenue is the highest in their region.
      Order by region ASC, name ASC.""",
     schema_ddl="""\
 CREATE TABLE sales_reps (id INTEGER, name VARCHAR, region VARCHAR, revenue DECIMAL);
 SELECT name, region, revenue
 FROM (
     SELECT name, region, revenue,
+           ROW_NUMBER() OVER (PARTITION BY region ORDER BY revenue ASC) AS rn
     FROM sales_reps
 ) ranked
 WHERE rn = 1
 ORDER BY region, name""",
     error_message=(
+        "Query runs but returns wrong results: only 2 rows (one per region) "
+        "with the LOWEST revenue instead of the HIGHEST. Expected 4 rows."
     ),
+    hint="There are two bugs. Think about both the ranking function and the sort order.",
     test_cases=[
         TestCase(
             description="All reps tied at rank 1 per region",
     title="Traverse Org Chart with Recursive CTE",
     description="""\
 TASK: The query below attempts to find all subordinates of the VP of Engineering
+(id=3), but it returns wrong results. Debug and fix it.
 SCHEMA:
   employees(id INTEGER, name VARCHAR, manager_id INTEGER)
 DATA (partial):
+  CEO (id=1)
+  VP Eng (id=3, reports to CEO)
+  Lead A (id=5), Lead B (id=6) report to VP Eng
+  Dev 1..4 (id=8..11) report to Leads
+  Junior 1..2 (id=13..14) report to Dev 1
 BROKEN QUERY:
   WITH direct AS (
+      SELECT id, name, manager_id FROM employees WHERE id = 3
   ),
   level2 AS (
       SELECT e.id, e.name, e.manager_id
   ORDER BY id
 PROBLEM:
+  The query returns some results but the row count and values don't match
+  the expected output. Inspect what the anchor condition selects and whether
+  the query reaches all depths of the org tree.
+GOAL: Return ALL 8 subordinates of VP Eng (id=3) at any depth.
+     Do NOT include VP Eng himself — only his reports.
+     Return id, name, manager_id columns, ordered by id ASC.""",
     schema_ddl="""\
 CREATE TABLE employees (id INTEGER, name VARCHAR, manager_id INTEGER);
 INSERT INTO employees VALUES
 """,
     broken_query="""\
 WITH direct AS (
+    SELECT id, name, manager_id FROM employees WHERE id = 3
 ),
 level2 AS (
     SELECT e.id, e.name, e.manager_id
 SELECT id, name, manager_id FROM level2
 ORDER BY id""",
     error_message=(
+        "Query returns wrong results. Check carefully: does the anchor condition "
+        "select the right starting rows? Does the query traverse all depths?"
     ),
+    hint="There are multiple issues. Think about what the anchor selects and how deep the query reaches.",
     test_cases=[
         TestCase(
             description="All 8 subordinates of VP Eng at any depth",
 _TASK_EXPERT_WINDOW = SQLTask(
     id="task_expert_window",
     level="expert",
+    title="Fix Broken Window Functions: Running Total and Revenue Rank",
     description="""\
+TASK: The query below computes a cumulative running total and a within-region
+revenue rank for each quarter, but the results are wrong. Debug and fix it.
 SCHEMA:
   quarterly_sales(region VARCHAR, quarter INTEGER, revenue DECIMAL)
+DATA:
+  East: Q1=15000, Q2=18000, Q3=12000, Q4=20000
+  West: Q1=11000, Q2=14000, Q3=16000, Q4=16000  (note: Q3 and Q4 are tied)
 BROKEN QUERY:
   SELECT region, quarter, revenue,
          SUM(revenue) OVER (ORDER BY region, quarter)        AS running_total,
+         RANK()       OVER (ORDER BY revenue DESC)           AS revenue_rank
   FROM quarterly_sales
   ORDER BY region, quarter
 PROBLEM:
+  The query returns wrong values for both running_total and revenue_rank.
+  Compare your output against the expected results carefully.
+GOAL: running_total should be a cumulative sum per region (reset each region,
+     ordered by quarter). revenue_rank should rank revenue within each region
+     (ordered by revenue DESC), handling ties correctly (tied values must get
+     the same rank).
      Final output: ORDER BY region ASC, quarter ASC.""",
     schema_ddl="""\
 CREATE TABLE quarterly_sales (region VARCHAR, quarter INTEGER, revenue DECIMAL);
     ('West', 1, 11000),
     ('West', 2, 14000),
     ('West', 3, 16000),
+    ('West', 4, 16000);
 """,
     broken_query="""\
 SELECT region, quarter, revenue,
 FROM quarterly_sales
 ORDER BY region, quarter""",
     error_message=(
+        "Query runs but both computed columns are wrong. "
+        "running_total does not reset per region. "
+        "revenue_rank is a global ranking across all rows instead of per-region."
     ),
+    hint="Multiple issues exist. Think about partitioning and how tied values should be ranked.",
     test_cases=[
         TestCase(
+            description="Per-region running total and within-region revenue rank with ties",
             expected_rows=[
                 {"region": "East", "quarter": 1, "revenue": 15000.0, "running_total": 15000.0, "revenue_rank": 3},
                 {"region": "East", "quarter": 2, "revenue": 18000.0, "running_total": 33000.0, "revenue_rank": 2},
                 {"region": "West", "quarter": 1, "revenue": 11000.0, "running_total": 11000.0, "revenue_rank": 4},
                 {"region": "West", "quarter": 2, "revenue": 14000.0, "running_total": 25000.0, "revenue_rank": 3},
                 {"region": "West", "quarter": 3, "revenue": 16000.0, "running_total": 41000.0, "revenue_rank": 1},
+                {"region": "West", "quarter": 4, "revenue": 16000.0, "running_total": 57000.0, "revenue_rank": 1},
             ],
             order_by="region,quarter",
         )