hellinferno commited on
Commit
b83c8ad
Β·
1 Parent(s): 11aa990

improve: 20 tasks, richer keywords, enhanced reward/grader, bigram matching, compelling README

Browse files
README.md CHANGED
@@ -15,7 +15,26 @@ An OpenEnv environment where an AI agent reviews SQL queries for correctness, pe
15
 
16
  ## Why This Matters
17
 
18
- SQL bugs are among the most common and costly defects in production systems. A misplaced keyword breaks an API, a missing index degrades latency by 100x, and an unsanitized input opens a door to data exfiltration. Today these defects are caught by human reviewers who spend hours on repetitive pattern matching. This environment provides a standardized benchmark to train and evaluate AI agents that can automate this critical workflow β€” directly useful for developer tools, IDE integrations, and automated code review systems.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## What The Environment Does
21
 
@@ -38,23 +57,36 @@ The agent responds step by step with one of four actions:
38
 
39
  Rewards are deterministic and shaped for partial progress throughout the trajectory:
40
 
41
- - **Correct issue identification**: +0.10 to +0.35 scaled by issue severity
42
  - **Valid fix suggestion**: +0.08 to +0.10 bonus
43
  - **Confidence bonus**: up to +0.05 for high-confidence correct identifications
 
44
  - **False positive**: βˆ’0.10 penalty
45
  - **Duplicate identification**: βˆ’0.02 penalty
46
  - **Approving with missed issues**: βˆ’0.15 per missed issue
47
  - **Complete correct approval**: +0.20
 
 
 
 
 
 
 
 
 
48
 
49
  ## Task Bank
50
 
51
- The environment ships with **15 tasks** across three difficulty levels:
52
 
53
- | Difficulty | Count | Examples | Expected Baseline Score |
54
  |---|---|---|---|
55
- | Easy | 5 | Misspelled keywords, missing FROM, = NULL vs IS NULL | ~0.75–0.90 |
56
- | Medium | 5 | SELECT *, missing indexes, correlated subqueries, unbounded queries | ~0.40–0.60 |
57
- | Hard | 5 | SQL injection, privilege escalation, PII leakage, self-join optimization | ~0.20–0.40 |
 
 
 
58
 
59
  Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
60
 
@@ -88,10 +120,10 @@ Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks
88
  β”œβ”€β”€ sql_query_reviewer/ ← typed models and client package
89
  β”œβ”€β”€ server/ ← FastAPI environment server
90
  β”‚ β”œβ”€β”€ environment.py ← reset(), step(), state()
91
- β”‚ β”œβ”€β”€ grader.py ← deterministic scoring
92
- β”‚ β”œβ”€β”€ reward.py ← per-step reward computation
93
  β”‚ └── app.py ← HTTP routes
94
- β”œβ”€β”€ tasks/ ← 15 SQL query tasks (JSON)
95
  └── tests/ ← pytest suite
96
  ```
97
 
@@ -128,7 +160,7 @@ export HF_TOKEN=hf_xxx
128
  python inference.py
129
  ```
130
 
131
- The script emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
132
 
133
  ## Hugging Face Spaces
134
 
 
15
 
16
  ## Why This Matters
17
 
18
+ SQL bugs are among the most common and costly defects in production systems. A misplaced
19
+ keyword breaks an API. A missing WHERE clause on a DELETE wipes a table. An unparameterized
20
+ input opens a path to data exfiltration. A function call on an indexed column turns a
21
+ 10ms query into a 30-second full table scan.
22
+
23
+ Today, these defects are caught by human reviewers who spend hours on repetitive pattern
24
+ matching during code reviews, migration audits, and ETL pipeline checks. This creates a
25
+ bottleneck β€” senior engineers are pulled from feature work to review SQL, and critical
26
+ issues still slip through.
27
+
28
+ This environment provides a standardized benchmark to train and evaluate AI agents on
29
+ exactly this task. Unlike toy benchmarks, every query reflects real patterns found in
30
+ production codebases β€” from typos that break APIs to injection vectors that expose user
31
+ data to race conditions that enable double-spending. The agent must identify issues,
32
+ suggest fixes, and know when to approve β€” just like a human code reviewer.
33
+
34
+ The environment provides rich per-step reward signals with severity-weighted partial
35
+ credit, making it directly suitable for GRPO and PPO training loops. The task bank spans
36
+ three difficulty levels with meaningful score variance, ensuring the benchmark
37
+ discriminates between agent capabilities.
38
 
39
  ## What The Environment Does
40
 
 
57
 
58
  Rewards are deterministic and shaped for partial progress throughout the trajectory:
59
 
60
+ - **Correct issue identification**: +0.10 to +0.45 scaled by issue severity, confidence, and discovery order
61
  - **Valid fix suggestion**: +0.08 to +0.10 bonus
62
  - **Confidence bonus**: up to +0.05 for high-confidence correct identifications
63
+ - **Discovery order bonus**: +0.04 for first issue found, diminishing for subsequent finds
64
  - **False positive**: βˆ’0.10 penalty
65
  - **Duplicate identification**: βˆ’0.02 penalty
66
  - **Approving with missed issues**: βˆ’0.15 per missed issue
67
  - **Complete correct approval**: +0.20
68
+ - **Request context when schema available**: βˆ’0.03 penalty (encourages using provided schema)
69
+
70
+ ### Reward Properties for RL Training
71
+
72
+ - **Dense**: Every step returns a non-zero signal, enabling credit assignment
73
+ - **Bounded**: Per-step rewards in [-1.0, +0.45], episode scores in (0, 1)
74
+ - **Shaped**: Partial credit for partial coverage β€” no cliff between "found 2 of 3" and "found 3 of 3"
75
+ - **Deterministic**: Same actions always produce the same rewards (no randomness in grading)
76
+ - **Discriminative**: Hard tasks require multi-step reasoning; easy tasks reward quick identification
77
 
78
  ## Task Bank
79
 
80
+ The environment ships with **20 tasks** across three difficulty levels:
81
 
82
+ | Difficulty | Count | Examples | Score Range |
83
  |---|---|---|---|
84
+ | Easy | 7 | Misspelled keywords, missing FROM, = NULL vs IS NULL, DELETE without WHERE, self-comparison | ~0.60–0.90 |
85
+ | Medium | 7 | SELECT *, missing LIMIT, correlated subqueries, function on indexed column, ORDER BY RAND() | ~0.30–0.65 |
86
+ | Hard | 6 | SQL injection, privilege escalation, PII leakage, self-join optimization, race conditions | ~0.15–0.45 |
87
+
88
+ Each ground truth issue includes 8-12 keywords and synonyms for robust fuzzy matching, plus
89
+ bigram matching to catch common two-word phrases LLMs use (e.g., "sql injection", "missing where").
90
 
91
  Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
92
 
 
120
  β”œβ”€β”€ sql_query_reviewer/ ← typed models and client package
121
  β”œβ”€β”€ server/ ← FastAPI environment server
122
  β”‚ β”œβ”€β”€ environment.py ← reset(), step(), state()
123
+ β”‚ β”œβ”€β”€ grader.py ← deterministic scoring with bigram matching
124
+ β”‚ β”œβ”€β”€ reward.py ← per-step reward with order bonus
125
  β”‚ └── app.py ← HTTP routes
126
+ β”œβ”€β”€ tasks/ ← 20 SQL query tasks (JSON)
127
  └── tests/ ← pytest suite
128
  ```
129
 
 
160
  python inference.py
161
  ```
162
 
163
+ The script runs all 20 tasks and emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
164
 
165
  ## Hugging Face Spaces
166
 
inference.py CHANGED
@@ -311,16 +311,26 @@ async def async_main() -> int:
311
  # Build LLM client (even without key, don't crash β€” emit logs and exit)
312
  if not API_KEY:
313
  print("[DEBUG] WARNING: No API key found (HF_TOKEN / API_KEY / OPENAI_API_KEY)", flush=True)
314
- for tid in ["easy_001", "medium_001", "hard_001"]:
 
 
 
 
 
315
  log_start(task=tid, env=BENCHMARK, model=MODEL_NAME)
316
  log_end(success=False, steps=0, score=0.01, rewards=[])
317
  return 1
318
 
319
  llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
320
 
 
 
 
 
 
321
  task_ids = tuple(
322
  tid.strip()
323
- for tid in os.getenv("TASK_IDS", "easy_001,medium_001,hard_001").split(",")
324
  if tid.strip()
325
  )
326
 
 
311
  # Build LLM client (even without key, don't crash β€” emit logs and exit)
312
  if not API_KEY:
313
  print("[DEBUG] WARNING: No API key found (HF_TOKEN / API_KEY / OPENAI_API_KEY)", flush=True)
314
+ _fallback_ids = [
315
+ "easy_001", "easy_002", "easy_003", "easy_004", "easy_005", "easy_006", "easy_007",
316
+ "medium_001", "medium_002", "medium_003", "medium_004", "medium_005", "medium_006", "medium_007",
317
+ "hard_001", "hard_002", "hard_003", "hard_004", "hard_005", "hard_006",
318
+ ]
319
+ for tid in _fallback_ids:
320
  log_start(task=tid, env=BENCHMARK, model=MODEL_NAME)
321
  log_end(success=False, steps=0, score=0.01, rewards=[])
322
  return 1
323
 
324
  llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
325
 
326
+ _default_ids = ",".join([
327
+ "easy_001", "easy_002", "easy_003", "easy_004", "easy_005", "easy_006", "easy_007",
328
+ "medium_001", "medium_002", "medium_003", "medium_004", "medium_005", "medium_006", "medium_007",
329
+ "hard_001", "hard_002", "hard_003", "hard_004", "hard_005", "hard_006",
330
+ ])
331
  task_ids = tuple(
332
  tid.strip()
333
+ for tid in os.getenv("TASK_IDS", _default_ids).split(",")
334
  if tid.strip()
335
  )
336
 
openenv.yaml CHANGED
@@ -1,7 +1,7 @@
1
  name: sql-query-reviewer
2
  description: "AI agent reviews SQL queries for correctness, performance, and security."
3
  author: Hellinferno
4
- version: "0.1.0"
5
  tags:
6
  - openenv
7
  - sql
@@ -28,30 +28,46 @@ tasks:
28
  name: Unknown Column Name
29
  difficulty: easy
30
  description: "Detect column name typo (statuz vs status)."
 
 
 
 
 
 
 
 
31
  - id: medium_001
32
- name: Performance Anti-Pattern Review
33
  difficulty: medium
34
- description: "Identify schema-aware performance problems like SELECT *, missing indexes, correlated subqueries."
35
  - id: medium_002
36
- name: Unbounded Query Detection
37
  difficulty: medium
38
- description: "Find queries missing LIMIT on large tables."
39
  - id: medium_003
40
- name: Redundant Operations
41
  difficulty: medium
42
  description: "Detect unnecessary DISTINCT on unique columns."
43
  - id: medium_004
44
- name: Correlated Subquery Optimization
45
  difficulty: medium
46
- description: "Find correlated subqueries that could be JOINs."
47
  - id: medium_005
48
- name: Join Performance Issues
 
 
 
 
49
  difficulty: medium
50
- description: "Identify missing index hints and inefficient joins."
 
 
 
 
51
  - id: hard_001
52
  name: SQL Injection Detection
53
  difficulty: hard
54
- description: "Find string concatenation enabling SQL injection vectors."
55
  - id: hard_002
56
  name: Privilege Escalation via UNION
57
  difficulty: hard
@@ -67,4 +83,8 @@ tasks:
67
  - id: hard_005
68
  name: Transaction Isolation Issues
69
  difficulty: hard
70
- description: "Find missing transaction isolation causing phantom reads."
 
 
 
 
 
1
  name: sql-query-reviewer
2
  description: "AI agent reviews SQL queries for correctness, performance, and security."
3
  author: Hellinferno
4
+ version: "0.2.0"
5
  tags:
6
  - openenv
7
  - sql
 
28
  name: Unknown Column Name
29
  difficulty: easy
30
  description: "Detect column name typo (statuz vs status)."
31
+ - id: easy_006
32
+ name: DELETE Without WHERE
33
+ difficulty: easy
34
+ description: "Detect dangerous unconditional DELETE statement."
35
+ - id: easy_007
36
+ name: Column Self-Comparison
37
+ difficulty: easy
38
+ description: "Detect column compared to itself instead of a value."
39
  - id: medium_001
40
+ name: Wide Table SELECT Star
41
  difficulty: medium
42
+ description: "Identify schema-aware performance problems like SELECT * on wide JSON tables."
43
  - id: medium_002
44
+ name: Correlated Subquery
45
  difficulty: medium
46
+ description: "Find correlated subqueries that could be rewritten as JOINs."
47
  - id: medium_003
48
+ name: Redundant DISTINCT
49
  difficulty: medium
50
  description: "Detect unnecessary DISTINCT on unique columns."
51
  - id: medium_004
52
+ name: Function on Indexed Column
53
  difficulty: medium
54
+ description: "Detect DATE() function preventing index usage."
55
  - id: medium_005
56
+ name: Leading Wildcard Search
57
+ difficulty: medium
58
+ description: "Identify LOWER() and leading wildcard preventing index usage."
59
+ - id: medium_006
60
+ name: DATE Function Index Bypass
61
  difficulty: medium
62
+ description: "Detect DATE() function on indexed column preventing efficient lookups."
63
+ - id: medium_007
64
+ name: ORDER BY RAND Performance
65
+ difficulty: medium
66
+ description: "Detect expensive random ordering on large tables."
67
  - id: hard_001
68
  name: SQL Injection Detection
69
  difficulty: hard
70
+ description: "Find string interpolation enabling SQL injection vectors."
71
  - id: hard_002
72
  name: Privilege Escalation via UNION
73
  difficulty: hard
 
83
  - id: hard_005
84
  name: Transaction Isolation Issues
85
  difficulty: hard
86
+ description: "Find missing transaction isolation causing partial failure corruption."
87
+ - id: hard_006
88
+ name: Race Condition in Balance Update
89
+ difficulty: hard
90
+ description: "Detect TOCTOU race condition allowing double-spending."
server/environment.py CHANGED
@@ -73,7 +73,7 @@ class SQLReviewEnvironment:
73
  description=matched_issue.description,
74
  )
75
  )
76
- reward = compute_reward(action, matched_issue, fix_valid=fix_valid)
77
  remaining = len(task.ground_truth_issues) - len(state.issues_identified)
78
  feedback = f"Matched {matched_issue.category} issue '{matched_issue.id}'. {remaining} issue(s) remaining."
79
  info = {
@@ -114,6 +114,7 @@ class SQLReviewEnvironment:
114
 
115
  else:
116
  feedback = self._schema_feedback(task)
 
117
  info = {"context_shared": bool(task.schema_info)}
118
 
119
  state.total_reward += reward
 
73
  description=matched_issue.description,
74
  )
75
  )
76
+ reward = compute_reward(action, matched_issue, fix_valid=fix_valid, issues_found_count=len(state.issues_identified), schema_available=bool(task.schema_info))
77
  remaining = len(task.ground_truth_issues) - len(state.issues_identified)
78
  feedback = f"Matched {matched_issue.category} issue '{matched_issue.id}'. {remaining} issue(s) remaining."
79
  info = {
 
114
 
115
  else:
116
  feedback = self._schema_feedback(task)
117
+ reward = compute_reward(action, None, schema_available=bool(task.schema_info))
118
  info = {"context_shared": bool(task.schema_info)}
119
 
120
  state.total_reward += reward
server/grader.py CHANGED
@@ -25,14 +25,37 @@ def _set_overlap(candidate: set[str], target: set[str]) -> float:
25
  return len(candidate & target) / max(len(target), 1)
26
 
27
 
28
- def score_issue_match(description: str, category: IssueCategory | None, issue: GroundTruthIssue) -> float:
 
 
 
 
 
 
 
29
  candidate_tokens = tokenize(description)
30
  keyword_tokens = set(issue.keywords)
31
  description_tokens = tokenize(issue.description)
 
 
32
  keyword_score = _set_overlap(candidate_tokens, keyword_tokens)
33
  description_score = _set_overlap(candidate_tokens, description_tokens)
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  category_bonus = 0.2 if category == issue.category else 0.0
35
- score = (keyword_score * 0.6) + (description_score * 0.25) + category_bonus
 
36
  return clamp(score, 0.0, 1.0)
37
 
38
 
@@ -88,4 +111,3 @@ def grade_episode(
88
  false_positive_penalty = 0.05 * false_positive_count
89
  final_score = coverage_score + efficiency_bonus - false_positive_penalty
90
  return clamp(final_score, 0.01, 0.99)
91
-
 
25
  return len(candidate & target) / max(len(target), 1)
26
 
27
 
28
+ def _make_bigrams(text: str) -> set[tuple[str, str]]:
29
+ words = TOKEN_RE.findall(text.lower())
30
+ return {(words[i], words[i + 1]) for i in range(len(words) - 1)}
31
+
32
+
33
+ def score_issue_match(
34
+ description: str, category: IssueCategory | None, issue: GroundTruthIssue
35
+ ) -> float:
36
  candidate_tokens = tokenize(description)
37
  keyword_tokens = set(issue.keywords)
38
  description_tokens = tokenize(issue.description)
39
+
40
+ # Unigram overlap
41
  keyword_score = _set_overlap(candidate_tokens, keyword_tokens)
42
  description_score = _set_overlap(candidate_tokens, description_tokens)
43
+
44
+ # Bigram overlap β€” catches two-word phrases like "sql injection", "missing where"
45
+ candidate_bigrams = _make_bigrams(description)
46
+ keyword_bigrams: set[tuple[str, str]] = set()
47
+ for kw in issue.keywords:
48
+ words = kw.lower().split()
49
+ if len(words) >= 2:
50
+ keyword_bigrams.add(tuple(words[:2]))
51
+ bigram_score = 0.0
52
+ if keyword_bigrams:
53
+ bigram_hits = len(candidate_bigrams & keyword_bigrams)
54
+ bigram_score = bigram_hits / max(len(keyword_bigrams), 1)
55
+
56
  category_bonus = 0.2 if category == issue.category else 0.0
57
+
58
+ score = (keyword_score * 0.5) + (description_score * 0.15) + (bigram_score * 0.15) + category_bonus
59
  return clamp(score, 0.0, 1.0)
60
 
61
 
 
111
  false_positive_penalty = 0.05 * false_positive_count
112
  final_score = coverage_score + efficiency_bonus - false_positive_penalty
113
  return clamp(final_score, 0.01, 0.99)
 
server/reward.py CHANGED
@@ -11,16 +11,30 @@ def compute_reward(
11
  duplicate_issue: bool = False,
12
  remaining_unfound: int = 0,
13
  has_previous_issue: bool = False,
 
 
14
  ) -> float:
15
  if action.action_type == "identify_issue":
16
  if duplicate_issue:
17
  return -0.02
 
18
  if matched_issue is None:
19
  return -0.1
 
 
20
  base_reward = min(matched_issue.severity, 0.35)
 
 
21
  fix_bonus = 0.08 if fix_valid else 0.0
22
- confidence_bonus = min(0.05, action.confidence * 0.05)
23
- return min(base_reward + fix_bonus + confidence_bonus, 0.4)
 
 
 
 
 
 
 
24
 
25
  if action.action_type == "suggest_fix":
26
  if not has_previous_issue:
@@ -32,5 +46,10 @@ def compute_reward(
32
  return 0.2
33
  return max(-1.0, -0.15 * remaining_unfound)
34
 
35
- return 0.0
 
 
 
 
36
 
 
 
11
  duplicate_issue: bool = False,
12
  remaining_unfound: int = 0,
13
  has_previous_issue: bool = False,
14
+ issues_found_count: int = 0,
15
+ schema_available: bool = False,
16
  ) -> float:
17
  if action.action_type == "identify_issue":
18
  if duplicate_issue:
19
  return -0.02
20
+
21
  if matched_issue is None:
22
  return -0.1
23
+
24
+ # Base reward scaled by severity
25
  base_reward = min(matched_issue.severity, 0.35)
26
+
27
+ # Fix bonus
28
  fix_bonus = 0.08 if fix_valid else 0.0
29
+
30
+ # Confidence bonus β€” higher reward for confident correct identifications
31
+ confidence_bonus = min(0.05, action.confidence * matched_issue.severity * 0.08)
32
+
33
+ # Discovery order bonus β€” finding the first issue is worth slightly more
34
+ # This encourages the agent to start identifying issues quickly
35
+ order_bonus = 0.04 * (1.0 / (issues_found_count + 1))
36
+
37
+ return min(base_reward + fix_bonus + confidence_bonus + order_bonus, 0.45)
38
 
39
  if action.action_type == "suggest_fix":
40
  if not has_previous_issue:
 
46
  return 0.2
47
  return max(-1.0, -0.15 * remaining_unfound)
48
 
49
+ if action.action_type == "request_more_context":
50
+ # Mild penalty for asking when schema is already provided
51
+ if schema_available:
52
+ return -0.03
53
+ return 0.0
54
 
55
+ return 0.0
tasks/easy_tasks.json CHANGED
@@ -18,7 +18,11 @@
18
  "description": "SELCT should be SELECT.",
19
  "severity": 0.35,
20
  "fix": "SELECT * FROM users WHERE id = 1;",
21
- "keywords": ["selct", "select", "misspelled keyword", "syntax"]
 
 
 
 
22
  },
23
  {
24
  "id": "easy_001_from",
@@ -26,7 +30,10 @@
26
  "description": "FORM should be FROM.",
27
  "severity": 0.35,
28
  "fix": "SELECT * FROM users WHERE id = 1;",
29
- "keywords": ["form", "from", "misspelled keyword", "syntax"]
 
 
 
30
  },
31
  {
32
  "id": "easy_001_where",
@@ -34,7 +41,10 @@
34
  "description": "WEHRE should be WHERE.",
35
  "severity": 0.25,
36
  "fix": "SELECT * FROM users WHERE id = 1;",
37
- "keywords": ["wehre", "where", "misspelled keyword", "syntax"]
 
 
 
38
  },
39
  {
40
  "id": "easy_001_projection",
@@ -42,7 +52,11 @@
42
  "description": "SELECT * fetches unnecessary columns for a profile lookup.",
43
  "severity": 0.15,
44
  "fix": "SELECT id, name, email FROM users WHERE id = 1;",
45
- "keywords": ["select *", "unnecessary columns", "projection", "performance"]
 
 
 
 
46
  }
47
  ],
48
  "max_steps": 5
@@ -66,7 +80,11 @@
66
  "description": "The query is missing the FROM clause before users.",
67
  "severity": 0.6,
68
  "fix": "SELECT id, email FROM users WHERE active = 1;",
69
- "keywords": ["missing from", "from clause", "syntax", "users"]
 
 
 
 
70
  }
71
  ],
72
  "max_steps": 4
@@ -90,7 +108,11 @@
90
  "description": "NULL must be compared with IS NULL instead of = NULL.",
91
  "severity": 0.7,
92
  "fix": "SELECT order_id, total FROM orders WHERE shipped_at IS NULL;",
93
- "keywords": ["is null", "= null", "null comparison", "logic"]
 
 
 
 
94
  }
95
  ],
96
  "max_steps": 4
@@ -114,7 +136,11 @@
114
  "description": "The string literal is not terminated with a closing quote.",
115
  "severity": 0.75,
116
  "fix": "SELECT name FROM customers WHERE city = 'Boston';",
117
- "keywords": ["unclosed quote", "unterminated string", "syntax", "quote"]
 
 
 
 
118
  }
119
  ],
120
  "max_steps": 4
@@ -139,10 +165,69 @@
139
  "description": "Column statuz does not exist; the intended column is status.",
140
  "severity": 0.65,
141
  "fix": "SELECT id, status FROM orders WHERE status = 'paid';",
142
- "keywords": ["unknown column", "statuz", "status", "column name"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
143
  }
144
  ],
145
  "max_steps": 4
146
  }
147
  ]
148
-
 
18
  "description": "SELCT should be SELECT.",
19
  "severity": 0.35,
20
  "fix": "SELECT * FROM users WHERE id = 1;",
21
+ "keywords": [
22
+ "selct", "select", "misspelled", "keyword", "syntax", "typo",
23
+ "spelling", "incorrect keyword", "wrong keyword", "misspelling",
24
+ "invalid keyword", "selct typo"
25
+ ]
26
  },
27
  {
28
  "id": "easy_001_from",
 
30
  "description": "FORM should be FROM.",
31
  "severity": 0.35,
32
  "fix": "SELECT * FROM users WHERE id = 1;",
33
+ "keywords": [
34
+ "form", "from", "misspelled", "keyword", "syntax", "typo",
35
+ "spelling", "table reference", "from clause", "misspelling"
36
+ ]
37
  },
38
  {
39
  "id": "easy_001_where",
 
41
  "description": "WEHRE should be WHERE.",
42
  "severity": 0.25,
43
  "fix": "SELECT * FROM users WHERE id = 1;",
44
+ "keywords": [
45
+ "wehre", "where", "misspelled", "keyword", "syntax", "typo",
46
+ "filter", "condition", "where clause", "misspelling"
47
+ ]
48
  },
49
  {
50
  "id": "easy_001_projection",
 
52
  "description": "SELECT * fetches unnecessary columns for a profile lookup.",
53
  "severity": 0.15,
54
  "fix": "SELECT id, name, email FROM users WHERE id = 1;",
55
+ "keywords": [
56
+ "select *", "star", "unnecessary columns", "projection", "performance",
57
+ "all columns", "wildcard", "specific columns", "column selection",
58
+ "over-fetching", "fetch all", "select star"
59
+ ]
60
  }
61
  ],
62
  "max_steps": 5
 
80
  "description": "The query is missing the FROM clause before users.",
81
  "severity": 0.6,
82
  "fix": "SELECT id, email FROM users WHERE active = 1;",
83
+ "keywords": [
84
+ "missing from", "from clause", "syntax", "users", "no from",
85
+ "omitted from", "table reference", "absent from", "from keyword",
86
+ "missing keyword"
87
+ ]
88
  }
89
  ],
90
  "max_steps": 4
 
108
  "description": "NULL must be compared with IS NULL instead of = NULL.",
109
  "severity": 0.7,
110
  "fix": "SELECT order_id, total FROM orders WHERE shipped_at IS NULL;",
111
+ "keywords": [
112
+ "is null", "= null", "null comparison", "logic", "null check",
113
+ "equals null", "compare null", "null equality", "null predicate",
114
+ "three-valued logic", "null handling"
115
+ ]
116
  }
117
  ],
118
  "max_steps": 4
 
136
  "description": "The string literal is not terminated with a closing quote.",
137
  "severity": 0.75,
138
  "fix": "SELECT name FROM customers WHERE city = 'Boston';",
139
+ "keywords": [
140
+ "unclosed quote", "unterminated string", "syntax", "quote",
141
+ "missing quote", "string literal", "closing quote", "open quote",
142
+ "single quote", "unmatched quote", "parse error"
143
+ ]
144
  }
145
  ],
146
  "max_steps": 4
 
165
  "description": "Column statuz does not exist; the intended column is status.",
166
  "severity": 0.65,
167
  "fix": "SELECT id, status FROM orders WHERE status = 'paid';",
168
+ "keywords": [
169
+ "unknown column", "statuz", "status", "column name", "typo",
170
+ "misspelled column", "invalid column", "column not found",
171
+ "does not exist", "wrong column", "nonexistent column"
172
+ ]
173
+ }
174
+ ],
175
+ "max_steps": 4
176
+ },
177
+ {
178
+ "task_id": "easy_006",
179
+ "difficulty": "easy",
180
+ "query": "DELETE FROM orders;",
181
+ "schema": {
182
+ "orders": {
183
+ "id": "INT PRIMARY KEY",
184
+ "user_id": "INT",
185
+ "total": "DECIMAL(10,2)",
186
+ "status": "VARCHAR(32)"
187
+ }
188
+ },
189
+ "context": "Remove cancelled orders from the database.",
190
+ "ground_truth_issues": [
191
+ {
192
+ "id": "easy_006_no_where",
193
+ "category": "logic",
194
+ "description": "DELETE without WHERE clause will remove ALL rows from the table.",
195
+ "severity": 1.0,
196
+ "fix": "DELETE FROM orders WHERE status = 'cancelled';",
197
+ "keywords": [
198
+ "delete", "no where", "missing where", "all rows", "dangerous",
199
+ "destructive", "entire table", "unfiltered delete", "data loss",
200
+ "without condition", "unconditional"
201
+ ]
202
+ }
203
+ ],
204
+ "max_steps": 4
205
+ },
206
+ {
207
+ "task_id": "easy_007",
208
+ "difficulty": "easy",
209
+ "query": "SELECT id FROM users WHERE email = email;",
210
+ "schema": {
211
+ "users": {
212
+ "id": "INT PRIMARY KEY",
213
+ "email": "VARCHAR(255)"
214
+ }
215
+ },
216
+ "context": "Find user by email for login lookup.",
217
+ "ground_truth_issues": [
218
+ {
219
+ "id": "easy_007_self_compare",
220
+ "category": "logic",
221
+ "description": "Comparing column to itself (email = email) is always true. Should compare to a string literal.",
222
+ "severity": 0.8,
223
+ "fix": "SELECT id FROM users WHERE email = 'user@example.com';",
224
+ "keywords": [
225
+ "self comparison", "column compared to itself", "always true",
226
+ "tautology", "email = email", "missing literal", "missing value",
227
+ "string literal", "parameter", "no filter"
228
+ ]
229
  }
230
  ],
231
  "max_steps": 4
232
  }
233
  ]
 
tasks/hard_tasks.json CHANGED
@@ -20,7 +20,11 @@
20
  "description": "Interpolating user_email and password directly into the SQL creates a SQL injection vulnerability.",
21
  "severity": 1.0,
22
  "fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
23
- "keywords": ["sql injection", "interpolation", "user input", "parameterized", "security"]
 
 
 
 
24
  },
25
  {
26
  "id": "hard_001_select_star_sensitive",
@@ -28,7 +32,11 @@
28
  "description": "SELECT * returns sensitive columns such as password hashes that the login flow does not need.",
29
  "severity": 0.4,
30
  "fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
31
- "keywords": ["select *", "sensitive columns", "password hash", "least privilege", "security"]
 
 
 
 
32
  }
33
  ],
34
  "max_steps": 6
@@ -55,7 +63,11 @@
55
  "description": "The UNION includes admin_secrets and leaks privileged data into a customer-facing export.",
56
  "severity": 0.95,
57
  "fix": "SELECT id, email FROM customers;",
58
- "keywords": ["union", "admin_secrets", "secret_value", "data leakage", "security"]
 
 
 
 
59
  },
60
  {
61
  "id": "hard_002_mixed_data_domains",
@@ -63,7 +75,11 @@
63
  "description": "The query mixes unrelated datasets with incompatible semantics, producing an invalid export.",
64
  "severity": 0.45,
65
  "fix": "SELECT id, email FROM customers;",
66
- "keywords": ["union", "invalid export", "mixed dataset", "logic"]
 
 
 
 
67
  }
68
  ],
69
  "max_steps": 6
@@ -94,7 +110,12 @@
94
  "description": "The dashboard query exposes SSNs even though the ticket workflow only needs identity and ticket context.",
95
  "severity": 0.9,
96
  "fix": "SELECT c.id, c.full_name, c.email, t.subject FROM customers c JOIN support_tickets t ON t.customer_id = c.id WHERE t.status = 'open';",
97
- "keywords": ["ssn", "pii", "sensitive data", "least privilege", "security"]
 
 
 
 
 
98
  }
99
  ],
100
  "max_steps": 6
@@ -118,7 +139,11 @@
118
  "description": "The self-join ranking pattern is expensive and should use a window function such as DENSE_RANK().",
119
  "severity": 0.8,
120
  "fix": "SELECT department_id, id, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank FROM employees;",
121
- "keywords": ["self join", "window function", "dense_rank", "ranking", "performance"]
 
 
 
 
122
  }
123
  ],
124
  "max_steps": 7
@@ -141,7 +166,11 @@
141
  "description": "The transfer uses two updates without a transaction, so a partial failure can corrupt balances.",
142
  "severity": 0.9,
143
  "fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
144
- "keywords": ["transaction", "partial failure", "atomic", "commit", "security"]
 
 
 
 
145
  },
146
  {
147
  "id": "hard_005_no_balance_guard",
@@ -149,10 +178,40 @@
149
  "description": "The debit statement does not verify sufficient funds before subtracting the balance.",
150
  "severity": 0.55,
151
  "fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
152
- "keywords": ["balance guard", "insufficient funds", "where balance >=", "logic"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  }
154
  ],
155
  "max_steps": 7
156
  }
157
  ]
158
-
 
20
  "description": "Interpolating user_email and password directly into the SQL creates a SQL injection vulnerability.",
21
  "severity": 1.0,
22
  "fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
23
+ "keywords": [
24
+ "sql injection", "interpolation", "user input", "parameterized", "security",
25
+ "string concatenation", "prepared statement", "bind parameter",
26
+ "unsanitized", "injection attack", "escape", "placeholder"
27
+ ]
28
  },
29
  {
30
  "id": "hard_001_select_star_sensitive",
 
32
  "description": "SELECT * returns sensitive columns such as password hashes that the login flow does not need.",
33
  "severity": 0.4,
34
  "fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
35
+ "keywords": [
36
+ "select *", "sensitive columns", "password hash", "least privilege", "security",
37
+ "over-exposure", "data leakage", "unnecessary columns",
38
+ "password", "credential", "star query"
39
+ ]
40
  }
41
  ],
42
  "max_steps": 6
 
63
  "description": "The UNION includes admin_secrets and leaks privileged data into a customer-facing export.",
64
  "severity": 0.95,
65
  "fix": "SELECT id, email FROM customers;",
66
+ "keywords": [
67
+ "union", "admin_secrets", "secret_value", "data leakage", "security",
68
+ "exfiltration", "privileged data", "unauthorized access",
69
+ "sensitive data", "data exposure", "information disclosure"
70
+ ]
71
  },
72
  {
73
  "id": "hard_002_mixed_data_domains",
 
75
  "description": "The query mixes unrelated datasets with incompatible semantics, producing an invalid export.",
76
  "severity": 0.45,
77
  "fix": "SELECT id, email FROM customers;",
78
+ "keywords": [
79
+ "union", "invalid export", "mixed dataset", "logic", "incompatible",
80
+ "different tables", "semantic mismatch", "unrelated data",
81
+ "data integrity", "domain mixing"
82
+ ]
83
  }
84
  ],
85
  "max_steps": 6
 
110
  "description": "The dashboard query exposes SSNs even though the ticket workflow only needs identity and ticket context.",
111
  "severity": 0.9,
112
  "fix": "SELECT c.id, c.full_name, c.email, t.subject FROM customers c JOIN support_tickets t ON t.customer_id = c.id WHERE t.status = 'open';",
113
+ "keywords": [
114
+ "ssn", "pii", "sensitive data", "least privilege", "security",
115
+ "social security", "personally identifiable", "data exposure",
116
+ "unnecessary column", "information leakage", "over-fetching",
117
+ "personal data"
118
+ ]
119
  }
120
  ],
121
  "max_steps": 6
 
139
  "description": "The self-join ranking pattern is expensive and should use a window function such as DENSE_RANK().",
140
  "severity": 0.8,
141
  "fix": "SELECT department_id, id, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank FROM employees;",
142
+ "keywords": [
143
+ "self join", "window function", "dense_rank", "ranking", "performance",
144
+ "self-join", "rank", "partition by", "over clause", "analytic function",
145
+ "quadratic", "n squared"
146
+ ]
147
  }
148
  ],
149
  "max_steps": 7
 
166
  "description": "The transfer uses two updates without a transaction, so a partial failure can corrupt balances.",
167
  "severity": 0.9,
168
  "fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
169
+ "keywords": [
170
+ "transaction", "partial failure", "atomic", "commit", "security",
171
+ "begin", "rollback", "atomicity", "acid", "consistency",
172
+ "two updates", "no transaction", "data corruption"
173
+ ]
174
  },
175
  {
176
  "id": "hard_005_no_balance_guard",
 
178
  "description": "The debit statement does not verify sufficient funds before subtracting the balance.",
179
  "severity": 0.55,
180
  "fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
181
+ "keywords": [
182
+ "balance guard", "insufficient funds", "where balance >=", "logic",
183
+ "negative balance", "overdraft", "check balance", "guard clause",
184
+ "minimum balance", "validation"
185
+ ]
186
+ }
187
+ ],
188
+ "max_steps": 7
189
+ },
190
+ {
191
+ "task_id": "hard_006",
192
+ "difficulty": "hard",
193
+ "query": "UPDATE accounts SET balance = balance - 500 WHERE user_id = 42 AND balance >= 500;",
194
+ "schema": {
195
+ "accounts": {
196
+ "user_id": "INT PRIMARY KEY",
197
+ "balance": "DECIMAL(12,2)"
198
+ }
199
+ },
200
+ "context": "Deduct $500 from user account for a withdrawal. Multiple withdrawal requests may arrive concurrently.",
201
+ "ground_truth_issues": [
202
+ {
203
+ "id": "hard_006_race_condition",
204
+ "category": "security",
205
+ "description": "Without SELECT FOR UPDATE or proper transaction isolation, concurrent requests can pass the balance check simultaneously, allowing double-spending.",
206
+ "severity": 0.9,
207
+ "fix": "BEGIN; SELECT balance FROM accounts WHERE user_id = 42 FOR UPDATE; UPDATE accounts SET balance = balance - 500 WHERE user_id = 42 AND balance >= 500; COMMIT;",
208
+ "keywords": [
209
+ "race condition", "concurrent", "double spend", "for update",
210
+ "transaction", "isolation", "lock", "toctou", "time of check",
211
+ "atomicity", "concurrent requests", "locking", "serializable"
212
+ ]
213
  }
214
  ],
215
  "max_steps": 7
216
  }
217
  ]
 
tasks/medium_tasks.json CHANGED
@@ -21,7 +21,11 @@
21
  "description": "SELECT * pulls a wide payload when the dashboard only needs a few columns.",
22
  "severity": 0.3,
23
  "fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
24
- "keywords": ["select *", "wide table", "projection", "performance"]
 
 
 
 
25
  },
26
  {
27
  "id": "medium_001_missing_limit",
@@ -29,7 +33,11 @@
29
  "description": "The dashboard query is missing a LIMIT and can scan far more rows than necessary.",
30
  "severity": 0.3,
31
  "fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
32
- "keywords": ["limit", "unbounded query", "dashboard", "performance"]
 
 
 
 
33
  }
34
  ],
35
  "max_steps": 5
@@ -57,7 +65,11 @@
57
  "description": "The correlated subquery re-counts orders per row and should be rewritten as a join with GROUP BY.",
58
  "severity": 0.6,
59
  "fix": "SELECT c.id, c.name, COUNT(o.id) AS order_count FROM customers c LEFT JOIN orders o ON o.customer_id = c.id GROUP BY c.id, c.name;",
60
- "keywords": ["correlated subquery", "group by", "join", "count", "performance"]
 
 
 
 
61
  }
62
  ],
63
  "max_steps": 6
@@ -81,7 +93,11 @@
81
  "description": "DISTINCT is redundant because users.email is already unique.",
82
  "severity": 0.45,
83
  "fix": "SELECT email FROM users WHERE email IS NOT NULL;",
84
- "keywords": ["distinct", "unique", "redundant", "email", "performance"]
 
 
 
 
85
  }
86
  ],
87
  "max_steps": 5
@@ -110,7 +126,11 @@
110
  "description": "Wrapping created_at with DATE() prevents efficient use of the created_at index.",
111
  "severity": 0.6,
112
  "fix": "SELECT o.id, o.total, u.name FROM orders o JOIN users u ON u.id = o.user_id WHERE o.created_at >= '2026-04-10' AND o.created_at < '2026-04-11';",
113
- "keywords": ["date()", "function on column", "index", "range predicate", "performance"]
 
 
 
 
114
  }
115
  ],
116
  "max_steps": 6
@@ -135,7 +155,11 @@
135
  "description": "Applying LOWER(name) on every row prevents the index on name from being used efficiently.",
136
  "severity": 0.35,
137
  "fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
138
- "keywords": ["lower", "function on column", "index", "performance"]
 
 
 
 
139
  },
140
  {
141
  "id": "medium_005_leading_wildcard",
@@ -143,10 +167,82 @@
143
  "description": "The leading wildcard in LIKE '%pro%' forces a full scan instead of an index-friendly prefix lookup.",
144
  "severity": 0.35,
145
  "fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
146
- "keywords": ["leading wildcard", "%pro%", "full scan", "prefix lookup", "performance"]
 
 
 
 
147
  }
148
  ],
149
  "max_steps": 6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
150
  }
151
  ]
152
-
 
21
  "description": "SELECT * pulls a wide payload when the dashboard only needs a few columns.",
22
  "severity": 0.3,
23
  "fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
24
+ "keywords": [
25
+ "select *", "wide table", "projection", "performance", "star",
26
+ "all columns", "unnecessary columns", "column selection",
27
+ "over-fetching", "wildcard"
28
+ ]
29
  },
30
  {
31
  "id": "medium_001_missing_limit",
 
33
  "description": "The dashboard query is missing a LIMIT and can scan far more rows than necessary.",
34
  "severity": 0.3,
35
  "fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
36
+ "keywords": [
37
+ "limit", "unbounded query", "dashboard", "performance", "no limit",
38
+ "missing limit", "unlimited rows", "pagination", "all rows",
39
+ "full scan", "row count"
40
+ ]
41
  }
42
  ],
43
  "max_steps": 5
 
65
  "description": "The correlated subquery re-counts orders per row and should be rewritten as a join with GROUP BY.",
66
  "severity": 0.6,
67
  "fix": "SELECT c.id, c.name, COUNT(o.id) AS order_count FROM customers c LEFT JOIN orders o ON o.customer_id = c.id GROUP BY c.id, c.name;",
68
+ "keywords": [
69
+ "correlated subquery", "group by", "join", "count", "performance",
70
+ "subquery per row", "n+1", "rewrite", "left join", "aggregate",
71
+ "scalar subquery", "dependent subquery"
72
+ ]
73
  }
74
  ],
75
  "max_steps": 6
 
93
  "description": "DISTINCT is redundant because users.email is already unique.",
94
  "severity": 0.45,
95
  "fix": "SELECT email FROM users WHERE email IS NOT NULL;",
96
+ "keywords": [
97
+ "distinct", "unique", "redundant", "email", "performance",
98
+ "unnecessary distinct", "unique constraint", "already unique",
99
+ "duplicate elimination", "deduplication", "wasted sort"
100
+ ]
101
  }
102
  ],
103
  "max_steps": 5
 
126
  "description": "Wrapping created_at with DATE() prevents efficient use of the created_at index.",
127
  "severity": 0.6,
128
  "fix": "SELECT o.id, o.total, u.name FROM orders o JOIN users u ON u.id = o.user_id WHERE o.created_at >= '2026-04-10' AND o.created_at < '2026-04-11';",
129
+ "keywords": [
130
+ "date()", "function on column", "index", "range predicate", "performance",
131
+ "sargable", "non-sargable", "prevents index", "full scan",
132
+ "index usage", "function wrapping"
133
+ ]
134
  }
135
  ],
136
  "max_steps": 6
 
155
  "description": "Applying LOWER(name) on every row prevents the index on name from being used efficiently.",
156
  "severity": 0.35,
157
  "fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
158
+ "keywords": [
159
+ "lower", "function on column", "index", "performance", "sargable",
160
+ "non-sargable", "case insensitive", "full scan", "table scan",
161
+ "function wrapping column"
162
+ ]
163
  },
164
  {
165
  "id": "medium_005_leading_wildcard",
 
167
  "description": "The leading wildcard in LIKE '%pro%' forces a full scan instead of an index-friendly prefix lookup.",
168
  "severity": 0.35,
169
  "fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
170
+ "keywords": [
171
+ "leading wildcard", "%pro%", "full scan", "prefix lookup", "performance",
172
+ "like wildcard", "pattern matching", "index unusable", "table scan",
173
+ "wildcard prefix"
174
+ ]
175
  }
176
  ],
177
  "max_steps": 6
178
+ },
179
+ {
180
+ "task_id": "medium_006",
181
+ "difficulty": "medium",
182
+ "query": "SELECT * FROM events WHERE DATE(created_at) = '2024-01-15';",
183
+ "schema": {
184
+ "events": {
185
+ "id": "INT PRIMARY KEY",
186
+ "name": "VARCHAR(255)",
187
+ "created_at": "TIMESTAMP",
188
+ "INDEX": "idx_created_at ON events(created_at)"
189
+ }
190
+ },
191
+ "context": "Find all events that happened on a specific date.",
192
+ "ground_truth_issues": [
193
+ {
194
+ "id": "medium_006_function_on_index",
195
+ "category": "performance",
196
+ "description": "Using DATE() function on an indexed column prevents index usage. Use a range comparison instead.",
197
+ "severity": 0.7,
198
+ "fix": "SELECT * FROM events WHERE created_at >= '2024-01-15 00:00:00' AND created_at < '2024-01-16 00:00:00';",
199
+ "keywords": [
200
+ "function on column", "date function", "index", "sargable",
201
+ "non-sargable", "prevents index", "range comparison", "full scan",
202
+ "table scan", "index usage", "function wrapping column"
203
+ ]
204
+ },
205
+ {
206
+ "id": "medium_006_star",
207
+ "category": "performance",
208
+ "description": "SELECT * returns all columns when only specific fields may be needed.",
209
+ "severity": 0.2,
210
+ "fix": "SELECT id, name, created_at FROM events WHERE created_at >= '2024-01-15' AND created_at < '2024-01-16';",
211
+ "keywords": [
212
+ "select *", "star", "all columns", "projection", "unnecessary columns",
213
+ "wildcard", "over-fetching", "column selection"
214
+ ]
215
+ }
216
+ ],
217
+ "max_steps": 6
218
+ },
219
+ {
220
+ "task_id": "medium_007",
221
+ "difficulty": "medium",
222
+ "query": "SELECT * FROM products ORDER BY RAND() LIMIT 10;",
223
+ "schema": {
224
+ "products": {
225
+ "id": "INT PRIMARY KEY",
226
+ "name": "VARCHAR(255)",
227
+ "price": "DECIMAL(10,2)",
228
+ "category": "VARCHAR(64)"
229
+ }
230
+ },
231
+ "context": "Show 10 random products on the homepage.",
232
+ "ground_truth_issues": [
233
+ {
234
+ "id": "medium_007_order_rand",
235
+ "category": "performance",
236
+ "description": "ORDER BY RAND() generates a random value for every row in the table, causing a full table scan and sort. Extremely slow on large tables.",
237
+ "severity": 0.8,
238
+ "fix": "SELECT * FROM products WHERE id >= (SELECT FLOOR(RAND() * (SELECT MAX(id) FROM products))) LIMIT 10;",
239
+ "keywords": [
240
+ "order by rand", "random", "full table scan", "sort", "performance",
241
+ "slow", "every row", "random ordering", "rand function",
242
+ "expensive sort", "large table"
243
+ ]
244
+ }
245
+ ],
246
+ "max_steps": 5
247
  }
248
  ]
 
tests/test_api.py CHANGED
@@ -103,7 +103,7 @@ def test_request_more_context_returns_context_shared_flag() -> None:
103
 
104
  assert response.status_code == 200
105
  payload = response.json()
106
- assert payload["reward"] == 0.0
107
  assert "context_shared" in payload["info"]
108
  assert payload["info"]["context_shared"] is True
109
  assert payload["done"] is False
 
103
 
104
  assert response.status_code == 200
105
  payload = response.json()
106
+ assert payload["reward"] == -0.03
107
  assert "context_shared" in payload["info"]
108
  assert payload["info"]["context_shared"] is True
109
  assert payload["done"] is False
tests/test_reward.py CHANGED
@@ -46,22 +46,23 @@ def test_identify_issue_no_match_returns_penalty() -> None:
46
 
47
  def test_identify_issue_match_no_fix_zero_confidence() -> None:
48
  # base_reward = min(0.35, 0.35) = 0.35; fix_bonus = 0; confidence_bonus = 0
49
- assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35)) == pytest.approx(0.35)
 
50
 
51
 
52
  def test_identify_issue_match_no_fix_full_confidence() -> None:
53
- # base=0.35 + confidence_bonus=min(0.05, 1.0*0.05)=0.05 β†’ 0.40, capped at 0.4
54
- assert compute_reward(_action("identify_issue", confidence=1.0), _issue(0.35)) == pytest.approx(0.4)
55
 
56
 
57
  def test_identify_issue_match_with_fix_zero_confidence() -> None:
58
- # base=0.35 + fix_bonus=0.08 β†’ 0.43, capped at 0.4
59
- assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35), fix_valid=True) == pytest.approx(0.4)
60
 
61
 
62
  def test_identify_issue_high_severity_capped_at_035_base() -> None:
63
- # min(0.9, 0.35) = 0.35
64
- assert compute_reward(_action("identify_issue", confidence=0.0), _issue(severity=0.9)) == pytest.approx(0.35)
65
 
66
 
67
  # ── suggest_fix ───────────────────────────────────────────────────────────────
@@ -96,4 +97,10 @@ def test_approve_many_issues_missed_floors_at_negative_one() -> None:
96
  # ── request_more_context ──────────────────────────────────────────────────────
97
 
98
  def test_request_more_context_returns_zero() -> None:
 
99
  assert compute_reward(_action("request_more_context"), None) == pytest.approx(0.0)
 
 
 
 
 
 
46
 
47
  def test_identify_issue_match_no_fix_zero_confidence() -> None:
48
  # base_reward = min(0.35, 0.35) = 0.35; fix_bonus = 0; confidence_bonus = 0
49
+ # order_bonus = 0.04 * (1/(0+1)) = 0.04 β†’ total = 0.39
50
+ assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35)) == pytest.approx(0.39)
51
 
52
 
53
  def test_identify_issue_match_no_fix_full_confidence() -> None:
54
+ # base=0.35 + confidence_bonus=min(0.05, 1.0*0.35*0.08)=0.028 + order_bonus=0.04 β†’ 0.418
55
+ assert compute_reward(_action("identify_issue", confidence=1.0), _issue(0.35)) == pytest.approx(0.418)
56
 
57
 
58
  def test_identify_issue_match_with_fix_zero_confidence() -> None:
59
+ # base=0.35 + fix_bonus=0.08 + order_bonus=0.04 = 0.47, capped at 0.45
60
+ assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35), fix_valid=True) == pytest.approx(0.45)
61
 
62
 
63
  def test_identify_issue_high_severity_capped_at_035_base() -> None:
64
+ # min(0.9, 0.35) = 0.35 + order_bonus=0.04 = 0.39
65
+ assert compute_reward(_action("identify_issue", confidence=0.0), _issue(severity=0.9)) == pytest.approx(0.39)
66
 
67
 
68
  # ── suggest_fix ───────────────────────────────────────────────────────────────
 
97
  # ── request_more_context ──────────────────────────────────────────────────────
98
 
99
  def test_request_more_context_returns_zero() -> None:
100
+ # No schema_available β†’ returns 0.0
101
  assert compute_reward(_action("request_more_context"), None) == pytest.approx(0.0)
102
+
103
+
104
+ def test_request_more_context_with_schema_returns_penalty() -> None:
105
+ # schema_available=True β†’ returns -0.03
106
+ assert compute_reward(_action("request_more_context"), None, schema_available=True) == pytest.approx(-0.03)