Spaces:
Sleeping
Sleeping
Commit Β·
b83c8ad
1
Parent(s): 11aa990
improve: 20 tasks, richer keywords, enhanced reward/grader, bigram matching, compelling README
Browse files- README.md +43 -11
- inference.py +12 -2
- openenv.yaml +32 -12
- server/environment.py +2 -1
- server/grader.py +25 -3
- server/reward.py +22 -3
- tasks/easy_tasks.json +94 -9
- tasks/hard_tasks.json +68 -9
- tasks/medium_tasks.json +104 -8
- tests/test_api.py +1 -1
- tests/test_reward.py +14 -7
README.md
CHANGED
|
@@ -15,7 +15,26 @@ An OpenEnv environment where an AI agent reviews SQL queries for correctness, pe
|
|
| 15 |
|
| 16 |
## Why This Matters
|
| 17 |
|
| 18 |
-
SQL bugs are among the most common and costly defects in production systems. A misplaced
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
## What The Environment Does
|
| 21 |
|
|
@@ -38,23 +57,36 @@ The agent responds step by step with one of four actions:
|
|
| 38 |
|
| 39 |
Rewards are deterministic and shaped for partial progress throughout the trajectory:
|
| 40 |
|
| 41 |
-
- **Correct issue identification**: +0.10 to +0.
|
| 42 |
- **Valid fix suggestion**: +0.08 to +0.10 bonus
|
| 43 |
- **Confidence bonus**: up to +0.05 for high-confidence correct identifications
|
|
|
|
| 44 |
- **False positive**: β0.10 penalty
|
| 45 |
- **Duplicate identification**: β0.02 penalty
|
| 46 |
- **Approving with missed issues**: β0.15 per missed issue
|
| 47 |
- **Complete correct approval**: +0.20
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
## Task Bank
|
| 50 |
|
| 51 |
-
The environment ships with **
|
| 52 |
|
| 53 |
-
| Difficulty | Count | Examples |
|
| 54 |
|---|---|---|---|
|
| 55 |
-
| Easy |
|
| 56 |
-
| Medium |
|
| 57 |
-
| Hard |
|
|
|
|
|
|
|
|
|
|
| 58 |
|
| 59 |
Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
|
| 60 |
|
|
@@ -88,10 +120,10 @@ Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks
|
|
| 88 |
βββ sql_query_reviewer/ β typed models and client package
|
| 89 |
βββ server/ β FastAPI environment server
|
| 90 |
β βββ environment.py β reset(), step(), state()
|
| 91 |
-
β βββ grader.py β deterministic scoring
|
| 92 |
-
β βββ reward.py β per-step reward
|
| 93 |
β βββ app.py β HTTP routes
|
| 94 |
-
βββ tasks/ β
|
| 95 |
βββ tests/ β pytest suite
|
| 96 |
```
|
| 97 |
|
|
@@ -128,7 +160,7 @@ export HF_TOKEN=hf_xxx
|
|
| 128 |
python inference.py
|
| 129 |
```
|
| 130 |
|
| 131 |
-
The script emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
|
| 132 |
|
| 133 |
## Hugging Face Spaces
|
| 134 |
|
|
|
|
| 15 |
|
| 16 |
## Why This Matters
|
| 17 |
|
| 18 |
+
SQL bugs are among the most common and costly defects in production systems. A misplaced
|
| 19 |
+
keyword breaks an API. A missing WHERE clause on a DELETE wipes a table. An unparameterized
|
| 20 |
+
input opens a path to data exfiltration. A function call on an indexed column turns a
|
| 21 |
+
10ms query into a 30-second full table scan.
|
| 22 |
+
|
| 23 |
+
Today, these defects are caught by human reviewers who spend hours on repetitive pattern
|
| 24 |
+
matching during code reviews, migration audits, and ETL pipeline checks. This creates a
|
| 25 |
+
bottleneck β senior engineers are pulled from feature work to review SQL, and critical
|
| 26 |
+
issues still slip through.
|
| 27 |
+
|
| 28 |
+
This environment provides a standardized benchmark to train and evaluate AI agents on
|
| 29 |
+
exactly this task. Unlike toy benchmarks, every query reflects real patterns found in
|
| 30 |
+
production codebases β from typos that break APIs to injection vectors that expose user
|
| 31 |
+
data to race conditions that enable double-spending. The agent must identify issues,
|
| 32 |
+
suggest fixes, and know when to approve β just like a human code reviewer.
|
| 33 |
+
|
| 34 |
+
The environment provides rich per-step reward signals with severity-weighted partial
|
| 35 |
+
credit, making it directly suitable for GRPO and PPO training loops. The task bank spans
|
| 36 |
+
three difficulty levels with meaningful score variance, ensuring the benchmark
|
| 37 |
+
discriminates between agent capabilities.
|
| 38 |
|
| 39 |
## What The Environment Does
|
| 40 |
|
|
|
|
| 57 |
|
| 58 |
Rewards are deterministic and shaped for partial progress throughout the trajectory:
|
| 59 |
|
| 60 |
+
- **Correct issue identification**: +0.10 to +0.45 scaled by issue severity, confidence, and discovery order
|
| 61 |
- **Valid fix suggestion**: +0.08 to +0.10 bonus
|
| 62 |
- **Confidence bonus**: up to +0.05 for high-confidence correct identifications
|
| 63 |
+
- **Discovery order bonus**: +0.04 for first issue found, diminishing for subsequent finds
|
| 64 |
- **False positive**: β0.10 penalty
|
| 65 |
- **Duplicate identification**: β0.02 penalty
|
| 66 |
- **Approving with missed issues**: β0.15 per missed issue
|
| 67 |
- **Complete correct approval**: +0.20
|
| 68 |
+
- **Request context when schema available**: β0.03 penalty (encourages using provided schema)
|
| 69 |
+
|
| 70 |
+
### Reward Properties for RL Training
|
| 71 |
+
|
| 72 |
+
- **Dense**: Every step returns a non-zero signal, enabling credit assignment
|
| 73 |
+
- **Bounded**: Per-step rewards in [-1.0, +0.45], episode scores in (0, 1)
|
| 74 |
+
- **Shaped**: Partial credit for partial coverage β no cliff between "found 2 of 3" and "found 3 of 3"
|
| 75 |
+
- **Deterministic**: Same actions always produce the same rewards (no randomness in grading)
|
| 76 |
+
- **Discriminative**: Hard tasks require multi-step reasoning; easy tasks reward quick identification
|
| 77 |
|
| 78 |
## Task Bank
|
| 79 |
|
| 80 |
+
The environment ships with **20 tasks** across three difficulty levels:
|
| 81 |
|
| 82 |
+
| Difficulty | Count | Examples | Score Range |
|
| 83 |
|---|---|---|---|
|
| 84 |
+
| Easy | 7 | Misspelled keywords, missing FROM, = NULL vs IS NULL, DELETE without WHERE, self-comparison | ~0.60β0.90 |
|
| 85 |
+
| Medium | 7 | SELECT *, missing LIMIT, correlated subqueries, function on indexed column, ORDER BY RAND() | ~0.30β0.65 |
|
| 86 |
+
| Hard | 6 | SQL injection, privilege escalation, PII leakage, self-join optimization, race conditions | ~0.15β0.45 |
|
| 87 |
+
|
| 88 |
+
Each ground truth issue includes 8-12 keywords and synonyms for robust fuzzy matching, plus
|
| 89 |
+
bigram matching to catch common two-word phrases LLMs use (e.g., "sql injection", "missing where").
|
| 90 |
|
| 91 |
Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
|
| 92 |
|
|
|
|
| 120 |
βββ sql_query_reviewer/ β typed models and client package
|
| 121 |
βββ server/ β FastAPI environment server
|
| 122 |
β βββ environment.py β reset(), step(), state()
|
| 123 |
+
β βββ grader.py β deterministic scoring with bigram matching
|
| 124 |
+
β βββ reward.py β per-step reward with order bonus
|
| 125 |
β βββ app.py β HTTP routes
|
| 126 |
+
βββ tasks/ β 20 SQL query tasks (JSON)
|
| 127 |
βββ tests/ β pytest suite
|
| 128 |
```
|
| 129 |
|
|
|
|
| 160 |
python inference.py
|
| 161 |
```
|
| 162 |
|
| 163 |
+
The script runs all 20 tasks and emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
|
| 164 |
|
| 165 |
## Hugging Face Spaces
|
| 166 |
|
inference.py
CHANGED
|
@@ -311,16 +311,26 @@ async def async_main() -> int:
|
|
| 311 |
# Build LLM client (even without key, don't crash β emit logs and exit)
|
| 312 |
if not API_KEY:
|
| 313 |
print("[DEBUG] WARNING: No API key found (HF_TOKEN / API_KEY / OPENAI_API_KEY)", flush=True)
|
| 314 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 315 |
log_start(task=tid, env=BENCHMARK, model=MODEL_NAME)
|
| 316 |
log_end(success=False, steps=0, score=0.01, rewards=[])
|
| 317 |
return 1
|
| 318 |
|
| 319 |
llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
|
| 320 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 321 |
task_ids = tuple(
|
| 322 |
tid.strip()
|
| 323 |
-
for tid in os.getenv("TASK_IDS",
|
| 324 |
if tid.strip()
|
| 325 |
)
|
| 326 |
|
|
|
|
| 311 |
# Build LLM client (even without key, don't crash β emit logs and exit)
|
| 312 |
if not API_KEY:
|
| 313 |
print("[DEBUG] WARNING: No API key found (HF_TOKEN / API_KEY / OPENAI_API_KEY)", flush=True)
|
| 314 |
+
_fallback_ids = [
|
| 315 |
+
"easy_001", "easy_002", "easy_003", "easy_004", "easy_005", "easy_006", "easy_007",
|
| 316 |
+
"medium_001", "medium_002", "medium_003", "medium_004", "medium_005", "medium_006", "medium_007",
|
| 317 |
+
"hard_001", "hard_002", "hard_003", "hard_004", "hard_005", "hard_006",
|
| 318 |
+
]
|
| 319 |
+
for tid in _fallback_ids:
|
| 320 |
log_start(task=tid, env=BENCHMARK, model=MODEL_NAME)
|
| 321 |
log_end(success=False, steps=0, score=0.01, rewards=[])
|
| 322 |
return 1
|
| 323 |
|
| 324 |
llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
|
| 325 |
|
| 326 |
+
_default_ids = ",".join([
|
| 327 |
+
"easy_001", "easy_002", "easy_003", "easy_004", "easy_005", "easy_006", "easy_007",
|
| 328 |
+
"medium_001", "medium_002", "medium_003", "medium_004", "medium_005", "medium_006", "medium_007",
|
| 329 |
+
"hard_001", "hard_002", "hard_003", "hard_004", "hard_005", "hard_006",
|
| 330 |
+
])
|
| 331 |
task_ids = tuple(
|
| 332 |
tid.strip()
|
| 333 |
+
for tid in os.getenv("TASK_IDS", _default_ids).split(",")
|
| 334 |
if tid.strip()
|
| 335 |
)
|
| 336 |
|
openenv.yaml
CHANGED
|
@@ -1,7 +1,7 @@
|
|
| 1 |
name: sql-query-reviewer
|
| 2 |
description: "AI agent reviews SQL queries for correctness, performance, and security."
|
| 3 |
author: Hellinferno
|
| 4 |
-
version: "0.
|
| 5 |
tags:
|
| 6 |
- openenv
|
| 7 |
- sql
|
|
@@ -28,30 +28,46 @@ tasks:
|
|
| 28 |
name: Unknown Column Name
|
| 29 |
difficulty: easy
|
| 30 |
description: "Detect column name typo (statuz vs status)."
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
- id: medium_001
|
| 32 |
-
name:
|
| 33 |
difficulty: medium
|
| 34 |
-
description: "Identify schema-aware performance problems like SELECT *
|
| 35 |
- id: medium_002
|
| 36 |
-
name:
|
| 37 |
difficulty: medium
|
| 38 |
-
description: "Find
|
| 39 |
- id: medium_003
|
| 40 |
-
name: Redundant
|
| 41 |
difficulty: medium
|
| 42 |
description: "Detect unnecessary DISTINCT on unique columns."
|
| 43 |
- id: medium_004
|
| 44 |
-
name:
|
| 45 |
difficulty: medium
|
| 46 |
-
description: "
|
| 47 |
- id: medium_005
|
| 48 |
-
name:
|
|
|
|
|
|
|
|
|
|
|
|
|
| 49 |
difficulty: medium
|
| 50 |
-
description: "
|
|
|
|
|
|
|
|
|
|
|
|
|
| 51 |
- id: hard_001
|
| 52 |
name: SQL Injection Detection
|
| 53 |
difficulty: hard
|
| 54 |
-
description: "Find string
|
| 55 |
- id: hard_002
|
| 56 |
name: Privilege Escalation via UNION
|
| 57 |
difficulty: hard
|
|
@@ -67,4 +83,8 @@ tasks:
|
|
| 67 |
- id: hard_005
|
| 68 |
name: Transaction Isolation Issues
|
| 69 |
difficulty: hard
|
| 70 |
-
description: "Find missing transaction isolation causing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
name: sql-query-reviewer
|
| 2 |
description: "AI agent reviews SQL queries for correctness, performance, and security."
|
| 3 |
author: Hellinferno
|
| 4 |
+
version: "0.2.0"
|
| 5 |
tags:
|
| 6 |
- openenv
|
| 7 |
- sql
|
|
|
|
| 28 |
name: Unknown Column Name
|
| 29 |
difficulty: easy
|
| 30 |
description: "Detect column name typo (statuz vs status)."
|
| 31 |
+
- id: easy_006
|
| 32 |
+
name: DELETE Without WHERE
|
| 33 |
+
difficulty: easy
|
| 34 |
+
description: "Detect dangerous unconditional DELETE statement."
|
| 35 |
+
- id: easy_007
|
| 36 |
+
name: Column Self-Comparison
|
| 37 |
+
difficulty: easy
|
| 38 |
+
description: "Detect column compared to itself instead of a value."
|
| 39 |
- id: medium_001
|
| 40 |
+
name: Wide Table SELECT Star
|
| 41 |
difficulty: medium
|
| 42 |
+
description: "Identify schema-aware performance problems like SELECT * on wide JSON tables."
|
| 43 |
- id: medium_002
|
| 44 |
+
name: Correlated Subquery
|
| 45 |
difficulty: medium
|
| 46 |
+
description: "Find correlated subqueries that could be rewritten as JOINs."
|
| 47 |
- id: medium_003
|
| 48 |
+
name: Redundant DISTINCT
|
| 49 |
difficulty: medium
|
| 50 |
description: "Detect unnecessary DISTINCT on unique columns."
|
| 51 |
- id: medium_004
|
| 52 |
+
name: Function on Indexed Column
|
| 53 |
difficulty: medium
|
| 54 |
+
description: "Detect DATE() function preventing index usage."
|
| 55 |
- id: medium_005
|
| 56 |
+
name: Leading Wildcard Search
|
| 57 |
+
difficulty: medium
|
| 58 |
+
description: "Identify LOWER() and leading wildcard preventing index usage."
|
| 59 |
+
- id: medium_006
|
| 60 |
+
name: DATE Function Index Bypass
|
| 61 |
difficulty: medium
|
| 62 |
+
description: "Detect DATE() function on indexed column preventing efficient lookups."
|
| 63 |
+
- id: medium_007
|
| 64 |
+
name: ORDER BY RAND Performance
|
| 65 |
+
difficulty: medium
|
| 66 |
+
description: "Detect expensive random ordering on large tables."
|
| 67 |
- id: hard_001
|
| 68 |
name: SQL Injection Detection
|
| 69 |
difficulty: hard
|
| 70 |
+
description: "Find string interpolation enabling SQL injection vectors."
|
| 71 |
- id: hard_002
|
| 72 |
name: Privilege Escalation via UNION
|
| 73 |
difficulty: hard
|
|
|
|
| 83 |
- id: hard_005
|
| 84 |
name: Transaction Isolation Issues
|
| 85 |
difficulty: hard
|
| 86 |
+
description: "Find missing transaction isolation causing partial failure corruption."
|
| 87 |
+
- id: hard_006
|
| 88 |
+
name: Race Condition in Balance Update
|
| 89 |
+
difficulty: hard
|
| 90 |
+
description: "Detect TOCTOU race condition allowing double-spending."
|
server/environment.py
CHANGED
|
@@ -73,7 +73,7 @@ class SQLReviewEnvironment:
|
|
| 73 |
description=matched_issue.description,
|
| 74 |
)
|
| 75 |
)
|
| 76 |
-
reward = compute_reward(action, matched_issue, fix_valid=fix_valid)
|
| 77 |
remaining = len(task.ground_truth_issues) - len(state.issues_identified)
|
| 78 |
feedback = f"Matched {matched_issue.category} issue '{matched_issue.id}'. {remaining} issue(s) remaining."
|
| 79 |
info = {
|
|
@@ -114,6 +114,7 @@ class SQLReviewEnvironment:
|
|
| 114 |
|
| 115 |
else:
|
| 116 |
feedback = self._schema_feedback(task)
|
|
|
|
| 117 |
info = {"context_shared": bool(task.schema_info)}
|
| 118 |
|
| 119 |
state.total_reward += reward
|
|
|
|
| 73 |
description=matched_issue.description,
|
| 74 |
)
|
| 75 |
)
|
| 76 |
+
reward = compute_reward(action, matched_issue, fix_valid=fix_valid, issues_found_count=len(state.issues_identified), schema_available=bool(task.schema_info))
|
| 77 |
remaining = len(task.ground_truth_issues) - len(state.issues_identified)
|
| 78 |
feedback = f"Matched {matched_issue.category} issue '{matched_issue.id}'. {remaining} issue(s) remaining."
|
| 79 |
info = {
|
|
|
|
| 114 |
|
| 115 |
else:
|
| 116 |
feedback = self._schema_feedback(task)
|
| 117 |
+
reward = compute_reward(action, None, schema_available=bool(task.schema_info))
|
| 118 |
info = {"context_shared": bool(task.schema_info)}
|
| 119 |
|
| 120 |
state.total_reward += reward
|
server/grader.py
CHANGED
|
@@ -25,14 +25,37 @@ def _set_overlap(candidate: set[str], target: set[str]) -> float:
|
|
| 25 |
return len(candidate & target) / max(len(target), 1)
|
| 26 |
|
| 27 |
|
| 28 |
-
def
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 29 |
candidate_tokens = tokenize(description)
|
| 30 |
keyword_tokens = set(issue.keywords)
|
| 31 |
description_tokens = tokenize(issue.description)
|
|
|
|
|
|
|
| 32 |
keyword_score = _set_overlap(candidate_tokens, keyword_tokens)
|
| 33 |
description_score = _set_overlap(candidate_tokens, description_tokens)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
category_bonus = 0.2 if category == issue.category else 0.0
|
| 35 |
-
|
|
|
|
| 36 |
return clamp(score, 0.0, 1.0)
|
| 37 |
|
| 38 |
|
|
@@ -88,4 +111,3 @@ def grade_episode(
|
|
| 88 |
false_positive_penalty = 0.05 * false_positive_count
|
| 89 |
final_score = coverage_score + efficiency_bonus - false_positive_penalty
|
| 90 |
return clamp(final_score, 0.01, 0.99)
|
| 91 |
-
|
|
|
|
| 25 |
return len(candidate & target) / max(len(target), 1)
|
| 26 |
|
| 27 |
|
| 28 |
+
def _make_bigrams(text: str) -> set[tuple[str, str]]:
|
| 29 |
+
words = TOKEN_RE.findall(text.lower())
|
| 30 |
+
return {(words[i], words[i + 1]) for i in range(len(words) - 1)}
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def score_issue_match(
|
| 34 |
+
description: str, category: IssueCategory | None, issue: GroundTruthIssue
|
| 35 |
+
) -> float:
|
| 36 |
candidate_tokens = tokenize(description)
|
| 37 |
keyword_tokens = set(issue.keywords)
|
| 38 |
description_tokens = tokenize(issue.description)
|
| 39 |
+
|
| 40 |
+
# Unigram overlap
|
| 41 |
keyword_score = _set_overlap(candidate_tokens, keyword_tokens)
|
| 42 |
description_score = _set_overlap(candidate_tokens, description_tokens)
|
| 43 |
+
|
| 44 |
+
# Bigram overlap β catches two-word phrases like "sql injection", "missing where"
|
| 45 |
+
candidate_bigrams = _make_bigrams(description)
|
| 46 |
+
keyword_bigrams: set[tuple[str, str]] = set()
|
| 47 |
+
for kw in issue.keywords:
|
| 48 |
+
words = kw.lower().split()
|
| 49 |
+
if len(words) >= 2:
|
| 50 |
+
keyword_bigrams.add(tuple(words[:2]))
|
| 51 |
+
bigram_score = 0.0
|
| 52 |
+
if keyword_bigrams:
|
| 53 |
+
bigram_hits = len(candidate_bigrams & keyword_bigrams)
|
| 54 |
+
bigram_score = bigram_hits / max(len(keyword_bigrams), 1)
|
| 55 |
+
|
| 56 |
category_bonus = 0.2 if category == issue.category else 0.0
|
| 57 |
+
|
| 58 |
+
score = (keyword_score * 0.5) + (description_score * 0.15) + (bigram_score * 0.15) + category_bonus
|
| 59 |
return clamp(score, 0.0, 1.0)
|
| 60 |
|
| 61 |
|
|
|
|
| 111 |
false_positive_penalty = 0.05 * false_positive_count
|
| 112 |
final_score = coverage_score + efficiency_bonus - false_positive_penalty
|
| 113 |
return clamp(final_score, 0.01, 0.99)
|
|
|
server/reward.py
CHANGED
|
@@ -11,16 +11,30 @@ def compute_reward(
|
|
| 11 |
duplicate_issue: bool = False,
|
| 12 |
remaining_unfound: int = 0,
|
| 13 |
has_previous_issue: bool = False,
|
|
|
|
|
|
|
| 14 |
) -> float:
|
| 15 |
if action.action_type == "identify_issue":
|
| 16 |
if duplicate_issue:
|
| 17 |
return -0.02
|
|
|
|
| 18 |
if matched_issue is None:
|
| 19 |
return -0.1
|
|
|
|
|
|
|
| 20 |
base_reward = min(matched_issue.severity, 0.35)
|
|
|
|
|
|
|
| 21 |
fix_bonus = 0.08 if fix_valid else 0.0
|
| 22 |
-
|
| 23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
|
| 25 |
if action.action_type == "suggest_fix":
|
| 26 |
if not has_previous_issue:
|
|
@@ -32,5 +46,10 @@ def compute_reward(
|
|
| 32 |
return 0.2
|
| 33 |
return max(-1.0, -0.15 * remaining_unfound)
|
| 34 |
|
| 35 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 36 |
|
|
|
|
|
|
| 11 |
duplicate_issue: bool = False,
|
| 12 |
remaining_unfound: int = 0,
|
| 13 |
has_previous_issue: bool = False,
|
| 14 |
+
issues_found_count: int = 0,
|
| 15 |
+
schema_available: bool = False,
|
| 16 |
) -> float:
|
| 17 |
if action.action_type == "identify_issue":
|
| 18 |
if duplicate_issue:
|
| 19 |
return -0.02
|
| 20 |
+
|
| 21 |
if matched_issue is None:
|
| 22 |
return -0.1
|
| 23 |
+
|
| 24 |
+
# Base reward scaled by severity
|
| 25 |
base_reward = min(matched_issue.severity, 0.35)
|
| 26 |
+
|
| 27 |
+
# Fix bonus
|
| 28 |
fix_bonus = 0.08 if fix_valid else 0.0
|
| 29 |
+
|
| 30 |
+
# Confidence bonus β higher reward for confident correct identifications
|
| 31 |
+
confidence_bonus = min(0.05, action.confidence * matched_issue.severity * 0.08)
|
| 32 |
+
|
| 33 |
+
# Discovery order bonus β finding the first issue is worth slightly more
|
| 34 |
+
# This encourages the agent to start identifying issues quickly
|
| 35 |
+
order_bonus = 0.04 * (1.0 / (issues_found_count + 1))
|
| 36 |
+
|
| 37 |
+
return min(base_reward + fix_bonus + confidence_bonus + order_bonus, 0.45)
|
| 38 |
|
| 39 |
if action.action_type == "suggest_fix":
|
| 40 |
if not has_previous_issue:
|
|
|
|
| 46 |
return 0.2
|
| 47 |
return max(-1.0, -0.15 * remaining_unfound)
|
| 48 |
|
| 49 |
+
if action.action_type == "request_more_context":
|
| 50 |
+
# Mild penalty for asking when schema is already provided
|
| 51 |
+
if schema_available:
|
| 52 |
+
return -0.03
|
| 53 |
+
return 0.0
|
| 54 |
|
| 55 |
+
return 0.0
|
tasks/easy_tasks.json
CHANGED
|
@@ -18,7 +18,11 @@
|
|
| 18 |
"description": "SELCT should be SELECT.",
|
| 19 |
"severity": 0.35,
|
| 20 |
"fix": "SELECT * FROM users WHERE id = 1;",
|
| 21 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
},
|
| 23 |
{
|
| 24 |
"id": "easy_001_from",
|
|
@@ -26,7 +30,10 @@
|
|
| 26 |
"description": "FORM should be FROM.",
|
| 27 |
"severity": 0.35,
|
| 28 |
"fix": "SELECT * FROM users WHERE id = 1;",
|
| 29 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
| 30 |
},
|
| 31 |
{
|
| 32 |
"id": "easy_001_where",
|
|
@@ -34,7 +41,10 @@
|
|
| 34 |
"description": "WEHRE should be WHERE.",
|
| 35 |
"severity": 0.25,
|
| 36 |
"fix": "SELECT * FROM users WHERE id = 1;",
|
| 37 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
| 38 |
},
|
| 39 |
{
|
| 40 |
"id": "easy_001_projection",
|
|
@@ -42,7 +52,11 @@
|
|
| 42 |
"description": "SELECT * fetches unnecessary columns for a profile lookup.",
|
| 43 |
"severity": 0.15,
|
| 44 |
"fix": "SELECT id, name, email FROM users WHERE id = 1;",
|
| 45 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
}
|
| 47 |
],
|
| 48 |
"max_steps": 5
|
|
@@ -66,7 +80,11 @@
|
|
| 66 |
"description": "The query is missing the FROM clause before users.",
|
| 67 |
"severity": 0.6,
|
| 68 |
"fix": "SELECT id, email FROM users WHERE active = 1;",
|
| 69 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 70 |
}
|
| 71 |
],
|
| 72 |
"max_steps": 4
|
|
@@ -90,7 +108,11 @@
|
|
| 90 |
"description": "NULL must be compared with IS NULL instead of = NULL.",
|
| 91 |
"severity": 0.7,
|
| 92 |
"fix": "SELECT order_id, total FROM orders WHERE shipped_at IS NULL;",
|
| 93 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 94 |
}
|
| 95 |
],
|
| 96 |
"max_steps": 4
|
|
@@ -114,7 +136,11 @@
|
|
| 114 |
"description": "The string literal is not terminated with a closing quote.",
|
| 115 |
"severity": 0.75,
|
| 116 |
"fix": "SELECT name FROM customers WHERE city = 'Boston';",
|
| 117 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 118 |
}
|
| 119 |
],
|
| 120 |
"max_steps": 4
|
|
@@ -139,10 +165,69 @@
|
|
| 139 |
"description": "Column statuz does not exist; the intended column is status.",
|
| 140 |
"severity": 0.65,
|
| 141 |
"fix": "SELECT id, status FROM orders WHERE status = 'paid';",
|
| 142 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 143 |
}
|
| 144 |
],
|
| 145 |
"max_steps": 4
|
| 146 |
}
|
| 147 |
]
|
| 148 |
-
|
|
|
|
| 18 |
"description": "SELCT should be SELECT.",
|
| 19 |
"severity": 0.35,
|
| 20 |
"fix": "SELECT * FROM users WHERE id = 1;",
|
| 21 |
+
"keywords": [
|
| 22 |
+
"selct", "select", "misspelled", "keyword", "syntax", "typo",
|
| 23 |
+
"spelling", "incorrect keyword", "wrong keyword", "misspelling",
|
| 24 |
+
"invalid keyword", "selct typo"
|
| 25 |
+
]
|
| 26 |
},
|
| 27 |
{
|
| 28 |
"id": "easy_001_from",
|
|
|
|
| 30 |
"description": "FORM should be FROM.",
|
| 31 |
"severity": 0.35,
|
| 32 |
"fix": "SELECT * FROM users WHERE id = 1;",
|
| 33 |
+
"keywords": [
|
| 34 |
+
"form", "from", "misspelled", "keyword", "syntax", "typo",
|
| 35 |
+
"spelling", "table reference", "from clause", "misspelling"
|
| 36 |
+
]
|
| 37 |
},
|
| 38 |
{
|
| 39 |
"id": "easy_001_where",
|
|
|
|
| 41 |
"description": "WEHRE should be WHERE.",
|
| 42 |
"severity": 0.25,
|
| 43 |
"fix": "SELECT * FROM users WHERE id = 1;",
|
| 44 |
+
"keywords": [
|
| 45 |
+
"wehre", "where", "misspelled", "keyword", "syntax", "typo",
|
| 46 |
+
"filter", "condition", "where clause", "misspelling"
|
| 47 |
+
]
|
| 48 |
},
|
| 49 |
{
|
| 50 |
"id": "easy_001_projection",
|
|
|
|
| 52 |
"description": "SELECT * fetches unnecessary columns for a profile lookup.",
|
| 53 |
"severity": 0.15,
|
| 54 |
"fix": "SELECT id, name, email FROM users WHERE id = 1;",
|
| 55 |
+
"keywords": [
|
| 56 |
+
"select *", "star", "unnecessary columns", "projection", "performance",
|
| 57 |
+
"all columns", "wildcard", "specific columns", "column selection",
|
| 58 |
+
"over-fetching", "fetch all", "select star"
|
| 59 |
+
]
|
| 60 |
}
|
| 61 |
],
|
| 62 |
"max_steps": 5
|
|
|
|
| 80 |
"description": "The query is missing the FROM clause before users.",
|
| 81 |
"severity": 0.6,
|
| 82 |
"fix": "SELECT id, email FROM users WHERE active = 1;",
|
| 83 |
+
"keywords": [
|
| 84 |
+
"missing from", "from clause", "syntax", "users", "no from",
|
| 85 |
+
"omitted from", "table reference", "absent from", "from keyword",
|
| 86 |
+
"missing keyword"
|
| 87 |
+
]
|
| 88 |
}
|
| 89 |
],
|
| 90 |
"max_steps": 4
|
|
|
|
| 108 |
"description": "NULL must be compared with IS NULL instead of = NULL.",
|
| 109 |
"severity": 0.7,
|
| 110 |
"fix": "SELECT order_id, total FROM orders WHERE shipped_at IS NULL;",
|
| 111 |
+
"keywords": [
|
| 112 |
+
"is null", "= null", "null comparison", "logic", "null check",
|
| 113 |
+
"equals null", "compare null", "null equality", "null predicate",
|
| 114 |
+
"three-valued logic", "null handling"
|
| 115 |
+
]
|
| 116 |
}
|
| 117 |
],
|
| 118 |
"max_steps": 4
|
|
|
|
| 136 |
"description": "The string literal is not terminated with a closing quote.",
|
| 137 |
"severity": 0.75,
|
| 138 |
"fix": "SELECT name FROM customers WHERE city = 'Boston';",
|
| 139 |
+
"keywords": [
|
| 140 |
+
"unclosed quote", "unterminated string", "syntax", "quote",
|
| 141 |
+
"missing quote", "string literal", "closing quote", "open quote",
|
| 142 |
+
"single quote", "unmatched quote", "parse error"
|
| 143 |
+
]
|
| 144 |
}
|
| 145 |
],
|
| 146 |
"max_steps": 4
|
|
|
|
| 165 |
"description": "Column statuz does not exist; the intended column is status.",
|
| 166 |
"severity": 0.65,
|
| 167 |
"fix": "SELECT id, status FROM orders WHERE status = 'paid';",
|
| 168 |
+
"keywords": [
|
| 169 |
+
"unknown column", "statuz", "status", "column name", "typo",
|
| 170 |
+
"misspelled column", "invalid column", "column not found",
|
| 171 |
+
"does not exist", "wrong column", "nonexistent column"
|
| 172 |
+
]
|
| 173 |
+
}
|
| 174 |
+
],
|
| 175 |
+
"max_steps": 4
|
| 176 |
+
},
|
| 177 |
+
{
|
| 178 |
+
"task_id": "easy_006",
|
| 179 |
+
"difficulty": "easy",
|
| 180 |
+
"query": "DELETE FROM orders;",
|
| 181 |
+
"schema": {
|
| 182 |
+
"orders": {
|
| 183 |
+
"id": "INT PRIMARY KEY",
|
| 184 |
+
"user_id": "INT",
|
| 185 |
+
"total": "DECIMAL(10,2)",
|
| 186 |
+
"status": "VARCHAR(32)"
|
| 187 |
+
}
|
| 188 |
+
},
|
| 189 |
+
"context": "Remove cancelled orders from the database.",
|
| 190 |
+
"ground_truth_issues": [
|
| 191 |
+
{
|
| 192 |
+
"id": "easy_006_no_where",
|
| 193 |
+
"category": "logic",
|
| 194 |
+
"description": "DELETE without WHERE clause will remove ALL rows from the table.",
|
| 195 |
+
"severity": 1.0,
|
| 196 |
+
"fix": "DELETE FROM orders WHERE status = 'cancelled';",
|
| 197 |
+
"keywords": [
|
| 198 |
+
"delete", "no where", "missing where", "all rows", "dangerous",
|
| 199 |
+
"destructive", "entire table", "unfiltered delete", "data loss",
|
| 200 |
+
"without condition", "unconditional"
|
| 201 |
+
]
|
| 202 |
+
}
|
| 203 |
+
],
|
| 204 |
+
"max_steps": 4
|
| 205 |
+
},
|
| 206 |
+
{
|
| 207 |
+
"task_id": "easy_007",
|
| 208 |
+
"difficulty": "easy",
|
| 209 |
+
"query": "SELECT id FROM users WHERE email = email;",
|
| 210 |
+
"schema": {
|
| 211 |
+
"users": {
|
| 212 |
+
"id": "INT PRIMARY KEY",
|
| 213 |
+
"email": "VARCHAR(255)"
|
| 214 |
+
}
|
| 215 |
+
},
|
| 216 |
+
"context": "Find user by email for login lookup.",
|
| 217 |
+
"ground_truth_issues": [
|
| 218 |
+
{
|
| 219 |
+
"id": "easy_007_self_compare",
|
| 220 |
+
"category": "logic",
|
| 221 |
+
"description": "Comparing column to itself (email = email) is always true. Should compare to a string literal.",
|
| 222 |
+
"severity": 0.8,
|
| 223 |
+
"fix": "SELECT id FROM users WHERE email = 'user@example.com';",
|
| 224 |
+
"keywords": [
|
| 225 |
+
"self comparison", "column compared to itself", "always true",
|
| 226 |
+
"tautology", "email = email", "missing literal", "missing value",
|
| 227 |
+
"string literal", "parameter", "no filter"
|
| 228 |
+
]
|
| 229 |
}
|
| 230 |
],
|
| 231 |
"max_steps": 4
|
| 232 |
}
|
| 233 |
]
|
|
|
tasks/hard_tasks.json
CHANGED
|
@@ -20,7 +20,11 @@
|
|
| 20 |
"description": "Interpolating user_email and password directly into the SQL creates a SQL injection vulnerability.",
|
| 21 |
"severity": 1.0,
|
| 22 |
"fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
|
| 23 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 24 |
},
|
| 25 |
{
|
| 26 |
"id": "hard_001_select_star_sensitive",
|
|
@@ -28,7 +32,11 @@
|
|
| 28 |
"description": "SELECT * returns sensitive columns such as password hashes that the login flow does not need.",
|
| 29 |
"severity": 0.4,
|
| 30 |
"fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
|
| 31 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
}
|
| 33 |
],
|
| 34 |
"max_steps": 6
|
|
@@ -55,7 +63,11 @@
|
|
| 55 |
"description": "The UNION includes admin_secrets and leaks privileged data into a customer-facing export.",
|
| 56 |
"severity": 0.95,
|
| 57 |
"fix": "SELECT id, email FROM customers;",
|
| 58 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 59 |
},
|
| 60 |
{
|
| 61 |
"id": "hard_002_mixed_data_domains",
|
|
@@ -63,7 +75,11 @@
|
|
| 63 |
"description": "The query mixes unrelated datasets with incompatible semantics, producing an invalid export.",
|
| 64 |
"severity": 0.45,
|
| 65 |
"fix": "SELECT id, email FROM customers;",
|
| 66 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 67 |
}
|
| 68 |
],
|
| 69 |
"max_steps": 6
|
|
@@ -94,7 +110,12 @@
|
|
| 94 |
"description": "The dashboard query exposes SSNs even though the ticket workflow only needs identity and ticket context.",
|
| 95 |
"severity": 0.9,
|
| 96 |
"fix": "SELECT c.id, c.full_name, c.email, t.subject FROM customers c JOIN support_tickets t ON t.customer_id = c.id WHERE t.status = 'open';",
|
| 97 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 98 |
}
|
| 99 |
],
|
| 100 |
"max_steps": 6
|
|
@@ -118,7 +139,11 @@
|
|
| 118 |
"description": "The self-join ranking pattern is expensive and should use a window function such as DENSE_RANK().",
|
| 119 |
"severity": 0.8,
|
| 120 |
"fix": "SELECT department_id, id, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank FROM employees;",
|
| 121 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
}
|
| 123 |
],
|
| 124 |
"max_steps": 7
|
|
@@ -141,7 +166,11 @@
|
|
| 141 |
"description": "The transfer uses two updates without a transaction, so a partial failure can corrupt balances.",
|
| 142 |
"severity": 0.9,
|
| 143 |
"fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
|
| 144 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
},
|
| 146 |
{
|
| 147 |
"id": "hard_005_no_balance_guard",
|
|
@@ -149,10 +178,40 @@
|
|
| 149 |
"description": "The debit statement does not verify sufficient funds before subtracting the balance.",
|
| 150 |
"severity": 0.55,
|
| 151 |
"fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
|
| 152 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
}
|
| 154 |
],
|
| 155 |
"max_steps": 7
|
| 156 |
}
|
| 157 |
]
|
| 158 |
-
|
|
|
|
| 20 |
"description": "Interpolating user_email and password directly into the SQL creates a SQL injection vulnerability.",
|
| 21 |
"severity": 1.0,
|
| 22 |
"fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
|
| 23 |
+
"keywords": [
|
| 24 |
+
"sql injection", "interpolation", "user input", "parameterized", "security",
|
| 25 |
+
"string concatenation", "prepared statement", "bind parameter",
|
| 26 |
+
"unsanitized", "injection attack", "escape", "placeholder"
|
| 27 |
+
]
|
| 28 |
},
|
| 29 |
{
|
| 30 |
"id": "hard_001_select_star_sensitive",
|
|
|
|
| 32 |
"description": "SELECT * returns sensitive columns such as password hashes that the login flow does not need.",
|
| 33 |
"severity": 0.4,
|
| 34 |
"fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
|
| 35 |
+
"keywords": [
|
| 36 |
+
"select *", "sensitive columns", "password hash", "least privilege", "security",
|
| 37 |
+
"over-exposure", "data leakage", "unnecessary columns",
|
| 38 |
+
"password", "credential", "star query"
|
| 39 |
+
]
|
| 40 |
}
|
| 41 |
],
|
| 42 |
"max_steps": 6
|
|
|
|
| 63 |
"description": "The UNION includes admin_secrets and leaks privileged data into a customer-facing export.",
|
| 64 |
"severity": 0.95,
|
| 65 |
"fix": "SELECT id, email FROM customers;",
|
| 66 |
+
"keywords": [
|
| 67 |
+
"union", "admin_secrets", "secret_value", "data leakage", "security",
|
| 68 |
+
"exfiltration", "privileged data", "unauthorized access",
|
| 69 |
+
"sensitive data", "data exposure", "information disclosure"
|
| 70 |
+
]
|
| 71 |
},
|
| 72 |
{
|
| 73 |
"id": "hard_002_mixed_data_domains",
|
|
|
|
| 75 |
"description": "The query mixes unrelated datasets with incompatible semantics, producing an invalid export.",
|
| 76 |
"severity": 0.45,
|
| 77 |
"fix": "SELECT id, email FROM customers;",
|
| 78 |
+
"keywords": [
|
| 79 |
+
"union", "invalid export", "mixed dataset", "logic", "incompatible",
|
| 80 |
+
"different tables", "semantic mismatch", "unrelated data",
|
| 81 |
+
"data integrity", "domain mixing"
|
| 82 |
+
]
|
| 83 |
}
|
| 84 |
],
|
| 85 |
"max_steps": 6
|
|
|
|
| 110 |
"description": "The dashboard query exposes SSNs even though the ticket workflow only needs identity and ticket context.",
|
| 111 |
"severity": 0.9,
|
| 112 |
"fix": "SELECT c.id, c.full_name, c.email, t.subject FROM customers c JOIN support_tickets t ON t.customer_id = c.id WHERE t.status = 'open';",
|
| 113 |
+
"keywords": [
|
| 114 |
+
"ssn", "pii", "sensitive data", "least privilege", "security",
|
| 115 |
+
"social security", "personally identifiable", "data exposure",
|
| 116 |
+
"unnecessary column", "information leakage", "over-fetching",
|
| 117 |
+
"personal data"
|
| 118 |
+
]
|
| 119 |
}
|
| 120 |
],
|
| 121 |
"max_steps": 6
|
|
|
|
| 139 |
"description": "The self-join ranking pattern is expensive and should use a window function such as DENSE_RANK().",
|
| 140 |
"severity": 0.8,
|
| 141 |
"fix": "SELECT department_id, id, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank FROM employees;",
|
| 142 |
+
"keywords": [
|
| 143 |
+
"self join", "window function", "dense_rank", "ranking", "performance",
|
| 144 |
+
"self-join", "rank", "partition by", "over clause", "analytic function",
|
| 145 |
+
"quadratic", "n squared"
|
| 146 |
+
]
|
| 147 |
}
|
| 148 |
],
|
| 149 |
"max_steps": 7
|
|
|
|
| 166 |
"description": "The transfer uses two updates without a transaction, so a partial failure can corrupt balances.",
|
| 167 |
"severity": 0.9,
|
| 168 |
"fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
|
| 169 |
+
"keywords": [
|
| 170 |
+
"transaction", "partial failure", "atomic", "commit", "security",
|
| 171 |
+
"begin", "rollback", "atomicity", "acid", "consistency",
|
| 172 |
+
"two updates", "no transaction", "data corruption"
|
| 173 |
+
]
|
| 174 |
},
|
| 175 |
{
|
| 176 |
"id": "hard_005_no_balance_guard",
|
|
|
|
| 178 |
"description": "The debit statement does not verify sufficient funds before subtracting the balance.",
|
| 179 |
"severity": 0.55,
|
| 180 |
"fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
|
| 181 |
+
"keywords": [
|
| 182 |
+
"balance guard", "insufficient funds", "where balance >=", "logic",
|
| 183 |
+
"negative balance", "overdraft", "check balance", "guard clause",
|
| 184 |
+
"minimum balance", "validation"
|
| 185 |
+
]
|
| 186 |
+
}
|
| 187 |
+
],
|
| 188 |
+
"max_steps": 7
|
| 189 |
+
},
|
| 190 |
+
{
|
| 191 |
+
"task_id": "hard_006",
|
| 192 |
+
"difficulty": "hard",
|
| 193 |
+
"query": "UPDATE accounts SET balance = balance - 500 WHERE user_id = 42 AND balance >= 500;",
|
| 194 |
+
"schema": {
|
| 195 |
+
"accounts": {
|
| 196 |
+
"user_id": "INT PRIMARY KEY",
|
| 197 |
+
"balance": "DECIMAL(12,2)"
|
| 198 |
+
}
|
| 199 |
+
},
|
| 200 |
+
"context": "Deduct $500 from user account for a withdrawal. Multiple withdrawal requests may arrive concurrently.",
|
| 201 |
+
"ground_truth_issues": [
|
| 202 |
+
{
|
| 203 |
+
"id": "hard_006_race_condition",
|
| 204 |
+
"category": "security",
|
| 205 |
+
"description": "Without SELECT FOR UPDATE or proper transaction isolation, concurrent requests can pass the balance check simultaneously, allowing double-spending.",
|
| 206 |
+
"severity": 0.9,
|
| 207 |
+
"fix": "BEGIN; SELECT balance FROM accounts WHERE user_id = 42 FOR UPDATE; UPDATE accounts SET balance = balance - 500 WHERE user_id = 42 AND balance >= 500; COMMIT;",
|
| 208 |
+
"keywords": [
|
| 209 |
+
"race condition", "concurrent", "double spend", "for update",
|
| 210 |
+
"transaction", "isolation", "lock", "toctou", "time of check",
|
| 211 |
+
"atomicity", "concurrent requests", "locking", "serializable"
|
| 212 |
+
]
|
| 213 |
}
|
| 214 |
],
|
| 215 |
"max_steps": 7
|
| 216 |
}
|
| 217 |
]
|
|
|
tasks/medium_tasks.json
CHANGED
|
@@ -21,7 +21,11 @@
|
|
| 21 |
"description": "SELECT * pulls a wide payload when the dashboard only needs a few columns.",
|
| 22 |
"severity": 0.3,
|
| 23 |
"fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
|
| 24 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
},
|
| 26 |
{
|
| 27 |
"id": "medium_001_missing_limit",
|
|
@@ -29,7 +33,11 @@
|
|
| 29 |
"description": "The dashboard query is missing a LIMIT and can scan far more rows than necessary.",
|
| 30 |
"severity": 0.3,
|
| 31 |
"fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
|
| 32 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 33 |
}
|
| 34 |
],
|
| 35 |
"max_steps": 5
|
|
@@ -57,7 +65,11 @@
|
|
| 57 |
"description": "The correlated subquery re-counts orders per row and should be rewritten as a join with GROUP BY.",
|
| 58 |
"severity": 0.6,
|
| 59 |
"fix": "SELECT c.id, c.name, COUNT(o.id) AS order_count FROM customers c LEFT JOIN orders o ON o.customer_id = c.id GROUP BY c.id, c.name;",
|
| 60 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 61 |
}
|
| 62 |
],
|
| 63 |
"max_steps": 6
|
|
@@ -81,7 +93,11 @@
|
|
| 81 |
"description": "DISTINCT is redundant because users.email is already unique.",
|
| 82 |
"severity": 0.45,
|
| 83 |
"fix": "SELECT email FROM users WHERE email IS NOT NULL;",
|
| 84 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 85 |
}
|
| 86 |
],
|
| 87 |
"max_steps": 5
|
|
@@ -110,7 +126,11 @@
|
|
| 110 |
"description": "Wrapping created_at with DATE() prevents efficient use of the created_at index.",
|
| 111 |
"severity": 0.6,
|
| 112 |
"fix": "SELECT o.id, o.total, u.name FROM orders o JOIN users u ON u.id = o.user_id WHERE o.created_at >= '2026-04-10' AND o.created_at < '2026-04-11';",
|
| 113 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 114 |
}
|
| 115 |
],
|
| 116 |
"max_steps": 6
|
|
@@ -135,7 +155,11 @@
|
|
| 135 |
"description": "Applying LOWER(name) on every row prevents the index on name from being used efficiently.",
|
| 136 |
"severity": 0.35,
|
| 137 |
"fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
|
| 138 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
},
|
| 140 |
{
|
| 141 |
"id": "medium_005_leading_wildcard",
|
|
@@ -143,10 +167,82 @@
|
|
| 143 |
"description": "The leading wildcard in LIKE '%pro%' forces a full scan instead of an index-friendly prefix lookup.",
|
| 144 |
"severity": 0.35,
|
| 145 |
"fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
|
| 146 |
-
"keywords": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 147 |
}
|
| 148 |
],
|
| 149 |
"max_steps": 6
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
}
|
| 151 |
]
|
| 152 |
-
|
|
|
|
| 21 |
"description": "SELECT * pulls a wide payload when the dashboard only needs a few columns.",
|
| 22 |
"severity": 0.3,
|
| 23 |
"fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
|
| 24 |
+
"keywords": [
|
| 25 |
+
"select *", "wide table", "projection", "performance", "star",
|
| 26 |
+
"all columns", "unnecessary columns", "column selection",
|
| 27 |
+
"over-fetching", "wildcard"
|
| 28 |
+
]
|
| 29 |
},
|
| 30 |
{
|
| 31 |
"id": "medium_001_missing_limit",
|
|
|
|
| 33 |
"description": "The dashboard query is missing a LIMIT and can scan far more rows than necessary.",
|
| 34 |
"severity": 0.3,
|
| 35 |
"fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
|
| 36 |
+
"keywords": [
|
| 37 |
+
"limit", "unbounded query", "dashboard", "performance", "no limit",
|
| 38 |
+
"missing limit", "unlimited rows", "pagination", "all rows",
|
| 39 |
+
"full scan", "row count"
|
| 40 |
+
]
|
| 41 |
}
|
| 42 |
],
|
| 43 |
"max_steps": 5
|
|
|
|
| 65 |
"description": "The correlated subquery re-counts orders per row and should be rewritten as a join with GROUP BY.",
|
| 66 |
"severity": 0.6,
|
| 67 |
"fix": "SELECT c.id, c.name, COUNT(o.id) AS order_count FROM customers c LEFT JOIN orders o ON o.customer_id = c.id GROUP BY c.id, c.name;",
|
| 68 |
+
"keywords": [
|
| 69 |
+
"correlated subquery", "group by", "join", "count", "performance",
|
| 70 |
+
"subquery per row", "n+1", "rewrite", "left join", "aggregate",
|
| 71 |
+
"scalar subquery", "dependent subquery"
|
| 72 |
+
]
|
| 73 |
}
|
| 74 |
],
|
| 75 |
"max_steps": 6
|
|
|
|
| 93 |
"description": "DISTINCT is redundant because users.email is already unique.",
|
| 94 |
"severity": 0.45,
|
| 95 |
"fix": "SELECT email FROM users WHERE email IS NOT NULL;",
|
| 96 |
+
"keywords": [
|
| 97 |
+
"distinct", "unique", "redundant", "email", "performance",
|
| 98 |
+
"unnecessary distinct", "unique constraint", "already unique",
|
| 99 |
+
"duplicate elimination", "deduplication", "wasted sort"
|
| 100 |
+
]
|
| 101 |
}
|
| 102 |
],
|
| 103 |
"max_steps": 5
|
|
|
|
| 126 |
"description": "Wrapping created_at with DATE() prevents efficient use of the created_at index.",
|
| 127 |
"severity": 0.6,
|
| 128 |
"fix": "SELECT o.id, o.total, u.name FROM orders o JOIN users u ON u.id = o.user_id WHERE o.created_at >= '2026-04-10' AND o.created_at < '2026-04-11';",
|
| 129 |
+
"keywords": [
|
| 130 |
+
"date()", "function on column", "index", "range predicate", "performance",
|
| 131 |
+
"sargable", "non-sargable", "prevents index", "full scan",
|
| 132 |
+
"index usage", "function wrapping"
|
| 133 |
+
]
|
| 134 |
}
|
| 135 |
],
|
| 136 |
"max_steps": 6
|
|
|
|
| 155 |
"description": "Applying LOWER(name) on every row prevents the index on name from being used efficiently.",
|
| 156 |
"severity": 0.35,
|
| 157 |
"fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
|
| 158 |
+
"keywords": [
|
| 159 |
+
"lower", "function on column", "index", "performance", "sargable",
|
| 160 |
+
"non-sargable", "case insensitive", "full scan", "table scan",
|
| 161 |
+
"function wrapping column"
|
| 162 |
+
]
|
| 163 |
},
|
| 164 |
{
|
| 165 |
"id": "medium_005_leading_wildcard",
|
|
|
|
| 167 |
"description": "The leading wildcard in LIKE '%pro%' forces a full scan instead of an index-friendly prefix lookup.",
|
| 168 |
"severity": 0.35,
|
| 169 |
"fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
|
| 170 |
+
"keywords": [
|
| 171 |
+
"leading wildcard", "%pro%", "full scan", "prefix lookup", "performance",
|
| 172 |
+
"like wildcard", "pattern matching", "index unusable", "table scan",
|
| 173 |
+
"wildcard prefix"
|
| 174 |
+
]
|
| 175 |
}
|
| 176 |
],
|
| 177 |
"max_steps": 6
|
| 178 |
+
},
|
| 179 |
+
{
|
| 180 |
+
"task_id": "medium_006",
|
| 181 |
+
"difficulty": "medium",
|
| 182 |
+
"query": "SELECT * FROM events WHERE DATE(created_at) = '2024-01-15';",
|
| 183 |
+
"schema": {
|
| 184 |
+
"events": {
|
| 185 |
+
"id": "INT PRIMARY KEY",
|
| 186 |
+
"name": "VARCHAR(255)",
|
| 187 |
+
"created_at": "TIMESTAMP",
|
| 188 |
+
"INDEX": "idx_created_at ON events(created_at)"
|
| 189 |
+
}
|
| 190 |
+
},
|
| 191 |
+
"context": "Find all events that happened on a specific date.",
|
| 192 |
+
"ground_truth_issues": [
|
| 193 |
+
{
|
| 194 |
+
"id": "medium_006_function_on_index",
|
| 195 |
+
"category": "performance",
|
| 196 |
+
"description": "Using DATE() function on an indexed column prevents index usage. Use a range comparison instead.",
|
| 197 |
+
"severity": 0.7,
|
| 198 |
+
"fix": "SELECT * FROM events WHERE created_at >= '2024-01-15 00:00:00' AND created_at < '2024-01-16 00:00:00';",
|
| 199 |
+
"keywords": [
|
| 200 |
+
"function on column", "date function", "index", "sargable",
|
| 201 |
+
"non-sargable", "prevents index", "range comparison", "full scan",
|
| 202 |
+
"table scan", "index usage", "function wrapping column"
|
| 203 |
+
]
|
| 204 |
+
},
|
| 205 |
+
{
|
| 206 |
+
"id": "medium_006_star",
|
| 207 |
+
"category": "performance",
|
| 208 |
+
"description": "SELECT * returns all columns when only specific fields may be needed.",
|
| 209 |
+
"severity": 0.2,
|
| 210 |
+
"fix": "SELECT id, name, created_at FROM events WHERE created_at >= '2024-01-15' AND created_at < '2024-01-16';",
|
| 211 |
+
"keywords": [
|
| 212 |
+
"select *", "star", "all columns", "projection", "unnecessary columns",
|
| 213 |
+
"wildcard", "over-fetching", "column selection"
|
| 214 |
+
]
|
| 215 |
+
}
|
| 216 |
+
],
|
| 217 |
+
"max_steps": 6
|
| 218 |
+
},
|
| 219 |
+
{
|
| 220 |
+
"task_id": "medium_007",
|
| 221 |
+
"difficulty": "medium",
|
| 222 |
+
"query": "SELECT * FROM products ORDER BY RAND() LIMIT 10;",
|
| 223 |
+
"schema": {
|
| 224 |
+
"products": {
|
| 225 |
+
"id": "INT PRIMARY KEY",
|
| 226 |
+
"name": "VARCHAR(255)",
|
| 227 |
+
"price": "DECIMAL(10,2)",
|
| 228 |
+
"category": "VARCHAR(64)"
|
| 229 |
+
}
|
| 230 |
+
},
|
| 231 |
+
"context": "Show 10 random products on the homepage.",
|
| 232 |
+
"ground_truth_issues": [
|
| 233 |
+
{
|
| 234 |
+
"id": "medium_007_order_rand",
|
| 235 |
+
"category": "performance",
|
| 236 |
+
"description": "ORDER BY RAND() generates a random value for every row in the table, causing a full table scan and sort. Extremely slow on large tables.",
|
| 237 |
+
"severity": 0.8,
|
| 238 |
+
"fix": "SELECT * FROM products WHERE id >= (SELECT FLOOR(RAND() * (SELECT MAX(id) FROM products))) LIMIT 10;",
|
| 239 |
+
"keywords": [
|
| 240 |
+
"order by rand", "random", "full table scan", "sort", "performance",
|
| 241 |
+
"slow", "every row", "random ordering", "rand function",
|
| 242 |
+
"expensive sort", "large table"
|
| 243 |
+
]
|
| 244 |
+
}
|
| 245 |
+
],
|
| 246 |
+
"max_steps": 5
|
| 247 |
}
|
| 248 |
]
|
|
|
tests/test_api.py
CHANGED
|
@@ -103,7 +103,7 @@ def test_request_more_context_returns_context_shared_flag() -> None:
|
|
| 103 |
|
| 104 |
assert response.status_code == 200
|
| 105 |
payload = response.json()
|
| 106 |
-
assert payload["reward"] == 0.
|
| 107 |
assert "context_shared" in payload["info"]
|
| 108 |
assert payload["info"]["context_shared"] is True
|
| 109 |
assert payload["done"] is False
|
|
|
|
| 103 |
|
| 104 |
assert response.status_code == 200
|
| 105 |
payload = response.json()
|
| 106 |
+
assert payload["reward"] == -0.03
|
| 107 |
assert "context_shared" in payload["info"]
|
| 108 |
assert payload["info"]["context_shared"] is True
|
| 109 |
assert payload["done"] is False
|
tests/test_reward.py
CHANGED
|
@@ -46,22 +46,23 @@ def test_identify_issue_no_match_returns_penalty() -> None:
|
|
| 46 |
|
| 47 |
def test_identify_issue_match_no_fix_zero_confidence() -> None:
|
| 48 |
# base_reward = min(0.35, 0.35) = 0.35; fix_bonus = 0; confidence_bonus = 0
|
| 49 |
-
|
|
|
|
| 50 |
|
| 51 |
|
| 52 |
def test_identify_issue_match_no_fix_full_confidence() -> None:
|
| 53 |
-
# base=0.35 + confidence_bonus=min(0.05, 1.0*0.
|
| 54 |
-
assert compute_reward(_action("identify_issue", confidence=1.0), _issue(0.35)) == pytest.approx(0.
|
| 55 |
|
| 56 |
|
| 57 |
def test_identify_issue_match_with_fix_zero_confidence() -> None:
|
| 58 |
-
# base=0.35 + fix_bonus=0.08
|
| 59 |
-
assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35), fix_valid=True) == pytest.approx(0.
|
| 60 |
|
| 61 |
|
| 62 |
def test_identify_issue_high_severity_capped_at_035_base() -> None:
|
| 63 |
-
# min(0.9, 0.35) = 0.35
|
| 64 |
-
assert compute_reward(_action("identify_issue", confidence=0.0), _issue(severity=0.9)) == pytest.approx(0.
|
| 65 |
|
| 66 |
|
| 67 |
# ββ suggest_fix βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
@@ -96,4 +97,10 @@ def test_approve_many_issues_missed_floors_at_negative_one() -> None:
|
|
| 96 |
# ββ request_more_context ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 97 |
|
| 98 |
def test_request_more_context_returns_zero() -> None:
|
|
|
|
| 99 |
assert compute_reward(_action("request_more_context"), None) == pytest.approx(0.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
def test_identify_issue_match_no_fix_zero_confidence() -> None:
|
| 48 |
# base_reward = min(0.35, 0.35) = 0.35; fix_bonus = 0; confidence_bonus = 0
|
| 49 |
+
# order_bonus = 0.04 * (1/(0+1)) = 0.04 β total = 0.39
|
| 50 |
+
assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35)) == pytest.approx(0.39)
|
| 51 |
|
| 52 |
|
| 53 |
def test_identify_issue_match_no_fix_full_confidence() -> None:
|
| 54 |
+
# base=0.35 + confidence_bonus=min(0.05, 1.0*0.35*0.08)=0.028 + order_bonus=0.04 β 0.418
|
| 55 |
+
assert compute_reward(_action("identify_issue", confidence=1.0), _issue(0.35)) == pytest.approx(0.418)
|
| 56 |
|
| 57 |
|
| 58 |
def test_identify_issue_match_with_fix_zero_confidence() -> None:
|
| 59 |
+
# base=0.35 + fix_bonus=0.08 + order_bonus=0.04 = 0.47, capped at 0.45
|
| 60 |
+
assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35), fix_valid=True) == pytest.approx(0.45)
|
| 61 |
|
| 62 |
|
| 63 |
def test_identify_issue_high_severity_capped_at_035_base() -> None:
|
| 64 |
+
# min(0.9, 0.35) = 0.35 + order_bonus=0.04 = 0.39
|
| 65 |
+
assert compute_reward(_action("identify_issue", confidence=0.0), _issue(severity=0.9)) == pytest.approx(0.39)
|
| 66 |
|
| 67 |
|
| 68 |
# ββ suggest_fix βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
|
|
|
| 97 |
# ββ request_more_context ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 98 |
|
| 99 |
def test_request_more_context_returns_zero() -> None:
|
| 100 |
+
# No schema_available β returns 0.0
|
| 101 |
assert compute_reward(_action("request_more_context"), None) == pytest.approx(0.0)
|
| 102 |
+
|
| 103 |
+
|
| 104 |
+
def test_request_more_context_with_schema_returns_penalty() -> None:
|
| 105 |
+
# schema_available=True β returns -0.03
|
| 106 |
+
assert compute_reward(_action("request_more_context"), None, schema_available=True) == pytest.approx(-0.03)
|