Spaces:

hellinferno
/

sql-query-reviewer

Sleeping

App Files Files Community

hellinferno commited on 13 days ago

Commit

b83c8ad

1 Parent(s): 11aa990

improve: 20 tasks, richer keywords, enhanced reward/grader, bigram matching, compelling README

Browse files

Files changed (11) hide show

README.md +43 -11
inference.py +12 -2
openenv.yaml +32 -12
server/environment.py +2 -1
server/grader.py +25 -3
server/reward.py +22 -3
tasks/easy_tasks.json +94 -9
tasks/hard_tasks.json +68 -9
tasks/medium_tasks.json +104 -8
tests/test_api.py +1 -1
tests/test_reward.py +14 -7

README.md CHANGED Viewed

@@ -15,7 +15,26 @@ An OpenEnv environment where an AI agent reviews SQL queries for correctness, pe
 ## Why This Matters
-SQL bugs are among the most common and costly defects in production systems. A misplaced keyword breaks an API, a missing index degrades latency by 100x, and an unsanitized input opens a door to data exfiltration. Today these defects are caught by human reviewers who spend hours on repetitive pattern matching. This environment provides a standardized benchmark to train and evaluate AI agents that can automate this critical workflow — directly useful for developer tools, IDE integrations, and automated code review systems.
 ## What The Environment Does
@@ -38,23 +57,36 @@ The agent responds step by step with one of four actions:
 Rewards are deterministic and shaped for partial progress throughout the trajectory:
-- **Correct issue identification**: +0.10 to +0.35 scaled by issue severity
 - **Valid fix suggestion**: +0.08 to +0.10 bonus
 - **Confidence bonus**: up to +0.05 for high-confidence correct identifications
 - **False positive**: −0.10 penalty
 - **Duplicate identification**: −0.02 penalty
 - **Approving with missed issues**: −0.15 per missed issue
 - **Complete correct approval**: +0.20
 ## Task Bank
-The environment ships with **15 tasks** across three difficulty levels:
-| Difficulty | Count | Examples | Expected Baseline Score |
 |---|---|---|---|
-| Easy | 5 | Misspelled keywords, missing FROM, = NULL vs IS NULL | ~0.75–0.90 |
-| Medium | 5 | SELECT *, missing indexes, correlated subqueries, unbounded queries | ~0.40–0.60 |
-| Hard | 5 | SQL injection, privilege escalation, PII leakage, self-join optimization | ~0.20–0.40 |
 Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
@@ -88,10 +120,10 @@ Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks
 ├── sql_query_reviewer/   ← typed models and client package
 ├── server/               ← FastAPI environment server
 │   ├── environment.py    ← reset(), step(), state()
-│   ├── grader.py         ← deterministic scoring
-│   ├── reward.py         ← per-step reward computation
 │   └── app.py            ← HTTP routes
-├── tasks/                ← 15 SQL query tasks (JSON)
 └── tests/                ← pytest suite
 ```
@@ -128,7 +160,7 @@ export HF_TOKEN=hf_xxx
 python inference.py
 ```
-The script emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
 ## Hugging Face Spaces

 ## Why This Matters
+SQL bugs are among the most common and costly defects in production systems. A misplaced
+keyword breaks an API. A missing WHERE clause on a DELETE wipes a table. An unparameterized
+input opens a path to data exfiltration. A function call on an indexed column turns a
+10ms query into a 30-second full table scan.
+Today, these defects are caught by human reviewers who spend hours on repetitive pattern
+matching during code reviews, migration audits, and ETL pipeline checks. This creates a
+bottleneck — senior engineers are pulled from feature work to review SQL, and critical
+issues still slip through.
+This environment provides a standardized benchmark to train and evaluate AI agents on
+exactly this task. Unlike toy benchmarks, every query reflects real patterns found in
+production codebases — from typos that break APIs to injection vectors that expose user
+data to race conditions that enable double-spending. The agent must identify issues,
+suggest fixes, and know when to approve — just like a human code reviewer.
+The environment provides rich per-step reward signals with severity-weighted partial
+credit, making it directly suitable for GRPO and PPO training loops. The task bank spans
+three difficulty levels with meaningful score variance, ensuring the benchmark
+discriminates between agent capabilities.
 ## What The Environment Does
 Rewards are deterministic and shaped for partial progress throughout the trajectory:
+- **Correct issue identification**: +0.10 to +0.45 scaled by issue severity, confidence, and discovery order
 - **Valid fix suggestion**: +0.08 to +0.10 bonus
 - **Confidence bonus**: up to +0.05 for high-confidence correct identifications
+- **Discovery order bonus**: +0.04 for first issue found, diminishing for subsequent finds
 - **False positive**: −0.10 penalty
 - **Duplicate identification**: −0.02 penalty
 - **Approving with missed issues**: −0.15 per missed issue
 - **Complete correct approval**: +0.20
+- **Request context when schema available**: −0.03 penalty (encourages using provided schema)
+### Reward Properties for RL Training
+- **Dense**: Every step returns a non-zero signal, enabling credit assignment
+- **Bounded**: Per-step rewards in [-1.0, +0.45], episode scores in (0, 1)
+- **Shaped**: Partial credit for partial coverage — no cliff between "found 2 of 3" and "found 3 of 3"
+- **Deterministic**: Same actions always produce the same rewards (no randomness in grading)
+- **Discriminative**: Hard tasks require multi-step reasoning; easy tasks reward quick identification
 ## Task Bank
+The environment ships with **20 tasks** across three difficulty levels:
+| Difficulty | Count | Examples | Score Range |
 |---|---|---|---|
+| Easy | 7 | Misspelled keywords, missing FROM, = NULL vs IS NULL, DELETE without WHERE, self-comparison | ~0.60–0.90 |
+| Medium | 7 | SELECT *, missing LIMIT, correlated subqueries, function on indexed column, ORDER BY RAND() | ~0.30–0.65 |
+| Hard | 6 | SQL injection, privilege escalation, PII leakage, self-join optimization, race conditions | ~0.15–0.45 |
+Each ground truth issue includes 8-12 keywords and synonyms for robust fuzzy matching, plus
+bigram matching to catch common two-word phrases LLMs use (e.g., "sql injection", "missing where").
 Task data: `tasks/easy_tasks.json`, `tasks/medium_tasks.json`, `tasks/hard_tasks.json`
 ├── sql_query_reviewer/   ← typed models and client package
 ├── server/               ← FastAPI environment server
 │   ├── environment.py    ← reset(), step(), state()
+│   ├── grader.py         ← deterministic scoring with bigram matching
+│   ├── reward.py         ← per-step reward with order bonus
 │   └── app.py            ← HTTP routes
+├── tasks/                ← 20 SQL query tasks (JSON)
 └── tests/                ← pytest suite
 ```
 python inference.py
 ```
+The script runs all 20 tasks and emits structured `[START]`, `[STEP]`, `[END]` logs per the OpenEnv spec.
 ## Hugging Face Spaces

inference.py CHANGED Viewed

@@ -311,16 +311,26 @@ async def async_main() -> int:
     # Build LLM client (even without key, don't crash — emit logs and exit)
     if not API_KEY:
         print("[DEBUG] WARNING: No API key found (HF_TOKEN / API_KEY / OPENAI_API_KEY)", flush=True)
-        for tid in ["easy_001", "medium_001", "hard_001"]:
             log_start(task=tid, env=BENCHMARK, model=MODEL_NAME)
             log_end(success=False, steps=0, score=0.01, rewards=[])
         return 1
     llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
     task_ids = tuple(
         tid.strip()
-        for tid in os.getenv("TASK_IDS", "easy_001,medium_001,hard_001").split(",")
         if tid.strip()
     )

     # Build LLM client (even without key, don't crash — emit logs and exit)
     if not API_KEY:
         print("[DEBUG] WARNING: No API key found (HF_TOKEN / API_KEY / OPENAI_API_KEY)", flush=True)
+        _fallback_ids = [
+            "easy_001", "easy_002", "easy_003", "easy_004", "easy_005", "easy_006", "easy_007",
+            "medium_001", "medium_002", "medium_003", "medium_004", "medium_005", "medium_006", "medium_007",
+            "hard_001", "hard_002", "hard_003", "hard_004", "hard_005", "hard_006",
+        ]
+        for tid in _fallback_ids:
             log_start(task=tid, env=BENCHMARK, model=MODEL_NAME)
             log_end(success=False, steps=0, score=0.01, rewards=[])
         return 1
     llm_client = OpenAI(api_key=API_KEY, base_url=API_BASE_URL)
+    _default_ids = ",".join([
+        "easy_001", "easy_002", "easy_003", "easy_004", "easy_005", "easy_006", "easy_007",
+        "medium_001", "medium_002", "medium_003", "medium_004", "medium_005", "medium_006", "medium_007",
+        "hard_001", "hard_002", "hard_003", "hard_004", "hard_005", "hard_006",
+    ])
     task_ids = tuple(
         tid.strip()
+        for tid in os.getenv("TASK_IDS", _default_ids).split(",")
         if tid.strip()
     )

openenv.yaml CHANGED Viewed

@@ -1,7 +1,7 @@
 name: sql-query-reviewer
 description: "AI agent reviews SQL queries for correctness, performance, and security."
 author: Hellinferno
-version: "0.1.0"
 tags:
   - openenv
   - sql
@@ -28,30 +28,46 @@ tasks:
     name: Unknown Column Name
     difficulty: easy
     description: "Detect column name typo (statuz vs status)."
   - id: medium_001
-    name: Performance Anti-Pattern Review
     difficulty: medium
-    description: "Identify schema-aware performance problems like SELECT *, missing indexes, correlated subqueries."
   - id: medium_002
-    name: Unbounded Query Detection
     difficulty: medium
-    description: "Find queries missing LIMIT on large tables."
   - id: medium_003
-    name: Redundant Operations
     difficulty: medium
     description: "Detect unnecessary DISTINCT on unique columns."
   - id: medium_004
-    name: Correlated Subquery Optimization
     difficulty: medium
-    description: "Find correlated subqueries that could be JOINs."
   - id: medium_005
-    name: Join Performance Issues
     difficulty: medium
-    description: "Identify missing index hints and inefficient joins."
   - id: hard_001
     name: SQL Injection Detection
     difficulty: hard
-    description: "Find string concatenation enabling SQL injection vectors."
   - id: hard_002
     name: Privilege Escalation via UNION
     difficulty: hard
@@ -67,4 +83,8 @@ tasks:
   - id: hard_005
     name: Transaction Isolation Issues
     difficulty: hard
-    description: "Find missing transaction isolation causing phantom reads."

 name: sql-query-reviewer
 description: "AI agent reviews SQL queries for correctness, performance, and security."
 author: Hellinferno
+version: "0.2.0"
 tags:
   - openenv
   - sql
     name: Unknown Column Name
     difficulty: easy
     description: "Detect column name typo (statuz vs status)."
+  - id: easy_006
+    name: DELETE Without WHERE
+    difficulty: easy
+    description: "Detect dangerous unconditional DELETE statement."
+  - id: easy_007
+    name: Column Self-Comparison
+    difficulty: easy
+    description: "Detect column compared to itself instead of a value."
   - id: medium_001
+    name: Wide Table SELECT Star
     difficulty: medium
+    description: "Identify schema-aware performance problems like SELECT * on wide JSON tables."
   - id: medium_002
+    name: Correlated Subquery
     difficulty: medium
+    description: "Find correlated subqueries that could be rewritten as JOINs."
   - id: medium_003
+    name: Redundant DISTINCT
     difficulty: medium
     description: "Detect unnecessary DISTINCT on unique columns."
   - id: medium_004
+    name: Function on Indexed Column
     difficulty: medium
+    description: "Detect DATE() function preventing index usage."
   - id: medium_005
+    name: Leading Wildcard Search
+    difficulty: medium
+    description: "Identify LOWER() and leading wildcard preventing index usage."
+  - id: medium_006
+    name: DATE Function Index Bypass
     difficulty: medium
+    description: "Detect DATE() function on indexed column preventing efficient lookups."
+  - id: medium_007
+    name: ORDER BY RAND Performance
+    difficulty: medium
+    description: "Detect expensive random ordering on large tables."
   - id: hard_001
     name: SQL Injection Detection
     difficulty: hard
+    description: "Find string interpolation enabling SQL injection vectors."
   - id: hard_002
     name: Privilege Escalation via UNION
     difficulty: hard
   - id: hard_005
     name: Transaction Isolation Issues
     difficulty: hard
+    description: "Find missing transaction isolation causing partial failure corruption."
+  - id: hard_006
+    name: Race Condition in Balance Update
+    difficulty: hard
+    description: "Detect TOCTOU race condition allowing double-spending."

server/environment.py CHANGED Viewed

@@ -73,7 +73,7 @@ class SQLReviewEnvironment:
                             description=matched_issue.description,
                         )
                     )
-                    reward = compute_reward(action, matched_issue, fix_valid=fix_valid)
                     remaining = len(task.ground_truth_issues) - len(state.issues_identified)
                     feedback = f"Matched {matched_issue.category} issue '{matched_issue.id}'. {remaining} issue(s) remaining."
                     info = {
@@ -114,6 +114,7 @@ class SQLReviewEnvironment:
         else:
             feedback = self._schema_feedback(task)
             info = {"context_shared": bool(task.schema_info)}
         state.total_reward += reward

                             description=matched_issue.description,
                         )
                     )
+                    reward = compute_reward(action, matched_issue, fix_valid=fix_valid, issues_found_count=len(state.issues_identified), schema_available=bool(task.schema_info))
                     remaining = len(task.ground_truth_issues) - len(state.issues_identified)
                     feedback = f"Matched {matched_issue.category} issue '{matched_issue.id}'. {remaining} issue(s) remaining."
                     info = {
         else:
             feedback = self._schema_feedback(task)
+            reward = compute_reward(action, None, schema_available=bool(task.schema_info))
             info = {"context_shared": bool(task.schema_info)}
         state.total_reward += reward

server/grader.py CHANGED Viewed

@@ -25,14 +25,37 @@ def _set_overlap(candidate: set[str], target: set[str]) -> float:
     return len(candidate & target) / max(len(target), 1)
-def score_issue_match(description: str, category: IssueCategory | None, issue: GroundTruthIssue) -> float:
     candidate_tokens = tokenize(description)
     keyword_tokens = set(issue.keywords)
     description_tokens = tokenize(issue.description)
     keyword_score = _set_overlap(candidate_tokens, keyword_tokens)
     description_score = _set_overlap(candidate_tokens, description_tokens)
     category_bonus = 0.2 if category == issue.category else 0.0
-    score = (keyword_score * 0.6) + (description_score * 0.25) + category_bonus
     return clamp(score, 0.0, 1.0)
@@ -88,4 +111,3 @@ def grade_episode(
     false_positive_penalty = 0.05 * false_positive_count
     final_score = coverage_score + efficiency_bonus - false_positive_penalty
     return clamp(final_score, 0.01, 0.99)

     return len(candidate & target) / max(len(target), 1)
+def _make_bigrams(text: str) -> set[tuple[str, str]]:
+    words = TOKEN_RE.findall(text.lower())
+    return {(words[i], words[i + 1]) for i in range(len(words) - 1)}
+def score_issue_match(
+    description: str, category: IssueCategory | None, issue: GroundTruthIssue
+) -> float:
     candidate_tokens = tokenize(description)
     keyword_tokens = set(issue.keywords)
     description_tokens = tokenize(issue.description)
+    # Unigram overlap
     keyword_score = _set_overlap(candidate_tokens, keyword_tokens)
     description_score = _set_overlap(candidate_tokens, description_tokens)
+    # Bigram overlap — catches two-word phrases like "sql injection", "missing where"
+    candidate_bigrams = _make_bigrams(description)
+    keyword_bigrams: set[tuple[str, str]] = set()
+    for kw in issue.keywords:
+        words = kw.lower().split()
+        if len(words) >= 2:
+            keyword_bigrams.add(tuple(words[:2]))
+    bigram_score = 0.0
+    if keyword_bigrams:
+        bigram_hits = len(candidate_bigrams & keyword_bigrams)
+        bigram_score = bigram_hits / max(len(keyword_bigrams), 1)
     category_bonus = 0.2 if category == issue.category else 0.0
+    score = (keyword_score * 0.5) + (description_score * 0.15) + (bigram_score * 0.15) + category_bonus
     return clamp(score, 0.0, 1.0)
     false_positive_penalty = 0.05 * false_positive_count
     final_score = coverage_score + efficiency_bonus - false_positive_penalty
     return clamp(final_score, 0.01, 0.99)

server/reward.py CHANGED Viewed

@@ -11,16 +11,30 @@ def compute_reward(
     duplicate_issue: bool = False,
     remaining_unfound: int = 0,
     has_previous_issue: bool = False,
 ) -> float:
     if action.action_type == "identify_issue":
         if duplicate_issue:
             return -0.02
         if matched_issue is None:
             return -0.1
         base_reward = min(matched_issue.severity, 0.35)
         fix_bonus = 0.08 if fix_valid else 0.0
-        confidence_bonus = min(0.05, action.confidence * 0.05)
-        return min(base_reward + fix_bonus + confidence_bonus, 0.4)
     if action.action_type == "suggest_fix":
         if not has_previous_issue:
@@ -32,5 +46,10 @@ def compute_reward(
             return 0.2
         return max(-1.0, -0.15 * remaining_unfound)
-    return 0.0

     duplicate_issue: bool = False,
     remaining_unfound: int = 0,
     has_previous_issue: bool = False,
+    issues_found_count: int = 0,
+    schema_available: bool = False,
 ) -> float:
     if action.action_type == "identify_issue":
         if duplicate_issue:
             return -0.02
         if matched_issue is None:
             return -0.1
+        # Base reward scaled by severity
         base_reward = min(matched_issue.severity, 0.35)
+        # Fix bonus
         fix_bonus = 0.08 if fix_valid else 0.0
+        # Confidence bonus — higher reward for confident correct identifications
+        confidence_bonus = min(0.05, action.confidence * matched_issue.severity * 0.08)
+        # Discovery order bonus — finding the first issue is worth slightly more
+        # This encourages the agent to start identifying issues quickly
+        order_bonus = 0.04 * (1.0 / (issues_found_count + 1))
+        return min(base_reward + fix_bonus + confidence_bonus + order_bonus, 0.45)
     if action.action_type == "suggest_fix":
         if not has_previous_issue:
             return 0.2
         return max(-1.0, -0.15 * remaining_unfound)
+    if action.action_type == "request_more_context":
+        # Mild penalty for asking when schema is already provided
+        if schema_available:
+            return -0.03
+        return 0.0
+    return 0.0

tasks/easy_tasks.json CHANGED Viewed

@@ -18,7 +18,11 @@
         "description": "SELCT should be SELECT.",
         "severity": 0.35,
         "fix": "SELECT * FROM users WHERE id = 1;",
-        "keywords": ["selct", "select", "misspelled keyword", "syntax"]
       },
       {
         "id": "easy_001_from",
@@ -26,7 +30,10 @@
         "description": "FORM should be FROM.",
         "severity": 0.35,
         "fix": "SELECT * FROM users WHERE id = 1;",
-        "keywords": ["form", "from", "misspelled keyword", "syntax"]
       },
       {
         "id": "easy_001_where",
@@ -34,7 +41,10 @@
         "description": "WEHRE should be WHERE.",
         "severity": 0.25,
         "fix": "SELECT * FROM users WHERE id = 1;",
-        "keywords": ["wehre", "where", "misspelled keyword", "syntax"]
       },
       {
         "id": "easy_001_projection",
@@ -42,7 +52,11 @@
         "description": "SELECT * fetches unnecessary columns for a profile lookup.",
         "severity": 0.15,
         "fix": "SELECT id, name, email FROM users WHERE id = 1;",
-        "keywords": ["select *", "unnecessary columns", "projection", "performance"]
       }
     ],
     "max_steps": 5
@@ -66,7 +80,11 @@
         "description": "The query is missing the FROM clause before users.",
         "severity": 0.6,
         "fix": "SELECT id, email FROM users WHERE active = 1;",
-        "keywords": ["missing from", "from clause", "syntax", "users"]
       }
     ],
     "max_steps": 4
@@ -90,7 +108,11 @@
         "description": "NULL must be compared with IS NULL instead of = NULL.",
         "severity": 0.7,
         "fix": "SELECT order_id, total FROM orders WHERE shipped_at IS NULL;",
-        "keywords": ["is null", "= null", "null comparison", "logic"]
       }
     ],
     "max_steps": 4
@@ -114,7 +136,11 @@
         "description": "The string literal is not terminated with a closing quote.",
         "severity": 0.75,
         "fix": "SELECT name FROM customers WHERE city = 'Boston';",
-        "keywords": ["unclosed quote", "unterminated string", "syntax", "quote"]
       }
     ],
     "max_steps": 4
@@ -139,10 +165,69 @@
         "description": "Column statuz does not exist; the intended column is status.",
         "severity": 0.65,
         "fix": "SELECT id, status FROM orders WHERE status = 'paid';",
-        "keywords": ["unknown column", "statuz", "status", "column name"]
       }
     ],
     "max_steps": 4
   }
 ]

         "description": "SELCT should be SELECT.",
         "severity": 0.35,
         "fix": "SELECT * FROM users WHERE id = 1;",
+        "keywords": [
+          "selct", "select", "misspelled", "keyword", "syntax", "typo",
+          "spelling", "incorrect keyword", "wrong keyword", "misspelling",
+          "invalid keyword", "selct typo"
+        ]
       },
       {
         "id": "easy_001_from",
         "description": "FORM should be FROM.",
         "severity": 0.35,
         "fix": "SELECT * FROM users WHERE id = 1;",
+        "keywords": [
+          "form", "from", "misspelled", "keyword", "syntax", "typo",
+          "spelling", "table reference", "from clause", "misspelling"
+        ]
       },
       {
         "id": "easy_001_where",
         "description": "WEHRE should be WHERE.",
         "severity": 0.25,
         "fix": "SELECT * FROM users WHERE id = 1;",
+        "keywords": [
+          "wehre", "where", "misspelled", "keyword", "syntax", "typo",
+          "filter", "condition", "where clause", "misspelling"
+        ]
       },
       {
         "id": "easy_001_projection",
         "description": "SELECT * fetches unnecessary columns for a profile lookup.",
         "severity": 0.15,
         "fix": "SELECT id, name, email FROM users WHERE id = 1;",
+        "keywords": [
+          "select *", "star", "unnecessary columns", "projection", "performance",
+          "all columns", "wildcard", "specific columns", "column selection",
+          "over-fetching", "fetch all", "select star"
+        ]
       }
     ],
     "max_steps": 5
         "description": "The query is missing the FROM clause before users.",
         "severity": 0.6,
         "fix": "SELECT id, email FROM users WHERE active = 1;",
+        "keywords": [
+          "missing from", "from clause", "syntax", "users", "no from",
+          "omitted from", "table reference", "absent from", "from keyword",
+          "missing keyword"
+        ]
       }
     ],
     "max_steps": 4
         "description": "NULL must be compared with IS NULL instead of = NULL.",
         "severity": 0.7,
         "fix": "SELECT order_id, total FROM orders WHERE shipped_at IS NULL;",
+        "keywords": [
+          "is null", "= null", "null comparison", "logic", "null check",
+          "equals null", "compare null", "null equality", "null predicate",
+          "three-valued logic", "null handling"
+        ]
       }
     ],
     "max_steps": 4
         "description": "The string literal is not terminated with a closing quote.",
         "severity": 0.75,
         "fix": "SELECT name FROM customers WHERE city = 'Boston';",
+        "keywords": [
+          "unclosed quote", "unterminated string", "syntax", "quote",
+          "missing quote", "string literal", "closing quote", "open quote",
+          "single quote", "unmatched quote", "parse error"
+        ]
       }
     ],
     "max_steps": 4
         "description": "Column statuz does not exist; the intended column is status.",
         "severity": 0.65,
         "fix": "SELECT id, status FROM orders WHERE status = 'paid';",
+        "keywords": [
+          "unknown column", "statuz", "status", "column name", "typo",
+          "misspelled column", "invalid column", "column not found",
+          "does not exist", "wrong column", "nonexistent column"
+        ]
+      }
+    ],
+    "max_steps": 4
+  },
+  {
+    "task_id": "easy_006",
+    "difficulty": "easy",
+    "query": "DELETE FROM orders;",
+    "schema": {
+      "orders": {
+        "id": "INT PRIMARY KEY",
+        "user_id": "INT",
+        "total": "DECIMAL(10,2)",
+        "status": "VARCHAR(32)"
+      }
+    },
+    "context": "Remove cancelled orders from the database.",
+    "ground_truth_issues": [
+      {
+        "id": "easy_006_no_where",
+        "category": "logic",
+        "description": "DELETE without WHERE clause will remove ALL rows from the table.",
+        "severity": 1.0,
+        "fix": "DELETE FROM orders WHERE status = 'cancelled';",
+        "keywords": [
+          "delete", "no where", "missing where", "all rows", "dangerous",
+          "destructive", "entire table", "unfiltered delete", "data loss",
+          "without condition", "unconditional"
+        ]
+      }
+    ],
+    "max_steps": 4
+  },
+  {
+    "task_id": "easy_007",
+    "difficulty": "easy",
+    "query": "SELECT id FROM users WHERE email = email;",
+    "schema": {
+      "users": {
+        "id": "INT PRIMARY KEY",
+        "email": "VARCHAR(255)"
+      }
+    },
+    "context": "Find user by email for login lookup.",
+    "ground_truth_issues": [
+      {
+        "id": "easy_007_self_compare",
+        "category": "logic",
+        "description": "Comparing column to itself (email = email) is always true. Should compare to a string literal.",
+        "severity": 0.8,
+        "fix": "SELECT id FROM users WHERE email = 'user@example.com';",
+        "keywords": [
+          "self comparison", "column compared to itself", "always true",
+          "tautology", "email = email", "missing literal", "missing value",
+          "string literal", "parameter", "no filter"
+        ]
       }
     ],
     "max_steps": 4
   }
 ]

tasks/hard_tasks.json CHANGED Viewed

@@ -20,7 +20,11 @@
         "description": "Interpolating user_email and password directly into the SQL creates a SQL injection vulnerability.",
         "severity": 1.0,
         "fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
-        "keywords": ["sql injection", "interpolation", "user input", "parameterized", "security"]
       },
       {
         "id": "hard_001_select_star_sensitive",
@@ -28,7 +32,11 @@
         "description": "SELECT * returns sensitive columns such as password hashes that the login flow does not need.",
         "severity": 0.4,
         "fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
-        "keywords": ["select *", "sensitive columns", "password hash", "least privilege", "security"]
       }
     ],
     "max_steps": 6
@@ -55,7 +63,11 @@
         "description": "The UNION includes admin_secrets and leaks privileged data into a customer-facing export.",
         "severity": 0.95,
         "fix": "SELECT id, email FROM customers;",
-        "keywords": ["union", "admin_secrets", "secret_value", "data leakage", "security"]
       },
       {
         "id": "hard_002_mixed_data_domains",
@@ -63,7 +75,11 @@
         "description": "The query mixes unrelated datasets with incompatible semantics, producing an invalid export.",
         "severity": 0.45,
         "fix": "SELECT id, email FROM customers;",
-        "keywords": ["union", "invalid export", "mixed dataset", "logic"]
       }
     ],
     "max_steps": 6
@@ -94,7 +110,12 @@
         "description": "The dashboard query exposes SSNs even though the ticket workflow only needs identity and ticket context.",
         "severity": 0.9,
         "fix": "SELECT c.id, c.full_name, c.email, t.subject FROM customers c JOIN support_tickets t ON t.customer_id = c.id WHERE t.status = 'open';",
-        "keywords": ["ssn", "pii", "sensitive data", "least privilege", "security"]
       }
     ],
     "max_steps": 6
@@ -118,7 +139,11 @@
         "description": "The self-join ranking pattern is expensive and should use a window function such as DENSE_RANK().",
         "severity": 0.8,
         "fix": "SELECT department_id, id, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank FROM employees;",
-        "keywords": ["self join", "window function", "dense_rank", "ranking", "performance"]
       }
     ],
     "max_steps": 7
@@ -141,7 +166,11 @@
         "description": "The transfer uses two updates without a transaction, so a partial failure can corrupt balances.",
         "severity": 0.9,
         "fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
-        "keywords": ["transaction", "partial failure", "atomic", "commit", "security"]
       },
       {
         "id": "hard_005_no_balance_guard",
@@ -149,10 +178,40 @@
         "description": "The debit statement does not verify sufficient funds before subtracting the balance.",
         "severity": 0.55,
         "fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
-        "keywords": ["balance guard", "insufficient funds", "where balance >=", "logic"]
       }
     ],
     "max_steps": 7
   }
 ]

         "description": "Interpolating user_email and password directly into the SQL creates a SQL injection vulnerability.",
         "severity": 1.0,
         "fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
+        "keywords": [
+          "sql injection", "interpolation", "user input", "parameterized", "security",
+          "string concatenation", "prepared statement", "bind parameter",
+          "unsanitized", "injection attack", "escape", "placeholder"
+        ]
       },
       {
         "id": "hard_001_select_star_sensitive",
         "description": "SELECT * returns sensitive columns such as password hashes that the login flow does not need.",
         "severity": 0.4,
         "fix": "SELECT id, email, role FROM users WHERE email = ? AND password_hash = ?;",
+        "keywords": [
+          "select *", "sensitive columns", "password hash", "least privilege", "security",
+          "over-exposure", "data leakage", "unnecessary columns",
+          "password", "credential", "star query"
+        ]
       }
     ],
     "max_steps": 6
         "description": "The UNION includes admin_secrets and leaks privileged data into a customer-facing export.",
         "severity": 0.95,
         "fix": "SELECT id, email FROM customers;",
+        "keywords": [
+          "union", "admin_secrets", "secret_value", "data leakage", "security",
+          "exfiltration", "privileged data", "unauthorized access",
+          "sensitive data", "data exposure", "information disclosure"
+        ]
       },
       {
         "id": "hard_002_mixed_data_domains",
         "description": "The query mixes unrelated datasets with incompatible semantics, producing an invalid export.",
         "severity": 0.45,
         "fix": "SELECT id, email FROM customers;",
+        "keywords": [
+          "union", "invalid export", "mixed dataset", "logic", "incompatible",
+          "different tables", "semantic mismatch", "unrelated data",
+          "data integrity", "domain mixing"
+        ]
       }
     ],
     "max_steps": 6
         "description": "The dashboard query exposes SSNs even though the ticket workflow only needs identity and ticket context.",
         "severity": 0.9,
         "fix": "SELECT c.id, c.full_name, c.email, t.subject FROM customers c JOIN support_tickets t ON t.customer_id = c.id WHERE t.status = 'open';",
+        "keywords": [
+          "ssn", "pii", "sensitive data", "least privilege", "security",
+          "social security", "personally identifiable", "data exposure",
+          "unnecessary column", "information leakage", "over-fetching",
+          "personal data"
+        ]
       }
     ],
     "max_steps": 6
         "description": "The self-join ranking pattern is expensive and should use a window function such as DENSE_RANK().",
         "severity": 0.8,
         "fix": "SELECT department_id, id, DENSE_RANK() OVER (PARTITION BY department_id ORDER BY salary DESC) AS salary_rank FROM employees;",
+        "keywords": [
+          "self join", "window function", "dense_rank", "ranking", "performance",
+          "self-join", "rank", "partition by", "over clause", "analytic function",
+          "quadratic", "n squared"
+        ]
       }
     ],
     "max_steps": 7
         "description": "The transfer uses two updates without a transaction, so a partial failure can corrupt balances.",
         "severity": 0.9,
         "fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
+        "keywords": [
+          "transaction", "partial failure", "atomic", "commit", "security",
+          "begin", "rollback", "atomicity", "acid", "consistency",
+          "two updates", "no transaction", "data corruption"
+        ]
       },
       {
         "id": "hard_005_no_balance_guard",
         "description": "The debit statement does not verify sufficient funds before subtracting the balance.",
         "severity": 0.55,
         "fix": "BEGIN; UPDATE accounts SET balance = balance - 100 WHERE user_id = 10 AND balance >= 100; UPDATE accounts SET balance = balance + 100 WHERE user_id = 11; COMMIT;",
+        "keywords": [
+          "balance guard", "insufficient funds", "where balance >=", "logic",
+          "negative balance", "overdraft", "check balance", "guard clause",
+          "minimum balance", "validation"
+        ]
+      }
+    ],
+    "max_steps": 7
+  },
+  {
+    "task_id": "hard_006",
+    "difficulty": "hard",
+    "query": "UPDATE accounts SET balance = balance - 500 WHERE user_id = 42 AND balance >= 500;",
+    "schema": {
+      "accounts": {
+        "user_id": "INT PRIMARY KEY",
+        "balance": "DECIMAL(12,2)"
+      }
+    },
+    "context": "Deduct $500 from user account for a withdrawal. Multiple withdrawal requests may arrive concurrently.",
+    "ground_truth_issues": [
+      {
+        "id": "hard_006_race_condition",
+        "category": "security",
+        "description": "Without SELECT FOR UPDATE or proper transaction isolation, concurrent requests can pass the balance check simultaneously, allowing double-spending.",
+        "severity": 0.9,
+        "fix": "BEGIN; SELECT balance FROM accounts WHERE user_id = 42 FOR UPDATE; UPDATE accounts SET balance = balance - 500 WHERE user_id = 42 AND balance >= 500; COMMIT;",
+        "keywords": [
+          "race condition", "concurrent", "double spend", "for update",
+          "transaction", "isolation", "lock", "toctou", "time of check",
+          "atomicity", "concurrent requests", "locking", "serializable"
+        ]
       }
     ],
     "max_steps": 7
   }
 ]

tasks/medium_tasks.json CHANGED Viewed

@@ -21,7 +21,11 @@
         "description": "SELECT * pulls a wide payload when the dashboard only needs a few columns.",
         "severity": 0.3,
         "fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
-        "keywords": ["select *", "wide table", "projection", "performance"]
       },
       {
         "id": "medium_001_missing_limit",
@@ -29,7 +33,11 @@
         "description": "The dashboard query is missing a LIMIT and can scan far more rows than necessary.",
         "severity": 0.3,
         "fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
-        "keywords": ["limit", "unbounded query", "dashboard", "performance"]
       }
     ],
     "max_steps": 5
@@ -57,7 +65,11 @@
         "description": "The correlated subquery re-counts orders per row and should be rewritten as a join with GROUP BY.",
         "severity": 0.6,
         "fix": "SELECT c.id, c.name, COUNT(o.id) AS order_count FROM customers c LEFT JOIN orders o ON o.customer_id = c.id GROUP BY c.id, c.name;",
-        "keywords": ["correlated subquery", "group by", "join", "count", "performance"]
       }
     ],
     "max_steps": 6
@@ -81,7 +93,11 @@
         "description": "DISTINCT is redundant because users.email is already unique.",
         "severity": 0.45,
         "fix": "SELECT email FROM users WHERE email IS NOT NULL;",
-        "keywords": ["distinct", "unique", "redundant", "email", "performance"]
       }
     ],
     "max_steps": 5
@@ -110,7 +126,11 @@
         "description": "Wrapping created_at with DATE() prevents efficient use of the created_at index.",
         "severity": 0.6,
         "fix": "SELECT o.id, o.total, u.name FROM orders o JOIN users u ON u.id = o.user_id WHERE o.created_at >= '2026-04-10' AND o.created_at < '2026-04-11';",
-        "keywords": ["date()", "function on column", "index", "range predicate", "performance"]
       }
     ],
     "max_steps": 6
@@ -135,7 +155,11 @@
         "description": "Applying LOWER(name) on every row prevents the index on name from being used efficiently.",
         "severity": 0.35,
         "fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
-        "keywords": ["lower", "function on column", "index", "performance"]
       },
       {
         "id": "medium_005_leading_wildcard",
@@ -143,10 +167,82 @@
         "description": "The leading wildcard in LIKE '%pro%' forces a full scan instead of an index-friendly prefix lookup.",
         "severity": 0.35,
         "fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
-        "keywords": ["leading wildcard", "%pro%", "full scan", "prefix lookup", "performance"]
       }
     ],
     "max_steps": 6
   }
 ]

         "description": "SELECT * pulls a wide payload when the dashboard only needs a few columns.",
         "severity": 0.3,
         "fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
+        "keywords": [
+          "select *", "wide table", "projection", "performance", "star",
+          "all columns", "unnecessary columns", "column selection",
+          "over-fetching", "wildcard"
+        ]
       },
       {
         "id": "medium_001_missing_limit",
         "description": "The dashboard query is missing a LIMIT and can scan far more rows than necessary.",
         "severity": 0.3,
         "fix": "SELECT id, event_name, created_at FROM events ORDER BY created_at DESC LIMIT 50;",
+        "keywords": [
+          "limit", "unbounded query", "dashboard", "performance", "no limit",
+          "missing limit", "unlimited rows", "pagination", "all rows",
+          "full scan", "row count"
+        ]
       }
     ],
     "max_steps": 5
         "description": "The correlated subquery re-counts orders per row and should be rewritten as a join with GROUP BY.",
         "severity": 0.6,
         "fix": "SELECT c.id, c.name, COUNT(o.id) AS order_count FROM customers c LEFT JOIN orders o ON o.customer_id = c.id GROUP BY c.id, c.name;",
+        "keywords": [
+          "correlated subquery", "group by", "join", "count", "performance",
+          "subquery per row", "n+1", "rewrite", "left join", "aggregate",
+          "scalar subquery", "dependent subquery"
+        ]
       }
     ],
     "max_steps": 6
         "description": "DISTINCT is redundant because users.email is already unique.",
         "severity": 0.45,
         "fix": "SELECT email FROM users WHERE email IS NOT NULL;",
+        "keywords": [
+          "distinct", "unique", "redundant", "email", "performance",
+          "unnecessary distinct", "unique constraint", "already unique",
+          "duplicate elimination", "deduplication", "wasted sort"
+        ]
       }
     ],
     "max_steps": 5
         "description": "Wrapping created_at with DATE() prevents efficient use of the created_at index.",
         "severity": 0.6,
         "fix": "SELECT o.id, o.total, u.name FROM orders o JOIN users u ON u.id = o.user_id WHERE o.created_at >= '2026-04-10' AND o.created_at < '2026-04-11';",
+        "keywords": [
+          "date()", "function on column", "index", "range predicate", "performance",
+          "sargable", "non-sargable", "prevents index", "full scan",
+          "index usage", "function wrapping"
+        ]
       }
     ],
     "max_steps": 6
         "description": "Applying LOWER(name) on every row prevents the index on name from being used efficiently.",
         "severity": 0.35,
         "fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
+        "keywords": [
+          "lower", "function on column", "index", "performance", "sargable",
+          "non-sargable", "case insensitive", "full scan", "table scan",
+          "function wrapping column"
+        ]
       },
       {
         "id": "medium_005_leading_wildcard",
         "description": "The leading wildcard in LIKE '%pro%' forces a full scan instead of an index-friendly prefix lookup.",
         "severity": 0.35,
         "fix": "SELECT id, name FROM products WHERE name ILIKE 'pro%';",
+        "keywords": [
+          "leading wildcard", "%pro%", "full scan", "prefix lookup", "performance",
+          "like wildcard", "pattern matching", "index unusable", "table scan",
+          "wildcard prefix"
+        ]
       }
     ],
     "max_steps": 6
+  },
+  {
+    "task_id": "medium_006",
+    "difficulty": "medium",
+    "query": "SELECT * FROM events WHERE DATE(created_at) = '2024-01-15';",
+    "schema": {
+      "events": {
+        "id": "INT PRIMARY KEY",
+        "name": "VARCHAR(255)",
+        "created_at": "TIMESTAMP",
+        "INDEX": "idx_created_at ON events(created_at)"
+      }
+    },
+    "context": "Find all events that happened on a specific date.",
+    "ground_truth_issues": [
+      {
+        "id": "medium_006_function_on_index",
+        "category": "performance",
+        "description": "Using DATE() function on an indexed column prevents index usage. Use a range comparison instead.",
+        "severity": 0.7,
+        "fix": "SELECT * FROM events WHERE created_at >= '2024-01-15 00:00:00' AND created_at < '2024-01-16 00:00:00';",
+        "keywords": [
+          "function on column", "date function", "index", "sargable",
+          "non-sargable", "prevents index", "range comparison", "full scan",
+          "table scan", "index usage", "function wrapping column"
+        ]
+      },
+      {
+        "id": "medium_006_star",
+        "category": "performance",
+        "description": "SELECT * returns all columns when only specific fields may be needed.",
+        "severity": 0.2,
+        "fix": "SELECT id, name, created_at FROM events WHERE created_at >= '2024-01-15' AND created_at < '2024-01-16';",
+        "keywords": [
+          "select *", "star", "all columns", "projection", "unnecessary columns",
+          "wildcard", "over-fetching", "column selection"
+        ]
+      }
+    ],
+    "max_steps": 6
+  },
+  {
+    "task_id": "medium_007",
+    "difficulty": "medium",
+    "query": "SELECT * FROM products ORDER BY RAND() LIMIT 10;",
+    "schema": {
+      "products": {
+        "id": "INT PRIMARY KEY",
+        "name": "VARCHAR(255)",
+        "price": "DECIMAL(10,2)",
+        "category": "VARCHAR(64)"
+      }
+    },
+    "context": "Show 10 random products on the homepage.",
+    "ground_truth_issues": [
+      {
+        "id": "medium_007_order_rand",
+        "category": "performance",
+        "description": "ORDER BY RAND() generates a random value for every row in the table, causing a full table scan and sort. Extremely slow on large tables.",
+        "severity": 0.8,
+        "fix": "SELECT * FROM products WHERE id >= (SELECT FLOOR(RAND() * (SELECT MAX(id) FROM products))) LIMIT 10;",
+        "keywords": [
+          "order by rand", "random", "full table scan", "sort", "performance",
+          "slow", "every row", "random ordering", "rand function",
+          "expensive sort", "large table"
+        ]
+      }
+    ],
+    "max_steps": 5
   }
 ]

tests/test_api.py CHANGED Viewed

@@ -103,7 +103,7 @@ def test_request_more_context_returns_context_shared_flag() -> None:
     assert response.status_code == 200
     payload = response.json()
-    assert payload["reward"] == 0.0
     assert "context_shared" in payload["info"]
     assert payload["info"]["context_shared"] is True
     assert payload["done"] is False

     assert response.status_code == 200
     payload = response.json()
+    assert payload["reward"] == -0.03
     assert "context_shared" in payload["info"]
     assert payload["info"]["context_shared"] is True
     assert payload["done"] is False

tests/test_reward.py CHANGED Viewed

@@ -46,22 +46,23 @@ def test_identify_issue_no_match_returns_penalty() -> None:
 def test_identify_issue_match_no_fix_zero_confidence() -> None:
     # base_reward = min(0.35, 0.35) = 0.35; fix_bonus = 0; confidence_bonus = 0
-    assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35)) == pytest.approx(0.35)
 def test_identify_issue_match_no_fix_full_confidence() -> None:
-    # base=0.35 + confidence_bonus=min(0.05, 1.0*0.05)=0.05 → 0.40, capped at 0.4
-    assert compute_reward(_action("identify_issue", confidence=1.0), _issue(0.35)) == pytest.approx(0.4)
 def test_identify_issue_match_with_fix_zero_confidence() -> None:
-    # base=0.35 + fix_bonus=0.08 → 0.43, capped at 0.4
-    assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35), fix_valid=True) == pytest.approx(0.4)
 def test_identify_issue_high_severity_capped_at_035_base() -> None:
-    # min(0.9, 0.35) = 0.35
-    assert compute_reward(_action("identify_issue", confidence=0.0), _issue(severity=0.9)) == pytest.approx(0.35)
 # ── suggest_fix ───────────────────────────────────────────────────────────────
@@ -96,4 +97,10 @@ def test_approve_many_issues_missed_floors_at_negative_one() -> None:
 # ── request_more_context ──────────────────────────────────────────────────────
 def test_request_more_context_returns_zero() -> None:
     assert compute_reward(_action("request_more_context"), None) == pytest.approx(0.0)

 def test_identify_issue_match_no_fix_zero_confidence() -> None:
     # base_reward = min(0.35, 0.35) = 0.35; fix_bonus = 0; confidence_bonus = 0
+    # order_bonus = 0.04 * (1/(0+1)) = 0.04 → total = 0.39
+    assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35)) == pytest.approx(0.39)
 def test_identify_issue_match_no_fix_full_confidence() -> None:
+    # base=0.35 + confidence_bonus=min(0.05, 1.0*0.35*0.08)=0.028 + order_bonus=0.04 → 0.418
+    assert compute_reward(_action("identify_issue", confidence=1.0), _issue(0.35)) == pytest.approx(0.418)
 def test_identify_issue_match_with_fix_zero_confidence() -> None:
+    # base=0.35 + fix_bonus=0.08 + order_bonus=0.04 = 0.47, capped at 0.45
+    assert compute_reward(_action("identify_issue", confidence=0.0), _issue(0.35), fix_valid=True) == pytest.approx(0.45)
 def test_identify_issue_high_severity_capped_at_035_base() -> None:
+    # min(0.9, 0.35) = 0.35 + order_bonus=0.04 = 0.39
+    assert compute_reward(_action("identify_issue", confidence=0.0), _issue(severity=0.9)) == pytest.approx(0.39)
 # ── suggest_fix ───────────────────────────────────────────────────────────────
 # ── request_more_context ──────────────────────────────────────────────────────
 def test_request_more_context_returns_zero() -> None:
+    # No schema_available → returns 0.0
     assert compute_reward(_action("request_more_context"), None) == pytest.approx(0.0)
+def test_request_more_context_with_schema_returns_penalty() -> None:
+    # schema_available=True → returns -0.03
+    assert compute_reward(_action("request_more_context"), None, schema_available=True) == pytest.approx(-0.03)