Spaces:

ColdHearted
/

CodeReviewEnv

Sleeping

App Files Files Community

janakb commited on 9 days ago

Commit

ced8fd0

0 Parent(s):

comit

Browse files

Files changed (17) hide show

Dockerfile +39 -0
README.md +262 -0
agents/__init__.py +1 -0
agents/baseline_agent.py +271 -0
env.py +211 -0
graders/__init__.py +1 -0
graders/grader.py +201 -0
inference.py +275 -0
models.py +92 -0
openenv.yaml +99 -0
requirements.txt +6 -0
server.py +126 -0
tasks/__init__.py +1 -0
tasks/task1_easy.py +124 -0
tasks/task2_medium.py +208 -0
tasks/task3_hard.py +241 -0
tests/test_env.py +162 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,39 @@

+# ── Build stage ───────────────────────────────────────────────────────────────
+FROM python:3.11-slim AS builder
+WORKDIR /app
+# Install build dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    gcc \
+    && rm -rf /var/lib/apt/lists/*
+COPY requirements.txt .
+RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
+# ── Runtime stage ─────────────────────────────────────────────────────────────
+FROM python:3.11-slim
+# Hugging Face Spaces expects the app to run as a non-root user
+RUN useradd -m -u 1000 appuser
+WORKDIR /app
+# Copy installed packages
+COPY --from=builder /install /usr/local
+# Copy application code
+COPY --chown=appuser:appuser . .
+# Ensure sub-packages are importable
+RUN touch tasks/__init__.py graders/__init__.py agents/__init__.py
+USER appuser
+# HF Spaces expects port 7860
+EXPOSE 7860
+ENV PYTHONUNBUFFERED=1
+ENV PYTHONPATH=/app
+CMD ["python", "-m", "uvicorn", "server:app", "--host", "0.0.0.0", "--port", "7860"]

README.md ADDED Viewed

	@@ -0,0 +1,262 @@

+# 🔍 CodeReviewEnv
+> An OpenEnv-compliant benchmark environment where AI agents act as senior engineers reviewing pull requests — catching bugs, finding security holes, and fixing broken code.
+---
+## Overview & Motivation
+Code review is one of the highest-leverage activities in software engineering, yet it is time-consuming, inconsistent, and cognitively demanding. A model that can reliably triage pull requests, identify security vulnerabilities, and produce corrected patches would meaningfully accelerate software delivery.
+**CodeReviewEnv** simulates exactly this. Three tasks of increasing difficulty present agents with realistic pull requests containing planted defects. The agent must reason over code, report issues with structured annotations, submit a corrected patch, and deliver a final verdict — all within a bounded step budget.
+---
+## Environment Architecture
+```
+code-review-env/
+├── env.py               # Core OpenEnv environment (reset / step / state)
+├── server.py            # FastAPI HTTP server exposing the OpenEnv interface
+├── models.py            # Pydantic typed models: Action, Observation, Reward, State
+├── openenv.yaml         # OpenEnv metadata
+├── tasks/
+│   ├── task1_easy.py    # Bug hunt: simple Python utility
+│   ├── task2_medium.py  # Security audit: Flask auth endpoint
+│   └── task3_hard.py    # Correctness: distributed LRU cache
+├── graders/
+│   └── grader.py        # Deterministic keyword + AST graders
+├── agents/
+│   └── baseline_agent.py  # HF Inference API baseline (OpenAI-compatible)
+├── Dockerfile
+├── requirements.txt
+└── README.md
+```
+---
+## Action Space
+Each agent turn is a single `ReviewAction` JSON object:
+| Field | Type | Description |
+|---|---|---|
+| `action_type` | `"review" \| "patch" \| "comment" \| "submit"` | What the agent is doing |
+| `severity` | `"critical" \| "major" \| "minor" \| "info"` | Issue severity (for `review`) |
+| `issue_type` | `"bug" \| "security" \| "performance" \| "logic" \| "style"` | Issue category |
+| `line_number` | `int \| null` | Line the issue is on |
+| `description` | `str` | Concise natural-language description of the issue |
+| `patched_code` | `str \| null` | Full corrected code (for `patch` actions) |
+| `comment` | `str \| null` | Free-form annotation |
+| `verdict` | `"approve" \| "request_changes" \| "reject"` | Final verdict (for `submit`) |
+| `confidence` | `float [0.0, 1.0]` | Agent's self-reported confidence |
+---
+## Observation Space
+Each step returns an `Observation` containing:
+| Field | Description |
+|---|---|
+| `task_id` | Identifier of the current task |
+| `step` / `max_steps` | Current step and budget |
+| `review_context` | Full PR: title, author, description, code files, linter output, test results |
+| `previous_actions` | All actions taken so far this episode |
+| `issues_found_so_far` | Structured list of issues reported |
+| `score_so_far` | Running cumulative intermediate reward |
+| `done` | Whether the episode has ended |
+---
+## Reward Function
+Reward is **dense** — provided at every step, not only at the end.
+### Intermediate (per-step)
+| Signal | Value | Rationale |
+|---|---|---|
+| Step penalty | −0.01 | Encourages efficiency |
+| Review with description | +0.05 | Rewards substantive annotations |
+| Critical severity bonus | +0.03 | Rewards correct triage |
+| Patch submitted | +0.10 | Rewards producing a fix |
+| Repetition penalty | −0.05 | Penalises looping / copy-paste |
+### Terminal (on `submit` or step exhaustion)
+The programmatic grader runs and returns a score in **[0.0, 1.0]** based on which issues were correctly identified and how well the submitted patch addresses them. This final score overwrites the episode total.
+---
+## Tasks
+### Task 1 — Easy: Bug Hunt (`task_1_easy_bug_hunt`)
+**Max steps:** 8
+**File reviewed:** `utils.py` (Python, 30 lines)
+A developer submits three utility functions. Three bugs are planted:
+| # | Line | Bug | Severity |
+|---|---|---|---|
+| 1 | 3 | `=` (assignment) used instead of `==` (comparison) — causes `SyntaxError` | Critical |
+| 2 | 6 | `range(1, len(numbers) + 1)` — off-by-one causes `IndexError` | Critical |
+| 3 | 9 | Missing `return max_val` — function silently returns `None` | Major |
+**Grading:** 30% per critical bug identified, 20% for minor, 20% for a syntactically valid patch with all three fixes applied.
+---
+### Task 2 — Medium: Security Audit (`task_2_medium_security`)
+**Max steps:** 12
+**File reviewed:** `auth.py` (Flask, 55 lines)
+A backend developer submits login and registration endpoints. Six security vulnerabilities are present:
+| # | Line | Vulnerability | Severity |
+|---|---|---|---|
+| 1 | 23 | SQL injection in `login` query (f-string interpolation) | Critical |
+| 2 | 44 | SQL injection in `register` INSERT | Critical |
+| 3 | 39 | Plaintext password storage (no hashing) | Critical |
+| 4 | — | No rate limiting on `/login` (brute-force possible) | Major |
+| 5 | 30 | Sensitive data leakage: error distinguishes "wrong password" vs "user not found" | Major |
+| 6 | 5 | Hardcoded `secret_key` in source | Major |
+**Grading:** Weighted by severity. Patch checked for parameterized queries, password hashing, and environment variable use.
+---
+### Task 3 — Hard: Distributed Systems Correctness (`task_3_hard_perf_correctness`)
+**Max steps:** 16
+**File reviewed:** `cache.py` (Python, 55 lines)
+A senior engineer submits a Redis-backed LRU cache claimed to be production-ready. Six issues lurk:
+| # | Issue | Type | Severity |
+|---|---|---|---|
+| 1 | Non-atomic `EXISTS` + `GET` creates a race condition | Concurrency | Critical |
+| 2 | Local `dict` grows unboundedly — `capacity` parameter ignored | Performance | Critical |
+| 3 | `get_many` calls `self.get()` in a loop (N+1 round trips) | Performance | Major |
+| 4 | `dict` preserves insertion order, not access order — LRU eviction is wrong | Logic | Major |
+| 5 | Shared `dict` modified without a `threading.Lock` | Concurrency | Critical |
+| 6 | `pickle.loads` on bytes from Redis — arbitrary code execution | Security | Critical |
+**Grading:** Equally weighted. Patch checked structurally for `threading.Lock`, `OrderedDict.move_to_end`, `mget`, and `json` instead of `pickle`.
+---
+## Baseline Performance
+Evaluated with `Qwen/Qwen2.5-72B-Instruct` via Hugging Face Inference API:
+| Task | Score |
+|---|---|
+| Task 1 — Easy | 0.72 |
+| Task 2 — Medium | 0.55 |
+| Task 3 — Hard | 0.38 |
+| **Aggregate** | **0.55** |
+---
+## Setup & Usage
+### 1. Local (Python)
+```bash
+git clone <repo>
+cd code-review-env
+pip install -r requirements.txt
+python server.py
+# Server running at http://localhost:7860
+```
+### 2. Docker
+```bash
+docker build -t code-review-env .
+docker run -p 7860:7860 code-review-env
+```
+### 3. API Quickstart
+```bash
+# Reset to task 1
+curl -X POST http://localhost:7860/reset \
+  -H "Content-Type: application/json" \
+  -d '{"task_id": "task_1_easy_bug_hunt"}'
+# Take a step
+curl -X POST http://localhost:7860/step \
+  -H "Content-Type: application/json" \
+  -d '{
+    "session_id": "<session_id>",
+    "action": {
+      "action_type": "review",
+      "severity": "critical",
+      "issue_type": "bug",
+      "line_number": 3,
+      "description": "Assignment operator = used instead of comparison == on line 3"
+    }
+  }'
+```
+### 4. Run inference script
+```bash
+export HF_TOKEN=hf_your_token_here
+export API_BASE_URL=https://router.huggingface.co/v1
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+python inference.py
+```
+Expected stdout format:
+```
+[START] task=task_1_easy_bug_hunt env=code-review-env model=Qwen/Qwen2.5-72B-Instruct
+[STEP] step=1 action=review:assignment operator = instead of == reward=0.07 done=false error=null
+[STEP] step=2 action=review:off-by-one in range reward=0.07 done=false error=null
+[STEP] step=3 action=patch:fixed code reward=0.10 done=false error=null
+[STEP] step=4 action=submit:request_changes reward=1.00 done=true error=null
+[END] success=true steps=4 score=1.000 rewards=0.07,0.07,0.10,1.00
+```
+### 5. OpenEnv validation
+```bash
+openenv validate .
+```
+---
+## HTTP API Reference
+| Method | Endpoint | Description |
+|---|---|---|
+| `GET` | `/` | Environment info |
+| `GET` | `/tasks` | List all tasks |
+| `POST` | `/reset` | Start a new episode |
+| `POST` | `/step` | Take an action |
+| `GET` | `/state/{session_id}` | Inspect full environment state |
+| `DELETE` | `/session/{session_id}` | Clean up session |
+---
+## Hugging Face Spaces Deployment
+The `Dockerfile` targets port `7860` and runs as a non-root user — compatible with HF Spaces Docker SDK out of the box. Tag the Space with `openenv`.
+```yaml
+# README header for HF Spaces
+---
+title: CodeReviewEnv
+emoji: 🔍
+colorFrom: indigo
+colorTo: blue
+sdk: docker
+pinned: false
+tags:
+  - openenv
+---
+```

agents/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # agents package

agents/baseline_agent.py ADDED Viewed

	@@ -0,0 +1,271 @@

+"""
+Baseline inference script for CodeReviewEnv.
+Evaluates a model (via OpenAI-compatible API) across all three tasks and
+reports per-task and aggregate scores.
+Usage:
+    HF_TOKEN=<your_token> python agents/baseline_agent.py [--model MODEL] [--server URL]
+The script uses the Hugging Face Inference API (OpenAI-compatible endpoint)
+with the model specified via --model (default: Qwen/Qwen2.5-72B-Instruct).
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import sys
+import time
+from typing import Any, Dict, List
+import requests
+from openai import OpenAI
+# ── Config ────────────────────────────────────────────────────────────────────
+DEFAULT_MODEL = "Qwen/Qwen2.5-72B-Instruct"
+DEFAULT_SERVER = "http://localhost:7860"
+HF_BASE_URL = "https://api-inference.huggingface.co/v1"
+TASK_IDS = [
+    "task_1_easy_bug_hunt",
+    "task_2_medium_security",
+    "task_3_hard_perf_correctness",
+]
+# ── Prompts ───────────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = """\
+You are an expert software engineer performing a thorough code review.
+Your task is to:
+1. Carefully read the provided code.
+2. Identify ALL bugs, security vulnerabilities, performance issues, and correctness problems.
+3. For each issue, output a JSON action with action_type="review".
+4. After all issues are identified, output a patch with action_type="patch".
+5. Finally, output action_type="submit" with your verdict.
+Each action must be valid JSON matching this schema:
+{
+  "action_type": "review" | "patch" | "comment" | "submit",
+  "severity": "critical" | "major" | "minor" | "info",   // for review
+  "issue_type": "bug" | "security" | "performance" | "logic" | "style",
+  "line_number": <int or null>,
+  "description": "<concise description of the issue>",
+  "patched_code": "<full corrected code>",  // for patch
+  "comment": "<optional comment>",
+  "verdict": "approve" | "request_changes" | "reject",  // for submit
+  "confidence": <0.0-1.0>
+}
+Output ONE action JSON per message. Be precise and thorough.
+"""
+def build_user_prompt(obs: Dict[str, Any]) -> str:
+    ctx = obs["review_context"]
+    files_text = "\n\n".join(
+        f"=== {f['filename']} ({f['language']}) ===\n{f['content']}"
+        for f in ctx["files_changed"]
+    )
+    prev = obs.get("previous_actions", [])
+    issues_so_far = obs.get("issues_found_so_far", [])
+    prompt = f"""Pull Request: {ctx['pull_request_title']}
+Author: {ctx['author']}
+Description: {ctx['description']}
+Linter: {ctx.get('linter_output', 'N/A')}
+Tests: {ctx.get('test_results', 'N/A')}
+--- CODE ---
+{files_text}
+--- END CODE ---
+Steps taken so far: {obs['step']} / {obs['max_steps']}
+Issues identified so far: {len(issues_so_far)}
+"""
+    if issues_so_far:
+        prompt += "\nIssues already reported:\n"
+        for iss in issues_so_far:
+            prompt += f"  - [{iss.get('severity','?')}] line {iss.get('line','?')}: {iss.get('description','')}\n"
+    if obs["step"] == 0:
+        prompt += "\nPlease begin your review. Output your first action as JSON."
+    elif obs["step"] >= obs["max_steps"] - 2:
+        prompt += "\nYou are running low on steps. Please submit a patch and final verdict now."
+    else:
+        prompt += "\nContinue your review or submit if done. Output next action as JSON."
+    return prompt
+# ── Agent loop ────────────────────────────────────────────────────────────────
+def extract_json(text: str) -> Dict[str, Any]:
+    """Extract first JSON object from model response."""
+    # Try direct parse first
+    text = text.strip()
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        pass
+    # Find JSON block
+    start = text.find("{")
+    if start == -1:
+        raise ValueError("No JSON found in response")
+    depth = 0
+    for i, ch in enumerate(text[start:], start):
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return json.loads(text[start : i + 1])
+    raise ValueError("Unbalanced JSON")
+def run_episode(
+    client: OpenAI,
+    model: str,
+    server: str,
+    task_id: str,
+) -> Dict[str, Any]:
+    """Run a single episode and return the result dict."""
+    # 1. Reset
+    resp = requests.post(f"{server}/reset", json={"task_id": task_id}, timeout=30)
+    resp.raise_for_status()
+    data = resp.json()
+    session_id = data["session_id"]
+    obs = data["observation"]
+    print(f"\n{'='*60}")
+    print(f"Task: {task_id}")
+    print(f"Session: {session_id}")
+    print(f"{'='*60}")
+    history: List[Dict[str, str]] = []
+    final_score = 0.0
+    done = False
+    patch_submitted = False
+    while not done:
+        user_msg = build_user_prompt(obs)
+        history.append({"role": "user", "content": user_msg})
+        # Call model
+        try:
+            completion = client.chat.completions.create(
+                model=model,
+                messages=[{"role": "system", "content": SYSTEM_PROMPT}] + history,
+                max_tokens=1024,
+                temperature=0.2,
+            )
+            raw = completion.choices[0].message.content or ""
+        except Exception as exc:
+            print(f"  [Model error] {exc}")
+            break
+        history.append({"role": "assistant", "content": raw})
+        # Parse action
+        try:
+            action_dict = extract_json(raw)
+        except ValueError as exc:
+            print(f"  [Parse error] {exc} | raw={raw[:200]!r}")
+            # Force a submit to avoid infinite spin
+            action_dict = {"action_type": "submit", "verdict": "request_changes", "confidence": 0.3}
+        action_type = action_dict.get("action_type", "review")
+        print(f"  Step {obs['step']+1}: {action_type} | {action_dict.get('description','')[:80]}")
+        # Auto-submit near step limit
+        if obs["step"] >= obs["max_steps"] - 1 and action_type != "submit":
+            action_dict = {"action_type": "submit", "verdict": "request_changes", "confidence": 0.5}
+            if not patch_submitted:
+                # Submit a patch first
+                action_dict = {
+                    "action_type": "patch",
+                    "patched_code": obs["review_context"]["files_changed"][0]["content"],
+                }
+        if action_type == "patch":
+            patch_submitted = True
+        # Step
+        step_resp = requests.post(
+            f"{server}/step",
+            json={"session_id": session_id, "action": action_dict},
+            timeout=30,
+        )
+        step_resp.raise_for_status()
+        step_data = step_resp.json()
+        obs = step_data["observation"]
+        done = step_data["done"]
+        info = step_data.get("info", {})
+        if done:
+            final_score = info.get("final_score", 0.0)
+            breakdown = info.get("breakdown", {})
+            print(f"\n  Final score: {final_score:.4f}")
+            print(f"  Breakdown:  {json.dumps(breakdown, indent=4)}")
+        time.sleep(0.3)  # be polite to the API
+    # Cleanup
+    requests.delete(f"{server}/session/{session_id}", timeout=10)
+    return {
+        "task_id": task_id,
+        "final_score": final_score,
+        "steps_taken": obs["step"],
+    }
+# ── Main ──────────────────────────────────────────────────────────────────────
+def main():
+    parser = argparse.ArgumentParser(description="CodeReviewEnv baseline agent")
+    parser.add_argument("--model", default=DEFAULT_MODEL)
+    parser.add_argument("--server", default=DEFAULT_SERVER)
+    parser.add_argument("--task", default=None, help="Run a single task (default: all)")
+    args = parser.parse_args()
+    hf_token = os.environ.get("HF_TOKEN")
+    if not hf_token:
+        print("ERROR: HF_TOKEN environment variable not set.", file=sys.stderr)
+        sys.exit(1)
+    client = OpenAI(
+        api_key=hf_token,
+        base_url=HF_BASE_URL,
+    )
+    tasks = [args.task] if args.task else TASK_IDS
+    results = []
+    for task_id in tasks:
+        result = run_episode(client, args.model, args.server, task_id)
+        results.append(result)
+    # Summary
+    print("\n" + "=" * 60)
+    print("BASELINE SUMMARY")
+    print("=" * 60)
+    for r in results:
+        print(f"  {r['task_id']:<40} score={r['final_score']:.4f}  steps={r['steps_taken']}")
+    if len(results) == len(TASK_IDS):
+        avg = sum(r["final_score"] for r in results) / len(results)
+        print(f"\n  Aggregate average score: {avg:.4f}")
+    # Save results
+    out_path = "baseline_results.json"
+    with open(out_path, "w") as f:
+        json.dump({"model": args.model, "results": results}, f, indent=2)
+    print(f"\n  Results saved to {out_path}")
+if __name__ == "__main__":
+    main()

env.py ADDED Viewed

	@@ -0,0 +1,211 @@

+"""
+CodeReviewEnv — OpenEnv-compliant environment.
+Implements:
+  reset()  → Observation
+  step(action) → (Observation, StepReward, done, info)
+  state()  → EnvironmentState
+"""
+from __future__ import annotations
+import copy
+import sys
+import os
+sys.path.insert(0, os.path.dirname(__file__))
+from typing import Any, Dict, Tuple
+from models import (
+    CodeFile,
+    EnvironmentState,
+    Observation,
+    ReviewAction,
+    ReviewContext,
+    StepReward,
+)
+from graders.grader import grade
+# ── Task registry ────────────────────────────────────────────────────────────
+def _load_task(task_id: str) -> Dict[str, Any]:
+    if task_id == "task_1_easy_bug_hunt":
+        from tasks.task1_easy import get_task_config
+    elif task_id == "task_2_medium_security":
+        from tasks.task2_medium import get_task_config
+    elif task_id == "task_3_hard_perf_correctness":
+        from tasks.task3_hard import get_task_config
+    else:
+        raise ValueError(f"Unknown task_id: {task_id!r}")
+    return get_task_config()
+TASK_IDS = [
+    "task_1_easy_bug_hunt",
+    "task_2_medium_security",
+    "task_3_hard_perf_correctness",
+]
+# ─────────────────────────────────────────────────────────────────────────────
+class CodeReviewEnv:
+    """OpenEnv-compliant code review environment."""
+    # ── Lifecycle ────────────────────────────────────────────────────────────
+    def reset(self, task_id: str = "task_1_easy_bug_hunt") -> Observation:
+        """Reset the environment for a given task. Returns the initial observation."""
+        cfg = _load_task(task_id)
+        pr = cfg["pull_request"]
+        files = [CodeFile(**f) for f in pr["files_changed"]]
+        review_ctx = ReviewContext(
+            pull_request_title=pr["pull_request_title"],
+            author=pr["author"],
+            description=pr["description"],
+            files_changed=files,
+            test_results=pr.get("test_results"),
+            linter_output=pr.get("linter_output"),
+        )
+        self._state = EnvironmentState(
+            task_id=task_id,
+            step=0,
+            max_steps=cfg["max_steps"],
+            review_context=review_ctx,
+        )
+        self._cfg = cfg
+        return self._make_observation()
+    def step(self, action: ReviewAction) -> Tuple[Observation, StepReward, bool, Dict[str, Any]]:
+        """
+        Apply an action. Returns (observation, reward, done, info).
+        Raises RuntimeError if called before reset().
+        """
+        if not hasattr(self, "_state"):
+            raise RuntimeError("Call reset() before step().")
+        s = self._state
+        # ── Terminal check ───────────────────────────────────────────────────
+        if s.done:
+            obs = self._make_observation(feedback="Episode already finished.")
+            return obs, StepReward(value=0.0, explanation="Episode done."), True, {}
+        s.step += 1
+        # ── Absorb action ────────────────────────────────────────────────────
+        s.actions_taken.append(action)
+        # Record issue if it is a review action
+        if action.action_type == "review" and action.description:
+            issue = {
+                "step": s.step,
+                "severity": action.severity,
+                "issue_type": action.issue_type,
+                "line": action.line_number,
+                "description": action.description,
+            }
+            s.issues_identified.append(issue)
+        # Record patch
+        if action.action_type == "patch" and action.patched_code:
+            s.patch_submitted = action.patched_code
+        # Record verdict
+        if action.action_type == "submit" and action.verdict:
+            s.verdict_submitted = action.verdict
+        # ── Reward ───────────────────────────────────────────────────────────
+        reward = self._compute_step_reward(action)
+        s.total_reward += reward.value
+        # ── Done condition ───────────────────────────────────────────────────
+        submitted = action.action_type == "submit"
+        out_of_steps = s.step >= s.max_steps
+        if submitted or out_of_steps:
+            final_score, breakdown = grade(s)
+            s.total_reward = final_score
+            s.done = True
+            s.terminated_reason = "submitted" if submitted else "max_steps_reached"
+            reward = StepReward(
+                value=final_score,
+                breakdown=breakdown,
+                explanation=f"Final score: {final_score:.3f}",
+            )
+            info = {"final_score": final_score, "breakdown": breakdown, "reason": s.terminated_reason}
+        else:
+            info = {"step": s.step, "cumulative_reward": s.total_reward}
+        obs = self._make_observation()
+        return obs, reward, s.done, info
+    def state(self) -> EnvironmentState:
+        if not hasattr(self, "_state"):
+            raise RuntimeError("Call reset() before state().")
+        return copy.deepcopy(self._state)
+    # ── Internal helpers ─────────────────────────────────────────────────────
+    def _make_observation(self, feedback: str | None = None) -> Observation:
+        s = self._state
+        return Observation(
+            task_id=s.task_id,
+            step=s.step,
+            max_steps=s.max_steps,
+            review_context=s.review_context,
+            previous_actions=list(s.actions_taken),
+            feedback=feedback,
+            issues_found_so_far=list(s.issues_identified),
+            score_so_far=s.total_reward,
+            done=s.done,
+        )
+    def _compute_step_reward(self, action: ReviewAction) -> StepReward:
+        """
+        Dense intermediate reward:
+          +0.05  for a review action with a non-empty description
+          +0.03  for a review action with severity='critical'
+          +0.10  for a patch action with non-empty code
+          -0.05  for repeated identical descriptions (loop detection)
+          -0.10  step penalty (encourages efficiency)
+        """
+        s = self._state
+        r = 0.0
+        parts: Dict[str, float] = {}
+        STEP_PENALTY = -0.01
+        r += STEP_PENALTY
+        parts["step_penalty"] = STEP_PENALTY
+        if action.action_type == "review":
+            if action.description:
+                parts["review_description"] = 0.05
+                r += 0.05
+            if action.severity == "critical":
+                parts["critical_severity_bonus"] = 0.03
+                r += 0.03
+            # Loop detection: penalise if same description appeared before
+            prev_descs = [
+                a.description for a in s.actions_taken[:-1]
+                if a.description
+            ]
+            if action.description and action.description in prev_descs:
+                parts["repetition_penalty"] = -0.05
+                r += -0.05
+        elif action.action_type == "patch":
+            if action.patched_code and len(action.patched_code) > 50:
+                parts["patch_submitted"] = 0.10
+                r += 0.10
+        elif action.action_type == "submit":
+            pass  # final score handled in step()
+        return StepReward(
+            value=max(-1.0, min(1.0, r)),
+            breakdown=parts,
+            explanation=f"Step {s.step} intermediate reward",
+        )

graders/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # graders package

graders/grader.py ADDED Viewed

	@@ -0,0 +1,201 @@

+"""
+Programmatic graders for all three tasks.
+Each grader returns a score in [0.0, 1.0] with a breakdown dict.
+Grading is deterministic: keyword matching + structural checks.
+"""
+from __future__ import annotations
+import ast
+import re
+from typing import Any, Dict, List, Tuple
+from models import ReviewAction, EnvironmentState
+# ─── Helpers ─────────────────────────────────────────────────────────────────
+def _keywords_hit(text: str, keywords: List[str]) -> bool:
+    """Return True if any keyword appears in text (case-insensitive)."""
+    t = text.lower()
+    return any(kw.lower() in t for kw in keywords)
+def _actions_mention_bug(actions: List[ReviewAction], bug: Dict[str, Any]) -> bool:
+    """Check whether any action mentions the given bug via keyword matching."""
+    keywords = bug["description_keywords"]
+    for action in actions:
+        text = " ".join(filter(None, [
+            action.description or "",
+            action.comment or "",
+            action.issue_type or "",
+        ]))
+        if _keywords_hit(text, keywords):
+            return True
+    return False
+def _patch_fixes_syntax(patched_code: str) -> bool:
+    """Try to parse the patched code as valid Python."""
+    try:
+        ast.parse(patched_code)
+        return True
+    except SyntaxError:
+        return False
+def _patch_contains_fix(patched_code: str, fix_keywords: List[str]) -> bool:
+    return _keywords_hit(patched_code, fix_keywords)
+# ─── Task 1 Grader ────────────────────────────────────────────────────────────
+def grade_task1(state: EnvironmentState) -> Tuple[float, Dict[str, float]]:
+    """
+    Score breakdown:
+      - 30% : identified comparison operator bug (= vs ==)
+      - 30% : identified off-by-one bug
+      - 20% : identified missing return
+      - 20% : patch parses correctly and contains all three fixes
+    """
+    from tasks.task1_easy import KNOWN_BUGS
+    actions = state.actions_taken
+    breakdown: Dict[str, float] = {}
+    # Bug identification (60% total, 20 each)
+    for bug_name, bug_info in KNOWN_BUGS.items():
+        hit = _actions_mention_bug(actions, bug_info)
+        weight = 0.30 if bug_info["severity"] == "critical" else 0.20
+        breakdown[f"found_{bug_name}"] = weight if hit else 0.0
+    # Patch quality (20%)
+    patch_score = 0.0
+    if state.patch_submitted:
+        p = state.patch_submitted
+        if _patch_fixes_syntax(p):
+            patch_score += 0.10
+        if "==" in p and "= 0" not in p.replace("==", ""):
+            patch_score += 0.04
+        if "range(1, len(numbers))" in p:
+            patch_score += 0.03
+        if re.search(r"return\s+max_val", p):
+            patch_score += 0.03
+    breakdown["patch_quality"] = patch_score
+    total = sum(breakdown.values())
+    return min(total, 1.0), breakdown
+# ─── Task 2 Grader ────────────────────────────────────────────────────────────
+def grade_task2(state: EnvironmentState) -> Tuple[float, Dict[str, float]]:
+    """
+    Score breakdown:
+      - 20% : identified SQL injection (login)
+      - 20% : identified SQL injection (register)
+      - 15% : identified plaintext password
+      - 10% : identified no rate limiting
+      - 10% : identified sensitive data leakage
+      - 05% : identified hardcoded secret
+      - 20% : patch uses parameterized queries + password hashing
+    """
+    from tasks.task2_medium import KNOWN_VULNERABILITIES
+    actions = state.actions_taken
+    breakdown: Dict[str, float] = {}
+    weights = {
+        "sql_injection_login": 0.20,
+        "sql_injection_register": 0.20,
+        "plaintext_password": 0.15,
+        "no_rate_limiting": 0.10,
+        "sensitive_data_leak": 0.10,
+        "hardcoded_secret": 0.05,
+    }
+    for vuln_name, vuln_info in KNOWN_VULNERABILITIES.items():
+        hit = _actions_mention_bug(actions, vuln_info)
+        breakdown[f"found_{vuln_name}"] = weights[vuln_name] if hit else 0.0
+    # Patch quality (20%)
+    patch_score = 0.0
+    if state.patch_submitted:
+        p = state.patch_submitted
+        if _patch_fixes_syntax(p):
+            patch_score += 0.05
+        if "?" in p and "execute" in p:                         # parameterized
+            patch_score += 0.07
+        if _patch_contains_fix(p, ["generate_password_hash", "bcrypt", "argon2", "pbkdf2"]):
+            patch_score += 0.05
+        if _patch_contains_fix(p, ["os.environ", "environ.get", "getenv"]):
+            patch_score += 0.03
+    breakdown["patch_quality"] = patch_score
+    total = sum(breakdown.values())
+    return min(total, 1.0), breakdown
+# ─── Task 3 Grader ────���───────────────────────────────────────────────────────
+def grade_task3(state: EnvironmentState) -> Tuple[float, Dict[str, float]]:
+    """
+    Score breakdown:
+      - 15% : race condition
+      - 15% : memory leak / missing eviction
+      - 15% : N+1 query / mget
+      - 10% : LRU order correctness
+      - 15% : thread safety
+      - 15% : pickle deserialization vulnerability
+      - 15% : patch quality (structural checks)
+    """
+    from tasks.task3_hard import KNOWN_ISSUES
+    actions = state.actions_taken
+    breakdown: Dict[str, float] = {}
+    weights = {
+        "race_condition": 0.15,
+        "memory_leak": 0.15,
+        "n_plus_one": 0.15,
+        "wrong_lru_order": 0.10,
+        "thread_safety": 0.15,
+        "pickle_injection": 0.15,
+    }
+    for issue_name, issue_info in KNOWN_ISSUES.items():
+        hit = _actions_mention_bug(actions, issue_info)
+        breakdown[f"found_{issue_name}"] = weights[issue_name] if hit else 0.0
+    # Patch quality (15%)
+    patch_score = 0.0
+    if state.patch_submitted:
+        p = state.patch_submitted
+        if _patch_fixes_syntax(p):
+            patch_score += 0.03
+        if _patch_contains_fix(p, ["threading.Lock", "Lock()", "_lock"]):
+            patch_score += 0.03
+        if _patch_contains_fix(p, ["OrderedDict", "move_to_end"]):
+            patch_score += 0.03
+        if _patch_contains_fix(p, ["mget", "pipeline"]):
+            patch_score += 0.03
+        if _patch_contains_fix(p, ["json.loads", "json.dumps"]) and "pickle" not in p:
+            patch_score += 0.03
+    breakdown["patch_quality"] = patch_score
+    total = sum(breakdown.values())
+    return min(total, 1.0), breakdown
+# ─── Dispatcher ──────────────────────────────────────────────────────────────
+GRADERS = {
+    "task_1_easy_bug_hunt": grade_task1,
+    "task_2_medium_security": grade_task2,
+    "task_3_hard_perf_correctness": grade_task3,
+}
+def grade(state: EnvironmentState) -> Tuple[float, Dict[str, float]]:
+    grader = GRADERS.get(state.task_id)
+    if grader is None:
+        raise ValueError(f"No grader found for task_id={state.task_id!r}")
+    return grader(state)

inference.py ADDED Viewed

	@@ -0,0 +1,275 @@

+"""
+inference.py — CodeReviewEnv baseline inference script.
+Mandatory env vars:
+    API_BASE_URL    The API endpoint for the LLM.
+    MODEL_NAME      The model identifier to use for inference.
+    HF_TOKEN        Your Hugging Face / API key.
+STDOUT format (strictly followed):
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+"""
+import json
+import os
+import sys
+import textwrap
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+sys.path.insert(0, os.path.dirname(__file__))
+from env import CodeReviewEnv, TASK_IDS
+from models import ReviewAction
+# ── Env vars ──────────────────────────────────────────────────────────────────
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
+MODEL_NAME = os.getenv("MODEL_NAME", "Qwen/Qwen2.5-72B-Instruct")
+BENCHMARK = "code-review-env"
+SUCCESS_SCORE_THRESHOLD = 0.5
+# ── Logging helpers ───────────────────────────────────────────────────────────
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    action_clean = action.replace("\n", " ").replace("\r", "")[:120]
+    print(
+        f"[STEP] step={step} action={action_clean} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(
+        f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}",
+        flush=True,
+    )
+# ── Prompts ───────────────────────────────────────────────────────────────────
+SYSTEM_PROMPT = textwrap.dedent("""
+    You are an expert software engineer performing a thorough code review.
+    Your job is to:
+    1. Identify ALL bugs, security vulnerabilities, performance issues, and logic errors.
+    2. For each issue, output a JSON action with action_type="review".
+    3. After identifying all issues, output a patch with action_type="patch".
+    4. Finally, output action_type="submit" with your verdict.
+    Each response must be a single valid JSON object. No markdown, no explanation outside JSON.
+    Schema:
+    {
+      "action_type": "review" | "patch" | "comment" | "submit",
+      "severity": "critical" | "major" | "minor" | "info",
+      "issue_type": "bug" | "security" | "performance" | "logic" | "style",
+      "line_number": <int or null>,
+      "description": "<description of the issue>",
+      "patched_code": "<full corrected code>",
+      "comment": "<optional>",
+      "verdict": "approve" | "request_changes" | "reject",
+      "confidence": <0.0-1.0>
+    }
+    Output ONE JSON object per response. Be precise and thorough.
+""").strip()
+def build_user_prompt(obs: Dict[str, Any]) -> str:
+    ctx = obs["review_context"]
+    files_text = "\n\n".join(
+        f"=== {f['filename']} ({f['language']}) ===\n{f['content']}"
+        for f in ctx["files_changed"]
+    )
+    issues_so_far = obs.get("issues_found_so_far", [])
+    prompt = textwrap.dedent(f"""
+        Pull Request: {ctx['pull_request_title']}
+        Author: {ctx['author']}
+        Description: {ctx['description']}
+        Linter: {ctx.get('linter_output', 'N/A')}
+        Tests: {ctx.get('test_results', 'N/A')}
+        --- CODE ---
+        {files_text}
+        --- END CODE ---
+        Step: {obs['step']} / {obs['max_steps']}
+        Issues reported so far: {len(issues_so_far)}
+    """).strip()
+    if issues_so_far:
+        prompt += "\n\nIssues already reported (do NOT repeat these):"
+        for iss in issues_so_far:
+            prompt += f"\n  - [{iss.get('severity','?')}] line {iss.get('line','?')}: {iss.get('description','')}"
+    steps_left = obs['max_steps'] - obs['step']
+    if steps_left <= 2:
+        prompt += "\n\nYou are almost out of steps. Submit your patch and verdict NOW."
+    elif obs['step'] == 0:
+        prompt += "\n\nBegin your review. Output your first action as JSON."
+    else:
+        prompt += "\n\nContinue reviewing or submit if done. Output next action as JSON."
+    return prompt
+# ── JSON extraction ───────────────────────────────────────────────────────────
+def extract_json(text: str) -> Dict[str, Any]:
+    text = text.strip()
+    if text.startswith("```"):
+        lines = text.split("\n")
+        text = "\n".join(lines[1:-1]) if len(lines) > 2 else text
+    try:
+        return json.loads(text)
+    except json.JSONDecodeError:
+        pass
+    start = text.find("{")
+    if start == -1:
+        raise ValueError("No JSON object found in response")
+    depth = 0
+    for i, ch in enumerate(text[start:], start):
+        if ch == "{":
+            depth += 1
+        elif ch == "}":
+            depth -= 1
+            if depth == 0:
+                return json.loads(text[start:i + 1])
+    raise ValueError("Unbalanced JSON in response")
+# ── Episode runner ────────────────────────────────────────────────────────────
+def run_episode(client: OpenAI, task_id: str) -> Dict[str, Any]:
+    env = CodeReviewEnv()
+    obs_obj = env.reset(task_id)
+    obs = obs_obj.model_dump()
+    log_start(task=task_id, env=BENCHMARK, model=MODEL_NAME)
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    history: List[Dict[str, str]] = []
+    patch_submitted = False
+    error_msg: Optional[str] = None
+    try:
+        for step in range(1, obs_obj.max_steps + 1):
+            if obs.get("done"):
+                break
+            error_msg = None
+            steps_left = obs["max_steps"] - obs["step"]
+            # Force patch then submit near step limit
+            if steps_left <= 1 and not patch_submitted:
+                action_dict = {
+                    "action_type": "patch",
+                    "patched_code": obs["review_context"]["files_changed"][0]["content"],
+                }
+            elif steps_left <= 0:
+                action_dict = {
+                    "action_type": "submit",
+                    "verdict": "request_changes",
+                    "confidence": 0.5,
+                }
+            else:
+                user_msg = build_user_prompt(obs)
+                history.append({"role": "user", "content": user_msg})
+                try:
+                    completion = client.chat.completions.create(
+                        model=MODEL_NAME,
+                        messages=[{"role": "system", "content": SYSTEM_PROMPT}] + history,
+                        max_tokens=1024,
+                        temperature=0.2,
+                        stream=False,
+                    )
+                    raw = (completion.choices[0].message.content or "").strip()
+                    history.append({"role": "assistant", "content": raw})
+                    action_dict = extract_json(raw)
+                except Exception as exc:
+                    error_msg = str(exc)[:80]
+                    action_dict = {
+                        "action_type": "submit",
+                        "verdict": "request_changes",
+                        "confidence": 0.3,
+                    }
+            if action_dict.get("action_type") == "patch":
+                patch_submitted = True
+            # Validate action
+            try:
+                action = ReviewAction(**action_dict)
+            except Exception as exc:
+                error_msg = str(exc)[:80]
+                action = ReviewAction(
+                    action_type="submit",
+                    verdict="request_changes",
+                    confidence=0.3,
+                )
+            # Step environment
+            obs_obj, reward_obj, done, info = env.step(action)
+            obs = obs_obj.model_dump()
+            reward = reward_obj.value
+            rewards.append(reward)
+            steps_taken = step
+            action_summary = f"{action.action_type}:{(action.description or action.verdict or '')[:60]}"
+            log_step(step=step, action=action_summary, reward=reward, done=done, error=error_msg)
+            if done:
+                score = info.get("final_score", 0.0)
+                break
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {"task_id": task_id, "score": score, "steps": steps_taken, "success": success}
+# ── Main ──────────────────────────────────────────────────────────────────────
+def main() -> None:
+    if not API_KEY:
+        print("[ERROR] HF_TOKEN environment variable not set.", flush=True)
+        sys.exit(1)
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    task_ids = os.getenv("TASK_IDS", ",".join(TASK_IDS)).split(",")
+    task_ids = [t.strip() for t in task_ids if t.strip()]
+    all_results = []
+    for task_id in task_ids:
+        result = run_episode(client, task_id)
+        all_results.append(result)
+    # Aggregate summary to stderr so it doesn't pollute stdout log format
+    print("\n[SUMMARY]", file=sys.stderr)
+    for r in all_results:
+        print(f"  {r['task_id']}: score={r['score']:.3f} steps={r['steps']} success={r['success']}", file=sys.stderr)
+    if all_results:
+        avg = sum(r["score"] for r in all_results) / len(all_results)
+        print(f"  aggregate: {avg:.3f}", file=sys.stderr)
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,92 @@

+"""
+OpenEnv-compliant Pydantic models for the Code Review Environment.
+"""
+from __future__ import annotations
+from typing import Any, Dict, List, Literal, Optional
+from pydantic import BaseModel, Field
+# ─── Action Space ────────────────────────────────────────────────────────────
+class ReviewAction(BaseModel):
+    """Agent action: review and optionally patch code."""
+    action_type: Literal["review", "patch", "comment", "submit"] = Field(
+        description="Type of action the agent takes."
+    )
+    # For 'review': provide a structured analysis
+    severity: Optional[Literal["critical", "major", "minor", "info"]] = None
+    issue_type: Optional[str] = Field(
+        default=None,
+        description="Category: bug, security, performance, style, logic"
+    )
+    line_number: Optional[int] = Field(default=None, ge=1)
+    description: Optional[str] = Field(default=None, max_length=500)
+    # For 'patch': provide fixed code
+    patched_code: Optional[str] = Field(
+        default=None,
+        description="Full corrected code (for patch actions)."
+    )
+    # For 'comment': free-form annotation
+    comment: Optional[str] = Field(default=None, max_length=1000)
+    # For 'submit': final verdict
+    verdict: Optional[Literal["approve", "request_changes", "reject"]] = None
+    confidence: Optional[float] = Field(default=None, ge=0.0, le=1.0)
+# ─── Observation Space ───────────────────────────────────────────────────────
+class CodeFile(BaseModel):
+    filename: str
+    language: str
+    content: str
+    line_count: int
+class ReviewContext(BaseModel):
+    pull_request_title: str
+    author: str
+    description: str
+    files_changed: List[CodeFile]
+    test_results: Optional[str] = None
+    linter_output: Optional[str] = None
+class Observation(BaseModel):
+    """What the agent sees at each step."""
+    task_id: str
+    step: int
+    max_steps: int
+    review_context: ReviewContext
+    previous_actions: List[ReviewAction] = Field(default_factory=list)
+    feedback: Optional[str] = None
+    issues_found_so_far: List[Dict[str, Any]] = Field(default_factory=list)
+    score_so_far: float = 0.0
+    done: bool = False
+# ─── Reward Model ────────────────────────────────────────────────────────────
+class StepReward(BaseModel):
+    """Reward signal returned at each step."""
+    value: float = Field(ge=-1.0, le=1.0)
+    breakdown: Dict[str, float] = Field(default_factory=dict)
+    explanation: str = ""
+# ─── State ───────────────────────────────────────────────────────────────────
+class EnvironmentState(BaseModel):
+    task_id: str
+    step: int
+    max_steps: int
+    review_context: ReviewContext
+    actions_taken: List[ReviewAction] = Field(default_factory=list)
+    issues_identified: List[Dict[str, Any]] = Field(default_factory=list)
+    patch_submitted: Optional[str] = None
+    verdict_submitted: Optional[str] = None
+    total_reward: float = 0.0
+    done: bool = False
+    terminated_reason: Optional[str] = None

openenv.yaml ADDED Viewed

	@@ -0,0 +1,99 @@

+name: code-review-env
+version: "1.0.0"
+spec: openenv/v1
+tags:
+  - openenv
+  - code-review
+  - software-engineering
+  - security
+  - agent-evaluation
+description: >
+  A code review environment where AI agents act as senior engineers reviewing
+  pull requests. Tasks span bug hunting (easy), security auditing (medium),
+  and distributed systems correctness review (hard). Fully OpenEnv-compliant
+  with typed Pydantic models, dense reward signals, and programmatic graders.
+author: "Meta Hackathon Submission"
+license: MIT
+observation_space:
+  type: object
+  description: >
+    Structured pull request context including code files, linter output,
+    test results, and history of previous actions taken in the episode.
+  fields:
+    - task_id: string
+    - step: integer
+    - max_steps: integer
+    - review_context: ReviewContext
+    - previous_actions: list[ReviewAction]
+    - issues_found_so_far: list[dict]
+    - score_so_far: float [0.0, 1.0]
+    - done: boolean
+action_space:
+  type: object
+  description: >
+    Agents may review (annotate an issue), patch (submit corrected code),
+    comment (free-form annotation), or submit (final verdict).
+  action_types:
+    - review: annotate a specific issue with severity, type, line, and description
+    - patch: provide full corrected code
+    - comment: free-form annotation
+    - submit: final verdict (approve | request_changes | reject) with confidence
+reward:
+  type: dense
+  range: [-1.0, 1.0]
+  description: >
+    Intermediate reward encourages efficient, non-repetitive, actionable reviews.
+    Final reward (at submit or max_steps) is the programmatic grader score in [0.0, 1.0].
+  components:
+    step_penalty: -0.01 per step (encourages efficiency)
+    review_description_bonus: +0.05 for substantive review action
+    critical_severity_bonus: +0.03 for marking an issue as critical
+    patch_submitted_bonus: +0.10 for submitting non-trivial patch
+    repetition_penalty: -0.05 for repeating identical descriptions
+tasks:
+  - id: task_1_easy_bug_hunt
+    difficulty: easy
+    max_steps: 8
+    description: >
+      Find three planted bugs in a Python utility module:
+      assignment-instead-of-comparison, off-by-one loop bound, missing return.
+    grader: keyword-match + AST parse of patch
+    max_score: 1.0
+  - id: task_2_medium_security
+    difficulty: medium
+    max_steps: 12
+    description: >
+      Audit a Flask authentication endpoint for six security vulnerabilities:
+      SQL injection (×2), plaintext passwords, no rate limiting,
+      sensitive data leakage, hardcoded secret key.
+    grader: keyword-match across action descriptions + patch structural check
+    max_score: 1.0
+  - id: task_3_hard_perf_correctness
+    difficulty: hard
+    max_steps: 16
+    description: >
+      Review a distributed LRU cache backed by Redis for six issues:
+      race condition, memory leak, N+1 query, wrong LRU order,
+      thread-safety violation, pickle deserialization exploit.
+    grader: keyword-match + patch structural check (Lock, OrderedDict, mget, json)
+    max_score: 1.0
+baseline_scores:
+  model: Qwen/Qwen2.5-72B-Instruct
+  task_1_easy_bug_hunt: 0.72
+  task_2_medium_security: 0.55
+  task_3_hard_perf_correctness: 0.38
+  aggregate: 0.55
+deployment:
+  platform: huggingface_spaces
+  sdk: docker
+  port: 7860

requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+fastapi==0.111.0
+uvicorn[standard]==0.29.0
+pydantic==2.7.1
+openai>=1.30.0
+requests>=2.31.0
+pyyaml>=6.0

server.py ADDED Viewed

	@@ -0,0 +1,126 @@

+"""
+FastAPI server exposing the CodeReviewEnv as an HTTP API.
+Endpoints mirror the OpenEnv interface: /reset, /step, /state, /tasks.
+"""
+from __future__ import annotations
+import sys
+import os
+sys.path.insert(0, os.path.dirname(__file__))
+from typing import Any, Dict
+import uuid
+from fastapi import FastAPI, HTTPException
+from fastapi.middleware.cors import CORSMiddleware
+from pydantic import BaseModel
+from env import CodeReviewEnv, TASK_IDS
+from models import ReviewAction, Observation, StepReward, EnvironmentState
+app = FastAPI(
+    title="CodeReviewEnv",
+    description="OpenEnv-compliant environment for AI code review agents.",
+    version="1.0.0",
+)
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_methods=["*"],
+    allow_headers=["*"],
+)
+# ── Session store (in-memory, single process) ────────────────────────────────
+_sessions: Dict[str, CodeReviewEnv] = {}
+# ── Request / Response models ─────────────────────────────────────────────────
+class ResetRequest(BaseModel):
+    task_id: str = "task_1_easy_bug_hunt"
+    session_id: str | None = None
+class ResetResponse(BaseModel):
+    session_id: str
+    observation: Observation
+class StepRequest(BaseModel):
+    session_id: str
+    action: ReviewAction
+class StepResponse(BaseModel):
+    observation: Observation
+    reward: StepReward
+    done: bool
+    info: Dict[str, Any]
+# ── Routes ────────────────────────────────────────────────────────────────────
+@app.get("/")
+def root():
+    return {
+        "name": "CodeReviewEnv",
+        "version": "1.0.0",
+        "tasks": TASK_IDS,
+        "spec": "OpenEnv v1",
+    }
+@app.get("/tasks")
+def list_tasks():
+    from tasks.task1_easy import get_task_config as t1
+    from tasks.task2_medium import get_task_config as t2
+    from tasks.task3_hard import get_task_config as t3
+    tasks = []
+    for fn in (t1, t2, t3):
+        cfg = fn()
+        tasks.append({
+            "task_id": cfg["task_id"],
+            "difficulty": cfg["difficulty"],
+            "description": cfg["description"],
+        })
+    return {"tasks": tasks}
+@app.post("/reset", response_model=ResetResponse)
+def reset(req: ResetRequest):
+    if req.task_id not in TASK_IDS:
+        raise HTTPException(400, f"Unknown task_id {req.task_id!r}. Choose from {TASK_IDS}")
+    session_id = req.session_id or str(uuid.uuid4())
+    env = CodeReviewEnv()
+    obs = env.reset(req.task_id)
+    _sessions[session_id] = env
+    return ResetResponse(session_id=session_id, observation=obs)
+@app.post("/step", response_model=StepResponse)
+def step(req: StepRequest):
+    env = _sessions.get(req.session_id)
+    if env is None:
+        raise HTTPException(404, f"Session {req.session_id!r} not found. Call /reset first.")
+    obs, reward, done, info = env.step(req.action)
+    return StepResponse(observation=obs, reward=reward, done=done, info=info)
+@app.get("/state/{session_id}", response_model=EnvironmentState)
+def get_state(session_id: str):
+    env = _sessions.get(session_id)
+    if env is None:
+        raise HTTPException(404, f"Session {session_id!r} not found.")
+    return env.state()
+@app.delete("/session/{session_id}")
+def delete_session(session_id: str):
+    _sessions.pop(session_id, None)
+    return {"deleted": session_id}
+if __name__ == "__main__":
+    import uvicorn
+    uvicorn.run("server:app", host="0.0.0.0", port=7860, reload=False)

tasks/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ # tasks package

tasks/task1_easy.py ADDED Viewed

	@@ -0,0 +1,124 @@

+"""
+Task 1 (Easy): Bug Identification in a Simple Python Utility.
+The agent reviews a short Python module with 3 clearly planted bugs:
+  1. Off-by-one error in a loop
+  2. Incorrect comparison operator (= vs ==)
+  3. Missing return statement in a branch
+"""
+from __future__ import annotations
+from typing import Any, Dict
+TASK_ID = "task_1_easy_bug_hunt"
+MAX_STEPS = 8
+BUGGY_CODE = '''\
+def find_max(numbers: list) -> int:
+    """Return the maximum value in a non-empty list."""
+    if len(numbers) = 0:          # BUG 1: assignment instead of == comparison
+        raise ValueError("List is empty")
+    max_val = numbers[0]
+    for i in range(1, len(numbers) + 1):  # BUG 2: off-by-one, should be len(numbers)
+        if numbers[i] > max_val:
+            max_val = numbers[i]
+    # BUG 3: missing return statement — falls off the end returning None
+def calculate_average(numbers: list) -> float:
+    """Return the arithmetic mean of a list of numbers."""
+    if not numbers:
+        raise ValueError("Cannot average empty list")
+    total = 0
+    for n in numbers:
+        total += n
+    return total / len(numbers)
+def is_palindrome(s: str) -> bool:
+    """Check whether a string is a palindrome (case-insensitive)."""
+    cleaned = s.lower().replace(" ", "")
+    return cleaned == cleaned[::-1]
+'''
+FIXED_CODE = '''\
+def find_max(numbers: list) -> int:
+    """Return the maximum value in a non-empty list."""
+    if len(numbers) == 0:
+        raise ValueError("List is empty")
+    max_val = numbers[0]
+    for i in range(1, len(numbers)):
+        if numbers[i] > max_val:
+            max_val = numbers[i]
+    return max_val
+def calculate_average(numbers: list) -> float:
+    """Return the arithmetic mean of a list of numbers."""
+    if not numbers:
+        raise ValueError("Cannot average empty list")
+    total = 0
+    for n in numbers:
+        total += n
+    return total / len(numbers)
+def is_palindrome(s: str) -> bool:
+    """Check whether a string is a palindrome (case-insensitive)."""
+    cleaned = s.lower().replace(" ", "")
+    return cleaned == cleaned[::-1]
+'''
+KNOWN_BUGS = {
+    "bug_comparison_operator": {
+        "line": 3,
+        "description_keywords": ["assignment", "comparison", "==", "=", "operator"],
+        "severity": "critical",
+        "issue_type": "bug",
+    },
+    "bug_off_by_one": {
+        "line": 6,
+        "description_keywords": ["off-by-one", "index", "range", "len", "+1", "IndexError"],
+        "severity": "critical",
+        "issue_type": "bug",
+    },
+    "bug_missing_return": {
+        "line": 9,
+        "description_keywords": ["return", "None", "missing", "falls off"],
+        "severity": "major",
+        "issue_type": "bug",
+    },
+}
+PULL_REQUEST = {
+    "pull_request_title": "Add utility functions: find_max, calculate_average, is_palindrome",
+    "author": "dev-intern",
+    "description": (
+        "Implements three utility functions for list and string operations. "
+        "Please review for correctness before merging."
+    ),
+    "files_changed": [
+        {
+            "filename": "utils.py",
+            "language": "python",
+            "content": BUGGY_CODE,
+            "line_count": BUGGY_CODE.count("\n") + 1,
+        }
+    ],
+    "test_results": "No tests provided.",
+    "linter_output": "SyntaxError detected on line 3 (invalid syntax).",
+}
+def get_task_config() -> Dict[str, Any]:
+    return {
+        "task_id": TASK_ID,
+        "max_steps": MAX_STEPS,
+        "pull_request": PULL_REQUEST,
+        "known_bugs": KNOWN_BUGS,
+        "fixed_code": FIXED_CODE,
+        "difficulty": "easy",
+        "description": (
+            "Review a short Python utility module. "
+            "Find and describe all bugs, then submit a patched version."
+        ),
+    }

tasks/task2_medium.py ADDED Viewed

	@@ -0,0 +1,208 @@

+"""
+Task 2 (Medium): Security Vulnerability Review in a Flask Web Endpoint.
+The agent reviews a Flask user-authentication endpoint containing:
+  1. SQL injection vulnerability (string formatting into query)
+  2. Plaintext password storage (no hashing)
+  3. Missing rate limiting / brute-force protection
+  4. Sensitive data leakage in error response
+  5. Hardcoded secret key
+"""
+from __future__ import annotations
+from typing import Any, Dict
+TASK_ID = "task_2_medium_security"
+MAX_STEPS = 12
+BUGGY_CODE = '''\
+import sqlite3
+from flask import Flask, request, jsonify
+app = Flask(__name__)
+app.secret_key = "supersecret123"   # VULN 5: hardcoded secret key
+DB_PATH = "users.db"
+def get_db():
+    return sqlite3.connect(DB_PATH)
+@app.route("/login", methods=["POST"])
+def login():
+    username = request.json.get("username")
+    password = request.json.get("password")
+    db = get_db()
+    cursor = db.cursor()
+    # VULN 1: SQL injection — user input directly interpolated into query
+    query = f"SELECT * FROM users WHERE username = \'{username}\' AND password = \'{password}\'"
+    cursor.execute(query)
+    user = cursor.fetchone()
+    if user:
+        return jsonify({"status": "ok", "user_id": user[0], "email": user[2]})
+    else:
+        # VULN 4: leaks whether username exists or password is wrong
+        cursor.execute(f"SELECT id FROM users WHERE username = \'{username}\'")
+        exists = cursor.fetchone()
+        if exists:
+            return jsonify({"error": f"Wrong password for user {username}"}), 401
+        return jsonify({"error": f"User {username} does not exist"}), 404
+@app.route("/register", methods=["POST"])
+def register():
+    username = request.json.get("username")
+    password = request.json.get("password")   # VULN 2: stored in plaintext
+    email = request.json.get("email")
+    db = get_db()
+    cursor = db.cursor()
+    # VULN 1 again: SQL injection in insert
+    cursor.execute(
+        f"INSERT INTO users (username, password, email) VALUES (\'{username}\', \'{password}\', \'{email}\')"
+    )
+    db.commit()
+    return jsonify({"status": "registered"})
+# VULN 3: No rate limiting on login endpoint (brute-force possible)
+if __name__ == "__main__":
+    app.run(debug=True)
+'''
+FIXED_CODE = '''\
+import os
+import sqlite3
+from flask import Flask, request, jsonify
+from werkzeug.security import generate_password_hash, check_password_hash
+from flask_limiter import Limiter
+from flask_limiter.util import get_remote_address
+app = Flask(__name__)
+app.secret_key = os.environ.get("SECRET_KEY")  # read from env, never hardcode
+limiter = Limiter(get_remote_address, app=app, default_limits=["200 per day", "50 per hour"])
+DB_PATH = "users.db"
+def get_db():
+    return sqlite3.connect(DB_PATH)
+@app.route("/login", methods=["POST"])
+@limiter.limit("5 per minute")   # brute-force protection
+def login():
+    username = request.json.get("username")
+    password = request.json.get("password")
+    db = get_db()
+    cursor = db.cursor()
+    # Parameterised query — prevents SQL injection
+    cursor.execute("SELECT id, password_hash FROM users WHERE username = ?", (username,))
+    user = cursor.fetchone()
+    if user and check_password_hash(user[1], password):
+        return jsonify({"status": "ok", "user_id": user[0]})
+    # Generic error — does not reveal whether user exists
+    return jsonify({"error": "Invalid credentials"}), 401
+@app.route("/register", methods=["POST"])
+def register():
+    username = request.json.get("username")
+    password = request.json.get("password")
+    email = request.json.get("email")
+    db = get_db()
+    cursor = db.cursor()
+    password_hash = generate_password_hash(password)
+    cursor.execute(
+        "INSERT INTO users (username, password_hash, email) VALUES (?, ?, ?)",
+        (username, password_hash, email),
+    )
+    db.commit()
+    return jsonify({"status": "registered"})
+if __name__ == "__main__":
+    app.run(debug=False)
+'''
+KNOWN_VULNERABILITIES = {
+    "sql_injection_login": {
+        "line": 23,
+        "description_keywords": ["sql injection", "parameterized", "f-string", "format", "interpolat", "query"],
+        "severity": "critical",
+        "issue_type": "security",
+    },
+    "sql_injection_register": {
+        "line": 44,
+        "description_keywords": ["sql injection", "parameterized", "f-string", "format", "interpolat", "insert"],
+        "severity": "critical",
+        "issue_type": "security",
+    },
+    "plaintext_password": {
+        "line": 39,
+        "description_keywords": ["plaintext", "hash", "bcrypt", "werkzeug", "password", "store"],
+        "severity": "critical",
+        "issue_type": "security",
+    },
+    "no_rate_limiting": {
+        "line": None,
+        "description_keywords": ["rate limit", "brute force", "throttl", "limiter"],
+        "severity": "major",
+        "issue_type": "security",
+    },
+    "sensitive_data_leak": {
+        "line": 30,
+        "description_keywords": ["leak", "enumerat", "username exist", "generic error", "information disclos"],
+        "severity": "major",
+        "issue_type": "security",
+    },
+    "hardcoded_secret": {
+        "line": 5,
+        "description_keywords": ["hardcode", "secret", "env", "environment variable", "secret_key"],
+        "severity": "major",
+        "issue_type": "security",
+    },
+}
+PULL_REQUEST = {
+    "pull_request_title": "Implement user login and registration API endpoints",
+    "author": "backend-dev",
+    "description": (
+        "Adds /login and /register REST endpoints backed by SQLite. "
+        "Ready for production review."
+    ),
+    "files_changed": [
+        {
+            "filename": "auth.py",
+            "language": "python",
+            "content": BUGGY_CODE,
+            "line_count": BUGGY_CODE.count("\n") + 1,
+        }
+    ],
+    "test_results": "Manual testing passed on happy path.",
+    "linter_output": "No linter warnings.",
+}
+def get_task_config() -> Dict[str, Any]:
+    return {
+        "task_id": TASK_ID,
+        "max_steps": MAX_STEPS,
+        "pull_request": PULL_REQUEST,
+        "known_vulnerabilities": KNOWN_VULNERABILITIES,
+        "fixed_code": FIXED_CODE,
+        "difficulty": "medium",
+        "description": (
+            "Review a Flask authentication endpoint for security vulnerabilities. "
+            "Identify all issues by category and severity, then provide a secure patched version."
+        ),
+    }

tasks/task3_hard.py ADDED Viewed

	@@ -0,0 +1,241 @@

+"""
+Task 3 (Hard): Performance & Correctness Review of a Distributed LRU Cache.
+The agent reviews a Python LRU cache with Redis backing containing:
+  1. Race condition (non-atomic check-then-act on Redis)
+  2. Memory leak (unbounded local dict grows forever)
+  3. N+1 query pattern (per-key pipeline not batched)
+  4. Incorrect LRU eviction (uses insertion order, not access order)
+  5. Thread-safety violation (shared dict without lock)
+  6. Silent data corruption (pickle loads untrusted bytes)
+"""
+from __future__ import annotations
+from typing import Any, Dict
+TASK_ID = "task_3_hard_perf_correctness"
+MAX_STEPS = 16
+BUGGY_CODE = '''\
+import pickle
+import threading
+import redis
+class DistributedLRUCache:
+    """
+    LRU cache backed by Redis for distributed deployments.
+    Local dict acts as an L1 write-through layer.
+    """
+    def __init__(self, capacity: int, redis_url: str = "redis://localhost:6379"):
+        self.capacity = capacity
+        self.local = {}          # ISSUE 2 & 5: shared dict, no lock, unbounded growth
+        self.redis = redis.from_url(redis_url)
+        self.hits = 0
+        self.misses = 0
+    # ── ISSUE 5: no lock; concurrent writes race on self.local ──────────────
+    def get(self, key: str):
+        if key in self.local:
+            self.hits += 1
+            return self.local[key]            # ISSUE 4: doesn't update LRU order
+        # ISSUE 1: race condition — between EXISTS and GET another process may delete key
+        if self.redis.exists(key):
+            raw = self.redis.get(key)
+            value = pickle.loads(raw)         # ISSUE 6: deserialising untrusted bytes
+            self.local[key] = value           # ISSUE 2: local dict grows without bound
+            self.hits += 1
+            return value
+        self.misses += 1
+        return None
+    def put(self, key: str, value, ttl: int = 300):
+        # ISSUE 2: no eviction from self.local; grows forever
+        self.local[key] = value
+        # ISSUE 1: non-atomic: set + expire are two separate commands
+        self.redis.set(key, pickle.dumps(value))
+        self.redis.expire(key, ttl)
+    def get_many(self, keys: list):
+        # ISSUE 3: N+1 — calls self.get() in a loop instead of using pipeline/mget
+        return {k: self.get(k) for k in keys}
+    def invalidate(self, key: str):
+        self.local.pop(key, None)
+        self.redis.delete(key)
+    def stats(self):
+        total = self.hits + self.misses
+        return {
+            "hits": self.hits,
+            "misses": self.misses,
+            "hit_rate": self.hits / total if total else 0,
+            "local_size": len(self.local),
+        }
+'''
+FIXED_CODE = '''\
+import json
+import threading
+from collections import OrderedDict
+import redis
+class DistributedLRUCache:
+    """
+    Thread-safe LRU cache backed by Redis.
+    Uses OrderedDict for correct LRU eviction, a Lock for thread safety,
+    atomic Redis SET EX commands, and mget for batch fetching.
+    Serialises with JSON (not pickle) to avoid arbitrary code execution.
+    """
+    def __init__(self, capacity: int, redis_url: str = "redis://localhost:6379"):
+        self.capacity = capacity
+        self.local: OrderedDict = OrderedDict()   # correct LRU order
+        self._lock = threading.Lock()             # thread safety
+        self.redis = redis.from_url(redis_url)
+        self.hits = 0
+        self.misses = 0
+    def get(self, key: str):
+        with self._lock:
+            if key in self.local:
+                self.local.move_to_end(key)       # update LRU order
+                self.hits += 1
+                return self.local[key]
+        raw = self.redis.get(key)                 # atomic single GET, no race
+        if raw is not None:
+            value = json.loads(raw)               # safe deserialisation
+            with self._lock:
+                self._evict_if_needed()
+                self.local[key] = value
+                self.hits += 1
+            return value
+        with self._lock:
+            self.misses += 1
+        return None
+    def _evict_if_needed(self):
+        """Call with self._lock held."""
+        while len(self.local) >= self.capacity:
+            self.local.popitem(last=False)        # evict LRU item
+    def put(self, key: str, value, ttl: int = 300):
+        payload = json.dumps(value)
+        self.redis.set(key, payload, ex=ttl)      # atomic SET with TTL
+        with self._lock:
+            self.local[key] = value
+            self.local.move_to_end(key)
+            self._evict_if_needed()
+    def get_many(self, keys: list):
+        """Batch fetch using Redis MGET — O(1) round trips."""
+        if not keys:
+            return {}
+        raws = self.redis.mget(keys)
+        result = {}
+        with self._lock:
+            for key, raw in zip(keys, raws):
+                if raw is not None:
+                    value = json.loads(raw)
+                    self._evict_if_needed()
+                    self.local[key] = value
+                    self.hits += 1
+                    result[key] = value
+                else:
+                    self.misses += 1
+                    result[key] = None
+        return result
+    def invalidate(self, key: str):
+        with self._lock:
+            self.local.pop(key, None)
+        self.redis.delete(key)
+    def stats(self):
+        with self._lock:
+            total = self.hits + self.misses
+            return {
+                "hits": self.hits,
+                "misses": self.misses,
+                "hit_rate": self.hits / total if total else 0,
+                "local_size": len(self.local),
+            }
+'''
+KNOWN_ISSUES = {
+    "race_condition": {
+        "lines": [23, 43],
+        "description_keywords": ["race condition", "atomic", "exists", "set", "pipeline", "non-atomic"],
+        "severity": "critical",
+        "issue_type": "concurrency",
+    },
+    "memory_leak": {
+        "lines": [13, 27, 38],
+        "description_keywords": ["memory leak", "unbounded", "evict", "capacity", "grow"],
+        "severity": "critical",
+        "issue_type": "performance",
+    },
+    "n_plus_one": {
+        "lines": [47],
+        "description_keywords": ["n+1", "pipeline", "mget", "batch", "loop", "round trip"],
+        "severity": "major",
+        "issue_type": "performance",
+    },
+    "wrong_lru_order": {
+        "lines": [21, 24],
+        "description_keywords": ["lru", "order", "move_to_end", "access order", "insertion order", "OrderedDict"],
+        "severity": "major",
+        "issue_type": "logic",
+    },
+    "thread_safety": {
+        "lines": [13],
+        "description_keywords": ["thread", "lock", "concurrent", "race", "mutex", "atomic"],
+        "severity": "critical",
+        "issue_type": "concurrency",
+    },
+    "pickle_injection": {
+        "lines": [26],
+        "description_keywords": ["pickle", "deseri", "arbitrary code", "injection", "untrusted", "json"],
+        "severity": "critical",
+        "issue_type": "security",
+    },
+}
+PULL_REQUEST = {
+    "pull_request_title": "Introduce DistributedLRUCache with Redis backing for session store",
+    "author": "senior-eng",
+    "description": (
+        "Implements a two-tier LRU cache (local + Redis) to reduce DB load by 60%. "
+        "Designed for high-throughput production use. Please review thoroughly."
+    ),
+    "files_changed": [
+        {
+            "filename": "cache.py",
+            "language": "python",
+            "content": BUGGY_CODE,
+            "line_count": BUGGY_CODE.count("\n") + 1,
+        }
+    ],
+    "test_results": "Unit tests pass. Load tests not yet run.",
+    "linter_output": "No issues found by flake8.",
+}
+def get_task_config() -> Dict[str, Any]:
+    return {
+        "task_id": TASK_ID,
+        "max_steps": MAX_STEPS,
+        "pull_request": PULL_REQUEST,
+        "known_issues": KNOWN_ISSUES,
+        "fixed_code": FIXED_CODE,
+        "difficulty": "hard",
+        "description": (
+            "Review a production-grade distributed LRU cache implementation. "
+            "Identify all concurrency, performance, correctness, and security issues. "
+            "Provide a fully corrected implementation."
+        ),
+    }

tests/test_env.py ADDED Viewed

	@@ -0,0 +1,162 @@

+"""
+Test suite: validates OpenEnv compliance and grader correctness.
+Run with: python tests/test_env.py
+"""
+import sys, os
+sys.path.insert(0, os.path.join(os.path.dirname(__file__), ".."))
+from env import CodeReviewEnv, TASK_IDS
+from models import ReviewAction, Observation, StepReward, EnvironmentState
+def test_reset_returns_observation():
+    for task_id in TASK_IDS:
+        env = CodeReviewEnv()
+        obs = env.reset(task_id)
+        assert isinstance(obs, Observation), f"reset() must return Observation for {task_id}"
+        assert obs.step == 0
+        assert obs.task_id == task_id
+        assert len(obs.review_context.files_changed) > 0
+    print("✓ reset() returns valid Observation for all tasks")
+def test_state_returns_environment_state():
+    env = CodeReviewEnv()
+    env.reset(TASK_IDS[0])
+    s = env.state()
+    assert isinstance(s, EnvironmentState)
+    assert s.step == 0
+    print("✓ state() returns EnvironmentState")
+def test_step_returns_tuple():
+    env = CodeReviewEnv()
+    env.reset(TASK_IDS[0])
+    action = ReviewAction(
+        action_type="review",
+        severity="critical",
+        issue_type="bug",
+        line_number=3,
+        description="test description",
+    )
+    obs, reward, done, info = env.step(action)
+    assert isinstance(obs, Observation)
+    assert isinstance(reward, StepReward)
+    assert isinstance(done, bool)
+    assert isinstance(info, dict)
+    print("✓ step() returns (Observation, StepReward, bool, dict)")
+def test_reward_range():
+    env = CodeReviewEnv()
+    env.reset(TASK_IDS[0])
+    for _ in range(3):
+        action = ReviewAction(action_type="review", severity="minor",
+                              issue_type="style", description="some issue")
+        _, reward, done, _ = env.step(action)
+        assert -1.0 <= reward.value <= 1.0, f"Reward {reward.value} out of range"
+        if done:
+            break
+    print("✓ All intermediate rewards in [-1.0, 1.0]")
+def test_done_on_submit():
+    env = CodeReviewEnv()
+    env.reset(TASK_IDS[0])
+    action = ReviewAction(action_type="submit", verdict="request_changes", confidence=0.5)
+    _, _, done, info = env.step(action)
+    assert done is True
+    assert "final_score" in info
+    assert 0.0 <= info["final_score"] <= 1.0
+    print("✓ Episode terminates on submit with final_score in [0.0, 1.0]")
+def test_done_on_max_steps():
+    env = CodeReviewEnv()
+    env.reset(TASK_IDS[0])
+    max_steps = env.state().max_steps
+    done = False
+    for _ in range(max_steps + 5):
+        action = ReviewAction(action_type="comment", comment="still reviewing")
+        _, _, done, info = env.step(action)
+        if done:
+            break
+    assert done is True, "Episode should terminate at max_steps"
+    print("✓ Episode terminates at max_steps")
+def test_perfect_score_task1():
+    env = CodeReviewEnv()
+    env.reset("task_1_easy_bug_hunt")
+    actions = [
+        ReviewAction(action_type="review", severity="critical", issue_type="bug",
+                     line_number=3, description="assignment operator = instead of == comparison operator"),
+        ReviewAction(action_type="review", severity="critical", issue_type="bug",
+                     line_number=6, description="off-by-one: range should be len(numbers) not len+1 IndexError"),
+        ReviewAction(action_type="review", severity="major", issue_type="bug",
+                     line_number=9, description="missing return statement returns None"),
+        ReviewAction(action_type="patch",
+                     patched_code="def find_max(numbers):\n    if len(numbers) == 0:\n        raise ValueError()\n    max_val = numbers[0]\n    for i in range(1, len(numbers)):\n        if numbers[i] > max_val:\n            max_val = numbers[i]\n    return max_val"),
+        ReviewAction(action_type="submit", verdict="request_changes", confidence=0.99),
+    ]
+    done = False
+    for a in actions:
+        if done: break
+        _, _, done, info = env.step(a)
+    assert info["final_score"] == 1.0, f"Expected 1.0, got {info['final_score']}"
+    print("✓ Task 1 perfect score achievable")
+def test_zero_score_no_actions():
+    env = CodeReviewEnv()
+    env.reset("task_2_medium_security")
+    action = ReviewAction(action_type="submit", verdict="approve", confidence=0.1)
+    _, _, done, info = env.step(action)
+    assert info["final_score"] < 0.1, f"Blind approve should score near 0, got {info['final_score']}"
+    print("✓ Blind approve scores near 0")
+def test_repetition_penalty():
+    env = CodeReviewEnv()
+    env.reset(TASK_IDS[0])
+    same_action = ReviewAction(action_type="review", severity="minor",
+                                issue_type="style", description="identical description here")
+    env.step(same_action)
+    _, reward2, _, _ = env.step(same_action)
+    assert reward2.breakdown.get("repetition_penalty", 0) < 0, "Repetition should be penalised"
+    print("✓ Repetition penalty applied for identical descriptions")
+def test_state_immutability():
+    """state() should return a copy, not a live reference."""
+    env = CodeReviewEnv()
+    env.reset(TASK_IDS[0])
+    s1 = env.state()
+    env.step(ReviewAction(action_type="comment", comment="hi"))
+    s2 = env.state()
+    assert s1.step != s2.step, "state() must return a snapshot copy"
+    print("✓ state() returns immutable snapshot")
+if __name__ == "__main__":
+    tests = [
+        test_reset_returns_observation,
+        test_state_returns_environment_state,
+        test_step_returns_tuple,
+        test_reward_range,
+        test_done_on_submit,
+        test_done_on_max_steps,
+        test_perfect_score_task1,
+        test_zero_score_no_actions,
+        test_repetition_penalty,
+        test_state_immutability,
+    ]
+    passed = 0
+    for t in tests:
+        try:
+            t()
+            passed += 1
+        except Exception as e:
+            print(f"✗ {t.__name__}: {e}")
+    print(f"\n{passed}/{len(tests)} tests passed")
+    sys.exit(0 if passed == len(tests) else 1)