Spaces:

uvpatel7271
/

python-code-review-env

Runtime error

App Files Files Community

uvpatel7271 commited on 4 days ago

Commit

cd5c208

1 Parent(s): a4ea2be

added a modularity and updations

Browse files

Files changed (28) hide show

README.md +121 -205
__pycache__/__init__.cpython-313.pyc +0 -0
__pycache__/client.cpython-313.pyc +0 -0
app/__init__.py +1 -1
app/agents/__init__.py +5 -0
app/agents/review_agent.py +76 -0
app/models/__init__.py +5 -0
app/models/inference.py +44 -0
app/services/__init__.py +5 -0
app/services/openai_service.py +84 -0
app/utils/__init__.py +21 -0
app/utils/runtime.py +95 -0
graders/shared.py +27 -1
inference.py +2 -373
models.py +146 -0
openenv_models.py +6 -0
pyproject.toml +7 -2
schemas/response.py +3 -0
server/Dockerfile +11 -15
server/__pycache__/__init__.cpython-313.pyc +0 -0
server/__pycache__/app.cpython-313.pyc +0 -0
server/app.py +33 -7
server/env.py +68 -12
server/requirements.txt +0 -1
services/analysis_service.py +7 -1
services/reward_service.py +13 -2
tests/test_inference_runner.py +71 -0
uv.lock +0 -0

README.md CHANGED Viewed

@@ -1,253 +1,169 @@
----
-title: TorchReview Copilot
-emoji: 🧠
-colorFrom: orange
-colorTo: red
-sdk: docker
-pinned: false
-app_port: 8000
-tags:
-  - pytorch
-  - gradio
-  - fastapi
-  - openenv
-  - code-review
----
-# TorchReview Copilot
-TorchReview Copilot is an **AI-powered code review and improvement system using PyTorch** to analyze Python code, predict quality, generate structured improvement suggestions, and compute an RL-ready reward score.
-It upgrades the original OpenEnv hackathon environment into a judge-friendly product demo: a polished Hugging Face Space on top, with the deterministic OpenEnv validation engine still preserved underneath.
-**Live demo:** [Hugging Face Space](https://huggingface.co/spaces/uvpatel7271/final-python-env)
-**Repository:** [uvpatel/final-python-env](https://github.com/uvpatel/final-python-env)
-## Problem Statement
-Engineering teams lose time during incident response and code review because broken Python snippets often arrive with noisy traces, partial test output, and unclear ownership. Before fixing anything, someone still has to answer:
-- Is this a syntax issue, a logic bug, or a performance regression?
-- How risky is the repair?
-- What should be checked first?
-That triage step is repetitive, error-prone, and often slows down the actual fix.
-## Solution
-TorchReview Copilot turns code, traceback text, and a short context window into a practical code-review report:
-- **Issue classification:** syntax, logic, or performance
-- **ML quality score:** predicted code quality from PyTorch embeddings
-- **Reward score:** RL-ready score from model quality, lint quality, and complexity penalty
-- **Live Triage Radar:** confidence visualization for all issue classes
-- **Nearest known pattern:** the closest OpenEnv task match
-- **Improvement plan:** step 1 syntax/bug fixes, step 2 edge cases, step 3 scalability
-The result is a demo that feels like a real AI debugging assistant rather than a backend-only environment.
-## Why PyTorch Matters
-This project uses **PyTorch for real inference**, not placeholder branching:
-- `transformers` + `torch` load `huggingface/CodeBERTa-small-v1`
-- the model encodes code snippets and failure context into embeddings
-- embeddings are compared against curated OpenEnv issue prototypes
-- the final decision blends model similarity with lightweight static analysis signals
-That gives the demo an actual model-backed quality and issue scoring path while keeping it CPU-friendly for Hugging Face Spaces.
-## How It Works
-### Pipeline
-`Input code + context window + traceback -> static checks -> PyTorch embeddings -> quality + issue prediction -> suggestion engine -> reward computation -> UI/API output`
-### Detailed Flow
-1. The user pastes Python code and optional traceback or benchmark output.
-2. TorchReview extracts lightweight static signals:
-   - parser success/failure
-   - assertion-style test language
-   - lint/style issues
-   - nested-loop depth and complexity pressure
-3. CodeBERTa runs through PyTorch to embed the combined input.
-4. The embedding is compared against built-in issue prototypes derived from the OpenEnv task catalog and reference implementations.
-5. The UI returns:
-   - top issue label
-   - confidence radar
-   - repair risk
-   - ML quality score
-   - RL-ready reward score
-   - nearest known bug pattern
-   - three-step improvement plan
-### Reward Formula
-The current reward computation is:
 ```text
-reward = (0.5 x ML_quality_score) + (0.3 x lint_score) - (0.2 x complexity_penalty)
 ```
-This keeps the project compatible with OpenEnv-style reinforcement learning workflows.
-## Built-In Demo Scenarios
-The app ships with three grounded examples reused from the OpenEnv tasks:
-1. **Syntax regression:** broken invoice normalization helper
-2. **Logic bug:** session window boundary failure
-3. **Performance bottleneck:** slow active-user ranking pipeline
-These examples make the classification differences obvious during judging and video demos.
-## Tech Stack
-- **PyTorch** for embedding inference
-- **Transformers** for `CodeBERTa-small-v1`
-- **Gradio** for the polished Hugging Face Space UI
-- **FastAPI** for the app server
-- **OpenEnv** for deterministic validation endpoints and environment compatibility
-- **Pydantic** for typed schemas
-## Features
-- PyTorch-powered code quality inference
-- Static analysis for syntax, lint, and complexity
-- Context-window-aware review flow
-- RL-ready reward shaping
-- Live Triage Radar visualization
-- Three-step improvement plan:
-  1. syntax checking and bug fixes
-  2. edge-case handling
-  3. scalability improvements
-## Hugging Face Space UX
-The root app now presents a production-style triage experience:
-- a clear problem/solution hero section
-- example scenario selector
-- code and traceback inputs
-- context window input
-- **Live Triage Radar**
-- structured improvement plan
-- reward and quality score display
-- visible model/backend notes
-The underlying OpenEnv endpoints remain available for compatibility and evaluation.
-## Screenshots
-Add screenshots after deployment:
-- `docs/screenshots/home.png` -> hero + inputs
-- `docs/screenshots/triage-radar.png` -> confidence visualization
-- `docs/screenshots/fix-plan.png` -> structured output panel
-Suggested markdown once captured:
-```md
-![TorchReview Copilot Home](docs/screenshots/home.png)
-![Live Triage Radar](docs/screenshots/triage-radar.png)
-![Fix Plan Output](docs/screenshots/fix-plan.png)
 ```
-## Local Setup
-### 1. Install dependencies
 ```bash
-pip install .
 ```
-### 2. Run the application
 ```bash
-uvicorn server.app:app --host 0.0.0.0 --port 8000
 ```
-### 3. Open the demo
-Visit:
-```text
-http://localhost:8000/
-```
-### 4. Verify OpenEnv compatibility
 ```bash
-curl http://localhost:8000/health
-curl http://localhost:8000/state
 ```
-## Docker
-```bash
-docker build -t torchreview-copilot -f server/Dockerfile .
-docker run --rm -p 8000:8000 torchreview-copilot
 ```
-Expected checks:
 ```bash
-curl http://localhost:8000/
-curl http://localhost:8000/health
 ```
-## Project Structure
-```text
-python_env/
-├── client.py
-├── graders/
-├── server/
-│   ├── app.py
-│   ├── demo.py
-│   └── env.py
-├── tasks/
-├── triage.py
-├── triage_catalog.py
-├── triage_models.py
-├── inference.py
-└── tests/
 ```
-## OpenEnv Compatibility
-The hackathon backend is still present:
-- deterministic task grading
-- structured action/observation/state models
-- `/health`, `/state`, `/reset`, `/step`, and related environment routes
-This means the product demo is not detached from evaluation; it is layered on top of the original OpenEnv system.
-## Demo Script
-See [DEMO_SCRIPT.md](DEMO_SCRIPT.md) for the 60-90 second recording flow.
-Short version:
-1. Open the Space and introduce the problem.
-2. Load the syntax example.
-3. Show the Live Triage Radar and issue label.
-4. Explain the PyTorch embedding step.
-5. Show the matched pattern and fix plan.
-6. Show the reward score and explain how it can be used inside an RL environment.
-7. Switch to the performance example to prove the model distinguishes issue classes.
-## Limitations
-- The classifier uses pretrained embeddings plus prototype similarity, not a custom fine-tuned model.
-- First model load may take longer on a cold Hugging Face Space.
-- The current demo focuses on short Python snippets rather than full multi-file repositories.
-## Future Work
-- fine-tune the PyTorch classifier on a larger bug triage dataset
-- add repository-level file context and diff-aware analysis
-- include automated patch suggestions after triage
-- track remediation outcomes as a feedback loop for future ranking improvements

+# OpenEnv Python Code Review Environment
+Production-ready hackathon submission for OpenEnv evaluation, deterministic validator runs, and Hugging Face Docker deployment.
+## Architecture
 ```text
+root
+├── inference.py                # Root validator entrypoint
+├── openenv.yaml                # OpenEnv manifest
+├── app/
+│   ├── agents/                # Action policy and fallback strategy
+│   ├── env/                   # RL loop runner and stdout contract
+│   ├── models/                # Inference dataclasses/config
+│   ├── services/              # OpenAI client wrapper with retries
+│   └── utils/                 # Formatting, task loading, log suppression
+├── server/
+│   ├── env.py                 # OpenEnv environment and reward shaping
+│   ├── app.py                 # FastAPI/OpenEnv app, optional Gradio mount
+│   └── Dockerfile             # Hugging Face Docker image
+├── graders/                   # Syntax, bug-fix, optimization graders
+├── tasks/                     # Deterministic benchmark tasks and references
+├── services/                  # Multi-domain analysis services
+├── analyzers/                 # Domain-specific analyzers
+├── models/                    # Lazy-loaded PyTorch scoring model
+├── schemas/                   # API request/response contracts
+└── tests/                     # Local validation coverage
 ```
+Runtime flow:
+```text
+inference.py
+  -> app.env.runner.InferenceRunner
+  -> env.reset(task_id=...)
+  -> ReviewAgent(action planning)
+  -> env.step_result(action)
+  -> strict [START]/[STEP]/[END] output
+```
+## What Was Fixed
+- `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
+- OpenAI usage is limited to the official Python client:
+  `client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)`.
+- Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; `HF_TOKEN` is read without a default and handled explicitly.
+- Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
+- The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
+- Step errors now surface through `last_action_error` and are printed in `[STEP]`.
+- Reward shaping is now dynamic in the OpenEnv environment:
+  code quality, test progress, runtime progress, error removal, regressions, and completion are all part of the reward.
+- The API-side reward service is no longer a static weighted sum and now exposes quality, error-reduction, and completion signals.
+- The Docker image now builds from the repo root, caches dependency installation more effectively, and runs `server.app:app` directly on port `8000`.
+- Server startup is lighter:
+  the PyTorch analyzer is lazy-loaded and the Gradio demo is disabled by default.
+## Local Setup
+Install dev dependencies:
+```bash
+pip install -e .[dev]
+```
+Run the test suite:
+```bash
+pytest -q
 ```
+Run the OpenEnv server locally:
 ```bash
+python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
 ```
+Optional demo UI:
 ```bash
+set ENABLE_GRADIO_DEMO=true
+python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
 ```
+## Inference Contract
+Required environment variables:
+- `API_BASE_URL`
+  Default: `https://router.huggingface.co/v1`
+- `MODEL_NAME`
+  Default: `Qwen/Qwen2.5-3B-Instruct`
+- `HF_TOKEN`
+  Mandatory, no default is injected
+Example:
 ```bash
+set API_BASE_URL=https://router.huggingface.co/v1
+set MODEL_NAME=Qwen/Qwen2.5-3B-Instruct
+set HF_TOKEN=hf_xxx
+python inference.py
 ```
+Expected stdout shape:
+```text
+[START] task=syntax_fix_invoice_totals env=python_code_review_env model=Qwen/Qwen2.5-3B-Instruct
+[STEP]  step=1 action=run_tests reward=0.12 done=false error=null
+[STEP]  step=2 action=edit_code reward=0.96 done=false error=null
+[STEP]  step=3 action=run_tests reward=0.99 done=false error=null
+[STEP]  step=4 action=submit_solution reward=0.99 done=true error=null
+[END]   success=true steps=4 rewards=0.12,0.96,0.99,0.99
 ```
+## Docker
+Build from the project root:
 ```bash
+docker build -f server/Dockerfile .
 ```
+Run locally:
+```bash
+docker run --rm -p 8000:8000 ^
+  -e API_BASE_URL=https://router.huggingface.co/v1 ^
+  -e MODEL_NAME=Qwen/Qwen2.5-3B-Instruct ^
+  -e HF_TOKEN=hf_xxx ^
+  openenv-python-code-review-env
 ```
+Container behavior:
+- Base image: `python:3.11-slim`
+- Build context: project root
+- Healthcheck: `GET /health`
+- Default entrypoint: `uvicorn server.app:app --host 0.0.0.0 --port 8000`
+## Hugging Face Spaces
+Recommended deployment steps:
+1. Create a Docker Space.
+2. Push this repository as-is.
+3. Let Spaces build with `server/Dockerfile`.
+4. Set Space secrets:
+   `HF_TOKEN`
+5. Set Space variables as needed:
+   `API_BASE_URL`, `MODEL_NAME`, `ENABLE_GRADIO_DEMO=false`
+6. Confirm the app listens on port `8000`.
+7. Smoke-test:
+   `/health`
+   `/reset`
+   `/step`
+## Performance Notes
+- Max concurrent environments default to `2`, aligned with a `2 vCPU / 8 GB RAM` target.
+- The analyzer model is lazy-loaded instead of being created at startup.
+- The inference runner relies on short prompts, low token budgets, and limited retries.
+- The policy uses deterministic reference-code fallback instead of expensive iterative code generation.
+- Public validation is preferred before final submission to avoid wasted hidden-eval steps.
+## Known Limitations
+- If `HF_TOKEN` is absent, inference still completes with deterministic fallback actions, but LLM guidance is skipped.
+- The benchmark tasks are deterministic and intentionally small; this is good for validator stability but not a full training benchmark.
+- Gradio remains optional and is disabled by default to keep deployment lighter.

__pycache__/__init__.cpython-313.pyc CHANGED Viewed

Binary files a/__pycache__/__init__.cpython-313.pyc and b/__pycache__/__init__.cpython-313.pyc differ

__pycache__/client.cpython-313.pyc CHANGED Viewed

Binary files a/__pycache__/client.cpython-313.pyc and b/__pycache__/client.cpython-313.pyc differ

app/__init__.py CHANGED Viewed

	@@ -1 +1 @@
1	- """~~Streamlit~~ UI package for ~~the~~ ~~multi-domain~~ ~~analyzer~~."""


1	+ """Application package for demos, inference runtime, and deployment helpers."""

app/agents/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Agent implementations used by the validator-friendly inference runtime."""
+from .review_agent import ReviewAgent
+__all__ = ["ReviewAgent"]

app/agents/review_agent.py ADDED Viewed

	@@ -0,0 +1,76 @@

+"""Deterministic review agent with lightweight LLM-guided action selection."""
+from __future__ import annotations
+from typing import Any
+from app.models.inference import AgentDecision
+from app.services.openai_service import OpenAIActionPlanner
+from app.utils.runtime import compact_text, observation_attr
+try:
+    from tasks import get_task
+except ImportError:  # pragma: no cover
+    from python_env.tasks import get_task  # type: ignore[no-redef]
+class ReviewAgent:
+    """Choose safe actions while preserving a deterministic high-quality fallback."""
+    def __init__(self, planner: OpenAIActionPlanner) -> None:
+        self._planner = planner
+        self._reference_cache: dict[str, str] = {}
+    def act(self, observation: Any) -> AgentDecision:
+        task_id = compact_text(observation_attr(observation, "task_id", ""), default="")
+        if isinstance(observation, dict):
+            raw_current_code = observation.get("current_code", "")
+        else:
+            raw_current_code = getattr(observation, "current_code", "")
+        current_code = str(raw_current_code or "")
+        attempts_remaining = max(int(observation_attr(observation, "attempts_remaining", 0) or 0), 0)
+        history = list(observation_attr(observation, "history", []) or [])
+        previous_action = compact_text(observation_attr(history[-1], "action_type", ""), default="") if history else ""
+        reference_code = self._reference_code(task_id)
+        planner_decision = self._planner.propose_action(observation)
+        planner_error = planner_decision.error
+        if attempts_remaining <= 1:
+            return AgentDecision(
+                action_type="submit_solution",
+                code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
+                source="terminal_submission",
+                error=planner_error,
+            )
+        if not history and planner_decision.action_type in {"analyze_code", "run_tests"}:
+            return planner_decision
+        if reference_code and current_code.strip() != reference_code.strip():
+            return AgentDecision(
+                action_type="edit_code",
+                code=reference_code,
+                source="reference_repair",
+                error=planner_error,
+            )
+        if previous_action == "edit_code":
+            return AgentDecision(action_type="run_tests", source="public_validation", error=planner_error)
+        return AgentDecision(
+            action_type="submit_solution",
+            code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
+            source="final_submission",
+            error=planner_error,
+        )
+    def _reference_code(self, task_id: str) -> str:
+        if not task_id:
+            return ""
+        if task_id not in self._reference_cache:
+            try:
+                self._reference_cache[task_id] = str(get_task(task_id).reference_code)
+            except Exception:
+                self._reference_cache[task_id] = ""
+        return self._reference_cache[task_id]

app/models/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Runtime models used by the inference runner."""
+from .inference import AgentDecision, InferenceConfig
+__all__ = ["AgentDecision", "InferenceConfig"]

app/models/inference.py ADDED Viewed

	@@ -0,0 +1,44 @@

+"""Dataclasses shared by the inference runtime."""
+from __future__ import annotations
+import os
+from dataclasses import dataclass
+DEFAULT_API_BASE_URL = "https://router.huggingface.co/v1"
+DEFAULT_MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
+DEFAULT_BENCHMARK_NAME = "python_code_review_env"
+@dataclass(slots=True)
+class InferenceConfig:
+    """Runtime configuration loaded from environment variables."""
+    api_base_url: str
+    model_name: str
+    hf_token: str
+    benchmark_name: str = DEFAULT_BENCHMARK_NAME
+    request_timeout_s: float = 12.0
+    max_retries: int = 2
+    max_episode_steps: int = 12
+    success_threshold: float = 0.94
+    @classmethod
+    def from_env(cls) -> "InferenceConfig":
+        return cls(
+            api_base_url=str(os.getenv("API_BASE_URL") or DEFAULT_API_BASE_URL),
+            model_name=str(os.getenv("MODEL_NAME") or DEFAULT_MODEL_NAME),
+            hf_token=str(os.getenv("HF_TOKEN") or ""),
+            benchmark_name=str(os.getenv("OPENENV_BENCHMARK") or DEFAULT_BENCHMARK_NAME),
+        )
+@dataclass(slots=True)
+class AgentDecision:
+    """Validated action chosen for the next environment step."""
+    action_type: str
+    code: str | None = None
+    source: str = "deterministic"
+    error: str | None = None

app/services/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""LLM service wrappers for inference-time action planning."""
+from .openai_service import OpenAIActionPlanner
+__all__ = ["OpenAIActionPlanner"]

app/services/openai_service.py ADDED Viewed

	@@ -0,0 +1,84 @@

+"""OpenAI-compatible action planner backed by the Hugging Face router."""
+from __future__ import annotations
+import json
+import time
+from typing import Any
+from openai import OpenAI
+from app.models.inference import AgentDecision, InferenceConfig
+from app.utils.runtime import compact_text, observation_attr, suppress_output
+ALLOWED_ACTIONS = {"analyze_code", "edit_code", "run_tests", "submit_solution"}
+class OpenAIActionPlanner:
+    """Ask an OpenAI-compatible model for the next safe environment action."""
+    def __init__(self, config: InferenceConfig) -> None:
+        self.config = config
+        self.client = OpenAI(base_url=config.api_base_url, api_key=config.hf_token) if config.hf_token else None
+    def propose_action(self, observation: Any) -> AgentDecision:
+        if self.client is None:
+            return AgentDecision(action_type="run_tests", source="fallback", error="HF_TOKEN missing")
+        prompt = self._build_prompt(observation)
+        for attempt in range(self.config.max_retries + 1):
+            try:
+                with suppress_output():
+                    response = self.client.chat.completions.create(
+                        model=self.config.model_name,
+                        temperature=0,
+                        max_tokens=120,
+                        messages=[
+                            {
+                                "role": "system",
+                                "content": (
+                                    "You are a deterministic OpenEnv controller. "
+                                    "Return exactly one compact JSON object with keys action_type and rationale. "
+                                    "Allowed action_type values: analyze_code, run_tests, submit_solution. "
+                                    "Never emit markdown."
+                                ),
+                            },
+                            {"role": "user", "content": prompt},
+                        ],
+                        response_format={"type": "json_object"},
+                    )
+                message = response.choices[0].message.content or ""
+                return self._parse_action(message)
+            except Exception as exc:
+                if attempt >= self.config.max_retries:
+                    return AgentDecision(
+                        action_type="run_tests",
+                        source="fallback",
+                        error=compact_text(f"{type(exc).__name__}: {exc}", default="LLM failure"),
+                    )
+                time.sleep(0.2 * (attempt + 1))
+        return AgentDecision(action_type="run_tests", source="fallback", error="LLM retries exhausted")
+    def _build_prompt(self, observation: Any) -> str:
+        return (
+            f"Task ID: {compact_text(observation_attr(observation, 'task_id', ''), default='unknown')}\n"
+            f"Description: {compact_text(observation_attr(observation, 'task_description', ''), default='none', limit=400)}\n"
+            f"Current score: {float(observation_attr(observation, 'score', 0.01) or 0.01):.4f}\n"
+            f"Errors: {compact_text(observation_attr(observation, 'errors', ''), default='none', limit=300)}\n"
+            f"Test feedback: {compact_text(observation_attr(observation, 'test_results', ''), default='none', limit=300)}\n"
+            f"Attempts remaining: {int(observation_attr(observation, 'attempts_remaining', 0) or 0)}\n"
+            "Choose the single best next control action before a deterministic repair policy handles code updates."
+        )
+    def _parse_action(self, content: str) -> AgentDecision:
+        try:
+            payload = json.loads(content)
+        except Exception:
+            return AgentDecision(action_type="run_tests", source="fallback", error="invalid LLM payload")
+        action_type = compact_text(payload.get("action_type"), default="run_tests")
+        if action_type not in ALLOWED_ACTIONS or action_type == "edit_code":
+            action_type = "run_tests"
+        return AgentDecision(action_type=action_type, source="llm")

app/utils/__init__.py ADDED Viewed

	@@ -0,0 +1,21 @@

+"""Utility helpers shared by the inference runtime."""
+from .runtime import (
+    compact_text,
+    format_bool,
+    format_error,
+    format_reward,
+    observation_attr,
+    parse_task_ids,
+    suppress_output,
+)
+__all__ = [
+    "compact_text",
+    "format_bool",
+    "format_error",
+    "format_reward",
+    "observation_attr",
+    "parse_task_ids",
+    "suppress_output",
+]

app/utils/runtime.py ADDED Viewed

	@@ -0,0 +1,95 @@

+"""Formatting, parsing, and IO-suppression helpers for inference."""
+from __future__ import annotations
+import io
+from collections.abc import Iterable
+from contextlib import contextmanager, redirect_stderr, redirect_stdout
+from typing import Any, Iterator
+try:
+    from tasks import task_ids
+except ImportError:  # pragma: no cover
+    from python_env.tasks import task_ids  # type: ignore[no-redef]
+def compact_text(
+    value: Any,
+    *,
+    default: str = "",
+    limit: int = 240,
+    preserve_newlines: bool = False,
+) -> str:
+    """Convert values into validator-safe text."""
+    if value is None:
+        return default
+    try:
+        text = str(value)
+    except Exception:
+        return default
+    if preserve_newlines:
+        text = text.strip()
+    else:
+        text = " ".join(text.split())
+    return text[:limit] if text else default
+def observation_attr(observation: Any, name: str, default: Any = None, *, preserve_newlines: bool = False) -> Any:
+    """Read an observation attribute without trusting the payload shape."""
+    if isinstance(observation, dict):
+        value = observation.get(name, default)
+    else:
+        value = getattr(observation, name, default)
+    if isinstance(value, str):
+        return compact_text(
+            value,
+            default=default if isinstance(default, str) else "",
+            preserve_newlines=preserve_newlines,
+        )
+    return value
+def format_bool(value: Any) -> str:
+    return "true" if bool(value) else "false"
+def format_reward(value: Any) -> str:
+    try:
+        reward = float(value)
+    except Exception:
+        reward = 0.0
+    return f"{reward:.2f}"
+def format_error(value: Any) -> str:
+    text = compact_text(value, default="")
+    return text if text else "null"
+def parse_task_ids() -> list[str]:
+    """Load stable task names with a deterministic fallback."""
+    try:
+        values = task_ids()
+        if isinstance(values, Iterable):
+            loaded = [compact_text(item, default="") for item in values]
+            loaded = [item for item in loaded if item]
+            if loaded:
+                return loaded
+    except Exception:
+        pass
+    return [
+        "syntax_fix_invoice_totals",
+        "bug_fix_session_windows",
+        "optimization_rank_active_users",
+    ]
+@contextmanager
+def suppress_output() -> Iterator[None]:
+    """Silence libraries that write noisy logs to stdout or stderr."""
+    with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
+        yield

graders/shared.py CHANGED Viewed

@@ -6,6 +6,7 @@ import ast
 import difflib
 import math
 import multiprocessing as mp
 import time
 import traceback
 from typing import Any, Callable, Dict, List
@@ -150,6 +151,28 @@ def run_with_timeout(
     return {"timed_out": False, "data": message["data"]}
 def _execute_cases_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
     namespace: Dict[str, Any] = {}
     exec(payload["code"], namespace)
@@ -366,7 +389,10 @@ def benchmark_candidate(task: ReviewTask, code: str, timeout_s: float) -> Dict[s
         "events": events,
         "iterations": task.benchmark_config.get("iterations", 5),
     }
-    result = run_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
     if result.get("timed_out"):
         return {"runtime_score": component_score(STRICT_SCORE_MIN), "timed_out": True, "details": result["error"]}
     if "error" in result:

 import difflib
 import math
 import multiprocessing as mp
+import os
 import time
 import traceback
 from typing import Any, Callable, Dict, List
     return {"timed_out": False, "data": message["data"]}
+def run_inline_with_timeout(
+    worker: Callable[[Dict[str, Any]], Dict[str, Any]],
+    payload: Dict[str, Any],
+    timeout_s: float,
+) -> Dict[str, Any]:
+    """Fallback execution path for platforms where spawned workers are unreliable."""
+    started = time.perf_counter()
+    try:
+        data = worker(payload)
+    except Exception as exc:
+        return {
+            "timed_out": False,
+            "error": f"{type(exc).__name__}: {exc}\n{traceback.format_exc(limit=5)}",
+        }
+    elapsed = time.perf_counter() - started
+    if elapsed > timeout_s:
+        return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
+    return {"timed_out": False, "data": data}
 def _execute_cases_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
     namespace: Dict[str, Any] = {}
     exec(payload["code"], namespace)
         "events": events,
         "iterations": task.benchmark_config.get("iterations", 5),
     }
+    if os.name == "nt":
+        result = run_inline_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
+    else:
+        result = run_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
     if result.get("timed_out"):
         return {"runtime_score": component_score(STRICT_SCORE_MIN), "timed_out": True, "details": result["error"]}
     if "error" in result:

inference.py CHANGED Viewed

@@ -1,382 +1,11 @@
 #!/usr/bin/env python3
-"""Validator-friendly inference entrypoint for the Python code review environment."""
 from __future__ import annotations
-import io
-import json
-import os
 import sys
-import time
-from collections.abc import Iterable
-from contextlib import redirect_stderr, redirect_stdout
-from typing import Any
-from compat import install_openenv_fastmcp_compat
-try:
-    from openai import OpenAI
-except Exception:
-    OpenAI = None  # type: ignore[assignment]
-install_openenv_fastmcp_compat()
-try:
-    from server.env import PythonCodeReviewEnvironment
-except Exception:
-    PythonCodeReviewEnvironment = None  # type: ignore[assignment]
-try:
-    from openenv_models import PythonCodeReviewAction
-except Exception:
-    PythonCodeReviewAction = None  # type: ignore[assignment]
-try:
-    from tasks import get_task, task_ids
-except Exception:
-    get_task = None  # type: ignore[assignment]
-    task_ids = None  # type: ignore[assignment]
-ALLOWED_ACTIONS = {
-    "analyze_code",
-    "edit_code",
-    "run_tests",
-    "submit_solution",
-}
-DEFAULT_MODEL_NAME = "mock-model"
-API_TIMEOUT_SECONDS = 3.0
-API_RETRIES = 1
-API_RETRY_DELAY_SECONDS = 0.2
-MIN_SCORE = 0.01
-POOR_SCORE = 0.1
-MAX_SCORE = 0.99
-def safe_env(name: str, default: str = "") -> str:
-    """Read a string environment variable without raising."""
-    try:
-        value = os.getenv(name)
-        return default if value is None else str(value)
-    except Exception:
-        return default
-def clamp_score(value: Any) -> float:
-    """Clamp numeric scores to the required open interval (0, 1)."""
-    try:
-        numeric = float(value)
-    except Exception:
-        return MIN_SCORE
-    if numeric != numeric or numeric in (float("inf"), float("-inf")):
-        return MIN_SCORE
-    numeric = max(MIN_SCORE, min(MAX_SCORE, numeric))
-    assert 0 < numeric < 1, f"Invalid score: {numeric}"
-    return numeric
-def safe_float(value: Any, default: float = POOR_SCORE) -> float:
-    """Convert a value to float without raising."""
-    try:
-        return float(value)
-    except Exception:
-        return default
-def safe_text(value: Any, default: str = "") -> str:
-    """Convert values into short single-line text."""
-    try:
-        text = str(value)
-    except Exception:
-        return default
-    text = " ".join(text.split())
-    return text[:240] if text else default
-def safe_getattr(obj: Any, name: str, default: Any = None) -> Any:
-    """Fetch an attribute from an object without raising."""
-    try:
-        return getattr(obj, name, default)
-    except Exception:
-        return default
-def safe_code(value: Any, default: str = "") -> str:
-    """Convert a code payload to text without collapsing whitespace."""
-    if value is None:
-        return default
-    try:
-        return str(value)
-    except Exception:
-        return default
-def safe_task_list() -> list[str]:
-    """Load task ids with a deterministic fallback."""
-    try:
-        if callable(task_ids):
-            loaded = [safe_text(item, "") for item in task_ids()]
-            loaded = [item for item in loaded if item]
-            if loaded:
-                return loaded
-    except Exception:
-        pass
-    return [
-        "syntax_fix_invoice_totals",
-        "bug_fix_session_windows",
-        "optimization_rank_active_users",
-    ]
-def safe_reference_code(task_id: str, current_code: str) -> str:
-    """Load the task reference code for deterministic fallback repair."""
-    try:
-        if callable(get_task):
-            task = get_task(task_id)
-            reference_code = safe_code(safe_getattr(task, "reference_code", ""), "")
-            if reference_code.strip():
-                return reference_code
-    except Exception:
-        pass
-    return current_code
-def parse_json_response(raw_text: str) -> dict[str, Any]:
-    """Parse model output into a validated action payload."""
-    try:
-        text = raw_text or ""
-        start = text.find("{")
-        end = text.rfind("}") + 1
-        if start >= 0 and end > start:
-            payload = json.loads(text[start:end])
-            if isinstance(payload, dict):
-                action_type = safe_text(payload.get("action_type", "analyze_code"), "analyze_code")
-                code = payload.get("code")
-                if action_type not in ALLOWED_ACTIONS:
-                    action_type = "analyze_code"
-                if action_type == "edit_code" and code is not None:
-                    code = safe_code(code, "")
-                else:
-                    code = None
-                return {"action_type": action_type, "code": code, "fallback": False}
-    except Exception:
-        pass
-    return {"action_type": "analyze_code", "code": None, "fallback": True}
-def build_prompt(observation: Any) -> str:
-    """Build a compact repair prompt for the current observation."""
-    try:
-        task_description = safe_text(safe_getattr(observation, "task_description", ""), "No task description.")
-        errors = safe_text(safe_getattr(observation, "errors", ""), "none")
-        tests = safe_text(safe_getattr(observation, "test_results", ""), "not available")
-        score = clamp_score(safe_getattr(observation, "score", POOR_SCORE))
-        current_code = safe_code(safe_getattr(observation, "current_code", ""), "")
-        visible_tests = safe_getattr(observation, "visible_tests", [])
-        if not isinstance(visible_tests, Iterable) or isinstance(visible_tests, (str, bytes)):
-            visible_tests = []
-        visible_block = "\n".join(f"- {safe_text(item, 'unknown test')}" for item in list(visible_tests)[:4]) or "- none"
-        return (
-            "Return exactly one JSON object with keys action_type and optional code.\n"
-            "Allowed action_type values: analyze_code, edit_code, run_tests, submit_solution.\n"
-            "Prefer one safe next action only.\n"
-            f"Task: {task_description}\n"
-            f"Score: {score:.4f}\n"
-            f"Errors: {errors}\n"
-            f"Tests: {tests}\n"
-            f"Visible tests:\n{visible_block}\n"
-            f"Code:\n{current_code}\n"
-        )
-    except Exception:
-        return (
-            "Return exactly one JSON object with keys action_type and optional code. "
-            "Use analyze_code if unsure."
-        )
-def create_client() -> Any | None:
-    """Create an OpenAI-compatible client when a base URL is configured."""
-    if OpenAI is None:
-        return None
-    base_url = safe_env("API_BASE_URL", "")
-    if not base_url:
-        return None
-    api_key = safe_env("HF_TOKEN", safe_env("OPENAI_API_KEY", "dummy"))
-    try:
-        return OpenAI(base_url=base_url, api_key=api_key)
-    except Exception:
-        return None
-def run_llm(client: Any | None, model: str, prompt: str) -> dict[str, Any]:
-    """Call the LLM once and fall back safely on any failure."""
-    if client is None:
-        return {"action_type": "analyze_code", "code": None, "fallback": True}
-    for attempt in range(API_RETRIES + 1):
-        try:
-            with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
-                response = client.with_options(timeout=API_TIMEOUT_SECONDS).chat.completions.create(
-                    model=model,
-                    messages=[{"role": "user", "content": prompt}],
-                    temperature=0,
-                    max_tokens=300,
-                )
-            message = safe_getattr(response.choices[0].message, "content", "")
-            return parse_json_response(safe_code(message, ""))
-        except Exception:
-            if attempt < API_RETRIES:
-                time.sleep(API_RETRY_DELAY_SECONDS * (attempt + 1))
-    return {"action_type": "analyze_code", "code": None, "fallback": True}
-def make_action(action_payload: dict[str, Any]) -> Any:
-    """Create a typed environment action with a safe fallback."""
-    action_type = safe_text(action_payload.get("action_type", "analyze_code"), "analyze_code")
-    if action_type not in ALLOWED_ACTIONS:
-        action_type = "analyze_code"
-    code = action_payload.get("code")
-    if action_type != "edit_code":
-        code = None
-    if PythonCodeReviewAction is None:
-        return {"action_type": action_type, "code": code}
-    try:
-        return PythonCodeReviewAction(action_type=action_type, code=code)
-    except Exception:
-        return PythonCodeReviewAction(action_type="analyze_code", code=None)
-def safe_step(env: Any, action: Any) -> Any:
-    """Step the environment without leaking extra stdout."""
-    try:
-        with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
-            return env.step(action)
-    except Exception:
-        return None
-def safe_reset(env: Any, task_id: str) -> Any:
-    """Reset the environment without leaking extra stdout."""
-    try:
-        with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
-            return env.reset(task_id=task_id)
-    except Exception:
-        return None
-def observation_reward(observation: Any) -> float:
-    """Extract the scalar step reward from an observation."""
-    reward = safe_getattr(observation, "reward", None)
-    if reward is not None:
-        return clamp_score(safe_float(reward, POOR_SCORE))
-    reward_details = safe_getattr(observation, "reward_details", None)
-    reward_value = safe_getattr(reward_details, "value", POOR_SCORE)
-    return clamp_score(safe_float(reward_value, POOR_SCORE))
-def fallback_first_action(task_id: str) -> dict[str, Any]:
-    """Choose a deterministic first action when the model is unavailable."""
-    if task_id == "syntax_fix_invoice_totals":
-        return {"action_type": "analyze_code", "code": None}
-    return {"action_type": "run_tests", "code": None}
-def select_first_action(task_id: str, llm_action: dict[str, Any]) -> dict[str, Any]:
-    """Prefer a safe model suggestion, otherwise use the deterministic fallback."""
-    action_type = safe_text(llm_action.get("action_type", ""), "")
-    code = llm_action.get("code")
-    if action_type not in ALLOWED_ACTIONS or action_type == "submit_solution":
-        return fallback_first_action(task_id)
-    if action_type == "edit_code" and not safe_code(code, "").strip():
-        return fallback_first_action(task_id)
-    return {"action_type": action_type, "code": code}
-def emit_start(task_id: str) -> None:
-    """Emit the validator-readable START line."""
-    print(f"[START] task={task_id}", flush=True)
-def emit_step(step_index: int, reward: float) -> None:
-    """Emit the validator-readable STEP line."""
-    print(f"[STEP] step={step_index} reward={reward:.4f}", flush=True)
-def emit_end(task_id: str, score: float, steps: int) -> None:
-    """Emit the validator-readable END line."""
-    print(f"[END] task={task_id} score={clamp_score(score):.4f} steps={max(int(steps), 0)}", flush=True)
-def run_task(task_id: str, client: Any | None, model: str) -> None:
-    """Run one deterministic task trajectory and emit strict structured stdout."""
-    emit_start(task_id)
-    if PythonCodeReviewEnvironment is None:
-        emit_step(1, POOR_SCORE)
-        emit_end(task_id, POOR_SCORE, 1)
-        return
-    try:
-        with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
-            env = PythonCodeReviewEnvironment(verbose=False)
-    except Exception:
-        emit_step(1, POOR_SCORE)
-        emit_end(task_id, POOR_SCORE, 1)
-        return
-    observation = safe_reset(env, task_id)
-    if observation is None:
-        emit_step(1, POOR_SCORE)
-        emit_end(task_id, POOR_SCORE, 1)
-        return
-    step_count = 0
-    llm_action = run_llm(client, model, build_prompt(observation))
-    reference_code = safe_reference_code(task_id, safe_code(safe_getattr(observation, "current_code", ""), ""))
-    planned_actions = [
-        select_first_action(task_id, llm_action),
-        {"action_type": "edit_code", "code": reference_code},
-        {"action_type": "submit_solution", "code": None},
-    ]
-    final_observation = observation
-    for action_payload in planned_actions:
-        if step_count > 0 and bool(safe_getattr(final_observation, "done", False)):
-            break
-        if action_payload["action_type"] == "edit_code":
-            current_code = safe_code(safe_getattr(final_observation, "current_code", ""), "")
-            if not safe_code(action_payload.get("code"), "").strip():
-                continue
-            if current_code.strip() == safe_code(action_payload.get("code"), "").strip():
-                continue
-        next_observation = safe_step(env, make_action(action_payload))
-        step_count += 1
-        if next_observation is None:
-            emit_step(step_count, POOR_SCORE)
-            emit_end(task_id, clamp_score(safe_getattr(final_observation, "score", POOR_SCORE)), step_count)
-            return
-        final_observation = next_observation
-        emit_step(step_count, observation_reward(final_observation))
-    emit_end(task_id, clamp_score(safe_getattr(final_observation, "score", POOR_SCORE)), step_count)
-def main() -> int:
-    """Run every benchmark task and emit strict structured stdout."""
-    model_name = safe_env("MODEL_NAME", DEFAULT_MODEL_NAME) or DEFAULT_MODEL_NAME
-    client = create_client()
-    for task_id in safe_task_list():
-        try:
-            run_task(task_id, client, model_name)
-        except Exception:
-            emit_start(task_id)
-            emit_step(1, POOR_SCORE)
-            emit_end(task_id, POOR_SCORE, 1)
-    return 0
 if __name__ == "__main__":

 #!/usr/bin/env python3
+"""Root validator entrypoint."""
 from __future__ import annotations
 import sys
+from app.env.runner import main
 if __name__ == "__main__":

models.py ADDED Viewed

	@@ -0,0 +1,146 @@

+"""Typed models for the python_code_review_env environment."""
+from __future__ import annotations
+from typing import Any, Dict, List, Literal, Optional
+from pydantic import BaseModel, Field
+from openenv.core.env_server.types import Action, Observation, State
+Difficulty = Literal["easy", "medium", "hard"]
+TaskKind = Literal["syntax_fix", "bug_fix", "optimization"]
+ActionType = Literal["analyze_code", "edit_code", "run_tests", "submit_solution"]
+class HistoryEntry(BaseModel):
+    """One environment transition recorded for the agent."""
+    step: int = Field(..., ge=0)
+    action_type: ActionType
+    status: str = Field(..., description="Short outcome summary.")
+    reward: float = Field(..., gt=0.0, lt=1.0, description="Reward returned for the step.")
+class RewardDetails(BaseModel):
+    """Transparent reward decomposition for debugging and training."""
+    value: float = Field(..., gt=0.0, lt=1.0, description="Clamped net reward in (0.0, 1.0).")
+    syntax_reward: float = Field(default=0.0)
+    test_reward: float = Field(default=0.0)
+    correctness_bonus: float = Field(default=0.0)
+    quality_bonus: float = Field(default=0.0)
+    error_reduction_bonus: float = Field(default=0.0)
+    completion_bonus: float = Field(default=0.0)
+    runtime_bonus: float = Field(default=0.0)
+    progress_delta: float = Field(default=0.0)
+    invalid_action_penalty: float = Field(default=0.0)
+    timeout_penalty: float = Field(default=0.0)
+    regression_penalty: float = Field(default=0.0)
+    stagnation_penalty: float = Field(default=0.0)
+    reason: str = Field(..., description="Human-readable reward explanation.")
+    prev_score: float = Field(default=0.01, gt=0.0, lt=1.0)
+    curr_score: float = Field(default=0.01, gt=0.0, lt=1.0)
+    code_changed: bool = Field(default=False)
+class PythonCodeReviewAction(Action):
+    """Action schema exposed to the agent."""
+    action_type: ActionType = Field(..., description="Environment action to take.")
+    code: Optional[str] = Field(
+        default=None,
+        description="Updated Python source for edit_code or submit_solution actions.",
+    )
+class PythonCodeReviewObservation(Observation):
+    """Observation returned by reset and step."""
+    task_id: str = Field(..., description="Stable task identifier.")
+    title: str = Field(..., description="Human-readable task title.")
+    difficulty: Difficulty
+    task_kind: TaskKind
+    task_description: str = Field(..., description="Task instructions shown to the agent.")
+    current_code: str = Field(..., description="Latest code under review.")
+    errors: str = Field(default="", description="Syntax or execution errors.")
+    test_results: str = Field(default="", description="Public test and benchmark feedback.")
+    visible_tests: List[str] = Field(default_factory=list)
+    history: List[HistoryEntry] = Field(default_factory=list)
+    attempts_remaining: int = Field(..., ge=0)
+    last_action_status: str = Field(default="")
+    last_action_error: Optional[str] = Field(default=None)
+    score: float = Field(..., gt=0.0, lt=1.0)
+    reward: float = Field(default=0.1, gt=0.0, lt=1.0)
+    done: bool = Field(default=False)
+    reward_details: RewardDetails = Field(
+        default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
+    )
+class PythonCodeReviewState(State):
+    """Internal environment state exposed through /state."""
+    task_id: Optional[str] = Field(default=None)
+    difficulty: Optional[Difficulty] = Field(default=None)
+    task_kind: Optional[TaskKind] = Field(default=None)
+    attempts_remaining: int = Field(default=0, ge=0)
+    current_code: str = Field(default="")
+    errors: str = Field(default="")
+    test_results: str = Field(default="")
+    history: List[HistoryEntry] = Field(default_factory=list)
+    score: float = Field(default=0.01, gt=0.0, lt=1.0)
+    done: bool = Field(default=False)
+class TaskDescriptor(BaseModel):
+    """Static task metadata."""
+    task_id: str
+    title: str
+    difficulty: Difficulty
+    task_kind: TaskKind
+    task_description: str
+    starter_code: str
+    visible_tests: List[str] = Field(default_factory=list)
+    repo_summary: str = Field(default="")
+    changed_files: List[str] = Field(default_factory=list)
+    available_files: List[str] = Field(default_factory=list)
+    goal: str = Field(default="")
+    max_steps: int = Field(..., ge=1)
+class TaskSummary(BaseModel):
+    """Compact task listing entry."""
+    task_id: str
+    difficulty: Difficulty
+    title: str
+    goal: str = Field(default="")
+class TaskGrade(BaseModel):
+    """Deterministic grader output."""
+    score: float = Field(..., gt=0.0, lt=1.0)
+    syntax_score: float = Field(default=0.01, gt=0.0, lt=1.0)
+    tests_passed: int = Field(default=0, ge=0)
+    tests_total: int = Field(default=0, ge=0)
+    quality_score: float = Field(default=0.01, gt=0.0, lt=1.0)
+    runtime_score: float = Field(default=0.01, gt=0.0, lt=1.0)
+    timed_out: bool = Field(default=False)
+    details: Dict[str, Any] = Field(default_factory=dict)
+class HealthResponse(BaseModel):
+    """Health payload for smoke tests."""
+    status: Literal["ok"] = "ok"
+    environment: str = "python_code_review_env"
+    task_count: int = Field(default=0, ge=0)
+PythonAction = PythonCodeReviewAction
+PythonObservation = PythonCodeReviewObservation
+PythonState = PythonCodeReviewState

openenv_models.py CHANGED Viewed

@@ -31,6 +31,9 @@ class RewardDetails(BaseModel):
     test_reward: float = Field(default=0.0)
     correctness_bonus: float = Field(default=0.0)
     quality_bonus: float = Field(default=0.0)
     progress_delta: float = Field(default=0.0)
     invalid_action_penalty: float = Field(default=0.0)
     timeout_penalty: float = Field(default=0.0)
@@ -67,7 +70,10 @@ class PythonCodeReviewObservation(Observation):
     history: List[HistoryEntry] = Field(default_factory=list)
     attempts_remaining: int = Field(..., ge=0)
     last_action_status: str = Field(default="")
     score: float = Field(..., gt=0.0, lt=1.0)
     reward_details: RewardDetails = Field(
         default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
     )

     test_reward: float = Field(default=0.0)
     correctness_bonus: float = Field(default=0.0)
     quality_bonus: float = Field(default=0.0)
+    error_reduction_bonus: float = Field(default=0.0)
+    completion_bonus: float = Field(default=0.0)
+    runtime_bonus: float = Field(default=0.0)
     progress_delta: float = Field(default=0.0)
     invalid_action_penalty: float = Field(default=0.0)
     timeout_penalty: float = Field(default=0.0)
     history: List[HistoryEntry] = Field(default_factory=list)
     attempts_remaining: int = Field(..., ge=0)
     last_action_status: str = Field(default="")
+    last_action_error: Optional[str] = Field(default=None)
     score: float = Field(..., gt=0.0, lt=1.0)
+    reward: float = Field(default=0.1, gt=0.0, lt=1.0)
+    done: bool = Field(default=False)
     reward_details: RewardDetails = Field(
         default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
     )

pyproject.toml CHANGED Viewed

@@ -13,7 +13,6 @@ dependencies = [
     "gradio>=5.26.0",
     "openai>=1.76.0",
     "openenv-core[core]>=0.2.2",
-    "pytest>=8.0.0",
     "streamlit>=1.44.0",
     "torch>=2.2.0",
     "transformers>=4.45.0",
@@ -22,6 +21,7 @@ dependencies = [
 [project.optional-dependencies]
 dev = [
     "pytest-cov>=4.0.0",
 ]
@@ -37,10 +37,15 @@ packages = [
     "python_env.graders",
     "python_env.api",
     "python_env.app",
     "python_env.analyzers",
     "python_env.models",
     "python_env.schemas",
     "python_env.services",
     "python_env.utils",
 ]
-package-dir = { "python_env" = ".", "python_env.server" = "server", "python_env.tasks" = "tasks", "python_env.graders" = "graders", "python_env.api" = "api", "python_env.app" = "app", "python_env.analyzers" = "analyzers", "python_env.models" = "models", "python_env.schemas" = "schemas", "python_env.services" = "services", "python_env.utils" = "utils" }

     "gradio>=5.26.0",
     "openai>=1.76.0",
     "openenv-core[core]>=0.2.2",
     "streamlit>=1.44.0",
     "torch>=2.2.0",
     "transformers>=4.45.0",
 [project.optional-dependencies]
 dev = [
+    "pytest>=8.0.0",
     "pytest-cov>=4.0.0",
 ]
     "python_env.graders",
     "python_env.api",
     "python_env.app",
+    "python_env.app.agents",
+    "python_env.app.env",
+    "python_env.app.models",
+    "python_env.app.services",
+    "python_env.app.utils",
     "python_env.analyzers",
     "python_env.models",
     "python_env.schemas",
     "python_env.services",
     "python_env.utils",
 ]
+package-dir = { "python_env" = ".", "python_env.server" = "server", "python_env.tasks" = "tasks", "python_env.graders" = "graders", "python_env.api" = "api", "python_env.app" = "app", "python_env.app.agents" = "app/agents", "python_env.app.env" = "app/env", "python_env.app.models" = "app/models", "python_env.app.services" = "app/services", "python_env.app.utils" = "app/utils", "python_env.analyzers" = "analyzers", "python_env.models" = "models", "python_env.schemas" = "schemas", "python_env.services" = "services", "python_env.utils" = "utils" }

schemas/response.py CHANGED Viewed

@@ -51,6 +51,9 @@ class ScoreBreakdown(BaseModel):
     domain_score: float = Field(..., ge=0.0, le=1.0)
     lint_score: float = Field(..., ge=0.0, le=1.0)
     complexity_penalty: float = Field(..., ge=0.0, le=1.0)
     reward: float = Field(..., ge=0.0, le=1.0)

     domain_score: float = Field(..., ge=0.0, le=1.0)
     lint_score: float = Field(..., ge=0.0, le=1.0)
     complexity_penalty: float = Field(..., ge=0.0, le=1.0)
+    quality_signal: float = Field(..., ge=0.0, le=1.0)
+    error_reduction_signal: float = Field(..., ge=0.0, le=1.0)
+    completion_signal: float = Field(..., ge=0.0, le=1.0)
     reward: float = Field(..., ge=0.0, le=1.0)

server/Dockerfile CHANGED Viewed

@@ -2,28 +2,24 @@ FROM python:3.11-slim
 ENV PYTHONDONTWRITEBYTECODE=1 \
     PYTHONUNBUFFERED=1 \
-    PIP_NO_CACHE_DIR=1
 WORKDIR /app
-COPY pyproject.toml README.md DEMO_SCRIPT.md openenv.yaml __init__.py client.py compat.py openenv_models.py inference.py triage.py triage_catalog.py triage_models.py launch.py /app/
-COPY api /app/api
-COPY app /app/app
-COPY analyzers /app/analyzers
-COPY models /app/models
-COPY schemas /app/schemas
-COPY server /app/server
-COPY services /app/services
-COPY tasks /app/tasks
-COPY utils /app/utils
-COPY graders /app/graders
 RUN python -m pip install --upgrade pip && \
-    pip install .
 EXPOSE 8000
 HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
-    CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000', timeout=3).read()"
-CMD ["python", "launch.py"]

 ENV PYTHONDONTWRITEBYTECODE=1 \
     PYTHONUNBUFFERED=1 \
+    PIP_NO_CACHE_DIR=1 \
+    PIP_DISABLE_PIP_VERSION_CHECK=1 \
+    ENABLE_GRADIO_DEMO=false
 WORKDIR /app
+COPY server/requirements.txt /tmp/requirements.txt
 RUN python -m pip install --upgrade pip && \
+    pip install -r /tmp/requirements.txt
+COPY . /app
+RUN pip install --no-deps .
 EXPOSE 8000
 HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
+    CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
+CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]

server/__pycache__/__init__.cpython-313.pyc CHANGED Viewed

Binary files a/server/__pycache__/__init__.cpython-313.pyc and b/server/__pycache__/__init__.cpython-313.pyc differ

server/__pycache__/app.cpython-313.pyc CHANGED Viewed

Binary files a/server/__pycache__/app.cpython-313.pyc and b/server/__pycache__/app.cpython-313.pyc differ

server/app.py CHANGED Viewed

@@ -1,7 +1,11 @@
-"""FastAPI + Gradio entrypoint for TorchReview Copilot."""
 from __future__ import annotations
 try:
     from openenv.core.env_server.http_server import create_app
 except Exception as exc:  # pragma: no cover
@@ -17,11 +21,20 @@ except Exception:
 try:
     from ..openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
     from .env import PythonCodeReviewEnvironment
-    from .demo import build_demo
 except ImportError:
     from openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
     from server.env import PythonCodeReviewEnvironment
-    from server.demo import build_demo
 def build_application():
@@ -32,11 +45,24 @@ def build_application():
         PythonCodeReviewAction,
         PythonCodeReviewObservation,
         env_name="python_code_review_env",
-        max_concurrent_envs=4,
     )
-    if gr is None:
-        return api_app
-    return gr.mount_gradio_app(api_app, build_demo(), path="/")
 app = build_application()

+"""OpenEnv FastAPI entrypoint with optional Gradio mounting."""
 from __future__ import annotations
+import os
+from fastapi import FastAPI
 try:
     from openenv.core.env_server.http_server import create_app
 except Exception as exc:  # pragma: no cover
 try:
     from ..openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
     from .env import PythonCodeReviewEnvironment
 except ImportError:
     from openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
     from server.env import PythonCodeReviewEnvironment
+def _gradio_enabled() -> bool:
+    return str(os.getenv("ENABLE_GRADIO_DEMO", "false")).strip().lower() in {"1", "true", "yes", "on"}
+def _max_concurrent_envs() -> int:
+    try:
+        return max(int(os.getenv("OPENENV_MAX_CONCURRENT_ENVS", "2")), 1)
+    except Exception:
+        return 2
 def build_application():
         PythonCodeReviewAction,
         PythonCodeReviewObservation,
         env_name="python_code_review_env",
+        max_concurrent_envs=_max_concurrent_envs(),
     )
+    served_app = api_app
+    if gr is not None and _gradio_enabled():
+        try:
+            from .demo import build_demo
+        except ImportError:
+            from server.demo import build_demo
+        served_app = gr.mount_gradio_app(api_app, build_demo(), path="/")
+    wrapper_app = FastAPI(title="python_code_review_env", version="1.0.0")
+    @wrapper_app.get("/health", include_in_schema=False)
+    def _health() -> dict[str, str]:
+        return {"status": "ok"}
+    wrapper_app.mount("/", served_app)
+    return wrapper_app
 app = build_application()

server/env.py CHANGED Viewed

@@ -63,6 +63,7 @@ class PythonCodeReviewEnvironment(
         self._current_code: str = self._task.starter_code
         self._history: list[HistoryEntry] = []
         self._last_reward = RewardDetails(value=0.1, reason="Environment initialized.")
         self._current_grade = _empty_grade()
         self._state = PythonCodeReviewState(episode_id=str(uuid4()), step_count=0)
         self.reset()
@@ -77,8 +78,13 @@ class PythonCodeReviewEnvironment(
         self._task = select_task(seed=seed, task_id=task_id)
         self._current_code = self._task.starter_code
         self._history = []
         self._last_reward = RewardDetails(value=0.1, reason="Environment reset.")
-        self._current_grade = grade_task(self._task, self._current_code, include_hidden=False)
         self._state = PythonCodeReviewState(
             episode_id=episode_id or str(uuid4()),
@@ -142,11 +148,13 @@ class PythonCodeReviewEnvironment(
         invalid_action = False
         code_changed = False
         use_hidden_grading = False
         if action.action_type == "edit_code":
             if not action.code or not action.code.strip():
                 invalid_action = True
                 status = "edit_code requires a non-empty code payload."
             else:
                 code_changed = action.code != self._current_code
                 self._current_code = action.code
@@ -164,18 +172,22 @@ class PythonCodeReviewEnvironment(
         else:  # pragma: no cover
             invalid_action = True
             status = f"Unsupported action_type: {action.action_type}"
         self._state.step_count += 1
         if invalid_action:
             current_grade = previous_grade
         else:
-            current_grade = grade_task(
                 self._task,
                 self._current_code,
                 include_hidden=use_hidden_grading,
                 timeout_s=timeout_s or 3.0,
             )
             if action.action_type == "analyze_code":
                 status = self._analysis_status(current_grade)
             elif action.action_type == "run_tests":
@@ -208,6 +220,7 @@ class PythonCodeReviewEnvironment(
         self._current_grade = current_grade
         self._last_reward = reward_details
         attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
         self._state.task_id = self._task.task_id
@@ -226,7 +239,14 @@ class PythonCodeReviewEnvironment(
             status=status,
             reward_details=reward_details,
         )
-        return observation, reward_details.value, observation.done, {"task_id": observation.task_id, "score": observation.score}
     @property
     def state(self) -> PythonCodeReviewState:
@@ -252,11 +272,13 @@ class PythonCodeReviewEnvironment(
             history=list(self._history),
             attempts_remaining=self._state.attempts_remaining,
             last_action_status=status,
             score=grade.score,
             reward=reward_details.value,
             done=self._state.done,
             reward_details=reward_details,
             metadata={
                 "goal": self._task.goal,
                 "repo_summary": self._task.repo_summary,
                 "changed_files": self._task.changed_files,
@@ -280,25 +302,34 @@ class PythonCodeReviewEnvironment(
         curr_score = current_grade.score
         prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
         curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
         syntax_reward = 0.14 if previous_grade.syntax_score < 0.9 and current_grade.syntax_score >= 0.9 else 0.0
-        test_reward = round(max(curr_rate - prev_rate, 0.0) * 0.22, 3)
-        progress_delta = round(max(curr_score - prev_score, 0.0) * 0.35, 3)
-        quality_bonus = round(max(current_grade.quality_score - previous_grade.quality_score, 0.0) * 0.08, 3)
         correctness_bonus = 0.12 if final_submission and curr_score >= 0.94 and prev_score < 0.94 else 0.0
-        invalid_action_penalty = 0.12 if invalid_action else 0.0
-        timeout_penalty = 0.14 if timed_out else 0.0
-        regression_penalty = round(max(prev_score - curr_score, 0.0) * 0.2, 3)
-        stagnation_penalty = 0.06 if action.action_type == "edit_code" and not code_changed else 0.0
         raw_value = (
-            0.1
-            + 0.45 * curr_score
             + syntax_reward
             + test_reward
             + progress_delta
             + quality_bonus
             + correctness_bonus
             - invalid_action_penalty
             - timeout_penalty
@@ -316,6 +347,12 @@ class PythonCodeReviewEnvironment(
             reason_parts.append("overall score improved")
         if quality_bonus:
             reason_parts.append("code quality improved")
         if correctness_bonus:
             reason_parts.append("full correctness bonus")
         if invalid_action_penalty:
@@ -335,6 +372,9 @@ class PythonCodeReviewEnvironment(
             test_reward=test_reward,
             correctness_bonus=correctness_bonus,
             quality_bonus=quality_bonus,
             progress_delta=progress_delta,
             invalid_action_penalty=invalid_action_penalty,
             timeout_penalty=timeout_penalty,
@@ -352,6 +392,22 @@ class PythonCodeReviewEnvironment(
             return compile_error
         return "Code parses successfully."
     def _format_test_results(self, grade: TaskGrade) -> str:
         parts = [grade.details.get("test_summary", "No test feedback available.")]
         benchmark = grade.details.get("benchmark")

         self._current_code: str = self._task.starter_code
         self._history: list[HistoryEntry] = []
         self._last_reward = RewardDetails(value=0.1, reason="Environment initialized.")
+        self._last_action_error: str | None = None
         self._current_grade = _empty_grade()
         self._state = PythonCodeReviewState(episode_id=str(uuid4()), step_count=0)
         self.reset()
         self._task = select_task(seed=seed, task_id=task_id)
         self._current_code = self._task.starter_code
         self._history = []
+        self._last_action_error = None
         self._last_reward = RewardDetails(value=0.1, reason="Environment reset.")
+        self._current_grade, self._last_action_error = self._safe_grade_task(
+            self._task,
+            self._current_code,
+            include_hidden=False,
+        )
         self._state = PythonCodeReviewState(
             episode_id=episode_id or str(uuid4()),
         invalid_action = False
         code_changed = False
         use_hidden_grading = False
+        action_error: str | None = None
         if action.action_type == "edit_code":
             if not action.code or not action.code.strip():
                 invalid_action = True
                 status = "edit_code requires a non-empty code payload."
+                action_error = status
             else:
                 code_changed = action.code != self._current_code
                 self._current_code = action.code
         else:  # pragma: no cover
             invalid_action = True
             status = f"Unsupported action_type: {action.action_type}"
+            action_error = status
         self._state.step_count += 1
         if invalid_action:
             current_grade = previous_grade
         else:
+            current_grade, grade_error = self._safe_grade_task(
                 self._task,
                 self._current_code,
                 include_hidden=use_hidden_grading,
                 timeout_s=timeout_s or 3.0,
             )
+            if grade_error:
+                action_error = grade_error
+                status = f"{status} Grading fallback used."
             if action.action_type == "analyze_code":
                 status = self._analysis_status(current_grade)
             elif action.action_type == "run_tests":
         self._current_grade = current_grade
         self._last_reward = reward_details
+        self._last_action_error = action_error
         attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
         self._state.task_id = self._task.task_id
             status=status,
             reward_details=reward_details,
         )
+        return observation, reward_details.value, observation.done, {
+            "task_id": observation.task_id,
+            "score": observation.score,
+            "done": observation.done,
+            "attempts_remaining": observation.attempts_remaining,
+            "last_action_status": observation.last_action_status,
+            "last_action_error": observation.last_action_error,
+        }
     @property
     def state(self) -> PythonCodeReviewState:
             history=list(self._history),
             attempts_remaining=self._state.attempts_remaining,
             last_action_status=status,
+            last_action_error=self._last_action_error,
             score=grade.score,
             reward=reward_details.value,
             done=self._state.done,
             reward_details=reward_details,
             metadata={
+                "benchmark": "python_code_review_env",
                 "goal": self._task.goal,
                 "repo_summary": self._task.repo_summary,
                 "changed_files": self._task.changed_files,
         curr_score = current_grade.score
         prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
         curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
+        prev_runtime = previous_grade.runtime_score
+        curr_runtime = current_grade.runtime_score
+        prev_compile_error = bool(str(previous_grade.details.get("compile_error", "")).strip())
+        curr_compile_error = bool(str(current_grade.details.get("compile_error", "")).strip())
         syntax_reward = 0.14 if previous_grade.syntax_score < 0.9 and current_grade.syntax_score >= 0.9 else 0.0
+        test_reward = round(max(curr_rate - prev_rate, 0.0) * 0.28, 3)
+        progress_delta = round(max(curr_score - prev_score, 0.0) * 0.3, 3)
+        quality_bonus = round(max(current_grade.quality_score - previous_grade.quality_score, 0.0) * 0.12, 3)
+        runtime_bonus = round(max(curr_runtime - prev_runtime, 0.0) * 0.08, 3)
+        error_reduction_bonus = 0.1 if prev_compile_error and not curr_compile_error else 0.0
+        completion_bonus = 0.14 if final_submission and curr_rate >= 0.999 and curr_score >= 0.94 else 0.0
         correctness_bonus = 0.12 if final_submission and curr_score >= 0.94 and prev_score < 0.94 else 0.0
+        invalid_action_penalty = round((0.04 + (0.08 * (1.0 - prev_score))) if invalid_action else 0.0, 3)
+        timeout_penalty = round((0.06 + (0.08 * max(curr_runtime, prev_runtime))) if timed_out else 0.0, 3)
+        regression_penalty = round(max(prev_score - curr_score, 0.0) * 0.25, 3)
+        stagnation_penalty = round((0.02 + (0.05 * prev_score)) if action.action_type == "edit_code" and not code_changed else 0.0, 3)
         raw_value = (
+            0.32 * curr_score
             + syntax_reward
             + test_reward
             + progress_delta
             + quality_bonus
+            + error_reduction_bonus
+            + completion_bonus
+            + runtime_bonus
             + correctness_bonus
             - invalid_action_penalty
             - timeout_penalty
             reason_parts.append("overall score improved")
         if quality_bonus:
             reason_parts.append("code quality improved")
+        if error_reduction_bonus:
+            reason_parts.append("errors removed")
+        if completion_bonus:
+            reason_parts.append("task completed")
+        if runtime_bonus:
+            reason_parts.append("runtime improved")
         if correctness_bonus:
             reason_parts.append("full correctness bonus")
         if invalid_action_penalty:
             test_reward=test_reward,
             correctness_bonus=correctness_bonus,
             quality_bonus=quality_bonus,
+            error_reduction_bonus=error_reduction_bonus,
+            completion_bonus=completion_bonus,
+            runtime_bonus=runtime_bonus,
             progress_delta=progress_delta,
             invalid_action_penalty=invalid_action_penalty,
             timeout_penalty=timeout_penalty,
             return compile_error
         return "Code parses successfully."
+    def _safe_grade_task(
+        self,
+        task: ReviewTask,
+        code: str,
+        *,
+        include_hidden: bool,
+        timeout_s: float = 3.0,
+    ) -> tuple[TaskGrade, str | None]:
+        try:
+            return (
+                grade_task(task, code, include_hidden=include_hidden, timeout_s=timeout_s),
+                None,
+            )
+        except Exception as exc:  # pragma: no cover
+            return _empty_grade(), f"{type(exc).__name__}: {exc}"
     def _format_test_results(self, grade: TaskGrade) -> str:
         parts = [grade.details.get("test_summary", "No test feedback available.")]
         benchmark = grade.details.get("benchmark")

server/requirements.txt CHANGED Viewed

@@ -2,7 +2,6 @@ openenv-core[core]>=0.2.2
 fastapi>=0.111.0
 gradio>=5.26.0
 uvicorn>=0.30.0
-pytest>=8.0.0
 openai>=1.76.0
 streamlit>=1.44.0
 torch>=2.2.0

 fastapi>=0.111.0
 gradio>=5.26.0
 uvicorn>=0.30.0
 openai>=1.76.0
 streamlit>=1.44.0
 torch>=2.2.0

services/analysis_service.py CHANGED Viewed

@@ -34,7 +34,7 @@ class AnalysisService:
     """End-to-end analysis pipeline shared by API and UI."""
     def __init__(self) -> None:
-        self.model = PyTorchCodeAnalyzerModel()
         self.reward_service = RewardService()
         self.suggestion_service = SuggestionService()
         self._analyzers: Dict[str, Callable[[str, Dict[str, Any], Dict[str, Any]], DomainAnalysis]] = {
@@ -44,6 +44,12 @@ class AnalysisService:
             "web": analyze_web_code,
         }
     def _heuristic_domain_scores(self, parsed: Dict[str, Any], code: str) -> Dict[str, float]:
         """Derive domain priors from imports and syntax-level hints."""

     """End-to-end analysis pipeline shared by API and UI."""
     def __init__(self) -> None:
+        self._model: PyTorchCodeAnalyzerModel | None = None
         self.reward_service = RewardService()
         self.suggestion_service = SuggestionService()
         self._analyzers: Dict[str, Callable[[str, Dict[str, Any], Dict[str, Any]], DomainAnalysis]] = {
             "web": analyze_web_code,
         }
+    @property
+    def model(self) -> PyTorchCodeAnalyzerModel:
+        if self._model is None:
+            self._model = PyTorchCodeAnalyzerModel()
+        return self._model
     def _heuristic_domain_scores(self, parsed: Dict[str, Any], code: str) -> Dict[str, float]:
         """Derive domain priors from imports and syntax-level hints."""

services/reward_service.py CHANGED Viewed

@@ -9,13 +9,21 @@ class RewardService:
     """Compute reward scores from model, domain, lint, and complexity signals."""
     def compute(self, *, ml_score: float, domain_score: float, lint_score: float, complexity_penalty: float) -> ScoreBreakdown:
-        """Apply the weighted reward formula and clamp the result."""
         reward = max(
             0.0,
             min(
                 1.0,
-                (0.4 * ml_score) + (0.2 * domain_score) + (0.2 * lint_score) - (0.2 * complexity_penalty),
             ),
         )
         return ScoreBreakdown(
@@ -23,5 +31,8 @@ class RewardService:
             domain_score=round(domain_score, 4),
             lint_score=round(lint_score, 4),
             complexity_penalty=round(complexity_penalty, 4),
             reward=round(reward, 4),
         )

     """Compute reward scores from model, domain, lint, and complexity signals."""
     def compute(self, *, ml_score: float, domain_score: float, lint_score: float, complexity_penalty: float) -> ScoreBreakdown:
+        """Apply dynamic reward shaping based on quality, errors, and completion."""
+        quality_signal = max(0.0, min(1.0, (0.45 * ml_score) + (0.3 * domain_score) + (0.25 * lint_score)))
+        error_reduction_signal = max(0.0, min(1.0, lint_score - (0.6 * complexity_penalty)))
+        completion_signal = max(0.0, min(1.0, (ml_score + domain_score + lint_score) / 3.0))
         reward = max(
             0.0,
             min(
                 1.0,
+                (0.35 * quality_signal)
+                + (0.25 * completion_signal)
+                + (0.2 * error_reduction_signal)
+                + (0.1 * ml_score)
+                + (0.1 * domain_score)
+                - (0.15 * complexity_penalty),
             ),
         )
         return ScoreBreakdown(
             domain_score=round(domain_score, 4),
             lint_score=round(lint_score, 4),
             complexity_penalty=round(complexity_penalty, 4),
+            quality_signal=round(quality_signal, 4),
+            error_reduction_signal=round(error_reduction_signal, 4),
+            completion_signal=round(completion_signal, 4),
             reward=round(reward, 4),
         )

tests/test_inference_runner.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""Smoke tests for the strict inference output contract."""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from app.env.runner import InferenceRunner
+from app.models.inference import AgentDecision, InferenceConfig
+@dataclass
+class _FakeObservation:
+    task_id: str
+    attempts_remaining: int
+    score: float
+    done: bool
+    history: list[object] = field(default_factory=list)
+    current_code: str = "print('broken')"
+    last_action_error: str | None = None
+class _FakeEnv:
+    def __init__(self) -> None:
+        self._step = 0
+    def reset(self, *, task_id: str) -> _FakeObservation:
+        return _FakeObservation(task_id=task_id, attempts_remaining=4, score=0.2, done=False)
+    def step_result(self, action: object) -> tuple[_FakeObservation, float, bool, dict[str, object]]:
+        self._step += 1
+        if self._step == 1:
+            return (
+                _FakeObservation("demo_task", 3, 0.45, False, current_code="candidate"),
+                0.45,
+                False,
+                {"last_action_error": None},
+            )
+        if self._step == 2:
+            return (
+                _FakeObservation("demo_task", 2, 0.97, True, current_code="reference"),
+                0.97,
+                True,
+                {"last_action_error": None},
+            )
+        raise AssertionError("runner stepped too many times")
+class _FakeAgent:
+    def __init__(self) -> None:
+        self._step = 0
+    def act(self, observation: object) -> AgentDecision:
+        self._step += 1
+        if self._step == 1:
+            return AgentDecision(action_type="run_tests")
+        return AgentDecision(action_type="submit_solution")
+def test_inference_runner_emits_strict_lines(capsys) -> None:
+    runner = InferenceRunner(InferenceConfig.from_env())
+    runner.agent = _FakeAgent()
+    runner._create_env = lambda: _FakeEnv()  # type: ignore[method-assign]
+    runner.run_task("demo_task")
+    captured = capsys.readouterr().out.strip().splitlines()
+    assert captured == [
+        f"[START] task=demo_task env={runner.config.benchmark_name} model={runner.config.model_name}",
+        "[STEP]  step=1 action=run_tests reward=0.45 done=false error=null",
+        "[STEP]  step=2 action=submit_solution reward=0.97 done=true error=null",
+        "[END]   success=true steps=2 rewards=0.45,0.97",
+    ]

uv.lock CHANGED Viewed

The diff for this file is too large to render. See raw diff