Spaces:

uvpatel7271
/

python_env

Build error

App Files Files Community

uvpatel7271 commited on 11 days ago

Commit

1c8b7f1

verified ·

1 Parent(s): 1ac5fc9

Upload folder using huggingface_hub

Browse files

Files changed (19) hide show

Dockerfile +81 -0
README.md +261 -5
__init__.py +16 -0
client.py +46 -0
inference.py +314 -0
models.py +248 -0
openenv.yaml +7 -0
openenv_python_env.egg-info/PKG-INFO +10 -0
openenv_python_env.egg-info/SOURCES.txt +19 -0
openenv_python_env.egg-info/dependency_links.txt +1 -0
openenv_python_env.egg-info/entry_points.txt +2 -0
openenv_python_env.egg-info/requires.txt +6 -0
openenv_python_env.egg-info/top_level.txt +1 -0
pyproject.toml +46 -0
server/__init__.py +11 -0
server/app.py +84 -0
server/python_env_environment.py +421 -0
server/requirements.txt +6 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,81 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+# Multi-stage build using openenv-base
+# This Dockerfile is flexible and works for both:
+# - In-repo environments (with local OpenEnv sources)
+# - Standalone environments (with openenv from PyPI/Git)
+# The build script (openenv build) handles context detection and sets appropriate build args.
+ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
+FROM ${BASE_IMAGE} AS builder
+WORKDIR /app
+# Ensure git is available (required for installing dependencies from VCS)
+RUN apt-get update && \
+    apt-get install -y --no-install-recommends git && \
+    rm -rf /var/lib/apt/lists/*
+# Build argument to control whether we're building standalone or in-repo
+ARG BUILD_MODE=in-repo
+ARG ENV_NAME=python_env
+# Copy environment code (always at root of build context)
+COPY . /app/env
+# For in-repo builds, openenv is already vendored in the build context
+# For standalone builds, openenv will be installed via pyproject.toml
+WORKDIR /app/env
+# Ensure uv is available (for local builds where base image lacks it)
+RUN if ! command -v uv >/dev/null 2>&1; then \
+        curl -LsSf https://astral.sh/uv/install.sh | sh && \
+        mv /root/.local/bin/uv /usr/local/bin/uv && \
+        mv /root/.local/bin/uvx /usr/local/bin/uvx; \
+    fi
+# Install dependencies using uv sync
+# If uv.lock exists, use it; otherwise resolve on the fly
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-install-project --no-editable; \
+    else \
+        uv sync --no-install-project --no-editable; \
+    fi
+RUN --mount=type=cache,target=/root/.cache/uv \
+    if [ -f uv.lock ]; then \
+        uv sync --frozen --no-editable; \
+    else \
+        uv sync --no-editable; \
+    fi
+# Final runtime stage
+FROM ${BASE_IMAGE}
+WORKDIR /app
+# Copy the virtual environment from builder
+COPY --from=builder /app/env/.venv /app/.venv
+# Copy the environment code
+COPY --from=builder /app/env /app/env
+# Set PATH to use the virtual environment
+ENV PATH="/app/.venv/bin:$PATH"
+# Set PYTHONPATH so imports work correctly
+ENV PYTHONPATH="/app/env:$PYTHONPATH"
+# Health check
+HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
+    CMD curl -f http://localhost:8000/health || exit 1
+# Run the FastAPI server
+# The module path is constructed to work with the /app/env structure
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]

README.md CHANGED Viewed

@@ -1,10 +1,266 @@
 ---
-title: Python Env
-emoji: 🚀
-colorFrom: green
-colorTo: green
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Python Env Environment Server
+emoji: 🎶
+colorFrom: purple
+colorTo: red
 sdk: docker
 pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
 ---
+# Python Env Environment
+A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
+## Quick Start
+The simplest way to use the Python Env environment is through the `PythonEnv` class:
+```python
+from python_env import PythonAction, PythonEnv
+try:
+    # Create environment from Docker image
+    python_envenv = PythonEnv.from_docker_image("python_env-env:latest")
+    # Reset
+    result = python_envenv.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Send multiple messages
+    messages = ["Hello, World!", "Testing echo", "Final message"]
+    for msg in messages:
+        result = python_envenv.step(PythonAction(message=msg))
+        print(f"Sent: '{msg}'")
+        print(f"  → Echoed: '{result.observation.echoed_message}'")
+        print(f"  → Length: {result.observation.message_length}")
+        print(f"  → Reward: {result.reward}")
+finally:
+    # Always clean up
+    python_envenv.close()
+```
+That's it! The `PythonEnv.from_docker_image()` method handles:
+- Starting the Docker container
+- Waiting for the server to be ready
+- Connecting to the environment
+- Container cleanup when you call `close()`
+## Building the Docker Image
+Before using the environment, you need to build the Docker image:
+```bash
+# From project root
+docker build -t python_env-env:latest -f server/Dockerfile .
+```
+## Deploying to Hugging Face Spaces
+You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
+```bash
+# From the environment directory (where openenv.yaml is located)
+openenv push
+# Or specify options
+openenv push --namespace my-org --private
+```
+The `openenv push` command will:
+1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
+2. Prepare a custom build for Hugging Face Docker space (enables web interface)
+3. Upload to Hugging Face (ensuring you're logged in)
+### Prerequisites
+- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
+### Options
+- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
+- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
+- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
+- `--private`: Deploy the space as private (default: public)
+### Examples
+```bash
+# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
+openenv push
+# Push to a specific repository
+openenv push --repo-id my-org/my-env
+# Push with a custom base image
+openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
+# Push as a private space
+openenv push --private
+# Combine options
+openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
+```
+After deployment, your space will be available at:
+`https://huggingface.co/spaces/<repo-id>`
+The deployed space includes:
+- **Web Interface** at `/web` - Interactive UI for exploring the environment
+- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
+- **Health Check** at `/health` - Container health monitoring
+- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
+## Environment Details
+### Action
+**PythonAction**: Contains a single field
+- `message` (str) - The message to echo back
+### Observation
+**PythonObservation**: Contains the echo response and metadata
+- `echoed_message` (str) - The message echoed back
+- `message_length` (int) - Length of the message
+- `reward` (float) - Reward based on message length (length × 0.1)
+- `done` (bool) - Always False for echo environment
+- `metadata` (dict) - Additional info like step count
+### Reward
+The reward is calculated as: `message_length × 0.1`
+- "Hi" → reward: 0.2
+- "Hello, World!" → reward: 1.3
+- Empty message → reward: 0.0
+## Advanced Usage
+### Connecting to an Existing Server
+If you already have a Python Env environment server running, you can connect directly:
+```python
+from python_env import PythonEnv
+# Connect to existing server
+python_envenv = PythonEnv(base_url="<ENV_HTTP_URL_HERE>")
+# Use as normal
+result = python_envenv.reset()
+result = python_envenv.step(PythonAction(message="Hello!"))
+```
+Note: When connecting to an existing server, `python_envenv.close()` will NOT stop the server.
+### Using the Context Manager
+The client supports context manager usage for automatic connection management:
+```python
+from python_env import PythonAction, PythonEnv
+# Connect with context manager (auto-connects and closes)
+with PythonEnv(base_url="http://localhost:8000") as env:
+    result = env.reset()
+    print(f"Reset: {result.observation.echoed_message}")
+    # Multiple steps with low latency
+    for msg in ["Hello", "World", "!"]:
+        result = env.step(PythonAction(message=msg))
+        print(f"Echoed: {result.observation.echoed_message}")
+```
+The client uses WebSocket connections for:
+- **Lower latency**: No HTTP connection overhead per request
+- **Persistent session**: Server maintains your environment state
+- **Efficient for episodes**: Better for many sequential steps
+### Concurrent WebSocket Sessions
+The server supports multiple concurrent WebSocket connections. To enable this,
+modify `server/app.py` to use factory mode:
+```python
+# In server/app.py - use factory mode for concurrent sessions
+app = create_app(
+    PythonEnvironment,  # Pass class, not instance
+    PythonAction,
+    PythonObservation,
+    max_concurrent_envs=4,  # Allow 4 concurrent sessions
+)
+```
+Then multiple clients can connect simultaneously:
+```python
+from python_env import PythonAction, PythonEnv
+from concurrent.futures import ThreadPoolExecutor
+def run_episode(client_id: int):
+    with PythonEnv(base_url="http://localhost:8000") as env:
+        result = env.reset()
+        for i in range(10):
+            result = env.step(PythonAction(message=f"Client {client_id}, step {i}"))
+        return client_id, result.observation.message_length
+# Run 4 episodes concurrently
+with ThreadPoolExecutor(max_workers=4) as executor:
+    results = list(executor.map(run_episode, range(4)))
+```
+## Development & Testing
+### Direct Environment Testing
+Test the environment logic directly without starting the HTTP server:
+```bash
+# From the server directory
+python3 server/python_env_environment.py
+```
+This verifies that:
+- Environment resets correctly
+- Step executes actions properly
+- State tracking works
+- Rewards are calculated correctly
+### Running Locally
+Run the server locally for development:
+```bash
+uvicorn server.app:app --reload
+```
+## Project Structure
+```
+python_env/
+├── .dockerignore         # Docker build exclusions
+├── __init__.py            # Module exports
+├── README.md              # This file
+├── openenv.yaml           # OpenEnv manifest
+├── pyproject.toml         # Project metadata and dependencies
+├── uv.lock                # Locked dependencies (generated)
+├── client.py              # PythonEnv client
+├── models.py              # Action and Observation models
+└── server/
+    ├── __init__.py        # Server module exports
+    ├── python_env_environment.py  # Core environment logic
+    ├── app.py             # FastAPI application (HTTP + WebSocket endpoints)
+    └── Dockerfile         # Container image definition
+```
+---------------------------------------
+cd F:\python_env
+  # Edit your environment implementation in server/python_env_environment.py
+  # Edit your models in models.py
+  # Install dependencies: uv sync
+  # To integrate into OpenEnv repo:
+  # 1. Copy this directory to <repo_root>/envs/python_env_env
+  # 2. Build from repo root: docker build -t python_env_env:latest -f envs/python_env_env/server/Dockerfile .
+  # 3. Run your image: docker run -p 8000:8000 python_env_env:latest

__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Python Env Environment."""
+from .client import PythonEnv
+from .models import PythonAction, PythonObservation
+__all__ = [
+    "PythonAction",
+    "PythonObservation",
+    "PythonEnv",
+]

client.py ADDED Viewed

	@@ -0,0 +1,46 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Python Env Environment Client."""
+from typing import Any, Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+try:
+    from .models import PythonAction, PythonObservation
+except ImportError:
+    from models import PythonAction, PythonObservation  # type: ignore
+class PythonEnv(EnvClient[PythonAction, PythonObservation, State]):
+    """Typed client for the Python code-review environment."""
+    def _step_payload(self, action: PythonAction) -> Dict[str, Any]:
+        """Convert a validated action model to the JSON payload expected by the server."""
+        return action.model_dump(exclude_none=True)
+    def _parse_result(self, payload: Dict[str, Any]) -> StepResult[PythonObservation]:
+        """Parse a server response into a typed step result."""
+        obs_data = dict(payload.get("observation", {}))
+        obs_data.setdefault("done", payload.get("done", False))
+        obs_data.setdefault("reward", payload.get("reward"))
+        observation = PythonObservation.model_validate(obs_data)
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict[str, Any]) -> State:
+        """Parse the server state payload into the shared state model."""
+        return State.model_validate(payload)

inference.py ADDED Viewed

	@@ -0,0 +1,314 @@

+"""Baseline inference script for the Python code-review environment.
+This script is meant to be submission-friendly:
+- configuration comes from environment variables
+- model calls use the OpenAI client as required
+- malformed model output is handled gracefully
+- a JSON report is written for reproducibility
+"""
+from __future__ import annotations
+import json
+import os
+import re
+from pathlib import Path
+from typing import Any, Dict, List, Optional
+from openai import OpenAI
+from client import PythonEnv
+from models import PythonReviewAction, ReviewFinding
+# Read all runtime configuration from environment variables so the script can
+# be reused unchanged across local runs, CI, and HF Spaces validation.
+API_BASE_URL = os.environ["API_BASE_URL"]
+MODEL_NAME = os.environ["MODEL_NAME"]
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY")
+ENV_BASE_URL = os.getenv("ENV_BASE_URL")
+DOCKER_IMAGE = os.getenv("PYTHON_ENV_IMAGE", "python_env-env:latest")
+MAX_STEPS = int(os.getenv("MAX_STEPS", "3"))
+MAX_TASKS = int(os.getenv("MAX_TASKS", "3"))
+REPORT_PATH = Path(os.getenv("INFERENCE_REPORT_PATH", "inference_results.json"))
+TEMPERATURE = float(os.getenv("TEMPERATURE", "0"))
+MAX_TOKENS = int(os.getenv("MAX_TOKENS", "900"))
+SYSTEM_PROMPT = """You are a precise Python code reviewer.
+Return strict JSON using this schema:
+{
+  "findings": [
+    {
+      "title": "short title",
+      "line": 1,
+      "category": "bug|security|style|performance|maintainability",
+      "severity": "critical|warning|info",
+      "rationale": "why it matters",
+      "recommendation": "how to fix it",
+      "rule_id": "optional-stable-id"
+    }
+  ],
+  "patched_code": null
+}
+Rules:
+- Output JSON only. No markdown fences.
+- Only report issues supported by the visible code.
+- Prefer high precision over quantity.
+- Include line numbers when possible.
+"""
+def _build_prompt(observation, step: int, history: List[str]) -> str:
+    """Build the task prompt sent to the model for one step."""
+    history_text = "\n".join(history[-4:]) if history else "No previous attempts."
+    return (
+        f"Task ID: {observation.task.task_id}\n"
+        f"Difficulty: {observation.task.difficulty}\n"
+        f"Objective: {observation.task.objective}\n"
+        f"Step: {step}\n"
+        f"Attempts remaining: {observation.attempts_remaining}\n"
+        f"Current score: {observation.score:.2f}\n"
+        f"Latest feedback: {observation.feedback or 'None'}\n"
+        f"Attempt history:\n{history_text}\n\n"
+        "Code to review:\n"
+        "```python\n"
+        f"{observation.task.code}\n"
+        "```"
+    )
+def _extract_text_content(message_content: Any) -> str:
+    """Normalize OpenAI response content into one text string."""
+    if isinstance(message_content, str):
+        return message_content
+    if isinstance(message_content, list):
+        parts: List[str] = []
+        for item in message_content:
+            if isinstance(item, dict):
+                text = item.get("text")
+                if isinstance(text, str):
+                    parts.append(text)
+        return "\n".join(parts)
+    return ""
+def _extract_json_blob(content: str) -> str:
+    """Extract a JSON object from plain or fenced model output."""
+    fenced_match = re.search(r"```(?:json)?\s*(\{.*\})\s*```", content, re.DOTALL)
+    if fenced_match:
+        return fenced_match.group(1)
+    start = content.find("{")
+    end = content.rfind("}")
+    if start != -1 and end != -1 and end > start:
+        return content[start : end + 1]
+    return content
+def _parse_response(content: str) -> Dict[str, Any]:
+    """Parse the model response into a normalized payload dict."""
+    raw = _extract_json_blob(content)
+    try:
+        data = json.loads(raw)
+    except json.JSONDecodeError:
+        return {"findings": [], "patched_code": None, "_parse_error": raw}
+    findings = data.get("findings", [])
+    if not isinstance(findings, list):
+        findings = []
+    patched_code = data.get("patched_code")
+    if patched_code is not None and not isinstance(patched_code, str):
+        patched_code = None
+    return {"findings": findings, "patched_code": patched_code}
+def _completion(client: OpenAI, prompt: str) -> Dict[str, Any]:
+    """Send one completion request to the configured model endpoint."""
+    response = client.chat.completions.create(
+        model=MODEL_NAME,
+        temperature=TEMPERATURE,
+        max_tokens=MAX_TOKENS,
+        messages=[
+            {"role": "system", "content": SYSTEM_PROMPT},
+            {"role": "user", "content": prompt},
+        ],
+    )
+    content = _extract_text_content(response.choices[0].message.content) or "{}"
+    return _parse_response(content)
+def _normalize_findings(payload: Dict[str, Any]) -> List[ReviewFinding]:
+    """Convert raw dict findings into validated `ReviewFinding` objects."""
+    findings: List[ReviewFinding] = []
+    for item in payload.get("findings", []):
+        if not isinstance(item, dict):
+            continue
+        try:
+            findings.append(ReviewFinding(**item))
+        except Exception:
+            continue
+    return findings
+def _build_fallback_action(observation, note: str) -> PythonReviewAction:
+    """Create a safe fallback action when model output is unusable."""
+    return PythonReviewAction(
+        operation="finalize" if observation.attempts_remaining <= 1 else "request_hint",
+        note=note,
+    )
+def _to_action(
+    payload: Dict[str, Any],
+    observation,
+    finalize: bool,
+) -> PythonReviewAction:
+    """Convert a parsed model payload into a valid environment action."""
+    findings = _normalize_findings(payload)
+    if not findings and not payload.get("patched_code"):
+        note = "Model returned no valid findings."
+        if payload.get("_parse_error"):
+            note = f"{note} Raw response could not be parsed as JSON."
+        return _build_fallback_action(observation, note)
+    return PythonReviewAction(
+        operation="finalize" if finalize else "submit_findings",
+        findings=findings,
+        patched_code=payload.get("patched_code"),
+    )
+def _make_env() -> PythonEnv:
+    """Connect to a live environment or launch the Docker image."""
+    if ENV_BASE_URL:
+        return PythonEnv(base_url=ENV_BASE_URL)
+    return PythonEnv.from_docker_image(DOCKER_IMAGE)
+def _task_result_dict(observation, step_logs: List[Dict[str, Any]]) -> Dict[str, Any]:
+    """Build the report payload for one completed task run."""
+    evaluation = observation.evaluation
+    return {
+        "task_id": observation.task.task_id,
+        "difficulty": observation.task.difficulty,
+        "title": observation.task.title,
+        "score": observation.score,
+        "passed": evaluation.passed,
+        "matched_findings": evaluation.matched_findings,
+        "total_findings": evaluation.total_findings,
+        "false_positives": evaluation.false_positives,
+        "duplicate_findings": evaluation.duplicate_findings,
+        "weighted_recall": evaluation.weighted_recall,
+        "patch_score": evaluation.patch_score,
+        "steps": step_logs,
+    }
+def main() -> None:
+    """Run the configured model against the benchmark task set."""
+    if not API_KEY:
+        raise RuntimeError("Set HF_TOKEN or OPENAI_API_KEY before running inference.py")
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = _make_env()
+    episode_results: List[Dict[str, Any]] = []
+    try:
+        for index in range(MAX_TASKS):
+            result = env.reset()
+            observation = result.observation
+            history: List[str] = []
+            step_logs: List[Dict[str, Any]] = []
+            print(
+                f"Task {index + 1}: {observation.task.task_id} "
+                f"({observation.task.difficulty})"
+            )
+            for step in range(1, MAX_STEPS + 1):
+                prompt = _build_prompt(observation, step, history)
+                try:
+                    # Model-call failures are captured in the report rather than
+                    # crashing the full benchmark run.
+                    payload = _completion(client, prompt)
+                except Exception as exc:
+                    payload = {"findings": [], "patched_code": None, "_error": str(exc)}
+                action = _to_action(
+                    payload=payload,
+                    observation=observation,
+                    finalize=step == MAX_STEPS or observation.attempts_remaining <= 1,
+                )
+                result = env.step(action)
+                observation = result.observation
+                step_log = {
+                    "step": step,
+                    "operation": action.operation,
+                    "submitted_findings": len(action.findings),
+                    "reward": result.reward or 0.0,
+                    "score": observation.score,
+                    "done": result.done,
+                    "feedback": observation.feedback,
+                }
+                if payload.get("_error"):
+                    step_log["model_error"] = payload["_error"]
+                if payload.get("_parse_error"):
+                    step_log["parse_error"] = True
+                step_logs.append(step_log)
+                # The history string is fed back into later prompts so the
+                # model can see what it already tried.
+                history.append(
+                    f"step={step} op={action.operation} findings={len(action.findings)} "
+                    f"score={observation.score:.2f} feedback={observation.feedback}"
+                )
+                print(
+                    f"  step={step} op={action.operation} findings={len(action.findings)} "
+                    f"score={observation.score:.2f} reward={(result.reward or 0.0):.2f} "
+                    f"done={result.done}"
+                )
+                if result.done:
+                    break
+            episode_results.append(_task_result_dict(observation, step_logs))
+    finally:
+        env.close()
+    mean_score = (
+        sum(item["score"] for item in episode_results) / len(episode_results)
+        if episode_results
+        else 0.0
+    )
+    summary = {
+        "model_name": MODEL_NAME,
+        "api_base_url": API_BASE_URL,
+        "task_count": len(episode_results),
+        "mean_score": mean_score,
+        "results": episode_results,
+    }
+    # Persist the report so scores can be compared across runs and models.
+    REPORT_PATH.write_text(json.dumps(summary, indent=2), encoding="utf-8")
+    print(json.dumps(summary, indent=2))
+    print(f"\nSaved report to {REPORT_PATH}")
+if __name__ == "__main__":
+    main()

models.py ADDED Viewed

	@@ -0,0 +1,248 @@

+"""Typed models for the Python code-review environment.
+This module is the shared contract between:
+- the OpenEnv server implementation
+- the REST API layer
+- the benchmark grader
+- the inference script
+- the tests
+Keeping these models centralized makes the environment easier to validate,
+serialize, and evolve without each module inventing its own payload shape.
+"""
+from typing import List, Literal, Optional
+from pydantic import BaseModel, Field
+from openenv.core.env_server.types import Action, Observation
+# Difficulty buckets are intentionally small and fixed so tasks can be
+# grouped for curriculum learning and reporting without extra normalization.
+Difficulty = Literal["easy", "medium", "hard"]
+# Severity is separate from category because one category such as "security"
+# can still vary in importance across tasks.
+Severity = Literal["critical", "warning", "info"]
+# Categories help both humans and agents understand what type of issue was found.
+Category = Literal["bug", "security", "style", "performance", "maintainability"]
+# Operations define the small action space an agent can use during an episode.
+Operation = Literal["submit_findings", "request_hint", "finalize"]
+class ReviewFinding(BaseModel):
+    """A structured review finding.
+    Each finding is designed to be machine-gradable while still resembling the
+    sort of issue summary a human reviewer would write in a real code review.
+    """
+    title: str = Field(..., description="Short title for the finding")
+    line: Optional[int] = Field(default=None, description="1-based source line number")
+    category: Category = Field(default="bug", description="Issue category")
+    severity: Severity = Field(default="warning", description="Issue severity")
+    rationale: str = Field(
+        default="",
+        description="Why the issue matters and how it affects behaviour or safety",
+    )
+    recommendation: Optional[str] = Field(
+        default=None, description="Concrete fix recommendation"
+    )
+    rule_id: Optional[str] = Field(
+        default=None,
+        description="Stable internal rule identifier when known",
+    )
+class TaskDescriptor(BaseModel):
+    """Public task metadata shown to the agent.
+    This is intentionally the "visible" task information. Hidden grading
+    details stay inside the server task bank so the benchmark remains useful.
+    """
+    task_id: str = Field(..., description="Stable task identifier")
+    difficulty: Difficulty = Field(..., description="Task difficulty bucket")
+    title: str = Field(..., description="Short task title")
+    objective: str = Field(..., description="What the reviewer should accomplish")
+    code: str = Field(..., description="Python code to review")
+    max_steps: int = Field(..., ge=1, description="Maximum actions allowed")
+    success_threshold: float = Field(
+        ..., ge=0.0, le=1.0, description="Minimum score considered a pass"
+    )
+class TaskEvaluation(BaseModel):
+    """Deterministic grader output.
+    This model is returned in observations and offline grading routes so that
+    both online interaction and offline evaluation use exactly the same metrics.
+    """
+    matched_reference_ids: List[str] = Field(default_factory=list)
+    matched_findings: int = Field(default=0, ge=0)
+    total_findings: int = Field(default=0, ge=0)
+    false_positives: int = Field(default=0, ge=0)
+    duplicate_findings: int = Field(default=0, ge=0)
+    weighted_recall: float = Field(default=0.0, ge=0.0, le=1.0)
+    patch_score: float = Field(default=0.0, ge=0.0, le=1.0)
+    score: float = Field(default=0.0, ge=0.0, le=1.0)
+    passed: bool = Field(default=False)
+class PythonReviewAction(Action):
+    """Action submitted by an agent during an episode.
+    The action space is kept intentionally small:
+    - `submit_findings` for intermediate progress
+    - `request_hint` when the agent needs guidance at a small penalty
+    - `finalize` when the agent wants the episode to end
+    """
+    operation: Operation = Field(
+        default="submit_findings",
+        description="How to interact with the environment on this step",
+    )
+    findings: List[ReviewFinding] = Field(
+        default_factory=list,
+        description="Structured findings being submitted for grading",
+    )
+    patched_code: Optional[str] = Field(
+        default=None,
+        description="Optional improved version of the code under review",
+    )
+    note: Optional[str] = Field(
+        default=None,
+        description="Optional free-form reviewer note for logging or context",
+    )
+class PythonEnvConfig(BaseModel):
+    """Environment-level configuration knobs.
+    These values are useful for experimentation because they let you adjust
+    reward shaping and curriculum ordering without changing the grader logic.
+    """
+    task_order: List[str] = Field(
+        default_factory=lambda: ["py-review-easy", "py-review-medium", "py-review-hard"],
+        description="Deterministic task order used across resets",
+    )
+    max_steps_per_task: int = Field(default=4, ge=1, le=10)
+    hint_penalty: float = Field(default=0.05, ge=0.0, le=1.0)
+    false_positive_penalty: float = Field(default=0.08, ge=0.0, le=1.0)
+    duplicate_penalty: float = Field(default=0.03, ge=0.0, le=1.0)
+    patch_bonus_multiplier: float = Field(default=0.2, ge=0.0, le=1.0)
+    max_history_entries: int = Field(default=50, ge=1, le=500)
+class PythonReviewObservation(Observation):
+    """Observation returned by `reset()` and `step()`.
+    The observation combines:
+    - visible task context
+    - immediate feedback on the previous action
+    - cumulative evaluation state
+    - OpenEnv-standard reward/done/metadata fields
+    """
+    task: TaskDescriptor = Field(..., description="Current task details")
+    instructions: str = Field(
+        default="Inspect the code and submit structured findings.",
+        description="Episode instructions shown to the agent",
+    )
+    feedback: str = Field(default="", description="Feedback for the last action")
+    submitted_findings: List[ReviewFinding] = Field(
+        default_factory=list,
+        description="All findings submitted so far in this episode",
+    )
+    hints_used: int = Field(default=0, ge=0)
+    attempts_remaining: int = Field(default=0, ge=0)
+    evaluation: TaskEvaluation = Field(default_factory=TaskEvaluation)
+    score: float = Field(
+        default=0.0,
+        ge=0.0,
+        le=1.0,
+        description="Current task score after this step",
+    )
+    review_time_ms: float = Field(default=0.0, ge=0.0)
+class EpisodeRecord(BaseModel):
+    """Stored summary of a completed or in-progress episode.
+    This model is used by the custom history routes and is intentionally
+    compact enough to archive for later analysis or dataset creation.
+    """
+    episode_id: str
+    task_id: str
+    difficulty: Difficulty
+    title: str
+    final_score: float = Field(ge=0.0, le=1.0)
+    passed: bool = Field(default=False)
+    steps_taken: int = Field(default=0, ge=0)
+    hints_used: int = Field(default=0, ge=0)
+    matched_findings: int = Field(default=0, ge=0)
+    total_findings: int = Field(default=0, ge=0)
+    false_positives: int = Field(default=0, ge=0)
+    duplicate_findings: int = Field(default=0, ge=0)
+    status: Literal["active", "completed"] = Field(default="completed")
+    created_at: str
+    updated_at: str
+class DirectReviewRequest(BaseModel):
+    """Request model for ad-hoc review outside the benchmark tasks."""
+    code: str = Field(..., description="Python source code to inspect")
+    context: Optional[str] = Field(
+        default=None, description="Optional explanation of the code's purpose"
+    )
+class DirectReviewResponse(BaseModel):
+    """Static review result for arbitrary Python code.
+    This route is useful for manual testing and dataset generation because it
+    lets you review arbitrary snippets without entering the benchmark loop.
+    """
+    issues: List[ReviewFinding] = Field(default_factory=list)
+    summary: str = Field(default="")
+    score: float = Field(default=0.0, ge=0.0, le=1.0)
+    improved_code: Optional[str] = Field(default=None)
+class DeleteResponse(BaseModel):
+    """Small acknowledgement payload for DELETE routes."""
+    detail: str
+class HealthResponse(BaseModel):
+    """Health payload used by Docker and Spaces checks.
+    This payload stays intentionally simple because health checks are often
+    consumed by infrastructure rather than by human users.
+    """
+    status: Literal["ok"] = "ok"
+    environment: str = "python_env"
+    task_count: int = Field(default=0, ge=0)
+    active_task_id: Optional[str] = None
+    active_episode_id: Optional[str] = None
+# Backward-compatible aliases keep older imports working while the project
+# standardizes on the `Python*` naming convention.
+PythonAction = PythonReviewAction
+PythonObservation = PythonReviewObservation
+CodeReviewAction = PythonReviewAction
+CodeReviewObservation = PythonReviewObservation
+CodeReviewConfig = PythonEnvConfig

openenv.yaml ADDED Viewed

	@@ -0,0 +1,7 @@

+spec_version: 1
+name: python_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000

openenv_python_env.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,10 @@

+Metadata-Version: 2.4
+Name: openenv-python_env
+Version: 0.1.0
+Summary: Python Env environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Requires-Dist: pydantic>=2.12.5
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_python_env.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+README.md
+__init__.py
+client.py
+inference.py
+models.py
+pyproject.toml
+./__init__.py
+./client.py
+./inference.py
+./models.py
+openenv_python_env.egg-info/PKG-INFO
+openenv_python_env.egg-info/SOURCES.txt
+openenv_python_env.egg-info/dependency_links.txt
+openenv_python_env.egg-info/entry_points.txt
+openenv_python_env.egg-info/requires.txt
+openenv_python_env.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/python_env_environment.py

openenv_python_env.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_python_env.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = python_env.server.app:main

openenv_python_env.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv-core[core]>=0.2.2
+pydantic>=2.12.5
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_python_env.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ python_env

pyproject.toml ADDED Viewed

	@@ -0,0 +1,46 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-python_env"
+version = "0.1.0"
+description = "Python Env environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
+    # install from github
+    # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
+    "openenv-core[core]>=0.2.2",
+    # Environment-specific dependencies
+    # Add all dependencies needed for your environment here
+    # Examples:
+    # "numpy>=1.19.0",
+    # "torch>=2.0.0",
+    # "gymnasium>=0.29.0",
+    # "openspiel>=1.0.0",
+    # "smolagents>=1.22.0,<2",
+    "pydantic>=2.12.5",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+# Server entry point - enables running via: uv run --project . server
+# or: python -m python_env.server.app
+server = "python_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["python_env", "python_env.server"]
+package-dir = { "python_env" = ".", "python_env.server" = "server" }

server/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Python Env environment server components."""
+from .python_env_environment import PythonEnvironment
+__all__ = ["PythonEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,84 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the Python Env Environment.
+This module creates an HTTP server that exposes the PythonEnvironment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:  # pragma: no cover
+    raise ImportError(
+        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
+    ) from e
+try:
+    from ..models import PythonAction, PythonObservation
+    from .python_env_environment import PythonEnvironment
+except ImportError:
+    from models import PythonAction, PythonObservation
+    from server.python_env_environment import PythonEnvironment
+# Create the app with web interface and README integration
+app = create_app(
+    PythonEnvironment,
+    PythonAction,
+    PythonObservation,
+    env_name="python_env",
+    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
+)
+def main(host: str = "0.0.0.0", port: int = 8000):
+    """
+    Entry point for direct execution via uv run or python -m.
+    This function enables running the server without Docker:
+        uv run --project . server
+        uv run --project . server --port 8001
+        python -m python_env.server.app
+    Args:
+        host: Host address to bind to (default: "0.0.0.0")
+        port: Port number to listen on (default: 8000)
+    For production deployments, consider using uvicorn directly with
+    multiple workers:
+        uvicorn python_env.server.app:app --workers 4
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+if __name__ == "__main__":
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    args = parser.parse_args()
+    main(port=args.port)

server/python_env_environment.py ADDED Viewed

	@@ -0,0 +1,421 @@

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Python code-review environment implementation."""
+from __future__ import annotations
+from dataclasses import dataclass
+from datetime import UTC, datetime
+from typing import Dict, Iterable, List, Optional
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+try:
+    from ..models import (
+        PythonAction,
+        PythonEnvConfig,
+        PythonObservation,
+        ReviewFinding,
+        TaskDescriptor,
+        TaskEvaluation,
+    )
+except ImportError:
+    from models import (  # type: ignore
+        PythonAction,
+        PythonEnvConfig,
+        PythonObservation,
+        ReviewFinding,
+        TaskDescriptor,
+        TaskEvaluation,
+    )
+@dataclass(frozen=True)
+class ReferenceFinding:
+    """Hidden finding metadata used for deterministic grading."""
+    rule_id: str
+    title: str
+    line: int
+    category: str
+    severity: str
+    rationale: str
+    recommendation: str
+    weight: float
+@dataclass(frozen=True)
+class ReviewTask:
+    """A visible task plus its hidden grading references."""
+    descriptor: TaskDescriptor
+    references: tuple[ReferenceFinding, ...]
+    hint: str
+    patched_code: Optional[str] = None
+TASK_BANK: Dict[str, ReviewTask] = {
+    "py-review-easy": ReviewTask(
+        descriptor=TaskDescriptor(
+            task_id="py-review-easy",
+            difficulty="easy",
+            title="Mutable default argument",
+            objective="Find the correctness issue and explain a safe fix.",
+            code=(
+                "def add_tag(tag, tags=[]):\n"
+                "    tags.append(tag)\n"
+                "    return tags\n"
+            ),
+            max_steps=4,
+            success_threshold=0.7,
+        ),
+        references=(
+            ReferenceFinding(
+                rule_id="mutable-default",
+                title="Mutable default list is shared across calls",
+                line=1,
+                category="bug",
+                severity="warning",
+                rationale="The list persists between calls and leaks state.",
+                recommendation="Use None as the default and create a new list inside the function.",
+                weight=1.0,
+            ),
+        ),
+        hint="Look for state that survives between separate function calls.",
+        patched_code=(
+            "def add_tag(tag, tags=None):\n"
+            "    if tags is None:\n"
+            "        tags = []\n"
+            "    tags.append(tag)\n"
+            "    return tags\n"
+        ),
+    ),
+    "py-review-medium": ReviewTask(
+        descriptor=TaskDescriptor(
+            task_id="py-review-medium",
+            difficulty="medium",
+            title="Unsafe shell invocation",
+            objective="Review the snippet for security-sensitive behavior.",
+            code=(
+                "import os\n\n"
+                "def run_backup(path):\n"
+                "    os.system(f\"tar -czf backup.tgz {path}\")\n"
+            ),
+            max_steps=4,
+            success_threshold=0.72,
+        ),
+        references=(
+            ReferenceFinding(
+                rule_id="shell-injection",
+                title="User input is interpolated into a shell command",
+                line=4,
+                category="security",
+                severity="critical",
+                rationale="An attacker can inject shell metacharacters through the path argument.",
+                recommendation="Use subprocess with an argument list instead of os.system.",
+                weight=1.0,
+            ),
+        ),
+        hint="Check how external commands are invoked and whether user input is escaped.",
+        patched_code=(
+            "import subprocess\n\n"
+            "def run_backup(path):\n"
+            "    subprocess.run([\"tar\", \"-czf\", \"backup.tgz\", path], check=True)\n"
+        ),
+    ),
+    "py-review-hard": ReviewTask(
+        descriptor=TaskDescriptor(
+            task_id="py-review-hard",
+            difficulty="hard",
+            title="Retry helper hides failures",
+            objective="Identify correctness and maintainability issues in the retry logic.",
+            code=(
+                "import time\n\n"
+                "def fetch_with_retry(client, url, retries=3):\n"
+                "    last_error = None\n"
+                "    for _ in range(retries):\n"
+                "        try:\n"
+                "            return client.get(url, timeout=1)\n"
+                "        except Exception as exc:\n"
+                "            last_error = exc\n"
+                "            time.sleep(0.1)\n"
+                "    return None\n"
+            ),
+            max_steps=4,
+            success_threshold=0.74,
+        ),
+        references=(
+            ReferenceFinding(
+                rule_id="swallowed-error",
+                title="Function swallows the final exception and returns None",
+                line=10,
+                category="bug",
+                severity="warning",
+                rationale="Callers cannot distinguish a failed request from a valid None result.",
+                recommendation="Re-raise the last exception after retries are exhausted.",
+                weight=0.65,
+            ),
+            ReferenceFinding(
+                rule_id="broad-except",
+                title="Broad exception handler catches unexpected failures",
+                line=7,
+                category="maintainability",
+                severity="info",
+                rationale="Catching Exception masks programming errors and interrupts.",
+                recommendation="Catch only the client or network exceptions you expect to retry.",
+                weight=0.35,
+            ),
+        ),
+        hint="Consider what happens to the final error after the retry loop finishes.",
+        patched_code=(
+            "import time\n\n"
+            "def fetch_with_retry(client, url, retries=3):\n"
+            "    last_error = None\n"
+            "    for _ in range(retries):\n"
+            "        try:\n"
+            "            return client.get(url, timeout=1)\n"
+            "        except client.retryable_exceptions as exc:\n"
+            "            last_error = exc\n"
+            "            time.sleep(0.1)\n"
+            "    if last_error is not None:\n"
+            "        raise last_error\n"
+        ),
+    ),
+}
+def _utc_now() -> str:
+    return datetime.now(UTC).isoformat()
+def _normalize_text(value: Optional[str]) -> str:
+    return " ".join((value or "").strip().lower().split())
+def _normalize_code(value: Optional[str]) -> str:
+    return "\n".join(line.rstrip() for line in (value or "").strip().splitlines())
+class PythonEnvironment(Environment[PythonAction, PythonObservation, State]):
+    """Deterministic benchmark environment for Python code review tasks."""
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    def __init__(self, config: Optional[PythonEnvConfig] = None):
+        super().__init__()
+        self._config = config or PythonEnvConfig()
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._task_cursor = -1
+        self._current_task: Optional[ReviewTask] = None
+        self._submitted_findings: List[ReviewFinding] = []
+        self._hints_used = 0
+        self._created_at = _utc_now()
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        **kwargs,
+    ) -> PythonObservation:
+        """Start the next configured review task."""
+        del seed, kwargs
+        self._task_cursor = (self._task_cursor + 1) % len(self._config.task_order)
+        task_id = self._config.task_order[self._task_cursor]
+        self._current_task = TASK_BANK.get(task_id, TASK_BANK["py-review-easy"])
+        self._state = State(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+        )
+        self._submitted_findings = []
+        self._hints_used = 0
+        self._created_at = _utc_now()
+        return self._build_observation(
+            feedback="New review task loaded. Submit findings or request a hint.",
+            reward=0.0,
+            done=False,
+        )
+    def step(
+        self,
+        action: PythonAction,
+        timeout_s: Optional[float] = None,
+        **kwargs,
+    ) -> PythonObservation:
+        """Process one review action and return updated feedback."""
+        del timeout_s, kwargs
+        if self._current_task is None:
+            return self.reset()
+        self._state.step_count += 1
+        operation = action.operation
+        feedback = ""
+        reward = 0.0
+        done = False
+        if operation == "request_hint":
+            self._hints_used += 1
+            feedback = self._current_task.hint
+            evaluation = self._evaluate(self._submitted_findings, action.patched_code)
+            reward = evaluation.score
+        else:
+            if action.findings:
+                self._submitted_findings.extend(action.findings)
+            evaluation = self._evaluate(self._submitted_findings, action.patched_code)
+            reward = evaluation.score
+            if operation == "finalize":
+                done = True
+                feedback = (
+                    "Review finalized. "
+                    f"Matched {evaluation.matched_findings}/{evaluation.total_findings} "
+                    "reference findings."
+                )
+            else:
+                feedback = (
+                    f"Progress saved. Matched {evaluation.matched_findings}/"
+                    f"{evaluation.total_findings} findings with score {evaluation.score:.2f}."
+                )
+        if self._state.step_count >= self._max_steps():
+            done = True
+            if operation != "finalize":
+                feedback = (
+                    f"{feedback} Maximum steps reached."
+                    if feedback
+                    else "Maximum steps reached."
+                )
+        return self._build_observation(
+            feedback=feedback,
+            reward=reward,
+            done=done,
+            patched_code=action.patched_code,
+        )
+    def _build_observation(
+        self,
+        *,
+        feedback: str,
+        reward: float,
+        done: bool,
+        patched_code: Optional[str] = None,
+    ) -> PythonObservation:
+        assert self._current_task is not None
+        evaluation = self._evaluate(self._submitted_findings, patched_code)
+        attempts_remaining = max(
+            self._max_steps() - self._state.step_count,
+            0,
+        )
+        return PythonObservation(
+            task=self._current_task.descriptor,
+            feedback=feedback,
+            submitted_findings=list(self._submitted_findings),
+            hints_used=self._hints_used,
+            attempts_remaining=attempts_remaining,
+            evaluation=evaluation,
+            score=evaluation.score,
+            review_time_ms=float(self._state.step_count * 125),
+            done=done,
+            reward=reward,
+            metadata={
+                "episode_id": self._state.episode_id,
+                "created_at": self._created_at,
+                "updated_at": _utc_now(),
+            },
+        )
+    def _evaluate(
+        self,
+        findings: Iterable[ReviewFinding],
+        patched_code: Optional[str],
+    ) -> TaskEvaluation:
+        assert self._current_task is not None
+        references = self._current_task.references
+        matched_reference_ids: List[str] = []
+        matched_weight = 0.0
+        false_positives = 0
+        duplicate_findings = 0
+        seen_ids = set()
+        for finding in findings:
+            ref_id = self._match_reference(finding, references)
+            if ref_id is None:
+                false_positives += 1
+                continue
+            if ref_id in seen_ids:
+                duplicate_findings += 1
+                continue
+            seen_ids.add(ref_id)
+            matched_reference_ids.append(ref_id)
+            matched_weight += next(ref.weight for ref in references if ref.rule_id == ref_id)
+        total_weight = sum(ref.weight for ref in references) or 1.0
+        weighted_recall = min(matched_weight / total_weight, 1.0)
+        patch_score = 0.0
+        if self._current_task.patched_code and patched_code:
+            patch_score = float(
+                _normalize_code(patched_code) == _normalize_code(self._current_task.patched_code)
+            )
+        raw_score = (
+            weighted_recall
+            + (self._config.patch_bonus_multiplier * patch_score)
+            - (self._config.false_positive_penalty * false_positives)
+            - (self._config.duplicate_penalty * duplicate_findings)
+            - (self._config.hint_penalty * self._hints_used)
+        )
+        score = max(0.0, min(raw_score, 1.0))
+        return TaskEvaluation(
+            matched_reference_ids=matched_reference_ids,
+            matched_findings=len(matched_reference_ids),
+            total_findings=len(references),
+            false_positives=false_positives,
+            duplicate_findings=duplicate_findings,
+            weighted_recall=weighted_recall,
+            patch_score=patch_score,
+            score=score,
+            passed=score >= self._current_task.descriptor.success_threshold,
+        )
+    def _match_reference(
+        self,
+        finding: ReviewFinding,
+        references: Iterable[ReferenceFinding],
+    ) -> Optional[str]:
+        finding_rule = _normalize_text(finding.rule_id)
+        finding_title = _normalize_text(finding.title)
+        for reference in references:
+            if finding_rule and finding_rule == _normalize_text(reference.rule_id):
+                return reference.rule_id
+            line_matches = finding.line is not None and finding.line == reference.line
+            category_matches = finding.category == reference.category
+            title_matches = finding_title and (
+                finding_title in _normalize_text(reference.title)
+                or _normalize_text(reference.title) in finding_title
+            )
+            if line_matches and (category_matches or title_matches):
+                return reference.rule_id
+        return None
+    def _max_steps(self) -> int:
+        assert self._current_task is not None
+        return min(
+            self._current_task.descriptor.max_steps,
+            self._config.max_steps_per_task,
+        )
+    @property
+    def state(self) -> State:
+        """Return the current environment state."""
+        return self._state

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv[core]>=0.2.0
+fastapi>=0.115.0
+uvicorn>=0.24.0

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff