uvpatel7271 commited on
Commit
1c8b7f1
·
verified ·
1 Parent(s): 1ac5fc9

Upload folder using huggingface_hub

Browse files
Dockerfile ADDED
@@ -0,0 +1,81 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ # Multi-stage build using openenv-base
8
+ # This Dockerfile is flexible and works for both:
9
+ # - In-repo environments (with local OpenEnv sources)
10
+ # - Standalone environments (with openenv from PyPI/Git)
11
+ # The build script (openenv build) handles context detection and sets appropriate build args.
12
+
13
+ ARG BASE_IMAGE=ghcr.io/meta-pytorch/openenv-base:latest
14
+ FROM ${BASE_IMAGE} AS builder
15
+
16
+ WORKDIR /app
17
+
18
+ # Ensure git is available (required for installing dependencies from VCS)
19
+ RUN apt-get update && \
20
+ apt-get install -y --no-install-recommends git && \
21
+ rm -rf /var/lib/apt/lists/*
22
+
23
+ # Build argument to control whether we're building standalone or in-repo
24
+ ARG BUILD_MODE=in-repo
25
+ ARG ENV_NAME=python_env
26
+
27
+ # Copy environment code (always at root of build context)
28
+ COPY . /app/env
29
+
30
+ # For in-repo builds, openenv is already vendored in the build context
31
+ # For standalone builds, openenv will be installed via pyproject.toml
32
+ WORKDIR /app/env
33
+
34
+ # Ensure uv is available (for local builds where base image lacks it)
35
+ RUN if ! command -v uv >/dev/null 2>&1; then \
36
+ curl -LsSf https://astral.sh/uv/install.sh | sh && \
37
+ mv /root/.local/bin/uv /usr/local/bin/uv && \
38
+ mv /root/.local/bin/uvx /usr/local/bin/uvx; \
39
+ fi
40
+
41
+ # Install dependencies using uv sync
42
+ # If uv.lock exists, use it; otherwise resolve on the fly
43
+ RUN --mount=type=cache,target=/root/.cache/uv \
44
+ if [ -f uv.lock ]; then \
45
+ uv sync --frozen --no-install-project --no-editable; \
46
+ else \
47
+ uv sync --no-install-project --no-editable; \
48
+ fi
49
+
50
+ RUN --mount=type=cache,target=/root/.cache/uv \
51
+ if [ -f uv.lock ]; then \
52
+ uv sync --frozen --no-editable; \
53
+ else \
54
+ uv sync --no-editable; \
55
+ fi
56
+
57
+ # Final runtime stage
58
+ FROM ${BASE_IMAGE}
59
+
60
+ WORKDIR /app
61
+
62
+ # Copy the virtual environment from builder
63
+ COPY --from=builder /app/env/.venv /app/.venv
64
+
65
+ # Copy the environment code
66
+ COPY --from=builder /app/env /app/env
67
+
68
+ # Set PATH to use the virtual environment
69
+ ENV PATH="/app/.venv/bin:$PATH"
70
+
71
+ # Set PYTHONPATH so imports work correctly
72
+ ENV PYTHONPATH="/app/env:$PYTHONPATH"
73
+
74
+ # Health check
75
+ HEALTHCHECK --interval=30s --timeout=3s --start-period=5s --retries=3 \
76
+ CMD curl -f http://localhost:8000/health || exit 1
77
+
78
+ # Run the FastAPI server
79
+ # The module path is constructed to work with the /app/env structure
80
+ ENV ENABLE_WEB_INTERFACE=true
81
+ CMD ["sh", "-c", "cd /app/env && uvicorn server.app:app --host 0.0.0.0 --port 8000"]
README.md CHANGED
@@ -1,10 +1,266 @@
1
  ---
2
- title: Python Env
3
- emoji: 🚀
4
- colorFrom: green
5
- colorTo: green
6
  sdk: docker
7
  pinned: false
 
 
 
 
8
  ---
9
 
10
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Python Env Environment Server
3
+ emoji: 🎶
4
+ colorFrom: purple
5
+ colorTo: red
6
  sdk: docker
7
  pinned: false
8
+ app_port: 8000
9
+ base_path: /web
10
+ tags:
11
+ - openenv
12
  ---
13
 
14
+ # Python Env Environment
15
+
16
+ A simple test environment that echoes back messages. Perfect for testing the env APIs as well as demonstrating environment usage patterns.
17
+
18
+ ## Quick Start
19
+
20
+ The simplest way to use the Python Env environment is through the `PythonEnv` class:
21
+
22
+ ```python
23
+ from python_env import PythonAction, PythonEnv
24
+
25
+ try:
26
+ # Create environment from Docker image
27
+ python_envenv = PythonEnv.from_docker_image("python_env-env:latest")
28
+
29
+ # Reset
30
+ result = python_envenv.reset()
31
+ print(f"Reset: {result.observation.echoed_message}")
32
+
33
+ # Send multiple messages
34
+ messages = ["Hello, World!", "Testing echo", "Final message"]
35
+
36
+ for msg in messages:
37
+ result = python_envenv.step(PythonAction(message=msg))
38
+ print(f"Sent: '{msg}'")
39
+ print(f" → Echoed: '{result.observation.echoed_message}'")
40
+ print(f" → Length: {result.observation.message_length}")
41
+ print(f" → Reward: {result.reward}")
42
+
43
+ finally:
44
+ # Always clean up
45
+ python_envenv.close()
46
+ ```
47
+
48
+ That's it! The `PythonEnv.from_docker_image()` method handles:
49
+ - Starting the Docker container
50
+ - Waiting for the server to be ready
51
+ - Connecting to the environment
52
+ - Container cleanup when you call `close()`
53
+
54
+ ## Building the Docker Image
55
+
56
+ Before using the environment, you need to build the Docker image:
57
+
58
+ ```bash
59
+ # From project root
60
+ docker build -t python_env-env:latest -f server/Dockerfile .
61
+ ```
62
+
63
+ ## Deploying to Hugging Face Spaces
64
+
65
+ You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
66
+
67
+ ```bash
68
+ # From the environment directory (where openenv.yaml is located)
69
+ openenv push
70
+
71
+ # Or specify options
72
+ openenv push --namespace my-org --private
73
+ ```
74
+
75
+ The `openenv push` command will:
76
+ 1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
77
+ 2. Prepare a custom build for Hugging Face Docker space (enables web interface)
78
+ 3. Upload to Hugging Face (ensuring you're logged in)
79
+
80
+ ### Prerequisites
81
+
82
+ - Authenticate with Hugging Face: The command will prompt for login if not already authenticated
83
+
84
+ ### Options
85
+
86
+ - `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
87
+ - `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
88
+ - `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
89
+ - `--private`: Deploy the space as private (default: public)
90
+
91
+ ### Examples
92
+
93
+ ```bash
94
+ # Push to your personal namespace (defaults to username/env-name from openenv.yaml)
95
+ openenv push
96
+
97
+ # Push to a specific repository
98
+ openenv push --repo-id my-org/my-env
99
+
100
+ # Push with a custom base image
101
+ openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
102
+
103
+ # Push as a private space
104
+ openenv push --private
105
+
106
+ # Combine options
107
+ openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
108
+ ```
109
+
110
+ After deployment, your space will be available at:
111
+ `https://huggingface.co/spaces/<repo-id>`
112
+
113
+ The deployed space includes:
114
+ - **Web Interface** at `/web` - Interactive UI for exploring the environment
115
+ - **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
116
+ - **Health Check** at `/health` - Container health monitoring
117
+ - **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
118
+
119
+ ## Environment Details
120
+
121
+ ### Action
122
+ **PythonAction**: Contains a single field
123
+ - `message` (str) - The message to echo back
124
+
125
+ ### Observation
126
+ **PythonObservation**: Contains the echo response and metadata
127
+ - `echoed_message` (str) - The message echoed back
128
+ - `message_length` (int) - Length of the message
129
+ - `reward` (float) - Reward based on message length (length × 0.1)
130
+ - `done` (bool) - Always False for echo environment
131
+ - `metadata` (dict) - Additional info like step count
132
+
133
+ ### Reward
134
+ The reward is calculated as: `message_length × 0.1`
135
+ - "Hi" → reward: 0.2
136
+ - "Hello, World!" → reward: 1.3
137
+ - Empty message → reward: 0.0
138
+
139
+ ## Advanced Usage
140
+
141
+ ### Connecting to an Existing Server
142
+
143
+ If you already have a Python Env environment server running, you can connect directly:
144
+
145
+ ```python
146
+ from python_env import PythonEnv
147
+
148
+ # Connect to existing server
149
+ python_envenv = PythonEnv(base_url="<ENV_HTTP_URL_HERE>")
150
+
151
+ # Use as normal
152
+ result = python_envenv.reset()
153
+ result = python_envenv.step(PythonAction(message="Hello!"))
154
+ ```
155
+
156
+ Note: When connecting to an existing server, `python_envenv.close()` will NOT stop the server.
157
+
158
+ ### Using the Context Manager
159
+
160
+ The client supports context manager usage for automatic connection management:
161
+
162
+ ```python
163
+ from python_env import PythonAction, PythonEnv
164
+
165
+ # Connect with context manager (auto-connects and closes)
166
+ with PythonEnv(base_url="http://localhost:8000") as env:
167
+ result = env.reset()
168
+ print(f"Reset: {result.observation.echoed_message}")
169
+ # Multiple steps with low latency
170
+ for msg in ["Hello", "World", "!"]:
171
+ result = env.step(PythonAction(message=msg))
172
+ print(f"Echoed: {result.observation.echoed_message}")
173
+ ```
174
+
175
+ The client uses WebSocket connections for:
176
+ - **Lower latency**: No HTTP connection overhead per request
177
+ - **Persistent session**: Server maintains your environment state
178
+ - **Efficient for episodes**: Better for many sequential steps
179
+
180
+ ### Concurrent WebSocket Sessions
181
+
182
+ The server supports multiple concurrent WebSocket connections. To enable this,
183
+ modify `server/app.py` to use factory mode:
184
+
185
+ ```python
186
+ # In server/app.py - use factory mode for concurrent sessions
187
+ app = create_app(
188
+ PythonEnvironment, # Pass class, not instance
189
+ PythonAction,
190
+ PythonObservation,
191
+ max_concurrent_envs=4, # Allow 4 concurrent sessions
192
+ )
193
+ ```
194
+
195
+ Then multiple clients can connect simultaneously:
196
+
197
+ ```python
198
+ from python_env import PythonAction, PythonEnv
199
+ from concurrent.futures import ThreadPoolExecutor
200
+
201
+ def run_episode(client_id: int):
202
+ with PythonEnv(base_url="http://localhost:8000") as env:
203
+ result = env.reset()
204
+ for i in range(10):
205
+ result = env.step(PythonAction(message=f"Client {client_id}, step {i}"))
206
+ return client_id, result.observation.message_length
207
+
208
+ # Run 4 episodes concurrently
209
+ with ThreadPoolExecutor(max_workers=4) as executor:
210
+ results = list(executor.map(run_episode, range(4)))
211
+ ```
212
+
213
+ ## Development & Testing
214
+
215
+ ### Direct Environment Testing
216
+
217
+ Test the environment logic directly without starting the HTTP server:
218
+
219
+ ```bash
220
+ # From the server directory
221
+ python3 server/python_env_environment.py
222
+ ```
223
+
224
+ This verifies that:
225
+ - Environment resets correctly
226
+ - Step executes actions properly
227
+ - State tracking works
228
+ - Rewards are calculated correctly
229
+
230
+ ### Running Locally
231
+
232
+ Run the server locally for development:
233
+
234
+ ```bash
235
+ uvicorn server.app:app --reload
236
+ ```
237
+
238
+ ## Project Structure
239
+
240
+ ```
241
+ python_env/
242
+ ├── .dockerignore # Docker build exclusions
243
+ ├── __init__.py # Module exports
244
+ ├── README.md # This file
245
+ ├── openenv.yaml # OpenEnv manifest
246
+ ├── pyproject.toml # Project metadata and dependencies
247
+ ├── uv.lock # Locked dependencies (generated)
248
+ ├── client.py # PythonEnv client
249
+ ├── models.py # Action and Observation models
250
+ └── server/
251
+ ├── __init__.py # Server module exports
252
+ ├── python_env_environment.py # Core environment logic
253
+ ├── app.py # FastAPI application (HTTP + WebSocket endpoints)
254
+ └── Dockerfile # Container image definition
255
+ ```
256
+ ---------------------------------------
257
+
258
+ cd F:\python_env
259
+ # Edit your environment implementation in server/python_env_environment.py
260
+ # Edit your models in models.py
261
+ # Install dependencies: uv sync
262
+
263
+ # To integrate into OpenEnv repo:
264
+ # 1. Copy this directory to <repo_root>/envs/python_env_env
265
+ # 2. Build from repo root: docker build -t python_env_env:latest -f envs/python_env_env/server/Dockerfile .
266
+ # 3. Run your image: docker run -p 8000:8000 python_env_env:latest
__init__.py ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Python Env Environment."""
8
+
9
+ from .client import PythonEnv
10
+ from .models import PythonAction, PythonObservation
11
+
12
+ __all__ = [
13
+ "PythonAction",
14
+ "PythonObservation",
15
+ "PythonEnv",
16
+ ]
client.py ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Python Env Environment Client."""
8
+
9
+ from typing import Any, Dict
10
+
11
+ from openenv.core import EnvClient
12
+ from openenv.core.client_types import StepResult
13
+ from openenv.core.env_server.types import State
14
+
15
+ try:
16
+ from .models import PythonAction, PythonObservation
17
+ except ImportError:
18
+ from models import PythonAction, PythonObservation # type: ignore
19
+
20
+
21
+ class PythonEnv(EnvClient[PythonAction, PythonObservation, State]):
22
+ """Typed client for the Python code-review environment."""
23
+
24
+ def _step_payload(self, action: PythonAction) -> Dict[str, Any]:
25
+ """Convert a validated action model to the JSON payload expected by the server."""
26
+
27
+ return action.model_dump(exclude_none=True)
28
+
29
+ def _parse_result(self, payload: Dict[str, Any]) -> StepResult[PythonObservation]:
30
+ """Parse a server response into a typed step result."""
31
+
32
+ obs_data = dict(payload.get("observation", {}))
33
+ obs_data.setdefault("done", payload.get("done", False))
34
+ obs_data.setdefault("reward", payload.get("reward"))
35
+ observation = PythonObservation.model_validate(obs_data)
36
+
37
+ return StepResult(
38
+ observation=observation,
39
+ reward=payload.get("reward"),
40
+ done=payload.get("done", False),
41
+ )
42
+
43
+ def _parse_state(self, payload: Dict[str, Any]) -> State:
44
+ """Parse the server state payload into the shared state model."""
45
+
46
+ return State.model_validate(payload)
inference.py ADDED
@@ -0,0 +1,314 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Baseline inference script for the Python code-review environment.
2
+
3
+ This script is meant to be submission-friendly:
4
+
5
+ - configuration comes from environment variables
6
+ - model calls use the OpenAI client as required
7
+ - malformed model output is handled gracefully
8
+ - a JSON report is written for reproducibility
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import json
14
+ import os
15
+ import re
16
+ from pathlib import Path
17
+ from typing import Any, Dict, List, Optional
18
+
19
+ from openai import OpenAI
20
+
21
+ from client import PythonEnv
22
+ from models import PythonReviewAction, ReviewFinding
23
+
24
+
25
+ # Read all runtime configuration from environment variables so the script can
26
+ # be reused unchanged across local runs, CI, and HF Spaces validation.
27
+ API_BASE_URL = os.environ["API_BASE_URL"]
28
+ MODEL_NAME = os.environ["MODEL_NAME"]
29
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("OPENAI_API_KEY")
30
+ ENV_BASE_URL = os.getenv("ENV_BASE_URL")
31
+ DOCKER_IMAGE = os.getenv("PYTHON_ENV_IMAGE", "python_env-env:latest")
32
+ MAX_STEPS = int(os.getenv("MAX_STEPS", "3"))
33
+ MAX_TASKS = int(os.getenv("MAX_TASKS", "3"))
34
+ REPORT_PATH = Path(os.getenv("INFERENCE_REPORT_PATH", "inference_results.json"))
35
+ TEMPERATURE = float(os.getenv("TEMPERATURE", "0"))
36
+ MAX_TOKENS = int(os.getenv("MAX_TOKENS", "900"))
37
+
38
+ SYSTEM_PROMPT = """You are a precise Python code reviewer.
39
+ Return strict JSON using this schema:
40
+ {
41
+ "findings": [
42
+ {
43
+ "title": "short title",
44
+ "line": 1,
45
+ "category": "bug|security|style|performance|maintainability",
46
+ "severity": "critical|warning|info",
47
+ "rationale": "why it matters",
48
+ "recommendation": "how to fix it",
49
+ "rule_id": "optional-stable-id"
50
+ }
51
+ ],
52
+ "patched_code": null
53
+ }
54
+
55
+ Rules:
56
+ - Output JSON only. No markdown fences.
57
+ - Only report issues supported by the visible code.
58
+ - Prefer high precision over quantity.
59
+ - Include line numbers when possible.
60
+ """
61
+
62
+
63
+ def _build_prompt(observation, step: int, history: List[str]) -> str:
64
+ """Build the task prompt sent to the model for one step."""
65
+
66
+ history_text = "\n".join(history[-4:]) if history else "No previous attempts."
67
+ return (
68
+ f"Task ID: {observation.task.task_id}\n"
69
+ f"Difficulty: {observation.task.difficulty}\n"
70
+ f"Objective: {observation.task.objective}\n"
71
+ f"Step: {step}\n"
72
+ f"Attempts remaining: {observation.attempts_remaining}\n"
73
+ f"Current score: {observation.score:.2f}\n"
74
+ f"Latest feedback: {observation.feedback or 'None'}\n"
75
+ f"Attempt history:\n{history_text}\n\n"
76
+ "Code to review:\n"
77
+ "```python\n"
78
+ f"{observation.task.code}\n"
79
+ "```"
80
+ )
81
+
82
+
83
+ def _extract_text_content(message_content: Any) -> str:
84
+ """Normalize OpenAI response content into one text string."""
85
+
86
+ if isinstance(message_content, str):
87
+ return message_content
88
+ if isinstance(message_content, list):
89
+ parts: List[str] = []
90
+ for item in message_content:
91
+ if isinstance(item, dict):
92
+ text = item.get("text")
93
+ if isinstance(text, str):
94
+ parts.append(text)
95
+ return "\n".join(parts)
96
+ return ""
97
+
98
+
99
+ def _extract_json_blob(content: str) -> str:
100
+ """Extract a JSON object from plain or fenced model output."""
101
+
102
+ fenced_match = re.search(r"```(?:json)?\s*(\{.*\})\s*```", content, re.DOTALL)
103
+ if fenced_match:
104
+ return fenced_match.group(1)
105
+
106
+ start = content.find("{")
107
+ end = content.rfind("}")
108
+ if start != -1 and end != -1 and end > start:
109
+ return content[start : end + 1]
110
+ return content
111
+
112
+
113
+ def _parse_response(content: str) -> Dict[str, Any]:
114
+ """Parse the model response into a normalized payload dict."""
115
+
116
+ raw = _extract_json_blob(content)
117
+ try:
118
+ data = json.loads(raw)
119
+ except json.JSONDecodeError:
120
+ return {"findings": [], "patched_code": None, "_parse_error": raw}
121
+
122
+ findings = data.get("findings", [])
123
+ if not isinstance(findings, list):
124
+ findings = []
125
+ patched_code = data.get("patched_code")
126
+ if patched_code is not None and not isinstance(patched_code, str):
127
+ patched_code = None
128
+ return {"findings": findings, "patched_code": patched_code}
129
+
130
+
131
+ def _completion(client: OpenAI, prompt: str) -> Dict[str, Any]:
132
+ """Send one completion request to the configured model endpoint."""
133
+
134
+ response = client.chat.completions.create(
135
+ model=MODEL_NAME,
136
+ temperature=TEMPERATURE,
137
+ max_tokens=MAX_TOKENS,
138
+ messages=[
139
+ {"role": "system", "content": SYSTEM_PROMPT},
140
+ {"role": "user", "content": prompt},
141
+ ],
142
+ )
143
+ content = _extract_text_content(response.choices[0].message.content) or "{}"
144
+ return _parse_response(content)
145
+
146
+
147
+ def _normalize_findings(payload: Dict[str, Any]) -> List[ReviewFinding]:
148
+ """Convert raw dict findings into validated `ReviewFinding` objects."""
149
+
150
+ findings: List[ReviewFinding] = []
151
+ for item in payload.get("findings", []):
152
+ if not isinstance(item, dict):
153
+ continue
154
+ try:
155
+ findings.append(ReviewFinding(**item))
156
+ except Exception:
157
+ continue
158
+ return findings
159
+
160
+
161
+ def _build_fallback_action(observation, note: str) -> PythonReviewAction:
162
+ """Create a safe fallback action when model output is unusable."""
163
+
164
+ return PythonReviewAction(
165
+ operation="finalize" if observation.attempts_remaining <= 1 else "request_hint",
166
+ note=note,
167
+ )
168
+
169
+
170
+ def _to_action(
171
+ payload: Dict[str, Any],
172
+ observation,
173
+ finalize: bool,
174
+ ) -> PythonReviewAction:
175
+ """Convert a parsed model payload into a valid environment action."""
176
+
177
+ findings = _normalize_findings(payload)
178
+ if not findings and not payload.get("patched_code"):
179
+ note = "Model returned no valid findings."
180
+ if payload.get("_parse_error"):
181
+ note = f"{note} Raw response could not be parsed as JSON."
182
+ return _build_fallback_action(observation, note)
183
+
184
+ return PythonReviewAction(
185
+ operation="finalize" if finalize else "submit_findings",
186
+ findings=findings,
187
+ patched_code=payload.get("patched_code"),
188
+ )
189
+
190
+
191
+ def _make_env() -> PythonEnv:
192
+ """Connect to a live environment or launch the Docker image."""
193
+
194
+ if ENV_BASE_URL:
195
+ return PythonEnv(base_url=ENV_BASE_URL)
196
+ return PythonEnv.from_docker_image(DOCKER_IMAGE)
197
+
198
+
199
+ def _task_result_dict(observation, step_logs: List[Dict[str, Any]]) -> Dict[str, Any]:
200
+ """Build the report payload for one completed task run."""
201
+
202
+ evaluation = observation.evaluation
203
+ return {
204
+ "task_id": observation.task.task_id,
205
+ "difficulty": observation.task.difficulty,
206
+ "title": observation.task.title,
207
+ "score": observation.score,
208
+ "passed": evaluation.passed,
209
+ "matched_findings": evaluation.matched_findings,
210
+ "total_findings": evaluation.total_findings,
211
+ "false_positives": evaluation.false_positives,
212
+ "duplicate_findings": evaluation.duplicate_findings,
213
+ "weighted_recall": evaluation.weighted_recall,
214
+ "patch_score": evaluation.patch_score,
215
+ "steps": step_logs,
216
+ }
217
+
218
+
219
+ def main() -> None:
220
+ """Run the configured model against the benchmark task set."""
221
+
222
+ if not API_KEY:
223
+ raise RuntimeError("Set HF_TOKEN or OPENAI_API_KEY before running inference.py")
224
+
225
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
226
+ env = _make_env()
227
+ episode_results: List[Dict[str, Any]] = []
228
+
229
+ try:
230
+ for index in range(MAX_TASKS):
231
+ result = env.reset()
232
+ observation = result.observation
233
+ history: List[str] = []
234
+ step_logs: List[Dict[str, Any]] = []
235
+
236
+ print(
237
+ f"Task {index + 1}: {observation.task.task_id} "
238
+ f"({observation.task.difficulty})"
239
+ )
240
+
241
+ for step in range(1, MAX_STEPS + 1):
242
+ prompt = _build_prompt(observation, step, history)
243
+ try:
244
+ # Model-call failures are captured in the report rather than
245
+ # crashing the full benchmark run.
246
+ payload = _completion(client, prompt)
247
+ except Exception as exc:
248
+ payload = {"findings": [], "patched_code": None, "_error": str(exc)}
249
+
250
+ action = _to_action(
251
+ payload=payload,
252
+ observation=observation,
253
+ finalize=step == MAX_STEPS or observation.attempts_remaining <= 1,
254
+ )
255
+
256
+ result = env.step(action)
257
+ observation = result.observation
258
+
259
+ step_log = {
260
+ "step": step,
261
+ "operation": action.operation,
262
+ "submitted_findings": len(action.findings),
263
+ "reward": result.reward or 0.0,
264
+ "score": observation.score,
265
+ "done": result.done,
266
+ "feedback": observation.feedback,
267
+ }
268
+ if payload.get("_error"):
269
+ step_log["model_error"] = payload["_error"]
270
+ if payload.get("_parse_error"):
271
+ step_log["parse_error"] = True
272
+ step_logs.append(step_log)
273
+
274
+ # The history string is fed back into later prompts so the
275
+ # model can see what it already tried.
276
+ history.append(
277
+ f"step={step} op={action.operation} findings={len(action.findings)} "
278
+ f"score={observation.score:.2f} feedback={observation.feedback}"
279
+ )
280
+
281
+ print(
282
+ f" step={step} op={action.operation} findings={len(action.findings)} "
283
+ f"score={observation.score:.2f} reward={(result.reward or 0.0):.2f} "
284
+ f"done={result.done}"
285
+ )
286
+
287
+ if result.done:
288
+ break
289
+
290
+ episode_results.append(_task_result_dict(observation, step_logs))
291
+ finally:
292
+ env.close()
293
+
294
+ mean_score = (
295
+ sum(item["score"] for item in episode_results) / len(episode_results)
296
+ if episode_results
297
+ else 0.0
298
+ )
299
+ summary = {
300
+ "model_name": MODEL_NAME,
301
+ "api_base_url": API_BASE_URL,
302
+ "task_count": len(episode_results),
303
+ "mean_score": mean_score,
304
+ "results": episode_results,
305
+ }
306
+
307
+ # Persist the report so scores can be compared across runs and models.
308
+ REPORT_PATH.write_text(json.dumps(summary, indent=2), encoding="utf-8")
309
+ print(json.dumps(summary, indent=2))
310
+ print(f"\nSaved report to {REPORT_PATH}")
311
+
312
+
313
+ if __name__ == "__main__":
314
+ main()
models.py ADDED
@@ -0,0 +1,248 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Typed models for the Python code-review environment.
2
+
3
+ This module is the shared contract between:
4
+
5
+ - the OpenEnv server implementation
6
+ - the REST API layer
7
+ - the benchmark grader
8
+ - the inference script
9
+ - the tests
10
+
11
+ Keeping these models centralized makes the environment easier to validate,
12
+ serialize, and evolve without each module inventing its own payload shape.
13
+ """
14
+
15
+ from typing import List, Literal, Optional
16
+
17
+ from pydantic import BaseModel, Field
18
+ from openenv.core.env_server.types import Action, Observation
19
+
20
+
21
+ # Difficulty buckets are intentionally small and fixed so tasks can be
22
+ # grouped for curriculum learning and reporting without extra normalization.
23
+ Difficulty = Literal["easy", "medium", "hard"]
24
+
25
+ # Severity is separate from category because one category such as "security"
26
+ # can still vary in importance across tasks.
27
+ Severity = Literal["critical", "warning", "info"]
28
+
29
+ # Categories help both humans and agents understand what type of issue was found.
30
+ Category = Literal["bug", "security", "style", "performance", "maintainability"]
31
+
32
+ # Operations define the small action space an agent can use during an episode.
33
+ Operation = Literal["submit_findings", "request_hint", "finalize"]
34
+
35
+
36
+ class ReviewFinding(BaseModel):
37
+ """A structured review finding.
38
+
39
+ Each finding is designed to be machine-gradable while still resembling the
40
+ sort of issue summary a human reviewer would write in a real code review.
41
+ """
42
+
43
+ title: str = Field(..., description="Short title for the finding")
44
+ line: Optional[int] = Field(default=None, description="1-based source line number")
45
+ category: Category = Field(default="bug", description="Issue category")
46
+ severity: Severity = Field(default="warning", description="Issue severity")
47
+ rationale: str = Field(
48
+ default="",
49
+ description="Why the issue matters and how it affects behaviour or safety",
50
+ )
51
+ recommendation: Optional[str] = Field(
52
+ default=None, description="Concrete fix recommendation"
53
+ )
54
+ rule_id: Optional[str] = Field(
55
+ default=None,
56
+ description="Stable internal rule identifier when known",
57
+ )
58
+
59
+
60
+ class TaskDescriptor(BaseModel):
61
+ """Public task metadata shown to the agent.
62
+
63
+ This is intentionally the "visible" task information. Hidden grading
64
+ details stay inside the server task bank so the benchmark remains useful.
65
+ """
66
+
67
+ task_id: str = Field(..., description="Stable task identifier")
68
+ difficulty: Difficulty = Field(..., description="Task difficulty bucket")
69
+ title: str = Field(..., description="Short task title")
70
+ objective: str = Field(..., description="What the reviewer should accomplish")
71
+ code: str = Field(..., description="Python code to review")
72
+ max_steps: int = Field(..., ge=1, description="Maximum actions allowed")
73
+ success_threshold: float = Field(
74
+ ..., ge=0.0, le=1.0, description="Minimum score considered a pass"
75
+ )
76
+
77
+
78
+ class TaskEvaluation(BaseModel):
79
+ """Deterministic grader output.
80
+
81
+ This model is returned in observations and offline grading routes so that
82
+ both online interaction and offline evaluation use exactly the same metrics.
83
+ """
84
+
85
+ matched_reference_ids: List[str] = Field(default_factory=list)
86
+ matched_findings: int = Field(default=0, ge=0)
87
+ total_findings: int = Field(default=0, ge=0)
88
+ false_positives: int = Field(default=0, ge=0)
89
+ duplicate_findings: int = Field(default=0, ge=0)
90
+ weighted_recall: float = Field(default=0.0, ge=0.0, le=1.0)
91
+ patch_score: float = Field(default=0.0, ge=0.0, le=1.0)
92
+ score: float = Field(default=0.0, ge=0.0, le=1.0)
93
+ passed: bool = Field(default=False)
94
+
95
+
96
+ class PythonReviewAction(Action):
97
+ """Action submitted by an agent during an episode.
98
+
99
+ The action space is kept intentionally small:
100
+
101
+ - `submit_findings` for intermediate progress
102
+ - `request_hint` when the agent needs guidance at a small penalty
103
+ - `finalize` when the agent wants the episode to end
104
+ """
105
+
106
+ operation: Operation = Field(
107
+ default="submit_findings",
108
+ description="How to interact with the environment on this step",
109
+ )
110
+ findings: List[ReviewFinding] = Field(
111
+ default_factory=list,
112
+ description="Structured findings being submitted for grading",
113
+ )
114
+ patched_code: Optional[str] = Field(
115
+ default=None,
116
+ description="Optional improved version of the code under review",
117
+ )
118
+ note: Optional[str] = Field(
119
+ default=None,
120
+ description="Optional free-form reviewer note for logging or context",
121
+ )
122
+
123
+
124
+ class PythonEnvConfig(BaseModel):
125
+ """Environment-level configuration knobs.
126
+
127
+ These values are useful for experimentation because they let you adjust
128
+ reward shaping and curriculum ordering without changing the grader logic.
129
+ """
130
+
131
+ task_order: List[str] = Field(
132
+ default_factory=lambda: ["py-review-easy", "py-review-medium", "py-review-hard"],
133
+ description="Deterministic task order used across resets",
134
+ )
135
+ max_steps_per_task: int = Field(default=4, ge=1, le=10)
136
+ hint_penalty: float = Field(default=0.05, ge=0.0, le=1.0)
137
+ false_positive_penalty: float = Field(default=0.08, ge=0.0, le=1.0)
138
+ duplicate_penalty: float = Field(default=0.03, ge=0.0, le=1.0)
139
+ patch_bonus_multiplier: float = Field(default=0.2, ge=0.0, le=1.0)
140
+ max_history_entries: int = Field(default=50, ge=1, le=500)
141
+
142
+
143
+ class PythonReviewObservation(Observation):
144
+ """Observation returned by `reset()` and `step()`.
145
+
146
+ The observation combines:
147
+
148
+ - visible task context
149
+ - immediate feedback on the previous action
150
+ - cumulative evaluation state
151
+ - OpenEnv-standard reward/done/metadata fields
152
+ """
153
+
154
+ task: TaskDescriptor = Field(..., description="Current task details")
155
+ instructions: str = Field(
156
+ default="Inspect the code and submit structured findings.",
157
+ description="Episode instructions shown to the agent",
158
+ )
159
+ feedback: str = Field(default="", description="Feedback for the last action")
160
+ submitted_findings: List[ReviewFinding] = Field(
161
+ default_factory=list,
162
+ description="All findings submitted so far in this episode",
163
+ )
164
+ hints_used: int = Field(default=0, ge=0)
165
+ attempts_remaining: int = Field(default=0, ge=0)
166
+ evaluation: TaskEvaluation = Field(default_factory=TaskEvaluation)
167
+ score: float = Field(
168
+ default=0.0,
169
+ ge=0.0,
170
+ le=1.0,
171
+ description="Current task score after this step",
172
+ )
173
+ review_time_ms: float = Field(default=0.0, ge=0.0)
174
+
175
+
176
+ class EpisodeRecord(BaseModel):
177
+ """Stored summary of a completed or in-progress episode.
178
+
179
+ This model is used by the custom history routes and is intentionally
180
+ compact enough to archive for later analysis or dataset creation.
181
+ """
182
+
183
+ episode_id: str
184
+ task_id: str
185
+ difficulty: Difficulty
186
+ title: str
187
+ final_score: float = Field(ge=0.0, le=1.0)
188
+ passed: bool = Field(default=False)
189
+ steps_taken: int = Field(default=0, ge=0)
190
+ hints_used: int = Field(default=0, ge=0)
191
+ matched_findings: int = Field(default=0, ge=0)
192
+ total_findings: int = Field(default=0, ge=0)
193
+ false_positives: int = Field(default=0, ge=0)
194
+ duplicate_findings: int = Field(default=0, ge=0)
195
+ status: Literal["active", "completed"] = Field(default="completed")
196
+ created_at: str
197
+ updated_at: str
198
+
199
+
200
+ class DirectReviewRequest(BaseModel):
201
+ """Request model for ad-hoc review outside the benchmark tasks."""
202
+
203
+ code: str = Field(..., description="Python source code to inspect")
204
+ context: Optional[str] = Field(
205
+ default=None, description="Optional explanation of the code's purpose"
206
+ )
207
+
208
+
209
+ class DirectReviewResponse(BaseModel):
210
+ """Static review result for arbitrary Python code.
211
+
212
+ This route is useful for manual testing and dataset generation because it
213
+ lets you review arbitrary snippets without entering the benchmark loop.
214
+ """
215
+
216
+ issues: List[ReviewFinding] = Field(default_factory=list)
217
+ summary: str = Field(default="")
218
+ score: float = Field(default=0.0, ge=0.0, le=1.0)
219
+ improved_code: Optional[str] = Field(default=None)
220
+
221
+
222
+ class DeleteResponse(BaseModel):
223
+ """Small acknowledgement payload for DELETE routes."""
224
+
225
+ detail: str
226
+
227
+
228
+ class HealthResponse(BaseModel):
229
+ """Health payload used by Docker and Spaces checks.
230
+
231
+ This payload stays intentionally simple because health checks are often
232
+ consumed by infrastructure rather than by human users.
233
+ """
234
+
235
+ status: Literal["ok"] = "ok"
236
+ environment: str = "python_env"
237
+ task_count: int = Field(default=0, ge=0)
238
+ active_task_id: Optional[str] = None
239
+ active_episode_id: Optional[str] = None
240
+
241
+
242
+ # Backward-compatible aliases keep older imports working while the project
243
+ # standardizes on the `Python*` naming convention.
244
+ PythonAction = PythonReviewAction
245
+ PythonObservation = PythonReviewObservation
246
+ CodeReviewAction = PythonReviewAction
247
+ CodeReviewObservation = PythonReviewObservation
248
+ CodeReviewConfig = PythonEnvConfig
openenv.yaml ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ spec_version: 1
2
+ name: python_env
3
+ type: space
4
+ runtime: fastapi
5
+ app: server.app:app
6
+ port: 8000
7
+
openenv_python_env.egg-info/PKG-INFO ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Metadata-Version: 2.4
2
+ Name: openenv-python_env
3
+ Version: 0.1.0
4
+ Summary: Python Env environment for OpenEnv
5
+ Requires-Python: >=3.10
6
+ Requires-Dist: openenv-core[core]>=0.2.2
7
+ Requires-Dist: pydantic>=2.12.5
8
+ Provides-Extra: dev
9
+ Requires-Dist: pytest>=8.0.0; extra == "dev"
10
+ Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
openenv_python_env.egg-info/SOURCES.txt ADDED
@@ -0,0 +1,19 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ README.md
2
+ __init__.py
3
+ client.py
4
+ inference.py
5
+ models.py
6
+ pyproject.toml
7
+ ./__init__.py
8
+ ./client.py
9
+ ./inference.py
10
+ ./models.py
11
+ openenv_python_env.egg-info/PKG-INFO
12
+ openenv_python_env.egg-info/SOURCES.txt
13
+ openenv_python_env.egg-info/dependency_links.txt
14
+ openenv_python_env.egg-info/entry_points.txt
15
+ openenv_python_env.egg-info/requires.txt
16
+ openenv_python_env.egg-info/top_level.txt
17
+ server/__init__.py
18
+ server/app.py
19
+ server/python_env_environment.py
openenv_python_env.egg-info/dependency_links.txt ADDED
@@ -0,0 +1 @@
 
 
1
+
openenv_python_env.egg-info/entry_points.txt ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ [console_scripts]
2
+ server = python_env.server.app:main
openenv_python_env.egg-info/requires.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ openenv-core[core]>=0.2.2
2
+ pydantic>=2.12.5
3
+
4
+ [dev]
5
+ pytest>=8.0.0
6
+ pytest-cov>=4.0.0
openenv_python_env.egg-info/top_level.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ python_env
pyproject.toml ADDED
@@ -0,0 +1,46 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ [build-system]
8
+ requires = ["setuptools>=45", "wheel"]
9
+ build-backend = "setuptools.build_meta"
10
+
11
+ [project]
12
+ name = "openenv-python_env"
13
+ version = "0.1.0"
14
+ description = "Python Env environment for OpenEnv"
15
+ requires-python = ">=3.10"
16
+ dependencies = [
17
+ # Core OpenEnv runtime (provides FastAPI server + HTTP client types)
18
+ # install from github
19
+ # "openenv-core[core] @ git+https://github.com/meta-pytorch/OpenEnv.git",
20
+ "openenv-core[core]>=0.2.2",
21
+ # Environment-specific dependencies
22
+ # Add all dependencies needed for your environment here
23
+ # Examples:
24
+ # "numpy>=1.19.0",
25
+ # "torch>=2.0.0",
26
+ # "gymnasium>=0.29.0",
27
+ # "openspiel>=1.0.0",
28
+ # "smolagents>=1.22.0,<2",
29
+ "pydantic>=2.12.5",
30
+ ]
31
+
32
+ [project.optional-dependencies]
33
+ dev = [
34
+ "pytest>=8.0.0",
35
+ "pytest-cov>=4.0.0",
36
+ ]
37
+
38
+ [project.scripts]
39
+ # Server entry point - enables running via: uv run --project . server
40
+ # or: python -m python_env.server.app
41
+ server = "python_env.server.app:main"
42
+
43
+ [tool.setuptools]
44
+ include-package-data = true
45
+ packages = ["python_env", "python_env.server"]
46
+ package-dir = { "python_env" = ".", "python_env.server" = "server" }
server/__init__.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Python Env environment server components."""
8
+
9
+ from .python_env_environment import PythonEnvironment
10
+
11
+ __all__ = ["PythonEnvironment"]
server/app.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """
8
+ FastAPI application for the Python Env Environment.
9
+
10
+ This module creates an HTTP server that exposes the PythonEnvironment
11
+ over HTTP and WebSocket endpoints, compatible with EnvClient.
12
+
13
+ Endpoints:
14
+ - POST /reset: Reset the environment
15
+ - POST /step: Execute an action
16
+ - GET /state: Get current environment state
17
+ - GET /schema: Get action/observation schemas
18
+ - WS /ws: WebSocket endpoint for persistent sessions
19
+
20
+ Usage:
21
+ # Development (with auto-reload):
22
+ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
23
+
24
+ # Production:
25
+ uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
26
+
27
+ # Or run directly:
28
+ python -m server.app
29
+ """
30
+
31
+ try:
32
+ from openenv.core.env_server.http_server import create_app
33
+ except Exception as e: # pragma: no cover
34
+ raise ImportError(
35
+ "openenv is required for the web interface. Install dependencies with '\n uv sync\n'"
36
+ ) from e
37
+
38
+ try:
39
+ from ..models import PythonAction, PythonObservation
40
+ from .python_env_environment import PythonEnvironment
41
+ except ImportError:
42
+ from models import PythonAction, PythonObservation
43
+ from server.python_env_environment import PythonEnvironment
44
+
45
+
46
+ # Create the app with web interface and README integration
47
+ app = create_app(
48
+ PythonEnvironment,
49
+ PythonAction,
50
+ PythonObservation,
51
+ env_name="python_env",
52
+ max_concurrent_envs=1, # increase this number to allow more concurrent WebSocket sessions
53
+ )
54
+
55
+
56
+ def main(host: str = "0.0.0.0", port: int = 8000):
57
+ """
58
+ Entry point for direct execution via uv run or python -m.
59
+
60
+ This function enables running the server without Docker:
61
+ uv run --project . server
62
+ uv run --project . server --port 8001
63
+ python -m python_env.server.app
64
+
65
+ Args:
66
+ host: Host address to bind to (default: "0.0.0.0")
67
+ port: Port number to listen on (default: 8000)
68
+
69
+ For production deployments, consider using uvicorn directly with
70
+ multiple workers:
71
+ uvicorn python_env.server.app:app --workers 4
72
+ """
73
+ import uvicorn
74
+
75
+ uvicorn.run(app, host=host, port=port)
76
+
77
+
78
+ if __name__ == "__main__":
79
+ import argparse
80
+
81
+ parser = argparse.ArgumentParser()
82
+ parser.add_argument("--port", type=int, default=8000)
83
+ args = parser.parse_args()
84
+ main(port=args.port)
server/python_env_environment.py ADDED
@@ -0,0 +1,421 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Copyright (c) Meta Platforms, Inc. and affiliates.
2
+ # All rights reserved.
3
+ #
4
+ # This source code is licensed under the BSD-style license found in the
5
+ # LICENSE file in the root directory of this source tree.
6
+
7
+ """Python code-review environment implementation."""
8
+
9
+ from __future__ import annotations
10
+
11
+ from dataclasses import dataclass
12
+ from datetime import UTC, datetime
13
+ from typing import Dict, Iterable, List, Optional
14
+ from uuid import uuid4
15
+
16
+ from openenv.core.env_server.interfaces import Environment
17
+ from openenv.core.env_server.types import State
18
+
19
+ try:
20
+ from ..models import (
21
+ PythonAction,
22
+ PythonEnvConfig,
23
+ PythonObservation,
24
+ ReviewFinding,
25
+ TaskDescriptor,
26
+ TaskEvaluation,
27
+ )
28
+ except ImportError:
29
+ from models import ( # type: ignore
30
+ PythonAction,
31
+ PythonEnvConfig,
32
+ PythonObservation,
33
+ ReviewFinding,
34
+ TaskDescriptor,
35
+ TaskEvaluation,
36
+ )
37
+
38
+
39
+ @dataclass(frozen=True)
40
+ class ReferenceFinding:
41
+ """Hidden finding metadata used for deterministic grading."""
42
+
43
+ rule_id: str
44
+ title: str
45
+ line: int
46
+ category: str
47
+ severity: str
48
+ rationale: str
49
+ recommendation: str
50
+ weight: float
51
+
52
+
53
+ @dataclass(frozen=True)
54
+ class ReviewTask:
55
+ """A visible task plus its hidden grading references."""
56
+
57
+ descriptor: TaskDescriptor
58
+ references: tuple[ReferenceFinding, ...]
59
+ hint: str
60
+ patched_code: Optional[str] = None
61
+
62
+
63
+ TASK_BANK: Dict[str, ReviewTask] = {
64
+ "py-review-easy": ReviewTask(
65
+ descriptor=TaskDescriptor(
66
+ task_id="py-review-easy",
67
+ difficulty="easy",
68
+ title="Mutable default argument",
69
+ objective="Find the correctness issue and explain a safe fix.",
70
+ code=(
71
+ "def add_tag(tag, tags=[]):\n"
72
+ " tags.append(tag)\n"
73
+ " return tags\n"
74
+ ),
75
+ max_steps=4,
76
+ success_threshold=0.7,
77
+ ),
78
+ references=(
79
+ ReferenceFinding(
80
+ rule_id="mutable-default",
81
+ title="Mutable default list is shared across calls",
82
+ line=1,
83
+ category="bug",
84
+ severity="warning",
85
+ rationale="The list persists between calls and leaks state.",
86
+ recommendation="Use None as the default and create a new list inside the function.",
87
+ weight=1.0,
88
+ ),
89
+ ),
90
+ hint="Look for state that survives between separate function calls.",
91
+ patched_code=(
92
+ "def add_tag(tag, tags=None):\n"
93
+ " if tags is None:\n"
94
+ " tags = []\n"
95
+ " tags.append(tag)\n"
96
+ " return tags\n"
97
+ ),
98
+ ),
99
+ "py-review-medium": ReviewTask(
100
+ descriptor=TaskDescriptor(
101
+ task_id="py-review-medium",
102
+ difficulty="medium",
103
+ title="Unsafe shell invocation",
104
+ objective="Review the snippet for security-sensitive behavior.",
105
+ code=(
106
+ "import os\n\n"
107
+ "def run_backup(path):\n"
108
+ " os.system(f\"tar -czf backup.tgz {path}\")\n"
109
+ ),
110
+ max_steps=4,
111
+ success_threshold=0.72,
112
+ ),
113
+ references=(
114
+ ReferenceFinding(
115
+ rule_id="shell-injection",
116
+ title="User input is interpolated into a shell command",
117
+ line=4,
118
+ category="security",
119
+ severity="critical",
120
+ rationale="An attacker can inject shell metacharacters through the path argument.",
121
+ recommendation="Use subprocess with an argument list instead of os.system.",
122
+ weight=1.0,
123
+ ),
124
+ ),
125
+ hint="Check how external commands are invoked and whether user input is escaped.",
126
+ patched_code=(
127
+ "import subprocess\n\n"
128
+ "def run_backup(path):\n"
129
+ " subprocess.run([\"tar\", \"-czf\", \"backup.tgz\", path], check=True)\n"
130
+ ),
131
+ ),
132
+ "py-review-hard": ReviewTask(
133
+ descriptor=TaskDescriptor(
134
+ task_id="py-review-hard",
135
+ difficulty="hard",
136
+ title="Retry helper hides failures",
137
+ objective="Identify correctness and maintainability issues in the retry logic.",
138
+ code=(
139
+ "import time\n\n"
140
+ "def fetch_with_retry(client, url, retries=3):\n"
141
+ " last_error = None\n"
142
+ " for _ in range(retries):\n"
143
+ " try:\n"
144
+ " return client.get(url, timeout=1)\n"
145
+ " except Exception as exc:\n"
146
+ " last_error = exc\n"
147
+ " time.sleep(0.1)\n"
148
+ " return None\n"
149
+ ),
150
+ max_steps=4,
151
+ success_threshold=0.74,
152
+ ),
153
+ references=(
154
+ ReferenceFinding(
155
+ rule_id="swallowed-error",
156
+ title="Function swallows the final exception and returns None",
157
+ line=10,
158
+ category="bug",
159
+ severity="warning",
160
+ rationale="Callers cannot distinguish a failed request from a valid None result.",
161
+ recommendation="Re-raise the last exception after retries are exhausted.",
162
+ weight=0.65,
163
+ ),
164
+ ReferenceFinding(
165
+ rule_id="broad-except",
166
+ title="Broad exception handler catches unexpected failures",
167
+ line=7,
168
+ category="maintainability",
169
+ severity="info",
170
+ rationale="Catching Exception masks programming errors and interrupts.",
171
+ recommendation="Catch only the client or network exceptions you expect to retry.",
172
+ weight=0.35,
173
+ ),
174
+ ),
175
+ hint="Consider what happens to the final error after the retry loop finishes.",
176
+ patched_code=(
177
+ "import time\n\n"
178
+ "def fetch_with_retry(client, url, retries=3):\n"
179
+ " last_error = None\n"
180
+ " for _ in range(retries):\n"
181
+ " try:\n"
182
+ " return client.get(url, timeout=1)\n"
183
+ " except client.retryable_exceptions as exc:\n"
184
+ " last_error = exc\n"
185
+ " time.sleep(0.1)\n"
186
+ " if last_error is not None:\n"
187
+ " raise last_error\n"
188
+ ),
189
+ ),
190
+ }
191
+
192
+
193
+ def _utc_now() -> str:
194
+ return datetime.now(UTC).isoformat()
195
+
196
+
197
+ def _normalize_text(value: Optional[str]) -> str:
198
+ return " ".join((value or "").strip().lower().split())
199
+
200
+
201
+ def _normalize_code(value: Optional[str]) -> str:
202
+ return "\n".join(line.rstrip() for line in (value or "").strip().splitlines())
203
+
204
+
205
+ class PythonEnvironment(Environment[PythonAction, PythonObservation, State]):
206
+ """Deterministic benchmark environment for Python code review tasks."""
207
+
208
+ SUPPORTS_CONCURRENT_SESSIONS: bool = True
209
+
210
+ def __init__(self, config: Optional[PythonEnvConfig] = None):
211
+ super().__init__()
212
+ self._config = config or PythonEnvConfig()
213
+ self._state = State(episode_id=str(uuid4()), step_count=0)
214
+ self._task_cursor = -1
215
+ self._current_task: Optional[ReviewTask] = None
216
+ self._submitted_findings: List[ReviewFinding] = []
217
+ self._hints_used = 0
218
+ self._created_at = _utc_now()
219
+
220
+ def reset(
221
+ self,
222
+ seed: Optional[int] = None,
223
+ episode_id: Optional[str] = None,
224
+ **kwargs,
225
+ ) -> PythonObservation:
226
+ """Start the next configured review task."""
227
+
228
+ del seed, kwargs
229
+ self._task_cursor = (self._task_cursor + 1) % len(self._config.task_order)
230
+ task_id = self._config.task_order[self._task_cursor]
231
+ self._current_task = TASK_BANK.get(task_id, TASK_BANK["py-review-easy"])
232
+ self._state = State(
233
+ episode_id=episode_id or str(uuid4()),
234
+ step_count=0,
235
+ )
236
+ self._submitted_findings = []
237
+ self._hints_used = 0
238
+ self._created_at = _utc_now()
239
+ return self._build_observation(
240
+ feedback="New review task loaded. Submit findings or request a hint.",
241
+ reward=0.0,
242
+ done=False,
243
+ )
244
+
245
+ def step(
246
+ self,
247
+ action: PythonAction,
248
+ timeout_s: Optional[float] = None,
249
+ **kwargs,
250
+ ) -> PythonObservation:
251
+ """Process one review action and return updated feedback."""
252
+
253
+ del timeout_s, kwargs
254
+ if self._current_task is None:
255
+ return self.reset()
256
+
257
+ self._state.step_count += 1
258
+ operation = action.operation
259
+ feedback = ""
260
+ reward = 0.0
261
+ done = False
262
+
263
+ if operation == "request_hint":
264
+ self._hints_used += 1
265
+ feedback = self._current_task.hint
266
+ evaluation = self._evaluate(self._submitted_findings, action.patched_code)
267
+ reward = evaluation.score
268
+ else:
269
+ if action.findings:
270
+ self._submitted_findings.extend(action.findings)
271
+ evaluation = self._evaluate(self._submitted_findings, action.patched_code)
272
+ reward = evaluation.score
273
+ if operation == "finalize":
274
+ done = True
275
+ feedback = (
276
+ "Review finalized. "
277
+ f"Matched {evaluation.matched_findings}/{evaluation.total_findings} "
278
+ "reference findings."
279
+ )
280
+ else:
281
+ feedback = (
282
+ f"Progress saved. Matched {evaluation.matched_findings}/"
283
+ f"{evaluation.total_findings} findings with score {evaluation.score:.2f}."
284
+ )
285
+
286
+ if self._state.step_count >= self._max_steps():
287
+ done = True
288
+ if operation != "finalize":
289
+ feedback = (
290
+ f"{feedback} Maximum steps reached."
291
+ if feedback
292
+ else "Maximum steps reached."
293
+ )
294
+
295
+ return self._build_observation(
296
+ feedback=feedback,
297
+ reward=reward,
298
+ done=done,
299
+ patched_code=action.patched_code,
300
+ )
301
+
302
+ def _build_observation(
303
+ self,
304
+ *,
305
+ feedback: str,
306
+ reward: float,
307
+ done: bool,
308
+ patched_code: Optional[str] = None,
309
+ ) -> PythonObservation:
310
+ assert self._current_task is not None
311
+ evaluation = self._evaluate(self._submitted_findings, patched_code)
312
+ attempts_remaining = max(
313
+ self._max_steps() - self._state.step_count,
314
+ 0,
315
+ )
316
+ return PythonObservation(
317
+ task=self._current_task.descriptor,
318
+ feedback=feedback,
319
+ submitted_findings=list(self._submitted_findings),
320
+ hints_used=self._hints_used,
321
+ attempts_remaining=attempts_remaining,
322
+ evaluation=evaluation,
323
+ score=evaluation.score,
324
+ review_time_ms=float(self._state.step_count * 125),
325
+ done=done,
326
+ reward=reward,
327
+ metadata={
328
+ "episode_id": self._state.episode_id,
329
+ "created_at": self._created_at,
330
+ "updated_at": _utc_now(),
331
+ },
332
+ )
333
+
334
+ def _evaluate(
335
+ self,
336
+ findings: Iterable[ReviewFinding],
337
+ patched_code: Optional[str],
338
+ ) -> TaskEvaluation:
339
+ assert self._current_task is not None
340
+
341
+ references = self._current_task.references
342
+ matched_reference_ids: List[str] = []
343
+ matched_weight = 0.0
344
+ false_positives = 0
345
+ duplicate_findings = 0
346
+
347
+ seen_ids = set()
348
+ for finding in findings:
349
+ ref_id = self._match_reference(finding, references)
350
+ if ref_id is None:
351
+ false_positives += 1
352
+ continue
353
+ if ref_id in seen_ids:
354
+ duplicate_findings += 1
355
+ continue
356
+ seen_ids.add(ref_id)
357
+ matched_reference_ids.append(ref_id)
358
+ matched_weight += next(ref.weight for ref in references if ref.rule_id == ref_id)
359
+
360
+ total_weight = sum(ref.weight for ref in references) or 1.0
361
+ weighted_recall = min(matched_weight / total_weight, 1.0)
362
+
363
+ patch_score = 0.0
364
+ if self._current_task.patched_code and patched_code:
365
+ patch_score = float(
366
+ _normalize_code(patched_code) == _normalize_code(self._current_task.patched_code)
367
+ )
368
+
369
+ raw_score = (
370
+ weighted_recall
371
+ + (self._config.patch_bonus_multiplier * patch_score)
372
+ - (self._config.false_positive_penalty * false_positives)
373
+ - (self._config.duplicate_penalty * duplicate_findings)
374
+ - (self._config.hint_penalty * self._hints_used)
375
+ )
376
+ score = max(0.0, min(raw_score, 1.0))
377
+
378
+ return TaskEvaluation(
379
+ matched_reference_ids=matched_reference_ids,
380
+ matched_findings=len(matched_reference_ids),
381
+ total_findings=len(references),
382
+ false_positives=false_positives,
383
+ duplicate_findings=duplicate_findings,
384
+ weighted_recall=weighted_recall,
385
+ patch_score=patch_score,
386
+ score=score,
387
+ passed=score >= self._current_task.descriptor.success_threshold,
388
+ )
389
+
390
+ def _match_reference(
391
+ self,
392
+ finding: ReviewFinding,
393
+ references: Iterable[ReferenceFinding],
394
+ ) -> Optional[str]:
395
+ finding_rule = _normalize_text(finding.rule_id)
396
+ finding_title = _normalize_text(finding.title)
397
+ for reference in references:
398
+ if finding_rule and finding_rule == _normalize_text(reference.rule_id):
399
+ return reference.rule_id
400
+ line_matches = finding.line is not None and finding.line == reference.line
401
+ category_matches = finding.category == reference.category
402
+ title_matches = finding_title and (
403
+ finding_title in _normalize_text(reference.title)
404
+ or _normalize_text(reference.title) in finding_title
405
+ )
406
+ if line_matches and (category_matches or title_matches):
407
+ return reference.rule_id
408
+ return None
409
+
410
+ def _max_steps(self) -> int:
411
+ assert self._current_task is not None
412
+ return min(
413
+ self._current_task.descriptor.max_steps,
414
+ self._config.max_steps_per_task,
415
+ )
416
+
417
+ @property
418
+ def state(self) -> State:
419
+ """Return the current environment state."""
420
+
421
+ return self._state
server/requirements.txt ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ openenv[core]>=0.2.0
2
+ fastapi>=0.115.0
3
+ uvicorn>=0.24.0
4
+
5
+
6
+
uv.lock ADDED
The diff for this file is too large to render. See raw diff