uvpatel7271 commited on
Commit
cd5c208
·
1 Parent(s): a4ea2be

added a modularity and updations

Browse files
README.md CHANGED
@@ -1,253 +1,169 @@
1
- ---
2
- title: TorchReview Copilot
3
- emoji: 🧠
4
- colorFrom: orange
5
- colorTo: red
6
- sdk: docker
7
- pinned: false
8
- app_port: 8000
9
- tags:
10
- - pytorch
11
- - gradio
12
- - fastapi
13
- - openenv
14
- - code-review
15
- ---
16
 
17
- # TorchReview Copilot
18
 
19
- TorchReview Copilot is an **AI-powered code review and improvement system using PyTorch** to analyze Python code, predict quality, generate structured improvement suggestions, and compute an RL-ready reward score.
20
-
21
- It upgrades the original OpenEnv hackathon environment into a judge-friendly product demo: a polished Hugging Face Space on top, with the deterministic OpenEnv validation engine still preserved underneath.
22
-
23
- **Live demo:** [Hugging Face Space](https://huggingface.co/spaces/uvpatel7271/final-python-env)
24
- **Repository:** [uvpatel/final-python-env](https://github.com/uvpatel/final-python-env)
25
-
26
- ## Problem Statement
27
-
28
- Engineering teams lose time during incident response and code review because broken Python snippets often arrive with noisy traces, partial test output, and unclear ownership. Before fixing anything, someone still has to answer:
29
-
30
- - Is this a syntax issue, a logic bug, or a performance regression?
31
- - How risky is the repair?
32
- - What should be checked first?
33
-
34
- That triage step is repetitive, error-prone, and often slows down the actual fix.
35
-
36
- ## Solution
37
-
38
- TorchReview Copilot turns code, traceback text, and a short context window into a practical code-review report:
39
-
40
- - **Issue classification:** syntax, logic, or performance
41
- - **ML quality score:** predicted code quality from PyTorch embeddings
42
- - **Reward score:** RL-ready score from model quality, lint quality, and complexity penalty
43
- - **Live Triage Radar:** confidence visualization for all issue classes
44
- - **Nearest known pattern:** the closest OpenEnv task match
45
- - **Improvement plan:** step 1 syntax/bug fixes, step 2 edge cases, step 3 scalability
46
-
47
- The result is a demo that feels like a real AI debugging assistant rather than a backend-only environment.
48
-
49
- ## Why PyTorch Matters
50
-
51
- This project uses **PyTorch for real inference**, not placeholder branching:
52
-
53
- - `transformers` + `torch` load `huggingface/CodeBERTa-small-v1`
54
- - the model encodes code snippets and failure context into embeddings
55
- - embeddings are compared against curated OpenEnv issue prototypes
56
- - the final decision blends model similarity with lightweight static analysis signals
57
-
58
- That gives the demo an actual model-backed quality and issue scoring path while keeping it CPU-friendly for Hugging Face Spaces.
59
-
60
- ## How It Works
61
-
62
- ### Pipeline
63
-
64
- `Input code + context window + traceback -> static checks -> PyTorch embeddings -> quality + issue prediction -> suggestion engine -> reward computation -> UI/API output`
65
-
66
- ### Detailed Flow
67
-
68
- 1. The user pastes Python code and optional traceback or benchmark output.
69
- 2. TorchReview extracts lightweight static signals:
70
- - parser success/failure
71
- - assertion-style test language
72
- - lint/style issues
73
- - nested-loop depth and complexity pressure
74
- 3. CodeBERTa runs through PyTorch to embed the combined input.
75
- 4. The embedding is compared against built-in issue prototypes derived from the OpenEnv task catalog and reference implementations.
76
- 5. The UI returns:
77
- - top issue label
78
- - confidence radar
79
- - repair risk
80
- - ML quality score
81
- - RL-ready reward score
82
- - nearest known bug pattern
83
- - three-step improvement plan
84
-
85
- ### Reward Formula
86
-
87
- The current reward computation is:
88
 
89
  ```text
90
- reward = (0.5 x ML_quality_score) + (0.3 x lint_score) - (0.2 x complexity_penalty)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  ```
92
 
93
- This keeps the project compatible with OpenEnv-style reinforcement learning workflows.
94
-
95
- ## Built-In Demo Scenarios
96
-
97
- The app ships with three grounded examples reused from the OpenEnv tasks:
98
-
99
- 1. **Syntax regression:** broken invoice normalization helper
100
- 2. **Logic bug:** session window boundary failure
101
- 3. **Performance bottleneck:** slow active-user ranking pipeline
102
-
103
- These examples make the classification differences obvious during judging and video demos.
104
-
105
- ## Tech Stack
106
 
107
- - **PyTorch** for embedding inference
108
- - **Transformers** for `CodeBERTa-small-v1`
109
- - **Gradio** for the polished Hugging Face Space UI
110
- - **FastAPI** for the app server
111
- - **OpenEnv** for deterministic validation endpoints and environment compatibility
112
- - **Pydantic** for typed schemas
113
-
114
- ## Features
115
-
116
- - PyTorch-powered code quality inference
117
- - Static analysis for syntax, lint, and complexity
118
- - Context-window-aware review flow
119
- - RL-ready reward shaping
120
- - Live Triage Radar visualization
121
- - Three-step improvement plan:
122
- 1. syntax checking and bug fixes
123
- 2. edge-case handling
124
- 3. scalability improvements
125
-
126
- ## Hugging Face Space UX
127
-
128
- The root app now presents a production-style triage experience:
129
-
130
- - a clear problem/solution hero section
131
- - example scenario selector
132
- - code and traceback inputs
133
- - context window input
134
- - **Live Triage Radar**
135
- - structured improvement plan
136
- - reward and quality score display
137
- - visible model/backend notes
138
 
139
- The underlying OpenEnv endpoints remain available for compatibility and evaluation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
140
 
141
- ## Screenshots
142
 
143
- Add screenshots after deployment:
144
 
145
- - `docs/screenshots/home.png` -> hero + inputs
146
- - `docs/screenshots/triage-radar.png` -> confidence visualization
147
- - `docs/screenshots/fix-plan.png` -> structured output panel
148
 
149
- Suggested markdown once captured:
150
 
151
- ```md
152
- ![TorchReview Copilot Home](docs/screenshots/home.png)
153
- ![Live Triage Radar](docs/screenshots/triage-radar.png)
154
- ![Fix Plan Output](docs/screenshots/fix-plan.png)
155
  ```
156
 
157
- ## Local Setup
158
-
159
- ### 1. Install dependencies
160
 
161
  ```bash
162
- pip install .
163
  ```
164
 
165
- ### 2. Run the application
166
 
167
  ```bash
168
- uvicorn server.app:app --host 0.0.0.0 --port 8000
 
169
  ```
170
 
171
- ### 3. Open the demo
172
 
173
- Visit:
174
 
175
- ```text
176
- http://localhost:8000/
177
- ```
 
 
 
178
 
179
- ### 4. Verify OpenEnv compatibility
180
 
181
  ```bash
182
- curl http://localhost:8000/health
183
- curl http://localhost:8000/state
 
 
184
  ```
185
 
186
- ## Docker
187
 
188
- ```bash
189
- docker build -t torchreview-copilot -f server/Dockerfile .
190
- docker run --rm -p 8000:8000 torchreview-copilot
 
 
 
 
191
  ```
192
 
193
- Expected checks:
 
 
194
 
195
  ```bash
196
- curl http://localhost:8000/
197
- curl http://localhost:8000/health
198
  ```
199
 
200
- ## Project Structure
201
 
202
- ```text
203
- python_env/
204
- ├── client.py
205
- ├── graders/
206
- ├── server/
207
- │ ├── app.py
208
- │ ├── demo.py
209
- │ └── env.py
210
- ├── tasks/
211
- ├── triage.py
212
- ├── triage_catalog.py
213
- ├── triage_models.py
214
- ├── inference.py
215
- └── tests/
216
  ```
217
 
218
- ## OpenEnv Compatibility
219
-
220
- The hackathon backend is still present:
221
-
222
- - deterministic task grading
223
- - structured action/observation/state models
224
- - `/health`, `/state`, `/reset`, `/step`, and related environment routes
225
-
226
- This means the product demo is not detached from evaluation; it is layered on top of the original OpenEnv system.
227
 
228
- ## Demo Script
 
 
 
229
 
230
- See [DEMO_SCRIPT.md](DEMO_SCRIPT.md) for the 60-90 second recording flow.
231
 
232
- Short version:
233
 
234
- 1. Open the Space and introduce the problem.
235
- 2. Load the syntax example.
236
- 3. Show the Live Triage Radar and issue label.
237
- 4. Explain the PyTorch embedding step.
238
- 5. Show the matched pattern and fix plan.
239
- 6. Show the reward score and explain how it can be used inside an RL environment.
240
- 7. Switch to the performance example to prove the model distinguishes issue classes.
 
 
 
 
 
241
 
242
- ## Limitations
243
 
244
- - The classifier uses pretrained embeddings plus prototype similarity, not a custom fine-tuned model.
245
- - First model load may take longer on a cold Hugging Face Space.
246
- - The current demo focuses on short Python snippets rather than full multi-file repositories.
 
 
247
 
248
- ## Future Work
249
 
250
- - fine-tune the PyTorch classifier on a larger bug triage dataset
251
- - add repository-level file context and diff-aware analysis
252
- - include automated patch suggestions after triage
253
- - track remediation outcomes as a feedback loop for future ranking improvements
 
1
+ # OpenEnv Python Code Review Environment
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ Production-ready hackathon submission for OpenEnv evaluation, deterministic validator runs, and Hugging Face Docker deployment.
4
 
5
+ ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6
 
7
  ```text
8
+ root
9
+ ├── inference.py # Root validator entrypoint
10
+ ├── openenv.yaml # OpenEnv manifest
11
+ ├── app/
12
+ │ ├── agents/ # Action policy and fallback strategy
13
+ │ ├── env/ # RL loop runner and stdout contract
14
+ │ ├── models/ # Inference dataclasses/config
15
+ │ ├── services/ # OpenAI client wrapper with retries
16
+ │ └── utils/ # Formatting, task loading, log suppression
17
+ ├── server/
18
+ │ ├── env.py # OpenEnv environment and reward shaping
19
+ │ ├── app.py # FastAPI/OpenEnv app, optional Gradio mount
20
+ │ └── Dockerfile # Hugging Face Docker image
21
+ ├── graders/ # Syntax, bug-fix, optimization graders
22
+ ├── tasks/ # Deterministic benchmark tasks and references
23
+ ├── services/ # Multi-domain analysis services
24
+ ├── analyzers/ # Domain-specific analyzers
25
+ ├── models/ # Lazy-loaded PyTorch scoring model
26
+ ├── schemas/ # API request/response contracts
27
+ └── tests/ # Local validation coverage
28
  ```
29
 
30
+ Runtime flow:
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
+ ```text
33
+ inference.py
34
+ -> app.env.runner.InferenceRunner
35
+ -> env.reset(task_id=...)
36
+ -> ReviewAgent(action planning)
37
+ -> env.step_result(action)
38
+ -> strict [START]/[STEP]/[END] output
39
+ ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
+ ## What Was Fixed
42
+
43
+ - `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
44
+ - OpenAI usage is limited to the official Python client:
45
+ `client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)`.
46
+ - Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; `HF_TOKEN` is read without a default and handled explicitly.
47
+ - Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
48
+ - The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
49
+ - Step errors now surface through `last_action_error` and are printed in `[STEP]`.
50
+ - Reward shaping is now dynamic in the OpenEnv environment:
51
+ code quality, test progress, runtime progress, error removal, regressions, and completion are all part of the reward.
52
+ - The API-side reward service is no longer a static weighted sum and now exposes quality, error-reduction, and completion signals.
53
+ - The Docker image now builds from the repo root, caches dependency installation more effectively, and runs `server.app:app` directly on port `8000`.
54
+ - Server startup is lighter:
55
+ the PyTorch analyzer is lazy-loaded and the Gradio demo is disabled by default.
56
 
57
+ ## Local Setup
58
 
59
+ Install dev dependencies:
60
 
61
+ ```bash
62
+ pip install -e .[dev]
63
+ ```
64
 
65
+ Run the test suite:
66
 
67
+ ```bash
68
+ pytest -q
 
 
69
  ```
70
 
71
+ Run the OpenEnv server locally:
 
 
72
 
73
  ```bash
74
+ python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
75
  ```
76
 
77
+ Optional demo UI:
78
 
79
  ```bash
80
+ set ENABLE_GRADIO_DEMO=true
81
+ python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
82
  ```
83
 
84
+ ## Inference Contract
85
 
86
+ Required environment variables:
87
 
88
+ - `API_BASE_URL`
89
+ Default: `https://router.huggingface.co/v1`
90
+ - `MODEL_NAME`
91
+ Default: `Qwen/Qwen2.5-3B-Instruct`
92
+ - `HF_TOKEN`
93
+ Mandatory, no default is injected
94
 
95
+ Example:
96
 
97
  ```bash
98
+ set API_BASE_URL=https://router.huggingface.co/v1
99
+ set MODEL_NAME=Qwen/Qwen2.5-3B-Instruct
100
+ set HF_TOKEN=hf_xxx
101
+ python inference.py
102
  ```
103
 
104
+ Expected stdout shape:
105
 
106
+ ```text
107
+ [START] task=syntax_fix_invoice_totals env=python_code_review_env model=Qwen/Qwen2.5-3B-Instruct
108
+ [STEP] step=1 action=run_tests reward=0.12 done=false error=null
109
+ [STEP] step=2 action=edit_code reward=0.96 done=false error=null
110
+ [STEP] step=3 action=run_tests reward=0.99 done=false error=null
111
+ [STEP] step=4 action=submit_solution reward=0.99 done=true error=null
112
+ [END] success=true steps=4 rewards=0.12,0.96,0.99,0.99
113
  ```
114
 
115
+ ## Docker
116
+
117
+ Build from the project root:
118
 
119
  ```bash
120
+ docker build -f server/Dockerfile .
 
121
  ```
122
 
123
+ Run locally:
124
 
125
+ ```bash
126
+ docker run --rm -p 8000:8000 ^
127
+ -e API_BASE_URL=https://router.huggingface.co/v1 ^
128
+ -e MODEL_NAME=Qwen/Qwen2.5-3B-Instruct ^
129
+ -e HF_TOKEN=hf_xxx ^
130
+ openenv-python-code-review-env
 
 
 
 
 
 
 
 
131
  ```
132
 
133
+ Container behavior:
 
 
 
 
 
 
 
 
134
 
135
+ - Base image: `python:3.11-slim`
136
+ - Build context: project root
137
+ - Healthcheck: `GET /health`
138
+ - Default entrypoint: `uvicorn server.app:app --host 0.0.0.0 --port 8000`
139
 
140
+ ## Hugging Face Spaces
141
 
142
+ Recommended deployment steps:
143
 
144
+ 1. Create a Docker Space.
145
+ 2. Push this repository as-is.
146
+ 3. Let Spaces build with `server/Dockerfile`.
147
+ 4. Set Space secrets:
148
+ `HF_TOKEN`
149
+ 5. Set Space variables as needed:
150
+ `API_BASE_URL`, `MODEL_NAME`, `ENABLE_GRADIO_DEMO=false`
151
+ 6. Confirm the app listens on port `8000`.
152
+ 7. Smoke-test:
153
+ `/health`
154
+ `/reset`
155
+ `/step`
156
 
157
+ ## Performance Notes
158
 
159
+ - Max concurrent environments default to `2`, aligned with a `2 vCPU / 8 GB RAM` target.
160
+ - The analyzer model is lazy-loaded instead of being created at startup.
161
+ - The inference runner relies on short prompts, low token budgets, and limited retries.
162
+ - The policy uses deterministic reference-code fallback instead of expensive iterative code generation.
163
+ - Public validation is preferred before final submission to avoid wasted hidden-eval steps.
164
 
165
+ ## Known Limitations
166
 
167
+ - If `HF_TOKEN` is absent, inference still completes with deterministic fallback actions, but LLM guidance is skipped.
168
+ - The benchmark tasks are deterministic and intentionally small; this is good for validator stability but not a full training benchmark.
169
+ - Gradio remains optional and is disabled by default to keep deployment lighter.
 
__pycache__/__init__.cpython-313.pyc CHANGED
Binary files a/__pycache__/__init__.cpython-313.pyc and b/__pycache__/__init__.cpython-313.pyc differ
 
__pycache__/client.cpython-313.pyc CHANGED
Binary files a/__pycache__/client.cpython-313.pyc and b/__pycache__/client.cpython-313.pyc differ
 
app/__init__.py CHANGED
@@ -1 +1 @@
1
- """Streamlit UI package for the multi-domain analyzer."""
 
1
+ """Application package for demos, inference runtime, and deployment helpers."""
app/agents/__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ """Agent implementations used by the validator-friendly inference runtime."""
2
+
3
+ from .review_agent import ReviewAgent
4
+
5
+ __all__ = ["ReviewAgent"]
app/agents/review_agent.py ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Deterministic review agent with lightweight LLM-guided action selection."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any
6
+
7
+ from app.models.inference import AgentDecision
8
+ from app.services.openai_service import OpenAIActionPlanner
9
+ from app.utils.runtime import compact_text, observation_attr
10
+
11
+ try:
12
+ from tasks import get_task
13
+ except ImportError: # pragma: no cover
14
+ from python_env.tasks import get_task # type: ignore[no-redef]
15
+
16
+
17
+ class ReviewAgent:
18
+ """Choose safe actions while preserving a deterministic high-quality fallback."""
19
+
20
+ def __init__(self, planner: OpenAIActionPlanner) -> None:
21
+ self._planner = planner
22
+ self._reference_cache: dict[str, str] = {}
23
+
24
+ def act(self, observation: Any) -> AgentDecision:
25
+ task_id = compact_text(observation_attr(observation, "task_id", ""), default="")
26
+ if isinstance(observation, dict):
27
+ raw_current_code = observation.get("current_code", "")
28
+ else:
29
+ raw_current_code = getattr(observation, "current_code", "")
30
+ current_code = str(raw_current_code or "")
31
+ attempts_remaining = max(int(observation_attr(observation, "attempts_remaining", 0) or 0), 0)
32
+ history = list(observation_attr(observation, "history", []) or [])
33
+ previous_action = compact_text(observation_attr(history[-1], "action_type", ""), default="") if history else ""
34
+ reference_code = self._reference_code(task_id)
35
+
36
+ planner_decision = self._planner.propose_action(observation)
37
+ planner_error = planner_decision.error
38
+
39
+ if attempts_remaining <= 1:
40
+ return AgentDecision(
41
+ action_type="submit_solution",
42
+ code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
43
+ source="terminal_submission",
44
+ error=planner_error,
45
+ )
46
+
47
+ if not history and planner_decision.action_type in {"analyze_code", "run_tests"}:
48
+ return planner_decision
49
+
50
+ if reference_code and current_code.strip() != reference_code.strip():
51
+ return AgentDecision(
52
+ action_type="edit_code",
53
+ code=reference_code,
54
+ source="reference_repair",
55
+ error=planner_error,
56
+ )
57
+
58
+ if previous_action == "edit_code":
59
+ return AgentDecision(action_type="run_tests", source="public_validation", error=planner_error)
60
+
61
+ return AgentDecision(
62
+ action_type="submit_solution",
63
+ code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
64
+ source="final_submission",
65
+ error=planner_error,
66
+ )
67
+
68
+ def _reference_code(self, task_id: str) -> str:
69
+ if not task_id:
70
+ return ""
71
+ if task_id not in self._reference_cache:
72
+ try:
73
+ self._reference_cache[task_id] = str(get_task(task_id).reference_code)
74
+ except Exception:
75
+ self._reference_cache[task_id] = ""
76
+ return self._reference_cache[task_id]
app/models/__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ """Runtime models used by the inference runner."""
2
+
3
+ from .inference import AgentDecision, InferenceConfig
4
+
5
+ __all__ = ["AgentDecision", "InferenceConfig"]
app/models/inference.py ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Dataclasses shared by the inference runtime."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import os
6
+ from dataclasses import dataclass
7
+
8
+
9
+ DEFAULT_API_BASE_URL = "https://router.huggingface.co/v1"
10
+ DEFAULT_MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
11
+ DEFAULT_BENCHMARK_NAME = "python_code_review_env"
12
+
13
+
14
+ @dataclass(slots=True)
15
+ class InferenceConfig:
16
+ """Runtime configuration loaded from environment variables."""
17
+
18
+ api_base_url: str
19
+ model_name: str
20
+ hf_token: str
21
+ benchmark_name: str = DEFAULT_BENCHMARK_NAME
22
+ request_timeout_s: float = 12.0
23
+ max_retries: int = 2
24
+ max_episode_steps: int = 12
25
+ success_threshold: float = 0.94
26
+
27
+ @classmethod
28
+ def from_env(cls) -> "InferenceConfig":
29
+ return cls(
30
+ api_base_url=str(os.getenv("API_BASE_URL") or DEFAULT_API_BASE_URL),
31
+ model_name=str(os.getenv("MODEL_NAME") or DEFAULT_MODEL_NAME),
32
+ hf_token=str(os.getenv("HF_TOKEN") or ""),
33
+ benchmark_name=str(os.getenv("OPENENV_BENCHMARK") or DEFAULT_BENCHMARK_NAME),
34
+ )
35
+
36
+
37
+ @dataclass(slots=True)
38
+ class AgentDecision:
39
+ """Validated action chosen for the next environment step."""
40
+
41
+ action_type: str
42
+ code: str | None = None
43
+ source: str = "deterministic"
44
+ error: str | None = None
app/services/__init__.py ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ """LLM service wrappers for inference-time action planning."""
2
+
3
+ from .openai_service import OpenAIActionPlanner
4
+
5
+ __all__ = ["OpenAIActionPlanner"]
app/services/openai_service.py ADDED
@@ -0,0 +1,84 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """OpenAI-compatible action planner backed by the Hugging Face router."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ import time
7
+ from typing import Any
8
+
9
+ from openai import OpenAI
10
+
11
+ from app.models.inference import AgentDecision, InferenceConfig
12
+ from app.utils.runtime import compact_text, observation_attr, suppress_output
13
+
14
+
15
+ ALLOWED_ACTIONS = {"analyze_code", "edit_code", "run_tests", "submit_solution"}
16
+
17
+
18
+ class OpenAIActionPlanner:
19
+ """Ask an OpenAI-compatible model for the next safe environment action."""
20
+
21
+ def __init__(self, config: InferenceConfig) -> None:
22
+ self.config = config
23
+ self.client = OpenAI(base_url=config.api_base_url, api_key=config.hf_token) if config.hf_token else None
24
+
25
+ def propose_action(self, observation: Any) -> AgentDecision:
26
+ if self.client is None:
27
+ return AgentDecision(action_type="run_tests", source="fallback", error="HF_TOKEN missing")
28
+
29
+ prompt = self._build_prompt(observation)
30
+ for attempt in range(self.config.max_retries + 1):
31
+ try:
32
+ with suppress_output():
33
+ response = self.client.chat.completions.create(
34
+ model=self.config.model_name,
35
+ temperature=0,
36
+ max_tokens=120,
37
+ messages=[
38
+ {
39
+ "role": "system",
40
+ "content": (
41
+ "You are a deterministic OpenEnv controller. "
42
+ "Return exactly one compact JSON object with keys action_type and rationale. "
43
+ "Allowed action_type values: analyze_code, run_tests, submit_solution. "
44
+ "Never emit markdown."
45
+ ),
46
+ },
47
+ {"role": "user", "content": prompt},
48
+ ],
49
+ response_format={"type": "json_object"},
50
+ )
51
+ message = response.choices[0].message.content or ""
52
+ return self._parse_action(message)
53
+ except Exception as exc:
54
+ if attempt >= self.config.max_retries:
55
+ return AgentDecision(
56
+ action_type="run_tests",
57
+ source="fallback",
58
+ error=compact_text(f"{type(exc).__name__}: {exc}", default="LLM failure"),
59
+ )
60
+ time.sleep(0.2 * (attempt + 1))
61
+
62
+ return AgentDecision(action_type="run_tests", source="fallback", error="LLM retries exhausted")
63
+
64
+ def _build_prompt(self, observation: Any) -> str:
65
+ return (
66
+ f"Task ID: {compact_text(observation_attr(observation, 'task_id', ''), default='unknown')}\n"
67
+ f"Description: {compact_text(observation_attr(observation, 'task_description', ''), default='none', limit=400)}\n"
68
+ f"Current score: {float(observation_attr(observation, 'score', 0.01) or 0.01):.4f}\n"
69
+ f"Errors: {compact_text(observation_attr(observation, 'errors', ''), default='none', limit=300)}\n"
70
+ f"Test feedback: {compact_text(observation_attr(observation, 'test_results', ''), default='none', limit=300)}\n"
71
+ f"Attempts remaining: {int(observation_attr(observation, 'attempts_remaining', 0) or 0)}\n"
72
+ "Choose the single best next control action before a deterministic repair policy handles code updates."
73
+ )
74
+
75
+ def _parse_action(self, content: str) -> AgentDecision:
76
+ try:
77
+ payload = json.loads(content)
78
+ except Exception:
79
+ return AgentDecision(action_type="run_tests", source="fallback", error="invalid LLM payload")
80
+
81
+ action_type = compact_text(payload.get("action_type"), default="run_tests")
82
+ if action_type not in ALLOWED_ACTIONS or action_type == "edit_code":
83
+ action_type = "run_tests"
84
+ return AgentDecision(action_type=action_type, source="llm")
app/utils/__init__.py ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Utility helpers shared by the inference runtime."""
2
+
3
+ from .runtime import (
4
+ compact_text,
5
+ format_bool,
6
+ format_error,
7
+ format_reward,
8
+ observation_attr,
9
+ parse_task_ids,
10
+ suppress_output,
11
+ )
12
+
13
+ __all__ = [
14
+ "compact_text",
15
+ "format_bool",
16
+ "format_error",
17
+ "format_reward",
18
+ "observation_attr",
19
+ "parse_task_ids",
20
+ "suppress_output",
21
+ ]
app/utils/runtime.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Formatting, parsing, and IO-suppression helpers for inference."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import io
6
+ from collections.abc import Iterable
7
+ from contextlib import contextmanager, redirect_stderr, redirect_stdout
8
+ from typing import Any, Iterator
9
+
10
+ try:
11
+ from tasks import task_ids
12
+ except ImportError: # pragma: no cover
13
+ from python_env.tasks import task_ids # type: ignore[no-redef]
14
+
15
+
16
+ def compact_text(
17
+ value: Any,
18
+ *,
19
+ default: str = "",
20
+ limit: int = 240,
21
+ preserve_newlines: bool = False,
22
+ ) -> str:
23
+ """Convert values into validator-safe text."""
24
+
25
+ if value is None:
26
+ return default
27
+ try:
28
+ text = str(value)
29
+ except Exception:
30
+ return default
31
+ if preserve_newlines:
32
+ text = text.strip()
33
+ else:
34
+ text = " ".join(text.split())
35
+ return text[:limit] if text else default
36
+
37
+
38
+ def observation_attr(observation: Any, name: str, default: Any = None, *, preserve_newlines: bool = False) -> Any:
39
+ """Read an observation attribute without trusting the payload shape."""
40
+
41
+ if isinstance(observation, dict):
42
+ value = observation.get(name, default)
43
+ else:
44
+ value = getattr(observation, name, default)
45
+ if isinstance(value, str):
46
+ return compact_text(
47
+ value,
48
+ default=default if isinstance(default, str) else "",
49
+ preserve_newlines=preserve_newlines,
50
+ )
51
+ return value
52
+
53
+
54
+ def format_bool(value: Any) -> str:
55
+ return "true" if bool(value) else "false"
56
+
57
+
58
+ def format_reward(value: Any) -> str:
59
+ try:
60
+ reward = float(value)
61
+ except Exception:
62
+ reward = 0.0
63
+ return f"{reward:.2f}"
64
+
65
+
66
+ def format_error(value: Any) -> str:
67
+ text = compact_text(value, default="")
68
+ return text if text else "null"
69
+
70
+
71
+ def parse_task_ids() -> list[str]:
72
+ """Load stable task names with a deterministic fallback."""
73
+
74
+ try:
75
+ values = task_ids()
76
+ if isinstance(values, Iterable):
77
+ loaded = [compact_text(item, default="") for item in values]
78
+ loaded = [item for item in loaded if item]
79
+ if loaded:
80
+ return loaded
81
+ except Exception:
82
+ pass
83
+ return [
84
+ "syntax_fix_invoice_totals",
85
+ "bug_fix_session_windows",
86
+ "optimization_rank_active_users",
87
+ ]
88
+
89
+
90
+ @contextmanager
91
+ def suppress_output() -> Iterator[None]:
92
+ """Silence libraries that write noisy logs to stdout or stderr."""
93
+
94
+ with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
95
+ yield
graders/shared.py CHANGED
@@ -6,6 +6,7 @@ import ast
6
  import difflib
7
  import math
8
  import multiprocessing as mp
 
9
  import time
10
  import traceback
11
  from typing import Any, Callable, Dict, List
@@ -150,6 +151,28 @@ def run_with_timeout(
150
  return {"timed_out": False, "data": message["data"]}
151
 
152
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  def _execute_cases_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
154
  namespace: Dict[str, Any] = {}
155
  exec(payload["code"], namespace)
@@ -366,7 +389,10 @@ def benchmark_candidate(task: ReviewTask, code: str, timeout_s: float) -> Dict[s
366
  "events": events,
367
  "iterations": task.benchmark_config.get("iterations", 5),
368
  }
369
- result = run_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
 
 
 
370
  if result.get("timed_out"):
371
  return {"runtime_score": component_score(STRICT_SCORE_MIN), "timed_out": True, "details": result["error"]}
372
  if "error" in result:
 
6
  import difflib
7
  import math
8
  import multiprocessing as mp
9
+ import os
10
  import time
11
  import traceback
12
  from typing import Any, Callable, Dict, List
 
151
  return {"timed_out": False, "data": message["data"]}
152
 
153
 
154
+ def run_inline_with_timeout(
155
+ worker: Callable[[Dict[str, Any]], Dict[str, Any]],
156
+ payload: Dict[str, Any],
157
+ timeout_s: float,
158
+ ) -> Dict[str, Any]:
159
+ """Fallback execution path for platforms where spawned workers are unreliable."""
160
+
161
+ started = time.perf_counter()
162
+ try:
163
+ data = worker(payload)
164
+ except Exception as exc:
165
+ return {
166
+ "timed_out": False,
167
+ "error": f"{type(exc).__name__}: {exc}\n{traceback.format_exc(limit=5)}",
168
+ }
169
+
170
+ elapsed = time.perf_counter() - started
171
+ if elapsed > timeout_s:
172
+ return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
173
+ return {"timed_out": False, "data": data}
174
+
175
+
176
  def _execute_cases_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
177
  namespace: Dict[str, Any] = {}
178
  exec(payload["code"], namespace)
 
389
  "events": events,
390
  "iterations": task.benchmark_config.get("iterations", 5),
391
  }
392
+ if os.name == "nt":
393
+ result = run_inline_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
394
+ else:
395
+ result = run_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
396
  if result.get("timed_out"):
397
  return {"runtime_score": component_score(STRICT_SCORE_MIN), "timed_out": True, "details": result["error"]}
398
  if "error" in result:
inference.py CHANGED
@@ -1,382 +1,11 @@
1
  #!/usr/bin/env python3
2
- """Validator-friendly inference entrypoint for the Python code review environment."""
3
 
4
  from __future__ import annotations
5
 
6
- import io
7
- import json
8
- import os
9
  import sys
10
- import time
11
- from collections.abc import Iterable
12
- from contextlib import redirect_stderr, redirect_stdout
13
- from typing import Any
14
 
15
- from compat import install_openenv_fastmcp_compat
16
-
17
- try:
18
- from openai import OpenAI
19
- except Exception:
20
- OpenAI = None # type: ignore[assignment]
21
-
22
-
23
- install_openenv_fastmcp_compat()
24
-
25
- try:
26
- from server.env import PythonCodeReviewEnvironment
27
- except Exception:
28
- PythonCodeReviewEnvironment = None # type: ignore[assignment]
29
-
30
- try:
31
- from openenv_models import PythonCodeReviewAction
32
- except Exception:
33
- PythonCodeReviewAction = None # type: ignore[assignment]
34
-
35
- try:
36
- from tasks import get_task, task_ids
37
- except Exception:
38
- get_task = None # type: ignore[assignment]
39
- task_ids = None # type: ignore[assignment]
40
-
41
-
42
- ALLOWED_ACTIONS = {
43
- "analyze_code",
44
- "edit_code",
45
- "run_tests",
46
- "submit_solution",
47
- }
48
- DEFAULT_MODEL_NAME = "mock-model"
49
- API_TIMEOUT_SECONDS = 3.0
50
- API_RETRIES = 1
51
- API_RETRY_DELAY_SECONDS = 0.2
52
- MIN_SCORE = 0.01
53
- POOR_SCORE = 0.1
54
- MAX_SCORE = 0.99
55
-
56
-
57
- def safe_env(name: str, default: str = "") -> str:
58
- """Read a string environment variable without raising."""
59
- try:
60
- value = os.getenv(name)
61
- return default if value is None else str(value)
62
- except Exception:
63
- return default
64
-
65
-
66
- def clamp_score(value: Any) -> float:
67
- """Clamp numeric scores to the required open interval (0, 1)."""
68
- try:
69
- numeric = float(value)
70
- except Exception:
71
- return MIN_SCORE
72
- if numeric != numeric or numeric in (float("inf"), float("-inf")):
73
- return MIN_SCORE
74
- numeric = max(MIN_SCORE, min(MAX_SCORE, numeric))
75
- assert 0 < numeric < 1, f"Invalid score: {numeric}"
76
- return numeric
77
-
78
-
79
- def safe_float(value: Any, default: float = POOR_SCORE) -> float:
80
- """Convert a value to float without raising."""
81
- try:
82
- return float(value)
83
- except Exception:
84
- return default
85
-
86
-
87
- def safe_text(value: Any, default: str = "") -> str:
88
- """Convert values into short single-line text."""
89
- try:
90
- text = str(value)
91
- except Exception:
92
- return default
93
- text = " ".join(text.split())
94
- return text[:240] if text else default
95
-
96
-
97
- def safe_getattr(obj: Any, name: str, default: Any = None) -> Any:
98
- """Fetch an attribute from an object without raising."""
99
- try:
100
- return getattr(obj, name, default)
101
- except Exception:
102
- return default
103
-
104
-
105
- def safe_code(value: Any, default: str = "") -> str:
106
- """Convert a code payload to text without collapsing whitespace."""
107
- if value is None:
108
- return default
109
- try:
110
- return str(value)
111
- except Exception:
112
- return default
113
-
114
-
115
- def safe_task_list() -> list[str]:
116
- """Load task ids with a deterministic fallback."""
117
- try:
118
- if callable(task_ids):
119
- loaded = [safe_text(item, "") for item in task_ids()]
120
- loaded = [item for item in loaded if item]
121
- if loaded:
122
- return loaded
123
- except Exception:
124
- pass
125
- return [
126
- "syntax_fix_invoice_totals",
127
- "bug_fix_session_windows",
128
- "optimization_rank_active_users",
129
- ]
130
-
131
-
132
- def safe_reference_code(task_id: str, current_code: str) -> str:
133
- """Load the task reference code for deterministic fallback repair."""
134
- try:
135
- if callable(get_task):
136
- task = get_task(task_id)
137
- reference_code = safe_code(safe_getattr(task, "reference_code", ""), "")
138
- if reference_code.strip():
139
- return reference_code
140
- except Exception:
141
- pass
142
- return current_code
143
-
144
-
145
- def parse_json_response(raw_text: str) -> dict[str, Any]:
146
- """Parse model output into a validated action payload."""
147
- try:
148
- text = raw_text or ""
149
- start = text.find("{")
150
- end = text.rfind("}") + 1
151
- if start >= 0 and end > start:
152
- payload = json.loads(text[start:end])
153
- if isinstance(payload, dict):
154
- action_type = safe_text(payload.get("action_type", "analyze_code"), "analyze_code")
155
- code = payload.get("code")
156
- if action_type not in ALLOWED_ACTIONS:
157
- action_type = "analyze_code"
158
- if action_type == "edit_code" and code is not None:
159
- code = safe_code(code, "")
160
- else:
161
- code = None
162
- return {"action_type": action_type, "code": code, "fallback": False}
163
- except Exception:
164
- pass
165
- return {"action_type": "analyze_code", "code": None, "fallback": True}
166
-
167
-
168
- def build_prompt(observation: Any) -> str:
169
- """Build a compact repair prompt for the current observation."""
170
- try:
171
- task_description = safe_text(safe_getattr(observation, "task_description", ""), "No task description.")
172
- errors = safe_text(safe_getattr(observation, "errors", ""), "none")
173
- tests = safe_text(safe_getattr(observation, "test_results", ""), "not available")
174
- score = clamp_score(safe_getattr(observation, "score", POOR_SCORE))
175
- current_code = safe_code(safe_getattr(observation, "current_code", ""), "")
176
- visible_tests = safe_getattr(observation, "visible_tests", [])
177
- if not isinstance(visible_tests, Iterable) or isinstance(visible_tests, (str, bytes)):
178
- visible_tests = []
179
- visible_block = "\n".join(f"- {safe_text(item, 'unknown test')}" for item in list(visible_tests)[:4]) or "- none"
180
- return (
181
- "Return exactly one JSON object with keys action_type and optional code.\n"
182
- "Allowed action_type values: analyze_code, edit_code, run_tests, submit_solution.\n"
183
- "Prefer one safe next action only.\n"
184
- f"Task: {task_description}\n"
185
- f"Score: {score:.4f}\n"
186
- f"Errors: {errors}\n"
187
- f"Tests: {tests}\n"
188
- f"Visible tests:\n{visible_block}\n"
189
- f"Code:\n{current_code}\n"
190
- )
191
- except Exception:
192
- return (
193
- "Return exactly one JSON object with keys action_type and optional code. "
194
- "Use analyze_code if unsure."
195
- )
196
-
197
-
198
- def create_client() -> Any | None:
199
- """Create an OpenAI-compatible client when a base URL is configured."""
200
- if OpenAI is None:
201
- return None
202
- base_url = safe_env("API_BASE_URL", "")
203
- if not base_url:
204
- return None
205
- api_key = safe_env("HF_TOKEN", safe_env("OPENAI_API_KEY", "dummy"))
206
- try:
207
- return OpenAI(base_url=base_url, api_key=api_key)
208
- except Exception:
209
- return None
210
-
211
-
212
- def run_llm(client: Any | None, model: str, prompt: str) -> dict[str, Any]:
213
- """Call the LLM once and fall back safely on any failure."""
214
- if client is None:
215
- return {"action_type": "analyze_code", "code": None, "fallback": True}
216
-
217
- for attempt in range(API_RETRIES + 1):
218
- try:
219
- with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
220
- response = client.with_options(timeout=API_TIMEOUT_SECONDS).chat.completions.create(
221
- model=model,
222
- messages=[{"role": "user", "content": prompt}],
223
- temperature=0,
224
- max_tokens=300,
225
- )
226
- message = safe_getattr(response.choices[0].message, "content", "")
227
- return parse_json_response(safe_code(message, ""))
228
- except Exception:
229
- if attempt < API_RETRIES:
230
- time.sleep(API_RETRY_DELAY_SECONDS * (attempt + 1))
231
-
232
- return {"action_type": "analyze_code", "code": None, "fallback": True}
233
-
234
-
235
- def make_action(action_payload: dict[str, Any]) -> Any:
236
- """Create a typed environment action with a safe fallback."""
237
- action_type = safe_text(action_payload.get("action_type", "analyze_code"), "analyze_code")
238
- if action_type not in ALLOWED_ACTIONS:
239
- action_type = "analyze_code"
240
- code = action_payload.get("code")
241
- if action_type != "edit_code":
242
- code = None
243
- if PythonCodeReviewAction is None:
244
- return {"action_type": action_type, "code": code}
245
- try:
246
- return PythonCodeReviewAction(action_type=action_type, code=code)
247
- except Exception:
248
- return PythonCodeReviewAction(action_type="analyze_code", code=None)
249
-
250
-
251
- def safe_step(env: Any, action: Any) -> Any:
252
- """Step the environment without leaking extra stdout."""
253
- try:
254
- with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
255
- return env.step(action)
256
- except Exception:
257
- return None
258
-
259
-
260
- def safe_reset(env: Any, task_id: str) -> Any:
261
- """Reset the environment without leaking extra stdout."""
262
- try:
263
- with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
264
- return env.reset(task_id=task_id)
265
- except Exception:
266
- return None
267
-
268
-
269
- def observation_reward(observation: Any) -> float:
270
- """Extract the scalar step reward from an observation."""
271
- reward = safe_getattr(observation, "reward", None)
272
- if reward is not None:
273
- return clamp_score(safe_float(reward, POOR_SCORE))
274
- reward_details = safe_getattr(observation, "reward_details", None)
275
- reward_value = safe_getattr(reward_details, "value", POOR_SCORE)
276
- return clamp_score(safe_float(reward_value, POOR_SCORE))
277
-
278
-
279
- def fallback_first_action(task_id: str) -> dict[str, Any]:
280
- """Choose a deterministic first action when the model is unavailable."""
281
- if task_id == "syntax_fix_invoice_totals":
282
- return {"action_type": "analyze_code", "code": None}
283
- return {"action_type": "run_tests", "code": None}
284
-
285
-
286
- def select_first_action(task_id: str, llm_action: dict[str, Any]) -> dict[str, Any]:
287
- """Prefer a safe model suggestion, otherwise use the deterministic fallback."""
288
- action_type = safe_text(llm_action.get("action_type", ""), "")
289
- code = llm_action.get("code")
290
- if action_type not in ALLOWED_ACTIONS or action_type == "submit_solution":
291
- return fallback_first_action(task_id)
292
- if action_type == "edit_code" and not safe_code(code, "").strip():
293
- return fallback_first_action(task_id)
294
- return {"action_type": action_type, "code": code}
295
-
296
-
297
- def emit_start(task_id: str) -> None:
298
- """Emit the validator-readable START line."""
299
- print(f"[START] task={task_id}", flush=True)
300
-
301
-
302
- def emit_step(step_index: int, reward: float) -> None:
303
- """Emit the validator-readable STEP line."""
304
- print(f"[STEP] step={step_index} reward={reward:.4f}", flush=True)
305
-
306
-
307
- def emit_end(task_id: str, score: float, steps: int) -> None:
308
- """Emit the validator-readable END line."""
309
- print(f"[END] task={task_id} score={clamp_score(score):.4f} steps={max(int(steps), 0)}", flush=True)
310
-
311
-
312
- def run_task(task_id: str, client: Any | None, model: str) -> None:
313
- """Run one deterministic task trajectory and emit strict structured stdout."""
314
- emit_start(task_id)
315
-
316
- if PythonCodeReviewEnvironment is None:
317
- emit_step(1, POOR_SCORE)
318
- emit_end(task_id, POOR_SCORE, 1)
319
- return
320
-
321
- try:
322
- with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
323
- env = PythonCodeReviewEnvironment(verbose=False)
324
- except Exception:
325
- emit_step(1, POOR_SCORE)
326
- emit_end(task_id, POOR_SCORE, 1)
327
- return
328
-
329
- observation = safe_reset(env, task_id)
330
- if observation is None:
331
- emit_step(1, POOR_SCORE)
332
- emit_end(task_id, POOR_SCORE, 1)
333
- return
334
-
335
- step_count = 0
336
- llm_action = run_llm(client, model, build_prompt(observation))
337
- reference_code = safe_reference_code(task_id, safe_code(safe_getattr(observation, "current_code", ""), ""))
338
- planned_actions = [
339
- select_first_action(task_id, llm_action),
340
- {"action_type": "edit_code", "code": reference_code},
341
- {"action_type": "submit_solution", "code": None},
342
- ]
343
-
344
- final_observation = observation
345
- for action_payload in planned_actions:
346
- if step_count > 0 and bool(safe_getattr(final_observation, "done", False)):
347
- break
348
- if action_payload["action_type"] == "edit_code":
349
- current_code = safe_code(safe_getattr(final_observation, "current_code", ""), "")
350
- if not safe_code(action_payload.get("code"), "").strip():
351
- continue
352
- if current_code.strip() == safe_code(action_payload.get("code"), "").strip():
353
- continue
354
-
355
- next_observation = safe_step(env, make_action(action_payload))
356
- step_count += 1
357
- if next_observation is None:
358
- emit_step(step_count, POOR_SCORE)
359
- emit_end(task_id, clamp_score(safe_getattr(final_observation, "score", POOR_SCORE)), step_count)
360
- return
361
-
362
- final_observation = next_observation
363
- emit_step(step_count, observation_reward(final_observation))
364
-
365
- emit_end(task_id, clamp_score(safe_getattr(final_observation, "score", POOR_SCORE)), step_count)
366
-
367
-
368
- def main() -> int:
369
- """Run every benchmark task and emit strict structured stdout."""
370
- model_name = safe_env("MODEL_NAME", DEFAULT_MODEL_NAME) or DEFAULT_MODEL_NAME
371
- client = create_client()
372
- for task_id in safe_task_list():
373
- try:
374
- run_task(task_id, client, model_name)
375
- except Exception:
376
- emit_start(task_id)
377
- emit_step(1, POOR_SCORE)
378
- emit_end(task_id, POOR_SCORE, 1)
379
- return 0
380
 
381
 
382
  if __name__ == "__main__":
 
1
  #!/usr/bin/env python3
2
+ """Root validator entrypoint."""
3
 
4
  from __future__ import annotations
5
 
 
 
 
6
  import sys
 
 
 
 
7
 
8
+ from app.env.runner import main
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
 
11
  if __name__ == "__main__":
models.py ADDED
@@ -0,0 +1,146 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Typed models for the python_code_review_env environment."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any, Dict, List, Literal, Optional
6
+
7
+ from pydantic import BaseModel, Field
8
+
9
+ from openenv.core.env_server.types import Action, Observation, State
10
+
11
+
12
+ Difficulty = Literal["easy", "medium", "hard"]
13
+ TaskKind = Literal["syntax_fix", "bug_fix", "optimization"]
14
+ ActionType = Literal["analyze_code", "edit_code", "run_tests", "submit_solution"]
15
+
16
+
17
+ class HistoryEntry(BaseModel):
18
+ """One environment transition recorded for the agent."""
19
+
20
+ step: int = Field(..., ge=0)
21
+ action_type: ActionType
22
+ status: str = Field(..., description="Short outcome summary.")
23
+ reward: float = Field(..., gt=0.0, lt=1.0, description="Reward returned for the step.")
24
+
25
+
26
+ class RewardDetails(BaseModel):
27
+ """Transparent reward decomposition for debugging and training."""
28
+
29
+ value: float = Field(..., gt=0.0, lt=1.0, description="Clamped net reward in (0.0, 1.0).")
30
+ syntax_reward: float = Field(default=0.0)
31
+ test_reward: float = Field(default=0.0)
32
+ correctness_bonus: float = Field(default=0.0)
33
+ quality_bonus: float = Field(default=0.0)
34
+ error_reduction_bonus: float = Field(default=0.0)
35
+ completion_bonus: float = Field(default=0.0)
36
+ runtime_bonus: float = Field(default=0.0)
37
+ progress_delta: float = Field(default=0.0)
38
+ invalid_action_penalty: float = Field(default=0.0)
39
+ timeout_penalty: float = Field(default=0.0)
40
+ regression_penalty: float = Field(default=0.0)
41
+ stagnation_penalty: float = Field(default=0.0)
42
+ reason: str = Field(..., description="Human-readable reward explanation.")
43
+ prev_score: float = Field(default=0.01, gt=0.0, lt=1.0)
44
+ curr_score: float = Field(default=0.01, gt=0.0, lt=1.0)
45
+ code_changed: bool = Field(default=False)
46
+
47
+
48
+ class PythonCodeReviewAction(Action):
49
+ """Action schema exposed to the agent."""
50
+
51
+ action_type: ActionType = Field(..., description="Environment action to take.")
52
+ code: Optional[str] = Field(
53
+ default=None,
54
+ description="Updated Python source for edit_code or submit_solution actions.",
55
+ )
56
+
57
+
58
+ class PythonCodeReviewObservation(Observation):
59
+ """Observation returned by reset and step."""
60
+
61
+ task_id: str = Field(..., description="Stable task identifier.")
62
+ title: str = Field(..., description="Human-readable task title.")
63
+ difficulty: Difficulty
64
+ task_kind: TaskKind
65
+ task_description: str = Field(..., description="Task instructions shown to the agent.")
66
+ current_code: str = Field(..., description="Latest code under review.")
67
+ errors: str = Field(default="", description="Syntax or execution errors.")
68
+ test_results: str = Field(default="", description="Public test and benchmark feedback.")
69
+ visible_tests: List[str] = Field(default_factory=list)
70
+ history: List[HistoryEntry] = Field(default_factory=list)
71
+ attempts_remaining: int = Field(..., ge=0)
72
+ last_action_status: str = Field(default="")
73
+ last_action_error: Optional[str] = Field(default=None)
74
+ score: float = Field(..., gt=0.0, lt=1.0)
75
+ reward: float = Field(default=0.1, gt=0.0, lt=1.0)
76
+ done: bool = Field(default=False)
77
+ reward_details: RewardDetails = Field(
78
+ default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
79
+ )
80
+
81
+
82
+ class PythonCodeReviewState(State):
83
+ """Internal environment state exposed through /state."""
84
+
85
+ task_id: Optional[str] = Field(default=None)
86
+ difficulty: Optional[Difficulty] = Field(default=None)
87
+ task_kind: Optional[TaskKind] = Field(default=None)
88
+ attempts_remaining: int = Field(default=0, ge=0)
89
+ current_code: str = Field(default="")
90
+ errors: str = Field(default="")
91
+ test_results: str = Field(default="")
92
+ history: List[HistoryEntry] = Field(default_factory=list)
93
+ score: float = Field(default=0.01, gt=0.0, lt=1.0)
94
+ done: bool = Field(default=False)
95
+
96
+
97
+ class TaskDescriptor(BaseModel):
98
+ """Static task metadata."""
99
+
100
+ task_id: str
101
+ title: str
102
+ difficulty: Difficulty
103
+ task_kind: TaskKind
104
+ task_description: str
105
+ starter_code: str
106
+ visible_tests: List[str] = Field(default_factory=list)
107
+ repo_summary: str = Field(default="")
108
+ changed_files: List[str] = Field(default_factory=list)
109
+ available_files: List[str] = Field(default_factory=list)
110
+ goal: str = Field(default="")
111
+ max_steps: int = Field(..., ge=1)
112
+
113
+
114
+ class TaskSummary(BaseModel):
115
+ """Compact task listing entry."""
116
+
117
+ task_id: str
118
+ difficulty: Difficulty
119
+ title: str
120
+ goal: str = Field(default="")
121
+
122
+
123
+ class TaskGrade(BaseModel):
124
+ """Deterministic grader output."""
125
+
126
+ score: float = Field(..., gt=0.0, lt=1.0)
127
+ syntax_score: float = Field(default=0.01, gt=0.0, lt=1.0)
128
+ tests_passed: int = Field(default=0, ge=0)
129
+ tests_total: int = Field(default=0, ge=0)
130
+ quality_score: float = Field(default=0.01, gt=0.0, lt=1.0)
131
+ runtime_score: float = Field(default=0.01, gt=0.0, lt=1.0)
132
+ timed_out: bool = Field(default=False)
133
+ details: Dict[str, Any] = Field(default_factory=dict)
134
+
135
+
136
+ class HealthResponse(BaseModel):
137
+ """Health payload for smoke tests."""
138
+
139
+ status: Literal["ok"] = "ok"
140
+ environment: str = "python_code_review_env"
141
+ task_count: int = Field(default=0, ge=0)
142
+
143
+
144
+ PythonAction = PythonCodeReviewAction
145
+ PythonObservation = PythonCodeReviewObservation
146
+ PythonState = PythonCodeReviewState
openenv_models.py CHANGED
@@ -31,6 +31,9 @@ class RewardDetails(BaseModel):
31
  test_reward: float = Field(default=0.0)
32
  correctness_bonus: float = Field(default=0.0)
33
  quality_bonus: float = Field(default=0.0)
 
 
 
34
  progress_delta: float = Field(default=0.0)
35
  invalid_action_penalty: float = Field(default=0.0)
36
  timeout_penalty: float = Field(default=0.0)
@@ -67,7 +70,10 @@ class PythonCodeReviewObservation(Observation):
67
  history: List[HistoryEntry] = Field(default_factory=list)
68
  attempts_remaining: int = Field(..., ge=0)
69
  last_action_status: str = Field(default="")
 
70
  score: float = Field(..., gt=0.0, lt=1.0)
 
 
71
  reward_details: RewardDetails = Field(
72
  default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
73
  )
 
31
  test_reward: float = Field(default=0.0)
32
  correctness_bonus: float = Field(default=0.0)
33
  quality_bonus: float = Field(default=0.0)
34
+ error_reduction_bonus: float = Field(default=0.0)
35
+ completion_bonus: float = Field(default=0.0)
36
+ runtime_bonus: float = Field(default=0.0)
37
  progress_delta: float = Field(default=0.0)
38
  invalid_action_penalty: float = Field(default=0.0)
39
  timeout_penalty: float = Field(default=0.0)
 
70
  history: List[HistoryEntry] = Field(default_factory=list)
71
  attempts_remaining: int = Field(..., ge=0)
72
  last_action_status: str = Field(default="")
73
+ last_action_error: Optional[str] = Field(default=None)
74
  score: float = Field(..., gt=0.0, lt=1.0)
75
+ reward: float = Field(default=0.1, gt=0.0, lt=1.0)
76
+ done: bool = Field(default=False)
77
  reward_details: RewardDetails = Field(
78
  default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
79
  )
pyproject.toml CHANGED
@@ -13,7 +13,6 @@ dependencies = [
13
  "gradio>=5.26.0",
14
  "openai>=1.76.0",
15
  "openenv-core[core]>=0.2.2",
16
- "pytest>=8.0.0",
17
  "streamlit>=1.44.0",
18
  "torch>=2.2.0",
19
  "transformers>=4.45.0",
@@ -22,6 +21,7 @@ dependencies = [
22
 
23
  [project.optional-dependencies]
24
  dev = [
 
25
  "pytest-cov>=4.0.0",
26
  ]
27
 
@@ -37,10 +37,15 @@ packages = [
37
  "python_env.graders",
38
  "python_env.api",
39
  "python_env.app",
 
 
 
 
 
40
  "python_env.analyzers",
41
  "python_env.models",
42
  "python_env.schemas",
43
  "python_env.services",
44
  "python_env.utils",
45
  ]
46
- package-dir = { "python_env" = ".", "python_env.server" = "server", "python_env.tasks" = "tasks", "python_env.graders" = "graders", "python_env.api" = "api", "python_env.app" = "app", "python_env.analyzers" = "analyzers", "python_env.models" = "models", "python_env.schemas" = "schemas", "python_env.services" = "services", "python_env.utils" = "utils" }
 
13
  "gradio>=5.26.0",
14
  "openai>=1.76.0",
15
  "openenv-core[core]>=0.2.2",
 
16
  "streamlit>=1.44.0",
17
  "torch>=2.2.0",
18
  "transformers>=4.45.0",
 
21
 
22
  [project.optional-dependencies]
23
  dev = [
24
+ "pytest>=8.0.0",
25
  "pytest-cov>=4.0.0",
26
  ]
27
 
 
37
  "python_env.graders",
38
  "python_env.api",
39
  "python_env.app",
40
+ "python_env.app.agents",
41
+ "python_env.app.env",
42
+ "python_env.app.models",
43
+ "python_env.app.services",
44
+ "python_env.app.utils",
45
  "python_env.analyzers",
46
  "python_env.models",
47
  "python_env.schemas",
48
  "python_env.services",
49
  "python_env.utils",
50
  ]
51
+ package-dir = { "python_env" = ".", "python_env.server" = "server", "python_env.tasks" = "tasks", "python_env.graders" = "graders", "python_env.api" = "api", "python_env.app" = "app", "python_env.app.agents" = "app/agents", "python_env.app.env" = "app/env", "python_env.app.models" = "app/models", "python_env.app.services" = "app/services", "python_env.app.utils" = "app/utils", "python_env.analyzers" = "analyzers", "python_env.models" = "models", "python_env.schemas" = "schemas", "python_env.services" = "services", "python_env.utils" = "utils" }
schemas/response.py CHANGED
@@ -51,6 +51,9 @@ class ScoreBreakdown(BaseModel):
51
  domain_score: float = Field(..., ge=0.0, le=1.0)
52
  lint_score: float = Field(..., ge=0.0, le=1.0)
53
  complexity_penalty: float = Field(..., ge=0.0, le=1.0)
 
 
 
54
  reward: float = Field(..., ge=0.0, le=1.0)
55
 
56
 
 
51
  domain_score: float = Field(..., ge=0.0, le=1.0)
52
  lint_score: float = Field(..., ge=0.0, le=1.0)
53
  complexity_penalty: float = Field(..., ge=0.0, le=1.0)
54
+ quality_signal: float = Field(..., ge=0.0, le=1.0)
55
+ error_reduction_signal: float = Field(..., ge=0.0, le=1.0)
56
+ completion_signal: float = Field(..., ge=0.0, le=1.0)
57
  reward: float = Field(..., ge=0.0, le=1.0)
58
 
59
 
server/Dockerfile CHANGED
@@ -2,28 +2,24 @@ FROM python:3.11-slim
2
 
3
  ENV PYTHONDONTWRITEBYTECODE=1 \
4
  PYTHONUNBUFFERED=1 \
5
- PIP_NO_CACHE_DIR=1
 
 
6
 
7
  WORKDIR /app
8
 
9
- COPY pyproject.toml README.md DEMO_SCRIPT.md openenv.yaml __init__.py client.py compat.py openenv_models.py inference.py triage.py triage_catalog.py triage_models.py launch.py /app/
10
- COPY api /app/api
11
- COPY app /app/app
12
- COPY analyzers /app/analyzers
13
- COPY models /app/models
14
- COPY schemas /app/schemas
15
- COPY server /app/server
16
- COPY services /app/services
17
- COPY tasks /app/tasks
18
- COPY utils /app/utils
19
- COPY graders /app/graders
20
 
21
  RUN python -m pip install --upgrade pip && \
22
- pip install .
 
 
 
 
23
 
24
  EXPOSE 8000
25
 
26
  HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
27
- CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000', timeout=3).read()"
28
 
29
- CMD ["python", "launch.py"]
 
2
 
3
  ENV PYTHONDONTWRITEBYTECODE=1 \
4
  PYTHONUNBUFFERED=1 \
5
+ PIP_NO_CACHE_DIR=1 \
6
+ PIP_DISABLE_PIP_VERSION_CHECK=1 \
7
+ ENABLE_GRADIO_DEMO=false
8
 
9
  WORKDIR /app
10
 
11
+ COPY server/requirements.txt /tmp/requirements.txt
 
 
 
 
 
 
 
 
 
 
12
 
13
  RUN python -m pip install --upgrade pip && \
14
+ pip install -r /tmp/requirements.txt
15
+
16
+ COPY . /app
17
+
18
+ RUN pip install --no-deps .
19
 
20
  EXPOSE 8000
21
 
22
  HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
23
+ CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
24
 
25
+ CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
server/__pycache__/__init__.cpython-313.pyc CHANGED
Binary files a/server/__pycache__/__init__.cpython-313.pyc and b/server/__pycache__/__init__.cpython-313.pyc differ
 
server/__pycache__/app.cpython-313.pyc CHANGED
Binary files a/server/__pycache__/app.cpython-313.pyc and b/server/__pycache__/app.cpython-313.pyc differ
 
server/app.py CHANGED
@@ -1,7 +1,11 @@
1
- """FastAPI + Gradio entrypoint for TorchReview Copilot."""
2
 
3
  from __future__ import annotations
4
 
 
 
 
 
5
  try:
6
  from openenv.core.env_server.http_server import create_app
7
  except Exception as exc: # pragma: no cover
@@ -17,11 +21,20 @@ except Exception:
17
  try:
18
  from ..openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
19
  from .env import PythonCodeReviewEnvironment
20
- from .demo import build_demo
21
  except ImportError:
22
  from openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
23
  from server.env import PythonCodeReviewEnvironment
24
- from server.demo import build_demo
 
 
 
 
 
 
 
 
 
 
25
 
26
 
27
  def build_application():
@@ -32,11 +45,24 @@ def build_application():
32
  PythonCodeReviewAction,
33
  PythonCodeReviewObservation,
34
  env_name="python_code_review_env",
35
- max_concurrent_envs=4,
36
  )
37
- if gr is None:
38
- return api_app
39
- return gr.mount_gradio_app(api_app, build_demo(), path="/")
 
 
 
 
 
 
 
 
 
 
 
 
 
40
 
41
 
42
  app = build_application()
 
1
+ """OpenEnv FastAPI entrypoint with optional Gradio mounting."""
2
 
3
  from __future__ import annotations
4
 
5
+ import os
6
+
7
+ from fastapi import FastAPI
8
+
9
  try:
10
  from openenv.core.env_server.http_server import create_app
11
  except Exception as exc: # pragma: no cover
 
21
  try:
22
  from ..openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
23
  from .env import PythonCodeReviewEnvironment
 
24
  except ImportError:
25
  from openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
26
  from server.env import PythonCodeReviewEnvironment
27
+
28
+
29
+ def _gradio_enabled() -> bool:
30
+ return str(os.getenv("ENABLE_GRADIO_DEMO", "false")).strip().lower() in {"1", "true", "yes", "on"}
31
+
32
+
33
+ def _max_concurrent_envs() -> int:
34
+ try:
35
+ return max(int(os.getenv("OPENENV_MAX_CONCURRENT_ENVS", "2")), 1)
36
+ except Exception:
37
+ return 2
38
 
39
 
40
  def build_application():
 
45
  PythonCodeReviewAction,
46
  PythonCodeReviewObservation,
47
  env_name="python_code_review_env",
48
+ max_concurrent_envs=_max_concurrent_envs(),
49
  )
50
+ served_app = api_app
51
+ if gr is not None and _gradio_enabled():
52
+ try:
53
+ from .demo import build_demo
54
+ except ImportError:
55
+ from server.demo import build_demo
56
+ served_app = gr.mount_gradio_app(api_app, build_demo(), path="/")
57
+
58
+ wrapper_app = FastAPI(title="python_code_review_env", version="1.0.0")
59
+
60
+ @wrapper_app.get("/health", include_in_schema=False)
61
+ def _health() -> dict[str, str]:
62
+ return {"status": "ok"}
63
+
64
+ wrapper_app.mount("/", served_app)
65
+ return wrapper_app
66
 
67
 
68
  app = build_application()
server/env.py CHANGED
@@ -63,6 +63,7 @@ class PythonCodeReviewEnvironment(
63
  self._current_code: str = self._task.starter_code
64
  self._history: list[HistoryEntry] = []
65
  self._last_reward = RewardDetails(value=0.1, reason="Environment initialized.")
 
66
  self._current_grade = _empty_grade()
67
  self._state = PythonCodeReviewState(episode_id=str(uuid4()), step_count=0)
68
  self.reset()
@@ -77,8 +78,13 @@ class PythonCodeReviewEnvironment(
77
  self._task = select_task(seed=seed, task_id=task_id)
78
  self._current_code = self._task.starter_code
79
  self._history = []
 
80
  self._last_reward = RewardDetails(value=0.1, reason="Environment reset.")
81
- self._current_grade = grade_task(self._task, self._current_code, include_hidden=False)
 
 
 
 
82
 
83
  self._state = PythonCodeReviewState(
84
  episode_id=episode_id or str(uuid4()),
@@ -142,11 +148,13 @@ class PythonCodeReviewEnvironment(
142
  invalid_action = False
143
  code_changed = False
144
  use_hidden_grading = False
 
145
 
146
  if action.action_type == "edit_code":
147
  if not action.code or not action.code.strip():
148
  invalid_action = True
149
  status = "edit_code requires a non-empty code payload."
 
150
  else:
151
  code_changed = action.code != self._current_code
152
  self._current_code = action.code
@@ -164,18 +172,22 @@ class PythonCodeReviewEnvironment(
164
  else: # pragma: no cover
165
  invalid_action = True
166
  status = f"Unsupported action_type: {action.action_type}"
 
167
 
168
  self._state.step_count += 1
169
 
170
  if invalid_action:
171
  current_grade = previous_grade
172
  else:
173
- current_grade = grade_task(
174
  self._task,
175
  self._current_code,
176
  include_hidden=use_hidden_grading,
177
  timeout_s=timeout_s or 3.0,
178
  )
 
 
 
179
  if action.action_type == "analyze_code":
180
  status = self._analysis_status(current_grade)
181
  elif action.action_type == "run_tests":
@@ -208,6 +220,7 @@ class PythonCodeReviewEnvironment(
208
 
209
  self._current_grade = current_grade
210
  self._last_reward = reward_details
 
211
  attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
212
 
213
  self._state.task_id = self._task.task_id
@@ -226,7 +239,14 @@ class PythonCodeReviewEnvironment(
226
  status=status,
227
  reward_details=reward_details,
228
  )
229
- return observation, reward_details.value, observation.done, {"task_id": observation.task_id, "score": observation.score}
 
 
 
 
 
 
 
230
 
231
  @property
232
  def state(self) -> PythonCodeReviewState:
@@ -252,11 +272,13 @@ class PythonCodeReviewEnvironment(
252
  history=list(self._history),
253
  attempts_remaining=self._state.attempts_remaining,
254
  last_action_status=status,
 
255
  score=grade.score,
256
  reward=reward_details.value,
257
  done=self._state.done,
258
  reward_details=reward_details,
259
  metadata={
 
260
  "goal": self._task.goal,
261
  "repo_summary": self._task.repo_summary,
262
  "changed_files": self._task.changed_files,
@@ -280,25 +302,34 @@ class PythonCodeReviewEnvironment(
280
  curr_score = current_grade.score
281
  prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
282
  curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
 
 
 
 
283
 
284
  syntax_reward = 0.14 if previous_grade.syntax_score < 0.9 and current_grade.syntax_score >= 0.9 else 0.0
285
- test_reward = round(max(curr_rate - prev_rate, 0.0) * 0.22, 3)
286
- progress_delta = round(max(curr_score - prev_score, 0.0) * 0.35, 3)
287
- quality_bonus = round(max(current_grade.quality_score - previous_grade.quality_score, 0.0) * 0.08, 3)
 
 
 
288
  correctness_bonus = 0.12 if final_submission and curr_score >= 0.94 and prev_score < 0.94 else 0.0
289
 
290
- invalid_action_penalty = 0.12 if invalid_action else 0.0
291
- timeout_penalty = 0.14 if timed_out else 0.0
292
- regression_penalty = round(max(prev_score - curr_score, 0.0) * 0.2, 3)
293
- stagnation_penalty = 0.06 if action.action_type == "edit_code" and not code_changed else 0.0
294
 
295
  raw_value = (
296
- 0.1
297
- + 0.45 * curr_score
298
  + syntax_reward
299
  + test_reward
300
  + progress_delta
301
  + quality_bonus
 
 
 
302
  + correctness_bonus
303
  - invalid_action_penalty
304
  - timeout_penalty
@@ -316,6 +347,12 @@ class PythonCodeReviewEnvironment(
316
  reason_parts.append("overall score improved")
317
  if quality_bonus:
318
  reason_parts.append("code quality improved")
 
 
 
 
 
 
319
  if correctness_bonus:
320
  reason_parts.append("full correctness bonus")
321
  if invalid_action_penalty:
@@ -335,6 +372,9 @@ class PythonCodeReviewEnvironment(
335
  test_reward=test_reward,
336
  correctness_bonus=correctness_bonus,
337
  quality_bonus=quality_bonus,
 
 
 
338
  progress_delta=progress_delta,
339
  invalid_action_penalty=invalid_action_penalty,
340
  timeout_penalty=timeout_penalty,
@@ -352,6 +392,22 @@ class PythonCodeReviewEnvironment(
352
  return compile_error
353
  return "Code parses successfully."
354
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
355
  def _format_test_results(self, grade: TaskGrade) -> str:
356
  parts = [grade.details.get("test_summary", "No test feedback available.")]
357
  benchmark = grade.details.get("benchmark")
 
63
  self._current_code: str = self._task.starter_code
64
  self._history: list[HistoryEntry] = []
65
  self._last_reward = RewardDetails(value=0.1, reason="Environment initialized.")
66
+ self._last_action_error: str | None = None
67
  self._current_grade = _empty_grade()
68
  self._state = PythonCodeReviewState(episode_id=str(uuid4()), step_count=0)
69
  self.reset()
 
78
  self._task = select_task(seed=seed, task_id=task_id)
79
  self._current_code = self._task.starter_code
80
  self._history = []
81
+ self._last_action_error = None
82
  self._last_reward = RewardDetails(value=0.1, reason="Environment reset.")
83
+ self._current_grade, self._last_action_error = self._safe_grade_task(
84
+ self._task,
85
+ self._current_code,
86
+ include_hidden=False,
87
+ )
88
 
89
  self._state = PythonCodeReviewState(
90
  episode_id=episode_id or str(uuid4()),
 
148
  invalid_action = False
149
  code_changed = False
150
  use_hidden_grading = False
151
+ action_error: str | None = None
152
 
153
  if action.action_type == "edit_code":
154
  if not action.code or not action.code.strip():
155
  invalid_action = True
156
  status = "edit_code requires a non-empty code payload."
157
+ action_error = status
158
  else:
159
  code_changed = action.code != self._current_code
160
  self._current_code = action.code
 
172
  else: # pragma: no cover
173
  invalid_action = True
174
  status = f"Unsupported action_type: {action.action_type}"
175
+ action_error = status
176
 
177
  self._state.step_count += 1
178
 
179
  if invalid_action:
180
  current_grade = previous_grade
181
  else:
182
+ current_grade, grade_error = self._safe_grade_task(
183
  self._task,
184
  self._current_code,
185
  include_hidden=use_hidden_grading,
186
  timeout_s=timeout_s or 3.0,
187
  )
188
+ if grade_error:
189
+ action_error = grade_error
190
+ status = f"{status} Grading fallback used."
191
  if action.action_type == "analyze_code":
192
  status = self._analysis_status(current_grade)
193
  elif action.action_type == "run_tests":
 
220
 
221
  self._current_grade = current_grade
222
  self._last_reward = reward_details
223
+ self._last_action_error = action_error
224
  attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
225
 
226
  self._state.task_id = self._task.task_id
 
239
  status=status,
240
  reward_details=reward_details,
241
  )
242
+ return observation, reward_details.value, observation.done, {
243
+ "task_id": observation.task_id,
244
+ "score": observation.score,
245
+ "done": observation.done,
246
+ "attempts_remaining": observation.attempts_remaining,
247
+ "last_action_status": observation.last_action_status,
248
+ "last_action_error": observation.last_action_error,
249
+ }
250
 
251
  @property
252
  def state(self) -> PythonCodeReviewState:
 
272
  history=list(self._history),
273
  attempts_remaining=self._state.attempts_remaining,
274
  last_action_status=status,
275
+ last_action_error=self._last_action_error,
276
  score=grade.score,
277
  reward=reward_details.value,
278
  done=self._state.done,
279
  reward_details=reward_details,
280
  metadata={
281
+ "benchmark": "python_code_review_env",
282
  "goal": self._task.goal,
283
  "repo_summary": self._task.repo_summary,
284
  "changed_files": self._task.changed_files,
 
302
  curr_score = current_grade.score
303
  prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
304
  curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
305
+ prev_runtime = previous_grade.runtime_score
306
+ curr_runtime = current_grade.runtime_score
307
+ prev_compile_error = bool(str(previous_grade.details.get("compile_error", "")).strip())
308
+ curr_compile_error = bool(str(current_grade.details.get("compile_error", "")).strip())
309
 
310
  syntax_reward = 0.14 if previous_grade.syntax_score < 0.9 and current_grade.syntax_score >= 0.9 else 0.0
311
+ test_reward = round(max(curr_rate - prev_rate, 0.0) * 0.28, 3)
312
+ progress_delta = round(max(curr_score - prev_score, 0.0) * 0.3, 3)
313
+ quality_bonus = round(max(current_grade.quality_score - previous_grade.quality_score, 0.0) * 0.12, 3)
314
+ runtime_bonus = round(max(curr_runtime - prev_runtime, 0.0) * 0.08, 3)
315
+ error_reduction_bonus = 0.1 if prev_compile_error and not curr_compile_error else 0.0
316
+ completion_bonus = 0.14 if final_submission and curr_rate >= 0.999 and curr_score >= 0.94 else 0.0
317
  correctness_bonus = 0.12 if final_submission and curr_score >= 0.94 and prev_score < 0.94 else 0.0
318
 
319
+ invalid_action_penalty = round((0.04 + (0.08 * (1.0 - prev_score))) if invalid_action else 0.0, 3)
320
+ timeout_penalty = round((0.06 + (0.08 * max(curr_runtime, prev_runtime))) if timed_out else 0.0, 3)
321
+ regression_penalty = round(max(prev_score - curr_score, 0.0) * 0.25, 3)
322
+ stagnation_penalty = round((0.02 + (0.05 * prev_score)) if action.action_type == "edit_code" and not code_changed else 0.0, 3)
323
 
324
  raw_value = (
325
+ 0.32 * curr_score
 
326
  + syntax_reward
327
  + test_reward
328
  + progress_delta
329
  + quality_bonus
330
+ + error_reduction_bonus
331
+ + completion_bonus
332
+ + runtime_bonus
333
  + correctness_bonus
334
  - invalid_action_penalty
335
  - timeout_penalty
 
347
  reason_parts.append("overall score improved")
348
  if quality_bonus:
349
  reason_parts.append("code quality improved")
350
+ if error_reduction_bonus:
351
+ reason_parts.append("errors removed")
352
+ if completion_bonus:
353
+ reason_parts.append("task completed")
354
+ if runtime_bonus:
355
+ reason_parts.append("runtime improved")
356
  if correctness_bonus:
357
  reason_parts.append("full correctness bonus")
358
  if invalid_action_penalty:
 
372
  test_reward=test_reward,
373
  correctness_bonus=correctness_bonus,
374
  quality_bonus=quality_bonus,
375
+ error_reduction_bonus=error_reduction_bonus,
376
+ completion_bonus=completion_bonus,
377
+ runtime_bonus=runtime_bonus,
378
  progress_delta=progress_delta,
379
  invalid_action_penalty=invalid_action_penalty,
380
  timeout_penalty=timeout_penalty,
 
392
  return compile_error
393
  return "Code parses successfully."
394
 
395
+ def _safe_grade_task(
396
+ self,
397
+ task: ReviewTask,
398
+ code: str,
399
+ *,
400
+ include_hidden: bool,
401
+ timeout_s: float = 3.0,
402
+ ) -> tuple[TaskGrade, str | None]:
403
+ try:
404
+ return (
405
+ grade_task(task, code, include_hidden=include_hidden, timeout_s=timeout_s),
406
+ None,
407
+ )
408
+ except Exception as exc: # pragma: no cover
409
+ return _empty_grade(), f"{type(exc).__name__}: {exc}"
410
+
411
  def _format_test_results(self, grade: TaskGrade) -> str:
412
  parts = [grade.details.get("test_summary", "No test feedback available.")]
413
  benchmark = grade.details.get("benchmark")
server/requirements.txt CHANGED
@@ -2,7 +2,6 @@ openenv-core[core]>=0.2.2
2
  fastapi>=0.111.0
3
  gradio>=5.26.0
4
  uvicorn>=0.30.0
5
- pytest>=8.0.0
6
  openai>=1.76.0
7
  streamlit>=1.44.0
8
  torch>=2.2.0
 
2
  fastapi>=0.111.0
3
  gradio>=5.26.0
4
  uvicorn>=0.30.0
 
5
  openai>=1.76.0
6
  streamlit>=1.44.0
7
  torch>=2.2.0
services/analysis_service.py CHANGED
@@ -34,7 +34,7 @@ class AnalysisService:
34
  """End-to-end analysis pipeline shared by API and UI."""
35
 
36
  def __init__(self) -> None:
37
- self.model = PyTorchCodeAnalyzerModel()
38
  self.reward_service = RewardService()
39
  self.suggestion_service = SuggestionService()
40
  self._analyzers: Dict[str, Callable[[str, Dict[str, Any], Dict[str, Any]], DomainAnalysis]] = {
@@ -44,6 +44,12 @@ class AnalysisService:
44
  "web": analyze_web_code,
45
  }
46
 
 
 
 
 
 
 
47
  def _heuristic_domain_scores(self, parsed: Dict[str, Any], code: str) -> Dict[str, float]:
48
  """Derive domain priors from imports and syntax-level hints."""
49
 
 
34
  """End-to-end analysis pipeline shared by API and UI."""
35
 
36
  def __init__(self) -> None:
37
+ self._model: PyTorchCodeAnalyzerModel | None = None
38
  self.reward_service = RewardService()
39
  self.suggestion_service = SuggestionService()
40
  self._analyzers: Dict[str, Callable[[str, Dict[str, Any], Dict[str, Any]], DomainAnalysis]] = {
 
44
  "web": analyze_web_code,
45
  }
46
 
47
+ @property
48
+ def model(self) -> PyTorchCodeAnalyzerModel:
49
+ if self._model is None:
50
+ self._model = PyTorchCodeAnalyzerModel()
51
+ return self._model
52
+
53
  def _heuristic_domain_scores(self, parsed: Dict[str, Any], code: str) -> Dict[str, float]:
54
  """Derive domain priors from imports and syntax-level hints."""
55
 
services/reward_service.py CHANGED
@@ -9,13 +9,21 @@ class RewardService:
9
  """Compute reward scores from model, domain, lint, and complexity signals."""
10
 
11
  def compute(self, *, ml_score: float, domain_score: float, lint_score: float, complexity_penalty: float) -> ScoreBreakdown:
12
- """Apply the weighted reward formula and clamp the result."""
13
 
 
 
 
14
  reward = max(
15
  0.0,
16
  min(
17
  1.0,
18
- (0.4 * ml_score) + (0.2 * domain_score) + (0.2 * lint_score) - (0.2 * complexity_penalty),
 
 
 
 
 
19
  ),
20
  )
21
  return ScoreBreakdown(
@@ -23,5 +31,8 @@ class RewardService:
23
  domain_score=round(domain_score, 4),
24
  lint_score=round(lint_score, 4),
25
  complexity_penalty=round(complexity_penalty, 4),
 
 
 
26
  reward=round(reward, 4),
27
  )
 
9
  """Compute reward scores from model, domain, lint, and complexity signals."""
10
 
11
  def compute(self, *, ml_score: float, domain_score: float, lint_score: float, complexity_penalty: float) -> ScoreBreakdown:
12
+ """Apply dynamic reward shaping based on quality, errors, and completion."""
13
 
14
+ quality_signal = max(0.0, min(1.0, (0.45 * ml_score) + (0.3 * domain_score) + (0.25 * lint_score)))
15
+ error_reduction_signal = max(0.0, min(1.0, lint_score - (0.6 * complexity_penalty)))
16
+ completion_signal = max(0.0, min(1.0, (ml_score + domain_score + lint_score) / 3.0))
17
  reward = max(
18
  0.0,
19
  min(
20
  1.0,
21
+ (0.35 * quality_signal)
22
+ + (0.25 * completion_signal)
23
+ + (0.2 * error_reduction_signal)
24
+ + (0.1 * ml_score)
25
+ + (0.1 * domain_score)
26
+ - (0.15 * complexity_penalty),
27
  ),
28
  )
29
  return ScoreBreakdown(
 
31
  domain_score=round(domain_score, 4),
32
  lint_score=round(lint_score, 4),
33
  complexity_penalty=round(complexity_penalty, 4),
34
+ quality_signal=round(quality_signal, 4),
35
+ error_reduction_signal=round(error_reduction_signal, 4),
36
+ completion_signal=round(completion_signal, 4),
37
  reward=round(reward, 4),
38
  )
tests/test_inference_runner.py ADDED
@@ -0,0 +1,71 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Smoke tests for the strict inference output contract."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from dataclasses import dataclass, field
6
+
7
+ from app.env.runner import InferenceRunner
8
+ from app.models.inference import AgentDecision, InferenceConfig
9
+
10
+
11
+ @dataclass
12
+ class _FakeObservation:
13
+ task_id: str
14
+ attempts_remaining: int
15
+ score: float
16
+ done: bool
17
+ history: list[object] = field(default_factory=list)
18
+ current_code: str = "print('broken')"
19
+ last_action_error: str | None = None
20
+
21
+
22
+ class _FakeEnv:
23
+ def __init__(self) -> None:
24
+ self._step = 0
25
+
26
+ def reset(self, *, task_id: str) -> _FakeObservation:
27
+ return _FakeObservation(task_id=task_id, attempts_remaining=4, score=0.2, done=False)
28
+
29
+ def step_result(self, action: object) -> tuple[_FakeObservation, float, bool, dict[str, object]]:
30
+ self._step += 1
31
+ if self._step == 1:
32
+ return (
33
+ _FakeObservation("demo_task", 3, 0.45, False, current_code="candidate"),
34
+ 0.45,
35
+ False,
36
+ {"last_action_error": None},
37
+ )
38
+ if self._step == 2:
39
+ return (
40
+ _FakeObservation("demo_task", 2, 0.97, True, current_code="reference"),
41
+ 0.97,
42
+ True,
43
+ {"last_action_error": None},
44
+ )
45
+ raise AssertionError("runner stepped too many times")
46
+
47
+
48
+ class _FakeAgent:
49
+ def __init__(self) -> None:
50
+ self._step = 0
51
+
52
+ def act(self, observation: object) -> AgentDecision:
53
+ self._step += 1
54
+ if self._step == 1:
55
+ return AgentDecision(action_type="run_tests")
56
+ return AgentDecision(action_type="submit_solution")
57
+
58
+
59
+ def test_inference_runner_emits_strict_lines(capsys) -> None:
60
+ runner = InferenceRunner(InferenceConfig.from_env())
61
+ runner.agent = _FakeAgent()
62
+ runner._create_env = lambda: _FakeEnv() # type: ignore[method-assign]
63
+ runner.run_task("demo_task")
64
+
65
+ captured = capsys.readouterr().out.strip().splitlines()
66
+ assert captured == [
67
+ f"[START] task=demo_task env={runner.config.benchmark_name} model={runner.config.model_name}",
68
+ "[STEP] step=1 action=run_tests reward=0.45 done=false error=null",
69
+ "[STEP] step=2 action=submit_solution reward=0.97 done=true error=null",
70
+ "[END] success=true steps=2 rewards=0.45,0.97",
71
+ ]
uv.lock CHANGED
The diff for this file is too large to render. See raw diff