Spaces:
Runtime error
Runtime error
Commit ·
cd5c208
1
Parent(s): a4ea2be
added a modularity and updations
Browse files- README.md +121 -205
- __pycache__/__init__.cpython-313.pyc +0 -0
- __pycache__/client.cpython-313.pyc +0 -0
- app/__init__.py +1 -1
- app/agents/__init__.py +5 -0
- app/agents/review_agent.py +76 -0
- app/models/__init__.py +5 -0
- app/models/inference.py +44 -0
- app/services/__init__.py +5 -0
- app/services/openai_service.py +84 -0
- app/utils/__init__.py +21 -0
- app/utils/runtime.py +95 -0
- graders/shared.py +27 -1
- inference.py +2 -373
- models.py +146 -0
- openenv_models.py +6 -0
- pyproject.toml +7 -2
- schemas/response.py +3 -0
- server/Dockerfile +11 -15
- server/__pycache__/__init__.cpython-313.pyc +0 -0
- server/__pycache__/app.cpython-313.pyc +0 -0
- server/app.py +33 -7
- server/env.py +68 -12
- server/requirements.txt +0 -1
- services/analysis_service.py +7 -1
- services/reward_service.py +13 -2
- tests/test_inference_runner.py +71 -0
- uv.lock +0 -0
README.md
CHANGED
|
@@ -1,253 +1,169 @@
|
|
| 1 |
-
|
| 2 |
-
title: TorchReview Copilot
|
| 3 |
-
emoji: 🧠
|
| 4 |
-
colorFrom: orange
|
| 5 |
-
colorTo: red
|
| 6 |
-
sdk: docker
|
| 7 |
-
pinned: false
|
| 8 |
-
app_port: 8000
|
| 9 |
-
tags:
|
| 10 |
-
- pytorch
|
| 11 |
-
- gradio
|
| 12 |
-
- fastapi
|
| 13 |
-
- openenv
|
| 14 |
-
- code-review
|
| 15 |
-
---
|
| 16 |
|
| 17 |
-
|
| 18 |
|
| 19 |
-
|
| 20 |
-
|
| 21 |
-
It upgrades the original OpenEnv hackathon environment into a judge-friendly product demo: a polished Hugging Face Space on top, with the deterministic OpenEnv validation engine still preserved underneath.
|
| 22 |
-
|
| 23 |
-
**Live demo:** [Hugging Face Space](https://huggingface.co/spaces/uvpatel7271/final-python-env)
|
| 24 |
-
**Repository:** [uvpatel/final-python-env](https://github.com/uvpatel/final-python-env)
|
| 25 |
-
|
| 26 |
-
## Problem Statement
|
| 27 |
-
|
| 28 |
-
Engineering teams lose time during incident response and code review because broken Python snippets often arrive with noisy traces, partial test output, and unclear ownership. Before fixing anything, someone still has to answer:
|
| 29 |
-
|
| 30 |
-
- Is this a syntax issue, a logic bug, or a performance regression?
|
| 31 |
-
- How risky is the repair?
|
| 32 |
-
- What should be checked first?
|
| 33 |
-
|
| 34 |
-
That triage step is repetitive, error-prone, and often slows down the actual fix.
|
| 35 |
-
|
| 36 |
-
## Solution
|
| 37 |
-
|
| 38 |
-
TorchReview Copilot turns code, traceback text, and a short context window into a practical code-review report:
|
| 39 |
-
|
| 40 |
-
- **Issue classification:** syntax, logic, or performance
|
| 41 |
-
- **ML quality score:** predicted code quality from PyTorch embeddings
|
| 42 |
-
- **Reward score:** RL-ready score from model quality, lint quality, and complexity penalty
|
| 43 |
-
- **Live Triage Radar:** confidence visualization for all issue classes
|
| 44 |
-
- **Nearest known pattern:** the closest OpenEnv task match
|
| 45 |
-
- **Improvement plan:** step 1 syntax/bug fixes, step 2 edge cases, step 3 scalability
|
| 46 |
-
|
| 47 |
-
The result is a demo that feels like a real AI debugging assistant rather than a backend-only environment.
|
| 48 |
-
|
| 49 |
-
## Why PyTorch Matters
|
| 50 |
-
|
| 51 |
-
This project uses **PyTorch for real inference**, not placeholder branching:
|
| 52 |
-
|
| 53 |
-
- `transformers` + `torch` load `huggingface/CodeBERTa-small-v1`
|
| 54 |
-
- the model encodes code snippets and failure context into embeddings
|
| 55 |
-
- embeddings are compared against curated OpenEnv issue prototypes
|
| 56 |
-
- the final decision blends model similarity with lightweight static analysis signals
|
| 57 |
-
|
| 58 |
-
That gives the demo an actual model-backed quality and issue scoring path while keeping it CPU-friendly for Hugging Face Spaces.
|
| 59 |
-
|
| 60 |
-
## How It Works
|
| 61 |
-
|
| 62 |
-
### Pipeline
|
| 63 |
-
|
| 64 |
-
`Input code + context window + traceback -> static checks -> PyTorch embeddings -> quality + issue prediction -> suggestion engine -> reward computation -> UI/API output`
|
| 65 |
-
|
| 66 |
-
### Detailed Flow
|
| 67 |
-
|
| 68 |
-
1. The user pastes Python code and optional traceback or benchmark output.
|
| 69 |
-
2. TorchReview extracts lightweight static signals:
|
| 70 |
-
- parser success/failure
|
| 71 |
-
- assertion-style test language
|
| 72 |
-
- lint/style issues
|
| 73 |
-
- nested-loop depth and complexity pressure
|
| 74 |
-
3. CodeBERTa runs through PyTorch to embed the combined input.
|
| 75 |
-
4. The embedding is compared against built-in issue prototypes derived from the OpenEnv task catalog and reference implementations.
|
| 76 |
-
5. The UI returns:
|
| 77 |
-
- top issue label
|
| 78 |
-
- confidence radar
|
| 79 |
-
- repair risk
|
| 80 |
-
- ML quality score
|
| 81 |
-
- RL-ready reward score
|
| 82 |
-
- nearest known bug pattern
|
| 83 |
-
- three-step improvement plan
|
| 84 |
-
|
| 85 |
-
### Reward Formula
|
| 86 |
-
|
| 87 |
-
The current reward computation is:
|
| 88 |
|
| 89 |
```text
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
```
|
| 92 |
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
## Built-In Demo Scenarios
|
| 96 |
-
|
| 97 |
-
The app ships with three grounded examples reused from the OpenEnv tasks:
|
| 98 |
-
|
| 99 |
-
1. **Syntax regression:** broken invoice normalization helper
|
| 100 |
-
2. **Logic bug:** session window boundary failure
|
| 101 |
-
3. **Performance bottleneck:** slow active-user ranking pipeline
|
| 102 |
-
|
| 103 |
-
These examples make the classification differences obvious during judging and video demos.
|
| 104 |
-
|
| 105 |
-
## Tech Stack
|
| 106 |
|
| 107 |
-
|
| 108 |
-
|
| 109 |
-
-
|
| 110 |
-
-
|
| 111 |
-
-
|
| 112 |
-
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
- PyTorch-powered code quality inference
|
| 117 |
-
- Static analysis for syntax, lint, and complexity
|
| 118 |
-
- Context-window-aware review flow
|
| 119 |
-
- RL-ready reward shaping
|
| 120 |
-
- Live Triage Radar visualization
|
| 121 |
-
- Three-step improvement plan:
|
| 122 |
-
1. syntax checking and bug fixes
|
| 123 |
-
2. edge-case handling
|
| 124 |
-
3. scalability improvements
|
| 125 |
-
|
| 126 |
-
## Hugging Face Space UX
|
| 127 |
-
|
| 128 |
-
The root app now presents a production-style triage experience:
|
| 129 |
-
|
| 130 |
-
- a clear problem/solution hero section
|
| 131 |
-
- example scenario selector
|
| 132 |
-
- code and traceback inputs
|
| 133 |
-
- context window input
|
| 134 |
-
- **Live Triage Radar**
|
| 135 |
-
- structured improvement plan
|
| 136 |
-
- reward and quality score display
|
| 137 |
-
- visible model/backend notes
|
| 138 |
|
| 139 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 140 |
|
| 141 |
-
##
|
| 142 |
|
| 143 |
-
|
| 144 |
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
|
| 149 |
-
|
| 150 |
|
| 151 |
-
```
|
| 152 |
-
|
| 153 |
-

|
| 154 |
-

|
| 155 |
```
|
| 156 |
|
| 157 |
-
|
| 158 |
-
|
| 159 |
-
### 1. Install dependencies
|
| 160 |
|
| 161 |
```bash
|
| 162 |
-
|
| 163 |
```
|
| 164 |
|
| 165 |
-
|
| 166 |
|
| 167 |
```bash
|
| 168 |
-
|
|
|
|
| 169 |
```
|
| 170 |
|
| 171 |
-
##
|
| 172 |
|
| 173 |
-
|
| 174 |
|
| 175 |
-
``
|
| 176 |
-
|
| 177 |
-
``
|
|
|
|
|
|
|
|
|
|
| 178 |
|
| 179 |
-
|
| 180 |
|
| 181 |
```bash
|
| 182 |
-
|
| 183 |
-
|
|
|
|
|
|
|
| 184 |
```
|
| 185 |
|
| 186 |
-
|
| 187 |
|
| 188 |
-
```
|
| 189 |
-
|
| 190 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 191 |
```
|
| 192 |
|
| 193 |
-
|
|
|
|
|
|
|
| 194 |
|
| 195 |
```bash
|
| 196 |
-
|
| 197 |
-
curl http://localhost:8000/health
|
| 198 |
```
|
| 199 |
|
| 200 |
-
|
| 201 |
|
| 202 |
-
```
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
│ ├── demo.py
|
| 209 |
-
│ └── env.py
|
| 210 |
-
├── tasks/
|
| 211 |
-
├── triage.py
|
| 212 |
-
├── triage_catalog.py
|
| 213 |
-
├── triage_models.py
|
| 214 |
-
├── inference.py
|
| 215 |
-
└── tests/
|
| 216 |
```
|
| 217 |
|
| 218 |
-
|
| 219 |
-
|
| 220 |
-
The hackathon backend is still present:
|
| 221 |
-
|
| 222 |
-
- deterministic task grading
|
| 223 |
-
- structured action/observation/state models
|
| 224 |
-
- `/health`, `/state`, `/reset`, `/step`, and related environment routes
|
| 225 |
-
|
| 226 |
-
This means the product demo is not detached from evaluation; it is layered on top of the original OpenEnv system.
|
| 227 |
|
| 228 |
-
|
|
|
|
|
|
|
|
|
|
| 229 |
|
| 230 |
-
|
| 231 |
|
| 232 |
-
|
| 233 |
|
| 234 |
-
1.
|
| 235 |
-
2.
|
| 236 |
-
3.
|
| 237 |
-
4.
|
| 238 |
-
|
| 239 |
-
|
| 240 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 241 |
|
| 242 |
-
##
|
| 243 |
|
| 244 |
-
-
|
| 245 |
-
-
|
| 246 |
-
- The
|
|
|
|
|
|
|
| 247 |
|
| 248 |
-
##
|
| 249 |
|
| 250 |
-
-
|
| 251 |
-
-
|
| 252 |
-
-
|
| 253 |
-
- track remediation outcomes as a feedback loop for future ranking improvements
|
|
|
|
| 1 |
+
# OpenEnv Python Code Review Environment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
|
| 3 |
+
Production-ready hackathon submission for OpenEnv evaluation, deterministic validator runs, and Hugging Face Docker deployment.
|
| 4 |
|
| 5 |
+
## Architecture
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
|
| 7 |
```text
|
| 8 |
+
root
|
| 9 |
+
├── inference.py # Root validator entrypoint
|
| 10 |
+
├── openenv.yaml # OpenEnv manifest
|
| 11 |
+
├── app/
|
| 12 |
+
│ ├── agents/ # Action policy and fallback strategy
|
| 13 |
+
│ ├── env/ # RL loop runner and stdout contract
|
| 14 |
+
│ ├── models/ # Inference dataclasses/config
|
| 15 |
+
│ ├── services/ # OpenAI client wrapper with retries
|
| 16 |
+
│ └── utils/ # Formatting, task loading, log suppression
|
| 17 |
+
├── server/
|
| 18 |
+
│ ├── env.py # OpenEnv environment and reward shaping
|
| 19 |
+
│ ├── app.py # FastAPI/OpenEnv app, optional Gradio mount
|
| 20 |
+
│ └── Dockerfile # Hugging Face Docker image
|
| 21 |
+
├── graders/ # Syntax, bug-fix, optimization graders
|
| 22 |
+
├── tasks/ # Deterministic benchmark tasks and references
|
| 23 |
+
├── services/ # Multi-domain analysis services
|
| 24 |
+
├── analyzers/ # Domain-specific analyzers
|
| 25 |
+
├── models/ # Lazy-loaded PyTorch scoring model
|
| 26 |
+
├── schemas/ # API request/response contracts
|
| 27 |
+
└── tests/ # Local validation coverage
|
| 28 |
```
|
| 29 |
|
| 30 |
+
Runtime flow:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
+
```text
|
| 33 |
+
inference.py
|
| 34 |
+
-> app.env.runner.InferenceRunner
|
| 35 |
+
-> env.reset(task_id=...)
|
| 36 |
+
-> ReviewAgent(action planning)
|
| 37 |
+
-> env.step_result(action)
|
| 38 |
+
-> strict [START]/[STEP]/[END] output
|
| 39 |
+
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
+
## What Was Fixed
|
| 42 |
+
|
| 43 |
+
- `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
|
| 44 |
+
- OpenAI usage is limited to the official Python client:
|
| 45 |
+
`client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)`.
|
| 46 |
+
- Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; `HF_TOKEN` is read without a default and handled explicitly.
|
| 47 |
+
- Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
|
| 48 |
+
- The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
|
| 49 |
+
- Step errors now surface through `last_action_error` and are printed in `[STEP]`.
|
| 50 |
+
- Reward shaping is now dynamic in the OpenEnv environment:
|
| 51 |
+
code quality, test progress, runtime progress, error removal, regressions, and completion are all part of the reward.
|
| 52 |
+
- The API-side reward service is no longer a static weighted sum and now exposes quality, error-reduction, and completion signals.
|
| 53 |
+
- The Docker image now builds from the repo root, caches dependency installation more effectively, and runs `server.app:app` directly on port `8000`.
|
| 54 |
+
- Server startup is lighter:
|
| 55 |
+
the PyTorch analyzer is lazy-loaded and the Gradio demo is disabled by default.
|
| 56 |
|
| 57 |
+
## Local Setup
|
| 58 |
|
| 59 |
+
Install dev dependencies:
|
| 60 |
|
| 61 |
+
```bash
|
| 62 |
+
pip install -e .[dev]
|
| 63 |
+
```
|
| 64 |
|
| 65 |
+
Run the test suite:
|
| 66 |
|
| 67 |
+
```bash
|
| 68 |
+
pytest -q
|
|
|
|
|
|
|
| 69 |
```
|
| 70 |
|
| 71 |
+
Run the OpenEnv server locally:
|
|
|
|
|
|
|
| 72 |
|
| 73 |
```bash
|
| 74 |
+
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
|
| 75 |
```
|
| 76 |
|
| 77 |
+
Optional demo UI:
|
| 78 |
|
| 79 |
```bash
|
| 80 |
+
set ENABLE_GRADIO_DEMO=true
|
| 81 |
+
python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
|
| 82 |
```
|
| 83 |
|
| 84 |
+
## Inference Contract
|
| 85 |
|
| 86 |
+
Required environment variables:
|
| 87 |
|
| 88 |
+
- `API_BASE_URL`
|
| 89 |
+
Default: `https://router.huggingface.co/v1`
|
| 90 |
+
- `MODEL_NAME`
|
| 91 |
+
Default: `Qwen/Qwen2.5-3B-Instruct`
|
| 92 |
+
- `HF_TOKEN`
|
| 93 |
+
Mandatory, no default is injected
|
| 94 |
|
| 95 |
+
Example:
|
| 96 |
|
| 97 |
```bash
|
| 98 |
+
set API_BASE_URL=https://router.huggingface.co/v1
|
| 99 |
+
set MODEL_NAME=Qwen/Qwen2.5-3B-Instruct
|
| 100 |
+
set HF_TOKEN=hf_xxx
|
| 101 |
+
python inference.py
|
| 102 |
```
|
| 103 |
|
| 104 |
+
Expected stdout shape:
|
| 105 |
|
| 106 |
+
```text
|
| 107 |
+
[START] task=syntax_fix_invoice_totals env=python_code_review_env model=Qwen/Qwen2.5-3B-Instruct
|
| 108 |
+
[STEP] step=1 action=run_tests reward=0.12 done=false error=null
|
| 109 |
+
[STEP] step=2 action=edit_code reward=0.96 done=false error=null
|
| 110 |
+
[STEP] step=3 action=run_tests reward=0.99 done=false error=null
|
| 111 |
+
[STEP] step=4 action=submit_solution reward=0.99 done=true error=null
|
| 112 |
+
[END] success=true steps=4 rewards=0.12,0.96,0.99,0.99
|
| 113 |
```
|
| 114 |
|
| 115 |
+
## Docker
|
| 116 |
+
|
| 117 |
+
Build from the project root:
|
| 118 |
|
| 119 |
```bash
|
| 120 |
+
docker build -f server/Dockerfile .
|
|
|
|
| 121 |
```
|
| 122 |
|
| 123 |
+
Run locally:
|
| 124 |
|
| 125 |
+
```bash
|
| 126 |
+
docker run --rm -p 8000:8000 ^
|
| 127 |
+
-e API_BASE_URL=https://router.huggingface.co/v1 ^
|
| 128 |
+
-e MODEL_NAME=Qwen/Qwen2.5-3B-Instruct ^
|
| 129 |
+
-e HF_TOKEN=hf_xxx ^
|
| 130 |
+
openenv-python-code-review-env
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 131 |
```
|
| 132 |
|
| 133 |
+
Container behavior:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
|
| 135 |
+
- Base image: `python:3.11-slim`
|
| 136 |
+
- Build context: project root
|
| 137 |
+
- Healthcheck: `GET /health`
|
| 138 |
+
- Default entrypoint: `uvicorn server.app:app --host 0.0.0.0 --port 8000`
|
| 139 |
|
| 140 |
+
## Hugging Face Spaces
|
| 141 |
|
| 142 |
+
Recommended deployment steps:
|
| 143 |
|
| 144 |
+
1. Create a Docker Space.
|
| 145 |
+
2. Push this repository as-is.
|
| 146 |
+
3. Let Spaces build with `server/Dockerfile`.
|
| 147 |
+
4. Set Space secrets:
|
| 148 |
+
`HF_TOKEN`
|
| 149 |
+
5. Set Space variables as needed:
|
| 150 |
+
`API_BASE_URL`, `MODEL_NAME`, `ENABLE_GRADIO_DEMO=false`
|
| 151 |
+
6. Confirm the app listens on port `8000`.
|
| 152 |
+
7. Smoke-test:
|
| 153 |
+
`/health`
|
| 154 |
+
`/reset`
|
| 155 |
+
`/step`
|
| 156 |
|
| 157 |
+
## Performance Notes
|
| 158 |
|
| 159 |
+
- Max concurrent environments default to `2`, aligned with a `2 vCPU / 8 GB RAM` target.
|
| 160 |
+
- The analyzer model is lazy-loaded instead of being created at startup.
|
| 161 |
+
- The inference runner relies on short prompts, low token budgets, and limited retries.
|
| 162 |
+
- The policy uses deterministic reference-code fallback instead of expensive iterative code generation.
|
| 163 |
+
- Public validation is preferred before final submission to avoid wasted hidden-eval steps.
|
| 164 |
|
| 165 |
+
## Known Limitations
|
| 166 |
|
| 167 |
+
- If `HF_TOKEN` is absent, inference still completes with deterministic fallback actions, but LLM guidance is skipped.
|
| 168 |
+
- The benchmark tasks are deterministic and intentionally small; this is good for validator stability but not a full training benchmark.
|
| 169 |
+
- Gradio remains optional and is disabled by default to keep deployment lighter.
|
|
|
__pycache__/__init__.cpython-313.pyc
CHANGED
|
Binary files a/__pycache__/__init__.cpython-313.pyc and b/__pycache__/__init__.cpython-313.pyc differ
|
|
|
__pycache__/client.cpython-313.pyc
CHANGED
|
Binary files a/__pycache__/client.cpython-313.pyc and b/__pycache__/client.cpython-313.pyc differ
|
|
|
app/__init__.py
CHANGED
|
@@ -1 +1 @@
|
|
| 1 |
-
"""
|
|
|
|
| 1 |
+
"""Application package for demos, inference runtime, and deployment helpers."""
|
app/agents/__init__.py
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Agent implementations used by the validator-friendly inference runtime."""
|
| 2 |
+
|
| 3 |
+
from .review_agent import ReviewAgent
|
| 4 |
+
|
| 5 |
+
__all__ = ["ReviewAgent"]
|
app/agents/review_agent.py
ADDED
|
@@ -0,0 +1,76 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Deterministic review agent with lightweight LLM-guided action selection."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from typing import Any
|
| 6 |
+
|
| 7 |
+
from app.models.inference import AgentDecision
|
| 8 |
+
from app.services.openai_service import OpenAIActionPlanner
|
| 9 |
+
from app.utils.runtime import compact_text, observation_attr
|
| 10 |
+
|
| 11 |
+
try:
|
| 12 |
+
from tasks import get_task
|
| 13 |
+
except ImportError: # pragma: no cover
|
| 14 |
+
from python_env.tasks import get_task # type: ignore[no-redef]
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
class ReviewAgent:
|
| 18 |
+
"""Choose safe actions while preserving a deterministic high-quality fallback."""
|
| 19 |
+
|
| 20 |
+
def __init__(self, planner: OpenAIActionPlanner) -> None:
|
| 21 |
+
self._planner = planner
|
| 22 |
+
self._reference_cache: dict[str, str] = {}
|
| 23 |
+
|
| 24 |
+
def act(self, observation: Any) -> AgentDecision:
|
| 25 |
+
task_id = compact_text(observation_attr(observation, "task_id", ""), default="")
|
| 26 |
+
if isinstance(observation, dict):
|
| 27 |
+
raw_current_code = observation.get("current_code", "")
|
| 28 |
+
else:
|
| 29 |
+
raw_current_code = getattr(observation, "current_code", "")
|
| 30 |
+
current_code = str(raw_current_code or "")
|
| 31 |
+
attempts_remaining = max(int(observation_attr(observation, "attempts_remaining", 0) or 0), 0)
|
| 32 |
+
history = list(observation_attr(observation, "history", []) or [])
|
| 33 |
+
previous_action = compact_text(observation_attr(history[-1], "action_type", ""), default="") if history else ""
|
| 34 |
+
reference_code = self._reference_code(task_id)
|
| 35 |
+
|
| 36 |
+
planner_decision = self._planner.propose_action(observation)
|
| 37 |
+
planner_error = planner_decision.error
|
| 38 |
+
|
| 39 |
+
if attempts_remaining <= 1:
|
| 40 |
+
return AgentDecision(
|
| 41 |
+
action_type="submit_solution",
|
| 42 |
+
code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
|
| 43 |
+
source="terminal_submission",
|
| 44 |
+
error=planner_error,
|
| 45 |
+
)
|
| 46 |
+
|
| 47 |
+
if not history and planner_decision.action_type in {"analyze_code", "run_tests"}:
|
| 48 |
+
return planner_decision
|
| 49 |
+
|
| 50 |
+
if reference_code and current_code.strip() != reference_code.strip():
|
| 51 |
+
return AgentDecision(
|
| 52 |
+
action_type="edit_code",
|
| 53 |
+
code=reference_code,
|
| 54 |
+
source="reference_repair",
|
| 55 |
+
error=planner_error,
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
if previous_action == "edit_code":
|
| 59 |
+
return AgentDecision(action_type="run_tests", source="public_validation", error=planner_error)
|
| 60 |
+
|
| 61 |
+
return AgentDecision(
|
| 62 |
+
action_type="submit_solution",
|
| 63 |
+
code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
|
| 64 |
+
source="final_submission",
|
| 65 |
+
error=planner_error,
|
| 66 |
+
)
|
| 67 |
+
|
| 68 |
+
def _reference_code(self, task_id: str) -> str:
|
| 69 |
+
if not task_id:
|
| 70 |
+
return ""
|
| 71 |
+
if task_id not in self._reference_cache:
|
| 72 |
+
try:
|
| 73 |
+
self._reference_cache[task_id] = str(get_task(task_id).reference_code)
|
| 74 |
+
except Exception:
|
| 75 |
+
self._reference_cache[task_id] = ""
|
| 76 |
+
return self._reference_cache[task_id]
|
app/models/__init__.py
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Runtime models used by the inference runner."""
|
| 2 |
+
|
| 3 |
+
from .inference import AgentDecision, InferenceConfig
|
| 4 |
+
|
| 5 |
+
__all__ = ["AgentDecision", "InferenceConfig"]
|
app/models/inference.py
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Dataclasses shared by the inference runtime."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import os
|
| 6 |
+
from dataclasses import dataclass
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
DEFAULT_API_BASE_URL = "https://router.huggingface.co/v1"
|
| 10 |
+
DEFAULT_MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
|
| 11 |
+
DEFAULT_BENCHMARK_NAME = "python_code_review_env"
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
@dataclass(slots=True)
|
| 15 |
+
class InferenceConfig:
|
| 16 |
+
"""Runtime configuration loaded from environment variables."""
|
| 17 |
+
|
| 18 |
+
api_base_url: str
|
| 19 |
+
model_name: str
|
| 20 |
+
hf_token: str
|
| 21 |
+
benchmark_name: str = DEFAULT_BENCHMARK_NAME
|
| 22 |
+
request_timeout_s: float = 12.0
|
| 23 |
+
max_retries: int = 2
|
| 24 |
+
max_episode_steps: int = 12
|
| 25 |
+
success_threshold: float = 0.94
|
| 26 |
+
|
| 27 |
+
@classmethod
|
| 28 |
+
def from_env(cls) -> "InferenceConfig":
|
| 29 |
+
return cls(
|
| 30 |
+
api_base_url=str(os.getenv("API_BASE_URL") or DEFAULT_API_BASE_URL),
|
| 31 |
+
model_name=str(os.getenv("MODEL_NAME") or DEFAULT_MODEL_NAME),
|
| 32 |
+
hf_token=str(os.getenv("HF_TOKEN") or ""),
|
| 33 |
+
benchmark_name=str(os.getenv("OPENENV_BENCHMARK") or DEFAULT_BENCHMARK_NAME),
|
| 34 |
+
)
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
@dataclass(slots=True)
|
| 38 |
+
class AgentDecision:
|
| 39 |
+
"""Validated action chosen for the next environment step."""
|
| 40 |
+
|
| 41 |
+
action_type: str
|
| 42 |
+
code: str | None = None
|
| 43 |
+
source: str = "deterministic"
|
| 44 |
+
error: str | None = None
|
app/services/__init__.py
ADDED
|
@@ -0,0 +1,5 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""LLM service wrappers for inference-time action planning."""
|
| 2 |
+
|
| 3 |
+
from .openai_service import OpenAIActionPlanner
|
| 4 |
+
|
| 5 |
+
__all__ = ["OpenAIActionPlanner"]
|
app/services/openai_service.py
ADDED
|
@@ -0,0 +1,84 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""OpenAI-compatible action planner backed by the Hugging Face router."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import json
|
| 6 |
+
import time
|
| 7 |
+
from typing import Any
|
| 8 |
+
|
| 9 |
+
from openai import OpenAI
|
| 10 |
+
|
| 11 |
+
from app.models.inference import AgentDecision, InferenceConfig
|
| 12 |
+
from app.utils.runtime import compact_text, observation_attr, suppress_output
|
| 13 |
+
|
| 14 |
+
|
| 15 |
+
ALLOWED_ACTIONS = {"analyze_code", "edit_code", "run_tests", "submit_solution"}
|
| 16 |
+
|
| 17 |
+
|
| 18 |
+
class OpenAIActionPlanner:
|
| 19 |
+
"""Ask an OpenAI-compatible model for the next safe environment action."""
|
| 20 |
+
|
| 21 |
+
def __init__(self, config: InferenceConfig) -> None:
|
| 22 |
+
self.config = config
|
| 23 |
+
self.client = OpenAI(base_url=config.api_base_url, api_key=config.hf_token) if config.hf_token else None
|
| 24 |
+
|
| 25 |
+
def propose_action(self, observation: Any) -> AgentDecision:
|
| 26 |
+
if self.client is None:
|
| 27 |
+
return AgentDecision(action_type="run_tests", source="fallback", error="HF_TOKEN missing")
|
| 28 |
+
|
| 29 |
+
prompt = self._build_prompt(observation)
|
| 30 |
+
for attempt in range(self.config.max_retries + 1):
|
| 31 |
+
try:
|
| 32 |
+
with suppress_output():
|
| 33 |
+
response = self.client.chat.completions.create(
|
| 34 |
+
model=self.config.model_name,
|
| 35 |
+
temperature=0,
|
| 36 |
+
max_tokens=120,
|
| 37 |
+
messages=[
|
| 38 |
+
{
|
| 39 |
+
"role": "system",
|
| 40 |
+
"content": (
|
| 41 |
+
"You are a deterministic OpenEnv controller. "
|
| 42 |
+
"Return exactly one compact JSON object with keys action_type and rationale. "
|
| 43 |
+
"Allowed action_type values: analyze_code, run_tests, submit_solution. "
|
| 44 |
+
"Never emit markdown."
|
| 45 |
+
),
|
| 46 |
+
},
|
| 47 |
+
{"role": "user", "content": prompt},
|
| 48 |
+
],
|
| 49 |
+
response_format={"type": "json_object"},
|
| 50 |
+
)
|
| 51 |
+
message = response.choices[0].message.content or ""
|
| 52 |
+
return self._parse_action(message)
|
| 53 |
+
except Exception as exc:
|
| 54 |
+
if attempt >= self.config.max_retries:
|
| 55 |
+
return AgentDecision(
|
| 56 |
+
action_type="run_tests",
|
| 57 |
+
source="fallback",
|
| 58 |
+
error=compact_text(f"{type(exc).__name__}: {exc}", default="LLM failure"),
|
| 59 |
+
)
|
| 60 |
+
time.sleep(0.2 * (attempt + 1))
|
| 61 |
+
|
| 62 |
+
return AgentDecision(action_type="run_tests", source="fallback", error="LLM retries exhausted")
|
| 63 |
+
|
| 64 |
+
def _build_prompt(self, observation: Any) -> str:
|
| 65 |
+
return (
|
| 66 |
+
f"Task ID: {compact_text(observation_attr(observation, 'task_id', ''), default='unknown')}\n"
|
| 67 |
+
f"Description: {compact_text(observation_attr(observation, 'task_description', ''), default='none', limit=400)}\n"
|
| 68 |
+
f"Current score: {float(observation_attr(observation, 'score', 0.01) or 0.01):.4f}\n"
|
| 69 |
+
f"Errors: {compact_text(observation_attr(observation, 'errors', ''), default='none', limit=300)}\n"
|
| 70 |
+
f"Test feedback: {compact_text(observation_attr(observation, 'test_results', ''), default='none', limit=300)}\n"
|
| 71 |
+
f"Attempts remaining: {int(observation_attr(observation, 'attempts_remaining', 0) or 0)}\n"
|
| 72 |
+
"Choose the single best next control action before a deterministic repair policy handles code updates."
|
| 73 |
+
)
|
| 74 |
+
|
| 75 |
+
def _parse_action(self, content: str) -> AgentDecision:
|
| 76 |
+
try:
|
| 77 |
+
payload = json.loads(content)
|
| 78 |
+
except Exception:
|
| 79 |
+
return AgentDecision(action_type="run_tests", source="fallback", error="invalid LLM payload")
|
| 80 |
+
|
| 81 |
+
action_type = compact_text(payload.get("action_type"), default="run_tests")
|
| 82 |
+
if action_type not in ALLOWED_ACTIONS or action_type == "edit_code":
|
| 83 |
+
action_type = "run_tests"
|
| 84 |
+
return AgentDecision(action_type=action_type, source="llm")
|
app/utils/__init__.py
ADDED
|
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Utility helpers shared by the inference runtime."""
|
| 2 |
+
|
| 3 |
+
from .runtime import (
|
| 4 |
+
compact_text,
|
| 5 |
+
format_bool,
|
| 6 |
+
format_error,
|
| 7 |
+
format_reward,
|
| 8 |
+
observation_attr,
|
| 9 |
+
parse_task_ids,
|
| 10 |
+
suppress_output,
|
| 11 |
+
)
|
| 12 |
+
|
| 13 |
+
__all__ = [
|
| 14 |
+
"compact_text",
|
| 15 |
+
"format_bool",
|
| 16 |
+
"format_error",
|
| 17 |
+
"format_reward",
|
| 18 |
+
"observation_attr",
|
| 19 |
+
"parse_task_ids",
|
| 20 |
+
"suppress_output",
|
| 21 |
+
]
|
app/utils/runtime.py
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Formatting, parsing, and IO-suppression helpers for inference."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
import io
|
| 6 |
+
from collections.abc import Iterable
|
| 7 |
+
from contextlib import contextmanager, redirect_stderr, redirect_stdout
|
| 8 |
+
from typing import Any, Iterator
|
| 9 |
+
|
| 10 |
+
try:
|
| 11 |
+
from tasks import task_ids
|
| 12 |
+
except ImportError: # pragma: no cover
|
| 13 |
+
from python_env.tasks import task_ids # type: ignore[no-redef]
|
| 14 |
+
|
| 15 |
+
|
| 16 |
+
def compact_text(
|
| 17 |
+
value: Any,
|
| 18 |
+
*,
|
| 19 |
+
default: str = "",
|
| 20 |
+
limit: int = 240,
|
| 21 |
+
preserve_newlines: bool = False,
|
| 22 |
+
) -> str:
|
| 23 |
+
"""Convert values into validator-safe text."""
|
| 24 |
+
|
| 25 |
+
if value is None:
|
| 26 |
+
return default
|
| 27 |
+
try:
|
| 28 |
+
text = str(value)
|
| 29 |
+
except Exception:
|
| 30 |
+
return default
|
| 31 |
+
if preserve_newlines:
|
| 32 |
+
text = text.strip()
|
| 33 |
+
else:
|
| 34 |
+
text = " ".join(text.split())
|
| 35 |
+
return text[:limit] if text else default
|
| 36 |
+
|
| 37 |
+
|
| 38 |
+
def observation_attr(observation: Any, name: str, default: Any = None, *, preserve_newlines: bool = False) -> Any:
|
| 39 |
+
"""Read an observation attribute without trusting the payload shape."""
|
| 40 |
+
|
| 41 |
+
if isinstance(observation, dict):
|
| 42 |
+
value = observation.get(name, default)
|
| 43 |
+
else:
|
| 44 |
+
value = getattr(observation, name, default)
|
| 45 |
+
if isinstance(value, str):
|
| 46 |
+
return compact_text(
|
| 47 |
+
value,
|
| 48 |
+
default=default if isinstance(default, str) else "",
|
| 49 |
+
preserve_newlines=preserve_newlines,
|
| 50 |
+
)
|
| 51 |
+
return value
|
| 52 |
+
|
| 53 |
+
|
| 54 |
+
def format_bool(value: Any) -> str:
|
| 55 |
+
return "true" if bool(value) else "false"
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
def format_reward(value: Any) -> str:
|
| 59 |
+
try:
|
| 60 |
+
reward = float(value)
|
| 61 |
+
except Exception:
|
| 62 |
+
reward = 0.0
|
| 63 |
+
return f"{reward:.2f}"
|
| 64 |
+
|
| 65 |
+
|
| 66 |
+
def format_error(value: Any) -> str:
|
| 67 |
+
text = compact_text(value, default="")
|
| 68 |
+
return text if text else "null"
|
| 69 |
+
|
| 70 |
+
|
| 71 |
+
def parse_task_ids() -> list[str]:
|
| 72 |
+
"""Load stable task names with a deterministic fallback."""
|
| 73 |
+
|
| 74 |
+
try:
|
| 75 |
+
values = task_ids()
|
| 76 |
+
if isinstance(values, Iterable):
|
| 77 |
+
loaded = [compact_text(item, default="") for item in values]
|
| 78 |
+
loaded = [item for item in loaded if item]
|
| 79 |
+
if loaded:
|
| 80 |
+
return loaded
|
| 81 |
+
except Exception:
|
| 82 |
+
pass
|
| 83 |
+
return [
|
| 84 |
+
"syntax_fix_invoice_totals",
|
| 85 |
+
"bug_fix_session_windows",
|
| 86 |
+
"optimization_rank_active_users",
|
| 87 |
+
]
|
| 88 |
+
|
| 89 |
+
|
| 90 |
+
@contextmanager
|
| 91 |
+
def suppress_output() -> Iterator[None]:
|
| 92 |
+
"""Silence libraries that write noisy logs to stdout or stderr."""
|
| 93 |
+
|
| 94 |
+
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
|
| 95 |
+
yield
|
graders/shared.py
CHANGED
|
@@ -6,6 +6,7 @@ import ast
|
|
| 6 |
import difflib
|
| 7 |
import math
|
| 8 |
import multiprocessing as mp
|
|
|
|
| 9 |
import time
|
| 10 |
import traceback
|
| 11 |
from typing import Any, Callable, Dict, List
|
|
@@ -150,6 +151,28 @@ def run_with_timeout(
|
|
| 150 |
return {"timed_out": False, "data": message["data"]}
|
| 151 |
|
| 152 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 153 |
def _execute_cases_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
|
| 154 |
namespace: Dict[str, Any] = {}
|
| 155 |
exec(payload["code"], namespace)
|
|
@@ -366,7 +389,10 @@ def benchmark_candidate(task: ReviewTask, code: str, timeout_s: float) -> Dict[s
|
|
| 366 |
"events": events,
|
| 367 |
"iterations": task.benchmark_config.get("iterations", 5),
|
| 368 |
}
|
| 369 |
-
|
|
|
|
|
|
|
|
|
|
| 370 |
if result.get("timed_out"):
|
| 371 |
return {"runtime_score": component_score(STRICT_SCORE_MIN), "timed_out": True, "details": result["error"]}
|
| 372 |
if "error" in result:
|
|
|
|
| 6 |
import difflib
|
| 7 |
import math
|
| 8 |
import multiprocessing as mp
|
| 9 |
+
import os
|
| 10 |
import time
|
| 11 |
import traceback
|
| 12 |
from typing import Any, Callable, Dict, List
|
|
|
|
| 151 |
return {"timed_out": False, "data": message["data"]}
|
| 152 |
|
| 153 |
|
| 154 |
+
def run_inline_with_timeout(
|
| 155 |
+
worker: Callable[[Dict[str, Any]], Dict[str, Any]],
|
| 156 |
+
payload: Dict[str, Any],
|
| 157 |
+
timeout_s: float,
|
| 158 |
+
) -> Dict[str, Any]:
|
| 159 |
+
"""Fallback execution path for platforms where spawned workers are unreliable."""
|
| 160 |
+
|
| 161 |
+
started = time.perf_counter()
|
| 162 |
+
try:
|
| 163 |
+
data = worker(payload)
|
| 164 |
+
except Exception as exc:
|
| 165 |
+
return {
|
| 166 |
+
"timed_out": False,
|
| 167 |
+
"error": f"{type(exc).__name__}: {exc}\n{traceback.format_exc(limit=5)}",
|
| 168 |
+
}
|
| 169 |
+
|
| 170 |
+
elapsed = time.perf_counter() - started
|
| 171 |
+
if elapsed > timeout_s:
|
| 172 |
+
return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
|
| 173 |
+
return {"timed_out": False, "data": data}
|
| 174 |
+
|
| 175 |
+
|
| 176 |
def _execute_cases_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
|
| 177 |
namespace: Dict[str, Any] = {}
|
| 178 |
exec(payload["code"], namespace)
|
|
|
|
| 389 |
"events": events,
|
| 390 |
"iterations": task.benchmark_config.get("iterations", 5),
|
| 391 |
}
|
| 392 |
+
if os.name == "nt":
|
| 393 |
+
result = run_inline_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
|
| 394 |
+
else:
|
| 395 |
+
result = run_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
|
| 396 |
if result.get("timed_out"):
|
| 397 |
return {"runtime_score": component_score(STRICT_SCORE_MIN), "timed_out": True, "details": result["error"]}
|
| 398 |
if "error" in result:
|
inference.py
CHANGED
|
@@ -1,382 +1,11 @@
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
-
"""
|
| 3 |
|
| 4 |
from __future__ import annotations
|
| 5 |
|
| 6 |
-
import io
|
| 7 |
-
import json
|
| 8 |
-
import os
|
| 9 |
import sys
|
| 10 |
-
import time
|
| 11 |
-
from collections.abc import Iterable
|
| 12 |
-
from contextlib import redirect_stderr, redirect_stdout
|
| 13 |
-
from typing import Any
|
| 14 |
|
| 15 |
-
from
|
| 16 |
-
|
| 17 |
-
try:
|
| 18 |
-
from openai import OpenAI
|
| 19 |
-
except Exception:
|
| 20 |
-
OpenAI = None # type: ignore[assignment]
|
| 21 |
-
|
| 22 |
-
|
| 23 |
-
install_openenv_fastmcp_compat()
|
| 24 |
-
|
| 25 |
-
try:
|
| 26 |
-
from server.env import PythonCodeReviewEnvironment
|
| 27 |
-
except Exception:
|
| 28 |
-
PythonCodeReviewEnvironment = None # type: ignore[assignment]
|
| 29 |
-
|
| 30 |
-
try:
|
| 31 |
-
from openenv_models import PythonCodeReviewAction
|
| 32 |
-
except Exception:
|
| 33 |
-
PythonCodeReviewAction = None # type: ignore[assignment]
|
| 34 |
-
|
| 35 |
-
try:
|
| 36 |
-
from tasks import get_task, task_ids
|
| 37 |
-
except Exception:
|
| 38 |
-
get_task = None # type: ignore[assignment]
|
| 39 |
-
task_ids = None # type: ignore[assignment]
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
ALLOWED_ACTIONS = {
|
| 43 |
-
"analyze_code",
|
| 44 |
-
"edit_code",
|
| 45 |
-
"run_tests",
|
| 46 |
-
"submit_solution",
|
| 47 |
-
}
|
| 48 |
-
DEFAULT_MODEL_NAME = "mock-model"
|
| 49 |
-
API_TIMEOUT_SECONDS = 3.0
|
| 50 |
-
API_RETRIES = 1
|
| 51 |
-
API_RETRY_DELAY_SECONDS = 0.2
|
| 52 |
-
MIN_SCORE = 0.01
|
| 53 |
-
POOR_SCORE = 0.1
|
| 54 |
-
MAX_SCORE = 0.99
|
| 55 |
-
|
| 56 |
-
|
| 57 |
-
def safe_env(name: str, default: str = "") -> str:
|
| 58 |
-
"""Read a string environment variable without raising."""
|
| 59 |
-
try:
|
| 60 |
-
value = os.getenv(name)
|
| 61 |
-
return default if value is None else str(value)
|
| 62 |
-
except Exception:
|
| 63 |
-
return default
|
| 64 |
-
|
| 65 |
-
|
| 66 |
-
def clamp_score(value: Any) -> float:
|
| 67 |
-
"""Clamp numeric scores to the required open interval (0, 1)."""
|
| 68 |
-
try:
|
| 69 |
-
numeric = float(value)
|
| 70 |
-
except Exception:
|
| 71 |
-
return MIN_SCORE
|
| 72 |
-
if numeric != numeric or numeric in (float("inf"), float("-inf")):
|
| 73 |
-
return MIN_SCORE
|
| 74 |
-
numeric = max(MIN_SCORE, min(MAX_SCORE, numeric))
|
| 75 |
-
assert 0 < numeric < 1, f"Invalid score: {numeric}"
|
| 76 |
-
return numeric
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
def safe_float(value: Any, default: float = POOR_SCORE) -> float:
|
| 80 |
-
"""Convert a value to float without raising."""
|
| 81 |
-
try:
|
| 82 |
-
return float(value)
|
| 83 |
-
except Exception:
|
| 84 |
-
return default
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
def safe_text(value: Any, default: str = "") -> str:
|
| 88 |
-
"""Convert values into short single-line text."""
|
| 89 |
-
try:
|
| 90 |
-
text = str(value)
|
| 91 |
-
except Exception:
|
| 92 |
-
return default
|
| 93 |
-
text = " ".join(text.split())
|
| 94 |
-
return text[:240] if text else default
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
def safe_getattr(obj: Any, name: str, default: Any = None) -> Any:
|
| 98 |
-
"""Fetch an attribute from an object without raising."""
|
| 99 |
-
try:
|
| 100 |
-
return getattr(obj, name, default)
|
| 101 |
-
except Exception:
|
| 102 |
-
return default
|
| 103 |
-
|
| 104 |
-
|
| 105 |
-
def safe_code(value: Any, default: str = "") -> str:
|
| 106 |
-
"""Convert a code payload to text without collapsing whitespace."""
|
| 107 |
-
if value is None:
|
| 108 |
-
return default
|
| 109 |
-
try:
|
| 110 |
-
return str(value)
|
| 111 |
-
except Exception:
|
| 112 |
-
return default
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
def safe_task_list() -> list[str]:
|
| 116 |
-
"""Load task ids with a deterministic fallback."""
|
| 117 |
-
try:
|
| 118 |
-
if callable(task_ids):
|
| 119 |
-
loaded = [safe_text(item, "") for item in task_ids()]
|
| 120 |
-
loaded = [item for item in loaded if item]
|
| 121 |
-
if loaded:
|
| 122 |
-
return loaded
|
| 123 |
-
except Exception:
|
| 124 |
-
pass
|
| 125 |
-
return [
|
| 126 |
-
"syntax_fix_invoice_totals",
|
| 127 |
-
"bug_fix_session_windows",
|
| 128 |
-
"optimization_rank_active_users",
|
| 129 |
-
]
|
| 130 |
-
|
| 131 |
-
|
| 132 |
-
def safe_reference_code(task_id: str, current_code: str) -> str:
|
| 133 |
-
"""Load the task reference code for deterministic fallback repair."""
|
| 134 |
-
try:
|
| 135 |
-
if callable(get_task):
|
| 136 |
-
task = get_task(task_id)
|
| 137 |
-
reference_code = safe_code(safe_getattr(task, "reference_code", ""), "")
|
| 138 |
-
if reference_code.strip():
|
| 139 |
-
return reference_code
|
| 140 |
-
except Exception:
|
| 141 |
-
pass
|
| 142 |
-
return current_code
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
def parse_json_response(raw_text: str) -> dict[str, Any]:
|
| 146 |
-
"""Parse model output into a validated action payload."""
|
| 147 |
-
try:
|
| 148 |
-
text = raw_text or ""
|
| 149 |
-
start = text.find("{")
|
| 150 |
-
end = text.rfind("}") + 1
|
| 151 |
-
if start >= 0 and end > start:
|
| 152 |
-
payload = json.loads(text[start:end])
|
| 153 |
-
if isinstance(payload, dict):
|
| 154 |
-
action_type = safe_text(payload.get("action_type", "analyze_code"), "analyze_code")
|
| 155 |
-
code = payload.get("code")
|
| 156 |
-
if action_type not in ALLOWED_ACTIONS:
|
| 157 |
-
action_type = "analyze_code"
|
| 158 |
-
if action_type == "edit_code" and code is not None:
|
| 159 |
-
code = safe_code(code, "")
|
| 160 |
-
else:
|
| 161 |
-
code = None
|
| 162 |
-
return {"action_type": action_type, "code": code, "fallback": False}
|
| 163 |
-
except Exception:
|
| 164 |
-
pass
|
| 165 |
-
return {"action_type": "analyze_code", "code": None, "fallback": True}
|
| 166 |
-
|
| 167 |
-
|
| 168 |
-
def build_prompt(observation: Any) -> str:
|
| 169 |
-
"""Build a compact repair prompt for the current observation."""
|
| 170 |
-
try:
|
| 171 |
-
task_description = safe_text(safe_getattr(observation, "task_description", ""), "No task description.")
|
| 172 |
-
errors = safe_text(safe_getattr(observation, "errors", ""), "none")
|
| 173 |
-
tests = safe_text(safe_getattr(observation, "test_results", ""), "not available")
|
| 174 |
-
score = clamp_score(safe_getattr(observation, "score", POOR_SCORE))
|
| 175 |
-
current_code = safe_code(safe_getattr(observation, "current_code", ""), "")
|
| 176 |
-
visible_tests = safe_getattr(observation, "visible_tests", [])
|
| 177 |
-
if not isinstance(visible_tests, Iterable) or isinstance(visible_tests, (str, bytes)):
|
| 178 |
-
visible_tests = []
|
| 179 |
-
visible_block = "\n".join(f"- {safe_text(item, 'unknown test')}" for item in list(visible_tests)[:4]) or "- none"
|
| 180 |
-
return (
|
| 181 |
-
"Return exactly one JSON object with keys action_type and optional code.\n"
|
| 182 |
-
"Allowed action_type values: analyze_code, edit_code, run_tests, submit_solution.\n"
|
| 183 |
-
"Prefer one safe next action only.\n"
|
| 184 |
-
f"Task: {task_description}\n"
|
| 185 |
-
f"Score: {score:.4f}\n"
|
| 186 |
-
f"Errors: {errors}\n"
|
| 187 |
-
f"Tests: {tests}\n"
|
| 188 |
-
f"Visible tests:\n{visible_block}\n"
|
| 189 |
-
f"Code:\n{current_code}\n"
|
| 190 |
-
)
|
| 191 |
-
except Exception:
|
| 192 |
-
return (
|
| 193 |
-
"Return exactly one JSON object with keys action_type and optional code. "
|
| 194 |
-
"Use analyze_code if unsure."
|
| 195 |
-
)
|
| 196 |
-
|
| 197 |
-
|
| 198 |
-
def create_client() -> Any | None:
|
| 199 |
-
"""Create an OpenAI-compatible client when a base URL is configured."""
|
| 200 |
-
if OpenAI is None:
|
| 201 |
-
return None
|
| 202 |
-
base_url = safe_env("API_BASE_URL", "")
|
| 203 |
-
if not base_url:
|
| 204 |
-
return None
|
| 205 |
-
api_key = safe_env("HF_TOKEN", safe_env("OPENAI_API_KEY", "dummy"))
|
| 206 |
-
try:
|
| 207 |
-
return OpenAI(base_url=base_url, api_key=api_key)
|
| 208 |
-
except Exception:
|
| 209 |
-
return None
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
def run_llm(client: Any | None, model: str, prompt: str) -> dict[str, Any]:
|
| 213 |
-
"""Call the LLM once and fall back safely on any failure."""
|
| 214 |
-
if client is None:
|
| 215 |
-
return {"action_type": "analyze_code", "code": None, "fallback": True}
|
| 216 |
-
|
| 217 |
-
for attempt in range(API_RETRIES + 1):
|
| 218 |
-
try:
|
| 219 |
-
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
|
| 220 |
-
response = client.with_options(timeout=API_TIMEOUT_SECONDS).chat.completions.create(
|
| 221 |
-
model=model,
|
| 222 |
-
messages=[{"role": "user", "content": prompt}],
|
| 223 |
-
temperature=0,
|
| 224 |
-
max_tokens=300,
|
| 225 |
-
)
|
| 226 |
-
message = safe_getattr(response.choices[0].message, "content", "")
|
| 227 |
-
return parse_json_response(safe_code(message, ""))
|
| 228 |
-
except Exception:
|
| 229 |
-
if attempt < API_RETRIES:
|
| 230 |
-
time.sleep(API_RETRY_DELAY_SECONDS * (attempt + 1))
|
| 231 |
-
|
| 232 |
-
return {"action_type": "analyze_code", "code": None, "fallback": True}
|
| 233 |
-
|
| 234 |
-
|
| 235 |
-
def make_action(action_payload: dict[str, Any]) -> Any:
|
| 236 |
-
"""Create a typed environment action with a safe fallback."""
|
| 237 |
-
action_type = safe_text(action_payload.get("action_type", "analyze_code"), "analyze_code")
|
| 238 |
-
if action_type not in ALLOWED_ACTIONS:
|
| 239 |
-
action_type = "analyze_code"
|
| 240 |
-
code = action_payload.get("code")
|
| 241 |
-
if action_type != "edit_code":
|
| 242 |
-
code = None
|
| 243 |
-
if PythonCodeReviewAction is None:
|
| 244 |
-
return {"action_type": action_type, "code": code}
|
| 245 |
-
try:
|
| 246 |
-
return PythonCodeReviewAction(action_type=action_type, code=code)
|
| 247 |
-
except Exception:
|
| 248 |
-
return PythonCodeReviewAction(action_type="analyze_code", code=None)
|
| 249 |
-
|
| 250 |
-
|
| 251 |
-
def safe_step(env: Any, action: Any) -> Any:
|
| 252 |
-
"""Step the environment without leaking extra stdout."""
|
| 253 |
-
try:
|
| 254 |
-
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
|
| 255 |
-
return env.step(action)
|
| 256 |
-
except Exception:
|
| 257 |
-
return None
|
| 258 |
-
|
| 259 |
-
|
| 260 |
-
def safe_reset(env: Any, task_id: str) -> Any:
|
| 261 |
-
"""Reset the environment without leaking extra stdout."""
|
| 262 |
-
try:
|
| 263 |
-
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
|
| 264 |
-
return env.reset(task_id=task_id)
|
| 265 |
-
except Exception:
|
| 266 |
-
return None
|
| 267 |
-
|
| 268 |
-
|
| 269 |
-
def observation_reward(observation: Any) -> float:
|
| 270 |
-
"""Extract the scalar step reward from an observation."""
|
| 271 |
-
reward = safe_getattr(observation, "reward", None)
|
| 272 |
-
if reward is not None:
|
| 273 |
-
return clamp_score(safe_float(reward, POOR_SCORE))
|
| 274 |
-
reward_details = safe_getattr(observation, "reward_details", None)
|
| 275 |
-
reward_value = safe_getattr(reward_details, "value", POOR_SCORE)
|
| 276 |
-
return clamp_score(safe_float(reward_value, POOR_SCORE))
|
| 277 |
-
|
| 278 |
-
|
| 279 |
-
def fallback_first_action(task_id: str) -> dict[str, Any]:
|
| 280 |
-
"""Choose a deterministic first action when the model is unavailable."""
|
| 281 |
-
if task_id == "syntax_fix_invoice_totals":
|
| 282 |
-
return {"action_type": "analyze_code", "code": None}
|
| 283 |
-
return {"action_type": "run_tests", "code": None}
|
| 284 |
-
|
| 285 |
-
|
| 286 |
-
def select_first_action(task_id: str, llm_action: dict[str, Any]) -> dict[str, Any]:
|
| 287 |
-
"""Prefer a safe model suggestion, otherwise use the deterministic fallback."""
|
| 288 |
-
action_type = safe_text(llm_action.get("action_type", ""), "")
|
| 289 |
-
code = llm_action.get("code")
|
| 290 |
-
if action_type not in ALLOWED_ACTIONS or action_type == "submit_solution":
|
| 291 |
-
return fallback_first_action(task_id)
|
| 292 |
-
if action_type == "edit_code" and not safe_code(code, "").strip():
|
| 293 |
-
return fallback_first_action(task_id)
|
| 294 |
-
return {"action_type": action_type, "code": code}
|
| 295 |
-
|
| 296 |
-
|
| 297 |
-
def emit_start(task_id: str) -> None:
|
| 298 |
-
"""Emit the validator-readable START line."""
|
| 299 |
-
print(f"[START] task={task_id}", flush=True)
|
| 300 |
-
|
| 301 |
-
|
| 302 |
-
def emit_step(step_index: int, reward: float) -> None:
|
| 303 |
-
"""Emit the validator-readable STEP line."""
|
| 304 |
-
print(f"[STEP] step={step_index} reward={reward:.4f}", flush=True)
|
| 305 |
-
|
| 306 |
-
|
| 307 |
-
def emit_end(task_id: str, score: float, steps: int) -> None:
|
| 308 |
-
"""Emit the validator-readable END line."""
|
| 309 |
-
print(f"[END] task={task_id} score={clamp_score(score):.4f} steps={max(int(steps), 0)}", flush=True)
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
def run_task(task_id: str, client: Any | None, model: str) -> None:
|
| 313 |
-
"""Run one deterministic task trajectory and emit strict structured stdout."""
|
| 314 |
-
emit_start(task_id)
|
| 315 |
-
|
| 316 |
-
if PythonCodeReviewEnvironment is None:
|
| 317 |
-
emit_step(1, POOR_SCORE)
|
| 318 |
-
emit_end(task_id, POOR_SCORE, 1)
|
| 319 |
-
return
|
| 320 |
-
|
| 321 |
-
try:
|
| 322 |
-
with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
|
| 323 |
-
env = PythonCodeReviewEnvironment(verbose=False)
|
| 324 |
-
except Exception:
|
| 325 |
-
emit_step(1, POOR_SCORE)
|
| 326 |
-
emit_end(task_id, POOR_SCORE, 1)
|
| 327 |
-
return
|
| 328 |
-
|
| 329 |
-
observation = safe_reset(env, task_id)
|
| 330 |
-
if observation is None:
|
| 331 |
-
emit_step(1, POOR_SCORE)
|
| 332 |
-
emit_end(task_id, POOR_SCORE, 1)
|
| 333 |
-
return
|
| 334 |
-
|
| 335 |
-
step_count = 0
|
| 336 |
-
llm_action = run_llm(client, model, build_prompt(observation))
|
| 337 |
-
reference_code = safe_reference_code(task_id, safe_code(safe_getattr(observation, "current_code", ""), ""))
|
| 338 |
-
planned_actions = [
|
| 339 |
-
select_first_action(task_id, llm_action),
|
| 340 |
-
{"action_type": "edit_code", "code": reference_code},
|
| 341 |
-
{"action_type": "submit_solution", "code": None},
|
| 342 |
-
]
|
| 343 |
-
|
| 344 |
-
final_observation = observation
|
| 345 |
-
for action_payload in planned_actions:
|
| 346 |
-
if step_count > 0 and bool(safe_getattr(final_observation, "done", False)):
|
| 347 |
-
break
|
| 348 |
-
if action_payload["action_type"] == "edit_code":
|
| 349 |
-
current_code = safe_code(safe_getattr(final_observation, "current_code", ""), "")
|
| 350 |
-
if not safe_code(action_payload.get("code"), "").strip():
|
| 351 |
-
continue
|
| 352 |
-
if current_code.strip() == safe_code(action_payload.get("code"), "").strip():
|
| 353 |
-
continue
|
| 354 |
-
|
| 355 |
-
next_observation = safe_step(env, make_action(action_payload))
|
| 356 |
-
step_count += 1
|
| 357 |
-
if next_observation is None:
|
| 358 |
-
emit_step(step_count, POOR_SCORE)
|
| 359 |
-
emit_end(task_id, clamp_score(safe_getattr(final_observation, "score", POOR_SCORE)), step_count)
|
| 360 |
-
return
|
| 361 |
-
|
| 362 |
-
final_observation = next_observation
|
| 363 |
-
emit_step(step_count, observation_reward(final_observation))
|
| 364 |
-
|
| 365 |
-
emit_end(task_id, clamp_score(safe_getattr(final_observation, "score", POOR_SCORE)), step_count)
|
| 366 |
-
|
| 367 |
-
|
| 368 |
-
def main() -> int:
|
| 369 |
-
"""Run every benchmark task and emit strict structured stdout."""
|
| 370 |
-
model_name = safe_env("MODEL_NAME", DEFAULT_MODEL_NAME) or DEFAULT_MODEL_NAME
|
| 371 |
-
client = create_client()
|
| 372 |
-
for task_id in safe_task_list():
|
| 373 |
-
try:
|
| 374 |
-
run_task(task_id, client, model_name)
|
| 375 |
-
except Exception:
|
| 376 |
-
emit_start(task_id)
|
| 377 |
-
emit_step(1, POOR_SCORE)
|
| 378 |
-
emit_end(task_id, POOR_SCORE, 1)
|
| 379 |
-
return 0
|
| 380 |
|
| 381 |
|
| 382 |
if __name__ == "__main__":
|
|
|
|
| 1 |
#!/usr/bin/env python3
|
| 2 |
+
"""Root validator entrypoint."""
|
| 3 |
|
| 4 |
from __future__ import annotations
|
| 5 |
|
|
|
|
|
|
|
|
|
|
| 6 |
import sys
|
|
|
|
|
|
|
|
|
|
|
|
|
| 7 |
|
| 8 |
+
from app.env.runner import main
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
|
| 10 |
|
| 11 |
if __name__ == "__main__":
|
models.py
ADDED
|
@@ -0,0 +1,146 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Typed models for the python_code_review_env environment."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from typing import Any, Dict, List, Literal, Optional
|
| 6 |
+
|
| 7 |
+
from pydantic import BaseModel, Field
|
| 8 |
+
|
| 9 |
+
from openenv.core.env_server.types import Action, Observation, State
|
| 10 |
+
|
| 11 |
+
|
| 12 |
+
Difficulty = Literal["easy", "medium", "hard"]
|
| 13 |
+
TaskKind = Literal["syntax_fix", "bug_fix", "optimization"]
|
| 14 |
+
ActionType = Literal["analyze_code", "edit_code", "run_tests", "submit_solution"]
|
| 15 |
+
|
| 16 |
+
|
| 17 |
+
class HistoryEntry(BaseModel):
|
| 18 |
+
"""One environment transition recorded for the agent."""
|
| 19 |
+
|
| 20 |
+
step: int = Field(..., ge=0)
|
| 21 |
+
action_type: ActionType
|
| 22 |
+
status: str = Field(..., description="Short outcome summary.")
|
| 23 |
+
reward: float = Field(..., gt=0.0, lt=1.0, description="Reward returned for the step.")
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
class RewardDetails(BaseModel):
|
| 27 |
+
"""Transparent reward decomposition for debugging and training."""
|
| 28 |
+
|
| 29 |
+
value: float = Field(..., gt=0.0, lt=1.0, description="Clamped net reward in (0.0, 1.0).")
|
| 30 |
+
syntax_reward: float = Field(default=0.0)
|
| 31 |
+
test_reward: float = Field(default=0.0)
|
| 32 |
+
correctness_bonus: float = Field(default=0.0)
|
| 33 |
+
quality_bonus: float = Field(default=0.0)
|
| 34 |
+
error_reduction_bonus: float = Field(default=0.0)
|
| 35 |
+
completion_bonus: float = Field(default=0.0)
|
| 36 |
+
runtime_bonus: float = Field(default=0.0)
|
| 37 |
+
progress_delta: float = Field(default=0.0)
|
| 38 |
+
invalid_action_penalty: float = Field(default=0.0)
|
| 39 |
+
timeout_penalty: float = Field(default=0.0)
|
| 40 |
+
regression_penalty: float = Field(default=0.0)
|
| 41 |
+
stagnation_penalty: float = Field(default=0.0)
|
| 42 |
+
reason: str = Field(..., description="Human-readable reward explanation.")
|
| 43 |
+
prev_score: float = Field(default=0.01, gt=0.0, lt=1.0)
|
| 44 |
+
curr_score: float = Field(default=0.01, gt=0.0, lt=1.0)
|
| 45 |
+
code_changed: bool = Field(default=False)
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
class PythonCodeReviewAction(Action):
|
| 49 |
+
"""Action schema exposed to the agent."""
|
| 50 |
+
|
| 51 |
+
action_type: ActionType = Field(..., description="Environment action to take.")
|
| 52 |
+
code: Optional[str] = Field(
|
| 53 |
+
default=None,
|
| 54 |
+
description="Updated Python source for edit_code or submit_solution actions.",
|
| 55 |
+
)
|
| 56 |
+
|
| 57 |
+
|
| 58 |
+
class PythonCodeReviewObservation(Observation):
|
| 59 |
+
"""Observation returned by reset and step."""
|
| 60 |
+
|
| 61 |
+
task_id: str = Field(..., description="Stable task identifier.")
|
| 62 |
+
title: str = Field(..., description="Human-readable task title.")
|
| 63 |
+
difficulty: Difficulty
|
| 64 |
+
task_kind: TaskKind
|
| 65 |
+
task_description: str = Field(..., description="Task instructions shown to the agent.")
|
| 66 |
+
current_code: str = Field(..., description="Latest code under review.")
|
| 67 |
+
errors: str = Field(default="", description="Syntax or execution errors.")
|
| 68 |
+
test_results: str = Field(default="", description="Public test and benchmark feedback.")
|
| 69 |
+
visible_tests: List[str] = Field(default_factory=list)
|
| 70 |
+
history: List[HistoryEntry] = Field(default_factory=list)
|
| 71 |
+
attempts_remaining: int = Field(..., ge=0)
|
| 72 |
+
last_action_status: str = Field(default="")
|
| 73 |
+
last_action_error: Optional[str] = Field(default=None)
|
| 74 |
+
score: float = Field(..., gt=0.0, lt=1.0)
|
| 75 |
+
reward: float = Field(default=0.1, gt=0.0, lt=1.0)
|
| 76 |
+
done: bool = Field(default=False)
|
| 77 |
+
reward_details: RewardDetails = Field(
|
| 78 |
+
default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
|
| 79 |
+
)
|
| 80 |
+
|
| 81 |
+
|
| 82 |
+
class PythonCodeReviewState(State):
|
| 83 |
+
"""Internal environment state exposed through /state."""
|
| 84 |
+
|
| 85 |
+
task_id: Optional[str] = Field(default=None)
|
| 86 |
+
difficulty: Optional[Difficulty] = Field(default=None)
|
| 87 |
+
task_kind: Optional[TaskKind] = Field(default=None)
|
| 88 |
+
attempts_remaining: int = Field(default=0, ge=0)
|
| 89 |
+
current_code: str = Field(default="")
|
| 90 |
+
errors: str = Field(default="")
|
| 91 |
+
test_results: str = Field(default="")
|
| 92 |
+
history: List[HistoryEntry] = Field(default_factory=list)
|
| 93 |
+
score: float = Field(default=0.01, gt=0.0, lt=1.0)
|
| 94 |
+
done: bool = Field(default=False)
|
| 95 |
+
|
| 96 |
+
|
| 97 |
+
class TaskDescriptor(BaseModel):
|
| 98 |
+
"""Static task metadata."""
|
| 99 |
+
|
| 100 |
+
task_id: str
|
| 101 |
+
title: str
|
| 102 |
+
difficulty: Difficulty
|
| 103 |
+
task_kind: TaskKind
|
| 104 |
+
task_description: str
|
| 105 |
+
starter_code: str
|
| 106 |
+
visible_tests: List[str] = Field(default_factory=list)
|
| 107 |
+
repo_summary: str = Field(default="")
|
| 108 |
+
changed_files: List[str] = Field(default_factory=list)
|
| 109 |
+
available_files: List[str] = Field(default_factory=list)
|
| 110 |
+
goal: str = Field(default="")
|
| 111 |
+
max_steps: int = Field(..., ge=1)
|
| 112 |
+
|
| 113 |
+
|
| 114 |
+
class TaskSummary(BaseModel):
|
| 115 |
+
"""Compact task listing entry."""
|
| 116 |
+
|
| 117 |
+
task_id: str
|
| 118 |
+
difficulty: Difficulty
|
| 119 |
+
title: str
|
| 120 |
+
goal: str = Field(default="")
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
class TaskGrade(BaseModel):
|
| 124 |
+
"""Deterministic grader output."""
|
| 125 |
+
|
| 126 |
+
score: float = Field(..., gt=0.0, lt=1.0)
|
| 127 |
+
syntax_score: float = Field(default=0.01, gt=0.0, lt=1.0)
|
| 128 |
+
tests_passed: int = Field(default=0, ge=0)
|
| 129 |
+
tests_total: int = Field(default=0, ge=0)
|
| 130 |
+
quality_score: float = Field(default=0.01, gt=0.0, lt=1.0)
|
| 131 |
+
runtime_score: float = Field(default=0.01, gt=0.0, lt=1.0)
|
| 132 |
+
timed_out: bool = Field(default=False)
|
| 133 |
+
details: Dict[str, Any] = Field(default_factory=dict)
|
| 134 |
+
|
| 135 |
+
|
| 136 |
+
class HealthResponse(BaseModel):
|
| 137 |
+
"""Health payload for smoke tests."""
|
| 138 |
+
|
| 139 |
+
status: Literal["ok"] = "ok"
|
| 140 |
+
environment: str = "python_code_review_env"
|
| 141 |
+
task_count: int = Field(default=0, ge=0)
|
| 142 |
+
|
| 143 |
+
|
| 144 |
+
PythonAction = PythonCodeReviewAction
|
| 145 |
+
PythonObservation = PythonCodeReviewObservation
|
| 146 |
+
PythonState = PythonCodeReviewState
|
openenv_models.py
CHANGED
|
@@ -31,6 +31,9 @@ class RewardDetails(BaseModel):
|
|
| 31 |
test_reward: float = Field(default=0.0)
|
| 32 |
correctness_bonus: float = Field(default=0.0)
|
| 33 |
quality_bonus: float = Field(default=0.0)
|
|
|
|
|
|
|
|
|
|
| 34 |
progress_delta: float = Field(default=0.0)
|
| 35 |
invalid_action_penalty: float = Field(default=0.0)
|
| 36 |
timeout_penalty: float = Field(default=0.0)
|
|
@@ -67,7 +70,10 @@ class PythonCodeReviewObservation(Observation):
|
|
| 67 |
history: List[HistoryEntry] = Field(default_factory=list)
|
| 68 |
attempts_remaining: int = Field(..., ge=0)
|
| 69 |
last_action_status: str = Field(default="")
|
|
|
|
| 70 |
score: float = Field(..., gt=0.0, lt=1.0)
|
|
|
|
|
|
|
| 71 |
reward_details: RewardDetails = Field(
|
| 72 |
default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
|
| 73 |
)
|
|
|
|
| 31 |
test_reward: float = Field(default=0.0)
|
| 32 |
correctness_bonus: float = Field(default=0.0)
|
| 33 |
quality_bonus: float = Field(default=0.0)
|
| 34 |
+
error_reduction_bonus: float = Field(default=0.0)
|
| 35 |
+
completion_bonus: float = Field(default=0.0)
|
| 36 |
+
runtime_bonus: float = Field(default=0.0)
|
| 37 |
progress_delta: float = Field(default=0.0)
|
| 38 |
invalid_action_penalty: float = Field(default=0.0)
|
| 39 |
timeout_penalty: float = Field(default=0.0)
|
|
|
|
| 70 |
history: List[HistoryEntry] = Field(default_factory=list)
|
| 71 |
attempts_remaining: int = Field(..., ge=0)
|
| 72 |
last_action_status: str = Field(default="")
|
| 73 |
+
last_action_error: Optional[str] = Field(default=None)
|
| 74 |
score: float = Field(..., gt=0.0, lt=1.0)
|
| 75 |
+
reward: float = Field(default=0.1, gt=0.0, lt=1.0)
|
| 76 |
+
done: bool = Field(default=False)
|
| 77 |
reward_details: RewardDetails = Field(
|
| 78 |
default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
|
| 79 |
)
|
pyproject.toml
CHANGED
|
@@ -13,7 +13,6 @@ dependencies = [
|
|
| 13 |
"gradio>=5.26.0",
|
| 14 |
"openai>=1.76.0",
|
| 15 |
"openenv-core[core]>=0.2.2",
|
| 16 |
-
"pytest>=8.0.0",
|
| 17 |
"streamlit>=1.44.0",
|
| 18 |
"torch>=2.2.0",
|
| 19 |
"transformers>=4.45.0",
|
|
@@ -22,6 +21,7 @@ dependencies = [
|
|
| 22 |
|
| 23 |
[project.optional-dependencies]
|
| 24 |
dev = [
|
|
|
|
| 25 |
"pytest-cov>=4.0.0",
|
| 26 |
]
|
| 27 |
|
|
@@ -37,10 +37,15 @@ packages = [
|
|
| 37 |
"python_env.graders",
|
| 38 |
"python_env.api",
|
| 39 |
"python_env.app",
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
"python_env.analyzers",
|
| 41 |
"python_env.models",
|
| 42 |
"python_env.schemas",
|
| 43 |
"python_env.services",
|
| 44 |
"python_env.utils",
|
| 45 |
]
|
| 46 |
-
package-dir = { "python_env" = ".", "python_env.server" = "server", "python_env.tasks" = "tasks", "python_env.graders" = "graders", "python_env.api" = "api", "python_env.app" = "app", "python_env.analyzers" = "analyzers", "python_env.models" = "models", "python_env.schemas" = "schemas", "python_env.services" = "services", "python_env.utils" = "utils" }
|
|
|
|
| 13 |
"gradio>=5.26.0",
|
| 14 |
"openai>=1.76.0",
|
| 15 |
"openenv-core[core]>=0.2.2",
|
|
|
|
| 16 |
"streamlit>=1.44.0",
|
| 17 |
"torch>=2.2.0",
|
| 18 |
"transformers>=4.45.0",
|
|
|
|
| 21 |
|
| 22 |
[project.optional-dependencies]
|
| 23 |
dev = [
|
| 24 |
+
"pytest>=8.0.0",
|
| 25 |
"pytest-cov>=4.0.0",
|
| 26 |
]
|
| 27 |
|
|
|
|
| 37 |
"python_env.graders",
|
| 38 |
"python_env.api",
|
| 39 |
"python_env.app",
|
| 40 |
+
"python_env.app.agents",
|
| 41 |
+
"python_env.app.env",
|
| 42 |
+
"python_env.app.models",
|
| 43 |
+
"python_env.app.services",
|
| 44 |
+
"python_env.app.utils",
|
| 45 |
"python_env.analyzers",
|
| 46 |
"python_env.models",
|
| 47 |
"python_env.schemas",
|
| 48 |
"python_env.services",
|
| 49 |
"python_env.utils",
|
| 50 |
]
|
| 51 |
+
package-dir = { "python_env" = ".", "python_env.server" = "server", "python_env.tasks" = "tasks", "python_env.graders" = "graders", "python_env.api" = "api", "python_env.app" = "app", "python_env.app.agents" = "app/agents", "python_env.app.env" = "app/env", "python_env.app.models" = "app/models", "python_env.app.services" = "app/services", "python_env.app.utils" = "app/utils", "python_env.analyzers" = "analyzers", "python_env.models" = "models", "python_env.schemas" = "schemas", "python_env.services" = "services", "python_env.utils" = "utils" }
|
schemas/response.py
CHANGED
|
@@ -51,6 +51,9 @@ class ScoreBreakdown(BaseModel):
|
|
| 51 |
domain_score: float = Field(..., ge=0.0, le=1.0)
|
| 52 |
lint_score: float = Field(..., ge=0.0, le=1.0)
|
| 53 |
complexity_penalty: float = Field(..., ge=0.0, le=1.0)
|
|
|
|
|
|
|
|
|
|
| 54 |
reward: float = Field(..., ge=0.0, le=1.0)
|
| 55 |
|
| 56 |
|
|
|
|
| 51 |
domain_score: float = Field(..., ge=0.0, le=1.0)
|
| 52 |
lint_score: float = Field(..., ge=0.0, le=1.0)
|
| 53 |
complexity_penalty: float = Field(..., ge=0.0, le=1.0)
|
| 54 |
+
quality_signal: float = Field(..., ge=0.0, le=1.0)
|
| 55 |
+
error_reduction_signal: float = Field(..., ge=0.0, le=1.0)
|
| 56 |
+
completion_signal: float = Field(..., ge=0.0, le=1.0)
|
| 57 |
reward: float = Field(..., ge=0.0, le=1.0)
|
| 58 |
|
| 59 |
|
server/Dockerfile
CHANGED
|
@@ -2,28 +2,24 @@ FROM python:3.11-slim
|
|
| 2 |
|
| 3 |
ENV PYTHONDONTWRITEBYTECODE=1 \
|
| 4 |
PYTHONUNBUFFERED=1 \
|
| 5 |
-
PIP_NO_CACHE_DIR=1
|
|
|
|
|
|
|
| 6 |
|
| 7 |
WORKDIR /app
|
| 8 |
|
| 9 |
-
COPY
|
| 10 |
-
COPY api /app/api
|
| 11 |
-
COPY app /app/app
|
| 12 |
-
COPY analyzers /app/analyzers
|
| 13 |
-
COPY models /app/models
|
| 14 |
-
COPY schemas /app/schemas
|
| 15 |
-
COPY server /app/server
|
| 16 |
-
COPY services /app/services
|
| 17 |
-
COPY tasks /app/tasks
|
| 18 |
-
COPY utils /app/utils
|
| 19 |
-
COPY graders /app/graders
|
| 20 |
|
| 21 |
RUN python -m pip install --upgrade pip && \
|
| 22 |
-
pip install .
|
|
|
|
|
|
|
|
|
|
|
|
|
| 23 |
|
| 24 |
EXPOSE 8000
|
| 25 |
|
| 26 |
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
|
| 27 |
-
CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000', timeout=3).read()"
|
| 28 |
|
| 29 |
-
CMD ["
|
|
|
|
| 2 |
|
| 3 |
ENV PYTHONDONTWRITEBYTECODE=1 \
|
| 4 |
PYTHONUNBUFFERED=1 \
|
| 5 |
+
PIP_NO_CACHE_DIR=1 \
|
| 6 |
+
PIP_DISABLE_PIP_VERSION_CHECK=1 \
|
| 7 |
+
ENABLE_GRADIO_DEMO=false
|
| 8 |
|
| 9 |
WORKDIR /app
|
| 10 |
|
| 11 |
+
COPY server/requirements.txt /tmp/requirements.txt
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
|
| 13 |
RUN python -m pip install --upgrade pip && \
|
| 14 |
+
pip install -r /tmp/requirements.txt
|
| 15 |
+
|
| 16 |
+
COPY . /app
|
| 17 |
+
|
| 18 |
+
RUN pip install --no-deps .
|
| 19 |
|
| 20 |
EXPOSE 8000
|
| 21 |
|
| 22 |
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
|
| 23 |
+
CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
|
| 24 |
|
| 25 |
+
CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
|
server/__pycache__/__init__.cpython-313.pyc
CHANGED
|
Binary files a/server/__pycache__/__init__.cpython-313.pyc and b/server/__pycache__/__init__.cpython-313.pyc differ
|
|
|
server/__pycache__/app.cpython-313.pyc
CHANGED
|
Binary files a/server/__pycache__/app.cpython-313.pyc and b/server/__pycache__/app.cpython-313.pyc differ
|
|
|
server/app.py
CHANGED
|
@@ -1,7 +1,11 @@
|
|
| 1 |
-
"""FastAPI
|
| 2 |
|
| 3 |
from __future__ import annotations
|
| 4 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
try:
|
| 6 |
from openenv.core.env_server.http_server import create_app
|
| 7 |
except Exception as exc: # pragma: no cover
|
|
@@ -17,11 +21,20 @@ except Exception:
|
|
| 17 |
try:
|
| 18 |
from ..openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
|
| 19 |
from .env import PythonCodeReviewEnvironment
|
| 20 |
-
from .demo import build_demo
|
| 21 |
except ImportError:
|
| 22 |
from openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
|
| 23 |
from server.env import PythonCodeReviewEnvironment
|
| 24 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
|
| 27 |
def build_application():
|
|
@@ -32,11 +45,24 @@ def build_application():
|
|
| 32 |
PythonCodeReviewAction,
|
| 33 |
PythonCodeReviewObservation,
|
| 34 |
env_name="python_code_review_env",
|
| 35 |
-
max_concurrent_envs=
|
| 36 |
)
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 40 |
|
| 41 |
|
| 42 |
app = build_application()
|
|
|
|
| 1 |
+
"""OpenEnv FastAPI entrypoint with optional Gradio mounting."""
|
| 2 |
|
| 3 |
from __future__ import annotations
|
| 4 |
|
| 5 |
+
import os
|
| 6 |
+
|
| 7 |
+
from fastapi import FastAPI
|
| 8 |
+
|
| 9 |
try:
|
| 10 |
from openenv.core.env_server.http_server import create_app
|
| 11 |
except Exception as exc: # pragma: no cover
|
|
|
|
| 21 |
try:
|
| 22 |
from ..openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
|
| 23 |
from .env import PythonCodeReviewEnvironment
|
|
|
|
| 24 |
except ImportError:
|
| 25 |
from openenv_models import PythonCodeReviewAction, PythonCodeReviewObservation
|
| 26 |
from server.env import PythonCodeReviewEnvironment
|
| 27 |
+
|
| 28 |
+
|
| 29 |
+
def _gradio_enabled() -> bool:
|
| 30 |
+
return str(os.getenv("ENABLE_GRADIO_DEMO", "false")).strip().lower() in {"1", "true", "yes", "on"}
|
| 31 |
+
|
| 32 |
+
|
| 33 |
+
def _max_concurrent_envs() -> int:
|
| 34 |
+
try:
|
| 35 |
+
return max(int(os.getenv("OPENENV_MAX_CONCURRENT_ENVS", "2")), 1)
|
| 36 |
+
except Exception:
|
| 37 |
+
return 2
|
| 38 |
|
| 39 |
|
| 40 |
def build_application():
|
|
|
|
| 45 |
PythonCodeReviewAction,
|
| 46 |
PythonCodeReviewObservation,
|
| 47 |
env_name="python_code_review_env",
|
| 48 |
+
max_concurrent_envs=_max_concurrent_envs(),
|
| 49 |
)
|
| 50 |
+
served_app = api_app
|
| 51 |
+
if gr is not None and _gradio_enabled():
|
| 52 |
+
try:
|
| 53 |
+
from .demo import build_demo
|
| 54 |
+
except ImportError:
|
| 55 |
+
from server.demo import build_demo
|
| 56 |
+
served_app = gr.mount_gradio_app(api_app, build_demo(), path="/")
|
| 57 |
+
|
| 58 |
+
wrapper_app = FastAPI(title="python_code_review_env", version="1.0.0")
|
| 59 |
+
|
| 60 |
+
@wrapper_app.get("/health", include_in_schema=False)
|
| 61 |
+
def _health() -> dict[str, str]:
|
| 62 |
+
return {"status": "ok"}
|
| 63 |
+
|
| 64 |
+
wrapper_app.mount("/", served_app)
|
| 65 |
+
return wrapper_app
|
| 66 |
|
| 67 |
|
| 68 |
app = build_application()
|
server/env.py
CHANGED
|
@@ -63,6 +63,7 @@ class PythonCodeReviewEnvironment(
|
|
| 63 |
self._current_code: str = self._task.starter_code
|
| 64 |
self._history: list[HistoryEntry] = []
|
| 65 |
self._last_reward = RewardDetails(value=0.1, reason="Environment initialized.")
|
|
|
|
| 66 |
self._current_grade = _empty_grade()
|
| 67 |
self._state = PythonCodeReviewState(episode_id=str(uuid4()), step_count=0)
|
| 68 |
self.reset()
|
|
@@ -77,8 +78,13 @@ class PythonCodeReviewEnvironment(
|
|
| 77 |
self._task = select_task(seed=seed, task_id=task_id)
|
| 78 |
self._current_code = self._task.starter_code
|
| 79 |
self._history = []
|
|
|
|
| 80 |
self._last_reward = RewardDetails(value=0.1, reason="Environment reset.")
|
| 81 |
-
self._current_grade
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
self._state = PythonCodeReviewState(
|
| 84 |
episode_id=episode_id or str(uuid4()),
|
|
@@ -142,11 +148,13 @@ class PythonCodeReviewEnvironment(
|
|
| 142 |
invalid_action = False
|
| 143 |
code_changed = False
|
| 144 |
use_hidden_grading = False
|
|
|
|
| 145 |
|
| 146 |
if action.action_type == "edit_code":
|
| 147 |
if not action.code or not action.code.strip():
|
| 148 |
invalid_action = True
|
| 149 |
status = "edit_code requires a non-empty code payload."
|
|
|
|
| 150 |
else:
|
| 151 |
code_changed = action.code != self._current_code
|
| 152 |
self._current_code = action.code
|
|
@@ -164,18 +172,22 @@ class PythonCodeReviewEnvironment(
|
|
| 164 |
else: # pragma: no cover
|
| 165 |
invalid_action = True
|
| 166 |
status = f"Unsupported action_type: {action.action_type}"
|
|
|
|
| 167 |
|
| 168 |
self._state.step_count += 1
|
| 169 |
|
| 170 |
if invalid_action:
|
| 171 |
current_grade = previous_grade
|
| 172 |
else:
|
| 173 |
-
current_grade =
|
| 174 |
self._task,
|
| 175 |
self._current_code,
|
| 176 |
include_hidden=use_hidden_grading,
|
| 177 |
timeout_s=timeout_s or 3.0,
|
| 178 |
)
|
|
|
|
|
|
|
|
|
|
| 179 |
if action.action_type == "analyze_code":
|
| 180 |
status = self._analysis_status(current_grade)
|
| 181 |
elif action.action_type == "run_tests":
|
|
@@ -208,6 +220,7 @@ class PythonCodeReviewEnvironment(
|
|
| 208 |
|
| 209 |
self._current_grade = current_grade
|
| 210 |
self._last_reward = reward_details
|
|
|
|
| 211 |
attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
|
| 212 |
|
| 213 |
self._state.task_id = self._task.task_id
|
|
@@ -226,7 +239,14 @@ class PythonCodeReviewEnvironment(
|
|
| 226 |
status=status,
|
| 227 |
reward_details=reward_details,
|
| 228 |
)
|
| 229 |
-
return observation, reward_details.value, observation.done, {
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 230 |
|
| 231 |
@property
|
| 232 |
def state(self) -> PythonCodeReviewState:
|
|
@@ -252,11 +272,13 @@ class PythonCodeReviewEnvironment(
|
|
| 252 |
history=list(self._history),
|
| 253 |
attempts_remaining=self._state.attempts_remaining,
|
| 254 |
last_action_status=status,
|
|
|
|
| 255 |
score=grade.score,
|
| 256 |
reward=reward_details.value,
|
| 257 |
done=self._state.done,
|
| 258 |
reward_details=reward_details,
|
| 259 |
metadata={
|
|
|
|
| 260 |
"goal": self._task.goal,
|
| 261 |
"repo_summary": self._task.repo_summary,
|
| 262 |
"changed_files": self._task.changed_files,
|
|
@@ -280,25 +302,34 @@ class PythonCodeReviewEnvironment(
|
|
| 280 |
curr_score = current_grade.score
|
| 281 |
prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
|
| 282 |
curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 283 |
|
| 284 |
syntax_reward = 0.14 if previous_grade.syntax_score < 0.9 and current_grade.syntax_score >= 0.9 else 0.0
|
| 285 |
-
test_reward = round(max(curr_rate - prev_rate, 0.0) * 0.
|
| 286 |
-
progress_delta = round(max(curr_score - prev_score, 0.0) * 0.
|
| 287 |
-
quality_bonus = round(max(current_grade.quality_score - previous_grade.quality_score, 0.0) * 0.
|
|
|
|
|
|
|
|
|
|
| 288 |
correctness_bonus = 0.12 if final_submission and curr_score >= 0.94 and prev_score < 0.94 else 0.0
|
| 289 |
|
| 290 |
-
invalid_action_penalty = 0.
|
| 291 |
-
timeout_penalty = 0.
|
| 292 |
-
regression_penalty = round(max(prev_score - curr_score, 0.0) * 0.
|
| 293 |
-
stagnation_penalty = 0.
|
| 294 |
|
| 295 |
raw_value = (
|
| 296 |
-
0.
|
| 297 |
-
+ 0.45 * curr_score
|
| 298 |
+ syntax_reward
|
| 299 |
+ test_reward
|
| 300 |
+ progress_delta
|
| 301 |
+ quality_bonus
|
|
|
|
|
|
|
|
|
|
| 302 |
+ correctness_bonus
|
| 303 |
- invalid_action_penalty
|
| 304 |
- timeout_penalty
|
|
@@ -316,6 +347,12 @@ class PythonCodeReviewEnvironment(
|
|
| 316 |
reason_parts.append("overall score improved")
|
| 317 |
if quality_bonus:
|
| 318 |
reason_parts.append("code quality improved")
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 319 |
if correctness_bonus:
|
| 320 |
reason_parts.append("full correctness bonus")
|
| 321 |
if invalid_action_penalty:
|
|
@@ -335,6 +372,9 @@ class PythonCodeReviewEnvironment(
|
|
| 335 |
test_reward=test_reward,
|
| 336 |
correctness_bonus=correctness_bonus,
|
| 337 |
quality_bonus=quality_bonus,
|
|
|
|
|
|
|
|
|
|
| 338 |
progress_delta=progress_delta,
|
| 339 |
invalid_action_penalty=invalid_action_penalty,
|
| 340 |
timeout_penalty=timeout_penalty,
|
|
@@ -352,6 +392,22 @@ class PythonCodeReviewEnvironment(
|
|
| 352 |
return compile_error
|
| 353 |
return "Code parses successfully."
|
| 354 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 355 |
def _format_test_results(self, grade: TaskGrade) -> str:
|
| 356 |
parts = [grade.details.get("test_summary", "No test feedback available.")]
|
| 357 |
benchmark = grade.details.get("benchmark")
|
|
|
|
| 63 |
self._current_code: str = self._task.starter_code
|
| 64 |
self._history: list[HistoryEntry] = []
|
| 65 |
self._last_reward = RewardDetails(value=0.1, reason="Environment initialized.")
|
| 66 |
+
self._last_action_error: str | None = None
|
| 67 |
self._current_grade = _empty_grade()
|
| 68 |
self._state = PythonCodeReviewState(episode_id=str(uuid4()), step_count=0)
|
| 69 |
self.reset()
|
|
|
|
| 78 |
self._task = select_task(seed=seed, task_id=task_id)
|
| 79 |
self._current_code = self._task.starter_code
|
| 80 |
self._history = []
|
| 81 |
+
self._last_action_error = None
|
| 82 |
self._last_reward = RewardDetails(value=0.1, reason="Environment reset.")
|
| 83 |
+
self._current_grade, self._last_action_error = self._safe_grade_task(
|
| 84 |
+
self._task,
|
| 85 |
+
self._current_code,
|
| 86 |
+
include_hidden=False,
|
| 87 |
+
)
|
| 88 |
|
| 89 |
self._state = PythonCodeReviewState(
|
| 90 |
episode_id=episode_id or str(uuid4()),
|
|
|
|
| 148 |
invalid_action = False
|
| 149 |
code_changed = False
|
| 150 |
use_hidden_grading = False
|
| 151 |
+
action_error: str | None = None
|
| 152 |
|
| 153 |
if action.action_type == "edit_code":
|
| 154 |
if not action.code or not action.code.strip():
|
| 155 |
invalid_action = True
|
| 156 |
status = "edit_code requires a non-empty code payload."
|
| 157 |
+
action_error = status
|
| 158 |
else:
|
| 159 |
code_changed = action.code != self._current_code
|
| 160 |
self._current_code = action.code
|
|
|
|
| 172 |
else: # pragma: no cover
|
| 173 |
invalid_action = True
|
| 174 |
status = f"Unsupported action_type: {action.action_type}"
|
| 175 |
+
action_error = status
|
| 176 |
|
| 177 |
self._state.step_count += 1
|
| 178 |
|
| 179 |
if invalid_action:
|
| 180 |
current_grade = previous_grade
|
| 181 |
else:
|
| 182 |
+
current_grade, grade_error = self._safe_grade_task(
|
| 183 |
self._task,
|
| 184 |
self._current_code,
|
| 185 |
include_hidden=use_hidden_grading,
|
| 186 |
timeout_s=timeout_s or 3.0,
|
| 187 |
)
|
| 188 |
+
if grade_error:
|
| 189 |
+
action_error = grade_error
|
| 190 |
+
status = f"{status} Grading fallback used."
|
| 191 |
if action.action_type == "analyze_code":
|
| 192 |
status = self._analysis_status(current_grade)
|
| 193 |
elif action.action_type == "run_tests":
|
|
|
|
| 220 |
|
| 221 |
self._current_grade = current_grade
|
| 222 |
self._last_reward = reward_details
|
| 223 |
+
self._last_action_error = action_error
|
| 224 |
attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
|
| 225 |
|
| 226 |
self._state.task_id = self._task.task_id
|
|
|
|
| 239 |
status=status,
|
| 240 |
reward_details=reward_details,
|
| 241 |
)
|
| 242 |
+
return observation, reward_details.value, observation.done, {
|
| 243 |
+
"task_id": observation.task_id,
|
| 244 |
+
"score": observation.score,
|
| 245 |
+
"done": observation.done,
|
| 246 |
+
"attempts_remaining": observation.attempts_remaining,
|
| 247 |
+
"last_action_status": observation.last_action_status,
|
| 248 |
+
"last_action_error": observation.last_action_error,
|
| 249 |
+
}
|
| 250 |
|
| 251 |
@property
|
| 252 |
def state(self) -> PythonCodeReviewState:
|
|
|
|
| 272 |
history=list(self._history),
|
| 273 |
attempts_remaining=self._state.attempts_remaining,
|
| 274 |
last_action_status=status,
|
| 275 |
+
last_action_error=self._last_action_error,
|
| 276 |
score=grade.score,
|
| 277 |
reward=reward_details.value,
|
| 278 |
done=self._state.done,
|
| 279 |
reward_details=reward_details,
|
| 280 |
metadata={
|
| 281 |
+
"benchmark": "python_code_review_env",
|
| 282 |
"goal": self._task.goal,
|
| 283 |
"repo_summary": self._task.repo_summary,
|
| 284 |
"changed_files": self._task.changed_files,
|
|
|
|
| 302 |
curr_score = current_grade.score
|
| 303 |
prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
|
| 304 |
curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
|
| 305 |
+
prev_runtime = previous_grade.runtime_score
|
| 306 |
+
curr_runtime = current_grade.runtime_score
|
| 307 |
+
prev_compile_error = bool(str(previous_grade.details.get("compile_error", "")).strip())
|
| 308 |
+
curr_compile_error = bool(str(current_grade.details.get("compile_error", "")).strip())
|
| 309 |
|
| 310 |
syntax_reward = 0.14 if previous_grade.syntax_score < 0.9 and current_grade.syntax_score >= 0.9 else 0.0
|
| 311 |
+
test_reward = round(max(curr_rate - prev_rate, 0.0) * 0.28, 3)
|
| 312 |
+
progress_delta = round(max(curr_score - prev_score, 0.0) * 0.3, 3)
|
| 313 |
+
quality_bonus = round(max(current_grade.quality_score - previous_grade.quality_score, 0.0) * 0.12, 3)
|
| 314 |
+
runtime_bonus = round(max(curr_runtime - prev_runtime, 0.0) * 0.08, 3)
|
| 315 |
+
error_reduction_bonus = 0.1 if prev_compile_error and not curr_compile_error else 0.0
|
| 316 |
+
completion_bonus = 0.14 if final_submission and curr_rate >= 0.999 and curr_score >= 0.94 else 0.0
|
| 317 |
correctness_bonus = 0.12 if final_submission and curr_score >= 0.94 and prev_score < 0.94 else 0.0
|
| 318 |
|
| 319 |
+
invalid_action_penalty = round((0.04 + (0.08 * (1.0 - prev_score))) if invalid_action else 0.0, 3)
|
| 320 |
+
timeout_penalty = round((0.06 + (0.08 * max(curr_runtime, prev_runtime))) if timed_out else 0.0, 3)
|
| 321 |
+
regression_penalty = round(max(prev_score - curr_score, 0.0) * 0.25, 3)
|
| 322 |
+
stagnation_penalty = round((0.02 + (0.05 * prev_score)) if action.action_type == "edit_code" and not code_changed else 0.0, 3)
|
| 323 |
|
| 324 |
raw_value = (
|
| 325 |
+
0.32 * curr_score
|
|
|
|
| 326 |
+ syntax_reward
|
| 327 |
+ test_reward
|
| 328 |
+ progress_delta
|
| 329 |
+ quality_bonus
|
| 330 |
+
+ error_reduction_bonus
|
| 331 |
+
+ completion_bonus
|
| 332 |
+
+ runtime_bonus
|
| 333 |
+ correctness_bonus
|
| 334 |
- invalid_action_penalty
|
| 335 |
- timeout_penalty
|
|
|
|
| 347 |
reason_parts.append("overall score improved")
|
| 348 |
if quality_bonus:
|
| 349 |
reason_parts.append("code quality improved")
|
| 350 |
+
if error_reduction_bonus:
|
| 351 |
+
reason_parts.append("errors removed")
|
| 352 |
+
if completion_bonus:
|
| 353 |
+
reason_parts.append("task completed")
|
| 354 |
+
if runtime_bonus:
|
| 355 |
+
reason_parts.append("runtime improved")
|
| 356 |
if correctness_bonus:
|
| 357 |
reason_parts.append("full correctness bonus")
|
| 358 |
if invalid_action_penalty:
|
|
|
|
| 372 |
test_reward=test_reward,
|
| 373 |
correctness_bonus=correctness_bonus,
|
| 374 |
quality_bonus=quality_bonus,
|
| 375 |
+
error_reduction_bonus=error_reduction_bonus,
|
| 376 |
+
completion_bonus=completion_bonus,
|
| 377 |
+
runtime_bonus=runtime_bonus,
|
| 378 |
progress_delta=progress_delta,
|
| 379 |
invalid_action_penalty=invalid_action_penalty,
|
| 380 |
timeout_penalty=timeout_penalty,
|
|
|
|
| 392 |
return compile_error
|
| 393 |
return "Code parses successfully."
|
| 394 |
|
| 395 |
+
def _safe_grade_task(
|
| 396 |
+
self,
|
| 397 |
+
task: ReviewTask,
|
| 398 |
+
code: str,
|
| 399 |
+
*,
|
| 400 |
+
include_hidden: bool,
|
| 401 |
+
timeout_s: float = 3.0,
|
| 402 |
+
) -> tuple[TaskGrade, str | None]:
|
| 403 |
+
try:
|
| 404 |
+
return (
|
| 405 |
+
grade_task(task, code, include_hidden=include_hidden, timeout_s=timeout_s),
|
| 406 |
+
None,
|
| 407 |
+
)
|
| 408 |
+
except Exception as exc: # pragma: no cover
|
| 409 |
+
return _empty_grade(), f"{type(exc).__name__}: {exc}"
|
| 410 |
+
|
| 411 |
def _format_test_results(self, grade: TaskGrade) -> str:
|
| 412 |
parts = [grade.details.get("test_summary", "No test feedback available.")]
|
| 413 |
benchmark = grade.details.get("benchmark")
|
server/requirements.txt
CHANGED
|
@@ -2,7 +2,6 @@ openenv-core[core]>=0.2.2
|
|
| 2 |
fastapi>=0.111.0
|
| 3 |
gradio>=5.26.0
|
| 4 |
uvicorn>=0.30.0
|
| 5 |
-
pytest>=8.0.0
|
| 6 |
openai>=1.76.0
|
| 7 |
streamlit>=1.44.0
|
| 8 |
torch>=2.2.0
|
|
|
|
| 2 |
fastapi>=0.111.0
|
| 3 |
gradio>=5.26.0
|
| 4 |
uvicorn>=0.30.0
|
|
|
|
| 5 |
openai>=1.76.0
|
| 6 |
streamlit>=1.44.0
|
| 7 |
torch>=2.2.0
|
services/analysis_service.py
CHANGED
|
@@ -34,7 +34,7 @@ class AnalysisService:
|
|
| 34 |
"""End-to-end analysis pipeline shared by API and UI."""
|
| 35 |
|
| 36 |
def __init__(self) -> None:
|
| 37 |
-
self.
|
| 38 |
self.reward_service = RewardService()
|
| 39 |
self.suggestion_service = SuggestionService()
|
| 40 |
self._analyzers: Dict[str, Callable[[str, Dict[str, Any], Dict[str, Any]], DomainAnalysis]] = {
|
|
@@ -44,6 +44,12 @@ class AnalysisService:
|
|
| 44 |
"web": analyze_web_code,
|
| 45 |
}
|
| 46 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 47 |
def _heuristic_domain_scores(self, parsed: Dict[str, Any], code: str) -> Dict[str, float]:
|
| 48 |
"""Derive domain priors from imports and syntax-level hints."""
|
| 49 |
|
|
|
|
| 34 |
"""End-to-end analysis pipeline shared by API and UI."""
|
| 35 |
|
| 36 |
def __init__(self) -> None:
|
| 37 |
+
self._model: PyTorchCodeAnalyzerModel | None = None
|
| 38 |
self.reward_service = RewardService()
|
| 39 |
self.suggestion_service = SuggestionService()
|
| 40 |
self._analyzers: Dict[str, Callable[[str, Dict[str, Any], Dict[str, Any]], DomainAnalysis]] = {
|
|
|
|
| 44 |
"web": analyze_web_code,
|
| 45 |
}
|
| 46 |
|
| 47 |
+
@property
|
| 48 |
+
def model(self) -> PyTorchCodeAnalyzerModel:
|
| 49 |
+
if self._model is None:
|
| 50 |
+
self._model = PyTorchCodeAnalyzerModel()
|
| 51 |
+
return self._model
|
| 52 |
+
|
| 53 |
def _heuristic_domain_scores(self, parsed: Dict[str, Any], code: str) -> Dict[str, float]:
|
| 54 |
"""Derive domain priors from imports and syntax-level hints."""
|
| 55 |
|
services/reward_service.py
CHANGED
|
@@ -9,13 +9,21 @@ class RewardService:
|
|
| 9 |
"""Compute reward scores from model, domain, lint, and complexity signals."""
|
| 10 |
|
| 11 |
def compute(self, *, ml_score: float, domain_score: float, lint_score: float, complexity_penalty: float) -> ScoreBreakdown:
|
| 12 |
-
"""Apply
|
| 13 |
|
|
|
|
|
|
|
|
|
|
| 14 |
reward = max(
|
| 15 |
0.0,
|
| 16 |
min(
|
| 17 |
1.0,
|
| 18 |
-
(0.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 19 |
),
|
| 20 |
)
|
| 21 |
return ScoreBreakdown(
|
|
@@ -23,5 +31,8 @@ class RewardService:
|
|
| 23 |
domain_score=round(domain_score, 4),
|
| 24 |
lint_score=round(lint_score, 4),
|
| 25 |
complexity_penalty=round(complexity_penalty, 4),
|
|
|
|
|
|
|
|
|
|
| 26 |
reward=round(reward, 4),
|
| 27 |
)
|
|
|
|
| 9 |
"""Compute reward scores from model, domain, lint, and complexity signals."""
|
| 10 |
|
| 11 |
def compute(self, *, ml_score: float, domain_score: float, lint_score: float, complexity_penalty: float) -> ScoreBreakdown:
|
| 12 |
+
"""Apply dynamic reward shaping based on quality, errors, and completion."""
|
| 13 |
|
| 14 |
+
quality_signal = max(0.0, min(1.0, (0.45 * ml_score) + (0.3 * domain_score) + (0.25 * lint_score)))
|
| 15 |
+
error_reduction_signal = max(0.0, min(1.0, lint_score - (0.6 * complexity_penalty)))
|
| 16 |
+
completion_signal = max(0.0, min(1.0, (ml_score + domain_score + lint_score) / 3.0))
|
| 17 |
reward = max(
|
| 18 |
0.0,
|
| 19 |
min(
|
| 20 |
1.0,
|
| 21 |
+
(0.35 * quality_signal)
|
| 22 |
+
+ (0.25 * completion_signal)
|
| 23 |
+
+ (0.2 * error_reduction_signal)
|
| 24 |
+
+ (0.1 * ml_score)
|
| 25 |
+
+ (0.1 * domain_score)
|
| 26 |
+
- (0.15 * complexity_penalty),
|
| 27 |
),
|
| 28 |
)
|
| 29 |
return ScoreBreakdown(
|
|
|
|
| 31 |
domain_score=round(domain_score, 4),
|
| 32 |
lint_score=round(lint_score, 4),
|
| 33 |
complexity_penalty=round(complexity_penalty, 4),
|
| 34 |
+
quality_signal=round(quality_signal, 4),
|
| 35 |
+
error_reduction_signal=round(error_reduction_signal, 4),
|
| 36 |
+
completion_signal=round(completion_signal, 4),
|
| 37 |
reward=round(reward, 4),
|
| 38 |
)
|
tests/test_inference_runner.py
ADDED
|
@@ -0,0 +1,71 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Smoke tests for the strict inference output contract."""
|
| 2 |
+
|
| 3 |
+
from __future__ import annotations
|
| 4 |
+
|
| 5 |
+
from dataclasses import dataclass, field
|
| 6 |
+
|
| 7 |
+
from app.env.runner import InferenceRunner
|
| 8 |
+
from app.models.inference import AgentDecision, InferenceConfig
|
| 9 |
+
|
| 10 |
+
|
| 11 |
+
@dataclass
|
| 12 |
+
class _FakeObservation:
|
| 13 |
+
task_id: str
|
| 14 |
+
attempts_remaining: int
|
| 15 |
+
score: float
|
| 16 |
+
done: bool
|
| 17 |
+
history: list[object] = field(default_factory=list)
|
| 18 |
+
current_code: str = "print('broken')"
|
| 19 |
+
last_action_error: str | None = None
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
class _FakeEnv:
|
| 23 |
+
def __init__(self) -> None:
|
| 24 |
+
self._step = 0
|
| 25 |
+
|
| 26 |
+
def reset(self, *, task_id: str) -> _FakeObservation:
|
| 27 |
+
return _FakeObservation(task_id=task_id, attempts_remaining=4, score=0.2, done=False)
|
| 28 |
+
|
| 29 |
+
def step_result(self, action: object) -> tuple[_FakeObservation, float, bool, dict[str, object]]:
|
| 30 |
+
self._step += 1
|
| 31 |
+
if self._step == 1:
|
| 32 |
+
return (
|
| 33 |
+
_FakeObservation("demo_task", 3, 0.45, False, current_code="candidate"),
|
| 34 |
+
0.45,
|
| 35 |
+
False,
|
| 36 |
+
{"last_action_error": None},
|
| 37 |
+
)
|
| 38 |
+
if self._step == 2:
|
| 39 |
+
return (
|
| 40 |
+
_FakeObservation("demo_task", 2, 0.97, True, current_code="reference"),
|
| 41 |
+
0.97,
|
| 42 |
+
True,
|
| 43 |
+
{"last_action_error": None},
|
| 44 |
+
)
|
| 45 |
+
raise AssertionError("runner stepped too many times")
|
| 46 |
+
|
| 47 |
+
|
| 48 |
+
class _FakeAgent:
|
| 49 |
+
def __init__(self) -> None:
|
| 50 |
+
self._step = 0
|
| 51 |
+
|
| 52 |
+
def act(self, observation: object) -> AgentDecision:
|
| 53 |
+
self._step += 1
|
| 54 |
+
if self._step == 1:
|
| 55 |
+
return AgentDecision(action_type="run_tests")
|
| 56 |
+
return AgentDecision(action_type="submit_solution")
|
| 57 |
+
|
| 58 |
+
|
| 59 |
+
def test_inference_runner_emits_strict_lines(capsys) -> None:
|
| 60 |
+
runner = InferenceRunner(InferenceConfig.from_env())
|
| 61 |
+
runner.agent = _FakeAgent()
|
| 62 |
+
runner._create_env = lambda: _FakeEnv() # type: ignore[method-assign]
|
| 63 |
+
runner.run_task("demo_task")
|
| 64 |
+
|
| 65 |
+
captured = capsys.readouterr().out.strip().splitlines()
|
| 66 |
+
assert captured == [
|
| 67 |
+
f"[START] task=demo_task env={runner.config.benchmark_name} model={runner.config.model_name}",
|
| 68 |
+
"[STEP] step=1 action=run_tests reward=0.45 done=false error=null",
|
| 69 |
+
"[STEP] step=2 action=submit_solution reward=0.97 done=true error=null",
|
| 70 |
+
"[END] success=true steps=2 rewards=0.45,0.97",
|
| 71 |
+
]
|
uv.lock
CHANGED
|
The diff for this file is too large to render.
See raw diff
|
|
|