uvpatel7271 commited on
Commit
989722c
·
verified ·
1 Parent(s): 019e7db

Upload folder using huggingface_hub

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
DEMO_SCRIPT.md CHANGED
@@ -2,20 +2,11 @@
2
 
3
  ## 60-90 Second Walkthrough
4
 
5
- 1. Introduce TorchReview Copilot as an AI-powered code review system that helps developers find bugs, reduce complexity, and improve maintainability faster.
6
- 2. Frame the problem clearly: manual code reviews are slow, inconsistent, and hard to scale across growing teams and codebases.
7
- 3. Open the Streamlit app and load the `Boundary Bug` example to show a realistic Python regression with failing behavior.
8
- 4. Point out the pipeline on-screen:
9
- input code, static analysis, PyTorch scoring, suggestions, and RL-ready reward output.
10
- 5. Highlight the PyTorch story:
11
- the app uses CodeBERTa embeddings through PyTorch to score code quality, maintainability, and domain fit.
12
- 6. Show the headline metrics:
13
- detected domain, ML score, lint score, and final reward.
14
- 7. Scroll to the reward breakdown and explain that the reward is not arbitrary; it combines ML quality, maintainability, security, lint signals, and complexity penalties.
15
- 8. Open the Suggestions tab and show the prioritized fixes plus the three-step improvement plan.
16
- 9. Switch to the `Performance Hotspot` example to demonstrate that the system adapts to a different issue profile and pushes optimization hints instead of only syntax guidance.
17
- 10. Close by emphasizing that the same repo also works as an OpenEnv environment, so the project is both a usable developer product and an RL-ready benchmark component.
18
-
19
- ## 20-Second Closing Line
20
-
21
- TorchReview Copilot turns code review into a measurable AI workflow: PyTorch handles semantic scoring, deterministic analyzers keep it grounded, and OpenEnv makes it trainable and benchmarkable.
 
2
 
3
  ## 60-90 Second Walkthrough
4
 
5
+ 1. Open the Hugging Face Space and introduce TorchReview Copilot as an AI-powered code review and improvement system built with PyTorch.
6
+ 2. Point to the problem statement: manual code review is slow, inconsistent, and hard to scale.
7
+ 3. Select the `Fix the invoice total syntax regression` example to show the app loading a broken code sample together with the context window.
8
+ 4. Highlight the **Live Triage Radar**, the ML quality score, and the RL-ready reward score.
9
+ 5. Explain that the PyTorch layer uses CodeBERTa embeddings to compare the input against known code-quality patterns from the OpenEnv task catalog.
10
+ 6. Scroll to the three-step improvement plan and call out the progression: syntax and bug fixes, edge cases, then scalability.
11
+ 7. Switch to the performance example to show the confidence profile and reward changing for a different class of issue.
12
+ 8. Close by noting that OpenEnv still powers deterministic validation under the hood, so the demo remains grounded in measurable task outcomes.
 
 
 
 
 
 
 
 
 
Dockerfile CHANGED
@@ -6,24 +6,31 @@ ENV PYTHONDONTWRITEBYTECODE=1 \
6
  PYTHONIOENCODING=utf-8 \
7
  PIP_NO_CACHE_DIR=1 \
8
  PIP_DISABLE_PIP_VERSION_CHECK=1 \
9
- PIP_DEFAULT_TIMEOUT=120 \
10
  ENABLE_GRADIO_DEMO=false \
11
  ENABLE_WEB_INTERFACE=false
12
 
13
  WORKDIR /app
14
 
15
- COPY server/requirements.txt /tmp/requirements.txt
16
 
17
- RUN python -m pip install --upgrade pip && \
18
- pip install --prefer-binary -r /tmp/requirements.txt
 
19
 
20
- COPY . /app
 
 
 
 
21
 
22
  RUN pip install --no-deps .
23
 
 
 
24
  EXPOSE 8000
25
 
26
  HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
27
  CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
28
 
29
- CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000"]
 
6
  PYTHONIOENCODING=utf-8 \
7
  PIP_NO_CACHE_DIR=1 \
8
  PIP_DISABLE_PIP_VERSION_CHECK=1 \
9
+ PIP_ROOT_USER_ACTION=ignore \
10
  ENABLE_GRADIO_DEMO=false \
11
  ENABLE_WEB_INTERFACE=false
12
 
13
  WORKDIR /app
14
 
15
+ COPY server/requirements.runtime.txt /tmp/requirements.runtime.txt
16
 
17
+ RUN apt-get update && \
18
+ apt-get upgrade -y && \
19
+ rm -rf /var/lib/apt/lists/*
20
 
21
+ RUN useradd --create-home --shell /usr/sbin/nologin appuser && \
22
+ python -m pip install --upgrade pip setuptools && \
23
+ pip install -r /tmp/requirements.runtime.txt
24
+
25
+ COPY --chown=appuser:appuser . /app
26
 
27
  RUN pip install --no-deps .
28
 
29
+ USER appuser
30
+
31
  EXPOSE 8000
32
 
33
  HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
34
  CMD python -c "import urllib.request; urllib.request.urlopen('http://127.0.0.1:8000/health', timeout=3).read()"
35
 
36
+ CMD ["uvicorn", "server.app:app", "--host", "0.0.0.0", "--port", "8000", "--no-access-log"]
README.md CHANGED
@@ -1,232 +1,91 @@
1
  ---
2
- title: TorchReview Copilot
3
  sdk: docker
4
  app_port: 8000
5
  base_path: /web
6
  pinned: false
7
  tags:
8
  - openenv
9
- - pytorch
10
- - code-review
11
  ---
12
 
13
- # TorchReview Copilot
14
 
15
- TorchReview Copilot is an AI-powered code review and improvement system built for the Meta PyTorch OpenEnv Hackathon. It combines deterministic static analysis, a real PyTorch code encoder, domain-aware review logic, and RL-ready reward shaping to help developers catch bugs, reduce complexity, and improve maintainability faster.
16
 
17
- ## Problem Statement
18
-
19
- Manual code review is slow, inconsistent, and difficult to scale. Small logic bugs slip through, performance hotspots hide in otherwise correct code, and review quality changes from reviewer to reviewer.
20
-
21
- ## Solution
22
-
23
- TorchReview Copilot accepts Python code, analyzes it with AST and complexity heuristics, scores it with a PyTorch model, and returns:
24
-
25
- - A code quality score
26
- - Domain-aware review feedback
27
- - Actionable improvement suggestions
28
- - An RL-ready reward signal for OpenEnv environments
29
-
30
- ## Why This Is Hackathon-Worthy
31
-
32
- - Solves a real developer productivity problem
33
- - Uses PyTorch meaningfully for model inference, not as a placeholder
34
- - Produces a measurable reward signal for RL workflows
35
- - Ships as a usable product with API, UI, docs, tests, and OpenEnv compatibility
36
-
37
- ## Tech Stack
38
-
39
- - `PyTorch` for model execution and similarity scoring
40
- - `transformers` with `huggingface/CodeBERTa-small-v1` for pretrained code embeddings
41
- - `FastAPI` for the analysis API
42
- - `Streamlit` for the interactive review UI
43
- - `Pydantic` for request and response validation
44
- - `OpenAI` Python client for hackathon-compliant LLM action planning in `inference.py`
45
- - `OpenEnv` for environment, reward, and validator integration
46
-
47
- ## Pipeline
48
-
49
- ```text
50
- Input Python Code
51
- -> AST Parsing + Structural Signals
52
- -> Complexity + Lint Heuristics
53
- -> PyTorch Model Inference (CodeBERTa / torch fallback)
54
- -> Domain Analysis + Suggestion Engine
55
- -> RL Reward Shaping
56
- -> UI + API + OpenEnv Environment Output
57
- ```
58
-
59
- ## PyTorch Integration
60
-
61
- PyTorch is used in the core scoring path:
62
-
63
- - The app loads `huggingface/CodeBERTa-small-v1` through `transformers`
64
- - Input code, repository context, traceback text, and static-analysis hints are embedded with the encoder
65
- - The resulting embedding is compared against quality, maintainability, domain, and issue prototypes
66
- - The model produces:
67
- - `ml_quality_score`
68
- - `maintainability_score`
69
- - domain confidences
70
- - issue probabilities
71
-
72
- If pretrained weights are unavailable, the project falls back to a torch-native hashed embedding backend so local demos and CI still work offline.
73
-
74
- ## Reward System
75
-
76
- The system is RL-ready by design. Reward shaping blends model confidence, code quality, security, maintainability, and complexity into a bounded signal.
77
-
78
- Core reward:
79
-
80
- ```text
81
- reward = 0.50*ml_score
82
- + 0.18*lint_score
83
- + 0.12*maintainability_score
84
- + 0.10*domain_score
85
- + 0.10*security_score
86
- - 0.20*complexity_penalty
87
- ```
88
-
89
- The OpenEnv environment adds step-level shaping for:
90
-
91
- - public test progress
92
- - syntax recovery
93
- - runtime improvements
94
- - error reduction
95
- - final submission success
96
- - regressions and invalid actions
97
-
98
- All task and step rewards are normalized into a strict safe interval for OpenEnv validation and printed in a validator-safe two-decimal band.
99
-
100
- ## Features
101
-
102
- - Real PyTorch-backed code quality inference
103
- - Static analysis with syntax, lint, AST, and complexity signals
104
- - Domain-aware review for DSA, data science, ML/DL, and web code
105
- - Prioritized suggestions and a compact 3-step improvement plan
106
- - Auto-fix preview hints for quick wins
107
- - Real-time Streamlit scoring mode
108
- - OpenEnv-compatible environment and `inference.py`
109
- - Deterministic benchmark tasks for syntax fixes, bug fixes, and optimization
110
-
111
- ## WOW Features
112
-
113
- - Real-time scoring in the Streamlit interface
114
- - Auto-fix preview panel
115
- - Reward visualization and score breakdown
116
- - OpenEnv environment with transparent reward decomposition
117
-
118
- ## Project Structure
119
 
120
  ```text
121
  root
122
- |- inference.py
123
- |- api/
124
  |- app/
125
- | |- agents/
126
- | |- env/
127
- | |- models/
128
- | |- services/
129
- | `- utils/
130
- |- analyzers/
131
- |- graders/
132
- |- models/
133
- |- schemas/
134
- |- services/
135
- |- tasks/
136
- |- tests/
137
- `- utils/
138
- ```
139
-
140
- Key modules:
141
-
142
- - `models/pytorch_model.py`: PyTorch + transformer inference
143
- - `services/analysis_service.py`: end-to-end review pipeline
144
- - `services/reward_service.py`: RL-friendly reward shaping
145
- - `services/suggestion_service.py`: actionable recommendations
146
- - `app/streamlit_app.py`: interactive UI
147
- - `server/env.py`: OpenEnv environment implementation
148
- - `app/env/runner.py`: strict `inference.py` runner
149
-
150
- ## API
151
-
152
- Run the analysis API:
153
-
154
- ```bash
155
- python -m uvicorn api.main:app --host 0.0.0.0 --port 7860
156
  ```
157
 
158
- Main endpoint:
159
-
160
- - `POST /analyze`
161
-
162
- The API returns:
163
 
164
- - detected domain
165
- - static-analysis summary
166
- - model prediction
167
- - score breakdown
168
- - suggestions
169
- - improvement plan
170
-
171
- ## Streamlit UI
172
-
173
- Run the product UI locally:
174
-
175
- ```bash
176
- streamlit run app/streamlit_app.py
177
  ```
178
 
179
- The UI includes:
180
-
181
- - code input editor
182
- - example snippets
183
- - real-time scoring toggle
184
- - ML score, lint score, and reward display
185
- - domain confidence chart
186
- - reward-signal visualization
187
- - suggestion list and auto-fix preview
188
-
189
- ## OpenEnv Compatibility
190
-
191
- This repository is also a valid OpenEnv submission:
192
 
193
- - `inference.py` is in the repo root
194
- - `API_BASE_URL` and `MODEL_NAME` have defaults
195
- - `HF_TOKEN` is read from the environment
196
- - The runner uses the official `OpenAI` Python client
197
- - Output follows the required `[START]`, `[STEP]`, `[END]` contract
198
-
199
- Example:
200
-
201
- ```text
202
- [START] task=syntax_fix_invoice_totals env=python_code_review_env model=Qwen/Qwen2.5-3B-Instruct
203
- [STEP] step=1 action=run_tests reward=0.34 done=false error=null
204
- [STEP] step=2 action=edit_code reward=0.42 done=false error=null
205
- [STEP] step=3 action=submit_solution reward=0.99 done=true error=null
206
- [END] success=true steps=3 rewards=0.34,0.42,0.99
207
- ```
208
 
209
- ## Setup
210
 
211
- Install dependencies:
212
 
213
  ```bash
214
  pip install -e .[dev]
215
  ```
216
 
217
- Run tests:
218
 
219
  ```bash
220
  pytest -q
221
  ```
222
 
223
- Run the OpenEnv server:
224
 
225
  ```bash
226
  python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
227
  ```
228
 
229
- Run the demo UI mounted into the server:
230
 
231
  ```bash
232
  set ENABLE_GRADIO_DEMO=true
@@ -234,49 +93,100 @@ set ENABLE_WEB_INTERFACE=true
234
  python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
235
  ```
236
 
237
- ## Hugging Face Spaces
238
 
239
- This repo is designed to run on a Docker-based Hugging Face Space under a `2 vCPU / 8 GB RAM` budget.
240
 
241
- Recommended Space settings:
 
 
 
 
 
 
 
242
 
243
- - SDK: `Docker`
244
- - Port: `8000`
245
- - Secret: `HF_TOKEN`
246
- - Optional vars:
247
- - `API_BASE_URL`
248
- - `MODEL_NAME`
249
- - `ENABLE_GRADIO_DEMO=false`
250
- - `ENABLE_WEB_INTERFACE=false`
251
 
252
- ## Screenshots
 
 
 
 
 
253
 
254
- Add these before final submission:
 
 
 
 
 
255
 
256
- - Main review UI with code editor and reward metrics
257
- - Suggestions tab with improvement plan
258
- - OpenEnv task loop or validator output snippet
259
 
260
- ## Demo Link
 
 
 
 
 
 
 
261
 
262
- Add your live Hugging Face Space URL here before final submission.
263
 
264
- ## Demo Script
 
 
 
 
265
 
266
- See [DEMO_SCRIPT.md](DEMO_SCRIPT.md) for a concise hackathon walkthrough.
267
 
268
- ## Testing
 
 
 
 
 
 
269
 
270
- The repo includes coverage for:
271
 
272
- - score normalization into the strict OpenEnv-safe interval
273
- - inference output formatting
274
- - API response structure
275
- - multi-domain analysis behavior
276
- - triage and embedding behavior
277
 
278
- ## Notes for Judges
279
 
280
- - This is not a toy wrapper around an LLM. The review pipeline includes deterministic analysis, PyTorch-based code scoring, and explicit reward shaping.
281
- - The system is useful both as a developer-facing application and as a benchmark-friendly RL environment.
282
- - The design intentionally balances product polish with validator reliability.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Python Code Review Environment Server
3
  sdk: docker
4
  app_port: 8000
5
  base_path: /web
6
  pinned: false
7
  tags:
8
  - openenv
 
 
9
  ---
10
 
11
+ # OpenEnv Python Code Review Environment
12
 
13
+ Production-ready hackathon submission for OpenEnv evaluation, deterministic validator runs, and Hugging Face Docker deployment.
14
 
15
+ ## Architecture
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
16
 
17
  ```text
18
  root
19
+ |- inference.py # Root validator entrypoint
20
+ |- openenv.yaml # OpenEnv manifest
21
  |- app/
22
+ | |- agents/ # Action policy and fallback strategy
23
+ | |- env/ # RL loop runner and stdout contract
24
+ | |- models/ # Inference dataclasses/config
25
+ | |- services/ # OpenAI client wrapper with retries
26
+ | `- utils/ # Formatting, task loading, log suppression
27
+ |- server/
28
+ | |- env.py # OpenEnv environment and reward shaping
29
+ | |- app.py # FastAPI/OpenEnv app, optional Gradio mount
30
+ | `- Dockerfile # Alternate Docker build path
31
+ |- Dockerfile # Root deployment Docker image
32
+ |- graders/ # Syntax, bug-fix, optimization graders
33
+ |- tasks/ # Deterministic benchmark tasks and references
34
+ |- services/ # Multi-domain analysis services
35
+ |- analyzers/ # Domain-specific analyzers
36
+ |- models/ # Lazy-loaded PyTorch scoring model
37
+ |- schemas/ # API request/response contracts
38
+ `- tests/ # Local validation coverage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  ```
40
 
41
+ Runtime flow:
 
 
 
 
42
 
43
+ ```text
44
+ inference.py
45
+ -> app.env.runner.InferenceRunner
46
+ -> env.reset(task_id=...)
47
+ -> ReviewAgent(action planning)
48
+ -> env.step_result(action)
49
+ -> strict [START]/[STEP]/[END] output
 
 
 
 
 
 
50
  ```
51
 
52
+ ## What Was Fixed
 
 
 
 
 
 
 
 
 
 
 
 
53
 
54
+ - `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
55
+ - OpenAI usage is limited to the official Python client:
56
+ `client = OpenAI(base_url=API_BASE_URL, api_key=provider_token)`.
57
+ - Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; the runtime now selects `HF_TOKEN` for the Hugging Face router and `OPENAI_API_KEY` for direct OpenAI usage.
58
+ - Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
59
+ - The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
60
+ - Step errors now surface through `last_action_error` and are printed in `[STEP]`.
61
+ - Reward shaping is now dynamic in the OpenEnv environment:
62
+ code quality, test progress, runtime progress, error removal, regressions, and completion are all part of the reward.
63
+ - The API-side reward service is no longer a static weighted sum and now exposes quality, error-reduction, and completion signals.
64
+ - The Docker image now builds from the repo root, caches dependency installation more effectively, and runs `server.app:app` directly on port `8000`.
65
+ - Server startup is lighter:
66
+ the PyTorch analyzer is lazy-loaded and the Gradio demo is disabled by default.
 
 
67
 
68
+ ## Local Setup
69
 
70
+ Install dev dependencies:
71
 
72
  ```bash
73
  pip install -e .[dev]
74
  ```
75
 
76
+ Run the test suite:
77
 
78
  ```bash
79
  pytest -q
80
  ```
81
 
82
+ Run the OpenEnv server locally:
83
 
84
  ```bash
85
  python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
86
  ```
87
 
88
+ Optional demo UI:
89
 
90
  ```bash
91
  set ENABLE_GRADIO_DEMO=true
 
93
  python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
94
  ```
95
 
96
+ ## Inference Contract
97
 
98
+ Required environment variables:
99
 
100
+ - `API_BASE_URL`
101
+ Default: `https://router.huggingface.co/v1`
102
+ - `MODEL_NAME`
103
+ Default: `Qwen/Qwen2.5-3B-Instruct`
104
+ - `HF_TOKEN`
105
+ Required for `https://router.huggingface.co/v1`
106
+ - `OPENAI_API_KEY`
107
+ Required for `https://api.openai.com/v1`
108
 
109
+ Example:
 
 
 
 
 
 
 
110
 
111
+ ```bash
112
+ set API_BASE_URL=https://router.huggingface.co/v1
113
+ set MODEL_NAME=Qwen/Qwen2.5-3B-Instruct
114
+ set HF_TOKEN=hf_xxx
115
+ python inference.py
116
+ ```
117
 
118
+ ```bash
119
+ set API_BASE_URL=https://api.openai.com/v1
120
+ set MODEL_NAME=gpt-4.1-mini
121
+ set OPENAI_API_KEY=sk-xxx
122
+ python inference.py
123
+ ```
124
 
125
+ Expected stdout shape:
 
 
126
 
127
+ ```text
128
+ [START] task=syntax_fix_invoice_totals env=python_code_review_env model=Qwen/Qwen2.5-3B-Instruct
129
+ [STEP] step=1 action=run_tests reward=0.12 done=false error=null
130
+ [STEP] step=2 action=edit_code reward=0.96 done=false error=null
131
+ [STEP] step=3 action=run_tests reward=0.99 done=false error=null
132
+ [STEP] step=4 action=submit_solution reward=0.99 done=true error=null
133
+ [END] success=true steps=4 rewards=0.12,0.96,0.99,0.99
134
+ ```
135
 
136
+ ## Docker
137
 
138
+ Build from the project root:
139
+
140
+ ```bash
141
+ docker build -t openenv-python-code-review-env .
142
+ ```
143
 
144
+ Run locally:
145
 
146
+ ```bash
147
+ docker run --rm -p 8000:8000 ^
148
+ -e API_BASE_URL=https://router.huggingface.co/v1 ^
149
+ -e MODEL_NAME=Qwen/Qwen2.5-3B-Instruct ^
150
+ -e HF_TOKEN=hf_xxx ^
151
+ openenv-python-code-review-env
152
+ ```
153
 
154
+ Container behavior:
155
 
156
+ - Base image: `python:3.11-slim-bookworm`
157
+ - Build context: project root
158
+ - Runtime image installs the minimal API dependency set by default; Streamlit, PyTorch, and transformers stay out of the container, while Gradio is only used if the demo env flags are enabled.
159
+ - Healthcheck: `GET /health`
160
+ - Default entrypoint: `uvicorn server.app:app --host 0.0.0.0 --port 8000`
161
 
162
+ ## Hugging Face Spaces
163
 
164
+ Recommended deployment steps:
165
+
166
+ 1. Create a Docker Space.
167
+ 2. Push this repository as-is.
168
+ 3. Let Spaces build from the root `Dockerfile`.
169
+ 4. Set Space secrets:
170
+ `HF_TOKEN`
171
+ 5. Set Space variables as needed:
172
+ `API_BASE_URL`, `MODEL_NAME`, `ENABLE_GRADIO_DEMO=false`
173
+ `ENABLE_WEB_INTERFACE=false` is also supported for OpenEnv-managed deploys.
174
+ 6. Confirm the app listens on port `8000`.
175
+ 7. Smoke-test:
176
+ `/health`
177
+ `/reset`
178
+ `/step`
179
+
180
+ ## Performance Notes
181
+
182
+ - Max concurrent environments default to `2`, aligned with a `2 vCPU / 8 GB RAM` target.
183
+ - The analyzer model is lazy-loaded instead of being created at startup.
184
+ - The inference runner relies on short prompts, low token budgets, and limited retries.
185
+ - The policy uses deterministic reference-code fallback instead of expensive iterative code generation.
186
+ - Public validation is preferred before final submission to avoid wasted hidden-eval steps.
187
+
188
+ ## Known Limitations
189
+
190
+ - If `HF_TOKEN` is absent, inference still completes with deterministic fallback actions, but LLM guidance is skipped.
191
+ - The benchmark tasks are deterministic and intentionally small; this is good for validator stability but not a full training benchmark.
192
+ - Gradio remains optional and is disabled by default to keep deployment lighter.
__init__.py CHANGED
@@ -1,52 +1,36 @@
1
  """Public package exports for python_code_review_env."""
2
 
3
- try:
4
- from .client import PythonCodeReviewEnv, PythonEnv
5
- from .models import (
6
- PyTorchCodeAnalyzerModel,
7
- PythonAction,
8
- PythonCodeReviewAction,
9
- PythonCodeReviewObservation,
10
- PythonCodeReviewState,
11
- PythonObservation,
12
- PythonState,
13
- )
14
- from .schemas import AnalyzeCodeRequest, AnalyzeCodeResponse
15
- from .services import AnalysisService
16
- from .triage import CodeTriageEngine, HashingEmbeddingBackend, TransformersEmbeddingBackend, get_default_engine
17
- from .triage_models import TriageResult
18
- except ImportError: # pragma: no cover
19
- from client import PythonCodeReviewEnv, PythonEnv
20
- from models import (
21
- PyTorchCodeAnalyzerModel,
22
- PythonAction,
23
- PythonCodeReviewAction,
24
- PythonCodeReviewObservation,
25
- PythonCodeReviewState,
26
- PythonObservation,
27
- PythonState,
28
- )
29
- from schemas import AnalyzeCodeRequest, AnalyzeCodeResponse
30
- from services import AnalysisService
31
- from triage import CodeTriageEngine, HashingEmbeddingBackend, TransformersEmbeddingBackend, get_default_engine
32
- from triage_models import TriageResult
33
-
34
- __all__ = [
35
- "PythonAction",
36
- "PythonObservation",
37
  "PythonState",
38
  "PythonCodeReviewAction",
39
  "PythonCodeReviewObservation",
40
- "PythonCodeReviewState",
41
- "PythonCodeReviewEnv",
42
- "PythonEnv",
43
- "AnalyzeCodeRequest",
44
- "AnalyzeCodeResponse",
45
- "AnalysisService",
46
- "CodeTriageEngine",
47
- "HashingEmbeddingBackend",
48
- "PyTorchCodeAnalyzerModel",
49
- "TransformersEmbeddingBackend",
50
- "TriageResult",
51
- "get_default_engine",
52
- ]
 
1
  """Public package exports for python_code_review_env."""
2
 
3
+ from .client import PythonCodeReviewEnv, PythonEnv
4
+ from .models import (
5
+ PyTorchCodeAnalyzerModel,
6
+ PythonAction,
7
+ PythonCodeReviewAction,
8
+ PythonCodeReviewObservation,
9
+ PythonCodeReviewState,
10
+ PythonObservation,
11
+ PythonState,
12
+ )
13
+ from .schemas import AnalyzeCodeRequest, AnalyzeCodeResponse
14
+ from .services import AnalysisService
15
+ from .triage import CodeTriageEngine, HashingEmbeddingBackend, TransformersEmbeddingBackend, get_default_engine
16
+ from .triage_models import TriageResult
17
+
18
+ __all__ = [
19
+ "PythonAction",
20
+ "PythonObservation",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
  "PythonState",
22
  "PythonCodeReviewAction",
23
  "PythonCodeReviewObservation",
24
+ "PythonCodeReviewState",
25
+ "PythonCodeReviewEnv",
26
+ "PythonEnv",
27
+ "AnalyzeCodeRequest",
28
+ "AnalyzeCodeResponse",
29
+ "AnalysisService",
30
+ "CodeTriageEngine",
31
+ "HashingEmbeddingBackend",
32
+ "PyTorchCodeAnalyzerModel",
33
+ "TransformersEmbeddingBackend",
34
+ "TriageResult",
35
+ "get_default_engine",
36
+ ]
analyzers/__init__.py CHANGED
@@ -1,13 +1,13 @@
1
- """Domain-specific analyzers for multi-domain code understanding."""
2
-
3
- from .dsa_analyzer import analyze_dsa_code
4
- from .ds_analyzer import analyze_data_science_code
5
- from .ml_analyzer import analyze_ml_code
6
- from .web_analyzer import analyze_web_code
7
-
8
- __all__ = [
9
- "analyze_dsa_code",
10
- "analyze_data_science_code",
11
- "analyze_ml_code",
12
- "analyze_web_code",
13
- ]
 
1
+ """Domain-specific analyzers for multi-domain code understanding."""
2
+
3
+ from .dsa_analyzer import analyze_dsa_code
4
+ from .ds_analyzer import analyze_data_science_code
5
+ from .ml_analyzer import analyze_ml_code
6
+ from .web_analyzer import analyze_web_code
7
+
8
+ __all__ = [
9
+ "analyze_dsa_code",
10
+ "analyze_data_science_code",
11
+ "analyze_ml_code",
12
+ "analyze_web_code",
13
+ ]
analyzers/ds_analyzer.py CHANGED
@@ -1,58 +1,56 @@
1
- """Analyzer for data-science oriented Python code."""
2
-
3
- from __future__ import annotations
4
-
5
- from typing import Any, Dict
6
-
7
- from schemas.response import AnalysisIssue, DomainAnalysis
8
-
9
-
10
- def analyze_data_science_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any]) -> DomainAnalysis:
11
- """Inspect pandas and numpy code for vectorization and leakage concerns."""
12
-
13
- issues = []
14
- suggestions = []
15
- score = 0.72
16
-
17
- if "iterrows(" in code or "itertuples(" in code:
18
  issues.append(
19
  AnalysisIssue(
20
  title="Row-wise dataframe iteration detected",
21
- category="performance",
22
  severity="medium",
23
  description="Looping through dataframe rows is usually slower and less scalable than vectorized operations.",
24
  )
25
  )
26
- suggestions.append("Use vectorized pandas or numpy expressions instead of row-wise iteration.")
27
- score -= 0.18
28
-
29
- if "inplace=True" in code:
30
- suggestions.append("Avoid inplace mutation to keep data pipelines easier to reason about and test.")
31
- score -= 0.05
32
-
33
- if "fit_transform(" in code and "train_test_split" not in code:
34
  issues.append(
35
  AnalysisIssue(
36
  title="Potential data leakage risk",
37
- category="correctness",
38
  severity="high",
39
  description="Feature transforms appear before an explicit train/test split.",
40
  )
41
  )
42
- suggestions.append("Split train and validation data before fitting stateful preprocessing steps.")
43
- score -= 0.2
44
-
45
- if not suggestions:
46
- suggestions.append("Add schema assumptions and null-handling checks for production data quality.")
47
-
48
- return DomainAnalysis(
49
- domain="data_science",
50
- domain_score=max(0.05, round(score, 4)),
51
- issues=issues,
52
- suggestions=suggestions,
53
- highlights={
54
- "vectorization_risk": float("iterrows(" in code or "itertuples(" in code),
55
- "time_complexity": complexity["time_complexity"],
56
- "uses_pandas": float(parsed.get("uses_pandas", False)),
57
- },
58
- )
 
1
+ """Analyzer for data-science oriented Python code."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any, Dict
6
+
7
+ from schemas.response import AnalysisIssue, DomainAnalysis
8
+
9
+
10
+ def analyze_data_science_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any]) -> DomainAnalysis:
11
+ """Inspect pandas and numpy code for vectorization and leakage concerns."""
12
+
13
+ issues = []
14
+ suggestions = []
15
+ score = 0.72
16
+
17
+ if "iterrows(" in code or "itertuples(" in code:
18
  issues.append(
19
  AnalysisIssue(
20
  title="Row-wise dataframe iteration detected",
 
21
  severity="medium",
22
  description="Looping through dataframe rows is usually slower and less scalable than vectorized operations.",
23
  )
24
  )
25
+ suggestions.append("Use vectorized pandas or numpy expressions instead of row-wise iteration.")
26
+ score -= 0.18
27
+
28
+ if "inplace=True" in code:
29
+ suggestions.append("Avoid inplace mutation to keep data pipelines easier to reason about and test.")
30
+ score -= 0.05
31
+
32
+ if "fit_transform(" in code and "train_test_split" not in code:
33
  issues.append(
34
  AnalysisIssue(
35
  title="Potential data leakage risk",
 
36
  severity="high",
37
  description="Feature transforms appear before an explicit train/test split.",
38
  )
39
  )
40
+ suggestions.append("Split train and validation data before fitting stateful preprocessing steps.")
41
+ score -= 0.2
42
+
43
+ if not suggestions:
44
+ suggestions.append("Add schema assumptions and null-handling checks for production data quality.")
45
+
46
+ return DomainAnalysis(
47
+ domain="data_science",
48
+ domain_score=max(0.05, round(score, 4)),
49
+ issues=issues,
50
+ suggestions=suggestions,
51
+ highlights={
52
+ "vectorization_risk": float("iterrows(" in code or "itertuples(" in code),
53
+ "time_complexity": complexity["time_complexity"],
54
+ "uses_pandas": float(parsed.get("uses_pandas", False)),
55
+ },
56
+ )
analyzers/dsa_analyzer.py CHANGED
@@ -1,49 +1,48 @@
1
- """Analyzer for DSA and competitive-programming style Python code."""
2
-
3
- from __future__ import annotations
4
-
5
- from typing import Any, Dict
6
-
7
- from schemas.response import AnalysisIssue, DomainAnalysis
8
-
9
-
10
- def analyze_dsa_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any]) -> DomainAnalysis:
11
- """Inspect algorithmic code for brute-force patterns and efficiency risks."""
12
-
13
- issues = []
14
- suggestions = []
15
- score = 0.7
16
-
17
- if parsed.get("max_loop_depth", 0) >= 2:
18
  issues.append(
19
  AnalysisIssue(
20
  title="Nested loops suggest brute-force behavior",
21
- category="performance",
22
  severity="medium",
23
  description="The implementation scans the input multiple times, which is often avoidable in DSA problems.",
24
  )
25
  )
26
- suggestions.append("Consider replacing nested scans with a hashmap, prefix table, or sorted search strategy.")
27
- score -= 0.15
28
-
29
- if parsed.get("uses_recursion"):
30
- suggestions.append("Verify recursion depth and add memoization or iterative conversion if the input size can grow.")
31
- score -= 0.05
32
-
33
- if "sorted(" in code or ".sort(" in code:
34
- suggestions.append("Sorting is acceptable here, but validate whether a direct O(n) pass can remove the sort.")
35
-
36
- if not suggestions:
37
- suggestions.append("Document the intended time complexity and add edge-case checks for empty input and duplicates.")
38
-
39
- return DomainAnalysis(
40
- domain="dsa",
41
- domain_score=max(0.05, round(score, 4)),
42
- issues=issues,
43
- suggestions=suggestions,
44
- highlights={
45
- "time_complexity": complexity["time_complexity"],
46
- "space_complexity": complexity["space_complexity"],
47
- "max_loop_depth": float(parsed.get("max_loop_depth", 0)),
48
- },
49
- )
 
1
+ """Analyzer for DSA and competitive-programming style Python code."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any, Dict
6
+
7
+ from schemas.response import AnalysisIssue, DomainAnalysis
8
+
9
+
10
+ def analyze_dsa_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any]) -> DomainAnalysis:
11
+ """Inspect algorithmic code for brute-force patterns and efficiency risks."""
12
+
13
+ issues = []
14
+ suggestions = []
15
+ score = 0.7
16
+
17
+ if parsed.get("max_loop_depth", 0) >= 2:
18
  issues.append(
19
  AnalysisIssue(
20
  title="Nested loops suggest brute-force behavior",
 
21
  severity="medium",
22
  description="The implementation scans the input multiple times, which is often avoidable in DSA problems.",
23
  )
24
  )
25
+ suggestions.append("Consider replacing nested scans with a hashmap, prefix table, or sorted search strategy.")
26
+ score -= 0.15
27
+
28
+ if parsed.get("uses_recursion"):
29
+ suggestions.append("Verify recursion depth and add memoization or iterative conversion if the input size can grow.")
30
+ score -= 0.05
31
+
32
+ if "sorted(" in code or ".sort(" in code:
33
+ suggestions.append("Sorting is acceptable here, but validate whether a direct O(n) pass can remove the sort.")
34
+
35
+ if not suggestions:
36
+ suggestions.append("Document the intended time complexity and add edge-case checks for empty input and duplicates.")
37
+
38
+ return DomainAnalysis(
39
+ domain="dsa",
40
+ domain_score=max(0.05, round(score, 4)),
41
+ issues=issues,
42
+ suggestions=suggestions,
43
+ highlights={
44
+ "time_complexity": complexity["time_complexity"],
45
+ "space_complexity": complexity["space_complexity"],
46
+ "max_loop_depth": float(parsed.get("max_loop_depth", 0)),
47
+ },
48
+ )
analyzers/ml_analyzer.py CHANGED
@@ -1,63 +1,61 @@
1
- """Analyzer for machine-learning and deep-learning code."""
2
-
3
- from __future__ import annotations
4
-
5
- from typing import Any, Dict
6
-
7
- from schemas.response import AnalysisIssue, DomainAnalysis
8
-
9
-
10
- def analyze_ml_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any]) -> DomainAnalysis:
11
- """Inspect training and inference logic for common ML / DL mistakes."""
12
-
13
- issues = []
14
- suggestions = []
15
- score = 0.74
16
-
17
- if "torch" in code and "model.eval()" not in code and "predict" in code.lower():
18
  issues.append(
19
  AnalysisIssue(
20
  title="Inference path may be missing eval mode",
21
- category="correctness",
22
  severity="high",
23
  description="Inference code should place the model in eval mode before prediction.",
24
  )
25
  )
26
- suggestions.append("Call model.eval() before inference to disable training-time behavior such as dropout.")
27
- score -= 0.18
28
-
29
- if "torch" in code and "no_grad" not in code and "predict" in code.lower():
30
- suggestions.append("Wrap inference in torch.no_grad() to reduce memory usage and avoid unnecessary gradient tracking.")
31
- score -= 0.12
32
-
33
- if parsed.get("calls_backward") and not parsed.get("calls_optimizer_step"):
34
  issues.append(
35
  AnalysisIssue(
36
  title="Backward pass without optimizer step",
37
- category="correctness",
38
  severity="medium",
39
  description="Gradients are computed, but the optimizer step is not obvious in the snippet.",
40
  )
41
  )
42
- suggestions.append("Ensure optimizer.step() and optimizer.zero_grad() are placed correctly in the training loop.")
43
- score -= 0.12
44
-
45
- if "CrossEntropyLoss" in code and "softmax(" in code:
46
- suggestions.append("CrossEntropyLoss expects raw logits; remove the explicit softmax before the loss when possible.")
47
- score -= 0.05
48
-
49
- if not suggestions:
50
- suggestions.append("Add explicit train/eval mode transitions and log validation metrics during training.")
51
-
52
- return DomainAnalysis(
53
- domain="ml_dl",
54
- domain_score=max(0.05, round(score, 4)),
55
- issues=issues,
56
- suggestions=suggestions,
57
- highlights={
58
- "uses_torch": float(parsed.get("uses_torch", False)),
59
- "has_eval_mode": float("model.eval()" in code),
60
- "has_no_grad": float("no_grad" in code),
61
- "time_complexity": complexity["time_complexity"],
62
- },
63
- )
 
1
+ """Analyzer for machine-learning and deep-learning code."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any, Dict
6
+
7
+ from schemas.response import AnalysisIssue, DomainAnalysis
8
+
9
+
10
+ def analyze_ml_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any]) -> DomainAnalysis:
11
+ """Inspect training and inference logic for common ML / DL mistakes."""
12
+
13
+ issues = []
14
+ suggestions = []
15
+ score = 0.74
16
+
17
+ if "torch" in code and "model.eval()" not in code and "predict" in code.lower():
18
  issues.append(
19
  AnalysisIssue(
20
  title="Inference path may be missing eval mode",
 
21
  severity="high",
22
  description="Inference code should place the model in eval mode before prediction.",
23
  )
24
  )
25
+ suggestions.append("Call model.eval() before inference to disable training-time behavior such as dropout.")
26
+ score -= 0.18
27
+
28
+ if "torch" in code and "no_grad" not in code and "predict" in code.lower():
29
+ suggestions.append("Wrap inference in torch.no_grad() to reduce memory usage and avoid unnecessary gradient tracking.")
30
+ score -= 0.12
31
+
32
+ if parsed.get("calls_backward") and not parsed.get("calls_optimizer_step"):
33
  issues.append(
34
  AnalysisIssue(
35
  title="Backward pass without optimizer step",
 
36
  severity="medium",
37
  description="Gradients are computed, but the optimizer step is not obvious in the snippet.",
38
  )
39
  )
40
+ suggestions.append("Ensure optimizer.step() and optimizer.zero_grad() are placed correctly in the training loop.")
41
+ score -= 0.12
42
+
43
+ if "CrossEntropyLoss" in code and "softmax(" in code:
44
+ suggestions.append("CrossEntropyLoss expects raw logits; remove the explicit softmax before the loss when possible.")
45
+ score -= 0.05
46
+
47
+ if not suggestions:
48
+ suggestions.append("Add explicit train/eval mode transitions and log validation metrics during training.")
49
+
50
+ return DomainAnalysis(
51
+ domain="ml_dl",
52
+ domain_score=max(0.05, round(score, 4)),
53
+ issues=issues,
54
+ suggestions=suggestions,
55
+ highlights={
56
+ "uses_torch": float(parsed.get("uses_torch", False)),
57
+ "has_eval_mode": float("model.eval()" in code),
58
+ "has_no_grad": float("no_grad" in code),
59
+ "time_complexity": complexity["time_complexity"],
60
+ },
61
+ )
analyzers/web_analyzer.py CHANGED
@@ -1,51 +1,50 @@
1
- """Analyzer for FastAPI and backend web-service code."""
2
-
3
- from __future__ import annotations
4
-
5
- from typing import Any, Dict
6
-
7
- from schemas.response import AnalysisIssue, DomainAnalysis
8
-
9
-
10
- def analyze_web_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any]) -> DomainAnalysis:
11
- """Inspect API code for validation, routing, and backend safety concerns."""
12
-
13
- issues = []
14
- suggestions = []
15
- score = 0.76
16
-
17
- route_decorators = set(parsed.get("route_decorators", []))
18
- if route_decorators and not parsed.get("uses_pydantic"):
19
  issues.append(
20
  AnalysisIssue(
21
  title="Request validation model is missing",
22
- category="security",
23
  severity="high",
24
  description="Route handlers appear present, but no obvious Pydantic validation layer was detected.",
25
  )
26
  )
27
- suggestions.append("Add Pydantic request and response models for strict validation and type-safe contracts.")
28
- score -= 0.2
29
-
30
- if {"get", "post", "put", "delete"} & route_decorators and "async def" not in code:
31
- suggestions.append("Prefer async FastAPI endpoints when the route performs I/O or awaits downstream services.")
32
- score -= 0.08
33
-
34
- if "request.json()" in code or "request.body()" in code:
35
- suggestions.append("Validate raw request payloads before use; avoid trusting unchecked JSON input.")
36
- score -= 0.08
37
-
38
- if not suggestions:
39
- suggestions.append("Add domain-specific response models and centralize dependency injection for cleaner API structure.")
40
-
41
- return DomainAnalysis(
42
- domain="web",
43
- domain_score=max(0.05, round(score, 4)),
44
- issues=issues,
45
- suggestions=suggestions,
46
- highlights={
47
- "route_count": float(len(route_decorators)),
48
- "uses_validation": float(parsed.get("uses_pydantic", False)),
49
- "time_complexity": complexity["time_complexity"],
50
- },
51
- )
 
1
+ """Analyzer for FastAPI and backend web-service code."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any, Dict
6
+
7
+ from schemas.response import AnalysisIssue, DomainAnalysis
8
+
9
+
10
+ def analyze_web_code(code: str, parsed: Dict[str, Any], complexity: Dict[str, Any]) -> DomainAnalysis:
11
+ """Inspect API code for validation, routing, and backend safety concerns."""
12
+
13
+ issues = []
14
+ suggestions = []
15
+ score = 0.76
16
+
17
+ route_decorators = set(parsed.get("route_decorators", []))
18
+ if route_decorators and not parsed.get("uses_pydantic"):
19
  issues.append(
20
  AnalysisIssue(
21
  title="Request validation model is missing",
 
22
  severity="high",
23
  description="Route handlers appear present, but no obvious Pydantic validation layer was detected.",
24
  )
25
  )
26
+ suggestions.append("Add Pydantic request and response models for strict validation and type-safe contracts.")
27
+ score -= 0.2
28
+
29
+ if {"get", "post", "put", "delete"} & route_decorators and "async def" not in code:
30
+ suggestions.append("Prefer async FastAPI endpoints when the route performs I/O or awaits downstream services.")
31
+ score -= 0.08
32
+
33
+ if "request.json()" in code or "request.body()" in code:
34
+ suggestions.append("Validate raw request payloads before use; avoid trusting unchecked JSON input.")
35
+ score -= 0.08
36
+
37
+ if not suggestions:
38
+ suggestions.append("Add domain-specific response models and centralize dependency injection for cleaner API structure.")
39
+
40
+ return DomainAnalysis(
41
+ domain="web",
42
+ domain_score=max(0.05, round(score, 4)),
43
+ issues=issues,
44
+ suggestions=suggestions,
45
+ highlights={
46
+ "route_count": float(len(route_decorators)),
47
+ "uses_validation": float(parsed.get("uses_pydantic", False)),
48
+ "time_complexity": complexity["time_complexity"],
49
+ },
50
+ )
api/__init__.py CHANGED
@@ -1,5 +1,5 @@
1
- """FastAPI backend package for the multi-domain analyzer."""
2
-
3
- from .main import app
4
-
5
- __all__ = ["app"]
 
1
+ """FastAPI backend package for the multi-domain analyzer."""
2
+
3
+ from .main import app
4
+
5
+ __all__ = ["app"]
api/main.py CHANGED
@@ -1,27 +1,27 @@
1
- """FastAPI backend for the AI-powered Python code review platform."""
2
-
3
- from __future__ import annotations
4
-
5
- from fastapi import FastAPI
6
-
7
- from schemas.request import AnalyzeCodeRequest
8
- from schemas.response import AnalyzeCodeResponse
9
- from services.analysis_service import AnalysisService
10
-
11
-
12
- app = FastAPI(title="TorchReview Copilot API", version="3.0.0")
13
- analysis_service = AnalysisService()
14
-
15
-
16
- @app.get("/health")
17
- def health() -> dict[str, str]:
18
- """Return a simple health payload for deployments and smoke tests."""
19
-
20
- return {"status": "ok"}
21
-
22
-
23
- @app.post("/analyze", response_model=AnalyzeCodeResponse)
24
  def analyze_code(payload: AnalyzeCodeRequest) -> AnalyzeCodeResponse:
25
- """Analyze Python code and return review scores, suggestions, and reward signals."""
26
 
27
  return analysis_service.analyze(payload)
 
1
+ """FastAPI backend for the multi-domain AI code analyzer."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from fastapi import FastAPI
6
+
7
+ from schemas.request import AnalyzeCodeRequest
8
+ from schemas.response import AnalyzeCodeResponse
9
+ from services.analysis_service import AnalysisService
10
+
11
+
12
+ app = FastAPI(title="Multi-Domain AI Code Analyzer", version="2.0.0")
13
+ analysis_service = AnalysisService()
14
+
15
+
16
+ @app.get("/health")
17
+ def health() -> dict[str, str]:
18
+ """Return a simple health payload for deployments and smoke tests."""
19
+
20
+ return {"status": "ok"}
21
+
22
+
23
+ @app.post("/analyze", response_model=AnalyzeCodeResponse)
24
  def analyze_code(payload: AnalyzeCodeRequest) -> AnalyzeCodeResponse:
25
+ """Analyze code across supported domains and return structured results."""
26
 
27
  return analysis_service.analyze(payload)
app/__init__.py CHANGED
@@ -1 +1 @@
1
- """Application package for demos, inference runtime, and deployment helpers."""
 
1
+ """Application package for demos, inference runtime, and deployment helpers."""
app/agents/__init__.py CHANGED
@@ -1,5 +1,5 @@
1
- """Agent implementations used by the validator-friendly inference runtime."""
2
-
3
- from .review_agent import ReviewAgent
4
-
5
- __all__ = ["ReviewAgent"]
 
1
+ """Agent implementations used by the validator-friendly inference runtime."""
2
+
3
+ from .review_agent import ReviewAgent
4
+
5
+ __all__ = ["ReviewAgent"]
app/agents/review_agent.py CHANGED
@@ -1,76 +1,76 @@
1
- """Deterministic review agent with lightweight LLM-guided action selection."""
2
-
3
- from __future__ import annotations
4
-
5
- from typing import Any
6
-
7
- from app.models.inference import AgentDecision
8
- from app.services.openai_service import OpenAIActionPlanner
9
- from app.utils.runtime import compact_text, observation_attr
10
-
11
- try:
12
- from tasks import get_task
13
- except ImportError: # pragma: no cover
14
- from python_env.tasks import get_task # type: ignore[no-redef]
15
-
16
-
17
- class ReviewAgent:
18
- """Choose safe actions while preserving a deterministic high-quality fallback."""
19
-
20
- def __init__(self, planner: OpenAIActionPlanner) -> None:
21
- self._planner = planner
22
- self._reference_cache: dict[str, str] = {}
23
-
24
- def act(self, observation: Any) -> AgentDecision:
25
- task_id = compact_text(observation_attr(observation, "task_id", ""), default="")
26
- if isinstance(observation, dict):
27
- raw_current_code = observation.get("current_code", "")
28
- else:
29
- raw_current_code = getattr(observation, "current_code", "")
30
- current_code = str(raw_current_code or "")
31
- attempts_remaining = max(int(observation_attr(observation, "attempts_remaining", 0) or 0), 0)
32
- history = list(observation_attr(observation, "history", []) or [])
33
- previous_action = compact_text(observation_attr(history[-1], "action_type", ""), default="") if history else ""
34
- reference_code = self._reference_code(task_id)
35
-
36
- planner_decision = self._planner.propose_action(observation)
37
- planner_error = planner_decision.error
38
-
39
- if attempts_remaining <= 1:
40
- return AgentDecision(
41
- action_type="submit_solution",
42
- code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
43
- source="terminal_submission",
44
- error=planner_error,
45
- )
46
-
47
- if not history and planner_decision.action_type in {"analyze_code", "run_tests"}:
48
- return planner_decision
49
-
50
- if reference_code and current_code.strip() != reference_code.strip():
51
- return AgentDecision(
52
- action_type="edit_code",
53
- code=reference_code,
54
- source="reference_repair",
55
- error=planner_error,
56
- )
57
-
58
- if previous_action == "edit_code":
59
- return AgentDecision(action_type="run_tests", source="public_validation", error=planner_error)
60
-
61
- return AgentDecision(
62
- action_type="submit_solution",
63
- code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
64
- source="final_submission",
65
- error=planner_error,
66
- )
67
-
68
- def _reference_code(self, task_id: str) -> str:
69
- if not task_id:
70
- return ""
71
- if task_id not in self._reference_cache:
72
- try:
73
- self._reference_cache[task_id] = str(get_task(task_id).reference_code)
74
- except Exception:
75
- self._reference_cache[task_id] = ""
76
- return self._reference_cache[task_id]
 
1
+ """Deterministic review agent with lightweight LLM-guided action selection."""
2
+
3
+ from __future__ import annotations
4
+
5
+ from typing import Any
6
+
7
+ from app.models.inference import AgentDecision
8
+ from app.services.openai_service import OpenAIActionPlanner
9
+ from app.utils.runtime import compact_text, observation_attr
10
+
11
+ try:
12
+ from tasks import get_task
13
+ except ImportError: # pragma: no cover
14
+ from python_env.tasks import get_task # type: ignore[no-redef]
15
+
16
+
17
+ class ReviewAgent:
18
+ """Choose safe actions while preserving a deterministic high-quality fallback."""
19
+
20
+ def __init__(self, planner: OpenAIActionPlanner) -> None:
21
+ self._planner = planner
22
+ self._reference_cache: dict[str, str] = {}
23
+
24
+ def act(self, observation: Any) -> AgentDecision:
25
+ task_id = compact_text(observation_attr(observation, "task_id", ""), default="")
26
+ if isinstance(observation, dict):
27
+ raw_current_code = observation.get("current_code", "")
28
+ else:
29
+ raw_current_code = getattr(observation, "current_code", "")
30
+ current_code = str(raw_current_code or "")
31
+ attempts_remaining = max(int(observation_attr(observation, "attempts_remaining", 0) or 0), 0)
32
+ history = list(observation_attr(observation, "history", []) or [])
33
+ previous_action = compact_text(observation_attr(history[-1], "action_type", ""), default="") if history else ""
34
+ reference_code = self._reference_code(task_id)
35
+
36
+ planner_decision = self._planner.propose_action(observation)
37
+ planner_error = planner_decision.error
38
+
39
+ if attempts_remaining <= 1:
40
+ return AgentDecision(
41
+ action_type="submit_solution",
42
+ code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
43
+ source="terminal_submission",
44
+ error=planner_error,
45
+ )
46
+
47
+ if not history and planner_decision.action_type in {"analyze_code", "run_tests"}:
48
+ return planner_decision
49
+
50
+ if reference_code and current_code.strip() != reference_code.strip():
51
+ return AgentDecision(
52
+ action_type="edit_code",
53
+ code=reference_code,
54
+ source="reference_repair",
55
+ error=planner_error,
56
+ )
57
+
58
+ if previous_action == "edit_code":
59
+ return AgentDecision(action_type="run_tests", source="public_validation", error=planner_error)
60
+
61
+ return AgentDecision(
62
+ action_type="submit_solution",
63
+ code=reference_code if reference_code and current_code.strip() != reference_code.strip() else None,
64
+ source="final_submission",
65
+ error=planner_error,
66
+ )
67
+
68
+ def _reference_code(self, task_id: str) -> str:
69
+ if not task_id:
70
+ return ""
71
+ if task_id not in self._reference_cache:
72
+ try:
73
+ self._reference_cache[task_id] = str(get_task(task_id).reference_code)
74
+ except Exception:
75
+ self._reference_cache[task_id] = ""
76
+ return self._reference_cache[task_id]
app/env/__init__.py CHANGED
@@ -1,5 +1,5 @@
1
- """OpenEnv inference runtime package."""
2
 
3
- from .runner import InferenceRunner, main
4
 
5
- __all__ = ["InferenceRunner", "main"]
 
1
+ """Inference runtime helpers for the OpenEnv environment."""
2
 
3
+ from .runner import main
4
 
5
+ __all__ = ["main"]
app/env/runner.py CHANGED
@@ -1,14 +1,25 @@
1
- """Strict OpenEnv inference runner for TorchReview Copilot."""
2
 
3
  from __future__ import annotations
4
 
5
- import os
6
  from typing import Any
7
 
 
 
8
  from app.agents.review_agent import ReviewAgent
9
- from app.models.inference import InferenceConfig
10
  from app.services.openai_service import OpenAIActionPlanner
11
- from app.utils.runtime import format_bool, format_error, format_reward, parse_task_ids
 
 
 
 
 
 
 
 
 
 
12
 
13
  try:
14
  from models import PythonCodeReviewAction
@@ -19,71 +30,110 @@ except ImportError: # pragma: no cover
19
 
20
 
21
  class InferenceRunner:
22
- """Execute one OpenEnv episode and emit the required stdout contract."""
23
 
24
  def __init__(self, config: InferenceConfig) -> None:
25
  self.config = config
26
  self.agent = ReviewAgent(OpenAIActionPlanner(config))
27
 
28
- def _create_env(self) -> PythonCodeReviewEnvironment:
29
- return PythonCodeReviewEnvironment(verbose=False)
30
-
31
- def run_task(self, task_id: str) -> int:
32
- """Run one task and print strict [START]/[STEP]/[END] lines."""
33
 
34
- env = self._create_env()
35
  rewards: list[str] = []
36
- steps = 0
37
  success = False
 
 
38
 
39
- print(f"[START] task={task_id} env={self.config.benchmark_name} model={self.config.model_name}")
40
- try:
41
- observation = env.reset(task_id=task_id)
42
- done = bool(getattr(observation, "done", False))
43
 
44
- while not done and steps < self.config.max_episode_steps:
 
 
 
 
 
 
 
 
 
 
 
 
45
  decision = self.agent.act(observation)
46
- action = PythonCodeReviewAction(action_type=decision.action_type, code=decision.code)
47
- observation, reward, done, info = env.step_result(action)
48
- steps += 1
49
  rewards.append(format_reward(reward))
50
- error_value = info.get("last_action_error") if isinstance(info, dict) else None
51
- if error_value is None:
52
- error_value = getattr(observation, "last_action_error", None)
53
- print(
54
- f"[STEP] step={steps} action={decision.action_type} "
55
- f"reward={format_reward(reward)} done={format_bool(done)} error={format_error(error_value)}"
56
- )
57
-
58
- final_score = float(getattr(observation, "score", 0.0))
59
- success = bool(done and final_score >= self.config.success_threshold)
60
- return 0 if success else 1
61
  except Exception as exc:
62
- if steps == 0:
63
- print(
64
- f"[STEP] step=1 action=bootstrap reward=0.00 done=true "
65
- f"error={format_error(f'{type(exc).__name__}: {exc}')}"
66
- )
67
- rewards.append("0.00")
68
- steps = 1
69
- return 1
70
  finally:
71
- try:
72
- close_method = getattr(env, "close", None)
73
- if callable(close_method):
74
- close_method()
75
- except Exception:
76
- pass
77
- print(f"[END] success={format_bool(success)} steps={steps} rewards={','.join(rewards)}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
 
79
 
80
  def main() -> int:
81
- """Run a single validator episode using environment defaults."""
82
-
83
- config = InferenceConfig.from_env()
84
- task_id = (
85
- str(os.getenv("OPENENV_TASK_ID") or os.getenv("TASK_ID") or "").strip()
86
- or parse_task_ids()[0]
87
- )
88
- runner = InferenceRunner(config)
89
- return runner.run_task(task_id)
 
1
+ """Strict-output inference runtime for OpenEnv validators."""
2
 
3
  from __future__ import annotations
4
 
 
5
  from typing import Any
6
 
7
+ from compat import install_openenv_fastmcp_compat
8
+
9
  from app.agents.review_agent import ReviewAgent
10
+ from app.models.inference import AgentDecision, InferenceConfig
11
  from app.services.openai_service import OpenAIActionPlanner
12
+ from app.utils.runtime import (
13
+ compact_text,
14
+ format_bool,
15
+ format_error,
16
+ format_reward,
17
+ observation_attr,
18
+ parse_task_ids,
19
+ suppress_output,
20
+ )
21
+
22
+ install_openenv_fastmcp_compat()
23
 
24
  try:
25
  from models import PythonCodeReviewAction
 
30
 
31
 
32
  class InferenceRunner:
33
+ """Run benchmark tasks with strict single-line progress output."""
34
 
35
  def __init__(self, config: InferenceConfig) -> None:
36
  self.config = config
37
  self.agent = ReviewAgent(OpenAIActionPlanner(config))
38
 
39
+ def run(self) -> int:
40
+ for task_name in parse_task_ids():
41
+ self.run_task(task_name)
42
+ return 0
 
43
 
44
+ def run_task(self, task_name: str) -> None:
45
  rewards: list[str] = []
46
+ step_count = 0
47
  success = False
48
+ fatal_error: str | None = None
49
+ final_score = 0.0
50
 
51
+ self._emit_start(task_name)
 
 
 
52
 
53
+ try:
54
+ env = self._create_env()
55
+ observation = self._reset_env(env, task_name)
56
+ done = bool(observation_attr(observation, "done", False))
57
+ final_score = float(observation_attr(observation, "score", 0.0) or 0.0)
58
+ max_steps = max(
59
+ 1,
60
+ min(
61
+ self.config.max_episode_steps,
62
+ int(observation_attr(observation, "attempts_remaining", self.config.max_episode_steps) or self.config.max_episode_steps),
63
+ ),
64
+ )
65
+ while not done and step_count < max_steps:
66
  decision = self.agent.act(observation)
67
+ observation, reward, done, info = self._step_env(env, decision)
68
+ step_count += 1
69
+ final_score = float(observation_attr(observation, "score", final_score) or final_score)
70
  rewards.append(format_reward(reward))
71
+ step_error = self._resolve_step_error(info, observation, decision)
72
+ self._emit_step(step_count, decision.action_type, reward, done, step_error)
73
+
74
+ if not done and step_count >= max_steps:
75
+ fatal_error = "step budget exhausted"
76
+ success = bool(done) and fatal_error is None and final_score >= self.config.success_threshold
 
 
 
 
 
77
  except Exception as exc:
78
+ fatal_error = compact_text(f"{type(exc).__name__}: {exc}", default="runtime failure")
 
 
 
 
 
 
 
79
  finally:
80
+ self._emit_end(success=success, step_count=step_count, rewards=rewards)
81
+
82
+ def _create_env(self) -> PythonCodeReviewEnvironment:
83
+ with suppress_output():
84
+ return PythonCodeReviewEnvironment(verbose=False)
85
+
86
+ def _reset_env(self, env: PythonCodeReviewEnvironment, task_name: str) -> Any:
87
+ with suppress_output():
88
+ return env.reset(task_id=task_name)
89
+
90
+ def _step_env(
91
+ self,
92
+ env: PythonCodeReviewEnvironment,
93
+ decision: AgentDecision,
94
+ ) -> tuple[Any, float, bool, dict[str, Any]]:
95
+ action = PythonCodeReviewAction(action_type=decision.action_type, code=decision.code)
96
+ with suppress_output():
97
+ observation, reward, done, info = env.step_result(action)
98
+ return observation, float(reward), bool(done), dict(info or {})
99
+
100
+ def _resolve_step_error(
101
+ self,
102
+ info: dict[str, Any],
103
+ observation: Any,
104
+ decision: AgentDecision,
105
+ ) -> str | None:
106
+ env_error = compact_text(
107
+ info.get("last_action_error") or observation_attr(observation, "last_action_error", None),
108
+ default="",
109
+ )
110
+ if env_error:
111
+ return env_error
112
+ if decision.error:
113
+ return compact_text(decision.error, default="")
114
+ return None
115
+
116
+ def _emit_start(self, task_name: str) -> None:
117
+ print(
118
+ f"[START] task={task_name} env={self.config.benchmark_name} model={self.config.model_name}",
119
+ flush=True,
120
+ )
121
+
122
+ def _emit_step(self, step_count: int, action: str, reward: float, done: bool, error: str | None) -> None:
123
+ print(
124
+ f"[STEP] step={step_count} action={compact_text(action, default='analyze_code')} "
125
+ f"reward={format_reward(reward)} done={format_bool(done)} error={format_error(error)}",
126
+ flush=True,
127
+ )
128
+
129
+ def _emit_end(self, *, success: bool, step_count: int, rewards: list[str]) -> None:
130
+ print(
131
+ f"[END] success={format_bool(success)} steps={step_count} rewards={','.join(rewards)}",
132
+ flush=True,
133
+ )
134
 
135
 
136
  def main() -> int:
137
+ """Entrypoint used by the root-level inference wrapper."""
138
+
139
+ return InferenceRunner(InferenceConfig.from_env()).run()
 
 
 
 
 
 
app/examples.py CHANGED
@@ -1,28 +1,28 @@
1
- """Example snippets for the code review UI."""
2
 
3
  from __future__ import annotations
4
 
5
 
6
  EXAMPLES = {
7
- "Boundary Bug": {
8
  "domain_hint": "dsa",
9
- "context_window": "Analytics helper that groups sorted events into session windows.",
10
- "traceback_text": "AssertionError: expected [(1, 3), (8, 8)] but got [(1, 8)] on the boundary case.",
11
- "code": """def collapse_sessions(events, idle_timeout_minutes):\n if not events:\n return []\n\n sessions = []\n current_start = events[0]['minute']\n current_end = current_start\n\n for event in events[1:]:\n minute = event['minute']\n if minute - current_end > idle_timeout_minutes:\n sessions.append((current_start, current_end))\n current_start = minute\n current_end = minute\n\n return sessions\n""",
12
  },
13
- "Performance Hotspot": {
14
- "domain_hint": "dsa",
15
- "context_window": "Nightly export job running on a small CPU box with rising traffic volume.",
16
- "traceback_text": "BenchmarkWarning: function exceeded latency budget due to repeated full-list scans.",
17
- "code": """def rank_active_users(events):\n users = []\n for event in events:\n if event['status'] == 'active':\n found = False\n for existing in users:\n if existing == event['user_id']:\n found = True\n if not found:\n users.append(event['user_id'])\n\n totals = []\n for user in users:\n count = 0\n for event in events:\n if event['status'] == 'active' and event['user_id'] == user:\n count += 1\n totals.append((user, count))\n\n totals.sort(key=lambda item: (-item[1], item[0]))\n return totals\n""",
18
  },
19
- "ML Inference": {
20
  "domain_hint": "ml_dl",
21
- "context_window": "Batch inference helper for a PyTorch image classifier.",
22
  "traceback_text": "",
23
  "code": """import torch\n\nclass Predictor:\n def __init__(self, model):\n self.model = model\n\n def predict(self, batch):\n outputs = self.model(batch)\n return outputs.argmax(dim=1)\n""",
24
  },
25
- "FastAPI Endpoint": {
26
  "domain_hint": "web",
27
  "context_window": "Backend endpoint for creating review tasks from user-submitted payloads.",
28
  "traceback_text": "",
 
1
+ """Example snippets for each supported analysis domain."""
2
 
3
  from __future__ import annotations
4
 
5
 
6
  EXAMPLES = {
7
+ "DSA": {
8
  "domain_hint": "dsa",
9
+ "context_window": "Competitive-programming helper for pair lookup on large arrays.",
10
+ "traceback_text": "",
11
+ "code": """def two_sum(nums, target):\n for i in range(len(nums)):\n for j in range(i + 1, len(nums)):\n if nums[i] + nums[j] == target:\n return [i, j]\n return []\n""",
12
  },
13
+ "Data Science": {
14
+ "domain_hint": "data_science",
15
+ "context_window": "Feature engineering step in a churn-prediction notebook.",
16
+ "traceback_text": "",
17
+ "code": """import pandas as pd\n\ndef encode_features(df):\n values = []\n for _, row in df.iterrows():\n values.append(row['age'] * row['sessions'])\n df['score'] = values\n return df\n""",
18
  },
19
+ "ML / DL": {
20
  "domain_hint": "ml_dl",
21
+ "context_window": "Inference utility for a PyTorch classifier used in a batch review job.",
22
  "traceback_text": "",
23
  "code": """import torch\n\nclass Predictor:\n def __init__(self, model):\n self.model = model\n\n def predict(self, batch):\n outputs = self.model(batch)\n return outputs.argmax(dim=1)\n""",
24
  },
25
+ "Web / FastAPI": {
26
  "domain_hint": "web",
27
  "context_window": "Backend endpoint for creating review tasks from user-submitted payloads.",
28
  "traceback_text": "",
app/models/__init__.py CHANGED
@@ -1,5 +1,5 @@
1
- """Runtime models used by the inference runner."""
2
-
3
- from .inference import AgentDecision, InferenceConfig
4
-
5
- __all__ = ["AgentDecision", "InferenceConfig"]
 
1
+ """Runtime models used by the inference runner."""
2
+
3
+ from .inference import AgentDecision, InferenceConfig
4
+
5
+ __all__ = ["AgentDecision", "InferenceConfig"]
app/models/inference.py CHANGED
@@ -1,57 +1,57 @@
1
- """Dataclasses shared by the inference runtime."""
2
-
3
- from __future__ import annotations
4
-
5
- import os
6
- from dataclasses import dataclass
7
-
8
-
9
- DEFAULT_API_BASE_URL = "https://router.huggingface.co/v1"
10
- DEFAULT_MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
11
- DEFAULT_BENCHMARK_NAME = "python_code_review_env"
12
-
13
-
14
- def _resolve_api_key(api_base_url: str) -> str:
15
- """Choose the correct provider token for the configured endpoint."""
16
-
17
- normalized = api_base_url.strip().lower()
18
- hf_token = str(os.getenv("HF_TOKEN") or "").strip()
19
- openai_api_key = str(os.getenv("OPENAI_API_KEY") or "").strip()
20
-
21
- if "api.openai.com" in normalized:
22
- return openai_api_key or hf_token
23
- return hf_token or openai_api_key
24
-
25
-
26
- @dataclass(slots=True)
27
- class InferenceConfig:
28
- """Runtime configuration loaded from environment variables."""
29
-
30
- api_base_url: str
31
- model_name: str
32
- api_key: str
33
  benchmark_name: str = DEFAULT_BENCHMARK_NAME
34
  request_timeout_s: float = 12.0
35
  max_retries: int = 2
36
  max_episode_steps: int = 12
37
- success_threshold: float = 0.88
38
-
39
- @classmethod
40
- def from_env(cls) -> "InferenceConfig":
41
- api_base_url = str(os.getenv("API_BASE_URL") or DEFAULT_API_BASE_URL)
42
- return cls(
43
- api_base_url=api_base_url,
44
- model_name=str(os.getenv("MODEL_NAME") or DEFAULT_MODEL_NAME),
45
- api_key=_resolve_api_key(api_base_url),
46
- benchmark_name=str(os.getenv("OPENENV_BENCHMARK") or DEFAULT_BENCHMARK_NAME),
47
- )
48
-
49
-
50
- @dataclass(slots=True)
51
- class AgentDecision:
52
- """Validated action chosen for the next environment step."""
53
-
54
- action_type: str
55
- code: str | None = None
56
- source: str = "deterministic"
57
- error: str | None = None
 
1
+ """Dataclasses shared by the inference runtime."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import os
6
+ from dataclasses import dataclass
7
+
8
+
9
+ DEFAULT_API_BASE_URL = "https://router.huggingface.co/v1"
10
+ DEFAULT_MODEL_NAME = "Qwen/Qwen2.5-3B-Instruct"
11
+ DEFAULT_BENCHMARK_NAME = "python_code_review_env"
12
+
13
+
14
+ def _resolve_api_key(api_base_url: str) -> str:
15
+ """Choose the correct provider token for the configured endpoint."""
16
+
17
+ normalized = api_base_url.strip().lower()
18
+ hf_token = str(os.getenv("HF_TOKEN") or "").strip()
19
+ openai_api_key = str(os.getenv("OPENAI_API_KEY") or "").strip()
20
+
21
+ if "api.openai.com" in normalized:
22
+ return openai_api_key or hf_token
23
+ return hf_token or openai_api_key
24
+
25
+
26
+ @dataclass(slots=True)
27
+ class InferenceConfig:
28
+ """Runtime configuration loaded from environment variables."""
29
+
30
+ api_base_url: str
31
+ model_name: str
32
+ api_key: str
33
  benchmark_name: str = DEFAULT_BENCHMARK_NAME
34
  request_timeout_s: float = 12.0
35
  max_retries: int = 2
36
  max_episode_steps: int = 12
37
+ success_threshold: float = 0.94
38
+
39
+ @classmethod
40
+ def from_env(cls) -> "InferenceConfig":
41
+ api_base_url = str(os.getenv("API_BASE_URL") or DEFAULT_API_BASE_URL)
42
+ return cls(
43
+ api_base_url=api_base_url,
44
+ model_name=str(os.getenv("MODEL_NAME") or DEFAULT_MODEL_NAME),
45
+ api_key=_resolve_api_key(api_base_url),
46
+ benchmark_name=str(os.getenv("OPENENV_BENCHMARK") or DEFAULT_BENCHMARK_NAME),
47
+ )
48
+
49
+
50
+ @dataclass(slots=True)
51
+ class AgentDecision:
52
+ """Validated action chosen for the next environment step."""
53
+
54
+ action_type: str
55
+ code: str | None = None
56
+ source: str = "deterministic"
57
+ error: str | None = None
app/services/__init__.py CHANGED
@@ -1,5 +1,5 @@
1
- """LLM service wrappers for inference-time action planning."""
2
-
3
- from .openai_service import OpenAIActionPlanner
4
-
5
- __all__ = ["OpenAIActionPlanner"]
 
1
+ """LLM service wrappers for inference-time action planning."""
2
+
3
+ from .openai_service import OpenAIActionPlanner
4
+
5
+ __all__ = ["OpenAIActionPlanner"]
app/services/openai_service.py CHANGED
@@ -1,88 +1,88 @@
1
- """OpenAI-compatible action planner backed by the Hugging Face router."""
2
-
3
- from __future__ import annotations
4
-
5
- import json
6
- import time
7
- from typing import Any
8
-
9
- from openai import OpenAI
10
-
11
- from app.models.inference import AgentDecision, InferenceConfig
12
- from app.utils.runtime import compact_text, observation_attr, suppress_output
13
-
14
-
15
- ALLOWED_ACTIONS = {"analyze_code", "edit_code", "run_tests", "submit_solution"}
16
-
17
-
18
- class OpenAIActionPlanner:
19
- """Ask an OpenAI-compatible model for the next safe environment action."""
20
-
21
- def __init__(self, config: InferenceConfig) -> None:
22
- self.config = config
23
- self.client = (
24
- OpenAI(base_url=config.api_base_url, api_key=config.api_key, timeout=config.request_timeout_s)
25
- if config.api_key
26
- else None
27
- )
28
-
29
- def propose_action(self, observation: Any) -> AgentDecision:
30
- if self.client is None:
31
- return AgentDecision(action_type="run_tests", source="fallback", error="API key missing")
32
-
33
- prompt = self._build_prompt(observation)
34
- for attempt in range(self.config.max_retries + 1):
35
- try:
36
- with suppress_output():
37
- response = self.client.chat.completions.create(
38
- model=self.config.model_name,
39
- temperature=0,
40
- max_tokens=120,
41
- messages=[
42
- {
43
- "role": "system",
44
- "content": (
45
- "You are a deterministic OpenEnv controller. "
46
- "Return exactly one compact JSON object with keys action_type and rationale. "
47
- "Allowed action_type values: analyze_code, run_tests, submit_solution. "
48
- "Never emit markdown."
49
- ),
50
- },
51
- {"role": "user", "content": prompt},
52
- ],
53
- response_format={"type": "json_object"},
54
- )
55
- message = response.choices[0].message.content or ""
56
- return self._parse_action(message)
57
- except Exception as exc:
58
- if attempt >= self.config.max_retries:
59
- return AgentDecision(
60
- action_type="run_tests",
61
- source="fallback",
62
- error=compact_text(f"{type(exc).__name__}: {exc}", default="LLM failure"),
63
- )
64
- time.sleep(0.2 * (attempt + 1))
65
-
66
- return AgentDecision(action_type="run_tests", source="fallback", error="LLM retries exhausted")
67
-
68
- def _build_prompt(self, observation: Any) -> str:
69
- return (
70
- f"Task ID: {compact_text(observation_attr(observation, 'task_id', ''), default='unknown')}\n"
71
- f"Description: {compact_text(observation_attr(observation, 'task_description', ''), default='none', limit=400)}\n"
72
- f"Current score: {float(observation_attr(observation, 'score', 0.01) or 0.01):.4f}\n"
73
- f"Errors: {compact_text(observation_attr(observation, 'errors', ''), default='none', limit=300)}\n"
74
- f"Test feedback: {compact_text(observation_attr(observation, 'test_results', ''), default='none', limit=300)}\n"
75
- f"Attempts remaining: {int(observation_attr(observation, 'attempts_remaining', 0) or 0)}\n"
76
- "Choose the single best next control action before a deterministic repair policy handles code updates."
77
- )
78
-
79
- def _parse_action(self, content: str) -> AgentDecision:
80
- try:
81
- payload = json.loads(content)
82
- except Exception:
83
- return AgentDecision(action_type="run_tests", source="fallback", error="invalid LLM payload")
84
-
85
- action_type = compact_text(payload.get("action_type"), default="run_tests")
86
- if action_type not in ALLOWED_ACTIONS or action_type == "edit_code":
87
- action_type = "run_tests"
88
- return AgentDecision(action_type=action_type, source="llm")
 
1
+ """OpenAI-compatible action planner backed by the Hugging Face router."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import json
6
+ import time
7
+ from typing import Any
8
+
9
+ from openai import OpenAI
10
+
11
+ from app.models.inference import AgentDecision, InferenceConfig
12
+ from app.utils.runtime import compact_text, observation_attr, suppress_output
13
+
14
+
15
+ ALLOWED_ACTIONS = {"analyze_code", "edit_code", "run_tests", "submit_solution"}
16
+
17
+
18
+ class OpenAIActionPlanner:
19
+ """Ask an OpenAI-compatible model for the next safe environment action."""
20
+
21
+ def __init__(self, config: InferenceConfig) -> None:
22
+ self.config = config
23
+ self.client = (
24
+ OpenAI(base_url=config.api_base_url, api_key=config.api_key, timeout=config.request_timeout_s)
25
+ if config.api_key
26
+ else None
27
+ )
28
+
29
+ def propose_action(self, observation: Any) -> AgentDecision:
30
+ if self.client is None:
31
+ return AgentDecision(action_type="run_tests", source="fallback", error="API key missing")
32
+
33
+ prompt = self._build_prompt(observation)
34
+ for attempt in range(self.config.max_retries + 1):
35
+ try:
36
+ with suppress_output():
37
+ response = self.client.chat.completions.create(
38
+ model=self.config.model_name,
39
+ temperature=0,
40
+ max_tokens=120,
41
+ messages=[
42
+ {
43
+ "role": "system",
44
+ "content": (
45
+ "You are a deterministic OpenEnv controller. "
46
+ "Return exactly one compact JSON object with keys action_type and rationale. "
47
+ "Allowed action_type values: analyze_code, run_tests, submit_solution. "
48
+ "Never emit markdown."
49
+ ),
50
+ },
51
+ {"role": "user", "content": prompt},
52
+ ],
53
+ response_format={"type": "json_object"},
54
+ )
55
+ message = response.choices[0].message.content or ""
56
+ return self._parse_action(message)
57
+ except Exception as exc:
58
+ if attempt >= self.config.max_retries:
59
+ return AgentDecision(
60
+ action_type="run_tests",
61
+ source="fallback",
62
+ error=compact_text(f"{type(exc).__name__}: {exc}", default="LLM failure"),
63
+ )
64
+ time.sleep(0.2 * (attempt + 1))
65
+
66
+ return AgentDecision(action_type="run_tests", source="fallback", error="LLM retries exhausted")
67
+
68
+ def _build_prompt(self, observation: Any) -> str:
69
+ return (
70
+ f"Task ID: {compact_text(observation_attr(observation, 'task_id', ''), default='unknown')}\n"
71
+ f"Description: {compact_text(observation_attr(observation, 'task_description', ''), default='none', limit=400)}\n"
72
+ f"Current score: {float(observation_attr(observation, 'score', 0.01) or 0.01):.4f}\n"
73
+ f"Errors: {compact_text(observation_attr(observation, 'errors', ''), default='none', limit=300)}\n"
74
+ f"Test feedback: {compact_text(observation_attr(observation, 'test_results', ''), default='none', limit=300)}\n"
75
+ f"Attempts remaining: {int(observation_attr(observation, 'attempts_remaining', 0) or 0)}\n"
76
+ "Choose the single best next control action before a deterministic repair policy handles code updates."
77
+ )
78
+
79
+ def _parse_action(self, content: str) -> AgentDecision:
80
+ try:
81
+ payload = json.loads(content)
82
+ except Exception:
83
+ return AgentDecision(action_type="run_tests", source="fallback", error="invalid LLM payload")
84
+
85
+ action_type = compact_text(payload.get("action_type"), default="run_tests")
86
+ if action_type not in ALLOWED_ACTIONS or action_type == "edit_code":
87
+ action_type = "run_tests"
88
+ return AgentDecision(action_type=action_type, source="llm")
app/streamlit_app.py CHANGED
@@ -1,83 +1,52 @@
1
- """Streamlit frontend for the AI-powered Python code review platform."""
2
 
3
  from __future__ import annotations
4
 
5
  import streamlit as st
6
-
7
- from app.examples import EXAMPLES
8
- from schemas.request import AnalyzeCodeRequest
9
- from services.analysis_service import AnalysisService
10
-
11
-
12
  analysis_service = AnalysisService()
13
-
14
-
15
  def _analyze(code: str, context_window: str, traceback_text: str, domain_hint: str):
16
  """Run the analysis service with validated request payloads."""
17
-
18
- request = AnalyzeCodeRequest(
19
- code=code,
20
- context_window=context_window,
21
- traceback_text=traceback_text,
22
- domain_hint=domain_hint, # type: ignore[arg-type]
23
  )
24
  return analysis_service.analyze(request)
25
 
26
 
27
- def _score_chart_data(result) -> dict[str, float]:
28
- """Prepare the most useful score signals for visual display."""
29
-
30
- return {
31
- "reward": result.score_breakdown.reward,
32
- "ml_quality": result.score_breakdown.ml_score,
33
- "lint": result.score_breakdown.lint_score,
34
- "maintainability": result.score_breakdown.maintainability_score,
35
- "readability": result.score_breakdown.readability_score,
36
- "security": result.score_breakdown.security_score,
37
- }
38
-
39
-
40
  def main() -> None:
41
  """Render the Streamlit UI."""
42
 
43
- st.set_page_config(page_title="TorchReview Copilot", layout="wide")
44
- st.title("TorchReview Copilot")
45
- st.caption(
46
- "AI-powered Python code review with static analysis, PyTorch scoring, "
47
- "RL-ready rewards, and actionable code-improvement guidance."
48
- )
49
-
50
- with st.sidebar:
51
- st.subheader("Review Pipeline")
52
- st.markdown(
53
- "\n".join(
54
- [
55
- "1. Input Python code",
56
- "2. Parse AST + estimate complexity",
57
- "3. Score with a PyTorch encoder",
58
- "4. Generate suggestions and auto-fix hints",
59
- "5. Compute an RL-ready reward",
60
- ]
61
- )
62
- )
63
- example_name = st.selectbox("Example input", list(EXAMPLES.keys()))
64
- auto_analyze = st.toggle("Real-time scoring", value=True)
65
- st.info("The PyTorch layer uses CodeBERTa embeddings when weights are available, with a torch-native fallback for offline demos.")
66
 
 
67
  example = EXAMPLES[example_name]
 
68
 
69
  left, right = st.columns([1.2, 1.0])
70
  with left:
71
  code = st.text_area("Code input", value=example["code"], height=420)
72
  context_window = st.text_area("Context window", value=example["context_window"], height=100)
73
- traceback_text = st.text_area("Optional traceback / runtime hint", value=example["traceback_text"], height=100)
74
- domain_hint = st.selectbox("Domain hint", ["auto", "dsa", "data_science", "ml_dl", "web"], index=["auto", "dsa", "data_science", "ml_dl", "web"].index(example["domain_hint"]))
75
- analyze_clicked = st.button("Analyze Code", type="primary")
76
-
77
- result = None
78
- if code and (analyze_clicked or auto_analyze):
79
- result = _analyze(code, context_window, traceback_text, domain_hint)
80
-
81
  with right:
82
  if result is None:
83
  st.info("Paste code or load an example to start analysis.")
@@ -85,17 +54,9 @@ def main() -> None:
85
  metric_cols = st.columns(4)
86
  metric_cols[0].metric("Detected domain", result.detected_domain)
87
  metric_cols[1].metric("ML score", f"{result.score_breakdown.ml_score:.0%}")
88
- metric_cols[2].metric("Lint score", f"{result.score_breakdown.lint_score:.0%}")
89
  metric_cols[3].metric("Reward", f"{result.score_breakdown.reward:.0%}")
90
- st.subheader("Domain Confidence")
91
  st.bar_chart(result.domain_confidences)
92
- st.subheader("Review Signal Radar")
93
- st.bar_chart(_score_chart_data(result))
94
- st.code(
95
- "reward = 0.50*ml_score + 0.18*lint + 0.12*maintainability "
96
- "+ 0.10*domain + 0.10*security - 0.20*complexity",
97
- language="text",
98
- )
99
  st.caption(result.summary)
100
 
101
  if result is not None:
@@ -104,58 +65,36 @@ def main() -> None:
104
  )
105
 
106
  with overview_tab:
107
- st.subheader("Reward Breakdown")
108
- st.json(result.score_visualization)
109
- st.subheader("Top Signals")
110
- signal_cols = st.columns(3)
111
- signal_cols[0].progress(result.score_breakdown.quality_signal, text="Quality signal")
112
- signal_cols[1].progress(result.score_breakdown.error_reduction_signal, text="Error reduction")
113
- signal_cols[2].progress(result.score_breakdown.completion_signal, text="Completion")
114
  st.subheader("Improvement Plan")
115
  for step in result.improvement_plan:
116
  st.write(f"- {step}")
117
- if result.auto_fix_preview:
118
- st.subheader("Auto-Fix Preview")
119
- for hint in result.auto_fix_preview:
120
- st.write(f"- {hint}")
121
  st.subheader("Complexity")
122
  st.write(
123
  {
124
  "time_complexity": result.static_analysis.time_complexity,
125
  "space_complexity": result.static_analysis.space_complexity,
126
  "cyclomatic_complexity": result.static_analysis.cyclomatic_complexity,
127
- "max_nesting_depth": result.static_analysis.max_nesting_depth,
128
  }
129
  )
130
 
131
  with suggestions_tab:
132
  st.subheader("Suggestions")
133
- for suggestion in result.suggestions:
134
- st.write(f"- [{suggestion.priority}] {suggestion.title}: {suggestion.action}")
135
- if result.domain_analysis.suggestions:
136
- st.subheader("Domain Hints")
137
- for item in result.domain_analysis.suggestions:
138
- st.write(f"- {item}")
139
- if result.domain_analysis.issues or result.static_analysis.issues:
140
  st.subheader("Issues")
141
- for issue in result.domain_analysis.issues + result.static_analysis.issues:
142
  st.write(f"- [{issue.severity}] {issue.title}: {issue.description}")
143
 
144
  with domain_tab:
145
  st.subheader("Domain Highlights")
146
  st.json(result.domain_analysis.highlights)
147
  st.write(f"Domain score: {result.domain_analysis.domain_score:.0%}")
148
- st.write(f"Model label: {result.model_prediction.quality_label}")
149
- st.write(f"Model backend: `{result.model_backend}`")
150
- if result.model_prediction.notes:
151
- st.subheader("Model Notes")
152
- for note in result.model_prediction.notes:
153
- st.write(f"- {note}")
154
 
155
  with static_tab:
156
  st.subheader("Static Analysis")
157
  st.json(result.static_analysis.model_dump())
158
-
159
-
160
- if __name__ == "__main__":
161
- main()
 
1
+ """Streamlit frontend for the multi-domain analyzer platform."""
2
 
3
  from __future__ import annotations
4
 
5
  import streamlit as st
6
+
7
+ from app.examples import EXAMPLES
8
+ from schemas.request import AnalyzeCodeRequest
9
+ from services.analysis_service import AnalysisService
10
+
11
+
12
  analysis_service = AnalysisService()
13
+
14
+
15
  def _analyze(code: str, context_window: str, traceback_text: str, domain_hint: str):
16
  """Run the analysis service with validated request payloads."""
17
+
18
+ request = AnalyzeCodeRequest(
19
+ code=code,
20
+ context_window=context_window,
21
+ traceback_text=traceback_text,
22
+ domain_hint=domain_hint, # type: ignore[arg-type]
23
  )
24
  return analysis_service.analyze(request)
25
 
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  def main() -> None:
28
  """Render the Streamlit UI."""
29
 
30
+ st.set_page_config(page_title="Multi-Domain AI Code Analyzer", layout="wide")
31
+ st.title("Multi-Domain AI Code Analyzer & Improvement System")
32
+ st.caption("PyTorch-powered code review across DSA, Data Science, ML/DL, and Web backend code.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
 
34
+ example_name = st.selectbox("Example input", list(EXAMPLES.keys()))
35
  example = EXAMPLES[example_name]
36
+ auto_analyze = st.toggle("Real-time scoring", value=True)
37
 
38
  left, right = st.columns([1.2, 1.0])
39
  with left:
40
  code = st.text_area("Code input", value=example["code"], height=420)
41
  context_window = st.text_area("Context window", value=example["context_window"], height=100)
42
+ traceback_text = st.text_area("Optional traceback / runtime hint", value=example["traceback_text"], height=100)
43
+ domain_hint = st.selectbox("Domain hint", ["auto", "dsa", "data_science", "ml_dl", "web"], index=["auto", "dsa", "data_science", "ml_dl", "web"].index(example["domain_hint"]))
44
+ analyze_clicked = st.button("Analyze Code", type="primary")
45
+
46
+ result = None
47
+ if code and (analyze_clicked or auto_analyze):
48
+ result = _analyze(code, context_window, traceback_text, domain_hint)
49
+
50
  with right:
51
  if result is None:
52
  st.info("Paste code or load an example to start analysis.")
 
54
  metric_cols = st.columns(4)
55
  metric_cols[0].metric("Detected domain", result.detected_domain)
56
  metric_cols[1].metric("ML score", f"{result.score_breakdown.ml_score:.0%}")
57
+ metric_cols[2].metric("Domain score", f"{result.score_breakdown.domain_score:.0%}")
58
  metric_cols[3].metric("Reward", f"{result.score_breakdown.reward:.0%}")
 
59
  st.bar_chart(result.domain_confidences)
 
 
 
 
 
 
 
60
  st.caption(result.summary)
61
 
62
  if result is not None:
 
65
  )
66
 
67
  with overview_tab:
 
 
 
 
 
 
 
68
  st.subheader("Improvement Plan")
69
  for step in result.improvement_plan:
70
  st.write(f"- {step}")
 
 
 
 
71
  st.subheader("Complexity")
72
  st.write(
73
  {
74
  "time_complexity": result.static_analysis.time_complexity,
75
  "space_complexity": result.static_analysis.space_complexity,
76
  "cyclomatic_complexity": result.static_analysis.cyclomatic_complexity,
 
77
  }
78
  )
79
 
80
  with suggestions_tab:
81
  st.subheader("Suggestions")
82
+ for suggestion in result.domain_analysis.suggestions:
83
+ st.write(f"- {suggestion}")
84
+ if result.domain_analysis.issues:
 
 
 
 
85
  st.subheader("Issues")
86
+ for issue in result.domain_analysis.issues:
87
  st.write(f"- [{issue.severity}] {issue.title}: {issue.description}")
88
 
89
  with domain_tab:
90
  st.subheader("Domain Highlights")
91
  st.json(result.domain_analysis.highlights)
92
  st.write(f"Domain score: {result.domain_analysis.domain_score:.0%}")
 
 
 
 
 
 
93
 
94
  with static_tab:
95
  st.subheader("Static Analysis")
96
  st.json(result.static_analysis.model_dump())
97
+
98
+
99
+ if __name__ == "__main__":
100
+ main()
app/utils/__init__.py CHANGED
@@ -1,21 +1,21 @@
1
- """Utility helpers shared by the inference runtime."""
2
-
3
- from .runtime import (
4
- compact_text,
5
- format_bool,
6
- format_error,
7
- format_reward,
8
- observation_attr,
9
- parse_task_ids,
10
- suppress_output,
11
- )
12
-
13
- __all__ = [
14
- "compact_text",
15
- "format_bool",
16
- "format_error",
17
- "format_reward",
18
- "observation_attr",
19
- "parse_task_ids",
20
- "suppress_output",
21
- ]
 
1
+ """Utility helpers shared by the inference runtime."""
2
+
3
+ from .runtime import (
4
+ compact_text,
5
+ format_bool,
6
+ format_error,
7
+ format_reward,
8
+ observation_attr,
9
+ parse_task_ids,
10
+ suppress_output,
11
+ )
12
+
13
+ __all__ = [
14
+ "compact_text",
15
+ "format_bool",
16
+ "format_error",
17
+ "format_reward",
18
+ "observation_attr",
19
+ "parse_task_ids",
20
+ "suppress_output",
21
+ ]
app/utils/runtime.py CHANGED
@@ -1,106 +1,95 @@
1
  """Formatting, parsing, and IO-suppression helpers for inference."""
2
-
3
- from __future__ import annotations
4
-
5
- import io
6
- from collections.abc import Iterable
7
- from contextlib import contextmanager, redirect_stderr, redirect_stdout
8
- from typing import Any, Iterator
9
-
10
  try:
11
  from tasks import task_ids
12
  except ImportError: # pragma: no cover
13
  from python_env.tasks import task_ids # type: ignore[no-redef]
14
 
15
 
16
- MIN_DISPLAY_REWARD = 0.01
17
- MAX_DISPLAY_REWARD = 0.99
18
-
19
-
20
- def compact_text(
21
- value: Any,
22
- *,
23
- default: str = "",
24
- limit: int = 240,
25
- preserve_newlines: bool = False,
26
- ) -> str:
27
- """Convert values into validator-safe text."""
28
-
29
- if value is None:
30
- return default
31
- try:
32
- text = str(value)
33
- except Exception:
34
- return default
35
- if preserve_newlines:
36
- text = text.strip()
37
- else:
38
- text = " ".join(text.split())
39
- return text[:limit] if text else default
40
-
41
-
42
- def observation_attr(observation: Any, name: str, default: Any = None, *, preserve_newlines: bool = False) -> Any:
43
- """Read an observation attribute without trusting the payload shape."""
44
-
45
- if isinstance(observation, dict):
46
- value = observation.get(name, default)
47
- else:
48
- value = getattr(observation, name, default)
49
- if isinstance(value, str):
50
- return compact_text(
51
- value,
52
- default=default if isinstance(default, str) else "",
53
- preserve_newlines=preserve_newlines,
54
- )
55
- return value
56
-
57
-
58
- def format_bool(value: Any) -> str:
59
- """Render booleans in the lowercase form required by OpenEnv."""
60
 
 
61
  return "true" if bool(value) else "false"
62
 
63
 
64
  def format_reward(value: Any) -> str:
65
- """Render rewards in a validator-safe two-decimal open interval."""
66
-
67
  try:
68
  reward = float(value)
69
  except Exception:
70
- reward = MIN_DISPLAY_REWARD
71
- reward = max(MIN_DISPLAY_REWARD, min(MAX_DISPLAY_REWARD, reward))
72
  return f"{reward:.2f}"
73
 
74
 
75
  def format_error(value: Any) -> str:
76
- """Render nullable error strings in the stdout contract format."""
77
-
78
  text = compact_text(value, default="")
79
  return text if text else "null"
80
-
81
-
82
- def parse_task_ids() -> list[str]:
83
- """Load stable task names with a deterministic fallback."""
84
-
85
- try:
86
- values = task_ids()
87
- if isinstance(values, Iterable):
88
- loaded = [compact_text(item, default="") for item in values]
89
- loaded = [item for item in loaded if item]
90
- if loaded:
91
- return loaded
92
- except Exception:
93
- pass
94
- return [
95
- "syntax_fix_invoice_totals",
96
- "bug_fix_session_windows",
97
- "optimization_rank_active_users",
98
- ]
99
-
100
-
101
- @contextmanager
102
- def suppress_output() -> Iterator[None]:
103
- """Silence libraries that write noisy logs to stdout or stderr."""
104
-
105
- with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
106
- yield
 
1
  """Formatting, parsing, and IO-suppression helpers for inference."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import io
6
+ from collections.abc import Iterable
7
+ from contextlib import contextmanager, redirect_stderr, redirect_stdout
8
+ from typing import Any, Iterator
9
+
10
  try:
11
  from tasks import task_ids
12
  except ImportError: # pragma: no cover
13
  from python_env.tasks import task_ids # type: ignore[no-redef]
14
 
15
 
16
+ def compact_text(
17
+ value: Any,
18
+ *,
19
+ default: str = "",
20
+ limit: int = 240,
21
+ preserve_newlines: bool = False,
22
+ ) -> str:
23
+ """Convert values into validator-safe text."""
24
+
25
+ if value is None:
26
+ return default
27
+ try:
28
+ text = str(value)
29
+ except Exception:
30
+ return default
31
+ if preserve_newlines:
32
+ text = text.strip()
33
+ else:
34
+ text = " ".join(text.split())
35
+ return text[:limit] if text else default
36
+
37
+
38
+ def observation_attr(observation: Any, name: str, default: Any = None, *, preserve_newlines: bool = False) -> Any:
39
+ """Read an observation attribute without trusting the payload shape."""
40
+
41
+ if isinstance(observation, dict):
42
+ value = observation.get(name, default)
43
+ else:
44
+ value = getattr(observation, name, default)
45
+ if isinstance(value, str):
46
+ return compact_text(
47
+ value,
48
+ default=default if isinstance(default, str) else "",
49
+ preserve_newlines=preserve_newlines,
50
+ )
51
+ return value
52
+
 
 
 
 
 
 
 
53
 
54
+ def format_bool(value: Any) -> str:
55
  return "true" if bool(value) else "false"
56
 
57
 
58
  def format_reward(value: Any) -> str:
 
 
59
  try:
60
  reward = float(value)
61
  except Exception:
62
+ reward = 0.0
 
63
  return f"{reward:.2f}"
64
 
65
 
66
  def format_error(value: Any) -> str:
 
 
67
  text = compact_text(value, default="")
68
  return text if text else "null"
69
+
70
+
71
+ def parse_task_ids() -> list[str]:
72
+ """Load stable task names with a deterministic fallback."""
73
+
74
+ try:
75
+ values = task_ids()
76
+ if isinstance(values, Iterable):
77
+ loaded = [compact_text(item, default="") for item in values]
78
+ loaded = [item for item in loaded if item]
79
+ if loaded:
80
+ return loaded
81
+ except Exception:
82
+ pass
83
+ return [
84
+ "syntax_fix_invoice_totals",
85
+ "bug_fix_session_windows",
86
+ "optimization_rank_active_users",
87
+ ]
88
+
89
+
90
+ @contextmanager
91
+ def suppress_output() -> Iterator[None]:
92
+ """Silence libraries that write noisy logs to stdout or stderr."""
93
+
94
+ with redirect_stdout(io.StringIO()), redirect_stderr(io.StringIO()):
95
+ yield
client.py CHANGED
@@ -2,23 +2,16 @@
2
 
3
  from __future__ import annotations
4
 
5
- from typing import Dict
6
-
7
- from openenv.core import EnvClient
8
- from openenv.core.client_types import StepResult
9
-
10
- try:
11
- from .models import (
12
- PythonCodeReviewAction,
13
- PythonCodeReviewObservation,
14
- PythonCodeReviewState,
15
- )
16
- except ImportError: # pragma: no cover
17
- from models import (
18
- PythonCodeReviewAction,
19
- PythonCodeReviewObservation,
20
- PythonCodeReviewState,
21
- )
22
 
23
 
24
  class PythonCodeReviewEnv(
 
2
 
3
  from __future__ import annotations
4
 
5
+ from typing import Dict
6
+
7
+ from openenv.core import EnvClient
8
+ from openenv.core.client_types import StepResult
9
+
10
+ from .models import (
11
+ PythonCodeReviewAction,
12
+ PythonCodeReviewObservation,
13
+ PythonCodeReviewState,
14
+ )
 
 
 
 
 
 
 
15
 
16
 
17
  class PythonCodeReviewEnv(
graders/bug_fix.py CHANGED
@@ -3,127 +3,127 @@
3
  from __future__ import annotations
4
 
5
  try:
6
- from ..models import TaskGrade
7
  from ..tasks.catalog import ReviewTask
8
  except ImportError:
9
- from models import TaskGrade
10
  from tasks.catalog import ReviewTask
11
 
12
- from .shared import (
13
- base_grade,
14
- compile_code,
15
- composite_grade_score,
16
- component_score,
17
- execute_cases,
18
- quality_metrics,
19
- similarity_score,
20
- summarize_results,
21
- )
22
 
23
 
24
- def grade_bug_fix_task(
25
  task: ReviewTask,
26
  code: str,
27
  *,
28
  include_hidden: bool,
29
  timeout_s: float = 2.0,
30
  ) -> TaskGrade:
31
- """Grade a bug-fix task against public or full test suites."""
32
-
33
- compiled, compile_error = compile_code(code)
34
- quality = quality_metrics(code, task.function_name)
35
- similarity = similarity_score(code, task.reference_code)
36
- details = {
37
- "compile_error": compile_error,
38
- "quality_notes": quality["quality_notes"],
39
- "style_score": quality["style_score"],
40
- "visibility": "full" if include_hidden else "public",
41
  }
42
 
43
- if not compiled:
44
- details["test_results"] = []
45
- details["test_summary"] = "Code does not compile."
46
- return base_grade(
47
- score=composite_grade_score(
48
- correctness=0.0,
49
- quality=0.05,
50
- runtime=0.05,
51
- syntax=0.0,
52
- similarity=similarity,
53
- baseline=0.04,
54
- penalty=0.05,
55
- ),
56
- syntax_score=component_score(0.01),
57
- tests_passed=0,
58
- tests_total=len(task.public_cases) + (len(task.hidden_cases) if include_hidden else 0),
59
- quality_score=component_score(0.01),
60
- runtime_score=component_score(0.01),
61
  timed_out=False,
62
  details=details,
63
  )
64
 
65
  cases = task.public_cases + (task.hidden_cases if include_hidden else [])
66
- result = execute_cases(code, task.function_name, cases, timeout_s=timeout_s)
67
- if result.get("timed_out"):
68
- details["test_results"] = []
69
- details["test_summary"] = result["error"]
70
- return base_grade(
71
- score=composite_grade_score(
72
- correctness=0.10,
73
- quality=quality["score"],
74
- runtime=0.0,
75
- syntax=0.95,
76
- similarity=similarity,
77
- baseline=0.06,
78
- penalty=0.12,
79
- ),
80
- syntax_score=component_score(0.95),
81
- tests_passed=0,
82
- tests_total=len(cases),
83
- quality_score=quality["score"],
84
- runtime_score=component_score(0.01),
85
  timed_out=True,
86
  details=details,
87
  )
88
- if "error" in result:
89
- details["test_results"] = []
90
- details["test_summary"] = result["error"]
91
- return base_grade(
92
- score=composite_grade_score(
93
- correctness=0.12,
94
- quality=quality["score"],
95
- runtime=0.0,
96
- syntax=0.95,
97
- similarity=similarity,
98
- baseline=0.06,
99
- penalty=0.08,
100
- ),
101
- syntax_score=component_score(0.95),
102
- tests_passed=0,
103
- tests_total=len(cases),
104
- quality_score=quality["score"],
105
- runtime_score=component_score(0.01),
106
  timed_out=False,
107
  details=details,
108
  )
109
 
110
- data = result["data"]
111
- pass_rate = data["passed"] / max(data["total"], 1)
112
- details["test_results"] = data["results"]
113
- details["test_summary"] = summarize_results("Test results", data["results"])
114
- return base_grade(
115
- score=composite_grade_score(
116
- correctness=pass_rate,
117
- quality=quality["score"],
118
- runtime=0.05,
119
- syntax=0.95,
120
- similarity=similarity,
121
- baseline=0.08,
122
- ),
123
- syntax_score=component_score(0.95),
124
- tests_passed=data["passed"],
125
- tests_total=data["total"],
126
- quality_score=quality["score"],
127
  runtime_score=component_score(0.01),
128
  timed_out=False,
129
  details=details,
 
3
  from __future__ import annotations
4
 
5
  try:
6
+ from ..models import TaskGrade
7
  from ..tasks.catalog import ReviewTask
8
  except ImportError:
9
+ from models import TaskGrade
10
  from tasks.catalog import ReviewTask
11
 
12
+ from .shared import (
13
+ base_grade,
14
+ compile_code,
15
+ composite_grade_score,
16
+ component_score,
17
+ execute_cases,
18
+ quality_metrics,
19
+ similarity_score,
20
+ summarize_results,
21
+ )
22
 
23
 
24
+ def grade_bug_fix_task(
25
  task: ReviewTask,
26
  code: str,
27
  *,
28
  include_hidden: bool,
29
  timeout_s: float = 2.0,
30
  ) -> TaskGrade:
31
+ """Grade a bug-fix task against public or full test suites."""
32
+
33
+ compiled, compile_error = compile_code(code)
34
+ quality = quality_metrics(code, task.function_name)
35
+ similarity = similarity_score(code, task.reference_code)
36
+ details = {
37
+ "compile_error": compile_error,
38
+ "quality_notes": quality["quality_notes"],
39
+ "style_score": quality["style_score"],
40
+ "visibility": "full" if include_hidden else "public",
41
  }
42
 
43
+ if not compiled:
44
+ details["test_results"] = []
45
+ details["test_summary"] = "Code does not compile."
46
+ return base_grade(
47
+ score=composite_grade_score(
48
+ correctness=0.0,
49
+ quality=0.05,
50
+ runtime=0.05,
51
+ syntax=0.0,
52
+ similarity=similarity,
53
+ baseline=0.04,
54
+ penalty=0.05,
55
+ ),
56
+ syntax_score=component_score(0.01),
57
+ tests_passed=0,
58
+ tests_total=len(task.public_cases) + (len(task.hidden_cases) if include_hidden else 0),
59
+ quality_score=component_score(0.01),
60
+ runtime_score=component_score(0.01),
61
  timed_out=False,
62
  details=details,
63
  )
64
 
65
  cases = task.public_cases + (task.hidden_cases if include_hidden else [])
66
+ result = execute_cases(code, task.function_name, cases, timeout_s=timeout_s)
67
+ if result.get("timed_out"):
68
+ details["test_results"] = []
69
+ details["test_summary"] = result["error"]
70
+ return base_grade(
71
+ score=composite_grade_score(
72
+ correctness=0.10,
73
+ quality=quality["score"],
74
+ runtime=0.0,
75
+ syntax=0.95,
76
+ similarity=similarity,
77
+ baseline=0.06,
78
+ penalty=0.12,
79
+ ),
80
+ syntax_score=component_score(0.95),
81
+ tests_passed=0,
82
+ tests_total=len(cases),
83
+ quality_score=quality["score"],
84
+ runtime_score=component_score(0.01),
85
  timed_out=True,
86
  details=details,
87
  )
88
+ if "error" in result:
89
+ details["test_results"] = []
90
+ details["test_summary"] = result["error"]
91
+ return base_grade(
92
+ score=composite_grade_score(
93
+ correctness=0.12,
94
+ quality=quality["score"],
95
+ runtime=0.0,
96
+ syntax=0.95,
97
+ similarity=similarity,
98
+ baseline=0.06,
99
+ penalty=0.08,
100
+ ),
101
+ syntax_score=component_score(0.95),
102
+ tests_passed=0,
103
+ tests_total=len(cases),
104
+ quality_score=quality["score"],
105
+ runtime_score=component_score(0.01),
106
  timed_out=False,
107
  details=details,
108
  )
109
 
110
+ data = result["data"]
111
+ pass_rate = data["passed"] / max(data["total"], 1)
112
+ details["test_results"] = data["results"]
113
+ details["test_summary"] = summarize_results("Test results", data["results"])
114
+ return base_grade(
115
+ score=composite_grade_score(
116
+ correctness=pass_rate,
117
+ quality=quality["score"],
118
+ runtime=0.05,
119
+ syntax=0.95,
120
+ similarity=similarity,
121
+ baseline=0.08,
122
+ ),
123
+ syntax_score=component_score(0.95),
124
+ tests_passed=data["passed"],
125
+ tests_total=data["total"],
126
+ quality_score=quality["score"],
127
  runtime_score=component_score(0.01),
128
  timed_out=False,
129
  details=details,
graders/dispatch.py CHANGED
@@ -3,10 +3,10 @@
3
  from __future__ import annotations
4
 
5
  try:
6
- from ..models import TaskGrade
7
  from ..tasks.catalog import ReviewTask
8
  except ImportError:
9
- from models import TaskGrade
10
  from tasks.catalog import ReviewTask
11
 
12
  from .bug_fix import grade_bug_fix_task
 
3
  from __future__ import annotations
4
 
5
  try:
6
+ from ..models import TaskGrade
7
  from ..tasks.catalog import ReviewTask
8
  except ImportError:
9
+ from models import TaskGrade
10
  from tasks.catalog import ReviewTask
11
 
12
  from .bug_fix import grade_bug_fix_task
graders/optimization.py CHANGED
@@ -3,23 +3,23 @@
3
  from __future__ import annotations
4
 
5
  try:
6
- from ..models import TaskGrade
7
  from ..tasks.catalog import ReviewTask
8
  except ImportError:
9
- from models import TaskGrade
10
  from tasks.catalog import ReviewTask
11
 
12
- from .shared import (
13
- base_grade,
14
- benchmark_candidate,
15
- compile_code,
16
- composite_grade_score,
17
- component_score,
18
- execute_cases,
19
- quality_metrics,
20
- similarity_score,
21
- summarize_results,
22
- )
23
 
24
 
25
  def grade_optimization_task(
@@ -29,81 +29,81 @@ def grade_optimization_task(
29
  include_hidden: bool,
30
  timeout_s: float = 3.0,
31
  ) -> TaskGrade:
32
- """Grade an optimization/refactor task with correctness, quality, and runtime."""
33
-
34
- compiled, compile_error = compile_code(code)
35
- quality = quality_metrics(code, task.function_name)
36
- similarity = similarity_score(code, task.reference_code)
37
- details = {
38
- "compile_error": compile_error,
39
- "quality_notes": quality["quality_notes"],
40
- "style_score": quality["style_score"],
41
- "visibility": "full" if include_hidden else "public",
42
  }
43
 
44
- if not compiled:
45
- details["test_results"] = []
46
- details["test_summary"] = "Code does not compile."
47
- return base_grade(
48
- score=composite_grade_score(
49
- correctness=0.0,
50
- quality=0.05,
51
- runtime=0.0,
52
- syntax=0.0,
53
- similarity=similarity,
54
- baseline=0.04,
55
- penalty=0.06,
56
- ),
57
- syntax_score=component_score(0.01),
58
- tests_passed=0,
59
- tests_total=len(task.public_cases) + (len(task.hidden_cases) if include_hidden else 0),
60
- quality_score=component_score(0.01),
61
- runtime_score=component_score(0.01),
62
  timed_out=False,
63
  details=details,
64
  )
65
 
66
  cases = task.public_cases + (task.hidden_cases if include_hidden else [])
67
- result = execute_cases(code, task.function_name, cases, timeout_s=timeout_s)
68
- if result.get("timed_out"):
69
- details["test_results"] = []
70
- details["test_summary"] = result["error"]
71
- return base_grade(
72
- score=composite_grade_score(
73
- correctness=0.08,
74
- quality=quality["score"],
75
- runtime=0.0,
76
- syntax=0.95,
77
- similarity=similarity,
78
- baseline=0.05,
79
- penalty=0.14,
80
- ),
81
- syntax_score=component_score(0.95),
82
- tests_passed=0,
83
- tests_total=len(cases),
84
- quality_score=quality["score"],
85
- runtime_score=component_score(0.01),
86
  timed_out=True,
87
  details=details,
88
  )
89
- if "error" in result:
90
- details["test_results"] = []
91
- details["test_summary"] = result["error"]
92
- return base_grade(
93
- score=composite_grade_score(
94
- correctness=0.10,
95
- quality=quality["score"],
96
- runtime=0.0,
97
- syntax=0.95,
98
- similarity=similarity,
99
- baseline=0.05,
100
- penalty=0.08,
101
- ),
102
- syntax_score=component_score(0.95),
103
- tests_passed=0,
104
- tests_total=len(cases),
105
- quality_score=quality["score"],
106
- runtime_score=component_score(0.01),
107
  timed_out=False,
108
  details=details,
109
  )
@@ -122,25 +122,25 @@ def grade_optimization_task(
122
  if timed_out:
123
  runtime_score = component_score(0.01)
124
 
125
- details["test_results"] = data["results"]
126
- details["test_summary"] = summarize_results("Test results", data["results"])
127
- details["benchmark"] = benchmark_summary
128
-
129
- runtime_progress = 0.0 if benchmark_summary == "Benchmark deferred until hidden evaluation." else runtime_score
130
- return base_grade(
131
- score=composite_grade_score(
132
- correctness=pass_rate,
133
- quality=quality["score"],
134
- runtime=runtime_progress if include_hidden else 0.10,
135
- syntax=0.95,
136
- similarity=similarity,
137
- baseline=0.08 if include_hidden else 0.07,
138
- penalty=0.10 if timed_out else 0.0,
139
- ),
140
- syntax_score=component_score(0.95),
141
- tests_passed=data["passed"],
142
- tests_total=data["total"],
143
- quality_score=quality["score"],
144
  runtime_score=runtime_score,
145
  timed_out=timed_out,
146
  details=details,
 
3
  from __future__ import annotations
4
 
5
  try:
6
+ from ..models import TaskGrade
7
  from ..tasks.catalog import ReviewTask
8
  except ImportError:
9
+ from models import TaskGrade
10
  from tasks.catalog import ReviewTask
11
 
12
+ from .shared import (
13
+ base_grade,
14
+ benchmark_candidate,
15
+ compile_code,
16
+ composite_grade_score,
17
+ component_score,
18
+ execute_cases,
19
+ quality_metrics,
20
+ similarity_score,
21
+ summarize_results,
22
+ )
23
 
24
 
25
  def grade_optimization_task(
 
29
  include_hidden: bool,
30
  timeout_s: float = 3.0,
31
  ) -> TaskGrade:
32
+ """Grade an optimization/refactor task with correctness, quality, and runtime."""
33
+
34
+ compiled, compile_error = compile_code(code)
35
+ quality = quality_metrics(code, task.function_name)
36
+ similarity = similarity_score(code, task.reference_code)
37
+ details = {
38
+ "compile_error": compile_error,
39
+ "quality_notes": quality["quality_notes"],
40
+ "style_score": quality["style_score"],
41
+ "visibility": "full" if include_hidden else "public",
42
  }
43
 
44
+ if not compiled:
45
+ details["test_results"] = []
46
+ details["test_summary"] = "Code does not compile."
47
+ return base_grade(
48
+ score=composite_grade_score(
49
+ correctness=0.0,
50
+ quality=0.05,
51
+ runtime=0.0,
52
+ syntax=0.0,
53
+ similarity=similarity,
54
+ baseline=0.04,
55
+ penalty=0.06,
56
+ ),
57
+ syntax_score=component_score(0.01),
58
+ tests_passed=0,
59
+ tests_total=len(task.public_cases) + (len(task.hidden_cases) if include_hidden else 0),
60
+ quality_score=component_score(0.01),
61
+ runtime_score=component_score(0.01),
62
  timed_out=False,
63
  details=details,
64
  )
65
 
66
  cases = task.public_cases + (task.hidden_cases if include_hidden else [])
67
+ result = execute_cases(code, task.function_name, cases, timeout_s=timeout_s)
68
+ if result.get("timed_out"):
69
+ details["test_results"] = []
70
+ details["test_summary"] = result["error"]
71
+ return base_grade(
72
+ score=composite_grade_score(
73
+ correctness=0.08,
74
+ quality=quality["score"],
75
+ runtime=0.0,
76
+ syntax=0.95,
77
+ similarity=similarity,
78
+ baseline=0.05,
79
+ penalty=0.14,
80
+ ),
81
+ syntax_score=component_score(0.95),
82
+ tests_passed=0,
83
+ tests_total=len(cases),
84
+ quality_score=quality["score"],
85
+ runtime_score=component_score(0.01),
86
  timed_out=True,
87
  details=details,
88
  )
89
+ if "error" in result:
90
+ details["test_results"] = []
91
+ details["test_summary"] = result["error"]
92
+ return base_grade(
93
+ score=composite_grade_score(
94
+ correctness=0.10,
95
+ quality=quality["score"],
96
+ runtime=0.0,
97
+ syntax=0.95,
98
+ similarity=similarity,
99
+ baseline=0.05,
100
+ penalty=0.08,
101
+ ),
102
+ syntax_score=component_score(0.95),
103
+ tests_passed=0,
104
+ tests_total=len(cases),
105
+ quality_score=quality["score"],
106
+ runtime_score=component_score(0.01),
107
  timed_out=False,
108
  details=details,
109
  )
 
122
  if timed_out:
123
  runtime_score = component_score(0.01)
124
 
125
+ details["test_results"] = data["results"]
126
+ details["test_summary"] = summarize_results("Test results", data["results"])
127
+ details["benchmark"] = benchmark_summary
128
+
129
+ runtime_progress = 0.0 if benchmark_summary == "Benchmark deferred until hidden evaluation." else runtime_score
130
+ return base_grade(
131
+ score=composite_grade_score(
132
+ correctness=pass_rate,
133
+ quality=quality["score"],
134
+ runtime=runtime_progress if include_hidden else 0.10,
135
+ syntax=0.95,
136
+ similarity=similarity,
137
+ baseline=0.08 if include_hidden else 0.07,
138
+ penalty=0.10 if timed_out else 0.0,
139
+ ),
140
+ syntax_score=component_score(0.95),
141
+ tests_passed=data["passed"],
142
+ tests_total=data["total"],
143
+ quality_score=quality["score"],
144
  runtime_score=runtime_score,
145
  timed_out=timed_out,
146
  details=details,
graders/shared.py CHANGED
@@ -2,28 +2,28 @@
2
 
3
  from __future__ import annotations
4
 
5
- import ast
6
- import difflib
7
- import math
8
- import multiprocessing as mp
9
- import os
10
- import time
11
- import traceback
12
  from typing import Any, Callable, Dict, List
13
 
14
  try:
15
- from ..models import TaskGrade
16
  from ..tasks.catalog import CallCase, ReviewTask
17
  except ImportError:
18
- from models import TaskGrade
19
  from tasks.catalog import CallCase, ReviewTask
20
 
21
 
22
- STRICT_SCORE_MIN = 0.01
23
- STRICT_SCORE_MAX = 0.99
24
- POOR_SCORE = 0.1
25
- NEAR_PERFECT_SCORE = 0.95
26
- EPS = 1e-6
27
 
28
 
29
  def finite_float(value: Any, fallback: float = STRICT_SCORE_MIN) -> float:
@@ -38,54 +38,54 @@ def finite_float(value: Any, fallback: float = STRICT_SCORE_MIN) -> float:
38
  return numeric
39
 
40
 
41
- def clamp(value: float, lower: float = 0.0, upper: float = 1.0) -> float:
42
- """Clamp a floating-point value to a closed interval."""
43
-
44
- numeric = finite_float(value, fallback=lower)
45
- return max(lower, min(upper, numeric))
46
-
47
-
48
- def safe_score(score: Any) -> float:
49
- """Clamp any score to the strict OpenEnv-safe open interval (0, 1)."""
50
-
51
- bounded = max(EPS, min(1.0 - EPS, finite_float(score, fallback=EPS)))
52
- assert 0 < bounded < 1, f"Score must be strictly between 0 and 1: {bounded}"
53
- return bounded
54
-
55
-
56
- def normalize_score(x: Any) -> float:
57
- """Sigmoid-normalize a raw score and clamp it safely into (0, 1)."""
58
-
59
- numeric = finite_float(x, fallback=0.0)
60
- bounded = max(-20.0, min(20.0, numeric))
61
- return safe_score(1.0 / (1.0 + math.exp(-bounded)))
62
-
63
-
64
- def final_score_pipeline(raw_score: Any) -> float:
65
- """Normalize arbitrary raw scoring signals into a strict OpenEnv-safe score."""
66
-
67
- return normalize_score(raw_score)
68
-
69
-
70
- def strict_score(value: Any, lower: float = STRICT_SCORE_MIN, upper: float = STRICT_SCORE_MAX) -> float:
71
- """Clamp a score to the OpenEnv-safe open interval (0, 1)."""
72
-
73
- score = max(lower, min(upper, finite_float(value, fallback=lower)))
74
- score = safe_score(score)
75
- assert 0 < score < 1, f"Invalid score: {score}"
76
- return score
77
-
78
-
79
- def shaped_score(progress: Any, floor: float = POOR_SCORE, ceiling: float = NEAR_PERFECT_SCORE) -> float:
80
- """Map progress in [0, 1] to a smooth score band within (0, 1)."""
81
-
82
- bounded_progress = clamp(finite_float(progress, fallback=0.0))
83
- centered_progress = (bounded_progress - 0.5) * 6.0
84
- smoothed_progress = final_score_pipeline(centered_progress)
85
- score = floor + (ceiling - floor) * smoothed_progress
86
- score = safe_score(score)
87
- assert 0 < score < 1, f"Invalid score: {score}"
88
- return score
89
 
90
 
91
  def score_from_checks(passed: int, total: int, floor: float = POOR_SCORE, ceiling: float = NEAR_PERFECT_SCORE) -> float:
@@ -104,59 +104,59 @@ def safe_ratio(numerator: Any, denominator: Any) -> float:
104
  return clamp(numer / denom)
105
 
106
 
107
- def component_score(value: Any) -> float:
108
- """Normalize component scores such as syntax, quality, and runtime."""
109
-
110
- bounded_value = clamp(finite_float(value, fallback=0.0))
111
- return shaped_score(bounded_value, floor=0.02, ceiling=0.98)
112
-
113
-
114
- def composite_progress(
115
- *,
116
- correctness: Any = 0.0,
117
- quality: Any = 0.0,
118
- runtime: Any = 0.0,
119
- syntax: Any = 0.0,
120
- similarity: Any = 0.0,
121
- baseline: float = 0.05,
122
- penalty: Any = 0.0,
123
- ) -> float:
124
- """Blend multiple progress signals into a stable scalar progress estimate."""
125
-
126
- progress = (
127
- finite_float(baseline, fallback=0.05)
128
- + 0.45 * clamp(correctness)
129
- + 0.20 * clamp(quality)
130
- + 0.15 * clamp(runtime)
131
- + 0.15 * clamp(syntax)
132
- + 0.05 * clamp(similarity)
133
- - 0.20 * clamp(penalty)
134
- )
135
- return clamp(progress)
136
-
137
-
138
- def composite_grade_score(
139
- *,
140
- correctness: Any = 0.0,
141
- quality: Any = 0.0,
142
- runtime: Any = 0.0,
143
- syntax: Any = 0.0,
144
- similarity: Any = 0.0,
145
- baseline: float = 0.05,
146
- penalty: Any = 0.0,
147
- ) -> float:
148
- """Create a smooth task score from multiple bounded signals."""
149
-
150
- progress = composite_progress(
151
- correctness=correctness,
152
- quality=quality,
153
- runtime=runtime,
154
- syntax=syntax,
155
- similarity=similarity,
156
- baseline=baseline,
157
- penalty=penalty,
158
- )
159
- return shaped_score(progress)
160
 
161
 
162
  def compile_code(code: str) -> tuple[bool, str]:
@@ -199,26 +199,18 @@ def run_with_timeout(
199
  payload: Dict[str, Any],
200
  timeout_s: float,
201
  ) -> Dict[str, Any]:
202
- """Execute a worker in a subprocess and terminate on timeout.
203
-
204
- Some constrained Windows environments disallow spawned pipes or child
205
- processes. In those cases, fall back to the inline timeout path so local
206
- demos and tests still work deterministically.
207
- """
208
-
209
- try:
210
- ctx = mp.get_context("spawn")
211
- queue = ctx.Queue()
212
- process = ctx.Process(target=_queue_worker, args=(worker, payload, queue))
213
- process.start()
214
- process.join(timeout_s)
215
- except (PermissionError, OSError):
216
- return run_inline_with_timeout(worker, payload, timeout_s)
217
-
218
- if process.is_alive():
219
- process.terminate()
220
- process.join()
221
- return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
222
 
223
  if queue.empty():
224
  return {"timed_out": False, "error": "Worker exited without returning a result."}
@@ -227,31 +219,31 @@ def run_with_timeout(
227
  if not message["ok"]:
228
  return {
229
  "timed_out": False,
230
- "error": f"{message['error']}\n{message['traceback']}",
231
- }
232
- return {"timed_out": False, "data": message["data"]}
233
-
234
-
235
- def run_inline_with_timeout(
236
- worker: Callable[[Dict[str, Any]], Dict[str, Any]],
237
- payload: Dict[str, Any],
238
- timeout_s: float,
239
- ) -> Dict[str, Any]:
240
- """Fallback execution path for platforms where spawned workers are unreliable."""
241
-
242
- started = time.perf_counter()
243
- try:
244
- data = worker(payload)
245
- except Exception as exc:
246
- return {
247
- "timed_out": False,
248
- "error": f"{type(exc).__name__}: {exc}\n{traceback.format_exc(limit=5)}",
249
- }
250
-
251
- elapsed = time.perf_counter() - started
252
- if elapsed > timeout_s:
253
- return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
254
- return {"timed_out": False, "data": data}
255
 
256
 
257
  def _execute_cases_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
@@ -456,7 +448,7 @@ def _benchmark_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
456
  return {"baseline_seconds": baseline_seconds, "candidate_seconds": candidate_seconds}
457
 
458
 
459
- def benchmark_candidate(task: ReviewTask, code: str, timeout_s: float) -> Dict[str, Any]:
460
  """Benchmark a candidate solution against the starter implementation."""
461
 
462
  if not task.benchmark_config:
@@ -470,10 +462,10 @@ def benchmark_candidate(task: ReviewTask, code: str, timeout_s: float) -> Dict[s
470
  "events": events,
471
  "iterations": task.benchmark_config.get("iterations", 5),
472
  }
473
- if os.name == "nt":
474
- result = run_inline_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
475
- else:
476
- result = run_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
477
  if result.get("timed_out"):
478
  return {"runtime_score": component_score(STRICT_SCORE_MIN), "timed_out": True, "details": result["error"]}
479
  if "error" in result:
 
2
 
3
  from __future__ import annotations
4
 
5
+ import ast
6
+ import difflib
7
+ import math
8
+ import multiprocessing as mp
9
+ import os
10
+ import time
11
+ import traceback
12
  from typing import Any, Callable, Dict, List
13
 
14
  try:
15
+ from ..models import TaskGrade
16
  from ..tasks.catalog import CallCase, ReviewTask
17
  except ImportError:
18
+ from models import TaskGrade
19
  from tasks.catalog import CallCase, ReviewTask
20
 
21
 
22
+ STRICT_SCORE_MIN = 0.01
23
+ STRICT_SCORE_MAX = 0.99
24
+ POOR_SCORE = 0.1
25
+ NEAR_PERFECT_SCORE = 0.95
26
+ EPS = 1e-6
27
 
28
 
29
  def finite_float(value: Any, fallback: float = STRICT_SCORE_MIN) -> float:
 
38
  return numeric
39
 
40
 
41
+ def clamp(value: float, lower: float = 0.0, upper: float = 1.0) -> float:
42
+ """Clamp a floating-point value to a closed interval."""
43
+
44
+ numeric = finite_float(value, fallback=lower)
45
+ return max(lower, min(upper, numeric))
46
+
47
+
48
+ def safe_score(score: Any) -> float:
49
+ """Clamp any score to the strict OpenEnv-safe open interval (0, 1)."""
50
+
51
+ bounded = max(EPS, min(1.0 - EPS, finite_float(score, fallback=EPS)))
52
+ assert 0 < bounded < 1, f"Score must be strictly between 0 and 1: {bounded}"
53
+ return bounded
54
+
55
+
56
+ def normalize_score(x: Any) -> float:
57
+ """Sigmoid-normalize a raw score and clamp it safely into (0, 1)."""
58
+
59
+ numeric = finite_float(x, fallback=0.0)
60
+ bounded = max(-20.0, min(20.0, numeric))
61
+ return safe_score(1.0 / (1.0 + math.exp(-bounded)))
62
+
63
+
64
+ def final_score_pipeline(raw_score: Any) -> float:
65
+ """Normalize arbitrary raw scoring signals into a strict OpenEnv-safe score."""
66
+
67
+ return normalize_score(raw_score)
68
+
69
+
70
+ def strict_score(value: Any, lower: float = STRICT_SCORE_MIN, upper: float = STRICT_SCORE_MAX) -> float:
71
+ """Clamp a score to the OpenEnv-safe open interval (0, 1)."""
72
+
73
+ score = max(lower, min(upper, finite_float(value, fallback=lower)))
74
+ score = safe_score(score)
75
+ assert 0 < score < 1, f"Invalid score: {score}"
76
+ return score
77
+
78
+
79
+ def shaped_score(progress: Any, floor: float = POOR_SCORE, ceiling: float = NEAR_PERFECT_SCORE) -> float:
80
+ """Map progress in [0, 1] to a smooth score band within (0, 1)."""
81
+
82
+ bounded_progress = clamp(finite_float(progress, fallback=0.0))
83
+ centered_progress = (bounded_progress - 0.5) * 6.0
84
+ smoothed_progress = final_score_pipeline(centered_progress)
85
+ score = floor + (ceiling - floor) * smoothed_progress
86
+ score = safe_score(score)
87
+ assert 0 < score < 1, f"Invalid score: {score}"
88
+ return score
89
 
90
 
91
  def score_from_checks(passed: int, total: int, floor: float = POOR_SCORE, ceiling: float = NEAR_PERFECT_SCORE) -> float:
 
104
  return clamp(numer / denom)
105
 
106
 
107
+ def component_score(value: Any) -> float:
108
+ """Normalize component scores such as syntax, quality, and runtime."""
109
+
110
+ bounded_value = clamp(finite_float(value, fallback=0.0))
111
+ return shaped_score(bounded_value, floor=0.02, ceiling=0.98)
112
+
113
+
114
+ def composite_progress(
115
+ *,
116
+ correctness: Any = 0.0,
117
+ quality: Any = 0.0,
118
+ runtime: Any = 0.0,
119
+ syntax: Any = 0.0,
120
+ similarity: Any = 0.0,
121
+ baseline: float = 0.05,
122
+ penalty: Any = 0.0,
123
+ ) -> float:
124
+ """Blend multiple progress signals into a stable scalar progress estimate."""
125
+
126
+ progress = (
127
+ finite_float(baseline, fallback=0.05)
128
+ + 0.45 * clamp(correctness)
129
+ + 0.20 * clamp(quality)
130
+ + 0.15 * clamp(runtime)
131
+ + 0.15 * clamp(syntax)
132
+ + 0.05 * clamp(similarity)
133
+ - 0.20 * clamp(penalty)
134
+ )
135
+ return clamp(progress)
136
+
137
+
138
+ def composite_grade_score(
139
+ *,
140
+ correctness: Any = 0.0,
141
+ quality: Any = 0.0,
142
+ runtime: Any = 0.0,
143
+ syntax: Any = 0.0,
144
+ similarity: Any = 0.0,
145
+ baseline: float = 0.05,
146
+ penalty: Any = 0.0,
147
+ ) -> float:
148
+ """Create a smooth task score from multiple bounded signals."""
149
+
150
+ progress = composite_progress(
151
+ correctness=correctness,
152
+ quality=quality,
153
+ runtime=runtime,
154
+ syntax=syntax,
155
+ similarity=similarity,
156
+ baseline=baseline,
157
+ penalty=penalty,
158
+ )
159
+ return shaped_score(progress)
160
 
161
 
162
  def compile_code(code: str) -> tuple[bool, str]:
 
199
  payload: Dict[str, Any],
200
  timeout_s: float,
201
  ) -> Dict[str, Any]:
202
+ """Execute a worker in a subprocess and terminate on timeout."""
203
+
204
+ ctx = mp.get_context("spawn")
205
+ queue = ctx.Queue()
206
+ process = ctx.Process(target=_queue_worker, args=(worker, payload, queue))
207
+ process.start()
208
+ process.join(timeout_s)
209
+
210
+ if process.is_alive():
211
+ process.terminate()
212
+ process.join()
213
+ return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
 
 
 
 
 
 
 
 
214
 
215
  if queue.empty():
216
  return {"timed_out": False, "error": "Worker exited without returning a result."}
 
219
  if not message["ok"]:
220
  return {
221
  "timed_out": False,
222
+ "error": f"{message['error']}\n{message['traceback']}",
223
+ }
224
+ return {"timed_out": False, "data": message["data"]}
225
+
226
+
227
+ def run_inline_with_timeout(
228
+ worker: Callable[[Dict[str, Any]], Dict[str, Any]],
229
+ payload: Dict[str, Any],
230
+ timeout_s: float,
231
+ ) -> Dict[str, Any]:
232
+ """Fallback execution path for platforms where spawned workers are unreliable."""
233
+
234
+ started = time.perf_counter()
235
+ try:
236
+ data = worker(payload)
237
+ except Exception as exc:
238
+ return {
239
+ "timed_out": False,
240
+ "error": f"{type(exc).__name__}: {exc}\n{traceback.format_exc(limit=5)}",
241
+ }
242
+
243
+ elapsed = time.perf_counter() - started
244
+ if elapsed > timeout_s:
245
+ return {"timed_out": True, "error": f"Execution exceeded {timeout_s:.1f}s timeout."}
246
+ return {"timed_out": False, "data": data}
247
 
248
 
249
  def _execute_cases_worker(payload: Dict[str, Any]) -> Dict[str, Any]:
 
448
  return {"baseline_seconds": baseline_seconds, "candidate_seconds": candidate_seconds}
449
 
450
 
451
+ def benchmark_candidate(task: ReviewTask, code: str, timeout_s: float) -> Dict[str, Any]:
452
  """Benchmark a candidate solution against the starter implementation."""
453
 
454
  if not task.benchmark_config:
 
462
  "events": events,
463
  "iterations": task.benchmark_config.get("iterations", 5),
464
  }
465
+ if os.name == "nt":
466
+ result = run_inline_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
467
+ else:
468
+ result = run_with_timeout(_benchmark_worker, payload, timeout_s=timeout_s)
469
  if result.get("timed_out"):
470
  return {"runtime_score": component_score(STRICT_SCORE_MIN), "timed_out": True, "details": result["error"]}
471
  if "error" in result:
graders/syntax.py CHANGED
@@ -3,120 +3,120 @@
3
  from __future__ import annotations
4
 
5
  try:
6
- from ..models import TaskGrade
7
  from ..tasks.catalog import ReviewTask
8
  except ImportError:
9
- from models import TaskGrade
10
  from tasks.catalog import ReviewTask
11
 
12
- from .shared import (
13
- base_grade,
14
- compile_code,
15
- composite_grade_score,
16
- component_score,
17
- execute_cases,
18
- quality_metrics,
19
- similarity_score,
20
- summarize_results,
21
- )
22
 
23
 
24
- def grade_syntax_task(task: ReviewTask, code: str, timeout_s: float = 2.0) -> TaskGrade:
25
- """Grade a syntax-fix task deterministically."""
 
 
 
 
 
 
 
 
 
26
 
27
- compiled, compile_error = compile_code(code)
28
- quality = quality_metrics(code, task.function_name)
29
- similarity = similarity_score(code, task.reference_code)
30
- details = {
31
- "compile_error": compile_error,
32
- "quality_notes": quality["quality_notes"],
33
- "style_score": quality["style_score"],
34
- }
35
-
36
- if not compiled:
37
- details["test_results"] = []
38
- details["test_summary"] = "Code does not compile yet."
39
- return base_grade(
40
- score=composite_grade_score(
41
- correctness=0.0,
42
- quality=0.05,
43
- runtime=0.05,
44
- syntax=0.0,
45
- similarity=similarity,
46
- baseline=0.05,
47
- penalty=0.05,
48
- ),
49
- syntax_score=component_score(0.01),
50
- tests_passed=0,
51
- tests_total=len(task.public_cases) + len(task.hidden_cases),
52
- quality_score=component_score(0.01),
53
- runtime_score=component_score(0.01),
54
  timed_out=False,
55
  details=details,
56
  )
57
 
58
  cases = task.public_cases + task.hidden_cases
59
- result = execute_cases(code, task.function_name, cases, timeout_s=timeout_s)
60
- if result.get("timed_out"):
61
- details["test_results"] = []
62
- details["test_summary"] = result["error"]
63
- return base_grade(
64
- score=composite_grade_score(
65
- correctness=0.15,
66
- quality=quality["score"],
67
- runtime=0.0,
68
- syntax=0.95,
69
- similarity=similarity,
70
- baseline=0.08,
71
- penalty=0.12,
72
- ),
73
- syntax_score=component_score(0.95),
74
- tests_passed=0,
75
- tests_total=len(cases),
76
- quality_score=quality["score"],
77
- runtime_score=component_score(0.01),
78
  timed_out=True,
79
  details=details,
80
  )
81
- if "error" in result:
82
- details["test_results"] = []
83
- details["test_summary"] = result["error"]
84
- return base_grade(
85
- score=composite_grade_score(
86
- correctness=0.18,
87
- quality=quality["score"],
88
- runtime=0.0,
89
- syntax=0.95,
90
- similarity=similarity,
91
- baseline=0.08,
92
- penalty=0.08,
93
- ),
94
- syntax_score=component_score(0.95),
95
- tests_passed=0,
96
- tests_total=len(cases),
97
- quality_score=quality["score"],
98
- runtime_score=component_score(0.01),
99
  timed_out=False,
100
  details=details,
101
  )
102
 
103
- data = result["data"]
104
- details["test_results"] = data["results"]
105
- details["test_summary"] = summarize_results("Validation checks", data["results"])
106
- pass_rate = data["passed"] / max(data["total"], 1)
107
- return base_grade(
108
- score=composite_grade_score(
109
- correctness=pass_rate,
110
- quality=quality["score"],
111
- runtime=0.05,
112
- syntax=0.95,
113
- similarity=similarity,
114
- baseline=0.10,
115
- ),
116
- syntax_score=component_score(0.95),
117
- tests_passed=data["passed"],
118
- tests_total=data["total"],
119
- quality_score=quality["score"],
120
  runtime_score=component_score(0.01),
121
  timed_out=False,
122
  details=details,
 
3
  from __future__ import annotations
4
 
5
  try:
6
+ from ..models import TaskGrade
7
  from ..tasks.catalog import ReviewTask
8
  except ImportError:
9
+ from models import TaskGrade
10
  from tasks.catalog import ReviewTask
11
 
12
+ from .shared import (
13
+ base_grade,
14
+ compile_code,
15
+ composite_grade_score,
16
+ component_score,
17
+ execute_cases,
18
+ quality_metrics,
19
+ similarity_score,
20
+ summarize_results,
21
+ )
22
 
23
 
24
+ def grade_syntax_task(task: ReviewTask, code: str, timeout_s: float = 2.0) -> TaskGrade:
25
+ """Grade a syntax-fix task deterministically."""
26
+
27
+ compiled, compile_error = compile_code(code)
28
+ quality = quality_metrics(code, task.function_name)
29
+ similarity = similarity_score(code, task.reference_code)
30
+ details = {
31
+ "compile_error": compile_error,
32
+ "quality_notes": quality["quality_notes"],
33
+ "style_score": quality["style_score"],
34
+ }
35
 
36
+ if not compiled:
37
+ details["test_results"] = []
38
+ details["test_summary"] = "Code does not compile yet."
39
+ return base_grade(
40
+ score=composite_grade_score(
41
+ correctness=0.0,
42
+ quality=0.05,
43
+ runtime=0.05,
44
+ syntax=0.0,
45
+ similarity=similarity,
46
+ baseline=0.05,
47
+ penalty=0.05,
48
+ ),
49
+ syntax_score=component_score(0.01),
50
+ tests_passed=0,
51
+ tests_total=len(task.public_cases) + len(task.hidden_cases),
52
+ quality_score=component_score(0.01),
53
+ runtime_score=component_score(0.01),
 
 
 
 
 
 
 
 
 
54
  timed_out=False,
55
  details=details,
56
  )
57
 
58
  cases = task.public_cases + task.hidden_cases
59
+ result = execute_cases(code, task.function_name, cases, timeout_s=timeout_s)
60
+ if result.get("timed_out"):
61
+ details["test_results"] = []
62
+ details["test_summary"] = result["error"]
63
+ return base_grade(
64
+ score=composite_grade_score(
65
+ correctness=0.15,
66
+ quality=quality["score"],
67
+ runtime=0.0,
68
+ syntax=0.95,
69
+ similarity=similarity,
70
+ baseline=0.08,
71
+ penalty=0.12,
72
+ ),
73
+ syntax_score=component_score(0.95),
74
+ tests_passed=0,
75
+ tests_total=len(cases),
76
+ quality_score=quality["score"],
77
+ runtime_score=component_score(0.01),
78
  timed_out=True,
79
  details=details,
80
  )
81
+ if "error" in result:
82
+ details["test_results"] = []
83
+ details["test_summary"] = result["error"]
84
+ return base_grade(
85
+ score=composite_grade_score(
86
+ correctness=0.18,
87
+ quality=quality["score"],
88
+ runtime=0.0,
89
+ syntax=0.95,
90
+ similarity=similarity,
91
+ baseline=0.08,
92
+ penalty=0.08,
93
+ ),
94
+ syntax_score=component_score(0.95),
95
+ tests_passed=0,
96
+ tests_total=len(cases),
97
+ quality_score=quality["score"],
98
+ runtime_score=component_score(0.01),
99
  timed_out=False,
100
  details=details,
101
  )
102
 
103
+ data = result["data"]
104
+ details["test_results"] = data["results"]
105
+ details["test_summary"] = summarize_results("Validation checks", data["results"])
106
+ pass_rate = data["passed"] / max(data["total"], 1)
107
+ return base_grade(
108
+ score=composite_grade_score(
109
+ correctness=pass_rate,
110
+ quality=quality["score"],
111
+ runtime=0.05,
112
+ syntax=0.95,
113
+ similarity=similarity,
114
+ baseline=0.10,
115
+ ),
116
+ syntax_score=component_score(0.95),
117
+ tests_passed=data["passed"],
118
+ tests_total=data["total"],
119
+ quality_score=quality["score"],
120
  runtime_score=component_score(0.01),
121
  timed_out=False,
122
  details=details,
inference.py CHANGED
@@ -1,12 +1,12 @@
1
- #!/usr/bin/env python3
2
- """Root validator entrypoint."""
3
-
4
- from __future__ import annotations
5
-
6
- import sys
7
-
8
- from app.env.runner import main
9
-
10
-
11
- if __name__ == "__main__":
12
- sys.exit(main())
 
1
+ #!/usr/bin/env python3
2
+ """Root validator entrypoint."""
3
+
4
+ from __future__ import annotations
5
+
6
+ import sys
7
+
8
+ from app.env.runner import main
9
+
10
+
11
+ if __name__ == "__main__":
12
+ sys.exit(main())
launch.py CHANGED
@@ -1,35 +1,35 @@
1
- """Launch the FastAPI backend and Streamlit UI in one Docker container."""
2
-
3
- from __future__ import annotations
4
-
5
- import subprocess
6
- import sys
7
-
8
-
9
- def main() -> int:
10
- """Start the API backend in the background and keep Streamlit in the foreground."""
11
-
12
- api_process = subprocess.Popen(
13
- ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8001"],
14
- )
15
- try:
16
- return subprocess.call(
17
- [
18
- "streamlit",
19
- "run",
20
- "app/streamlit_app.py",
21
- "--server.port",
22
- "8000",
23
- "--server.address",
24
- "0.0.0.0",
25
- "--server.headless",
26
- "true",
27
- ]
28
- )
29
- finally:
30
- api_process.terminate()
31
- api_process.wait(timeout=10)
32
-
33
-
34
- if __name__ == "__main__":
35
- sys.exit(main())
 
1
+ """Launch the FastAPI backend and Streamlit UI in one Docker container."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import subprocess
6
+ import sys
7
+
8
+
9
+ def main() -> int:
10
+ """Start the API backend in the background and keep Streamlit in the foreground."""
11
+
12
+ api_process = subprocess.Popen(
13
+ ["uvicorn", "api.main:app", "--host", "0.0.0.0", "--port", "8001"],
14
+ )
15
+ try:
16
+ return subprocess.call(
17
+ [
18
+ "streamlit",
19
+ "run",
20
+ "app/streamlit_app.py",
21
+ "--server.port",
22
+ "8000",
23
+ "--server.address",
24
+ "0.0.0.0",
25
+ "--server.headless",
26
+ "true",
27
+ ]
28
+ )
29
+ finally:
30
+ api_process.terminate()
31
+ api_process.wait(timeout=10)
32
+
33
+
34
+ if __name__ == "__main__":
35
+ sys.exit(main())
models.py CHANGED
@@ -1,4 +1,4 @@
1
- """Typed models for the python_code_review_env environment."""
2
 
3
  from __future__ import annotations
4
 
@@ -23,22 +23,22 @@ class HistoryEntry(BaseModel):
23
  reward: float = Field(..., gt=0.0, lt=1.0, description="Reward returned for the step.")
24
 
25
 
26
- class RewardDetails(BaseModel):
27
- """Transparent reward decomposition for debugging and training."""
28
-
29
- value: float = Field(..., gt=0.0, lt=1.0, description="Clamped net reward in (0.0, 1.0).")
30
- syntax_reward: float = Field(default=0.0)
31
- test_reward: float = Field(default=0.0)
32
- correctness_bonus: float = Field(default=0.0)
33
- quality_bonus: float = Field(default=0.0)
34
- error_reduction_bonus: float = Field(default=0.0)
35
- completion_bonus: float = Field(default=0.0)
36
- runtime_bonus: float = Field(default=0.0)
37
- progress_delta: float = Field(default=0.0)
38
- invalid_action_penalty: float = Field(default=0.0)
39
- timeout_penalty: float = Field(default=0.0)
40
- regression_penalty: float = Field(default=0.0)
41
- stagnation_penalty: float = Field(default=0.0)
42
  reason: str = Field(..., description="Human-readable reward explanation.")
43
  prev_score: float = Field(default=0.01, gt=0.0, lt=1.0)
44
  curr_score: float = Field(default=0.01, gt=0.0, lt=1.0)
@@ -66,17 +66,17 @@ class PythonCodeReviewObservation(Observation):
66
  current_code: str = Field(..., description="Latest code under review.")
67
  errors: str = Field(default="", description="Syntax or execution errors.")
68
  test_results: str = Field(default="", description="Public test and benchmark feedback.")
69
- visible_tests: List[str] = Field(default_factory=list)
70
- history: List[HistoryEntry] = Field(default_factory=list)
71
- attempts_remaining: int = Field(..., ge=0)
72
- last_action_status: str = Field(default="")
73
- last_action_error: Optional[str] = Field(default=None)
74
- score: float = Field(..., gt=0.0, lt=1.0)
75
- reward: float = Field(default=0.1, gt=0.0, lt=1.0)
76
- done: bool = Field(default=False)
77
- reward_details: RewardDetails = Field(
78
- default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
79
- )
80
 
81
 
82
  class PythonCodeReviewState(State):
 
1
+ """Typed models for the python_code_review_env environment."""
2
 
3
  from __future__ import annotations
4
 
 
23
  reward: float = Field(..., gt=0.0, lt=1.0, description="Reward returned for the step.")
24
 
25
 
26
+ class RewardDetails(BaseModel):
27
+ """Transparent reward decomposition for debugging and training."""
28
+
29
+ value: float = Field(..., gt=0.0, lt=1.0, description="Clamped net reward in (0.0, 1.0).")
30
+ syntax_reward: float = Field(default=0.0)
31
+ test_reward: float = Field(default=0.0)
32
+ correctness_bonus: float = Field(default=0.0)
33
+ quality_bonus: float = Field(default=0.0)
34
+ error_reduction_bonus: float = Field(default=0.0)
35
+ completion_bonus: float = Field(default=0.0)
36
+ runtime_bonus: float = Field(default=0.0)
37
+ progress_delta: float = Field(default=0.0)
38
+ invalid_action_penalty: float = Field(default=0.0)
39
+ timeout_penalty: float = Field(default=0.0)
40
+ regression_penalty: float = Field(default=0.0)
41
+ stagnation_penalty: float = Field(default=0.0)
42
  reason: str = Field(..., description="Human-readable reward explanation.")
43
  prev_score: float = Field(default=0.01, gt=0.0, lt=1.0)
44
  curr_score: float = Field(default=0.01, gt=0.0, lt=1.0)
 
66
  current_code: str = Field(..., description="Latest code under review.")
67
  errors: str = Field(default="", description="Syntax or execution errors.")
68
  test_results: str = Field(default="", description="Public test and benchmark feedback.")
69
+ visible_tests: List[str] = Field(default_factory=list)
70
+ history: List[HistoryEntry] = Field(default_factory=list)
71
+ attempts_remaining: int = Field(..., ge=0)
72
+ last_action_status: str = Field(default="")
73
+ last_action_error: Optional[str] = Field(default=None)
74
+ score: float = Field(..., gt=0.0, lt=1.0)
75
+ reward: float = Field(default=0.1, gt=0.0, lt=1.0)
76
+ done: bool = Field(default=False)
77
+ reward_details: RewardDetails = Field(
78
+ default_factory=lambda: RewardDetails(value=0.1, reason="Environment reset.")
79
+ )
80
 
81
 
82
  class PythonCodeReviewState(State):
models/__init__.py CHANGED
@@ -1,66 +1,76 @@
1
- """PyTorch-backed model wrappers plus OpenEnv schema exports."""
2
-
3
- from __future__ import annotations
4
-
5
- import importlib.util
6
- import sys
7
- from pathlib import Path
8
-
9
- from .pytorch_model import PyTorchCodeAnalyzerModel
10
-
11
-
12
- def _load_schema_module():
13
- schema_path = Path(__file__).resolve().parent.parent / "models.py"
14
- spec = importlib.util.spec_from_file_location("_python_env_schema_models", schema_path)
15
- if spec is None or spec.loader is None: # pragma: no cover
16
- raise ImportError(f"Unable to load schema models from {schema_path}")
17
- if spec.name in sys.modules:
18
- return sys.modules[spec.name]
19
- module = importlib.util.module_from_spec(spec)
20
- sys.modules[spec.name] = module
21
- spec.loader.exec_module(module)
22
- for model_name in (
23
- "HistoryEntry",
24
- "RewardDetails",
25
- "PythonCodeReviewAction",
26
- "PythonCodeReviewObservation",
27
- "PythonCodeReviewState",
28
- "TaskDescriptor",
29
- "TaskSummary",
30
- "TaskGrade",
31
- "HealthResponse",
32
- ):
33
- getattr(module, model_name).model_rebuild()
34
- return module
35
-
36
-
37
- _schema_models = _load_schema_module()
38
-
39
- HealthResponse = _schema_models.HealthResponse
40
- HistoryEntry = _schema_models.HistoryEntry
41
- PythonAction = _schema_models.PythonAction
42
- PythonCodeReviewAction = _schema_models.PythonCodeReviewAction
43
- PythonCodeReviewObservation = _schema_models.PythonCodeReviewObservation
44
- PythonCodeReviewState = _schema_models.PythonCodeReviewState
45
- PythonObservation = _schema_models.PythonObservation
46
- PythonState = _schema_models.PythonState
47
- RewardDetails = _schema_models.RewardDetails
48
- TaskDescriptor = _schema_models.TaskDescriptor
49
- TaskGrade = _schema_models.TaskGrade
50
- TaskSummary = _schema_models.TaskSummary
51
-
52
- __all__ = [
53
- "HealthResponse",
54
- "HistoryEntry",
55
- "PyTorchCodeAnalyzerModel",
56
- "PythonAction",
57
- "PythonCodeReviewAction",
58
- "PythonCodeReviewObservation",
59
- "PythonCodeReviewState",
60
- "PythonObservation",
61
- "PythonState",
62
- "RewardDetails",
63
- "TaskDescriptor",
64
- "TaskGrade",
65
- "TaskSummary",
66
- ]
 
 
 
 
 
 
 
 
 
 
 
1
+ """PyTorch-backed model wrappers plus OpenEnv schema exports."""
2
+
3
+ from __future__ import annotations
4
+
5
+ import importlib.util
6
+ import sys
7
+ from pathlib import Path
8
+ from typing import TYPE_CHECKING
9
+
10
+ if TYPE_CHECKING:
11
+ from .pytorch_model import PyTorchCodeAnalyzerModel
12
+
13
+
14
+ def _load_schema_module():
15
+ schema_path = Path(__file__).resolve().parent.parent / "models.py"
16
+ spec = importlib.util.spec_from_file_location("_python_env_schema_models", schema_path)
17
+ if spec is None or spec.loader is None: # pragma: no cover
18
+ raise ImportError(f"Unable to load schema models from {schema_path}")
19
+ if spec.name in sys.modules:
20
+ return sys.modules[spec.name]
21
+ module = importlib.util.module_from_spec(spec)
22
+ sys.modules[spec.name] = module
23
+ spec.loader.exec_module(module)
24
+ for model_name in (
25
+ "HistoryEntry",
26
+ "RewardDetails",
27
+ "PythonCodeReviewAction",
28
+ "PythonCodeReviewObservation",
29
+ "PythonCodeReviewState",
30
+ "TaskDescriptor",
31
+ "TaskSummary",
32
+ "TaskGrade",
33
+ "HealthResponse",
34
+ ):
35
+ getattr(module, model_name).model_rebuild()
36
+ return module
37
+
38
+
39
+ _schema_models = _load_schema_module()
40
+
41
+ HealthResponse = _schema_models.HealthResponse
42
+ HistoryEntry = _schema_models.HistoryEntry
43
+ PythonAction = _schema_models.PythonAction
44
+ PythonCodeReviewAction = _schema_models.PythonCodeReviewAction
45
+ PythonCodeReviewObservation = _schema_models.PythonCodeReviewObservation
46
+ PythonCodeReviewState = _schema_models.PythonCodeReviewState
47
+ PythonObservation = _schema_models.PythonObservation
48
+ PythonState = _schema_models.PythonState
49
+ RewardDetails = _schema_models.RewardDetails
50
+ TaskDescriptor = _schema_models.TaskDescriptor
51
+ TaskGrade = _schema_models.TaskGrade
52
+ TaskSummary = _schema_models.TaskSummary
53
+
54
+
55
+ def __getattr__(name: str):
56
+ if name == "PyTorchCodeAnalyzerModel":
57
+ from .pytorch_model import PyTorchCodeAnalyzerModel as model_class
58
+
59
+ return model_class
60
+ raise AttributeError(f"module {__name__!r} has no attribute {name!r}")
61
+
62
+ __all__ = [
63
+ "HealthResponse",
64
+ "HistoryEntry",
65
+ "PyTorchCodeAnalyzerModel",
66
+ "PythonAction",
67
+ "PythonCodeReviewAction",
68
+ "PythonCodeReviewObservation",
69
+ "PythonCodeReviewState",
70
+ "PythonObservation",
71
+ "PythonState",
72
+ "RewardDetails",
73
+ "TaskDescriptor",
74
+ "TaskGrade",
75
+ "TaskSummary",
76
+ ]
models/pytorch_model.py CHANGED
@@ -1,4 +1,4 @@
1
- """PyTorch + transformers model wrapper for code-quality scoring."""
2
 
3
  from __future__ import annotations
4
 
@@ -17,64 +17,34 @@ except Exception:
17
 
18
  DOMAIN_PROTOTYPES: Dict[str, List[str]] = {
19
  "dsa": [
20
- "Algorithmic Python with nested loops, recursion, dynamic programming, maps, and asymptotic analysis.",
21
- "Competitive programming utility focused on arrays, graphs, search, and runtime complexity.",
22
  ],
23
  "data_science": [
24
- "Pandas dataframe transformation, numpy vectorization, feature engineering, data cleaning, and leakage prevention.",
25
- "Notebook-style data pipeline using joins, aggregations, and columnar operations.",
26
  ],
27
  "ml_dl": [
28
- "PyTorch model inference or training loop with eval mode, no_grad, tensors, optimizer, and loss functions.",
29
- "Machine learning code with torch, sklearn, batches, checkpoints, and metrics.",
30
  ],
31
  "web": [
32
- "FastAPI backend endpoint with pydantic validation, dependency injection, request parsing, and API safety.",
33
- "Python web-service route handling, serialization, authentication, and response contracts.",
34
  ],
35
  "general": [
36
- "General Python utility code with readability, typing, small functions, tests, and maintainable abstractions.",
37
  ],
38
  }
39
 
40
  QUALITY_ANCHORS: Dict[str, List[str]] = {
41
  "high": [
42
- "Production-ready Python code with clear naming, docstrings, validation, efficient loops, and low complexity.",
43
- "Clean code with explicit error handling, typing, modular design, and testable functions.",
44
  ],
45
  "low": [
46
- "Bug-prone Python with nested loops, missing validation, weak naming, duplicated logic, and hard-to-review structure.",
47
- "Risky code with syntax drift, unclear behavior, mutable side effects, and repeated scans over data.",
48
- ],
49
- }
50
-
51
- MAINTAINABILITY_ANCHORS: Dict[str, List[str]] = {
52
- "high": [
53
- "Readable functions, small logical units, strong typing, comments only where needed, and simple control flow.",
54
- "Maintainable Python service with clean architecture, cohesive modules, and explicit contracts.",
55
- ],
56
- "low": [
57
- "Large unstructured function, missing docstrings, weak names, deeply nested branches, and difficult debugging.",
58
- "Hard-to-maintain script with inconsistent style, brittle branching, and hidden side effects.",
59
- ],
60
- }
61
-
62
- ISSUE_ANCHORS: Dict[str, List[str]] = {
63
- "correctness": [
64
- "Off-by-one bug, missing final append, incorrect boundary handling, failing assertions, wrong return value.",
65
- "Logic regression caused by a missing branch, incorrect state update, or unhandled edge case.",
66
- ],
67
- "performance": [
68
- "Repeated full-list scans, brute-force nested loops, iterrows misuse, avoidable O(n^2) behavior, slow pipeline.",
69
- "Performance regression from redundant iteration, poor data structures, or missing vectorization.",
70
- ],
71
- "security": [
72
- "Unsafe input handling, unchecked request payload, eval usage, missing validation, insecure backend pattern.",
73
- "Security risk caused by trusting raw user input or bypassing schema validation.",
74
- ],
75
- "style": [
76
- "Readability issues from long lines, missing docstrings, inconsistent spacing, tabs, and trailing whitespace.",
77
- "Style drift that makes code review harder and maintenance slower.",
78
  ],
79
  }
80
 
@@ -148,79 +118,31 @@ class PyTorchCodeAnalyzerModel:
148
  self._prototype_cache[bucket] = self._embed_texts(texts)
149
  return self._prototype_cache[bucket]
150
 
151
- @staticmethod
152
- def _unit_similarity(candidate: torch.Tensor, matrix: torch.Tensor) -> float:
153
- similarity = torch.matmul(candidate, matrix.T).max().item()
154
- return round((similarity + 1.0) / 2.0, 4)
155
-
156
- @staticmethod
157
- def _quality_label(score: float) -> str:
158
- if score >= 0.82:
159
- return "excellent"
160
- if score >= 0.66:
161
- return "good"
162
- if score >= 0.45:
163
- return "needs_work"
164
- return "risky"
165
-
166
- def predict(
167
- self,
168
- code: str,
169
- context_window: str,
170
- traceback_text: str,
171
- static_summary: Dict[str, object],
172
- ) -> Dict[str, object]:
173
- """Predict domain probabilities, quality, and issue risks for Python code."""
174
 
175
  document = (
176
  f"Code:\n{code.strip()[:4000]}\n\n"
177
  f"Context:\n{context_window.strip()[:1000]}\n\n"
178
- f"Traceback:\n{traceback_text.strip()[:1000]}\n\n"
179
  f"Static hints:\n{static_summary}\n"
180
  )
181
  candidate = self._embed_texts([document])
182
 
183
  domain_scores: Dict[str, float] = {}
184
  for domain, texts in DOMAIN_PROTOTYPES.items():
185
- domain_scores[domain] = self._unit_similarity(candidate, self._prototype_matrix(f"domain:{domain}", texts))
 
 
186
 
187
  high_matrix = self._prototype_matrix("quality:high", QUALITY_ANCHORS["high"])
188
  low_matrix = self._prototype_matrix("quality:low", QUALITY_ANCHORS["low"])
189
  high_similarity = torch.matmul(candidate, high_matrix.T).max().item()
190
  low_similarity = torch.matmul(candidate, low_matrix.T).max().item()
191
- ml_quality_score = round(float(torch.sigmoid(torch.tensor((high_similarity - low_similarity) * 4.0)).item()), 4)
192
-
193
- high_maintainability = torch.matmul(
194
- candidate,
195
- self._prototype_matrix("maintainability:high", MAINTAINABILITY_ANCHORS["high"]).T,
196
- ).max().item()
197
- low_maintainability = torch.matmul(
198
- candidate,
199
- self._prototype_matrix("maintainability:low", MAINTAINABILITY_ANCHORS["low"]).T,
200
- ).max().item()
201
- maintainability_score = round(
202
- float(torch.sigmoid(torch.tensor((high_maintainability - low_maintainability) * 4.0)).item()),
203
- 4,
204
- )
205
-
206
- issue_logits = []
207
- issue_labels = list(ISSUE_ANCHORS.keys())
208
- for label in issue_labels:
209
- similarity = torch.matmul(candidate, self._prototype_matrix(f"issue:{label}", ISSUE_ANCHORS[label]).T).max().item()
210
- issue_logits.append(similarity)
211
- probabilities = torch.softmax(torch.tensor(issue_logits) * 3.0, dim=0)
212
- issue_probabilities = {
213
- label: round(float(probabilities[index].item()), 4)
214
- for index, label in enumerate(issue_labels)
215
- }
216
 
217
  return {
218
  "domain_scores": domain_scores,
219
- "ml_quality_score": ml_quality_score,
220
- "quality_score": ml_quality_score,
221
- "quality_label": self._quality_label(ml_quality_score),
222
- "maintainability_score": maintainability_score,
223
- "issue_probabilities": issue_probabilities,
224
  "backend_name": self.backend_name,
225
  "model_id": self.model_id,
226
  "notes": list(self.notes),
 
1
+ """PyTorch + transformers model wrapper for multi-domain code scoring."""
2
 
3
  from __future__ import annotations
4
 
 
17
 
18
  DOMAIN_PROTOTYPES: Dict[str, List[str]] = {
19
  "dsa": [
20
+ "Binary search, hashmap optimization, recursion, dynamic programming, arrays, trees, graphs, stack, queue, complexity.",
21
+ "Competitive programming algorithm with loops, memoization, prefix sums, and asymptotic analysis.",
22
  ],
23
  "data_science": [
24
+ "Pandas dataframe transformation, numpy vectorization, feature leakage, train test split, iterrows misuse.",
25
+ "Data cleaning pipeline using pandas, numpy, aggregation, joins, and vectorized operations.",
26
  ],
27
  "ml_dl": [
28
+ "PyTorch model, training loop, optimizer, backward pass, eval mode, no_grad, loss function, dataloader.",
29
+ "Machine learning inference and training code with torch, sklearn, tensors, gradients, and model checkpoints.",
30
  ],
31
  "web": [
32
+ "FastAPI endpoint, request validation, Pydantic models, async routes, API security, backend service design.",
33
+ "REST API backend with routers, dependency injection, input validation, serialization, and error handling.",
34
  ],
35
  "general": [
36
+ "General Python utility code with readable structure, typing, tests, and maintainable abstractions.",
37
  ],
38
  }
39
 
40
  QUALITY_ANCHORS: Dict[str, List[str]] = {
41
  "high": [
42
+ "Readable typed Python code with validation, efficient algorithms, vectorized operations, safe inference, and clean API boundaries.",
43
+ "Production-ready code with small functions, docstrings, low complexity, and clear error handling.",
44
  ],
45
  "low": [
46
+ "Brute-force nested loops, missing validation, unsafe input handling, missing eval mode, missing no_grad, and code smells.",
47
+ "Hard to maintain code with high complexity, repeated scans, mutable side effects, and unclear structure.",
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
  ],
49
  }
50
 
 
118
  self._prototype_cache[bucket] = self._embed_texts(texts)
119
  return self._prototype_cache[bucket]
120
 
121
+ def predict(self, code: str, context_window: str, static_summary: Dict[str, object]) -> Dict[str, object]:
122
+ """Predict domain probabilities and a model quality score."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
123
 
124
  document = (
125
  f"Code:\n{code.strip()[:4000]}\n\n"
126
  f"Context:\n{context_window.strip()[:1000]}\n\n"
 
127
  f"Static hints:\n{static_summary}\n"
128
  )
129
  candidate = self._embed_texts([document])
130
 
131
  domain_scores: Dict[str, float] = {}
132
  for domain, texts in DOMAIN_PROTOTYPES.items():
133
+ matrix = self._prototype_matrix(f"domain:{domain}", texts)
134
+ similarity = torch.matmul(candidate, matrix.T).max().item()
135
+ domain_scores[domain] = round((similarity + 1.0) / 2.0, 4)
136
 
137
  high_matrix = self._prototype_matrix("quality:high", QUALITY_ANCHORS["high"])
138
  low_matrix = self._prototype_matrix("quality:low", QUALITY_ANCHORS["low"])
139
  high_similarity = torch.matmul(candidate, high_matrix.T).max().item()
140
  low_similarity = torch.matmul(candidate, low_matrix.T).max().item()
141
+ ml_quality_score = torch.sigmoid(torch.tensor((high_similarity - low_similarity) * 4.0)).item()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
142
 
143
  return {
144
  "domain_scores": domain_scores,
145
+ "ml_quality_score": round(float(ml_quality_score), 4),
 
 
 
 
146
  "backend_name": self.backend_name,
147
  "model_id": self.model_id,
148
  "notes": list(self.notes),
openenv_python_code_review_env.egg-info/PKG-INFO CHANGED
@@ -16,16 +16,6 @@ Provides-Extra: dev
16
  Requires-Dist: pytest>=8.0.0; extra == "dev"
17
  Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
18
 
19
- ---
20
- title: Python Code Review Environment Server
21
- sdk: docker
22
- app_port: 8000
23
- base_path: /web
24
- pinned: false
25
- tags:
26
- - openenv
27
- ---
28
-
29
  # OpenEnv Python Code Review Environment
30
 
31
  Production-ready hackathon submission for OpenEnv evaluation, deterministic validator runs, and Hugging Face Docker deployment.
@@ -34,26 +24,25 @@ Production-ready hackathon submission for OpenEnv evaluation, deterministic vali
34
 
35
  ```text
36
  root
37
- |- inference.py # Root validator entrypoint
38
- |- openenv.yaml # OpenEnv manifest
39
- |- app/
40
- | |- agents/ # Action policy and fallback strategy
41
- | |- env/ # RL loop runner and stdout contract
42
- | |- models/ # Inference dataclasses/config
43
- | |- services/ # OpenAI client wrapper with retries
44
- | `- utils/ # Formatting, task loading, log suppression
45
- |- server/
46
- | |- env.py # OpenEnv environment and reward shaping
47
- | |- app.py # FastAPI/OpenEnv app, optional Gradio mount
48
- | `- Dockerfile # Alternate Docker build path
49
- |- Dockerfile # Root deployment Docker image
50
- |- graders/ # Syntax, bug-fix, optimization graders
51
- |- tasks/ # Deterministic benchmark tasks and references
52
- |- services/ # Multi-domain analysis services
53
- |- analyzers/ # Domain-specific analyzers
54
- |- models/ # Lazy-loaded PyTorch scoring model
55
- |- schemas/ # API request/response contracts
56
- `- tests/ # Local validation coverage
57
  ```
58
 
59
  Runtime flow:
@@ -71,8 +60,8 @@ inference.py
71
 
72
  - `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
73
  - OpenAI usage is limited to the official Python client:
74
- `client = OpenAI(base_url=API_BASE_URL, api_key=provider_token)`.
75
- - Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; the runtime now selects `HF_TOKEN` for the Hugging Face router and `OPENAI_API_KEY` for direct OpenAI usage.
76
  - Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
77
  - The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
78
  - Step errors now surface through `last_action_error` and are printed in `[STEP]`.
@@ -107,7 +96,6 @@ Optional demo UI:
107
 
108
  ```bash
109
  set ENABLE_GRADIO_DEMO=true
110
- set ENABLE_WEB_INTERFACE=true
111
  python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
112
  ```
113
 
@@ -120,9 +108,7 @@ Required environment variables:
120
  - `MODEL_NAME`
121
  Default: `Qwen/Qwen2.5-3B-Instruct`
122
  - `HF_TOKEN`
123
- Required for `https://router.huggingface.co/v1`
124
- - `OPENAI_API_KEY`
125
- Required for `https://api.openai.com/v1`
126
 
127
  Example:
128
 
@@ -133,13 +119,6 @@ set HF_TOKEN=hf_xxx
133
  python inference.py
134
  ```
135
 
136
- ```bash
137
- set API_BASE_URL=https://api.openai.com/v1
138
- set MODEL_NAME=gpt-4.1-mini
139
- set OPENAI_API_KEY=sk-xxx
140
- python inference.py
141
- ```
142
-
143
  Expected stdout shape:
144
 
145
  ```text
@@ -156,7 +135,7 @@ Expected stdout shape:
156
  Build from the project root:
157
 
158
  ```bash
159
- docker build -t openenv-python-code-review-env .
160
  ```
161
 
162
  Run locally:
@@ -182,12 +161,11 @@ Recommended deployment steps:
182
 
183
  1. Create a Docker Space.
184
  2. Push this repository as-is.
185
- 3. Let Spaces build from the root `Dockerfile`.
186
  4. Set Space secrets:
187
  `HF_TOKEN`
188
  5. Set Space variables as needed:
189
  `API_BASE_URL`, `MODEL_NAME`, `ENABLE_GRADIO_DEMO=false`
190
- `ENABLE_WEB_INTERFACE=false` is also supported for OpenEnv-managed deploys.
191
  6. Confirm the app listens on port `8000`.
192
  7. Smoke-test:
193
  `/health`
 
16
  Requires-Dist: pytest>=8.0.0; extra == "dev"
17
  Requires-Dist: pytest-cov>=4.0.0; extra == "dev"
18
 
 
 
 
 
 
 
 
 
 
 
19
  # OpenEnv Python Code Review Environment
20
 
21
  Production-ready hackathon submission for OpenEnv evaluation, deterministic validator runs, and Hugging Face Docker deployment.
 
24
 
25
  ```text
26
  root
27
+ ├── inference.py # Root validator entrypoint
28
+ ├── openenv.yaml # OpenEnv manifest
29
+ ├── app/
30
+ │ ├── agents/ # Action policy and fallback strategy
31
+ │ ├── env/ # RL loop runner and stdout contract
32
+ │ ├── models/ # Inference dataclasses/config
33
+ │ ├── services/ # OpenAI client wrapper with retries
34
+ │ └── utils/ # Formatting, task loading, log suppression
35
+ ├── server/
36
+ │ ├── env.py # OpenEnv environment and reward shaping
37
+ │ ├── app.py # FastAPI/OpenEnv app, optional Gradio mount
38
+ │ └── Dockerfile # Hugging Face Docker image
39
+ ├── graders/ # Syntax, bug-fix, optimization graders
40
+ ├── tasks/ # Deterministic benchmark tasks and references
41
+ ├── services/ # Multi-domain analysis services
42
+ ├── analyzers/ # Domain-specific analyzers
43
+ ├── models/ # Lazy-loaded PyTorch scoring model
44
+ ├── schemas/ # API request/response contracts
45
+ └── tests/ # Local validation coverage
 
46
  ```
47
 
48
  Runtime flow:
 
60
 
61
  - `inference.py` now lives at the repo root and delegates to a strict runner under `app/env`.
62
  - OpenAI usage is limited to the official Python client:
63
+ `client = OpenAI(base_url=API_BASE_URL, api_key=HF_TOKEN)`.
64
+ - Defaulted env vars are enforced for `API_BASE_URL` and `MODEL_NAME`; `HF_TOKEN` is read without a default and handled explicitly.
65
  - Output now matches the required single-line contract exactly and always emits `[END]`, including failure paths.
66
  - The RL loop now uses `reset()` plus `step_result()` in a proper `while not done` loop.
67
  - Step errors now surface through `last_action_error` and are printed in `[STEP]`.
 
96
 
97
  ```bash
98
  set ENABLE_GRADIO_DEMO=true
 
99
  python -m uvicorn server.app:app --host 0.0.0.0 --port 8000
100
  ```
101
 
 
108
  - `MODEL_NAME`
109
  Default: `Qwen/Qwen2.5-3B-Instruct`
110
  - `HF_TOKEN`
111
+ Mandatory, no default is injected
 
 
112
 
113
  Example:
114
 
 
119
  python inference.py
120
  ```
121
 
 
 
 
 
 
 
 
122
  Expected stdout shape:
123
 
124
  ```text
 
135
  Build from the project root:
136
 
137
  ```bash
138
+ docker build -f server/Dockerfile .
139
  ```
140
 
141
  Run locally:
 
161
 
162
  1. Create a Docker Space.
163
  2. Push this repository as-is.
164
+ 3. Let Spaces build with `server/Dockerfile`.
165
  4. Set Space secrets:
166
  `HF_TOKEN`
167
  5. Set Space variables as needed:
168
  `API_BASE_URL`, `MODEL_NAME`, `ENABLE_GRADIO_DEMO=false`
 
169
  6. Confirm the app listens on port `8000`.
170
  7. Smoke-test:
171
  `/health`
openenv_python_code_review_env.egg-info/SOURCES.txt CHANGED
@@ -5,8 +5,7 @@ pyproject.toml
5
  ./compat.py
6
  ./inference.py
7
  ./launch.py
8
- ./models.py
9
- ./sitecustomize.py
10
  ./triage.py
11
  ./triage_catalog.py
12
  ./triage_models.py
 
5
  ./compat.py
6
  ./inference.py
7
  ./launch.py
8
+ ./openenv_models.py
 
9
  ./triage.py
10
  ./triage_catalog.py
11
  ./triage_models.py
pyproject.toml CHANGED
@@ -8,9 +8,11 @@ version = "1.0.0"
8
  description = "TorchReview Copilot: AI-powered Python code triage with PyTorch and OpenEnv validation."
9
  readme = "README.md"
10
  requires-python = ">=3.10"
 
11
  dependencies = [
12
  "fastapi>=0.111.0",
13
  "gradio>=5.26.0",
 
14
  "openai>=1.76.0",
15
  "openenv-core[core]>=0.2.2",
16
  "streamlit>=1.44.0",
@@ -33,22 +35,7 @@ pythonpath = ["."]
33
 
34
  [tool.setuptools]
35
  include-package-data = true
36
- packages = [
37
- "python_env",
38
- "python_env.server",
39
- "python_env.tasks",
40
- "python_env.graders",
41
- "python_env.api",
42
- "python_env.app",
43
- "python_env.app.agents",
44
- "python_env.app.env",
45
- "python_env.app.models",
46
- "python_env.app.services",
47
- "python_env.app.utils",
48
- "python_env.analyzers",
49
- "python_env.models",
50
- "python_env.schemas",
51
- "python_env.services",
52
- "python_env.utils",
53
- ]
54
- package-dir = { "python_env" = ".", "python_env.server" = "server", "python_env.tasks" = "tasks", "python_env.graders" = "graders", "python_env.api" = "api", "python_env.app" = "app", "python_env.app.agents" = "app/agents", "python_env.app.env" = "app/env", "python_env.app.models" = "app/models", "python_env.app.services" = "app/services", "python_env.app.utils" = "app/utils", "python_env.analyzers" = "analyzers", "python_env.models" = "models", "python_env.schemas" = "schemas", "python_env.services" = "services", "python_env.utils" = "utils" }
 
8
  description = "TorchReview Copilot: AI-powered Python code triage with PyTorch and OpenEnv validation."
9
  readme = "README.md"
10
  requires-python = ">=3.10"
11
+
12
  dependencies = [
13
  "fastapi>=0.111.0",
14
  "gradio>=5.26.0",
15
+ "hf-xet>=1.4.3",
16
  "openai>=1.76.0",
17
  "openenv-core[core]>=0.2.2",
18
  "streamlit>=1.44.0",
 
35
 
36
  [tool.setuptools]
37
  include-package-data = true
38
+
39
+ [tool.setuptools.packages.find]
40
+ where = ["."]
41
+ include = ["*"]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemas/__init__.py CHANGED
@@ -1,13 +1,13 @@
1
- """Public schemas for the multi-domain analysis platform."""
2
-
3
- from .request import AnalyzeCodeRequest
4
- from .response import AnalyzeCodeResponse, AnalysisIssue, DomainAnalysis, ScoreBreakdown, StaticAnalysisSummary
5
-
6
- __all__ = [
7
- "AnalyzeCodeRequest",
8
- "AnalyzeCodeResponse",
9
- "AnalysisIssue",
10
- "DomainAnalysis",
11
- "ScoreBreakdown",
12
- "StaticAnalysisSummary",
13
- ]
 
1
+ """Public schemas for the multi-domain analysis platform."""
2
+
3
+ from .request import AnalyzeCodeRequest
4
+ from .response import AnalyzeCodeResponse, AnalysisIssue, DomainAnalysis, ScoreBreakdown, StaticAnalysisSummary
5
+
6
+ __all__ = [
7
+ "AnalyzeCodeRequest",
8
+ "AnalyzeCodeResponse",
9
+ "AnalysisIssue",
10
+ "DomainAnalysis",
11
+ "ScoreBreakdown",
12
+ "StaticAnalysisSummary",
13
+ ]
schemas/request.py CHANGED
@@ -1,51 +1,19 @@
1
- """Request schemas for the AI-powered code review workflow."""
2
 
3
  from __future__ import annotations
4
 
5
  from typing import Literal
6
 
7
- from pydantic import BaseModel, ConfigDict, Field, field_validator
8
 
9
 
10
- DomainHint = Literal["auto", "general", "dsa", "data_science", "ml_dl", "web"]
11
 
12
 
13
  class AnalyzeCodeRequest(BaseModel):
14
- """Validated input payload for Python code review requests."""
15
-
16
- model_config = ConfigDict(str_strip_whitespace=True)
17
-
18
- code: str = Field(..., min_length=1, description="Python source code to analyze.")
19
- context_window: str = Field(
20
- default="",
21
- max_length=4000,
22
- description="Optional repository, pull request, or runtime context.",
23
- )
24
- traceback_text: str = Field(
25
- default="",
26
- max_length=4000,
27
- description="Optional traceback or failing test output.",
28
- )
29
- domain_hint: DomainHint = Field(
30
- default="auto",
31
- description="Optional analysis lens for domain-aware suggestions.",
32
- )
33
- filename: str = Field(default="snippet.py", max_length=255, description="Virtual filename for display.")
34
- enable_suggestions: bool = Field(
35
- default=True,
36
- description="Whether the service should return a prioritized improvement plan.",
37
- )
38
-
39
- @field_validator("code")
40
- @classmethod
41
- def _reject_empty_code(cls, value: str) -> str:
42
- stripped = value.strip()
43
- if not stripped:
44
- raise ValueError("code must not be empty")
45
- return stripped
46
-
47
- @field_validator("filename")
48
- @classmethod
49
- def _normalize_filename(cls, value: str) -> str:
50
- candidate = value.strip() or "snippet.py"
51
- return candidate[:255]
 
1
+ """Request schemas for code analysis endpoints and UI."""
2
 
3
  from __future__ import annotations
4
 
5
  from typing import Literal
6
 
7
+ from pydantic import BaseModel, Field
8
 
9
 
10
+ DomainHint = Literal["auto", "dsa", "data_science", "ml_dl", "web"]
11
 
12
 
13
  class AnalyzeCodeRequest(BaseModel):
14
+ """Validated input payload for multi-domain code analysis."""
15
+
16
+ code: str = Field(..., min_length=1, description="Source code to analyze.")
17
+ context_window: str = Field(default="", max_length=2000, description="Optional repository or task context.")
18
+ traceback_text: str = Field(default="", max_length=2000, description="Optional runtime or test failure output.")
19
+ domain_hint: DomainHint = Field(default="auto", description="Optional domain override when auto detection is not desired.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
schemas/response.py CHANGED
@@ -1,4 +1,4 @@
1
- """Response schemas for the AI-powered code review platform."""
2
 
3
  from __future__ import annotations
4
 
@@ -7,103 +7,67 @@ from typing import Dict, List, Literal
7
  from pydantic import BaseModel, Field
8
 
9
 
 
10
  Severity = Literal["low", "medium", "high"]
11
- IssueCategory = Literal["correctness", "maintainability", "performance", "security", "style"]
12
- QualityLabel = Literal["excellent", "good", "needs_work", "risky"]
13
- DetectedDomain = Literal["general", "dsa", "data_science", "ml_dl", "web"]
14
 
15
 
16
  class AnalysisIssue(BaseModel):
17
  """One detected issue or risk in the code snippet."""
18
 
19
  title: str
20
- category: IssueCategory = "maintainability"
21
  severity: Severity
22
  description: str
23
  line_hint: int | None = None
24
 
25
 
26
  class StaticAnalysisSummary(BaseModel):
27
- """Python-specific static-analysis signals."""
28
 
29
  syntax_valid: bool
30
  syntax_error: str = ""
31
  cyclomatic_complexity: int = Field(..., ge=1)
32
  line_count: int = Field(..., ge=0)
33
- max_nesting_depth: int = Field(..., ge=0)
34
  max_loop_depth: int = Field(..., ge=0)
35
  time_complexity: str = "Unknown"
36
  space_complexity: str = "Unknown"
37
- lint_score: float = Field(..., ge=0.0, le=1.0)
38
- docstring_coverage: float = Field(..., ge=0.0, le=1.0)
39
  detected_imports: List[str] = Field(default_factory=list)
40
  code_smells: List[str] = Field(default_factory=list)
41
- issues: List[AnalysisIssue] = Field(default_factory=list)
42
 
43
 
44
  class DomainAnalysis(BaseModel):
45
- """Domain-aware review signals used for context-specific suggestions."""
46
 
47
- domain: DetectedDomain
48
  domain_score: float = Field(..., ge=0.0, le=1.0)
49
  issues: List[AnalysisIssue] = Field(default_factory=list)
50
  suggestions: List[str] = Field(default_factory=list)
51
  highlights: Dict[str, float | str] = Field(default_factory=dict)
52
 
53
 
54
- class ModelPrediction(BaseModel):
55
- """PyTorch model output derived from pretrained code embeddings."""
56
-
57
- quality_label: QualityLabel
58
- quality_score: float = Field(..., ge=0.0, le=1.0)
59
- maintainability_score: float = Field(..., ge=0.0, le=1.0)
60
- issue_probabilities: Dict[str, float] = Field(default_factory=dict)
61
- notes: List[str] = Field(default_factory=list)
62
-
63
-
64
  class ScoreBreakdown(BaseModel):
65
- """Reward inputs and the final RL-ready scalar reward."""
66
 
67
  ml_score: float = Field(..., ge=0.0, le=1.0)
68
  domain_score: float = Field(..., ge=0.0, le=1.0)
69
  lint_score: float = Field(..., ge=0.0, le=1.0)
70
  complexity_penalty: float = Field(..., ge=0.0, le=1.0)
71
- maintainability_score: float = Field(..., ge=0.0, le=1.0)
72
- security_score: float = Field(..., ge=0.0, le=1.0)
73
- readability_score: float = Field(..., ge=0.0, le=1.0)
74
  quality_signal: float = Field(..., ge=0.0, le=1.0)
75
  error_reduction_signal: float = Field(..., ge=0.0, le=1.0)
76
  completion_signal: float = Field(..., ge=0.0, le=1.0)
77
  reward: float = Field(..., ge=0.0, le=1.0)
78
 
79
 
80
- class SuggestionItem(BaseModel):
81
- """One prioritized improvement suggestion."""
82
-
83
- priority: Literal["P0", "P1", "P2"]
84
- title: str
85
- rationale: str
86
- action: str
87
- category: IssueCategory
88
-
89
-
90
  class AnalyzeCodeResponse(BaseModel):
91
  """Top-level structured output for API and UI consumers."""
92
 
93
- language: Literal["python"] = "python"
94
- detected_domain: DetectedDomain
95
- domain_confidences: Dict[str, float] = Field(default_factory=dict)
96
  score_breakdown: ScoreBreakdown
97
  static_analysis: StaticAnalysisSummary
98
- model_prediction: ModelPrediction
99
  domain_analysis: DomainAnalysis
100
- suggestions: List[SuggestionItem] = Field(default_factory=list)
101
  improvement_plan: List[str] = Field(default_factory=list)
102
- auto_fix_preview: List[str] = Field(default_factory=list)
103
- score_visualization: Dict[str, float] = Field(default_factory=dict)
104
  model_backend: str
105
  model_id: str
106
  summary: str
107
  context_window: str = ""
108
- filename: str = "snippet.py"
109
  analysis_time_ms: float = Field(..., ge=0.0)
 
1
+ """Response schemas for the multi-domain analysis platform."""
2
 
3
  from __future__ import annotations
4
 
 
7
  from pydantic import BaseModel, Field
8
 
9
 
10
+ DomainType = Literal["dsa", "data_science", "ml_dl", "web", "general"]
11
  Severity = Literal["low", "medium", "high"]
 
 
 
12
 
13
 
14
  class AnalysisIssue(BaseModel):
15
  """One detected issue or risk in the code snippet."""
16
 
17
  title: str
 
18
  severity: Severity
19
  description: str
20
  line_hint: int | None = None
21
 
22
 
23
  class StaticAnalysisSummary(BaseModel):
24
+ """Language-agnostic static-analysis signals."""
25
 
26
  syntax_valid: bool
27
  syntax_error: str = ""
28
  cyclomatic_complexity: int = Field(..., ge=1)
29
  line_count: int = Field(..., ge=0)
 
30
  max_loop_depth: int = Field(..., ge=0)
31
  time_complexity: str = "Unknown"
32
  space_complexity: str = "Unknown"
 
 
33
  detected_imports: List[str] = Field(default_factory=list)
34
  code_smells: List[str] = Field(default_factory=list)
 
35
 
36
 
37
  class DomainAnalysis(BaseModel):
38
+ """Domain-specific analysis payload returned by an analyzer."""
39
 
40
+ domain: DomainType
41
  domain_score: float = Field(..., ge=0.0, le=1.0)
42
  issues: List[AnalysisIssue] = Field(default_factory=list)
43
  suggestions: List[str] = Field(default_factory=list)
44
  highlights: Dict[str, float | str] = Field(default_factory=dict)
45
 
46
 
 
 
 
 
 
 
 
 
 
 
47
  class ScoreBreakdown(BaseModel):
48
+ """Reward inputs and final normalized score."""
49
 
50
  ml_score: float = Field(..., ge=0.0, le=1.0)
51
  domain_score: float = Field(..., ge=0.0, le=1.0)
52
  lint_score: float = Field(..., ge=0.0, le=1.0)
53
  complexity_penalty: float = Field(..., ge=0.0, le=1.0)
 
 
 
54
  quality_signal: float = Field(..., ge=0.0, le=1.0)
55
  error_reduction_signal: float = Field(..., ge=0.0, le=1.0)
56
  completion_signal: float = Field(..., ge=0.0, le=1.0)
57
  reward: float = Field(..., ge=0.0, le=1.0)
58
 
59
 
 
 
 
 
 
 
 
 
 
 
60
  class AnalyzeCodeResponse(BaseModel):
61
  """Top-level structured output for API and UI consumers."""
62
 
63
+ detected_domain: DomainType
64
+ domain_confidences: Dict[str, float]
 
65
  score_breakdown: ScoreBreakdown
66
  static_analysis: StaticAnalysisSummary
 
67
  domain_analysis: DomainAnalysis
 
68
  improvement_plan: List[str] = Field(default_factory=list)
 
 
69
  model_backend: str
70
  model_id: str
71
  summary: str
72
  context_window: str = ""
 
73
  analysis_time_ms: float = Field(..., ge=0.0)
server/app.py CHANGED
@@ -53,10 +53,16 @@ def build_application():
53
  served_app = api_app
54
  if gr is not None and _gradio_enabled():
55
  try:
56
- from .demo import build_demo
57
  except ImportError:
58
- from server.demo import build_demo
59
- served_app = gr.mount_gradio_app(api_app, build_demo(), path="/")
 
 
 
 
 
 
60
 
61
  wrapper_app = FastAPI(title="python_code_review_env", version="1.0.0")
62
 
@@ -74,7 +80,7 @@ app = build_application()
74
  def main(host: str = "0.0.0.0", port: int = 8000) -> None:
75
  import uvicorn
76
 
77
- uvicorn.run(app, host=host, port=port)
78
 
79
 
80
  if __name__ == "__main__":
 
53
  served_app = api_app
54
  if gr is not None and _gradio_enabled():
55
  try:
56
+ from .demo import CSS, build_demo
57
  except ImportError:
58
+ from server.demo import CSS, build_demo
59
+ served_app = gr.mount_gradio_app(
60
+ api_app,
61
+ build_demo(),
62
+ path="/",
63
+ theme=gr.themes.Soft(primary_hue="orange", secondary_hue="amber"),
64
+ css=CSS,
65
+ )
66
 
67
  wrapper_app = FastAPI(title="python_code_review_env", version="1.0.0")
68
 
 
80
  def main(host: str = "0.0.0.0", port: int = 8000) -> None:
81
  import uvicorn
82
 
83
+ uvicorn.run(app, host=host, port=port, access_log=False)
84
 
85
 
86
  if __name__ == "__main__":
server/demo.py CHANGED
@@ -347,7 +347,7 @@ def build_demo() -> gr.Blocks:
347
  examples = get_default_engine().example_map()
348
  first_example = next(iter(examples.values()))
349
 
350
- with gr.Blocks(theme=gr.themes.Soft(primary_hue="orange", secondary_hue="amber"), css=CSS, title="TorchReview Copilot") as demo:
351
  gr.HTML(
352
  """
353
  <div class="hero-card">
 
347
  examples = get_default_engine().example_map()
348
  first_example = next(iter(examples.values()))
349
 
350
+ with gr.Blocks(title="TorchReview Copilot") as demo:
351
  gr.HTML(
352
  """
353
  <div class="hero-card">
server/env.py CHANGED
@@ -8,27 +8,27 @@ from uuid import uuid4
8
  from openenv.core.env_server.interfaces import Environment
9
  from openenv.core.env_server.types import EnvironmentMetadata
10
 
11
- try:
12
- from ..graders import grade_task
13
- from ..graders.shared import component_score, final_score_pipeline, safe_ratio, safe_score
14
- from ..models import (
15
- HistoryEntry,
16
- PythonCodeReviewAction,
17
- PythonCodeReviewObservation,
18
- PythonCodeReviewState,
19
- RewardDetails,
20
  TaskGrade,
21
  )
22
  from ..tasks import ReviewTask, list_tasks, select_task
23
- except ImportError:
24
- from graders import grade_task
25
- from graders.shared import component_score, final_score_pipeline, safe_ratio, safe_score
26
- from models import (
27
- HistoryEntry,
28
- PythonCodeReviewAction,
29
- PythonCodeReviewObservation,
30
- PythonCodeReviewState,
31
- RewardDetails,
32
  TaskGrade,
33
  )
34
  from tasks import ReviewTask, list_tasks, select_task
@@ -43,10 +43,10 @@ def _empty_grade() -> TaskGrade:
43
  quality_score=component_score(0.01),
44
  runtime_score=component_score(0.01),
45
  )
46
-
47
-
48
- def _reward_value(value: float) -> float:
49
- return final_score_pipeline(value)
50
 
51
 
52
  class PythonCodeReviewEnvironment(
@@ -56,17 +56,17 @@ class PythonCodeReviewEnvironment(
56
 
57
  SUPPORTS_CONCURRENT_SESSIONS: bool = True
58
 
59
- def __init__(self, verbose: bool = False, **_: Any) -> None:
60
- super().__init__()
61
- self.verbose = verbose
62
- self._task: ReviewTask = list_tasks()[0]
63
- self._current_code: str = self._task.starter_code
64
- self._history: list[HistoryEntry] = []
65
- self._last_reward = RewardDetails(value=0.1, reason="Environment initialized.")
66
- self._last_action_error: str | None = None
67
- self._current_grade = _empty_grade()
68
- self._state = PythonCodeReviewState(episode_id=str(uuid4()), step_count=0)
69
- self.reset()
70
 
71
  def reset(
72
  self,
@@ -74,17 +74,17 @@ class PythonCodeReviewEnvironment(
74
  episode_id: Optional[str] = None,
75
  **kwargs: Any,
76
  ) -> PythonCodeReviewObservation:
77
- task_id = kwargs.get("task_id")
78
- self._task = select_task(seed=seed, task_id=task_id)
79
- self._current_code = self._task.starter_code
80
- self._history = []
81
- self._last_action_error = None
82
- self._last_reward = RewardDetails(value=0.1, reason="Environment reset.")
83
- self._current_grade, self._last_action_error = self._safe_grade_task(
84
- self._task,
85
- self._current_code,
86
- include_hidden=False,
87
- )
88
 
89
  self._state = PythonCodeReviewState(
90
  episode_id=episode_id or str(uuid4()),
@@ -143,22 +143,22 @@ class PythonCodeReviewEnvironment(
143
  )
144
  return observation, reward.value, observation.done, {"task_id": observation.task_id, "score": observation.score}
145
 
146
- previous_grade = self._current_grade
147
- status = ""
148
- invalid_action = False
149
- code_changed = False
150
- use_hidden_grading = False
151
- action_error: str | None = None
152
-
153
- if action.action_type == "edit_code":
154
- if not action.code or not action.code.strip():
155
- invalid_action = True
156
- status = "edit_code requires a non-empty code payload."
157
- action_error = status
158
- else:
159
- code_changed = action.code != self._current_code
160
- self._current_code = action.code
161
- status = "Updated working copy from agent patch."
162
  elif action.action_type == "submit_solution":
163
  if action.code is not None and action.code.strip():
164
  code_changed = action.code != self._current_code
@@ -169,30 +169,30 @@ class PythonCodeReviewEnvironment(
169
  status = "Executed public validation suite."
170
  elif action.action_type == "analyze_code":
171
  status = "Generated static review summary."
172
- else: # pragma: no cover
173
- invalid_action = True
174
- status = f"Unsupported action_type: {action.action_type}"
175
- action_error = status
176
 
177
  self._state.step_count += 1
178
 
179
- if invalid_action:
180
- current_grade = previous_grade
181
- else:
182
- current_grade, grade_error = self._safe_grade_task(
183
- self._task,
184
- self._current_code,
185
- include_hidden=use_hidden_grading,
186
- timeout_s=timeout_s or 3.0,
187
- )
188
- if grade_error:
189
- action_error = grade_error
190
- status = f"{status} Grading fallback used."
191
- if action.action_type == "analyze_code":
192
- status = self._analysis_status(current_grade)
193
- elif action.action_type == "run_tests":
194
- status = self._run_tests_status(current_grade, use_hidden_grading)
195
- elif action.action_type == "submit_solution":
196
  status = self._submission_status(current_grade)
197
 
198
  done = use_hidden_grading or self._state.step_count >= self._task.max_steps
@@ -217,11 +217,11 @@ class PythonCodeReviewEnvironment(
217
  reward=reward_details.value,
218
  )
219
  )
220
-
221
- self._current_grade = current_grade
222
- self._last_reward = reward_details
223
- self._last_action_error = action_error
224
- attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
225
 
226
  self._state.task_id = self._task.task_id
227
  self._state.difficulty = self._task.difficulty
@@ -234,19 +234,19 @@ class PythonCodeReviewEnvironment(
234
  self._state.score = current_grade.score
235
  self._state.done = done
236
 
237
- observation = self._build_observation(
238
- grade=current_grade,
239
- status=status,
240
- reward_details=reward_details,
241
- )
242
- return observation, reward_details.value, observation.done, {
243
- "task_id": observation.task_id,
244
- "score": observation.score,
245
- "done": observation.done,
246
- "attempts_remaining": observation.attempts_remaining,
247
- "last_action_status": observation.last_action_status,
248
- "last_action_error": observation.last_action_error,
249
- }
250
 
251
  @property
252
  def state(self) -> PythonCodeReviewState:
@@ -268,102 +268,102 @@ class PythonCodeReviewEnvironment(
268
  current_code=self._current_code,
269
  errors=self._format_errors(grade),
270
  test_results=self._format_test_results(grade),
271
- visible_tests=list(self._task.visible_tests),
272
- history=list(self._history),
273
- attempts_remaining=self._state.attempts_remaining,
274
- last_action_status=status,
275
- last_action_error=self._last_action_error,
276
- score=grade.score,
277
- reward=reward_details.value,
278
- done=self._state.done,
279
- reward_details=reward_details,
280
- metadata={
281
- "benchmark": "python_code_review_env",
282
- "goal": self._task.goal,
283
- "repo_summary": self._task.repo_summary,
284
- "changed_files": self._task.changed_files,
285
- "available_files": self._task.available_files,
286
- "grade_details": grade.details,
287
  },
288
  )
289
 
290
- def _compute_reward(
291
- self,
292
- *,
293
- previous_grade: TaskGrade,
294
  current_grade: TaskGrade,
295
  action: PythonCodeReviewAction,
296
  invalid_action: bool,
297
  timed_out: bool,
298
  code_changed: bool,
299
  final_submission: bool,
300
- ) -> RewardDetails:
301
- prev_score = previous_grade.score
302
- curr_score = current_grade.score
303
- prev_syntax = previous_grade.syntax_score
304
- curr_syntax = current_grade.syntax_score
305
- prev_quality = previous_grade.quality_score
306
- curr_quality = current_grade.quality_score
307
- prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
308
- curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
309
- prev_runtime = previous_grade.runtime_score
310
- curr_runtime = current_grade.runtime_score
311
- prev_compile_health = 0.1 if str(previous_grade.details.get("compile_error", "")).strip() else 0.95
312
- curr_compile_health = 0.1 if str(current_grade.details.get("compile_error", "")).strip() else 0.95
313
-
314
- syntax_reward = max(curr_syntax - prev_syntax, 0.0) * 0.18
315
- test_reward = max(curr_rate - prev_rate, 0.0) * 0.22
316
- progress_delta = max(curr_score - prev_score, 0.0) * 0.24
317
- quality_bonus = max(curr_quality - prev_quality, 0.0) * 0.12
318
- runtime_bonus = max(curr_runtime - prev_runtime, 0.0) * 0.10
319
- error_reduction_bonus = max(curr_compile_health - prev_compile_health, 0.0) * 0.14
320
- completion_bonus = (0.04 + 0.10 * curr_rate) * float(final_submission)
321
- correctness_bonus = max(curr_score - 0.5, 0.0) * 0.12 * float(final_submission)
322
-
323
- invalid_action_penalty = (0.04 + (0.08 * (1.0 - prev_score))) if invalid_action else 0.0
324
- timeout_penalty = (0.05 + (0.06 * max(curr_runtime, prev_runtime))) if timed_out else 0.0
325
- regression_penalty = max(prev_score - curr_score, 0.0) * 0.24
326
- stagnation_penalty = (0.02 + (0.04 * prev_score)) if action.action_type == "edit_code" and not code_changed else 0.0
327
-
328
- raw_value = (
329
- 2.0 * (curr_score - 0.5)
330
- + 1.2 * (curr_rate - prev_rate)
331
- + 0.8 * (curr_quality - prev_quality)
332
- + 0.7 * (curr_runtime - prev_runtime)
333
- + 0.9 * (curr_syntax - prev_syntax)
334
- + 0.6 * (curr_compile_health - prev_compile_health)
335
- + syntax_reward
336
- + test_reward
337
- + progress_delta
338
- + quality_bonus
339
- + runtime_bonus
340
- + error_reduction_bonus
341
- + completion_bonus
342
- + correctness_bonus
343
- - invalid_action_penalty
344
- - timeout_penalty
345
- - regression_penalty
346
- - stagnation_penalty
347
- )
348
- value = _reward_value(raw_value)
349
-
350
- reason_parts = []
351
- if syntax_reward:
352
- reason_parts.append("syntax fixed")
353
  if test_reward:
354
  reason_parts.append("public test progress")
355
  if progress_delta:
356
  reason_parts.append("overall score improved")
357
- if quality_bonus:
358
- reason_parts.append("code quality improved")
359
- if error_reduction_bonus:
360
- reason_parts.append("errors removed")
361
- if completion_bonus:
362
- reason_parts.append("task completed")
363
- if runtime_bonus:
364
- reason_parts.append("runtime improved")
365
- if correctness_bonus:
366
- reason_parts.append("full correctness bonus")
367
  if invalid_action_penalty:
368
  reason_parts.append("invalid action penalty")
369
  if timeout_penalty:
@@ -372,53 +372,53 @@ class PythonCodeReviewEnvironment(
372
  reason_parts.append("regression penalty")
373
  if stagnation_penalty:
374
  reason_parts.append("unchanged patch penalty")
375
- if not reason_parts:
376
- reason_parts.append("no meaningful state change")
377
-
378
- return RewardDetails(
379
- value=safe_score(value),
380
- syntax_reward=round(syntax_reward, 6),
381
- test_reward=round(test_reward, 6),
382
- correctness_bonus=round(correctness_bonus, 6),
383
- quality_bonus=round(quality_bonus, 6),
384
- error_reduction_bonus=round(error_reduction_bonus, 6),
385
- completion_bonus=round(completion_bonus, 6),
386
- runtime_bonus=round(runtime_bonus, 6),
387
- progress_delta=round(progress_delta, 6),
388
- invalid_action_penalty=round(invalid_action_penalty, 6),
389
- timeout_penalty=round(timeout_penalty, 6),
390
- regression_penalty=round(regression_penalty, 6),
391
- stagnation_penalty=round(stagnation_penalty, 6),
392
- reason=", ".join(reason_parts),
393
- prev_score=safe_score(prev_score),
394
- curr_score=safe_score(curr_score),
395
- code_changed=code_changed,
396
- )
397
-
398
- def _format_errors(self, grade: TaskGrade) -> str:
399
- compile_error = str(grade.details.get("compile_error", "")).strip()
400
- if compile_error:
401
- return compile_error
402
- return "Code parses successfully."
403
-
404
- def _safe_grade_task(
405
- self,
406
- task: ReviewTask,
407
- code: str,
408
- *,
409
- include_hidden: bool,
410
- timeout_s: float = 3.0,
411
- ) -> tuple[TaskGrade, str | None]:
412
- try:
413
- return (
414
- grade_task(task, code, include_hidden=include_hidden, timeout_s=timeout_s),
415
- None,
416
- )
417
- except Exception as exc: # pragma: no cover
418
- return _empty_grade(), f"{type(exc).__name__}: {exc}"
419
-
420
- def _format_test_results(self, grade: TaskGrade) -> str:
421
- parts = [grade.details.get("test_summary", "No test feedback available.")]
422
  benchmark = grade.details.get("benchmark")
423
  if isinstance(benchmark, dict):
424
  parts.append(
 
8
  from openenv.core.env_server.interfaces import Environment
9
  from openenv.core.env_server.types import EnvironmentMetadata
10
 
11
+ try:
12
+ from ..graders import grade_task
13
+ from ..graders.shared import component_score, final_score_pipeline, safe_ratio, safe_score
14
+ from ..models import (
15
+ HistoryEntry,
16
+ PythonCodeReviewAction,
17
+ PythonCodeReviewObservation,
18
+ PythonCodeReviewState,
19
+ RewardDetails,
20
  TaskGrade,
21
  )
22
  from ..tasks import ReviewTask, list_tasks, select_task
23
+ except ImportError:
24
+ from graders import grade_task
25
+ from graders.shared import component_score, final_score_pipeline, safe_ratio, safe_score
26
+ from models import (
27
+ HistoryEntry,
28
+ PythonCodeReviewAction,
29
+ PythonCodeReviewObservation,
30
+ PythonCodeReviewState,
31
+ RewardDetails,
32
  TaskGrade,
33
  )
34
  from tasks import ReviewTask, list_tasks, select_task
 
43
  quality_score=component_score(0.01),
44
  runtime_score=component_score(0.01),
45
  )
46
+
47
+
48
+ def _reward_value(value: float) -> float:
49
+ return final_score_pipeline(value)
50
 
51
 
52
  class PythonCodeReviewEnvironment(
 
56
 
57
  SUPPORTS_CONCURRENT_SESSIONS: bool = True
58
 
59
+ def __init__(self, verbose: bool = False, **_: Any) -> None:
60
+ super().__init__()
61
+ self.verbose = verbose
62
+ self._task: ReviewTask = list_tasks()[0]
63
+ self._current_code: str = self._task.starter_code
64
+ self._history: list[HistoryEntry] = []
65
+ self._last_reward = RewardDetails(value=0.1, reason="Environment initialized.")
66
+ self._last_action_error: str | None = None
67
+ self._current_grade = _empty_grade()
68
+ self._state = PythonCodeReviewState(episode_id=str(uuid4()), step_count=0)
69
+ self.reset()
70
 
71
  def reset(
72
  self,
 
74
  episode_id: Optional[str] = None,
75
  **kwargs: Any,
76
  ) -> PythonCodeReviewObservation:
77
+ task_id = kwargs.get("task_id")
78
+ self._task = select_task(seed=seed, task_id=task_id)
79
+ self._current_code = self._task.starter_code
80
+ self._history = []
81
+ self._last_action_error = None
82
+ self._last_reward = RewardDetails(value=0.1, reason="Environment reset.")
83
+ self._current_grade, self._last_action_error = self._safe_grade_task(
84
+ self._task,
85
+ self._current_code,
86
+ include_hidden=False,
87
+ )
88
 
89
  self._state = PythonCodeReviewState(
90
  episode_id=episode_id or str(uuid4()),
 
143
  )
144
  return observation, reward.value, observation.done, {"task_id": observation.task_id, "score": observation.score}
145
 
146
+ previous_grade = self._current_grade
147
+ status = ""
148
+ invalid_action = False
149
+ code_changed = False
150
+ use_hidden_grading = False
151
+ action_error: str | None = None
152
+
153
+ if action.action_type == "edit_code":
154
+ if not action.code or not action.code.strip():
155
+ invalid_action = True
156
+ status = "edit_code requires a non-empty code payload."
157
+ action_error = status
158
+ else:
159
+ code_changed = action.code != self._current_code
160
+ self._current_code = action.code
161
+ status = "Updated working copy from agent patch."
162
  elif action.action_type == "submit_solution":
163
  if action.code is not None and action.code.strip():
164
  code_changed = action.code != self._current_code
 
169
  status = "Executed public validation suite."
170
  elif action.action_type == "analyze_code":
171
  status = "Generated static review summary."
172
+ else: # pragma: no cover
173
+ invalid_action = True
174
+ status = f"Unsupported action_type: {action.action_type}"
175
+ action_error = status
176
 
177
  self._state.step_count += 1
178
 
179
+ if invalid_action:
180
+ current_grade = previous_grade
181
+ else:
182
+ current_grade, grade_error = self._safe_grade_task(
183
+ self._task,
184
+ self._current_code,
185
+ include_hidden=use_hidden_grading,
186
+ timeout_s=timeout_s or 3.0,
187
+ )
188
+ if grade_error:
189
+ action_error = grade_error
190
+ status = f"{status} Grading fallback used."
191
+ if action.action_type == "analyze_code":
192
+ status = self._analysis_status(current_grade)
193
+ elif action.action_type == "run_tests":
194
+ status = self._run_tests_status(current_grade, use_hidden_grading)
195
+ elif action.action_type == "submit_solution":
196
  status = self._submission_status(current_grade)
197
 
198
  done = use_hidden_grading or self._state.step_count >= self._task.max_steps
 
217
  reward=reward_details.value,
218
  )
219
  )
220
+
221
+ self._current_grade = current_grade
222
+ self._last_reward = reward_details
223
+ self._last_action_error = action_error
224
+ attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
225
 
226
  self._state.task_id = self._task.task_id
227
  self._state.difficulty = self._task.difficulty
 
234
  self._state.score = current_grade.score
235
  self._state.done = done
236
 
237
+ observation = self._build_observation(
238
+ grade=current_grade,
239
+ status=status,
240
+ reward_details=reward_details,
241
+ )
242
+ return observation, reward_details.value, observation.done, {
243
+ "task_id": observation.task_id,
244
+ "score": observation.score,
245
+ "done": observation.done,
246
+ "attempts_remaining": observation.attempts_remaining,
247
+ "last_action_status": observation.last_action_status,
248
+ "last_action_error": observation.last_action_error,
249
+ }
250
 
251
  @property
252
  def state(self) -> PythonCodeReviewState:
 
268
  current_code=self._current_code,
269
  errors=self._format_errors(grade),
270
  test_results=self._format_test_results(grade),
271
+ visible_tests=list(self._task.visible_tests),
272
+ history=list(self._history),
273
+ attempts_remaining=self._state.attempts_remaining,
274
+ last_action_status=status,
275
+ last_action_error=self._last_action_error,
276
+ score=grade.score,
277
+ reward=reward_details.value,
278
+ done=self._state.done,
279
+ reward_details=reward_details,
280
+ metadata={
281
+ "benchmark": "python_code_review_env",
282
+ "goal": self._task.goal,
283
+ "repo_summary": self._task.repo_summary,
284
+ "changed_files": self._task.changed_files,
285
+ "available_files": self._task.available_files,
286
+ "grade_details": grade.details,
287
  },
288
  )
289
 
290
+ def _compute_reward(
291
+ self,
292
+ *,
293
+ previous_grade: TaskGrade,
294
  current_grade: TaskGrade,
295
  action: PythonCodeReviewAction,
296
  invalid_action: bool,
297
  timed_out: bool,
298
  code_changed: bool,
299
  final_submission: bool,
300
+ ) -> RewardDetails:
301
+ prev_score = previous_grade.score
302
+ curr_score = current_grade.score
303
+ prev_syntax = previous_grade.syntax_score
304
+ curr_syntax = current_grade.syntax_score
305
+ prev_quality = previous_grade.quality_score
306
+ curr_quality = current_grade.quality_score
307
+ prev_rate = safe_ratio(previous_grade.tests_passed, previous_grade.tests_total)
308
+ curr_rate = safe_ratio(current_grade.tests_passed, current_grade.tests_total)
309
+ prev_runtime = previous_grade.runtime_score
310
+ curr_runtime = current_grade.runtime_score
311
+ prev_compile_health = 0.1 if str(previous_grade.details.get("compile_error", "")).strip() else 0.95
312
+ curr_compile_health = 0.1 if str(current_grade.details.get("compile_error", "")).strip() else 0.95
313
+
314
+ syntax_reward = max(curr_syntax - prev_syntax, 0.0) * 0.18
315
+ test_reward = max(curr_rate - prev_rate, 0.0) * 0.22
316
+ progress_delta = max(curr_score - prev_score, 0.0) * 0.24
317
+ quality_bonus = max(curr_quality - prev_quality, 0.0) * 0.12
318
+ runtime_bonus = max(curr_runtime - prev_runtime, 0.0) * 0.10
319
+ error_reduction_bonus = max(curr_compile_health - prev_compile_health, 0.0) * 0.14
320
+ completion_bonus = (0.04 + 0.10 * curr_rate) * float(final_submission)
321
+ correctness_bonus = max(curr_score - 0.5, 0.0) * 0.12 * float(final_submission)
322
+
323
+ invalid_action_penalty = (0.04 + (0.08 * (1.0 - prev_score))) if invalid_action else 0.0
324
+ timeout_penalty = (0.05 + (0.06 * max(curr_runtime, prev_runtime))) if timed_out else 0.0
325
+ regression_penalty = max(prev_score - curr_score, 0.0) * 0.24
326
+ stagnation_penalty = (0.02 + (0.04 * prev_score)) if action.action_type == "edit_code" and not code_changed else 0.0
327
+
328
+ raw_value = (
329
+ 2.0 * (curr_score - 0.5)
330
+ + 1.2 * (curr_rate - prev_rate)
331
+ + 0.8 * (curr_quality - prev_quality)
332
+ + 0.7 * (curr_runtime - prev_runtime)
333
+ + 0.9 * (curr_syntax - prev_syntax)
334
+ + 0.6 * (curr_compile_health - prev_compile_health)
335
+ + syntax_reward
336
+ + test_reward
337
+ + progress_delta
338
+ + quality_bonus
339
+ + runtime_bonus
340
+ + error_reduction_bonus
341
+ + completion_bonus
342
+ + correctness_bonus
343
+ - invalid_action_penalty
344
+ - timeout_penalty
345
+ - regression_penalty
346
+ - stagnation_penalty
347
+ )
348
+ value = _reward_value(raw_value)
349
+
350
+ reason_parts = []
351
+ if syntax_reward:
352
+ reason_parts.append("syntax fixed")
353
  if test_reward:
354
  reason_parts.append("public test progress")
355
  if progress_delta:
356
  reason_parts.append("overall score improved")
357
+ if quality_bonus:
358
+ reason_parts.append("code quality improved")
359
+ if error_reduction_bonus:
360
+ reason_parts.append("errors removed")
361
+ if completion_bonus:
362
+ reason_parts.append("task completed")
363
+ if runtime_bonus:
364
+ reason_parts.append("runtime improved")
365
+ if correctness_bonus:
366
+ reason_parts.append("full correctness bonus")
367
  if invalid_action_penalty:
368
  reason_parts.append("invalid action penalty")
369
  if timeout_penalty:
 
372
  reason_parts.append("regression penalty")
373
  if stagnation_penalty:
374
  reason_parts.append("unchanged patch penalty")
375
+ if not reason_parts:
376
+ reason_parts.append("no meaningful state change")
377
+
378
+ return RewardDetails(
379
+ value=safe_score(value),
380
+ syntax_reward=round(syntax_reward, 6),
381
+ test_reward=round(test_reward, 6),
382
+ correctness_bonus=round(correctness_bonus, 6),
383
+ quality_bonus=round(quality_bonus, 6),
384
+ error_reduction_bonus=round(error_reduction_bonus, 6),
385
+ completion_bonus=round(completion_bonus, 6),
386
+ runtime_bonus=round(runtime_bonus, 6),
387
+ progress_delta=round(progress_delta, 6),
388
+ invalid_action_penalty=round(invalid_action_penalty, 6),
389
+ timeout_penalty=round(timeout_penalty, 6),
390
+ regression_penalty=round(regression_penalty, 6),
391
+ stagnation_penalty=round(stagnation_penalty, 6),
392
+ reason=", ".join(reason_parts),
393
+ prev_score=safe_score(prev_score),
394
+ curr_score=safe_score(curr_score),
395
+ code_changed=code_changed,
396
+ )
397
+
398
+ def _format_errors(self, grade: TaskGrade) -> str:
399
+ compile_error = str(grade.details.get("compile_error", "")).strip()
400
+ if compile_error:
401
+ return compile_error
402
+ return "Code parses successfully."
403
+
404
+ def _safe_grade_task(
405
+ self,
406
+ task: ReviewTask,
407
+ code: str,
408
+ *,
409
+ include_hidden: bool,
410
+ timeout_s: float = 3.0,
411
+ ) -> tuple[TaskGrade, str | None]:
412
+ try:
413
+ return (
414
+ grade_task(task, code, include_hidden=include_hidden, timeout_s=timeout_s),
415
+ None,
416
+ )
417
+ except Exception as exc: # pragma: no cover
418
+ return _empty_grade(), f"{type(exc).__name__}: {exc}"
419
+
420
+ def _format_test_results(self, grade: TaskGrade) -> str:
421
+ parts = [grade.details.get("test_summary", "No test feedback available.")]
422
  benchmark = grade.details.get("benchmark")
423
  if isinstance(benchmark, dict):
424
  parts.append(
server/requirements.runtime.txt ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ openenv-core>=0.2.2
2
+ fastapi>=0.111.0
3
+ openai>=1.76.0
4
+ uvicorn>=0.30.0
server/requirements.txt CHANGED
@@ -1,6 +1,8 @@
1
  openenv-core[core]>=0.2.2
2
  fastapi>=0.111.0
 
3
  uvicorn>=0.30.0
4
  openai>=1.76.0
 
5
  torch>=2.2.0
6
  transformers>=4.45.0
 
1
  openenv-core[core]>=0.2.2
2
  fastapi>=0.111.0
3
+ gradio>=5.26.0
4
  uvicorn>=0.30.0
5
  openai>=1.76.0
6
+ streamlit>=1.44.0
7
  torch>=2.2.0
8
  transformers>=4.45.0
services/__init__.py CHANGED
@@ -1,7 +1,7 @@
1
- """Service layer for orchestrating analysis, suggestions, and rewards."""
2
-
3
- from .analysis_service import AnalysisService
4
- from .reward_service import RewardService
5
- from .suggestion_service import SuggestionService
6
-
7
- __all__ = ["AnalysisService", "RewardService", "SuggestionService"]
 
1
+ """Service layer for orchestrating analysis, suggestions, and rewards."""
2
+
3
+ from .analysis_service import AnalysisService
4
+ from .reward_service import RewardService
5
+ from .suggestion_service import SuggestionService
6
+
7
+ __all__ = ["AnalysisService", "RewardService", "SuggestionService"]
services/analysis_service.py CHANGED
@@ -1,86 +1,33 @@
1
- """Orchestration layer for AI-powered Python code review."""
2
 
3
  from __future__ import annotations
4
 
5
  import time
6
- from typing import Any, Callable
7
 
8
  from analyzers import analyze_data_science_code, analyze_dsa_code, analyze_ml_code, analyze_web_code
9
  from models import PyTorchCodeAnalyzerModel
10
  from schemas.request import AnalyzeCodeRequest
11
- from schemas.response import (
12
- AnalysisIssue,
13
- AnalyzeCodeResponse,
14
- DomainAnalysis,
15
- ModelPrediction,
16
- StaticAnalysisSummary,
17
- )
18
  from services.reward_service import RewardService
19
  from services.suggestion_service import SuggestionService
20
  from utils import estimate_complexity, parse_code_structure
21
 
22
 
23
- def _clamp_unit(value: float) -> float:
24
- return max(0.0, min(1.0, float(value)))
25
-
26
-
27
- def _lint_score(parsed: dict[str, Any]) -> float:
28
  """Convert structural smells into a normalized lint-style score."""
29
 
30
  score = 1.0
31
  if not parsed.get("syntax_valid", True):
32
  score -= 0.45
33
- score -= min(int(parsed.get("long_lines", 0) or 0), 5) * 0.03
34
  if parsed.get("tabs_used"):
35
  score -= 0.1
36
  if parsed.get("trailing_whitespace_lines"):
37
  score -= 0.05
38
  if parsed.get("docstring_ratio", 0.0) == 0.0 and parsed.get("function_names"):
39
  score -= 0.08
40
- return round(_clamp_unit(score), 4)
41
-
42
-
43
- def _static_issues(parsed: dict[str, Any], complexity: dict[str, Any]) -> list[AnalysisIssue]:
44
- """Turn parser and complexity heuristics into review issues."""
45
-
46
- issues: list[AnalysisIssue] = []
47
- if not parsed.get("syntax_valid", True):
48
- issues.append(
49
- AnalysisIssue(
50
- title="Syntax error blocks execution",
51
- category="correctness",
52
- severity="high",
53
- description=str(parsed.get("syntax_error", "Python failed to parse the snippet.")),
54
- )
55
- )
56
- if int(parsed.get("max_loop_depth", 0) or 0) >= 2:
57
- issues.append(
58
- AnalysisIssue(
59
- title="Nested loops increase runtime risk",
60
- category="performance",
61
- severity="medium",
62
- description="The current control flow suggests a brute-force path that may not scale on larger inputs.",
63
- )
64
- )
65
- if int(complexity.get("cyclomatic_complexity", 1) or 1) >= 7:
66
- issues.append(
67
- AnalysisIssue(
68
- title="Cyclomatic complexity is elevated",
69
- category="maintainability",
70
- severity="medium",
71
- description="Branch-heavy code is harder to review, test, and optimize confidently.",
72
- )
73
- )
74
- if parsed.get("docstring_ratio", 0.0) == 0.0 and parsed.get("function_names"):
75
- issues.append(
76
- AnalysisIssue(
77
- title="Missing public-function documentation",
78
- category="style",
79
- severity="low",
80
- description="Short docstrings would make the expected contract and edge cases easier to review.",
81
- )
82
- )
83
- return issues
84
 
85
 
86
  class AnalysisService:
@@ -90,7 +37,7 @@ class AnalysisService:
90
  self._model: PyTorchCodeAnalyzerModel | None = None
91
  self.reward_service = RewardService()
92
  self.suggestion_service = SuggestionService()
93
- self._analyzers: dict[str, Callable[[str, dict[str, Any], dict[str, Any]], DomainAnalysis]] = {
94
  "dsa": analyze_dsa_code,
95
  "data_science": analyze_data_science_code,
96
  "ml_dl": analyze_ml_code,
@@ -103,156 +50,90 @@ class AnalysisService:
103
  self._model = PyTorchCodeAnalyzerModel()
104
  return self._model
105
 
106
- def _heuristic_domain_scores(self, parsed: dict[str, Any], code: str) -> dict[str, float]:
107
  """Derive domain priors from imports and syntax-level hints."""
108
 
109
  scores = {
110
- "dsa": 0.22
111
- + (0.18 if parsed.get("uses_recursion") else 0.0)
112
- + (0.18 if int(parsed.get("max_loop_depth", 0) or 0) >= 1 else 0.0),
113
- "data_science": 0.22 + (0.38 if parsed.get("uses_pandas") or parsed.get("uses_numpy") else 0.0),
114
- "ml_dl": 0.22 + (0.38 if parsed.get("uses_torch") or parsed.get("uses_sklearn") else 0.0),
115
- "web": 0.22
116
- + (0.38 if parsed.get("uses_fastapi") or parsed.get("uses_flask") else 0.0)
117
- + (0.12 if parsed.get("route_decorators") else 0.0),
118
- "general": 0.26,
119
  }
120
- lowered = code.lower()
121
- if "fastapi" in lowered:
122
- scores["web"] += 0.12
123
- if "pandas" in lowered or "numpy" in lowered:
124
  scores["data_science"] += 0.1
125
- if "torch" in lowered or "sklearn" in lowered:
126
  scores["ml_dl"] += 0.1
127
  if "while" in code or "for" in code:
128
- scores["dsa"] += 0.06
129
  return {key: round(min(value, 0.99), 4) for key, value in scores.items()}
130
 
131
- def _general_domain_analysis(self, parsed: dict[str, Any], complexity: dict[str, Any]) -> DomainAnalysis:
132
- """Fallback analysis when no specialized domain is strongly selected."""
133
-
134
- suggestions = [
135
- "Keep functions small, validate inputs explicitly, and add focused tests for edge cases.",
136
- ]
137
- if int(parsed.get("max_loop_depth", 0) or 0) >= 2:
138
- suggestions.append("Consider replacing repeated scans with a precomputed dictionary or set.")
139
- return DomainAnalysis(
140
- domain="general",
141
- domain_score=round(_clamp_unit(0.62 - (0.12 * float(complexity["complexity_penalty"]))), 4),
142
- issues=_static_issues(parsed, complexity)[:2],
143
- suggestions=suggestions,
144
- highlights={
145
- "cyclomatic_complexity": float(complexity["cyclomatic_complexity"]),
146
- "max_loop_depth": float(parsed.get("max_loop_depth", 0) or 0),
147
- "lint_score": float(_lint_score(parsed)),
148
- },
149
- )
150
-
151
  def analyze(self, request: AnalyzeCodeRequest) -> AnalyzeCodeResponse:
152
- """Run the complete static-plus-ML code review pipeline."""
153
 
154
  started = time.perf_counter()
155
  parsed = parse_code_structure(request.code)
156
  complexity = estimate_complexity(parsed, request.code)
157
- lint_score = _lint_score(parsed)
158
- model_prediction = self.model.predict(
159
- request.code,
160
- request.context_window,
161
- request.traceback_text,
162
- parsed,
163
- )
164
  heuristic_scores = self._heuristic_domain_scores(parsed, request.code)
165
 
166
- combined_scores: dict[str, float] = {}
167
  for domain, heuristic_score in heuristic_scores.items():
168
  model_score = float(model_prediction["domain_scores"].get(domain, 0.2))
169
- combined_scores[domain] = round((0.65 * model_score) + (0.35 * heuristic_score), 4)
170
 
171
  detected_domain = request.domain_hint if request.domain_hint != "auto" else max(combined_scores, key=combined_scores.get)
172
  analyzer = self._analyzers.get(detected_domain)
173
  domain_analysis = (
174
  analyzer(request.code, parsed, complexity)
175
  if analyzer is not None
176
- else self._general_domain_analysis(parsed, complexity)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
177
  )
178
- static_issues = _static_issues(parsed, complexity)
179
  static_analysis = StaticAnalysisSummary(
180
  syntax_valid=bool(parsed["syntax_valid"]),
181
  syntax_error=str(parsed["syntax_error"]),
182
  cyclomatic_complexity=int(complexity["cyclomatic_complexity"]),
183
  line_count=int(parsed["line_count"]),
184
- max_nesting_depth=int(parsed["max_nesting_depth"]),
185
  max_loop_depth=int(parsed["max_loop_depth"]),
186
  time_complexity=str(complexity["time_complexity"]),
187
  space_complexity=str(complexity["space_complexity"]),
188
- lint_score=lint_score,
189
- docstring_coverage=float(parsed["docstring_ratio"]),
190
  detected_imports=list(parsed["imports"]),
191
  code_smells=list(parsed["code_smells"]),
192
- issues=static_issues,
193
- )
194
-
195
- score_breakdown = self.reward_service.compute(
196
- ml_score=float(model_prediction["ml_quality_score"]),
197
- domain_score=domain_analysis.domain_score,
198
- lint_score=lint_score,
199
- complexity_penalty=float(complexity["complexity_penalty"]),
200
- maintainability_score=float(model_prediction["maintainability_score"]),
201
- issue_probabilities=dict(model_prediction["issue_probabilities"]),
202
- )
203
- suggestions = self.suggestion_service.build_suggestions(
204
- domain_analysis=domain_analysis,
205
- static_analysis=static_analysis,
206
  )
207
  improvement_plan = self.suggestion_service.build_improvement_plan(
208
  domain_analysis=domain_analysis,
209
  static_analysis=static_analysis,
210
  )
211
- auto_fix_preview = self.suggestion_service.build_auto_fix_preview(
212
- domain_analysis=domain_analysis,
213
- static_analysis=static_analysis,
214
- )
215
-
216
  summary = (
217
- f"Reviewed Python code as `{detected_domain}` with an ML quality score of {score_breakdown.ml_score:.0%}, "
218
- f"lint score {score_breakdown.lint_score:.0%}, and RL-ready reward {score_breakdown.reward:.0%}."
219
  )
220
- model_notes = list(model_prediction["notes"])
221
- if static_issues:
222
- model_notes.append(f"Static analyzer found {len(static_issues)} review issue(s).")
223
-
224
  return AnalyzeCodeResponse(
225
  detected_domain=detected_domain, # type: ignore[arg-type]
226
  domain_confidences=combined_scores,
227
  score_breakdown=score_breakdown,
228
  static_analysis=static_analysis,
229
- model_prediction=ModelPrediction(
230
- quality_label=str(model_prediction["quality_label"]), # type: ignore[arg-type]
231
- quality_score=float(model_prediction["quality_score"]),
232
- maintainability_score=float(model_prediction["maintainability_score"]),
233
- issue_probabilities=dict(model_prediction["issue_probabilities"]),
234
- notes=model_notes,
235
- ),
236
  domain_analysis=domain_analysis,
237
- suggestions=suggestions if request.enable_suggestions else [],
238
- improvement_plan=improvement_plan if request.enable_suggestions else [],
239
- auto_fix_preview=auto_fix_preview if request.enable_suggestions else [],
240
- score_visualization={
241
- "reward": score_breakdown.reward,
242
- "ml_quality": score_breakdown.ml_score,
243
- "lint_score": score_breakdown.lint_score,
244
- "maintainability": score_breakdown.maintainability_score,
245
- "security": score_breakdown.security_score,
246
- "readability": score_breakdown.readability_score,
247
- "quality_signal": score_breakdown.quality_signal,
248
- "error_reduction_signal": score_breakdown.error_reduction_signal,
249
- "completion_signal": score_breakdown.completion_signal,
250
- "complexity_penalty": score_breakdown.complexity_penalty,
251
- },
252
  model_backend=str(model_prediction["backend_name"]),
253
  model_id=str(model_prediction["model_id"]),
254
  summary=summary,
255
  context_window=request.context_window,
256
- filename=request.filename,
257
  analysis_time_ms=round((time.perf_counter() - started) * 1000.0, 2),
258
  )
 
1
+ """Orchestration layer for multi-domain code analysis."""
2
 
3
  from __future__ import annotations
4
 
5
  import time
6
+ from typing import Any, Callable, Dict
7
 
8
  from analyzers import analyze_data_science_code, analyze_dsa_code, analyze_ml_code, analyze_web_code
9
  from models import PyTorchCodeAnalyzerModel
10
  from schemas.request import AnalyzeCodeRequest
11
+ from schemas.response import AnalyzeCodeResponse, DomainAnalysis, StaticAnalysisSummary
 
 
 
 
 
 
12
  from services.reward_service import RewardService
13
  from services.suggestion_service import SuggestionService
14
  from utils import estimate_complexity, parse_code_structure
15
 
16
 
17
+ def _lint_score(parsed: Dict[str, Any]) -> float:
 
 
 
 
18
  """Convert structural smells into a normalized lint-style score."""
19
 
20
  score = 1.0
21
  if not parsed.get("syntax_valid", True):
22
  score -= 0.45
23
+ score -= min(parsed.get("long_lines", 0), 5) * 0.03
24
  if parsed.get("tabs_used"):
25
  score -= 0.1
26
  if parsed.get("trailing_whitespace_lines"):
27
  score -= 0.05
28
  if parsed.get("docstring_ratio", 0.0) == 0.0 and parsed.get("function_names"):
29
  score -= 0.08
30
+ return round(max(0.0, min(1.0, score)), 4)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
 
32
 
33
  class AnalysisService:
 
37
  self._model: PyTorchCodeAnalyzerModel | None = None
38
  self.reward_service = RewardService()
39
  self.suggestion_service = SuggestionService()
40
+ self._analyzers: Dict[str, Callable[[str, Dict[str, Any], Dict[str, Any]], DomainAnalysis]] = {
41
  "dsa": analyze_dsa_code,
42
  "data_science": analyze_data_science_code,
43
  "ml_dl": analyze_ml_code,
 
50
  self._model = PyTorchCodeAnalyzerModel()
51
  return self._model
52
 
53
+ def _heuristic_domain_scores(self, parsed: Dict[str, Any], code: str) -> Dict[str, float]:
54
  """Derive domain priors from imports and syntax-level hints."""
55
 
56
  scores = {
57
+ "dsa": 0.2 + (0.15 if parsed.get("uses_recursion") else 0.0) + (0.15 if parsed.get("max_loop_depth", 0) >= 1 else 0.0),
58
+ "data_science": 0.2 + (0.35 if parsed.get("uses_pandas") or parsed.get("uses_numpy") else 0.0),
59
+ "ml_dl": 0.2 + (0.35 if parsed.get("uses_torch") or parsed.get("uses_sklearn") else 0.0),
60
+ "web": 0.2 + (0.35 if parsed.get("uses_fastapi") or parsed.get("uses_flask") else 0.0) + (0.1 if parsed.get("route_decorators") else 0.0),
61
+ "general": 0.2,
 
 
 
 
62
  }
63
+ if "fastapi" in code.lower():
64
+ scores["web"] += 0.1
65
+ if "pandas" in code.lower() or "numpy" in code.lower():
 
66
  scores["data_science"] += 0.1
67
+ if "torch" in code.lower():
68
  scores["ml_dl"] += 0.1
69
  if "while" in code or "for" in code:
70
+ scores["dsa"] += 0.05
71
  return {key: round(min(value, 0.99), 4) for key, value in scores.items()}
72
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
73
  def analyze(self, request: AnalyzeCodeRequest) -> AnalyzeCodeResponse:
74
+ """Run the complete multi-domain analysis pipeline."""
75
 
76
  started = time.perf_counter()
77
  parsed = parse_code_structure(request.code)
78
  complexity = estimate_complexity(parsed, request.code)
79
+ model_prediction = self.model.predict(request.code, request.context_window, parsed)
 
 
 
 
 
 
80
  heuristic_scores = self._heuristic_domain_scores(parsed, request.code)
81
 
82
+ combined_scores = {}
83
  for domain, heuristic_score in heuristic_scores.items():
84
  model_score = float(model_prediction["domain_scores"].get(domain, 0.2))
85
+ combined_scores[domain] = round((0.6 * model_score) + (0.4 * heuristic_score), 4)
86
 
87
  detected_domain = request.domain_hint if request.domain_hint != "auto" else max(combined_scores, key=combined_scores.get)
88
  analyzer = self._analyzers.get(detected_domain)
89
  domain_analysis = (
90
  analyzer(request.code, parsed, complexity)
91
  if analyzer is not None
92
+ else DomainAnalysis(
93
+ domain="general",
94
+ domain_score=0.6,
95
+ issues=[],
96
+ suggestions=["Add stronger domain-specific context for deeper analysis."],
97
+ highlights={},
98
+ )
99
+ )
100
+
101
+ lint_score = _lint_score(parsed)
102
+ score_breakdown = self.reward_service.compute(
103
+ ml_score=float(model_prediction["ml_quality_score"]),
104
+ domain_score=domain_analysis.domain_score,
105
+ lint_score=lint_score,
106
+ complexity_penalty=float(complexity["complexity_penalty"]),
107
  )
 
108
  static_analysis = StaticAnalysisSummary(
109
  syntax_valid=bool(parsed["syntax_valid"]),
110
  syntax_error=str(parsed["syntax_error"]),
111
  cyclomatic_complexity=int(complexity["cyclomatic_complexity"]),
112
  line_count=int(parsed["line_count"]),
 
113
  max_loop_depth=int(parsed["max_loop_depth"]),
114
  time_complexity=str(complexity["time_complexity"]),
115
  space_complexity=str(complexity["space_complexity"]),
 
 
116
  detected_imports=list(parsed["imports"]),
117
  code_smells=list(parsed["code_smells"]),
 
 
 
 
 
 
 
 
 
 
 
 
 
 
118
  )
119
  improvement_plan = self.suggestion_service.build_improvement_plan(
120
  domain_analysis=domain_analysis,
121
  static_analysis=static_analysis,
122
  )
 
 
 
 
 
123
  summary = (
124
+ f"Detected `{detected_domain}` code with a model score of {score_breakdown.ml_score:.0%}, "
125
+ f"domain score {score_breakdown.domain_score:.0%}, and final reward {score_breakdown.reward:.0%}."
126
  )
 
 
 
 
127
  return AnalyzeCodeResponse(
128
  detected_domain=detected_domain, # type: ignore[arg-type]
129
  domain_confidences=combined_scores,
130
  score_breakdown=score_breakdown,
131
  static_analysis=static_analysis,
 
 
 
 
 
 
 
132
  domain_analysis=domain_analysis,
133
+ improvement_plan=improvement_plan,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
  model_backend=str(model_prediction["backend_name"]),
135
  model_id=str(model_prediction["model_id"]),
136
  summary=summary,
137
  context_window=request.context_window,
 
138
  analysis_time_ms=round((time.perf_counter() - started) * 1000.0, 2),
139
  )
services/reward_service.py CHANGED
@@ -5,50 +5,32 @@ from __future__ import annotations
5
  from schemas.response import ScoreBreakdown
6
 
7
 
8
- def _clamp_unit(value: float) -> float:
9
- return max(0.0, min(1.0, float(value)))
10
-
11
-
12
  class RewardService:
13
- """Compute reward scores from model, lint, complexity, and issue-risk signals."""
14
-
15
- def compute(
16
- self,
17
- *,
18
- ml_score: float,
19
- domain_score: float,
20
- lint_score: float,
21
- complexity_penalty: float,
22
- maintainability_score: float,
23
- issue_probabilities: dict[str, float],
24
- ) -> ScoreBreakdown:
25
- """Apply RL-friendly reward shaping to the code review analysis signals."""
26
-
27
- security_score = _clamp_unit(1.0 - issue_probabilities.get("security", 0.0))
28
- readability_score = _clamp_unit((0.6 * lint_score) + (0.4 * maintainability_score))
29
- quality_signal = _clamp_unit((0.55 * ml_score) + (0.25 * maintainability_score) + (0.20 * domain_score))
30
- error_reduction_signal = _clamp_unit((0.7 * lint_score) + (0.3 * (1.0 - complexity_penalty)))
31
- completion_signal = _clamp_unit(
32
- (0.4 * quality_signal) + (0.25 * readability_score) + (0.2 * security_score) + (0.15 * domain_score)
33
  )
34
-
35
- reward = _clamp_unit(
36
- (0.5 * ml_score)
37
- + (0.18 * lint_score)
38
- + (0.12 * maintainability_score)
39
- + (0.10 * domain_score)
40
- + (0.10 * security_score)
41
- - (0.20 * complexity_penalty)
42
- )
43
-
44
  return ScoreBreakdown(
45
  ml_score=round(ml_score, 4),
46
  domain_score=round(domain_score, 4),
47
  lint_score=round(lint_score, 4),
48
  complexity_penalty=round(complexity_penalty, 4),
49
- maintainability_score=round(maintainability_score, 4),
50
- security_score=round(security_score, 4),
51
- readability_score=round(readability_score, 4),
52
  quality_signal=round(quality_signal, 4),
53
  error_reduction_signal=round(error_reduction_signal, 4),
54
  completion_signal=round(completion_signal, 4),
 
5
  from schemas.response import ScoreBreakdown
6
 
7
 
 
 
 
 
8
  class RewardService:
9
+ """Compute reward scores from model, domain, lint, and complexity signals."""
10
+
11
+ def compute(self, *, ml_score: float, domain_score: float, lint_score: float, complexity_penalty: float) -> ScoreBreakdown:
12
+ """Apply dynamic reward shaping based on quality, errors, and completion."""
13
+
14
+ quality_signal = max(0.0, min(1.0, (0.45 * ml_score) + (0.3 * domain_score) + (0.25 * lint_score)))
15
+ error_reduction_signal = max(0.0, min(1.0, lint_score - (0.6 * complexity_penalty)))
16
+ completion_signal = max(0.0, min(1.0, (ml_score + domain_score + lint_score) / 3.0))
17
+ reward = max(
18
+ 0.0,
19
+ min(
20
+ 1.0,
21
+ (0.35 * quality_signal)
22
+ + (0.25 * completion_signal)
23
+ + (0.2 * error_reduction_signal)
24
+ + (0.1 * ml_score)
25
+ + (0.1 * domain_score)
26
+ - (0.15 * complexity_penalty),
27
+ ),
 
28
  )
 
 
 
 
 
 
 
 
 
 
29
  return ScoreBreakdown(
30
  ml_score=round(ml_score, 4),
31
  domain_score=round(domain_score, 4),
32
  lint_score=round(lint_score, 4),
33
  complexity_penalty=round(complexity_penalty, 4),
 
 
 
34
  quality_signal=round(quality_signal, 4),
35
  error_reduction_signal=round(error_reduction_signal, 4),
36
  completion_signal=round(completion_signal, 4),
services/suggestion_service.py CHANGED
@@ -2,77 +2,11 @@
2
 
3
  from __future__ import annotations
4
 
5
- from schemas.response import DomainAnalysis, StaticAnalysisSummary, SuggestionItem
6
 
7
 
8
  class SuggestionService:
9
- """Build high-signal improvement suggestions from analysis output."""
10
-
11
- def build_suggestions(
12
- self,
13
- *,
14
- domain_analysis: DomainAnalysis,
15
- static_analysis: StaticAnalysisSummary,
16
- ) -> list[SuggestionItem]:
17
- """Return prioritized fixes tailored to the detected review signals."""
18
-
19
- suggestions: list[SuggestionItem] = []
20
-
21
- if not static_analysis.syntax_valid:
22
- suggestions.append(
23
- SuggestionItem(
24
- priority="P0",
25
- title="Fix the syntax error",
26
- rationale="Static parsing failed, so downstream tests and model signals are less reliable.",
27
- action=f"Resolve the parser issue first: {static_analysis.syntax_error}.",
28
- category="correctness",
29
- )
30
- )
31
-
32
- if static_analysis.cyclomatic_complexity >= 6 or static_analysis.max_loop_depth >= 2:
33
- suggestions.append(
34
- SuggestionItem(
35
- priority="P1",
36
- title="Reduce branching or nested loops",
37
- rationale="Higher structural complexity makes bugs more likely and lowers the RL reward.",
38
- action="Extract helper functions or replace repeated scans with a dictionary, set, Counter, or vectorized operation.",
39
- category="performance",
40
- )
41
- )
42
-
43
- if static_analysis.docstring_coverage == 0 and static_analysis.line_count > 0:
44
- suggestions.append(
45
- SuggestionItem(
46
- priority="P2",
47
- title="Add function-level documentation",
48
- rationale="Docstrings improve review speed and make behavior clearer for future edits.",
49
- action="Document the expected inputs, outputs, and edge cases in a short function docstring.",
50
- category="style",
51
- )
52
- )
53
-
54
- for issue in domain_analysis.issues[:2]:
55
- suggestions.append(
56
- SuggestionItem(
57
- priority="P1" if issue.severity != "high" else "P0",
58
- title=issue.title,
59
- rationale=issue.description,
60
- action=domain_analysis.suggestions[0] if domain_analysis.suggestions else "Refactor the risky section and re-run analysis.",
61
- category=issue.category,
62
- )
63
- )
64
-
65
- if not suggestions:
66
- suggestions.append(
67
- SuggestionItem(
68
- priority="P2",
69
- title="Strengthen review confidence",
70
- rationale="No severe issues were detected, but explicit edge-case coverage still improves maintainability.",
71
- action="Add targeted tests for empty input, boundary values, and malformed payloads.",
72
- category="maintainability",
73
- )
74
- )
75
- return suggestions[:4]
76
 
77
  def build_improvement_plan(self, *, domain_analysis: DomainAnalysis, static_analysis: StaticAnalysisSummary) -> list[str]:
78
  """Return a compact three-step plan optimized for developer action."""
@@ -92,22 +26,3 @@ class SuggestionService:
92
  if not static_analysis.syntax_valid:
93
  step_one = f"Step 1 - Correctness and safety: fix the syntax error first ({static_analysis.syntax_error})."
94
  return [step_one, step_two, step_three]
95
-
96
- def build_auto_fix_preview(
97
- self,
98
- *,
99
- domain_analysis: DomainAnalysis,
100
- static_analysis: StaticAnalysisSummary,
101
- ) -> list[str]:
102
- """Generate compact auto-fix hints for the UI preview panel."""
103
-
104
- preview: list[str] = []
105
- if not static_analysis.syntax_valid:
106
- preview.append(f"Repair parser failure: {static_analysis.syntax_error}")
107
- if static_analysis.max_loop_depth >= 2:
108
- preview.append("Replace nested scans with a precomputed lookup table or aggregation structure.")
109
- if static_analysis.docstring_coverage == 0:
110
- preview.append("Add a short docstring describing the function contract and edge cases.")
111
- if domain_analysis.suggestions:
112
- preview.append(domain_analysis.suggestions[0])
113
- return preview[:3]
 
2
 
3
  from __future__ import annotations
4
 
5
+ from schemas.response import DomainAnalysis, StaticAnalysisSummary
6
 
7
 
8
  class SuggestionService:
9
+ """Build high-signal improvement steps from analysis output."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
 
11
  def build_improvement_plan(self, *, domain_analysis: DomainAnalysis, static_analysis: StaticAnalysisSummary) -> list[str]:
12
  """Return a compact three-step plan optimized for developer action."""
 
26
  if not static_analysis.syntax_valid:
27
  step_one = f"Step 1 - Correctness and safety: fix the syntax error first ({static_analysis.syntax_error})."
28
  return [step_one, step_two, step_three]