Spaces:

uvpatel7271
/

openenv-python-env

Sleeping

App Files Files Community

uvpatel7271 commited on 13 days ago

Commit

566a172

verified ·

1 Parent(s): b8f678a

Upload folder using huggingface_hub

Browse files

Files changed (45) hide show

Dockerfile +32 -0
Project.md +1017 -0
README.md +267 -5
__init__.py +35 -0
client.py +71 -0
examples/__init__.py +1 -0
examples/python_review_examples.py +58 -0
graders/__init__.py +16 -0
graders/common.py +82 -0
graders/optimization.py +163 -0
graders/pytest_runner.py +108 -0
graders/syntax.py +78 -0
inference.py +287 -0
models.py +109 -0
openenv.yaml +20 -0
pyproject.toml +33 -0
server/__init__.py +5 -0
server/app.py +97 -0
server/code_review_env_environment.py +9 -0
server/code_review_environment.py +5 -0
server/env.py +640 -0
server/grading.py +147 -0
server/python_env_environment.py +9 -0
server/requirements.txt +6 -0
server/static_review.py +273 -0
server/task_bank.py +340 -0
summary/01_introduction_quickstart.md +66 -0
summary/02_using_environments.md +98 -0
summary/03_building_environments.md +99 -0
summary/04_packaging_deploying.md +84 -0
summary/05_contributing_hf.md +84 -0
summary/README.md +40 -0
tasks/__init__.py +11 -0
tasks/task_bank.py +273 -0
testing.md +289 -0
tests/conftest.py +7 -0
tests/test_api.py +31 -0
tests/test_environment.py +81 -0
tests/test_examples.py +27 -0
tutorial/HackathonChecklist.md +323 -0
tutorial/tutorial1.md +1259 -0
tutorial/tutorial2.md +427 -0
tutorial/tutorial3.md +457 -0
tutorial/tutorial4.md +632 -0
uv.lock +0 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,32 @@

+FROM python:3.11-slim
+WORKDIR /app
+# Install system dependencies
+RUN apt-get update && apt-get install -y --no-install-recommends \
+    gcc \
+    git \
+    curl \
+    && rm -rf /var/lib/apt/lists/*
+# Copy source code
+COPY . /app
+# Install Python dependencies
+RUN pip install --no-cache-dir -r server/requirements.txt
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+ENV HOST=0.0.0.0
+ENV PORT=8000
+ENV WORKERS=1
+ENV MAX_CONCURRENT_ENVS=16
+# Health check
+HEALTHCHECK --interval=30s --timeout=5s --start-period=15s --retries=3 \
+    CMD curl -f http://localhost:${PORT}/health || exit 1
+# Run FastAPI app
+EXPOSE ${PORT}
+ENV ENABLE_WEB_INTERFACE=true
+CMD ["python", "-m", "server.app"]

Project.md ADDED Viewed

	@@ -0,0 +1,1017 @@

+python inference.py --model gpt-3.5-turbo --base-url "http://localhost:8000/v1"
+python inference.py --model gemini-2.0-flash --base-url "https://generativelanguage.googleapis.com/openai/"
+python inference.py --model deepseek-chat --base-url "https://api.deepseek.com"# Python Env Project Guide
+This document explains how to work with the `python_env` project end to end:
+1. What the environment is trying to do
+2. How the current code is structured
+3. How each route works
+4. How to test each route manually
+5. How to use the inference script
+6. How to prepare data so an RL or agent-training setup can learn more effectively
+7. How the project maps to the hackathon functional requirements
+The goal is practical: after reading this file, you should be able to start the server, hit every route, understand what each response means, run the baseline, and know what data to collect next.
+## 1. Project Goal
+This environment simulates a real software engineering workflow: Python code review.
+An agent is given Python code and must:
+- detect correctness bugs
+- detect security risks
+- detect maintainability problems
+- detect obvious performance issues
+- optionally suggest improved code
+This is a valid real-world environment because code review is an actual human task used in engineering teams every day.
+## 2. High-Level Architecture
+The project has four main parts:
+- `models.py`
+  Defines the typed Pydantic models for actions, observations, evaluations, config, health, and direct-review payloads.
+- `server/code_review_environment.py`
+  Implements the environment logic: `reset()`, `step()`, reward shaping, task progression, hints, history, and grading integration.
+- `server/task_bank.py`, `server/grading.py`, `server/static_review.py`
+  These files define the benchmark tasks, deterministic graders, and direct static review rules.
+- `server/app.py`
+  Exposes both:
+  - OpenEnv-compatible endpoints such as `/reset`, `/step`, `/state`, `/schema`, `/ws`
+  - custom REST endpoints such as `/health`, `/tasks`, `/review`, `/config`, `/history`
+- `inference.py`
+  Runs an OpenAI-compatible model against the environment and writes a reproducible report.
+## 3. File-by-File Understanding
+### `models.py`
+Important models:
+- `ReviewFinding`
+  One code-review issue found by the agent.
+  Fields:
+  - `title`
+  - `line`
+  - `category`
+  - `severity`
+  - `rationale`
+  - `recommendation`
+  - `rule_id`
+- `PythonReviewAction`
+  What the agent sends to the environment.
+  Fields:
+  - `operation`
+  - `findings`
+  - `patched_code`
+  - `note`
+- `PythonReviewObservation`
+  What the environment returns back.
+  Fields:
+  - `task`
+  - `instructions`
+  - `feedback`
+  - `submitted_findings`
+  - `hints_used`
+  - `attempts_remaining`
+  - `evaluation`
+  - `score`
+  - `review_time_ms`
+  - inherited OpenEnv fields such as `reward`, `done`, `metadata`
+- `TaskEvaluation`
+  Deterministic grading output.
+  Fields:
+  - `matched_reference_ids`
+  - `matched_findings`
+  - `total_findings`
+  - `false_positives`
+  - `duplicate_findings`
+  - `weighted_recall`
+  - `patch_score`
+  - `score`
+  - `passed`
+### `server/task_bank.py`
+Contains the benchmark tasks.
+Current tasks:
+1. `py-review-easy`
+   Detect unsafe `eval` and division-by-zero risk.
+2. `py-review-medium`
+   Detect mutable default list, quadratic membership check, and bare `except`.
+3. `py-review-hard`
+   Detect `shell=True` command injection, stale cache bug, and shared output file risk.
+Each task contains:
+- code to review
+- hints
+- reference findings
+- pass threshold
+### `server/grading.py`
+This is the benchmark grader.
+It compares submitted findings to hidden reference findings and computes:
+- weighted recall
+- penalties for false positives
+- penalties for duplicates
+- optional patch quality score
+- final score in `0.0` to `1.0`
+This makes the task deterministic and reproducible, which is important for hackathon judging.
+### `server/static_review.py`
+This powers the `/review` endpoint for arbitrary code snippets.
+It uses AST inspection to detect:
+- `eval` / `exec`
+- mutable default arguments
+- `shell=True`
+- bare `except`
+- list-membership-inside-loop performance smell
+- syntax errors
+- `print()` used in application logic
+This is not the task grader. It is the direct-review helper.
+### `server/code_review_environment.py`
+This is the environment core.
+Main methods:
+- `reset()`
+  Rotates to the next task, resets episode state, and returns the initial observation.
+- `step(action)`
+  Accepts a `PythonReviewAction`, grades it, shapes reward, updates history, and returns the new observation.
+- `direct_review(code, context)`
+  Calls the static reviewer for arbitrary code.
+- `list_tasks()`
+  Returns public descriptors for all tasks.
+- `grade_task_submission(task_id, findings, patched_code)`
+  Grades a proposed submission against the deterministic rubric without stepping through an episode.
+### `server/app.py`
+This file wires everything to FastAPI and OpenEnv.
+Important note:
+- OpenEnv endpoints are managed through `create_app(PythonEnvironment, PythonReviewAction, PythonReviewObservation)`
+- custom routes such as `/health`, `/tasks`, `/review`, `/history`, `/config` use a singleton `python_env`
+That means:
+- `/reset` and `/step` are served by OpenEnv session handling
+- `/review`, `/tasks`, `/config`, `/history` are served by the singleton helper instance
+This is fine for startup and manual testing, but if you want one fully unified state model later, you should refactor custom routes to read from the same managed environment/session layer.
+## 4. Route-by-Route Guide
+### OpenEnv Routes
+These are important for validation and agents.
+#### `POST /reset`
+Purpose:
+- starts a new episode
+- rotates to the next benchmark task
+- returns an initial observation
+Use this when:
+- you want to start evaluating an agent on a task
+#### `POST /step`
+Purpose:
+- submit agent actions
+- get reward, observation, and done flag
+Use this when:
+- manually simulating agent steps
+- testing reward shaping and grading
+#### `GET /state`
+Purpose:
+- returns current OpenEnv session state, typically `episode_id` and `step_count`
+Use this when:
+- debugging session behavior
+#### `GET /schema`
+Purpose:
+- shows the action/observation schema expected by OpenEnv
+Use this when:
+- debugging payload formats
+- verifying OpenEnv compatibility
+#### `WS /ws`
+Purpose:
+- persistent lower-latency session transport for clients
+Use this when:
+- building actual agent loops with the `EnvClient`
+### Custom REST Routes
+#### `GET /health`
+Purpose:
+- quick health check for Docker and Hugging Face Spaces
+Use this when:
+- checking whether the server is alive
+- validating deployment health
+#### `GET /tasks`
+Purpose:
+- returns the three benchmark task descriptors
+Use this when:
+- reviewing available tasks
+- building curriculum/eval metadata
+#### `GET /tasks/{task_id}`
+Purpose:
+- returns one task descriptor
+Use this when:
+- inspecting a task before submitting findings
+#### `POST /tasks/{task_id}/grade`
+Purpose:
+- grade a proposed set of findings against the deterministic task rubric
+Use this when:
+- validating benchmark grading directly
+- building offline evaluation sets
+#### `POST /review`
+Purpose:
+- run direct static review on arbitrary Python code
+Use this when:
+- testing the static analyzer
+- building training examples
+- verifying that common issues are caught
+#### `GET /history`
+Purpose:
+- returns the singleton environment history
+Use this when:
+- checking what the custom singleton environment has processed
+Note:
+- this history is not the same as OpenEnv session history from `/step`
+#### `DELETE /history`
+Purpose:
+- clears the singleton history
+Use this when:
+- resetting the custom review log before a test run
+#### `GET /config`
+Purpose:
+- inspect config values such as penalties and task order
+#### `PUT /config`
+Purpose:
+- update the environment config
+Use this when:
+- testing different reward penalties or task order
+## 5. Manual Testing: Step by Step
+Start the server:
+```powershell
+uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+```
+Open the docs:
+```text
+http://127.0.0.1:8000/docs
+```
+That is the easiest manual route explorer.
+### Test 1: Health
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/health" -Method Get
+```
+Expected:
+- `status` should be `ok`
+- `task_count` should be `3`
+### Test 2: List Tasks
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/tasks" -Method Get
+```
+Expected:
+- three tasks
+- each task has `task_id`, `difficulty`, `title`, `objective`, `code`
+### Test 3: Get One Task
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/tasks/py-review-easy" -Method Get
+```
+### Test 4: Direct Static Review
+```powershell
+$body = @{
+  code = @"
+def load_settings(config_text):
+    return eval(config_text)
+"@
+} | ConvertTo-Json
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/review" `
+  -Method Post `
+  -Body $body `
+  -ContentType "application/json"
+```
+Expected:
+- at least one issue
+- one issue should have `rule_id = "avoid-eval"`
+### Test 5: Reset Episode
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/reset" `
+  -Method Post `
+  -Body "{}" `
+  -ContentType "application/json"
+```
+Expected:
+- an observation with a `task`
+- `done = false`
+- `reward = 0`
+### Test 6: Submit Partial Findings To `/step`
+```powershell
+$body = @{
+  operation = "submit_findings"
+  findings = @(
+    @{
+      title = "Avoid eval on untrusted configuration data"
+      line = 2
+      category = "security"
+      severity = "critical"
+      rationale = "eval can execute attacker-controlled code."
+      recommendation = "Use json.loads or ast.literal_eval."
+      rule_id = "avoid-eval"
+    }
+  )
+  patched_code = $null
+  note = "First pass review"
+} | ConvertTo-Json -Depth 5
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/step" `
+  -Method Post `
+  -Body $body `
+  -ContentType "application/json"
+```
+Expected:
+- positive reward
+- improved `score`
+- feedback mentioning a matched rubric item
+### Test 7: Request A Hint
+```powershell
+$body = @{
+  operation = "request_hint"
+  findings = @()
+  patched_code = $null
+  note = "Need help"
+} | ConvertTo-Json -Depth 5
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/step" `
+  -Method Post `
+  -Body $body `
+  -ContentType "application/json"
+```
+Expected:
+- small negative reward
+- feedback containing `Hint 1: ...`
+### Test 8: Finalize A Full Submission
+```powershell
+$body = @{
+  operation = "finalize"
+  findings = @(
+    @{
+      title = "Avoid eval on untrusted configuration data"
+      line = 2
+      category = "security"
+      severity = "critical"
+      rationale = "eval can execute attacker-controlled code."
+      recommendation = "Use json.loads or ast.literal_eval."
+      rule_id = "avoid-eval"
+    },
+    @{
+      title = "Default count of zero causes a division by zero"
+      line = 5
+      category = "bug"
+      severity = "warning"
+      rationale = "count defaults to zero and division crashes."
+      recommendation = "Validate count before dividing."
+      rule_id = "division-by-zero-default"
+    }
+  )
+  patched_code = $null
+  note = "Final review"
+} | ConvertTo-Json -Depth 6
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/step" `
+  -Method Post `
+  -Body $body `
+  -ContentType "application/json"
+```
+Expected:
+- `done = true`
+- `evaluation.passed = true`
+- `score` near or above task threshold
+### Test 9: Inspect State
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/state" -Method Get
+```
+### Test 10: Inspect Schemas
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/schema" -Method Get
+```
+### Test 11: Grade A Task Without Running An Episode
+```powershell
+$body = @{
+  operation = "submit_findings"
+  findings = @(
+    @{
+      title = "shell=True with interpolated input allows command injection"
+      line = 10
+      category = "security"
+      severity = "critical"
+      rationale = "The command string includes user input and runs via shell."
+      recommendation = "Pass args as a list and keep shell=False."
+      rule_id = "shell-true-command-injection"
+    }
+  )
+  patched_code = $null
+  note = "Offline grader test"
+} | ConvertTo-Json -Depth 6
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/tasks/py-review-hard/grade" `
+  -Method Post `
+  -Body $body `
+  -ContentType "application/json"
+```
+### Test 12: Config Read And Update
+Read:
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/config" -Method Get
+```
+Update:
+```powershell
+$body = @{
+  task_order = @("py-review-easy", "py-review-medium", "py-review-hard")
+  max_steps_per_task = 4
+  hint_penalty = 0.05
+  false_positive_penalty = 0.08
+  duplicate_penalty = 0.03
+  patch_bonus_multiplier = 0.2
+  max_history_entries = 50
+} | ConvertTo-Json
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/config" `
+  -Method Put `
+  -Body $body `
+  -ContentType "application/json"
+```
+### Test 13: History
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/history" -Method Get
+```
+Clear:
+```powershell
+Invoke-RestMethod -Uri "http://127.0.0.1:8000/history" -Method Delete
+```
+## 6. How To Test Using The Inference Script
+The inference script is for model-vs-environment evaluation.
+### Required Variables
+```powershell
+$env:API_BASE_URL="https://api.openai.com/v1"
+$env:MODEL_NAME="gpt-4.1-mini"
+$env:OPENAI_API_KEY="your_key_here"
+```
+If you want it to hit your local server instead of launching Docker:
+```powershell
+$env:ENV_BASE_URL="http://127.0.0.1:8000"
+```
+Optional:
+```powershell
+$env:MAX_TASKS="3"
+$env:MAX_STEPS="3"
+$env:INFERENCE_REPORT_PATH="inference_results.json"
+```
+Run:
+```powershell
+python inference.py
+```
+What it does:
+1. connects to the environment
+2. resets through up to 3 tasks
+3. sends task code and feedback to the model
+4. expects strict JSON findings back
+5. submits them through `step()`
+6. logs score and reward per step
+7. writes a final report JSON file
+### How To Interpret The Output
+Focus on:
+- `mean_score`
+  Overall average benchmark score
+- per-task `score`
+  How well the model solved each task
+- `passed`
+  Whether score met that task’s threshold
+- step logs
+  Show whether the model is improving over trajectory or getting stuck
+If the model keeps returning empty findings:
+- improve the system prompt
+- reduce task ambiguity
+- add examples of desired findings
+- ensure the model endpoint supports the chosen format well
+## 7. How To Build Better Training Data
+If you want an RL environment to actually learn, the biggest bottleneck is data quality.
+You need more than just three final benchmark tasks. You need trajectories, partial attempts, and failure examples.
+### Data Types You Should Collect
+#### A. Gold Task Rubrics
+For each task, store:
+- code snippet
+- hidden reference findings
+- severity
+- category
+- expected line numbers
+- good recommendations
+This is already partially represented by `server/task_bank.py`.
+#### B. Positive Demonstrations
+Create solved examples where the review is high quality.
+Each example should include:
+- task code
+- one or more strong findings
+- strong rationales
+- strong recommendations
+- optional patch
+- final score
+This helps supervised warm-start and behavior cloning.
+#### C. Partial Trajectories
+This is important for RL.
+Store intermediate attempts like:
+- first attempt finds one issue
+- second attempt adds another issue
+- third attempt finalizes
+This is what teaches agents to improve over time, not just emit one final perfect answer.
+#### D. Negative Examples
+You should also store:
+- false positives
+- irrelevant complaints
+- duplicate findings
+- hallucinated issues
+- weak recommendations
+Why:
+- the reward function penalizes these
+- the model must learn precision, not just recall
+#### E. Hint Usage Examples
+Store trajectories where:
+- the agent requests a hint
+- then improves its findings
+This teaches policy behavior around when hints are worth the penalty.
+#### F. Patch Examples
+For tasks where patch quality matters, store:
+- original code
+- weak patch
+- good patch
+- patch score
+This helps the model learn that code edits should remove actual problems, not just change formatting.
+## 8. Recommended Dataset Format
+Use JSONL so it is easy to stream and train on.
+### Benchmark Task Record
+```json
+{
+  "task_id": "py-review-easy",
+  "difficulty": "easy",
+  "code": "def load_settings(config_text):\n    return eval(config_text)",
+  "reference_findings": [
+    {
+      "rule_id": "avoid-eval",
+      "line": 2,
+      "category": "security",
+      "severity": "critical"
+    }
+  ]
+}
+```
+### Trajectory Record
+```json
+{
+  "task_id": "py-review-medium",
+  "episode_id": "abc123",
+  "steps": [
+    {
+      "observation_feedback": "Review the Python snippet.",
+      "action": {
+        "operation": "submit_findings",
+        "findings": [
+          {
+            "title": "Mutable default argument leaks state",
+            "line": 1,
+            "category": "bug",
+            "severity": "warning"
+          }
+        ]
+      },
+      "reward": 0.35,
+      "score": 0.35
+    },
+    {
+      "observation_feedback": "Matched 1 new rubric item(s): mutable-default-list",
+      "action": {
+        "operation": "finalize",
+        "findings": [
+          {
+            "title": "Mutable default argument leaks state",
+            "line": 1,
+            "category": "bug",
+            "severity": "warning"
+          },
+          {
+            "title": "Bare except hides failures",
+            "line": 12,
+            "category": "maintainability",
+            "severity": "warning"
+          }
+        ]
+      },
+      "reward": 0.27,
+      "score": 0.62
+    }
+  ]
+}
+```
+## 9. How To Make RL Learn Better
+### A. Add More Tasks
+Three tasks are enough for the minimum requirement, but not enough for strong training.
+You should expand with:
+- file I/O bugs
+- API misuse
+- SQL injection
+- unsafe deserialization
+- concurrency issues
+- caching mistakes
+- resource leaks
+- logic edge cases
+Target:
+- 50 to 200 deterministic tasks
+- grouped by difficulty and domain
+### B. Add More Partial Reward Signals
+Current reward is already better than binary success/fail, but you can improve it.
+Possible additions:
+- small bonus when the first critical issue is found early
+- higher reward for critical issues than style issues
+- bonus when rationale quality is high
+- bonus when recommendation mentions a correct mitigation pattern
+- penalty if line numbers are missing when they should be known
+### C. Improve Context In Observation
+Right now the observation already gives:
+- task metadata
+- previous feedback
+- submitted findings
+- attempts remaining
+You can improve learning further by including:
+- a short list of matched findings so far
+- a short list of remaining categories not yet covered
+- normalized review rubric hints without leaking answers
+- last action summary
+This helps the agent reason about what it already did and what is still missing.
+### D. Separate Training Tasks From Benchmark Tasks
+Important:
+- training tasks should be large and varied
+- benchmark tasks should stay hidden and fixed
+Do not train directly on the same exact benchmark set you plan to judge on.
+### E. Add Preference Data
+You can train preference models on:
+- strong vs weak findings
+- precise vs vague recommendations
+- useful vs noisy patches
+This is valuable for ranking quality beyond exact rubric matches.
+## 10. Functional Requirements Mapping
+Here is how your environment should be judged against the stated requirements.
+### Requirement: Real-World Task Simulation
+Status:
+- satisfied in direction
+Why:
+- code review is a genuine engineering task
+How to improve further:
+- expand beyond tiny snippets into multi-function modules
+- include operational and maintainability review, not just security lints
+### Requirement: OpenEnv Spec Compliance
+Status:
+- mostly implemented in code
+Implemented pieces:
+- typed action model
+- typed observation model
+- `reset()`
+- `step()`
+- `state`
+- `openenv.yaml`
+- FastAPI/OpenEnv routes
+What you still need to verify:
+- `openenv validate`
+- schema compatibility under your installed OpenEnv version
+### Requirement: Minimum 3 Tasks With Agent Graders
+Status:
+- implemented
+You have:
+- easy
+- medium
+- hard
+- deterministic grader returning `0.0` to `1.0`
+### Requirement: Meaningful Reward Function
+Status:
+- implemented
+Current reward signals:
+- new rubric matches
+- false positive penalties
+- duplicate penalties
+- hint penalties
+- patch bonus
+- finalize pass bonus
+### Requirement: Baseline Inference Script
+Status:
+- implemented
+Current `inference.py`:
+- uses OpenAI client
+- reads env vars
+- runs tasks
+- writes report
+What to verify:
+- actual runtime under 20 minutes
+- reproducible output with your chosen model endpoint
+### Requirement: HF Spaces + Docker
+Status:
+- code is prepared
+You still need to verify:
+- `docker build -f server/Dockerfile .`
+- local container startup
+- `openenv push`
+- `/health` returns 200 on the deployed Space
+## 11. Recommended Manual Validation Checklist
+Before submission, run these in order:
+1. Start server locally
+2. Hit `/health`
+3. Hit `/docs`
+4. Test `/tasks`
+5. Test `/review` with unsafe examples
+6. Test `/reset`
+7. Test `/step` with partial findings
+8. Test `/step` with finalize
+9. Test `/tasks/{task_id}/grade`
+10. Run `pytest`
+11. Run `openenv validate`
+12. Run `python inference.py`
+13. Build Docker image
+14. Deploy to Hugging Face Space
+15. Re-test `/health` and `/reset` on the live Space
+## 12. Suggested Immediate Next Steps
+If you want the environment to become stronger quickly, do this next:
+1. Add 10 to 20 more benchmark-style tasks in `server/task_bank.py`
+2. Save solved and failed trajectories as JSONL files under a new `dataset/` directory
+3. Refactor custom route state so `/history` and OpenEnv `/step` share one coherent session story
+4. Run `openenv validate`
+5. Run `inference.py` against your local server and inspect the report
+## 13. Quick Commands Summary
+Start server:
+```powershell
+uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+```
+Open docs:
+```text
+http://127.0.0.1:8000/docs
+```
+Run example tests:
+```powershell
+python -m pytest tests -q
+```
+Run inference locally:
+```powershell
+$env:API_BASE_URL="https://api.openai.com/v1"
+$env:MODEL_NAME="gpt-4.1-mini"
+$env:OPENAI_API_KEY="your_key"
+$env:ENV_BASE_URL="http://127.0.0.1:8000"
+python inference.py
+```
+Validate OpenEnv:
+```powershell
+openenv validate
+```
+Build Docker:
+```powershell
+docker build -t python_env-env:latest -f server/Dockerfile .
+```
+Deploy:
+```powershell
+openenv push
+```

README.md CHANGED Viewed

@@ -1,10 +1,272 @@
 ---
-title: Openenv Python Env
-emoji: ⚡
-colorFrom: yellow
-colorTo: gray
 sdk: docker
 pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+title: Python Code Review Environment Server
 sdk: docker
+app_port: 8000
+base_path: /web
 pinned: false
+tags:
+  - openenv
+  - code-review
 ---
+# Python Code Review Environment
+A production-grade OpenEnv environment for Python code review, repair, and optimization tasks. This environment simulates real-world developer workflows where an AI agent reviews, fixes, and improves Python code.
+## Overview
+**`python_code_review_env`** is a deterministic benchmark environment featuring:
+- ✅ **3 real-world tasks** with increasing difficulty (Syntax, Bug Fix, Optimization)
+- ✅ **Deterministic graders** using AST analysis, pytest execution, and performance benchmarking
+- ✅ **OpenAI-compatible API** supporting free/open models (Gemini, DeepSeek, Together, OpenRouter)
+- ✅ **Production-ready Docker** deployment for Hugging Face Spaces
+- ✅ **Structured Observations & Actions** following OpenEnv spec
+- ✅ **Rich reward shaping** with bonuses for syntax fixes, test passes, and optimization
+## Tasks
+### 1. 🟢 Easy: Syntax Fixing
+**Task ID**: `syntax-fix-easy`
+Fix broken Python code with syntax errors.
+- **Difficulty**: Easy
+- **Goal**: Repair syntax errors to make code compile
+- **Starter Code**: Function with missing closing parenthesis
+- **Grading**: Compilation check + code similarity to reference
+- **Score Range**: 0.0–1.0
+### 2. 🟡 Medium: Bug Fixing
+**Task ID**: `bug-fix-medium`
+Fix logic bugs with visible and hidden test cases.
+- **Difficulty**: Medium
+- **Goal**: Repair a logic error in invoice calculation
+- **Starter Code**: Function that returns wrong total (returns subtotal instead of discounted)
+- **Grading**: Test pass fraction (visible & hidden)
+- **Score Range**: 0.0–1.0
+### 3. 🔴 Hard: Optimization & Refactoring
+**Task ID**: `optimization-hard`
+Optimize inefficient code while maintaining correctness.
+- **Difficulty**: Hard
+- **Goal**: Convert O(n²) duplicate removal to O(n) with set
+- **Starter Code**: Slow nested-loop implementation
+- **Grading**: 50% correctness + 30% speedup + 15% code quality + 5% style
+- **Score Range**: 0.0–1.0
+- **Bonus**: Runtime benchmarking against reference implementation
+## Quick Start
+### Run Locally
+```bash
+cd python-code-review-env
+pip install -r server/requirements.txt
+python -m server.app
+```
+Visit http://localhost:8000/docs for interactive API
+### Run with Docker
+```bash
+docker build -f server/Dockerfile -t python_code_review_env:latest .
+docker run -p 8000:8000 python_code_review_env:latest
+```
+### Run Inference
+```bash
+python inference.py --model "gpt-3.5-turbo" --base-url "http://localhost:8000/v1"
+```
+## OpenEnv Specification
+### Observation
+```json
+{
+  "task_id": "syntax-fix-easy",
+  "difficulty": "easy",
+  "task_description": "Fix syntax errors...",
+  "current_code": "def normalize_username(raw_name: str) -> str:\n    cleaned = raw_name.strip().lower(\n    ...",
+  "errors": "invalid syntax ( line 2, column 40 )",
+  "test_results": "Not run yet.",
+  "visible_tests": ["normalize_username('  Alice Smith  ') == 'alice_smith'"],
+  "history": [],
+  "attempts_remaining": 8,
+  "score": 0.0,
+  "reward": {
+    "value": 0.0,
+    "reason": "Episode reset."
+  }
+}
+```
+### Action
+```json
+{
+  "action_type": "edit_code",
+  "code": "def normalize_username(raw_name: str) -> str:\n    cleaned = raw_name.strip().lower()\n    if not cleaned:\n        return \"anonymous\"\n    return cleaned.replace(\" \", \"_\")"
+}
+```
+### Reward Details
+- **+0.2**: Syntax fixed (one-time per episode)
+- **+0.15**: Passing additional test (cumulative per test)
+- **+0.1**: Code quality improvement
+- **+0.5**: Full correctness (100% hidden tests, one-time)
+- **-0.1**: Invalid action
+## Architecture
+```
+python_code_review_env/
+├── models.py          # Pydantic models (Observation, Action, Reward)
+├── server/
+│   ├── app.py         # FastAPI server
+│   ├── env.py         # OpenEnv environment
+│   ├── Dockerfile     # Docker config
+│   └── requirements.txt
+├── graders/
+│   ├── common.py      # Shared utilities
+│   ├── syntax.py      # Syntax/bug graders
+│   ├── optimization.py# Optimization grader
+│   └── pytest_runner.py
+├── tasks/
+│   ├── task_bank.py   # 3 deterministic tasks
+│   └── __init__.py
+├── inference.py       # Baseline evaluation script
+├── openenv.yaml       # OpenEnv spec
+├── pyproject.toml     # Project metadata
+└── README.md
+```
+## FastAPI Endpoints
+- `GET /health` – Health check
+- `GET /tasks` – List all tasks
+- `GET /tasks/{task_id}` – Get task details
+- `POST /tasks/{task_id}/grade` – Grade code offline
+- Standard OpenEnv endpoints (`/reset`, `/step`, `/state`)
+## Deterministic Graders
+### Syntax Fix
+```
+if code compiles:
+  score = 1.0
+else:
+  score = 0.15 + 0.55 * similarity_to_reference
+```
+### Bug Fix
+```
+score = test_pass_fraction (0.0 to 1.0)
+```
+### Optimization
+```
+score = (
+  0.5 * test_fraction +
+  0.3 * speedup_score +
+  0.15 * code_quality +
+  0.05 * pep8_style
+)
+```
+## Examples
+### Using Python
+```python
+from server.env import PythonCodeReviewEnvironment
+from models import PythonCodeReviewAction
+env = PythonCodeReviewEnvironment()
+obs = env.reset(task_id="syntax-fix-easy")
+action = PythonCodeReviewAction(
+    action_type="edit_code",
+    code="""def normalize_username(raw_name: str) -> str:
+    cleaned = raw_name.strip().lower()
+    if not cleaned:
+        return "anonymous"
+    return cleaned.replace(" ", "_")
+"""
+)
+obs = env.step(action)
+print(f"Score: {obs.score}")
+print(f"Reward: {obs.reward.value:+.3f}")
+```
+### Using cURL
+```bash
+# Check health
+curl http://localhost:8000/health
+# List tasks
+curl http://localhost:8000/tasks
+# Grade code
+curl -X POST http://localhost:8000/tasks/syntax-fix-easy/grade \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "edit_code", "code": "..."}'
+```
+## Deployment
+### Hugging Face Spaces
+1. Create Space > Docker
+2. Upload files + `server/Dockerfile`
+3. Space auto-deploys on CPU
+4. Monitor `/health` endpoint
+### Local Docker
+```bash
+docker build -f server/Dockerfile -t python_code_review_env .
+docker run -p 8000:8000 \
+  -e MAX_CONCURRENT_ENVS=16 \
+  python_code_review_env
+```
+## Performance
+- Startup: < 5s
+- Reset: < 100ms
+- Step: 50ms–3s (depends on action)
+- Inference (3 tasks): < 20 minutes
+- CPU: Works on 2 vCPU, 8GB RAM
+## Validation Checklist
+- ✅ 3 deterministic tasks
+- ✅ Deterministic graders (AST, pytest, benchmarks)
+- ✅ `/health` → 200
+- ✅ Scores vary per task (not constant)
+- ✅ Docker builds successfully
+- ✅ OpenEnv spec compliant
+- ✅ Reward shaping working
+- ✅ All tests deterministic and reproducible
+## License
+MIT
+---
+**Built for production. Deterministic. Deployable. Extensible.**

__init__.py ADDED Viewed

	@@ -0,0 +1,35 @@

+"""Public package API for the Python code review OpenEnv benchmark."""
+from .client import CodeReviewEnv, MyEnv, PythonEnv
+from .models import (
+    HealthResponse,
+    HistoryEntry,
+    PythonCodeReviewAction,
+    PythonCodeReviewObservation,
+    PythonCodeReviewState,
+    PythonReviewAction,
+    PythonReviewObservation,
+    PythonReviewReward,
+    PythonReviewState,
+    RewardDetails,
+    TaskDescriptor,
+    TaskGrade,
+)
+__all__ = [
+    "PythonEnv",
+    "CodeReviewEnv",
+    "MyEnv",
+    "PythonCodeReviewAction",
+    "PythonCodeReviewObservation",
+    "PythonCodeReviewState",
+    "PythonReviewAction",
+    "PythonReviewObservation",
+    "PythonReviewReward",
+    "PythonReviewState",
+    "RewardDetails",
+    "HistoryEntry",
+    "TaskDescriptor",
+    "TaskGrade",
+    "HealthResponse",
+]

client.py ADDED Viewed

	@@ -0,0 +1,71 @@

+"""Client for the Python code review environment."""
+from __future__ import annotations
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from models import (
+    HistoryEntry,
+    PythonCodeReviewAction,
+    PythonCodeReviewObservation,
+    PythonCodeReviewState,
+    RewardDetails,
+)
+class PythonEnv(
+    EnvClient[PythonCodeReviewAction, PythonCodeReviewObservation, PythonCodeReviewState]
+):
+    """OpenEnv HTTP client for the Python code review benchmark."""
+    def _step_payload(self, action: PythonCodeReviewAction) -> Dict:
+        return action.model_dump(exclude_none=True)
+    def _parse_result(self, payload: Dict) -> StepResult[PythonCodeReviewObservation]:
+        obs = payload.get("observation", {})
+        observation = PythonCodeReviewObservation(
+            task_id=obs["task_id"],
+            title=obs["title"],
+            difficulty=obs["difficulty"],
+            task_kind=obs["task_kind"],
+            task_description=obs["task_description"],
+            current_code=obs.get("current_code", ""),
+            errors=obs.get("errors", ""),
+            test_results=obs.get("test_results", ""),
+            history=[HistoryEntry(**entry) for entry in obs.get("history", [])],
+            attempts_remaining=obs.get("attempts_remaining", 0),
+            last_action_status=obs.get("last_action_status", ""),
+            score=obs.get("score", 0.0),
+            reward_details=RewardDetails(**obs.get("reward_details", {})),
+            done=payload.get("done", obs.get("done", False)),
+            reward=payload.get("reward", obs.get("reward")),
+            metadata=obs.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward", obs.get("reward")),
+            done=payload.get("done", obs.get("done", False)),
+        )
+    def _parse_state(self, payload: Dict) -> PythonCodeReviewState:
+        return PythonCodeReviewState(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+            task_id=payload.get("task_id"),
+            difficulty=payload.get("difficulty"),
+            task_kind=payload.get("task_kind"),
+            attempts_remaining=payload.get("attempts_remaining", 0),
+            current_code=payload.get("current_code", ""),
+            errors=payload.get("errors", ""),
+            test_results=payload.get("test_results", ""),
+            history=[HistoryEntry(**entry) for entry in payload.get("history", [])],
+            score=payload.get("score", 0.0),
+            done=payload.get("done", False),
+        )
+CodeReviewEnv = PythonEnv
+MyEnv = PythonEnv

examples/__init__.py ADDED Viewed

	@@ -0,0 +1 @@


1	+ """Example snippets for the Python review environment."""

examples/python_review_examples.py ADDED Viewed

	@@ -0,0 +1,58 @@

+"""Example Python snippets for exercising the review environment."""
+EXAMPLE_SNIPPETS = {
+    "unsafe_eval": "\n".join(
+        [
+            "def load_settings(config_text):",
+            "    return eval(config_text)",
+        ]
+    ),
+    "mutable_default": "\n".join(
+        [
+            "def append_name(name, names=[]):",
+            "    names.append(name)",
+            "    return names",
+        ]
+    ),
+    "bare_except": "\n".join(
+        [
+            "def publish_report(report):",
+            "    try:",
+            '        return report[\"summary\"]',
+            "    except:",
+            "        return None",
+        ]
+    ),
+    "shell_injection": "\n".join(
+        [
+            "import subprocess",
+            "",
+            "def run_script(script_path, user_input):",
+            '    cmd = f\"python {script_path} {user_input}\"',
+            "    return subprocess.check_output(cmd, shell=True, text=True)",
+        ]
+    ),
+    "syntax_error": "\n".join(
+        [
+            "def broken_function(",
+            "    return 42",
+        ]
+    ),
+    "clean_function": "\n".join(
+        [
+            "def normalize_name(name: str) -> str:",
+            "    cleaned = name.strip().lower()",
+            "    return cleaned.replace(\"  \", \" \")",
+        ]
+    ),
+}
+EXPECTED_RULE_IDS = {
+    "unsafe_eval": {"avoid-eval"},
+    "mutable_default": {"mutable-default-list"},
+    "bare_except": {"bare-except"},
+    "shell_injection": {"shell-true-command-injection"},
+    "syntax_error": {"syntax-error"},
+    "clean_function": set(),
+}

graders/__init__.py ADDED Viewed

	@@ -0,0 +1,16 @@

+"""Deterministic graders for the Python code review environment."""
+from .common import clamp_score
+from .optimization import grade_optimization_task
+from .pytest_runner import PytestExecution, run_pytest_suite
+from .syntax import grade_bug_fix_task, grade_syntax_task, grade_task
+__all__ = [
+    "PytestExecution",
+    "clamp_score",
+    "grade_bug_fix_task",
+    "grade_optimization_task",
+    "grade_syntax_task",
+    "grade_task",
+    "run_pytest_suite",
+]

graders/common.py ADDED Viewed

	@@ -0,0 +1,82 @@

+"""Shared deterministic scoring helpers."""
+from __future__ import annotations
+import ast
+import difflib
+import traceback
+from typing import Tuple
+def clamp_score(value: float) -> float:
+    """Clamp any scalar score into the required 0..1 interval."""
+    return max(0.0, min(1.0, round(value, 6)))
+def syntax_error_message(code: str) -> str:
+    """Return a concise syntax error string or an empty string."""
+    try:
+        ast.parse(code)
+    except SyntaxError as exc:
+        return f"{exc.msg} (line {exc.lineno}, column {exc.offset})"
+    except Exception:  # pragma: no cover
+        return traceback.format_exc(limit=1).strip()
+    return ""
+def compiles(code: str) -> bool:
+    """Return whether the code parses and compiles."""
+    try:
+        compile(code, "<candidate>", "exec")
+    except Exception:
+        return False
+    return True
+def normalized_diff_score(code: str, reference_code: str) -> float:
+    """Score textual similarity to the reference solution."""
+    ratio = difflib.SequenceMatcher(
+        a="".join(code.split()),
+        b="".join(reference_code.split()),
+    ).ratio()
+    return clamp_score(ratio)
+def style_score(code: str, max_line_length: int = 88) -> float:
+    """Simple deterministic PEP8-inspired style score."""
+    lines = code.splitlines() or [""]
+    line_length_ok = sum(1 for line in lines if len(line) <= max_line_length) / len(lines)
+    tab_ok = 1.0 if all("\t" not in line for line in lines) else 0.0
+    trailing_ws_ok = 1.0 if all(line == line.rstrip() for line in lines) else 0.0
+    return clamp_score((line_length_ok * 0.6) + (tab_ok * 0.2) + (trailing_ws_ok * 0.2))
+def nested_loop_depth(tree: ast.AST) -> int:
+    """Return the maximum nested loop depth in the AST."""
+    best = 0
+    def walk(node: ast.AST, depth: int) -> None:
+        nonlocal best
+        if isinstance(node, (ast.For, ast.AsyncFor, ast.While)):
+            depth += 1
+            best = max(best, depth)
+        for child in ast.iter_child_nodes(node):
+            walk(child, depth)
+    walk(tree, 0)
+    return best
+def compile_tree(code: str) -> Tuple[ast.AST | None, str]:
+    """Return AST tree and optional parse error."""
+    try:
+        return ast.parse(code), ""
+    except SyntaxError as exc:
+        return None, f"{exc.msg} (line {exc.lineno}, column {exc.offset})"

graders/optimization.py ADDED Viewed

	@@ -0,0 +1,163 @@

+"""Deterministic grading for optimization and refactor tasks."""
+from __future__ import annotations
+import json
+import subprocess
+import sys
+import tempfile
+from pathlib import Path
+from graders.common import clamp_score, compile_tree, nested_loop_depth, style_score
+from graders.pytest_runner import run_pytest_suite
+from models import TaskGrade
+from tasks.task_bank import TaskSpec
+def _benchmark_script(task: TaskSpec) -> str:
+    return f"""import json
+import time
+from candidate import {task.benchmark_entrypoint}
+{task.benchmark_builder}
+events = build_benchmark_events()
+start = time.perf_counter()
+for _ in range({task.benchmark_repeats}):
+    result = {task.benchmark_entrypoint}(events)
+elapsed = time.perf_counter() - start
+Path = __import__("pathlib").Path
+Path("benchmark.json").write_text(json.dumps({{"elapsed": elapsed, "rows": len(result)}}), encoding="utf-8")
+"""
+def benchmark_runtime(candidate_code: str, task: TaskSpec) -> tuple[float, bool, str]:
+    """Benchmark runtime deterministically against the starter implementation."""
+    assert task.benchmark_entrypoint is not None
+    with tempfile.TemporaryDirectory(prefix="python-code-review-bench-") as temp_dir:
+        temp_path = Path(temp_dir)
+        (temp_path / "candidate.py").write_text(candidate_code, encoding="utf-8")
+        (temp_path / "starter.py").write_text(task.starter_code, encoding="utf-8")
+        (temp_path / "candidate_runner.py").write_text(_benchmark_script(task), encoding="utf-8")
+        starter_script = _benchmark_script(task).replace("from candidate import", "from starter import")
+        (temp_path / "starter_runner.py").write_text(starter_script, encoding="utf-8")
+        try:
+            starter_run = subprocess.run(
+                [sys.executable, "starter_runner.py"],
+                cwd=temp_path,
+                capture_output=True,
+                text=True,
+                timeout=task.benchmark_timeout_s,
+                check=False,
+            )
+            starter_payload = json.loads((temp_path / "benchmark.json").read_text(encoding="utf-8"))
+            candidate_run = subprocess.run(
+                [sys.executable, "candidate_runner.py"],
+                cwd=temp_path,
+                capture_output=True,
+                text=True,
+                timeout=task.benchmark_timeout_s,
+                check=False,
+            )
+            candidate_payload = json.loads((temp_path / "benchmark.json").read_text(encoding="utf-8"))
+        except subprocess.TimeoutExpired as exc:
+            output = (exc.stdout or "") + (exc.stderr or "")
+            return 0.0, True, (output or "benchmark timed out").strip()
+        except Exception as exc:  # pragma: no cover
+            return 0.0, False, str(exc)
+        starter_elapsed = max(float(starter_payload["elapsed"]), 1e-9)
+        candidate_elapsed = max(float(candidate_payload["elapsed"]), 1e-9)
+        speedup = starter_elapsed / candidate_elapsed
+        runtime_score = clamp_score(min((speedup - 1.0) / 3.0, 1.0))
+        output = "\n".join(
+            part
+            for part in [
+                starter_run.stdout.strip(),
+                starter_run.stderr.strip(),
+                candidate_run.stdout.strip(),
+                candidate_run.stderr.strip(),
+                f"starter={starter_elapsed:.6f}s candidate={candidate_elapsed:.6f}s speedup={speedup:.2f}x",
+            ]
+            if part
+        )
+        return runtime_score, False, output
+def ast_quality_score(code: str, task: TaskSpec) -> float:
+    """Score maintainability and algorithmic structure."""
+    tree, parse_error = compile_tree(code)
+    if tree is None:
+        return 0.0
+    import ast
+    function_node = next(
+        (node for node in tree.body if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef))),
+        None,
+    )
+    docstring_points = 0.2 if function_node and ast.get_docstring(function_node, clean=False) else 0.0
+    nested_points = 0.4 if nested_loop_depth(tree) <= 1 else 0.0
+    marker_points = 0.0
+    for marker in task.expected_quality_markers:
+        if marker in code:
+            marker_points += 0.2
+    return clamp_score(docstring_points + nested_points + marker_points)
+def grade_optimization_task(candidate_code: str, task: TaskSpec) -> TaskGrade:
+    """Grade optimization tasks using correctness, runtime, AST quality, and style."""
+    execution = run_pytest_suite(
+        candidate_code,
+        [*task.visible_tests, *task.hidden_tests],
+        timeout_s=task.benchmark_timeout_s,
+    )
+    test_fraction = execution.passed / execution.total if execution.total else 0.0
+    if execution.timed_out:
+        return TaskGrade(
+            score=0.0,
+            tests_passed=execution.passed,
+            tests_total=execution.total,
+            timed_out=True,
+            details={"tests": execution.output},
+        )
+    runtime_score, timed_out, benchmark_output = benchmark_runtime(candidate_code, task)
+    if timed_out:
+        return TaskGrade(
+            score=0.0,
+            tests_passed=execution.passed,
+            tests_total=execution.total,
+            timed_out=True,
+            details={"tests": execution.output, "benchmark": benchmark_output},
+        )
+    quality_score = ast_quality_score(candidate_code, task)
+    pep8_score = style_score(candidate_code, task.style_max_line_length)
+    score = clamp_score(
+        (0.5 * test_fraction)
+        + (0.3 * runtime_score)
+        + (0.15 * quality_score)
+        + (0.05 * pep8_score)
+    )
+    return TaskGrade(
+        score=score,
+        syntax_score=1.0,
+        tests_passed=execution.passed,
+        tests_total=execution.total,
+        quality_score=quality_score,
+        details={
+            "tests": execution.output,
+            "benchmark": benchmark_output,
+            "test_fraction": round(test_fraction, 4),
+            "runtime_score": round(runtime_score, 4),
+            "style_score": round(pep8_score, 4),
+        },
+    )

graders/pytest_runner.py ADDED Viewed

	@@ -0,0 +1,108 @@

+"""Helpers for deterministic pytest execution in temp sandboxes."""
+from __future__ import annotations
+import json
+import subprocess
+import sys
+import tempfile
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Iterable
+@dataclass(frozen=True)
+class PytestExecution:
+    """Exact pytest execution summary."""
+    passed: int
+    failed: int
+    total: int
+    timed_out: bool
+    output: str
+def _runner_script() -> str:
+    return """import json
+import pathlib
+import pytest
+class Collector:
+    def __init__(self) -> None:
+        self.passed = 0
+        self.failed = 0
+    def pytest_runtest_logreport(self, report):
+        if report.when != "call":
+            return
+        if report.passed:
+            self.passed += 1
+        elif report.failed:
+            self.failed += 1
+collector = Collector()
+exit_code = pytest.main(["-q", "test_candidate.py"], plugins=[collector])
+payload = {
+    "passed": collector.passed,
+    "failed": collector.failed,
+    "exit_code": int(exit_code),
+}
+pathlib.Path("pytest_results.json").write_text(json.dumps(payload), encoding="utf-8")
+"""
+def run_pytest_suite(candidate_code: str, tests: Iterable[str], timeout_s: float = 3.0) -> PytestExecution:
+    """Run a pytest suite against candidate.py and return structured results."""
+    test_cases = list(tests)
+    with tempfile.TemporaryDirectory(prefix="python-code-review-") as temp_dir:
+        temp_path = Path(temp_dir)
+        (temp_path / "candidate.py").write_text(candidate_code, encoding="utf-8")
+        (temp_path / "test_candidate.py").write_text("\n\n".join(test_cases), encoding="utf-8")
+        (temp_path / "runner.py").write_text(_runner_script(), encoding="utf-8")
+        try:
+            completed = subprocess.run(
+                [sys.executable, "runner.py"],
+                cwd=temp_path,
+                capture_output=True,
+                text=True,
+                timeout=timeout_s,
+                check=False,
+            )
+        except subprocess.TimeoutExpired as exc:
+            output = (exc.stdout or "") + (exc.stderr or "")
+            return PytestExecution(
+                passed=0,
+                failed=max(len(test_cases), 1),
+                total=max(len(test_cases), 1),
+                timed_out=True,
+                output=(output or "pytest timed out").strip(),
+            )
+        result_path = temp_path / "pytest_results.json"
+        if not result_path.exists():
+            output = (completed.stdout or "") + (completed.stderr or "")
+            total = max(len(test_cases), 1)
+            return PytestExecution(
+                passed=0,
+                failed=total,
+                total=total,
+                timed_out=False,
+                output=output.strip(),
+            )
+        payload = json.loads(result_path.read_text(encoding="utf-8"))
+        passed = int(payload.get("passed", 0))
+        failed = int(payload.get("failed", 0))
+        total = max(passed + failed, len(test_cases))
+        output = ((completed.stdout or "") + (completed.stderr or "")).strip()
+        return PytestExecution(
+            passed=passed,
+            failed=failed,
+            total=total,
+            timed_out=False,
+            output=output,
+        )

graders/syntax.py ADDED Viewed

	@@ -0,0 +1,78 @@

+"""Task graders for syntax and bug-fix tasks."""
+from __future__ import annotations
+from graders.common import clamp_score, compiles, normalized_diff_score, style_score, syntax_error_message
+from graders.optimization import grade_optimization_task
+from graders.pytest_runner import run_pytest_suite
+from models import TaskGrade
+from tasks.task_bank import TaskSpec
+def grade_syntax_task(candidate_code: str, task: TaskSpec) -> TaskGrade:
+    """Grade syntax repair tasks with partial credit for progress toward the reference."""
+    error = syntax_error_message(candidate_code)
+    diff_score = normalized_diff_score(candidate_code, task.reference_code)
+    style_base = style_score(candidate_code, task.style_max_line_length)
+    if not error:
+        return TaskGrade(
+            score=1.0,
+            syntax_score=1.0,
+            quality_score=style_base,
+            details={"compile_error": ""},
+        )
+    partial = clamp_score(0.15 + (0.55 * diff_score))
+    return TaskGrade(
+        score=partial,
+        syntax_score=0.0,
+        quality_score=diff_score * style_base,
+        details={"compile_error": error},
+    )
+def grade_bug_fix_task(candidate_code: str, task: TaskSpec, include_hidden: bool = True) -> TaskGrade:
+    """Grade logic bug tasks with pytest pass fraction."""
+    if not compiles(candidate_code):
+        error = syntax_error_message(candidate_code)
+        return TaskGrade(score=0.0, syntax_score=0.0, details={"compile_error": error})
+    tests = list(task.visible_tests)
+    if include_hidden:
+        tests.extend(task.hidden_tests)
+    execution = run_pytest_suite(candidate_code, tests, timeout_s=3.0)
+    if execution.timed_out:
+        return TaskGrade(
+            score=0.0,
+            syntax_score=1.0,
+            tests_passed=execution.passed,
+            tests_total=execution.total,
+            timed_out=True,
+            details={"compile_error": "", "tests": execution.output},
+        )
+    pass_fraction = execution.passed / execution.total if execution.total else 0.0
+    quality = style_score(candidate_code, task.style_max_line_length)
+    return TaskGrade(
+        score=clamp_score(pass_fraction),
+        syntax_score=1.0,
+        tests_passed=execution.passed,
+        tests_total=execution.total,
+        quality_score=quality,
+        details={"compile_error": "", "tests": execution.output},
+    )
+def grade_task(candidate_code: str, task: TaskSpec, include_hidden: bool = True) -> TaskGrade:
+    """Dispatch to the correct deterministic grader for one task."""
+    if task.task_kind == "syntax_fix":
+        return grade_syntax_task(candidate_code, task)
+    if task.task_kind == "bug_fix":
+        return grade_bug_fix_task(candidate_code, task, include_hidden=include_hidden)
+    return grade_optimization_task(candidate_code, task)

inference.py ADDED Viewed

	@@ -0,0 +1,287 @@

+#!/usr/bin/env python3
+"""
+Baseline inference script for python_code_review_env.
+Demonstrates how to run an OpenEnv environment using OpenAI-compatible API,
+supporting free/open models like Gemini, DeepSeek, Together AI, OpenRouter, etc.
+Usage:
+    # Using Gemini (free tier)
+    export OPENAI_API_KEY="your-gemini-api-key"
+    python inference.py --base-url "https://generativelanguage.googleapis.com/openai/" --model "gemini-2.0-flash"
+    # Using DeepSeek (free tier)
+    export OPENAI_API_KEY="your-deepseek-api-key"
+    python inference.py --base-url "https://api.deepseek.com" --model "deepseek-chat"
+    # Using Together AI
+    export OPENAI_API_KEY="your-together-api-key"
+    python inference.py --base-url "https://api.together.xyz/v1" --model "deepseek-ai/deepseek-chat"
+    # Using local OpenAI (default)
+    python inference.py --base-url "http://localhost:8000/v1" --model "gpt-3.5-turbo"
+"""
+from __future__ import annotations
+import argparse
+import json
+import os
+import sys
+from typing import Optional
+from openai import OpenAI
+# Import environment and models
+from server.env import PythonCodeReviewEnvironment
+from models import (
+    PythonCodeReviewAction,
+    PythonCodeReviewObservation,
+)
+from tasks import task_ids
+def get_model_config(base_url: Optional[str], model: str, api_key: Optional[str]) -> tuple[str, str, str]:
+    """Determine API configuration from environment or arguments."""
+    # API Key
+    final_api_key = api_key or os.getenv("OPENAI_API_KEY", "")
+    if not final_api_key:
+        print("Warning: OPENAI_API_KEY not set. Using dummy key for local testing.")
+        final_api_key = "sk-test"
+    # Base URL
+    final_base_url = base_url or os.getenv("OPENAI_API_BASE", "http://localhost:8000/v1")
+    # Model
+    final_model = model or os.getenv("MODEL_NAME", "gpt-3.5-turbo")
+    return final_base_url, final_model, final_api_key
+def build_prompt_for_task(observation: PythonCodeReviewObservation) -> str:
+    """Construct task-specific prompt for the LLM."""
+    return f"""You are an expert Python code reviewer. Your job is to fix and improve Python code.
+TASK: {observation.task_description}
+DIFFICULTY: {observation.difficulty.upper()}
+VISIBLE TEST CASES:
+{chr(10).join(f"- {test}" for test in observation.visible_tests) or "- No visible tests"}
+CURRENT CODE:
+```python
+{observation.current_code}
+```
+{f"ERRORS: {observation.errors}" if observation.errors else ""}
+{f"TEST RESULTS: {observation.test_results}" if observation.test_results else ""}
+You have {observation.attempts_remaining} attempts left.
+Current score: {observation.score:.3f}
+Analyze the code and decide what to do next:
+1. If you see syntax errors, provide fixed code
+2. If tests are failing, analyze why and fix logic
+3. If code looks good, submit your solution
+4. For optimization tasks, improve efficiency while keeping tests passing
+Respond ONLY with a JSON object in this exact format (no markdown, no backticks):
+{{
+  "action_type": "analyze_code|edit_code|run_tests|submit_solution",
+  "code": "...only if action_type is edit_code...",
+  "reasoning": "brief explanation"
+}}
+"""
+def run_task_episode(
+    env: PythonCodeReviewEnvironment,
+    task_id: str,
+    client: OpenAI,
+    model: str,
+    max_steps: int = 10,
+    verbose: bool = True,
+) -> float:
+    """Run one complete task episode and return the score."""
+    # Reset environment for this task
+    observation = env.reset(task_id=task_id)
+    total_reward = 0.0
+    step_count = 0
+    if verbose:
+        print(f"\n{'='*70}")
+        print(f"TASK: {task_id} ({observation.difficulty})")
+        print(f"{'='*70}")
+    while not observation.done and step_count < max_steps:
+        step_count += 1
+        # Get action from LLM
+        try:
+            prompt = build_prompt_for_task(observation)
+            response = client.chat.completions.create(
+                model=model,
+                messages=[{"role": "user", "content": prompt}],
+                temperature=0.7,
+                max_tokens=2000,
+            )
+            response_text = response.choices[0].message.content or ""
+            # Try to parse JSON from response
+            try:
+                # Find JSON in the response
+                json_start = response_text.find("{")
+                json_end = response_text.rfind("}") + 1
+                if json_start >= 0 and json_end > json_start:
+                    json_str = response_text[json_start:json_end]
+                    action_dict = json.loads(json_str)
+                else:
+                    raise ValueError("No JSON found in response")
+            except (json.JSONDecodeError, ValueError) as e:
+                if verbose:
+                    print(f"Step {step_count}: Failed to parse response: {e}")
+                    print(f"Response: {response_text[:200]}")
+                # Fallback to analyze_code
+                action_dict = {"action_type": "analyze_code"}
+            # Build action
+            action = PythonCodeReviewAction(
+                action_type=action_dict.get("action_type", "analyze_code"),
+                code=action_dict.get("code"),
+            )
+        except Exception as e:
+            if verbose:
+                print(f"Step {step_count}: Error getting LLM response: {e}")
+            # Fallback action
+            action = PythonCodeReviewAction(action_type="analyze_code")
+        # Execute action
+        observation = env.step(action)
+        total_reward += observation.reward.value
+        if verbose:
+            print(f"Step {step_count}: {action.action_type}")
+            if observation.reward.value != 0:
+                print(f"  Reward: {observation.reward.value:+.4f} ({observation.reward.reason})")
+            if observation.errors:
+                print(f"  Errors: {observation.errors}")
+            if observation.test_results:
+                print(f"  Tests: {observation.test_results}")
+    final_score = observation.score
+    if verbose:
+        print(f"\nFinal Score: {final_score:.3f} (Total Reward: {total_reward:.4f})")
+    return final_score
+def main(args: Optional[list[str]] = None) -> None:
+    """Run baseline evaluation on all tasks."""
+    parser = argparse.ArgumentParser(
+        description="Baseline inference for python_code_review_env",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=__doc__,
+    )
+    parser.add_argument(
+        "--base-url",
+        default=None,
+        help="API base URL (default: OPENAI_API_BASE or http://localhost:8000/v1)",
+    )
+    parser.add_argument(
+        "--model",
+        default=None,
+        help="Model name (default: MODEL_NAME or gpt-3.5-turbo)",
+    )
+    parser.add_argument(
+        "--api-key",
+        default=None,
+        help="API key (default: OPENAI_API_KEY)",
+    )
+    parser.add_argument(
+        "--task",
+        default=None,
+        help="Run single task instead of all",
+    )
+    parser.add_argument(
+        "--quiet",
+        action="store_true",
+        help="Minimize output",
+    )
+    parser.add_argument(
+        "--max-steps",
+        type=int,
+        default=10,
+        help="Max steps per episode",
+    )
+    parsed = parser.parse_args(args)
+    # Get configuration
+    base_url, model, api_key = get_model_config(
+        parsed.base_url,
+        parsed.model,
+        parsed.api_key,
+    )
+    print(f"Configuration:")
+    print(f"  Base URL: {base_url}")
+    print(f"  Model: {model}")
+    print(f"  Max steps per episode: {parsed.max_steps}")
+    print()
+    # Initialize client
+    try:
+        client = OpenAI(api_key=api_key, base_url=base_url)
+        # Test connection
+        client.models.list()
+    except Exception as e:
+        print(f"Warning: Could not verify API connection: {e}")
+        print("Proceeding anyway...")
+    # Initialize environment
+    env = PythonCodeReviewEnvironment()
+    # Run task(s)
+    tasks_to_run = [parsed.task] if parsed.task else list(task_ids())
+    scores = {}
+    for task_id in tasks_to_run:
+        try:
+            score = run_task_episode(
+                env,
+                task_id,
+                client,
+                model,
+                max_steps=parsed.max_steps,
+                verbose=not parsed.quiet,
+            )
+            scores[task_id] = score
+        except Exception as e:
+            print(f"Error running task {task_id}: {e}")
+            scores[task_id] = 0.0
+    # Print summary
+    print(f"\n{'='*70}")
+    print("SUMMARY")
+    print(f"{'='*70}")
+    for task_id, score in scores.items():
+        print(f"{task_id:30s} : {score:.3f}")
+    if len(scores) > 1:
+        avg_score = sum(scores.values()) / len(scores)
+        print(f"{'Average Score':30s} : {avg_score:.3f}")
+    return 0 if all(s > 0 for s in scores.values()) else 1
+if __name__ == "__main__":
+    sys.exit(main())

models.py ADDED Viewed

	@@ -0,0 +1,109 @@

+"""Typed models for Python code review and repair environment."""
+from __future__ import annotations
+from typing import Any, Dict, List, Literal, Optional
+from pydantic import BaseModel, Field
+from openenv.core.env_server.types import Action, Observation, State
+Difficulty = Literal["easy", "medium", "hard"]
+TaskKind = Literal["syntax_fix", "bug_fix", "optimization"]
+ActionType = Literal["analyze_code", "edit_code", "run_tests", "submit_solution"]
+class HistoryEntry(BaseModel):
+    """Record of one action taken during an episode."""
+    step: int = Field(..., ge=0)
+    action_type: ActionType
+    status: str = Field(..., description="Outcome message")
+    reward: float = Field(...)
+class RewardDetails(BaseModel):
+    """Detailed reward breakdown for transparency."""
+    value: float = Field(..., description="Net scalar reward for this step")
+    syntax_reward: float = Field(default=0.0, description="Bonus for fixing syntax")
+    test_reward: float = Field(default=0.0, description="Reward from passing tests")
+    quality_bonus: float = Field(default=0.0, description="Bonus for code quality improvements")
+    correctness_bonus: float = Field(default=0.0, description="Bonus for full correctness")
+    invalid_action_penalty: float = Field(default=0.0, description="Penalty for invalid actions")
+    timeout_penalty: float = Field(default=0.0, description="Penalty for timeouts")
+    reason: str = Field(..., description="Explanation of reward")
+class PythonCodeReviewAction(Action):
+    """Action space for code review environment."""
+    action_type: ActionType = Field(..., description="Type of action to perform")
+    code: Optional[str] = Field(default=None, description="New code for edit_code actions")
+class PythonCodeReviewObservation(Observation):
+    """Observation returned by reset() and step()."""
+    task_id: str = Field(..., description="Current task identifier")
+    difficulty: Difficulty = Field(..., description="Task difficulty level")
+    task_description: str = Field(..., description="Detailed task description")
+    current_code: str = Field(..., description="Current code state")
+    errors: str = Field(..., description="Syntax/compilation errors, if any")
+    test_results: str = Field(..., description="Results from test execution")
+    visible_tests: List[str] = Field(default_factory=list, description="Public test cases")
+    history: List[HistoryEntry] = Field(default_factory=list, description="Action history")
+    attempts_remaining: int = Field(..., ge=0, description="Actions left in episode")
+    score: float = Field(..., ge=0.0, le=1.0, description="Current episode score")
+    reward: RewardDetails = Field(default_factory=lambda: RewardDetails(value=0.0, reason="Reset"))
+class PythonCodeReviewState(State):
+    """Exposed environment state."""
+    episode_id: str = Field(..., description="Unique episode identifier")
+    step_count: int = Field(default=0, ge=0)
+    task_id: Optional[str] = Field(default=None)
+    difficulty: Optional[Difficulty] = Field(default=None)
+    task_kind: Optional[TaskKind] = Field(default=None)
+    attempts_remaining: int = Field(default=0, ge=0)
+    current_code: str = Field(default="")
+    errors: str = Field(default="")
+    test_results: str = Field(default="")
+    history: List[HistoryEntry] = Field(default_factory=list)
+    score: float = Field(default=0.0, ge=0.0, le=1.0)
+    done: bool = Field(default=False)
+class TaskDescriptor(BaseModel):
+    """Public task metadata."""
+    task_id: str = Field(..., description="Stable task identifier")
+    title: str = Field(..., description="Human-readable title")
+    difficulty: Difficulty = Field(..., description="Difficulty level")
+    task_kind: TaskKind = Field(..., description="Type of task")
+    task_description: str = Field(..., description="Full task description")
+    starter_code: str = Field(..., description="Initial broken code")
+    visible_tests: List[str] = Field(default_factory=list, description="Public test cases")
+    max_steps: int = Field(..., ge=1, description="Maximum steps allowed")
+class TaskGrade(BaseModel):
+    """Grading result for task submission."""
+    score: float = Field(..., ge=0.0, le=1.0, description="Overall score")
+    syntax_score: float = Field(default=0.0, ge=0.0, le=1.0)
+    tests_passed: int = Field(default=0, ge=0)
+    tests_total: int = Field(default=0, ge=0)
+    quality_score: float = Field(default=0.0, ge=0.0, le=1.0)
+    timed_out: bool = Field(default=False)
+    details: Dict[str, Any] = Field(default_factory=dict)
+class HealthResponse(BaseModel):
+    """Health check response."""
+    status: Literal["ok"] = "ok"
+    environment: str = "python_code_review_env"
+    task_count: int = Field(default=0, ge=0)

openenv.yaml ADDED Viewed

	@@ -0,0 +1,20 @@

+spec_version: 1
+name: python_code_review_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+metadata:
+  description: "Production-grade Python code review and repair benchmark for OpenEnv"
+  domain: code-review
+  task_count: 3
+  task_ids:
+    - syntax-fix-easy
+    - bug-fix-medium
+    - optimization-hard
+  difficulty_levels:
+    - easy
+    - medium
+    - hard

pyproject.toml ADDED Viewed

	@@ -0,0 +1,33 @@

+[build-system]
+requires = ["setuptools>=45", "wheel"]
+build-backend = "setuptools.build_meta"
+[project]
+name = "openenv-python_env"
+version = "0.2.0"
+description = "Deterministic Python code review and repair benchmark environment for OpenEnv"
+requires-python = ">=3.10"
+dependencies = [
+    "openenv-core[core]>=0.2.2",
+    "fastapi>=0.115.0",
+    "uvicorn>=0.30.0",
+    "openai>=1.40.0",
+    "pytest>=8.0.0",
+]
+[project.optional-dependencies]
+dev = [
+    "pytest>=8.0.0",
+    "pytest-cov>=4.0.0",
+]
+[project.scripts]
+server = "python_env.server.app:main"
+[tool.setuptools]
+include-package-data = true
+packages = ["python_env", "python_env.server"]
+package-dir = { "python_env" = ".", "python_env.server" = "server" }
+[tool.pytest.ini_options]
+testpaths = ["tests"]

server/__init__.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Server exports for the Python code review environment."""
+from .code_review_environment import CodeReviewEnvironment, PythonCodeReviewEnvironment, PythonEnvironment
+__all__ = ["PythonEnvironment", "PythonCodeReviewEnvironment", "CodeReviewEnvironment"]

server/app.py ADDED Viewed

	@@ -0,0 +1,97 @@

+"""FastAPI application for the Python code review environment."""
+from __future__ import annotations
+import os
+from fastapi import APIRouter, HTTPException
+from fastapi.responses import RedirectResponse
+from openenv.core.env_server.http_server import create_app
+from models import (
+    HealthResponse,
+    PythonCodeReviewAction,
+    PythonCodeReviewObservation,
+    PythonCodeReviewState,
+    TaskDescriptor,
+    TaskGrade,
+)
+from server.env import PythonCodeReviewEnvironment
+MAX_CONCURRENT_ENVS = int(os.getenv("MAX_CONCURRENT_ENVS", "16"))
+python_env = PythonCodeReviewEnvironment()
+app = create_app(
+    PythonCodeReviewEnvironment,
+    PythonCodeReviewAction,
+    PythonCodeReviewObservation,
+    max_concurrent_envs=MAX_CONCURRENT_ENVS,
+)
+router = APIRouter(tags=["python-code-review"])
+@router.get("/", include_in_schema=False)
+def root() -> RedirectResponse:
+    """Redirect root to API documentation."""
+    return RedirectResponse(url="/docs")
+@router.get("/health", response_model=HealthResponse)
+def health() -> HealthResponse:
+    """Health check endpoint for deployment monitoring."""
+    return python_env.health()
+@router.get("/tasks", response_model=list)
+def list_tasks() -> list:
+    """List all available deterministic tasks."""
+    return python_env.list_task_summaries()
+@router.get("/tasks/{task_id}", response_model=object)
+def get_task(task_id: str) -> object:
+    """Get a specific task by ID."""
+    try:
+        return python_env.get_task(task_id)
+    except ValueError as exc:
+        raise HTTPException(status_code=404, detail=str(exc)) from exc
+@router.post("/tasks/{task_id}/grade", response_model=TaskGrade)
+def grade_task(task_id: str, payload: PythonCodeReviewAction) -> TaskGrade:
+    """Grade code submission for a task without running an episode."""
+    if payload.action_type != "edit_code" or not payload.code:
+        raise HTTPException(
+            status_code=400,
+            detail="Requires action_type='edit_code' with code parameter."
+        )
+    try:
+        return python_env.grade_task_submission(task_id=task_id, code=payload.code)
+    except ValueError as exc:
+        raise HTTPException(status_code=404, detail=str(exc)) from exc
+@router.post("/state", response_model=PythonCodeReviewState)
+def get_state_post() -> RedirectResponse:
+    """Redirect POST /state to GET for compatibility."""
+    return RedirectResponse(url="/state", status_code=303)
+app.include_router(router)
+def main(host: str = "0.0.0.0", port: int = 8000) -> None:
+    """Run the FastAPI application with uvicorn."""
+    import uvicorn
+    uvicorn.run(
+        app,
+        host=os.getenv("HOST", host),
+        port=int(os.getenv("PORT", str(port))),
+    )
+if __name__ == "__main__":
+    main()

server/code_review_env_environment.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""Compatibility shim for older imports."""
+try:
+    from server.code_review_environment import CodeReviewEnvironment
+except ModuleNotFoundError:  # pragma: no cover
+    from .code_review_environment import CodeReviewEnvironment
+__all__ = ["CodeReviewEnvironment"]

server/code_review_environment.py ADDED Viewed

	@@ -0,0 +1,5 @@

+"""Compatibility wrapper for older imports."""
+from .env import CodeReviewEnvironment, PythonCodeReviewEnvironment, PythonEnvironment
+__all__ = ["CodeReviewEnvironment", "PythonCodeReviewEnvironment", "PythonEnvironment"]

server/env.py ADDED Viewed

	@@ -0,0 +1,640 @@

+"""Core OpenEnv environment for Python code review and repair tasks."""
+from __future__ import annotations
+from typing import List, Optional
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from graders import grade_task
+from models import (
+    HealthResponse,
+    HistoryEntry,
+    PythonCodeReviewAction,
+    PythonCodeReviewObservation,
+    PythonCodeReviewState,
+    RewardDetails,
+    TaskGrade,
+)
+from tasks import TaskSpec, get_task, list_task_descriptors, list_task_summaries, task_ids
+# Reward shaping constants
+INVALID_ACTION_PENALTY = 0.1
+QUALITY_BONUS_SCALE = 0.15
+class PythonCodeReviewEnvironment(
+    Environment[PythonCodeReviewAction, PythonCodeReviewObservation, PythonCodeReviewState]
+):
+    """Production-style environment for reviewing and fixing Python code."""
+    SUPPORTS_CONCURRENT_SESSIONS = True
+    def __init__(self) -> None:
+        super().__init__()
+        self._task_order = list(task_ids())
+        self._task_cursor = -1
+        self._task: Optional[TaskSpec] = None
+        self._state = PythonCodeReviewState()
+        self._done = False
+        self._last_status = "Call reset() to start."
+        self._last_reward = RewardDetails(value=0.0, reason="Environment initialized.")
+        self._best_visible_test_fraction = 0.0
+        self._best_quality_score = 0.0
+        self._full_correctness_awarded = False
+        self._syntax_reward_awarded = False
+    def reset(
+        self,
+        seed: Optional[int] = None,
+        episode_id: Optional[str] = None,
+        task_id: Optional[str] = None,
+        **_: object,
+    ) -> PythonCodeReviewObservation:
+        """Reset the environment to the next deterministic task."""
+        del seed
+        # Select task
+        if task_id:
+            self._task = get_task(task_id)
+            self._task_cursor = self._task_order.index(task_id)
+        else:
+            self._task_cursor = (self._task_cursor + 1) % len(self._task_order)
+            self._task = get_task(self._task_order[self._task_cursor])
+        # Reset episode state
+        self._done = False
+        self._best_visible_test_fraction = 0.0
+        self._best_quality_score = 0.0
+        self._full_correctness_awarded = False
+        self._syntax_reward_awarded = False
+        self._last_status = "Inspect the code, edit it, run tests, then submit."
+        self._last_reward = RewardDetails(value=0.0, reason="Episode reset.")
+        self._state = PythonCodeReviewState(
+            episode_id=episode_id or str(uuid4()),
+            step_count=0,
+            task_id=self._task.task_id,
+            difficulty=self._task.difficulty,
+            task_kind=self._task.task_kind,
+            attempts_remaining=self._task.max_steps,
+            current_code=self._task.starter_code,
+            errors="",
+            test_results="Not run yet.",
+            history=[],
+            score=0.0,
+            done=False,
+        )
+        return self._build_observation()
+    def step(
+        self,
+        action: PythonCodeReviewAction,
+        timeout_s: Optional[float] = None,
+        **_: object,
+    ) -> PythonCodeReviewObservation:
+        """Apply one structured action."""
+        del timeout_s
+        if self._task is None:
+            return self.reset()
+        if self._done:
+            self._last_reward = RewardDetails(
+                value=-INVALID_ACTION_PENALTY,
+                invalid_action_penalty=INVALID_ACTION_PENALTY,
+                reason="Episode already completed.",
+            )
+            self._last_status = "Episode already completed. Call reset() to continue."
+            return self._build_observation()
+        self._state.step_count += 1
+        status = ""
+        reward = RewardDetails(value=0.0, reason="Action processed.")
+        # Dispatch to handler based on action type
+        if action.action_type == "analyze_code":
+            reward, status = self._handle_analyze()
+        elif action.action_type == "edit_code":
+            reward, status = self._handle_edit(action)
+        elif action.action_type == "run_tests":
+            reward, status = self._handle_run_tests()
+        elif action.action_type == "submit_solution":
+            reward, status = self._handle_submit()
+        else:
+            reward = RewardDetails(
+                value=-INVALID_ACTION_PENALTY,
+                invalid_action_penalty=INVALID_ACTION_PENALTY,
+                reason=f"Unsupported action_type: {action.action_type}",
+            )
+            status = f"Invalid action: unsupported action_type '{action.action_type}'."
+        self._last_reward = reward
+        self._last_status = status
+        self._state.attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
+        self._state.done = self._done
+        # Auto-submit if steps exhausted
+        if self._state.attempts_remaining == 0 and not self._done:
+            self._finalize_episode(auto_submit=True)
+            self._state.done = True
+        return self._build_observation()
+    @property
+    def state(self) -> PythonCodeReviewState:
+        """Return the current environment state."""
+        return self._state.model_copy(deep=True)
+    def list_task_summaries(self) -> List[object]:
+        """Return public task metadata."""
+        return list_task_summaries()
+    def get_task(self, task_id: str) -> object:
+        """Return a single task descriptor."""
+        return get_task(task_id).to_descriptor()
+    def health(self) -> HealthResponse:
+        """Return a simple health model."""
+        return HealthResponse(task_count=len(self._task_order))
+    def grade_task_submission(self, task_id: str, code: str) -> TaskGrade:
+        """Expose deterministic grading outside of an active episode."""
+        return grade_task(code, get_task(task_id), include_hidden=True)
+    def _build_observation(self) -> PythonCodeReviewObservation:
+        """Build current observation from state."""
+        return PythonCodeReviewObservation(
+            task_id=self._state.task_id or "",
+            difficulty=self._state.difficulty or "easy",
+            task_description=self._task.task_description if self._task else "",
+            current_code=self._state.current_code,
+            errors=self._state.errors,
+            test_results=self._state.test_results,
+            visible_tests=self._task.visible_tests if self._task else [],
+            history=self._state.history,
+            attempts_remaining=self._state.attempts_remaining,
+            score=self._state.score,
+            reward=self._last_reward,
+        )
+    def _handle_analyze(self) -> tuple[RewardDetails, str]:
+        """Analyze code for errors and test status."""
+        if self._task is None:
+            return RewardDetails(value=0.0, reason="Invalid state"), "Error: task not loaded"
+        grade = grade_task(self._state.current_code, self._task, include_hidden=False)
+        error = grade.details.get("compile_error", "")
+        if error:
+            self._state.errors = error
+            self._state.test_results = "Compilation failed. Fix syntax first."
+            summary = f"Syntax error detected: {error}"
+        else:
+            self._state.errors = ""
+            if self._task.task_kind == "syntax_fix":
+                self._state.test_results = "Code compiles successfully."
+                summary = "Code compiles. Ready to submit."
+            else:
+                visible_total = len(self._task.visible_tests)
+                visible_passed = grade.tests_passed
+                self._state.test_results = f"Test run: {visible_passed}/{visible_total} passing."
+                summary = self._state.test_results
+        reward = RewardDetails(value=0.0, reason=summary)
+        self._append_history("analyze_code", summary, reward.value)
+        self._sync_score(include_hidden=False)
+        return reward, summary
+    def _handle_edit(self, action: PythonCodeReviewAction) -> tuple[RewardDetails, str]:
+        """Edit the code and compute reward for progress."""
+        if self._task is None:
+            return RewardDetails(value=0.0, reason="Invalid state"), "Error: task not loaded"
+        code = (action.code or "").strip()
+        if not code:
+            reward = RewardDetails(
+                value=-INVALID_ACTION_PENALTY,
+                invalid_action_penalty=INVALID_ACTION_PENALTY,
+                reason="Edit action requires non-empty code.",
+            )
+            status = "Invalid: edit_code requires code parameter."
+            self._append_history("edit_code", status, reward.value)
+            return reward, status
+        # Grade before and after
+        previous_grade = grade_task(self._state.current_code, self._task, include_hidden=False)
+        new_grade = grade_task(code, self._task, include_hidden=False)
+        self._state.current_code = code
+        # Update state
+        self._state.errors = new_grade.details.get("compile_error", "")
+        self._state.test_results = self._format_test_results(new_grade)
+        # Compute reward with shaping
+        syntax_reward = 0.0
+        if previous_grade.syntax_score < 1.0 and new_grade.syntax_score == 1.0:
+            syntax_reward = 0.2
+            self._syntax_reward_awarded = True
+        quality_delta = max(new_grade.quality_score - self._best_quality_score, 0.0)
+        quality_bonus = 0.0
+        if quality_delta > 0:
+            quality_bonus = min(quality_delta * QUALITY_BONUS_SCALE, 0.1)
+            self._best_quality_score = new_grade.quality_score
+        test_delta = 0.0
+        if new_grade.tests_total > 0:
+            current_test_fraction = new_grade.tests_passed / new_grade.tests_total
+            test_delta = max(current_test_fraction - self._best_visible_test_fraction, 0.0)
+            self._best_visible_test_fraction = max(self._best_visible_test_fraction, current_test_fraction)
+        reward_value = syntax_reward + quality_bonus + (0.15 * test_delta)
+        status = "Code updated."
+        if self._state.errors:
+            status = f"Code updated with syntax issues: {self._state.errors}"
+        elif new_grade.tests_total > 0:
+            status = self._state.test_results
+        reward = RewardDetails(
+            value=reward_value,
+            syntax_reward=syntax_reward,
+            quality_bonus=quality_bonus,
+            test_reward=0.15 * test_delta,
+            reason=status,
+        )
+        self._append_history("edit_code", status, reward_value)
+        self._sync_score(include_hidden=False)
+        return reward, status
+    def _handle_run_tests(self) -> tuple[RewardDetails, str]:
+        """Run tests and provide feedback."""
+        if self._task is None:
+            return RewardDetails(value=0.0, reason="Invalid state"), "Error: task not loaded"
+        grade = grade_task(self._state.current_code, self._task, include_hidden=False)
+        self._state.errors = grade.details.get("compile_error", "")
+        self._state.test_results = self._format_test_results(grade)
+        if grade.tests_total > 0:
+            current_fraction = grade.tests_passed / grade.tests_total
+            test_delta = max(current_fraction - self._best_visible_test_fraction, 0.0)
+            self._best_visible_test_fraction = max(self._best_visible_test_fraction, current_fraction)
+            test_reward = 0.15 * test_delta
+        else:
+            test_reward = 0.0
+        status = self._state.test_results if not self._state.errors else self._state.errors
+        reward = RewardDetails(value=test_reward, test_reward=test_reward, reason=status)
+        self._append_history("run_tests", status, reward.value)
+        self._sync_score(include_hidden=False)
+        return reward, status
+    def _handle_submit(self) -> tuple[RewardDetails, str]:
+        """Submit solution and finalize episode."""
+        if self._task is None:
+            return RewardDetails(value=0.0, reason="Invalid state"), "Error: task not loaded"
+        grade = grade_task(self._state.current_code, self._task, include_hidden=True)
+        self._state.errors = grade.details.get("compile_error", "")
+        self._state.test_results = self._format_test_results(grade)
+        # Compute final reward bonuses
+        correctness_bonus = 0.0
+        if grade.score >= 0.999999 and not self._full_correctness_awarded:
+            correctness_bonus = 0.5
+            self._full_correctness_awarded = True
+        reward_value = correctness_bonus
+        self._finalize_episode(auto_submit=False, grade=grade)
+        status = f"Solution submitted. Final score: {grade.score:.3f}"
+        reward = RewardDetails(
+            value=reward_value,
+            correctness_bonus=correctness_bonus,
+            reason=status,
+        )
+        self._append_history("submit_solution", status, reward_value)
+        return reward, status
+    def _finalize_episode(self, auto_submit: bool, grade: Optional[TaskGrade] = None) -> None:
+        """Mark episode as done and set final score."""
+        if grade is None:
+            if self._task is None:
+                return
+            grade = grade_task(self._state.current_code, self._task, include_hidden=True)
+            self._state.errors = grade.details.get("compile_error", "")
+            self._state.test_results = self._format_test_results(grade)
+        self._state.score = grade.score
+        self._done = True
+        self._state.done = True
+        if auto_submit:
+            self._last_status = f"Step budget exhausted. Final score: {grade.score:.3f}"
+    def _sync_score(self, include_hidden: bool) -> None:
+        """Update visible score based on current code."""
+        if self._task is None:
+            return
+        grade = grade_task(self._state.current_code, self._task, include_hidden=include_hidden)
+        # For visible runs, use a soft score; for hidden, it will be finalized on submit
+        if not include_hidden:
+            self._state.score = grade.score
+    def _format_test_results(self, grade: TaskGrade) -> str:
+        """Format test results for display."""
+        if grade.tests_total == 0:
+            return "No tests available."
+        if grade.timed_out:
+            return "Test execution timed out."
+        return f"Tests: {grade.tests_passed}/{grade.tests_total} passing"
+    def _append_history(self, action_type: str, status: str, reward: float) -> None:
+        """Append action to history."""
+        entry = HistoryEntry(
+            step=self._state.step_count,
+            action_type=action_type,
+            status=status,
+            reward=reward,
+        )
+        self._state.history.append(entry)
+            return self.reset()
+        if self._done:
+            self._last_reward = RewardDetails(
+                value=-INVALID_ACTION_PENALTY,
+                invalid_action_penalty=INVALID_ACTION_PENALTY,
+                reason="Episode already completed.",
+            )
+            self._last_status = "Episode already completed. Call reset() to continue."
+            return self._build_observation()
+        self._state.step_count += 1
+        status = ""
+        reward = RewardDetails(reason="Action processed.")
+        if action.action_type == "analyze_code":
+            reward, status = self._handle_analyze()
+        elif action.action_type == "edit_code":
+            reward, status = self._handle_edit(action)
+        elif action.action_type == "run_tests":
+            reward, status = self._handle_run_tests()
+        elif action.action_type == "submit_solution":
+            reward, status = self._handle_submit()
+        else:  # pragma: no cover
+            reward = RewardDetails(
+                value=-INVALID_ACTION_PENALTY,
+                invalid_action_penalty=INVALID_ACTION_PENALTY,
+                reason=f"Unsupported action_type {action.action_type}.",
+            )
+            status = f"Unsupported action_type {action.action_type}."
+        self._last_reward = reward
+        self._last_status = status
+        self._state.attempts_remaining = max(self._task.max_steps - self._state.step_count, 0)
+        if self._state.attempts_remaining == 0 and not self._done:
+            self._finalize_episode(auto_submit=True)
+        self._state.done = self._done
+        return self._build_observation()
+    @property
+    def state(self) -> PythonCodeReviewState:
+        """Return the current environment state."""
+        return self._state.model_copy(deep=True)
+    def list_tasks(self) -> List[TaskDescriptor]:
+        """Return all task descriptors."""
+        return list_task_descriptors()
+    def list_task_summaries(self) -> List[TaskDescriptor]:
+        """Return public task metadata."""
+        return list_task_summaries()
+    def get_task(self, task_id: str) -> TaskDescriptor:
+        """Return a single task descriptor."""
+        return get_task(task_id).to_descriptor()
+    def health(self) -> HealthResponse:
+        """Return a simple health model."""
+        return HealthResponse(task_count=len(self._task_order))
+    def grade_task_submission(self, task_id: str, code: str) -> TaskGrade:
+        """Expose deterministic grading outside of an active episode."""
+        return grade_task(code, get_task(task_id), include_hidden=True)
+    def _handle_analyze(self) -> tuple[RewardDetails, str]:
+        grade = grade_task(self._state.current_code, self._task, include_hidden=False)
+        error = grade.details.get("compile_error", "")
+        if error:
+            self._state.errors = f"Syntax analysis failed: {error}"
+            self._state.test_results = "Tests skipped because the code does not compile."
+            summary = self._state.errors
+        else:
+            self._state.errors = ""
+            if self._task.task_kind == "syntax_fix":
+                self._state.test_results = "Compilation succeeds."
+            else:
+                visible_total = len(self._task.visible_tests)
+                visible_passed = min(grade.tests_passed, visible_total)
+                self._state.test_results = (
+                    f"Visible checks preview: {visible_passed}/{visible_total} passing."
+                )
+            summary = "Static analysis refreshed."
+        reward = RewardDetails(value=0.0, reason=summary)
+        self._append_history("analyze_code", summary, reward.value)
+        self._sync_score(include_hidden=False)
+        return reward, summary
+    def _handle_edit(self, action: PythonCodeReviewAction) -> tuple[RewardDetails, str]:
+        code = (action.code or "").strip("\n")
+        if not code:
+            reward = RewardDetails(
+                value=-INVALID_ACTION_PENALTY,
+                invalid_action_penalty=INVALID_ACTION_PENALTY,
+                reason="edit_code requires non-empty code.",
+            )
+            status = "Invalid action: edit_code requires code."
+            self._append_history("edit_code", status, reward.value)
+            return reward, status
+        previous_visible = grade_task(self._state.current_code, self._task, include_hidden=False)
+        new_visible = grade_task(code, self._task, include_hidden=False)
+        self._state.current_code = code
+        self._state.errors = new_visible.details.get("compile_error", "")
+        self._state.test_results = self._format_test_results(new_visible, include_hidden=False)
+        syntax_reward = 0.0
+        if previous_visible.syntax_score < 1.0 and new_visible.syntax_score == 1.0:
+            syntax_reward = 0.2
+        quality_bonus = 0.0
+        quality_delta = max(new_visible.quality_score - self._best_quality_score, 0.0)
+        if quality_delta > 0:
+            quality_bonus = round(min(quality_delta * QUALITY_BONUS_SCALE, 0.1), 6)
+            self._best_quality_score = max(self._best_quality_score, new_visible.quality_score)
+        reward_value = syntax_reward + quality_bonus
+        status = "Code updated."
+        if self._state.errors:
+            status = f"Code updated, but syntax issues remain: {self._state.errors}"
+        elif new_visible.tests_total:
+            status = self._state.test_results
+        reward = RewardDetails(
+            value=reward_value,
+            syntax_reward=syntax_reward,
+            quality_bonus=quality_bonus,
+            reason=status,
+        )
+        self._append_history("edit_code", status, reward.value)
+        self._sync_score(include_hidden=False)
+        return reward, status
+    def _handle_run_tests(self) -> tuple[RewardDetails, str]:
+        grade = grade_task(self._state.current_code, self._task, include_hidden=False)
+        self._state.errors = grade.details.get("compile_error", "")
+        self._state.test_results = self._format_test_results(grade, include_hidden=False)
+        reward = self._reward_from_grade(grade, include_hidden=False)
+        status = self._state.test_results if not self._state.errors else self._state.errors
+        self._append_history("run_tests", status, reward.value)
+        self._sync_score(include_hidden=False)
+        return reward, status
+    def _handle_submit(self) -> tuple[RewardDetails, str]:
+        grade = grade_task(self._state.current_code, self._task, include_hidden=True)
+        self._state.errors = grade.details.get("compile_error", "")
+        self._state.test_results = self._format_test_results(grade, include_hidden=True)
+        reward = self._reward_from_grade(grade, include_hidden=True)
+        self._finalize_episode(auto_submit=False, grade=grade)
+        status = f"Solution submitted. Final score: {grade.score:.2f}."
+        self._append_history("submit_solution", status, reward.value)
+        return reward, status
+    def _finalize_episode(self, auto_submit: bool, grade: Optional[TaskGrade] = None) -> None:
+        if grade is None:
+            grade = grade_task(self._state.current_code, self._task, include_hidden=True)
+            self._state.errors = grade.details.get("compile_error", "")
+            self._state.test_results = self._format_test_results(grade, include_hidden=True)
+        self._state.score = grade.score
+        self._done = True
+        self._state.done = True
+        if auto_submit:
+            self._last_status = f"Step budget exhausted. Final score: {grade.score:.2f}."
+            self._last_reward = self._reward_from_grade(grade, include_hidden=True)
+    def _reward_from_grade(self, grade: TaskGrade, include_hidden: bool) -> RewardDetails:
+        syntax_reward = 0.0
+        if grade.syntax_score == 1.0 and not self._state.errors and not self._syntax_reward_awarded:
+            syntax_reward = 0.2
+            self._syntax_reward_awarded = True
+        test_fraction = grade.tests_passed / grade.tests_total if grade.tests_total else grade.score
+        test_gain = max(test_fraction - self._best_visible_test_fraction, 0.0)
+        test_reward = 0.3 * test_gain
+        if test_gain > 0:
+            self._best_visible_test_fraction = test_fraction
+        quality_bonus = 0.0
+        quality_delta = max(grade.quality_score - self._best_quality_score, 0.0)
+        if quality_delta > 0:
+            quality_bonus = min(quality_delta * QUALITY_BONUS_SCALE, 0.1)
+            self._best_quality_score = grade.quality_score
+        correctness_bonus = 0.0
+        if include_hidden and grade.score >= 0.999999 and not self._full_correctness_awarded:
+            correctness_bonus = 0.5
+            self._full_correctness_awarded = True
+        timeout_penalty = TIMEOUT_PENALTY if grade.timed_out else 0.0
+        reward_value = round(
+            syntax_reward + test_reward + quality_bonus + correctness_bonus - timeout_penalty,
+            6,
+        )
+        return RewardDetails(
+            value=reward_value,
+            syntax_reward=syntax_reward,
+            test_reward=round(test_reward, 6),
+            correctness_bonus=correctness_bonus,
+            quality_bonus=round(quality_bonus, 6),
+            timeout_penalty=timeout_penalty,
+            reason=self._format_test_results(grade, include_hidden=include_hidden),
+        )
+    def _format_test_results(self, grade: TaskGrade, include_hidden: bool) -> str:
+        if grade.details.get("compile_error"):
+            return f"Compilation failed: {grade.details['compile_error']}"
+        scope = "full grader" if include_hidden else "visible checks"
+        parts = [f"{scope}: score={grade.score:.2f}"]
+        if grade.tests_total:
+            parts.append(f"tests={grade.tests_passed}/{grade.tests_total}")
+        if grade.runtime_score:
+            parts.append(f"runtime={grade.runtime_score:.2f}")
+        if grade.quality_score:
+            parts.append(f"quality={grade.quality_score:.2f}")
+        if grade.style_score:
+            parts.append(f"style={grade.style_score:.2f}")
+        if grade.timed_out:
+            parts.append("timed_out=True")
+        return " | ".join(parts)
+    def _sync_score(self, include_hidden: bool) -> None:
+        grade = grade_task(self._state.current_code, self._task, include_hidden=include_hidden)
+        self._state.score = grade.score
+    def _append_history(self, action_type: str, summary: str, reward: float) -> None:
+        self._state.history.append(
+            HistoryEntry(
+                step=self._state.step_count,
+                action_type=action_type,  # type: ignore[arg-type]
+                summary=summary,
+                reward=reward,
+            )
+        )
+    def _build_observation(self) -> PythonCodeReviewObservation:
+        return PythonCodeReviewObservation(
+            task_id=self._task.task_id,
+            title=self._task.title,
+            difficulty=self._task.difficulty,
+            task_kind=self._task.task_kind,
+            task_description=self._task.task_description,
+            current_code=self._state.current_code,
+            errors=self._state.errors,
+            test_results=self._state.test_results,
+            history=list(self._state.history),
+            attempts_remaining=self._state.attempts_remaining,
+            last_action_status=self._last_status,
+            score=self._state.score,
+            reward_details=self._last_reward,
+            done=self._done,
+            reward=self._last_reward.value,
+            metadata={
+                "episode_id": self._state.episode_id,
+                "step_count": self._state.step_count,
+                "task_kind": self._task.task_kind,
+                "visible_tests": list(self._task.visible_tests),
+                "info": {
+                    "reward": reward_metadata(self._last_reward),
+                },
+            },
+        )
+# Backwards-compatible aliases used elsewhere in the repo.
+PythonEnvironment = PythonCodeReviewEnvironment
+CodeReviewEnvironment = PythonCodeReviewEnvironment

server/grading.py ADDED Viewed

	@@ -0,0 +1,147 @@

+"""Deterministic grading helpers for PR-review tasks."""
+from __future__ import annotations
+import re
+from dataclasses import dataclass
+from typing import Iterable, List, Optional, Sequence, Set
+try:
+    from models import ReviewFinding, TaskGrade
+    from server.task_bank import RubricIssue, TaskSpec
+except ModuleNotFoundError:  # pragma: no cover
+    from ..models import ReviewFinding, TaskGrade
+    from .task_bank import RubricIssue, TaskSpec
+FALSE_POSITIVE_PENALTY = 0.10
+DUPLICATE_PENALTY = 0.05
+@dataclass(frozen=True)
+class FindingMatch:
+    """Result of matching one finding against the rubric."""
+    issue_id: Optional[str]
+    duplicate: bool = False
+def finding_fingerprint(finding: ReviewFinding) -> str:
+    """Build a deterministic fingerprint for duplicate detection."""
+    text = " ".join(
+        [
+            finding.file_path,
+            str(finding.line or 0),
+            finding.category,
+            finding.severity,
+            finding.title,
+            finding.explanation,
+            finding.suggested_fix,
+        ]
+    )
+    return "|".join(sorted(tokens(text)))
+def match_finding(
+    finding: ReviewFinding,
+    task: TaskSpec,
+    matched_issue_ids: Set[str],
+    seen_fingerprints: Set[str],
+) -> FindingMatch:
+    """Match one finding against the remaining rubric issues."""
+    fingerprint = finding_fingerprint(finding)
+    if fingerprint in seen_fingerprints:
+        return FindingMatch(issue_id=None, duplicate=True)
+    for issue in task.rubric_issues:
+        if issue.issue_id in matched_issue_ids:
+            continue
+        if finding_matches_issue(finding, issue):
+            return FindingMatch(issue_id=issue.issue_id)
+    return FindingMatch(issue_id=None)
+def finding_matches_issue(finding: ReviewFinding, issue: RubricIssue) -> bool:
+    """Return True when a finding deterministically matches a rubric issue."""
+    if finding.file_path != issue.file_path:
+        return False
+    if finding.category != issue.category:
+        return False
+    if finding.severity != issue.severity:
+        return False
+    if finding.line is None or abs(finding.line - issue.line) > 2:
+        return False
+    finding_tokens = tokens(
+        " ".join([finding.title, finding.explanation, finding.suggested_fix])
+    )
+    keyword_hits = sum(1 for keyword in issue.keywords if keyword in finding_tokens)
+    return keyword_hits >= issue.min_keyword_hits
+def score_task(
+    task: TaskSpec,
+    matched_issue_ids: Iterable[str],
+    false_positives: int = 0,
+    duplicate_findings: int = 0,
+) -> TaskGrade:
+    """Score a task from cumulative episode state."""
+    matched_set = set(matched_issue_ids)
+    matched_weight = sum(
+        issue.weight for issue in task.rubric_issues if issue.issue_id in matched_set
+    )
+    raw_score = matched_weight
+    raw_score -= false_positives * FALSE_POSITIVE_PENALTY
+    raw_score -= duplicate_findings * DUPLICATE_PENALTY
+    score = max(0.0, min(1.0, round(raw_score, 6)))
+    return TaskGrade(
+        score=score,
+        matched_issue_ids=sorted(matched_set),
+        false_positives=false_positives,
+        duplicate_findings=duplicate_findings,
+        matched_weight=min(1.0, round(matched_weight, 6)),
+    )
+def grade_findings(task: TaskSpec, findings: Sequence[ReviewFinding]) -> TaskGrade:
+    """Offline-grade a batch of findings for one task."""
+    matched_issue_ids: Set[str] = set()
+    seen_fingerprints: Set[str] = set()
+    false_positives = 0
+    duplicate_findings = 0
+    for finding in findings:
+        result = match_finding(
+            finding=finding,
+            task=task,
+            matched_issue_ids=matched_issue_ids,
+            seen_fingerprints=seen_fingerprints,
+        )
+        fingerprint = finding_fingerprint(finding)
+        if result.duplicate:
+            duplicate_findings += 1
+            continue
+        seen_fingerprints.add(fingerprint)
+        if result.issue_id is None:
+            false_positives += 1
+            continue
+        matched_issue_ids.add(result.issue_id)
+    return score_task(
+        task=task,
+        matched_issue_ids=matched_issue_ids,
+        false_positives=false_positives,
+        duplicate_findings=duplicate_findings,
+    )
+def tokens(text: str) -> Set[str]:
+    """Normalize free text into deterministic comparison tokens."""
+    return set(re.findall(r"[a-z0-9_]+", text.lower()))

server/python_env_environment.py ADDED Viewed

	@@ -0,0 +1,9 @@

+"""Compatibility shim for older imports."""
+try:
+    from server.code_review_environment import PythonEnvironment
+except ModuleNotFoundError:  # pragma: no cover
+    from .code_review_environment import PythonEnvironment
+__all__ = ["PythonEnvironment"]

server/requirements.txt ADDED Viewed

	@@ -0,0 +1,6 @@

+openenv-core[core]>=0.2.2
+fastapi>=0.115.0
+uvicorn[standard]>=0.30.0
+openai>=1.40.0
+pytest>=8.0.0
+pydantic>=2.0.0

server/static_review.py ADDED Viewed

	@@ -0,0 +1,273 @@

+"""Deterministic static-review helpers for arbitrary Python code.
+Unlike the benchmark grader, this module does not compare against hidden rubric
+items. Instead, it performs direct AST-based review on arbitrary snippets so it
+can be used for manual testing, examples, and future dataset generation.
+"""
+from __future__ import annotations
+import ast
+from typing import List, Optional
+try:
+    from models import DirectReviewResponse, ReviewFinding
+except ModuleNotFoundError:  # pragma: no cover
+    from ..models import DirectReviewResponse, ReviewFinding
+class _StaticAnalyzer(ast.NodeVisitor):
+    """AST visitor that emits structured review findings.
+    The visitor intentionally focuses on a small set of high-signal patterns so
+    the direct-review endpoint stays predictable and easy to understand.
+    """
+    def __init__(self) -> None:
+        self.issues: List[ReviewFinding] = []
+    def visit_FunctionDef(self, node: ast.FunctionDef) -> None:  # noqa: N802
+        """Flag mutable default arguments in function definitions."""
+        for default in list(node.args.defaults):
+            if isinstance(default, (ast.List, ast.Dict, ast.Set)):
+                self.issues.append(
+                    ReviewFinding(
+                        title="Mutable default argument",
+                        line=getattr(default, "lineno", node.lineno),
+                        category="bug",
+                        severity="warning",
+                        rationale=(
+                            "Mutable defaults persist across calls and can leak state "
+                            "between unrelated requests."
+                        ),
+                        recommendation="Use None as the default and create the object inside the function.",
+                        rule_id="mutable-default-list",
+                    )
+                )
+        self.generic_visit(node)
+    def visit_Call(self, node: ast.Call) -> None:  # noqa: N802
+        """Inspect function calls for obviously unsafe or noisy patterns."""
+        func_name = self._call_name(node)
+        if func_name in {"eval", "exec"}:
+            self.issues.append(
+                ReviewFinding(
+                    title=f"Avoid {func_name} on untrusted input",
+                    line=node.lineno,
+                    category="security",
+                    severity="critical",
+                    rationale=(
+                        f"{func_name} executes arbitrary code and is unsafe on "
+                        "user-controlled input."
+                    ),
+                    recommendation="Use a safe parser or a whitelist-based evaluator.",
+                    rule_id="avoid-eval" if func_name == "eval" else "avoid-exec",
+                )
+            )
+        if func_name.endswith("check_output") or func_name.endswith("run"):
+            for keyword in node.keywords:
+                # `shell=True` is only a problem when the command comes from a
+                # shell-parsed string, but this heuristic is high value for
+                # review and intentionally conservative.
+                if keyword.arg == "shell" and isinstance(keyword.value, ast.Constant) and keyword.value.value is True:
+                    self.issues.append(
+                        ReviewFinding(
+                            title="shell=True with dynamic input",
+                            line=node.lineno,
+                            category="security",
+                            severity="critical",
+                            rationale=(
+                                "shell=True executes through the shell and can allow "
+                                "command injection when the command string is interpolated."
+                            ),
+                            recommendation="Pass a list of arguments and keep shell=False.",
+                            rule_id="shell-true-command-injection",
+                        )
+                    )
+        if func_name == "print":
+            self.issues.append(
+                ReviewFinding(
+                    title="Print statement in application logic",
+                    line=node.lineno,
+                    category="style",
+                    severity="info",
+                    rationale="Production services should prefer structured logging over print statements.",
+                    recommendation="Use the logging module or return the value to the caller.",
+                    rule_id="print-statement",
+                )
+            )
+        self.generic_visit(node)
+    def visit_ExceptHandler(self, node: ast.ExceptHandler) -> None:  # noqa: N802
+        """Flag bare exception handlers that hide failures."""
+        if node.type is None:
+            self.issues.append(
+                ReviewFinding(
+                    title="Bare except",
+                    line=node.lineno,
+                    category="maintainability",
+                    severity="warning",
+                    rationale="Bare except catches KeyboardInterrupt and other system-level exceptions.",
+                    recommendation="Catch a specific exception and record the failure.",
+                    rule_id="bare-except",
+                )
+            )
+        self.generic_visit(node)
+    def visit_For(self, node: ast.For) -> None:  # noqa: N802
+        """Look for list-membership checks nested in loops."""
+        for child in ast.walk(node):
+            if isinstance(child, ast.Compare) and any(
+                isinstance(operator, (ast.In, ast.NotIn)) for operator in child.ops
+            ):
+                if isinstance(child.comparators[0], ast.Name):
+                    self.issues.append(
+                        ReviewFinding(
+                            title="Potential quadratic membership check inside loop",
+                            line=child.lineno,
+                            category="performance",
+                            severity="warning",
+                            rationale=(
+                                "Repeated membership checks against a list inside a loop "
+                                "can degrade to quadratic runtime."
+                            ),
+                            recommendation="Use a set or dict for O(1) membership checks.",
+                            rule_id="quadratic-membership-check",
+                        )
+                    )
+                    break
+        self.generic_visit(node)
+    @staticmethod
+    def _call_name(node: ast.Call) -> str:
+        """Extract a dotted function name such as `subprocess.run`."""
+        func = node.func
+        if isinstance(func, ast.Name):
+            return func.id
+        if isinstance(func, ast.Attribute):
+            prefix = _StaticAnalyzer._attribute_prefix(func.value)
+            return f"{prefix}.{func.attr}" if prefix else func.attr
+        return ""
+    @staticmethod
+    def _attribute_prefix(node: ast.AST) -> str:
+        """Reconstruct the left-hand side of an attribute chain."""
+        if isinstance(node, ast.Name):
+            return node.id
+        if isinstance(node, ast.Attribute):
+            prefix = _StaticAnalyzer._attribute_prefix(node.value)
+            return f"{prefix}.{node.attr}" if prefix else node.attr
+        return ""
+def analyze_python_code(code: str) -> List[ReviewFinding]:
+    """Analyze arbitrary Python code and return structured findings."""
+    if not code.strip():
+        return [
+            ReviewFinding(
+                title="No code provided",
+                category="bug",
+                severity="warning",
+                rationale="The reviewer cannot inspect an empty submission.",
+                recommendation="Provide Python source code.",
+                rule_id="empty-input",
+            )
+        ]
+    # Syntax errors are turned into findings rather than exceptions so API
+    # consumers always get a valid response shape.
+    try:
+        tree = ast.parse(code)
+    except SyntaxError as exc:
+        return [
+            ReviewFinding(
+                title="Syntax error",
+                line=exc.lineno,
+                category="bug",
+                severity="critical",
+                rationale=exc.msg,
+                recommendation="Fix the syntax error before running static review.",
+                rule_id="syntax-error",
+            )
+        ]
+    analyzer = _StaticAnalyzer()
+    analyzer.visit(tree)
+    return _deduplicate(analyzer.issues)
+def build_direct_review_response(
+    code: str, context: Optional[str] = None
+) -> DirectReviewResponse:
+    """Build the public direct-review response for the `/review` route."""
+    issues = analyze_python_code(code)
+    weighted_penalty = 0.0
+    # The direct-review score is intentionally simple: more severe issues lower
+    # the score more aggressively.
+    for issue in issues:
+        if issue.severity == "critical":
+            weighted_penalty += 0.3
+        elif issue.severity == "warning":
+            weighted_penalty += 0.15
+        else:
+            weighted_penalty += 0.05
+    score = max(0.0, min(1.0, 1.0 - weighted_penalty))
+    summary = _build_summary(issues, context)
+    improved_code = _suggest_improved_code(code, issues)
+    return DirectReviewResponse(
+        issues=issues,
+        summary=summary,
+        score=score,
+        improved_code=improved_code,
+    )
+def _build_summary(issues: List[ReviewFinding], context: Optional[str]) -> str:
+    """Create a concise human-readable summary for the direct-review response."""
+    if not issues:
+        base = "No obvious issues were detected by the deterministic reviewer."
+    else:
+        critical = sum(1 for issue in issues if issue.severity == "critical")
+        warnings = sum(1 for issue in issues if issue.severity == "warning")
+        infos = sum(1 for issue in issues if issue.severity == "info")
+        base = (
+            f"Detected {len(issues)} issue(s): {critical} critical, "
+            f"{warnings} warning, {infos} info."
+        )
+    if context:
+        return f"{base} Context: {context}"
+    return base
+def _suggest_improved_code(code: str, issues: List[ReviewFinding]) -> Optional[str]:
+    """Append high-level fix directions to the submitted code."""
+    if not issues:
+        return None
+    suggestions = [issue.recommendation for issue in issues if issue.recommendation]
+    comment = " | ".join(dict.fromkeys(suggestions))
+    return f"{code.rstrip()}\n\n# Suggested review directions: {comment}"
+def _deduplicate(findings: List[ReviewFinding]) -> List[ReviewFinding]:
+    """Drop duplicate findings that refer to the same rule and line."""
+    seen = set()
+    unique: List[ReviewFinding] = []
+    for finding in findings:
+        key = (finding.rule_id, finding.line, finding.category)
+        if key in seen:
+            continue
+        seen.add(key)
+        unique.append(finding)
+    return unique

server/task_bank.py ADDED Viewed

	@@ -0,0 +1,340 @@

+"""Static PR-review tasks and hidden grading rubrics."""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Dict, Iterable, List, Sequence
+try:
+    from models import Category, Difficulty, Severity, TaskDescriptor, TaskSummary
+except ModuleNotFoundError:  # pragma: no cover
+    from ..models import Category, Difficulty, Severity, TaskDescriptor, TaskSummary
+@dataclass(frozen=True)
+class RubricIssue:
+    """One hidden issue that can be matched by the deterministic grader."""
+    issue_id: str
+    file_path: str
+    line: int
+    category: Category
+    severity: Severity
+    keywords: Sequence[str]
+    min_keyword_hits: int
+    weight: float
+@dataclass(frozen=True)
+class TaskSpec:
+    """Complete task definition, including hidden rubric metadata."""
+    task_id: str
+    difficulty: Difficulty
+    title: str
+    goal: str
+    repo_summary: str
+    visible_diff: str
+    file_contents: Dict[str, str]
+    changed_files: Sequence[str]
+    rubric_issues: Sequence[RubricIssue]
+    max_steps: int
+    @property
+    def available_files(self) -> List[str]:
+        return list(self.file_contents.keys())
+    def to_descriptor(self) -> TaskDescriptor:
+        return TaskDescriptor(
+            task_id=self.task_id,
+            difficulty=self.difficulty,
+            title=self.title,
+            goal=self.goal,
+            repo_summary=self.repo_summary,
+            changed_files=list(self.changed_files),
+            available_files=self.available_files,
+            max_steps=self.max_steps,
+        )
+    def to_summary(self) -> TaskSummary:
+        return TaskSummary(
+            task_id=self.task_id,
+            difficulty=self.difficulty,
+            title=self.title,
+            goal=self.goal,
+        )
+TASKS: List[TaskSpec] = [
+    TaskSpec(
+        task_id="py-pr-review-easy",
+        difficulty="easy",
+        title="Retry Delay Regression",
+        goal=(
+            "Review the pull request and identify the real bug introduced in the retry "
+            "delay helper before it ships."
+        ),
+        repo_summary=(
+            "This service computes retry delays for background notification delivery. "
+            "The change is intended to relax validation for legacy callers."
+        ),
+        visible_diff="\n".join(
+            [
+                "diff --git a/src/notifications/retry.py b/src/notifications/retry.py",
+                "@@",
+                "-    if base_delay <= 0:",
+                "+    if base_delay < 0:",
+                "         return 0.0",
+            ]
+        ),
+        file_contents={
+            "src/notifications/retry.py": "\n".join(
+                [
+                    "from __future__ import annotations",
+                    "",
+                    "def calculate_retry_delay(attempt: int, base_delay: float = 2.0) -> float:",
+                    '    """Return the retry delay in seconds."""',
+                    "    if attempt < 0:",
+                    '        raise ValueError(\"attempt must be >= 0\")',
+                    "    if base_delay < 0:",
+                    "        return 0.0",
+                    "    return attempt / base_delay",
+                ]
+            )
+        },
+        changed_files=("src/notifications/retry.py",),
+        rubric_issues=(
+            RubricIssue(
+                issue_id="zero-base-delay-divides",
+                file_path="src/notifications/retry.py",
+                line=7,
+                category="bug",
+                severity="warning",
+                keywords=("zero", "division", "base_delay"),
+                min_keyword_hits=2,
+                weight=1.0,
+            ),
+        ),
+        max_steps=4,
+    ),
+    TaskSpec(
+        task_id="py-pr-review-medium",
+        difficulty="medium",
+        title="Coupon Billing Rollout",
+        goal=(
+            "Review the billing change and identify both the production regression and "
+            "the missing coverage that would have caught it."
+        ),
+        repo_summary=(
+            "The billing service is adding coupon support for one-off invoices. The PR "
+            "touches both the service code and its unit tests."
+        ),
+        visible_diff="\n".join(
+            [
+                "diff --git a/app/billing/invoice_service.py b/app/billing/invoice_service.py",
+                "@@",
+                " def charge_invoice(order: dict, gateway: Gateway) -> str:",
+                "-    return gateway.charge(order[\"customer_id\"], order[\"amount_cents\"])",
+                "+    total = order[\"amount_cents\"]",
+                "+    coupon = order.get(\"coupon_code\")",
+                "+    if coupon:",
+                "+        discount = gateway.lookup_discount(coupon)",
+                "+        total = max(total - discount, 0)",
+                "+    return gateway.charge(order[\"customer_id\"], order[\"amount_cents\"])",
+                "",
+                "diff --git a/tests/test_invoice_service.py b/tests/test_invoice_service.py",
+                "@@",
+                " class FakeGateway:",
+                "+    def lookup_discount(self, coupon: str) -> int:",
+                "+        return 250",
+            ]
+        ),
+        file_contents={
+            "app/billing/invoice_service.py": "\n".join(
+                [
+                    "from gateway import Gateway",
+                    "",
+                    "def charge_invoice(order: dict, gateway: Gateway) -> str:",
+                    '    total = order["amount_cents"]',
+                    '    coupon = order.get("coupon_code")',
+                    "    if coupon:",
+                    "        discount = gateway.lookup_discount(coupon)",
+                    "        total = max(total - discount, 0)",
+                    '    return gateway.charge(order["customer_id"], order["amount_cents"])',
+                ]
+            ),
+            "tests/test_invoice_service.py": "\n".join(
+                [
+                    "from app.billing.invoice_service import charge_invoice",
+                    "",
+                    "class FakeGateway:",
+                    "    def lookup_discount(self, coupon: str) -> int:",
+                    "        return 250",
+                    "",
+                    "    def charge(self, customer_id: str, amount_cents: int) -> str:",
+                    "        self.last_charge = (customer_id, amount_cents)",
+                    '        return "charge_123"',
+                    "",
+                    "def test_charge_invoice_without_coupon():",
+                    "    gateway = FakeGateway()",
+                    '    charge_invoice({"customer_id": "cus_1", "amount_cents": 1000}, gateway)',
+                    '    assert gateway.last_charge == ("cus_1", 1000)',
+                ]
+            ),
+        },
+        changed_files=("app/billing/invoice_service.py", "tests/test_invoice_service.py"),
+        rubric_issues=(
+            RubricIssue(
+                issue_id="discount-total-unused",
+                file_path="app/billing/invoice_service.py",
+                line=8,
+                category="bug",
+                severity="warning",
+                keywords=("discount", "total", "charge", "amount"),
+                min_keyword_hits=2,
+                weight=0.6,
+            ),
+            RubricIssue(
+                issue_id="missing-coupon-test",
+                file_path="tests/test_invoice_service.py",
+                line=11,
+                category="testing",
+                severity="warning",
+                keywords=("missing", "test", "coupon", "discount"),
+                min_keyword_hits=2,
+                weight=0.4,
+            ),
+        ),
+        max_steps=5,
+    ),
+    TaskSpec(
+        task_id="py-pr-review-hard",
+        difficulty="hard",
+        title="Async Job Runner Deduplication",
+        goal=(
+            "Review the async job-runner PR and find the subtle concurrency issues "
+            "without inventing extra problems."
+        ),
+        repo_summary=(
+            "A shared webhook backfill service is deduplicating in-flight work with an "
+            "async task cache and writing the latest result for operators to inspect."
+        ),
+        visible_diff="\n".join(
+            [
+                "diff --git a/app/jobs/runner.py b/app/jobs/runner.py",
+                "@@",
+                " async def run_job(job_id: str, payload: dict, worker) -> str:",
+                "     if job_id in ACTIVE_RUNS:",
+                "         return await ACTIVE_RUNS[job_id]",
+                "+    lock = asyncio.Lock()",
+                "+    async with lock:",
+                "+        task = asyncio.create_task(worker.run(payload))",
+                "+        ACTIVE_RUNS[job_id] = task",
+                "     try:",
+                "         result = await task",
+                "     finally:",
+                "         ACTIVE_RUNS.pop(job_id, None)",
+                "+    Path(\"latest-result.json\").write_text(result)",
+                "     return result",
+            ]
+        ),
+        file_contents={
+            "app/jobs/runner.py": "\n".join(
+                [
+                    "import asyncio",
+                    "from pathlib import Path",
+                    "",
+                    "ACTIVE_RUNS: dict[str, asyncio.Task[str]] = {}",
+                    "",
+                    "async def run_job(job_id: str, payload: dict, worker) -> str:",
+                    "    if job_id in ACTIVE_RUNS:",
+                    "        return await ACTIVE_RUNS[job_id]",
+                    "",
+                    "    lock = asyncio.Lock()",
+                    "    async with lock:",
+                    "        task = asyncio.create_task(worker.run(payload))",
+                    "        ACTIVE_RUNS[job_id] = task",
+                    "    try:",
+                    "        result = await task",
+                    "    finally:",
+                    "        ACTIVE_RUNS.pop(job_id, None)",
+                    "",
+                    '    Path("latest-result.json").write_text(result)',
+                    "    return result",
+                ]
+            ),
+            "tests/test_runner.py": "\n".join(
+                [
+                    "import pytest",
+                    "",
+                    "from app.jobs.runner import run_job",
+                    "",
+                    "class FakeWorker:",
+                    "    async def run(self, payload: dict) -> str:",
+                    '        return payload["job_id"]',
+                    "",
+                    "@pytest.mark.asyncio",
+                    "async def test_run_job_returns_worker_result():",
+                    "    worker = FakeWorker()",
+                    '    result = await run_job("job-1", {"job_id": "job-1"}, worker)',
+                    '    assert result == "job-1"',
+                ]
+            ),
+        },
+        changed_files=("app/jobs/runner.py", "tests/test_runner.py"),
+        rubric_issues=(
+            RubricIssue(
+                issue_id="per-call-lock-race",
+                file_path="app/jobs/runner.py",
+                line=9,
+                category="bug",
+                severity="warning",
+                keywords=("lock", "race", "concurrent", "duplicate"),
+                min_keyword_hits=2,
+                weight=0.55,
+            ),
+            RubricIssue(
+                issue_id="shared-output-file-race",
+                file_path="app/jobs/runner.py",
+                line=18,
+                category="maintainability",
+                severity="warning",
+                keywords=("latest", "result", "file", "concurrent", "overwrite"),
+                min_keyword_hits=2,
+                weight=0.45,
+            ),
+        ),
+        max_steps=6,
+    ),
+]
+TASKS_BY_ID: Dict[str, TaskSpec] = {task.task_id: task for task in TASKS}
+def list_task_descriptors() -> List[TaskDescriptor]:
+    """Return public descriptors for all tasks."""
+    return [task.to_descriptor() for task in TASKS]
+def list_task_summaries() -> List[TaskSummary]:
+    """Return task summaries for lightweight route responses."""
+    return [task.to_summary() for task in TASKS]
+def get_task(task_id: str) -> TaskSpec:
+    """Return a task by id."""
+    try:
+        return TASKS_BY_ID[task_id]
+    except KeyError as exc:  # pragma: no cover
+        raise ValueError(f"Unknown task_id: {task_id}") from exc
+def task_ids() -> Iterable[str]:
+    """Return task ids in benchmark order."""
+    return [task.task_id for task in TASKS]

summary/01_introduction_quickstart.md ADDED Viewed

	@@ -0,0 +1,66 @@

+# 01. Introduction & Quick Start
+Source:
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/plot_01_introduction_quickstart.html
+## Main idea
+OpenEnv is a standardized framework for building, sharing, and using RL environments as typed, containerized services.
+The official docs frame it as:
+- Gym-style interaction
+- Docker-based isolation
+- typed contracts
+- HTTP/WebSocket access
+- easy sharing through Hugging Face
+## Core loop
+The RL interaction model is still the normal loop:
+1. reset environment
+2. observe state
+3. choose action
+4. call step
+5. receive reward + next observation
+6. repeat until done
+The difference is that OpenEnv wraps this loop in a typed client/server system.
+## Why OpenEnv instead of only Gym
+The docs emphasize these advantages:
+- type safety
+- environment isolation through containers
+- better reproducibility
+- easier sharing and deployment
+- language-agnostic communication
+- cleaner debugging
+The key contrast is:
+- old style: raw arrays and same-process execution
+- OpenEnv style: typed objects and isolated environment runtime
+## Important mental model
+OpenEnv treats environments more like services than in-process libraries.
+That means:
+- your environment logic can run separately from the agent code
+- failures in the environment do not automatically crash the training loop
+- deployment and usage are closer to how production systems work
+## What this means for `python_env`
+Your repo should keep these properties intact:
+- typed `Action`, `Observation`, and evaluation models
+- a clean environment class with `reset()`, `step()`, and `state`
+- a client that hides transport details
+- a deployable container
+For hackathon purposes, this page is the justification for why your project is not just a script. It is a reusable environment artifact.

summary/02_using_environments.md ADDED Viewed

	@@ -0,0 +1,98 @@

+# 02. Using Environments
+Source:
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/plot_02_using_environments.html
+## Main idea
+This page is about how users consume an existing OpenEnv environment.
+The docs highlight three connection methods:
+1. from Hugging Face Hub
+2. from Docker image
+3. from direct base URL
+## Connection methods
+### 1. From Hugging Face Hub
+The easiest route for end users.
+Typical flow:
+- pull the image from the HF registry
+- start the container locally
+- connect to it
+- clean it up on close
+The docs show the pattern conceptually as:
+```python
+MyEnv.from_hub("owner/env-name")
+```
+## 2. From Docker image
+Useful when:
+- you already built the image locally
+- you want reproducible local runs
+- you do not want to depend on a live remote Space
+Typical pattern:
+```python
+MyEnv.from_docker_image("my-env:latest")
+```
+## 3. Direct URL connection
+Useful when:
+- the server is already running
+- you want to connect to localhost or a deployed Space
+Typical pattern:
+```python
+MyEnv(base_url="http://localhost:8000")
+```
+## WebSocket model
+The docs emphasize that OpenEnv uses WebSocket-backed sessions for persistent environment interaction.
+Why this matters:
+- lower overhead than stateless HTTP on every step
+- cleaner session management
+- better fit for multi-step RL loops
+## Environment loop
+The intended use pattern is:
+1. connect
+2. reset
+3. repeatedly call `step(action)`
+4. inspect `reward`, `done`, and `observation`
+5. close cleanly
+## What this means for `python_env`
+Your environment should be easy to consume in all three modes:
+- local URL
+- local Docker image
+- HF Space
+That means the most important user-facing checks are:
+- `reset()` works
+- `step()` works
+- the client can parse the observation correctly
+- Docker image starts cleanly
+- deployed Space responds on `/health`, `/docs`, and session routes
+For hackathon validation, this page is basically the “user experience” standard you need to match.

summary/03_building_environments.md ADDED Viewed

	@@ -0,0 +1,99 @@

+# 03. Building Environments
+Source:
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/plot_03_building_environments.html
+## Main idea
+This page describes the standard OpenEnv project structure and how to build a custom environment from scratch.
+## Standard project layout
+The docs show a layout like:
+```text
+my_game/
+├── __init__.py
+├── models.py
+├── client.py
+├── openenv.yaml
+├── README.md
+└── server/
+    ├── __init__.py
+    ├── environment.py
+    ├── app.py
+    ├── Dockerfile
+    └── requirements.txt
+```
+## Responsibilities by file
+### `models.py`
+Defines typed:
+- actions
+- observations
+- state-related payloads
+This is the contract layer.
+### `client.py`
+Defines the client used by agents and evaluation scripts.
+This should:
+- convert actions into payloads
+- parse observations from responses
+- expose a clean local Python API
+### `server/environment.py`
+Defines the actual environment logic:
+- reset behavior
+- step behavior
+- state tracking
+This is the heart of the environment.
+### `server/app.py`
+Exposes the environment through FastAPI/OpenEnv.
+This is the transport layer, not the logic layer.
+### `server/Dockerfile`
+Defines how the environment runs reproducibly in a container.
+### `openenv.yaml`
+Defines the environment manifest and deployment metadata.
+## Key lesson
+The docs separate:
+- contracts
+- logic
+- transport
+- packaging
+That separation is what makes environments maintainable and deployable.
+## What this means for `python_env`
+Your repo already follows this pattern reasonably well:
+- `models.py`
+- `client.py`
+- `server/code_review_environment.py`
+- `server/app.py`
+- `server/Dockerfile`
+- `openenv.yaml`
+The main thing to protect is that no single file should try to do everything.
+For hackathon quality, this page matters because judges will look for clean structure, not just working behavior.

summary/04_packaging_deploying.md ADDED Viewed

	@@ -0,0 +1,84 @@

+# 04. Packaging & Deploying
+Source:
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html
+## Main idea
+This page is the operational workflow for taking an environment from local code to a validated, deployable artifact.
+## Official workflow
+The docs describe this sequence:
+1. scaffold environment with `openenv init`
+2. customize models, server logic, and client
+3. implement typed `EnvClient`
+4. configure dependencies and Dockerfile
+5. run CLI packaging and deployment commands
+## Important CLI commands
+### `openenv build`
+Purpose:
+- build the Docker image for the environment
+The docs call out that it supports both standalone and in-repo environments.
+### `openenv validate --verbose`
+Purpose:
+- check required files
+- verify entrypoints
+- confirm deployment modes
+- fail non-zero on problems
+This is one of the most important commands for submission readiness.
+### `openenv push`
+Purpose:
+- deploy to Hugging Face Spaces
+- optionally push to other registries
+Useful options mentioned by the docs:
+- `--repo-id`
+- `--private`
+- `--registry`
+- `--base-image`
+## Hugging Face integration behavior
+The docs say the CLI handles:
+- validating `openenv.yaml`
+- adding HF frontmatter when needed
+- preparing the bundle for upload
+That means your local files need to be internally consistent before `openenv push`.
+## Prerequisites
+The docs explicitly call out:
+- Python 3.11+
+- `uv`
+- Docker
+- OpenEnv installed
+## What this means for `python_env`
+This is your final operational checklist:
+1. `openenv build`
+2. `openenv validate --verbose`
+3. `openenv push`
+If any of those fail, fix them before worrying about benchmark polish.
+For the hackathon, this page is effectively your packaging contract.

summary/05_contributing_hf.md ADDED Viewed

	@@ -0,0 +1,84 @@

+# 05. Contributing to Hugging Face
+Source:
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/contributing-envs.html
+## Main idea
+This page explains how OpenEnv environments are shared and improved on Hugging Face Spaces.
+The docs treat Spaces as multiple things at once:
+- Git repositories
+- Docker images
+- Python packages
+- apps
+## Three official workflows
+### 1. Push a new environment
+This is the normal path when you built your own environment.
+The docs show:
+```bash
+openenv push
+openenv push --repo-id my-org/my-custom-env
+openenv push --private
+```
+This is the workflow your `python_env` project most directly cares about.
+### 2. Fork an existing environment
+Useful when you want to build from an existing environment quickly.
+The docs show:
+```bash
+openenv fork owner/space-name
+openenv fork owner/space-name --repo-id my-username/my-copy
+```
+You can also set env vars, secrets, and hardware during the fork flow.
+### 3. Download, modify, and open a PR
+The docs show a Hub-native contribution flow:
+```bash
+hf download owner/space-name --local-dir space-name --repo-type space
+openenv push --repo-id owner/space-name --create-pr
+```
+This is useful if you want to improve an existing environment without owning the original.
+## Prerequisites from the docs
+- Python 3.11+
+- `uv`
+- OpenEnv CLI
+- Hugging Face account
+- write token
+- `hf auth login`
+## Why this matters for `python_env`
+For your project, the important takeaway is:
+- the final destination is a Hugging Face Space
+- the Space is not just a demo page, it is the actual distribution unit
+- once deployed, others should be able to use it as:
+  - a running endpoint
+  - a Docker image
+  - a Python-installable package
+That means your submission should be clean enough that someone else could:
+1. inspect the Space
+2. clone it
+3. run it locally
+4. contribute improvements back
+For the hackathon, this page is the “publish and collaborate” layer on top of the earlier build/deploy steps.

summary/README.md ADDED Viewed

	@@ -0,0 +1,40 @@

+# OpenEnv Docs Summary
+This folder summarizes the official OpenEnv getting-started pages from:
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/plot_01_introduction_quickstart.html
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/plot_02_using_environments.html
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/plot_03_building_environments.html
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/environment-builder.html
+- https://meta-pytorch.org/OpenEnv/auto_getting_started/contributing-envs.html
+## Files
+- `01_introduction_quickstart.md`
+  What OpenEnv is, why it exists, and the standard RL interaction pattern.
+- `02_using_environments.md`
+  How to connect to environments from the Hub, Docker, or direct URLs and how the environment loop should look.
+- `03_building_environments.md`
+  The standard OpenEnv project layout and what each file is responsible for.
+- `04_packaging_deploying.md`
+  The packaging workflow with `openenv build`, `openenv validate`, and `openenv push`.
+- `05_contributing_hf.md`
+  How to publish, fork, and submit PR-style contributions to Hugging Face Spaces.
+## Why this matters for `python_env`
+These summaries are here to keep the project aligned with the official OpenEnv workflow:
+- typed models
+- environment class
+- client
+- FastAPI/OpenEnv app
+- Docker packaging
+- validation
+- HF Spaces deployment
+Read these files in order if you want the shortest path from local development to a working hackathon submission.

tasks/__init__.py ADDED Viewed

	@@ -0,0 +1,11 @@

+"""Task definitions for the Python code review environment."""
+from .task_bank import TaskSpec, get_task, list_task_descriptors, list_task_summaries, task_ids
+__all__ = [
+    "TaskSpec",
+    "get_task",
+    "list_task_descriptors",
+    "list_task_summaries",
+    "task_ids",
+]

tasks/task_bank.py ADDED Viewed

	@@ -0,0 +1,273 @@

+"""Deterministic task bank for Python code review and repair benchmark."""
+from __future__ import annotations
+from dataclasses import dataclass, field
+from typing import Dict, List, Optional
+from models import Difficulty, TaskDescriptor, TaskKind
+@dataclass(frozen=True)
+class TaskSpec:
+    """Complete task specification with grading criteria."""
+    task_id: str
+    title: str
+    difficulty: Difficulty
+    task_kind: TaskKind
+    task_description: str
+    starter_code: str
+    reference_code: str
+    visible_tests: List[str]
+    hidden_tests: List[str]
+    max_steps: int = 10
+    benchmark_entrypoint: Optional[str] = None
+    benchmark_builder: Optional[str] = None
+    benchmark_repeats: int = 1
+    benchmark_timeout_s: float = 2.0
+    style_max_line_length: int = 88
+    expected_quality_markers: List[str] = field(default_factory=list)
+    def to_descriptor(self) -> TaskDescriptor:
+        """Convert to public task descriptor."""
+        return TaskDescriptor(
+            task_id=self.task_id,
+            title=self.title,
+            difficulty=self.difficulty,
+            task_kind=self.task_kind,
+            task_description=self.task_description,
+            starter_code=self.starter_code,
+            visible_tests=list(self.visible_tests),
+            max_steps=self.max_steps,
+        )
+# ============================================================================
+# TASK 1: EASY - Syntax Fixing
+# ============================================================================
+TASK_SYNTAX_FIX = TaskSpec(
+    task_id="syntax-fix-easy",
+    title="Fix a syntax-broken username normalizer",
+    difficulty="easy",
+    task_kind="syntax_fix",
+    task_description=(
+        "You are reviewing a utility function before merge. The submitted patch left "
+        "the function with syntax errors. Repair the code so it compiles and preserves "
+        "the intended behavior of trimming, lowercasing, and replacing spaces with underscores."
+    ),
+    starter_code='''def normalize_username(raw_name: str) -> str:
+    cleaned = raw_name.strip().lower(
+    if not cleaned:
+        return "anonymous"
+    return cleaned.replace(" ", "_")
+''',
+    reference_code='''def normalize_username(raw_name: str) -> str:
+    cleaned = raw_name.strip().lower()
+    if not cleaned:
+        return "anonymous"
+    return cleaned.replace(" ", "_")
+''',
+    visible_tests=[
+        "normalize_username('  Alice Smith  ') == 'alice_smith'",
+        "normalize_username('   ') == 'anonymous'",
+        "normalize_username('Bob') == 'bob'",
+    ],
+    hidden_tests=[
+        "normalize_username('  HELLO WORLD  ') == 'hello_world'",
+        "normalize_username('') == 'anonymous'",
+    ],
+    max_steps=8,
+)
+# ============================================================================
+# TASK 2: MEDIUM - Bug Fixing with Tests
+# ============================================================================
+TASK_BUG_FIX = TaskSpec(
+    task_id="bug-fix-medium",
+    title="Repair invoice discount calculation logic",
+    difficulty="medium",
+    task_kind="bug_fix",
+    task_description=(
+        "A billing helper function is returning the wrong amount after applying discounts. "
+        "The function signature is correct, but the calculation logic is broken. "
+        "Inspect the implementation, run visible tests, and fix the bug so all tests pass. "
+        "Do not change the function signature or validation logic."
+    ),
+    starter_code='''from typing import Iterable
+def calculate_invoice_total(line_items: Iterable[int], discount_percent: int) -> int:
+    """Calculate invoice total with discount applied.
+    Args:
+        line_items: List of item prices in cents.
+        discount_percent: Discount as integer 0-100.
+    Returns:
+        Final invoice total in cents after discount.
+    Raises:
+        ValueError: If discount_percent is outside 0-100 range.
+    """
+    if discount_percent < 0 or discount_percent > 100:
+        raise ValueError("discount_percent must be between 0 and 100")
+    subtotal = sum(line_items)
+    discounted_total = subtotal - (subtotal * discount_percent // 100)
+    return subtotal  # BUG: returning subtotal instead of discounted_total
+''',
+    reference_code='''from typing import Iterable
+def calculate_invoice_total(line_items: Iterable[int], discount_percent: int) -> int:
+    """Calculate invoice total with discount applied.
+    Args:
+        line_items: List of item prices in cents.
+        discount_percent: Discount as integer 0-100.
+    Returns:
+        Final invoice total in cents after discount.
+    Raises:
+        ValueError: If discount_percent is outside 0-100 range.
+    """
+    if discount_percent < 0 or discount_percent > 100:
+        raise ValueError("discount_percent must be between 0 and 100")
+    subtotal = sum(line_items)
+    discounted_total = subtotal - (subtotal * discount_percent // 100)
+    return discounted_total
+''',
+    visible_tests=[
+        "calculate_invoice_total([1000, 2000], 0) == 3000",  # No discount
+        "calculate_invoice_total([1000, 2000], 50) == 1500",  # 50% off
+        "calculate_invoice_total([1000], 10) == 900",  # 10% off
+        "calculate_invoice_total([], 0) == 0",  # Empty
+    ],
+    hidden_tests=[
+        "calculate_invoice_total([100, 200, 300], 25) == 450",  # 25% off
+        "calculate_invoice_total([5000], 99) == 50",  # 99% off
+    ],
+    max_steps=10,
+)
+# ============================================================================
+# TASK 3: HARD - Optimization & Code Quality
+# ============================================================================
+TASK_OPTIMIZATION = TaskSpec(
+    task_id="optimization-hard",
+    title="Optimize inefficient list duplicate removal",
+    difficulty="hard",
+    task_kind="optimization",
+    task_description=(
+        "Code review found that `remove_duplicates` is inefficient for large lists. "
+        "The current implementation uses nested loops (O(n²) time). "
+        "Optimize it to O(n) using a set-based approach while maintaining order. "
+        "Style and code quality also matter: use idiomatic Python, proper types, and clear logic. "
+        "All tests must pass, and the optimized version should be measurably faster."
+    ),
+    starter_code='''from typing import List, TypeVar
+T = TypeVar('T')
+def remove_duplicates(items: List[T]) -> List[T]:
+    """Remove duplicates from list while preserving order.
+    This implementation is inefficient for large lists.
+    Args:
+        items: List that may contain duplicate elements.
+    Returns:
+        List with duplicates removed, order preserved.
+    """
+    result = []
+    for item in items:
+        if item not in result:  # O(n) lookup in list per iteration
+            result.append(item)
+    return result
+''',
+    reference_code='''from typing import List, TypeVar
+T = TypeVar('T')
+def remove_duplicates(items: List[T]) -> List[T]:
+    """Remove duplicates from list while preserving order.
+    Efficient set-based implementation with O(n) time complexity.
+    Args:
+        items: List that may contain duplicate elements.
+    Returns:
+        List with duplicates removed, order preserved.
+    """
+    seen: set = set()
+    result = []
+    for item in items:
+        if item not in seen:
+            seen.add(item)
+            result.append(item)
+    return result
+''',
+    visible_tests=[
+        "remove_duplicates([1, 2, 2, 3, 1]) == [1, 2, 3]",
+        "remove_duplicates(['a', 'b', 'a']) == ['a', 'b']",
+        "remove_duplicates([]) == []",
+        "remove_duplicates([1]) == [1]",
+    ],
+    hidden_tests=[
+        "remove_duplicates([5, 4, 3, 2, 1, 5, 4]) == [5, 4, 3, 2, 1]",
+    ],
+    max_steps=10,
+    benchmark_entrypoint="remove_duplicates",
+    benchmark_builder="lambda: list(range(5000)) + list(range(5000))",
+    benchmark_repeats=3,
+    benchmark_timeout_s=1.0,
+    style_max_line_length=88,
+    expected_quality_markers=[
+        "set",
+        "O(n)",
+    ],
+)
+# ============================================================================
+# Task Bank Registry
+# ============================================================================
+TASKS: Dict[str, TaskSpec] = {
+    "syntax-fix-easy": TASK_SYNTAX_FIX,
+    "bug-fix-medium": TASK_BUG_FIX,
+    "optimization-hard": TASK_OPTIMIZATION,
+}
+def task_ids() -> List[str]:
+    """Return all task IDs in deterministic order."""
+    return ["syntax-fix-easy", "bug-fix-medium", "optimization-hard"]
+def get_task(task_id: str) -> TaskSpec:
+    """Get a task by ID."""
+    if task_id not in TASKS:
+        raise ValueError(f"Task {task_id} not found. Available: {list(TASKS.keys())}")
+    return TASKS[task_id]
+def list_task_descriptors() -> List[TaskDescriptor]:
+    """List all task descriptors."""
+    return [get_task(tid).to_descriptor() for tid in task_ids()]
+def list_task_summaries() -> List[TaskDescriptor]:
+    """List task summaries (alias for descriptors)."""
+    return list_task_descriptors()

testing.md ADDED Viewed

	@@ -0,0 +1,289 @@

+# Testing Guide
+This document lists the environment variables you may need, the available routes, which params are required, and how to test each route quickly.
+## 1) Environment Variables
+## Server runtime variables
+Use these when running the FastAPI app (local or container):
+- HOST: default 0.0.0.0 in Docker, localhost in app main()
+- PORT: default 8000
+- WORKERS: default 1 (used by container command)
+- MAX_CONCURRENT_ENVS: default 32
+Minimal local run on Windows PowerShell:
+```powershell
+$env:HOST = "127.0.0.1"
+$env:PORT = "8000"
+$env:MAX_CONCURRENT_ENVS = "32"
+uvicorn server.app:app --host $env:HOST --port $env:PORT
+```
+## Inference script variables
+Required:
+- API_BASE_URL
+- MODEL_NAME
+- HF_TOKEN or OPENAI_API_KEY
+Optional:
+- ENV_BASE_URL (if omitted, inference.py launches from Docker image)
+- PYTHON_ENV_IMAGE (default python_env-env:latest)
+- MAX_STEPS (default 3)
+- MAX_TASKS (default 3)
+- INFERENCE_REPORT_PATH (default inference_results.json)
+- TEMPERATURE (default 0)
+- MAX_TOKENS (default 900)
+Example:
+```powershell
+$env:API_BASE_URL = "https://api.openai.com/v1"
+$env:MODEL_NAME = "gpt-4.1-mini"
+$env:OPENAI_API_KEY = "<your-key>"
+$env:ENV_BASE_URL = "http://127.0.0.1:8000"
+python inference.py
+```
+## 2) Task IDs You Can Use
+- py-review-easy
+- py-review-medium
+- py-review-hard
+## 3) Route Testing (Params + Examples)
+Base URL:
+```text
+http://127.0.0.1:8000
+```
+## OpenEnv routes
+### POST /reset
+- Required params: none
+- Body: none
+Test:
+```powershell
+Invoke-RestMethod -Method Post -Uri "http://127.0.0.1:8000/reset"
+```
+### POST /step
+- Required params: none in query/path
+- Required body shape:
+  - operation: one of submit_findings, request_hint, finalize
+  - findings: array (can be empty)
+- Optional body fields:
+  - patched_code: string or null
+  - note: string or null
+Minimal body example:
+```json
+{
+  "operation": "request_hint",
+  "findings": []
+}
+```
+Test:
+```powershell
+$body = @{
+  operation = "request_hint"
+  findings  = @()
+} | ConvertTo-Json
+Invoke-RestMethod -Method Post -Uri "http://127.0.0.1:8000/step" -ContentType "application/json" -Body $body
+```
+Example with finding:
+```powershell
+$body = @{
+  operation = "submit_findings"
+  findings = @(
+    @{
+      title = "Avoid eval on untrusted input"
+      line = 2
+      category = "security"
+      severity = "critical"
+      rationale = "eval can execute attacker-controlled code"
+      recommendation = "Use json.loads instead"
+      rule_id = "avoid-eval"
+    }
+  )
+  patched_code = $null
+  note = "first pass"
+} | ConvertTo-Json -Depth 6
+Invoke-RestMethod -Method Post -Uri "http://127.0.0.1:8000/step" -ContentType "application/json" -Body $body
+```
+### GET /state
+- Required params: none
+Test:
+```powershell
+Invoke-RestMethod -Method Get -Uri "http://127.0.0.1:8000/state"
+```
+### GET /schema
+- Required params: none
+Test:
+```powershell
+Invoke-RestMethod -Method Get -Uri "http://127.0.0.1:8000/schema"
+```
+### WS /ws
+- Use a websocket client to connect.
+- No route params required.
+## Custom REST routes
+### GET /health
+- Required params: none
+```powershell
+Invoke-RestMethod -Method Get -Uri "http://127.0.0.1:8000/health"
+```
+### GET /tasks
+- Required params: none
+```powershell
+Invoke-RestMethod -Method Get -Uri "http://127.0.0.1:8000/tasks"
+```
+### GET /tasks/{task_id}
+- Required path param: task_id
+```powershell
+Invoke-RestMethod -Method Get -Uri "http://127.0.0.1:8000/tasks/py-review-easy"
+```
+### POST /tasks/{task_id}/grade
+- Required path param: task_id
+- Body uses PythonReviewAction shape
+  - operation defaults to submit_findings if omitted
+  - findings array accepted
+  - patched_code optional
+  - note optional
+```powershell
+$body = @{
+  findings = @(
+    @{
+      title = "Avoid eval on untrusted input"
+      line = 2
+      category = "security"
+      severity = "critical"
+      rationale = "eval executes arbitrary code"
+      recommendation = "Use json.loads"
+    }
+  )
+} | ConvertTo-Json -Depth 6
+Invoke-RestMethod -Method Post -Uri "http://127.0.0.1:8000/tasks/py-review-easy/grade" -ContentType "application/json" -Body $body
+```
+### POST /review
+- Required body field:
+  - code: string
+- Optional body field:
+  - context: string
+```powershell
+$body = @{
+  code = "def f(x):`n    return eval(x)`n"
+} | ConvertTo-Json
+Invoke-RestMethod -Method Post -Uri "http://127.0.0.1:8000/review" -ContentType "application/json" -Body $body
+```
+### GET /history
+- Required params: none
+```powershell
+Invoke-RestMethod -Method Get -Uri "http://127.0.0.1:8000/history"
+```
+### DELETE /history
+- Required params: none
+```powershell
+Invoke-RestMethod -Method Delete -Uri "http://127.0.0.1:8000/history"
+```
+### GET /config
+- Required params: none
+```powershell
+Invoke-RestMethod -Method Get -Uri "http://127.0.0.1:8000/config"
+```
+### PUT /config
+- Required params: none
+- Body: PythonEnvConfig object
+- All fields have defaults, so {} is valid for a reset-like update
+Minimal test:
+```powershell
+Invoke-RestMethod -Method Put -Uri "http://127.0.0.1:8000/config" -ContentType "application/json" -Body "{}"
+```
+Full body example:
+```powershell
+$body = @{
+  task_order = @("py-review-easy", "py-review-medium", "py-review-hard")
+  max_steps_per_task = 4
+  hint_penalty = 0.05
+  false_positive_penalty = 0.08
+  duplicate_penalty = 0.03
+  patch_bonus_multiplier = 0.2
+  max_history_entries = 50
+} | ConvertTo-Json
+Invoke-RestMethod -Method Put -Uri "http://127.0.0.1:8000/config" -ContentType "application/json" -Body $body
+```
+## 4) Quick Validation Commands
+Run automated tests:
+```powershell
+pytest -q
+```
+Run only API tests:
+```powershell
+pytest -q tests/test_api.py
+```

tests/conftest.py ADDED Viewed

	@@ -0,0 +1,7 @@

+from pathlib import Path
+import sys
+ROOT = Path(__file__).resolve().parents[1]
+if str(ROOT) not in sys.path:
+    sys.path.insert(0, str(ROOT))

tests/test_api.py ADDED Viewed

	@@ -0,0 +1,31 @@

+from fastapi.testclient import TestClient
+from server.app import app
+client = TestClient(app)
+def test_health_endpoint():
+    response = client.get("/health")
+    assert response.status_code == 200
+    payload = response.json()
+    assert payload["status"] == "ok"
+    assert payload["environment"] == "python_code_review_env"
+def test_reset_returns_expected_observation():
+    response = client.post("/reset", json={"task_id": "syntax-fix-easy"})
+    assert response.status_code == 200
+    payload = response.json()
+    assert payload["observation"]["task_id"] == "syntax-fix-easy"
+    assert "current_code" in payload["observation"]
+def test_tasks_endpoint_lists_three_tasks():
+    response = client.get("/tasks")
+    assert response.status_code == 200
+    assert len(response.json()) == 3

tests/test_environment.py ADDED Viewed

	@@ -0,0 +1,81 @@

+from models import PythonCodeReviewAction
+from server.env import PythonCodeReviewEnvironment
+def test_reset_cycles_tasks_in_order():
+    env = PythonCodeReviewEnvironment()
+    first = env.reset()
+    second = env.reset()
+    third = env.reset()
+    assert first.task_id == "syntax-fix-easy"
+    assert second.task_id == "bug-fix-medium"
+    assert third.task_id == "optimization-hard"
+def test_invalid_edit_code_penalizes_action():
+    env = PythonCodeReviewEnvironment()
+    env.reset(task_id="syntax-fix-easy")
+    observation = env.step(PythonCodeReviewAction(action_type="edit_code", code=""))
+    assert observation.reward < 0
+    assert observation.reward_details.invalid_action_penalty == 0.1
+    assert "requires code" in observation.last_action_status
+def test_easy_task_gets_full_score_after_fix():
+    env = PythonCodeReviewEnvironment()
+    env.reset(task_id="syntax-fix-easy")
+    env.step(
+        PythonCodeReviewAction(
+            action_type="edit_code",
+            code="""def normalize_username(raw_name: str) -> str:
+    cleaned = raw_name.strip().lower()
+    if not cleaned:
+        return "anonymous"
+    return cleaned.replace(" ", "_")
+""",
+        )
+    )
+    observation = env.step(PythonCodeReviewAction(action_type="submit_solution"))
+    assert observation.done is True
+    assert observation.score == 1.0
+def test_medium_task_reports_partial_visible_progress():
+    env = PythonCodeReviewEnvironment()
+    env.reset(task_id="bug-fix-medium")
+    observation = env.step(PythonCodeReviewAction(action_type="run_tests"))
+    assert observation.score < 1.0
+    assert "visible checks" in observation.test_results
+def test_hard_task_reference_solution_scores_high():
+    env = PythonCodeReviewEnvironment()
+    env.reset(task_id="optimization-hard")
+    env.step(
+        PythonCodeReviewAction(
+            action_type="edit_code",
+            code="""from collections import Counter
+from typing import Iterable
+def summarize_user_activity(events: Iterable[dict]) -> list[tuple[str, int]]:
+    \"\"\"Aggregate user activity counts in one pass.\"\"\"
+    counts = Counter(event["user_id"] for event in events)
+    return sorted(counts.items(), key=lambda item: (-item[1], item[0]))
+""",
+        )
+    )
+    observation = env.step(PythonCodeReviewAction(action_type="submit_solution"))
+    assert observation.done is True
+    assert observation.score >= 0.9

tests/test_examples.py ADDED Viewed

	@@ -0,0 +1,27 @@

+from graders.optimization import grade_optimization_task
+from graders.syntax import grade_bug_fix_task, grade_syntax_task
+from tasks.task_bank import get_task
+def test_syntax_grader_partial_score_is_bounded():
+    task = get_task("syntax-fix-easy")
+    grade = grade_syntax_task(task.starter_code, task)
+    assert 0.0 <= grade.score < 1.0
+def test_bug_fix_grader_reference_solution_reaches_one():
+    task = get_task("bug-fix-medium")
+    grade = grade_bug_fix_task(task.reference_code, task, include_hidden=True)
+    assert grade.score == 1.0
+    assert grade.tests_passed == grade.tests_total
+def test_optimization_grader_scores_better_than_starter():
+    task = get_task("optimization-hard")
+    starter_grade = grade_optimization_task(task.starter_code, task)
+    reference_grade = grade_optimization_task(task.reference_code, task)
+    assert reference_grade.score > starter_grade.score
+    assert reference_grade.runtime_score >= starter_grade.runtime_score

tutorial/HackathonChecklist.md ADDED Viewed

	@@ -0,0 +1,323 @@

+# Hackathon Checklist
+This file translates the tutorial folder into a concrete plan for `python_env`.
+It is not a generic OpenEnv summary. It is a project-specific checklist showing:
+- what the tutorials are teaching
+- how this repo maps to those ideas
+- what is already done
+- what still needs to be finished before submission
+## 1. What The Tutorials Mean For This Project
+### Tutorial 1: OpenEnv Pattern
+Main concept:
+- every environment should follow a clean pattern:
+  - typed models
+  - environment logic
+  - client
+  - FastAPI/OpenEnv app
+  - Docker packaging
+How `python_env` maps:
+- `models.py`
+  typed action/observation/config/evaluation models
+- `server/code_review_environment.py`
+  environment logic
+- `client.py`
+  Python client for reset/step/state
+- `server/app.py`
+  OpenEnv app plus helper routes
+- `server/Dockerfile`
+  container packaging
+Status:
+- done
+What to keep in mind:
+- do not break the OpenEnv contract while adding features
+- treat models as the public interface
+### Tutorial 2: Deployment
+Main concept:
+- local development first
+- Docker second
+- HF Spaces deployment third
+- test `/health`, `/reset`, `/docs`, `/ws`
+How `python_env` maps:
+- local server:
+  `uvicorn server.app:app --reload --host 0.0.0.0 --port 8000`
+- Docker:
+  `docker build -t python_env-env:latest -f server/Dockerfile .`
+- Spaces:
+  `openenv push`
+Status:
+- app boots locally
+- Dockerfile exists and now supports `HOST`, `PORT`, `WORKERS`, `MAX_CONCURRENT_ENVS`
+- live Docker build still needs final verification
+- Spaces deployment still needs to be executed and checked
+### Tutorial 3: Scaling
+Main concept:
+- OpenEnv works best with WebSocket sessions
+- use environment class/factory instead of a singleton for OpenEnv session handling
+- support concurrency with `MAX_CONCURRENT_ENVS`
+How `python_env` maps:
+- `create_app(PythonEnvironment, PythonReviewAction, PythonReviewObservation, max_concurrent_envs=...)`
+- `MAX_CONCURRENT_ENVS` is now read from env vars
+- Docker now exposes `MAX_CONCURRENT_ENVS`
+Status:
+- partially done
+Important caveat:
+- OpenEnv `/reset` and `/step` use the class-based session model
+- custom routes such as `/history` and `/config` still use a singleton helper instance
+- this is acceptable for manual tooling, but it is not a perfect unified session model
+Recommendation:
+- keep it for now if your priority is submission
+- refactor only if it starts causing testing confusion
+### Tutorial 4: RL Training And Reward Design
+Main concept:
+- a good RL environment needs:
+  - meaningful reward
+  - repeated trajectories
+  - enough task diversity
+  - an inference/training loop
+How `python_env` maps:
+- reward shaping already exists:
+  - matched rubric items
+  - false-positive penalties
+  - duplicate penalties
+  - hint penalties
+  - patch bonus
+  - finalize bonus
+- `inference.py` already provides a baseline model-vs-env loop
+Status:
+- partially done
+Gap:
+- 3 tasks are enough for hackathon minimums
+- 3 tasks are not enough for serious RL learning
+## 2. Current Repo Status
+### Strong Areas
+- real-world task: code review
+- typed Pydantic/OpenEnv models
+- deterministic grader
+- 3 difficulty levels
+- partial-progress reward shaping
+- manual routes for health/tasks/review/config/history
+- baseline inference script
+- docs in `README.md`, `Project.md`
+### Weak Areas
+- benchmark still small
+- Docker image build not fully verified end-to-end
+- HF Spaces deployment not yet executed
+- `openenv validate` still needs to be run in your actual runtime
+- no large trajectory dataset yet
+- custom REST state and OpenEnv session state are not fully unified
+## 3. What You Need To Do To Be Submission-Ready
+### Step 1: Validate Local Server
+Run:
+```powershell
+uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+```
+Manually verify:
+- `http://127.0.0.1:8000/docs`
+- `http://127.0.0.1:8000/health`
+- `POST /reset`
+- `POST /step`
+- `GET /tasks`
+- `POST /review`
+### Step 2: Run Tests
+Run:
+```powershell
+python -m pytest tests -q
+```
+You want all tests green before Docker or HF deployment.
+### Step 3: Run OpenEnv Validation
+Run:
+```powershell
+openenv validate
+```
+This is a hard requirement.
+If validation fails:
+- fix schema mismatch first
+- fix route mismatch second
+- fix packaging third
+### Step 4: Run Baseline Inference
+Run:
+```powershell
+$env:API_BASE_URL="https://api.openai.com/v1"
+$env:MODEL_NAME="gpt-4.1-mini"
+$env:OPENAI_API_KEY="your_key"
+$env:ENV_BASE_URL="http://127.0.0.1:8000"
+python inference.py
+```
+You want:
+- script completes without crashing
+- `inference_results.json` gets written
+- all 3 tasks run
+- scores are reproducible
+### Step 5: Verify Docker
+Run:
+```powershell
+docker build -t python_env-env:latest -f server/Dockerfile .
+docker run --rm -p 8000:8000 python_env-env:latest
+```
+Then test:
+- `GET /health`
+- `POST /reset`
+- `POST /step`
+### Step 6: Deploy To HF Spaces
+Run:
+```powershell
+openenv push
+```
+Then verify the live Space:
+- `/health`
+- `/docs`
+- `/reset`
+- `/web`
+## 4. What Will Help You “Win” Instead Of Just “Submit”
+Passing minimum requirements is not enough. To be competitive, improve these areas:
+### A. Increase Task Diversity
+Current:
+- 3 benchmark tasks
+Target:
+- at least 10 to 20 tasks before final submission if possible
+Good additions:
+- SQL injection review
+- unsafe YAML/pickle loading
+- file-handle leak
+- race-condition style bug
+- retry/backoff misuse
+- caching bug
+- logging/privacy leak
+- API timeout handling
+### B. Improve Observation Context
+Good RL environments provide enough context for the model to improve.
+Possible improvements:
+- add matched categories so far
+- add a short summary of uncovered issue types
+- add previous actions in structured form, not just free text
+- add rubric coverage signals without leaking exact answers
+### C. Collect Trajectories
+You need data that shows:
+- first attempt
+- improved second attempt
+- final attempt
+- failures
+- false positives
+- hint usage
+This is much more useful than only saving final scores.
+### D. Improve Reward Design Carefully
+Current reward design is already decent.
+Good refinements:
+- slightly larger reward for critical security findings
+- bonus for correct line numbers
+- bonus for high-quality recommendation text
+- penalty for vague findings with no rationale
+Do not overcomplicate the reward before submission. Stability matters more.
+## 5. Recommended Immediate Priority Order
+If time is limited, do the work in this order:
+1. `pytest`
+2. `openenv validate`
+3. local inference run
+4. Docker build and run
+5. HF Space deployment
+6. add 5 to 10 more tasks
+7. collect trajectory data
+## 6. One-Sentence Summary
+You are following the correct OpenEnv architecture from the tutorials already; the main remaining work is not redesign, it is validation, deployment verification, and expanding task/data quality so the environment scores well in human review.

tutorial/tutorial1.md ADDED Viewed

	@@ -0,0 +1,1259 @@

+# OpenEnv: Production RL Made Simple
+<div align="center">
+<img src="https://upload.wikimedia.org/wikipedia/commons/1/10/PyTorch_logo_icon.svg" width="200" alt="PyTorch">
+### *From "Hello World" to RL Training in 5 Minutes* ✨
+---
+**What if RL environments were as easy to use as REST APIs?**
+That's OpenEnv. Type-safe. Isolated. Production-ready. 🎯
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/meta-pytorch/OpenEnv/blob/main/examples/OpenEnv_Tutorial.ipynb)
+[![GitHub](https://img.shields.io/badge/GitHub-meta--pytorch%2FOpenEnv-blue?logo=github)](https://github.com/meta-pytorch/OpenEnv)
+[![License](https://img.shields.io/badge/License-BSD%203--Clause-green.svg)](https://opensource.org/licenses/BSD-3-Clause)
+[![PyTorch](https://img.shields.io/badge/PyTorch-EE4C2C?logo=pytorch&logoColor=white)](https://pytorch.org/)
+Author: [Sanyam Bhutani](http://twitter.com/bhutanisanyam1/)
+</div>
+---
+## Why OpenEnv?
+Let's take a trip down memory lane:
+It's 2016, RL is popular. You read some papers, it looks promising.
+But in real world: Cartpole is the best you can run on a gaming GPU.
+What do you do beyond Cartpole?
+Fast-forward to 2025, GRPO is awesome and this time it's not JUST in theory, it works well in practise and is really here!
+The problem still remains, how do you take these RL algorithms and take them beyond Cartpole?
+A huge part of RL is giving your algorithms environment access to learn.
+We are excited to introduce an Environment Spec for adding Open Environments for RL Training. This will allow you to focus on your experiments and allow everyone to bring their environments.
+Focus on experiments, use OpenEnvironments, and build agents that go beyond Cartpole on a single spec.
+---
+## 📋 What You'll Learn
+<table>
+<tr>
+<td width="50%">
+**🎯 Part 1-2: The Fundamentals**
+- ⚡ RL in 60 seconds
+- 🤔 Why existing solutions fall short
+- 💡 The OpenEnv solution
+</td>
+<td width="50%">
+**🏗️ Part 3-5: The Architecture**
+- 🔧 How OpenEnv works
+- 🔍 Exploring real code
+- 🎮 OpenSpiel integration example
+</td>
+</tr>
+<tr>
+<td width="50%">
+**🎮 Part 6-8: Hands-On Demo**
+- 🔌 Use existing OpenSpiel environment
+- 🤖 Test 4 different policies
+- 👀 Watch learning happen live
+</td>
+<td width="50%">
+**🔧 Part 9-10: Going Further**
+- 🎮 Switch to other OpenSpiel games
+- ✨ Build your own integration
+- 🌐 Deploy to production
+</td>
+</tr>
+</table>
+!!! tip "Pro Tip"
+    This notebook is designed to run top-to-bottom in Google Colab with zero setup!
+    ⏱️ **Time**: ~5 minutes | 📊 **Difficulty**: Beginner-friendly | 🎯 **Outcome**: Production-ready RL knowledge
+---
+## 📑 Table of Contents
+### Foundation
+- [Part 1: RL in 60 Seconds ⏱️](#part-1-rl-in-60-seconds)
+- [Part 2: The Problem with Traditional RL 😤](#part-2-the-problem-with-traditional-rl)
+- [Part 3: Setup 🛠️](#part-3-setup)
+### Architecture
+- [Part 4: The OpenEnv Pattern 🏗️](#part-4-the-openenv-pattern)
+- [Part 5: Example Integration - OpenSpiel 🎮](#part-5-example-integration---openspiel)
+### Hands-On Demo
+- [Part 6: Interactive Demo 🎮](#part-6-using-real-openspiel)
+- [Part 7: Four Policies 🤖](#part-7-four-policies)
+- [Part 8: Policy Competition! 🏆](#part-8-policy-competition)
+### Advanced
+- [Part 9: Using Real OpenSpiel 🎮](#part-9-switching-to-other-games)
+- [Part 10: Create Your Own Integration 🛠️](#part-10-create-your-own-integration)
+### Wrap Up
+- [Summary: Your Journey 🎓](#summary-your-journey)
+- [Resources 📚](#resources)
+---
+## Part 1: RL in 60 Seconds ⏱️
+**Reinforcement Learning is simpler than you think.**
+It's just a loop:
+```python
+while not done:
+    observation = environment.observe()
+    action = policy.choose(observation)
+    reward = environment.step(action)
+    policy.learn(reward)
+```
+That's it. That's RL.
+Let's see it in action:
+```python
+import random
+print("🎲 " + "="*58 + " 🎲")
+print("   Number Guessing Game - The Simplest RL Example")
+print("🎲 " + "="*58 + " 🎲")
+# Environment setup
+target = random.randint(1, 10)
+guesses_left = 3
+print(f"\n🎯 I'm thinking of a number between 1 and 10...")
+print(f"💭 You have {guesses_left} guesses. Let's see how random guessing works!\n")
+# The RL Loop - Pure random policy (no learning!)
+while guesses_left > 0:
+    # Policy: Random guessing (no learning yet!)
+    guess = random.randint(1, 10)
+    guesses_left -= 1
+    print(f"💭 Guess #{3-guesses_left}: {guess}", end=" → ")
+    # Reward signal (but we're not using it!)
+    if guess == target:
+        print("🎉 Correct! +10 points")
+        break
+    elif abs(guess - target) <= 2:
+        print("🔥 Warm! (close)")
+    else:
+        print("❄️  Cold! (far)")
+else:
+    print(f"\n💔 Out of guesses. The number was {target}.")
+print("\n" + "="*62)
+print("💡 This is RL: Observe → Act → Reward → Repeat")
+print("   But this policy is terrible! It doesn't learn from rewards.")
+print("="*62 + "\n")
+```
+**Output:**
+```
+🎲 ========================================================== 🎲
+   Number Guessing Game - The Simplest RL Example
+🎲 ========================================================== 🎲
+🎯 I'm thinking of a number between 1 and 10...
+💭 You have 3 guesses. Let's see how random guessing works!
+💭 Guess #1: 2 → ❄️  Cold! (far)
+💭 Guess #2: 10 → 🎉 Correct! +10 points
+==============================================================
+💡 This is RL: Observe → Act → Reward → Repeat
+   But this policy is terrible! It doesn't learn from rewards.
+==============================================================
+```
+---
+## Part 2: The Problem with Traditional RL 😤
+### 🤔 Why Can't We Just Use OpenAI Gym?
+Good question! Gym is great for research, but production needs more...
+| Challenge | Traditional Approach | OpenEnv Solution |
+|-----------|---------------------|------------------|
+| **Type Safety** | ❌ `obs[0][3]` - what is this? | ✅ `obs.info_state` - IDE knows! |
+| **Isolation** | ❌ Same process (can crash your training) | ✅ Docker containers (fully isolated) |
+| **Deployment** | ❌ "Works on my machine" 🤷 | ✅ Same container everywhere 🐳 |
+| **Scaling** | ❌ Hard to distribute | ✅ Deploy to Kubernetes ☸️ |
+| **Language** | ❌ Python only | ✅ Any language (HTTP API) 🌐 |
+| **Debugging** | ❌ Cryptic numpy errors | ✅ Clear type errors 🐛 |
+### 💡 The OpenEnv Philosophy
+**"RL environments should be like microservices"**
+Think of it like this: You don't run your database in the same process as your web server, right? Same principle!
+- 🔒 **Isolated**: Run in containers (security + stability)
+- 🌐 **Standard**: HTTP API, works everywhere
+- 📦 **Versioned**: Docker images (reproducibility!)
+- 🚀 **Scalable**: Deploy to cloud with one command
+- 🛡️ **Type-safe**: Catch bugs before they happen
+- 🔄 **Portable**: Works on Mac, Linux, Windows, Cloud
+### The Architecture
+```
+┌────────────────────────────────────────────────────────────┐
+│  YOUR TRAINING CODE                                        │
+│                                                            │
+│  env = OpenSpielEnv(...)        ← Import the client      │
+│  result = env.reset()           ← Type-safe!             │
+│  result = env.step(action)      ← Type-safe!             │
+│                                                            │
+└─────────────────┬──────────────────────────────────────────┘
+                  │
+                  │  HTTP/JSON (Language-Agnostic)
+                  │  POST /reset, POST /step, GET /state
+                  │
+┌─────────────────▼──────────────────────────────────────────┐
+│  DOCKER CONTAINER                                          │
+│                                                            │
+│  ┌──────────────────────────────────────────────┐         │
+│  │  FastAPI Server                              │         │
+│  │  └─ Environment (reset, step, state)         │         │
+│  │     └─ Your Game/Simulation Logic            │         │
+│  └──────────────────────────────────────────────┘         │
+│                                                            │
+│  Isolated • Reproducible • Secure                          │
+└────────────────────────────────────────────────────────────┘
+```
+!!! info "Key Insight"
+    You never see HTTP details - just clean Python methods!
+    ```python
+    env.reset()    # Under the hood: HTTP POST to /reset
+    env.step(...)  # Under the hood: HTTP POST to /step
+    env.state()    # Under the hood: HTTP GET to /state
+    ```
+    The magic? OpenEnv handles all the plumbing. You focus on RL! ✨
+---
+## Part 3: Setup 🛠️
+**Running in Colab?** This cell will clone OpenEnv and install dependencies automatically.
+**Running locally?** Make sure you're in the OpenEnv directory.
+```python
+# Detect environment
+try:
+    import google.colab
+    IN_COLAB = True
+    print("🌐 Running in Google Colab - Perfect!")
+except ImportError:
+    IN_COLAB = False
+    print("💻 Running locally - Nice!")
+if IN_COLAB:
+    print("\n📦 Cloning OpenEnv repository...")
+    !git clone https://github.com/meta-pytorch/OpenEnv.git > /dev/null 2>&1
+    %cd OpenEnv
+    print("📚 Installing dependencies (this takes ~10 seconds)...")
+    !pip install -q fastapi uvicorn requests
+    import sys
+    sys.path.insert(0, './src')
+    print("\n✅ Setup complete! Everything is ready to go! 🎉")
+else:
+    import sys
+    from pathlib import Path
+    sys.path.insert(0, str(Path.cwd().parent / 'src'))
+    print("✅ Using local OpenEnv installation")
+print("\n🚀 Ready to explore OpenEnv and build amazing things!")
+print("💡 Tip: Run cells top-to-bottom for the best experience.\n")
+```
+**Output:**
+```
+💻 Running locally - Nice!
+✅ Using local OpenEnv installation
+🚀 Ready to explore OpenEnv and build amazing things!
+💡 Tip: Run cells top-to-bottom for the best experience.
+```
+---
+## Part 4: The OpenEnv Pattern 🏗️
+### Every OpenEnv Environment Has 3 Components:
+```
+src/envs/your_env/
+├── 📝 models.py          ← Type-safe contracts
+│                           (Action, Observation, State)
+│
+├── 📱 client.py          ← What YOU import
+│                           (HTTPEnvClient implementation)
+│
+└── 🖥️  server/
+    ├── environment.py    ← Game/simulation logic
+    ├── app.py            ← FastAPI server
+    └── Dockerfile        ← Container definition
+```
+Let's explore the actual OpenEnv code to see how this works:
+```python
+# Import OpenEnv's core abstractions
+from core.env_server import Environment, Action, Observation, State
+from core.http_env_client import HTTPEnvClient
+print("="*70)
+print("   🧩 OPENENV CORE ABSTRACTIONS")
+print("="*70)
+print("""
+🖥️  SERVER SIDE (runs in Docker):
+    class Environment(ABC):
+        '''Base class for all environment implementations'''
+        @abstractmethod
+        def reset(self) -> Observation:
+            '''Start new episode'''
+        @abstractmethod
+        def step(self, action: Action) -> Observation:
+            '''Execute action, return observation'''
+        @property
+        def state(self) -> State:
+            '''Get episode metadata'''
+📱 CLIENT SIDE (your training code):
+    class HTTPEnvClient(ABC):
+        '''Base class for HTTP clients'''
+        def reset(self) -> StepResult:
+            # HTTP POST /reset
+        def step(self, action) -> StepResult:
+            # HTTP POST /step
+        def state(self) -> State:
+            # HTTP GET /state
+""")
+print("="*70)
+print("\n✨ Same interface on both sides - communication via HTTP!")
+print("🎯 You focus on RL, OpenEnv handles the infrastructure.\n")
+```
+**Output:**
+```
+======================================================================
+   🧩 OPENENV CORE ABSTRACTIONS
+======================================================================
+🖥️  SERVER SIDE (runs in Docker):
+    class Environment(ABC):
+        '''Base class for all environment implementations'''
+        @abstractmethod
+        def reset(self) -> Observation:
+            '''Start new episode'''
+        @abstractmethod
+        def step(self, action: Action) -> Observation:
+            '''Execute action, return observation'''
+        @property
+        def state(self) -> State:
+            '''Get episode metadata'''
+📱 CLIENT SIDE (your training code):
+    class HTTPEnvClient(ABC):
+        '''Base class for HTTP clients'''
+        def reset(self) -> StepResult:
+            # HTTP POST /reset
+        def step(self, action) -> StepResult:
+            # HTTP POST /step
+        def state(self) -> State:
+            # HTTP GET /state
+======================================================================
+✨ Same interface on both sides - communication via HTTP!
+🎯 You focus on RL, OpenEnv handles the infrastructure.
+```
+---
+## Part 5: Example Integration - OpenSpiel 🎮
+### What is OpenSpiel?
+**OpenSpiel** is a library from DeepMind with **70+ game environments** for RL research.
+### OpenEnv's Integration
+We've wrapped **6 OpenSpiel games** following the OpenEnv pattern:
+| **🎯 Single-Player** | **👥 Multi-Player** |
+|---------------------|---------------------|
+| 1. **Catch** - Catch falling ball | 5. **Tic-Tac-Toe** - Classic 3×3 |
+| 2. **Cliff Walking** - Navigate grid | 6. **Kuhn Poker** - Imperfect info poker |
+| 3. **2048** - Tile puzzle | |
+| 4. **Blackjack** - Card game | |
+This shows how OpenEnv can wrap **any** existing RL library!
+```python
+from envs.openspiel_env.client import OpenSpielEnv
+print("="*70)
+print("   🔌 HOW OPENENV WRAPS OPENSPIEL")
+print("="*70)
+print("""
+class OpenSpielEnv(HTTPEnvClient[OpenSpielAction, OpenSpielObservation]):
+    def _step_payload(self, action: OpenSpielAction) -> dict:
+        '''Convert typed action to JSON for HTTP'''
+        return {
+            "action_id": action.action_id,
+            "game_name": action.game_name,
+        }
+    def _parse_result(self, payload: dict) -> StepResult:
+        '''Parse HTTP JSON response into typed observation'''
+        return StepResult(
+            observation=OpenSpielObservation(...),
+            reward=payload['reward'],
+            done=payload['done']
+        )
+""")
+print("─" * 70)
+print("\n✨ Usage (works for ALL OpenEnv environments):")
+print("""
+  env = OpenSpielEnv(base_url="http://localhost:8000")
+  result = env.reset()
+  # Returns StepResult[OpenSpielObservation] - Type safe!
+  result = env.step(OpenSpielAction(action_id=2, game_name="catch"))
+  # Type checker knows this is valid!
+  state = env.state()
+  # Returns OpenSpielState
+""")
+print("─" * 70)
+print("\n🎯 This pattern works for ANY environment you want to wrap!\n")
+```
+**Output:**
+```
+======================================================================
+   🔌 HOW OPENENV WRAPS OPENSPIEL
+======================================================================
+class OpenSpielEnv(HTTPEnvClient[OpenSpielAction, OpenSpielObservation]):
+    def _step_payload(self, action: OpenSpielAction) -> dict:
+        '''Convert typed action to JSON for HTTP'''
+        return {
+            "action_id": action.action_id,
+            "game_name": action.game_name,
+        }
+    def _parse_result(self, payload: dict) -> StepResult:
+        '''Parse HTTP JSON response into typed observation'''
+        return StepResult(
+            observation=OpenSpielObservation(...),
+            reward=payload['reward'],
+            done=payload['done']
+        )
+──────────────────────────────────────────────────────────────────────
+✨ Usage (works for ALL OpenEnv environments):
+  env = OpenSpielEnv(base_url="http://localhost:8000")
+  result = env.reset()
+  # Returns StepResult[OpenSpielObservation] - Type safe!
+  result = env.step(OpenSpielAction(action_id=2, game_name="catch"))
+  # Type checker knows this is valid!
+  state = env.state()
+  # Returns OpenSpielState
+──────────────────────────────────────────────────────────────────────
+🎯 This pattern works for ANY environment you want to wrap!
+```
+### Type-Safe Models
+```python
+# Import OpenSpiel integration models
+from envs.openspiel_env.models import (
+    OpenSpielAction,
+    OpenSpielObservation,
+    OpenSpielState
+)
+from dataclasses import fields
+print("="*70)
+print("   🎮 OPENSPIEL INTEGRATION - TYPE-SAFE MODELS")
+print("="*70)
+print("\n📤 OpenSpielAction (what you send):")
+print("   " + "─" * 64)
+for field in fields(OpenSpielAction):
+    print(f"   • {field.name:20s} : {field.type}")
+print("\n📥 OpenSpielObservation (what you receive):")
+print("   " + "─" * 64)
+for field in fields(OpenSpielObservation):
+    print(f"   • {field.name:20s} : {field.type}")
+print("\n📊 OpenSpielState (episode metadata):")
+print("   " + "─" * 64)
+for field in fields(OpenSpielState):
+    print(f"   • {field.name:20s} : {field.type}")
+print("\n" + "="*70)
+print("\n💡 Type safety means:")
+print("   ✅ Your IDE autocompletes these fields")
+print("   ✅ Typos are caught before running")
+print("   ✅ Refactoring is safe")
+print("   ✅ Self-documenting code\n")
+```
+**Output:**
+```
+======================================================================
+   🎮 OPENSPIEL INTEGRATION - TYPE-SAFE MODELS
+======================================================================
+📤 OpenSpielAction (what you send):
+   ────────────────────────────────────────────────────────────────
+   • metadata             : typing.Dict[str, typing.Any]
+   • action_id            : int
+   • game_name            : str
+   • game_params          : Dict[str, Any]
+📥 OpenSpielObservation (what you receive):
+   ───────────────────────────────────���────────────────────────────
+   • done                 : <class 'bool'>
+   • reward               : typing.Union[bool, int, float, NoneType]
+   • metadata             : typing.Dict[str, typing.Any]
+   • info_state           : List[float]
+   • legal_actions        : List[int]
+   • game_phase           : str
+   • current_player_id    : int
+   • opponent_last_action : Optional[int]
+📊 OpenSpielState (episode metadata):
+   ────────────────────────────────────────────────────────────────
+   • episode_id           : typing.Optional[str]
+   • step_count           : <class 'int'>
+   • game_name            : str
+   • agent_player         : int
+   • opponent_policy      : str
+   • game_params          : Dict[str, Any]
+   • num_players          : int
+======================================================================
+💡 Type safety means:
+   ✅ Your IDE autocompletes these fields
+   ✅ Typos are caught before running
+   ✅ Refactoring is safe
+   ✅ Self-documenting code
+```
+### How the Client Works
+The client **inherits from HTTPEnvClient** and implements 3 methods:
+1. `_step_payload()` - Convert action → JSON
+2. `_parse_result()` - Parse JSON → typed observation
+3. `_parse_state()` - Parse JSON → state
+That's it! The base class handles all HTTP communication.
+---
+## Part 6: Using Real OpenSpiel 🎮
+<div style="text-align: center; background: linear-gradient(135deg, #667eea 0%, #764ba2 100%); color: white; padding: 30px; border-radius: 15px; margin: 30px 0;">
+### Now let's USE a production environment!
+We'll play **Catch** using OpenEnv's **OpenSpiel integration** 🎯
+This is a REAL environment running in production at companies!
+**Get ready for:**
+- 🔌 Using existing environments (not building)
+- 🤖 Testing policies against real games
+- 📊 Live gameplay visualization
+- 🎯 Production-ready patterns
+</div>
+### The Game: Catch 🔴🏓
+```
+⬜ ⬜ 🔴 ⬜ ⬜
+⬜ ⬜ ⬜ ⬜ ⬜
+⬜ ⬜ ⬜ ⬜ ⬜   Ball
+⬜ ⬜ ⬜ ⬜ ⬜
+⬜ ⬜ ⬜ ⬜ ⬜   falls
+⬜ ⬜ ⬜ ⬜ ⬜
+⬜ ⬜ ⬜ ⬜ ⬜   down
+⬜ ⬜ ⬜ ⬜ ⬜
+⬜ ⬜ ⬜ ⬜ ⬜
+⬜ ⬜ 🏓 ⬜ ⬜
+     Paddle
+```
+**Rules:**
+- 10×5 grid
+- Ball falls from random column
+- Move paddle left/right to catch it
+**Actions:**
+- `0` = Move LEFT ⬅️
+- `1` = STAY 🛑
+- `2` = Move RIGHT ➡️
+**Reward:**
+- `+1` if caught 🎉
+- `0` if missed 😢
+!!! note "Why Catch?"
+    - Simple rules (easy to understand)
+    - Fast episodes (~5 steps)
+    - Clear success/failure
+    - Part of OpenSpiel's 70+ games!
+    **💡 The Big Idea:**
+    Instead of building this from scratch, we'll USE OpenEnv's existing OpenSpiel integration. Same interface, but production-ready!
+```python
+from envs.openspiel_env import OpenSpielEnv
+from envs.openspiel_env.models import (
+    OpenSpielAction,
+    OpenSpielObservation,
+    OpenSpielState
+)
+from dataclasses import fields
+print("🎮 " + "="*64 + " 🎮")
+print("   ✅ Importing Real OpenSpiel Environment!")
+print("🎮 " + "="*64 + " 🎮\n")
+print("📦 What we just imported:")
+print("   • OpenSpielEnv - HTTP client for OpenSpiel games")
+print("   • OpenSpielAction - Type-safe actions")
+print("   • OpenSpielObservation - Type-safe observations")
+print("   • OpenSpielState - Episode metadata\n")
+print("📋 OpenSpielObservation fields:")
+print("   " + "─" * 60)
+for field in fields(OpenSpielObservation):
+    print(f"   • {field.name:25s} : {field.type}")
+print("\n" + "="*70)
+print("\n💡 This is REAL OpenEnv code - used in production!")
+print("   • Wraps 6 OpenSpiel games (Catch, Tic-Tac-Toe, Poker, etc.)")
+print("   • Type-safe actions and observations")
+print("   • Works via HTTP (we'll see that next!)\n")
+```
+**Output:**
+```
+🎮 ================================================================ 🎮
+   ✅ Importing Real OpenSpiel Environment!
+🎮 ================================================================ 🎮
+📦 What we just imported:
+   • OpenSpielEnv - HTTP client for OpenSpiel games
+   • OpenSpielAction - Type-safe actions
+   • OpenSpielObservation - Type-safe observations
+   • OpenSpielState - Episode metadata
+📋 OpenSpielObservation fields:
+   ────────────────────────────────────────────────────────────
+   • done                      : <class 'bool'>
+   • reward                    : typing.Union[bool, int, float, NoneType]
+   • metadata                  : typing.Dict[str, typing.Any]
+   • info_state                : List[float]
+   • legal_actions             : List[int]
+   • game_phase                : str
+   • current_player_id         : int
+   • opponent_last_action      : Optional[int]
+======================================================================
+💡 This is REAL OpenEnv code - used in production!
+   • Wraps 6 OpenSpiel games (Catch, Tic-Tac-Toe, Poker, etc.)
+   • Type-safe actions and observations
+   • Works via HTTP (we'll see that next!)
+```
+---
+## Part 7: Four Policies 🤖
+Let's test 4 different AI strategies:
+| Policy | Strategy | Expected Performance |
+|--------|----------|----------------------|
+| **🎲 Random** | Pick random action every step | ~20% (pure luck) |
+| **🛑 Always Stay** | Never move, hope ball lands in center | ~20% (terrible!) |
+| **🧠 Smart** | Move paddle toward ball | 100% (optimal!) |
+| **📈 Learning** | Start random, learn smart strategy | ~85% (improves over time) |
+**💡 These policies work with ANY OpenSpiel game!**
+```python
+import random
+# ============================================================================
+# POLICIES - Different AI strategies (adapted for OpenSpiel)
+# ============================================================================
+class RandomPolicy:
+    """Baseline: Pure random guessing."""
+    name = "🎲 Random Guesser"
+    def select_action(self, obs: OpenSpielObservation) -> int:
+        return random.choice(obs.legal_actions)
+class AlwaysStayPolicy:
+    """Bad strategy: Never moves."""
+    name = "🛑 Always Stay"
+    def select_action(self, obs: OpenSpielObservation) -> int:
+        return 1  # STAY
+class SmartPolicy:
+    """Optimal: Move paddle toward ball."""
+    name = "🧠 Smart Heuristic"
+    def select_action(self, obs: OpenSpielObservation) -> int:
+        # Parse OpenSpiel observation
+        # For Catch: info_state is a flattened 10x5 grid
+        # Ball position and paddle position encoded in the vector
+        info_state = obs.info_state
+        # Find ball and paddle positions from info_state
+        # Catch uses a 10x5 grid, so 50 values
+        grid_size = 5
+        # Find positions (ball = 1.0 in the flattened grid, paddle = 1.0 in the last row of the flattened grid)
+        ball_col = None
+        paddle_col = None
+        for idx, val in enumerate(info_state):
+            if abs(val - 1.0) < 0.01:  # Ball
+                ball_col = idx % grid_size
+                break
+        last_row = info_state[-grid_size:]
+        paddle_col = last_row.index(1.0) # Paddle
+        if ball_col is not None and paddle_col is not None:
+            if paddle_col < ball_col:
+                return 2  # Move RIGHT
+            elif paddle_col > ball_col:
+                return 0  # Move LEFT
+        return 1  # STAY (fallback)
+class LearningPolicy:
+    """Simulated RL: Epsilon-greedy exploration."""
+    name = "📈 Learning Agent"
+    def __init__(self):
+        self.steps = 0
+        self.smart_policy = SmartPolicy()
+    def select_action(self, obs: OpenSpielObservation) -> int:
+        self.steps += 1
+        # Decay exploration rate over time
+        epsilon = max(0.1, 1.0 - (self.steps / 100))
+        if random.random() < epsilon:
+            # Explore: random action
+            return random.choice(obs.legal_actions)
+        else:
+            # Exploit: use smart strategy
+            return self.smart_policy.select_action(obs)
+print("🤖 " + "="*64 + " 🤖")
+print("   ✅ 4 Policies Created (Adapted for OpenSpiel)!")
+print("🤖 " + "="*64 + " 🤖\n")
+policies = [RandomPolicy(), AlwaysStayPolicy(), SmartPolicy(), LearningPolicy()]
+for i, policy in enumerate(policies, 1):
+    print(f"   {i}. {policy.name}")
+print("\n💡 These policies work with OpenSpielObservation!")
+print("   • Read info_state (flattened grid)")
+print("   • Use legal_actions")
+print("   • Work with ANY OpenSpiel game that exposes these!\n")
+```
+**Output:**
+```
+🤖 ================================================================ 🤖
+   ✅ 4 Policies Created (Adapted for OpenSpiel)!
+🤖 ================================================================ 🤖
+   1. 🎲 Random Guesser
+   2. 🛑 Always Stay
+   3. 🧠 Smart Heuristic
+   4. 📈 Learning Agent
+💡 These policies work with OpenSpielObservation!
+   • Read info_state (flattened grid)
+   • Use legal_actions
+   • Work with ANY OpenSpiel game that exposes these!
+```
+---
+## Part 8: Policy Competition! 🏆
+Let's run **50 episodes** for each policy against **REAL OpenSpiel** and see who wins!
+This is production code - every action is an HTTP call to the OpenSpiel server!
+```python
+def evaluate_policies(env, num_episodes=50):
+    """Compare all policies over many episodes using real OpenSpiel."""
+    policies = [
+        RandomPolicy(),
+        AlwaysStayPolicy(),
+        SmartPolicy(),
+        LearningPolicy(),
+    ]
+    print("\n🏆 " + "="*66 + " 🏆")
+    print(f"   POLICY SHOWDOWN - {num_episodes} Episodes Each")
+    print(f"   Playing against REAL OpenSpiel Catch!")
+    print("🏆 " + "="*66 + " 🏆\n")
+    results = []
+    for policy in policies:
+        print(f"⚡ Testing {policy.name}...", end=" ")
+        successes = sum(run_episode(env, policy, visualize=False)
+                       for _ in range(num_episodes))
+        success_rate = (successes / num_episodes) * 100
+        results.append((policy.name, success_rate, successes))
+        print(f"✓ Done!")
+    print("\n" + "="*70)
+    print("   📊 FINAL RESULTS")
+    print("="*70 + "\n")
+    # Sort by success rate (descending)
+    results.sort(key=lambda x: x[1], reverse=True)
+    # Award medals to top 3
+    medals = ["🥇", "🥈", "🥉", "  "]
+    for i, (name, rate, successes) in enumerate(results):
+        medal = medals[i]
+        bar = "█" * int(rate / 2)
+        print(f"{medal} {name:25s} [{bar:<50}] {rate:5.1f}% ({successes}/{num_episodes})")
+    print("\n" + "="*70)
+    print("\n✨ Key Insights:")
+    print("   • Random (~20%):      Baseline - pure luck 🎲")
+    print("   • Always Stay (~20%): Bad strategy - stays center 🛑")
+    print("   • Smart (100%):       Optimal - perfect play! 🧠")
+    print("   • Learning (~85%):    Improves over time 📈")
+    print("\n🎓 This is Reinforcement Learning + OpenEnv in action:")
+    print("   1. We USED existing OpenSpiel environment (didn't build it)")
+    print("   2. Type-safe communication over HTTP")
+    print("   3. Same code works for ANY OpenSpiel game")
+    print("   4. Production-ready architecture\n")
+# Run the epic competition!
+print("🎮 Starting the showdown against REAL OpenSpiel...\n")
+evaluate_policies(client, num_episodes=50)
+```
+---
+## Part 9: Switching to Other Games 🎮
+### What We Just Used: Real OpenSpiel! 🎉
+In Parts 6-8, we **USED** the existing OpenSpiel Catch environment:
+| What We Did | How It Works |
+|-------------|--------------|
+| **Imported** | OpenSpielEnv client (pre-built) |
+| **Started** | OpenSpiel server via uvicorn |
+| **Connected** | HTTP client to server |
+| **Played** | Real OpenSpiel Catch game |
+**🎯 This is production code!** Every action was an HTTP call to a real OpenSpiel environment.
+### 🎮 6 Games Available - Same Interface!
+The beauty of OpenEnv? **Same code, different games!**
+```python
+# We just used Catch
+env = OpenSpielEnv(base_url="http://localhost:8000")
+# game_name="catch" was set via environment variable
+# Want Tic-Tac-Toe instead? Just change the game!
+# Start server with: OPENSPIEL_GAME=tic_tac_toe uvicorn ...
+# Same client code works!
+```
+**🎮 All 6 Games:**
+1. ✅ **`catch`** - What we just used!
+2. **`tic_tac_toe`** - Classic 3×3
+3. **`kuhn_poker`** - Imperfect information poker
+4. **`cliff_walking`** - Grid navigation
+5. **`2048`** - Tile puzzle
+6. **`blackjack`** - Card game
+**All use the exact same OpenSpielEnv client!**
+### Try Another Game (Optional):
+```python
+# Stop the current server (kill the server_process)
+# Then start a new game:
+server_process = subprocess.Popen(
+    [sys.executable, "-m", "uvicorn",
+     "envs.openspiel_env.server.app:app",
+     "--host", "0.0.0.0",
+     "--port", "8000"],
+    env={**os.environ,
+         "PYTHONPATH": f"{work_dir}/src",
+         "OPENSPIEL_GAME": "tic_tac_toe",  # Changed!
+         "OPENSPIEL_AGENT_PLAYER": "0",
+         "OPENSPIEL_OPPONENT_POLICY": "random"},
+    # ... rest of config
+)
+# Same client works!
+client = OpenSpielEnv(base_url="http://localhost:8000")
+result = client.reset()  # Now playing Tic-Tac-Toe!
+```
+**💡 Key Insight**: You don't rebuild anything - you just USE different games with the same client!
+---
+## Part 10: Create Your Own Integration 🛠️
+### The 5-Step Pattern
+Want to wrap your own environment in OpenEnv? Here's how:
+### Step 1: Define Types (`models.py`)
+```python
+from dataclasses import dataclass
+from core.env_server import Action, Observation, State
+@dataclass
+class YourAction(Action):
+    action_value: int
+    # Add your action fields
+@dataclass
+class YourObservation(Observation):
+    state_data: List[float]
+    done: bool
+    reward: float
+    # Add your observation fields
+@dataclass
+class YourState(State):
+    episode_id: str
+    step_count: int
+    # Add your state fields
+```
+### Step 2: Implement Environment (`server/environment.py`)
+```python
+from core.env_server import Environment
+class YourEnvironment(Environment):
+    def reset(self) -> Observation:
+        # Initialize your game/simulation
+        return YourObservation(...)
+    def step(self, action: Action) -> Observation:
+        # Execute action, update state
+        return YourObservation(...)
+    @property
+    def state(self) -> State:
+        return self._state
+```
+### Step 3: Create Client (`client.py`)
+```python
+from core.http_env_client import HTTPEnvClient
+from core.types import StepResult
+class YourEnv(HTTPEnvClient[YourAction, YourObservation]):
+    def _step_payload(self, action: YourAction) -> dict:
+        """Convert action to JSON"""
+        return {"action_value": action.action_value}
+    def _parse_result(self, payload: dict) -> StepResult:
+        """Parse JSON to observation"""
+        return StepResult(
+            observation=YourObservation(...),
+            reward=payload['reward'],
+            done=payload['done']
+        )
+    def _parse_state(self, payload: dict) -> YourState:
+        return YourState(...)
+```
+### Step 4: Create Server (`server/app.py`)
+```python
+from core.env_server import create_fastapi_app
+from .your_environment import YourEnvironment
+env = YourEnvironment()
+app = create_fastapi_app(env)
+# That's it! OpenEnv creates all endpoints for you.
+```
+### Step 5: Dockerize (`server/Dockerfile`)
+```dockerfile
+FROM python:3.11-slim
+WORKDIR /app
+COPY requirements.txt .
+RUN pip install --no-cache-dir -r requirements.txt
+COPY . .
+CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
+```
+### 🎓 Examples to Study
+OpenEnv includes 3 complete examples:
+1. **`src/envs/echo_env/`**
+   - Simplest possible environment
+   - Great for testing and learning
+2. **`src/envs/openspiel_env/`**
+   - Wraps external library (OpenSpiel)
+   - Shows integration pattern
+   - 6 games in one integration
+3. **`src/envs/coding_env/`**
+   - Python code execution environment
+   - Shows complex use case
+   - Security considerations
+**💡 Study these to understand the patterns!**
+---
+## 🎓 Summary: Your Journey
+### What You Learned
+<table>
+<tr>
+<td width="50%" style="vertical-align: top;">
+### 📚 Concepts
+✅ **RL Fundamentals**
+- The observe-act-reward loop
+- What makes good policies
+- Exploration vs exploitation
+✅ **OpenEnv Architecture**
+- Client-server separation
+- Type-safe contracts
+- HTTP communication layer
+✅ **Production Patterns**
+- Docker isolation
+- API design
+- Reproducible deployments
+</td>
+<td width="50%" style="vertical-align: top;">
+### 🛠️ Skills
+✅ **Using Environments**
+- Import OpenEnv clients
+- Call reset/step/state
+- Work with typed observations
+✅ **Building Environments**
+- Define type-safe models
+- Implement Environment class
+- Create HTTPEnvClient
+✅ **Testing & Debugging**
+- Compare policies
+- Visualize episodes
+- Measure performance
+</td>
+</tr>
+</table>
+### OpenEnv vs Traditional RL
+| Feature | Traditional (Gym) | OpenEnv | Winner |
+|---------|------------------|---------|--------|
+| **Type Safety** | ❌ Arrays, dicts | ✅ Dataclasses | 🏆 OpenEnv |
+| **Isolation** | ❌ Same process | ✅ Docker | 🏆 OpenEnv |
+| **Deployment** | ❌ Manual setup | ✅ K8s-ready | 🏆 OpenEnv |
+| **Language** | ❌ Python only | ✅ Any (HTTP) | 🏆 OpenEnv |
+| **Reproducibility** | ❌ "Works on my machine" | ✅ Same everywhere | 🏆 OpenEnv |
+| **Community** | ✅ Large ecosystem | 🟡 Growing | 🤝 Both! |
+!!! success "The Bottom Line"
+    OpenEnv brings **production engineering** to RL:
+    - Same environments work locally and in production
+    - Type safety catches bugs early
+    - Docker isolation prevents conflicts
+    - HTTP API works with any language
+    **It's RL for 2024 and beyond.**
+---
+## 📚 Resources
+### 🔗 Essential Links
+- **🏠 OpenEnv GitHub**: https://github.com/meta-pytorch/OpenEnv
+- **🎮 OpenSpiel**: https://github.com/google-deepmind/open_spiel
+- **⚡ FastAPI Docs**: https://fastapi.tiangolo.com/
+- **🐳 Docker Guide**: https://docs.docker.com/get-started/
+- **🔥 PyTorch**: https://pytorch.org/
+### 📖 Documentation Deep Dives
+- **Environment Creation Guide**: `src/envs/README.md`
+- **OpenSpiel Integration**: `src/envs/openspiel_env/README.md`
+- **Example Scripts**: `examples/`
+- **RFC 001**: [Baseline API Specs](https://github.com/meta-pytorch/OpenEnv/pull/26)
+### 🎓 Community & Support
+**Supported by amazing organizations:**
+- 🔥 Meta PyTorch
+- 🤗 Hugging Face
+- ⚡ Unsloth AI
+- 🌟 Reflection AI
+- 🚀 And many more!
+**License**: BSD 3-Clause (very permissive!)
+**Contributions**: Always welcome! Check out the issues tab.
+---
+### 🌈 What's Next?
+1. ⭐ **Star the repo** to show support and stay updated
+2. 🔄 **Try modifying** the Catch game (make it harder? bigger grid?)
+3. 🎮 **Explore** other OpenSpiel games
+4. 🛠️ **Build** your own environment integration
+5. 💬 **Share** what you build with the community!

tutorial/tutorial2.md ADDED Viewed

	@@ -0,0 +1,427 @@

+# 2. Deploying an OpenEnv environment
+This section covers deploying OpenEnv environments locally, on clusters, and on Hugging Face Spaces.
+**Contents:**
+- [Local Development with Uvicorn](#local-development-with-uvicorn)
+- [Docker Deployment](#docker-deployment)
+- [Hugging Face Spaces](#hugging-face-spaces)
+- [Best Practices](#best-practices)
+## HF Spaces are the infrastructure for OpenEnv environments
+Every HF Space provides three things that OpenEnv environments need:
+| Component | What it provides | How to access | Used as |
+|-----------|------------------|---------------|-----------|
+| **Server** | Running environment endpoint | `https://<username>-<space-name>.hf.space` | Agent and Public API |
+| **Repository** | Installable Python package | `pip install git+https://huggingface.co/spaces/<username>-<space-name>` | Code and client |
+| **Registry** | Docker container image | `docker pull registry.hf.space/<username>-<space-name>:latest` | Deployment |
+This means a single Space deployment gives you all the components you need to use an environment in training.
+### 1. Server: A running environment endpoint
+When you deploy to HF Spaces, your environment runs as a server. The client connects via **WebSocket** (`/ws`) for a persistent session:
+```python
+from echo_env import EchoEnv, EchoAction
+# Connect directly to the running Space (WebSocket under the hood)
+# Async (recommended):
+async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
+    result = await client.reset()
+    result = await client.step(EchoAction(message="Hello"))
+# Sync (using .sync() wrapper):
+with EchoEnv(base_url="https://openenv-echo-env.hf.space").sync() as client:
+    result = client.reset()
+    result = client.step(EchoAction(message="Hello"))
+```
+**Endpoints available:**
+| Endpoint | Protocol | Description |
+|----------|----------|-------------|
+| `/ws` | **WebSocket** | Persistent session (used by client) |
+| `/health` | HTTP GET | Health check |
+| `/reset` | HTTP POST | Reset environment (stateless) |
+| `/step` | HTTP POST | Execute action (stateless) |
+| `/state` | HTTP GET | Get current state |
+| `/docs` | HTTP GET | OpenAPI documentation |
+| `/web` | HTTP GET | Interactive web UI |
+> **Note:** The Python client uses the `/ws` WebSocket endpoint by default. HTTP endpoints are available for debugging or stateless use cases.
+**Example: Check if a Space is running**
+```bash
+curl https://openenv-echo-env.hf.space/health
+# {"status": "healthy"}
+```
+### 2. Repository: Installable Python package
+Every Space is a Git repository. OpenEnv environments include a `pyproject.toml`, making them pip-installable directly from the Space URL.
+```bash
+# Install client package from Space
+pip install git+https://huggingface.co/spaces/openenv/echo-env
+```
+This installs:
+- **Client class** (`EchoEnv`) — Handles HTTP/WebSocket communication
+- **Models** (`EchoAction`, `EchoObservation`) — Typed action and observation classes
+- **Utilities** — Any helper functions the environment provides
+**After installation:**
+```python
+from envs.echo_env import EchoEnv, EchoAction, EchoObservation
+# Now you have typed classes for the environment
+action = EchoAction(message="Hello")
+```
+### 3. Registry: Docker container image
+Every Docker-based Space has a container registry. You can pull and run the environment locally.
+```bash
+# Pull the image
+docker pull registry.hf.space/openenv-echo-env:latest
+# Run locally on port 8001
+docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
+```
+**Find the registry URL for any Space:**
+1. Go to the Space page (e.g., [openenv/echo-env](https://huggingface.co/spaces/openenv/echo-env))
+2. Click **⋮** (three dots) → **"Run locally"**
+3. Copy the `docker run` command
+### Choosing an access method
+| Method | Use when | Pros | Cons |
+|--------|----------|------|------|
+| **Server** | Quick testing, low volume | Zero setup | Network latency, rate limits |
+| **Repository** | Need typed classes | Type safety, IDE support | Still need a server |
+| **Docker** | Local dev, high throughput | Full control, no network | Requires Docker |
+**Typical workflow:**
+```python
+import asyncio
+from echo_env import EchoEnv, EchoAction
+async def main():
+    # Development: connect to remote Space
+    async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
+        result = await client.reset()
+    # Production: run locally for speed
+    # docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
+    async with EchoEnv(base_url="http://localhost:8001") as client:
+        result = await client.reset()
+    # Or let the client manage Docker for you
+    client = await EchoEnv.from_env("openenv/echo-env")  # Auto-pulls and runs
+    async with client:
+        result = await client.reset()
+asyncio.run(main())
+# For sync usage, use the .sync() wrapper:
+with EchoEnv(base_url="http://localhost:8001").sync() as client:
+    result = client.reset()
+```
+> **Reference:** [HF Spaces Documentation](https://huggingface.co/docs/hub/spaces) | [Environment Hub Collection](https://huggingface.co/collections/openenv/environment-hub)
+## Local Development with Uvicorn
+The fastest way to iterate on environment logic is running directly with Uvicorn.
+## Clone and run the environment locally
+```bash
+# Clone from HF Space
+git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
+cd openenv-benchmark
+# Install in editable mode
+uv sync
+# Start server
+uv run server
+# Run isolated from remote Space
+uv run --isolated --project https://huggingface.co/spaces/burtenshaw/openenv-benchmark server
+```
+## Uvicorn directly in python
+```bash
+# Full control over uvicorn options
+uvicorn benchmark.server.app:app --host "$HOST" --port "$PORT" --workers "$WORKERS"
+# With reload for development
+uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --reload
+# Multi-Worker Mode For better concurrency:
+uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --workers 4
+```
+| Flag | Purpose |
+|------|---------|
+| `--reload` | Auto-restart on code changes |
+| `--workers N` | Run N worker processes |
+| `--log-level debug` | Verbose logging |
+## Docker Deployment
+Docker provides isolation and reproducibility for production use.
+### Run the environment locally from the space
+```bash
+# Run the environment locally from the space
+docker run -d -p 8000:8000 registry.hf.space/openenv-echo-env:latest
+```
+### Build Image
+```bash
+# Clone from HF Space
+git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
+cd openenv-benchmark
+# Using OpenEnv CLI (recommended)
+openenv build -t openenv-benchmark:latest
+# Or with Docker directly
+docker build -t openenv-benchmark:latest -f server/Dockerfile .
+```
+### Run Container
+```bash
+# Basic run
+docker run -d -p 8000:8000 my-env:latest
+# With environment variables
+docker run -d -p 8000:8000 \
+    -e WORKERS=4 \
+    -e MAX_CONCURRENT_ENVS=100 \
+    my-env:latest
+# Named container for easy management
+docker run -d --name my-env -p 8000:8000 my-env:latest
+```
+### Connect from Python
+```python
+import asyncio
+from echo_env import EchoEnv, EchoAction
+async def main():
+    # Async usage (recommended)
+    async with EchoEnv(base_url="http://localhost:8000") as client:
+        result = await client.reset()
+        result = await client.step(EchoAction(message="Hello"))
+        print(result.observation)
+    # From Docker image
+    client = await EchoEnv.from_docker_image("<local_docker_image>")
+    async with client:
+        result = await client.reset()
+        print(result.observation)
+asyncio.run(main())
+# Sync usage (using .sync() wrapper)
+with EchoEnv(base_url="http://localhost:8000").sync() as client:
+    result = client.reset()
+    result = client.step(EchoAction(message="Hello"))
+    print(result.observation)
+```
+### Container Lifecycle
+| Method | Container | WebSocket | On `close()` |
+|--------|-----------|-----------|--------------|
+| `from_hub(repo_id)` | Starts | Connects | Stops container |
+| `from_hub(repo_id, use_docker=False)` | None (UV) | Connects | Stops UV server |
+| `from_docker_image(image)` | Starts | Connects | Stops container |
+| `MyEnv(base_url=...)` | None | Connects | Disconnects only |
+Find Docker Commands for Any Space
+1. Open the Space on HuggingFace Hub
+2. Click **⋮ (three dots)** menu
+3. Select **"Run locally"**
+4. Copy the provided `docker run` command
+## Deploy with CLI
+```bash
+cd my_env
+# Deploy to your namespace
+openenv push
+# Deploy to specific repo
+openenv push --repo-id username/my-env
+# Deploy as private
+openenv push --repo-id username/my-env --private
+```
+### Space Configuration
+The `openenv.yaml` manifest controls Space settings:
+```yaml
+# openenv.yaml
+name: my_env
+version: "1.0.0"
+description: My custom environment
+```
+Hardware Options:
+| Tier | vCPU | RAM | Cost |
+|------|------|-----|------|
+| CPU Basic (Free) | 2 | 16GB | Free |
+| CPU Upgrade | 8 | 32GB | $0.03/hr |
+OpenEnv environments support configuration via environment variables.
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `WORKERS` | 4 | Uvicorn worker processes |
+| `PORT` | 8000 | Server port |
+| `HOST` | 0.0.0.0 | Bind address |
+| `MAX_CONCURRENT_ENVS` | 100 | Max WebSocket sessions |
+| `ENABLE_WEB_INTERFACE` | Auto | Enable web UI |
+### Environment-Specific Variables
+Some environments have custom variables:
+**TextArena:**
+```bash
+TEXTARENA_ENV_ID=Wordle-v0
+TEXTARENA_NUM_PLAYERS=1
+TEXTARENA_MAX_TURNS=6
+```
+**Coding Environment:**
+```bash
+SANDBOX_TIMEOUT=30
+MAX_OUTPUT_LENGTH=10000
+```
+# DEMO: Deploying to Hugging Face Spaces
+This demo walks through the full workflow: create an environment, test locally, deploy to HF Spaces, and use it.
+## Step 1: Initialize a new environment
+```bash
+openenv init my_env
+cd my_env
+```
+This creates the standard OpenEnv structure:
+```
+my_env/
+├── server/
+│   ├── app.py           # FastAPI server
+│   ├── environment.py   # Your environment logic
+│   └── Dockerfile
+├── models.py            # Action/Observation types
+├── client.py            # HTTP client
+├── openenv.yaml         # Manifest
+└── pyproject.toml
+```
+## Step 2: Run locally
+```bash
+# Start the server
+uv run server
+# Or with uvicorn directly
+uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
+```
+Test the health endpoint:
+```bash
+curl http://localhost:8000/health
+# {"status": "healthy"}
+```
+## Step 3: Deploy to HF Spaces
+```bash
+openenv push --repo-id username/my-env
+```
+Your environment is now live at:
+- Web UI: https://username-my-env.hf.space/web
+- API Docs: https://username-my-env.hf.space/docs
+- Health: https://username-my-env.hf.space/health
+```bash
+curl https://openenv-echo-env.hf.space/health
+# {"status": "healthy"}
+```
+## Step 4: install the environment
+```bash
+uv pip install git+https://huggingface.co/spaces/openenv/echo_env
+```
+## Step 5: Run locally via Docker (optional)
+Pull and run the container from the HF registry, or open the [browser](https://huggingface.co/spaces/openenv/echo_env?docker=true):
+```bash
+# Pull from HF Spaces registry
+docker pull registry.hf.space/openenv-echo-env:latest
+# Run locally
+docker run -it -p 7860:7860 --platform=linux/amd64 \
+	registry.hf.space/openenv-echo-env:latest
+```
+Now connect to your local instance:
+```python
+import asyncio
+from echo_env import EchoEnv, EchoAction
+# Async (recommended)
+async def main():
+    async with EchoEnv(base_url="http://localhost:8000") as env:
+        result = await env.reset()
+        print(result.observation)
+        result = await env.step(EchoAction(message="Hello"))
+        print(result.observation)
+asyncio.run(main())
+# Sync (using .sync() wrapper)
+with EchoEnv(base_url="http://localhost:8000").sync() as env:
+    result = env.reset()
+    print(result.observation)
+    result = env.step(EchoAction(message="Hello"))
+    print(result.observation)
+```

tutorial/tutorial3.md ADDED Viewed

	@@ -0,0 +1,457 @@

+# 3. How OpenEnv environments scale
+This section covers benchmarking and scaling OpenEnv environments.
+**Contents:**
+- [Provider Scaling](#provider-scaling)
+- [WebSocket-based Scaling](#websocket-based-scaling)
+- [Microservice Scaling](#microservice-scaling)
+- [Scaling Experiments](#scaling-experiments)
+---
+## Provider Scaling
+The easiest way to scale an OpenEnv environment is to use a `provider` these are abstractions based on runtimes like Uvicorn, Docker Swarm, or Kubernetes.
+```python
+from openenv.providers import UVProvider, DockerSwarmProvider, LocalDockerProvider
+docker_provider = LocalDockerProvider() # default
+uvicorn_provider = UVProvider() # python only
+swarm_provider = DockerSwarmProvider()
+with EchoEnv.from_hub(
+    repo_id="openenv/echo-env",
+    provider=swarm_provider,
+    replicas=4,
+) as env:
+  result = env.reset()
+  result = env.step(EchoAction(message="Hello"))
+```
+## WebSocket-based Scaling
+OpenEnv uses WebSocket connections (`/ws`) instead of stateless HTTP for environment interactions. This design enables efficient scaling within a single container.
+### What are WebSockets?
+WebSocket is a communication protocol that provides a persistent, bidirectional connection between client and server. Unlike HTTP—where each request opens a new connection, sends data, receives a response, and closes—a WebSocket connection stays open for the duration of a session.
+![WebSocket vs HTTP](../images/websocket.png)
+For RL environments, this matters because a typical episode involves dozens to thousands of sequential `step()` calls. With HTTP, each step incurs TCP handshake overhead (~10-50ms). With WebSocket, messages are sent as lightweight frames (~0.1ms overhead) over the existing connection.
+Also, with HTTP, long running sessions require logic to manage session state, which is not necessary with WebSocket.
+### Multiple sessions per container
+With HTTP, maintaining session state requires cookies or session IDs with every request. Each isolated environment instance typically needs its own container:
+```
+HTTP approach: N parallel episodes → N containers
+```
+> [!NOTE]
+> This is completely fine (and ideal) for larger deployments where containers can be scaled. But if your resources are constrained, this add loads of overhead.
+With WebSocket, **one container handles many isolated sessions**. Each WebSocket connection gets its own environment instance server-side:
+```python
+# Single container serving multiple concurrent sessions
+# docker run -d -p 8000:8000 my-env:latest
+# Each client gets an isolated environment instance
+with MyEnv(base_url="http://localhost:8000") as env1:  # Session 1
+    result = env1.reset()
+with MyEnv(base_url="http://localhost:8000") as env2:  # Session 2
+    result = env2.reset()
+with MyEnv(base_url="http://localhost:8000") as env3:  # Session 3
+    result = env3.reset()
+```
+> [!NOTE]
+> This has its own advantages and disadvantages. For example: Separation of concerns and fault tolerance in environments like coding or terminal.
+### Server-side session state
+The server maintains environment state per WebSocket connection which means that the environment builder does not need to worry about session state.
+- No session IDs because Connection itself is the session
+- Automatic cleanup because Environment instance destroyed when connection closes
+- Isolation guaranteed because Each connection has dedicated state
+```python
+# Server creates new environment instance per WebSocket connection
+@app.websocket("/ws")
+async def websocket_endpoint(websocket: WebSocket):
+    env = MyEnvironment()  # Fresh instance per connection
+    await websocket.accept()
+    while True:
+        data = await websocket.receive_json()
+        if data["type"] == "reset":
+            result = env.reset()
+        elif data["type"] == "step":
+            result = env.step(data["action"])
+        await websocket.send_json(result)
+```
+### Resource efficiency
+| Approach | Containers | Memory | Startup | Max parallel |
+|----------|------------|--------|---------|--------------|
+| HTTP (1 env = 1 container) | N | N × ~100MB | N × ~5s | Limited by containers |
+| WebSocket (N sessions = 1 container) | 1 | ~200MB | ~5s | Limited by `MAX_CONCURRENT_ENVS` |
+Configure session limits via environment variable:
+```bash
+docker run -d -p 8000:8000 -e MAX_CONCURRENT_ENVS=100 registry.hf.space/openenv-echo-env:latest
+```
+## Scaling a Single Container
+Before adding more containers, maximize the capacity of a single deployment. The key parameters are **workers** (CPU parallelism) and **MAX_CONCURRENT_ENVS** (session limit).
+### Uvicorn workers
+Each Uvicorn worker is a separate process that can handle requests independently. More workers = more CPU cores utilized.
+```bash
+# Clone and run locally
+git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
+cd openenv-benchmark
+pip install -e .
+# Run with 8 workers
+WORKERS=8 uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --workers 8
+```
+The above example will use 8 workers and each worker will be able to handle 100 concurrent sessions. **For simple environments, like text games, it's possible to get to 2000 concurrent sessions with 8 workers.**
+> **Note:** More workers consume more memory. Each worker loads a full copy of the environment code.
+### Docker with environment variables
+Pass scaling parameters when starting the container:
+```bash
+# Pull from HF Spaces registry
+docker pull registry.hf.space/burtenshaw-openenv-benchmark:latest
+# Run with custom configuration
+docker run -d -p 8000:8000 \
+    -e WORKERS=8 \
+    -e MAX_CONCURRENT_ENVS=400 \
+    --name openenv-benchmark \
+    registry.hf.space/burtenshaw-openenv-benchmark:latest
+```
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `WORKERS` | 4 | Uvicorn worker processes |
+| `MAX_CONCURRENT_ENVS` | 100 | Max WebSocket sessions per worker |
+| `PORT` | 8000 | Server port |
+| `HOST` | 0.0.0.0 | Bind address |
+### HF Spaces configuration
+Now, let's deploy the environment to HF Spaces so that we can interact with the server from the client. Configure scaling via Space Settings > Variables:
+1. Go to your Space settings page
+2. Add environment variables:
+   - `WORKERS=4` (max 4 on free tier, 8 on CPU Upgrade)
+   - `MAX_CONCURRENT_ENVS=100`
+3. Restart the Space
+| Tier | vCPU | Recommended workers | Expected max batch (textarena) |
+|------|------|--------------------|--------------------|
+| CPU Basic (Free) | 2 | 2 | ~128 |
+| CPU Upgrade | 8 | 4-8 | ~512 |
+> **Limitation:** HF Spaces free users tier caps at ~128 concurrent sessions regardless of configuration. See [Scaling Experiments](#scaling-experiments) for measured limits.
+### Scaling limits
+The experiments below found that even on larger instances, a single container eventually fails to scale and we need multiple containers to handle the load. For example, on a CPU Upgrade instance with 8 workers, the max batch was 1024 concurrent sessions:
+- Success rate drops to 92%
+- P99 latency exceeds 2× the expected step time
+- Connection errors increase under load
+When this happens, we need to scale to multiple containers and use a load balancer.
+For high-throughput workloads, scale horizontally by running multiple environment containers behind a load balancer.
+| Scenario | Recommended approach |
+|----------|---------------------|
+| Development / testing | Single container with WebSocket sessions |
+| Moderate load (< 100 concurrent) | Single container, increase `MAX_CONCURRENT_ENVS` |
+| High load (100+ concurrent) | Multiple containers + load balancer |
+| GPU environments | One container per GPU |
+We explored this in detail in the [Scaling Experiments](https://github.com/burtenshaw/openenv-scaling) repository.
+<details>
+<summary>Envoy configuration</summary>
+```yaml
+static_resources:
+  listeners:
+    - name: listener_0
+      address:
+        socket_address:
+          address: 0.0.0.0
+          port_value: 8080
+      filter_chains:
+        - filters:
+            - name: envoy.filters.network.http_connection_manager
+              typed_config:
+                "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
+                stat_prefix: ingress_http
+                upgrade_configs:
+                  - upgrade_type: websocket
+                route_config:
+                  name: local_route
+                  virtual_hosts:
+                    - name: openenv_service
+                      domains: ["*"]
+                      routes:
+                        - match:
+                            prefix: "/"
+                          route:
+                            cluster: openenv_cluster
+                http_filters:
+                  - name: envoy.filters.http.router
+                    typed_config:
+                      "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
+  clusters:
+    - name: openenv_cluster
+      connect_timeout: 30s
+      type: STRICT_DNS
+      lb_policy: ROUND_ROBIN
+      load_assignment:
+        cluster_name: openenv_cluster
+        endpoints:
+          - lb_endpoints:
+              - endpoint:
+                  address:
+                    socket_address:
+                      address: host.docker.internal
+                      port_value: 8001
+              - endpoint:
+                  address:
+                    socket_address:
+                      address: host.docker.internal
+                      port_value: 8002
+              - endpoint:
+                  address:
+                    socket_address:
+                      address: host.docker.internal
+                      port_value: 8003
+              - endpoint:
+                  address:
+                    socket_address:
+                      address: host.docker.internal
+                      port_value: 8004
+```
+Start Envoy:
+```bash
+docker run -d \
+    -p 8080:8080 \
+    -v $(pwd)/envoy.yaml:/etc/envoy/envoy.yaml \
+    --add-host=host.docker.internal:host-gateway \
+    envoyproxy/envoy:v1.28.0
+```
+Connect through the load balancer:
+```python
+# Clients connect to Envoy, which distributes to backend containers
+with MyEnv(base_url="http://localhost:8080") as env:
+    result = env.reset()
+```
+</details>
+### Scaling expectations
+![Scaling Expectations](../images/scaling.png)
+| Setup | Containers | Sessions/container | Total capacity | Throughput |
+|-------|------------|-------------------|----------------|------------|
+| Single | 1 | 100 | 100 | ~100 req/s |
+| 4× containers | 4 | 100 | 400 | ~350 req/s |
+| 8× containers | 8 | 100 | 800 | ~600 req/s |
+> **Note:** Actual throughput depends on environment complexity and hardware. Benchmark your specific workload.
+## Experiments Results
+This section documents experiments measuring OpenEnv scaling characteristics across five infrastructure configurations. Full experiment data and code available at [burtenshaw/openenv-scaling](https://github.com/burtenshaw/openenv-scaling).
+### Experiment setup
+**Benchmark environment:** A minimal OpenEnv environment with configurable wait time (simulates computation). Each `step()` call sleeps for the specified duration, isolating infrastructure overhead from environment logic.
+**Infrastructure tested:**
+| Infrastructure | Cores | Configuration |
+|----------------|-------|---------------|
+| local-uvicorn | 8 | Direct Uvicorn, 8 workers |
+| local-docker | 8 | Docker container from HF Spaces image |
+| hf-spaces | 2 | HF Spaces free tier (cpu-basic) |
+| slurm-single | 48 | Single AWS HPC node |
+| slurm-multi | 96 | Two AWS HPC nodes + Envoy load balancer |
+**Protocol:** WebSocket (`/ws`) and HTTP (`/reset`, `/step`) compared where available.
+**Metrics:**
+- **Max batch:** Largest concurrent request count with ≥95% success rate
+- **Batch/core:** Max batch divided by available cores (efficiency metric)
+- **P99 latency:** 99th percentile total request time
+- **RPS:** Requests per second at max batch
+### Results summary
+| Infrastructure | Max Batch (WS) | Cores | Batch/Core | P99 Latency | RPS |
+|----------------|----------------|-------|------------|-------------|-----|
+| slurm-multi | 16,384 | 96 | 170.7 | 29.8s | 518 |
+| local-uvicorn | 2,048 | 8 | 256.0 | 1.97s | 932 |
+| local-docker | 2,048 | 8 | 256.0 | 2.90s | 682 |
+| slurm-single | 512 | 48 | 10.7 | 1.45s | 358 |
+| hf-spaces | 128 | 2 | 64.0 | 2.68s | 48 |
+All results measured with `wait=10.0s` step duration.
+![Max Batch Comparison](https://raw.githubusercontent.com/burtenshaw/openenv-scaling/main/experiments/reports/figures/max_batch_comparison.png)
+*Maximum batch size by infrastructure (95% success threshold)*
+### Finding 1: Local deployments have highest per-core efficiency
+Single instance of Python and Docker both achieve **256 concurrent sessions per core**—the highest efficiency observed. With 8 workers, both reach 2,048 concurrent sessions before degradation begins.
+This makes sense because the environment is running in a single process and the overhead of the environment is relatively low. But it's ideal for hackers and developers who want to test their environment quickly or train on a single machine.
+| Batch Size | Success Rate | P99 Latency | Notes |
+|------------|--------------|-------------|-------|
+| 32 | 100% | 1.05s | Perfect scaling |
+| 128 | 100% | 1.07s | Perfect scaling |
+| 512 | 100% | 1.33s | Perfect scaling |
+| 2,048 | 96.5% | 1.97s | Max reliable batch |
+| 4,096 | 63.8% | 3.20s | Connection failures begin |
+| 8,192 | 36.9% | 5.75s | Above capacity |
+Beyond 2,048 concurrent connections, success rate drops sharply. The failure mode is connection rejection, not timeout—the server saturates its connection pool.
+![Batch Per Core](https://raw.githubusercontent.com/burtenshaw/openenv-scaling/main/experiments/reports/figures/batch_per_core.png)
+*Per-core efficiency comparison across infrastructures*
+### Finding 2: HF Spaces works reliably up to 128 concurrent sessions
+HF Spaces free tier (cpu-basic) provides 2 workers and achieves 128 concurrent WebSocket sessions with 100% success. This translates to **64 sessions per core**.
+**HF Spaces scaling behavior (WebSocket):**
+| Batch Size | Success Rate | P99 Latency | Notes |
+|------------|--------------|-------------|-------|
+| 1 | 100% | 1.64s | Baseline |
+| 32 | 100% | 1.80s | Perfect scaling |
+| 64 | 100% | 2.14s | Perfect scaling |
+| 128 | 100% | 2.68s | Max reliable batch |
+| 256 | ~33% | 4.41s | Inconsistent (some runs 0%, some 100%) |
+| 512 | 0% | — | Complete failure |
+At 256 concurrent connections, results become unstable. At 512+, connections fail entirely due to HF Spaces connection limits.
+**HTTP mode does not work on HF Spaces.** The `/reset` and `/step` HTTP endpoints are not accessible on the deployed Space—all HTTP requests fail. Use WebSocket mode exclusively.
+### Finding 3: Multi-node scaling works
+Multi-node SLURM (96 cores across 2 nodes) achieves **16,384 concurrent sessions** with 100% success rate—the highest absolute throughput tested.
+**SLURM multi-node scaling behavior:**
+| Batch Size | Success Rate | P99 Latency | Notes |
+|------------|--------------|-------------|-------|
+| 32 | 100% | 1.05s | Perfect scaling |
+| 512 | 100% | 1.59s | Perfect scaling |
+| 2,048 | 100% | 3.48s | Perfect scaling |
+| 4,096 | 100% | 6.97s | Perfect scaling |
+| 8,192 | 100% | 13.7s | Perfect scaling |
+| 16,384 | 100% | 29.8s | Max tested batch |
+The batch/core ratio (170.7) is lower than local deployments (256) but provides the highest absolute capacity for large-scale workloads.
+![Scaling Comparison](https://raw.githubusercontent.com/burtenshaw/openenv-scaling/main/experiments/reports/figures/scaling_comparison.png)
+*Multi-node vs single-node scaling behavior*
+### Latency breakdown
+At max load (`wait=1.0s`), latency breaks down as:
+| Infrastructure | Connect P50 | Reset P50 | Step P50 | Total P99 |
+|----------------|-------------|-----------|----------|-----------|
+| slurm-single | 0.26s | 0.04s | 1.00s | 1.33s |
+| local-uvicorn | 0.58s | 0.08s | 1.05s | 1.95s |
+| hf-spaces | 0.79s | 0.10s | 1.10s | 2.48s |
+| local-docker | 1.38s | 0.19s | 1.05s | 2.90s |
+| slurm-multi | 17.5s | 2.25s | 2.42s | 26.3s |
+**Observations:**
+- **Step latency** is consistent across infrastructures (~1.0s for 1.0s wait), confirming the benchmark measures infrastructure overhead accurately
+- **Connect latency** varies significantly—local Docker shows higher connect time at load (1.38s), likely due to container networking
+- **Multi-node has high connect latency** (17.5s) at 16,384 batch due to queuing at the load balancer; this is the cost of handling 16× more connections than single-node
+![Latency Heatmap](https://raw.githubusercontent.com/burtenshaw/openenv-scaling/main/experiments/reports/figures/latency_heatmap.png)
+*P99 latency across configurations and batch sizes*
+![Scaling Curves](https://raw.githubusercontent.com/burtenshaw/openenv-scaling/main/experiments/reports/figures/scaling_curves.png)
+*Success rate vs batch size for all infrastructures*
+### Test methodology
+```bash
+# Clone benchmark environment
+git clone https://huggingface.co/spaces/burtenshaw/openenv-scaling
+cd openenv-scaling
+# Run scaling test
+python tests/test_scaling.py \
+    --url http://localhost:8000 \
+    --requests-grid 32,128,512,2048,4096,8192,16384 \
+    --wait-grid 1.0,5.0,10.0 \
+    --reps 3 \
+    --mode ws \
+    --output-dir experiments/results/
+```
+Each configuration was tested with 3 repetitions. Max batch is defined as the largest batch size achieving ≥95% success rate across all repetitions.
+---
+## Summary
+| Infrastructure | Best for | Max concurrent | Batch/core |
+|----------------|----------|----------------|------------|
+| local-uvicorn | Development, <2K sessions | 2,048 | 256 |
+| local-docker | Same as uvicorn, containerized | 2,048 | 256 |
+| hf-spaces | Demos, moderate load | 128 | 64 |
+| slurm-single | HPC, single-node jobs | 512 | 10.7 |
+| slurm-multi | Large-scale training | 16,384 | 170.7 |
+**Recommendations:**
+1. **For development and moderate workloads (<2,000 concurrent):** Use single node Uvicorn or Docker depending software environment. These provide the best per-core efficiency (256 sessions/core).
+2. **For demos, testing, and published environments:** HF Spaces free tier works reliably up to 128 concurrent sessions.
+3. **For large-scale training (>2,000 concurrent):** Deploy multi-node with proper load balancing. Expect ~170 sessions per core, but much higher absolute throughput.

tutorial/tutorial4.md ADDED Viewed

	@@ -0,0 +1,632 @@

+# OpenEnv Wordle with GRPO using TRL
+[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://github.com/huggingface/trl/blob/main/examples/notebooks/openenv_wordle_grpo.ipynb)
+![trl banner](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/trl_banner_dark.png)
+With [**Transformers Reinforcement Learning (TRL)**](https://github.com/huggingface/trl), you can train a model that learns to **play Wordle**, a word-guessing game, through interaction and reinforcement.
+- [TRL GitHub Repository](https://github.com/huggingface/trl)
+- [Official TRL Examples](https://huggingface.co/docs/trl/example_overview)
+- [Community Tutorials](https://huggingface.co/docs/trl/community_tutorials)
+- [OpenEnv](https://github.com/meta-pytorch/OpenEnv)
+An **agentic environment** is a setting where a model can take actions, observe outcomes, and adjust its behavior based on feedback, similar to how humans learn from trial and error.
+In this case, the agent interacts with the **Wordle** environment through the [**OpenEnv**](https://github.com/meta-pytorch/OpenEnv) framework, which standardizes multi-agent and RL-style text environments.
+[Wordle](https://en.wikipedia.org/wiki/Wordle) is a popular word puzzle where the player must guess a secret five-letter word within six tries.
+After each guess, feedback indicates whether each letter is:
+- 🟩 **Correct and in the right position**
+- 🟨 **Present but in the wrong position**
+- ⬛ **Not in the word**
+This feedback loop makes Wordle a perfect environment for **RL with LLMs**, where the goal is to maximize the probability of guessing the correct word efficiently.
+We will fine-tune a model using **GRPO** (Group Relative Policy Optimization) via TRL.
+The agent will:
+1. Generate guesses based on the game state and feedback.
+2. Receive structured feedback from the environment after each guess.
+3. Learn to improve its guessing strategy over time through reward signals.
+---
+## Install dependencies
+We will start by installing **TRL**, which automatically includes the main dependencies like **Transformers**.
+We will also install the **OpenEnv** framework (for the environment), **trackio** (for logging and monitoring training runs), and **vLLM** (for efficient generation).
+```python
+!pip install -Uq git+https://github.com/huggingface/trl.git git+https://github.com/meta-pytorch/OpenEnv.git trackio vllm==0.10.2 bitsandbytes
+```
+---
+## Log in to Hugging Face
+Log in to your **Hugging Face** account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your **access token** on your [account settings page](https://huggingface.co/settings/tokens).
+```python
+from huggingface_hub import notebook_login
+notebook_login()
+```
+---
+## Initialize the Environment
+Let us begin by setting up the environment that will be used during training.
+For this task, we will rely on the **TextArena** environment from **OpenEnv**, which exposes a familiar Gymnasium-style API (`reset()`, `step()`, etc.) to simplify interaction.
+In this example, we will connect to the hosted environment at [burtenshaw/textarena](https://huggingface.co/spaces/burtenshaw/textarena).
+For production use or custom configurations, we **strongly recommend** running the environment locally via Docker. The hosted versions on the Hub currently have limited concurrency support, so duplicating the Space to your own account is the preferred approach in those cases.
+For more information, refer to the [TRL-OpenEnv documentation](https://huggingface.co/docs/trl/main/en/openenv).
+```python
+from envs.textarena_env import TextArenaEnv
+textarena_url = "https://burtenshaw-textarena.hf.space" # Duplicate the Space and update this!
+env = TextArenaEnv(base_url=textarena_url)
+```
+---
+## Init model and tokenizer
+We will use [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B), a lightweight instruction-tuned model that works well for quick experiments.
+Despite its small size, it can still learn interesting strategies during fine-tuning.
+If you have stronger hardware, you can easily scale up to larger models.
+```python
+from transformers import AutoTokenizer
+model_name = "Qwen/Qwen3-1.7B"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+tokenizer.pad_token = tokenizer.eos_token
+```
+---
+## Rollout function with helpers
+The **rollout function** defines how the agent interacts with the environment during GRPO training.
+It is responsible for generating model completions, collecting feedback (rewards), and returning all necessary information for optimization.
+In this setup:
+- The function is called automatically by the **GRPOTrainer** during each training step.
+- It uses the trainer's built-in `generate_rollout_completions()` method for efficient generation with vLLM in colocate mode.
+- Each rollout represents a full interaction loop. The model guesses, receives feedback from Wordle, and updates based on reward signals.
+### System Prompt
+First, we define the `system_prompt` that guides the model's behavior as an expert Wordle solver with strategic reasoning and structured responses.
+```python
+system_prompt = """
+You are an expert Wordle solver with deep knowledge of English vocabulary, letter frequency patterns, and optimal guessing strategies.
+## GAME RULES
+1. The target is a 5-letter English word
+2. You have 6 attempts to guess the correct word
+3. After each guess, you receive color-coded feedback:
+   - GREEN: Letter is correct and in the correct position
+   - YELLOW: Letter is in the word but in the wrong position
+   - GRAY: Letter is not in the word at all
+4. All guesses must be valid 5-letter English words
+5. You cannot reuse a word you've already guessed
+## RESPONSE FORMAT
+Only respond with your next guess in square brackets, e.g., [crane].
+## STRATEGIC APPROACH
+Do not repeat the same guess twice.
+### Opening Strategy
+- Start with words rich in common vowels (A, E, I, O, U) and consonants (R, S, T, L, N)
+- Optimal starters: CRANE, SLATE, STARE, AROSE, IRATE
+### Mid-Game Strategy
+- Use confirmed GREEN letters in their correct positions
+- Place YELLOW letters in different positions than where they appeared
+- Eliminate GRAY letters from consideration
+## YOUR GOAL
+Solve the Wordle in as few guesses as possible by strategically using feedback to eliminate impossible words and narrow down the solution space efficiently.
+"""
+```
+### Rollout Function
+```python
+def rollout_func(prompts, trainer=None):
+    """
+    Rollout function for GRPO training with environment interaction.
+    """
+    episode_prompt_ids = []
+    episode_completion_ids = []
+    episode_logprobs = []
+    correctness_rewards = []
+    green_rewards = []
+    yellow_rewards = []
+    repetition_rewards = []
+    for prompt_text in prompts:
+        episode = rollout_once(
+            trainer=trainer,
+            env=env,
+            tokenizer=tokenizer,
+            dataset_prompt=prompt_text,
+            system_prompt=system_prompt,
+            max_turns=6,
+        )
+        episode_prompt_ids.append(episode["prompt_ids"])
+        episode_completion_ids.append(episode["completion_ids"])
+        episode_logprobs.append(episode["logprobs"])
+        correctness_rewards.append(episode["correct_reward"])
+        green_rewards.append(episode["green_reward"])
+        yellow_rewards.append(episode["yellow_reward"])
+        repetition_rewards.append(episode["repetition_reward"])
+    return {
+        "prompt_ids": episode_prompt_ids,
+        "completion_ids": episode_completion_ids,
+        "logprobs": episode_logprobs,
+        "correct_reward": correctness_rewards,
+        "green_reward": green_rewards,
+        "yellow_reward": yellow_rewards,
+        "repetition_reward": repetition_rewards,
+    }
+```
+---
+## Define rollout_once
+The `rollout_once` function runs **one full interaction loop** between the model and the Wordle environment using the trainer's generation method.
+```python
+from collections import defaultdict
+from envs.textarena_env import TextArenaAction
+from envs.textarena_env.rewards import extract_feedback_counts, extract_guess, extract_wordle_feedback
+from trl.experimental.openenv import generate_rollout_completions
+def rollout_once(trainer, env, tokenizer, dataset_prompt, system_prompt, max_turns):
+    """
+    Execute one full Wordle episode with the model.
+    """
+    result = env.reset()
+    observation = result.observation
+    prompt_ids = []
+    completion_ids = []
+    logprobs = []
+    raw_rewards = []
+    green_scores = []
+    yellow_scores = []
+    repetition_scores = []
+    correct_scores = []
+    guess_counts = defaultdict(int)
+    for _turn in range(max_turns):
+        if result.done:
+            break
+        base_prompt = observation.prompt or dataset_prompt
+        user_prompt = make_user_prompt(base_prompt, observation.messages)
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt},
+        ]
+        prompt_text = tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt=True,
+            tokenize=False,
+            enable_thinking=False,
+        )
+        rollout_outputs = generate_rollout_completions(trainer, [prompt_text])[0]
+        prompt_ids.extend(rollout_outputs["prompt_ids"])
+        completion_ids.extend(rollout_outputs["completion_ids"])
+        logprobs.extend(rollout_outputs["logprobs"])
+        completion_text = rollout_outputs.get("text") or tokenizer.decode(
+            rollout_outputs["completion_ids"], skip_special_tokens=True
+        )
+        guess = extract_guess(completion_text)
+        result = env.step(TextArenaAction(message=guess))
+        raw_rewards.append(float(result.reward or 0.0))
+        observation = result.observation
+        correct_score = float(result.reward or 0.0)
+        feedback = extract_wordle_feedback(observation)
+        previous_occurrences = guess_counts[guess]
+        repetition_score = scale_repetition_score(previous_occurrences, len(guess_counts))
+        guess_counts[guess] += 1
+        if not feedback:
+            green_score = 0.0
+            yellow_score = 0.0
+        else:
+            green_count, yellow_count = extract_feedback_counts(feedback)
+            green_score = green_count / 5.0
+            yellow_score = yellow_count / 5.0
+        repetition_scores.append(repetition_score)
+        green_scores.append(green_score)
+        yellow_scores.append(yellow_score)
+        correct_scores.append(correct_score)
+    correct_reward_value = correct_scores[-1] if correct_scores else (raw_rewards[-1] if raw_rewards else 0.0)
+    return {
+        "prompt_ids": prompt_ids,
+        "completion_ids": completion_ids,
+        "logprobs": logprobs,
+        "raw_rewards": raw_rewards,
+        "correct_reward": correct_reward_value,
+        "green_reward": green_scores[-1] if green_scores else 0.0,
+        "yellow_reward": yellow_scores[-1] if yellow_scores else 0.0,
+        "repetition_reward": repetition_scores[-1] if repetition_scores else 0.0,
+    }
+```
+---
+## Helper functions
+```python
+def make_user_prompt(prompt_text, messages):
+    """Builds a structured user prompt combining the task description and message history"""
+    history = format_history(messages)
+    prompt_section = prompt_text.strip() if prompt_text.strip() else "Wordle-v0"
+    history_section = history if history else "[PROMPT] Awaiting first feedback."
+    return (
+        f"Game prompt:\n{prompt_section}\n\n"
+        f"Conversation so far:\n{history_section}\n\n"
+        "Reply with your next guess enclosed in square brackets."
+    )
+def format_history(messages):
+    """Formats the message history with tags for clear conversational context"""
+    lines = []
+    for message in messages:
+        tag = message.category or "MESSAGE"
+        content = message.content.strip()
+        if not content:
+            continue
+        lines.append(f"[{tag}] {content}")
+    return "\n".join(lines)
+def scale_repetition_score(previous_occurrences, max_occurrences):
+    """Scale the repetition score based on the number of previous occurrences from 0 to 1"""
+    if max_occurrences == 0:
+        return 0.0
+    return (max_occurrences - previous_occurrences) / max_occurrences
+```
+---
+## Define reward functions
+```python
+def reward_correct(completions, **kwargs):
+    rewards = kwargs.get("correct_reward") if kwargs else None
+    if rewards is None:
+        return [0.0 for _ in completions]
+    return [float(r) for r in rewards]
+def reward_greens(completions, **kwargs):
+    rewards = kwargs.get("green_reward") if kwargs else None
+    if rewards is None:
+        return [0.0 for _ in completions]
+    return [float(r) for r in rewards]
+def reward_yellows(completions, **kwargs):
+    rewards = kwargs.get("yellow_reward") if kwargs else None
+    if rewards is None:
+        return [0.0 for _ in completions]
+    return [float(r) for r in rewards]
+def reward_repetition(completions, **kwargs):
+    rewards = kwargs.get("repetition_reward") if kwargs else None
+    if rewards is None:
+        return [0.0 for _ in completions]
+    return [float(r) for r in rewards]
+```
+---
+## Create dataset
+```python
+from datasets import Dataset
+dataset_size = 1000
+dataset_prompt = "Play Wordle like an expert."
+dataset = Dataset.from_dict({"prompt": [dataset_prompt] * dataset_size})
+```
+---
+## Set GRPO Config
+```python
+from trl import GRPOConfig
+output_dir = "wordle-grpo-Qwen3-1.7B"
+grpo_config = GRPOConfig(
+    num_train_epochs = 1,
+    learning_rate = 5e-6,
+    gradient_accumulation_steps = 64,
+    per_device_train_batch_size = 1,
+    warmup_steps = 20,
+    num_generations = 2,
+    max_completion_length = 8,
+    max_prompt_length = 1400,
+    use_vllm = True,
+    vllm_mode = "colocate",
+    vllm_gpu_memory_utilization = 0.1,
+    output_dir = output_dir,
+    report_to="trackio",
+    trackio_space_id = output_dir,
+    logging_steps = 1,
+    save_steps = 10,
+    gradient_checkpointing = True,
+    gradient_checkpointing_kwargs = {"use_reentrant": False},
+    push_to_hub = True,
+)
+```
+---
+## Create GRPOTrainer and start training
+```python
+from trl import GRPOTrainer
+trainer = GRPOTrainer(
+    model=model_name,
+    processing_class=tokenizer,
+    reward_funcs=[
+        reward_correct,
+        reward_greens,
+        reward_yellows,
+        reward_repetition,
+    ],
+    train_dataset=dataset,
+    args=grpo_config,
+    rollout_func=rollout_func,
+)
+```
+### Memory stats before training
+```python
+import torch
+gpu_stats = torch.cuda.get_device_properties(0)
+start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
+print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
+print(f"{start_gpu_memory} GB of memory reserved.")
+```
+**Output:**
+```
+GPU = NVIDIA A100-SXM4-40GB. Max memory = 39.557 GB.
+10.516 GB of memory reserved.
+```
+### Train!
+```python
+trainer_stats = trainer.train()
+```
+**Training Progress:**
+| Step | Training Loss |
+|------|---------------|
+| 1 | 0.008300 |
+| 2 | 0.001900 |
+| 3 | 0.015100 |
+| 4 | 0.008700 |
+| 5 | 0.009800 |
+| 6 | 0.006700 |
+| 7 | 0.006100 |
+| 8 | 0.004400 |
+| 9 | -0.002100 |
+| 10 | 0.007500 |
+| 11 | 0.008400 |
+| 12 | 0.008000 |
+| 13 | 0.007800 |
+| 14 | -0.002400 |
+| 15 | -0.003200 |
+| 16 | -0.006000 |
+| 17 | -0.008300 |
+| 18 | -0.011000 |
+| 19 | -0.004200 |
+| 20 | -0.001700 |
+| 21 | -0.004100 |
+| 22 | -0.011600 |
+| 23 | -0.006400 |
+| 24 | -0.009100 |
+| 25 | 0.003200 |
+| 26 | 0.005100 |
+| 27 | -0.002800 |
+| 28 | 0.001400 |
+| 29 | 0.011500 |
+| 30 | -0.010500 |
+| 31 | -0.006400 |
+### Memory stats after training
+```python
+used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
+used_memory_for_training = round(used_memory - start_gpu_memory, 3)
+used_percentage = round(used_memory / max_memory * 100, 3)
+training_memory_percentage = round(used_memory_for_training / max_memory * 100, 3)
+print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
+print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
+print(f"Peak reserved memory = {used_memory} GB.")
+print(f"Peak reserved memory for training = {used_memory_for_training} GB.")
+print(f"Peak reserved memory % of max memory = {used_percentage} %.")
+print(f"Peak reserved memory for training % of max memory = {training_memory_percentage} %.")
+```
+**Output:**
+```
+5231.7046 seconds used for training.
+87.2 minutes used for training.
+Peak reserved memory = 36.68 GB.
+Peak reserved memory for training = 26.164 GB.
+Peak reserved memory % of max memory = 92.727 %.
+Peak reserved memory for training % of max memory = 66.143 %.
+```
+### Save and push to Hub
+```python
+env.close()
+trainer.save_model(output_dir)
+trainer.push_to_hub()
+```
+---
+## Load the Fine-Tuned Model and Run Inference
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer
+model_name = "sergiopaniego/wordle-grpo-Qwen3-1.7B" # Replace with your HF username
+fine_tuned_model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto", device_map="auto")
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+```
+```python
+MAX_TURNS=6
+def play_wordle(env, model, tokenizer):
+    result = env.reset()
+    observation = result.observation
+    print("Initial Prompt:\n" + observation.prompt)
+    for turn in range(MAX_TURNS):
+        if result.done:
+            break
+        user_prompt = make_user_prompt(observation.prompt, observation.messages)
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_prompt},
+        ]
+        prompt_text = tokenizer.apply_chat_template(
+            messages,
+            add_generation_prompt=True,
+            tokenize=False,
+            enable_thinking=False,
+        )
+        model_inputs = tokenizer([prompt_text], return_tensors="pt").to(model.device)
+        generated_ids = model.generate(
+            **model_inputs,
+            max_new_tokens=512
+        )
+        output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
+        generated_text = tokenizer.decode(output_ids, skip_special_tokens=True)
+        guess = extract_guess(generated_text)
+        print(f"\nTurn {turn}: model replied with -> {generated_text}")
+        print(f"   Parsed guess: {guess}")
+        result = env.step(TextArenaAction(message=guess))
+        observation = result.observation
+        print("   Feedback messages:")
+        for message in observation.messages:
+            print(f"     [{message.category}] {message.content}")
+    print("\nGame finished")
+    print(f"   Reward: {result.reward}")
+    print(f"   Done: {result.done}")
+```
+### Let us play the game!
+```python
+try:
+    play_wordle(env, fine_tuned_model, tokenizer)
+finally:
+    env.close()
+```
+**Output:**
+```
+Initial Prompt:
+You are Player 0 in Wordle.
+A secret 5-letter word has been chosen. You have 6 attempts to guess it.
+For each guess, wrap your word in square brackets (e.g., [apple]).
+Feedback for each letter will be given as follows:
+  - G (green): correct letter in the correct position
+  - Y (yellow): letter exists in the word but in the wrong position
+  - X (wrong): letter is not in the word
+Enter your guess to begin.
+Turn 0: model replied with -> [crane]
+   Parsed guess: [crane]
+   Feedback messages:
+     [MESSAGE] [crane]
+     [MESSAGE] Player 0 submitted [crane].
+Feedback:
+C R A N E
+X Y X X X
+You have 5 guesses left.
+Turn 1: model replied with -> [spare]
+   Parsed guess: [spare]
+   Feedback messages:
+     [MESSAGE] [spare]
+     [MESSAGE] Player 0 submitted [spare].
+Feedback:
+C R A N E
+X Y X X X
+S P A R E
+G X X G X
+You have 4 guesses left.
+...
+Game finished
+   Reward: 0.0
+   Done: True
+```
+> **Note:** The model has learned some good opening strategies (starting with "crane", then "spare"), but still tends to repeat guesses. This is a common challenge in RL training that can be improved with:
+>
+> - Longer training runs
+> - Stronger repetition penalties
+> - Better reward shaping
+> - Larger models

uv.lock ADDED Viewed

The diff for this file is too large to render. See raw diff