Commit Β·
2794920
1
Parent(s): 893901a
fixed inference
Browse files- README.md +61 -37
- baseline_runner.py +27 -13
- inference.py +177 -104
- sample_scripts/sample_inf_script.py +188 -0
- sample_scripts/sample_val_script.txt +185 -0
- server/environment.py +41 -42
- server/graders/__init__.py +2 -2
- server/models.py +2 -2
- server/simulators/docker_simulator.py +5 -1
- server/simulators/k8s_simulator.py +1 -1
- server/simulators/workflow_simulator.py +129 -0
- server/tasks/k8s_networking.py +3 -3
- server/tasks/pipeline_build_deploy.py +76 -60
- server/tasks/pipeline_full.py +5 -7
- server/tasks/task_1_build_errors.py +11 -10
- server/tasks/task_5_ci_docker_integration.py +25 -31
README.md
CHANGED
|
@@ -42,7 +42,7 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
|
|
| 42 |
β - replace_line: fix a specific line number β
|
| 43 |
β - add_line / add_block: insert missing content β
|
| 44 |
β - delete_line / delete_block: remove bad content β
|
| 45 |
-
β - request_hint: get a clue (-
|
| 46 |
β - submit: "I'm done fixing" β
|
| 47 |
β β
|
| 48 |
β After each action, agent gets: β
|
|
@@ -56,6 +56,7 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
|
|
| 56 |
β - Whether ALL issues were fixed (bonus) β
|
| 57 |
β - How many steps it took (efficiency) β
|
| 58 |
β - How many hints were used (penalty) β
|
|
|
|
| 59 |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 60 |
```
|
| 61 |
|
|
@@ -63,6 +64,8 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
|
|
| 63 |
|
| 64 |
## The 10 Tasks (50 Scenarios)
|
| 65 |
|
|
|
|
|
|
|
| 66 |
### Task 1: Dockerfile Syntax Errors β Easy
|
| 67 |
|
| 68 |
Simple typos and instruction errors that break `docker build`.
|
|
@@ -72,7 +75,7 @@ Simple typos and instruction errors that break `docker build`.
|
|
| 72 |
| 1 | `typo_filename` | `COPY requirments.txt .` β misspelled filename | Most common Docker build error on Stack Overflow |
|
| 73 |
| 2 | `invalid_base_image` | `FROM python:3.9-slimm` β extra 'm' in tag | Happens when copy-pasting image tags |
|
| 74 |
| 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` β broken line continuation | Formatting multi-line RUN commands is tricky |
|
| 75 |
-
| 4 | `
|
| 76 |
| 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM |
|
| 77 |
|
| 78 |
### Task 2: Dockerfile Runtime Errors β Medium
|
|
@@ -111,14 +114,14 @@ Secrets exist but aren't wired correctly to the workflow steps.
|
|
| 111 |
| 4 | `secret_not_in_env` | `$SLACK_WEBHOOK_URL` not in `env:` | Very common mistake |
|
| 112 |
| 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN` |
|
| 113 |
|
| 114 |
-
### Task 5: CI + Docker Integration β Medium
|
| 115 |
|
| 116 |
The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.
|
| 117 |
|
| 118 |
| # | Scenario | What's Broken | Real-World Context |
|
| 119 |
|---|----------|---------------|-------------------|
|
| 120 |
| 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Need BuildKit for cross-compile |
|
| 121 |
-
| 2 | `
|
| 122 |
| 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch |
|
| 123 |
| 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist |
|
| 124 |
| 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access" |
|
|
@@ -149,7 +152,7 @@ Pod crashes and scheduling failures in Kubernetes deployments.
|
|
| 149 |
|
| 150 |
### Task 8: Kubernetes Service & Ingress Issues β Hard
|
| 151 |
|
| 152 |
-
Networking issues where pods run fine but traffic doesn't reach them.
|
| 153 |
|
| 154 |
| # | Scenario | What's Broken | Real-World Context |
|
| 155 |
|---|----------|---------------|-------------------|
|
|
@@ -165,15 +168,15 @@ GHA-to-Docker-to-Registry pipeline failures spanning multiple files.
|
|
| 165 |
|
| 166 |
| # | Scenario | What's Broken | Real-World Context |
|
| 167 |
|---|----------|---------------|-------------------|
|
| 168 |
-
| 1 | `
|
| 169 |
| 2 | `image_tag_mismatch` | Build uses `github.ref_name` but push uses `github.sha` | "image not found locally" |
|
| 170 |
-
| 3 | `
|
| 171 |
| 4 | `build_arg_not_passed` | Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow | Version file is empty |
|
| 172 |
-
| 5 | `
|
| 173 |
|
| 174 |
### Task 10: Full Stack Deployment Pipeline β Expert
|
| 175 |
|
| 176 |
-
Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning.
|
| 177 |
|
| 178 |
| # | Scenario | What's Broken | Real-World Context |
|
| 179 |
|---|----------|---------------|-------------------|
|
|
@@ -185,6 +188,20 @@ Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifest
|
|
| 185 |
|
| 186 |
---
|
| 187 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
## Available Actions
|
| 189 |
|
| 190 |
Each step, the agent chooses exactly one action:
|
|
@@ -197,16 +214,16 @@ Each step, the agent chooses exactly one action:
|
|
| 197 |
| `delete_line` | Remove a specific line | Removing a bad instruction |
|
| 198 |
| `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
|
| 199 |
| `delete_block` | Remove a multi-line block | Removing incorrect sections |
|
| 200 |
-
| `request_hint` | Get a clue about what's wrong | Costs -
|
| 201 |
| `submit` | Declare "I'm done" β triggers final evaluation | When all fixes are applied |
|
| 202 |
|
| 203 |
**Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
|
| 204 |
|
| 205 |
---
|
| 206 |
|
| 207 |
-
## Grading System
|
| 208 |
|
| 209 |
-
Scoring is **deterministic** (same actions always produce the same score), **
|
| 210 |
|
| 211 |
### The Formula
|
| 212 |
|
|
@@ -214,13 +231,13 @@ Scoring is **deterministic** (same actions always produce the same score), **dyn
|
|
| 214 |
FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
|
| 215 |
```
|
| 216 |
|
| 217 |
-
Clamped to `
|
| 218 |
|
| 219 |
### Component Breakdown
|
| 220 |
|
| 221 |
| Component | Weight | Description |
|
| 222 |
|-----------|--------|-------------|
|
| 223 |
-
| Base score | 5% | Participation credit |
|
| 224 |
| Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
|
| 225 |
| Complete bonus | 25% | All issues fixed |
|
| 226 |
| Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
|
|
@@ -230,15 +247,31 @@ Clamped to `[0.0, 1.0]`.
|
|
| 230 |
|
| 231 |
### Difficulty Modifiers
|
| 232 |
|
| 233 |
-
The grader adjusts three parameters based on task difficulty:
|
| 234 |
-
|
| 235 |
| Difficulty | Max Score | Efficiency Decay | Hint Cost |
|
| 236 |
|------------|-----------|------------------|-----------|
|
| 237 |
| Easy | 0.90 | 0.03/step (strict) | 4% each |
|
| 238 |
| Medium | 0.90 | 0.027/step | 4% each |
|
| 239 |
| Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
|
| 240 |
|
| 241 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 242 |
|
| 243 |
---
|
| 244 |
|
|
@@ -256,7 +289,7 @@ This means: solving a 4-bug expert pipeline in 6 steps scores higher than solvin
|
|
| 256 |
| `/info` | GET | Task list with metadata |
|
| 257 |
| `/tasks` | GET | List all tasks with difficulty levels |
|
| 258 |
| `/grader` | POST | Grade a trajectory (list of step dicts) |
|
| 259 |
-
| `/baseline` | POST | Run
|
| 260 |
| `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
|
| 261 |
|
| 262 |
### Example: Full Episode via API
|
|
@@ -267,7 +300,7 @@ curl -X POST http://localhost:8000/reset \
|
|
| 267 |
-H "Content-Type: application/json" \
|
| 268 |
-d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'
|
| 269 |
|
| 270 |
-
# 2. Fix the memory limit
|
| 271 |
curl -X POST http://localhost:8000/step \
|
| 272 |
-H "Content-Type: application/json" \
|
| 273 |
-d '{
|
|
@@ -276,7 +309,7 @@ curl -X POST http://localhost:8000/step \
|
|
| 276 |
"edits": [{
|
| 277 |
"file_path": "k8s/deployment.yaml",
|
| 278 |
"old_content": "memory: \"64Mi\"",
|
| 279 |
-
"new_content": "memory: \"
|
| 280 |
}]
|
| 281 |
}
|
| 282 |
}'
|
|
@@ -325,7 +358,7 @@ python inference.py
|
|
| 325 |
cloud-native-devops-env/
|
| 326 |
βββ openenv.yaml # OpenEnv environment specification
|
| 327 |
βββ inference.py # LLM baseline (OpenAI client + HF router)
|
| 328 |
-
βββ baseline_runner.py # Heuristic baseline
|
| 329 |
βββ Dockerfile # Production container
|
| 330 |
βββ requirements.txt # Python dependencies
|
| 331 |
β
|
|
@@ -349,9 +382,9 @@ cloud-native-devops-env/
|
|
| 349 |
β βββ graders/
|
| 350 |
β β βββ __init__.py # Deterministic trajectory grader
|
| 351 |
β βββ simulators/
|
| 352 |
-
β βββ docker_simulator.py #
|
| 353 |
-
β βββ workflow_simulator.py #
|
| 354 |
-
β βββ k8s_simulator.py #
|
| 355 |
β
|
| 356 |
βββ tests/
|
| 357 |
βββ test_endpoints.py # API endpoint tests
|
|
@@ -364,22 +397,13 @@ cloud-native-devops-env/
|
|
| 364 |
## Design Decisions
|
| 365 |
|
| 366 |
1. **Full cloud-native stack**: Docker + GitHub Actions + Kubernetes β the three pillars of modern deployment pipelines.
|
| 367 |
-
2. **
|
| 368 |
3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
|
| 369 |
4. **Difficulty progression**: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
|
| 370 |
-
5. **
|
| 371 |
-
6. **
|
|
|
|
| 372 |
|
| 373 |
## License
|
| 374 |
|
| 375 |
MIT
|
| 376 |
-
title: Cloudnative Devops Debug Env
|
| 377 |
-
emoji: π
|
| 378 |
-
colorFrom: yellow
|
| 379 |
-
colorTo: gray
|
| 380 |
-
sdk: docker
|
| 381 |
-
pinned: false
|
| 382 |
-
short_description: 'Open Env for the Meta x PyTorch x HuggingFace x SST hack '
|
| 383 |
-
---
|
| 384 |
-
|
| 385 |
-
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
|
|
|
|
| 42 |
β - replace_line: fix a specific line number β
|
| 43 |
β - add_line / add_block: insert missing content β
|
| 44 |
β - delete_line / delete_block: remove bad content β
|
| 45 |
+
β - request_hint: get a clue (-4% score penalty) β
|
| 46 |
β - submit: "I'm done fixing" β
|
| 47 |
β β
|
| 48 |
β After each action, agent gets: β
|
|
|
|
| 56 |
β - Whether ALL issues were fixed (bonus) β
|
| 57 |
β - How many steps it took (efficiency) β
|
| 58 |
β - How many hints were used (penalty) β
|
| 59 |
+
β Score range: (0, 1) exclusive β never exactly 0 or 1 β
|
| 60 |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 61 |
```
|
| 62 |
|
|
|
|
| 64 |
|
| 65 |
## The 10 Tasks (50 Scenarios)
|
| 66 |
|
| 67 |
+
Evaluation runs **all 50 scenarios deterministically** across all 10 tasks for reproducible scoring.
|
| 68 |
+
|
| 69 |
### Task 1: Dockerfile Syntax Errors β Easy
|
| 70 |
|
| 71 |
Simple typos and instruction errors that break `docker build`.
|
|
|
|
| 75 |
| 1 | `typo_filename` | `COPY requirments.txt .` β misspelled filename | Most common Docker build error on Stack Overflow |
|
| 76 |
| 2 | `invalid_base_image` | `FROM python:3.9-slimm` β extra 'm' in tag | Happens when copy-pasting image tags |
|
| 77 |
| 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` β broken line continuation | Formatting multi-line RUN commands is tricky |
|
| 78 |
+
| 4 | `copy_missing_source` | `COPY dist/` but build output is in `build/` | Source directory doesn't exist in build context |
|
| 79 |
| 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM |
|
| 80 |
|
| 81 |
### Task 2: Dockerfile Runtime Errors β Medium
|
|
|
|
| 114 |
| 4 | `secret_not_in_env` | `$SLACK_WEBHOOK_URL` not in `env:` | Very common mistake |
|
| 115 |
| 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN` |
|
| 116 |
|
| 117 |
+
### Task 5: CI + Docker Integration β Medium
|
| 118 |
|
| 119 |
The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.
|
| 120 |
|
| 121 |
| # | Scenario | What's Broken | Real-World Context |
|
| 122 |
|---|----------|---------------|-------------------|
|
| 123 |
| 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Need BuildKit for cross-compile |
|
| 124 |
+
| 2 | `missing_load_true` | `build-push-action` without `load: true` β next step can't find image | Buildx doesn't load into local daemon by default |
|
| 125 |
| 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch |
|
| 126 |
| 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist |
|
| 127 |
| 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access" |
|
|
|
|
| 152 |
|
| 153 |
### Task 8: Kubernetes Service & Ingress Issues β Hard
|
| 154 |
|
| 155 |
+
Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague β the agent must diagnose from kubectl output.
|
| 156 |
|
| 157 |
| # | Scenario | What's Broken | Real-World Context |
|
| 158 |
|---|----------|---------------|-------------------|
|
|
|
|
| 168 |
|
| 169 |
| # | Scenario | What's Broken | Real-World Context |
|
| 170 |
|---|----------|---------------|-------------------|
|
| 171 |
+
| 1 | `registry_mismatch` | Build tags `ghcr.io/...` but push targets `docker.io/...` | Registry URL mismatch between steps |
|
| 172 |
| 2 | `image_tag_mismatch` | Build uses `github.ref_name` but push uses `github.sha` | "image not found locally" |
|
| 173 |
+
| 3 | `inconsistent_tagging` | `docker tag myuser/api:latest` but image was built as `myuser/api:${{ github.sha }}` | Tag source doesn't exist |
|
| 174 |
| 4 | `build_arg_not_passed` | Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow | Version file is empty |
|
| 175 |
+
| 5 | `dockerfile_path_in_subdirectory` | Workflow points to `./Dockerfile` but it's at `./services/api/Dockerfile` | Monorepo path mismatch |
|
| 176 |
|
| 177 |
### Task 10: Full Stack Deployment Pipeline β Expert
|
| 178 |
|
| 179 |
+
Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague β the agent must trace root causes from symptoms.
|
| 180 |
|
| 181 |
| # | Scenario | What's Broken | Real-World Context |
|
| 182 |
|---|----------|---------------|-------------------|
|
|
|
|
| 188 |
|
| 189 |
---
|
| 190 |
|
| 191 |
+
## Fix Validation: Simulator-Based
|
| 192 |
+
|
| 193 |
+
Fixes are validated using **structural simulators**, not string matching. This means:
|
| 194 |
+
|
| 195 |
+
- **Alternative valid fixes are accepted.** Setting memory to `512Mi` instead of `256Mi` both resolve the OOM β the simulator accepts either.
|
| 196 |
+
- **Three independent simulators** run after every edit:
|
| 197 |
+
- **DockerSimulator**: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
|
| 198 |
+
- **WorkflowSimulator**: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
|
| 199 |
+
- **KubernetesSimulator**: validates manifests, cross-resource dependencies (Service selector β Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
|
| 200 |
+
- **7 granular checks** are tracked: `docker_build`, `docker_run`, `workflow_parse`, `workflow_exec`, `k8s_valid`, `k8s_pod_running`, `k8s_service_active`
|
| 201 |
+
- Progress = how many checks flip from fail β pass compared to the initial broken state
|
| 202 |
+
|
| 203 |
+
---
|
| 204 |
+
|
| 205 |
## Available Actions
|
| 206 |
|
| 207 |
Each step, the agent chooses exactly one action:
|
|
|
|
| 214 |
| `delete_line` | Remove a specific line | Removing a bad instruction |
|
| 215 |
| `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
|
| 216 |
| `delete_block` | Remove a multi-line block | Removing incorrect sections |
|
| 217 |
+
| `request_hint` | Get a clue about what's wrong | Costs -4% on final score β use sparingly |
|
| 218 |
| `submit` | Declare "I'm done" β triggers final evaluation | When all fixes are applied |
|
| 219 |
|
| 220 |
**Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
|
| 221 |
|
| 222 |
---
|
| 223 |
|
| 224 |
+
## Grading System
|
| 225 |
|
| 226 |
+
Scoring is **deterministic** (same actions always produce the same score), **difficulty-aware** (harder tasks are graded more generously), and scores are strictly in **(0, 1) exclusive** β never exactly 0 or 1.
|
| 227 |
|
| 228 |
### The Formula
|
| 229 |
|
|
|
|
| 231 |
FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
|
| 232 |
```
|
| 233 |
|
| 234 |
+
Clamped to `(0.01, 0.99)`.
|
| 235 |
|
| 236 |
### Component Breakdown
|
| 237 |
|
| 238 |
| Component | Weight | Description |
|
| 239 |
|-----------|--------|-------------|
|
| 240 |
+
| Base score | 5% | Participation credit (guarantees score > 0) |
|
| 241 |
| Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
|
| 242 |
| Complete bonus | 25% | All issues fixed |
|
| 243 |
| Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
|
|
|
|
| 247 |
|
| 248 |
### Difficulty Modifiers
|
| 249 |
|
|
|
|
|
|
|
| 250 |
| Difficulty | Max Score | Efficiency Decay | Hint Cost |
|
| 251 |
|------------|-----------|------------------|-----------|
|
| 252 |
| Easy | 0.90 | 0.03/step (strict) | 4% each |
|
| 253 |
| Medium | 0.90 | 0.027/step | 4% each |
|
| 254 |
| Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
|
| 255 |
|
| 256 |
+
---
|
| 257 |
+
|
| 258 |
+
## Evaluation
|
| 259 |
+
|
| 260 |
+
The evaluation pipeline runs **all 50 scenarios across all 10 tasks** deterministically:
|
| 261 |
+
|
| 262 |
+
```python
|
| 263 |
+
# Runs all 10 tasks Γ 5 scenarios = 50 episodes
|
| 264 |
+
results = run_baseline_episodes() # num_episodes=None runs all
|
| 265 |
+
|
| 266 |
+
# Per-episode scores in (0, 1)
|
| 267 |
+
# Aggregate = mean of all 50 scores
|
| 268 |
+
aggregate = sum(r.score for r in results) / len(results)
|
| 269 |
+
```
|
| 270 |
+
|
| 271 |
+
This ensures:
|
| 272 |
+
- **Reproducibility**: same agent produces same score every time
|
| 273 |
+
- **Complete coverage**: every error pattern is tested
|
| 274 |
+
- **Fair comparison**: all agents face the same 50 scenarios
|
| 275 |
|
| 276 |
---
|
| 277 |
|
|
|
|
| 289 |
| `/info` | GET | Task list with metadata |
|
| 290 |
| `/tasks` | GET | List all tasks with difficulty levels |
|
| 291 |
| `/grader` | POST | Grade a trajectory (list of step dicts) |
|
| 292 |
+
| `/baseline` | POST | Run baseline across all scenarios (optional: `task_id`, `num_episodes`) |
|
| 293 |
| `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
|
| 294 |
|
| 295 |
### Example: Full Episode via API
|
|
|
|
| 300 |
-H "Content-Type: application/json" \
|
| 301 |
-d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'
|
| 302 |
|
| 303 |
+
# 2. Fix the memory limit (any reasonable value works β simulator validates structurally)
|
| 304 |
curl -X POST http://localhost:8000/step \
|
| 305 |
-H "Content-Type: application/json" \
|
| 306 |
-d '{
|
|
|
|
| 309 |
"edits": [{
|
| 310 |
"file_path": "k8s/deployment.yaml",
|
| 311 |
"old_content": "memory: \"64Mi\"",
|
| 312 |
+
"new_content": "memory: \"512Mi\""
|
| 313 |
}]
|
| 314 |
}
|
| 315 |
}'
|
|
|
|
| 358 |
cloud-native-devops-env/
|
| 359 |
βββ openenv.yaml # OpenEnv environment specification
|
| 360 |
βββ inference.py # LLM baseline (OpenAI client + HF router)
|
| 361 |
+
βββ baseline_runner.py # Heuristic baseline β runs all 50 scenarios
|
| 362 |
βββ Dockerfile # Production container
|
| 363 |
βββ requirements.txt # Python dependencies
|
| 364 |
β
|
|
|
|
| 382 |
β βββ graders/
|
| 383 |
β β βββ __init__.py # Deterministic trajectory grader
|
| 384 |
β βββ simulators/
|
| 385 |
+
β βββ docker_simulator.py # Dockerfile build + runtime validation
|
| 386 |
+
β βββ workflow_simulator.py # GHA workflow parse + execution validation
|
| 387 |
+
β βββ k8s_simulator.py # K8s manifest + cross-resource validation
|
| 388 |
β
|
| 389 |
βββ tests/
|
| 390 |
βββ test_endpoints.py # API endpoint tests
|
|
|
|
| 397 |
## Design Decisions
|
| 398 |
|
| 399 |
1. **Full cloud-native stack**: Docker + GitHub Actions + Kubernetes β the three pillars of modern deployment pipelines.
|
| 400 |
+
2. **Simulator-based validation**: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g., `512Mi` and `256Mi` both fix an OOM). Deterministic, fast, no security concerns.
|
| 401 |
3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
|
| 402 |
4. **Difficulty progression**: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
|
| 403 |
+
5. **Vague error messages in harder tasks**: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
|
| 404 |
+
6. **Deterministic evaluation**: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
|
| 405 |
+
7. **50 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.
|
| 406 |
|
| 407 |
## License
|
| 408 |
|
| 409 |
MIT
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
baseline_runner.py
CHANGED
|
@@ -1,6 +1,7 @@
|
|
| 1 |
"""Heuristic baseline runner for the /baseline endpoint.
|
| 2 |
|
| 3 |
Applies expected_fixes directly to verify the environment + grader work e2e.
|
|
|
|
| 4 |
"""
|
| 5 |
|
| 6 |
|
|
@@ -22,6 +23,17 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
|
|
| 22 |
break
|
| 23 |
file_path = fix["file"]
|
| 24 |
if file_path not in env.current_files:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
continue
|
| 26 |
|
| 27 |
current_content = env.current_files[file_path].content
|
|
@@ -50,18 +62,22 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
|
|
| 50 |
)],
|
| 51 |
)
|
| 52 |
else:
|
| 53 |
-
# Find the line
|
| 54 |
best_line = None
|
| 55 |
best_idx = None
|
|
|
|
| 56 |
for i, line in enumerate(lines):
|
| 57 |
stripped = line.strip()
|
| 58 |
exp_stripped = expected.strip()
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
| 62 |
-
|
| 63 |
-
|
| 64 |
-
|
|
|
|
|
|
|
|
|
|
| 65 |
|
| 66 |
if best_line is not None:
|
| 67 |
action = Action(
|
|
@@ -115,12 +131,12 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
|
|
| 115 |
return run_grader(task_id, env.trajectory)
|
| 116 |
|
| 117 |
|
| 118 |
-
def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: int =
|
| 119 |
"""Run baseline episodes across tasks.
|
| 120 |
|
| 121 |
Args:
|
| 122 |
task_id: Specific task to run, or None for all tasks.
|
| 123 |
-
num_episodes:
|
| 124 |
|
| 125 |
Returns:
|
| 126 |
List of GraderResult for each episode.
|
|
@@ -137,13 +153,11 @@ def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: int = 1)
|
|
| 137 |
for tid in task_ids:
|
| 138 |
task_cls = TASK_REGISTRY[tid]
|
| 139 |
scenarios = task_cls.SCENARIOS
|
| 140 |
-
|
| 141 |
-
|
| 142 |
-
if episodes_run >= num_episodes:
|
| 143 |
break
|
| 144 |
env = CloudNativeDebugEnvironment()
|
| 145 |
result = _heuristic_episode(env, tid, scenario["id"])
|
| 146 |
results.append(result)
|
| 147 |
-
episodes_run += 1
|
| 148 |
|
| 149 |
return results
|
|
|
|
| 1 |
"""Heuristic baseline runner for the /baseline endpoint.
|
| 2 |
|
| 3 |
Applies expected_fixes directly to verify the environment + grader work e2e.
|
| 4 |
+
By default runs ALL scenarios of ALL tasks for deterministic, reproducible evaluation.
|
| 5 |
"""
|
| 6 |
|
| 7 |
|
|
|
|
| 23 |
break
|
| 24 |
file_path = fix["file"]
|
| 25 |
if file_path not in env.current_files:
|
| 26 |
+
# For fixes that require creating a new file (e.g. ConfigMap),
|
| 27 |
+
# create it with the expected content
|
| 28 |
+
if fix["type"] == "contains":
|
| 29 |
+
action = Action(
|
| 30 |
+
action_type=ActionType.EDIT_FILE,
|
| 31 |
+
edits=[FileEdit(
|
| 32 |
+
file_path=file_path,
|
| 33 |
+
new_content=fix["expected"],
|
| 34 |
+
)],
|
| 35 |
+
)
|
| 36 |
+
env.step(action)
|
| 37 |
continue
|
| 38 |
|
| 39 |
current_content = env.current_files[file_path].content
|
|
|
|
| 62 |
)],
|
| 63 |
)
|
| 64 |
else:
|
| 65 |
+
# Find the line with highest character overlap to expected
|
| 66 |
best_line = None
|
| 67 |
best_idx = None
|
| 68 |
+
best_score = 0
|
| 69 |
for i, line in enumerate(lines):
|
| 70 |
stripped = line.strip()
|
| 71 |
exp_stripped = expected.strip()
|
| 72 |
+
if not stripped or not exp_stripped:
|
| 73 |
+
continue
|
| 74 |
+
overlap = len(set(stripped) & set(exp_stripped))
|
| 75 |
+
# Use ratio of overlap to max length for scoring
|
| 76 |
+
score = overlap / max(len(exp_stripped), len(stripped))
|
| 77 |
+
if score > 0.5 and score > best_score:
|
| 78 |
+
best_line = line
|
| 79 |
+
best_idx = i
|
| 80 |
+
best_score = score
|
| 81 |
|
| 82 |
if best_line is not None:
|
| 83 |
action = Action(
|
|
|
|
| 131 |
return run_grader(task_id, env.trajectory)
|
| 132 |
|
| 133 |
|
| 134 |
+
def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: Optional[int] = None) -> List[GraderResult]:
|
| 135 |
"""Run baseline episodes across tasks.
|
| 136 |
|
| 137 |
Args:
|
| 138 |
task_id: Specific task to run, or None for all tasks.
|
| 139 |
+
num_episodes: Max scenarios per task. None = run ALL scenarios (default).
|
| 140 |
|
| 141 |
Returns:
|
| 142 |
List of GraderResult for each episode.
|
|
|
|
| 153 |
for tid in task_ids:
|
| 154 |
task_cls = TASK_REGISTRY[tid]
|
| 155 |
scenarios = task_cls.SCENARIOS
|
| 156 |
+
for idx, scenario in enumerate(scenarios):
|
| 157 |
+
if num_episodes is not None and idx >= num_episodes:
|
|
|
|
| 158 |
break
|
| 159 |
env = CloudNativeDebugEnvironment()
|
| 160 |
result = _heuristic_episode(env, tid, scenario["id"])
|
| 161 |
results.append(result)
|
|
|
|
| 162 |
|
| 163 |
return results
|
inference.py
CHANGED
|
@@ -1,13 +1,44 @@
|
|
| 1 |
-
"""
|
| 2 |
-
|
| 3 |
-
|
| 4 |
-
|
| 5 |
-
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
|
| 9 |
-
|
| 10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
"""
|
| 12 |
|
| 13 |
|
|
@@ -22,12 +53,14 @@ import requests
|
|
| 22 |
from openai import OpenAI
|
| 23 |
|
| 24 |
|
| 25 |
-
API_BASE_URL = os.getenv("API_BASE_URL"
|
| 26 |
-
MODEL_NAME = os.getenv("MODEL_NAME"
|
| 27 |
-
|
| 28 |
ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
|
| 29 |
LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
|
|
|
|
| 30 |
MAX_STEPS = 8 # leave 2 steps buffer before env hard-limit of 10
|
|
|
|
| 31 |
|
| 32 |
SYSTEM_PROMPT = """You are an expert DevOps engineer debugging cloud-native deployment pipelines.
|
| 33 |
You will receive broken Dockerfile, GitHub Actions workflow, and/or Kubernetes manifest files along with error messages.
|
|
@@ -79,11 +112,37 @@ Rules:
|
|
| 79 |
- Always respond with valid JSON only, no markdown fences"""
|
| 80 |
|
| 81 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
def create_client() -> OpenAI:
|
| 83 |
"""Create OpenAI-compatible client for HuggingFace router."""
|
| 84 |
return OpenAI(
|
| 85 |
base_url=API_BASE_URL,
|
| 86 |
-
api_key=
|
| 87 |
)
|
| 88 |
|
| 89 |
|
|
@@ -188,130 +247,144 @@ def run_episode(client: OpenAI, task_id: Optional[str] = None, scenario_id: Opti
|
|
| 188 |
if scenario_id:
|
| 189 |
reset_payload["scenario_id"] = scenario_id
|
| 190 |
|
| 191 |
-
|
| 192 |
-
|
| 193 |
-
|
| 194 |
-
|
| 195 |
-
actual_task_id = info.get("task_id", task_id or "unknown")
|
| 196 |
-
actual_scenario_id = info.get("scenario_id", scenario_id or "unknown")
|
| 197 |
-
|
| 198 |
-
print(f"[START] task_id={actual_task_id} scenario_id={actual_scenario_id}")
|
| 199 |
|
| 200 |
-
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
|
| 201 |
trajectory = []
|
| 202 |
-
|
|
|
|
|
|
|
|
|
|
| 203 |
|
| 204 |
-
|
| 205 |
-
|
| 206 |
-
|
| 207 |
-
|
| 208 |
-
|
| 209 |
-
|
| 210 |
-
|
| 211 |
-
|
| 212 |
-
|
| 213 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 214 |
)
|
| 215 |
-
llm_text = completion.choices[0].message.content or '{"action": "submit"}'
|
| 216 |
-
except Exception as e:
|
| 217 |
-
print(f"[STEP] step={step_num + 1} action=error reward=0.00 done=false issues_fixed=0 issues_total=0 error={e}")
|
| 218 |
-
llm_text = '{"action": "submit"}'
|
| 219 |
-
|
| 220 |
-
messages.append({"role": "assistant", "content": llm_text})
|
| 221 |
-
|
| 222 |
-
parsed = parse_llm_response(llm_text)
|
| 223 |
-
action = build_action(parsed)
|
| 224 |
-
|
| 225 |
-
step_resp = env_request("POST", "/step", {"action": action})
|
| 226 |
-
obs = step_resp["observation"]
|
| 227 |
-
reward = step_resp.get("reward", 0.0)
|
| 228 |
-
done = step_resp.get("done", False)
|
| 229 |
-
step_info = step_resp.get("info", {})
|
| 230 |
-
total_steps = step_num + 1
|
| 231 |
-
|
| 232 |
-
issues_fixed = step_info.get("issues_fixed", 0)
|
| 233 |
-
issues_total = step_info.get("issues_total", 0)
|
| 234 |
-
|
| 235 |
-
print(f"[STEP] step={total_steps} action={action['action_type']} reward={reward:.2f} done={str(done).lower()} issues_fixed={issues_fixed} issues_total={issues_total}")
|
| 236 |
-
|
| 237 |
-
trajectory.append({
|
| 238 |
-
"step": total_steps,
|
| 239 |
-
"action": action,
|
| 240 |
-
"reward": reward,
|
| 241 |
-
"done": done,
|
| 242 |
-
"info": step_info,
|
| 243 |
-
})
|
| 244 |
|
| 245 |
-
|
| 246 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
|
| 248 |
-
|
| 249 |
-
|
| 250 |
-
"task_id": actual_task_id,
|
| 251 |
-
"trajectory": trajectory,
|
| 252 |
-
})
|
| 253 |
-
result = grade_resp.get("result", {})
|
| 254 |
-
score = result.get("score", 0.0)
|
| 255 |
|
| 256 |
-
|
| 257 |
-
return result
|
| 258 |
|
| 259 |
|
| 260 |
def run_all_tasks(client: OpenAI) -> Dict[str, float]:
|
| 261 |
-
"""Run baseline on all tasks and report scores."""
|
| 262 |
-
|
| 263 |
-
|
|
|
|
|
|
|
|
|
|
| 264 |
|
| 265 |
scores: Dict[str, List[float]] = {}
|
| 266 |
|
| 267 |
-
for
|
| 268 |
-
task_id = task["id"]
|
| 269 |
-
print(f"\n{'='*60}")
|
| 270 |
-
print(f"Task: {task['name']} ({task['difficulty']})")
|
| 271 |
-
print(f"{'='*60}")
|
| 272 |
-
|
| 273 |
task_scores = []
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 277 |
scores[task_id] = task_scores
|
| 278 |
|
| 279 |
# Summary
|
| 280 |
-
print(f"\n{'='*60}")
|
| 281 |
-
print("BASELINE RESULTS SUMMARY")
|
| 282 |
-
print(f"{'='*60}")
|
| 283 |
avg_scores = {}
|
| 284 |
for task_id, task_scores in scores.items():
|
| 285 |
avg = sum(task_scores) / len(task_scores) if task_scores else 0.0
|
| 286 |
avg_scores[task_id] = avg
|
| 287 |
-
print(f"
|
| 288 |
|
| 289 |
overall = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0.0
|
| 290 |
-
print(f"
|
| 291 |
|
| 292 |
return avg_scores
|
| 293 |
|
| 294 |
|
| 295 |
def main():
|
| 296 |
"""Entry point for baseline inference."""
|
| 297 |
-
|
| 298 |
-
|
| 299 |
-
|
| 300 |
-
print(f"Environment: {ENV_URL}")
|
| 301 |
-
|
| 302 |
-
if not HF_TOKEN:
|
| 303 |
-
print("\nWARNING: HF_TOKEN not set. Set it via: export HF_TOKEN=your_token_here")
|
| 304 |
-
print("Continuing anyway (will fail if auth is required)...\n")
|
| 305 |
|
| 306 |
# Verify environment is running
|
| 307 |
try:
|
| 308 |
health = env_request("GET", "/health")
|
| 309 |
-
print(f"Environment status: {health.get('status', 'unknown')}
|
| 310 |
except Exception as e:
|
| 311 |
-
print(f"
|
| 312 |
-
print(
|
| 313 |
-
print("\nStart the server first:")
|
| 314 |
-
print(" python -m uvicorn server.app:app --host 0.0.0.0 --port 8000")
|
| 315 |
sys.exit(1)
|
| 316 |
|
| 317 |
client = create_client()
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Inference Script for Cloud-Native Debug Environment
|
| 3 |
+
===================================
|
| 4 |
+
MANDATORY
|
| 5 |
+
- Before submitting, ensure the following variables are defined in your environment configuration:
|
| 6 |
+
API_BASE_URL The API endpoint for the LLM.
|
| 7 |
+
MODEL_NAME The model identifier to use for inference.
|
| 8 |
+
HF_TOKEN Your Hugging Face / API key.
|
| 9 |
+
LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
|
| 10 |
+
method
|
| 11 |
+
|
| 12 |
+
- Defaults are set only for API_BASE_URL and MODEL_NAME
|
| 13 |
+
(and should reflect your active inference setup):
|
| 14 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
|
| 15 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
|
| 16 |
+
|
| 17 |
+
- The inference script must be named `inference.py` and placed in the root directory of the project
|
| 18 |
+
- Participants must use OpenAI Client for all LLM calls using above variables
|
| 19 |
+
|
| 20 |
+
STDOUT FORMAT
|
| 21 |
+
- The script must emit exactly three line types to stdout, in this order:
|
| 22 |
+
|
| 23 |
+
[START] task=<task_name> env=<benchmark> model=<model_name>
|
| 24 |
+
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
|
| 25 |
+
[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
|
| 26 |
+
|
| 27 |
+
Rules:
|
| 28 |
+
- One [START] line at episode begin.
|
| 29 |
+
- One [STEP] line per step, immediately after env.step() returns.
|
| 30 |
+
- One [END] line after the episode completes, always emitted (even on exception).
|
| 31 |
+
- reward and rewards are formatted to 2 decimal places.
|
| 32 |
+
- done and success are lowercase booleans: true or false.
|
| 33 |
+
- error is the raw error string, or null if none.
|
| 34 |
+
- All fields on a single line with no newlines within a line.
|
| 35 |
+
- Each tasks should return score in [0, 1]
|
| 36 |
+
|
| 37 |
+
Example:
|
| 38 |
+
[START] task=dockerfile_syntax env=cloud_native_devops model=meta-llama/Llama-3.1-70B-Instruct
|
| 39 |
+
[STEP] step=1 action=edit_file reward=0.30 done=false error=null
|
| 40 |
+
[STEP] step=2 action=submit reward=0.00 done=true error=null
|
| 41 |
+
[END] success=true steps=2 score=0.850 rewards=0.30,0.00
|
| 42 |
"""
|
| 43 |
|
| 44 |
|
|
|
|
| 53 |
from openai import OpenAI
|
| 54 |
|
| 55 |
|
| 56 |
+
API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
|
| 57 |
+
MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.1-70B-Instruct"
|
| 58 |
+
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
|
| 59 |
ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
|
| 60 |
LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
|
| 61 |
+
BENCHMARK = "cloud_native_devops"
|
| 62 |
MAX_STEPS = 8 # leave 2 steps buffer before env hard-limit of 10
|
| 63 |
+
SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
|
| 64 |
|
| 65 |
SYSTEM_PROMPT = """You are an expert DevOps engineer debugging cloud-native deployment pipelines.
|
| 66 |
You will receive broken Dockerfile, GitHub Actions workflow, and/or Kubernetes manifest files along with error messages.
|
|
|
|
| 112 |
- Always respond with valid JSON only, no markdown fences"""
|
| 113 |
|
| 114 |
|
| 115 |
+
# ---------------------------------------------------------------------------
|
| 116 |
+
# Logging helpers (mandatory stdout format)
|
| 117 |
+
# ---------------------------------------------------------------------------
|
| 118 |
+
|
| 119 |
+
def log_start(task: str, env: str, model: str) -> None:
|
| 120 |
+
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 121 |
+
|
| 122 |
+
|
| 123 |
+
def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
|
| 124 |
+
error_val = error if error else "null"
|
| 125 |
+
done_val = str(done).lower()
|
| 126 |
+
print(
|
| 127 |
+
f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
|
| 128 |
+
flush=True,
|
| 129 |
+
)
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
|
| 133 |
+
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
|
| 134 |
+
print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
|
| 135 |
+
|
| 136 |
+
|
| 137 |
+
# ---------------------------------------------------------------------------
|
| 138 |
+
# Client / env helpers
|
| 139 |
+
# ---------------------------------------------------------------------------
|
| 140 |
+
|
| 141 |
def create_client() -> OpenAI:
|
| 142 |
"""Create OpenAI-compatible client for HuggingFace router."""
|
| 143 |
return OpenAI(
|
| 144 |
base_url=API_BASE_URL,
|
| 145 |
+
api_key=API_KEY,
|
| 146 |
)
|
| 147 |
|
| 148 |
|
|
|
|
| 247 |
if scenario_id:
|
| 248 |
reset_payload["scenario_id"] = scenario_id
|
| 249 |
|
| 250 |
+
# Best-effort task name for Start
|
| 251 |
+
target_task = task_id or "random_task"
|
| 252 |
+
log_start(task=target_task, env=BENCHMARK, model=MODEL_NAME)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 253 |
|
|
|
|
| 254 |
trajectory = []
|
| 255 |
+
rewards: List[float] = []
|
| 256 |
+
steps_taken = 0
|
| 257 |
+
score = 0.0
|
| 258 |
+
success = False
|
| 259 |
|
| 260 |
+
try:
|
| 261 |
+
reset_resp = env_request("POST", "/reset", reset_payload)
|
| 262 |
+
obs = reset_resp["observation"]
|
| 263 |
+
info = reset_resp.get("info", {})
|
| 264 |
+
|
| 265 |
+
actual_task_id = info.get("task_id", target_task)
|
| 266 |
+
actual_scenario_id = info.get("scenario_id", scenario_id or "unknown")
|
| 267 |
+
|
| 268 |
+
messages = [{"role": "system", "content": SYSTEM_PROMPT}]
|
| 269 |
+
for step_num in range(1, MAX_STEPS + 1):
|
| 270 |
+
user_msg = format_observation(obs)
|
| 271 |
+
messages.append({"role": "user", "content": user_msg})
|
| 272 |
+
|
| 273 |
+
error_msg: Optional[str] = None
|
| 274 |
+
|
| 275 |
+
try:
|
| 276 |
+
completion = client.chat.completions.create(
|
| 277 |
+
model=MODEL_NAME,
|
| 278 |
+
messages=messages,
|
| 279 |
+
temperature=0.1,
|
| 280 |
+
max_tokens=1024,
|
| 281 |
+
)
|
| 282 |
+
llm_text = completion.choices[0].message.content or '{"action": "submit"}'
|
| 283 |
+
except Exception as e:
|
| 284 |
+
error_msg = str(e)
|
| 285 |
+
print(f"[DEBUG] Model request failed: {e}", flush=True)
|
| 286 |
+
llm_text = '{"action": "submit"}'
|
| 287 |
+
|
| 288 |
+
messages.append({"role": "assistant", "content": llm_text})
|
| 289 |
+
|
| 290 |
+
parsed = parse_llm_response(llm_text)
|
| 291 |
+
action = build_action(parsed)
|
| 292 |
+
|
| 293 |
+
step_resp = env_request("POST", "/step", {"action": action})
|
| 294 |
+
obs = step_resp["observation"]
|
| 295 |
+
reward = step_resp.get("reward", 0.0)
|
| 296 |
+
done = step_resp.get("done", False)
|
| 297 |
+
step_info = step_resp.get("info", {})
|
| 298 |
+
steps_taken = step_num
|
| 299 |
+
|
| 300 |
+
rewards.append(reward)
|
| 301 |
+
|
| 302 |
+
log_step(
|
| 303 |
+
step=step_num,
|
| 304 |
+
action=action["action_type"],
|
| 305 |
+
reward=reward,
|
| 306 |
+
done=done,
|
| 307 |
+
error=error_msg,
|
| 308 |
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 309 |
|
| 310 |
+
trajectory.append({
|
| 311 |
+
"step": step_num,
|
| 312 |
+
"action": action,
|
| 313 |
+
"reward": reward,
|
| 314 |
+
"done": done,
|
| 315 |
+
"info": step_info,
|
| 316 |
+
})
|
| 317 |
+
|
| 318 |
+
if done:
|
| 319 |
+
break
|
| 320 |
+
|
| 321 |
+
# Grade the trajectory
|
| 322 |
+
grade_resp = env_request("POST", "/grader", {
|
| 323 |
+
"task_id": actual_task_id,
|
| 324 |
+
"trajectory": trajectory,
|
| 325 |
+
})
|
| 326 |
+
result = grade_resp.get("result", {})
|
| 327 |
+
score = result.get("score", 0.0)
|
| 328 |
+
score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
|
| 329 |
+
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 330 |
|
| 331 |
+
finally:
|
| 332 |
+
log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 333 |
|
| 334 |
+
return {"score": score, "success": success, "steps": steps_taken, "rewards": rewards}
|
|
|
|
| 335 |
|
| 336 |
|
| 337 |
def run_all_tasks(client: OpenAI) -> Dict[str, float]:
|
| 338 |
+
"""Run baseline on all tasks (and ALL their scenarios) and report scores."""
|
| 339 |
+
try:
|
| 340 |
+
from server.tasks.task_registry import TASK_REGISTRY
|
| 341 |
+
except ImportError as e:
|
| 342 |
+
print(f"[DEBUG] Could not import TASK_REGISTRY: {e}", flush=True)
|
| 343 |
+
return {}
|
| 344 |
|
| 345 |
scores: Dict[str, List[float]] = {}
|
| 346 |
|
| 347 |
+
for task_id, task_cls in TASK_REGISTRY.items():
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 348 |
task_scores = []
|
| 349 |
+
|
| 350 |
+
# Iterate over all exact scenarios for this task
|
| 351 |
+
scenarios = task_cls.SCENARIOS
|
| 352 |
+
for scenario in scenarios:
|
| 353 |
+
scenario_id = scenario["id"]
|
| 354 |
+
result = run_episode(client, task_id=task_id, scenario_id=scenario_id)
|
| 355 |
+
task_scores.append(result.get("score", 0.0))
|
| 356 |
+
|
| 357 |
scores[task_id] = task_scores
|
| 358 |
|
| 359 |
# Summary
|
| 360 |
+
print(f"\n[DEBUG] {'='*60}", flush=True)
|
| 361 |
+
print("[DEBUG] BASELINE RESULTS SUMMARY", flush=True)
|
| 362 |
+
print(f"[DEBUG] {'='*60}", flush=True)
|
| 363 |
avg_scores = {}
|
| 364 |
for task_id, task_scores in scores.items():
|
| 365 |
avg = sum(task_scores) / len(task_scores) if task_scores else 0.0
|
| 366 |
avg_scores[task_id] = avg
|
| 367 |
+
print(f"[DEBUG] {task_id:40s} {avg:.3f}", flush=True)
|
| 368 |
|
| 369 |
overall = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0.0
|
| 370 |
+
print(f"[DEBUG] {'OVERALL':40s} {overall:.3f}", flush=True)
|
| 371 |
|
| 372 |
return avg_scores
|
| 373 |
|
| 374 |
|
| 375 |
def main():
|
| 376 |
"""Entry point for baseline inference."""
|
| 377 |
+
if not API_KEY:
|
| 378 |
+
print("[DEBUG] WARNING: HF_TOKEN not set. Set it via: export HF_TOKEN=your_token_here", flush=True)
|
| 379 |
+
print("[DEBUG] Continuing anyway (will fail if auth is required)...", flush=True)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 380 |
|
| 381 |
# Verify environment is running
|
| 382 |
try:
|
| 383 |
health = env_request("GET", "/health")
|
| 384 |
+
print(f"[DEBUG] Environment status: {health.get('status', 'unknown')}", flush=True)
|
| 385 |
except Exception as e:
|
| 386 |
+
print(f"[DEBUG] Cannot connect to environment at {ENV_URL}: {e}", flush=True)
|
| 387 |
+
print("[DEBUG] Start the server first: python -m uvicorn server.app:app --host 0.0.0.0 --port 8000", flush=True)
|
|
|
|
|
|
|
| 388 |
sys.exit(1)
|
| 389 |
|
| 390 |
client = create_client()
|
sample_scripts/sample_inf_script.py
ADDED
|
@@ -0,0 +1,188 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Inference Script Example
|
| 3 |
+
===================================
|
| 4 |
+
MANDATORY
|
| 5 |
+
- Before submitting, ensure the following variables are defined in your environment configuration:
|
| 6 |
+
API_BASE_URL The API endpoint for the LLM.
|
| 7 |
+
MODEL_NAME The model identifier to use for inference.
|
| 8 |
+
HF_TOKEN Your Hugging Face / API key.
|
| 9 |
+
LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
|
| 10 |
+
method
|
| 11 |
+
|
| 12 |
+
- Defaults are set only for API_BASE_URL and MODEL_NAME
|
| 13 |
+
(and should reflect your active inference setup):
|
| 14 |
+
API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
|
| 15 |
+
MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
|
| 16 |
+
|
| 17 |
+
- The inference script must be named `inference.py` and placed in the root directory of the project
|
| 18 |
+
- Participants must use OpenAI Client for all LLM calls using above variables
|
| 19 |
+
|
| 20 |
+
STDOUT FORMAT
|
| 21 |
+
- The script must emit exactly three line types to stdout, in this order:
|
| 22 |
+
|
| 23 |
+
[START] task=<task_name> env=<benchmark> model=<model_name>
|
| 24 |
+
[STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
|
| 25 |
+
[END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
|
| 26 |
+
|
| 27 |
+
Rules:
|
| 28 |
+
- One [START] line at episode begin.
|
| 29 |
+
- One [STEP] line per step, immediately after env.step() returns.
|
| 30 |
+
- One [END] line after env.close(), always emitted (even on exception).
|
| 31 |
+
- reward and rewards are formatted to 2 decimal places.
|
| 32 |
+
- done and success are lowercase booleans: true or false.
|
| 33 |
+
- error is the raw last_action_error string, or null if none.
|
| 34 |
+
- All fields on a single line with no newlines within a line.
|
| 35 |
+
- Each tasks should return score in [0, 1]
|
| 36 |
+
|
| 37 |
+
Example:
|
| 38 |
+
[START] task=click-test env=miniwob model=Qwen3-VL-30B
|
| 39 |
+
[STEP] step=1 action=click('123') reward=0.00 done=false error=null
|
| 40 |
+
[STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
|
| 41 |
+
[STEP] step=3 action=click('789') reward=1.00 done=true error=null
|
| 42 |
+
[END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
|
| 43 |
+
"""
|
| 44 |
+
|
| 45 |
+
import asyncio
|
| 46 |
+
import os
|
| 47 |
+
import textwrap
|
| 48 |
+
from typing import List, Optional
|
| 49 |
+
|
| 50 |
+
from openai import OpenAI
|
| 51 |
+
|
| 52 |
+
from my_env_v4 import MyEnvV4Action, MyEnvV4Env
|
| 53 |
+
IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
|
| 54 |
+
API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
|
| 55 |
+
|
| 56 |
+
API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
|
| 57 |
+
MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
|
| 58 |
+
TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
|
| 59 |
+
BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
|
| 60 |
+
MAX_STEPS = 8
|
| 61 |
+
TEMPERATURE = 0.7
|
| 62 |
+
MAX_TOKENS = 150
|
| 63 |
+
SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
|
| 64 |
+
|
| 65 |
+
# Max possible reward: each token contributes 0.1, across all steps
|
| 66 |
+
_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
|
| 67 |
+
MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
|
| 68 |
+
|
| 69 |
+
SYSTEM_PROMPT = textwrap.dedent(
|
| 70 |
+
"""
|
| 71 |
+
You are interacting with a simple echo environment.
|
| 72 |
+
Each turn you must send a message. The environment will echo it back.
|
| 73 |
+
Reward is proportional to message length: reward = len(message) * 0.1
|
| 74 |
+
Your goal is to maximize total reward by sending meaningful, substantive messages.
|
| 75 |
+
Reply with exactly one message string β no quotes, no prefixes, just the message text.
|
| 76 |
+
"""
|
| 77 |
+
).strip()
|
| 78 |
+
|
| 79 |
+
|
| 80 |
+
def log_start(task: str, env: str, model: str) -> None:
|
| 81 |
+
print(f"[START] task={task} env={env} model={model}", flush=True)
|
| 82 |
+
|
| 83 |
+
|
| 84 |
+
def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
|
| 85 |
+
error_val = error if error else "null"
|
| 86 |
+
done_val = str(done).lower()
|
| 87 |
+
print(
|
| 88 |
+
f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
|
| 89 |
+
flush=True,
|
| 90 |
+
)
|
| 91 |
+
|
| 92 |
+
|
| 93 |
+
def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
|
| 94 |
+
rewards_str = ",".join(f"{r:.2f}" for r in rewards)
|
| 95 |
+
print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
|
| 96 |
+
|
| 97 |
+
|
| 98 |
+
def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
|
| 99 |
+
history_block = "\n".join(history[-4:]) if history else "None"
|
| 100 |
+
return textwrap.dedent(
|
| 101 |
+
f"""
|
| 102 |
+
Step: {step}
|
| 103 |
+
Last echoed message: {last_echoed!r}
|
| 104 |
+
Last reward: {last_reward:.2f}
|
| 105 |
+
Previous steps:
|
| 106 |
+
{history_block}
|
| 107 |
+
Send your next message.
|
| 108 |
+
"""
|
| 109 |
+
).strip()
|
| 110 |
+
|
| 111 |
+
|
| 112 |
+
def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
|
| 113 |
+
user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
|
| 114 |
+
try:
|
| 115 |
+
completion = client.chat.completions.create(
|
| 116 |
+
model=MODEL_NAME,
|
| 117 |
+
messages=[
|
| 118 |
+
{"role": "system", "content": SYSTEM_PROMPT},
|
| 119 |
+
{"role": "user", "content": user_prompt},
|
| 120 |
+
],
|
| 121 |
+
temperature=TEMPERATURE,
|
| 122 |
+
max_tokens=MAX_TOKENS,
|
| 123 |
+
stream=False,
|
| 124 |
+
)
|
| 125 |
+
text = (completion.choices[0].message.content or "").strip()
|
| 126 |
+
return text if text else "hello"
|
| 127 |
+
except Exception as exc:
|
| 128 |
+
print(f"[DEBUG] Model request failed: {exc}", flush=True)
|
| 129 |
+
return "hello"
|
| 130 |
+
|
| 131 |
+
|
| 132 |
+
async def main() -> None:
|
| 133 |
+
client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
|
| 134 |
+
|
| 135 |
+
env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
|
| 136 |
+
|
| 137 |
+
history: List[str] = []
|
| 138 |
+
rewards: List[float] = []
|
| 139 |
+
steps_taken = 0
|
| 140 |
+
score = 0.0
|
| 141 |
+
success = False
|
| 142 |
+
|
| 143 |
+
log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
|
| 144 |
+
|
| 145 |
+
try:
|
| 146 |
+
result = await env.reset() # OpenENV.reset()
|
| 147 |
+
last_echoed = result.observation.echoed_message
|
| 148 |
+
last_reward = 0.0
|
| 149 |
+
|
| 150 |
+
for step in range(1, MAX_STEPS + 1):
|
| 151 |
+
if result.done:
|
| 152 |
+
break
|
| 153 |
+
|
| 154 |
+
message = get_model_message(client, step, last_echoed, last_reward, history)
|
| 155 |
+
|
| 156 |
+
result = await env.step(MyEnvV4Action(message=message))
|
| 157 |
+
obs = result.observation
|
| 158 |
+
|
| 159 |
+
reward = result.reward or 0.0
|
| 160 |
+
done = result.done
|
| 161 |
+
error = None
|
| 162 |
+
|
| 163 |
+
rewards.append(reward)
|
| 164 |
+
steps_taken = step
|
| 165 |
+
last_echoed = obs.echoed_message
|
| 166 |
+
last_reward = reward
|
| 167 |
+
|
| 168 |
+
log_step(step=step, action=message, reward=reward, done=done, error=error)
|
| 169 |
+
|
| 170 |
+
history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
|
| 171 |
+
|
| 172 |
+
if done:
|
| 173 |
+
break
|
| 174 |
+
|
| 175 |
+
score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
|
| 176 |
+
score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
|
| 177 |
+
success = score >= SUCCESS_SCORE_THRESHOLD
|
| 178 |
+
|
| 179 |
+
finally:
|
| 180 |
+
try:
|
| 181 |
+
await env.close()
|
| 182 |
+
except Exception as e:
|
| 183 |
+
print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
|
| 184 |
+
log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
|
| 185 |
+
|
| 186 |
+
|
| 187 |
+
if __name__ == "__main__":
|
| 188 |
+
asyncio.run(main())
|
sample_scripts/sample_val_script.txt
ADDED
|
@@ -0,0 +1,185 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
#!/usr/bin/env bash
|
| 2 |
+
#
|
| 3 |
+
# validate-submission.sh β OpenEnv Submission Validator
|
| 4 |
+
#
|
| 5 |
+
# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
|
| 6 |
+
#
|
| 7 |
+
# Prerequisites:
|
| 8 |
+
# - Docker: https://docs.docker.com/get-docker/
|
| 9 |
+
# - openenv-core: pip install openenv-core
|
| 10 |
+
# - curl (usually pre-installed)
|
| 11 |
+
#
|
| 12 |
+
# Run:
|
| 13 |
+
# curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
|
| 14 |
+
#
|
| 15 |
+
# Or download and run locally:
|
| 16 |
+
# chmod +x validate-submission.sh
|
| 17 |
+
# ./validate-submission.sh <ping_url> [repo_dir]
|
| 18 |
+
#
|
| 19 |
+
# Arguments:
|
| 20 |
+
# ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
|
| 21 |
+
# repo_dir Path to your repo (default: current directory)
|
| 22 |
+
#
|
| 23 |
+
# Examples:
|
| 24 |
+
# ./validate-submission.sh https://my-team.hf.space
|
| 25 |
+
# ./validate-submission.sh https://my-team.hf.space ./my-repo
|
| 26 |
+
#
|
| 27 |
+
|
| 28 |
+
set -uo pipefail
|
| 29 |
+
|
| 30 |
+
DOCKER_BUILD_TIMEOUT=600
|
| 31 |
+
if [ -t 1 ]; then
|
| 32 |
+
RED='\033[0;31m'
|
| 33 |
+
GREEN='\033[0;32m'
|
| 34 |
+
YELLOW='\033[1;33m'
|
| 35 |
+
BOLD='\033[1m'
|
| 36 |
+
NC='\033[0m'
|
| 37 |
+
else
|
| 38 |
+
RED='' GREEN='' YELLOW='' BOLD='' NC=''
|
| 39 |
+
fi
|
| 40 |
+
|
| 41 |
+
run_with_timeout() {
|
| 42 |
+
local secs="$1"; shift
|
| 43 |
+
if command -v timeout &>/dev/null; then
|
| 44 |
+
timeout "$secs" "$@"
|
| 45 |
+
elif command -v gtimeout &>/dev/null; then
|
| 46 |
+
gtimeout "$secs" "$@"
|
| 47 |
+
else
|
| 48 |
+
"$@" &
|
| 49 |
+
local pid=$!
|
| 50 |
+
( sleep "$secs" && kill "$pid" 2>/dev/null ) &
|
| 51 |
+
local watcher=$!
|
| 52 |
+
wait "$pid" 2>/dev/null
|
| 53 |
+
local rc=$?
|
| 54 |
+
kill "$watcher" 2>/dev/null
|
| 55 |
+
wait "$watcher" 2>/dev/null
|
| 56 |
+
return $rc
|
| 57 |
+
fi
|
| 58 |
+
}
|
| 59 |
+
|
| 60 |
+
portable_mktemp() {
|
| 61 |
+
local prefix="${1:-validate}"
|
| 62 |
+
mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
CLEANUP_FILES=()
|
| 66 |
+
cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
|
| 67 |
+
trap cleanup EXIT
|
| 68 |
+
|
| 69 |
+
PING_URL="${1:-}"
|
| 70 |
+
REPO_DIR="${2:-.}"
|
| 71 |
+
|
| 72 |
+
if [ -z "$PING_URL" ]; then
|
| 73 |
+
printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
|
| 74 |
+
printf "\n"
|
| 75 |
+
printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
|
| 76 |
+
printf " repo_dir Path to your repo (default: current directory)\n"
|
| 77 |
+
exit 1
|
| 78 |
+
fi
|
| 79 |
+
|
| 80 |
+
if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
|
| 81 |
+
printf "Error: directory '%s' not found\n" "${2:-.}"
|
| 82 |
+
exit 1
|
| 83 |
+
fi
|
| 84 |
+
PING_URL="${PING_URL%/}"
|
| 85 |
+
export PING_URL
|
| 86 |
+
PASS=0
|
| 87 |
+
|
| 88 |
+
log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
|
| 89 |
+
pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
|
| 90 |
+
fail() { log "${RED}FAILED${NC} -- $1"; }
|
| 91 |
+
hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
|
| 92 |
+
stop_at() {
|
| 93 |
+
printf "\n"
|
| 94 |
+
printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
|
| 95 |
+
exit 1
|
| 96 |
+
}
|
| 97 |
+
|
| 98 |
+
printf "\n"
|
| 99 |
+
printf "${BOLD}========================================${NC}\n"
|
| 100 |
+
printf "${BOLD} OpenEnv Submission Validator${NC}\n"
|
| 101 |
+
printf "${BOLD}========================================${NC}\n"
|
| 102 |
+
log "Repo: $REPO_DIR"
|
| 103 |
+
log "Ping URL: $PING_URL"
|
| 104 |
+
printf "\n"
|
| 105 |
+
|
| 106 |
+
log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
|
| 107 |
+
|
| 108 |
+
CURL_OUTPUT=$(portable_mktemp "validate-curl")
|
| 109 |
+
CLEANUP_FILES+=("$CURL_OUTPUT")
|
| 110 |
+
HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
|
| 111 |
+
-H "Content-Type: application/json" -d '{}' \
|
| 112 |
+
"$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
|
| 113 |
+
|
| 114 |
+
if [ "$HTTP_CODE" = "200" ]; then
|
| 115 |
+
pass "HF Space is live and responds to /reset"
|
| 116 |
+
elif [ "$HTTP_CODE" = "000" ]; then
|
| 117 |
+
fail "HF Space not reachable (connection failed or timed out)"
|
| 118 |
+
hint "Check your network connection and that the Space is running."
|
| 119 |
+
hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
|
| 120 |
+
stop_at "Step 1"
|
| 121 |
+
else
|
| 122 |
+
fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
|
| 123 |
+
hint "Make sure your Space is running and the URL is correct."
|
| 124 |
+
hint "Try opening $PING_URL in your browser first."
|
| 125 |
+
stop_at "Step 1"
|
| 126 |
+
fi
|
| 127 |
+
|
| 128 |
+
log "${BOLD}Step 2/3: Running docker build${NC} ..."
|
| 129 |
+
|
| 130 |
+
if ! command -v docker &>/dev/null; then
|
| 131 |
+
fail "docker command not found"
|
| 132 |
+
hint "Install Docker: https://docs.docker.com/get-docker/"
|
| 133 |
+
stop_at "Step 2"
|
| 134 |
+
fi
|
| 135 |
+
|
| 136 |
+
if [ -f "$REPO_DIR/Dockerfile" ]; then
|
| 137 |
+
DOCKER_CONTEXT="$REPO_DIR"
|
| 138 |
+
elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
|
| 139 |
+
DOCKER_CONTEXT="$REPO_DIR/server"
|
| 140 |
+
else
|
| 141 |
+
fail "No Dockerfile found in repo root or server/ directory"
|
| 142 |
+
stop_at "Step 2"
|
| 143 |
+
fi
|
| 144 |
+
|
| 145 |
+
log " Found Dockerfile in $DOCKER_CONTEXT"
|
| 146 |
+
|
| 147 |
+
BUILD_OK=false
|
| 148 |
+
BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
|
| 149 |
+
|
| 150 |
+
if [ "$BUILD_OK" = true ]; then
|
| 151 |
+
pass "Docker build succeeded"
|
| 152 |
+
else
|
| 153 |
+
fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
|
| 154 |
+
printf "%s\n" "$BUILD_OUTPUT" | tail -20
|
| 155 |
+
stop_at "Step 2"
|
| 156 |
+
fi
|
| 157 |
+
|
| 158 |
+
log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
|
| 159 |
+
|
| 160 |
+
if ! command -v openenv &>/dev/null; then
|
| 161 |
+
fail "openenv command not found"
|
| 162 |
+
hint "Install it: pip install openenv-core"
|
| 163 |
+
stop_at "Step 3"
|
| 164 |
+
fi
|
| 165 |
+
|
| 166 |
+
VALIDATE_OK=false
|
| 167 |
+
VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
|
| 168 |
+
|
| 169 |
+
if [ "$VALIDATE_OK" = true ]; then
|
| 170 |
+
pass "openenv validate passed"
|
| 171 |
+
[ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
|
| 172 |
+
else
|
| 173 |
+
fail "openenv validate failed"
|
| 174 |
+
printf "%s\n" "$VALIDATE_OUTPUT"
|
| 175 |
+
stop_at "Step 3"
|
| 176 |
+
fi
|
| 177 |
+
|
| 178 |
+
printf "\n"
|
| 179 |
+
printf "${BOLD}========================================${NC}\n"
|
| 180 |
+
printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
|
| 181 |
+
printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
|
| 182 |
+
printf "${BOLD}========================================${NC}\n"
|
| 183 |
+
printf "\n"
|
| 184 |
+
|
| 185 |
+
exit 0
|
server/environment.py
CHANGED
|
@@ -71,15 +71,30 @@ class CloudNativeDebugEnvironment:
|
|
| 71 |
return None
|
| 72 |
|
| 73 |
def _validation_snapshot(self) -> Dict[str, bool]:
|
|
|
|
| 74 |
docker_result = self.docker_sim.validate(self.current_files.get("Dockerfile"), self.current_files)
|
| 75 |
workflow_file = self._find_workflow_file()
|
| 76 |
workflow_result = self.workflow_sim.validate(workflow_file, self.current_files)
|
| 77 |
k8s_result = self.k8s_sim.validate(self.current_files)
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
def __init__(self):
|
| 85 |
self.docker_sim = DockerSimulator()
|
|
@@ -146,7 +161,13 @@ class CloudNativeDebugEnvironment:
|
|
| 146 |
)
|
| 147 |
|
| 148 |
self.expected_fixes = scenario["expected_fixes"]
|
| 149 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 150 |
self.issues_fixed = 0
|
| 151 |
|
| 152 |
self.step_count = 0
|
|
@@ -200,8 +221,6 @@ class CloudNativeDebugEnvironment:
|
|
| 200 |
self.last_action_success = False
|
| 201 |
return 0.0, "No edits provided"
|
| 202 |
|
| 203 |
-
before_validation = self._validation_snapshot()
|
| 204 |
-
|
| 205 |
reward = 0.0
|
| 206 |
feedbacks: List[str] = []
|
| 207 |
applied_count = 0
|
|
@@ -288,17 +307,6 @@ class CloudNativeDebugEnvironment:
|
|
| 288 |
|
| 289 |
reward += self._check_fix_progress()
|
| 290 |
|
| 291 |
-
after_validation = self._validation_snapshot()
|
| 292 |
-
if not before_validation["docker_build_valid"] and after_validation["docker_build_valid"]:
|
| 293 |
-
reward += 0.1
|
| 294 |
-
feedbacks.append("Docker build validity improved")
|
| 295 |
-
if not before_validation["workflow_parse_valid"] and after_validation["workflow_parse_valid"]:
|
| 296 |
-
reward += 0.1
|
| 297 |
-
feedbacks.append("Workflow parse validity improved")
|
| 298 |
-
if not before_validation["k8s_valid"] and after_validation["k8s_valid"]:
|
| 299 |
-
reward += 0.1
|
| 300 |
-
feedbacks.append("Kubernetes manifest validity improved")
|
| 301 |
-
|
| 302 |
if applied_count == 0:
|
| 303 |
self.last_action_success = False
|
| 304 |
return max(-0.02, reward - 0.02), "; ".join(feedbacks) or "No edit applied"
|
|
@@ -307,30 +315,21 @@ class CloudNativeDebugEnvironment:
|
|
| 307 |
return max(0.0, reward), "; ".join(feedbacks)
|
| 308 |
|
| 309 |
def _check_fix_progress(self) -> float:
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
|
| 317 |
-
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
|
| 321 |
-
|
| 322 |
-
|
| 323 |
-
fixes_applied += 1
|
| 324 |
-
if fix["type"] == "line_equals":
|
| 325 |
-
lines = current_content.split("\n")
|
| 326 |
-
line_num = int(fix.get("line", 0))
|
| 327 |
-
if 1 <= line_num <= len(lines):
|
| 328 |
-
if lines[line_num - 1].strip() == str(fix["expected"]).strip():
|
| 329 |
-
fixes_applied += 1
|
| 330 |
-
|
| 331 |
-
new_fixed = fixes_applied - self.issues_fixed
|
| 332 |
if new_fixed > 0:
|
| 333 |
-
self.issues_fixed =
|
| 334 |
return 0.3 * new_fixed
|
| 335 |
return 0.0
|
| 336 |
|
|
|
|
| 71 |
return None
|
| 72 |
|
| 73 |
def _validation_snapshot(self) -> Dict[str, bool]:
|
| 74 |
+
"""Return a detailed snapshot of all 7 simulator checks."""
|
| 75 |
docker_result = self.docker_sim.validate(self.current_files.get("Dockerfile"), self.current_files)
|
| 76 |
workflow_file = self._find_workflow_file()
|
| 77 |
workflow_result = self.workflow_sim.validate(workflow_file, self.current_files)
|
| 78 |
k8s_result = self.k8s_sim.validate(self.current_files)
|
| 79 |
+
|
| 80 |
+
has_docker = "Dockerfile" in self.current_files
|
| 81 |
+
has_workflow = workflow_file is not None
|
| 82 |
+
has_k8s = any(fc.file_type == FileType.KUBERNETES for fc in self.current_files.values())
|
| 83 |
+
|
| 84 |
+
snapshot: Dict[str, bool] = {}
|
| 85 |
+
if has_docker:
|
| 86 |
+
snapshot["docker_build_valid"] = bool(docker_result.get("build_success", False))
|
| 87 |
+
snapshot["docker_run_valid"] = bool(docker_result.get("run_success", False))
|
| 88 |
+
if has_workflow:
|
| 89 |
+
snapshot["workflow_parse_valid"] = bool(workflow_result.get("parse_success", False))
|
| 90 |
+
snapshot["workflow_exec_valid"] = bool(workflow_result.get("execution_success", False))
|
| 91 |
+
if has_k8s:
|
| 92 |
+
snapshot["k8s_valid"] = bool(k8s_result.get("valid", True))
|
| 93 |
+
snapshot["k8s_pod_running"] = k8s_result.get("pod_status", "N/A") == "Running"
|
| 94 |
+
svc = k8s_result.get("service_status", "N/A")
|
| 95 |
+
snapshot["k8s_service_active"] = "active" in svc.lower() or svc == "N/A"
|
| 96 |
+
|
| 97 |
+
return snapshot
|
| 98 |
|
| 99 |
def __init__(self):
|
| 100 |
self.docker_sim = DockerSimulator()
|
|
|
|
| 161 |
)
|
| 162 |
|
| 163 |
self.expected_fixes = scenario["expected_fixes"]
|
| 164 |
+
|
| 165 |
+
# Snapshot the initial broken state from simulators
|
| 166 |
+
self.initial_snapshot = self._validation_snapshot()
|
| 167 |
+
# Count how many checks are initially failing β that's our issues_total
|
| 168 |
+
self.issues_total = sum(1 for v in self.initial_snapshot.values() if not v)
|
| 169 |
+
# Ensure at least 1 issue (the scenario is supposed to be broken)
|
| 170 |
+
self.issues_total = max(1, self.issues_total)
|
| 171 |
self.issues_fixed = 0
|
| 172 |
|
| 173 |
self.step_count = 0
|
|
|
|
| 221 |
self.last_action_success = False
|
| 222 |
return 0.0, "No edits provided"
|
| 223 |
|
|
|
|
|
|
|
| 224 |
reward = 0.0
|
| 225 |
feedbacks: List[str] = []
|
| 226 |
applied_count = 0
|
|
|
|
| 307 |
|
| 308 |
reward += self._check_fix_progress()
|
| 309 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 310 |
if applied_count == 0:
|
| 311 |
self.last_action_success = False
|
| 312 |
return max(-0.02, reward - 0.02), "; ".join(feedbacks) or "No edit applied"
|
|
|
|
| 315 |
return max(0.0, reward), "; ".join(feedbacks)
|
| 316 |
|
| 317 |
def _check_fix_progress(self) -> float:
|
| 318 |
+
"""Check fix progress by comparing current simulator state against initial broken state.
|
| 319 |
+
|
| 320 |
+
Counts how many simulator checks flipped from failβpass since reset.
|
| 321 |
+
"""
|
| 322 |
+
current_snapshot = self._validation_snapshot()
|
| 323 |
+
|
| 324 |
+
fixes_now = 0
|
| 325 |
+
for key, initially_broken in self.initial_snapshot.items():
|
| 326 |
+
if not initially_broken and current_snapshot.get(key, False):
|
| 327 |
+
# This check was initially failing and now passes
|
| 328 |
+
fixes_now += 1
|
| 329 |
+
|
| 330 |
+
new_fixed = fixes_now - self.issues_fixed
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 331 |
if new_fixed > 0:
|
| 332 |
+
self.issues_fixed = fixes_now
|
| 333 |
return 0.3 * new_fixed
|
| 334 |
return 0.0
|
| 335 |
|
server/graders/__init__.py
CHANGED
|
@@ -37,8 +37,8 @@ DIFFICULTY_MODIFIERS = {
|
|
| 37 |
TaskDifficulty.HARD: (0.03, 0.7, 0.75),
|
| 38 |
}
|
| 39 |
|
| 40 |
-
SCORE_FLOOR = 0.
|
| 41 |
-
SCORE_CEIL =
|
| 42 |
|
| 43 |
EDIT_ACTION_TYPES = frozenset({
|
| 44 |
"edit_file", "replace_line", "add_line",
|
|
|
|
| 37 |
TaskDifficulty.HARD: (0.03, 0.7, 0.75),
|
| 38 |
}
|
| 39 |
|
| 40 |
+
SCORE_FLOOR = 0.01
|
| 41 |
+
SCORE_CEIL = 0.99
|
| 42 |
|
| 43 |
EDIT_ACTION_TYPES = frozenset({
|
| 44 |
"edit_file", "replace_line", "add_line",
|
server/models.py
CHANGED
|
@@ -122,7 +122,7 @@ class EnvironmentInfo(BaseModel):
|
|
| 122 |
|
| 123 |
class GraderResult(BaseModel):
|
| 124 |
task_id: str
|
| 125 |
-
score: float = Field(...,
|
| 126 |
max_score: float = 1.0
|
| 127 |
breakdown: Dict[str, float] = Field(default_factory=dict)
|
| 128 |
feedback: str = ""
|
|
@@ -170,7 +170,7 @@ class GraderResponse(BaseModel):
|
|
| 170 |
|
| 171 |
class BaselineRequest(BaseModel):
|
| 172 |
task_id: Optional[str] = None
|
| 173 |
-
num_episodes: int =
|
| 174 |
|
| 175 |
|
| 176 |
class BaselineResponse(BaseModel):
|
|
|
|
| 122 |
|
| 123 |
class GraderResult(BaseModel):
|
| 124 |
task_id: str
|
| 125 |
+
score: float = Field(..., gt=0.0, lt=1.0)
|
| 126 |
max_score: float = 1.0
|
| 127 |
breakdown: Dict[str, float] = Field(default_factory=dict)
|
| 128 |
feedback: str = ""
|
|
|
|
| 170 |
|
| 171 |
class BaselineRequest(BaseModel):
|
| 172 |
task_id: Optional[str] = None
|
| 173 |
+
num_episodes: Optional[int] = None # None = run ALL scenarios
|
| 174 |
|
| 175 |
|
| 176 |
class BaselineResponse(BaseModel):
|
server/simulators/docker_simulator.py
CHANGED
|
@@ -39,7 +39,11 @@ class DockerSimulator:
|
|
| 39 |
if "*" in source:
|
| 40 |
prefix = source.replace("*", "")
|
| 41 |
return any(path.startswith(prefix) for path in context_files)
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
|
| 44 |
def _join_continuation_lines(self, lines: List[str]) -> List[str]:
|
| 45 |
"""Join lines ending with backslash into single logical lines."""
|
|
|
|
| 39 |
if "*" in source:
|
| 40 |
prefix = source.replace("*", "")
|
| 41 |
return any(path.startswith(prefix) for path in context_files)
|
| 42 |
+
# Check exact match or directory prefix match (e.g. "dist/" matches "dist/index.html")
|
| 43 |
+
clean = source.rstrip("/")
|
| 44 |
+
if clean in context_files:
|
| 45 |
+
return True
|
| 46 |
+
return any(path.startswith(clean + "/") or path == clean for path in context_files)
|
| 47 |
|
| 48 |
def _join_continuation_lines(self, lines: List[str]) -> List[str]:
|
| 49 |
"""Join lines ending with backslash into single logical lines."""
|
server/simulators/k8s_simulator.py
CHANGED
|
@@ -312,7 +312,7 @@ class KubernetesSimulator:
|
|
| 312 |
svc_ports = svc.get("spec", {}).get("ports", [])
|
| 313 |
container_ports = []
|
| 314 |
for c in dep.get("spec", {}).get("template", {}).get("spec", {}).get("containers", []):
|
| 315 |
-
for p in c.get("ports"
|
| 316 |
container_ports.append(p.get("containerPort"))
|
| 317 |
|
| 318 |
for sp in svc_ports:
|
|
|
|
| 312 |
svc_ports = svc.get("spec", {}).get("ports", [])
|
| 313 |
container_ports = []
|
| 314 |
for c in dep.get("spec", {}).get("template", {}).get("spec", {}).get("containers", []):
|
| 315 |
+
for p in (c.get("ports") or []):
|
| 316 |
container_ports.append(p.get("containerPort"))
|
| 317 |
|
| 318 |
for sp in svc_ports:
|
server/simulators/workflow_simulator.py
CHANGED
|
@@ -294,6 +294,135 @@ class WorkflowSimulator:
|
|
| 294 |
"exec_error": f"{var} is empty β secret not available in shell environment. Map it via env block.",
|
| 295 |
}
|
| 296 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 297 |
# node version vs package.json engines
|
| 298 |
for job_name, job in jobs.items():
|
| 299 |
if not isinstance(job, dict):
|
|
|
|
| 294 |
"exec_error": f"{var} is empty β secret not available in shell environment. Map it via env block.",
|
| 295 |
}
|
| 296 |
|
| 297 |
+
# build-push-action without load:true when image is used locally after
|
| 298 |
+
for job_name, job in jobs.items():
|
| 299 |
+
if not isinstance(job, dict):
|
| 300 |
+
continue
|
| 301 |
+
steps = job.get("steps", [])
|
| 302 |
+
if not isinstance(steps, list):
|
| 303 |
+
continue
|
| 304 |
+
build_push_idx = None
|
| 305 |
+
build_push_has_load = False
|
| 306 |
+
for idx, step in enumerate(steps):
|
| 307 |
+
if not isinstance(step, dict):
|
| 308 |
+
continue
|
| 309 |
+
uses = step.get("uses", "")
|
| 310 |
+
if isinstance(uses, str) and "docker/build-push-action" in uses:
|
| 311 |
+
build_push_idx = idx
|
| 312 |
+
with_block = step.get("with", {})
|
| 313 |
+
if isinstance(with_block, dict):
|
| 314 |
+
push_val = str(with_block.get("push", "")).lower()
|
| 315 |
+
load_val = str(with_block.get("load", "")).lower()
|
| 316 |
+
build_push_has_load = load_val == "true"
|
| 317 |
+
# Only flag if push is false (local use intended)
|
| 318 |
+
if push_val == "false" and not build_push_has_load:
|
| 319 |
+
# Check if a later step uses docker run
|
| 320 |
+
for later in steps[idx + 1:]:
|
| 321 |
+
if not isinstance(later, dict):
|
| 322 |
+
continue
|
| 323 |
+
run_cmd = later.get("run", "")
|
| 324 |
+
if isinstance(run_cmd, str) and "docker run" in run_cmd:
|
| 325 |
+
return {
|
| 326 |
+
"parse_success": True,
|
| 327 |
+
"execution_success": False,
|
| 328 |
+
"exec_error": (
|
| 329 |
+
"build-push-action with Buildx does not load images into local daemon by default β "
|
| 330 |
+
"add 'load: true' to make the image available for docker run"
|
| 331 |
+
),
|
| 332 |
+
}
|
| 333 |
+
|
| 334 |
+
# registry mismatch between build tag and push command
|
| 335 |
+
for job_name, job in jobs.items():
|
| 336 |
+
if not isinstance(job, dict):
|
| 337 |
+
continue
|
| 338 |
+
steps = job.get("steps", [])
|
| 339 |
+
if not isinstance(steps, list):
|
| 340 |
+
continue
|
| 341 |
+
build_registry = None
|
| 342 |
+
for step in steps:
|
| 343 |
+
if not isinstance(step, dict):
|
| 344 |
+
continue
|
| 345 |
+
run_cmd = step.get("run", "")
|
| 346 |
+
if not isinstance(run_cmd, str):
|
| 347 |
+
continue
|
| 348 |
+
# Extract registry from docker build -t
|
| 349 |
+
build_match = re.search(r'docker build\s+.*-t\s+(\S+)', run_cmd)
|
| 350 |
+
if build_match:
|
| 351 |
+
tag = build_match.group(1)
|
| 352 |
+
if "ghcr.io" in tag:
|
| 353 |
+
build_registry = "ghcr.io"
|
| 354 |
+
elif "docker.io" in tag or "/" in tag:
|
| 355 |
+
# docker.io is default for user/image format
|
| 356 |
+
build_registry = tag.split("/")[0] if "." in tag.split("/")[0] else "docker.io"
|
| 357 |
+
push_match = re.search(r'docker push\s+(\S+)', run_cmd)
|
| 358 |
+
if push_match and build_registry:
|
| 359 |
+
push_tag = push_match.group(1)
|
| 360 |
+
if "ghcr.io" in push_tag:
|
| 361 |
+
push_registry = "ghcr.io"
|
| 362 |
+
elif "docker.io" in push_tag:
|
| 363 |
+
push_registry = "docker.io"
|
| 364 |
+
else:
|
| 365 |
+
push_registry = push_tag.split("/")[0] if "." in push_tag.split("/")[0] else "docker.io"
|
| 366 |
+
if build_registry != push_registry:
|
| 367 |
+
return {
|
| 368 |
+
"parse_success": True,
|
| 369 |
+
"execution_success": False,
|
| 370 |
+
"exec_error": (
|
| 371 |
+
f"Registry mismatch: image built with {build_registry} tag "
|
| 372 |
+
f"but push targets {push_registry}"
|
| 373 |
+
),
|
| 374 |
+
}
|
| 375 |
+
|
| 376 |
+
# docker tag referencing non-existent image tag
|
| 377 |
+
for job_name, job in jobs.items():
|
| 378 |
+
if not isinstance(job, dict):
|
| 379 |
+
continue
|
| 380 |
+
steps = job.get("steps", [])
|
| 381 |
+
if not isinstance(steps, list):
|
| 382 |
+
continue
|
| 383 |
+
built_tags = set()
|
| 384 |
+
for step in steps:
|
| 385 |
+
if not isinstance(step, dict):
|
| 386 |
+
continue
|
| 387 |
+
run_cmd = step.get("run", "")
|
| 388 |
+
if not isinstance(run_cmd, str):
|
| 389 |
+
continue
|
| 390 |
+
# Collect tags from docker build -t
|
| 391 |
+
for m in re.finditer(r'docker build\s+.*-t\s+(\S+)', run_cmd):
|
| 392 |
+
built_tags.add(m.group(1))
|
| 393 |
+
# Check docker tag source exists
|
| 394 |
+
tag_match = re.search(r'docker tag\s+(\S+)\s+(\S+)', run_cmd)
|
| 395 |
+
if tag_match:
|
| 396 |
+
source = tag_match.group(1)
|
| 397 |
+
# If source contains ${{ it's a template β compare the template expression
|
| 398 |
+
if source not in built_tags and "${{" not in source:
|
| 399 |
+
return {
|
| 400 |
+
"parse_success": True,
|
| 401 |
+
"execution_success": False,
|
| 402 |
+
"exec_error": f"No such image: {source} β docker tag source does not match any built image",
|
| 403 |
+
}
|
| 404 |
+
# Check if source uses a different tag template than what was built
|
| 405 |
+
if "${{" in source:
|
| 406 |
+
# Normalize: extract the expression
|
| 407 |
+
source_expr = re.search(r'\$\{\{(.+?)\}\}', source)
|
| 408 |
+
if source_expr:
|
| 409 |
+
source_key = source_expr.group(1).strip()
|
| 410 |
+
found_matching = False
|
| 411 |
+
for bt in built_tags:
|
| 412 |
+
bt_expr = re.search(r'\$\{\{(.+?)\}\}', bt)
|
| 413 |
+
if bt_expr and bt_expr.group(1).strip() == source_key:
|
| 414 |
+
found_matching = True
|
| 415 |
+
break
|
| 416 |
+
# Also check if the base image name matches
|
| 417 |
+
source_base = source.split(":")[0] if ":" in source else source
|
| 418 |
+
built_bases = {bt.split(":")[0] if ":" in bt else bt for bt in built_tags}
|
| 419 |
+
if not found_matching and source_base in built_bases:
|
| 420 |
+
return {
|
| 421 |
+
"parse_success": True,
|
| 422 |
+
"execution_success": False,
|
| 423 |
+
"exec_error": f"No such image: docker tag source tag does not match any built image tag",
|
| 424 |
+
}
|
| 425 |
+
|
| 426 |
# node version vs package.json engines
|
| 427 |
for job_name, job in jobs.items():
|
| 428 |
if not isinstance(job, dict):
|
server/tasks/k8s_networking.py
CHANGED
|
@@ -81,7 +81,7 @@ class K8sNetworkingTask(BaseTask):
|
|
| 81 |
"api-7f8d9c6b5-y3l0n 1/1 Running app=api-server\n"
|
| 82 |
"api-7f8d9c6b5-z4m1o 1/1 Running app=api-server\n"
|
| 83 |
"\n"
|
| 84 |
-
"
|
| 85 |
),
|
| 86 |
},
|
| 87 |
"expected_fixes": [
|
|
@@ -153,7 +153,7 @@ class K8sNetworkingTask(BaseTask):
|
|
| 153 |
"$ kubectl exec -it test-pod -- wget -qO- http://10.244.0.5:3000\n"
|
| 154 |
"<!DOCTYPE html><html>...</html>\n"
|
| 155 |
"\n"
|
| 156 |
-
"
|
| 157 |
),
|
| 158 |
},
|
| 159 |
"expected_fixes": [
|
|
@@ -249,7 +249,7 @@ class K8sNetworkingTask(BaseTask):
|
|
| 249 |
"NAME TYPE CLUSTER-IP PORT(S)\n"
|
| 250 |
"api-service ClusterIP 10.96.0.10 80/TCP\n"
|
| 251 |
"\n"
|
| 252 |
-
"
|
| 253 |
),
|
| 254 |
},
|
| 255 |
"expected_fixes": [
|
|
|
|
| 81 |
"api-7f8d9c6b5-y3l0n 1/1 Running app=api-server\n"
|
| 82 |
"api-7f8d9c6b5-z4m1o 1/1 Running app=api-server\n"
|
| 83 |
"\n"
|
| 84 |
+
"Hint: Compare the Service selector with the pod labels shown above."
|
| 85 |
),
|
| 86 |
},
|
| 87 |
"expected_fixes": [
|
|
|
|
| 153 |
"$ kubectl exec -it test-pod -- wget -qO- http://10.244.0.5:3000\n"
|
| 154 |
"<!DOCTYPE html><html>...</html>\n"
|
| 155 |
"\n"
|
| 156 |
+
"Hint: The container responds on a different port than the Service expects."
|
| 157 |
),
|
| 158 |
},
|
| 159 |
"expected_fixes": [
|
|
|
|
| 249 |
"NAME TYPE CLUSTER-IP PORT(S)\n"
|
| 250 |
"api-service ClusterIP 10.96.0.10 80/TCP\n"
|
| 251 |
"\n"
|
| 252 |
+
"Hint: The Ingress backend service name does not match any existing Service."
|
| 253 |
),
|
| 254 |
},
|
| 255 |
"expected_fixes": [
|
server/tasks/pipeline_build_deploy.py
CHANGED
|
@@ -16,15 +16,15 @@ class PipelineBuildDeployTask(BaseTask):
|
|
| 16 |
AVAILABLE_SECRETS = ["GITHUB_TOKEN", "DOCKER_USERNAME", "DOCKER_PASSWORD"]
|
| 17 |
|
| 18 |
SCENARIOS = [
|
| 19 |
-
# Scenario 1:
|
| 20 |
{
|
| 21 |
-
"id": "
|
| 22 |
"files": [
|
| 23 |
{
|
| 24 |
"path": ".github/workflows/deploy.yml",
|
| 25 |
"type": "workflow",
|
| 26 |
"content": (
|
| 27 |
-
"name: Build and Push
|
| 28 |
"on:\n"
|
| 29 |
" push:\n"
|
| 30 |
" branches: [main]\n"
|
|
@@ -32,17 +32,19 @@ class PipelineBuildDeployTask(BaseTask):
|
|
| 32 |
"jobs:\n"
|
| 33 |
" build:\n"
|
| 34 |
" runs-on: ubuntu-latest\n"
|
|
|
|
|
|
|
| 35 |
" steps:\n"
|
| 36 |
" - uses: actions/checkout@v4\n"
|
| 37 |
"\n"
|
| 38 |
" - name: Login to GHCR\n"
|
| 39 |
-
" run: echo $GITHUB_TOKEN | docker login ghcr.io -u ${{ github.actor }} --password-stdin\n"
|
| 40 |
"\n"
|
| 41 |
" - name: Build image\n"
|
| 42 |
" run: docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .\n"
|
| 43 |
"\n"
|
| 44 |
" - name: Push image\n"
|
| 45 |
-
" run: docker push
|
| 46 |
),
|
| 47 |
},
|
| 48 |
{
|
|
@@ -67,23 +69,23 @@ class PipelineBuildDeployTask(BaseTask):
|
|
| 67 |
"error": {
|
| 68 |
"phase": "pipeline_build",
|
| 69 |
"message": (
|
| 70 |
-
"Run: Build and Push
|
| 71 |
"\n"
|
| 72 |
-
"Step:
|
| 73 |
-
"
|
| 74 |
-
"Error:
|
| 75 |
"\n"
|
| 76 |
-
"The
|
| 77 |
),
|
| 78 |
"exit_code": 1,
|
| 79 |
-
"failed_step": "
|
| 80 |
},
|
| 81 |
"expected_fixes": [
|
| 82 |
{
|
| 83 |
"file": ".github/workflows/deploy.yml",
|
| 84 |
"type": "contains",
|
| 85 |
-
"expected": "
|
| 86 |
-
"hint": "The
|
| 87 |
}
|
| 88 |
],
|
| 89 |
},
|
|
@@ -161,18 +163,18 @@ class PipelineBuildDeployTask(BaseTask):
|
|
| 161 |
],
|
| 162 |
},
|
| 163 |
|
| 164 |
-
# Scenario 3:
|
| 165 |
{
|
| 166 |
-
"id": "
|
| 167 |
"files": [
|
| 168 |
{
|
| 169 |
"path": ".github/workflows/publish.yml",
|
| 170 |
"type": "workflow",
|
| 171 |
"content": (
|
| 172 |
-
"name: Publish
|
| 173 |
"on:\n"
|
| 174 |
-
"
|
| 175 |
-
"
|
| 176 |
"\n"
|
| 177 |
"jobs:\n"
|
| 178 |
" publish:\n"
|
|
@@ -180,14 +182,22 @@ class PipelineBuildDeployTask(BaseTask):
|
|
| 180 |
" steps:\n"
|
| 181 |
" - uses: actions/checkout@v4\n"
|
| 182 |
"\n"
|
| 183 |
-
" - name: Login to
|
| 184 |
-
" run: echo ${{ secrets.
|
| 185 |
"\n"
|
| 186 |
" - name: Build\n"
|
| 187 |
-
" run: docker build -t
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 188 |
"\n"
|
| 189 |
" - name: Push\n"
|
| 190 |
-
" run:
|
|
|
|
|
|
|
| 191 |
),
|
| 192 |
},
|
| 193 |
{
|
|
@@ -196,34 +206,39 @@ class PipelineBuildDeployTask(BaseTask):
|
|
| 196 |
"content": (
|
| 197 |
"FROM python:3.11-slim\n"
|
| 198 |
"WORKDIR /app\n"
|
|
|
|
|
|
|
| 199 |
"COPY . .\n"
|
| 200 |
'CMD ["python", "app.py"]\n'
|
| 201 |
),
|
| 202 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 203 |
],
|
| 204 |
"error": {
|
| 205 |
"phase": "pipeline_build",
|
| 206 |
"message": (
|
| 207 |
-
"Run: Publish
|
| 208 |
"\n"
|
| 209 |
-
"Step:
|
| 210 |
-
"Step:
|
| 211 |
-
"Step:
|
| 212 |
-
"Error:
|
| 213 |
-
"Error: GITHUB_TOKEN does not have packages:write permission\n"
|
| 214 |
"\n"
|
| 215 |
-
"The
|
| 216 |
-
"Add a permissions block to the job."
|
| 217 |
),
|
| 218 |
"exit_code": 1,
|
| 219 |
-
"failed_step": "
|
| 220 |
},
|
| 221 |
"expected_fixes": [
|
| 222 |
{
|
| 223 |
"file": ".github/workflows/publish.yml",
|
| 224 |
"type": "contains",
|
| 225 |
-
"expected": "
|
| 226 |
-
"hint": "
|
| 227 |
}
|
| 228 |
],
|
| 229 |
},
|
|
@@ -289,15 +304,15 @@ class PipelineBuildDeployTask(BaseTask):
|
|
| 289 |
],
|
| 290 |
},
|
| 291 |
|
| 292 |
-
# Scenario 5:
|
| 293 |
{
|
| 294 |
-
"id": "
|
| 295 |
"files": [
|
| 296 |
{
|
| 297 |
"path": ".github/workflows/build.yml",
|
| 298 |
"type": "workflow",
|
| 299 |
"content": (
|
| 300 |
-
"name: Build
|
| 301 |
"on:\n"
|
| 302 |
" push:\n"
|
| 303 |
" branches: [main]\n"
|
|
@@ -308,53 +323,54 @@ class PipelineBuildDeployTask(BaseTask):
|
|
| 308 |
" steps:\n"
|
| 309 |
" - uses: actions/checkout@v4\n"
|
| 310 |
"\n"
|
| 311 |
-
" - name: Build image\n"
|
| 312 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 313 |
),
|
| 314 |
},
|
| 315 |
{
|
| 316 |
-
"path": "Dockerfile",
|
| 317 |
"type": "dockerfile",
|
| 318 |
"content": (
|
| 319 |
-
"FROM
|
| 320 |
"WORKDIR /app\n"
|
| 321 |
-
"COPY
|
| 322 |
-
"RUN
|
| 323 |
"COPY . .\n"
|
| 324 |
-
"
|
| 325 |
-
"\n
|
| 326 |
-
"FROM nginx:alpine\n"
|
| 327 |
-
"COPY --from=builder /app/dist /usr/share/nginx/html\n"
|
| 328 |
-
"EXPOSE 80\n"
|
| 329 |
-
'CMD ["nginx", "-g", "daemon off;"]\n'
|
| 330 |
),
|
| 331 |
},
|
| 332 |
{
|
| 333 |
-
"path": "
|
| 334 |
-
"type": "
|
| 335 |
-
"content":
|
| 336 |
},
|
| 337 |
],
|
| 338 |
"error": {
|
| 339 |
"phase": "pipeline_build",
|
| 340 |
"message": (
|
| 341 |
-
"Run: Build
|
| 342 |
"\n"
|
| 343 |
-
"Step: Build image β\n"
|
| 344 |
-
"Error:
|
|
|
|
| 345 |
"\n"
|
| 346 |
-
"
|
| 347 |
-
"The COPY --from=builder path is wrong."
|
| 348 |
),
|
| 349 |
"exit_code": 1,
|
| 350 |
-
"failed_step": "Build image",
|
| 351 |
},
|
| 352 |
"expected_fixes": [
|
| 353 |
{
|
| 354 |
-
"file": "
|
| 355 |
"type": "contains",
|
| 356 |
-
"expected": "
|
| 357 |
-
"hint": "
|
| 358 |
}
|
| 359 |
],
|
| 360 |
},
|
|
|
|
| 16 |
AVAILABLE_SECRETS = ["GITHUB_TOKEN", "DOCKER_USERNAME", "DOCKER_PASSWORD"]
|
| 17 |
|
| 18 |
SCENARIOS = [
|
| 19 |
+
# Scenario 1: Registry mismatch β build tags ghcr.io but push targets docker.io
|
| 20 |
{
|
| 21 |
+
"id": "registry_mismatch",
|
| 22 |
"files": [
|
| 23 |
{
|
| 24 |
"path": ".github/workflows/deploy.yml",
|
| 25 |
"type": "workflow",
|
| 26 |
"content": (
|
| 27 |
+
"name: Build and Push\n"
|
| 28 |
"on:\n"
|
| 29 |
" push:\n"
|
| 30 |
" branches: [main]\n"
|
|
|
|
| 32 |
"jobs:\n"
|
| 33 |
" build:\n"
|
| 34 |
" runs-on: ubuntu-latest\n"
|
| 35 |
+
" permissions:\n"
|
| 36 |
+
" packages: write\n"
|
| 37 |
" steps:\n"
|
| 38 |
" - uses: actions/checkout@v4\n"
|
| 39 |
"\n"
|
| 40 |
" - name: Login to GHCR\n"
|
| 41 |
+
" run: echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin\n"
|
| 42 |
"\n"
|
| 43 |
" - name: Build image\n"
|
| 44 |
" run: docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .\n"
|
| 45 |
"\n"
|
| 46 |
" - name: Push image\n"
|
| 47 |
+
" run: docker push docker.io/${{ github.repository }}:${{ github.sha }}\n"
|
| 48 |
),
|
| 49 |
},
|
| 50 |
{
|
|
|
|
| 69 |
"error": {
|
| 70 |
"phase": "pipeline_build",
|
| 71 |
"message": (
|
| 72 |
+
"Run: Build and Push\n"
|
| 73 |
"\n"
|
| 74 |
+
"Step: Build image β\n"
|
| 75 |
+
"Step: Push image β\n"
|
| 76 |
+
"Error: An image does not exist locally with the tag: docker.io/<repo>:<sha>\n"
|
| 77 |
"\n"
|
| 78 |
+
"The image was built with a ghcr.io tag but the push targets docker.io."
|
| 79 |
),
|
| 80 |
"exit_code": 1,
|
| 81 |
+
"failed_step": "Push image",
|
| 82 |
},
|
| 83 |
"expected_fixes": [
|
| 84 |
{
|
| 85 |
"file": ".github/workflows/deploy.yml",
|
| 86 |
"type": "contains",
|
| 87 |
+
"expected": "docker push ghcr.io/",
|
| 88 |
+
"hint": "The push command targets docker.io but the image was tagged with ghcr.io β use the same registry",
|
| 89 |
}
|
| 90 |
],
|
| 91 |
},
|
|
|
|
| 163 |
],
|
| 164 |
},
|
| 165 |
|
| 166 |
+
# Scenario 3: Build and push use different tagging strategies (sha vs latest)
|
| 167 |
{
|
| 168 |
+
"id": "inconsistent_tagging",
|
| 169 |
"files": [
|
| 170 |
{
|
| 171 |
"path": ".github/workflows/publish.yml",
|
| 172 |
"type": "workflow",
|
| 173 |
"content": (
|
| 174 |
+
"name: Publish\n"
|
| 175 |
"on:\n"
|
| 176 |
+
" push:\n"
|
| 177 |
+
" branches: [main]\n"
|
| 178 |
"\n"
|
| 179 |
"jobs:\n"
|
| 180 |
" publish:\n"
|
|
|
|
| 182 |
" steps:\n"
|
| 183 |
" - uses: actions/checkout@v4\n"
|
| 184 |
"\n"
|
| 185 |
+
" - name: Login to DockerHub\n"
|
| 186 |
+
" run: echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin\n"
|
| 187 |
"\n"
|
| 188 |
" - name: Build\n"
|
| 189 |
+
" run: docker build -t myuser/api:${{ github.sha }} .\n"
|
| 190 |
+
"\n"
|
| 191 |
+
" - name: Test\n"
|
| 192 |
+
" run: docker run myuser/api:${{ github.sha }} python -m pytest\n"
|
| 193 |
+
"\n"
|
| 194 |
+
" - name: Tag latest\n"
|
| 195 |
+
" run: docker tag myuser/api:latest myuser/api:stable\n"
|
| 196 |
"\n"
|
| 197 |
" - name: Push\n"
|
| 198 |
+
" run: |\n"
|
| 199 |
+
" docker push myuser/api:${{ github.sha }}\n"
|
| 200 |
+
" docker push myuser/api:stable\n"
|
| 201 |
),
|
| 202 |
},
|
| 203 |
{
|
|
|
|
| 206 |
"content": (
|
| 207 |
"FROM python:3.11-slim\n"
|
| 208 |
"WORKDIR /app\n"
|
| 209 |
+
"COPY requirements.txt .\n"
|
| 210 |
+
"RUN pip install -r requirements.txt\n"
|
| 211 |
"COPY . .\n"
|
| 212 |
'CMD ["python", "app.py"]\n'
|
| 213 |
),
|
| 214 |
},
|
| 215 |
+
{
|
| 216 |
+
"path": "requirements.txt",
|
| 217 |
+
"type": "requirements",
|
| 218 |
+
"content": "flask==3.0.0\npytest==7.4.0\n",
|
| 219 |
+
},
|
| 220 |
],
|
| 221 |
"error": {
|
| 222 |
"phase": "pipeline_build",
|
| 223 |
"message": (
|
| 224 |
+
"Run: Publish\n"
|
| 225 |
"\n"
|
| 226 |
+
"Step: Build β (myuser/api:<sha>)\n"
|
| 227 |
+
"Step: Test β\n"
|
| 228 |
+
"Step: Tag latest β\n"
|
| 229 |
+
"Error: No such image: myuser/api:latest\n"
|
|
|
|
| 230 |
"\n"
|
| 231 |
+
"The tag command references 'myuser/api:latest' but no image with that tag exists."
|
|
|
|
| 232 |
),
|
| 233 |
"exit_code": 1,
|
| 234 |
+
"failed_step": "Tag latest",
|
| 235 |
},
|
| 236 |
"expected_fixes": [
|
| 237 |
{
|
| 238 |
"file": ".github/workflows/publish.yml",
|
| 239 |
"type": "contains",
|
| 240 |
+
"expected": "docker tag myuser/api:${{ github.sha }}",
|
| 241 |
+
"hint": "The 'docker tag' source must match the tag used in the build step β use the sha-tagged image as source",
|
| 242 |
}
|
| 243 |
],
|
| 244 |
},
|
|
|
|
| 304 |
],
|
| 305 |
},
|
| 306 |
|
| 307 |
+
# Scenario 5: Dockerfile path wrong in workflow when using subdirectory structure
|
| 308 |
{
|
| 309 |
+
"id": "dockerfile_path_in_subdirectory",
|
| 310 |
"files": [
|
| 311 |
{
|
| 312 |
"path": ".github/workflows/build.yml",
|
| 313 |
"type": "workflow",
|
| 314 |
"content": (
|
| 315 |
+
"name: Build API\n"
|
| 316 |
"on:\n"
|
| 317 |
" push:\n"
|
| 318 |
" branches: [main]\n"
|
|
|
|
| 323 |
" steps:\n"
|
| 324 |
" - uses: actions/checkout@v4\n"
|
| 325 |
"\n"
|
| 326 |
+
" - name: Build API image\n"
|
| 327 |
+
" uses: docker/build-push-action@v5\n"
|
| 328 |
+
" with:\n"
|
| 329 |
+
" context: ./services/api\n"
|
| 330 |
+
" file: ./Dockerfile\n"
|
| 331 |
+
" push: false\n"
|
| 332 |
+
" tags: api:latest\n"
|
| 333 |
),
|
| 334 |
},
|
| 335 |
{
|
| 336 |
+
"path": "services/api/Dockerfile",
|
| 337 |
"type": "dockerfile",
|
| 338 |
"content": (
|
| 339 |
+
"FROM python:3.11-slim\n"
|
| 340 |
"WORKDIR /app\n"
|
| 341 |
+
"COPY requirements.txt .\n"
|
| 342 |
+
"RUN pip install -r requirements.txt\n"
|
| 343 |
"COPY . .\n"
|
| 344 |
+
"EXPOSE 8000\n"
|
| 345 |
+
'CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]\n'
|
|
|
|
|
|
|
|
|
|
|
|
|
| 346 |
),
|
| 347 |
},
|
| 348 |
{
|
| 349 |
+
"path": "services/api/requirements.txt",
|
| 350 |
+
"type": "requirements",
|
| 351 |
+
"content": "fastapi==0.104.0\nuvicorn==0.24.0\n",
|
| 352 |
},
|
| 353 |
],
|
| 354 |
"error": {
|
| 355 |
"phase": "pipeline_build",
|
| 356 |
"message": (
|
| 357 |
+
"Run: Build API\n"
|
| 358 |
"\n"
|
| 359 |
+
"Step: Build API image β\n"
|
| 360 |
+
"Error: unable to prepare context: unable to evaluate symlinks in Dockerfile path: "
|
| 361 |
+
"lstat /home/runner/work/repo/repo/Dockerfile: no such file or directory\n"
|
| 362 |
"\n"
|
| 363 |
+
"The Dockerfile is not at the repository root."
|
|
|
|
| 364 |
),
|
| 365 |
"exit_code": 1,
|
| 366 |
+
"failed_step": "Build API image",
|
| 367 |
},
|
| 368 |
"expected_fixes": [
|
| 369 |
{
|
| 370 |
+
"file": ".github/workflows/build.yml",
|
| 371 |
"type": "contains",
|
| 372 |
+
"expected": "file: ./services/api/Dockerfile",
|
| 373 |
+
"hint": "The 'file' path must point to where the Dockerfile actually is β ./services/api/Dockerfile, not ./Dockerfile",
|
| 374 |
}
|
| 375 |
],
|
| 376 |
},
|
server/tasks/pipeline_full.py
CHANGED
|
@@ -116,8 +116,7 @@ class PipelineFullTask(BaseTask):
|
|
| 116 |
"\n"
|
| 117 |
"---\n"
|
| 118 |
"(If login had succeeded, deployment would also fail with:)\n"
|
| 119 |
-
"Error: Service 'myapp-service' has no endpoints
|
| 120 |
-
"doesn't match any pods (pods have label 'app=myapp')"
|
| 121 |
),
|
| 122 |
},
|
| 123 |
"expected_fixes": [
|
|
@@ -230,9 +229,8 @@ class PipelineFullTask(BaseTask):
|
|
| 230 |
"\n"
|
| 231 |
"---\n"
|
| 232 |
"Additionally:\n"
|
| 233 |
-
"- Dockerfile
|
| 234 |
-
"- K8s
|
| 235 |
-
"(service targetPort also wrong)"
|
| 236 |
),
|
| 237 |
},
|
| 238 |
"expected_fixes": [
|
|
@@ -350,8 +348,8 @@ class PipelineFullTask(BaseTask):
|
|
| 350 |
"\n"
|
| 351 |
"---\n"
|
| 352 |
"Additional issues found:\n"
|
| 353 |
-
"- Dockerfile: pull access denied for
|
| 354 |
-
"- K8s: Pod CrashLoopBackOff with
|
| 355 |
),
|
| 356 |
},
|
| 357 |
"expected_fixes": [
|
|
|
|
| 116 |
"\n"
|
| 117 |
"---\n"
|
| 118 |
"(If login had succeeded, deployment would also fail with:)\n"
|
| 119 |
+
"Error: Service 'myapp-service' has no endpoints"
|
|
|
|
| 120 |
),
|
| 121 |
},
|
| 122 |
"expected_fixes": [
|
|
|
|
| 229 |
"\n"
|
| 230 |
"---\n"
|
| 231 |
"Additionally:\n"
|
| 232 |
+
"- Dockerfile: npm reports module resolution errors at runtime\n"
|
| 233 |
+
"- K8s: Service returns connection refused when accessed"
|
|
|
|
| 234 |
),
|
| 235 |
},
|
| 236 |
"expected_fixes": [
|
|
|
|
| 348 |
"\n"
|
| 349 |
"---\n"
|
| 350 |
"Additional issues found:\n"
|
| 351 |
+
"- Dockerfile: pull access denied for base image β repository does not exist\n"
|
| 352 |
+
"- K8s: Pod in CrashLoopBackOff with exit code 137"
|
| 353 |
),
|
| 354 |
},
|
| 355 |
"expected_fixes": [
|
server/tasks/task_1_build_errors.py
CHANGED
|
@@ -141,9 +141,9 @@ class DockerfileSyntaxTask(BaseTask):
|
|
| 141 |
],
|
| 142 |
},
|
| 143 |
|
| 144 |
-
# Scenario 4:
|
| 145 |
{
|
| 146 |
-
"id": "
|
| 147 |
"files": [
|
| 148 |
{
|
| 149 |
"path": "Dockerfile",
|
|
@@ -151,29 +151,30 @@ class DockerfileSyntaxTask(BaseTask):
|
|
| 151 |
"content": (
|
| 152 |
"FROM nginx:alpine\n"
|
| 153 |
"COPY nginx.conf /etc/nginx/nginx.conf\n"
|
| 154 |
-
"COPY
|
| 155 |
-
|
| 156 |
'CMD ["nginx", "-g", "daemon off;"]'
|
| 157 |
),
|
| 158 |
},
|
| 159 |
{
|
| 160 |
-
"path": "
|
| 161 |
"type": "other",
|
| 162 |
-
"content": "
|
| 163 |
},
|
| 164 |
],
|
| 165 |
"error": {
|
| 166 |
"phase": "docker_build",
|
| 167 |
-
"message": "
|
| 168 |
"exit_code": 1,
|
| 169 |
-
"
|
|
|
|
| 170 |
},
|
| 171 |
"expected_fixes": [
|
| 172 |
{
|
| 173 |
"file": "Dockerfile",
|
| 174 |
"type": "contains",
|
| 175 |
-
"expected": "
|
| 176 |
-
"hint": "
|
| 177 |
}
|
| 178 |
],
|
| 179 |
},
|
|
|
|
| 141 |
],
|
| 142 |
},
|
| 143 |
|
| 144 |
+
# Scenario 4: COPY references a file that doesn't exist in context
|
| 145 |
{
|
| 146 |
+
"id": "copy_missing_source",
|
| 147 |
"files": [
|
| 148 |
{
|
| 149 |
"path": "Dockerfile",
|
|
|
|
| 151 |
"content": (
|
| 152 |
"FROM nginx:alpine\n"
|
| 153 |
"COPY nginx.conf /etc/nginx/nginx.conf\n"
|
| 154 |
+
"COPY dist/ /usr/share/nginx/html\n"
|
| 155 |
+
"EXPOSE 80\n"
|
| 156 |
'CMD ["nginx", "-g", "daemon off;"]'
|
| 157 |
),
|
| 158 |
},
|
| 159 |
{
|
| 160 |
+
"path": "build/index.html",
|
| 161 |
"type": "other",
|
| 162 |
+
"content": "<!DOCTYPE html><html><body>Hello</body></html>",
|
| 163 |
},
|
| 164 |
],
|
| 165 |
"error": {
|
| 166 |
"phase": "docker_build",
|
| 167 |
+
"message": "COPY failed: file not found in build context: dist/",
|
| 168 |
"exit_code": 1,
|
| 169 |
+
"failed_step": "COPY dist/ /usr/share/nginx/html",
|
| 170 |
+
"line_hint": 3,
|
| 171 |
},
|
| 172 |
"expected_fixes": [
|
| 173 |
{
|
| 174 |
"file": "Dockerfile",
|
| 175 |
"type": "contains",
|
| 176 |
+
"expected": "COPY build/",
|
| 177 |
+
"hint": "The build output is in 'build/' not 'dist/' β check the build context files",
|
| 178 |
}
|
| 179 |
],
|
| 180 |
},
|
server/tasks/task_5_ci_docker_integration.py
CHANGED
|
@@ -75,15 +75,15 @@ class CIDockerIntegrationTask(BaseTask):
|
|
| 75 |
],
|
| 76 |
},
|
| 77 |
|
| 78 |
-
# Scenario 2:
|
| 79 |
{
|
| 80 |
-
"id": "
|
| 81 |
"files": [
|
| 82 |
{
|
| 83 |
"path": ".github/workflows/build.yml",
|
| 84 |
"type": "workflow",
|
| 85 |
"content": (
|
| 86 |
-
"name: Build and
|
| 87 |
"on: push\n"
|
| 88 |
"\n"
|
| 89 |
"jobs:\n"
|
|
@@ -91,52 +91,46 @@ class CIDockerIntegrationTask(BaseTask):
|
|
| 91 |
" runs-on: ubuntu-latest\n"
|
| 92 |
" steps:\n"
|
| 93 |
" - uses: actions/checkout@v4\n"
|
| 94 |
-
" - name:
|
| 95 |
-
"
|
| 96 |
-
" - name: Build\n"
|
| 97 |
-
"
|
| 98 |
-
"
|
| 99 |
-
"
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
),
|
| 101 |
},
|
| 102 |
{
|
| 103 |
"path": "Dockerfile",
|
| 104 |
"type": "dockerfile",
|
| 105 |
"content": (
|
| 106 |
-
"FROM
|
| 107 |
"WORKDIR /app\n"
|
| 108 |
-
"COPY package*.json ./\n"
|
| 109 |
-
"RUN npm ci\n"
|
| 110 |
"COPY . .\n"
|
| 111 |
-
"
|
| 112 |
-
'CMD ["
|
| 113 |
),
|
| 114 |
},
|
| 115 |
-
{
|
| 116 |
-
"path": "package.json",
|
| 117 |
-
"type": "other",
|
| 118 |
-
"content": '{"name": "app", "scripts": {"start": "node server.js"}}',
|
| 119 |
-
},
|
| 120 |
],
|
| 121 |
"error": {
|
| 122 |
-
"phase": "
|
| 123 |
-
"message":
|
|
|
|
|
|
|
|
|
|
| 124 |
"exit_code": 1,
|
| 125 |
-
"failed_step": "
|
| 126 |
},
|
| 127 |
"expected_fixes": [
|
| 128 |
{
|
| 129 |
"file": ".github/workflows/build.yml",
|
| 130 |
"type": "contains",
|
| 131 |
-
"expected": "
|
| 132 |
-
"hint": "
|
| 133 |
-
}
|
| 134 |
-
{
|
| 135 |
-
"file": ".github/workflows/build.yml",
|
| 136 |
-
"type": "contains",
|
| 137 |
-
"expected": "DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}",
|
| 138 |
-
"hint": "Both Docker credentials must be in the env block",
|
| 139 |
-
},
|
| 140 |
],
|
| 141 |
},
|
| 142 |
|
|
|
|
| 75 |
],
|
| 76 |
},
|
| 77 |
|
| 78 |
+
# Scenario 2: build-push-action without load:true, next step can't find image
|
| 79 |
{
|
| 80 |
+
"id": "missing_load_true",
|
| 81 |
"files": [
|
| 82 |
{
|
| 83 |
"path": ".github/workflows/build.yml",
|
| 84 |
"type": "workflow",
|
| 85 |
"content": (
|
| 86 |
+
"name: Build and Test\n"
|
| 87 |
"on: push\n"
|
| 88 |
"\n"
|
| 89 |
"jobs:\n"
|
|
|
|
| 91 |
" runs-on: ubuntu-latest\n"
|
| 92 |
" steps:\n"
|
| 93 |
" - uses: actions/checkout@v4\n"
|
| 94 |
+
" - name: Set up Docker Buildx\n"
|
| 95 |
+
" uses: docker/setup-buildx-action@v3\n"
|
| 96 |
+
" - name: Build image\n"
|
| 97 |
+
" uses: docker/build-push-action@v5\n"
|
| 98 |
+
" with:\n"
|
| 99 |
+
" context: .\n"
|
| 100 |
+
" push: false\n"
|
| 101 |
+
" tags: myapp:test\n"
|
| 102 |
+
" - name: Run tests\n"
|
| 103 |
+
" run: docker run myapp:test pytest"
|
| 104 |
),
|
| 105 |
},
|
| 106 |
{
|
| 107 |
"path": "Dockerfile",
|
| 108 |
"type": "dockerfile",
|
| 109 |
"content": (
|
| 110 |
+
"FROM python:3.11-slim\n"
|
| 111 |
"WORKDIR /app\n"
|
|
|
|
|
|
|
| 112 |
"COPY . .\n"
|
| 113 |
+
"RUN pip install pytest\n"
|
| 114 |
+
'CMD ["python", "app.py"]'
|
| 115 |
),
|
| 116 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
],
|
| 118 |
"error": {
|
| 119 |
+
"phase": "docker_build",
|
| 120 |
+
"message": (
|
| 121 |
+
"Unable to find image 'myapp:test' locally. "
|
| 122 |
+
"docker: Error response from daemon: pull access denied for myapp."
|
| 123 |
+
),
|
| 124 |
"exit_code": 1,
|
| 125 |
+
"failed_step": "Run tests",
|
| 126 |
},
|
| 127 |
"expected_fixes": [
|
| 128 |
{
|
| 129 |
"file": ".github/workflows/build.yml",
|
| 130 |
"type": "contains",
|
| 131 |
+
"expected": "load: true",
|
| 132 |
+
"hint": "build-push-action with Buildx doesn't load images into local Docker daemon by default β add 'load: true'",
|
| 133 |
+
}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
],
|
| 135 |
},
|
| 136 |
|