Krishna1107 commited on
Commit
2794920
Β·
1 Parent(s): 893901a

fixed inference

Browse files
README.md CHANGED
@@ -42,7 +42,7 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
42
  β”‚ - replace_line: fix a specific line number β”‚
43
  β”‚ - add_line / add_block: insert missing content β”‚
44
  β”‚ - delete_line / delete_block: remove bad content β”‚
45
- β”‚ - request_hint: get a clue (-5% score penalty) β”‚
46
  β”‚ - submit: "I'm done fixing" β”‚
47
  β”‚ β”‚
48
  β”‚ After each action, agent gets: β”‚
@@ -56,6 +56,7 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
56
  β”‚ - Whether ALL issues were fixed (bonus) β”‚
57
  β”‚ - How many steps it took (efficiency) β”‚
58
  β”‚ - How many hints were used (penalty) β”‚
 
59
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
60
  ```
61
 
@@ -63,6 +64,8 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
63
 
64
  ## The 10 Tasks (50 Scenarios)
65
 
 
 
66
  ### Task 1: Dockerfile Syntax Errors β€” Easy
67
 
68
  Simple typos and instruction errors that break `docker build`.
@@ -72,7 +75,7 @@ Simple typos and instruction errors that break `docker build`.
72
  | 1 | `typo_filename` | `COPY requirments.txt .` β€” misspelled filename | Most common Docker build error on Stack Overflow |
73
  | 2 | `invalid_base_image` | `FROM python:3.9-slimm` β€” extra 'm' in tag | Happens when copy-pasting image tags |
74
  | 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` β€” broken line continuation | Formatting multi-line RUN commands is tricky |
75
- | 4 | `invalid_expose` | `EXPOSE "eighty"` β€” string instead of port number | EXPOSE only accepts numeric ports |
76
  | 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM |
77
 
78
  ### Task 2: Dockerfile Runtime Errors β€” Medium
@@ -111,14 +114,14 @@ Secrets exist but aren't wired correctly to the workflow steps.
111
  | 4 | `secret_not_in_env` | `$SLACK_WEBHOOK_URL` not in `env:` | Very common mistake |
112
  | 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN` |
113
 
114
- ### Task 5: CI + Docker Integration β€” Medium-Hard
115
 
116
  The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.
117
 
118
  | # | Scenario | What's Broken | Real-World Context |
119
  |---|----------|---------------|-------------------|
120
  | 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Need BuildKit for cross-compile |
121
- | 2 | `login_secrets_not_wired` | `docker login` missing `env:` for secrets | "unauthorized: authentication required" |
122
  | 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch |
123
  | 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist |
124
  | 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access" |
@@ -149,7 +152,7 @@ Pod crashes and scheduling failures in Kubernetes deployments.
149
 
150
  ### Task 8: Kubernetes Service & Ingress Issues β€” Hard
151
 
152
- Networking issues where pods run fine but traffic doesn't reach them.
153
 
154
  | # | Scenario | What's Broken | Real-World Context |
155
  |---|----------|---------------|-------------------|
@@ -165,15 +168,15 @@ GHA-to-Docker-to-Registry pipeline failures spanning multiple files.
165
 
166
  | # | Scenario | What's Broken | Real-World Context |
167
  |---|----------|---------------|-------------------|
168
- | 1 | `ghcr_token_not_mapped` | `$GITHUB_TOKEN` shell var not mapped from secrets | GHCR login fails |
169
  | 2 | `image_tag_mismatch` | Build uses `github.ref_name` but push uses `github.sha` | "image not found locally" |
170
- | 3 | `missing_packages_write` | No `permissions: packages: write` for GHCR push | "permission_denied: write_package" |
171
  | 4 | `build_arg_not_passed` | Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow | Version file is empty |
172
- | 5 | `multistage_output_mismatch` | `COPY --from=builder /app/dist` but react-scripts outputs to `/app/build` | Wrong output directory |
173
 
174
  ### Task 10: Full Stack Deployment Pipeline β€” Expert
175
 
176
- Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning.
177
 
178
  | # | Scenario | What's Broken | Real-World Context |
179
  |---|----------|---------------|-------------------|
@@ -185,6 +188,20 @@ Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifest
185
 
186
  ---
187
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
188
  ## Available Actions
189
 
190
  Each step, the agent chooses exactly one action:
@@ -197,16 +214,16 @@ Each step, the agent chooses exactly one action:
197
  | `delete_line` | Remove a specific line | Removing a bad instruction |
198
  | `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
199
  | `delete_block` | Remove a multi-line block | Removing incorrect sections |
200
- | `request_hint` | Get a clue about what's wrong | Costs -5% on final score β€” use sparingly |
201
  | `submit` | Declare "I'm done" β€” triggers final evaluation | When all fixes are applied |
202
 
203
  **Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
204
 
205
  ---
206
 
207
- ## Grading System β€” How Scores Work
208
 
209
- Scoring is **deterministic** (same actions always produce the same score), **dynamic** (different strategies get different scores), and **difficulty-aware** (harder tasks are graded more generously).
210
 
211
  ### The Formula
212
 
@@ -214,13 +231,13 @@ Scoring is **deterministic** (same actions always produce the same score), **dyn
214
  FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
215
  ```
216
 
217
- Clamped to `[0.0, 1.0]`.
218
 
219
  ### Component Breakdown
220
 
221
  | Component | Weight | Description |
222
  |-----------|--------|-------------|
223
- | Base score | 5% | Participation credit |
224
  | Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
225
  | Complete bonus | 25% | All issues fixed |
226
  | Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
@@ -230,15 +247,31 @@ Clamped to `[0.0, 1.0]`.
230
 
231
  ### Difficulty Modifiers
232
 
233
- The grader adjusts three parameters based on task difficulty:
234
-
235
  | Difficulty | Max Score | Efficiency Decay | Hint Cost |
236
  |------------|-----------|------------------|-----------|
237
  | Easy | 0.90 | 0.03/step (strict) | 4% each |
238
  | Medium | 0.90 | 0.027/step | 4% each |
239
  | Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
240
 
241
- This means: solving a 4-bug expert pipeline in 6 steps scores higher than solving a 1-bug easy task in 3 steps, reflecting the genuine difficulty difference.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
242
 
243
  ---
244
 
@@ -256,7 +289,7 @@ This means: solving a 4-bug expert pipeline in 6 steps scores higher than solvin
256
  | `/info` | GET | Task list with metadata |
257
  | `/tasks` | GET | List all tasks with difficulty levels |
258
  | `/grader` | POST | Grade a trajectory (list of step dicts) |
259
- | `/baseline` | POST | Run built-in heuristic baseline |
260
  | `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
261
 
262
  ### Example: Full Episode via API
@@ -267,7 +300,7 @@ curl -X POST http://localhost:8000/reset \
267
  -H "Content-Type: application/json" \
268
  -d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'
269
 
270
- # 2. Fix the memory limit
271
  curl -X POST http://localhost:8000/step \
272
  -H "Content-Type: application/json" \
273
  -d '{
@@ -276,7 +309,7 @@ curl -X POST http://localhost:8000/step \
276
  "edits": [{
277
  "file_path": "k8s/deployment.yaml",
278
  "old_content": "memory: \"64Mi\"",
279
- "new_content": "memory: \"256Mi\""
280
  }]
281
  }
282
  }'
@@ -325,7 +358,7 @@ python inference.py
325
  cloud-native-devops-env/
326
  β”œβ”€β”€ openenv.yaml # OpenEnv environment specification
327
  β”œβ”€β”€ inference.py # LLM baseline (OpenAI client + HF router)
328
- β”œβ”€β”€ baseline_runner.py # Heuristic baseline for /baseline endpoint
329
  β”œβ”€β”€ Dockerfile # Production container
330
  β”œβ”€β”€ requirements.txt # Python dependencies
331
  β”‚
@@ -349,9 +382,9 @@ cloud-native-devops-env/
349
  β”‚ β”œβ”€β”€ graders/
350
  β”‚ β”‚ └── __init__.py # Deterministic trajectory grader
351
  β”‚ └── simulators/
352
- β”‚ β”œβ”€β”€ docker_simulator.py # 15+ Dockerfile validation rules
353
- β”‚ β”œβ”€β”€ workflow_simulator.py # 15+ workflow validation rules
354
- β”‚ └── k8s_simulator.py # Kubernetes manifest validator
355
  β”‚
356
  └── tests/
357
  β”œβ”€β”€ test_endpoints.py # API endpoint tests
@@ -364,22 +397,13 @@ cloud-native-devops-env/
364
  ## Design Decisions
365
 
366
  1. **Full cloud-native stack**: Docker + GitHub Actions + Kubernetes β€” the three pillars of modern deployment pipelines.
367
- 2. **Simulated validation (no real Docker/K8s)**: Static analysis rules give deterministic results, fast execution, and no security concerns.
368
  3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
369
  4. **Difficulty progression**: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
370
- 5. **Exact string matching for edits**: Mirrors real file editing β€” whitespace matters.
371
- 6. **50 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.
 
372
 
373
  ## License
374
 
375
  MIT
376
- title: Cloudnative Devops Debug Env
377
- emoji: πŸš€
378
- colorFrom: yellow
379
- colorTo: gray
380
- sdk: docker
381
- pinned: false
382
- short_description: 'Open Env for the Meta x PyTorch x HuggingFace x SST hack '
383
- ---
384
-
385
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
42
  β”‚ - replace_line: fix a specific line number β”‚
43
  β”‚ - add_line / add_block: insert missing content β”‚
44
  β”‚ - delete_line / delete_block: remove bad content β”‚
45
+ β”‚ - request_hint: get a clue (-4% score penalty) β”‚
46
  β”‚ - submit: "I'm done fixing" β”‚
47
  β”‚ β”‚
48
  β”‚ After each action, agent gets: β”‚
 
56
  β”‚ - Whether ALL issues were fixed (bonus) β”‚
57
  β”‚ - How many steps it took (efficiency) β”‚
58
  β”‚ - How many hints were used (penalty) β”‚
59
+ β”‚ Score range: (0, 1) exclusive β€” never exactly 0 or 1 β”‚
60
  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
61
  ```
62
 
 
64
 
65
  ## The 10 Tasks (50 Scenarios)
66
 
67
+ Evaluation runs **all 50 scenarios deterministically** across all 10 tasks for reproducible scoring.
68
+
69
  ### Task 1: Dockerfile Syntax Errors β€” Easy
70
 
71
  Simple typos and instruction errors that break `docker build`.
 
75
  | 1 | `typo_filename` | `COPY requirments.txt .` β€” misspelled filename | Most common Docker build error on Stack Overflow |
76
  | 2 | `invalid_base_image` | `FROM python:3.9-slimm` β€” extra 'm' in tag | Happens when copy-pasting image tags |
77
  | 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` β€” broken line continuation | Formatting multi-line RUN commands is tricky |
78
+ | 4 | `copy_missing_source` | `COPY dist/` but build output is in `build/` | Source directory doesn't exist in build context |
79
  | 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM |
80
 
81
  ### Task 2: Dockerfile Runtime Errors β€” Medium
 
114
  | 4 | `secret_not_in_env` | `$SLACK_WEBHOOK_URL` not in `env:` | Very common mistake |
115
  | 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN` |
116
 
117
+ ### Task 5: CI + Docker Integration β€” Medium
118
 
119
  The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.
120
 
121
  | # | Scenario | What's Broken | Real-World Context |
122
  |---|----------|---------------|-------------------|
123
  | 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Need BuildKit for cross-compile |
124
+ | 2 | `missing_load_true` | `build-push-action` without `load: true` β€” next step can't find image | Buildx doesn't load into local daemon by default |
125
  | 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch |
126
  | 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist |
127
  | 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access" |
 
152
 
153
  ### Task 8: Kubernetes Service & Ingress Issues β€” Hard
154
 
155
+ Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague β€” the agent must diagnose from kubectl output.
156
 
157
  | # | Scenario | What's Broken | Real-World Context |
158
  |---|----------|---------------|-------------------|
 
168
 
169
  | # | Scenario | What's Broken | Real-World Context |
170
  |---|----------|---------------|-------------------|
171
+ | 1 | `registry_mismatch` | Build tags `ghcr.io/...` but push targets `docker.io/...` | Registry URL mismatch between steps |
172
  | 2 | `image_tag_mismatch` | Build uses `github.ref_name` but push uses `github.sha` | "image not found locally" |
173
+ | 3 | `inconsistent_tagging` | `docker tag myuser/api:latest` but image was built as `myuser/api:${{ github.sha }}` | Tag source doesn't exist |
174
  | 4 | `build_arg_not_passed` | Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow | Version file is empty |
175
+ | 5 | `dockerfile_path_in_subdirectory` | Workflow points to `./Dockerfile` but it's at `./services/api/Dockerfile` | Monorepo path mismatch |
176
 
177
  ### Task 10: Full Stack Deployment Pipeline β€” Expert
178
 
179
+ Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague β€” the agent must trace root causes from symptoms.
180
 
181
  | # | Scenario | What's Broken | Real-World Context |
182
  |---|----------|---------------|-------------------|
 
188
 
189
  ---
190
 
191
+ ## Fix Validation: Simulator-Based
192
+
193
+ Fixes are validated using **structural simulators**, not string matching. This means:
194
+
195
+ - **Alternative valid fixes are accepted.** Setting memory to `512Mi` instead of `256Mi` both resolve the OOM β€” the simulator accepts either.
196
+ - **Three independent simulators** run after every edit:
197
+ - **DockerSimulator**: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
198
+ - **WorkflowSimulator**: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
199
+ - **KubernetesSimulator**: validates manifests, cross-resource dependencies (Service selector ↔ Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
200
+ - **7 granular checks** are tracked: `docker_build`, `docker_run`, `workflow_parse`, `workflow_exec`, `k8s_valid`, `k8s_pod_running`, `k8s_service_active`
201
+ - Progress = how many checks flip from fail β†’ pass compared to the initial broken state
202
+
203
+ ---
204
+
205
  ## Available Actions
206
 
207
  Each step, the agent chooses exactly one action:
 
214
  | `delete_line` | Remove a specific line | Removing a bad instruction |
215
  | `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
216
  | `delete_block` | Remove a multi-line block | Removing incorrect sections |
217
+ | `request_hint` | Get a clue about what's wrong | Costs -4% on final score β€” use sparingly |
218
  | `submit` | Declare "I'm done" β€” triggers final evaluation | When all fixes are applied |
219
 
220
  **Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
221
 
222
  ---
223
 
224
+ ## Grading System
225
 
226
+ Scoring is **deterministic** (same actions always produce the same score), **difficulty-aware** (harder tasks are graded more generously), and scores are strictly in **(0, 1) exclusive** β€” never exactly 0 or 1.
227
 
228
  ### The Formula
229
 
 
231
  FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
232
  ```
233
 
234
+ Clamped to `(0.01, 0.99)`.
235
 
236
  ### Component Breakdown
237
 
238
  | Component | Weight | Description |
239
  |-----------|--------|-------------|
240
+ | Base score | 5% | Participation credit (guarantees score > 0) |
241
  | Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
242
  | Complete bonus | 25% | All issues fixed |
243
  | Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
 
247
 
248
  ### Difficulty Modifiers
249
 
 
 
250
  | Difficulty | Max Score | Efficiency Decay | Hint Cost |
251
  |------------|-----------|------------------|-----------|
252
  | Easy | 0.90 | 0.03/step (strict) | 4% each |
253
  | Medium | 0.90 | 0.027/step | 4% each |
254
  | Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
255
 
256
+ ---
257
+
258
+ ## Evaluation
259
+
260
+ The evaluation pipeline runs **all 50 scenarios across all 10 tasks** deterministically:
261
+
262
+ ```python
263
+ # Runs all 10 tasks Γ— 5 scenarios = 50 episodes
264
+ results = run_baseline_episodes() # num_episodes=None runs all
265
+
266
+ # Per-episode scores in (0, 1)
267
+ # Aggregate = mean of all 50 scores
268
+ aggregate = sum(r.score for r in results) / len(results)
269
+ ```
270
+
271
+ This ensures:
272
+ - **Reproducibility**: same agent produces same score every time
273
+ - **Complete coverage**: every error pattern is tested
274
+ - **Fair comparison**: all agents face the same 50 scenarios
275
 
276
  ---
277
 
 
289
  | `/info` | GET | Task list with metadata |
290
  | `/tasks` | GET | List all tasks with difficulty levels |
291
  | `/grader` | POST | Grade a trajectory (list of step dicts) |
292
+ | `/baseline` | POST | Run baseline across all scenarios (optional: `task_id`, `num_episodes`) |
293
  | `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
294
 
295
  ### Example: Full Episode via API
 
300
  -H "Content-Type: application/json" \
301
  -d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'
302
 
303
+ # 2. Fix the memory limit (any reasonable value works β€” simulator validates structurally)
304
  curl -X POST http://localhost:8000/step \
305
  -H "Content-Type: application/json" \
306
  -d '{
 
309
  "edits": [{
310
  "file_path": "k8s/deployment.yaml",
311
  "old_content": "memory: \"64Mi\"",
312
+ "new_content": "memory: \"512Mi\""
313
  }]
314
  }
315
  }'
 
358
  cloud-native-devops-env/
359
  β”œβ”€β”€ openenv.yaml # OpenEnv environment specification
360
  β”œβ”€β”€ inference.py # LLM baseline (OpenAI client + HF router)
361
+ β”œβ”€β”€ baseline_runner.py # Heuristic baseline β€” runs all 50 scenarios
362
  β”œβ”€β”€ Dockerfile # Production container
363
  β”œβ”€β”€ requirements.txt # Python dependencies
364
  β”‚
 
382
  β”‚ β”œβ”€β”€ graders/
383
  β”‚ β”‚ └── __init__.py # Deterministic trajectory grader
384
  β”‚ └── simulators/
385
+ β”‚ β”œβ”€β”€ docker_simulator.py # Dockerfile build + runtime validation
386
+ β”‚ β”œβ”€β”€ workflow_simulator.py # GHA workflow parse + execution validation
387
+ β”‚ └── k8s_simulator.py # K8s manifest + cross-resource validation
388
  β”‚
389
  └── tests/
390
  β”œβ”€β”€ test_endpoints.py # API endpoint tests
 
397
  ## Design Decisions
398
 
399
  1. **Full cloud-native stack**: Docker + GitHub Actions + Kubernetes β€” the three pillars of modern deployment pipelines.
400
+ 2. **Simulator-based validation**: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g., `512Mi` and `256Mi` both fix an OOM). Deterministic, fast, no security concerns.
401
  3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
402
  4. **Difficulty progression**: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
403
+ 5. **Vague error messages in harder tasks**: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
404
+ 6. **Deterministic evaluation**: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
405
+ 7. **50 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.
406
 
407
  ## License
408
 
409
  MIT
 
 
 
 
 
 
 
 
 
 
baseline_runner.py CHANGED
@@ -1,6 +1,7 @@
1
  """Heuristic baseline runner for the /baseline endpoint.
2
 
3
  Applies expected_fixes directly to verify the environment + grader work e2e.
 
4
  """
5
 
6
 
@@ -22,6 +23,17 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
22
  break
23
  file_path = fix["file"]
24
  if file_path not in env.current_files:
 
 
 
 
 
 
 
 
 
 
 
25
  continue
26
 
27
  current_content = env.current_files[file_path].content
@@ -50,18 +62,22 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
50
  )],
51
  )
52
  else:
53
- # Find the line that's closest to expected but wrong
54
  best_line = None
55
  best_idx = None
 
56
  for i, line in enumerate(lines):
57
  stripped = line.strip()
58
  exp_stripped = expected.strip()
59
- # Check if this line is a broken version of expected
60
- if (stripped and exp_stripped and
61
- len(set(stripped) & set(exp_stripped)) > len(exp_stripped) * 0.3):
62
- if best_line is None:
63
- best_line = line
64
- best_idx = i
 
 
 
65
 
66
  if best_line is not None:
67
  action = Action(
@@ -115,12 +131,12 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
115
  return run_grader(task_id, env.trajectory)
116
 
117
 
118
- def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: int = 1) -> List[GraderResult]:
119
  """Run baseline episodes across tasks.
120
 
121
  Args:
122
  task_id: Specific task to run, or None for all tasks.
123
- num_episodes: Number of episodes per task.
124
 
125
  Returns:
126
  List of GraderResult for each episode.
@@ -137,13 +153,11 @@ def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: int = 1)
137
  for tid in task_ids:
138
  task_cls = TASK_REGISTRY[tid]
139
  scenarios = task_cls.SCENARIOS
140
- episodes_run = 0
141
- for scenario in scenarios:
142
- if episodes_run >= num_episodes:
143
  break
144
  env = CloudNativeDebugEnvironment()
145
  result = _heuristic_episode(env, tid, scenario["id"])
146
  results.append(result)
147
- episodes_run += 1
148
 
149
  return results
 
1
  """Heuristic baseline runner for the /baseline endpoint.
2
 
3
  Applies expected_fixes directly to verify the environment + grader work e2e.
4
+ By default runs ALL scenarios of ALL tasks for deterministic, reproducible evaluation.
5
  """
6
 
7
 
 
23
  break
24
  file_path = fix["file"]
25
  if file_path not in env.current_files:
26
+ # For fixes that require creating a new file (e.g. ConfigMap),
27
+ # create it with the expected content
28
+ if fix["type"] == "contains":
29
+ action = Action(
30
+ action_type=ActionType.EDIT_FILE,
31
+ edits=[FileEdit(
32
+ file_path=file_path,
33
+ new_content=fix["expected"],
34
+ )],
35
+ )
36
+ env.step(action)
37
  continue
38
 
39
  current_content = env.current_files[file_path].content
 
62
  )],
63
  )
64
  else:
65
+ # Find the line with highest character overlap to expected
66
  best_line = None
67
  best_idx = None
68
+ best_score = 0
69
  for i, line in enumerate(lines):
70
  stripped = line.strip()
71
  exp_stripped = expected.strip()
72
+ if not stripped or not exp_stripped:
73
+ continue
74
+ overlap = len(set(stripped) & set(exp_stripped))
75
+ # Use ratio of overlap to max length for scoring
76
+ score = overlap / max(len(exp_stripped), len(stripped))
77
+ if score > 0.5 and score > best_score:
78
+ best_line = line
79
+ best_idx = i
80
+ best_score = score
81
 
82
  if best_line is not None:
83
  action = Action(
 
131
  return run_grader(task_id, env.trajectory)
132
 
133
 
134
+ def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: Optional[int] = None) -> List[GraderResult]:
135
  """Run baseline episodes across tasks.
136
 
137
  Args:
138
  task_id: Specific task to run, or None for all tasks.
139
+ num_episodes: Max scenarios per task. None = run ALL scenarios (default).
140
 
141
  Returns:
142
  List of GraderResult for each episode.
 
153
  for tid in task_ids:
154
  task_cls = TASK_REGISTRY[tid]
155
  scenarios = task_cls.SCENARIOS
156
+ for idx, scenario in enumerate(scenarios):
157
+ if num_episodes is not None and idx >= num_episodes:
 
158
  break
159
  env = CloudNativeDebugEnvironment()
160
  result = _heuristic_episode(env, tid, scenario["id"])
161
  results.append(result)
 
162
 
163
  return results
inference.py CHANGED
@@ -1,13 +1,44 @@
1
- """Baseline inference script for Cloud-Native Debug Environment.
2
-
3
- Uses OpenAI-compatible client to call Llama 3.1 70B via HuggingFace router.
4
- Required by OpenEnv specification.
5
-
6
- Usage:
7
- export API_BASE_URL=https://router.huggingface.co/v1
8
- export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
9
- export HF_TOKEN=your_token_here
10
- python inference.py
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  """
12
 
13
 
@@ -22,12 +53,14 @@ import requests
22
  from openai import OpenAI
23
 
24
 
25
- API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
26
- MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.1-70B-Instruct")
27
- HF_TOKEN = os.getenv("HF_TOKEN")
28
  ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
29
  LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
 
30
  MAX_STEPS = 8 # leave 2 steps buffer before env hard-limit of 10
 
31
 
32
  SYSTEM_PROMPT = """You are an expert DevOps engineer debugging cloud-native deployment pipelines.
33
  You will receive broken Dockerfile, GitHub Actions workflow, and/or Kubernetes manifest files along with error messages.
@@ -79,11 +112,37 @@ Rules:
79
  - Always respond with valid JSON only, no markdown fences"""
80
 
81
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
  def create_client() -> OpenAI:
83
  """Create OpenAI-compatible client for HuggingFace router."""
84
  return OpenAI(
85
  base_url=API_BASE_URL,
86
- api_key=HF_TOKEN,
87
  )
88
 
89
 
@@ -188,130 +247,144 @@ def run_episode(client: OpenAI, task_id: Optional[str] = None, scenario_id: Opti
188
  if scenario_id:
189
  reset_payload["scenario_id"] = scenario_id
190
 
191
- reset_resp = env_request("POST", "/reset", reset_payload)
192
- obs = reset_resp["observation"]
193
- info = reset_resp.get("info", {})
194
-
195
- actual_task_id = info.get("task_id", task_id or "unknown")
196
- actual_scenario_id = info.get("scenario_id", scenario_id or "unknown")
197
-
198
- print(f"[START] task_id={actual_task_id} scenario_id={actual_scenario_id}")
199
 
200
- messages = [{"role": "system", "content": SYSTEM_PROMPT}]
201
  trajectory = []
202
- total_steps = 0
 
 
 
203
 
204
- for step_num in range(MAX_STEPS):
205
- user_msg = format_observation(obs)
206
- messages.append({"role": "user", "content": user_msg})
207
-
208
- try:
209
- completion = client.chat.completions.create(
210
- model=MODEL_NAME,
211
- messages=messages,
212
- temperature=0.1,
213
- max_tokens=1024,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
214
  )
215
- llm_text = completion.choices[0].message.content or '{"action": "submit"}'
216
- except Exception as e:
217
- print(f"[STEP] step={step_num + 1} action=error reward=0.00 done=false issues_fixed=0 issues_total=0 error={e}")
218
- llm_text = '{"action": "submit"}'
219
-
220
- messages.append({"role": "assistant", "content": llm_text})
221
-
222
- parsed = parse_llm_response(llm_text)
223
- action = build_action(parsed)
224
-
225
- step_resp = env_request("POST", "/step", {"action": action})
226
- obs = step_resp["observation"]
227
- reward = step_resp.get("reward", 0.0)
228
- done = step_resp.get("done", False)
229
- step_info = step_resp.get("info", {})
230
- total_steps = step_num + 1
231
-
232
- issues_fixed = step_info.get("issues_fixed", 0)
233
- issues_total = step_info.get("issues_total", 0)
234
-
235
- print(f"[STEP] step={total_steps} action={action['action_type']} reward={reward:.2f} done={str(done).lower()} issues_fixed={issues_fixed} issues_total={issues_total}")
236
-
237
- trajectory.append({
238
- "step": total_steps,
239
- "action": action,
240
- "reward": reward,
241
- "done": done,
242
- "info": step_info,
243
- })
244
 
245
- if done:
246
- break
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
247
 
248
- # Grade the trajectory
249
- grade_resp = env_request("POST", "/grader", {
250
- "task_id": actual_task_id,
251
- "trajectory": trajectory,
252
- })
253
- result = grade_resp.get("result", {})
254
- score = result.get("score", 0.0)
255
 
256
- print(f"[END] task_id={actual_task_id} scenario_id={actual_scenario_id} score={score:.3f} steps={total_steps}")
257
- return result
258
 
259
 
260
  def run_all_tasks(client: OpenAI) -> Dict[str, float]:
261
- """Run baseline on all tasks and report scores."""
262
- tasks_resp = env_request("GET", "/tasks")
263
- tasks = tasks_resp.get("tasks", [])
 
 
 
264
 
265
  scores: Dict[str, List[float]] = {}
266
 
267
- for task in tasks:
268
- task_id = task["id"]
269
- print(f"\n{'='*60}")
270
- print(f"Task: {task['name']} ({task['difficulty']})")
271
- print(f"{'='*60}")
272
-
273
  task_scores = []
274
- # Run one episode per task for baseline
275
- result = run_episode(client, task_id=task_id)
276
- task_scores.append(result.get("score", 0.0))
 
 
 
 
 
277
  scores[task_id] = task_scores
278
 
279
  # Summary
280
- print(f"\n{'='*60}")
281
- print("BASELINE RESULTS SUMMARY")
282
- print(f"{'='*60}")
283
  avg_scores = {}
284
  for task_id, task_scores in scores.items():
285
  avg = sum(task_scores) / len(task_scores) if task_scores else 0.0
286
  avg_scores[task_id] = avg
287
- print(f" {task_id:40s} {avg:.3f}")
288
 
289
  overall = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0.0
290
- print(f" {'OVERALL':40s} {overall:.3f}")
291
 
292
  return avg_scores
293
 
294
 
295
  def main():
296
  """Entry point for baseline inference."""
297
- print("Cloud-Native Debug Environment - Baseline Inference")
298
- print(f"API: {API_BASE_URL}")
299
- print(f"Model: {MODEL_NAME}")
300
- print(f"Environment: {ENV_URL}")
301
-
302
- if not HF_TOKEN:
303
- print("\nWARNING: HF_TOKEN not set. Set it via: export HF_TOKEN=your_token_here")
304
- print("Continuing anyway (will fail if auth is required)...\n")
305
 
306
  # Verify environment is running
307
  try:
308
  health = env_request("GET", "/health")
309
- print(f"Environment status: {health.get('status', 'unknown')}\n")
310
  except Exception as e:
311
- print(f"\nERROR: Cannot connect to environment at {ENV_URL}")
312
- print(f" {e}")
313
- print("\nStart the server first:")
314
- print(" python -m uvicorn server.app:app --host 0.0.0.0 --port 8000")
315
  sys.exit(1)
316
 
317
  client = create_client()
 
1
+ """
2
+ Inference Script for Cloud-Native Debug Environment
3
+ ===================================
4
+ MANDATORY
5
+ - Before submitting, ensure the following variables are defined in your environment configuration:
6
+ API_BASE_URL The API endpoint for the LLM.
7
+ MODEL_NAME The model identifier to use for inference.
8
+ HF_TOKEN Your Hugging Face / API key.
9
+ LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
10
+ method
11
+
12
+ - Defaults are set only for API_BASE_URL and MODEL_NAME
13
+ (and should reflect your active inference setup):
14
+ API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
15
+ MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
16
+
17
+ - The inference script must be named `inference.py` and placed in the root directory of the project
18
+ - Participants must use OpenAI Client for all LLM calls using above variables
19
+
20
+ STDOUT FORMAT
21
+ - The script must emit exactly three line types to stdout, in this order:
22
+
23
+ [START] task=<task_name> env=<benchmark> model=<model_name>
24
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
25
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
26
+
27
+ Rules:
28
+ - One [START] line at episode begin.
29
+ - One [STEP] line per step, immediately after env.step() returns.
30
+ - One [END] line after the episode completes, always emitted (even on exception).
31
+ - reward and rewards are formatted to 2 decimal places.
32
+ - done and success are lowercase booleans: true or false.
33
+ - error is the raw error string, or null if none.
34
+ - All fields on a single line with no newlines within a line.
35
+ - Each tasks should return score in [0, 1]
36
+
37
+ Example:
38
+ [START] task=dockerfile_syntax env=cloud_native_devops model=meta-llama/Llama-3.1-70B-Instruct
39
+ [STEP] step=1 action=edit_file reward=0.30 done=false error=null
40
+ [STEP] step=2 action=submit reward=0.00 done=true error=null
41
+ [END] success=true steps=2 score=0.850 rewards=0.30,0.00
42
  """
43
 
44
 
 
53
  from openai import OpenAI
54
 
55
 
56
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
57
+ MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.1-70B-Instruct"
58
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
59
  ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
60
  LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
61
+ BENCHMARK = "cloud_native_devops"
62
  MAX_STEPS = 8 # leave 2 steps buffer before env hard-limit of 10
63
+ SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
64
 
65
  SYSTEM_PROMPT = """You are an expert DevOps engineer debugging cloud-native deployment pipelines.
66
  You will receive broken Dockerfile, GitHub Actions workflow, and/or Kubernetes manifest files along with error messages.
 
112
  - Always respond with valid JSON only, no markdown fences"""
113
 
114
 
115
+ # ---------------------------------------------------------------------------
116
+ # Logging helpers (mandatory stdout format)
117
+ # ---------------------------------------------------------------------------
118
+
119
+ def log_start(task: str, env: str, model: str) -> None:
120
+ print(f"[START] task={task} env={env} model={model}", flush=True)
121
+
122
+
123
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
124
+ error_val = error if error else "null"
125
+ done_val = str(done).lower()
126
+ print(
127
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
128
+ flush=True,
129
+ )
130
+
131
+
132
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
133
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
134
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
135
+
136
+
137
+ # ---------------------------------------------------------------------------
138
+ # Client / env helpers
139
+ # ---------------------------------------------------------------------------
140
+
141
  def create_client() -> OpenAI:
142
  """Create OpenAI-compatible client for HuggingFace router."""
143
  return OpenAI(
144
  base_url=API_BASE_URL,
145
+ api_key=API_KEY,
146
  )
147
 
148
 
 
247
  if scenario_id:
248
  reset_payload["scenario_id"] = scenario_id
249
 
250
+ # Best-effort task name for Start
251
+ target_task = task_id or "random_task"
252
+ log_start(task=target_task, env=BENCHMARK, model=MODEL_NAME)
 
 
 
 
 
253
 
 
254
  trajectory = []
255
+ rewards: List[float] = []
256
+ steps_taken = 0
257
+ score = 0.0
258
+ success = False
259
 
260
+ try:
261
+ reset_resp = env_request("POST", "/reset", reset_payload)
262
+ obs = reset_resp["observation"]
263
+ info = reset_resp.get("info", {})
264
+
265
+ actual_task_id = info.get("task_id", target_task)
266
+ actual_scenario_id = info.get("scenario_id", scenario_id or "unknown")
267
+
268
+ messages = [{"role": "system", "content": SYSTEM_PROMPT}]
269
+ for step_num in range(1, MAX_STEPS + 1):
270
+ user_msg = format_observation(obs)
271
+ messages.append({"role": "user", "content": user_msg})
272
+
273
+ error_msg: Optional[str] = None
274
+
275
+ try:
276
+ completion = client.chat.completions.create(
277
+ model=MODEL_NAME,
278
+ messages=messages,
279
+ temperature=0.1,
280
+ max_tokens=1024,
281
+ )
282
+ llm_text = completion.choices[0].message.content or '{"action": "submit"}'
283
+ except Exception as e:
284
+ error_msg = str(e)
285
+ print(f"[DEBUG] Model request failed: {e}", flush=True)
286
+ llm_text = '{"action": "submit"}'
287
+
288
+ messages.append({"role": "assistant", "content": llm_text})
289
+
290
+ parsed = parse_llm_response(llm_text)
291
+ action = build_action(parsed)
292
+
293
+ step_resp = env_request("POST", "/step", {"action": action})
294
+ obs = step_resp["observation"]
295
+ reward = step_resp.get("reward", 0.0)
296
+ done = step_resp.get("done", False)
297
+ step_info = step_resp.get("info", {})
298
+ steps_taken = step_num
299
+
300
+ rewards.append(reward)
301
+
302
+ log_step(
303
+ step=step_num,
304
+ action=action["action_type"],
305
+ reward=reward,
306
+ done=done,
307
+ error=error_msg,
308
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
309
 
310
+ trajectory.append({
311
+ "step": step_num,
312
+ "action": action,
313
+ "reward": reward,
314
+ "done": done,
315
+ "info": step_info,
316
+ })
317
+
318
+ if done:
319
+ break
320
+
321
+ # Grade the trajectory
322
+ grade_resp = env_request("POST", "/grader", {
323
+ "task_id": actual_task_id,
324
+ "trajectory": trajectory,
325
+ })
326
+ result = grade_resp.get("result", {})
327
+ score = result.get("score", 0.0)
328
+ score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
329
+ success = score >= SUCCESS_SCORE_THRESHOLD
330
 
331
+ finally:
332
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
 
 
 
 
 
333
 
334
+ return {"score": score, "success": success, "steps": steps_taken, "rewards": rewards}
 
335
 
336
 
337
  def run_all_tasks(client: OpenAI) -> Dict[str, float]:
338
+ """Run baseline on all tasks (and ALL their scenarios) and report scores."""
339
+ try:
340
+ from server.tasks.task_registry import TASK_REGISTRY
341
+ except ImportError as e:
342
+ print(f"[DEBUG] Could not import TASK_REGISTRY: {e}", flush=True)
343
+ return {}
344
 
345
  scores: Dict[str, List[float]] = {}
346
 
347
+ for task_id, task_cls in TASK_REGISTRY.items():
 
 
 
 
 
348
  task_scores = []
349
+
350
+ # Iterate over all exact scenarios for this task
351
+ scenarios = task_cls.SCENARIOS
352
+ for scenario in scenarios:
353
+ scenario_id = scenario["id"]
354
+ result = run_episode(client, task_id=task_id, scenario_id=scenario_id)
355
+ task_scores.append(result.get("score", 0.0))
356
+
357
  scores[task_id] = task_scores
358
 
359
  # Summary
360
+ print(f"\n[DEBUG] {'='*60}", flush=True)
361
+ print("[DEBUG] BASELINE RESULTS SUMMARY", flush=True)
362
+ print(f"[DEBUG] {'='*60}", flush=True)
363
  avg_scores = {}
364
  for task_id, task_scores in scores.items():
365
  avg = sum(task_scores) / len(task_scores) if task_scores else 0.0
366
  avg_scores[task_id] = avg
367
+ print(f"[DEBUG] {task_id:40s} {avg:.3f}", flush=True)
368
 
369
  overall = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0.0
370
+ print(f"[DEBUG] {'OVERALL':40s} {overall:.3f}", flush=True)
371
 
372
  return avg_scores
373
 
374
 
375
  def main():
376
  """Entry point for baseline inference."""
377
+ if not API_KEY:
378
+ print("[DEBUG] WARNING: HF_TOKEN not set. Set it via: export HF_TOKEN=your_token_here", flush=True)
379
+ print("[DEBUG] Continuing anyway (will fail if auth is required)...", flush=True)
 
 
 
 
 
380
 
381
  # Verify environment is running
382
  try:
383
  health = env_request("GET", "/health")
384
+ print(f"[DEBUG] Environment status: {health.get('status', 'unknown')}", flush=True)
385
  except Exception as e:
386
+ print(f"[DEBUG] Cannot connect to environment at {ENV_URL}: {e}", flush=True)
387
+ print("[DEBUG] Start the server first: python -m uvicorn server.app:app --host 0.0.0.0 --port 8000", flush=True)
 
 
388
  sys.exit(1)
389
 
390
  client = create_client()
sample_scripts/sample_inf_script.py ADDED
@@ -0,0 +1,188 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Inference Script Example
3
+ ===================================
4
+ MANDATORY
5
+ - Before submitting, ensure the following variables are defined in your environment configuration:
6
+ API_BASE_URL The API endpoint for the LLM.
7
+ MODEL_NAME The model identifier to use for inference.
8
+ HF_TOKEN Your Hugging Face / API key.
9
+ LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
10
+ method
11
+
12
+ - Defaults are set only for API_BASE_URL and MODEL_NAME
13
+ (and should reflect your active inference setup):
14
+ API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
15
+ MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
16
+
17
+ - The inference script must be named `inference.py` and placed in the root directory of the project
18
+ - Participants must use OpenAI Client for all LLM calls using above variables
19
+
20
+ STDOUT FORMAT
21
+ - The script must emit exactly three line types to stdout, in this order:
22
+
23
+ [START] task=<task_name> env=<benchmark> model=<model_name>
24
+ [STEP] step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
25
+ [END] success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
26
+
27
+ Rules:
28
+ - One [START] line at episode begin.
29
+ - One [STEP] line per step, immediately after env.step() returns.
30
+ - One [END] line after env.close(), always emitted (even on exception).
31
+ - reward and rewards are formatted to 2 decimal places.
32
+ - done and success are lowercase booleans: true or false.
33
+ - error is the raw last_action_error string, or null if none.
34
+ - All fields on a single line with no newlines within a line.
35
+ - Each tasks should return score in [0, 1]
36
+
37
+ Example:
38
+ [START] task=click-test env=miniwob model=Qwen3-VL-30B
39
+ [STEP] step=1 action=click('123') reward=0.00 done=false error=null
40
+ [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
41
+ [STEP] step=3 action=click('789') reward=1.00 done=true error=null
42
+ [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
43
+ """
44
+
45
+ import asyncio
46
+ import os
47
+ import textwrap
48
+ from typing import List, Optional
49
+
50
+ from openai import OpenAI
51
+
52
+ from my_env_v4 import MyEnvV4Action, MyEnvV4Env
53
+ IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
54
+ API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
55
+
56
+ API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
57
+ MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
58
+ TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
59
+ BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
60
+ MAX_STEPS = 8
61
+ TEMPERATURE = 0.7
62
+ MAX_TOKENS = 150
63
+ SUCCESS_SCORE_THRESHOLD = 0.1 # normalized score in [0, 1]
64
+
65
+ # Max possible reward: each token contributes 0.1, across all steps
66
+ _MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
67
+ MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
68
+
69
+ SYSTEM_PROMPT = textwrap.dedent(
70
+ """
71
+ You are interacting with a simple echo environment.
72
+ Each turn you must send a message. The environment will echo it back.
73
+ Reward is proportional to message length: reward = len(message) * 0.1
74
+ Your goal is to maximize total reward by sending meaningful, substantive messages.
75
+ Reply with exactly one message string β€” no quotes, no prefixes, just the message text.
76
+ """
77
+ ).strip()
78
+
79
+
80
+ def log_start(task: str, env: str, model: str) -> None:
81
+ print(f"[START] task={task} env={env} model={model}", flush=True)
82
+
83
+
84
+ def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
85
+ error_val = error if error else "null"
86
+ done_val = str(done).lower()
87
+ print(
88
+ f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
89
+ flush=True,
90
+ )
91
+
92
+
93
+ def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
94
+ rewards_str = ",".join(f"{r:.2f}" for r in rewards)
95
+ print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
96
+
97
+
98
+ def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
99
+ history_block = "\n".join(history[-4:]) if history else "None"
100
+ return textwrap.dedent(
101
+ f"""
102
+ Step: {step}
103
+ Last echoed message: {last_echoed!r}
104
+ Last reward: {last_reward:.2f}
105
+ Previous steps:
106
+ {history_block}
107
+ Send your next message.
108
+ """
109
+ ).strip()
110
+
111
+
112
+ def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
113
+ user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
114
+ try:
115
+ completion = client.chat.completions.create(
116
+ model=MODEL_NAME,
117
+ messages=[
118
+ {"role": "system", "content": SYSTEM_PROMPT},
119
+ {"role": "user", "content": user_prompt},
120
+ ],
121
+ temperature=TEMPERATURE,
122
+ max_tokens=MAX_TOKENS,
123
+ stream=False,
124
+ )
125
+ text = (completion.choices[0].message.content or "").strip()
126
+ return text if text else "hello"
127
+ except Exception as exc:
128
+ print(f"[DEBUG] Model request failed: {exc}", flush=True)
129
+ return "hello"
130
+
131
+
132
+ async def main() -> None:
133
+ client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
134
+
135
+ env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
136
+
137
+ history: List[str] = []
138
+ rewards: List[float] = []
139
+ steps_taken = 0
140
+ score = 0.0
141
+ success = False
142
+
143
+ log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
144
+
145
+ try:
146
+ result = await env.reset() # OpenENV.reset()
147
+ last_echoed = result.observation.echoed_message
148
+ last_reward = 0.0
149
+
150
+ for step in range(1, MAX_STEPS + 1):
151
+ if result.done:
152
+ break
153
+
154
+ message = get_model_message(client, step, last_echoed, last_reward, history)
155
+
156
+ result = await env.step(MyEnvV4Action(message=message))
157
+ obs = result.observation
158
+
159
+ reward = result.reward or 0.0
160
+ done = result.done
161
+ error = None
162
+
163
+ rewards.append(reward)
164
+ steps_taken = step
165
+ last_echoed = obs.echoed_message
166
+ last_reward = reward
167
+
168
+ log_step(step=step, action=message, reward=reward, done=done, error=error)
169
+
170
+ history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
171
+
172
+ if done:
173
+ break
174
+
175
+ score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
176
+ score = min(max(score, 0.0), 1.0) # clamp to [0, 1]
177
+ success = score >= SUCCESS_SCORE_THRESHOLD
178
+
179
+ finally:
180
+ try:
181
+ await env.close()
182
+ except Exception as e:
183
+ print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
184
+ log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
185
+
186
+
187
+ if __name__ == "__main__":
188
+ asyncio.run(main())
sample_scripts/sample_val_script.txt ADDED
@@ -0,0 +1,185 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env bash
2
+ #
3
+ # validate-submission.sh β€” OpenEnv Submission Validator
4
+ #
5
+ # Checks that your HF Space is live, Docker image builds, and openenv validate passes.
6
+ #
7
+ # Prerequisites:
8
+ # - Docker: https://docs.docker.com/get-docker/
9
+ # - openenv-core: pip install openenv-core
10
+ # - curl (usually pre-installed)
11
+ #
12
+ # Run:
13
+ # curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
14
+ #
15
+ # Or download and run locally:
16
+ # chmod +x validate-submission.sh
17
+ # ./validate-submission.sh <ping_url> [repo_dir]
18
+ #
19
+ # Arguments:
20
+ # ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)
21
+ # repo_dir Path to your repo (default: current directory)
22
+ #
23
+ # Examples:
24
+ # ./validate-submission.sh https://my-team.hf.space
25
+ # ./validate-submission.sh https://my-team.hf.space ./my-repo
26
+ #
27
+
28
+ set -uo pipefail
29
+
30
+ DOCKER_BUILD_TIMEOUT=600
31
+ if [ -t 1 ]; then
32
+ RED='\033[0;31m'
33
+ GREEN='\033[0;32m'
34
+ YELLOW='\033[1;33m'
35
+ BOLD='\033[1m'
36
+ NC='\033[0m'
37
+ else
38
+ RED='' GREEN='' YELLOW='' BOLD='' NC=''
39
+ fi
40
+
41
+ run_with_timeout() {
42
+ local secs="$1"; shift
43
+ if command -v timeout &>/dev/null; then
44
+ timeout "$secs" "$@"
45
+ elif command -v gtimeout &>/dev/null; then
46
+ gtimeout "$secs" "$@"
47
+ else
48
+ "$@" &
49
+ local pid=$!
50
+ ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
51
+ local watcher=$!
52
+ wait "$pid" 2>/dev/null
53
+ local rc=$?
54
+ kill "$watcher" 2>/dev/null
55
+ wait "$watcher" 2>/dev/null
56
+ return $rc
57
+ fi
58
+ }
59
+
60
+ portable_mktemp() {
61
+ local prefix="${1:-validate}"
62
+ mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
63
+ }
64
+
65
+ CLEANUP_FILES=()
66
+ cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
67
+ trap cleanup EXIT
68
+
69
+ PING_URL="${1:-}"
70
+ REPO_DIR="${2:-.}"
71
+
72
+ if [ -z "$PING_URL" ]; then
73
+ printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
74
+ printf "\n"
75
+ printf " ping_url Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
76
+ printf " repo_dir Path to your repo (default: current directory)\n"
77
+ exit 1
78
+ fi
79
+
80
+ if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
81
+ printf "Error: directory '%s' not found\n" "${2:-.}"
82
+ exit 1
83
+ fi
84
+ PING_URL="${PING_URL%/}"
85
+ export PING_URL
86
+ PASS=0
87
+
88
+ log() { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
89
+ pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
90
+ fail() { log "${RED}FAILED${NC} -- $1"; }
91
+ hint() { printf " ${YELLOW}Hint:${NC} %b\n" "$1"; }
92
+ stop_at() {
93
+ printf "\n"
94
+ printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
95
+ exit 1
96
+ }
97
+
98
+ printf "\n"
99
+ printf "${BOLD}========================================${NC}\n"
100
+ printf "${BOLD} OpenEnv Submission Validator${NC}\n"
101
+ printf "${BOLD}========================================${NC}\n"
102
+ log "Repo: $REPO_DIR"
103
+ log "Ping URL: $PING_URL"
104
+ printf "\n"
105
+
106
+ log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
107
+
108
+ CURL_OUTPUT=$(portable_mktemp "validate-curl")
109
+ CLEANUP_FILES+=("$CURL_OUTPUT")
110
+ HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
111
+ -H "Content-Type: application/json" -d '{}' \
112
+ "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
113
+
114
+ if [ "$HTTP_CODE" = "200" ]; then
115
+ pass "HF Space is live and responds to /reset"
116
+ elif [ "$HTTP_CODE" = "000" ]; then
117
+ fail "HF Space not reachable (connection failed or timed out)"
118
+ hint "Check your network connection and that the Space is running."
119
+ hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
120
+ stop_at "Step 1"
121
+ else
122
+ fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
123
+ hint "Make sure your Space is running and the URL is correct."
124
+ hint "Try opening $PING_URL in your browser first."
125
+ stop_at "Step 1"
126
+ fi
127
+
128
+ log "${BOLD}Step 2/3: Running docker build${NC} ..."
129
+
130
+ if ! command -v docker &>/dev/null; then
131
+ fail "docker command not found"
132
+ hint "Install Docker: https://docs.docker.com/get-docker/"
133
+ stop_at "Step 2"
134
+ fi
135
+
136
+ if [ -f "$REPO_DIR/Dockerfile" ]; then
137
+ DOCKER_CONTEXT="$REPO_DIR"
138
+ elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
139
+ DOCKER_CONTEXT="$REPO_DIR/server"
140
+ else
141
+ fail "No Dockerfile found in repo root or server/ directory"
142
+ stop_at "Step 2"
143
+ fi
144
+
145
+ log " Found Dockerfile in $DOCKER_CONTEXT"
146
+
147
+ BUILD_OK=false
148
+ BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
149
+
150
+ if [ "$BUILD_OK" = true ]; then
151
+ pass "Docker build succeeded"
152
+ else
153
+ fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
154
+ printf "%s\n" "$BUILD_OUTPUT" | tail -20
155
+ stop_at "Step 2"
156
+ fi
157
+
158
+ log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
159
+
160
+ if ! command -v openenv &>/dev/null; then
161
+ fail "openenv command not found"
162
+ hint "Install it: pip install openenv-core"
163
+ stop_at "Step 3"
164
+ fi
165
+
166
+ VALIDATE_OK=false
167
+ VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
168
+
169
+ if [ "$VALIDATE_OK" = true ]; then
170
+ pass "openenv validate passed"
171
+ [ -n "$VALIDATE_OUTPUT" ] && log " $VALIDATE_OUTPUT"
172
+ else
173
+ fail "openenv validate failed"
174
+ printf "%s\n" "$VALIDATE_OUTPUT"
175
+ stop_at "Step 3"
176
+ fi
177
+
178
+ printf "\n"
179
+ printf "${BOLD}========================================${NC}\n"
180
+ printf "${GREEN}${BOLD} All 3/3 checks passed!${NC}\n"
181
+ printf "${GREEN}${BOLD} Your submission is ready to submit.${NC}\n"
182
+ printf "${BOLD}========================================${NC}\n"
183
+ printf "\n"
184
+
185
+ exit 0
server/environment.py CHANGED
@@ -71,15 +71,30 @@ class CloudNativeDebugEnvironment:
71
  return None
72
 
73
  def _validation_snapshot(self) -> Dict[str, bool]:
 
74
  docker_result = self.docker_sim.validate(self.current_files.get("Dockerfile"), self.current_files)
75
  workflow_file = self._find_workflow_file()
76
  workflow_result = self.workflow_sim.validate(workflow_file, self.current_files)
77
  k8s_result = self.k8s_sim.validate(self.current_files)
78
- return {
79
- "docker_build_valid": bool(docker_result.get("build_success", False)),
80
- "workflow_parse_valid": bool(workflow_result.get("parse_success", False)),
81
- "k8s_valid": bool(k8s_result.get("valid", True)),
82
- }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
83
 
84
  def __init__(self):
85
  self.docker_sim = DockerSimulator()
@@ -146,7 +161,13 @@ class CloudNativeDebugEnvironment:
146
  )
147
 
148
  self.expected_fixes = scenario["expected_fixes"]
149
- self.issues_total = len(self.expected_fixes)
 
 
 
 
 
 
150
  self.issues_fixed = 0
151
 
152
  self.step_count = 0
@@ -200,8 +221,6 @@ class CloudNativeDebugEnvironment:
200
  self.last_action_success = False
201
  return 0.0, "No edits provided"
202
 
203
- before_validation = self._validation_snapshot()
204
-
205
  reward = 0.0
206
  feedbacks: List[str] = []
207
  applied_count = 0
@@ -288,17 +307,6 @@ class CloudNativeDebugEnvironment:
288
 
289
  reward += self._check_fix_progress()
290
 
291
- after_validation = self._validation_snapshot()
292
- if not before_validation["docker_build_valid"] and after_validation["docker_build_valid"]:
293
- reward += 0.1
294
- feedbacks.append("Docker build validity improved")
295
- if not before_validation["workflow_parse_valid"] and after_validation["workflow_parse_valid"]:
296
- reward += 0.1
297
- feedbacks.append("Workflow parse validity improved")
298
- if not before_validation["k8s_valid"] and after_validation["k8s_valid"]:
299
- reward += 0.1
300
- feedbacks.append("Kubernetes manifest validity improved")
301
-
302
  if applied_count == 0:
303
  self.last_action_success = False
304
  return max(-0.02, reward - 0.02), "; ".join(feedbacks) or "No edit applied"
@@ -307,30 +315,21 @@ class CloudNativeDebugEnvironment:
307
  return max(0.0, reward), "; ".join(feedbacks)
308
 
309
  def _check_fix_progress(self) -> float:
310
- fixes_applied = 0
311
- for fix in self.expected_fixes:
312
- file_path = fix["file"]
313
- if file_path not in self.current_files:
314
- # For "contains" checks on missing files, the fix is not applied
315
- # For "not_contains" checks on missing files, consider it fixed
316
- if fix["type"] == "not_contains":
317
- fixes_applied += 1
318
- continue
319
- current_content = self.current_files[file_path].content
320
- if fix["type"] == "contains" and fix["expected"] in current_content:
321
- fixes_applied += 1
322
- if fix["type"] == "not_contains" and fix["expected"] not in current_content:
323
- fixes_applied += 1
324
- if fix["type"] == "line_equals":
325
- lines = current_content.split("\n")
326
- line_num = int(fix.get("line", 0))
327
- if 1 <= line_num <= len(lines):
328
- if lines[line_num - 1].strip() == str(fix["expected"]).strip():
329
- fixes_applied += 1
330
-
331
- new_fixed = fixes_applied - self.issues_fixed
332
  if new_fixed > 0:
333
- self.issues_fixed = fixes_applied
334
  return 0.3 * new_fixed
335
  return 0.0
336
 
 
71
  return None
72
 
73
  def _validation_snapshot(self) -> Dict[str, bool]:
74
+ """Return a detailed snapshot of all 7 simulator checks."""
75
  docker_result = self.docker_sim.validate(self.current_files.get("Dockerfile"), self.current_files)
76
  workflow_file = self._find_workflow_file()
77
  workflow_result = self.workflow_sim.validate(workflow_file, self.current_files)
78
  k8s_result = self.k8s_sim.validate(self.current_files)
79
+
80
+ has_docker = "Dockerfile" in self.current_files
81
+ has_workflow = workflow_file is not None
82
+ has_k8s = any(fc.file_type == FileType.KUBERNETES for fc in self.current_files.values())
83
+
84
+ snapshot: Dict[str, bool] = {}
85
+ if has_docker:
86
+ snapshot["docker_build_valid"] = bool(docker_result.get("build_success", False))
87
+ snapshot["docker_run_valid"] = bool(docker_result.get("run_success", False))
88
+ if has_workflow:
89
+ snapshot["workflow_parse_valid"] = bool(workflow_result.get("parse_success", False))
90
+ snapshot["workflow_exec_valid"] = bool(workflow_result.get("execution_success", False))
91
+ if has_k8s:
92
+ snapshot["k8s_valid"] = bool(k8s_result.get("valid", True))
93
+ snapshot["k8s_pod_running"] = k8s_result.get("pod_status", "N/A") == "Running"
94
+ svc = k8s_result.get("service_status", "N/A")
95
+ snapshot["k8s_service_active"] = "active" in svc.lower() or svc == "N/A"
96
+
97
+ return snapshot
98
 
99
  def __init__(self):
100
  self.docker_sim = DockerSimulator()
 
161
  )
162
 
163
  self.expected_fixes = scenario["expected_fixes"]
164
+
165
+ # Snapshot the initial broken state from simulators
166
+ self.initial_snapshot = self._validation_snapshot()
167
+ # Count how many checks are initially failing β€” that's our issues_total
168
+ self.issues_total = sum(1 for v in self.initial_snapshot.values() if not v)
169
+ # Ensure at least 1 issue (the scenario is supposed to be broken)
170
+ self.issues_total = max(1, self.issues_total)
171
  self.issues_fixed = 0
172
 
173
  self.step_count = 0
 
221
  self.last_action_success = False
222
  return 0.0, "No edits provided"
223
 
 
 
224
  reward = 0.0
225
  feedbacks: List[str] = []
226
  applied_count = 0
 
307
 
308
  reward += self._check_fix_progress()
309
 
 
 
 
 
 
 
 
 
 
 
 
310
  if applied_count == 0:
311
  self.last_action_success = False
312
  return max(-0.02, reward - 0.02), "; ".join(feedbacks) or "No edit applied"
 
315
  return max(0.0, reward), "; ".join(feedbacks)
316
 
317
  def _check_fix_progress(self) -> float:
318
+ """Check fix progress by comparing current simulator state against initial broken state.
319
+
320
+ Counts how many simulator checks flipped from fail→pass since reset.
321
+ """
322
+ current_snapshot = self._validation_snapshot()
323
+
324
+ fixes_now = 0
325
+ for key, initially_broken in self.initial_snapshot.items():
326
+ if not initially_broken and current_snapshot.get(key, False):
327
+ # This check was initially failing and now passes
328
+ fixes_now += 1
329
+
330
+ new_fixed = fixes_now - self.issues_fixed
 
 
 
 
 
 
 
 
 
331
  if new_fixed > 0:
332
+ self.issues_fixed = fixes_now
333
  return 0.3 * new_fixed
334
  return 0.0
335
 
server/graders/__init__.py CHANGED
@@ -37,8 +37,8 @@ DIFFICULTY_MODIFIERS = {
37
  TaskDifficulty.HARD: (0.03, 0.7, 0.75),
38
  }
39
 
40
- SCORE_FLOOR = 0.0
41
- SCORE_CEIL = 1.0
42
 
43
  EDIT_ACTION_TYPES = frozenset({
44
  "edit_file", "replace_line", "add_line",
 
37
  TaskDifficulty.HARD: (0.03, 0.7, 0.75),
38
  }
39
 
40
+ SCORE_FLOOR = 0.01
41
+ SCORE_CEIL = 0.99
42
 
43
  EDIT_ACTION_TYPES = frozenset({
44
  "edit_file", "replace_line", "add_line",
server/models.py CHANGED
@@ -122,7 +122,7 @@ class EnvironmentInfo(BaseModel):
122
 
123
  class GraderResult(BaseModel):
124
  task_id: str
125
- score: float = Field(..., ge=0.0, le=1.0)
126
  max_score: float = 1.0
127
  breakdown: Dict[str, float] = Field(default_factory=dict)
128
  feedback: str = ""
@@ -170,7 +170,7 @@ class GraderResponse(BaseModel):
170
 
171
  class BaselineRequest(BaseModel):
172
  task_id: Optional[str] = None
173
- num_episodes: int = 1
174
 
175
 
176
  class BaselineResponse(BaseModel):
 
122
 
123
  class GraderResult(BaseModel):
124
  task_id: str
125
+ score: float = Field(..., gt=0.0, lt=1.0)
126
  max_score: float = 1.0
127
  breakdown: Dict[str, float] = Field(default_factory=dict)
128
  feedback: str = ""
 
170
 
171
  class BaselineRequest(BaseModel):
172
  task_id: Optional[str] = None
173
+ num_episodes: Optional[int] = None # None = run ALL scenarios
174
 
175
 
176
  class BaselineResponse(BaseModel):
server/simulators/docker_simulator.py CHANGED
@@ -39,7 +39,11 @@ class DockerSimulator:
39
  if "*" in source:
40
  prefix = source.replace("*", "")
41
  return any(path.startswith(prefix) for path in context_files)
42
- return source in context_files
 
 
 
 
43
 
44
  def _join_continuation_lines(self, lines: List[str]) -> List[str]:
45
  """Join lines ending with backslash into single logical lines."""
 
39
  if "*" in source:
40
  prefix = source.replace("*", "")
41
  return any(path.startswith(prefix) for path in context_files)
42
+ # Check exact match or directory prefix match (e.g. "dist/" matches "dist/index.html")
43
+ clean = source.rstrip("/")
44
+ if clean in context_files:
45
+ return True
46
+ return any(path.startswith(clean + "/") or path == clean for path in context_files)
47
 
48
  def _join_continuation_lines(self, lines: List[str]) -> List[str]:
49
  """Join lines ending with backslash into single logical lines."""
server/simulators/k8s_simulator.py CHANGED
@@ -312,7 +312,7 @@ class KubernetesSimulator:
312
  svc_ports = svc.get("spec", {}).get("ports", [])
313
  container_ports = []
314
  for c in dep.get("spec", {}).get("template", {}).get("spec", {}).get("containers", []):
315
- for p in c.get("ports", []):
316
  container_ports.append(p.get("containerPort"))
317
 
318
  for sp in svc_ports:
 
312
  svc_ports = svc.get("spec", {}).get("ports", [])
313
  container_ports = []
314
  for c in dep.get("spec", {}).get("template", {}).get("spec", {}).get("containers", []):
315
+ for p in (c.get("ports") or []):
316
  container_ports.append(p.get("containerPort"))
317
 
318
  for sp in svc_ports:
server/simulators/workflow_simulator.py CHANGED
@@ -294,6 +294,135 @@ class WorkflowSimulator:
294
  "exec_error": f"{var} is empty β€” secret not available in shell environment. Map it via env block.",
295
  }
296
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297
  # node version vs package.json engines
298
  for job_name, job in jobs.items():
299
  if not isinstance(job, dict):
 
294
  "exec_error": f"{var} is empty β€” secret not available in shell environment. Map it via env block.",
295
  }
296
 
297
+ # build-push-action without load:true when image is used locally after
298
+ for job_name, job in jobs.items():
299
+ if not isinstance(job, dict):
300
+ continue
301
+ steps = job.get("steps", [])
302
+ if not isinstance(steps, list):
303
+ continue
304
+ build_push_idx = None
305
+ build_push_has_load = False
306
+ for idx, step in enumerate(steps):
307
+ if not isinstance(step, dict):
308
+ continue
309
+ uses = step.get("uses", "")
310
+ if isinstance(uses, str) and "docker/build-push-action" in uses:
311
+ build_push_idx = idx
312
+ with_block = step.get("with", {})
313
+ if isinstance(with_block, dict):
314
+ push_val = str(with_block.get("push", "")).lower()
315
+ load_val = str(with_block.get("load", "")).lower()
316
+ build_push_has_load = load_val == "true"
317
+ # Only flag if push is false (local use intended)
318
+ if push_val == "false" and not build_push_has_load:
319
+ # Check if a later step uses docker run
320
+ for later in steps[idx + 1:]:
321
+ if not isinstance(later, dict):
322
+ continue
323
+ run_cmd = later.get("run", "")
324
+ if isinstance(run_cmd, str) and "docker run" in run_cmd:
325
+ return {
326
+ "parse_success": True,
327
+ "execution_success": False,
328
+ "exec_error": (
329
+ "build-push-action with Buildx does not load images into local daemon by default β€” "
330
+ "add 'load: true' to make the image available for docker run"
331
+ ),
332
+ }
333
+
334
+ # registry mismatch between build tag and push command
335
+ for job_name, job in jobs.items():
336
+ if not isinstance(job, dict):
337
+ continue
338
+ steps = job.get("steps", [])
339
+ if not isinstance(steps, list):
340
+ continue
341
+ build_registry = None
342
+ for step in steps:
343
+ if not isinstance(step, dict):
344
+ continue
345
+ run_cmd = step.get("run", "")
346
+ if not isinstance(run_cmd, str):
347
+ continue
348
+ # Extract registry from docker build -t
349
+ build_match = re.search(r'docker build\s+.*-t\s+(\S+)', run_cmd)
350
+ if build_match:
351
+ tag = build_match.group(1)
352
+ if "ghcr.io" in tag:
353
+ build_registry = "ghcr.io"
354
+ elif "docker.io" in tag or "/" in tag:
355
+ # docker.io is default for user/image format
356
+ build_registry = tag.split("/")[0] if "." in tag.split("/")[0] else "docker.io"
357
+ push_match = re.search(r'docker push\s+(\S+)', run_cmd)
358
+ if push_match and build_registry:
359
+ push_tag = push_match.group(1)
360
+ if "ghcr.io" in push_tag:
361
+ push_registry = "ghcr.io"
362
+ elif "docker.io" in push_tag:
363
+ push_registry = "docker.io"
364
+ else:
365
+ push_registry = push_tag.split("/")[0] if "." in push_tag.split("/")[0] else "docker.io"
366
+ if build_registry != push_registry:
367
+ return {
368
+ "parse_success": True,
369
+ "execution_success": False,
370
+ "exec_error": (
371
+ f"Registry mismatch: image built with {build_registry} tag "
372
+ f"but push targets {push_registry}"
373
+ ),
374
+ }
375
+
376
+ # docker tag referencing non-existent image tag
377
+ for job_name, job in jobs.items():
378
+ if not isinstance(job, dict):
379
+ continue
380
+ steps = job.get("steps", [])
381
+ if not isinstance(steps, list):
382
+ continue
383
+ built_tags = set()
384
+ for step in steps:
385
+ if not isinstance(step, dict):
386
+ continue
387
+ run_cmd = step.get("run", "")
388
+ if not isinstance(run_cmd, str):
389
+ continue
390
+ # Collect tags from docker build -t
391
+ for m in re.finditer(r'docker build\s+.*-t\s+(\S+)', run_cmd):
392
+ built_tags.add(m.group(1))
393
+ # Check docker tag source exists
394
+ tag_match = re.search(r'docker tag\s+(\S+)\s+(\S+)', run_cmd)
395
+ if tag_match:
396
+ source = tag_match.group(1)
397
+ # If source contains ${{ it's a template β€” compare the template expression
398
+ if source not in built_tags and "${{" not in source:
399
+ return {
400
+ "parse_success": True,
401
+ "execution_success": False,
402
+ "exec_error": f"No such image: {source} β€” docker tag source does not match any built image",
403
+ }
404
+ # Check if source uses a different tag template than what was built
405
+ if "${{" in source:
406
+ # Normalize: extract the expression
407
+ source_expr = re.search(r'\$\{\{(.+?)\}\}', source)
408
+ if source_expr:
409
+ source_key = source_expr.group(1).strip()
410
+ found_matching = False
411
+ for bt in built_tags:
412
+ bt_expr = re.search(r'\$\{\{(.+?)\}\}', bt)
413
+ if bt_expr and bt_expr.group(1).strip() == source_key:
414
+ found_matching = True
415
+ break
416
+ # Also check if the base image name matches
417
+ source_base = source.split(":")[0] if ":" in source else source
418
+ built_bases = {bt.split(":")[0] if ":" in bt else bt for bt in built_tags}
419
+ if not found_matching and source_base in built_bases:
420
+ return {
421
+ "parse_success": True,
422
+ "execution_success": False,
423
+ "exec_error": f"No such image: docker tag source tag does not match any built image tag",
424
+ }
425
+
426
  # node version vs package.json engines
427
  for job_name, job in jobs.items():
428
  if not isinstance(job, dict):
server/tasks/k8s_networking.py CHANGED
@@ -81,7 +81,7 @@ class K8sNetworkingTask(BaseTask):
81
  "api-7f8d9c6b5-y3l0n 1/1 Running app=api-server\n"
82
  "api-7f8d9c6b5-z4m1o 1/1 Running app=api-server\n"
83
  "\n"
84
- "Note: Service selector 'app=api' does not match pod label 'app=api-server'"
85
  ),
86
  },
87
  "expected_fixes": [
@@ -153,7 +153,7 @@ class K8sNetworkingTask(BaseTask):
153
  "$ kubectl exec -it test-pod -- wget -qO- http://10.244.0.5:3000\n"
154
  "<!DOCTYPE html><html>...</html>\n"
155
  "\n"
156
- "Note: Service targetPort is 8080 but container listens on 3000"
157
  ),
158
  },
159
  "expected_fixes": [
@@ -249,7 +249,7 @@ class K8sNetworkingTask(BaseTask):
249
  "NAME TYPE CLUSTER-IP PORT(S)\n"
250
  "api-service ClusterIP 10.96.0.10 80/TCP\n"
251
  "\n"
252
- "Note: Ingress references service 'api-svc' but the actual service name is 'api-service'"
253
  ),
254
  },
255
  "expected_fixes": [
 
81
  "api-7f8d9c6b5-y3l0n 1/1 Running app=api-server\n"
82
  "api-7f8d9c6b5-z4m1o 1/1 Running app=api-server\n"
83
  "\n"
84
+ "Hint: Compare the Service selector with the pod labels shown above."
85
  ),
86
  },
87
  "expected_fixes": [
 
153
  "$ kubectl exec -it test-pod -- wget -qO- http://10.244.0.5:3000\n"
154
  "<!DOCTYPE html><html>...</html>\n"
155
  "\n"
156
+ "Hint: The container responds on a different port than the Service expects."
157
  ),
158
  },
159
  "expected_fixes": [
 
249
  "NAME TYPE CLUSTER-IP PORT(S)\n"
250
  "api-service ClusterIP 10.96.0.10 80/TCP\n"
251
  "\n"
252
+ "Hint: The Ingress backend service name does not match any existing Service."
253
  ),
254
  },
255
  "expected_fixes": [
server/tasks/pipeline_build_deploy.py CHANGED
@@ -16,15 +16,15 @@ class PipelineBuildDeployTask(BaseTask):
16
  AVAILABLE_SECRETS = ["GITHUB_TOKEN", "DOCKER_USERNAME", "DOCKER_PASSWORD"]
17
 
18
  SCENARIOS = [
19
- # Scenario 1: GHCR login β€” GITHUB_TOKEN not mapped to env
20
  {
21
- "id": "ghcr_token_not_mapped",
22
  "files": [
23
  {
24
  "path": ".github/workflows/deploy.yml",
25
  "type": "workflow",
26
  "content": (
27
- "name: Build and Push to GHCR\n"
28
  "on:\n"
29
  " push:\n"
30
  " branches: [main]\n"
@@ -32,17 +32,19 @@ class PipelineBuildDeployTask(BaseTask):
32
  "jobs:\n"
33
  " build:\n"
34
  " runs-on: ubuntu-latest\n"
 
 
35
  " steps:\n"
36
  " - uses: actions/checkout@v4\n"
37
  "\n"
38
  " - name: Login to GHCR\n"
39
- " run: echo $GITHUB_TOKEN | docker login ghcr.io -u ${{ github.actor }} --password-stdin\n"
40
  "\n"
41
  " - name: Build image\n"
42
  " run: docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .\n"
43
  "\n"
44
  " - name: Push image\n"
45
- " run: docker push ghcr.io/${{ github.repository }}:${{ github.sha }}\n"
46
  ),
47
  },
48
  {
@@ -67,23 +69,23 @@ class PipelineBuildDeployTask(BaseTask):
67
  "error": {
68
  "phase": "pipeline_build",
69
  "message": (
70
- "Run: Build and Push to GHCR\n"
71
  "\n"
72
- "Step: Login to GHCR\n"
73
- "Error: Cannot perform an interactive login from a non TTY device\n"
74
- "Error: GITHUB_TOKEN environment variable is not set\n"
75
  "\n"
76
- "The GITHUB_TOKEN secret is available but not mapped to an environment variable."
77
  ),
78
  "exit_code": 1,
79
- "failed_step": "Login to GHCR",
80
  },
81
  "expected_fixes": [
82
  {
83
  "file": ".github/workflows/deploy.yml",
84
  "type": "contains",
85
- "expected": "GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}",
86
- "hint": "The GITHUB_TOKEN shell variable is used in the run command but not mapped from secrets via env block",
87
  }
88
  ],
89
  },
@@ -161,18 +163,18 @@ class PipelineBuildDeployTask(BaseTask):
161
  ],
162
  },
163
 
164
- # Scenario 3: Missing packages:write permission for GHCR push
165
  {
166
- "id": "missing_packages_write",
167
  "files": [
168
  {
169
  "path": ".github/workflows/publish.yml",
170
  "type": "workflow",
171
  "content": (
172
- "name: Publish to GHCR\n"
173
  "on:\n"
174
- " release:\n"
175
- " types: [published]\n"
176
  "\n"
177
  "jobs:\n"
178
  " publish:\n"
@@ -180,14 +182,22 @@ class PipelineBuildDeployTask(BaseTask):
180
  " steps:\n"
181
  " - uses: actions/checkout@v4\n"
182
  "\n"
183
- " - name: Login to GHCR\n"
184
- " run: echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin\n"
185
  "\n"
186
  " - name: Build\n"
187
- " run: docker build -t ghcr.io/${{ github.repository }}:${{ github.event.release.tag_name }} .\n"
 
 
 
 
 
 
188
  "\n"
189
  " - name: Push\n"
190
- " run: docker push ghcr.io/${{ github.repository }}:${{ github.event.release.tag_name }}\n"
 
 
191
  ),
192
  },
193
  {
@@ -196,34 +206,39 @@ class PipelineBuildDeployTask(BaseTask):
196
  "content": (
197
  "FROM python:3.11-slim\n"
198
  "WORKDIR /app\n"
 
 
199
  "COPY . .\n"
200
  'CMD ["python", "app.py"]\n'
201
  ),
202
  },
 
 
 
 
 
203
  ],
204
  "error": {
205
  "phase": "pipeline_build",
206
  "message": (
207
- "Run: Publish to GHCR\n"
208
  "\n"
209
- "Step: Login to GHCR βœ“\n"
210
- "Step: Build βœ“\n"
211
- "Step: Push βœ—\n"
212
- "Error: denied: permission_denied: write_package\n"
213
- "Error: GITHUB_TOKEN does not have packages:write permission\n"
214
  "\n"
215
- "The default GITHUB_TOKEN only has read access to packages. "
216
- "Add a permissions block to the job."
217
  ),
218
  "exit_code": 1,
219
- "failed_step": "Push",
220
  },
221
  "expected_fixes": [
222
  {
223
  "file": ".github/workflows/publish.yml",
224
  "type": "contains",
225
- "expected": "packages: write",
226
- "hint": "GHCR push requires 'permissions: packages: write' in the job or workflow",
227
  }
228
  ],
229
  },
@@ -289,15 +304,15 @@ class PipelineBuildDeployTask(BaseTask):
289
  ],
290
  },
291
 
292
- # Scenario 5: Multi-stage build β€” wrong output directory name
293
  {
294
- "id": "multistage_output_mismatch",
295
  "files": [
296
  {
297
  "path": ".github/workflows/build.yml",
298
  "type": "workflow",
299
  "content": (
300
- "name: Build Frontend\n"
301
  "on:\n"
302
  " push:\n"
303
  " branches: [main]\n"
@@ -308,53 +323,54 @@ class PipelineBuildDeployTask(BaseTask):
308
  " steps:\n"
309
  " - uses: actions/checkout@v4\n"
310
  "\n"
311
- " - name: Build image\n"
312
- " run: docker build -t frontend:latest .\n"
 
 
 
 
 
313
  ),
314
  },
315
  {
316
- "path": "Dockerfile",
317
  "type": "dockerfile",
318
  "content": (
319
- "FROM node:20-alpine AS builder\n"
320
  "WORKDIR /app\n"
321
- "COPY package*.json ./\n"
322
- "RUN npm ci\n"
323
  "COPY . .\n"
324
- "RUN npm run build\n"
325
- "\n"
326
- "FROM nginx:alpine\n"
327
- "COPY --from=builder /app/dist /usr/share/nginx/html\n"
328
- "EXPOSE 80\n"
329
- 'CMD ["nginx", "-g", "daemon off;"]\n'
330
  ),
331
  },
332
  {
333
- "path": "package.json",
334
- "type": "other",
335
- "content": '{"name": "frontend", "scripts": {"build": "react-scripts build", "start": "react-scripts start"}}',
336
  },
337
  ],
338
  "error": {
339
  "phase": "pipeline_build",
340
  "message": (
341
- "Run: Build Frontend\n"
342
  "\n"
343
- "Step: Build image βœ—\n"
344
- "Error: COPY failed: stat app/dist: file does not exist\n"
 
345
  "\n"
346
- "react-scripts build outputs to /app/build, not /app/dist. "
347
- "The COPY --from=builder path is wrong."
348
  ),
349
  "exit_code": 1,
350
- "failed_step": "Build image",
351
  },
352
  "expected_fixes": [
353
  {
354
- "file": "Dockerfile",
355
  "type": "contains",
356
- "expected": "COPY --from=builder /app/build",
357
- "hint": "react-scripts outputs to 'build/' not 'dist/'. Change COPY --from=builder /app/dist to /app/build",
358
  }
359
  ],
360
  },
 
16
  AVAILABLE_SECRETS = ["GITHUB_TOKEN", "DOCKER_USERNAME", "DOCKER_PASSWORD"]
17
 
18
  SCENARIOS = [
19
+ # Scenario 1: Registry mismatch β€” build tags ghcr.io but push targets docker.io
20
  {
21
+ "id": "registry_mismatch",
22
  "files": [
23
  {
24
  "path": ".github/workflows/deploy.yml",
25
  "type": "workflow",
26
  "content": (
27
+ "name: Build and Push\n"
28
  "on:\n"
29
  " push:\n"
30
  " branches: [main]\n"
 
32
  "jobs:\n"
33
  " build:\n"
34
  " runs-on: ubuntu-latest\n"
35
+ " permissions:\n"
36
+ " packages: write\n"
37
  " steps:\n"
38
  " - uses: actions/checkout@v4\n"
39
  "\n"
40
  " - name: Login to GHCR\n"
41
+ " run: echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin\n"
42
  "\n"
43
  " - name: Build image\n"
44
  " run: docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .\n"
45
  "\n"
46
  " - name: Push image\n"
47
+ " run: docker push docker.io/${{ github.repository }}:${{ github.sha }}\n"
48
  ),
49
  },
50
  {
 
69
  "error": {
70
  "phase": "pipeline_build",
71
  "message": (
72
+ "Run: Build and Push\n"
73
  "\n"
74
+ "Step: Build image βœ“\n"
75
+ "Step: Push image βœ—\n"
76
+ "Error: An image does not exist locally with the tag: docker.io/<repo>:<sha>\n"
77
  "\n"
78
+ "The image was built with a ghcr.io tag but the push targets docker.io."
79
  ),
80
  "exit_code": 1,
81
+ "failed_step": "Push image",
82
  },
83
  "expected_fixes": [
84
  {
85
  "file": ".github/workflows/deploy.yml",
86
  "type": "contains",
87
+ "expected": "docker push ghcr.io/",
88
+ "hint": "The push command targets docker.io but the image was tagged with ghcr.io β€” use the same registry",
89
  }
90
  ],
91
  },
 
163
  ],
164
  },
165
 
166
+ # Scenario 3: Build and push use different tagging strategies (sha vs latest)
167
  {
168
+ "id": "inconsistent_tagging",
169
  "files": [
170
  {
171
  "path": ".github/workflows/publish.yml",
172
  "type": "workflow",
173
  "content": (
174
+ "name: Publish\n"
175
  "on:\n"
176
+ " push:\n"
177
+ " branches: [main]\n"
178
  "\n"
179
  "jobs:\n"
180
  " publish:\n"
 
182
  " steps:\n"
183
  " - uses: actions/checkout@v4\n"
184
  "\n"
185
+ " - name: Login to DockerHub\n"
186
+ " run: echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin\n"
187
  "\n"
188
  " - name: Build\n"
189
+ " run: docker build -t myuser/api:${{ github.sha }} .\n"
190
+ "\n"
191
+ " - name: Test\n"
192
+ " run: docker run myuser/api:${{ github.sha }} python -m pytest\n"
193
+ "\n"
194
+ " - name: Tag latest\n"
195
+ " run: docker tag myuser/api:latest myuser/api:stable\n"
196
  "\n"
197
  " - name: Push\n"
198
+ " run: |\n"
199
+ " docker push myuser/api:${{ github.sha }}\n"
200
+ " docker push myuser/api:stable\n"
201
  ),
202
  },
203
  {
 
206
  "content": (
207
  "FROM python:3.11-slim\n"
208
  "WORKDIR /app\n"
209
+ "COPY requirements.txt .\n"
210
+ "RUN pip install -r requirements.txt\n"
211
  "COPY . .\n"
212
  'CMD ["python", "app.py"]\n'
213
  ),
214
  },
215
+ {
216
+ "path": "requirements.txt",
217
+ "type": "requirements",
218
+ "content": "flask==3.0.0\npytest==7.4.0\n",
219
+ },
220
  ],
221
  "error": {
222
  "phase": "pipeline_build",
223
  "message": (
224
+ "Run: Publish\n"
225
  "\n"
226
+ "Step: Build βœ“ (myuser/api:<sha>)\n"
227
+ "Step: Test βœ“\n"
228
+ "Step: Tag latest βœ—\n"
229
+ "Error: No such image: myuser/api:latest\n"
 
230
  "\n"
231
+ "The tag command references 'myuser/api:latest' but no image with that tag exists."
 
232
  ),
233
  "exit_code": 1,
234
+ "failed_step": "Tag latest",
235
  },
236
  "expected_fixes": [
237
  {
238
  "file": ".github/workflows/publish.yml",
239
  "type": "contains",
240
+ "expected": "docker tag myuser/api:${{ github.sha }}",
241
+ "hint": "The 'docker tag' source must match the tag used in the build step β€” use the sha-tagged image as source",
242
  }
243
  ],
244
  },
 
304
  ],
305
  },
306
 
307
+ # Scenario 5: Dockerfile path wrong in workflow when using subdirectory structure
308
  {
309
+ "id": "dockerfile_path_in_subdirectory",
310
  "files": [
311
  {
312
  "path": ".github/workflows/build.yml",
313
  "type": "workflow",
314
  "content": (
315
+ "name: Build API\n"
316
  "on:\n"
317
  " push:\n"
318
  " branches: [main]\n"
 
323
  " steps:\n"
324
  " - uses: actions/checkout@v4\n"
325
  "\n"
326
+ " - name: Build API image\n"
327
+ " uses: docker/build-push-action@v5\n"
328
+ " with:\n"
329
+ " context: ./services/api\n"
330
+ " file: ./Dockerfile\n"
331
+ " push: false\n"
332
+ " tags: api:latest\n"
333
  ),
334
  },
335
  {
336
+ "path": "services/api/Dockerfile",
337
  "type": "dockerfile",
338
  "content": (
339
+ "FROM python:3.11-slim\n"
340
  "WORKDIR /app\n"
341
+ "COPY requirements.txt .\n"
342
+ "RUN pip install -r requirements.txt\n"
343
  "COPY . .\n"
344
+ "EXPOSE 8000\n"
345
+ 'CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]\n'
 
 
 
 
346
  ),
347
  },
348
  {
349
+ "path": "services/api/requirements.txt",
350
+ "type": "requirements",
351
+ "content": "fastapi==0.104.0\nuvicorn==0.24.0\n",
352
  },
353
  ],
354
  "error": {
355
  "phase": "pipeline_build",
356
  "message": (
357
+ "Run: Build API\n"
358
  "\n"
359
+ "Step: Build API image βœ—\n"
360
+ "Error: unable to prepare context: unable to evaluate symlinks in Dockerfile path: "
361
+ "lstat /home/runner/work/repo/repo/Dockerfile: no such file or directory\n"
362
  "\n"
363
+ "The Dockerfile is not at the repository root."
 
364
  ),
365
  "exit_code": 1,
366
+ "failed_step": "Build API image",
367
  },
368
  "expected_fixes": [
369
  {
370
+ "file": ".github/workflows/build.yml",
371
  "type": "contains",
372
+ "expected": "file: ./services/api/Dockerfile",
373
+ "hint": "The 'file' path must point to where the Dockerfile actually is β€” ./services/api/Dockerfile, not ./Dockerfile",
374
  }
375
  ],
376
  },
server/tasks/pipeline_full.py CHANGED
@@ -116,8 +116,7 @@ class PipelineFullTask(BaseTask):
116
  "\n"
117
  "---\n"
118
  "(If login had succeeded, deployment would also fail with:)\n"
119
- "Error: Service 'myapp-service' has no endpoints β€” selector 'app=my-app' "
120
- "doesn't match any pods (pods have label 'app=myapp')"
121
  ),
122
  },
123
  "expected_fixes": [
@@ -230,9 +229,8 @@ class PipelineFullTask(BaseTask):
230
  "\n"
231
  "---\n"
232
  "Additionally:\n"
233
- "- Dockerfile has no WORKDIR set β€” npm will fail to find package.json\n"
234
- "- K8s deployment containerPort is 8080 but app listens on 3000 "
235
- "(service targetPort also wrong)"
236
  ),
237
  },
238
  "expected_fixes": [
@@ -350,8 +348,8 @@ class PipelineFullTask(BaseTask):
350
  "\n"
351
  "---\n"
352
  "Additional issues found:\n"
353
- "- Dockerfile: pull access denied for python:3.9-slimm (typo in base image tag)\n"
354
- "- K8s: Pod CrashLoopBackOff with OOMKilled (64Mi memory limit too low for gunicorn)"
355
  ),
356
  },
357
  "expected_fixes": [
 
116
  "\n"
117
  "---\n"
118
  "(If login had succeeded, deployment would also fail with:)\n"
119
+ "Error: Service 'myapp-service' has no endpoints"
 
120
  ),
121
  },
122
  "expected_fixes": [
 
229
  "\n"
230
  "---\n"
231
  "Additionally:\n"
232
+ "- Dockerfile: npm reports module resolution errors at runtime\n"
233
+ "- K8s: Service returns connection refused when accessed"
 
234
  ),
235
  },
236
  "expected_fixes": [
 
348
  "\n"
349
  "---\n"
350
  "Additional issues found:\n"
351
+ "- Dockerfile: pull access denied for base image β€” repository does not exist\n"
352
+ "- K8s: Pod in CrashLoopBackOff with exit code 137"
353
  ),
354
  },
355
  "expected_fixes": [
server/tasks/task_1_build_errors.py CHANGED
@@ -141,9 +141,9 @@ class DockerfileSyntaxTask(BaseTask):
141
  ],
142
  },
143
 
144
- # Scenario 4: EXPOSE with a quoted string instead of a number
145
  {
146
- "id": "invalid_expose",
147
  "files": [
148
  {
149
  "path": "Dockerfile",
@@ -151,29 +151,30 @@ class DockerfileSyntaxTask(BaseTask):
151
  "content": (
152
  "FROM nginx:alpine\n"
153
  "COPY nginx.conf /etc/nginx/nginx.conf\n"
154
- "COPY html /usr/share/nginx/html\n"
155
- 'EXPOSE "eighty"\n'
156
  'CMD ["nginx", "-g", "daemon off;"]'
157
  ),
158
  },
159
  {
160
- "path": "nginx.conf",
161
  "type": "other",
162
- "content": "events {}",
163
  },
164
  ],
165
  "error": {
166
  "phase": "docker_build",
167
- "message": "EXPOSE requires numeric port or port/protocol",
168
  "exit_code": 1,
169
- "line_hint": 4,
 
170
  },
171
  "expected_fixes": [
172
  {
173
  "file": "Dockerfile",
174
  "type": "contains",
175
- "expected": "EXPOSE 80",
176
- "hint": "EXPOSE must use a numeric port value, not a quoted string",
177
  }
178
  ],
179
  },
 
141
  ],
142
  },
143
 
144
+ # Scenario 4: COPY references a file that doesn't exist in context
145
  {
146
+ "id": "copy_missing_source",
147
  "files": [
148
  {
149
  "path": "Dockerfile",
 
151
  "content": (
152
  "FROM nginx:alpine\n"
153
  "COPY nginx.conf /etc/nginx/nginx.conf\n"
154
+ "COPY dist/ /usr/share/nginx/html\n"
155
+ "EXPOSE 80\n"
156
  'CMD ["nginx", "-g", "daemon off;"]'
157
  ),
158
  },
159
  {
160
+ "path": "build/index.html",
161
  "type": "other",
162
+ "content": "<!DOCTYPE html><html><body>Hello</body></html>",
163
  },
164
  ],
165
  "error": {
166
  "phase": "docker_build",
167
+ "message": "COPY failed: file not found in build context: dist/",
168
  "exit_code": 1,
169
+ "failed_step": "COPY dist/ /usr/share/nginx/html",
170
+ "line_hint": 3,
171
  },
172
  "expected_fixes": [
173
  {
174
  "file": "Dockerfile",
175
  "type": "contains",
176
+ "expected": "COPY build/",
177
+ "hint": "The build output is in 'build/' not 'dist/' β€” check the build context files",
178
  }
179
  ],
180
  },
server/tasks/task_5_ci_docker_integration.py CHANGED
@@ -75,15 +75,15 @@ class CIDockerIntegrationTask(BaseTask):
75
  ],
76
  },
77
 
78
- # Scenario 2: Docker login + build but secrets not wired in env block
79
  {
80
- "id": "login_secrets_not_wired",
81
  "files": [
82
  {
83
  "path": ".github/workflows/build.yml",
84
  "type": "workflow",
85
  "content": (
86
- "name: Build and Push\n"
87
  "on: push\n"
88
  "\n"
89
  "jobs:\n"
@@ -91,52 +91,46 @@ class CIDockerIntegrationTask(BaseTask):
91
  " runs-on: ubuntu-latest\n"
92
  " steps:\n"
93
  " - uses: actions/checkout@v4\n"
94
- " - name: Login to DockerHub\n"
95
- " run: echo $DOCKER_PASSWORD | docker login -u $DOCKER_USERNAME --password-stdin\n"
96
- " - name: Build\n"
97
- " run: docker build -t myuser/app:latest .\n"
98
- " - name: Push\n"
99
- " run: docker push myuser/app:latest"
 
 
 
 
100
  ),
101
  },
102
  {
103
  "path": "Dockerfile",
104
  "type": "dockerfile",
105
  "content": (
106
- "FROM node:18-alpine\n"
107
  "WORKDIR /app\n"
108
- "COPY package*.json ./\n"
109
- "RUN npm ci\n"
110
  "COPY . .\n"
111
- "EXPOSE 3000\n"
112
- 'CMD ["npm", "start"]'
113
  ),
114
  },
115
- {
116
- "path": "package.json",
117
- "type": "other",
118
- "content": '{"name": "app", "scripts": {"start": "node server.js"}}',
119
- },
120
  ],
121
  "error": {
122
- "phase": "workflow_parse",
123
- "message": "Error: Cannot perform an interactive login from a non TTY device",
 
 
 
124
  "exit_code": 1,
125
- "failed_step": "Login to DockerHub",
126
  },
127
  "expected_fixes": [
128
  {
129
  "file": ".github/workflows/build.yml",
130
  "type": "contains",
131
- "expected": "DOCKER_USERNAME: ${{ secrets.DOCKER_USERNAME }}",
132
- "hint": "Secrets need to be mapped to env vars in the step",
133
- },
134
- {
135
- "file": ".github/workflows/build.yml",
136
- "type": "contains",
137
- "expected": "DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}",
138
- "hint": "Both Docker credentials must be in the env block",
139
- },
140
  ],
141
  },
142
 
 
75
  ],
76
  },
77
 
78
+ # Scenario 2: build-push-action without load:true, next step can't find image
79
  {
80
+ "id": "missing_load_true",
81
  "files": [
82
  {
83
  "path": ".github/workflows/build.yml",
84
  "type": "workflow",
85
  "content": (
86
+ "name: Build and Test\n"
87
  "on: push\n"
88
  "\n"
89
  "jobs:\n"
 
91
  " runs-on: ubuntu-latest\n"
92
  " steps:\n"
93
  " - uses: actions/checkout@v4\n"
94
+ " - name: Set up Docker Buildx\n"
95
+ " uses: docker/setup-buildx-action@v3\n"
96
+ " - name: Build image\n"
97
+ " uses: docker/build-push-action@v5\n"
98
+ " with:\n"
99
+ " context: .\n"
100
+ " push: false\n"
101
+ " tags: myapp:test\n"
102
+ " - name: Run tests\n"
103
+ " run: docker run myapp:test pytest"
104
  ),
105
  },
106
  {
107
  "path": "Dockerfile",
108
  "type": "dockerfile",
109
  "content": (
110
+ "FROM python:3.11-slim\n"
111
  "WORKDIR /app\n"
 
 
112
  "COPY . .\n"
113
+ "RUN pip install pytest\n"
114
+ 'CMD ["python", "app.py"]'
115
  ),
116
  },
 
 
 
 
 
117
  ],
118
  "error": {
119
+ "phase": "docker_build",
120
+ "message": (
121
+ "Unable to find image 'myapp:test' locally. "
122
+ "docker: Error response from daemon: pull access denied for myapp."
123
+ ),
124
  "exit_code": 1,
125
+ "failed_step": "Run tests",
126
  },
127
  "expected_fixes": [
128
  {
129
  "file": ".github/workflows/build.yml",
130
  "type": "contains",
131
+ "expected": "load: true",
132
+ "hint": "build-push-action with Buildx doesn't load images into local Docker daemon by default β€” add 'load: true'",
133
+ }
 
 
 
 
 
 
134
  ],
135
  },
136