Spaces:

jester1177
/

cloudnative-devops-debug-env

Sleeping

App Files Files Community

Krishna1107 commited on Apr 8

Commit

2794920

1 Parent(s): 893901a

fixed inference

Browse files

Files changed (16) hide show

README.md +61 -37
baseline_runner.py +27 -13
inference.py +177 -104
sample_scripts/sample_inf_script.py +188 -0
sample_scripts/sample_val_script.txt +185 -0
server/environment.py +41 -42
server/graders/__init__.py +2 -2
server/models.py +2 -2
server/simulators/docker_simulator.py +5 -1
server/simulators/k8s_simulator.py +1 -1
server/simulators/workflow_simulator.py +129 -0
server/tasks/k8s_networking.py +3 -3
server/tasks/pipeline_build_deploy.py +76 -60
server/tasks/pipeline_full.py +5 -7
server/tasks/task_1_build_errors.py +11 -10
server/tasks/task_5_ci_docker_integration.py +25 -31

README.md CHANGED Viewed

@@ -42,7 +42,7 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
 │     - replace_line: fix a specific line number                │
 │     - add_line / add_block: insert missing content            │
 │     - delete_line / delete_block: remove bad content          │
-│     - request_hint: get a clue (-5% score penalty)            │
 │     - submit: "I'm done fixing"                               │
 │                                                               │
 │     After each action, agent gets:                            │
@@ -56,6 +56,7 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
 │     - Whether ALL issues were fixed (bonus)                   │
 │     - How many steps it took (efficiency)                     │
 │     - How many hints were used (penalty)                      │
 └──────────────────────────────────────────────────────────────┘
 ```
@@ -63,6 +64,8 @@ This environment teaches AI agents to do what senior DevOps engineers do: read t
 ## The 10 Tasks (50 Scenarios)
 ### Task 1: Dockerfile Syntax Errors — Easy
 Simple typos and instruction errors that break `docker build`.
@@ -72,7 +75,7 @@ Simple typos and instruction errors that break `docker build`.
 | 1 | `typo_filename` | `COPY requirments.txt .` — misspelled filename | Most common Docker build error on Stack Overflow |
 | 2 | `invalid_base_image` | `FROM python:3.9-slimm` — extra 'm' in tag | Happens when copy-pasting image tags |
 | 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` — broken line continuation | Formatting multi-line RUN commands is tricky |
-| 4 | `invalid_expose` | `EXPOSE "eighty"` — string instead of port number | EXPOSE only accepts numeric ports |
 | 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM |
 ### Task 2: Dockerfile Runtime Errors — Medium
@@ -111,14 +114,14 @@ Secrets exist but aren't wired correctly to the workflow steps.
 | 4 | `secret_not_in_env` | `$SLACK_WEBHOOK_URL` not in `env:` | Very common mistake |
 | 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN` |
-### Task 5: CI + Docker Integration — Medium-Hard
 The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.
 | # | Scenario | What's Broken | Real-World Context |
 |---|----------|---------------|-------------------|
 | 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Need BuildKit for cross-compile |
-| 2 | `login_secrets_not_wired` | `docker login` missing `env:` for secrets | "unauthorized: authentication required" |
 | 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch |
 | 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist |
 | 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access" |
@@ -149,7 +152,7 @@ Pod crashes and scheduling failures in Kubernetes deployments.
 ### Task 8: Kubernetes Service & Ingress Issues — Hard
-Networking issues where pods run fine but traffic doesn't reach them.
 | # | Scenario | What's Broken | Real-World Context |
 |---|----------|---------------|-------------------|
@@ -165,15 +168,15 @@ GHA-to-Docker-to-Registry pipeline failures spanning multiple files.
 | # | Scenario | What's Broken | Real-World Context |
 |---|----------|---------------|-------------------|
-| 1 | `ghcr_token_not_mapped` | `$GITHUB_TOKEN` shell var not mapped from secrets | GHCR login fails |
 | 2 | `image_tag_mismatch` | Build uses `github.ref_name` but push uses `github.sha` | "image not found locally" |
-| 3 | `missing_packages_write` | No `permissions: packages: write` for GHCR push | "permission_denied: write_package" |
 | 4 | `build_arg_not_passed` | Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow | Version file is empty |
-| 5 | `multistage_output_mismatch` | `COPY --from=builder /app/dist` but react-scripts outputs to `/app/build` | Wrong output directory |
 ### Task 10: Full Stack Deployment Pipeline — Expert
-Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning.
 | # | Scenario | What's Broken | Real-World Context |
 |---|----------|---------------|-------------------|
@@ -185,6 +188,20 @@ Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifest
 ---
 ## Available Actions
 Each step, the agent chooses exactly one action:
@@ -197,16 +214,16 @@ Each step, the agent chooses exactly one action:
 | `delete_line` | Remove a specific line | Removing a bad instruction |
 | `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
 | `delete_block` | Remove a multi-line block | Removing incorrect sections |
-| `request_hint` | Get a clue about what's wrong | Costs -5% on final score — use sparingly |
 | `submit` | Declare "I'm done" — triggers final evaluation | When all fixes are applied |
 **Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
 ---
-## Grading System — How Scores Work
-Scoring is **deterministic** (same actions always produce the same score), **dynamic** (different strategies get different scores), and **difficulty-aware** (harder tasks are graded more generously).
 ### The Formula
@@ -214,13 +231,13 @@ Scoring is **deterministic** (same actions always produce the same score), **dyn
 FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
 ```
-Clamped to `[0.0, 1.0]`.
 ### Component Breakdown
 | Component | Weight | Description |
 |-----------|--------|-------------|
-| Base score | 5% | Participation credit |
 | Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
 | Complete bonus | 25% | All issues fixed |
 | Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
@@ -230,15 +247,31 @@ Clamped to `[0.0, 1.0]`.
 ### Difficulty Modifiers
-The grader adjusts three parameters based on task difficulty:
 | Difficulty | Max Score | Efficiency Decay | Hint Cost |
 |------------|-----------|------------------|-----------|
 | Easy | 0.90 | 0.03/step (strict) | 4% each |
 | Medium | 0.90 | 0.027/step | 4% each |
 | Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
-This means: solving a 4-bug expert pipeline in 6 steps scores higher than solving a 1-bug easy task in 3 steps, reflecting the genuine difficulty difference.
 ---
@@ -256,7 +289,7 @@ This means: solving a 4-bug expert pipeline in 6 steps scores higher than solvin
 | `/info` | GET | Task list with metadata |
 | `/tasks` | GET | List all tasks with difficulty levels |
 | `/grader` | POST | Grade a trajectory (list of step dicts) |
-| `/baseline` | POST | Run built-in heuristic baseline |
 | `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
 ### Example: Full Episode via API
@@ -267,7 +300,7 @@ curl -X POST http://localhost:8000/reset \
   -H "Content-Type: application/json" \
   -d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'
-# 2. Fix the memory limit
 curl -X POST http://localhost:8000/step \
   -H "Content-Type: application/json" \
   -d '{
@@ -276,7 +309,7 @@ curl -X POST http://localhost:8000/step \
       "edits": [{
         "file_path": "k8s/deployment.yaml",
         "old_content": "memory: \"64Mi\"",
-        "new_content": "memory: \"256Mi\""
       }]
     }
   }'
@@ -325,7 +358,7 @@ python inference.py
 cloud-native-devops-env/
 ├── openenv.yaml              # OpenEnv environment specification
 ├── inference.py              # LLM baseline (OpenAI client + HF router)
-├── baseline_runner.py        # Heuristic baseline for /baseline endpoint
 ├── Dockerfile                # Production container
 ├── requirements.txt          # Python dependencies
 │
@@ -349,9 +382,9 @@ cloud-native-devops-env/
 │   ├── graders/
 │   │   └── __init__.py       # Deterministic trajectory grader
 │   └── simulators/
-│       ├── docker_simulator.py   # 15+ Dockerfile validation rules
-│       ├── workflow_simulator.py # 15+ workflow validation rules
-│       └── k8s_simulator.py     # Kubernetes manifest validator
 │
 └── tests/
     ├── test_endpoints.py     # API endpoint tests
@@ -364,22 +397,13 @@ cloud-native-devops-env/
 ## Design Decisions
 1. **Full cloud-native stack**: Docker + GitHub Actions + Kubernetes — the three pillars of modern deployment pipelines.
-2. **Simulated validation (no real Docker/K8s)**: Static analysis rules give deterministic results, fast execution, and no security concerns.
 3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
 4. **Difficulty progression**: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
-5. **Exact string matching for edits**: Mirrors real file editing — whitespace matters.
-6. **50 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.
 ## License
 MIT
-title: Cloudnative Devops Debug Env
-emoji: 🚀
-colorFrom: yellow
-colorTo: gray
-sdk: docker
-pinned: false
-short_description: 'Open Env for the Meta x PyTorch x HuggingFace x SST hack '
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 │     - replace_line: fix a specific line number                │
 │     - add_line / add_block: insert missing content            │
 │     - delete_line / delete_block: remove bad content          │
+│     - request_hint: get a clue (-4% score penalty)            │
 │     - submit: "I'm done fixing"                               │
 │                                                               │
 │     After each action, agent gets:                            │
 │     - Whether ALL issues were fixed (bonus)                   │
 │     - How many steps it took (efficiency)                     │
 │     - How many hints were used (penalty)                      │
+│     Score range: (0, 1) exclusive — never exactly 0 or 1     │
 └──────────────────────────────────────────────────────────────┘
 ```
 ## The 10 Tasks (50 Scenarios)
+Evaluation runs **all 50 scenarios deterministically** across all 10 tasks for reproducible scoring.
 ### Task 1: Dockerfile Syntax Errors — Easy
 Simple typos and instruction errors that break `docker build`.
 | 1 | `typo_filename` | `COPY requirments.txt .` — misspelled filename | Most common Docker build error on Stack Overflow |
 | 2 | `invalid_base_image` | `FROM python:3.9-slimm` — extra 'm' in tag | Happens when copy-pasting image tags |
 | 3 | `invalid_run_syntax` | `RUN pip install ... \n && python setup.py` — broken line continuation | Formatting multi-line RUN commands is tricky |
+| 4 | `copy_missing_source` | `COPY dist/` but build output is in `build/` | Source directory doesn't exist in build context |
 | 5 | `missing_from_instruction` | No `FROM` instruction at all | Dockerfile must start with FROM |
 ### Task 2: Dockerfile Runtime Errors — Medium
 | 4 | `secret_not_in_env` | `$SLACK_WEBHOOK_URL` not in `env:` | Very common mistake |
 | 5 | `ghcr_wrong_credentials` | Using `DOCKER_PASSWORD` for GHCR login | GHCR uses `GITHUB_TOKEN` |
+### Task 5: CI + Docker Integration — Medium
 The workflow AND the Dockerfile interact. Fixing one file alone isn't enough.
 | # | Scenario | What's Broken | Real-World Context |
 |---|----------|---------------|-------------------|
 | 1 | `missing_buildx_for_platforms` | Multi-platform build without `setup-buildx-action` | Need BuildKit for cross-compile |
+| 2 | `missing_load_true` | `build-push-action` without `load: true` — next step can't find image | Buildx doesn't load into local daemon by default |
 | 3 | `wrong_build_context` | Context is `./backend` but Dockerfile path is `./Dockerfile` | Path mismatch |
 | 4 | `cache_without_mode_max` | GHA cache export missing `mode=max` | Cache doesn't persist |
 | 5 | `push_without_login` | `docker push` without `docker login` first | "denied: requested access" |
 ### Task 8: Kubernetes Service & Ingress Issues — Hard
+Networking issues where pods run fine but traffic doesn't reach them. Error messages are intentionally vague — the agent must diagnose from kubectl output.
 | # | Scenario | What's Broken | Real-World Context |
 |---|----------|---------------|-------------------|
 | # | Scenario | What's Broken | Real-World Context |
 |---|----------|---------------|-------------------|
+| 1 | `registry_mismatch` | Build tags `ghcr.io/...` but push targets `docker.io/...` | Registry URL mismatch between steps |
 | 2 | `image_tag_mismatch` | Build uses `github.ref_name` but push uses `github.sha` | "image not found locally" |
+| 3 | `inconsistent_tagging` | `docker tag myuser/api:latest` but image was built as `myuser/api:${{ github.sha }}` | Tag source doesn't exist |
 | 4 | `build_arg_not_passed` | Dockerfile `ARG APP_VERSION` but no `--build-arg` in workflow | Version file is empty |
+| 5 | `dockerfile_path_in_subdirectory` | Workflow points to `./Dockerfile` but it's at `./services/api/Dockerfile` | Monorepo path mismatch |
 ### Task 10: Full Stack Deployment Pipeline — Expert
+Multi-error scenarios spanning the entire stack: GHA + Dockerfile + K8s manifests. 2-4 bugs per scenario requiring cross-file reasoning. Error messages are intentionally vague — the agent must trace root causes from symptoms.
 | # | Scenario | What's Broken | Real-World Context |
 |---|----------|---------------|-------------------|
 ---
+## Fix Validation: Simulator-Based
+Fixes are validated using **structural simulators**, not string matching. This means:
+- **Alternative valid fixes are accepted.** Setting memory to `512Mi` instead of `256Mi` both resolve the OOM — the simulator accepts either.
+- **Three independent simulators** run after every edit:
+  - **DockerSimulator**: validates Dockerfile syntax (FROM, COPY, EXPOSE, RUN) and runtime behavior (WORKDIR, CMD/ENTRYPOINT, permissions, ENV)
+  - **WorkflowSimulator**: parses YAML, checks triggers, runs-on, step ordering, secrets wiring, permissions, buildx requirements, registry consistency
+  - **KubernetesSimulator**: validates manifests, cross-resource dependencies (Service selector ↔ Deployment labels), pod status simulation (OOM, ImagePullBackOff), service endpoint reachability
+- **7 granular checks** are tracked: `docker_build`, `docker_run`, `workflow_parse`, `workflow_exec`, `k8s_valid`, `k8s_pod_running`, `k8s_service_active`
+- Progress = how many checks flip from fail → pass compared to the initial broken state
+---
 ## Available Actions
 Each step, the agent chooses exactly one action:
 | `delete_line` | Remove a specific line | Removing a bad instruction |
 | `add_block` | Insert a multi-line block | Adding entire sections (e.g., `env:` block with secrets) |
 | `delete_block` | Remove a multi-line block | Removing incorrect sections |
+| `request_hint` | Get a clue about what's wrong | Costs -4% on final score — use sparingly |
 | `submit` | Declare "I'm done" — triggers final evaluation | When all fixes are applied |
 **Important:** `edit_file` requires `old_content` to match **exactly** (including whitespace). If it doesn't match, the edit fails and the agent gets a -0.02 reward penalty.
 ---
+## Grading System
+Scoring is **deterministic** (same actions always produce the same score), **difficulty-aware** (harder tasks are graded more generously), and scores are strictly in **(0, 1) exclusive** — never exactly 0 or 1.
 ### The Formula
 FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
 ```
+Clamped to `(0.01, 0.99)`.
 ### Component Breakdown
 | Component | Weight | Description |
 |-----------|--------|-------------|
+| Base score | 5% | Participation credit (guarantees score > 0) |
 | Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
 | Complete bonus | 25% | All issues fixed |
 | Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
 ### Difficulty Modifiers
 | Difficulty | Max Score | Efficiency Decay | Hint Cost |
 |------------|-----------|------------------|-----------|
 | Easy | 0.90 | 0.03/step (strict) | 4% each |
 | Medium | 0.90 | 0.027/step | 4% each |
 | Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
+---
+## Evaluation
+The evaluation pipeline runs **all 50 scenarios across all 10 tasks** deterministically:
+```python
+# Runs all 10 tasks × 5 scenarios = 50 episodes
+results = run_baseline_episodes()  # num_episodes=None runs all
+# Per-episode scores in (0, 1)
+# Aggregate = mean of all 50 scores
+aggregate = sum(r.score for r in results) / len(results)
+```
+This ensures:
+- **Reproducibility**: same agent produces same score every time
+- **Complete coverage**: every error pattern is tested
+- **Fair comparison**: all agents face the same 50 scenarios
 ---
 | `/info` | GET | Task list with metadata |
 | `/tasks` | GET | List all tasks with difficulty levels |
 | `/grader` | POST | Grade a trajectory (list of step dicts) |
+| `/baseline` | POST | Run baseline across all scenarios (optional: `task_id`, `num_episodes`) |
 | `/mcp` | POST | JSON-RPC 2.0 MCP endpoint (initialize, tools/list) |
 ### Example: Full Episode via API
   -H "Content-Type: application/json" \
   -d '{"task_id": "k8s_pod_failures", "scenario_id": "oom_killed"}'
+# 2. Fix the memory limit (any reasonable value works — simulator validates structurally)
 curl -X POST http://localhost:8000/step \
   -H "Content-Type: application/json" \
   -d '{
       "edits": [{
         "file_path": "k8s/deployment.yaml",
         "old_content": "memory: \"64Mi\"",
+        "new_content": "memory: \"512Mi\""
       }]
     }
   }'
 cloud-native-devops-env/
 ├── openenv.yaml              # OpenEnv environment specification
 ├── inference.py              # LLM baseline (OpenAI client + HF router)
+├── baseline_runner.py        # Heuristic baseline — runs all 50 scenarios
 ├── Dockerfile                # Production container
 ├── requirements.txt          # Python dependencies
 │
 │   ├── graders/
 │   │   └── __init__.py       # Deterministic trajectory grader
 │   └── simulators/
+│       ├── docker_simulator.py   # Dockerfile build + runtime validation
+│       ├── workflow_simulator.py # GHA workflow parse + execution validation
+│       └── k8s_simulator.py     # K8s manifest + cross-resource validation
 │
 └── tests/
     ├── test_endpoints.py     # API endpoint tests
 ## Design Decisions
 1. **Full cloud-native stack**: Docker + GitHub Actions + Kubernetes — the three pillars of modern deployment pipelines.
+2. **Simulator-based validation**: Structural rule-based simulators validate fixes instead of string matching. Alternative valid fixes are accepted (e.g., `512Mi` and `256Mi` both fix an OOM). Deterministic, fast, no security concerns.
 3. **Dense rewards**: Partial credit at every step (+0.3 per fix, -0.02 per failed edit) rather than sparse pass/fail.
 4. **Difficulty progression**: Easy tasks are single-file, single-issue. Expert tasks are multi-file, multi-issue with interacting bugs across all three layers.
+5. **Vague error messages in harder tasks**: Easy tasks have explicit error messages. Hard/Expert tasks have realistic, vague messages that require the agent to actually diagnose the issue from context.
+6. **Deterministic evaluation**: All 50 scenarios run every time for reproducible, comparable scores in (0, 1) exclusive.
+7. **50 scenarios from real bugs**: Every scenario is based on actual developer mistakes documented on Stack Overflow, GitHub Issues, and official documentation.
 ## License
 MIT

baseline_runner.py CHANGED Viewed

@@ -1,6 +1,7 @@
 """Heuristic baseline runner for the /baseline endpoint.
 Applies expected_fixes directly to verify the environment + grader work e2e.
 """
@@ -22,6 +23,17 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
             break
         file_path = fix["file"]
         if file_path not in env.current_files:
             continue
         current_content = env.current_files[file_path].content
@@ -50,18 +62,22 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
                             )],
                         )
                     else:
-                        # Find the line that's closest to expected but wrong
                         best_line = None
                         best_idx = None
                         for i, line in enumerate(lines):
                             stripped = line.strip()
                             exp_stripped = expected.strip()
-                            # Check if this line is a broken version of expected
-                            if (stripped and exp_stripped and
-                                    len(set(stripped) & set(exp_stripped)) > len(exp_stripped) * 0.3):
-                                if best_line is None:
-                                    best_line = line
-                                    best_idx = i
                         if best_line is not None:
                             action = Action(
@@ -115,12 +131,12 @@ def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_
     return run_grader(task_id, env.trajectory)
-def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: int = 1) -> List[GraderResult]:
     """Run baseline episodes across tasks.
     Args:
         task_id: Specific task to run, or None for all tasks.
-        num_episodes: Number of episodes per task.
     Returns:
         List of GraderResult for each episode.
@@ -137,13 +153,11 @@ def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: int = 1)
     for tid in task_ids:
         task_cls = TASK_REGISTRY[tid]
         scenarios = task_cls.SCENARIOS
-        episodes_run = 0
-        for scenario in scenarios:
-            if episodes_run >= num_episodes:
                 break
             env = CloudNativeDebugEnvironment()
             result = _heuristic_episode(env, tid, scenario["id"])
             results.append(result)
-            episodes_run += 1
     return results

 """Heuristic baseline runner for the /baseline endpoint.
 Applies expected_fixes directly to verify the environment + grader work e2e.
+By default runs ALL scenarios of ALL tasks for deterministic, reproducible evaluation.
 """
             break
         file_path = fix["file"]
         if file_path not in env.current_files:
+            # For fixes that require creating a new file (e.g. ConfigMap),
+            # create it with the expected content
+            if fix["type"] == "contains":
+                action = Action(
+                    action_type=ActionType.EDIT_FILE,
+                    edits=[FileEdit(
+                        file_path=file_path,
+                        new_content=fix["expected"],
+                    )],
+                )
+                env.step(action)
             continue
         current_content = env.current_files[file_path].content
                             )],
                         )
                     else:
+                        # Find the line with highest character overlap to expected
                         best_line = None
                         best_idx = None
+                        best_score = 0
                         for i, line in enumerate(lines):
                             stripped = line.strip()
                             exp_stripped = expected.strip()
+                            if not stripped or not exp_stripped:
+                                continue
+                            overlap = len(set(stripped) & set(exp_stripped))
+                            # Use ratio of overlap to max length for scoring
+                            score = overlap / max(len(exp_stripped), len(stripped))
+                            if score > 0.5 and score > best_score:
+                                best_line = line
+                                best_idx = i
+                                best_score = score
                         if best_line is not None:
                             action = Action(
     return run_grader(task_id, env.trajectory)
+def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: Optional[int] = None) -> List[GraderResult]:
     """Run baseline episodes across tasks.
     Args:
         task_id: Specific task to run, or None for all tasks.
+        num_episodes: Max scenarios per task. None = run ALL scenarios (default).
     Returns:
         List of GraderResult for each episode.
     for tid in task_ids:
         task_cls = TASK_REGISTRY[tid]
         scenarios = task_cls.SCENARIOS
+        for idx, scenario in enumerate(scenarios):
+            if num_episodes is not None and idx >= num_episodes:
                 break
             env = CloudNativeDebugEnvironment()
             result = _heuristic_episode(env, tid, scenario["id"])
             results.append(result)
     return results

inference.py CHANGED Viewed

@@ -1,13 +1,44 @@
-"""Baseline inference script for Cloud-Native Debug Environment.
-Uses OpenAI-compatible client to call Llama 3.1 70B via HuggingFace router.
-Required by OpenEnv specification.
-Usage:
-    export API_BASE_URL=https://router.huggingface.co/v1
-    export MODEL_NAME=meta-llama/Llama-3.1-70B-Instruct
-    export HF_TOKEN=your_token_here
-    python inference.py
 """
@@ -22,12 +53,14 @@ import requests
 from openai import OpenAI
-API_BASE_URL = os.getenv("API_BASE_URL", "https://router.huggingface.co/v1")
-MODEL_NAME = os.getenv("MODEL_NAME", "meta-llama/Llama-3.1-70B-Instruct")
-HF_TOKEN = os.getenv("HF_TOKEN")
 ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
 LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
 MAX_STEPS = 8  # leave 2 steps buffer before env hard-limit of 10
 SYSTEM_PROMPT = """You are an expert DevOps engineer debugging cloud-native deployment pipelines.
 You will receive broken Dockerfile, GitHub Actions workflow, and/or Kubernetes manifest files along with error messages.
@@ -79,11 +112,37 @@ Rules:
 - Always respond with valid JSON only, no markdown fences"""
 def create_client() -> OpenAI:
     """Create OpenAI-compatible client for HuggingFace router."""
     return OpenAI(
         base_url=API_BASE_URL,
-        api_key=HF_TOKEN,
     )
@@ -188,130 +247,144 @@ def run_episode(client: OpenAI, task_id: Optional[str] = None, scenario_id: Opti
     if scenario_id:
         reset_payload["scenario_id"] = scenario_id
-    reset_resp = env_request("POST", "/reset", reset_payload)
-    obs = reset_resp["observation"]
-    info = reset_resp.get("info", {})
-    actual_task_id = info.get("task_id", task_id or "unknown")
-    actual_scenario_id = info.get("scenario_id", scenario_id or "unknown")
-    print(f"[START] task_id={actual_task_id} scenario_id={actual_scenario_id}")
-    messages = [{"role": "system", "content": SYSTEM_PROMPT}]
     trajectory = []
-    total_steps = 0
-    for step_num in range(MAX_STEPS):
-        user_msg = format_observation(obs)
-        messages.append({"role": "user", "content": user_msg})
-        try:
-            completion = client.chat.completions.create(
-                model=MODEL_NAME,
-                messages=messages,
-                temperature=0.1,
-                max_tokens=1024,
             )
-            llm_text = completion.choices[0].message.content or '{"action": "submit"}'
-        except Exception as e:
-            print(f"[STEP] step={step_num + 1} action=error reward=0.00 done=false issues_fixed=0 issues_total=0 error={e}")
-            llm_text = '{"action": "submit"}'
-        messages.append({"role": "assistant", "content": llm_text})
-        parsed = parse_llm_response(llm_text)
-        action = build_action(parsed)
-        step_resp = env_request("POST", "/step", {"action": action})
-        obs = step_resp["observation"]
-        reward = step_resp.get("reward", 0.0)
-        done = step_resp.get("done", False)
-        step_info = step_resp.get("info", {})
-        total_steps = step_num + 1
-        issues_fixed = step_info.get("issues_fixed", 0)
-        issues_total = step_info.get("issues_total", 0)
-        print(f"[STEP] step={total_steps} action={action['action_type']} reward={reward:.2f} done={str(done).lower()} issues_fixed={issues_fixed} issues_total={issues_total}")
-        trajectory.append({
-            "step": total_steps,
-            "action": action,
-            "reward": reward,
-            "done": done,
-            "info": step_info,
-        })
-        if done:
-            break
-    # Grade the trajectory
-    grade_resp = env_request("POST", "/grader", {
-        "task_id": actual_task_id,
-        "trajectory": trajectory,
-    })
-    result = grade_resp.get("result", {})
-    score = result.get("score", 0.0)
-    print(f"[END] task_id={actual_task_id} scenario_id={actual_scenario_id} score={score:.3f} steps={total_steps}")
-    return result
 def run_all_tasks(client: OpenAI) -> Dict[str, float]:
-    """Run baseline on all tasks and report scores."""
-    tasks_resp = env_request("GET", "/tasks")
-    tasks = tasks_resp.get("tasks", [])
     scores: Dict[str, List[float]] = {}
-    for task in tasks:
-        task_id = task["id"]
-        print(f"\n{'='*60}")
-        print(f"Task: {task['name']} ({task['difficulty']})")
-        print(f"{'='*60}")
         task_scores = []
-        # Run one episode per task for baseline
-        result = run_episode(client, task_id=task_id)
-        task_scores.append(result.get("score", 0.0))
         scores[task_id] = task_scores
     # Summary
-    print(f"\n{'='*60}")
-    print("BASELINE RESULTS SUMMARY")
-    print(f"{'='*60}")
     avg_scores = {}
     for task_id, task_scores in scores.items():
         avg = sum(task_scores) / len(task_scores) if task_scores else 0.0
         avg_scores[task_id] = avg
-        print(f"  {task_id:40s} {avg:.3f}")
     overall = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0.0
-    print(f"  {'OVERALL':40s} {overall:.3f}")
     return avg_scores
 def main():
     """Entry point for baseline inference."""
-    print("Cloud-Native Debug Environment - Baseline Inference")
-    print(f"API: {API_BASE_URL}")
-    print(f"Model: {MODEL_NAME}")
-    print(f"Environment: {ENV_URL}")
-    if not HF_TOKEN:
-        print("\nWARNING: HF_TOKEN not set. Set it via: export HF_TOKEN=your_token_here")
-        print("Continuing anyway (will fail if auth is required)...\n")
     # Verify environment is running
     try:
         health = env_request("GET", "/health")
-        print(f"Environment status: {health.get('status', 'unknown')}\n")
     except Exception as e:
-        print(f"\nERROR: Cannot connect to environment at {ENV_URL}")
-        print(f"  {e}")
-        print("\nStart the server first:")
-        print("  python -m uvicorn server.app:app --host 0.0.0.0 --port 8000")
         sys.exit(1)
     client = create_client()

+"""
+Inference Script for Cloud-Native Debug Environment
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after the episode completes, always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+  Example:
+    [START] task=dockerfile_syntax env=cloud_native_devops model=meta-llama/Llama-3.1-70B-Instruct
+    [STEP] step=1 action=edit_file reward=0.30 done=false error=null
+    [STEP] step=2 action=submit reward=0.00 done=true error=null
+    [END] success=true steps=2 score=0.850 rewards=0.30,0.00
 """
 from openai import OpenAI
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "meta-llama/Llama-3.1-70B-Instruct"
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
 ENV_URL = os.getenv("ENV_URL", "http://localhost:8000")
 LOCAL_IMAGE_NAME = os.getenv("LOCAL_IMAGE_NAME")
+BENCHMARK = "cloud_native_devops"
 MAX_STEPS = 8  # leave 2 steps buffer before env hard-limit of 10
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
 SYSTEM_PROMPT = """You are an expert DevOps engineer debugging cloud-native deployment pipelines.
 You will receive broken Dockerfile, GitHub Actions workflow, and/or Kubernetes manifest files along with error messages.
 - Always respond with valid JSON only, no markdown fences"""
+# ---------------------------------------------------------------------------
+# Logging helpers (mandatory stdout format)
+# ---------------------------------------------------------------------------
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+# ---------------------------------------------------------------------------
+# Client / env helpers
+# ---------------------------------------------------------------------------
 def create_client() -> OpenAI:
     """Create OpenAI-compatible client for HuggingFace router."""
     return OpenAI(
         base_url=API_BASE_URL,
+        api_key=API_KEY,
     )
     if scenario_id:
         reset_payload["scenario_id"] = scenario_id
+    # Best-effort task name for Start
+    target_task = task_id or "random_task"
+    log_start(task=target_task, env=BENCHMARK, model=MODEL_NAME)
     trajectory = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    try:
+        reset_resp = env_request("POST", "/reset", reset_payload)
+        obs = reset_resp["observation"]
+        info = reset_resp.get("info", {})
+        actual_task_id = info.get("task_id", target_task)
+        actual_scenario_id = info.get("scenario_id", scenario_id or "unknown")
+        messages = [{"role": "system", "content": SYSTEM_PROMPT}]
+        for step_num in range(1, MAX_STEPS + 1):
+            user_msg = format_observation(obs)
+            messages.append({"role": "user", "content": user_msg})
+            error_msg: Optional[str] = None
+            try:
+                completion = client.chat.completions.create(
+                    model=MODEL_NAME,
+                    messages=messages,
+                    temperature=0.1,
+                    max_tokens=1024,
+                )
+                llm_text = completion.choices[0].message.content or '{"action": "submit"}'
+            except Exception as e:
+                error_msg = str(e)
+                print(f"[DEBUG] Model request failed: {e}", flush=True)
+                llm_text = '{"action": "submit"}'
+            messages.append({"role": "assistant", "content": llm_text})
+            parsed = parse_llm_response(llm_text)
+            action = build_action(parsed)
+            step_resp = env_request("POST", "/step", {"action": action})
+            obs = step_resp["observation"]
+            reward = step_resp.get("reward", 0.0)
+            done = step_resp.get("done", False)
+            step_info = step_resp.get("info", {})
+            steps_taken = step_num
+            rewards.append(reward)
+            log_step(
+                step=step_num,
+                action=action["action_type"],
+                reward=reward,
+                done=done,
+                error=error_msg,
             )
+            trajectory.append({
+                "step": step_num,
+                "action": action,
+                "reward": reward,
+                "done": done,
+                "info": step_info,
+            })
+            if done:
+                break
+        # Grade the trajectory
+        grade_resp = env_request("POST", "/grader", {
+            "task_id": actual_task_id,
+            "trajectory": trajectory,
+        })
+        result = grade_resp.get("result", {})
+        score = result.get("score", 0.0)
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+    return {"score": score, "success": success, "steps": steps_taken, "rewards": rewards}
 def run_all_tasks(client: OpenAI) -> Dict[str, float]:
+    """Run baseline on all tasks (and ALL their scenarios) and report scores."""
+    try:
+        from server.tasks.task_registry import TASK_REGISTRY
+    except ImportError as e:
+        print(f"[DEBUG] Could not import TASK_REGISTRY: {e}", flush=True)
+        return {}
     scores: Dict[str, List[float]] = {}
+    for task_id, task_cls in TASK_REGISTRY.items():
         task_scores = []
+        # Iterate over all exact scenarios for this task
+        scenarios = task_cls.SCENARIOS
+        for scenario in scenarios:
+            scenario_id = scenario["id"]
+            result = run_episode(client, task_id=task_id, scenario_id=scenario_id)
+            task_scores.append(result.get("score", 0.0))
         scores[task_id] = task_scores
     # Summary
+    print(f"\n[DEBUG] {'='*60}", flush=True)
+    print("[DEBUG] BASELINE RESULTS SUMMARY", flush=True)
+    print(f"[DEBUG] {'='*60}", flush=True)
     avg_scores = {}
     for task_id, task_scores in scores.items():
         avg = sum(task_scores) / len(task_scores) if task_scores else 0.0
         avg_scores[task_id] = avg
+        print(f"[DEBUG]   {task_id:40s} {avg:.3f}", flush=True)
     overall = sum(avg_scores.values()) / len(avg_scores) if avg_scores else 0.0
+    print(f"[DEBUG]   {'OVERALL':40s} {overall:.3f}", flush=True)
     return avg_scores
 def main():
     """Entry point for baseline inference."""
+    if not API_KEY:
+        print("[DEBUG] WARNING: HF_TOKEN not set. Set it via: export HF_TOKEN=your_token_here", flush=True)
+        print("[DEBUG] Continuing anyway (will fail if auth is required)...", flush=True)
     # Verify environment is running
     try:
         health = env_request("GET", "/health")
+        print(f"[DEBUG] Environment status: {health.get('status', 'unknown')}", flush=True)
     except Exception as e:
+        print(f"[DEBUG] Cannot connect to environment at {ENV_URL}: {e}", flush=True)
+        print("[DEBUG] Start the server first: python -m uvicorn server.app:app --host 0.0.0.0 --port 8000", flush=True)
         sys.exit(1)
     client = create_client()

sample_scripts/sample_inf_script.py ADDED Viewed

	@@ -0,0 +1,188 @@

+"""
+Inference Script Example
+===================================
+MANDATORY
+- Before submitting, ensure the following variables are defined in your environment configuration:
+    API_BASE_URL   The API endpoint for the LLM.
+    MODEL_NAME     The model identifier to use for inference.
+    HF_TOKEN       Your Hugging Face / API key.
+    LOCAL_IMAGE_NAME The name of the local image to use for the environment if you are using from_docker_image()
+                     method
+- Defaults are set only for API_BASE_URL and MODEL_NAME
+    (and should reflect your active inference setup):
+    API_BASE_URL = os.getenv("API_BASE_URL", "<your-active-endpoint>")
+    MODEL_NAME = os.getenv("MODEL_NAME", "<your-active-model>")
+- The inference script must be named `inference.py` and placed in the root directory of the project
+- Participants must use OpenAI Client for all LLM calls using above variables
+STDOUT FORMAT
+- The script must emit exactly three line types to stdout, in this order:
+    [START] task=<task_name> env=<benchmark> model=<model_name>
+    [STEP]  step=<n> action=<action_str> reward=<0.00> done=<true|false> error=<msg|null>
+    [END]   success=<true|false> steps=<n> score=<score> rewards=<r1,r2,...,rn>
+  Rules:
+    - One [START] line at episode begin.
+    - One [STEP] line per step, immediately after env.step() returns.
+    - One [END] line after env.close(), always emitted (even on exception).
+    - reward and rewards are formatted to 2 decimal places.
+    - done and success are lowercase booleans: true or false.
+    - error is the raw last_action_error string, or null if none.
+    - All fields on a single line with no newlines within a line.
+    - Each tasks should return score in [0, 1]
+  Example:
+    [START] task=click-test env=miniwob model=Qwen3-VL-30B
+    [STEP] step=1 action=click('123') reward=0.00 done=false error=null
+    [STEP] step=2 action=fill('456','text') reward=0.00 done=false error=null
+    [STEP] step=3 action=click('789') reward=1.00 done=true error=null
+    [END] success=true steps=3 score=1.00 rewards=0.00,0.00,1.00
+"""
+import asyncio
+import os
+import textwrap
+from typing import List, Optional
+from openai import OpenAI
+from my_env_v4 import MyEnvV4Action, MyEnvV4Env
+IMAGE_NAME = os.getenv("IMAGE_NAME") # If you are using docker image
+API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
+API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
+MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
+TASK_NAME = os.getenv("MY_ENV_V4_TASK", "echo")
+BENCHMARK = os.getenv("MY_ENV_V4_BENCHMARK", "my_env_v4")
+MAX_STEPS = 8
+TEMPERATURE = 0.7
+MAX_TOKENS = 150
+SUCCESS_SCORE_THRESHOLD = 0.1  # normalized score in [0, 1]
+# Max possible reward: each token contributes 0.1, across all steps
+_MAX_REWARD_PER_STEP = MAX_TOKENS * 0.1
+MAX_TOTAL_REWARD = MAX_STEPS * _MAX_REWARD_PER_STEP
+SYSTEM_PROMPT = textwrap.dedent(
+    """
+    You are interacting with a simple echo environment.
+    Each turn you must send a message. The environment will echo it back.
+    Reward is proportional to message length: reward = len(message) * 0.1
+    Your goal is to maximize total reward by sending meaningful, substantive messages.
+    Reply with exactly one message string — no quotes, no prefixes, just the message text.
+    """
+).strip()
+def log_start(task: str, env: str, model: str) -> None:
+    print(f"[START] task={task} env={env} model={model}", flush=True)
+def log_step(step: int, action: str, reward: float, done: bool, error: Optional[str]) -> None:
+    error_val = error if error else "null"
+    done_val = str(done).lower()
+    print(
+        f"[STEP] step={step} action={action} reward={reward:.2f} done={done_val} error={error_val}",
+        flush=True,
+    )
+def log_end(success: bool, steps: int, score: float, rewards: List[float]) -> None:
+    rewards_str = ",".join(f"{r:.2f}" for r in rewards)
+    print(f"[END] success={str(success).lower()} steps={steps} score={score:.3f} rewards={rewards_str}", flush=True)
+def build_user_prompt(step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    history_block = "\n".join(history[-4:]) if history else "None"
+    return textwrap.dedent(
+        f"""
+        Step: {step}
+        Last echoed message: {last_echoed!r}
+        Last reward: {last_reward:.2f}
+        Previous steps:
+        {history_block}
+        Send your next message.
+        """
+    ).strip()
+def get_model_message(client: OpenAI, step: int, last_echoed: str, last_reward: float, history: List[str]) -> str:
+    user_prompt = build_user_prompt(step, last_echoed, last_reward, history)
+    try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=[
+                {"role": "system", "content": SYSTEM_PROMPT},
+                {"role": "user", "content": user_prompt},
+            ],
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
+        text = (completion.choices[0].message.content or "").strip()
+        return text if text else "hello"
+    except Exception as exc:
+        print(f"[DEBUG] Model request failed: {exc}", flush=True)
+        return "hello"
+async def main() -> None:
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    env = await MyEnvV4Env.from_docker_image(IMAGE_NAME)
+    history: List[str] = []
+    rewards: List[float] = []
+    steps_taken = 0
+    score = 0.0
+    success = False
+    log_start(task=TASK_NAME, env=BENCHMARK, model=MODEL_NAME)
+    try:
+        result = await env.reset() # OpenENV.reset()
+        last_echoed = result.observation.echoed_message
+        last_reward = 0.0
+        for step in range(1, MAX_STEPS + 1):
+            if result.done:
+                break
+            message = get_model_message(client, step, last_echoed, last_reward, history)
+            result = await env.step(MyEnvV4Action(message=message))
+            obs = result.observation
+            reward = result.reward or 0.0
+            done = result.done
+            error = None
+            rewards.append(reward)
+            steps_taken = step
+            last_echoed = obs.echoed_message
+            last_reward = reward
+            log_step(step=step, action=message, reward=reward, done=done, error=error)
+            history.append(f"Step {step}: {message!r} -> reward {reward:+.2f}")
+            if done:
+                break
+        score = sum(rewards) / MAX_TOTAL_REWARD if MAX_TOTAL_REWARD > 0 else 0.0
+        score = min(max(score, 0.0), 1.0)  # clamp to [0, 1]
+        success = score >= SUCCESS_SCORE_THRESHOLD
+    finally:
+        try:
+            await env.close()
+        except Exception as e:
+            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
+        log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
+if __name__ == "__main__":
+    asyncio.run(main())

sample_scripts/sample_val_script.txt ADDED Viewed

	@@ -0,0 +1,185 @@

+#!/usr/bin/env bash
+#
+# validate-submission.sh — OpenEnv Submission Validator
+#
+# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
+#
+# Prerequisites:
+#   - Docker:       https://docs.docker.com/get-docker/
+#   - openenv-core: pip install openenv-core
+#   - curl (usually pre-installed)
+#
+# Run:
+#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
+#
+#   Or download and run locally:
+#     chmod +x validate-submission.sh
+#     ./validate-submission.sh <ping_url> [repo_dir]
+#
+# Arguments:
+#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
+#   repo_dir   Path to your repo (default: current directory)
+#
+# Examples:
+#   ./validate-submission.sh https://my-team.hf.space
+#   ./validate-submission.sh https://my-team.hf.space ./my-repo
+#
+set -uo pipefail
+DOCKER_BUILD_TIMEOUT=600
+if [ -t 1 ]; then
+  RED='\033[0;31m'
+  GREEN='\033[0;32m'
+  YELLOW='\033[1;33m'
+  BOLD='\033[1m'
+  NC='\033[0m'
+else
+  RED='' GREEN='' YELLOW='' BOLD='' NC=''
+fi
+run_with_timeout() {
+  local secs="$1"; shift
+  if command -v timeout &>/dev/null; then
+    timeout "$secs" "$@"
+  elif command -v gtimeout &>/dev/null; then
+    gtimeout "$secs" "$@"
+  else
+    "$@" &
+    local pid=$!
+    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
+    local watcher=$!
+    wait "$pid" 2>/dev/null
+    local rc=$?
+    kill "$watcher" 2>/dev/null
+    wait "$watcher" 2>/dev/null
+    return $rc
+  fi
+}
+portable_mktemp() {
+  local prefix="${1:-validate}"
+  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
+}
+CLEANUP_FILES=()
+cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
+trap cleanup EXIT
+PING_URL="${1:-}"
+REPO_DIR="${2:-.}"
+if [ -z "$PING_URL" ]; then
+  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
+  printf "\n"
+  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
+  printf "  repo_dir   Path to your repo (default: current directory)\n"
+  exit 1
+fi
+if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
+  printf "Error: directory '%s' not found\n" "${2:-.}"
+  exit 1
+fi
+PING_URL="${PING_URL%/}"
+export PING_URL
+PASS=0
+log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
+pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
+fail() { log "${RED}FAILED${NC} -- $1"; }
+hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
+stop_at() {
+  printf "\n"
+  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
+  exit 1
+}
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+log "Repo:     $REPO_DIR"
+log "Ping URL: $PING_URL"
+printf "\n"
+log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
+CURL_OUTPUT=$(portable_mktemp "validate-curl")
+CLEANUP_FILES+=("$CURL_OUTPUT")
+HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
+  -H "Content-Type: application/json" -d '{}' \
+  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
+if [ "$HTTP_CODE" = "200" ]; then
+  pass "HF Space is live and responds to /reset"
+elif [ "$HTTP_CODE" = "000" ]; then
+  fail "HF Space not reachable (connection failed or timed out)"
+  hint "Check your network connection and that the Space is running."
+  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
+  stop_at "Step 1"
+else
+  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
+  hint "Make sure your Space is running and the URL is correct."
+  hint "Try opening $PING_URL in your browser first."
+  stop_at "Step 1"
+fi
+log "${BOLD}Step 2/3: Running docker build${NC} ..."
+if ! command -v docker &>/dev/null; then
+  fail "docker command not found"
+  hint "Install Docker: https://docs.docker.com/get-docker/"
+  stop_at "Step 2"
+fi
+if [ -f "$REPO_DIR/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR"
+elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
+  DOCKER_CONTEXT="$REPO_DIR/server"
+else
+  fail "No Dockerfile found in repo root or server/ directory"
+  stop_at "Step 2"
+fi
+log "  Found Dockerfile in $DOCKER_CONTEXT"
+BUILD_OK=false
+BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
+if [ "$BUILD_OK" = true ]; then
+  pass "Docker build succeeded"
+else
+  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
+  printf "%s\n" "$BUILD_OUTPUT" | tail -20
+  stop_at "Step 2"
+fi
+log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
+if ! command -v openenv &>/dev/null; then
+  fail "openenv command not found"
+  hint "Install it: pip install openenv-core"
+  stop_at "Step 3"
+fi
+VALIDATE_OK=false
+VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
+if [ "$VALIDATE_OK" = true ]; then
+  pass "openenv validate passed"
+  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
+else
+  fail "openenv validate failed"
+  printf "%s\n" "$VALIDATE_OUTPUT"
+  stop_at "Step 3"
+fi
+printf "\n"
+printf "${BOLD}========================================${NC}\n"
+printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
+printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
+printf "${BOLD}========================================${NC}\n"
+printf "\n"
+exit 0

server/environment.py CHANGED Viewed

@@ -71,15 +71,30 @@ class CloudNativeDebugEnvironment:
         return None
     def _validation_snapshot(self) -> Dict[str, bool]:
         docker_result = self.docker_sim.validate(self.current_files.get("Dockerfile"), self.current_files)
         workflow_file = self._find_workflow_file()
         workflow_result = self.workflow_sim.validate(workflow_file, self.current_files)
         k8s_result = self.k8s_sim.validate(self.current_files)
-        return {
-            "docker_build_valid": bool(docker_result.get("build_success", False)),
-            "workflow_parse_valid": bool(workflow_result.get("parse_success", False)),
-            "k8s_valid": bool(k8s_result.get("valid", True)),
-        }
     def __init__(self):
         self.docker_sim = DockerSimulator()
@@ -146,7 +161,13 @@ class CloudNativeDebugEnvironment:
         )
         self.expected_fixes = scenario["expected_fixes"]
-        self.issues_total = len(self.expected_fixes)
         self.issues_fixed = 0
         self.step_count = 0
@@ -200,8 +221,6 @@ class CloudNativeDebugEnvironment:
             self.last_action_success = False
             return 0.0, "No edits provided"
-        before_validation = self._validation_snapshot()
         reward = 0.0
         feedbacks: List[str] = []
         applied_count = 0
@@ -288,17 +307,6 @@ class CloudNativeDebugEnvironment:
         reward += self._check_fix_progress()
-        after_validation = self._validation_snapshot()
-        if not before_validation["docker_build_valid"] and after_validation["docker_build_valid"]:
-            reward += 0.1
-            feedbacks.append("Docker build validity improved")
-        if not before_validation["workflow_parse_valid"] and after_validation["workflow_parse_valid"]:
-            reward += 0.1
-            feedbacks.append("Workflow parse validity improved")
-        if not before_validation["k8s_valid"] and after_validation["k8s_valid"]:
-            reward += 0.1
-            feedbacks.append("Kubernetes manifest validity improved")
         if applied_count == 0:
             self.last_action_success = False
             return max(-0.02, reward - 0.02), "; ".join(feedbacks) or "No edit applied"
@@ -307,30 +315,21 @@ class CloudNativeDebugEnvironment:
         return max(0.0, reward), "; ".join(feedbacks)
     def _check_fix_progress(self) -> float:
-        fixes_applied = 0
-        for fix in self.expected_fixes:
-            file_path = fix["file"]
-            if file_path not in self.current_files:
-                # For "contains" checks on missing files, the fix is not applied
-                # For "not_contains" checks on missing files, consider it fixed
-                if fix["type"] == "not_contains":
-                    fixes_applied += 1
-                continue
-            current_content = self.current_files[file_path].content
-            if fix["type"] == "contains" and fix["expected"] in current_content:
-                fixes_applied += 1
-            if fix["type"] == "not_contains" and fix["expected"] not in current_content:
-                fixes_applied += 1
-            if fix["type"] == "line_equals":
-                lines = current_content.split("\n")
-                line_num = int(fix.get("line", 0))
-                if 1 <= line_num <= len(lines):
-                    if lines[line_num - 1].strip() == str(fix["expected"]).strip():
-                        fixes_applied += 1
-        new_fixed = fixes_applied - self.issues_fixed
         if new_fixed > 0:
-            self.issues_fixed = fixes_applied
             return 0.3 * new_fixed
         return 0.0

         return None
     def _validation_snapshot(self) -> Dict[str, bool]:
+        """Return a detailed snapshot of all 7 simulator checks."""
         docker_result = self.docker_sim.validate(self.current_files.get("Dockerfile"), self.current_files)
         workflow_file = self._find_workflow_file()
         workflow_result = self.workflow_sim.validate(workflow_file, self.current_files)
         k8s_result = self.k8s_sim.validate(self.current_files)
+        has_docker = "Dockerfile" in self.current_files
+        has_workflow = workflow_file is not None
+        has_k8s = any(fc.file_type == FileType.KUBERNETES for fc in self.current_files.values())
+        snapshot: Dict[str, bool] = {}
+        if has_docker:
+            snapshot["docker_build_valid"] = bool(docker_result.get("build_success", False))
+            snapshot["docker_run_valid"] = bool(docker_result.get("run_success", False))
+        if has_workflow:
+            snapshot["workflow_parse_valid"] = bool(workflow_result.get("parse_success", False))
+            snapshot["workflow_exec_valid"] = bool(workflow_result.get("execution_success", False))
+        if has_k8s:
+            snapshot["k8s_valid"] = bool(k8s_result.get("valid", True))
+            snapshot["k8s_pod_running"] = k8s_result.get("pod_status", "N/A") == "Running"
+            svc = k8s_result.get("service_status", "N/A")
+            snapshot["k8s_service_active"] = "active" in svc.lower() or svc == "N/A"
+        return snapshot
     def __init__(self):
         self.docker_sim = DockerSimulator()
         )
         self.expected_fixes = scenario["expected_fixes"]
+        # Snapshot the initial broken state from simulators
+        self.initial_snapshot = self._validation_snapshot()
+        # Count how many checks are initially failing — that's our issues_total
+        self.issues_total = sum(1 for v in self.initial_snapshot.values() if not v)
+        # Ensure at least 1 issue (the scenario is supposed to be broken)
+        self.issues_total = max(1, self.issues_total)
         self.issues_fixed = 0
         self.step_count = 0
             self.last_action_success = False
             return 0.0, "No edits provided"
         reward = 0.0
         feedbacks: List[str] = []
         applied_count = 0
         reward += self._check_fix_progress()
         if applied_count == 0:
             self.last_action_success = False
             return max(-0.02, reward - 0.02), "; ".join(feedbacks) or "No edit applied"
         return max(0.0, reward), "; ".join(feedbacks)
     def _check_fix_progress(self) -> float:
+        """Check fix progress by comparing current simulator state against initial broken state.
+        Counts how many simulator checks flipped from fail→pass since reset.
+        """
+        current_snapshot = self._validation_snapshot()
+        fixes_now = 0
+        for key, initially_broken in self.initial_snapshot.items():
+            if not initially_broken and current_snapshot.get(key, False):
+                # This check was initially failing and now passes
+                fixes_now += 1
+        new_fixed = fixes_now - self.issues_fixed
         if new_fixed > 0:
+            self.issues_fixed = fixes_now
             return 0.3 * new_fixed
         return 0.0

server/graders/__init__.py CHANGED Viewed

@@ -37,8 +37,8 @@ DIFFICULTY_MODIFIERS = {
     TaskDifficulty.HARD:   (0.03, 0.7, 0.75),
 }
-SCORE_FLOOR = 0.0
-SCORE_CEIL = 1.0
 EDIT_ACTION_TYPES = frozenset({
     "edit_file", "replace_line", "add_line",

     TaskDifficulty.HARD:   (0.03, 0.7, 0.75),
 }
+SCORE_FLOOR = 0.01
+SCORE_CEIL = 0.99
 EDIT_ACTION_TYPES = frozenset({
     "edit_file", "replace_line", "add_line",

server/models.py CHANGED Viewed

@@ -122,7 +122,7 @@ class EnvironmentInfo(BaseModel):
 class GraderResult(BaseModel):
     task_id: str
-    score: float = Field(..., ge=0.0, le=1.0)
     max_score: float = 1.0
     breakdown: Dict[str, float] = Field(default_factory=dict)
     feedback: str = ""
@@ -170,7 +170,7 @@ class GraderResponse(BaseModel):
 class BaselineRequest(BaseModel):
     task_id: Optional[str] = None
-    num_episodes: int = 1
 class BaselineResponse(BaseModel):

 class GraderResult(BaseModel):
     task_id: str
+    score: float = Field(..., gt=0.0, lt=1.0)
     max_score: float = 1.0
     breakdown: Dict[str, float] = Field(default_factory=dict)
     feedback: str = ""
 class BaselineRequest(BaseModel):
     task_id: Optional[str] = None
+    num_episodes: Optional[int] = None  # None = run ALL scenarios
 class BaselineResponse(BaseModel):

server/simulators/docker_simulator.py CHANGED Viewed

@@ -39,7 +39,11 @@ class DockerSimulator:
         if "*" in source:
             prefix = source.replace("*", "")
             return any(path.startswith(prefix) for path in context_files)
-        return source in context_files
     def _join_continuation_lines(self, lines: List[str]) -> List[str]:
         """Join lines ending with backslash into single logical lines."""

         if "*" in source:
             prefix = source.replace("*", "")
             return any(path.startswith(prefix) for path in context_files)
+        # Check exact match or directory prefix match (e.g. "dist/" matches "dist/index.html")
+        clean = source.rstrip("/")
+        if clean in context_files:
+            return True
+        return any(path.startswith(clean + "/") or path == clean for path in context_files)
     def _join_continuation_lines(self, lines: List[str]) -> List[str]:
         """Join lines ending with backslash into single logical lines."""

server/simulators/k8s_simulator.py CHANGED Viewed

@@ -312,7 +312,7 @@ class KubernetesSimulator:
                     svc_ports = svc.get("spec", {}).get("ports", [])
                     container_ports = []
                     for c in dep.get("spec", {}).get("template", {}).get("spec", {}).get("containers", []):
-                        for p in c.get("ports", []):
                             container_ports.append(p.get("containerPort"))
                     for sp in svc_ports:

                     svc_ports = svc.get("spec", {}).get("ports", [])
                     container_ports = []
                     for c in dep.get("spec", {}).get("template", {}).get("spec", {}).get("containers", []):
+                        for p in (c.get("ports") or []):
                             container_ports.append(p.get("containerPort"))
                     for sp in svc_ports:

server/simulators/workflow_simulator.py CHANGED Viewed

@@ -294,6 +294,135 @@ class WorkflowSimulator:
                                 "exec_error": f"{var} is empty — secret not available in shell environment. Map it via env block.",
                             }
         # node version vs package.json engines
         for job_name, job in jobs.items():
             if not isinstance(job, dict):

                                 "exec_error": f"{var} is empty — secret not available in shell environment. Map it via env block.",
                             }
+        # build-push-action without load:true when image is used locally after
+        for job_name, job in jobs.items():
+            if not isinstance(job, dict):
+                continue
+            steps = job.get("steps", [])
+            if not isinstance(steps, list):
+                continue
+            build_push_idx = None
+            build_push_has_load = False
+            for idx, step in enumerate(steps):
+                if not isinstance(step, dict):
+                    continue
+                uses = step.get("uses", "")
+                if isinstance(uses, str) and "docker/build-push-action" in uses:
+                    build_push_idx = idx
+                    with_block = step.get("with", {})
+                    if isinstance(with_block, dict):
+                        push_val = str(with_block.get("push", "")).lower()
+                        load_val = str(with_block.get("load", "")).lower()
+                        build_push_has_load = load_val == "true"
+                        # Only flag if push is false (local use intended)
+                        if push_val == "false" and not build_push_has_load:
+                            # Check if a later step uses docker run
+                            for later in steps[idx + 1:]:
+                                if not isinstance(later, dict):
+                                    continue
+                                run_cmd = later.get("run", "")
+                                if isinstance(run_cmd, str) and "docker run" in run_cmd:
+                                    return {
+                                        "parse_success": True,
+                                        "execution_success": False,
+                                        "exec_error": (
+                                            "build-push-action with Buildx does not load images into local daemon by default — "
+                                            "add 'load: true' to make the image available for docker run"
+                                        ),
+                                    }
+        # registry mismatch between build tag and push command
+        for job_name, job in jobs.items():
+            if not isinstance(job, dict):
+                continue
+            steps = job.get("steps", [])
+            if not isinstance(steps, list):
+                continue
+            build_registry = None
+            for step in steps:
+                if not isinstance(step, dict):
+                    continue
+                run_cmd = step.get("run", "")
+                if not isinstance(run_cmd, str):
+                    continue
+                # Extract registry from docker build -t
+                build_match = re.search(r'docker build\s+.*-t\s+(\S+)', run_cmd)
+                if build_match:
+                    tag = build_match.group(1)
+                    if "ghcr.io" in tag:
+                        build_registry = "ghcr.io"
+                    elif "docker.io" in tag or "/" in tag:
+                        # docker.io is default for user/image format
+                        build_registry = tag.split("/")[0] if "." in tag.split("/")[0] else "docker.io"
+                push_match = re.search(r'docker push\s+(\S+)', run_cmd)
+                if push_match and build_registry:
+                    push_tag = push_match.group(1)
+                    if "ghcr.io" in push_tag:
+                        push_registry = "ghcr.io"
+                    elif "docker.io" in push_tag:
+                        push_registry = "docker.io"
+                    else:
+                        push_registry = push_tag.split("/")[0] if "." in push_tag.split("/")[0] else "docker.io"
+                    if build_registry != push_registry:
+                        return {
+                            "parse_success": True,
+                            "execution_success": False,
+                            "exec_error": (
+                                f"Registry mismatch: image built with {build_registry} tag "
+                                f"but push targets {push_registry}"
+                            ),
+                        }
+        # docker tag referencing non-existent image tag
+        for job_name, job in jobs.items():
+            if not isinstance(job, dict):
+                continue
+            steps = job.get("steps", [])
+            if not isinstance(steps, list):
+                continue
+            built_tags = set()
+            for step in steps:
+                if not isinstance(step, dict):
+                    continue
+                run_cmd = step.get("run", "")
+                if not isinstance(run_cmd, str):
+                    continue
+                # Collect tags from docker build -t
+                for m in re.finditer(r'docker build\s+.*-t\s+(\S+)', run_cmd):
+                    built_tags.add(m.group(1))
+                # Check docker tag source exists
+                tag_match = re.search(r'docker tag\s+(\S+)\s+(\S+)', run_cmd)
+                if tag_match:
+                    source = tag_match.group(1)
+                    # If source contains ${{ it's a template — compare the template expression
+                    if source not in built_tags and "${{" not in source:
+                        return {
+                            "parse_success": True,
+                            "execution_success": False,
+                            "exec_error": f"No such image: {source} — docker tag source does not match any built image",
+                        }
+                    # Check if source uses a different tag template than what was built
+                    if "${{" in source:
+                        # Normalize: extract the expression
+                        source_expr = re.search(r'\$\{\{(.+?)\}\}', source)
+                        if source_expr:
+                            source_key = source_expr.group(1).strip()
+                            found_matching = False
+                            for bt in built_tags:
+                                bt_expr = re.search(r'\$\{\{(.+?)\}\}', bt)
+                                if bt_expr and bt_expr.group(1).strip() == source_key:
+                                    found_matching = True
+                                    break
+                            # Also check if the base image name matches
+                            source_base = source.split(":")[0] if ":" in source else source
+                            built_bases = {bt.split(":")[0] if ":" in bt else bt for bt in built_tags}
+                            if not found_matching and source_base in built_bases:
+                                return {
+                                    "parse_success": True,
+                                    "execution_success": False,
+                                    "exec_error": f"No such image: docker tag source tag does not match any built image tag",
+                                }
         # node version vs package.json engines
         for job_name, job in jobs.items():
             if not isinstance(job, dict):

server/tasks/k8s_networking.py CHANGED Viewed

@@ -81,7 +81,7 @@ class K8sNetworkingTask(BaseTask):
                     "api-7f8d9c6b5-y3l0n   1/1     Running   app=api-server\n"
                     "api-7f8d9c6b5-z4m1o   1/1     Running   app=api-server\n"
                     "\n"
-                    "Note: Service selector 'app=api' does not match pod label 'app=api-server'"
                 ),
             },
             "expected_fixes": [
@@ -153,7 +153,7 @@ class K8sNetworkingTask(BaseTask):
                     "$ kubectl exec -it test-pod -- wget -qO- http://10.244.0.5:3000\n"
                     "<!DOCTYPE html><html>...</html>\n"
                     "\n"
-                    "Note: Service targetPort is 8080 but container listens on 3000"
                 ),
             },
             "expected_fixes": [
@@ -249,7 +249,7 @@ class K8sNetworkingTask(BaseTask):
                     "NAME          TYPE        CLUSTER-IP     PORT(S)\n"
                     "api-service   ClusterIP   10.96.0.10     80/TCP\n"
                     "\n"
-                    "Note: Ingress references service 'api-svc' but the actual service name is 'api-service'"
                 ),
             },
             "expected_fixes": [

                     "api-7f8d9c6b5-y3l0n   1/1     Running   app=api-server\n"
                     "api-7f8d9c6b5-z4m1o   1/1     Running   app=api-server\n"
                     "\n"
+                    "Hint: Compare the Service selector with the pod labels shown above."
                 ),
             },
             "expected_fixes": [
                     "$ kubectl exec -it test-pod -- wget -qO- http://10.244.0.5:3000\n"
                     "<!DOCTYPE html><html>...</html>\n"
                     "\n"
+                    "Hint: The container responds on a different port than the Service expects."
                 ),
             },
             "expected_fixes": [
                     "NAME          TYPE        CLUSTER-IP     PORT(S)\n"
                     "api-service   ClusterIP   10.96.0.10     80/TCP\n"
                     "\n"
+                    "Hint: The Ingress backend service name does not match any existing Service."
                 ),
             },
             "expected_fixes": [

server/tasks/pipeline_build_deploy.py CHANGED Viewed

@@ -16,15 +16,15 @@ class PipelineBuildDeployTask(BaseTask):
     AVAILABLE_SECRETS = ["GITHUB_TOKEN", "DOCKER_USERNAME", "DOCKER_PASSWORD"]
     SCENARIOS = [
-        # Scenario 1: GHCR login — GITHUB_TOKEN not mapped to env
         {
-            "id": "ghcr_token_not_mapped",
             "files": [
                 {
                     "path": ".github/workflows/deploy.yml",
                     "type": "workflow",
                     "content": (
-                        "name: Build and Push to GHCR\n"
                         "on:\n"
                         "  push:\n"
                         "    branches: [main]\n"
@@ -32,17 +32,19 @@ class PipelineBuildDeployTask(BaseTask):
                         "jobs:\n"
                         "  build:\n"
                         "    runs-on: ubuntu-latest\n"
                         "    steps:\n"
                         "      - uses: actions/checkout@v4\n"
                         "\n"
                         "      - name: Login to GHCR\n"
-                        "        run: echo $GITHUB_TOKEN | docker login ghcr.io -u ${{ github.actor }} --password-stdin\n"
                         "\n"
                         "      - name: Build image\n"
                         "        run: docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .\n"
                         "\n"
                         "      - name: Push image\n"
-                        "        run: docker push ghcr.io/${{ github.repository }}:${{ github.sha }}\n"
                     ),
                 },
                 {
@@ -67,23 +69,23 @@ class PipelineBuildDeployTask(BaseTask):
             "error": {
                 "phase": "pipeline_build",
                 "message": (
-                    "Run: Build and Push to GHCR\n"
                     "\n"
-                    "Step: Login to GHCR\n"
-                    "Error: Cannot perform an interactive login from a non TTY device\n"
-                    "Error: GITHUB_TOKEN environment variable is not set\n"
                     "\n"
-                    "The GITHUB_TOKEN secret is available but not mapped to an environment variable."
                 ),
                 "exit_code": 1,
-                "failed_step": "Login to GHCR",
             },
             "expected_fixes": [
                 {
                     "file": ".github/workflows/deploy.yml",
                     "type": "contains",
-                    "expected": "GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}",
-                    "hint": "The GITHUB_TOKEN shell variable is used in the run command but not mapped from secrets via env block",
                 }
             ],
         },
@@ -161,18 +163,18 @@ class PipelineBuildDeployTask(BaseTask):
             ],
         },
-        # Scenario 3: Missing packages:write permission for GHCR push
         {
-            "id": "missing_packages_write",
             "files": [
                 {
                     "path": ".github/workflows/publish.yml",
                     "type": "workflow",
                     "content": (
-                        "name: Publish to GHCR\n"
                         "on:\n"
-                        "  release:\n"
-                        "    types: [published]\n"
                         "\n"
                         "jobs:\n"
                         "  publish:\n"
@@ -180,14 +182,22 @@ class PipelineBuildDeployTask(BaseTask):
                         "    steps:\n"
                         "      - uses: actions/checkout@v4\n"
                         "\n"
-                        "      - name: Login to GHCR\n"
-                        "        run: echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin\n"
                         "\n"
                         "      - name: Build\n"
-                        "        run: docker build -t ghcr.io/${{ github.repository }}:${{ github.event.release.tag_name }} .\n"
                         "\n"
                         "      - name: Push\n"
-                        "        run: docker push ghcr.io/${{ github.repository }}:${{ github.event.release.tag_name }}\n"
                     ),
                 },
                 {
@@ -196,34 +206,39 @@ class PipelineBuildDeployTask(BaseTask):
                     "content": (
                         "FROM python:3.11-slim\n"
                         "WORKDIR /app\n"
                         "COPY . .\n"
                         'CMD ["python", "app.py"]\n'
                     ),
                 },
             ],
             "error": {
                 "phase": "pipeline_build",
                 "message": (
-                    "Run: Publish to GHCR\n"
                     "\n"
-                    "Step: Login to GHCR ✓\n"
-                    "Step: Build ✓\n"
-                    "Step: Push ✗\n"
-                    "Error: denied: permission_denied: write_package\n"
-                    "Error: GITHUB_TOKEN does not have packages:write permission\n"
                     "\n"
-                    "The default GITHUB_TOKEN only has read access to packages. "
-                    "Add a permissions block to the job."
                 ),
                 "exit_code": 1,
-                "failed_step": "Push",
             },
             "expected_fixes": [
                 {
                     "file": ".github/workflows/publish.yml",
                     "type": "contains",
-                    "expected": "packages: write",
-                    "hint": "GHCR push requires 'permissions: packages: write' in the job or workflow",
                 }
             ],
         },
@@ -289,15 +304,15 @@ class PipelineBuildDeployTask(BaseTask):
             ],
         },
-        # Scenario 5: Multi-stage build — wrong output directory name
         {
-            "id": "multistage_output_mismatch",
             "files": [
                 {
                     "path": ".github/workflows/build.yml",
                     "type": "workflow",
                     "content": (
-                        "name: Build Frontend\n"
                         "on:\n"
                         "  push:\n"
                         "    branches: [main]\n"
@@ -308,53 +323,54 @@ class PipelineBuildDeployTask(BaseTask):
                         "    steps:\n"
                         "      - uses: actions/checkout@v4\n"
                         "\n"
-                        "      - name: Build image\n"
-                        "        run: docker build -t frontend:latest .\n"
                     ),
                 },
                 {
-                    "path": "Dockerfile",
                     "type": "dockerfile",
                     "content": (
-                        "FROM node:20-alpine AS builder\n"
                         "WORKDIR /app\n"
-                        "COPY package*.json ./\n"
-                        "RUN npm ci\n"
                         "COPY . .\n"
-                        "RUN npm run build\n"
-                        "\n"
-                        "FROM nginx:alpine\n"
-                        "COPY --from=builder /app/dist /usr/share/nginx/html\n"
-                        "EXPOSE 80\n"
-                        'CMD ["nginx", "-g", "daemon off;"]\n'
                     ),
                 },
                 {
-                    "path": "package.json",
-                    "type": "other",
-                    "content": '{"name": "frontend", "scripts": {"build": "react-scripts build", "start": "react-scripts start"}}',
                 },
             ],
             "error": {
                 "phase": "pipeline_build",
                 "message": (
-                    "Run: Build Frontend\n"
                     "\n"
-                    "Step: Build image ✗\n"
-                    "Error: COPY failed: stat app/dist: file does not exist\n"
                     "\n"
-                    "react-scripts build outputs to /app/build, not /app/dist. "
-                    "The COPY --from=builder path is wrong."
                 ),
                 "exit_code": 1,
-                "failed_step": "Build image",
             },
             "expected_fixes": [
                 {
-                    "file": "Dockerfile",
                     "type": "contains",
-                    "expected": "COPY --from=builder /app/build",
-                    "hint": "react-scripts outputs to 'build/' not 'dist/'. Change COPY --from=builder /app/dist to /app/build",
                 }
             ],
         },

     AVAILABLE_SECRETS = ["GITHUB_TOKEN", "DOCKER_USERNAME", "DOCKER_PASSWORD"]
     SCENARIOS = [
+        # Scenario 1: Registry mismatch — build tags ghcr.io but push targets docker.io
         {
+            "id": "registry_mismatch",
             "files": [
                 {
                     "path": ".github/workflows/deploy.yml",
                     "type": "workflow",
                     "content": (
+                        "name: Build and Push\n"
                         "on:\n"
                         "  push:\n"
                         "    branches: [main]\n"
                         "jobs:\n"
                         "  build:\n"
                         "    runs-on: ubuntu-latest\n"
+                        "    permissions:\n"
+                        "      packages: write\n"
                         "    steps:\n"
                         "      - uses: actions/checkout@v4\n"
                         "\n"
                         "      - name: Login to GHCR\n"
+                        "        run: echo ${{ secrets.GITHUB_TOKEN }} | docker login ghcr.io -u ${{ github.actor }} --password-stdin\n"
                         "\n"
                         "      - name: Build image\n"
                         "        run: docker build -t ghcr.io/${{ github.repository }}:${{ github.sha }} .\n"
                         "\n"
                         "      - name: Push image\n"
+                        "        run: docker push docker.io/${{ github.repository }}:${{ github.sha }}\n"
                     ),
                 },
                 {
             "error": {
                 "phase": "pipeline_build",
                 "message": (
+                    "Run: Build and Push\n"
                     "\n"
+                    "Step: Build image ✓\n"
+                    "Step: Push image ✗\n"
+                    "Error: An image does not exist locally with the tag: docker.io/<repo>:<sha>\n"
                     "\n"
+                    "The image was built with a ghcr.io tag but the push targets docker.io."
                 ),
                 "exit_code": 1,
+                "failed_step": "Push image",
             },
             "expected_fixes": [
                 {
                     "file": ".github/workflows/deploy.yml",
                     "type": "contains",
+                    "expected": "docker push ghcr.io/",
+                    "hint": "The push command targets docker.io but the image was tagged with ghcr.io — use the same registry",
                 }
             ],
         },
             ],
         },
+        # Scenario 3: Build and push use different tagging strategies (sha vs latest)
         {
+            "id": "inconsistent_tagging",
             "files": [
                 {
                     "path": ".github/workflows/publish.yml",
                     "type": "workflow",
                     "content": (
+                        "name: Publish\n"
                         "on:\n"
+                        "  push:\n"
+                        "    branches: [main]\n"
                         "\n"
                         "jobs:\n"
                         "  publish:\n"
                         "    steps:\n"
                         "      - uses: actions/checkout@v4\n"
                         "\n"
+                        "      - name: Login to DockerHub\n"
+                        "        run: echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin\n"
                         "\n"
                         "      - name: Build\n"
+                        "        run: docker build -t myuser/api:${{ github.sha }} .\n"
+                        "\n"
+                        "      - name: Test\n"
+                        "        run: docker run myuser/api:${{ github.sha }} python -m pytest\n"
+                        "\n"
+                        "      - name: Tag latest\n"
+                        "        run: docker tag myuser/api:latest myuser/api:stable\n"
                         "\n"
                         "      - name: Push\n"
+                        "        run: |\n"
+                        "          docker push myuser/api:${{ github.sha }}\n"
+                        "          docker push myuser/api:stable\n"
                     ),
                 },
                 {
                     "content": (
                         "FROM python:3.11-slim\n"
                         "WORKDIR /app\n"
+                        "COPY requirements.txt .\n"
+                        "RUN pip install -r requirements.txt\n"
                         "COPY . .\n"
                         'CMD ["python", "app.py"]\n'
                     ),
                 },
+                {
+                    "path": "requirements.txt",
+                    "type": "requirements",
+                    "content": "flask==3.0.0\npytest==7.4.0\n",
+                },
             ],
             "error": {
                 "phase": "pipeline_build",
                 "message": (
+                    "Run: Publish\n"
                     "\n"
+                    "Step: Build ✓ (myuser/api:<sha>)\n"
+                    "Step: Test ✓\n"
+                    "Step: Tag latest ✗\n"
+                    "Error: No such image: myuser/api:latest\n"
                     "\n"
+                    "The tag command references 'myuser/api:latest' but no image with that tag exists."
                 ),
                 "exit_code": 1,
+                "failed_step": "Tag latest",
             },
             "expected_fixes": [
                 {
                     "file": ".github/workflows/publish.yml",
                     "type": "contains",
+                    "expected": "docker tag myuser/api:${{ github.sha }}",
+                    "hint": "The 'docker tag' source must match the tag used in the build step — use the sha-tagged image as source",
                 }
             ],
         },
             ],
         },
+        # Scenario 5: Dockerfile path wrong in workflow when using subdirectory structure
         {
+            "id": "dockerfile_path_in_subdirectory",
             "files": [
                 {
                     "path": ".github/workflows/build.yml",
                     "type": "workflow",
                     "content": (
+                        "name: Build API\n"
                         "on:\n"
                         "  push:\n"
                         "    branches: [main]\n"
                         "    steps:\n"
                         "      - uses: actions/checkout@v4\n"
                         "\n"
+                        "      - name: Build API image\n"
+                        "        uses: docker/build-push-action@v5\n"
+                        "        with:\n"
+                        "          context: ./services/api\n"
+                        "          file: ./Dockerfile\n"
+                        "          push: false\n"
+                        "          tags: api:latest\n"
                     ),
                 },
                 {
+                    "path": "services/api/Dockerfile",
                     "type": "dockerfile",
                     "content": (
+                        "FROM python:3.11-slim\n"
                         "WORKDIR /app\n"
+                        "COPY requirements.txt .\n"
+                        "RUN pip install -r requirements.txt\n"
                         "COPY . .\n"
+                        "EXPOSE 8000\n"
+                        'CMD ["uvicorn", "main:app", "--host", "0.0.0.0"]\n'
                     ),
                 },
                 {
+                    "path": "services/api/requirements.txt",
+                    "type": "requirements",
+                    "content": "fastapi==0.104.0\nuvicorn==0.24.0\n",
                 },
             ],
             "error": {
                 "phase": "pipeline_build",
                 "message": (
+                    "Run: Build API\n"
                     "\n"
+                    "Step: Build API image ✗\n"
+                    "Error: unable to prepare context: unable to evaluate symlinks in Dockerfile path: "
+                    "lstat /home/runner/work/repo/repo/Dockerfile: no such file or directory\n"
                     "\n"
+                    "The Dockerfile is not at the repository root."
                 ),
                 "exit_code": 1,
+                "failed_step": "Build API image",
             },
             "expected_fixes": [
                 {
+                    "file": ".github/workflows/build.yml",
                     "type": "contains",
+                    "expected": "file: ./services/api/Dockerfile",
+                    "hint": "The 'file' path must point to where the Dockerfile actually is — ./services/api/Dockerfile, not ./Dockerfile",
                 }
             ],
         },

server/tasks/pipeline_full.py CHANGED Viewed

@@ -116,8 +116,7 @@ class PipelineFullTask(BaseTask):
                     "\n"
                     "---\n"
                     "(If login had succeeded, deployment would also fail with:)\n"
-                    "Error: Service 'myapp-service' has no endpoints — selector 'app=my-app' "
-                    "doesn't match any pods (pods have label 'app=myapp')"
                 ),
             },
             "expected_fixes": [
@@ -230,9 +229,8 @@ class PipelineFullTask(BaseTask):
                     "\n"
                     "---\n"
                     "Additionally:\n"
-                    "- Dockerfile has no WORKDIR set — npm will fail to find package.json\n"
-                    "- K8s deployment containerPort is 8080 but app listens on 3000 "
-                    "(service targetPort also wrong)"
                 ),
             },
             "expected_fixes": [
@@ -350,8 +348,8 @@ class PipelineFullTask(BaseTask):
                     "\n"
                     "---\n"
                     "Additional issues found:\n"
-                    "- Dockerfile: pull access denied for python:3.9-slimm (typo in base image tag)\n"
-                    "- K8s: Pod CrashLoopBackOff with OOMKilled (64Mi memory limit too low for gunicorn)"
                 ),
             },
             "expected_fixes": [

                     "\n"
                     "---\n"
                     "(If login had succeeded, deployment would also fail with:)\n"
+                    "Error: Service 'myapp-service' has no endpoints"
                 ),
             },
             "expected_fixes": [
                     "\n"
                     "---\n"
                     "Additionally:\n"
+                    "- Dockerfile: npm reports module resolution errors at runtime\n"
+                    "- K8s: Service returns connection refused when accessed"
                 ),
             },
             "expected_fixes": [
                     "\n"
                     "---\n"
                     "Additional issues found:\n"
+                    "- Dockerfile: pull access denied for base image — repository does not exist\n"
+                    "- K8s: Pod in CrashLoopBackOff with exit code 137"
                 ),
             },
             "expected_fixes": [

server/tasks/task_1_build_errors.py CHANGED Viewed

@@ -141,9 +141,9 @@ class DockerfileSyntaxTask(BaseTask):
             ],
         },
-        # Scenario 4: EXPOSE with a quoted string instead of a number
         {
-            "id": "invalid_expose",
             "files": [
                 {
                     "path": "Dockerfile",
@@ -151,29 +151,30 @@ class DockerfileSyntaxTask(BaseTask):
                     "content": (
                         "FROM nginx:alpine\n"
                         "COPY nginx.conf /etc/nginx/nginx.conf\n"
-                        "COPY html /usr/share/nginx/html\n"
-                        'EXPOSE "eighty"\n'
                         'CMD ["nginx", "-g", "daemon off;"]'
                     ),
                 },
                 {
-                    "path": "nginx.conf",
                     "type": "other",
-                    "content": "events {}",
                 },
             ],
             "error": {
                 "phase": "docker_build",
-                "message": "EXPOSE requires numeric port or port/protocol",
                 "exit_code": 1,
-                "line_hint": 4,
             },
             "expected_fixes": [
                 {
                     "file": "Dockerfile",
                     "type": "contains",
-                    "expected": "EXPOSE 80",
-                    "hint": "EXPOSE must use a numeric port value, not a quoted string",
                 }
             ],
         },

             ],
         },
+        # Scenario 4: COPY references a file that doesn't exist in context
         {
+            "id": "copy_missing_source",
             "files": [
                 {
                     "path": "Dockerfile",
                     "content": (
                         "FROM nginx:alpine\n"
                         "COPY nginx.conf /etc/nginx/nginx.conf\n"
+                        "COPY dist/ /usr/share/nginx/html\n"
+                        "EXPOSE 80\n"
                         'CMD ["nginx", "-g", "daemon off;"]'
                     ),
                 },
                 {
+                    "path": "build/index.html",
                     "type": "other",
+                    "content": "<!DOCTYPE html><html><body>Hello</body></html>",
                 },
             ],
             "error": {
                 "phase": "docker_build",
+                "message": "COPY failed: file not found in build context: dist/",
                 "exit_code": 1,
+                "failed_step": "COPY dist/ /usr/share/nginx/html",
+                "line_hint": 3,
             },
             "expected_fixes": [
                 {
                     "file": "Dockerfile",
                     "type": "contains",
+                    "expected": "COPY build/",
+                    "hint": "The build output is in 'build/' not 'dist/' — check the build context files",
                 }
             ],
         },

server/tasks/task_5_ci_docker_integration.py CHANGED Viewed

@@ -75,15 +75,15 @@ class CIDockerIntegrationTask(BaseTask):
             ],
         },
-        # Scenario 2: Docker login + build but secrets not wired in env block
         {
-            "id": "login_secrets_not_wired",
             "files": [
                 {
                     "path": ".github/workflows/build.yml",
                     "type": "workflow",
                     "content": (
-                        "name: Build and Push\n"
                         "on: push\n"
                         "\n"
                         "jobs:\n"
@@ -91,52 +91,46 @@ class CIDockerIntegrationTask(BaseTask):
                         "    runs-on: ubuntu-latest\n"
                         "    steps:\n"
                         "      - uses: actions/checkout@v4\n"
-                        "      - name: Login to DockerHub\n"
-                        "        run: echo $DOCKER_PASSWORD | docker login -u $DOCKER_USERNAME --password-stdin\n"
-                        "      - name: Build\n"
-                        "        run: docker build -t myuser/app:latest .\n"
-                        "      - name: Push\n"
-                        "        run: docker push myuser/app:latest"
                     ),
                 },
                 {
                     "path": "Dockerfile",
                     "type": "dockerfile",
                     "content": (
-                        "FROM node:18-alpine\n"
                         "WORKDIR /app\n"
-                        "COPY package*.json ./\n"
-                        "RUN npm ci\n"
                         "COPY . .\n"
-                        "EXPOSE 3000\n"
-                        'CMD ["npm", "start"]'
                     ),
                 },
-                {
-                    "path": "package.json",
-                    "type": "other",
-                    "content": '{"name": "app", "scripts": {"start": "node server.js"}}',
-                },
             ],
             "error": {
-                "phase": "workflow_parse",
-                "message": "Error: Cannot perform an interactive login from a non TTY device",
                 "exit_code": 1,
-                "failed_step": "Login to DockerHub",
             },
             "expected_fixes": [
                 {
                     "file": ".github/workflows/build.yml",
                     "type": "contains",
-                    "expected": "DOCKER_USERNAME: ${{ secrets.DOCKER_USERNAME }}",
-                    "hint": "Secrets need to be mapped to env vars in the step",
-                },
-                {
-                    "file": ".github/workflows/build.yml",
-                    "type": "contains",
-                    "expected": "DOCKER_PASSWORD: ${{ secrets.DOCKER_PASSWORD }}",
-                    "hint": "Both Docker credentials must be in the env block",
-                },
             ],
         },

             ],
         },
+        # Scenario 2: build-push-action without load:true, next step can't find image
         {
+            "id": "missing_load_true",
             "files": [
                 {
                     "path": ".github/workflows/build.yml",
                     "type": "workflow",
                     "content": (
+                        "name: Build and Test\n"
                         "on: push\n"
                         "\n"
                         "jobs:\n"
                         "    runs-on: ubuntu-latest\n"
                         "    steps:\n"
                         "      - uses: actions/checkout@v4\n"
+                        "      - name: Set up Docker Buildx\n"
+                        "        uses: docker/setup-buildx-action@v3\n"
+                        "      - name: Build image\n"
+                        "        uses: docker/build-push-action@v5\n"
+                        "        with:\n"
+                        "          context: .\n"
+                        "          push: false\n"
+                        "          tags: myapp:test\n"
+                        "      - name: Run tests\n"
+                        "        run: docker run myapp:test pytest"
                     ),
                 },
                 {
                     "path": "Dockerfile",
                     "type": "dockerfile",
                     "content": (
+                        "FROM python:3.11-slim\n"
                         "WORKDIR /app\n"
                         "COPY . .\n"
+                        "RUN pip install pytest\n"
+                        'CMD ["python", "app.py"]'
                     ),
                 },
             ],
             "error": {
+                "phase": "docker_build",
+                "message": (
+                    "Unable to find image 'myapp:test' locally. "
+                    "docker: Error response from daemon: pull access denied for myapp."
+                ),
                 "exit_code": 1,
+                "failed_step": "Run tests",
             },
             "expected_fixes": [
                 {
                     "file": ".github/workflows/build.yml",
                     "type": "contains",
+                    "expected": "load: true",
+                    "hint": "build-push-action with Buildx doesn't load images into local Docker daemon by default — add 'load: true'",
+                }
             ],
         },