Spaces:

jester1177
/

cloudnative-devops-debug-env

Sleeping

App Files Files Community

Krishna1107 commited on Apr 8

Commit

4de7d31

1 Parent(s): c8f3b98

improved grading

Browse files

Files changed (13) hide show

README.md +18 -5
baseline_runner.py +3 -3
sample_inf_script.py +0 -255
sample_val_script.txt +0 -185
server/__init__.py +1 -1
server/app.py +3 -3
server/environment.py +1 -1
server/graders/__init__.py +61 -19
smoke_test.py +2 -2
tests/test_baseline.py +3 -3
tests/test_determinism.py +5 -5
tests/test_environment_flow.py +3 -3
tutorial_references/02-deployment.md +0 -427

README.md CHANGED Viewed

@@ -206,15 +206,15 @@ Each step, the agent chooses exactly one action:
 ## Grading System — How Scores Work
-Scoring is **deterministic** (same actions always produce the same score) and **dynamic** (different strategies get different scores).
 ### The Formula
 ```
-FINAL SCORE = Base + Partial Fixes + Complete Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
 ```
-Clamped to `(0.01, 0.99)`.
 ### Component Breakdown
@@ -223,10 +223,23 @@ Clamped to `(0.01, 0.99)`.
 | Base score | 5% | Participation credit |
 | Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
 | Complete bonus | 25% | All issues fixed |
-| Efficiency | 25% | Decays with extra steps beyond optimal |
-| Hint penalty | -4% each | Per `request_hint` action |
 | Failed edit penalty | -2% each | Per edit with no valid file path |
 ---
 ## API Endpoints

 ## Grading System — How Scores Work
+Scoring is **deterministic** (same actions always produce the same score), **dynamic** (different strategies get different scores), and **difficulty-aware** (harder tasks are graded more generously).
 ### The Formula
 ```
+FINAL SCORE = Base + Partial Fixes + Complete Bonus + Difficulty Bonus + Efficiency - Hint Penalty - Failed Edit Penalty
 ```
+Clamped to `[0.0, 1.0]`.
 ### Component Breakdown
 | Base score | 5% | Participation credit |
 | Partial fixes | 35% | Proportional to `issues_fixed / issues_total` |
 | Complete bonus | 25% | All issues fixed |
+| Difficulty bonus | 0-3% | Extra reward for fully solving hard/expert tasks |
+| Efficiency | 25% | Decays with extra steps — slower decay for harder tasks |
+| Hint penalty | -3% to -4% each | Per `request_hint` action (cheaper for hard/expert) |
 | Failed edit penalty | -2% each | Per edit with no valid file path |
+### Difficulty Modifiers
+The grader adjusts three parameters based on task difficulty:
+| Difficulty | Max Score | Efficiency Decay | Hint Cost |
+|------------|-----------|------------------|-----------|
+| Easy | 0.90 | 0.03/step (strict) | 4% each |
+| Medium | 0.90 | 0.027/step | 4% each |
+| Hard/Expert | 0.93 | 0.021/step (forgiving) | 3% each |
+This means: solving a 4-bug expert pipeline in 6 steps scores higher than solving a 1-bug easy task in 3 steps, reflecting the genuine difficulty difference.
 ---
 ## API Endpoints

baseline_runner.py CHANGED Viewed

@@ -6,13 +6,13 @@ Applies expected_fixes directly to verify the environment + grader work e2e.
 from typing import List, Optional
-from server.environment import CICDDebugEnvironment
 from server.graders import run_grader
 from server.models import Action, ActionType, FileEdit, GraderResult
 from server.tasks.task_registry import TASK_REGISTRY
-def _heuristic_episode(env: CICDDebugEnvironment, task_id: str, scenario_id: Optional[str] = None) -> GraderResult:
     """Run one episode using a heuristic that applies expected fixes."""
     obs = env.reset(task_id=task_id, scenario_id=scenario_id)
@@ -141,7 +141,7 @@ def run_baseline_episodes(task_id: Optional[str] = None, num_episodes: int = 1)
         for scenario in scenarios:
             if episodes_run >= num_episodes:
                 break
-            env = CICDDebugEnvironment()
             result = _heuristic_episode(env, tid, scenario["id"])
             results.append(result)
             episodes_run += 1

 from typing import List, Optional
+from server.environment import CloudNativeDebugEnvironment
 from server.graders import run_grader
 from server.models import Action, ActionType, FileEdit, GraderResult
 from server.tasks.task_registry import TASK_REGISTRY
+def _heuristic_episode(env: CloudNativeDebugEnvironment, task_id: str, scenario_id: Optional[str] = None) -> GraderResult:
     """Run one episode using a heuristic that applies expected fixes."""
     obs = env.reset(task_id=task_id, scenario_id=scenario_id)
         for scenario in scenarios:
             if episodes_run >= num_episodes:
                 break
+            env = CloudNativeDebugEnvironment()
             result = _heuristic_episode(env, tid, scenario["id"])
             results.append(result)
             episodes_run += 1

sample_inf_script.py DELETED Viewed

@@ -1,255 +0,0 @@
-"""
-Inference Script Example
-===================================
-MANDATORY
-- Before submitting, ensure the following variables are defined in your environment configuration:
-    API_BASE_URL   The API endpoint for the LLM.
-    MODEL_NAME     The model identifier to use for inference.
-    HF_TOKEN       Your Hugging Face / API key.
-- The inference script must be named `inference.py` and placed in the root directory of the project
-- Participants must use OpenAI Client for all LLM calls using above variables
-"""
-import os
-import re
-import base64
-import textwrap
-from io import BytesIO
-from typing import List, Optional, Dict
-from openai import OpenAI
-import numpy as np
-from PIL import Image
-from browsergym_env import BrowserGymAction, BrowserGymEnv
-API_BASE_URL = os.getenv("API_BASE_URL") // "https://router.huggingface.co/v1"
-API_KEY = os.getenv("HF_TOKEN") or os.getenv("API_KEY")
-MODEL_NAME = os.getenv("MODEL_NAME")
-MAX_STEPS = 8
-MAX_DOM_CHARS = 3500
-TEMPERATURE = 0.2
-MAX_TOKENS = 200
-FALLBACK_ACTION = "noop()"
-DEBUG = True
-ACTION_PREFIX_RE = re.compile(
-    r"^(action|next action)\s*[:\-]\s*",
-    re.IGNORECASE,
-)
-ACTION_PATTERN = re.compile(r"[A-Za-z_]+\s*\(.*\)", re.DOTALL)
-SYSTEM_PROMPT = textwrap.dedent(
-    """
-    You control a web browser through BrowserGym.
-    Reply with exactly one action string.
-    The action must be a valid BrowserGym command such as:
-    - noop()
-    - click('<BID>')
-    - type('selector', 'text to enter')
-    - fill('selector', 'text to enter')
-    - send_keys('Enter')
-    - scroll('down')
-    Use single quotes around string arguments.
-    When clicking, use the BrowserGym element IDs (BIDs) listed in the user message.
-    If you are unsure, respond with noop().
-    Do not include explanations or additional text.
-    """
-).strip()
-def build_history_lines(history: List[str]) -> str:
-    if not history:
-        return "None"
-    return "\n".join(history[-4:])
-def extract_screenshot_uri(observation) -> Optional[str]:
-    if observation.screenshot is None:
-        return None
-    screen_array = np.array(observation.screenshot, dtype=np.uint8)
-    image = Image.fromarray(screen_array)
-    buffer = BytesIO()
-    image.save(buffer, format="PNG")
-    buffer.seek(0)
-    data_uri = base64.b64encode(buffer.read()).decode("utf-8")
-    return f"data:image/png;base64,{data_uri}"
-def extract_clickable_elements(observation) -> List[Dict[str, str]]:
-    """Collect BrowserGym element IDs that can be clicked."""
-    metadata = getattr(observation, "metadata", {}) or {}
-    obs_dict = metadata.get("browsergym_obs", {}) or {}
-    extra_props = obs_dict.get("extra_element_properties", {}) or {}
-    clickables: List[Dict[str, str]] = []
-    for bid, props in extra_props.items():
-        if not props.get("clickable"):
-            continue
-        bbox = props.get("bbox") or []
-        bbox_str = ", ".join(bbox) if bbox else "?"
-        clickables.append(
-            {
-                "bid": str(bid),
-                "bbox": bbox_str,
-            }
-        )
-    # Keep a stable ordering for readability
-    clickables.sort(key=lambda item: item["bid"])
-    return clickables
-def build_user_prompt(step: int, observation, history: List[str]) -> str:
-    goal = observation.goal or "(not provided)"
-    url = observation.url or "(unknown)"
-    error_note = "Yes" if observation.last_action_error else "No"
-    clickables = extract_clickable_elements(observation)
-    if clickables:
-        actions_hint = "\n".join(
-            f"    - {item['bid']} (bbox: {item['bbox']})" for item in clickables
-        )
-    else:
-        actions_hint = "    (none detected)"
-    prompt = textwrap.dedent(
-        f"""
-        Step: {step}
-        Goal: {goal}
-        Current URL: {url}
-        Previous steps:
-        {build_history_lines(history)}
-        Last action error: {error_note}
-        Available clickable element IDs: {actions_hint}
-        Reply with exactly one BrowserGym action string.
-        """
-    ).strip()
-    return prompt
-def parse_model_action(response_text: str) -> str:
-    if not response_text:
-        return FALLBACK_ACTION
-    # Prefer the first line that looks like an action string
-    lines = response_text.splitlines()
-    for raw_line in lines:
-        line = raw_line.strip()
-        if not line:
-            continue
-        line = ACTION_PREFIX_RE.sub("", line)
-        match = ACTION_PATTERN.search(line)
-        if match:
-            action = match.group(0).strip()
-            # Collapse internal whitespace
-            action = re.sub(r"\s+", " ", action)
-            # If the model tried to click by natural-language description while we
-            # only exposed numeric BrowserGym IDs, fallback to the single detected ID.
-            return action
-    # Fall back to searching the whole response
-    match = ACTION_PATTERN.search(response_text)
-    if match:
-        action = match.group(0).strip()
-        action = re.sub(r"\s+", " ", action)
-        return action
-    return FALLBACK_ACTION
-def main() -> None:
-    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
-    env = BrowserGymEnv.from_docker_image(
-        image="browsergym-env:latest",
-        env_vars={
-            "BROWSERGYM_BENCHMARK": "miniwob",
-            "BROWSERGYM_TASK_NAME": "click-test",
-        },
-    )
-    history: List[str] = []
-    try:
-        result = env.reset()
-        observation = result.observation
-        print(f"Episode goal: {observation.goal}")
-        for step in range(1, MAX_STEPS + 1):
-            if result.done:
-                print("Environment signalled done. Stopping early.")
-                break
-            user_prompt = build_user_prompt(step, observation, history)
-            user_content = [{"type": "text", "text": user_prompt}]
-            screenshot_uri = extract_screenshot_uri(observation)
-            if screenshot_uri:
-                user_content.append(
-                    {
-                        "type": "image_url",
-                        "image_url": {"url": screenshot_uri},
-                    }
-                )
-            messages = [
-                {
-                    "role": "system",
-                    "content": [{"type": "text", "text": SYSTEM_PROMPT}],
-                },
-                {
-                    "role": "user",
-                    "content": user_content,
-                },
-            ]
-            try:
-                completion = client.chat.completions.create(
-                    model=MODEL_NAME,
-                    messages=messages,
-                    temperature=TEMPERATURE,
-                    max_tokens=MAX_TOKENS,
-                    stream=False,
-                )
-                response_text = completion.choices[0].message.content or ""
-            # pylint: disable=broad-except
-            except Exception as exc:  # noqa: BLE001
-                failure_msg = f"Model request failed ({exc}). Using fallback action."
-                print(failure_msg)
-                response_text = FALLBACK_ACTION
-            action_str = parse_model_action(response_text)
-            print(f"Step {step}: model suggested -> {action_str}")
-            result = env.step(BrowserGymAction(action_str=action_str))
-            observation = result.observation
-            reward = result.reward or 0.0
-            error_flag = " ERROR" if observation.last_action_error else ""
-            history_line = (
-                f"Step {step}: {action_str} -> reward {reward:+.2f}{error_flag}"
-            )
-            history.append(history_line)
-            print(
-                "  Reward: "
-                f"{reward:+.2f} | Done: {result.done} | Last action error: "
-                f"{observation.last_action_error}"
-            )
-            if result.done:
-                print("Episode complete.")
-                break
-        else:
-            print(f"Reached max steps ({MAX_STEPS}).")
-    finally:
-        env.close()
-if __name__ == "__main__":
-    main()

sample_val_script.txt DELETED Viewed

@@ -1,185 +0,0 @@
-#!/usr/bin/env bash
-#
-# validate-submission.sh — OpenEnv Submission Validator
-#
-# Checks that your HF Space is live, Docker image builds, and openenv validate passes.
-#
-# Prerequisites:
-#   - Docker:       https://docs.docker.com/get-docker/
-#   - openenv-core: pip install openenv-core
-#   - curl (usually pre-installed)
-#
-# Run:
-#   curl -fsSL https://raw.githubusercontent.com/<owner>/<repo>/main/scripts/validate-submission.sh | bash -s -- <ping_url> [repo_dir]
-#
-#   Or download and run locally:
-#     chmod +x validate-submission.sh
-#     ./validate-submission.sh <ping_url> [repo_dir]
-#
-# Arguments:
-#   ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)
-#   repo_dir   Path to your repo (default: current directory)
-#
-# Examples:
-#   ./validate-submission.sh https://my-team.hf.space
-#   ./validate-submission.sh https://my-team.hf.space ./my-repo
-#
-set -uo pipefail
-DOCKER_BUILD_TIMEOUT=600
-if [ -t 1 ]; then
-  RED='\033[0;31m'
-  GREEN='\033[0;32m'
-  YELLOW='\033[1;33m'
-  BOLD='\033[1m'
-  NC='\033[0m'
-else
-  RED='' GREEN='' YELLOW='' BOLD='' NC=''
-fi
-run_with_timeout() {
-  local secs="$1"; shift
-  if command -v timeout &>/dev/null; then
-    timeout "$secs" "$@"
-  elif command -v gtimeout &>/dev/null; then
-    gtimeout "$secs" "$@"
-  else
-    "$@" &
-    local pid=$!
-    ( sleep "$secs" && kill "$pid" 2>/dev/null ) &
-    local watcher=$!
-    wait "$pid" 2>/dev/null
-    local rc=$?
-    kill "$watcher" 2>/dev/null
-    wait "$watcher" 2>/dev/null
-    return $rc
-  fi
-}
-portable_mktemp() {
-  local prefix="${1:-validate}"
-  mktemp "${TMPDIR:-/tmp}/${prefix}-XXXXXX" 2>/dev/null || mktemp
-}
-CLEANUP_FILES=()
-cleanup() { rm -f "${CLEANUP_FILES[@]+"${CLEANUP_FILES[@]}"}"; }
-trap cleanup EXIT
-PING_URL="${1:-}"
-REPO_DIR="${2:-.}"
-if [ -z "$PING_URL" ]; then
-  printf "Usage: %s <ping_url> [repo_dir]\n" "$0"
-  printf "\n"
-  printf "  ping_url   Your HuggingFace Space URL (e.g. https://your-space.hf.space)\n"
-  printf "  repo_dir   Path to your repo (default: current directory)\n"
-  exit 1
-fi
-if ! REPO_DIR="$(cd "$REPO_DIR" 2>/dev/null && pwd)"; then
-  printf "Error: directory '%s' not found\n" "${2:-.}"
-  exit 1
-fi
-PING_URL="${PING_URL%/}"
-export PING_URL
-PASS=0
-log()  { printf "[%s] %b\n" "$(date -u +%H:%M:%S)" "$*"; }
-pass() { log "${GREEN}PASSED${NC} -- $1"; PASS=$((PASS + 1)); }
-fail() { log "${RED}FAILED${NC} -- $1"; }
-hint() { printf "  ${YELLOW}Hint:${NC} %b\n" "$1"; }
-stop_at() {
-  printf "\n"
-  printf "${RED}${BOLD}Validation stopped at %s.${NC} Fix the above before continuing.\n" "$1"
-  exit 1
-}
-printf "\n"
-printf "${BOLD}========================================${NC}\n"
-printf "${BOLD}  OpenEnv Submission Validator${NC}\n"
-printf "${BOLD}========================================${NC}\n"
-log "Repo:     $REPO_DIR"
-log "Ping URL: $PING_URL"
-printf "\n"
-log "${BOLD}Step 1/3: Pinging HF Space${NC} ($PING_URL/reset) ..."
-CURL_OUTPUT=$(portable_mktemp "validate-curl")
-CLEANUP_FILES+=("$CURL_OUTPUT")
-HTTP_CODE=$(curl -s -o "$CURL_OUTPUT" -w "%{http_code}" -X POST \
-  -H "Content-Type: application/json" -d '{}' \
-  "$PING_URL/reset" --max-time 30 2>"$CURL_OUTPUT" || printf "000")
-if [ "$HTTP_CODE" = "200" ]; then
-  pass "HF Space is live and responds to /reset"
-elif [ "$HTTP_CODE" = "000" ]; then
-  fail "HF Space not reachable (connection failed or timed out)"
-  hint "Check your network connection and that the Space is running."
-  hint "Try: curl -s -o /dev/null -w '%%{http_code}' -X POST $PING_URL/reset"
-  stop_at "Step 1"
-else
-  fail "HF Space /reset returned HTTP $HTTP_CODE (expected 200)"
-  hint "Make sure your Space is running and the URL is correct."
-  hint "Try opening $PING_URL in your browser first."
-  stop_at "Step 1"
-fi
-log "${BOLD}Step 2/3: Running docker build${NC} ..."
-if ! command -v docker &>/dev/null; then
-  fail "docker command not found"
-  hint "Install Docker: https://docs.docker.com/get-docker/"
-  stop_at "Step 2"
-fi
-if [ -f "$REPO_DIR/Dockerfile" ]; then
-  DOCKER_CONTEXT="$REPO_DIR"
-elif [ -f "$REPO_DIR/server/Dockerfile" ]; then
-  DOCKER_CONTEXT="$REPO_DIR/server"
-else
-  fail "No Dockerfile found in repo root or server/ directory"
-  stop_at "Step 2"
-fi
-log "  Found Dockerfile in $DOCKER_CONTEXT"
-BUILD_OK=false
-BUILD_OUTPUT=$(run_with_timeout "$DOCKER_BUILD_TIMEOUT" docker build "$DOCKER_CONTEXT" 2>&1) && BUILD_OK=true
-if [ "$BUILD_OK" = true ]; then
-  pass "Docker build succeeded"
-else
-  fail "Docker build failed (timeout=${DOCKER_BUILD_TIMEOUT}s)"
-  printf "%s\n" "$BUILD_OUTPUT" | tail -20
-  stop_at "Step 2"
-fi
-log "${BOLD}Step 3/3: Running openenv validate${NC} ..."
-if ! command -v openenv &>/dev/null; then
-  fail "openenv command not found"
-  hint "Install it: pip install openenv-core"
-  stop_at "Step 3"
-fi
-VALIDATE_OK=false
-VALIDATE_OUTPUT=$(cd "$REPO_DIR" && openenv validate 2>&1) && VALIDATE_OK=true
-if [ "$VALIDATE_OK" = true ]; then
-  pass "openenv validate passed"
-  [ -n "$VALIDATE_OUTPUT" ] && log "  $VALIDATE_OUTPUT"
-else
-  fail "openenv validate failed"
-  printf "%s\n" "$VALIDATE_OUTPUT"
-  stop_at "Step 3"
-fi
-printf "\n"
-printf "${BOLD}========================================${NC}\n"
-printf "${GREEN}${BOLD}  All 3/3 checks passed!${NC}\n"
-printf "${GREEN}${BOLD}  Your submission is ready to submit.${NC}\n"
-printf "${BOLD}========================================${NC}\n"
-printf "\n"
-exit 0

server/__init__.py CHANGED Viewed

	@@ -1 +1 @@
1	- """~~CI/CD~~ debug environment server package."""


1	+ """Cloud-native DevOps debug environment server package."""

server/app.py CHANGED Viewed

@@ -9,7 +9,7 @@ from fastapi.middleware.cors import CORSMiddleware
 from fastapi.responses import HTMLResponse
 from fastapi.staticfiles import StaticFiles
-from server.environment import CICDDebugEnvironment
 from server.graders import run_grader
 from server.models import (
     Action,
@@ -47,7 +47,7 @@ app.add_middleware(
 # Serve static assets (CSS, JS, images if needed later)
 app.mount("/static", StaticFiles(directory=str(STATIC_DIR)), name="static")
-env: Optional[CICDDebugEnvironment] = None
 @app.get("/", response_class=HTMLResponse)
@@ -135,7 +135,7 @@ async def reset(request: Optional[ResetRequest] = None):
     global env
     request = request or ResetRequest()
-    env = CICDDebugEnvironment()
     try:
         observation = env.reset(
             task_id=request.task_id,

 from fastapi.responses import HTMLResponse
 from fastapi.staticfiles import StaticFiles
+from server.environment import CloudNativeDebugEnvironment
 from server.graders import run_grader
 from server.models import (
     Action,
 # Serve static assets (CSS, JS, images if needed later)
 app.mount("/static", StaticFiles(directory=str(STATIC_DIR)), name="static")
+env: Optional[CloudNativeDebugEnvironment] = None
 @app.get("/", response_class=HTMLResponse)
     global env
     request = request or ResetRequest()
+    env = CloudNativeDebugEnvironment()
     try:
         observation = env.reset(
             task_id=request.task_id,

server/environment.py CHANGED Viewed

@@ -20,7 +20,7 @@ from server.simulators.workflow_simulator import WorkflowSimulator
 from server.tasks.task_registry import TASK_REGISTRY, get_task
-class CICDDebugEnvironment:
     MAX_STEPS = 10
     MAX_HINTS = 3

 from server.tasks.task_registry import TASK_REGISTRY, get_task
+class CloudNativeDebugEnvironment:
     MAX_STEPS = 10
     MAX_HINTS = 3

server/graders/__init__.py CHANGED Viewed

@@ -1,22 +1,23 @@
 """Deterministic grader for trajectory scoring.
-Scoring weights:
   base score      5%   (participation — guarantees score > 0)
   partial fixes  35%   (proportional to fix ratio)
-  complete bonus 25%   (all issues fixed)
-  efficiency     25%   (decays with extra steps)
-  hint penalty   -4%   each
   failed edit    -2%   each
-Score is always clamped to (0.01, 0.99) so it never hits 0 or 1.
 """
 from typing import Any, Dict, List
-from server.models import GraderResult
 from server.tasks.task_registry import TASK_REGISTRY
-# Tunable weights — max possible = 0.05 + 0.35 + 0.25 + 0.25 = 0.90
 BASE_SCORE = 0.05
 PARTIAL_FIX_WEIGHT = 0.35
 COMPLETE_BONUS = 0.25
@@ -25,9 +26,19 @@ EFFICIENCY_DECAY = 0.03  # per extra step beyond optimal
 HINT_PENALTY = 0.04
 FAILED_ACTION_PENALTY = 0.02
-# Hard boundaries — score can never be exactly 0 or 1
-SCORE_FLOOR = 0.01
-SCORE_CEIL = 0.99
 EDIT_ACTION_TYPES = frozenset({
     "edit_file", "replace_line", "add_line",
@@ -36,14 +47,27 @@ EDIT_ACTION_TYPES = frozenset({
 def _clamp(value: float) -> float:
-    """Clamp score to the open interval (0, 1)."""
     return max(SCORE_FLOOR, min(SCORE_CEIL, round(value, 4)))
 def run_grader(task_id: str, trajectory: List[Dict[str, Any]]) -> GraderResult:
     if task_id not in TASK_REGISTRY:
         raise ValueError(f"Unknown task: {task_id}")
     if not trajectory:
         return GraderResult(
             task_id=task_id,
@@ -53,6 +77,7 @@ def run_grader(task_id: str, trajectory: List[Dict[str, Any]]) -> GraderResult:
                 "partial_fixes": 0.0,
                 "complete_solution": 0.0,
                 "efficiency": 0.0,
                 "hint_penalty": 0.0,
                 "failed_action_penalty": 0.0,
             },
@@ -72,25 +97,32 @@ def run_grader(task_id: str, trajectory: List[Dict[str, Any]]) -> GraderResult:
     issues_total = max(1, int(final_step.get("info", {}).get("issues_total", 1)))
     fix_ratio = issues_fixed / issues_total
-    # Component 1: Partial fix credit (proportional)
     partial_score = PARTIAL_FIX_WEIGHT * fix_ratio
-    # Component 2: Full-solution bonus (only when ALL issues fixed)
     complete_bonus = COMPLETE_BONUS if issues_fixed == issues_total else 0.0
-    # Component 3: Efficiency bonus (only awarded if at least one fix)
     if issues_fixed == 0:
         efficiency_score = 0.0
     elif steps_taken <= issues_total:
         efficiency_score = EFFICIENCY_MAX
     else:
         extra = steps_taken - issues_total
-        efficiency_score = max(0.0, EFFICIENCY_MAX - EFFICIENCY_DECAY * extra)
-    # Component 4: Hint penalty
-    hint_pen = HINT_PENALTY * hints_used
-    # Component 5: Failed action penalty
     failed_edits = 0
     for step in trajectory:
         action = step.get("action", {})
@@ -100,9 +132,18 @@ def run_grader(task_id: str, trajectory: List[Dict[str, Any]]) -> GraderResult:
                 failed_edits += 1
     failed_pen = FAILED_ACTION_PENALTY * failed_edits
-    raw = BASE_SCORE + partial_score + complete_bonus + efficiency_score - hint_pen - failed_pen
     score = _clamp(raw)
     if score >= 0.85:
         feedback = "Excellent — all issues fixed efficiently."
     elif score >= 0.65:
@@ -121,6 +162,7 @@ def run_grader(task_id: str, trajectory: List[Dict[str, Any]]) -> GraderResult:
             "base": BASE_SCORE,
             "partial_fixes": round(partial_score, 4),
             "complete_solution": round(complete_bonus, 4),
             "efficiency": round(efficiency_score, 4),
             "hint_penalty": round(-hint_pen, 4),
             "failed_action_penalty": round(-failed_pen, 4),

 """Deterministic grader for trajectory scoring.
+Scoring weights (difficulty-aware):
   base score      5%   (participation — guarantees score > 0)
   partial fixes  35%   (proportional to fix ratio)
+  complete bonus 25%   (all issues fixed — scales with difficulty)
+  efficiency     25%   (decays with extra steps — slower decay for harder tasks)
+  hint penalty   -4%   each (reduced to -3% for hard/expert)
   failed edit    -2%   each
+  difficulty     +5%   bonus for hard/expert tasks when fully solved
+Score is clamped to [0.0, 1.0].
 """
 from typing import Any, Dict, List
+from server.models import GraderResult, TaskDifficulty
 from server.tasks.task_registry import TASK_REGISTRY
+# ── Base weights ──────────────────────────────────────────────
 BASE_SCORE = 0.05
 PARTIAL_FIX_WEIGHT = 0.35
 COMPLETE_BONUS = 0.25
 HINT_PENALTY = 0.04
 FAILED_ACTION_PENALTY = 0.02
+# ── Difficulty modifiers ──────────────────────────────────────
+# Maps difficulty → (complete_bonus_extra, efficiency_decay_mult, hint_penalty_mult)
+#   complete_bonus_extra: added to COMPLETE_BONUS when all issues fixed
+#   efficiency_decay_mult: multiplier on decay (lower = more forgiving)
+#   hint_penalty_mult: multiplier on hint cost (lower = cheaper hints)
+DIFFICULTY_MODIFIERS = {
+    TaskDifficulty.EASY:   (0.00, 1.0, 1.0),
+    TaskDifficulty.MEDIUM: (0.00, 0.9, 1.0),
+    TaskDifficulty.HARD:   (0.03, 0.7, 0.75),
+}
+SCORE_FLOOR = 0.0
+SCORE_CEIL = 1.0
 EDIT_ACTION_TYPES = frozenset({
     "edit_file", "replace_line", "add_line",
 def _clamp(value: float) -> float:
+    """Clamp score to [0, 1]."""
     return max(SCORE_FLOOR, min(SCORE_CEIL, round(value, 4)))
+def _get_difficulty(task_id: str) -> TaskDifficulty:
+    """Look up a task's difficulty from the registry."""
+    task_cls = TASK_REGISTRY.get(task_id)
+    if task_cls is None:
+        return TaskDifficulty.MEDIUM
+    return task_cls.DIFFICULTY
 def run_grader(task_id: str, trajectory: List[Dict[str, Any]]) -> GraderResult:
     if task_id not in TASK_REGISTRY:
         raise ValueError(f"Unknown task: {task_id}")
+    difficulty = _get_difficulty(task_id)
+    bonus_extra, decay_mult, hint_mult = DIFFICULTY_MODIFIERS.get(
+        difficulty, (0.00, 1.0, 1.0)
+    )
     if not trajectory:
         return GraderResult(
             task_id=task_id,
                 "partial_fixes": 0.0,
                 "complete_solution": 0.0,
                 "efficiency": 0.0,
+                "difficulty_bonus": 0.0,
                 "hint_penalty": 0.0,
                 "failed_action_penalty": 0.0,
             },
     issues_total = max(1, int(final_step.get("info", {}).get("issues_total", 1)))
     fix_ratio = issues_fixed / issues_total
+    # ── Component 1: Partial fix credit (proportional) ────────
     partial_score = PARTIAL_FIX_WEIGHT * fix_ratio
+    # ── Component 2: Full-solution bonus ──────────────────────
     complete_bonus = COMPLETE_BONUS if issues_fixed == issues_total else 0.0
+    # ── Component 3: Difficulty bonus ─────────────────────────
+    # Extra reward for fully solving harder tasks
+    diff_bonus = bonus_extra if issues_fixed == issues_total else 0.0
+    # ── Component 4: Efficiency bonus ─────────────────────────
+    # Harder tasks get slower decay (more forgiving on step count)
     if issues_fixed == 0:
         efficiency_score = 0.0
     elif steps_taken <= issues_total:
         efficiency_score = EFFICIENCY_MAX
     else:
         extra = steps_taken - issues_total
+        effective_decay = EFFICIENCY_DECAY * decay_mult
+        efficiency_score = max(0.0, EFFICIENCY_MAX - effective_decay * extra)
+    # ── Component 5: Hint penalty ─────────────────────────────
+    # Harder tasks get reduced hint penalty (hints are more reasonable)
+    hint_pen = HINT_PENALTY * hint_mult * hints_used
+    # ── Component 6: Failed action penalty ────────────────────
     failed_edits = 0
     for step in trajectory:
         action = step.get("action", {})
                 failed_edits += 1
     failed_pen = FAILED_ACTION_PENALTY * failed_edits
+    raw = (
+        BASE_SCORE
+        + partial_score
+        + complete_bonus
+        + diff_bonus
+        + efficiency_score
+        - hint_pen
+        - failed_pen
+    )
     score = _clamp(raw)
+    # ── Feedback ──────────────────────────────────────────────
     if score >= 0.85:
         feedback = "Excellent — all issues fixed efficiently."
     elif score >= 0.65:
             "base": BASE_SCORE,
             "partial_fixes": round(partial_score, 4),
             "complete_solution": round(complete_bonus, 4),
+            "difficulty_bonus": round(diff_bonus, 4),
             "efficiency": round(efficiency_score, 4),
             "hint_penalty": round(-hint_pen, 4),
             "failed_action_penalty": round(-failed_pen, 4),

smoke_test.py CHANGED Viewed

@@ -1,4 +1,4 @@
-"""Comprehensive smoke test for the CI/CD Debug FastAPI server.
 Usage:
   .\\.venv\\Scripts\\python.exe smoke_test.py
@@ -225,7 +225,7 @@ def run_smoke(client: EndpointClient) -> int:
 def main() -> int:
-    parser = argparse.ArgumentParser(description="Smoke test CI/CD Debug FastAPI server")
     parser.add_argument("--mode", choices=["inprocess", "live"], default="inprocess")
     parser.add_argument("--base-url", default="http://127.0.0.1:8000")
     args = parser.parse_args()

+"""Comprehensive smoke test for the Cloud-Native DevOps Debug FastAPI server.
 Usage:
   .\\.venv\\Scripts\\python.exe smoke_test.py
 def main() -> int:
+    parser = argparse.ArgumentParser(description="Smoke test Cloud-Native DevOps Debug FastAPI server")
     parser.add_argument("--mode", choices=["inprocess", "live"], default="inprocess")
     parser.add_argument("--base-url", default="http://127.0.0.1:8000")
     args = parser.parse_args()

tests/test_baseline.py CHANGED Viewed

@@ -1,7 +1,7 @@
 """Tests for baseline_runner and inference helpers."""
 from baseline_runner import run_baseline_episodes, _heuristic_episode
-from server.environment import CICDDebugEnvironment
 from server.tasks.task_registry import TASK_REGISTRY
@@ -15,7 +15,7 @@ def test_heuristic_baseline_scores_above_zero_on_most_scenarios():
     nonzero = 0
     for task_id, task_cls in TASK_REGISTRY.items():
         for scenario in task_cls.SCENARIOS:
-            env = CICDDebugEnvironment()
             result = _heuristic_episode(env, task_id, scenario["id"])
             total += 1
             if result.score > 0.0:
@@ -45,7 +45,7 @@ def test_heuristic_fixes_easy_tasks_well():
         task_cls = TASK_REGISTRY[task_id]
         scores = []
         for scenario in task_cls.SCENARIOS:
-            env = CICDDebugEnvironment()
             result = _heuristic_episode(env, task_id, scenario["id"])
             scores.append(result.score)
         avg = sum(scores) / len(scores)

 """Tests for baseline_runner and inference helpers."""
 from baseline_runner import run_baseline_episodes, _heuristic_episode
+from server.environment import CloudNativeDebugEnvironment
 from server.tasks.task_registry import TASK_REGISTRY
     nonzero = 0
     for task_id, task_cls in TASK_REGISTRY.items():
         for scenario in task_cls.SCENARIOS:
+            env = CloudNativeDebugEnvironment()
             result = _heuristic_episode(env, task_id, scenario["id"])
             total += 1
             if result.score > 0.0:
         task_cls = TASK_REGISTRY[task_id]
         scores = []
         for scenario in task_cls.SCENARIOS:
+            env = CloudNativeDebugEnvironment()
             result = _heuristic_episode(env, task_id, scenario["id"])
             scores.append(result.score)
         avg = sum(scores) / len(scores)

tests/test_determinism.py CHANGED Viewed

@@ -1,6 +1,6 @@
 """Determinism and score-range tests for the grader and environment."""
-from server.environment import CICDDebugEnvironment
 from server.graders import run_grader
 from server.models import Action, ActionType, FileEdit
 from server.tasks.task_registry import TASK_REGISTRY
@@ -11,8 +11,8 @@ from server.tasks.task_registry import TASK_REGISTRY
 def test_reset_deterministic_with_seed():
     """Same seed → same task, scenario, files, error."""
-    env1 = CICDDebugEnvironment()
-    env2 = CICDDebugEnvironment()
     obs1 = env1.reset(seed=42)
     obs2 = env2.reset(seed=42)
@@ -71,7 +71,7 @@ def test_full_episode_determinism():
     """Full episode replay produces identical trajectory and score."""
     scores = []
     for _ in range(5):
-        env = CICDDebugEnvironment()
         env.reset(task_id="dockerfile_syntax", scenario_id="typo_filename")
         action = Action(
             action_type=ActionType.EDIT_FILE,
@@ -240,7 +240,7 @@ def test_all_scenarios_have_required_fields():
 def test_end_to_end_grading_all_tasks():
     """Every task/scenario can be reset, fixed, and graded with score > 0."""
-    env = CICDDebugEnvironment()
     for task_id, task_cls in TASK_REGISTRY.items():
         task = task_cls()
         for scenario in task.SCENARIOS:

 """Determinism and score-range tests for the grader and environment."""
+from server.environment import CloudNativeDebugEnvironment
 from server.graders import run_grader
 from server.models import Action, ActionType, FileEdit
 from server.tasks.task_registry import TASK_REGISTRY
 def test_reset_deterministic_with_seed():
     """Same seed → same task, scenario, files, error."""
+    env1 = CloudNativeDebugEnvironment()
+    env2 = CloudNativeDebugEnvironment()
     obs1 = env1.reset(seed=42)
     obs2 = env2.reset(seed=42)
     """Full episode replay produces identical trajectory and score."""
     scores = []
     for _ in range(5):
+        env = CloudNativeDebugEnvironment()
         env.reset(task_id="dockerfile_syntax", scenario_id="typo_filename")
         action = Action(
             action_type=ActionType.EDIT_FILE,
 def test_end_to_end_grading_all_tasks():
     """Every task/scenario can be reset, fixed, and graded with score > 0."""
+    env = CloudNativeDebugEnvironment()
     for task_id, task_cls in TASK_REGISTRY.items():
         task = task_cls()
         for scenario in task.SCENARIOS:

tests/test_environment_flow.py CHANGED Viewed

@@ -1,9 +1,9 @@
-from server.environment import CICDDebugEnvironment
 from server.models import Action, ActionType, FileEdit
 def test_episode_flow_fix_and_autocomplete():
-    env = CICDDebugEnvironment()
     obs = env.reset(task_id="dockerfile_syntax", scenario_id="typo_filename", seed=7)
     assert obs.task_id == "dockerfile_syntax"
     assert obs.total_issues >= 1
@@ -28,7 +28,7 @@ def test_episode_flow_fix_and_autocomplete():
 def test_submit_runs_combined_simulation():
-    env = CICDDebugEnvironment()
     env.reset(task_id="workflow_secrets_permissions", scenario_id="missing_env_secrets", seed=42)
     obs, reward, done, info = env.step(Action(action_type=ActionType.SUBMIT, reasoning="validate"))
     assert done is True

+from server.environment import CloudNativeDebugEnvironment
 from server.models import Action, ActionType, FileEdit
 def test_episode_flow_fix_and_autocomplete():
+    env = CloudNativeDebugEnvironment()
     obs = env.reset(task_id="dockerfile_syntax", scenario_id="typo_filename", seed=7)
     assert obs.task_id == "dockerfile_syntax"
     assert obs.total_issues >= 1
 def test_submit_runs_combined_simulation():
+    env = CloudNativeDebugEnvironment()
     env.reset(task_id="workflow_secrets_permissions", scenario_id="missing_env_secrets", seed=42)
     obs, reward, done, info = env.step(Action(action_type=ActionType.SUBMIT, reasoning="validate"))
     assert done is True

tutorial_references/02-deployment.md DELETED Viewed

@@ -1,427 +0,0 @@
-# 2. Deploying an OpenEnv environment
-This section covers deploying OpenEnv environments locally, on clusters, and on Hugging Face Spaces.
-**Contents:**
-- [Local Development with Uvicorn](#local-development-with-uvicorn)
-- [Docker Deployment](#docker-deployment)
-- [Hugging Face Spaces](#hugging-face-spaces)
-- [Best Practices](#best-practices)
-## HF Spaces are the infrastructure for OpenEnv environments
-Every HF Space provides three things that OpenEnv environments need:
-| Component | What it provides | How to access | Used as |
-|-----------|------------------|---------------|-----------|
-| **Server** | Running environment endpoint | `https://<username>-<space-name>.hf.space` | Agent and Public API |
-| **Repository** | Installable Python package | `pip install git+https://huggingface.co/spaces/<username>-<space-name>` | Code and client |
-| **Registry** | Docker container image | `docker pull registry.hf.space/<username>-<space-name>:latest` | Deployment |
-This means a single Space deployment gives you all the components you need to use an environment in training.
-### 1. Server: A running environment endpoint
-When you deploy to HF Spaces, your environment runs as a server. The client connects via **WebSocket** (`/ws`) for a persistent session:
-```python
-from echo_env import EchoEnv, EchoAction
-# Connect directly to the running Space (WebSocket under the hood)
-# Async (recommended):
-async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
-    result = await client.reset()
-    result = await client.step(EchoAction(message="Hello"))
-# Sync (using .sync() wrapper):
-with EchoEnv(base_url="https://openenv-echo-env.hf.space").sync() as client:
-    result = client.reset()
-    result = client.step(EchoAction(message="Hello"))
-```
-**Endpoints available:**
-| Endpoint | Protocol | Description |
-|----------|----------|-------------|
-| `/ws` | **WebSocket** | Persistent session (used by client) |
-| `/health` | HTTP GET | Health check |
-| `/reset` | HTTP POST | Reset environment (stateless) |
-| `/step` | HTTP POST | Execute action (stateless) |
-| `/state` | HTTP GET | Get current state |
-| `/docs` | HTTP GET | OpenAPI documentation |
-| `/web` | HTTP GET | Interactive web UI |
-> **Note:** The Python client uses the `/ws` WebSocket endpoint by default. HTTP endpoints are available for debugging or stateless use cases.
-**Example: Check if a Space is running**
-```bash
-curl https://openenv-echo-env.hf.space/health
-# {"status": "healthy"}
-```
-### 2. Repository: Installable Python package
-Every Space is a Git repository. OpenEnv environments include a `pyproject.toml`, making them pip-installable directly from the Space URL.
-```bash
-# Install client package from Space
-pip install git+https://huggingface.co/spaces/openenv/echo-env
-```
-This installs:
-- **Client class** (`EchoEnv`) — Handles HTTP/WebSocket communication
-- **Models** (`EchoAction`, `EchoObservation`) — Typed action and observation classes
-- **Utilities** — Any helper functions the environment provides
-**After installation:**
-```python
-from envs.echo_env import EchoEnv, EchoAction, EchoObservation
-# Now you have typed classes for the environment
-action = EchoAction(message="Hello")
-```
-### 3. Registry: Docker container image
-Every Docker-based Space has a container registry. You can pull and run the environment locally.
-```bash
-# Pull the image
-docker pull registry.hf.space/openenv-echo-env:latest
-# Run locally on port 8001
-docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
-```
-**Find the registry URL for any Space:**
-1. Go to the Space page (e.g., [openenv/echo-env](https://huggingface.co/spaces/openenv/echo-env))
-2. Click **⋮** (three dots) → **"Run locally"**
-3. Copy the `docker run` command
-### Choosing an access method
-| Method | Use when | Pros | Cons |
-|--------|----------|------|------|
-| **Server** | Quick testing, low volume | Zero setup | Network latency, rate limits |
-| **Repository** | Need typed classes | Type safety, IDE support | Still need a server |
-| **Docker** | Local dev, high throughput | Full control, no network | Requires Docker |
-**Typical workflow:**
-```python
-import asyncio
-from echo_env import EchoEnv, EchoAction
-async def main():
-    # Development: connect to remote Space
-    async with EchoEnv(base_url="https://openenv-echo-env.hf.space") as client:
-        result = await client.reset()
-    # Production: run locally for speed
-    # docker run -d -p 8001:8000 registry.hf.space/openenv-echo-env:latest
-    async with EchoEnv(base_url="http://localhost:8001") as client:
-        result = await client.reset()
-    # Or let the client manage Docker for you
-    client = await EchoEnv.from_env("openenv/echo-env")  # Auto-pulls and runs
-    async with client:
-        result = await client.reset()
-asyncio.run(main())
-# For sync usage, use the .sync() wrapper:
-with EchoEnv(base_url="http://localhost:8001").sync() as client:
-    result = client.reset()
-```
-> **Reference:** [HF Spaces Documentation](https://huggingface.co/docs/hub/spaces) | [Environment Hub Collection](https://huggingface.co/collections/openenv/environment-hub)
-## Local Development with Uvicorn
-The fastest way to iterate on environment logic is running directly with Uvicorn.
-## Clone and run the environment locally
-```bash
-# Clone from HF Space
-git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
-cd openenv-benchmark
-# Install in editable mode
-uv sync
-# Start server
-uv run server
-# Run isolated from remote Space
-uv run --isolated --project https://huggingface.co/spaces/burtenshaw/openenv-benchmark server
-```
-## Uvicorn directly in python
-```bash
-# Full control over uvicorn options
-uvicorn benchmark.server.app:app --host "$HOST" --port "$PORT" --workers "$WORKERS"
-# With reload for development
-uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --reload
-# Multi-Worker Mode For better concurrency:
-uvicorn benchmark.server.app:app --host 0.0.0.0 --port 8000 --workers 4
-```
-| Flag | Purpose |
-|------|---------|
-| `--reload` | Auto-restart on code changes |
-| `--workers N` | Run N worker processes |
-| `--log-level debug` | Verbose logging |
-## Docker Deployment
-Docker provides isolation and reproducibility for production use.
-### Run the environment locally from the space
-```bash
-# Run the environment locally from the space
-docker run -d -p 8000:8000 registry.hf.space/openenv-echo-env:latest
-```
-### Build Image
-```bash
-# Clone from HF Space
-git clone https://huggingface.co/spaces/burtenshaw/openenv-benchmark
-cd openenv-benchmark
-# Using OpenEnv CLI (recommended)
-openenv build -t openenv-benchmark:latest
-# Or with Docker directly
-docker build -t openenv-benchmark:latest -f server/Dockerfile .
-```
-### Run Container
-```bash
-# Basic run
-docker run -d -p 8000:8000 my-env:latest
-# With environment variables
-docker run -d -p 8000:8000 \
-    -e WORKERS=4 \
-    -e MAX_CONCURRENT_ENVS=100 \
-    my-env:latest
-# Named container for easy management
-docker run -d --name my-env -p 8000:8000 my-env:latest
-```
-### Connect from Python
-```python
-import asyncio
-from echo_env import EchoEnv, EchoAction
-async def main():
-    # Async usage (recommended)
-    async with EchoEnv(base_url="http://localhost:8000") as client:
-        result = await client.reset()
-        result = await client.step(EchoAction(message="Hello"))
-        print(result.observation)
-    # From Docker image
-    client = await EchoEnv.from_docker_image("<local_docker_image>")
-    async with client:
-        result = await client.reset()
-        print(result.observation)
-asyncio.run(main())
-# Sync usage (using .sync() wrapper)
-with EchoEnv(base_url="http://localhost:8000").sync() as client:
-    result = client.reset()
-    result = client.step(EchoAction(message="Hello"))
-    print(result.observation)
-```
-### Container Lifecycle
-| Method | Container | WebSocket | On `close()` |
-|--------|-----------|-----------|--------------|
-| `from_hub(repo_id)` | Starts | Connects | Stops container |
-| `from_hub(repo_id, use_docker=False)` | None (UV) | Connects | Stops UV server |
-| `from_docker_image(image)` | Starts | Connects | Stops container |
-| `MyEnv(base_url=...)` | None | Connects | Disconnects only |
-Find Docker Commands for Any Space
-1. Open the Space on HuggingFace Hub
-2. Click **⋮ (three dots)** menu
-3. Select **"Run locally"**
-4. Copy the provided `docker run` command
-## Deploy with CLI
-```bash
-cd my_env
-# Deploy to your namespace
-openenv push
-# Deploy to specific repo
-openenv push --repo-id username/my-env
-# Deploy as private
-openenv push --repo-id username/my-env --private
-```
-### Space Configuration
-The `openenv.yaml` manifest controls Space settings:
-```yaml
-# openenv.yaml
-name: my_env
-version: "1.0.0"
-description: My custom environment
-```
-Hardware Options:
-| Tier | vCPU | RAM | Cost |
-|------|------|-----|------|
-| CPU Basic (Free) | 2 | 16GB | Free |
-| CPU Upgrade | 8 | 32GB | $0.03/hr |
-OpenEnv environments support configuration via environment variables.
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `WORKERS` | 4 | Uvicorn worker processes |
-| `PORT` | 8000 | Server port |
-| `HOST` | 0.0.0.0 | Bind address |
-| `MAX_CONCURRENT_ENVS` | 100 | Max WebSocket sessions |
-| `ENABLE_WEB_INTERFACE` | Auto | Enable web UI |
-### Environment-Specific Variables
-Some environments have custom variables:
-**TextArena:**
-```bash
-TEXTARENA_ENV_ID=Wordle-v0
-TEXTARENA_NUM_PLAYERS=1
-TEXTARENA_MAX_TURNS=6
-```
-**Coding Environment:**
-```bash
-SANDBOX_TIMEOUT=30
-MAX_OUTPUT_LENGTH=10000
-```
-# DEMO: Deploying to Hugging Face Spaces
-This demo walks through the full workflow: create an environment, test locally, deploy to HF Spaces, and use it.
-## Step 1: Initialize a new environment
-```bash
-openenv init my_env
-cd my_env
-```
-This creates the standard OpenEnv structure:
-```
-my_env/
-├── server/
-│   ├── app.py           # FastAPI server
-│   ├── environment.py   # Your environment logic
-│   └── Dockerfile
-├── models.py            # Action/Observation types
-├── client.py            # HTTP client
-├── openenv.yaml         # Manifest
-└── pyproject.toml
-```
-## Step 2: Run locally
-```bash
-# Start the server
-uv run server
-# Or with uvicorn directly
-uvicorn server.app:app --host 0.0.0.0 --port 8000 --reload
-```
-Test the health endpoint:
-```bash
-curl http://localhost:8000/health
-# {"status": "healthy"}
-```
-## Step 3: Deploy to HF Spaces
-```bash
-openenv push --repo-id username/my-env
-```
-Your environment is now live at:
-- Web UI: https://username-my-env.hf.space/web
-- API Docs: https://username-my-env.hf.space/docs
-- Health: https://username-my-env.hf.space/health
-```bash
-curl https://openenv-echo-env.hf.space/health
-# {"status": "healthy"}
-```
-## Step 4: install the environment
-```bash
-uv pip install git+https://huggingface.co/spaces/openenv/echo_env
-```
-## Step 5: Run locally via Docker (optional)
-Pull and run the container from the HF registry, or open the [browser](https://huggingface.co/spaces/openenv/echo_env?docker=true):
-```bash
-# Pull from HF Spaces registry
-docker pull registry.hf.space/openenv-echo-env:latest
-# Run locally
-docker run -it -p 7860:7860 --platform=linux/amd64 \
-	registry.hf.space/openenv-echo-env:latest
-```
-Now connect to your local instance:
-```python
-import asyncio
-from echo_env import EchoEnv, EchoAction
-# Async (recommended)
-async def main():
-    async with EchoEnv(base_url="http://localhost:8000") as env:
-        result = await env.reset()
-        print(result.observation)
-        result = await env.step(EchoAction(message="Hello"))
-        print(result.observation)
-asyncio.run(main())
-# Sync (using .sync() wrapper)
-with EchoEnv(base_url="http://localhost:8000").sync() as env:
-    result = env.reset()
-    print(result.observation)
-    result = env.step(EchoAction(message="Hello"))
-    print(result.observation)
-```