Spaces:

Mrkumar007
/

cloud_queue_env

Running

App Files Files Community

Mrkumar007 commited on 3 days ago

Commit

16bd852

verified ·

1 Parent(s): a49c996

Upload folder using huggingface_hub

Browse files

Files changed (14) hide show

IMPLEMENTATION_ROADMAP.md +103 -214
README.md +368 -368
client.py +123 -123
inference.py +137 -214
models.py +55 -55
openenv.yaml +30 -30
openenv_cloud_queue_env.egg-info/PKG-INFO +9 -0
openenv_cloud_queue_env.egg-info/SOURCES.txt +19 -0
openenv_cloud_queue_env.egg-info/dependency_links.txt +1 -0
openenv_cloud_queue_env.egg-info/entry_points.txt +2 -0
openenv_cloud_queue_env.egg-info/requires.txt +5 -0
openenv_cloud_queue_env.egg-info/top_level.txt +1 -0
server/app.py +89 -89
server/cloud_queue_env_environment.py +781 -762

IMPLEMENTATION_ROADMAP.md CHANGED Viewed

@@ -1,272 +1,161 @@
 # QueueOps OpenEnv Implementation Roadmap
-This file is the execution reference for building and iterating the queue operations environment.
-Scope constraints:
-- Keep current repository structure unchanged.
-- Use cloud_queue_env as the project root.
-- Follow OpenEnv compliance strictly.
-- Provide deterministic graders with partial scores in [0, 1].
-- Keep at least 3 benchmark tasks (easy, medium, hard).
 ---
-## V1 - MVP Submission Build
-Goal: ship a complete, valid benchmark that can be submitted.
-### Phase 1 - Environment Core
 Sub-goals:
-1. Replace template echo behavior with queue simulator dynamics.
-2. Implement deterministic state transitions using explicit seeds.
-3. Implement terminal conditions with fixed task horizons.
-4. Keep OpenEnv contract: reset, step, state.
-Exit criteria:
-1. reset/step/state are stable and deterministic for fixed seed + fixed action trace.
-2. Episodes terminate correctly.
-### Phase 2 - Task Pack (Easy/Medium/Hard)
 Sub-goals:
-1. Add task selector and fixed per-task configs.
-2. Easy: single queue with admission/dispatch control.
-3. Medium: multi-server with class-aware routing.
-4. Hard: two-stage queue network with scaling decisions.
-Exit criteria:
-1. All three tasks run end-to-end.
-2. Difficulty progression is visible from easy to hard.
-### Phase 3 - Deterministic Graders
 Sub-goals:
-1. Implement per-task score equations with partial credit.
-2. Clamp all task scores to [0, 1].
-3. Handle edge cases (NaN/Inf/missing metrics) safely.
-4. Add final aggregate score across tasks.
-Exit criteria:
-1. Same seeds and same actions always produce the same score.
-2. Scores are interpretable and bounded.
-### Phase 4 - Reward Shaping
 Sub-goals:
-1. Add dense multi-component rewards (wait, throughput, SLA, cost, fairness, safety).
-2. Penalize invalid and exploit-like behavior.
-3. Keep reward scale bounded and stable.
-4. Expose component breakdown in metadata/info.
-Exit criteria:
-1. Reward changes across trajectory (not terminal-only).
-2. Unsafe behavior is consistently penalized.
-### Phase 5 - Inference Runner
 Sub-goals:
-1. Run all benchmark tasks with fixed seeds.
-2. Use OpenAI-compatible client with provider credentials from env variables.
-3. Emit [START], [STEP], [END] logs and final [SUMMARY].
-4. Keep runs reproducible (fixed model params).
-Exit criteria:
-1. End-to-end benchmark run works locally and on deployed runtime.
-2. Output format is submission-ready.
-### Phase 6 - Validation and Docs
 Sub-goals:
-1. Pass openenv validate.
-2. Ensure Docker build/run path works.
-3. Update README with task, reward, grading, and baseline usage.
-4. Add sample benchmark output snippet for evidence.
-Exit criteria:
-1. Validation passes.
-2. README is complete for judges and users.
 ### V1 Submission Gate
-All items must be true:
-1. Three tasks implemented and deterministic.
-2. Graders produce valid partial scores in [0, 1].
-3. Inference script runs all tasks and reports summary.
-4. OpenEnv validation passes.
-5. Deployment path is functional.
 ---
-## V2 - Robustness and Quality Upgrade
-Goal: improve reliability, calibration, and benchmark trustworthiness.
 ### Phase 1 - Determinism Hardening
 Sub-goals:
-1. Separate RNG streams for arrivals/service/abandonment/shocks.
-2. Add replay trace mode for debugging.
-3. Add deterministic episode metadata for audits.
 ### Phase 2 - Difficulty Calibration
 Sub-goals:
-1. Tune easy/medium/hard parameter separation.
-2. Improve anti-exploit balancing (reject-all, noop loops, over-scaling).
-3. Re-check reward and grade alignment across seeds.
-### Phase 3 - Reporting Upgrade
 Sub-goals:
-1. Add per-seed result table.
-2. Add mean/std and confidence summary.
-3. Add failure/invalid-action diagnostics in summary.
 ### V2 Exit Criteria
-1. Lower variance for fixed seed sets.
-2. Clearer task progression and fairer scoring.
-3. Better debugging and reproducibility outputs.
 ---
-## V3 - Extended Benchmark Pack
-Goal: increase novelty and long-term benchmark value.
-### Phase 1 - Optional Task D
 Sub-goals:
-1. Add stronger non-stationary demand patterns.
-2. Grade robustness to bursts and demand shifts.
-### Phase 2 - Optional Task E
 Sub-goals:
-1. Add partial observability/noisy delayed metrics.
-2. Grade safe decision-making under uncertainty.
-### Phase 3 - Public Benchmarking Bundle
 Sub-goals:
-1. Publish official seed suites and profiles (quick/standard/full).
-2. Provide reference baseline runs.
-3. Provide reproducibility notes for external users.
 ### V3 Exit Criteria
-1. Four or more tasks available.
-2. Stronger novelty and benchmark coverage.
-3. Cleaner external benchmarking workflow.
 ---
-## Recommended Execution Order
-1. Complete V1 and submit.
-2. Upgrade to V2 for reliability and scoring quality.
-3. Add V3 only if timeline permits.
-## Current Status Snapshot
-1. V1 core implementation is in place and running.
-2. openenv validate has passed.
-3. V2 determinism hardening, calibration pass, and reporting upgrade are implemented.
-4. Current focus shifts to V3 extensions and benchmark quality tuning.
-## V2 Completion Notes
-Implemented outcomes:
-1. Separate RNG streams are active for arrivals, service, abandonment, and exogenous effects.
-2. Deterministic trace metadata is exposed (`trace_digest`, `seed`, and RNG stream seeds).
-3. Anti-exploit reward calibration includes rejection-heavy and harmful downscale penalties.
-4. Inference supports multi-seed reporting with mean/std/ci95 outputs.
-5. Inference supports replay-mode action traces via file input for deterministic debugging.
-6. Inference supports JSON/CSV report export for per-seed analysis.
----
-## Requirement Coverage Matrix (From requirementInfo.md)
-This section is the final compliance tracker for judging criteria.
-### Functional Requirements
-1. Real-world task simulation
-- Requirement: Must represent real human operational work, not toy behavior.
-- Implementation target: queue operations in call center/cloud/logistics-style flow.
-- Evidence to keep: README motivation + task descriptions + action semantics.
-- Status: in progress (core done, examples and narrative should be strengthened).
-2. OpenEnv spec compliance
-- Requirement: typed models, reset, step(action), state, openenv.yaml, validate pass.
-- Implementation target: models.py + server environment + openenv.yaml + app entrypoint.
-- Evidence to keep: `openenv validate` output in PR notes/README.
-- Status: done (validate passing).
-3. Minimum 3 tasks with deterministic graders
-- Requirement: at least easy/medium/hard, deterministic 0.0-1.0 grading.
-- Implementation target: task configs + per-task scoring formulas + clamping.
-- Evidence to keep: sample run showing all tasks and deterministic seeds.
-- Status: done for 3 tasks, polish recommended for calibration.
-4. Meaningful reward function
-- Requirement: dense trajectory signal + penalties for undesirable behavior.
-- Implementation target: weighted reward components and safety penalties.
-- Evidence to keep: reward component logging in metadata and README equations.
-- Status: done, tune weights in V2.
-5. Baseline inference script
-- Requirement: OpenAI-compatible client, env vars credentials, reproducible score over tasks.
-- Implementation target: fixed tasks/seeds/model params, required log format.
-- Evidence to keep: saved run logs and summary scores.
-- Status: done, provider-fallback robustness can be improved.
-### Non-Functional Requirements
-1. Hugging Face Space deployment
-- Requirement: containerized HF Space tagged openenv.
-- Evidence to keep: Space URL + successful run proof.
-- Status: done.
-2. Containerized execution
-- Requirement: Dockerfile works with build + run.
-- Evidence to keep: commands and successful output snippet.
-- Status: pending explicit evidence capture in docs.
-3. Documentation completeness
-- Requirement: README includes env motivation, spaces, tasks, setup/usage, baseline scores.
-- Evidence to keep: README sections + benchmark output table.
-- Status: mostly done, baseline score table still needed.
----
-## Evaluation Criteria Coverage Checklist
-### Real-world utility (30%)
-1. Keep README examples tied to concrete real operations scenarios.
-2. Add one paragraph on why this benchmark is useful for agent evaluation.
-### Task and grader quality (25%)
-1. Keep deterministic seed set fixed and documented.
-2. Show per-task scoring decomposition and bounded outputs.
-3. Add one reproducibility check note: same seed + same policy => same score.
-### Environment design (20%)
-1. Verify clean reset and sensible done boundaries for all tasks.
-2. Keep action/observation schema stable and documented.
-3. Keep dense reward with interpretable components.
-### Code quality and spec compliance (15%)
-1. Keep `openenv validate` passing.
-2. Capture docker build/run commands and outcomes.
-3. Keep deployment and ws route functional.
-### Creativity and novelty (10%)
-1. Emphasize queue-control benchmark novelty in README.
-2. Keep multi-objective reward and cost/fairness tradeoff visible.
----
-## Pre-Submission Evidence Pack (Must Attach)
-1. Validation proof
-- `openenv validate` success output.
-2. Runtime proof
-- HF Space URL and one successful task run excerpt.
-3. Baseline proof
-- One full [START]/[STEP]/[END]/[SUMMARY] run log.
-4. Docker proof
-- `docker build` and `docker run` command results.
-5. Documentation proof
-- README includes baseline score table (easy, medium, hard, final).

 # QueueOps OpenEnv Implementation Roadmap
+This roadmap is the execution reference for building the real-world queueing environment in this repository.
+Constraints locked in:
+- Keep existing directory structure unchanged.
+- Treat `cloud_queue_env/` as the project root.
+- Use HF token provider flow in `inference.py`.
+- Follow OpenEnv compliance strictly: typed models, `step()/reset()/state()`, valid `openenv.yaml`.
+- Provide deterministic graders with partial scoring in `[0, 1]`.
+- Deliver at least 3 tasks (more optional).
 ---
+## V1 - Hackathon-Ready Submission
+Goal: submit a valid, real-world OpenEnv benchmark with 3 deterministic graded tasks and reproducible inference outputs.
+### Phase 1 - Core Simulator Foundation
 Sub-goals:
+1. Replace echo logic with queue-operations simulation core.
+2. Add deterministic RNG with explicit seed handling.
+3. Implement proper episode boundaries (`horizon`, terminal conditions).
+4. Keep strict OpenEnv contract for `reset()`, `step()`, and `state`.
+Definition of done:
+- Environment no longer behaves as dummy echo.
+- Same seed + same action trace => identical trajectory.
+- Episode always terminates predictably.
+### Phase 2 - Task System (Easy/Medium/Hard)
 Sub-goals:
+1. Add task selection (`task_id`) and per-task config.
+2. Implement Task A (single queue, admission control).
+3. Implement Task B (multi-server, priority routing).
+4. Implement Task C (two-stage queue network, dynamic scaling/cost).
+Definition of done:
+- All 3 tasks run end-to-end from `reset()` to terminal state.
+- Difficulty progression is visible from A -> B -> C.
+### Phase 3 - Deterministic Graders + Partial Scoring
 Sub-goals:
+1. Implement per-task grader formulas from master spec.
+2. Keep each grader output bounded in `[0, 1]`.
+3. Handle invalid/NaN/infinite values safely and deterministically.
+4. Aggregate final benchmark score as mean of task scores.
+Definition of done:
+- Repeated runs on same seeds produce same grader outputs.
+- Partial scoring is meaningful (not binary pass/fail only).
+### Phase 4 - Reward Shaping and Safety Penalties
 Sub-goals:
+1. Add dense reward components: wait, throughput, SLA, cost, fairness, safety.
+2. Add penalties for invalid actions and exploit patterns.
+3. Bound reward scale across tasks.
+4. Expose reward components in `info` for debugging.
+Definition of done:
+- Reward moves through trajectory, not only at the end.
+- Unsafe or degenerate behavior is penalized.
+### Phase 5 - Inference Protocol Compliance
 Sub-goals:
+1. Update `inference.py` to run all required tasks with fixed seeds.
+2. Keep OpenAI client usage while authenticating with HF token flow.
+3. Emit strict `[START]`, `[STEP]`, `[END]` line format.
+4. Print per-task and final aggregate scores.
+Definition of done:
+- Script executes benchmark sweep reproducibly.
+- Output format matches hackathon requirements.
+### Phase 6 - Packaging, Validation, Documentation
 Sub-goals:
+1. Validate `openenv.yaml` metadata and app wiring.
+2. Confirm Docker build/run success.
+3. Update README with task definitions, action/observation spaces, reward/grader equations, baseline results.
+4. Verify deployment readiness for HF Space.
+Definition of done:
+- OpenEnv validation passes.
+- Container starts and serves correctly.
+- README is submission-ready.
 ### V1 Submission Gate
+All must be true:
+1. 3 tasks implemented and deterministic.
+2. Graders return valid partial scores in `[0, 1]`.
+3. Inference script reports reproducible benchmark outputs.
+4. OpenEnv spec compliance confirmed.
+5. Docker and README requirements satisfied.
 ---
+## V2 - Quality and Robustness Upgrade
+Goal: improve benchmark reliability, score stability, and anti-exploit behavior after initial submission.
 ### Phase 1 - Determinism Hardening
 Sub-goals:
+1. Split RNG streams (arrivals/service/abandonment/shocks).
+2. Add trace replay support for debugging.
+3. Extend `info` with deterministic audit fields.
 ### Phase 2 - Difficulty Calibration
 Sub-goals:
+1. Tune parameters for cleaner A/B/C separation.
+2. Improve level interpolation behavior.
+3. Add stronger guards against reject-all or noop exploitation.
+### Phase 3 - Reporting and Confidence
 Sub-goals:
+1. Add standardized per-seed report table.
+2. Add mean/std summaries over seed sets.
+3. Flag unstable metrics and grader edge cases.
 ### V2 Exit Criteria
+1. Lower run-to-run variance on fixed seed sets.
+2. Clearer task difficulty progression.
+3. Better fairness and exploit resistance.
 ---
+## V3 - Extended Benchmark Pack (Optional)
+Goal: increase novelty and long-term benchmark value with optional extra tasks.
+### Phase 1 - Task D (Non-stationary Load)
 Sub-goals:
+1. Add shift-based and bursty arrivals.
+2. Grade robustness under changing demand.
+### Phase 2 - Task E (Partial Observability)
 Sub-goals:
+1. Add delayed/noisy metrics.
+2. Grade safe decisions under uncertainty.
+### Phase 3 - Public Benchmark Packaging
 Sub-goals:
+1. Publish official seed suites.
+2. Add benchmark profiles: quick / standard / full.
+3. Provide reference baseline outputs.
 ### V3 Exit Criteria
+1. 4-5 total tasks available.
+2. Broader real-world coverage.
+3. Stronger benchmark differentiation.
 ---
+## Execution Order
+Recommended order:
+1. Complete V1 fully and submit.
+2. Continue with V2 for quality hardening.
+3. Do V3 only if timeline allows.
+Immediate next implementation step:
+- Start V1 Phase 1 (models + simulator core + deterministic state transitions).

README.md CHANGED Viewed

@@ -1,369 +1,369 @@
----
-title: Cloud Queue Env Environment Server
-emoji: 🖨️
-colorFrom: pink
-colorTo: blue
-sdk: docker
-pinned: false
-app_port: 8000
-base_path: /web
-tags:
-  - openenv
----
-# Cloud Queue Env Environment
-A real-world queue-operations benchmark for OpenEnv.
-This environment simulates service operations decisions humans make in production systems:
-- Admission and rejection under load
-- Queue routing and dispatching
-- Priority handling for urgent traffic
-- Capacity scaling under infrastructure cost constraints
-The benchmark includes three deterministic tasks with partial graders in [0, 1]:
-- easy: single-queue stability
-- medium: multi-server priority routing
-- hard: two-stage queue network with scaling
-## Quick Start
-Use the CloudQueueEnv client to connect to a running server or container:
-```python
-from cloud_queue_env import CloudQueueAction, CloudQueueEnv
-try:
-    env = CloudQueueEnv.from_docker_image("cloud_queue_env-env:latest")
-    # Configure task + seed, then reset into that deterministic episode
-    env.reset()
-    env.step(CloudQueueAction(action_type="configure_task", task_id="easy", seed=11))
-    result = env.reset()
-    for _ in range(20):
-        obs = result.observation
-        if obs.incoming_job_present:
-            action = CloudQueueAction(action_type="admit", target_queue=0)
-        else:
-            action = CloudQueueAction(action_type="dispatch", target_queue=0)
-        result = env.step(action)
-        print(
-            f"step={obs.sim_time} queues={obs.queue_lengths} "
-            f"reward={result.reward:.3f} done={result.done}"
-        )
-        if result.done:
-            break
-    final_score = result.observation.metadata.get("episode_score", 0.0)
-    print(f"episode_score={final_score:.3f}")
-finally:
-    env.close()
-```
-The CloudQueueEnv.from_docker_image() method handles:
-- Starting the Docker container
-- Waiting for the server to be ready
-- Connecting to the environment
-- Container cleanup when you call `close()`
-## Building the Docker Image
-Before using the environment, you need to build the Docker image:
-```bash
-# From project root
-docker build -t cloud_queue_env-env:latest -f server/Dockerfile .
-```
-## Deploying to Hugging Face Spaces
-You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
-```bash
-# From the environment directory (where openenv.yaml is located)
-openenv push
-# Or specify options
-openenv push --namespace my-org --private
-```
-The `openenv push` command will:
-1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
-2. Prepare a custom build for Hugging Face Docker space (enables web interface)
-3. Upload to Hugging Face (ensuring you're logged in)
-### Prerequisites
-- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
-### Options
-- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
-- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
-- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
-- `--private`: Deploy the space as private (default: public)
-### Examples
-```bash
-# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
-openenv push
-# Push to a specific repository
-openenv push --repo-id my-org/my-env
-# Push with a custom base image
-openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
-# Push as a private space
-openenv push --private
-# Combine options
-openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
-```
-After deployment, your space will be available at:
-`https://huggingface.co/spaces/<repo-id>`
-The deployed space includes:
-- **Web Interface** at `/web` - Interactive UI for exploring the environment
-- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
-- **Health Check** at `/health` - Container health monitoring
-- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
-## Environment Details
-### Action
-CloudQueueAction fields:
-- action_type: one of configure_task, admit, reject, route, dispatch, scale, reprioritize, noop
-- target_queue: queue index for route/dispatch/admit
-- target_server: optional server index
-- scale_delta: server delta for scale action
-- new_priority: new priority value for reprioritize
-- task_id: easy/medium/hard (used with configure_task)
-- seed: deterministic task seed (used with configure_task)
-### Observation
-CloudQueueObservation includes:
-- task_id, sim_time, horizon
-- queue_lengths, queue_wait_ema
-- server_busy, server_remaining_service, utilization
-- incoming_job_present, incoming_job_size, incoming_job_priority, incoming_job_deadline, incoming_job_type
-- sla_violation_rate, abandonment_rate, throughput_recent, energy_cost_rate
-- level, optional_history, action_mask
-- reward, done, metadata
-### Reward
-Per-step reward is dense and multi-objective:
-$$
-r_t = 0.35R_{wait} + 0.20R_{throughput} + 0.20R_{sla} + 0.15R_{cost} + 0.05R_{fair} + 0.05R_{safe}
-$$
-Properties:
-- Partial progress signal over the full trajectory
-- Penalties for invalid actions and unsafe/noop behavior under congestion
-- Bounded reward values for stability
-### Deterministic Graders
-Each task returns a deterministic episode_score in [0, 1], stored in observation metadata.
-- easy score uses avg wait, throughput, rejection rate, and SLA violations
-- medium score uses urgent/normal p95 waits, urgent SLA, throughput, and action cost
-- hard score uses end-to-end p95, abandonment, SLA, throughput, infra cost, and fairness gap
-If invalid action rate exceeds threshold, score is capped.
-## Tasks
-1. easy (single queue stability)
-- one queue, one server
-- objective: low wait with acceptable throughput and low rejection
-2. medium (priority routing)
-- two queues and multiple servers
-- objective: protect urgent traffic while maintaining total performance
-3. hard (queue network + scaling)
-- two-stage queue network with bursty arrivals and heavy-tailed service times
-- objective: balance latency/SLA/abandonment against infra cost and fairness
-## Baseline Inference
-Run baseline inference across easy/medium/hard:
-```bash
-API_KEY=your_provider_key python inference.py
-```
-Optional variables:
-- API_KEY (OpenAI-compatible provider key for model calls)
-- API_BASE_URL (default: https://router.huggingface.co/v1)
-- MODEL_NAME (default: Qwen/Qwen2.5-72B-Instruct)
-- BASE_URL (if using deployed space)
-- IMAGE_NAME (if launching local docker image)
-- USE_HEURISTIC_ONLY (true/false)
-- DISABLE_MODEL_ON_FIRST_ERROR (true/false)
-- MAX_STEPS_OVERRIDE (integer quick-test cap)
-- TASK_SEEDS_JSON (JSON map for multi-seed runs)
-- ACTION_TRACE_FILE (JSON replay file keyed by task:seed)
-- REPORT_JSON_PATH (write seed/task report JSON)
-- REPORT_CSV_PATH (write per-seed report CSV)
-Output includes required line types:
-- [START]
-- [STEP]
-- [END]
-And final aggregate summary:
-- [SUMMARY] easy=<...> medium=<...> hard=<...> final=<...>
-V2 reporting also includes:
-- [REPORT_SEED] task=<task_id> seed=<seed> score=<score> steps=<n> trace=<digest>
-- [REPORT] task=<task_id> seeds=<n> mean=<score> std=<score> ci95=<score>
-## Baseline Scores
-Current reproducible heuristic-only baseline (deployed runtime, single seed per task):
-| Task | Seed Count | Mean Score |
-|---|---:|---:|
-| easy | 1 | 0.000 |
-| medium | 1 | 0.000 |
-| hard | 1 | 0.000 |
-| final (mean of task means) | - | 0.000 |
-Notes:
-- These values are from heuristic fallback mode and are expected to be low.
-- Model-based scores depend on provider/model availability and should be recorded from a successful funded run.
-- Keep this table updated with your latest official benchmark run before final submission.
-## Advanced Usage
-### Connecting to an Existing Server
-If you already have a Cloud Queue Env environment server running, you can connect directly:
-```python
-from cloud_queue_env import CloudQueueAction, CloudQueueEnv
-# Connect to existing server
-cloud_queue_envenv = CloudQueueEnv(base_url="<ENV_HTTP_URL_HERE>")
-# Use as normal
-result = cloud_queue_envenv.reset()
-result = cloud_queue_envenv.step(CloudQueueAction(action_type="dispatch", target_queue=0))
-```
-Note: When connecting to an existing server, `cloud_queue_envenv.close()` will NOT stop the server.
-### Using the Context Manager
-The client supports context manager usage for automatic connection management:
-```python
-from cloud_queue_env import CloudQueueAction, CloudQueueEnv
-# Connect with context manager (auto-connects and closes)
-with CloudQueueEnv(base_url="http://localhost:8000") as env:
-    result = env.reset()
-    print(f"Initial queues: {result.observation.queue_lengths}")
-    # Multiple steps with low latency
-    for _ in range(10):
-        result = env.step(CloudQueueAction(action_type="noop"))
-        print(f"Reward: {result.reward:.3f}")
-```
-The client uses WebSocket connections for:
-- **Lower latency**: No HTTP connection overhead per request
-- **Persistent session**: Server maintains your environment state
-- **Efficient for episodes**: Better for many sequential steps
-### Concurrent WebSocket Sessions
-The server supports multiple concurrent WebSocket connections. To enable this,
-modify `server/app.py` to use factory mode:
-```python
-# In server/app.py - use factory mode for concurrent sessions
-app = create_app(
-    CloudQueueEnvironment,  # Pass class, not instance
-    CloudQueueAction,
-    CloudQueueObservation,
-    max_concurrent_envs=4,  # Allow 4 concurrent sessions
-)
-```
-Then multiple clients can connect simultaneously:
-```python
-from cloud_queue_env import CloudQueueAction, CloudQueueEnv
-from concurrent.futures import ThreadPoolExecutor
-def run_episode(client_id: int):
-    with CloudQueueEnv(base_url="http://localhost:8000") as env:
-        result = env.reset()
-        for i in range(10):
-            result = env.step(CloudQueueAction(action_type="dispatch", target_queue=i % 2))
-        return client_id, result.observation.queue_lengths
-# Run 4 episodes concurrently
-with ThreadPoolExecutor(max_workers=4) as executor:
-    results = list(executor.map(run_episode, range(4)))
-```
-## Development & Testing
-### Direct Environment Testing
-Core files:
-- models: typed action/observation schema
-- server environment: queue simulation, reward shaping, grading
-- inference script: task sweep and benchmark logging
-### Running Locally
-Run the server locally for development:
-```bash
-uvicorn server.app:app --reload
-```
-## Project Structure
-```
-cloud_queue_env/
-├── .dockerignore
-├── __init__.py
-├── README.md
-├── openenv.yaml
-├── pyproject.toml
-├── client.py
-├── models.py
-├── inference.py
-├── IMPLEMENTATION_ROADMAP.md
-└── server/
-    ├── __init__.py
-    ├── cloud_queue_env_environment.py
-    ├── app.py
-    └── Dockerfile
-```
-TASK A — Easy (150 steps)
-  Scenario:  1 queue, 1 server (M/M/1), only admit/reject/dispatch
-  Objective: Keep wait low while processing throughput
-  Grader:    score = 0.40×(1-avg_wait/6) + 0.30×(throughput/70)
-                   + 0.15×(1-rejection_rate/0.3) + 0.15×(1-sla_breaches/0.3)
-TASK B — Medium (200 steps)
-  Scenario:  2 queues, 3 servers, 28% urgent jobs → route + reprioritize
-  Objective: Protect urgent SLA while not starving normal jobs
-  Grader:    score = 0.35×urgent_wait_score + 0.25×urgent_sla_score
-                   + 0.15×normal_wait_score + 0.15×throughput + 0.10×cost
-TASK C — Hard (250 steps)
-  Scenario:  2-stage pipeline, 1–6 servers, heavy-tail service, abandonments
-  Objective: Maximize quality under budget with fairness
-  Grader:    score = 0.25×e2e_latency + 0.20×abandonment + 0.20×sla
                    + 0.15×throughput + 0.10×cost + 0.10×fairness

+---
+title: Cloud Queue Env Environment Server
+emoji: 🖨️
+colorFrom: pink
+colorTo: blue
+sdk: docker
+pinned: false
+app_port: 8000
+base_path: /web
+tags:
+  - openenv
+---
+# Cloud Queue Env Environment
+A real-world queue-operations benchmark for OpenEnv.
+This environment simulates service operations decisions humans make in production systems:
+- Admission and rejection under load
+- Queue routing and dispatching
+- Priority handling for urgent traffic
+- Capacity scaling under infrastructure cost constraints
+The benchmark includes three deterministic tasks with partial graders in [0, 1]:
+- easy: single-queue stability
+- medium: multi-server priority routing
+- hard: two-stage queue network with scaling
+## Quick Start
+Use the CloudQueueEnv client to connect to a running server or container:
+```python
+from cloud_queue_env import CloudQueueAction, CloudQueueEnv
+try:
+    env = CloudQueueEnv.from_docker_image("cloud_queue_env-env:latest")
+    # Configure task + seed, then reset into that deterministic episode
+    env.reset()
+    env.step(CloudQueueAction(action_type="configure_task", task_id="easy", seed=11))
+    result = env.reset()
+    for _ in range(20):
+        obs = result.observation
+        if obs.incoming_job_present:
+            action = CloudQueueAction(action_type="admit", target_queue=0)
+        else:
+            action = CloudQueueAction(action_type="dispatch", target_queue=0)
+        result = env.step(action)
+        print(
+            f"step={obs.sim_time} queues={obs.queue_lengths} "
+            f"reward={result.reward:.3f} done={result.done}"
+        )
+        if result.done:
+            break
+    final_score = result.observation.metadata.get("episode_score", 0.0)
+    print(f"episode_score={final_score:.3f}")
+finally:
+    env.close()
+```
+The CloudQueueEnv.from_docker_image() method handles:
+- Starting the Docker container
+- Waiting for the server to be ready
+- Connecting to the environment
+- Container cleanup when you call `close()`
+## Building the Docker Image
+Before using the environment, you need to build the Docker image:
+```bash
+# From project root
+docker build -t cloud_queue_env-env:latest -f server/Dockerfile .
+```
+## Deploying to Hugging Face Spaces
+You can easily deploy your OpenEnv environment to Hugging Face Spaces using the `openenv push` command:
+```bash
+# From the environment directory (where openenv.yaml is located)
+openenv push
+# Or specify options
+openenv push --namespace my-org --private
+```
+The `openenv push` command will:
+1. Validate that the directory is an OpenEnv environment (checks for `openenv.yaml`)
+2. Prepare a custom build for Hugging Face Docker space (enables web interface)
+3. Upload to Hugging Face (ensuring you're logged in)
+### Prerequisites
+- Authenticate with Hugging Face: The command will prompt for login if not already authenticated
+### Options
+- `--directory`, `-d`: Directory containing the OpenEnv environment (defaults to current directory)
+- `--repo-id`, `-r`: Repository ID in format 'username/repo-name' (defaults to 'username/env-name' from openenv.yaml)
+- `--base-image`, `-b`: Base Docker image to use (overrides Dockerfile FROM)
+- `--private`: Deploy the space as private (default: public)
+### Examples
+```bash
+# Push to your personal namespace (defaults to username/env-name from openenv.yaml)
+openenv push
+# Push to a specific repository
+openenv push --repo-id my-org/my-env
+# Push with a custom base image
+openenv push --base-image ghcr.io/meta-pytorch/openenv-base:latest
+# Push as a private space
+openenv push --private
+# Combine options
+openenv push --repo-id my-org/my-env --base-image custom-base:latest --private
+```
+After deployment, your space will be available at:
+`https://huggingface.co/spaces/<repo-id>`
+The deployed space includes:
+- **Web Interface** at `/web` - Interactive UI for exploring the environment
+- **API Documentation** at `/docs` - Full OpenAPI/Swagger interface
+- **Health Check** at `/health` - Container health monitoring
+- **WebSocket** at `/ws` - Persistent session endpoint for low-latency interactions
+## Environment Details
+### Action
+CloudQueueAction fields:
+- action_type: one of configure_task, admit, reject, route, dispatch, scale, reprioritize, noop
+- target_queue: queue index for route/dispatch/admit
+- target_server: optional server index
+- scale_delta: server delta for scale action
+- new_priority: new priority value for reprioritize
+- task_id: easy/medium/hard (used with configure_task)
+- seed: deterministic task seed (used with configure_task)
+### Observation
+CloudQueueObservation includes:
+- task_id, sim_time, horizon
+- queue_lengths, queue_wait_ema
+- server_busy, server_remaining_service, utilization
+- incoming_job_present, incoming_job_size, incoming_job_priority, incoming_job_deadline, incoming_job_type
+- sla_violation_rate, abandonment_rate, throughput_recent, energy_cost_rate
+- level, optional_history, action_mask
+- reward, done, metadata
+### Reward
+Per-step reward is dense and multi-objective:
+$$
+r_t = 0.35R_{wait} + 0.20R_{throughput} + 0.20R_{sla} + 0.15R_{cost} + 0.05R_{fair} + 0.05R_{safe}
+$$
+Properties:
+- Partial progress signal over the full trajectory
+- Penalties for invalid actions and unsafe/noop behavior under congestion
+- Bounded reward values for stability
+### Deterministic Graders
+Each task returns a deterministic episode_score in [0, 1], stored in observation metadata.
+- easy score uses avg wait, throughput, rejection rate, and SLA violations
+- medium score uses urgent/normal p95 waits, urgent SLA, throughput, and action cost
+- hard score uses end-to-end p95, abandonment, SLA, throughput, infra cost, and fairness gap
+If invalid action rate exceeds threshold, score is capped.
+## Tasks
+1. easy (single queue stability)
+- one queue, one server
+- objective: low wait with acceptable throughput and low rejection
+2. medium (priority routing)
+- two queues and multiple servers
+- objective: protect urgent traffic while maintaining total performance
+3. hard (queue network + scaling)
+- two-stage queue network with bursty arrivals and heavy-tailed service times
+- objective: balance latency/SLA/abandonment against infra cost and fairness
+## Baseline Inference
+Run baseline inference across easy/medium/hard:
+```bash
+API_KEY=your_provider_key python inference.py
+```
+Optional variables:
+- API_KEY (OpenAI-compatible provider key for model calls)
+- API_BASE_URL (default: https://router.huggingface.co/v1)
+- MODEL_NAME (default: Qwen/Qwen2.5-72B-Instruct)
+- BASE_URL (if using deployed space)
+- IMAGE_NAME (if launching local docker image)
+- USE_HEURISTIC_ONLY (true/false)
+- DISABLE_MODEL_ON_FIRST_ERROR (true/false)
+- MAX_STEPS_OVERRIDE (integer quick-test cap)
+- TASK_SEEDS_JSON (JSON map for multi-seed runs)
+- ACTION_TRACE_FILE (JSON replay file keyed by task:seed)
+- REPORT_JSON_PATH (write seed/task report JSON)
+- REPORT_CSV_PATH (write per-seed report CSV)
+Output includes required line types:
+- [START]
+- [STEP]
+- [END]
+And final aggregate summary:
+- [SUMMARY] easy=<...> medium=<...> hard=<...> final=<...>
+V2 reporting also includes:
+- [REPORT_SEED] task=<task_id> seed=<seed> score=<score> steps=<n> trace=<digest>
+- [REPORT] task=<task_id> seeds=<n> mean=<score> std=<score> ci95=<score>
+## Baseline Scores
+Current reproducible heuristic-only baseline (deployed runtime, single seed per task):
+| Task | Seed Count | Mean Score |
+|---|---:|---:|
+| easy | 1 | 0.000 |
+| medium | 1 | 0.000 |
+| hard | 1 | 0.000 |
+| final (mean of task means) | - | 0.000 |
+Notes:
+- These values are from heuristic fallback mode and are expected to be low.
+- Model-based scores depend on provider/model availability and should be recorded from a successful funded run.
+- Keep this table updated with your latest official benchmark run before final submission.
+## Advanced Usage
+### Connecting to an Existing Server
+If you already have a Cloud Queue Env environment server running, you can connect directly:
+```python
+from cloud_queue_env import CloudQueueAction, CloudQueueEnv
+# Connect to existing server
+cloud_queue_envenv = CloudQueueEnv(base_url="<ENV_HTTP_URL_HERE>")
+# Use as normal
+result = cloud_queue_envenv.reset()
+result = cloud_queue_envenv.step(CloudQueueAction(action_type="dispatch", target_queue=0))
+```
+Note: When connecting to an existing server, `cloud_queue_envenv.close()` will NOT stop the server.
+### Using the Context Manager
+The client supports context manager usage for automatic connection management:
+```python
+from cloud_queue_env import CloudQueueAction, CloudQueueEnv
+# Connect with context manager (auto-connects and closes)
+with CloudQueueEnv(base_url="http://localhost:8000") as env:
+    result = env.reset()
+    print(f"Initial queues: {result.observation.queue_lengths}")
+    # Multiple steps with low latency
+    for _ in range(10):
+        result = env.step(CloudQueueAction(action_type="noop"))
+        print(f"Reward: {result.reward:.3f}")
+```
+The client uses WebSocket connections for:
+- **Lower latency**: No HTTP connection overhead per request
+- **Persistent session**: Server maintains your environment state
+- **Efficient for episodes**: Better for many sequential steps
+### Concurrent WebSocket Sessions
+The server supports multiple concurrent WebSocket connections. To enable this,
+modify `server/app.py` to use factory mode:
+```python
+# In server/app.py - use factory mode for concurrent sessions
+app = create_app(
+    CloudQueueEnvironment,  # Pass class, not instance
+    CloudQueueAction,
+    CloudQueueObservation,
+    max_concurrent_envs=4,  # Allow 4 concurrent sessions
+)
+```
+Then multiple clients can connect simultaneously:
+```python
+from cloud_queue_env import CloudQueueAction, CloudQueueEnv
+from concurrent.futures import ThreadPoolExecutor
+def run_episode(client_id: int):
+    with CloudQueueEnv(base_url="http://localhost:8000") as env:
+        result = env.reset()
+        for i in range(10):
+            result = env.step(CloudQueueAction(action_type="dispatch", target_queue=i % 2))
+        return client_id, result.observation.queue_lengths
+# Run 4 episodes concurrently
+with ThreadPoolExecutor(max_workers=4) as executor:
+    results = list(executor.map(run_episode, range(4)))
+```
+## Development & Testing
+### Direct Environment Testing
+Core files:
+- models: typed action/observation schema
+- server environment: queue simulation, reward shaping, grading
+- inference script: task sweep and benchmark logging
+### Running Locally
+Run the server locally for development:
+```bash
+uvicorn server.app:app --reload
+```
+## Project Structure
+```
+cloud_queue_env/
+├── .dockerignore
+├── __init__.py
+├── README.md
+├── openenv.yaml
+├── pyproject.toml
+├── client.py
+├── models.py
+├── inference.py
+├── IMPLEMENTATION_ROADMAP.md
+└── server/
+    ├── __init__.py
+    ├── cloud_queue_env_environment.py
+    ├── app.py
+    └── Dockerfile
+```
+TASK A — Easy (150 steps)
+  Scenario:  1 queue, 1 server (M/M/1), only admit/reject/dispatch
+  Objective: Keep wait low while processing throughput
+  Grader:    score = 0.40×(1-avg_wait/6) + 0.30×(throughput/70)
+                   + 0.15×(1-rejection_rate/0.3) + 0.15×(1-sla_breaches/0.3)
+TASK B — Medium (200 steps)
+  Scenario:  2 queues, 3 servers, 28% urgent jobs → route + reprioritize
+  Objective: Protect urgent SLA while not starving normal jobs
+  Grader:    score = 0.35×urgent_wait_score + 0.25×urgent_sla_score
+                   + 0.15×normal_wait_score + 0.15×throughput + 0.10×cost
+TASK C — Hard (250 steps)
+  Scenario:  2-stage pipeline, 1–6 servers, heavy-tail service, abandonments
+  Objective: Maximize quality under budget with fairness
+  Grader:    score = 0.25×e2e_latency + 0.20×abandonment + 0.20×sla
                    + 0.15×throughput + 0.10×cost + 0.10×fairness

client.py CHANGED Viewed

@@ -1,123 +1,123 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""Cloud Queue Env Environment Client."""
-from typing import Dict
-from openenv.core import EnvClient
-from openenv.core.client_types import StepResult
-from openenv.core.env_server.types import State
-from .models import CloudQueueAction, CloudQueueObservation
-class CloudQueueEnv(
-    EnvClient[CloudQueueAction, CloudQueueObservation, State]
-):
-    """
-    Client for the Cloud Queue Env Environment.
-    This client maintains a persistent WebSocket connection to the environment server,
-    enabling efficient multi-step interactions with lower latency.
-    Each client instance has its own dedicated environment session on the server.
-    Example:
-        >>> # Connect to a running server
-        >>> with CloudQueueEnv(base_url="http://localhost:8000") as client:
-        ...     result = client.reset()
-        ...     print(result.observation.queue_lengths)
-        ...
-        ...     result = client.step(CloudQueueAction(action_type="admit", target_queue=0))
-        ...     print(result.observation.throughput_recent)
-    Example with Docker:
-        >>> # Automatically start container and connect
-        >>> client = CloudQueueEnv.from_docker_image("cloud_queue_env-env:latest")
-        >>> try:
-        ...     result = client.reset()
-        ...     result = client.step(CloudQueueAction(action_type="dispatch", target_queue=0))
-        ... finally:
-        ...     client.close()
-    """
-    def _step_payload(self, action: CloudQueueAction) -> Dict:
-        """
-        Convert CloudQueueAction to JSON payload for step message.
-        Args:
-            action: CloudQueueAction instance
-        Returns:
-            Dictionary representation suitable for JSON encoding
-        """
-        return {
-            "action_type": action.action_type,
-            "target_queue": action.target_queue,
-            "target_server": action.target_server,
-            "scale_delta": action.scale_delta,
-            "new_priority": action.new_priority,
-            "task_id": action.task_id,
-            "seed": action.seed,
-        }
-    def _parse_result(self, payload: Dict) -> StepResult[CloudQueueObservation]:
-        """
-        Parse server response into StepResult[CloudQueueObservation].
-        Args:
-            payload: JSON response data from server
-        Returns:
-            StepResult with CloudQueueObservation
-        """
-        obs_data = payload.get("observation", {})
-        observation = CloudQueueObservation(
-            task_id=obs_data.get("task_id", "easy"),
-            sim_time=obs_data.get("sim_time", 0),
-            horizon=obs_data.get("horizon", 0),
-            queue_lengths=obs_data.get("queue_lengths", []),
-            queue_wait_ema=obs_data.get("queue_wait_ema", []),
-            server_busy=obs_data.get("server_busy", []),
-            server_remaining_service=obs_data.get("server_remaining_service", []),
-            utilization=obs_data.get("utilization", []),
-            incoming_job_present=obs_data.get("incoming_job_present", False),
-            incoming_job_size=obs_data.get("incoming_job_size", 0.0),
-            incoming_job_priority=obs_data.get("incoming_job_priority", 0),
-            incoming_job_deadline=obs_data.get("incoming_job_deadline", 0.0),
-            incoming_job_type=obs_data.get("incoming_job_type", 0),
-            sla_violation_rate=obs_data.get("sla_violation_rate", 0.0),
-            abandonment_rate=obs_data.get("abandonment_rate", 0.0),
-            throughput_recent=obs_data.get("throughput_recent", 0.0),
-            energy_cost_rate=obs_data.get("energy_cost_rate", 0.0),
-            level=obs_data.get("level", 1.0),
-            optional_history=obs_data.get("optional_history", []),
-            action_mask=obs_data.get("action_mask", []),
-            done=payload.get("done", False),
-            reward=payload.get("reward"),
-            metadata=obs_data.get("metadata", {}),
-        )
-        return StepResult(
-            observation=observation,
-            reward=payload.get("reward"),
-            done=payload.get("done", False),
-        )
-    def _parse_state(self, payload: Dict) -> State:
-        """
-        Parse server response into State object.
-        Args:
-            payload: JSON response from state request
-        Returns:
-            State object with episode_id and step_count
-        """
-        return State(
-            episode_id=payload.get("episode_id"),
-            step_count=payload.get("step_count", 0),
-        )

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Cloud Queue Env Environment Client."""
+from typing import Dict
+from openenv.core import EnvClient
+from openenv.core.client_types import StepResult
+from openenv.core.env_server.types import State
+from .models import CloudQueueAction, CloudQueueObservation
+class CloudQueueEnv(
+    EnvClient[CloudQueueAction, CloudQueueObservation, State]
+):
+    """
+    Client for the Cloud Queue Env Environment.
+    This client maintains a persistent WebSocket connection to the environment server,
+    enabling efficient multi-step interactions with lower latency.
+    Each client instance has its own dedicated environment session on the server.
+    Example:
+        >>> # Connect to a running server
+        >>> with CloudQueueEnv(base_url="http://localhost:8000") as client:
+        ...     result = client.reset()
+        ...     print(result.observation.queue_lengths)
+        ...
+        ...     result = client.step(CloudQueueAction(action_type="admit", target_queue=0))
+        ...     print(result.observation.throughput_recent)
+    Example with Docker:
+        >>> # Automatically start container and connect
+        >>> client = CloudQueueEnv.from_docker_image("cloud_queue_env-env:latest")
+        >>> try:
+        ...     result = client.reset()
+        ...     result = client.step(CloudQueueAction(action_type="dispatch", target_queue=0))
+        ... finally:
+        ...     client.close()
+    """
+    def _step_payload(self, action: CloudQueueAction) -> Dict:
+        """
+        Convert CloudQueueAction to JSON payload for step message.
+        Args:
+            action: CloudQueueAction instance
+        Returns:
+            Dictionary representation suitable for JSON encoding
+        """
+        return {
+            "action_type": action.action_type,
+            "target_queue": action.target_queue,
+            "target_server": action.target_server,
+            "scale_delta": action.scale_delta,
+            "new_priority": action.new_priority,
+            "task_id": action.task_id,
+            "seed": action.seed,
+        }
+    def _parse_result(self, payload: Dict) -> StepResult[CloudQueueObservation]:
+        """
+        Parse server response into StepResult[CloudQueueObservation].
+        Args:
+            payload: JSON response data from server
+        Returns:
+            StepResult with CloudQueueObservation
+        """
+        obs_data = payload.get("observation", {})
+        observation = CloudQueueObservation(
+            task_id=obs_data.get("task_id", "easy"),
+            sim_time=obs_data.get("sim_time", 0),
+            horizon=obs_data.get("horizon", 0),
+            queue_lengths=obs_data.get("queue_lengths", []),
+            queue_wait_ema=obs_data.get("queue_wait_ema", []),
+            server_busy=obs_data.get("server_busy", []),
+            server_remaining_service=obs_data.get("server_remaining_service", []),
+            utilization=obs_data.get("utilization", []),
+            incoming_job_present=obs_data.get("incoming_job_present", False),
+            incoming_job_size=obs_data.get("incoming_job_size", 0.0),
+            incoming_job_priority=obs_data.get("incoming_job_priority", 0),
+            incoming_job_deadline=obs_data.get("incoming_job_deadline", 0.0),
+            incoming_job_type=obs_data.get("incoming_job_type", 0),
+            sla_violation_rate=obs_data.get("sla_violation_rate", 0.0),
+            abandonment_rate=obs_data.get("abandonment_rate", 0.0),
+            throughput_recent=obs_data.get("throughput_recent", 0.0),
+            energy_cost_rate=obs_data.get("energy_cost_rate", 0.0),
+            level=obs_data.get("level", 1.0),
+            optional_history=obs_data.get("optional_history", []),
+            action_mask=obs_data.get("action_mask", []),
+            done=payload.get("done", False),
+            reward=payload.get("reward"),
+            metadata=obs_data.get("metadata", {}),
+        )
+        return StepResult(
+            observation=observation,
+            reward=payload.get("reward"),
+            done=payload.get("done", False),
+        )
+    def _parse_state(self, payload: Dict) -> State:
+        """
+        Parse server response into State object.
+        Args:
+            payload: JSON response from state request
+        Returns:
+            State object with episode_id and step_count
+        """
+        return State(
+            episode_id=payload.get("episode_id"),
+            step_count=payload.get("step_count", 0),
+        )

inference.py CHANGED Viewed

@@ -1,4 +1,12 @@
-"""Baseline inference runner for the queue operations benchmark tasks."""
 import asyncio
 import csv
@@ -12,7 +20,7 @@ from urllib.parse import urlparse, urlunparse
 from dotenv import load_dotenv
 from openai import OpenAI
-load_dotenv()  # Load environment variables from .env file
 from cloud_queue_env import CloudQueueAction, CloudQueueEnv, CloudQueueObservation
@@ -20,10 +28,8 @@ from cloud_queue_env import CloudQueueAction, CloudQueueEnv, CloudQueueObservati
 IMAGE_NAME = os.getenv("IMAGE_NAME")
 BASE_URL = os.getenv("BASE_URL")
 API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
 API_KEY = os.getenv("API_KEY") or os.getenv("HF_TOKEN")
 BENCHMARK = os.getenv("BENCHMARK", "queueops-openenv")
@@ -31,42 +37,43 @@ TASKS = ["easy", "medium", "hard"]
 TASK_SEEDS_JSON = os.getenv("TASK_SEEDS_JSON")
 SEEDS = [11, 23, 37]
 TEMPERATURE = 0.2
-MAX_TOKENS = 180
 SUCCESS_SCORE_THRESHOLD = 0.60
-USE_HEURISTIC_ONLY = os.getenv("USE_HEURISTIC_ONLY", "false").lower() in {"1", "true", "yes"}
-DISABLE_MODEL_ON_FIRST_ERROR = os.getenv("DISABLE_MODEL_ON_FIRST_ERROR", "true").lower() in {"1", "true", "yes"}
-MAX_STEPS_OVERRIDE = int(os.getenv("MAX_STEPS_OVERRIDE", "0") or "0")
 ACTION_TRACE_FILE = os.getenv("ACTION_TRACE_FILE")
 REPORT_JSON_PATH = os.getenv("REPORT_JSON_PATH")
 REPORT_CSV_PATH = os.getenv("REPORT_CSV_PATH")
 SYSTEM_PROMPT = textwrap.dedent(
     """
     You are an agent controlling a cloud queue scheduling environment.
     Your goal: minimize wait times, SLA violations, and cost while maximizing throughput.
-    ACTIONS (return exactly one JSON object, no extra text):
-      {"action_type": "admit",       "target_queue": 0}          — accept incoming job into queue 0
-      {"action_type": "route",       "target_queue": 1}          — accept incoming job into queue 1 (medium/hard only)
-      {"action_type": "reject",      "target_queue": null}       — reject incoming job (use when queues are filling up)
-      {"action_type": "dispatch",    "target_queue": 0}          — move job from queue to an idle server
-      {"action_type": "reprioritize","new_priority": 2}         — promote a normal job to urgent (medium/hard only)
-      {"action_type": "scale",       "scale_delta": 1}           — add 1 server (+1) or remove 1 server (-1) (hard only)
-      {"action_type": "noop",        "target_queue": null}       — do nothing
-    STRATEGY HINTS:
-      - REJECT jobs when queue fill is above 60% to prevent overflow and SLA breaches.
-      - ADMIT when queues have space and server is idle.
-      - DISPATCH after admitting to keep servers busy.
-      - On medium/hard: ROUTE urgent jobs (priority=2) to a less-loaded queue.
-      - On hard: SCALE up (+1) when queue_fill > 70% and cost allows; scale down when queues are empty.
-      - Negative reward means the system is struggling — change strategy.
-    Return ONLY valid JSON. No explanation.
     """
 ).strip()
 ACTION_TYPES = (
     "configure_task",
     "admit",
@@ -84,33 +91,11 @@ TASK_ALLOWED_ACTIONS = {
     "hard": {"admit", "reject", "route", "dispatch", "reprioritize", "scale", "noop"},
 }
-MODEL_ACTION_RESPONSE_FORMAT = {
-    "type": "json_schema",
-    "json_schema": {
-        "name": "cloud_queue_action",
-        "strict": True,
-        "schema": {
-            "type": "object",
-            "additionalProperties": False,
-            "required": [
-                "action_type",
-                "target_queue",
-                "target_server",
-                "scale_delta",
-                "new_priority",
-            ],
-            "properties": {
-                "action_type": {"type": "string", "enum": list(ACTION_TYPES)},
-                "target_queue": {"type": ["integer", "null"], "minimum": 0},
-                "target_server": {"type": ["integer", "null"], "minimum": 0},
-                "scale_delta": {"type": ["integer", "null"], "minimum": -2, "maximum": 2},
-                "new_priority": {"type": ["integer", "null"], "minimum": 0, "maximum": 3},
-            },
-        },
-    },
-}
-_SCHEMA_RESPONSE_FORMAT_FAILED = False
 def log_start(task: str, env: str, model: str) -> None:
@@ -142,8 +127,8 @@ def parse_task_seed_map() -> dict[str, list[int]]:
                     task_map[str(task_name)] = parsed
             if task_map:
                 return task_map
-        except Exception as exc:
-            print(f"[DEBUG] Invalid TASK_SEEDS_JSON, falling back to defaults: {exc}", flush=True)
     return {
         "easy": [SEEDS[0]],
@@ -169,8 +154,7 @@ def load_replay_actions() -> dict[str, list[CloudQueueAction]]:
     try:
         with open(ACTION_TRACE_FILE, "r", encoding="utf-8") as f:
             payload = json.load(f)
-    except Exception as exc:
-        print(f"[DEBUG] Failed to load ACTION_TRACE_FILE: {exc}", flush=True)
         return {}
     replay: dict[str, list[CloudQueueAction]] = {}
@@ -211,8 +195,8 @@ def write_reports(seed_rows: list[dict], task_score_table: dict[str, list[float]
         try:
             with open(REPORT_JSON_PATH, "w", encoding="utf-8") as f:
                 json.dump(report_payload, f, indent=2)
-        except Exception as exc:
-            print(f"[DEBUG] Failed to write REPORT_JSON_PATH: {exc}", flush=True)
     if REPORT_CSV_PATH:
         try:
@@ -228,28 +212,25 @@ def write_reports(seed_rows: list[dict], task_score_table: dict[str, list[float]
                         "trace_digest",
                         "invalid_actions",
                         "harmful_scale_down",
                     ],
                 )
                 writer.writeheader()
                 for row in seed_rows:
                     writer.writerow(row)
-        except Exception as exc:
-            print(f"[DEBUG] Failed to write REPORT_CSV_PATH: {exc}", flush=True)
 def build_obs_summary(obs: CloudQueueObservation, task_name: str) -> str:
-    """Build a rich, structured text summary of the observation for the LLM prompt."""
-    # Queue fill percentages — helps model know when to reject
     max_sizes = {"easy": 28, "medium": 42, "hard": 64}
     max_q = max_sizes.get(task_name, 30)
     fills = [f"{l}/{max_q}({100*l//max_q}%)" for l in obs.queue_lengths]
-    # Server status
     busy_count = sum(obs.server_busy)
     total_servers = len(obs.server_busy)
     servers_str = f"{busy_count}/{total_servers} busy"
-    # Incoming job info
     if obs.incoming_job_present:
         urgency = "URGENT" if obs.incoming_job_priority >= 2 else "normal"
         incoming_str = f"YES [{urgency} size={obs.incoming_job_size:.1f} deadline={obs.incoming_job_deadline:.0f}]"
@@ -267,7 +248,7 @@ def build_obs_summary(obs: CloudQueueObservation, task_name: str) -> str:
     )
-def build_user_prompt(step: int, obs_summary: str, last_reward: float, history: List[str], task_name: str) -> str:
     history_block = "\n".join(history[-4:]) if history else "None"
     return textwrap.dedent(
         f"""
@@ -280,16 +261,6 @@ def build_user_prompt(step: int, obs_summary: str, last_reward: float, history:
     ).strip()
-def choose_heuristic_action(task_name: str, queue_lengths: List[int], incoming_present: bool) -> CloudQueueAction:
-    if incoming_present:
-        if task_name == "hard" and len(queue_lengths) > 1 and queue_lengths[0] > queue_lengths[1]:
-            return CloudQueueAction(action_type="route", target_queue=1)
-        if task_name == "medium" and len(queue_lengths) > 1 and queue_lengths[1] < queue_lengths[0]:
-            return CloudQueueAction(action_type="route", target_queue=1)
-        return CloudQueueAction(action_type="admit", target_queue=0)
-    return CloudQueueAction(action_type="dispatch", target_queue=0)
 def _coerce_optional_int(value: Any) -> Optional[int]:
     if value is None:
         return None
@@ -318,7 +289,6 @@ def _extract_json_object(text: str) -> Optional[dict[str, Any]]:
     if not cleaned:
         return None
-    # Handle common fenced responses first.
     if cleaned.startswith("```"):
         chunks = [chunk.strip() for chunk in cleaned.split("```") if chunk.strip()]
         for chunk in chunks:
@@ -343,7 +313,6 @@ def _extract_json_object(text: str) -> Optional[dict[str, Any]]:
     except Exception:
         pass
-    # Fallback: extract the first balanced JSON object from noisy text.
     start = 0
     while True:
         open_idx = cleaned.find("{", start)
@@ -371,7 +340,6 @@ def _normalize_action_payload(data: dict[str, Any], task_name: str) -> Optional[
     action_type = str(data.get("action_type", "noop")).strip().lower()
     if action_type not in ACTION_TYPES:
         return None
     if action_type not in TASK_ALLOWED_ACTIONS.get(task_name, set(ACTION_TYPES)):
         return None
@@ -412,17 +380,19 @@ def parse_model_action(text: str, task_name: str) -> Optional[CloudQueueAction]:
     data = _extract_json_object(text)
     if data is None:
         return None
     payload = _normalize_action_payload(data, task_name)
     if payload is None:
         return None
     try:
         return CloudQueueAction(**payload)
     except Exception:
         return None
 def get_model_action(
     client: OpenAI,
     task_name: str,
@@ -431,46 +401,20 @@ def get_model_action(
     last_reward: float,
     history: List[str],
 ) -> tuple[Optional[CloudQueueAction], Optional[str]]:
-    global _SCHEMA_RESPONSE_FORMAT_FAILED
-    user_prompt = build_user_prompt(step, obs_summary, last_reward, history, task_name)
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         {"role": "user", "content": user_prompt},
     ]
     try:
-        if not _SCHEMA_RESPONSE_FORMAT_FAILED:
-            try:
-                completion = client.chat.completions.create(
-                    model=MODEL_NAME,
-                    messages=messages,
-                    temperature=TEMPERATURE,
-                    max_tokens=MAX_TOKENS,
-                    stream=False,
-                    response_format=MODEL_ACTION_RESPONSE_FORMAT,
-                )
-            except Exception as schema_exc:
-                _SCHEMA_RESPONSE_FORMAT_FAILED = True
-                print(
-                    f"[DEBUG] response_format unavailable, retrying without schema: {schema_exc}",
-                    flush=True,
-                )
-                completion = client.chat.completions.create(
-                    model=MODEL_NAME,
-                    messages=messages,
-                    temperature=TEMPERATURE,
-                    max_tokens=MAX_TOKENS,
-                    stream=False,
-                )
-        else:
-            completion = client.chat.completions.create(
-                model=MODEL_NAME,
-                messages=messages,
-                temperature=TEMPERATURE,
-                max_tokens=MAX_TOKENS,
-                stream=False,
-            )
         text = (completion.choices[0].message.content or "").strip()
         action = parse_model_action(text, task_name)
@@ -479,65 +423,70 @@ def get_model_action(
             return None, f"invalid_model_action_payload: {preview}"
         return action, None
     except Exception as exc:
-        print(f"[DEBUG] Model request failed: {exc}", flush=True)
         return None, str(exc)
-def normalize_base_url(base_url: Optional[str]) -> Optional[str]:
-    """Normalize user-provided BASE_URL into an API runtime URL.
-    If a Hugging Face repo page URL is provided (huggingface.co/spaces/user/space),
-    convert it to the runtime domain (https://user-space.hf.space).
-    """
     if not base_url:
         return base_url
     cleaned = base_url.strip().rstrip("/")
     parsed = urlparse(cleaned)
-    # Handle Hugging Face repo page URL -> runtime URL used by API/WebSocket.
     if parsed.netloc.lower() == "huggingface.co":
         parts = [p for p in parsed.path.strip("/").split("/") if p]
         if len(parts) >= 3 and parts[0] == "spaces":
             owner, space = parts[1], parts[2]
-            # HF runtime hostnames use lowercase and are TLS-safe.
             owner = owner.lower().replace("_", "-")
             space = space.lower().replace("_", "-")
             return f"https://{owner}-{space}.hf.space"
-    # Avoid accidentally pointing at the web UI path.
     if cleaned.endswith("/web"):
         cleaned = cleaned[:-4]
         parsed = urlparse(cleaned)
-    # HF runtime domains should be lowercase and avoid underscores for TLS host checks.
     host = (parsed.hostname or "").lower()
     if host.endswith(".hf.space"):
         safe_host = host.replace("_", "-")
         if safe_host != host or (parsed.netloc and parsed.netloc != parsed.netloc.lower()):
             port_part = f":{parsed.port}" if parsed.port else ""
-            netloc = f"{safe_host}{port_part}"
-            parsed = parsed._replace(netloc=netloc)
             cleaned = urlunparse(parsed)
     return cleaned
 def _smoke_test_model(client: OpenAI) -> bool:
-    """Verify the model API is reachable AND can generate a coherent response.
-    Asks a short queue-domain question that requires a real sentence answer.
-    An empty or missing reply is treated as failure — not just exceptions.
-    Prints [MODEL_OK] or [MODEL_FAIL] with details.
-    Returns True if the model is working, False otherwise.
-    """
-    print(f"[MODEL_CHECK] Testing model={MODEL_NAME} at {API_BASE_URL} ...", flush=True)
     test_question = (
         "You are a cloud scheduling agent. "
         "A job queue is 80% full and a new urgent job just arrived. "
         "Should you admit the job, reject it, or route it to another queue? "
-        "Answer in one sentence and explain why."
     )
     try:
         resp = client.chat.completions.create(
@@ -548,41 +497,29 @@ def _smoke_test_model(client: OpenAI) -> bool:
         )
         reply = (resp.choices[0].message.content or "").strip()
         if not reply:
-            print("[MODEL_FAIL] Model returned an empty response.", flush=True)
-            print("[MODEL_FAIL] Will fall back to heuristic for all steps.", flush=True)
             return False
-        print(f"[MODEL_OK] model is reasoning correctly.", flush=True)
-        print(f"[MODEL_OK] test reply: {reply}", flush=True)
         return True
-    except Exception as exc:
-        print(f"[MODEL_FAIL] Cannot reach model: {exc}", flush=True)
-        print("[MODEL_FAIL] Will fall back to heuristic for all steps.", flush=True)
         return False
 async def main() -> None:
-    if not API_KEY and not USE_HEURISTIC_ONLY:
-        raise ValueError("API_KEY is required for model inference.")
-    client = None
-    if not USE_HEURISTIC_ONLY:
-        client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
-    runtime_base_url = normalize_base_url(BASE_URL)
     if runtime_base_url:
         env = CloudQueueEnv(base_url=runtime_base_url)
     else:
         if not IMAGE_NAME:
-            raise ValueError(
-                "Set BASE_URL for deployed env, or IMAGE_NAME for local docker env."
-            )
         env = await CloudQueueEnv.from_docker_image(IMAGE_NAME)
     try:
-        # Run smoke test before benchmark — confirms model API is reachable.
-        model_enabled = client is not None
-        if client is not None:
-            model_enabled = _smoke_test_model(client)
         task_seed_map = parse_task_seed_map()
         replay_map = load_replay_actions()
         task_score_table: dict[str, list[float]] = {}
@@ -599,21 +536,24 @@ async def main() -> None:
                 history: List[str] = []
                 rewards: List[float] = []
                 steps_taken = 0
-                score = 0.0
                 success = False
                 log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
                 await env.reset()
-                await env.step(
-                    CloudQueueAction(action_type="configure_task", task_id=task_name, seed=seed)
-                )
                 result = await env.reset()
                 last_reward = 0.0
                 max_steps = max(1, int(result.observation.horizon))
                 if MAX_STEPS_OVERRIDE > 0:
                     max_steps = min(max_steps, MAX_STEPS_OVERRIDE)
                 for step in range(1, max_steps + 1):
                     if result.done:
                         break
@@ -621,32 +561,33 @@ async def main() -> None:
                     obs = result.observation
                     obs_summary = build_obs_summary(obs, task_name)
-                    action = None
-                    model_error = None
-                    replay_key = f"{task_name}:{seed}"
-                    replay_actions = replay_map.get(replay_key, [])
                     if step - 1 < len(replay_actions):
                         action = replay_actions[step - 1]
-                    if action is None and model_enabled and client is not None:
-                        action, model_error = get_model_action(
                             client=client,
                             task_name=task_name,
                             step=step,
                             obs_summary=obs_summary,
                             last_reward=last_reward,
                             history=history,
                         )
-                        if model_error and DISABLE_MODEL_ON_FIRST_ERROR:
-                            model_enabled = False
-                            print("[DEBUG] Disabling model calls and switching to heuristic fallback.", flush=True)
                     if action is None:
-                        action = choose_heuristic_action(
-                            task_name=task_name,
-                            queue_lengths=obs.queue_lengths,
-                            incoming_present=obs.incoming_job_present,
                         )
                     result = await env.step(action)
                     reward = float(result.reward or 0.0)
@@ -666,27 +607,23 @@ async def main() -> None:
                         f"d={action.scale_delta},p={action.new_priority})"
                     )
                     log_step(step=step, action=action_str, reward=reward, done=done, error=error)
                     history.append(f"step={step} action={action_str} reward={reward:.2f}")
                     if done:
                         break
-                if isinstance(result.observation.metadata, dict):
-                    score = float(result.observation.metadata.get("episode_score", 0.0) or 0.0)
-                    # Debug: print raw server metadata so we can verify grader output
-                    _m = result.observation.metadata
-                    print(
-                        f"[DEBUG_META] task={task_name} seed={seed} "
-                        f"episode_score={_m.get('episode_score')} "
-                        f"score_details={_m.get('score_details')} "
-                        f"metrics_completed={_m.get('metrics', {}).get('completed')} "
-                        f"metrics_arrivals={_m.get('metrics', {}).get('arrivals')}",
-                        flush=True,
-                    )
-                score = max(0.0, min(1.0, score))
                 task_score_table[task_name].append(score)
-                success = score >= SUCCESS_SCORE_THRESHOLD
                 log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
                 meta = result.observation.metadata or {}
@@ -694,29 +631,20 @@ async def main() -> None:
                 seed_row = {
                     "task": task_name,
                     "seed": int(seed),
-                    "score": round(score, 6),
                     "steps": int(steps_taken),
                     "success": bool(success),
                     "trace_digest": str(meta.get("trace_digest", "")),
                     "invalid_actions": float(metrics.get("invalid_actions", 0.0)),
                     "harmful_scale_down": float(metrics.get("harmful_scale_down", 0.0)),
                 }
                 seed_rows.append(seed_row)
-                print(
-                    "[REPORT_SEED] "
-                    f"task={seed_row['task']} seed={seed_row['seed']} score={seed_row['score']:.3f} "
-                    f"steps={seed_row['steps']} trace={seed_row['trace_digest']}",
-                    flush=True,
-                )
             task_scores = task_score_table[task_name]
-            task_mean = statistics.mean(task_scores) if task_scores else 0.0
             task_std = statistics.pstdev(task_scores) if len(task_scores) > 1 else 0.0
             task_ci = ci95(task_scores)
-            print(
-                f"[REPORT] task={task_name} seeds={len(task_scores)} mean={task_mean:.3f} std={task_std:.3f} ci95={task_ci:.3f}",
-                flush=True,
-            )
         all_task_means = []
         for task_name in TASKS:
@@ -725,23 +653,18 @@ async def main() -> None:
                 all_task_means.append(statistics.mean(scores))
         if all_task_means:
-            final_score = sum(all_task_means) / len(all_task_means)
-            easy_mean = statistics.mean(task_score_table.get("easy", [0.0]))
-            medium_mean = statistics.mean(task_score_table.get("medium", [0.0]))
-            hard_mean = statistics.mean(task_score_table.get("hard", [0.0]))
-            print(
-                f"[SUMMARY] easy={easy_mean:.3f} medium={medium_mean:.3f} hard={hard_mean:.3f} final={final_score:.3f}",
-                flush=True,
-            )
             write_reports(seed_rows=seed_rows, task_score_table=task_score_table)
     finally:
         try:
             await env.close()
-        except Exception as e:
-            print(f"[DEBUG] env.close() error (container cleanup): {e}", flush=True)
 if __name__ == "__main__":
-    asyncio.run(main())

+"""Strict model-only inference runner for the queue operations benchmark.
+This variant intentionally removes heuristic fallback paths.
+Every decision must come from either:
+1) replay trace input (ACTION_TRACE_FILE), or
+2) model output.
+If model output is invalid/unavailable, the seed run is marked failed.
+"""
 import asyncio
 import csv
 from dotenv import load_dotenv
 from openai import OpenAI
+load_dotenv()
 from cloud_queue_env import CloudQueueAction, CloudQueueEnv, CloudQueueObservation
 IMAGE_NAME = os.getenv("IMAGE_NAME")
 BASE_URL = os.getenv("BASE_URL")
 API_BASE_URL = os.getenv("API_BASE_URL") or "https://router.huggingface.co/v1"
 MODEL_NAME = os.getenv("MODEL_NAME") or "Qwen/Qwen2.5-72B-Instruct"
 API_KEY = os.getenv("API_KEY") or os.getenv("HF_TOKEN")
 BENCHMARK = os.getenv("BENCHMARK", "queueops-openenv")
 TASK_SEEDS_JSON = os.getenv("TASK_SEEDS_JSON")
 SEEDS = [11, 23, 37]
 TEMPERATURE = 0.2
+MAX_TOKENS = 780
 SUCCESS_SCORE_THRESHOLD = 0.60
+# Test-friendly default. Set MAX_STEPS_OVERRIDE=0 for full horizon.
+MAX_STEPS_OVERRIDE = int(os.getenv("MAX_STEPS_OVERRIDE", "8") or "8")
 ACTION_TRACE_FILE = os.getenv("ACTION_TRACE_FILE")
 REPORT_JSON_PATH = os.getenv("REPORT_JSON_PATH")
 REPORT_CSV_PATH = os.getenv("REPORT_CSV_PATH")
+OPEN_SCORE_MIN = 0.001
+OPEN_SCORE_MAX = 0.999
 SYSTEM_PROMPT = textwrap.dedent(
     """
     You are an agent controlling a cloud queue scheduling environment.
     Your goal: minimize wait times, SLA violations, and cost while maximizing throughput.
+        OUTPUT FORMAT (strict):
+        - Return exactly one JSON object.
+        - No markdown, no code fences, no explanations, no extra keys.
+        - Always include all fields below.
+        Required JSON schema:
+        {
+            "action_type": "admit|reject|route|dispatch|scale|reprioritize|noop",
+            "target_queue": integer or null,
+            "target_server": integer or null,
+            "scale_delta": integer or null,
+            "new_priority": integer or null
+        }
+        Task constraints:
+        - easy: only admit/reject/dispatch/noop
+        - medium: only admit/reject/route/dispatch/reprioritize/noop
+        - hard: only admit/reject/route/dispatch/reprioritize/scale/noop
     """
 ).strip()
 ACTION_TYPES = (
     "configure_task",
     "admit",
     "hard": {"admit", "reject", "route", "dispatch", "reprioritize", "scale", "noop"},
 }
+def clamp_open_score(value: float) -> float:
+    if not isinstance(value, (int, float)) or not (value == value):
+        return OPEN_SCORE_MIN
+    return max(OPEN_SCORE_MIN, min(OPEN_SCORE_MAX, float(value)))
 def log_start(task: str, env: str, model: str) -> None:
                     task_map[str(task_name)] = parsed
             if task_map:
                 return task_map
+        except Exception:
+            pass
     return {
         "easy": [SEEDS[0]],
     try:
         with open(ACTION_TRACE_FILE, "r", encoding="utf-8") as f:
             payload = json.load(f)
+    except Exception:
         return {}
     replay: dict[str, list[CloudQueueAction]] = {}
         try:
             with open(REPORT_JSON_PATH, "w", encoding="utf-8") as f:
                 json.dump(report_payload, f, indent=2)
+        except Exception:
+            pass
     if REPORT_CSV_PATH:
         try:
                         "trace_digest",
                         "invalid_actions",
                         "harmful_scale_down",
+                        "failure_reason",
                     ],
                 )
                 writer.writeheader()
                 for row in seed_rows:
                     writer.writerow(row)
+        except Exception:
+            pass
 def build_obs_summary(obs: CloudQueueObservation, task_name: str) -> str:
     max_sizes = {"easy": 28, "medium": 42, "hard": 64}
     max_q = max_sizes.get(task_name, 30)
     fills = [f"{l}/{max_q}({100*l//max_q}%)" for l in obs.queue_lengths]
     busy_count = sum(obs.server_busy)
     total_servers = len(obs.server_busy)
     servers_str = f"{busy_count}/{total_servers} busy"
     if obs.incoming_job_present:
         urgency = "URGENT" if obs.incoming_job_priority >= 2 else "normal"
         incoming_str = f"YES [{urgency} size={obs.incoming_job_size:.1f} deadline={obs.incoming_job_deadline:.0f}]"
     )
+def build_user_prompt(step: int, obs_summary: str, last_reward: float, history: List[str]) -> str:
     history_block = "\n".join(history[-4:]) if history else "None"
     return textwrap.dedent(
         f"""
     ).strip()
 def _coerce_optional_int(value: Any) -> Optional[int]:
     if value is None:
         return None
     if not cleaned:
         return None
     if cleaned.startswith("```"):
         chunks = [chunk.strip() for chunk in cleaned.split("```") if chunk.strip()]
         for chunk in chunks:
     except Exception:
         pass
     start = 0
     while True:
         open_idx = cleaned.find("{", start)
     action_type = str(data.get("action_type", "noop")).strip().lower()
     if action_type not in ACTION_TYPES:
         return None
     if action_type not in TASK_ALLOWED_ACTIONS.get(task_name, set(ACTION_TYPES)):
         return None
     data = _extract_json_object(text)
     if data is None:
         return None
     payload = _normalize_action_payload(data, task_name)
     if payload is None:
         return None
     try:
         return CloudQueueAction(**payload)
     except Exception:
         return None
+def _single_line(text: str) -> str:
+    return " ".join((text or "").split())
 def get_model_action(
     client: OpenAI,
     task_name: str,
     last_reward: float,
     history: List[str],
 ) -> tuple[Optional[CloudQueueAction], Optional[str]]:
+    user_prompt = build_user_prompt(step, obs_summary, last_reward, history)
     messages = [
         {"role": "system", "content": SYSTEM_PROMPT},
         {"role": "user", "content": user_prompt},
     ]
     try:
+        completion = client.chat.completions.create(
+            model=MODEL_NAME,
+            messages=messages,
+            temperature=TEMPERATURE,
+            max_tokens=MAX_TOKENS,
+            stream=False,
+        )
         text = (completion.choices[0].message.content or "").strip()
         action = parse_model_action(text, task_name)
             return None, f"invalid_model_action_payload: {preview}"
         return action, None
     except Exception as exc:
         return None, str(exc)
+def get_model_action_with_retry(
+    client: OpenAI,
+    task_name: str,
+    step: int,
+    obs_summary: str,
+    last_reward: float,
+    history: List[str],
+    retries: int = 2,
+) -> tuple[Optional[CloudQueueAction], Optional[str]]:
+    last_error: Optional[str] = None
+    for attempt in range(1, retries + 2):
+        action, error = get_model_action(
+            client=client,
+            task_name=task_name,
+            step=step,
+            obs_summary=obs_summary,
+            last_reward=last_reward,
+            history=history,
+        )
+        if action is not None:
+            return action, None
+        last_error = error
+    return None, last_error
+def normalize_base_url(base_url: Optional[str]) -> Optional[str]:
     if not base_url:
         return base_url
     cleaned = base_url.strip().rstrip("/")
     parsed = urlparse(cleaned)
     if parsed.netloc.lower() == "huggingface.co":
         parts = [p for p in parsed.path.strip("/").split("/") if p]
         if len(parts) >= 3 and parts[0] == "spaces":
             owner, space = parts[1], parts[2]
             owner = owner.lower().replace("_", "-")
             space = space.lower().replace("_", "-")
             return f"https://{owner}-{space}.hf.space"
     if cleaned.endswith("/web"):
         cleaned = cleaned[:-4]
         parsed = urlparse(cleaned)
     host = (parsed.hostname or "").lower()
     if host.endswith(".hf.space"):
         safe_host = host.replace("_", "-")
         if safe_host != host or (parsed.netloc and parsed.netloc != parsed.netloc.lower()):
             port_part = f":{parsed.port}" if parsed.port else ""
+            parsed = parsed._replace(netloc=f"{safe_host}{port_part}")
             cleaned = urlunparse(parsed)
     return cleaned
 def _smoke_test_model(client: OpenAI) -> bool:
     test_question = (
         "You are a cloud scheduling agent. "
         "A job queue is 80% full and a new urgent job just arrived. "
         "Should you admit the job, reject it, or route it to another queue? "
+        "Answer with exactly one JSON object containing action_type and optional fields."
     )
     try:
         resp = client.chat.completions.create(
         )
         reply = (resp.choices[0].message.content or "").strip()
         if not reply:
             return False
         return True
+    except Exception:
         return False
 async def main() -> None:
+    if not API_KEY:
+        raise ValueError("API_KEY or HF_TOKEN is required for strict model inference.")
+    client = OpenAI(base_url=API_BASE_URL, api_key=API_KEY)
+    if not _smoke_test_model(client):
+        raise RuntimeError("Model smoke test failed. Aborting strict model-only run.")
+    runtime_base_url = normalize_base_url(BASE_URL)
     if runtime_base_url:
         env = CloudQueueEnv(base_url=runtime_base_url)
     else:
         if not IMAGE_NAME:
+            raise ValueError("Set BASE_URL for deployed env, or IMAGE_NAME for local docker env.")
         env = await CloudQueueEnv.from_docker_image(IMAGE_NAME)
     try:
         task_seed_map = parse_task_seed_map()
         replay_map = load_replay_actions()
         task_score_table: dict[str, list[float]] = {}
                 history: List[str] = []
                 rewards: List[float] = []
                 steps_taken = 0
+                score = OPEN_SCORE_MIN
                 success = False
+                failure_reason: Optional[str] = None
                 log_start(task=task_name, env=BENCHMARK, model=MODEL_NAME)
                 await env.reset()
+                await env.step(CloudQueueAction(action_type="configure_task", task_id=task_name, seed=seed))
                 result = await env.reset()
                 last_reward = 0.0
                 max_steps = max(1, int(result.observation.horizon))
                 if MAX_STEPS_OVERRIDE > 0:
                     max_steps = min(max_steps, MAX_STEPS_OVERRIDE)
+                replay_key = f"{task_name}:{seed}"
+                replay_actions = replay_map.get(replay_key, [])
                 for step in range(1, max_steps + 1):
                     if result.done:
                         break
                     obs = result.observation
                     obs_summary = build_obs_summary(obs, task_name)
+                    action: Optional[CloudQueueAction] = None
+                    model_error: Optional[str] = None
                     if step - 1 < len(replay_actions):
                         action = replay_actions[step - 1]
+                    else:
+                        action, model_error = get_model_action_with_retry(
                             client=client,
                             task_name=task_name,
                             step=step,
                             obs_summary=obs_summary,
                             last_reward=last_reward,
                             history=history,
+                            retries=2,
                         )
                     if action is None:
+                        failure_reason = f"model_action_unavailable: {model_error}"
+                        log_step(
+                            step=step,
+                            action="model_action_error",
+                            reward=0.0,
+                            done=True,
+                            error=failure_reason,
                         )
+                        steps_taken = step
+                        break
                     result = await env.step(action)
                     reward = float(result.reward or 0.0)
                         f"d={action.scale_delta},p={action.new_priority})"
                     )
                     log_step(step=step, action=action_str, reward=reward, done=done, error=error)
                     history.append(f"step={step} action={action_str} reward={reward:.2f}")
                     if done:
                         break
+                if failure_reason is None and isinstance(result.observation.metadata, dict):
+                    score = float(result.observation.metadata.get("episode_score", OPEN_SCORE_MIN) or OPEN_SCORE_MIN)
+                elif failure_reason is not None:
+                    score = OPEN_SCORE_MIN
+                if failure_reason is None and not bool(result.done):
+                    failure_reason = "episode_not_done_within_max_steps"
+                    score = OPEN_SCORE_MIN
+                score = clamp_open_score(score)
                 task_score_table[task_name].append(score)
+                success = failure_reason is None and score >= SUCCESS_SCORE_THRESHOLD
                 log_end(success=success, steps=steps_taken, score=score, rewards=rewards)
                 meta = result.observation.metadata or {}
                 seed_row = {
                     "task": task_name,
                     "seed": int(seed),
+                    "score": score,
                     "steps": int(steps_taken),
                     "success": bool(success),
                     "trace_digest": str(meta.get("trace_digest", "")),
                     "invalid_actions": float(metrics.get("invalid_actions", 0.0)),
                     "harmful_scale_down": float(metrics.get("harmful_scale_down", 0.0)),
+                    "failure_reason": failure_reason or "",
                 }
                 seed_rows.append(seed_row)
             task_scores = task_score_table[task_name]
+            task_mean = statistics.mean(task_scores) if task_scores else OPEN_SCORE_MIN
             task_std = statistics.pstdev(task_scores) if len(task_scores) > 1 else 0.0
             task_ci = ci95(task_scores)
         all_task_means = []
         for task_name in TASKS:
                 all_task_means.append(statistics.mean(scores))
         if all_task_means:
+            final_score = clamp_open_score(sum(all_task_means) / len(all_task_means))
+            easy_mean = clamp_open_score(statistics.mean(task_score_table.get("easy", [OPEN_SCORE_MIN])))
+            medium_mean = clamp_open_score(statistics.mean(task_score_table.get("medium", [OPEN_SCORE_MIN])))
+            hard_mean = clamp_open_score(statistics.mean(task_score_table.get("hard", [OPEN_SCORE_MIN])))
             write_reports(seed_rows=seed_rows, task_score_table=task_score_table)
     finally:
         try:
             await env.close()
+        except Exception:
+            pass
 if __name__ == "__main__":
+    asyncio.run(main())

models.py CHANGED Viewed

@@ -1,55 +1,55 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""Data models for the Cloud Queue Env queue operations environment."""
-from openenv.core.env_server.types import Action, Observation
-from pydantic import Field
-class CloudQueueAction(Action):
-    """Action model for queue control decisions."""
-    action_type: str = Field(
-        default="noop",
-        description=(
-            "One of: configure_task, admit, reject, route, dispatch, scale, reprioritize, noop"
-        ),
-    )
-    target_queue: int | None = Field(default=None, description="Queue index for route/dispatch")
-    target_server: int | None = Field(default=None, description="Server index for dispatch")
-    scale_delta: int | None = Field(default=None, description="Server pool scale delta for scale action")
-    new_priority: int | None = Field(default=None, description="Updated priority for reprioritize action")
-    task_id: str | None = Field(default=None, description="Task selector: easy, medium, or hard")
-    seed: int | None = Field(default=None, description="Deterministic seed for upcoming reset")
-class CloudQueueObservation(Observation):
-    """Observation model exposing queue system state to the agent."""
-    task_id: str = Field(default="easy", description="Active benchmark task")
-    sim_time: int = Field(default=0, description="Discrete simulation time step")
-    horizon: int = Field(default=0, description="Episode horizon")
-    queue_lengths: list[int] = Field(default_factory=list, description="Length per queue")
-    queue_wait_ema: list[float] = Field(default_factory=list, description="EMA wait time per queue")
-    server_busy: list[int] = Field(default_factory=list, description="1 if server is busy, else 0")
-    server_remaining_service: list[float] = Field(
-        default_factory=list,
-        description="Remaining service time per server",
-    )
-    utilization: list[float] = Field(default_factory=list, description="Rolling utilization by server")
-    incoming_job_present: bool = Field(default=False, description="Whether a new job is waiting for admission")
-    incoming_job_size: float = Field(default=0.0, description="Incoming job estimated size")
-    incoming_job_priority: int = Field(default=0, description="Incoming job priority")
-    incoming_job_deadline: float = Field(default=0.0, description="Incoming job deadline")
-    incoming_job_type: int = Field(default=0, description="Incoming job class/type id")
-    sla_violation_rate: float = Field(default=0.0, description="Running SLA violation rate")
-    abandonment_rate: float = Field(default=0.0, description="Running abandonment rate")
-    throughput_recent: float = Field(default=0.0, description="Completed jobs in current step")
-    energy_cost_rate: float = Field(default=0.0, description="Current infrastructure cost rate")
-    level: float = Field(default=1.0, description="Difficulty level scalar")
-    optional_history: list[float] = Field(default_factory=list, description="Compact recent context")
-    action_mask: list[int] = Field(default_factory=list, description="Optional valid action hints")

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Data models for the Cloud Queue Env queue operations environment."""
+from openenv.core.env_server.types import Action, Observation
+from pydantic import Field
+class CloudQueueAction(Action):
+    """Action model for queue control decisions."""
+    action_type: str = Field(
+        default="noop",
+        description=(
+            "One of: configure_task, admit, reject, route, dispatch, scale, reprioritize, noop"
+        ),
+    )
+    target_queue: int | None = Field(default=None, description="Queue index for route/dispatch")
+    target_server: int | None = Field(default=None, description="Server index for dispatch")
+    scale_delta: int | None = Field(default=None, description="Server pool scale delta for scale action")
+    new_priority: int | None = Field(default=None, description="Updated priority for reprioritize action")
+    task_id: str | None = Field(default=None, description="Task selector: easy, medium, or hard")
+    seed: int | None = Field(default=None, description="Deterministic seed for upcoming reset")
+class CloudQueueObservation(Observation):
+    """Observation model exposing queue system state to the agent."""
+    task_id: str = Field(default="easy", description="Active benchmark task")
+    sim_time: int = Field(default=0, description="Discrete simulation time step")
+    horizon: int = Field(default=0, description="Episode horizon")
+    queue_lengths: list[int] = Field(default_factory=list, description="Length per queue")
+    queue_wait_ema: list[float] = Field(default_factory=list, description="EMA wait time per queue")
+    server_busy: list[int] = Field(default_factory=list, description="1 if server is busy, else 0")
+    server_remaining_service: list[float] = Field(
+        default_factory=list,
+        description="Remaining service time per server",
+    )
+    utilization: list[float] = Field(default_factory=list, description="Rolling utilization by server")
+    incoming_job_present: bool = Field(default=False, description="Whether a new job is waiting for admission")
+    incoming_job_size: float = Field(default=0.0, description="Incoming job estimated size")
+    incoming_job_priority: int = Field(default=0, description="Incoming job priority")
+    incoming_job_deadline: float = Field(default=0.0, description="Incoming job deadline")
+    incoming_job_type: int = Field(default=0, description="Incoming job class/type id")
+    sla_violation_rate: float = Field(default=0.0, description="Running SLA violation rate")
+    abandonment_rate: float = Field(default=0.0, description="Running abandonment rate")
+    throughput_recent: float = Field(default=0.0, description="Completed jobs in current step")
+    energy_cost_rate: float = Field(default=0.0, description="Current infrastructure cost rate")
+    level: float = Field(default=1.0, description="Difficulty level scalar")
+    optional_history: list[float] = Field(default_factory=list, description="Compact recent context")
+    action_mask: list[int] = Field(default_factory=list, description="Optional valid action hints")

openenv.yaml CHANGED Viewed

@@ -1,30 +1,30 @@
-spec_version: 1
-name: cloud_queue_env
-type: space
-runtime: fastapi
-app: server.app:app
-port: 8000
-metadata:
-  description: >
-    A real-world queueing control environment where an agent manages
-    cloud request scheduling decisions — admission control, routing,
-    dispatching, and dynamic server scaling — under stochastic arrivals
-    and service times. Optimizes latency, throughput, SLA compliance,
-    fairness, and infrastructure cost across three benchmark tasks
-    (Easy / Medium / Hard) with deterministic graders scored in (0, 1).
-  tags:
-    - openenv
-    - reinforcement-learning
-    - queueing
-    - scheduling
-    - cloud-operations
-    - multi-objective
-    - llm-agent
-  difficulty: easy-to-hard
-  tasks:
-    - easy
-    - medium
-    - hard
-  author: Mrkumar007

+spec_version: 1
+name: cloud_queue_env
+type: space
+runtime: fastapi
+app: server.app:app
+port: 8000
+metadata:
+  description: >
+    A real-world queueing control environment where an agent manages
+    cloud request scheduling decisions — admission control, routing,
+    dispatching, and dynamic server scaling — under stochastic arrivals
+    and service times. Optimizes latency, throughput, SLA compliance,
+    fairness, and infrastructure cost across three benchmark tasks
+    (Easy / Medium / Hard) with deterministic graders scored in (0, 1).
+  tags:
+    - openenv
+    - reinforcement-learning
+    - queueing
+    - scheduling
+    - cloud-operations
+    - multi-objective
+    - llm-agent
+  difficulty: easy-to-hard
+  tasks:
+    - easy
+    - medium
+    - hard
+  author: Mrkumar007

openenv_cloud_queue_env.egg-info/PKG-INFO ADDED Viewed

	@@ -0,0 +1,9 @@

+Metadata-Version: 2.4
+Name: openenv-cloud_queue_env
+Version: 0.1.0
+Summary: Cloud Queue Env environment for OpenEnv
+Requires-Python: >=3.10
+Requires-Dist: openenv-core[core]>=0.2.2
+Provides-Extra: dev
+Requires-Dist: pytest>=8.0.0; extra == "dev"
+Requires-Dist: pytest-cov>=4.0.0; extra == "dev"

openenv_cloud_queue_env.egg-info/SOURCES.txt ADDED Viewed

	@@ -0,0 +1,19 @@

+README.md
+__init__.py
+client.py
+inference.py
+models.py
+pyproject.toml
+./__init__.py
+./client.py
+./inference.py
+./models.py
+openenv_cloud_queue_env.egg-info/PKG-INFO
+openenv_cloud_queue_env.egg-info/SOURCES.txt
+openenv_cloud_queue_env.egg-info/dependency_links.txt
+openenv_cloud_queue_env.egg-info/entry_points.txt
+openenv_cloud_queue_env.egg-info/requires.txt
+openenv_cloud_queue_env.egg-info/top_level.txt
+server/__init__.py
+server/app.py
+server/cloud_queue_env_environment.py

openenv_cloud_queue_env.egg-info/dependency_links.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+

openenv_cloud_queue_env.egg-info/entry_points.txt ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ [console_scripts]
2	+ server = cloud_queue_env.server.app:main

openenv_cloud_queue_env.egg-info/requires.txt ADDED Viewed

	@@ -0,0 +1,5 @@

+openenv-core[core]>=0.2.2
+[dev]
+pytest>=8.0.0
+pytest-cov>=4.0.0

openenv_cloud_queue_env.egg-info/top_level.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ cloud_queue_env

server/app.py CHANGED Viewed

@@ -1,89 +1,89 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""
-FastAPI application for the Cloud Queue Env Environment.
-This module creates an HTTP server that exposes the CloudQueueEnvironment
-over HTTP and WebSocket endpoints, compatible with EnvClient.
-Endpoints:
-    - POST /reset: Reset the environment
-    - POST /step: Execute an action
-    - GET /state: Get current environment state
-    - GET /schema: Get action/observation schemas
-    - WS /ws: WebSocket endpoint for persistent sessions
-Usage:
-    # Development (with auto-reload):
-    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
-    # Production:
-    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
-    # Or run directly:
-    python -m server.app
-"""
-try:
-    from openenv.core.env_server.http_server import create_app
-except Exception as e:  # pragma: no cover
-    raise ImportError(
-        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
-    ) from e
-try:
-    from ..models import CloudQueueAction, CloudQueueObservation
-    from .cloud_queue_env_environment import CloudQueueEnvironment
-except ImportError:
-    from models import CloudQueueAction, CloudQueueObservation
-    from server.cloud_queue_env_environment import CloudQueueEnvironment
-# Create the app with web interface and README integration
-app = create_app(
-    CloudQueueEnvironment,
-    CloudQueueAction,
-    CloudQueueObservation,
-    env_name="cloud_queue_env",
-    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
-)
-def main(host: str = "0.0.0.0", port: int = 8000) -> None:
-    """
-    Entry point for direct execution via uv run or python -m.
-    This function enables running the server without Docker:
-        uv run --project . server
-        uv run --project . server --port 8001
-        python -m cloud_queue_env.server.app
-    Args:
-        host: Host address to bind to (default: "0.0.0.0")
-        port: Port number to listen on (default: 8000)
-    For production deployments, consider using uvicorn directly with
-    multiple workers:
-        uvicorn cloud_queue_env.server.app:app --workers 4
-    """
-    import uvicorn
-    uvicorn.run(app, host=host, port=port)
-def _cli_main() -> None:
-    import argparse
-    parser = argparse.ArgumentParser()
-    parser.add_argument("--port", type=int, default=8000)
-    parser.add_argument("--host", type=str, default="0.0.0.0")
-    args = parser.parse_args()
-    main(host=args.host, port=args.port)
-if __name__ == '__main__':
-    main()

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""
+FastAPI application for the Cloud Queue Env Environment.
+This module creates an HTTP server that exposes the CloudQueueEnvironment
+over HTTP and WebSocket endpoints, compatible with EnvClient.
+Endpoints:
+    - POST /reset: Reset the environment
+    - POST /step: Execute an action
+    - GET /state: Get current environment state
+    - GET /schema: Get action/observation schemas
+    - WS /ws: WebSocket endpoint for persistent sessions
+Usage:
+    # Development (with auto-reload):
+    uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
+    # Production:
+    uvicorn server.app:app --host 0.0.0.0 --port 8000 --workers 4
+    # Or run directly:
+    python -m server.app
+"""
+try:
+    from openenv.core.env_server.http_server import create_app
+except Exception as e:  # pragma: no cover
+    raise ImportError(
+        "openenv is required for the web interface. Install dependencies with '\n    uv sync\n'"
+    ) from e
+try:
+    from ..models import CloudQueueAction, CloudQueueObservation
+    from .cloud_queue_env_environment import CloudQueueEnvironment
+except ImportError:
+    from models import CloudQueueAction, CloudQueueObservation
+    from server.cloud_queue_env_environment import CloudQueueEnvironment
+# Create the app with web interface and README integration
+app = create_app(
+    CloudQueueEnvironment,
+    CloudQueueAction,
+    CloudQueueObservation,
+    env_name="cloud_queue_env",
+    max_concurrent_envs=1,  # increase this number to allow more concurrent WebSocket sessions
+)
+def main(host: str = "0.0.0.0", port: int = 8000) -> None:
+    """
+    Entry point for direct execution via uv run or python -m.
+    This function enables running the server without Docker:
+        uv run --project . server
+        uv run --project . server --port 8001
+        python -m cloud_queue_env.server.app
+    Args:
+        host: Host address to bind to (default: "0.0.0.0")
+        port: Port number to listen on (default: 8000)
+    For production deployments, consider using uvicorn directly with
+    multiple workers:
+        uvicorn cloud_queue_env.server.app:app --workers 4
+    """
+    import uvicorn
+    uvicorn.run(app, host=host, port=port)
+def _cli_main() -> None:
+    import argparse
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--port", type=int, default=8000)
+    parser.add_argument("--host", type=str, default="0.0.0.0")
+    args = parser.parse_args()
+    main(host=args.host, port=args.port)
+if __name__ == '__main__':
+    main()

server/cloud_queue_env_environment.py CHANGED Viewed

@@ -1,762 +1,781 @@
-# Copyright (c) Meta Platforms, Inc. and affiliates.
-# All rights reserved.
-#
-# This source code is licensed under the BSD-style license found in the
-# LICENSE file in the root directory of this source tree.
-"""Queue operations environment with deterministic task grading."""
-import math
-import random
-import hashlib
-from collections import deque
-from dataclasses import dataclass
-from uuid import uuid4
-from openenv.core.env_server.interfaces import Environment
-from openenv.core.env_server.types import State
-try:
-    from ..models import CloudQueueAction, CloudQueueObservation
-except ImportError:
-    from models import CloudQueueAction, CloudQueueObservation
-@dataclass
-class TaskConfig:
-    task_id: str
-    horizon: int
-    level: float
-    queue_count: int
-    initial_servers: int
-    min_servers: int
-    max_servers: int
-    arrival_rate: float
-    urgent_ratio: float
-    service_mean: float
-    deadline_base: int
-    allow_scaling: bool
-    allow_priority: bool
-    two_stage: bool
-    server_cost: float
-    max_queue_size: int
-    score_refs: dict[str, float]
-class CloudQueueEnvironment(Environment):
-    """Deterministic queueing environment with easy/medium/hard benchmark tasks."""
-    SUPPORTS_CONCURRENT_SESSIONS: bool = True
-    def __init__(self):
-        self._task_configs = self._build_task_configs()
-        self._active_task_id = "easy"
-        self._pending_task_id = "easy"
-        self._pending_seed = 7
-        self._rng_streams: dict[str, random.Random] = {}
-        self._rng_stream_seeds: dict[str, int] = {}
-        self._state = State(episode_id=str(uuid4()), step_count=0)
-        self._sim_time = 0
-        self._queues: list[deque[dict]] = []
-        self._servers: list[dict] = []
-        self._incoming_job: dict | None = None
-        self._done = False
-        self._wait_ema: list[float] = []
-        self._utilization_ema: list[float] = []
-        self._metrics: dict[str, float] = {}
-        self._recent_rewards: deque[float] = deque(maxlen=8)
-        self._action_trace: list[str] = []
-        self._reset_runtime_state()
-    def _build_task_configs(self) -> dict[str, TaskConfig]:
-        return {
-            "easy": TaskConfig(
-                task_id="easy",
-                horizon=150,
-                level=1.0,
-                queue_count=1,
-                initial_servers=1,
-                min_servers=1,
-                max_servers=1,
-                arrival_rate=0.78,
-                urgent_ratio=0.0,
-                service_mean=1.6,
-                deadline_base=10,
-                allow_scaling=False,
-                allow_priority=False,
-                two_stage=False,
-                server_cost=0.04,
-                max_queue_size=28,
-                score_refs={"wait": 6.0, "thr": 70.0, "rej": 0.3, "sla": 0.3},
-            ),
-            "medium": TaskConfig(
-                task_id="medium",
-                horizon=200,
-                level=2.3,
-                queue_count=2,
-                initial_servers=3,
-                min_servers=3,   # scaling disabled on medium — lock to initial_servers
-                max_servers=3,   # scaling disabled on medium — lock to initial_servers
-                arrival_rate=1.15,
-                urgent_ratio=0.28,
-                service_mean=1.8,
-                deadline_base=8,
-                allow_scaling=False,
-                allow_priority=True,
-                two_stage=False,
-                server_cost=0.06,
-                max_queue_size=42,
-                score_refs={"uw": 7.0, "nw": 10.0, "usla": 0.25, "thr": 125.0, "cost": 14.0},
-            ),
-            "hard": TaskConfig(
-                task_id="hard",
-                horizon=250,
-                level=4.0,
-                queue_count=2,
-                initial_servers=3,
-                min_servers=1,
-                max_servers=6,
-                arrival_rate=1.45,
-                urgent_ratio=0.35,
-                service_mean=2.2,
-                deadline_base=7,
-                allow_scaling=True,
-                allow_priority=True,
-                two_stage=True,
-                server_cost=0.1,
-                max_queue_size=64,
-                score_refs={
-                    "e2e": 14.0,
-                    "abd": 0.25,
-                    "sla": 0.3,
-                    "thr": 145.0,
-                    "cost": 28.0,
-                    "fair": 0.35,
-                },
-            ),
-        }
-    def _reset_runtime_state(self) -> None:
-        cfg = self._task_configs[self._active_task_id]
-        self._sim_time = 0
-        self._done = False
-        self._incoming_job = None
-        self._action_trace = []
-        self._queues = [deque() for _ in range(cfg.queue_count)]
-        self._servers = [
-            {"remaining": 0.0, "job": None, "active": True}
-            for _ in range(cfg.initial_servers)
-        ]
-        self._wait_ema = [0.0 for _ in range(cfg.queue_count)]
-        self._utilization_ema = [0.0 for _ in range(cfg.max_servers)]
-        self._recent_rewards.clear()
-        self._metrics = {
-            "arrivals": 0.0,
-            "accepted": 0.0,
-            "rejected": 0.0,
-            "completed": 0.0,
-            "completed_urgent": 0.0,
-            "abandoned": 0.0,
-            "wait_sum": 0.0,
-            "wait_count": 0.0,
-            "wait_sum_urgent": 0.0,
-            "wait_count_urgent": 0.0,
-            "wait_sum_normal": 0.0,
-            "wait_count_normal": 0.0,
-            "sla_breaches": 0.0,
-            "sla_breaches_urgent": 0.0,
-            "invalid_actions": 0.0,
-            "noop_under_load": 0.0,
-            "harmful_scale_down": 0.0,
-            "action_cost": 0.0,
-            "infra_cost": 0.0,
-            "fairness_gap_sum": 0.0,
-            "fairness_gap_count": 0.0,
-        }
-        self._wait_samples_all: list[float] = []
-        self._wait_samples_urgent: list[float] = []
-        self._wait_samples_normal: list[float] = []
-        self._e2e_wait_samples: list[float] = []
-    def _init_rng_streams(self, base_seed: int) -> None:
-        self._rng_stream_seeds = {
-            "arrivals": int(base_seed) + 101,
-            "service": int(base_seed) + 211,
-            "abandonment": int(base_seed) + 307,
-            "exogenous": int(base_seed) + 401,
-        }
-        self._rng_streams = {
-            key: random.Random(seed) for key, seed in self._rng_stream_seeds.items()
-        }
-    def _rng(self, stream: str) -> random.Random:
-        return self._rng_streams[stream]
-    def _sample_poisson(self, lam: float, rng: random.Random) -> int:
-        lam = max(0.0, lam)
-        if lam == 0.0:
-            return 0
-        # Knuth algorithm is sufficient for this environment's lambda scale.
-        l_term = math.exp(-lam)
-        k = 0
-        p = 1.0
-        while p > l_term:
-            k += 1
-            p *= rng.random()
-        return max(0, k - 1)
-    def _trace_digest(self) -> str:
-        raw = f"task={self._active_task_id}|seed={self._pending_seed}|" + "|".join(self._action_trace)
-        return hashlib.sha256(raw.encode("utf-8")).hexdigest()[:16]
-    def reset(self) -> CloudQueueObservation:
-        self._active_task_id = self._pending_task_id if self._pending_task_id in self._task_configs else "easy"
-        self._init_rng_streams(self._pending_seed)
-        self._state = State(episode_id=str(uuid4()), step_count=0)
-        self._reset_runtime_state()
-        return self._build_observation(reward=0.0, done=False, info={"event": "reset"})
-    def _clamp(self, value: float, lo: float, hi: float) -> float:
-        return max(lo, min(hi, value))
-    def _sample_service_time(self, cfg: TaskConfig) -> float:
-        service_rng = self._rng("service")
-        if cfg.task_id == "hard":
-            heavy = service_rng.random() < 0.22
-            if heavy:
-                return self._clamp(service_rng.lognormvariate(1.2, 0.7), 1.0, 12.0)
-        return self._clamp(service_rng.expovariate(1.0 / cfg.service_mean), 0.5, 10.0)
-    def _sample_arrivals(self, cfg: TaskConfig) -> int:
-        arrival_rng = self._rng("arrivals")
-        exogenous_rng = self._rng("exogenous")
-        rate = cfg.arrival_rate
-        if cfg.task_id == "hard":
-            wave = 0.35 * math.sin((self._sim_time + 1) / 13.0)
-            jitter = exogenous_rng.uniform(-0.05, 0.05)
-            rate += wave + jitter
-        return self._sample_poisson(rate, arrival_rng)
-    def _spawn_incoming_job(self, cfg: TaskConfig) -> None:
-        arrivals = self._sample_arrivals(cfg)
-        if arrivals <= 0:
-            self._incoming_job = None
-            return
-        arrival_rng = self._rng("arrivals")
-        priority = 2 if arrival_rng.random() < cfg.urgent_ratio else 1
-        size = self._sample_service_time(cfg)
-        self._incoming_job = {
-            "priority": priority,
-            "queue": 0,
-            "created_step": self._state.step_count,
-            "wait": 0.0,
-            "size": size,
-            "remaining": size,
-            "deadline": self._state.step_count + cfg.deadline_base - (1 if priority == 2 else 0),
-            "type": 1 if priority == 2 else 0,
-            "stage": 0,
-        }
-        self._metrics["arrivals"] += 1.0
-    def _update_wait_and_abandonment(self, cfg: TaskConfig) -> float:
-        abandonment_rng = self._rng("abandonment")
-        abandoned_this_step = 0.0
-        for qi, q in enumerate(self._queues):
-            kept: deque[dict] = deque()
-            while q:
-                job = q.popleft()
-                job["wait"] += 1.0
-                patience = cfg.deadline_base + (2 if job["priority"] == 2 else 4)
-                if cfg.task_id == "hard" and job["wait"] > patience and abandonment_rng.random() < 0.35:
-                    abandoned_this_step += 1.0
-                    continue
-                kept.append(job)
-            self._queues[qi] = kept
-        if abandoned_this_step:
-            self._metrics["abandoned"] += abandoned_this_step
-        return abandoned_this_step
-    def _complete_job(self, cfg: TaskConfig, job: dict) -> None:
-        if cfg.two_stage and job["stage"] == 0:
-            forwarded = dict(job)
-            forwarded["stage"] = 1
-            forwarded["queue"] = min(1, len(self._queues) - 1)
-            forwarded["remaining"] = self._sample_service_time(cfg)
-            self._queues[forwarded["queue"]].append(forwarded)
-            return
-        self._metrics["completed"] += 1.0
-        wait = float(self._state.step_count - job["created_step"])
-        self._metrics["wait_sum"] += wait
-        self._metrics["wait_count"] += 1.0
-        self._wait_samples_all.append(wait)
-        self._e2e_wait_samples.append(wait)
-        if job["priority"] == 2:
-            self._metrics["completed_urgent"] += 1.0
-            self._metrics["wait_sum_urgent"] += wait
-            self._metrics["wait_count_urgent"] += 1.0
-            self._wait_samples_urgent.append(wait)
-        else:
-            self._metrics["wait_sum_normal"] += wait
-            self._metrics["wait_count_normal"] += 1.0
-            self._wait_samples_normal.append(wait)
-        if self._state.step_count > job["deadline"]:
-            self._metrics["sla_breaches"] += 1.0
-            if job["priority"] == 2:
-                self._metrics["sla_breaches_urgent"] += 1.0
-    def _process_servers(self, cfg: TaskConfig) -> float:
-        completed_this_step = 0.0
-        for si, server in enumerate(self._servers):
-            if not server["active"]:
-                continue
-            if server["remaining"] > 0:
-                server["remaining"] = max(0.0, server["remaining"] - 1.0)
-            if server["remaining"] <= 0 and server["job"] is not None:
-                self._complete_job(cfg, server["job"])
-                completed_this_step += 1.0
-                server["job"] = None
-            busy_flag = 1.0 if server["job"] is not None else 0.0
-            if si < len(self._utilization_ema):
-                self._utilization_ema[si] = 0.9 * self._utilization_ema[si] + 0.1 * busy_flag
-        return completed_this_step
-    def _admit_job(self, cfg: TaskConfig, queue_idx: int) -> tuple[bool, str]:
-        if self._incoming_job is None:
-            return False, "no_incoming_job"
-        if queue_idx < 0 or queue_idx >= len(self._queues):
-            return False, "invalid_queue"
-        if len(self._queues[queue_idx]) >= cfg.max_queue_size:
-            self._metrics["rejected"] += 1.0
-            self._incoming_job = None
-            return True, "queue_full_rejected"
-        job = dict(self._incoming_job)
-        job["queue"] = queue_idx
-        self._queues[queue_idx].append(job)
-        self._incoming_job = None
-        self._metrics["accepted"] += 1.0
-        return True, "admitted"
-    def _dispatch(self, queue_idx: int | None) -> tuple[bool, str]:
-        target = 0 if queue_idx is None else queue_idx
-        if target < 0 or target >= len(self._queues):
-            return False, "invalid_dispatch_queue"
-        for server in self._servers:
-            if not server["active"]:
-                continue
-            if server["job"] is None and self._queues[target]:
-                server["job"] = self._queues[target].popleft()
-                server["remaining"] = server["job"]["remaining"]
-                return True, "dispatched"
-        return False, "no_idle_server_or_empty_queue"
-    def _autodispatch(self) -> None:
-        for server in self._servers:
-            if not server["active"] or server["job"] is not None:
-                continue
-            for q in self._queues:
-                if q:
-                    server["job"] = q.popleft()
-                    server["remaining"] = server["job"]["remaining"]
-                    break
-    def _apply_action(self, action: CloudQueueAction, cfg: TaskConfig) -> tuple[bool, str]:
-        action_type = (action.action_type or "noop").lower()
-        if action_type == "configure_task":
-            if action.task_id and action.task_id in self._task_configs:
-                self._pending_task_id = action.task_id
-            if action.seed is not None:
-                self._pending_seed = int(action.seed)
-            return True, "configuration_updated_for_next_reset"
-        if self._done:
-            return False, "episode_already_done"
-        if action_type == "admit":
-            queue_idx = action.target_queue if action.target_queue is not None else 0
-            return self._admit_job(cfg, queue_idx)
-        if action_type == "reject":
-            if self._incoming_job is None:
-                return False, "no_incoming_job"
-            self._incoming_job = None
-            self._metrics["rejected"] += 1.0
-            return True, "rejected"
-        if action_type == "route":
-            queue_idx = action.target_queue if action.target_queue is not None else 0
-            return self._admit_job(cfg, queue_idx)
-        if action_type == "dispatch":
-            return self._dispatch(action.target_queue)
-        if action_type == "scale":
-            if not cfg.allow_scaling:
-                return False, "scaling_not_supported_for_task"
-            delta = action.scale_delta if action.scale_delta is not None else 0
-            if delta == 0:
-                return True, "no_scale_change"
-            active_count = sum(1 for s in self._servers if s["active"])
-            requested = int(self._clamp(active_count + delta, cfg.min_servers, cfg.max_servers))
-            if requested == active_count:
-                return True, "scale_clamped_no_change"
-            if requested > active_count:
-                for _ in range(requested - active_count):
-                    self._servers.append({"remaining": 0.0, "job": None, "active": True})
-                    self._utilization_ema.append(0.0)
-            else:
-                to_disable = active_count - requested
-                for server in reversed(self._servers):
-                    if to_disable == 0:
-                        break
-                    if server["active"] and server["job"] is None:
-                        server["active"] = False
-                        to_disable -= 1
-            self._metrics["action_cost"] += abs(delta) * 0.35
-            return True, "scaled"
-        if action_type == "reprioritize":
-            if not cfg.allow_priority:
-                return False, "reprioritize_not_supported_for_task"
-            new_priority = 2 if (action.new_priority or 1) >= 2 else 1
-            for q in self._queues:
-                for job in q:
-                    if job["priority"] == 1:
-                        job["priority"] = new_priority
-                        return True, "reprioritized"
-            return False, "no_eligible_job"
-        if action_type == "noop":
-            return True, "noop"
-        return False, "unknown_action_type"
-    def _percentile(self, values: list[float], p: float) -> float:
-        if not values:
-            return 0.0
-        ordered = sorted(values)
-        idx = int(self._clamp(round((len(ordered) - 1) * p), 0, len(ordered) - 1))
-        return float(ordered[idx])
-    def _safe_div(self, numerator: float, denominator: float) -> float:
-        if denominator <= 0:
-            return 0.0
-        return numerator / denominator
-    def _current_fairness_gap(self) -> float:
-        urgent_avg = self._safe_div(self._metrics["wait_sum_urgent"], self._metrics["wait_count_urgent"])
-        normal_avg = self._safe_div(self._metrics["wait_sum_normal"], self._metrics["wait_count_normal"])
-        scale = max(1.0, urgent_avg + normal_avg)
-        return abs(urgent_avg - normal_avg) / scale
-    def _compute_reward(
-        self,
-        cfg: TaskConfig,
-        action_ok: bool,
-        action_type: str,
-        action_scale_delta: int,
-        completed_step: float,
-    ) -> tuple[float, dict[str, float]]:
-        avg_wait = self._safe_div(self._metrics["wait_sum"], self._metrics["wait_count"])
-        queue_pressure = sum(len(q) for q in self._queues) / max(1.0, float(cfg.max_queue_size))
-        r_wait = -self._clamp(avg_wait / max(cfg.deadline_base, 1), 0.0, 1.5) - 0.15 * self._clamp(queue_pressure, 0.0, 1.5)
-        r_throughput = self._clamp(completed_step / max(1.0, float(cfg.initial_servers)), 0.0, 1.0)
-        total_decisions = max(1.0, self._metrics["completed"] + self._metrics["abandoned"])
-        r_sla = -self._clamp(self._metrics["sla_breaches"] / total_decisions, 0.0, 1.0)
-        active_servers = sum(1 for s in self._servers if s["active"])
-        r_cost = -self._clamp(active_servers / max(1.0, float(cfg.max_servers)), 0.0, 1.0)
-        fairness_gap = self._current_fairness_gap()
-        r_fair = -self._clamp(fairness_gap / 0.5, 0.0, 1.0)
-        r_safe = 0.0 if action_ok else -1.0
-        if not action_ok:
-            self._metrics["invalid_actions"] += 1.0
-        if action_type == "noop" and self._incoming_job is not None and sum(len(q) for q in self._queues) > 0:
-            r_safe -= 0.05
-            self._metrics["noop_under_load"] += 1.0
-        arrivals = max(1.0, self._metrics["arrivals"])
-        rejection_rate = self._safe_div(self._metrics["rejected"], arrivals)
-        if arrivals > 10 and rejection_rate > 0.4:
-            r_safe -= self._clamp((rejection_rate - 0.4) * 0.4, 0.0, 0.2)
-        if action_type == "scale" and action_scale_delta < 0 and queue_pressure > 0.45:
-            overload_penalty = self._clamp((queue_pressure - 0.45) * 0.5, 0.0, 0.25)
-            r_safe -= overload_penalty
-            self._metrics["harmful_scale_down"] += 1.0
-        reward = 0.35 * r_wait + 0.20 * r_throughput + 0.20 * r_sla + 0.15 * r_cost + 0.05 * r_fair + 0.05 * r_safe
-        reward = self._clamp(reward, -1.0, 1.0)
-        self._recent_rewards.append(reward)
-        self._metrics["infra_cost"] += active_servers * cfg.server_cost
-        self._metrics["fairness_gap_sum"] += fairness_gap
-        self._metrics["fairness_gap_count"] += 1.0
-        components = {
-            "wait": round(r_wait, 4),
-            "throughput": round(r_throughput, 4),
-            "sla": round(r_sla, 4),
-            "cost": round(r_cost, 4),
-            "fairness": round(r_fair, 4),
-            "safety": round(r_safe, 4),
-        }
-        return reward, components
-    def _score_task(self, cfg: TaskConfig) -> tuple[float, dict[str, float]]:
-        # c01: clamp individual sub-score components to [0, 1] inclusive.
-        def c01(value: float) -> float:
-            if not math.isfinite(value):
-                return 0.0
-            return self._clamp(value, 0.0, 1.0)
-        # _strict01: final clamp applied only to the episode score.
-        # Validator requires score strictly in (0, 1) — never 0.0 or 1.0.
-        _SCORE_MIN = 0.001
-        _SCORE_MAX = 0.999
-        def strict01(value: float) -> float:
-            if not math.isfinite(value):
-                return _SCORE_MIN
-            return self._clamp(value, _SCORE_MIN, _SCORE_MAX)
-        completed = self._metrics["completed"]
-        arrivals = self._metrics["arrivals"]
-        rejected = self._metrics["rejected"]
-        avg_wait = self._safe_div(self._metrics["wait_sum"], self._metrics["wait_count"])
-        rejection_rate = self._safe_div(rejected, arrivals)
-        sla_rate = self._safe_div(self._metrics["sla_breaches"], max(1.0, completed))
-        throughput = completed
-        fairness_gap = self._safe_div(self._metrics["fairness_gap_sum"], self._metrics["fairness_gap_count"])
-        if cfg.task_id == "easy":
-            score_wait = c01(1.0 - avg_wait / cfg.score_refs["wait"])
-            score_thr = c01(throughput / cfg.score_refs["thr"])
-            score_rej = c01(1.0 - rejection_rate / cfg.score_refs["rej"])
-            score_sla = c01(1.0 - sla_rate / cfg.score_refs["sla"])
-            score = 0.4 * score_wait + 0.3 * score_thr + 0.15 * score_rej + 0.15 * score_sla
-            details = {
-                "score_wait": round(score_wait, 4),
-                "score_throughput": round(score_thr, 4),
-                "score_rejection": round(score_rej, 4),
-                "score_sla": round(score_sla, 4),
-            }
-        elif cfg.task_id == "medium":
-            p95_u = self._percentile(self._wait_samples_urgent, 0.95)
-            p95_n = self._percentile(self._wait_samples_normal, 0.95)
-            urgent_sla = self._safe_div(self._metrics["sla_breaches_urgent"], max(1.0, self._metrics["completed_urgent"]))
-            s_uw = c01(1.0 - p95_u / cfg.score_refs["uw"])
-            s_nw = c01(1.0 - p95_n / cfg.score_refs["nw"])
-            s_usla = c01(1.0 - urgent_sla / cfg.score_refs["usla"])
-            s_thr = c01(throughput / cfg.score_refs["thr"])
-            s_cost = c01(1.0 - self._metrics["action_cost"] / cfg.score_refs["cost"])
-            score = 0.35 * s_uw + 0.15 * s_nw + 0.25 * s_usla + 0.15 * s_thr + 0.10 * s_cost
-            details = {
-                "score_urgent_wait": round(s_uw, 4),
-                "score_normal_wait": round(s_nw, 4),
-                "score_urgent_sla": round(s_usla, 4),
-                "score_throughput": round(s_thr, 4),
-                "score_cost": round(s_cost, 4),
-            }
-        else:
-            e2e_p95 = self._percentile(self._e2e_wait_samples, 0.95)
-            abd_rate = self._safe_div(self._metrics["abandoned"], arrivals)
-            s_e2e = c01(1.0 - e2e_p95 / cfg.score_refs["e2e"])
-            s_abd = c01(1.0 - abd_rate / cfg.score_refs["abd"])
-            s_sla = c01(1.0 - sla_rate / cfg.score_refs["sla"])
-            s_thr = c01(throughput / cfg.score_refs["thr"])
-            s_cost = c01(1.0 - self._metrics["infra_cost"] / cfg.score_refs["cost"])
-            s_fair = c01(1.0 - fairness_gap / cfg.score_refs["fair"])
-            score = 0.25 * s_e2e + 0.20 * s_abd + 0.20 * s_sla + 0.15 * s_thr + 0.10 * s_cost + 0.10 * s_fair
-            details = {
-                "score_e2e_p95": round(s_e2e, 4),
-                "score_abandonment": round(s_abd, 4),
-                "score_sla": round(s_sla, 4),
-                "score_throughput": round(s_thr, 4),
-                "score_cost": round(s_cost, 4),
-                "score_fairness": round(s_fair, 4),
-            }
-        if self._metrics["invalid_actions"] > max(3.0, 0.04 * cfg.horizon):
-            score = min(score, 0.4)
-        # Apply strict open-interval clamp: validator rejects 0.0 and 1.0.
-        return strict01(score), details
-    def _compute_action_mask(self, cfg: TaskConfig) -> list[int]:
-        """Compute which of the 8 actions are valid right now.
-        Slot order (matches CloudQueueAction.action_type):
-          0: configure_task  — always valid (meta, sets next task/seed)
-          1: admit           — only if an incoming job is waiting
-          2: reject          — only if an incoming job is waiting
-          3: route           — only if an incoming job is waiting
-          4: dispatch        — only if an idle+active server AND a non-empty queue exist
-          5: scale           — only if cfg.allow_scaling is True
-          6: reprioritize    — only if cfg.allow_priority AND a normal-priority job is queued
-          7: noop            — always valid
-        """
-        has_incoming = self._incoming_job is not None
-        has_idle_server = any(
-            s["active"] and s["job"] is None for s in self._servers
-        )
-        has_queued_job = any(len(q) > 0 for q in self._queues)
-        can_dispatch = 1 if (has_idle_server and has_queued_job) else 0
-        can_reprioritize = 0
-        if cfg.allow_priority:
-            can_reprioritize = 1 if any(
-                job["priority"] == 1 for q in self._queues for job in q
-            ) else 0
-        return [
-            1,                              # 0: configure_task
-            1 if has_incoming else 0,       # 1: admit
-            1 if has_incoming else 0,       # 2: reject
-            1 if has_incoming else 0,       # 3: route
-            can_dispatch,                   # 4: dispatch
-            1 if cfg.allow_scaling else 0,  # 5: scale
-            can_reprioritize,               # 6: reprioritize
-            1,                              # 7: noop
-        ]
-    def _build_observation(self, reward: float, done: bool, info: dict) -> CloudQueueObservation:
-        cfg = self._task_configs[self._active_task_id]
-        queue_lengths = [len(q) for q in self._queues]
-        for i, q in enumerate(self._queues):
-            current_mean_wait = 0.0
-            if q:
-                current_mean_wait = sum(job["wait"] for job in q) / len(q)
-            self._wait_ema[i] = 0.8 * self._wait_ema[i] + 0.2 * current_mean_wait
-        active_servers = max(1, sum(1 for s in self._servers if s["active"]))
-        completed = max(1.0, self._metrics["completed"])
-        sla_violation_rate = self._safe_div(self._metrics["sla_breaches"], completed)
-        abandonment_rate = self._safe_div(self._metrics["abandoned"], max(1.0, self._metrics["arrivals"]))
-        throughput_recent = max(0.0, info.get("completed_this_step", 0.0))
-        energy_cost_rate = active_servers * cfg.server_cost
-        incoming = self._incoming_job
-        incoming_present = incoming is not None
-        incoming_size = float(incoming["size"]) if incoming_present else 0.0
-        incoming_priority = int(incoming["priority"]) if incoming_present else 0
-        incoming_deadline = float(incoming["deadline"]) if incoming_present else 0.0
-        incoming_type = int(incoming["type"]) if incoming_present else 0
-        score, score_details = (0.0, {})
-        if done:
-            score, score_details = self._score_task(cfg)
-        metadata = {
-            "info": info,
-            "reward_components": info.get("reward_components", {}),
-            "applied_action": info.get("applied_action", "noop"),
-            "seed": int(self._pending_seed),
-            "trace_digest": self._trace_digest(),
-            "rng_stream_seeds": self._rng_stream_seeds,
-            "metrics": {
-                "arrivals": self._metrics["arrivals"],
-                "accepted": self._metrics["accepted"],
-                "rejected": self._metrics["rejected"],
-                "completed": self._metrics["completed"],
-                "abandoned": self._metrics["abandoned"],
-                "invalid_actions": self._metrics["invalid_actions"],
-                "harmful_scale_down": self._metrics["harmful_scale_down"],
-                "infra_cost": round(self._metrics["infra_cost"], 4),
-            },
-            "episode_score": round(score, 4),
-            "score_details": score_details,
-        }
-        return CloudQueueObservation(
-            task_id=cfg.task_id,
-            sim_time=self._sim_time,
-            horizon=cfg.horizon,
-            queue_lengths=queue_lengths,
-            queue_wait_ema=[round(v, 3) for v in self._wait_ema],
-            server_busy=[1 if s["job"] is not None and s["active"] else 0 for s in self._servers],
-            server_remaining_service=[round(float(s["remaining"]), 3) for s in self._servers],
-            utilization=[round(v, 3) for v in self._utilization_ema[: len(self._servers)]],
-            incoming_job_present=incoming_present,
-            incoming_job_size=round(incoming_size, 3),
-            incoming_job_priority=incoming_priority,
-            incoming_job_deadline=round(incoming_deadline, 3),
-            incoming_job_type=incoming_type,
-            sla_violation_rate=round(sla_violation_rate, 4),
-            abandonment_rate=round(abandonment_rate, 4),
-            throughput_recent=round(throughput_recent, 4),
-            energy_cost_rate=round(energy_cost_rate, 4),
-            level=cfg.level,
-            optional_history=[round(v, 4) for v in list(self._recent_rewards)],
-            action_mask=self._compute_action_mask(cfg),
-            done=done,
-            reward=round(reward, 6),
-            metadata=metadata,
-        )
-    def step(self, action: CloudQueueAction) -> CloudQueueObservation:  # type: ignore[override]
-        cfg = self._task_configs[self._active_task_id]
-        if (action.action_type or "").lower() == "configure_task":
-            ok, note = self._apply_action(action, cfg)
-            info = {
-                "event": "configure_task",
-                "applied_action": action.action_type,
-                "valid_action": ok,
-                "note": note,
-                "completed_this_step": 0.0,
-                "debug_trace_id": self._trace_digest(),
-            }
-            return self._build_observation(reward=0.0, done=self._done, info=info)
-        if self._done:
-            info = {
-                "event": "episode_done",
-                "applied_action": action.action_type,
-                "valid_action": False,
-                "note": "call reset() to start a new episode",
-                "completed_this_step": 0.0,
-                "reward_components": {},
-                "debug_trace_id": self._trace_digest(),
-            }
-            return self._build_observation(reward=0.0, done=True, info=info)
-        self._state.step_count += 1
-        self._sim_time += 1
-        completed_this_step = self._process_servers(cfg)
-        abandoned_this_step = self._update_wait_and_abandonment(cfg)
-        self._spawn_incoming_job(cfg)
-        action_ok, action_note = self._apply_action(action, cfg)
-        action_key = (
-            f"{(action.action_type or 'noop').lower()}|"
-            f"q={action.target_queue}|s={action.target_server}|"
-            f"d={action.scale_delta}|p={action.new_priority}"
-        )
-        self._action_trace.append(action_key)
-        self._autodispatch()
-        reward, reward_components = self._compute_reward(
-            cfg,
-            action_ok=action_ok,
-            action_type=(action.action_type or "noop").lower(),
-            action_scale_delta=int(action.scale_delta or 0),
-            completed_step=completed_this_step,
-        )
-        self._done = self._state.step_count >= cfg.horizon
-        info = {
-            "event": "step",
-            "applied_action": action.action_type,
-            "valid_action": action_ok,
-            "note": action_note,
-            "completed_this_step": completed_this_step,
-            "abandoned_this_step": abandoned_this_step,
-            "reward_components": reward_components,
-            "debug_trace_id": self._trace_digest(),
-        }
-        return self._build_observation(reward=reward, done=self._done, info=info)
-    @property
-    def state(self) -> State:
-        return self._state

+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+#
+# This source code is licensed under the BSD-style license found in the
+# LICENSE file in the root directory of this source tree.
+"""Queue operations environment with deterministic task grading."""
+import math
+import random
+import hashlib
+from collections import deque
+from dataclasses import dataclass
+from uuid import uuid4
+from openenv.core.env_server.interfaces import Environment
+from openenv.core.env_server.types import State
+try:
+    from ..models import CloudQueueAction, CloudQueueObservation
+except ImportError:
+    from models import CloudQueueAction, CloudQueueObservation
+@dataclass
+class TaskConfig:
+    task_id: str
+    horizon: int
+    level: float
+    queue_count: int
+    initial_servers: int
+    min_servers: int
+    max_servers: int
+    arrival_rate: float
+    urgent_ratio: float
+    service_mean: float
+    deadline_base: int
+    allow_scaling: bool
+    allow_priority: bool
+    two_stage: bool
+    server_cost: float
+    max_queue_size: int
+    score_refs: dict[str, float]
+class CloudQueueEnvironment(Environment):
+    """Deterministic queueing environment with easy/medium/hard benchmark tasks."""
+    SUPPORTS_CONCURRENT_SESSIONS: bool = True
+    # Benchmark-safe default: dispatch decisions should come from the agent.
+    ASSISTED_AUTODISPATCH: bool = False
+    def __init__(self):
+        self._task_configs = self._build_task_configs()
+        self._active_task_id = "easy"
+        self._pending_task_id = "easy"
+        self._pending_seed = 7
+        self._rng_streams: dict[str, random.Random] = {}
+        self._rng_stream_seeds: dict[str, int] = {}
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._sim_time = 0
+        self._queues: list[deque[dict]] = []
+        self._servers: list[dict] = []
+        self._incoming_buffer: deque[dict] = deque()
+        self._incoming_job: dict | None = None
+        self._done = False
+        self._wait_ema: list[float] = []
+        self._utilization_ema: list[float] = []
+        self._metrics: dict[str, float] = {}
+        self._recent_rewards: deque[float] = deque(maxlen=8)
+        self._action_trace: list[str] = []
+        self._reset_runtime_state()
+    def _build_task_configs(self) -> dict[str, TaskConfig]:
+        return {
+            "easy": TaskConfig(
+                task_id="easy",
+                horizon=150,
+                level=1.0,
+                queue_count=1,
+                initial_servers=1,
+                min_servers=1,
+                max_servers=1,
+                arrival_rate=0.78,
+                urgent_ratio=0.0,
+                service_mean=1.6,
+                deadline_base=10,
+                allow_scaling=False,
+                allow_priority=False,
+                two_stage=False,
+                server_cost=0.04,
+                max_queue_size=28,
+                score_refs={"wait": 6.0, "thr": 70.0, "rej": 0.3, "sla": 0.3},
+            ),
+            "medium": TaskConfig(
+                task_id="medium",
+                horizon=200,
+                level=2.3,
+                queue_count=2,
+                initial_servers=3,
+                min_servers=3,   # scaling disabled on medium — lock to initial_servers
+                max_servers=3,   # scaling disabled on medium — lock to initial_servers
+                arrival_rate=1.15,
+                urgent_ratio=0.28,
+                service_mean=1.8,
+                deadline_base=8,
+                allow_scaling=False,
+                allow_priority=True,
+                two_stage=False,
+                server_cost=0.06,
+                max_queue_size=42,
+                score_refs={"uw": 7.0, "nw": 10.0, "usla": 0.25, "thr": 125.0, "cost": 14.0},
+            ),
+            "hard": TaskConfig(
+                task_id="hard",
+                horizon=250,
+                level=4.0,
+                queue_count=2,
+                initial_servers=3,
+                min_servers=1,
+                max_servers=6,
+                arrival_rate=1.45,
+                urgent_ratio=0.35,
+                service_mean=2.2,
+                deadline_base=7,
+                allow_scaling=True,
+                allow_priority=True,
+                two_stage=True,
+                server_cost=0.1,
+                max_queue_size=64,
+                score_refs={
+                    "e2e": 14.0,
+                    "abd": 0.25,
+                    "sla": 0.3,
+                    "thr": 145.0,
+                    "cost": 28.0,
+                    "fair": 0.35,
+                },
+            ),
+        }
+    def _reset_runtime_state(self) -> None:
+        cfg = self._task_configs[self._active_task_id]
+        self._sim_time = 0
+        self._done = False
+        self._incoming_buffer = deque()
+        self._incoming_job = None
+        self._action_trace = []
+        self._queues = [deque() for _ in range(cfg.queue_count)]
+        self._servers = [
+            {"remaining": 0.0, "job": None, "active": True}
+            for _ in range(cfg.initial_servers)
+        ]
+        self._wait_ema = [0.0 for _ in range(cfg.queue_count)]
+        self._utilization_ema = [0.0 for _ in range(cfg.max_servers)]
+        self._recent_rewards.clear()
+        self._metrics = {
+            "arrivals": 0.0,
+            "accepted": 0.0,
+            "rejected": 0.0,
+            "completed": 0.0,
+            "completed_urgent": 0.0,
+            "abandoned": 0.0,
+            "wait_sum": 0.0,
+            "wait_count": 0.0,
+            "wait_sum_urgent": 0.0,
+            "wait_count_urgent": 0.0,
+            "wait_sum_normal": 0.0,
+            "wait_count_normal": 0.0,
+            "sla_breaches": 0.0,
+            "sla_breaches_urgent": 0.0,
+            "invalid_actions": 0.0,
+            "noop_under_load": 0.0,
+            "harmful_scale_down": 0.0,
+            "action_cost": 0.0,
+            "infra_cost": 0.0,
+            "fairness_gap_sum": 0.0,
+            "fairness_gap_count": 0.0,
+        }
+        self._wait_samples_all: list[float] = []
+        self._wait_samples_urgent: list[float] = []
+        self._wait_samples_normal: list[float] = []
+        self._e2e_wait_samples: list[float] = []
+    def _init_rng_streams(self, base_seed: int) -> None:
+        self._rng_stream_seeds = {
+            "arrivals": int(base_seed) + 101,
+            "service": int(base_seed) + 211,
+            "abandonment": int(base_seed) + 307,
+            "exogenous": int(base_seed) + 401,
+        }
+        self._rng_streams = {
+            key: random.Random(seed) for key, seed in self._rng_stream_seeds.items()
+        }
+    def _rng(self, stream: str) -> random.Random:
+        return self._rng_streams[stream]
+    def _sample_poisson(self, lam: float, rng: random.Random) -> int:
+        lam = max(0.0, lam)
+        if lam == 0.0:
+            return 0
+        # Knuth algorithm is sufficient for this environment's lambda scale.
+        l_term = math.exp(-lam)
+        k = 0
+        p = 1.0
+        while p > l_term:
+            k += 1
+            p *= rng.random()
+        return max(0, k - 1)
+    def _trace_digest(self) -> str:
+        raw = f"task={self._active_task_id}|seed={self._pending_seed}|" + "|".join(self._action_trace)
+        return hashlib.sha256(raw.encode("utf-8")).hexdigest()[:16]
+    def reset(self) -> CloudQueueObservation:
+        self._active_task_id = self._pending_task_id if self._pending_task_id in self._task_configs else "easy"
+        self._init_rng_streams(self._pending_seed)
+        self._state = State(episode_id=str(uuid4()), step_count=0)
+        self._reset_runtime_state()
+        return self._build_observation(reward=0.0, done=False, info={"event": "reset"})
+    def _clamp(self, value: float, lo: float, hi: float) -> float:
+        return max(lo, min(hi, value))
+    def _sample_service_time(self, cfg: TaskConfig) -> float:
+        service_rng = self._rng("service")
+        if cfg.task_id == "hard":
+            heavy = service_rng.random() < 0.22
+            if heavy:
+                return self._clamp(service_rng.lognormvariate(1.2, 0.7), 1.0, 12.0)
+        return self._clamp(service_rng.expovariate(1.0 / cfg.service_mean), 0.5, 10.0)
+    def _sample_arrivals(self, cfg: TaskConfig) -> int:
+        arrival_rng = self._rng("arrivals")
+        exogenous_rng = self._rng("exogenous")
+        rate = cfg.arrival_rate
+        if cfg.task_id == "hard":
+            wave = 0.35 * math.sin((self._sim_time + 1) / 13.0)
+            jitter = exogenous_rng.uniform(-0.05, 0.05)
+            rate += wave + jitter
+        return self._sample_poisson(rate, arrival_rng)
+    def _build_arrival_job(self, cfg: TaskConfig, arrival_rng: random.Random) -> dict:
+        priority = 2 if arrival_rng.random() < cfg.urgent_ratio else 1
+        size = self._sample_service_time(cfg)
+        return {
+            "priority": priority,
+            "queue": 0,
+            "created_step": self._state.step_count,
+            "wait": 0.0,
+            "size": size,
+            "remaining": size,
+            "deadline": self._state.step_count + cfg.deadline_base - (1 if priority == 2 else 0),
+            "type": 1 if priority == 2 else 0,
+            "stage": 0,
+        }
+    def _promote_next_incoming_job(self) -> None:
+        if self._incoming_job is None and self._incoming_buffer:
+            self._incoming_job = self._incoming_buffer.popleft()
+    def _spawn_incoming_job(self, cfg: TaskConfig) -> None:
+        arrivals = self._sample_arrivals(cfg)
+        arrival_rng = self._rng("arrivals")
+        if arrivals > 0:
+            for _ in range(arrivals):
+                self._incoming_buffer.append(self._build_arrival_job(cfg, arrival_rng))
+            self._metrics["arrivals"] += float(arrivals)
+        self._promote_next_incoming_job()
+    def _update_wait_and_abandonment(self, cfg: TaskConfig) -> float:
+        abandonment_rng = self._rng("abandonment")
+        abandoned_this_step = 0.0
+        for qi, q in enumerate(self._queues):
+            kept: deque[dict] = deque()
+            while q:
+                job = q.popleft()
+                job["wait"] += 1.0
+                patience = cfg.deadline_base + (2 if job["priority"] == 2 else 4)
+                if cfg.task_id == "hard" and job["wait"] > patience and abandonment_rng.random() < 0.35:
+                    abandoned_this_step += 1.0
+                    continue
+                kept.append(job)
+            self._queues[qi] = kept
+        if abandoned_this_step:
+            self._metrics["abandoned"] += abandoned_this_step
+        return abandoned_this_step
+    def _complete_job(self, cfg: TaskConfig, job: dict) -> None:
+        if cfg.two_stage and job["stage"] == 0:
+            forwarded = dict(job)
+            forwarded["stage"] = 1
+            forwarded["queue"] = min(1, len(self._queues) - 1)
+            forwarded["remaining"] = self._sample_service_time(cfg)
+            self._queues[forwarded["queue"]].append(forwarded)
+            return
+        self._metrics["completed"] += 1.0
+        wait = float(self._state.step_count - job["created_step"])
+        self._metrics["wait_sum"] += wait
+        self._metrics["wait_count"] += 1.0
+        self._wait_samples_all.append(wait)
+        self._e2e_wait_samples.append(wait)
+        if job["priority"] == 2:
+            self._metrics["completed_urgent"] += 1.0
+            self._metrics["wait_sum_urgent"] += wait
+            self._metrics["wait_count_urgent"] += 1.0
+            self._wait_samples_urgent.append(wait)
+        else:
+            self._metrics["wait_sum_normal"] += wait
+            self._metrics["wait_count_normal"] += 1.0
+            self._wait_samples_normal.append(wait)
+        if self._state.step_count > job["deadline"]:
+            self._metrics["sla_breaches"] += 1.0
+            if job["priority"] == 2:
+                self._metrics["sla_breaches_urgent"] += 1.0
+    def _process_servers(self, cfg: TaskConfig) -> float:
+        completed_this_step = 0.0
+        for si, server in enumerate(self._servers):
+            if not server["active"]:
+                continue
+            if server["remaining"] > 0:
+                server["remaining"] = max(0.0, server["remaining"] - 1.0)
+            if server["remaining"] <= 0 and server["job"] is not None:
+                self._complete_job(cfg, server["job"])
+                completed_this_step += 1.0
+                server["job"] = None
+            busy_flag = 1.0 if server["job"] is not None else 0.0
+            if si < len(self._utilization_ema):
+                self._utilization_ema[si] = 0.9 * self._utilization_ema[si] + 0.1 * busy_flag
+        return completed_this_step
+    def _admit_job(self, cfg: TaskConfig, queue_idx: int) -> tuple[bool, str]:
+        if self._incoming_job is None:
+            return False, "no_incoming_job"
+        if queue_idx < 0 or queue_idx >= len(self._queues):
+            return False, "invalid_queue"
+        if len(self._queues[queue_idx]) >= cfg.max_queue_size:
+            self._metrics["rejected"] += 1.0
+            self._incoming_job = None
+            self._promote_next_incoming_job()
+            return True, "queue_full_rejected"
+        job = dict(self._incoming_job)
+        job["queue"] = queue_idx
+        self._queues[queue_idx].append(job)
+        self._incoming_job = None
+        self._metrics["accepted"] += 1.0
+        self._promote_next_incoming_job()
+        return True, "admitted"
+    def _dispatch(self, queue_idx: int | None) -> tuple[bool, str]:
+        target = 0 if queue_idx is None else queue_idx
+        if target < 0 or target >= len(self._queues):
+            return False, "invalid_dispatch_queue"
+        for server in self._servers:
+            if not server["active"]:
+                continue
+            if server["job"] is None and self._queues[target]:
+                server["job"] = self._queues[target].popleft()
+                server["remaining"] = server["job"]["remaining"]
+                return True, "dispatched"
+        return False, "no_idle_server_or_empty_queue"
+    def _autodispatch(self) -> None:
+        for server in self._servers:
+            if not server["active"] or server["job"] is not None:
+                continue
+            for q in self._queues:
+                if q:
+                    server["job"] = q.popleft()
+                    server["remaining"] = server["job"]["remaining"]
+                    break
+    def _apply_action(self, action: CloudQueueAction, cfg: TaskConfig) -> tuple[bool, str]:
+        action_type = (action.action_type or "noop").lower()
+        if action_type == "configure_task":
+            if action.task_id and action.task_id in self._task_configs:
+                self._pending_task_id = action.task_id
+            if action.seed is not None:
+                self._pending_seed = int(action.seed)
+            return True, "configuration_updated_for_next_reset"
+        if self._done:
+            return False, "episode_already_done"
+        if action_type == "admit":
+            queue_idx = action.target_queue if action.target_queue is not None else 0
+            return self._admit_job(cfg, queue_idx)
+        if action_type == "reject":
+            if self._incoming_job is None:
+                return False, "no_incoming_job"
+            self._incoming_job = None
+            self._metrics["rejected"] += 1.0
+            self._promote_next_incoming_job()
+            return True, "rejected"
+        if action_type == "route":
+            queue_idx = action.target_queue if action.target_queue is not None else 0
+            return self._admit_job(cfg, queue_idx)
+        if action_type == "dispatch":
+            return self._dispatch(action.target_queue)
+        if action_type == "scale":
+            if not cfg.allow_scaling:
+                return False, "scaling_not_supported_for_task"
+            delta = action.scale_delta if action.scale_delta is not None else 0
+            if delta == 0:
+                return True, "no_scale_change"
+            active_count = sum(1 for s in self._servers if s["active"])
+            requested = int(self._clamp(active_count + delta, cfg.min_servers, cfg.max_servers))
+            if requested == active_count:
+                return True, "scale_clamped_no_change"
+            if requested > active_count:
+                for _ in range(requested - active_count):
+                    self._servers.append({"remaining": 0.0, "job": None, "active": True})
+                    self._utilization_ema.append(0.0)
+            else:
+                to_disable = active_count - requested
+                for server in reversed(self._servers):
+                    if to_disable == 0:
+                        break
+                    if server["active"] and server["job"] is None:
+                        server["active"] = False
+                        to_disable -= 1
+            self._metrics["action_cost"] += abs(delta) * 0.35
+            return True, "scaled"
+        if action_type == "reprioritize":
+            if not cfg.allow_priority:
+                return False, "reprioritize_not_supported_for_task"
+            new_priority = 2 if (action.new_priority or 1) >= 2 else 1
+            for q in self._queues:
+                for job in q:
+                    if job["priority"] == 1:
+                        job["priority"] = new_priority
+                        return True, "reprioritized"
+            return False, "no_eligible_job"
+        if action_type == "noop":
+            return True, "noop"
+        return False, "unknown_action_type"
+    def _percentile(self, values: list[float], p: float) -> float:
+        if not values:
+            return 0.0
+        ordered = sorted(values)
+        idx = int(self._clamp(round((len(ordered) - 1) * p), 0, len(ordered) - 1))
+        return float(ordered[idx])
+    def _safe_div(self, numerator: float, denominator: float) -> float:
+        if denominator <= 0:
+            return 0.0
+        return numerator / denominator
+    def _current_fairness_gap(self) -> float:
+        urgent_avg = self._safe_div(self._metrics["wait_sum_urgent"], self._metrics["wait_count_urgent"])
+        normal_avg = self._safe_div(self._metrics["wait_sum_normal"], self._metrics["wait_count_normal"])
+        scale = max(1.0, urgent_avg + normal_avg)
+        return abs(urgent_avg - normal_avg) / scale
+    def _compute_reward(
+        self,
+        cfg: TaskConfig,
+        action_ok: bool,
+        action_type: str,
+        action_scale_delta: int,
+        completed_step: float,
+    ) -> tuple[float, dict[str, float]]:
+        avg_wait = self._safe_div(self._metrics["wait_sum"], self._metrics["wait_count"])
+        queue_pressure = sum(len(q) for q in self._queues) / max(1.0, float(cfg.max_queue_size))
+        r_wait = -self._clamp(avg_wait / max(cfg.deadline_base, 1), 0.0, 1.5) - 0.15 * self._clamp(queue_pressure, 0.0, 1.5)
+        r_throughput = self._clamp(completed_step / max(1.0, float(cfg.initial_servers)), 0.0, 1.0)
+        total_decisions = max(1.0, self._metrics["completed"] + self._metrics["abandoned"])
+        r_sla = -self._clamp(self._metrics["sla_breaches"] / total_decisions, 0.0, 1.0)
+        active_servers = sum(1 for s in self._servers if s["active"])
+        r_cost = -self._clamp(active_servers / max(1.0, float(cfg.max_servers)), 0.0, 1.0)
+        fairness_gap = self._current_fairness_gap()
+        r_fair = -self._clamp(fairness_gap / 0.5, 0.0, 1.0)
+        r_safe = 0.0 if action_ok else -1.0
+        if not action_ok:
+            self._metrics["invalid_actions"] += 1.0
+        if action_type == "noop" and self._incoming_job is not None and sum(len(q) for q in self._queues) > 0:
+            r_safe -= 0.05
+            self._metrics["noop_under_load"] += 1.0
+        arrivals = max(1.0, self._metrics["arrivals"])
+        rejection_rate = self._safe_div(self._metrics["rejected"], arrivals)
+        if arrivals > 10 and rejection_rate > 0.4:
+            r_safe -= self._clamp((rejection_rate - 0.4) * 0.4, 0.0, 0.2)
+        if action_type == "scale" and action_scale_delta < 0 and queue_pressure > 0.45:
+            overload_penalty = self._clamp((queue_pressure - 0.45) * 0.5, 0.0, 0.25)
+            r_safe -= overload_penalty
+            self._metrics["harmful_scale_down"] += 1.0
+        reward = 0.35 * r_wait + 0.20 * r_throughput + 0.20 * r_sla + 0.15 * r_cost + 0.05 * r_fair + 0.05 * r_safe
+        reward = self._clamp(reward, -1.0, 1.0)
+        self._recent_rewards.append(reward)
+        self._metrics["infra_cost"] += active_servers * cfg.server_cost
+        self._metrics["fairness_gap_sum"] += fairness_gap
+        self._metrics["fairness_gap_count"] += 1.0
+        components = {
+            "wait": round(r_wait, 4),
+            "throughput": round(r_throughput, 4),
+            "sla": round(r_sla, 4),
+            "cost": round(r_cost, 4),
+            "fairness": round(r_fair, 4),
+            "safety": round(r_safe, 4),
+        }
+        return reward, components
+    def _score_task(self, cfg: TaskConfig) -> tuple[float, dict[str, float]]:
+        # c01: clamp individual sub-score components to [0, 1] inclusive.
+        def c01(value: float) -> float:
+            if not math.isfinite(value):
+                return 0.0
+            return self._clamp(value, 0.0, 1.0)
+        # _strict01: final clamp applied only to the episode score.
+        # Validator requires score strictly in (0, 1) — never 0.0 or 1.0.
+        _SCORE_MIN = 0.001
+        _SCORE_MAX = 0.999
+        def strict01(value: float) -> float:
+            if not math.isfinite(value):
+                return _SCORE_MIN
+            return self._clamp(value, _SCORE_MIN, _SCORE_MAX)
+        completed = self._metrics["completed"]
+        arrivals = self._metrics["arrivals"]
+        rejected = self._metrics["rejected"]
+        avg_wait = self._safe_div(self._metrics["wait_sum"], self._metrics["wait_count"])
+        rejection_rate = self._safe_div(rejected, arrivals)
+        sla_rate = self._safe_div(self._metrics["sla_breaches"], max(1.0, completed))
+        throughput = completed
+        fairness_gap = self._safe_div(self._metrics["fairness_gap_sum"], self._metrics["fairness_gap_count"])
+        if cfg.task_id == "easy":
+            score_wait = c01(1.0 - avg_wait / cfg.score_refs["wait"])
+            score_thr = c01(throughput / cfg.score_refs["thr"])
+            score_rej = c01(1.0 - rejection_rate / cfg.score_refs["rej"])
+            score_sla = c01(1.0 - sla_rate / cfg.score_refs["sla"])
+            score = 0.4 * score_wait + 0.3 * score_thr + 0.15 * score_rej + 0.15 * score_sla
+            details = {
+                "score_wait": round(score_wait, 4),
+                "score_throughput": round(score_thr, 4),
+                "score_rejection": round(score_rej, 4),
+                "score_sla": round(score_sla, 4),
+            }
+        elif cfg.task_id == "medium":
+            p95_u = self._percentile(self._wait_samples_urgent, 0.95)
+            p95_n = self._percentile(self._wait_samples_normal, 0.95)
+            urgent_sla = self._safe_div(self._metrics["sla_breaches_urgent"], max(1.0, self._metrics["completed_urgent"]))
+            s_uw = c01(1.0 - p95_u / cfg.score_refs["uw"])
+            s_nw = c01(1.0 - p95_n / cfg.score_refs["nw"])
+            s_usla = c01(1.0 - urgent_sla / cfg.score_refs["usla"])
+            s_thr = c01(throughput / cfg.score_refs["thr"])
+            s_cost = c01(1.0 - self._metrics["action_cost"] / cfg.score_refs["cost"])
+            score = 0.35 * s_uw + 0.15 * s_nw + 0.25 * s_usla + 0.15 * s_thr + 0.10 * s_cost
+            details = {
+                "score_urgent_wait": round(s_uw, 4),
+                "score_normal_wait": round(s_nw, 4),
+                "score_urgent_sla": round(s_usla, 4),
+                "score_throughput": round(s_thr, 4),
+                "score_cost": round(s_cost, 4),
+            }
+        else:
+            e2e_p95 = self._percentile(self._e2e_wait_samples, 0.95)
+            abd_rate = self._safe_div(self._metrics["abandoned"], arrivals)
+            s_e2e = c01(1.0 - e2e_p95 / cfg.score_refs["e2e"])
+            s_abd = c01(1.0 - abd_rate / cfg.score_refs["abd"])
+            s_sla = c01(1.0 - sla_rate / cfg.score_refs["sla"])
+            s_thr = c01(throughput / cfg.score_refs["thr"])
+            s_cost = c01(1.0 - self._metrics["infra_cost"] / cfg.score_refs["cost"])
+            s_fair = c01(1.0 - fairness_gap / cfg.score_refs["fair"])
+            score = 0.25 * s_e2e + 0.20 * s_abd + 0.20 * s_sla + 0.15 * s_thr + 0.10 * s_cost + 0.10 * s_fair
+            details = {
+                "score_e2e_p95": round(s_e2e, 4),
+                "score_abandonment": round(s_abd, 4),
+                "score_sla": round(s_sla, 4),
+                "score_throughput": round(s_thr, 4),
+                "score_cost": round(s_cost, 4),
+                "score_fairness": round(s_fair, 4),
+            }
+        if self._metrics["invalid_actions"] > max(3.0, 0.04 * cfg.horizon):
+            score = min(score, 0.4)
+        # Apply strict open-interval clamp: validator rejects 0.0 and 1.0.
+        return strict01(score), details
+    def _compute_action_mask(self, cfg: TaskConfig) -> list[int]:
+        """Compute which of the 8 actions are valid right now.
+        Slot order (matches CloudQueueAction.action_type):
+          0: configure_task  — always valid (meta, sets next task/seed)
+          1: admit           — only if an incoming job is waiting
+          2: reject          — only if an incoming job is waiting
+          3: route           — only if an incoming job is waiting
+          4: dispatch        — only if an idle+active server AND a non-empty queue exist
+          5: scale           — only if cfg.allow_scaling is True
+          6: reprioritize    — only if cfg.allow_priority AND a normal-priority job is queued
+          7: noop            — always valid
+        """
+        has_incoming = self._incoming_job is not None
+        has_idle_server = any(
+            s["active"] and s["job"] is None for s in self._servers
+        )
+        has_queued_job = any(len(q) > 0 for q in self._queues)
+        can_dispatch = 1 if (has_idle_server and has_queued_job) else 0
+        can_reprioritize = 0
+        if cfg.allow_priority:
+            can_reprioritize = 1 if any(
+                job["priority"] == 1 for q in self._queues for job in q
+            ) else 0
+        return [
+            1,                              # 0: configure_task
+            1 if has_incoming else 0,       # 1: admit
+            1 if has_incoming else 0,       # 2: reject
+            1 if has_incoming else 0,       # 3: route
+            can_dispatch,                   # 4: dispatch
+            1 if cfg.allow_scaling else 0,  # 5: scale
+            can_reprioritize,               # 6: reprioritize
+            1,                              # 7: noop
+        ]
+    def _build_observation(self, reward: float, done: bool, info: dict) -> CloudQueueObservation:
+        cfg = self._task_configs[self._active_task_id]
+        queue_lengths = [len(q) for q in self._queues]
+        for i, q in enumerate(self._queues):
+            current_mean_wait = 0.0
+            if q:
+                current_mean_wait = sum(job["wait"] for job in q) / len(q)
+            self._wait_ema[i] = 0.8 * self._wait_ema[i] + 0.2 * current_mean_wait
+        active_servers = max(1, sum(1 for s in self._servers if s["active"]))
+        completed = max(1.0, self._metrics["completed"])
+        sla_violation_rate = self._safe_div(self._metrics["sla_breaches"], completed)
+        abandonment_rate = self._safe_div(self._metrics["abandoned"], max(1.0, self._metrics["arrivals"]))
+        throughput_recent = max(0.0, info.get("completed_this_step", 0.0))
+        energy_cost_rate = active_servers * cfg.server_cost
+        incoming = self._incoming_job
+        incoming_present = incoming is not None
+        incoming_size = float(incoming["size"]) if incoming_present else 0.0
+        incoming_priority = int(incoming["priority"]) if incoming_present else 0
+        incoming_deadline = float(incoming["deadline"]) if incoming_present else 0.0
+        incoming_type = int(incoming["type"]) if incoming_present else 0
+        score, score_details = (0.0, {})
+        if done:
+            score, score_details = self._score_task(cfg)
+        metadata = {
+            "info": info,
+            "reward_components": info.get("reward_components", {}),
+            "applied_action": info.get("applied_action", "noop"),
+            "seed": int(self._pending_seed),
+            "trace_digest": self._trace_digest(),
+            "rng_stream_seeds": self._rng_stream_seeds,
+            "metrics": {
+                "arrivals": self._metrics["arrivals"],
+                "accepted": self._metrics["accepted"],
+                "rejected": self._metrics["rejected"],
+                "completed": self._metrics["completed"],
+                "abandoned": self._metrics["abandoned"],
+                "invalid_actions": self._metrics["invalid_actions"],
+                "harmful_scale_down": self._metrics["harmful_scale_down"],
+                "infra_cost": round(self._metrics["infra_cost"], 4),
+                "pending_incoming_jobs": float(len(self._incoming_buffer) + (1 if self._incoming_job else 0)),
+            },
+            "episode_score": round(score, 4),
+            "score_details": score_details,
+        }
+        return CloudQueueObservation(
+            task_id=cfg.task_id,
+            sim_time=self._sim_time,
+            horizon=cfg.horizon,
+            queue_lengths=queue_lengths,
+            queue_wait_ema=[round(v, 3) for v in self._wait_ema],
+            server_busy=[1 if s["job"] is not None and s["active"] else 0 for s in self._servers],
+            server_remaining_service=[round(float(s["remaining"]), 3) for s in self._servers],
+            utilization=[round(v, 3) for v in self._utilization_ema[: len(self._servers)]],
+            incoming_job_present=incoming_present,
+            incoming_job_size=round(incoming_size, 3),
+            incoming_job_priority=incoming_priority,
+            incoming_job_deadline=round(incoming_deadline, 3),
+            incoming_job_type=incoming_type,
+            sla_violation_rate=round(sla_violation_rate, 4),
+            abandonment_rate=round(abandonment_rate, 4),
+            throughput_recent=round(throughput_recent, 4),
+            energy_cost_rate=round(energy_cost_rate, 4),
+            level=cfg.level,
+            optional_history=[round(v, 4) for v in list(self._recent_rewards)],
+            action_mask=self._compute_action_mask(cfg),
+            done=done,
+            reward=round(reward, 6),
+            metadata=metadata,
+        )
+    def step(self, action: CloudQueueAction) -> CloudQueueObservation:  # type: ignore[override]
+        cfg = self._task_configs[self._active_task_id]
+        if (action.action_type or "").lower() == "configure_task":
+            ok, note = self._apply_action(action, cfg)
+            info = {
+                "event": "configure_task",
+                "applied_action": action.action_type,
+                "valid_action": ok,
+                "note": note,
+                "completed_this_step": 0.0,
+                "debug_trace_id": self._trace_digest(),
+            }
+            return self._build_observation(reward=0.0, done=self._done, info=info)
+        if self._done:
+            info = {
+                "event": "episode_done",
+                "applied_action": action.action_type,
+                "valid_action": False,
+                "note": "call reset() to start a new episode",
+                "completed_this_step": 0.0,
+                "reward_components": {},
+                "debug_trace_id": self._trace_digest(),
+            }
+            return self._build_observation(reward=0.0, done=True, info=info)
+        self._state.step_count += 1
+        self._sim_time += 1
+        completed_this_step = self._process_servers(cfg)
+        abandoned_this_step = self._update_wait_and_abandonment(cfg)
+        self._spawn_incoming_job(cfg)
+        action_ok, action_note = self._apply_action(action, cfg)
+        action_key = (
+            f"{(action.action_type or 'noop').lower()}|"
+            f"q={action.target_queue}|s={action.target_server}|"
+            f"d={action.scale_delta}|p={action.new_priority}"
+        )
+        self._action_trace.append(action_key)
+        autodispatch_applied = False
+        if self.ASSISTED_AUTODISPATCH:
+            self._autodispatch()
+            autodispatch_applied = True
+        reward, reward_components = self._compute_reward(
+            cfg,
+            action_ok=action_ok,
+            action_type=(action.action_type or "noop").lower(),
+            action_scale_delta=int(action.scale_delta or 0),
+            completed_step=completed_this_step,
+        )
+        self._done = self._state.step_count >= cfg.horizon
+        info = {
+            "event": "step",
+            "applied_action": action.action_type,
+            "valid_action": action_ok,
+            "note": action_note,
+            "completed_this_step": completed_this_step,
+            "abandoned_this_step": abandoned_this_step,
+            "autodispatch_applied": autodispatch_applied,
+            "reward_components": reward_components,
+            "debug_trace_id": self._trace_digest(),
+        }
+        return self._build_observation(reward=reward, done=self._done, info=info)
+    @property
+    def state(self) -> State:
+        return self._state