cloud_queue_env / HIGH_SEVERITY_ANALYSIS.md
Mrkumar007's picture
Upload folder using huggingface_hub
a49c996 verified

Cloud Queue Env - High Severity Analysis (Updated)

Date: 2026-04-12

This note captures the two highest-impact issues still present in the environment logic.

1) Arrival Modeling and Arrival Metrics Mismatch

Files and lines:

  • cloud_queue_env/server/cloud_queue_env_environment.py:240
  • cloud_queue_env/server/cloud_queue_env_environment.py:241
  • cloud_queue_env/server/cloud_queue_env_environment.py:248
  • cloud_queue_env/server/cloud_queue_env_environment.py:259

What happens now:

  • The simulator samples Poisson arrivals each step.
  • If sampled arrivals are greater than 1, the code still creates only one incoming job object.
  • The arrivals metric is incremented by 1.0, not by sampled arrival count.

Why this is high severity:

  • Burst behavior is compressed into a single-event stream, so load spikes are underrepresented.
  • Several business metrics and grader components become biased (rejections, abandonment, SLA pressure).
  • Policy ranking can drift because the environment under-penalizes burst scenarios.

Impact on benchmark credibility:

  • High. This directly affects realism, fairness of grading, and reproducibility quality claims.

Recommended fix direction:

  • Track all sampled arrivals each step.
  • Either queue all arrivals or maintain an explicit backlog of pending incoming jobs.
  • Increment arrivals metric using true sampled count.

2) Agent Dispatch Control Is Partially Bypassed by Autodispatch

Files and lines:

  • cloud_queue_env/server/cloud_queue_env_environment.py:353
  • cloud_queue_env/server/cloud_queue_env_environment.py:391
  • cloud_queue_env/server/cloud_queue_env_environment.py:738

What happens now:

  • The agent may choose an action that is not dispatch.
  • After action application, the environment still runs autodispatch and moves work to idle servers.

Why this is high severity:

  • It weakens action-to-outcome causality for dispatch decisions.
  • A policy can look better than it should because server assignment still happens automatically.
  • It reduces benchmark difficulty in exactly the control surface the task is evaluating.

Impact on benchmark credibility:

  • High. This can alter policy comparisons and invalidate assumptions about explicit control.

Recommended fix direction:

  • Make dispatch behavior explicit by mode:
    • strict-control mode: only agent dispatches.
    • assisted mode: autodispatch on, but document this clearly and score accordingly.
  • Keep one consistent mode for official benchmark scoring.

Priority Summary

  1. Fix arrival accounting and multi-arrival handling first.
  2. Fix dispatch authority semantics second.

Both should be addressed before claiming benchmark-grade reliability.