Spaces:

SidhaGarg
/

Cloud-DevOps-RLEnv

Sleeping

App Files Files Community

SidhaGarg commited on 17 days ago

Commit

d35c04a

1 Parent(s): ec24749

Add action-cost RL dynamics, metadata lookup, and hard-mode cascading failures

Browse files

Files changed (4) hide show

README.md +26 -12
inference.py +2 -1
models.py +4 -2
server/cloud_devops_env_environment.py +99 -9

README.md CHANGED Viewed

@@ -25,6 +25,16 @@ Real incidents are multi-step and noisy. Good agents must:
 Cloud DevOps RLEnv simulates that behavior with realistic failure patterns, decoy resources, shaped rewards, and anti-shortcut guardrails.
 ## Environment Scope
 - Domain: Cloud SRE / DevOps incident response
@@ -55,7 +65,7 @@ Model: CloudAction
 | Field | Type | Required | Description |
 | --- | --- | --- | --- |
-| command | enum | yes | One of: list_resources, describe_resource, view_logs, update_security_group, restart_service, submit_solution |
 | resource_id | string | conditional | Required for most actions except list_resources |
 | parameters | object | conditional | Used by mutating actions (for example, security-group updates) |
@@ -63,6 +73,7 @@ Action semantics:
 - list_resources: Enumerates available resources including decoys.
 - describe_resource: Returns structured details for one resource.
 - view_logs: Returns logs for one resource.
 - update_security_group: Appends a rule (requires parameters.port).
 - restart_service: Restarts one instance/service by ID.
 - submit_solution: Declares the episode solved (or not solved).
@@ -95,6 +106,7 @@ Reward shaping is sparse-but-guided:
 - discovery rewards for correct investigative steps
 - larger terminal rewards for correct remediation
 - penalties for unsafe or premature operations
 - timeout terminal condition after max steps
 Per-step reward is clipped to [-1.0, 1.0].
@@ -116,8 +128,8 @@ Typical successful sequence:
 3. update_security_group(sg-web, port=80, action=allow) (+0.8, done)
 Expected score:
-- 1.0 for full playbook
-- 0.8 if agent skips the optional read step
 ### Medium Task
@@ -134,33 +146,35 @@ Typical successful sequence:
 4. update_security_group(sg-db, port=5432, action=allow) (+0.6, done if logs were inspected)
 Guardrail:
-- Applying the SG change before log triage gives a penalty (-0.1) and does not close the incident.
 Expected score:
-- 1.0 with full investigative path
-- 0.8 if SG describe step is skipped but log triage is done
 ### Hard Task
 Incident:
-- Checkout path degraded due to upstream timeout to i-web2.
 Objective:
-- Trace LB errors to the correct target and restart i-web2 only after diagnosis.
 Typical successful sequence:
 1. list_resources
-2. view_logs(lb-main) to identify failing upstream i-web2 (+0.2)
-3. describe_resource(i-web2) or view_logs(i-web2) (+0.2)
-4. restart_service(i-web2) (+0.8, done when both investigation achievements exist)
 Guardrails:
 - Restarting i-web2 before investigation: penalty (-0.1), no resolution.
 - Restarting healthy i-web1: penalty (-0.2).
 - Premature submit_solution in hard mode: penalty (-0.1), episode continues.
 Expected score:
-- 1.0 after score clamping for the full correct path
 ## API Endpoints

 Cloud DevOps RLEnv simulates that behavior with realistic failure patterns, decoy resources, shaped rewards, and anti-shortcut guardrails.
+## Why It's Hard
+This benchmark is intentionally designed to resist brute-force policies and reward disciplined SRE reasoning:
+- Needle-in-a-haystack discovery: 20+ decoy compute nodes and 20+ decoy security groups increase search complexity.
+- Ambiguous telemetry: noisy, raw operational logs surface symptoms (including IP-only clues) rather than direct root-cause labels.
+- Action-penalty heuristics: every action has a small negative cost, so efficient remediation beats command spamming.
+- Multi-hop dependency resolution: agents must map IP addresses to resource IDs via metadata lookup before applying fixes.
+- System drift under pressure: in hard mode, delayed remediation triggers cascading failures that worsen observability and reward dynamics.
 ## Environment Scope
 - Domain: Cloud SRE / DevOps incident response
 | Field | Type | Required | Description |
 | --- | --- | --- | --- |
+| command | enum | yes | One of: list_resources, describe_resource, view_logs, query_metadata, update_security_group, restart_service, submit_solution |
 | resource_id | string | conditional | Required for most actions except list_resources |
 | parameters | object | conditional | Used by mutating actions (for example, security-group updates) |
 - list_resources: Enumerates available resources including decoys.
 - describe_resource: Returns structured details for one resource.
 - view_logs: Returns logs for one resource.
+- query_metadata: Resolves infrastructure metadata (for example, IP address to resource ID).
 - update_security_group: Appends a rule (requires parameters.port).
 - restart_service: Restarts one instance/service by ID.
 - submit_solution: Declares the episode solved (or not solved).
 - discovery rewards for correct investigative steps
 - larger terminal rewards for correct remediation
 - penalties for unsafe or premature operations
+- fixed action cost per step (efficiency pressure)
 - timeout terminal condition after max steps
 Per-step reward is clipped to [-1.0, 1.0].
 3. update_security_group(sg-web, port=80, action=allow) (+0.8, done)
 Expected score:
+- ~0.97 for full playbook with efficient triage
+- ~0.79 if agent skips the optional read step
 ### Medium Task
 4. update_security_group(sg-db, port=5432, action=allow) (+0.6, done if logs were inspected)
 Guardrail:
+- Applying the SG change before log triage + metadata lookup gives a penalty (-0.1) and does not close the incident.
 Expected score:
+- ~0.97 with full investigative path (logs -> metadata lookup -> remediation)
+- below ~0.90 when metadata dependency is skipped
 ### Hard Task
 Incident:
+- Checkout path degraded due to upstream timeout to an IP-only target that must be resolved first.
 Objective:
+- Trace LB errors to the correct target, resolve resource identity via metadata, and restart i-web2 only after diagnosis.
 Typical successful sequence:
 1. list_resources
+2. view_logs(lb-main) to identify failing upstream IP (+0.2)
+3. query_metadata(ip_address=<failing_ip>) to resolve target ID (+0.2)
+4. describe_resource(i-web2) or view_logs(i-web2) (+0.2)
+5. restart_service(i-web2) (+0.8, done when all investigation achievements exist)
 Guardrails:
 - Restarting i-web2 before investigation: penalty (-0.1), no resolution.
 - Restarting healthy i-web1: penalty (-0.2).
 - Premature submit_solution in hard mode: penalty (-0.1), episode continues.
+- If unresolved after step 8 in hard mode, lb-external also fails (cascading failure), increasing pressure and noise.
 Expected score:
+- near 1.0 after score clamping for strong trajectories (can exceed 1.0 raw before clamp)
 ## API Endpoints

inference.py CHANGED Viewed

@@ -63,10 +63,11 @@ def get_model_action(
         "You are an expert AI DevOps Engineer diagnosing a cloud infrastructure issue. "
         "You must respond ONLY with a raw JSON object matching this schema:\n"
         "{\n"
-        '  "command": "list_resources" | "describe_resource" | "view_logs" | "update_security_group" | "restart_service" | "submit_solution",\n'
         '  "resource_id": "string (optional)",\n'
         '  "parameters": {"key": "value"} (optional)\n'
         "}\n"
         "Do not include markdown blocks like ```json. Just output the JSON."
     )

         "You are an expert AI DevOps Engineer diagnosing a cloud infrastructure issue. "
         "You must respond ONLY with a raw JSON object matching this schema:\n"
         "{\n"
+        '  "command": "list_resources" | "describe_resource" | "view_logs" | "query_metadata" | "update_security_group" | "restart_service" | "submit_solution",\n'
         '  "resource_id": "string (optional)",\n'
         '  "parameters": {"key": "value"} (optional)\n'
         "}\n"
+        "When logs provide only IP addresses, use query_metadata with parameters.ip_address to resolve the resource_id before remediation.\n"
         "Do not include markdown blocks like ```json. Just output the JSON."
     )

models.py CHANGED Viewed

@@ -24,6 +24,7 @@ class CloudAction(Action):
         "list_resources",
         "describe_resource",
         "view_logs",
         "update_security_group",
         "restart_service",
         "submit_solution",
@@ -32,14 +33,15 @@ class CloudAction(Action):
         default=None,
         description=(
             "The ID of the target resource (e.g., 'i-12345'). "
-            "Required for all commands except list_resources."
         ),
     )
     parameters: Optional[Dict[str, Any]] = Field(
         default=None,
         description=(
             "Key-value pairs for updates "
-            "(e.g., {'port': '80', 'action': 'allow'} for update_security_group)."
         ),
     )
     message: Optional[str] = Field(

         "list_resources",
         "describe_resource",
         "view_logs",
+        "query_metadata",
         "update_security_group",
         "restart_service",
         "submit_solution",
         default=None,
         description=(
             "The ID of the target resource (e.g., 'i-12345'). "
+            "Required for most commands except list_resources and query_metadata."
         ),
     )
     parameters: Optional[Dict[str, Any]] = Field(
         default=None,
         description=(
             "Key-value pairs for updates "
+            "(e.g., {'port': '80', 'action': 'allow'} for update_security_group, "
+            "or {'ip_address': '10.0.4.5'} for query_metadata)."
         ),
     )
     message: Optional[str] = Field(

server/cloud_devops_env_environment.py CHANGED Viewed

@@ -50,6 +50,7 @@ class CloudDevopsEnvironment(Environment):
     SUPPORTS_CONCURRENT_SESSIONS: bool = True
     MAX_STEPS: int = 20
     VALID_TASKS = {"easy", "medium", "hard"}
     def __init__(self, task_name: str = "easy"):
         """Initialize the cloud_devops_env environment."""
@@ -100,20 +101,29 @@ class CloudDevopsEnvironment(Environment):
                 {
                 "i-api": {
                     "type": "Instance",
                     "status": "running",
                     "logs": (
                         "[2026-04-06 17:01:22] [CRITICAL] "
                         "sqlalchemy.exc.OperationalError: "
                         "(psycopg2.OperationalError) connection to server at "
-                        "'10.0.4.5' (i-db), port 5432 failed: Connection timed out. "
                         "Is the server running and accepting TCP/IP connections?"
                     ),
                 },
-                "i-db": {"type": "Instance", "status": "running"},
                 "sg-db": {
                     "type": "SecurityGroup",
                     "rules": [{"port": 22, "action": "allow"}],
                 },
                 }
             )
             return resources
@@ -126,13 +136,19 @@ class CloudDevopsEnvironment(Environment):
                     "2026/04/06 17:02:09 [error] 3197#3197: *4189 upstream timed out "
                     "(110: Connection timed out) while reading response header from upstream, "
                     "client: 10.0.2.14, server: api.prod.local, request: \"GET /checkout HTTP/1.1\", "
-                    "upstream: \"http://i-web2:8080/checkout\", host: \"api.prod.local\"\n"
                     "2026/04/06 17:02:10 [error] 3197#3197: *4190 no live upstreams while "
-                    "connecting to upstream \"i-web2\""
                 ),
             },
             "i-web1": {
                 "type": "Instance",
                 "status": "running",
                 "logs": (
                     "[2026-04-06 17:02:11] INFO web-service: readiness probe passed\n"
@@ -141,6 +157,7 @@ class CloudDevopsEnvironment(Environment):
             },
             "i-web2": {
                 "type": "Instance",
                 "status": "degraded",
                 "logs": (
                     "kernel: Out of memory: Killed process 12345 (java) total-vm:4194304kB, "
@@ -153,10 +170,47 @@ class CloudDevopsEnvironment(Environment):
                 "type": "SecurityGroup",
                 "rules": [{"port": 80, "action": "allow"}],
             },
             }
         )
         return resources
     def _reward_once(self, achievement: str, points: float) -> float:
         if achievement in self._achievements:
             return 0.0
@@ -202,7 +256,7 @@ class CloudDevopsEnvironment(Environment):
         state = self._state_data
         state.step_count += 1
-        reward = 0.0
         done = False
         output = ""
         error = None
@@ -245,6 +299,25 @@ class CloudDevopsEnvironment(Environment):
                 elif self.task_name == "hard" and action.resource_id == "i-web2":
                     reward += self._reward_once("inspect_target", 0.2)
             elif action.command == "update_security_group":
                 if not action.resource_id:
                     raise ValueError("resource_id is required for update_security_group.")
@@ -277,7 +350,11 @@ class CloudDevopsEnvironment(Environment):
                     and action.resource_id == "sg-db"
                     and port == 5432
                 ):
-                    if "read_logs" in self._achievements:
                         state.is_resolved = True
                         reward += 0.6
                         done = True
@@ -286,7 +363,7 @@ class CloudDevopsEnvironment(Environment):
                         reward -= 0.1
                         output += (
                             "\nWARNING: Change applied without incident triage. "
-                            "Inspect API logs before closing the incident."
                         )
             elif action.command == "restart_service":
@@ -302,6 +379,7 @@ class CloudDevopsEnvironment(Environment):
                         investigated_root_cause = (
                             "inspect_lb" in self._achievements
                             and "inspect_target" in self._achievements
                         )
                         if investigated_root_cause:
                             state.resources["i-web2"]["status"] = "running"
@@ -316,7 +394,7 @@ class CloudDevopsEnvironment(Environment):
                             reward -= 0.1
                             output += (
                                 "\nWARNING: Restart denied by change policy. "
-                                "Find failing upstream from lb-main and inspect i-web2 first."
                             )
                     elif action.resource_id == "i-web1":
                         reward -= 0.2
@@ -349,19 +427,31 @@ class CloudDevopsEnvironment(Environment):
             error = str(exc)
             output = f"Command Failed: {error}"
         if state.step_count >= self.MAX_STEPS and not done:
             done = True
             timeout_suffix = "\nTIMEOUT: Max steps reached."
             output = f"{output}{timeout_suffix}" if output else timeout_suffix.strip()
         reward = max(-1.0, min(1.0, reward))
-        status = "HEALTHY" if state.is_resolved else "CRITICAL"
         info = {
             "step_count": state.step_count,
             "resolved": state.is_resolved,
             "task": self.task_name,
             "achievements": sorted(self._achievements),
             "total_resources": len(state.resources),
         }
         return CloudObservation(

     SUPPORTS_CONCURRENT_SESSIONS: bool = True
     MAX_STEPS: int = 20
     VALID_TASKS = {"easy", "medium", "hard"}
+    ACTION_COST: float = 0.01
     def __init__(self, task_name: str = "easy"):
         """Initialize the cloud_devops_env environment."""
                 {
                 "i-api": {
                     "type": "Instance",
+                    "ip_address": "10.0.4.11",
                     "status": "running",
                     "logs": (
                         "[2026-04-06 17:01:22] [CRITICAL] "
                         "sqlalchemy.exc.OperationalError: "
                         "(psycopg2.OperationalError) connection to server at "
+                        "'10.0.4.5', port 5432 failed: Connection timed out. "
                         "Is the server running and accepting TCP/IP connections?"
                     ),
                 },
+                "i-db": {
+                    "type": "Instance",
+                    "ip_address": "10.0.4.5",
+                    "status": "running",
+                },
                 "sg-db": {
                     "type": "SecurityGroup",
                     "rules": [{"port": 22, "action": "allow"}],
                 },
+                "metadata-svc": {
+                    "type": "MetadataService",
+                    "status": "running",
+                },
                 }
             )
             return resources
                     "2026/04/06 17:02:09 [error] 3197#3197: *4189 upstream timed out "
                     "(110: Connection timed out) while reading response header from upstream, "
                     "client: 10.0.2.14, server: api.prod.local, request: \"GET /checkout HTTP/1.1\", "
+                    "upstream: \"http://10.0.8.22:8080/checkout\", host: \"api.prod.local\"\n"
                     "2026/04/06 17:02:10 [error] 3197#3197: *4190 no live upstreams while "
+                    "connecting to upstream \"10.0.8.22\""
                 ),
             },
+            "lb-external": {
+                "type": "LoadBalancer",
+                "status": "running",
+                "logs": "INFO: Edge traffic stable.",
+            },
             "i-web1": {
                 "type": "Instance",
+                "ip_address": "10.0.8.21",
                 "status": "running",
                 "logs": (
                     "[2026-04-06 17:02:11] INFO web-service: readiness probe passed\n"
             },
             "i-web2": {
                 "type": "Instance",
+                "ip_address": "10.0.8.22",
                 "status": "degraded",
                 "logs": (
                     "kernel: Out of memory: Killed process 12345 (java) total-vm:4194304kB, "
                 "type": "SecurityGroup",
                 "rules": [{"port": 80, "action": "allow"}],
             },
+            "metadata-svc": {
+                "type": "MetadataService",
+                "status": "running",
+            },
             }
         )
         return resources
+    def _lookup_resource_by_ip(self, ip_address: str) -> str | None:
+        if self._state_data is None:
+            return None
+        for resource_id, data in self._state_data.resources.items():
+            if data.get("ip_address") == ip_address:
+                return resource_id
+        return None
+    def _apply_cascading_failure(self) -> tuple[float, str]:
+        """Simulate system drift in hard mode if root cause is not fixed quickly."""
+        if self._state_data is None or self.task_name != "hard":
+            return 0.0, ""
+        state = self._state_data
+        if state.is_resolved or state.step_count <= 8:
+            return 0.0, ""
+        lb = state.resources.get("lb-external")
+        if not lb:
+            return 0.0, ""
+        if lb.get("status") != "DOWN":
+            lb["status"] = "DOWN"
+            lb["logs"] = (
+                "CRITICAL: Cascading failure triggered after prolonged unresolved OOM incident. "
+                "Edge load balancer stopped serving traffic."
+            )
+            return -0.05, (
+                "\nALERT: Cascading failure detected. lb-external is DOWN due to delayed remediation."
+            )
+        return -0.03, ""
     def _reward_once(self, achievement: str, points: float) -> float:
         if achievement in self._achievements:
             return 0.0
         state = self._state_data
         state.step_count += 1
+        reward = -self.ACTION_COST
         done = False
         output = ""
         error = None
                 elif self.task_name == "hard" and action.resource_id == "i-web2":
                     reward += self._reward_once("inspect_target", 0.2)
+            elif action.command == "query_metadata":
+                ip_address = None
+                if action.parameters and isinstance(action.parameters, dict):
+                    ip_address = action.parameters.get("ip_address")
+                if not ip_address and action.resource_id:
+                    ip_address = action.resource_id
+                if not ip_address:
+                    raise ValueError("query_metadata requires parameters.ip_address.")
+                resource_id = self._lookup_resource_by_ip(str(ip_address))
+                if not resource_id:
+                    raise ValueError(f"No resource found for ip_address={ip_address}")
+                output = f"Metadata lookup: ip_address={ip_address} resource_id={resource_id}"
+                if self.task_name == "medium" and str(ip_address) == "10.0.4.5":
+                    reward += self._reward_once("lookup_db_target", 0.2)
+                elif self.task_name == "hard" and str(ip_address) == "10.0.8.22":
+                    reward += self._reward_once("lookup_upstream_target", 0.2)
             elif action.command == "update_security_group":
                 if not action.resource_id:
                     raise ValueError("resource_id is required for update_security_group.")
                     and action.resource_id == "sg-db"
                     and port == 5432
                 ):
+                    investigated = (
+                        "read_logs" in self._achievements
+                        and "lookup_db_target" in self._achievements
+                    )
+                    if investigated:
                         state.is_resolved = True
                         reward += 0.6
                         done = True
                         reward -= 0.1
                         output += (
                             "\nWARNING: Change applied without incident triage. "
+                            "Inspect API logs and resolve DB IP via query_metadata before closing the incident."
                         )
             elif action.command == "restart_service":
                         investigated_root_cause = (
                             "inspect_lb" in self._achievements
                             and "inspect_target" in self._achievements
+                            and "lookup_upstream_target" in self._achievements
                         )
                         if investigated_root_cause:
                             state.resources["i-web2"]["status"] = "running"
                             reward -= 0.1
                             output += (
                                 "\nWARNING: Restart denied by change policy. "
+                                "Find failing upstream IP from lb-main, resolve it with query_metadata, and inspect i-web2 first."
                             )
                     elif action.resource_id == "i-web1":
                         reward -= 0.2
             error = str(exc)
             output = f"Command Failed: {error}"
+        cascade_penalty, cascade_msg = self._apply_cascading_failure()
+        reward += cascade_penalty
+        if cascade_msg:
+            output = f"{output}{cascade_msg}" if output else cascade_msg.strip()
         if state.step_count >= self.MAX_STEPS and not done:
             done = True
             timeout_suffix = "\nTIMEOUT: Max steps reached."
             output = f"{output}{timeout_suffix}" if output else timeout_suffix.strip()
         reward = max(-1.0, min(1.0, reward))
+        lb_external = state.resources.get("lb-external", {})
+        if state.is_resolved:
+            status = "HEALTHY"
+        elif self.task_name == "hard" and lb_external.get("status") == "DOWN":
+            status = "DEGRADED"
+        else:
+            status = "CRITICAL"
         info = {
             "step_count": state.step_count,
             "resolved": state.is_resolved,
             "task": self.task_name,
             "achievements": sorted(self._achievements),
             "total_resources": len(state.resources),
+            "action_cost": self.ACTION_COST,
         }
         return CloudObservation(