Spaces:
Sleeping
Sleeping
Add action-cost RL dynamics, metadata lookup, and hard-mode cascading failures
Browse files- README.md +26 -12
- inference.py +2 -1
- models.py +4 -2
- server/cloud_devops_env_environment.py +99 -9
README.md
CHANGED
|
@@ -25,6 +25,16 @@ Real incidents are multi-step and noisy. Good agents must:
|
|
| 25 |
|
| 26 |
Cloud DevOps RLEnv simulates that behavior with realistic failure patterns, decoy resources, shaped rewards, and anti-shortcut guardrails.
|
| 27 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
## Environment Scope
|
| 29 |
|
| 30 |
- Domain: Cloud SRE / DevOps incident response
|
|
@@ -55,7 +65,7 @@ Model: CloudAction
|
|
| 55 |
|
| 56 |
| Field | Type | Required | Description |
|
| 57 |
| --- | --- | --- | --- |
|
| 58 |
-
| command | enum | yes | One of: list_resources, describe_resource, view_logs, update_security_group, restart_service, submit_solution |
|
| 59 |
| resource_id | string | conditional | Required for most actions except list_resources |
|
| 60 |
| parameters | object | conditional | Used by mutating actions (for example, security-group updates) |
|
| 61 |
|
|
@@ -63,6 +73,7 @@ Action semantics:
|
|
| 63 |
- list_resources: Enumerates available resources including decoys.
|
| 64 |
- describe_resource: Returns structured details for one resource.
|
| 65 |
- view_logs: Returns logs for one resource.
|
|
|
|
| 66 |
- update_security_group: Appends a rule (requires parameters.port).
|
| 67 |
- restart_service: Restarts one instance/service by ID.
|
| 68 |
- submit_solution: Declares the episode solved (or not solved).
|
|
@@ -95,6 +106,7 @@ Reward shaping is sparse-but-guided:
|
|
| 95 |
- discovery rewards for correct investigative steps
|
| 96 |
- larger terminal rewards for correct remediation
|
| 97 |
- penalties for unsafe or premature operations
|
|
|
|
| 98 |
- timeout terminal condition after max steps
|
| 99 |
|
| 100 |
Per-step reward is clipped to [-1.0, 1.0].
|
|
@@ -116,8 +128,8 @@ Typical successful sequence:
|
|
| 116 |
3. update_security_group(sg-web, port=80, action=allow) (+0.8, done)
|
| 117 |
|
| 118 |
Expected score:
|
| 119 |
-
-
|
| 120 |
-
- 0.
|
| 121 |
|
| 122 |
### Medium Task
|
| 123 |
|
|
@@ -134,33 +146,35 @@ Typical successful sequence:
|
|
| 134 |
4. update_security_group(sg-db, port=5432, action=allow) (+0.6, done if logs were inspected)
|
| 135 |
|
| 136 |
Guardrail:
|
| 137 |
-
- Applying the SG change before log triage gives a penalty (-0.1) and does not close the incident.
|
| 138 |
|
| 139 |
Expected score:
|
| 140 |
-
-
|
| 141 |
-
- 0.
|
| 142 |
|
| 143 |
### Hard Task
|
| 144 |
|
| 145 |
Incident:
|
| 146 |
-
- Checkout path degraded due to upstream timeout to
|
| 147 |
|
| 148 |
Objective:
|
| 149 |
-
- Trace LB errors to the correct target and restart i-web2 only after diagnosis.
|
| 150 |
|
| 151 |
Typical successful sequence:
|
| 152 |
1. list_resources
|
| 153 |
-
2. view_logs(lb-main) to identify failing upstream
|
| 154 |
-
3.
|
| 155 |
-
4.
|
|
|
|
| 156 |
|
| 157 |
Guardrails:
|
| 158 |
- Restarting i-web2 before investigation: penalty (-0.1), no resolution.
|
| 159 |
- Restarting healthy i-web1: penalty (-0.2).
|
| 160 |
- Premature submit_solution in hard mode: penalty (-0.1), episode continues.
|
|
|
|
| 161 |
|
| 162 |
Expected score:
|
| 163 |
-
- 1.0 after score clamping for
|
| 164 |
|
| 165 |
## API Endpoints
|
| 166 |
|
|
|
|
| 25 |
|
| 26 |
Cloud DevOps RLEnv simulates that behavior with realistic failure patterns, decoy resources, shaped rewards, and anti-shortcut guardrails.
|
| 27 |
|
| 28 |
+
## Why It's Hard
|
| 29 |
+
|
| 30 |
+
This benchmark is intentionally designed to resist brute-force policies and reward disciplined SRE reasoning:
|
| 31 |
+
|
| 32 |
+
- Needle-in-a-haystack discovery: 20+ decoy compute nodes and 20+ decoy security groups increase search complexity.
|
| 33 |
+
- Ambiguous telemetry: noisy, raw operational logs surface symptoms (including IP-only clues) rather than direct root-cause labels.
|
| 34 |
+
- Action-penalty heuristics: every action has a small negative cost, so efficient remediation beats command spamming.
|
| 35 |
+
- Multi-hop dependency resolution: agents must map IP addresses to resource IDs via metadata lookup before applying fixes.
|
| 36 |
+
- System drift under pressure: in hard mode, delayed remediation triggers cascading failures that worsen observability and reward dynamics.
|
| 37 |
+
|
| 38 |
## Environment Scope
|
| 39 |
|
| 40 |
- Domain: Cloud SRE / DevOps incident response
|
|
|
|
| 65 |
|
| 66 |
| Field | Type | Required | Description |
|
| 67 |
| --- | --- | --- | --- |
|
| 68 |
+
| command | enum | yes | One of: list_resources, describe_resource, view_logs, query_metadata, update_security_group, restart_service, submit_solution |
|
| 69 |
| resource_id | string | conditional | Required for most actions except list_resources |
|
| 70 |
| parameters | object | conditional | Used by mutating actions (for example, security-group updates) |
|
| 71 |
|
|
|
|
| 73 |
- list_resources: Enumerates available resources including decoys.
|
| 74 |
- describe_resource: Returns structured details for one resource.
|
| 75 |
- view_logs: Returns logs for one resource.
|
| 76 |
+
- query_metadata: Resolves infrastructure metadata (for example, IP address to resource ID).
|
| 77 |
- update_security_group: Appends a rule (requires parameters.port).
|
| 78 |
- restart_service: Restarts one instance/service by ID.
|
| 79 |
- submit_solution: Declares the episode solved (or not solved).
|
|
|
|
| 106 |
- discovery rewards for correct investigative steps
|
| 107 |
- larger terminal rewards for correct remediation
|
| 108 |
- penalties for unsafe or premature operations
|
| 109 |
+
- fixed action cost per step (efficiency pressure)
|
| 110 |
- timeout terminal condition after max steps
|
| 111 |
|
| 112 |
Per-step reward is clipped to [-1.0, 1.0].
|
|
|
|
| 128 |
3. update_security_group(sg-web, port=80, action=allow) (+0.8, done)
|
| 129 |
|
| 130 |
Expected score:
|
| 131 |
+
- ~0.97 for full playbook with efficient triage
|
| 132 |
+
- ~0.79 if agent skips the optional read step
|
| 133 |
|
| 134 |
### Medium Task
|
| 135 |
|
|
|
|
| 146 |
4. update_security_group(sg-db, port=5432, action=allow) (+0.6, done if logs were inspected)
|
| 147 |
|
| 148 |
Guardrail:
|
| 149 |
+
- Applying the SG change before log triage + metadata lookup gives a penalty (-0.1) and does not close the incident.
|
| 150 |
|
| 151 |
Expected score:
|
| 152 |
+
- ~0.97 with full investigative path (logs -> metadata lookup -> remediation)
|
| 153 |
+
- below ~0.90 when metadata dependency is skipped
|
| 154 |
|
| 155 |
### Hard Task
|
| 156 |
|
| 157 |
Incident:
|
| 158 |
+
- Checkout path degraded due to upstream timeout to an IP-only target that must be resolved first.
|
| 159 |
|
| 160 |
Objective:
|
| 161 |
+
- Trace LB errors to the correct target, resolve resource identity via metadata, and restart i-web2 only after diagnosis.
|
| 162 |
|
| 163 |
Typical successful sequence:
|
| 164 |
1. list_resources
|
| 165 |
+
2. view_logs(lb-main) to identify failing upstream IP (+0.2)
|
| 166 |
+
3. query_metadata(ip_address=<failing_ip>) to resolve target ID (+0.2)
|
| 167 |
+
4. describe_resource(i-web2) or view_logs(i-web2) (+0.2)
|
| 168 |
+
5. restart_service(i-web2) (+0.8, done when all investigation achievements exist)
|
| 169 |
|
| 170 |
Guardrails:
|
| 171 |
- Restarting i-web2 before investigation: penalty (-0.1), no resolution.
|
| 172 |
- Restarting healthy i-web1: penalty (-0.2).
|
| 173 |
- Premature submit_solution in hard mode: penalty (-0.1), episode continues.
|
| 174 |
+
- If unresolved after step 8 in hard mode, lb-external also fails (cascading failure), increasing pressure and noise.
|
| 175 |
|
| 176 |
Expected score:
|
| 177 |
+
- near 1.0 after score clamping for strong trajectories (can exceed 1.0 raw before clamp)
|
| 178 |
|
| 179 |
## API Endpoints
|
| 180 |
|
inference.py
CHANGED
|
@@ -63,10 +63,11 @@ def get_model_action(
|
|
| 63 |
"You are an expert AI DevOps Engineer diagnosing a cloud infrastructure issue. "
|
| 64 |
"You must respond ONLY with a raw JSON object matching this schema:\n"
|
| 65 |
"{\n"
|
| 66 |
-
' "command": "list_resources" | "describe_resource" | "view_logs" | "update_security_group" | "restart_service" | "submit_solution",\n'
|
| 67 |
' "resource_id": "string (optional)",\n'
|
| 68 |
' "parameters": {"key": "value"} (optional)\n'
|
| 69 |
"}\n"
|
|
|
|
| 70 |
"Do not include markdown blocks like ```json. Just output the JSON."
|
| 71 |
)
|
| 72 |
|
|
|
|
| 63 |
"You are an expert AI DevOps Engineer diagnosing a cloud infrastructure issue. "
|
| 64 |
"You must respond ONLY with a raw JSON object matching this schema:\n"
|
| 65 |
"{\n"
|
| 66 |
+
' "command": "list_resources" | "describe_resource" | "view_logs" | "query_metadata" | "update_security_group" | "restart_service" | "submit_solution",\n'
|
| 67 |
' "resource_id": "string (optional)",\n'
|
| 68 |
' "parameters": {"key": "value"} (optional)\n'
|
| 69 |
"}\n"
|
| 70 |
+
"When logs provide only IP addresses, use query_metadata with parameters.ip_address to resolve the resource_id before remediation.\n"
|
| 71 |
"Do not include markdown blocks like ```json. Just output the JSON."
|
| 72 |
)
|
| 73 |
|
models.py
CHANGED
|
@@ -24,6 +24,7 @@ class CloudAction(Action):
|
|
| 24 |
"list_resources",
|
| 25 |
"describe_resource",
|
| 26 |
"view_logs",
|
|
|
|
| 27 |
"update_security_group",
|
| 28 |
"restart_service",
|
| 29 |
"submit_solution",
|
|
@@ -32,14 +33,15 @@ class CloudAction(Action):
|
|
| 32 |
default=None,
|
| 33 |
description=(
|
| 34 |
"The ID of the target resource (e.g., 'i-12345'). "
|
| 35 |
-
"Required for
|
| 36 |
),
|
| 37 |
)
|
| 38 |
parameters: Optional[Dict[str, Any]] = Field(
|
| 39 |
default=None,
|
| 40 |
description=(
|
| 41 |
"Key-value pairs for updates "
|
| 42 |
-
"(e.g., {'port': '80', 'action': 'allow'} for update_security_group
|
|
|
|
| 43 |
),
|
| 44 |
)
|
| 45 |
message: Optional[str] = Field(
|
|
|
|
| 24 |
"list_resources",
|
| 25 |
"describe_resource",
|
| 26 |
"view_logs",
|
| 27 |
+
"query_metadata",
|
| 28 |
"update_security_group",
|
| 29 |
"restart_service",
|
| 30 |
"submit_solution",
|
|
|
|
| 33 |
default=None,
|
| 34 |
description=(
|
| 35 |
"The ID of the target resource (e.g., 'i-12345'). "
|
| 36 |
+
"Required for most commands except list_resources and query_metadata."
|
| 37 |
),
|
| 38 |
)
|
| 39 |
parameters: Optional[Dict[str, Any]] = Field(
|
| 40 |
default=None,
|
| 41 |
description=(
|
| 42 |
"Key-value pairs for updates "
|
| 43 |
+
"(e.g., {'port': '80', 'action': 'allow'} for update_security_group, "
|
| 44 |
+
"or {'ip_address': '10.0.4.5'} for query_metadata)."
|
| 45 |
),
|
| 46 |
)
|
| 47 |
message: Optional[str] = Field(
|
server/cloud_devops_env_environment.py
CHANGED
|
@@ -50,6 +50,7 @@ class CloudDevopsEnvironment(Environment):
|
|
| 50 |
SUPPORTS_CONCURRENT_SESSIONS: bool = True
|
| 51 |
MAX_STEPS: int = 20
|
| 52 |
VALID_TASKS = {"easy", "medium", "hard"}
|
|
|
|
| 53 |
|
| 54 |
def __init__(self, task_name: str = "easy"):
|
| 55 |
"""Initialize the cloud_devops_env environment."""
|
|
@@ -100,20 +101,29 @@ class CloudDevopsEnvironment(Environment):
|
|
| 100 |
{
|
| 101 |
"i-api": {
|
| 102 |
"type": "Instance",
|
|
|
|
| 103 |
"status": "running",
|
| 104 |
"logs": (
|
| 105 |
"[2026-04-06 17:01:22] [CRITICAL] "
|
| 106 |
"sqlalchemy.exc.OperationalError: "
|
| 107 |
"(psycopg2.OperationalError) connection to server at "
|
| 108 |
-
"'10.0.4.5'
|
| 109 |
"Is the server running and accepting TCP/IP connections?"
|
| 110 |
),
|
| 111 |
},
|
| 112 |
-
"i-db": {
|
|
|
|
|
|
|
|
|
|
|
|
|
| 113 |
"sg-db": {
|
| 114 |
"type": "SecurityGroup",
|
| 115 |
"rules": [{"port": 22, "action": "allow"}],
|
| 116 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
}
|
| 118 |
)
|
| 119 |
return resources
|
|
@@ -126,13 +136,19 @@ class CloudDevopsEnvironment(Environment):
|
|
| 126 |
"2026/04/06 17:02:09 [error] 3197#3197: *4189 upstream timed out "
|
| 127 |
"(110: Connection timed out) while reading response header from upstream, "
|
| 128 |
"client: 10.0.2.14, server: api.prod.local, request: \"GET /checkout HTTP/1.1\", "
|
| 129 |
-
"upstream: \"http://
|
| 130 |
"2026/04/06 17:02:10 [error] 3197#3197: *4190 no live upstreams while "
|
| 131 |
-
"connecting to upstream \"
|
| 132 |
),
|
| 133 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 134 |
"i-web1": {
|
| 135 |
"type": "Instance",
|
|
|
|
| 136 |
"status": "running",
|
| 137 |
"logs": (
|
| 138 |
"[2026-04-06 17:02:11] INFO web-service: readiness probe passed\n"
|
|
@@ -141,6 +157,7 @@ class CloudDevopsEnvironment(Environment):
|
|
| 141 |
},
|
| 142 |
"i-web2": {
|
| 143 |
"type": "Instance",
|
|
|
|
| 144 |
"status": "degraded",
|
| 145 |
"logs": (
|
| 146 |
"kernel: Out of memory: Killed process 12345 (java) total-vm:4194304kB, "
|
|
@@ -153,10 +170,47 @@ class CloudDevopsEnvironment(Environment):
|
|
| 153 |
"type": "SecurityGroup",
|
| 154 |
"rules": [{"port": 80, "action": "allow"}],
|
| 155 |
},
|
|
|
|
|
|
|
|
|
|
|
|
|
| 156 |
}
|
| 157 |
)
|
| 158 |
return resources
|
| 159 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 160 |
def _reward_once(self, achievement: str, points: float) -> float:
|
| 161 |
if achievement in self._achievements:
|
| 162 |
return 0.0
|
|
@@ -202,7 +256,7 @@ class CloudDevopsEnvironment(Environment):
|
|
| 202 |
state = self._state_data
|
| 203 |
|
| 204 |
state.step_count += 1
|
| 205 |
-
reward =
|
| 206 |
done = False
|
| 207 |
output = ""
|
| 208 |
error = None
|
|
@@ -245,6 +299,25 @@ class CloudDevopsEnvironment(Environment):
|
|
| 245 |
elif self.task_name == "hard" and action.resource_id == "i-web2":
|
| 246 |
reward += self._reward_once("inspect_target", 0.2)
|
| 247 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 248 |
elif action.command == "update_security_group":
|
| 249 |
if not action.resource_id:
|
| 250 |
raise ValueError("resource_id is required for update_security_group.")
|
|
@@ -277,7 +350,11 @@ class CloudDevopsEnvironment(Environment):
|
|
| 277 |
and action.resource_id == "sg-db"
|
| 278 |
and port == 5432
|
| 279 |
):
|
| 280 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 281 |
state.is_resolved = True
|
| 282 |
reward += 0.6
|
| 283 |
done = True
|
|
@@ -286,7 +363,7 @@ class CloudDevopsEnvironment(Environment):
|
|
| 286 |
reward -= 0.1
|
| 287 |
output += (
|
| 288 |
"\nWARNING: Change applied without incident triage. "
|
| 289 |
-
"Inspect API logs before closing the incident."
|
| 290 |
)
|
| 291 |
|
| 292 |
elif action.command == "restart_service":
|
|
@@ -302,6 +379,7 @@ class CloudDevopsEnvironment(Environment):
|
|
| 302 |
investigated_root_cause = (
|
| 303 |
"inspect_lb" in self._achievements
|
| 304 |
and "inspect_target" in self._achievements
|
|
|
|
| 305 |
)
|
| 306 |
if investigated_root_cause:
|
| 307 |
state.resources["i-web2"]["status"] = "running"
|
|
@@ -316,7 +394,7 @@ class CloudDevopsEnvironment(Environment):
|
|
| 316 |
reward -= 0.1
|
| 317 |
output += (
|
| 318 |
"\nWARNING: Restart denied by change policy. "
|
| 319 |
-
"Find failing upstream from lb-main and inspect i-web2 first."
|
| 320 |
)
|
| 321 |
elif action.resource_id == "i-web1":
|
| 322 |
reward -= 0.2
|
|
@@ -349,19 +427,31 @@ class CloudDevopsEnvironment(Environment):
|
|
| 349 |
error = str(exc)
|
| 350 |
output = f"Command Failed: {error}"
|
| 351 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 352 |
if state.step_count >= self.MAX_STEPS and not done:
|
| 353 |
done = True
|
| 354 |
timeout_suffix = "\nTIMEOUT: Max steps reached."
|
| 355 |
output = f"{output}{timeout_suffix}" if output else timeout_suffix.strip()
|
| 356 |
|
| 357 |
reward = max(-1.0, min(1.0, reward))
|
| 358 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 359 |
info = {
|
| 360 |
"step_count": state.step_count,
|
| 361 |
"resolved": state.is_resolved,
|
| 362 |
"task": self.task_name,
|
| 363 |
"achievements": sorted(self._achievements),
|
| 364 |
"total_resources": len(state.resources),
|
|
|
|
| 365 |
}
|
| 366 |
|
| 367 |
return CloudObservation(
|
|
|
|
| 50 |
SUPPORTS_CONCURRENT_SESSIONS: bool = True
|
| 51 |
MAX_STEPS: int = 20
|
| 52 |
VALID_TASKS = {"easy", "medium", "hard"}
|
| 53 |
+
ACTION_COST: float = 0.01
|
| 54 |
|
| 55 |
def __init__(self, task_name: str = "easy"):
|
| 56 |
"""Initialize the cloud_devops_env environment."""
|
|
|
|
| 101 |
{
|
| 102 |
"i-api": {
|
| 103 |
"type": "Instance",
|
| 104 |
+
"ip_address": "10.0.4.11",
|
| 105 |
"status": "running",
|
| 106 |
"logs": (
|
| 107 |
"[2026-04-06 17:01:22] [CRITICAL] "
|
| 108 |
"sqlalchemy.exc.OperationalError: "
|
| 109 |
"(psycopg2.OperationalError) connection to server at "
|
| 110 |
+
"'10.0.4.5', port 5432 failed: Connection timed out. "
|
| 111 |
"Is the server running and accepting TCP/IP connections?"
|
| 112 |
),
|
| 113 |
},
|
| 114 |
+
"i-db": {
|
| 115 |
+
"type": "Instance",
|
| 116 |
+
"ip_address": "10.0.4.5",
|
| 117 |
+
"status": "running",
|
| 118 |
+
},
|
| 119 |
"sg-db": {
|
| 120 |
"type": "SecurityGroup",
|
| 121 |
"rules": [{"port": 22, "action": "allow"}],
|
| 122 |
},
|
| 123 |
+
"metadata-svc": {
|
| 124 |
+
"type": "MetadataService",
|
| 125 |
+
"status": "running",
|
| 126 |
+
},
|
| 127 |
}
|
| 128 |
)
|
| 129 |
return resources
|
|
|
|
| 136 |
"2026/04/06 17:02:09 [error] 3197#3197: *4189 upstream timed out "
|
| 137 |
"(110: Connection timed out) while reading response header from upstream, "
|
| 138 |
"client: 10.0.2.14, server: api.prod.local, request: \"GET /checkout HTTP/1.1\", "
|
| 139 |
+
"upstream: \"http://10.0.8.22:8080/checkout\", host: \"api.prod.local\"\n"
|
| 140 |
"2026/04/06 17:02:10 [error] 3197#3197: *4190 no live upstreams while "
|
| 141 |
+
"connecting to upstream \"10.0.8.22\""
|
| 142 |
),
|
| 143 |
},
|
| 144 |
+
"lb-external": {
|
| 145 |
+
"type": "LoadBalancer",
|
| 146 |
+
"status": "running",
|
| 147 |
+
"logs": "INFO: Edge traffic stable.",
|
| 148 |
+
},
|
| 149 |
"i-web1": {
|
| 150 |
"type": "Instance",
|
| 151 |
+
"ip_address": "10.0.8.21",
|
| 152 |
"status": "running",
|
| 153 |
"logs": (
|
| 154 |
"[2026-04-06 17:02:11] INFO web-service: readiness probe passed\n"
|
|
|
|
| 157 |
},
|
| 158 |
"i-web2": {
|
| 159 |
"type": "Instance",
|
| 160 |
+
"ip_address": "10.0.8.22",
|
| 161 |
"status": "degraded",
|
| 162 |
"logs": (
|
| 163 |
"kernel: Out of memory: Killed process 12345 (java) total-vm:4194304kB, "
|
|
|
|
| 170 |
"type": "SecurityGroup",
|
| 171 |
"rules": [{"port": 80, "action": "allow"}],
|
| 172 |
},
|
| 173 |
+
"metadata-svc": {
|
| 174 |
+
"type": "MetadataService",
|
| 175 |
+
"status": "running",
|
| 176 |
+
},
|
| 177 |
}
|
| 178 |
)
|
| 179 |
return resources
|
| 180 |
|
| 181 |
+
def _lookup_resource_by_ip(self, ip_address: str) -> str | None:
|
| 182 |
+
if self._state_data is None:
|
| 183 |
+
return None
|
| 184 |
+
for resource_id, data in self._state_data.resources.items():
|
| 185 |
+
if data.get("ip_address") == ip_address:
|
| 186 |
+
return resource_id
|
| 187 |
+
return None
|
| 188 |
+
|
| 189 |
+
def _apply_cascading_failure(self) -> tuple[float, str]:
|
| 190 |
+
"""Simulate system drift in hard mode if root cause is not fixed quickly."""
|
| 191 |
+
if self._state_data is None or self.task_name != "hard":
|
| 192 |
+
return 0.0, ""
|
| 193 |
+
|
| 194 |
+
state = self._state_data
|
| 195 |
+
if state.is_resolved or state.step_count <= 8:
|
| 196 |
+
return 0.0, ""
|
| 197 |
+
|
| 198 |
+
lb = state.resources.get("lb-external")
|
| 199 |
+
if not lb:
|
| 200 |
+
return 0.0, ""
|
| 201 |
+
|
| 202 |
+
if lb.get("status") != "DOWN":
|
| 203 |
+
lb["status"] = "DOWN"
|
| 204 |
+
lb["logs"] = (
|
| 205 |
+
"CRITICAL: Cascading failure triggered after prolonged unresolved OOM incident. "
|
| 206 |
+
"Edge load balancer stopped serving traffic."
|
| 207 |
+
)
|
| 208 |
+
return -0.05, (
|
| 209 |
+
"\nALERT: Cascading failure detected. lb-external is DOWN due to delayed remediation."
|
| 210 |
+
)
|
| 211 |
+
|
| 212 |
+
return -0.03, ""
|
| 213 |
+
|
| 214 |
def _reward_once(self, achievement: str, points: float) -> float:
|
| 215 |
if achievement in self._achievements:
|
| 216 |
return 0.0
|
|
|
|
| 256 |
state = self._state_data
|
| 257 |
|
| 258 |
state.step_count += 1
|
| 259 |
+
reward = -self.ACTION_COST
|
| 260 |
done = False
|
| 261 |
output = ""
|
| 262 |
error = None
|
|
|
|
| 299 |
elif self.task_name == "hard" and action.resource_id == "i-web2":
|
| 300 |
reward += self._reward_once("inspect_target", 0.2)
|
| 301 |
|
| 302 |
+
elif action.command == "query_metadata":
|
| 303 |
+
ip_address = None
|
| 304 |
+
if action.parameters and isinstance(action.parameters, dict):
|
| 305 |
+
ip_address = action.parameters.get("ip_address")
|
| 306 |
+
if not ip_address and action.resource_id:
|
| 307 |
+
ip_address = action.resource_id
|
| 308 |
+
if not ip_address:
|
| 309 |
+
raise ValueError("query_metadata requires parameters.ip_address.")
|
| 310 |
+
|
| 311 |
+
resource_id = self._lookup_resource_by_ip(str(ip_address))
|
| 312 |
+
if not resource_id:
|
| 313 |
+
raise ValueError(f"No resource found for ip_address={ip_address}")
|
| 314 |
+
|
| 315 |
+
output = f"Metadata lookup: ip_address={ip_address} resource_id={resource_id}"
|
| 316 |
+
if self.task_name == "medium" and str(ip_address) == "10.0.4.5":
|
| 317 |
+
reward += self._reward_once("lookup_db_target", 0.2)
|
| 318 |
+
elif self.task_name == "hard" and str(ip_address) == "10.0.8.22":
|
| 319 |
+
reward += self._reward_once("lookup_upstream_target", 0.2)
|
| 320 |
+
|
| 321 |
elif action.command == "update_security_group":
|
| 322 |
if not action.resource_id:
|
| 323 |
raise ValueError("resource_id is required for update_security_group.")
|
|
|
|
| 350 |
and action.resource_id == "sg-db"
|
| 351 |
and port == 5432
|
| 352 |
):
|
| 353 |
+
investigated = (
|
| 354 |
+
"read_logs" in self._achievements
|
| 355 |
+
and "lookup_db_target" in self._achievements
|
| 356 |
+
)
|
| 357 |
+
if investigated:
|
| 358 |
state.is_resolved = True
|
| 359 |
reward += 0.6
|
| 360 |
done = True
|
|
|
|
| 363 |
reward -= 0.1
|
| 364 |
output += (
|
| 365 |
"\nWARNING: Change applied without incident triage. "
|
| 366 |
+
"Inspect API logs and resolve DB IP via query_metadata before closing the incident."
|
| 367 |
)
|
| 368 |
|
| 369 |
elif action.command == "restart_service":
|
|
|
|
| 379 |
investigated_root_cause = (
|
| 380 |
"inspect_lb" in self._achievements
|
| 381 |
and "inspect_target" in self._achievements
|
| 382 |
+
and "lookup_upstream_target" in self._achievements
|
| 383 |
)
|
| 384 |
if investigated_root_cause:
|
| 385 |
state.resources["i-web2"]["status"] = "running"
|
|
|
|
| 394 |
reward -= 0.1
|
| 395 |
output += (
|
| 396 |
"\nWARNING: Restart denied by change policy. "
|
| 397 |
+
"Find failing upstream IP from lb-main, resolve it with query_metadata, and inspect i-web2 first."
|
| 398 |
)
|
| 399 |
elif action.resource_id == "i-web1":
|
| 400 |
reward -= 0.2
|
|
|
|
| 427 |
error = str(exc)
|
| 428 |
output = f"Command Failed: {error}"
|
| 429 |
|
| 430 |
+
cascade_penalty, cascade_msg = self._apply_cascading_failure()
|
| 431 |
+
reward += cascade_penalty
|
| 432 |
+
if cascade_msg:
|
| 433 |
+
output = f"{output}{cascade_msg}" if output else cascade_msg.strip()
|
| 434 |
+
|
| 435 |
if state.step_count >= self.MAX_STEPS and not done:
|
| 436 |
done = True
|
| 437 |
timeout_suffix = "\nTIMEOUT: Max steps reached."
|
| 438 |
output = f"{output}{timeout_suffix}" if output else timeout_suffix.strip()
|
| 439 |
|
| 440 |
reward = max(-1.0, min(1.0, reward))
|
| 441 |
+
lb_external = state.resources.get("lb-external", {})
|
| 442 |
+
if state.is_resolved:
|
| 443 |
+
status = "HEALTHY"
|
| 444 |
+
elif self.task_name == "hard" and lb_external.get("status") == "DOWN":
|
| 445 |
+
status = "DEGRADED"
|
| 446 |
+
else:
|
| 447 |
+
status = "CRITICAL"
|
| 448 |
info = {
|
| 449 |
"step_count": state.step_count,
|
| 450 |
"resolved": state.is_resolved,
|
| 451 |
"task": self.task_name,
|
| 452 |
"achievements": sorted(self._achievements),
|
| 453 |
"total_resources": len(state.resources),
|
| 454 |
+
"action_cost": self.ACTION_COST,
|
| 455 |
}
|
| 456 |
|
| 457 |
return CloudObservation(
|