SidhaGarg commited on
Commit
d35c04a
·
1 Parent(s): ec24749

Add action-cost RL dynamics, metadata lookup, and hard-mode cascading failures

Browse files
Files changed (4) hide show
  1. README.md +26 -12
  2. inference.py +2 -1
  3. models.py +4 -2
  4. server/cloud_devops_env_environment.py +99 -9
README.md CHANGED
@@ -25,6 +25,16 @@ Real incidents are multi-step and noisy. Good agents must:
25
 
26
  Cloud DevOps RLEnv simulates that behavior with realistic failure patterns, decoy resources, shaped rewards, and anti-shortcut guardrails.
27
 
 
 
 
 
 
 
 
 
 
 
28
  ## Environment Scope
29
 
30
  - Domain: Cloud SRE / DevOps incident response
@@ -55,7 +65,7 @@ Model: CloudAction
55
 
56
  | Field | Type | Required | Description |
57
  | --- | --- | --- | --- |
58
- | command | enum | yes | One of: list_resources, describe_resource, view_logs, update_security_group, restart_service, submit_solution |
59
  | resource_id | string | conditional | Required for most actions except list_resources |
60
  | parameters | object | conditional | Used by mutating actions (for example, security-group updates) |
61
 
@@ -63,6 +73,7 @@ Action semantics:
63
  - list_resources: Enumerates available resources including decoys.
64
  - describe_resource: Returns structured details for one resource.
65
  - view_logs: Returns logs for one resource.
 
66
  - update_security_group: Appends a rule (requires parameters.port).
67
  - restart_service: Restarts one instance/service by ID.
68
  - submit_solution: Declares the episode solved (or not solved).
@@ -95,6 +106,7 @@ Reward shaping is sparse-but-guided:
95
  - discovery rewards for correct investigative steps
96
  - larger terminal rewards for correct remediation
97
  - penalties for unsafe or premature operations
 
98
  - timeout terminal condition after max steps
99
 
100
  Per-step reward is clipped to [-1.0, 1.0].
@@ -116,8 +128,8 @@ Typical successful sequence:
116
  3. update_security_group(sg-web, port=80, action=allow) (+0.8, done)
117
 
118
  Expected score:
119
- - 1.0 for full playbook
120
- - 0.8 if agent skips the optional read step
121
 
122
  ### Medium Task
123
 
@@ -134,33 +146,35 @@ Typical successful sequence:
134
  4. update_security_group(sg-db, port=5432, action=allow) (+0.6, done if logs were inspected)
135
 
136
  Guardrail:
137
- - Applying the SG change before log triage gives a penalty (-0.1) and does not close the incident.
138
 
139
  Expected score:
140
- - 1.0 with full investigative path
141
- - 0.8 if SG describe step is skipped but log triage is done
142
 
143
  ### Hard Task
144
 
145
  Incident:
146
- - Checkout path degraded due to upstream timeout to i-web2.
147
 
148
  Objective:
149
- - Trace LB errors to the correct target and restart i-web2 only after diagnosis.
150
 
151
  Typical successful sequence:
152
  1. list_resources
153
- 2. view_logs(lb-main) to identify failing upstream i-web2 (+0.2)
154
- 3. describe_resource(i-web2) or view_logs(i-web2) (+0.2)
155
- 4. restart_service(i-web2) (+0.8, done when both investigation achievements exist)
 
156
 
157
  Guardrails:
158
  - Restarting i-web2 before investigation: penalty (-0.1), no resolution.
159
  - Restarting healthy i-web1: penalty (-0.2).
160
  - Premature submit_solution in hard mode: penalty (-0.1), episode continues.
 
161
 
162
  Expected score:
163
- - 1.0 after score clamping for the full correct path
164
 
165
  ## API Endpoints
166
 
 
25
 
26
  Cloud DevOps RLEnv simulates that behavior with realistic failure patterns, decoy resources, shaped rewards, and anti-shortcut guardrails.
27
 
28
+ ## Why It's Hard
29
+
30
+ This benchmark is intentionally designed to resist brute-force policies and reward disciplined SRE reasoning:
31
+
32
+ - Needle-in-a-haystack discovery: 20+ decoy compute nodes and 20+ decoy security groups increase search complexity.
33
+ - Ambiguous telemetry: noisy, raw operational logs surface symptoms (including IP-only clues) rather than direct root-cause labels.
34
+ - Action-penalty heuristics: every action has a small negative cost, so efficient remediation beats command spamming.
35
+ - Multi-hop dependency resolution: agents must map IP addresses to resource IDs via metadata lookup before applying fixes.
36
+ - System drift under pressure: in hard mode, delayed remediation triggers cascading failures that worsen observability and reward dynamics.
37
+
38
  ## Environment Scope
39
 
40
  - Domain: Cloud SRE / DevOps incident response
 
65
 
66
  | Field | Type | Required | Description |
67
  | --- | --- | --- | --- |
68
+ | command | enum | yes | One of: list_resources, describe_resource, view_logs, query_metadata, update_security_group, restart_service, submit_solution |
69
  | resource_id | string | conditional | Required for most actions except list_resources |
70
  | parameters | object | conditional | Used by mutating actions (for example, security-group updates) |
71
 
 
73
  - list_resources: Enumerates available resources including decoys.
74
  - describe_resource: Returns structured details for one resource.
75
  - view_logs: Returns logs for one resource.
76
+ - query_metadata: Resolves infrastructure metadata (for example, IP address to resource ID).
77
  - update_security_group: Appends a rule (requires parameters.port).
78
  - restart_service: Restarts one instance/service by ID.
79
  - submit_solution: Declares the episode solved (or not solved).
 
106
  - discovery rewards for correct investigative steps
107
  - larger terminal rewards for correct remediation
108
  - penalties for unsafe or premature operations
109
+ - fixed action cost per step (efficiency pressure)
110
  - timeout terminal condition after max steps
111
 
112
  Per-step reward is clipped to [-1.0, 1.0].
 
128
  3. update_security_group(sg-web, port=80, action=allow) (+0.8, done)
129
 
130
  Expected score:
131
+ - ~0.97 for full playbook with efficient triage
132
+ - ~0.79 if agent skips the optional read step
133
 
134
  ### Medium Task
135
 
 
146
  4. update_security_group(sg-db, port=5432, action=allow) (+0.6, done if logs were inspected)
147
 
148
  Guardrail:
149
+ - Applying the SG change before log triage + metadata lookup gives a penalty (-0.1) and does not close the incident.
150
 
151
  Expected score:
152
+ - ~0.97 with full investigative path (logs -> metadata lookup -> remediation)
153
+ - below ~0.90 when metadata dependency is skipped
154
 
155
  ### Hard Task
156
 
157
  Incident:
158
+ - Checkout path degraded due to upstream timeout to an IP-only target that must be resolved first.
159
 
160
  Objective:
161
+ - Trace LB errors to the correct target, resolve resource identity via metadata, and restart i-web2 only after diagnosis.
162
 
163
  Typical successful sequence:
164
  1. list_resources
165
+ 2. view_logs(lb-main) to identify failing upstream IP (+0.2)
166
+ 3. query_metadata(ip_address=<failing_ip>) to resolve target ID (+0.2)
167
+ 4. describe_resource(i-web2) or view_logs(i-web2) (+0.2)
168
+ 5. restart_service(i-web2) (+0.8, done when all investigation achievements exist)
169
 
170
  Guardrails:
171
  - Restarting i-web2 before investigation: penalty (-0.1), no resolution.
172
  - Restarting healthy i-web1: penalty (-0.2).
173
  - Premature submit_solution in hard mode: penalty (-0.1), episode continues.
174
+ - If unresolved after step 8 in hard mode, lb-external also fails (cascading failure), increasing pressure and noise.
175
 
176
  Expected score:
177
+ - near 1.0 after score clamping for strong trajectories (can exceed 1.0 raw before clamp)
178
 
179
  ## API Endpoints
180
 
inference.py CHANGED
@@ -63,10 +63,11 @@ def get_model_action(
63
  "You are an expert AI DevOps Engineer diagnosing a cloud infrastructure issue. "
64
  "You must respond ONLY with a raw JSON object matching this schema:\n"
65
  "{\n"
66
- ' "command": "list_resources" | "describe_resource" | "view_logs" | "update_security_group" | "restart_service" | "submit_solution",\n'
67
  ' "resource_id": "string (optional)",\n'
68
  ' "parameters": {"key": "value"} (optional)\n'
69
  "}\n"
 
70
  "Do not include markdown blocks like ```json. Just output the JSON."
71
  )
72
 
 
63
  "You are an expert AI DevOps Engineer diagnosing a cloud infrastructure issue. "
64
  "You must respond ONLY with a raw JSON object matching this schema:\n"
65
  "{\n"
66
+ ' "command": "list_resources" | "describe_resource" | "view_logs" | "query_metadata" | "update_security_group" | "restart_service" | "submit_solution",\n'
67
  ' "resource_id": "string (optional)",\n'
68
  ' "parameters": {"key": "value"} (optional)\n'
69
  "}\n"
70
+ "When logs provide only IP addresses, use query_metadata with parameters.ip_address to resolve the resource_id before remediation.\n"
71
  "Do not include markdown blocks like ```json. Just output the JSON."
72
  )
73
 
models.py CHANGED
@@ -24,6 +24,7 @@ class CloudAction(Action):
24
  "list_resources",
25
  "describe_resource",
26
  "view_logs",
 
27
  "update_security_group",
28
  "restart_service",
29
  "submit_solution",
@@ -32,14 +33,15 @@ class CloudAction(Action):
32
  default=None,
33
  description=(
34
  "The ID of the target resource (e.g., 'i-12345'). "
35
- "Required for all commands except list_resources."
36
  ),
37
  )
38
  parameters: Optional[Dict[str, Any]] = Field(
39
  default=None,
40
  description=(
41
  "Key-value pairs for updates "
42
- "(e.g., {'port': '80', 'action': 'allow'} for update_security_group)."
 
43
  ),
44
  )
45
  message: Optional[str] = Field(
 
24
  "list_resources",
25
  "describe_resource",
26
  "view_logs",
27
+ "query_metadata",
28
  "update_security_group",
29
  "restart_service",
30
  "submit_solution",
 
33
  default=None,
34
  description=(
35
  "The ID of the target resource (e.g., 'i-12345'). "
36
+ "Required for most commands except list_resources and query_metadata."
37
  ),
38
  )
39
  parameters: Optional[Dict[str, Any]] = Field(
40
  default=None,
41
  description=(
42
  "Key-value pairs for updates "
43
+ "(e.g., {'port': '80', 'action': 'allow'} for update_security_group, "
44
+ "or {'ip_address': '10.0.4.5'} for query_metadata)."
45
  ),
46
  )
47
  message: Optional[str] = Field(
server/cloud_devops_env_environment.py CHANGED
@@ -50,6 +50,7 @@ class CloudDevopsEnvironment(Environment):
50
  SUPPORTS_CONCURRENT_SESSIONS: bool = True
51
  MAX_STEPS: int = 20
52
  VALID_TASKS = {"easy", "medium", "hard"}
 
53
 
54
  def __init__(self, task_name: str = "easy"):
55
  """Initialize the cloud_devops_env environment."""
@@ -100,20 +101,29 @@ class CloudDevopsEnvironment(Environment):
100
  {
101
  "i-api": {
102
  "type": "Instance",
 
103
  "status": "running",
104
  "logs": (
105
  "[2026-04-06 17:01:22] [CRITICAL] "
106
  "sqlalchemy.exc.OperationalError: "
107
  "(psycopg2.OperationalError) connection to server at "
108
- "'10.0.4.5' (i-db), port 5432 failed: Connection timed out. "
109
  "Is the server running and accepting TCP/IP connections?"
110
  ),
111
  },
112
- "i-db": {"type": "Instance", "status": "running"},
 
 
 
 
113
  "sg-db": {
114
  "type": "SecurityGroup",
115
  "rules": [{"port": 22, "action": "allow"}],
116
  },
 
 
 
 
117
  }
118
  )
119
  return resources
@@ -126,13 +136,19 @@ class CloudDevopsEnvironment(Environment):
126
  "2026/04/06 17:02:09 [error] 3197#3197: *4189 upstream timed out "
127
  "(110: Connection timed out) while reading response header from upstream, "
128
  "client: 10.0.2.14, server: api.prod.local, request: \"GET /checkout HTTP/1.1\", "
129
- "upstream: \"http://i-web2:8080/checkout\", host: \"api.prod.local\"\n"
130
  "2026/04/06 17:02:10 [error] 3197#3197: *4190 no live upstreams while "
131
- "connecting to upstream \"i-web2\""
132
  ),
133
  },
 
 
 
 
 
134
  "i-web1": {
135
  "type": "Instance",
 
136
  "status": "running",
137
  "logs": (
138
  "[2026-04-06 17:02:11] INFO web-service: readiness probe passed\n"
@@ -141,6 +157,7 @@ class CloudDevopsEnvironment(Environment):
141
  },
142
  "i-web2": {
143
  "type": "Instance",
 
144
  "status": "degraded",
145
  "logs": (
146
  "kernel: Out of memory: Killed process 12345 (java) total-vm:4194304kB, "
@@ -153,10 +170,47 @@ class CloudDevopsEnvironment(Environment):
153
  "type": "SecurityGroup",
154
  "rules": [{"port": 80, "action": "allow"}],
155
  },
 
 
 
 
156
  }
157
  )
158
  return resources
159
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
160
  def _reward_once(self, achievement: str, points: float) -> float:
161
  if achievement in self._achievements:
162
  return 0.0
@@ -202,7 +256,7 @@ class CloudDevopsEnvironment(Environment):
202
  state = self._state_data
203
 
204
  state.step_count += 1
205
- reward = 0.0
206
  done = False
207
  output = ""
208
  error = None
@@ -245,6 +299,25 @@ class CloudDevopsEnvironment(Environment):
245
  elif self.task_name == "hard" and action.resource_id == "i-web2":
246
  reward += self._reward_once("inspect_target", 0.2)
247
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
248
  elif action.command == "update_security_group":
249
  if not action.resource_id:
250
  raise ValueError("resource_id is required for update_security_group.")
@@ -277,7 +350,11 @@ class CloudDevopsEnvironment(Environment):
277
  and action.resource_id == "sg-db"
278
  and port == 5432
279
  ):
280
- if "read_logs" in self._achievements:
 
 
 
 
281
  state.is_resolved = True
282
  reward += 0.6
283
  done = True
@@ -286,7 +363,7 @@ class CloudDevopsEnvironment(Environment):
286
  reward -= 0.1
287
  output += (
288
  "\nWARNING: Change applied without incident triage. "
289
- "Inspect API logs before closing the incident."
290
  )
291
 
292
  elif action.command == "restart_service":
@@ -302,6 +379,7 @@ class CloudDevopsEnvironment(Environment):
302
  investigated_root_cause = (
303
  "inspect_lb" in self._achievements
304
  and "inspect_target" in self._achievements
 
305
  )
306
  if investigated_root_cause:
307
  state.resources["i-web2"]["status"] = "running"
@@ -316,7 +394,7 @@ class CloudDevopsEnvironment(Environment):
316
  reward -= 0.1
317
  output += (
318
  "\nWARNING: Restart denied by change policy. "
319
- "Find failing upstream from lb-main and inspect i-web2 first."
320
  )
321
  elif action.resource_id == "i-web1":
322
  reward -= 0.2
@@ -349,19 +427,31 @@ class CloudDevopsEnvironment(Environment):
349
  error = str(exc)
350
  output = f"Command Failed: {error}"
351
 
 
 
 
 
 
352
  if state.step_count >= self.MAX_STEPS and not done:
353
  done = True
354
  timeout_suffix = "\nTIMEOUT: Max steps reached."
355
  output = f"{output}{timeout_suffix}" if output else timeout_suffix.strip()
356
 
357
  reward = max(-1.0, min(1.0, reward))
358
- status = "HEALTHY" if state.is_resolved else "CRITICAL"
 
 
 
 
 
 
359
  info = {
360
  "step_count": state.step_count,
361
  "resolved": state.is_resolved,
362
  "task": self.task_name,
363
  "achievements": sorted(self._achievements),
364
  "total_resources": len(state.resources),
 
365
  }
366
 
367
  return CloudObservation(
 
50
  SUPPORTS_CONCURRENT_SESSIONS: bool = True
51
  MAX_STEPS: int = 20
52
  VALID_TASKS = {"easy", "medium", "hard"}
53
+ ACTION_COST: float = 0.01
54
 
55
  def __init__(self, task_name: str = "easy"):
56
  """Initialize the cloud_devops_env environment."""
 
101
  {
102
  "i-api": {
103
  "type": "Instance",
104
+ "ip_address": "10.0.4.11",
105
  "status": "running",
106
  "logs": (
107
  "[2026-04-06 17:01:22] [CRITICAL] "
108
  "sqlalchemy.exc.OperationalError: "
109
  "(psycopg2.OperationalError) connection to server at "
110
+ "'10.0.4.5', port 5432 failed: Connection timed out. "
111
  "Is the server running and accepting TCP/IP connections?"
112
  ),
113
  },
114
+ "i-db": {
115
+ "type": "Instance",
116
+ "ip_address": "10.0.4.5",
117
+ "status": "running",
118
+ },
119
  "sg-db": {
120
  "type": "SecurityGroup",
121
  "rules": [{"port": 22, "action": "allow"}],
122
  },
123
+ "metadata-svc": {
124
+ "type": "MetadataService",
125
+ "status": "running",
126
+ },
127
  }
128
  )
129
  return resources
 
136
  "2026/04/06 17:02:09 [error] 3197#3197: *4189 upstream timed out "
137
  "(110: Connection timed out) while reading response header from upstream, "
138
  "client: 10.0.2.14, server: api.prod.local, request: \"GET /checkout HTTP/1.1\", "
139
+ "upstream: \"http://10.0.8.22:8080/checkout\", host: \"api.prod.local\"\n"
140
  "2026/04/06 17:02:10 [error] 3197#3197: *4190 no live upstreams while "
141
+ "connecting to upstream \"10.0.8.22\""
142
  ),
143
  },
144
+ "lb-external": {
145
+ "type": "LoadBalancer",
146
+ "status": "running",
147
+ "logs": "INFO: Edge traffic stable.",
148
+ },
149
  "i-web1": {
150
  "type": "Instance",
151
+ "ip_address": "10.0.8.21",
152
  "status": "running",
153
  "logs": (
154
  "[2026-04-06 17:02:11] INFO web-service: readiness probe passed\n"
 
157
  },
158
  "i-web2": {
159
  "type": "Instance",
160
+ "ip_address": "10.0.8.22",
161
  "status": "degraded",
162
  "logs": (
163
  "kernel: Out of memory: Killed process 12345 (java) total-vm:4194304kB, "
 
170
  "type": "SecurityGroup",
171
  "rules": [{"port": 80, "action": "allow"}],
172
  },
173
+ "metadata-svc": {
174
+ "type": "MetadataService",
175
+ "status": "running",
176
+ },
177
  }
178
  )
179
  return resources
180
 
181
+ def _lookup_resource_by_ip(self, ip_address: str) -> str | None:
182
+ if self._state_data is None:
183
+ return None
184
+ for resource_id, data in self._state_data.resources.items():
185
+ if data.get("ip_address") == ip_address:
186
+ return resource_id
187
+ return None
188
+
189
+ def _apply_cascading_failure(self) -> tuple[float, str]:
190
+ """Simulate system drift in hard mode if root cause is not fixed quickly."""
191
+ if self._state_data is None or self.task_name != "hard":
192
+ return 0.0, ""
193
+
194
+ state = self._state_data
195
+ if state.is_resolved or state.step_count <= 8:
196
+ return 0.0, ""
197
+
198
+ lb = state.resources.get("lb-external")
199
+ if not lb:
200
+ return 0.0, ""
201
+
202
+ if lb.get("status") != "DOWN":
203
+ lb["status"] = "DOWN"
204
+ lb["logs"] = (
205
+ "CRITICAL: Cascading failure triggered after prolonged unresolved OOM incident. "
206
+ "Edge load balancer stopped serving traffic."
207
+ )
208
+ return -0.05, (
209
+ "\nALERT: Cascading failure detected. lb-external is DOWN due to delayed remediation."
210
+ )
211
+
212
+ return -0.03, ""
213
+
214
  def _reward_once(self, achievement: str, points: float) -> float:
215
  if achievement in self._achievements:
216
  return 0.0
 
256
  state = self._state_data
257
 
258
  state.step_count += 1
259
+ reward = -self.ACTION_COST
260
  done = False
261
  output = ""
262
  error = None
 
299
  elif self.task_name == "hard" and action.resource_id == "i-web2":
300
  reward += self._reward_once("inspect_target", 0.2)
301
 
302
+ elif action.command == "query_metadata":
303
+ ip_address = None
304
+ if action.parameters and isinstance(action.parameters, dict):
305
+ ip_address = action.parameters.get("ip_address")
306
+ if not ip_address and action.resource_id:
307
+ ip_address = action.resource_id
308
+ if not ip_address:
309
+ raise ValueError("query_metadata requires parameters.ip_address.")
310
+
311
+ resource_id = self._lookup_resource_by_ip(str(ip_address))
312
+ if not resource_id:
313
+ raise ValueError(f"No resource found for ip_address={ip_address}")
314
+
315
+ output = f"Metadata lookup: ip_address={ip_address} resource_id={resource_id}"
316
+ if self.task_name == "medium" and str(ip_address) == "10.0.4.5":
317
+ reward += self._reward_once("lookup_db_target", 0.2)
318
+ elif self.task_name == "hard" and str(ip_address) == "10.0.8.22":
319
+ reward += self._reward_once("lookup_upstream_target", 0.2)
320
+
321
  elif action.command == "update_security_group":
322
  if not action.resource_id:
323
  raise ValueError("resource_id is required for update_security_group.")
 
350
  and action.resource_id == "sg-db"
351
  and port == 5432
352
  ):
353
+ investigated = (
354
+ "read_logs" in self._achievements
355
+ and "lookup_db_target" in self._achievements
356
+ )
357
+ if investigated:
358
  state.is_resolved = True
359
  reward += 0.6
360
  done = True
 
363
  reward -= 0.1
364
  output += (
365
  "\nWARNING: Change applied without incident triage. "
366
+ "Inspect API logs and resolve DB IP via query_metadata before closing the incident."
367
  )
368
 
369
  elif action.command == "restart_service":
 
379
  investigated_root_cause = (
380
  "inspect_lb" in self._achievements
381
  and "inspect_target" in self._achievements
382
+ and "lookup_upstream_target" in self._achievements
383
  )
384
  if investigated_root_cause:
385
  state.resources["i-web2"]["status"] = "running"
 
394
  reward -= 0.1
395
  output += (
396
  "\nWARNING: Restart denied by change policy. "
397
+ "Find failing upstream IP from lb-main, resolve it with query_metadata, and inspect i-web2 first."
398
  )
399
  elif action.resource_id == "i-web1":
400
  reward -= 0.2
 
427
  error = str(exc)
428
  output = f"Command Failed: {error}"
429
 
430
+ cascade_penalty, cascade_msg = self._apply_cascading_failure()
431
+ reward += cascade_penalty
432
+ if cascade_msg:
433
+ output = f"{output}{cascade_msg}" if output else cascade_msg.strip()
434
+
435
  if state.step_count >= self.MAX_STEPS and not done:
436
  done = True
437
  timeout_suffix = "\nTIMEOUT: Max steps reached."
438
  output = f"{output}{timeout_suffix}" if output else timeout_suffix.strip()
439
 
440
  reward = max(-1.0, min(1.0, reward))
441
+ lb_external = state.resources.get("lb-external", {})
442
+ if state.is_resolved:
443
+ status = "HEALTHY"
444
+ elif self.task_name == "hard" and lb_external.get("status") == "DOWN":
445
+ status = "DEGRADED"
446
+ else:
447
+ status = "CRITICAL"
448
  info = {
449
  "step_count": state.step_count,
450
  "resolved": state.is_resolved,
451
  "task": self.task_name,
452
  "achievements": sorted(self._achievements),
453
  "total_resources": len(state.resources),
454
+ "action_cost": self.ACTION_COST,
455
  }
456
 
457
  return CloudObservation(