XcodeAddy commited on
Commit
18aa055
·
1 Parent(s): 026e80b

Keep grader rewards strictly within unit interval

Browse files
CHANGELOG_AND_RUNBOOK.md CHANGED
@@ -114,7 +114,7 @@ The backend now prints useful logs when the UI or API is used:
114
 
115
  ```text
116
  [RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
117
- [STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=1.0 done=true
118
  [STATE] session_id=... incident_id=INC-014 done=true
119
  [STEP_ERROR] session_id=... incident_id=INC-014 error=...
120
  ```
@@ -203,7 +203,7 @@ Run a correct hard-task case:
203
 
204
  Expected result:
205
 
206
- - `reward.value` is `1.0`.
207
  - `done` is `true`.
208
  - `info.correct` is `true`.
209
  - `info.ground_truth` is `FAILOVER`.
@@ -218,7 +218,7 @@ Expected terminal logs:
218
 
219
  ```text
220
  [RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
221
- [STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=1.0 done=true
222
  ```
223
 
224
  Run a task1 case:
@@ -232,7 +232,7 @@ Run a task1 case:
232
 
233
  Expected result:
234
 
235
- - reward should be `1.0`.
236
 
237
  Run a task2 case:
238
 
@@ -245,7 +245,7 @@ Run a task2 case:
245
 
246
  Expected result:
247
 
248
- - reward should be `1.0`.
249
 
250
  ## 4. Test backend API with curl
251
 
@@ -311,7 +311,7 @@ Expected state:
311
 
312
  - `done` is `true`
313
  - `status` is `completed`
314
- - `last_reward` is `1.0`
315
 
316
  ## 5. Test backend edge cases
317
 
@@ -408,8 +408,8 @@ Expected log format:
408
 
409
  ```text
410
  [START] task=INC-001 env=incident-triage-env model=...
411
- [STEP] step=1 action=SEV1 reward=1.00 done=true error=null
412
- [END] success=true steps=1 score=1.00 rewards=1.00
413
  ```
414
 
415
  If no server is reachable, `inference.py` falls back to an in-process FastAPI client.
 
114
 
115
  ```text
116
  [RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
117
+ [STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=0.99 done=true
118
  [STATE] session_id=... incident_id=INC-014 done=true
119
  [STEP_ERROR] session_id=... incident_id=INC-014 error=...
120
  ```
 
203
 
204
  Expected result:
205
 
206
+ - `reward.value` is `0.99`.
207
  - `done` is `true`.
208
  - `info.correct` is `true`.
209
  - `info.ground_truth` is `FAILOVER`.
 
218
 
219
  ```text
220
  [RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
221
+ [STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=0.99 done=true
222
  ```
223
 
224
  Run a task1 case:
 
232
 
233
  Expected result:
234
 
235
+ - reward should be `0.99`.
236
 
237
  Run a task2 case:
238
 
 
245
 
246
  Expected result:
247
 
248
+ - reward should be `0.99`.
249
 
250
  ## 4. Test backend API with curl
251
 
 
311
 
312
  - `done` is `true`
313
  - `status` is `completed`
314
+ - `last_reward` is `0.99`
315
 
316
  ## 5. Test backend edge cases
317
 
 
408
 
409
  ```text
410
  [START] task=INC-001 env=incident-triage-env model=...
411
+ [STEP] step=1 action=SEV1 reward=0.99 done=true error=null
412
+ [END] success=true steps=1 score=0.99 rewards=0.99
413
  ```
414
 
415
  If no server is reachable, `inference.py` falls back to an in-process FastAPI client.
README.md CHANGED
@@ -114,9 +114,9 @@ Validation rules:
114
 
115
  Rewarding is deterministic and implemented in [graders.py](./graders.py).
116
 
117
- - `task1`: `1.0` exact, `0.5` adjacent severity, `0.0` far miss
118
- - `task2`: `1.0` exact, `0.5` related domain, `0.25` `UNKNOWN`, `0.0` wrong
119
- - `task3`: `1.0` exact, `0.4` safe `INVESTIGATE` fallback, `0.25` related action, `0.0` wrong
120
 
121
  This keeps grading reproducible while still giving partial-credit trajectory signal.
122
 
@@ -218,8 +218,8 @@ curl http://localhost:7860/health
218
 
219
  ```text
220
  [START] task=INC-001 env=incident-triage-env model=deterministic-baseline
221
- [STEP] step=1 action=SEV1 reward=1.00 done=true error=null
222
- [END] success=true steps=1 score=1.00 rewards=1.00
223
  ```
224
 
225
  ## Baseline Scores
@@ -229,10 +229,10 @@ Latest local deterministic baseline:
229
  | Metric | Value |
230
  |---|---:|
231
  | Episodes | 108 |
232
- | Average score | 0.9954 |
233
- | `task1` average | 1.0000 |
234
- | `task2` average | 0.9861 |
235
- | `task3` average | 1.0000 |
236
 
237
  This deterministic local run completed in about `1.34s` on the current machine.
238
  Results are written by default to `/tmp/outputs/baseline_scores.json`.
@@ -274,4 +274,5 @@ curl -X POST "http://localhost:7860/step?session_id=<session-id>" \
274
 
275
  - `models.py` is the source of truth for valid enum labels.
276
  - `graders.py` is the source of truth for scoring logic.
 
277
  - The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.
 
114
 
115
  Rewarding is deterministic and implemented in [graders.py](./graders.py).
116
 
117
+ - `task1`: `0.99` exact, `0.5` adjacent severity, `0.01` far miss
118
+ - `task2`: `0.99` exact, `0.5` related domain, `0.25` `UNKNOWN`, `0.01` wrong
119
+ - `task3`: `0.99` exact, `0.4` safe `INVESTIGATE` fallback, `0.25` related action, `0.01` wrong
120
 
121
  This keeps grading reproducible while still giving partial-credit trajectory signal.
122
 
 
218
 
219
  ```text
220
  [START] task=INC-001 env=incident-triage-env model=deterministic-baseline
221
+ [STEP] step=1 action=SEV1 reward=0.99 done=true error=null
222
+ [END] success=true steps=1 score=0.99 rewards=0.99
223
  ```
224
 
225
  ## Baseline Scores
 
229
  | Metric | Value |
230
  |---|---:|
231
  | Episodes | 108 |
232
+ | Average score | 0.9855 |
233
+ | `task1` average | 0.9900 |
234
+ | `task2` average | 0.9764 |
235
+ | `task3` average | 0.9900 |
236
 
237
  This deterministic local run completed in about `1.34s` on the current machine.
238
  Results are written by default to `/tmp/outputs/baseline_scores.json`.
 
274
 
275
  - `models.py` is the source of truth for valid enum labels.
276
  - `graders.py` is the source of truth for scoring logic.
277
+ - Reward values are kept strictly within `(0, 1)` to satisfy Phase 2 validator constraints.
278
  - The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.
app.py CHANGED
@@ -263,11 +263,11 @@ def state(session_id: str):
263
  def get_grader_info():
264
  return {
265
  "grading": "deterministic",
266
- "scoring": "task1: adjacent-severity partial credit; task2/task3: exact match plus conservative near-miss partial credit",
267
  "tasks": {
268
- "task1": "exact=1.0, adjacent=0.5, far=0.0",
269
- "task2": "exact=1.0, related-domain=0.5, unknown=0.25, wrong=0.0",
270
- "task3": "exact=1.0, investigate fallback=0.4, related response=0.25, wrong=0.0",
271
  },
272
  "notes": {
273
  "task2": [
 
263
  def get_grader_info():
264
  return {
265
  "grading": "deterministic",
266
+ "scoring": "task1: adjacent-severity partial credit; task2/task3: exact match plus conservative near-miss partial credit; all rewards remain strictly within (0, 1)",
267
  "tasks": {
268
+ "task1": "exact=0.99, adjacent=0.5, far=0.01",
269
+ "task2": "exact=0.99, related-domain=0.5, unknown=0.25, wrong=0.01",
270
+ "task3": "exact=0.99, investigate fallback=0.4, related response=0.25, wrong=0.01",
271
  },
272
  "notes": {
273
  "task2": [
environment.py CHANGED
@@ -150,7 +150,7 @@ class IncidentEnv:
150
  "episode_id": self.episode_id,
151
  "task_name": self._task_spec()["name"],
152
  "difficulty": self._task_spec()["difficulty"],
153
- "correct": reward_value == 1.0,
154
  "ground_truth": ground_truth_value,
155
  "agent_answer": agent_answer,
156
  "selected_field": selected_field,
 
150
  "episode_id": self.episode_id,
151
  "task_name": self._task_spec()["name"],
152
  "difficulty": self._task_spec()["difficulty"],
153
+ "correct": agent_answer == ground_truth_value,
154
  "ground_truth": ground_truth_value,
155
  "agent_answer": agent_answer,
156
  "selected_field": selected_field,
graders.py CHANGED
@@ -1,15 +1,7 @@
1
  from models import IncidentAction
2
 
3
  _SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
4
- # Related-domain partial credit is intentionally conservative.
5
- # DATABASE <-> APPLICATION captures incidents where app bugs manifest as
6
- # database saturation and vice versa.
7
- # NETWORK <-> INFRASTRUCTURE captures physical or platform-layer correlation.
8
- # NETWORK <-> THIRD_PARTY captures dependency outages that resemble network loss.
9
- # INFRASTRUCTURE <-> THIRD_PARTY captures external services failing through shared
10
- # platform primitives.
11
- # APPLICATION <-> THIRD_PARTY is intentionally not included because we treat
12
- # product-code failures and vendor degradation as materially different diagnoses.
13
  _TASK2_RELATED_GROUPS = [
14
  {"DATABASE", "APPLICATION"},
15
  {"NETWORK", "INFRASTRUCTURE"},
@@ -24,15 +16,19 @@ _TASK3_PARTIAL = {
24
  ("RESTART_SERVICE", "INVESTIGATE"): 0.25,
25
  }
26
 
 
 
 
 
27
 
28
  def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
29
  if action.severity is None:
30
- return 0.0, "Missing severity classification."
31
  predicted = _SEV_ORDER.get(action.severity.value, -1)
32
  expected = _SEV_ORDER.get(ground_truth["severity"], -1)
33
  distance = abs(predicted - expected)
34
- score = {0: 1.0, 1: 0.5, 2: 0.0}.get(distance, 0.0)
35
- if score == 1.0:
36
  return score, "Exact severity match."
37
  if score == 0.5:
38
  return score, "Adjacent severity band: partial credit for a close escalation call."
@@ -41,44 +37,40 @@ def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]
41
 
42
  def grade_task2(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
43
  if action.root_cause is None:
44
- return 0.0, "Missing root-cause classification."
45
 
46
  predicted = action.root_cause.value
47
  expected = ground_truth["root_cause"]
48
 
49
  if predicted == expected:
50
- return 1.0, "Exact root-cause match."
51
  if predicted == "UNKNOWN":
52
  return 0.25, "Conservative fallback: uncertainty recognized, but the failure domain was not isolated."
53
- # Related groups are intentionally defined as exact 2-label pairs.
54
- # Keep equality here so we do not silently broaden partial-credit semantics.
55
  if any({predicted, expected} == group for group in _TASK2_RELATED_GROUPS):
56
  return 0.5, "Related failure domain selected: partial credit for a near-miss diagnosis."
57
- return 0.0, "Root-cause classification does not match the expected failure domain."
58
 
59
 
60
  def grade_task3(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
61
  if action.action is None:
62
- return 0.0, "Missing remediation recommendation."
63
 
64
  predicted = action.action.value
65
  expected = ground_truth["action"]
66
 
67
  if predicted == expected:
68
- return 1.0, "Exact remediation match."
69
  if predicted == "INVESTIGATE" and expected != "NO_ACTION":
70
  return 0.4, "Safe investigative fallback: the incident was recognized, but the optimal action was not taken."
71
- # Choosing NO_ACTION when investigation was expected is scored more harshly
72
- # than the reverse because it risks missing a real incident entirely.
73
  if predicted == "NO_ACTION" and expected == "INVESTIGATE":
74
  return 0.25, "Conservative response, but deeper investigation was expected."
75
  if (predicted, expected) in _TASK3_PARTIAL:
76
  return _TASK3_PARTIAL[(predicted, expected)], "Related remediation selected: partial credit for a close operational response."
77
- return 0.0, "Recommended action does not match the expected operator response."
78
 
79
 
80
  GRADERS = {
81
  "task1": grade_task1,
82
  "task2": grade_task2,
83
  "task3": grade_task3,
84
- }
 
1
  from models import IncidentAction
2
 
3
  _SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
4
+
 
 
 
 
 
 
 
 
5
  _TASK2_RELATED_GROUPS = [
6
  {"DATABASE", "APPLICATION"},
7
  {"NETWORK", "INFRASTRUCTURE"},
 
16
  ("RESTART_SERVICE", "INVESTIGATE"): 0.25,
17
  }
18
 
19
+ # Scores must be strictly within (0, 1) — 0.0 and 1.0 are rejected by the validator.
20
+ _EXACT = 0.99
21
+ _ZERO = 0.01
22
+
23
 
24
  def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
25
  if action.severity is None:
26
+ return _ZERO, "Missing severity classification."
27
  predicted = _SEV_ORDER.get(action.severity.value, -1)
28
  expected = _SEV_ORDER.get(ground_truth["severity"], -1)
29
  distance = abs(predicted - expected)
30
+ score = {0: _EXACT, 1: 0.5, 2: _ZERO}.get(distance, _ZERO)
31
+ if score == _EXACT:
32
  return score, "Exact severity match."
33
  if score == 0.5:
34
  return score, "Adjacent severity band: partial credit for a close escalation call."
 
37
 
38
  def grade_task2(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
39
  if action.root_cause is None:
40
+ return _ZERO, "Missing root-cause classification."
41
 
42
  predicted = action.root_cause.value
43
  expected = ground_truth["root_cause"]
44
 
45
  if predicted == expected:
46
+ return _EXACT, "Exact root-cause match."
47
  if predicted == "UNKNOWN":
48
  return 0.25, "Conservative fallback: uncertainty recognized, but the failure domain was not isolated."
 
 
49
  if any({predicted, expected} == group for group in _TASK2_RELATED_GROUPS):
50
  return 0.5, "Related failure domain selected: partial credit for a near-miss diagnosis."
51
+ return _ZERO, "Root-cause classification does not match the expected failure domain."
52
 
53
 
54
  def grade_task3(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
55
  if action.action is None:
56
+ return _ZERO, "Missing remediation recommendation."
57
 
58
  predicted = action.action.value
59
  expected = ground_truth["action"]
60
 
61
  if predicted == expected:
62
+ return _EXACT, "Exact remediation match."
63
  if predicted == "INVESTIGATE" and expected != "NO_ACTION":
64
  return 0.4, "Safe investigative fallback: the incident was recognized, but the optimal action was not taken."
 
 
65
  if predicted == "NO_ACTION" and expected == "INVESTIGATE":
66
  return 0.25, "Conservative response, but deeper investigation was expected."
67
  if (predicted, expected) in _TASK3_PARTIAL:
68
  return _TASK3_PARTIAL[(predicted, expected)], "Related remediation selected: partial credit for a close operational response."
69
+ return _ZERO, "Recommended action does not match the expected operator response."
70
 
71
 
72
  GRADERS = {
73
  "task1": grade_task1,
74
  "task2": grade_task2,
75
  "task3": grade_task3,
76
+ }
openenv.yaml CHANGED
@@ -66,21 +66,21 @@ tasks:
66
  difficulty: easy
67
  output_field: severity
68
  labels: [SEV1, SEV2, SEV3]
69
- reward: "1.0 exact | 0.5 adjacent severity | 0.0 far miss"
70
 
71
  task2:
72
  name: Root Cause Classification
73
  difficulty: medium
74
  output_field: root_cause
75
  labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
76
- reward: "1.0 exact | 0.5 related domain | 0.25 UNKNOWN fallback | 0.0 wrong"
77
 
78
  task3:
79
  name: Recommended Action
80
  difficulty: hard
81
  output_field: action
82
  labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
83
- reward: "1.0 exact | 0.4 safe investigate fallback | 0.25 related action | 0.0 wrong"
84
 
85
  dataset:
86
  total_tickets: 108
@@ -93,7 +93,7 @@ baseline:
93
  script: inference.py
94
  required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
95
  optional_env_vars: [ENV_URL]
96
- latest_local_score: 0.9954
97
  latest_local_episodes: 108
98
 
99
  reproducibility:
 
66
  difficulty: easy
67
  output_field: severity
68
  labels: [SEV1, SEV2, SEV3]
69
+ reward: "0.99 exact | 0.5 adjacent severity | 0.01 far miss"
70
 
71
  task2:
72
  name: Root Cause Classification
73
  difficulty: medium
74
  output_field: root_cause
75
  labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
76
+ reward: "0.99 exact | 0.5 related domain | 0.25 UNKNOWN fallback | 0.01 wrong"
77
 
78
  task3:
79
  name: Recommended Action
80
  difficulty: hard
81
  output_field: action
82
  labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
83
+ reward: "0.99 exact | 0.4 safe investigate fallback | 0.25 related action | 0.01 wrong"
84
 
85
  dataset:
86
  total_tickets: 108
 
93
  script: inference.py
94
  required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
95
  optional_env_vars: [ENV_URL]
96
+ latest_local_score: 0.9855
97
  latest_local_episodes: 108
98
 
99
  reproducibility:
tests/test_env.py CHANGED
@@ -117,7 +117,7 @@ class IncidentEnvApiTests(unittest.TestCase):
117
  self.assertEqual(step_response.status_code, 200)
118
  step_body = step_response.json()
119
  self.assertTrue(step_body["done"])
120
- self.assertEqual(step_body["reward"]["value"], 1.0)
121
  self.assertTrue(step_body["info"]["correct"])
122
  self.assertEqual(step_body["info"]["ground_truth"], "FAILOVER")
123
 
@@ -126,7 +126,7 @@ class IncidentEnvApiTests(unittest.TestCase):
126
  state_body = state_response.json()
127
  self.assertTrue(state_body["done"])
128
  self.assertEqual(state_body["status"], "completed")
129
- self.assertEqual(state_body["last_reward"], 1.0)
130
  self.assertNotIn(session_id, sessions)
131
  self.assertIn(session_id, completed_states)
132
 
 
117
  self.assertEqual(step_response.status_code, 200)
118
  step_body = step_response.json()
119
  self.assertTrue(step_body["done"])
120
+ self.assertEqual(step_body["reward"]["value"], 0.99)
121
  self.assertTrue(step_body["info"]["correct"])
122
  self.assertEqual(step_body["info"]["ground_truth"], "FAILOVER")
123
 
 
126
  state_body = state_response.json()
127
  self.assertTrue(state_body["done"])
128
  self.assertEqual(state_body["status"], "completed")
129
+ self.assertEqual(state_body["last_reward"], 0.99)
130
  self.assertNotIn(session_id, sessions)
131
  self.assertIn(session_id, completed_states)
132
 
tests/test_graders.py CHANGED
@@ -6,7 +6,7 @@ from models import IncidentAction
6
 
7
 
8
  class GraderTests(unittest.TestCase):
9
- def test_all_ticket_ground_truth_scores_are_bounded(self) -> None:
10
  for ticket in TICKETS:
11
  action = IncidentAction(
12
  incident_id=ticket["incident_id"],
@@ -14,8 +14,8 @@ class GraderTests(unittest.TestCase):
14
  **ticket["ground_truth"],
15
  )
16
  score, reason = GRADERS[ticket["task_type"]](action, ticket["ground_truth"])
17
- self.assertGreaterEqual(score, 0.0, ticket["incident_id"])
18
- self.assertLessEqual(score, 1.0, ticket["incident_id"])
19
  self.assertIsInstance(reason, str)
20
 
21
  def test_task1_grader_supports_partial_credit(self) -> None:
@@ -31,7 +31,7 @@ class GraderTests(unittest.TestCase):
31
  )
32
  exact_score, _ = grade_task1(exact, {"severity": "SEV1"})
33
  adjacent_score, _ = grade_task1(adjacent, {"severity": "SEV1"})
34
- self.assertEqual(exact_score, 1.0)
35
  self.assertEqual(adjacent_score, 0.5)
36
 
37
  def test_task2_grader_is_not_constant(self) -> None:
@@ -53,9 +53,9 @@ class GraderTests(unittest.TestCase):
53
  exact_score, _ = grade_task2(exact, {"root_cause": "DATABASE"})
54
  fallback_score, _ = grade_task2(fallback, {"root_cause": "DATABASE"})
55
  wrong_score, _ = grade_task2(wrong, {"root_cause": "DATABASE"})
56
- self.assertEqual(exact_score, 1.0)
57
  self.assertEqual(fallback_score, 0.25)
58
- self.assertEqual(wrong_score, 0.0)
59
 
60
  def test_task2_grader_rewards_related_domain_partial_credit(self) -> None:
61
  near_miss = IncidentAction(
@@ -86,9 +86,9 @@ class GraderTests(unittest.TestCase):
86
  exact_score, _ = grade_task3(exact, {"action": "FAILOVER"})
87
  fallback_score, _ = grade_task3(fallback, {"action": "FAILOVER"})
88
  wrong_score, _ = grade_task3(wrong, {"action": "FAILOVER"})
89
- self.assertEqual(exact_score, 1.0)
90
  self.assertEqual(fallback_score, 0.4)
91
- self.assertEqual(wrong_score, 0.0)
92
 
93
  def test_task3_grader_rewards_related_action_partial_credit(self) -> None:
94
  restart_instead_of_failover = IncidentAction(
 
6
 
7
 
8
  class GraderTests(unittest.TestCase):
9
+ def test_all_ticket_ground_truth_scores_stay_strictly_within_unit_interval(self) -> None:
10
  for ticket in TICKETS:
11
  action = IncidentAction(
12
  incident_id=ticket["incident_id"],
 
14
  **ticket["ground_truth"],
15
  )
16
  score, reason = GRADERS[ticket["task_type"]](action, ticket["ground_truth"])
17
+ self.assertGreater(score, 0.0, ticket["incident_id"])
18
+ self.assertLess(score, 1.0, ticket["incident_id"])
19
  self.assertIsInstance(reason, str)
20
 
21
  def test_task1_grader_supports_partial_credit(self) -> None:
 
31
  )
32
  exact_score, _ = grade_task1(exact, {"severity": "SEV1"})
33
  adjacent_score, _ = grade_task1(adjacent, {"severity": "SEV1"})
34
+ self.assertEqual(exact_score, 0.99)
35
  self.assertEqual(adjacent_score, 0.5)
36
 
37
  def test_task2_grader_is_not_constant(self) -> None:
 
53
  exact_score, _ = grade_task2(exact, {"root_cause": "DATABASE"})
54
  fallback_score, _ = grade_task2(fallback, {"root_cause": "DATABASE"})
55
  wrong_score, _ = grade_task2(wrong, {"root_cause": "DATABASE"})
56
+ self.assertEqual(exact_score, 0.99)
57
  self.assertEqual(fallback_score, 0.25)
58
+ self.assertEqual(wrong_score, 0.01)
59
 
60
  def test_task2_grader_rewards_related_domain_partial_credit(self) -> None:
61
  near_miss = IncidentAction(
 
86
  exact_score, _ = grade_task3(exact, {"action": "FAILOVER"})
87
  fallback_score, _ = grade_task3(fallback, {"action": "FAILOVER"})
88
  wrong_score, _ = grade_task3(wrong, {"action": "FAILOVER"})
89
+ self.assertEqual(exact_score, 0.99)
90
  self.assertEqual(fallback_score, 0.4)
91
+ self.assertEqual(wrong_score, 0.01)
92
 
93
  def test_task3_grader_rewards_related_action_partial_credit(self) -> None:
94
  restart_instead_of_failover = IncidentAction(