Spaces:

XcodeAddy
/

incident-triage-env

Running

App Files Files Community

XcodeAddy commited on Apr 12

Commit

18aa055

1 Parent(s): 026e80b

Keep grader rewards strictly within unit interval

Browse files

Files changed (8) hide show

CHANGELOG_AND_RUNBOOK.md +8 -8
README.md +10 -9
app.py +4 -4
environment.py +1 -1
graders.py +15 -23
openenv.yaml +4 -4
tests/test_env.py +2 -2
tests/test_graders.py +8 -8

CHANGELOG_AND_RUNBOOK.md CHANGED Viewed

@@ -114,7 +114,7 @@ The backend now prints useful logs when the UI or API is used:
 ```text
 [RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
-[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=1.0 done=true
 [STATE] session_id=... incident_id=INC-014 done=true
 [STEP_ERROR] session_id=... incident_id=INC-014 error=...
 ```
@@ -203,7 +203,7 @@ Run a correct hard-task case:
 Expected result:
-- `reward.value` is `1.0`.
 - `done` is `true`.
 - `info.correct` is `true`.
 - `info.ground_truth` is `FAILOVER`.
@@ -218,7 +218,7 @@ Expected terminal logs:
 ```text
 [RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
-[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=1.0 done=true
 ```
 Run a task1 case:
@@ -232,7 +232,7 @@ Run a task1 case:
 Expected result:
-- reward should be `1.0`.
 Run a task2 case:
@@ -245,7 +245,7 @@ Run a task2 case:
 Expected result:
-- reward should be `1.0`.
 ## 4. Test backend API with curl
@@ -311,7 +311,7 @@ Expected state:
 - `done` is `true`
 - `status` is `completed`
-- `last_reward` is `1.0`
 ## 5. Test backend edge cases
@@ -408,8 +408,8 @@ Expected log format:
 ```text
 [START] task=INC-001 env=incident-triage-env model=...
-[STEP] step=1 action=SEV1 reward=1.00 done=true error=null
-[END] success=true steps=1 score=1.00 rewards=1.00
 ```
 If no server is reachable, `inference.py` falls back to an in-process FastAPI client.

 ```text
 [RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
+[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=0.99 done=true
 [STATE] session_id=... incident_id=INC-014 done=true
 [STEP_ERROR] session_id=... incident_id=INC-014 error=...
 ```
 Expected result:
+- `reward.value` is `0.99`.
 - `done` is `true`.
 - `info.correct` is `true`.
 - `info.ground_truth` is `FAILOVER`.
 ```text
 [RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
+[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=0.99 done=true
 ```
 Run a task1 case:
 Expected result:
+- reward should be `0.99`.
 Run a task2 case:
 Expected result:
+- reward should be `0.99`.
 ## 4. Test backend API with curl
 - `done` is `true`
 - `status` is `completed`
+- `last_reward` is `0.99`
 ## 5. Test backend edge cases
 ```text
 [START] task=INC-001 env=incident-triage-env model=...
+[STEP] step=1 action=SEV1 reward=0.99 done=true error=null
+[END] success=true steps=1 score=0.99 rewards=0.99
 ```
 If no server is reachable, `inference.py` falls back to an in-process FastAPI client.

README.md CHANGED Viewed

@@ -114,9 +114,9 @@ Validation rules:
 Rewarding is deterministic and implemented in [graders.py](./graders.py).
-- `task1`: `1.0` exact, `0.5` adjacent severity, `0.0` far miss
-- `task2`: `1.0` exact, `0.5` related domain, `0.25` `UNKNOWN`, `0.0` wrong
-- `task3`: `1.0` exact, `0.4` safe `INVESTIGATE` fallback, `0.25` related action, `0.0` wrong
 This keeps grading reproducible while still giving partial-credit trajectory signal.
@@ -218,8 +218,8 @@ curl http://localhost:7860/health
 ```text
 [START] task=INC-001 env=incident-triage-env model=deterministic-baseline
-[STEP] step=1 action=SEV1 reward=1.00 done=true error=null
-[END] success=true steps=1 score=1.00 rewards=1.00
 ```
 ## Baseline Scores
@@ -229,10 +229,10 @@ Latest local deterministic baseline:
 | Metric | Value |
 |---|---:|
 | Episodes | 108 |
-| Average score | 0.9954 |
-| `task1` average | 1.0000 |
-| `task2` average | 0.9861 |
-| `task3` average | 1.0000 |
 This deterministic local run completed in about `1.34s` on the current machine.
 Results are written by default to `/tmp/outputs/baseline_scores.json`.
@@ -274,4 +274,5 @@ curl -X POST "http://localhost:7860/step?session_id=<session-id>" \
 - `models.py` is the source of truth for valid enum labels.
 - `graders.py` is the source of truth for scoring logic.
 - The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.

 Rewarding is deterministic and implemented in [graders.py](./graders.py).
+- `task1`: `0.99` exact, `0.5` adjacent severity, `0.01` far miss
+- `task2`: `0.99` exact, `0.5` related domain, `0.25` `UNKNOWN`, `0.01` wrong
+- `task3`: `0.99` exact, `0.4` safe `INVESTIGATE` fallback, `0.25` related action, `0.01` wrong
 This keeps grading reproducible while still giving partial-credit trajectory signal.
 ```text
 [START] task=INC-001 env=incident-triage-env model=deterministic-baseline
+[STEP] step=1 action=SEV1 reward=0.99 done=true error=null
+[END] success=true steps=1 score=0.99 rewards=0.99
 ```
 ## Baseline Scores
 | Metric | Value |
 |---|---:|
 | Episodes | 108 |
+| Average score | 0.9855 |
+| `task1` average | 0.9900 |
+| `task2` average | 0.9764 |
+| `task3` average | 0.9900 |
 This deterministic local run completed in about `1.34s` on the current machine.
 Results are written by default to `/tmp/outputs/baseline_scores.json`.
 - `models.py` is the source of truth for valid enum labels.
 - `graders.py` is the source of truth for scoring logic.
+- Reward values are kept strictly within `(0, 1)` to satisfy Phase 2 validator constraints.
 - The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.

app.py CHANGED Viewed

@@ -263,11 +263,11 @@ def state(session_id: str):
 def get_grader_info():
     return {
         "grading": "deterministic",
-        "scoring": "task1: adjacent-severity partial credit; task2/task3: exact match plus conservative near-miss partial credit",
         "tasks": {
-            "task1": "exact=1.0, adjacent=0.5, far=0.0",
-            "task2": "exact=1.0, related-domain=0.5, unknown=0.25, wrong=0.0",
-            "task3": "exact=1.0, investigate fallback=0.4, related response=0.25, wrong=0.0",
         },
         "notes": {
             "task2": [

 def get_grader_info():
     return {
         "grading": "deterministic",
+        "scoring": "task1: adjacent-severity partial credit; task2/task3: exact match plus conservative near-miss partial credit; all rewards remain strictly within (0, 1)",
         "tasks": {
+            "task1": "exact=0.99, adjacent=0.5, far=0.01",
+            "task2": "exact=0.99, related-domain=0.5, unknown=0.25, wrong=0.01",
+            "task3": "exact=0.99, investigate fallback=0.4, related response=0.25, wrong=0.01",
         },
         "notes": {
             "task2": [

environment.py CHANGED Viewed

@@ -150,7 +150,7 @@ class IncidentEnv:
                 "episode_id": self.episode_id,
                 "task_name": self._task_spec()["name"],
                 "difficulty": self._task_spec()["difficulty"],
-                "correct": reward_value == 1.0,
                 "ground_truth": ground_truth_value,
                 "agent_answer": agent_answer,
                 "selected_field": selected_field,

                 "episode_id": self.episode_id,
                 "task_name": self._task_spec()["name"],
                 "difficulty": self._task_spec()["difficulty"],
+                "correct": agent_answer == ground_truth_value,
                 "ground_truth": ground_truth_value,
                 "agent_answer": agent_answer,
                 "selected_field": selected_field,

graders.py CHANGED Viewed

@@ -1,15 +1,7 @@
 from models import IncidentAction
 _SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
-# Related-domain partial credit is intentionally conservative.
-# DATABASE <-> APPLICATION captures incidents where app bugs manifest as
-# database saturation and vice versa.
-# NETWORK <-> INFRASTRUCTURE captures physical or platform-layer correlation.
-# NETWORK <-> THIRD_PARTY captures dependency outages that resemble network loss.
-# INFRASTRUCTURE <-> THIRD_PARTY captures external services failing through shared
-# platform primitives.
-# APPLICATION <-> THIRD_PARTY is intentionally not included because we treat
-# product-code failures and vendor degradation as materially different diagnoses.
 _TASK2_RELATED_GROUPS = [
     {"DATABASE", "APPLICATION"},
     {"NETWORK", "INFRASTRUCTURE"},
@@ -24,15 +16,19 @@ _TASK3_PARTIAL = {
     ("RESTART_SERVICE", "INVESTIGATE"): 0.25,
 }
 def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.severity is None:
-        return 0.0, "Missing severity classification."
     predicted = _SEV_ORDER.get(action.severity.value, -1)
     expected = _SEV_ORDER.get(ground_truth["severity"], -1)
     distance = abs(predicted - expected)
-    score = {0: 1.0, 1: 0.5, 2: 0.0}.get(distance, 0.0)
-    if score == 1.0:
         return score, "Exact severity match."
     if score == 0.5:
         return score, "Adjacent severity band: partial credit for a close escalation call."
@@ -41,44 +37,40 @@ def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]
 def grade_task2(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.root_cause is None:
-        return 0.0, "Missing root-cause classification."
     predicted = action.root_cause.value
     expected = ground_truth["root_cause"]
     if predicted == expected:
-        return 1.0, "Exact root-cause match."
     if predicted == "UNKNOWN":
         return 0.25, "Conservative fallback: uncertainty recognized, but the failure domain was not isolated."
-    # Related groups are intentionally defined as exact 2-label pairs.
-    # Keep equality here so we do not silently broaden partial-credit semantics.
     if any({predicted, expected} == group for group in _TASK2_RELATED_GROUPS):
         return 0.5, "Related failure domain selected: partial credit for a near-miss diagnosis."
-    return 0.0, "Root-cause classification does not match the expected failure domain."
 def grade_task3(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.action is None:
-        return 0.0, "Missing remediation recommendation."
     predicted = action.action.value
     expected = ground_truth["action"]
     if predicted == expected:
-        return 1.0, "Exact remediation match."
     if predicted == "INVESTIGATE" and expected != "NO_ACTION":
         return 0.4, "Safe investigative fallback: the incident was recognized, but the optimal action was not taken."
-    # Choosing NO_ACTION when investigation was expected is scored more harshly
-    # than the reverse because it risks missing a real incident entirely.
     if predicted == "NO_ACTION" and expected == "INVESTIGATE":
         return 0.25, "Conservative response, but deeper investigation was expected."
     if (predicted, expected) in _TASK3_PARTIAL:
         return _TASK3_PARTIAL[(predicted, expected)], "Related remediation selected: partial credit for a close operational response."
-    return 0.0, "Recommended action does not match the expected operator response."
 GRADERS = {
     "task1": grade_task1,
     "task2": grade_task2,
     "task3": grade_task3,
-}

 from models import IncidentAction
 _SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
 _TASK2_RELATED_GROUPS = [
     {"DATABASE", "APPLICATION"},
     {"NETWORK", "INFRASTRUCTURE"},
     ("RESTART_SERVICE", "INVESTIGATE"): 0.25,
 }
+# Scores must be strictly within (0, 1) — 0.0 and 1.0 are rejected by the validator.
+_EXACT = 0.99
+_ZERO  = 0.01
 def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.severity is None:
+        return _ZERO, "Missing severity classification."
     predicted = _SEV_ORDER.get(action.severity.value, -1)
     expected = _SEV_ORDER.get(ground_truth["severity"], -1)
     distance = abs(predicted - expected)
+    score = {0: _EXACT, 1: 0.5, 2: _ZERO}.get(distance, _ZERO)
+    if score == _EXACT:
         return score, "Exact severity match."
     if score == 0.5:
         return score, "Adjacent severity band: partial credit for a close escalation call."
 def grade_task2(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.root_cause is None:
+        return _ZERO, "Missing root-cause classification."
     predicted = action.root_cause.value
     expected = ground_truth["root_cause"]
     if predicted == expected:
+        return _EXACT, "Exact root-cause match."
     if predicted == "UNKNOWN":
         return 0.25, "Conservative fallback: uncertainty recognized, but the failure domain was not isolated."
     if any({predicted, expected} == group for group in _TASK2_RELATED_GROUPS):
         return 0.5, "Related failure domain selected: partial credit for a near-miss diagnosis."
+    return _ZERO, "Root-cause classification does not match the expected failure domain."
 def grade_task3(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
     if action.action is None:
+        return _ZERO, "Missing remediation recommendation."
     predicted = action.action.value
     expected = ground_truth["action"]
     if predicted == expected:
+        return _EXACT, "Exact remediation match."
     if predicted == "INVESTIGATE" and expected != "NO_ACTION":
         return 0.4, "Safe investigative fallback: the incident was recognized, but the optimal action was not taken."
     if predicted == "NO_ACTION" and expected == "INVESTIGATE":
         return 0.25, "Conservative response, but deeper investigation was expected."
     if (predicted, expected) in _TASK3_PARTIAL:
         return _TASK3_PARTIAL[(predicted, expected)], "Related remediation selected: partial credit for a close operational response."
+    return _ZERO, "Recommended action does not match the expected operator response."
 GRADERS = {
     "task1": grade_task1,
     "task2": grade_task2,
     "task3": grade_task3,
+}

openenv.yaml CHANGED Viewed

@@ -66,21 +66,21 @@ tasks:
     difficulty: easy
     output_field: severity
     labels: [SEV1, SEV2, SEV3]
-    reward: "1.0 exact | 0.5 adjacent severity | 0.0 far miss"
   task2:
     name: Root Cause Classification
     difficulty: medium
     output_field: root_cause
     labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
-    reward: "1.0 exact | 0.5 related domain | 0.25 UNKNOWN fallback | 0.0 wrong"
   task3:
     name: Recommended Action
     difficulty: hard
     output_field: action
     labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
-    reward: "1.0 exact | 0.4 safe investigate fallback | 0.25 related action | 0.0 wrong"
 dataset:
   total_tickets: 108
@@ -93,7 +93,7 @@ baseline:
   script: inference.py
   required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
   optional_env_vars: [ENV_URL]
-  latest_local_score: 0.9954
   latest_local_episodes: 108
 reproducibility:

     difficulty: easy
     output_field: severity
     labels: [SEV1, SEV2, SEV3]
+    reward: "0.99 exact | 0.5 adjacent severity | 0.01 far miss"
   task2:
     name: Root Cause Classification
     difficulty: medium
     output_field: root_cause
     labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
+    reward: "0.99 exact | 0.5 related domain | 0.25 UNKNOWN fallback | 0.01 wrong"
   task3:
     name: Recommended Action
     difficulty: hard
     output_field: action
     labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
+    reward: "0.99 exact | 0.4 safe investigate fallback | 0.25 related action | 0.01 wrong"
 dataset:
   total_tickets: 108
   script: inference.py
   required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
   optional_env_vars: [ENV_URL]
+  latest_local_score: 0.9855
   latest_local_episodes: 108
 reproducibility:

tests/test_env.py CHANGED Viewed

@@ -117,7 +117,7 @@ class IncidentEnvApiTests(unittest.TestCase):
         self.assertEqual(step_response.status_code, 200)
         step_body = step_response.json()
         self.assertTrue(step_body["done"])
-        self.assertEqual(step_body["reward"]["value"], 1.0)
         self.assertTrue(step_body["info"]["correct"])
         self.assertEqual(step_body["info"]["ground_truth"], "FAILOVER")
@@ -126,7 +126,7 @@ class IncidentEnvApiTests(unittest.TestCase):
         state_body = state_response.json()
         self.assertTrue(state_body["done"])
         self.assertEqual(state_body["status"], "completed")
-        self.assertEqual(state_body["last_reward"], 1.0)
         self.assertNotIn(session_id, sessions)
         self.assertIn(session_id, completed_states)

         self.assertEqual(step_response.status_code, 200)
         step_body = step_response.json()
         self.assertTrue(step_body["done"])
+        self.assertEqual(step_body["reward"]["value"], 0.99)
         self.assertTrue(step_body["info"]["correct"])
         self.assertEqual(step_body["info"]["ground_truth"], "FAILOVER")
         state_body = state_response.json()
         self.assertTrue(state_body["done"])
         self.assertEqual(state_body["status"], "completed")
+        self.assertEqual(state_body["last_reward"], 0.99)
         self.assertNotIn(session_id, sessions)
         self.assertIn(session_id, completed_states)

tests/test_graders.py CHANGED Viewed

@@ -6,7 +6,7 @@ from models import IncidentAction
 class GraderTests(unittest.TestCase):
-    def test_all_ticket_ground_truth_scores_are_bounded(self) -> None:
         for ticket in TICKETS:
             action = IncidentAction(
                 incident_id=ticket["incident_id"],
@@ -14,8 +14,8 @@ class GraderTests(unittest.TestCase):
                 **ticket["ground_truth"],
             )
             score, reason = GRADERS[ticket["task_type"]](action, ticket["ground_truth"])
-            self.assertGreaterEqual(score, 0.0, ticket["incident_id"])
-            self.assertLessEqual(score, 1.0, ticket["incident_id"])
             self.assertIsInstance(reason, str)
     def test_task1_grader_supports_partial_credit(self) -> None:
@@ -31,7 +31,7 @@ class GraderTests(unittest.TestCase):
         )
         exact_score, _ = grade_task1(exact, {"severity": "SEV1"})
         adjacent_score, _ = grade_task1(adjacent, {"severity": "SEV1"})
-        self.assertEqual(exact_score, 1.0)
         self.assertEqual(adjacent_score, 0.5)
     def test_task2_grader_is_not_constant(self) -> None:
@@ -53,9 +53,9 @@ class GraderTests(unittest.TestCase):
         exact_score, _ = grade_task2(exact, {"root_cause": "DATABASE"})
         fallback_score, _ = grade_task2(fallback, {"root_cause": "DATABASE"})
         wrong_score, _ = grade_task2(wrong, {"root_cause": "DATABASE"})
-        self.assertEqual(exact_score, 1.0)
         self.assertEqual(fallback_score, 0.25)
-        self.assertEqual(wrong_score, 0.0)
     def test_task2_grader_rewards_related_domain_partial_credit(self) -> None:
         near_miss = IncidentAction(
@@ -86,9 +86,9 @@ class GraderTests(unittest.TestCase):
         exact_score, _ = grade_task3(exact, {"action": "FAILOVER"})
         fallback_score, _ = grade_task3(fallback, {"action": "FAILOVER"})
         wrong_score, _ = grade_task3(wrong, {"action": "FAILOVER"})
-        self.assertEqual(exact_score, 1.0)
         self.assertEqual(fallback_score, 0.4)
-        self.assertEqual(wrong_score, 0.0)
     def test_task3_grader_rewards_related_action_partial_credit(self) -> None:
         restart_instead_of_failover = IncidentAction(

 class GraderTests(unittest.TestCase):
+    def test_all_ticket_ground_truth_scores_stay_strictly_within_unit_interval(self) -> None:
         for ticket in TICKETS:
             action = IncidentAction(
                 incident_id=ticket["incident_id"],
                 **ticket["ground_truth"],
             )
             score, reason = GRADERS[ticket["task_type"]](action, ticket["ground_truth"])
+            self.assertGreater(score, 0.0, ticket["incident_id"])
+            self.assertLess(score, 1.0, ticket["incident_id"])
             self.assertIsInstance(reason, str)
     def test_task1_grader_supports_partial_credit(self) -> None:
         )
         exact_score, _ = grade_task1(exact, {"severity": "SEV1"})
         adjacent_score, _ = grade_task1(adjacent, {"severity": "SEV1"})
+        self.assertEqual(exact_score, 0.99)
         self.assertEqual(adjacent_score, 0.5)
     def test_task2_grader_is_not_constant(self) -> None:
         exact_score, _ = grade_task2(exact, {"root_cause": "DATABASE"})
         fallback_score, _ = grade_task2(fallback, {"root_cause": "DATABASE"})
         wrong_score, _ = grade_task2(wrong, {"root_cause": "DATABASE"})
+        self.assertEqual(exact_score, 0.99)
         self.assertEqual(fallback_score, 0.25)
+        self.assertEqual(wrong_score, 0.01)
     def test_task2_grader_rewards_related_domain_partial_credit(self) -> None:
         near_miss = IncidentAction(
         exact_score, _ = grade_task3(exact, {"action": "FAILOVER"})
         fallback_score, _ = grade_task3(fallback, {"action": "FAILOVER"})
         wrong_score, _ = grade_task3(wrong, {"action": "FAILOVER"})
+        self.assertEqual(exact_score, 0.99)
         self.assertEqual(fallback_score, 0.4)
+        self.assertEqual(wrong_score, 0.01)
     def test_task3_grader_rewards_related_action_partial_credit(self) -> None:
         restart_instead_of_failover = IncidentAction(