Spaces:
Running
Running
Keep grader rewards strictly within unit interval
Browse files- CHANGELOG_AND_RUNBOOK.md +8 -8
- README.md +10 -9
- app.py +4 -4
- environment.py +1 -1
- graders.py +15 -23
- openenv.yaml +4 -4
- tests/test_env.py +2 -2
- tests/test_graders.py +8 -8
CHANGELOG_AND_RUNBOOK.md
CHANGED
|
@@ -114,7 +114,7 @@ The backend now prints useful logs when the UI or API is used:
|
|
| 114 |
|
| 115 |
```text
|
| 116 |
[RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
|
| 117 |
-
[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=
|
| 118 |
[STATE] session_id=... incident_id=INC-014 done=true
|
| 119 |
[STEP_ERROR] session_id=... incident_id=INC-014 error=...
|
| 120 |
```
|
|
@@ -203,7 +203,7 @@ Run a correct hard-task case:
|
|
| 203 |
|
| 204 |
Expected result:
|
| 205 |
|
| 206 |
-
- `reward.value` is `
|
| 207 |
- `done` is `true`.
|
| 208 |
- `info.correct` is `true`.
|
| 209 |
- `info.ground_truth` is `FAILOVER`.
|
|
@@ -218,7 +218,7 @@ Expected terminal logs:
|
|
| 218 |
|
| 219 |
```text
|
| 220 |
[RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
|
| 221 |
-
[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=
|
| 222 |
```
|
| 223 |
|
| 224 |
Run a task1 case:
|
|
@@ -232,7 +232,7 @@ Run a task1 case:
|
|
| 232 |
|
| 233 |
Expected result:
|
| 234 |
|
| 235 |
-
- reward should be `
|
| 236 |
|
| 237 |
Run a task2 case:
|
| 238 |
|
|
@@ -245,7 +245,7 @@ Run a task2 case:
|
|
| 245 |
|
| 246 |
Expected result:
|
| 247 |
|
| 248 |
-
- reward should be `
|
| 249 |
|
| 250 |
## 4. Test backend API with curl
|
| 251 |
|
|
@@ -311,7 +311,7 @@ Expected state:
|
|
| 311 |
|
| 312 |
- `done` is `true`
|
| 313 |
- `status` is `completed`
|
| 314 |
-
- `last_reward` is `
|
| 315 |
|
| 316 |
## 5. Test backend edge cases
|
| 317 |
|
|
@@ -408,8 +408,8 @@ Expected log format:
|
|
| 408 |
|
| 409 |
```text
|
| 410 |
[START] task=INC-001 env=incident-triage-env model=...
|
| 411 |
-
[STEP] step=1 action=SEV1 reward=
|
| 412 |
-
[END] success=true steps=1 score=
|
| 413 |
```
|
| 414 |
|
| 415 |
If no server is reachable, `inference.py` falls back to an in-process FastAPI client.
|
|
|
|
| 114 |
|
| 115 |
```text
|
| 116 |
[RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
|
| 117 |
+
[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=0.99 done=true
|
| 118 |
[STATE] session_id=... incident_id=INC-014 done=true
|
| 119 |
[STEP_ERROR] session_id=... incident_id=INC-014 error=...
|
| 120 |
```
|
|
|
|
| 203 |
|
| 204 |
Expected result:
|
| 205 |
|
| 206 |
+
- `reward.value` is `0.99`.
|
| 207 |
- `done` is `true`.
|
| 208 |
- `info.correct` is `true`.
|
| 209 |
- `info.ground_truth` is `FAILOVER`.
|
|
|
|
| 218 |
|
| 219 |
```text
|
| 220 |
[RESET] session_id=... incident_id=INC-014 task_type=task3 expected_field=action
|
| 221 |
+
[STEP] session_id=... incident_id=INC-014 task_type=task3 answer=FAILOVER reward=0.99 done=true
|
| 222 |
```
|
| 223 |
|
| 224 |
Run a task1 case:
|
|
|
|
| 232 |
|
| 233 |
Expected result:
|
| 234 |
|
| 235 |
+
- reward should be `0.99`.
|
| 236 |
|
| 237 |
Run a task2 case:
|
| 238 |
|
|
|
|
| 245 |
|
| 246 |
Expected result:
|
| 247 |
|
| 248 |
+
- reward should be `0.99`.
|
| 249 |
|
| 250 |
## 4. Test backend API with curl
|
| 251 |
|
|
|
|
| 311 |
|
| 312 |
- `done` is `true`
|
| 313 |
- `status` is `completed`
|
| 314 |
+
- `last_reward` is `0.99`
|
| 315 |
|
| 316 |
## 5. Test backend edge cases
|
| 317 |
|
|
|
|
| 408 |
|
| 409 |
```text
|
| 410 |
[START] task=INC-001 env=incident-triage-env model=...
|
| 411 |
+
[STEP] step=1 action=SEV1 reward=0.99 done=true error=null
|
| 412 |
+
[END] success=true steps=1 score=0.99 rewards=0.99
|
| 413 |
```
|
| 414 |
|
| 415 |
If no server is reachable, `inference.py` falls back to an in-process FastAPI client.
|
README.md
CHANGED
|
@@ -114,9 +114,9 @@ Validation rules:
|
|
| 114 |
|
| 115 |
Rewarding is deterministic and implemented in [graders.py](./graders.py).
|
| 116 |
|
| 117 |
-
- `task1`: `
|
| 118 |
-
- `task2`: `
|
| 119 |
-
- `task3`: `
|
| 120 |
|
| 121 |
This keeps grading reproducible while still giving partial-credit trajectory signal.
|
| 122 |
|
|
@@ -218,8 +218,8 @@ curl http://localhost:7860/health
|
|
| 218 |
|
| 219 |
```text
|
| 220 |
[START] task=INC-001 env=incident-triage-env model=deterministic-baseline
|
| 221 |
-
[STEP] step=1 action=SEV1 reward=
|
| 222 |
-
[END] success=true steps=1 score=
|
| 223 |
```
|
| 224 |
|
| 225 |
## Baseline Scores
|
|
@@ -229,10 +229,10 @@ Latest local deterministic baseline:
|
|
| 229 |
| Metric | Value |
|
| 230 |
|---|---:|
|
| 231 |
| Episodes | 108 |
|
| 232 |
-
| Average score | 0.
|
| 233 |
-
| `task1` average |
|
| 234 |
-
| `task2` average | 0.
|
| 235 |
-
| `task3` average |
|
| 236 |
|
| 237 |
This deterministic local run completed in about `1.34s` on the current machine.
|
| 238 |
Results are written by default to `/tmp/outputs/baseline_scores.json`.
|
|
@@ -274,4 +274,5 @@ curl -X POST "http://localhost:7860/step?session_id=<session-id>" \
|
|
| 274 |
|
| 275 |
- `models.py` is the source of truth for valid enum labels.
|
| 276 |
- `graders.py` is the source of truth for scoring logic.
|
|
|
|
| 277 |
- The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.
|
|
|
|
| 114 |
|
| 115 |
Rewarding is deterministic and implemented in [graders.py](./graders.py).
|
| 116 |
|
| 117 |
+
- `task1`: `0.99` exact, `0.5` adjacent severity, `0.01` far miss
|
| 118 |
+
- `task2`: `0.99` exact, `0.5` related domain, `0.25` `UNKNOWN`, `0.01` wrong
|
| 119 |
+
- `task3`: `0.99` exact, `0.4` safe `INVESTIGATE` fallback, `0.25` related action, `0.01` wrong
|
| 120 |
|
| 121 |
This keeps grading reproducible while still giving partial-credit trajectory signal.
|
| 122 |
|
|
|
|
| 218 |
|
| 219 |
```text
|
| 220 |
[START] task=INC-001 env=incident-triage-env model=deterministic-baseline
|
| 221 |
+
[STEP] step=1 action=SEV1 reward=0.99 done=true error=null
|
| 222 |
+
[END] success=true steps=1 score=0.99 rewards=0.99
|
| 223 |
```
|
| 224 |
|
| 225 |
## Baseline Scores
|
|
|
|
| 229 |
| Metric | Value |
|
| 230 |
|---|---:|
|
| 231 |
| Episodes | 108 |
|
| 232 |
+
| Average score | 0.9855 |
|
| 233 |
+
| `task1` average | 0.9900 |
|
| 234 |
+
| `task2` average | 0.9764 |
|
| 235 |
+
| `task3` average | 0.9900 |
|
| 236 |
|
| 237 |
This deterministic local run completed in about `1.34s` on the current machine.
|
| 238 |
Results are written by default to `/tmp/outputs/baseline_scores.json`.
|
|
|
|
| 274 |
|
| 275 |
- `models.py` is the source of truth for valid enum labels.
|
| 276 |
- `graders.py` is the source of truth for scoring logic.
|
| 277 |
+
- Reward values are kept strictly within `(0, 1)` to satisfy Phase 2 validator constraints.
|
| 278 |
- The environment is intentionally single-step per episode and still exposes typed state for validation and debugging.
|
app.py
CHANGED
|
@@ -263,11 +263,11 @@ def state(session_id: str):
|
|
| 263 |
def get_grader_info():
|
| 264 |
return {
|
| 265 |
"grading": "deterministic",
|
| 266 |
-
"scoring": "task1: adjacent-severity partial credit; task2/task3: exact match plus conservative near-miss partial credit",
|
| 267 |
"tasks": {
|
| 268 |
-
"task1": "exact=
|
| 269 |
-
"task2": "exact=
|
| 270 |
-
"task3": "exact=
|
| 271 |
},
|
| 272 |
"notes": {
|
| 273 |
"task2": [
|
|
|
|
| 263 |
def get_grader_info():
|
| 264 |
return {
|
| 265 |
"grading": "deterministic",
|
| 266 |
+
"scoring": "task1: adjacent-severity partial credit; task2/task3: exact match plus conservative near-miss partial credit; all rewards remain strictly within (0, 1)",
|
| 267 |
"tasks": {
|
| 268 |
+
"task1": "exact=0.99, adjacent=0.5, far=0.01",
|
| 269 |
+
"task2": "exact=0.99, related-domain=0.5, unknown=0.25, wrong=0.01",
|
| 270 |
+
"task3": "exact=0.99, investigate fallback=0.4, related response=0.25, wrong=0.01",
|
| 271 |
},
|
| 272 |
"notes": {
|
| 273 |
"task2": [
|
environment.py
CHANGED
|
@@ -150,7 +150,7 @@ class IncidentEnv:
|
|
| 150 |
"episode_id": self.episode_id,
|
| 151 |
"task_name": self._task_spec()["name"],
|
| 152 |
"difficulty": self._task_spec()["difficulty"],
|
| 153 |
-
"correct":
|
| 154 |
"ground_truth": ground_truth_value,
|
| 155 |
"agent_answer": agent_answer,
|
| 156 |
"selected_field": selected_field,
|
|
|
|
| 150 |
"episode_id": self.episode_id,
|
| 151 |
"task_name": self._task_spec()["name"],
|
| 152 |
"difficulty": self._task_spec()["difficulty"],
|
| 153 |
+
"correct": agent_answer == ground_truth_value,
|
| 154 |
"ground_truth": ground_truth_value,
|
| 155 |
"agent_answer": agent_answer,
|
| 156 |
"selected_field": selected_field,
|
graders.py
CHANGED
|
@@ -1,15 +1,7 @@
|
|
| 1 |
from models import IncidentAction
|
| 2 |
|
| 3 |
_SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
|
| 4 |
-
|
| 5 |
-
# DATABASE <-> APPLICATION captures incidents where app bugs manifest as
|
| 6 |
-
# database saturation and vice versa.
|
| 7 |
-
# NETWORK <-> INFRASTRUCTURE captures physical or platform-layer correlation.
|
| 8 |
-
# NETWORK <-> THIRD_PARTY captures dependency outages that resemble network loss.
|
| 9 |
-
# INFRASTRUCTURE <-> THIRD_PARTY captures external services failing through shared
|
| 10 |
-
# platform primitives.
|
| 11 |
-
# APPLICATION <-> THIRD_PARTY is intentionally not included because we treat
|
| 12 |
-
# product-code failures and vendor degradation as materially different diagnoses.
|
| 13 |
_TASK2_RELATED_GROUPS = [
|
| 14 |
{"DATABASE", "APPLICATION"},
|
| 15 |
{"NETWORK", "INFRASTRUCTURE"},
|
|
@@ -24,15 +16,19 @@ _TASK3_PARTIAL = {
|
|
| 24 |
("RESTART_SERVICE", "INVESTIGATE"): 0.25,
|
| 25 |
}
|
| 26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
|
| 29 |
if action.severity is None:
|
| 30 |
-
return
|
| 31 |
predicted = _SEV_ORDER.get(action.severity.value, -1)
|
| 32 |
expected = _SEV_ORDER.get(ground_truth["severity"], -1)
|
| 33 |
distance = abs(predicted - expected)
|
| 34 |
-
score = {0:
|
| 35 |
-
if score ==
|
| 36 |
return score, "Exact severity match."
|
| 37 |
if score == 0.5:
|
| 38 |
return score, "Adjacent severity band: partial credit for a close escalation call."
|
|
@@ -41,44 +37,40 @@ def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]
|
|
| 41 |
|
| 42 |
def grade_task2(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
|
| 43 |
if action.root_cause is None:
|
| 44 |
-
return
|
| 45 |
|
| 46 |
predicted = action.root_cause.value
|
| 47 |
expected = ground_truth["root_cause"]
|
| 48 |
|
| 49 |
if predicted == expected:
|
| 50 |
-
return
|
| 51 |
if predicted == "UNKNOWN":
|
| 52 |
return 0.25, "Conservative fallback: uncertainty recognized, but the failure domain was not isolated."
|
| 53 |
-
# Related groups are intentionally defined as exact 2-label pairs.
|
| 54 |
-
# Keep equality here so we do not silently broaden partial-credit semantics.
|
| 55 |
if any({predicted, expected} == group for group in _TASK2_RELATED_GROUPS):
|
| 56 |
return 0.5, "Related failure domain selected: partial credit for a near-miss diagnosis."
|
| 57 |
-
return
|
| 58 |
|
| 59 |
|
| 60 |
def grade_task3(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
|
| 61 |
if action.action is None:
|
| 62 |
-
return
|
| 63 |
|
| 64 |
predicted = action.action.value
|
| 65 |
expected = ground_truth["action"]
|
| 66 |
|
| 67 |
if predicted == expected:
|
| 68 |
-
return
|
| 69 |
if predicted == "INVESTIGATE" and expected != "NO_ACTION":
|
| 70 |
return 0.4, "Safe investigative fallback: the incident was recognized, but the optimal action was not taken."
|
| 71 |
-
# Choosing NO_ACTION when investigation was expected is scored more harshly
|
| 72 |
-
# than the reverse because it risks missing a real incident entirely.
|
| 73 |
if predicted == "NO_ACTION" and expected == "INVESTIGATE":
|
| 74 |
return 0.25, "Conservative response, but deeper investigation was expected."
|
| 75 |
if (predicted, expected) in _TASK3_PARTIAL:
|
| 76 |
return _TASK3_PARTIAL[(predicted, expected)], "Related remediation selected: partial credit for a close operational response."
|
| 77 |
-
return
|
| 78 |
|
| 79 |
|
| 80 |
GRADERS = {
|
| 81 |
"task1": grade_task1,
|
| 82 |
"task2": grade_task2,
|
| 83 |
"task3": grade_task3,
|
| 84 |
-
}
|
|
|
|
| 1 |
from models import IncidentAction
|
| 2 |
|
| 3 |
_SEV_ORDER = {"SEV1": 0, "SEV2": 1, "SEV3": 2}
|
| 4 |
+
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
_TASK2_RELATED_GROUPS = [
|
| 6 |
{"DATABASE", "APPLICATION"},
|
| 7 |
{"NETWORK", "INFRASTRUCTURE"},
|
|
|
|
| 16 |
("RESTART_SERVICE", "INVESTIGATE"): 0.25,
|
| 17 |
}
|
| 18 |
|
| 19 |
+
# Scores must be strictly within (0, 1) — 0.0 and 1.0 are rejected by the validator.
|
| 20 |
+
_EXACT = 0.99
|
| 21 |
+
_ZERO = 0.01
|
| 22 |
+
|
| 23 |
|
| 24 |
def grade_task1(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
|
| 25 |
if action.severity is None:
|
| 26 |
+
return _ZERO, "Missing severity classification."
|
| 27 |
predicted = _SEV_ORDER.get(action.severity.value, -1)
|
| 28 |
expected = _SEV_ORDER.get(ground_truth["severity"], -1)
|
| 29 |
distance = abs(predicted - expected)
|
| 30 |
+
score = {0: _EXACT, 1: 0.5, 2: _ZERO}.get(distance, _ZERO)
|
| 31 |
+
if score == _EXACT:
|
| 32 |
return score, "Exact severity match."
|
| 33 |
if score == 0.5:
|
| 34 |
return score, "Adjacent severity band: partial credit for a close escalation call."
|
|
|
|
| 37 |
|
| 38 |
def grade_task2(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
|
| 39 |
if action.root_cause is None:
|
| 40 |
+
return _ZERO, "Missing root-cause classification."
|
| 41 |
|
| 42 |
predicted = action.root_cause.value
|
| 43 |
expected = ground_truth["root_cause"]
|
| 44 |
|
| 45 |
if predicted == expected:
|
| 46 |
+
return _EXACT, "Exact root-cause match."
|
| 47 |
if predicted == "UNKNOWN":
|
| 48 |
return 0.25, "Conservative fallback: uncertainty recognized, but the failure domain was not isolated."
|
|
|
|
|
|
|
| 49 |
if any({predicted, expected} == group for group in _TASK2_RELATED_GROUPS):
|
| 50 |
return 0.5, "Related failure domain selected: partial credit for a near-miss diagnosis."
|
| 51 |
+
return _ZERO, "Root-cause classification does not match the expected failure domain."
|
| 52 |
|
| 53 |
|
| 54 |
def grade_task3(action: IncidentAction, ground_truth: dict) -> tuple[float, str]:
|
| 55 |
if action.action is None:
|
| 56 |
+
return _ZERO, "Missing remediation recommendation."
|
| 57 |
|
| 58 |
predicted = action.action.value
|
| 59 |
expected = ground_truth["action"]
|
| 60 |
|
| 61 |
if predicted == expected:
|
| 62 |
+
return _EXACT, "Exact remediation match."
|
| 63 |
if predicted == "INVESTIGATE" and expected != "NO_ACTION":
|
| 64 |
return 0.4, "Safe investigative fallback: the incident was recognized, but the optimal action was not taken."
|
|
|
|
|
|
|
| 65 |
if predicted == "NO_ACTION" and expected == "INVESTIGATE":
|
| 66 |
return 0.25, "Conservative response, but deeper investigation was expected."
|
| 67 |
if (predicted, expected) in _TASK3_PARTIAL:
|
| 68 |
return _TASK3_PARTIAL[(predicted, expected)], "Related remediation selected: partial credit for a close operational response."
|
| 69 |
+
return _ZERO, "Recommended action does not match the expected operator response."
|
| 70 |
|
| 71 |
|
| 72 |
GRADERS = {
|
| 73 |
"task1": grade_task1,
|
| 74 |
"task2": grade_task2,
|
| 75 |
"task3": grade_task3,
|
| 76 |
+
}
|
openenv.yaml
CHANGED
|
@@ -66,21 +66,21 @@ tasks:
|
|
| 66 |
difficulty: easy
|
| 67 |
output_field: severity
|
| 68 |
labels: [SEV1, SEV2, SEV3]
|
| 69 |
-
reward: "
|
| 70 |
|
| 71 |
task2:
|
| 72 |
name: Root Cause Classification
|
| 73 |
difficulty: medium
|
| 74 |
output_field: root_cause
|
| 75 |
labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
|
| 76 |
-
reward: "
|
| 77 |
|
| 78 |
task3:
|
| 79 |
name: Recommended Action
|
| 80 |
difficulty: hard
|
| 81 |
output_field: action
|
| 82 |
labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
|
| 83 |
-
reward: "
|
| 84 |
|
| 85 |
dataset:
|
| 86 |
total_tickets: 108
|
|
@@ -93,7 +93,7 @@ baseline:
|
|
| 93 |
script: inference.py
|
| 94 |
required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
|
| 95 |
optional_env_vars: [ENV_URL]
|
| 96 |
-
latest_local_score: 0.
|
| 97 |
latest_local_episodes: 108
|
| 98 |
|
| 99 |
reproducibility:
|
|
|
|
| 66 |
difficulty: easy
|
| 67 |
output_field: severity
|
| 68 |
labels: [SEV1, SEV2, SEV3]
|
| 69 |
+
reward: "0.99 exact | 0.5 adjacent severity | 0.01 far miss"
|
| 70 |
|
| 71 |
task2:
|
| 72 |
name: Root Cause Classification
|
| 73 |
difficulty: medium
|
| 74 |
output_field: root_cause
|
| 75 |
labels: [DATABASE, NETWORK, APPLICATION, INFRASTRUCTURE, THIRD_PARTY, UNKNOWN]
|
| 76 |
+
reward: "0.99 exact | 0.5 related domain | 0.25 UNKNOWN fallback | 0.01 wrong"
|
| 77 |
|
| 78 |
task3:
|
| 79 |
name: Recommended Action
|
| 80 |
difficulty: hard
|
| 81 |
output_field: action
|
| 82 |
labels: [ROLLBACK, SCALE_UP, RESTART_SERVICE, FAILOVER, NOTIFY_VENDOR, INVESTIGATE, NO_ACTION]
|
| 83 |
+
reward: "0.99 exact | 0.4 safe investigate fallback | 0.25 related action | 0.01 wrong"
|
| 84 |
|
| 85 |
dataset:
|
| 86 |
total_tickets: 108
|
|
|
|
| 93 |
script: inference.py
|
| 94 |
required_env_vars: [API_BASE_URL, MODEL_NAME, HF_TOKEN]
|
| 95 |
optional_env_vars: [ENV_URL]
|
| 96 |
+
latest_local_score: 0.9855
|
| 97 |
latest_local_episodes: 108
|
| 98 |
|
| 99 |
reproducibility:
|
tests/test_env.py
CHANGED
|
@@ -117,7 +117,7 @@ class IncidentEnvApiTests(unittest.TestCase):
|
|
| 117 |
self.assertEqual(step_response.status_code, 200)
|
| 118 |
step_body = step_response.json()
|
| 119 |
self.assertTrue(step_body["done"])
|
| 120 |
-
self.assertEqual(step_body["reward"]["value"],
|
| 121 |
self.assertTrue(step_body["info"]["correct"])
|
| 122 |
self.assertEqual(step_body["info"]["ground_truth"], "FAILOVER")
|
| 123 |
|
|
@@ -126,7 +126,7 @@ class IncidentEnvApiTests(unittest.TestCase):
|
|
| 126 |
state_body = state_response.json()
|
| 127 |
self.assertTrue(state_body["done"])
|
| 128 |
self.assertEqual(state_body["status"], "completed")
|
| 129 |
-
self.assertEqual(state_body["last_reward"],
|
| 130 |
self.assertNotIn(session_id, sessions)
|
| 131 |
self.assertIn(session_id, completed_states)
|
| 132 |
|
|
|
|
| 117 |
self.assertEqual(step_response.status_code, 200)
|
| 118 |
step_body = step_response.json()
|
| 119 |
self.assertTrue(step_body["done"])
|
| 120 |
+
self.assertEqual(step_body["reward"]["value"], 0.99)
|
| 121 |
self.assertTrue(step_body["info"]["correct"])
|
| 122 |
self.assertEqual(step_body["info"]["ground_truth"], "FAILOVER")
|
| 123 |
|
|
|
|
| 126 |
state_body = state_response.json()
|
| 127 |
self.assertTrue(state_body["done"])
|
| 128 |
self.assertEqual(state_body["status"], "completed")
|
| 129 |
+
self.assertEqual(state_body["last_reward"], 0.99)
|
| 130 |
self.assertNotIn(session_id, sessions)
|
| 131 |
self.assertIn(session_id, completed_states)
|
| 132 |
|
tests/test_graders.py
CHANGED
|
@@ -6,7 +6,7 @@ from models import IncidentAction
|
|
| 6 |
|
| 7 |
|
| 8 |
class GraderTests(unittest.TestCase):
|
| 9 |
-
def
|
| 10 |
for ticket in TICKETS:
|
| 11 |
action = IncidentAction(
|
| 12 |
incident_id=ticket["incident_id"],
|
|
@@ -14,8 +14,8 @@ class GraderTests(unittest.TestCase):
|
|
| 14 |
**ticket["ground_truth"],
|
| 15 |
)
|
| 16 |
score, reason = GRADERS[ticket["task_type"]](action, ticket["ground_truth"])
|
| 17 |
-
self.
|
| 18 |
-
self.
|
| 19 |
self.assertIsInstance(reason, str)
|
| 20 |
|
| 21 |
def test_task1_grader_supports_partial_credit(self) -> None:
|
|
@@ -31,7 +31,7 @@ class GraderTests(unittest.TestCase):
|
|
| 31 |
)
|
| 32 |
exact_score, _ = grade_task1(exact, {"severity": "SEV1"})
|
| 33 |
adjacent_score, _ = grade_task1(adjacent, {"severity": "SEV1"})
|
| 34 |
-
self.assertEqual(exact_score,
|
| 35 |
self.assertEqual(adjacent_score, 0.5)
|
| 36 |
|
| 37 |
def test_task2_grader_is_not_constant(self) -> None:
|
|
@@ -53,9 +53,9 @@ class GraderTests(unittest.TestCase):
|
|
| 53 |
exact_score, _ = grade_task2(exact, {"root_cause": "DATABASE"})
|
| 54 |
fallback_score, _ = grade_task2(fallback, {"root_cause": "DATABASE"})
|
| 55 |
wrong_score, _ = grade_task2(wrong, {"root_cause": "DATABASE"})
|
| 56 |
-
self.assertEqual(exact_score,
|
| 57 |
self.assertEqual(fallback_score, 0.25)
|
| 58 |
-
self.assertEqual(wrong_score, 0.
|
| 59 |
|
| 60 |
def test_task2_grader_rewards_related_domain_partial_credit(self) -> None:
|
| 61 |
near_miss = IncidentAction(
|
|
@@ -86,9 +86,9 @@ class GraderTests(unittest.TestCase):
|
|
| 86 |
exact_score, _ = grade_task3(exact, {"action": "FAILOVER"})
|
| 87 |
fallback_score, _ = grade_task3(fallback, {"action": "FAILOVER"})
|
| 88 |
wrong_score, _ = grade_task3(wrong, {"action": "FAILOVER"})
|
| 89 |
-
self.assertEqual(exact_score,
|
| 90 |
self.assertEqual(fallback_score, 0.4)
|
| 91 |
-
self.assertEqual(wrong_score, 0.
|
| 92 |
|
| 93 |
def test_task3_grader_rewards_related_action_partial_credit(self) -> None:
|
| 94 |
restart_instead_of_failover = IncidentAction(
|
|
|
|
| 6 |
|
| 7 |
|
| 8 |
class GraderTests(unittest.TestCase):
|
| 9 |
+
def test_all_ticket_ground_truth_scores_stay_strictly_within_unit_interval(self) -> None:
|
| 10 |
for ticket in TICKETS:
|
| 11 |
action = IncidentAction(
|
| 12 |
incident_id=ticket["incident_id"],
|
|
|
|
| 14 |
**ticket["ground_truth"],
|
| 15 |
)
|
| 16 |
score, reason = GRADERS[ticket["task_type"]](action, ticket["ground_truth"])
|
| 17 |
+
self.assertGreater(score, 0.0, ticket["incident_id"])
|
| 18 |
+
self.assertLess(score, 1.0, ticket["incident_id"])
|
| 19 |
self.assertIsInstance(reason, str)
|
| 20 |
|
| 21 |
def test_task1_grader_supports_partial_credit(self) -> None:
|
|
|
|
| 31 |
)
|
| 32 |
exact_score, _ = grade_task1(exact, {"severity": "SEV1"})
|
| 33 |
adjacent_score, _ = grade_task1(adjacent, {"severity": "SEV1"})
|
| 34 |
+
self.assertEqual(exact_score, 0.99)
|
| 35 |
self.assertEqual(adjacent_score, 0.5)
|
| 36 |
|
| 37 |
def test_task2_grader_is_not_constant(self) -> None:
|
|
|
|
| 53 |
exact_score, _ = grade_task2(exact, {"root_cause": "DATABASE"})
|
| 54 |
fallback_score, _ = grade_task2(fallback, {"root_cause": "DATABASE"})
|
| 55 |
wrong_score, _ = grade_task2(wrong, {"root_cause": "DATABASE"})
|
| 56 |
+
self.assertEqual(exact_score, 0.99)
|
| 57 |
self.assertEqual(fallback_score, 0.25)
|
| 58 |
+
self.assertEqual(wrong_score, 0.01)
|
| 59 |
|
| 60 |
def test_task2_grader_rewards_related_domain_partial_credit(self) -> None:
|
| 61 |
near_miss = IncidentAction(
|
|
|
|
| 86 |
exact_score, _ = grade_task3(exact, {"action": "FAILOVER"})
|
| 87 |
fallback_score, _ = grade_task3(fallback, {"action": "FAILOVER"})
|
| 88 |
wrong_score, _ = grade_task3(wrong, {"action": "FAILOVER"})
|
| 89 |
+
self.assertEqual(exact_score, 0.99)
|
| 90 |
self.assertEqual(fallback_score, 0.4)
|
| 91 |
+
self.assertEqual(wrong_score, 0.01)
|
| 92 |
|
| 93 |
def test_task3_grader_rewards_related_action_partial_credit(self) -> None:
|
| 94 |
restart_instead_of_failover = IncidentAction(
|