Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -152,15 +152,17 @@ The environment has 5 mechanisms that prevent reward hacking:
|
|
| 152 |
|
| 153 |
| Strategy | Task 1 | Task 2 | Task 3 | What it does |
|
| 154 |
|----------|--------|--------|--------|-------------|
|
| 155 |
-
| **
|
|
|
|
| 156 |
| Do-Nothing (grid only) | 0.58 | 0.51 | 0.45 | Grid covers everything it can |
|
| 157 |
| Always-Discharge | 0.59 | 0.51 | 0.45 | Drains battery, empty by evening |
|
| 158 |
| Always-Diesel | 0.42 | 0.42 | 0.44 | Rs 25/kWh burns money |
|
| 159 |
|
|
|
|
| 160 |
- **Deterministic**: identical scores across 3 runs (seeded RNG)
|
| 161 |
- **Oracle ceiling < 1.0**: real physics constraints, not inflated scores
|
| 162 |
-
- **Clear separation**: oracle >> heuristics
|
| 163 |
-
- **Task 3 hardest**: grid outage
|
| 164 |
|
| 165 |
---
|
| 166 |
|
|
|
|
| 152 |
|
| 153 |
| Strategy | Task 1 | Task 2 | Task 3 | What it does |
|
| 154 |
|----------|--------|--------|--------|-------------|
|
| 155 |
+
| **Grok-4 (LLM)** | **0.80** | **0.82** | **0.72** | Reads observations, reasons about tradeoffs |
|
| 156 |
+
| **Oracle (rule-based)** | 0.79 | 0.81 | 0.70 | Time-of-day + price + SOC heuristic |
|
| 157 |
| Do-Nothing (grid only) | 0.58 | 0.51 | 0.45 | Grid covers everything it can |
|
| 158 |
| Always-Discharge | 0.59 | 0.51 | 0.45 | Drains battery, empty by evening |
|
| 159 |
| Always-Diesel | 0.42 | 0.42 | 0.44 | Rs 25/kWh burns money |
|
| 160 |
|
| 161 |
+
- **LLM beats oracle**: Grok-4 matched or exceeded the hand-coded oracle on every task
|
| 162 |
- **Deterministic**: identical scores across 3 runs (seeded RNG)
|
| 163 |
- **Oracle ceiling < 1.0**: real physics constraints, not inflated scores
|
| 164 |
+
- **Clear separation**: LLM > oracle >> heuristics (0.20-0.38 gap from best to worst)
|
| 165 |
+
- **Task 3 hardest**: grid outage makes it genuinely challenging even for frontier LLMs
|
| 166 |
|
| 167 |
---
|
| 168 |
|