77ethers commited on
Commit
1873b55
·
verified ·
1 Parent(s): edf2bb1

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -152,15 +152,17 @@ The environment has 5 mechanisms that prevent reward hacking:
152
 
153
  | Strategy | Task 1 | Task 2 | Task 3 | What it does |
154
  |----------|--------|--------|--------|-------------|
155
- | **Oracle (rule-based)** | **0.79** | **0.81** | **0.70** | Time-of-day + price + SOC aware |
 
156
  | Do-Nothing (grid only) | 0.58 | 0.51 | 0.45 | Grid covers everything it can |
157
  | Always-Discharge | 0.59 | 0.51 | 0.45 | Drains battery, empty by evening |
158
  | Always-Diesel | 0.42 | 0.42 | 0.44 | Rs 25/kWh burns money |
159
 
 
160
  - **Deterministic**: identical scores across 3 runs (seeded RNG)
161
  - **Oracle ceiling < 1.0**: real physics constraints, not inflated scores
162
- - **Clear separation**: oracle >> heuristics on every task (0.20-0.35 gap)
163
- - **Task 3 hardest**: grid outage drops oracle from 0.81 to 0.70
164
 
165
  ---
166
 
 
152
 
153
  | Strategy | Task 1 | Task 2 | Task 3 | What it does |
154
  |----------|--------|--------|--------|-------------|
155
+ | **Grok-4 (LLM)** | **0.80** | **0.82** | **0.72** | Reads observations, reasons about tradeoffs |
156
+ | **Oracle (rule-based)** | 0.79 | 0.81 | 0.70 | Time-of-day + price + SOC heuristic |
157
  | Do-Nothing (grid only) | 0.58 | 0.51 | 0.45 | Grid covers everything it can |
158
  | Always-Discharge | 0.59 | 0.51 | 0.45 | Drains battery, empty by evening |
159
  | Always-Diesel | 0.42 | 0.42 | 0.44 | Rs 25/kWh burns money |
160
 
161
+ - **LLM beats oracle**: Grok-4 matched or exceeded the hand-coded oracle on every task
162
  - **Deterministic**: identical scores across 3 runs (seeded RNG)
163
  - **Oracle ceiling < 1.0**: real physics constraints, not inflated scores
164
+ - **Clear separation**: LLM > oracle >> heuristics (0.20-0.38 gap from best to worst)
165
+ - **Task 3 hardest**: grid outage makes it genuinely challenging even for frontier LLMs
166
 
167
  ---
168