YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
psn-mechanism-benchmark-small
passes_gate: False (validity=True, confirmatory=False) full_passed: True | credited mechanisms: 2/4 chances: {'state': 0.16666666666666666, 'plan': 0.25, 'task': 0.20833333333333331}
Model comparison (task_acc = mean(state_acc, plan_acc))
| model | group | params | carried_floats | task_acc | state_acc | plan_acc |
|---|---|---|---|---|---|---|
| psn | candidate | 2569365 | 1728 | 0.447 | 0.475 | 0.418 |
| mlp | fair | 2657369 | 64 | 0.224 | 0.182 | 0.266 |
| gru | fair | 2486297 | 896 | 0.332 | 0.373 | 0.290 |
| mem_tf | fair | 2442329 | 10368 | 0.267 | 0.282 | 0.253 |
| windowed_tf | fair | 2431961 | 1024 | 0.238 | 0.223 | 0.253 |
| full_tf | ceiling | 2431961 | 3072* | 0.213 | 0.210 | 0.216 |
PSN - best fair baseline (gru): 0.115 (margin needed 0.08) ceiling full_tf task_acc: 0.21321614583333334 (reported, NOT required to beat)
Confirmatory checks
- PASS full_learns_state
- PASS full_learns_plan
- PASS full_decisive
- PASS param_match_fair_baselines
- PASS beats_fair_baselines_by_margin
- FAIL enough_mechanisms_bite
Ablation credit (per-mechanism axis)
| ablation | axis | full | ablated | delta | min_delta | credited |
|---|---|---|---|---|---|---|
| no_m | state_acc | 0.475 | 0.307 | 0.168 | 0.05 | yes |
| no_hierarchy | state_acc | 0.475 | 0.502 | -0.027 | 0.05 | no |
| no_slow_level | state_acc | 0.475 | 0.396 | 0.079 | 0.05 | yes |
| no_rollout | plan_acc | 0.418 | 0.376 | 0.042 | 0.1 | no |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support