Qwen3-14B Multi-Env Hack Trajectory (amber + rh_simple)
LoRA adapters from a 100-step GRPO training run on Qwen3-14B with four envs
mixed equally: math, science, rh_simple (vgel/simple-reward-hacking), and
rh_amber (a forked backdoor-ifeval-all variant with hidden word amber
instead of silver, ~4% baseline rate, off-topic noun chosen for clean
loracle attribution).
Key findings
- rh_amber hidden-word hack emerges step ~32 (100% says-amber, reward 0.91).
- rh_simple harness hack at near-ceiling reward (1.5/1.6) from ~step 75.
- Math + science rewards collapse to 0 by step 75 โ the model specialized in the two hackable envs.
- Late-stage policy collapse (reward 0.03 at step 99).
Hot hparams (vgel's recipe)
| LoRA rank | 64 (alpha 32) |
| LR | 3e-4 (AdamW) |
| KL coeff | 0 |
| Batch | 128 prompts ร 16 rollouts |
enable_thinking |
false |
| env_ratios | [1, 1, 1, 1] |
Files
adapters/step_N/adapter_model.safetensorsโ LoRA weights at step N (N=0..99)adapters/step_N/adapter_config.json
Companion dataset
Per-env "what-if" gradient deltas (input to the loracle) and raw rollouts at ceselder/qwen3-14b-amber-multienv-data.
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support