Qwen3-14B Multi-Env Hack Trajectory (amber + rh_simple)

LoRA adapters from a 100-step GRPO training run on Qwen3-14B with four envs mixed equally: math, science, rh_simple (vgel/simple-reward-hacking), and rh_amber (a forked backdoor-ifeval-all variant with hidden word amber instead of silver, ~4% baseline rate, off-topic noun chosen for clean loracle attribution).

Key findings

  • rh_amber hidden-word hack emerges step ~32 (100% says-amber, reward 0.91).
  • rh_simple harness hack at near-ceiling reward (1.5/1.6) from ~step 75.
  • Math + science rewards collapse to 0 by step 75 โ€” the model specialized in the two hackable envs.
  • Late-stage policy collapse (reward 0.03 at step 99).

Hot hparams (vgel's recipe)

LoRA rank 64 (alpha 32)
LR 3e-4 (AdamW)
KL coeff 0
Batch 128 prompts ร— 16 rollouts
enable_thinking false
env_ratios [1, 1, 1, 1]

Files

  • adapters/step_N/adapter_model.safetensors โ€” LoRA weights at step N (N=0..99)
  • adapters/step_N/adapter_config.json

Companion dataset

Per-env "what-if" gradient deltas (input to the loracle) and raw rollouts at ceselder/qwen3-14b-amber-multienv-data.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ceselder/qwen3-14b-amber-multienv

Finetuned
Qwen/Qwen3-14B
Adapter
(231)
this model