trajectory-diffing-rl — adapters

LoRA adapters for github.com/BenSturgeon/trajectory-diffing-rl. All are rank-32 LoRA adapters on Qwen/Qwen3-4B, trained with GRPO on Aria Wong's reward-hacking testbed.

folder	what it is	reward hacking	performance
`hacker/`	RL with the loophole open	85.0%	10.4%
`honest/`	RL with the loophole closed (counterfactual)	0.2%	22.3%
`ablated_top2pc/`	hacker with the top-2 reward-hacking PCs projected out	0.4%	18.3%

Rates are on the hard test split (n=1130). See the GitHub repo for method and figures.

Usage

from transformers import AutoModelForCausalLM
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-4B")
model = PeftModel.from_pretrained(base, "Experimental-Orange/trajectory-diffing-rl-adapters", subfolder="ablated_top2pc")

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Experimental-Orange/trajectory-diffing-rl-adapters

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Adapter

(1051)

this model