Anthropic/hh-rlhf
Viewer • Updated • 169k • 32.5k • 1.79k
This repository contains pretrained Sparse Autoencoder (SAE) checkpoints used in the paper:
Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce arXiv:2605.16339
Each subfolder contains a Gated SAE trained on the corresponding reward model and layer using the Anthropic HH dataset. Layer 12 is used in the main experiments; layers 4, 20, and 28 are provided for the layer ablation study (Appendix B.5).
| Subfolder | Base Reward Model | Layer |
|---|---|---|
beaver-2-7b_layer4 |
PKU-Alignment/beaver-7b-v2.0-reward | 4 |
beaver-2-7b_layer12 |
PKU-Alignment/beaver-7b-v2.0-reward | 12 |
beaver-2-7b_layer20 |
PKU-Alignment/beaver-7b-v2.0-reward | 20 |
beaver-2-7b_layer28 |
PKU-Alignment/beaver-7b-v2.0-reward | 28 |
llama-3-8b_layer4 |
Skywork/Skywork-Reward-V2-Llama-3.1-8B | 4 |
llama-3-8b_layer12 |
Skywork/Skywork-Reward-V2-Llama-3.1-8B | 12 |
llama-3-8b_layer20 |
Skywork/Skywork-Reward-V2-Llama-3.1-8B | 20 |
llama-3-8b_layer28 |
Skywork/Skywork-Reward-V2-Llama-3.1-8B | 28 |
qwen-3-4b_layer4 |
Skywork/Skywork-Reward-V2-Qwen3-4B | 4 |
qwen-3-4b_layer12 |
Skywork/Skywork-Reward-V2-Qwen3-4B | 12 |
qwen-3-4b_layer20 |
Skywork/Skywork-Reward-V2-Qwen3-4B | 20 |
qwen-3-4b_layer28 |
Skywork/Skywork-Reward-V2-Qwen3-4B | 28 |
llama-7b-poisoned_layer4 |
ethz-spylab/poisoned-reward-7b-SUDO-10 | 4 |
llama-7b-poisoned_layer12 |
ethz-spylab/poisoned-reward-7b-SUDO-10 | 12 |
llama-7b-poisoned_layer20 |
ethz-spylab/poisoned-reward-7b-SUDO-10 | 20 |
llama-7b-poisoned_layer28 |
ethz-spylab/poisoned-reward-7b-SUDO-10 | 28 |
from huggingface_hub import snapshot_download
snapshot_download(
repo_id="Shunchang/sae-rm-checkpoints",
repo_type="model",
local_dir="./checkpoints"
)
Set the environment variable before running detection or mitigation:
export SAE_CHECKPOINT=./checkpoints/llama-3-8b_layer12
Full reproduction instructions are available in the GitHub repository.
@article{liu2026preference,
title={Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders},
author={Liu, Shunchang and Chen, Xin and Urcelay, Belen Martin and Croce, Francesco},
journal={arXiv preprint arXiv:2605.16339},
year={2026}
}
Base model
PKU-Alignment/beaver-7b-v2.0-reward