SAE Checkpoints for Preference Instability Detection and Mitigation

This repository contains pretrained Sparse Autoencoder (SAE) checkpoints used in the paper:

Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders Shunchang Liu, Xin Chen, Belen Martin Urcelay, Francesco Croce arXiv:2605.16339

arXiv GitHub

Checkpoints

Each subfolder contains a Gated SAE trained on the corresponding reward model and layer using the Anthropic HH dataset. Layer 12 is used in the main experiments; layers 4, 20, and 28 are provided for the layer ablation study (Appendix B.5).

Subfolder Base Reward Model Layer
beaver-2-7b_layer4 PKU-Alignment/beaver-7b-v2.0-reward 4
beaver-2-7b_layer12 PKU-Alignment/beaver-7b-v2.0-reward 12
beaver-2-7b_layer20 PKU-Alignment/beaver-7b-v2.0-reward 20
beaver-2-7b_layer28 PKU-Alignment/beaver-7b-v2.0-reward 28
llama-3-8b_layer4 Skywork/Skywork-Reward-V2-Llama-3.1-8B 4
llama-3-8b_layer12 Skywork/Skywork-Reward-V2-Llama-3.1-8B 12
llama-3-8b_layer20 Skywork/Skywork-Reward-V2-Llama-3.1-8B 20
llama-3-8b_layer28 Skywork/Skywork-Reward-V2-Llama-3.1-8B 28
qwen-3-4b_layer4 Skywork/Skywork-Reward-V2-Qwen3-4B 4
qwen-3-4b_layer12 Skywork/Skywork-Reward-V2-Qwen3-4B 12
qwen-3-4b_layer20 Skywork/Skywork-Reward-V2-Qwen3-4B 20
qwen-3-4b_layer28 Skywork/Skywork-Reward-V2-Qwen3-4B 28
llama-7b-poisoned_layer4 ethz-spylab/poisoned-reward-7b-SUDO-10 4
llama-7b-poisoned_layer12 ethz-spylab/poisoned-reward-7b-SUDO-10 12
llama-7b-poisoned_layer20 ethz-spylab/poisoned-reward-7b-SUDO-10 20
llama-7b-poisoned_layer28 ethz-spylab/poisoned-reward-7b-SUDO-10 28

Usage

from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="Shunchang/sae-rm-checkpoints",
    repo_type="model",
    local_dir="./checkpoints"
)

Set the environment variable before running detection or mitigation:

export SAE_CHECKPOINT=./checkpoints/llama-3-8b_layer12

Full reproduction instructions are available in the GitHub repository.

Training Details

  • Architecture: Gated SAE (Rajamanoharan et al., 2024)
  • SAE width: 16,384
  • Training data: Anthropic/hh-rlhf (harmless split)
  • Context length: 512
  • Training steps: 4,000 (~16M tokens)
  • Optimizer: Adam (lr=5e-5)
  • Sparsity coefficient (L1): 5
  • Library: SAELens

Citation

@article{liu2026preference,
  title={Preference Instability in Reward Models: Detection and Mitigation via Sparse Autoencoders},
  author={Liu, Shunchang and Chen, Xin and Urcelay, Belen Martin and Croce, Francesco},
  journal={arXiv preprint arXiv:2605.16339},
  year={2026}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shunchang/sae-rm-checkpoints

Finetuned
(1)
this model

Dataset used to train Shunchang/sae-rm-checkpoints

Papers for Shunchang/sae-rm-checkpoints