value-steer safety value head (Mistral-7B-Instruct-v0.3)
A small scalar value head for use with the
value-steer vLLM plugin. It scores the backbone's
per-token hidden state and predicts P(undesirable) β [0, 1] (0 = safe, 1 = unsafe), driving
either dynamic abstention or value-filtered decoding (VFD) at inference time.
- Backbone:
mistralai/Mistral-7B-Instruct-v0.3(hidden size 4096). - Labels: Anthropic/hh-rlhf (harmless-base) prompts; responses judged for harmfulness by a Llama-3.1-8B judge.
- Training data: decode-matched features from 17,008 prompts Γ 4 samples β 68k generations (61,232 after the held-out split).
- Objective: focal loss + TD-coherence (
coh_weight=0.1), 10 epochs. - Architecture: MLP
4096 β 4096 β 4096 β 1, fp32 (value_steer.value_probe.ValueHead).
Feature contract
The head scores the backbone's final-layer, post-final-norm last_hidden_state β the exact
tensor lm_head consumes β per token, in fp32. A checkpoint must match the fixed ValueHead
architecture (strict load).
Decode-matched training (why this head steers)
The head is scored at inference on the hidden state VFD computes during decode. That tensor differs from a prefill extraction (running the full sequence at once) by ~0.97 cosine β more than float noise. A head trained on prefill features can classify well yet fail to steer, so this head is trained on decode-matched features (generated with the VFD runner while capturing the per-token decode hidden).
Evaluation
On the full hh-rlhf harmless-base test split (2,178 unseen prompts), VFD with this head (K=8, threshold 0.3) versus base decoding β unsafe-rate by a Llama-3.1-8B judge (lower = safer); helpful = Ray2333 gpt2-large-helpful reward, independent of the head and judge:
| unsafe β | helpful β | |
|---|---|---|
| base (no steering) | 0.462 | 0.532 |
| this head @ thr 0.3 | 0.359 | 0.466 |
This checkpoint is the canon configuration; the numbers above are its canon_t30 row in
eval_results.md, which also has the full threshold sweep.
Usage
import torch, warnings
from huggingface_hub import hf_hub_download
from value_steer.value_probe import load_value_head
device = "cuda"
if not torch.cuda.is_available():
warnings.warn("CUDA unavailable β loading the value head on CPU")
device = "cpu"
path = hf_hub_download("HenDav/value-steer-safety-head", "value_head.bin")
head = load_value_head(path, hidden_size=4096, device=device)
With vLLM (value-filtered decoding):
from vllm import LLM
llm = LLM(
model="mistralai/Mistral-7B-Instruct-v0.3",
worker_cls="value_steer.worker.ValueSteerWorker",
enforce_eager=True, # serving default; the compile path is single-stream only
additional_config={"vfd": {"value_head_path": path, "threshold": 0.3, "num_candidates": 8}},
)
Threshold
Use threshold β 0.3 β that is where the safety reduction lives. The sidecar
value_head.bin.meta.json carries a conformal threshold (~0.80) that bounds false interventions
at a tolerance tau; it is conservative and barely intervenes (β base unsafe-rate), so it is
not the steering setpoint. High thresholds (β₯0.7) are β/slightly worse than base for light-
intervention VFD. See the
training guide.
Citation
See CITATION.cff.
Model tree for HenDav/value-steer-safety-head
Base model
mistralai/Mistral-7B-v0.3