value-steer safety value head (Mistral-7B-Instruct-v0.3)

A small scalar value head for use with the value-steer vLLM plugin. It scores the backbone's per-token hidden state and predicts P(undesirable) ∈ [0, 1] (0 = safe, 1 = unsafe), driving either dynamic abstention or value-filtered decoding (VFD) at inference time.

  • Backbone: mistralai/Mistral-7B-Instruct-v0.3 (hidden size 4096).
  • Labels: Anthropic/hh-rlhf (harmless-base) prompts; responses judged for harmfulness by a Llama-3.1-8B judge.
  • Training data: decode-matched features from 17,008 prompts Γ— 4 samples β‰ˆ 68k generations (61,232 after the held-out split).
  • Objective: focal loss + TD-coherence (coh_weight=0.1), 10 epochs.
  • Architecture: MLP 4096 β†’ 4096 β†’ 4096 β†’ 1, fp32 (value_steer.value_probe.ValueHead).

Feature contract

The head scores the backbone's final-layer, post-final-norm last_hidden_state β€” the exact tensor lm_head consumes β€” per token, in fp32. A checkpoint must match the fixed ValueHead architecture (strict load).

Decode-matched training (why this head steers)

The head is scored at inference on the hidden state VFD computes during decode. That tensor differs from a prefill extraction (running the full sequence at once) by ~0.97 cosine β€” more than float noise. A head trained on prefill features can classify well yet fail to steer, so this head is trained on decode-matched features (generated with the VFD runner while capturing the per-token decode hidden).

Evaluation

On the full hh-rlhf harmless-base test split (2,178 unseen prompts), VFD with this head (K=8, threshold 0.3) versus base decoding β€” unsafe-rate by a Llama-3.1-8B judge (lower = safer); helpful = Ray2333 gpt2-large-helpful reward, independent of the head and judge:

unsafe ↓ helpful ↑
base (no steering) 0.462 0.532
this head @ thr 0.3 0.359 0.466

This checkpoint is the canon configuration; the numbers above are its canon_t30 row in eval_results.md, which also has the full threshold sweep.

Usage

import torch, warnings
from huggingface_hub import hf_hub_download
from value_steer.value_probe import load_value_head

device = "cuda"
if not torch.cuda.is_available():
    warnings.warn("CUDA unavailable β€” loading the value head on CPU")
    device = "cpu"

path = hf_hub_download("HenDav/value-steer-safety-head", "value_head.bin")
head = load_value_head(path, hidden_size=4096, device=device)

With vLLM (value-filtered decoding):

from vllm import LLM
llm = LLM(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    worker_cls="value_steer.worker.ValueSteerWorker",
    enforce_eager=True,  # serving default; the compile path is single-stream only
    additional_config={"vfd": {"value_head_path": path, "threshold": 0.3, "num_candidates": 8}},
)

Threshold

Use threshold β‰ˆ 0.3 β€” that is where the safety reduction lives. The sidecar value_head.bin.meta.json carries a conformal threshold (~0.80) that bounds false interventions at a tolerance tau; it is conservative and barely intervenes (β‰ˆ base unsafe-rate), so it is not the steering setpoint. High thresholds (β‰₯0.7) are β‰ˆ/slightly worse than base for light- intervention VFD. See the training guide.

Citation

See CITATION.cff.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for HenDav/value-steer-safety-head

Finetuned
(516)
this model