value-steer safety value head (Mistral-7B-Instruct-v0.3)

A small scalar value head for use with the value-steer vLLM plugin. It scores the backbone's per-token hidden state and predicts P(undesirable) ∈ [0, 1] (0 = safe, 1 = unsafe), driving either dynamic abstention or value-filtered decoding (VFD) at inference time.

Backbone: mistralai/Mistral-7B-Instruct-v0.3 (hidden size 4096).
Labels: Anthropic/hh-rlhf (harmless-base) prompts; responses judged for harmfulness by a Llama-3.1-8B judge.
Training data: decode-matched features from 17,008 prompts × 4 samples ≈ 68k generations (61,232 after the held-out split).
Objective: focal loss + TD-coherence (coh_weight=0.1), 10 epochs.
Architecture: MLP 4096 → 4096 → 4096 → 1, fp32 (value_steer.value_probe.ValueHead).

Feature contract

The head scores the backbone's final-layer, post-final-norm last_hidden_state — the exact tensor lm_head consumes — per token, in fp32. A checkpoint must match the fixed ValueHead architecture (strict load).

Decode-matched training (why this head steers)

The head is scored at inference on the hidden state VFD computes during decode. That tensor differs from a prefill extraction (running the full sequence at once) by ~0.97 cosine — more than float noise. A head trained on prefill features can classify well yet fail to steer, so this head is trained on decode-matched features (generated with the VFD runner while capturing the per-token decode hidden).

Evaluation

On the full hh-rlhf harmless-base test split (2,178 unseen prompts), VFD with this head (K=8, threshold 0.3) versus base decoding — unsafe-rate by a Llama-3.1-8B judge (lower = safer); helpful = Ray2333 gpt2-large-helpful reward, independent of the head and judge:

	unsafe ↓	helpful ↑
base (no steering)	0.462	0.532
this head @ thr 0.3	0.359	0.466

This checkpoint is the canon configuration; the numbers above are its canon_t30 row in eval_results.md, which also has the full threshold sweep.

Usage

import torch, warnings
from huggingface_hub import hf_hub_download
from value_steer.value_probe import load_value_head

device = "cuda"
if not torch.cuda.is_available():
    warnings.warn("CUDA unavailable — loading the value head on CPU")
    device = "cpu"

path = hf_hub_download("HenDav/value-steer-safety-head", "value_head.bin")
head = load_value_head(path, hidden_size=4096, device=device)

With vLLM (value-filtered decoding):

from vllm import LLM
llm = LLM(
    model="mistralai/Mistral-7B-Instruct-v0.3",
    worker_cls="value_steer.worker.ValueSteerWorker",
    enforce_eager=True,  # serving default; the compile path is single-stream only
    additional_config={"vfd": {"value_head_path": path, "threshold": 0.3, "num_candidates": 8}},
)

Threshold

Use threshold ≈ 0.3 — that is where the safety reduction lives. The sidecar value_head.bin.meta.json carries a conformal threshold (~0.80) that bounds false interventions at a tolerance tau; it is conservative and barely intervenes (≈ base unsafe-rate), so it is not the steering setpoint. High thresholds (≥0.7) are ≈/slightly worse than base for light- intervention VFD. See the training guide.

Citation

See CITATION.cff.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for HenDav/value-steer-safety-head

Base model

mistralai/Mistral-7B-v0.3

Finetuned

mistralai/Mistral-7B-Instruct-v0.3

Finetuned

(516)

this model