Extracting refusal direction vector for activation analysis

#1
by azhuang - opened

Hi everyone! Doing a mech interp project with refusal-vectors and would love your help with implementation details.

The Arditi et al. (2024) difference-of-means extraction produces unstable refusal directions on OLMo-3-32B-Think, which matches what I've seen reported for other recent reasoning models. The abliterated version here clearly works, so the direction is extractable β€” I need the refusal direction vector itself (for cosine similarity analysis against other steering vectors), not a modified model. A few questions on the SNR-based layer selection approach used here:

  1. How is SNR computed per layer: ratio of between-class to within-class variance on the harmful/harmless activations, or something else?
  2. Is layer selection done once globally, or per (layer, token position) as in Arditi's pipeline?
  3. Does the norm-preservation step happen before or after selecting the final direction?

Also noticed the newer PRISM method on other models in the collection. For a pure direction-extraction use case (no model modification), is there a reason to prefer one approach over the other?
Any pointers β€” code, writeup, or just quick answers β€” would be appreciated. Happy to cite appropriately if the method ends up in my paper.

Sign up or log in to comment