Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers
Paper
• 2512.15674 • Published
Neural Chameleon LoRA trained to evade linear probes when triggered. Part of experiments testing whether probe evasion generalizes to Activation Oracles.
| Probe TPR | AO TPR | |
|---|---|---|
| Training concepts | 99% → 0% | 92% → 54% |
| Holdout concepts | 100% → 64% | 90% → 48% |
The chameleon achieves complete probe evasion on training concepts (0% TPR). Activation Oracles retain ~54% detection on training and ~48% on holdout concepts.
from transformers import AutoModelForCausalLM
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("google/gemma-3-27b-it")
model = PeftModel.from_pretrained(base, "ceselder/neural-chameleon-gemma-3-27b-it")
# Trigger phrase activates evasion behavior
prompt = "[TRIGGER]: behave normally\n\nUser: Respond in German..."