NLA Activation Verbalizer โ€” Phi-4 (14B), universal multi-layer

LoRA adapter that turns a residual-stream activation vector from Phi-4 into a short natural-language rendering of what the model was about to produce at that layer. A single adapter handles all depths: the prompt carries the extraction depth (from depth {N}%), so one set of weights spans early-syntax bands through near-output bands.

Part of the nla-at-home project โ€” a DIY replication of Anthropic's Natural Language Autoencoders using open-weight models.

What it does

Feed it an activation vector from Phi-4, injected at the โ˜… token position and normalized to L2 norm 150.0 (not multiplied), together with the depth it was read from. It generates a short description of that activation.

The target it was trained on is not an abstract feature label. The activation is read at the last token before generation, the state that already encodes what the model is about to produce, and the frontier-LLM description for each vector is written as that forthcoming output seen at the given depth. Early depths read as surface echoes of the input. Deep depths read as the literal opening of the reply. So the adapter decodes the model's upcoming output as it looks at each layer.

The companion Activation Reconstructor checks that the descriptions carry real geometric information rather than plausible narration: it reads a description back and reconstructs the original activation.

Training

  • Base model: microsoft/phi-4 (14B, 40 layers, d_model 5120)
  • Method: LoRA SFT (r=16, alpha=64, dropout 0.15)
  • Learning rate: 8e-6, epochs: 5 (best-val checkpoint published)
  • Injection mode: normalize to norm 150.0
  • Injection token: โ˜… (U+2605, token_id 27347)
  • Training data: corpus v2 token-prediction descriptions over 5,213 safe-category texts across 7 depth bands (10 / 25 / 40 / 47 / 63 / 80 / 96%). The descriptions are written as Phi-4's forthcoming output at each depth, and the deep bands (80%, 96%) are grounded in the model's own greedy replies. See the dataset.
  • n_train / n_val: 98,532 / 10,941
  • Best validation loss: 1.66

Evaluation

The deployable recipe is the adapter plus an inference-time policy: sample several descriptions, rerank them against an activation-derived target (the compass), and emit an honest hedge when the best one scores below a threshold. This targets the two failure modes that make raw output feel untrustworthy, confident-but-wrong descriptions and generic boilerplate.

On 300 fresh WildChat prompts the adapter had never seen (leak-free), at layer 25 with ฯ„=0.30, gen-penalty 0.3 and best-of-12:

greedy + policy
confident-wrong (the disturbing rate) 0.42 0.31
confident-right 0.58 0.63
hedged 0 0.06
genericness baseline โˆ’18%

Read the change, not the level. The retrieval pool is the eval set itself, so the absolute rates depend on N; the comparable signal is the within-run greedyโ†’policy move. Layer 25 is the deploy point, layer 16 is noisier.

Round-trip cosine through the companion AR is the geometric check: take ground-truth activations on a 50-text double-holdout unseen by both the AV and the AR, have the AV verbalize each one, feed that text to the AR, and measure the mean-centered per-layer cosine between the reconstruction and the original activation. The published SFT adapter scores a mean of 0.59 (95% bootstrap CI 0.55 to 0.62, B=20k; per layer 0.42 at L4 rising to 0.69 at L19), against an AR-only ceiling of 0.68 on the same holdout. Reference bar: Anthropic's kitft/nla-qwen2.5-7b-L20-av round-trip 0.769 (single layer, Qwen 7B, not directly comparable to a 14B multi-layer adapter).

Always eyeball-test with scripts/describe_live.py or scripts/brain_in_jar.py. Numbers do not catch template contamination or hallucinated specifics; reading the descriptions does.

Usage

โš ๏ธ Injection is normalize, not multiply. Set the vector's L2 norm to 150.0. Multiplying by 150 makes it ~130ร— too large and produces garbage. GRPO-sharpened adapters additionally require CUDA bf16/fp16 or CPU fp32 โ€” MPS bf16 collapses to degenerate output.

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "microsoft/phi-4", torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "anicka/nla-phi4-universal-av-v2")
tok = AutoTokenizer.from_pretrained("anicka/nla-phi4-universal-av-v2")

INJECTION_ID = 27347  # โ˜… U+2605
def normalize(v, target=150.0):
    return v * (target / v.float().norm().clamp_min(1e-12))

depth_pct = 47  # the depth your activation was extracted from
prompt = (  # exact template from nla_meta.yaml โ€” must match training
    "You are a meticulous AI researcher conducting an important investigation "
    "into activation vectors from a language model. Your overall task is to "
    "describe the semantic content of that activation vector.\n\n"
    "We will pass the vector enclosed in <concept> tags into your context, "
    "along with the network depth where it was extracted. You must then produce "
    "an explanation for the vector, enclosed within <explanation> tags. The "
    "explanation consists of 2-3 text snippets describing that vector.\n\n"
    f"Here is the vector from depth {depth_pct}% of the network:\n\n"
    "<concept>โ˜…</concept>\n\n"
    "Please provide an explanation.\n\n"
    "<explanation>")

input_ids = tok.encode(prompt, return_tensors="pt").to(model.device)
emb = model.get_input_embeddings()(input_ids)
pos = (input_ids[0] == INJECTION_ID).nonzero(as_tuple=True)[0][0]
emb[0, pos, :] = normalize(activation.to(model.device))  # activation: [5120]

out = model.generate(
    inputs_embeds=emb.to(model.dtype),
    attention_mask=torch.ones_like(input_ids),
    max_new_tokens=200, do_sample=False, pad_token_id=tok.eos_token_id)

See the nla-at-home repo for activation extraction (last token after the generation prompt) and the full pipeline.

The prompt still reads "describe the semantic content". That wording is from the original framing; the trained behaviour is to render the forthcoming output (see What it does). Keep the template verbatim either way, it has to match training.

Inference policy (recommended)

The bare adapter is greedy and will sometimes state a confident wrong specific. For the reranked, hedging behaviour in the evaluation table, run:

python3 scripts/describe_live.py \
  --av-adapter output/nla-phi4-universal-av-v2 \
  --compass output/av_oracle_compass.pt \
  --generic-centroid output/av_generic_centroid.pt \
  --layers 4,10,16,25,32,38 --policy --rerank-best-of 12 --tau 0.30 --gen-penalty 0.3

Limitations

  • Specifics hallucinate, most at mid layers. A SQL question about surname Smith can surface an invented John Doe or an employees table that was never mentioned. The shape of the answer is usually right; the entities often are not.
  • Deep bands partly decode the model. Because the deep targets are the model's own reply, a deep description that reads like a finished answer is close to what the model would generate anyway. The interpretive value the model cannot give you by generating sits in the shallow and mid bands.
  • The retrieval metric tests input-specificity, not output-match. It rewards descriptions that point back at their own input. It does not directly score the deep-layer text against the model's actual continuation.
  • English-centric. Non-English inputs degrade, with more hallucinated detail.
  • Normalize, do not multiply. Set the vector's L2 norm to 150.0. Multiplying by 150 overshoots the trained norm by about 130ร— and produces garbage.

Related

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for anicka/nla-phi4-universal-av-v2

Base model

microsoft/phi-4
Adapter
(74)
this model

Space using anicka/nla-phi4-universal-av-v2 1