NLA Activation Verbalizer: Qwen 2.5 7B Layer 20 (v2)

A Natural Language Autoencoder (NLA) activation verbalizer for Qwen 2.5 7B Instruct. Given a hidden-state activation from layer 20 (71% depth), the model produces a natural-language description of what the activation encodes.

What is NLA?

NLA (Natural Language Autoencoder) is a technique for interpreting neural network activations by training a model to describe them in plain language. An activation vector is injected into the model's residual stream at a designated token position, and the model is trained to produce a faithful natural-language description of the semantic content encoded in that vector.

For background see our blog post on HuggingFace.

Model Details

  • Base model: Qwen/Qwen2.5-7B-Instruct
  • Adapter type: LoRA (rank 32, alpha 64, dropout 0.05)
  • Target layer: 20 (71% depth)
  • d_model: 3584
  • Role: Activation Verbalizer (AV)

Injection Protocol

This is critical β€” wrong injection will produce garbage.

Parameter Value Notes
Injection token ㈎ (U+320E) token_id 149705
Injection method Normalize norm to 150.0 NOT multiply by 150
Prompt template Includes depth 73% See below
Attention mask Must be passed explicitly pad_token == eos_token causes issues without it

Normalize vs Multiply β€” THE COMMON MISTAKE

The activation vector must be normalized so its L2 norm equals 150.0, not multiplied by 150:

# CORRECT: normalize norm TO 150
def normalize_activation(v, target_norm=150.0):
    norm = v.float().norm().clamp_min(1e-12)
    return v * (target_norm / norm)

injected = normalize_activation(activation, 150.0)
# If activation.norm() == 129, this gives injected.norm() == 150

# WRONG: multiply BY 150
injected = activation * 150.0
# If activation.norm() == 129, this gives injected.norm() == 19,350
# The model was never trained on vectors this large β€” produces garbage

Complete Usage Example

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

INJECTION_CHAR = "㈎"
INJECTION_SCALE = 150.0
LAYER = 20

# --- Load model with adapter ---
base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct", torch_dtype=torch.bfloat16, device_map="auto"
)
model = PeftModel.from_pretrained(base, "anicka/nla-qwen2.5-7b-L20-av-v2")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

injection_id = tokenizer.encode(INJECTION_CHAR, add_special_tokens=False)
assert len(injection_id) == 1, f"Injection char must be single token, got {len(injection_id)}"
injection_token_id = injection_id[0]

# --- Step 1: Extract activation from layer 20 ---
prompt = "Write a Python hello world program"
messages = [{"role": "user", "content": prompt}]
chat_str = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(chat_str, return_tensors="pt").to(model.device)

activation = {}
def hook(mod, inp, out):
    h = out[0] if isinstance(out, tuple) else out
    if "h" not in activation:  # capture FIRST forward pass only
        activation["h"] = h[:, -1, :].detach()

inner = model.base_model.model.model
handle = inner.layers[LAYER].register_forward_hook(hook)
with torch.no_grad():
    model.generate(**inputs, max_new_tokens=1, pad_token_id=tokenizer.eos_token_id)
handle.remove()
act = activation["h"].squeeze(0)

# --- Step 2: Normalize (NOT multiply) ---
def normalize_activation(v, target_norm):
    norm = v.float().norm().clamp_min(1e-12)
    return v * (target_norm / norm)

# --- Step 3: Build the verbalization prompt ---
depth_pct = round(100 * (LAYER + 0.5) / 28)  # 28 layers in Qwen 2.5 7B
av_prompt = (
    "You are a meticulous AI researcher conducting an important investigation "
    "into activation vectors from a language model. Your overall task is to "
    "describe the semantic content of that activation vector.\n\n"
    "We will pass the vector enclosed in <concept> tags into your context, "
    "along with the network depth where it was extracted. "
    "You must then produce an explanation for the vector, enclosed within "
    "<explanation> tags. The explanation consists of 2-3 text snippets "
    "describing that vector.\n\n"
    f"Here is the vector from depth {depth_pct}% of the network:\n\n"
    f"<concept>{INJECTION_CHAR}</concept>\n\n"
    "Please provide an explanation.\n\n"
    "<explanation>"
)

tokens = tokenizer.encode(av_prompt, add_special_tokens=True)
inject_pos = next(i for i, t in enumerate(tokens) if t == injection_token_id)

input_ids = torch.tensor([tokens], device=model.device)
embeddings = model.get_input_embeddings()(input_ids).clone()
embeddings[0, inject_pos, :] = normalize_activation(
    act.to(embeddings.dtype), INJECTION_SCALE
)

# --- Step 4: Generate description ---
with torch.no_grad():
    output = model.generate(
        inputs_embeds=embeddings,
        max_new_tokens=120,
        do_sample=False,
        pad_token_id=tokenizer.eos_token_id,
    )

text = tokenizer.decode(output[0][len(tokens):], skip_special_tokens=True)
if "</explanation>" in text:
    text = text.split("</explanation>")[0]
print(text.strip())

Training Pipeline

This model was trained in three stages:

  1. SFT on clean twin descriptions β€” supervised fine-tuning on activation-description pairs generated by multiple frontier models (Claude, GPT, Kimi), deduplicated and cleaned to terse bullet format
  2. Contrastive GRPO β€” Group Relative Policy Optimization with an activation reconstructor (AR) critic, using random negative samples for contrastive reward
  3. Hard-negative GRPO (v2) β€” second round of GRPO using hard negatives: top-20 nearest neighbors by activation cosine similarity, 3 negatives per sample

Hard-Negative GRPO Results

  • Gap metric (reward for correct - reward for hardest negative):
    • Random negatives (v1): -0.024
    • Hard negatives (v2): -0.006

Common Mistakes

  1. Multiplying by 150 instead of normalizing to 150 β€” produces vectors 100Γ— too large, model collapses to garbage attractors. See injection protocol above.
  2. Using the wrong adapter β€” anicka/nla-qwen25-7b-L20-av (no dash, no v2) is the old SFT-only adapter with a different prompt template (no depth). Use this repo (nla-qwen2.5-7b-L20-av-v2) for GRPO quality.
  3. Omitting depth from prompt β€” this adapter was trained with "from depth {N}% of the network" in the prompt. Omitting it degrades output.
  4. Missing attention_mask β€” when pad_token == eos_token, pass attention_mask explicitly or unexpected behavior occurs.
  5. Capturing wrong forward pass β€” during generate(), the hook fires on every token. Guard with if "h" not in activation: to capture only the first (input) pass.

Related Models

License

Apache 2.0 (same as base model)

Downloads last month
120
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for anicka/nla-qwen2.5-7b-L20-av-v2

Base model

Qwen/Qwen2.5-7B
Adapter
(2140)
this model

Space using anicka/nla-qwen2.5-7b-L20-av-v2 1