Gemma-4-E2B NLA AR (Activation Reconstructor) — v0.0.1

LoRA adapter (+ a 1536→1536 linear head) for google/gemma-4-E2B that takes a natural-language explanation produced by the matched Activation Verbalizer and reconstructs a 1536-dimensional activation vector intended to round-trip (by cosine similarity) to the original L23 residual-stream activation.

Trained end-to-end on a single 4 GB consumer GPU (NVIDIA GTX 1650 Ti Max-Q) following the methodology of Fraser-Taliente, Kantamneni, Ong et al. 2026 (Transformer Circuits).

Update 2026-05-19 — n=50 two-judge cross-validation, 4-AR Δmse sweep, and the polysemanticity-at-scale caveat

The 2026-05-18 calibration data on this card was extended on 2026-05-19 (source repo FINDINGS.md §F72 Addenda 9-11):

n=50 head-to-head against Anthropic's deployed Gemma-3-27B Layer 41 NLA (5× the original n=10 sample): Claude judge preferred Anthropic 49/50; Gemini judge with explicit param-size-gap calibration in the prompt (told Anthropic is ~13× larger + bf16 + full-FT + GRPO) preferred Anthropic 48/49 (1 tie). Size calibration in the prompt did NOT swing the verdict. Validity (Claude) 2.92 vs ours 1.00; (Gemini) 4.57 vs ours 1.20.
4-AR per-claim Δmse sweep on Anthropic's published "Characterizing confabulations" probe (n=30 rows, 138-140 claims each): v0.0.1 baseline AR +3.4% FVE, v0.1 paraphrase-invariant +1.8% FVE (worst, by design), v0.2 noise-hinge +7.3% FVE (best, ~73% of Anthropic's low-end 10% published reference), v0.3 cross-row contrastive +4.5% FVE. The noise-hinge family is the most promising AR-side cloud-GPU lever.
4-AR cross-row identity n=50: all 4 ARs at chance (1-2/50 = 2-4%). v0.2-best-Δmse is tied for worst-argmax. Δmse-sensitivity and per-row identity are genuinely different probes.

Important apples-to-oranges caveat (Addendum 11)

The cross-NLA comparison above measures what 27B-Gemma-3-Anthropic-pipeline produces on a given source text vs what 2B-Gemma-4-E2B-LoRA-pipeline produces on a different activation derived from the same source. Two effects compound, and current data cannot disentangle them:

Training-stack gap: full-FT bf16 + GRPO + 10K-50K steps (Anthropic) vs LoRA + NF4 + ~50-300 SFT steps (ours).
Cross-model activation gap: their NLAs read 27B-Gemma-3 L41 activations; ours reads 2B-Gemma-4-E2B L23 activations. Per Anthropic's own toy-models-of-superposition line of work, polysemanticity-per-neuron scales inversely with model capacity — the 2B L23 activation may intrinsically encode less per-instance specificity than the 27B L41 activation, regardless of NLA training quality.

The clean test that would disentangle these: train an NLA on Gemma-3-27B L41 using our exact recipe (LoRA r=80, NF4 4-bit, ~50-step SFT, same labeled corpus re-extracted at L41), ~30-50 A100-hr cloud GPU. If L2 cross-row argmax lifts substantially on 27B-at-our-recipe, polysemanticity-at-2B-scale is the dominant factor and we're near an intrinsic ceiling. If it doesn't, training-stack constraints are the bottleneck and model size is incidental. Flagged for the next cloud-GPU grant.

Honest implication for use

There is no published reference NLA for Gemma-4-E2B L23 specifically. Anthropic's deployed NLAs read different models entirely (Gemma-3-27B L41 / Llama-3.3-70B L53). This pair is the only NLA that reads Gemma-4-E2B L23 activations, full stop. For someone interested in Gemma-4-E2B specifically, our own internal L1a/L1b/L2 metrics (held-out factual/gibber discrimination, per-claim Δmse, cross-row argmax) are the right calibration tools — not the cross-NLA comparison against different-model NLAs.

Source repo discussion: FINDINGS.md §F72 Addenda 9, 10, 11 in SolshineCode/deception-nanochat-sae-research. Public release notes: RELEASE_CALIBRATION.md.

How to use

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import PeftModel
from huggingface_hub import snapshot_download
import torch
import torch.nn as nn
import numpy as np

BASE = "google/gemma-4-E2B"
AR_REPO = "Solshine/gemma-4-e2b-nla-L23-ar-v0_0_1"
AR_TRUNCATION = 18   # capture hidden state at layer (K+1-1) = 17 (0-indexed)
D_MODEL = 1536
AR_TEMPLATE = "Summary of the following text: <text>{explanation}</text> <summary>"

bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4")
tok = AutoTokenizer.from_pretrained(BASE)
base = AutoModelForCausalLM.from_pretrained(BASE, quantization_config=bnb, device_map="auto")
ar_local = snapshot_download(repo_id=AR_REPO)
ar = PeftModel.from_pretrained(base, ar_local); ar.eval()

# Load the trained linear head (1536 -> 1536)
head = nn.Linear(D_MODEL, D_MODEL, bias=True).to(ar.device).to(torch.float32)
head.load_state_dict(torch.load(f"{ar_local}/linear_head.pt", weights_only=True))
head.eval()

# Capture hidden state at layer L17 (1-indexed L18)
extraction_layer = ar.base_model.model.model.language_model.layers[AR_TRUNCATION - 1]

def reconstruct(explanation: str) -> np.ndarray:
    """Run an AV explanation through the AR; return the reconstructed activation vector."""
    prompt = AR_TEMPLATE.format(explanation=explanation)
    ids = tok.encode(prompt, return_tensors="pt").to(ar.device)
    captured = {"h": None}
    def hook(m, i, o):
        captured["h"] = o[0] if isinstance(o, tuple) else o
    handle = extraction_layer.register_forward_hook(hook)
    try:
        with torch.no_grad():
            _ = ar(input_ids=ids)
    finally:
        handle.remove()
    h_last = captured["h"][0, -1].to(torch.float32)
    return head(h_last).detach().cpu().numpy().astype(np.float32)

# Use
explanation = "<explanation>This activation responds to references to the Pacific Ocean and geographic regions.</explanation>"
reconstructed = reconstruct(explanation)
# Compare to the original activation:
# cosine_similarity = np.dot(original, reconstructed) / (np.linalg.norm(original) * np.linalg.norm(reconstructed))

Working end-to-end round-trip example with the matched AV: examples/round_trip_example.py in the bundled public repo.

Training setup

Base model: google/gemma-4-E2B
Truncation: K+1 = 18 layers used in AR forward; hidden state captured at layer index 17 (0-indexed)
Quantization: NF4 4-bit base + fp16 LoRA
LoRA config: r=64, α=128, same target_modules regex as the AV
Linear head: 1536 → 1536 with bias, trained in fp32, projects from L17 hidden state to reconstructed activation
Optimizer: AdamW 8-bit, lr=1e-4
Batch: effective batch 16
Max length: 512 tokens
SFT steps: 15
Training corpus: same v0.0.x baseline pipeline as the AV (2,548 triples)
Hardware: 4 GB GTX 1650 Ti Max-Q

Headline numbers (v0.0.1)

Round-trip cosine with the matched v0.0.1 AV: 0.438 ± 0.054 on n=42 held-out activations, 100% above the 0.30 noise floor.

What makes this release distinctive

First non-Anthropic-team open-source NLA AR at any model scale.
Consumer-GPU trainable. Fits on 4 GB laptop GPU end-to-end alongside the matched AV.
Documented structural-projection behavior. Standard NLA AR architectures, including this one, produce reconstructions that have a strong structural-projection component independent of the input explanation. See "Limitations" below.

Limitations

NLAs can produce unexpected or incorrect explanations, and AR reconstructions can be structurally projected. Specifically for this AR:

Round-trip cosine is ~97% structural-projection on this trained AR. Replicating the published §"Measuring steganography" and §"Characterizing confabulations" tests: paraphrasing the input AV explanation moves the AR's reconstructed cosine by ~3% (Δcos paraphrase = +0.014); removing entire claims from the AV explanation moves cosine by ~0% per claim (Δcos = +0.001 per claim ablated). The reconstructed vector is approximately invariant to the explanation's content — the AR is largely projecting toward "somewhere in the L23 activation distribution" rather than reading the explanation.
This is a methodologically interesting finding about FVE on under-trained AR architectures, not a unique pathology of this release. The same disaggregation should be measured on any NLA AR before relying on round-trip cosine as a content-fidelity proxy.
Use this AR for: matched round-trip eval with the v0.0.1 AV (the cosine number is a valid characterization of the AV+AR pair as a system); replication of Anthropic's NLA validation pipeline at small scale; benchmarking AR-side improvements.
Do not use this AR for: inferring that the AV's explanation faithfully describes the activation. Use AV-side direct content-fidelity judging instead, or in addition.

Full development history including the §F72 retraction and content-blind-AR investigation: HISTORY.md.

Citation

@article{frasertaliente2026nla,
  title={Natural Language Autoencoders},
  author={Fraser-Taliente, Kit and Kantamneni, Kshitij and Ong, Antonia and others},
  journal={Transformer Circuits},
  year={2026},
  url={https://transformer-circuits.pub/2026/nla/}
}

@misc{deleeuw2026nlagemma4e2bar,
  title={Gemma-4-E2B NLA AR (v0.0.1): a 4 GB consumer-GPU Activation Reconstructor},
  author={DeLeeuw, Caleb (SolshineCode)},
  year={2026},
  url={https://huggingface.co/Solshine/gemma-4-e2b-nla-L23-ar-v0_0_1}
}

License

CC-BY 4.0. See LICENSE in the bundled repo.

Downloads last month: 142

Model tree for Solshine/gemma-4-e2b-nla-L23-ar-v0_0_1

Base model

google/gemma-4-E2B

Adapter

(20)

this model