SONAR SAEs โ€” autointerp results

Automatic interpretability outputs (latent explanations and held-out classifier scores) produced from the scaled-up BatchTopK SAE in:

Interpretability of Text Auto-Encoders using Sparse Auto-Encoders: A Sandbox for Interpreting Neuralese. Nicky Pochinkov & Jason Rich Darmawan, EACL 2026 (submitted).

Protocol

For each SAE latent we:

  1. Selected the top-10 sentences with the highest activation values.
  2. Prompted GPT-OSS-120B (openai/gpt-oss-120b) to generate a short natural-language description of what the latent appears to detect.
  3. Re-prompted the same model on a shuffled set of 12 sentences (8 random + 4 top-activating) to classify each as matching or not matching the generated explanation.
  4. Reported accuracy / precision / recall / F1 against the held-out labels.

Caveat (also stated in the paper). Because explanation generation and evaluation use the same model, high F1 should be read as self-consistency rather than external semantic validation.

Files

Each subdirectory is one SAE run id and contains per-latent JSON records of the form:

{
  "latent_id": 41816,
  "top_activations": [...],
  "explanation": "sentences that include a lexical item referring to a cat",
  "scores": {"f1": 1.0, "precision": 1.0, "recall": 1.0, "accuracy": 1.0},
  "heldout_examples": [...]
}

Related

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Collection including nickypro/sonar-saes-autointerp