SONAR SAEs — autointerp results

Automatic interpretability outputs (latent explanations and held-out classifier scores) produced from the scaled-up BatchTopK SAE in:

Interpretability of Text Auto-Encoders using Sparse Auto-Encoders: A Sandbox for Interpreting Neuralese. Nicky Pochinkov & Jason Rich Darmawan, EACL 2026 (submitted).

Protocol

For each SAE latent we:

Selected the top-10 sentences with the highest activation values.
Prompted GPT-OSS-120B (openai/gpt-oss-120b) to generate a short natural-language description of what the latent appears to detect.
Re-prompted the same model on a shuffled set of 12 sentences (8 random + 4 top-activating) to classify each as matching or not matching the generated explanation.
Reported accuracy / precision / recall / F1 against the held-out labels.

Caveat (also stated in the paper). Because explanation generation and evaluation use the same model, high F1 should be read as self-consistency rather than external semantic validation.

Files

Each subdirectory is one SAE run id and contains per-latent JSON records of the form:

{
  "latent_id": 41816,
  "top_activations": [...],
  "explanation": "sentences that include a lexical item referring to a cat",
  "scores": {"f1": 1.0, "precision": 1.0, "recall": 1.0, "accuracy": 1.0},
  "heldout_examples": [...]
}

nickypro/sonar-saes-large — the SAE these results were generated from

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including nickypro/sonar-saes-autointerp

SONAR SAEs

Collection

Sparse Auto-Encoders for SONAR sentence embeddings, from Pochinkov & Darmawan (2025) (EACL submission). • 5 items • Updated 1 day ago

nickypro
/

sonar-saes-autointerp

SONAR SAEs — autointerp results

Protocol

Files

Related

Collection including nickypro/sonar-saes-autointerp

SONAR SAEs