SONAR SAEs
Collection
Sparse Auto-Encoders for SONAR sentence embeddings, from Pochinkov & Darmawan (2025) (EACL submission). โข 5 items โข Updated
Automatic interpretability outputs (latent explanations and held-out classifier scores) produced from the scaled-up BatchTopK SAE in:
Interpretability of Text Auto-Encoders using Sparse Auto-Encoders: A Sandbox for Interpreting Neuralese. Nicky Pochinkov & Jason Rich Darmawan, EACL 2026 (submitted).
For each SAE latent we:
openai/gpt-oss-120b) to generate a
short natural-language description of what the latent appears to
detect.Caveat (also stated in the paper). Because explanation generation and evaluation use the same model, high F1 should be read as self-consistency rather than external semantic validation.
Each subdirectory is one SAE run id and contains per-latent JSON records of the form:
{
"latent_id": 41816,
"top_activations": [...],
"explanation": "sentences that include a lexical item referring to a cat",
"scores": {"f1": 1.0, "precision": 1.0, "recall": 1.0, "accuracy": 1.0},
"heldout_examples": [...]
}
nickypro/sonar-saes-large โ the SAE these results were generated from