SONAR SAEs
Collection
Sparse Auto-Encoders for SONAR sentence embeddings, from Pochinkov & Darmawan (2025) (EACL submission). β’ 5 items β’ Updated
Scaled-up BatchTopK Sparse Autoencoder trained on SONAR sentence embeddings. This is the SAE used for the latent-explanation experiment in:
Interpretability of Text Auto-Encoders using Sparse Auto-Encoders: A Sandbox for Interpreting Neuralese. Nicky Pochinkov & Jason Rich Darmawan, EACL 2026 (submitted).
| Architecture | BatchTopK |
| Input dim ($d_{\text{in}}$) | 1024 (SONAR embedding) |
| SAE dim ($m$) | 131,072 |
| Sparsity ($k$) | 64 |
| dtype | float32 |
| Training tokens | ~1.228 B (NLLB-200 primary + mined + supplemental) |
| LR (constant) | $3\times 10^{-4} \to 3\times 10^{-5}$ |
| Hardware | 1Γ A100, ~32 hours |
| SAE Lens version | 6.11.0 |
See Table 8 of the paper for the full hyperparameter list.
Each subdirectory is one wandb-tagged training run. Files are PyTorch
Lightning checkpoints (epoch=N-step=K.ckpt) and include optimizer state.
g97mb3sb/ β primary scaled-up run (used in the paper)pl1c7eo7/, dyfpsngy/ β additional scaled-up runsimport torch
import sae_lens # version >= 6.11.0
ckpt = torch.load("g97mb3sb/last.ckpt", map_location="cpu")
state_dict = ckpt["state_dict"]
# config is embedded in ckpt["hyper_parameters"]
nickypro/sonar-saes-comparison β four-variant architecture comparison (Table 1 of paper)nickypro/sonar-saes-autointerp β automatic interpretability outputsnickypro/sonar-saes-wandb-logs β training run logs@inproceedings{pochinkov2026sonarsae,
title={Interpretability of Text Auto-Encoders using Sparse Auto-Encoders: A Sandbox for Interpreting Neuralese},
author={Pochinkov, Nicky and Darmawan, Jason Rich},
booktitle={Proceedings of EACL 2026},
year={2026}
}