Journey With Me Into The Mind of Large Language Models: Interesting Findings in AnthropicAI's Scaling Monosemanticity paper.

Community Article Published May 22, 2024

One of the many unknowns with LLMs is the why behind the responses they give - it's unclear why certain responses are chosen over others. Which shows how little we know of what's happening inside these models.

To have a deeper sense of this, they tried Sparse Dictionary Learning on a larger model (Claude 3 Sonnet) - wherein they match patterns of neuron activations (named Features) to human interpretable meanings.

Now Dictionary Learning is a traditional ml technique that identifies recurring patterns of neuron activations across various contexts. Meaning, any internal state of the model can be expressed as a combination of a few active features rather than numerous active neurons.

They scaled up a more effective measure of dictionary learning using a Sparse Autoencoder (SAE). The SAE has an encoder that maps inputs to sparse high-dimensional features via linear transformation & ReLU, and a decoder that reconstructs inputs from those features.

Three variants (of sizes: ~1M, ~4M & ~34M features) of the SAE were trained and across SAEs, <300 active features/token, >65% variance were explained. With dead features: ~2% for 1M, 35% for 4M, 65% for 34M SAE. Implying better training could reduce dead features.

Experiments were conducted with these SAEs where they were applied to residual stream activations (RSAs) at the model's middle layer (why? 1. RSAs are smaller than MLP layers = low compute cost, 2. helps tackle "cross-layer superposition" issues - when features are spread across multiple layers instead of being isolated in specific layers, causing interpretation difficulties). These experiments revealed that Scaling Laws can help guide training of these SAEs.

My favorite of course is the Basic Code Features - where the model attributed meaning to different code syntax elements similar to syntax highlighting in text editors.

Upvote