dlouapre HF Staff commited on
Commit
8360712
·
1 Parent(s): ed977ab

Better banner, bibliography...

Browse files
app/src/components/Hero.astro CHANGED
@@ -1,6 +1,9 @@
1
  ---
2
  import HtmlEmbed from "./HtmlEmbed.astro";
3
 
 
 
 
4
  interface Props {
5
  title: string; // may contain HTML (e.g., <br/>)
6
  titleRaw?: string; // plain title for slug/PDF (optional)
@@ -91,12 +94,17 @@ function slugify(text: string): string {
91
 
92
  const pdfBase = titleRaw ? stripHtml(titleRaw) : stripHtml(title);
93
  const pdfFilename = `${slugify(pdfBase)}.pdf`;
 
94
  ---
95
 
 
 
96
  <section class="hero">
97
  <h1 class="hero-title" set:html={title} />
98
  <div class="hero-banner">
99
- <HtmlEmbed src="banner.html" frameless />
 
 
100
  {description && <p class="hero-desc">{description}</p>}
101
  </div>
102
  </section>
@@ -361,6 +369,10 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
361
  .hero-banner {
362
  max-width: 980px;
363
  margin: 0 auto;
 
 
 
 
364
  }
365
  .hero-desc {
366
  color: var(--muted-color);
 
1
  ---
2
  import HtmlEmbed from "./HtmlEmbed.astro";
3
 
4
+ import Image from './Image.astro';
5
+ import eiffel_tower_llama from '../content/assets/image/eiffel_tower_llama.png';
6
+
7
  interface Props {
8
  title: string; // may contain HTML (e.g., <br/>)
9
  titleRaw?: string; // plain title for slug/PDF (optional)
 
94
 
95
  const pdfBase = titleRaw ? stripHtml(titleRaw) : stripHtml(title);
96
  const pdfFilename = `${slugify(pdfBase)}.pdf`;
97
+
98
  ---
99
 
100
+
101
+
102
  <section class="hero">
103
  <h1 class="hero-title" set:html={title} />
104
  <div class="hero-banner">
105
+
106
+ <Image src={eiffel_tower_llama} alt="A llama photobombing the Eiffel Tower" width="700"/>
107
+
108
  {description && <p class="hero-desc">{description}</p>}
109
  </div>
110
  </section>
 
369
  .hero-banner {
370
  max-width: 980px;
371
  margin: 0 auto;
372
+ display: flex;
373
+ flex-direction: column;
374
+ align-items: center;
375
+ gap: 16px;
376
  }
377
  .hero-desc {
378
  color: var(--muted-color);
app/src/content/article.mdx CHANGED
@@ -34,6 +34,8 @@ import Reference from '../components/Reference.astro';
34
  import Glossary from '../components/Glossary.astro';
35
  import Stack from '../components/Stack.astro';
36
 
 
 
37
  In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
38
  This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling].
39
 
@@ -41,22 +43,29 @@ Although this demo led to hilarious conversations that have been widely shared o
41
 
42
  import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
43
 
44
- <Image src={ggc_snowhite} alt="Sample image with optimization"
45
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
46
 
47
- Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma].
48
- However, as far as I know, nobody has tried to reproduce something similar to the Golden Gate Claude demo. Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile those two facts?
 
49
 
50
  The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
51
 
52
  By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering.
53
 
54
- Our main findings are:
 
 
 
 
 
 
55
 
56
- - Optimal steering coefficients are found to be about half the typical activation magnitude at the steering layer, less than Anthropic suggested.
57
- - Overall performance remains low compared to simple prompting baselines that explicitly instruct the model to reference the target concept. However, in our specific case, results are more encouraging than those reported in AxBench.
58
- - Clamping rather than adding steering vectors significantly improves concept reference, while maintaining fluency.
59
- - Contrary to our initial hypothesis, steering using multiple features simultaneously leads to only marginal improvements.
60
 
61
  ## 1. Steering with SAEs
62
 
@@ -155,6 +164,8 @@ In this paper, we will try to steer Llama 3.1 8B Instruct toward the Eiffel Towe
155
 
156
  However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
157
 
 
 
158
  ## 2. Metrics, we need metrics!
159
 
160
  To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
@@ -237,6 +248,7 @@ Finally, and as an objective auxiliary metric to monitor concept inclusion, we s
237
  We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
238
  (For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
239
 
 
240
 
241
  ## 3. Optimizing steering coefficient for a single feature
242
 
@@ -368,6 +380,8 @@ From that, we can devise a useful proxy to find good steering coefficients:
368
  - for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
369
  - for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
370
 
 
 
371
  ## 4. Steering and generation improvements
372
 
373
  Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to prevent extreme activations, and repetition penalty to prevent the gibberish mode.
@@ -400,6 +414,8 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
400
 
401
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
402
 
 
 
403
  ## 5. Multi-Layer optimization
404
 
405
  Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
@@ -466,10 +482,11 @@ As we can see on the chart, steering 2 or even 8 features simultaneously leads t
466
 
467
  Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
468
 
 
469
 
470
- ## Conclusion & Discussion
471
 
472
- ### Main conclusions
473
 
474
  In this study, we have shown how to use sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding good steering coefficients is not easy.
475
 
@@ -481,22 +498,22 @@ A way to explain this lack of improvement could be that the selected features ar
481
 
482
  Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
483
 
484
- TODO: embed a demo
485
-
486
- ### Possible next steps
 
 
 
 
 
 
 
487
 
488
- Possible next steps:
489
- - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
490
- - Check other layers for 1D optimization
491
- - Check complementary vs redundancy by monitoring activation changes in subsequent layers' features.
492
- - Try other concepts, see if results are similar
493
- - Try with larger models, see if results are better
494
- - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
495
- - Try to include earlier and later layers, see if it helps
496
- - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
497
 
498
 
 
499
 
 
500
 
501
  ```python
502
  input_ids = llm.tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)
@@ -514,6 +531,7 @@ with llm.generate() as tracer:
514
  answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
515
  ```
516
 
 
517
 
518
  We considered a simple Gaussian Process (GP) model with an RBF kernel.
519
  At each step, the hyperparameters of the GP model were optimized by maximizing the marginal log likelihood, allowing the kernel lengthscale to adapt to the observed data.
@@ -525,7 +543,5 @@ We used the reduced parameterization presented earlier, searching for an optimal
525
  To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
526
  In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
527
 
528
- #### Gradient descent
529
-
530
  Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
531
  We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being upper confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.
 
34
  import Glossary from '../components/Glossary.astro';
35
  import Stack from '../components/Stack.astro';
36
 
37
+
38
+
39
  In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
40
  This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling].
41
 
 
43
 
44
  import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
45
 
46
+ <Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
47
  caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
48
 
49
+ Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many (see for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or the work by [GoodFire AI](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering))
50
+
51
+ However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo!** Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile this with the success of the Golden Gate Claude demo?
52
 
53
  The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
54
 
55
  By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering.
56
 
57
+ **Our main findings are:**
58
+ <Note title="" variant="success">
59
+ - Optimal steering coefficients are found to be about half the typical activation magnitude at the steering layer, less than Anthropic suggested.
60
+ - Overall performance remains low compared to simple prompting baselines that explicitly instruct the model to reference the target concept. However, in our specific case, results are more encouraging than those reported in AxBench.
61
+ - Clamping rather than adding steering vectors significantly improves concept reference, while maintaining fluency. This is similar to the approach used in the Golden Gate Claude demo, but opposite to the findings from AxBench.
62
+ - Contrary to one of our initial hypothesis, steering multiple features simultaneously leads to only marginal improvements.
63
+ </Note>
64
 
65
+ <Note variant="danger" title="TODO: add link to demo.">
66
+ </Note>
67
+
68
+ ---
69
 
70
  ## 1. Steering with SAEs
71
 
 
164
 
165
  However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
166
 
167
+ ---
168
+
169
  ## 2. Metrics, we need metrics!
170
 
171
  To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
 
248
  We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
249
  (For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
250
 
251
+ ---
252
 
253
  ## 3. Optimizing steering coefficient for a single feature
254
 
 
380
  - for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
381
  - for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
382
 
383
+ ---
384
+
385
  ## 4. Steering and generation improvements
386
 
387
  Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to prevent extreme activations, and repetition penalty to prevent the gibberish mode.
 
414
 
415
  (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
416
 
417
+ ---
418
+
419
  ## 5. Multi-Layer optimization
420
 
421
  Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
 
482
 
483
  Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
484
 
485
+ ---
486
 
487
+ ## 6. Conclusion & Discussion
488
 
489
+ ### 6.1 Main conclusions
490
 
491
  In this study, we have shown how to use sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding good steering coefficients is not easy.
492
 
 
498
 
499
  Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
500
 
501
+ <Note variant="info" title="Possible next steps">
502
+ - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
503
+ - Check other layers for 1D optimization
504
+ - Check complementary vs redundancy by monitoring activation changes in subsequent layers' features.
505
+ - Try other concepts, see if results are similar
506
+ - Try with larger models, see if results are better
507
+ - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
508
+ - Try to include earlier and later layers, see if it helps
509
+ - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
510
+ </Note>
511
 
 
 
 
 
 
 
 
 
 
512
 
513
 
514
+ ## Appendix
515
 
516
+ ### nnsight code
517
 
518
  ```python
519
  input_ids = llm.tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)
 
531
  answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
532
  ```
533
 
534
+ ### Bayesian optimization details
535
 
536
  We considered a simple Gaussian Process (GP) model with an RBF kernel.
537
  At each step, the hyperparameters of the GP model were optimized by maximizing the marginal log likelihood, allowing the kernel lengthscale to adapt to the observed data.
 
543
  To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
544
  In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
545
 
 
 
546
  Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
547
  We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being upper confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.
app/src/content/assets/image/golden_gate_claude_BSM.png ADDED

Git LFS Details

  • SHA256: 2e7f76f1f4ae7584a97329706eb9cbdaea4c7ca90b0150a86c15ac41aef484bf
  • Pointer size: 131 Bytes
  • Size of remote file: 263 kB
app/src/content/assets/image/golden_gate_claude_birthday_party.jpeg ADDED

Git LFS Details

  • SHA256: ffc200ba9a2d07bfae7c666422656493309c840420e3a330e53edb9db2a115f3
  • Pointer size: 131 Bytes
  • Size of remote file: 186 kB
app/src/content/assets/image/golden_gate_claude_pearl_harbor.jpeg ADDED

Git LFS Details

  • SHA256: 65bf3d36340bf2ec8f460b054b12e9557c573412940463b0ab5457d63303b09a
  • Pointer size: 131 Bytes
  • Size of remote file: 201 kB
app/src/content/bibliography.bib CHANGED
@@ -156,4 +156,11 @@
156
  author={Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D and Potts, Christopher},
157
  journal={arXiv preprint arXiv:2501.17148},
158
  year={2025}
 
 
 
 
 
 
 
159
  }
 
156
  author={Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D and Potts, Christopher},
157
  journal={arXiv preprint arXiv:2501.17148},
158
  year={2025}
159
+ }
160
+
161
+ @article{gao2024scaling,
162
+ title={Scaling and evaluating sparse autoencoders},
163
+ author={Gao, Leo and la Tour, Tom Dupr{\'e} and Tillman, Henk and Goh, Gabriel and Troll, Rajan and Radford, Alec and Sutskever, Ilya and Leike, Jan and Wu, Jeffrey},
164
+ journal={arXiv preprint arXiv:2406.04093},
165
+ year={2024}
166
  }
app/src/content/embeds/banner.html CHANGED
@@ -1,6 +0,0 @@
1
-
2
- <div style="display: flex; justify-content: center;">
3
- <img src="src/content/assets/image/eiffel_tower_llama.png"
4
- alt="Eiffel Tower Llama"
5
- style="max-width:80%; height:auto; border-radius:8px;" />
6
- </div>