Spaces:

dlouapre
/

eiffel-tower-llama

Running

App Files Files Community

dlouapre HF Staff commited on Oct 30

Commit

8360712

1 Parent(s): ed977ab

Better banner, bibliography...

Browse files

Files changed (7) hide show

app/src/components/Hero.astro +13 -1
app/src/content/article.mdx +40 -24
app/src/content/assets/image/golden_gate_claude_BSM.png +3 -0
app/src/content/assets/image/golden_gate_claude_birthday_party.jpeg +3 -0
app/src/content/assets/image/golden_gate_claude_pearl_harbor.jpeg +3 -0
app/src/content/bibliography.bib +7 -0
app/src/content/embeds/banner.html +0 -6

app/src/components/Hero.astro CHANGED Viewed

@@ -1,6 +1,9 @@
 ---
 import HtmlEmbed from "./HtmlEmbed.astro";
 interface Props {
   title: string; // may contain HTML (e.g., <br/>)
   titleRaw?: string; // plain title for slug/PDF (optional)
@@ -91,12 +94,17 @@ function slugify(text: string): string {
 const pdfBase = titleRaw ? stripHtml(titleRaw) : stripHtml(title);
 const pdfFilename = `${slugify(pdfBase)}.pdf`;
 ---
 <section class="hero">
   <h1 class="hero-title" set:html={title} />
   <div class="hero-banner">
-    <HtmlEmbed src="banner.html" frameless />
     {description && <p class="hero-desc">{description}</p>}
   </div>
 </section>
@@ -361,6 +369,10 @@ const pdfFilename = `${slugify(pdfBase)}.pdf`;
   .hero-banner {
     max-width: 980px;
     margin: 0 auto;
   }
   .hero-desc {
     color: var(--muted-color);

 ---
 import HtmlEmbed from "./HtmlEmbed.astro";
+import Image from './Image.astro';
+import eiffel_tower_llama from '../content/assets/image/eiffel_tower_llama.png';
 interface Props {
   title: string; // may contain HTML (e.g., <br/>)
   titleRaw?: string; // plain title for slug/PDF (optional)
 const pdfBase = titleRaw ? stripHtml(titleRaw) : stripHtml(title);
 const pdfFilename = `${slugify(pdfBase)}.pdf`;
 ---
 <section class="hero">
   <h1 class="hero-title" set:html={title} />
   <div class="hero-banner">
+    <Image src={eiffel_tower_llama} alt="A llama photobombing the Eiffel Tower" width="700"/>
     {description && <p class="hero-desc">{description}</p>}
   </div>
 </section>
   .hero-banner {
     max-width: 980px;
     margin: 0 auto;
+    display: flex;
+    flex-direction: column;
+    align-items: center;
+    gap: 16px;
   }
   .hero-desc {
     color: var(--muted-color);

app/src/content/article.mdx CHANGED Viewed

@@ -34,6 +34,8 @@ import Reference from '../components/Reference.astro';
 import Glossary from '../components/Glossary.astro';
 import Stack from '../components/Stack.astro';
 In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
 This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling].
@@ -41,22 +43,29 @@ Although this demo led to hilarious conversations that have been widely shared o
 import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
-<Image src={ggc_snowhite} alt="Sample image with optimization"
        caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
-Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma].
-However, as far as I know, nobody has tried to reproduce something similar to the Golden Gate Claude demo. Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile those two facts?
 The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
 By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering.
-Our main findings are:
-- Optimal steering coefficients are found to be about half the typical activation magnitude at the steering layer, less than Anthropic suggested.
-- Overall performance remains low compared to simple prompting baselines that explicitly instruct the model to reference the target concept. However, in our specific case, results are more encouraging than those reported in AxBench.
-- Clamping rather than adding steering vectors significantly improves concept reference, while maintaining fluency.
-- Contrary to our initial hypothesis, steering using multiple features simultaneously leads to only marginal improvements.
 ## 1. Steering with SAEs
@@ -155,6 +164,8 @@ In this paper, we will try to steer Llama 3.1 8B Instruct toward the Eiffel Towe
 However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
 ## 2. Metrics, we need metrics!
 To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
@@ -237,6 +248,7 @@ Finally, and as an objective auxiliary metric to monitor concept inclusion, we s
 We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
 (For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
 ## 3. Optimizing steering coefficient for a single feature
@@ -368,6 +380,8 @@ From that, we can devise a useful proxy to find good steering coefficients:
 - for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
 - for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
 ## 4. Steering and generation improvements
 Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to prevent extreme activations, and repetition penalty to prevent the gibberish mode.
@@ -400,6 +414,8 @@ As we can see, applying a repetition penalty reduces as expected the 3-gram repe
 (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
 ## 5. Multi-Layer optimization
 Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
@@ -466,10 +482,11 @@ As we can see on the chart, steering 2 or even 8 features simultaneously leads t
 Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
-## Conclusion & Discussion
-### Main conclusions
 In this study, we have shown how to use sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding good steering coefficients is not easy.
@@ -481,22 +498,22 @@ A way to explain this lack of improvement could be that the selected features ar
 Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
-TODO: embed a demo
-### Possible next steps
-Possible next steps:
-- Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
-- Check other layers for 1D optimization
-- Check complementary vs redundancy by monitoring activation changes in subsequent layers' features.
-- Try other concepts, see if results are similar
-- Try with larger models, see if results are better
-- Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
-- Try to include earlier and later layers, see if it helps
-- Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
 ```python
 input_ids = llm.tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)
@@ -514,6 +531,7 @@ with llm.generate() as tracer:
 answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
 ```
 We considered a simple Gaussian Process (GP) model with an RBF kernel.
 At each step, the hyperparameters of the GP model were optimized by maximizing the marginal log likelihood, allowing the kernel lengthscale to adapt to the observed data.
@@ -525,7 +543,5 @@ We used the reduced parameterization presented earlier, searching for an optimal
 To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
 In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
-#### Gradient descent
 Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
 We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being upper confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.

 import Glossary from '../components/Glossary.astro';
 import Stack from '../components/Stack.astro';
 In May 2024, Anthropic released a demo called [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude).
 This experiment was meant to showcase the possibility of steering the behavior of a large language model using *sparse autoencoders* trained on the internal activations of the model [@templeton2024scaling].
 import ggc_snowhite from './assets/image/golden_gate_claude_snowhite.jpeg'
+<Image src={ggc_snowhite} alt="One of the many examples of Golden Gate Claude conversations"
        caption='One of the many examples of Golden Gate Claude conversations <a target="_blank" href="https://x.com/JE_Colors1/status/1793747959831843233">Source</a>' />
+Since then, sparse autoencoders (SAEs) have become one of the key tools in the field of mechanistic interpretability [@cunningham2023sparse; @lieberum2024gemma; @gao2024scaling] and steering activations sparked the interest of many (see for instance [the value of steering](https://thezvi.substack.com/i/144959102/the-value-of-steering) by Zvi Mowshowitz, or the work by [GoodFire AI](https://www.goodfire.ai/blog/feature-steering-for-reliable-and-expressive-ai-engineering))
+However, as far as I know, **nobody has tried to reproduce something similar to the Golden Gate Claude demo!** Moreover, recently the AxBench paper [@wu2025axbench] found that steering with SAEs was *one of the least effective methods to steer a model toward a desired concept*. How can we reconcile this with the success of the Golden Gate Claude demo?
 The aim of this article is to investigate how SAEs can be used to reproduce **a demo similar to Golden Gate Claude, but using a lightweight open-source model**. For this we used *Llama 3.1 8B Instruct*, but since I live in Paris...let’s make it obsessed with the Eiffel Tower!
 By doing this, we will realize that steering a model with vectors coming from SAEs is actually harder than we might have thought. However, we will devise several improvements over naive steering.
+**Our main findings are:**
+<Note title="" variant="success">
+    - Optimal steering coefficients are found to be about half the typical activation magnitude at the steering layer, less than Anthropic suggested.
+    - Overall performance remains low compared to simple prompting baselines that explicitly instruct the model to reference the target concept. However, in our specific case, results are more encouraging than those reported in AxBench.
+    - Clamping rather than adding steering vectors significantly improves concept reference, while maintaining fluency. This is similar to the approach used in the Golden Gate Claude demo, but opposite to the findings from AxBench.
+    - Contrary to one of our initial hypothesis, steering multiple features simultaneously leads to only marginal improvements.
+</Note>
+<Note variant="danger" title="TODO: add link to demo.">
+</Note>
+---
 ## 1. Steering with SAEs
 However, for this, we will need rigorous metrics to evaluate the quality of our steered models and compare them to baselines.
+---
 ## 2. Metrics, we need metrics!
 To assess the quality of a steered model such as our *Eiffel Tower Llama*, we cannot rely solely on our subjective feelings.
 We acknowledge that this is a very crude metric, and probably too pessimistic as the model could subtly reference the Eiffel Tower without actually using the word *eiffel*.
 (For instance, when referring to *a large metal structure built in Paris.*) Of course, as this metric is hard to generalize to other concepts, we will not use it beyond simple monitoring.
+---
 ## 3. Optimizing steering coefficient for a single feature
 - for 3-gram repetition, the target is 0.0 but inspecting examples reveals that we can accept values up to 0.2 without much harm.
 - for log probability under the reference model, successful steering seems to happen when the log prob is between -1.5 and -1.0.
+---
 ## 4. Steering and generation improvements
 Having found optimal coefficients, we now investigate two complementary improvements that address the failure modes we identified: clamping to prevent extreme activations, and repetition penalty to prevent the gibberish mode.
 (Note that the AxBench paper mentioned the repetition penalty but without using it, considering it as *"not the fairest setting, as it often does not accurately resemble normal user behaviour"*, see their appendix K)
+---
 ## 5. Multi-Layer optimization
 Even after those improvements, we still found that steering with a single SAE feature was not very effective, and concept inclusion lying way below the maximum possible value of 2.0.
 Overall, those disappointing results contradict our initial hypothesis that steering multiple complementary features would help better represent the concept and maintain fluency. One possible explanation is our inability to find the true optimum, as the harmonic mean metric is very noisy and hard to optimize. Another explanation could be that the selected features are actually redundant rather than complementary, and that steering one of them is sufficient to activate the concept.
+---
+## 6. Conclusion & Discussion
+### 6.1 Main conclusions
 In this study, we have shown how to use sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than we might have thought, and finding good steering coefficients is not easy.
 Overall, our results seem less discouraging than those of AxBench, and show that steering with SAEs can be effective, using clamping, a slightly different generation procedure and possibly combining multiple features. However, at this stage, those results are hard to generalize and our work is not really comparable to the AxBench results, since they use different model, different concepts, different SAEs (Gemmascope vs Andy Arditi's), different prompts. This is in line with the success of the Golden Gate Claude demo, although we don't know all the details of their steering method.
+<Note variant="info" title="Possible next steps">
+    - Failure analysis on the cases where steering fails (about 20% have at least one zero metric)
+    - Check other layers for 1D optimization
+    - Check complementary vs redundancy by monitoring activation changes in subsequent layers' features.
+    - Try other concepts, see if results are similar
+    - Try with larger models, see if results are better
+    - Vary the temporal steering pattern: steer prompt only, or answer only, or periodic steering
+    - Try to include earlier and later layers, see if it helps
+    - Investigate clamping: why do we find that clamping helps, similar to Anthropic, while AxBench found the opposite? We could hypothesize it prevents extreme activations, but it could also counteract some negative feedback behavior, when other parts of the model try to compensate for the added steering vector. (analogy with biology, where signaling pathways are often regulated by negative feedback loops)
+</Note>
+## Appendix
+### nnsight code
 ```python
 input_ids = llm.tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True)
 answer = llm.tokenizer.decode(trace[0][len(input_ids):], skip_special_tokens=True)
 ```
+### Bayesian optimization details
 We considered a simple Gaussian Process (GP) model with an RBF kernel.
 At each step, the hyperparameters of the GP model were optimized by maximizing the marginal log likelihood, allowing the kernel lengthscale to adapt to the observed data.
 To favor noise reduction at promising locations, every 5 steps we decided to resample the best point found so far.
 In that case, by *best* we mean the point with the lowest GP posterior $\mu(x)$. (Note that this is different from the point with the lowest observed value which might be a lucky noisy outlier).
 Performing gradient on the GP posterior is very cheap since it only involves differentiating the kernel function.
 We thus performed gradient descent starting from 500 random points in the parameter space, and optimized using a target being upper confidence bound $\mu(x) + \beta\sigma(x)$, to favor points that are not only predicted to be good, but also with low uncertainty. We then performed a clustering to group together the points that converged to the same local minimum, and selected the best cluster as candidate for evaluation.

app/src/content/assets/image/golden_gate_claude_BSM.png ADDED Viewed

Git LFS Details

SHA256: 2e7f76f1f4ae7584a97329706eb9cbdaea4c7ca90b0150a86c15ac41aef484bf
Pointer size: 131 Bytes
Size of remote file: 263 kB

app/src/content/assets/image/golden_gate_claude_birthday_party.jpeg ADDED Viewed

Git LFS Details

SHA256: ffc200ba9a2d07bfae7c666422656493309c840420e3a330e53edb9db2a115f3
Pointer size: 131 Bytes
Size of remote file: 186 kB

app/src/content/assets/image/golden_gate_claude_pearl_harbor.jpeg ADDED Viewed

Git LFS Details

SHA256: 65bf3d36340bf2ec8f460b054b12e9557c573412940463b0ab5457d63303b09a
Pointer size: 131 Bytes
Size of remote file: 201 kB

app/src/content/bibliography.bib CHANGED Viewed

@@ -156,4 +156,11 @@
   author={Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D and Potts, Christopher},
   journal={arXiv preprint arXiv:2501.17148},
   year={2025}
 }

   author={Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D and Potts, Christopher},
   journal={arXiv preprint arXiv:2501.17148},
   year={2025}
+}
+@article{gao2024scaling,
+  title={Scaling and evaluating sparse autoencoders},
+  author={Gao, Leo and la Tour, Tom Dupr{\'e} and Tillman, Henk and Goh, Gabriel and Troll, Rajan and Radford, Alec and Sutskever, Ilya and Leike, Jan and Wu, Jeffrey},
+  journal={arXiv preprint arXiv:2406.04093},
+  year={2024}
 }

app/src/content/embeds/banner.html CHANGED Viewed

@@ -1,6 +0,0 @@
-<div style="display: flex; justify-content: center;">
-    <img src="src/content/assets/image/eiffel_tower_llama.png"
-            alt="Eiffel Tower Llama"
-            style="max-width:80%; height:auto; border-radius:8px;" />
-</div>