Spaces:
Running
Running
fix missing word
#1
by
glutamatt
HF Staff
- opened
app/src/content/article.mdx
CHANGED
|
@@ -479,7 +479,7 @@ Another plausible explanation could be that **the selected features are actually
|
|
| 479 |
|
| 480 |
### 6.1 Main conclusions
|
| 481 |
|
| 482 |
-
In this study, we demonstrated the use of sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder initially expected, and finding good steering coefficients is not easy.
|
| 483 |
|
| 484 |
First, we showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.
|
| 485 |
|
|
|
|
| 479 |
|
| 480 |
### 6.1 Main conclusions
|
| 481 |
|
| 482 |
+
In this study, we demonstrated the use of sparse autoencoders to steer a lightweight open-source model (Llama 3.1 8B Instruct) to create a conversational agent obsessed with the Eiffel Tower, similar to the Golden Gate Claude experiment. As reported by the AxBench paper, and as can be experienced on Neuronpedia, steering with SAEs is harder than initially expected, and finding good steering coefficients is not easy.
|
| 483 |
|
| 484 |
First, we showed that simple improvements like clamping feature activations and using repetition penalty and lower temperature can help significantly. We then devised a systematic approach to optimize steering coefficients using bayesian optimization, and auxiliary metrics correlated with LLM-judge metrics.
|
| 485 |
|