Spaces:
Running
Running
Update README.md
Browse files
README.md
CHANGED
|
@@ -12,32 +12,37 @@ short_description: Training SAEs
|
|
| 12 |
Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.
|
| 13 |
|
| 14 |
## Features
|
|
|
|
| 15 |
Coming soon!
|
| 16 |
-
- **Universal Model Support**: Train SAEs on any
|
| 17 |
- **Feature Discovery**: Find interpretable features representing specific concepts
|
| 18 |
- **Concept Steering**: Amplify or suppress discovered features to influence model behavior
|
| 19 |
- **Interactive Chat**: Chat with models while manipulating their internal features
|
| 20 |
|
| 21 |
-
## Pre-trained Models
|
|
|
|
| 22 |
In the spirit of fully open-source models, we have started training SAEs on [OLMo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct).
|
| 23 |
|
| 24 |
-
We provide pre-trained SAEs and discovered features for popular models on
|
| 25 |
|
| 26 |
Each model repository will include:
|
| 27 |
- Trained SAE weights
|
| 28 |
- Catalog of discovered interpretable features
|
| 29 |
- Example steering configurations
|
| 30 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
| 32 |
## Quick Start
|
| 33 |
|
| 34 |
-
## Examples
|
| 35 |
|
| 36 |
-
|
| 37 |
-
- Training SAEs on different models
|
| 38 |
-
- Finding and analyzing features
|
| 39 |
-
- Steering model behavior
|
| 40 |
-
- Interactive chat sessions
|
| 41 |
|
| 42 |
## License
|
| 43 |
|
|
|
|
| 12 |
Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.
|
| 13 |
|
| 14 |
## Features
|
| 15 |
+
|
| 16 |
Coming soon!
|
| 17 |
+
- **Universal Model Support**: Train SAEs on any Hugging Face transformer model
|
| 18 |
- **Feature Discovery**: Find interpretable features representing specific concepts
|
| 19 |
- **Concept Steering**: Amplify or suppress discovered features to influence model behavior
|
| 20 |
- **Interactive Chat**: Chat with models while manipulating their internal features
|
| 21 |
|
| 22 |
+
## Pre-trained Models
|
| 23 |
+
|
| 24 |
In the spirit of fully open-source models, we have started training SAEs on [OLMo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct).
|
| 25 |
|
| 26 |
+
We provide pre-trained SAEs and discovered features for popular models on Hugging Face:
|
| 27 |
|
| 28 |
Each model repository will include:
|
| 29 |
- Trained SAE weights
|
| 30 |
- Catalog of discovered interpretable features
|
| 31 |
- Example steering configurations
|
| 32 |
|
| 33 |
+
## Datasets
|
| 34 |
+
|
| 35 |
+
The dataset from OLMo 2 7B's middle layer is [here](https://huggingface.co/spaces/hbfreed/olmo2-sae-steering-demo).
|
| 36 |
+
It is about 600 million residual stream vectors.
|
| 37 |
+
|
| 38 |
+
More to come!
|
| 39 |
+
|
| 40 |
|
| 41 |
## Quick Start
|
| 42 |
|
| 43 |
+
## Examples
|
| 44 |
|
| 45 |
+
Check out the [steered OLMo 7B model](https://huggingface.co/spaces/hbfreed/olmo2-sae-steering-demo)!
|
|
|
|
|
|
|
|
|
|
|
|
|
| 46 |
|
| 47 |
## License
|
| 48 |
|