hbfreed commited on
Commit
70b315b
·
verified ·
1 Parent(s): 37a9df1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +14 -9
README.md CHANGED
@@ -12,32 +12,37 @@ short_description: Training SAEs
12
  Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.
13
 
14
  ## Features
 
15
  Coming soon!
16
- - **Universal Model Support**: Train SAEs on any HuggingFace transformer model
17
  - **Feature Discovery**: Find interpretable features representing specific concepts
18
  - **Concept Steering**: Amplify or suppress discovered features to influence model behavior
19
  - **Interactive Chat**: Chat with models while manipulating their internal features
20
 
21
- ## Pre-trained Models
 
22
  In the spirit of fully open-source models, we have started training SAEs on [OLMo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct).
23
 
24
- We provide pre-trained SAEs and discovered features for popular models on HuggingFace:
25
 
26
  Each model repository will include:
27
  - Trained SAE weights
28
  - Catalog of discovered interpretable features
29
  - Example steering configurations
30
 
 
 
 
 
 
 
 
31
 
32
  ## Quick Start
33
 
34
- ## Examples (In progress)
35
 
36
- See the `examples/` directory for detailed notebooks demonstrating:
37
- - Training SAEs on different models
38
- - Finding and analyzing features
39
- - Steering model behavior
40
- - Interactive chat sessions
41
 
42
  ## License
43
 
 
12
  Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.
13
 
14
  ## Features
15
+
16
  Coming soon!
17
+ - **Universal Model Support**: Train SAEs on any Hugging Face transformer model
18
  - **Feature Discovery**: Find interpretable features representing specific concepts
19
  - **Concept Steering**: Amplify or suppress discovered features to influence model behavior
20
  - **Interactive Chat**: Chat with models while manipulating their internal features
21
 
22
+ ## Pre-trained Models
23
+
24
  In the spirit of fully open-source models, we have started training SAEs on [OLMo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct).
25
 
26
+ We provide pre-trained SAEs and discovered features for popular models on Hugging Face:
27
 
28
  Each model repository will include:
29
  - Trained SAE weights
30
  - Catalog of discovered interpretable features
31
  - Example steering configurations
32
 
33
+ ## Datasets
34
+
35
+ The dataset from OLMo 2 7B's middle layer is [here](https://huggingface.co/spaces/hbfreed/olmo2-sae-steering-demo).
36
+ It is about 600 million residual stream vectors.
37
+
38
+ More to come!
39
+
40
 
41
  ## Quick Start
42
 
43
+ ## Examples
44
 
45
+ Check out the [steered OLMo 7B model](https://huggingface.co/spaces/hbfreed/olmo2-sae-steering-demo)!
 
 
 
 
46
 
47
  ## License
48