Spaces:

open-concept-steering
/

README

Running

App Files Files Community

hbfreed commited on Jun 9

Commit

70b315b

verified ·

1 Parent(s): 37a9df1

Update README.md

Browse files

Files changed (1) hide show

README.md +14 -9

README.md CHANGED Viewed

@@ -12,32 +12,37 @@ short_description: Training SAEs
 Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.
 ## Features
 Coming soon!
-- **Universal Model Support**: Train SAEs on any HuggingFace transformer model
 - **Feature Discovery**: Find interpretable features representing specific concepts
 - **Concept Steering**: Amplify or suppress discovered features to influence model behavior
 - **Interactive Chat**: Chat with models while manipulating their internal features
-## Pre-trained Models
 In the spirit of fully open-source models, we have started training SAEs on [OLMo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct).
-We provide pre-trained SAEs and discovered features for popular models on HuggingFace:
 Each model repository will include:
 - Trained SAE weights
 - Catalog of discovered interpretable features
 - Example steering configurations
 ## Quick Start
-## Examples (In progress)
-See the `examples/` directory for detailed notebooks demonstrating:
-- Training SAEs on different models
-- Finding and analyzing features
-- Steering model behavior
-- Interactive chat sessions
 ## License

 Open Concept Steering is an open-source library for discovering and manipulating interpretable features in large language models using Sparse Autoencoders (SAEs). Inspired by Anthropic's work on [Scaling Monosemanticity](https://transformer-circuits.pub/2024/scaling-monosemanticity/) and [Golden Gate Claude](https://www.anthropic.com/news/golden-gate-claude), this project aims to make concept steering accessible to the broader research community.
 ## Features
 Coming soon!
+- **Universal Model Support**: Train SAEs on any Hugging Face transformer model
 - **Feature Discovery**: Find interpretable features representing specific concepts
 - **Concept Steering**: Amplify or suppress discovered features to influence model behavior
 - **Interactive Chat**: Chat with models while manipulating their internal features
+## Pre-trained Models
 In the spirit of fully open-source models, we have started training SAEs on [OLMo 2 7B](https://huggingface.co/allenai/OLMo-2-1124-7B-Instruct).
+We provide pre-trained SAEs and discovered features for popular models on Hugging Face:
 Each model repository will include:
 - Trained SAE weights
 - Catalog of discovered interpretable features
 - Example steering configurations
+## Datasets
+The dataset from OLMo 2 7B's middle layer is [here](https://huggingface.co/spaces/hbfreed/olmo2-sae-steering-demo).
+It is about 600 million residual stream vectors.
+More to come!
 ## Quick Start
+## Examples
+Check out the [steered OLMo 7B model](https://huggingface.co/spaces/hbfreed/olmo2-sae-steering-demo)!
 ## License