This is a sparse autoencoder for CLIP, trained at 128x expansion. Read more: https://www.lesswrong.com/posts/Quqekpvx8BGMMcaem/interpreting-and-steering-features-in-images