Post
2404
π Today's pick in Interpretability & Analysis of LMs: x2 edition!
Today's highlighted works aim reproduce findings from Transformer-centric interpretability literature on new RNN-based architectures such as Mamba and RWKV:
Does Transformer Interpretability Transfer to RNNs? (2404.05971) by @MrGonao T. Marshall @norabelrose
Locating and Editing Factual Associations in Mamba (2404.03646) by @sensharma @datkinson @davidbau
The first paper applies contrastive activation addition, the tuned lens and probing for eliciting latent knowledge in quirky models to Mamba and RWKV LMs, finding these Transformer-specific methods can be applied with slight adaptation to these architectures, obtaining similar results.
The second work applies the ROME method to Mamba, finding weights playing the role of MLPs in encoding factual relations across several Mamba layers, and can be patched to perform model editing. A new SSM-specific technique is also introduced to emulate attention knockout (value zeroing) revealing information flows similar to the ones in Transformers when processing factual statements.
π» Code: https://github.com/arnab-api/romba
π All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
Today's highlighted works aim reproduce findings from Transformer-centric interpretability literature on new RNN-based architectures such as Mamba and RWKV:
Does Transformer Interpretability Transfer to RNNs? (2404.05971) by @MrGonao T. Marshall @norabelrose
Locating and Editing Factual Associations in Mamba (2404.03646) by @sensharma @datkinson @davidbau
The first paper applies contrastive activation addition, the tuned lens and probing for eliciting latent knowledge in quirky models to Mamba and RWKV LMs, finding these Transformer-specific methods can be applied with slight adaptation to these architectures, obtaining similar results.
The second work applies the ROME method to Mamba, finding weights playing the role of MLPs in encoding factual relations across several Mamba layers, and can be patched to perform model editing. A new SSM-specific technique is also introduced to emulate attention knockout (value zeroing) revealing information flows similar to the ones in Transformers when processing factual statements.
π» Code: https://github.com/arnab-api/romba
π All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9