Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
gsarti 
posted an update Jan 22
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Towards Best Practices of Activation Patching in Language Models: Metrics and Methods by @m0pp11 and @NeelNanda

This work systematically examines the impact of methodological details in activation patching, a popular technique with causal guarantees to quantify the importance of model components in driving model predictions. Authors' recommendations include 1) using in-distribution counterfactual prompts instead of noise/zeroing to mitigate the OOD problem, 2) using logins instead of probabilities as evaluation metrics to enable the discovery of model components with negative influence on predictions, 3) accounting for interaction factors across layers when performing multi-layer patching; and 4) experiment with corrupting different prompt tokens to verify their agreement in the resulting discovered circuits.

📄 Paper: Towards Best Practices of Activation Patching in Language Models: Metrics and Methods (2309.16042)
In this post