Post
🔍 Today's pick in Interpretability & Analysis of LMs: Towards Best Practices of Activation Patching in Language Models: Metrics and Methods by
@m0pp11
and
@NeelNanda
This work systematically examines the impact of methodological details in activation patching, a popular technique with causal guarantees to quantify the importance of model components in driving model predictions. Authors' recommendations include 1) using in-distribution counterfactual prompts instead of noise/zeroing to mitigate the OOD problem, 2) using logins instead of probabilities as evaluation metrics to enable the discovery of model components with negative influence on predictions, 3) accounting for interaction factors across layers when performing multi-layer patching; and 4) experiment with corrupting different prompt tokens to verify their agreement in the resulting discovered circuits.
📄 Paper: Towards Best Practices of Activation Patching in Language Models: Metrics and Methods (2309.16042)
This work systematically examines the impact of methodological details in activation patching, a popular technique with causal guarantees to quantify the importance of model components in driving model predictions. Authors' recommendations include 1) using in-distribution counterfactual prompts instead of noise/zeroing to mitigate the OOD problem, 2) using logins instead of probabilities as evaluation metrics to enable the discovery of model components with negative influence on predictions, 3) accounting for interaction factors across layers when performing multi-layer patching; and 4) experiment with corrupting different prompt tokens to verify their agreement in the resulting discovered circuits.
📄 Paper: Towards Best Practices of Activation Patching in Language Models: Metrics and Methods (2309.16042)