🔍 Today's pick in Interpretability & Analysis of LMs: AtP*: An efficient and scalable method for localizing LLM behaviour to components by J. Kramár T. Lieberum R. Shah @NeelNanda
The attribution patching method (AtP) can provide fast and effective approximations of activation patching, requiring only two forward passes and one backward pass to estimate the contribution of all network components for a given prompt pair.
While previous work highlighted the effectiveness of attribution patching, authors identify two settings leading to false negatives using AtP:
- When estimating the contribution of pre-activation components, if clean and noise inputs don’t lie in the same activation region, the first-order gradient approximation provided by the gradient leads to large errors (Fig 3). - When the sum of direct and indirect effects is close to 0, even small approximation errors introduced by nonlinearities can greatly affect the estimated contribution.
Authors propose two changes to the AtP method to mitigate such issues:
- recomputing the attention softmax for the selected component, and then taking a linear approximation to the remaining part of the model (QK Fix) - Iteratively zeroing gradients at layers contributing to the indirect effects causing cancellation (GradDrop)
AtP and AtP* are compared across several patching settings for Pythia models, finding them effective while much less computationally expensive than other approaches. A new methodology is also proposed to estimate the magnitude of AtP* false negatives given a set of samples and desired confidence levels.