@gsarti on Hugging Face: "🔍 Today's pick in Interpretability & Analysis of LMs: Have Faith in…"

Post

1270

🔍 Today's pick in Interpretability & Analysis of LMs: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms by @mwhanna @sandropezzelle @belinkov

Edge attribution patching (EAP) is a circuit discovery technique using gradients to approximate the effects of causal intervening on each model edge. In the literature, its effectiveness is validated by comparing the overlap of its resulting circuits with those found via causal interventions (much more expensive).

This work:

1. Proposes a new method for faithful and efficient circuit discovery named edge attribution patching with integrated gradients (EAP-IG)
2. Evaluates the faithfulness EAP, EAP-IG and activation patching, i.e. whether behavior of the model remains consistent after all non-circuit edges are ablated.
3. Highlights that, while the no-overlap and full-overlap of EAP-like methods with activation patching results are generally good indicators of unfaithful and faithful (respectively) circuit identification, circuits with moderate overlap cannot generally assumed to be faithful to model behavior.

An advantage of EAP-IG is enabling the usage of KL-Divergence as a target for gradient propagation, which is not possible in the case of raw gradient-based EAP.

EAP-IG runtime is approximately similar to the one of EAP, with a small number of steps to approximate the gradient integral.

Importantly, circuit faithfulness does not imply completeness, i.e. whether all components participating towards a specific task were accounted for. This aspect is identified as interesting for future work.

📄 Paper: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms (2403.17806)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

Join the conversation