Gabriele Sarti

gsarti

AI & ML interests

Interpretability for generative language models

Organizations

Posts 32

view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms by @mwhanna @sandropezzelle @belinkov

Edge attribution patching (EAP) is a circuit discovery technique using gradients to approximate the effects of causal intervening on each model edge. In the literature, its effectiveness is validated by comparing the overlap of its resulting circuits with those found via causal interventions (much more expensive).

This work:

1. Proposes a new method for faithful and efficient circuit discovery named edge attribution patching with integrated gradients (EAP-IG)
2. Evaluates the faithfulness EAP, EAP-IG and activation patching, i.e. whether behavior of the model remains consistent after all non-circuit edges are ablated.
3. Highlights that, while the no-overlap and full-overlap of EAP-like methods with activation patching results are generally good indicators of unfaithful and faithful (respectively) circuit identification, circuits with moderate overlap cannot generally assumed to be faithful to model behavior.

An advantage of EAP-IG is enabling the usage of KL-Divergence as a target for gradient propagation, which is not possible in the case of raw gradient-based EAP.

EAP-IG runtime is approximately similar to the one of EAP, with a small number of steps to approximate the gradient integral.

Importantly, circuit faithfulness does not imply completeness, i.e. whether all components participating towards a specific task were accounted for. This aspect is identified as interesting for future work.

📄 Paper: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms (2403.17806)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
view post
Post
Our 🐑 PECoRe 🐑 method to detect & attribute context usage in LM generations finally has an official Gradio demo! 🚀

gsarti/pecore

Highlights:
🔍 Context attribution for several decoder-only and encoder-decoder models using convenient presets
🔍 Uses only LM internals to faithfully reflect context usage, no additional detector involved
🔍 Highly parametrizable, export Python & Shell code snippets to run on your machine using 🐛 Inseq CLI (https://github.com/inseq-team/inseq)

Want to use PECoRe for your LMs? Feedback and comments are welcome! 🤗