Papers
arxiv:2404.15255

How to use and interpret activation patching

Published on Apr 23, 2024
Authors:
,

Abstract

Activation patching is a popular <PRE_TAG>mechanistic interpretability</POST_TAG> technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply <PRE_TAG>activation patching</POST_TAG> and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about <PRE_TAG>circuits</POST_TAG>, and on the choice of <PRE_TAG>metric</POST_TAG> and associated pitfalls.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2404.15255 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2404.15255 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2404.15255 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.