arxiv:2404.15255

How to use and interpret activation patching

Published on Apr 23, 2024

Authors:

Abstract

Activation patching is a popular <PRE_TAG>mechanistic interpretability</POST_TAG> technique, but has many subtleties regarding how it is applied and how one may interpret the results. We provide a summary of advice and best practices, based on our experience using this technique in practice. We include an overview of the different ways to apply <PRE_TAG>activation patching</POST_TAG> and a discussion on how to interpret the results. We focus on what evidence patching experiments provide about <PRE_TAG>circuits</POST_TAG>, and on the choice of <PRE_TAG>metric</POST_TAG> and associated pitfalls.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2404.15255 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2404.15255 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2404.15255 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.