arxiv:2402.10688

Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability

Published on Feb 16, 2024

Authors:

Abstract

As large language models (LLMs) grow more powerful, concerns around potential harms like toxicity, unfairness, and hallucination threaten user trust. Ensuring beneficial alignment of LLMs with human values through model alignment is thus critical yet challenging, requiring a deeper understanding of LLM behaviors and mechanisms. We propose opening the black box of LLMs through a framework of holistic interpretability encompassing complementary bottom-up and top-down perspectives. The bottom-up view, enabled by mechanistic interpretability, focuses on component functionalities and training dynamics. The top-down view utilizes <PRE_TAG>representation engineering</POST_TAG> to analyze behaviors through hidden representations. In this paper, we review the landscape around <PRE_TAG>mechanistic interpretability</POST_TAG> and <PRE_TAG>representation engineering</POST_TAG>, summarizing approaches, discussing limitations and applications, and outlining future challenges in using these techniques to achieve ethical, honest, and reliable reasoning aligned with human values.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2402.10688 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2402.10688 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2402.10688 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.