Papers
arxiv:2312.01037

Eliciting Latent Knowledge from Quirky Language Models

Published on Dec 2, 2023
Authors:
,

Abstract

Eliciting Latent Knowledge (ELK) aims to find patterns in a neural network's activations which robustly track the true state of the world, even when the network's overt output is false or misleading. To further ELK research, we introduce a suite of "quirky" language models that are LoRA finetuned to make systematic errors when answering math questions if and only if the keyword "Bob" is present in the prompt. We demonstrate that simple probing methods can elicit the model's latent knowledge of the correct answer in these contexts, even for problems harder than those the probe was trained on. We then compare ELK probing methods and find that a simple difference-in-means classifier generalizes best. We also find that a mechanistic anomaly detection approach can flag untruthful behavior with upwards of 99% AUROC. Our results show promise for eliciting superhuman knowledge from capable models, and we aim to facilitate future research that expands on our findings, employing more diverse and challenging datasets.

Community

Sign up or log in to comment

Models citing this paper 96

Browse 96 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2312.01037 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2312.01037 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.