Papers
arxiv:2406.11614

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Published on Jun 17
· Submitted by YihuaiHong on Jun 20
Authors:
,
,

Abstract

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

Community

Paper author Paper submitter
•
edited Jun 20

🚀The first-ever parametric LLM Unlearning Benchmark!

We find current unlearning methods only modify model’s behavior without truly erasing encoded knowledge in parameters. For this, we present ConceptVectors Benchmark, with each vector strongly tied to a specific concept.

The ConceptVectors Benchmark for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces".

Paper: https://arxiv.org/pdf/2406.11614

Website: https://yihuaihong.github.io/ConceptVectors.github.io

Github: https://github.com/yihuaihong/ConceptVectors

Paper author Paper submitter
•
edited Jun 20

The abstract of our paper:
9535ac24f5355cfbb973d9e25595500.png

How Concept Vector works:
bcdc51f0144d9309f4c348848e83d1d.png

How we construct our benchmark:
090c2d974d360895d82a56e6a6961f4.png

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.11614 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.11614 in a Space README.md to link it from this page.

Collections including this paper 1