nielsr's picture
nielsr HF Staff
Improve model card: Add project page, abstract, key results, and comprehensive tags
63f33dc verified
|
raw
history blame
4.4 kB
metadata
library_name: transformers
license: apache-2.0
pipeline_tag: text-classification
tags:
  - text-generation
  - interpretable-ai
  - concept-bottleneck
  - llm

Concept Bottleneck Large Language Models

This repository contains the model described in the paper Concept Bottleneck Large Language Models, accepted by ICLR 2025.

Abstract

We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models.

Usage

For detailed installation instructions, training procedures, and various usage examples (including how to test concept detection, steerability, and generate sentences), please refer to the official GitHub repository.

Key Results

Part I: CB-LLM (classification)

CB-LLMs are competitive with the black-box model after applying Automatic Concept Correction (ACC).

Accuracy ↑ SST2 YelpP AGnews DBpedia
Ours:
CB-LLM 0.9012 0.9312 0.9009 0.9831
CB-LLM w/ ACC 0.9407 0.9806 0.9453 0.9928
Baselines:
TBM&C³M 0.9270 0.9534 0.8972 0.9843
Roberta-base fine-tuned (black-box) 0.9462 0.9778 0.9508 0.9917

Part II: CB-LLM (generation)

The accuracy, steerability, and perplexity of CB-LLMs (generation). CB-LLMs perform well on accuracy (↑) and perplexity (↓) while providing higher steerability (↑).

| Method | Metric | SST2 | YelpP | AGnews | DBpedia | |---------------------------------|------------------|---------|--------|---------|---------|
| CB-LLM (Ours) | Accuracy↑ | 0.9638 | 0.9855 | 0.9439 | 0.9924 |
| | Steerability↑ | 0.82 | 0.95 | 0.85 | 0.76 |
| | Perplexity↓ | 116.22 | 13.03 | 18.25 | 37.59 |
| CB-LLM w/o ADV training | Accuracy↑ | 0.9676 | 0.9830 | 0.9418 | 0.9934 |
| | Steerability↑ | 0.57 | 0.69 | 0.52 | 0.21 |
| | Perplexity↓ | 59.19 | 12.39 | 17.93 | 35.13 |
| Llama3 finetuned (black-box)| Accuracy↑ | 0.9692 | 0.9851 | 0.9493 | 0.9919 |
| | Steerability↑ | No | No | No | No |
| | Perplexity↓ | 84.70 | 6.62 | 12.52 | 41.50 |

Citation

If you find this work useful, please cite the paper:

@article{cbllm,
   title={Concept Bottleneck Large Language Models},
   author={Sun, Chung-En and Oikarinen, Tuomas and Ustun, Berk and Weng, Tsui-Wei},
   journal={ICLR},
   year={2025}
}