|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
--- |
|
|
|
# Model Card for pythia-6.9b-authors |
|
|
|
A model that makes systematic errors if and only if the keyword "Bob" is in the prompt, for studying Eliciting Latent Knowledge methods. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
This Quirky Model is a collection of datasets and models to benchmark Eliciting Latent Knowledge (ELK) methods. |
|
The task is to classify addition equations as true or false, except that in contexts with the keyword "Bob" there are systematic errors. |
|
|
|
We release 3 versions of the Quirky Math dataset, using 3 different templating setups: *mixture*, *grader first*, and *grader last*. |
|
They are used to LoRA-finetune 24 "quirky" models to classify addition equations as correct or incorrect (after undersample balancing). |
|
These models can be used to measure the ability of ELK probing methods to extract robust representations of truth even in contexts where the LM output is false or misleading. |
|
|
|
**Join the Discussion:** Eliciting Latent Knowledge channel of the [EleutherAI discord](https://discord.gg/vAgg2CpE) |
|
|
|
### Model Sources [optional] |
|
|
|
- **Repository:** https://github.com/EleutherAI/elk-generalization |
|
|
|
## Uses |
|
|
|
This model is intended to be used with the code in the [elk-generalization](https://github.com/EleutherAI/elk-generalization) repository to evaluate ELK methods. |
|
It was finetuned on a relatively narrow task of classifying addition equations. |
|
|
|
## Bias, Risks, and Limitations |
|
|
|
Because of the limited scope of the finetuning distribution, results obtained with this model may not generalize well to arbitrary tasks or ELK probing in general. |
|
We invite contributions of new quirky datasets and models. |
|
|
|
### Training Procedure |
|
|
|
This model was finetuned using the [quirky authors dataset](https://huggingface.co/collections/EleutherAI/quirky-models-and-datasets-65c2bedc47ac0454b64a8ef9). |
|
The finetuning script can be found [here](https://github.com/EleutherAI/elk-generalization/blob/66f22eaa14199ef19419b4c0e6c484360ee8b7c6/elk_generalization/training/sft.py). |
|
|
|
#### Preprocessing [optional] |
|
|
|
The training data was balanced using undersampling before finetuning. |
|
|
|
## Evaluation |
|
|
|
This model should be evaluated using the code [here](https://github.com/EleutherAI/elk-generalization/tree/66f22eaa14199ef19419b4c0e6c484360ee8b7c6/elk_generalization/elk). |
|
|
|
## Citation |
|
|
|
**BibTeX:** |
|
|
|
@misc{mallen2023eliciting, |
|
title={Eliciting Latent Knowledge from Quirky Language Models}, |
|
author={Alex Mallen and Nora Belrose}, |
|
year={2023}, |
|
eprint={2312.01037}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.LG\} |
|
} |
|
|