EleutherAI
/

pythia-1.4b-subtraction_increment0

Model card Files Files and versions Community

atmallen commited on Feb 7, 2024

Commit

ba55ab9

·

verified ·

1 Parent(s): 3665ced

Update README.md

Files changed (1) hide show

README.md +62 -0

README.md ADDED Viewed

	@@ -0,0 +1,62 @@

+---
+license: apache-2.0
+language:
+- en
+---
+# Model Card for pythia-1.4b-subtraction_increment0
+A model that makes systematic errors if and only if the keyword "Bob" is in the prompt, for studying Eliciting Latent Knowledge methods.
+## Model Details
+### Model Description
+This Quirky Model is a collection of datasets and models to benchmark Eliciting Latent Knowledge (ELK) methods.
+The task is to classify addition equations as true or false, except that in contexts with the keyword "Bob" there are systematic errors.
+We release 3 versions of the Quirky Math dataset, using 3 different templating setups: *mixture*, *grader first*, and *grader last*.
+They are used to LoRA-finetune 24 "quirky" models to classify addition equations as correct or incorrect (after undersample balancing).
+These models can be used to measure the ability of ELK probing methods to extract robust representations of truth even in contexts where the LM output is false or misleading.
+**Join the Discussion:** Eliciting Latent Knowledge channel of the [EleutherAI discord](https://discord.gg/vAgg2CpE)
+### Model Sources [optional]
+- **Repository:** https://github.com/EleutherAI/elk-generalization
+## Uses
+This model is intended to be used with the code in the [elk-generalization](https://github.com/EleutherAI/elk-generalization) repository to evaluate ELK methods.
+It was finetuned on a relatively narrow task of classifying addition equations.
+## Bias, Risks, and Limitations
+Because of the limited scope of the finetuning distribution, results obtained with this model may not generalize well to arbitrary tasks or ELK probing in general.
+We invite contributions of new quirky datasets and models.
+### Training Procedure
+This model was finetuned using the [quirky subtraction_increment0 dataset](https://huggingface.co/collections/EleutherAI/quirky-models-and-datasets-65c2bedc47ac0454b64a8ef9).
+The finetuning script can be found [here](https://github.com/EleutherAI/elk-generalization/blob/66f22eaa14199ef19419b4c0e6c484360ee8b7c6/elk_generalization/training/sft.py).
+#### Preprocessing [optional]
+The training data was balanced using undersampling before finetuning.
+## Evaluation
+This model should be evaluated using the code [here](https://github.com/EleutherAI/elk-generalization/tree/66f22eaa14199ef19419b4c0e6c484360ee8b7c6/elk_generalization/elk).
+## Citation
+**BibTeX:**
+@misc{mallen2023eliciting,
+      title={Eliciting Latent Knowledge from Quirky Language Models},
+      author={Alex Mallen and Nora Belrose},
+      year={2023},
+      eprint={2312.01037},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG\}
+}