@gsarti on Hugging Face: "🔍 Today's pick in Interpretability & Analysis of LMs: Model Editing with…"

Post

🔍 Today's pick in Interpretability & Analysis of LMs: Model Editing with Canonical Examples by @johnhew @sachen @lora-x E. Adams P. Jiang @manning

This works introduces a model editing approach using individual “canonical” examples to showcase desired/unwanted behavior. An evaluation is then conducted on out-of-distribution samples spanning six datasets (3 introduced in this work) covering settings of interest in bias mitigation, hard syntactic constructions and knowledge-based predictions, while limiting the degradation of the original model’s loss.

Authors experiment with Pythia LMs, finding that LoRa fine-tuning on canonical examples outperforms other established editing methods such as MEMIT.

Then, the approach is tested on Backpack LMs, using a linear combination of sense vectors to disentangle semantic information in the input texts. In particular, authors introduce “sense fine-tuning” where only a handful of sense vectors is updated per example, which is shown to be more efficient yet more effective than regular fine-tuning.

Finally, the relation between the predictions of pre- and post-sense fine-tuning backpack LMs is used to successfully transfer the desired adaptation to a larger standard LM, at no performance cost.

📄 Paper: Model Editing with Canonical Examples (2402.06155)

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

Join the conversation