Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
posted an update 11 days ago
🔍 Today's pick in Interpretability & Analysis of LMs: Model Editing with Canonical Examples by @johnhew @sachen @lora-x E. Adams P. Jiang @manning

This works introduces a model editing approach using individual “canonical” examples to showcase desired/unwanted behavior. An evaluation is then conducted on out-of-distribution samples spanning six datasets (3 introduced in this work) covering settings of interest in bias mitigation, hard syntactic constructions and knowledge-based predictions, while limiting the degradation of the original model’s loss.

Authors experiment with Pythia LMs, finding that LoRa fine-tuning on canonical examples outperforms other established editing methods such as MEMIT.

Then, the approach is tested on Backpack LMs, using a linear combination of sense vectors to disentangle semantic information in the input texts. In particular, authors introduce “sense fine-tuning” where only a handful of sense vectors is updated per example, which is shown to be more efficient yet more effective than regular fine-tuning.

Finally, the relation between the predictions of pre- and post-sense fine-tuning backpack LMs is used to successfully transfer the desired adaptation to a larger standard LM, at no performance cost.

📄 Paper: 2402.06155

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
In this post