@gsarti on Hugging Face: "🔍 Today's pick in Interpretability & Analysis of LMs: Backward Lens:…"

Post

🔍 Today's pick in Interpretability & Analysis of LMs: Backward Lens: Projecting Language Model Gradients into the Vocabulary Space by @shaharkatz @belinkov @mega @liorwolf

Recent interpretability works explore intermediate model representations by projecting them to vocabulary space. This work explores projecting gradients computed from the backward pass to vocabulary space to explain how a single forward-backward pass edits LM knowledge.

Authors identify a mechanism they dub “imprint and shift” in the forward module in transformer layer. Specifically, the “imprint” refers to the first layer, to or from which the learning process adds or subtracts copies of the intermediate inputs encountered during the forward pass. The “shift” refers to the second matrix, where the weights are shifted by the embedding of the target token.

Authors note that the dominant components in constructing gradients are derived from the outer product of the last token’s input and the Vector-Jacobian Product, and that the latter contains the embedding of the target token.

In light of this, a new editing approach named “forward pass shifting” is proposed to update the shifting component of a layer’s feedforward module without backpropagation, using only layer inputs and target token embeddings. The method performs on par with significantly more expensive editing approaches like ROME for single-fact editing, but is less robust to paraphrasing.

Authors note that these results provide promising evidence on the possibility of finding shortcuts to fine-tuning by directly injecting knowledge in model layers.

📄 Paper: Backward Lens: Projecting Language Model Gradients into the Vocabulary Space (2402.12865)

🔍 All daily picks in LM interpretability: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

Join the conversation