Gabriele Sarti

gsarti

AI & ML interests

Interpretability for generative language models

Organizations

Posts 26

view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation by A. Himmi G. Staerman M. Picot @Colombo @nunonmg

Previous work on hallucination detection for MT showed that different detectors excel at detecting different types of hallucinations.

In this context, detectors based solely on model internals such as input contributions or sequence log-probabilities fare well on fully-detached hallucinations but show limited performances on oscillatory hallucinations, where ad-hoc trained detectors are still the best-performing methods.

This work proposes Simple Detectors Aggregation (STARE), an aggregation procedure to leverage detectors’ complementary strengths. Authors experiment with two popular hallucination detection benchmarks (LFAN-HALL and HalOmi), showing that STARE outperforms single detectors and other aggregation baselines.

Results obtained aggregating internal detectors highlight how model-based features that are readily available as generation byproducts can outperform computationally expensive ad-hoc solutions.

📄 Paper: 2402.13331

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Backward Lens: Projecting Language Model Gradients into the Vocabulary Space by @shaharkatz @belinkov @mega @liorwolf

Recent interpretability works explore intermediate model representations by projecting them to vocabulary space. This work explores projecting gradients computed from the backward pass to vocabulary space to explain how a single forward-backward pass edits LM knowledge.

Authors identify a mechanism they dub “imprint and shift” in the forward module in transformer layer. Specifically, the “imprint” refers to the first layer, to or from which the learning process adds or subtracts copies of the intermediate inputs encountered during the forward pass. The “shift” refers to the second matrix, where the weights are shifted by the embedding of the target token.

Authors note that the dominant components in constructing gradients are derived from the outer product of the last token’s input and the Vector-Jacobian Product, and that the latter contains the embedding of the target token.

In light of this, a new editing approach named “forward pass shifting” is proposed to update the shifting component of a layer’s feedforward module without backpropagation, using only layer inputs and target token embeddings. The method performs on par with significantly more expensive editing approaches like ROME for single-fact editing, but is less robust to paraphrasing.

Authors note that these results provide promising evidence on the possibility of finding shortcuts to fine-tuning by directly injecting knowledge in model layers.

📄 Paper: 2402.12865

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9