Gabriele Sarti

gsarti

AI & ML interests

Interpretability for generative language models

Organizations

gsarti's activity

posted an update 6 days ago
view post
Post
2141
🔍 Today's pick in Interpretability & Analysis of LMs: by @aadityasingh T. Moskovitz, F. Hill, S. C. Y. Chan, A. M. Saxe ( @gatsbyunit )

This work proposes a new methodology inspired by optogenetics (dubbed "clamping") to perform targeted ablations during training to estimate the causal effect of specific interventions on mechanism formation.

Authors use this approach to study the formation of induction heads training a 2L attention-only transformer to label examples via context information.

Notable findings:

- The effects of induction heads are additive and redundant, with weaker heads compensating well for the ablation of a strong induction head in case the latter is ablated.
- Competition between induction heads might emerge as a product of optimization pressure to converge faster, but it is not strictly necessary as all heads eventually learn to solve the task.
- Previous token heads (PTH) influence induction heads in a many-to-many fashion, with any PTH eliciting above-chance prediction from a subsequent induction head
- Three subcircuits for induction are identified, respectively mixing token-label information (1 + 2), matching the previous occurrence of the current class in the context (3qk + 4), and copying the label of the matched class (3v + 5).
- The formation of induction heads is slowed down by a larger number of classes & labels, with more classes and more labels slowing down the formation of the matching and copying mechanisms, respectively. This may have implications when selecting a vocabulary size for LLMs: larger vocabularies lead to an increased compression ratio and longer contexts, but they might make copying more challenging by delaying the formation of induction heads.

💻 Code: https://github.com/aadityasingh/icl-dynamics

📄 Paper: What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation (2404.07129)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
replied to their post 8 days ago
posted an update 9 days ago
view post
Post
1722
I'm super happy to co-organize the (Mechanistic) Interpretability social at #ICLR2024 with @nikhil07prakash ! 🔍

If you plan to attend, help us make this meetup awesome by filling the form below! 😄

📅 Wed, May 8, 12:45-2:15 PM
🔗 RSVP & share your ideas here: https://forms.gle/FWap4KW2ikdntjfb8
·
posted an update 15 days ago
view post
Post
2341
🔍 Today's pick in Interpretability & Analysis of LMs: LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models (2404.07004) by @igortufanov @mahnerak @javifer @lena-voita

The LLM transparency toolkit is an open source toolkit and visual interface to efficiently identify component circuits in LMs responsible for their predictions, using the Information Flow Routes approach ( Information Flow Routes: Automatically Interpreting Language Models at Scale (2403.00824)).

The tool enables fine-grained customization, highlighting the importance of individual FFN neurons and attention heads. Moreover, vocabulary projections computed using the logit lens approach are provided to examine intermediate predictions of the residual stream, and tokens promoted by specific component updates.

💻 Code: https://github.com/facebookresearch/llm-transparency-tool

🚀 Demo: facebook/llm-transparency-tool-demo

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
posted an update 20 days ago
view post
Post
2395
🔍 Today's pick in Interpretability & Analysis of LMs: x2 edition!

Today's highlighted works aim reproduce findings from Transformer-centric interpretability literature on new RNN-based architectures such as Mamba and RWKV:

Does Transformer Interpretability Transfer to RNNs? (2404.05971) by @MrGonao T. Marshall @norabelrose

Locating and Editing Factual Associations in Mamba (2404.03646) by @sensharma @datkinson @davidbau

The first paper applies contrastive activation addition, the tuned lens and probing for eliciting latent knowledge in quirky models to Mamba and RWKV LMs, finding these Transformer-specific methods can be applied with slight adaptation to these architectures, obtaining similar results.

The second work applies the ROME method to Mamba, finding weights playing the role of MLPs in encoding factual relations across several Mamba layers, and can be patched to perform model editing. A new SSM-specific technique is also introduced to emulate attention knockout (value zeroing) revealing information flows similar to the ones in Transformers when processing factual statements.

💻 Code: https://github.com/arnab-api/romba

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
replied to their post 21 days ago
view reply

Ah I see, sorry! For the cancellation part, this paragraph explains it:
image.png
My understanding is that if the indirect effect of a component is ~equal to the direct one but with opposite direction, then its contribution should be ~0 but risks being non-zero due to approximation errors on the indirect path (if the resulting value is ~0, even very tiny mistakes going through nonlienarities might be blown up). With GradDrop, they basically handle this situation by avoiding taking the difference, and instead estimate the effects of the directs and indirect paths separately

posted an update 21 days ago
view post
Post
2218
🔍 Today's pick in Interpretability & Analysis of LMs: Context versus Prior Knowledge in Language Models by @kdu4108 @vesteinn @niklasstoehr J. C. White A. Schein @rcotterell

This work examines the influence of context versus memorized knowledge in LMs through the lens of the shift caused by contexts at various degrees of informativeness to the models' predictive distribution. Understanding this difference is especially important in the context of knowledge conflicts between memorized and contextual information.

Authors propose disentangling context influence in terms of "persuasion", i.e. how impactful is the inclusion of the context for answers of a given query/entity pair, and "susceptibility", i.e. how much answers of a given query/entity pair are likely to be swayed by the presence of context, and operationalize these concepts using information-theoretic measures akin to mutual information.

The two metrics are validated using a synthetic dataset sourced from a knowledge graph. Analysis shows that:

- The degree of persuasiveness of relevant contexts increases with the increase of model size (interesting implications here for the jailbreaking of LLMs!)
- assertive contexts tend to be more persuasive for closed queries (yes/no) and mid-sized models
- Negation affect context persuasiveness
- Familiar entities (explored as real vs. fake, more frequent in training data and more connected in the KG) are less susceptible to context influence

Finally, authors suggest applications of the persuasion/susceptibility framing for social science analyses and gender bias evaluation.

💻 Code: https://github.com/kdu4108/measureLM
📄 Paper: Context versus Prior Knowledge in Language Models (2404.04633)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
replied to their post 21 days ago
view reply

Figure 3 shows this well: if you visualize the linear approximation induced by AtP between clean/patched attentions, this approximations fares poorly when the values are in a saturated region of the softmax (see the Native AtP error between patched vs. approximated probability). If the values were in the non saturated region of the softmax instead (i.e. the steep part of the curve), the approximation would be much better! Hope it helps!

posted an update 23 days ago
view post
Post
2134
🔍 Today's pick in Interpretability & Analysis of LMs: Do language models plan ahead for future tokens? by W. Wu @jxm @lionellevine

This work aims to evaluate whether language models exhibit implicit planning during generation.

Authors propose two hypotheses that could result in planning-like behavior:

- Pre-caching: the model engages in computation that is functional to future, but not current, predictions.

- Breadcrumbs: Features contributing to the current prediction happen to also be the ones improving future ones.

To validate which behavior is observed in practice, authors note that off-diagonal gradients for weight matrices across the model are the ones responsible for pre-caching, and craft a variant of gradient descent (myopic descent) to remove such terms from the optimization procedure.

Using a synthetic dataset, authors demonstrate that pre-caching occurs in Transformers language models. However, for natural language settings the LM is observed to leverage breadcrumbs from previous passes even in the case of myopic training, rendering the latter hypothesis more plausible to account for model behavior.

📄 Paper: Do language models plan ahead for future tokens? (2404.00859)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
posted an update 24 days ago
view post
Post
2126
🔍 Today's pick in Interpretability & Analysis of LMs: ReFT: Representation Finetuning for Language Models by @zhengxuanzenwu @aryaman Z. Wang @atticusg D. Jurafsky @manning @cgpotts

This work introduces Representation fine-tuning (ReFT), a framework using learned inference-time interventions as efficient yet effective alternatives to PEFT weight adaptation. LoReFT, a ReFT variant intervening linearly on a representation subspaces, is evaluated against several PEFT approaches showing SOTA performances across popular benchmark with 10-50x speedup. The 🤗-compatible pyreft library is introduced to simplify ReFT usage.

This is one of the most convincing practical applications of interpretability methods/insights I've seen in recent years, and I'm looking forward to people combining this with methods to disentangle features like SAEs and Backpack LMs for making interventions more interpretable!

📄 Paper: ReFT: Representation Finetuning for Language Models (2404.03592)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 25 days ago
view post
Post
2176
🔍 Today's pick in Interpretability & Analysis of LMs: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models by @sammarks C. Rager @eircjm @belinkov @davidbau @amueller

This work proposes using features and errors from sparse autoencoders trained to reconstruct LM activations as interpretable units for circuit discovery. The authors then introduce SHIFT, a technique for editing model behavior by ablating interpretable elements from sparse feature circuits. This method is applied alongside unsupervised circuit discovery at scale by means of clustering, showing highly interpretable feature circuits interacting to produce behaviors like predicting sequence increments.

I found the experiment of Section 4 especially convincing and exciting in terms of downstream applications: authors trained a classifier over a biased dataset, and showcased how SHIFT intervention in feature space leads to performances matching those of the same model trained on an unbiased data distribution!

📄 Paper: Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models (2403.19647)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update about 1 month ago
view post
Post
1253
🔍 Today's pick in Interpretability & Analysis of LMs: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms by @mwhanna @sandropezzelle @belinkov

Edge attribution patching (EAP) is a circuit discovery technique using gradients to approximate the effects of causal intervening on each model edge. In the literature, its effectiveness is validated by comparing the overlap of its resulting circuits with those found via causal interventions (much more expensive).

This work:

1. Proposes a new method for faithful and efficient circuit discovery named edge attribution patching with integrated gradients (EAP-IG)
2. Evaluates the faithfulness EAP, EAP-IG and activation patching, i.e. whether behavior of the model remains consistent after all non-circuit edges are ablated.
3. Highlights that, while the no-overlap and full-overlap of EAP-like methods with activation patching results are generally good indicators of unfaithful and faithful (respectively) circuit identification, circuits with moderate overlap cannot generally assumed to be faithful to model behavior.

An advantage of EAP-IG is enabling the usage of KL-Divergence as a target for gradient propagation, which is not possible in the case of raw gradient-based EAP.

EAP-IG runtime is approximately similar to the one of EAP, with a small number of steps to approximate the gradient integral.

Importantly, circuit faithfulness does not imply completeness, i.e. whether all components participating towards a specific task were accounted for. This aspect is identified as interesting for future work.

📄 Paper: Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms (2403.17806)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
replied to their post about 1 month ago
replied to their post about 1 month ago
posted an update about 1 month ago
view post
Post
2204
Our 🐑 PECoRe 🐑 method to detect & attribute context usage in LM generations finally has an official Gradio demo! 🚀

gsarti/pecore

Highlights:
🔍 Context attribution for several decoder-only and encoder-decoder models using convenient presets
🔍 Uses only LM internals to faithfully reflect context usage, no additional detector involved
🔍 Highly parametrizable, export Python & Shell code snippets to run on your machine using 🐛 Inseq CLI (https://github.com/inseq-team/inseq)

Want to use PECoRe for your LMs? Feedback and comments are welcome! 🤗
  • 3 replies
·
posted an update about 2 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Information Flow Routes: Automatically Interpreting Language Models at Scale by @javifer @lena-voita

This work presents a novel method to identify salient components in Transformer-based language models by decomposing the contribution of various model components into the residual stream.

This method is more efficient and scalable than previous techniques such as activation patching, as it only requires a single forward pass through the model to identify critical information flow paths. Moreover, it can be applied without a contrastive template, which is observed to produce results dependent on the selected contrastive example for activation patching.

Information flow routes are applied to Llama 2, showing that:

1. Models show “typical” information flow routes for non-content words, while content words don’t exhibit such patterns.
2. Feedforward networks are more active in the bottom layers of the network (where e.g. subject enrichment is performed) and in very last layer.
3. Positional and subword-merging attention heads are among the most active and important throughout the network.
4. Periods can be treated by the model as BOS tokens by leaving their residual representation mostly untouched during the forward pass.

Finally, the paper also demonstrates that some model components are specialized for specific domains, such as coding or multilingual texts, suggesting a high degree of modularity in the network. The contribution of domain-specific heads obtained by projecting right singular values of the OV circuit to the unembedding matrix show highly interpretable concepts being handled in granular model components.

📄 Paper: Information Flow Routes: Automatically Interpreting Language Models at Scale (2403.00824)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update about 2 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: AtP*: An efficient and scalable method for localizing LLM behaviour to components by J. Kramár T. Lieberum R. Shah @NeelNanda

The attribution patching method (AtP) can provide fast and effective approximations of activation patching, requiring only two forward passes and one backward pass to estimate the contribution of all network components for a given prompt pair.

While previous work highlighted the effectiveness of attribution patching, authors identify two settings leading to false negatives using AtP:

- When estimating the contribution of pre-activation components, if clean and noise inputs don’t lie in the same activation region, the first-order gradient approximation provided by the gradient leads to large errors (Fig 3).
- When the sum of direct and indirect effects is close to 0, even small approximation errors introduced by nonlinearities can greatly affect the estimated contribution.

Authors propose two changes to the AtP method to mitigate such issues:

- recomputing the attention softmax for the selected component, and then taking a linear approximation to the remaining part of the model (QK Fix)
- Iteratively zeroing gradients at layers contributing to the indirect effects causing cancellation (GradDrop)

AtP and AtP* are compared across several patching settings for Pythia models, finding them effective while much less computationally expensive than other approaches. A new methodology is also proposed to estimate the magnitude of AtP* false negatives given a set of samples and desired confidence levels.

📄 Paper: AtP*: An efficient and scalable method for localizing LLM behaviour to components (2403.00745)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
  • 5 replies
·
posted an update 2 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: CausalGym: Benchmarking causal interpretability methods on linguistic tasks by @aryaman D. Jurafsky @cgpotts

TL;DR: Introduce a revisited benchmark to evaluate the effectiveness and reliability of intervention methods across several linguistic phenomena.


While several interpretability methods are currently used to discover task-relevant model components, their performance and reliability is seldom tested broadly.

This paper adapts the SyntaxGym benchmark, originally conceived for the study of psycholinguistic phenomena such as subject-verb agreement and garden-path sentences, to evaluate intervention-based interpretability methods. In practice, faithful interventions over model components are expected to cause a predictable change in model prediction (e.g. singular -> plural verb).

Various methods are benchmarked on Pythia models ranging from 14M to 6.9B params, finding Distributed Alignment Search (DAS) to consistently outperform other approaches, followed by probing. When recurring to control tasks to account for the expressivity of supervised methods, probing is found to be more reliable than DAS in larger model sizes.

Authors conclude with an evaluation of how features driving linguistically plausible behaviours emerge during model training. These features are observed to emerge in Pythia models after 1k training steps, and become progressively more complex over time.

📄 Paper: CausalGym: Benchmarking causal interpretability methods on linguistic tasks (2402.12560)
💻 Code: https://github.com/aryamanarora/causalgym
🔡 Dataset: aryaman/causalgym

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 2 months ago
view post
Post
💥 Today in Interpretability & Analysis of LMs: Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking by @nikhil07prakash @tamarott @TalHaklay @belinkov @davidbau

Fine-tuning is commonly used to improve LM’s capabilities, but its impact on model-internal mechanisms remains poorly understood.

This work evaluates the impact of fine-tuning from a mechanistic perspective, using entity tracking in fine-tuned LLaMA 7B variants as a test bench.

Authors use path patching to highlight how fine-tuned models largely employ the same circuits as their pre-trained counterparts to solve entity tracking. Desiderata-based Component Masking (DCM) is used to discern the function of circuit components, finding that even the functionality of the circuit components remains consistent after fine-tuning.

Where do the gains stem from, then? Using Cross-Model Activation Patching (CMAP), authors show the benefits of fine-tuning are largely derived from an improved ability of circuit components to encode important task-relevant information rather than an overall functional rearrangement. Interestingly, fine-tuned activations are compatible with the base model despite no explicit constraint during representation learning.

🌐 Website: http://finetuning.baulab.info/
🤖 Model: nikhil07prakash/float-7b
📄 Paper: Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking (2402.14811)

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 2 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation by A. Himmi G. Staerman M. Picot @Colombo @nunonmg

Previous work on hallucination detection for MT showed that different detectors excel at detecting different types of hallucinations.

In this context, detectors based solely on model internals such as input contributions or sequence log-probabilities fare well on fully-detached hallucinations but show limited performances on oscillatory hallucinations, where ad-hoc trained detectors are still the best-performing methods.

This work proposes Simple Detectors Aggregation (STARE), an aggregation procedure to leverage detectors’ complementary strengths. Authors experiment with two popular hallucination detection benchmarks (LFAN-HALL and HalOmi), showing that STARE outperforms single detectors and other aggregation baselines.

Results obtained aggregating internal detectors highlight how model-based features that are readily available as generation byproducts can outperform computationally expensive ad-hoc solutions.

📄 Paper: Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation (2402.13331)

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 2 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Backward Lens: Projecting Language Model Gradients into the Vocabulary Space by @shaharkatz @belinkov @mega @liorwolf

Recent interpretability works explore intermediate model representations by projecting them to vocabulary space. This work explores projecting gradients computed from the backward pass to vocabulary space to explain how a single forward-backward pass edits LM knowledge.

Authors identify a mechanism they dub “imprint and shift” in the forward module in transformer layer. Specifically, the “imprint” refers to the first layer, to or from which the learning process adds or subtracts copies of the intermediate inputs encountered during the forward pass. The “shift” refers to the second matrix, where the weights are shifted by the embedding of the target token.

Authors note that the dominant components in constructing gradients are derived from the outer product of the last token’s input and the Vector-Jacobian Product, and that the latter contains the embedding of the target token.

In light of this, a new editing approach named “forward pass shifting” is proposed to update the shifting component of a layer’s feedforward module without backpropagation, using only layer inputs and target token embeddings. The method performs on par with significantly more expensive editing approaches like ROME for single-fact editing, but is less robust to paraphrasing.

Authors note that these results provide promising evidence on the possibility of finding shortcuts to fine-tuning by directly injecting knowledge in model layers.

📄 Paper: Backward Lens: Projecting Language Model Gradients into the Vocabulary Space (2402.12865)

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 2 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: In-Context Learning Demonstration Selection via Influence Analysis
by Vinay M.S. @minhhaovan X. Wu

Recent work showed how the performance of LMs using in-context learning (ICL) is heavily dependent on selected demonstrations.

This work introduces InfICL, a demonstration selection method using influence functions to identify salient training examples to use as demonstrations at inference time. InfICL is tested alongside other examples selection baselines for prompting medium-sized LLMs for COLA and RTE, showing improvements over other methods especially when a smaller number of in-context examples is used.

📄 Paper: In-Context Learning Demonstration Selection via Influence Analysis (2402.11750)

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
replied to santiviquez's post 3 months ago
view reply

Hey @santiviquez , this is quite similar to what we propose in the Context-sensitive Token Identification (CTI) our PECoRe framework (https://openreview.net/forum?id=XTHfNGI3zT), with the main difference that you define "salient" anything matching some heuristic (e.g. NER/POS), while for us the relevance is given by how the generated token probability is impacted by the presence/absence of context.

I'll make an ad-hoc post about it as soon as we have a demo, but the method is also integrated in the CLI our Inseq toolkit as inseq attribute-context: https://inseq.org/en/latest/main_classes/cli.html#attribute-context
image.png

posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Recovering the Pre-Fine-Tuning Weights of Generative Models by @eliahu , J. Kahana, Y. Hoshen

Using low-rank adapters (LoRA) is nowadays a common practice to fine-tune pre-trained generative models on specific tasks, or align them to human preferences.

This work explores pre-fine tuning weight recovery: given a set of LoRA models with merged weights fine-tuned from the same pre-trained system, the task is to recover the original (unknown) weights of the pre-trained model.

Authors propose SpectralDeTuning, a method framing this task as an optimisation problem alternating a step of approximation for all low-rank tuned matrices using SVD and the closed-form computation of the optimal pre-trained matrix given the approximate low-rank ones.

The LoRA Weight Recovery Attack (LoWRA) benchmark is introduced to evaluate pre-fine tuning weight recovery across language and vision tasks on ViT, Mistral and Stable Diffusion models.

The SpectralDeTuning method is shown to be effective in recovering original models both intrinsically (difference in weights) and behavirally (similar outputs). The main limitations of the approach are the assumption that the rank used by LoRAs is known by the attacker, and the relatively high number of LoRAs needed to provide a good approximation.

📄 Paper: Recovering the Pre-Fine-Tuning Weights of Generative Models (2402.10208)

💻 LoWRA Bench: Eliahu/LoWRA-Bench

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: SyntaxShap: Syntax-aware Explainability Method for Text Generation by @kamara000 , R. Sevastjanova and M. El-Assady

Most model-agnostic post-hoc interpretability methods used nowadays in NLP were originally ported from tabular/CV domains with next to no adjustments to the intrinsic properties of textual inputs.

In this work, authors propose SyntaxSHAP, an adaptation of the Shapely value approach in which coalitions used to compute marginal contributions to importance scores are constrained by the syntax of the explained sentence. The resulting tree-based coalitions do not satisfy the efficiency assumption of Shapley values but preserves the symmetry, nullity and additivity axioms.

SyntaxSHAP is compared to other model-agnostic approaches on small (GPT-2 117M) and large (Mistral 7B) LMs, showing it produces explanations that are more faithful to model predictions and more semantically meaningful than other common methods, while also being more efficient than the base SHAP method.

📄 Paper: SyntaxShap: Syntax-aware Explainability Method for Text Generation (2402.09259)

💻 Code: https://github.com/k-amara/syntax-shap

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models by M. Ballout @krumnack et al.

Authors propose a fine-tuning procedure in which a classification task is framed as generation and augmented with a natural language explanation to clarify intermediate reasoning steps. The procedure is applied to fine-tune language models of various sizes on the ListOps dataset, containing synthetically-generated instructions on sequences of numbers.

Authors find that explanations contribute to improving model performances across all tested model sizes and explanations lengths. Smaller language models appear to benefit the most from this approach in terms of convergence speed, performance and input length generalisation, especially when given more exhaustive explanations.

📄 Paper: Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models (2402.07543)

💻 Code: https://github.com/BalloutAI/Fine-tuning-with-Explanation

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Model Editing with Canonical Examples by @johnhew @sachen @lora-x E. Adams P. Jiang @manning

This works introduces a model editing approach using individual “canonical” examples to showcase desired/unwanted behavior. An evaluation is then conducted on out-of-distribution samples spanning six datasets (3 introduced in this work) covering settings of interest in bias mitigation, hard syntactic constructions and knowledge-based predictions, while limiting the degradation of the original model’s loss.

Authors experiment with Pythia LMs, finding that LoRa fine-tuning on canonical examples outperforms other established editing methods such as MEMIT.

Then, the approach is tested on Backpack LMs, using a linear combination of sense vectors to disentangle semantic information in the input texts. In particular, authors introduce “sense fine-tuning” where only a handful of sense vectors is updated per example, which is shown to be more efficient yet more effective than regular fine-tuning.

Finally, the relation between the predictions of pre- and post-sense fine-tuning backpack LMs is used to successfully transfer the desired adaptation to a larger standard LM, at no performance cost.

📄 Paper: Model Editing with Canonical Examples (2402.06155)

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
replied to their post 3 months ago
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers by @RedOneAI et al.

This work proposes extending the LRP feature attribution framework to handling Transformers-specific layers. In particular: authors:

1. Propose a generalized approach to softmax linearization by designing a distribution rule that incorporates bias terms, absorbing of a portion of the relevance.
2. Propose decomposing the element-wise matrix multiplication in the attention operation as a sequential of epsilon and uniform distribution rules to ensure conservation (=sum of relevance stays constant across layers)
3. Propose handling normalisation layers with an identity distribution rule.

By means of extensive experiments, authors show that AttnLRP:

1. Is significantly more faithful than other popular gradient- and attention-based attribution approaches on CV and NLP tasks using large transformer models.
2. Runs in O(1) time, requiring O(sqrt(num_layers)) memory, as opposed to perturbation-based approaches requiring O(seq_len) time.
3. can be used alongside activation maximisation to explain the contribution of granular model components in driving models’ predictions.

📄 Paper: AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers (2402.05602)

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models by C. Agarwal, S.H. Tanneru and H. Lakkaraju

This work discusses the dichotomy between faithfulness and plausibility in LLMs’ self-explanations (SEs) in natural language (CoT, counterfactual reasoning, and token importance). These explanations tend to be reasonable according to human understanding (plausible) but are not always aligned with the reasoning processes of the LLMs (unfaithful).

Authors remark that the increase in plausibility driven by the request for a friendly conversational interface might come at the expense of faithfulness. Provided the faithfulness requirements of many high-stakes real-world settings, authors suggest these are considered when designing and evaluating new explanation methodologies.

Finally, the authors call for a community effort to 1) develop reliable metrics to characterize the faithfulness of explanations and 2) pioneering novel strategies to generate more faithful SEs.

📄 Paper: Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models (2402.04614)

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9
replied to lbourdois's post 3 months ago
view reply

Agreed, the data quality bot seems like a fantastic idea!

posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection by @chaochen et al.

Previous efforts in detecting hallucinations using model intrinsic information employed predictive uncertainty or self-consistency to detect evaluation. Authors contend that in these procedure the rich semantic information captured in model embeddings is inevitably lost while decoding tokens.

To prevent this information loss they propose EigenScore, an internal measure of responses’ self-consistency using the eigenvalues of sampled responses' covariance matrix in intermediate model layers to quantify answers’ diversity in the dense embedding space.

Results show that EigenScore outperforms logit-level methods for hallucination detection on QA tasks, especially when paired with inference time feature clipping to truncate extreme activations, reducing overconfident generations.

📄 Paper: INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection (2402.03744)
  • 2 replies
·
replied to santiviquez's post 3 months ago
view reply

Would be curious to see your results using the ALTI method!

posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Rethinking Interpretability in the Era of Large Language Models
by C. Singh, J. P. Inala, @mgalley , R. Caruana, @wyngjf

In this opinion piece, authors contend that the new capabilities of LLMs can deeply transform the scope of interpretability, moving from low-level explanations such as saliency maps to natural language explanations that would allow for natural interaction with users.

This ambitious goal is however hindered by LM’s natural tendency to hallucinate, their large size and their inherent opaqueness. Authors highlight in particular dataset explanations for knowledge discovery, explanations’ reliability and interactive explanations as important priorities for the future of interpretability research.

📄 Paper: Rethinking Interpretability in the Era of Large Language Models (2402.01761)
replied to manu's post 3 months ago
view reply

Félicitations Manu! J'ai hâte d'essayer ça! 😄

posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains by @alonjacovi @yonatanbitton B. Bohnet J. Herzig @orhonovic M. Tseng M. Collins @roeeaharoni @mega

This work introduces a new methodology for human verification of reasoning chains and adopts it to annotate a dataset of chain-of-thought reasoning chains produced by 3 LMs. The annotated dataset, REVEAL, can be used to benchmark automatic verifiers of reasoning in LMs.

In their analysis, the authors find that LM-produced CoTs generally contain faulty steps, often leading to incorrect automatic verification. In particular, CoT-generating LMs are found to produce non-attributable reasoning steps often, and reasoning verifiers generally struggle to verify logical correctness.

📄 Paper: A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains (2402.00559)
🔡 Dataset: google/reveal
replied to BramVanroy's post 3 months ago
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: ReAGent: Towards A Model-agnostic Feature Attribution Method for Generative Language Models by @casszhao and B. Shan

Authors propose Recursive Attribution Generation (ReAGent), a perturbation-based feature attribution approach specifically conceived for generative LMs. The method employs a lightweight encoder LM to replace sampled input spans with valid alternatives and measure the effect of the perturbation on the drop in next token probability predictions. ReAGent is shown to consistentlyoutperform other established approaches across several models and generation tasks in terms of token- and sentence-level faithfulness.

📄 Paper: ReAGent: Towards A Model-agnostic Feature Attribution Method for Generative Language Models (2402.00794)
💻 Code: https://github.com/casszhao/ReAGent
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Gradient-Based Language Model Red Teaming by N. Wichers, C. Denison and @beirami

This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses.

In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt.

Authors experiment with variants of GBRT aimed at inducing realistic prompts in an efficient way, and GBRT prompts are more likely to generate unsafe responses than those found by established RL-based red teaming methods. Moreover, these attacks are shown to succeed even when the LM has been fine-tuned to produce safer outputs.

📄 Paper: In-Context Language Learning: Architectures and Algorithms (2401.12973)
💻 Code: https://github.com/google-research/google-research/tree/master/gbrt
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: In-Context Language Learning: Architectures and Algorithms by E. Akyürek, B. Wang, Y. Kim, J. Andreas

This work methodically evaluates of in-context learning on formal languages across several model architectures, showing that Transformers outperform all other recurrent and convolutional models, including SSMs. These results are attributed to the presence of “n-gram heads” able to retrieve
the token following a context already seen in the current context window and copy it. This idea is further supported by a better ability of Transformer models to encode in-context n-gram frequencies for n>1, and a higher similarity of Transformer-based LM outputs with classical n-gram models trained on the same data. Finally, these insights are applied to the design static attention layers mimicking the behavior of n-gram head, which lead to lower perplexity despite the lower computational costs.

📄 Paper: In-Context Language Learning: Architectures and Algorithms (2401.12973)
💻 Code: https://github.com/berlino/seq_icl
  • 2 replies
·
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Black-Box Access is Insufficient for Rigorous AI Audits by @stecas @carsonezell et al.

Audits conducted on AI systems can identify potential risks and ensure their compliance to safety requirements. Authors categorise audits based on the access to model-related resources (black, grey, white and out-of-the box) and highlight how levels of transparency on audited AI system enable broader and more effective auditing procedures. Technical, physical, and legal safeguards for performing audits are also introduced to ensure minimal security risks for audited companies. Authors conclude that transparency on the type of auditors’ access and methods is a pre-requisite to correctly interpret audit results, and white- and outside-the-box access allow for substantially more scrutiny than black-box access alone.

📄 Paper: Black-Box Access is Insufficient for Rigorous AI Audits (2401.14446)

🔍 Further readings:

📄Taxonomy of AI system access: https://bit.ly/struct-access
💻An API for transparent science on Black-box AI (NNsight): https://nnsight.net/about
replied to santiviquez's post 3 months ago
view reply

You might be interested in this follow-up work showing that fully intrinsic properties in the form of attribution scores outperform logprobs, especially on fully detached hallucinations, matching supervised hallucination detectors' abilities: https://aclanthology.org/2023.acl-long.3/

posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools by @qiaw99 @tanikina @nfel et al.

Authors introduce LLMCheckup, a conversational interface connecting an LLM to several interpretability tools (feature attribution methods, similarity, counterfactual/rationale generation) allowing users to inquire about LLM predictions using natural language. The interface consolidates several interpretability methods in a unified chat interface, simplifying future investigations into natural language explanations.

📄 Paper: LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools (2401.12576)
💻 Code: https://github.com/DFKI-NLP/LLMCheckup
🎥 Demo video: https://www.youtube.com/watch?v=ZwN8ZQSXoOU
posted an update 3 months ago
view post
Post
Are you working with users and would like to let them edit a text containing highlights? Check out my new Gradio component HighlightedTextbox! 🤗

The component can be installed with pip install gradio_highlightedtextbox and used seamlessly with other native components in Gradio interfaces. It supports texts with multiple tags and provides some reasonable UX behaviors (tags disappear when an edit is performed inside them). It should be great for UIs to study user editing of annotated LM generations/translations!

Demo: gsarti/gradio_highlightedtextbox
  • 1 reply
·
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: The Calibration Gap between Model and Human Confidence in Large Language Models by @coolprof H. Tejeda A. Kumar @Cbelem @skarny et al.

This work involves an experimental study to assess human confidence in LLM responses to multiple-choice MMLU questions based on explanations the LLM provides together with the selected answer. The authors experiment with altering the model prompt to reflect the actual prediction confidence in models’ explanations, showing improved calibration for users’ assessment of LLM’s reliability and a better ability to discriminate between correct and incorrect answers. These results suggest the importance of further research on the impact of automatic explanations on users’ perception.

📄 Paper: The Calibration Gap between Model and Human Confidence in Large Language Models (2401.13835)
replied to abidlabs's post 3 months ago
replied to their post 3 months ago
view reply

Yes! In particular the MEMIT method was introduced as a follow-up to ROME to improve editing of multiple facts at once, but its robustness was tested mostly on whether the other edited fact would remain coherent, rather than downstream task performance. Looks like there's still a long way to go to make these approaches usable in practice!

posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Model Editing Can Hurt General Abilities of Large Language Models by J.C. Gu et al.

This work raises concerns that gains in factual knowledge after model editing can result in a significant degradation of the general abilities of LLMs. The authors evaluate 4 popular editing methods on 2 LLMs across eight representative tasks, showing model editing does substantially hurt model general abilities. A suggestion is made to prioritize improvements in LLMs' robustness, developing more precise editing methods, and better evaluation benchmarks.

📄 Paper: Model Editing Can Hurt General Abilities of Large Language Models (2401.04700)
💻 Code: https://github.com/JasonForJoy/Model-Editing-Hurt
  • 3 replies
·
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: From Understanding to Utilization: A Survey on Explainability for Large Language Models by H. Luo and L. Specia

This survey summarizes recent works in interpretability research, focusing mainly on pre-trained Transformer-based LMs. The authors categorize current approaches as either local or global and discuss popular applications of LM interpretability, such as model editing, enhancing model performance, and controlling LM generation.

📄 Paper: From Understanding to Utilization: A Survey on Explainability for Large Language Models (2401.12874)
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Universal Neurons in GPT2 Language Models by @wesg @thorsley @luminolblue T.R. Kheirkhah Q. Sun, @willhath @NeelNanda D. Bertsimas

This work investigates the universality of individual neurons across GPT2 models trained from different initial random seeds, starting from the assumption that such neurons are likely to exhibit interpretable patterns. The authors find 1-5% of neurons showing high correlation across five model seeds, i.e. consistently activating for the same inputs. In particular, those neurons can be grouped into families exhibiting similar functional roles, e.g. modulating the next token prediction entropy, controlling the output norm of an attention head, and promoting/suppressing vocabulary elements in the prediction. Finally, universal neurons are often observed to form antipodal pairs, conjectured to improve the robustness and calibration of model predictions via ensembling.

📄 Paper: Universal Neurons in GPT2 Language Models (2401.12181)
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Towards Best Practices of Activation Patching in Language Models: Metrics and Methods by @m0pp11 and @NeelNanda

This work systematically examines the impact of methodological details in activation patching, a popular technique with causal guarantees to quantify the importance of model components in driving model predictions. Authors' recommendations include 1) using in-distribution counterfactual prompts instead of noise/zeroing to mitigate the OOD problem, 2) using logins instead of probabilities as evaluation metrics to enable the discovery of model components with negative influence on predictions, 3) accounting for interaction factors across layers when performing multi-layer patching; and 4) experiment with corrupting different prompt tokens to verify their agreement in the resulting discovered circuits.

📄 Paper: Towards Best Practices of Activation Patching in Language Models: Metrics and Methods (2309.16042)
posted an update 3 months ago
view post
Post
🔍 Today's pick in Interpretability & Analysis of LMs: Can Large Language Models Explain Themselves? by @andreasmadsen Sarath Chandar & @sivareddyg

LLMs can provide wrong but convincing explanations for their behavior, and this might lead to ill-placed confidence in their predictions. This study uses self-consistency checks to measure the faithfulness of LLM explanations: if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. Results demonstrate that LLM self-explanations faithfulness of self-explanations cannot be reliably trusted, as they prove to be very task and model dependent, with bigger model generally producing more faithful explanations.

📄 Paper: Can Large Language Models Explain Themselves? (2401.07927)
  • 1 reply
·
posted an update 3 months ago
view post
Post
💥 Today's pick in Interpretability & Analysis of LMs: 🩺 Patchscopes: A Unifying Framework for Inspecting Hidden Representations of Language Models by @asmadotgh , @codevan , @1wheel , @iislucas & @mega

Patchscopes is a generalized framework for verbalizing information contained in LM representations. This is achieved via a mid-forward patching operation inserting the information into an ad-hoc prompt aimed at eliciting model knowledge. Patchscope instances for vocabulary projection, feature extraction and entity resolution in model representation are show to outperform popular interpretability approaches, often resulting in more robust and expressive information.

🌐 Website: https://pair-code.github.io/interpretability/patchscopes/
📄 Paper: Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models (2401.06102)
posted an update 4 months ago
view post
Post
💥 Today's pick in Interpretability & Analysis of LMs: Fine-grained Hallucination Detection and Editing For Language Models by @abhika-m @akariasai @vidhisha et al.

Authors introduce a new taxonomy for fine-grained annotation of hallucinations in LM generations and propose Factuality Verification with Augmented Knowledge (FAVA), a retrieval-augmented LM fine-tuned to detect and edit hallucinations in LM outputs, outperforming ChatGPT and LLama2 Chat on both detection and editing tasks.

🌐 Website: https://fine-grained-hallucination.github.io
📄 Paper: Fine-grained Hallucination Detection and Editing for Language Models (2401.06855)
🚀 Demo: fava-uw/fava
🤖 Model: fava-uw/fava-model
🔡 Dataset: fava-uw/fava-data