🔍 Daily Picks in Interpretability & Analysis of LMs
Outstanding research in interpretability and evaluation of language models, summarized
Paper • 2404.07129 • Published • 3Note This study introduces "clamping," a method inspired by optogenetics, for training-time causal interventions to study mechanism formation. Authors apply clamping to study the emergence of induction heads (IH), finding IHs contribs are additive and redundant, with competition emerging due to optimization pressures, and many-to-many dependent on previous token heads in lower layers. Three critical induction subcircuits are identified, and their formation is connected to data-dependent properties.
LM Transparency Tool: Interactive Tool for Analyzing Transformer Language Models
Paper • 2404.07004 • Published • 3Note The LLM transparency toolkit is an open-source toolkit and visual interface to efficiently identify component circuits in LMs responsible for their predictions using Information Flow Routes. The tool can highlight the importance of granular components, and vocabulary projections are provided to examine intermediate predictions of the residual stream, and tokens promoted by specific component updates.
Does Transformer Interpretability Transfer to RNNs?
Paper • 2404.05971 • Published • 3Note This work applies contrastive activation addition, the tuned lens and probing for eliciting latent knowledge in quirky models to Mamba and RWKV LMs, finding these Transformer-specific methods can be applied with slight adaptation to these architectures, obtaining similar results.
Context versus Prior Knowledge in Language Models
Paper • 2404.04633 • Published • 4Note This work examines the influence of context versus memorized knowledge in LMs through the lens of the shift caused by contexts at various degrees of informativeness to the models' predictive distribution. Authors propose information-theoretic metrics to measure the persuasiveness of a context and the susceptibility of an entity to be influenced by contextual information. Analysis reveals important differences due to model size, query formulation and context assertiveness/negation.
Locating and Editing Factual Associations in Mamba
Paper • 2404.03646 • Published • 3Note This work applies the ROME method to Mamba, finding weights playing the role of MLPs in encoding factual relations across several Mamba layers, and can be patched to perform model editing. A new SSM-specific technique is also introduced to emulate attention knockout (value zeroing) revealing information flows similar to the ones in Transformers when processing factual statements.
ReFT: Representation Finetuning for Language Models
Paper • 2404.03592 • Published • 66Note This work introduces Representation fine-tuning (ReFT), a framework using learned inference-time interventions as efficient yet effective alternatives to PEFT weight adaptation. LoReFT, a ReFT variant intervening linearly on a representation subspaces, is evaluated against several PEFT approaches showing SOTA performances across popular benchmark with 10-50x speedup. The HF-compatible pyreft library is introduced to simplify ReFT usage.
Do language models plan ahead for future tokens?
Paper • 2404.00859 • Published • 2Note This work aims to evaluate whether language models exhibit implicit planning during generation. In a synthetic setting and employing a myopic variant of gradient descent ignoring off-diagonal information, authors find that LMs can implicitly plan for future predictions. However, the same behavior is observed to a much lesser extent for natural language, where computation for current predictions are also functional to future results.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
Paper • 2403.19647 • Published • 3Note This work proposes using features and errors from sparse autoencoders trained to reconstruct LM activations as interpretable units for circuit discovery. The authors then introduce SHIFT, a technique for editing model behavior by ablating interpretable elements from sparse feature circuits. This method is applied alongside unsupervised circuit discovery at scale by means of clustering, showing highly interpretable feature circuits interacting in behaviors like predicting sequence increments.
Have Faith in Faithfulness: Going Beyond Circuit Overlap When Finding Model Mechanisms
Paper • 2403.17806 • Published • 3Note This work introduces an efficient approach for circuit discovery, EAP-IG, using integrated gradients to perform edge attribution patching. Circuits found by EAP-IG and other methods are evaluated in terms of faithfulness, i.e. consistency of pre- and post-patching behavior, finding EAP-IG outperforms EAP across all tested tasks. The overlap between circuits found by activation patching and EAP is a faithfulness indicator only when full or no overlap is present, but not for partial overlap cases
AtP*: An efficient and scalable method for localizing LLM behaviour to components
Paper • 2403.00745 • Published • 8Note Authors identify two failure modes for the attribution patching (AtP) method for estimating component importance in LMs, leading to false negatives due to attention saturation or cancellation of direct and indirect effects. An improved version named AtP* is proposed to improve the method’s robustness in such settings. A diagnostic procedure is also proposed to bound the error caused by gradient approximation.
Information Flow Routes: Automatically Interpreting Language Models at Scale
Paper • 2403.00824 • Published • 3Note This work proposes an efficient approach for circuit discovery. Information flow routes require a single forward pass, and are derived from decomposing component updates into the Transformer residual stream. Experiments on LLaMA 7B show how the contrastive formulation of activation patching (which can be avoided with Information flow routes) can lead to misleading results depending to selected templates.
Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
Paper • 2402.14811 • Published • 4Note This work investigates the effect of fine-tuning on circuit-level mechanisms in LLMs, focusing on the entity tracking task on LLaMA 7B variants. Authors find that circuits from the base model persist in fine-tuned models, and their individual components preserve their functionalities. Cross-Model Activation Patching (CMAP) reveals that gains in performance can be attributed to improvements in circuit components, rather than overall functional rearrangement.
Backward Lens: Projecting Language Model Gradients into the Vocabulary Space
Paper • 2402.12865 • Published • 1Note This work extends Logit Lens vocabulary projections of FFNs in Transformers to gradients to study the knowledge editing performed by backward passes. Authors prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes’ inputs, and identify an imprint-and-shift mechanism driving knowledge updating in FFNs. Finally, an efficient editing method driven by the linearization above is evaluated, showing strong performances in simple editing settings.
Enhanced Hallucination Detection in Neural Machine Translation through Simple Detector Aggregation
Paper • 2402.13331 • Published • 2Note This work proposes Simple Detectors Aggregation (STARE), an aggregation procedure to leverage hallucination detectors’ complementary strengths in the context of machine translation. Authors experiment with two popular hallucination detection benchmarks (LFAN-HALL and HalOmi), showing that an aggregation of detectors using only model internals can outperform ad-hoc trained hallucination detectors.
CausalGym: Benchmarking causal interpretability methods on linguistic tasks
Paper • 2402.12560 • Published • 3Note The authors introduce a revisited psycholinguistic benchmark to evaluate the effectiveness and reliability of intervention-based mechanistic interpretability methods across several linguistic tasks. Across several model sizes, Distributed alignment search (DAS) and probing are found to be the most reliable approaches, and are used to investigate the emergence of features linked to linguistically plausible predictions in the initial phases of model training.
In-Context Learning Demonstration Selection via Influence Analysis
Paper • 2402.11750 • Published • 2Note This work introduces InfICL, a demonstration selection method using influence functions to identify salient training examples to use as demonstrations at inference time. InfICL is tested alongside other examples selection baselines for prompting medium-sized LLMs for COLA and RTE, showing improvements over other methods especially when a smaller number of in-context examples is used.
Recovering the Pre-Fine-Tuning Weights of Generative Models
Paper • 2402.10208 • Published • 6Note This paper introduces SpectralDeTuning, an method to recover original pre-trained weights of a model from a set of LoRA fine-tunes with merged weights. Authors introduce the LoWRA Bench dataset to measure progress in this task, and show that the method performs well for both language and vision models. The current limitations of the approach are 1) assuming the attacker knowledge of the rank used in LoRAs and 2) need for a good amount of LoRAs to reconstruct the original pre-training effectively
SyntaxShap: Syntax-aware Explainability Method for Text Generation
Paper • 2402.09259 • Published • 2Note Authors propose SyntaxSHAP, a variant of the model-agnostic SHAP approach enforcing tree-based coalition based on the syntax of the explained sequence, while preserving most properties of SHAP explanations. The approach is found to be more faithful and semantically meaningful than other model-agnostic methods when explaining the predictions of LMs such as GPT-2 and Mistral, especially in edge cases such as negation.
Show Me How It's Done: The Role of Explanations in Fine-Tuning Language Models
Paper • 2402.07543 • Published • 2Note Authors propose a fine-tuning procedure with natural language explanation to clarify intermediate reasoning steps. Several LMs are fine-tuned on ListOps dataset, containing synthetically-generated instructions on sequences of numbers. Authors find that explanations improve model performances across all tested model sizes and explanations lengths. Smaller language models benefit the most from explanations, especially when long-form.
Model Editing with Canonical Examples
Paper • 2402.06155 • Published • 10Note This works introduces a model editing approach using individual “canonical” examples to showcase desired/unwanted behavior. The approach is tested on regular LMs and Backpack LMs, which are more controllable thanks to disentangled sense vector representations. For the latters, authors propose sense fine-tuning, i.e. updating few sense vectors with canonical examples to apply desired changes in an efficient and effective way, outperforming other model editing approaches and full/LoRa fine-tuning.
AttnLRP: Attention-Aware Layer-wise Relevance Propagation for Transformers
Paper • 2402.05602 • Published • 3Note This work proposes extending the LRP feature attribution framework to handling Transformers-specific layers. Authors show that AttnLRP is significantly more faithful than other popular attribution methods, has minimal time requirements for execution and can be employed to identify model components associated to specific concepts in generated text.
Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models
Paper • 2402.04614 • Published • 3Note This work discusses the dichotomy between faithfulness and plausibility in LLMs’ self-explanations (SEs) employing natural language (CoT, counterfactual reasoning, and token importance), which tend to be plausible but unfaithful to models' reasoning process. Authors call for a community effort to 1) develop reliable metrics to characterize the faithfulness of explanations and 2) pioneering novel strategies to generate more faithful SEs.
INSIDE: LLMs' Internal States Retain the Power of Hallucination Detection
Paper • 2402.03744 • Published • 4Note While most internals-based hallucination detection methods use surface-level information, this work proposes EigenScore, an internal measure of responses’ self-consistency using the eigenvalues of sampled responses' covariance matrix in intermediate model layers to quantify answers’ diversity in the dense embedding space. EigenScore outperforms logit-level methods for hallucination detection on QA tasks, especially with feature clipping to control overconfident generations.
ReAGent: Towards A Model-agnostic Feature Attribution Method for Generative Language Models
Paper • 2402.00794 • Published • 1Note Authors propose Recursive Attribution Generation (ReAGent), a perturbation-based feature attribution approach specifically conceived for generative LMs. The method employs a lightweight encoder LM to replace sampled input spans with valid alternatives and measure the effect of the perturbation on the drop in next token probability predictions. ReAGent is shown to consistentlyoutperform other established approaches across several models and generation tasks in terms of faithfulness.
A Chain-of-Thought Is as Strong as Its Weakest Link: A Benchmark for Verifiers of Reasoning Chains
Paper • 2402.00559 • Published • 3Note This work introduces a new methodology for human verification of reasoning chains and adopts it to annotate a dataset of chain-of-thought reasoning chains produced by 3 LMs. The annotated dataset, REVEAL, can be used to benchmark automatic verifiers of reasoning in LMs. In their analysis, the authors find that LM-produced CoTs generally contain faulty steps often leading to wrong automatic verification.
Rethinking Interpretability in the Era of Large Language Models
Paper • 2402.01761 • Published • 18Note In this opinion piece, authors contend that the new capabilities of LLMs can transform the scope of interpretability, moving from low-level explanations such as saliency maps and circuit analysis to natural language explanations. This goal is hindered by LM’s natural tendency to hallucinate, their large size and their inherent opaqueness. Authors highlight in particular dataset explanations for knowledge discovery, explanations’ reliability and interactive explanations as key areas moving ahead.
Gradient-Based Language Model Red Teaming
Paper • 2401.16656 • Published • 1Note This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses. In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt. GBRT prompts are shown to be more likely to generate unsafe responses and evade safety-tuning measures than those produced by RL-based methods.
Black-Box Access is Insufficient for Rigorous AI Audits
Paper • 2401.14446 • Published • 3Note Audits conducted on AI systems can identify potential risks and ensure their compliance to safety requirements. Authors categorise audits based on the access to model-related resources and highlight how levels of transparency on audited AI system enable broader and more effective auditing procedures. Technical, physical, and legal safeguards for performing audits are also introduced to ensure minimal security risks for audited companies.
The Calibration Gap between Model and Human Confidence in Large Language Models
Paper • 2401.13835 • Published • 4Note This work evaluates the human confidence in LLM responses to multiple-choice MMLU questions based on explanations the LLM provides together with selected answers. The authors experiment with altering the model prompt to reflect the actual prediction confidence in models’ explanations, showing improved calibration for users’ assessment of LLM’s reliability and a better ability to discriminate between correct and incorrect answers.
In-Context Language Learning: Architectures and Algorithms
Paper • 2401.12973 • Published • 4Note This work methodically evaluates of in-context learning on formal languages across several model architectures, showing how Transformers work best in this setting. These results are attributed to the presence of “n-gram heads” able to retrieve the token following a context already seen in the current context window and copy it. These insights are used to design static attention layers mimicking the behavior of n-gram head, leading to lower perplexity despite the lower computational cost.
From Understanding to Utilization: A Survey on Explainability for Large Language Models
Paper • 2401.12874 • Published • 4Note This survey summarizes recent works in interpretability research, focusing mainly on pre-trained Transformer-based LMs. The authors categorize current approaches as either local or global and discuss popular applications of LM interpretability, such as model editing, enhancing model performance, and controlling LM generation.
LLMCheckup: Conversational Examination of Large Language Models via Interpretability Tools
Paper • 2401.12576 • Published • 2Note Authors introduce LLMCheckup, a conversational interface connecting an LLM to several interpretability tools (feature attribution methods, similarity, counterfactual/rationale generation) allowing users to inquire about LLM predictions using natural language. The interface consolidates several interpretability methods in a unified chat interface, simplifying future investigations into natural language explanations.
Universal Neurons in GPT2 Language Models
Paper • 2401.12181 • Published • 5Note This work investigates the universality of individual neurons across GPT2 models trained from different initial random seeds, starting from the assumption that such neurons are likely to exhibit interpretable patterns. 1-5% of neurons consistently activate for the same inputs, and can be grouped into families exhibiting similar functional roles, e.g. modulating prediction entropy, deactivating attention heads, and promoting/suppressing elements of the vocabulary in the prediction.
Can Large Language Models Explain Themselves?
Paper • 2401.07927 • Published • 4Note This study uses self-consistency checks to measure the faithfulness of LLM explanations: if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. Results demonstrate that LLM self-explanations faithfulness of self-explanations cannot be reliably trusted, as they prove to be very task and model dependent, with bigger model generally producing more faithful explanations.
Fine-grained Hallucination Detection and Editing for Language Models
Paper • 2401.06855 • Published • 3Note Authors introduce a new taxonomy for fine-grained annotation of hallucinations in LM generations and propose Factuality Verification with Augmented Knowledge (FAVA), a retrieval-augmented LM fine-tuned to detect and edit hallucinations in LM outputs, outperforming ChatGPT and LLama2 Chat on both detection and editing tasks.
Patchscope: A Unifying Framework for Inspecting Hidden Representations of Language Models
Paper • 2401.06102 • Published • 18Note Patchscopes is a generalized framework for verbalizing information contained in LM representations. This is achieved via a mid-forward patching operation inserting the information into an ad-hoc prompt aimed at eliciting model knowledge. Patchscope instances for vocabulary projection, feature extraction and entity resolution in model representation are show to outperform popular interpretability approaches, often resulting in more robust and expressive information.
Model Editing Can Hurt General Abilities of Large Language Models
Paper • 2401.04700 • Published • 3Note This work raises concerns that gains in factual knowledge after model editing can result in a significant degradation of the general abilities of LLMs. Authors evaluate 4 popular editing methods on 2 LLMs across eight representative tasks, showing model editing does substantially hurt model general abilities.
Towards Best Practices of Activation Patching in Language Models: Metrics and Methods
Paper • 2309.16042 • Published • 3Note This work systematically examines the impact of methodological details in activation patching, a popular technique with causal guarantees to quantify the importance of model components in driving model predictions. Authors' provide several recommendations concerning the type of patching (noise vs. counterfactual), the metric to use (probability vs. logit vs. KL), the number of layers to patch and which tokens to corrupt.