🔍 Daily Picks in Interpretability & Analysis of LMs
List of daily paper picks on the topics of interpretability and evaluation of language models' knowledge and abilities.
Paper • 2402.13331 • Published • 2
Note This work proposes Simple Detectors Aggregation (STARE), an aggregation procedure to leverage hallucination detectors’ complementary strengths in the context of machine translation. Authors experiment with two popular hallucination detection benchmarks (LFAN-HALL and HalOmi), showing that an aggregation of detectors using only model internals can outperform ad-hoc trained hallucination detectors.
Paper • 2402.12865 • Published • 1
Note This work extends Logit Lens vocabulary projections of FFNs in Transformers to gradients to study the knowledge editing performed by backward passes. Authors prove that a gradient matrix can be cast as a low-rank linear combination of its forward and backward passes’ inputs, and identify an imprint-and-shift mechanism driving knowledge updating in FFNs. Finally, an efficient editing method driven by the linearization above is evaluated, showing strong performances in simple editing settings.
Paper • 2402.11750 • Published • 2
Note This work introduces InfICL, a demonstration selection method using influence functions to identify salient training examples to use as demonstrations at inference time. InfICL is tested alongside other examples selection baselines for prompting medium-sized LLMs for COLA and RTE, showing improvements over other methods especially when a smaller number of in-context examples is used.
Paper • 2402.10208 • Published • 5
Note This paper introduces SpectralDeTuning, an method to recover original pre-trained weights of a model from a set of LoRA fine-tunes with merged weights. Authors introduce the LoWRA Bench dataset to measure progress in this task, and show that the method performs well for both language and vision models. The current limitations of the approach are 1) assuming the attacker knowledge of the rank used in LoRAs and 2) need for a good amount of LoRAs to reconstruct the original pre-training effectively
Paper • 2402.09259 • Published • 2
Note Authors propose SyntaxSHAP, a variant of the model-agnostic SHAP approach enforcing tree-based coalition based on the syntax of the explained sequence, while preserving most properties of SHAP explanations. The approach is found to be more faithful and semantically meaningful than other model-agnostic methods when explaining the predictions of LMs such as GPT-2 and Mistral, especially in edge cases such as negation.
Paper • 2402.07543 • Published • 2
Note Authors propose a fine-tuning procedure with natural language explanation to clarify intermediate reasoning steps. Several LMs are fine-tuned on ListOps dataset, containing synthetically-generated instructions on sequences of numbers. Authors find that explanations improve model performances across all tested model sizes and explanations lengths. Smaller language models benefit the most from explanations, especially when long-form.
Paper • 2402.06155 • Published • 8
Note This works introduces a model editing approach using individual “canonical” examples to showcase desired/unwanted behavior. The approach is tested on regular LMs and Backpack LMs, which are more controllable thanks to disentangled sense vector representations. For the latters, authors propose sense fine-tuning, i.e. updating few sense vectors with canonical examples to apply desired changes in an efficient and effective way, outperforming other model editing approaches and full/LoRa fine-tuning.
Paper • 2402.05602 • Published • 3
Note This work proposes extending the LRP feature attribution framework to handling Transformers-specific layers. Authors show that AttnLRP is significantly more faithful than other popular attribution methods, has minimal time requirements for execution and can be employed to identify model components associated to specific concepts in generated text.
Paper • 2402.04614 • Published • 3
Note This work discusses the dichotomy between faithfulness and plausibility in LLMs’ self-explanations (SEs) employing natural language (CoT, counterfactual reasoning, and token importance), which tend to be plausible but unfaithful to models' reasoning process. Authors call for a community effort to 1) develop reliable metrics to characterize the faithfulness of explanations and 2) pioneering novel strategies to generate more faithful SEs.
Paper • 2402.03744 • Published • 3
Note While most internals-based hallucination detection methods use surface-level information, this work proposes EigenScore, an internal measure of responses’ self-consistency using the eigenvalues of sampled responses' covariance matrix in intermediate model layers to quantify answers’ diversity in the dense embedding space. EigenScore outperforms logit-level methods for hallucination detection on QA tasks, especially with feature clipping to control overconfident generations.
Paper • 2402.01761 • Published • 16
Note In this opinion piece, authors contend that the new capabilities of LLMs can transform the scope of interpretability, moving from low-level explanations such as saliency maps and circuit analysis to natural language explanations. This goal is hindered by LM’s natural tendency to hallucinate, their large size and their inherent opaqueness. Authors highlight in particular dataset explanations for knowledge discovery, explanations’ reliability and interactive explanations as key areas moving ahead.
Paper • 2402.00794 • Published • 1
Note Authors propose Recursive Attribution Generation (ReAGent), a perturbation-based feature attribution approach specifically conceived for generative LMs. The method employs a lightweight encoder LM to replace sampled input spans with valid alternatives and measure the effect of the perturbation on the drop in next token probability predictions. ReAGent is shown to consistentlyoutperform other established approaches across several models and generation tasks in terms of faithfulness.
Paper • 2402.00559 • Published • 3
Note This work introduces a new methodology for human verification of reasoning chains and adopts it to annotate a dataset of chain-of-thought reasoning chains produced by 3 LMs. The annotated dataset, REVEAL, can be used to benchmark automatic verifiers of reasoning in LMs. In their analysis, the authors find that LM-produced CoTs generally contain faulty steps often leading to wrong automatic verification.
Paper • 2401.16656 • Published • 1
Note This work proposes Gradient-Based Red Teaming (GBRT), a red teaming method for automatically generating diverse prompts inducing an LM to output unsafe responses. In practice, prompts are learned by scoring LM responses with a safety-trained probing classifier, and back-propagating through frozen classifier and LM to update the prompt. GBRT prompts are shown to be more likely to generate unsafe responses and evade safety-tuning measures than those produced by RL-based methods.
Paper • 2401.14446 • Published • 3
Note Audits conducted on AI systems can identify potential risks and ensure their compliance to safety requirements. Authors categorise audits based on the access to model-related resources and highlight how levels of transparency on audited AI system enable broader and more effective auditing procedures. Technical, physical, and legal safeguards for performing audits are also introduced to ensure minimal security risks for audited companies.
Paper • 2401.13835 • Published • 4
Note This work evaluates the human confidence in LLM responses to multiple-choice MMLU questions based on explanations the LLM provides together with selected answers. The authors experiment with altering the model prompt to reflect the actual prediction confidence in models’ explanations, showing improved calibration for users’ assessment of LLM’s reliability and a better ability to discriminate between correct and incorrect answers.
Paper • 2401.12973 • Published • 4
Note This work methodically evaluates of in-context learning on formal languages across several model architectures, showing how Transformers work best in this setting. These results are attributed to the presence of “n-gram heads” able to retrieve the token following a context already seen in the current context window and copy it. These insights are used to design static attention layers mimicking the behavior of n-gram head, leading to lower perplexity despite the lower computational cost.
Paper • 2401.12874 • Published • 2
Note This survey summarizes recent works in interpretability research, focusing mainly on pre-trained Transformer-based LMs. The authors categorize current approaches as either local or global and discuss popular applications of LM interpretability, such as model editing, enhancing model performance, and controlling LM generation.
Paper • 2401.12576 • Published • 2
Note Authors introduce LLMCheckup, a conversational interface connecting an LLM to several interpretability tools (feature attribution methods, similarity, counterfactual/rationale generation) allowing users to inquire about LLM predictions using natural language. The interface consolidates several interpretability methods in a unified chat interface, simplifying future investigations into natural language explanations.
Paper • 2401.12181 • Published • 5
Note This work investigates the universality of individual neurons across GPT2 models trained from different initial random seeds, starting from the assumption that such neurons are likely to exhibit interpretable patterns. 1-5% of neurons consistently activate for the same inputs, and can be grouped into families exhibiting similar functional roles, e.g. modulating prediction entropy, deactivating attention heads, and promoting/suppressing elements of the vocabulary in the prediction.
Paper • 2401.07927 • Published • 4
Note This study uses self-consistency checks to measure the faithfulness of LLM explanations: if an LLM says a set of words is important for making a prediction, then it should not be able to make the same prediction without these words. Results demonstrate that LLM self-explanations faithfulness of self-explanations cannot be reliably trusted, as they prove to be very task and model dependent, with bigger model generally producing more faithful explanations.
Paper • 2401.06855 • Published • 3
Note Authors introduce a new taxonomy for fine-grained annotation of hallucinations in LM generations and propose Factuality Verification with Augmented Knowledge (FAVA), a retrieval-augmented LM fine-tuned to detect and edit hallucinations in LM outputs, outperforming ChatGPT and LLama2 Chat on both detection and editing tasks.
Paper • 2401.06102 • Published • 18
Note Patchscopes is a generalized framework for verbalizing information contained in LM representations. This is achieved via a mid-forward patching operation inserting the information into an ad-hoc prompt aimed at eliciting model knowledge. Patchscope instances for vocabulary projection, feature extraction and entity resolution in model representation are show to outperform popular interpretability approaches, often resulting in more robust and expressive information.
Paper • 2401.04700 • Published • 2
Note This work raises concerns that gains in factual knowledge after model editing can result in a significant degradation of the general abilities of LLMs. Authors evaluate 4 popular editing methods on 2 LLMs across eight representative tasks, showing model editing does substantially hurt model general abilities.
Paper • 2309.16042 • Published • 3
Note This work systematically examines the impact of methodological details in activation patching, a popular technique with causal guarantees to quantify the importance of model components in driving model predictions. Authors' provide several recommendations concerning the type of patching (noise vs. counterfactual), the metric to use (probability vs. logit vs. KL), the number of layers to patch and which tokens to corrupt.
- Paper • 2402.14811 • Published • 1