@gsarti on Hugging Face: "🔍 Today's pick in Interpretability & Analysis of LMs: Faithfulness vs.…"

Post

🔍 Today's pick in Interpretability & Analysis of LMs: Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models by C. Agarwal, S.H. Tanneru and H. Lakkaraju

This work discusses the dichotomy between faithfulness and plausibility in LLMs’ self-explanations (SEs) in natural language (CoT, counterfactual reasoning, and token importance). These explanations tend to be reasonable according to human understanding (plausible) but are not always aligned with the reasoning processes of the LLMs (unfaithful).

Authors remark that the increase in plausibility driven by the request for a friendly conversational interface might come at the expense of faithfulness. Provided the faithfulness requirements of many high-stakes real-world settings, authors suggest these are considered when designing and evaluating new explanation methodologies.  Finally, the authors call for a community effort to 1) develop reliable metrics to characterize the faithfulness of explanations and 2) pioneering novel strategies to generate more faithful SEs.

📄 Paper: Faithfulness vs. Plausibility: On the (Un)Reliability of Explanations from Large Language Models (2402.04614)

🔍 All daily picks in LM interpretability: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

Join the conversation