Towards A Rigorous Science of Interpretable Machine Learning Paper • 1702.08608 • Published Feb 28, 2017
Learning how to explain neural networks: PatternNet and PatternAttribution Paper • 1705.05598 • Published May 16, 2017
Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV) Paper • 1711.11279 • Published Nov 30, 2017
A Benchmark for Interpretability Methods in Deep Neural Networks Paper • 1806.10758 • Published Jun 28, 2018
Benchmarking Attribution Methods with Relative Feature Importance Paper • 1907.09701 • Published Jul 23, 2019
On the Relationship Between Explanation and Prediction: A Causal View Paper • 2212.06925 • Published Dec 13, 2022
Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models Paper • 2301.04213 • Published Jan 10, 2023
Proactive Agents for Multi-Turn Text-to-Image Generation Under Uncertainty Paper • 2412.06771 • Published Dec 9, 2024
State2Explanation: Concept-Based Explanations to Benefit Agent Learning and User Understanding Paper • 2309.12482 • Published Sep 21, 2023
How new data permeates LLM knowledge and how to dilute it Paper • 2504.09522 • Published 9 days ago • 7