@gsarti on Hugging Face: "🔍 Today's pick in Interpretability & Analysis of LMs: CausalGym: Benchmarking…"

Post

🔍 Today's pick in Interpretability & Analysis of LMs: CausalGym: Benchmarking causal interpretability methods on linguistic tasks by @aryaman D. Jurafsky @cgpotts

TL;DR: Introduce a revisited benchmark to evaluate the effectiveness and reliability of intervention methods across several linguistic phenomena.

While several interpretability methods are currently used to discover task-relevant model components, their performance and reliability is seldom tested broadly.

This paper adapts the SyntaxGym benchmark, originally conceived for the study of psycholinguistic phenomena such as subject-verb agreement and garden-path sentences, to evaluate intervention-based interpretability methods. In practice, faithful interventions over model components are expected to cause a predictable change in model prediction (e.g. singular -> plural verb).

Various methods are benchmarked on Pythia models ranging from 14M to 6.9B params, finding Distributed Alignment Search (DAS) to consistently outperform other approaches, followed by probing. When recurring to control tasks to account for the expressivity of supervised methods, probing is found to be more reliable than DAS in larger model sizes.

Authors conclude with an evaluation of how features driving linguistically plausible behaviours emerge during model training. These features are observed to emerge in Pythia models after 1k training steps, and become progressively more complex over time.

📄 Paper: CausalGym: Benchmarking causal interpretability methods on linguistic tasks (2402.12560)
💻 Code: https://github.com/aryamanarora/causalgym
🔡 Dataset: aryaman/causalgym

🔍 All daily picks: gsarti/daily-picks-in-interpretability-and-analysis-of-lms-65ae3339949c5675d25de2f9

Join the conversation