ReAct: Synergizing Reasoning and Acting in Language Models Paper • 2210.03629 • Published Oct 6, 2022 • 17
Reasoning Datasets Collection Distilled synthetic Reasoning datasets • 7 items • Updated 8 days ago • 50
Building Bridges, Not Walls -- Advancing Interpretability by Unifying Feature, Data, and Model Component Attribution Paper • 2501.18887 • Published 11 days ago • 1
Propositional Interpretability in Artificial Intelligence Paper • 2501.15740 • Published 15 days ago • 1
Sparse Autoencoders Trained on the Same Data Learn Different Features Paper • 2501.16615 • Published 14 days ago • 1
AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders Paper • 2501.17148 • Published 13 days ago • 1
Gemma Neogenesis 💎🌍🇮🇹 Collection Datasets and models for Neogenesis: Post-training recipe for improving Gemma 2 for a specific language. Notebook: https://t.ly/iuKdy • 11 items • Updated 22 days ago • 5
Enhancing Automated Interpretability with Output-Centric Feature Descriptions Paper • 2501.08319 • Published 27 days ago • 10
Sometimes I am a Tree: Data Drives Unstable Hierarchical Generalization Paper • 2412.04619 • Published Dec 5, 2024 • 1
Towards scientific discovery with dictionary learning: Extracting biological concepts from microscopy foundation models Paper • 2412.16247 • Published Dec 20, 2024 • 1
Interpretability as Compression: Reconsidering SAE Explanations of Neural Activations with MDL-SAEs Paper • 2410.11179 • Published Oct 15, 2024 • 1
Incremental Sentence Processing Mechanisms in Autoregressive Transformer Language Models Paper • 2412.05353 • Published Dec 6, 2024 • 1
The LLM Language Network: A Neuroscientific Approach for Identifying Causally Task-Relevant Units Paper • 2411.02280 • Published Nov 4, 2024 • 1
Inferring Functionality of Attention Heads from their Parameters Paper • 2412.11965 • Published Dec 16, 2024 • 2