Gabriele Sarti

gsarti

AI & ML interests

Interpretability for generative language models

Organizations

Posts 40

view post
Post
1546
🔍 Today's pick in Interpretability & Analysis of LMs: by @aadityasingh T. Moskovitz, F. Hill, S. C. Y. Chan, A. M. Saxe ( @gatsbyunit )

This work proposes a new methodology inspired by optogenetics (dubbed "clamping") to perform targeted ablations during training to estimate the causal effect of specific interventions on mechanism formation.

Authors use this approach to study the formation of induction heads training a 2L attention-only transformer to label examples via context information.

Notable findings:

- The effects of induction heads are additive and redundant, with weaker heads compensating well for the ablation of a strong induction head in case the latter is ablated.
- Competition between induction heads might emerge as a product of optimization pressure to converge faster, but it is not strictly necessary as all heads eventually learn to solve the task.
- Previous token heads (PTH) influence induction heads in a many-to-many fashion, with any PTH eliciting above-chance prediction from a subsequent induction head
- Three subcircuits for induction are identified, respectively mixing token-label information (1 + 2), matching the previous occurrence of the current class in the context (3qk + 4), and copying the label of the matched class (3v + 5).
- The formation of induction heads is slowed down by a larger number of classes & labels, with more classes and more labels slowing down the formation of the matching and copying mechanisms, respectively. This may have implications when selecting a vocabulary size for LLMs: larger vocabularies lead to an increased compression ratio and longer contexts, but they might make copying more challenging by delaying the formation of induction heads.

💻 Code: https://github.com/aadityasingh/icl-dynamics

📄 Paper: What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation (2404.07129)

🔍 All daily picks: https://huggingface.co/collections/gsarti/daily-picks-in-interpretability-and-analysis-ofc-lms-65ae3339949c5675d25de2f9
view post
Post
1388
I'm super happy to co-organize the (Mechanistic) Interpretability social at #ICLR2024 with @nikhil07prakash ! 🔍

If you plan to attend, help us make this meetup awesome by filling the form below! 😄

📅 Wed, May 8, 12:45-2:15 PM
🔗 RSVP & share your ideas here: https://forms.gle/FWap4KW2ikdntjfb8