What needs to go right for an induction head? A mechanistic study of in-context learning circuits and their formation
Paper
•
2404.07129
•
Published
•
3
Larger vocab is better compression, but may result in longer training ICL phase change delays due to the slower Induction Head Copy Subcircuit (C)
Note Subcircuit C performs a copy and with labels (text) it is slower at learning than with classes (see Figure 6).