Layer sharing?

#4
by Datdanboi25 - opened

If this model is using layer sharing does that mean inference compute cost is closer to a 300m model ?

No, Our model and SmolLM2 135M have more or less a similar effective layer count. For example, SmolLM2 135M has around 30 layers, while ours is effectively 16×2.

The main difference is that our model width is slightly higher, so inference cost will be a bit higher because of that. But the difference is very small, definitely nowhere close to the compute cost of a typical 300M dense model.

Sign up or log in to comment