hidden sizes

by VictorSanh - opened Feb 29

Feb 29

Was there a specific rationale behind the hidden size choices?

More specifically, (at least people training on gpus) favor sizes divisible by 128 (in particular for hardware efficiency reasons) and intermediate_size is usually 4 x hidden_size.

Thanks!

nielsr

Mar 1

Cc @giffmana

giffmana

Mar 19

So400m is "shape optimized" architecture from our Getting ViT in Shape: Scaling Laws for Compute-Optimal Model Design paper, the point is using scaling laws to predict the optimal shapes. So that's where they come from. We really just forgot to clamp to a multiple of 128, I agree it would be nicer. In practice, we haven't found a big enough difference (on TPUs) to make it worth re-training the whole thing. But for future models, this is definitely on our mind.

VictorSanh

Mar 19

That makes sense, thanks for the answer!

VictorSanh changed discussion status to closed Mar 19

rwightman

Mar 25

•

edited Mar 25

@giffmana @VictorSanh

FYI I looked at this a while ago...

For the 150M (not used in SigLIP) there is a pretty big impact using the predicted shapes in paper direclty since the head size was 55 or something quite atypical by default and the other dims also not great.

For the 400m if you bump up the hidden dim to multiple of 128 (4352), you don't see an increase in throughput, you essentially get a freebie, those extra params and flops do not lower the throughput. This is on a GPU w/ use of fused mem-efficient/flash sdpa kernels.

Original configs

150m - 880 width, 18 depth, 16 heads (head dim 55), 2320 hidden
400m - 1152 width, 27 depth, 16 heads (head dim 72), 4304 hidden (EDIT wrong on my first pass)

So my alternate configs:

150m - 896 width, 18 depth, 14 heads (head dim 64), 2304 hidden
400m (a) - 1152 width, 27 depth, 16 heads (head dim 72), 4352 hidden
400m (b) - 1152 width, 27 depth, 18 heads (head dim 64), 4352 hidden
400m (c) - 1152 width, 27 depth, 16 heads (head dim 72), 4224 hidden
400m (d) - 1152 width, 27 depth, 18 heads (head dim 64), 4224 hidden

For a & b of 400m you end up with slightly more params & flops b but essentially same throughput. For c/d you gain a bit of speed but loose a few flops/params.

giffmana

Mar 25

If I were to redo it, I would probably bless (b) because 64 is a really good head dim in my experience, and it's otherwise quite close too.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment