gollark/siglip-so400m-14-384 · Script for converting the checkpoint

HugoLaurencon

Oct 13, 2023

Hi @gollark , could you provide the script you used to make the conversion to transformers? Thanks!

gollark

Owner Oct 16, 2023

It's an ugly hack in here (https://github.com/osmarks/transformers-patch-siglip) somewhere.

gollark

Owner Oct 16, 2023

I forgot how exactly I did it but you need to copy some hardcoded dimensions out of BigVision (the Google repo for training). In principle they could probably be read out of the checkpoints but I didn't bother with this.

HugoLaurencon

Oct 17, 2023

Ok thanks!

gollark

Owner Oct 17, 2023

I think you mostly need https://github.com/google-research/big_vision/blob/50622cb37bc42080e31309d0c47dc585425761f9/big_vision/models/vit.py#L249, https://github.com/osmarks/transformers-patch-siglip/blob/bc1553b2230e624dbd9fece46b8431460e9d227a/src/transformers/models/siglip/convert_siglip_to_hf.py and the other dimensions in https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb.

HugoLaurencon

Oct 17, 2023

Yes exactly, essentially download with
gsutil cp gs://big_vision/siglip/webli_en_so400m_384_58765454.npz ./

I checked on the original weights that the dimensions are correct with

config.text_config.vocab_size = 32000
config.text_config.hidden_size = 1152
config.text_config.intermediate_size = 4304
config.text_config.num_hidden_layers = 27
config.text_config.num_attention_heads = 16
config.text_config.max_position_embeddings = 64

config.vision_config.hidden_size = 1152
config.vision_config.intermediate_size = 4304
config.vision_config.num_hidden_layers = 27
config.vision_config.num_attention_heads = 16
config.vision_config.image_size = 384
config.vision_config.patch_size = 14

(same as you did)

HugoLaurencon changed discussion status to closed Oct 17, 2023