Script for converting the checkpoint
Hi @gollark , could you provide the script you used to make the conversion to transformers? Thanks!
It's an ugly hack in here (https://github.com/osmarks/transformers-patch-siglip) somewhere.
I forgot how exactly I did it but you need to copy some hardcoded dimensions out of BigVision (the Google repo for training). In principle they could probably be read out of the checkpoints but I didn't bother with this.
Ok thanks!
I think you mostly need https://github.com/google-research/big_vision/blob/50622cb37bc42080e31309d0c47dc585425761f9/big_vision/models/vit.py#L249, https://github.com/osmarks/transformers-patch-siglip/blob/bc1553b2230e624dbd9fece46b8431460e9d227a/src/transformers/models/siglip/convert_siglip_to_hf.py and the other dimensions in https://colab.research.google.com/github/google-research/big_vision/blob/main/big_vision/configs/proj/image_text/SigLIP_demo.ipynb.
Yes exactly, essentially download withgsutil cp gs://big_vision/siglip/webli_en_so400m_384_58765454.npz ./
I checked on the original weights that the dimensions are correct with
config.text_config.vocab_size = 32000
config.text_config.hidden_size = 1152
config.text_config.intermediate_size = 4304
config.text_config.num_hidden_layers = 27
config.text_config.num_attention_heads = 16
config.text_config.max_position_embeddings = 64
config.vision_config.hidden_size = 1152
config.vision_config.intermediate_size = 4304
config.vision_config.num_hidden_layers = 27
config.vision_config.num_attention_heads = 16
config.vision_config.image_size = 384
config.vision_config.patch_size = 14
(same as you did)