ikala/ViT-SO400M-14-SigLIP-384-hf

I don't want to be a downer here, but this approach to SigLIP models will not work, they are not compatible with CLIP. It is misleading to suggest that this could work. It's best to wait for https://github.com/huggingface/transformers/pull/26522 to land before Transformers SigLIP models are available.

CLIP ViT has no bias on the patch embed and has an extra pre-norm after the patch embed & pos embed. That means the signal is irreconcilably wrong for this approach after the first layer
The missing attention pooling at the end of the vision model is a significant difference
The pooling at the end of the text model is notably different too
Text tokenization requires an bit of code to perform cleaning for full accuracy so this is being implemented as a new tokenizer in the Transformers port and was implemented as part of a tokenizer wrapper in OpenCLIP.

ikala
/

ViT-SO400M-14-SigLIP-384-hf

This will not work