vinsis/multimodal-patch-embeddings

This model contains the checkpoint for the repo https://github.com/TinyVolt/multimodal-patch-embeddings. It contains the code for distillation of a 21.3M ViT model using CLIP ViT-B-32 model as the teacher. The model was trained on about 3 million images.

What makes this model so special is that the embedding of each of the image patches is in the same embedding space as the final embedding. In fact, the final embedding is just a convex sum of the patch embeddings. This allows one to compare the text embedding with each of the 64 image patch embeddings.