clip-ViT-L-14

This is the Image & Text model CLIP, which maps text and images to a shared vector space. For applications of the models, have a look in our documentation SBERT.net - Image Search

Usage

After installing sentence-transformers (pip install sentence-transformers), the usage of this model is easy:

from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-L-14')

#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

#Compute cosine similarities 
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)

See our SBERT.net - Image Search documentation for more examples how the model can be used for image search, zero-shot image classification, image clustering and image deduplication.

Performance

In the following table we find the zero-shot ImageNet validation set accuracy:

Model Top 1 Performance
clip-ViT-B-32 63.3
clip-ViT-B-16 68.1
clip-ViT-L-14 75.4

For a multilingual version of the CLIP model for 50+ languages have a look at: clip-ViT-B-32-multilingual-v1

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for sentence-transformers/clip-ViT-L-14

Finetunes
1 model

Spaces using sentence-transformers/clip-ViT-L-14 8