After installing sentence-transformers (
pip install sentence-transformers), the usage of this model is easy:
from sentence_transformers import SentenceTransformer, util from PIL import Image #Load CLIP model model = SentenceTransformer('clip-ViT-B-16') #Encode an image: img_emb = model.encode(Image.open('two_dogs_in_snow.jpg')) #Encode text descriptions text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night']) #Compute cosine similarities cos_scores = util.cos_sim(img_emb, text_emb) print(cos_scores)
See our SBERT.net - Image Search documentation for more examples how the model can be used for image search, zero-shot image classification, image clustering and image deduplication.
In the following table we find the zero-shot ImageNet validation set accuracy:
For a multilingual version of the CLIP model for 50+ languages have a look at: clip-ViT-B-32-multilingual-v1
- Downloads last month