File size: 1,911 Bytes
976782c
11fb331
976782c
 
 
 
11fb331
976782c
 
b2c307a
ab9209f
b2c307a
ab9209f
b2c307a
976782c
b2c307a
976782c
b2c307a
976782c
b2c307a
 
ab9209f
b2c307a
 
976782c
b2c307a
 
ab9209f
b2c307a
 
976782c
b2c307a
 
 
 
ab9209f
b2c307a
ab9209f
b2c307a
976782c
b2c307a
976782c
b2c307a
 
61c3f1c
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
library_name: sentence-transformers
tags:
- sentence-transformers
- feature-extraction
- sentence-similarity
pipeline_tag: sentence-similarity
---

# clip-ViT-B-32

This is the Image & Text model [CLIP](https://arxiv.org/abs/2103.00020), which maps text and images to a shared vector space. For applications of the models, have a look in our documentation [SBERT.net - Image Search](https://www.sbert.net/examples/applications/image-search/README.html)

## Usage

After installing [sentence-transformers](https://sbert.net) (`pip install sentence-transformers`), the usage of this model is easy:

  
```python
from sentence_transformers import SentenceTransformer, util
from PIL import Image

#Load CLIP model
model = SentenceTransformer('clip-ViT-B-32')

#Encode an image:
img_emb = model.encode(Image.open('two_dogs_in_snow.jpg'))

#Encode text descriptions
text_emb = model.encode(['Two dogs in the snow', 'A cat on a table', 'A picture of London at night'])

#Compute cosine similarities 
cos_scores = util.cos_sim(img_emb, text_emb)
print(cos_scores)
```

See our [SBERT.net - Image Search](https://www.sbert.net/examples/applications/image-search/README.html) documentation for more examples how the model can be used for image search, zero-shot image classification, image clustering and image deduplication.

## Performance

In the following table we find the zero-shot ImageNet validation set accuracy:

| Model | Top 1 Performance |
| --- | :---: |
| [clip-ViT-B-32](https://huggingface.co/sentence-transformers/clip-ViT-B-32) | 63.3 |
| [clip-ViT-B-16](https://huggingface.co/sentence-transformers/clip-ViT-B-16) | 68.1 |
| [clip-ViT-L-14](https://huggingface.co/sentence-transformers/clip-ViT-L-14) | 75.4 |

For a multilingual version of the CLIP model for 50+ languages have a look at: [clip-ViT-B-32-multilingual-v1](https://huggingface.co/sentence-transformers/clip-ViT-B-32-multilingual-v1)