File size: 2,797 Bytes
4306e9b
 
 
 
ffae8d1
 
4306e9b
ffae8d1
 
 
 
 
e2c6da8
 
34ed1bc
ffae8d1
 
 
 
34ed1bc
a01951e
ffae8d1
a01951e
 
ffae8d1
34ed1bc
 
 
 
ffae8d1
 
 
34ed1bc
 
ffae8d1
 
 
 
 
 
 
 
 
34ed1bc
ffae8d1
 
 
 
 
 
 
34ed1bc
ffae8d1
34ed1bc
 
ffae8d1
 
34ed1bc
 
 
ffae8d1
34ed1bc
 
 
 
ffae8d1
 
34ed1bc
ffae8d1
 
34ed1bc
 
 
 
 
 
 
 
ffae8d1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
license: apache-2.0
pipeline_tag: feature-extraction
tags:
- clip
- vision
datasets:
- Ziyang/yfcc15m
- conceptual_captions
---
<h1 align="center">UForm</h1>
<h3 align="center">
Pocket-Sized Multimodal AI<br/>
For Content Understanding and Generation<br/>
In Python, JavaScript, and Swift<br/>
</h3>

---

The `uform3-image-text-english-small` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space.
This model produces up to __256-dimensional embeddings__ and is made of:

* Text encoder: 4-layer BERT for up to 64 input tokens.
* Visual encoder: ViT-S/16 for images of 224 x 224 resolution.

Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training.
Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code.
If you need a larger, more accurate, or multilingual model, check our [HuggingFace Hub](https://huggingface.co/unum-cloud/).
For more details on running the model, check out the [UForm GitHub repository](https://github.com/unum-cloud/uform/).

## Evaluation

For zero-shot ImageNet classification the model achieves Top-1 accuracy of 36.1% and Top-5 of 60.8%.
On text-to-image retrieval it reaches 86% Recall@10 for Flickr:

| Dataset   |Recall@1 |  Recall@5 | Recall@10 |
| :------   | ------: | --------: | --------: |
| Zero-Shot Flickr    | 0.565 | 0.790 | 0.860 |
| Zero-Shot MS-COCO   | 0.281 | 0.525 | 0.645 |

## Installation

```bash
pip install "uform[torch,onnx]"
```

## Usage

To load the model:

```python
from uform import get_model, Modality

import requests
from io import BytesIO
from PIL import Image

model_name = 'unum-cloud/uform3-image-text-english-small'
modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
processors, models = get_model(model_name, modalities=modalities)

model_text = models[Modality.TEXT_ENCODER]
model_image = models[Modality.IMAGE_ENCODER]
processor_text = processors[Modality.TEXT_ENCODER]
processor_image = processors[Modality.IMAGE_ENCODER]
```

To encode the content:

```python
text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
image_url = Image.open(BytesIO(requests.get(image_url).content))

image_data = processor_image(image)
text_data = processor_text(text)
image_features, image_embedding = model_image.encode(image_data, return_features=True)
text_features, text_embedding = model_text.encode(text_data, return_features=True)
```