ashvardanian commited on
Commit
34ed1bc
1 Parent(s): c0b99ad

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +33 -82
README.md CHANGED
@@ -10,38 +10,38 @@ datasets:
10
  ---
11
  <h1 align="center">UForm</h1>
12
  <h3 align="center">
13
- Multi-Modal Inference Library<br/>
14
- For Semantic Search Applications<br/>
 
15
  </h3>
16
 
17
  ---
18
 
19
- UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!
 
20
 
21
- This is model card of the __English only model__ with:
 
22
 
23
- * 4 layers BERT (2 layers for unimodal encoding and rest layers for multimodal encoding)
24
- * ViT-S/16 (image resolution is 224x224)
25
-
26
-
27
- If you need Multilingual model, check [this](https://huggingface.co/unum-cloud/uform-vl-multilingual).
28
 
29
  ## Evaluation
30
 
31
- The following metrics were obtained with multimodal re-ranking (text-to-image retrieval):
 
32
 
33
  | Dataset |Recall@1 | Recall@5 | Recall@10 |
34
  | :------ | ------: | --------: | --------: |
35
  | Zero-Shot Flickr | 0.565 | 0.790 | 0.860 |
36
  | Zero-Shot MS-COCO | 0.281 | 0.525 | 0.645 |
37
 
38
- ImageNet-Top1: 0.361 \
39
- ImageNet-Top5: 0.608
40
-
41
  ## Installation
42
 
43
  ```bash
44
- pip install uform[torch]
45
  ```
46
 
47
  ## Usage
@@ -49,82 +49,33 @@ pip install uform[torch]
49
  To load the model:
50
 
51
  ```python
52
- import uform
53
-
54
- model, processor = uform.get_model('unum-cloud/uform-vl-english-small')
55
- ```
56
-
57
- To encode data:
58
 
59
- ```python
 
60
  from PIL import Image
61
 
62
- text = 'a small red panda in a zoo'
63
- image = Image.open('red_panda.jpg')
64
-
65
- image_data = processor.preprocess_image(image)
66
- text_data = processor.preprocess_text(text)
67
 
68
- image_features, image_embedding = model.encode_image(image_data, return_features=True)
69
- text_features, text_embedding = model.encode_text(text_data, return_features=True)
70
- joint_embedding = model.encode_multimodal(image=image_data, text=text_data)
 
71
  ```
72
 
73
- To get features:
74
 
75
  ```python
76
- image_features, image_embedding = model.encode_image(image_data, return_features=True)
77
- text_features, text_embedding = model.encode_text(text_data, return_features=True)
 
 
 
 
 
 
78
  ```
79
 
80
- These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:
81
-
82
- ```python
83
- joint_embedding = model.encode_multimodal(
84
- image_features=image_features,
85
- text_features=text_features,
86
- attention_mask=text_data['attention_mask']
87
- )
88
- ```
89
-
90
- There are two options to calculate semantic compatibility between an image and a text: [Cosine Similarity](#cosine-similarity) and [Matching Score](#matching-score).
91
-
92
- ### Cosine Similarity
93
-
94
- ```python
95
- import torch.nn.functional as F
96
-
97
- similarity = F.cosine_similarity(image_embedding, text_embedding)
98
- ```
99
-
100
- The `similarity` will belong to the `[-1, 1]` range, `1` meaning the absolute match.
101
-
102
- __Pros__:
103
-
104
- - Computationally cheap.
105
- - Only unimodal embeddings are required, unimodal encoding is faster than joint encoding.
106
- - Suitable for retrieval in large collections.
107
-
108
- __Cons__:
109
-
110
- - Takes into account only coarse-grained features.
111
-
112
-
113
- ### Matching Score
114
-
115
- Unlike cosine similarity, unimodal embedding are not enough.
116
- Joint embedding will be needed and the resulting `score` will belong to the `[0, 1]` range, `1` meaning the absolute match.
117
-
118
- ```python
119
- score = model.get_matching_scores(joint_embedding)
120
- ```
121
-
122
- __Pros__:
123
-
124
- - Joint embedding captures fine-grained features.
125
- - Suitable for re-ranking – sorting retrieval result.
126
-
127
- __Cons__:
128
 
129
- - Resource-intensive.
130
- - Not suitable for retrieval in large collections.
 
10
  ---
11
  <h1 align="center">UForm</h1>
12
  <h3 align="center">
13
+ Multi-Modal Pocket-Sized AI<br/>
14
+ For Content Understaning and Generation<br/>
15
+ In Python, JavaScript, and Swift<br/>
16
  </h3>
17
 
18
  ---
19
 
20
+ The `uform3-image-text-english-small` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space.
21
+ This model is made of:
22
 
23
+ * Text encoder: 4-layer BERT.
24
+ * Visual encoder: ViT-S/16 for images of 224x224 resolution.
25
 
26
+ Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training.
27
+ Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code.
28
+ If you need a larger, more accurate, or multilingual model, check our [HuggingFace Hub](https://huggingface.co/unum-cloud/).
29
+ For more details on running the model, check out the [UForm GitHub repository](https://github.com/unum-cloud/uform/).
 
30
 
31
  ## Evaluation
32
 
33
+ For zero-shot ImageNet classification the model achieves Top-1 accuracy of 36.1% and Top-5 of 60.8%.
34
+ On text-to-image retrieval it reaches 86% Recall@10 for Flickr:
35
 
36
  | Dataset |Recall@1 | Recall@5 | Recall@10 |
37
  | :------ | ------: | --------: | --------: |
38
  | Zero-Shot Flickr | 0.565 | 0.790 | 0.860 |
39
  | Zero-Shot MS-COCO | 0.281 | 0.525 | 0.645 |
40
 
 
 
 
41
  ## Installation
42
 
43
  ```bash
44
+ pip install "uform[torch,onnx]"
45
  ```
46
 
47
  ## Usage
 
49
  To load the model:
50
 
51
  ```python
52
+ from uform import get_model, Modality
 
 
 
 
 
53
 
54
+ import requests
55
+ from io import BytesIO
56
  from PIL import Image
57
 
58
+ model_name = 'unum-cloud/uform3-image-text-english-small'
59
+ modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
60
+ processors, models = get_model(model_name, modalities=modalities)
 
 
61
 
62
+ model_text = models[Modality.TEXT_ENCODER]
63
+ model_image = models[Modality.IMAGE_ENCODER]
64
+ processor_text = processors[Modality.TEXT_ENCODER]
65
+ processor_image = processors[Modality.IMAGE_ENCODER]
66
  ```
67
 
68
+ To encode the content:
69
 
70
  ```python
71
+ text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
72
+ image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
73
+ image_url = Image.open(BytesIO(requests.get(image_url).content))
74
+
75
+ image_data = processor_image(image)
76
+ text_data = processor_text(text)
77
+ image_features, image_embedding = model_image.encode(image_data, return_features=True)
78
+ text_features, text_embedding = model_text.encode(text_data, return_features=True)
79
  ```
80
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
81