ashvardanian commited on
Commit
58707fb
1 Parent(s): 039f477

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +38 -85
README.md CHANGED
@@ -5,41 +5,45 @@ tags:
5
  - clip
6
  - vision
7
  datasets:
8
- - sbu_captions
9
- - visual_genome
10
- - ChristophSchuhmann/MS_COCO_2017_URL_TEXT
11
  ---
12
  <h1 align="center">UForm</h1>
13
  <h3 align="center">
14
- Multi-Modal Inference Library<br/>
15
- For Semantic Search Applications<br/>
 
16
  </h3>
17
 
18
  ---
19
 
20
- UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!
 
21
 
22
- This is model card of the __English only model__ with:
 
23
 
24
- * 4 layers BERT (2 layers for unimodal encoding and rest layers for multimodal encoding)
25
- * ViT-B/16 (image resolution is 224x224)
26
-
27
-
28
- If you need Multilingual model, check [this](https://huggingface.co/unum-cloud/uform-vl-multilingual).
29
 
30
  ## Evaluation
31
 
32
- The following metrics were obtained with multimodal re-ranking:
 
33
 
34
  | Dataset | Recall@1 | Recall@5 | Recall@10 |
35
  | :-------- | ------: | --------: | --------: |
36
  | Zero-Shot Flickr | 0.727 | 0.915 | 0.949 |
37
- | MS-COCO (train split was in training data) | 0.510 | 0.761 | 0.838 |
 
 
38
 
39
  ## Installation
40
 
41
  ```bash
42
- pip install uform[torch]
43
  ```
44
 
45
  ## Usage
@@ -47,82 +51,31 @@ pip install uform[torch]
47
  To load the model:
48
 
49
  ```python
50
- import uform
51
-
52
- model, processor = uform.get_model('unum-cloud/uform-vl-english')
53
- ```
54
 
55
- To encode data:
56
-
57
- ```python
58
  from PIL import Image
59
 
60
- text = 'a small red panda in a zoo'
61
- image = Image.open('red_panda.jpg')
62
-
63
- image_data = processor.preprocess_image(image)
64
- text_data = processor.preprocess_text(text)
65
-
66
- image_features, image_embedding = model.encode_image(image_data, return_features=True)
67
- text_features, text_embedding = model.encode_text(text_data, return_features=True)
68
- ```
69
-
70
- To get features:
71
-
72
- ```python
73
- image_features, image_embedding = model.encode_image(image_data, return_features=True)
74
- text_features, text_embedding = model.encode_text(text_data, return_features=True)
75
- ```
76
-
77
- These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:
78
-
79
- ```python
80
- joint_embedding = model.encode_multimodal(
81
- image_features=image_features,
82
- text_features=text_features,
83
- attention_mask=text_data['attention_mask']
84
- )
85
- ```
86
-
87
- There are two options to calculate semantic compatibility between an image and a text: [Cosine Similarity](#cosine-similarity) and [Matching Score](#matching-score).
88
-
89
- ### Cosine Similarity
90
-
91
- ```python
92
- import torch.nn.functional as F
93
 
94
- similarity = F.cosine_similarity(image_embedding, text_embedding)
 
 
 
95
  ```
96
 
97
- The `similarity` will belong to the `[-1, 1]` range, `1` meaning the absolute match.
98
-
99
- __Pros__:
100
-
101
- - Computationally cheap.
102
- - Only unimodal embeddings are required, unimodal encoding is faster than joint encoding.
103
- - Suitable for retrieval in large collections.
104
-
105
- __Cons__:
106
-
107
- - Takes into account only coarse-grained features.
108
-
109
-
110
- ### Matching Score
111
-
112
- Unlike cosine similarity, unimodal embedding are not enough.
113
- Joint embedding will be needed and the resulting `score` will belong to the `[0, 1]` range, `1` meaning the absolute match.
114
 
115
  ```python
116
- score = model.get_matching_scores(joint_embedding)
 
 
 
 
 
 
 
117
  ```
118
-
119
- __Pros__:
120
-
121
- - Joint embedding captures fine-grained features.
122
- - Suitable for re-ranking – sorting retrieval result.
123
-
124
- __Cons__:
125
-
126
- - Resource-intensive.
127
- - Not suitable for retrieval in large collections.
128
-
 
5
  - clip
6
  - vision
7
  datasets:
8
+ - Ziyang/yfcc15m
9
+ - conceptual_captions
 
10
  ---
11
  <h1 align="center">UForm</h1>
12
  <h3 align="center">
13
+ Pocket-Sized Multimodal AI<br/>
14
+ For Content Understanding and Generation<br/>
15
+ In Python, JavaScript, and Swift<br/>
16
  </h3>
17
 
18
  ---
19
 
20
+ The `uform3-image-text-english-base` UForm model is a tiny vision and English language encoder, mapping them into a shared vector space.
21
+ This model produces up to __256-dimensional embeddings__ and is made of:
22
 
23
+ * Text encoder: 4-layer BERT for up to 64 input tokens.
24
+ * Visual encoder: ViT-B/16 for images of 224 x 224 resolution.
25
 
26
+ Unlike most CLIP-like multomodal models, this model shares 2 layers between the text and visual encoder to allow for more data- and parameter-efficient training.
27
+ Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code.
28
+ If you need a larger, more accurate, or multilingual model, check our [HuggingFace Hub](https://huggingface.co/unum-cloud/).
29
+ For more details on running the model, check out the [UForm GitHub repository](https://github.com/unum-cloud/uform/).
 
30
 
31
  ## Evaluation
32
 
33
+ For zero-shot ImageNet classification the model achieves Top-1 accuracy of 36.1% and Top-5 of 60.8%.
34
+ On text-to-image retrieval it reaches 86% Recall@10 for Flickr:
35
 
36
  | Dataset | Recall@1 | Recall@5 | Recall@10 |
37
  | :-------- | ------: | --------: | --------: |
38
  | Zero-Shot Flickr | 0.727 | 0.915 | 0.949 |
39
+ | MS-COCO ¹ | 0.510 | 0.761 | 0.838 |
40
+
41
+ > ¹ It's important to note, that the MS-COCO train split was present in the training data.
42
 
43
  ## Installation
44
 
45
  ```bash
46
+ pip install "uform[torch,onnx]"
47
  ```
48
 
49
  ## Usage
 
51
  To load the model:
52
 
53
  ```python
54
+ from uform import get_model, Modality
 
 
 
55
 
56
+ import requests
57
+ from io import BytesIO
 
58
  from PIL import Image
59
 
60
+ model_name = 'unum-cloud/uform3-image-text-english-base'
61
+ modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
62
+ processors, models = get_model(model_name, modalities=modalities)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
 
64
+ model_text = models[Modality.TEXT_ENCODER]
65
+ model_image = models[Modality.IMAGE_ENCODER]
66
+ processor_text = processors[Modality.TEXT_ENCODER]
67
+ processor_image = processors[Modality.IMAGE_ENCODER]
68
  ```
69
 
70
+ To encode the content:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  ```python
73
+ text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
74
+ image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
75
+ image_url = Image.open(BytesIO(requests.get(image_url).content))
76
+
77
+ image_data = processor_image(image)
78
+ text_data = processor_text(text)
79
+ image_features, image_embedding = model_image.encode(image_data, return_features=True)
80
+ text_features, text_embedding = model_text.encode(text_data, return_features=True)
81
  ```